genome evolution: a sequence-centric approach

Genome evolution: a sequence-centric approach

Lecture 8-9: Concepts in population genetics

Probabilistic models

Inference

Parameter estimation

Genome structure

Mutations

Population

Inferring Selection

(Probability, Calculus/Matrix theory, some graph theory, some statistics)

Simple Tree ModelsHMMs and variantsPhyloHMM,DBNContext-aware MMFactor Graphs

DPSamplingVariational apx.LBP

EMGeneralized EM (optimize free energy)

Tree of lifeGenome SizeElements of genome structureElements of genomic information

Today refs: Hartl and Clark, Topics from Chapters 3-7See Gruer/Li chapter 2 (easy to read overview) and lynch chapter 4 (more advanced)

Studying Populations

Models:

A set of individuals, genomesAncestry relations or hierarchies

Experiments:

Fields studies, diversity/genotypingExperimental evolution

Åland Islands, Glanville fritillary population

mtDNA human migration patterns

http://www.helsinki.fi/science/metapop/english/Species/Cinxia.htm

Species and populations

What is a species?

Multiple definitions, most of them rely on free flow of genetic information within and weak flow of information outside/inside

Species 1Species 2

Species can emerge through the formation of reproductive barriers

Allopatric speciation – occurs through geographical separationParapatric speciation – occurs without geographical separation but with weak

flow of genetic informationSympatric speciation – occurs while information is flowing - controversial

Barriers can be genetic, physical, behavioral

Population dynamics

We think of a species genome as representing the population “average” genomic information

Individuals have genomes that are closely related to the “species genomes”, but differ from it in certain loci (alleles)

As the population evolve there are continuous changes in allele frequencies, which may result in ultimate changes in the genome (fixation)

We can measure and quantify just few aspect of this evolutionary dynamics:Size of populationsAllele frequenciesThe average homozygosity/heterozygosity of an alleleHow many alleles at a locus

Population genetics is dealing with theories that predict the behavior of these quantities using simple assumption on the evolutionary dynamics

In haploid populations (bacteria), genotypes are determined by one haplotype and ancestral relations are simple trees

In diploid populations things are a bit more complex, as genotypes can be homozygous or heterozygous at each locus.

Frequency estimates

We will be dealing with estimation of allele frequencies.

To remind you, when sampling n times from a population with allele of frequency p, we get an estimate that is distributed as a binomial variable. This can be further approximated using a normal distribution:

))1(,());(( pnpnpNnpBV

n

pps

)ˆ1(ˆ

When estimating the frequency out of the number of successes we therefore have an error that looks like:

2

2

)(

2)(

)(

qaaP

pqAaP

pAAP

Simplest model: Hardy-Weinberg

Studying dynamics of the frequencies of two alleles A/a of a gene

Assume:Diploid organismsSexual Reproduction Non-overlapping generationsRandom matingMale-females have the same allele frequenciesLarge population, No migrationNo mutations, no selection on the alleles under study

Hardy-Weinberg equilibrium:

AA

Aa

aa

aAqaP

pAP

)(

)(AA

Aa

aa

aA

Random mating

Non overlapping generations

With the model assumption, equilibrium is reached within one generation

Testing Hardy-Weinberg using chi-square statistics

HW is over simplifying everything, but can be used as a baseline to test if interesting evolution is going on for some allele

Classical example is the blood group genotypes M/N (Sanger 1975) (this genotype determines the expression of a polysaccharide on red blood cell surfaces – so they were quantifiable before the genomic era..):

MM298294.3

MN489496

NN213209.3

Observed HW

2

2

)(

2)(

)(

qaaP

pqAaP

pAAP

22.0exp

exp)( 22

obs

Chi-square significance can be computed from the chi-square distribution with df degrees of freedom.

Here: df = #classes - #parameters – 1 = 3(MN/NN/MM) – 1 (p) – 1 = 1

Recombination and linkage

Assume two loci have alleles A1,A2, B1,B2

2222

1212

2121

1111

)(

)(

)(

)(

qpBAP

qpBAP

qpBAP

qpBAP

Only double Heterozygous can allow recombination to change allele frequencies:

A1B1/ A2B2

A1B2/ A1B2

A1 B1

A2 B2

A1 B2

A2 B1

Linkage equilibrium:

The recombination fraction r: proportion of recombinant gametes generated from double heterozygote

For different chromosomes: r = 0.5For the same chromosome, function of the distance and possibly other factors

Linkage disequilibrium (LD)

Define the linkage disequilibrium parameter D as:

1111 qpPD

))(1(

)1(

111111'11

1111'11

qpPrpqP

prqPrP

)(),(),(),( 2222122121121111 BAPPBAPPBAPPBAPP

01 )1()1( DrDrD nnn

Next generation:

No recomb Recombination on any A1- / -B1

Generation

D

r=0.05

r=0.2r=0.5

A2 B1

A1 B1

A2 B2

A1 B2

A2 B2

A1 B1

r

1-r

A2 B2

A1 B1

21122211 PPPPD

Linkage disequilibrium (LD) - example

blood group genotypes M/N and S/s. Both alleles in Hardy-Weinberg

MS484334.2

Ms611750.8

NS142281.8

Ns773633.2

Observed unlinked

7.184exp

exp)( 22

obs

For M/N – p1 = 0.5425 p2 = 0.4575For S/s – q1 = 0.3080 q2 = 0.6920

Linkage equilibrium highly unlikely!

07.021122211 PPPPD

Sources of Linkage disequilibrium

LD in original population that was not stabilized due to low r

Genetic coadaptation: regions of the genome that are not subject to recombination (for example, inverted chromosomal fragments)

Admixture of populations with different allele frequencies:

9025.0

0475.0

0475.0

0025.0

22

21

12

11

P

P

P

P

0025.0

0475.0

0475.0

9025.0

22

21

12

11

P

P

P

P

2025.0D

0D0D

4525.0

0475.0

0475.0

4525.0

22

21

12

11

P

P

P

P

Population substructure

The HW theory assumed population are randomly matingWe mentioned that species are suppose to be isolated genetically, but even inside a

species, the flow of information is never uniform

Subpopulation structure would result in lowheterozygosityThis is because (different) alleles would be fixated in different sub-populationsWe can compute the average heterozygosity predicted by HWE from allele frequencies: H=2pq

HS – in each population use frequency to compute HWE heterozygosity and averageHR – in each region use frequency to compute HWE heterzygosity and take a weighted

averageHT – for the entire population use frequency to compute HWE heterzygosity and average

Wrights fixation index FComparing one level in the hierarchy to anotherProvide indication to the level of genetic differentiation in the population

0<F<1, F<0.05 is considered quite low, F>0.25 is considered very high

T

SRSR H

HHF

T

RTRT H

HHF

0.717

0.573

0.504

0.302

0.657

0.3390.008

0.007

0.032

0.005

0.009

0.000

0.000

0.000

0.000

0.005

0.010

0.000

0.000

0.126

0.068

0.004

0.002

0.000

0.000

0.000

0.014

0.224

0.411

0.106

Frequency of recessive allele (blue flower color) in “desert snow” flowers (Lynanthus parruae)

Each point represent ~4000 plants over 30 square miles of the Mohave desert

Population substructure – (Dobzhansky and Epling 1942)

1589.0SRF

3299.0RTF

4995.0H

0272.0H

3062.0H

More significant difference among regions than inside them

Inbreeding

A population with inbreeding will undergo reduction in heterozygosity

For example, self-fertilization in plants

The inbreeding coefficient:

H0 – the random mating heterozygosityHI – observed (inbreeding) heterozygosity

In fact F is identical to the Fixation index F and can be interpreted as measuring the probability that two alleles are identical by descent - autozygotes

The increase in rare-alleles homozygosity for inbreeded population is frequently detrimental

0

0

H

HHF I

Regular mating schemes in the lab and field: Selfing, Sib-mating, Backcrossing to single individual from a random bred strain

Assortative mating:positive (height in human)negative (cases in plants)

The hapmap project

1 million SNPs (single nucleotide polymorphisms)

4 populations:30 trios (parents/child) from Nigeria (Yoruba - YRI)30 trios (parents/child) from Utah (CEU)45 Han chinease (Beijing)44 Japanease (Tokyo)

Haplotyping – each SNP/individualNo just determining heterozygosity/homozygosity – haplotyping completely resolve the genotypes (phasing)

Because of linkage, the partial SNPMap largely determine all other SNPs!!

The idea is that a group of “tag SNPs”Can be used for representing all geneticVariation in the human population.

This is extremely important in associationstudies that look for the genetic cause ofdisease.

Correlation on SNPs between populations

Recombination rates in the human population: LD blocks

Recombination rates in the human population

Recombination rates are highly non uniform – with major effects on genome structure!

Mutations

Simplest model: assume two alleles, and mutations probabilities:

)Pr(

)Pr(

Aa

aA

If the process is running long enough, we will converge to a stationary distribution:

)Pr(AA

a

Populations are however finite, and this create random genetic driftA random allele have a significance change to be eliminated, even in one generation:

N2

1 eN

N /1)2

11( 2

sampling

Figure 7.4

Drift

Experiments with drifting fly populations: 107 Drosophila melanogaster populations. Each consisted orignally of 16 brown eys (bw) heterozygotes. At each generation, 8 males and 8 females were selected at random from the progenies of the previous generation. The bars shows the distribution of allele frequencies in the 107 populations

Drift, fixation, and the neutral theory

If sampling is random, the chance of ultimate fixation is

Simply because one allele must become fixated (and there are 2N to begin with).

N2

1

According to the neutral theory fixation of neutral alleles play a major role in driving divergence of populations.

This is in contrast to the selectionist view that stress adaptive evolution as the major force for fixation of new alleles.

The controversy around the neutral theory seems like something that belongs to the past, since it was heated around question of evolution in protein coding loci, and densely coded genomes. Today we realize that genomic information is distributed in a way that should certainly allow neutral or almost neutral mutations a considerable freedom in large parts of the genome..

There are still critically important questions on how strong is the neutrality assumption in different parts of the genome – we’ll look at this question later.

Wright-Fischer model for genetic drift

Nindividuals

∞gametes

Nindividuals

∞gametes

We follow the frequency of an allele in the population, until fixation (f=2N) or loss (f=0)

We can model the frequency as a Markov process with transition probabilities:

jNj

ij N

i

N

i

j

NT

2

21

2

2 Sampling j alleles from a population 2N population with i alleles.

In larger population the frequency would change more slowly (the variance of the binomial variable is pq/2N – so sampling wouldn’t change that much)

Diffusion approximation and Kimura’s solution

),(),( txJx

txt

),( tx

Fischer, and then Kimura approximated the drift process using a diffusion equation.

The density of population with frequency x..x+dx at time t

),( txJ The flux of probability at time t and frequency x

The change in the density equals the differences between the fluxes J(x,t) and J(x+dx,t), taking dx to the limit we have:

The if M(x) is the mean change in allele frequency when the frequency is x, and V(x) is the variance of that change, then the probability flux equals:

),()(2

1),()(),( txxV

xtxxMtxJ

),()(2

1),()(),(

2txxV

xtxxM

xtx

t

N

xxxVM

2

)1()(,0

),()1(

4

1),(

2txxx

xNtx

t

Heat diffusionFokker-PlanckKolmogorov Forward eq.

Changes in allele-frequencies, Fischer-Wright model

After about 4N generations, just 10% of the cases are not fixed and the distribution becomes flat.

Absorption time and Time to fixation

According to Kimura’s solution, the mean time for allele fixation, assuming initial probability p and assuming it was not lost is:

)1log()1(4

)(1̂ ppp

Npt

)log()(1

4)(0̂ pp

p

Npt

The mean time for allele loss is (the fixation time of the complement event):

Effective population size

4N generations looks light a huge number (in a population of billions!)

But in fact, the wright-fischer model (like the hardy-weinberg model) is based on many non-realistic assumption, including random mating – any two individuals can mate

The effective population size is defined as the size of an idealized population for which the predicted dynamics of changes in allele frequency are similar to the observed ones

For each measurable statistics of population dynamics, a different effective population size can be computed

For example, the expected variance in allele frequency is expressed as:

N

pppV ttt 2

)1()( 1

e

ttt N

pppV

2

)1()( 1

But we can use the same formula to define the effective population size given the variance:

Effective population size: changing populations

110

1..

11

t

e

NNN

tN

So the effective population size is dominated by the size of the smallest bottleneck

Bottlenecks can occur during migration, environmental stress, isolation

Such effects greatly decrease heterozygosity (founder effect – for example Tay-Sachs in “ashkenazim”)

Bottlenecks can accelerate fixation of neutral or even deleterious mutations as we shall see later.

If the population is changing over time, the dynamics will be affect by the harmonic mean of the sizes:

Human effective population size in the recent 2My is estimated around 10,000 (due to bottlenecks).

Effective population size: unequal sex ratio, and sex chromosomes

fma NNN

So if there are 10 times more females in the population, the effective population size is 4*x*10x/(11x)=4x, much less than the size of the population (11x).

If there are more females than males, or there are fewer males participating in reproduction then the effective population size will be smaller:

fm

fme NN

NNN

4 Any combination of alleles

from a male and a female

Another example is the X chromosome, which is contained in only one copy for males.

fm

fme NN

NNN

24

9

f

ff

m

mmfm N

qp

N

qppVarppp

29

4

9

1)(,

3

2

3

1

fm

fmfmfm

NN

NN

pq

NNpqpVarppp

24

92

18

4

9

1)(,

Testing neutrality

The drift process have clear dynamics. We are usually interested in these dynamics as a baseline for testing hypotheses on non-neutral evolution

Such tests require predictions on the behavior of concrete statistics that we can measure from a population

For example, we can sequence alleles and count how many polymorphic sites exist in a gene and what are their frequencies.

We can also perform evolutionary comparisons among different sites – we will focus on these later in the course.

Slow evolution

sp1sp2sp3sp4sp5

Non neutral population dynamics

Infinite alleles model

Assuming a gene with multiple loci, we can think of the number of possible alleles as much larger than the population

In this model, the probability of generating the same mutation twice is considered 0One can then ask how many distinct alleles should we observe given a neutral process and

a certain mutation probabilityAlternatively, one can ask what will be the probability of autozygosity F (identity by descent)

122 )1)(

2

11()1(

2

1

tt F

NNF

NF

41

1ˆ

Looking for steady state and neglecting factors that depends on

Because of our model, F is also the fraction of homozygous individuals

i

ipNF 2

41

1ˆ

(picking up two autozygous alleles and not mutating them, or picking up the same allele twice)

4N

Testing the infinite alleles model

N

NH

NF

41

4ˆ,41

1ˆ

The Ewens formula enable us to predict the number of alleles (k) we should observe when sampling n times from a population with =4N, assuming the infinite allele model :

1..

211)(

nkE

The Chinese restaurant process

Figure 7.16,7.17

Testing the infinite alleles model

We can estimate F from k (by finding from the E(k) formula) –

1

1

41

1ˆN

F

We use this statistics to test if a given gene behave neutrally (or at least according to the model):

Not quite neutral Highly non neutral

F computed from the number of Xdh alleles in 89 D. pseudoobscura lines gene: 52 had a common allele, 8 singletons.

Compared to a simulation assuming the infinite allele model.

VNTR locus in humans: observed (open columns) and Ewens predicted allele counts.

Infinite sites model

Instead of looking at an entire gene with many alleles, consider the many loci consisting the gene and assume that these are changing slowly: most loci are monomorphic or dimorphic.

NiSi

4,11

1)Pr( 2

Probability of i mismatches in two

random sequences:

FS

1

1)0Pr( 2

In particular, autozygosity:

Just like we had for the infinite allele model.

If we sample n allele, the number of segregating sites is distributed like:

1

1

1)(

n

i iSE

1

12

21

1

11)(

n

i

n

i iiSV Assuming no intragenic

recombination

So we can test neutrality by looking at the number of alleles in a certain sample.

Coalescent theory

Any set of individuals in a population are a consequence of a coalescence process: a common ancestor giving rise to multiple alleles through mutation, duplication and recombination.

Such models are in wide use for simulating populations

Application for inferring selection/neutrality or other population dynamics are becoming reasonable as more data becomes available.

A simple coalescent model look at the gene tree of the k observed alleles

Present 10

2)( 5

NTE

6

2)( 4

NTE

3

2)( 3

NTE

NTE 2)( 2

Past

Selection

Fitness: the relative reproductive success of an individual (or genome)

Fitness is only defined with respect to the current population.

Fitness is unlikely to remain constant in all conditions and environments

Mutations can change fitness

A deleterious mutation decrease fitness. It would therefore be selected against. This process is called negative or purifying selection.

A advantageous or beneficial mutation increase fitness. It would therefore be subject to positive selection.

A neutral mutation is one that do not change the fitness.

For mono-allelic populations, selection directly observe the fitness of an alleleFor diploid organisms, we should define how the combination of alleles affect fitness.

Sampling probability is multiplied by a selection factor 1+s

Selection in haploid populations

11

1

1

1

tt

t

t

t

qwp

wp

wp

w

p

A

11

1

1

1

1

tt

t

t

t

qwp

q

q

q

BAllele

Frequency

Relative fitness

Gamete after selection

Generation t:

0

0

q

pw

q

p t

t

t Ratio as a function of time:

)()(),()( tbBtBtaAtA Consider continuous time model

tbaetB

tA

tB

tA )(

)(

)(

)(

)(

The change in allele frequency:qpw

wpqp

qpw

pwp

)1(

Example (Hartl Dykhuizen 81):E.Coli with two gnd alleles. One allele is

beneficial for growth on Gluconate.

A population of E.coli was tracked for 35 generations, evolving on two mediums, the observed frequencies were:

Gluconate: 0.4555 0.898Ribose: 0.594 0.587

For Gluconate:log(0.898/0.102)-log(0.455/0.545)=35logwlog(w) = 0.292, w=1.0696

Compare to w=0.999 in Ribose.

Selection and allele frequency dynamics

22221211

2 qpqp

www

aaAaAAGenotype

Fitness

Frequency (Hardy Weinberg!)

222

12112

222

121 2 wqpqwwp

wqpqwq t

22

21211

211221112

2

)()(

wqpqwwp

wwqwwppqq

Assume:

Change in frequency is given by:

swsww 2,,1 221211 In the case of codominance:

dt

dqqsqspq

sqsqp

spqq

s

)1(221

0

2st

t

eqq

q

0

011

1

Selection and fixation

An allele with a beneficial mutation will have an increased frequency in the gamete pool:

Nspo 2

1)1(

Its chances to avoid immediate extinction are:

ese

N

s sN 1)1()

2

11( )1(2

This is a rather modest increase, so even beneficial allele are likely to be eliminated. For example, s=0.1 would have a loss probability of 0.333 compared to 0.368 for a neutral allele.

For a diploid population, if we assume the fitness of a heterozygous if 1+s and of a homozygous is 1+2s, it can be computed from the diffusion approximation that the overall fixation probability will be:

NsNe

NsN

e

ep esN

eNNs

ssN

sNsN

f e

e

e

e

/21

/2

1

14

)1/,11(

)1/(4

)1/()/2(

Selection and fixation

The fixation time for a neutral allele (assuming fixation was achieved), as we said before, is averaging at:

Nt 4

With a selective advantage, the fixation time is approximated by:

)2ln()/2( Nst

Substitutions

Considering now the entire population, the rate of substitution at a loci equals the number of mutations times their fixation probability. In the neutral case, this is very simple:

N

NK2

12

With a selective advantage, the fixation probability is approximated by:

sNNsNNK ee 4)/2(2

So evolution will be more efficient when population is larger, mutation rate is faster and selection is stronger. The parameter 4Nes is describing the speed up.

So neutral evolution is unaffected by the size of the population.

Other types of selection

Over-dominance: heterozygous are better, so there is a possibility for equilibrium in allele frequencies: few examples, but on famous is resistance ot malaria and sickle cell anemia in Africa

Frequency-, Density-dependent selection: when the fitness depend on the frequency of the allele or the population size.

Fecundity selection: different reproductive potential for mating pairs.

Effects of heterogeneous environment: (overdominance?)

Different effects in males and femeals

Effects that apply directly to the haplotype: gametic selection/meiotic drive (e.g., killing your homologous chromosome reproductive potential)

Kin selection: origin of altruism?

Recombination and selection

Linkage and selection

Beneficial Weakly deleterious

Beneficial Beneficial

Linkage interfere with the purging of deleterious mutations and reduce the efficiency of positive selection!

Selective sweep/Hitchhiking effect/“genetic draft”

Hill-Robertson effect

Linkage and selection

)2/()1()( eNpppV The variance in allele frequency is used to

define the effective population size

e

el

e N

NN

NpppV

212

1)1()(

Simplistically, assume a neutral locus is evolving such that a selective sweep is affecting a fully linked locus at rate A sweep will fixate the allele with probability p, and we further assume that the sweep happens instantly:

This is very rough, but it demonstrates the basic intuition here: sweeps reduce the effective selection in a way that can be quantified through reduction in the effective population size.

C – the average frequency of the neutral allele after the sweepCN

NN

e

el 21

genome evolution: a sequence-centric approach

Documents

species genomes

allele of frequency

genome evolution

genome fixationwe

simple assumption

model assumption

alleles aa

frequency estimateswe