march 2006vineet bafna cse280b: population genetics vineet bafna/pavel pevzner

March 2006 Vineet Bafna

CSE280b: Population Genetics

Vineet Bafna/Pavel Pevzner

www.cse.ucsd.edu/classes/sp05/cse291www.cse.ucsd.edu/classes/sp05/cse291

Population Genetics

• Individuals in a species (population) are phenotypically different.

• Often these differences are inherited (genetic).

• Studying these differences is important!

• Q:How predictive are these differences?

EX:Population Structure

• 377 locations (loci) were sampled in 1000 people from 52 populations.

• 6 genetic clusters were obtained, which corresponded to 5 geographic regions (Rosenberg et al. Science 2003)

• Genetic differences can predict ethnicity.

AfricaEurasia East Asia

America

Scope of these lectures

• Basic terminology• Key principles

– Sources of variation– HW equilibrium– Linkage– Coalescent theory– Recombination/Ancestral Recombination Graph– Haplotypes/Haplotype phasing– Population sub-structure– Structural polymorphisms– Medical genetics basis: Association

mapping/pedigree analysis

Alleles

• Genotype: genetic makeup of an individual• Allele: A specific variant at a location

– The notion of alleles predates the concept of gene, and DNA.

– Initially, alleles referred to variants that described a measurable phenotype (round/wrinkled seed)

– Now, an allele might be a nucleotide on a chromosome, with no measurable phenotype.

• Humans are diploid, they have 2 copies of each chromosome.– They may have heterozygosity/homozygosity at a location– Other organisms (plants) have higher forms of ploidy.– Additionally, some sites might have 2 allelic forms, or even

many allelic forms.

What causes variation in a population?

• Mutations (may lead to SNPs)• Recombinations• Other genetic events (gene conversion)• Structural Polymorphisms

Single Nucleotide Polymorphisms

000001010111000110100101000101010010000000110001111000000101100110

Infinite Sites Assumption:Each site mutates at most once

Short Tandem Repeats

GCTAGATCATCATCATCATTGCTAGGCTAGATCATCATCATTGCTAGTTAGCTAGATCATCATCATCATCATTGCGCTAGATCATCATCATTGCTAGTTAGCTAGATCATCATCATTGCTAGTTAGCTAGATCATCATCATCATCATTGC

435335

STR can be used as a DNA fingerprint

• Consider a collection of regions with variable length repeats.

• Variable length repeats will lead to variable length DNA

• Vector of lengths is a finger-print

4 23 35 13 23 15 3

Recombination

0000000011111111

00011111

Gene Conversion

• Gene Conversion versus crossover– Hard to distinguish

in a population

Structural polymorphisms

• Large scale structural changes (deletions/insertions/inversions) may occur in a population.

Topic 1: Basic Principles

• In a ‘stable’ population, the distribution of alleles obeys certain laws– Not really, and the deviations are

interesting• HW Equilibrium

– (due to mixing in a population)• Linkage (dis)-equilibrium

– Due to recombination

Hardy Weinberg equilibrium

• Consider a locus with 2 alleles, A, a• p (respectively, q) is the frequency of A

(resp. a) in the population• 3 Genotypes: AA, Aa, aa• Q: What is the frequency of each genotype

If various assumptions are satisfied, (such as random mating, no natural selection), Then• PAA=p2

• PAa=2pq• Paa=q2

Hardy Weinberg: why?

• Assumptions:– Diploid– Sexual reproduction– Random mating– Bi-allelic sites– Large population size, …

• Why? Each individual randomly picks his two chromosomes. Therefore, Prob. (Aa) = pq+qp = 2pq, and so on.

Hardy Weinberg: Generalizations

• Multiple alleles with frequencies– By HW,

• Multiple loci?

θ1,θ2,L ,θH

Pr[homozygous genotype i] =θ i2

Pr[heterozygous genotype i, j] = 2θ iθ j

Hardy Weinberg: Implications

• The allele frequency does not change from generation to generation. Why?

• It is observed that 1 in 10,000 caucasians have the disease phenylketonuria. The disease mutation(s) are all recessive. What fraction of the population carries the disease?

• Males are 100 times more likely to have the “red’ type of color blindness than females. Why?

• Conclusion: While the HW assumptions are rarely satisfied, the principle is still important as a baseline assumption, and significant deviations are interesting.

Recombination

0000000011111111

00011111

What if there were no recombinations?

• Life would be simpler• Each individual sequence would have a

single parent (even for higher ploidy)• The relationship is expressed as a tree.

The Infinite Sites Assumption

0 0 0 0 0 0 0 0

0 0 1 0 0 0 0 0

0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0

• The different sites are linked. A 1 in position 8 implies 0 in position 5, and vice versa.

• Some phenotypes could be linked to the polymorphisms• Some of the linkage is “destroyed” by recombination

Infinite sites assumption and Perfect Phylogeny

• Each site is mutated at most once in the history.

• All descendants must carry the mutated value, and all others must carry the ancestral value

1 in position i0 in position i

Perfect Phylogeny

• Assume an evolutionary model in which no recombination takes place, only mutation.

• The evolutionary history is explained by a tree in which every mutation is on an edge of the tree. All the species in one sub-tree contain a 0, and all species in the other contain a 1. Such a tree is called a perfect phylogeny.

The 4-gamete condition

• A column i partitions the set of species into two sets i0, and i1

• A column is homogeneous w.r.t a set of species, if it has the same value for all species. Otherwise, it is heterogenous.

• EX: i is heterogenous w.r.t {A,D,E}

iA 0B 0C 0D 1E 1F 1

4 Gamete Condition

• 4 Gamete Condition– There exists a perfect phylogeny if and only

if for all pair of columns (i,j), j is not heterogenous w.r.t i0, or i1.

– Equivalent to– There exists a perfect phylogeny if and only

if for all pairs of columns (i,j), the following 4 rows do not exist(0,0), (0,1), (1,0), (1,1)

4-gamete condition: proof (only if)

• Depending on which edge the mutation j occurs, either i0, or i1 should be homogenous.

• (only if) Every perfect phylogeny satisfies the 4-gamete condition

• (if) If the 4-gamete condition is satisfied, does a prefect phylogeny exist? i0

Handling recombination

• A tree is not sufficient as a sequence may have 2 parents

• Recombination leads to loss of correlation between columns

Linkage (Dis)-equilibrium (LD)

• Consider sites A &B• Case 1: No

recombination• Each new individual

chromosome chooses a parent from the existing ‘haplotype’

A B0 10 10 00 01 01 01 01 0

• Consider sites A &B• Case 2: diploidy and

recombination• Each new individual

chooses a parent from the existing alleles

A B0 10 10 00 01 01 01 01 0

• Consider sites A &B• Case 1: No recombination• Each new individual chooses a

parent from the existing ‘haplotype’

– Pr[A,B=0,1] = 0.25• Linkage disequilibrium

• Case 2: Extensive recombination• Each new individual simply

chooses and allele from either site

– Pr[A,B=(0,1)=0.125• Linkage equilibrium

A B0 10 10 00 01 01 01 01 0

• In the absence of recombination, – Correlation between columns– The joint probability Pr[A=a,B=b] is

different from P(a)P(b)• With extensive recombination

– Pr(a,b)=P(a)P(b)

Measures of LD

• Consider two bi-allelic sites with alleles marked with 0 and 1

• Define– P00 = Pr[Allele 0 in locus 1, and 0 in locus 2]

– P0* = Pr[Allele 0 in locus 1]

• Linkage equilibrium if P00 = P0* P*0

• D = abs(P00 - P0* P*0) = abs(P01 - P0* P*1) = …

LD over time

• With random mating, and fixed recombination rate r between the sites, Linkage Disequilibrium will disappear– Let D(t) = LD at time t– P(t)

00 = (1-r) P(t-1)00 + r P(t-1)

0* P(t-1)*0

– D(t) = P(t)00 - P(t)

0* P(t)*0 = P(t)

00 - P(t-1)0* P(t-1)

*0 (HW)

– D(t) =(1-r) D(t-1) =(1-r)t D(0)

LD over distance

• Assumption– Recombination rate increases linearly with

distance– LD decays exponentially with distance.

• The assumption is reasonable, but recombination rates vary from region to region, adding to complexity

• This simple fact is the basis of disease association mapping.

LD and disease mapping

• Consider a mutation that is causal for a disease. • The goal of disease gene mapping is to discover

which gene (locus) carries the mutation.• Consider every polymorphism, and check:

– There might be too many polymorphisms – Multiple mutations (even at a single locus) that lead to

the same disease

• Instead, consider a dense sample of polymorphisms that span the genome

LD can be used to map disease genes

• LD decays with distance from the disease allele.

• By plotting LD, one can short list the region containing the disease gene.

011001

DNNDDN

LD and disease gene mapping problems

• Marker density?• Complex diseases• Population sub-structure

Population Genetics

• Often we look at these equilibria (Linkage/HW) and their deviations in specific populations

• These deviations offer insight into evolution.

• However, what is Normal?• A combination of empirical (simulation)

and theoretical insight helps distinguish between expected and unexpected.

Topic 2: Simulating population data

• We described various population genetic concepts (HW, LD), and their applicability

• The values of these parameters depend critically upon the population assumptions.– What if we do not have infinite populations– No random mating (Ex: geographic isolation)– Sudden growth– Bottlenecks– Ad-mixture

• It would be nice to have a simulation of such a population to test various ideas. How would you do this simulation?

Wright Fisher Model of Evolution

• Fixed population size from generation to generation

• Random mating

Coalescent model

• Insight 1: – Separate the genealogy from allelic states (mutations)– First generate the genealogy (who begat whom)– Assign an allelic state (0) to the ancestor. Drop mutations on the

branches.

Coalescent theory

• Insight 2: – Much of the genealogy is irrelevant, because it

disappears.– Better to go backwards

Coalescent theory (Kingman)

• Input – (Fixed population (N individuals), random

mating)• Consider 2 individuals.

– Probability that they coalesce in the previous generation (have the same parent)=

• Probability that they do not coalesce after t generations=

1− 1N( )

≅ e− t N

Coalescent theory

• Consider k individuals. – Probability that no pair coalesces after 1

generation

– Probability that no pair coalesces after t generations

k2 ⎛ ⎝ ⎜ ⎞

⎠ ⎟

⎜ ⎜ ⎜

⎟ ⎟ ⎟

≅ e−

k2 ⎛ ⎝ ⎜ ⎞

⎠ ⎟t

= e− k

2 ⎛ ⎝ ⎜ ⎞

⎠ ⎟τ

is time in unitsof N generations

Coalescent approximation

• Insight 3:– Topology is independent of coalescent times– If you have n individuals, generate a

random binary topology• Iterate (until one individual)

– Pick a pair at random, and coalesce

• Insight 4:– To generate coalescent times, there is no

need to go back generation by generation

Coalescent approximation

• At any step, there are 1 <= k <= n individuals• To generate time to coalesce (k to k-1

individuals)– Pick a number from exponential distribution with rate

k(k-1)/2– Mean time to coalescence

= 2/(k(k-1))= 2/(k(k-1))

Typical coalescents

• 4 random examples with n=6 (Note that we do not need to specify N. Why?)

• Expected time to coalesce?

Coalescent properties

• Expected time for the last step

• The last step is half of the total time to coalesce• Studying larger number of individuals does not change

numbers tremendously• EX: Number of mutations in a population is proportional

to the total branch length of the tree– E(Ttot)

Variants (exponentially growing populations)

• If the population is growing exponentially, the branch lengths become similar, or even star-like. Why?

• With appropriate scaling of time, the same process can be extended to various scenarios: male-female, hermaphrodite, segregation, migration, etc.

Simulating population data

• Generate a coalescent (Topology + Branch lengths)

• For each branch length, drop mutations with rate

• Generate sequence data• Note that the resulting sequence is a perfect phylogeny.• Given such sequence data, can you reconstruct the

coalescent tree? (Only the topology, not the branch lengths)

• Also, note that all pairs of positions are correlated (should have high LD).

Coalescent with Recombination

• An individual may have one parent, or 2 parents

ARG: Coalescent with recombination

• Given: mutation rate , recombination rate , population size 2N (diploid), sample size n.

• How can you generate the ARG (topology+branch lengths) efficiently?

• How will you generate sequences for n individuals?

• Given sequence data, can you reconstruct the ARG (topology)

Recombination

• Define r as the probability of recombining. – Note that the parameter is a caled

value which will be defined later• Assume k individuals in a

generation. The following might happen:1. An individual arises because of a

recombination event between two individuals (It will have 2 parents).

2. Two individuals coalesce3. Neither (Each individual has a

distinct parent)4. Multiple events (low probability)

Recombination

• We ignore the case of multiple (> 1) events in one generation

• Pr (No recombination) = 1-kr• Pr (No coalescence)

• Consider scaled time in units of 2N generations. Thus the number of individuals increase with rate kr2N, and decrease with rate

• The value 2rN is usually small, and therefore, the process will ultimately coalesce to a single individual (MRCA)

k2 ⎛ ⎝ ⎜ ⎞

⎠ ⎟

⎜ ⎜ ⎜

⎟ ⎟ ⎟

k2 ⎛ ⎝ ⎜ ⎞

⎠ ⎟

• Let k = n,• Define • Iterate until k= 1

– Choose time from an exponential distribution with rate

– Pick event as recombination with probability

– If event is recombination, choose an individual to recombine, and a position, else choose a pair to coalesce.

– Update k, and continue

2 ⎛ ⎝ ⎜ ⎞

⎠ ⎟

+ (k −1)

What is the flaw in this procedure?

Simulating sequences on the ARG

• Generate topology and branch lengths as before

• For each recombination, generate a position.

• Next generate mutations at random on branch lengths– For a mutation, select a position as well.

Recombination events and

• Given , n, can you compute the expected number of recombination events?

• It can be shown that E(n, ) = log (n)• The question that people are really interested

in• Given a set of sequences from a population, compute

the recombination rate • Given a population reconstruct the most likely

history (as an ancestral recombination graph)• We will address this question in subsequent lectures

An algorithm for constructing a perfect phylogeny

• We will consider the case where 0 is the ancestral state, and 1 is the mutated state. This will be fixed later.

• In any tree, each node (except the root) has a single parent.– It is sufficient to construct a parent for every

node.• In each step, we add a column and refine

some of the nodes containing multiple children.

• Stop if all columns have been considered.

Inclusion Property

• For any pair of columns i,j– i < j if and only if i1

j1 • Note that if i<j then the

edge containing i is an ancestor of the edge containing i

Example

1 2 3 4 5A 1 1 0 0 0B 0 0 1 0 0C 1 1 0 1 0D 0 0 1 0 1E 1 0 0 0 0

A B C D E

Initially, there is a single clade r, and each node has r as its parent

Sort columns

• Sort columns according to the inclusion property (note that the columns are already sorted here).

• This can be achieved by considering the columns as binary representations of numbers (most significant bit in row 1) and sorting in decreasing order

1 2 3 4 5A 1 1 0 0 0B 0 0 1 0 0C 1 1 0 1 0D 0 0 1 0 1E 1 0 0 0 0

Add first column

• In adding column i– Check each edge

and decide which side you belong.

– Finally add a node if you can resolve a clade

A BC DE

1 2 3 4 5

A 1 1 0 0 0B 0 0 1 0 0C 1 1 0 1 0D 0 0 1 0 1E 1 0 0 0 0

Adding other columns

• Add other columns on edges using the ordering property

1 2 3 4 5

A 1 1 0 0 0B 0 0 1 0 0C 1 1 0 1 0D 0 0 1 0 1E 1 0 0 0 0

Unrooted case

• Switch the values in each column, so that 0 is the majority element.

• Apply the algorithm for the rooted case

march 2006vineet bafna cse280b: population genetics vineet bafna/pavel pevzner

vineet bafna recombination

vineet bafna scope

vineet bafna cse280b

vineet bafna ex

vineet bafna topic

vineet bafna alleles

recombination slide

population structure

Documents

cse 182: biological data analysis instructor: vineet bafna...

cse280vineet bafna cse280a: algorithmic topics in...

t shirts by bafna promoters mumbai

beng 203: genomics, proteomics & network biology trey ideker...

wi’07bafna proteomics via mass spectrometry (a...

gaurav bafna--positive attitude

vineet intro

cse/beng/bimm 182: biological data analysis instructor:...

dr. vineet vinayak

corporate blogging - vineet rajan

vineet jain

vineet meera experience letter

qnet: a tool for querying protein interaction networks banu...

international retailing vineet

vineet project

sundeep kumar bafna vs. state of maharashtra & anr

airtel final vineet

evidence for large inversion polymorphisms in the human...

bionf/beng 203: functional...

discovery and revision of arabidopsis genes by...