advanced algorithms and models for computational biology -- a machine learning approach
DESCRIPTION
Advanced Algorithms and Models for Computational Biology -- a machine learning approach. Population Genetics: More on halpltypes --- coalescence, multi-population phasing, blocks, etc. Eric Xing Lecture 19, March 27, 2006. Clustering. How to label them ? inference How many clusters ??? - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/1.jpg)
Advanced Algorithms Advanced Algorithms and Models for and Models for
Computational BiologyComputational Biology-- a machine learning approach-- a machine learning approach
Population Genetics:Population Genetics:
More on halpltypesMore on halpltypes--- coalescence, multi-population phasing, blocks, --- coalescence, multi-population phasing, blocks,
etc.etc.
Eric XingEric Xing
Lecture 19, March 27, 2006
![Page 2: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/2.jpg)
Clustering
How to label them ? inference
How many clusters ??? model selection ? or inference ?
![Page 3: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/3.jpg)
Genetic Demography
Are there genetic prototypes among them ? What are they ? How many ? (how many ancestors do we have ?)
![Page 4: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/4.jpg)
Inference done separately, or jointly?
Multi-population Genetic Demography
![Page 5: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/5.jpg)
Clustering as Mixture Modeling
![Page 6: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/6.jpg)
Model selection "intelligent" guess: ??? cross validation: data-hungry information theoretic:
AIC TIC MDL :
Posterior inference:
we want to handle uncertainty of model complexity explicitly
we favor a distribution that does not constrain M in a "closed" space!
Model Selection vs. Posterior Inference
),ˆ|(|)(minarg KKL MLgf
)()|()|( MpMDpDMp
K,M
Parsimony, Ockam's RazorParsimony, Ockam's Razor
![Page 7: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/7.jpg)
C T A
T G A
C G A
T T A
??????
haplotype h(h1, h2)possible associations of alleles to chromosomes
Heterozygousdiploid individual
C T A
T G ACp
Cm
Genotype gpairs of alleles, whose
associations to chromosomes are unknown
ATGCsequencing
TC TG AA
This is a mixture modeling problem!
Haplotype Ambiguity
![Page 8: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/8.jpg)
The probability of a genotype g:
Standard settings: H| = K << 2J fixed-sized population haplotype pool
p(h1,h2)= p(h1)p(h2)=f1f2 Hardy-Weinberg equilibrium
Problem: K ? H ?
,
2121
21
),|(),()(Hhh
hhgphhpgp
Genotypingmodel
Haplotypemodel
Population haplotypepool
A Finite (Mixture of ) Allele Model
Gn
Hn1 Hn2
![Page 9: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/9.jpg)
Present
Time
22 individuals
The coalescent process
![Page 10: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/10.jpg)
Present
Time
22 individuals
18 ancestors
The coalescent process
![Page 11: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/11.jpg)
Present
Time
22 individuals
18 ancestors
16 ancestors
The coalescent process
![Page 12: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/12.jpg)
Present
Time
22 individuals
18 ancestors
16 ancestors
14 ancestors
The coalescent process
![Page 13: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/13.jpg)
Present
Time
22 individuals
18 ancestors
16 ancestors
14 ancestors
12 ancestors
9 ancestors
8 ancestors
8 ancestors
7 ancestors
7 ancestors
5 ancestors
5 ancestors
3 ancestors
3 ancestors
3 ancestors
2 ancestors
2 ancestors
1 ancestor
![Page 14: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/14.jpg)
Present
Time
Most recent common ancestor(MRCA)
![Page 15: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/15.jpg)
Mutational events can now be added to the genealogical tree, resulting in polymorphic sites. If these sites are typed in the modern sample, they can be used to split the sample into sub-clades (represented by different colours)
![Page 16: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/16.jpg)
Present
Time
Most recent common ancestor(MRCA)
TCGAGGTATTAACTCTAGGTATTAACTCGAGGCATTAACTCTAGGTGTTAACTCGAGGTATTAGCTCTAGGTATCAAC * ** * *
![Page 17: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/17.jpg)
Present
Time
TCGAGGTATTAACTCTAGGTATTAACTCGAGGCATTAACTCTAGGTGTTAACTCGAGGTATTAGCTCTAGGTATCAAC * ** * *
Assuming there are presently k active lineages: The probability of coalescence:
The probability of mutation (killing a lineage):
11
kk
1k
![Page 18: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/18.jpg)
Population Genetic Basis of an Infinite Allele model
≈
Natural genealogy
N
Infinite mixturesCoalescent with mutation
Kingman coalescent process with binary lineage merging Kingman coalescent process with binary lineage merging New population haplotype alleles emerge along all branches of the New population haplotype alleles emerge along all branches of the
coalescence tree at rate coalescence tree at rate /2 per unit length/2 per unit length
The Ewens Sampling Formula: an exchangeable random partition of The Ewens Sampling Formula: an exchangeable random partition of individualsindividuals
Dirichlet process mixtureDirichlet process mixture
![Page 19: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/19.jpg)
A Hierarchical Bayesian Infinite Allele model
).},{|(~ kahph
• Assume anAssume an individual haplotypeindividual haplotype hh is stochastically is stochastically derived from aderived from a population haplotypepopulation haplotype ak withwith
nucleotide-substitution frequencynucleotide-substitution frequency k: :
• Not knowing the correspondences between individual Not knowing the correspondences between individual and population haplotypes, each individual haplotype and population haplotypes, each individual haplotype is a mixture of population haplotypesis a mixture of population haplotypes..
• The number and identity of the population haplotypes are unknownThe number and identity of the population haplotypes are unknown
use ause a Dirichlet Process Dirichlet Process to construct a priorto construct a prior distributiondistribution GG on on HH××RRJJ..
Gn
Hn1 Hn2
Ak k
G
G0
![Page 20: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/20.jpg)
Dirichlet Process Priors
A c.d.f., G onfollows a Dirichlet Process if for any measurable finite partition of (B1,B2, .., Bm), of , the joint distribution of the random variables
(G(B1), G(B2), …, G(Bm)) ~ Dirichlet(G0(B1), …., G0(Bm)),
where, G0 is a the base distribution and a is the precision parameter
![Page 21: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/21.jpg)
Stick-breaking Process
G0
0 0.4 0.4
0.6 0.5 0.3
0.3 0.8 0.24
),Beta(~
)-(
~
)(
∏
∑
∑
-
∞
∞
1
1
1
1
1
1
0
1
k
k
jkkk
kk
k
kkk
G
G
Location
Mass
![Page 22: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/22.jpg)
Chinese Restaurant Process
CRP defines an exchangeable distribution on partitions over an (infinite) sequence of integers
=)|=( -ii kcP c 1 0 0
+1
10
+1
+2
1
+2
1
+2
+3
1
+3
2
+3
1-+1
i
m
1-+2
i
m
1-+
i....
1 2
![Page 23: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/23.jpg)
{A} {A} {A} {A} {A} {A} ……3
12 4
56 7
8 9
The DP Mixture of Ancestral Haplotypes
The customers around a table form a cluster associate a mixture component (i.e., a population haplotype) with a table
sample {a, } at each table from a base measure G0 to obtain the population haplotype and nucleotide substitution frequency for that component
With p(h|{}) and p(g|h1,h2), the CRP yields a posterior distribution on the number of population haplotypes (and on the haplotype configurations and the nucleotide substitution frequencies)
![Page 24: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/24.jpg)
DP-haplotyper
Inference: Markov Chain Monte Carlo (MCMC) Gibbs sampling Metropolis Hasting
Gn
Hn1 Hn2
A
N
K
G
G0 DP
infinite mixture components(for population haplotypes)
Likelihood model(for individual
haplotypes and genotypes)
![Page 25: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/25.jpg)
Model components
Choice of base measure:
Nucleotide-substitution model:
Noisy genotyping model:
j
jaG )Beta()Unif(~ 0
jkjijk
jkjijkjkjkji
jjkjkjiki
ah
ahahp
ahpahp
,,,
,,,
,,,
,,,
if
if ),|( where
),|()},{|(
1
jijiji
jijiji
jijiji
jjijijiiii
ghh
ghhhhgp
hhgphhgp
,,,
,,,
,,,
,,,
if 2
if ),|( where
),|(),|(
21
21
21
2121
1
![Page 26: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/26.jpg)
Gibbs sampling
Starting from some initial haplotype reconstruction H(0) , pick a first table with an arbitrary a1
(0) , and form initial population-hap pool A(0) ={a1
(0) }:
i) Choose an individual i and one of his/her two haplytopes t, uniformly and at random, from all ambiguous individuals;
ii) Sample from , update ;
iii) Sample , where , from ;
update A(t+1) ;
iii) Sample from , update H(t+1).
),,|( )()()()1( ttti
ti Hccp
ttA
)1( tit
c
)1( tka )1( t
itck ) s.t. |( )1(
')('
)1(
''kchap t
iti
tk tt
)1( tit
h ),,|( )1()()1()1(
tti
ti
ti ttt
Hchp A
)1( tc
![Page 27: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/27.jpg)
Convergence of Ancestral Inference
![Page 28: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/28.jpg)
Haplotyping Error
The Gabriel data
![Page 29: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/29.jpg)
Multi-population Genetic Gemography
Pool everything together and solve 1 hap problem? --- ignore population structures
Solve 4 hap problems separately? --- data fragmentation
Co-clustering … solve 4 coupled hap problems jointly
![Page 30: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/30.jpg)
Why humans are so similar and polymorphic patterns are regional
Population bottleneck: a small population that interbred reduced the genetic variation
Out of Africa ~ 100,000 years ago
Out of Africa
![Page 31: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/31.jpg)
Each population can be associated with a unique DP capturing population-specific genetic demography
Different population may have unique haplotypes
Different population may share
common haplotypes
Thus Population specific DPs
are marginally dependent
Population Specific DPs
GG11
GG22
GG44
GG33
![Page 32: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/32.jpg)
Hierarchical DP Mixture
GG11
GG22
GG44
GG33
....
HH
2
1
3
4
![Page 33: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/33.jpg)
Simulated Data
As 1 HapTyping problem
As 5 HapTyping problems
![Page 34: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/34.jpg)
HapMap Data
![Page 35: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/35.jpg)
The block structure of haplotypes
The Daly et al (2001) data set This consists of 103 common SNPs (>5% minor allele frequency) in a 500 kb region
implicated in Crohn disease, genotyped in 129 trios (mom, pop, kid) from a European derived population, giving 258 transmitted and 258 untransmitted chromosomes.
The haplotype blocks span up to 100kb and contain 5 or more common SNPs. For example, one 84 kb block of 8 SNPs shows just two distinct haplotypes accounting for 95% of the observed chromosomes.
![Page 36: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/36.jpg)
Another study: the Patil et al data
The haplotype patterns for 20 independent globally diverse chromosomes defined by 147 common human chr 21 SNPs spanning 106 kb of genomic sequence. Each row represents an SNP. Blue box = major, yellow = minor allele. Each column represents a single
chromosome.
The 147 SNPs are divided into 18 blocks defined by black lines. The expanded box on the right is an SNP
block of 26 SNPs over 19kb of genomic DNA. The 4 most common of 7 different haplotypes
include 80% of the chromosomes, and can be distinguished with 2 SNPs.
Figure 2 of Patil et al
![Page 37: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/37.jpg)
Haplotype Blocks
![Page 38: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/38.jpg)
Markov Dependence Between Blocks
On any chromosome, the haplotype carried at the kth block is drawn from A(k) according to unknown probabilities that depend on the haplotype at block k − 1.
![Page 39: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/39.jpg)
Block Partitioning Algorithms
Greedy algorithm (Patil et al 2001). Begin by considering all possible blocks of ≥1 consecutive SNPs. Next, exclude all blocks in which < 80% of the chromosomes in the data are defined
by haplotypes represented more than once in the block (80% coverage). Considering the remaining overlapping blocks simultaneously, select the one which
maximizes the ratio of total SNPs in the block to the number required to uniquely discriminate haplotypes represented more than once in the block. Any of the remaining blocks that physically overlap with the selected block are discarded, and the process repeated until we have selected a set of contiguous, non-overlapping blocks that cover the 32.4 Mb of chr 21 2ith no gaps and with every SNP assigned to a block.
Hidden Markov Models Maximum a posterior inference (i.e., viterbi) (Daly et al. 2001) Minimum description length (Anderson et al.2003)
Dynamic programming (Sun et al. 2002)
![Page 40: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/40.jpg)
Haplotypes are Shaped by Recombination
Inheritance of unknown generation
....
![Page 41: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/41.jpg)
Hidden Markov DP for Recombination
HH
y1 y2 y3 yN…
....
HH
y1y1 y2y2 y3y3 yNyN…
....
![Page 42: Advanced Algorithms and Models for Computational Biology -- a machine learning approach](https://reader036.vdocuments.mx/reader036/viewer/2022070405/56813f47550346895da9fc8a/html5/thumbnails/42.jpg)
Reference
E.P. Xing, R. Sharan and M.I Jordan, Bayesian Haplotype Inference via the Dirichlet Process. Proceedings of the 21st International Conference on Machine Learning (ICML2004),
N Patil et al . Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21 Science 294 2001:1719-1723.
M J Daly et al . High-resolution haplotype structure in the human genome Nat. Genet. 29 2001: 229-232
Anderson, E.C., Novembre, J. (2003) "Finding haplotype block boundaries using the minimum description length principle." American Journal of Human Genetics 73(2):336-354.