introduction to haplotype estimation
DESCRIPTION
Introduction to Haplotype Estimation. Stat/Biostat 550. The Haplotype Problem. Suppose we genotype individuals at a number of tightly linked SNPs. A. C. G. C. C. T. T. T. G. C. G. C. G. A. A. C. C. C. C. C. A. G. G. C. The Haplotype Problem. - PowerPoint PPT PresentationTRANSCRIPT
Introduction to Haplotype Estimation
Stat/Biostat 550
The Haplotype Problem
• Suppose we genotype individuals at a number of tightly linked SNPs.
A C G C C T T T G C G C
G A A C C C C C A G G C
The Haplotype Problem
• Suppose we genotype individuals at a number of tightly linked SNPs.
A C G C C T T T G C G C
G A A C C C C C A G G C
The Haplotype Problem
• Suppose we genotype individuals at a number of tightly linked SNPs.
The Haplotype Problem
• What do the types on the two chromosomes look like?
The Haplotype Problem
• What do the types on the two chromosomes look like?
The Haplotype Problem
• What do the types on the two chromosomes look like?
The Haplotype Problem
• What do the types on the two chromosomes look like?
The Haplotype Problem
• What do the types on the two chromosomes look like?
Haplotypes: who cares?
• LD mapping: increase power?
• LD mapping: decrease genotyping?
• Evolutionary studies: selection, recombination, gene conversion, population structure,…
Many people, for many different reasons…
The Haplotype Problem – potential solutions
• Molecular methods
• Collect family data
• Statistical methods for population data
The Simplest Case
• What do the types on the two chromosomes look like?
The Next Simplest Case
• What do the types on the two chromosomes look like?
The Next Simplest Case
• What do the types on the two chromosomes look like?
The first difficult case…
• What do the types on the two chromosomes look like?
The first difficult case…
• What do the types on the two chromosomes look like?
Clark’s Method (1990)
• Idea: use information obtained from other individuals in the population to determine the most probable haplotype pair.
Is it this configuration?
1
2
3
…or this one?
1
2
3
This one is more probable.
1
2
3
Clark’s Method (Clark, 1990)
• Identify the unambiguous individuals.
• Make a list of “known” haplotypes.
• Go through list, and see whether ambiguous individuals can be made up from a “known” haplotype plus another “complementary” haplotype. If so, add the complementary haplotype to the list of “known” haplotypes.
Clark’s Method
List of known haps.1
2
3
Clark’s Method
List of known haps.1
2
3
Clark’s Method: Problem 1
3
1
2
Clark’s Method: Problem 1
List of known haps.1
2
3
Clark’s Method: Problem 1
List of known haps.1
2
3
Clark’s Method: Problem 1
List of known haps.1
2
3
Clark’s Method: Problem 1
List of known haps.1
2
3
Clark’s Method: Problem 1
List of known haps.1
2
3
Answer depends on order list is considered….
… and frequency information is ignored
Clark’s Method: Problem 2
3
1
2
Clark’s Method: Problem 2
3
1
2
List of known haps.
Algorithm can fail to resolve all haplotypes…
… because looks only for exact matches
Clark’s Algorithm: Summary
• Results may depend on order individuals are considered.
• Frequency information is ignored.
• May fail to resolve all haplotypes.
• Fails to assess uncertainty.
• Looks only for exact matches.
• Fast and intuitive(?).
Maximum Likelihood (EM Algorithm)
• Idea: find haplotype frequencies (f1,…fN) to maximise probability of observed genotype data (g1,…,gn).
}21:2,1{ 211 ),...|Pr(ighhhh hhNi ffffg
),...|Pr(),...|,...,Pr( 111 Ni
iNn ffgffgg
Bayesian version
• Replace single pass through data, with iterative scheme.
• Allow for uncertainty in resolution.
• Use frequency information.
Resulting “naïve Gibbs sampler” produces results similar to EM (Stephens, Smith and Donnelly 2001).
Modify Clark’s algorithm:
Example
List of known haps.1
2
3Matches 1 known
Does not match any
31
Assigned moderate probability
Example
List of known haps.1
2
3Matches 3 known
Does not match any
31
Assigned higher probability
Example
List of known haps.1
2
3Does not match any
Does not match any
31
Assigned low probability
Problems with EM/naïve Gibbs
• Potentially (very) large number of parameters to estimate, leading to inaccurate estimates.
• Can be time-consuming for large problems.
• Can “converge” to poor local optima (alleviated by multiple runs).
Further modification
• Take into account “near misses”, as well as exact matches.
(PHASE v1.0: Stephens, Smith and Donnelly 2001)
Example
List of known haps.1
2
3Matches 1 known
Differs by 2 from 3 known
31
Example
List of known haps.1
2
3Matches 3 known
Differs by 2 from 1 known
31
Example
List of known haps.1
2
3Differs by 1 from 3 known
Differs by 1 from 1 known
31
How to balance these possibilities?
The key question
• What is the conditional distribution of the next haplotype, given a set of known haplotypes?
Example
1
2
Given the above haplotypes, what would you expect the next haplotype to look like?
Qualitative answer
• The next haplotype will likely differ by a small number of mutations (possibly 0 mutations) from a (randomly-chosen) existing haplotype.
• Use theory (Ewens sampling formula; coalescent theory) to roughly quantify the distribution of the “small number”.
Comparisons on simulated data
Problems
• Time-consuming for large problems.
• Can “converge” to poor local optima.
• Ignores recombination (decay of LD with distance).
• How should uncertainty in haplotype estimates be treated?
… to be continued.