introduction to haplotype estimation

Introduction to Haplotype Estimation

Stat/Biostat 550

The Haplotype Problem

• Suppose we genotype individuals at a number of tightly linked SNPs.

A C G C C T T T G C G C

G A A C C C C C A G G C


• Suppose we genotype individuals at a number of tightly linked SNPs.


• What do the types on the two chromosomes look like?

Haplotypes: who cares?

• LD mapping: increase power?

• LD mapping: decrease genotyping?

• Evolutionary studies: selection, recombination, gene conversion, population structure,…

Many people, for many different reasons…

The Haplotype Problem – potential solutions

• Molecular methods

• Collect family data

• Statistical methods for population data

The Simplest Case


The Next Simplest Case


The first difficult case…


Clark’s Method (1990)

• Idea: use information obtained from other individuals in the population to determine the most probable haplotype pair.

Is it this configuration?

1

2

3

…or this one?

1

2

3

This one is more probable.

1

2

3

Clark’s Method (Clark, 1990)

• Identify the unambiguous individuals.

• Make a list of “known” haplotypes.

• Go through list, and see whether ambiguous individuals can be made up from a “known” haplotype plus another “complementary” haplotype. If so, add the complementary haplotype to the list of “known” haplotypes.

Clark’s Method

List of known haps.1

2

3

Clark’s Method: Problem 1

3

1

2



2

3



2

3

Answer depends on order list is considered….

… and frequency information is ignored


3

1

2


3

1

2

List of known haps.

Algorithm can fail to resolve all haplotypes…

… because looks only for exact matches

Clark’s Algorithm: Summary

• Results may depend on order individuals are considered.

• Frequency information is ignored.

• May fail to resolve all haplotypes.

• Fails to assess uncertainty.

• Looks only for exact matches.

• Fast and intuitive(?).

Maximum Likelihood (EM Algorithm)

• Idea: find haplotype frequencies (f1,…fN) to maximise probability of observed genotype data (g1,…,gn).

}21:2,1{ 211 ),...|Pr(ighhhh hhNi ffffg

),...|Pr(),...|,...,Pr( 111 Ni

iNn ffgffgg

Bayesian version

• Replace single pass through data, with iterative scheme.

• Allow for uncertainty in resolution.

• Use frequency information.

Resulting “naïve Gibbs sampler” produces results similar to EM (Stephens, Smith and Donnelly 2001).

Modify Clark’s algorithm:

Example


2

3Matches 1 known

Does not match any

31

Assigned moderate probability

Example


2

3Matches 3 known

Does not match any

31

Assigned higher probability

Example


2

3Does not match any

Does not match any

31

Assigned low probability

Problems with EM/naïve Gibbs

• Potentially (very) large number of parameters to estimate, leading to inaccurate estimates.

• Can be time-consuming for large problems.

• Can “converge” to poor local optima (alleviated by multiple runs).

Further modification

• Take into account “near misses”, as well as exact matches.

(PHASE v1.0: Stephens, Smith and Donnelly 2001)

Example


2

3Matches 1 known

Differs by 2 from 3 known

31

Example


2

3Matches 3 known


31

Example


2

3Differs by 1 from 3 known


31

How to balance these possibilities?

The key question

• What is the conditional distribution of the next haplotype, given a set of known haplotypes?

Example

1

2

Given the above haplotypes, what would you expect the next haplotype to look like?

Qualitative answer

• The next haplotype will likely differ by a small number of mutations (possibly 0 mutations) from a (randomly-chosen) existing haplotype.

• Use theory (Ewens sampling formula; coalescent theory) to roughly quantify the distribution of the “small number”.

Comparisons on simulated data

Problems

• Time-consuming for large problems.

• Can “converge” to poor local optima.

• Ignores recombination (decay of LD with distance).

• How should uncertainty in haplotype estimates be treated?

… to be continued.

introduction to haplotype estimation

Documents

haplotype problemwhat

complementary haplotype

clarks method clark

probable haplotype pair

haplotype frequencies

ignoredclarks method

list of known haplotypes

order individuals