a linear-time algorithm for the perfect phylogeny haplotyping (pph) problem

27
1 A Linear-Time Algorithm A Linear-Time Algorithm for the Perfect Phylogeny for the Perfect Phylogeny Haplotyping (PPH) Problem Haplotyping (PPH) Problem Zhihong Ding, Vladimir Zhihong Ding, Vladimir Filkov, Dan Gusfield Filkov, Dan Gusfield Department of Computer Science Department of Computer Science University of California, Davis University of California, Davis RECOMB 2005 RECOMB 2005

Upload: javen

Post on 11-Jan-2016

27 views

Category:

Documents


3 download

DESCRIPTION

A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem. Zhihong Ding, Vladimir Filkov, Dan Gusfield Department of Computer Science University of California, Davis. RECOMB 2005. Haplotypes to Genotypes. Each individual has two “copies” of each chromosome. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem

1

A Linear-Time Algorithm A Linear-Time Algorithm for the Perfect Phylogeny for the Perfect Phylogeny

Haplotyping (PPH) ProblemHaplotyping (PPH) Problem

Zhihong Ding, Vladimir Filkov, Zhihong Ding, Vladimir Filkov, Dan GusfieldDan Gusfield

Department of Computer ScienceDepartment of Computer Science

University of California, DavisUniversity of California, Davis

RECOMB 2005RECOMB 2005

Page 2: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem

2

Haplotypes to GenotypesHaplotypes to Genotypes Each individual has two “copies” of Each individual has two “copies” of

each chromosome. each chromosome. At each site, each chromosome has At each site, each chromosome has

one of two states denoted by 0 and 1one of two states denoted by 0 and 1 From haplotypes to genotypes: From haplotypes to genotypes: For each site of an individual, if both For each site of an individual, if both

haplotypes have state 0, then the genotype haplotypes have state 0, then the genotype has state 0. Same rule for state 1. If two has state 0. Same rule for state 1. If two haplotypes have state 0 and 1, or 1 and 0, haplotypes have state 0 and 1, or 1 and 0, then the state of the genotype is 2. then the state of the genotype is 2.

Page 3: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem

3

Haplotypes to GenotypesHaplotypes to Genotypes

0 1 1 1 0 0 1 1 0

1 1 0 1 0 0 1 0 0

2 1 2 1 0 0 1 2 0

Two haplotypes per individual

Genotype for the individual

Merge the haplotypes

Sites: 1 2 3 4 5 6 7 8 9

Page 4: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem

4

Genotypes to HaplotypesGenotypes to Haplotypes

0 1 1 1 0 0 1 1 0

1 1 0 1 0 0 1 0 0

2 1 2 1 0 0 1 2 0

Two haplotypes per individual

Genotype for the individual

For each site, if the genotype has state 0 or 1, then the two haplotypes must have states 0, 0 or 1, 1. If the genotype has state 2, the two haplotypes can either have states 0, 1 or 1, 0.

Page 5: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem

5

Haplotype Inference Haplotype Inference ProblemProblem

For disease association studies, haplotype For disease association studies, haplotype data is more valuable than genotype data, data is more valuable than genotype data, but haplotype data is harder and more but haplotype data is harder and more expensive to collect than genotype data.expensive to collect than genotype data.

Haplotype Inference ProblemHaplotype Inference Problem: Given a : Given a set of set of nn genotypes, determine the original genotypes, determine the original set of set of nn haplotype pairs haplotype pairs that generated that generated the the nn genotypes. genotypes.

NIH leads HAPMAP project to find NIH leads HAPMAP project to find common haplotypes in the human common haplotypes in the human population.population.

Page 6: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem

6

Haplotype Inference Haplotype Inference ProblemProblem

If the genotype has state 2 at If the genotype has state 2 at kk sites, there are 2sites, there are 2k k –– 11 possible possible explaining haplotype pairs.explaining haplotype pairs.

How to determine which How to determine which haplotype pair is the original haplotype pair is the original one generating the genotypeone generating the genotype??

We need a model of haplotype We need a model of haplotype evolution to help solve the evolution to help solve the haplotype inference problem.haplotype inference problem.

Page 7: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem

7

The Perfect Phylogeny The Perfect Phylogeny Model of Haplotype Model of Haplotype

EvolutionEvolution

00000

1

2

4

3

510100

1000001011

00010

01010

12345sitesAncestral haplotype

Extant haplotypes at the leaves

Site mutations on edges

Page 8: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem

8

Assumptions of Perfect Assumptions of Perfect Phylogeny ModelPhylogeny Model

No recombination, only No recombination, only mutation.mutation.

Infinite-site assumption: one Infinite-site assumption: one mutation per site.mutation per site.

Page 9: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem

9

The Perfect Phylogeny The Perfect Phylogeny HaplotypingHaplotyping

(PPH) Problem(PPH) ProblemGiven a set of genotypes, find an explaining set of haplotypes that fits a perfect phylogeny

11 22

aa 22 22

bb 00 22

cc 11 00

11 22

aa 11 00

aa 00 11

bb 00 00

bb 00 11

cc 11 00

cc 11 00

1

c c a a

b

b

2

10 10 10 01 01

00

Genotype matrix

Haplotype matrix Perfect phylogeny

Site

Page 10: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem

10

Prior WorkPrior Work

Several existing algorithms that Several existing algorithms that solve the PPH problem, but none solve the PPH problem, but none of them is in linear time.of them is in linear time.

Our contribution:Our contribution: A linear time algorithm.A linear time algorithm. Our implementation is about 250 Our implementation is about 250

times faster than the fastest one of times faster than the fastest one of previous algorithms for large data previous algorithms for large data set.set.

Page 11: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem

11

A P-Class of PPH A P-Class of PPH SolutionsSolutions

11 22

3355

44

Genotype Genotype MatrixMatrix

2 2 2 2 2 2 0 0 2 0 0 0 2 0 0 2 2 2 0 2 2 2 2 2 0 2 2 2 0 2 2 0 0 2 2 0 0 2

00

One PPH One PPH SolutionSolution

rooroott

P-Class: Maximum common P-Class: Maximum common subgraph in all PPH solutionssubgraph in all PPH solutions

Each P-Class consists of two Each P-Class consists of two subtreessubtrees

Sites: 1 2 3 Sites: 1 2 3 4 54 5

GenotypGenotypeses

aa

bb cc

dd

a,d

a,c

b,d

b,c

Page 12: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem

12

P-Class Property of PPH P-Class Property of PPH SolutionsSolutions

Second PPH Second PPH SolutionsSolutions

All PPH solutions can be obtained by All PPH solutions can be obtained by choosing how to flip each P-Class.choosing how to flip each P-Class.

One PPH One PPH SolutionSolution

11 22

3355

44rooroo

tt

a,d

a,cb,c

b,d22

33

44

a,cb,d

rooroott11

a,d55

b,c

SwitchiSwitching ng pointpointss

SwitchiSwitching ng pointpointss

Page 13: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem

13

The Key TheoremThe Key Theorem Every PPH solution can be obtained Every PPH solution can be obtained

by choosing a flip for each P-Class.by choosing a flip for each P-Class.

Conversely, after fixing one P-Conversely, after fixing one P-Class, every distinct choice of flips Class, every distinct choice of flips of P-Classes, leads to a distinct of P-Classes, leads to a distinct PPH solution.PPH solution.

If there are If there are kk P-Classes, there are P-Classes, there are 22k k –– 1 1 distinct PPH solutions. distinct PPH solutions.

Page 14: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem

14

Shadow TreeShadow Tree Contains classesContains classes Each class in the shadow tree is a Each class in the shadow tree is a

subgraph of a P-Classsubgraph of a P-Class Merging classes results in larger Merging classes results in larger

classes, classes are never splitclasses, classes are never split Contains tree edges and shadow Contains tree edges and shadow

edgesedges

Page 15: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem

15

The AlgorithmThe Algorithm Process the genotype matrix Process the genotype matrix

one row at a time, starting at one row at a time, starting at the first row, and modify the the first row, and modify the shadow treeshadow tree

The genotype matrix only The genotype matrix only contains entries of value 0 and contains entries of value 0 and 2.2.

Page 16: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem

16

Overview of the Algorithm Overview of the Algorithm for One Rowfor One Row

Procedure FirstPathProcedure FirstPath

Procedure SecondPathProcedure SecondPath

Procedure FixTreeProcedure FixTree

Procedure NewEntriesProcedure NewEntries

Page 17: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem

17

OldEntryListOldEntryList

Genotype Genotype MatrixMatrix

2 2 2 0 2 2 2 0 0 2 0 0 0 2 0 0 2 2 2 2 2 2 2 2 2 0 2 2 0 2 2 2 0 0 2 00 0 2 0

OldEntryList for OldEntryList for row row 33: : 11, , 22, , 33, , 55

OldEntryList : column indices that OldEntryList : column indices that have entries of value 2 in this row have entries of value 2 in this row and also have entries of value 2 in and also have entries of value 2 in some previous rowssome previous rows

33

Page 18: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem

18

Procedures FirstPath and Procedures FirstPath and SecondPathSecondPath

FirstPathFirstPath : Construct a first path : Construct a first path towards the root of the shadow tree towards the root of the shadow tree which passes through tree edges of as which passes through tree edges of as many columns in OldEntryList as many columns in OldEntryList as possiblepossible

SecondPathSecondPath : Construct a second path : Construct a second path towards the root of the shadow tree towards the root of the shadow tree which passes through tree edges of which passes through tree edges of columns in OldEntryList and not on the columns in OldEntryList and not on the first pathfirst path

Page 19: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem

19

Shadow Tree After Shadow Tree After Processing the First Two Processing the First Two

RowsRows rootroot

11 11

44

55

22

33

Genotype Genotype MatrixMatrix

2 2 2 0 2 2 2 0 0 2 0 0 0 2 0 0 2 2 2 2 2 2 2 2 2 0 2 2 0 2 2 2 0 0 2 00 0 2 0

33

11

22

OldEntryList for OldEntryList for row 3 : row 3 : 11, , 22, , 33, , 55

22

33

44

55

Page 20: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem

20

Algorithm – FirstPathAlgorithm – FirstPath

rootroot

11 11

44

55

22

33

22

33

44

55

OldEntryLOldEntryList:ist:CheckListCheckList: : 33

, , 22

22,, 33,, 5511,,

Edges Edges 44 and and 55 cannot be cannot be on the same on the same path to the path to the root in any root in any PPH solutionPPH solution

Page 21: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem

21

Algorithm – SecondPathAlgorithm – SecondPath

rootroot

11 11

44

55

22

33

22

33

44

55

CheckLCheckList: ist:

33

OldEntryList: OldEntryList: 11, , 22, , 33, , 55 22

,,

Page 22: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem

22

Shadow Tree to PPH Shadow Tree to PPH SolutionsSolutions

rootroot

11 11

44

55

22

33

22

33

44

55

Genotype Genotype MatrixMatrix

2 2 2 0 2 2 2 0 0 2 0 0 0 2 0 0 2 2 2 2 2 2 2 2 2 02 0 22 2 2 0 0 2 00 0 2 0

One PPH One PPH SolutionSolution

Sites: 1 2 3 Sites: 1 2 3 4 54 5aa

bb

cc

dd

Final shadow treeFinal shadow tree

11

55

22

3344

Page 23: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem

23

Shadow Tree to PPH Shadow Tree to PPH SolutionsSolutions

rootroot

1111

44

55

22

33

22

33

44

55Second PPH Second PPH

SolutionSolutionFinal shadow treeFinal shadow tree

55

33

11

2244a,da,d

b,cb,c

b,db,da,ca,c

Page 24: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem

24

Implementation – Leaf Implementation – Leaf CountCount

Leaf count of column Leaf count of column ii (L[ (L[ ii

]): the number of 2's plus ]): the number of 2's plus twice the number of 1's in twice the number of 1's in column column ii..

L[L[ ii ] is the number of ] is the number of leaves below mutation leaves below mutation ii, in , in everyevery perfect phylogeny perfect phylogeny for the genotype matrix.for the genotype matrix.

Along Along anyany path to the root path to the root in in anyany PPH solution, the PPH solution, the successive edges are successive edges are labeled by columns with labeled by columns with strictly increasing leaf strictly increasing leaf counts.counts.

11 22 33 44

aa 11 11 00 00

bb 00 22 22 00

cc 22 00 22 00

dd 22 00 00 22

4 3 2 1Leaf Count:

Page 25: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem

25

Time ComplexityTime Complexity Constant number of simple Constant number of simple

operations on each edge per rowoperations on each edge per row Each traversal in the shadow tree Each traversal in the shadow tree

goes through O(goes through O(mm) edges.) edges. The algorithm does constant The algorithm does constant

number of traversals in the number of traversals in the shadow tree for each row.shadow tree for each row.

Total time: O(Total time: O(nn mm))n, m are the number of rows and columns in the genotype matrix.

Page 26: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem

26

ResultsResults

Average Running Times (seconds)

Sites (m)

Individuals (n)

Dataset DPPH O(nm2) Our Alg. O(nm)

300 150 30 1.07 0.05

500 250 30 5.72 0.13

1000 500 30 45.85 0.48

2000 1000 10 467.18 1.89

Page 27: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem

27

Thank you !Thank you !

Paper and program can be Paper and program can be downloaded at:downloaded at:

http://wwwcsif.cs.ucdavis.edu/~gusfield/lpph/http://wwwcsif.cs.ucdavis.edu/~gusfield/lpph/