inferring local tree topologies for snp sequences under recombination in a population yufeng wu...

17
Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population Yufeng Wu Dept. of Computer Science and Engineering University of Connecticut, USA MIEP 2008

Upload: warren-lory

Post on 01-Apr-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population Yufeng Wu Dept. of Computer Science and Engineering University of

Inferring Local Tree Topologies for SNP Sequences Under

Recombination in a Population

Yufeng WuDept. of Computer Science and Engineering

University of Connecticut, USA

MIEP 2008

Page 2: Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population Yufeng Wu Dept. of Computer Science and Engineering University of

2

Genetic Variations

• Single-nucleotide polymorphism (SNP): a site (genomic location) where two types of nucleotides occur frequently in the population.– Haplotype, a binary vector of SNPs (encoded as 0/1).

• Haplotypes: offer hints on genealogy.

AATGTAGCCGA

AATATAACCTA

AATGTAGCCGT

AATGTAACCTA

CATATAGCCGT

AATGTAGCCGA

AATATAACCTA

AATGTAGCCGT

AATGTAACCTA

CATATAGCCGT

DNA sequences

Sites

00100

01010

00101

00010

11101

Haplotypes

Sites

Each SNP induces a split

Page 3: Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population Yufeng Wu Dept. of Computer Science and Engineering University of

Genealogy: Evolutionary History of Genomic Sequences

• Tells how individuals in a population are related

• Helps to explain diseases: disease mutations occur on branches and all descendents carry the mutations

• Problem: How to determine the genealogy for “unrelated” individuals?

• Complicated by recombinationIndividuals in current population

Diseased (case)

Healthy (control)

Disease mutation

3

Page 4: Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population Yufeng Wu Dept. of Computer Science and Engineering University of

4

Recombination

• One of the principle genetic forces shaping sequence variations within species

• Two equal length sequences generate a third new equal length sequence in genealogy• Spatial order is important: different parts of genome inherit

from different ancestors.

110001111111001

000110000001111

Prefix

Suffix

11000 0000001111

Breakpoint

Page 5: Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population Yufeng Wu Dept. of Computer Science and Engineering University of

Ancestral Recombination Graph (ARG)

10 01 00

S1 = 00S2 = 01S3 = 10S4 = 10

Mutations

S1 = 00S2 = 01S3 = 10S4 = 11

10 01 0011

Recombination

Assumption:

At most one mutation per site

1 0 0 1

1 1

00

10

5

Page 6: Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population Yufeng Wu Dept. of Computer Science and Engineering University of

Local Trees• ARG represents a set of local trees.

• Each tree for a continuous genomic region.

• No recombination between two sites same local trees for the two sites

• Local tree topology: informative and useful

ARG

Local tree near sites 1 and 2 Local tree near site 2 Local tree to the right of site 3

6

Page 7: Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population Yufeng Wu Dept. of Computer Science and Engineering University of

Inference of Local Tree Topologies

7

• Question: given SNP haplotypes, infer local tree topologies (one tree for each SNP site, ignore branch length)

– Hein (1990, 1993)

• Enumerate all possible tree topologies at each site

– Song and Hein (2003,2005)– Parsimony-based

• Local tree reconstruction can be formulated as inference on a hidden Markov model.

Page 8: Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population Yufeng Wu Dept. of Computer Science and Engineering University of

Local Tree Topologies

8

• Key technical difficulty– Brute-force enumeration of local tree topologies: not

feasible when number of sequences > 9

• Can not enumerate all tree topologies• Trivial solution: create a tree for a SNP containing

the single split induced by the SNP.– Always correct (assume one mutation per site)– But not very informative: need more refined trees!

A: 0B: 0C: 1D: 0E: 1F: 0G: 1H: 0

C

E

G

AB

DF

H

Page 9: Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population Yufeng Wu Dept. of Computer Science and Engineering University of

How to do better? Neighboring Local Trees are Similar!

• Nearby SNP sites provide hints!– Near-by local trees are often topologically similar– Recombination often only alters small parts of the

trees

• Key idea: reconstructing local trees by combining information from multiple nearby SNPs

9

Page 10: Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population Yufeng Wu Dept. of Computer Science and Engineering University of

RENT: REfining Neighboring Trees

• Maintain for each SNP site a (possibly non-binary) tree topology– Initialize to a tree containing the split induced by

the SNP

• Gradually refining trees by adding new splits to the trees– Splits found by a set of rules (later)– Splits added early may be more reliable

• Stop when binary trees or enough information is recovered

10

Page 11: Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population Yufeng Wu Dept. of Computer Science and Engineering University of

11

0 0 0 1 01 0 0 1 00 0 1 0 01 0 1 0 00 1 1 0 00 1 1 0 10 0 1 0 1

1 2 3 4 5

abcdefg

M

A Little Background: Compatibility

• Two sites (columns) p, q are incompatible if columns p,q contains all four ordered pairs (gametes): 00, 01, 10, 11. Otherwise, p and q are compatible.• Easily extended to splits.• A split s is incompatible with tree T if s is incompatible with

any one split in T. Two trees are compatible if their splits are pairwise compatible.

Sites 1 and 2 are compatible, but 1 and 3 are incompatible.

Page 12: Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population Yufeng Wu Dept. of Computer Science and Engineering University of

Fully-Compatible Region: Simple Case

• A region of consecutive SNP sites where these SNPs are pairwise compatible.– May indicate no topology-altering recombination

occurred within the region

• Rule: for site s, add any such split to tree at s.– Compatibility: very strong property and unlikely arise

due to chance.

12

Page 13: Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population Yufeng Wu Dept. of Computer Science and Engineering University of

Split Propagation: More General Rule

• Three consecutive sites 1,2 and 3. Sites 1 and 2 are incompatible. Does site 3 matter for tree at site 1?– Trees at site 1 and 2 are different.– Suppose site 3 is compatible with sites 1 and 2. Then?– Site 3 may indicate a shared subtree in both trees at sites 1 and 2.

• Rule: a split propagates to both directions until reaching a incompatible tree.

13

Page 14: Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population Yufeng Wu Dept. of Computer Science and Engineering University of

Unique Refinement

• Consider the subtree with leaves 1,2 and 3.– Which refinement is more likely?– Add split of 1 and 2: the only split that is compatible

with neighboring T2.• Rule: refine a non-binary node by the only

compatible split with neighboring trees

1 3 2

?14

Page 15: Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population Yufeng Wu Dept. of Computer Science and Engineering University of

One Subtree-Prune-Regraft (SPR) Event

• Recombination: simulated by SPR.– The rest of two trees (without pruned subtrees) remain the same

• Rule: find identical subtree Ts in neighboring trees T1 and T2, s.t. the rest of T1 and T2 (Ts removed) are compatible. Then joint refine T1- Ts and T2- Ts before adding back Ts.

Subtree to prune

15

More complex rules possible.

Page 16: Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population Yufeng Wu Dept. of Computer Science and Engineering University of

Simulation• Hudson’s program MS (with known coalescent local tree topologies):

100 datasets for each settings.– Data much larger and perform better or similarly for small data than Song

and Hein’s method.• Test local tree topology recovery scored by Song and Hein’s shared-

split measure

= 15 = 50 16

Page 17: Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population Yufeng Wu Dept. of Computer Science and Engineering University of

17

Acknowledgement

• Software available upon request.

• More information available at: http://www.engr.uconn.edu/~ywu

• I want to thank– Yun S. Song– Dan Gusfield