comparative genomics anthony labarre september...

Algorithms and bioinformaticsComparative genomics

Anthony Labarre

September 26, 2016

StringsOther models

Alternative approachesBeyond pairwise comparisons

Duplications in evolutionBalanced stringsGeneral strings

Motivation

I We saw a model for representing genomes without directionality;

I We saw another model for taking directionality into account;

I Both of them lack realism in a crucial way: they don’t allowduplications;

I And duplications / insertions / deletions account for a very largepart of what happens in evolution [Ohno, 1970];

2

StringsOther models



Two examples of duplications

Example (tandem duplications)

(source: K. Aainsqatsi on Wikimedia)

Example (whole genome duplication)

(source: Eric Lyons on CoGePedia)

3

StringsOther models



Today

I Models that take duplications into account;

I Other approaches to solving the corresponding problems;

I Other models for those cases where only partial information isavailable or relevant;

4

StringsOther models



Strings

I Since duplications pervade genomes, we should take them intoaccount;

I We now see genomes as strings on an alphabet Σ;

I Be careful: similar segments have been identified, soΣ = segments and not A,C ,G ,T;

I Our goal is still to explain evolution using most parsimoniousscenarios made of fixed transformations;

5

StringsOther models



Strings

I Note: the restriction to sorting problems does not work anymore;I if you have two A’s, which one should be “number one”?

I So we really are interested in transforming one string into another,which is not equivalent to sorting another string;

I Sorting problems have been considered in that model, but they’rejust a special case of a more general problem;

6

StringsOther models



Strings

I We can distinguish between several approaches based on genecontents;

I Either we have exactly the same contents in both genomes (andduplications are of course allowed);

I Or we have duplications but with different amounts of repetitions(e.g. three 1’s in genome A but only two in genome B);

I This time the breakpoint graph cannot save us anymore, since wewould not know how to connect elements or decompose the graph;

7

StringsOther models



Balanced strings

I The number of occurrences of a character c in a string S isdenoted by occ(c ,S);

Definition (balanced strings)

Two strings S and T on an alphabet Σ are balanced if:

∀ c ∈ Σ : occ(c ,S) = occ(c ,T ).

I Basically, S and T are anagrams;

I Straightforward generalisation of permutations: we haveduplications, but we actually still have the same content in bothgenomes;

8

StringsOther models



Comparing balanced strings

I One way of relating genomes’ contents is to identify commonsegments;

I In other words, we want to partition genomes into the same set ofsegments;

I this is how we obtained (signed) permutations;I but now we want to partition the resulting sequences;

9

StringsOther models



Generalising breakpoints

I Recall that, for permutations:I adjacencies are pairs of adjacent elements in π that are also

adjacent in ι = 〈1 2 · · · n〉 (or χ = 〈n n − 1 · · · 1〉 for reversals);I breakpoints are pairs that are not adjacencies;

I Recall that, for signed permutations:I adjacencies are pairs of adjacent elements in π that are also

adjacent in ι = 〈1 2 · · · n〉 (or χ = 〈−n − (n − 1) · · · − 1〉 forsigned reversals);

I breakpoints are pairs that are not adjacencies;

I Those can be generalised to any pair of permutations;

I And we can do the same thing for strings;

10

StringsOther models



Minimum common string partition

I A partition of a string S is a set of strings that can beconcatenated to obtain S ;

I A common partition of two strings S and T is a set of stringsthat can be concatenated to obtain both S and T ;

Example (common string partitions)

Here’s a common partition of “dictionary” and “indicatory”:

d i c t i o n a r y

S1 S2 S3 S4 S5 S6 S7

i n d i c a t o r y

S3 S5 S1 S6 S2 S4 S7

11

StringsOther models



Minimum common string partition

I A common string partition is minimum if there is no smallercommon string partition for the two strings under consideration;

I This leads to the following decision problem:

Problem (minimum common string partition (mcsp))

Instance: balanced strings S and T , a bound k ∈ N;Question: is there a common partition of S and T with at most kblocks?

12

StringsOther models



Relation(s) to rearrangement problems

I Recall that breakpoints were pairs of elements adjacent in onegenome but not in the other;

I Common string partitions generalise that point of view to anarbitrary number of elements in each part;

I So if we have a minimum common string partition for S and T , weget the number of breakpoints between strings S and T ;

13

StringsOther models



About mcsp

I Bad news about mcsp:I NP-hard, even if only one gene family is

nontrivial [Blin et al., 2004];I APX-hard, even if no character appears more than

twice [Goldstein et al., 2005];

I Good news about mcsp:I fixed parameter tractable: a solution of size k can be found in time

f (k) · poly(n) (n = |S | = |T |) [Bulteau and Komusiewicz, 2014];

I Greedy approach [Goldstein and Lewenstein, 2011]: repeatedlyselect an LCS without any marked letter;

X simple and fast (runs in O(n) time);× approximation ratio between Ω(n0.43) and O(n0.69)

[Kaplan and Shafrir, 2006];

14

StringsOther models



Minimum common string partition: variants

I One can also consider signed strings: each segment is thenequivalent up to a reversal;

I Or equivalence under full reversals: a partition of S is also apartition of T if one can concatenate its elements to obtain T orits reverse;

I Those variants are still hard, but the positive results do notstraightforwardly generalise [Bulteau and Komusiewicz, 2014];

15

StringsOther models



Unbalanced strings

I Of course, we are not always so lucky that our genomes are justanagrams;

I Most of the time, duplications are not balanced;

I So, what do we do?

16

StringsOther models



Arbitrary strings

I One idea is to try and match different copies of a same geneaccross two genomes;

I Three general approaches have been proposed:

1. the exemplar model;2. the intermediate model;3. the full model;

I All three are based on a notion of matching;

17

StringsOther models



Matching and pruning

Definition (gene matching)

A gene matching between two strings S and T is a set of disjointpairs Si ,Tj such that Si = Tj for every such pair (1 ≤ i ≤ |S |,1 ≤ j ≤ |T |).

Definition (pruning)

Given two strings S and T and a gene matching M, theM-pruning is the pair (S ′,T ′) obtained by removing allunmatched characters from S and T and relabelling the remainingcharacters according to M.

(examples to appear shortly)

18

StringsOther models



Matching(s) and pruning(s)

I Matchings will depend on the model we use;

I Since prunings are derived from matchings, they will also varydepending on the underlying model;

I Let us review them on examples;

19

StringsOther models



Exemplar matching / pruning

I In the exemplar model, we match only one copy of each gene:

Example (exemplar matching / pruning)

S = 1 2 −4 −2 3 1 4 −3 4

T = 4 1 −3 −2 2 1 2 4

S ′ = 1 2 3 4

T ′ = 1 −3 −2 4

20

StringsOther models



Intermediate matching/pruning

I In the intermediate model, we match at least one copy of eachgene:

Example (intermediate matching/pruning)

S = 1 2 −4 −2 3 1 4 −3 4

T = 4 1 −3 −2 2 1 2 4

S ′ = 1 2 3 1′ 4

T ′ = 1 −3 −2 1′ 4

21

StringsOther models



Full matching / pruning

I In the full model, we match as many copies of each gene aspossible:

Example (full matching / pruning)

S = 1 2 −4 −2 3 1 4 −3 4

T = 4 1 −3 −2 2 1 2 4

S ′ = 1 2 −4′ −2′ 3 1′ 4

T ′ = 4′ 1′ −3 2′ 1 2 4

22

StringsOther models



Using matchings and prunings

I Once we’ve pruned our input strings, we can compare them as ifthey were permutations;

I This gives rise to many variations on the following theme:

Problem (“(M , d)-comparison”)

Input: two strings S and TGoal: find an “M matching” such that the resulting “M pruning”(S ′,T ′) minimises d(S ′,T ′)

I Here M ∈ exemplar, intermediate, full, and d is any distance onSn or S±n (with n = |S ′| = |T ′|);

23

StringsOther models



Strings

I This is not “just a matching problem”;I in matching problems, every edge is given a weight, and we have to

optimize a function that takes all weights into account;I while here, we look for a matching that optimises a quantity, but

the edge weights are not fixed to begin with;

I In other words: in matching problems we can compute the cost ofa partial solution, while here we must have a full matching beforewe can even begin to compute the cost;

24

StringsOther models



Strings: extensions

I Strings can of course be signed to take directionality into account;

I They can also be circular;

I And of course we could have a mix of both to represent differentchromosomes;

25

StringsOther models


PosetsSet systems

Other models

I We’ve mostly seen (signed) permutations and strings so far;I Other models may be more suitable, according to:

I the data we have;I the relations we want to take into account;

I We mention briefly the following structures:I posets;I set systems;

26

StringsOther models


PosetsSet systems

The need for other models

I Most genomes consist of several chromosomes:

27

StringsOther models


PosetsSet systems

Posets

I Recall that genomes are not directly copied from a long string ofDNA to a drive;

1. “small” subsequences called reads are identified;2. then those reads are assembled to form the target genome;

I We still want to be able to compare genomes even if only partialgene order information is available;

I This naturally leads us to compare posets instead of permutationsor strings;

28

StringsOther models


PosetsSet systems

Posets

I Informally, although we may not know the complete ordering, wemay know parts of it;

I So segments are partially ordered, and genomes may berepresented by directed acyclic graphs, where:

I vertices stand for segments;I arc (u, v) means “segment u precedes segment v”;

I In this regard, permutations are paths of maximal length;

Example (a genome as a poset)

1

−2

3

−5 6 10 9 12

29

StringsOther models


PosetsSet systems

Comparing genomes as posets

I Comparing genomes G1 and G2 represented as posets is based onpermutations:

I find linear extensions L1 and L2 that minimise d(L1, L2);

I Another way of trying to aggregate their contents is by:I merging them into a conflict-free graph;I finding a linear extension of that graph;

30

StringsOther models


PosetsSet systems

Finding an “agreement” for posets

G1: 1

−2

3

−5 6 10 8 12

G2: 1 −2 −4 −5 7

9

11

12

G1 ∪ G2: 1 −2 −4 −5

3

6 10 8

12

7 9

11

31

StringsOther models


PosetsSet systems

Set systems and the syntenic distance

I Recall that chromosomes are ordered sets of genes;

I Sometimes we’re not interested in order, but in the fact that twosegments belong to the same chromosome;

I So we view a genome as a family of (unordered) sets of genes;

32

StringsOther models


PosetsSet systems


I Three operations are taken into account in that setting:

a, b, c, p, q, r, x , y

a, b, c, p, q, r, x , y a, b, c, x , y, p, q, r

a, p, b, c , q, r, x , y

fission fusion

translocation

I The syntenic distance between two genomes is then theminimum number of such operations that are needed to transformone genome into the other;

33

StringsOther models


PosetsSet systems


I There is a compact representation that allows us to assume that:

1. our input is S1,S2, . . . ,Sk (subsets of 1, 2, . . . , n);2. our target is 1, 2, . . . , n;

I So we want to obtain that genome using as few fissions, fusionsand translocations as possible;

I Syntenic genes are simply genes that belong to the samechromosome;

34

StringsOther models


PosetsSet systems

Synteny graph

I A graph-theoretic approach for attacking the problem wasproposed:

Definition ([DasGupta et al., 1998])

The synteny graph of an instance S (n, k) is defined by:

I V = S1,S2, . . . ,Sk;I E = Si ,Sj | Si ∩ Sj 6= ∅, 1 ≤ i 6= j ≤ n;

I The synteny graph of our target 1, 2, . . . , n has ncomponents;

35

StringsOther models


PosetsSet systems

Mutations and the synteny graph

I Translocations, fusions and fissions affect the graph in differentways;

I translocations (may) disconnect adjacent vertices;I fissions split vertices into two nonadjacent vertices;I fusions: opposite of fissions;

I Our goal is to obtain n components;

I It can be proved that the distance is at least n − p (where p is thenumber of components in our instance’s graph);

36

StringsOther models


PosetsSet systems

About the syntenic distance

I The synteny graph dictates that we want to increase the numberof connected components;

I In that regard, restricting oneself to “intra-component moves”seems optimal;

I But any approach that does this is a 2-approximation[Liben-Nowell, 2001];

I No better approximation is known;

I And computing the distance or an optimal scenario isNP-hard [DasGupta et al., 1998];

37

StringsOther models


SAT solversLinear programming

Today’s models: wrap-up

I As soon as we have duplications, most problems become hard (tosolve exactly, or even to approximate within a reasonable factor)

I As soon as we forget about order (partially or completely), we alsoend up with difficult problems;

I Yet the problems still have to be solved;

38

StringsOther models



Alternative approach: sat solvers

I sat solvers are highly-optimised programs for solving thewell-known NP-complete satisfiability problem [Cook, 1971]:

Problem (satisfiability (sat))

Input: a Boolean formula φ in conjunctive normal form.Question: is there a satisfying assignment for φ?

I Idea: take advantage of these solvers;

39

StringsOther models



Alternative approach: sat solvers

I The workflow is as follows:

PROBLEM INSTANCE

BOOLEAN FORMULA

SAT SOLVER

SATISFYING ASSIGNMENT

SOLUTION

translation

40

StringsOther models



Alternative approach: linear and pseudo-booleanprogramming

I Linear programs are of the form:

maximise cTxsubject to Ax ≤ b

and x ≥ 0

I Pseudo-boolean programs: same form, but the function tooptimise maps 0, 1n to R (versus 0, 1 for boolean functions);

I Specialised solvers also exist for those and were used to solverearrangement problems on strings [Angibaud et al., 2007] andposets [Angibaud et al., 2009];

41

StringsOther models



Comparative genomics wrap-up

I Here we talked mostly about computing “edit distances” betweengenomes;

I Other measures of similarity exist that are not associated tomutations;

I Many hard problems;I Much remains to be done in order to satisfy biologists;

I realistic models;I software;I ...

42

StringsOther models


From comparisons to phylogeniesBoundsSelected results

Beyond pairwise comparisons

I The genome rearrangement problems we’ve seen were formulatedin a pairwise fashion;

I But actually, more than two genomes can be taken into account;

I Unsurprisingly, most problems become hard in that setting;

43

StringsOther models



Why more than two genomes?

I A sequence does not yield enough information for ancestralgenome reconstruction:

G1 G2

I Taking an additional genome into account restricts our choices:

G1 G2

G3

I What’s more, it’s ultimately one of our goals;

44

StringsOther models



Median problems

I Measures of similarities between genomes are useful inreconstructing phylogenies;

Example (phylogeny from distance matrix)

a b c d e

a 0 2 3 6 6b 2 0 3 6 6c 3 3 0 5 5d 6 6 5 0 4e 6 6 5 4 0

a 1

b 1

1 2

c

1d

2

e2

I (The matrix must satisfy some conditions [Buneman, 1971]);

45

StringsOther models



Median problems

I Parsimony again: search for a tree that minimises the total numberof evolutionary events (i.e. the sum of all edge weights);

I In its simplest form, the problem we want to solve is:

Problem (median of three)

Given: π, σ, τ in S±n ; a distance d : S±n × S±n → N.Find: a permutation µ in S±n that minimises

w(µ) = d(π, µ) + d(σ, µ) + d(τ, µ).

I Can be generalised to more than three input permutations;

46

StringsOther models



Generic bounds [Siepel and Moret, 2001]

I Generic lower and upper bounds for any distance:π

σ τ

d(π, σ) d(π, τ)

d(σ, τ)

µ

d(π, µ)

d(µ, σ) d(µ, τ)

I w(µ) ≤ min

if µ=π︷︸︸︷d(π, σ) + d(π, τ),

if µ=σ︷︸︸︷d(π, σ) + d(σ, τ),

if µ=τ︷︸︸︷d(π, τ) + d(σ, τ).

I 2w(µ) = d(π, µ) + d(π, µ) + d(σ, µ) + d(σ, µ) + d(τ, µ) + d(τ, µ)2w(µ) =d(π, µ) + d(π, µ) + d(σ, µ) + d(σ, µ) + d(τ, µ) + d(τ, µ)

≥ d(π, σ) + d(π, τ) + d(σ, τ) (triangle inequalities)

47

StringsOther models



Results on median problems

I What has been done:Operation or measure Median of three Best approximation

breakpoint NP-hard [Bryant, 1998] 5/3 [Caprara, 2002]signed breakpoint NP-hard [Bryant, 1998] 7/6 [Pe’er and Shamir, 2000]exchange ? ?signed reversal NP-hard [Caprara, 2003] 4/3 [Caprara, 1999]signed double-cut-and-join NP-hard [Caprara, 2003] 4/3 [Caprara, 1999]transposition NP-hard [Bader, 2011] ?

I What could be done:

1. complexity of the exchange median problem?(trivial for 2 permutations, NP-hard for ≥ 4; what about 3?)

2. better approximations;3. “median clouds” [Eriksen, 2009];

48

StringsOther models



Further topics

I Other topics could have been discussed:I what to do in the presence of multiple optimal sequences?I what can be said about the distribution of those distances?I how else can we assess the quality of the solutions?I how do we modify them if they’re unsatisfactory?I what other biological constraints can we add?I . . .

49

StringsOther models



References I

Angibaud, S., Fertin, G., Rusu, I., and Vialette, S. (2007).

A pseudo-boolean framework for computing rearrangement distances between genomes with duplicates.Journal of Computational Biology, 14(4):379–393.

Angibaud, S., Fertin, G., Thevenin, A., and Vialette, S. (2009).

Pseudo boolean programming for partially ordered genomes.In Ciccarelli, F. and Miklos, I., editors, RECOMB-CG, volume 5817 of Lecture Notes in Computer Science,pages 126–137. Springer.

Bader, M. (2011).

The transposition median problem is NP-complete.Theoretical Computer Science, 412(12-14):1099–1110.

Blin, G., Fertin, G., Chauve, C., et al. (2004).

The breakpoint distance for signed sequences.In 1st Conference on Algorithms and Computational Methods for biochemical and Evolutionary Networks(CompBioNets’ 04), volume 3, pages 3–16.

Bryant, D. (1998).

The complexity of the breakpoint median problem.Technical report, Universite De Montreal.

Bulteau, L. and Komusiewicz, C. (2014).

Minimum common string partition parameterized by partition size is fixed-parameter tractable.In Proc. 25th SODA, pages 102–121.

50

StringsOther models



References II

Buneman, P. (1971).

The recovery of trees from measures of dissimilarity.Mathematics in the Archaeological and Historical Sciences, pages 387–395.

Caprara, A. (1999).

Formulations and hardness of multiple sorting by reversals.In RECOMB’99, pages 84–93, New York, NY, USA. ACM.

Caprara, A. (2002).

Additive bounding, worst-case analysis, and the breakpoint median problem.SIAM Journal on Optimization, 13:508–519.

Caprara, A. (2003).

The reversal median problem.INFORMS Journal on Computing, 15:93–113.

Cook, S. A. (1971).

The complexity of theorem-proving procedures.In Proc. 3rd STOC, pages 151–158, Shaker Heights, Ohio, USA. ACM.

DasGupta, B., Jiang, T., Kannan, S., Li, M., and Sweedyk, E. (1998).

On the complexity and approximation of syntenic distance.Discrete Applied Mathematics, 88(1-3):59–82.

51

StringsOther models



References III

Eriksen, N. (2009).

Median clouds and a fast transposition median solver.In FPSAC’09, Discrete Math. Theor. Comput. Sci. Proc., AK, pages 373–384. Assoc. Discrete Math. Theor.Comput. Sci., Nancy.

Goldstein, A., Kolman, P., and Zheng, J. (2005).

Minimum common string partition problem: Hardness and approximations.Electronic Journal of Combinatorics, 12(1).

Goldstein, I. and Lewenstein, M. (2011).

Quick greedy computation for minimum common string partitions.In Giancarlo, R. and Manzini, G., editors, CPM, volume 6661 of Lecture Notes in Computer Science, pages273–284. Springer.

Kaplan, H. and Shafrir, N. (2006).

The greedy algorithm for edit distance with moves.Information Processing Letters, 97(1):23–27.

Liben-Nowell, D. (2001).

On the structure of syntenic distance.Journal of Computational Biology, 8(1):53–67.

Ohno, S. (1970).

Evolution by gene duplication.Springer-Verlag.

52

StringsOther models



References IV

Pe’er, I. and Shamir, R. (2000).

Approximation algorithms for the median problem in the breakpoint model.D. Sankoff, J.H. Nadeau (Eds.), Comparative Genomics, Kluwer, Dordrecht, 2000:225–241.

Siepel, A. C. and Moret, B. M. E. (2001).

Finding an optimal inversion median: Experimental results.In WABI’01, volume 2149 of LNCS, pages 189–203. Springer-Verlag.

53

comparative genomics anthony labarre september...

Documents