comparison of large sequences

Post on 12-Jan-2016

26 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Comparison of large sequences. First part: Alignment of large sequences. Dynamic programming. accaccacaccacaacgagcata … acctgagcgatat. a c c . . t. acc.................................agt | | |.................................|xx acc.................................a--. - PowerPoint PPT Presentation

TRANSCRIPT

Comparison of large sequences

First part:

Alignment of large sequences

Dynamic programming

What about genomes?

• Quadratic cost of space and time.

accaccacaccacaacgagcata … acctgagcgatat

acc..t

• Short sequences (up to 10.000 bps) can be aligned using dynamic programming

• Quadratic cost of space and time.

acc.................................agt | | |.................................|xxacc.................................a--

Genomic sequences

In which case Dynamic Programming can be applied?

•The length of sequences is 1000 times longer.

• Genomic sequences have millions of base pairs.

•The running time is 1.000.000 times higher !

(1 second becomes 11 days)(1 minute becomes 2 years)

First assumption

……………………………………………………………….

………………………….………………...…………...….

Genome B

Genome A

……………………………………Genome B

……

……

……

……

……

….

Gen

ome

A

Realistic assumption?

Unrealistic assumption!

More realistic

assumption

……………………………………………………………….

………………………….………………...…………...….

Genome B

Genome A

………………………………………………………………….

………………………………………………...…………...….Genome A

Genome B

………………………

……

……G

enom

e A

Genome B

Realistic assumptions?

But, now is it a

real case?

Unrealistic assumption!

More realistic

assumption

……………………………………………………………….

………………………….………………...…………...….

Genome B

Genome A

…………………………………………………………………

………………………………………………...…………...….Genome A

Genome B

………………………

……

……G

enom

e A

Genome B

Preview in a real case

Chlamidia muridarum: 1.084.689bps Chlamidia Thrachomatis:1057413bps

Preview in a real case

Pyrococcus abyssis: 1.790.334 bpsPyrococcus horikoshu: 1.763.341 bps

Methodology of an alignment

1st:

2nd:

3th: (Linear cost)

Identify the portions that can be aligned.

Make a preview: ……………………..….

…………………...….

Make the alignment:

…..…

……

………………….

(Linear cost)

Methodology of an alignment

(Linear cost)

Make a preview: ……………………..….

…………………...….

1st:

2nd:

3th:

Identify the portions that can be aligned.

Make the alignment:

…..…

……

………………….

?

Preview-Revisited

… a a t g….c t g...

… c g t g….c c c ...

MatchingUniqueMaximal

MUM

Connect to MALGEN

Methodology of an alignment

1st:

2nd:

3th:

Identify the portions that can be aligned.

Make a preview: ……………………..….

…………………...….

Make the alignment:

…..…

……

………………….

How can MUMs be found?

With CLUSTALW, TCOFFEE,…

How can these portions be determined?

Linear costwith

Suffix trees

Bioinformatics PhD. Course

Second part:

Introducing Suffix trees

Suffix trees

Given string ababaas:

1: ababaas

2: babaas

3: abaas

4: baas

5: aas

6: as

7: s

as,3

s,6

as,5

s,7

as,4ba

baas,2

a

babaas,1

a

babaas,1

ba

baas,2

as,3

as,4

s,6

as,5

s,7

Suffixes:

What kind of queries?

Applications of Suffix trees

a

babaas,1as,3

ba

baas,2

as,4

s,6

as,5

s,7

1. Exact string matching

…………………………

• Does the sequence ababaas contain any ocurrence of patterns abab, aab, and ab?

Quadratic insertion algorithm

Given the string …………………………......

P1: the leaves of suffixes from have been inserted

and the suffix-tree

…...

Invariant Properties:

Quadratic insertion algorithm

Given the string ababaabbs

ababaabbs,1

Quadratic insertion algorithm

Given the string ababaabbs

babaabbs,2

ababaabbs,1

Quadratic insertion algorithm

Given the string ababaabbs

babaabbs,2

ababaabbs,1ababaabbs,1

Quadratic insertion algorithm

Given the string ababaabbs

babaabbs,2

ababaabbs,1

abbs,3

Quadratic insertion algorithm

Given the string ababaabbs

babaabbs,2

ababaabbs,1

abbs,3

ba

baabbs,2

Quadratic insertion algorithm

Given the string ababaabbs

ababaabbs,1

abbs,3

ba

baabbs,2

abbs,4

Quadratic insertion algorithm

Given the string ababaabbs

ababaabbs,1

abbs,3

abbs,4ba

baabbs,2

abbs,4

abbs,3ba

a

baabbs,1

Quadratic insertion algorithm

Given the string ababaabbs

abbs,4ba

baabbs,2

abbs,4

abbs,3ba

a

baabbs,1

abbs,5

Quadratic insertion algorithm

Given the string ababaabbs

abbs,4ba

baabbs,2

abbs,4

abbs,3ba

a

baabbs,1

abbs,5

Quadratic insertion algorithm

Given the string ababaabbs

abbs,4

ba

ba

baabbs,2

abbs,4

a abbs,5

b

a abbs,3

baabbs,1

Quadratic insertion algorithm

Given the string ababaabbs

abbs,4ba

baabbs,2

abbs,4

a abbs,5

b

a abbs,3

baabbs,1

bs,6

Quadratic insertion algorithm

Given the string ababaabbs

abbs,4ba

baabbs,2

abbs,4

a abbs,5

b

a abbs,3

baabbs,1

bs,6

Quadratic insertion algorithm

Given the string ababaabbs

a abbs,5

b

a abbs,3

baabbs,1

bs,6

a

baabbs,2

b

abbs,4

bs,7

Quadratic insertion algorithm

Given the string ababaabbs

a abbs,5

b

a abbs,3

baabbs,1

bs,6

a

baabbs,2

b

abbs,4

bs,7

s,8

Quadratic insertion algorithm

Given the string ababaabbs

a abbs,5

b

a abbs,3

baabbs,1

bs,6

a

baabbs,2

b

abbs,4

bs,7

s,7

s,9

Generalizad suffix tree

The suffix tree of many strings …

and it is the suffix tree of the concatenation of strings.

the generalized suffix tree of ababaabb and aabaat …

is the suffix tree of ababaabαaabaatβ, :

is called the generalized suffix tree …

For instance,

Generalizad suffix tree

a abbα,5

b

a abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

Given the suffix tree of ababaabα :

Construction of the suffix tree of ababaabbαaabaaβ :

Generalizad suffix tree

a abbα,5

b

a abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

Construction of the suffix tree of ababaabbαaabaaβ :

Generalizad suffix tree

Construction of the suffix tree of ababaabbαaabaaβ :

a bα,5

b

a abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

abaaβ,1

Generalizad suffix tree

Construction of the suffix tree of ababaabbαaabaaβ :

a bα,5

b

a abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

abaaβ,1

Generalizad suffix tree

a bα,5

b

a bbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

abaaβ,1

aβ,2

Construction of the suffix tree of ababaabbαaabaaβ :

Generalizad suffix tree

a bα,5

b

a bbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

abaaβ,1

aβ,2

Construction of the suffix tree of ababaabbαaabaaβ :

Construction of the suffix tree of ababaabbαaabaaβ :

Generalizad suffix tree

a bα,5

b

a bbα,3

baabbα,1

bα,6

a

baabbα,2

b

bbα,4

bα,7

α,8

α,9

abaaβ,1

aβ,2

aβ,3

Construction of the suffix tree of ababaabbαaabaaβ :

Generalizad suffix tree

a bα,5

b

a bbα,3

baabbα,1

bα,6

a

baabbα,2

b

bbα,4

bα,7

α,8

α,9

abaaβ,1

aβ,2

aβ,3

Generalizad suffix tree

abα,5

b

a bbα,3

baabbα,1

bα,6

a

baabbα,2

b

bbα,4

bα,7

α,8

α,9

baaβ,1

aβ,2

aβ,3

a

β,4

Construction of the suffix tree of ababaabbαaabaaβ :

Generalizad suffix tree

abα,5

b

a bbα,3

baabbα,1

bα,6

a

baabbα,2

b

bbα,4

bα,7

α,8

α,9

baaβ,1

aβ,2

aβ,3

a

β,4

Construction of the suffix tree of ababaabbαaabaaβ :

Generalizad suffix tree

abα,5

b

a bbα,3

baabbα,1

bα,6

a

baabbα,2

b

bbα,4

bα,7

α,8

α,9

baaβ,1

aβ,2

aβ,3

a

β,4β,5

Construction of the suffix tree of ababaabbαaabaaβ :

Generalizad suffix tree

abα,5

b

a bbα,3

baabbα,1

bα,6

a

baabbα,2

b

bbα,4

bα,7

α,8

α,9

baaβ,1

aβ,2

aβ,3

a

β,4β,5

Construction of the suffix tree of ababaabbαaabaaβ :

Generalizad suffix tree

abα,5

b

a bbα,3

baabbα,1

bα,6

a

baabbα,2

b

bbα,4

bα,7

α,8

α,9

baaβ,1

aβ,2

aβ,3

a

β,4β,5β,6

Construction of the suffix tree of ababaabbαaabaaβ :

Generalizad suffix tree

a bα,5

ba bbα,3

baabbα,1

bα,6

a

baabbα,2

b

bbα,4

bα,7

α,8

α,9

baaβ,1

aβ,2

a β,3

aβ,4β,5

β,6

Generalized suffix tree of ababaabbαaabaaβ :

Applications of Suffix trees

a

babaas,1as,3

ba

baas,2

as,4

s,6

as,5

s,7

1. Exact string matching

…………………………

• Does the sequence ababaas contain any ocurrence of patterns abab, aab, and ab?

Applications of Suffix trees

2. The substring problem for a database of strings DB• Does the DB contain any ocurrence of patterns abab, aab, and ab?

a bα,5

ba bbα,3

baabbα,1

bα,6

a

baabbα,2

b

bbα,4

bα,7

α,8

α,9

baaβ,1

aβ,2

a β,3

aβ,4β,5

β,6

Applications of Suffix trees

3. The longest common substring of two strings

a bα,5

ba bbα,3

baabbα,1

bα,6

a

baabbα,2

b

bbα,4

bα,7

α,8

α,9

baaβ,1

aβ,2

a β,3

aβ,4β,5

β,6

Applications of Suffix trees

5. Finding MUMs.

a bα,5

ba bbα,3

baabbα,1

bα,6

a

baabbα,2

b

bbα,4

bα,7

α,8

α,9

baaβ,1

aβ,2

a β,3

aβ,4β,5

β,6

Bioinformatics PhD. Course

Third part:

Suffix links

Suffix links

a abbα,5

b

a abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

a

Suffix links

a abbα,5

b

a abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

a

?

Suffix links

a abbα,5

b

a abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

a

?

Suffix links

a abbα,5

b

a abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

a

?

Suffix links

a abbα,5

b

a abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

a

?

Suffix links

a abbα,5

b

a abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

a

?

Suffix links

a abbα,5

b

a abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

a

?

Suffix links

a abbα,5

b

a abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

a

?

Suffix links

a abbα,5

b

a abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

a

?

Suffix links

a abbα,5

b

a abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

a

Suffix links

a abbα,5

b

a abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

a

Traversal using Suffix links

a abbα,5

b

a abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

Given S2 = a a b a a

Traversal using Suffix links

a abbα,5

b

a abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

Given S2 = a a b a a

Traversal using Suffix links

a abbα,5

b

a abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

Given S2 = a a b a a aa in S2 [1] Unique matchings

Traversal using Suffix links

a abbα,5

b

a abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

Given S2 = a a b a a aa in S2 [1] Unique matchings

aab in S2 [1] =

S1[5..6-7] in S2 [1]

Traversal using Suffix links

a abbα,5

b

a abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

Given S2 = a a b a a Unique matchings S1[5..6-7] in S2 [1]

Traversal using Suffix links

a abbα,5

b

a abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

Given S2 = a a b a a Unique matchings S1[5..6-7] in S2 [1]

Traversal using Suffix links

a abbα,5

b

a abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

Given S2 = a a b a a b b a Unique matchings S1[5..6-7] in S2 [1]

S1[3..6-…] in S2 [2]

Traversal using Suffix links

a abbα,5

b

a abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

Given S2 = a a b a a b b a Unique matchings S1[5..6-7] in S2 [1]

S1[3..6-…] in S2 [2]

Traversal using Suffix links

a abbα,5

b

a abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

Given S2 = a a b a a b b a Unique matchings S1[5..6-7] in S2 [1]

S1[3..6-…] in S2 [2]

Traversal using Suffix links

a abbα,5

b

a abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

Given S2 = a a b a a b b a Unique matchings S1[5..6-7] in S2 [1]

S1[3..6-…] in S2 [2]

Traversal using Suffix links

a abbα,5

b

a abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

Given S2 = a a b a a b b a Unique matchings S1[5..6-7] in S2 [1]

S1[3..6-8] in S2 [2]

S1[4..6-8] in S2 [3]

Traversal using Suffix links

a abbα,5

b

a abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,8

α,9

Given S2 = a a b a a b b a Unique matchings S1[5..8] in S2 [4]

S1[3..6-8] in S2 [2]

S1[4..6-8] in S2 [3] S1[6..8] in S2 [5] S1[7..8] in S2 [6]

From UMs to MUMs

Given S2 = a a b a a b b a Unique matchings S1[5..8] in S2 [4]

S1[3..6-8] in S2 [2]

S1[4..6-8] in S2 [3] S1[6..8] in S2 [5] S1[7..8] in S2 [6]

Array of UMs

123 6-84 6-85 86 87 889

and S1 = a b a b a a b b α

MUM: S1[3..6-8] in S2[2]

Bioinformatics PhD. Course

Third part:

Linear insertion algorithm

Quadratic insertion algorithm

Given the string …………………………......

P1: the leaves of suffixes from have been inserted

and the suffix-tree

…...

Invariant Properties:

Linear insertion algorithm

Given the string …………………………......

P2: the string is the longest string that can be spelt through the tree.

P1: the leaves of suffixes from have been inserted

and the suffix-tree

…...

Invariant Properties:

Linear insertion algorithm: example

Given the string ababaababb...

ba

baababb...,2

a ababb...,5

ba ababb...,3

baababb...,1ababb...,4

a

Linear insertion algorithm: example

Given the string ababaababb...

ba

baababb...,2

a ababb...,5

ba ababb...,3

baababb...,1ababb...,4

6 7 8

Linear insertion algorithm: example

ba

baababb...,2

a ababb...,5

ba ababb...,3

baababb...,1ababb...,4

6 7 8Given the string ababaababb...

Linear insertion algorithm: example

ba

baababb...,2

a ababb...,5

ba ababb...,3

baababb...,1ababb...,4

6 7 89Given the string ababaababb...

Linear insertion algorithm: example

a ababb...,5

ba ababb...,3

baababb...,1

ba

baababb...,2

ababb...,4

Given the string ababaababb...

6 7 89

baababb...,1b

b...,6

aababb...,1

Linear insertion algorithm: example

a ababb...,5

ba ababb...,3

ba

baababb...,2

ababb...,4

Given the string ababaababb...

7 89

b

b...,6

aababb...,1

Linear insertion algorithm: example

a ababb...,5

ba ababb...,3

ba

baababb...,2

ababb...,4

Given the string ababaababb...

7 89

b

b...,6

aababb...,1

Linear insertion algorithm: example

a ababb...,5

ba ababb...,3

ba

baababb...,2

ababb...,4

Given the string ababaababb...

7 89

b

b...,6

aababb...,1

Linear insertion algorithm: example

a ababb...,5

ba ababb...,3

ba

baababb...,2

ababb...,4

Given the string ababaababb...

7 89

b

b...,6

aababb...,1

baababb...,2b aababb...,2

Linear insertion algorithm: example

a ababb...,5

ba ababb...,3

ba

baababb...,2

ababb...,4

Given the string ababaababb...

7 8…

b

b...,6

aababb...,1

baababb...,2b

b...,7

aababb...,2

Linear insertion algorithm: example

a ababb...,5

ba ababb...,3

ba ababb...,4

Given the string ababaababb...

89

b

b...,6

aababb...,1

b

b...,7

aababb...,2

Linear insertion algorithm: example

a ababb...,5

ba ababb...,3

ba ababb...,4

Given the string ababaababb...

89

b

b...,6

aababb...,1

b

b...,7

aababb...,2

Linear insertion algorithm: example

a ababb...,5

ba ababb...,3

ba ababb...,4

Given the string ababaababb...

89

b

b...,6

aababb...,1

b

b...,7

aababb...,2

Linear insertion algorithm: example

a ababb...,5

ba ababb...,3

ba ababb...,4

Given the string ababaababb...

89

b

b...,6

aababb...,1

b

b...,7

aababb...,2

Linear insertion algorithm: example

a ababb...,5

b

ba ababb...,4

Given the string ababaababb...

89

ababb...,3

b

b...,6

aababb...,1

b

b...,7

aababb...,2

a

Linear insertion algorithm: example

a ababb...,5

b

ba ababb...,4

Given the string ababaababb...

89

ababb...,3

b

b...,6

aababb...,1

b

b...,7

aababb...,2

a

b...,8

Linear insertion algorithm: example

a ababb...,5

b

ba ababb...,4

Given the string ababaababb...

9

ababb...,3

b

b...,6

aababb...,1

b

b...,7

aababb...,2

a

b...,8

Linear insertion algorithm: example

a ababb...,5

b

ba ababb...,4

Given the string ababaababb...

9

ababb...,3

b

b...,6

aababb...,1

b

b...,7

aababb...,2

a

b...,8

Linear insertion algorithm: example

a ababb...,5

b

b ababb...,4

Given the string ababaababb... 9

ababb...,3

b

b...,6

aababb...,1

b

b...,7

aababb...,2

a

b...,8

a

Linear insertion algorithm: example

a ababb...,5

b

b ababb...,4

Given the string ababaababb... 9

ababb...,3

b

b...,6

aababb...,1

b

b...,7

aababb...,2

a

b...,8

a

b...,9

Linear insertion algorithm: example

a ababb...,5

b

b ababb...,4

Given the string ababaababb... 9

ababb...,3

b

b...,6

ababb...,1

b

b...,7

aababb...,2

a

b...,8

a

b...,9

Linear insertion algorithm: example

a ababb...,5

b

b ababb...,4

Given the string ababaababb... 9

ababb...,3

b

b...,6

ababb...,1

b

b...,7

aababb...,2

a

b...,8

a

b...,9

Linear insertion algorithm: example

a ababb...,5

b

b ababb...,4

Given the string ababaababb...

9

ababb...,3

b

b...,6

ababb...,1

b

b...,7

aababb...,2

a

b...,8

a

b...,9

Index

Suffix arrays Suffix-arrays: a new method for on-line

string searches, G. Myers, U. Manber

Suffix arrays

Given string ababaa#:

1: ababaa#

2: babaa#

3: abaa#

4: baa#

5: aa#

6: a#

7: #

Suffixes: … but lexicographically sorted

1: ababaa#

2: babaa#

3: abaa#

4: baa#

5: aa#6: a#1: #1

234567

Which is the cost? O(n log(n))

Applications of suffix arrays

1. Exact string matching• Does the sequence ababaas contain any ocurrence of patterns abab, aab, and ab?

1: ababaa#

2: babaa#

3: abaa#

4: baa#

5: aa#6: a#1: #1

234567

Binary search

O(log(n) |P|)

… which is the cost?

O(log(n)+|P|) ?

Can it be improved to …

Fast search with cost O(log(n)+|P|) Query:

Invariant Properties:

P1: α < query ≤ β α

β

12… …

n

Suffix array

P2: matches pref( query)

Fast search with cost O(log(n)+|P|) Query:

Invariant Properties:

P1: α < query ≤ β α

β

γAlgorithm:

12… …

n

Suffix array

P2: matches pref( query)

If suff(γ)<suff(query) then α = γ

else β = γ

top related