cs 466 introduction to bioinformatics - el-kebir · introduction to bioinformatics lecture 5...

42
CS 466 Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 9, 2020

Upload: others

Post on 05-Aug-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

CS 466Introduction to Bioinformatics

Lecture 5

Mohammed El-KebirSeptember 9, 2020

Page 2: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

Course AnnouncementsInstructor:• Mohammed El-Kebir (melkebir)• Office hours: Wednesdays, 3:15-4:15pm

TA:• Aswhin Ramesh (aramesh7)• Office hours: Fridays, 11:00-11:59am in SC 3405

2

Homework 1: Due on Sept. 18 (11:59pm)

Midterm: 10/4, 11-1pm @Transportation Building 103(conflict: 10/7, 7-9pm @Siebel 1302 -- to sign up email me)

Page 3: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

Global, Fitting and Local Alignment

3

Local Alignment problem: Given strings 𝐯 ∈ Σ!and 𝐰 ∈ Σ" and scoring function 𝛿, find a substring

of 𝐯 and a substring of 𝐰 whose alignment has maximum global alignment score 𝑠∗ among allglobal alignments of all substrings of 𝐯 and 𝐰

[Smith-Waterman algorithm]

Fitting Alignment problem: Given strings 𝐯 ∈ Σ!and 𝐰 ∈ Σ" and scoring function 𝛿, find an

alignment of 𝐯 and a substring of 𝐰 with maximum global alignment score 𝑠∗ among all global

alignments of 𝐯 and all substrings of 𝐰

Global Alignment problem: Given strings 𝐯 ∈ Σ!and 𝐰 ∈ Σ" and scoring function 𝛿, find alignment

of 𝐯 and 𝐰 with maximum score.[Needleman-Wunsch algorithm]

A

G

G

0T A C G G C0𝐯\𝐰

Question: How to assess resulting algorithms?

Page 4: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

Time Complexity

4

0 1 2 3 n = 4

0 O O O O O

1 O O O O O

2 O O O O O

3 O O O O O

m = 4 O O O O O

WA T C G

A

T

G

T

V

Alignment is a path from source (0, 0)to target (𝑚, 𝑛) in edit graph

Edit graph is a weighed, directed grid graph 𝐺 = (𝑉, 𝐸) with source vertex (0, 0) and target vertex (𝑚, 𝑛). Each

edge ((𝑖, 𝑗), (𝑘, 𝑙)) has weight depending on direction.

Running time is 𝑂(𝑚𝑛)[quadratic time]

Page 5: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

Time Complexity

5

0 1 2 3 n = 4

0 O O O O O

1 O O O O O

2 O O O O O

3 O O O O O

m = 4 O O O O O

WA T C G

A

T

G

T

V

Alignment is a path from source (0, 0)to target (𝑚, 𝑛) in edit graph

Edit graph is a weighed, directed grid graph 𝐺 = (𝑉, 𝐸) with source vertex (0, 0) and target vertex (𝑚, 𝑛). Each

edge ((𝑖, 𝑗), (𝑘, 𝑙)) has weight depending on direction.

Running time is 𝑂(𝑚𝑛)[quadratic time]

Question: Compute alignment faster than 𝑂(𝑚𝑛) time? [subquadratic time]

Page 6: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

Space Complexity

6

0 1 2 3 n = 4

0

1

2

3

m = 4

WA T C G

A

T

G

T

VThus, space complexity is 𝑂(𝑚𝑛)

[quadratic space]

Size of DP table is 𝑚 + 1 × (𝑛 + 1)

Example: To align a short read (𝑚 = 100) to

human genome (𝑛 = 3 4 10!), we need300 GB memory.

Page 7: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

Space Complexity

7

0 1 2 3 n = 4

0

1

2

3

m = 4

WA T C G

A

T

G

T

VThus, space complexity is 𝑂(𝑚𝑛)

[quadratic space]

Size of DP table is 𝑚 + 1 × (𝑛 + 1)

Example: To align a short read (𝑚 = 100) to

human genome (𝑛 = 3 4 10!), we need300 GB memory.

Question: How long is an alignment?

Page 8: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

Space Complexity

8

0 1 2 3 n = 4

0

1

2

3

m = 4

WA T C G

A

T

G

T

VThus, space complexity is 𝑂(𝑚𝑛)

[quadratic space]

Size of DP table is 𝑚 + 1 × (𝑛 + 1)

Example: To align a short read (𝑚 = 100) to

human genome (𝑛 = 3 4 10!), we need300 GB memory.

Question: How long is an alignment?

Question: Compute alignment in 𝑂(𝑚) space? [linear space]

Page 9: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

Outline1. Recap of global, fitting, local and gapped alignment2. Space-efficient alignment3. Subquadratic time alignment

Reading:• Jones and Pevzner. Chapters 7.1-7.4• Lecture notes

9

Page 10: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

2/2/12 1:56 PMIntroduction to Bioinformatics Algorithms

Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=c12cdb…e1417e8453f71b12fffb812a0765e61cabb13181e78b7cf61aa3092b81458876e

Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 232.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=251Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.

𝑚

00 𝑛

Space Efficient Alignment

10

Computing 𝑠[𝑖, 𝑗] requires access to: 𝑠 𝑖 − 1, 𝑗 ,𝑠[𝑖, 𝑗 − 1] and 𝑠[𝑖 − 1, 𝑗 − 1]

𝑖

𝑗

s[i, j] = max

8>>><

>>>:

0, if i = 0 and j = 0,

s[i� 1, j] + �(vi,�), if i > 0,

s[i, j � 1] + �(�, wj), if j > 0,

s[i� 1, j � 1] + �(vi, wj), if i > 0 and j > 0.

Page 11: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

2/2/12 1:56 PMIntroduction to Bioinformatics Algorithms

Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=c12cdb…e1417e8453f71b12fffb812a0765e61cabb13181e78b7cf61aa3092b81458876e

Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 232.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=251Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.

𝑚

00 𝑛

Space Efficient Alignment

11

Computing 𝑠[𝑖, 𝑗] requires access to: 𝑠 𝑖 − 1, 𝑗 ,𝑠[𝑖, 𝑗 − 1] and 𝑠[𝑖 − 1, 𝑗 − 1]

𝑖

𝑗

Thus it suffices to store only two columns to compute optimal alignment score 𝑠 𝑚, 𝑛 ,

i.e., 2 𝑚 + 1 = 𝑂(𝑚) space.

s[i, j] = max

8>>><

>>>:

0, if i = 0 and j = 0,

s[i� 1, j] + �(vi,�), if i > 0,

s[i, j � 1] + �(�, wj), if j > 0,

s[i� 1, j � 1] + �(vi, wj), if i > 0 and j > 0.

Page 12: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

2/2/12 1:56 PMIntroduction to Bioinformatics Algorithms

Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=c12cdb…e1417e8453f71b12fffb812a0765e61cabb13181e78b7cf61aa3092b81458876e

Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 232.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=251Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.

𝑚

00 𝑛

Space Efficient Alignment

12

Question: What if we want alignment itself?

Computing 𝑠[𝑖, 𝑗] requires access to: 𝑠 𝑖 − 1, 𝑗 ,𝑠[𝑖, 𝑗 − 1] and 𝑠[𝑖 − 1, 𝑗 − 1]

𝑖

𝑗

Thus it suffices to store only two columns to compute optimal alignment score 𝑠 𝑚, 𝑛 ,

i.e., 2 𝑚 + 1 = 𝑂(𝑚) space.

s[i, j] = max

8>>><

>>>:

0, if i = 0 and j = 0,

s[i� 1, j] + �(vi,�), if i > 0,

s[i, j � 1] + �(�, wj), if j > 0,

s[i� 1, j � 1] + �(vi, wj), if i > 0 and j > 0.

Page 13: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

Space Efficient Alignment – First Attempt

• What if also want optimal alignment?• Easy: keep best pointers as fill in table.• No! Do not know which path to keep until computing

recurrence at each step.

w

vw

v

Page 14: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

Space Efficient Alignment – First Attempt

• What if also want optimal alignment?• Easy: keep best pointers as fill in table.• No! Do not know which path to keep until computing

recurrence at each step.

w

vw

v

Page 15: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

Space Efficient Alignment – First Attempt

• What if also want optimal alignment?• Easy: keep best pointers as fill in table.• No! Do not know which path to keep until computing

recurrence at each step.

w

vw

v

Best score for column might not be part of best alignment!

Page 16: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

Space Efficient Alignment – Second Attempt

16

𝑛/2

𝑚

𝑖∗

𝑖

Maximum weight path from (0,0) to (𝑚, 𝑛) passes

through (𝑖∗, 𝑛/2)

Question: What is 𝑖∗?

Alignment is a path from source (0, 0) to target (𝑚, 𝑛)

in edit graph

Page 17: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

Space Efficient Alignment – Second Attempt

17

𝑛/2

𝑚

𝑖∗

𝑖

Page 18: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

Space Efficient Alignment – Second Attempt

18

𝑛/2

𝑚

𝑖∗

𝑖

Page 19: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

19

2/2/12 1:53 PMIntroduction to Bioinformatics Algorithms

Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=516d6f…41115138e47718afb5f0e14342e57210a6977481c8e3ce3915fa0a11e19f36678

Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 235.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=254Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.

Hirschberg(𝑖, 𝑗, 𝑖", 𝑗′)1. if 𝑗" − 𝑗 > 12. 𝑖∗ ß arg max

$%$!!%$!wt(𝑖′′)

3. Report (𝑖∗, 𝑗 + &!'&( )

4. Hirschberg(𝑖, 𝑗, 𝑖∗, 𝑗 + &!'&( )

5. Hirschberg(𝑖∗, 𝑗 + &!'&(, 𝑖", 𝑗′)

Page 20: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

20

2/2/12 1:53 PMIntroduction to Bioinformatics Algorithms

Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=516d6f…41115138e47718afb5f0e14342e57210a6977481c8e3ce3915fa0a11e19f36678

Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 235.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=254Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.

Time: area + area/2 + area/4 + … = area (1 + ½ + ¼ + ⅛ + …) ≤ 2 × area = O(mn)

Space: O(m)

Hirschberg(𝑖, 𝑗, 𝑖", 𝑗′)1. if 𝑗" − 𝑗 > 12. 𝑖∗ ß arg max

$%$!!%$!wt(𝑖′′)

3. Report (𝑖∗, 𝑗 + &!'&( )

4. Hirschberg(𝑖, 𝑗, 𝑖∗, 𝑗 + &!'&( )

5. Hirschberg(𝑖∗, 𝑗 + &!'&(, 𝑖", 𝑗′)

Page 21: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

21

2/2/12 1:53 PMIntroduction to Bioinformatics Algorithms

Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=516d6f…41115138e47718afb5f0e14342e57210a6977481c8e3ce3915fa0a11e19f36678

Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 235.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=254Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.

Hirschberg(𝑖, 𝑗, 𝑖", 𝑗′)1. if 𝑗" − 𝑗 > 12. 𝑖∗ ß arg max

$%$!!%$!wt(𝑖′′)

3. Report (𝑖∗, 𝑗 + &!'&( )

4. Hirschberg(𝑖, 𝑗, 𝑖∗, 𝑗 + &!'&( )

5. Hirschberg(𝑖∗, 𝑗 + &!'&(, 𝑖", 𝑗′)

Time: area + area/2 + area/4 + … = area (1 + ½ + ¼ + ⅛ + …) ≤ 2 × area = O(mn)

Space: O(m)

Question: How to reconstruct alignment from reported vertices?

Page 22: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

Hirschberg Algorithm: Reversing Edges Necessary?

22

𝑖∗ = arg max{$%&%!

pre2ix 𝑖 + suf2ix 𝑖 }

Max weight path from (0,0) to (𝑚, 𝑛) through (𝑖∗, 𝑛/2)

Compute pre2ix 𝑖 0 ≤ 𝑖 ≤ 𝑚} in O(𝑚𝑗) time and O(𝑚)space, by starting from (0,0) to 𝑚, 𝑗 keeping only two

columns in memory. [single-source multiple destinations]

𝑗

𝑚

𝑖∗

𝑖

Page 23: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

Hirschberg Algorithm: Reversing Edges Necessary?

23

𝑖∗ = arg max{$%&%!

pre2ix 𝑖 + suf2ix 𝑖 }

Max weight path from (0,0) to (𝑚, 𝑛) through (𝑖∗, 𝑛/2)

Want: Compute suf2ix 𝑖 0 ≤ 𝑖 ≤ 𝑚} in O(𝑚𝑗) time and O(𝑚) space

Doing a longest path from each 𝑖, 𝑗 to 𝑚, 𝑛 (for all 0 ≤ 𝑖 ≤ 𝑚) will not achieve desired running time!

Reversing edges enables single-source multiple destination computation in desired time and space bound!

𝑗

𝑚

𝑖∗

𝑖

Compute pre2ix 𝑖 0 ≤ 𝑖 ≤ 𝑚} in O(𝑚𝑗) time and O(𝑚)space, by starting from (0,0) to 𝑚, 𝑗 keeping only two

columns in memory. [single-source multiple destinations]

Page 24: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

Hirschberg Algorithm: Reconstructing Alignment

24

Problem: Given reported vertices and scores { 𝑖$, 0, 𝑠$ , … , 𝑖", 𝑛, 𝑠" },

find intermediary vertices.

Hirschberg(𝑖, 𝑗, 𝑖', 𝑗′)1. if 𝑗' − 𝑗 > 12. 𝑖∗ ß arg max

$%&%!wt(𝑖)

3. Report (𝑖∗, 𝑗 + (!)(*)

4. Hirschberg(𝑖, 𝑗, 𝑖∗, 𝑗 + (!)(*

)

5. Hirschberg(𝑖∗, 𝑗 + (!)(*, 𝑖', 𝑗′)

0 1 2 3 4 5

0 0 -1 -2 -3 -4 -5

1 -1 1 0 -1 -2 -3

2 -2 0 2 1 0 -1

3 -3 -1 1 1 2 1

4 -4 -2 0 0 1 1

5 -5 -3 -1 1 0 2

WA T C G

A

T

G

T

V

A T - G T C

A T C G - C

C

CTransposing matrix does not help, because gaps could occur in both

input sequences (and there might be multiple opt. alignments)

Page 25: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

Linear Space Alignment – The Hirschberg Algorithm

25

Page 26: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

Outline1. Recap of global, fitting, local and gapped alignment2. Space-efficient alignment3. Subquadratic time alignment

Reading:• Jones and Pevzner. Chapters 7.1-7.4• Lecture notes

26

Page 27: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

Banded Alignment

27

3. A�ne Gap Costs (More biologically realistic): We can extend this problem further byintroducing an a�ne gap penalty function

� + (n� 1)✏

where � is the gap open penalty, n is the gap length and ✏ is the gap extension penalty.A gap extension penalty which is smaller than a gap open penalty more directly modelsthe underlying biology, as a few longer gaps (corresponding to, say, double strandedDNA breaks and repair) are more likely than several small gaps. Gotoh’s algorithmallows local alignment with a�ne gap costs in quadratic time.

4. Linear Space: For large pairs of sequences, the O(MN) memory taken up by the tableof scores can be substantial. Hirschberg’s algorithm is a way of computing only thenecessary entries of F which allows O(N) memory usage (hence linear space) at theexpense of roughly doubly the running time.

5. Banded Alignment : Banded alignment is best illustrated with a figure. It reduces thetime complexity of alignment to roughly linear time by being semi-greedy, as it disallowsthe possibility of a high scoring alignment that wanders significantly o↵ the diagonal(corresponding to a big gap in either sequence).

x1

x2

x3

.

.

.

.

.

.

xM

y1 y2 y3 ... ... ... ... yN

Constrain traceback to band of DP matrix (penalize big gaps)

Figure 7 | By constraining the traceback to a band of the matrix, we can make the running timeroughly linear (if N = M , then the running time would be O(Nk(N)), where k(N) is the char-acteristic band width. Biologically, this corresponds to disallowing long gaps. Computationally,cells which are outside of the banded region are not considered when taking the max; this meanssetting F (i, j) = �1 for |i� j| > k(N).

10

Figure source: http://jinome.stanford.edu/stat366/pdfs/stat366_win0607_lecture04.pdf

Constraint path to band of width 𝑘around diagonal

Running time: O(𝑛𝑘)

Gives a good approximation of highly identical sequences

Page 28: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

Banded Alignment

28

3. A�ne Gap Costs (More biologically realistic): We can extend this problem further byintroducing an a�ne gap penalty function

� + (n� 1)✏

where � is the gap open penalty, n is the gap length and ✏ is the gap extension penalty.A gap extension penalty which is smaller than a gap open penalty more directly modelsthe underlying biology, as a few longer gaps (corresponding to, say, double strandedDNA breaks and repair) are more likely than several small gaps. Gotoh’s algorithmallows local alignment with a�ne gap costs in quadratic time.

4. Linear Space: For large pairs of sequences, the O(MN) memory taken up by the tableof scores can be substantial. Hirschberg’s algorithm is a way of computing only thenecessary entries of F which allows O(N) memory usage (hence linear space) at theexpense of roughly doubly the running time.

5. Banded Alignment : Banded alignment is best illustrated with a figure. It reduces thetime complexity of alignment to roughly linear time by being semi-greedy, as it disallowsthe possibility of a high scoring alignment that wanders significantly o↵ the diagonal(corresponding to a big gap in either sequence).

x1

x2

x3

.

.

.

.

.

.

xM

y1 y2 y3 ... ... ... ... yN

Constrain traceback to band of DP matrix (penalize big gaps)

Figure 7 | By constraining the traceback to a band of the matrix, we can make the running timeroughly linear (if N = M , then the running time would be O(Nk(N)), where k(N) is the char-acteristic band width. Biologically, this corresponds to disallowing long gaps. Computationally,cells which are outside of the banded region are not considered when taking the max; this meanssetting F (i, j) = �1 for |i� j| > k(N).

10

Figure source: http://jinome.stanford.edu/stat366/pdfs/stat366_win0607_lecture04.pdf

Constraint path to band of width 𝑘around diagonal

Running time: O(𝑛𝑘)

Gives a good approximation of highly identical sequences

Question: How to change recurrence to accomplish this?

Page 29: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

Block Alignment

29

Divide input sequences into blocks of length 𝑡

v1, …, vt vt+1, …, v2t … vm-t+1, …, vm

w1, …, wt wt+1, …, w2t … wn-t+1, …, wn

Page 30: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

2/7/12 1:59 PMIntroduction to Bioinformatics Algorithms

Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=91cb81…6f194fedd9f1e790119b175e5ad3d121b64d11105dbafa31a487926919095c69

Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 236.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=255Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.

Block Alignment

30

Divide input sequences into blocks of length 𝑡

v1, …, vt vt+1, …, v2t … vm-t+1, …, vm

w1, …, wt wt+1, …, w2t … wn-t+1, …, wn

Require that paths in edit graph pass through corners of blocks

Page 31: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

2/7/12 1:59 PMIntroduction to Bioinformatics Algorithms

Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=91cb81…6f194fedd9f1e790119b175e5ad3d121b64d11105dbafa31a487926919095c69

Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 236.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=255Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.

Block Alignment

31

Divide input sequences into blocks of length 𝑡

v1, …, vt vt+1, …, v2t … vm-t+1, …, vm

w1, …, wt wt+1, …, w2t … wn-t+1, …, wn

Require that paths in edit graph pass through corners of blocks

s[i, j] = max

8>>><

>>>:

0, if i = 0 and j = 0,

s[i� 1, j]� �, if i > 0,

s[i, j � 1]� �, if j > 0,

s[i� 1, j � 1] + �(i, j), if i > 0 and j > 0.

0 ≤ 𝑖, 𝑗 ≤ 𝑡 and 𝛽(𝑖, 𝑗) is max score alignment between block 𝑖 of 𝐯 and block 𝑗 of 𝐰

Page 32: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

Block Alignment – First Attempt: Pre-compute 𝛽 𝑖, 𝑗

32

s[i, j] = max

8>>><

>>>:

0, if i = 0 and j = 0,

s[i� 1, j]� �, if i > 0,

s[i, j � 1]� �, if j > 0,

s[i� 1, j � 1] + �(i, j), if i > 0 and j > 0.

0 ≤ 𝑖, 𝑗 ≤ 𝑛/𝑡 and 𝛽(𝑖, 𝑗) is max score alignment between block 𝑖 of 𝐯 and block 𝑗 of 𝐰

2/7/12 1:59 PMIntroduction to Bioinformatics Algorithms

Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=91cb81…6f194fedd9f1e790119b175e5ad3d121b64d11105dbafa31a487926919095c69

Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 236.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=255Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.

t

2t

nt

t 2t nt

Page 33: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

Block Alignment – First Attempt: Pre-compute 𝛽 𝑖, 𝑗

33

s[i, j] = max

8>>><

>>>:

0, if i = 0 and j = 0,

s[i� 1, j]� �, if i > 0,

s[i, j � 1]� �, if j > 0,

s[i� 1, j � 1] + �(i, j), if i > 0 and j > 0.

0 ≤ 𝑖, 𝑗 ≤ 𝑛/𝑡 and 𝛽(𝑖, 𝑗) is max score alignment between block 𝑖 of 𝐯 and block 𝑗 of 𝐰

Question: How much time to compute all 𝛽(𝑖, 𝑗)?

2/7/12 1:59 PMIntroduction to Bioinformatics Algorithms

Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=91cb81…6f194fedd9f1e790119b175e5ad3d121b64d11105dbafa31a487926919095c69

Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 236.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=255Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.

t

2t

nt

t 2t nt

Page 34: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

Block Alignment – First Attempt: Pre-compute 𝛽 𝑖, 𝑗

34

s[i, j] = max

8>>><

>>>:

0, if i = 0 and j = 0,

s[i� 1, j]� �, if i > 0,

s[i, j � 1]� �, if j > 0,

s[i� 1, j � 1] + �(i, j), if i > 0 and j > 0.

0 ≤ 𝑖, 𝑗 ≤ 𝑛/𝑡 and 𝛽(𝑖, 𝑗) is max score alignment between block 𝑖 of 𝐯 and block 𝑗 of 𝐰

Question: How much time to compute all 𝛽(𝑖, 𝑗)?

2/7/12 1:59 PMIntroduction to Bioinformatics Algorithms

Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=91cb81…6f194fedd9f1e790119b175e5ad3d121b64d11105dbafa31a487926919095c69

Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 236.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=255Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.

t

2t

nt

t 2t nt

Computing 𝛽 𝑖, 𝑗 takes 𝑂 𝑡( time

There are 𝑛/𝑡 × 𝑛/𝑡 values 𝛽 𝑖, 𝑗

Total: 𝑂 )*× )*× 𝑡( = 𝑂 𝑛( time

Page 35: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

Block Alignment – Four Russians Technique

35

2/7/12 1:59 PMIntroduction to Bioinformatics Algorithms

Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=91cb81…6f194fedd9f1e790119b175e5ad3d121b64d11105dbafa31a487926919095c69

Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 236.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=255Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.

t

2t

nt

t 2t nt

Pre-compute and store all βij

Pre-compute and store all max weight alignments 𝑆[𝐯′, 𝐰′] of allpairs (𝐯", 𝐰") of length t strings

Algorithm:1. Precompute 𝑆[𝐯′, 𝐰′] where

𝐯", 𝐰" ∈ Σ*2. Compute block alignment

between 𝐯 and 𝐰 using 𝑆

s[i, j] = max

8>>><

>>>:

0, if i = 0 and j = 0,

s[i� 1, j]� �, if i > 0,

s[i, j � 1]� �, if j > 0,

s[i� 1, j � 1] + S[v(i), w(j)], if i > 0 and j > 0.

Page 36: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

Block Alignment – Four Russians Technique

36

2/7/12 1:59 PMIntroduction to Bioinformatics Algorithms

Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=91cb81…6f194fedd9f1e790119b175e5ad3d121b64d11105dbafa31a487926919095c69

Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 236.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=255Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.

t

2t

nt

t 2t nt

Pre-compute and store all βij

Pre-compute and store all max weight alignments 𝑆[𝐯′, 𝐰′] of allpairs (𝐯", 𝐰") of length t strings

Algorithm:1. Precompute 𝑆[𝐯′, 𝐰′] where

𝐯", 𝐰" ∈ Σ*2. Compute block alignment

between 𝐯 and 𝐰 using 𝑆

s[i, j] = max

8>>><

>>>:

0, if i = 0 and j = 0,

s[i� 1, j]� �, if i > 0,

s[i, j � 1]� �, if j > 0,

s[i� 1, j � 1] + S[v(i), w(j)], if i > 0 and j > 0.

Question: How to choose 𝑡 for DNA?

Page 37: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

Block Alignment – Four Russians Technique

37

Question: How to choose 𝑡 for DNA?

Page 38: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

Fastest Subquadratic Alignment* Algorithm

38*for edit distance

Edit distance in O(𝑛M/ log 𝑛) time

Barely subquadratic!

Want: O(𝑛MNO) time where 𝜀 > 0

Page 39: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

Fastest Subquadratic Alignment* Algorithm

39*for edit distance

Edit distance in O(𝑛M/ log 𝑛) time

Barely subquadratic!

Want: O(𝑛MNO) time where 𝜀 > 0

Question: Is 𝑛MNO in O(𝑛M/ log 𝑛) for any 𝜀 > 0?

Page 40: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

Hardness Result for Edit Distance [STOC 2015]

40

Page 41: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

41

Page 42: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir

Take Home Messages1. Global alignment in O(𝑚𝑛) time and O(𝑚) space• Hirschberg algorithm

2. Block alignment can be done in subquadratic time• Four Russians Technique: O(𝑛M/ log 𝑛) time

3. Global alignment cannot be done in O(𝑛#$%) time under SETH

Reading:• Jones and Pevzner. Chapters 7.1-7.4• Lecture notes

42