methods to chain local alignments sparse dynamic programming o(n log n)
Post on 19-Dec-2015
226 views
TRANSCRIPT
Methods to CHAIN Local Alignments
Sparse Dynamic ProgrammingO(N log N)
Saving cells in DP
1. Find local alignments
2. Chain -O(NlogN) L.I.S.
3. Restricted DP
The Problem: Find a Chain of Local Alignments
(x,y) (x’,y’)
requires
x < x’y < y’
Each local alignment has a weight
FIND the chain with highest total weight
A similar problem: LCS
15 3 24 16 20 4 24 3 11 18
4
20
24
3
11
15
11
4
18
20
• Given two strings x and y, find the longest common subsequence• Imagine a sparse scenario, where x and y have few matches
Sparse Dynamic Programming – L.I.S.
• Longest Increasing Subsequence
• Given a sequence over an ordered alphabet
w = w1, …, wm
• Find the longest increasing subsequence
s = s1, …, sk
s1 < s2 < … < sk
Sparse Dynamic Programming – L.I.S.
Let input be w: w1,…, wn
INITIALIZATION:L: 1-indexed array, L[1] w1
B: 0-indexed array of backpointers; B[0] = 0P: array used for traceback// L[j]: smallest last element wi of j-long LIS seen so far
ALGORITHMfor i = 2 to n { Find j such that L[j] < w[i] ≤ L[j+1] L[j+1] w[i]
B[j+1] iP[i] B[j]
}
That’s it!!!• Running time?
Sparse LCS expressed as LIS
Create a sequence w
• Every matching point (i, j), is inserted into w as follows:
• For each column j, from smallest to largest, insert in w the points (i, j), in decreasing row i order
• The 11 example points are inserted in the order given
• a = (y, x), b = (y’, x’) can be chained iff
a is before b in w, and y < y’
15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10
9
5
11
3
4
20
24
3
11
15
11
4
18
20
x
y
Sparse LCS expressed as LIS
Create a sequence w
w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10)
Consider now w’s elements as ordered lexicographically, where
• (y, x) < (y’, x’) if y < y’
Claim: An increasing subsequence of w is a common subsequence of x and y
15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10
9
5
11
3
4
20
24
3
11
15
11
4
18
20
x
y
Sparse Dynamic Programming for LIS
Example:w = (4,2) (3,3) (10,5) (2,5) (8,6)
(1,6) (3,7) (4,8) (7,9) (5,9) (9,10)
L =1. (4,2)2. (3,3)3. (3,3) (10,5)4. (2,5) (10,5)5. (2,5) (8,6)6. (1,6) (8,6)7. (1,6) (3,7)8. (1,6) (3,7) (4,8)9. (1,6) (3,7) (4,8) (7,9)10. (1,6) (3,7) (4,8) (5,9)11. (1,6) (3,7) (4,8) (5,9) (9,10) Longest common subsequence:
s = 4, 24, 3, 11, 18
15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10
9
5
11
3
4
20
24
3
11
15
11
4
18
20
x
y
Sparse DP for rectangle chaining
• 1,…, N: rectangles
• (hj, lj): y-coordinates of rectangle j
• w(j): weight of rectangle j
• V(j): optimal score of chain ending in j
• L: list of triplets (lj, V(j), j)
L is sorted by lj: top to bottom
L is implemented as a balanced binary tree
y
h
l
Sparse DP for rectangle chaining
Main idea:
• Sweep through x-coordinates
• To the right of b, anything chainable to a is chainable to b
• Therefore, if V(b) > V(a), rectangle a is “useless” – remove it
• In L, keep rectangles j sorted with increasing lj-coordinates sorted with increasing V(j)
V(b)V(a)
Sparse DP for rectangle chaining
Go through rectangle x-coordinates, from lowest to highest:
1. When on the leftmost end of i:
a. j: rectangle in L, with largest lj < hi
b. V(i) = w(i) + V(j)
2. When on the rightmost end of i:
a. k: rectangle in L, with largest lk lib. If V(i) V(k):
i. INSERT (li, V(i), i) in L
ii. REMOVE all (lj, V(j), j) with V(j) V(i) & lj li
i
j
k
Example
x
y
1: 5
3: 3
2: 6
4: 45: 2
2
56
91011
1214
1516
Time Analysis
1. Sorting the x-coords takes O(N log N)
2. Going through x-coords: N steps
3. Each of N steps requires O(log N) time:
• Searching L takes log N• Inserting to L takes log N• All deletions are consecutive, so log N per deletion• Each element is deleted at most once: N log N for all deletions
• Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree
Multiple Sequence Multiple Sequence AlignmentsAlignments
The Global Alignment problem
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
x
y
z
Definition
• Given N sequences x1, x2,…, xN: Insert gaps (-) in each sequence xi, such that
• All sequences have the same length L
• Score of the global map is maximum
• A faint similarity between two sequences becomes significant if present in many
• Multiple alignments can help improve the pairwise alignments
Scoring Function
• Ideally: Find alignment that maximizes probability that sequences evolved
from common ancestor, according to some phylogenetic model
• More on phylogenetic models later
x
yz
w
v
?
Scoring Function
• A comprehensive model would have too many parameters, too inefficient to optimize
• Possible simplifications
Ignore phylogenetic tree
Statistically independent columns:
S(m) = i S(mi)
m: multiple alignment, mi are columns
Scoring Function: Sum Of Pairs
Definition: Induced pairwise alignment
A pairwise alignment induced by the multiple alignment
Example:
x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG
Induces:
x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG
Sum Of Pairs (cont’d)
• The sum-of-pairs score of an alignment is the sum of the scores of all induced pairwise alignments
S(m) = k<l s(mk, ml)
s(mk, ml): score of induced alignment (k,l)
Sum Of Pairs (cont’d)
• Heuristic way to incorporate evolution tree:
Human
Mouse
Chicken• Weighted SOP:
S(m) = k<l wkl s(mk, ml)
wkl: weight decreasing with distance
Duck
Consensus
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGACCAG-CTATCAC--GACCGC----TCGATTTGCTCGAC
CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
• Find optimal consensus string m* to maximize
S(m) = i s(m*, mi)
s(mk, ml): score of pairwise alignment (k,l)
A Profile Representation
• Given a multiple alignment M = m1…mn
Replace each column mi with profile entry pi
• Frequency of each letter in • # gaps• Optional: # gap openings, extensions, closings
- A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G
A 1 1 .8 C .6 1 .4 1 .6 .2G 1 .2 .2 .4 1T .2 1 .6 .2O .2 .8 .4 .4E .4C .2 .8 .4 .2