optimal alignments in linear space

60
Optimal alignments in linear space Eugene W.Myers and Webb Miller

Upload: gauri

Post on 15-Feb-2016

55 views

Category:

Documents


6 download

DESCRIPTION

Optimal alignments in linear space. Eugene W.Myers and Webb Miller. Outline. Introduction Gotoh's algorithm O(N) space Gotoh's algorithm Main algorithm Implementation Conclusion . Introduction. Introduction. Space, not time Hirschberg’s Algorithm - PowerPoint PPT Presentation

TRANSCRIPT

Introduction

Optimal alignments in linear spaceEugene W.Myers and Webb Miller Optimal alignments in linear space1OutlineIntroductionGotoh's algorithm O(N) space Gotoh's algorithmMain algorithmImplementationConclusion

outlineintroductionpaperGotohlinear spacegotohimplementation2Introduction Optimal alignments in linear space3IntroductionSpace, not time Hirschbergs AlgorithmMaximizing the similarity score of an alignmentGotohs AlgorithmMinimizing the difference score of a conversionLinear space version for affine gap penalties.For a megabyte of memory.W.Myers and Miller : sequences of length 62500 Altschul and Erickson : sequences length < 1070

hirschbergGotohhirschbergAlignmentGotohseqeucensequencecostpapergotohaffine gap penaltieslinear spacePaper1 megabyte62500sequenceErickson1070sequence4Transformation (1/2)hisberggotohHirschbergGotohHirbergmatchmismatch#(a,b)Gotohw(a,b)XXX.. GotohcostgappenaltiesG = -Q, h = XXXH XXX. h1/2max?5Transformation (2/2)Match = 8, Mismatch = -5, Gap Symbol = -3, Gap-open = -4< hisberg2gap penaltymismatchW(a,b)gotohmimatch#max-2rgotohgapgapcost6Example(1/2)hisbergmatch8convertion cost0mismatch7Example(2/2)Hirschbergs AlgorithmGotohs AlgorithmCostC(minimum) gotohconversion costChisberg maximum score Mnseq1seqconversion cost0hisberg2mismatchhisrsbeg673gapConversion costgap symbolgap open11hirsbergGotohconversion cost

8Gotoh's algorithmR99922005Some notations : the i-symbol prefix of A : the j-symbol prefix of BC(i, j):minimum cost of a conversion of to

Simple gap(1/4)gap(k)= h*k

Simple gap(2/4)0.02.53.03.52.50.02.53.03.02.51.02.53.53.03.52.04.03.53.04.54.54.04.54.0A A GAGTACSpace= O(n^2)Simple gap(3/4)m/2Simple gap(4/4)Forward score and backward scoreSpace: O(m+n)Affine gap(1/8)A gap of length k : cost = g + k*hA - - - T A A C TC G A A T C - - T Affine gap(2/8)C(i, j):minimum cost of a conversion of to D(i, j):minimum cost of a conversion of to that deletesI(i, j):minimum cost of a conversion of to that inserts

Affine gap(3/8) if i > 0 and j> 0 if i = 0 and j> 0 if i > 0 and j= 0 if i = 0 and j= 0

Affine gap(4/8) if i > 0 and j> 0 if i = 0 and j> 0

Affine gap(5/8) if i > 0 and j> 0 if i > 0 and j= 0

Affine gap(6/8)

Affine gap(7/8)*4.55.05.5*5.05.56.0*2.55.05.5*3.03.55.0*3.54.04.5*4.04.55.00.02.53.03.52.50.02.53.03.02.51.02.53.53.03.52.04.03.53.04.54.54.04.54.0 A A GAGTAC****4.55.02.53.05.05.55.03.55.56.05.56.06.06.56.05.56.57.06.57.0A A GAGTACA A GAGTAC

CDIAffine gap(8/8)*4.55.05.5*5.05.56.0*2.55.05.5*3.03.55.0*3.54.04.5*4.04.55.00.02.53.03.52.50.02.53.03.02.51.02.53.53.03.52.04.03.53.04.54.54.04.54.0A A GAGTAC****4.55.02.53.05.05.55.03.55.56.05.56.06.06.56.05.56.57.06.57.0A A GAGTACA A GAGTAC

IDCO(N) space Gotoh's algorithm R99922041Observationi-th row of C and D depends only on row i and i-1.i-th row of I depends only on row i.

CDILinear SpaceUse two one-dimension arrays (CC and DD) and three variables.Linear SpaceAlgorithm

*4.55.05.5*5.05.56.0*2.55.05.5*3.03.55.0*3.54.04.5*4.04.55.00.02.53.03.52.50.02.53.03.02.51.02.53.53.03.52.04.03.53.04.54.54.04.54.0A A GAGTAC****4.55.02.53.05.05.55.03.55.56.05.56.06.06.56.05.56.57.06.57.0A A GAGTACA A GAGTAC

CDIg = 2.0 h = 0.5CCDDt = 2.0*4.55.05.5*5.05.56.0*2.55.05.5*3.03.55.0*3.54.04.5*4.04.55.00.02.53.03.52.50.02.53.03.02.51.02.53.53.03.52.04.03.53.04.54.54.04.54.0A A GAGTAC****4.55.02.53.05.05.55.03.55.56.05.56.06.06.56.05.56.57.06.57.0A A GAGTACA A GAGTAC

CDIg = 2.0 h = 0.5CCDDt = 2.0*4.55.05.5*5.05.56.0*2.55.05.5*3.03.55.0*3.54.04.5*4.04.55.00.02.53.03.52.50.02.53.03.02.51.02.53.53.03.52.04.03.53.04.54.54.04.54.0A A GAGTAC****4.55.02.53.05.05.55.03.55.56.05.56.06.06.56.05.56.57.06.57.0A A GAGTACA A GAGTAC

sceCCDDg = 2.0 h = 0.5i = 5t = 4.5CDI*4.55.05.5*5.05.56.0*2.55.05.5*3.03.55.0*3.54.04.5*4.04.55.00.02.53.03.52.50.02.53.03.02.51.02.53.53.03.52.04.03.53.04.54.54.04.54.0A A GAGTAC****4.55.02.53.05.05.55.03.55.56.05.56.06.06.56.05.56.57.06.57.0A A GAGTACA A GAGTAC

sceCCDDt = 4.5i = 5j = 1g = 2.0 h = 0.5CDI*4.55.05.5*5.05.56.0*2.55.05.5*3.03.55.0*3.54.04.5*4.04.55.00.02.53.03.52.50.02.53.03.02.51.02.53.53.03.52.04.03.53.04.54.54.04.54.0A A GAGTAC****4.55.02.53.05.05.55.03.55.56.05.56.06.06.56.05.56.57.06.57.0A A GAGTACA A GAGTAC

scCCDDt = 4.5i = 5j = 1g = 2.0 h = 0.5eCDI*4.55.05.5*5.05.56.0*2.55.05.5*3.03.55.0*3.54.04.5*4.04.55.00.02.53.03.52.50.02.53.03.02.51.02.53.53.03.52.04.03.53.04.54.54.04.54.0A A GAGTAC****4.55.02.53.05.05.55.03.55.56.05.56.06.06.56.05.56.57.06.57.0A A GAGTACA A GAGTAC

sCCDDt = 4.5i = 5j = 1g = 2.0 h = 0.5ecCDI*4.55.05.5*5.05.56.0*2.55.05.5*3.03.55.0*3.54.04.5*4.04.55.00.02.53.03.52.50.02.53.03.02.51.02.53.53.03.52.04.03.53.04.54.54.04.54.0A A GAGTAC****4.55.02.53.05.05.55.03.55.56.05.56.06.06.56.05.56.57.06.57.0A A GAGTACA A GAGTAC

Optimal conversion cost.CCDDCDIWhat is the conversion of AGTAC and AAG ?Main algorithm part 1B95902077

MidpointHirschberg (1975): recursive divide-and-conquerBackward ComputingForward ComputingGap Penaltyi-1, j-1i, j-1i-1, ji, jGap PenaltyCC( j) = minimum cost of a conversion of Ai* to BjDD( j) = minimum cost of a conversion of Ai* to Bj that ends with a deleteGap PenaltyRR(N - j) = minimum cost of a conversion of Ai*T to BjTSS(N - j) = minimum cost of a conversion of Ai*T to BjT that begins with a delete

Find Midpoint with Gap PenaltyBackward ComputingForward ComputingHow to compute the midpoint?Main algorithm part 2R99922035MidpointThe problem of calculating the midpoint is that when we concatenate two substrings into one, we may coalesce two gaps into one

Which means that we may consider min { CC + RR, DD + SS - g, II + JJ - g}MidpointRecall the above algorithm, we do save the space of II and JJ.

We can reduce it into min {CC + RR, DD + SS - g} MidpointRemember that we should find minj [0, N]{min { CC + RR, DD + SS - g, II + JJ - g}} i*j j+1MidpointType 1 recurrence Type 2 recurrence

i*j*i*j*Example A = agtac , B = aag, i* = 2 agtac a__ag

Recurrsive call on (a, a) and (ac, ag)

ImplementationR99922062ImplementationStorage Requirement

Memory v.s. Sequence length

Compared with classic dynamic programming algorithm

Linear space algorithm -> space not time49Storage Requirement(1/4)Vectors : CC,DD,RR, and SSSpace: 4N words

M + N words for an optimal conversion

M = N = 38

40Storage Requirement(2/4)16384 words for the table(w):replacement costs128*128

wASCII [1]ASCII [2]ASCII[3]ASCII[4]ASCII[]ASCII[128]ASCII [1]W1,1W1,2W1,3W1,4W1,W1,128ASCII [2]W2,1W2,2W2,3W2,4W2,W2,128ASCII [3]W3,1W3,2W3,3W3,4W3,W3,128ASCII [4]W4,1W4,2W4,3W4,4W4,W4,128ASCII[]W,1W,2W,3W,4W,W,128ASCII[128]W128,1W128,2W128,3W128,4W128,W128,128Storage Requirement(3/4)16 words for the table(w):replacement costs4*4

ATCGAW(A,A)W(A,T)W(A,C)W(A,G)TW(T,A)W(T,T)W(T,C)W(T,G)CW(C,A)W(C,T)W(C,C)W(C,G)GW(G,A)W(G,T)W(G,C)W(G,G)Storage Requirement(4/4)M + N bytes for the sequences A and B.A and B could be compressedDNA sequences only 2(M + N) bits are necessary

Compress -> Huffman code 53Memory v.s. Sequence lengthMaximum length of sequences that can be aligned in a given amount of memory

Altschul and Erickson : 7MN-bit approachMemory (bytes)Linear Space(w/o op.)Linear Space(with op.)Altschul and Erickson 64K40002666270128k80005333382256k16000106665401000k62500416661069N = Memory / 4*4N = Memory / 6*4N = sqrt(Memory *8 / 7)

Compared with classic dynamic programming algorithmclassic dynamic programming algorithm(Wagner and Fischer, 1974).

Compared with classic dynamic programming algorithmSpace : classic dynamic programming algorithm : O(MN)linear-space algorithm O(N + lgM)Time : Both O(MN)But in practice, linear-space slower than classic dynamic programming algorithm.linear-space : classic DP = 2.84 : 1 ConclusionR9994502058 0-3-6-9-12-15-18-21-24-3852-1-4-7-10-13-6530-3741-2-920-2-552-19-12-1-3-5630107-15-4-6-831-285-18-7-9-110-2963-21-10-12-14-386414C G G A T C A TCTTAACTReduce problem58Reduce problem(cont.)

60Reduce problem(cont.)m/2Partition line60