メモリより大きなデータの sufix array 構築方法の紹介

29
メモリよりきいデータの Suffix Tree 構築法の紹介 2014-08-29 星野 喬 1

Upload: takashi-hoshino

Post on 22-Jun-2015

203 views

Category:

Software


2 download

DESCRIPTION

論文の紹介 Suffix trees for inputs larger than main memory Marina Barsky, Ulrike Stege and Alex Thomo Information Systems archive Volume 36 Issue 3, May, 2011

TRANSCRIPT

  • 1. Suffix Tree 2014-08-29 1

2. Suffix trees for inputs larger than mainmemory Marina Barsky, Ulrike Stege and Alex Thomo Information Systems archive Volume 36 Issue3, May, 2011 CIKM2009 short paper 2 3. suffix tree 3 4. Suffix tree () suffix :suffix :suffixes prefix 4most N 1 internal nodes in the tree. Hence, the maximum numberlinear in N. The trees total space is linear in N in the case that eachstored in a constant space. Fortunately, this is the case for an implicitsubstrings by their positions.Ra b a b c0 1 2 3 40ababc1-1 b4-4abc1c43cc20-12-4 4-4 2-4 4-4= ababc. For clarity, the explicit edge labels are shown, which are represented asin the actual suffix tree. Each suffix Si can be found by concatenating substrings 5. : B2ST Big tree, Big string Suffix Tree constructionalgorithm 3 (1) Input partitioning (2) Pair-wise sorting (3) Merging5 6. X: (N) Xij: X[i]X[j-1] Si: i suffix (0 = iN) LCPij: Si Sj longest common prefix |LCPij|: LCPij SA: suffix array ST: suffix tree6 7. partition, and its positions are not included into the suffix array of the partition,positions are not included into the suffix array of the partition. This ensures suffix Input partitioningSi of a partition Xu (0 i|Xu|, 0 uk) will be sorted as representative of a suffix Sup+i of X. In practice, for real-life DNA sequences, the tail is negligibly small compared to the size of the partition itself (it never exceeded X k Xu (0 = uk) k = 2N/M (M: ) Xu tail Xu suffix sorting tail 1000 characters in our experiments with DNA databases).Consider the example in Figure 2. It shows the partitioning of input string ababaaabbabbbabaabab into four partitions. The main memory can hold up to 16 the input at a time. To facilitate the example, we represent our input as stands for 0-bit, b stands for 1-bit). In this illustration we also refer to the as A, B, C, and D in order to distinguish them from the numbers representingpositions. Note that the tail of partition XB is substring bbb, which never = aabba.Partition A Partition B Partition C Partition Da b a b a a a b b a b b b a b a a b a b1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5Input string X = ababaaabbabbbabaabab and its four partitions. The tail of partition B 7is 8. Pair-wise sorting XuXv SA LCP (SAuv) XuXv: Xu Xv (uv) k SA suffix LCP SAu: Xu SA Ruv: SAuv LCP uv bit 8 9. In this step we generate suffix arrays for each pair of partitions. The pseudocode for thisstep is shown in Figure 3. We concatenate every possible pair u, v of partitions with theirtails (0 uk 1, u + SAu 1 vk, main memory and build the suffix array uv) into Ruv string XuXv. We load this input into theSAuv with attached LCP length information foreach suffix. The suffixes which start in tail positions are excluded from the output suffixarray, they serve only for determining the relative order of suffixes starting at the end ofeach partition. An LCP entry of SAuv is the length of the longest common prefix of eachsuffix in SAuv with its immediate predecessor. Figure SAu 4 shows size: such an 4N/array k for + the pairA, B of partitions for the same input string as in Figure 2. From each SAuv, we extract N/k2^32 Ruv size: 8N/k + SAu k Ruv k(k-1)/2 4kN + : O(kN)9for determining the relative order of suffixes starting at the end ofLCP entry of SAuv is the length of the longest common prefix of eachimmediate predecessor. Figure 4 shows such an array for the pairthe same input string as in Figure 2. From each SAuv, we extractSAAB (in memory)suffix startLCPpartition bit5 1 3 1 2 5 4 2 4 30 2 1 3 2 3 0 2 3 1A B A A B B A A B BRABSAAB (in memory)suffix startLCPpartition bit5 1 3 1 2 5 4 2 4 30 2 1 3 2 3 0 2 3 1A B A A B B A A B BLCP 0 2 1 3 2 3 0 2 3 1partition bit A B A A B B A A B BSAA5 3 1 4 2RABLCP 0 2 1 3 2 3 0 2 3 1partition bit A B A A B B A A B BSAAwritten 5 3 to 1 disk4 2written to diskwith LCP information for a pair of partitions A and B. Two structures arethe suffix array of partition A and (2) the order array RAB storing the relativeB. These two structures are written to disk.Fig. 4. Suffix array SAAB with LCP information for a pair of partitions A and B. Two structures areextracted from SAAB: (1) the suffix array of partition A and (2) the order array RAB storing the relativeorder of suffixes in A and B. These two structures are written to disk. 10. Pair-wise sorting pseudo code10Algorithm pairwiseSortinginput: k partitions of string X1. for Algorithm (u=0; upairwiseSortingk-1; i++)2. for input: (v=1; k partitions vk; v++)of string X3. 1. concatenate for (u=0; uk-1; 2. for (v=1; vk; XuXv i++)v++)and load into RAM4. 3. build concatenate suffix array XuXv with and load LCP into SAuvRAM5. 4. during build sequential suffix array scan with LCP of SAuvSAuv6. 5. if during v==k-sequential 1 //last scan chunkof SAuv6. output if v==k-1 to //last chunk7. 7. output to disk disk SAvSAv8. 8. output output to to disk disk SAuSAu9. 9. output output to to disk disk Ruv Ruv //order //order arrayarrayFig. 3. Algorithm for pairwise sorting of suffixes in all partition pairs.Fig. 3. Algorithm for pairwise sorting of suffixes in all partition pairs.Step 2: Suffix sorting in partition pairs 11. Merging 2 SAu Ruv SA SA ST SA_buf_u: SAu (size m, k ) R_buf_uv: Ruv (size 2m/k,k(k-1)/2 ) ST_buf: ST ( size M-km)11 12. SA 12SAu...Ruv...HeapSA with LCPHeap SAu (0 = uk) suffix u 13. 5. Merge algorithm13What happens with each suffix in the output buffer is the subject pseudocode of merge is shown in Figure 8.Algorithm merge1. lastTransferred = null2. while heap is not empty3. remove smallest suffix Si of partition ufrom the top of the heap4. rebalance heap5. lcp = 06. if lastTransferred is not null7. v = lastTransferred.partitionID8. lcp = LCP from R_bufuv [current_pointer]9. create leaf for Si using lcp in ST_buf10. advancePointers (u)11. lastTransferred=Si12. if ST_buf is full13. store Si (max suffix)as a pointer to the current tree14. write ST_buf to disk4. rebalance heap5. lcp = 06. if lastTransferred is not null7. v = lastTransferred.partitionID8. lcp = LCP from R_bufuv [current_pointer]9. create leaf for Si using lcp in ST_buf10. advancePointers (u)11. lastTransferred=Si12. if ST_buf is full13. store Si (max suffix)as a pointer to the current tree14. write ST_buf to disk15. lastTransferred = null16. Sj = get next suffix from SA_bufu17. if Sj is not null18. insert Sj into heapFig. 8. The general pseudocode for merge.lcp = 06. if lastTransferred is not null7. v = lastTransferred.partitionID8. lcp = LCP from R_bufuv [current_pointer]9. create leaf for Si using lcp in ST_buf10. advancePointers (u)11. lastTransferred=Si12. if ST_buf is full13. store Si (max suffix)as a pointer to the current tree14. write ST_buf to disk15. lastTransferred = null16. Sj = get next suffix from SA_bufu17. if Sj is not null18. insert Sj into heapFig. 8. The general pseudocode for merge. 14. the information we need to efficiently perform the merge. As produce information. Since each Ruv contains an information only about two partitions, we need to the use suffix one bit to tree represent for the the partition entire ID input in Ruv. string Specifically, X. we We use are 0 for doing u entire v (input uv). Figure string InitialMerge 4 shows into SAA main and memory. RAB extracted algorithmIn from fact, SAAB we for never At the end of this step we have on disk suffix arrays for partitions partition access pair (k k (of total size plus k(k 1)/2 order arrays for each possible pair of partitions (of total size kN).Algorithm initializeMergeThis is all the information we need to efficiently perform the merge. As a result of merge we produce the suffix tree for the entire input string X. We are doing this withoutloading the entire input string into main memory. In fact, we never access X anymore.141. for each SA_bufu2. read first m start positionsfrom disk suffix array SAuAlgorithm initializeMerge1. for each SA_bufu2. read first m start positions3. for each R_bufuv4. read first m/k LCP+partitionBit from Ruvfrom disk suffix array SAu3. for each R_bufuv4. read first m/k LCP+partitionBit from Ruv5. for each SA_bufu6. insert SA_bufu[0] into heap5. for each SA_bufu6. insert SA_bufu[0] into heapFig. 5. The pseudocode for buffer allocation as the initial step for merge.Fig. 5. The pseudocode for buffer allocation as the initial step for merge. 15. CompareSuffix algorithm15Algorithm compareSuffix (Si from partition u,Sj from partition v)1. if (u = = v)2. return -1 //Si lex Sj, since they are sorted//in increasing order inside each partitionAlgorithm compareSuffix (Si from partition u,3. if (uv)4. if (partitionBit in R_bufuv[Sj from partition v)1. if (u = = v)current pointer] = = 0)5. return 2. return -1 -1 //Si lex Sj, //Si since they are sortedlex Sj6. else//in increasing order inside each partition7. return1 //Si lex Sj8. if (uv)9. if (partitionBit in R_bufvu[current pointer] = = 0)10. return 1 //Sj lex Si11. else12. return -1 //Sj lex Si3. if (uv)4. if (partitionBit in R_bufuv[current pointer] = = 0)5. return -1 //Si lex Sj6. else7. return1 //Si lex Sj8. if (uv)9. if (partitionBit in R_bufvu[current pointer] = = 0)10. return 1 //Sj lex Si11. else12. return -1 //Sj lex SiFig. 6. Algorithm for suffix comparison which uses the pairwise suffix information from the order arrayscreated during pairwise suffix sorting.Algorithm for suffix comparison which uses the pairwise suffix information from the order during pairwise suffix sorting. 16. in all relevant order buffers. We insert into the heap the we continue in a similar way until all suffixes are merged.Fig. 6. Algorithm for suffix comparison which uses the pairwise suffix information from the order arraysAdvancePointers algorithmPseudocode of current pointer advancing in suffix array buffer and the order 16created during pairwise suffix sorting.Algorithm advancePointers (partition ID u)1. SA_bufu.current_pointer++2. if reached the end of SA_bufu3. refill SA_bufu from disk-based SAu4. for (i=0; iu; i++)5. R_bufiu.current_pointer++6. if reached the end of R_bufiu7. refill R_bufiu from disk-based Riu8. for (i=u; ik; i++)9. R_bufui.current_pointer++10. if reached the end of R_bufui11. refill R_bufui from disk-based Ruipointers for all the order array buffers containing information about partition u are alsoadvanced by 1, as shown in pseudocode of Figure 7. This means we have determined theorder of the current suffix of partition u, and we need to consider the next element bothin SA BUFu and in all relevant order buffers. We insert into the heap the next suffix ofpartition u, and we continue in a similar way until all suffixes are merged.Algorithm advancePointers (partition ID u)1. SA_bufu.current_pointer++2. if reached the end of SA_bufu3. refill SA_bufu from disk-based SAu4. for (i=0; iu; i++)5. R_bufiu.current_pointer++6. if reached the end of R_bufiu7. refill R_bufiu from disk-based Riu8. for (i=u; ik; i++)9. R_bufui.current_pointer++10. if reached the end of R_bufui11. refill R_bufui from disk-based RuiFig. 7. Pseudocode of current pointer advancing in suffix array buffer and the order buffers. 17. ST X: ababc SA: 02134 binary alphabet: a: 00, b: 01, c: 10 ST binary alphabet 2 17 18. ST with binary alphabet18ST with alphabetST with binary alphabetS4S4S0S2S1S3S0S2S1S3abbcabccabcc01000110001101000011010index LCP bit byte () 19. SA internal node LCP 19L1L1L1insertedinternal nodeL1L1insertedleaf nodeL1 20. Insert s0 then s220X: ababcXb: 0001000110SA: 02134Insert s0 (0001000110, I0, L0)Insert s2 (000110, I4, L4)S0S0I0L4S2I0+4 I4+4I0IX: index is X (in unit of bit)LX: LCP is X (in unit of bit) 21. Insert s121X: ababcXb: 0001000110SA: 02134Insert s2 (000110, I4, L4)Insert s1 (01000110, I2, L1)S0I0L4S2I4 I8Ix: index is x (in unit of bit)Lx: LCP is x (in unit of bit)S0I0+1L4-1S2I4 I8I0L1S1I2+1 22. Insert s322X: ababcXb: 0001000110SA: 02134Insert s1 (01000110, I2, L1)Insert s3 (0110, I6, L2)Ix: index is x (in unit of bit)Lx: LCP is x (in unit of bit)S0I1L3S2I4 I8I0L1S1I3+1S0I1L3S2I4 I8I0L1S1I3I3L2-1S3I6+2 23. Insert s423X: ababcXb: 0001000110SA: 02134Insert s3 (0110, I6, L2)Insert s4 (10, I8, L0)Ix: index is x (in unit of bit)Lx: LCP is x (in unit of bit)S0I1L3S2I4 I8I0L1S1I4I3L1S3I8S0I1L3S2I4 I8I0+0L1-0S1I4I3L1S3I8I0L0S4I8+0 24. Index LCP 24S0I1L3S20 10I0L1I0+1+3 I4+1+3I3L1S1I2+1+1S3I6+1+1I0L0S4I8001 1000110 10 00011010Xb: 0001000110idx: 0123456789 25. ST 3 Internal node array Leaf node array suffix prefix (): divider divider 25Internal node structure ()Leaf node structureLeft childRight childIndex4bytes4bytes4bytespartition id (1byte)Index4bytespartition id (1byte) 26. TDD and ST-merge (2005) 6MB 20MB 8 Trellis+SB (2008) 3GB 512MB 11 95% DiGeST (2008) 98% merge 26 27. TDD which works with uncompressed inputs. The results : Program Time, hoursTDD 125B2ST 3 3GB DNA (11byte) 600MB27Trellis+SB 11of different suffix tree construction algorithms for approximately which is larger than the total allocated main memory.suffix tree for the above 3GB input in 125 hours. of Trellis+SB reported in [20]. The value in Figure similar settings on a comparable machine.divided the 3GB into partitions of 1GB each and built 28. produced an intermediate on-disk output of size 234GB. Despite this, the merge in only 59 minutes, scanning all this on-disk data in sequential manner 2514 suffix tree files of total 215GB.: Totaltime,minMerge,minPairwisesorting,minNumber ofpartitionpairsNumberofpartitionsInputsize,GB6 3 3 441 27 4688 4 6 730 34 76410 5 10 1100 42 1142 DNA 2GB Pairwise sorting k^2 Merge 2812 6 15 1462 59 1521Running time (min) of B2ST for different sets (approximately 6, 8, 10 and 12GB) of genomic 2GB of main memory.example shows that we need a large temporary disk space for scaling up the Specifically, we need D = k2p = kN disk space to store the order arrays partition pairs. Since the number of partitions is k = N/M, from D = N2/M the size of the largest input that we can process with M bytes of internal bytes of disk space. If we substitute the common values for modern computers, 29. k partition id 1 byte, suffix index 4 byte 5 bytes (N2^40) Induce sort O(kN) O(N) tail dividers LCP 29