paper12 revised

48
A Using Multi-Core Processors Approach for Mining Frequent Sequential Patterns Abstract. The problem of mining frequent sequential patterns (FSPs) has attracted a great deal of research attention. Althought there are many efficient algorithms for mining FSPs, the mining time of these algorithms is still high, especially for very large or dense datasets. Parallel processing has been widely applied to improve the processing speed for various problems. Based on a multi-core processor architecture, this paper proposes a parallel approach called PIB-PRISM for mining FSPs from very large datasets. Baranches of the search tree are thought of as tasks in PIB-PRISM and are processed in parallel. The Prime- encoding theory is also used for fast determining the support values. Experiments are conducted to verify the effectiveness of

Upload: technetvn

Post on 16-Nov-2015

230 views

Category:

Documents


5 download

DESCRIPTION

Paper12 Revised

TRANSCRIPT

A Using Multi-Core Processors Approach for Mining Frequent Sequential Patterns

Abstract. The problem of mining frequent sequential patterns (FSPs) has attracted a great deal of research attention. Althought there are many efficient algorithms for mining FSPs, the mining time of these algorithms is still high, especially for very large or dense datasets. Parallel processing has been widely applied to improve the processing speed for various problems. Based on a multi-core processor architecture, this paper proposes a parallel approach called PIB-PRISM for mining FSPs from very large datasets. Baranches of the search tree are thought of as tasks in PIB-PRISM and are processed in parallel. The Prime-encoding theory is also used for fast determining the support values. Experiments are conducted to verify the effectiveness of PIB-PRISM. Experimental results show that the proposed algorithm outperforms PRISM for mining FSPs in terms of mining time.Comment by tphong: Be careful for the term since the datasets in your experiments are not so very large

1. IntroductionMining sequential patterns (SPs) is a fundamental problem in data mining and is applied in many domains, such as customer shopping cart analysis (Agrawal and Srikant, 1995), weblog analysis (Weichbroth et al., 2012; Zubi et al., 2014), and DNA sequence analysis (Raza, 2013; Zaki et al., 2001). The input data, called a sequence dataset, include a set of sequences. Each sequence is a list of transactions, each of which contains a set of items. An SP consists of a list of sets of items, with their appearing support above a user-specified minimum support, where the support of an SP is the percentage of sequences that contain the pattern. The goal of SP mining is to find all SPs in a sequence dataset.An example of a sequence dataset is customer purchase behavior in a store. The dataset contains the itemsets purchased in sequence by each customer, and thus the purchase behavior of each customer can be represented in the form of [Customer ID, Ordered Sequence Events], where each sequence event is a set of store items (e.g., bread, sugar, milk and tea). Below is an example for the purchase behavior of two customers, C1 and C2: [C1, (bread, milk), (bread, milk, sugar), (milk), (tea, sugar)]; [C2, (bread), (sugar, tea)]. The first customer, C1, purchases (bread, milk), (bread, milk, sugar), (milk) and (tea, sugar) in sequence, and the second customer, C2, purchases (bread) and (sugar, tea) in sequence.Comment by tphong: I suggest using C is better than using T. Check allThe general idea of all existing methods is to start with general (short) sequences and then extend them towards specific (long) ones. Existing methods can be divided into the following three main types.(1) Horizontal methods:. Datasets are organized in the horizontal format, where each line is a transaction in sid-itemset form, where sid is a unique sequence (customer) ID, and itemset is a set of respective items. GSP (Agrawal and Srikant, 1996; Zhang et al., 2002), AprioriAll (Agrawal and Srikant, 1995), and PSP (Masseglia et al., 1998), which are extensions of the Apriori approach, are some examples. The main disadvantage of this type of methods is that the datasets are scanned many times to determine frequent sequential pattern (FSP). Therefore, the runtime and the memory usage of these algorithms are high.Comment by tphong: So, a customer kas many lines (transactions?) My conceot is customer is not equal to transaction, but you seem to use them interchangely. Comment by tphong: Unclear, tuple?Comment by tphong: It is not consistent with the representation of the form [Customer ID, Ordered Sequence Events]Comment by tphong: Right?Comment by tphong: Full name firstComment by tphong: Full name first(2)Vertical methods. Datasets are organized in the vertical format, . Iin which, each row is with the form of tuple item, sid and where sid is a set of sequence ids IDs containing this item. In this layout, the support of a sequence is the size of the set of sids. In SPADE algorithm (Zaki, 2001a), FSPs are determined by combining sids based on the lattice theory to which reduces the number of scaning a dataset. Besides, SPAM (Ayres et al., 2002) and PRISM (Gouda et al., 2010a) algorithms use bit vectors to storeing sids forto reducinge the memory usage. (3)Projection methods. Projection methods are hybrid methods that combine horizontal and vertical approaches. Given a prefix sequence P, the main idea is to project the horizontal datasets. Hence, the projected (or conditional) dataset contains only those sequences that have prefix P. The frequency of extensions of P can be directly counted in the projection dataset. PrefixSpan (Pei et al., 2004), an extension of FreeSpan (Han et al., 2000; Pei et al., 2004), is an example. Its general idea is to examine only the prefix subsequences and project only their corresponding postfix subsequences into projection datasets. In each projectedion dataset, FSPs are grown by exploring only local FSPs. This approach segments data. Projections are first performed on the a dataset to reduce the cost of data storage cost. Beginning from FSPs with only one item1-item, PrefixSpan generates a dataset from each projected pattern. In athe projectedion dataset, each sequence of data only retains the suffix for the projection prefix. The FSPs are generated from the frequent items found in the projectedion datasets. IMSR_PreTree (Van et al., 2014), for mining sequential rules, and MNSR-Pretree (Pham et al., 2014), for mining non-redundant sequential rules, are implemented based on the prefix tree architecture (prefix-tree). Comment by tphong: according to what condition to project? Comment by tphong: unclearComment by tphong: Not clear from the contextComment by tphong: For the above, it is not smooth. In addition, a number of extended problems related to sequence mining have been proposed, including the mining of frequent closed SPs (Tran et al., 2015; Yan et al., 2003), frequent closed sequences without candidate generation (Wang et al., 2004), sequence generation (Lo et al., 2008; Vijayarani et al., 2014), and inter-sequence patterns (Wanga et al., 2009).All FSP mining algorithms are implemented based on a sequential strategy and single- task processings. This mean that a task must be completed before the next one can be started. Hence, these methods are time-consuming for large datasets, especially dense datasets. To improve performance, some researchers have applied parallel computing approaches to speed up the processing time. For example, Zaki (2001a) proposed a parallel algorithm named pSPADE, which is based on SPADE (Zaki, 2001b). It is , an algorithm used for fast discovery of sequential patterns on, for computers with distributed memory. Next, Cong et al. (2005) proposed the Par-CSP (parallel closed sequential pattern mining) algorithm, also for computers with distributed memory. In addition, multi-core processors (Andrew, 2008) allow for multiple tasks to be executed in parallel to enhance performance. Many Some data mining researchers have thus developed parallel algorithms for multi-core processor architectures. Liu et al. (2007) proposed cache-conscious FP-array and a mechanism for parallel data lock-free dataset tiling for mining frequent itemsets on multi-core computers. Yu and Wu (2011) proposed a strategy for effective load balancing to reduce the number of duplicate candidates generated. In addition, parallel mining has been applied to closed itemsets mining (Liu et al., 2007; Negrevergne et al., 2010; Schlegel et al., 2013; Yu et al., 2011), correlated pattern mining (Casali et al., 2013), and generic pattern mining (Negrevergne et al., 2013). Comment by tphong: Is the word OK?The present study proposes an approach for parallel mining SPs in parallel based on the PRISM algorithm for multi-core processor architectures. It is called parallel independent branch PRISM (PIB-PRISM). PIB-PRISM uses prime block coding based on the prime factor theory for fast determining the support values associated with SPs. Experiments were performed to show the efficiency of PIB-PRISM when compared to that of PRISM.The rest of this paper is organized as follows. Section 2 presents reviews the basic concepts of FSP mining, multi-core processor architectures, prime block coding, and PRISM. The PIB-PRISM algorithm is presented proposed in Section 3. Experimental results are discussed in Section 4. Finally, conclusions and ideas for future work are given in Section 5.2. Related work2.1. PreliminariesLet I = {i1, i2, ..., im} be a set of m distinct attributes, also called items. An itemset X = {i1, i2, ..., ik} is a non-empty unordered collection of items denoted as a k-itemset. Without loss of generality, it is assumed that the items of an itemset are sorted in increasing order. A sequence S is thus an ordered list of itemsets, and. A sequence S is denoted as (S1 S2... Sn), where the a sequence element Si is an itemset, and n, (the size of a sequence), is the number of itemsets (or elements) in this sequence. The length of sequence is the total number of items in this sequence, denoted as . A sequence of length k is called a k-sequence. For example, the sequence p = (B)(AC) is a 3-sequence of size 2. A sequence = (b1 b2 ... bm) is called a subsequence of sequence = (a1 a2 ... an) and is a supersequence of , denoted as , if there exist integers 1 < j1 < j2 < ... < jn < m such that b1 bk ajk1, 1 k n.b2 aj2, ..., bm ajm For example, the sequence (B)(AC) is a subsequence of (AB)(E)(ACD), but (AB)(E)) is not a subsequence of (ABE)).Comment by tphong: So, it can be ?Comment by tphong: Check it.A sequence dataset D is a set of tuples (sid, S), where sid is a unique sequence identifier and S = (S1 S2... Sn) is the sequence of the itemsets. A pattern is a subsequence of ordered itemetssequence data, and each itemset in a pattern is called an element. The absolute support of a sequence p in a dataset D is defined as the total number of sequences in D that contain p, denoted as sup(p) = |{Si D | p Si}|. The relative support of p is given as the fraction of the sequences datasets that contain p in a dataset. Absolute and relative supports are sometimes used interchangeably. Given a user-specified threshold, called the minimum support (denoted minSup), we say that a sequence p is said to be a frequent sequence if it occurs more than minSup times. That is, sup(p) minSup. A frequent sequence is maximal if it is not a subsequence of any other frequent sequence. A frequent sequence is closed if it is not a subsequence of any other frequent sequence with the same support. Given a sequence dataset D and minSup, the problem of mining SPs is to find all the frequent sequences in the dataset.Comment by tphong: So, Introduction needs not to be explained too detailedTable 1. Example of a sequence datasetComment by tphong: Put Table 1 below the next paragrpahSIDSequence dataset

S1Comment by tphong: Use another symbol because it is different fomr S1 above.(AB)(B)(B)(AB)(B)(AC)

S2(AB)(BC)(BC)

S3(B)(AB)

S4(B)(B)(BC)

S5(AB)(AB)(AB)(A)(BC)

Considering the sequence dataset D in Table 1. The set of items in the dataset is {A, B, C}. Assume This dataset and minSup is set at = 2 are used throughout the paper. Sequence S1 = (AB)(B)(B)(AB)(B)(AC) has six itemsets, including (AB), (B), (B), (AB), (B), and (AC). The size and length of S1 are thus six and nine, respectively. Sequence p1 = (AB)(C) is a subsequence of sequence S1. In the example of sequence dataset (Table 1), only the three sequences S1, S2, and S5 contain p1., Ttherefore, sup(p1) = 3 and. p1 is an FSP sincedue to sup(p1) > minSup.2.2. Multi-core processor architectureCore 1Core 2Core 3Core 4Individual MemoryIndividual Memory

Individual MemoryIndividual MemoryChip BoundaryShared MemoryBus InterfaceOff-chip Components

Figure 1. Quad-core processor.Comment by tphong: Make the four units (individual memory) look the same.

In a multi-core architecture, a processor includes two or more independent cores in the same physical package (Andrew, 2008), with each processing unit having separate memory and shared main memory. An example of a quad-core processor is shown in Figure 1. Multi-core processors allow multiple tasks to be executed in parallel to increase performance.With the outstanding benefits of the multi-core architecture, this paper proposes a method for parallel mining FSPs based on the multi-core architecture tofor speed up the execution processrunning time in data mining, thereby improvinge the efficiency of intelligent systems.Comment by tphong: Move Figure 1 here2.3. Prime block encodingThe PRISM (Gouda et al., 2010) algorithm was proposed as an effective approach for frequent- sequence mining via prime block encoding. We It is briefly introduced summarize as follows.Given a set G that is a set of prime numbers sorted in increasing order and a bit vector B of length n, B can be partitioned into m = continuous blocks., where Eeach block Bi includes the segment from = B [(i - 1) |G| +1] to B[: i |G|], 1 i m. Each Bi represents some subset S G.Comment by tphong: Unclear. Let Bi[j] be the j-th bit j in block Bi and G[j] be the j-th prime number j in G. The value of Bi with respect to G, denotedis v(Bi, G), is defined as = . For example, let Bi = 1001 and G = {2, 3, 5, 7}. Then v(Bi, G) = 21305071 = 27 = 14. If Bi = 0000, then v(Bi, G) = 1.Comment by tphong: Define the symbol or use commonly used symbol in mathematics The pPrime block encoding of a bit vector B with respect to athe base prime set G is denoted v(B, G), which is the set of all = {v(Bi, G),: 1 i m}. We can write v(Bi, G) as v(Bi) and v(B, G) as v(B) for simplicity. With the bit vector partitioned into blocks, each block has length |G|. Prime block encoding is conducted based on a given set G.For example, given G = {2, 3, 5, 7} and B = 100111100100, we can partion B into 12/4 (= 3) blocks with B1 = 1001, B2 = 1110, and B3 = 0100., with Thus, v(B1) = {2, 7} = 14, v(B2) = {2, 3, 5} = 30, and v(B) = {3} = 3. Thus, prime block encoding v(B) of the bit vector B with respect to he given base prime set G is v(B) = {14, 30, 3}. The inverse operation of v(B) is defined as v-1({14, 30, 3}) = v-1(14) v-1(30) v-1(3) = 100111100100 = B. Note that a bit vector with all zeros is encoded as 1.Given a bit vector A = A1A2...Am, where Ai is an array that contains |G| bits, let fA be the position of the first 1 bit in A. Then, the j-th bit in a mask operator is defined as follows:

Indeed, is a bit vector obtained by setting = 0 if and = 1 if . For example, let assume A = 001001100100. We have fA = 3 and = 000111111111. Similarly, the mask operator for prime block encoding is defined as (v = v(. For example, (v = v() = v(000111111111) = v(0001) v(1111) v(1111) = {7, 210, 210}. Because v(A) = v(001001100100) = v(0010) v(0110) v(0100) = {5, 15, 3}. Thus, according to this definition, we have = (7, 210, 210).Consider the a set of items I = {i1, i2, ..., in}. Each item can appear in different sequences and in different positions ofin a sequence. Therefore, two encoding mechanisms are executed for with each item ij, including we proceed ID encoding of sequence data and position encoding of items appearing in eachthe sequence data.Let P(SX, PX) denotes the prime encoding of item X, where SX is the ID encoding of sequence data that contain X and PX is the position encoding of item X in each the sequence data.The pPrime block encoding included two steps shown as follows.:Comment by tphong: The term appears first, but hard to understand from the word.Step 1. Position block encodingComment by tphong: The term is hard to understand. How about using the term item position encoding. Or the original approach uses the term?Prime encoding by positions on block for appears of items in each sequence. The prime encoding for positions of an item X appearing in athe sequence S is conducted as follows.:Firstly, Build athe bit vector, named BIT, is built to represent for the positions of an item X appearing in a sequence S. The length of the bit vector is the same as that of the sequence. If the i-th itemset i in the sequenceS contains item X, then BIT[i] = 1; otherwise, BIT[i] = 0. An eExtra 0 bits areis appended to the bit vector suchso that the vector size it is a multiple of |G|. For example, Figure 2(a) shows an example of 5 sequences with , where I = {A, B, C}.Comment by tphong: I suggest you separate the figures to individual figures. It is more easy to trace. Then it is a table

, given that G = {2, 3, 5, 7} is square-free prime generator set (Gouda et al., 2010). In the first sequence, item A appears in positions 1, 4, and 6, so the bit encoding of A is 100101. Assume G = {2, 3, 5, 7}, which is square-free prime generator set (Gouda et al., 2010). Because |G| = 4 = 22, two extra 0 bits are added and the bit encoding becomes (10010100). We compute v(1001) and v(0100) are then calculated as = {14 and , 3, respectively}. Figure 2(b) shows all the encoded positions of item A in the given five sequences. The same procedure is repeated for the other two items, B and C.Comment by tphong: Figure 2(b) as a table.

Step 2. Sequence block encodingThe Another bit vector can be used to represent the sequences in which an item appears. is encoded into a prime block based on prime set G. In the example above, since Item A appears in all sequences except for the 4thth sequence, the bit vector, 11101, is then used to represent the case. The bit vector is then encoded into a prime block based on the given prime set G. Again, extra 0 bits are appended to the bit vector such that the vector size is a multiple of |G|. Thus, we should represent the occurrence of item A on the whole dataset as a bit vector, after added three3 bits of 0 are added, and the resulting bit vector of to the right, A = is 11101000., The generated prime coding v(A) generated from the bit vector for item A in Figure 2(c) is v(A) = v(1110) v(1000), which is = {30, 2}. The results for all the three items are shown in Figure 2(c). Figure 2(d) shows the final prime encoding, including sequence and position encoding.

Compression of prime block encoding An SP appears is only present in only some sequences of a dataset, and in each sequence is only occurs in some positions of a sequence. therefore Therefore, athe bit vector usually includes many bits of zero. A block with all bits being zero is called an empty block. For example, Blocks Ai = 0000 is an empty block. = 0, with v(Ai) = v(0000) = {1} = 1, are called empty blocks. Empty blocks are removed during the compression process of prime encoding blocks to reduce size. PRISM retains only non-empty blocks after prime encoding. Hence, it keeps an index with each sequence block to indicate which non-empty position blocks correspond to a given sequence block. Figure 2(e) shows the compact prime block encoding for item A. The first sequence block is 30, with a factor cardinality of ||30||G = 3, meanings that there are 3 non-empty sequences in this block that contains item A. For each sequence, the offsets are stored in the position blocks.Comment by tphong: Using independent tableComment by tphong: Very unclear. Explain it to match the numbers in the figure

Figure 2. Prime block encoding.

For example, the offset of sequence 1 is 1, with the first two position blocks corresponding to this sequence. The offset for sequence 2 is 3, and that for sequence 3 is 4. The sequence that represents sequence block 30 can be found directly from the corresponding bit vector v-1(30) = 1110, which indicates that sequence 4 is invalid because it is empty block. The second sequence block for item A is v-1(2) = 1000, indicating that only sequence 5 is valid. Its position blocks begin at position 5. The benefit of this sparse representation becomes clear when we consider the primal encoding for item C. Full prime blocks in Figure 2(d) contain a lot of redundant information, which has been removed in the compact prime block encoding in Figure 2(e).Comment by tphong: Explain itSupport counting via prime block joinsConsider a sequence s, with P(Ss, Ps) denoting its the prime encoding of sequence s, where Ss is the set of all encoded sequence blocks s and Ps is the set of all encoded position blocks for s. The support of sequence s can be directly determined from the prime block encoding given by sup(s) =. For example, s = (A) and v(A) = {30, 2}. We have sup(s) = ||30||g + ||2||g = 3 + 1 = 4.2.5. PRISM algorithmWe use PRISM (Gouda et al, 2010) as the basice serial algorithm for our parallel FSPs mining algorithm. There are because two reasons. First, PRISM is an effective algorithm for mining SPs. Second, PRISM searches the space without maintaining a set of candidates, which facilitates its parallel processing. In this paper, the concepts of the PRISM algorithm are used for parallel computation. PRISM uses the vertical data format based on prime block encoding to represent candidate sequences. It also adopt jJoin operations over the prime blocks are used to determine the frequency for each candidate. ItPRISM scans the dataset only once to find the set of SPs whose size is 1 together with the block encoding corresponding to the patterns. A new pattern is identified based on the block encoding of an old pattern and the added items. Two main steps are included in PRISM as follows.Comment by tphong: Right?Step 1. Construct the search tree from the set of frequent 1-itemset sequences.Step 2. Extend the search tree and traverse the tree to find new SPs with a larger size using itemset extension and sequence extension.Search spaceComment by tphong: Write some sentences to conncet the last paragraph with this paragraph. It is not smooth. For a sequence dataset, the relation of subsequences is typically represented as a search tree, defined recursively as follows. The root of the tree is at level zero and labeled with the null sequence . A node labeled sequence S at level k (called a k-sequence) is repeatedly extended by adding one item I to generate a child node at the next level (k+1) or (k+1)-sequence. A (k+1)-sequence can be extended using sequence extension or itemset extension. In sequence extension, the item is appended to the SP as a new itemset. In itemset extension, the item is added to the last itemset in the pattern, so the item is lexicographically greater than all items in the last itemset. For example, for node S = (A)(A), after item B is added, (A)(A)(B) is the sequence extension and (A)(AB) is the itemset extension. Figure 3 shows the process of extending the search sequence in depth-first order. Itemset extensionsSequence extensions(A)(B)(A){}(A)(B)(C)(A)(A)(A)(B)(AB)(A)(C)(AC)(B)(A)(B)(B)(B)(C)(BC)(A)(A)(A)(A)(A)(B)(A)(AB)(A)(A)(C)(A)(AC)(A)(B)(B)(A)(B)(C)(A)(BC)(AB)(A)(AB)(B)(AB)(C)(ABC)

Figure 3. Search space for mining SPs with itemset extension and sequence extension.

Method for extending patterns and computing supportsSuppose that in the initialization step, we compute the prime block encoding of each single item in the sequence dataset. The FSP mining process starts fromat the root of the search tree., PRISM mines SPs as follows. For each node in the search tree, the pattern for that node is extended by adding an item to create a new pattern. The support is computed via the prime block by joining patterns used to extend. For node S, all of its extensions are evaluated before the depth-first recursive call. The search stops when no new frequent extensions are found.Itemset extensionConsider (A) with a prime block encoding of P(S(A), P(A)) and item B with a prime block encoding of P(SB, PB). The iItemset extension applied to pattern (A) creates the new pattern (AB). The sequence blocks of S(A) and SB in Figure 4(a) contain all information about the relevant sequence idsIDs, where A and B occur. Comment by tphong: A pattern? Then write it as consider a patternComment by tphong: At the same time?Comment by tphong: Separate 4a, 4b into tow figures. And move 4a here.

Sequence block encoding that contains pattern (A) and item B was computed via the greatest common divisor (, named gcd), of each pair of elements in the two blocks S(A) and SB. We can find the bit vector corresponding to pattern (A) and item B occurrences in the sequence data of the dataset. For example, as shown in Figure 4(a), sequence A is S(A) = {30, 2} and item B is SB = {210, 2}. Then, gcd(S(A), SB) = {gcd(30, 210), gcd(2, 2)} = {30, 2}. The reverse operation is denoted by v-1(30, 2) = 11101000. Pattern (A) and item B occur in sequences 1, 2, 3, and 5.Comment by tphong: Right?For example, for sequence S1 in Table 1, P1A ={14, 3} and P1B ={210, 2}. Then, gcd(P1A, P1B) ={gcd(14, 210), gcd(3, 2)} ={14, 1}, which indicates that A and B occur at positions 1 and 4 in sequence 1.Sequence extensionSequence extension applied to pattern (A) with item B creates a new pattern (A)(B). As with itemset extension, we find all sequence data that simultaneously contain pattern (A) and item B can be found.For each sequence data found, it is checked whether item B appears after item A, if the sequence data that contains (A)(B). Comment by tphong: Check the sentence

Figure 4. Sequence extension via Prime Block Joins

For example, consider sequence 1as shown in Figure 4(b)., considering sequence 1, P1(A) ={14, 3} and P1B ={210, 2}. Then, gcd() = gcd, {210, 2}) = gcd({105, 210},{210, 2}) = {gcd(105, 210), gcd(210, 2)} = {105, 2}. Since v-1(105, 2) = 01111000, which precisely indicates those positions in sequence 1 of Table 1. Comment by tphong: Whose position?

3. Parallel independent branch PRISM for mining frequent sequential patternsF11-sequenceFk+1(k+1)-sequencesCollectionsFk+1(k+1)-sequencesFkk-sequencesFkk-sequencesFnn-sequencesTaskkTaskkTask(k+1)Task(k+1)Task(k+i)Task(k+i)

Figure 5. Parallel model.To improve the efficiency and reduce processing time of mining FSPs, this paper presents a parallel model algorithmbased on PRISM. The parallel model is shown in Figure 5. Comment by tphong: Right?

Task 1Task 2(A)(B)(A){}(A)(B)(C)(A)(A)(A)(B)(AB)(A)(C)(AC)(B)(A)(B)(B)(B)(C)(BC)(A)(A)(A)(A)(A)(B)(A)(AB)(A)(A)(C)(A)(AC)(A)(B)(B)(A)(B)(C)(A)(BC)(AB)(A)(AB)(B)(AB)(C)(ABC)

Figure 6. Task tree.Comment by tphong: Remove on the first layerIn the parallel mining forof FSPs, distributes each branch of the search tree can be regarded as to a single task, which is processed mines assigned branch independently from all other tasks to generate FSPs. An example is given in Figure 6. It shows that level 1 hasthere are two 2 tasks on level 1 of the task tree.; task Task 1 and, task 2 parallel processing process branches A and, B, respectively. independent.

In general, this strategy is similar to the independent- class search strategy proposed by Schlegel et al. (2013). Our proposed Parallel Independent Branch PRISM (PIB-PRISM) uses a tree structure and a parallel implementation of tasks instead of threads. The key advantages of this strategy PIB-PRISM is that each task is assigned forto searching the branches of the trees and is processed independently. The algorithm was implemented in .Net Framework 4.0. Using tasks has advantages over using threads. First, tasks require less memory than do threads do. Second, a thread runs on only one core, whereas a task can run on multiple cores. Finally, threads require more processing time than do tasks do because the operating system need to allocates data structures for threads, as initialization and, destruction, and also performs the context switches between threads. Comment by tphong: OK?Comment by tphong: I am not sure. Because in my opinion, tasks can also be done using threads. So, chack it from OS book.The parallel mining process for finding FSPs from in a task tree is shown as follows. Step 1. Construct the search tree from the set of frequent 1-itemset sequences.Comment by tphong: A little rough, but OK Step 2. To parallel mine FSPs, Assign each branch of the tree is assigned to a processor and mine FSPs , the mining process for each branch is performed independently.In PIB-PRISM, a dataset must be preprocesseding into vertical format. and Tthis step iswas performed only once. After preprocessingtransformation, the dataset is called a transformed dataset D and has a structured as shown in Figure 7. The fFirst line in Figure 7 indicates the number of transactions in the dataset, the secondnext line includesis an item i and its support count of a task of item i in the dataset, and the next lines (called support lines) are in the form of sid - and positions, which indicatesd the item i appears in the positions of the sequencein sid. For example, the third line in Figure 7 represents that item a appears in positions 1, 4, 6 of the first sequence. Similarly, the fourth line indicates that item a appears in position 1 of the second sequence.

Figure 7. An example of a transformed dataset from Table 1Comment by tphong: You have task 3 in the example. It means the previous statement with only two tasks need to be modified.

The Pseudo pseudo code of the PIB-PRISM strategy is shown in Figure 8.follows:Input Dataset D, minSupOutput: All FSPs satisfying minSupProcedure PIB-PRISM(D, minSup)1Begin//Find P1 satisfying minSupComment by tphong: Not clear. Defined before?2dbpat = pGenarate_SPs(minSup, D);Comment by tphong: Change the parameter order to be consistent with the previous .2. Add comments at the back3List listpatstring = null;Comment by tphong: Declaration? Unclear representation. Add comments to the back4For (int i = 0; i < |dbpat|; i++)Comment by Mr.Bay: khong nhat quan ve cach trinh bay vong lap, check lai hetComment by tphong: Italic or not? IF not, D and minSup are not as well5Begin forComment by tphong: Line 1: need for?6Add dbpat.Pats[i] to listpatstringComment by tphong: Not defined. Add comments to the back7Task ti = new Task(() => { Comment by tphong: Unclear representation8extendTree(dbpat.Pats[i], dbpat, listpatstring); }); Comment by tphong: Add comment9End for10For each task in the list of created tasks do Comment by tphong: unclearComment by tphong: unclear, ti?11collect the set of patterns (Ps) returned by each task; 12totalPs = totalPs Ps;13End

Figure 8. The PIB-PRISM strategy

In the PIB-PRISM strategy, the pGenarate_SPs procedure is called to identify all frequent 1-sequences from the dataset D and storeing themits into dbpat (line 2). Next, PIB-PRISM creates a new dbpat task corresponds corresponding to dbpat (line 7) and executes the procedure of extendTree (line 8) to extendsion itemsets and extension sequences. Each task is then performed independently on a processor core to generate and given a partial set of FSPs partial. The final set of FSPs is then the unionresult of the consolidation of the partial results. The procedure of pGenerate_SPs is stated in Figure 9.Comment by tphong: Add , which is a array?Comment by tphong: unclearComment by tphong: Not drscribed clearly in the algorithm. Since your paper is aboutparallel processing, it is better to describe it in the algorithm

Procedure pGenerate_SPs(minSup, D)1Begin2Create n tasks, which with n = Comment by Bao Huynh: sp xem li gip ch nay c chnh xc.n l s task, I l tp cc item, P l s processor core.n = I/P c n khng?VD: I = 7, P=4I/P = 7/4 = 1.xxxComment by tphong: Add ceiling or floor of the formula. (e.g., 0.8 ->1 or 0?), where I is the number of single items in D, and P is the number of processor cores3For each item i in I 4Assign item i for tasks[i] 5Generates_SP(i)Comment by tphong: Add comments6End Foreach7EndComment by tphong: This is figure 9. Add the caption of the figure. The procedure below should be Figure 10.

Procudere Generates-SP(item)1Begin2if support(item) >= minsup thenComment by Mr.Bay: check toan bo thuat toan3Encoding information of item based on PRSIMComment by tphong: Not clear. Add comments.4Store information of item in dbpat[i]5End

Figure 9. Procedure Generates-SP Single PatternComment by tphong: Should be Figure 10In Figure 9, the pGenerate_SPs procedure also createsd n task (line 2) and assigns these tasks to a processor cores for executinge the Generate_SP procedure (line 5) based on PRISM to find frequent 1-sequences. The procedure of extendTree is shown in Figure 11. Comment by tphong: Add for Procedure extendTree(Pattern p, DB_Pattern dbpat, int level, List listpatstring) 1Begin2extendItemset(p, dbpat, listpatstring)3 extendSequence(p, dbpat, listpatstring)4extendTreeCollection(p.Itemset_ext_pattern, dbpat, level + 1, listpatstring)5extendTreeCollection(p.Sequence_ext_pattern, dbpat, level + 1, listpatstring)6EndComment by tphong: The figure can be divided into three figures.

Procedure extendItemset(Pattern p, DB_Pattern dbpat, List listpatstring)1Begin//Let item_p be the last item of pattern P//Find the position of item_p in dbpat2While (i |dbpat|)3Begin whileComment by Mr.Bay: khong can begin, end4Pattern pnew = createNewPattern(p, dbpat.Pats[i], true)Comment by tphong: Add comment5Add pnew to listpatstring6Add pnew to p.Itemset_ext_patternComment by tphong: Add comment7i++8End while9End

Procedure extendSequence(Pattern p, DB_Pattern dbpat, List listpatstring)1Begin2For each(Pattern pi in dbpat.Pats)Comment by Mr.Bay: check lai hetComment by tphong: Check3Pattern pnew = createNewPattern(p, pi pi, false)4Add pnew to listpatstring5Add pnew to p.Sequence_ext_patternComment by tphong: Add comment6End foreach7End

Procedure extendTreeCollection(List listpats, DB_Pattern dbpat, int level, List listpatstring)1Begin2For each(Pattern pipi in listpats)Comment by tphong: Check foreach or for each3extendTree(pipi, dbpat, level, listpatstring)4End for each5End

Figure 10. Procedure ExtendTreeComment by tphong: Which one? CheckFor each node in the search tree, the pattern for that node is extended by callinged extendItemset and extendSequence (Gouda et al, 2010) to create a new pattern in Figure 10. An example of the results for itemset extension and sequence extension executed on the data in Table 1 is shown in Table 2. The set of frequent sequential patterns derived is shown in Table 3.Comment by tphong: They were proposed by others? If yes, state it at somewhere above.Comment by tphong: I add it. Right?

Table 2. Item extension and sequence extension Frequent sequential patterns from Table 1 with minSup =50%Comment by tphong: The examples dont show the parallel processing steps very well. Maybe you can try to provide an example showing the parallel processing stpes.PrefixItemset extensionSupport minSupSequence extensionSupport minSup

(A): 4(AB):4Yes(A)(A): 2No

(AC): 1No(A)(B):3Yes

(A)(C): 3Yes

(AB): 4(ABC): 0No(AB)(A): 2No

(AB)(B): 3Yes

(AB)(C): 3Yes

(AB)(B): 3(AB)(BC): 2No(AB)(B)(A): 2No

(AB)(B)(B): 3Yes

(AB)(B)(C): 3Yes

(AB)(B)(B): 3(AB)(B)(BC): 2No(AB)(B)(B)(A): 2No

(AB)(B)(B)(B): 2No

(AB)(B)(B)(C): 2No

(AB)(B)(C): 3(AB)(B)(C)(A): 0No

(AB)(B)(C)(B): 0No

(AB)(B)(C)(C): 0No

(AB)(C): 3(AB)(C)(A): 0No

(AB)(C)(B): 1No

(AB)(C)(C): 1No

(A)(B): 3(A)(BC): 2No(A)(B)(A): 2No

(A)(B)(B): 3Yes

(A)(B)(C): 3Yes

(A)(B)(B): 3(A)(B)(BC): 2No(A)(B)(B)(A): 2No

(A)(B)(B)(B): 2No

(A)(B)(B)(C): 2No

(A)(B)(C): 3(A)(B)(C)(A): 0No

(A)(B)(C)(B): 0No

(A)(B)(C)(C): 0No

(A)(C): 3(A)(C)(A): 0No

(A)(C)(B): 1No

(A)(C)(C): 1No

(B): 5(BC): 3Yes(B)(A): 3Yes

(B)(B): 5Yes

(B)(C): 4Yes

(BC): 3(BC)(A): 0No

(BC)(B): 1No

(BC)(C): 1No

(B)(A): 3(B)(AB): 3Yes(B)(A)(A): 2No

(B)(AC): 1No(B)(A)(B): 2No

(B)(A)(C): 2No

(B)(AB): 3(B)(ABC): 0No(B)(AB)(A): 2No

(B)(AB)(B): 2No

(B)(AB)(C): 2No

(B)(B): 5(B)(BC): 3Yes(B)(B)(A): 2No

(B)(B)(B): 4Yes

(B)(B)(C): 4Yes

(B)(BC): 3(B)(BC)(A): 0No

(B)(BC)(B): 1No

(B)(BC)(C): 1No

(B)(B)(B): 4(B)(B)(BC): 3Yes(B)(B)(B)(A): 2No

(B)(B)(B)(B): 2No

(B)(B)(B)(C): 2No

(B)(B)(BC): 3(B)(B)(BC)(A): 0No

(B)(B)(BC)(B): 0No

(B)(B)(BC)(C): 0No

(B)(B)(C): 4(B)(B)(C)(A): 0No

(B)(B)(C)(B): 0No

(B)(B)(C)(C): 0No

(B)(C): 4(B)(C)(A): 0No

(B)(C)(B): 1No

(B)(C)(C): 1No

(C): 4(C)(A): 0No

(C)(B): 1No

(C)(C): 1No

Table 3. The sSet of frequent sequential patterns from Table 1 with minSup =50%No.FSPs Support

1(A)4

2(AB)4

3(AB)(B)3

4(AB)(B)(B)3

5(AB)(B)(C)3

6(AB)(C)3

7(A)(B)3

8(A)(B)(B)3

9(A)(B)(C)3

10(A)(C)3

11(B)5

12(BC)3

13(B)(A)3

14(B)(AB)3

15(B)(B)5

16(B)(BC)3

17(B)(B)(B)4

18(B)(B)(BC)3

19(B)(B)(C)4

20(B)(C)4

21(C)4

4. Experimental resultsExperiments were conducted to evaluate the proposed parallel algorithm. The experiments were performed on a personal computer with an Intel Core i7-4790 3.6-GHz 8-core CPU with 8 cores, (3 MB of L3 cache) and 8 GB of RAM, running Windows 7. The algorithm was implemented using .Net Framework 4.0.Comment by tphong: Right?Comparisons of runtime were performed using three Three standard datasets from the IBM data generator were used for comparison of runtime. The information about the data sets are shown in (see Table 4). Comment by tphong: CiteTable 4. Three dDatasets used in experimentsDatasetNo. of sequencesNo. of items

C6T5N1kD1kComment by tphong: Explain the parameters, C, T, N .1,0001,000

C6T5N1kD10k10,0001,000

Kosarak25kComment by tphong: Is the dataset also from IBM? You may explain the dataset a little25,00041,270

The mining runtime of PRISM (sequential) and the proposed PIB-PRISM for various minSup values is shown in Table 5. Table 5. Mining time for PRISM and PIB-PRISMDatasetminSupNo. of patternsParallel mining time (sec.)Sequential mining time (sec.)

C6T5N1kD1k0.81120969.803145.454

0.71480298.048197.684

0.620644136.079280.067

0.531189216.58431.34

C6T5N1kD10k0.88430780.7351108.835

0.710480995.6561423.003

0.6136271349.2311905.902

0.5184611807.3262632.395

Kosarak25k0.81985.9757.629

0.72449.45311.263

0.631514.99218.018

0.541926.1331.824

Experiments were then conducted to compare the execution times of the two algorithms in Table 5. The runtimes for various minSup values (0.5-0.8%). The results are shown in Figures 11-13 for the three datasets, respectively. With decreasing minSup values, more FSPs are were obtained, and thus thr runtime increaseds.Comment by tphong: The labels mye be changed as PRISM and PIB-PRISMFigure 11. Comparison of runtime for various minSup values for C6T5N1kD1k.

Figure 12. Comparison of runtime for various minSup values for C6T5N1kD10k.

Comment by tphong: Maybe you can discuss why the gap between the two results will become larger alng with the decrease of minsupFigure 13. Comparison of runtime for various minSup values for Kosarak25k.

The experimental results show that the runtime for the three datasets PIB-PRISM is faster than PRISM for all the three datasets. With minSup = 0.8%, the runtimes of PRISM and PIB-PRISM wasare 145.454 PIB-PRISMand 69.803 seconds, respectively, for the C6T5N1kD1k dataset. Similar results were obtained for the other two datasets. PIB-PRISM hads lower runtimes because parallel processing divideds the tasks to be processed into independent branches. PIB-PRISMThe experimental results show that the proposed algorithm has significantly improved execution time when compared to that of the original algorithm for the three synthetic datasets.Comment by tphong: Another experiment usualy done in parallel processing is comparing the time for different core numbers. You use 8 cores, but the speed up ratio is less than 2. It may be argued. However, the The search tree of PIB-PRIM might, however, beis unbalanced. Some branches in the tree hadve more nodes than others. This is was a problem caused by depth-first search. A solution to it waswould be to sort items according to their support values in ascending order. After applying this solution, most of the nodes in the leftmost branches waswould be infrequent and would be pruned during the search. On the contrary, nNodes in the rightmost branches would bewas frequent and would not be pruned during the search. Thus, tThis strategy would help better balance the search tree for parallel processing.Comment by tphong: Ascending or descending is better?5. Conclusions and future workThis study has proposed a strategy for the parallel mining of SPs in parallel using a multi-core architecture. The proposed algorithm distributes the search forof SPs to distributed tasks on a multi-core computer and uses an efficient data structure for minning SPs fast. Experimental results show that the proposed algorithm outperforms the PRISM algorithm.This paper only solved the problem of mining SPs using multi-core proceesors., In the future, we will further study parallel strategies for mining closed patterns and maximal patterns in sequence datasets using multi-core proceesors do not discuss in this paper. We will also study using other architecture to solve the mining problem efficiently.Therefore, we will study these two problems in the future.ReferencesAgrawal R, Srikant R. Mining Sequential Patterns. In ICDE95, pp. 314 (1995)Agrawal R, Srikant R. Mining Sequential Patterns: Generalizations and Performance Improvements. In EDBT96, pp. 317 (1996) Andrew B. Multi-Core Processor Architecture Explained. In http://software.intel.com/en-us/articles/multi-core-processor-architecture-explained: Intel (2008)Ayres J, Gehrke JE, Yiu T, Flannick J. Sequential Pattern Mining using a Bitmap Representaion. In SIGKDD02, pp.17 (2002)Comment by tphong: Note the upper-case or ower-case. Please be conitentBurdick D, Calimlim M, Gehrke J. MAFIA: a maximal frequent itemset algorithm for transactional databases. In ICDE01, pp. 443452 (2001)Casali A, Ernst C. Extracting correlated patterns on multicore architectures. In CD-ARES13, pp.118-133 (2013)Cong S, Han J, Padua D. Parallel mining of closed sequential patterns. In: ACM SIGKDD05, pp. 562-567 (2005)Gouda K, Hassaan M, Zaki M. Prism: An effective approach for frequent sequence mining via prime-block encoding. Journal of Computer and System Sciences, 76(1), 88-102 (2010)Han J, Kamber M and Pei J. Data Mining: Concepts and Techniques. 3rd Edition, Morgan Kaufmann (2011)Han J, Pei J, Mortazavi Asl B, Chen Q, Dayal U, Hsu M. Freespan: Frequent pattern-projected sequential pattern mining. In KDD00, pp. 355359 (2000)Liu L, Li E, Zhang Y, Tang Z. Optimization of Frequent Itemset Mining on Multiple-Core Processor. In VLDB '07, pp. 1275-1285 (2007)Lo D, Khoo SC, Liu C. Mining and ranking generators of sequential patterns. In SDM08, pp. 553-564 (2008)Masseglia F, Cathala F, Poncelet P. The PSP Approach for Mining Sequential Patterns. In PKDD98, pp. 176-184 (1998)Mannila H, Toivonen H, and Verkamo AI. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 259-289 (1997)Negrevergne B, Termier A, Mhaut JF, Uno T. Discovering closed frequent itemsets on multicore: Parallelizing computations and optimizing memory accesses. In HPCS10, IEEE, pp. 521528 (2010)Negrevergne B, Termier A, Rousset MC, Mhaut JF. Para Miner: a generic pattern mining algorithm for multi-core architectures. Data Mining and Knowledge Discovery, 28(3), 141 (2014)Nguyen D, Vo B, Le B. Efficient strategies for parallel mining class association rules. Expert Systems with Applications, 41(10), 47164729 (2014)Slimani T, Lazzez A. Sequential mining: patterns and algorithms analysis. In International Journal of Computer and Electronics Research, 2(5), 639-647 (2013)Tran T, Le B, Vo B. Combination of dynamic bit vectors and transaction information for mining frequent closed sequences efficiently. Engineering Applications of Artificial Intelligence, 183-189 (2015)Pham T, Luo J, Hong TP, Vo B. An efficient method for mining non-redundant sequential rules using attributed prefix trees. Engineering Applications of Artificial Intelligence 32, 88-99 (2014)Comment by tphong: Use journal full name or short name, be consistent. Also for conference.Van T, Vo B, Le B. IMSR_PreTree: an improved algorithm for mining sequential rules based on the prefix-tree. Vietnam J. Computer Science, 1(2), 97-105 (2014)Pham T, Luo J, Vo B. An effective algorithm for mining closed sequential patterns and their minimal generators based on prefix trees. IJIIDS, 7(4), 324-339 (2013)Pei J, Han J, Asl BM, Wang J, Pinto H, Chen Q, Dayal U, Hsu M. Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach. IEEE Trans. Knowl. Data Eng. 16(11), 1424-1440 (2004)Schlegel B, Karnagel T, Kiefer T, Lehner W. Scalable frequent itemset mining on many-core processors. In The 9th International Workshop on Data Management on New Hardware. ACM. Article No. 3 (2013)Raza K. Application of Data Mining In Bioinformatics. Indian Journal of Computer Science and Engineering, 1(2), 114-118 (2013)Vijayarani S, Deepa S. An efficient algorithm for sequence generation in data mining, IJCI, 3(1) (2014)Wanga CS, Lee AJT. Mining inter-sequence patterns. Experts System with Aplication, 36 (4), 86498658 (2009)Wang W, Yang J. Mining Sequential Patterns from Large Data Sets. Advances in Database Systems 28, 1-161 (2005)Wang J and Han J. BIDE: Efficient Mining of Frequent Closed Sequences. In ICDE '04, pp. 79 90 (2004)Weichbroth P, Owoc M, Pleszkun M. Web User Navigation Patterns Discovery from WWW Server Log Files. FedCSIS12, 11771176 (2012)Yan X, Han J, Afshar R. CloSpan: Mining Closed Sequential Patterns in Large Datasets. In SDM03, pp. 166-177 (2003)Yu KM, Wu SH. An efficient load balancing multi-core frequent patterns mining algorithm. In TrustCom11, pp. 14081412 (2011)Zaki J, Wang TL, Toivonen TT. BIOKDD01: Workshop on Data Mining in Bioinformatics. In ACM SIGKDD Explorations, 3(2), pp. 71-73 (2002) Zaki MJ. SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning Journal, 42, 3160 (2001a)Zaki J. Parallel sequence mining on shared-memory machines. Journal of Parallel and Distributed Computing, 61(3), 401-426 (2001b)Zubi ZS, Raiani MSE. Using web logs dataset via web mining for user behavior understanding. International Journal of Computers and Communications, 8, 103-111 (2014)