fast mining top-rank-k frequent patterns by using node-lists

6
Fast mining Top-Rank-k frequent patterns by using Node-lists Zhi-Hong Deng Key Laboratory of Machine Perception (Ministry of Education), School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China article info Keywords: Data mining Pattern mining Top-Rank-k frequent patterns Node-list Algorithm abstract Mining Top-Rank-k frequent patterns is an emerging topic in frequent pattern mining in recent years. In this paper, we propose a new mining algorithm, NTK, to mining Top-Rank-k frequent patterns. The NTK algorithm employs a data structure, Node-list, to represent patterns. The Node-list structure makes the mining process much efficient. We have experimentally evaluated our algorithm against two representa- tive algorithms on four real datasets. The experimental results show that the NTK algorithm is efficient and is at least two orders of magnitude faster than the FAE algorithm and also remarkably faster than the VTK algorithm, the recently reported state-of-the-art algorithm for mining Top-Rank-k frequent patterns. Ó 2013 Elsevier Ltd. All rights reserved. 1. Introduction The task of mining frequent patterns is to discover a set of items shared among a large number of transactions in a given database. Since frequent pattern mining (Agrawal, Imielinski, & Swami, 1993) was first proposed by Agrawal et al. in 1993, it has emerged as a fundamental problem in data mining and plays an essential role in many important data mining tasks. With over a decade of substantial and fruitful research, there have been hundreds of fol- low-up research papers, on various kinds of works, ranging from scalable mining algorithms (Agrawal & Srikant, 1994; Bayardo, 1998; Han, Pei, & Yin, 2000; Burdick, Calimlim, & Gehrke, 2001; Zaki & Hsiao, 2002; Wang, Han, & Pei, 2003; Zaki & Hsiao, 2002; Zaki & Gouda, 2003; Jin & Agrawal, 2005; Grahne & Zhu, 2005; Deng & Wang, 2010; Deng, Wang, & Jiang, 2012); to a great diver- sity of extended mining tasks (Tzvetkov, Yan, & Han, 2005; Hu & Mojsilovic, 2007;Jin, Xiang, & Liu, 2009; Aggarwal, Li, Wang, & Wang, 2009); and plentiful applications (Li, Lu, Myagmar, & Zhou, 2004; Cao, Mamoulis, & Cheung, 2005; Jin & Agrawal, 2005; Jin & Agrawal, 2005; Zaki & Hsiao, 2005; Li & Deng, 2010). The common framework of mining frequent patterns is to use a minimal support threshold to ensure the generation of the correct and complete set of frequent patterns. However, this framework leads to the following two problems (Han, Wang, Lu, & Tzvetkov, 2002) that may hamper its popular use. First, setting minimal sup- port threshold is quite tricky. Users cannot know the exact threshold in advance. A too small threshold may lead to the generation of tens of thousands of patterns, while a too big one may often generate few patterns. Both cases are undesired to users. Second, frequent pattern mining often leads to the generation of a large number of patterns, which may be much larger than the number of interesting rules. Based on the above observation, Han et al. proposed a new min- ing task (Han et al., 2002): mining Top-k frequent closed patterns of length no less than min_l, where k is the desired number of fre- quent closed patterns to be mined, and min_l is the minimal length of each pattern. For handling this mining task, an efficient algo- rithm, TFP (Wang, Han, Lu, & Tzvetkov, 2005), was proposed. However, the above framework for mining Top-k frequent closed patterns requires parameter min_l, the minimal length of each pattern. Setting min_l is still quite subtle. It is not easy for users to know the min_l before mining patterns. The most possible thing, which users may know, is that how much frequents patterns should be generated. For this reason, Deng and Fang first proposed another new mining task (Deng & Fang, 2007): mining Top-Rank-k frequent patterns. Different from mining Top-k frequent closed patterns, mining Top-Rank-k frequent patterns need not specify min_l. In addition, the task of mining Top-k frequent closed pat- terns evaluates patterns in terms of support while the task of min- ing Top-Rank-k frequent patterns evaluates patterns in terms of rank. Two algorithms, FAE (Deng & Fang, 2007) and VTK (Fang & Deng, 2008), were proposed to handle this mining task. In this paper, we propose a new algorithm, NTK, for mining Top- Rank-k frequent patterns efficiently. The efficiency of mining is achieved by representing patterns by an ingenious data structure, Node-list, which has proved to be very useful in mining frequent patterns (Deng & Wang, 2010). A thorough performance study has been conducted to compare the performance of NTK with two representative Top-Rank-k fre- quent pattern mining algorithms, FAE and VTK. Our study shows that NTK is at lest two orders of magnitude faster than the FAE algorithm; also, it distinctly outperforms the VTK algorithm, the state-of-the-art algorithm for mining Top-Rank-k frequent patterns. The remaining of the paper is organized as follows. Section 2 pre- sents a detailed problem description of mining Top-Rank-k frequent 0957-4174/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.eswa.2013.08.075 Tel.: +86 10 62755592; fax: +86 10 62754911. E-mail address: [email protected] Expert Systems with Applications 41 (2014) 1763–1768 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Upload: zhi-hong

Post on 21-Dec-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fast mining Top-Rank-k frequent patterns by using Node-lists

Expert Systems with Applications 41 (2014) 1763–1768

Contents lists available at ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

Fast mining Top-Rank-k frequent patterns by using Node-lists

0957-4174/$ - see front matter � 2013 Elsevier Ltd. All rights reserved.http://dx.doi.org/10.1016/j.eswa.2013.08.075

⇑ Tel.: +86 10 62755592; fax: +86 10 62754911.E-mail address: [email protected]

Zhi-Hong Deng ⇑Key Laboratory of Machine Perception (Ministry of Education), School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China

a r t i c l e i n f o a b s t r a c t

Keywords:Data miningPattern miningTop-Rank-k frequent patternsNode-listAlgorithm

Mining Top-Rank-k frequent patterns is an emerging topic in frequent pattern mining in recent years. Inthis paper, we propose a new mining algorithm, NTK, to mining Top-Rank-k frequent patterns. The NTKalgorithm employs a data structure, Node-list, to represent patterns. The Node-list structure makes themining process much efficient. We have experimentally evaluated our algorithm against two representa-tive algorithms on four real datasets. The experimental results show that the NTK algorithm is efficientand is at least two orders of magnitude faster than the FAE algorithm and also remarkably faster than theVTK algorithm, the recently reported state-of-the-art algorithm for mining Top-Rank-k frequent patterns.

� 2013 Elsevier Ltd. All rights reserved.

1. Introduction Based on the above observation, Han et al. proposed a new min-

The task of mining frequent patterns is to discover a set of itemsshared among a large number of transactions in a given database.Since frequent pattern mining (Agrawal, Imielinski, & Swami,1993) was first proposed by Agrawal et al. in 1993, it has emergedas a fundamental problem in data mining and plays an essentialrole in many important data mining tasks. With over a decade ofsubstantial and fruitful research, there have been hundreds of fol-low-up research papers, on various kinds of works, ranging fromscalable mining algorithms (Agrawal & Srikant, 1994; Bayardo,1998; Han, Pei, & Yin, 2000; Burdick, Calimlim, & Gehrke, 2001;Zaki & Hsiao, 2002; Wang, Han, & Pei, 2003; Zaki & Hsiao, 2002;Zaki & Gouda, 2003; Jin & Agrawal, 2005; Grahne & Zhu, 2005;Deng & Wang, 2010; Deng, Wang, & Jiang, 2012); to a great diver-sity of extended mining tasks (Tzvetkov, Yan, & Han, 2005; Hu &Mojsilovic, 2007;Jin, Xiang, & Liu, 2009; Aggarwal, Li, Wang, &Wang, 2009); and plentiful applications (Li, Lu, Myagmar, & Zhou,2004; Cao, Mamoulis, & Cheung, 2005; Jin & Agrawal, 2005; Jin &Agrawal, 2005; Zaki & Hsiao, 2005; Li & Deng, 2010).

The common framework of mining frequent patterns is to use aminimal support threshold to ensure the generation of the correctand complete set of frequent patterns. However, this frameworkleads to the following two problems (Han, Wang, Lu, & Tzvetkov,2002) that may hamper its popular use. First, setting minimal sup-port threshold is quite tricky. Users cannot know the exact thresholdin advance. A too small threshold may lead to the generation of tensof thousands of patterns, while a too big one may often generate fewpatterns. Both cases are undesired to users. Second, frequent patternmining often leads to the generation of a large number of patterns,which may be much larger than the number of interesting rules.

ing task (Han et al., 2002): mining Top-k frequent closed patterns oflength no less than min_l, where k is the desired number of fre-quent closed patterns to be mined, and min_l is the minimal lengthof each pattern. For handling this mining task, an efficient algo-rithm, TFP (Wang, Han, Lu, & Tzvetkov, 2005), was proposed.

However, the above framework for mining Top-k frequentclosed patterns requires parameter min_l, the minimal length ofeach pattern. Setting min_l is still quite subtle. It is not easy forusers to know the min_l before mining patterns. The most possiblething, which users may know, is that how much frequents patternsshould be generated. For this reason, Deng and Fang first proposedanother new mining task (Deng & Fang, 2007): mining Top-Rank-kfrequent patterns. Different from mining Top-k frequent closedpatterns, mining Top-Rank-k frequent patterns need not specifymin_l. In addition, the task of mining Top-k frequent closed pat-terns evaluates patterns in terms of support while the task of min-ing Top-Rank-k frequent patterns evaluates patterns in terms ofrank. Two algorithms, FAE (Deng & Fang, 2007) and VTK (Fang &Deng, 2008), were proposed to handle this mining task.

In this paper, we propose a new algorithm, NTK, for mining Top-Rank-k frequent patterns efficiently. The efficiency of mining isachieved by representing patterns by an ingenious data structure,Node-list, which has proved to be very useful in mining frequentpatterns (Deng & Wang, 2010).

A thorough performance study has been conducted to comparethe performance of NTK with two representative Top-Rank-k fre-quent pattern mining algorithms, FAE and VTK. Our study showsthat NTK is at lest two orders of magnitude faster than the FAEalgorithm; also, it distinctly outperforms the VTK algorithm, thestate-of-the-art algorithm for mining Top-Rank-k frequentpatterns.

The remaining of the paper is organized as follows. Section 2 pre-sents a detailed problem description of mining Top-Rank-k frequent

Page 2: Fast mining Top-Rank-k frequent patterns by using Node-lists

1764 Z.-H. Deng / Expert Systems with Applications 41 (2014) 1763–1768

patterns and some definitions. Section 3 introduces the Node-liststructure, its construction method, and some important properties.Section 4 develops a Node-list-based algorithm, NTK, for miningTop-Rank-k frequent patterns. Section 5 studies the performance ofthe NTK algorithm. Section 6 summarizes our study and points outsome future research issues.

2. Problem definition

We dedicate this section to describe the problem of mining Top-Rank-k patterns.

Let I = {i1,i2, . . . ,iM} be the universal item set, and DB = {T1,T2, . . . ,TN} be a transaction database, where Ti(i 2 [1..N]) is a trans-action that has a unique identifier and contains a set of items in I. Ais called a pattern if A is a set of items. Given a pattern A and atransactionT, we say T contains A if and only if A # T.

Definition 1 (The Support of a Pattern). Given a transactiondatabase DB and a pattern A( # I), the support of A is the numberof transactions containing A in DB. For simplicity, we denote thesupport of A in DB as SupA.

Definition 2 (The Rank of a Pattern). Given a transaction databaseDB and a pattern A( # I), the rank of A,RA, is defined as RA = j{SupXjX# IandSupX P SupA}j, where jYj is the number of elements in Y.

Definition 3 (Top-Rank-k frequent patterns). Given a transactiondatabase DB and a threshold k, a pattern A( # I) is called to be aTop-Rank-k frequent pattern if and only if RA is not larger than k.That is, RA 6 k.

Given a transaction database DB and a threshold k, the Top-Rank-k frequent patterns mining is the task of finding the complete set offrequent patterns whose ranks are no greater than k. That is, the setof Top-Rank-k frequent patterns is equal to {XjX # IandRX6 k}.

For better understanding the above concepts, let us examine thefollowing example.

Example 1. Let the transaction database, DB, shown by Table 1 bethe running example.

According to Definition 1, the support of pattern {b} is 5 becausefive transactions, which are TID 1, TID 2, TID 3, TID 4, and TID 5,contain it. By computing the support of each pattern, we find thesupport of pattern {b} is the biggest one. Therefore, the rank of {b}is 1. Table 2 shows the rank and support of each pattern.

If we set k = 3, the set of Top-Rank-3 frequent patterns are {{a},{b}, {ab}, {d}, {e}, {ad}, {ae}, {bd}, {be}, {abd}, {abe}}. Note that, theset of Top-k frequent closed patterns is different from that set ofTop-Rank-k frequent patterns with the same value of k. Given k = 3and min_l=1, the set of Top-3 frequent closed patterns is {{a}, {b},{d}, {e}}. Given k = 3 and min_l=2, the set of Top-3 frequent closedpatterns is {{ab}, {abd}, {abe}}. Given k = 3 and min_l=3, the set ofTop-3 frequent closed patterns is {{abd}, {abe}, {abde}}. That is, nomatter what the value of min_l is, the set of Top-3 frequent closedpatterns is different from that set of Top-Rank-3 frequent patterns.

Table 1A transaction database as running example.

TID Items

1 b,c2 a,b,c,d3 a,b,d,e4 a,b,e5 a,b,d,e

In fact, Top-k frequent closed patterns and Top-Rank-k frequentpatterns are different types of patterns. The task of mining Top-kfrequent closed patterns evaluates patterns in terms of support andmin_l while the task of mining Top-Rank-k frequent patternsevaluates patterns in terms of rank.

3. Node-lists

In this section, we will describe the Node-list structure andsome properties.

3.1. PPC-tree

Before introducing the definition of Node-list, we first describePPC-tree, a structure similar to FP-tree (Han et al., 2000).

Definition 4. PPC-tree is a tree structure with satisfying thefollowing constraints:

(1) It consists of one root labeled as ‘‘root’’, a set of item prefixsubtrees as the children of the root.

(2) Each node in the item prefix subtree consists of five fields:item-name, count, childreNode-list, pre-order, and post-order.item-name registers which item this node represents. countregisters the number of transactions presented by the por-tion of the path reaching this node. childreNode-list registersall children of the node.pre-order is the sequence number ofthe node when scanning the tree by pre-order traversal.post-order is the sequence number of the node when scanning thetree by post-order traversal.

Based on Definition 4, we have the following PPC-tree con-struction algorithm.

Algorithm 1. PPC-tree Construction.Input: A transaction database DBOutput: A PPC-tree

Method: Construct-PPC-tree (DB)1. Scan DB to find all items and their supports. Sort all items

in support descending order as Iorder, Note that, if the sup-ports of some items are equal, the orders among them canbe assigned arbitrarily.

2. Create the root of a PPC-tree, Tr, and label it as ‘‘root’’.3. For each transaction Trans in DB do:

Select the items in Trans and sort them according to theorder of Iorder. Let the sorted item list in Trans be [pjP],where p is the first element and P is the remaining list. CallInsert_Tree ([pjP],Tr).

4. Scan the PPC-tree to generate the pre-order and the post-orderof each node by pre-order traversal and post-order traversal.

Function Insert_Tree ([pjP],Tr)

1. If Tr has a child N such that N.item-name = p.item-name,then increment N’s count by 1; else create a new node N,with its count initialized to 1, add it to Tr’s children-list.

2. If P is nonempty, call Insert_Tree (P,N) recursively.

For better understanding of the concept and the constructionalgorithm of PPC-tree, let us examine Example 1 again.

By scanning the database, we have Iorder = {b(5), a(4), d(3), e(3),c(2)}. Note that, the number in each bracket is the support of the cor-responding item. Therefore, the original database can be rewritten asTable 3. Fig. 1 shows the PPC-tree generated from the database.

Page 3: Fast mining Top-Rank-k frequent patterns by using Node-lists

Table 2The rank and support of each pattern in Table 1.

Rank Support Patterns

1 5 {b}2 4 {a}, {ab}3 3 {d}, {e}, {ad}, {ae}, {bd}, {be}, {abd}, {abe}4 2 {c}, {bc}, {de}, {ade}, {bde}, {abde}5 1 {ac}, {cd}, {abc}, {acd}, {bcd}, {abcd} (2 ,0)

(0,7)

(1,6)

(4,3)

(5 ,1)

(3 ,5)

(6 ,2)

root

b :5

d :3

c :1 e :2

c :1 a :4

e :1 (7,4)

Fig. 1. The PPC-tree resulting from Example 1 after running Algorithm 1.

Z.-H. Deng / Expert Systems with Applications 41 (2014) 1763–1768 1765

In Fig. 1, each rectangle stands for a node, and a pair of numbersin each bracket means the pre-order and post-order of the corre-sponding node. Note that the letter in a node is the item registeringthe node and the number is the times that the item registers thenode. For example, the node with (3, 5) means that its pre-orderis 3, post-order is 5, item-name is a, and count is 4.

3.2. Node-list: definition and property

cBased on PPC-trees, we introduce Node-lists by the followingdefinitions.

Definition 5 (PP-code). For each node N in a PPC-tree, we callh(N.pre-order, N.post-order): N.counti as the PP-code of N.

Property 1 Grust, 2002. Given any two different nodes N1 and N2 ina PPC-tree, N1 is an ancestor of N2 if and only if N1.pre-order <N2.pre-order and N1.post-order iN2.post-order. In the rest of the paper, a pat-tern with l items is called a l-pattern. In addition, it should be notedthat each l-pattern is represented by a sequence of items sorted bythe ascending order of Iorder. For example, given Iorder = {b(5), a(4),d(3), e(3), c(2)}, a 1-pattern consisting of a is denoted by {a}, and a4-pattern consisting of a, b, c, and d is denoted by {badc}.

Definition 6 (the Node-list of a 1-pattern). Given a PPC-tree, theNode-list of an item (1-pattern) is a sequence of all PP-codes ofnodes registering the item in the PPC-tree. The PP-codes are sortedby pre-order ascending order.

For example, the Node-list of {b} includes one node, whose pre-order is 1, post-order is 6, and count is 5. Fig. 2 shows the Node-listsof all 1-pattern in Example 1.

{b}

{a}

<(1,6): 5>

<(3,5): 4>

Definition 7. [the Node-list of a k-pattern] Let P = {i1 i2 . . . i(k�2)ixiy} bea pattern (k P 2), and the Node-list of P1 = {i1i2 . . . i(k�2)ix} is {h(xP11,yP11): zP11i,h(xP12, yP12): zP12i, . . . ,h(xP1m, yP1m): zP1mi}, the Node-list ofP2 = {i1i2 . . . i(k�2)iy} is {h(xP21, yP21): zP21i,h(xP22, yP22): zP22i, . . . ,h(xP2n,yP2n): zP2ni}. The Node-list of P is a sequence of PP-codes sorted bypre-order ascending order and generated by the following rule: Forany <(xP1r,yP1r): zP1ri2 the Node-list of P1(16 r6m) and <(xP2s, yP2s):zP2si2 the Node-list of P2(16 s6 n), if h(xP1r, yP1r): zP1ri is the ancestorof h(xP2s, yP2s): zP2si, then h(xP2s, yP2s): zP2si2 the Node-list of P.

As shown by Fig. 2, the Node-list of {b} is {h(1,6): 5i} and theNode-list of {c} is {h(2,0): 1i, h(5,1): 1i}. According to Property 1,<(1,6): 5iis the ancestor of <(2,0): 1iand <(5,1): 1i. Therefore, theNode-list of {bc} is {h(2,0):1i,h(5,1):1i} according to Definition 7.

Table 3The transaction database with sorted items.

TID Sorted Items

1 b,c2 b,a,d,c3 b,a,d,e4 b,a,e5 b,a,d,e

For Node-lists, they have the following important property(Deng & Wang, 2010).

Property 2. Given the Node-list of a l-pattern P = {i1 i2 . . . ij}, whichis denoted by {h(x1,y1): z1i,h(x2,y2): z2i, . . ., <(xm, ym): zmi}, the supportof P is z1 + z2 + . . . + zm. For example, the Node-list of {bc} is {h(2,0):1i,h(5,1): 1i}. By summing the value of count of each PP-code in theNode-list, we get 2 (1 + 1). By scanning the database shown by Table 1,we know there are two transactions contain {bc}. Therefore, the sup-port of {bc} is 2.Property 2 means that we can obtain the support of apattern by simply scanning its Node-list without scanning the wholetransaction database. This greatly reduces the time for computing pat-terns’ supports.

4. Proposed algorithm

Before presenting our algorithm, let us first explore one impor-tant lemma relevant to Top-Rank-k frequent pattern mining.

Lemma 1. if A is not a Top-Rank-k frequent pattern, any pattern Bcontaining A, that is A # B, can not be a Top-Rank-k frequent pattern.

Proof. Let B is a superset of A. For any transaction T, if B # T, wehave A # T. According to Definition 1, we have that the supportof A is not less than the support of B. According to Definition 2,RA is not more than RB. Therefore, if A is not a Top-Rank-k frequentpattern, B can not be a Top-Rank-k frequent pattern. h

Lemma 1 shows that the anti-monotone property of Top-Rank-kfrequent patterns. We will employ this property to discover allTop-Rank-k frequent patterns from short patterns to long patternsby using an iterative approach known as a level-wise search(Agrawal & Srikant, 1994).

{d}

{e}

{c}

<(4,3): 3>

<(6,2): 2> <(7,4): 1>

<(2,0): 1> <(5,1): 1>

Fig. 2. The Node-lists of all 1-patterns in Example 1.

Page 4: Fast mining Top-Rank-k frequent patterns by using Node-lists

1766 Z.-H. Deng / Expert Systems with Applications 41 (2014) 1763–1768

NTK employs l-patterns to explore (l + 1)-patterns. By using No-de_lists, we do not need to scan the database repeatedly when wewant to get the supports of (l + 1)-patterns. Instead, we just need tointersect the Node_lists of l-patterns.

Below are the processing procedures.

(1) Scan the PPC-tree and generate Node_lists of all 1-patterns.Find Top-Rank-k frequent 1-patterns and insert theminto the Top-Rank-k table. The Top-Rank-k table containspatterns and their supports. All patterns with the samesupport are stored in the same entry. The number ofentries in the Top-Rank-k table is not more thanthreshold k.

(2) Use 1-patterns in the Top-Rank-k table to generate candidate2-patterns. If the support of a candidate 2-pattern is not lessthan the smallest support of the Top-Rank-k table, the candi-date 2-pattern is inserted into the Top-Rank-k table. Aftereach inserting operation, the Top-Rank-k table is checkedto ensure the number of entries is not more than k. If thenumber is bigger than k, the entries with support that is lessthank-th maximum support are deleted from the Top-Rank-ktable.

(3) Repeat procedure (2) by using l-patterns in the Top-Rank-ktable to generate candidate (l + 1)-patterns until no new can-didate patterns can be generated.

Based on the above procedures, we establish the following algo-rithm, NTK, for mining Top-Rank-k frequent patterns.

Algorithm 2. NTK.Input: the Node-lists of all 1-patterns in database DBOutput: a Top-Rank-k table, Tabk, which has a fixed number

of entries. Each entry contains patterns with the samerank. Note that, in the processing of NTK,Tabk alwayscontains the Top-Rank-k frequent patterns in the currentstatus.

Method:Find 1-patterns with rank not larger than k in all 1-patterns,denote the set of these 1-patterns by TR1 and insert theminto Tabk with their supports;For (j = 2;TRj�1–£; j + +) {

CRj Candidate_gen(TRj�1);For each C 2 CRj, let C is generated by P1(2 TRj�1) and

P2(2TRj�1), the Node-lists of C, P1, and P2 are denoted byC.Nodelist, P1.Nodelist, and P2.Nodelist respectively, do {

C.Nodelist = NL_intersection(P1.Nodelist,P2.Nodelist);}Temp {PjP is a pattern in a entry ofTabk};Candidate CRj [ Temp;Find patterns with rank not bigger than k in

Candidate and insert these patterns with their supports intoTabk;

TRj {PjP is a pattern in a entry of Tabk} Temp;}

Procedure Candidate_gen(TRj�1)CRj £;For each Cu 2 TPj�1{

For each Cv 2 TPj�1(Cv–Cu) {// Ck[i] means the ith item of Ck.If (Cu[1] = Cv[1] ^ Cu[2] = Cv[2] ^ . . . Cu[j � 2]

= Cv[j � 2] ^ Cu [j � 1]–Cv[j � 1]) then {C Cu [1] Cu [2] . . .Cu[j � 2]Cu[j � 1]Cv[j � 1] }

If each (j-1)-subset of C belongs to TRj�1 then {CRj CRj [ {C}}

}}Return CRj;

Procedure NL_intersection(NL1, NL2)i 0;//Point to the start of NL1.

j 0;//Point to the start of NL2.

while (ihNL1.size () & & jhNL2.size ()) {if (NL1[i].pre_codehNL2[j].pre_code) {

if (NL1[i].post_codeiNL2[j].post_code) {Insert NL2[j] into NL;j++;}

else i++;}else j++;// NL1 [i].pre_codeiNL2 [j].pre_code

}return NL;

Note that, for obtaining the Node-list of a (l + 1)-pattern via twol-patterns more efficiently, we adopt NL_intersection proposed byDeng and Wang (2010). NL_intersection generates a Node-list withtime complexity of O (m + n), where m and n are the cardinalities ofthe two Node-lists as input.

5. Experiments

In this section, we report the experiments in which the runningtime of NTK was compared with FAE and VTK.

5.1. Experiment setup

We used four real datasets, which were often used in previousstudy of frequent pattern mining, in our experiments. These data-sets are mushroom, pumsb, retail, and kosarak. We downloadedthem from http://fimi.ua.ac.be/data/. Table 4 shows the character-istics of these real datasets. It shows the average transaction length(denoted as Avg. Length), the number of items (denoted as #Items)and the number of transactions (denoted as #Trans) in eachdataset.

All the experiments were performed on a Dell PC with IntelCore2 Duo 2.66G and 2G Memory. The operating system wasMicrosoft Windows XP. All the programs were coded in MS/VisualC++.

According to the various requirements, the input data for thethree algorithms is different. For FAE, the input data is originaldatasets. For VTK, the input data is Tid-lists (Fang & Deng, 2008)converted from original datasets. For NTK, the input data isNode-lists converted from original datasets. Table 5 shows the sizecomparison of original datasets, the corresponding Tid-lists, andthe corresponding Node-lists.

From Table 5, we find that three datasets are smaller than theirTid-lists and Node-lists excluding mushroom. The reason is thatTid-lists and Node-lists use some techniques such as pointer,which need additional storage. However, mushroom is very dense.Node-lists are generated from a PPC-tree. According to the defini-tion of PPC-tree, it is a prefix tree. As we know, prefix tree canachieve a very good compression performance when dataset isdense. Therefore, the Node-lists of mushroom is smaller thanmushroom.

Table 6 shows the time for converting. Compared with the run-ning time shown in the next subsection, the time for converting isnegligible.

Page 5: Fast mining Top-Rank-k frequent patterns by using Node-lists

Table 4The Summary of datasets used in our experiments.

Database Avg. Length #Items #Trans

Mushroom 23 120 8124Pumsb 74 2113 49,046Retail 10.3 16,470 88,162Kosarak 8.1 41,270 990,002

Table 7Running time on mushroom.

k 100 200 500 1,000

#Patterns 469 1723 57,607 1,504,515FAE (sec.) 1.52 4.98 67.6 N/AVTK (sec.) 0.11 0.281 2.05 51.9NTK (sec.) 0 0.015 0.22 14.3

Table 8Running time on pumsb.

k 100 200 500 1,000

#Patterns 169 359 976 2299FAE (sec.) 8.3 20.6 68.8 276.6VTK (sec.) 0.17 0.45 1.5 4.3NTK (sec.) 0 0.016 0.09 0.3

Table 9Running time on retail.

k 100 200 500 700

#Patterns 102 220 945 4379FAE (sec.) 7.1 31.4 658.8 N/AVTK (sec.) 0.078 0.17 1.41 14.9NTK (sec.) 0.016 0.047 0.95 13.9

Table 10Running time on kosarak.

k 100 200 500 1,000

#Patterns 100 201 512 1,106FAE (sec.) 33.5 112.7 598.7 N/AVTK (sec.) 0.6 1.3 3.8 9.4NTK (sec.) 0.03 0.20 1.2 4.2

Table 6The Summary of the datasets.

Converting Timefor Tid-lists

Converting Timefor Node-lists

Mushroom (sec.) 0.015 0.031Pumsb (sec.) 0.391 0.812Retail (sec.) 0.00 0.015Kosarak (sec.) 0.891 31.39

Table 5The Summary of the datasets.

Original dataset Size Tid-lists Size Node-lists Size

Mushroom 557 KB 889 KB 383 KBPumsb 15.9 MB 20.4 MB 18.5 MBRetail 4 MB 5.4 MB 11 MBKosarak 30.5 MB 47.7 MB 83.8 MB

Z.-H. Deng / Expert Systems with Applications 41 (2014) 1763–1768 1767

5.2. Running time

Tables 7–10 show the running time of the compared algorithmson mushroom, pumsb, retail, and kosarak with different value of k.The first row of each table presents the different value of k. Thesecond row of each table presents the number of Top-Rank-k fre-quent patterns (denoted by #Patterns) discovered with the valueof k in the same column. The rest rows present the running timeof different algorithms with the value of k in the same column.

Table 7 shows the running time of the compared algorithms onmushroom. NTK runs fastest among three algorithms with all val-ues of k. NTK is about two orders of magnitude faster than FAE. Forsmall values of k, such as 100, 200, and 500, NTK is about one orderof magnitude faster than VTK. When k is equal to 1000, the runningtime of NTK is about 1/6 of VTK. We also find FAE can not discoverall Top-Rank-k frequent patterns in 1000 s with k = 1000. An inter-esting phenomenon is that the number of patterns increases expo-nentially with k from 100 to 1000. This confirms the previous study

(Zaki & Hsiao, 2005), which points out that mushroom containsmillions of frequent patterns even if the support threshold is low.

Table 8 shows the running time of the compared algorithms onpumsb. NTK also runs fastest. On average, NTK is nearly three or-ders of magnitude faster than FAE and about one order of magni-tude faster than VTK.

Table 9 shows the running time of the compared algorithms onretail. NTK still runs fastest. On average, NTK is also nearly threeorders of magnitude faster than FAE. However, compared withVTK, the advantage of NTK is not so much great as above.

Table 10 shows the running time of the compared algorithms onkosarak. NTK runs fastest once again. On average, NTK is still nearlythree orders of magnitude faster than FAE. As the case on retail,NTK does not show very distinct advantage over VTK.

In summary, no matter which dataset is chosen, experimentalresult suggests that NTK is efficient and is at least two orders ofmagnitude faster than the FAE algorithm and also remarkably fas-ter than the VTK algorithm, the state-of-the-art algorithm for min-ing Top-Rank-k frequent patterns.

6. Conclusions and future work

In this paper, we have presented a new algorithm, VTK, usingthe Node-list structure to represent patterns. Through extensiveexperiments, the NTK algorithm has been shown to gain significantperformance improvement over FAE and VTK on various datasets.

As indicated by our experiments, it may still generate a hugenumber of patterns for Top-Rank-k frequent pattern mining. There-fore, the extension of NTK to mine Top-Rank-k compressed fre-quent patterns, such as maximal frequent patterns and closedfrequent patterns, is an interesting topic for future research.

Acknowledgement

This work was partially supported by National Natural ScienceFoundation of China (Grant No. 61170091).

References

Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between setof items in large databases. In SIGMOD (pp. 207–216).

Agrawal, R., & Srikant, R. (1994). Fast algorithm for mining association rules. In VLDB(pp. 487–499).

Aggarwal, C., Li, Y., Wang, J., & Wang, J. (2009). Frequent pattern mining withuncertain data. In SIGKDD (pp. 29–37).

Bayardo, R. J. (1998). Efficiently mining long patterns from databases. In SIGMOD(pp. 85–93).

Burdick, D., Calimlim, M., & Gehrke, J. (2001). MAFIA: A maximal frequent patternalgorithm for transactional databases. In ICDE (pp. 443–452).

Page 6: Fast mining Top-Rank-k frequent patterns by using Node-lists

1768 Z.-H. Deng / Expert Systems with Applications 41 (2014) 1763–1768

Cao, H., Mamoulis, N., & Cheung, D.W. (2005). Mining frequent spatio-temporalsequential patterns. In ICDM (pp. 82–89).

Deng, Z., & Fang, G., 2007. Mining top-rank-k frequent patterns. In ICMLC (pp. 851–856).

Deng, Z., & Wang, Z. (2010). A new fast vertical method for mining frequentpatterns. International Journal of Computational Intelligence Systems, 3(6),733–744.

Deng, Z., Wang, Z., & Jiang, J. (2012). A new algorithm for fast mining frequentitemsets using N-lists. SCIENCE CHINA Information Sciences, 55(9), 2008–2030.

Fang, G., & Deng, Z. (2008). VTK: Vertical mining of top-rank-k frequent patterns. InFSKD (pp. 620–624).

Grahne, G., & Zhu, J. (2005). Fast algorithms for frequent pattern mining using FP-trees. IEEE TKDE Journal, 17(10), 1347–1362.

Grust, T. (2002). Accelerating xpath location steps. In SIGMOD (pp. 109–120).Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns without candidate

generation. In SIGMOD (pp. 1–12).Han, J., Wang, J., Lu, Y., & Tzvetkov, P. (2002). Mining top-k frequent closed patterns

without minimum support. In ICDM (pp. 211–218).Hu, J., & Mojsilovic, A. (2007). High-utility pattern mining: A method for discovery

of high-utility item sets. Pattern Recognition, 40(11), 3317–3324.Jin, R. & Agrawal, G. (2005). An algorithm for in-core frequent pattern mining on

streaming data. In ICDM (pp. 210–217).

Jin, R., Xiang, Y., & Liu, L. (2009). Cartesian contour: A concise representation for acollection of frequent sets. In SIGKDD (pp. 417–425).

Li, X., & Deng, Z. (2010). Mining frequent patterns from network flows formonitoring network. Expert Systems with Applications, 37(12), 8850–8860.

Li, Z., Lu, S., Myagmar, S., & Zhou, Y. (2004). CP-Miner: A tool for finding copy-pasteand related bugs in operating system code. In OSDI (pp. 289–302).

Tzvetkov, P., Yan, X., & Han, J. (2005). TSP: Mining top-k closed sequential patterns.Knowledge Information System, 7(4), 438–457.

Wang, J., Han, J., & Pei, J. (2003). CLOSET+: Searching for the best strategies formining frequent closed patterns. In KDD (pp. 236–245).

Wang, J., Han, J., Lu, Y., & Tzvetkov, P. (2005). TFP: An efficient algorithm for miningtop-k frequent closed patterns. IEEE Transactions on Knowledge and DataEngineering, 17(5), 652–664.

Zaki, M. J., & Gouda, K. (2003). Fast vertical mining using diffsets. In SIGKDD (pp.326–335).

Zaki, M. J. & Hsiao, C. J. (2002). CHARM: An efficient algorithm for closed patternmining. In SDM (pp. 457–473).

Zaki, M. J., & Hsiao, C. J. (2005). Efficient algorithm for mining closed patterns andtheir lattice structure. IEEE Transactions on Knowledge and Data Engineering,17(4), 462–478.