car-miner: an efficient algorithm for mining class-association rules

7
CAR-Miner: An efficient algorithm for mining class-association rules Loan T.T. Nguyen a , Bay Vo b,, Tzung-Pei Hong c,d , Hoang Chi Thanh e a Faculty of Information Technology, VOV College, Ho Chi Minh, Viet Nam b Information Technology College, Ho Chi Minh, Viet Nam c Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung, Taiwan, ROC d Department of Computer Science and Engineering, National Sun Yat-sen University, Kaohsiung, Taiwan, ROC e Department of Informatics, Ha Noi University of Science, Ha Noi, Viet Nam article info Keywords: Accuracy Classification Class-association rules Data mining Tree structure abstract Building a high accuracy classifier for classification is a problem in real applications. One high accuracy classifier used for this purpose is based on association rules. In the past, some researches showed that classification based on association rules (or class-association rules – CARs) has higher accuracy than that of other rule-based methods such as ILA and C4.5. However, mining CARs consumes more time because it mines a complete rule set. Therefore, improving the execution time for mining CARs is one of the main problems with this method that needs to be solved. In this paper, we propose a new method for mining class-association rule. Firstly, we design a tree structure for the storage frequent itemsets of datasets. Some theorems for pruning nodes and computing information in the tree are developed after that, and then, based on the theorems, we propose an efficient algorithm for mining CARs. Experimental results show that our approach is more efficient than those used previously. Ó 2012 Elsevier Ltd. All rights reserved. 1. Introduction 1.1. Motivation Classification plays an important role in decision support systems. A lot of methods for mining classification rules have been developed including C4.5 (Quinlan, 1992) and ILA (Tolun & Abu-Soud, 1998; Tolun, Sever, Uludag, & Abu-Soud, 1999). Recently, a new method for classification from data mining, called classifica- tion based on associations (CBA), has been proposed for mining class-association rules (CARs). This method has more advantages than the heuristic and greedy methods in that the former can easily remove noise, and the accuracy is thus higher. It generates a more complete rule set than C4.5 and ILA. For association rule mining, the target attribute (or class attribute) is not pre-determined. How- ever, the target attribute must be pre-determined in classification problems. Thus, some algorithms for mining classification rules based on association rule mining have been proposed. Examples include classification based on predictive association rules (Yin & Han, 2003), classification based on multiple association rules (Li, Han, & Pei, 2001), classification based on associations (CBA, Liu, Hsu, & Ma, 1998), multi-class, multi-label associative classification (Thabtah, Cowling, & Peng, 2004), multi-class classification based on association rules (Thabtah, Cowling, & Peng, 2005), associative classifier based on maximum entropy (Thonangi & Pudi, 2005), Noah (Giuffrida, Chu, & Hanssens, 2000), and the use of the equivalence class rule tree (Vo & Le, 2008). Some researches have also reported that classifiers based on class-association rules are more accurate than those of traditional methods such as C4.5 and ILA both theoretically (Veloso, Meira, & Zaki, 2006) and with regard to experimental results (Liu et al., 1998). Veloso et al. proposed lazy associative classification (Veloso et al., 2006; Veloso, Meira, Goncalves, & Zaki, 2007; Veloso, Meira, Goncalves, Almeida, & Zaki, 2011), which differed from CARs in that it used rules mined from the projected dataset of an unknown object for predicting the class instead of using the ones mined from the whole dataset. Genetic algorithms have also been applied recently for mining CARs, and several approaches have been proposed. For example, Chien and Chen (2010) proposed a GA-based approach to build the classifier for numeric datasets and applied it to stock trading data. Kaya (2010) proposed a Pareto-optimal genetic approach for building autonomous classifiers. Qodmanan, Nasiri, and Minaei-Bidgoli (2011) proposed a GA-based method without requiring minimum support or minimum confidence thresholds. Yang, Mabu, Shimada, and Hirasawa (2011) proposed an evolutionary approach to rank rules. These algorithms were mainly based on heuristics in order to build classifiers. All the above methods focused on the design of the algorithms for mining CARs or for building classifiers, but did not discuss much with regard to their mining time. Therefore, in this paper, we aim to propose an efficient algorithm for mining CARs based on a tree structure. Section 1.2 will present our contributions in this paper. 0957-4174/$ - see front matter Ó 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.eswa.2012.10.035 Corresponding author. E-mail addresses: [email protected] (L.T.T. Nguyen), vdbay@itc. edu.vn (B. Vo), [email protected] (T.-P. Hong), [email protected] (H.C. Thanh). Expert Systems with Applications 40 (2013) 2305–2311 Contents lists available at SciVerse ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Upload: hoang-chi

Post on 23-Dec-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Expert Systems with Applications 40 (2013) 2305–2311

Contents lists available at SciVerse ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

CAR-Miner: An efficient algorithm for mining class-association rules

Loan T.T. Nguyen a, Bay Vo b,⇑, Tzung-Pei Hong c,d, Hoang Chi Thanh e

a Faculty of Information Technology, VOV College, Ho Chi Minh, Viet Namb Information Technology College, Ho Chi Minh, Viet Namc Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung, Taiwan, ROCd Department of Computer Science and Engineering, National Sun Yat-sen University, Kaohsiung, Taiwan, ROCe Department of Informatics, Ha Noi University of Science, Ha Noi, Viet Nam

a r t i c l e i n f o a b s t r a c t

Keywords:AccuracyClassificationClass-association rulesData miningTree structure

0957-4174/$ - see front matter � 2012 Elsevier Ltd. Ahttp://dx.doi.org/10.1016/j.eswa.2012.10.035

⇑ Corresponding author.E-mail addresses: [email protected]

edu.vn (B. Vo), [email protected] (T.-P. Hong), than

Building a high accuracy classifier for classification is a problem in real applications. One high accuracyclassifier used for this purpose is based on association rules. In the past, some researches showed thatclassification based on association rules (or class-association rules – CARs) has higher accuracy than thatof other rule-based methods such as ILA and C4.5. However, mining CARs consumes more time because itmines a complete rule set. Therefore, improving the execution time for mining CARs is one of the mainproblems with this method that needs to be solved. In this paper, we propose a new method for miningclass-association rule. Firstly, we design a tree structure for the storage frequent itemsets of datasets.Some theorems for pruning nodes and computing information in the tree are developed after that, andthen, based on the theorems, we propose an efficient algorithm for mining CARs. Experimental resultsshow that our approach is more efficient than those used previously.

� 2012 Elsevier Ltd. All rights reserved.

1. Introduction

1.1. Motivation

Classification plays an important role in decision supportsystems. A lot of methods for mining classification rules have beendeveloped including C4.5 (Quinlan, 1992) and ILA (Tolun &Abu-Soud, 1998; Tolun, Sever, Uludag, & Abu-Soud, 1999). Recently,a new method for classification from data mining, called classifica-tion based on associations (CBA), has been proposed for miningclass-association rules (CARs). This method has more advantagesthan the heuristic and greedy methods in that the former can easilyremove noise, and the accuracy is thus higher. It generates a morecomplete rule set than C4.5 and ILA. For association rule mining,the target attribute (or class attribute) is not pre-determined. How-ever, the target attribute must be pre-determined in classificationproblems. Thus, some algorithms for mining classification rulesbased on association rule mining have been proposed. Examplesinclude classification based on predictive association rules (Yin &Han, 2003), classification based on multiple association rules (Li,Han, & Pei, 2001), classification based on associations (CBA, Liu,Hsu, & Ma, 1998), multi-class, multi-label associative classification(Thabtah, Cowling, & Peng, 2004), multi-class classification basedon association rules (Thabtah, Cowling, & Peng, 2005), associative

ll rights reserved.

(L.T.T. Nguyen), [email protected]@vnu.vn (H.C. Thanh).

classifier based on maximum entropy (Thonangi & Pudi, 2005),Noah (Giuffrida, Chu, & Hanssens, 2000), and the use of theequivalence class rule tree (Vo & Le, 2008). Some researches havealso reported that classifiers based on class-association rules aremore accurate than those of traditional methods such as C4.5 andILA both theoretically (Veloso, Meira, & Zaki, 2006) and with regardto experimental results (Liu et al., 1998). Veloso et al. proposed lazyassociative classification (Veloso et al., 2006; Veloso, Meira,Goncalves, & Zaki, 2007; Veloso, Meira, Goncalves, Almeida, & Zaki,2011), which differed from CARs in that it used rules mined from theprojected dataset of an unknown object for predicting the classinstead of using the ones mined from the whole dataset. Geneticalgorithms have also been applied recently for mining CARs, andseveral approaches have been proposed. For example, Chien andChen (2010) proposed a GA-based approach to build the classifierfor numeric datasets and applied it to stock trading data. Kaya(2010) proposed a Pareto-optimal genetic approach for buildingautonomous classifiers. Qodmanan, Nasiri, and Minaei-Bidgoli(2011) proposed a GA-based method without requiring minimumsupport or minimum confidence thresholds. Yang, Mabu, Shimada,and Hirasawa (2011) proposed an evolutionary approach to rankrules. These algorithms were mainly based on heuristics in orderto build classifiers.

All the above methods focused on the design of the algorithmsfor mining CARs or for building classifiers, but did not discuss muchwith regard to their mining time. Therefore, in this paper, we aim topropose an efficient algorithm for mining CARs based on a treestructure. Section 1.2 will present our contributions in this paper.

2306 L.T.T. Nguyen et al. / Expert Systems with Applications 40 (2013) 2305–2311

1.2. Our contributions

In the past, Vo and Le (2008) proposed a method for miningCARs using the equivalence class rule-tree (ECR-tree). An efficientmining algorithm, named ECR-CARM, was proposed in their study.ECR-CARM scanned the dataset only once and was based on objectidentifiers to quickly determine the support of itemsets. However,it was quite time consuming for generate-and-test candidatesbecause all itemsets with the same attributes are grouped intoone node in the tree. Therefore, when joining two nodes li and ljto create a new node, ECR-CARM had to consider each element ofli with each element of lj to check whether they had the same prefixor not. In this paper, we design a MECR-tree as follow: Each node inthe tree contains an itemset, instead of their all the itemsets. Withthis tree, some theorems are also designed and based on them, analgorithm is proposed for mining CARs.

1.3. Organization of our paper

The rest of this paper is organized as follows. In Section 2, weintroduce some works related to mining CARs. Section 3 presentspreliminary concepts. The main contributions are presented in Sec-tion 4, in which a tree structure, named MECR-tree, is developedand some theorems for pruning candidates fast are derived. Basedon the tree and these theorems, we propose an algorithm for min-ing CARs efficiently. In Section 5, we show and discuss the experi-mental results. The conclusions and future work are presented inSection 6.

Table 1An example of training dataset.

OID A B C Class

1 a1 b1 c1 y2 a1 b2 c1 n3 a2 b2 c1 n4 a3 b3 c1 y5 a3 b1 c2 n6 a3 b3 c1 y7 a1 b3 c2 y8 a2 b2 c2 n

2. Related work

Mining CARs is discovery of all classification rules that satisfythe minimum support (minSup) and the minimum confidence(minConf) thresholds. The first method for mining CARs was pro-posed by Liu et al. (1998). It generates all candidate 1-ruleitemsand then calculates their supports for finding ruleitems that satisfyminSup. It then generates all the candidate 2-ruleitems from the 1-ruleitems by the same checking way. The authors also proposed aheuristic for building the classifier. The weak point of this methodis that it generates a lot of candidates and scans the dataset manytimes, so it is time-consuming. Therefore, the proposed algorithmuses a threshold K and only generates k-ruleitems with k 6 K. In2000, an improved algorithm for solving the problem of imbal-anced datasets has been proposed (Liu, Ma, & Wong, 2000). The lat-ter has higher accuracy than the former because it uses the hybridapproach for prediction.

Li et al. proposed a method based on the FP-tree (Li et al., 2001).The advantage of this method is that it scans the dataset only twotimes and uses an FP-tree to compress the dataset. It also uses thetree-projection technique to find frequent itemsets. To predict un-seen data, this method finds all rules that satisfy this data and usesa weighted v2 measure to determine the class.

Vo and Le proposed another approach based on the ECR-tree(Vo & Le, 2008). This approach develops a tree structure calledthe equivalence class- rules tree (ECR-tree), and proposes an algo-rithm called ECR-CARM for mining CARs. The algorithm scans thedataset only once. It is based on the intersection of object identifi-cations to quickly compute the supports of itemsets. Nguyen, Vo,Hong, and Thanh (2012) then proposed a new method for pruningredundant rules based on a lattice.

Thabtah et al. (2004) proposed a multi-class, multi-label asso-ciative classification approach for mining CARs. This method usedthe rule form {(Ai1,ai1), (Ai2,ai2), . . . ,(Aim,aim)} ? ci1 _ ci2 _ . . . _ cil,where aij is a value of attribute Aij, and cij is a class label.

Some other class-association rule mining approaches have beenpresented in the work of Coenen, Leng, and Zhang (2007), Giuffridaet al. (2000), Lim and Lee (2010), Liu, Jiang, Liu, and Yang (2008),Priss (2002), Sun, Wang, and Wong (2006), Thabtah et al. (2005),Thabtah, Cowling, and Hammoud (2006), Thonangi and Pudi(2005), Yin and Han (2003), Zhang, Chen, and Wei (2011), andZhao, Tsang, Chen, and Wang (2010).

3. Preliminary concepts

Let D be the set of training data with n attributes A1,A2, . . . ,An

and jDj objects (cases). Let C = {c1, c2, . . . ,ck} be a list of class labels.A specific value of an attribute Ai and class C are denoted by thelower-case letters a and c, respectively.

Definition 1. An itemset is a set of some pairs of attributes and aspecific value, denoted {(Ai1,ai1), (Ai2,ai2), . . . ,(Aim,aim)}.

Definition 2. A class-association rule r is of the form {(Ai1,ai1), . . . ,(Aim,aim)} ? c, where {(Ai1,ai1), . . . , (Aim,aim)} is an itemset, and c 2 Cis a class label.

Definition 3. The actual occurrence ActOcc(r) of a rule r in D is thenumber of rows of D that match r’s condition.

Definition 4. The support of a rule r, denoted Sup(r), is the numberof rows that match r’s condition and belong to r’s class.

For example: Consider r: {(A,a1)} ? y from the dataset inTable 1. We have ActOcc(r) = 3 and Sup(r) = 2 because there arethree objects with A = a1, in that two objects have the same class y.

4. Mining class-association rules

4.1. Tree structure

In this paper, we modify the ECR-tree structure (Vo & Le, 2008)into the MECR-tree structure (M stands for Modification) as fol-lows. In the ECR-tree, all itemsets with the same attributes arearranged into one group and put them in one node. Itemsets in dif-ferent groups were then joined together to form itemsets withmore items. This led to the consumption of much time for gener-ate-and-test itemsets. In our work, each node in the tree containsonly one itemset along with the following information:

(a) Obidset: a set of object identifiers that contain the itemset.(b) (#c1, #c2, . . . , #ck) – a list of integers, where #ci is the num-

ber of records in Obidset which belong to class ci, and(c) pos – a positive integer storing the position of the class with

the maximum count, i.e., pos = argmaxi2[1,k]{# ci}

In the ECR-tree, the authors did not store ci and pos, thus need-ing to compute them for all nodes. However, some values are notcalculated in the proposed approach here with the MECR-tree byusing theorems presented in Section 4.2.

L.T.T. Nguyen et al. / Expert Systems with Applications 40 (2013) 2305–2311 2307

For example, consider a node containing the itemset X = {(A,a3),(B,b3)} from Table 1. Because X is contained in objects 4 and 6, all ofthem belong to class y. Therefore, a node fðA; a3Þ; ðB; b3Þg

46ð2;0Þor more

simply as 3� a3b346ð2;0Þ

is generated in the tree. The pos is 1 (underlined

at position 1 of this node) because the count of class y is at a max-imum (2 as compared to 0). The latter is another representation ofthe former for saving memory when we use the tree structure tostore itemsets. We use bit presentation for storage of the itemset’sattributes. For example, AB can present as 11 in bit presentation,and therefore, the value of these attributes is 3. With this presen-tation, we can use bitwise operations to make itemsets join faster.

4.2. Proposed algorithm

In this section, some theorems for fast mining CARs are de-signed. Based on these theorems, we propose an efficient algorithmfor mining CARs.

Theorem 1. Given two nodes att1 � values1Obidset1ðc11; . . . ; c1kÞ

?att2 � values2

Obidset2ðc21; . . . ; c2kÞ, if att1=att2 and values1 – values2, then

Obidset1\ Obidset2 = £.

Proof. Since att1 = att2 and values1 – values2, there exist a val12 values1 and a val2 2 values2 such that val1 and val2 have the sameattributes but different values. Thus, if a record with OIDi containsval1, it cannot contain val2. Therefore, "OID 2 Obidset1, and it canbe inferred that OID R Obidset2. Thus, Obidset1 \ Obidset2 = £. h

In this theorem, we divide the itemset into form att � values forease of use. Theorem 1 infers that if two itemsets X and Y have thesame attributes, they do not need to be combined into the itemsetXY because Sup(XY) = 0. For example, consider the two nodes

1� a1127ð2;1Þ and 1� a2

38ð1;1Þ , in which Obidset({(A, a1)}) = 127, and

Obidset({(A, a2)}) = 38. Obidset({(A, a1), (A, a2)}) = Obidset({(A,a1)}) \ Obidset({(A, a2)}) = £. Similarly, Obidset ({(A, a1), (B,b1)}) = 1, and Obidset({(A, a1); (B, b2)}) = 2. It can be inferred thatObidset({(A, a1), (B, b1)}) \ Obidset({(A, a1); (B, b2)}) = £ becauseboth of these two itemsets have the same attributes AB but withdifferent values.

Theorem 2. Given two nodes itemset1Obidset1ðc11; . . . ; c1kÞ

and itemset2

Obidset2ðc21; . . . ; c2kÞ, if itemset1 � itemset2 and jObidset1j = jObid-

set2j, then "i 2 [1, k]: c1i = c2i.

Proof. We have itemset1�itemset2, this means that all records con-taining itemset2 also contain itemset1, and therefore, Obidset2 #

Obidset1. Additionally, according to theory, we have jObidset1-

j = jObidset2j, this means that we have Obidset2 = Obidset1, or "i 2 [1, k]: c1i = c2i. h

From Theorem 2, when we join two parent nodes into a childnode, then the itemset of the child node is always a supperset ofthe itemset of each of the parent nodes. Therefore, we will checktheir cardinations, and if they are the same, we need not computethe count for each class and the pos of this node because they arethe same as the parent node.

Using these theorems, we develop an algorithm for mining CARsefficiently. By Theorem 1, we need not join two nodes with thesame attributes, and by Theorem 2, we need not compute theinformation for some child nodes.

First of all, the root node of the tree (Lr) contains children nodessuch that each node contains a single frequent itemset. After that,

procedure CAR-Miner will be called with the parameter Lr to mineall CARs from the dataset D.

The CAR-Miner procedure (Fig. 1) considers each node li withall the other node lj in Lr, with j > i (Lines 2 and 5) to generate acandidate child node l. With each pair (li, lj), the algorithm checkswhether li.att – lj.att or not (Line 6, using Theorem 1). If they aredifferent, it computes the three elements att, values, Obidset forthe new node O (Lines 7–9). Line 10 checks if the number of ob-ject identifiers of li is equal to the number of object identifiers ofO (by Theorem 2). If this is true, then, by Theorem 2, the algo-rithm can copy all information from node li to node O (Lines11–12). Similarly, in the event that the result of Line 10 is false,the algorithm checks li with O, and if the numbers of their objectidentifiers are the same (Line 13), the algorithm can copy allinformation from node lj to node O (Lines 14–15). Otherwise,the algorithm computes the O.count by using O.Obidset andO.pos (Lines 17–18). After computing all of the information fornode O, the algorithm adds it to Pi (Pi is initialized empty in Line4) if O.count[O.pos] P minSup (Lines 19–20). Finally, CAR-Minerwill be recursively called with a new set Pi as its input parameter(Line 21).

The procedure ENUMERATE-CAR(l, minConf) generates a rulefrom node l. It first computes the confidence of the rule (Line22), if the confidence of this rule satisfies minConf (Line 23), thenit adds this rule into the set of CARs (Line 24).

4.3. An example

In this section, we use the example in Table 1 to describe theCAR-Miner process with minSup = 10% and minConf = 60%. Fig. 2shows the results of this process.

The MECR-tree was built from the dataset in Table 1 as follows:First, the root node Lr contains all frequent 1-itemsets such as

1�a1 1�a2 1�a3 2�b1 2�b2 2�b3 4�c1 4�c2127ð2;1Þ 38ð0;2Þ 456ð2;1Þ 15ð1;1Þ 238ð0;3Þ 467ð3;0Þ 12346ð3;2Þ 578ð1;2Þ

� �.

After that, procedure CAR-Miner is called with the parameter Lr.

We use node li¼1�a2

38ð0;2Þ as an example for illustrating the CAR-

Miner process. li joins with all nodes following it in Lr:

� With node lj ¼1� a3

456ð2;1Þ : They (li and lj) have the same attri-

bute and different values. Do not make any thing from them.

� With node lj ¼2� b1

15ð1;1Þ : Because their attributes are different,

three elements are computed such as O.att = li.att [ lj.att = 1 j2 = 3 or 11 in bit presentation; O.values = li.values [ lj.values = a2 [ b1 = a2b1, and O.Obidset = li.Obidset \ lj.Obidset ={3,8} \ {1,5} = {£}. Because the O.count[O.pos] = 0 < minSup, Ois not added to Pi.

� With node lj ¼2� b2

238ð0;3Þ : Because their attributes are different,

three elements are computed such as O.att = li.att [ lj.att = 1 j2 = 3 or 11 in bit presentation; O.values = li.values [ lj.values = a2 [ b2 = a2b2, and O.Obidset = li.Obidset \ lj.Obid-set = {3,8} \ {2,3,8} = {3,8}. Because of jli.Obidsetj = jO.Obidsetj,the algorithm copies all information for li to O. This means that

O.count = li.count = (0,2), and O.pos = 2. Because O.count

[O.pos] = 2 > minSup, O is added to Pi ) Pi ¼3� a2b238ð0;2Þ

� �.

� With node lj ¼2� b3

467ð3;0Þ : Because their attributes are different,

three elements are computed such as O.att = li.att [ lj.att =1j2 = 3 or 11 in bit presentation; O.values = li.values [ lj.values = a2 [ b3 = a2b3, and O.Obidset = li.Obidset \ lj.Obidset={3,8} \ {4,6,7} = {£}. Because the O.count[O.pos] = 0 < minSup,O is not added to Pi.

Fig. 1. The proposed algorithm for mining CARs.

Fig. 2. MECR-tree for the dataset in Table 1.

2308 L.T.T. Nguyen et al. / Expert Systems with Applications 40 (2013) 2305–2311

� With node lj ¼4� c1

12346ð3;2Þ : Because their attributes are differ-

ent, three elements are computed such as O.att = li.att [ lj.att = 1j 4 = 5 or 101 in bit presentation; O.values = li.values [lj.val-ues = a2 [ c1 = a2c1, and O.Obidset = li.Obidset \ lj.Obidset ={3,8} \ {1,2,3,4,6} = {3}. The algorithm computes additionalinformation including O.count = {0,1} and O.pos = 2. Becausethe O.count[O.pos] = 1 P minSup, O is added to Pi ) Pi ¼

3� a2b238ð0;2Þ ;

5� a2c13ð0;1Þ

� �.

� With node lj ¼4� c2

578ð1;2Þ : Because their attributes are different,

three elements are computed such as O.att = li.att [ lj.att = 1 j4 = 5 or 101 in bit presentation; O.values = li.values [ lj.val-ues = a2 [ c2 = a2c2, and O.Obidset = li.Obidset \ lj.Obidset = {3,8}\ {5,7,8} = {8}. The algorithm computes additional informationincluding O.count = {0,1} and O.pos = 2. Because theO.count[O.pos] = 1 P minSup, O is added to Pi ) Pi ¼

3� a2b238ð0;2Þ ;

5� a2c18ð0;1Þ ;

5� a2c28ð0;1Þ

� �.

Fig. 3. Numbers of CARs in the breast dataset for various minSup values.

L.T.T. Nguyen et al. / Expert Systems with Applications 40 (2013) 2305–2311 2309

� After Pi is created, the CAR-Miner is called recursively withparameters Pi, minSup, and minConf to create all children nodesof Pi. Consider the process to make child nodes of node

li ¼3� a2b238ð0;2Þ :

� With node lj ¼5� a2c1

3ð0;1Þ : Because their attributes are differ-

ent, three elements are computed such as O.att = li.att[ lj.att = 3 j 5 = 7 or 111 in bit presentation; O.values = li.values [ lj.values = a2b2 [ a2 c1 = a2b2c1, and O.Obidset =li.Obidset \ lj.Obidset = {3,8} \ {3} = {3} = lj.Obidset. the algo-rithm copies all information of lj to O, it means that

O.count = lj.count = (0,1) and O.pos = 2. Because theO.count[O.pos] = 1 > minSup, O is added to Pi ) Pi ¼

7� a2b2c13ð0;1Þ

� �.

� Using the same process for node lj ¼5� a2c2

8ð0;1Þ , we have the

result Pi ¼7� a2b2c1

3ð0;1Þ ;7� a2b2c2

8ð0;1Þ

� �.

Fig. 4. Numbers of CARs in the German dataset for various minSup values.

Rules are easily to generate in the same step for traversing li(Line 3) by calling procedure ENUMERATE-CAR(li, minConf). For

example, when traversing node li ¼1� a2

38ð0;2Þ , the procedure com-

putes the confidence of the candidate rule, conf = li.count[li.pos]/jli.Obidsetj = 2/2 = 1. Because conf P minConf (60%), add rule {(A,a2)} ? n (2,1) into the rule set CARs. The meaning of this rule is‘‘If A = a2 then class = n’’ (support = 2 and confidence = 100%).

To show the efficiency of Theorem 2, we can see that the algo-rithm need not compute the information of some itemsets, such as{3 � a2b2, 7 � a1b1c1, 7 � a1b2c1, 7 � a1b3c2, 7 � a2b2c1,7 � a2b2c2, 7 � a3b1c2, 7 � a3b3c1}.

Fig. 6. Numbers of CARs in the Led7 dataset for various minSup values.

Fig. 5. Numbers of CARs in the lymph dataset for various minSup values.

5. Experimental results

5.1. Characteristics of experimental datasets

The algorithms used in the experiments were coded on a per-sonal computer with C#2008, Windows 7, Centrino 2 � 2.53 GHz,and 4 MBs RAM. The experimental results were tested in the data-sets obtained from the UCI Machine Learning Repository (http://mlearn.ics.uci.edu). Table 4 shows the characteristics of the exper-imental datasets.

The experimental datasets had different features. The Breast,German and Vehicle datasets had many attributes and distinctive(values) but had very few numbers of objects (or records). TheLed7 dataset had only a few attributes, distinctive values and num-ber of objects.

5.2. Numbers of rules of the experimental datasets

Figs. 3–7 show the numbers of rules of the datasets in Table 4for different minimum support thresholds. We used a min-Conf = 50% for all experiments.

The results from Figs. 3–7 show that some datasets had a lot ofrules. For example, the Lymph dataset had 4,039,186 rules with a

Fig. 7. Numbers of CARs in the vehicle dataset for various minSup values.

Table 4The characteristics of the experimental datasets.

Dataset #attrs #classes #distinct values #Objs

Breast 12 2 737 699German 21 2 1077 1000Lymph 18 4 63 148Led7 8 10 24 3200Vehicle 19 4 1434 846

Fig. 12. The execution time for CAR-Miner and ECR-CARM in the vehicle dataset.

Fig. 11. The execution time for CAR-Miner and ECR-CARM in the Led7 dataset.

Fig. 8. The execution time for CAR-Miner and ECR-CARM in the breast dataset.

Fig. 9. The execution time for CAR-Miner and ECR-CARM in the German dataset.

Fig. 10. The execution time for CAR-Miner and ECR-CARM in the lymph dataset.

2310 L.T.T. Nguyen et al. / Expert Systems with Applications 40 (2013) 2305–2311

minSup = 1%. The German dataset had 752,643 rules with a min-Sup = 1%, etc.

5.3. Execution time

Experiments were then made to compare the execution timebetween CAR-Miner and ECR-CARM (Vo & Le, 2008). The resultsare shown in Figs. 8–12.

Results from Figs. 8–12 show CAR-Miner to be more efficientthan ECR-CARM in all of the experiments. For example: Consider

the Breast dataset with a minSup = 0.1%. The mining time for theCAR-Miner was 1.517 s, while that for the ECR-CARM was17.136 s. The ratio was 1:517

17:136� 100% ¼ 8:85%.

6. Conclusions and future work

This paper proposed a new algorithm for mining CARs using atree structure. Each node in the tree contained some informationfor fast computation of the support of the candidate rule. In addi-tion, using Obidset, we were able to compute the support of item-sets quickly. Some theorems were also developed. Based on thesetheorems, we did not need to compute the information for a lotof nodes in the tree. With these improvements, the proposed algo-rithm had better performance relative to the previous algorithm inregard to all results.

Mining itemsets from incremental databases has been devel-oped in recent years (Gharib, Nassar, Taha, & Abraham, 2010; Hong& Wang, 2010; Hong, Lin, & Wu, 2009; Hong, Wang, & Tseng, 2011;Lin, Hong, & Lu, 2009). It can be seen that it saves a lot of time andmemory when compared with mining from integration databases.Therefore, in the future, we will study how to use this approach formining CARs.

Acknowledgements

This work was supported by Vietnam’s National Foundation forScience and Technology Development (NAFOSTED) under GrantNo. 102.01-2012.47.

This paper has been completed while the second author is vis-iting Vietnam Institute for Advanced Study in Mathematics(VIASM), Ha Noi, Viet Nam.

References

Chien, Y. W. C., & Chen, Y. L. (2010). Mining associative classification rules withstock trading data – A GA-based method. Knowledge-Based Systems, 23(6),605–614.

Coenen, F., Leng, P., & Zhang, L. (2007). The effect of threshold values on associationrule based classification accuracy. Data and Knowledge Engineering, 60(2),345–360.

Gharib, T. F., Nassar, H., Taha, M., & Abraham, A. (2010). An efficient algorithm forincremental mining of temporal association rules. Data and KnowledgeEngineering, 69(8), 800–815.

Giuffrida, G., Chu, W. W., & Hanssens, D. M. (2000). Mining classification rules fromdatasets with large number of many-valued attributes. In 7th Internationalconference on extending database technology: advances in database technology(EDBT’00) (pp. 335–349). Munich, Germany.

Hong, T. P., & Wang, C. J. (2010). An efficient and effective association-rulemaintenance algorithm for record modification. Expert Systems withApplications, 37(1), 618–626.

Hong, T. P., Lin, C. W., & Wu, Y. L. (2009). Maintenance of fast updated frequentpattern trees for record deletion. Computational Statistics and Data Analysis,53(7), 2485–2499.

Hong, T. P., Wang, C. Y., & Tseng, S. S. (2011). An incremental mining algorithm formaintaining sequential patterns using pre-large sequences. Expert Systems withApplications, 38(6), 7051–7058.

L.T.T. Nguyen et al. / Expert Systems with Applications 40 (2013) 2305–2311 2311

Kaya, M. (2010). Autonomous classifiers with understandable rule using multi-objective genetic algorithms. Expert Systems with Applications, 37(4),3489–3494.

Li, W., Han, J., & Pei, J. (2001). CMAR: Accurate and efficient classification based onmultiple class-association rules. In 1st IEEE international conference on datamining (pp. 369–376). San Jose, CA, USA.

Lim, A. H. L., & Lee, C. S. (2010). Processing online analytics with classification andassociation rule mining. Knowledge-Based Systems, 23(3), 248–255.

Lin, C. W., Hong, T. P., & Lu, W. H. (2009). The pre-FUFP algorithm for incrementalmining. Expert Systems with Applications, 36(5), 9498–9505.

Liu, B., Hsu, W., & Ma, Y. (1998). Integrating classification and association rulemining. In 4th International conference on knowledge discovery and data mining(pp. 80–86). New York, USA.

Liu, B., Ma, Y., & Wong, C. K. (2000). Improving an association rule based classifier. In4th European conference on principles of data mining and knowledge discovery (pp.80–86). Lyon, France.

Liu, Y. Z., Jiang, Y. C., Liu, X., & Yang, S. L. (2008). CSMC: A combination strategy formulticlass classification based on multiple association rules. Knowledge-BasedSystems, 21(8), 786–793.

Nguyen, T. T. L., Vo, B., Hong, T. P., & Thanh, H. C. (2012). Classification based onassociation rules: A lattice-based approach. Expert Systems with Applications,39(13), 11357–11366.

Priss, U. (2002). A classification of associative and formal concepts. In The Chicagolinguistic society’s 38th annual meeting (pp. 273–284). Chicago, USA.

Qodmanan, H. R., Nasiri, M., & Minaei-Bidgoli, B. (2011). Multi objective associationrule mining with genetic algorithm without specifying minimum support andminimum confidence. Expert Systems with Applications, 38(1), 288–298.

Quinlan, J. R. (1992). C4.5: program for machine learning. Morgan Kaufmann.Sun, Y., Wang, Y., & Wong, A. K. C. (2006). Boosting an associative classifier. IEEE

Transactions on Knowledge and Data Engineering, 18(7), 988–992.Thabtah, F., Cowling, P., & Hammoud, S. (2006). Improving rule sorting, predictive

accuracy and training time in associative classification. Expert Systems withApplications, 31(2), 414–426.

Thabtah, F., Cowling, P., & Peng, Y. (2004). MMAC: A new multi-class, multi-labelassociative classification approach. In 4th IEEE international conference on datamining (pp. 217–224). Brighton, UK.

Thabtah, F., Cowling, P., & Peng, Y. (2005). MCAR: Multi-class classification based onassociation rule. In 3rd ACS/IEEE international conference on computer systemsand applications (pp. 33–39). Tunis, Tunisia.

Thonangi, R., & Pudi, V. (2005). ACME: An associative classifier based on maximumentropy principle. In 16th International conference algorithmic learning theory(pp. 122–134). LNAI 3734, Singapore.

Tolun, M. R., & Abu-Soud, S. M. (1998). ILA: An inductive learning algorithm forproduction rule discovery. Expert Systems with Applications, 14(3), 361–370.

Tolun, M. R., Sever, H., Uludag, M., & Abu-Soud, S. M. (1999). ILA-2: An inductivelearning algorithm for knowledge discovery. Cybernetics and Systems, 30(7),609–628.

Veloso, A., Meira, Jr., W., & Zaki, M. J. (2006). Lazy associative classification. In 2006IEEE international conference on data mining (ICDM’06) (pp. 645–654). HongKong, China.

Veloso, A., Meira, W., Jr., Goncalves, M., & Zaki, M. J. (2007). Multi-label lazyassociative classification. In 11th European conference on principles of datamining and knowledge discovery (pp. 605–612). Warsaw, Poland.

Veloso, A., Meira, W., Jr., Goncalves, M., Almeida, H. M., & Zaki, M. J. (2011).Calibrated lazy associative classification. Information Sciences, 181(13),2656–2670.

Vo, B., & Le, B. (2008). A novel classification algorithm based on association rulemining. In The 2008 Pacific rim knowledge acquisition workshop (held withPRICAI’08) (pp. 61–75). LNAI 5465, Ha Noi, Viet Nam.

Yang, G., Mabu, S., Shimada, K., & Hirasawa, K. (2011). An evolutionary approach torank class association rules with feedback mechanism. Expert Systems withApplications, 38(12), 15040–15048.

Yin, X., & Han, J. (2003). CPAR: Classification based on predictive association rules.In SIAM international conference on data mining (SDM’03) (pp. 331–335). SanFrancisco, CA, USA.

Zhang, X., Chen, G., & Wei, Q. (2011). Building a highly-compact and accurateassociative classifier. Applied Intelligence, 34(1), 74–86.

Zhao, S., Tsang, E. C. C., Chen, D., & Wang, X. Z. (2010). Building a rule-basedclassifier – A fuzzy-rough set approach. IEEE Transactions on Knowledge and DataEngineering, 22(5), 624–638.