[lecture notes in computer science] machine learning and data mining in pattern recognition volume...

P. Perner (Ed.): MLDM 2014, LNAI 8556, pp. 31–42, 2014. © Springer International Publishing Switzerland 2014

A Cost-Sensitive Based Approach for Improving Associative Classification on Imbalanced Datasets

Kitsana Waiyamai and Phoonperm Suwannarattaphoom

Data Analysis and Knowledge Discovery Laboratory (DAKDL), Computer Engineering Department, Engineering Faculty, Kasetsart University, Bangkok, Thailand

{g521455020,kitsana.w}@ku.ac.th

Abstract. Associative classification is one of rule-based classifiers that has been applied in many real-world applications. Associative classifier is easily interpretable in terms of classification rules. However, there is room for improvement when associative classification applied for imbalanced classification task. Existing associative classification algorithms can be limited in their performance on highly imbalanced datasets in which the class of interest is the minority class. Our objective is to improve the accuracy of the associative classifier on highly imbalanced datasets. In this paper, an effective cost-sensitive rule ranking method, named (SSCR Statistically Significant Cost-sensitive Rules), is proposed to estimate risks of a rule in classifying unseen data. Risk of a statistically significant association rule is estimated based on its classification cost induced from the training data. SSCR combines statistically significant association rules with cost-sensitive learning to build an associative classifier. Experimental results show that SSCR achieves best performance in terms of true positive rate and recall on real-world imbalanced datasets, compared with CBA and C4.5.

Keywords: Associative classification, Imbalanced class datasets, Cost-sensitive learning.

1 Introduction

In general, the problem of class imbalance happens when there is a difference in class distribution between minority and majority classes. The minority class is highly skewed with very small proportion compared to the majority class. Absolute rarity and relative rarity are among the active class imbalance problems [1]. Absolute rarity happens when the number of examples associated with the minority class is small in an absolute sense, where relative rarity happens when objects are not rare in an absolute sense, but rare with respect to their relation with the other objects. Large number of solutions to these problems has been previously proposed. For absolute rarity, over-sampling can be used to address this problem; however this technique leads to the over-fitting situation [15,16]. For relative rarity, appropriate evaluation metrics, cost-sensitive learning, boosting and sampling are among the solutions that have been investigated [12,13,14]. With inappropriate rule evaluation metrics, many potentially interesting rules of the minority class are eliminated and the classifier

32 K. Waiyamai and P. Suwannarattaphoom

performance will decrease. Effective rule evaluation metrics to estimate risk of classification rules in classifying unseen data on imbalanced datasets are needed.

Many efficient associative classification algorithms have been proposed in the literature [2-4]. Based-on the support-confidence paradigm, these algorithms are biased toward the majority class on imbalanced datasets. They yield good accuracy for the majority class, however do not perform well for the minority class. Indeed, large number of potentially interesting rules of the minority class is eliminated by the minimum support threshold. In [5], the authors introduced a Class Correlation Ratio (CCR) measurement to evaluate how a rule is correlated with the class it predicts.

They combine statistically significant association rules with CCR to build an associative classifier. Although rule’s CCR provides correlation with the class it predicts, higher CCR means the rule is positively correlated with the class it predicts than the other classes. However, performance in minority class is not guaranteed since false negative error is neglected. Rules with high CCR may contain more false negative errors than rules with lower CCR.

In this paper, we propose a cost-sensitive based approach for improving associative classification on imbalanced datasets. Statistically Significant Cost-sensitive Rules (SSCR) are generated to build an associative classifier. An effective cost-sensitive rule ranking method, named SSCR, is proposed to estimate risk of a rule in classifying unseen data. Risk of a statistically significant association rule is estimated based on its classification cost induced from the training data. Rules with lowest risk are preferred to rules with high risk because these rules will have more chances to predict the true classes. SSCR combines statistically significant association rules with cost-sensitive learning to build an associative classifier. Experimental results show that SSCR achieves best performance in terms of true positive rate and recall on real-world imbalanced dataset where minority class is highly skewed, compared with CBA [2] and C4.5 [9].

The rest of this paper is organized as follows. Section 2 describes the background knowledge related to associative classification, statistically significant association rules and cost-sensitive learning. Section 3 describes our SSCR technique. Section 4 compares the experimental results with related methods. Conclusion is given in Section 5.

2 Background Knowledge

This section gives background knowledge about associative classification, statistically significant rules selection and cost-sensitive learning.

2.1 Associative Classification

Associative classification is a data mining technique that combines the association rule mining (ARM) and classification together by using association rules in the form of class association rules (CAR) to classify new instances.

Association rule mining is to discover rules from a training dataset of transactions , , … , where is the number of transactions. Let , , … , be the set of all items where is the number of items. Each transaction represents an itemset.

A Cost-Sensitive Based Approach for Improving Associative Classification 33

An association rule is an implication expression of the form , where and are disjointed itemsets. In general, the strength of an association rule can be

measured in terms of support and confidence of rules. The support of rule is the percentage of the instances in transactions satisfying the Equation 1. The confidence of rule is the percentage of transactions where itemset appears together with itemset in all the transactions as shown in Equation 2.

(1)

(2)

Class association rule is an association rule and an implication of the form of which contains with itemsets of classes label only. Let , , … ,

and be the class and is the number of all classes. Associative classifier (AC) consists of three main processes which are rule

generator, rule selection, and classification. In rule generator process, association rules in the form of CARs are discovered by using association rule generation algorithms such as Apriori [6] or its variation. In rule selection process, pruning and ranking techniques are performed to select small subset of CARs that yields highest accuracy. In the classification process, unseen instances are classified using CARs from the previous step.

2.2 Statistically Significant Rules

Webb [7] introduced the concept of statistically significant rules. In a traditional rule generation process, the output contains both productive rules and unproductive rules. Given an association rule of the form , and if | | \ . Given \ , and are conditionally independent. Therefore, unproductive and redundant rules should be eliminated.

To eliminate unproductive rules, Fisher’s Exact Test (FET) is used to test the null hypothesis of instances with rules and \ that are conditionally independent. This test compares with each of immediate generalization \

, where . The probability ( ) is calculated to test immediate generalization shown in Equation 3. If data is separated into two groups independently, the first group is |1 and second group is \ |1 . Each group is separated into two classes and with null hypothesis At : , that is an opportunity of both groups divided dataset into classes and as the same. For the alternative hypothesis : , that is an opportunity of both groups to divide dataset into classes and differently.

To test the statistical hypothesis, will be rejected if less than the critical value . In general 0.05 is used, if testing , the null hypothesis will be accepted. This indicates that the rule is an unproductive rule. Inversely, if testing the null hypothesis will be rejected. This indicates that the rule

is a productive rule and it cannot be eliminated. Thus, it follows that the statistical hypothesis testing is able to discover all the significant association rules.

The process of mining significant rules is a time-consuming process. In order to reduce the number of rules, we consider only positively correlated rules using the


interest factor measurement [10]. Although, its limitation on imbalanced datasets where the interested class is the majority class. However, the interest factor measure can be used to eliminate negatively correlated rules reduce the number of rules to be tested using the FET.

2.3 Cost-Sensitive Learning

Cost-sensitive classification considers the varying costs of different misclassification types. Cost-sensitive learning may be used to enhance the performance of rare class detection on imbalanced class distribution [10]. It uses a cost matrix to encode the penalty of classifying instances from one class to another class, as shown in Table 1.

Table 1. Cost matrix penalty

Predited Class Class = Class =

Actual Class Class = -1 100 Class = 1 0

Let , denote the cost of prediction an instance from class to class . When

this notation , is the cost of misclassifying a positive (rare class) instances as the negative (prevalent class) instance and , is the cost of the contrary case. Cost-sensitive classifier uses the cost matrix to evaluate the model for lowest total cost. In highly imbalanced datasets, cost of positive class is much more than the one of negative class. To cover instances of positive class, cost-sensitive learning reduces number of false negative errors while increasing number of false positive errors.

The total cost of a model can be evaluated using Equation 4.

, , , , (4)

Where , , and correspond to true positive, false positive, false negative and true negative, respectively.

In this paper, cost-sensitive learning is used to estimate risk of a significant association rule to classify unseen data. Rules with lowest risk should be preferred to rules with high risk because these rules will have more chances to predict true actual classes. Risk of a significant association rule is measured based on its classification cost induced from the training data.

3 SSCR: Statistically Significant Cost-Sensitive Rules for Associative Classification

SSCR is composed of four components:

• Rule generation step generates all the association rules from the training data. Explanation of this step is given in Section 3.1.

• Rule pruning step determines all the potentially interesting rules with statistically significant test. Explanation of this step is given in Section 3.2.


• Rule ranking step determines risk of rules and assigns each rule a cost to be used for classification. Explanation of this step is given in Section 3.3.

• Classification step classifies an unseen instance by using the selected cost-sensitive association rules. Explanation of this step is given in Section 3.4.

3.1 Rules Generation Step

In this step, class association rules (CARs) are generated based-on Weka’s Apriori implementation [8]. Minimum support and minimum confidence thresholds are set to zero in order to obtain the complete set of class association rules.

3.2 Rules Pruning Step

In this step, two pruning strategies are performed. The first pruning strategy is to keep only positive correlated rules with 1 and prune negative correlated rules. The correlation is calculated by using Equation 5. The idea behind is to reduce number of FET statistical tests in the next step.

(5)

The second pruning strategy is to keep only statistically significant rules with statistical hypothesis testing using FET. Table 2, is a contingency table for Fisher Exact Test (FET). For testing, we compare against each of immediate generalization \ where . We calculate the probability of every member of the itemset through Equation 6. Let , \

, and \ .

min , ; , (6)

Table 2. A 2-way contingency table for rule . Represent with the notation , ; ,

From Equation 6, we represent rule in the form of contingency table that is

shown in Table 3 and 4. For ; | | 1 , the is calculated using Table 3. For ; | | 1 , the is calculated using Table 4. For rule pruning, we apply general critical value of 0.05 and consider only rules that are statistically significant when . Then, these rules will be selected. Otherwise,

, these rules will be pruned.

Table 3. The contingency table , ; , for significance test of rule when | | 1 : : \ : \ : \ \ : \ \

\


Table 4. The contingency table , ; , for significance test of rule when | | 1

: : \ : \ : :

In the experiments, the two pruning strategies are considered. The first strategy

considers pruning of rules ; | | 1 , and the second strategy without pruning rules ; | | 1 . Result of this step is the set of productive interesting rules that are positively correlated and statistically significant. Details of the pruning algorithm are given in Figure 1.

1. Let T denote the set of training instances

2. Let R denote the set of rules found

3. Let denote the significance level

4. Let m denote the pruning method

5. = pruneRules(T, R, , m)

6.

7. for each do

8. .

9. if ( . 1) then

10. if (| . | 1) then

11.

. \ . min .. . ,. \ . . . ;. . ,. \ . . .

12. if ( ) then

13.

14. end if

15. else if (m is prune | . | 1) then

16.

. \ . min .. . ,. . . ;. . ,. . .

17. if ( ) then

18.

19. end if

20. else

21.

22. end if

23. end if

24. end if

25. end for

26. return

Fig. 1. Pruning algorithm


3.3 Rule Ranking

To reduce the chance of misclassification, rules generated from the previous step can be ranked according to their degree of interestingness. We propose two ranking methods which are cost-based ranking and combination of rule size and cost-based ranking. The first ranking method is based on the concept of rule risk. Rules with lowest risk should be preferred to rules with high risk because these rules will have more chances to predict true actual classes. Risk of a significant association rule is measured based on its classification cost induced from the training data. Cost of a rule can be obtained by the following Equation 7.

, ,, , (7)

Figure 2 shows the algorithm that determines cost of each rule, and assigns its cost. Notice that, we use the cost matrix in Table 1 in our experiments.

The second ranking method is based on the combination of rule size and cost-based ranking. Rules are ordered in descending order of rule’ size | |. If they have same size, then rules will be ordered in ascending order of their cost .

1. Let R denote the set of rules found 2. Let T denote the set of training instances 3. Let cost denote the cost matrix encodes 4. Let pc denote the positive class 5. riskEstimate(T, R, cost, pc) 6. for each do7. , , , 08. for each do9. | | . . . . . |

10. | | . . . . . |11. | | . . . . . |12. | | . . . . . | 13. end for 14. . . , . ,. , . ,

15. end for

Fig. 2. Risk estimate algorithm

3.4 Classification

Once all the class association rules have been ranked, they together constitute a classifier which can be used to predict unseen instance. Figure 3 shows how SSCR algorithm performs rule selection to predict unseen instance. SSCR makes a decision by selecting rules with lowest cost of misclassification among all the matching rules.

If all the matching rules have equal cost and predict the same class, then SSCR will assign that class to the unseen instance.


If all the matching rules have equal cost but they predict different classes, then SSCR assign the class with majority vote to the unseen instance. In case of non majority class, SSCR will proceed to an alternative rule matching with higher cost to classify unseen instance according to the method described above.

If SSCR cannot choose any classes from the previous steps to predict unseen instance, SSCR will assign the class with majority vote from all the matching rules. In case of non majority class, SSCR will assign the default class (positive class) to the unseen instance. In the case of no matching rules with the unseen instance, SSCR will assign the default class to the unseen instance.

1. Let R denote the set of rules found 2. Let t denote an instance to classify 3. Let pc denote the positive class 4. Let c denote the predicted class 5. c=classifyInstance(R, t, pc) 6. //default positive class 7. | .8. . |9. for each do

10. . | | . | 11. end for 12. | max . } 13. if (| | 1) then 14. 15. end if 16. 17. repeat 18. | min . 19. if (| | 1) then 20. | . 21. else 22. . | 23. for each do 24. . | | . |25. end for 26. | max . } 27. if (| | 1) then 28. 29. else 30. 31. end if 32. end if 33. until 34. if ( ) then 35. 36. end if 37. if ( ) then 38. 39. end if 40. return

Fig. 3. SSCR classifier algorithm


4 Experimental Results

In this section, we present experimental results of SSCR method. We select four datasets from the UCI machine learning repository [11] which are breast-cancer, yeast3, yeast6, and abalone19. All the four datasets have the imbalanced ratio (IR) varied from low to high. Table 5 summaries the properties of the selected datasets. For each dataset, we give number of instances (#Ins), number of attributes (#Atts), number of instanced in positive class (#Pcs), number of instanced in negative class (#Ncs) and number of classes (#Cls).

Table 5. Imbalanced datasets information

Datasets #Ins #Atts #Pcs #Ncs #Cls Breast-Cancer 286 10 85 201 2 Yeast3 1484 9 163 1321 2 Yeast6 1484 9 35 1449 2 Abalone19 4174 9 32 4142 2

Algorithm Threshold Measure Ranking Breast Cancer Yeast3 Yeast6 Abalone19

CBA 1% 25%

TP Rate 36.5% 23.3% 0.0% 0.0%

FN Rate 63.5% 76.7% 100.0% 100.0%

C4.5 100% TP Rate 37.6% 69.9% 42.9% 0.0%

FN Rate 62.4% 30.1% 57.1% 100.0%

SSCR

0.1 1, 100; 1,0

Prune R1

TP Rate c 85.9% 90.8% 80.0% 50.0%

rc 72.9% 90.8% 80.0% 56.3%

FN Rate c 14.1% 9.2% 20.0% 50.0%

rc 27.1% 9.2% 20.0% 43.8%

0.1 1, 100; 1,0

TP Rate c 85.9% 90.8% 80.0% 50.0%

rc 72.9% 90.8% 80.0% 56.3%

FN Rate c 14.1% 9.2% 20.0% 50.0%

rc 27.1% 9.2% 20.0% 43.8% 0.05 1, 100; 1,0

Prune R1

TP Rate c 82.4% 91.4% 82.9% 62.5%

rc 71.8% 92.6% 82.9% 65.6%

FN Rate c 17.6% 8.6% 17.1% 37.5%

rc 28.2% 7.4% 17.1% 34.4% 0.05 1, 100; 1,0

TP Rate c 84.7% 91.4% 82.9% 56.3%

rc 71.8% 92.6% 82.9% 65.6%

FN Rate c 15.3% 8.6% 17.1% 43.8%

rc 28.2% 7.4% 17.1% 34.4% 0.01 1, 100; 1,0

Prune R1

TP Rate c 72.9% 92.6% 85.7% 68.8%

rc 61.2% 92.6% 85.7% 75.0%

FN Rate c 27.1% 7.4% 14.3% 31.3%

rc 38.8% 7.4% 14.3% 25.0% 0.01 1, 100; 1,0

TP Rate c 88.2% 92.0% 82.9% 59.4%

rc 61.2% 92.6% 85.7% 68.8%

FN Rate c 11.8% 8.0% 17.1% 40.6%

rc 38.8% 7.4% 14.3% 31.3%

Fig. 4. TPR, FNR of minority class on imbalanced datasets


To test the performance of SSCR classifier on imbalanced datasets, we compare the experimental results with well-known classifiers that are CBA and decision tree classifier C4.5 using WEKA software. CBA requires the minimum support and confidence thresholds. We set 1% and 25% for experiments to obtain sufficient number of association rules. C4.5 requires the confidence factor for leaf node pruning, so we set 100% to have all leaf nodes pure. SSCR requires significant level for statistical testing and cost matrix penalty as parameters. We set three significant levels 0.1, 0.05, 0.01 , cost matrix 1, 100; 1,0 , similar to Table 1. We use 10-fold cross validation to measure the performance of all the classifiers. Figure 4 shows the true positive rate (TPR) and the false negative rate (FNR) of the minority class.

Compared with CBA and C4.5, empirical analysis from the experimental results demonstrates that SSCR provides higher true positive rate (TPR) and lower false negative rate (FNR). For highly imbalanced dataset such as abalone19, SSCR also yields better result compared with CBA and C4.5 which predict toward the majority class with false negative rate (FNR) equal to 100%. Further improvement of the results can be obtained by using high quality of the statistically significant association rules. We obtained higher TPR when the significant level is set to 0.01.

We can evaluate SSCR based on its ranking methods, and | . |, . Experimental results show that gives the better results with pruning or without pruning of rule size one. The best result can be obtained when rule size one

Algorithm Threshold Measure Ranking Breast Cancer Yeast3 Yeast6 Abalone19

CBA 1% 25%

precision 47.7% 92.7% 0.0% 0.0%

recall 36.5% 23.3% 0.0% 0.0%

f-measure 41.3% 37.3% 0.0% 0.0%

ROC 59.8% 61.5% 50.0% 50.0%

C4.5 100%

precision 44.4% 67.5% 55.6% 0.0%

recall 37.6% 69.9% 42.9% 0.0%

f-measure 40.8% 68.7% 48.4% 0.0%

ROC 58.1% 88.9% 72.8% 69.0%

SSCR

0.1 1, 100; 1,0

Prune R1

precision rc 35.8% 36.7% 13.1% 2.3%

recall rc 72.9% 90.8% 80.0% 56.3%

f-measure rc 48.1% 52.3% 22.5% 4.4%

ROC rc 58.9% 85.7% 83.6% 68.9% 0.05 1, 100; 1,0

Prune R1

precision rc 37.7% 36.7% 12.6% 2.4%

recall rc 71.8% 92.6% 82.9% 65.6%

f-measure rc 49.4% 52.5% 21.8% 4.6%

ROC rc 60.8% 86.4% 84.5% 72.4% 0.01 1, 100; 1,0

Prune R1

precision rc 40.9% 38.7% 9.5% 2.2%

recall rc 61.2% 92.6% 85.7% 75.0%

f-measure rc 49.1% 54.6% 17.1% 4.2%

ROC rc 61.9% 87.3% 83.0% 74.5%

Fig. 5. Precision, recall, f-measure , ROC of minority class on imbalanced datasets


pruning is performed. This can be explained from the observation that positive class occurs often rarely with minority class, thus rules size one have a strong relationship with majority classes.

Figure 5 shows experimental results of CBA, C4.5 and SSCR in terms of precision, recall, f-measure and ROC. For SSCR, we choose the rule ranking method only to compare that gives the best average results. CBA and C4.5 give better results in terms of precision. We can explain that both classifiers can handle the majority class better than the minority class with higher TNR and FNR. However, SSCR is able to handle the minority class better with higher FPR compared to CBA and C4.5. Notice that error in predicting actual positive class as a negative class is of higher cost.

C4.5 gives the best f-measure which is a harmonic mean of both precision and recall. However, on highly imbalanced dataset such Abalone19, SSCR yields higher accuracy compared to CBA and C4.5. Further, SSCR gives a better result in terms of ROC. Figure 6 shows performance of the three algorithms in terms of accuracy. For all the experimental datasets, SSCR obtains the best accuracy.

Algorithm Threshold Measure Ranking Breast

Cancer

Yeast3 Yeast6 Abalone19

CBA 1%, 25%

Accuracy 59.8% 61.6% 50.0% 50.0%

C4.5 100% Accuracy 58.9% 82.9% 71.1% 50.0%

SSCR

0.1, Prune R1 Accuracy rc 58.9% 85.8% 83.6% 68.9% 0.05, Prune R1 Accuracy rc 60.8% 86.4% 84.5% 72.4% 0.01, Prune R1 Accuracy rc 62.0% 87.3% 83.0% 74.5%

Fig. 6. Accuracy of minority class on imbalanced datasets

5 Conclusion

In this paper we present a technique, SSCR for improving associative classification on imbalanced datasets. SSCR combines statistically significant association rules with cost-sensitive learning to build a associative classifier. We show that statistically significant association rules are efficient on highly imbalanced datasets. Cost sensitive learning is used to estimate risk of a significant association rule to classify unseen data. Rules with lowest risk are preferred to rules with high risk because these rules will have more chances to predict true actual classes.

SSCR yields good accuracy, higher true positive rate (TPR), lower false negative rate (FNR) and higher recall to handle imbalanced classes. SSCR has lower precision and f-measure because it loses some association rules of the majority class, thus there is room for improvement. Optimization of rule pruning, rule ranking and cost matrix should be further investigated.


References

1. Weiss, G.M.: Mining with Rarity: A Unifying Framework. Sigkdd Explorations 6(1), 7–19 (2004)

2. Liu, B., Hsu, W., Ma, Y.: Integrating Classification and Association Rule Mining. In: KDD, pp. 80–86 (1998)

3. Li, W., Han, J., Pei, J.: CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules. In: ICDM, pp. 369–376 (2001)

4. Yin, X., Han, J.: CPAR: Classification based on Predictive Association Rules. In: SDM (2003)

5. Verhein, F., Chawla, S.: Using Significant, Positively Associated and Relatively Class Correlated Rules for Associative Classification of Imbalanced Datasets. In: ICDM, pp. 679–684 (2007)

6. Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules in Large Databases. In: VLDB, pp. 487–499 (1994)

7. Webb, G.I.: Discovering significant rules. In: KDD, pp. 434–443 (2006) 8. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA

data mining software: an update. SIGKDD Explorations, 10–18 (2009) 9. Quinlan, J.R.: C4.5: Programs for Machine Learning

10. Tan, P., Steinbach, M., Kumar, V.: Introduction to Data Mining 11. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository (2007) 12. Chai, X., Deng, L., Yang, Q., Ling, C.X.: Test-Cost Sensitive Naïve Bayesian

Classification. In: Proceedings of the Fourth IEEE International Conference on Data Mining. IEEE Computer Society Press, Brighton (2004)

13. Domingos, P.: MetaCost: A general method for making classifiers cost-sensitive. In: Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining, pp. 155–164. ACM Press (1999)

14. Sheng, V.S., Ling, C.X.: Thresholding for Making Classifiers Cost-sensitive. In: Proceedings of the 21st National Conference on Artificial Intelligence, Boston, Massachusetts, July 16-20, pp. 476–481 (2006)

15. Japkowicz, N.: The Class Imbalance Problem: Significance and Strategies. In: Proceedings of the 2000 International Conference on Artificial Intelligence (IC-AI 2000): Special Track on Inductive Learning, Las Vegas, Nevada (2000)

16. Solberg, A., Solberg, R.: A Large-Scale Evaluation of Features for Automatic Detection of Oil Spills in ERS SAR Images. In: International Geoscience and Remote Sensing Symposium, Lincoln, NE, pp. 1484–1486 (1996)

[lecture notes in computer science] machine learning and data mining in pattern recognition volume...

Documents