[IEEE 2009 International Conference on Artificial Intelligence and Computational Intelligence - Shanghai, China (2009.11.7-2009.11.8)] 2009 International Conference on Artificial Intelligence and Computational Intelligence - Study on Comparison of Discretization Methods

Download [IEEE 2009 International Conference on Artificial Intelligence and Computational Intelligence - Shanghai, China (2009.11.7-2009.11.8)] 2009 International Conference on Artificial Intelligence and Computational Intelligence - Study on Comparison of Discretization Methods

Post on 27-Jan-2017




5 download

Embed Size (px)


<ul><li><p>Study on Comparison of Discretization Methods </p><p>Liu Peng, Wang Qing, Gu Yujia School of Information Management and Engineering </p><p>Shanghai University of Finance and Economics Shanghai, China </p><p>liupeng@mail.shufe.edu.cn, wq_811@yahoo.com.cn, shufegyj@hotmail.com </p><p>AbstractDiscrete features play an important role in data mining. How to best discretize continuous features has always been a NP-hard problem. This paper introduces diverse taxonomies in the existing literature to classify discretization methods, as well as idea and drawbacks of some typical methods. Furthermore, a comparison of these methods is studied. It's essential to select proper methods depending on learning environment. At last, the thought of choosing the best discretization methods in association analysis is proposed as future research. </p><p>Keywords-continuous features; discrete features; discretization </p><p>I. INTRODUCTION In terms of values, features of data sets can be classified </p><p>as continuous and discrete. Continuous features are also called quantitative features, e.g. people's height, age and so on. Discrete features, also often referred to as qualitative features, including sex and degree of education, can only be limited among a few values [1, 2]. Continuous features can be ranked in order and admit to meaningful arithmetic operations. However, discrete features sometimes can be arrayed in a meaningful order. But no arithmetic operations can be applied to them [3]. </p><p>In the field of Machine Learning and Data Mining, there exist many learning algorithms that are primarily oriented to handling discrete features. Even for algorithms that can directly deal with continuous features, learning is often less efficient and effective. However, data in real world are often continuous. Hence discretization has been an active topic in data mining for a long time. Discretization is a data-preprocessing procedure that transforms continuous features into discrete features, helping improve the performance of learning and understanding of the results [3, 4]. </p><p>There are many advantages of using discrete values over continuous ones: (1) Discretization will reduce the number of continuous features' values, which brings smaller demands on system's storage. (2)Discrete features are closer to a knowledge-level representation than continuous ones. (3)Data can also be reduced and simplified through discretization. For both users and experts, discrete features are easier to understand, use, and explain. (4)Discretization makes learning more accurate and faster [5]. (5)In addition to the many advantages of having discrete data over continuous one, a suite of classification learning algorithms can only </p><p>deal with discrete data. Successful discretization can significantly extend the application range of many learning algorithms [4, 6 and 7]. But optimal discretization has been proved to be a NP-hard problem. </p><p>There are a lot of discrete methods. The following part of this paper is about diverse taxonomies in the existing literature to classify discretization methods. The remainder of the paper is organized as follows. In Section 3, idea and drawbacks of some typical methods are introduced. Section 4 provides a comparison among various discretization methods. Section 5 shows that it's essential to select proper methods considering learning environment. And taking discretization in association rules for example, the problems that should be noticed are proposed. The paper concludes in Section 6 with discretization in association rules as further work. </p><p>II. TAXONOMY Discretization methods have been developed along </p><p>different lines due to different needs. There has been much different taxonomy of discretization methods at present [3]. Main classification is as follows: supervised vs. unsupervised, dynamic vs. static, global vs. local, splitting (top-down) vs. merging (bottom-up), and direct vs. incremental. Some will be simply introduced below. </p><p>Discretization methods can be supervised or unsupervised depending on whether it uses class information of data sets. Supervised discretization can be further characterized as error-based, entropy-based or statistics-based [3, 5, and 8]. Unsupervised discretization is seen in earlier methods like equal-width and equal-frequency. Continuous ranges are divided into sub ranges by the user specified width or frequency. It is vulnerable to outliers as they affect the ranges significantly [9].To overcome this shortcoming, supervised discretization methods using class information were introduced. </p><p>Discretization methods can be also viewed from dynamic or static. A dynamic method would discretize continuous values when a classifier is being built, such as in C4.5 [10] while in the static approach discretization is done prior to the classification task. There is a detailed comparison between dynamic and static methods in Dougherty [5]. </p><p>Liu Huan and Hussian proposed a hierarchical framework of discretization methods in Fig. 1. This framework is mainly based on the existence of class information. More details see Liu and Hussian (2002) [6].</p><p>2009 International Conference on Artificial Intelligence and Computational Intelligence</p><p>978-0-7695-3816-7/09 $26.00 2009 IEEEDOI 10.1109/AICI.2009.385</p><p>380</p></li><li><p>III. ADVANTAGES AND DRAWBACKS OF TYPICAL DISCRETIZATION METHODS </p><p>Discretization is mainly used to transform a continuous feature into a categorical feature in classification and association analysis. Usually, the result of discretization depends on the algorithms we adopt and other features needed to be considered. Most of present discretization algorithms are based on single feature when discretizing [13]. </p><p>Main idea and their advantages as well as drawbacks of typical discretization methods will be introduced in terms of unsupervised and supervised methods in the following part. </p><p>A. Unsupervised Methods 1) Binning: To Unsupervised discretization is seen in </p><p>earlier methods like equal-width and equal-frequency. They belong to the binning methods [7].It is the simplest method to discretize a continuous-valued attribute by creating a specified number of bins. </p><p>The two methods are very simple and easy to implement but are sensitive for a given k. Besides, the equal width approach can be badly affected by outliers [13] and may not give good results in cases where the distribution of the continuous values is not uniform which will definitely damage the feature's power to construct good structure of decision. Although the equal frequency method overcomes the above problem, it tends to put the same feature with the same class tag in different bins in order to satisfy the equal frequency condition [8]. </p><p>For equal-width, one solution to the problem of outliers that take extreme values can be to remove the outliers using a threshold. One improvement of the traditional equal frequency approach can be after signed into bins; boundaries of every pair of neighboring bins are adjusted so that duplicate values can belong to one bin only [6]. </p><p>Besides, unsupervised methods also include discretization according intuition, which is more applicable for simple and practically meaningful real data sets, and less vulnerable to outliers. </p><p>B. Supervised Methods 1) 1R: 1R [14] is a supervised discretization method </p><p>using binning. It is also very easy to operation. The number of bins k doesnt need to be specified beforehand any more. Besides, 1R overcomes the unsupervised binning approaches' problem of ignoring the class information, which contributes to the solution of putting two instances with the same class label in two different intervals. </p><p>2) Entropy measure: The method of using entropy to split continuous features when constructing decision trees works well in practice, so it's possible to extend it to more common discretization. Discrete methods based on entropy measure employ the class information to compute and decide the splitting point, which is definitely supervised and splitting from top to down methods. </p><p>ID3 [16] and C4.5 [10] are two popular algorithms for decision tree induction that use entropy measure. One of the problems of ID3 method is that the cut-points obtained are usually more applicable for classification algorithm. However, its the base for its many successors such as D2 </p><p>Figure1. A hierarchical framework of discretization methods [6]</p><p>381</p></li><li><p>[9], MDLP (minimum description length principle) [17] and so on. </p><p>Unlike ID3 which binarizes a range of values while building a decision tree, D2 is a static method that discretizes the whole instance space. Instead of finding only one cut-point, it recursively binarizes ranges or sub-ranges until a stopping criterion is met. A stopping criterion is essential to avoid over-splitting. MDLP uses the minimum description length principle to determine when to stop discretization. It also suggests that potential cut-points are those that separate different class values [6]. </p><p>3) 2 measure: Methods mentioned above are all top-down spitting discretization methods, while 2 measure is a bottom-up merging approach [7]. </p><p>Most usually used 2-based discretization is ChiMerge [18]. It is a kind of automatic discretizing algorithm [1]. When applying the statistical measure 2 to test the dependence of two intervals, its necessary for the user to predefine the parameter of significance level and then calculate threshold values with the knowledge of statistics to compare with the computed values. A higher value of significance level for 2 test causes over discretization while a lower value causes under discretization. So Liu and Setiono [19] made an improvement on ChiMerge in 1997 and formed a new algorithm which is called Chi2. In Chi2, the statistical significance level keeps declining step by step to merge more and more adjacent intervals as long as the inconsistency criterion is satisfied. The algorithm includes two phase in discretization, and takes the inconsistency measure as the stopping criterion. </p><p>The Chi2 algorithm using the inconsistency measure to check the results of discretization is much better than ChiMerge with fixed significance level. However, Chi2 has drawbacks as well. Lower limit of the inconsistency is needed which leads to numerous experimentation, tremendous work amount and the selection of the best limit. </p><p>Besides the discretization methods mentioned above, there are also some other methods such as algorithms based on the idea of distance or other measure and approaches uniting many single discretization in a comprehensive method. Due to the space limit of the paper, this passage doesnt introduce every kind of discretization in detail. </p><p>IV. COMPARISON OF DISCRETIZATION METHODS There have been a lot of discretization methods. However </p><p>the same algorithms dont exist. Every algorithm has its own characteristic and doing well in different situations. Ying Yang and Xindong Wu [19] conducted a research of the difference between 15 discretization methods from the point of taxonomy of the algorithms in 2007; Liu and Hussian evaluated the results of many discretization algorithms from the aspect of accuracy, time for discretization, time for learning and understandability through experiments [6]. </p><p>This paper aims to supply a study of explicit comparison of ten discretization methods according to the following aspects: </p><p> Merging or splitting methods. Merging methods start with the complete list of all the continuous values of the feature as cut-points and remove some of them by 'merging' intervals as the discretization progresses. Splitting methods start with an empty list of cut-points (or split-points) and keep on adding new ones to the list by splitting intervals as the discretization progresses. </p><p> Discretization with class information or not. Stopping criterion. Different discretization methods </p><p>have diverse stopping criterion. Sensitivity to outlier. Simplicity of operation. Some of the methods are </p><p>quite easy. Uniformity of the results among intervals. For </p><p>example, equal width methods will probably lead to the consequences that some bins concludes no instances at all while others have too many. </p><p> Consideration of values of the feature or not. The majority of methods discretize according to the values of features, however other methods ignore the values and discretize only with class labels. </p><p> Grouping instances with the same value and the same class label in the same interval or not. </p><p>Every discretization methods have its-own strengths and weaknesses and can be applied to different situation. </p><p>A comparative analysis of common discretization methods mentioned above is displayed in Table 1. In the table, Y denotes yes while N denotes no. And its with no doubt possible that further studies of theses methods from other aspects can be conducted. </p><p>V. DISCRETIZATION AND LEARNING CONTEXT Although various discretization methods are available, no </p><p>one discretization method can ensure a negative difference for all data sets and all algorithms. Discretization is a NP-hard problem in nature. They are tuned to difference types of learning context. As a result, we should choose a appropriate discretization method based on the characteristics of data sets, the learning context as well as users preferences and understanding of the problem. For example, decision tree learners can suffer from the fragmentation problem, and hence they may benefit more than other learners from discretization that results in few intervals; Decision rule learners require pure intervals (containing instances dominated by a single class); Association rule learners value the relations between attributes, and thus they desire multivariate discretization that can capture the inter-dependencies among attributes [3]. </p><p>Take the analysis of association rules for example. The users should focus on the relations between features when discretization in association rules analysis instead of discretizing every single feature. Isolated discretization of single feature of data sets tends to induce imperfect association rules which affect the correctness and effectiveness of the results. So association rules analysis prefers multivariate discretization. </p><p>382</p></li><li><p>TABLE I. COMPARATIVE ANALYSIS OF DISCRETIZATION METHODS </p><p>Evaluating Standards </p><p> Algorithms </p><p>Merge or Split </p><p>Use class information </p><p>Stopping criterion </p><p>Sensitive to outlier </p><p>Simplicity of </p><p>operation </p><p>Uniformity of the results </p><p>Use of values </p><p>Same values into different </p><p>intervals </p><p>Equal-width Split N Fixed bin number Y N Y Y N </p><p>Equal-frequency Split N Fixed bin number N Y N Y Y </p><p>Direct-viewing Split N 3-4-5 criterion N Y Y Y N </p><p>Cluster Both N Fixed cluster number Y Y Y Y N </p><p>IR Split Y Minimum instances in each interval </p><p>N Y N Y Y </p><p>ID3 Split Y One class label in one interval N Y Y Y N </p><p>D2 Split Y Special criterion[6] N Y Y Y N </p><p>MDLP Split Y MDLP N Y Y Y N </p><p>ChiMege Merge Y Threshold N Y Y Y N </p><p>Chi2 Merge Y Inconsistency N Y Y Y N </p><p> Furthermore, data sets used in association analy...</p></li></ul>