[ieee 2009 international conference on artificial intelligence and computational intelligence -...
Post on 27-Jan-2017
Embed Size (px)
Study on Comparison of Discretization Methods
Liu Peng, Wang Qing, Gu Yujia School of Information Management and Engineering
Shanghai University of Finance and Economics Shanghai, China
firstname.lastname@example.org, email@example.com, firstname.lastname@example.org
AbstractDiscrete features play an important role in data mining. How to best discretize continuous features has always been a NP-hard problem. This paper introduces diverse taxonomies in the existing literature to classify discretization methods, as well as idea and drawbacks of some typical methods. Furthermore, a comparison of these methods is studied. It's essential to select proper methods depending on learning environment. At last, the thought of choosing the best discretization methods in association analysis is proposed as future research.
Keywords-continuous features; discrete features; discretization
I. INTRODUCTION In terms of values, features of data sets can be classified
as continuous and discrete. Continuous features are also called quantitative features, e.g. people's height, age and so on. Discrete features, also often referred to as qualitative features, including sex and degree of education, can only be limited among a few values [1, 2]. Continuous features can be ranked in order and admit to meaningful arithmetic operations. However, discrete features sometimes can be arrayed in a meaningful order. But no arithmetic operations can be applied to them .
In the field of Machine Learning and Data Mining, there exist many learning algorithms that are primarily oriented to handling discrete features. Even for algorithms that can directly deal with continuous features, learning is often less efficient and effective. However, data in real world are often continuous. Hence discretization has been an active topic in data mining for a long time. Discretization is a data-preprocessing procedure that transforms continuous features into discrete features, helping improve the performance of learning and understanding of the results [3, 4].
There are many advantages of using discrete values over continuous ones: (1) Discretization will reduce the number of continuous features' values, which brings smaller demands on system's storage. (2)Discrete features are closer to a knowledge-level representation than continuous ones. (3)Data can also be reduced and simplified through discretization. For both users and experts, discrete features are easier to understand, use, and explain. (4)Discretization makes learning more accurate and faster . (5)In addition to the many advantages of having discrete data over continuous one, a suite of classification learning algorithms can only
deal with discrete data. Successful discretization can significantly extend the application range of many learning algorithms [4, 6 and 7]. But optimal discretization has been proved to be a NP-hard problem.
There are a lot of discrete methods. The following part of this paper is about diverse taxonomies in the existing literature to classify discretization methods. The remainder of the paper is organized as follows. In Section 3, idea and drawbacks of some typical methods are introduced. Section 4 provides a comparison among various discretization methods. Section 5 shows that it's essential to select proper methods considering learning environment. And taking discretization in association rules for example, the problems that should be noticed are proposed. The paper concludes in Section 6 with discretization in association rules as further work.
II. TAXONOMY Discretization methods have been developed along
different lines due to different needs. There has been much different taxonomy of discretization methods at present . Main classification is as follows: supervised vs. unsupervised, dynamic vs. static, global vs. local, splitting (top-down) vs. merging (bottom-up), and direct vs. incremental. Some will be simply introduced below.
Discretization methods can be supervised or unsupervised depending on whether it uses class information of data sets. Supervised discretization can be further characterized as error-based, entropy-based or statistics-based [3, 5, and 8]. Unsupervised discretization is seen in earlier methods like equal-width and equal-frequency. Continuous ranges are divided into sub ranges by the user specified width or frequency. It is vulnerable to outliers as they affect the ranges significantly .To overcome this shortcoming, supervised discretization methods using class information were introduced.
Discretization methods can be also viewed from dynamic or static. A dynamic method would discretize continuous values when a classifier is being built, such as in C4.5  while in the static approach discretization is done prior to the classification task. There is a detailed comparison between dynamic and static methods in Dougherty .
Liu Huan and Hussian proposed a hierarchical framework of discretization methods in Fig. 1. This framework is mainly based on the existence of class information. More details see Liu and Hussian (2002) .
2009 International Conference on Artificial Intelligence and Computational Intelligence
978-0-7695-3816-7/09 $26.00 2009 IEEEDOI 10.1109/AICI.2009.385
III. ADVANTAGES AND DRAWBACKS OF TYPICAL DISCRETIZATION METHODS
Discretization is mainly used to transform a continuous feature into a categorical feature in classification and association analysis. Usually, the result of discretization depends on the algorithms we adopt and other features needed to be considered. Most of present discretization algorithms are based on single feature when discretizing .
Main idea and their advantages as well as drawbacks of typical discretization methods will be introduced in terms of unsupervised and supervised methods in the following part.
A. Unsupervised Methods 1) Binning: To Unsupervised discretization is seen in
earlier methods like equal-width and equal-frequency. They belong to the binning methods .It is the simplest method to discretize a continuous-valued attribute by creating a specified number of bins.
The two methods are very simple and easy to implement but are sensitive for a given k. Besides, the equal width approach can be badly affected by outliers  and may not give good results in cases where the distribution of the continuous values is not uniform which will definitely damage the feature's power to construct good structure of decision. Although the equal frequency method overcomes the above problem, it tends to put the same feature with the same class tag in different bins in order to satisfy the equal frequency condition .
For equal-width, one solution to the problem of outliers that take extreme values can be to remove the outliers using a threshold. One improvement of the traditional equal frequency approach can be after signed into bins; boundaries of every pair of neighboring bins are adjusted so that duplicate values can belong to one bin only .
Besides, unsupervised methods also include discretization according intuition, which is more applicable for simple and practically meaningful real data sets, and less vulnerable to outliers.
B. Supervised Methods 1) 1R: 1R  is a supervised discretization method
using binning. It is also very easy to operation. The number of bins k doesnt need to be specified beforehand any more. Besides, 1R overcomes the unsupervised binning approaches' problem of ignoring the class information, which contributes to the solution of putting two instances with the same class label in two different intervals.
2) Entropy measure: The method of using entropy to split continuous features when constructing decision trees works well in practice, so it's possible to extend it to more common discretization. Discrete methods based on entropy measure employ the class information to compute and decide the splitting point, which is definitely supervised and splitting from top to down methods.
ID3  and C4.5  are two popular algorithms for decision tree induction that use entropy measure. One of the problems of ID3 method is that the cut-points obtained are usually more applicable for classification algorithm. However, its the base for its many successors such as D2
Figure1. A hierarchical framework of discretization methods 
, MDLP (minimum description length principle)  and so on.
Unlike ID3 which binarizes a range of values while building a decision tree, D2 is a static method that discretizes the whole instance space. Instead of finding only one cut-point, it recursively binarizes ranges or sub-ranges until a stopping criterion is met. A stopping criterion is essential to avoid over-splitting. MDLP uses the minimum description length principle to determine when to stop discretization. It also suggests that potential cut-points are those that separate different class values .
3) 2 measure: Methods mentioned above are all top-down spitting discretization methods, while 2 measure is a bottom-up merging approach .
Most usually used 2-based discretization is ChiMerge . It is a kind of automatic discretizing algorithm . When applying the statistical measure 2 to test the dependence of two intervals, its necessary for the user to predefine the parameter of significance level and then calculate threshold values with the knowledge of statistics to compare with the computed values. A higher value of significance level for 2 test causes over discretization while a lower value causes under discretization. So Liu and Setiono  made an improvement on ChiMerge in 1997 and formed a new algorithm which is called Chi2. In Chi2, the statistical significa