Download - [IEEE 2009 International Conference on Artificial Intelligence and Computational Intelligence - Shanghai, China (2009.11.7-2009.11.8)] 2009 International Conference on Artificial Intelligence

Study on Comparison of Discretization Methods

Liu Peng, Wang Qing, Gu Yujia School of Information Management and Engineering

Shanghai University of Finance and Economics Shanghai, China

[email protected], [email protected], [email protected]

Abstract—Discrete features play an important role in data mining. How to best discretize continuous features has always been a NP-hard problem. This paper introduces diverse taxonomies in the existing literature to classify discretization methods, as well as idea and drawbacks of some typical methods. Furthermore, a comparison of these methods is studied. It's essential to select proper methods depending on learning environment. At last, the thought of choosing the best discretization methods in association analysis is proposed as future research.

Keywords-continuous features; discrete features; discretization

I. INTRODUCTION In terms of values, features of data sets can be classified

as continuous and discrete. Continuous features are also called quantitative features, e.g. people's height, age and so on. Discrete features, also often referred to as qualitative features, including sex and degree of education, can only be limited among a few values [1, 2]. Continuous features can be ranked in order and admit to meaningful arithmetic operations. However, discrete features sometimes can be arrayed in a meaningful order. But no arithmetic operations can be applied to them [3].

In the field of Machine Learning and Data Mining, there exist many learning algorithms that are primarily oriented to handling discrete features. Even for algorithms that can directly deal with continuous features, learning is often less efficient and effective. However, data in real world are often continuous. Hence discretization has been an active topic in data mining for a long time. Discretization is a data-preprocessing procedure that transforms continuous features into discrete features, helping improve the performance of learning and understanding of the results [3, 4].

There are many advantages of using discrete values over continuous ones: (1) Discretization will reduce the number of continuous features' values, which brings smaller demands on system's storage. (2)Discrete features are closer to a knowledge-level representation than continuous ones. (3)Data can also be reduced and simplified through discretization. For both users and experts, discrete features are easier to understand, use, and explain. (4)Discretization makes learning more accurate and faster [5]. (5)In addition to the many advantages of having discrete data over continuous one, a suite of classification learning algorithms can only

deal with discrete data. Successful discretization can significantly extend the application range of many learning algorithms [4, 6 and 7]. But optimal discretization has been proved to be a NP-hard problem.

There are a lot of discrete methods. The following part of this paper is about diverse taxonomies in the existing literature to classify discretization methods. The remainder of the paper is organized as follows. In Section 3, idea and drawbacks of some typical methods are introduced. Section 4 provides a comparison among various discretization methods. Section 5 shows that it's essential to select proper methods considering learning environment. And taking discretization in association rules for example, the problems that should be noticed are proposed. The paper concludes in Section 6 with discretization in association rules as further work.

II. TAXONOMY Discretization methods have been developed along

different lines due to different needs. There has been much different taxonomy of discretization methods at present [3]. Main classification is as follows: supervised vs. unsupervised, dynamic vs. static, global vs. local, splitting (top-down) vs. merging (bottom-up), and direct vs. incremental. Some will be simply introduced below.

Discretization methods can be supervised or unsupervised depending on whether it uses class information of data sets. Supervised discretization can be further characterized as error-based, entropy-based or statistics-based [3, 5, and 8]. Unsupervised discretization is seen in earlier methods like equal-width and equal-frequency. Continuous ranges are divided into sub ranges by the user specified width or frequency. It is vulnerable to outliers as they affect the ranges significantly [9].To overcome this shortcoming, supervised discretization methods using class information were introduced.

Discretization methods can be also viewed from dynamic or static. A dynamic method would discretize continuous values when a classifier is being built, such as in C4.5 [10] while in the static approach discretization is done prior to the classification task. There is a detailed comparison between dynamic and static methods in Dougherty [5].

Liu Huan and Hussian proposed a hierarchical framework of discretization methods in Fig. 1. This framework is mainly based on the existence of class information. More details see Liu and Hussian (2002) [6].

2009 International Conference on Artificial Intelligence and Computational Intelligence

978-0-7695-3816-7/09 $26.00 © 2009 IEEE

DOI 10.1109/AICI.2009.385

380

III. ADVANTAGES AND DRAWBACKS OF TYPICAL DISCRETIZATION METHODS

Discretization is mainly used to transform a continuous feature into a categorical feature in classification and association analysis. Usually, the result of discretization depends on the algorithms we adopt and other features needed to be considered. Most of present discretization algorithms are based on single feature when discretizing [13].

Main idea and their advantages as well as drawbacks of typical discretization methods will be introduced in terms of unsupervised and supervised methods in the following part.

A. Unsupervised Methods 1) Binning: To Unsupervised discretization is seen in

earlier methods like equal-width and equal-frequency. They belong to the binning methods [7].It is the simplest method to discretize a continuous-valued attribute by creating a specified number of bins.

The two methods are very simple and easy to implement but are sensitive for a given k. Besides, the equal width approach can be badly affected by outliers [13] and may not give good results in cases where the distribution of the continuous values is not uniform which will definitely damage the feature's power to construct good structure of decision. Although the equal frequency method overcomes the above problem, it tends to put the same feature with the same class tag in different bins in order to satisfy the equal frequency condition [8].

For equal-width, one solution to the problem of outliers that take extreme values can be to remove the outliers using a threshold. One improvement of the traditional equal frequency approach can be after signed into bins; boundaries of every pair of neighboring bins are adjusted so that duplicate values can belong to one bin only [6].

Besides, unsupervised methods also include discretization according intuition, which is more applicable for simple and practically meaningful real data sets, and less vulnerable to outliers.

B. Supervised Methods 1) 1R: 1R [14] is a supervised discretization method

using binning. It is also very easy to operation. The number of bins k doesn’t need to be specified beforehand any more. Besides, 1R overcomes the unsupervised binning approaches' problem of ignoring the class information, which contributes to the solution of putting two instances with the same class label in two different intervals.

2) Entropy measure: The method of using entropy to split continuous features when constructing decision trees works well in practice, so it's possible to extend it to more common discretization. Discrete methods based on entropy measure employ the class information to compute and decide the splitting point, which is definitely supervised and splitting from top to down methods.

ID3 [16] and C4.5 [10] are two popular algorithms for decision tree induction that use entropy measure. One of the problems of ID3 method is that the cut-points obtained are usually more applicable for classification algorithm. However, it’s the base for its many successors such as D2

Figure1. A hierarchical framework of discretization methods [6]

381

[9], MDLP (minimum description length principle) [17] and so on.

Unlike ID3 which binarizes a range of values while building a decision tree, D2 is a static method that discretizes the whole instance space. Instead of finding only one cut-point, it recursively binarizes ranges or sub-ranges until a stopping criterion is met. A stopping criterion is essential to avoid over-splitting. MDLP uses the minimum description length principle to determine when to stop discretization. It also suggests that potential cut-points are those that separate different class values [6].

3) χ2 measure: Methods mentioned above are all top-down spitting discretization methods, while χ2 measure is a bottom-up merging approach [7].

Most usually used χ2-based discretization is ChiMerge [18]. It is a kind of automatic discretizing algorithm [1]. When applying the statistical measure χ2 to test the dependence of two intervals, it’s necessary for the user to predefine the parameter of significance level and then calculate threshold values with the knowledge of statistics to compare with the computed values. A higher value of significance level for χ2 test causes over discretization while a lower value causes under discretization. So Liu and Setiono

[19] made an improvement on ChiMerge in 1997 and formed a new algorithm which is called Chi2. In Chi2, the statistical significance level keeps declining step by step to merge more and more adjacent intervals as long as the inconsistency criterion is satisfied. The algorithm includes two phase in discretization, and takes the inconsistency measure as the stopping criterion.

The Chi2 algorithm using the inconsistency measure to check the results of discretization is much better than ChiMerge with fixed significance level. However, Chi2 has drawbacks as well. Lower limit of the inconsistency is needed which leads to numerous experimentation, tremendous work amount and the selection of the best limit.

Besides the discretization methods mentioned above, there are also some other methods such as algorithms based on the idea of distance or other measure and approaches uniting many single discretization in a comprehensive method. Due to the space limit of the paper, this passage doesn’t introduce every kind of discretization in detail.

IV. COMPARISON OF DISCRETIZATION METHODS There have been a lot of discretization methods. However

the same algorithms don’t exist. Every algorithm has its own characteristic and doing well in different situations. Ying Yang and Xindong Wu [19] conducted a research of the difference between 15 discretization methods from the point of taxonomy of the algorithms in 2007; Liu and Hussian evaluated the results of many discretization algorithms from the aspect of accuracy, time for discretization, time for learning and understandability through experiments [6].

This paper aims to supply a study of explicit comparison of ten discretization methods according to the following aspects:

• Merging or splitting methods. Merging methods start with the complete list of all the continuous values of the feature as cut-points and remove some of them by 'merging' intervals as the discretization progresses. Splitting methods start with an empty list of cut-points (or split-points) and keep on adding new ones to the list by ‘splitting’ intervals as the discretization progresses.

• Discretization with class information or not. • Stopping criterion. Different discretization methods

have diverse stopping criterion. • Sensitivity to outlier. • Simplicity of operation. Some of the methods are

quite easy. • Uniformity of the results among intervals. For

example, equal width methods will probably lead to the consequences that some bins concludes no instances at all while others have too many.

• Consideration of values of the feature or not. The majority of methods discretize according to the values of features, however other methods ignore the values and discretize only with class labels.

• Grouping instances with the same value and the same class label in the same interval or not.

Every discretization methods have its-own strengths and weaknesses and can be applied to different situation.

A comparative analysis of common discretization methods mentioned above is displayed in Table 1. In the table, Y denotes yes while N denotes no. And it’s with no doubt possible that further studies of theses methods from other aspects can be conducted.

V. DISCRETIZATION AND LEARNING CONTEXT Although various discretization methods are available, no

one discretization method can ensure a negative difference for all data sets and all algorithms. Discretization is a NP-hard problem in nature. They are tuned to difference types of learning context. As a result, we should choose a appropriate discretization method based on the characteristics of data sets, the learning context as well as users’ preferences and understanding of the problem. For example, decision tree learners can suffer from the fragmentation problem, and hence they may benefit more than other learners from discretization that results in few intervals; Decision rule learners require pure intervals (containing instances dominated by a single class); Association rule learners value the relations between attributes, and thus they desire multivariate discretization that can capture the inter-dependencies among attributes [3].

Take the analysis of association rules for example. The users should focus on the relations between features when discretization in association rules analysis instead of discretizing every single feature. Isolated discretization of single feature of data sets tends to induce imperfect association rules which affect the correctness and effectiveness of the results. So association rules analysis prefers multivariate discretization.

382

TABLE I. COMPARATIVE ANALYSIS OF DISCRETIZATION METHODS

Evaluating Standards

Algorithms

Merge or Split

Use class information

Stopping criterion

Sensitive to outlier

Simplicity of

operation

Uniformity of the results

Use of values

Same values into different

intervals

Equal-width Split N Fixed bin number Y N Y Y N

Equal-frequency Split N Fixed bin number N Y N Y Y

Direct-viewing Split N 3-4-5 criterion N Y Y Y N

Cluster Both N Fixed cluster number Y Y Y Y N

IR Split Y Minimum instances in each interval

N Y N Y Y

ID3 Split Y One class label in one interval N Y Y Y N

D2 Split Y Special criterion[6] N Y Y Y N

MDLP Split Y MDLP N Y Y Y N

ChiMege Merge Y Threshold N Y Y Y N

Chi2 Merge Y Inconsistency N Y Y Y N

Furthermore, data sets used in association analysis

conclude no class information sometimes. It’s necessary to employ unsupervised discretization methods. However at present unsupervised discretization method is still relatively limited. Therefore, association rule analysis of the discrete problem will be an area worthy of further study. And this will contribute a lot to the association rule analysis.

Besides, other algorithms and learning context also require the user to analyze the demands and choose an appropriate discretization method.

VI. SUMMARY Discretization of continuous features plays an important

role in data pre-processing. This paper briefly introduces that the generation of the problem of discretization brings many benefits including improving the algorithms’ efficiency and expanding their application scope. There have been diverse taxonomies in the existing literature to classify discretization methods. The idea and drawbacks of some typical methods are expressed in details by supervised or unsupervised category. Furthermore, these methods are compared. Not all

discretization methods are covered due to the paper’s limited space.

No discretization method can ensure a negative difference for all data sets and all algorithms. So it's of vital importance to select proper methods depending on data sets and learning context in practice. Discretization in association rules analysis need to take into account both inter-dependence among features and unsupervised discretization. This is a complicated problem. And the thought of choosing the most appropriate discretization methods in association analysis is proposed as the future research.

REFERENCES [1] Mehmed Kantardzic, Data Mining: Concepts, Models,

Methods, and Algorithms. IEEE press, 2003. [2] Zhang Yong, Ding Hongchang, “The Methods of Max Diff

Discretization of continuous features”, Engineering and Application of Computer, Vol 43(19), 2007, pp.43-47.

[3] Ying Yang and Xindong Wu , Discretization Methods, 2nd edn. Cambridge, MA: MIT Press, 2007.

383

[4] Liu Yezheng, Jiao Ning, “Study on Discretization Methods”, Research of Computer Application , Vol 24(9), 2007.9.

[5] Doughtery, J., Kohavi, R., and Sahami, M., “Supervised and unsupervised discretization of continuous features” , In Proc. Twelfth International Conference on Machine Learning . Los Altos, CA: Morgan Kaufmann, 1995, pp. 194–202.

[6] HUAN LIU, RARHAD HUSSAIN, CHEW LIM TAN, MANORANJAN DASH, Discretization: An Enabling Technique. Data Mining and Knowledge Discovery , Kluwer Academic Publishers. Manufactured in The Netherlands, 2002(6), pp.393-423.

[7] Jiawei Han, Micheline Kamber, Data Ming, Concepts and Techniques (2nd edition) . Machinery Industry Press, 2004.

[8] Ian H. Witten, Eibe Frank, Data Ming, Practical Machine Learning Tools and Techniques (2nd edition). Machinery Industry Press, 2005.

[9] Catlett, J. ,“On changing continuous attributes into ordered discrete attributes” . In Proc. Fifth European Working Session on Learning. Berlin: Springer-Verlag, 1991, pp. 164–177.

[10] Quinlan, J.R. , “C4.5: Programs for Machine Learning”. San Mateo, CA: Morgan Kaufmann, 1993.

[11] Bay, S. D. ,”UCI ,KDD archive”. Department of Information and Computer Sciences, University of California, Irvine, 2000.

[12] Robert, S. “Analyzing discretization of continuous attributes given a monotonic discrimination function”. Intelligent Data Analysis, 1997, 1:157 179.

[13] Pang-Ning Tan and Michael Steinbach, Vipin Kumar, Introduction to Data Mining. Posts & Telecom Press, 2006.

[14] Holte , R.C., “Very simple classification rules perform well on most commonly used datasets”, Machine Learning, 1993,pp.63–90.

[15] Shannon, C. and Weaver, W. , The Mathematical Theory of Information., Urbana: University of Illinois Press,1949.

[16] Quinlan, J.R., “Induction of decision trees”, Machine Learning, 1986,1, pp.81-106.

[17] Fayyad, U. and Irani, K. , “Multi-interval discretization of continuous-valued attributes for classification learning” . In Proc. Thirteenth International Joint Conference on Artificial Intelligence, San Mateo, CA: Morgan Kaufmann.,1993, pp.1022–1027.

[18] Kerber, R., “Chimerge: Discretization of numeric attributes.”, In Proc. AAAI-92, Ninth National Conference on Artificial Intelligence, AAAI Press/ the MIT Press, 1992, pp. 123–128.

[19] HUAN Liu, RUDY Setiono, “Feature selection via dis2 criterion”, IEEE Transactions on Knowledge and Data Engineering, 1997, 9 (4):, pp.642 64.

384

Download - [IEEE 2009 International Conference on Artificial Intelligence and Computational Intelligence - Shanghai, China (2009.11.7-2009.11.8)] 2009 International Conference on Artificial Intelligence

Top Related