effective data mining using neural networks

8/8/2019 Effective Data Mining Using Neural Networks

1/5

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 8, NO . 6, DECEMBER 1996 957

Effective Data MiningUsing Neural NetworksHongjun Lu, Member, IEEE Computer Society,

Rudy Setiono, and Huan Liu, Member, IEEEAbstract-Classification is one of the data mining problems receivinggreat attention recently in the database community. This paperpresents an approach to discover symbolic classification rules usingneural networks. Neural networks have not been thought suited fordata mining because how the classificationswere made is not explicitlystated as symbolic rules that are suitable for verif ication orinterpretationby humans. With the proposed approach, concisesymbolic rules with high accuracy can be extracted from a neuralnetwork. The network is first trained to achieve the required accuracyrate. Redundant connections of the network are then removed by anetwork pruning algorithm. The activation values of the hidden units inthe network are analyzed, and classification rules are generated usingthe result of this analysis. The effectiveness of the proposed approachis clearly demonstrated by the experimental results on a set ofstandard data mining test problems.Index Terms-Data mining, neural networks, rule extraction, networkpruning, classification. +1 INTRODUCTIONONE of the data mining problems is classification. Various classifi-cation algorithms have been designed to tackle the problem byresearchers in different fields such as mathematical program-ming, machine learning, and statistics. Recently, there is a surge ofdata mining research in the database community. The classifica-tion problem is re-examined in the context of large databases. Un-like researchers in other fields, database researchers pay moreattention to the issues related to the volume of data. They are alsoconcerned with the effective use of the available database tech-niques, such as efficient data retrieval mechanisms. With suchconcerns, most algorithms proposed are basically based on deci-sion trees. The general impression is that the neura l networks arenot well suited for data mining. The major criticisms include thefollowing:

1) Neural networks learn the classification rules by manypasses over the training data set so that the learn ing time ofa neural network is usually long.

2) A neural network is usually a layered graph with the outputof one node feeding into one or many other nodes in thenext layer. The classification process is buried in both thestructure of the graph and the weights assigned to the linksbetween the nodes. Articulating the classification rules be-comes a difficult problem.

3) For the same reason, available domain knowledge is ratherdifficult to be incorporated to a neural network.

On the other hand, the use of neural ne tworks in classificationis not uncommon in machine learning community [5]. In somecases, neural networks give a lower classification error rate thanthe decision trees but require longer learning time [71, [81 In thispaper, we present our results from applying neural networks to

The authors ar e with the Department of Infovmation Systems and Com-puter Science, National Unive rsity of Singapore, Lower Kent Ridge Rd.,Singapore 119260. E-mail: {luhj,rudys,linh/@iscs.nus.sg.Manuscript received Aug. 28,1996.Fo r informa tion on obtainin g reprints of this art icle, please send e-mail to:transkde&omputer.org, and reference IEEECS Log Number K96083.

mine classification rules for large databases [4] with the focus onarticulating the classification rules represented by neural net-works. The contributions of our study include the following:

Different from previous research work that excludes theneural network based approaches entirely, we argue thatthose approaches should have their position in data miningbecause of its merits such as low classification error ratesand robustness to noise.With our rule extraction algorithms, symbolic classificationrules can be extracted from a neural network. The rulesusually have a comparable classification error rate to thosegenerated by the decision tree based methods. For a data setwith a strong relationship among attributes, the rules ex-tracted are generally more concise.A data mining system based on neural networks is devel-oped . The system successfully solves a number of classifica-tion problems in the literature.

Our neural network based data mining approach consists ofthree major phases:

Network construction and training. This phase constructs andtrains a three layer neural network based on the number ofattributes and number of classes and chosen input codingmethod.Network pruning. The pruning phase aims at removing re-dundant links and units without increasing he classificationerror rate of the network. A small number of un its and linksleft in the network after pruning enable us to extract conciseand comprehensible ules.Rule extraction. This phase extracts the classification rulesfrom the pruned network. The rules generated are in theform of "if ( a , Bv,) and (x, Bv,) an d ... and ( x , Bv,) then C ywhere a,s are the attributes of an inpu t tuple, v , ~re con-stants, & are relational operators (=,


2/5

958 IEEE TRANSACTIONS O N

elevel

Rule extraction algorithm (RX)1) Apply a clustering algorithm to find clusters of hidden node

activation values.2) Enumerate the discretized activation values and computethe network outputs Generate rules that describe the net-work outputs in terms of the discretized hidden unit activa-tion values

3) For each hidden unit, enumerate the inpu t values that leadto them and generate a set of rules to describe the hiddenunits' discretized values in terms of the inputs.

4) Merge the two sets of rules obtained in the previous twosteps to obtain rules that relate the inputs and outputs

4 101,P I ,PI , 31,41

The first step of RX clusters the activation values of hiddenunits into a manageable number of discrete values without sacri-ficing the classification accuracy of the network. After Clustering,we obtain a set of act ivation values at each hidden node. The sec-ond step is to relate these discretized activation values with theoutput layer activahon values, i e , the class labels. And the thirdstep is to relate them with the attribute values at the nodes con-nected to the hidden node A general purpose algorithm X2R wasdeveloped and implemented to automate the rule generationprocess. It takes as input a set of discrete patterns with the classlabels and produces the rules describing the relationship betweenthe patterns and their class labels The details of this rule genera-tion algorithm can be found in our earlier work [3].

To cluster the activation values, we used a simple clustering al-gorithm which consists of the following steps,

1) Find the smallest integer d such that if all the network acti-vation values are rounded to d-decimal-place, the networkstill retains its accuracy ratea x od Let 3 f ={XI,2,representations Set z = 1

2) Represent each activation value a by the integer,3&} e the set of these discrete

3) Sort the set x s u c h that the values of $ re in increasing

zipcode

order.4) Find a pair of distinct adjacent values k,,] and kl,l+l in 3( suchthat i f hl,l+l is replaced by k y , no conflicting data will be

generatedby k,,] and repeat Step 4.

Otherwise, set i = I + 1, if z IH go to Step 3.5) If such values exist, replace

3 [ I , 31, [4, 61, [7 ,91

This algorithm tries to merge as many as possible values intoone interval in every step as long as the merge doconflicting data, i.e., two sets of activation values thbut they represent tuples that belong to different classes. The algo-rithm is order sensitive, that is, the resulting clusters depend onthe order in which the hidden units are clustered.2.2 An Illustrative ExampleIn [l],10 classification problems are defined on datasets havingnine attributes: salary, commission, age, elevel,house-value, house-years, and loan. We use one ofFunction 3 as an example to illustrate how classification rules canbe generated from a network. It classifies tuplescording to the following definition.

Group A are labeled BTo solve the problem, a neural network was first constructed.

Each attribute value was coded as a binary string for use as input

hyearsloan

KNOWLEDGE AND DATA ENGINEERING, VOL 8, NO 6, DECEMBER 1996

35

[ I , 101,[lo, 201, 120, 301[0, look), [look, 200k), [200k, 300k),[300k,400k), [400k,50Okl

to the network The e strings used to represent theattribute values are in The thermometercoding scheme was u nar ntations of the con-tinuous attributes Each bit of a string was either 0 or 1 dependingon which subinterval the original value was located For exam-ple, a salary value of 140k would be coded as (1, , 1, 1,1,1} nda value of 100k as (0, , 1, 1,1, l} .For the discrete attribute,elevel, for example, an elevel of 0 would be coded as (0, , 0, 01,1 as {O,0, ,11, tc.

TABLE 1CODING OF THE ATTRIBUTES FOR NE URA L NE TWO RK INPUT


3/5

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 8, NO. 6, DECEMBER 1996 959

Positive- eightNegativeweightb-

1-11 1-13 1-16

Fig. 1. A pruned network for Function3.took as its input 4-bit binary strings (Il1,,,, 116, I,,). Because of thecoding scheme used (cf. Table 2), there were not 16, but only ninepossible combina tions of 0-1 values for these strings. The classlabel for each of these input strings was either 1 or 2, dependingon the activation value. Three strings labeled 1 corresponded tooriginal input tuples that had activation values in the interval[-1, 0.46). The remaining six strings were labeled 2 since they rep-resented inputs that had ac tivation values in the second subinter-val, [0.46,1]. The rules generated were as follows:

Rule 1.If (I,, = 113 118 11, then ai 1;Rule 2. If (Il3 I,, = 1and I,, = 0), then q 1.Default rule. q 2

TABLE 2THE CODING SCHEME FOR ATTRIBUTES AG E AN D ELEVEL

[a,q 0 0 0 0 1[30,40) 0 0 0 0 1 1[40,59) 0 0 0 1 1 1[so ,so) 0 0 1 1 1 1[70,80] 1 1 1 1 1 1[EO,0) 0 1 1 1 1 1

eleve' 116 117 118 1190 0 0 0 01 0 0 0 12 0 0 1 13 0 1 1 14 1 1 1 1

ISimilarly, for hidden unit 2, the input for X2R was the binary

string (I,,, I,,, I,,, I19). here were also nine possible combina-tions of 0-1 values that can be taken by the four inputs. Threeinput strings were labeled 1,while the rest 2. X2R generated thefollowing rules:

Rule 1.If (I,, = I,, = 0 and 119 l) , then ol,= 1;Rule 2. If (Il3 119 0), then ol,= 1.Default rule. ol,= 2The conditions of the rules that determine the output in terms

of the activation values can now be rewritten in terms of the in-puts. After removing some redundant conditions, the following setof rules for Function 3 were obtained:

Bias

1-18 1-19

R1 . If Ill = I,, = 1,then Group A.R2 . If I,, = I,, = 1and 116 0, then Group A.R3. If Ill= 118 = 0 and I19 = 1,then Group AR4. If I,, = II9 0, then Group ADefault rule. Group BHence, from the pruned network, we extracted five rules with a

total of 10 conditions.To express the rules in terms of the originalattribute values, we refer to Table 2 which shows the binary repre-sentations for attributes age and elevel. Referring to Table 2, theabove set of rules is equivalent to:

R1. If age t 60 and elevel E [2,3,41, then Group A.R2. If age 2 40 and elevel E [ 2 , 3 ] , hen Group A.R3. If age < 60 and elevel = 1,then Group A.R4. If age < 40 and elevel = 0, then Group A.Default rule. Group B.It is worth to highlight the significant difference between the de-

cision tree based algorithms and the neural network approach forclassification. Fig. 2 depicts the decision boundaries formed by thenetwork's hidden units and output unit. Decision tree algorithmssplit the input space by generating two new nodes in the context ofbinary data. As the path between the root node and a new node getslonger, the number of tuples becomes smaller. In contrast, the deci-sion boundaries of the network are formed by considering all theavailable input tuples as a whole. The consequence of this funda-mental difference of the two learning approaches is that a neuralnetwork can be expected to generate fewer rules than do decisiontrees. However, the number of conditions per rule will generally behigher than that of decision trees' rules.3 EXPERIMENTALESULTSTo test the neural network approach more rigorously, a series ofexperiments were conducted to solve the problems defined by


4/5

960 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL 8, NO 6, DECEMBER 1996

I Func.

0 1 2o u t p u t

Accuracy I No. of rules I No. of I

t 4 0200 a 2 3 4 0 1 2 3 4H i d d e n unit 1 Hidden unit 2Elaval ElevelFig. 2. Decision boundaries formed by the units of the pruned network for Function 3.Agrawal et al. 111 Among 10 functions described, we found that

and 10 produced highly skewed data that made classi-meaningful. We will only discuss functions other than

C4.5 also generated twice as many rules compared to the neuralnetwork based approach

TABLE 3TH E NUMBER F RULES,AND THE AVERAGE ONDITIONS

The values of the attributes of each tuple were generated random-[l], we also included a perturbation factor as one of the parameters

AVERAGES F ACCURACY ATESON TH E TEST,PER RULEOBTAINEDRO M 30 NETWORKSly accordlng to the distribuhons gv en m [ll Following Agrawal et al.

of the random data generator. This perturbation factor was et at 5percent. For each tuple, a class label was determined according tothe rules that define the function. For each problem, 3,000 tupleswere generated We used three-fold cross validation to obtain anestimate of the classification accuracy on the rules generated bythe algorithm.

Table 3 summarizes the results of the experiments In this table,we list the average accuracy of the extracted rules on the test set,the average number of rules extracted, and the average number ofconditions in a rule. These averages are obtained from 3 x 10 neu-ral networks Error due to perturbation in the test data set wassubtracted from the total error, and the average accuracy ratesshown in the table reflect this correction.

For comparison, we have also mn C4 5 [ 6 ] for the same datasets. Classification rules were generated from the trees byC4.5mles. The same binary coded data for neural networks wereused for C4 5 and C4 5rules Figs 3-5 show the accuracy, numberof rules, and the average number of conditions in the rules gener-ated by two approaches. We can see that the two approaches arecomparable in accuracy for most functions except for Function 4.On the other hand, while the average number of conditions in bothrule sets are almost the same for all the functions, the number ofrules generated by the neural network approach isthat of C4.5rules. For Functions 1 and 9, C4.5 generated as manyfive times the rules. For all other functions except for Function 5,

1234567

99.91 (0.36)98.1 3 (0.78)98.18 (1.56)95.45 (0.94)97.16 (0.86)90.78 (0.43)90.50 (0.92)

2.03 (0.18)7.13 (1.22)6.70 (1.15)13.37 (2.39)24.40 (10.18)13.13 (3.72)7.43 (1.76)

conditions4.37 (0.66)3.18 (0.28)4.17 (0.88)4.68 (0.87)4.61 (1.02)2.94 (0.32)I 9 I 90.86 (0.60) I 9.03 (1.65) I 3.46 (0.36) I

Standard deviations nppenr in parentheses

In this paper we present a neural network based approach tomining classification rules from given databases. The approachconsists of three phases:

1) constructing and training a network to correctly classify tu-2) pruning the network while maintaining the classification ac-3 ) extracting symbolic rules from the pruned network.

ples in the given training data set to required accuracy,curacy, and

A set of experiments was conducteusing a well defined set of data the proposed approachproblems. The results


5/5

IEEE TRANSACTION S ON KNOWLEDGE AND DATA ENGINEERING, VOL. 8, NO. 6, DECEMBER 1996 96 1

Func. 1

Func. 2

Func. 3

Fnnc. 4

Fnnc. 5

Fnnc. 6

h u e . 7

h u e . 9

I

, ,............

7 L . - - -. . .. . . . . .. . . . . . . . . . . . . . . . . . . .

SS 90 92 94 96 95 100Accuracy (%IFig. 3. Accuracy of the rules extracted from neural networks (NN) andby C4.5rule (DT).indicate that, us ing the proposed approach, high quality rules canbe discovered from the given data sets.

The work reported here is our attempt to apply the connec-tionist approach to data mining to generate rules similar to that ofdecision trees. A number of related issues are to be further stud-ied. One of the issues is to reduce the training time of neural net-works. Although we have been improving the speed of networktraining by developing fast algorithms, the time required to extractrules by our neural network approach is still longer than the timeneeded by the decision tree based approach, such as C4.5. As thelong initial training time of a network may be tolerable in somecases, it is desirable to have incremental training and rule extrac-tion during the life time of an application database. With a n in-cremental training that requires less time, the accuracy of rulesextracted can be improved along with the change of databasecontents. Another possibility to reduce the training time and im-prove the classification accuracy is to reduce the number of inputunits of the networks by feature selection.ACKNOWLEDGMENTSThis work is partly supported by the National University ResearchGrant RP950660. An early version of this paper appeared in theProceedings V L D B 95.REFERENCESI l l R. Agrawal, T. Imielinski, and A. Swami, Database Mining: APerformance Perspective, I E E E Trans. Knowledge and Data Eng.,vol. 5, no. 6 , Dec. 1993.

J.E. Dennis Jr. and R.B. Schnabel, Numerical Methods for Uncon-strained Optimization and Nonlinear Equations. Englewood Cliffs,N.J.: Prentice Hall, 1983.H. Liu and S.T. Tan, X2R A Fast Rule Generator, Pvoc. I E E EIntl Conf. Systems, Ma n, and Cybernetics. IEEE, 1995.H. Lu, R. Setiono, and H. Liu, Neurorule: A Connectionist Ap-proach to Data Mining, Proc, V L D B 95,pp. 478-489,1995.D. Michie, D.J. Spiegelhalter, and C.C. Taylor, Machine Learning,Neural and Statistical Classification. Ellis Horwood Series in Artifi-cial Intelligence, 1994.J.R. Quinlan, C4.5: Programs for Machine Learning. Morgan Kauf-mann, 1993.

[2]

[3][41151

[6]

Func. 1

Func. 2

Func. 3

Func. 4

Func. 5

Func. 6

Func.7

Func. 9I I I10 20 30 40 50

Numberof rulesFiq. 4. The number of the rules extracted from neural networks (NN)a id by C4.5rule (DT).

Func. 1

Func. 2

Func. 3

Func. 4

Func. 5

Func. 6

Func. 7

Func. 9

1 2 3 4 5Ave. of conditionsper rule

Fig. 5. The number of conditions per neural network rule (NN) andC4.5rule(DT).[7] J.R. Quinlan, Comparing Connectionist and Symbolic LearningMethods, S.J. Hanson, G.A. Drastall, and R.L. Rivest, eds., Com-putational Learning Theory and Natural Learning Systems, vol. 1, pp.445456.A Bradford Book, MIT Press, 1994.J.W. Shavlik, R.J. Mooney, and G.G. Towell, Symbolic and Neu-ral Learning Algorithms: An Experimental Comparison, MachineLearning, vol. 6, no. 2, pp. 111-143,1991.[9] R Setiono, A Neural Network Construction Algorithm whichMaximizes the Likelihood Function, Connection Science, vol. 7,no 2, pp. 147-166,1995.1101 R. Setiono, A Penalty Function Approach for Pruning Feed-forward Neural Networks, Neuval Computation, vol. 9, no. 1,

[8]

pp. 301-320,1997.

effective data mining using neural networks

Documents