a mass assignment based id3 algorithm for decision tree induction

30
A Mass Assignment Based ID3 Algorithm for Decision Tree Induction J. F. Baldwin,* J. Lawry, and T. P. Martin A. I. Group, Department of Engineering Mathematics, University of Bristol, Bristol BS8 1TR, United Kingdom A mass assignment based ID3 algorithm for learning probabilistic fuzzy decision trees is introduced. Fuzzy partitions are used to discretize continuous feature universes and to reduce complexity when universes are discrete but with large cardinalities. Further- more, the fuzzy partitioning of classification universes facilitates the use of these decision trees in function approximation problems. Generally the incorporation of fuzzy sets into this paradigm overcomes many of the problems associated with the application of decision trees to real-world problems. The probabilities required for the trees are calculated according to mass assignment theory applied to fuzzy labels. The latter concept is introduced to overcome computational complexity problems associated with higher dimensional mass assignment evaluations on databases. 1997 John Wiley & Sons, Inc. I. INTRODUCTION The ID3 algorithm introduced by Quinlan 1 has proved to be an effective and popular method for finding decision tree rules to express information contained implicitly in discrete valued data sets. There are, however, a number of well- known difficulties associated with the application of this method to real-world problems. For instance, the decision tree generated is equivalent to a set of first- order logic conditionals each of which is true for every element of the data set. In other words, if we generate classification rules relating a set of classes to values of some set of attributes then correct classification is guaranteed for each element in the training set. A natural consequence of this property is that classical ID3 is inappropriate for databases containing significant noise since the generated rules will then fit the noise and this may lead to a high error rate when classifying unseen cases. Furthermore, often in practice classification problems have continu- *Author to whom correspondence should be addressed. INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, VOL. 12, 523–552 (1997) 1997 John Wiley & Sons, Inc. CCC 0884-8173/97/070523-30

Upload: j-f-baldwin

Post on 06-Jun-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A mass assignment based ID3 algorithm for decision tree induction

A Mass Assignment Based ID3 Algorithm forDecision Tree InductionJ. F. Baldwin,* J. Lawry, and T. P. MartinA. I. Group, Department of Engineering Mathematics, University of Bristol,Bristol BS8 1TR, United Kingdom

A mass assignment based ID3 algorithm for learning probabilistic fuzzy decision treesis introduced. Fuzzy partitions are used to discretize continuous feature universes andto reduce complexity when universes are discrete but with large cardinalities. Further-more, the fuzzy partitioning of classification universes facilitates the use of thesedecision trees in function approximation problems. Generally the incorporation offuzzy sets into this paradigm overcomes many of the problems associated with theapplication of decision trees to real-world problems. The probabilities required forthe trees are calculated according to mass assignment theory applied to fuzzy labels.The latter concept is introduced to overcome computational complexity problemsassociated with higher dimensional mass assignment evaluations on databases. 1997John Wiley & Sons, Inc.

I. INTRODUCTION

The ID3 algorithm introduced by Quinlan1 has proved to be an effective andpopular method for finding decision tree rules to express information containedimplicitly in discrete valued data sets. There are, however, a number of well-known difficulties associated with the application of this method to real-worldproblems. For instance, the decision tree generated is equivalent to a set of first-order logic conditionals each of which is true for every element of the data set.In other words, if we generate classification rules relating a set of classes tovalues of some set of attributes then correct classification is guaranteed for eachelement in the training set. A natural consequence of this property is that classicalID3 is inappropriate for databases containing significant noise since the generatedrules will then fit the noise and this may lead to a high error rate when classifyingunseen cases. Furthermore, often in practice classification problems have continu-

*Author to whom correspondence should be addressed.

INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, VOL. 12, 523–552 (1997) 1997 John Wiley & Sons, Inc. CCC 0884-8173/97/070523-30

Page 2: A mass assignment based ID3 algorithm for decision tree induction

524 BALDWIN, LAWRY, AND MARTIN

ous attribute values associated with them necessitating the partitioning of relevantuniverses if ID3 is to be applied. This is essentially the approach adopted in theC4.5 algorithm,2 a successor to ID3 where the universe of a continuous attributeA is partitioned by the two sets A . a and A # a for some parameter a. Theuse of crisp partitions in this case can be problematic since sudden andinappropriate changes to the assigned class may result from small changes inattribute values. Clearly such behavior will reduce the generalization capabilitiesof the system. A further limitation is the inability to utilize or classify data pointswhere some of the attribute values have not been specified although in the C4.5algorithm this problem is partially overcome by exploring all possible branchesof the tree consistent with this point and then combining the results. Finally, therequirement that the set of classification values be finite and mutually exclusivemeans that classical ID3 and C4.5 cannot be applied to more general problemssuch as function approximation or the generation of rules to summarize informa-tion stored in large databases.

The use of fuzzy sets to partition universes can have significant advantagesover more traditional approaches and when combined with classical decisiontree induction methods can help to address many of the difficulties discussedabove. In particular, fuzzy decision rules tend to be more robust and less sensitiveto small changes in attribute values near partition boundaries. Also such ruleswill tend to have greater generalization capabilities than their crisp counterpartssince the requirement of 100% correct classification of the training set has beenrelaxed. The concept of fuzzy partition (see Ref. 3) allows us to incorporate bothoverlapping classification and attribute classes into our induction model in acoherent way. This can have advantages in terms of tree complexity since empiri-cal evidence suggests that this less restrictive notion of partition enables fewerattribute classes to be used. In addition, many problems can best be expressedusing concepts most naturally corresponding to overlapping classes and hencein this sense fuzzy partitions can facilitate the generation of rules more easilyunderstood by humans. Of course, there is another way in which the incorporationof fuzzy sets can produce more ‘‘human’’ decision rules since they are able tomodel vague concepts such as those found in natural language. This is particularlyuseful in the case of continuous variables where it can be helpful to give linguisticlabels to the fuzzy sets such as, for example, high, medium, and low. A furtheradvantage in fuzzy partitions is that the inherent interpolation properties ofsmooth fuzzy sets enables the decision tree to be used, in conjunction with adefuzzification method, for function approximation.

In the sequel we describe a method for generating decision trees with fuzzyattribute and classification values. The decision trees generated are probabilisticclassifiers analogous to those suggested by Quinlan in Ref. 4 where the probabili-ties are calculated according to the mass assignment semantics for fuzzy setsdeveloped by Baldwin (see Refs. 5 and 6). This algorithm has been implementedin Fril7 which is a logic programming style language with built-in capabilities forprocessing both probabilistic and fuzzy uncertainty. The decision trees can berepresented in terms of Fril extended rules the syntax and semantics of whichwill be described in a later section.

Page 3: A mass assignment based ID3 algorithm for decision tree induction

ID3 ALGORITHM 525

II. FUZZY PARTITIONING

The need to partition universes introduces a new problem into decision treeinduction; namely how to decide on the exact form of a partition for any givenvariable. Often in fuzzy systems the choice of fuzzy sets is left to the user andthis has mainly been the case for other fuzzy ID3 algorithms, for example Umanoet al.,8 although Zeidler and Schlosser9 have proposed a method for automaticallygenerating trapezoidal fuzzy sets. However, given that one of the original motiva-tions for decision tree induction methods was to enable unsupervised knowledgeacquisition from a set of concrete examples this state of affairs seems some-what unsatisfactory. In this section we shall discuss a number of heuristics forgenerating fuzzy partitions although much work still remains to be done in thisarea. Initially, however, we introduce the formal concept of a fuzzy partition.

The notion of a partition of a universe has been extended to fuzzy sets byRuspini3 as follows:

DEFINITION 2.0.1. The set of fuzzy sets h f1 , ? ? ? , fnj form a fuzzy partition ofthe universe V iff

;x [ V On

i51x fi

(x) 5 1

It can easily be seen that f1 , ? ? ? , fn are restricted to crisp sets then this correspondsto the standard definition for a partition of V. Fuzzy partitions have a number ofdesirable properties but the following result for probabilities of fuzzy events is ofparticular importance to this work.

If f1 , ? ? ? , fn is a fuzzy partition then we have that On

i51Prob( fi) 5 1 where

for f a fuzzy set Prob( f) is taken to be the expected value of xf , the membershipfunction of f, with respect to a given distribution over the universe (see Ref. 10for details).

In the current context the problem of fuzzy partitioning universes is naturallydivided in two distinct subproblems. These are the partitioning of universes ofclassification values and the partitioning of attribute universes. For classificationuniverses we use an algorithm originally developed for the Fril data browser11

based on the heuristic that the classification values generated by the data setshould be evenly distributed across the partition sets. A number of partitionpoints are selected on this basis each of which forms the apex of a triangularfuzzy set constructed so that together they form a fuzzy partition in accordancewith Definition 2.0.1. Notice that some user involvement is still required, however,since the number of fuzzy sets in the partition must be specified. With regardto partitioning attribute universes we have, for the test cases presented in thefollowing sections, adopted a fairly simple-minded approach and used fuzzypartitions consisting of evenly spaced triangular fuzzy sets, again where the actualnumber of sets required is specified by the user. There are of course many moresophisticated partitioning techniques that could be considered here, in particular

Page 4: A mass assignment based ID3 algorithm for decision tree induction

526 BALDWIN, LAWRY, AND MARTIN

fuzzy versions of the partitioning methods proposed for crisp decision trees wouldseem worthy of further investigation (see, for example, Ref. 12). We prefer toignore such extensions in this article since the aim is merely to demonstrate thepotential of fuzzy probabilistic decision tree classifiers. We do, however, proposea possible method for generating attribute partitions for simple binary classifica-tion problems.

A. Forming Fuzzy Partitions for Binary Classification Problems

Consider the problem of generating a decision tree to classify objects aseither 0 or 1 given values for some set of measurable attributes hAijn21

i51 from adatabase D of training examples. Here it is assumed that D is a set of n-tuplesof the form D 5 hkai,1 , ? ? ? , ai,n21 , cilui 5 1, ? ? ? , Nj where ai, j is a value forAj and ci [ h0, 1j is a classification value. Now for each attribute and classificationvalue we determine the lowest and greatest value of that attribute among thedata points with the corresponding classification. Let l (i)

j and u(i)j denote, respec-

tively, the lowest and greatest value of attribute Aj with classification i. Clearlythen the cross product 3n21

j51 [l (i)j , u(i)

j ] contains all points in D with classificationi. We now consider the intersection of these regions for classifications 1 and 0given by

IR 5 3n21j51 [l (1)

j , u(1)j ] > 3n21

j51 [l (0)j , u(0)

j ]

IR then corresponds to the region of attribute space for which there is uncertaintyregarding the classification value (see Fig. 1). In other words, the classification

Figure 1. Intersecting classification regions.

Page 5: A mass assignment based ID3 algorithm for decision tree induction

ID3 ALGORITHM 527

Figure 2. Projecting cluster centers to form fuzzy partitions.

value for any point in D outside this region can be immediately determined.This means that we need only generate a decision tree for points in IR andaugment this tree with rules of the form

((class is 0) if (Aj [ [l (1)j , u(1)

j ]c))

((class is 1) if (Aj [ [l (0)j , u(0)

j ]c))

We now use a Kohonen self-organizing map13 to generate a fuzzy partition onthe universes [l (1)

j , u(1)j ] > [l (0)

j , u(0)j ] for j 5 1, ? ? ? , n 2 1. The algorithm is run

on data points from D with classification value 1 (or 0) that lie in IR to generatea number of cluster centers for IR. For each of these centers xui 5 kxi,1 , ? ? ? ,xi,n21l we take xi, j as the center for a triangular fuzzy set on [l (1)

j , u(1)j ] >

[l (0)j , u(0)

j ] (see Fig. 2). It can easily be seen then that there is a unique fuzzypartition of this universe consisting of triangular fuzzy sets with the given centers.

III. GENERATING PROBABILITY DISTRIBUTIONS ON FUZZYLABEL SPACES FROM DATA

Having formed the necessary partitions of attribute and classification uni-verses we can now view the continuous attribute problem as concerning discretevariables taking fuzzy set labels as values. There are, however, several difficulties

Page 6: A mass assignment based ID3 algorithm for decision tree induction

528 BALDWIN, LAWRY, AND MARTIN

that must be overcome in order to make such a transformation. For instance,we must now find a representation of any relevant database in terms of theselabels and having done so a method is required for calculating probabilities onlabel spaces from such data. In this section we describe how the notion of second-order fuzzy sets can be used to transform databases concerning attribute valuesinto databases on the cross product of fuzzy label spaces. Given such discretevalued databases it is then demonstrated how the mass assignment theory ofthe probability of fuzzy events can be used in order to obtain the requiredprobability distributions.

Here and in the sequel we consider databases of the form D 5hkai,1 , ? ? ? , ai,nlui 5 1, ? ? ? , Nj where either ai, j is a value of the attributeAj (i.e., ai, j [ V j where V j is the universe of Aj) or ai, j is a fuzzy value of theattribute Aj (i.e., ai, j #f V j). Note that by allowing fuzzy values for attributes weare able to represent examples where some of the attribute values are unspecified,imprecisely specified, or vaguely specified.

A. The Mass Assignment Theory of the Probability of Fuzzy Sets

In Ref. 10 Zadeh outlines a theory of the probability of fuzzy events extendingclassical probability theory to fuzzy sets. The mass assignment theory as discussedin Ref. 6 differs from that of Zadeh primarily in the definition of conditionalprobability which unlike the standard t-norm definition proposed in Ref. 10 isprobability/possibility consistent (see Ref. 14). This theory forms the basis forsemantic unification which is the mechanism for matching fuzzy sets in Fril. Themass assignment definition for the case of discrete random variables is givenbelow and a more general definition can be found in Ref. 15.

DEFINITION 3.1.1. Let X be a random variable into some finite universe V, f,and g be normalized fuzzy subsets of V and let W be the prior distribution of Xthen the conditional probability of X is f given X is g is defined as follows;

ProbW(X is f uX is g) 5 OFi

OGj

ProbW(X [ Fi uX [ Gj)mf (Fi)mg(Gj)

5 OFi

OGi

W(Fi > Gi)W(Gj)

mf (Fi)mg(Gj)

where mf , hFij i and mg , hGij i are the mass assignment and set of focal elementsfor f and g, respectively. (See Ref. 5 for an introduction to mass assignmenttheory)

In the case where W is the uniform distribution the above conditionalprobability is equivalent to the point semantic unification value of f given g. Theinterval semantic unification corresponds to the interval of possible values forthe conditional probability as W varies.

Page 7: A mass assignment based ID3 algorithm for decision tree induction

ID3 ALGORITHM 529

Now it is possible (see Ref. 6 for details) to express this conditional probabil-ity in terms of a posterior distribution, Wg , of X obtained by conditioning theprior W on the knowledge that X is g. That is:

ProbW(X is f uX is g) 5 Ox[V

x f (x)Wg(x)

where

Wg(x) 5 ProbW(X 5 x uX is g) 5 W(x) OGj:x[Gj

mg(Gj)W(Gj)

This conditional distribution is referred to as the least prejudiced distributionof g and plays an important role in the methods expounded in the sequel.

Example 3.1.2. Consider a biased dice so that the probability distribution ofoutcomes is W(1) 5 0.1, W(2) 5 0.1, W(3) 5 0.5, W(4) 5 0.1, W(5) 5 0.1,W(6) 5 0.1. Now suppose we know that the outcome of a throw of the dice isa small_value where small_value 5 h1:1 2:0.7 3:0.3j and we want to know theprobability that the outcome is about_two where about_two 5 h1:0.5 2:1 3:0.5j.Clearly then we must calculate Prob(about_twousmall_value). Now msmall_value 5h1, 2, 3j:0.3, h1, 2j:0.4, h1j:0.3 so that

Wsmall_value(3) 5W(3)

W(1) 1 W(2) 1 W(3)0.3 5

0.50.7

0.3 5 0.21428,

Wsmall_value(2) 5 W(2) S 0.3W(1) 1 W(2) 1 W(3)

10.4

W(1) 1 W(2)D5 0.1 S0.3

0.71

0.40.2D5 0.24286,

Wsmall_value(1) 5 0.1 S0.30.7

10.40.2

10.30.1D5 0.54286

Hence, we have that

Prob(about_twousmall_value) 5 0.5(0.54286) 1 1(0.24286) 1 0.5(0.21428)

5 0.62143

Suppose X is a random variable into some finite universe V distributed ac-cording to the uniform distribution on V and further suppose that we are givena probability distribution W on V which results from conditioning X on theknowledge that X is f where f is an unknown fuzzy subset of V. In other words,we know that W is the least prejudiced distribution of f with respect to theuniform prior on V. A natural question then, and one of considerable importancewith respect to the theories that we shall describe in the following sections, iswhether or not we can find a fuzzy set consistent with this information. In factit can be shown (see Ref. 16) that a unique solution for f may be determined

Page 8: A mass assignment based ID3 algorithm for decision tree induction

530 BALDWIN, LAWRY, AND MARTIN

accordingly: Let the range of values of W be given by hp1 , ? ? ? , pnj where 0 #

pi11 , pi # 1 and On

i51pi 5 1.

Then the mass assignment of f is given by

mf 5 Fn :yn , ? ? ? , Fi :yi 2 yi11 , ? ? ? , F1:1 2 y1

where Fi 5 hx [ VuW(x) $ pi j and

yj 5 uFj upj 1 On

i5j11(uFi u 2 uFi21u)pi

f can then be inferred from its mass assignment according to the formula

;x [ Vxf (x) 5 OFi :x[Fi

mf (Fi)

Example 3.1.3. Find the fuzzy set f with least prejudice distribution W 5 a:0.15,b:0.6, c:0.2, d:0.05. so that p1 5 0.6, p2 5 0.2, p3 5 0.15, p4 5 0.05.

Given the ordering constraint imposed by W we have that mf 5 F4:y4 ,F3:y3 2 y4 , F2:y2 2 y3 , F1:1 2 y2 where F4 5 hxuW(x) $ p4 5 0.05j 5 ha, b, c, d j,F3 5 hxuW(x) $ p3 5 0.15j 5 ha, b, cj, F2 5 hxuW(x) $ p2 5 0.2j 5 hb, cj, F1 5hxuW(x) $ p1 5 0.6j 5 hbj. This implies that f 5 b/1 1 c/y2 1 a/y3 1 d/y4 andusing the above formula we obtain

y4 5 4p4 5 4(0.05) 5 0.2

y3 5 3p3 1 p4 5 3(0.15) 1 0.05 5 0.5

y2 5 2p2 1 p3 1 p45 2(0.2) 1 0.15 1 0.05 5 0.6

Therefore, f 5 b/1 1 c/0.6 1 a/0.5 1 d/0.2

B. Finding Second-Order Representations of Attribute Values

A second-order fuzzy set is simply defined to be a fuzzy subset of someuniverse the elements of which are labels of fuzzy sets. Here we consider theparticular case when the labels concerned relate to a fuzzy partition of a universeof attribute values.

Let h f1 , ? ? ? , fnj denote a fuzzy partition of the universe V of an attributeA. Further let LABEL be a random variable into h f1 , ? ? ? , fnj distributedaccording to the uniform distribution on h f1 , ? ? ? , fnj. Now given any value aof A we define the second-order representation of a with respect h f1 , ? ? ? , fnjto be a fuzzy set h(a) such that

Prob(A is fi uA 5 a) 5 Prob(LABEL 5 fi uLABEL is h(a)) for i 5 1, ? ? ? , n

Notice that the left-hand side of this expression is equal to x fi(a) irrespective of

the prior distribution of A. Clearly then this expression defines a probability

Page 9: A mass assignment based ID3 algorithm for decision tree induction

ID3 ALGORITHM 531

distribution for LABEL conditional on the knowledge that LABEL is h(a) andthe algorithm described in Section III-A for finding a fuzzy set from its conditional(least prejudiced) distribution can be used to derive a unique solution for h(a).

In order to find second-order representations of fuzzy values of A we needto make the additional assumption that a priori A is distributed according to theuniform distribution on V. Thus as above we require that h(a) satisfies

Prob(A is fi uA is a) 5 Prob(LABEL 5 fi uLABEL is h(a)) for i 5 1, ? ? ? , n

where the left-hand side is evaluated as described in Section III-A assuming auniform prior distribution. Again this defines a conditional distribution forLABEL so that h(a) can then be determined as before.

Example 3.2.1. Consider an attribute SCORE giving the value obtained fromthrowing a 6-sided dice with face values in h1, 2, 3, 4, 5, 6j. Also suppose wehave a fuzzy partition hhigh_score, high_score j of h1, 2, 3, 4, 5, 6j wherehigh_score 5 6/1 1 5/0.8 1 4/0.3. Using the above method we can find a second-order fuzzy set corresponding to each of the values of SCORE. In order toillustrate this consider the case when SCORE 5 5.

Conditioning on the knowledge that SCORE 5 5 generates a probabilitydistribution on hhigh_score, high_scorej where Prob(LABEL 5 high_score) 5Prob(SCORE is high_scoreuSCORE 5 5) 5 xhigh_score(5).

In other words, we have the following distribution:

high_score:0.8, high_score:0.2

We now determine the unique fuzzy set for which this forms the least prejudiceddistribution so that

h(5) 5 high_score /1 1 high_score /0.4

Repeating this procedure we obtain that

h(6) 5 high_score /1

h(5) 5 high_score /1 1 high_score /0.4

h(4) 5 high_score /1 1 high_score /0.6

h(3) 5 high_score /1

h(2) 5 high_score /1

h(1) 5 high_score /1

Now suppose we want to find a second-order representation of the fuzzy valueabout_4 5 3/0.5 1 4/1 1 5/0.5 with respect to hhigh_score, high_score j. Herewe assume a uniform prior on the universe h1, 2, 3, 4, 5, 6j so that conditioningon the knowledge that SCORE is about_4 generates the following probabilitydistribution on the label space:

high_score:13, high_score:

23

Page 10: A mass assignment based ID3 algorithm for decision tree induction

532 BALDWIN, LAWRY, AND MARTIN

Transforming this into a fuzzy set we obtain

h(about_2) 5 high_score /1 1 high_score /23

Now suppose that we are given a database D 5 hkai,1, ? ? ? , ai,nlu i 5 1, ? ? ? ,Nj together with fuzzy partitions h fi, jj

mji51 of V j for j 5 1, ? ? ? , n. We can find a

second-order representation of D, denoted D*, by replacing every tuplekai,1 , ? ? ? , ai,nl with kh(ai,1), ? ? ? , h(ai,n)l where kh(ai, j)l is the second-order representa-tion of ai, j with respect to h fi, jj

mji51 (i.e., the fuzzy partition of V j). D* then is a

fuzzy label representation of the original data base.

Example 3.2.2. Consider a game played by a number of people which simplyinvolves each participant throwing a dice the winner being the individual withthe highest score. In the case where more than one person has the highest scorethen each of them records a joint win. Suppose the game has been playedrepeatedly over an evening and the results of a single individual have beenrecorded in the following database where the attributes are, from left to right,Outcome and Score.

D 5 hklose, 1/1 1 2/0.3l,

k joint_win, 4l,

kwin, 6l,

kwin, 6/1 1 5/0.6 1 4/0.2l,

klose, 4l

klose, 3/1 1 4/0.6l

k joint_win, 5l

k joint_win, 6lj

We now partition the classification universe hwin, joint_win, losej by success 5win/1 1 joint_win/0.6, failure 5 lose/1 1 joint_win/0.4. This means that asecond-order representation of D relative to hsuccess, failurej 3 hhigh_score,high_score j where high_score is defined as in Example 3.2.1 is given by

D* 5 hk failure/1, high_score /1l,

ksuccess/1 1 failure/0.8, high_score /1 1 high_score /0.6l,

ksuccess/1, high_score /1l,

ksuccess/1, high_score /1 1 high_score /0.2l,

k failure/1, high_score /1 1 high_score /0.6l,

k failure/1, high_score /1 1 high_score /0.18l

ksuccess/1 1 failure/0.8, high_score /1 1 high_score /0.4l

ksuccess/1 1 failure/0.8, high_score /1lj

Page 11: A mass assignment based ID3 algorithm for decision tree induction

ID3 ALGORITHM 533

C. Obtaining Probability Distributions of Discrete-ValuedFeatures From Databases

Once a second-order representation of the database has been found thenthe problem of generating probability distributions on the label spaces becomes aspecial case of the problem of generating probability distributions from databaseswhere all attribute universes are finite. In the latter case mass assignment theorycan be used to infer a joint distribution on the product space of attribute universes.For a database D 5 h aui 5 kai,1 , ? ? ? , ai,nlui 5 1, ? ? ? , Nj as above but whereall the universes V j have finite cardinality the joint probability distribution ofA1 , ? ? ? , An conditional on D is given by

; xu [ V1 3 ? ? ? 3 Vn Prob( xuuD)

5 Oau[D

Prob(A1 5 x1 , ? ? ? , An 5 xn u au)Prob( au uD)

Assuming that all objects in the database are equally likely we obtain that

Prob( xuuD) 51N ON

i51Prob(A1 5 x1 , ? ? ? , An 5 xnu aui)

Futhermore, given that we have no prior knowledge of the distribution of thevariables A1 , ? ? ? , An or of any dependencies existing between then we assumea priori that these features are independent and distributed according to theuniform distribution on their respective domains. Making these assumptions weobtain that

Prob(A1 5 x1 , ? ? ? , An 5 xn u aui) 5

Prob(A1 5 x1 , ? ? ? , An 5 xn uA1 is ai,1 , ? ? ? , An is ai,n)

Note that when ai, j is a crisp value Aj is ai, j is used as an abbreviation for Aj isai, j /1.

5 pn

j51

Prob(Aj 5 xj uAj is ai, j) 5 pn

j51

Wai, j(xj)

where Wai, jis the least prejudiced distribution of ai, j assuming a uniform prior

distribution on V j .

Hence, we obtain ; xu [ V1 3 ? ? ? 3 Vn Prob( xuuD) 51N ON

i51p

n

j51

Wai, j(xj)

Now suppose we have a second-order representation D* 5 hkh(ai,1), ? ? ? , h(ai,n)lui 5 1, ? ? ? , Nj of D 5 h aui 5 kai,1 , ? ? ? , ai,nlui 5 1, ? ? ? , Nj with respect to3n

j51h fi, jjmji51 as described in Section III-B. Here, of course the universes of attributes

Aj may have infinite cardinality. The second-order representation D*, however,consists of values for discrete label features. Let these features be denotedLABEL1 , ? ? ? , LABELn where LABEL j has the universe h fi, jj

mji51 then the above

Page 12: A mass assignment based ID3 algorithm for decision tree induction

534 BALDWIN, LAWRY, AND MARTIN

method can be used to infer a joint distribution of LABEL1 , ? ? ? , LABELn

given D*. More specifically we obtain

; lu

[ 3nj51h fi, jj

mji51 Prob(LABEL1 5 l1 , ? ? ? , LABELn 5 ln uD*) 5

1N ON

i51p

n

j51

Wh (ai, j

) (l j)

Now given the definition of h(ai, j) this last term can be written as

51N ON

i51p

n

j51

Prob(Aj is l j uAj is ai, j)

where Prob(Aj is l j uAj is ai, j) is calculated assuming a uniform prior on V j . Noticethat this expression enables us to calculate probability distributions on labelsdirectly from the original database D.

It is important to mention that a major advantage in using this second-orderapproach lies in the treatment of conditional probabilities. Since the attributesLABEL i take fuzzy labels as values, events of the form LABEL i 5 fi, j are crisprather than fuzzy events and hence classical probability theory can be usedto calculate conditional probabilities rather than the mass assignment theorydescribed in Section III-A, the latter being very computationally expensive inthis context. Thus, for example

Prob(LABEL1 5 l1uLABEL2 5 l2 , D*) 5Prob(LABEL1 5 l1uLABEL2 5 l2 , D*)

Prob(LABEL2 5 l2 , D*)

where the relevant marginal probabilities are calculate in the standard way fromthe above joint distribution. The relative simplicity of this expression shouldbe compared with the corresponding direct mass assignment evaluation whichassuming D contains only crisp values is given by;

Prob(A1 is l1uA2 is l2 , D) 5 E1

0E1

0

uh aui [ Duai,1 [ l1s, ai,2 [ l2t

ju

uh aui [ Duai,2 [ l2tju

ds dt

where for a fuzzy set f and a value y [ [0, 1], fy denotes the yth a-cut of f. (SeeRefs. 14 and 15 for more details regarding this representation of mass assignmentbased conditional probability).

Example 3.3.1. Consider the database D described in Example 3.2.2. LetLABELOutcome be an attribute with universe hsuccess, failurej and LABELScore bean attribute with universe hhigh_score, high_score j. We can now use the methoddescribed above to obtain a distribution on LABELOutcome 3 LABELScore fromD. For instance,

Prob(LABELOutcome 5 success, LABELScore 5 high_score)

518

(Prob(successulose)Prob(high_scoreu1/1 1 2/0.3)

1 Prob(successu joint_win)Prob(high_scoreu4)

Page 13: A mass assignment based ID3 algorithm for decision tree induction

ID3 ALGORITHM 535

1 Prob(successuwin)Prob(high_scoreu6)

1 Prob(successuwin)Prob(high_scoreu6/1 1 5/0.6 1 4/0.2)

1 Prob(successulose)Prob(high_scoreu4)

1 Prob(successulose)Prob(high_scoreu3/1 1 4/0.6)

1 Prob(successu joint_win)Prob(high_scoreu5)

1 Prob(successu joint_win)Prob(high_scoreu6)) 518

(0(0) 1 0.6(0.3) 1 1(1)

1 1(0.9) 1 0(0.3) 1 0(0.09) 1 0.6(0.8) 1 0.6(1)) 5 0.395

Repeating this calculation for all combinations we obtain

LABELScore

high_score high_score

success 0.395 0.08LABELOutcome S Dfailure 0.15375 0.37125

Notice that the above calculation can be expressed more simply as the averageover the database of the product of the point semantic unification values for thelabel fuzzy sets given the corresponding attribute values.

IV. LEARNING AND USING FUZZY DECISION TREES

We are now able to utilize the above ideas for inferring probability distribu-tions on fuzzy labels in order to develop a method to generate fuzzy decisiontrees from data. The trees induced will be probabilistic classifiers similar to thosediscussed in Ref. 4 although the method for obtaining the necessary probabilitiesis clearly quite different. In addition, the possibility of fuzzy classifications necessi-tates the incorporation of some form of defuzzification procedure into our system.In this section we shall describe the induction algorithm together with methodsfor classifying unseen cases in some detail. Furthermore, it will be shown howsuch classifiers can be naturally represented in Fril by the extended Fril rule.

A. An Algorithm for Fuzzy Decision Tree Induction

Initially fuzzy partitions of all attribute universes with infinite or large cardi-nality are formed. For the attribute representing classification values the methoddescribed in Section II is used to form a partition of triangular fuzzy sets overwhich there is a uniform spread of data classification values. Again as stated inSection II the independent attributes are partitioned using evenly spaced triangu-lar fuzzy sets although more sophisticated methods could be used here. In bothcases the user is required to specify the number of fuzzy sets in the partitions.

Page 14: A mass assignment based ID3 algorithm for decision tree induction

536 BALDWIN, LAWRY, AND MARTIN

The decision tree is formed using the discrete attributes LABEL i in placeof Ai for i 5 1, ? ? ? , N where the notation is as defined in the previous section.As with classical ID3 attributes are selected based on entropy calculations toconstruct a tree with attributes corresponding to the nodes and attribute valuesthe edges. Also, in this case, we have associated with each branch a set ofconditional probabilities specifying the probability of each classification giventhe outcomes encoded by that branch. The tree is propagated in the followingmanner. Suppose the classification attribute is Aj and consider a branch B con-sisting of the nodes LABEL1 , ? ? ? , LABELk21 with corresponding edgesLABEL1 5 l1 , ? ? ? , LABELk21 5 lk21 where lr is a member of the fuzzy partitionof Vr (i.e., the universe of Ar). The entropy of B is then given by

I(B) 5 2 Omj

i51pi log pi

where

pt 5 Prob(LABEL j 5 ft, j uLABEL1 5 l1 , ? ? ? , LABELk21 5 lk21)

The conditional probabilities are calculated from the database D by summing

the product Prob(Aj is ft, juAj is ai, j) pk21

r51

Prob(Ar is lr uAr is ai,r) for each tuple

kai,1 , ? ? ? , ai,nl in D and then normalizing over the classification classes. Morespecifically, for

D 5 h aui 5 kai,1 , ? ? ? , ai,nlui 5 1, ? ? ? , Nj

pt 5wt

Omj

r51wr

for t 5 1, ? ? ? , mj

where

wt 5 ONi51

Prob(Aj is ft, j uAj is ai, j) pk21

r51

Prob(Ar is lr uAr is ai,r)

It can easily be seen that this method of calculating the conditional probabilitiesis in accordance with the theory set out in Section III-C (see Ref. 16 for details).Now if I(B) is smaller than a predefined threshold then the branch terminates.Otherwise, each remaining independent attribute is assessed in terms of expectedinformation gain with the most informative being selected as the next node ofthe tree. For branch B we consider attributes hLABELr ur $ k, r ? j j so that forany such attribute the expected entropy is given by:

I(LABEL) 5 Ol

Prob(LABEL 5 luLABEL1 5 l1 , ? ? ? , LABELk21 5 lk21) I(Bl)

where Bl is the branch formed by adding the node LABEL and the edge

Page 15: A mass assignment based ID3 algorithm for decision tree induction

ID3 ALGORITHM 537

LABEL 5 l to B and where the conditional probabilities are evaluated as above.The expected information gain can now be determined by

G(LABEL) 5 I(B) 2 I(LABEL)

G is calculated for every member of hLABELr ur $ k, r ? j j and the attributewith the maximal value is selected to form the next node.

The process described here is repeated for each branch until either theentropy stopping criterion is met or until all attributes have been exhausted.

B. Classification of Unseen Examples

For any decision tree of the form described above the branches correspondto a set of mutually exclusive and exhaustive events. This observation enablesus to use probabilistic updating methods to determine the probability that apreviously unseen example belongs to a particular class. More specifically, givena decision tree with branches B1 , ? ? ? , BT and test example au 5 ka1 , ? ? ? , anlan updated value for the probability of each classification can be found usingJeffrey’s rule (see Ref. 17) as follows:

Prob(LABEL j 5 ft, j u au) 5 OT

i51Prob(LABEL j 5 ft, j uBi)Prob(Bi u aui)

Here the conditional probabilities Prob(LABEL j 5 ft, j uBi) are specified in thedecision tree and

Prob(Br u au) 5 pkr

t51

Prob(LABELrt5 lrt

uLABEL rtis h(art))

5 pkr

t51

Prob(Artis lrt

uArtis art

)

where Br corresponds to the event `krt51 (LABELrt

5 lrt)

In this way we find a support for each class and classify the example as havingthe class with highest support.

Notice that if each attribute universe V j is partitioned using mj fuzzy sets

then there is an upper bound of pn21

j51

mj branches to any decision tree suggesting

that the above calculation could be extremely computationally expensive. Thisis partially avoided, however, since because only triangular fuzzy sets are useda value has nonzero membership only in two adjacent fuzzy sets. This meansthat for any attribute tuple au 5 ka1 , ? ? ? , anl, Prob(Bu au) is nonzero for at most2n21 branches B. Precisely which branches these are can easily be determined sothat unnecessary calculation may be avoided.

Example 4.2.1. Consider the dice game of Example 3.2.2 but where an extra

Page 16: A mass assignment based ID3 algorithm for decision tree induction

538 BALDWIN, LAWRY, AND MARTIN

attribute value recording the number of players between 1 and 8 participatingin each game is given. The database is now

D 5 hklose, 1/1 1 2/0.3, 8l,

k joint_win, 4, 5l,

kwin, 6, 3l,

kwin, 6/1 1 5/0.6 1 4/0.2, 4l

klose, 4, 6l

klose, 3/1 1 4/0.6, 2l

k joint_win, 5, 6l

k joint_win, 6, 4lj

where the attributes are, from left to right, Outcome, Score, Players.Now let the universe h1, 2, 3, 4, 5, 6, 7, 8j be partitioned by the fuzzy sets

many 5 8/1 1 7/1 1 6/1 1 5/0.5 1 4/0.2 and few 5 1/1 1 2/1 1 3/1 1 4/0.81 5/0.5 and let LABELOutcome with universe hsuccess, failurej, LABELScore withuniverse hhigh_score, high_score j and LABELPlayers with universe hmany, fewj bethe second-order label attributes corresponding to Outcome, Score, and Players,respectively. Suppose then we want to generate a decision tree to classify Out-come in terms of Score and Players. To make the initial choice of attributes wemust first calculate the conditional distributions Prob(LABELOutcomeuLABELScore)and Prob(LABELOutcomeuLABELPlayers). The latter, for example, can be determinedby summing the product Prob(LABELOutcomeuOutcome)Prob(LABELPlayersuPlay-ers) for the data points and then normalizing across LABELOutcome .

LABELOutcomeumany LABELOutcomeu few

success failure success failure

(0)(1) 5 0 (1)(1) 5 1 (0)(0) 5 0 (1)(0) 5 0(0.6)(0.5) 5 0.3 (0.4)(0.5) 5 0.2 (0.6)(0.5) 5 0.3 (0.4)(0.5) 5 0.2

(1)(0) 5 0 (0)(0) 5 0 (1)(1) 5 1 (0)(1) 5 0(1)(0.2)..5 0.2 (0)(0.2) 5 0 (1)(0.8) 5 0.8 (0)(0.8) 5 0

(0)(1) 5 0 (1)(1) 5 1 (0)(0) 5 0 (1)(0) 5 0(0)(0) 5 0 (1)(0) 5 0 (0)(1) 5 0 (1)(1) 5 1

(0.6)(1)..5 0.6 (0.4)(1) 5 0.4 (0.6)(0) 5 0 (0.6)(0) 5 0(0.6)(0.2) 5 0.12 (0.4)(0.2) 5 0.08 (0.6)(0.8) 5 0.48 (0.4)(0.8) 5 0.32

w1 5 1.22 w2 5 2.68 w1 5 2.58 w2 5 1.52

Prob(successumany) 5 0.3128 Prob(successu few) 5 0.6293

Prob( failureumany) 5 0.6872 Prob( failureu few) 5 0.3707

Also it is found that Prob( few) 5 0.5125 and Prob(many) 5 0.4875

Page 17: A mass assignment based ID3 algorithm for decision tree induction

ID3 ALGORITHM 539

Hence we obtain

I(LABELPlayers) 5 0.5125(20.6293 log2 0.6293 2 0.3707 log2 0.3707)

1 0.4875(20.3128 log2 0.3128 2 0.6872 log2 0.6872) 5 0.924849

Similarly we find the relevant probabilities for LABELScore to beProb(successuhigh_score) 5 0.7198, Prob( failureuhigh_score) 5 0.2802, Prob(successuhigh_score) 5 0.1773, Prob( failureuhigh_score) 5 0.8227,Prob(high_score) 5 0.54875 and Prob(high_score) 5 0.45125 givingI(LABELScore) 5 0.773548.

In addition, the prior probabilities for LABELOutcome are Prob(success) 50.475 and Prob( failure) 5 0.525 giving a prior entropy value of I 5 20.475 log2

0.475 2 0.525 log2 0.525 5 0.998196From this we obtain

G(LABELPlayers) 5 0.998196 2 0.924849 5 0.073347

G(LABELScore) 5 0.998196 2 0.773548 5 0.224648

Hence the attribute LABELScore is selected to generate the following subtree:

LABELScore

high_score high_score

success:0.7198 success:0.1773failure:0.2802 failure:0.8227I 5 0.855296 I 5 0.674136

Now setting the entropy threshold of 0.5 as a stopping criterion both branchesfail to satisfy this criterion and hence we evaluate the remaining attributeLABELPlayers to give:

LABELScore

high_score high_score

LABELPlayers LABELPlayers

few many few many

success:0.8297 success:0.5033 success:0.2309 success:0.1542failure:0.1703 failure:0.4967 failure:0.7691 failure:0.8458

Page 18: A mass assignment based ID3 algorithm for decision tree induction

540 BALDWIN, LAWRY, AND MARTIN

Suppose now we are presented with a test example E with attribute valuesScore 5 3, Players 5 8 then according to Jeffrey’s rule

Prob(successuE) 5 Prob(successuhigh_score, few)Prob(high_scoreu3)Prob( fewu8)

1 Prob(successuhigh_score, many)Prob(high_scoreu3)Prob(manyu8)

1 Prob(successuhigh_score, few)Prob(high_score u3)Prob( fewu8)

1 Prob(successuhigh_score, many)Prob(high_score u3)Prob(manyu8)

5 (0.8297)(0)(0) 1 (0.5033)(0)(1) 1 (0.2309)(1)(0) 1 (0.1542)(1)(1) 5 0.1542

Hence, we have supports success:0.1542, failure:0.8458 so that the outcome isclassified as failure.

C. Fril Extended Rule Representations of Fuzzy Decision Trees

In some contexts it is desirable to have rule representations of decisiontrees. For classical discrete decision trees first-order logic conditionals will sufficebut for probabilistic classifiers clearly these are inappropriate. The extendedFril rule provides an ideal way of representing decision trees with associatedprobability values within the unified uncertainty framework of Fril. The syntaxof the extended rule is as follows;

(h if (b1 , ? ? ? , bn)):((u1 , v1) ? ? ? (un , vn))

where h represents a head of the form (kpredl arguments) and bi represents alist or conjunction of goals (c1 , ? ? ? , cm) where ci is of the form (kpredl arguments).In addition, [ui , vi] is an interval containing Prob(h ubi) where the list of goals bi

is interpreted as a disjunction of goals. In the case where bi corresponds to acrisp event then it is assumed that the set of events hbi ui 5 1, ? ? ? , nj aremutually exclusive and exhaustive. If, on the other hand, bi corresponds to afuzzy event of the form `m

j51 (Aj is fi, j) for i 5 1, ? ? ? , n then it is required that

<n

i51

3mj51 fi, j is a fuzzy partition of 3m

j51 V j where V j is the universe of Aj and

the fuzzy cross product is defined using the product conjunction. Given supportsfor the body terms bi for i 5 1, ? ? ? , n the support for h is evaluated using aninterval version of Jeffrey’s rule corresponding to Jeffrey’s rule in the case ofpoint supports. (See Ref. 5 for details.)

Now consider a fuzzy decision tree with branches B1 , ? ? ? , Bn . Viewed asa set of crisp events on fuzzy label spaces these branches are mutually exclusiveand exhaustive. Clearly then we may represent the tree as a set of extendedrules where each rule corresponds to a particular classification. For instance, thedecision tree from Example 4.2.1 can be represented by the rule;

Page 19: A mass assignment based ID3 algorithm for decision tree induction

ID3 ALGORITHM 541

((LABELOutcome 5 success) if (

((LABELScore 5 high_score) and (LABELPlayers 5 few))

or ((LABELScore 5 high_score) and (LABELPlayers 5 many))

or ((LABELScore 5 high_score) and (LABELPlayers 5 few))

or ((LABELScore 5 high_score) and (LABELPlayers 5 many))

)):((0.8297 0.8297)(0.5033 0.5033)(0.2309 0.2309)(0.1542 0.1542))

Often in such rules we replace the crisp statements (LABEL i 5 l) by the corre-sponding fuzzy restriction on the attribute Ai , (Ai is l). The resulting rule mustbe then viewed as approximate since the decision tree was generated on fuzzylabel space and not on attribute space, however, transformations of this type arenecessary when the classifications are fuzzy and a defuzzified value is required.The above rule, for example, can be replaced by;

((Outcome is success) if (

((Score is high_score) and (Players is few))

or ((Score is high_score) and (Players is many))

or ((Score is high_score) and (Players is few))

or ((Score is high_score) and (Players is many))

)):((0.8297 0.8297)(0.5033 0.5033)(0.2309 0.2309)(0.1542 0.1542))

D. Defuzzification From Fuzzy Classifications To Precise Values

In many cases where we have formed a fuzzy partition of the classificationspace a method is required for defuzzifying from fuzzy labels to precise values.This is especially important for function approximation problems. More, preciselythen we need a method by which when given a knowledge base of the form

Prob(LABELA 5 l i) 5 a i for i 5 1, ? ? ? , n

we can infer a value for A where it is supposed here that the universe of A issome interval of the real numbers. As mentioned in the above section we initiallyreplace the crisp statements on labels with their equivalent fuzzy statementsgiving us the transformed knowledge base.

Prob(A is l i) 5 a i for i 5 1, ? ? ? , n

Now given a fuzzy restriction of the form (A is l i) a standard defuzzificationprocedure is to take the average, assuming a uniform prior, of the values withmembership 1 in l i . We adopt this method here (see Ref. 5) to obtain a set ofn defuzzified values each with associated probability a i . A single defuzzification

Page 20: A mass assignment based ID3 algorithm for decision tree induction

ONE LINE LONG

542 BALDWIN, LAWRY, AND MARTIN

value is then obtained simply by taking the expected value of these relative tothe given probability distribution. In other words, if (A is l i) is defuzzified to vi

the final output value is given by v 5 On

i51a ivi

V. EXAMPLES OF CLASSIFICATION AND FUNCTIONAPPROXIMATION WITH FUZZY ID3

We shall now discuss the performance of the above fuzzy ID3 algorithmwith respect to three test problems. The first is a binary classification problemwhile the remaining are function approximation problems.

Example 5.1. Consider the problem of classifying points in [21.5, 1.5]2 as legalif they lie within the ellipse y2 1 2x2 5 1 and illegal otherwise given a databaseof triples kCLASS, X, Yl (See Fig. 3).

Here the database D consists of 126 triples generated by selecting randompoints from [21.5, 1.5]2 and labeling them with their classification value.

The X and Y universes are partitioned into 5 evenly spaced triangularfuzzy sets;

about_21.5 5 [21.5:1 20.75:0]

about_20.75 5 [21.5:0 20.75:1 0:0]

about_0 5 [20.75:0 0:1 0.75:0]

about_0.75 5 [0:0 0.75:1 1.5:0]

about_1.5 5 [0.75:0 1.5:1]

Using the fuzzy ID3 algorithm we obtain the following decision tree

Figure 3. An ellipse classification problem.

Page 21: A mass assignment based ID3 algorithm for decision tree induction

ONE LINE LONG

ID3 ALGORITHM 543

about_21.5L:0 I:1

about_21.5L:0.0092 I:0.9908

about_20.75L:0.3506 I:0.6494

about_20.75 Y about_0L:0.5090 I:0.4910

about_0.75L:0.3455 I:0.6545

about_1.5L:0.0131 I:0.9869

about_21.5L:0.1352 I:0.8648

about_20.75L:0.8131 I:0.1869

Xabout_0 Y about_0

L:1 I:0about_0.75

L:0.8178 I:0.1822about_1.5

L:0.1327 I:0.8673

about_21.5L:0.0109 I:0.9891

about_20.75L:0.3629 I:0.6371

about_0.75 Y about_0L:0.5090 I:0.5910

about_0.75L:0.3455 I:0.6545

about_1.5L:0.0131 I:0.9869

about_1.5L:0 I:1

The corresponding extended rule representation is;

((Classification 5 legal) if (((X is about_21.5)) or((X is about_20.75)&(Y is about_21.5)) or((X is about_21.5)&(Y is about_20.75)) or((X is about_20.75)&(Y is about_0)) or((X is about_20.75)&(Y is about_0.75)) or((X is about_20.75)&(Y is about_1.5)) or((X is about_0)&(Y is about_21.5)) or((X is about_0)&(Y is about_20.75)) or((X is about_0)&(Y is about_0)) or((X is about_0)&(Y is about_0.75)) or((X is about_0)&(Y is about_1.5)) or((X is about_0.75)&(Y is about_21.5)) or((X is about_0.75)&(Y is about_20.75)) or((X is about_0.75)&(Y is about_0)) or((X is about_0.75)&(Y is about_0.75)) or((X is about_0.75)&(Y is about_1.5)) or((X is about_1.5)))): ((0 0))(0.0092 0.0092)(0.3506 0.3506)(0.5090

0.5090)(0.3455 0.3455)(0.0131 0.0131)(0.1352 0.1352)(0.8131 0.8131)(1 1)(0.8178 0.8178)(0.1327 0.1327)(0.0109 0.0109)(0.3629 0.3629)(0.50900.5090)(0.3455 0.3455)(0.0131 0.0131)(0 0))

Page 22: A mass assignment based ID3 algorithm for decision tree induction

544 BALDWIN, LAWRY, AND MARTIN

Figure 4. Ellipse decision surface.

Note that since this is a binary problem only one rule is given and the proba-bilities for illegal can be calculated trivially.

These decision rules correctly classified 100% of the training data set Dand 99.168% of a test database consisting of 960 points forming a regular gridon [21.5, 1.5]2. The control surface for the positive quadrant is given inFigure 4.

As mentioned before the incorporation of fuzzy sets into decision rulesfacilitates their use in function approximation problems. The following two exam-ples demonstrate their potential in this area.

Example 5.2. In this problem we are given a data base, D, of tripleskX, Y, F(X, Y)l where F(X, Y) 5 X2 1 Y2. More specifically, D consists of 528such triples where the pairs kX, Y l form a regular grid on [0, 10]2. The indepen-dent variable universe [0, 10] are partitioned into 5 evenly spaced triangularfuzzy sets;

about_0 5 [0:1 2.5:0]

about_2.5 5 [0:0 2.5:1 5:0]

about_5 5 [2.5:0 5:1 7.5:0]

about_7.5 5 [5:0 7.5:1 10:0]

about_10 5 [7.5:0 10:1]

Page 23: A mass assignment based ID3 algorithm for decision tree induction

ID3 ALGORITHM 545

and the dependent variable universe [0, 200] was partitioned as described inSection II to give the following 5 fuzzy sets;

class_1 5 [0:1 30.78:0]

class_2 5 [0:0 30.78:1 65.4959:0]

class_3 5 [30.78:0 65.4959:1 99.5868:0]

class_4 5 [65.4959:0 99.5868:1 200:0]

class_5 5 [99.5868:0 200:1]

The fuzzy decision tree generated consists of 25 branches and is given in detailin the appendix. The tree was evaluated in terms of its performance on a testset of 1 024 points on a regular grid. This gave an average percentage error of2.2287% (i.e., the average error is given as a percentage of the lebesgue measureof the dependent variable universe which is in this case 200). The control surfaceis given below in Figure 5 together with the true values of X 2 1 Y 2 for comparison.

Example 5.3. In this example we consider another function approximationproblem involving a significantly more complex continuous function. Here thedatabase consists of 528 triples kX, Y, sin XY l where the pairs kX, Y l form aregular grid on [0, 3]2. Due to the complexity of the function on this occasion10 equally spaced triangular fuzzy sets are used to partition the independentvariable domain [0, 3]. These are;

Figure 5. x2 1 y2 control surface.

Page 24: A mass assignment based ID3 algorithm for decision tree induction

546 BALDWIN, LAWRY, AND MARTIN

Figure 6. Sin xy control surface.

about_0 5 [0:1 0.333333:0]

about_0.3333 5 [0:0 0.333333:1 0.666667:0]

about_0.6667 5 [0.333333:0 0.666667:1 1:0]

about_1 5 [0.666667:0 1:1 1.33333:0]

about_1.333 5 [1:0 1.33333:1 1.66667:0]

about_1.667 5 [1.33333:0 1.66667:1 2:0]

about_2 5 [1.66667:0 2:1 2.33333:0]

about_2.333 5 [2:0 2.33333:1 2.66667:0]

about_2.6667 5 [2.33333:0 2.66667:1 3:0]

about_3 5 [2.66667:0 3:1]

As in the previous example the dependent variable domain [21, 1] is partitionedaccording to the algorithm described in Section II into 5 fuzzy classes;

class_1 5 [21:1 0:0]

class_2 5 [21:0 0:1 0.380647.0]

class_3 5 [0:0 0.380647:1 0.822602:0]

class_4 5 [0.380647:0 0.822602:1 1:0]

class_5 5 [0.822602:0 1:1]

Page 25: A mass assignment based ID3 algorithm for decision tree induction

ID3 ALGORITHM 547

The fuzzy ID3 algorithm is used to generate a decision tree with 100 branchesthe details of which can be found in the appendix. The percentage error on aregular test database of 1024 points was 4.22427% and the decision surfacetogether with true values is given below in Figure 6.

VI. CONCLUSIONS

The extension of ID3 to allow fuzzy sets as attribute values and classificationclasses has been shown to resolve many of the traditional difficulties associatedwith applying decision tree methods to real-word problems. In particular, thisapproach allows for a more natural and robust treatment of continuous valuedattributes. Furthermore, the use of fuzzy classification values together with asuitable defuzzification procedure means that fuzzy ID3 can successfully be ap-plied to function approximation problems.

The probabilities required for the decision tree are inferred from the data-base according to mass assignment theory applied to second-order representa-tions of the data points on label spaces. Such an approach overcomes thecomputational difficulties associated with calculating the probabilities directly.

Finally, notice that although in this work standard entropy techniques wereused for attribute selection an alternative approach is to base such choices ondiscrimination analysis as described in Ref. 5. An investigation into this possibilityis currently being carried out.

Supported by an E.P.S.R.C. research assistantship.

References

1. J.R. Quinlan, ‘‘Induction of decision trees,’’ Mach. Learn., 1, 81–106 (1986).2. J.R. Quinlan, C4.5: Programs for Machine Learning, San Mateo, Morgan Kauf-

mann, 1993.3. E.H. Ruspini, ‘‘A new approach to clustering,’’ Inform. Contr., 15, 22–32 (1969).4. J.R. Quinlan, ‘‘Decision trees as probabilistic classifiers,’’ Proceedings of the Fourth

International Workshop on Machine Learning, 1987.5. J.F. Baldwin, T.P. Martin, and B.W. Pilsworth, ‘‘FRIL—Fuzzy and evidential reason-

ing in A.I.,’’ Research Studies Press, Wiley, New York, 1995.6. J.F. Baldwin, J. Lawry, and T.P. Martin, ‘‘A mass assignment theory of the probability

of fuzzy events,’’ Fuzzy Sets Syst., to appear.7. J.F. Baldwin, T.P. Martin, and B.W. Pilsworth, ‘‘FRIL Manual (Version 4.0),’’ FRIL

Systems Ltd, Bristol Business Centre, Maggs House, Queens Road, Bristol BS8 1QX,UK, 1988.

8. M. Umano, H. Okamoto, I. Hatono, H. Tamura, F. Kawachi, S. Umedzu, and J.Kinoshita, ‘‘Fuzzy decision trees by a fuzzy ID3 algorithm and its application todiagnosis systems,’’ Proceedings of the Third IEEE International Conference on FuzzySystems, 1994, pp. 2113–2118.

9. J. Zeidler and M. Schlosser, ‘‘Continuous-valued attributes in fuzzy decision trees,’’Proceedings IPMU 96, Vol. 1, 1996, pp. 395–400.

10. L.A. Zadeh, ‘‘Probability measures of fuzzy events,’’ J. Math. Analy. Appl., 23, 421–427 (1968).

11. J.F. Baldwin and T.P. Martin, ‘‘A fuzzy data browser in Fril,’’ Fuzzy Logic, J.F.Baldwin, Ed., Wiley, New York, 1996.

Page 26: A mass assignment based ID3 algorithm for decision tree induction

548 BALDWIN, LAWRY, AND MARTIN

12. U. Fayyad and K.B. Irani, ‘‘On the handling of continuous-valued attributes in decisiontree generation,’’ Mach. Learn., 8, 87–102 (1992).

13. T. Kohonen, ‘‘Self-organizing formation of topologically correct feature maps,’’ Biol.Cybernet., 43(1), 59–69 (1982).

14. J.F. Baldwin, J. Lawry, and T.P. Martin, ‘‘A note on probability/possibility consistencyfor fuzzy events,’’ Proceedings of IPMU 96, Vol. 1, 1996, pp. 521–526.

15. J.F. Baldwin, J. Lawry, and T.P. Martin, ‘‘A note on the conditional probability offuzzy subsets of a continuous domain,’’ Fuzzy Sets Syst., to appear.

16. J.F. Baldwin, J. Lawry, and T.P. Martin, A Mass Assignment Theory Approach toFuzzy Rule Generation, ITRC Report, 1996.

17. R.C. Jeffrey, The Logic of Decision, Gordon & Breach Inc., New York, 1965.

APPENDIX

A. Decision Tree for Example 5.2

about_0C1:0.973, C2:0.027, C3:0, C4:0, C5:0

about_2.5C1:0.796, C2:0.203, C3:0, C4:0, C5:0

about_0 Y about_5C1:0.18, C2:0.769, C3:0.049, C4:0, C5:0

about_7.5C1:0, C2:0.292, C3:0.635, C4:0.072, C5:0

about_10C1:0, C2:0.001, C3:0.312, C4:0.685, C5:0.02

about_0C1:0.796, C2:0.204, C3:0, C4:0, C5:0

about_2.5C1:0.555, C2:0.444, C3:0, C4:0, C5:0

about_2.5 Y about_5C1:0.082, C2:0.783, C3:0.0134, C4:0.146, C5:0

about_7.5C1:0, C2:0.178, C3:0.674, C4:0.146, C5:0

about_10C1:0, C2:0, C3:0.03, C4:0.864, C5:0.105

about_0C1:0.18, C2:0.769, C3:0.049, C4:0, C5:0

about_2.5C1:0.08, C2:0.783, C3:0.13, C4:0, C5:0

Xabout_5 Y about_5

C1:0.004, C2:0.418, C3:0.548, C4:0.028, C5:0about_7.5

C1:0, C2:0.024, C3:0.477, C4:0.487, C5:0.009about_10

C1:0, C2:0, C3:0, C4:0.56, C5:0.439

about_0C1:0, C2:0.296, C3:0.635, C4:0.0725, C5:0

about_2.5C1:0, C2:0.178, C3:0.674, C4:0.146, C5:0

about_7.5 Y about_5C1:0, C2:0.024, C3:0.477, C4:0.485, C5:0.009

about_7.5C1:0, C2:0, C3:0.065, C4:0.8, C5:0.131

about_10C1:0, C2:0, C3:0, C4:0.56, C5:0.439

about_0C1:0, C2:0.001, C3:0.311, C4:0.685, C5:0.002

about_2.5C1:0, C2:0, C3:0.193, C4:0.79, C5:0.015

about_10 Y about_5C1:0, C2:0, C3:0.03, C4:0.864, C5:0.105

about_7.5C1:0, C2:0, C3:0, C4:0.56, C5:0.439

about_10C1:0, C2:0, C3:0, C4:0.204, C5:0.795

Page 27: A mass assignment based ID3 algorithm for decision tree induction

ID3 ALGORITHM 549

B. Extended Rules for Example 5.2

Let b denote the body term((X is ab_0)&(Y is ab_0)) or ((X is ab_0)&(Y is ab_0.333)) or((X is ab_0)&(Y is ab_0.667)) or((X is ab_0)&(Y is ab_1)) or ((X is ab_0)&(Y is ab_1.333)) or((X is ab_0)&(Y is ab_1.667)) or((X is ab_0)&(Y is ab_2)) or ((X is ab_0)&(Y is ab_2.333)) or((X is ab_0)&(Y is ab_2.667)) or((X is ab_0)&(Y is ab_3)) or((X is ab_0.333)&(Y is ab_0)) or ((X is ab_0.333)&(Y is ab_0.333)) or((X is ab_0.333)&(Y is ab_0.667)) or((X is ab_0.333)&(Y is ab_1)) or ((X is ab_0.333)&(Y is ab_1.333)) or((X is ab_0.333)&(Y is ab_1.667)) or((X is ab_0.333)&(Y is ab_2)) or ((X is ab_0.333)&(Y is ab_2.333)) or((X is ab_0.333)&(Y is ab_2.667)) or((X is ab_0.333)&(Y is ab_3)) or((X is ab_0.666)&(Y is ab_0)) or ((X is ab_0.666)&(Y is ab_0.333)) or((X is ab_0.666)&(Y is ab_0.667)) or((X is ab_0.666)&(Y is ab_1)) or ((X is ab_0.666)&(Y is ab_1.333)) or((X is ab_0.666)&(Y is ab_1.667)) or((X is ab_0.666)&(Y is ab_2)) or ((X is ab_0.666)&(Y is ab_2.333)) or((X is ab_0.666)&(Y is ab_2.667)) or((X is ab_0.666)&(Y is ab_3)) or((X is ab_1)&(Y is ab_0)) or ((X is ab_1)&(Y is ab_0.333)) or((X is ab_1)&(Y is ab_0.667)) or((X is ab_1)&(Y is ab_1)) or ((X is ab_1)&(Y is ab_1.333)) or((X is ab_1)&(Y is ab_1.667)) or((X is ab_1)&(Y is ab_2)) or ((X is ab_1)&(Y is ab_2.333)) or((X is ab_1)&(Y is ab_2.667)) or((X is ab_1)&(Y is ab_3)) or((X is ab_1.333)&(Y is ab_0)) or ((X is ab_1.333)&(Y is ab_0.333)) or((X is ab_1.333)&(Y is ab_0.667)) or((X is ab_1.333)&(Y is ab_1)) or ((X is ab_1.333)&(Y is ab_1.333)) or((X is ab_1.333)&(Y is ab_1.667)) or((X is ab_1.333)&(Y is ab_2)) or ((X is ab_1.333)&(Y is ab_2.333)) or((X is ab_1.333)&(Y is ab_2.667)) or((X is ab_1.333)&(Y is ab_3)) or((X is ab_1.667)&(Y is ab_0)) or ((X is ab_1.667)&(Y is ab_0.333)) or((X is ab_1.667)&(Y is ab_0.667)) or((X is ab_1.667)&(Y is ab_1)) or ((X is ab_1.667)&(Y is ab_1.333)) or((X is ab_1.667)&(Y is ab_1.667)) or((X is ab_1.667)&(Y is ab_2)) or ((X is ab_1.667)&(Y is ab_2.333)) or((X is ab_1.667)&(Y is ab_2.667)) or((X is ab_1.667)&(Y is ab_3)) or((X is ab_2)&(Y is ab_0)) or ((X is ab_2)&(Y is ab_0.333)) or

Page 28: A mass assignment based ID3 algorithm for decision tree induction

550 BALDWIN, LAWRY, AND MARTIN

((X is ab_2)&(Y is ab_0.667)) or((X is ab_2)&(Y is ab_1)) or ((X is ab_2)&(Y is ab_1.333)) or((X is ab_2)&(Y is ab_1.667)) or((X is ab_2)&(Y is ab_2)) or ((X is ab_2)&(Y is ab_2.333)) or((X is ab_2)&(Y is ab_2.667)) or((X is ab_2)&(Y is ab_3)) or((X is ab_2.333)&(Y is ab_0)) or ((X is ab_2.333)&(Y is ab_0.333)) or((X is ab_2.333)&(Y is ab_0.667)) or((X is ab_2.333)&(Y is ab_1)) or ((X is ab_2.333)&(Y is ab_1.333)) or((X is ab_2.333)&(Y is ab_1.667)) or((X is ab_2.333)&(Y is ab_2)) or ((X is ab_2.333)&(Y is ab_2.333)) or((X is ab_2.333)&(Y is ab_2.667)) or((X is ab_2.333)&(Y is ab_3)) or((X is ab_2.667)&(Y is ab_0)) or ((X is ab_2.667)&(Y is ab_0.333)) or((X is ab_2.667)&(Y is ab_0.667)) or((X is ab_2.667)&(Y is ab_1)) or ((X is ab_2.667)&(Y is ab_1.333)) or((X is ab_2.667)&(Y is ab_1.667)) or((X is ab_2.667)&(Y is ab_2)) or ((X is ab_2.667)&(Y is ab_2.333)) or((X is ab_2.667)&(Y is ab_2.667)) or((X is ab_2.667)&(Y is ab_3)) or((X is ab_3)&(Y is ab_0)) or ((X is ab_3)&(Y is ab_0.333)) or((X is ab_3)&(Y is ab_0.667)) or((X is ab_3)&(Y is ab_1)) or ((X is ab_3)&(Y is ab_1.333)) or((X is ab_3)&(Y is ab_1.667)) or((X is ab_3)&(Y is ab_2)) or ((X is ab_3)&(Y is ab_2.333)) or((X is ab_3)&(Y is ab_2.667)) or((X is ab_3)&(Y is ab_3))

((Classification is Class1) if(b)): ((0 0)(0 0)(0 0)(0 0)(0 0)(0 0)(0 0)(0 0)(0 0)(0 0)(0 0)(0 0)(0 0)(0

0)(0 0)(0 0)(0 0)(0 0)(0 0)(0 0)(0 0)(0 0)(0 0)(0 0)(0 0)(0 0)(0 0)(0 0)(0 0)(00)(0 0)(0 0)(0 0)(0 0)(0 0)(0 0)(0 0)(0.0002 0.0002)(0.019)(0.019)(0.091 0.091)(00)(0 0)(0 0)(0 0)(0 0)(0 0)(0.027 0.027)(0.20 0.20)(0.46 0.46)(0.65 0.65)(0 0)(00)(0 0)(0 0)(0 0)(0.019 0.19)(0.23 0.23)(0.57 0.57)(0.83 0.83)(0.88 0.88)(0 0)(00)(0 0)(0 0)(0.019 0.019)(0.29 0.29)(0.75 0.75)(0.91 0.91)(0.65 0.65)(0.37 0.37)(00)(0 0)(0 0)(0.0004 0.0004)(0.11 0.11)(0.62 0.62)(0.92 0.92)(0.69 0.69)(0.270.27)(0.049 0.049)(0 0)(0 0)(0 0)(0.011 0.011)(0.386 0.386)(0.911 0.911)(0.7720.772)(0.217 0.217)(0.0051 0.0051)(0 0)(0 0)(0 0)(0 0)(0.039 0.039)(0.6 0.6)(0.510.51)(0.032 0.32)(0 0)(0 0))

((Classification is Class2) if(b)): ((0.99 0.99)(0.92 0.92)(0.79 0.79)(0.65 0.65)(0.53 0.53)(0.45 0.45)(0.4

0.4)(0.37 0.37)(0.35 0.35)(0.33 0.33)(0.96 0.96)(0.74 0.74)(0.42 0.42)(0.240.24)(0.15 0.15)(0.087 0.087)(0.052 0.052)(0.028 0.028)(0.009 0.009)(0 0)(0.910.91)(0.42 0.42)(0.066 0.66)(0.003 0.003)(0 0)(0 0)(0 0)(0 0)(0 0)(0.0220.022)(0.855 0.855)(0.228 0.228)(0.0045 0.0045)(0 0)(0 0)(0 0)(0 0)(0.0670.067)(0.26 0.26)(0.32 0.32)(0.76 0.76)(0.095 0.095)(0 0)(0 0)(0 0)(0.036

Page 29: A mass assignment based ID3 algorithm for decision tree induction

ID3 ALGORITHM 551

0.036)(0.3 0.3)(0.44 0.44)(0.41 0.41)(0.34 0.34)(0.7 0.7)(0.056 0.056)(0 0)(00)(0.013 0.013)(0.28 0.28)(0.5 0.5)(0.4 0.4)(0.17 0.17)(0.12 0.12)(0.64 0.64)(0.0290.029)(0 0)(0.003 0.003)(0.26 0.26)(0.56 0.56)(0.25 0.25)(0.094 0.094)(0.310.31)(0.36 0.36)(0.61 0.61)(0.017 0.017)(0 0)(0.044 0.044)(0.51 0.51)(0.370.37)(0.079 0.079)(0.29 0.29)(0.4 0.4)(0.28 0.28)(0.58 0.58)(0.005 0.005)(00)(0.22 0.22)(0.55 0.55)(0.089 0.089)(0.23 0.23)(0.486 0.486)(0.13 0.13)(0 0)(0.570.57)(0.001 0.001)(0.002 0.002)(0.37 0.37)(0.38 0.38)(0.053 0.053)(0.43 0.43)(0.30.3)(0 0)(0 0))

((Classification is Class3) if(b)): ((0.01 0.01)(0.078 0.078)(0.21 0.21)(0.35 0.35)(0.47 0.47)(0.53

0.53)(0.52 0.52)(0.47 0.47)(0.42 0.42)(0.38 0.38)(0.037 0.037)(0.25 0.25)(0.570.57)(0.69 0.69)(0.62 0.62)(0.51 0.51)(0.42 0.42)(0.36 0.36)(0.31 0.31)(0.2860.286)(0.092 0.092)(0.57 0.57)(0.76 0.76)(0.47 0.47)(0.23 0.23)(0.0910.091)(0.028 0.028)(0.02 0.02)(0.13 0.13)(0.23 0.23)(0.14 0.14)(0.73 0.73)(0.540.54)(0.13 0.13)(0.013 0.013)(0.0025 0.0025)(0.097 0.097)(0.26 0.26)(0.250.25)(0.23 0.23)(0.24 0.24)(0.71 0.71)(0.16 0.16)(0.002 0.002)(0.026 0.026)(0.260.26)(0.34 0.34)(0.27 0.27)(0.12 0.12)(0.007 0.007)(0.29 0.29)(0.61 0.61)(0.050.05)(0.003 0.003)(0.19 0.19)(0.39 0.39)(0.24 0.24)(0.025 0.025)(0 0)(0 0)(0.340.34)(0.44 0.44)(0.008 0.008)(0.11 0.11)(0.43 0.43)(0.14 0.14)(0 0)(0 0)(0.030.03)(0.21 0.21)(0.36 0.36)(0.34 0.34)(0.006 0.006)(0.28 0.28)(0.31 0.31)(0.010.01)(0 0)(0.019 0.019)(0.21 0.21)(0.24 0.24)(0.37 0.37)(0.25 0.25)(0.030.03)(0.43 0.43)(0.062 0.062)(0 0)(0.002 0.002)(0.21 0.21)(0.27 0.27)(0.0570.057)(0.38 0.38)(0.21 0.21)(0.07 0.07)(0.41 0.41)(0.01 0.01)(0 0)(0.0480.048)(0.36 0.36)(0.076 0.076)(0.044 0.044))

((Classification is Class4) if(b)): ((0 0)(0 0)(0 0)(0 0)(0.0018 0.0018)(0.027 0.027)(0.078 0.078)(0.16

0.16)(0.24 0.24)(0.28 0.28)(0 0)(0 0)(0.005 0.005)(0.07 0.07)(0.24 0.24)(0.40.4)(0.44 0.44)(0.4 0.4)(0.36 0.36)(0.3 0.3)(0 0)(0.006 0.006)(0.18 0.18)(0.510.51)(0.55 0.55)(0.44 0.44)(0.4 0.4)(0.46 0.46)(0.42 0.42)(0.32 0.32)(0 0)(0.0430.043)(0.45 0.45)(0.64 0.64)(0.38 0.38)(0.29 0.29)(0.36 0.36)(0.32 0.32)(0.280.28)(0.26 0.26)(0.001 0.001)(0.19 0.19)(0.68 0.68)(0.26 0.26)(0.31 0.31)(0.460.46)(0.31 0.31)(0.085 0.085)(0.002 0.002)(0 0)(0.005 0.005)(0.33 0.33)(0.560.56)(0.19 0.19)(0.49 0.49)(0.29 0.29)(0.03 0.03)(0 0)(0 0)(0 0)(0.017 0.017)(0.490.49)(0.29 0.29)(0.45 0.45)(0.27 0.27)(0.009 0.009)(0 0)(0 0)(0 0)(0.0660.066)(0.029 0.029)(0.53 0.53)(0.24 0.24)(0.49 0.49)(0.068 0.068)(0 0)(0 0)(00)(0.1 0.1)(0.26 0.26)(0.044 0.044)(0.52 0.52)(0.31 0.31)(0.31 0.31)(0.0020.002)(0 0)(0 0)(0.076 0.076)(0.33 0.33)(0.43 0.43)(0.054 0.054)(0.46 0.46)(0.370.37)(0.17 0.17)(0 0)(0 0)(0.002 0.002)(0.25 0.25)(0.37 0.37)(0.45 0.45))

((Classification is Class5) if(b)): ((0 0)(0 0)(0 0)(0 0)(0 0)(0 0)(0 0)(0 0)(0 0)(0 0)(0 0)(0 0)(0 0)(0

0)(0 0)(0.003 0.003)(0.079 0.079)(0.21 0.21)(0.32 0.32)(0.42 0.42)(0 0)(0 0)(00)(0.023 0.023)(0.22 0.22)(0.47 0.47)(0.58 0.58)(0.52 0.52)(0.45 0.45)(0.430.43)(0 0)(0 0)(0.007 0.007)(0.23 0.23)(0.6 0.6)(0.7 0.7)(0.54 0.54)(0.350.35)(0.19 0.19)(0.09 0.09)(0 0)(0 0)(0.16 0.16)(0.73 0.73)(0.67 0.67)(0.25

Page 30: A mass assignment based ID3 algorithm for decision tree induction

552 BALDWIN, LAWRY, AND MARTIN

0.25)(0.027 0.027)(0 0)(0 0)(0 0)(0 0)(0.004 0.004)(0.39 0.39)(0.81 0.81)(0.30.3)(0.017 0.017)(0 0)(0 0)(0 0)(0 0)(0 0)(0.048 0.048)(0.703 0.703)(0.450.45)(0.014 0.014)(0 0)(0 0)(0 0)(0 0)(0 0)(0 0)(0.11 0.11)(0.75 0.75)(0.190.19)(0 0)(0 0)(0 0)(0 0)(0.013 0.013)(0.17 0.17)(0 0)(0.23 0.23)(0.66 0.66)(0.0320.032)(0 0)(0 0)(0 0)(0.007 0.007)(0.26)(0.52)(0 0)(0.33)(0.55 0.55)(0.0060.006)(0 0)(0 0)(0 0)(0.052 0.052)(0.56 0.56)(0.51 0.51))