on machine learning and data mining

Machine Learning and Data Mining

Yves KodratoffCNRS, LRI Bât. 490, Université Paris-Sud

91405 Orsay, [email protected]

WORKING PAPER : ENGLISH VERSION OF"Apprentissage et Fouille de Données" accepted for publication

Summary

Deep differences explain why Data Mining has been enthusiastically accepted by Industry, while Machine Learning and Exploratory Statistics still have problems being accepted by it. This paper points at all the epistemological, scientific, and industrial differences between the two, and explains why Data Mining is better accepted in Industry.

1. Introduction

Many techniques for developing models out of data were developed since the 60s. This work amounts to building automatic method for performing an inductive reasoning, but it is not always acknowledged to be of this nature. Since Data Mining (DM) is the last manifestation of this attitude, we shall briefly recall the various domains which participated in this effort, in order to obtain a definition of what DM might be.Machine Learning was developed at the end of the 70s while Data Mining started at the beginning of the 90s. In parallel, and since the 60s, several techniques that all belong to Statistical Learning, have developed as well as their applications to Pattern Recognition. Statistical Learning includes the last improvements to regression, in particular regression trees (Breiman et al., 1984), the domain called Data Analysis, the perceptron (Rosenblatt, 1958) and its extension, neural networks (around 1985, see Le Cun et al., 1989), and Support Vector Machines (Vapnik, 1995). Independently, Bayesian statistics developed their inductive tools. The so-called “naive Bayes” technique has been used since the beginning of the 60s (Maron, 1961). It postulates conditional independence between the features knowing the class variable. Presently, techniques for the automatic generation of Bayesian networks, including the structure of the network, have been developed.I propose to call Automated Learning (AL) - a domain still in creation - that unifies ML and Statistical Learning (including Data Analysis, Pattern Recognition , Neural Network and Bayesian approaches). This leads us to propose the following definition:

Definition:From the point of view its origins as well as from the one of daily work, Data Mining is the merging of Data Base and Automated Learning research.

mailto:[email protected]

This shows clearly that, in spite of the still strong influence of Carnap's ideas1 (1959) in Science, researchers in inductive reasoning have not tried to extend the capacities of models of uncertainty (this work has been done by researchers specialized in the creation of deductive reasoning models) but to improve our ways to deal with the phenomenon of model emergence from data. Bayesian networks illustrate the ambiguousness of this phenomenon particularly well. Some researchers (specialized in deduction) try to improve the capacities of Bayesian networks inference, given a structure and the conditional probabilities. They improve capacities of reasoning of the given model, and it happens that this model is capable of deductive and abductive probabilistic reasoning. Inversely, some other researchers (specialized in induction) develop methods for the construction from data of the tables of conditional probabilities and of the structure of the Bayesian network. The last ones try to improve the adequacy between data and the network, whereas the first try to improve the capacities of a given network. Because of the relations that we have just underlined between DM and AL, it comes somewhat as a surprise that Industry enthusiastically adopted the first, while it has always looked down on the second. Even more surprising, our the analysis forthcoming in this paper will reveal that the deep reason of DM success seems to be that it did not hesitate to innovate relative to traditional data processing whereas AL preserves the majority of its epistemological choices. In reality, and because of this industrial success, the relationships between DM and AL already underlined, and much of the software sold as DM is nothing but friendly interfaced AL systems. In this paper, without insisting more on the camouflage that we have just underlined, we will point at the most important differences between DM and AL, those that can explain the “fashion effect” of DM.

2. The components of DM

Before being able to describe the differences between the two domains, we have to recall what these domains are made of. The review below presents the most widespread methods developed by each component of DM. Each method will be described by its essential aspect, inputs and outputs of each system, but without going into detail. In spite of our lack of exhaustiveness, the wealth of methods - often unexpected to the non specialist – worked out in order to help humans to build models from data is quite striking.In order to explain the difference between DM and AL, it is also necessary to separate Supervised Learning, in which the inductive steps are controlled by the expert before the inductive phase, from Unsupervised Learning where the expert's opinion is taken in account after the model has been built automatically.

2.1 Supervised and Unsupervised Learning

2.1.1. Supervised Learning

1 He states: « all inductive reasoning, ..., is reasoning in terms of probability; hence inductive logic, the theory of inductive reasoning, is the same as probability logic ... ».

Supervised Learning essentially consists in the transformation of a description in extension of a class, to a description made in intention of the same class.Inputs are in a data table, one field of which is called the class, or the variable of interest. The other fields are called features. All the fields can be continuous or discrete.Outputs are uncertain theorems whose premise is a combination of the feature values, and the conclusion is one of the class values. This is an improvement on the obvious condition that the description in intention uses less many bits than the description in extension, and this is a means to rate the interest of a description in intention. The validity measure of such a procedure is most often a so-called "cross-validation," i.e., data are repetitively divided in a learning set of and a test set, upon which the precision is measured. In general, 9/10 of the initial set are used for learning, and 1/10 for the test. This procedure is repeated 10 times and the precision is the average of the precision obtained for each test. The value of 10 does not have a theoretical justification, but seems to give quite satisfactory results.

2.1.2. Unsupervised Learning Unsupervised Learning tries to extract structures (or patterns) existing within the data, without knowing beforehand what is a ‘good’ structure.Let us first speak of Data Analysis that developed fundamentally deductive methods but that are used often in an inductive way, as correspondence analysis and main components analysis. Generally, when a matrix is made diagonal and that the ‘strongest’ values are kept while the ‘weakest’ ones are ignored, then an inductive use of a deductive method is performed. In fact, the induction is made by the human who decides what is strong and what is weak. In principle, it would be enough to add an optimization operation for the system to become perfectly inductive. It happens that this addition is far from being easy, this is why so many directly inductive methods have been developed.Depending on the kind of structure looked-for, Unsupervised Learning takes a particular name.When the system must build classes clustering the individuals judged as the most similar relative to a certain criteria, then it is most often called clustering, and classification in Data Analysis, categorization in Cognitive Sciences, segmentation in the industry.When the system looks for logical relations confirmed by data, that is to say theorems, in general uncertain ones, then it is referred to as the detection of associations, or relations or patterns within the data. When a valid functional relation among the variables is looked for, Statistics talks about logistical regression and in ML it is ‘scientific discovery'. The main scientific laws, such as the law of gas compression, PV = nRT, express a functional relation holding among data.When the searched relations are relative to the spatial or temporal organization, one speaks of the discovery of spatial sequences (typically: in Genomics) or of temporal sequences (typically: in the analysis of the stock market). It is necessary to realize that Unsupervised Learning does not start with enough information to steer the induction steps towards a solution more or less expected by the user. Therefore, its results are extremely difficult to validate. In fact, these results are of three kinds: they can be trivial (for example, the systematic discovery of tautologies) when they are relative to a very large population; false (due to noise, for example) when they are relative to a tiny population; or very interesting since they are unexpected.

The results of Unsupervised Learning can be validated always a posteriori in two ways.The first consists in a direct confirmation by a domain expert. Like any other scientific

discovery, validation takes place when the discovery raises interest among the experts and brings progress to a domain. For example, a system of automatic association detection can be coupled to an Expert System, and Unsupervised Learning is a success when the induced rules, once introduced into it, improve the Expert System.

The second is a kind of cross validation obtained by several independent optimizations. Its principle is as follows: use an optimization method, as in 2.2.c below, to perform the Unsupervised Learning, then test the obtained results with a supervised method using another optimization criterion. For instance, our team is developing a system that detects associations, while using a measure called Non-contradiction, combining the confidence in the validity of the implication A B, P(B A), and the confidence in the non validity of this implication, P(B A). Each rule thus found defines two classes in the examples, those that follow the rule and those that do not. This creates a clustering relative to the set of found rules, provided we accept that these classes cover the examples without partitioning them. These classes are considered, in turn, as data by a supervised induction method, here a decision tree, optimized according to a measure of entropy. The two types of rules thus created should be equivalent, and they are compared. Our experience is that rules found by the two methods are very different, but their comparison is easier to rate for an expert than the rules obtained in an unsupervised way.

2.2 The induction criteria

Almost all programs achieving an inductive step perform an optimized search in a space of hypotheses. Inductive Learning consists in the following 4 steps.

2.2.1. Definition of the hypothesis space A few starting examples are chosen, and a space of hypotheses is generated as a subset of the set of all possible generalizations of these examples. To learn grammatical tagging, i.e., labeling each word of a text by its grammatical category, for example, one particular tagging is observed, usually in a very large set of tagged sentences, and all the generalizations describing the context of this tagging are generated. The system learns then the generalizations that are the most confirmed in the tagged text. In general, this very important step is described very briefly or even poorly by the authors: they add in a mixture of domain knowledge, of arbitrary choices of knowledge representation, and hidden heuristics. A precise definition of the hypothesis space is indeed difficult to describe correctly. For example, Brill's tagger (1994) learns tags in context, and it describes the context of a word by the labels or the words preceding or (exclusive or) following it within a distance of three words. This defines a limited space of possible generalizations, and the reason why generalizations including words or labels preceding and/or following the one to be tagged does not correspond to a theory about the environment of a word. Its role is simply to limit the size of the research space. One contribution of DM is to have insisted on the importance of this step, which must be explicit so that one can understand that the results can only be a combination of allowed generalizations. For example, association detection in DM is done under conditions of coverage and precision that have a questionable interest, but that are perfectly explicit.

2.2.2. Choice of a search strategy within the hypothesis space

The most current strategies are the so-called greedy and exhaustive strategies. Brill's tagger, cited above, uses an exhaustive search in a very limited hypothesis space.

DM, in association detection, uses also an exhaustive research. In the greedy strategy, the possible choices are ranked according to a measure of

optimization (to see point c. below) and the path passing by the first point optimizing this measure is chosen.

The ML approach put a lot of efforts in studying these search strategies, whereas other approaches tend to adopt the exhaustive search.

Random choices, a variant of the exhaustive strategy, seems very efficient. Genetic Algorithms are also acknowledged as one of the most efficient search strategies.

2.2.3. Choice of an optimization criterion

The number of optimization criteria is impressive, consequently we will discuss them in detail.The two most often used criteria are precision (number of successes / total number of tries) and recall (number of successes / number of objects to be recognized). In Supervised Learning the number of successes is given by the number of cases where the class is correctly recognized by the hypothesis under test. In Unsupervised Learning, precision is an evaluation given by the expert examining a subset of the obtained results. Usually, the expert examines only a subset because, if the automatic method is efficient, then it deals with too much data and too many results to be entirely checked by a human. Note then that in Unsupervised Learning the number of objects to recognize is unknown. Therefore, recall is not computable, unless the expert makes an exhaustive analysis, and we have just underlined that this is unrealistic on real problems. It is also “well known” (at least in the ML community) that the precision measure expresses a particular hypothesis on the nature of data and that Laplace estimator, (number of success +1) / (total number of tests + number of classes to be recognized) is to be preferred (see http://www.lri.fr/~yk/ for explanations relative to this phenomenon). It is also interesting to consider the number of times where the class is falsely recognized (the so-called “false positive” recognition), that is to say that the hypothesis under test is mistaken while recognizing a class. The ROC curves (Receiver (or Relative) Operating Characteristic), used especially in DM, represent the variation of the correctly recognized against the falsely recognized classes. The precision is also used in DM by drawing lift charts giving the variation of the precision according to the number of examples examined. A good lift-chart rises very fast which is very important in the unsupervised approach since a high precision is reached with a small number of examples validated by the expert, and this is worth consideration since expert work is always expensive.Another classical criterion is entropy variation, systematically used by decision trees, or the quadratic entropy (also called Gini index) used in ML and in Data Analysis. When a numerical distance between objects is computable, and this is an usual hypothesis in Data Analysis and in regression techniques, multiple transformations of the data representation are then possible, based, in general, on a minimization of the squares of a distance. The statistical approach very often uses the hypothesis that a distribution of small variance is better than another of large variance, and it uses the minimization of the variance as a criterion for optimization, for instance in the case of regression trees.

The Bayesian approach uses the fact that data tables give the probability of the data knowing that the studied phenomenon, Ph, took place, P(D Ph). In Supervised Learning Ph is, for example, to belong to a class, and in the unsupervised case, it can be the validity of a pattern. One can therefore deduce the probability of the phenomenon given the data by computing P(Ph D) = P(D Ph) * P(Ph) / P(D). This process uses the least possible induction, it simply induces that the values of P(Ph D) computed from observed data stay valid for new data.Finally, it is also classical to use the principle of Minimum Description Length (MDL). For this, both the length of the description of the model, and the length of the description of the examples it fails to class correctly are encoded (in the supervised case). According to this principle, the minimum of the sum of these two values is an optimum. The software C4.5 transforms the trees into rules according to this principle. The present methods of Bayesian networks construction from data also use the MDL principle systematically. In this unsupervised case, the encoding includes the network and the examples, given this network.

DM also introduced other measures of optimization, more linked to applications, such as the optimization of operation cost, or the return on investment, etc.

2.2.4. validation

AL was in general satisfied with validations associated to the chosen optimization criteria. As the criterion of precision is the most often used, validation reduces to show that the most precise hypotheses have been used, as in the above described cross-validation.DM insisted on the importance of a further phase of validation. Either the induced results are directly examined by an expert who confirms their validity (the comprehensibility of results is then a primary condition), or, this is the best validation, the results of the induction are used for a task whose efficiency is measurable. Validation takes then place when efficiency is increased with the introduction of the induced knowledge.

2.3 Data Mining One can consider that the date of birth of DM is 1989 when Gregory Piatetsky-Shapiro organized the first workshop on " Knowledge Discovery in Data Bases". However, the first spectacular demonstration dates from 1995 when he organized the first KDD conference in Montreal. Among multiple applications and original points of views, DM gave birth to three main types of methods that are all included in the DM commercial systems. The first is association detection, in particular the discovery of uncertain theorem confirmed by the data, and the multiple measures of interest to choose among all valid theorems. The DM approach focuses on the problems raised by applications to very large data bases.It happens that these methods can be very easily extended to the discovery of temporal series that became the second noticeable successes of DM, as scientific discipline.The default of classical methods of association detection, that is to say their exhaustiveness limited by the cover only (if, for example, A & B C, then the cover of this implication is the probability that A, B and C be together true), becomes an advantage when discovering temporal series since they can be considered as valid only when the series is repeated often enough.

Finally, and under industrial influence, DM developed multiple methods for data cleaning and segmentation.

2.4 Machine Learning (ML) The first programs developed that learn rules from data are due to Michalski (Michalski and Chilausky, 1980) and this dates the beginning of ML, although the first ML workshop (that became the International Machine Learning Conference) took place in 1982. The work of Dietterich and Michalski (1982) witnesses that learning structures was a very early concern for ML. This research field accomplished most of its work in Supervised Learning and generated a quantity of systems of which some are included in industrial software. In particular, decision trees ask for an input of discrete or continuous features (that will be made optimally discrete in regard to the classes) and of imperatively discrete classes. They produce classification trees that are a description in intention of the classes. The most famous of these systems, C4.5 (sold now as C5 or See5) generates rules built from the decision tree.Other systems, such AQ and CN2, generate classification rules directly from the data, generally discrete or previously discretized ones.

One of basic procedure of Learning is generalization. The space of possible generalization has been called the version space, and many methods propose their own way to move in the version space.

Inductive Logical Programming (ILP) is precisely one way to move in a relational version space. All other methods suppose implicitly that a feature is in relation with only one record, i.e., features are postulated to be unary. Otherwise stated, the i-th feature takes a value for the j-th field. In ILP, a feature can be n-ary, i.e., it describes a relation between n objects. For example, one can describe the properties of objects A and B with features taking unary values (such as : A is red, B is blue), or a binary feature, such as the distance between A and B (for instance, distance(A, B) = 27). From inputs of this type, ILP will learn some general laws about the distance, for example that there are no objects distant from B of more than 50 units: [For all x, (distance(x, B)) <50]. However, the space of possible hypotheses becomes huge, and the algorithms checking the validity of the hypothesized relations are n-complete. It follows that the descriptive power of ILP is balanced by the complexity of the computation necessary to verify the hypotheses allowing the program to build a model explaining the data. This is why the domain seems to move now towards the so-called propositionnalisation methods in which n-ary descriptions are trivially replaced by unary relations: one creates, in principle, as many descriptions as there are possible variable matching. The combinatory explosion in time is replaced by a combinatory explosion in space. The gain comes from the fact that only a few (that is, thousands of them) “carefully chosen” descriptions are preserved. The heuristics defining the way to choose the descriptions to keep (including the trivial heuristic of random choices) constitute the main topic of research for this new approach.

In clustering, the main contribution of ML is the Unsupervised Learning COBWEB system. COBWEB uses yet another criterion of optimization, called utility. The utility of a class C containing the feature A taking v possible values is computed by the product of the probabilities P(A = v) P(A = v | C) P(C | A = v). P(A = v) is the probability that feature A takes value v; P(A = v | C) is the probability that the feature A takes the value v in class C; P(C | A = v) is the probability of meeting the concept C when A = v. Of course, Bayes law rewrites this expression,

to be able to compute the sum of utility gains brought by each class, so that the formula giving the utility of a clustering is:

U =åP(C) [ååP(A = v | C)2 - P(A = v )2 ] /n

where n is the number of classes, where the first sum is on all classes, and the two following ones are on all features and on all their values. U is computed for every possible configurations, which would be impossible if one did not compute the utility gain incrementally. COBWEB is therefore very slow, but incremental so that it is very well adapted to problems asking for a regular updating. Besides, the sums are replaced by integrals when dealing with continuous values, so that COBWEB adapts well to mixed, continuous and discrete, data. In spite of all these qualities, COBWEB, still written in LISP, is not part of a commercial software.

2.5 Pattern Recognition Before ML started, researchers in Pattern Recognition developed learning programs of which the most used is a linear separator, called the perceptron (Rosenblatt, 1958). One can prove that a perceptron is able to separate two sets of examples indexed by 0 and 1 in a finite number of k steps of calculation, where k is bounded by Novikoff’s important theorem, k < (R/ ) 2. R is the radius of the data (that is to say the radius of the volume they fill in the space of n features), and is the maximum, on all examples, of the minimum, on all hyperplans, of the distribution of distances between the examples and the separating hyperplan, what is now called the functional margin of the separating hyperplan.

Neural Networks (NN) were born from the need to get out of linear separators, but their success is rather due to the fact that they deal with inputs and outputs, possibly multiple outputs, that can be continuous, discrete, or mixed. These two properties (mixed variables and multiple outputs), inherent to the way a NN is built, correspond to a real industrial need. The NN are now used in the settings of Vapnik's theory, that is to say as support vector machines (SVM) in order to be able to compute their generalization capacity. NN led to an unsupervised version, self-organizing maps (of Kohonen, 1990). Kohonen's maps implement a particular kind of NN, the so-called competition NN. The success of an output neuron (belonging to what is in this context called the competition layer) to recognize an input, reinforce the neuron winning the competition and inhibits the other neurons, so that the winner for an example, has a tendency to specialize in the recognition of this example.A self-organizing map is a NN whose outputs are equal in number to the number of classes. Two examples belong to the same class if they activate the same output. Outputs are represented as disposed in a plan as the nodes of a grid, which is where the name “maps” comes from.

2.6 Exploratory Statistics

Statistics are extensively taught in University curricula, but their exploratory aspect is much less taught, we will therefore give a few details on this aspect. The hypothesis underlying

all inductive statistics, in spite of the diversity of the proposed methods, is that the smaller the variance, the better the model.

For example, the k-means method minimizes intra-class variance, so that if N is the number of objects to classify, xi the coordinates of the i-th object, and μm the coordinates of the

center of gravity of the m-th class, then the quantity to minimize is

1/N åm åi (xi - μm)2

The difference between the k-means and other approaches, comes from the various methods of choice of the seed (i.e., choose astutely the first μm), and the subsequent technique of allocation-

recentering (i. e., compute astutely the next μm).

Regression, be it logistical or not, looks for a solution that minimizes the variance of a distance, most usually given by the sum of the squares of distances of the objects to the solution. When the model to be discovered is not given in advance, then the way logistical regression discovers the model is a truly inductive work.

Regression Trees (Breiman et al., 1984) use exactly the same technique, except that beforehand it divides the space of solutions in pavements, each among them being a leaf of the regression tree. Thus, the building of the regression tree itself brings no new concept to the foreground. Inversely, the notion of an optimal path for pruning the tree built in this way, introduced by Breiman, is indeed a new concept added to Exploratory Statistics.

Finally, support vector machines (SVM, Vapnik, 1995), in their simplest linear form, are nothing but perceptrons that minimize the variance of the distances between the objects and the separating hyperplan, which is called the minimization of the functional margin. The notion of kernel permitting to simulate the non linear separations, and the notion of Vapnik-Chervonenkis dimension are, on the contrary, completely original. By these aspects, SVM introduce exploratory statistics of an completely new type.

2.7 Data Analysis

Data Analysis (DA) is taught very extensively in university courses, for this reason we will not give any details on this approach.

The basic method of DA - clustering excepted- consists in studying the points in Rn formed by the studied object. After centering (i.e., expressing coordinates as distances to the mean) and reducing (i.e., dividing by the standard deviation), a family of ellipsoids centered on the mean is studied, and the one closest (i.e., most often, the one minimizing variance) to the largest number of objects is deemed the best representation of the data. The axes of this best ellipsoid reflect the main tendency of the data. As we pointed out above, induction takes place by choosing the ‘relevant’ number of axes of the ellipsoid. DA also developed methods of Unsupervised Learning of classes while regrouping individuals nearest in the sense of a numerical distance.

2.8 Bayesian statistics The main effort of Bayesian statistics is relative to the development of deductive reasoning methods taking into account the conditional independence of discrete variables. From the point of view of induction, they developed two techniques.

The first one is a Supervised Learning method called "Naive Bayes" where all features conditionally depend on the class to be recognized, if this class is known. Learning, in this case, is reduced to taking into account the probabilities of the observed event occurrence, but it is one of the most efficient methods in precision, and it can deal with many features. Note however that the generated model is absolutely incomprehensible.

The second one is unsupervised. Two different things can be learned. For a given network structure, it is possible to learn the conditional probability tables from data. The comprehensibility is then entirely due to the network. It is also possible to learn the structures themselves. In this case, the automatic generation of large Bayesian networks (Heckerman et al., 1995) constitutes an essential progress in the domain of inductive reasoning. The criterion of optimization used is MDL, the principle of minimal description length. This approach induces comprehensible structures from examples. However, recent results (Bendou and Munteanu, 2003) devised experiments showing that a very small amount of noise, of the order of 1%, will change many structures of the network. They also proved the generality of their experimental results by using the properties of d–separation. The V-substructures only, expressing a conditional dependence of two nodes relative to a variation of knowledge about a third node or its descendants (in other words: two variables are the common ‘cause’ of third one), resist well to noise and might possibly be considered as explanatory by the domain expert. The other structures are not steady, and get settled to the only purpose of optimizing the network behavior in precision. This approach has also produced a classification method, AUTOCLASS, that builds classes using an exhaustive search of classes conditionally optimal relative to data, or at least in principle an exhaustive one. To our knowledge, this approach has not given any industrial software, even though an American company tried to sell it.

3. DM/AL differences from the point of view of epistemology

These differences are summarized below in table 1, showing that DM and AL, though both automate the generation of a model from data, differ otherwise in many epistemological choices.

Differences in the scientific approach

Classic data processing Automatic Learning(ML and Statistics)

DM

Simulates a deductive reasoning (= applies an

existing model)

Simulates an inductive reasoning (= invents a model)

Simulates an inductive reasoning ("even more

inductive") validation according to

precisionvalidation according to

precisionvalidation according to

utility and comprehensibility Results as universal as possible Results as universal as

possibleResults relative to particular

cases elegance = conciseness elegance = conciseness elegance = adequacy to the

user's model Position relative to Artificial Intelligence

Tends to reject AI Either tends to reject AI (Statistics) or claims

belonging to AI (ML)

Naturally integrates AI, DB, Stat., and MMI.

Table 1: Differences of epistemological nature among Computer Science, AL and DM.

3.1 Induction

Classic Computer Science applies existing models, and as we already pointed out, these models can be of a probabilistic nature. In the same way, methods of fuzzy inference propose a model, the fuzzy model, and study how this model can be applied to real data. Some approaches produce inductively fuzzy models, as fuzzy trees of decision (see http://www.lri.fr/~yk/ for a particularly simple presentation of fuzzy decision trees) or fuzzy rules. This addition of fuzziness makes the induction more complex (and requires fuzzy data) but does not modify its nature. In the same way, Rough Sets are a knowledge representation and propose a model. They are therefore by nature deductive. This does not prevent them fro introducing induction within their representation, but the induction methods then introduced are the same as those of the other inductive approaches. AL obviously works on the automatic generation of models, but while the majority of the systems stemming from AL perform Supervised Learning, the majority of systems stemming from DM perform Unsupervised Learning. This is why table 1, above, states that DM is “even more inductive” than AL.

3.2 Validation

In classical Computer Science, because of the weight of the deductive approach, a result is definitely validated after having been integrated into a formal model, so that it seems deducible from this model. Here, we will not deal with this final phase, but only with the initial phase during which the first experimental results are obtained. AL, as well as classical data processing, uses a criterion of precision to choose the most meaningful experimental results. In fact, symbolic Learning has sometimes also introduced criteria of comprehensibility. For example, the induction of decision trees software, C4.5, introduces a final procedure during which decision trees are transformed into rules, often more comprehensible than trees. Besides, the creation of short trees or short rules is preferred, even to the loss of a little amount of precision, in order to favor comprehensibility. The concern for comprehensibility therefore did not appear ex nihilo in DM. It is necessary to admit, however, that most research efforts, even in the field of symbolic ML, have been judged on criteria of precision rather than on comprehensibility.DM considers that precision is only one of the possible criteria, and substitutes the concept of utility. Utility is obviously not universal and therefore DM introduces at this point a definition of validation depending on each problem, what is both very new, and very interesting for each application. Some criteria of utility, as the patient’s pain in Medicine, depend closely on the application and are completely incompatible with precision. DM therefore does not hesitate to

introduce some social considerations in the criteria for algorithm validation, which is classically considered as a "scientific heresy." Comprehensibility is also a criterion of social nature, and what means more precisely : express the induced model in the language of the concerned field, while using this expert's concepts.In fact, DM supposes that a society of experts exists, and that it shares the same concepts and speaks the same language (which is quite sensible), and DM addresses explicitly to one of these societies, in each of its applications. Validation happens within this society of experts and not in an absolute sense.

3.3 Universality In fact, choosing utility, as opposed to precision, as a criterion of optimization, is already an example of choice of the particular versus the universal. AL is essentially about the general methods of induction and their properties. Inversely, DM is essentially about the application of induction methods to particular problems. For example, AL considers that data are not spoiled by unverifiable mistakes that prevent the induction to take place correctly, whereas DM considers that each data set requires a particular cleaning treatment. At the other end of the chain of knowledge acquisition, AL considers its work accomplished once knowledge that satisfies a given criterion is acquired. DM considers that this knowledge must be useful in the relevant specialty domain, and it must be validated by an improvement of the existing methods of this specialty domain. In addition, it is quite characteristic that the DM conferences, even the academic ones, systematically organize competitions among systems, and that domain experts are called to judge the excellence of the results obtained. In a similar way, Text Mining, methods are adapted to a particular corpus whereas the more classical Linguistics approach analyzes the general laws of the language. This difference might appear trivial but it is fundamental. It is quite obvious that it is impossible to rewrite all programs for each application. This is why DM develops tools allowing experts themselves to develop their application, for every particular case. This requirement forces conviviality in setting the program parameters, and that leads to methods adapting to different applications. Thus, by a kind of epistemological slight of the hand, DM, which is less interested in the general, builds systems that have more potential applications (and in a sense they are therefore more general) that AL.

3.4 Elegant conciseness

There have been many debates about the criterion called "Occam's razor", that prefer the simplest solution. It remains the rule for most approaches (in the DM community, see a discussion in Domingos, 1998). Of course, nothing really scientific justifies it, except the scientist's aesthetic pleasure when they use it. This systematic conciseness, when it results in a lack of clarity in the exposition of the model induced, opposes to the principle of comprehensibility of DM.

3.5 Relations with Artificial Intelligence It is relatively surprising to note that DM integrates perfectly, apparently without problem, AI with approaches that traditionally rejected AI. Of the two AL components , the symbolic component declared its belonging to AI, whereas Statistics, and even Pattern Recognition, put

distance between AI and them. It is possible that the academic quarrels pro or con AI do not really concern the industrial world, and that this integration of AI in DM is not a reason for industrial acceptance, but a consequence of an industrial concern.

4. An industry view of the differences DM/AL

4.1 The twelve tips for successful Data Mining, according to the Oracle Data Mining Suite

These tips can still be found on the web, in .pdf form, at: http://technet.oracle.com/products/datamining/listing.htm

We use these tips as interesting witnesses of what an industrial might ask from a DM method. We shall see that under their humorous formulation, very interesting truth are hidden.

4.1.1 - Mine significantly more data.

AL has a tendency to look deeply into small databases, whereas the DM concentrates its efforts on the very large ones. 4.1.2 - Create new variable to tease more information out of your data AL, and specially ML developed methods called “constructive induction” and “feature selection” (Liu and Motoda, 1998), that is to say, ways to create or eliminate features. However, this effort was essentially carried out on the justification of modifications done to the features, while DM is ready to be content with a posteriori justifications, observed by the improvement of the obtained model, rather than by carrying out transforms justified beforehand.

4.1.3 - Take a shallow dive into the data first

A superficial approach is never advisable in an academic context. However many crude mistakes are avoided by a superficial examination.

4.1.4 - Rapidly build many exploratory predictive models

AL tries to build the ‘best’ optimal explanatory model, whereas DM does not hesitate to produce several explanatory models. Even in the case of new techniques (actually born after DM started to exist) such as boosting and bagging, the main effort consists in devising a kind of voting procedure providing one best result, usually the most precise one. The DM approach would be to keep the different models generated and help the domain expert to choose among them, or to combine them in an optimal way.

4.1.5 - Cluster your customers first, and then build multiple targeted predictive models.

As we saw already, in AL, supervised approaches are distinctly dominant, while the unsupervised ones lead DM. We also saw that one of the goals of Unsupervised Learning is clustering,

http://technet.oracle.com/products/datamining/listing.htm

therefore a segmentation of records of the DB. Once this segmentation is done, methods of rule generation, for example, can be applied to each segment.This advice may appear innocent and somewhat superficial. Yet, it is very important. While applying pattern detection methods to the entire basis, general laws, valid for all individuals, are sought, and this often leads to detecting only trivial laws, valid for all the records. Inversely, a prior segmentation allows us to detect patterns valid on some sub populations. If these sub populations are meaningful, that is to say if the segmentation has a sense, then the laws thus found have a good chance of being interesting, either unknown or merely suspected by the expert.We see that this advice is an illustration of the difference about universality, commented above in 3.3.

4.1.6 - automated model building

This advises the use of induction, it does not make any difference between AL of DM. It nevertheless illustrates that automatic building of models, i.e. the automation of inductive reasoning, is not a fancy of academics but an industrial need.

4.1.7 - Demystify neural networks and clusters by reverse engineering them using C&RT models

Neural Networks techniques of classification, together with many other approaches, could be "demystified" since they are not the only ones to provide non comprehensible results. DM does not recommend the exclusive use techniques giving comprehensible results, and all techniques of data mining are acceptable. It is however DM-unacceptable to provide crude outputs, without interpreting them in a language comprehensible to the user. It follows that the concept of reverse engineering should become central in DM.

4.1.8 - Use predictive modeling to impute missing values

The missing value problem is obviously well-known in AL. The methods used in AL are of three types.The data are absent in a natural way (for example, the illnesses specific to one sex will be missing from records of the other sex). Then, the missing values are replaced by "non meaningful" and a specific non-meaningful-value treatment is introduced in the algorithm. This solution is definitely the best in this case. When the missing data are due to a lack of documentation, then two solutions are used. The first one consists in introducing a coefficient weakening the variable whose data are missing, as in C4.5, in order to decrease their contribution to the decision. The second one consists in completing by the mean of the observed values. The mean cab be taken over the whole set of examples, or over the examples of the same class.The DM approach to this second case follows from the fact that DM does not suppose that the learning takes place in one step. The domain specialist and the programmer work together to optimize the results. Models created during the previous iteration, or existing models known to the experts, are used to compute the missing values. It is necessary to note however that the case of large amounts of missing data is not dealt with. When, for example, more than 80% of the values of a variable are not documented, there is no really efficient method to deal with such shortcomings.

4.1.9 - Build multiple models and form a ‘panel of experts’ predictive modelsAL developed numerous approaches for simultaneously generating several models, in particular those including a vote of models. Eventually one of the models will win. The notion of cooperation between experts is never used. Although this has not been studied much, models of agents could play an important role in DM.

4.1.10 - Forget about traditional dated hygiene practices

I prefer not to comment this assertion.

4.1.11 - Enrich your data with external data

AL, in principle, takes data as it is given. It are not obtained by a process with which interaction is possible. DM supposes that observing that some necessary new data is possible. It can solve a problem or obtain a solution otherwise impossible to find.

4.1.12 - Feed the models a better ‘balanced fuel mixture’ of data

This advice is similar enough to the one before, except that the model obtained at the previous iteration is also used to search for data that is better adapted to a future induction.

4.2 What Data Mining techniques do you use regularly? When consulting Gregory Piatetsky-Shapiro's site, http://www.kdnuggets.com, it is noticeable that the tools really used in DM are not exactly those that had the greatest success among the AL community. In particular, categorization tools are used as much as the entire set of the statistical tools (not including regression and nearest neighbors). Moreover, when one considers categorization no longer as a tool, but as a problem, 22% declare a need for these tools.

Aug. 2001 Oct. 2002Clustering - 12% (if ‘type of analysis’, then 22%)

Neural Networks 13% 9%Decision Trees/Rules 19% 16%Logistic Regression 14% 9%

Statistics 17% 12%Bayesian nets 6% 3%Visualization 8% 6%

Nearest Neighbor - 5%Association Rules 7% 8%Hybrid methods 4% 3%

Text Mining 2% 4%Sequence Analysis - 3%Genetic Algorithms - 3%

Naive Bayes - 2%Web mining 5% 2%

Agents 1% -

http://www.kdnuggets.com/

Other 4% 3%Table 2. The DM tools in 2001 and 2002.

New methods have been adopted during the last year, such as sequence analysis, genetic algorithms, text and web mining, that constitutes 12% of new methods. A sudden appearance of 5% of such a nearly ancient method as Nearest Neighbors is all the more striking. In fact, it is extremely simple to implement, and its efficiency in precision has been noticed for years in the academic world. Nevertheless, no real clever changes can be made in its use, thus it is not interesting to academics. While taking into account these figures, a decrease of 17% for the techniques classed in 2001 is to be expected. The slight increase of association detection it is even more noticeable. This method of automatic detection of uncertain patterns in data certainly answers an industrial need. To the best of my knowledge, it was never studied by AL before DM started identifying its interest. Bayesian networks lose some points, but apparently only because of the difference now made between naive or not Bayesian. Similarly, the category "statistics" lose 5% which are probably the 5% of the Nearest Neighbors. Decision Trees decrease slightly, but not significantly.Finally, one can say that the 2002 losers are

- Logistical Regression, very much taught at University, and therefore probably over-valued by students gone to work in industry

- Neural Networks, probably because of the complete lack of understandability of their results, and their tendency to learn procedures that are not general enough.

- Support Vector Machines, not even cited by industry, in spite of their huge academic success. It will be interesting to check if this tendency is confirmed or not in the coming years.

5. Conclusion

The cause of the industrial acceptance of DM is easy to understand since the creators of this research topic took the problems of industry into account while AL researchers are centered on scientific issues. Even though they are certainly happy when they find an application, but they are not motivated by the application. As a testimony of the isolation of AL research from industrial applications, consider the thousands of academic AL papers that report the progress of a few tenths % in precision, improving an already known method, and applied to non grounded data.

An unexpected consequence of taking applications into account is that DM dares to attack problems known for being impossible to solve with certainty, that is to say, all unsupervised problems: categorization and segmentation, discovery of associations, temporal series and construction of a Bayesian network structure from data. Even in the supervised case, DM also deals with badly defined problems: large quantity of missing values, very noisy data, data with few examples (i. e., few records) and a large number of features (i.e., many fields). A striking example of this last problem, which currently attracts much attention from the DM community, is DNA chips. It is obvious that many models will fit this special kind of data. It is therefore hopeless to try to find the one true solution. The real goal is decreasing the failure rate in order to ease further the work of the human specialists.

Thus, DM is characterized by its audacity in challenging the problems as they are, not as they can be neatly solved.

References

Bendou M., Munteanu P. "Analyse de l'effet du bruit dans les algorithmes d'apprentissage des réseaux Bayésiens," Revue des sciences et technologies de l'information 17, (EGC-2003), pp. 411-422, 2003.

Benzecri, J. P. L'analyse des données, Dunod, Paris 1973.Breiman L., Friedman J., Olshen R., Stone C. : Classification and Regression Trees.

Wadsworth International Group, 1984.Brill E. "Some Advances in Transformation-Based Part of Speech Tagging," AAAI, 1:722-

727, 1994.Cornuéjols A., Miclet L., Apprentissage Artificiel, Eyrolles, Paris, 2002.Dietterich, G. T., Michalski, R. S. "Inductive Learning of Structural Descriptions:

Evaluation Criteria and Comparative Review of Selected Methods" Artificial Intelligence Journal 16, 1981, pp. 257-294.

Domingos P. "Occam's Two Razors: The Sharp and the Blunt," Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, New York, NY: AAAI Press, pp. 37-43, 1998.

Fisher D. “Knowledge acquisition via incremental conceptual clustering”, Machine Learning Journal 2, 139-172, 1987.

Heckerman D., Geiger D., Chickering D. "Learning Bayesian networks: The combination of knowledge and statistical data," Machine Learning Journal 20, 197-243, 1995.

Kohonen T. "The self-organizing map," Proc. IEEE 78, 1464-1480, 1990.Liu, H., Motoda, H., Feature Selection, Kluwer Academic Publishers, Norwell, MA,

1998.LeCun Y., Boser B., Denker J. S., Henderson D., Howard R. E., Hubbard W., Jackel L.

D., "Backpropagation Applied to Handwritten Zip Code Recognition," Neural Computation, vol. 1, no. 4, pp. 541-551, 1989.

Maron M. E. 1961. "Automatic indexing: An experimental inquiry," Journal of the Association for Computing Machinery, 8:404-417, 1961.

Michalski, R. S., Chilausky R. L. "Learning by being told and learning from examples: An experimental comparison of the two methods of knowledge acquisition in the context of developing an expert system for soybean disease diagnosis," International Journal of Policy Analysis and Information Systems 4:125-160, 1980.

Piatetsky-Shapiro G, Frawley W. J. , (Eds.), Knowledge Discovery in Data Bases, ALAI/MIT Press, Melo Park CA, 1991.

Popper, K. R. The logic de scientific discovery, Harper and Row, NY, 1959.Quinlan J. R. "Learning Efficient Classification Procedures and their Application to Chess

End Games," in Machine Learning: An Artificial Intelligence Approach, R. S. Michalski, J. G. Carbonell, T. M. Mitchell (Eds.), Morgan Kaufmann, Los Altos, pp. 463-482, 1983.

Quinlan R. S. C4.5: Programs for ML, Morgan-Kaufmann, San Mateo, 1993.Rosenblatt F. "The perceptron: a probabilistic model for information storage and

organization in the brain," Psychological Review 65:386-408 (1958).Vapnik V. The nature of statistical learning theory, Springer-Verlag, 1995.

on machine learning and data mining

Documents