choosing data-mining methods for multiple classification

26
Choosing Data-Mining Methods for Multiple Classification: Representational and Performance Measurement Implications for Decision Support WILLIAM E. SPANGLER, JERROLD H. MAY, AND LUIS G. VARGAS WILLIAM E. SPANGLER is an Assistant Professor in the College of Business and Economics at West Virginia University. After several years in private industry, he earned his Ph.D. in 1995 from the Katz Graduate School of Business, University of Pittsburgh, specializing in artificial intelligence. His current research interests focus on data mining and computational modeling for decision support. His work has been published in various journals, including Information and Management, Interfaces, Expert Systems with Applications, and IEEE Transactions on Knowledge and Data Engineering. JERROLD H. MAY is a Professor of Artificial Intelligence at the Katz Graduate School of Business, University of Pittsburgh, and is also the Director of the Artificial Intelligence in Management (AIM) Laboratory there. He has more than sixty refereed publications in a variety of outlets, ranging from management journals such as Operations Research and Information Systems Research to medical ones such as Anesthesiology and Journal of the American Medical Informatics Association. Pro- fessor May’s current work focuses on modeling, planning, and control problems, the solutions to which combine management science, statistical analysis, and artificial intelligence, particularly for operational tasks in health-related applications. LUIS G. VARGAS is a Professor of Decision Sciences and Artificial Intelligence at the Katz Graduate School of Business, University of Pittsburgh, and Co-Director of the AIM Laboratory. He has published over forty publications in refereed journals such as Management Science, Operations Research, Anesthesiology, and Journal of the American Medical Informatics Association, and three books on applications of the Analytic Hierarchy Process with Thomas L. Saaty. Professor Vargas’s current work focuses on the use of operations research and artificial intelligence methods in health care environments. ABSTRACT: Data-mining techniques are designed for classification problems in which each observation is a member of one and only one category. We formulate ten data representations that could be used to extend those methods to problems in which observationsmay be full members of multiple categories. We propose an audit matrix methodology for evaluating the performance of three popular data-mining tech- niques-linear discriminant analysis, neural networks, and decision tree i n d u c t i o v JournalofManagementInformation Systems I Summer 1999, Vol. 16, No. I, pp. 31-62. 0 1999 M.E. Sharpe, Inc. 0742-1222 I 1999 $9.50+ 0.00.

Upload: others

Post on 09-Feb-2022

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Choosing Data-Mining Methods for Multiple Classification

Choosing Data-Mining Methods for Multiple Classification: Representational and Performance Measurement Implications for Decision Support

WILLIAM E. SPANGLER, JERROLD H. MAY, AND LUIS G. VARGAS

WILLIAM E. SPANGLER is an Assistant Professor in the College of Business and Economics at West Virginia University. After several years in private industry, he earned his Ph.D. in 1995 from the Katz Graduate School of Business, University of Pittsburgh, specializing in artificial intelligence. His current research interests focus on data mining and computational modeling for decision support. His work has been published in various journals, including Information and Management, Interfaces, Expert Systems with Applications, and IEEE Transactions on Knowledge and Data Engineering.

JERROLD H. MAY is a Professor of Artificial Intelligence at the Katz Graduate School of Business, University of Pittsburgh, and is also the Director of the Artificial Intelligence in Management (AIM) Laboratory there. He has more than sixty refereed publications in a variety of outlets, ranging from management journals such as Operations Research and Information Systems Research to medical ones such as Anesthesiology and Journal of the American Medical Informatics Association. Pro- fessor May’s current work focuses on modeling, planning, and control problems, the solutions to which combine management science, statistical analysis, and artificial intelligence, particularly for operational tasks in health-related applications.

LUIS G. VARGAS is a Professor of Decision Sciences and Artificial Intelligence at the Katz Graduate School of Business, University of Pittsburgh, and Co-Director of the AIM Laboratory. He has published over forty publications in refereed journals such as Management Science, Operations Research, Anesthesiology, and Journal of the American Medical Informatics Association, and three books on applications of the Analytic Hierarchy Process with Thomas L. Saaty. Professor Vargas’s current work focuses on the use of operations research and artificial intelligence methods in health care environments.

ABSTRACT: Data-mining techniques are designed for classification problems in which each observation is a member of one and only one category. We formulate ten data representations that could be used to extend those methods to problems in which observations may be full members of multiple categories. We propose an audit matrix methodology for evaluating the performance of three popular data-mining tech- niques-linear discriminant analysis, neural networks, and decision tree induct iov

JournalofManagementInformation Systems I Summer 1999, Vol. 16, No. I , pp. 31-62. 0 1999 M.E. Sharpe, Inc.

0742-1222 I 1999 $9.50+ 0.00.

Page 2: Choosing Data-Mining Methods for Multiple Classification

38 SPANGLER, MAY, AND VARGAS

using the representations that each technique can accommodate. We then empirically test our approach on an actual surgical data set. Tree induction gives the lowest rate of false positive predictions, and a version of discriminant analysis yields the lowest rate of false negatives for multiple category problems, but neural networks give the best overall results for the largest multiple classification cases. There is substantial room for improvement in overall performance for all techniques.

KEY WORDS AND PHRASES: data mining, decision support systems, decision tree induction, neural networks, statistical classification.

DATA MINING IS THE SEARCH THROUGH REAL-WORLD DATA for general patterns that are useful in classifying individual observations and in making reasoned predictions about outcomes [ 1 I]. That generally entails the use of statistical methods to link a series of independent variables that collectively describe a case or observation to the dependent variable(s) that classifies the case. The set of classification problems typically includes patterns containing a single, well-defined dependent variable or category; that is, an observation is assigned to one and only one category. This research explores the less-tractable problems of multiple classification in data mining, wherein a single observation may be classified into more than one category. Because multiple classification is a significant aspect of numerous managerial tasks-including, among others, diagnosis, auditing, and schedulineunderstanding the effectiveness of data- mining methods in multiple classification situations has important implications for the use of information systems for knowledge-based decision support.

By multipleclassijication, we mean that the categories are well defined and mutually exclusive, but the observations themselves transcend categorical boundaries. This contrasts withfuzzy clustering, in which the categories themselves are not necessarily either well defmed or mutually exclusive. Consider, for example, the universe ofcategories that includes men, researchers, andjazz singers. In this situation, a single person can belong to any combination of these categories simultaneously, in effect having multiple member- ship across categories. ClassifLing someone in such a context requires recognizing the potential for multiple membership, while also identifjing the correct categories them- selves. Figures 1 and 2 show two alternative ways of pictorially representing multiple classification problems. In figure 1, the categories are shown as distinct and mutually exclusive, with individual cases or observations transcending categories. Figure 2 shows the categories in a type of Venn diagram, with multiply classified cases appearing in the intersections between categories. In contrast to the situation we consider, when the categories themselves are poorly defined or understood, classification of observations within those categories may be correspondingly uncertain. Fuzzy clustering assigns observations with likelihoods to multiple categories, where the likelihoods sum to one.

Multiple Classification Problems-An Example

DATA MINING IN A DIAGNOSTIC SETTING IS THE SEARCH for patterns linking state descriptions with associated outcomes, with the objective of predicting an outcome

Page 3: Choosing Data-Mining Methods for Multiple Classification

CHOOSING DATA-MINING METHODS 39

Figure 1. Observations Classified into Multiple Categories: Observation 0 1 is in Category C1, 0 2 is in CI and C2, while 0 3 is in CI, C2, and C3.

Figure 2. A Multiple Classification Domain Pictured as a Venn Diagram, with Multiply Classified Observations Appearing in the Intersections

given new data showing a similar pattern. It can be characterized as an attempt to extract knowledge from data. In the medical domain, numerical codes are used to describe both what is wrong with a patient and what was done to treat the patient. The dominant patient state taxonomy is the International Classification of Diseases (IC- 9) coding system, which indicates a patient’s disease or condition within a hierarchical, numerical classification scheme. The corresponding procedural taxonomy is the Common Procedural Terminology (CPT) system, which indicates the procedure(s) performed on a patient, partly in response to the ICD-9 code@) assigned to the patient. Our empirical results are derived from 59,864 patient records, each of which includes the patient’s diagnoses (ICD-9s), the procedures performed on the patient (CPTs), and patient demographic and case-specific information.

Page 4: Choosing Data-Mining Methods for Multiple Classification

40 SPANGLER, MAY, AND VARGAS

The data-mining task is to find patterns linking patient ICD-9 and demographic information to CPT outcomes. The task is important for surgical scheduling because patient diagnoses and demographics are known before the patient enters the operating room, but the procedures that will be performed there often are not known with certainty. The surgery performed is a function of the information discovered by the physicians during the course of the operation. Data mining in this situation is a multiple-classification task because several procedures may be performed on a patient during a single operation. The patient records to which we had access contain as many as three CPTs each. The ICD-9(s) and patient demographics are the independent variables in the analysis. The surgical procedures performed on the patient (CPTs) are the dependent variables. Each set of independent variables is linked to one, two, or three CPTs.

The identification of the procedures to be performed on a patient is important for a number of managerial tasks in this domain, including medical auditing, operating room scheduling, and materials management. If a surgical scheduler knew in advance the most likely sets of procedures and had models for estimating times for sets of procedures, the scheduler could more optimally plan the operating room schedule. The problem for a data-mining method in this multiple classification domain is twofold: First, the method must be able to identify the proper number of CPTs associated with a specific pattern. That is, if a set of patient factors is normally associated with two CPTs, the method should construct a pattern linking the factors to two and only two CPTs. Second, the method should identify the specific CPTs. Scoring the performance of a data-mining tool requires consideration of both of these aspects of the problem.

Comparison of Data-Mining Methods

THE EXAMPLE ABOVE HINTS AT THE CHARACTERISTICS of the multiple classification ~~~~

problem that make it interesting as well as dificult. Our goal is to find and to propose solutions to the following associated problems of multiple classification:

Problem representation: How should a decision maker structure a model both to recognize and to identify multiple classes? Should the dependent variables be treated individually, as a series of yeslno questions related to the presence or absence of each variable, even when they occur as a group in a single record? Alternatively, should a group of dependent variables be treated as a separate entity, distinct from the individual variables that comprise the group? Performance measurement: In multiple classification, there is potential for both false positives (i.e., assigning an observation to an incorrect class) and false negatives (i.e., not assigning an observation to a correct class). A decision maker requires a strategy for scoring the performance of various data-mining methods, based on the relative number of both types of errors and their associated costs.

Those research issues could be investigated using mathematical arguments or numerically. While a mathematical comparison would be the most definitive one, we are not aware of a methodology that would permit it to be done. Numerical research

Page 5: Choosing Data-Mining Methods for Multiple Classification

CHOOSING DATA-MINING METHODS 41

Table 1 Comparison of Data-Mining Methods

Decision tree Neural induction networks

Type of method Logic-based Math-based Learning approach Supervised Supervised Linearity Linear Nonlinear Representational Set of decision Functional scheme nodes and branches: relationship

and classes production system between attributes

Discriminant analysis

Math-based Supervised Linear Functional relationship between attributes and classes

can be done with artificially generated data or with real data. Conclusions based on empirical research are useful if the characteristics of the samples on which they are based are sufficiently similar to those of problem instances others are likely to encounter in practice. We preferred using a real data set to the use of generated data because it is representative of an important class of managerial problems, and because the noise in the data set provides us with information on the sensitivity of the approaches to “dirt” in a data set. Artificial data would have allowed us to carehlly control the population parameters from which the data are drawn, but would have required that we first define and estimate values for all such critical population parameters.

Using our real data set, we empirically compare the performance of tree and rule induction (TRI), artificial neural networks (ANN), and linear discriminant analysis (LDA) in modeling the multiple classification patterns in our data set. We chose the three methods because each is a popular data-mining method sharing a number of common characteristics while also exhibiting some notable differences (see Table 1). Weiss and Indurkhya divide data-mining algorithms into three groups: math-based methods, distance-based methods, and logic-based methods [29]. LDA is the most common math-based method, as well as the most common classification technique in general, while TRI is the most common logic-based method. Neural networks are an increasingly popular nonlinear math-based method. Tools employing these methods are commonly available in commercial computer-based applications.

All three methods are supervised learning techniques. That is, they induce rules for assigning observations to predefined classes from a set of examples, as opposed to unsupervised techniques, which both define classes and determine classification rules [20, 251. Supervised learning techniques are appropriate for our decision problems because the classes (CPT codes) are defined exogenously and cannot be modified by the decision maker. Each of the methods we compare engages in discrete classification through a process of selection and combination of case attributes, and each employs similar validation techniques, described below. Cluster analysis and knowledge dis- covery methods are examples of unsupervised learning algorithms.

The methods also differ, particularly in the way they model the relationships among

Page 6: Choosing Data-Mining Methods for Multiple Classification

42 SPANGLER, MAY, AND VARGAS

attributes and classes. The classification structures of LDA and ANN are expressed mathematically as a functional relationship between weighted attributes and resulting classes. TRI represents relationships as a set of decision nodes and branches, which, in turn, can be represented as a production system, or set of rules. LDA and TRI are linear approaches; NN is nonlinear. The effectiveness of the representation generally depends on the orientation and mathematical sophistication of the user.

Liang argues that, because the choice of a learning method for an application is an important problem, research into the comparison of alternative (or perhaps even complementary) methods is likewise important [ 191. That is especially true for data mining, where the costs and potential benefits involved strongly motivate the proper choice of tool and method as well as the proper analysis of the results.

Tree and Rule Induction

TRI is attractive because its explicit representation of classification as a series of binary splits makes the induced knowledge structure easy to understand and validate. TRI constructs a tree, but the tree can be translated into an equivalent set of rules. We used Quinlan’s See5 package, the most recent version of his ID3 algorithm [22]. ID3 induces a decision tree from a table of individual cases, each of which describes identified attributes as well as the class to which the case belongs. At each node, the algorithm builds the tree by assessing the conditional probabilities linking attributes and outcomes, and divides the subset of cases under consideration into two further subsets so as to minimize entropy, a measure of the information content of the data. The user specifies parameters that control the stopping behavior of the method.

If the training set contains no contradictory cases tha t is, cases with identical attributes that are members of different c l a s sesa fully grown tree will produce an error rate of zero on the training set. Weiss and Kulikowski [31] show that as a tree becomes more complex, measured by the number of decision nodes it contains, the danger of overfitting the data increases, and the predictive power of the tree declines commensurately. That is, the true predictive error rate, measured by the performance of the tree on test cases, becomes much higher than the apparent error rate reflected in the performance of the tree against the training cases alone. To minimize the true error rate, See5 first grows the tree completely, and then prunes it based on a prespecified certainty factor at each node.

Performance evaluation ofclassification methodologies is discussed in the statistical literature (for example, see [15], ch. 11). The two most common approaches are dividing the data set into training and holdout subsets before estimating the model and jackknifing. The former avoids the bias of using the same information for both creating and judging the model. However, it requires large data sets, and there is no simple way to provide a definitive rule for determining either the size or composition of the two subsets. Worse, separating a holdout sample results in the creation of a model that is not the desired one, because the removal ofthe holdout sample reduces the information content of the training set and may exclude cases that are critical to the estimation of an accurate or robust model. The alternative to separating the data into two groups is

Page 7: Choosing Data-Mining Methods for Multiple Classification

CHOOSING DATA-MINING METHODS 43

jackknifing, a one-at-a-time holdout procedure due to Lachenbruch [ 181. Jackknifing temporarily ignores the first observation, estimates the model using observations two through n, classifies the first, held-out observation, and notes whether it was correctly or incorrectly classified. It then puts the first observation back into the data set, ignores the second, and repeats the process. Repeating the procedure n times, jackknifing creates n different models and tallies up overall modeling performance based on the behavior of each of those models on a single omitted observation. Omitting only a single observation minimizes the loss of information to which the modeling process is exposed, but it can require a lot of computer time and does not produce a single model as its result. How do you combine n potentially very different models, and how do you interpret its evaluation when each holdout sample was tested on a different model? The TRI software package See5 includes a k-fold cross-validation, an ap- proach less extreme than either a fixed holdout sample or jackknifing. K-fold cross- validation divides the data set into k equal-sized partitions, ignores them one at a time, estimates a model, and computes its error rate on the ignored partition.

Neural Networks

Artificial neural networks simulate human cognition by modeling the inherent paral- lelism of neural circuits found in the brain using mathematical models of how the circuits function. The models typically are composed of a layer of input nodes (independent variables), one or more layers of intermediate (or hidden) nodes, and a layer of output nodes (dependent variables). Nodes in a layer are each connected by one-way arcs to nodes in subsequent layers, and signals are sent over those arcs. “Behavior” propagates from values set in the input nodes, sent over arcs through the hidden layer(s), and results in the establishment of values in the output layer. The value of a node is a nonlinear, usually logistic, function of the weighted sum of the values sent to it by nodes that are connected to it. A node forwards a signal to a subsequent node only if it exceeds a threshold value. An ANN model is specified by defining the number of layers it has, the number of nodes in each layer, the way in which the nodes are connected, and the nonlinear function used to compute node values. Estimation of the specified model involves determining the best set of weights for the arcs and threshold values for the nodes.

An ANN is trainebthat is, its parameters are es t imatehs ing nonlinear optimi- zation. In the backpropugation algorithm, the first-order gradient descent method used in the Brainmaker software we used, the network propagates inputs through the network, derives a set of output values, compares the computed output to the provided (corresponding) output, and calculates the difference between the two numbers (Le., the error). If a difference exists, the algorithm proceeds backward through the hidden layer(s) to the input layer, adjusting the weights between connections based on their gradients to reduce the sum of squared output errors. The algorithm stops when the total error is acceptably small.

Neural networks are frequently used in data mining because, in adjusting the number of layers, nodes, and connections, the user can make an ANN model almost any smooth

Page 8: Choosing Data-Mining Methods for Multiple Classification

44 SPANGLER, MAY, AND VARGAS

mathematical function. While inputs to an ANN might be integer or discrete, the weighted nonlinear transformations of the inputs as part of their being fed forward through the network result in continuous output level values. Continuous output levels result in a more tractable error measure for the backpropagation algorithm to optimize and also permit the interpretation of outputs as partial group membership. Partial group membership means an ANN is capable of representing inexact matching, if that is the way to find a best fit for some set of input data. It also can model classification tasks that are inherently “fuzzy”--that is, tasks that generally are simple for humans but traditionally difficult for computers.

Because of their flexibility, ANNs may be difficult to specify. Adding too much structure to an ANN makes it prone to overfitting, but too little structure may prevent it from capturing the patterns in the data set. Those patterns are represented in the arc (connection) weights and the node thresholds, a form that is not transparent to humans. Computationally, if the training set is large, backpropagation and related algorithms may require a lot of time. ANNs may be good classifiers and predictors as compared with linear methods, but the mathematical representations of the various nodes, and the relative importance of the independent variables, tend to be somewhat less accessible to the end user than induced decision trees, rules, and even classification functions. Neural networks often are treated as a black box, with only the inputs and outputs visible to the decision maker. The classification chosen by the ANN is easily visible to the user, but the decision process that led to that classification is not.

Linear Discriminant Analysis

Linear discriminant analysis (LDA) is the most common classification method in use, and also one of the oldest [ 131, having been developed by Fisher in the 1930s. Because of its popularity and long history, we provide only a brief overview of the method here.

Like TRI and ANN, LDA partitions a data set into two or more classes, within which new observations or cases can then be assigned. Because it uses linear functions of the independent variables to define those partitions, LDA is similar to multiple regression. The primary distinction between LDA and multiple regression lies in the form of the dependent variable. Multiple regression uses a continuous dependent variable. The dependent variable in LDA is ordinal or nominal.

For a data set of cases, each with m attributes and n categories, LDA constructs classification hnctions of the form,

c1 a, + c p 2 . . . + c, a, + c,

where ci is the coefficient for the case attribute ai and c,, is a constant, for each of the n categories. An observation is assigned to the class for which it has the highest classification function value.

Page 9: Choosing Data-Mining Methods for Multiple Classification

CHOOSING DATA-MINING METHODS 45

Table 2 Comparative Data-Mining Method Studies Across Domains

Tree/rule Neural networks induction Remession

Asset writedowns Bankruptcy

Bank failure Inventory accounting Lendingkredit risk Corporate acquisitions Corporate earnings Management fraud Mortgage choice

~ 3 1 [2, 6,8, 14, 191 [S, 19,21, 261

Studies of Supervised Learning Approaches

PREVIOUS RESEARCH HAS INVESTIGATED AND COMPARED SUPERVISED, inductive learning techniques in a number of domains (see Table 2), with mixed results. Some comparative studies suggest the superiority of neural networks over other techniques. For example, in bankruptcy prediction, Tam and Kiang [27] found that neural networks performed better than discriminant analysis, logit analysis, k-nearest neigh- bor, and tree induction (ID3). Fanning and Cogger [8] were somewhat more tentative in comparing neural network models with both logistic regression and existing bankruptcy models. Although they found no particular technique to be superior across all comparisons, they argued that neural nets were “competitive with, and often superior to” the logit and bankruptcy model results. In a subsequent study of fraudulent financial statements, Fanning and Cogger [9] reported that aneural network was better able than the traditional statistical methods to identify management fraud.

By contrast, previous comparative studies had shown decision t reehle induction to be superior to other methods. Messier and Hansen [21], for example, compared decision treeh.de induction with discriminant analysis, as well as with individual and group judgments. On the basis of the attributes selected and the percentage of correct predictions, they concluded that the induction technique outperformed the other approaches in the prediction of bankruptcies. Weiss and Kapouleas [30] compared statistical pattern recognition (linear and quadratic discriminant analysis, nearest neighbor, and Bayesian classification), neural networks, and machine learning meth- ods (rule/decision tree induction methods: ID3K4.5 and Predictive Value Maximiza- tion). They concluded that the rule induction methods were superior to the other methods with respect to accuracy of classification, training time, and compatibility with human reasoning.

Other multiple method studies have been less conclusive and suggest that perfor- mance is dependent on other factors such as the type of task and the nature of the data

Page 10: Choosing Data-Mining Methods for Multiple Classification

46 SPANGLER, MAY, AND VARGAS

set. Chung and Tam [6], for example, compared three inductive-learning models across five managerial tasks (in construction project assessment and bankruptcy prediction). They concluded that model performance generally was task-dependent, although neural networks tended to produce relatively consistent predictions across task domains. In assessing LIFO/FIFO classification methods, Liang et al. [20] reported that neural networks tended to perform best overall in holdout tests, and when the data contained dominant nominal variables. However, when nominal variables were not dominant, probit provided better performance. Sen and Gibbs [24] studied corporate takeover models, comparing six neural network models and logistic regres- sion. They found little difference in predictive performance among them, indicating that they all performed poorly. Boritz et al. [2] tested the performance of neural networks with several regression techniques, as well as with well-known bankruptcy models. No approach was clearly superior, and the ability of an induced model to distinguish between bankrupt and nonbankrupt firms was dependent on the number of bankrupt firms in the training set.

Bases for Judging Performance

JUDGING THE PERFORMANCE OF ONE DATA-MINING METHOD over another requires consideration of several modeling objectives.

Predictive Accuracy

Most of the comparative studies we cited above measured the predictive accuracy and error rate of each method. Messier and Hansen [21], for example, compared the percentage of correct classifications produced by their induced rule system to the percentage drawn from discriminant analysis, as well as individual and group judg- ments. As suggested by the review above, it is difficult to make general claims about the relative predictive accuracy of the various methods. Performance is highly depen- dent on the domain and setting, the size and nature of the data set, the presence of noise and outliers in the data, and the validation technique(s) used. Predictive accuracy tends to be an important and prevalent indication of a method’s performance, but others also are important.

Comprehensibility

Henery [ 131 uses this term to indicate the need for a classification method to provide clearly understood and justifiable decision support to a human manager or operator. TRI systems, because they explicitly structure the reasoning underlying the classifi- cation process, tend to have an inherent advantage over both traditional statistical classification models and ANN. Tessmer et al. [28] argue that, while the traditional statistical methods provide effkient predictive accuracy, “they do not provide an explicit description ofthe classification process.” Weiss and Kulikowski [3 1 1 suggest that any explanation resident in mathematical inferencing techniques is buried in

Page 11: Choosing Data-Mining Methods for Multiple Classification

CHOOSING DATA-MINING METHODS 47

computations that are inaccessible to the “mathematically uninclined.” The results of such techniques might be misunderstood and misused. Rules and decision trees, on the other hand, are more compatible with human reasoning and explanations.

Speed of Training and Classification

Speed can be an important consideration in some situations [3 I]. Henery [ 131 suggests that a number of real-time applications, for example, must sacrifice some accuracy in order to classify and process items in a timely fashion. Again, because of situational dependencies, it is difficult to make generalizations about the computational expense of each method. ANNs estimated using backpropagation may require an unacceptably large amount of time [3 11.

Modeling and Simulation of Human Decision Behavior

Using case descriptions and human judgments as input, data-mining methods also can be used for the automated modeling and acquisition of expert knowledge. Kim et al. [16] determined that the performance of a particular method in constructing an inductive model of a human decision strategy is dependent, in part, on conformance ofthe model with the strategy. Linear models tend to simulate linear (orcompensatory) decision strategies more accurately, while nonlinear models are more appropriate for nonlinear (or noncompensatory) strategies. Kim et al. found ANN to be superior to decision tree induction (ID3), logistic regression, and discriminant analysis, even in simulations of linear decision processes. They note that the flexibility of neural networks in forming both linear and nonlinear decision models contributes to their superior performance relative to the other methods.

Selection of Attributes

The attributes selected for consideration and their relative influence on the outcome are an indication of the performance of a method. The concept of diagnostic validity of induction methods was proposed by Cunim et al. [7], and was used by Messier and Hansen [21] to compare the attributes selected by each of their induction methods.

Data Set and Tools

OUR DATA SET IS A CENSUS OF 59,864 CASES OF SURGERY PERFORMED between 1989 and 1995 at a large university teaching hospital (see [ l ] for a description of the computerized collection system). Each case is represented as a single record contain- ing twenty-three attributes describing patient demographic and case-specific informa- tion. Of the twenty-three factors available, the following were chosen for analysis: diagnoses (one, two, or three ICD-9 codes), procedures (one, two, or three CPT codes), type of anesthesia (general, monitor, regional), patient demographic information (age, sex), the patient’s overall condition (based on the six-value ASA ordinal coding

Page 12: Choosing Data-Mining Methods for Multiple Classification

48 SPANGLER, MAY, AND VARGAS

scheme), emergencyhonemergency status, in-patiendoutpatient status, and the sur- geon who performed the procedure (identified by number). The remaining fields are the time durations of surgical events.

We chose 8 19 records dealing with ICD-9 code 180.9, “malignant neoplasm of the cervix, unspecified,” because there is a fairly large fanout from it to the associated CPTs across the records, presenting a challenge to any classification method. ICD-9 180.9 is associated with 139 different CPTs, although 107 of the CPTs appear in four or fewer records. Because the presence of outliers impedes the detection of general patterns, we followed the standard data-mining approach of removing them. Of the 8 19 records containing ICD-9 180.9, 160 records contained one of the 107 CPTs. Those records were judged to be outliers and were removed, leaving 659 records linked to a total of 32 CPTs remaining in the data set. Table 3 provides a detailed description of each of the 32 remaining CPTs.

We used commercial software instead of programming the methods ourselves, to eliminate possible bias caused by our own computer skills. We used Statgraphics version 3.1 for LDA, BrainMaker version 3.1 for ANN, and See5 version 1.05 for TRI.

Methodology

W E IDENTIFIED TEN DISTINCT WAYS OF REPRESENTING THE MULTIPLE CLASSIFICATION

problem. As shown in Table 4, not all methods are capable of estimating parameters for each of the representations. Our strategy was to evaluate each method from a decision support perspective. That is, how does a method fundamentally constrain the types of representation that can (and should) be employed by a person using the method?

Discriminant Analysis

Six LDA models were constructed, three basic representations with two variations on each. The three basic representations-multiple, replicated, and binary4iffer in their treatment of the dependent variables; recall that each case can be a member of one, two, or three classes. For each basic representation, two variations on the treatment of prior probabilities were included: (1) prior probabilities for each group are assumed to be equal, and (2) prior probabilities for each group are assumed to be proportional to the number observations in each group. The basic representations and variations are described below.

Dependent Variables Represented as Multiple Values (LDAMult)

The dependent variable is a string with all CPTs, space delimited. For example, the dependent variable in a record containing only CPT 58210 is represented as “582 10.” A record containing CPTs 58210,77760, and 77770 has “58210 77760 77770” as its dependent variable. Because one dependent variable is used for all CPT codes present, a single linear discriminant analysis could be performed for each of the two variations:

Page 13: Choosing Data-Mining Methods for Multiple Classification

CHOOSING DATA-MINING METHODS 49

Table 3 Top 32 CPTs and Their Description

CPT Frequency Description

36489

38500 38562

38564

38780

441 20

45300

47600 49000

4901 0

52000 52204 52332

5631 1

571 00 5741 0 57500

5751 3 57520

581 50

58200

5

8 17

8

8

6

16

7 34

7

34 9 8

5

13 27 34

5 32

41

47

‘Placement of central venous catheter subclavian, jugular, or other *vein, e.g., for central venous pressure, hyperalimentation, hemo- dialysis, *or chemotherapy, *percutaneous, over age 2 Biopsy or excision of lymph nodes, superficial separate procedure Limited lymphadenectomy for staging separate procedure, pelvic and para-aortic Limited lymphadenectomy for staging separate procedure, retro- peritoneal aortic and/or splenic Retroperitoneal transabdominal lymphadenectomy, extensive, including pelvic, aortic, and renal nodes separate procedure Enterectomy, resection of small intestine, single resection and anastomosis Proctosigmoidoscopy, rigid, diagnostic, with or without collection of specimens by brushing or washing separate procedure Cholecystectomy Exploratory laparotomy, exploratory celiotomy with or without biopsys separate procedure Exploration, retroperitoneal area with or without biopsys separate procedure Cystourethroscopy separate procedure Cystourethroscopy, with biopsy Cystourethroscopy, with insertion of indwelling ureteral stent, e.g., Gibbons or double4 type Laparoscopy, surgical, with retroperitoneal lymph node sampling biopsy, single or multiple ‘Biopsy of vaginal mucosa, ‘simple separate procedure *Pelvic examination under anesthesia *Biopsy, single or multiple, or local excision of lesion, with or without *fulguration separate procedure Cauterization of cervix, laser ablation Conization of cervix, with or without fulguration, with or without dilation and curettage, with or without repair, cold knife or laser Total abdominal hysterectomy corpus and cervix, with or without removal of tubes, with or without removal of ovaries Total abdominal hysterectomy, including partial vaginectomy, with para-aortic and pelvic lymph node sampling, with or without removal of tubes. with or without removal of ovaries

equal prior probabilities (LDAMuZtE) and prior probabilities proportional to the sample (LDAMuZtP). The advantage of LDAMult is that each observation is repre- sented exactly once. The disadvantage is its inability to represent class intersections. An observation that is a member of both categories a and b (i.e., dependent variable = “a b”) is considered to be completely separate from observations that are members of only either a or b.

Page 14: Choosing Data-Mining Methods for Multiple Classification

50 SPANGLER, MAY, AND VARGAS

Table 3 Continued

CPT Frequency Description 5821 0

58240

58260 58720

58960

58999 77761 77762 77763 77777 77778

168

10

6 6

6

12 46

106 12 6

11

Radical abdominal hysterectomy, with bilateral total pelvic lymphadenectomy and para-aortic lymph node sampling biopsy, with or without removal of tubes, with or without removal of ovaries Pelvic exenteration for gynecologic malignancy, with total abdominal hysterectomy or cervicectomy, with or without removal of tubes, with or without removal of ovaries, with removal of bladder and ureteral transplantations, and/or abdominoperineal resection of rectum and colon and colostomy, or any combination thereof Vaginal hysterectomy Salpingo-oophorectomy, complete or partial, unilateral or bilateral separate procedure Laparotomy, for staging or restaging of ovarian malignancy second look, with or without omentectomy, peritoneal washing, biopsy of abdominal and pelvic peritoneum, diaphragmatic assessment with pelvic and limited para-aortic lymphadenectomy Unlisted procedure, female genital system nonobstetrical Intracavitary radioelement application, simple Intracavitary radioelement application, intermediate Intracavitary radioelement application, complex Interstitial radioelement application, intermediate Interstitial radioelement application, complex

Dependent Variables Represented as Single Values, Multiple Values Replicated (LDARep)

A single record containing multiple values for the dependent variable is decomposed into multiple records, one for each value of the dependent variable. A record contain- ing CPTs 582 10,77760, and 77770 is represented three times in the data set, once with “582 10” as the dependent variable, once with “77760,” and the third with “77770”; all three records have the same independent variable values (see figure 3).

In this representation, because only the relative sizes of the classification fknction values are meaningful, a single-step estimation process provides a rank order for membership in each of the categories but does not provide any insight regarding the number of categories into which the observation is to be classified (unlike neural nets or logistic regression, for which 0.5 is a commonly accepted threshold). Therefore, a two-step process is required: First, use an LDA model to estimate the number of categories to which the observation belongs, and then use a separate LDA to determine what those categories are. LDARepE is that two-stage process with equal prior probabilities for both parts of the process, and LDARepP is the corresponding technique using proportional probabilities.

The advantage of LDARep is that it recognizes an observation that is a member of

Page 15: Choosing Data-Mining Methods for Multiple Classification

CHOOSING DATA-MINING METHODS 5 I

Table 4 Model Representations Across Methods

Tree/rule Neural networks induction Discriminant analvsis

Multiple Equal

Replicated Equal Proportional

Proportional

One hidden layer, 57 nodes Proportional One hidden layer, 114 nodes

Binary No hidden layer Equal

, I , Y I " I -

"A ,, vu" "C" "4"

Figure 3. Dependent Variables Represented as Single Values, Multiple Values Replicated (LDARep)

Figure 4. Dependent Variables Represented as Single Binary Values (LDABin)

more than one class as being in each of those classes individually. The disadvantage is that the representation does not differentiate between a single observation that is simultaneously in multiple classes, in which the replication of its independent variable values is a representational necessity, from multiple observations with identical independent variables that are, however, members of different classes. That is, a set intersection and a contradiction have the same representation.

Dependent Variables Represented as Single Binary Values (LDABin)

The dependent variable is represented as a series of binary values, one for each possible value of the dependent variable (see figure 4). An observation is considered a member of a category if its classification function value for assigning membership is larger than its classification hc t ion value for not assigning membership. This representation requires a separate LDA model for each class, thirty-two in the case of our data set. LDABinE is this approach with equal probabilities, and LDABinP has proportional probabilities.

Page 16: Choosing Data-Mining Methods for Multiple Classification

52 SPANGLER, MAY, AND VARGAS

The advantages of this approach are that subset relationships are preserved and that each observation occurs only once in the data set, so that intersections are represented differently from contradictions. Its disadvantage is that an individual observation might be a member of no classes or too many classes (for our data, more than three).

Neural Networks

The ANN representation of the dependent variable is the same as in LDABin, because of the ways values are propagated through the network to the output nodes. The variations in the model are functions ofthe structure ofthe hidden layer(s). One hidden layer is all that needs to be considered, if structure between the input and output layers is desired, but the number of nodes in it is a matter of choice [3 11. With a hidden layer and a sigmoid function for combining activations, the ANN performs logistic regres- sion. Deferring to commercial software, we allowed BrainMaker to suggest the size of a hidden layer. With 25 input nodes and 32 output nodes, it recommended 57, their sum. We also considered a network with twice that many nodes in the hidden layer, a structure that might tend to overfit the data. The modeling alternatives are: ( I ) a neural network with no hidden layer (NNO), (2) a network with 57 nodes in the hidden layer (NN57), and (3) a network with 114 nodes in the hidden layer (NNZ14). Activation at an output node, interpreted as degree of membership, ranges from zero to one. An observation is considered a member of any group for which it generates an activation value above 0.5, and is considered not a member of any group for which it generates an activation level below 0.

All three ANN models have the same advantages and disadvantages as LDABin. The ANN model representation is preferable to that of LDABin because LDABin requires, in our case, thirty-two separate binary models, and ANN simultaneously models all thirty-two binary alternatives in a single model.

Decision TreeRule Induction

We used a single TRI representation, analogous to LDAMult. Multiple dependent variables in each record were represented collectively within a single string (“a b c”). A representation similar to that of LDARep is not possible because See5 trees do not rank-order the classification alternatives. See5 does allow for differential misclassifi- cation costs, but they are not capable of representing equal and proportional prior probabilities in a way equivalent to that in LDA. The advantages and disadvantages of LDAMult apply to our See5 representation. The model has the advantage of constraining possible categories to between 1 and 3, inclusive, and the disadvantage of not recognizing the intersections of classes.

Results and Analysis

COMPARISON OF THE METHODS REQUIRES THE CONSIDERATION Of a number of issues related to the measurement of performance in multiple classification problems. For

Page 17: Choosing Data-Mining Methods for Multiple Classification

CHOOSNG DATA-MINIh’G METHODS 53

example, consider a single test case having two values for the dependent variable (“1 2”). If the method predicts the value “1 2,” there is little doubt that the method has performed without error. If the method predicts the value “3,” it is incorrect for three reasons. First, it failed to recognize the case as having multiple classes. Second, it failed to include either “1” or “2” in its predicted value for the dependent variable. Three, it included an incorrect value (i.e., “3”) in its prediction. If the method predicts “1 3,” it has identified the correct number of cases (2), while also correctly identifying one of the values but incorrectly identifying the other. If the method predicts “1 2 3,” it has identified both of the correct classes, but it also has predicted the wrong number of classes, and in doing so has included a class that is incorrect. If it predicts “1,” it has predicted the wrong number of classes. However, the class it has predicted is correct, and it refrained from predicting any incorrect classes.

The above list of error alternatives argues for what we call an audit matrix within which multiple classification results can be judged. We use the term “audit” to denote a situation assessment, performed by an auditor or decision maker, which attempts to reconcile the observed characteristics of a situation with a priori expectations of that situation (i.e., “actual” versus “predicted” characteristics-see figure 5). That is, an auditor, when initially encountering a situation, will expect to observe certain character- istics while also expecting not to observe others. Subsequently, if expected characteristics are observed and unexpected characteristics are not observed, the situation matches expectations and the classification is judged correct. However, predictions can vary from the observed situation in two important ways: (1) the auditor might e x p e c e o r p r e d i c h characteristic that is not present (i.e., a false positive), or (2) the auditor might observe a characteristic that had not been predicted (i.e., a false negative).

An audit matrix has four cells, two of which indicate correct behavior of a method and two of which indicate incorrect behavior. The two correct cells are the number of classes predicted to be present and actually observed and the number predicted to be absent and actually absent. The incorrect cells are the number of classes predicted to be present but actually absent, and the number of classes predicted to be absent that actually were present. For example, figure 6 shows an audit matrix for an observation that was predicted to be a member of class 3, but that actually is a member of classes 1 and 2.

Reduction ofan audit matrix to a single number could be done with a weighted linear function of the number of the cell values, such as

classification score = W&- ( WspFP + WfiE;R),

where W,, Wfp. and Wfi are weights assigned to the number of matches, false positives, and false negatives, respectively, and A4 is the number of matches, FP is the number of false positives, and FN is the number of false negatives. The weights would be applica- tion-specific and would relate to the relative costs of the two types of misclassification.

Table 5 illustrates our approach using the results from the ten models for a single observation that was a member of CPT classes 38500,338562, and 58720 (shown as value = 1 in the column labeled “actual”). The columns labeled 1 through 10 under the header “representation” contain the output of each model given the values of the

Page 18: Choosing Data-Mining Methods for Multiple Classification

54 SPANGLER, MAY, AND VARGAS

1 Actual

Predicted Present I Not present 1

I Presenl 1 Match 1 False Positive 1 Not present False negative Match

Figure 5. An Audit Matrix for Structuring and Evaluating Multiple Classification Results

Actual Predicted Not present Present

Present -- 3 I

Notpresent 192

Figure 6. An Audit Matix Example for Actual = “1 2” and Predicted = “3”

independent variables of the observation (1 = CPT is predicted; 0 = CPT is not predicted). For example, LDABinE predicts that seven CPTs would be associated with this particular observation, but is incorrect on all seven (7 false positives, 3 false negatives, and 0 matches). NN57 predicts 5 CPTs and is correct on two (3 false positives; 1 false negative; 2 matches). Table 6 summarizes the results for all 650 cases in the data set. In the absence of a value function for relative misclassification costs, the performance of each of the representations can be measured by (1) the number of correct predictions relative to the number of misclassifications and (2) the relative number of false positives and false negatives.

Table 7 compares the correct predictions (represented as a percentage: “Observed & Predicted” / C [misclassifications] * 100) across each of the representations, for each type of case (i.e., those cases containing I , 2, and 3 CPTs) and for all cases. As shown, 5 of the I O representations are able accurately to classify over 50 percent of the single CPT cases. Although NN57 is the most accurate model overall (92.41 percent), all of the neural network models as a group outperform the other represen- tations. However, the performance of all representations deteriorates dramatically when classifying the multiple dependent variable cases. In the 2-CPT cases, LDARepP has the highest accuracy, but is still exceptionally poor (6.25 percent). In the 3-CPT cases, LDABinP is highest with 7.62 percent. Notably, the neural network models were among the poorest performers.

The relative performance of the representations is also reflected in the number and nature of the classification errors. Those include the misclassification rate (the number of misclas- sifications divided by the number of observations in each group), andthe proportion of false

Page 19: Choosing Data-Mining Methods for Multiple Classification

CHOOSING DATA-MINING METHODS 55

Table 5 Classification and Misclassification Results for an Observation Containing Three Dependent Variables*

CPT

36489 38500 38562 38564 38780 441 20 45300 47600 49000 490 1 0 52000 52204 52332 5631 1 571 00 5741 0 57500 5751 3 57520 581 50 58200 5821 0 58240 58260 58720 58960 58999 77761 77762 77763 77777 77778

Actual 1 2

0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

__

Representation 3 4 5 6 7 8 9 1 0

0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 1 1 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0

Total 3 7 5 28 3 13 3 1 1 5 2 predicted Total 0 3 3 0 2 0 0 1 2 2 correct

_ _ ~ ~

* Each column corresponds to one of the ten representations: I = LDABinE; 2 = LDABinP; 3 = LDAMultE; 4 = LDAMultP; 5 = LDARepE; 6 = LDARepP; 7 = See.5; 8 = N O ; 9 = NN57; 10 = N N I 14.

negatives and false positives, which are important in considering the relative costs of misclassifying observations. Figures 7 through 11 graphically show the relative number of false negatives and positives for all cases, and cases with 3 CPTs, 2 CPTs, and 1 CPT, respectively. Figure 1 1 shows results for all multiple CPT cases, combined.

Page 20: Choosing Data-Mining Methods for Multiple Classification

56 SPANGLER, MAY, AND VARGAS

Table 6 Audit Matrix of Results from Each Method and Representation

Misclassifications Matches Observed & not Not observed & Observed & Not observed &

predicted predicted predicted not predicted All CPTs

LDARepE LDARepP LDABinE LDABinP LDAMultE LDAMultP NNO NN57 NN114 See5

248 438 280 486 297 522 407 31 9 318 533

4935 393

2263 29 1

241 1 1904 275 298 296 86

102 286

13 189

3 0

278 355 339 22 1

15803 19971 18532 201 22 18377 18662 201 28 201 16 201 35 20248

3 CPTs LDARepE 54 38 1 0 20653 LDARepP 81 56 1 20950 LDABinE 45 166 2 20875 LDABinP 70 35 8 20975 LDAMultE 32 178 0 20878 LDAMultP 77 138 0 20873 NNO 52 10 0 21 026 NN57 44 25 0 21019 NN114 42 16 0 21 030 See5 96 7 0 20985

LDARepE 61 782 1 20244 LDARepP 111 81 12 20884 LDABinE 58 372 1 20657 LDABinP 108 61 6 2091 3 LDAMultE 49 360 1 20678 LDAMultP 94 276 0 2071 8 NNO 113 47 1 20927 NN57 102 64 2 20920 NN114 97 60 2 20929 See5 128 12 8 20940

LDARepE 133 3772 101 17082 LDARepP 246 256 273 2031 3 LDABinE 177 1725 10 19176 LDABinP 308 195 175 2041 0 LDAMultE 21 6 1873 2 18997 LDAMultP 351 1490 0 19247 NNO 242 21 8 276 20352 NN57 173 209 353 20353 NN114 179 220 337 20352 See5 309 67 21 3 20499

2 CPTs

1 CPT

Page 21: Choosing Data-Mining Methods for Multiple Classification

CHOOSING DATA-MINING METHODS 57

Table 7 Correct Classifications as a Percentage of Total Errors ~~ ~

All CPTs 3 CPTs 2 CPTs 1 CPT

LDARepE 1.97 0.00 0.12 2.59 LDARepP 34.42 0.73 6.25 54.38 LDABinE 0.51 0.95 0.23 0.53 LDABinP 24.32 7.62 3.55 34.79 LDAMultE 0.1 1 0.00 0.24 0.10 LDAMultP 0.00 0.00 0.00 0.00 NNO 40.76 0.00 0.63 60.00 NN57 57.54 0.00 1.20 92.41 NN114 55.21 0.00 1.27 84.46 See5 35.70 0.00 5.71 56.65

LDARepE LDARepP LDABinE LDABinP

LDAMultE LDAMultP

N N O NN57

NN114 See5

4 FalseNeg FalsePos

0 1000 2000 3000 4000 5000 6000 Misclassifications

Figure 7. Misclassification Results for All Observations

Overall, the NN57, NN114, and See5 have the fewest total misclassifications, but the neural net models more evenly balance false positives and false negatives than does the See5 classification, which appears to give more weight to false positives than it does to false negatives. The same holds true for observations in multiple categories, although the See5 classifier does marginally better than the neural network models for observations in exactly two categories, and somewhat worse for observations in exactly three categories. The See5 classifier is consistently more prone to generating false negatives than false positives. All three neural net models show a similar tendency, which is more pronounced for the three-category observations than for the two-category observations.

For the LDA approaches, the classifiers that used prior probabilities proportional to sample frequencies consistently give a smaller number oftotal errors and more closely balance false positives against false negatives than did those with equal prior proba- bilities. The classifiers derived with equal prior probabilities do give a smaller number

Page 22: Choosing Data-Mining Methods for Multiple Classification

58 SPANGLER, MAY, AND VARGAS

LDARepE LDARepP LDABinE LDABinP

LDAMultE LDAMultP

N N O NN57 NN114

See5

FalseNeg FalsePos

0 100 200 300 400 500 Misclassifications

Figure 8. Misclassification Results for Observations with Three Dependent Variables

LDARepE LDARepP LDABinE LDABinP

LDAMultE LDAMultP

N N O NN57

NN114 See5

1

I

W FalseNeg FalsePos

0 200 400 600 800 1000 Misclassifications

Figure 9. Misclassification Results for Observations with Two Dependent Variables

of false positives than do those with probabilities proportional to sample frequencies. Finally, as expected, the discriminant analysidased methods and See5 all show an

increase in the rate of false positives, false negatives, and total misclassifications (with the exception of LDABinE, for false positives) as the number of categories increases from one, to two, and to three (see Table 8). The behavior of the neural nets was not expected, however. Although all of the neural net models likewise increase their misclassification rates when the number of categories increases from one to two, the rates actually decrease when the number of categories increases from two to three. In one case, the false positives for N N O , the rate for three categories is actually lower than the rate for a single category. It is surprising to see neural nets do better, in that sense, as the problem becomes more difficult.

Page 23: Choosing Data-Mining Methods for Multiple Classification

CHOOSING DATA-MMNG METHODS 59

LDARepE LDARepP LDABinE LDABinP

LDAMultE LDAMultP

N N O NN57

NN114 See5

rn FalseNeg FalsePos

0 1000 2000 3000 4000

Misclassifications Figure IO. Misclassification Results for Observations with One Dependent Variable

LDARepE LDARepP LDABinE LDABinP

LDAMultE LDAMultP

N N O NN5 7

NN114 See5

rn FalseNeg FalsePos

0 300 600 900 1200 1500 Misclassifications

Figure 11. Misclassification Results for Observations with Multiple (2 or 3) Dependent Variables

Conclusions

OUR NUMERICALRESULTS INDJCATE THATCURRENTDATA-MINING TECHNIQUES do not adequately model data sets in which observations may simultaneously be members of multiple classifications. Such data sets commonly appear in medical and other applications, and extraction of knowledge from data sets such as ours is important for the design of intelligent decision support systems for auditing, planning, and control tasks. Three primary conclusions can be drawn from our experimentation with alternative representations and solution approaches.

First, the performance of the various models is indicated by the number and characteristics of the classification errors they produce. The data-mining methods showed clear differences in the rate and type of classification errors in both single and

Page 24: Choosing Data-Mining Methods for Multiple Classification

60 SPANGLER, MAY, AND VARGAS

Table 8 Misclassification Rates for 3, 2, and 1 Categories Across Representations

Observations in Observations in Observations in 3 categories 2 categories 1 category

False False False False False False Method neg. pos. Total neg. pos. Total neg. pos. Total

LDARepE LDARepP LDABinE LDABinP LDAMultE LDAMultP NNO NN57 NN114 See5

1.227 1.841 1.023 1.591 0.727 1 ,750 1.182 1.000 0.955 2.1 82

8.659 1.273 3.773 0.795 4.045 3.136 0.227 0.568 0.364 0.1 59

9.886 3.1 14 4.795 2.386 4.773 4.886 1.409 1.568 1.31 8 2.341

0.656 1.194 0.624 1.161 0.527 1.01 1 1.21 5 1.097 1.043 1.376

8.409 0.871 4.000 0.656 3.871 2.968 0.505 0.688 0.645 0.129

9.065 2.065 4.624 1.81 7 4.398 3.978 1.720 1.785 1.688 1.505

0.255 0.471 0.339 0.590 0.41 4 0.672 0.464 0.331 0.343 0.592

7.226 0.490 3.305 0.374 3.588 2.854 0.41 8 0.400 0.421 0.128

7.481 0.962 3.644 0.964 4.002 3.527 0.881 0.732 0.764 0.720

multiple class observations. The neural net and See5 models, for example, all tend to show a higher proportion of false negatives in comparison to the LDA models, but an overall lower misclassification rate. This suggests that the magnitude of Type I and Type 11 costs incurred by an organization, and the relative differences between those costs, can affect a decision maker’s choice of method and representation.

Second, the performance of a data-mining method in multiple classification prob- lems is measured by the extent to which it allows a representation that is appropriate to the problem at hand. Neural networks, despite their poor performance in identifying multiple classes, arguably allow the most natural modeling, a simultaneous binary representation of dependent variables. Formulations that can be accommodated by the other methods, particularly LDA, are somewhat clumsy approximations of what neural networks can model naturally. The choice of representation can have a signif- icant bearing on the time and cost involved in the analysis of multiple category data, and a decision maker needs to consider the tradeoff of a better mathematical represen- tation and better output performance from an ANN against the difficulty in interpreting the model’s reasoning process.

Third, the choice of representation also is dependent on the inherent capabilities of the available methods, and on the compatibility of the theory underlying each method with the data and objectives ofthe decision maker. For example, classical discriminant analysis is based on all the independent variables being continuous; therefore, qualitative indepen- dent variables, which are present in this study, might cause problems. Krzanowski [ 171 notes that discriminant analysis can perform poorly or satisfactorily in the presence of qualitative variables depending on the correlation between the continuous and the quali- tative variables. Also, for discriminant functions, the rules based on minimizing the expected cost of misclassification are a function of prior probabilities, misclassification

Page 25: Choosing Data-Mining Methods for Multiple Classification

CHOOSING DATA-MINING METHODS 61

costs, and density hnctions of the data, so the incorporation of prior probabilities within See5 thereby constructing a new representation similar to the LDA approaches, may not be replicable by See5’s use of misclassification costs alone [15]. The CART approach [3] includes a statistical alternative to the procedure used by See5. Although CART was not included in this study, it is a possible item for future research.

The poor predictive performance shown by each ofthe representations in identifying multiple classes suggests continued exploration of this area. Further study is required in order to determine the root causes of the limitations, whether they can be cir- cumvented, and whether other classification methods might be more appropriate. For example, the limitations shown here could be due in part to the nature of the domain and the data set. Because only 137 of the 659 observations in the data set contain more than one category (44 with 3 categories, 93 with 2 categories), some of the difficulty in identifying multiple classes might be attributed to an insufficient number of multiple-class observations. As noted, five of the ten representations performed satisfactorily (i.e., over 50percent) on the 80 percent ofthe observations (i.e., 522/659) that contained a single observation.

Additional exploration of the prediction task in this environment might also reveal other factors affecting the performance ofthe decision models. For example, modeling the decision strategies of human experts, such as schedulers and surgeons, could provide an indication of the degree of linearity of this task and the associated impact of the linearity on the performance of the decision models [16]. Further research also is required to determine how and whether additional case information, in the form of additional independent variables, might improve predictive accuracy. Beyond that, the context-independent classification methods we used in this paper could be supple- mented by domain-specific knowledge of case factors and their relationships, as well as causal and/or heuristic knowledge of the task environment. More knowledge-rich models of diagnosis and situation assessment are possible enhancements to the traditional induction approaches described in this paper.

REFERENCES

1. Bashein, G., and Barna, C. A comprehensive computer system for anesthetic record retrieval. Anesthesia and Analgesia, 64,4 ( I 985), 425-43 1.

2. Boritz, J.E.; Kennedy, D.B.; and Albuquerque, A. Predicting corporate failure using a neural network approach. Intelligent Systems in Accounting, Finance and Management, 4, 2 (1995), 95-1 1 1 .

3. Breiman, L.; Friedman, J.H.; Olshen, R.A.; and Stone, C.J. Classification andRegression Trees. Monterey, CA: Wadsworth and Brooks, 1984. 4. Carter, C., and Catlett, J. Assessing credit card applications using machine learning. fEEE

Expert, 2,3 (1 987), 71-79. 5. Charitou, A., and Charalambous, C. The prediction of earnings using financial statement

information: empirical evidence with logit models and artificial neural networks. Inrelligent Systems in Accounting, Finance and Management, 5 , 4 (1 996), 19%215.

6. Chung, H.M., and Tam, K.Y. A comparative analysis of inductive learning algorithms. Intelligent Systems in Accounting, Finance and Management, 2, 1 (1993), 3-1 8.

7. Currim, I.S.; Meyer, R.J.; and Le, N. A concept-learning system for the inference of production models of consumer choice. UCLA Working Paper, 1986.

8. Fanning, K., and Cogger, K.O. A comparative analysis of artificial neural networks using

Page 26: Choosing Data-Mining Methods for Multiple Classification

62 SPANGLER, MAY, AND VARGAS

financial distress prediction. Intelligent Systems in Accounting, Finance andManagement, 3 , 4

9. Fanning, K.M., and Cogger, K.O. Neural network detection of management fraud using publishedfinancialdata IntelligentSystems in Accounting, FinanceandManagement, 7( 1998), 2 1 4 1 .

10. Fanning, K.; Cogger, K.O.; and Srivastava, R. Detection of management fraud: a neural network approach. Intelligent Systems in Accounting, Finance and Management, 4 ,2 (1 99.9, 1 13-126.

1 1 . Fayyad, U.; Piatetsky-Shapiro, G.; and Smyth, P. From Data mining to knowledge discovery in databases. AI Magazine, I7 ,3 (1 996), 37-54.

12. Grudnitski, G.; Do, A.Q.; and Shilling, J.D. A neural network analysis of mortgage choice. Intelligent Systems in Accounting, Finance andManagement, 4 , 2 (1995), 127-135.

13. Henery, R.J. Classification. In D. Michie, D.J. Spiegelhalter, and C.C. Taylor (eds.), Machine Learning, Neural and Statistical Classification. New York: Ellis Horwood, 1994, pp. 6 - 1 6.

14. Jo, H.; Han, I.; and Lee, H. Bankruptcy prediction using case-based reasoning, neural networks, and discriminant analysis. Expert Systems with Applications, 13,2 (1997), 97-108.

15. Johnson, R.W., and Wichern, D.W. Applied Multivariate Statistical Analysis. Upper Saddle River, NJ: Prentice-Hall, 1998, pp. 629-725.

16. Kim, C.N.; Chung, H.M.; and Paradice, D.B. Inductive modeling of expert decision making in loan evaluation: a decision strategy perspective. Decision Support Systems, 21, 2 (1 997), 83-98.

17. Krzanowski, W.J. The performance of Fisher’s linear discriminant function under non-optimal conditions. Technometrics, 19,2 (1 977), 191-200.

18. Lachenbruch, P.A., and Mickey, M.R. Estimation of error rates in discriminant analysis. Technometrics, 19,2 ( 1 977), 191-200.

19. Liang, T.P. A composite approach to inducing knowledge for expert system design. Management Science, 38 , l (1 992), 1-1 7.

20. Liang, T.P.; Chandler, J.S.; Han, I.; and Roan, J. An empirical investigation of some data effects on the classification accuracy of probit, ID3, and neural networks. Contemporary Accounting Research, 9,1 ( 1 992).

21. Messier, W.F., and Hansen, J.V. Inducing rules for expert system development: an example using default and bankruptcy rules. Management Science, 34, 12 (1988), 1403-1415.

22. Quinlan, J.R. C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann, 1993.

23. Ragothaman, S., and Naik, B. Using rule induction for expert system development: the case of asset writedowns. Intelligent Systems in Accounting, Finance and Management, 3, 3

24. Sen, T.K., and Gibbs, A.M. An evaluation of the corporate takeover model using neural networks. Intelligent Systems in Accounting, Finance andManagement, 3 , 4 (1994), 279-292.

25. Shavlik, J.W., and Dietterich, T.G. General aspects of machine learning. In J.W. Shavlik and T.G. Dietterich (eds.),Readings in MachineLearning. San Mateo, CA: Morgan Kaufmann,

26. Shaw, M.J., and Gentry, J.A. Using an expert system with inductive learning to evaluate business loans. FinancialManagement, 17,3 (1 988), 45-56.

27. Tam, K., and Kiang, M. Managerial applications of neural networks: the case of bank failure prediction. Management Science, 38, 7 (1992), 926-947.

28. Tessmer, A.C.; Shaw, M.J.; and Gentry, J.A. Inductive learning for international financial analysis: a layered approach. Journal ofManagement Information Systems, 9 ,4 (1 993), 17-36.

29. Weiss, S.M., and Indurkhya, N. Predictive Data Mining. San Francisco: Morgan Kaufinann, 1998.

30. Weiss, S.M., and Kapouleas, I. An empirical comparison of pattern recognition, neural nets, and machine learning classification methods. Proceedings of the Eleventh International Joint Conference on Artificial Intelligence. San Francisco: Morgan Kaufmann, 1989, pp.

( 1994), 24 1-252.

(1 994), 187-203.

1990, pp. 1-1 0.

781-787. 31. Weiss, S.M., and Kulikowski, C.A. Computer Systems That Learn. Sari Francisco:

Morgan Kaufmann, 199 1.