advances in predictive models for data mining

7
Advances in predictive models for data mining Se June Hong, Sholom M. Weiss * IBM T.J. Watson Research Center, P.O. Box 218, Yorktown Heights, NY 10598, USA Abstract Expanding application demand for data mining of massive data warehouses has fueled advances in automated predictive methods. We examine a few successful application areas and their technical challenges. We review the key theoretical developments in PAC and statistical learning theory that have lead to the development of support vector machines and to the use of multiple models for increased predictive accuracy. Ó 2001 Elsevier Science B.V. All rights reserved. Keywords: Data mining; Text mining; Machine learning; Boosting 1. Introduction Predictive modeling, which is perhaps the most- used subfield of data mining, draws from statistics, machine learning, database techniques, pattern recognition, and optimization techniques. The proven utility of industrial applications has led to advances in predictive modeling. We will review a few application areas that have demonstrated the importance of predictive modeling and have also fueled advances. Some important advances in learning theory will be considered in Section 3. The idea of using multiple models to enhance the performance of a model has spawned several use- ful approaches with a theoretical understanding for their success (Section 4). Alternative ways of measuring predictive performance will be dis- cussed in Section 5. 2. Challenging applications It has been well-established that a substantial competitive advantage can be obtained by data mining in general and predictive modeling in par- ticular. For some applications, maximizing accu- racy or a utility measure is of paramount importance, even at the expense of weaker ex- planatory capabilities. We will briefly examine three challenging application areas: insurance, fraud detection, and text categorization. 2.1. Insurance Risk assessment is at the core of the insurance business, where actuarial statistics have been the traditional tools to model various aspects of risk such as accident, health claims, or disaster rates, and the severity of these claims. The claim fre- quency is rare and stochastic in nature. For in- stance, the auto accident rate of an insured driver is never a clear no-accident class vs. accident class problem and instead is modeled as a Poison www.elsevier.nl/locate/patrec Pattern Recognition Letters 22 (2001) 55–61 * Corresponding author. Tel.: +1-914-945-2330; fax: +1-914- 945-3434. E-mail addresses: [email protected] (S.J. Hong), sho- [email protected] (S.M. Weiss). 0167-8655/00/$ - see front matter Ó 2001 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 8 6 5 5 ( 0 0 ) 0 0 0 9 9 - 4

Upload: se-june-hong

Post on 02-Jul-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Advances in predictive models for data mining

Se June Hong, Sholom M. Weiss *

IBM T.J. Watson Research Center, P.O. Box 218, Yorktown Heights, NY 10598, USA

Abstract

Expanding application demand for data mining of massive data warehouses has fueled advances in automated

predictive methods. We examine a few successful application areas and their technical challenges. We review the key

theoretical developments in PAC and statistical learning theory that have lead to the development of support vector

machines and to the use of multiple models for increased predictive accuracy. Ó 2001 Elsevier Science B.V. All rights

reserved.

Keywords: Data mining; Text mining; Machine learning; Boosting

1. Introduction

Predictive modeling, which is perhaps the most-used sub®eld of data mining, draws from statistics,machine learning, database techniques, patternrecognition, and optimization techniques. Theproven utility of industrial applications has led toadvances in predictive modeling. We will review afew application areas that have demonstrated theimportance of predictive modeling and have alsofueled advances. Some important advances inlearning theory will be considered in Section 3.The idea of using multiple models to enhance theperformance of a model has spawned several use-ful approaches with a theoretical understandingfor their success (Section 4). Alternative ways ofmeasuring predictive performance will be dis-cussed in Section 5.

2. Challenging applications

It has been well-established that a substantialcompetitive advantage can be obtained by datamining in general and predictive modeling in par-ticular. For some applications, maximizing accu-racy or a utility measure is of paramountimportance, even at the expense of weaker ex-planatory capabilities. We will brie¯y examinethree challenging application areas: insurance,fraud detection, and text categorization.

2.1. Insurance

Risk assessment is at the core of the insurancebusiness, where actuarial statistics have been thetraditional tools to model various aspects of risksuch as accident, health claims, or disaster rates,and the severity of these claims. The claim fre-quency is rare and stochastic in nature. For in-stance, the auto accident rate of an insured driveris never a clear no-accident class vs. accident classproblem and instead is modeled as a Poison

www.elsevier.nl/locate/patrec

Pattern Recognition Letters 22 (2001) 55±61

* Corresponding author. Tel.: +1-914-945-2330; fax: +1-914-

945-3434.

E-mail addresses: [email protected] (S.J. Hong), sho-

[email protected] (S.M. Weiss).

0167-8655/00/$ - see front matter Ó 2001 Elsevier Science B.V. All rights reserved.

PII: S 0 1 6 7 - 8 6 5 5 ( 0 0 ) 0 0 0 9 9 - 4

distribution. The claim amounts usually follow alog-normal distribution which captures the phe-nomenon of rare but very high damage amounts.Neither of these distributions are well-modeled byconventional modeling tools such as CHAID,CART, C4.5, SPRINT or classical statistical re-gression techniques that optimize for traditionalnormal distributions. In general, di�erent kinds ofinsurance can use di�erent statistical models de-pending on the fundamental nature of the claimsprocess, requiring a predictive model that can beoptimized for di�erent underlying distributions.

Other factors in insurance applications com-plicate the modeling process. Often, the desiredtarget to be modeled is the expected claims amountfor each individual policy holder, which is pro-duced by a joint distribution of the claim rate andclaim amount. Stored records usually have a sig-ni®cant proportion of missing data or back ®lleddata updated only at the time of accidents. Theclaims data usually has hundreds of ®elds, anddemographic data must also be included. Insur-ance actuaries demand that the model must beactuarially credible, i.e., the parameters of themodel be within 5% of the expected true value with90% con®dence.

The Underwriting Pro®tability Analysis (UPA)application (Apte et al., 1999) embodies a newapproach for generating predictive models for in-surance risks. Groups of equal risk are identi®edby a top down recursive splitting method similar totree generation algorithms. A key di�erence fromtraditional tree splitting is that the splits are se-lected by statistical models of insurance risk, e.g.,joint Poisson and log-normal distribution for autoaccident claim amount, otherwise known as purepremium. The methods tries to optimize themaximum likelihood estimation (in this case,negative log likelihood) of all the examples giventhe assumed distribution. This application hasyielded demonstrably superior results for a majorinsurance ®rm, and many of the extracted rules,the leaf nodes of the trees, have replaced existingactuarial rules. They also report that in this ap-plication, the model improves as more data areused in the training set, contrary to many appli-cations which reach a plateau of performance aftertens of thousands of examples.

2.2. Fraud detection

Fraud detection is an important problem be-cause fraudulent insurance claims and credit cardtransactions alone cost tens of billions of dollars ayear. In the case of credit card fraud, arti®cialneural-networks have been widely-used by manybanks. Frauds are relatively rare, i.e., a skeweddistribution that ba�es many traditional datamining algorithms unless strati®ed samples areused in the training set. Some large banks add tothe transaction data volume by millions of trans-actions per day. The cost of processing a fraudcase, once detected, is a signi®cant factor againstfalse positive errors while undetected fraud addsthe transaction cost in the loss column. This notonly in¯uences the decision whether to declare atransaction to be processed as a fraud or not, butalso calls for a more realistic performance measurethan traditional accuracy. The pattern of fraudu-lent transactions varies with time, requiring rela-tively frequent and rapid generation of newmodels.

The JAM System (Java Agents for Meta-Learning) (Stolfo et al., 1997) is a recent approachfor credit card fraud detection. The massive set ofdata with binary labels of fraud or legitimatetransactions is divided into smaller subsets, foreach participating bank unit and for multiplesamples to gain better performance. They producemodels by some fast existing methods in a dis-tributed fashion. These multiple base models arethen combined to form a meta-learner (see Sec-tions 4 and 6). Using data from Chase and FirstUnion banks, the induced models produced asubstantial cost savings over existing methods.

2.3. Text mining

Electronic documents or text ®elds in databasesare a large percentage of the data stored in cen-tralized data warehouses. Text mining is the searchfor valuable patterns in stored text. When storeddocuments have correct labels, such as the topicsof the documents, then that form of text mining iscalled text categorization. In many text storageand retrieval systems, documents are classi®edwith one or more codes chosen from a classi®ca-

56 S.J. Hong, S.M. Weiss / Pattern Recognition Letters 22 (2001) 55±61

tion system. For example, news services like Reu-ters carefully assign topics to news-stories. Simi-larly, a bank may route incoming e-mail to one ofdozens of potential response sites.

Originally, human-engineered knowledge-basedsystems were developed to assign topics to news-wires. Such an approach to classi®cation may haveseemed reasonable, but the cost of the manualanalysis needed to build a set of rules is no longerreasonable, given the overwhelming increase in thenumber of digital documents. Instead, automaticprocedures are a realistic alternative, and re-searchers have proposed a plethora of techniquesto solve this problem.

The use of a standardized collection of docu-ments for analysis and testing, such as the Reuterscollection of newswires for the year 1987, has al-lowed researchers to measure progress in this ®eld.Substantial improvements in automated perfor-mance have been made since then.

Many automated prediction methods exist forextracting patterns from sample cases (Weiss andIndurkhya, 1998). In text mining, speci®cally textcategorization, the raw cases are individual docu-ments. The documents are encoded in terms offeatures in some numerical form, requiring atransformation from text to numbers. For eachcase, a uniform set of measurements on the fea-tures are taken by compiling a dictionary from thecollection of training documents. Predictionmethods look at samples of documents withknown topics, and attempt to ®nd patterns forgeneralized rules that can be applied to new un-classi®ed documents. Once the data is in a stan-dard encoding for classi®cation, any standard datamining method, such as decision trees or nearestneighbors, can be applied.

One of the interesting challenges text miningposes is the problem of minimal labeling. Mosttext collections are not tagged with category labels.Human tagging is usually costly. Starting fromsome tagged examples, one wishes to develop atext categorization model by asking certain se-lected examples to be tagged, and one naturallywishes to minimize the number of such requests.Many approaches to this problem are being pur-sued by theorists as well as practical algorithmdevelopers. Another interesting application of text

mining technology is web-mining where a varietyof features beyond the original text present aspecial challenge.

3. Theoretical advances

The theory of predictive modeling sheds lighton what kind of functions, i.e., mapping of featurevectors to the target values, can be learned e�-ciently with a given set of models. These resultsgive an understanding of model complexity andhow it can be used to assess the future perfor-mance of models on unseen data. These new con-cepts are beginning to guide the model searchprocess as well as the model evaluation process forpractical cases, complementing traditional tech-niques from statistics. For further details on thesetheoretical advances, see Hosking et al. (1997) andKearns and Vazirani (1994).

3.1. Computational and statistical learning theory

A model generation process can be viewed asselecting a ``best'' possible model, from a givenfamily of models, i.e., functions that map inputfeature space to the target variable. A model is bestif it optimizes the error rate or more generally aloss function de®ned over the example space andthe predicted output. Computational learningtheory is concerned with the complexity of such aprocess, but more focused on ®nding when theprocess can be e�cient. One of the key areas incomputational learning theory is the ProbablyApproximately Correct (PAC) learning model.Informally speaking, a concept from a given con-cept class is PAC learnable if a model can be foundfrom a given model class such that the ``error'' ofthe model on the examples of the concept is boundby some given � within a given con®dence boundof d. The learning algorithm is said to be e�cient ifthe complexity is polynomial in the number ofexamples needed to learn the concept, 1=� and 1=d.

An important PAC learning result shows that aweak PAC learner with � less than 1/2 can beturned into a strong learner with ``error'' close to0, by multiple models, giving rise to a theoreticalunderstanding of boosting (see Section 4). Another

S.J. Hong, S.M. Weiss / Pattern Recognition Letters 22 (2001) 55±61 57

interesting direction of the computational learningtheory is to allow learning algorithm to makecertain queries to an oracle about the data vectors.By asking for the probability of the example eventwithin some given ``noise''-tolerance, this ap-proach makes it possible to analyze the learningproblems in the presence of noise. This line of re-search has also been applied to the minimal la-beling problem of text mining with somepromising results.

Statistical learning theory has its origin in thework of Vapnik and Chervonenkis in the late 60s,who developed a mathematical basis for compar-ing models of di�erent forms. The theory focuseson ®nite sample statistics (classical statistics usu-ally rely on asymptotic statistics) in the process ofdetermining what is the best among the given set ofmodels in ®tting the data. In classical statisticsapproach, it is assumed that the correct model isknown and the focus is on the parameter estima-tion. Statistical learning theory focuses on esti-mating relative performance of competing modelsso that the best can be selected.

For predictive models, consider an examplevector z given as the input feature vector x and thetarget value y. A model a operates on x and pre-dicts f �x; a�. If we are modeling the conditionalprobability distribution of y as a function of x,an appropriate loss function Q�z; a� of the modelon z is the same negative log-likelihood oftenused in classical statistical modeling: Q�z; a� �ÿ log p�y jx; a�. For classi®cation error, the lossfunction Q�z; a� � 0 if y � f �x; a� and 1 otherwise.If it is only known that the data vector z is gen-erated according to some given probability mea-sure F �z�, the best model would be the a thatminimizes the expected loss

R�a� �Z

Q�z; a� dF �z�:

In practice, the probability measure F �z� is notknown, and one can only compute an empiricalexpected loss for the given example set zi; i �1; . . . ; `, assuming that they are i.i.d. generated:

Remp�a; `� � 1

`

X̀i�1

Q�zi; a�:

Under what conditions does minimizing the em-pirical loss Remp�a; `� also minimize the expectedloss R�a� without knowing the distribution F �z�?Statistical learning theory answers this key ques-tion by o�ering the con®dence regions for R�a�given Remp�a; `� for a given lower bound ofprobability 1ÿ g. These bounds are based on ameasure known as VC-dimension of the of the setof models being considered. It su�ces to say herethat the VC-dimension might be considered to bea more reliable measure of the complexity of themodel family than the degree of freedom conceptused in the classical statistics. It directly impliesthat the number of examples should be muchmore than the VC-dimension to obtain a reliablemodel.

While the VC-dimension of a set of models isoften very di�cult to compute, the theory doeso�er a practical method for selecting the bestmodel, when there is ``enough'' data, by use ofrandomly split data set into training and valida-tion sets: the search for the best ®tting modelproceeds using the training set and the loss is es-timated from the validation set using the con®-dence bound. The con®dence bound can beexpressed without explicit dependence on the VC-dimension for the loss in the validation set. This issimilar to the cross validation method in classicalstatistics where the technique is often used insearching for the parameter values of a ®xed modelfamily.

3.2. Support vector machine

The support vector machine has the baselineform of a linear discriminator. Here we give a briefsketch of the support vector machine model. For adetailed introduction to the subject, see Vapnik(1998) and Cristianini and Shawe-Taylor (2000).Let D be the smallest radius of the sphere thatcontains the data (example vectors). The points oneither side of the separating hyperplane have dis-tances to the hyperplane. The smallest such dis-tance is called the margin of separation. The hyperplane is called optimal if the margin is maximized.Let q be the margin of the optimal hyperplane.The points that are distance q away from the op-timal hyperplane are called the support vectors. It

58 S.J. Hong, S.M. Weiss / Pattern Recognition Letters 22 (2001) 55±61

has been shown that the VC-dimension dependsonly on the number of support vectors. This im-plies that one can generate arbitrarily many de-rived features, e.g., all pairwise products of theoriginal features, as long as the number of supportvectors for the optimal hyperplane (in the ex-panded dimensions) does not increase much. Onecan see that, although the ®nal form of the modelis a linear function, the decision boundary can beof almost arbitrary shape in the original featurespace because of the nonlinearity introduced byderived variables.

This understanding leads to a new strategy tosearch for the support vectors and the coe�cientsfor the optimal hyperplane simultaneously as anoptimization problem in the expanded dimen-sional space. Actually, one need not explicitlycompute the values of the derived features if theyare chosen in a judicious way. The feature ex-pansion makes use of several popular family ofkernel functions, such as polynomial, radial basisfunctions or sigmoid functions as in the two layerneural network. Traditionally, such models with alinear outer function were constructed to mini-mize the error. To bound the error, support vec-tor machines are optimized for the margin.E�cient search techniques for the optimal hy-perplane and selecting the right basis function areactive areas of research. The support vector ma-chine is a signi®cant new addition for predictivemodeling.

4. Use of multiple models

Recent research results in learning demonstratethe e�ectiveness of combining multiple models ofthe same or di�erent types for improving modelingaccuracy. These methods, such as bagging (Brei-man, 1996) and boosting (Freund and Schapire,1996), have taken di�erent approaches to achievemaximized modeling performance.

Let us look at an example where diverse pre-dictive methods can be applied to obtain a solu-tion. For example, a classi®cation problem thatcan be solved by either a neural net method or adecision tree or a linear method. Until recently, thetypical approach would be to try both methods on

the same data, and then select the method with thestrongest predictive performance. Researchershave observed that predictive performance oftencan be improved by inducing multiple solutions ofthe same type, for example multiple decision trees.These models are generated by sampling from thesame data. The ®nal answer on a new case is de-termined by giving a vote to each of the multipledecision trees, and picking the answer with themost votes.

Although the techniques for generating multiplemodels from the same data are independent of thetype of model, the decision tree is the most com-monly used. How are the trees generated? Nochange is made to a standard tree inductionmethod. The di�erence is in the sampling method.In the simplest approach, called bagging (Breiman,1996), a sample of size n is taken with replacementfrom the original set of n examples. (For very largedata, a smaller sample can be taken.) Some ex-amples will be repeated in the sample, others maynot occur. The expected proportion of unique ex-amples in any given sample is 63.2%. Thus, it ispossible to generate many samples, induce a deci-sion tree from each sample, and then vote the re-sults. An alternate sampling technique randomlyselects a subset of features for each base model(Ho, 1998).

Adaptive resampling, usually performs betterthan bagging. Instead of sampling all cases ran-domly, so that each case has a 1=n chance ofbeing drawn from the sample, an incrementalapproach is used in random selection. The ob-jective is to increase the odds of sampling casesthat have been erroneously classi®ed by the treesthat have previously been induced. Some algo-rithms use weighted voting, where some treesmay be given more weight than others. The``boosting'' algorithm, such as AdaBoost (Freundand Schapire, 1996) uses weighted voting and anexplicit formula for updating the likelihood ofsampling or weighting each case in the trainingsample. While an ensemble of models does notnecessarily increase the complexity of the solu-tion in the sense of statistical learning theory,such a combined model diminishes the under-standability that might have existed in a singlemodel.

S.J. Hong, S.M. Weiss / Pattern Recognition Letters 22 (2001) 55±61 59

5. Practical performance measures

Although a measure of accuracy is generallyuseful for evaluating predictive models, the utilityof the model's output is a more direct goal. Modelsthat optimize utility are widely available, for ex-ample users can enter a cost factors to CART.When the utility of a prediction can be computedin terms of the prediction statistics, it can be usedin the model generation process in many di�erentways, e.g., in determining a split in a tree, inpruning process, etc. When the utility is not easilyexpressed in terms of computable measures withthe model generation process, the negative log lossfunction is generally more useful than the accuracymeasure.

Two alternative ways of evaluating model'sperformance, Receiver Operating Characteristic(ROC) curves and lift curves are of interest. Theseare not new, but can o�er insight into how dif-ferent models will perform for many applicationsituations. Many classi®cation models can bemodi®ed so that the output is a probability of thegiven class, and hence depending on the thresholdor some other decision making parameter value,one can get a family of models from one, for ex-ample a tree or a rule set. The ROC curve origi-nated from signal detection theory. It plots thetrue positive rate (y-axis) against the false positiverate (x-axis). If there are two models in this spaceone can obtain any performance on the connect-

ing line of the two just by randomly using themodels with some probability in proportion to thedesired position on the line. This curve allows oneto select the optimal model depending on the as-sumed class distribution at the time of prediction.Any model that falls below the convex hull ofother models can be ignored (Provost et al., 1998).Fig. 1 shows an example of a ROC curve and a liftcurve.

For many applications, the aim of prediction isto identify some desired class members (e.g., cus-tomers) to whom some action (e.g., mailing ad-vertisement circulars) is to be performed. Ratherthan classi®cation, it is more ¯exible if the pre-diction is in the form of ranking based on thepredicted class probability. The Lift curve thenplots cumulative true positive coverage (y-axis)against the rank-ordered examples (x-axis). Arandom ranking will result in a straight diagonalline on this plot. A lift curve of a model is usuallyabove this line, the higher the better for any givencoverage of the examples in the preferential order.

6. Conclusion

We have presented an overview of some notableadvances in predictive modeling. Clearly, this is amajor area of interest to many research commu-nities. Readers may think of other advances, notcited here, that are better-suited to other types of

Fig. 1. Lift and ROC curves.

60 S.J. Hong, S.M. Weiss / Pattern Recognition Letters 22 (2001) 55±61

pattern recognition. Our emphasis has been ontechniques for data mining of data warehouses,massive collections of data, such as the historicalrecord of sales transactions. These data may re-quire ``clean up'' but the need for pre-processing isfar less than many other pattern recognition taskslike image processing. It is no accident that gainsin predictive performance have arrived in parallelwith major enhancements of computing, storage,and networking capabilities. These rapid develop-ments in computing and networking, along with e-commerce, can only increase interest in the theoryand application of predictive modeling.

References

Apte, C., Grossman, E., Pednault, E., Rosen, B., Tipu, F.,

White, B., 1999. Probabilistic estimation based data mining

for discovering insurance risks. IEEE Intelligent Syst. 14 (6),

49±58.

Breiman, L., 1996. Bagging predictors. Machine Learning 24,

123±140.

Cristianini, N., Shawe-Taylor, J., 2000. An Introduction to

Support Vector Machines. Cambridge University Press,

Cambridge.

Freund, Y., Schapire, R., 1996. Experiments with a new

boosting algorithm. In: Proc. of the International Machine

Learning Conference. Morgan Kaufmann, Los Altos, CA,

pp. 148±156.

Ho, J., 1998. The random subspace method for constructing

decision forests. IEEE Trans. Pattern Anal. Machine Intell.

20 (8), 832±842.

Hosking, J.R.M., Pednault, E.P.D., Sudan, M., 1997. A

statistical perspective on data mining. Future Generation

Computer Systems: Special issue on Data Mining 3 (2 & 3),

117±134.

Kearns, M.J., Vazirani, U.V., 1994. An Introduction to Com-

putational Learning Theory. MIT Press, Cambridge, MA.

Provost, F., Fawcett, T., Kohavi, R., 1998. The case against

accuracy estimation for comparing induction algorithms.

KDDM98.

Stolfo, S.J., Prodromidis, A., Tselepis, S., Lee, W., Fan, W.,

Chan, P., 1997. JAM: Java agents for meta-learning over

distributed databases. In: Proc. of KDDM97. pp. 74±81.

Vapnik, V.N., 1998. Statistical Learning Theory. Wiley, New

York.

Weiss, S., Indurkhya, N., 1998. Predictive Data Mining: A

Practical Guide. Morgan Kaufmann, Los Altos, CA.

S.J. Hong, S.M. Weiss / Pattern Recognition Letters 22 (2001) 55±61 61