comparative study of data mining methods in
DESCRIPTION
Mineria de datos metodosTRANSCRIPT
J Syst Sci Syst Eng(Dec 2006) 15(4): 419-435 ISSN: 1004-3756 (Paper) 1861-9576 (Online) DOI: 10.1007/s11518-006-5023-5 CN11-2983/N
© Systems Engineering Society of China & Springer-Verlag 2006
A COMPARATIVE STUDY OF DATA MINING METHODS IN CONSUMER LOANS CREDIT SCORING MANAGEMENT∗
Wenbing XIAO1 Qian ZHAO 2 Qi FEI 3 1 Institute of Systems Engineering, Huazhong University of Science & Technology, Wuhan 430074, China
[email protected] ( ) 2 School of Economics, Renmin University of China, Beijing 100872, China
[email protected] 3 Institute of Systems Engineering, Huazhong University of Science & Technology, Wuhan 430074, China
Abstract
Credit scoring has become a critical and challenging management science issue as the credit
industry has been facing stiffer competition in recent years. Many classification methods have been
suggested to tackle this problem in the literature. In this paper, we investigate the performance of
various credit scoring models and the corresponding credit risk cost for three real-life credit scoring
data sets. Besides the well-known classification algorithms (e.g. linear discriminant analysis, logistic
regression, neural networks and k-nearest neighbor), we also investigate the suitability and
performance of some recently proposed, advanced data mining techniques such as support vector
machines (SVMs), classification and regression tree (CART), and multivariate adaptive regression
splines (MARS). The performance is assessed by using the classification accuracy and cost of credit
scoring errors. The experiment results show that SVM, MARS, logistic regression and neural networks
yield a very good performance. However, CART and MARS’s explanatory capability outperforms the
other methods.
Keywords: Data mining, credit scoring, classification and regression tree, support vector machines,
multivariate adaptive regression splines, credit-risk evaluation
∗ This work was supported in part by National Science Foundation of China under Grant No. 70171015
1. Introduction Data mining (DM), sometimes referred to as
knowledge discovery in database (KDD), is a
systematic approach to find underlying patterns,
trends, and relationships buried in data. Data
mining has drawn much attention from both
researchers and practitioners due to its wide
applications in crucial business decisions.
Basically, the research on DM can be classified
into two categories: methodologies and
technologies. According to Curt (1995), the
technology part of DM consists of techniques
A Comparative Study of Data Mining Methods in Consumer Loans Credit Scoring Management
JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING 420
such as statistical methods, neural networks,
decision trees, genetic algorithms, and
non-parametric methods. Among the
above-mentioned applications, the classification
problems which observations can be assigned to
one of several disjoint groups have played
important roles in business decision making due
to their wide applications in decisions support,
financial forecasting, fraud detection, marketing
strategy, and other related fields (Chen et al.
1996, Lee and Chen 2005, Tam and Kiang
1992).
Credit risk evaluation decisions are crucial
for financial institutions due to the severe impact
of loan default. It is an even more important task
today as the credit industry has been
experiencing serious competition during the past
few years. Credit scoring has gained more and
more attention as the credit industry has realized
the benefits of improving cash flow, insuring
credit collections and reducing possible risks.
Hence, many different useful techniques, known
as the credit scoring models, have been
developed by banks and researchers in order to
solve the problems involved during the
evaluation process (Mester 1997). The objective
of credit scoring models is to assign credit
applicants to either a “good credit” group who
are likely to repay financial obligation, or a “bad
credit” group who are more likely to default on
the financial obligation. The applications of the
latter should be denied. Therefore, credit scoring
problems are basically in the scope of the more
generally and widely discussed classification
problems.
Usually, credit scoring is employed to rank
credit information based on the application form
details and other relevant information held by a
credit reference agency. As a result, accounts
with high probability of default can be
monitored and necessary actions can be taken in
order to prevent the account from entering
default. In response, the statistical methods,
non-parametric methods, and artificial
intelligence approaches have been proposed to
support the credit approval decision process
(Desai et al., 1996, West, 2000).
Generally, linear discriminant analysis and
logistic regression are the two most commonly
used data mining techniques to construct credit
scoring models. However, linear discriminant
analysis (LDA) has often been criticized because
of the categorical nature of the credit data and
the fact that the covariance matrices of the good
and bad credit classes are not likely to be equal.
In addition to the LDA approach, logistic
regression is an alternative to conduct credit
scoring. A number of logistic regression models
for credit scoring applications have been
reported in the literature (Henley 1995).
However, logistic regression is also being
criticized for some strong model assumptions,
such as variation homogeneity, which has
limited its application in handling credit scoring
problems. Recently, neural networks have
provided an alternative to LDA and logistic
regression, particularly in situations where the
dependent and independent variables exhibit
complex nonlinear relationships. Even though it
has been reported that neural networks have
better credit scoring capability than LDA and
logistic regression (Desai et al. 1996), neural
networks are also being criticized for their long
training process in designing the optimal
XIAO, ZHAO and FEI
JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING 421
networks’ topology, difficulty in identifying the
relative importance of potential input variables,
and certain interpretive difficulties which have
limited their applicability in handling credit
scoring problems. Hence, the issue of which
classification technique to be used for credit
scoring remains a very difficult and challenging
problem. In the paper, we will conduct a
benchmarking study of various classification
techniques on three real-life credit data sets.
Techniques that will be implemented are logistic
regression, linear discriminant analysis, SVMs,
neural networks, KNN, CART and MARS. All
techniques will be evaluated in terms of the
percentage of correctly classified observations
and misclassification cost.
This paper is organized as follows. We begin
with a short overview of the classification
techniques used in Section 2. Data sets and
experimental design are presented in Section 3.
Section 4 gives the empirical results and
discussion for three real credit scoring data set,
including classification performance, the costs
of credit scoring errors and explanatory ability
of credit scoring models. Section 5 addresses the
conclusion and discusses possible future
research areas.
2. Literature Review
2.1 Linear Discriminant Analysis and Logistic Regression Models
Linear discriminant analysis involves the
linear combination of the two (or more)
independent variables that differentiate best
between the priori defined groups. This is
achieved by the statistical decision rule of
maximizing the between-group variance relative
to the within-group variance; this relationship is
expressed as the ration of between-group to
within-group variance. The linear combinations
for a discriminant analysis are derived from an
equation that takes the from (1)
1 1 2 2 n nZ w x w x w x= + + +L (1)
where Z is the discriminant score,
( 1,2, , )iw i n= L are the discriminant weights,
and ( 1,2, , )ix i n= L are independent variables
(Altman 1968, Jo, Han and Lee 1997).
Logistic regression (Logistic) analysis has
also been used to investigate the relationship
between binary or ordinal response probability
and explanatory variables. The method fits linear
logistic regression model for binary or ordinal
response data by the method of maximum
likelihood. The advantage of this method is that
it does not assume multivariate normality and
equal covariance matrices as LDA does. The
logistic regression approach to classification
(Logistic) tries to estimate the probability
( 1 | )P y x= as follows:
0 1 1( )
1 1( 1 | )
1 e 1 e n nz w w x w xP y x − − + += = =
+ + L (2)
Whereby nx∈ℜ is n-dimensional input
vector, iw is the parameter vector and the scalar
0w is the intercept. The parameters
0w and iw are then typically estimated using the
maximum likelihood procedure (Hosener 2000,
Thomas 2000).
2.2 Support Vector Machines Models A simple description of the SVM algorithm
is provided as follows. Given a training set
1{ , }Ni i iD x y == with input vectors (1) ( )( , )n T n
i i ix x x R= ∈L and target labels
A Comparative Study of Data Mining Methods in Consumer Loans Credit Scoring Management
JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING 422
{ 1, 1}iy ∈ − + , the support vector machine (SVM)
classifier, according to Vapnik's original
formation, satisfies the following condition:
( ) 1, 1
( ) 1, 1
Ti i
Ti i
w x b if y
w x b if y
φ
φ
⎧ + ≥ + = +⎪⎨
+ ≤ − = −⎪⎩ (3)
which is equivalent to
[ ( ) ] 1, 1, ,Ti iy w x b i Nφ + ≥ = L (4)
where w represents the weight vector and b the
bias. Nonlinear function ( ) : knnR Rφ ⋅ → maps
input or measurement space to a high-
dimensional, and possibly infinite-dimensional,
feature space. Equation (4) then comes down the
construct of two parallel bounding hyperplanes
at opposite sides of a separating
hyperplane ( ) 0Tw x bφ + = in the feature space
with the margin width between both hyperplanes
equal to 22 /( || || )w . In primal weight space, the
classifier then takes the decision function form
(5)
sgn( ( ) )Tw x bφ + (5)
Most of classification problems are, however,
linearly non-separable. Therefore, it is general to
find the weight vector using slack variable iξ
to permit misclassification. One defines the
primal optimization problem as
, , 1
1Min
2
NT
iw b i
w w Cξ
ξ=
+ ∑ (6)
Subject to
( ( ) 1 , 1, ,
0, 1, ,
Ti i i
i
y w x b i N
i N
φ ξξ⎧ + ≥ − =⎪⎨
≥ =⎪⎩
L
L (7)
Where iξ 's are slack variables needed to allow
misclassifications in the set of inequalities, and
C +∈ℜ is a tuning hyperparameter, weighting
the importance of classification errors visa the
margin width. The solution of the primal
problem is obtained after constructing the
Lagrangian. From the conditions of optimality,
one obtains a quadratic programming (QP)
problem with Lagrange multipliers iα 's. A
multiple iα exists for each training data instance.
Data instances corresponding to non-zero iα 's
are called support vectors.
On the other hand, the above primal problem
can be converted into the following dual
problem with objective function (8) and
constraints (9). Since the decision variables are
support vector of Lagrange multipliers, it is
easier to interpret the results of this dual
problem than those of the primal one.
1Max
2T TQ e
αα α α− (8)
Subject to
0 , 1, ,
0
i
T
C i N
y
α
α
≤ ≤ =⎧⎪⎨
=⎪⎩
L (9)
In the dual problem above,
(1,1, ,1)T Ne R= ∈L , Q is a N N× positive
semi-definite matrix, ( , )ij i j i jQ y y K x x= and
( , ) ( ) ( )Ti j i jK x x x xφ φ≡ is the kernel. Here,
training vectors ix 's are mapped into a higher
(maybe infinite) dimensional space by
function φ . As is typical for SVMs, we never
calculate w or ( )xφ . This is made possible due
to Mercer's condition, which relates mapping
function ( )xφ to kernel function ( , )K ⋅ ⋅ as
follows.
( , ) ( ) ( )Ti j i jK x x x xφ φ= (10)
For kernel function ( , )K ⋅ ⋅ , one typically has
several design choices such as the linear kernel
of ( , ) Ti j i jK x x x x= , the polynomial kernel of
degree d of ( , ) ( )T di j i jK x x x x rγ= + , 0γ > ,
XIAO, ZHAO and FEI
JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING 423
the radial basis function (RBF) kernel of 2( , ) exp{ || || }i j i jK x x x xγ= − − , 0γ > , and the
sigmoid kernel of ( , ) tanh{ }Ti j i jK x x x x rγ= + ,
where ,d r N∈ and Rγ +∈ are constants. Then
one constructs the final SVM classifier as
sgn ( , )N
i i ii
y K x x bα⎛ ⎞
+⎜ ⎟⎝ ⎠∑ (11)
The details of the optimization are discussed in
(Vapnik 1999, Gunn 1998, Cristianini 2000).
2.3 Neural Networks Models (BPN, RBF and FAR) A neural networks model involves
constructing computers with architectures and
processing capabilities that mimic certain
processing capabilities of the human brain. A
neural network model is composed of neurons,
the processing elements. These elements are
inspired by biological nervous systems. Each of
the neurons receives inputs, and delivers a single
output. Thus, a neural network model is a
collection of neurons that are grouped in layers
such as the input layer, the hidden layer, and the
output layer. Several hidden layers can be placed
between the input and the output layers. We will
discuss the BPN in more detail because it is the
most popular NN for classification.
A simple back-propagation network (BPN)
model consists of three layers: the input layer,
the hidden layer, and the output layer. The
input-layer processes the input variables, and
provides the processed values to the hidden layer.
The hidden layer further processes the
intermediate values, and transmits the processed
values to the output layer. The output layer
corresponds to the output variables of the
back-propagation neural network model. A
three-layer backpropagation neural networks
(BPN) is shown in Figure 1. For the details of
the neural networks, readers are referred to Refs
(West 2000, Bihop 1995).
Input layer Hidden layer Output layer
Figure 1 A three-layer back-propagation neural networks
A Comparative Study of Data Mining Methods in Consumer Loans Credit Scoring Management
JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING 424
Radial Basis Function (RBF) networks
(Moody and Darken 1989) have a static
Gaussian function as the non-linearity for the
hidden layer processing elements. The Gaussian
function responds only to a small region of the
input space where the Gaussian is centered. The
key to a successful implementation of these
networks is to find suitable centers for the
Gaussian functions. This can be done with
supervised learning, but an unsupervised
approach usually produces better results. The
advantage of the radial basis function networks
is that it finds the input to output map using
local approximators. Usually the supervised
segment is simply a linear combination of the
approximators. Since linear combiners have few
weights, these networks train extremely fast and
require fewer training samples.
The Fuzzy art (FAR) (West 2000) network is
a dynamic network where incorporates
computations from fuzzy set theory into the
adaptive resonance theory (ART). The typical
FAR network consists of two totally
interconnected layers of neurons, identified as
the complement layer and the category layer, in
addition to the input and output layers. When an
input vector is applied to the network, it creates
a short-term activation of the neurons in the
complement layer. This activity is transmitted
through the weight vector to neurons in the
category layer. Each neuron in the category layer
then calculates the inner product of the
respective weights and input values. These
calculated values are then resonated back to the
complement layer.
2.4 Multivariate Adaptive Regression Splines MARS is first proposed by Firedman (1991,
1995) as a flexible procedure which models
relationships that are nearly additive or involve
interactions with fewer variables. The modeling
procedure is inspired by the recursive
partitioning technique governing classification
and regression tree (CART) (Breiman et al. 1984)
and generalized additive modeling, resulting in a
model that is continuous with continuous
derivatives. It excels at finding optimal variable
transformations and interactions, the complex
data structure that often hides in
high-dimensional data. And hence can
effectively uncover important data patterns and
relationships that are difficult, if not impossible,
for other methods to reveal.
MARS essentially builds flexible models by
fitting piecewise linear regressions; that is, the
nonlinearity of a model is approximated through
the use of separate regression slopes in distinct
intervals of the predictor variable space.
Therefore the slope of the regression line is
allowed to change from one interval to the other
as the two ‘knot’ point are crossed. The variable
to use and the end points of the intervals for
each variable are found via a fast but intensive
search procedure. In addition to searching
variables one by one, MARS also searches for
interactions between variables, allowing any
degree of interaction to be considered.
The general MARS function can be
represented using the following equation:
0 ( , )1 1
( ) [ ( )]mKM
m km v k m kmm k
f x a a s x t∧
+= =
= + −∑ ∏ (12)
where 0a and ma are parameters, M is the
number of basis functions, mK is the number of
knots, kms takes on value of either 1 or −1 and
indicates the right/left sense of the associated
step function, ( , )v k m is the label of the
XIAO, ZHAO and FEI
JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING 425
independent variable, and kmt indicates the knot
location.
The optimal MARS model is selected in a
two-stage process. Firstly, MARS constructs a
very large number of basis functions to overfit
the data initially, where variables are allowed to
enter as continuous, categorical, or ordinal—the
formal mechanism by which variable intervals
are defined, and they can interact with each
other or be restricted to enter in only as additive
components. In the second stage, basis functions
are deleted in order of least contribution using
the generalized cross-validation (GCV) criterion.
A measure of variable importance can be
assessed by observing the decrease in the
calculated GCV values when a variable is
removed from the model. The GCV can be
expressed as follows: ^
^2 2
1
( ) ( )
1 ( )[ ( )] /[1 ]
M
N
i M ii
LOF f GCV M
C My f x
N N=
=
= − −∑
(13)
where there are N observations, and ( )C M is
the cost-penalty measures of a model containing
M basis function (therefore the numerator
measures the lack of fit on the M basis function
model ( )M if x and the denominator denotes the
penalty for model complexity ( )C M ). Missing
values can also be handled in MARS by using
dummy variables indicating the presence of the
missing values. By allowing for any arbitrary
shape for the function and interactions, and by
using the above-mentioned two-stage model
building procedure, MARS is capable of reliably
tracking the very complex data structures that
often hide in high-dimensional data. Please refer
to Firedman (1991, 1995) for more details
regarding the model building process.
2.5 k-Nearest-Neighbor-Classifiers and CART Model
k-Nearest-neighbor classifiers (KNN) (Henley
and Hand 1996) classify a data instance by
considering only the k-most similar data
instances in the training set. The class label is
then assigned according to the class of the
majority of the k nearest neighbors. Ties can be
avoided by choosing k odd. One commonly
opts for the Euclidean distance as the similarity
measure: 1/ 2( , ) || || [( ) ( )]T
i j i j i j i jd x x x x x x x x= − = − −
(14)
where , ni jx x ∈ℜ are the input vectors of data
instance i and j , respectively. Note that also
more advanced distance measures have been
proposed in the literature.
Classification and regression tree (CART), a
statistical procedure introduced by Breiman et al.
(1984), is primarily used as a classification tool,
where the objective is to classify an object into
two or more populations. As the name suggests,
CART is a single procedure that can be used to
analyze either categorical or continuous data
using the same technology. The methodology
outlined in Breiman et al. can be summarized
into three stages. The first stage involves
growing the tree using a recursive partitioning
technique to select variables and split points
using a splitting criterion. Several criteria are
available for determining the splits, including
gini, towing and ordered towing. Detailed
description of the mentioned criteria one can
refer to Breiman et al. In addition to selecting
the primary variables, surrogate variables, which
are closely related to the original splits and may
A Comparative Study of Data Mining Methods in Consumer Loans Credit Scoring Management
JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING 426
be used in classifying observations having
missing values for the primary variables, can be
identified and selected.
After a large tree is identified, the second
stage of the CART methodology uses a pruning
procedure that incorporates a minimal cost
complexity measure. The result of the pruning
procedure is a nested subset of trees starting
from the largest tree grown and continuing the
process until only one node of the tree remains.
Cross-validation or a testing sample will be used
to provide estimates of future classification
errors for each subtree. The last stage of the
methodology is to select optimal tree, which
corresponds to a tree yielding the lowest error
rate of cross-validated or testing set. Please refer
to Breiman et al. (1984) and Steinburg and Colla
(1997) for more details regarding the model
building process of CART.
3. Data Sets and Experimental Design The German and Australian credit data sets
are publicly available at the UCI repository
(http://kdd.ics.uci.edu). Dr. Hans Hofmann of
the University of Hamburg contributed the
German credit scoring data. It consists of 700
examples of creditworthy applicants and 300
examples where credit should not be extended.
For each applicant, 24 variables described credit
history, account balances, loan purpose, loan
amount, employment status, personal
information, age, housing, and job. The
Australian credit scoring data is a similar but
more balanced with 307 and 383 examples of
each outcome. The data set contains a mixture of
six continuous and eight categorical variables.
The third credit data is from major financial
institutions in US, where there are 1225
applications, including 902 examples of
creditworthy applications and 323 examples of
no-creditworthy applicants. The data sets also
include 14 attributes. To protect the
confidentiality of these data, attribute names and
values of data sets have been changed to
symbolic data.
To minimize the impact of data dependency
and improve the reliability of the resultant
estimates, 10-fold cross validation is used to
create random partitions of the raw data sets.
Each of the 10 random partitions serves as an
independent holdout test set for the credit
scoring model trained with the remaining nine
partitions. The training set is used to establish
the credit scoring model’s parameter, while the
independent test sample is used to test the
generalization capability of the model. The
overall scoring accuracy reported is an average
across all ten test set partitions.
The topic of choosing the appropriate class
distribution for classifier learning has received
much attention in the literature. In this study, we
dealt with this problem by using a variety of
class distribution ranging from 55.5/44.5 for the
Australian credit data set to 73.6/26.4 for
America credit data set. The LDA, Logistic,
CART, KNN and MARS classifiers require no
parameter tuning. For the SVM classifiers, we
used the LIBSVM toolbox 2.8 and adopt a grid
search mechanism to tune the parameters. For
BPN classifiers, we adopted the standard
three-layer. The number of input and output
nodes was the number of input and output
variables, respectively. In the hidden layer and
output layer nodes use the sigmoid transfer
function. Since the optimum networks for the
data in the test set is still difficult to guarantee
XIAO, ZHAO and FEI
JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING 427
generalization performance, the number of
hidden nodes of three data sets was varied
between 8 and 30 and the network with the best
training set performance was selected for test set
evaluation. The NN analyses were conducted
using Neural Networks toolbox 4.0.
(http://www.mathworks.com). CART 4.0 and
MARS 2.0 evaluation (http://www.salford
-systems.com) are provided by Salford Systems,
in building the CART and MARS credit scoring
models. The SVM analyses were conducted
using the LIBSVM toolbox 2.8 (Chung and Lin
2001).
4. Results and Discussion The results for each credit-scoring model are
reported in Table 1 for both the German,
Australian and American credit data. These
results are averages of accuracy determined for
each of the 10 independent test data set
partitions used in the cross validation
methodology. Since the training of any neural
networks model is a stochastic process, the
network accuracy determined for each data set
partition is itself an average of 10 repetitions.
Table 1 10-fold cross validation test set classification accuracy on credit scoring data sets
German credit data (%) Australian credit data (%) American credit data (%)
Goods Bads Overall Goods Bads Overall Goods Bads Overall RBF 86.5 48.0 74.6 86.8 87.2 87.1 88.5 24.2 71.3 BPN 86.4 42.5 73.3 84.6 86.7 85.8 88.1 22.9 70.9 FAR 60.0 51.2 57.3 74.4 76.2 75.4 N/A N/A N/A LDA 72.3 73.3 72.6 81.0 92.2 85.9 65.4 56.0 62.9 LOGIT 88.1 48.7 76.3 85.9 89.0 87.2 95.9 11.2 73.5 KNN 77.5 44.7 67.6 84.7 86.7 85.8 78.4 30.1 66.1 Kernel 84.5 37.0 70.2 81.4 84.8 84.4 N/A N/A N/A CART 71.2 69.4 70.5 79.9 92.5 85.5 59.3 59.4 59.3 Mars 89.0 66.0 74.9 86.3 88.3 87.4 89.7 20.2 71.4 Lin-svm 88.9 49.1 77.0 79.9 92.5 85.5 88.9 22.0 71.3 Pol-svm 88.5 48.6 76.5 83.8 88.6 85.5 89.9 18.3 71.0 Rbf-svm 88.7 49.7 77.1 80.5 93.0 85.8 89.4 22.6 71.8 Sig-svm 89.0 50.0 77.2 80.5 92.0 85.6 89.6 21.1 71.5 Neural networks results are averages of 10 repetitions. N/A: not test.
It is evident from Table 1 that Sig-SVM has
the highest overall credit scoring accuracy of
77.2% for German credit data, while the
Lin-SVM, Pol-SVM and Rbf-SVM have credit
scoring accuracy of 76.5 % to 77.1%. Closely
following SVM is Logistic regression with an
overall accuracy of 76.3%, and MARS with
74.9%. Linear discriminant analysis has
accuracy of 72.6%, which is 3.7% less accurate
than logistic regression. Strength of the linear
discriminant model for this data, however, is a
significantly higher accurate than any other
model identifying bad credit risks. This is likely
due to the assumption of equal prior
probabilities used to develop the linear
discriminant model. It is also interesting to note
that the most commonly used neural network
architecture, BPN with accuracy 73.3%, is
comparable to linear discriminant analysis with
an accuracy of 72.6%. The K-NN, kernel density
and CART at overall accuracy levels are 67.6%,
70.2% and 70.5%, respectively. The least
A Comparative Study of Data Mining Methods in Consumer Loans Credit Scoring Management
JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING 428
accurate method for the German credit scoring
data is the FAR neural networks model at
57.3%.
For the Australian credit data, MARS has the
top overall credit scoring accuracy of 87.4%,
followed closely by the Logistic regression
(87.2%) and NN (RBF) (87.1%). The BPN (85.8
%) and LDA (85.9 %) are again comparable
from an overall accuracy consideration. The
KNN,CART,BPN, LDA, SVM model have
overall credit scoring errors that are more than
0.01 greater than MARS, logistic regression, and
RBF neural models. The FAR neural networks
and kernel density model overall accuracy are
75.4% and 84.4%, respectively.
For the American credit data, Logistic
regression has the top overall credit scoring
accuracy of 73.5%, followed closely by the
Rbf-SVM (71.8%) and Sig-SVM (71.5%). The
linear-SVM and Poly-SVM are all grouped at
accuracy levels from 71.0% to 71.3%. The BPN,
K-NN and Mars at overall accuracy levels are
70.9%, 66.1% and 71.4%, respectively. The least
accurate method for the America credit scoring
data is the CART model (Kernel density and
FAR aren’t tested for American credit data).
However, we note that CART has the lowest
error rate (40.6%) in all models identifying bad
credit risks, followed closely by the LDA
(44.0%).
To further enhance the conclusion, we test
for statistically significant differences between
credit scoring models. We have used a special
notational convention whereby the best three of
the overall accuracy is underlined and denoted
in bold face for each data. For cross validation
studies of supervised learning algorithms,
Dietterich (1998) recommends McNermar’s test,
which is used in this paper to establish
statistically significant differences between
credit scoring models. McNemar’s test is
chi-square statistic calculated from a 2 2×
contingency table. The diagonal elements of the
contingency table are counts of the number of
credit applications misclassified by both
models, 00n , and the number correctly classified
by both models, 11n . The off diagonal elements
are counts of numbers classified incorrectly by
Model A and correctly by Model B, 01n , and
conversely the numbers classified incorrectly by
Model B and correctly by Model A, 10n .
Results of McNemar’s test with 0.05p = are
given in Table 2. All credit scoring models are
tested for significant differences with the most
accurate model in the data set. A model whose
overall credit scoring is not significantly
different from the most accurate are labeled as a
superior model; those that are significantly less
accurate are labeled as inferior models. It is
evident from Table 2 that the SVM, Logistic
regression, NN (RBF) and MARS models are
superior ones for three credit scoring data sets
and the LDA, KNN and CART models are
superior for only the Australian credit data.
4.1 Cost of Credit Scoring Errors This subsection considers the costs of credit
scoring errors and their impact on model
selection. It is evident that the individual group
(bad or good) accuracy of the credit scoring
model can vary widely. For the German credit
data, all models except LDA are much less
accurate at classifying bad credit risks than good
credit risks. Most pronounced is the accuracy of
logistic regression with an error of 0.1186 for
good credit and 0.5113 for bad credit. In credit
XIAO, ZHAO and FEI
JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING 429
Table 2 Statistically significant differences, credit scoring models
German credit data Australian credit data American credit data Superior models RBF
Mars, Logistic regression SVM
RBF, SVM Logistic regression MARS, LDA KNN, CART
RBF BPN Logistic regression SVM, MARS
Inferior Models FAR, VPN
LDA, KNN Kernel density CART
FAR Kernel density
LDA CART KNN
Statistical significance established with McNemars’ test, p=0.05; kernel density and FAR aren’t tested for American credit data.
scoring applications, it is generally believed that
the costs of granting credit to a bad risk
candidate, denoted by 12C is significantly
greater than the cost of denying credit to a good
risk candidate, denoted by 21C . In this situation
it is important to rate the credit scoring models
with the cost function defined in Equation (15)
rather than relying on the overall classification
accuracy. To illustrate the cost function, relative
costs of misclassification suggested by Dr.
Hofmann when he compiled the German credit
data are used; 12C is 5 and 21C is 1. Evaluation
of the cost function also requires estimates of the
prior probabilities of good credit 1π and bad
2π in the application pool of the credit scoring
model. These prior probabilities are estimated
from reported default rates. For the year 1997,
6.48% of a total credit debt of $ 560 billion was
charged off (West 2000), while Jensen reports a
charge off rate of 11.2% fro credit applications
he investigated (Frydman et al. 1985). The error
rate for the bad credit group of the German
credit data (which averages about 0.45) is used
to establish a low value for 2π of 0.144
(0.0648/0.45) and a high value of 0.249
(0.112/0.45). The ratio 2 2/n N , in Equation (15)
measures the false positive rate, the proportion
of bad credit risks that are granted credit, while
the ration 1 1/n N measures the false negative
rate, or good credit risks denied credit by the
model.
2 112 2 21 1
2 1
n nCost C C
N Nπ π= + (15)
Under these assumptions, the credit scoring
cost is reported for each model in Table 3. For
the German credit data, the MARS (0.413)
model is now slightly better than the LDA
(0.429) at the prior probability level of 14.4%
bad credit. At the higher level of 24.9% bad
credit, the LDA is clearly the best model from an
overall cost perspective with a score of 0.540.
Closely following LDA is MARS with an
overall accuracy of 0.571, and CART with 0.597.
For the Australian credit, the costs of all models
are nearly identical at both levels of 2π . The
Logistic (0.200) model is now slightly better
than the MARS (0.202) at the prior probability
level of 14.4% bad credit. At the higher level of
24.9% bad credit, the Rbf-SVM is clearly the
best model from an overall cost perspective with
a score of 0.234. Closely following Rbf-SVM is
LDA and Logistic with overall cost of 0.239 and
0.244, respectively. For the American credit, the
LDA (0.613) model is now slightly better than
A Comparative Study of Data Mining Methods in Consumer Loans Credit Scoring Management
JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING 430
Table 3 Credit scoring models misclassification cost
German credit data Australian credit data American credit data
2π =0.144 2π =0.249 2π =0.144 2π =0.249 2π =0.144 2π =0.249
BPN 0.530 0.818 0.228 0.281 0.657 1.049
RBF 0.497 0.761 0.205 0.258 0.644 1.030
FAR 0.694 0.908 0.391 0.490 N/A N/A
LDA 0.429 0.540 0.219 0.239 0.613 0.808
Logist 0.471 0.728 0.200 0.243 0.673 1.140
KNN 0.592 0.858 0.227 0.281 0.688 1.033
Kernel 0.587 0.901 0.268 0.329 N/A N/A
CART 0.467 0.597 0.226 0.244 0.641 0.811
Lin-SVM 0.462 0.717 0.226 0.244 0.657 1.055
Pol-SVM 0.469 0.726 0.221 0.264 0.675 1.093
Rbf-SVM 0.459 0.711 0.217 0.234 0.648 1.043
Sig-SVM 0.454 0.705 0.225 0.246 0.657 1.060
MARS 0.413 0.571 0.202 0.249 0.663 1.071
N/A: not test
Table 4 5-fold cross validation test set classification accuracy on parities credit scoring data sets in new strategy
German credit data (%) Australian credit data (%) American credit data (%)
Goods Bads Overall Goods Bads Overall Goods Bads Overall RBF 67.2 73.7 70.4 85.7 89.3 87.5 65.2 57.3 61.3 BPN 67.0 70.3 68.7 85.2 87.6 86.4 64.7 55.7 60.2 LDA 69.0 73.0 71.0 80.3 92.5 86.4 59.5 55.5 57.5 LOGIT 74.3 74.0 74.2 84.0 92.3 88.2 64.3 62.7 63.5 CART 68.0 69.7 68.8 80.7 93.3 87.0 66.7 54.7 61.3 Mars 66.0 79.0 72.5 84.0 91.0 87.5 66.3 50.7 58.5 Rbf-SVM 69.1 73.5 71.3 81.0 93.3 87.2 63.3 59.0 61.2
Neural networks results are averages of 10 repetitions.
the CART(0.641) at the prior probability level of
14.4% bad credit, followed NN(RBF) with a
score of 0.644. At the higher level of 24.9% bad
credit, the LDA is clearly the best model from an
overall cost perspective with a score of 0.808.
Closely following LDA is CART with an overall
accuracy of 0.8111, and RBFNN with 1.030.
From the Table 2, the relative group
classification accuracy of the neural networks
model, SVM, logistic regression and MARS are
influenced by the design of the training
no-balance data. To improve their accuracy with
bad credit risks, a new strategy is tested for the
above models training sets. The strategy is to
form new data sets from a balanced group of
300 good credit examples and 300 bad credit
XIAO, ZHAO and FEI
JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING 431
examples for difference models. Each of these
models is tested with the 5-fold cross-validation.
The new strategy accuracy results are
summarized in Table 4. The new strategy yields
the greatest improvement in the error for bad
credit identification with a reduction
approximate 20% for German credit data and
30% for American credit data. The overall error
rate for tested with a new strategy increases 5%
and 10% for the two data sets, respectively.
However, the overall error rate for tested with a
new strategy decreases 0.5% to 2% for
Australian credit data.
4.2 A Comparative of Explanatory Ability of Credit Scoring Models This subsection considers explanatory ability
of the credit scoring models. A good explanatory
ability of any credit scoring models for credit
scoring applications is very important in
explaining the rationale for the decision to deny
credit. Neural networks and SVM models both
cannot explain how and why they identified a
potential “bad” loan application. LDA and
Logistic regression models are better than SVM
and neural networks. KNN and Kernel density
are inferior models regarding the explanatory
ability. CART and MARS have better
explanatory ability. The more detailed analysis
of the three (Neural, CART and MARS)
explanatory ability for German credit data are as
follows.
4.2.1 Explanatory Ability of Neural Networks
Model for German Data Credit Scoring
A key deficiency of any neural networks
model for credit scoring applications is the
difficulty in explaining the rationale for the
decision to deny credit. Neural networks are
usually thought of as black-box technology
devoid of any logic or rule-based explanations
for the output mapping. This is a particularly
sensitive issue in light of recent federal
legislation regarding discrimination in lending
practices. To address this problem, West (2000)
developed explanatory ability insights for the
neural network trained on the German credit
data. It is accomplished by clamping 23 of the
24 input values, varying the remaining input by
± 5%, and measuring the magnitude of the
impact on the two output neurons. The clamping
process is repeated until all network inputs have
been varied. A weight can now be determined
for each input that estimates its relative power in
determining the resultant credit decision. Please
refer to West (2000) for more details and results
regarding the model building process.
4.2.2 Explanatory Ability of CART Model for
German Data Credit Scoring
Figure 2 depicts the obtained CART tree of
the testing sample with the popular 1-SE rule in
the tree pruning procedure. It is observed from
Figure 2 that A1, A3, A5, A2 play important
roles in the rule induction (Ai indicates the ith
attribute name for i=1,…,n and it has likely
meaning when appearing latter). It can also be
observed from Figure 2 that if an observed
who’s A1 is between 1.5 and 2.5 and A2 22.5≥
and A5>3.5, it falls into terminal node 11 whose
classified class is class 1 (good customer). The
built rules and terminal nodes from the built tree,
unlike other classification techniques, are very
easy to interpret and hence marketing
professionals can use the built rules in designing
proper managerial decisions. Furthermore, we
A Comparative Study of Data Mining Methods in Consumer Loans Credit Scoring Management
JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING 432
conclude by saying that CART is executive and
powerful management tools which allow us to
build advanced and user-friendly decision-
support systems for credit scoring management.
4.2.3 Explanatory Ability of MARS Model for
Credit Scoring
In order to demonstrate the explanatory
ability of MARS scoring models, the German
data will be used as an illustrative example. The
obtained basis functions and variable selection
results of the illustrative example are
summarized in Table 5. It is observed that
A1,A2,A3,,A4,A5,A8,A9,A15,A16,A17,A20 do
play important roles in deciding the MARS
Node 1
Class=1
A1<=2.500
N=1000
Node 2
Class=1
A2<=22.500
N=543
Terminal
Node 12
Class=1
N=457
Terminal
Node 1
Class=1
N=28
Node 3
Class=2
A3<=1.500
N=306
Node 4
Class=2
A2<=11.5
N=278
Terminal
Node 2
Class=1
N=80
Node 5
Class=2
A4<=13.5
N=278Node 6
Class=2
A18<=11.5
N=72
Terminal
Node 4
Class=2
N=60
Terminal
Node 3
Class=1
N=12
Node 8
Class=1
A5<=1.5
N=114
Node 7
Class=1
A10<=50.5
N=126
Node 9
Class=2
A1<=1.5
N=77Terminal
Node 5
Class=2
N=48
Terminal
Node 6
Class=1
N=29
Terminal
Node 7
Class=1
N=37
Terminal
Node 8
Class=2
N=12
Node 10
Class=2
A5<=3.500
N=237
Terminal
Node 9
Class=2
N=196
Node 11
Class=1
A1<=1.500
N=41
Terminal
Node 10
Class=2
N=17
Terminal
Node 11
Class=1
N=24
Figure 2 The tree of CART credit scoring mode
XIAO, ZHAO and FEI
JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING 433
Table 5 Variable selection results and basis functions of MARS credit scoring model
Variable name Relative importance (%) Equation name Equation A1 100.00 BF1 max (0, A1 − 1.000) A2 57.31 BF2 max (0, A2 − 4.000) A3 51.82 BF3 max (0, A3 − .180272E-06) A5 40.44 BF4 max (0, A5 − 1.000) A4 36.67 BF5 max (0, A4 − 36.000) A16 32.33 BF6 max (0, 36.000 − A4) A9 30.43 BF7 max (0, A16 + .180632E-07) A17 28.86 BF8 max (0, A15 − 1.000) A15 27.52 BF9 max (0, A20 + .182414E-07) A20 27.34 BF10 max (0, A17 − .376854E−08) A6 29.1 BF12 max (0, 4.000 − A6) A8 16.95 BF13 max (0, A9 − 1.000) BF14 max (0, A8 − 2.000) BF15 max (0, 2.000 − A8) MARS prediction function: Y = 1.358 − 0.096 * BF1 + 0.007 * BF2 − 0.058 * BF3 − 0.032 * BF4+ 0.002 * BF5 + 0.005 * BF6 + 0.098 * BF7 − 0.192 * BF8+ 0.094 * BF9 − 0.129 * BF10 + 0.040 * BF12+ 0.040 * BF13 − 0.026 * BF14 − 0.095 * BF15;
In the MARS credit scoring model, Y=0(1) is defined to be a good (bad) credit customer.
credit scoring models. Besides, according to the
obtained basis functions and the MARS
prediction function, it can be observed that the
high value of A2, A9, A16, and A20 tends to
become a bad credit customer while the high
value of A1, A3, A5, A15, and A17 likely to be a
good credit customer. The above conclusions
from the basis functions and MARS prediction
function have important managerial implications
since it can help managers/professionals design
appropriate loan policies in acquiring the good
credit customer.
5. Conclusions and Areas of Future Research Credit scoring has become more and more
important as the competition between financial
institutions has come to a totally conflicting
stage. More and more companies are seeking
better strategies through the help of credit
scoring models. And hence various modeling
techniques have been developed in different
credit evaluation processes for better credit
approval schemes. Therefore, many modeling
alternatives, like traditional statistical methods,
non-parametric methods and artificial
intelligence techniques, have been developed in
order to handle the credit scoring tasks
successfully. In this paper, we have studied the
performance of various classification techniques
for credit scoring. The experiments were
conducted on 3 real-life credit scoring data sets.
The classification performance was assessed by
the percentage of correctly classified and
misclassified cost.
It is found that each technique has showed
some characteristics which may be interesting in
the context of different data set. Firstly, Logistic,
MARS, SVM and ANN (BPN and RBF)
classifiers yield very good performances in
terms of the classification ratio. However, it has
to be noted that LDA and CART were
A Comparative Study of Data Mining Methods in Consumer Loans Credit Scoring Management
JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING 434
significantly more accurate than any other model
in identifying bad credit risks for German and
American credit scoring data sets. Secondly, the
experiments clearly indicated that many
classification techniques yield performances
which are quite competitive with each other.
Only a few classification techniques (e.g. FAR
and kernel density) were clearly inferior to the
others. Besides, CART and MARS not only have
lower Type II errors associated with high
misclassification costs, but also have better
evaluation reasoning and can help to structure
the understanding of prediction.
Starting from the findings of this study,
several interesting topics for future research can
be identified. One interesting topic may aim at
collecting more important variables in
improving the credit scoring accuracy. Another
promising avenue for future research is to
investigate the power of classifier ensembles
where multiple classification algorithms are
combined.
References [1] Altman, E.I. (1968). Financial ratios,
discriminant analysis and prediction of
corporate bankruptcy. Finance, 23: 589-609
[2] Bishop, C.M. (1995). Neural Networks for
Pattern Recognition. New York: Oxford
University, Press
[3] Breiman, L., Friedman, J.H., Olshen, R.A.
& Stone, C.J. (1984). Classification and
Regression Trees, Pacific Grove, CA:
Wadsworth
[4] Chen, M.S., Han, J. & Yu, P.S. (1996). Data
mining: an overview from a database
perspective. IEEE Transactions on
Knowledge and Data Engineering, 8(6):
866-883
[5] Chung, C-C. & Lin, C-J. (2001). LIBSVM:
a Library for Support Vector Machines,
Software. available at
http://www.csie.ntu.edu.tw/~cjlin/libsvm
[6] Curt, H. (1995). The devil’s in the detail:
techniques, tools, and applications for
database mining and knowledge discovery –
Part 1. Intell, Software Strategies, 6: 1-15
[7] Cristianini, N. & Shawe-Taylor, J. (ed.)
(2000). An Introduction to Support Vector
Machines, NewYork, Cambridge Univ,
Cambridge
[8] Desai, V.S., Crook, J.N. & Overstreet, G.A.
(1996). A comparison of neural networks
and linear scoring models in the credit union
environment. European Journal of
Operational Research, 95(1): 24-37
[9] Dietterich, T.G. (1998). Approximate
statistical tests for comparing supervised
classification learning algorithms. Neural
Computation, 10: 1895-1923
[10] Firedman, J.H. (1991). Multivariate
adaptive regression splines (with discussion).
Annals of Statistics, 19: 1-141
[11] Firedman, J.H. & Roosen, C.B. (1995). An
introduction to multivariate adaptive
regression splines. Statistical Methods in
Medical Research, l4: 197-217
[12] Frydman, H.E., Altman, EI. & Kao, D.
(1985). Introducing recursive partitioning
for financial classification: the case of
financial distress. Journal of Finance, 40(1):
53-65
[13] Gunn, S.R. (ed.) (1998). Support Vector
Machines for Classification and Regression.
Technical Report, University of
Southampton
XIAO, ZHAO and FEI
JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING 435
[14] Henley, W.E. (1995). Statistical aspects of
credit scoring. Dissertation, The Open
University, Milton Keynes, UK
[15] Henley, W.E. & Hand, D.J. (1996).
K-nearest neighbor classifier for assessing
consumer credit risk. Statistician, 44: 77-95
[16] Hosmer, D.W. & Lemeshow, S. (2000).
Applied Logistic Regression. New
York:John Wiley & Sons Inc
[17] Jo, H., Han, I. & Lee, H. (1997).
Bankruptcy prediction using case-based
reasoning, neural networks, and
discriminant analysis. Expert Systems
Application, 13: 97-108
[18] Lee, T.S. & Chen, I.F. (2005). A two-stage
hybrid credit scoring model using artificial
neural networks and multivariate adaptive
regression splines. Expert Systems with
Applications, 28: 743-752
[19] Mester, L.J. (1997). What’s the point of
credit scoring? Business Review - Federal
Reserve Bank of Philadelphia. Sept/Oct:
3-16
[20] Moody, J. & Darken, C.J. (1989). Fast
learning in networks of locally tuned
processing units. Neural Computation, 3:
213-25
[21] Steinburg, D. & Colla, P. (ed.) (1997).
Classification and Regression Trees, Salford
Systems. San Didgo, CA
[22] Thomas, L.C. (2000). A survey of credit and
behavioral scoring: Forecasting financial
risks of lending to customers. International
Journal of Forecasting, 16: 149-172
[23] Tam, K.Y. & Kiang, M.Y. (1992).
Managerial applications of neural networks:
the case of bank failure predictions.
Management Science; 38(7): 926-47
[24] Vapnik, N. (1999). Statistical Learning
Theory. New York: Springer & Verlag.
[25] West, D. (2000). Neural network credit
scoring models. Computers & Operations
Research, 27: 1131-1152
Wenbing Xiao is a doctoral student of Institute
of Control Science & System Engineering at
Huazhong University of Science and Technology,
China. His research interests include financial
forecasting and modeling, decision support
system, data mining and machine learning. He
received the M.S. degree in Mathematics &
Computer at Hunan Normal University (2004).
Qian Zhao is a doctoral student in School of
Economics at Renmin University of China. She
received her M.S. in mathematics from Hunan
Normal University in 2004. Her current research
interests include financial forecasting and
modeling, data mining and energy economics.
She has published in Advances in Mathematics,
Chinese Journal of Management Science.
Qi Fei is a professor of Institute of Control
Science & Systems Engineering at Huazhong
University of Science and Technology, China.
His research interests include complex theory,
decision support system and decision analysis.
He received the B.S. degree in Control Science
and Engineering at Harbin Institute of
Technology (1961).