comparative study of data mining methods in

J Syst Sci Syst Eng(Dec 2006) 15(4): 419-435 ISSN: 1004-3756 (Paper) 1861-9576 (Online) DOI: 10.1007/s11518-006-5023-5 CN11-2983/N

© Systems Engineering Society of China & Springer-Verlag 2006

A COMPARATIVE STUDY OF DATA MINING METHODS IN CONSUMER LOANS CREDIT SCORING MANAGEMENT∗

Wenbing XIAO1 Qian ZHAO 2 Qi FEI 3 1 Institute of Systems Engineering, Huazhong University of Science & Technology, Wuhan 430074, China

[email protected] ( ) 2 School of Economics, Renmin University of China, Beijing 100872, China

[email protected] 3 Institute of Systems Engineering, Huazhong University of Science & Technology, Wuhan 430074, China

[email protected]

Abstract

Credit scoring has become a critical and challenging management science issue as the credit

industry has been facing stiffer competition in recent years. Many classification methods have been

suggested to tackle this problem in the literature. In this paper, we investigate the performance of

various credit scoring models and the corresponding credit risk cost for three real-life credit scoring

data sets. Besides the well-known classification algorithms (e.g. linear discriminant analysis, logistic

regression, neural networks and k-nearest neighbor), we also investigate the suitability and

performance of some recently proposed, advanced data mining techniques such as support vector

machines (SVMs), classification and regression tree (CART), and multivariate adaptive regression

splines (MARS). The performance is assessed by using the classification accuracy and cost of credit

scoring errors. The experiment results show that SVM, MARS, logistic regression and neural networks

yield a very good performance. However, CART and MARS’s explanatory capability outperforms the

other methods.

Keywords: Data mining, credit scoring, classification and regression tree, support vector machines,

multivariate adaptive regression splines, credit-risk evaluation

∗ This work was supported in part by National Science Foundation of China under Grant No. 70171015

1. Introduction Data mining (DM), sometimes referred to as

knowledge discovery in database (KDD), is a

systematic approach to find underlying patterns,

trends, and relationships buried in data. Data

mining has drawn much attention from both

researchers and practitioners due to its wide

applications in crucial business decisions.

Basically, the research on DM can be classified

into two categories: methodologies and

technologies. According to Curt (1995), the

technology part of DM consists of techniques

A Comparative Study of Data Mining Methods in Consumer Loans Credit Scoring Management

JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING 420

such as statistical methods, neural networks,

decision trees, genetic algorithms, and

non-parametric methods. Among the

above-mentioned applications, the classification

problems which observations can be assigned to

one of several disjoint groups have played

important roles in business decision making due

to their wide applications in decisions support,

financial forecasting, fraud detection, marketing

strategy, and other related fields (Chen et al.

1996, Lee and Chen 2005, Tam and Kiang

1992).

Credit risk evaluation decisions are crucial

for financial institutions due to the severe impact

of loan default. It is an even more important task

today as the credit industry has been

experiencing serious competition during the past

few years. Credit scoring has gained more and

more attention as the credit industry has realized

the benefits of improving cash flow, insuring

credit collections and reducing possible risks.

Hence, many different useful techniques, known

as the credit scoring models, have been

developed by banks and researchers in order to

solve the problems involved during the

evaluation process (Mester 1997). The objective

of credit scoring models is to assign credit

applicants to either a “good credit” group who

are likely to repay financial obligation, or a “bad

credit” group who are more likely to default on

the financial obligation. The applications of the

latter should be denied. Therefore, credit scoring

problems are basically in the scope of the more

generally and widely discussed classification

problems.

Usually, credit scoring is employed to rank

credit information based on the application form

details and other relevant information held by a

credit reference agency. As a result, accounts

with high probability of default can be

monitored and necessary actions can be taken in

order to prevent the account from entering

default. In response, the statistical methods,

non-parametric methods, and artificial

intelligence approaches have been proposed to

support the credit approval decision process

(Desai et al., 1996, West, 2000).

Generally, linear discriminant analysis and

logistic regression are the two most commonly

used data mining techniques to construct credit

scoring models. However, linear discriminant

analysis (LDA) has often been criticized because

of the categorical nature of the credit data and

the fact that the covariance matrices of the good

and bad credit classes are not likely to be equal.

In addition to the LDA approach, logistic

regression is an alternative to conduct credit

scoring. A number of logistic regression models

for credit scoring applications have been

reported in the literature (Henley 1995).

However, logistic regression is also being

criticized for some strong model assumptions,

such as variation homogeneity, which has

limited its application in handling credit scoring

problems. Recently, neural networks have

provided an alternative to LDA and logistic

regression, particularly in situations where the

dependent and independent variables exhibit

complex nonlinear relationships. Even though it

has been reported that neural networks have

better credit scoring capability than LDA and

logistic regression (Desai et al. 1996), neural

networks are also being criticized for their long

training process in designing the optimal

XIAO, ZHAO and FEI


networks’ topology, difficulty in identifying the

relative importance of potential input variables,

and certain interpretive difficulties which have

limited their applicability in handling credit

scoring problems. Hence, the issue of which

classification technique to be used for credit

scoring remains a very difficult and challenging

problem. In the paper, we will conduct a

benchmarking study of various classification

techniques on three real-life credit data sets.

Techniques that will be implemented are logistic

regression, linear discriminant analysis, SVMs,

neural networks, KNN, CART and MARS. All

techniques will be evaluated in terms of the

percentage of correctly classified observations

and misclassification cost.

This paper is organized as follows. We begin

with a short overview of the classification

techniques used in Section 2. Data sets and

experimental design are presented in Section 3.

Section 4 gives the empirical results and

discussion for three real credit scoring data set,

including classification performance, the costs

of credit scoring errors and explanatory ability

of credit scoring models. Section 5 addresses the

conclusion and discusses possible future

research areas.

2. Literature Review

2.1 Linear Discriminant Analysis and Logistic Regression Models

Linear discriminant analysis involves the

linear combination of the two (or more)

independent variables that differentiate best

between the priori defined groups. This is

achieved by the statistical decision rule of

maximizing the between-group variance relative

to the within-group variance; this relationship is

expressed as the ration of between-group to

within-group variance. The linear combinations

for a discriminant analysis are derived from an

equation that takes the from (1)

1 1 2 2 n nZ w x w x w x= + + +L (1)

where Z is the discriminant score,

( 1,2, , )iw i n= L are the discriminant weights,

and ( 1,2, , )ix i n= L are independent variables

(Altman 1968, Jo, Han and Lee 1997).

Logistic regression (Logistic) analysis has

also been used to investigate the relationship

between binary or ordinal response probability

and explanatory variables. The method fits linear

logistic regression model for binary or ordinal

response data by the method of maximum

likelihood. The advantage of this method is that

it does not assume multivariate normality and

equal covariance matrices as LDA does. The

logistic regression approach to classification

(Logistic) tries to estimate the probability

( 1 | )P y x= as follows:

0 1 1( )

1 1( 1 | )

1 e 1 e n nz w w x w xP y x − − + += = =

+ + L (2)

Whereby nx∈ℜ is n-dimensional input

vector, iw is the parameter vector and the scalar

0w is the intercept. The parameters

0w and iw are then typically estimated using the

maximum likelihood procedure (Hosener 2000,

Thomas 2000).

2.2 Support Vector Machines Models A simple description of the SVM algorithm

is provided as follows. Given a training set

1{ , }Ni i iD x y == with input vectors (1) ( )( , )n T n

i i ix x x R= ∈L and target labels



{ 1, 1}iy ∈ − + , the support vector machine (SVM)

classifier, according to Vapnik's original

formation, satisfies the following condition:

( ) 1, 1

( ) 1, 1

Ti i

Ti i

w x b if y

w x b if y

φ

φ

⎧ + ≥ + = +⎪⎨

+ ≤ − = −⎪⎩ (3)

which is equivalent to

[ ( ) ] 1, 1, ,Ti iy w x b i Nφ + ≥ = L (4)

where w represents the weight vector and b the

bias. Nonlinear function ( ) : knnR Rφ ⋅ → maps

input or measurement space to a high-

dimensional, and possibly infinite-dimensional,

feature space. Equation (4) then comes down the

construct of two parallel bounding hyperplanes

at opposite sides of a separating

hyperplane ( ) 0Tw x bφ + = in the feature space

with the margin width between both hyperplanes

equal to 22 /( || || )w . In primal weight space, the

classifier then takes the decision function form

(5)

sgn( ( ) )Tw x bφ + (5)

Most of classification problems are, however,

linearly non-separable. Therefore, it is general to

find the weight vector using slack variable iξ

to permit misclassification. One defines the

primal optimization problem as

, , 1

1Min

2

NT

iw b i

w w Cξ

ξ=

+ ∑ (6)

Subject to

( ( ) 1 , 1, ,

0, 1, ,

Ti i i

i

y w x b i N

i N

φ ξξ⎧ + ≥ − =⎪⎨

≥ =⎪⎩

L

L (7)

Where iξ 's are slack variables needed to allow

misclassifications in the set of inequalities, and

C +∈ℜ is a tuning hyperparameter, weighting

the importance of classification errors visa the

margin width. The solution of the primal

problem is obtained after constructing the

Lagrangian. From the conditions of optimality,

one obtains a quadratic programming (QP)

problem with Lagrange multipliers iα 's. A

multiple iα exists for each training data instance.

Data instances corresponding to non-zero iα 's

are called support vectors.

On the other hand, the above primal problem

can be converted into the following dual

problem with objective function (8) and

constraints (9). Since the decision variables are

support vector of Lagrange multipliers, it is

easier to interpret the results of this dual

problem than those of the primal one.

1Max

2T TQ e

αα α α− (8)

Subject to

0 , 1, ,

0

i

T

C i N

y

α

α

≤ ≤ =⎧⎪⎨

=⎪⎩

L (9)

In the dual problem above,

(1,1, ,1)T Ne R= ∈L , Q is a N N× positive

semi-definite matrix, ( , )ij i j i jQ y y K x x= and

( , ) ( ) ( )Ti j i jK x x x xφ φ≡ is the kernel. Here,

training vectors ix 's are mapped into a higher

(maybe infinite) dimensional space by

function φ . As is typical for SVMs, we never

calculate w or ( )xφ . This is made possible due

to Mercer's condition, which relates mapping

function ( )xφ to kernel function ( , )K ⋅ ⋅ as

follows.

( , ) ( ) ( )Ti j i jK x x x xφ φ= （10）

For kernel function ( , )K ⋅ ⋅ , one typically has

several design choices such as the linear kernel

of ( , ) Ti j i jK x x x x= , the polynomial kernel of

degree d of ( , ) ( )T di j i jK x x x x rγ= + , 0γ > ,

XIAO, ZHAO and FEI


the radial basis function (RBF) kernel of 2( , ) exp{ || || }i j i jK x x x xγ= − − , 0γ > , and the

sigmoid kernel of ( , ) tanh{ }Ti j i jK x x x x rγ= + ,

where ,d r N∈ and Rγ +∈ are constants. Then

one constructs the final SVM classifier as

sgn ( , )N

i i ii

y K x x bα⎛ ⎞

+⎜ ⎟⎝ ⎠∑ (11)

The details of the optimization are discussed in

(Vapnik 1999, Gunn 1998, Cristianini 2000).

2.3 Neural Networks Models (BPN, RBF and FAR) A neural networks model involves

constructing computers with architectures and

processing capabilities that mimic certain

processing capabilities of the human brain. A

neural network model is composed of neurons,

the processing elements. These elements are

inspired by biological nervous systems. Each of

the neurons receives inputs, and delivers a single

output. Thus, a neural network model is a

collection of neurons that are grouped in layers

such as the input layer, the hidden layer, and the

output layer. Several hidden layers can be placed

between the input and the output layers. We will

discuss the BPN in more detail because it is the

most popular NN for classification.

A simple back-propagation network (BPN)

model consists of three layers: the input layer,

the hidden layer, and the output layer. The

input-layer processes the input variables, and

provides the processed values to the hidden layer.

The hidden layer further processes the

intermediate values, and transmits the processed

values to the output layer. The output layer

corresponds to the output variables of the

back-propagation neural network model. A

three-layer backpropagation neural networks

(BPN) is shown in Figure 1. For the details of

the neural networks, readers are referred to Refs

(West 2000, Bihop 1995).

Input layer Hidden layer Output layer

Figure 1 A three-layer back-propagation neural networks



Radial Basis Function (RBF) networks

(Moody and Darken 1989) have a static

Gaussian function as the non-linearity for the

hidden layer processing elements. The Gaussian

function responds only to a small region of the

input space where the Gaussian is centered. The

key to a successful implementation of these

networks is to find suitable centers for the

Gaussian functions. This can be done with

supervised learning, but an unsupervised

approach usually produces better results. The

advantage of the radial basis function networks

is that it finds the input to output map using

local approximators. Usually the supervised

segment is simply a linear combination of the

approximators. Since linear combiners have few

weights, these networks train extremely fast and

require fewer training samples.

The Fuzzy art (FAR) (West 2000) network is

a dynamic network where incorporates

computations from fuzzy set theory into the

adaptive resonance theory (ART). The typical

FAR network consists of two totally

interconnected layers of neurons, identified as

the complement layer and the category layer, in

addition to the input and output layers. When an

input vector is applied to the network, it creates

a short-term activation of the neurons in the

complement layer. This activity is transmitted

through the weight vector to neurons in the

category layer. Each neuron in the category layer

then calculates the inner product of the

respective weights and input values. These

calculated values are then resonated back to the

complement layer.

2.4 Multivariate Adaptive Regression Splines MARS is first proposed by Firedman (1991,

1995) as a flexible procedure which models

relationships that are nearly additive or involve

interactions with fewer variables. The modeling

procedure is inspired by the recursive

partitioning technique governing classification

and regression tree (CART) (Breiman et al. 1984)

and generalized additive modeling, resulting in a

model that is continuous with continuous

derivatives. It excels at finding optimal variable

transformations and interactions, the complex

data structure that often hides in

high-dimensional data. And hence can

effectively uncover important data patterns and

relationships that are difficult, if not impossible,

for other methods to reveal.

MARS essentially builds flexible models by

fitting piecewise linear regressions; that is, the

nonlinearity of a model is approximated through

the use of separate regression slopes in distinct

intervals of the predictor variable space.

Therefore the slope of the regression line is

allowed to change from one interval to the other

as the two ‘knot’ point are crossed. The variable

to use and the end points of the intervals for

each variable are found via a fast but intensive

search procedure. In addition to searching

variables one by one, MARS also searches for

interactions between variables, allowing any

degree of interaction to be considered.

The general MARS function can be

represented using the following equation:

0 ( , )1 1

( ) [ ( )]mKM

m km v k m kmm k

f x a a s x t∧

+= =

= + −∑ ∏ (12)

where 0a and ma are parameters, M is the

number of basis functions, mK is the number of

knots, kms takes on value of either 1 or −1 and

indicates the right/left sense of the associated

step function, ( , )v k m is the label of the

XIAO, ZHAO and FEI


independent variable, and kmt indicates the knot

location.

The optimal MARS model is selected in a

two-stage process. Firstly, MARS constructs a

very large number of basis functions to overfit

the data initially, where variables are allowed to

enter as continuous, categorical, or ordinal—the

formal mechanism by which variable intervals

are defined, and they can interact with each

other or be restricted to enter in only as additive

components. In the second stage, basis functions

are deleted in order of least contribution using

the generalized cross-validation (GCV) criterion.

A measure of variable importance can be

assessed by observing the decrease in the

calculated GCV values when a variable is

removed from the model. The GCV can be

expressed as follows: ^

^2 2

1

( ) ( )

1 ( )[ ( )] /[1 ]

M

N

i M ii

LOF f GCV M

C My f x

N N=

=

= − −∑

(13)

where there are N observations, and ( )C M is

the cost-penalty measures of a model containing

M basis function (therefore the numerator

measures the lack of fit on the M basis function

model ( )M if x and the denominator denotes the

penalty for model complexity ( )C M ). Missing

values can also be handled in MARS by using

dummy variables indicating the presence of the

missing values. By allowing for any arbitrary

shape for the function and interactions, and by

using the above-mentioned two-stage model

building procedure, MARS is capable of reliably

tracking the very complex data structures that

often hide in high-dimensional data. Please refer

to Firedman (1991, 1995) for more details

regarding the model building process.

2.5 k-Nearest-Neighbor-Classifiers and CART Model

k-Nearest-neighbor classifiers (KNN) (Henley

and Hand 1996) classify a data instance by

considering only the k-most similar data

instances in the training set. The class label is

then assigned according to the class of the

majority of the k nearest neighbors. Ties can be

avoided by choosing k odd. One commonly

opts for the Euclidean distance as the similarity

measure: 1/ 2( , ) || || [( ) ( )]T

i j i j i j i jd x x x x x x x x= − = − −

(14)

where , ni jx x ∈ℜ are the input vectors of data

instance i and j , respectively. Note that also

more advanced distance measures have been

proposed in the literature.

Classification and regression tree (CART), a

statistical procedure introduced by Breiman et al.

(1984), is primarily used as a classification tool,

where the objective is to classify an object into

two or more populations. As the name suggests,

CART is a single procedure that can be used to

analyze either categorical or continuous data

using the same technology. The methodology

outlined in Breiman et al. can be summarized

into three stages. The first stage involves

growing the tree using a recursive partitioning

technique to select variables and split points

using a splitting criterion. Several criteria are

available for determining the splits, including

gini, towing and ordered towing. Detailed

description of the mentioned criteria one can

refer to Breiman et al. In addition to selecting

the primary variables, surrogate variables, which

are closely related to the original splits and may



be used in classifying observations having

missing values for the primary variables, can be

identified and selected.

After a large tree is identified, the second

stage of the CART methodology uses a pruning

procedure that incorporates a minimal cost

complexity measure. The result of the pruning

procedure is a nested subset of trees starting

from the largest tree grown and continuing the

process until only one node of the tree remains.

Cross-validation or a testing sample will be used

to provide estimates of future classification

errors for each subtree. The last stage of the

methodology is to select optimal tree, which

corresponds to a tree yielding the lowest error

rate of cross-validated or testing set. Please refer

to Breiman et al. (1984) and Steinburg and Colla

(1997) for more details regarding the model

building process of CART.

3. Data Sets and Experimental Design The German and Australian credit data sets

are publicly available at the UCI repository

(http://kdd.ics.uci.edu). Dr. Hans Hofmann of

the University of Hamburg contributed the

German credit scoring data. It consists of 700

examples of creditworthy applicants and 300

examples where credit should not be extended.

For each applicant, 24 variables described credit

history, account balances, loan purpose, loan

amount, employment status, personal

information, age, housing, and job. The

Australian credit scoring data is a similar but

more balanced with 307 and 383 examples of

each outcome. The data set contains a mixture of

six continuous and eight categorical variables.

The third credit data is from major financial

institutions in US, where there are 1225

applications, including 902 examples of

creditworthy applications and 323 examples of

no-creditworthy applicants. The data sets also

include 14 attributes. To protect the

confidentiality of these data, attribute names and

values of data sets have been changed to

symbolic data.

To minimize the impact of data dependency

and improve the reliability of the resultant

estimates, 10-fold cross validation is used to

create random partitions of the raw data sets.

Each of the 10 random partitions serves as an

independent holdout test set for the credit

scoring model trained with the remaining nine

partitions. The training set is used to establish

the credit scoring model’s parameter, while the

independent test sample is used to test the

generalization capability of the model. The

overall scoring accuracy reported is an average

across all ten test set partitions.

The topic of choosing the appropriate class

distribution for classifier learning has received

much attention in the literature. In this study, we

dealt with this problem by using a variety of

class distribution ranging from 55.5/44.5 for the

Australian credit data set to 73.6/26.4 for

America credit data set. The LDA, Logistic,

CART, KNN and MARS classifiers require no

parameter tuning. For the SVM classifiers, we

used the LIBSVM toolbox 2.8 and adopt a grid

search mechanism to tune the parameters. For

BPN classifiers, we adopted the standard

three-layer. The number of input and output

nodes was the number of input and output

variables, respectively. In the hidden layer and

output layer nodes use the sigmoid transfer

function. Since the optimum networks for the

data in the test set is still difficult to guarantee

XIAO, ZHAO and FEI


generalization performance, the number of

hidden nodes of three data sets was varied

between 8 and 30 and the network with the best

training set performance was selected for test set

evaluation. The NN analyses were conducted

using Neural Networks toolbox 4.0.

(http://www.mathworks.com). CART 4.0 and

MARS 2.0 evaluation (http://www.salford

-systems.com) are provided by Salford Systems,

in building the CART and MARS credit scoring

models. The SVM analyses were conducted

using the LIBSVM toolbox 2.8 (Chung and Lin

2001).

4. Results and Discussion The results for each credit-scoring model are

reported in Table 1 for both the German,

Australian and American credit data. These

results are averages of accuracy determined for

each of the 10 independent test data set

partitions used in the cross validation

methodology. Since the training of any neural

networks model is a stochastic process, the

network accuracy determined for each data set

partition is itself an average of 10 repetitions.

Table 1 10-fold cross validation test set classification accuracy on credit scoring data sets

German credit data (%) Australian credit data (%) American credit data (%)

Goods Bads Overall Goods Bads Overall Goods Bads Overall RBF 86.5 48.0 74.6 86.8 87.2 87.1 88.5 24.2 71.3 BPN 86.4 42.5 73.3 84.6 86.7 85.8 88.1 22.9 70.9 FAR 60.0 51.2 57.3 74.4 76.2 75.4 N/A N/A N/A LDA 72.3 73.3 72.6 81.0 92.2 85.9 65.4 56.0 62.9 LOGIT 88.1 48.7 76.3 85.9 89.0 87.2 95.9 11.2 73.5 KNN 77.5 44.7 67.6 84.7 86.7 85.8 78.4 30.1 66.1 Kernel 84.5 37.0 70.2 81.4 84.8 84.4 N/A N/A N/A CART 71.2 69.4 70.5 79.9 92.5 85.5 59.3 59.4 59.3 Mars 89.0 66.0 74.9 86.3 88.3 87.4 89.7 20.2 71.4 Lin-svm 88.9 49.1 77.0 79.9 92.5 85.5 88.9 22.0 71.3 Pol-svm 88.5 48.6 76.5 83.8 88.6 85.5 89.9 18.3 71.0 Rbf-svm 88.7 49.7 77.1 80.5 93.0 85.8 89.4 22.6 71.8 Sig-svm 89.0 50.0 77.2 80.5 92.0 85.6 89.6 21.1 71.5 Neural networks results are averages of 10 repetitions. N/A: not test.

It is evident from Table 1 that Sig-SVM has

the highest overall credit scoring accuracy of

77.2% for German credit data, while the

Lin-SVM, Pol-SVM and Rbf-SVM have credit

scoring accuracy of 76.5 % to 77.1%. Closely

following SVM is Logistic regression with an

overall accuracy of 76.3%, and MARS with

74.9%. Linear discriminant analysis has

accuracy of 72.6%, which is 3.7% less accurate

than logistic regression. Strength of the linear

discriminant model for this data, however, is a

significantly higher accurate than any other

model identifying bad credit risks. This is likely

due to the assumption of equal prior

probabilities used to develop the linear

discriminant model. It is also interesting to note

that the most commonly used neural network

architecture, BPN with accuracy 73.3%, is

comparable to linear discriminant analysis with

an accuracy of 72.6%. The K-NN, kernel density

and CART at overall accuracy levels are 67.6%,

70.2% and 70.5%, respectively. The least



accurate method for the German credit scoring

data is the FAR neural networks model at

57.3%.

For the Australian credit data, MARS has the

top overall credit scoring accuracy of 87.4%,

followed closely by the Logistic regression

(87.2%) and NN (RBF) (87.1%). The BPN (85.8

%) and LDA (85.9 %) are again comparable

from an overall accuracy consideration. The

KNN，CART，BPN, LDA, SVM model have

overall credit scoring errors that are more than

0.01 greater than MARS, logistic regression, and

RBF neural models. The FAR neural networks

and kernel density model overall accuracy are

75.4% and 84.4%, respectively.

For the American credit data, Logistic

regression has the top overall credit scoring

accuracy of 73.5%, followed closely by the

Rbf-SVM (71.8%) and Sig-SVM (71.5%). The

linear-SVM and Poly-SVM are all grouped at

accuracy levels from 71.0% to 71.3%. The BPN,

K-NN and Mars at overall accuracy levels are

70.9%, 66.1% and 71.4%, respectively. The least

accurate method for the America credit scoring

data is the CART model (Kernel density and

FAR aren’t tested for American credit data).

However, we note that CART has the lowest

error rate (40.6%) in all models identifying bad

credit risks, followed closely by the LDA

(44.0%).

To further enhance the conclusion, we test

for statistically significant differences between

credit scoring models. We have used a special

notational convention whereby the best three of

the overall accuracy is underlined and denoted

in bold face for each data. For cross validation

studies of supervised learning algorithms,

Dietterich (1998) recommends McNermar’s test,

which is used in this paper to establish

statistically significant differences between

credit scoring models. McNemar’s test is

chi-square statistic calculated from a 2 2×

contingency table. The diagonal elements of the

contingency table are counts of the number of

credit applications misclassified by both

models, 00n , and the number correctly classified

by both models, 11n . The off diagonal elements

are counts of numbers classified incorrectly by

Model A and correctly by Model B, 01n , and

conversely the numbers classified incorrectly by

Model B and correctly by Model A, 10n .

Results of McNemar’s test with 0.05p = are

given in Table 2. All credit scoring models are

tested for significant differences with the most

accurate model in the data set. A model whose

overall credit scoring is not significantly

different from the most accurate are labeled as a

superior model; those that are significantly less

accurate are labeled as inferior models. It is

evident from Table 2 that the SVM, Logistic

regression, NN (RBF) and MARS models are

superior ones for three credit scoring data sets

and the LDA, KNN and CART models are

superior for only the Australian credit data.

4.1 Cost of Credit Scoring Errors This subsection considers the costs of credit

scoring errors and their impact on model

selection. It is evident that the individual group

(bad or good) accuracy of the credit scoring

model can vary widely. For the German credit

data, all models except LDA are much less

accurate at classifying bad credit risks than good

credit risks. Most pronounced is the accuracy of

logistic regression with an error of 0.1186 for

good credit and 0.5113 for bad credit. In credit

XIAO, ZHAO and FEI


Table 2 Statistically significant differences, credit scoring models

German credit data Australian credit data American credit data Superior models RBF

Mars, Logistic regression SVM

RBF, SVM Logistic regression MARS, LDA KNN, CART

RBF BPN Logistic regression SVM, MARS

Inferior Models FAR, VPN

LDA, KNN Kernel density CART

FAR Kernel density

LDA CART KNN

Statistical significance established with McNemars’ test, p=0.05; kernel density and FAR aren’t tested for American credit data.

scoring applications, it is generally believed that

the costs of granting credit to a bad risk

candidate, denoted by 12C is significantly

greater than the cost of denying credit to a good

risk candidate, denoted by 21C . In this situation

it is important to rate the credit scoring models

with the cost function defined in Equation (15)

rather than relying on the overall classification

accuracy. To illustrate the cost function, relative

costs of misclassification suggested by Dr.

Hofmann when he compiled the German credit

data are used; 12C is 5 and 21C is 1. Evaluation

of the cost function also requires estimates of the

prior probabilities of good credit 1π and bad

2π in the application pool of the credit scoring

model. These prior probabilities are estimated

from reported default rates. For the year 1997,

6.48% of a total credit debt of $ 560 billion was

charged off (West 2000), while Jensen reports a

charge off rate of 11.2% fro credit applications

he investigated (Frydman et al. 1985). The error

rate for the bad credit group of the German

credit data (which averages about 0.45) is used

to establish a low value for 2π of 0.144

(0.0648/0.45) and a high value of 0.249

(0.112/0.45). The ratio 2 2/n N , in Equation (15)

measures the false positive rate, the proportion

of bad credit risks that are granted credit, while

the ration 1 1/n N measures the false negative

rate, or good credit risks denied credit by the

model.

2 112 2 21 1

2 1

n nCost C C

N Nπ π= + (15)

Under these assumptions, the credit scoring

cost is reported for each model in Table 3. For

the German credit data, the MARS (0.413)

model is now slightly better than the LDA

(0.429) at the prior probability level of 14.4%

bad credit. At the higher level of 24.9% bad

credit, the LDA is clearly the best model from an

overall cost perspective with a score of 0.540.

Closely following LDA is MARS with an

overall accuracy of 0.571, and CART with 0.597.

For the Australian credit, the costs of all models

are nearly identical at both levels of 2π . The

Logistic (0.200) model is now slightly better

than the MARS (0.202) at the prior probability

level of 14.4% bad credit. At the higher level of

24.9% bad credit, the Rbf-SVM is clearly the

best model from an overall cost perspective with

a score of 0.234. Closely following Rbf-SVM is

LDA and Logistic with overall cost of 0.239 and

0.244, respectively. For the American credit, the

LDA (0.613) model is now slightly better than



Table 3 Credit scoring models misclassification cost

German credit data Australian credit data American credit data

2π =0.144 2π =0.249 2π =0.144 2π =0.249 2π =0.144 2π =0.249

BPN 0.530 0.818 0.228 0.281 0.657 1.049

RBF 0.497 0.761 0.205 0.258 0.644 1.030

FAR 0.694 0.908 0.391 0.490 N/A N/A

LDA 0.429 0.540 0.219 0.239 0.613 0.808

Logist 0.471 0.728 0.200 0.243 0.673 1.140

KNN 0.592 0.858 0.227 0.281 0.688 1.033

Kernel 0.587 0.901 0.268 0.329 N/A N/A

CART 0.467 0.597 0.226 0.244 0.641 0.811

Lin-SVM 0.462 0.717 0.226 0.244 0.657 1.055

Pol-SVM 0.469 0.726 0.221 0.264 0.675 1.093

Rbf-SVM 0.459 0.711 0.217 0.234 0.648 1.043

Sig-SVM 0.454 0.705 0.225 0.246 0.657 1.060

MARS 0.413 0.571 0.202 0.249 0.663 1.071

N/A: not test

Table 4 5-fold cross validation test set classification accuracy on parities credit scoring data sets in new strategy

German credit data (%) Australian credit data (%) American credit data (%)

Goods Bads Overall Goods Bads Overall Goods Bads Overall RBF 67.2 73.7 70.4 85.7 89.3 87.5 65.2 57.3 61.3 BPN 67.0 70.3 68.7 85.2 87.6 86.4 64.7 55.7 60.2 LDA 69.0 73.0 71.0 80.3 92.5 86.4 59.5 55.5 57.5 LOGIT 74.3 74.0 74.2 84.0 92.3 88.2 64.3 62.7 63.5 CART 68.0 69.7 68.8 80.7 93.3 87.0 66.7 54.7 61.3 Mars 66.0 79.0 72.5 84.0 91.0 87.5 66.3 50.7 58.5 Rbf-SVM 69.1 73.5 71.3 81.0 93.3 87.2 63.3 59.0 61.2

Neural networks results are averages of 10 repetitions.

the CART(0.641) at the prior probability level of

14.4% bad credit, followed NN(RBF) with a

score of 0.644. At the higher level of 24.9% bad

credit, the LDA is clearly the best model from an

overall cost perspective with a score of 0.808.

Closely following LDA is CART with an overall

accuracy of 0.8111, and RBFNN with 1.030.

From the Table 2, the relative group

classification accuracy of the neural networks

model, SVM, logistic regression and MARS are

influenced by the design of the training

no-balance data. To improve their accuracy with

bad credit risks, a new strategy is tested for the

above models training sets. The strategy is to

form new data sets from a balanced group of

300 good credit examples and 300 bad credit

XIAO, ZHAO and FEI


examples for difference models. Each of these

models is tested with the 5-fold cross-validation.

The new strategy accuracy results are

summarized in Table 4. The new strategy yields

the greatest improvement in the error for bad

credit identification with a reduction

approximate 20% for German credit data and

30% for American credit data. The overall error

rate for tested with a new strategy increases 5%

and 10% for the two data sets, respectively.

However, the overall error rate for tested with a

new strategy decreases 0.5% to 2% for

Australian credit data.

4.2 A Comparative of Explanatory Ability of Credit Scoring Models This subsection considers explanatory ability

of the credit scoring models. A good explanatory

ability of any credit scoring models for credit

scoring applications is very important in

explaining the rationale for the decision to deny

credit. Neural networks and SVM models both

cannot explain how and why they identified a

potential “bad” loan application. LDA and

Logistic regression models are better than SVM

and neural networks. KNN and Kernel density

are inferior models regarding the explanatory

ability. CART and MARS have better

explanatory ability. The more detailed analysis

of the three (Neural, CART and MARS)

explanatory ability for German credit data are as

follows.

4.2.1 Explanatory Ability of Neural Networks

Model for German Data Credit Scoring

A key deficiency of any neural networks

model for credit scoring applications is the

difficulty in explaining the rationale for the

decision to deny credit. Neural networks are

usually thought of as black-box technology

devoid of any logic or rule-based explanations

for the output mapping. This is a particularly

sensitive issue in light of recent federal

legislation regarding discrimination in lending

practices. To address this problem, West (2000)

developed explanatory ability insights for the

neural network trained on the German credit

data. It is accomplished by clamping 23 of the

24 input values, varying the remaining input by

± 5%, and measuring the magnitude of the

impact on the two output neurons. The clamping

process is repeated until all network inputs have

been varied. A weight can now be determined

for each input that estimates its relative power in

determining the resultant credit decision. Please

refer to West (2000) for more details and results

regarding the model building process.

4.2.2 Explanatory Ability of CART Model for

German Data Credit Scoring

Figure 2 depicts the obtained CART tree of

the testing sample with the popular 1-SE rule in

the tree pruning procedure. It is observed from

Figure 2 that A1, A3, A5, A2 play important

roles in the rule induction (Ai indicates the ith

attribute name for i=1,…,n and it has likely

meaning when appearing latter). It can also be

observed from Figure 2 that if an observed

who’s A1 is between 1.5 and 2.5 and A2 22.5≥

and A5>3.5, it falls into terminal node 11 whose

classified class is class 1 (good customer). The

built rules and terminal nodes from the built tree,

unlike other classification techniques, are very

easy to interpret and hence marketing

professionals can use the built rules in designing

proper managerial decisions. Furthermore, we



conclude by saying that CART is executive and

powerful management tools which allow us to

build advanced and user-friendly decision-

support systems for credit scoring management.

4.2.3 Explanatory Ability of MARS Model for

Credit Scoring

In order to demonstrate the explanatory

ability of MARS scoring models, the German

data will be used as an illustrative example. The

obtained basis functions and variable selection

results of the illustrative example are

summarized in Table 5. It is observed that

A1,A2,A3,,A4,A5,A8,A9,A15,A16,A17,A20 do

play important roles in deciding the MARS

Node 1

Class=1

A1<=2.500

N=1000

Node 2

Class=1

A2<=22.500

N=543

Terminal

Node 12

Class=1

N=457

Terminal

Node 1

Class=1

N=28

Node 3

Class=2

A3<=1.500

N=306

Node 4

Class=2

A2<=11.5

N=278

Terminal

Node 2

Class=1

N=80

Node 5

Class=2

A4<=13.5

N=278Node 6

Class=2

A18<=11.5

N=72

Terminal

Node 4

Class=2

N=60

Terminal

Node 3

Class=1

N=12

Node 8

Class=1

A5<=1.5

N=114

Node 7

Class=1

A10<=50.5

N=126

Node 9

Class=2

A1<=1.5

N=77Terminal

Node 5

Class=2

N=48

Terminal

Node 6

Class=1

N=29

Terminal

Node 7

Class=1

N=37

Terminal

Node 8

Class=2

N=12

Node 10

Class=2

A5<=3.500

N=237

Terminal

Node 9

Class=2

N=196

Node 11

Class=1

A1<=1.500

N=41

Terminal

Node 10

Class=2

N=17

Terminal

Node 11

Class=1

N=24

Figure 2 The tree of CART credit scoring mode

XIAO, ZHAO and FEI


Table 5 Variable selection results and basis functions of MARS credit scoring model

Variable name Relative importance (%) Equation name Equation A1 100.00 BF1 max (0, A1 − 1.000) A2 57.31 BF2 max (0, A2 − 4.000) A3 51.82 BF3 max (0, A3 − .180272E-06) A5 40.44 BF4 max (0, A5 − 1.000) A4 36.67 BF5 max (0, A4 − 36.000) A16 32.33 BF6 max (0, 36.000 − A4) A9 30.43 BF7 max (0, A16 + .180632E-07) A17 28.86 BF8 max (0, A15 − 1.000) A15 27.52 BF9 max (0, A20 + .182414E-07) A20 27.34 BF10 max (0, A17 − .376854E−08) A6 29.1 BF12 max (0, 4.000 − A6) A8 16.95 BF13 max (0, A9 − 1.000) BF14 max (0, A8 − 2.000) BF15 max (0, 2.000 − A8) MARS prediction function: Y = 1.358 − 0.096 * BF1 + 0.007 * BF2 − 0.058 * BF3 − 0.032 * BF4+ 0.002 * BF5 + 0.005 * BF6 + 0.098 * BF7 − 0.192 * BF8+ 0.094 * BF9 − 0.129 * BF10 + 0.040 * BF12+ 0.040 * BF13 − 0.026 * BF14 − 0.095 * BF15;

In the MARS credit scoring model, Y=0(1) is defined to be a good (bad) credit customer.

credit scoring models. Besides, according to the

obtained basis functions and the MARS

prediction function, it can be observed that the

high value of A2, A9, A16, and A20 tends to

become a bad credit customer while the high

value of A1, A3, A5, A15, and A17 likely to be a

good credit customer. The above conclusions

from the basis functions and MARS prediction

function have important managerial implications

since it can help managers/professionals design

appropriate loan policies in acquiring the good

credit customer.

5. Conclusions and Areas of Future Research Credit scoring has become more and more

important as the competition between financial

institutions has come to a totally conflicting

stage. More and more companies are seeking

better strategies through the help of credit

scoring models. And hence various modeling

techniques have been developed in different

credit evaluation processes for better credit

approval schemes. Therefore, many modeling

alternatives, like traditional statistical methods,

non-parametric methods and artificial

intelligence techniques, have been developed in

order to handle the credit scoring tasks

successfully. In this paper, we have studied the

performance of various classification techniques

for credit scoring. The experiments were

conducted on 3 real-life credit scoring data sets.

The classification performance was assessed by

the percentage of correctly classified and

misclassified cost.

It is found that each technique has showed

some characteristics which may be interesting in

the context of different data set. Firstly, Logistic,

MARS, SVM and ANN (BPN and RBF)

classifiers yield very good performances in

terms of the classification ratio. However, it has

to be noted that LDA and CART were



significantly more accurate than any other model

in identifying bad credit risks for German and

American credit scoring data sets. Secondly, the

experiments clearly indicated that many

classification techniques yield performances

which are quite competitive with each other.

Only a few classification techniques (e.g. FAR

and kernel density) were clearly inferior to the

others. Besides, CART and MARS not only have

lower Type II errors associated with high

misclassification costs, but also have better

evaluation reasoning and can help to structure

the understanding of prediction.

Starting from the findings of this study,

several interesting topics for future research can

be identified. One interesting topic may aim at

collecting more important variables in

improving the credit scoring accuracy. Another

promising avenue for future research is to

investigate the power of classifier ensembles

where multiple classification algorithms are

combined.

References [1] Altman, E.I. (1968). Financial ratios,

discriminant analysis and prediction of

corporate bankruptcy. Finance, 23: 589-609

[2] Bishop, C.M. (1995). Neural Networks for

Pattern Recognition. New York: Oxford

University, Press

[3] Breiman, L., Friedman, J.H., Olshen, R.A.

& Stone, C.J. (1984). Classification and

Regression Trees, Pacific Grove, CA:

Wadsworth

[4] Chen, M.S., Han, J. & Yu, P.S. (1996). Data

mining: an overview from a database

perspective. IEEE Transactions on

Knowledge and Data Engineering, 8(6):

866-883

[5] Chung, C-C. & Lin, C-J. (2001). LIBSVM:

a Library for Support Vector Machines,

Software. available at

http://www.csie.ntu.edu.tw/~cjlin/libsvm

[6] Curt, H. (1995). The devil’s in the detail:

techniques, tools, and applications for

database mining and knowledge discovery –

Part 1. Intell, Software Strategies, 6: 1-15

[7] Cristianini, N. & Shawe-Taylor, J. (ed.)

(2000). An Introduction to Support Vector

Machines, NewYork, Cambridge Univ,

Cambridge

[8] Desai, V.S., Crook, J.N. & Overstreet, G.A.

(1996). A comparison of neural networks

and linear scoring models in the credit union

environment. European Journal of

Operational Research, 95(1): 24-37

[9] Dietterich, T.G. (1998). Approximate

statistical tests for comparing supervised

classification learning algorithms. Neural

Computation, 10: 1895-1923

[10] Firedman, J.H. (1991). Multivariate

adaptive regression splines (with discussion).

Annals of Statistics, 19: 1-141

[11] Firedman, J.H. & Roosen, C.B. (1995). An

introduction to multivariate adaptive

regression splines. Statistical Methods in

Medical Research, l4: 197-217

[12] Frydman, H.E., Altman, EI. & Kao, D.

(1985). Introducing recursive partitioning

for financial classification: the case of

financial distress. Journal of Finance, 40(1):

53-65

[13] Gunn, S.R. (ed.) (1998). Support Vector

Machines for Classification and Regression.

Technical Report, University of

Southampton

XIAO, ZHAO and FEI


[14] Henley, W.E. (1995). Statistical aspects of

credit scoring. Dissertation, The Open

University, Milton Keynes, UK

[15] Henley, W.E. & Hand, D.J. (1996).

K-nearest neighbor classifier for assessing

consumer credit risk. Statistician, 44: 77-95

[16] Hosmer, D.W. & Lemeshow, S. (2000).

Applied Logistic Regression. New

York:John Wiley & Sons Inc

[17] Jo, H., Han, I. & Lee, H. (1997).

Bankruptcy prediction using case-based

reasoning, neural networks, and

discriminant analysis. Expert Systems

Application, 13: 97-108

[18] Lee, T.S. & Chen, I.F. (2005). A two-stage

hybrid credit scoring model using artificial

neural networks and multivariate adaptive

regression splines. Expert Systems with

Applications, 28: 743-752

[19] Mester, L.J. (1997). What’s the point of

credit scoring? Business Review - Federal

Reserve Bank of Philadelphia. Sept/Oct:

3-16

[20] Moody, J. & Darken, C.J. (1989). Fast

learning in networks of locally tuned

processing units. Neural Computation, 3:

213-25

[21] Steinburg, D. & Colla, P. (ed.) (1997).

Classification and Regression Trees, Salford

Systems. San Didgo, CA

[22] Thomas, L.C. (2000). A survey of credit and

behavioral scoring: Forecasting financial

risks of lending to customers. International

Journal of Forecasting, 16: 149-172

[23] Tam, K.Y. & Kiang, M.Y. (1992).

Managerial applications of neural networks:

the case of bank failure predictions.

Management Science; 38(7): 926-47

[24] Vapnik, N. (1999). Statistical Learning

Theory. New York: Springer & Verlag.

[25] West, D. (2000). Neural network credit

scoring models. Computers & Operations

Research, 27: 1131-1152

Wenbing Xiao is a doctoral student of Institute

of Control Science & System Engineering at

Huazhong University of Science and Technology,

China. His research interests include financial

forecasting and modeling, decision support

system, data mining and machine learning. He

received the M.S. degree in Mathematics &

Computer at Hunan Normal University (2004).

Qian Zhao is a doctoral student in School of

Economics at Renmin University of China. She

received her M.S. in mathematics from Hunan

Normal University in 2004. Her current research

interests include financial forecasting and

modeling, data mining and energy economics.

She has published in Advances in Mathematics,

Chinese Journal of Management Science.

Qi Fei is a professor of Institute of Control

Science & Systems Engineering at Huazhong

University of Science and Technology, China.

His research interests include complex theory,

decision support system and decision analysis.

He received the B.S. degree in Control Science

and Engineering at Harbin Institute of

Technology (1961).

comparative study of data mining methods in

Documents

abstract credit scoring

credit industry

classification methods

creditrisk evaluation

cost of credit scoring

introduction data mining

corresponding credit

logistic regression