comparisonofdataminingclassificationalgorithms...

Research ArticleComparison of Data Mining Classification AlgorithmsDetermining the Default Risk

Begum Ccedilıgsar and Deniz Unal

Cukurova University Faculty of Arts and Sciences Department of Statistics Adana Turkey

Correspondence should be addressed to Deniz Unal dunalcuedutr

Received 5 November 2018 Accepted 27 December 2018 Published 3 February 2019

Academic Editor Antonio J Pentildea

Copyright copy 2019 Begum Ccedilıgsar and Deniz Unal 1is is an open access article distributed under the Creative CommonsAttribution License which permits unrestricted use distribution and reproduction in anymedium provided the original work isproperly cited

Big data and its analysis have become a widespread practice in recent times applicable to multiple industries Data mining is atechnique that is based on statistical applications 1is method extracts previously undetermined data items from large quantitiesof data 1e banking and insurance industries use data mining analysis to detect fraud offer the appropriate credit or insurancesolutions to customers and better understand customer demands 1is study aims to identify data mining classification al-gorithms and use them to predict default risks avoid possible payment difficulties and reduce potential problems in extendingcredit 1e data for this study which contains demographic and socioeconomic characteristics of individuals were obtained fromthe Turkish Statistical Institute 2015 survey Six classification algorithmsmdashNaive Bayes Bayesian networks J48 random forestmultilayer perceptron and logistic regressionmdashwere applied to the dataset using WEKA 39 data mining software 1ese al-gorithms were compared considering the root mean error squares receiver operating characteristic area accuracy precisionF-measure and recall statistical criteria 1e best algorithmmdashlogistic regressionmdashwas obtained and applied to the real dataset todetermine the attributes causing the default risk by using odds ratios 1e socioeconomic and demographic characteristics of theindividuals were examined and based on the odds ratio values the results of which individuals and characteristics were morelikely to default were reached 1ese results are not only beneficial to the literature but also have a significant influence in thefinancial industry in terms of the ability to predict customersrsquo default risk

1 Introduction

1e rapid and inevitable development of technology is causinga substantial global increase in the volume of data Such datamean better information and information is wealth 1is isbecause information makes it possible for mankind to have asafer and better future which is the primary goal of scientistsand researchers Due to this incredible amount of informationthat can be obtained from Big Data humanity is able to makeconsiderable progress in diverse fields ranging from healthand safety to education and economy

Obtaining information from big data utilizing the ap-propriate methods is similar to extracting the maximumpossible ore from a newly discovered mine 1e necessity ofcoming to scientifically accurate conclusions highlights theneed for big data analysis Big data analysis can reduceinformation loss and save time giving rise to the term data

mining (DM) [1 2] DM is a data analysis technique basedon statistical application it aims to extract information thatcould previously not be determined frommassive quantitiesof data [3]

Big data is not only a subject of interest for researchersbut has also become an essential tool in business Processingbig data effectively is crucial for companies aiming for aleading role in their field 1e need for big data analysis hasespecially increased in the banking and insurance industriesEven the small amount of information that has broughtcompanies to the forefront of the competitivemarket thanksto DM analysis has increased the importance of DM

1e analysis and modeling of big data are not newsubjects for actuaries bankers and insurers DM helps themovercome many difficulties in their aim to manage moneymore effectively control the system reduce or transferpotential risks understand client requirements improve

HindawiScientific ProgrammingVolume 2019 Article ID 8706505 8 pageshttpsdoiorg10115520198706505

funds management increase market share and reduce ortransfer potential risks [4] Specifically DM can be used inthe banking and insurance industries to determine defaultrisks and risk groups specify the correct insurance optionsfor individual customers increase customer satisfaction andidentify credit card fraud

1ere are many DM methods to detect problems facedby bankers and insurers for example clustering classifi-cation and association Classification is a widely used DMmethod that is applied in various fields [5] Hence theclassification algorithms are widely used and successfulresults obtained from the algorithms are also used for de-termining credit risks Which classification algorithm tochoose is a very important decision 1ere is no specificclassification algorithm to solve the current problem Inother words the best algorithm does not solve every problemin the best way 1ere are classification algorithms that givedifferent results for different datasets or different problemsA classification algorithm that is considered to be the bestsolution to solve a problem may not work in anotherproblem or dataset For this reason different classificationalgorithms for the given dataset must be compared beforeproblem solving 1e algorithm that best solves the problemis the algorithm obtained by comparison with the specificstatistical criterions 1us the algorithm to be used inproblem solving is determined In this study for de-termining best algorithms for current dataset all datamining classification algorithms were compared with respectto the suitability of data and accuracy rates (accuracythreshold was taken as 80) With this comparison allalgorithms were reduced to six classification algorithms(Naive Bayes Bayes network J48 random forest multilayerperceptron and logistic regression) that have almost thesame accuracy threshold rate Reduced algorithmsmdashNaiveBayes Bayesian networks J48 random forest multilayerperceptron and logistic regressionmdashare frequently seenalgorithms in the literature 1e six classification algorithmshave almost the same accuracy rates and data availability Soin order to determine the algorithm that will operate at themaximum level with the data the comparison under variouscriteria was repeated using WEKA (Waikato Environmentfor Knowledge Analysis) 39 data-mining software

1e algorithms that have similar accuracy rates werecompared again with different statistical criteria such as ROC(receiver operating characteristic) precision recall F-mea-sure and the root mean squared error (RMSE) to achieve thebest results As a result the most appropriate algorithm forthis dataset is found as the logistic regression algorithm

1e aim of this study is to use DM classification algo-rithms to investigate the effects of certain demographic andsocioeconomic characteristics on the probability of in-dividualsrsquo default risk as well as to predict their futurepayment challenges by determining individual attributesusing a logistic regression classification algorithm

1e data for this study which contain the demographicand socioeconomic characteristics of individuals were ob-tained from the 2015 TUIK (Turkish Statistical Institution)survey 1e raw data contained 59663 observations ofwhich only 20275 observations and 12 attributes belonging

to heads of households were used 1e data include thedemographic features of households from various regions aswell as their total income and debts paid on a regular basisover the last 12 months

When choosing an algorithm using the algorithm whichis known as giving good results without comparing itsperformance to different algorithms or comparing differentalgorithms by adhering to a single statistical criterion maygive misleading results Using an algorithm that is consid-ered to only give good results without considering itssuitability in data may lead to inaccurate results Likewisewhen determining the best algorithm comparing under asingle criterion may cause a wrong selection 1erefore inthis study Naive Bayes Bayes network J48 logistic re-gression multilayer perceptron random forest and classi-fication algorithms were determined with pre-eliminationand then determined algorithms were compared again withaccuracy F-measure Roc area recall precision and RMSEcriterions As a result of the analysis the logistic regressionclassification algorithm was determined as the best algo-rithm In conclusion the logistic regression algorithm wasused in the analysis of default risk

Moreover by applying the best algorithm (logistic re-gression) to the dataset we determined which characteristicsincrease the default risk most

2 Materials and Methods

1e concept of extending credit goes back 5000 years and isstill a primary research topic in the finance sector [6] 1erole of banks and banking activities are expanding dailywhich increases the necessity to manage loan issues ap-propriately Currently credit score models and credit ratingsare used to determine an individualrsquos default risk Creditscores are mathematical models that determine the likeli-hood of a default risk by observing the characteristics of thecustomers applying for the loan [7]

In a loan society or company crediting refers to the riskof balance at a certain time [8] Credit institutions facemany risks such as delays in payment or client defaults thevolatility of interest rates and the depreciation of in-vestments and securities Credit risk management providesthe opportunity to determine and measure these potentialrisks [9 10]

Financial institutions design models using certain cus-tomer characteristics (age gender area of residence incomemarital status previous credit payments etc) to predict andidentify possible credit risks [11 12] Fisher proposed dis-criminant and classification analysis in 1936 as the basis ofcredit scoring models Lately decision trees logistic re-gression K-nearest neighbor neural networks and supportvector machine algorithms are frequently used for creditscoring [13]

Financial institutions and analysts are always aiming toincrease credit volume while reducing default risks1erefore credit scoring analyses are crucial to aid fasterdecision making reduce the costs of loan analysis monitorexisting accounts more closely predict default risks andensure that institutions can detect possible risks while

2 Scientific Programming

developing their competitiveness [14] 1erefore DMtechniques using big data should be applied for creditscoring [11]

In the last decade there has been a significant increase inloan applicants and credit card users which in turn hasincreased the risks for credit institutions It is thereforenecessary for banks and financial institutions to determinethe probability of default risk by using the demographic andsocioeconomic characteristics of customers 1is allows fi-nancial institutions to take precautions against client defaultand identify risk groups Identifying risk groups can alsoprevent potential customer losses and aid banks in avoidingpotential risks For this reason DM usingWEKA software isimplemented to identify risk groups and ensure that fi-nancial institutions extend credit to clients not at risk ofdefault

1e first part of our study compares the classificationalgorithms to select the most suitable algorithm according toselected criteria In the second part the best algorithmmdashthelogistic regressionmdashis used to research the attributes thatmay cause default risk For this analysis odds ratios wereused as a criterion

21 WEKA 1e WEKA data-mining implementationsoftware was developed by the University of New Zealand Itis an open source software program written in Java underGeneral Public License It contains several supervised andunsupervised methods such as classification clusteringassociation and data visualization For this study theWEKA 39 implementation and its experimenter user in-terface were used for the classification of the algorithms aswell as to specify risk attributes using the logistic regressionalgorithm

22 Dataset 1e data for this study were obtained from theTUIK survey for 2015 Only heads of households over theage of 15 were selected from the 59663 units of survey data1e data used to support the findings of this study are re-stricted by TUIK in order to protect privacy 1e author(s)can supply the data upon request to researchers who meetthe criteria for access to confidential data 1e data containsthe demographic and socioeconomic characteristics of in-dividuals A WEKA preprocessing application was used toobtain 20275 units of data containing 12 variables one ofwhich is a class variable Incidents of payment and non-payment of past credit card debts are treated as class vari-ables 1e dataset structure is shown in Table 1

23 Classification of Algorithms Each object in the dataset isclassified according to its similarities Classification is thebest-known and most used method of DM 1e aim of theclassification method is to predict accurately the target classof objects of which the class label is unknown [15] In theWEKA implementation classification algorithms are pro-vided in nine groups For the purposes of this study thefollowing algorithms were chosen under the Bayesian fileBayes networks (BayesNet) and Naive Bayes algorithms

under the Functions file logistic regression (Logistic) andmultilayer perceptron under the Trees file J48 and randomforest algorithms

231 Bayesian Classifiers Bayesian network the Bayesiannetworkmdashalso known as the belief networkmdashis a probabi-listic graphical model that represents knowledge concerninga set of random variables [16] In this model each node inthe graph represents a random variable and the edges be-tween the variables represent the conditional dependencies[17] 1e conditional dependencies are calculated by sta-tistical probabilistic theories and computational methods InWEKA software the BayesNet algorithm is part of theBayesian file In order to implement the BayesNet algorithmthe dataset being studied should not have any missing dataand all variables must be discrete In cases where the datasetbeing studied contains continuous variables as well as dis-crete variables discretization can be applied by using thepreprocessing tab in the WEKA program After the dis-cretization is applied to the continuous variables (incomeand age) the dataset is ready to be studied

Naive Bayes the Naive Bayes algorithm is based on theBayesian theorem and operates on conditional probabilityDespite its simplicity it is a powerful algorithm for pre-dictive modeling Additionally the Naive Bayes classifierworks quite well concerning real-world situations An ex-ample is spam filtering which is a well-known problem forwhich the Naive Bayes classifier is suitable As with theBayesNet algorithm there should be no missing data in thisalgorithm and the variables must be discrete Since there areno missing data in this dataset the Naive Bayes algorithmcan be applied after discretization of the continuous vari-ables (income and age)

Table 1 Dataset structure

Attributename Description

Age AgeGender Gender (1male 2 female)Maritalstatus Marital status (1married 2 other)

Education Education level (1 illiterate 2 primary school3 secondary school 4 high school 5 higher degree)

Work Working status (1working 2 looking for a job3 retired 4 other (nonactive))

Health Health (1 good 2medium 3 poor)

RegionRegion (1mediterranean 2 aegean 3marmara4 black sea 5 central anatolia 6 eastern anatolia

7 southeastern anatolia)Housing Housing status (1 paying rent 2 not paying rent)

Revenue Individual revenue (1 low income 2mediumincome 3 higher income)

Homeloan

Nonpayment of house rent interest-bearing debtrepayment or home loan payment within the last 12

months (1 no 2 yes)

Bills Nonpayment of electricity water and gas bills withinthe last 12months (1no 2 yes)

Class Nonpayment of credit card installments and other debtpayments within the last 12months (1 no 2 yes)

Scientific Programming 3

232 Functions Logistic regression logistic regression mea-sures the relationship between a response variable and in-dependent variables like linear regression and belongs to thefamily of exponential classifiers [18] Logistic regressionclassifies an observation into one of two classes [19] and thisalgorithm analysis can be used when the variables are nominalor binary1e data are analyzed after the discretization processfor the continuous variables similar to the Bayesian group

Multilayer perceptron the multilayer perceptron algo-rithm is an artificial neural network algorithm Artificialneural networks gather information from a training set byminimizing the error iteratively and then applying thisinformation to a new dataset

233 Decision Trees J48 algorithm this algorithmrsquos name isderived from its tree-like structure and is based on super-vised learning techniques It is a frequently used algorithmdue to its ease of implementation low cost and reliabilityDecision treesrsquo roots consist of decision nodes branches andleaves [20] In the WEKA software the J48 algorithm usesthe rules of the C45 algorithm1erefore in WEKA the J48algorithm is considered a C45 algorithm 1e C45 algo-rithm can manage numerical values large data quantitiesand datasets with missing values 1e C45 algorithm uses athreshold value to divide the data into two ranges 1ethreshold value is selected to provide the most informationfrom the raw data and is determined by sorting the attributesand selecting the average value of the attributes

Random forest in this algorithm the classificationprocess uses more than one ldquotreerdquo [21] Each tree produces aclassifier and these classifiers vote to determine the algo-rithm that gets the most votes [22] 1is classification al-gorithm is then used to classify the dataset

24 Analysis of ClassificationAlgorithms In DM it is crucialto use a comparison to determine the best classifier [23] 1eclassifierrsquos performance is evaluated according to the fol-lowing criteria [24]

(i) Classification accuracy the ability of the model tocorrectly predict the label of class which is expressedas a percentage

(ii) Speed the speed refers to the time taken to set upthe model

(iii) Robustness the ability to predict the model cor-rectly even though the data has noisy observationsand missing values

(iv) Scalability the ability of a model to be accurate andproductive while handling an increasing amount ofdata

(v) Interpretability the level of understanding providedby the model

(vi) Rule Structure the understandability of the algo-rithmsrsquo rule structure

1e WEKA 39 software utilizes approximately 100 clas-sification algorithms and a pre-elimination was carried out bytesting the planned data for suitability under various conditions

(the algorithms chosen are those that can function in cate-gorical numerical binary or mixed systems) Next a secondscreening was carried out that considered criteria such as kappastatistics the speed of generating the model usage frequency ofthe algorithm in the literature and intelligibility 1ese elim-inations determined that the Bayesian networks Naive BayesJ48 random forest logistic regression and multilayer per-ceptron classification algorithms are the most suitable algo-rithms for our dataset 1e next step was comparing the sixclassification algorithms according to the six statistical criteria(accuracy rate RMSE precision recall ROC area andF-measure) 1e six statistical criteria are explained as follows

1e values of the statistical criteria that are compared toclassification algorithms are calculated by using a confusionmatrix 1e confusion matrix is shown in Table 2

Accuracy rate (AC) the percentage of correct pre-dictions According to the confusion matrix it can be cal-culated as

AC TN + TP

TP + FP + FN + TN (1)

where TN is the true negative TP is the true positive FP isthe false positive and FN false negative

Precision (P) the fraction of correctly predicted pos-itive observations among the total predicted positiveobservations

P TP

TP + FP (2)

where TP is the true positive and FP false positiveRecall (R) the fraction of correctly predicted positive

observations among all the observations in the class

R TP

TP + FN (3)

where TP is the true positive and FN false negativeF-measure 1e Precision and Recall criteria can be inter-

preted together rather than individually To accomplish this weconsider the F-Measure values generated by the harmonicmeanof the Precision and Recall columns as the harmonic meanprovides the average of two separate factors produced per unit1erefore F provides both the level of accuracy of the classi-fication and how robust (less data loss) it is

Fminusmeasure 2 times P times R

P + R (4)

where P is the precision and R is the recallROC area the ROC field curve determines the predictive

performance of the different classification algorithms 1earea under the ROC curve is one of the essential evaluationcriteria used to select the best classification algorithmWhenthe area under the curve is approaching 1 it indicates thatthe classification was carried out correctly

RMSE the root mean squared error deviation is obtainedby determining the square root of the mean squared error(MSE) Normally the RMSE is used as a measure of thedifference between the actual values and the estimated valuesof a model or estimator In other words the RMSE shows thestandard deviation of the difference between the estimated


values and the observed values It is preferable that theRMSE value is small

RMSE MSE

radic

1113936ni1 yi minus 1113954yi( 1113857

2

n

1113971

(5)

After the analysis using the WEKA data-miningimplementation the classification algorithms were sum-marized under the selected statistical criteria as shown inTable 3

Observing the accuracy percentage in the table it is clearthat the logistic regression algorithm has the highest ac-curacy rate while the multilayer perceptron algorithm hasthe lowest accuracy rate Similarly comparing the RMSEslogistic regression is the best algorithm and multilayerperceptron is the worst with a value of 0383 1e ROC areashows that the logistic regression algorithmrsquos value is thehighest making it the best algorithm Comparing the valuesin F-measure according to the recall criterion the logisticregression algorithm along with Naive Bayes and BayesNetall show a good value of 0824 1e algorithms that yieldedthe best results for the precision criterion are Naive Bayesand BayesNet In summary logistic regression is the bestalgorithm referring to the columns RMSE ROC area ac-curacy F-measure and recall statistical criterions 1e onlyexception is the precision value

Considering the value of precision it is clear that thelogistic regression algorithm has the closest precision valuewith the best result of 0002 whichmdashin the case of thisstudymdashdoes not dramatically affect the results

25 Determination of the Variables Causing Default Risk1e above analysis determined that logistic regression is thebest classification algorithm According to this result weapplied the logistic regression algorithm again to determinethe variables that could cause default risk

1e chi-square analysis was applied via theWEKA SelectAttribute panel to determine the variables that explain themodel the best Chi-square is an analysis that shows thevalue of the selected variable depending on the class variableused when the variables are nominal 1e analysis recom-mends subtracting the lowest rank variant from the model

However since WEKA cannot apply the chi-squareanalysis based on the algorithm it is preferable to excludevariables from the model 1e rank results of applying thechi-square analysis to the variables are shown in Table 4

Studying this analysis we observe that the age variable isnumerically ranked the lowest 1erefore this variable musteither be omitted or converted to a nominal value if it is to beincluded in the analysis We therefore transformed the agevariable into a categorical variable (one of 3 categories) and

the logistic regression analysis and the chi-square analysisfor variable selection were repeated

1e results of the logistic regression analysis with thenominal age variable are shown in Table 5

1e chi-square analysis with the nominal age variable isshown in Table 6

1e results show that even though the age variable wasmade nominal it still ranked the lowest 1e logistic re-gression analysis results with the age variable excluded isshown in Table 7

Since it is clear from these results that the classification ofthe data is improved when the age variable is removed thevariable will be removed and the analysis continued

3 Results and Discussion (AgeVariable Excluded)

A logistic regression analysis was performed using 20275observations and 12 variables (gender work marital statuseducation health region house home loan bills individualrevenue class) as a dataset

1e class subcategory was interpreted according to thenon-default status and a 10-fold cross validation was ap-plied in analysis 1e lowest and highest odds ratios areshown in Table 8

1ese ratios were interpreted and attributed to differentclasses without going into default sub groups According toTable 8 the model predicts that the odds of not going intodefault risk are 11174 times higher for women than they are

Table 3 Comparison of classification algorithms

AlgorithmsProperties Accuracy RMSE ROC

area Precision Recall F-measure

BayesNet 82528 0357 0836 0824 0825 0824NaiveBayes 82532 0357 0836 0824 0825 0824

Logistic 83108 0342 0843 0822 0831 0824J48 82470 0367 0768 0818 0825 0821Randomforest 82110 0352 0828 0809 0821 0814

Multilayerperceptron 81416 0383 0799 0810 0814 0810

Table 4 Chi-square results

Rank Attributes5532421 10 bills1846604 9 house_loan37045 2 work257218 8 house30438 1 gender10663 6 health9499 7 region7173 11 individual_revenue126 4 marriage0891 5 education0 3 ageSelected variables 10 9 2 8 1 6 7 11 4 5 3 11

Table 2 Confusion matrix

Precision classA b

Actual class a TP FNb FP TN


for men 1is result therefore predicts that women pay theirdebts on time and tend to default on debt less frequentlythan men [25]

Concerning the marital status attribute the model pre-dicts that the odds of not going into default risk are 10274times higher for unmarried individuals than for marriedindividuals 1is result therefore predicts that married

individuals tend to default on paying their loans morefrequently

1e model predicts that the odds of not going intodefault risk are 12381 times higher for a retired person thanfor people of other work groups Additionally people fromthe ldquolooking for a jobrdquo group are 07632 times more at risk ofgoing into default than other working groups

1e model further illustrates that the odds of illiteratepersons not going into default risk are 10453 time higherthan that of other education levels Individuals that havea secondary school degree are most at risk of going intodefault

1e odds of not going into the default risk are 10533times higher for people in good health than individuals notof good health

1e odds of not going into default risk for people livingin the Mediterranean Region are 11063 times greater thanthat of people from other regions Additionally those fromthe Southeastern Anatolia Region are at greater risk of goinginto default

1e odds for people that are not renting a house to not gointo default risk is 11008 times higher than that of peoplerenting a house

1e model predicts that the odds of not going into thedefault risk are 03449 for individuals not paying house rent1is means that individuals who do not pay their house rentare more likely to not go into default

1e model predicts that the odds of not going into thedefault risk are 00838 for the nonpayment of bills 1ismeans that individuals who do not pay their bills fully andon time are likely to not pay their other creditors either

Lastly the model predicts that the odds of not going intodefault risk are 10316 times higher for people with a lowerincome level than for others 1is indicates that banks shouldextend lower credit amounts to persons of lower income

4 Conclusion

Big data analysis cannot be carried out by traditionalmethods so data mining is used to extract information frommassive amounts of data DM applications are used in-tensively in the financial industry for predicting the likeli-hood of customer default risk

Our study used an analysis to discover the most suitableclassification algorithm to identify credit risks and estimatethe likelihood of default 1is analysis was carried out usingWEKA software and by applying 12 variables such as de-mographic characteristics of heads of household total in-come debt payment status and regional information Sixclassification algorithms were used (Bayes network NaiveBayes J48 random forest multilayer perceptron and lo-gistic regression) 1e performances of the algorithms werecompared according to accuracy root mean squared errorROC area F-measure precision and recall criteria and thelogistic regression classification algorithm was found to bethe best algorithm

Logistic regression was then applied again to raw datafrom TUIK by using WEKA 39 software to investigatethe factors affecting the default risk of individuals using

Table 6 Nominal age attribute chi-squared analysis results


Table 7 Nominal age attribute omitted logistic regression results

Accuracy 822984Root mean squared error 03455ROC area 0826Precision 0819Recall 0823F-measure 0821

Table 8 Odds ratios per attribute

Attribute name Subgroup Odds ratiosGender Women 11174

Work Retired 12381Looking for a job 07632

Marital status Other 10274

Education Illiterate 10453Secondary school 09666

Health Good 10533Medium 09184

Region Mediterranean 11063Southeastern Anatolia 091

House status Not paying rent 11008Home loan Yes 03449Bills Yes 00838

Individual revenue Low income 10316Medium income 09682

Table 5 Nominal age attribute logistic regression results



socioeconomic and demographic variables Next a chi-squared analysis was used for attribute selection whichdemonstrated that the age attribute needed to be omittedfrom data After both these analyses were run the odds ratiovalues were used to determine the probability with whichindividuals with certain characteristics may default onpaying loans 1ese results are not only beneficial to theliterature but they could also have a significant influence infinancial institutions to predict the customer default risks

According to the results it is observed that women aremore likely to honor their payments thanmen In the light ofthe results was obtained Cigsar and Unal [25] examined thegender variable and investigated the reasons why womenwere more likely to honor their payments thanmen Besidesfurther research of these findings will be made in the futurestudy

Regarding the marital status attribute the results showedthat unmarried people have a lower risk of default 1ereason for this may be related to the responsibility of marriedindividuals and their possible difficulties in paying theirdebts on time

1e results for working status show that retirees regu-larly pay their debts compared to the job-seeking and otherinactive groups which makes the likelihood of defaultinglower Furthermore the results for noneducated individualsindicate that their risk of default is lower than that of othereducation levels 1is shows that an increase in the level ofeducation also increases the likelihood of default

1e health status of the individual is also influential inthe case of a default with the results confirming that peoplein good health are less likely to default than other groupsWhen considering the region attribute we see that peoplefrom the Mediterranean region are less likely to default Itwas also observed that individuals who do not pay house rentwere more successful in paying their debts than the ones whodo In this case we can surmise that the extra responsibilityof the rent that the individuals pay makes debt paymentsdifficult Finally it was also observed that individuals who donot pay their other debt are likely to default on their loans

Considering these results and the fact that credit in-stitutions should consider the characteristics of their cus-tomers and their circumstances which will in turn affecttheir defaults the risk of default could be reduced by usingDM 1e risks that could be detected using this method canbe eliminated by taking precautionary measures whichcould indirectly increase the national income

1is study contributes to existing literature by suggestingclassification algorithms that can be used to determine creditrisks Additionally we identified the variables that can beused in determining the default risk which will assist futureresearchers in this field1e study is also valuable in terms ofillustrating that DM can be used in the determination ofcredit risk within the framework of the development ofacademic studies both in Turkey and globally Because thereare a limited number of studies on the subject of default riskbeing analyzed using DM applications and WEKA softwarethis study will contribute to filling the gap in the field

Lastly this study proposes a solution for financial in-stitutions and companies in related fields who want to

determine credit risks Future technological developmentmayproduce new software and new algorithms for the process1ese new algorithms will eliminate the imperfections ofexisting algorithms and introduce new approaches

Data Availability

1e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

1e authors declare that they have no conflicts of interest

Acknowledgments

1is research was supported by Cukurova University Sci-entific Research Fund (Project Number FYL-2017-8454)1e first stages of this study are presented in the In-ternational Conference on Applied Analysis and Mathe-matical Modeling at Istanbul in July 2017

References

[1] R Arora and S Suman ldquoComparative analysis of classificationalgorithms on different datasets using WEKArdquo InternationalJournal of Computer Applications vol 54 no 13 pp 21ndash252012

[2] J Xia F Xie Y Zhang and C Caulfield ldquoArtificial in-telligence and data mining algorithms and applicationsrdquoAbstract and Applied Analysis vol 2013 Article ID 5247202 pages 2013

[3] W H Inmon Building Data Warehouse QEDWileyHoboken NJ USA 2005

[4] I O A In Belgium Big Data An actuarial perspective In-stitute of Actuaries Belgium 2015

[5] J Han M Kamber and J Pei Data Mining Concepts andTechniques Morgan Kaufmann Publishers Burlington MAUSA 2012

[6] D Donko and A Dzelihodzic ldquoData mining techniques forcredit risk assessment taskrdquo Recent Advances in ComputerScience and Applications pp 105ndash110 2013

[7] S Belkıs Credit Rating Capital Market Licensing Registrationand Training Organization Turkey 2016

[8] H Selimler ldquoAnalysis of problem loans assessment of theeffect on bank financial statements and ratesrdquo Journal ofFinancial Research and Studies vol 7 no 12 2015

[9] T V Gestel and B Baesens Credit Risk Management BasicConcepts Financial Risk Components Rating AnalysisModels Economic and Regulatory Capital Oxford UniversityPress Oxford UK 2008

[10] A Wang L Yong W Zeng and Y Wang ldquo1e optimalanalysis of default probability for a credit risk modelrdquo Ab-stract and Applied Analysis vol 2013 Article ID 8783069 pages 2014

[11] C Huang M Chen and C Wang Credit Scoring a DataMining approach based on support vector machines Vol 33New York City NY USA 2016

[12] G Kou and W Wu ldquoAn analytic hierarchy model for clas-sification algorithms selection in credit risk analysisrdquoMathematical Problems in Engineering vol 2014 Article ID297563 7 pages 2014


[13] M D M Sousa and R S Figueiredo ldquoCredit analysis usingdata mining application in the case of a credit unionrdquo Journalof Information Systems and Technology Management vol 11no 2 pp 379ndash396 2014

[14] I Yeh and C Lien ldquo1e comparisons of data mining tech-niques for the predictive accuracy of probability of default ofcredit card clientsrdquo Expert Systems with Applications vol 36no 2 pp 2473ndash2480 2009

[15] G Kesavaraj and S Sukumaran A Study on ClassificationTechniques in Data Mining IEEE 2013

[16] Y Yan and B Suo ldquoRisks analysis of logistics financialbusiness based on evidential bayesian networkrdquo Mathemat-ical Problems in Engineering vol 2013 Article ID 7852188 pages 2013

[17] I B Gal F Ruggeri F Faltin and R Kenett ldquoBayesiannetworksrdquo Encyclopedia of Statistics in Quality and ReliabilityWiley and Sons Hoboken NJ USA 1st edition 2007

[18] S Chen Y-J J Goo and Z-D Shen ldquoA hybrid approach ofstepwise regression logistic regression support vector ma-chine and decision tree for forecasting fraudulent financialstatementsrdquo9e Scientific World Journal vol 2014 Article ID968712 9 pages 2014

[19] Y Kumar and G Sahoo ldquoAnalysis of parametric and nonparametric classifiers for classification technique usingWEKArdquo International Journal of Information Technology andComputer Science vol 4 pp 43ndash49 2012

[20] M Kantardzic Data Mining Concepts Models Methods andAlgorithms John Wiley amp Sons Danvers MA USA 2003

[21] L Breiman ldquoRandom forestsrdquo Machine Learning vol 45no 1 pp 5ndash32 2001

[22] T-T Nguyen and J Z Huang ldquoUnbiased feature selection inlearning random forests for high-dimensional datardquo ScientificWorld Journal vol 2015 Article ID 471371 18 pages 2015

[23] I Nguyen E Frank and M Hall Data Mining PracticalMachine Learning Tools and techniques Morgan KaufmannBurlington MA USA 2011

[24] J Stefanowski Data Mining- Evaluation of ClassifiersInstitute of Computing Sciences Poznan University ofTechnology Poland 2010

[25] B Cigsar and D Unal ldquo1e effect of gender and gender-dependent factors on the default riskrdquo Revista de Cercetare siInterventie Sociala vol 63 pp 28ndash418 2018


Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems


Volume 2018


ReconfigurableComputing



Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018


Civil EngineeringAdvances in


Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications


Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia


Biomedical Imaging



Engineering Mathematics


RoboticsJournal of



Computational Intelligence and Neuroscience


Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018


Human-ComputerInteraction

Advances in


Scientic Programming

Submit your manuscripts atwwwhindawicom

funds management increase market share and reduce ortransfer potential risks [4] Specifically DM can be used inthe banking and insurance industries to determine defaultrisks and risk groups specify the correct insurance optionsfor individual customers increase customer satisfaction andidentify credit card fraud

1ere are many DM methods to detect problems facedby bankers and insurers for example clustering classifi-cation and association Classification is a widely used DMmethod that is applied in various fields [5] Hence theclassification algorithms are widely used and successfulresults obtained from the algorithms are also used for de-termining credit risks Which classification algorithm tochoose is a very important decision 1ere is no specificclassification algorithm to solve the current problem Inother words the best algorithm does not solve every problemin the best way 1ere are classification algorithms that givedifferent results for different datasets or different problemsA classification algorithm that is considered to be the bestsolution to solve a problem may not work in anotherproblem or dataset For this reason different classificationalgorithms for the given dataset must be compared beforeproblem solving 1e algorithm that best solves the problemis the algorithm obtained by comparison with the specificstatistical criterions 1us the algorithm to be used inproblem solving is determined In this study for de-termining best algorithms for current dataset all datamining classification algorithms were compared with respectto the suitability of data and accuracy rates (accuracythreshold was taken as 80) With this comparison allalgorithms were reduced to six classification algorithms(Naive Bayes Bayes network J48 random forest multilayerperceptron and logistic regression) that have almost thesame accuracy threshold rate Reduced algorithmsmdashNaiveBayes Bayesian networks J48 random forest multilayerperceptron and logistic regressionmdashare frequently seenalgorithms in the literature 1e six classification algorithmshave almost the same accuracy rates and data availability Soin order to determine the algorithm that will operate at themaximum level with the data the comparison under variouscriteria was repeated using WEKA (Waikato Environmentfor Knowledge Analysis) 39 data-mining software

1e algorithms that have similar accuracy rates werecompared again with different statistical criteria such as ROC(receiver operating characteristic) precision recall F-mea-sure and the root mean squared error (RMSE) to achieve thebest results As a result the most appropriate algorithm forthis dataset is found as the logistic regression algorithm

1e aim of this study is to use DM classification algo-rithms to investigate the effects of certain demographic andsocioeconomic characteristics on the probability of in-dividualsrsquo default risk as well as to predict their futurepayment challenges by determining individual attributesusing a logistic regression classification algorithm

1e data for this study which contain the demographicand socioeconomic characteristics of individuals were ob-tained from the 2015 TUIK (Turkish Statistical Institution)survey 1e raw data contained 59663 observations ofwhich only 20275 observations and 12 attributes belonging

to heads of households were used 1e data include thedemographic features of households from various regions aswell as their total income and debts paid on a regular basisover the last 12 months

When choosing an algorithm using the algorithm whichis known as giving good results without comparing itsperformance to different algorithms or comparing differentalgorithms by adhering to a single statistical criterion maygive misleading results Using an algorithm that is consid-ered to only give good results without considering itssuitability in data may lead to inaccurate results Likewisewhen determining the best algorithm comparing under asingle criterion may cause a wrong selection 1erefore inthis study Naive Bayes Bayes network J48 logistic re-gression multilayer perceptron random forest and classi-fication algorithms were determined with pre-eliminationand then determined algorithms were compared again withaccuracy F-measure Roc area recall precision and RMSEcriterions As a result of the analysis the logistic regressionclassification algorithm was determined as the best algo-rithm In conclusion the logistic regression algorithm wasused in the analysis of default risk

Moreover by applying the best algorithm (logistic re-gression) to the dataset we determined which characteristicsincrease the default risk most

2 Materials and Methods

1e concept of extending credit goes back 5000 years and isstill a primary research topic in the finance sector [6] 1erole of banks and banking activities are expanding dailywhich increases the necessity to manage loan issues ap-propriately Currently credit score models and credit ratingsare used to determine an individualrsquos default risk Creditscores are mathematical models that determine the likeli-hood of a default risk by observing the characteristics of thecustomers applying for the loan [7]

In a loan society or company crediting refers to the riskof balance at a certain time [8] Credit institutions facemany risks such as delays in payment or client defaults thevolatility of interest rates and the depreciation of in-vestments and securities Credit risk management providesthe opportunity to determine and measure these potentialrisks [9 10]

Financial institutions design models using certain cus-tomer characteristics (age gender area of residence incomemarital status previous credit payments etc) to predict andidentify possible credit risks [11 12] Fisher proposed dis-criminant and classification analysis in 1936 as the basis ofcredit scoring models Lately decision trees logistic re-gression K-nearest neighbor neural networks and supportvector machine algorithms are frequently used for creditscoring [13]

Financial institutions and analysts are always aiming toincrease credit volume while reducing default risks1erefore credit scoring analyses are crucial to aid fasterdecision making reduce the costs of loan analysis monitorexisting accounts more closely predict default risks andensure that institutions can detect possible risks while




















Homeloan


months (1 no 2 yes)



















AC TN + TP

TP + FP + FN + TN (1)



P TP

TP + FP (2)



R TP

TP + FN (3)




P + R (4)






RMSE MSE

radic

1113936ni1 yi minus 1113954yi( 1113857

2

n

1113971

(5)


























Precision classA b














4 Conclusion





























Data Availability




Acknowledgments


References

































Advances in

FuzzySystems


Volume 2018













Journal of

Journal of



Hindawi


Advances in

Multimedia


Biomedical Imaging





RoboticsJournal of









Volume 2018



Advances in






















Homeloan


months (1 no 2 yes)



















AC TN + TP

TP + FP + FN + TN (1)



P TP

TP + FP (2)



R TP

TP + FN (3)




P + R (4)






RMSE MSE

radic

1113936ni1 yi minus 1113954yi( 1113857

2

n

1113971

(5)


























Precision classA b














4 Conclusion





























Data Availability




Acknowledgments


References

































Advances in

FuzzySystems


Volume 2018













Journal of

Journal of



Hindawi


Advances in

Multimedia


Biomedical Imaging





RoboticsJournal of









Volume 2018



Advances in



















AC TN + TP

TP + FP + FN + TN (1)



P TP

TP + FP (2)



R TP

TP + FN (3)




P + R (4)






RMSE MSE

radic

1113936ni1 yi minus 1113954yi( 1113857

2

n

1113971

(5)


























Precision classA b














4 Conclusion





























Data Availability




Acknowledgments


References

































Advances in

FuzzySystems


Volume 2018













Journal of

Journal of



Hindawi


Advances in

Multimedia


Biomedical Imaging





RoboticsJournal of









Volume 2018



Advances in





RMSE MSE

radic

1113936ni1 yi minus 1113954yi( 1113857

2

n

1113971

(5)


























Precision classA b














4 Conclusion





























Data Availability




Acknowledgments


References

































Advances in

FuzzySystems


Volume 2018













Journal of

Journal of



Hindawi


Advances in

Multimedia


Biomedical Imaging





RoboticsJournal of









Volume 2018



Advances in















4 Conclusion





























Data Availability




Acknowledgments


References

































Advances in

FuzzySystems


Volume 2018













Journal of

Journal of



Hindawi


Advances in

Multimedia


Biomedical Imaging





RoboticsJournal of









Volume 2018



Advances in













Data Availability




Acknowledgments


References

































Advances in

FuzzySystems


Volume 2018













Journal of

Journal of



Hindawi


Advances in

Multimedia


Biomedical Imaging





RoboticsJournal of









Volume 2018



Advances in























Advances in

FuzzySystems


Volume 2018













Journal of

Journal of



Hindawi


Advances in

Multimedia


Biomedical Imaging





RoboticsJournal of









Volume 2018



Advances in




comparisonofdataminingclassificationalgorithms...

Documents