research article a hybrid approach of stepwise...

Research ArticleA Hybrid Approach of Stepwise RegressionLogistic Regression Support Vector Machine and DecisionTree for Forecasting Fraudulent Financial Statements

Suduan Chen1 Yeong-Jia James Goo2 and Zone-De Shen2

1 Department of Accounting Information National Taipei University of Business 321 Jinan Road Section 1 Taipei 10051 Taiwan2Department of Business Administration National Taipei University No 67 Section 3 Ming-shen East Road Taipei 10478 Taiwan

Correspondence should be addressed to Suduan Chen suduanchenyahoocomtw

Received 13 May 2014 Revised 22 August 2014 Accepted 23 August 2014 Published 11 September 2014

Academic Editor Shifei Ding

Copyright copy 2014 Suduan Chen et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

As the fraudulent financial statement of an enterprise is increasingly serious with each passing day establishing a valid forecastingfraudulent financial statement model of an enterprise has become an important question for academic research and financialpractice After screening the important variables using the stepwise regression the study also matches the logistic regressionsupport vector machine and decision tree to construct the classification models to make a comparison The study adopts financialand nonfinancial variables to assist in establishment of the forecasting fraudulent financial statement model Research objects arethe companies to which the fraudulent and nonfraudulent financial statement happened between years 1998 to 2012 The findingsare that financial and nonfinancial information are effectively used to distinguish the fraudulent financial statement and decisiontree C50 has the best classification effect 8571

1 Introduction

The financial statement is the main basis of decision-makingby investors creditors and other accounting informationdemanders and concurrently also the concrete expressionof management performance financial condition and pos-sessing social responsibility of the listed and OTC com-panies but the fraudulent financial statement (FFS) hasthe trend of becoming increasingly serious in recent years[1ndash8]

This behavior not only makes the investing public subjectto vast amount of loss but also more seriously influencesthe capital market order Because the fraudulent case isincreasingly serious with each passing day the United StatesCongress passed Sarbanes-Oxley Act in 2002 and mainlyhope by which to improve the accuracy and reliability ofthe financial statement of a company and disclosure to makethe auditors able to forecast the omen of the FFS before theFFS of an enterprise occurs When one checks corporationsrsquofinancial statements due to fraud which led to a significant

misstatement there are fairly strict norms for audit staff inTaiwan [9]

The FFS can be regarded as a typical classification prob-lem [10]The classification problem carries out a computationmainly in light of the variable attribute numerical value ofsome given classification data to acquire the relevant classi-fication rule of every classification and bring the unknownclassification data into the rule to acquire the final classifi-cation result Many authors apply the logistic regression tomake a fraudulent classification and acquire the result in theFFS issue in the past [3 6 7 11ndash13]

Data mining is an analytical tool used to handle acomplicated data analysis It discovers previously unknowninformation from mass data and aims for data to make aninduction from the structured model as reference amountin making a decision with many different functions such asclassification association clustering and forecasting [4 5 814] ldquoClassificationrdquo function is used the most often thereinand its result can serve as the decision basis and predictionHowever whether every application of data mining in the

Hindawi Publishing Corporatione Scientific World JournalVolume 2014 Article ID 968712 9 pageshttpdxdoiorg1011552014968712

2 The Scientific World Journal

FFS is superior to the traditional classification model iscontroversial

The purpose of this study is to expect that a bettermethod of forecasting fraudulent financial statement can bepresented to forecast the omen of the fraudulent financialstatement and to reduce damage to the investors and auditorsThe study will adopt the logistic regression and the supportvector machine (SVM) as well as the decision tree (DT)C50 in data mining as the basis and match the stepwiseregression to separately establish classificationmodel tomakea comparison In conclusion the study first aims at theldquofraudulent financial statementrdquo issue to make an arrange-ment for and carry out an exploration of relevant literatureto ensure the research variable and sample adopted by thestudy We then take the logistic regression SVM and DTC50 as the bases to establish the FFS classification modelFinally we present the conclusions and suggestions of thestudy

2 Literature Review

21 Fraudulent Definition The FFS is a kind of intentional orillegal behavior the result of which directly causes the seri-ously misleading financial statement or financial disclosure[2 15] Pursuant to the provision of SAS NO99 a kind offraudulent pattern is dishonest financial report and it meansa kind of intentional erroneous narration neglecting amountor disclosure which makes the misunderstood financialstatement [6]

22 Research Method The classification problem carries outa computation mainly in light of the variable attributenumerical value of some given classification data to acquirethe relevant classification rule of every classification andbringthe unknown classification data into the rule to acquire thefinal classification result Many authors apply the logisticregression to make a fraudulent classification in the FFS issuein the past [3 11 12 15ndash17] However the traditional statisticmethod has limitation of having to accord with specificassumption in data

As a result the machine learning way which does notrequire any statistic assumption about data portfolio risesabruptly Many scholars recently try to adopt the machinelearning way as the classification machine to conduct aresearchThe empirical result also points out that it possessesan excellent classification effect Chen et al [13] appliedthe neural network and SVM to forecast network invasionand the research result indicates that the SVM has excellentclassification ability Huang et al [18] applied the neuralnetwork and SVM to explore the classificationmodel of creditevaluation Shin et al [19] conducted a relevant research ofbankruptcy prediction Yeh et al [4] apply it in predictionof enterprise failure On the other hand Kotsiantis et al[3] and Kirkos et al [10] apply DT C50 in the relevantresearch to acquire the excellent classification result Thusthe study will adopt the foresaid logistic regression SVMand DT C50 as the classifier construction classificationmodel

23 Variable Selection As for variable selection via rele-vant literature exploration some authors adopt the financialvariable as the research variable [3 10] others adopt thenonfinancial variable as the research variable [12 16 17] andstill others adopt both the financial variable and nonfinancialvariable as the research variable [15 20]

Because financial statement data often have cheatingsuspicion if we purely consider the financial variables thepossibility of erroneous classification may increase There-fore the study not only adopts the financial variable as theresearch variable but also adds the nonfinancial variable toconstruct the fraudulent financial prediction model

3 Methodology

The purpose of this study is to present a two-stage researchmodel which integrates the financial variable and nonfi-nancial variable to establish the fraudulent early warningmodel of an enterprise The procedure of the study is toaim at the data to make a stepwise regression analysis toacquire the result of the important variable of the TTFafter screening and then to take such variable as the inputvariable of the logistic regression and SVM Finally the studymakes a comparison and an analysis to acquire a better FFSclassification result

31 Stepwise Regression The study selects a variable of themaximum classification ability in accordance with forwardselection and incorporates the predictor into the model bystepwise increase During each process119875 value of the statistictest is used to screen the variables If 119875 value is less than orequal to 005 then the variable enters the regression modeland the selected variable is the independent variable of theregression model

32 Logistic Regression The logistic regression resembles thelinear regression while the response variable and explanatoryvariable of the general linear regression are usually thecontinuous variable but the response variable explored by thelogistic regression is the discrete variable that is it handlesthe qualitative variable of the two-dimensional independentvariable problem (eg yes or no and success or failure)The model utilizes cumulative probability density functionto convert real number value of the explanatory variable toprobability value between 0 and 1 The elementary assump-tion is different from the analytic assumption of anothermul-tivariate analysisThe influence of the explanatory variable onthe response variable is to fluctuate in the index form whichmeans that the logistic regression does not need to conformto the normal distribution assumption In other words itcan handle the population of the nonnormal distribution andthe problem of the nonlinear model and the nonmeasuringvariable

The general logistic regression model is as follows

119884lowast

= 120573119909 + 120576

119884 = (1 119884lowast

gt 0

0 119884lowast

le 0)

(1)

The Scientific World Journal 3

where 119884 response variable of actual observation 119884 = 1 afinancial crisis event occurs 119884 = 0 no financial crisis eventoccurs 119884lowast latent variable without observation 119909 matrixof explanatory variable 120573 matrix of explanatory variableparameter and 120576 error of explanatory variable

33 Support Vector Machine (SVM) The operation model ofthe SVM projects the initial input vector to eigenspace of thehigh dimension with linear and nonlinear core function andutilizes the separating hyperplane to distinguish two or manymaterials of different classesThe SVMutilizes the hyperplaneclassifier to classify the materials

331 Linear Divisibility When the plain formed by thetraining sample data is linear which consider the trainingvector 119909

119894= (119909

(1)

119894 119909

119899

119894) isin 119877

119899 belongs to two classes119910119894isin minus1 +1 In order to definitely distinguish the training

vector class it is necessary to find out the optimal partitionhyperplane able to separate the materials

If the hyperplane119908sdot119909+119887 can separate the training sampleit is shown as

119908 sdot 119909119894+ 119887 gt 0 if 119910

119894= 1 (2)

119908 sdot 119909119894+ 119887 lt 0 if 119910

119894= minus1 (3)

Adjust 119908 and 119887 properly (2) and (3) can be rewritten as

119908 sdot 119909119894+ 119887 ge 1 if 119910

119894= 1

119908 sdot 119909119894+ 119887 le minus1 if 119910

119894= minus1

(4)

or as

119910119894(119908 sdot 119909

119894+ 119887) ge 1 forall119894 isin 1 119899 (5)

Pursuant to the statistics theory the best interface not onlyseparates two classes of samples correctly but alsomaximizesthe classification marginThe class margin of the interface119908sdot119909 + 119887 is shown as

119889 (119908 119887) = min119909119894|119910119894=1

119908 sdot 119909119894+ 119887

|119908|minus max119909119894|119910119894=minus1

119908 sdot 119909119894+ 119887

|119908| (6)

Equation (7) can be acquired from (4)

119889 (119908 119887) =1

|119908|minusminus1

|119908|=2

|119908| (7)

So the problem of the maximization class margin 119889(119908 119887)transforms to minimization |119908|22 under constraint con-dition (5) Pursuant to Lagrange relaxation the foresaidproblem must accord with the hypothesis of (8) and (9) Inthe foresaid condition the minimization is shown as (10)

120572119894ge 0 (8)

sum

119894

120572119894119910119894= 0 (9)

sum

119894

120572119894minus1

2sum

119894119895

120572119894120572119895119910119894119910119895119909119894sdot 119909119895 119894 = 1 119899 (10)

Every 120572119894corresponds to a training sample 119909

119894 and the training

sample of its corresponding 120572119894gt 0 is called the support

vector Classification function acquired finally is shown as

119891 (119909) = sgn (119908 sdot 119909119894+ 119887)

= sgn(119873119904

sum

119894=1

120572119894119910119894119909119894sdot 119909 + 119887)

(11)

where119873119904is the number of the support vector

332 Linear Indivisibility If the training sample is linearlyindivisible (4) can be rewritten as

119908 sdot 119909119894+ 119887 ge 1 minus 120585

119894 if 119910

119894= 1

119908 sdot 119909119894+ 119887 ge 120585

119894minus 1 if 119910

119894= minus1

(12)

where 120585119894ge 0 119894 = 1 119899

If 119909119894is classified mistakenly then 120585

119894gt 1 Thus the

mistaken classification is less thansum119894120585119894 Add a given parame-

ter value in the objective function Consider reasonably themaximum class margin and the minimum mistaken classsample that is seeking the minimum of |1199082|2 + 119862(sum

119894120585119894)

can acquire the SVM under linear indivisibility Pursuant toLagrange relaxation the foresaid problem must accord withthe hypothesis of (13) and (14) In the foresaid condition theminimization is shown as (15)

0 le 120572119894le 119862 (13)

sum

119894

120572119894119910119894= 0 (14)

sum

119894

120572119894minus1

2sum

119894119895

120572119894120572119895119910119894119910119895119909119894sdot 119909119895 119894 = 1 2 3 119899 (15)

34 Decision Tree (DT) The Decision Tree (DT) is thesimplest in the inductive learning method [21] It belongsto the data mining tool and can handle the continuousand noncontinuous variable It establishes the tree structurediagram mainly by the given classification fact and inducessome principles therein The principles are mutually exclu-sive and the DT generated can also make an out-of-sampleprediction The DT algorithms used most frequently includeCART CHAID and C50 [22] C50 [23] improves fromID3 [23] Thanks to ID3 use limitation it cannot handlethe continuous numerical value materials thus Quinlanconducts a research for improvement and C50 is developedto handle the continuous and the noncontinuous numericalvalue

The DT C50 is mainly separated into two parts The firstpart is classification criterion which is calculated pursuantto the gain ratio Construct the DT completely as shownin (2) Information gained in (16) is used to calculate thepretest and posttest gain of the data set and is defined asldquopretest informationrdquo minus ldquopostinformationrdquo from (17)The entropy in (16) is used to calculate impurity which iscalled randomness In other words it is used to calculate


Table 1 Results of stepwise regression variable screening

Variable code Variable classification Variable description Pr gt ChiSq1198831 Financial Accounts receivablestotal assets 024011198833 Financial Inventorycurrent assets 0033911988310 Financial Interest protection multiples 0069411988313 Financial Debt ratio 0029411988315 Financial Cash flow ratio 0002511988317 Financial Accounts payable turnover 0029511988324 Financial Operation profitlast year operation profit gt11 0026711988329 Nonfinancial Pledge ratio of shares of the directors and supervisors 00473

Table 2 Hit ratio of three models using the train datasets

Research model C50 Logistic SVMHit ratio 9394 8333 7879

randomness of the data set When randomness in the dataset reaches the most disorderly state the value will be 1

Therefore the less random the posttest data set is thelarger the information gain is calculated and the morefavorable it is for DT construction

Gain Ratio (119878 119860) = Information 119899 Gain (119878 119860)Entropy (119878 119860) (16)

Gain (119878 119860) = Entropy (119878) minus sum

V isinvalues(119860)

1003816100381610038161003816119878V1003816100381610038161003816

|119878|Entropy (119878V)

(17)

The second part is pruning criterion Pursuant to the errorbased pruning (EBP) the DT is properly pruned to enhancethe correct ratio of classification EBP is evolved from thepessimistic error pruning (PEP) and such two pruningmethods are presented by Quinlan The main concept of theEBP is to make a judgment using the error ratio calculate theerror ratio of every node and further judge the node whichresults in rise of the error ratio of the overall DT Finally thisnode is pruned properly to further enhance the correct ratioof the DT

35 Definition of Type I Error and Type II Error In order toestablish the valid forecasting fraudulent financial statementit is considerably important to measure type I type II errorsof the study Type I error is to mistakenly judge the normalfinancial statement company as the FFS company Thisjudgment does not cause investorsrsquo damage but it carriesout an erroneous audit opinion for being too conservativeand further influences credit of the company audited TypeII error is that the FFS enterprise is mistaken for the normalenterprise This classification error leads to auditing failureauditorsrsquo investment loss or investorsrsquo erroneous judgment

4 Empirical Analysis

41 Data Collection and Variables The research samples arethe FFS enterprises from the years 1998 to 2012 66 enterprisesare selected from the listed andOTCcompanies of the TaiwanEconomic Journal Data Bank (TEJ) The 1 by 1 pair way isadopted to match 66 normal enterprises so there are 132enterprises in total as research samples

As for selection of the research variables the studyaltogether selects 29 variables including 24 financial variablesand 5 nonfinancial variables (see appendix)

For consideration of the number of samples to avoidhaving too few samples of the test group and to improve testaccuracy we propose to utilize 50 of the sample materialsas the train sample to establish the regression classificationmodel The remaining 50 of the sample materials serve asthe test sample to test validity of the classification modelestablished

In addition to test the stability of the proposed researchmodel this study randomly selects three groups at a ratio of80 from the test data as the test sample for cross-validationThe compartment and sampling of data in this research areshown in Figure 1

42 Model Development To begin with the study aims forthe financial and nonfinancial variable to screen using thestepwise regression screeningmethodThe variables screenedserve as the input variable of the logistic regression andSVM Next the study aims at every method to carry outthe model training and test Finally the study compares themerit and demerit of the classification correct ratio and givesthe relevant suggestions for the analytic result The modelconstruction is divided into three parts The first part is thevariable screening way the second part is the classificationway the third part compares the test results of two kindsof classification models The research process of the study isshown as Figure 2

43 Important Variable Screening While constructing theclassification model there may be quite many variables butnot every variable is important Therefore the variables ofno account need to be eliminated to construct a simplerclassification model There are quite many variable screening


Table 3 C50 cross-validation results

C50 model Predict value Hit ratio Type I error Type II errorNon-FFS FFS

Actual value

CV1 Non-FFS 25 3 8393 1071 2142FFS 6 22

CV2 Non-FFS 25 3 8750 1071 1428FFS 4 24

CV3 Non-FFS 25 3 8571 1071 1785FFS 5 23

Average 25 3 8571 1071 17855 23

Table 4 Logistic regression cross-validation results

Logistic regression model Predict value Hit ratio Type I error Type II errorNon-FFS FFS

Actual value

CV1 Non-FFS 25 3 8036 1071 2857FFS 8 20

CV2 Non-FFS 26 2 8214 714 2857FFS 8 20

CV3 Non-FFS 25 3 8036 1071 2857FFS 8 20

Average Non-FFS 25 3 8095 952 2857FFS 8 20

ways amongwhich the stepwise regression variable screeningmethod is used most frequently [24]

Therefore the study adopts the suggestions of Pudil et al[24] to screen the variables using the stepwise regression bywhich to retain the research variables with more influenceThe input variables of the study are screened via the step-wise regression to acquire the results as shown in Table 1including 7 financial variables and 1 nonfinancial variableSubsequently the study takes these 8 variables as the newinput variables to construct the classification model

44 Classification Model The prediction accuracy of thethree types of models using the train datasets is displayed inTable 2

As shown in Table 2 C50 has the best performance in theestablishment of the predictionmodel and its accuracy rate is9394The traditional logistic model is the second best Theaccuracy rate of the SVM model at 7879 is the lowest ofthe three The cross-validation results of the proposed threeprediction models are shown in Tables 3 to 5

441 Decision Tree (DT) The study constructs the DT C50model sets EBP at 120572 = 5 and adopt the binary partitionprinciple to obtain the optimal spanning tree The predictionresults of the DT C50 classification model are shown asTable 3

On average 25 of the 28 non-FFS materials are correctlyclassified in the non-FFS and three of them are incorrectlyclassified in the FFS The type I error is 1071 On the otherhand 23 of the 28 FFS materials are correctly classified and

the remaining five FFS materials are incorrectly classified inthe non-FFS The type II error is 1785

442 Logistic Regression Table 4 is the empirical results ofthe logistic classification model which shows that 25 of 28non-FFS materials are correctly classified and that three ofthem are incorrectly classified in the FFS The overall typeI error is 952 In addition 20 of the 28 FFS materials arecorrectly classified and the remaining eight FFS materialsare incorrectly classified in the non-FFS The type II error is2857

443 Support Vector Machine (SVM) The operation core isset at RBF when the study constructs the SVMmodel As forthe parameter the C search scope is set at 2minus10 to 210 and 120574 isset at 01 The SVM classification results are shown as Table 5

In this part 26 of the 28 non-FFS materials are correctlyclassified and two of them are incorrectly classified in theFFS The type I error is 714 In addition 14 of the 28 FFSmaterials are correctly classified and the remaining 14 FFSmaterials are incorrectly classified in the non-FFS The typeII error is 4881

444 Comprehensive Comparison and Analysis Kirkos et al[10] pointed out that the merit and demerit of the evaluationmodel must also consider the type I error and type IIerror The type I error means to classify the nonfraudulentcompanies into the fraudulent companies Occurrence ofthese two type errors results from the auditing failure ofthe auditors Type II error means that the auditors classify


Table 5 SVM cross-validation results

SVMmodel Predict value Hit ratio Type I error Type II errorNon-FFS FFS

Actual value

CV1 Non-FFS 26 2 7321 714 4642FFS 13 15

CV2 Non-FFS 26 2 7143 714 5000FFS 14 14

CV3 Non-FFS 26 2 7143 714 5000FFS 14 14


Table 6 Summary of classification results

Model Type I error Type II error Hit ratio RankingLogistic regression 952 2857 8095 2SVM 714 4881 7202 3DT C50 1071 1785 8571 1

Table 7 Paired-samples 119905 test

Model 119905-value DF Significant (two-tailed)C50mdashlogistic minus5201 2 035LogisticmdashSVM minus16958 2 003SVMmdashC50 9823 2 010

the fraudulent companies into the nonfraudulent companiesBoth types of error would cause different loss costs andthe auditors must avoid occurrence of these two errorsComparing the results of these three models we concludethat the classification ability of the DT C50 is the bestthe next is the logistic regression and the last is the SVMThe classification correct ratios of three kinds of model aresummarized as shown in Table 6

The comparison shows that although the logistic classifi-cation model performs the best for type I errors the DT C50possesses the best classification effect both for type II errorsand the hit ratio The correct classification ratio is 8571followed by 8095 for the logistic model and 7202 for theSVMmodel

Unlike general studies using type I errors to judge theperformance of prediction models FFS studies use type IIerrors to determine the performance of prediction modelsFor the sake of prudence we conduct the statistical test oftype II errors in the abovementioned cross-validation resultsto confirm whether the differences in between models aresignificantly other than 0 The analysis results are shown inTable 7 which shows that the 119905-values of the predictionmodeltype II error differences are minus5201 (C50mdashLogistic) minus16958(LogisticmdashSVM) and 9823 (SVMmdashC50) respectively andall of them reach the significance level

5 Conclusion and Suggestion

As the fraudulent financial statement (FFS) increases onthe trot in recent years the auditing failure risk of theauditors also rises thereby Therefore many researches focuson developing a good classification model to reduce therelevant risk In the past the accuracy of forecasting FFSpurely using regression analysis has been relatively lowManyscholars have pointed out that prediction by data mining canimprove the accuracy rate Thus this study adopts stepwiseregression to screen the important factors of financial andnonfinancial variables Meanwhile it combines the abovewith data mining techniques to establish a more accurate FFSforecast model

A total of eight critical variables are screened via thestepwise regression analysis including two parts financialvariables (accounts receivablestotal assets inventorycurrentassets interest protection multiples cash flow ratio accountspayable turnover operation profitlast year operation profitgt 11) and nonfinancial variables (pledge ratio of shares of thedirectors and supervisors)

The financial variables include operating capabilitiesprofitability index debt solvency ability index and financialstructure The nonfinancial variables include relevant vari-ables of stock rights and scale of an enterprisersquos directorsand supervisors The results indicate that when auditorsinvestigate FFS they must focus on the alert provided by thenonfinancial information as well as the financial information

In the classification model the study adopts the logisticregression of the traditional classificationmethod and theDTC50 and SVM of data mining to construct the classificationmodel The empirical result indicates that the SVM modelperforms the best in the type I error after comparison and theDT C50 has the best classification performance in the type IIerror and overall classification correct ratio


Table 8 Selection of the research variables

Variable classification Variable code Variable description and computation

Financial variables

1198831 Accounts receivablestotal assets1198832 Gross profittotal assets1198833 Inventorycurrent assets1198834 Inventorytotal assets1198835 Net profit after taxtotal assets1198836 Net profit after taxfixed assets1198837 Cashtotal assets1198838 Log total assets1198839 Log total liabilities11988310 Interest protection multiples (debt service coverage ratio times interest earned)11988311 Gross profit margin11988312 Operating expense ratio11988313 Debt ratio11988314 Inventory turnover11988315 Cash flow ratio11988316 Net profit ratio before tax11988317 Accounts payable turnover11988318 Revenue growth rate11988319 Debtequity ratio11988320 Earnings before interest taxes depreciation and amortization11988321 Current liabilitiestotal assets11988322 Total assets turnover11988323 Account receivablelast year accounts receivable gt1111988324 Operation profitlast year operation profit gt11

Nonfinancial variables

11988325 Shareholding ratio of the major shareholders11988326 Shareholding ratio of directors and supervisors11988327 Whether the chairman concurrently holds the position of CEO11988328 Board size11988329 Pledge ratio of shares of the directors and supervisors

Train subset

Test subset

FFSand

non-FFS sample

50 without replacement 80 with replacement

Cross-validation

Cross-validationsubset 2


50

50

subset 1

Figure 1 Train and test subsets design


Financial variable

regression

SVM

Logistic

regression Stepwise Empirical result and

analysis comparisonNonfinancialvariable DT C50

Figure 2 Research model

One of the research purposes is to anticipate accom-modating the auditors with another assistant auditing toolbesides the traditional analysis method but the researchabout the forecasting FFS is not sufficient Therefore thesubsequent researchers can also adopt other methods toforecast the FFS to provide a better reference In additionfuture researchers can also try to adopt different variablescreening methods to enhance the classification correctratio of the method As for the variable some nonfinancialvariables are difficult to measure and material acquisitionis difficult so the study does not incorporate them Finallyas for the sample the study focuses on the FFS scoperesearch and a certain number of the FFSs may not be foundTherefore the pair companies can also be the FFS companiesin the coming year which can influence the accuracy of thestudy The findings of this study can provide a referenceto auditors certified public accountants (CPAs) securitiesanalysts company managers and future academic studies

Appendix

See Table 8

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] C Spathis M Doumpos and C Zopounidis ldquoDetecting fal-sified financial statements a comparative study using multi-criteria analysis and multivariate statistical techniquesrdquo TheEuropean Accounting Review vol 11 pp 509ndash535 2002

[2] Z Rezaee ldquoCauses consequences and deterence of financialstatement fraudrdquo Critical Perspectives on Accounting vol 16 no3 pp 277ndash298 2005

[3] S Kotsiantis E Koumanakos D Tzelepis and V TampakasldquoForecasting fraudulent financial statements using data min-ingrdquo Transactions on Engineering Computing and Technologyvol 12 pp 283ndash288 2006

[4] C-C Yeh D-J Chi andM-F Hsu ldquoA hybrid approach of DEArough set and support vector machines for business failurepredictionrdquo Expert Systems with Applications vol 37 no 2 pp1535ndash1541 2010

[5] W Zhou and G Kapoor ldquoDetecting evolutionary financialstatement fraudrdquo Decision Support Systems vol 50 no 3 pp570ndash575 2011

[6] S L Humpherys K C Moffitt M B Burns J K Burgoon andW F Felix ldquoIdentification of fraudulent financial statementsusing linguistic credibility analysisrdquo Decision Support Systemsvol 50 no 3 pp 585ndash594 2011

[7] K A Kamarudin W A W Ismail and W A H W MustaphaldquoAggressive financial reporting and corporate fraudrdquo Procedia-Social Behavioral Sciences vol 65 pp 638ndash643 2012

[8] P-F Pai M-F Hsu and M-C Wang ldquoA support vectormachine-based model for detecting top management fraudrdquoKnowledge-Based Systems vol 24 no 2 pp 314ndash321 2011

[9] Accounting Research and Development Foundation Audit theFinancial Statements of the Considerations for Fraud Account-ing Research and Development Foundation Taipei Taiwan2013

[10] S Kirkos C Spathis and Y Manolopoulos ldquoData miningtechniques for the detection of fraudulent financial statementsrdquoExpert Systems with Applications vol 32 no 4 pp 995ndash10032007

[11] T B Bell and J V Carcello ldquoA decision aid for assessing thelikelihood of fraudulent financial reportingrdquo Auditing vol 19pp 169ndash178 2000

[12] V D Sharma ldquoBoard of director characteristics institutionalownership and fraud evidence from Australiardquo Auditing vol23 no 2 pp 105ndash117 2004

[13] W-H Chen S-H Hsu and H-P Shen ldquoApplication of SVMand ANN for intrusion detectionrdquo Computers and OperationsResearch vol 32 no 10 pp 2617ndash2634 2005

[14] J W Seifert ldquoData mining and the search for security chal-lenges for connecting the dots and databasesrdquo GovernmentInformation Quarterly vol 21 no 4 pp 461ndash480 2004

[15] M S Beasley ldquoAn empirical analysis of the relation between theboard of director composition and financial statement fraudrdquoAccounting Review vol 71 no 4 pp 443ndash465 1996

[16] P Dunn ldquoThe impact of insider power on fraudulent financialreportingrdquo Journal of Management vol 30 no 3 pp 397ndash4122004

[17] G Chen ldquoPositive research on the financial statement fraudfactors of listed companies in Chinardquo Journal of ModernAccounting and Auditing vol 2 pp 25ndash34 2006

[18] Z Huang H Chen C-J Hsu W-H Chen and S Wu ldquoCreditrating analysis with support vector machines and neural net-works a market comparative studyrdquo Decision Support Systemsvol 37 no 4 pp 543ndash558 2004


[19] K-S Shin T S Lee and H-J Kim ldquoAn application of supportvector machines in bankruptcy prediction modelrdquo Expert Sys-tems with Applications vol 28 no 1 pp 127ndash135 2005

[20] S L Summers and J T Sweeney ldquoFraudulently misstatedfinancial statements and insider trading an empirical analysisrdquoThe Accounting Review vol 73 no 1 pp 131ndash146 1998

[21] G Arminger D Enache and T Bonne ldquoAnalyzing credit riskdata a comparison of logistic discrimination classification treeanalysis and feedforward networksrdquo Computational Statisticsvol 12 no 2 pp 293ndash310 1997

[22] S Viaene G Dedene and R A Derrig ldquoAuto claim frauddetection using Bayesian learning neural networksrdquo ExpertSystems with Applications vol 29 no 3 pp 653ndash666 2005

[23] J R Quinlan C45 Programs for Machine Learning MorganKaufmann Publishers 1993

[24] P Pudil K Fuka K Beranek and P Dvorak ldquoPotential of arti-ficial intelligence based feature selection methods in regressionmodelsrdquo in Proceedings of the IEEE 3rd International Conferenceon Computational Intelligence and Multimedia Application pp159ndash163 1999

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Distributed Sensor Networks


Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014


ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014


Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications


Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia


Biomedical Imaging


ArtificialNeural Systems

Advances in


RoboticsJournal of



Computational Intelligence and Neuroscience

Industrial EngineeringJournal of


Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014


Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in



FFS is superior to the traditional classification model iscontroversial

The purpose of this study is to expect that a bettermethod of forecasting fraudulent financial statement can bepresented to forecast the omen of the fraudulent financialstatement and to reduce damage to the investors and auditorsThe study will adopt the logistic regression and the supportvector machine (SVM) as well as the decision tree (DT)C50 in data mining as the basis and match the stepwiseregression to separately establish classificationmodel tomakea comparison In conclusion the study first aims at theldquofraudulent financial statementrdquo issue to make an arrange-ment for and carry out an exploration of relevant literatureto ensure the research variable and sample adopted by thestudy We then take the logistic regression SVM and DTC50 as the bases to establish the FFS classification modelFinally we present the conclusions and suggestions of thestudy

2 Literature Review

21 Fraudulent Definition The FFS is a kind of intentional orillegal behavior the result of which directly causes the seri-ously misleading financial statement or financial disclosure[2 15] Pursuant to the provision of SAS NO99 a kind offraudulent pattern is dishonest financial report and it meansa kind of intentional erroneous narration neglecting amountor disclosure which makes the misunderstood financialstatement [6]

22 Research Method The classification problem carries outa computation mainly in light of the variable attributenumerical value of some given classification data to acquirethe relevant classification rule of every classification andbringthe unknown classification data into the rule to acquire thefinal classification result Many authors apply the logisticregression to make a fraudulent classification in the FFS issuein the past [3 11 12 15ndash17] However the traditional statisticmethod has limitation of having to accord with specificassumption in data

As a result the machine learning way which does notrequire any statistic assumption about data portfolio risesabruptly Many scholars recently try to adopt the machinelearning way as the classification machine to conduct aresearchThe empirical result also points out that it possessesan excellent classification effect Chen et al [13] appliedthe neural network and SVM to forecast network invasionand the research result indicates that the SVM has excellentclassification ability Huang et al [18] applied the neuralnetwork and SVM to explore the classificationmodel of creditevaluation Shin et al [19] conducted a relevant research ofbankruptcy prediction Yeh et al [4] apply it in predictionof enterprise failure On the other hand Kotsiantis et al[3] and Kirkos et al [10] apply DT C50 in the relevantresearch to acquire the excellent classification result Thusthe study will adopt the foresaid logistic regression SVMand DT C50 as the classifier construction classificationmodel

23 Variable Selection As for variable selection via rele-vant literature exploration some authors adopt the financialvariable as the research variable [3 10] others adopt thenonfinancial variable as the research variable [12 16 17] andstill others adopt both the financial variable and nonfinancialvariable as the research variable [15 20]

Because financial statement data often have cheatingsuspicion if we purely consider the financial variables thepossibility of erroneous classification may increase There-fore the study not only adopts the financial variable as theresearch variable but also adds the nonfinancial variable toconstruct the fraudulent financial prediction model

3 Methodology

The purpose of this study is to present a two-stage researchmodel which integrates the financial variable and nonfi-nancial variable to establish the fraudulent early warningmodel of an enterprise The procedure of the study is toaim at the data to make a stepwise regression analysis toacquire the result of the important variable of the TTFafter screening and then to take such variable as the inputvariable of the logistic regression and SVM Finally the studymakes a comparison and an analysis to acquire a better FFSclassification result

31 Stepwise Regression The study selects a variable of themaximum classification ability in accordance with forwardselection and incorporates the predictor into the model bystepwise increase During each process119875 value of the statistictest is used to screen the variables If 119875 value is less than orequal to 005 then the variable enters the regression modeland the selected variable is the independent variable of theregression model

32 Logistic Regression The logistic regression resembles thelinear regression while the response variable and explanatoryvariable of the general linear regression are usually thecontinuous variable but the response variable explored by thelogistic regression is the discrete variable that is it handlesthe qualitative variable of the two-dimensional independentvariable problem (eg yes or no and success or failure)The model utilizes cumulative probability density functionto convert real number value of the explanatory variable toprobability value between 0 and 1 The elementary assump-tion is different from the analytic assumption of anothermul-tivariate analysisThe influence of the explanatory variable onthe response variable is to fluctuate in the index form whichmeans that the logistic regression does not need to conformto the normal distribution assumption In other words itcan handle the population of the nonnormal distribution andthe problem of the nonlinear model and the nonmeasuringvariable

The general logistic regression model is as follows

119884lowast

= 120573119909 + 120576

119884 = (1 119884lowast

gt 0

0 119884lowast

le 0)

(1)





119894= (119909

(1)

119894 119909

119899

119894) isin 119877




119908 sdot 119909119894+ 119887 gt 0 if 119910

119894= 1 (2)

119908 sdot 119909119894+ 119887 lt 0 if 119910

119894= minus1 (3)


119908 sdot 119909119894+ 119887 ge 1 if 119910

119894= 1

119908 sdot 119909119894+ 119887 le minus1 if 119910

119894= minus1

(4)

or as

119910119894(119908 sdot 119909

119894+ 119887) ge 1 forall119894 isin 1 119899 (5)


119889 (119908 119887) = min119909119894|119910119894=1

119908 sdot 119909119894+ 119887

|119908|minus max119909119894|119910119894=minus1

119908 sdot 119909119894+ 119887

|119908| (6)


119889 (119908 119887) =1

|119908|minusminus1

|119908|=2

|119908| (7)


120572119894ge 0 (8)

sum

119894

120572119894119910119894= 0 (9)

sum

119894

120572119894minus1

2sum

119894119895

120572119894120572119895119910119894119910119895119909119894sdot 119909119895 119894 = 1 119899 (10)





119891 (119909) = sgn (119908 sdot 119909119894+ 119887)

= sgn(119873119904

sum

119894=1

120572119894119910119894119909119894sdot 119909 + 119887)

(11)



119908 sdot 119909119894+ 119887 ge 1 minus 120585

119894 if 119910

119894= 1

119908 sdot 119909119894+ 119887 ge 120585

119894minus 1 if 119910

119894= minus1

(12)

where 120585119894ge 0 119894 = 1 119899


119894gt 1 Thus the



119894120585119894)


0 le 120572119894le 119862 (13)

sum

119894

120572119894119910119894= 0 (14)

sum

119894

120572119894minus1

2sum

119894119895

120572119894120572119895119910119894119910119895119909119894sdot 119909119895 119894 = 1 2 3 119899 (15)













1003816100381610038161003816119878V1003816100381610038161003816

|119878|Entropy (119878V)

(17)













Actual value

CV1 Non-FFS 25 3 8393 1071 2142FFS 6 22

CV2 Non-FFS 25 3 8750 1071 1428FFS 4 24

CV3 Non-FFS 25 3 8571 1071 1785FFS 5 23

Average 25 3 8571 1071 17855 23



Actual value

CV1 Non-FFS 25 3 8036 1071 2857FFS 8 20

CV2 Non-FFS 26 2 8214 714 2857FFS 8 20

CV3 Non-FFS 25 3 8036 1071 2857FFS 8 20
















Actual value

CV1 Non-FFS 26 2 7321 714 4642FFS 13 15

CV2 Non-FFS 26 2 7143 714 5000FFS 14 14

CV3 Non-FFS 26 2 7143 714 5000FFS 14 14

















Financial variables




Train subset

Test subset

FFSand

non-FFS sample


Cross-validation



50

50

subset 1



Financial variable

regression

SVM

Logistic





Appendix

See Table 8



References

































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in







119894= (119909

(1)

119894 119909

119899

119894) isin 119877




119908 sdot 119909119894+ 119887 gt 0 if 119910

119894= 1 (2)

119908 sdot 119909119894+ 119887 lt 0 if 119910

119894= minus1 (3)


119908 sdot 119909119894+ 119887 ge 1 if 119910

119894= 1

119908 sdot 119909119894+ 119887 le minus1 if 119910

119894= minus1

(4)

or as

119910119894(119908 sdot 119909

119894+ 119887) ge 1 forall119894 isin 1 119899 (5)


119889 (119908 119887) = min119909119894|119910119894=1

119908 sdot 119909119894+ 119887

|119908|minus max119909119894|119910119894=minus1

119908 sdot 119909119894+ 119887

|119908| (6)


119889 (119908 119887) =1

|119908|minusminus1

|119908|=2

|119908| (7)


120572119894ge 0 (8)

sum

119894

120572119894119910119894= 0 (9)

sum

119894

120572119894minus1

2sum

119894119895

120572119894120572119895119910119894119910119895119909119894sdot 119909119895 119894 = 1 119899 (10)





119891 (119909) = sgn (119908 sdot 119909119894+ 119887)

= sgn(119873119904

sum

119894=1

120572119894119910119894119909119894sdot 119909 + 119887)

(11)



119908 sdot 119909119894+ 119887 ge 1 minus 120585

119894 if 119910

119894= 1

119908 sdot 119909119894+ 119887 ge 120585

119894minus 1 if 119910

119894= minus1

(12)

where 120585119894ge 0 119894 = 1 119899


119894gt 1 Thus the



119894120585119894)


0 le 120572119894le 119862 (13)

sum

119894

120572119894119910119894= 0 (14)

sum

119894

120572119894minus1

2sum

119894119895

120572119894120572119895119910119894119910119895119909119894sdot 119909119895 119894 = 1 2 3 119899 (15)













1003816100381610038161003816119878V1003816100381610038161003816

|119878|Entropy (119878V)

(17)













Actual value

CV1 Non-FFS 25 3 8393 1071 2142FFS 6 22

CV2 Non-FFS 25 3 8750 1071 1428FFS 4 24

CV3 Non-FFS 25 3 8571 1071 1785FFS 5 23

Average 25 3 8571 1071 17855 23



Actual value

CV1 Non-FFS 25 3 8036 1071 2857FFS 8 20

CV2 Non-FFS 26 2 8214 714 2857FFS 8 20

CV3 Non-FFS 25 3 8036 1071 2857FFS 8 20
















Actual value

CV1 Non-FFS 26 2 7321 714 4642FFS 13 15

CV2 Non-FFS 26 2 7143 714 5000FFS 14 14

CV3 Non-FFS 26 2 7143 714 5000FFS 14 14

















Financial variables




Train subset

Test subset

FFSand

non-FFS sample


Cross-validation



50

50

subset 1



Financial variable

regression

SVM

Logistic





Appendix

See Table 8



References

































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in













1003816100381610038161003816119878V1003816100381610038161003816

|119878|Entropy (119878V)

(17)













Actual value

CV1 Non-FFS 25 3 8393 1071 2142FFS 6 22

CV2 Non-FFS 25 3 8750 1071 1428FFS 4 24

CV3 Non-FFS 25 3 8571 1071 1785FFS 5 23

Average 25 3 8571 1071 17855 23



Actual value

CV1 Non-FFS 25 3 8036 1071 2857FFS 8 20

CV2 Non-FFS 26 2 8214 714 2857FFS 8 20

CV3 Non-FFS 25 3 8036 1071 2857FFS 8 20
















Actual value

CV1 Non-FFS 26 2 7321 714 4642FFS 13 15

CV2 Non-FFS 26 2 7143 714 5000FFS 14 14

CV3 Non-FFS 26 2 7143 714 5000FFS 14 14

















Financial variables




Train subset

Test subset

FFSand

non-FFS sample


Cross-validation



50

50

subset 1



Financial variable

regression

SVM

Logistic





Appendix

See Table 8



References

































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in






Actual value

CV1 Non-FFS 25 3 8393 1071 2142FFS 6 22

CV2 Non-FFS 25 3 8750 1071 1428FFS 4 24

CV3 Non-FFS 25 3 8571 1071 1785FFS 5 23

Average 25 3 8571 1071 17855 23



Actual value

CV1 Non-FFS 25 3 8036 1071 2857FFS 8 20

CV2 Non-FFS 26 2 8214 714 2857FFS 8 20

CV3 Non-FFS 25 3 8036 1071 2857FFS 8 20
















Actual value

CV1 Non-FFS 26 2 7321 714 4642FFS 13 15

CV2 Non-FFS 26 2 7143 714 5000FFS 14 14

CV3 Non-FFS 26 2 7143 714 5000FFS 14 14

















Financial variables




Train subset

Test subset

FFSand

non-FFS sample


Cross-validation



50

50

subset 1



Financial variable

regression

SVM

Logistic





Appendix

See Table 8



References

































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in






Actual value

CV1 Non-FFS 26 2 7321 714 4642FFS 13 15

CV2 Non-FFS 26 2 7143 714 5000FFS 14 14

CV3 Non-FFS 26 2 7143 714 5000FFS 14 14

















Financial variables




Train subset

Test subset

FFSand

non-FFS sample


Cross-validation



50

50

subset 1



Financial variable

regression

SVM

Logistic





Appendix

See Table 8



References

































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in






Financial variables




Train subset

Test subset

FFSand

non-FFS sample


Cross-validation



50

50

subset 1



Financial variable

regression

SVM

Logistic





Appendix

See Table 8



References

































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




Financial variable

regression

SVM

Logistic





Appendix

See Table 8



References

































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in

















Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in



research article a hybrid approach of stepwise...

Documents