2015 ideas v1

An Empirical Method for Discovering Tax Fraudsters: A Real Case Study of Brazilian Fiscal Evasion

Tales Matos 1st author's affiliation

1st line of address 2nd line of address

Telephone number, incl. country code

[email protected]

José Macedo 2nd author's affiliation



2nd E-mail

José Maria 3rd author's affiliation



3rd E-mail

ABSTRACT This work encompasses the development of a new method for classifying tax fraudsters based on fraud indicators. This work was developed in conjunction with a Brazilian fiscal agency aim at avoiding fiscal evasion. The main contribution of this paper is a method that allows classifying and ranking taxpayers analyzing fraud indicators obtained from several fiscal applications. Particularly, we developed a method for identifying frequent fraud patterns using association rules and then we apply two dimension reduction methods (i.e PCA and SVD) in order to create a fraud scale, which allows ranking taxpayers according to their potential to commit a fraud. Experiments were conducted using real taxpayer data and tax auditors, specialized on fraud detection, did the validation. Preliminary results show that our method may indicate fraudsters with 80% of accuracy, which is definitely an excellent result.

Categories and Subject Descriptors H.2.8 [Database Applications]: Data mining.

General Terms Algorithms, Measurement, Experimentation, Verification.

Keywords Keywords are your own designated keywords.

1. INTRODUCTION

Brazil is currently the seventh largest economy in the world conform to the national wealth ranking (see Wikipedia1). Due to the size o Brazilian economy, fiscal evasion has become a key problem for states and municipalities. In order to cope with this problem, Brazilian government has implementing two systems: the Electronic Invoice and the Tax Bookkeeping Digital. These systems allow tracking financial information crossing between contributors, states and municipalities. Although those systems fastened the process of gathering taxpayer information, the level of frauds are still high and 25 per cent of potential income tax is lost due to recurrent frauds. However, taxpayer information brings new opportunities for detecting fraudulent activities in order to mitigate tax frauds. In this sense, we are interesting in devising a

1

http://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)

non-supervised method that allows forecasting potential fraudsters from taxpayers’ data by analyzing their fraud indicator traces.

In this direction, we resort to a real case scenario of the Finance Department of the State of Ceará (SEFAZ), which is responsible for inspecting over 142,000 active contributors. Although SEFAZ has a large dataset about taxpayer frauds, its enforcement agent team struggles in performing a complete inspection on taxpayers accountings because each inspection process need to evaluate countless fraud indicators, which is very time consuming and error prone. Motivated by this problem, we collected 4 analytical questions to be answered by this work: (1) What is the behavioral pattern of fraud indicators? (2) Is there a correlation among fraud indicators? (3) Which are the most relevant fraud indicators? and 4) How can we measure the risk of a taxpayer committing a fraud?

The main contribution of this paper is a method for classifying taxpayers using fraud indicators. This method is composed by four steps, which are implemented by using data mining, statistical analysis and dimensionality reduction techniques. The steps are: (1) Analyzing Fraud Frequencies, (2) Discovering The Most Relevant Fraud Indicators, (3) Correlating Fraud Indicators and (4) Classifying Taxpayers from Relevant Fraud Indicators. Indeed this is an empirical method that is oriented towards answering the 4 questions raised in the previous paragraph.

Experimental results were conducted and showed that only a small subset of fraud indicators are representative and should be used during the analytical process. In addition, we succeeded to create a scale to identify the propensity for fraud by taxpayers, which may help tax agents to orient their analysis on potential fraudsters. Preliminary results show that our method allows indicating fraudsters with 80% of accuracy, which demonstrates the good accuracy of our method. The results validation were done be tax auditors with a lot of experience on fraud detection.

This paper is structured as follows. Section 2 presents a description of our real case study. In Section 3, we present the proposed method describing each step of the analytical process. Next, in Section 4, we describe the experimental setting and discuss the results. Finally we conclude this work on Section 5.

2. CASE STUDY One of the objectives of this study is to analyze taxpayer fraud indicators in order to identify what are the key indicators that may characterize a potential tax fraud. Hereinafter we call such indicators as fraud indicators. Indeed, fraud indicators could guide tax auditors in the process of identifying irregular behavior of taxpayers. As we mentioned before tax auditors should do a

thoroughly analysis on taxpayer data in order to discover possible frauds that taxpayer has committed. However this process is complex, time consuming and error prone due to the excessive volume of data that encompasses different kind of information, such as: accounting, goods stock, sales, legal data, etc. 2.1 Tax Fraud Indicator Datasets In our case study, we resort to a historical audit data provided by the Treasury State of Ceará (SEFAZ-CE - Brazil). This data were extracted from 8 applications, summing up 72 million of records. We select taxpayers’ data of 2009 and 2010, which correspond to the current audit period. We used fourteen fraud indicators, identified by tax auditors. From a financial point of view, these fourteen indicators are of key importance since they correspond to the largest amount of money that can be recovered from fraudulent transactions. Due to confidentiality reasons, we anonymized these fraud indicators. Each fraud indicator is determined by a tax auditor after analyzing information issued by taxpayers, such as: tax documents, record the movement of goods at the border of the State, taxpayer sales data through credit and debit cards. Figure 1 presents two histograms detailing fraud indicator’s frequencies of 2009 and 2010 datasets. In 2009, 10,789 (95%) of 11,386 taxpayers have at least one type of evidence of fraud and 597 (5%) of taxpayers do not present any evidence of fraud. In 2010, 11,989 (96%) of 12,424 taxpayers have at least one type of fraud evidence and 435 (4%) did not present any evidence. We have also verified that the fraud indicators: C, E, F, G and H appeared with higher frequencies, in both datasets. While H, N and K fraud indicators had a significant increase in the frequency in 2010, the J fraud indicator greatly reduced in 2010. The majority of fraud indicators frequencies had increased from 2009 to 2010, which help us to conclude that we need some method for mitigate fraud evolution and thus tax evasion.

Figure 1. Tax fraud indicators' frequencies

We have also analyzed the correlation among fraud indicators in order to understand which fraud indicators occur together and what are their frequencies. Table 1 presents the result of this analysis showing in each line a set of fraud indicators that occur together with corresponding percentage for 2009 and 2010. From these observations we identify 14 fraud indicator sets with high frequency in 2009. We have also observed that fraud indicator sets presented in 2001 repeat in 2010, but with a high frequency (colored in red). In 2010 we have noted that 26 fraud indicator sets are frequent. Thus, we could perceive that frauds are increasing in number but also new combinations of frauds are appearing.

Table 1. Frequency of fraud indicator sets

2.2 Tax Fraud Matrix Two matrices were constructed for the years 2009 and 2010, respectively. Each matrix relates the taxpayers (in rows) and their corresponding fraud indicators (in columns). The matrix for the data from 2009 has 11,386 rows and 14 columns. As for the 2010 data, the matrix has 12,424 rows and 14 columns. Taxpayers who did not present any evidence of fraud were excluded because they

were considered outliers. This representation is necessary to work with the algorithms used in sections 3.1, 3.2, and 3.3.

After data selection, they were aggregated by summing the total of each indicator for each taxpayer. Thus, the data were grouped into a matrix taxpayers versus indicators of fraud. For some techniques, the matrix is adapted to display the existence / absence of evidence of fraud for each taxpayer, where 1 (one) indicates the presence and 0 (zero) is the absence of evidence of fraud. Aiming to anonymize taxpayers, was assigned sequential ID for each. Each of fraud indicators - Fourteen - was renamed A, B, C, D, E, F, G, H, I, J, K, L, M and N.

3. BEHAVIOURAL ANALYSIS OF TAXPAYERS In this section, we present a detailed description of the proposed method. Figure presents an overview of this method depicting each step with their corresponding input and output. The first step of the method is to compute the frequency of tax indicators (this was explained in the previous section 2.1). The second step aims to find the most relevant fraud indicators and their correlation. In the third step, we correlate fraud indicators and compare their evolution in both years. Finally, in the fourth step we classify taxpayers using a fraud scale, which allows measuring the tendency of a taxpayer to commit a fraud. Each step is explained in detail within the following sections.

3.1 Discovering The Most Relevant Fraud Indicators The second step of the method is to understand the frequency of fraud indicators. For this task we use association rule technique, viewinf each fraud indicator as an itemset. So, the mining algorithm used in this study is the Apriori algorithm (Agrawal and Srikant, 1994), considered to be the most widely used in the literature for this purpose.

We executed Apriori algorithm using support = 60% and confidence = 60%. The values of support and confidence were defined as the minimum acceptable for the tax auditors. For this implementation, there is a new parameter called "lift". Lift tries to come up with ways of measuring how “interesting” a rule is. A more interesting rule may be a more useful rule because it is more novel or unexpected. Without getting into math, lift takes into account the support for a rule, but also favors situations where left hand side and right hand side variables are not abundant but where the relatively few occurrences always happen together. The larger the value of lift, the more “interesting” the rule may be.

Figure 3. Confidence, support and lift: (a) 2009 dataset and (b) 2010 dataset

(a) (b)

Figure 3 shows a plot that describes the correlation between support and confidence. Even though we see a two dimensional plot, we actually have three variables represented here. Support is on the X-axis and confidence is on the Y-axis. Lift serves as a measure of interestingness, and we are also interested in the rules with the highest lift. On this plot, the lift is shown by the color of a dot. The darker the dot, the closer the lift of that rule is to 1.1, which appears to be the highest lift value among these rules.

Regarding this plot it is worth noting that all of the rules with high lift seem to have support below 80%, in both years 2009 and 2010. On the other hand, there are rules with high lift and high confidence, which sounds quite positive.

Based on this evidence, we focus on a smaller set of rules, here called “good rules” that only have highest lift. 2009 data was used lift > 1.05 to 2010 data was used lift > 1.10. An example of these good rules can be seen in Figure 4.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. IDEAS’15, July 13-15, 2015, Yokohama, Japan. Copyright ©2014 ACM 978-1-4503-3414-3/15/-7…$15.00. http://dx.doi.org/xx.xxxx/xxxxxxx.xxxxxxx

Figure 3 - Proposed Method Overview

!Section!2.1:!Tax!Fraud!Indicator!Datasets!

Section!3.1:!Discovering!The!Most!Relevant!Fraud!Indicators!

Section!3.2:!Correlating!Fraud!

Indicators!

Section!3.3:!Classifying!Taxpayers!from!Relevant!Fraud!Indicators!

Process!Input!

Process!Output!

Fraud!Indicators!

Together!occurrences!

!

Fraud!Indicators!!

Reduced!set!of!unique!determinants!

!

Fraud!Indicators!!

Similar!clusters!in!both!analyzed!years!

!

Relevant!Fraud!Indicator!

Scale!fraud!risk!

!

FImeasure!80%!

3

Figure 4. A sample set of good rules: (a) 2009 dataset and (b) 2010 dataset

(a) (b)

Analysis of the rules listed as "good rules", it was identified that 72% of existing rules 2009 are repeated in 2010. This analysis shows that the pattern of behavior of taxpayers is being repeated in subsequent years, which requires urgent intervention to prevent tax evasion.

The main result achieved in this step is the reduced set of unique determinants on left hand side, both in 2009 and in 2010. They are: C, E, F, H and L. After executing APRIORI algorithm, we achieved 2,563 rules in 2009 and 1,527 rules in 2010. From those rules, we selected 176 good rules (lift > 1.05) in 2009 and 150 good rules (lift > 1.10) in 2010.

3.2 Correlating Fraud Indicators In addition to the association rules and also to answer the question (2) Is there similarity between the fraud indicators used in the research? An important tool used in this analysis of the behavior of taxpayers is obtaining the similarity between the indicators of fraud. For this analysis, we used the Manhattan similarity function.

The Manhattan similarity function was chosen for this study because it is the technique most suitable for the type of variables used: binary variables. The similarity was calculated for indicators of frauds in 2009 and 2010. In order to better visualize the correlations between their indicators we constructed dendrograms shown in Figure 5. Indicators that are closer to zero, are more similar to the indicators.

Figure 5. Manhattan dendrograms

These dendograms are useful for comparing the similarity of 2009 and 2010 fraud indicators,. For example, the indicators of 2009: A, I, B and N, form a group and E, F are a pair on the same similarity. Both exhibit the highest degrees of similarity - the closer to zero, the more similar. In 2010, these groups change and new groups are formed: (I, B, A, J) and (F,L).

In Figure 5 we can observe the formation of distinct clusters for the years 2009 and 2010. If we make a horizontal cut in the two graphs, the height of the weight 400 (four hundred) in 2009 we have four clusters: (J), (CLEF), (DMGAIBN) and (HK). In 2010, we have two clusters: (HCLEF) and (NKGDMIBAJ). The selection of the cutoff point on the graph horizontal similarity depends on the requirement for training the clusters. In this study, we used a criterion of similarity of 60% (sixty percent): weight 400.

3.3 Classifying Taxpayers from Relevant Fraud Indicators So far, the analyses were used to understand the behavior of taxpayers in the past. However, we aim to rate the taxpayer on a risk scale. The idea is that this scale indicates the risk of the taxpayer to commit a fraud. For this, we resort to dimensionality reduction techniques to reduce the 14 indicators of fraud to a single dimension. The following are the results of dimensionality reduction techniques.

In order to answer the question (3) Can we reduce the number of existing indicators, without losing the quality of analysis in order to optimize the inspection process? In this sense, we use the Principal Component Analysis (PCA) (Ullman, 2010) and Singular Value Decomposition (SVD) (Ullman, 2010).

3.3.1 Classifying Taxpayers from Relevant Fraud Indicators PCA is a mathematical procedure, which uses orthogonal transformation to convert a set of variable observations, possibly correlated, to a set of values linearly uncorrelated variables called principal components. PCA is commonly used as a tool for exploratory data analysis and for making predictive models. PCA can be made by eigenvalue decomposition of a covariance matrix (or correlation) or by singular value decomposition of a data matrix, usually after centering (and normalized) data matrix for each attribute (Abdi and Williams, 2010). PCA is the simplest of the true eigenvector by multivariate analysis. It is defined mathematically (Jolliffe, 2002) as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by any

projection of the data lies along the first coordinate (called the first component), the second greatest variance is along the second coordinate, and so on.

The idea is to treat the set of tuples as a matrix M and find the eigenvectors for MMT or MTM. The matrix of these eigenvectors can be thought of as a rigid rotation in a space of high dimension (Ullman, 2010).

At first, this study used PCA to try to reduce the universe of 14 existing indicators. However, this strategy was not very effective for this data set as can be observed in the sample correlation matrix in Table 2.

Each principal component (PC1, PC2, PC3, ...) accounts for the total variance of the standardized data. The first principal component accounts for about 21% of the total variance of the standardized data, whereas if we take the first two components reached about 35% of the total variance. To achieve at least 50% of the variance would need the first four components (a reduction of 14 for four indicators). The first principal component variance is 2.95 (standard deviation squared), much higher than the average of the variances (equal to 1). In addition, Figure 6 does not show a relevant difference between the main components. In this example, it is possible to retain the first two or three components indicating a reduction to two or three dimensions.

Figure 6. Variance of the principal components

Thus, it would be possible to have a quality much in reducing to a single dimension or a single indicator.

In order to achieve the reduction to a single dimension, we envisage an alternative to the use of determinants found by APRIORI algorithm (Section 3.1) C, E, F, H and L. That is, the Apriori algorithm found subset indicators as key determinants and this subset is interpreted here as a dimensionality reduction - instead of 14, we now have five dimensions. Table 3 shows the correlation matrix for the five sample indicators (C, E, F, H and L) as found by the Apriori algorithm.

Table 3. Sample correlation matrix for indicators C, E, F, H and L

2010

PC1 PC2 PC3 PC4 PC5

Standard deviation 1.436 1.019 0.981 0.888 0.383

Proportion of variance 0.412 0.207 0.192 0.157 0.029

Cumulative Proportion 0.412 0.620 0.812 0.970 1.000

In this case, the first principal component accounts for approximately 41% of the total variance standardized data, whereas if we take the first two components reach about 62% of the total variance. This strategy has better result to reduce the matrix.

To define which components to use, should calculate the standard deviation squared and choose values greater than 1 (corresponding to the components), as shown in Table 4.

Table 4. Square of the standard deviation of each principal component

2010

PC1 PC2 PC3 PC4 PC5

Standard Deviation ^ 2

2.063 1.038 0.962 0.788 0.147

From the values obtained, the components to be used are PC1 and PC2. Figure 7 reinforces the choice to show these two values as the highest.

Figure 7. Calculating the standard deviation squared of each principal component

The goal is to reduce the matrix to a single dimension. In this case, each variable should relate to the main component chosen: PC1 or PC2. This relationship is made using the eigenvectors of each component (Figure 8).

Table 2. sample correlation matrix for all indicators PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14

Standard Deviation 1.717 1.401 1.100 1.017 0.994 0.976 0.944 0.929 0.926 0.877 0.800 0.718 0.596 0.000

Proportion of variance 0.210 0.140 0.086 0.074 0.070 0.068 0.063 0.061 0.061 0.055 0.045 0.036 0.025 0.000

Cumulative Proportion 0.210 0.351 0.437 0.511 0.582 0.650 0.713 0.775 0.836 0.891 0.937 0.974 1.000 1.000

Figure 8. Relationship of each variable with PC1 and PC2

By analyzing the Figure 8, we can observe that the plot has the characteristic of an S-curve, not a straight line. However, the Singular-Value Decomposition (SVD) technique yielded better results than PCA for the data analyzed in our study.

3.3.2 Singular-Value Decomposition (SVD) Following the same principle of PCA, SVD technique was also used on the determinants found by APRIORI algorithm: C, E, F, H and L. Figure 9 is a demonstration of how the matrix of this study was reduced to a single dimension. This figure shows the SVD data reduction for 2010, as follows: (a) fifty taxpayers with lowest values of fraud indicators; and (b) on thousand taxpayers with lowest values of fraud indicators. This technique proved to be more feasible to reduce dimensionality of the data used in this study, because it is simpler and get a line exactly as expected. However, it must be applied to all data again whenever a new line is inserted to the array. The SVD calculation is altered in accordance with the values used in the sample.

Figure 9. SVD reduction applied 2010 dataset: (a) fifty taxpayers with lowest values of fraud indicators; and (b) on thousand taxpayers with lowest values of fraud indicators

The results point out to the possibility of creating a scale fraud risk for taxpayers, with the use of SVD. Thus, we answer the final

question 4) Can we set a scale to indicate the risk of a taxpayer committing fraud?

4. Results Evaluation In order to evaluate our classification method, we resort to experienced tax auditors that manually verified the results of our method. We have used F-measure formula to analyze the precision and recall of our approach. In fact, F-Measure enforces a better balance between performance on the minority and majority class, and is more suitable in the case of imbalanced data, which arises quite frequently in real-world applications. In this evaluation we applied the following methodology. We selected 120 contributors with the highest values in the ranking list computed by the SVD method, which represent a group of potential fraudsters. Then we selected 50 contributors with the lowest values of SVD, which in turn correspond to potential non-fraudsters. In sequel we submit these two groups (i.e. 120 fraudsters and 50 non-fraudsters) to be analyzed by 2 experienced tax auditors. After rough analysis, the auditors concluded that from the group of 120 classified as fraudsters, 85 were correct, and from the group of 50 selected as non-fraudsters, 23 actually were not fraudsters.

We organized the auditors’ evaluation into a confusion matrix (showed in Table 5), which shows the positive and negative classification of our results. From this table, we obtain the values of Accuracy (A), Precision (P) and Recall (R). True Positive (TP) occurs when fraudsters contributors were selected correctly by our method (this is considered the correct decision). False Positive (FP) occurs when fraudsters contributors were wrongly selected by our method (this is considered an incorrect decision). False Negative (FN) occurs when non-fraudsters contributors were wrongly selected (this is considered an incorrect decision). True Negative (TN) occurs when non-fraudsters contributors were selected correctly (this is considered the correct decision).

Table 5. Confusion matrix with positive and negative classification

IS A REAL FRAUDSTER?

CORRECT NOT

CORRECT

SELECTED 85

True Positive (TP)

35 False Positive

(FP)

Positive classification

NOT SELECTED

7 False Negative

(FN)

23 True

Negative (TN)

Negative classification

Below we present the computed values of Accuracy (A), Precision (P) and Recall (R). These values demonstrate the effectiveness of our classification method. Since we achieved 80% in F-measure we may conclude that this method captured correctly potential fraudsters.

Accuracy = TP + TN = 85 + 23 = 72,00%

TP+FP+FN+TN 85+35+7+23

Precision (P) = TP = 85 = 70,83%

TP + FP 85 + 35

Recall (R) = TP = 85 = 92,39%

TP + FN 85 + 7 F-measure = 2 P R = 80,19%

(P + R)

5. Related Work Glancy and Yadav [7] proposed a quantitative model for detecting fraudulent financial reporting. The model detects the attempt to conceal information and/or present incorrect information in annual filings with the US Securities and Exchange Commission (SEC). The model uses essentially all of the information contained in a text document for fraud detection. Ngai et. al. [8] presented a review of — and classification scheme for — the literature on the application of data mining techniques for the detection of financial fraud. The findings of this review clearly show that data mining techniques have been applied most extensively to the detection of insurance fraud, although corporate fraud and credit card fraud have also attracted a great deal of attention in recent years. The main data mining techniques used for FFD are logistic models, neural networks, the Bayesian belief network, and decision trees, all of which provide primary solutions to the problems inherent in the detection and classification of fraudulent data. Bhattacharyya et. al. [9] evaluated two advanced data mining approaches, support vector machines and random forests, together with the well-known logistic regression, as part of an attempt to better detect (and thus control and prosecute) credit card fraud. The study was based on real-life data of transactions from an international credit card operation.

Ravisankar et. al. [10] used data mining techniques such as Multilayer Feed Forward Neural Network (MLFF), Support Vector Machines (SVM), Genetic Programming (GP), Group Method of Data Handling (GMDH), Logistic Regression (LR), and Probabilistic Neural Network (PNN) to identify companies that resort to financial statement fraud. Each of these techniques was tested on a dataset involving 202 Chinese companies and compared with and without feature selection. PNN outperformed all the techniques without feature selection, and GP and PNN outperformed others with feature selection and with marginally equal accuracies.

Kirkos et. al. [11] explored the effectiveness of Data Mining (DM) classification techniques in detecting firms that issue fraudulent financial statements (FFS) and deals with the identification of factors associated to FFS. This study investigated the usefulness of Decision Trees, Neural Networks and Bayesian Belief Networks in the identification of fraudulent financial statements. These three models were compared in terms of their performances.

Serrano et. al. [12] proposed the use of association rules in order to extract knowledge so that normal behavior patterns may be obtained in unlawful transactions from transactional credit card databases in order to detect and prevent fraud. The proposed methodology has been applied on data about credit card fraud in some of the most important retail companies in Chile. Li et. al. [13] applied Bayesian Classification and Association Rule to identify the signs of fraudulent accounts and the patterns of fraudulent transactions. Detection rules were developed based on the identified signs and applied to the design of a fraudulent account detection system. Empirical verification supported that this fraudulent account detection system can successfully identify fraudulent accounts in early stages and is able to provide reference for financial institutions.

Phua et. al. [14] presented a survey that categorizes, compares, and summarizes from almost all published technical and review articles in automated fraud detection within publishes between 2000 and 2010. This research discussed the main methods and techniques used in order to detect frauds in a automatic way together with their problems.

6. CONCLUSION This paper proposes a method for classifying taxpayers in order to help detecting potential fraudsters. In our experiments, we discovered key patterns. Through statistical techniques we observed that: (1) taxpayers analyzed show a high frequency indicators of tax evasion, (2) the indicators studied showed an increase in frequency from 2009 to 2010, (3) indicators C, E, F, G and H have the highest frequency in both periods, and (4) there are many fraud indicators sets that occur with great frequency and in both periods. These analyzes now enable a decision on the indicators presented in an attempt to reduce them in subsequent tax years.

We use association rule method in order to verify the existence of some indicators that determine others. This perspective reveals that taxpayers tend to commit different types of frauds together. Besides, analysis of the rules listed as "good rules", revealed that 72% of existing rules 2009 are repeated in 2010.

Another technique used in this study was the classification of the indicators of fraud by the similarity between them. With this technique, it was possible to identify groups of fraud according to the similarity between their indicators (Figure 5).

We have also analyzed the feasibility of dimensionality reduction techniques as a way to create a scale of risk for frauds. Thus, we investigated two dimensionality reduction techniques to reduce the fourteen indicators of fraud to a single dimension. For this purpose, the technique-Singular Value Decomposition (SVD) was more feasible than Principal Component Analysis (PCA) indicating that it is possible to create a scale to identify the propensity for fraud by taxpayers.

Last but not least, our method proved to be very accurate in detecting fraudsters, achieving 80% of accuracy. Indeed, this method has two important advantages. It is an unsupervised method and it needs to evaluate few fraud indicators to infer potential fraudsters. Clearly, this method is of great importance to Brazilian fiscal agencies since it can allow immediate actions, which may mitigate avoid frauds.

As an opportunity for future work, is the need to investigate other dimensionality reduction techniques, as well as the study of

Outlier Detection techniques to find new evidence of fraud that are not perceived so trivial and so improve fraud detection.

7. REFERENCES [1] Abdi, H., Williams, L.J., 2010. Principal component analysis.

Wiley Interdisciplinary Reviews: Computational Statistics, 2: 433-459.

[2] Agrawal, R., Srikant, R., 1994. Fast algorithms for mining association rules. Intl. Conf. on Very Large Databases, pp. 487–499.

[3] Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., 1996. The KDD Process for Extracting Useful Knowledge from Volumes of Data, ACM.

[4] Jolliffe, I.T., 2002. Principal Component Analysis, Series: Springer Series in Statistics, 2nd ed., Springer, NY, XXIX, 487 p. 28 illus. ISBN 978-0-387-95442-4

[5] Santos, F.F., 2010. Selecionando Candidatos a Descritores para Agrupamentos Hierárquicos de Documentos utilizando Regras de Associação, Dissertação de Mestrado, USP.

[6] Ullman, J. D., 2010. Mining of Massive Datasets.

[7] [1] F. H. Glancy and S. B. Yadav, “A computational model for financial reporting fraud detection,” Decision Support Systems, vol. 50, no. 3, pp. 595-601, Feb. 2011

[8] [2] E. Ngai, Y. Hu, Y. Wong, Y. Chen, and X. Sun, “The application of data mining techniques in financial fraud

detection: A classification framework and an academic review of literature,” Decision Support Systems, vol. 50, no. 3, pp. 559–569, 2011.

[9] [3] Bhattacharyya, S.; Jha, S.; Tharakunnel, K.; Westland, J. C. 2011 “Data mining for credit card fraud: A comparative study”, Decision Support Systems, vol. 50, Issue 3, pp. 602--613

[10] [4] Ravisankar, P.; Ravi, V.; Raghava Rao, G.; Bose, I.; “Detection of financial statement fraud and feature selection using data mining techniques” Decision Support Systems, vol. 50, Issue 2, January, 2011 Pages 491-500.

[11] [5] Kirkos, E., Spathis, C., & Manolopoulos, Y. (2007). "Data mining techniques for the detection of fraudulent financial statements", Expert Systems with Applications 32 (4) (2007) 995–1003.

[12] [6] D. Sánchez, M.A. Vila, L. Cerda, J.M. Serrano, "Association rules applied to credit card fraud detection", Expert Systems with Applications 36 (2) (2009) 3630–3640.

[13] [7] Shing-Han Li, David C. Yen b,1, Wen-Hui Lu, Chiang Wang. Identifying the signs of fraudulent accounts using data mining techniques.

[14] [8] Clifton Phua, Vincent Lee, Kate Smith, Ross Gaylera; Comprehensive Survey of Data Mining-based Fraud Detection Research

2015 ideas v1

Documents