improving credit card fraud detection using a …...ii improving credit card fraud detection using a...

Improving Credit Card Fraud Detection using a

Meta-Learning Strategy

by

Joseph King-Fung Pun

A thesis submitted in conformity with the requirements

for the degree of Master of Applied Science

Graduate Department of Chemical Engineering and Applied Chemistry

University of Toronto

© Copyright by Joseph King-Fung Pun 2011

ii

Improving Credit Card Fraud Detection using a Meta-Learning Strategy

Joseph King-Fung Pun M.A.Sc. Chemical Engineering and Applied Chemistry University of Toronto 2011

Abstract

One of the issues facing credit card fraud detection systems is that a significant percentage of

transactions labeled as fraudulent are in fact legitimate. These “false alarms” delay the detection

of fraudulent transactions. Analysis of 11 months of credit card transaction data from a major

Canadian bank was conducted to determine savings improvements that can be achieved by

identifying truly fraudulent transactions. A meta-classifier model was used in this research. This

model consists of 3 base classifiers constructed using the k-nearest neighbour, decision tree, and

naïve Bayesian algorithms. The naïve Bayesian algorithm was also used as the meta-level

algorithm to combine the base classifier predictions to produce the final classifier. Results from

this research show that when a meta-classifier was deployed in series with the Bank’s existing

fraud detection algorithm a 24% to 34% performance improvement was achieved resulting in

$1.8 to $2.6 million cost savings per year.

iii

Acknowledgements

I would like to express my sincerest gratitude to my supervisor Professor Yuri Lawryshyn for his

constant support, encouragement, and guidance. Throughout my thesis-writing period he

provided helpful advice, cherished teachings, and lots of good ideas. I would have been lost

without him.

I am grateful to Professor Joseph Paradi for his valuable input for my research and for providing

such a wonderful environment at CMTE. I am grateful to Dr. Judy Farvolden for her continual

support during my stay at CMTE and for her many encouraging words of advice.

I would like to thank my colleagues at CMTE for providing a stimulating and fun environment in

which to learn and grow. I am especially grateful to Kelsey Barton, Steve Frensch, Pulkit Gupta,

Leili Javanmardi, Erin Kim, Laleh Kobari, Alex LaPlante, Elizabeth Min, Susan

Mohammadzadeh, Colin Powell, Muhammad Saeed, Sanaz Sigaroudi, Justin Toupin, Angela

Tran, Marinos Tryphonas, D’Andre Wilson, and Haiyan Zhu.

I wish to thank Sau Yan Lee and Dan Tomchyshyn for providing networking and computer

assistance and many thanks to the Chemical Engineering administrative staff for their support

especially to Joan Chen, Leticia Gutierrez, Pauline Martini, Phil Milczarek, and Gorette Silva.

I am extremely blessed to have so many friends and family that have supported me throughout

my study at the University of Toronto. I thank you all from the bottom of my heart.

Lastly, I would like to thank my parents, Angela Pun and Stewart Pun, for their unending love

and encouragement. I thank God for having them in my life.

iv

Table of Contents

Abstract ........................................................................................................................................... ii

Acknowledgements ........................................................................................................................ iii

Table of Contents ........................................................................................................................... iv

Executive Summary ........................................................................................................................ 1

1 Introduction ............................................................................................................................. 3

1.1 Problem Statement ........................................................................................................... 6

1.2 Credit Card Fraud in Canada ............................................................................................ 8

1.3 Organization of Thesis ................................................................................................... 10

2 Fraud Solution Approaches ................................................................................................... 11

2.1 Supervised and Unsupervised Learning ......................................................................... 11

2.2 Base Classifiers .............................................................................................................. 12

2.2.1 Naïve Bayesian ....................................................................................................... 13

2.2.2 Bayesian Network ................................................................................................... 15

2.2.3 Decision Tree – C4.5 .............................................................................................. 16

2.2.4 K-Nearest Neighbours ............................................................................................ 18

2.2.5 Support Vector Machines ....................................................................................... 19

2.2.6 Neural Networks ..................................................................................................... 21

2.2.7 Logistic Regression ................................................................................................. 24

2.3 Introduction to Combination Strategies in Data Mining ................................................ 26

v

2.3.1 Examples using Meta-learning: Applying the bagging, boosting, and stacking

methodologies ........................................................................................................................ 29

2.3.1.1 Bagging Example ............................................................................................ 30

2.3.1.2 Boosting Example............................................................................................ 32

2.3.1.3 Stacking Example ............................................................................................ 38

3 Literature on Credit Card Fraud Detection ............................................................................ 41

3.1 Single and Multi-Algorithm Techniques for Fraud Detection used in Literature .......... 41

3.2 Meta-Learning in Credit Card Fraud Detection ............................................................. 51

3.3 Meta-Learning and the Combiner Strategy .................................................................... 53

3.3.1 The Combiner Strategy in Detail ............................................................................ 54

4 Methodology .......................................................................................................................... 56

4.1 Software Used ................................................................................................................ 56

4.2 Data preparation ............................................................................................................. 56

4.3 Diversity – Selecting base classifiers ............................................................................. 60

4.4 Selecting the Training, Validation, and Testing Dataset Sizes ...................................... 62

4.5 Constructing the Meta-classifier .................................................................................... 65

4.5.1 Meta-Learning Stage 1 ............................................................................................ 65

4.5.2 Meta-Learning Stage 2 & 3 ..................................................................................... 66

4.5.3 Meta-Learning Stage 4 ............................................................................................ 68

4.6 Performance Evaluation of the Meta-Classifier ............................................................. 69

vi

4.6.1 Ranking ................................................................................................................... 72

4.6.2 Performance Evaluations ........................................................................................ 73

5 Results & Discussion ............................................................................................................. 79

5.1 Falcon Score Distribution ............................................................................................... 80

5.2 Base Algorithm Selection ............................................................................................... 82

5.3 Training, Validation, and Testing Dataset Selection ...................................................... 84

5.4 Meta-Classifier Performance Evaluation ....................................................................... 85

5.4.1 Evaluating the Meta-Classifier: True Positive and False Negative Evaluation ...... 86

5.4.2 Evaluating the Meta-Classifier: Correctly Classified TP Evaluation ..................... 89

6 Conclusion and Future Work ................................................................................................. 93

6.1 Meta-Classifier Probabilities and Falcon Scores ........................................................... 93

6.2 Improving the Meta-Classifier ....................................................................................... 94

6.3 Implementing the Meta-Classifier .................................................................................. 96

7 Glossary of Terms ................................................................................................................. 98

8 References ............................................................................................................................. 99

Appendix A: Implementation of Base Algorithms on Simple Datasets ..................................... 107

Appendix B: Pre-processing and Data Cleansing of Raw Dataset ............................................. 125

Appendix C: Example of how Weka calculates the Root Mean Squared Error ......................... 132

1

Executive Summary

Currently, major Canadian banks rely heavily on a neural network based engine called the

Falcon Fraud Manager in the detection of fraudulent credit card transactions. The Falcon Fraud

Manager generates a Falcon score for each credit card transaction. This Falcon score ranges from

1 to 999, where 1 represents the lowest and 999 represents the highest chance of a fraudulent

transaction. Analysis of credit card transaction data from a collaborating bank showed that

transactions with Falcon scores from 991 to 999 had four times more fraud than transactions with

Falcon scores from 900 to 910. This suggests that the Falcon scoring metric is able to identify

transactions that are more likely to be fraudulent. However, the data also show that the majority

of transactions with Falcon scores greater than or equal to 900 are actually legitimate and on

average only 10% of transactions with Falcon scores greater than or equal to 900 are fraudulent.

Since the Bank relies heavily on Falcon scores to determine fraudulent activity, many fraud

analysts are investigating transactions that are in fact legitimate. This creates scenarios in which

resources are used to investigate legitimate transactions that are considered to be fraudulent, the

investigation of fraudulent transactions is delayed, and unnecessary concerns for customers are

produced.

This work proposes the use of a meta-classifier to act as a filter for the Falcon data. The

meta-classifier uses the predictions of different base classifiers to determine the final prediction

of a transaction. The objective of the meta-classifier is to filter out the fraudulent transactions

from the legitimate transactions. The meta-classifier was chosen because this methodology uses

the combination of multiple algorithms to detect credit card fraud. Past research has shown that

learning algorithms have their own set of assumptions, and by using multiple algorithms the

2

strength of one algorithm can complement the weakness of another. Furthermore, past studies

have shown that probability based models can outperform neural network models. Analysis of

11 months of credit card transaction data from a major Canadian bank was used to construct the

meta-classifier model. The results from this research showed that the best number of base

classifiers to use was a combination of 3 classifiers, and the best algorithms to train the 3 base

classifiers were found to be the k-nearest neighbour, decision tree, and naïve Bayesian

algorithms. A meta-level algorithm was then used to combine the predictions of the 3 base

classifiers to produce the meta-classifier. The final predictions for transactions were produced

using the meta-classifier. The naïve Bayesian algorithm was used as the meta-level algorithm

because past research has shown that the naïve Bayesian algorithm provides the best prediction

accuracy in meta-learning.

By implementing a meta-classifier in series with the Bank’s existing fraud detection

algorithm a 24% to 34% performance improvement was achieved resulting in $1.8 to $2.6

million cost savings per year. The meta-classifier investigation method was able to catch more

fraudulent accounts and miss less fraudulent accounts compared to the Bank’s Falcon based

investigation methods, and the meta-classifier was able to avoid investigating legitimate

transactions which frees up resources to investigate other transactions. The meta-classifier

method also investigated fraudulent transactions earlier thereby reducing fraud losses.

3

1 Introduction

In today’s increasingly internet-dependent society the use of credit cards has become convenient

and necessary. Credit card transactions have become the de facto standard for Internet e-

commerce. Statistics Canada reports that approximately $15 billion was spent on online orders

for goods and services alone in 2009, and 84% of all online consumers paid directly over the

internet rather than paying in-store (Statistics Canada 2010). Consumers’ demand for electronic

transactions due to its convenience and ease of use, and the rise in e-commerce has opened up

new opportunities for criminals to steal credit card numbers and consequently commit fraud

(Royal Canadian Mounted Police 2010).

The volume of credit card transactions continues to grow leading to higher risks of stolen

account numbers and results in fraud losses to financial institutions (FIs) (The Nilson Report

2010). Fraud detection has become an essential tool in maintaining the viability of the payment

system, and to ensure that losses are reduced to a minimum. A secured and trusted banking

network for electronic commerce requires high speed verification and authentication mechanisms

that allow legitimate users easy access to conduct their business, while preventing fraudulent

transaction attempts by others. Currently, FIs use a third party neural network based fraud

detection system called the Falcon Fraud Manager (FFM) to detect fraudulent credit card

transactions (Tavan 2011). Fraud is a serious problem faced by credit card issuers and can cause

large financial losses.

According to the Basel Committee on Banking Supervision, fraud can be divided into 2

types: internal fraud and external fraud (Basel Committee on Banking Supervision 2006).

Businesses are always susceptible to internal fraud or corruption from its management or

employees. While external fraud is mainly about using the stolen, fake or counterfeit credit card

4

to consume or obtain cash in disguised forms. This thesis is focused on the investigation of the

external card fraud, which accounts for the majority of credit card frauds in Canada (Royal

Canadian Mounted Police 2010). Credit card fraud can be either an offline fraud or online fraud.

Offline fraud is a stolen physical card at a storefront or call center. The institution issuing the

card can lock the account before it is used in a fraudulent manner. Online fraud is committed via

web, phone shopping or cardholder-not-present situations. The main objective in fraud detection

is to identify fraud as quickly as possible once it is committed (Bolton and Hand 2002).

The purpose of this work is to apply data mining strategies to a unique and updated

Canadian dataset (a neural network filtered dataset), and to investigate whether a meta-learning

strategy (a combination methodology) has the potential to save money and improve fraud

detection. This work primarily aims to improve current fraud detection processes by improving

the prediction of fraudulent accounts.

1. Modeling techniques. Neural network (NN) models are heavily studied in current

literature and these models are the main tools used in current commercial systems.

However, research has shown that simplistic algorithms can outperform neural

networks in the credit card fraud domain (Maes, et al. 2002). Furthermore, the aim of

this thesis is to not replace the main Falcon score fraud detection system but to

supplement this system by implementing combinations of algorithms, using a ‘meta-

learning’ strategy, in a post-process manner.

2. Updated dataset. The most recent studies on credit card fraud using data mining

techniques were conducted in 2006 (Ngai, et al. 2011), while the meta-learning

strategy (combining multiple algorithms to create a new classifier, the ‘meta-

5

classifier’) was last studied on datasets in 1999 (Ngai, et al. 2011), (Bolton and Hand

2002).. We know that fraud patterns change constantly because new uncaught

fraudulent transactions occur frequently. This leads us to believe that criminals

constantly change their fraud techniques to overcome previously caught transactions.

This constant change in fraud patterns make it essential for the re-evaluation of the

fraud detection performance of the meta-classifier.

Based on the findings of Ehramikar (2000) and on the analysis in Section 4.2 and Section 5.1, the

motivation for applying meta-classification stems from the fact that in the current fraud detection

systems that utilize neural network models for classification, approximately 90% of transactions

flagged as potentially fraudulent are false positives, that is, the transactions are flagged as

fraudulent even though they are legitimate. It would be beneficial to apply alternative algorithms

to the output of a neural network model to help improve prediction accuracy. A comparative

study between the Bayesian Belief Network (BBN) and Artificial Neural Network method shows

that BBNs were more accurate and much faster to train using real world credit card data (Maes,

et al. 2002). This suggests that the neural network algorithm might not be the best method for

credit card fraud prediction and that there is potential for further improvements in a neural

network system by utilizing alternative algorithms. There have been few reported studies of

credit card fraud detection using data mining techniques in the literature in recent years. Among

the reported credit card fraud studies most have focused on using neural networks (Ngai, et al.

2011), (Bhattacharyya, et al. 2011). Since the fraud detection system currently used by FIs is

already based on the neural network algorithm the meta-learning strategy should use alternative

types of algorithms. Therefore the focus of this thesis is to investigate credit card fraud detection

6

algorithms that were popular and successful in the literature during the 1990’s and early 2000’s

such as decision trees, logistic models, k-nearest neighbours, Bayesian networks, etc.

1.1 Problem Statement

It is claimed by FIs that fraudulent credit card transactions increase exponentially with time for a

cardholder’s account (Trepanier 2009). Therefore, the faster a fraudulent account is deactivated,

the less money is lost. To address this problem, FIs are employing preventive measures such as

fraud detection systems, one of which is called the "Falcon Fraud Manager" (FFM) offered by

Fair Isaac Corporation1 (FFM is a neural network system). This fraud detection system (FDS)

scores transactions for the likelihood of fraud in real time. When these “Falcon” scores hit a

threshold set by the FIs, a case is created and those accounts are passed to the fraud analysts for

further follow up. Fraud analysts are security officers trained to examine a cardholder’s credit

card transaction behaviours and they can determine the potential risk associated with the flagged

accounts. Very often an ‘unusual’ transaction is legitimate and credit card issuers are anxious not

to inadvertently offend a cardholder by acting too hastily and blocking his or her account,

especially in cases where the fraud analyst is unable to find the cardholders to verify the

transactions.(Trepanier 2009).

Although the FFM has shown good results in reducing fraud, the majority of cases being

flagged by this system are legitimate accounts flagged as fraudulent (approximately 90% false

positives for transactions with Falcon scores of 900 and above) resulting in substantial loss of

resources and time for the investigation of truly fraudulent accounts. As discussed in Chapter 5,

the credit card data received from the collaborating FI show that there is an exponential increase

1FICO Falcon Fraud Manager - http://www.fico.com/en/Products/DMApps/Pages/FICO-Falcon-Fraud-Manager.aspx

7

in fraudulent transactions as Falcon scores increase from 900 to 999. However, this same data

shows that there is a large disparity between the percentage of legitimate and fraudulent

transactions for transactions with Falcon scores greater than or equal to 900. On average only

10% of the transactions with Falcon scores greater than or equal to 900 are fraudulent while the

other 90% are legitimate transactions. A similar problem was also present in the work conducted

by Ehramikar (2000) where it was found that 90 percent of flagged cases by the neural network

based FDS were false positives. Although a fraud analyst might come to the conclusion that the

activity of the flagged account is legitimate, FIs’ policy requires them to call every individual

cardholder for the verification of transactions (Ehramikar 2000). This results in three major

problems:

1. The costs associated with investigating a large number of False Positives (FPs –

transactions that are flagged as fraudulent but are actually legitimate) can become very

high.

2. Inefficient use of resources. A substantial amount of time is being spent on investigating

FPs (Ehramikar 2000). If the number of FP investigations can be lowered, then fraud

analysts can spend more time on investigating truly fraudulent cases (TPs – True

Positives), preventing more losses to the financial institution. By identifying more TPs

fraud is caught earlier and more transactions can be investigated by analysts.

3. Not all of the suspicious transactions are necessarily fraudulent. The process of

confirming every transaction that deviated from the cardholder’s usual behavior results in

potential customer dissatisfaction.

8

1.2 Credit Card Fraud in Canada

There were approximately 72 million credit cards in circulation across Canada in 2009, with a

retail sales volume exceeding $267 billion (Schulz 2010). Payment card counterfeiters are now

using the latest computer devices (embossers, encoders, and decoders often supported by

computers) to read, modify, and implant magnetic stripe information on counterfeit payment

cards. Fraudulent identification has been used to obtain government assistance, personal loans,

unemployment insurance benefits and for other schemes victimizing governments, individuals,

and corporate bodies (Royal Canadian Mounted Police 2010).

According to the Royal Canadian Mounted Police, the criminal use of credit cards can be

divided into 4 categories: counterfeit credit cards, no-card fraud, cards lost or stolen, and

impersonation fraud. Counterfeit credit card represents the largest category of credit card fraud

involving Canadian issued cards. As shown in Table 1-1, counterfeit credit cards and e-

commerce fraud represented 44% and 39% of all credit card losses in 2009 respectively. This is a

decrease of 19% for counterfeit credit cards but an increase of 9% for e-commerce fraud (Royal

Canadian Mounted Police 2010). Organized criminals have acquired the technology that allows

them to "skim" the data contained on magnetic stripes, manufacture counterfeit cards, and

overcome such protective features such as holograms.

Fraud committed without the actual use of a card (no-card fraud) accounts for 32% of all the

losses (Royal Canadian Mounted Police 2010). Deceptive telemarketers and fraudulent internet

websites obtain specific card details from their victims, while promoting the sale of exaggerated

or non-existent goods and services. This, in turn, results in fraudulent charges against victims'

accounts.

9

Fraud committed on cards not received by the legitimate cardholder (non-receipt fraud)

occurs when cards are intercepted prior to delivery to the cardholder. Losses attributable to mail

theft have declined as a result of "card activation" programs, where cardholders must call their

financial institution to confirm their identity before the card is activated. In 1992 the non-receipt

fraud category accounted for 16% of total losses but in 2008 this number has dropped to just 3%

(Royal Canadian Mounted Police 2010).

Cards fraudulently obtained by criminals who have made false applications involve criminals

impersonating a creditworthy individual in order to acquire credit cards. A technique that is often

used in this type of fraud is called “phishing”. Fraudsters use emails to entice users to divulge

sensitive information such as usernames, passwords, credit card information by impersonating a

financial institution (FI) or other institution seeking personal information.

Table 1-1: Credit Card Fraud Statistics in Canada for 2008-2009

(Royal Canadian Mounted Police 2010)

Payment Card Partner Losses by Type for 2008-2009

Category Loss in $CAD in 2008

Loss in $CAD in 2009

Change

Lost $16,505,213 $13,599,382 -18% Stolen $32,293,078 $27,208,823 -16% Non-receipt $13,239,049 $6,088,948 -54% Fraudulent applications $11,013,923 $4,707,088 -57% Counterfeit $196,653,970 $158,809,947 -19% Fraudulent e-commerce, telephone and mail purchases

$128,362,477 $140,443,893 +9%

Miscellaneous, not defined $9,662,029 $7,503,210 -22% Total $407,729,739 $358,361,292 -12%

10

1.3 Organization of Thesis

The thesis is organized as follows. Chapter 2 describes different algorithm techniques used in

fraud detection. The base algorithms used in the meta-classification process are introduced, and

the combination strategies of multiple algorithms are explained in detail. Chapter 3 presents the

literature on credit card fraud detection techniques and outlines the strategy used in combining

multiple algorithms. The algorithms that are considered for the combination strategy are

discussed in detail. Chapter 4 is the methodology chapter of the thesis and outlines the four

major stages in the meta-classification process. The performance evaluation of the meta-

classifier is discussed and the ranking and evaluation models are presented. Chapter 5 presents

the results and discussion of the performance evaluation models comparing the FI investigation

method and the meta-classifier investigation method. Chapter 6 concludes the thesis by

presenting the key results of the evaluations and discusses future work that can be done to

improve the meta-classification system.

11

2 Fraud Solution Approaches

Credit card fraud prevention is the first line of defense in reducing costs associated with credit

card fraud. Once fraud prevention fails, it is essential for fraud detection methods to identify

fraud as soon as possible. Data mining techniques are relevant to fraud detection because there is

a need for fast and efficient algorithms to search for patterns in large databases. In this chapter

detailed descriptions of data mining techniques are outlined, beginning with the introduction of

the two categories of machine learning – supervised and unsupervised learning. The algorithms

for different fraud detection techniques are discussed, and the three main techniques in

combining multiple algorithms are described.

Older fraud detection software tools have their roots in statistics (cluster analysis),

whereas the more recent tools are based in data mining (due to increased power of modern

computers and massive datasets) (Witten and Frank 2005). Data mining is a process of extracting

patterns from data, and a process of analyzing data from different perspectives and summarizing

it into useful information - information that can be used to increase revenue, cut costs, or both.

Data mining allows users to analyze data from many different dimensions or angles, categorize

it, and summarize the relationships identified. Technically, data mining is the process of finding

correlations or patterns among dozens of fields in large databases (Witten and Frank 2005).

Machine learning in general falls into two main categories, supervised learning and unsupervised

learning (Kotsiantis 2007).

2.1 Supervised and Unsupervised Learning

Fraud detection methods can be categorized into either supervised or unsupervised

learning. Supervised machine learning in credit card fraud detection is a technique that applies

algorithms on both fraudulent and legitimate instances to construct models that assign new

12

observations into one of the two classes – the classes being either fraudulent or legitimate. The

goal of supervised learning is to build a concise model of the distribution of class labels in terms

of predictor features (Witten and Frank 2005). The resulting classifier is then used to assign class

labels to the testing instances where the values of the predictor features are known, but the value

of the class label is unknown. In unsupervised learning the classifications of the instances are

unknown. This learning method simply determines which observations are most dissimilar from

the norm. Unsupervised algorithms look for similarity in the training data to determine whether

instances can be characterized as forming a group. Therefore unsupervised learning is often

called “cluster analysis” and aims to group the data to develop classification labels automatically

(Jain, Murty and Flynn 1999).

Inductive learning, or classification, takes place when a learner or classifier (e.g.,

decision tree, neural network, rule-learners, support vector machine (SVM)) is applied to some

data to produce a hypothesis explaining a target concept; the search for a good hypothesis

depends on the fixed bias embedded by the learner (Mitchell 1980). The algorithm is said to be

able to learn because the quality of the hypothesis normally improves with an increasing number

of examples. Nevertheless, since the bias of the learner is fixed, successive applications of the

algorithm over the same data always produces the same hypothesis, independently of

performance; no knowledge is commonly extracted across domains or tasks (Pratt and Thrum

1997).

2.2 Base Classifiers

This thesis utilizes the method of supervised learning. As discussed above, this is a machine

learning method that uses a training dataset with known target classes to produce an inferred

function that pairs an input to a desired output value (Witten and Frank 2005). This inferred

13

function, called a “classifier”, should approximate the correct output even for examples that have

not been shown during training. There are five main supervised data mining techniques:

statistical techniques (Bayesian/Regression), logic based techniques (decision trees), perceptron-

based techniques (neural networks), instance based learners (kNN), and support vector machines

(SVM). For multi-dimensions and continuous features SVMs and neural networks are the data

mining techniques of choice, while logic-based systems are preferred when dealing with discrete

or categorical attributes. Neural network models and SVMs require large training dataset sizes in

order to achieve their maximum prediction accuracy, whereas the Bayesian algorithm only

requires a relatively smaller dataset size (Kotsiantis 2007). Irrelevant attributes have a large

negative impact on the training process of the kNN and neural network algorithms, and because

of these irrelevant attributes the training of classifiers based on these algorithms can often be

inefficient and sometimes impractical (Kotsiantis 2007).

Since there are weaknesses and strengths for each algorithm, a strategy is required to

determine the best base classifiers to use in the credit card domain. In the following sections

detailed descriptions of seven data mining algorithms that were used in experimentation are

presented.

2.2.1 Naïve Bayesian

The Naïve Bayesian classifier is a powerful probabilistic method that utilizes class information

from training instances to predict the class of future instances. This algorithm was first

introduced by John and Langley (1995) and is superior in its speed of learning while retaining

accurate predictive power. Experiments on real-world data have repeatedly shown that the Naïve

Bayesian classifiers perform comparably to more sophisticated induction algorithms. Clark &

Niblett (1989) show that Bayesian classifiers achieve similar accuracy levels compared to rule-

14

induction methods such as CN2 and ID3 algorithms in medical domains. John & Langley (1995)

show that by using a kernel density estimation instead of a Gaussian distribution, the Naïve

Bayesian classifier performs equally as well and in some cases better than the decision tree

algorithm C4.5. However, this method goes by the name “Naïve” because it naively assumes

independence of the attributes given the class. Classification is then done by applying Bayes’

rule to compute the probability of the correct class given the particular attributes of the credit

card transaction,

)(

)()|()|(

EvidencesP

fraudPfraudEvidencesPEvidencesfraudP

(2.1)

Where P(fraud|Evidences) is the posterior probability; the probability of the hypothesis (the

transaction being fraudulent) after considering the effect of the evidences (the attribute values

based on training examples). P(fraud) is the a-priori probability; the probability of the hypothesis

given only past experiences while ignoring any of the attribute values. P(Evidences|fraud) is

called the likelihood. This is the probability of the evidences given that the hypothesis is truly

fraudulent and that past experiences are true. The likelihood, P(Evidences|fraud), is calculated as

follows:

P(Evidences|fraud) = P(E1|fraud) x P(E2|fraud) x P(E3|fraud)…P(En|fraud) (2.2)

Where n is the number of attributes in the dataset.

The goal of classification is to correctly predict the value of a designated discrete class

variable given a vector of predictors or attributes (Grossman and Domingos 2004). In particular,

the Naïve Bayesian classifier is a Bayesian network where the class has no parents and each

attribute has the class as its sole parent (Othman and Yau 2007).

15

2.2.2 Bayesian Network

Bayesian belief networks are powerful modeling tools for condensing what is known about

causes and effects into a compact network of probabilities. A Bayesian network is a graphical

model for probabilistic relationships among a set of variables. The Bayesian network has become

a popular representation for encoding uncertain expert knowledge in expert systems (Heckerman,

Geiger and Chickering 1995). Bayesian networks can readily handle incomplete data sets and

can learn about causal relationships. Bayesian belief networks are very effective for modeling

situations where information about the past and/or the current situation is vague, incomplete,

conflicting, and uncertain, whereas rule-based models result in ineffective or inaccurate

predictions when the data is uncertain or unavailable. The Bayesian belief network used in this

thesis was first introduced by Cooper and Herskovits (1992).

In a Bayesian Network graphical model each node represents a random variable, and the

directed edges of the graph represent conditional dependence assumptions. Hence they provide a

compact representation of joint probability distributions.

The probability of joint events can be defined as:

)|()(),( 12121 EEPEPEEP (2.3)

Where P(E1) is the probability of event 1 being true, P(E2|E1) is the marginal probability of event

2 being true given the condition that event 1 is also true, finally P(E1,E2) is the probability that

both events occur. The Bayesian Network diagram is constructed to show the marginal and joint

probabilities of events.

16

2.2.3 Decision Tree – C4.5

Decision trees are rule based classifiers that utilize a “divide and conquer” method to construct a

prediction rule. The divide and conquer method works by recursively breaking down a problem

into two or more sub-problems until it is simple enough to be solved directly. Decision trees are

graphical representations of “if, then statements” (decision rules). The decision tree algorithm

used in this thesis – C4.5 – was first introduced by J.R. Quinlan (1993).

A decision tree consists of nodes and branches. The starting node is usually referred to as

the root node. Each node is labeled with a feature name and each branch leading out of it is

labeled with one or more possible values for that feature. Each node has just one incoming

branch, except for the root, which is designated as the starting point. Each internal node in the

tree corresponds to a test of the value of one of the features. Branches from the node are labeled

with the possible values of the test. Leaves are labeled with the values of the classification

features and specify the value to be returned if that leaf is reached. By taking a set of features and

their associated values as input, a decision tree is able to classify a case by traversing the

decision tree. Depending on whether the result of a test is true or false, the tree branches to one

node or another. The feature of the instance corresponding to the label of the root of the tree is

compared to the values on the root’s outgoing branches, and the matching branch is selected.

This node label matching and branch selection process continues until a terminal node, referred

to as leaf, is reached, at which point the case is classified according to the label of the leaf and a

decision is made on the class assignment of the case (J. R. Quinlan 1993).

The C4.5 algorithm is the most commonly used method to build decision trees. This

algorithm uses the concept of information entropy to determine the best node for the tree to

branch to. At each node of the tree, C4.5 chooses one attribute of the data that most effectively

17

splits its set of samples into subsets enriched in one class or the other. Its criterion is the

normalized information gain (difference in entropy) that results from choosing an attribute for

splitting the data. The attribute with the highest normalized information gain is chosen to make

the decision.

Entropy for a set of examples, S, for one variable can be calculated as follows:

c

ippSE ii

1log)( 2 (2.4)

Where i is the outcome state, pi is the probability of outcome state i, and c is the number of

outcome states.

Entropy for two variables can be calculated as follows:

)(||

||),(

vSE

c

Av Sv

SASE

(2.5)

Where v is the state of the second variable, A is the set of examples of the second variable, Sv is

the size of the subset in state v, and S is the size of the entire set.

Finally the information gain is defined as:

)(||

||)(),(

vSE

c

Av Sv

SSEASGain

(2.6)

The entropy of an attribute represents the expected amount of information that would be needed

to specify the classification of a new instance. Therefore the attribute with the largest amount of

information gained would be selected as the splitting attribute. The decision tree is stopped when

18

the data cannot be split any further. Ideally, the process is repeated until all leaf nodes are pure,

that is, when they contain instances that have the same classification (See Appendix A for

calculations in the construction of a decision tree).

2.2.4 K-Nearest Neighbours

The k-Nearest Neighbour (kNN) method is a simple algorithm that stores all available instances

and classifies new cases based on a similarity measure. The kNN algorithm is an example of an

instance-based learner. In a sense, all of the other learning methods are “instance-based,” as well,

because they start with a set of instances as the initial training information. However, for

instance-based learners the instances themselves are used to represent what is learned, rather than

using the instances to infer a rule set or decision tree. The nearest-neighbour classification

method is when each new instance is compared with existing ones using a distance metric, and

the closest existing instance is used to assign the class to the new one. Sometimes more than one

nearest neighbor is used, and the majority class of the closest k neighbours (or the distance-

weighted average, if the class is numeric) is assigned to the new instance. The concept of the

instance-based nearest-neighbour algorithm was first introduced by Aha, Kibler, and Albert

(1991).

Generally, the standard Euclidean distance is used when computing the distance between

several numerical attributes. However, this assumes that the attributes are normalized and are of

equal importance (one of the main problems in learning is to determine which are the important

features). For cases when nominal attributes are present, such as comparing the attribute values

of the types of credit cards: Classic, Gold and Platinum, a distance of zero is assigned if the

19

values are identical; otherwise, the distance is one. Thus the distance between gold and gold is

zero but that between gold and platinum is one.

Some attributes are more important than others, and this is usually reflected in the

distance metric by some kind of attribute weighting. Deriving suitable attribute weights from the

training set is a key problem in instance-based learning. In this technique the instances do not

really “describe” the patterns in data. However, the instances combine with the distance metric to

carve out boundaries in instance space that distinguish one class from another, and this is a kind

of explicit representation of knowledge.

2.2.5 Support Vector Machines

The Support Vector Machines (SVM) algorithm was first introduced by Cortes and Vapnik

(1995). This algorithm finds a special kind of linear model, the maximum margin hyperplane,

and it classifies all training instances correctly by separating them into correct classes through a

hyperplane (a linear model). The maximum margin hyperplane is the one that gives the greatest

separation between the classes – it comes no closer to any of the classes than it has to. The

instances that are closest to the maximum margin hyperplane – the ones with minimum distance

to it – are called support vectors. There is always at least one support vector for each class, and

often there are more (Witten and Frank 2005).

The optimal hyperplane is found by maximizing the width of the margin. As shown in

Figure 2-1, the margin is the distance between the separating hyperplane and the closest positive

class and negative class.

20

Figure 2-1: Separating two classes using a hyperplane (Leopold and Kindermann 2006)

In situations that the classes are not perfectly separable, the SVM algorithm finds the hyperplane

that maximizes the margin while minimizing the misclassified instances using a slack variable.

As shown in Figure 2-1, the slack variable, ξ, represents the distance of the misclassified instance

from its margin hyperplane. The SVM algorithm minimizes the sum of distances of the slack

variables from the margin hyperplanes while maximizing the margin width. This is done by

solving the following equation using Quadratic Programming:

Minimize:

1

2

2

1

iiCw (2.7)

Subject to: 0

,1)(

i

iiii xbxwy

Where w and b are parameters that are learned using the training data, ξ is the slack variable that

represents the outliers, and C is a parameter that allows for selecting the complexity of the

21

model. The larger the C value is the less training errors are accepted and the more complex the

predictive model becomes.

There are situations where a nonlinear region can separate the classes more effectively.

Rather than fitting nonlinear curves to the data, SVM determines a dividing line by using a

kernel function to map the data into a different space where a hyperplane can be used to do a

linear separation. The concept of a kernel mapping function is very powerful because it allows

SVM models to perform separations even with very complex boundaries. An infinite number of

kernel mapping functions can be used, but the Radial Basis Function has been found to work

well for a wide variety of applications including credit card fraud (Hanagandi, Dhar and

Buescher 1996). The transformation to a high-dimensional space is done by replacing every dot

product in the SVM algorithm with the Gaussian radial basis function kernel as follows:

0),||||exp(),( 2 jiji xxxxK

(2.8)

)()(),( jiji xxxxK

(2.9)

Where K(xi,xj) is the kernel function and φ(x) is the transformation function.

2.2.6 Neural Networks

Artificial Neural Networks (ANN) are computational models that try to mimic our body’s

biological neural networks and can easily adapt to change. This mathematical model consists of

interconnected artificial neurons (nodes) that can receive one or more inputs and sums them to

produce a prediction (output). A neuron has two modes of operation: training mode, and usage

mode. In training mode, the neuron can be taught to associate a certain prediction with an input

22

pattern. While in usage mode, if a taught input pattern is detected by the neuron its associated

prediction is outputted.

The effect of each input’s contribution to the final prediction is dependent on the weight

of the particular input. To determine a neural network that is an accurate predictor, appropriate

weights for the connections must be determined. The most widely used method to determine the

optimal connection weights is called backpropagation. This method was introduced by

Rumelhart, Hinton, and Williams (1986) and through their work artificial neural network

research gained recognition in machine learning. Backpropagation utilizes a mathematical

algorithm called gradient descent which iteratively adjusts a function’s parameters to minimize

the squared error function of the network’s output. If the function has several minima the

gradient descent method might not find the best one.

The sigmoid function is used to calculate the output of each network layer and is defined

as follows:

xe

xf

1

1)(

(2.10)

The squared error function is defined as follows:

2))((2

1xfyE

(2.11)

Where f(x) is the network’s prediction obtained from the output unit and y is the instance’s class

label. An example of a neural network is shown in Figure 2-2.

To find th

determin

Where w

the input

changes a

small con

the weigh

Figur

he weights o

ned. The der

wi are the wei

s to the neur

associated w

nstant), and

hts become v

re 2-2: Exam

of a neural n

rivative of th

ights for the

ral network.

with a particu

subtracted fr

very small.

mple of a ne

network, the

he error funct

i

ydw

dE(

ith input var

This compu

ular weight w

rom the wi’s

23

eural networ

derivative o

tion with res

fxfy ('))(

riable, x is th

utation is rep

wi are added

s current valu

rk with one

f the squared

spect to a pa

iax)(

he weighted

peated for ea

up, multipli

ue. This is re

e hidden lay

d error funct

articular weig

sum of the i

ach training i

ied by the le

epeated until

er

tion must be

ght is define

(

inputs, and a

instance, and

earning rate (

l the change

e

ed as:

(2.12)

ai are

d the

(a

s in

24

2.2.7 Logistic Regression

Logistic regression is often used when the dependent variable takes only two values and the

independent variables are continuous, categorical, or both. Logistic regression method is ideal

when classifying outcomes that only have two values because the logistic curve is limited to

values between 0 and 1. The method utilized in this thesis is based on the work done by le Cessie

and van Houwelingen (1997).

In credit card fraud detection the dependent variable would take on a value of 0

(legitimate transaction) or 1 (fraudulent transaction). Unlike ordinary linear regression however,

logistic regression does not assume a linear relationship between the independent variables and

the dependent variable, nor does it assume that the dependent variable or the error terms are

distributed normally.

The logistic regression model is defined as follows:

kk XXX

p

p

...)1

log( 22110

(2.13)

Where X1,X2,…,Xk are the independent variables and p is the probability that the dependent

variable has a value of 1. 0 is a constant and 1 ,…, k are coefficients of the independent

variables. The logistic regression model looks similar to the multi-linear regression equation,

however, the logistic regression regresses against the logit, )1

log(p

p

, and not against the

dependent variable (See Figure 2-3). The Maximum Likelihood Estimation (MLE) is then used

to compute the beta coefficients in the logistic regression formula. The aim of MLE is to find the

parameter values that make the observed data most likely to be predicted. Likelihood and

probabili

the proba

T

Where x1

f(x;a) is t

numbers

until the

B

values of

calculate

Fi

ity are closel

ability of the

Likelih

Pro

The likelihoo

1, x2, …, xn ar

the probabili

for the para

likelihood fu

By using the

f the indepen

d.

igure 2-3: C

ly related be

e data given t

hood Esti

obability P

od function is

(L

re the observ

ity distributi

meters, and

unction is m

beta parame

ndent variabl

Comparison

cause the lik

the paramete

imating mod

Predicting a

s defined as

);() 1 axfa

ved values o

ion function.

through an i

maximized.

eters calculat

les, the expe

n of the Line

25

kelihood of t

ers (Montgo

del paramete

n outcome g

follows:

);() 2 axf

of a dataset, a

. The MLE a

iterative pro

ted by the M

ected probab

ear Probabi

the paramete

omery and Ru

ers given the

given model

);( axf n

a is a single

algorithm ini

cess the para

MLE method

bility for a fra

ility model w

ers given the

Runger 2003)

e observed da

parameters

unknown pa

itially choos

ameters are

and the corr

audulent tran

with the Log

e data is equa

).

ata

(

arameter, an

ses arbitrary

slowly chan

responding

nsaction can

git Model

al to

(2.14)

nd

nged

n be

26

There are advantages and disadvantages with applying certain algorithms to fraud

detection. Therefore a metric is needed to determine the ideal algorithms to use in the credit card

fraud domain. A “diversity” value was selected as a metric to determine the optimal algorithms

because it is easily calculated and the numerical score output can assist in ranking the best base

algorithms to use as base classifiers. The diversity calculation method is discussed in detail in

Chapter 4.

2.3 Introduction to Combination Strategies in Data Mining

The three main combination techniques used in data mining – bagging, boosting, and stacking –

are presented in detail in this section. The bagging and boosting methods both use voting to

combine the output of individual models of the same type (the same algorithm is used in

constructing the models). However, boosting is an iterative process that uses weighted instances

that focuses on a particular set of instances to build a prediction model. Stacking differs from

both the boosting and bagging method because it combines the output of different types of

algorithms to generate a final prediction. In the following section the bagging, boosting, and

stacking techniques are discussed in detail and step-by-step examples are presented for these

three combination techniques.

All learning systems work by adapting to a specific environment. Given instances that

have not been encountered, learning algorithms use their own set of assumptions to generate

predictions. These assumptions are referred to as inductive bias (Mitchell 1980). Different

algorithms have different representations and search heuristics, therefore, by using multiple

algorithms, different search spaces can be explored and potentially diverse results can be

obtained. No single algorithm works best on all kinds of datasets, therefore it is beneficial to use

combinations of learning algorithms to evaluate complex databases.

T

output of

stacking.

(Breiman

classifier

through t

believes

considere

uniquene

expensiv

datasets,

instances

small cha

There are sev

f different le

“Bagging”

n 1996). Thi

rs. These cla

the process o

to be correct

ed to be the

ess of baggin

ve to extract t

bagging res

s replicated.

ange in the d

Figure 2-

veral machin

arning mode

stands for bo

s method use

ssifiers gene

of voting. Fo

t (the classif

correct class

ng is in the w

training data

amples the o

This method

data results in

-4: Bagging

e learning te

els. The thre

ootstrap agg

es training d

erate predicti

or each test i

fier’s predict

s. The baggin

way the train

a from a com

original train

d performs e

n a large cha

algorithm a

27

echniques th

e main techn

gregating and

datasets of th

ions that are

nstance each

tion), and the

ng algorithm

ning datasets

mplex domai

ning data wit

effectively fo

ange in pred

s described b

hat have been

niques are: b

d was introd

he same size

e used to dete

h classifier “

e class that r

m is summari

are generate

n. Instead of

th some of it

or unstable le

dictions.

by (Witten

n developed

bagging, boo

duced by Bre

to produce m

ermine the f

“votes” on w

receives the

ized in Figur

ed. It is often

f obtaining i

ts instances

earning algo

and Frank

to combine

osting, and

eiman in 199

multiple

final predicti

which class it

most “votes

re 2-4. The

n difficult or

independent

deleted and

orithms wher

2005)

the

96

ion

t

s” is

r

other

re a

F

considere

models o

method i

placing a

emphasis

reweight

instances

boosting

same typ

class are

reund and S

ed a “boostin

of the same ty

s taught to c

a larger emph

s on the corr

ed instances

s. The boosti

uses a votin

pe. To make

summed, an

Figure 2-

chapire (199

ng” algorithm

ype (i.e. com

concentrate o

hasis (weigh

rectly classif

s, which focu

ing algorithm

ng system to

this final pre

nd the class w

5: Boosting

96) introduce

m. Boosting

mbining mul

on instances

ht) on the mi

fied instance

uses on corre

m is summar

determine a

ediction the

with the grea

algorithm as

28

ed an algorit

is an iterativ

tiple decisio

that are mis

sclassified in

s. A new cla

ectly classify

rized in Figu

a final predic

weights of a

atest total we

s described b

thm called A

ve technique

on trees). The

sclassified by

nstances wh

assifier is bu

ying the prev

ure 2-5. As m

ction from th

all classifiers

eight is chos

by (Witten a

AdaBoost wh

e that uses v

e learning al

y the previou

hile decreasin

uilt by learnin

viously misc

mentioned pr

he multiple m

s that vote fo

sen.

and Frank

hich is

voting to com

lgorithm in t

us model by

ng learning

ng from the

classified

reviously,

models of the

or a particula

2005).

mbine

this

e

ar

29

Wolpert (1992) presented a novel technique in combining multiple models built by

different learning algorithms termed stacked generalization, or stacking for short. Whereas

bagging and boosting are used to combine models of the same type through the process of

voting, stacking introduces the concept of a meta-learner that uses the predictions of different

base models (the models that are to be combined) as input into its learning algorithm. The danger

with using unweighted voting is the possibility of having multiple classifiers that are grossly

incorrect which would lead to extremely inaccurate predictions. The meta-learner in the stacking

technique is a separate learning algorithm that tries to learn which base classifiers are reliable.

Meta-learning studies how to choose the right bias dynamically, as opposed to base-learning

(single algorithm learning) where the bias is fixed or user parameterized. Meta-learning is a

general technique to combine the results of multiple learning algorithms, each applied to a set of

training data.

2.3.1 Examples using Meta-learning: Applying the bagging, boosting, and stacking methodologies

The dataset in Table 2-1 is used to show how the bagging, boosting, and stacking method are

applied to produce final predictions. The dataset contains three attributes: transaction amount,

transaction location, and type of credit card used for the transaction. The dataset also contains the

correct class label for each transaction: either fraudulent or legitimate.

30

Table 2-1: Arbitrary training dataset consisting of three attributes with correct class

Inst. # Transaction Amount ($)

Transaction Location

Type of Credit Card

Correct Class

1 2 USA Gold Fraud 2 16 Canada Gold Legit 3 24 Canada Gold Legit 4 108 Canada Platinum Fraud 5 427 Canada Platinum Legit 6 28 USA Platinum Legit 7 59 Canada Gold Legit 8 107 Canada Platinum Fraud 9 97 USA Platinum Fraud

2.3.1.1 Bagging Example

Let us initially choose five random instances from the training dataset in Table 2-1, and

randomly choose to replace two old instances with two new instances for each iteration. These

new datasets – Table 2-2, Table 2-3, and Table 2-4 – are generated by re-sampling the original

training data.

Table 2-2: Bagging Dataset #1

Instance #

Transaction Amount ($)


Type of Credit Card

Correct Class

1 2 USA Gold Fraud 4 108 Canada Platinum Fraud 5 427 Canada Platinum Legit 8 107 Canada Platinum Fraud 3 24 Canada Gold Legit


Instance #



Type of Credit Card

Correct Class

2 16 Canada Gold Legit 9 97 USA Platinum Fraud 5 427 Canada Platinum Legit 8 107 Canada Platinum Fraud 3 24 Canada Gold Legit

*red represents new instances taken from original training dataset

31


Instance #



Type of Credit Card

Correct Class

2 16 Canada Gold Legit 9 97 USA Platinum Fraud 6 28 USA Platinum Legit 7 59 Canada Gold Legit 3 24 Canada Gold Legit

*red represents new instances taken from original training dataset

For this example a decision tree algorithm is selected as the training algorithm and is applied to

bagging dataset #1, #2, and #3 to generate three different classification models – prediction

models 1, 2, and 3. These models are applied to a testing dataset (Table 2-5). The predictions for

each instance for each model are outputted and the majority prediction from the three models is

then used as the final prediction as shown in Table 2-6.

Table 2-5: Testing dataset consisting of new unclassified instances

Instance # Transaction Amount ($)

TransactionLocation

Type of Credit Card

100 251 USA Platinum 101 12 USA Gold 102 59 Canada Gold 103 1005 Canada Gold 104 432 Canada Gold 105 29 Canada Platinum 106 65 USA Gold 107 803 Canada Gold 108 25 USA Platinum

32

Table 2-6: Applying the three bagging models to a testing dataset

Inst. #



Type of Credit Card

Prediction from

Model #1

Prediction from

Model #2

Prediction from

Model #3

Final Class

Prediction100 251 USA Platinum Fraud Fraud Fraud Fraud 101 12 USA Gold Fraud Legitimate Legitimate Legitimate102 59 Canada Gold Legitimate Legitimate Fraud Legitimate103 1005 Canada Gold Legitimate Legitimate Legitimate Legitimate104 432 Canada Gold Fraud Legitimate Fraud Fraud 105 29 Canada Platinum Legitimate Fraud Legitimate Legitimate106 65 USA Gold Fraud Fraud Legitimate Fraud 107 803 Canada Gold Fraud Fraud Fraud Fraud 108 25 USA Platinum Legitimate Legitimate Legitimate Legitimate

As can be seen from Table 2-6, the final prediction in the bagging method is based on a

majority vote from the predictions of models 1, 2, and 3.

2.3.1.2 Boosting Example

Once again we use the data presented in Table 2-1, however for the boosting methodology there

are weights associated with each instance. Before the iterative procedure begins, each training

instance is assigned an equal random weight as shown below in Table 2-7 (a positive number

between zero and infinity is randomly picked as the starting weights).

Table 2-7: Boosting Dataset Example



Type of Credit Card

Correct Class

Weights

1 2 USA Gold Fraud 0.6 2 16 Canada Gold Legit 0.6 3 24 Canada Gold Legit 0.6 4 108 Canada Platinum Fraud 0.6 5 427 Canada Platinum Legit 0.6 6 28 USA Platinum Legit 0.6 7 59 Canada Gold Legit 0.6 8 107 Canada Platinum Fraud 0.6 9 97 USA Platinum Fraud 0.6

33

Boosting Iteration #1:

Let us assume that we apply a decision tree algorithm to Table 2-7 in which all instances

have equal weights, to generate a classification model that outputs a prediction for each instance.

The root mean squared error (RMSE) term for the classifier, e (a fraction between 0 and 1), is

then calculated using the following formula:

n

yxe ii

2)(

(2.15)

Where xi is the predicted probability outcome of instance i, yi is the actual outcome of instance i,

and n is the number of instances under investigation .

For correctly classified instances, the weight of each instance is adjusted by the

following formula:

)1( e

eWeightWeightAdjusted

(2.16)

Where e is the error term for the classifier. Weights remain unchanged for misclassified

instances. All weights are then normalized by dividing each instance’s weight by the sum of the

new weights and multiplying by the sum of the old weights.

Assuming the error term for this classifier is 0.35, the adjusted weights for each instance

can be calculated using Equation 2.16. Table 2-8 shows the adjusted and normalized weights for

each instance.

34

Table 2-8: Boosting iteration #1

Instance #

Txn Amt ($)


Type of Credit Card

Correct Class

Weights #1

Classified Adjusted Weights

Normalized Weights

1 2 USA Gold Fraud 0.6 Correctly 0.323 0.434 2 16 Canada Gold Legit 0.6 Correctly 0.323 0.434 3 24 Canada Gold Legit 0.6 Incorrectly 0.6 0.807 4 108 Canada Platinum Fraud 0.6 Correctly 0.323 0.434 5 427 Canada Platinum Legit 0.6 Incorrectly 0.6 0.807 6 28 USA Platinum Legit 0.6 Correctly 0.323 0.434 7 59 Canada Gold Legit 0.6 Incorrectly 0.6 0.807 8 107 Canada Platinum Fraud 0.6 Incorrectly 0.6 0.807 9 97 USA Platinum Fraud 0.6 Correctly 0.323 0.434

As shown in Table 2-8, the weights of correctly classified instances are decreased while

weights are increased for incorrectly classified instances.


For the second iteration in the boosting methodology, the decision tree algorithm is

applied to the original dataset from Table 2-7 but with the new adjusted weights calculated from

iteration #1 (See the red numbers in Table 2-8) instead of the original weights. This produces a

second classification model with its own set of predictions. The weights are once again adjusted

and normalized to put more emphasis on incorrectly classified instances. Assuming the error

term for the classifier for this iteration is 0.30, the weights from iteration #1 are adjusted using

Equation 2.16 to calculate new adjusted weights for each instance (See Table 2-9).

35


Inst. #

Txn Amt ($)

Txn Location

Type of Credit Card

Correct Class

Weights #2

(from #1)


Normal-ized

Weights1 2 USA Gold Fraud 0.434 Correctly 0.186 0.255 2 16 Canada Gold Legit 0.434 Incorrectly 0.434 0.594 3 24 Canada Gold Legit 0.807 Incorrectly 0.807 1.104 4 108 Canada Platinum Fraud 0.434 Correctly 0.186 0.255 5 427 Canada Platinum Legit 0.807 Correctly 0.346 0.473 6 28 USA Platinum Legit 0.434 Correctly 0.186 0.255 7 59 Canada Gold Legit 0.807 Incorrectly 0.807 1.104 8 107 Canada Platinum Fraud 0.807 Incorrectly 0.807 1.104 9 97 USA Platinum Fraud 0.434 Correctly 0.186 0.255


Once again the decision tree algorithm is applied to the original dataset (Table 2-7), but

the weights are the adjusted weights from the previous iteration (iteration #2 – see the blue

numbers from Table 2-9). This produces a third classification model with its own set of

predictions. The weights are once again adjusted using Equation 2.16 and then normalized.

Assuming the error term for the classifier for this iteration is 0.15, the data from Table 2-10 can

then be constructed.

36


Txn Amt ($)

Txn Location

Type of Credit Card

Correct Class

Weights #3 (from $2)


Normalized Weights

2 USA Gold Fraud 0.255 Correctly 0.045 0.082 16 Canada Gold Legit 0.594 Incorrectly 0.594 1.087 24 Canada Gold Legit 1.104 Incorrectly 1.104 2.020 108 Canada Platinum Fraud 0.255 Correctly 0.045 0.082 427 Canada Platinum Legit 0.473 Incorrectly 0.473 0.865 28 USA Platinum Legit 0.255 Correctly 0.045 0.082 59 Canada Gold Legit 1.104 Correctly 0.195 0.357 107 Canada Platinum Fraud 1.104 Correctly 0.195 0.357 97 USA Platinum Fraud 0.255 Incorrectly 0.255 0.467


The decision tree algorithm is applied to the dataset with the new weights calculated from

the previous iteration (iteration #3) to generate a fourth prediction model. However, let us

assume that the overall error for this model is zero, therefore the fourth prediction model is not

created. In the boosting methodology, whenever the error term is zero, or when it is greater or

equal to 0.5, the iterative process stops and no more models are constructed. Therefore, in this

example the boosting method stops after three iterations.

Final Prediction:

For the final classification of an instance in the boosting method, the weights of all

classifiers that vote for a particular class are summed and the class with the greatest total is

chosen to be the final prediction. The process begins by assigning a weight of zero to all classes

(fraud or legit). For each instance a e

e

1log term is added to the weight of a class predicted by

37

a model. The class with the highest weight is chosen as the final prediction. The three boosting

models from this example are applied to a new testing dataset (dataset originally introduced in

Table 2-5) and the predictions for each instance using these models are determined (See Table

2-11). The weights associated with each model for each instance, the sum of the weights for each

class, and the final prediction for each instance using the boosting method are shown in the

following Table 2-11.

Table 2-11: Using the three boosting models to determine the final predictions

Class Weights (

e

e

1log )

Weighted Vote

Inst. #

Model # 1

Predicts

Model #2

Predicts

Model #3

Predicts

Model #1

e=0.35

Model #2

e=0.30

Model #3

e-0.15

Fraud Legit Final Predict

100 Fraud Fraud Fraud 0.269 0.368 0.753 1.39 0 Fraud 101 Legit Fraud Fraud 0.269 0.368 0.753 1.121 0.269 Fraud 102 Legit Legit Legit 0.269 0.368 0.753 0 1.39 Legit 103 Legit Legit Legit 0.269 0.368 0.753 0 1.39 Legit 104 Fraud Legit Legit 0.269 0.368 0.753 0.269 1.121 Legit 105 Fraud Fraud Fraud 0.269 0.368 0.753 1.39 0 Fraud 106 Legit Legit Fraud 0.269 0.368 0.753 0.753 0.637 Fraud 107 Fraud Fraud Legit 0.269 0.368 0.753 0.637 0.753 Legit 108 Legit Legit Fraud 0.269 0.368 0.753 0.753 0.637 Fraud

In summary, the boosting method is an iterative process in which the weights of correctly

classified instances are decreased and the weights of misclassified instances are increased. This

produces classifiers that focus on classifying instances that were previously misclassified. The

final prediction is determined by a weighted vote in which the predictions from well performing

classifiers have greater influence in the voting process.

38

2.3.1.3 Stacking Example

Bagging and boosting combine models of the same type to produce a final prediction, on the

other hand, stacking applies models built by different learning algorithms. Instead of voting,

stacking introduces a meta-classifier which tries to learn which classifiers (base classifiers) are

the reliable ones using another learning algorithm. This meta-classifier tries to determine the best

way to combine the outputs of the base classifiers.

The k-nearest neighbour algorithm (kNN), rule-based algorithm, and Bayesian algorithm

are used to construct the base classifiers. The decision tree algorithm is used to construct the

meta-classifier (classifier generated using the meta-learner algorithm). Table 2-12 consists of the

same data from Table 2-1 but with a separation of the instances into either training data for the

base classifiers or training data for the meta-classifier. Two-thirds of the data in Table 2-12 is

used for training the base classifiers, while the remaining one-third is used for training the meta-

classifier.

Table 2-12: Base classifier predictions on the example dataset



Type of Credit Card

Correct Class

Training data for base or meta

classifier 1 2 USA Gold Fraud Base classifiers 2 16 Canada Gold Legit Base classifiers 3 24 Canada Gold Legit Base classifiers 4 108 Canada Platinum Fraud Base classifiers 5 427 Canada Platinum Legit Base classifiers 6 28 USA Platinum Legit Base classifiers 7 59 Canada Gold Legit Meta-classifier 8 107 Canada Platinum Fraud Meta-classifier 9 97 USA Platinum Fraud Meta-classifier

39

The chosen base algorithms are applied to the base classifiers’ training data to generate

base classification models. These models are applied to the meta-classifier training data to output

predictions that are used as new attributes to the meta-classifier’s training data (See Table 2-13).

Table 2-13 shows the modification of the meta-classifier training dataset by using the predictions

of the base classifiers as new attributes and combining them with the data that was originally set

aside to train the meta-classifier.

Table 2-13: Base classifier predictions as new attributes in the meta-classifier training data



Type of Credit Card

kNN Rule-based

Bayesian

59 Canada Gold Fraud Fraud Fraud 107 Canada Platinum Legit Fraud Fraud 97 USA Platinum Legit Legit Fraud

The meta-classifier algorithm is applied to the data in Table 2-13 (the decision tree

algorithm is chosen as the meta-classifier algorithm in this example) to construct the meta-

classifier model that produces the final predictions. Table 2-14 shows the final meta-classifier

prediction for a new testing dataset originally introduced in Table 2-5.

Table 2-14: Applying the meta-classifier model to a new testing dataset



Type of Credit Card

Meta-classifier

prediction 100 251 USA Platinum Fraud 101 12 USA Gold Fraud 102 59 Canada Gold Legit 103 1005 Canada Gold Legit 104 432 Canada Gold Legit 105 29 Canada Platinum Legit 106 65 USA Gold Fraud 107 803 Canada Gold Legit 108 25 USA Platinum Fraud

40

The meta-classifier prediction does not select the majority prediction from the base

classifiers. The meta-classifier uses its own algorithm to select the best prediction based on the

predictions of the base classifiers.

In summary, there are many different algorithms that are capable of detecting credit card

fraud. These algorithms can use either supervised or unsupervised learning and can range from

statistical methods such as the Bayesian algorithm, to perceptron-based algorithms such as a

neural network. The method of combining models aims to combine the strengths of different

algorithms to improve the accuracy of fraud detection and is just one of the many techniques that

have been used in literature for credit card fraud detection. The next chapter outlines the wide

range of techniques that have been used in the past to detect different types of fraud.

41

3 Literature on Credit Card Fraud Detection

In this chapter a detailed literature study of the different techniques in fraud detection are

presented. Section 3.1 outlines the methodologies used in literature for the detection of fraud

using single or multi-algorithm based prediction models. The literature review in Section 3.1 is

presented in chronological order. The following section, Section 3.2, introduces meta-learning (a

multi-algorithm technique) and discusses the latest work in credit card fraud that have used this

technique. Finally Section 3.3 describes in detail the specific process of meta-learning, the

Combiner Strategy, that is implemented in this thesis for the construction of the meta-classifier.

3.1 Single and Multi-Algorithm Techniques for Fraud Detection used in Literature

Many techniques have been applied to the field of fraud detection ranging from supervised

learning and unsupervised learning to hybrid models. Bolton and Hand (2001), and Kim, Ong

and Overill (2003) both used outlier detection methods to detect abnormality in credit card

transactions. Outlier detection techniques are unsupervised learning approaches that do not

require prior knowledge of fraudulent and non-fraudulent transactions in historical databases.

These techniques look for observations that deviate from other observations as to arouse

suspicion. The advantage of unsupervised methods is that previously undiscovered types of fraud

may be detected. Supervised methods require accurate identification of fraudulent transactions

and are only trained to discriminate between legitimate transactions and previously known fraud.

However, outlier detection can cause legitimate erratic behavior to be classified as an anomaly,

thus causing inconveniences to the customer. A more sophisticated method that is used often in

literature and industry is neural networks. Neural networks are made up of interconnected nodes

that try to imitate the functioning of the human brain. Each node has a weighted connection to

several other nodes in adjacent layers. Individual nodes take the input received from connected

42

nodes and use the weights together with a simple function to compute output values. The neural

network method can be either supervised or unsupervised and the output layer may contain one

or several nodes.

Other methods seen in literature that have been used to detect fraud include rule-based

systems, decision trees, support vector machines, meta-classifier systems, and other data mining

methods, as discussed below.

Ghosh and Reilly (1994) used a neural network system which consists of a three-layered

feed-forward network with only two training passes to achieve a reduction of 20% to 40% in

total credit card fraud loses. This system also significantly reduced the investigation workload of

the fraud analysts.

Aleskerov, Freisleben, and Rao (1997) developed a fraud detection system called

Cardwatch that is built upon the neural network learning algorithm. This system is aimed

towards commercial implementation and therefore can handle large datasets, and parameters of

an analysis can be easily adjusted within a graphical user interface. Cardwatch uses three main

neural network learning techniques: conjugate gradient, backpropagation, and batch

backpropagation. This system is a useful product for large financial institutions due to its ease of

implementation with commercial databases. Unfortunately, the disadvantage of this system is the

need to build a separate neural network for each customer. This results in a very large overall

network that requires relatively higher amounts of resources to maintain.

Dorronsoro, and others (1997) developed a neural network based fraud detection system

called Minerva. This system’s main focus is to imbed itself deep in credit card transaction

servers to detect fraud in real-time. It uses a novel nonlinear discriminant analysis technique that

43

combines the muti-layer perceptron architecture of a neural network with Fisher’s discriminant

analysis method. Minerva does not require a large set of historical data because it acts solely on

immediate previous history, and is able to classify a transaction in 60ms. The disadvantage of

this system is the difficulty in determining a meaningful set of detection variables and the

difficulty in obtaining effective datasets to train with.

Kokkinaki (1997) suggested to create a user profile for each credit card account and to

test incoming transactions against the corresponding user’s profile. The attributes that were used

to construct these profiles are: credit card numbers, transaction dates, type of business, place,

amount spent, credit limit and expiration time. Kokkinaki proposed a Similarity Tree algorithm,

a variation of Decision Trees, to capture a user’s habits. The analyses found that the method has

a very small probability for false negative errors. However, in this approach the user profiles are

not dynamically adaptive and therefore continual updates are needed when user habits and fraud

patterns change.

Chan and Stolfo (1998) studied the class distribution of a training set and its effects on

the performance of multi-classifiers on the credit card fraud domain. It was found that increasing

the number of minority instances in the training process results in fewer losses due to fraudulent

transactions. Furthermore, the fraud distribution for training was varied from 10% to 90% and it

was found that maximum savings were achieved when the fraud percentage used in training was

50%.

Brause and others (1999) looked specifically at credit card payment fraud and identified

fraud cases by combining a rule-based classification approach with a neural network algorithm.

In this approach the rule-base classifier first checked to see if a transaction was fraudulent, and

44

then the transaction classification was verified by a neural network. This technique increases the

probability for the diagnosis of “fraud” to be correct and therefore it is able to decrease the

number of false alarms while increasing the confidence level.

Ehramikar (2000) showed that the most predictive Boosted Decision Tree classifier is one

that is trained on a 50:50 class distribution of fraudulent and legitimate credit card transactions. It

was also reported that training decision tree classifiers on datasets with a high distribution of

legitimate transactions leads to high fraudulent cases classified as legitimate (a high false

negative rate). This suggests that predictive model over fitting occurs when the training dataset

has a majority of legitimate transactions.

Wheeler and Aitken (2000) developed a case-based reasoning system that consists of two

parts, a retrieval component and a decision component, to reduce the number of fraud

investigations in the credit approval process. The retrieval component uses a weighting matrix

and nearest neighbor strategy to identify and extract appropriate cases to be used in the final

diagnosis for fraud, while the decision component utilizes a multi-algorithm strategy to analyze

the retrieved cases and attempts to reach a final diagnosis. The nearest-neighbour and Bayesian

algorithms were used in the multi-algorithm strategy. Initial results of 80% non-fraud and 52%

fraud recognition from Wheeler and Aitken suggest that their multi-algorithmic case-based

reasoning system is capable of high accuracy rates.

Bolton and Hand (2001) proposed an unsupervised credit card detection method by

observing abnormal spending behaviour and frequency of transactions. The mean amount spent

over a specified time window was used as the comparison statistic. Bolton and Hand proposed

the Peer Group Analysis (PGA) and the Break Point Analysis (BPA) techniques as unsupervised

45

outlier detection tools. The paper showed that the PGA technique is able to successfully detect

local anomalies in the data, and the BPA technique is successful in determining fraudulent

behaviour by comparing transactions at the beginning and end of a time window.

Kim (2002) proposed a fraud density map technique to improve the learning efficiency of

a neural network. There is an overemphasis of fraudulent transactions in training data sets,

therefore, the fraud density map (FDM) tries to address the issue of the inconsistent distributions

of legitimate and fraudulent transactions between the training data and real data. FDM adjusts

the bias found in the training data by reflecting the distribution of the real data onto the training

data through the changing of a weighted fraud score.

Maes (2002) applied artificial neural networks (ANN) and Bayesian belief networks

(BBN) to a real world dataset provided by Europay International. The best prediction rate was

obtained for the experiment in which the features were pre-processed. It was found that by

performing a correlation analysis on the features and removing the feature that was strongly

correlated with many of the other features clear improvements to the results were obtained.

Furthermore, their experiments showed that BBNs yields better fraud detection results and their

training period is shorter, however ANN was found to be able to compute fraud predictions faster

in the testing stage.

Chen and others (2004) presented a new method to address the credit card fraud problem.

A questionnaire-responded transaction (QRT) data of users was developed by using an online

questionnaire. The support vector machine algorithm was then applied to the data to develop the

QRT models, which were then used to decide if new transactions were fraudulent or legitimate.

46

It was found that even with very little transaction data the QRT model has a high accuracy in

detecting fraud.

Chiu and Tsai (2004) identified the problem of credit card transaction data having a

natural skewness towards legitimate transactions. The ratio of fraud transactions to normal

transactions is extremely low for an individual FI, and this makes it difficult for FIs to maintain

updated fraud patterns. The authors of this paper proposed web service techniques for FIs to

share their individual fraud transactions to a centralized data centre and a rule-based data mining

algorithm was then applied to the combined dataset to detect credit card fraud.

Fan (2004) proposed an efficient algorithm based on decision trees. The decision tree

“sifts through” old data and combines it with new data to construct the optimal model. The basic

idea is to train a number of random and uncorrelated decision trees, and each decision tree is

constructed by randomly selecting available features. The structure of the trees are uncorrelated,

the only correlation is in the training data itself.

Foster and Stine (2004) attempted to predict personal bankruptcy using a fully automated

stepwise regression model. Neural network models used in fraud detection modeling are often

regarded as black-boxes, and it is difficult to follow the process from input to the output

prediction. On the other hand, the benefit of a statistical model is the ability to easily understand

the procedures in the prediction process. The results from this paper indicate that standard

statistical models are competitive with decision trees.

Abdelhalim and Traore (2009) tackled the application fraud problem where a fraudster

applies for an identity certificate using someone else’s identity. Identity certificates were

extracted from the web and cross-referenced with the information from application forms and

identity claims (i.e. passport application, credit card application, etc.) to detect anomalies. The

47

paper introduced a rule-based decision tree technique to design their fraud detector. This

technique was able to correctly identify 92% of the application fraud cases.

The single algorithm techniques presented above are summarized in Table 3-1, while the

multi-algorithm techniques used in literature are summarized in Table 3-2. These experiments

show that in the study of fraud activities, neural networks, Bayesian algorithms, decision trees,

and nearest-neighbour methods are extremely effective in fraud detection. The neural network

methodology has been found to be the most popular method in recent credit card fraud detection

studies (Ngai, et al. 2011), however, there also has been successful work in literature on fraud

identification using different algorithms such as decision trees, statistical models, and nearest-

neighbour strategies (See Table 3-1). The effectiveness of these algorithms led to the selection of

a multi-algorithm strategy to detect credit card fraud for this thesis. By applying multiple

algorithms onto a neural network filtered dataset we hope to further improve the accuracy of

fraud detection.

Studies have shown that Bayesian networks and regression models are able to outperform

neural networks in fraud detection accuracy. The study of combining different data mining

algorithms have also increased in literature and have shown to outperform single algorithm

methods. Since the dataset in this thesis consists of transactions that have already been filtered by

a neural network model, a multi-algorithmic approach that consists of algorithms other than a

neural network has the greatest potential in improving fraud detection and therefore a multi-

algorithm method is used in this thesis.

48

Table 3-1: Summary of single algorithm techniques in literature for the prediction of fraud

Reference Method Method Applied to: Advantages Disadvantages Ghosh and Reilly (1994)

Neural network (restricted coulomb energy algorithm)

Credit card transactions Increased accuracy and timeliness of fraud detection

Compared to other data mining techniques this method requires a longer training period

Aleskerov, Freisleben, and Rao (1997)

Neural network (gradient descent algorithm)

Credit card transactions Can handle large commercial size databases

Non-convergence in training

Dorronsoro (1997)

Neural network Credit card transactions Real-time fraud detection Difficulty in determining the optimal size of the hidden layers

Kokkinaki (1997) Decision tree Credit card transactions Simple and easy to implement; reduced misclassifications

Not dynamically adaptive

Ehramikar (2000) Decision tree Credit card transactions Predictive performance was improved by increasing the number of minority instances

Only the decision tree algorithm is experimented upon

Wheeler and Aitken (2000)

Case-based reasoning (Nearest neighbor and probabilistic algorithms)

Credit applications Model can be easily updated and maintained; robust to missing or irrelevant data

Requires two separate experiments; one to determine the instances to experiment upon, and another to determine the final prediction.

Bolton and Hand (2001)

Outlier detection (unsupervised)

Credit card transactions Successful in detecting local anomalies and can detect fraudulent behavior in a continuous manner

Treats all accounts equally; does not differentiate between different accounts

49

Kim (2002) Neural network with weighted fraud scores (unsupervised)

Credit card transactions Increased number of detected frauds compared to a neural network only classifier

Backpropagation is used to train the neural networks; this method is only able to find local minima in the error function, therefore an optimal model may not always be reached

Maes (2002) Neural & Bayesian belief networks

Credit card transactions By removing highly correlated attributes, fraud detection was improved

Bayesian algorithm performs better than neural networks in fraud detection

Fan (2004) Decision tree Synthetic data and credit card transaction data

The use of a cross-validation decision tree ensemble decreases error rate in fraud prediction

Prediction performance with this method decreases as the percentage of recent transactions increase in the training data

Foster and Stine (2004)

Regression model Personal bankruptcy Easy to understand the procedure in the prediction process; competitive to neural networks and decision tree methods

Linear models cannot easily adapt to changes in fraud patterns

50

Table 3-2: Summary of multi-algorithm techniques in literature for the prediction of fraud

Reference Method Method Applied to: Advantages Disadvantages Chan and Stolfo (1998)

Multi-classifier meta-learning

Credit card transactions

A 46% improvement over the no fraud detection scenario was achieved

Required to determine the best distribution to use for each training experiment

Brause (1999) Combination of rule-based and neural network algorithms


Increased the number of correct classifications and decreased the number of false alarms

Batch process; requires two separate experiments (one for a rule-based algorithm and another for a neural network)

Chen (2004) Support vector machine applied to questionnaire-responded transaction data


Able to achieve high accuracy in fraud detection with very little transaction data

New questionnaires need to be conducted whenever user behaviour changes

Chiu and Tsai (2004)

Rule-based algorithm applied to a web-based knowledge sharing scheme


Able to centralize fraudulent transactions from different FIs, thereby increasing the prediction accuracy of models by training on a higher fraud distributed dataset

Subject to the willingness of FIs to share credit card transaction data

Abdelhalim and Traore (2009)

Decision tree algorithm applied to online identity application data

Identity application fraud

Able to correctly classify 92% of the identity application fraud cases

The data used was a mix of real data collected online and synthetic data; a more accurate experiment would be to use 100% real data

51

3.2 Meta-Learning in Credit Card Fraud Detection

This section outlines the development of the multi-algorithm strategy. The development of this

strategy for credit card fraud began with its introduction in speech recognition and eventually

was adapted for use in the credit card fraud detection.

Meta-learning is a general technique to coalesce the results of multiple learners. The idea

of applying multiple algorithms to achieve an overall accuracy higher than a single learning

algorithm was first proposed in speech recognition by Stolfo, Galil, McKeown, and Mills in

1989 (Stolfo, et al. 1989). The first foray into the combiner strategy was studied by Wolpert

(Wolpert 1992) who proposed a strategy to improve the cross-validation method by estimating

and correcting for the error of a base classifier termed stacked generalization.

The first foray into the arbiter strategy was conducted by Schapire (1990) and was termed

“hypothesis boosting”. This scheme consists of three different classifiers. The first classifier

learns from the given training data and generates its predictions. The second classifier learns

from instances that are equally likely to be correctly or incorrectly classified by the first learned

classifier. Finally, the last classifier is the arbiter classifier, this classifier learns from examples

where both the first two classifiers disagree. The final prediction is chosen by analyzing the

predictions of all three classifiers with the arbiter classifier breaking a tie in situations where the

first two classifiers disagree. Schapire’s hypothesis boosting is essentially a boosting technique

that requires the generation of two additional distributions of examples and utilizes only a single

learning algorithm.

Chan and Stolfo (1993) expanded Wolpert’s and Schapire’s initial works by developing a

multistrategy hypothesis boosting technique that uses ideas from hypothesis boosting and stacked

generalization. Three strategies are introduced: combiner strategy, arbiter strategy, and a hybrid

52

strategy. Each strategy has a different technique for combining the predictions of the base

learners. The combiner strategy joins the predictions from the base classifiers by learning the

relationship between base predictions and the correct prediction. The arbiter strategy learns from

examples that are confusing to the base classifiers. Finally, the hybrid strategy picks examples as

in the arbiter strategy (predictions that do not agree) and then joins the predicted classifications

of data in disagreement by the base classifiers as in the combiner strategy. From the Chan and

Stolfo experiments (Chan and Stolfo 1993) it was found that the combiner strategy performed

more effectively than the arbiter or hybrid strategies.

Credit card fraud detection using meta-learning strategies was first extensively studied by

Stolfo and others (Stolfo, et al. 1997). Their initial results show that a meta-classifier generated

using the Bayesian algorithm achieves the highest True Positive rates (correctly classified

fraudulent transactions), while the best base classifiers are the ones generated using the CART

and RIPPER algorithms. In a 1999 paper by Chan, a cost model was developed to evaluate the

effectiveness of the meta-learning strategy proposed by Chan in 1993 (Chan and Stolfo 1993).

The technique of combining multiple base models to produce meta-classifiers was used to offset

the loss of predictive performance that usually occurs when mining from data subsets or

sampling.

The results from the experiments by Stolfo and Chan showed great success in the

implementation of a meta-learning classifier in the detection of credit card fraud. The meta-

learning approach was shown to be significantly more effective than the methods used by the FIs

at that time. Due to these findings, the meta-learning strategy was selected to be implemented

onto the neural network filtered dataset.

53

3.3 Meta-Learning and the Combiner Strategy

The methodology applied in the thesis work closely follows the “meta-learning”

techniques introduced by Chan and Stolfo (Chan and Stolfo 1993). No single learning algorithm

can uniformly outperform other algorithms over all datasets. Furthermore, previous studies have

found that by modifying the distribution of examples in such a way as to force a learning

algorithm to focus on the harder-to-learn parts of the distribution, the accuracy of this learner can

be greatly improved (Schapire 1990). Thus, the meta-learning technique aims to coalesce the

results of multiple learners to improve prediction accuracy and to utilize the strengths of one

method to complement the weakness of another. In this approach, rather than using weights to

train a model, the predictions of a set of base classifiers are used as training data to “meta-learn”

a set of new classifiers. It involves applying multiple algorithms on the same dataset and

combining the results by meta-learning. There are two methods of combing algorithms that were

introduced by Chan and Stolfo, the arbiter and the combiner strategies. Through experimentation

conducted in previous papers it was found that the combiner strategy performs more effectively

than the arbiter strategy, therefore only the combiner strategy is used in this thesis.

The next section provides an overview of the combination method used in the meta-

learning method (the “combiner strategy”).

54

3.3.1 The Combiner Strategy in Detail

In the combiner strategy, as shown in Figure 3-1, the attributes and correct classifications

of credit card transaction instances are used to train multiple base classifiers. The predictions of

the base classifiers are used as new attributes for the meta-level classifier. By combining the

original attributes, the base classifier predictions, and the correct classification for each instance

(the composition rule), a new “combined” dataset is created which is used as the training data to

generate the meta-level classifier. The predictions from the meta-level classifier are then used as

the final predictions in the combiner strategy.

Figure 3-1: Classification of a credit card transaction by the combiner strategy

55

In summary, the non-neural network techniques that have been used in literature for fraud

detection were studied extensively in the late 1990’s and early 2000’s, furthermore, the meta-

learning strategy was last implemented on credit card data in 1999. Therefore it is valuable to

determine the effectiveness of these techniques on recent credit card data. Results from the Chan

and Stolfo studies have shown that the ‘Combiner Strategy’ is the best performing meta-learning

method and this strategy is used exclusively in this thesis. In the next chapter the metrics used for

the selection of the base classifiers, and the training, validation, and testing dataset sizes are

discussed. The application of the combiner strategy and the performance evaluation using

different ranking and evaluation methods are also described.

56

4 Methodology

In this Chapter, the methodologies used in the construction of the meta-classifier are discussed in

detail. In Section 4.1, the software used in the construction of the meta-classifier is presented.

Section 4.2 discusses in detail the filtering and pre-processing of the initial dataset, and the

reasoning behind the construction of datasets with a 50:50 fraudulent to legitimate transaction

ratio are presented. In Section 4.3 a metric is introduced to determine the optimal number of base

classifiers, and the reasoning behind the selection of the types of algorithms used for the base

classifiers are discussed. The next section, Section 4.4, discusses the selection of the training,

validation, and testing dataset sizes. Section 4.5 presents the four stages involved in the

construction of the meta-classifier. Finally Section 4.6 describes the ranking methods and the

evaluation techniques used to determine the performance of the meta-classifier.

4.1 Software Used

All the meta-classification models and outputs were obtained using the open source “Weka” data

mining software (Hall, et al. 2009). Weka is an open source program that contains a large set of

data mining algorithms and is a program that is widely used in academia. The prime reason for

choosing Weka is the abundance of algorithms that can be used and because the implementation

of each algorithm is thoroughly documented in the software. Weka was developed and is

maintained by the University of Waikato, in New Zealand. In addition to Weka, Microsoft Excel

and SPSS Clementine software were used extensively for analyses throughout the experiments.

4.2 Data preparation

The dataset received from the FI contained transactions with Falcon scores ranging from 0 to

999. However, all transactions with a Falcon score lower than 900 were removed. By analyzing

transactions only with high Falcon scores, the dataset is limited to transactions that are most

57

likely to be fraudulent. As a result, the percentage of minority instances is increased which is

beneficial in the training process. The dataset for the testing month used in this thesis is from

October 2009 and it contains 106,934 credit card transactions with Falcon scores greater than or

equal to 900. For this testing month, 11,317 transactions have been verified by the FI as

fraudulent and accounts for 10.6% of the transactions in this month. It is also important to point

out the types of transactions that were present in this dataset. All credit card transactions go

through a neural network system (Falcon fraud manager) as described in Section 1.1. Based on

the Falcon score values, the FI’s in-house methods are then used to predict whether a transaction

is legitimate or fraudulent. The dataset that is used throughout this paper is assumed to consist

entirely of transactions that were deemed to be fraudulent by the FI classification methods

(Falcon scores greater than 900). The correct classification labels are assumed to be determined

through the investigation of these transactions.

The dataset that was initially received from the FI contained 11 months of data from

December 2008 to October 2009 with one data file per month. Each of these 11 files contained

41 attributes (See Appendix B: Table B1). After pre-processing and data cleansing, 29 attributes

remained in the dataset (See Appendix B: Table B2). The “Time” and “Date” attributes

themselves do not provide valuable classifier training information, however, the time and day

differences between subsequent transactions can be quite informative. The “Time and “Date”

attributes were converted to a more useful attribute by computing the difference in time and days

between subsequent credit card transactions using the SPSS Clementine software. The “Time

Difference” and a “Date Difference” attributes were generated to replace the “Time” and “Date”

attributes respectively. The final major modification to the dataset was done to the merchant state

attribute. Originally “Merchant State” consisted of the abbreviations for the 50 states of the

58

United States and the 10 provinces and 3 territories of Canada. This resulted in too many unique

instances in the dataset, which could possibly weaken the predictive accuracy of the meta-

classifier. Therefore, the 50 states were converted and reduced to just 4 labels depending on the

region the state resided in. The 4 labels are: NEUS (North Eastern United States), MWUS (Mid-

Western United States), WUS (Western United States), and SUS (Southern United States)

(Reasons for the removal and changes of attributes are listed in Appendix B: Table B3). The next

step in the data preparation process was to match the credit card transactions to the FI’s database

of verified fraudulent transactions. This was done by comparing the cleansed dataset that has 29

attributes with a new dataset that contained only fraudulent transactions. Using a C-program, the

credit card number, time stamp, and date stamp were compared between the two datasets. If any

matches were found, the program would add a ‘Y’ label to the transaction in the cleansed dataset

to represent a fraudulent transaction. The final step in data preparation involved the removal of

characters that were unacceptable for the Weka program. This was done by another C-program

that would scan through the dataset and replace the unacceptable characters with dashes. The

final 29 attributes used in the analysis for this thesis are listed in Appendix B: Table B4. The

formatting of the attributes, whether the attributes were categorical or numerical, the possible

values, and a brief description of each attribute are also listed in this table.

In credit card fraud detection, it has been shown that the desired fraud to legitimate

distribution is 50:50 for the training process (Ehramikar 2000), (Stolfo, et al. 1997). Therefore,

the training datasets were divided into subsets such that 50% of the instances were fraudulent

transactions and the other 50% were legitimate transactions (See Figure 4-1). The distribution of

the original credit card datasets contained a 10:90 ratio of fraudulent to legitimate transactions.

To achieve the desired 50:50 distribution, the minority instances were replicated across the

59

majority instances by dividing the dataset into partitions. The technique used to determine the

number of partitions is as follows:

Number of Partitions = v

u

x

y (4.1)

Number of Minority Instances in each Partition = nx (4.2)

Number of Majority Instances in each Partition = u

nxv (4.3)

Where n is the size of the dataset with a distribution of x:y, x is the percentage of the minority

instances, y is the percentage of the majority instances, u:v is the desired distribution where u is

the desired percentage of minority instances, and v is the desired percentage for the majority

instances.

Since the original dataset contained approximately 10%fraudulent transactions and

90%legitimate transactions and the desired training dataset distribution is 50:50, the desired

number of partitions was calculated to be nine (50

50

10

90 ). The data subsets were formed by

merging the replicated minority instances (fraud transactions) with each of the 9 partitions

containing majority instances (legitimate transactions) (Chan and Stolfo 1998).

60

Figure 4-1: Constructing a 50:50 distribution for the training datasets

4.3 Diversity – Selecting base classifiers

The number of base classifiers used for the training stage and the type of algorithms used for

each classifier were chosen based on a diversity metric as introduced by Chan (Chan 1996). This

entropy-based metric measures the “randomness” of the predictions and how “different” the base

classifiers are based on their predictions. It measures the average amount of information required

to represent each event. The larger the diversity value, the more evenly distributed the

predictions are for the base classifiers, while a smaller diversity value represents base classifiers

that have predictions that have more bias (some predictions are more likely to occur) (Chan

1996). For each instance, yi, the fraction of base classifiers predicting class, classk, (pik) is

calculated as follows:

b

jkijik classyCOneIfTrue

bp ))((

1

(4.4)

61

Where Cj is the prediction of base classifier j, yi is the instance i, classk is the kth class of the

target variable, and b is the number of base classifiers. Using pik, the entropy in the predictions

for each instance is calculated. Diversity is defined as:

n

i

c

kikik pp

cndiversity )log(

log

11

(4.5)

The fraction of base classifiers predicting class k, pik, is then normalized by log c, where c is the

number of classes in the target variable. The entropy is then averaged by the number of

instances, n, to determine the diversity value for the specified base classifiers. Since there are

only 2 classes for the target variable in credit card fraud detection (legitimate or fraudulent),

Equation 4.4 can be expressed as:

b

jiji classyCOneIfTrue

bp ))((

100 (4.6)

b

jiji classyCOneIfTrue

bp ))((

111 (4.7)

110 ii pp (4.8)

Where 0ip represents the fraction of base classifiers predicting class 0 (legitimate class) for

instance i, and 1ip represents the fraction of base classifiers predicting class 1 (fraudulent class)

for instance i. Similarly Equation 4.5 can be expressed as:

)]}log()log([2log

1{

111

100 ii

n

iii pppp

ndiversity

(4.9)

62

When comparing different combinations of base classifiers, the predictions from the base

classifiers that are more evenly distributed result in a larger diversity value. The diversity

calculations were done for different classifier combinations. According to Schapire (Schapire

1990), “A model of learnability in which the learner is only required to perform slightly better

than guessing is as strong as a model in which the learner’s error can be made arbitrarily small”.

This suggests that even simple algorithms can be excellent candidates for constructing base

classifiers. Therefore, the Naïve Bayesian (NB) classifier, and the k-Nearest Neighbour (kNN)

classifier were used as the starting combination for testing. Different classifiers were added to

the initial two to determine the effect multiple classifiers have on the diversity value.

Using multiple base classifiers is beneficial because each classifier has an inductive bias

towards a certain learning space. Inductive biases are assumptions that the base classifiers use to

predict outputs given inputs that have not been encountered. With multiple classifiers, a wider

range of learning spaces are available, leading to a higher chance that the target pattern is

covered within the base classifiers’ learning space (Vilalta and Drissi 2001), (Mitchell 1980).

The results from the diversity calculations are presented in Section 5.2. Based on these

diversity calculations the best performing algorithms are selected as base algorithms for the

construct of the meta-classifier.

4.4 Selecting the Training, Validation, and Testing Dataset Sizes

The optimal number of months for training, validating, and testing was determined by comparing

Receiver Operative Characteristic (ROC) areas. The ROC area is a plot between the true positive

63

rate and the false positive rate for a binary classification system. The True Positive Rate (TPR)

is equivalent to sensitivity and is defined as:

ivesFalseNegatvesTruePositi

vesTruePositiTPR

(4.10)

Where True Positives are fraudulent transactions predicted to be fraudulent, and False Negatives

are fraudulent transactions predicted to be legitimate.

The False Positive Rate (FPR) is equivalent to (1 – Specificity) and can be defined as:

vesTrueNegatiivesFalsePosit

ivesFalsePositFPR

(4.11)

Where False Positives refer to legitimate transactions predicted to be fraudulent, and True

Negatives are legitimate transactions predicted to be legitimate.

The upper left corner of a ROC plot represents the best possible prediction method since

that region presents the highest TPR and the lowest FPR. Figure 4-2 shows the performance of

prediction models based on ROC curves. The red curve represents a highly accurate model, the

blue curve represents a less accurate model, and the green curve represents a model that has a 50-

50 chance of providing the right prediction. The coordinate (0, 1) represents the best case

scenario with no False Positives and no False Negatives, i.e. a 100% prediction accuracy. ROC

curves that are closest to the (0, 1) coordinate have a larger area under the curve. Therefore the

optimal training, validating, and testing month sizes were determined by selecting the datasets

sizes that generate prediction models with the largest ROC areas.

64

Figure 4-2: Example of different ROC curves

Weka uses the Mann–Whitney U statistic to calculate the area under a ROC curve. In order to

calculate the U-statistic, the dataset must first be arranged in ascending order (based on the meta-

classifier’s probability score for the ‘Fraud’ class) with tied scores receiving a rank equal to the

average position of those scores in the ordered sequence. The U-statistic can then be defined as

follows:

2

)1( 1111

nnRU

(4.12)

Where n1 is the sample size of sample 1 (choose instances in which transactions have high

probability scores), and R1 is the sum of the ranks in sample 1. The Mann–Whitney U is closely

related to the area under the receiver operating characteristic curve (Mason and Graham 2002).

The area under the curve (AUC) is defined as follows:

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

True Positive

Rate

False Positive Rate

Different ROC Curves

Higher Accuracy

Lower Accuracy

Random Performance

65

21

1

nn

UAUC

(4.13)

Where U1 is the U value calculated using sample 1, n1 is the size of sample 1, and n2 is the size

of sample 2 (sample 2 are the instances which are not chosen to be in sample 1).

The area under the ROC curve for varying training, validation, and testing dataset sizes are

shown in Section 5.3. The dataset sizes that result in prediction models with the largest area

under their ROC curves are selected in the construction of the meta-classifier.

4.5 Constructing the Meta-classifier

There are four main stages in the meta-learning process. Stage 1 creates the base classifiers using

a training dataset that consists of 50% fraudulent transactions and 50% legitimate transactions. In

stage 2, the base classifiers are applied to a validation dataset to generate base predictions. These

predictions are then combined with the validation dataset in stage 3 and a meta-algorithm is

applied to this combined dataset to produce a meta-classifier. Finally, in stage 4 the base

classifiers from stage 1 are applied to the testing dataset to produce new base predictions. These

predictions along with the testing dataset attributes are used as input to the meta-classifier to

output the final predictions for each credit card transaction.

4.5.1 Meta-Learning Stage 1

The 1st stage of the meta-learning method consists of training the “base” classifiers. In meta-

learning, base classifiers are constructed by applying an algorithm to a 50:50 legitimate to

fraudulent distributed training dataset to produce base classifier predictions (see Figure 4-3).

66

Figure 4-3: Training stage in meta-learning consists of training base classifiers using a 50:50 distribution training set

As mentioned in Section 4.2, the training dataset is divided into 9 subsets which have a 50:50

legitimate to fraud distribution. The base algorithms are then applied to the 9 training data

subsets to generate 27 different base classifiers (3 algorithms applied to 9 different subsets).

4.5.2 Meta-Learning Stage 2 & 3

The 2nd and 3rd stages of the meta-learning process utilize the validation dataset to generate both

the base classifier predictions as well as the meta-classifier (see Figure 4-4). The validation

dataset is a separate dataset from the training dataset. While the training dataset was created such

that there is a 50:50 ratio of fraudulent to legitimate transactions, the validation dataset has an

unaltered distribution that has an approximate ratio of 10:90 fraudulent to legitimate transactions.

In stage 2, the validation dataset is used as input for the 27 base classifiers to produce 27 unique

sets of predictions. These predictions are then combined with the original validation dataset in

67

stage 3. This new validation dataset now contains the original 29 attributes as mentioned in

Section 4.2, the correct classification for each instance, and the 27 base classifier predictions.

The Naïve Bayesian algorithm was used as the meta-classifier algorithm because it has been

shown in literature that this algorithm is the most effective for meta-learning in the credit card

domain (Chan and Stolfo 1998). The Naïve Bayesian algorithm is then used as the meta-

algorithm and is applied to the new validation dataset to produce a meta-classifier.

Figure 4-4: Generating the base classifier predictions and Meta-Learning classifier

The kNN algorithm and the decision tree algorithm in Weka were also used in separate trials for

generating the meta-classifier. Results show that when the Naïve Bayesian algorithm was used as

the meta-algorithm, there was a 10% and a 30% increase in ROC area compared to the results

68

obtained when using the decision tree and kNN algorithms, respectively. These results agree

with other findings in the literature that suggest that the Naïve Bayesian algorithm is the most

effective and efficient algorithm for training the meta-classifier (Chan and Stolfo 1993),

4.5.3 Meta-Learning Stage 4

The final step in the meta-learning process is to use the meta-classifier created in stage 3 to

compute a “meta-learned” prediction for the credit card transactions. The 27 base classifiers,

created in stage 1, were re-evaluated on the testing dataset and predictions were generated.

Similar to stage 3, the predictions were merged into the testing dataset as new attributes. The

meta-classifier was then applied to this new testing dataset to produce the final predictions for

the transactions (see Figure 4-5).

Figure 4-5: Generating the final predictions for the a dataset

69

4.6 Performance Evaluation of the Meta-Classifier

As mentioned in Section 4.2, the dataset of the analysis consists of transactions with Falcon

scores greater than or equal to 900. In the meta-classifier method, the meta-classifier assigns a

fraudulent or legitimate classification to each transaction based on a probability output. If the

calculated probability is greater than or equal to 0.5 the transaction is considered fraudulent and

is flagged accordingly, while if the probability is less than 0.5 the transaction is considered

legitimate. In this thesis, the meta-classifier has three different ranking methods. The first

method is to rank transactions with meta-classifier probabilities greater than or equal to 0.5 by

Falcon scores, the second method is to rank transactions with meta-classifier probabilities greater

than or equal to 0.5 by transaction amounts, and the third method is to rank transactions by meta-

classifier probabilities then by Falcon scores (Falcon scores are used to break ties when instances

have the same meta-classifier probability).

To our best understanding the FI investigation method uses two ranking methods: the

rank by Falcon method and the rank by transaction amount method. In the rank by Falcon

method it is assumed that the FI investigates transactions with the highest Falcon scores (all

transactions analyzed already have Falcon scores greater than or equal to 900), while for the rank

by transaction amount method it is assumed that the FI investigates transactions with the highest

transaction amounts given that the Falcon scores for those transactions are greater than or equal

to 900.

70

Two different evaluation techniques were used to analyze the performance of the meta-

classifier. The two evaluation methods are:

1. True Positive and False Negative Evaluation (TP and FN Evaluation)

2. Correctly Classified True Positive Evaluation (Correctly Classified TP Evaluation).

For each of the evaluations different FI and MC (Meta-Classifier) ranking methods were used.

The five ranking methods are:

1. FI: Rank by Falcon

2. FI: Rank by Transaction Amount

3. MC: Rank by Falcon with P > 0.5

4. MC: Rank by Transaction Amount with P > 0.5

5. MC: Rank by Meta-Classifier Probability then by Falcon

The purpose of ranking is to give priority to transactions that have the highest risk of being

fraudulent transactions. It is assumed that the FI investigates transactions in one of two ways:

either by investigating transactions with the highest Falcon scores first (FI: Rank by Falcon), or

by investigating transactions with high Falcon scores (greater than or equal to 900) that have the

highest transaction amounts first (FI: Rank by Transaction Amount). The meta-classifier method

proposes three ways of prioritizing investigations. The first way is to rank transactions by highest

Falcon scores and investigate transactions that have meta-classifier probabilities of 0.5 or greater

(MC: Rank by Falcon with P>0.5). The second way is to rank transactions with high Falcon

scores (greater than or equal to 900) by highest transaction amounts and investigate transactions

that have meta-classifier probabilities of 0.5 or greater (MC: Rank by Transaction Amount with

P>0.5). The third way is to rank transactions by highest meta-classifier probabilities and then by

Falcon sc

Classifier

T

‘FI: Rank

‘MC: Ra

by the m

Probabili

Evaluatio

Ev

TP

CCla

Due to th

evaluatio

methods

following

detail.

cores and inv

r Probability

The ‘FI: Rank

k by Transac

nk by Falcon

eta-classifier

ity then by F

on. Table 4-

Ta

valuations

P and FN

Correctly assified TP

he limited am

on combinati

that should

g subsection

vestigate tran

y then by Fal

k by Falcon’

ction Amoun

n with P>0.5

r method for

Falcon’ is use

1 summarize

able 4-1: Pai

FI In

FI: Rankby Falco

mount of tim

ions were co

be looked at

ns discuss the

nsactions th

lcon).

’ is used by t

nt’ is used by

5’ and ‘MC:

r the TP and

ed by the me

es the pairing

iring of the

nvestigation

k on

FI: Rby

TransaAmou

me allowed w

onducted. Th

t in the futur

e ranking an

71

at are highes

the FI metho

y the FI meth

Rank by Tr

d FN Evaluat

eta-classifier

g of the rank

ranking an

n

ank y action unt

MCby Fwith

with the credi

he question m

re for the res

nd evaluation

st on this list

od for both o

hod for the T

ransaction A

tion. The ‘M

r method for

king and eva

nd evaluatio

Meta-Clas

C: Rank Falcon h P>0.5

M

Tr

w

it card datas

marks in Tab

spective eval

n methods us

t (MC: Rank

of the evalua

TP and FN E

Amount with

MC: Rank by

r the Correct

aluation meth

n methods

ssifier Inves

MC: Rank by

ransaction Amount

with P>0.5

et not all ran

ble 4-1 repre

luation meth

sed in this th

k by Meta-

ations, while

Evaluation. T

P>0.5’ is us

Meta-Class

tly Classified

hods.

tigation

MC: Rankby Meta-ClassifierP then by

Falcon

nking and

esent ranking

hods. The

hesis in great

the

The

sed

ifier

d TP

k -r y

g

ter

72

4.6.1 Ranking

FI Ranking without Meta-Classifier

For the ‘FI: Rank by Falcon’ method, the transactions are sorted from highest Falcon scores to

lowest Falcon scores with highest being 999 and lowest being 900. The ‘FI: Rank by Transaction

Amount’ method sorts transactions from highest transaction amount to lowest transaction

amount (these transactions also have Falcon scores greater than or equal to 900).

Meta-Classifier Ranking

Similar to the FI ranking methods, the meta-classifier’s ranking methods also sort transactions

from highest Falcon scores to lowest Falcon scores, and from highest transaction amounts to

lowest transaction amounts. However, the meta-classifier provides a probability score that is

used to further prioritize the transactions. For the ‘MC: Rank by Falcon with P>0.5’ method, the

meta-classifier prioritizes transactions that have meta-classifier probabilities of 0.5 or greater and

have the highest Falcon scores. For the ‘MC: Rank by Transaction Amount with P>0.5’ method,

the meta-classifier prioritizes transactions that have meta-classifier probabilities of 0.5 or greater

and have the highest transaction amounts (Falcon scores are 900 and above). A third ranking

method, the ‘MC: Rank by Meta-Classifier Probability then by Falcon’ method, is also

investigated. This method ranks transactions based on its meta-classifier probability score first,

and then by the highest Falcon scores second.

73

4.6.2 Performance Evaluations

As mentioned at the beginning of Section 4.6, there are two evaluation methods: the TP and FN

Evaluation, and the Correctly Classified TP Evaluation. These evaluations are used to determine

how well the meta-classifier method performs in comparison to the FI method. In the True

Positive (TP) and False Negative (FN) Evaluation, the number of TP accounts, the number of FN

accounts, and the number of missed fraudulent accounts due to non-investigation were counted

for both the meta-classifier and the FI methods. A savings amount was given to each “caught”

fraudulent transaction (TP) and a cost was incurred for each “missed” fraudulent transaction (FN

+ non-investigated fraud accounts). By comparing the number of “caught” and “missed” for the

same number of investigated accounts per day for the meta-classifier and FI methods, it is

possible to determine which method can catch more fraudulent transactions.

The second performance evaluation, the Correctly Classified TP Evaluation, counts the

number of correctly classified transactions for the meta-classifier and FI methods. For this

evaluation, the FI method ranks transactions by Falcon scores and counts the number of correctly

classified fraudulent transactions. The meta-classifier method ranks transactions first by meta-

classifier probability then by highest Falcon scores and counts the number of correctly classified

fraudulent transactions. This evaluation method focuses solely on counting the number of caught

fraudulent transactions and also utilizes the meta-classifier probability as a ranking criterion to

improve prediction accuracy for the meta-classifier.

The two evaluation methods are discussed in greater detail in the following paragraphs.

74

True Positive (TP) and False Negative (FN) Evaluation

The number of caught fraudulent accounts and number of missed fraudulent accounts were

compared between the meta-classification and the FI methods in the TP and FN Evaluation.

Table 4-2 shows a confusion matrix and explains what a True Positive, False Positive, False

Negative, and True Negative represent in the credit card domain.

Table 4-2: Confusion matrix for the credit card domain

Actual Positive (fraudulent)

Actual Negative (legitimate)

Predicted Positive (fraudulent)

True Positive (Hit)

False Positive (False Alarm)

Predicted Negative (legitimate)

False Negative (Miss)

True Negative (Normal)

In credit card fraud detection, a True Positive (TP) is when an account is predicted to be

fraudulent and the account is actually fraudulent. A TP represents a situation where fraud losses

can be prevented through investigation. A False Negative (FN) is when an account is predicted to

be legitimate but the account is actually fraudulent. FNs represent money lost due to fraud. A

False Positive (FP) is when an account is predicted to be fraudulent but the account is actually

legitimate. FPs require the use of investigation resources but incur no fraud losses. Finally, a

True Negative (TN) is when an account is predicted to be legitimate and the account is actually

legitimate. TNs incur no fraud losses and do not require investigation resources.

In the TP and FN evaluation method, FP accounts only require investigations and do not

result in monetary losses. For TN occurrences there is no need for investigations and no savings

or losses occur because the transactions are correctly labeled as legitimate. However, for TP and

FN accounts savings and losses do occur. For TPs the fraudulent account is considered “caught”

75

and therefore receives a savings value. All subsequent fraudulent transactions that are associated

with that credit card account on that day and on all following days are also considered caught

regardless of the classifier label and are removed from the dataset. If a FN occurs, the fraudulent

account is considered “missed” and therefore receives a loss value. All subsequent fraudulent

transactions associated with that account for that day are removed. On following days, if the

account is still labeled as a false negative, the account continues to incur a loss value. However,

if on the following days the classifier suggests that the account is to be investigated (TP or FP

label), the transaction would receive a savings value if the transaction is indeed fraudulent.

“Following days” refers to the 14-day period after the first day of investigation. This establishes

a fair testing scenario where each test day has 14 trailing days. This 14-day period is shifted from

October 1st to October 17th creating 17 unique test cases for the testing month. The first testing

period is October 1st to October 15th and the final testing period is from October 17th to October

31st. The rolling test scenario is summarized in Figure 4-6.

Figure 4-6: Rolling test scenarios for fraud prediction on data from the test month

76

The TP and FN Evaluation method uses the following ranking methods: ‘FI: Rank by

Falcon’, ‘FI: Rank by Transaction Amount’, ‘MC: Rank by Falcon with P>0.5’, and ‘MC: Rank

by Transaction Amount with P>0.5’. It is assumed that only a limited number of investigations

can be conducted per day, 200, 500, and 800 accounts were chosen to be investigated to show the

effects on savings and losses when more accounts are investigated. For each test case the

transactions were sorted either by Falcon score (i.e. methods ‘FI: Rank by Falcon’ and ‘MC:

Rank by Falcon with P>0.5’) or by transaction amounts (i.e. methods ‘FI: Rank by Transaction

Amount’ and ‘MC: Rank by Transaction Amount with P>0.5’), the fraudulent transactions that

were caught previously were removed, and the number of TP accounts, FN accounts, and missed

fraudulent accounts due to non-investigation for each day were counted. It is hypothesized that

by utilizing the meta-classification ranking method, namely methods ‘MC: Rank by Falcon with

P>0.5’ and ‘MC: Rank by Transaction Amount with P>0.5’, fraud accounts are caught earlier

compared to the FI method, namely methods ‘FI: Rank by Falcon’ and ‘FI: Rank by Transaction

Amount’. By comparing the number of caught fraud accounts and the number of missed fraud

accounts, while varying the number of investigated accounts, a comparison in savings between

the meta-classifier method and the FI method was determined.

The significance in applying a meta-learning strategy to high Falcon scores is to quickly

and accurately identify fraudulent accounts, while at the same time minimize the number of

fraudulent accounts that are missed. This evaluation model ranks accounts either by Falcon score

or by transaction amount and also examines the effect of gradually increasing the number of

investigated accounts.

77

Correctly Classified True Positive Evaluation

Rather than ranking by Falcon scores or transaction amounts, another method is to count the

number of correctly classified instances (True Positives) based on the ranking of the meta-

classifier probability scores. This evaluation focuses on improving the ‘MC: Rank by Falcon and

P>0.5’ method by adding a second ranking criterion based on the meta-classifier’s prediction

probability scores. The number of correctly classified transactions for the FI and meta-classifier

methods are then counted and the performance improvement of the meta-classifier is evaluated.

To compare the performance of the FI method versus the meta-classifier method based on

the Correctly Classified True Positive Evaluation, the testing month was divided into 31 subsets

each containing a day’s worth of transactions and with all previously caught fraudulent

transactions removed. In the ‘FI: Rank by Falcon’ method, the first 50, 100, 200, 300, 400, 500,

600, 700, and 800 highest Falcon ranked transactions were investigated for each day. While for

the ‘MC: Rank by Meta-Classifier Probability then by Falcon’ method, the first 50, 100, 200,

300, 400, 500, 600, 700, and 800 transactions with the highest ranked meta-classifier probability

and then by highest Falcon score were investigated. The final step involved averaging the

correctly classified transactions for each of the 31 days in the testing month to determine the

overall prediction accuracy for both the FI method and the meta-classifier method.

In summary, by calculating a diversity value, the optimal base classifiers can be determined.

Comparing the ROC areas of models that utilize different sets of training, validation, and testing

dataset sizes, the best dataset sizes to use can be selected. By identifying the number of caught

and missed fraudulent accounts for the FI method and meta-classifier method, the ability to catch

78

fraud earlier can be evaluated and a comparison of the performance of each method can be made.

Finally, by investigating accounts that have both a high Falcon score and a high meta-classifier

probability, larger amounts of correctly classified fraudulent accounts can be determined. The

next chapter, Chapter 5, presents the results from these evaluations and discusses the findings

from the observed data. The diversity calculations to determine the best base algorithms for the

meta-classifier are presented. The selection of the training, validation, and testing dataset sizes

based on ROC areas are shown. Finally, comparisons between the Falcon and the meta-classifier

methods using the results from the three main analyses – the True Positive and False Negative

Evaluation, and the Correctly Classified True Positive Evaluation – are presented.

79

5 Results & Discussion

The main results of this work are presented in this Chapter. In Section 5.2, the Falcon score

distribution in the credit card data is presented. The next section presents the diversity values for

different combinations of algorithms. Then, the best training, validation, and testing dataset sizes

are established based on ROC areas. Finally the meta-classifier predictions are evaluated for

performance improvements using the following two methods:

1. True Positive and False Negative Evaluation

2. Rank by Meta-Classifier Prediction Probability Evaluation.

In the analysis of this thesis it is assumed that the FI method prioritizes transactions that have

a Falcon score of 900 or above either by highest Falcon scores or by highest transaction amounts.

It is understood that, after transactions are given a Falcon score, the FI’s in-house fraud

classification method is applied to the Falcon scored dataset. However, we do not know the

details of how this system operates nor do we know what methods this system uses to rank

transactions, therefore an assumption on how transactions are ranked was made. This thesis

presents the comparisons made between the assumed FI method and the meta-classifier method.

The motivation for implementing the meta-classifier system is because the majority of

transactions with a high Falcon score are in fact legitimate. By using a meta-classifier we hope to

further classify high Falcon scored transactions as being either legitimate or fraudulent. The first

experiment involved calculating a diversity value for different combinations of algorithms to

determine the optimal base classifiers. The C4.5 algorithm, Naïve Bayesian algorithm, and the

kNN algorithm were chosen as the three base classifiers based on their diversity values. The next

set of experiments determined the dataset sizes for training, validation, and testing by comparing

80

ROC areas. It was found that the optimal dataset sizes for training, validating, and testing are 8,

2, and 1 month(s) respectively. In the final experiment, the meta-classifier produced a prediction

for each transaction in the test dataset. Two evaluation methods were applied to both the meta-

classifier’s predictions and the FI’s predictions to determine the best fraud detection method. In

the first evaluation method, the TP and FN Evaluation, the meta-classifier method was able to

catch fraudulent accounts quicker and more accurately compared to the FI method. Finally, the

second evaluation method, the Correctly Classified TP Evaluation, showed that ranking by the

meta-classifier probability first results in the greatest fraud detection improvement over the FI

method.

5.1 Falcon Score Distribution

As briefly mentioned in Section 1.1, there is an exponential increase in the number of fraudulent

transactions as Falcon scores increase. Analysis of the credit card data obtained for this thesis

show that there are 4 times more fraudulent transactions in the Falcon score range of 991-999

than in the 900-910 range (See Figure 5-1). This suggests that the Falcon score works well at

identifying fraudulent transactions, and confirms that the higher a Falcon score is, the higher the

probability a transaction is fraudulent. Furthermore, this suggests that fraud investigations should

give priority to transactions with the highest Falcon scores and investigate transactions based on

a Falcon score ranking.

F

However

legitimat

95% and

for transa

fraudulen

identified

greater th

Figure 5-1:

r, as shown i

te transaction

80% for the

actions with

nt and 90% a

d as the Falc

han or equal

Percentage

Fraud (%)

Falcon scor

in Figure 5-2

ns. The perc

e Falcon scor

Falcon scor

are actually l

con score inc

to 900 are le

0

5

10

15

20

25

Falco

re distributi

2, the Falcon

entage of leg

re ranges of

res greater th

legitimate. E

creases, the v

egitimate.

on Score

81

ion for fraud

n scoring me

gitimate tran

f 900 to 910 a

han or equal

Even though

vast majority

Falcon

e Distribu

dulent cred

etric also giv

nsactions wit

and 991 to 9

to 900, only

h more fraudu

y of transacti

Score

ution ‐ %

dit card tran

ves high Falc

th high Falco

999 respectiv

y 10% of tran

ulent transac

ions with Fa

% Fraud

nsactions

con scores to

on scores ar

vely. On ave

nsactions are

ctions are

alcon scores

o

e

erage,

e

Figure 5

By apply

flag these

nothing.

5.2 Ba

As descri

different

analysis,

detection

are prese

T

construct

the diver

5-2: Falcon s

ying a meta-c

e transaction

ase Algorith

ibed in Secti

base classif

therefore it

n. In this sect

ented and the

The base algo

t the base cla

rsity values f

Percentage

(%)

score distrib

classifier to h

ns as fraudul

hm Selectio

ion 4.5, the m

fiers. Algorit

is necessary

tion the calc

e optimal bas

orithms with

assifiers in th

for different

0102030405060708090100

F

bution for le

high Falcon

lent and inve

on

meta-classifi

thms perform

y to determin

culated diver

se algorithm

h the highest

his experime

combination

Falcon S% Fr

82

egitimate an

scored trans

estigate, or c

fier is constru

m differently

ne the best al

rsity values f

ms are selecte

diversity va

ent. Table 5-

ns of classifi

Falcon Score

core Disraud vs %

nd fraudule

sactions we

consider them

ucted using t

y depending

lgorithms to

for different

ed.

alues were se

-1 shows the

iers.

stributio% Legit

ent credit ca

aim to deter

m to be legit

the predictio

on the data i

use for cred

combination

elected as th

e number of

on ‐

ard transact

rmine whethe

imate and do

ons from

involved in t

dit card fraud

ns of algorith

e algorithms

classifiers an

% Fraud

% Legit

tions

er to

o

the

d

hms

s to

nd

83

Table 5-1: Diversity Values for different classifier combinations

# of Classifiers

Classifiers Diversity Value

2 k-nearest neighbor (kNN) & Naïve Bayesian (NB)

0.368051

2 Decision Tree (DT) & NB 0.400208

2 DT & kNN 0.091721

3 DT, kNN, NB 0.394858

3 DT, kNN, Bayesian Belief Network (BBN)

0.281256

4 DT, NB, kNN & Support Vector Machines (SVM)

0.389205

4 DT, NB, kNN & Neural network (NN)

0.370881

5 DT, NB, kNN, SVM & NN

0.33016

6 DT, NB, kNN, SVM, NN & Logistic Regression

0.308171

7 DT, NB, kNN, SVM, NN, Logistic Regression, & BBN

0.348375

The diversity value does not necessary increase as the number of classifiers increase, as seen in

Table 5-1. The combination with three classifiers – Decision Tree (DT), Naïve Bayesian (NB),

and k-Nearest Neighbour (kNN) – was chosen as the base classifier combination for this thesis.

This combination was chosen because it maintained a high diversity value while utilizing more

classifiers.

The two combinations with the highest diversity values were Decision Tree with Naïve

Bayesian, and Decision Tree with Naïve Bayesian and k-Nearest Neighbour. However, the

combination with more classifiers was chosen because each learning algorithm covers a region

of tasks favoured by its bias (Vilalta and Drissi 2002), therefore by choosing 3 classifiers, more

of the region under study can be covered. Of interest in Table 5-1 are the diversity values for the

84

cases with three base classifiers, where the only difference was the utilization of the Bayesian

method. It was found that the diversity value is significantly higher if the Naïve Bayesian

classifier was used instead of the Bayesian Network method. This further supports the hypothesis

that weak algorithms can become powerful when they are combined.

5.3 Training, Validation, and Testing Dataset Selection

As mentioned in Section 4.4, the ROC area was used to select the dataset sizes for training,

validation, and testing. As can be seen in Figure 5-3, the training dataset size was tested using 5,

6, 7, and 8 months of data, while keeping the validation time at 2 months and the testing at 1

month for each test instance.

Training Dataset Size Validation SizeTesting Size

ROC Area

Training ‐ 5 months Validation – 2 months Testing – 1 month 0.836




Figure 5-3: ROC Areas for different training dataset sizes

As summarized in Figure 5-4, the validation dataset size was tested using 1, 2, and 3

months of data, while holding the training dataset and testing dataset constant at 7 months and 1

month respectively.

Training Dataset Size Validation SizeTesting Size

ROC Area

Training – 7 months Validation – 1 month Testing – 1 month 0.838

Training – 7 months Validation‐2 months Testing – 1 month 0.841

Training – 7 months Validation ‐ 3 months Testing – 1 month 0.841

Figure 5-4: ROC Areas for different validation dataset sizes

85

Finally, the testing dataset size was tested using 1, 2, and 3 months of data, while holding

the training and validation datasets constant at 5 months and 2 months respectively as seen in

Figure 5-5.

Training Dataset Size Validation Size Testing Size ROC Area

Training – 5 months Validation – 2 months Testing – 1 month 0.836

Training – 5 months Validation – 2 months Testing – 2 months 0.828

Training – 5 months Validation – 2 months Testing – 3 months 0.819

Figure 5-5: ROC Areas for different testing dataset sizes

The results show that the model with the highest prediction accuracy, the largest ROC

area, is the model where 8 months of data were used for training (DEC08 – JUL09), 2 months for

validating (AUG09-SEP09), and 1 month for testing (OCT09). This arrangement was used to

compute the final meta-classifier model and predictions. The results from the ROC analysis also

suggest that the meta-classification method is a very robust method that is accurate under

different dataset sizes. Since the ROC areas are not significantly different the smallest dataset

sizes of 5 months of training, 1 month of validation, and 1 month of testing can also be used to

reduce the training time of the meta-classifier.

5.4 Meta-Classifier Performance Evaluation

Three algorithms – Decision Tree, Naïve Bayesian, and k-Nearest Neighbour algorithms (as per

Section 5.2) – were selected to train the three base classifiers using the first 8 months2 of data.

The three base classifiers were then applied to the 2 months of validation data to produce base

classifier predictions. As mentioned in Section 4.5.2, the Naïve Bayesian algorithm was selected

to train the meta-classifier. The Naïve Bayesian algorithm was applied to the validation data and

2 As mentioned in Section 5.3, the optimal months for training the meta-classifier are from DEC08 to JUL09.

86

the base classifier predictions to produce the meta-classifier. The final month in the dataset was

used to test the meta-classifier on data that it did not train on. This was accomplished by first

applying the three base classifiers on the testing data to produce a new set of base classifier

predictions. These predictions along with the testing data were used as inputs to the meta-

classifier to output a prediction for each transaction.

5.4.1 Evaluating the Meta-Classifier: True Positive and False Negative Evaluation

There is potential in the meta-classifier method to catch fraudulent accounts earlier than the FI

method. For example, say the FI method successfully identifies a fraudulent account after 5

transactions while the meta-classifier method is able to identify the same fraudulent account after

only 2 transactions. To quantify this difference in performance an evaluation method was applied

to determine whether the meta-classifier could catch fraudulent transactions earlier than the FI

method. This evaluation analyzed the number of “caught” fraudulent accounts (True Positives -

TPs) and the number of “missed” fraudulent accounts (False Negatives (FNs) and non-

investigated fraud accounts) on a per day basis. A “savings” amount of $356 was given to each

caught account and a “loss” amount of $356 was given to each missed account ($356 is an

estimate of the value for a fraudulent account in the testing month of October 2009). The savings

per day was calculated by taking the difference between the amounts saved through caught

accounts, and the amounts loss through missed accounts.

Table 5-2 compares the number of caught and missed for the FI and meta-classifier using

the ‘FI: Rank by Falcon’ and’ MC: Rank by Falcon with P>0.5’ ranking methods for varying

number of investigated accounts. Table 5-3 compares the number of caught and missed for the FI

and meta-classifier using the ‘FI: Rank by Transaction Amounts’ and ‘MC: Rank by Transaction

Amounts with P>0.5’ ranking methods for varying number of investigated accounts.

87

Table 5-2: Comparison between the meta-learner and FI method based on the number of TP and FN with the dataset ranked by Falcon score

# of Accts Investigated

in a day

FI: Rank by Falcon MC: Rank by Falcon with P>0.5

Savings Per Day ($) Meta-Classifier Improvement Avg #

of caught fraud accts (TP)

Avg # of missed fraud

accounts (fraud accts

not investigated

+ FN)

Avg # of

caught fraud accts (TP)

Avg # of missed fraud

accounts (fraud accts

not investigated

+ FN)

FI Method

Meta-classifier Method

200 47 82 55 74 -12,460 -6,764 46% $5,696

500 73 56 87 42 6,052 16,020 165% $9,968

800 96 32 115 13 22,784 36,312 59% $13,528

Table 5-3: Comparison between the meta-learner and FI method based on the number of TPs and FNs with the dataset ranked by transaction amount


in a day

FI: Rank by Transaction Amount

MC: Rank by Transaction Amount

with P>0.5

Savings Per Day ($)

Meta-Classifier Improvement

Avg # of


Avg # of missed fraud accts (fraud

accts not investigated

+ FN)

Avg # of


Avg # of missed fraud accts (fraud

accts not investigated

+ FN)

FI Method

Meta-classifier Method

200 7 126 29 104 -42,364 -26,700

37% $15,664

500 25 108 81 53 -29,548 9,968 134% $39,516

800 57 73 114 19 -5,696 33,820 694% $39,516

88

As shown in Table 5-2 and Table 5-3, it was found that the meta-classifier significantly

outperformed the FI method by having more caught fraudulent accounts while maintaining a

lower number of missed fraudulent accounts for the same number of investigations. This implies

that the meta-classifier is able to catch fraudulent accounts earlier and is able to catch more

fraudulent accounts. Only 800 accounts are investigated because on average the meta-classifier

labels 700 to 800 accounts with probabilities of greater than 0.5, accounts which the meta-

classifier investigates, while the remaining accounts are not investigated because the meta-

classifier probabilities are below 0.5.

By comparing Table 5-2 and Table 5-3 it can be seen that for the same number of

investigations both the ‘FI: Rank by Falcon’ and ‘MC: Rank by Falcon with P>0.5’ methods

outperform the ‘FI: Rank by Transaction Amount’ and ‘MC: Rank by Transaction Amount with

P>0.5’ respectively, resulting in larger savings for the ‘Rank by Falcon’ methods. This suggests

that the neural network classifier used in computing the Falcon scores provides valuable fraud

prediction information for credit card transactions. However, the results also show that by

implementing a meta-learning strategy on-top of a neural network filtered dataset, larger savings

are obtained. These results show that the meta-classifier is able to catch fraudulent accounts

earlier and outperforms the FI method in all scenarios. When only 200 accounts are investigated

both methods result in monetary losses (negative savings) for a given day, but the meta-classifier

method still outperforms the FI method. It should be noted that the ‘FI: Rank by Transaction

Amount’ method for 500 and 800 investigations (Table 5-3) resulted in a negative savings, while

for the same number of investigations the ‘MC: Rank by Transactions Amount with P>0.5’

resulted in a positive savings amount. This suggests that the meta-classifier method is more

robust and is able to accurately predict fraudulent transactions over a wider range of scenarios.

89

As shown in Table 5-2, for 500 investigated accounts3, the implementation of the meta-classifier

method can result in an additional $9,968 per day in savings compared to the FI method.

Assuming there are 260 working days in a year, the meta-classifier method has the potential to

save an additional $2.59 million per year.

5.4.2 Evaluating the Meta-Classifier: Correctly Classified TP Evaluation

As mentioned in Section 4.6, the meta-classifier provides a probability score for each transaction.

The Correctly Classified TP Evaluation uses this probability score to determine which

transactions should be investigated first. By giving priority to transactions that have the highest

meta-classifier probability scores, there is the potential to catch more fraudulent transactions and

at an earlier time compared to using only a Falcon ranked method for investigations (i.e. the FI

method). Table 5-4 compares the average number of correctly classified fraudulent transactions

for the ‘FI: Rank by Falcon’ versus the meta-classifier’s ‘MC: Rank by Meta-Classifier

Probability then by Falcon’ ranking methods for the testing month of October 2009.

3 It is assumed that the bank can only investigate 500 accounts per day

90

Table 5-4: Correctly classified fraudulent transactions for the meta-classifier and Falcon methods

Average # of Correctly Classified Fraudulent Accounts


FI: Rank by Falcon MC: Rank by Meta-Classifier Probability

then by Falcon

Difference in Correctly Classified Fraudulent

Accounts between FI and MC

50 17 21 4 100 27 33 6 200 40 50 10 300 52 63 11 400 59 75 16 500 66 85 19 600 72 93 21 700 77 103 25 800 83 111 28

From Table 5-4, it is clearly shown that the meta-classifier method consistently classifies more

correctly classified fraudulent transactions for all cases. The greatest improvement achieved by

the meta-classifier in this evaluation method was when 800 transactions were investigated and

this suggests that the meta-classifier method continually provides value as the numbers of

investigated accounts are increased. The differences in correctly classified fraudulent

transactions between the meta-classifier method and Falcon method are also greater in this

evaluation method than in the TP and FN Evaluation method. This suggests that by prioritizing

transactions with a meta-classifier probability score, more fraudulent transactions can be

identified. Figure 5-6 shows the number of correctly classified fraudulent transactions for both

the meta-classifier and the FI methods and the percentage improvement the meta-classifier

provides to the FI method.

91

Figure 5-6: The number and percentage improvement of correctly classified fraudulent transactions the meta-classifier provides to the FI method

Figure 5-6 shows that the meta-classifier is able to provide a 20% to 34% improvement upon the

currently implemented FI investigation method. The total transaction amount for the 11,317

fraudulent transactions in the testing month of October 2009 was estimated to be $4,035,556

based on the original dataset provided by the FI. By dividing the total fraudulent transaction

amounts by the number of fraudulent transactions, the average fraud cost was calculated to be

approximately $356 ($4,035,556 divided by 11,317 fraud transactions). Utilizing this average

cost and assuming that for each day only 500 accounts can be investigated and that there are 260

working days in a year, $70004 per day or about $1.82 million per year can be additionally saved

compared to the FI method by implementing the meta-classifier method.

4 $7000 is from multiplying the difference in correctly classified fraudulent accounts between the FI and MC methods for 500 investigated accounts with the average transaction amount of a fraud account ($356 multiplied by 19 is approximately $7000)

23%20%

23%

22%

27%28%

29%32%

34%

0

20

40

60

80

100

120

50 100 200 300 400 500 600 700 800# of Correctly Classified fraud transactions

# of Investigated Transactions

Number and Percentage Improvement of Correctly Classified Fraud Transactions

Meta‐Classifier

FI

92

Both performance evaluation methods show that the meta-classifier provided quantifiable

improvements to the assumed FI method. The True Positive (TP) and False Negative (FN)

Evaluation successfully showed that the meta-classifier is able to catch more fraudulent accounts

while maintaining a lower number of missed fraudulent accounts compared to the FI method of

investigation. This method indicated that the meta-classifier can catch more fraudulent accounts

and at an earlier time. For 500 investigated accounts, the meta-classifier provided approximately

$9,968 in savings per day or $2.6 million per year. This evaluation method also found that the

prediction performance is slightly lower when accounts are investigated based on the ‘Rank by

Transaction Amounts’ methods.

The Correctly Classified TP Evaluation was conducted to further investigate the differences

in caught fraudulent accounts (TP rates) between the meta-classifier method and the FI method.

By looking at only the Falcon scores and meta-classifier probabilities the optimal investigation

scenario was determined. The meta-classifier’s ranking method in this evaluation gives priority

to transactions with the highest meta-classifier probability first then by highest Falcon score. The

FI’s ranking method in this evaluation gives priority to transactions with the highest Falcon

scores. This evaluation method resulted in the largest improvements in the number of correctly

classified fraudulent accounts for the meta-classifier. For 500 investigated accounts, the meta-

classifier provided 19 more correctly classified fraudulent accounts which equates to $7000 of

savings per day or $1.8 million per year.

93

6 Conclusion and Future Work

The findings in this work highlight the fraud detection improvement that a meta-learning strategy

can provide when it is used in conjunction with an established neural network fraud detection

system. The meta-classifier constructed from the meta-learning strategy outperformed the FI

method by providing approximately $2.6 million in additional savings per year when 500

accounts are investigated. Furthermore, for the same number of investigated accounts, the meta-

classifier can correctly identify a larger number of fraudulent accounts compared to the FI

method, and as a result the meta-classifier method has the ability to identify fraudulent accounts

at an earlier time.

6.1 Meta-Classifier Probabilities and Falcon Scores

This thesis was successful in identifying the savings improvement a meta-classifier can

achieve when implemented sequentially following a neural-network based system (the financial

institution’s Falcon score based fraud detection system). It was found that the Falcon score

attribute is an essential credit card fraud scoring metric. By utilizing the Falcon score with the

meta-classifier’s probability score, large improvements in the identification of fraudulent

transactions were observed. As shown in the results from the TP and FN Evaluation, the ‘MC:

Rank by Falcon with Probability > 0.5’ method consistently outperformed the ‘MC: Rank by

Transaction Amount with Probability > 0.5’method. These results show that the Falcon score

attribute should take precedence over the amount of a transaction when training a credit card

fraud classifier with instances that have Falcon scores greater or equal to 900. Furthermore, when

the meta-classifier probability is used in conjunction with the Falcon score as a ranking method

(‘MC: Rank by Meta-Classifier Probability then by Falcon’), large improvements in the average

number of correctly classified fraudulent transactions were observed as shown in the Correctly

94

Classified TP Evaluation. Out of the two evaluations presented in this thesis, the meta-classifier

method showed the largest improvement over the FI’s method in the Correctly Classified TP

Evaluation.

6.2 Improving the Meta-Classifier

There were attributes in the data preparation and pre-processing stages that were discarded

due to the sheer number of unique instances for the attributes. To address this problem, more

insight into the meaning of the attributes needs to be found to be able to categorize the attribute

into reasonable numbers of classes. There were also attributes extracted from the FI’s database

that had no significant value due to large numbers of repeated or null values. Each fewer

attribute in the dataset translates to fewer historical data for the base classifiers and meta-

classifier to train upon, and therefore may decrease the performance of the meta-classifier’s

predictions. The attributes selected for training were chosen based on preference and intuition. A

worth-while experiment would be to use attribute selection metrics to determine the best

attributes to train with.

This work found that the best base classifier algorithms (selected from 7 commonly used

fraud detection algorithms) to use in the credit card data domain are the C4.5 decision tree

algorithm, the Naïve Bayesian algorithm, and the k-nearest neighbour algorithm. Many choices

in learning algorithms are available for selection as base classifier algorithms, and by choosing

alternative combinations of algorithms an even stronger meta-classifier could be developed.

Case-based Reasoning (CBR) is an excellent technique that should be looked at as an alternative

base classifier algorithm in future studies. CBR is an instance based reasoning technique that is

computationally intensive, which may be a reason why it has not commonly been used in the

past for credit card fraud detection. It has been reported in literature that CBR has many

95

advantages over rule-based reasoning methods such as the decision tree algorithm. Furthermore,

instead of using an entropy-based metric such as the diversity calculation, different selection

metrics can be experimented with in the determination of the optimal base algorithms. In terms

of the meta-classifier algorithm, literature has reported that the Naïve Bayesian provides the best

performance in credit card fraud detection. However, it would be beneficial to test different

algorithms for use as the meta-classifier algorithm on newer datasets.

To further improve the training process of the meta-classifier, the training dataset should

be enlarged to further increase the number of unique fraudulent transactions in the training

process. However, with an increased number of transactions, new ROC calculations will be

needed to determine new optimal training, validation, and testing dataset sizes. With a larger

dataset size the optimal distribution of the transactions (based on ROC calculations) for the

training, validation, and testing datasets may be completely different than the one used in this

thesis, and similarly, the base algorithms not selected in this thesis may perform differently under

these conditions. Finally, to evaluate the potential savings of the meta-classifier on a less biased

basis a new dataset should be collected. This dataset should consist of transactions that have a

Falcon score but have not gone through the FI’s in-house fraud classification method. This way

the meta-classifier can choose transactions it believes are fraudulent while the FI’s method can

choose its own set of fraudulent transactions, and a comparison can be accurately made to

determine the investigation method that provides the best performance in identifying correctly

classified fraudulent transactions. To improve the evaluation methods, the window size for the

“following days” scenario should be increased to allow transactions to be tracked for a longer

period of time, this would result in a better representation of the potential savings a caught

transaction can provide.

96

One major obstacle encountered in this work was the need to combine the credit card

transaction data with the fraud classification data of each transaction. Two data files were

obtained from the FI, one contained the 11 months of unclassified credit card transaction data,

while the other data file contained transactions that were classified as fraudulent. In order for any

supervised learning method to work, the known classification of the instances in the training

dataset must be known, therefore, a C-program was written to match the known fraudulent

transactions in one data file to the credit card transaction data from the other data file. We

believe that it is valuable to continuously combine the correct classification of each historical

transaction to its corresponding instance in the credit card transaction database. This not only

provides insight in the detection of fraudulent transaction for fraud analysts but also allows for

easier access as training data for fraud detection algorithms.

6.3 Implementing the Meta-Classifier

To implement the meta-classifier in a real-world scenario, the training, validation, and

testing datasets need to be continually updated. In this thesis 8 months were used for training

(DEC08 – JUL09), 2 months were used for validation (AUG09 – SEP09), and 1 month was used

for testing (OCT09). Experiments should be conducted to determine whether the meta-classifier

should be updated with new data on a weekly, bi-monthly, or monthly schedule.

The benefit of the current meta-classifier’s design is its ability to operate in parallel to the

FI’s method. Only historical attribute data and the correct classification of each transaction

determined by a fraud analyst are required to produce the meta-classifier’s fraud predictions. To

compare the performance of the meta-classifier and FI’s fraud detection method, two groups of

fraud analysts should be used for investigations. The first group would investigate transactions

that the meta-classifier labels as fraudulent while the second group would investigate

97

transactions that the FI believes are fraudulent. By tracking the number of correctly classified

fraudulent transactions in each group for each day, a realistic real-world performance

comparison can be made.

In summary, the work in this thesis provides an update on the effectiveness of the meta-

learning strategy on credit card fraud detection. The main thrust of this work was to use a multi-

algorithm based classifier to improve a Falcon based fraud prediction performance. In particular

this work looked at the performance improvements a meta-classifier can provide to a neural-

network filtered dataset. Based on a diversity metric it was found that the optimal training,

validation, and testing dataset sizes for the 11 months of data analyzed were 8 months, 2 months,

and 1 month respectively. The ROC area calculations showed that the optimal base algorithms to

train the meta-classifier were the naïve Bayesian, decision tree, and k-nearest neighbour

algorithms.

The two performance evaluation methods were able to show that the meta-classifier method

does indeed outperform the FI’s method. The True Positive and False Negative Evaluation

successfully showed that the meta-classifier method can consistently identify more correctly

classified fraudulent transactions (TPs) and incur fewer missed fraudulent transactions (FNs +

non-investigated fraud accounts) than the FI method. Lastly, the Correctly Classified TP

Evaluation showed an even larger improvement in the identification of TPs when priority is

given to the meta-classifier’s probability when ranking transactions. The meta-classifier method

has the potential to provide $2.6 million in savings to the FI, and is able to efficiently allocate

investigation resources by correctly identifying more fraudulent transactions compared to the FI

method.

98

7 Glossary of Terms

Abbreviation Definition ANN Artificial Neural Network AUC Area Under the Curve BBN Bayesian Belief Network BPA Break Point Analysis CART Classification and Regression Trees DT Decision Tree FDM Fraud Density Map FDS Fraud Detection System FFM Falcon Fraud Manager FI Financial Institution FN False Negative FP False Positive FPR False Positive Rate ID3 Iterative Dichotomiser 3 kNN k-Nearest Neighbour MC Meta-Classifier MLE Maximum Likelihood Estimation NB Naïve Bayesian NN Neural Network PGA Peer Group Analysis QRT Questionnaire-Responded Transaction RIPPER Repeated Incremental Pruning to Produce Error Reduction RMSE Root Mean Squared Error ROC Receiver Operative Characteristic SVM Support Vector Machine TN True Negative TP True Positive TPR True Positive Rate

99

8 References

Abdelhalim, A, and I Traore. "Identity Application Fraud Detection using Web." International

Journal of Computer and Network Security 1, no. 1 (October 2009): 31-44.

Aha, David W., Dennis Kibler, and Marc K. Albert. "Instance-based learning algorithms."

Machine Learning, 1991: 37-66.

Aleskerov, Emin, Bernd Freisleben, and Bharat Rao. "Cardwatch: A neural network based

database mining system for credit card fraud detection." Computational Intelligence for

Financial Engineering. Piscataway, NJ: IEEE, 1997. 220-226.

Ali, K., and M. Pazzani. "Error reduction through learning multiple descriptions." Machine

Learning 24, no. 3 (1996): 173-202.

Basel Committee on Banking Supervision. "Basel Accords II." Basel, Switzerland: Bank for

International Settlements Press & Communications, June 2006.

Bhattacharyya, S, S Jha, K Tharakunnel, and Westland J.C. "Data mining for credit card fraud: A

comparative study." Decision Support Systems, 2011: 602-613.

Bolton, R, and D Hand. "Unsupervised Profiling Methods for Fraud Detection." Credit Scoring

and Credit Control VII, 2001.

Bolton, R.J., and D.J. Hand. "Statistical Fraud Detection: A Review." Statistical Science, 2002:

235-255.

100

Brause, R, T Langsdorf, and M Hepp. "Neural Data Mining for Credit Card Fraud Detection."

Proceedings of the 11th IEEE International Conference on Tools with Artificial

Intelligence. Silver Spring: IEEE Computer Society Press, 1999. 103-106.

Breiman, L. "Bagging Predictors." Machine Learning 24 (1996): 123-140.

Brodley, C., and T. Lane. "Creating and exploiting coverage and diversity." Work. Notes AAAI-

96 Workshop Integrating Multiple Learned Models, 1996: 8-14.

Chan. "An Extensible Meta-Learning Approach for Scalable and Accurate Inductive Learning."

PhD Thesis, 1996.

Chan, Philip K, and Salvatore J Stolfo. "Experiments in Multistrategy Learning by Meta-

Learning." Proceedings of the second international conference on Information and

knowledge management, 1993: 314-323.

Chan, Philip L, and Salvatore J Stolfo. "Toward Scalable Learning with Non-uniform Class and

Cost Distributions: A Case Study in Creidt Card Fraud Detection." Proceedings of the

Fourth International Conference on Knowledge Discovery and Data Mining, 1998: 164-

168.

Chen, R, M Chiu, Y Huang, and L Chen. "Detecting Credit Card Fraud by Using Questionaire-

Responded Transaction Model Based on Support Vector Machines." Proceedings of

IDEAL. 2004. 800-806.

Chiu, C, and C Tsai. "A Web Services-Based Collaborative Scheme for Credit Card Fraud

Detection." Proceedings of 2004 IEEE International Conference on e-Technology, e-

Commerce and e-Service. 2004.

101

Cohen, William W. "Fast Effective Rule Induction." International Conference on Machine

Learning. Morgan Kaufmann, 1995. 115-123.

Cooper, G.F., and E. Herskovits. "A Bayesian method for the induction of probabilistic network

from data." Machine Learning, 1992: 309-347.

Cortes, Corinna, and Vladimir Vapnik. "Support-vector networks." Machine Learning, 1995:

273-297.

Dorronsoro, José R, Francisco Ginel, Carmen Sanchez, and Carlos Santa Cruz. "Neural Fraud

Detection in Credit Card Operations." IEEE Transactions on Neural Networks 8, no. 4

(1997): 827-834.

Ehramikar, S. "The Enhancement of Credit Card Fraud Detection Systems using Machine

Learning Methodology." MASc Thesis, Department of Chemical Engineering, University

of Toronto, 2000.

Fan, W. "Systematic Data Selection to Mine Concept-Drifting Data Streams." Proceedings of

SIGKDD. 2004. 128-137.

Foster, D, and R Stine. "Variable Selection in Data Mining: Building a Predictive Model for

Bankruptcy." Journal of American Statistical Association 99 (2004): 303-313.

Freund, Y, and R.E. Schapire. "Experiments with a New Boosting Algorithm." Machine

Learning: Proceedings of the Thirteenth International Conference. 1996.

Ghosh, S, and D. L. Reilly. "Credit card fraud detection with a neural network." Proceedings of

the 27th Hawaii International Conference on System Sciences. Los Alamitos, CA: IEEE

Computer Society, 1994. 621-630.

102

Grossman, D., and P. Domingos. "Learning Bayesian Network Classifiers by Maximizing

Conditional Likelihood." Proceedings of the 21st International Conference on Machine

Learning. Banff, Canada, 2004.

Hall, Mark, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H.

Witten. "The WEKA Data Mining Software: An Update." SIGKDD Explorations 11, no.

1 (2009).

Hanagandi, V., A. Dhar, and K. Buescher. "Density-based clustering and radial basis function

modeling to generate credit card fraud scores." Computational Intelligence for Financial

Engineering. New York City, 1996. 247-251.

Heckerman, D, D Geiger, and D. M. Chickering. "Learning Bayesian networks: The combination

of knowledge and statistical data." Machine Learning 20, no. 3 (1995): 197-243.

Jain, A.K., M.N. Murty, and P Flynn. "Data clustering: A review." ACM Computing Surveys 31,

no. 3 (1999): 264-323.

John, George H, and Pat Langley. "Estimating Continuous Distributions in Bayesian Classifiers."

Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence.

SanMateo: Morgan Kaufmann Publishers, 1995. 338-345.

Kim, J, A Ong, and R Overill. "Design of an Artificial Immune System as a Novel Anomaly

Detector for Combating Financial Fraud in Retail Sector." Congress on Evolutionary

Computation. 2003.

Kim, M, and T Kim. "A Neural Classifier with Fraud Density Map for Effective Credit Card

Fraud Detection." Proceedings of IDEAL. 2002. 378-383.

103

Kokkinaki, A. "On Atypical Database Transactions: Identification of Probable Frauds using

Machine Learning for User Profiling." Knowledge and Data Engineering Exchange

Workshop. IEEE, 1997. 107-113.

Kotsiantis, S. B. "Supervised Machine Learning: A Review of Classification Techniques."

Informatica, 2007: 249-268.

le Cessie, S., and J.C. van Houwelingen. "Ridge Estimators in Logistic Regression." Applied

Statistics, 1997: 191-201.

Leopold, Edda, and Jorg Kindermann. "Content Classification of Multimedia Documents using

Partitions of Low-Level Features." Journal of Virtual Reality and Broadcasting 3, no. 6

(2006): 1-17.

Maes, S., K. Tuyls, B. Vanschoenwinkel, and B. Manderick. "Credit Card Fraud Detection Using

Bayesian and Neural Networks." Proceedings of the 1st International NAISO Congress

on Neuro Fuzzy Technologies. Havana, Cuba, 2002.

Mason, S.J., and N.E. Graham. "Areas beneath the relative operating characteristics (ROC) and

relative operating levels (ROL) curves, statistical significance and interpretation."

Quarterly Journal of the Royal Meteorological Society 128 (2002): 2145-2166.

Mitchell, T. "The Need for Biases in Learning Generalizations." Technical Report CMB-TR-117,

Computer Science Department, Rutgers University, New Brunswick, 1980.

Montgomery, Douglas C., and George C. Runger. Applied Statistics and Probability for

Engineers. Ney York: John Wiley & Sons, 2003.

104

Ngai, E.W.T., Yong Hu, Y.H. Wong, Yijun Chen, and Xin Sun. "The application of data mining

techniques in financial fraud detection: A classification framework and an academic

review of literature." Decision Support Systems, 2011: 559-569.

Othman, M. F., and T. M. S. Yau. "Comparison of different classification techniques using Weka

for breast cancer." International Conference on Biomedical Engineering. 2007. 520-523.

Pratt, L, and S Thrum. "Second Special Issue on Inductive Transfer." Machine Learning 28

(1997).

Quinlan, J. R. "Simplifying decision trees." International Journal of Man-Machine Studies 27,

no. 3 (1987): 221-248.

Quinlan, J. Ross. C4.5: Programs for machine learning. San Francisco: Morgan Kaufmann,

1993.

Royal Canadian Mounted Police. Credit Card Fraud. 2010. http://www.rcmp-grc.gc.ca/scams-

fraudes/cc-fraud-fraude-eng.htm.

—. Identity Theft and Identity Fraud. 2010. http://www.rcmp-grc.gc.ca/scams-fraudes/id-theft-

vol-eng.htm.

Rumelhart, D.E., G.E. Hinton, and R.J. Williams. Learning internal representations by error

propagation. Cambridge, MA: Bradford, 1986.

Schapire, Robert E. "The strength of weak learnability." Machine Learning, 1990: 197-227.

Schulz, Matt. CreditCards.com. January 15, 2010. http://canada.creditcards.com/credit-card-

news/canada-credit-card-debit-card-stats-international.php.

105

Statistics Canada. E-commerce: Shopping on the Internet. September 27, 2010.

http://www.statcan.gc.ca/daily-quotidien/100927/dq100927a-eng.htm.

Stolfo, S, Z Galil, K McKeown, and R Mills. "Speech recognition in parallel." Speech and

Natural Language Workshop. 1989. 353-373.

Stolfo, S.J., D.W. Fan, A.L. Prodromidis, and P.K. Chan. "Credit Card Fraud Detection Using

Meta-Learning: Issues and Initial Results." AAAI Workshop on AI Approaches to Fraud

Detection and Risk Management. Menlo Park, CA, 1997. 83-90.

Tavan, Duygu. Tesco Bank deploys FICO’s banking solutions for risk, fraud management.

January 20, 2011. http://www.vrl-financial-news.com/retail-banking/retail-banker-

intl/issues/rbi-2011/rbi-645/tesco-bank-deploys-fico%E2%80%99s-bank.aspx.

The Nilson Report. "U.S. Credit Card Projected." The Nilson Report, October 2010: 7-8.

Trepanier, Marc, interview by Joseph Pun. Credit card fraud detection using meta-learning -

Proposal (July 16, 2009).

Vilalta, Ricardo, and Youssef Drissi. "A Perspective View and Survey of Meta-Learning."

Artificial Intelligence Review 18, no. 2 (2002): 77-95.

—. "Research Directions in Meta-Learning." Proceedings of the International Conference on

Artificial Intelligence. Las Vegas, 2001.

Wheeler, R, and S Aitken. "Multiple algorithms for fraud detection." Knowledge-Based Systems,

no. 13 (2000): 93-99.

106

Witten, Ian, and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques.

San Fransico: Elsevier, 2005.

Wolpert, D. "Stacked Generalization." Neural Networks 5 (1992): 241-259.

Xu, L.D. "Case based reasoning." IEEE Potentials 13, no. 5 (1995): 10-13.

107

Appendix A: Implementation of Base Algorithms on Simple Datasets

Naïve Bayesian example:

Table A1: Credit card dataset with three attributes and correct classifications for each instance

Instance # Country POS Entry Card Type Legit or Fraud 1 Canada Swiped Gold Fraud 2 USA Keyed Platinum Fraud 3 USA Swiped Classic Legit 4 Canada Swiped Gold Legit 5 USA Swiped Platinum Fraud 6 Canada Keyed Gold Legit 7 Canada Swiped Classic Legit 8 Canada Swiped Classic Legit 9 USA Swiped Platinum Legit

Table A2: Counts for the credit card dataset

Country POS Entry Card Type Fraud/Legit Fraud Legit Fraud Legit Fraud Legit Fraud Legit Canada 1 4 Swiped 2 5 Classic 0 3 3 6 USA 2 2 Keyed 1 1 Gold 1 2 Platinum 2 1

Table A3: Probabilities for the credit card dataset

Country POS Entry Card Type Fraud/Legit Fraud Legit Fraud Legit Fraud Legit Fraud LegitCanada 1/3 4/6 Swiped 2/3 5/6 Classic 0/3 3/6 3/9 6/9 USA 2/3 2/6 Keyed 1/3 1/6 Gold 1/3 2/6 Platinum 2/3 1/6

108

Table A4: New instance to be predicted

Instance #

Country POS Entry

Card Type Legit or Fraud

10 Canada Keyed Platinum ?

The three attributes in Table A4 – Country, POS Entry, and Card Type – are treated as equally

important and independent pieces of evidence. Therefore by multiplying the likelihood of fraud

for each attribute the overall likelihood of fraud for instance #10 can be calculated.

The probability of instance #10 being fraudulent using the Naïve Bayesian method is calculated

as follows:

]Pr[

]Pr[]|Pr[]|Pr[]|Pr[]|Pr[ 321

E

fraudxfraudExfraudExfraudEEfraud

The evidence, E, is the particular combination of attribute values for the new instance.

Country=Canada, POS Entry=Keyed, Card Type=Platinum are the three pieces of evidence E1,

E2, and E3 respectively. The probability of fraud, Pr[fraud], is the probability that an instance is

fraud without considering any of the evidences.

]Pr[

9/33/23/13/1]|Pr[

EEfraud

Similarly the probability of instance #10 being legitimate can be calculated as follows:

]Pr[

9/66/16/16/4]|Pr[

EElegit

109

Normalizing to calculate the probabilities yields:

Probability of fraud =]|Pr[]|Pr[

]|Pr[

ElegitEfraud

Efraud

= 0.6667

Probability of legit = ]|Pr[]|Pr[

]|Pr[

ElegitEfraud

Elegit

= 0.3333

Therefore, using the Naïve Bayesian method, instance #10 in Table A4 has a higher probability

of being a fraudulent transaction based on the training dataset from Table A1.

Bayesian

Let P(E1

probabili

card is sw

credit car

informati

Figure Aand 2

In the ab

swiped c

probabili

and P(!E

swiped o

n network e

) be the prob

ity that a cre

wiped given

rd is swiped

ion the diagr

A1: Example

ove figure, t

an easily be

ity that a tran

1,!E2) repre

or not swiped

example:

bability that

dit card is sw

that the tran

given that th

ram below c

e of a Bayes

the probabili

determined

nsaction is fr

esent the prob

d.

a credit card

wiped. It is

nsaction is fr

he transactio

can be constr

sian Networ

ities of a tran

for four sce

raudulent an

bability that

110

d transaction

known that P

raudulent, P(

on is legitim

ructed.

rk diagram

nsaction bein

narios. P(E1

nd the card is

a transactio

n is fraudulen

P(E1) is 0.2

(E2|E1), is 0

ate, P(E2|!E

showing the

ng fraudulen

1,E2) and P(

s either swip

on is legitima

nt, and P(E2

, the probab

0.7, and the p

E1), is 0.2. U

e probabilit

nt given that

E1,!E2) repr

ped or not sw

ate and the c

2) be the

ility that a cr

probability t

Using this

ties of event

the card is

resent the

wiped. P(!E1

card is either

redit

that a

ts 1

,E2)

r

111

Decision tree example:

The dataset in Table A5 will be used for the construction of a decision tree.

Table A5: Credit card dataset with four attributes and correct classifications for each instance

Instance # Country Card Type POS Entry Security Code Legit or Fraud 1 Canada Gold Swiped False Legit 2 Canada Gold Swiped True Legit 3 Mexico Gold Swiped False Fraud 4 USA Classic Swiped False Fraud 5 USA Platinum Keyed False Fraud 6 USA Platinum Keyed True Legit 7 Mexico Platinum Keyed True Fraud 8 Canada Classic Swiped False Legit 9 Canada Platinum Keyed False Fraud 10 USA Classic Keyed False Fraud 11 Canada Classic Keyed True Fraud 12 Mexico Classic Swiped True Fraud 13 Mexico Gold Keyed False Fraud 14 USA Classic Swiped True Legit

Table A6: Counts for the credit card dataset

Country Card Type POS Entry Security Code Fraud Legit Fraud Legit Fraud Legit Fraud Legit Fraud Legit 9 5 Canada 2 3 Gold 2 2 Swiped 3 4 True 3 3 USA 3 2 Classic 4 2 Keyed 6 1 False 6 2 Mexico 4 0 Platinum 3 1

112

The first step is to determine the root node of the decision tree. The entropy associated with the

transaction being fraudulent or legitimate is to be determined, and can be calculated as follows:

Entropy(Fraud) = Entropy (9,5)

= -p log2 p – q log2 q

= - [(9/14) log2 (9/14)] – [(5/14) log2 (5/14)]

= -[-0.4098] – [-0.5305]

= 0.94

Next we calculate the entropy of fraud versus each of the four attributes.

Table A7: Count data for “Country” attribute

Country Fraud Legit TotalCanada 2 3 5 USA 3 2 5 Mexico 4 0 4 Total 14

Entropy(Fraud,Country) = [(5/14) x Entropy(2,3)] + [(5/14) x Entropy(3,2)]

+ [(4/14) x Entropy(4,0)]

= [(5/14) x 0.97] + [(5/14) x 0.97] + [0]

= 0.6929

113

Table A8: Count data for “Card Type”, “POS Entry”, and “Security Code” attributes

Card Type POS Entry Security Code Fraud Legit Total Fraud Legit Total Fraud Legit TotalGold 2 2 4 Swiped 3 4 7 True 3 3 6 Classic 4 2 6 Keyed 6 1 7 False 6 2 8 Platinum 3 1 4 Total 14 Total 14 Total 14

Similarly,

Entropy(Fraud,Card Type) = [(4/14) x Entropy(2,2)] + [(6/14) x Entropy(4,2)]

+ [(4/14) x Entropy(3,1)]

= [(4/14) x 1] + [(6/14) x 0.91] + [(4/14) x 0.81]

= 0.9071

Entropy(Fraud,POS Entry) = [(7/14) x Entropy(3,4)] + [(7/14) x Entropy(6,1)]

= [(7/14) x 0.98] + [(7/14) x 0.59]

= 0.785

Entropy(Fraud,Security Code) = [(6/14) x Entropy(3,3)] + [(8/14) x Entropy(6,2)]

= [(6/14) x 1] + [(8/14) x 0.81]

= 0.8914

To select the root node we pick the attribute that generates the largest gain value.

Gain(Fraud,Country) = 0.94 – 0.6929 = 0.2471

Gain(Fraud,Card Type) = 0.94 – 0.9071 = 0.0329

Gain(Fraud,POS Entry) = 0.94 – 0.785 = 0.155

Gain(Fraud,Security Code) = 0.94 – 0.8914 = 0.0486

114

Therefore the attribute “Country” will be chosen as the root node since it has the largest gain

value. Next we split the node when “Country=Canada”.

Table A9: Count data for “Country=Canada” and “Card Type” entropy calculations

Fraud (Country=Canada)

Card Type Fraud Legit Gold 0 2 2 Classic 1 1 2 Platinum 0 1 1 Total 5

Entropy(Country=Canada) = Entropy (2,3) = 0.97

Entropy(Country=Canada, Card Type) = [(2/5) x Entropy(0,2)]

+ [(2/5) x Entropy(1,1)]

+ [(1/5) x Entropy(0,1)]

= 0 + (2/5) + 0

= 0.4

Similarly, the entropy calculations when the Country node is equal to Canada can be calculated

using the information in Table A10 as shown below.

Table A10: Count data for “Country=Canada” and “POS Entry” entropy calculations


POS Entry Fraud Legit Swiped 0 3 3 Keyed 2 0 2 Total 5

115

Entropy(Country=Canada, POS Entry) = [(3/5) x Entropy(0,3)]

+ [(2/5) x Entropy(2,0)]

= 0

Lastly, the entropy when the Country node is equal to Canada and the splitting attribute is

Security Code can be calculated using the information in Table A11.

Table A11: Count data for “Country=Canada” and “Security Code” entropy calculations


Security Code

Fraud Legit

False 1 2 3 True 1 1 2 Total 5

Entropy(Country=Canada, Security Code)= [(3/5) x Entropy(1,2)]

+ [(2/5) x Entropy(1,1)]

= [(3/5) x 0.91] + (2/5) = 0.946

Therefore the gains for the three attributes when split with the node “Country=Canada” is as

follows:

Gain(Country=Canada, Card Type) = 0.97 – 0.4 = 0.57

Gain(Country=Canada, POS Entry) = 0.97 – 0 = 0.97

Gain(Country=Canada, Security Code) = 0.97 – 0.946 = 0.024

From the

when Co

Followin

dataset fr

ese results th

ountry is Can

ng the same p

rom Table A

Figu

he decision tr

nada.

procedures a

A5 (See Figu

ure A1: Dec

ree method w

as above, the

ure A1 below

cision tree fo

116

would choos

e following d

w).

or the credi

se the attribu

decision tree

it card trans

ute “POS En

e can be cons

saction data

ntry” to the p

structed for t

a

path

the

117

K-Nearest Neighbour example:

In the situation of a tie, the KNN test is run on K minus 1 (one less neighbour) of the data

point in question.

Suppose we want to predict a transaction that has a transaction amount equal to $12 and has a

timestamp of 25 minutes. Using the training data from Table A12 the kNN method uses a

distance measure to determine the “closest” match for classification.

Table A12: Training dataset for the kNN example

Instance #


Timestamp (minutes)

Classification

1 25 25 Fraud 2 25 15 Fraud 3 12 15 Legit 4 7 15 Legit

The K-value is an adjustable parameter. A K-value of 3 will be used for this example.

The first step is to calculate the distance between the training data (Table A12) and the new data

we want to classify ($12, 25minutes):

Table A13: Square distances between training data and new instance


Timestamp (minutes)

Square Distance

1 25 25 (25-12)2 + (25-25)2 = 169 2 25 15 (25-12)2 + (15-25)2 = 269 3 12 15 (12-12)2 + (15-25)2 = 100 4 7 15 (7-12)2 + (15-25)2 = 125

Next we sort the distances from smallest to largest and determine if the instance lies

within 3-nearest neighbours.

118

Table A14: Classification of the nearest-neighbours

Inst. # Transaction Amount ($)

Timestamp (minutes)

Square Distance Lies within K-nearest

neighbours? (k=3)

Classification of nearest-neighbour

3 12 15 (12-12)2 + (15-25)2 = 100 Yes Legit 4 7 15 (7-12)2 + (15-25)2 = 125 Yes Legit 1 25 25 (25-12)2 + (25-25)2 = 169 Yes Fraud 2 25 15 (25-12)2 + (15-25)2 = 269 No ---

Using a majority vote the kNN algorithm would therefore classify the new instance ($12,

25 minutes) as a legitimate transaction.

119

Neural-network example:

The dataset in Table A15 will be used to construct the neural-network model.

Table A15: Credit card transaction data for neural-network algorithm

Instance # Transaction amount (thousands $)

Timestamp (minutes)

Classification (0.5=legit; 1=fraud)

1 0.35 0.9 0.5 2 0.12 0.3 1 3 0.47 0.6 1

We choose a simple network as shown below and set the initial weights to be random

numbers. The neurons in this network have a Sigmoid activation function.

Figure A2: Simple neural network with randomly initiated weights

To train the neural network with the first data instance we use the data from instance #1

in Table A15 as inputs A and B to the neural network. The outputs from each of the units

(neurons) can be calculated as follows:

Input A = 0.35, Input B = 0.9 (values from instance #1)

Input to “Hidden Unit 1” = (0.35x0.1) + (0.9x0.8) = 0.755

120

Output of “Hidden Unit 1” = 755.01

1 e

= 0.68

Input to “Hidden Unit 2” = (0.9x0.6) + (0.35x0.4) = 0.68

Output of “Hidden Unit 2” = 0.6637

Input to “Output Unit” = (0.3x0.68) + (0.9x0.6637) = 0.8013

Output from “Output Unit” = 0.69

Next we calculate the error term from the output unit. This is done by calculating the

difference between the target value, 0.5 (the correct classification for instance #1), and the

output value, 0.69 (the calculated output value from the Output Unit).

Output error (δ) = (target – output) x (1 – output) x output

= (0.5 – 0.69) x (1 – 0.69) x 0.69

= -0.0406

The ‘(1 – output) x output’ term is needed because the units (neurons) use a Sigmoid function.

The weights for the connections between the hidden layer and the output unit are updated as

follows:

w1’ = w1 + (δ x input from “hidden unit 1” to “output unit”)

= 0.3 + (-0.0406 x 0.68)

= 0.272392

w2’ = w2 + (δ x input from “hidden unit 2” to “output unit”)

= 0.9 + (-0.0406 x 0.6637)

= 0.87305

121

Unlike the output layer, the errors for the hidden layer units cannot be calculated directly

since there is no target value for the hidden layer. Therefore errors are back propagated from the

output layer. This is done by taking the errors from the output unit and running them back

through the weights to get the hidden layer errors.

Therefore the errors for the hidden layer can be updates as follows:

δ1 = δ x w1’ = -0.0406 x 0.272392 x [(1 – output of hidden unit #1) x output of hidden unit #1]

= -0.0406 x 0.272392 x [(1 – 0.68) x 0.68]

= -2.406 x 10-3

δ2 = δ x w2’ = -0.0406 x 0.87305 x [(1 – output of hidden unit #2) x output of hidden unit #2]

= -0.0406 x 0.87305 x [(1 – 0.6637) x 0.6637]

= -7.916 x 10-3

Using the hidden layer errors, the new hidden layer weights can be calculated as follows:

w3’ = w3 + (δ1 x input A)

= 0.1 + (-2.406x10-3 x 0.35) = 0.0992

w4’ = w4 + (δ1 x input B)

= 0.8 + (-2.406x10-3 x 0.9) = 0.7978

w5’ = w5 + (δ2 x input A)

= 0.4 + (-7.916x10-3 x 0.35) = 0.3972

w6’ = w6 + (δ2 x input B)

= 0.6 + (-7.916x10-3 x 0.9) = 0.5928

122

This ends the first iteration in which all the weights in the neural network model are updated

using training instance #1. By working through the network with the updated weights, the new

final output is calculated to be 0.683. This results in a new reduced error of -0.183.

The same processes as described above are conducted for all the instances in the training

dataset. A neural network model is completely trained when all the weights are optimized

according to the training instances. Once the model is trained, new instances can be used as input

to the network to produce new instance predictions.

123

Logistic Regression example:

The Table in A16 will be used to construct the logistic regression model.

Tabe A16: Credit card transaction data for logistic regression Instance # Transaction amount =

x1 (thousands $)

Timestamp = x2 (minutes)

Fraud Classification

1 0.35 0.9 No 2 0.12 0.3 Yes 3 0.47 0.6 Yes

The logistic regression equation is set up as follows:

)22110(1

1xbxbbe

p ,

where p is the probability that the fraud classification for an instance is “Yes”, b0 is a constant, b1

is the coefficient associated with variable x1, and b2 is the coefficient associated with variable x2.

Using the SPSS Clementine software, the Maximum Likelihood Estimation (MLE)

algorithm was used to determine the constant and the coefficients for the logistic regression

equation. It was found that for the training data from Table A16, b0 is equal to 34.402, b1 is equal

to 74.085, and b2 is equal to -86.433. The following equation can be constructed from the results:

)2433.861085.74402.34(1

1xxe

p

This new equation can be used to predict whether a future instance is fraudulent or not. For

example, let us assume that instance # 4 is a new credit card transaction that we want to predict.

The transaction amount is $500 and the timestamp is 1 minute.

124

To determine the probability of fraud for this transaction we plug the values into the

logistic regression as follows:

07))1(433.86)5.0(085.74402.34(

10094.31

1

ep

Therefore, the logistic regression predicts that instance #4 is a legitimate transaction since the

probability of fraud is below 0.5 and close to zero.

125

Appendix B: Pre-processing and Data Cleansing of Raw Dataset

Table B1: Sample of the unaltered dataset received from the Financial Institution

card_no type_modifier txn_date txn_time expiry_date txn_code txn_amt mess_type appr_code resp_code

1234567890123456 0 6/30/2009 21:31:56 1012 0 64 100 46851 0

card_verfy_flag card_verfy_digts cv12_prsnt_indctr acq_bin cond_code pos_mode pin_ind e_comm_flag avalbl_crdt mrchnt_ID

M MMM 1 402954 8 812 N 7 3106.43 09367562 X

mrchnt_SIC_code term_ID mrchnt_name mrchnt_city mrchnt_state mrchnt_cntry mrchnt_pstcd user_cntry user_pstcd card_type

5200 2310010 CDN TIRE VIMONT LAVAL

QC CA A2A3B3 CAN A1A2B2 VWIAV

decl_rsn_code cris_score cris_type fico_score fico_reason falc_score falc_reason crd_expr_date trml_cpbty chip_rslt_code trml_type 0 3 300 0 934 8 1211 5 12 0

126

Table B2: Sample of the cleansed dataset

card_no type_modifier expiry_date txn_code txn_amt mess_type appr_code3 resp_code2 card_verfy_flag card_verfy_digts2

1234567890123456 0 902 0 5 100 A 5 M MMM

cv12_prsnt_indctr2 acq_bin cond_code pos_mode2 pin_ind e_comm_flag2 avalbl_crdt mrchnt_state3 mrchnt_cntry2 user_cntry2

9999 450001 0 902 N 9999 5107.45 AB CA CA

card_type2 cris_type fico_score falc_score falc_reason trml_cpbty date_diff_days time_diff_mins fraud

VGGPR 1 180 960 2 2 ‐5 272 Y

127

Table B3: Removal and Simplification of attributes

Attribute Removed/Changed/Added Reason Transaction Date changed to Date Difference - To measure the difference in days between subsequent transactions Transaction Time changed to Time difference - To measure the difference in minutes between subsequent transactions Merchant ID removed - Uninformative attribute

- Each transaction had its own unique number, no pattern was recognizable Merchant SIC code removed - Uninformative attribute

- Each transaction had its own unique number, no pattern was recognizable Terminal ID removed - Uninformative attribute

- Some instances were numeric while other instances were alphanumeric (inconsistent formatting)

Merchant Name removed - Categorical attribute that contained a large amount of unique instances. This would degrade the prediction of the meta-classifier.

- Inconsistency in formatting. Merchant City removed - Categorical attribute that contained a large amount of unique instances. This would

degrade the prediction of the meta-classifier. - Inconsistency in formatting.

Merchant Postal Code removed - Contains many missing/blank values User Postal Code removed - Categorical attribute that contained a large amount of unique instances. This would

degrade the prediction of the meta-classifier. - Inconsistency in formatting.

Decline Reason Code removed - Categorical attribute that contained a large amount of unique instances. This would degrade the prediction of the meta-classifier.

CRIS Score removed - All instances were blank FICO Reason removed - 90% of instances were labeled as ‘0’ Credit Expiry Date removed - This attribute had the same values as the ‘Expiry Date’ attribute Chip Result Code removed - Only 4% of instances had a value in this attribute Terminal Type removed - Contains many missing/blank values Fraud Label added - To classify whether a transaction was considered fraudulent or legitimate by the FI

after investigations

128

Table B4: Numerical and Categorical attributes in cleansed dataset

Name Format Numerical/Categorical Possible Values Description

Acquiring BIN 1, 4, 5 or 6-character numeric (1, 1111, 11111 or 111111)

Numerical Positive whole numbers Bank Identification Number

Approval Code 1-character alphabetic (1A1A1A)

Categorical A, B, C, D Code displayed when authorization is approved

Available Credit 7-character numeric (1111111)

Numerical Positive whole numbers Amount of available credit

Card Number 16-character numeric (1111111111111111)

Numerical Positive whole numbers Credit card number associated with the transaction

Card Type 5-character alphanumeric (1A1A1)

Categorical VBBNX, VBBRX, VBCSX, VBGSX, VGCPX, VGGCM, VGGCP, VGGLD, VGGLX, VGGPR, VGGST, VGGTS, VGGUS, VGGXP, VGRVG, VPBAP, VPPCA, VPPEL, VPPLP, VPPLT, VPPNX, VPPRX, VPPST, VPPTS, VSC2S, VSCCL, VSCCM, VSCL2, VSCLO, VSCLR, VSCLS, VSCLX, VSCMW, VSCSB, VSCSL, VSCST, VSCTS, VSESO, VSRVC, VWIAV, VWPBI, OTHER

Category of the credit card (i.e. rewards card, travel card, etc.)

129

Card Verify Digits

1 or 3-character alphabetic (A)

Categorical A, B, C, D, E, MMM Indicates whether the card verification digits on the card matches the digits on the account

Card Verify Flag 1-character alphabetic (A)

Categorical M, N, X, ‘ ‘ Indicates whether the card verification flag on the card matches the flag on the account

Condition Code 1 or 2-character numeric (1 or 11)

Categorical 0-2, 5, 8, 71 Gives suspicious transactions a higher priority based on a characteristic of the transaction

CRIS Type 1 or 2-character numeric (1 or 11)

Categorical 0-16, 18, 20, 22 Different categories of risks

CVI2 Present Indicator

1 or 4-character numeric (1 or 1111)

Categorical 0-2, 9, 9999 Indicates whether the Card Verification Indicator matches, mismatches, or is not evaluated

Date Difference (Days)

2-character numeric (±11)

Numerical Integer values The number of days between the current transaction and the previous transaction

E-commerce Flag 1 or 4-character numeric (1111)

Categorical 1-9, 9999 Electronic Commerce Indicator

Expiry Date 1, 3 or 4-character numeric (1111)

Numerical Positive whole numbers The date the credit card expires

Falcon Reason 1, 2 or 3-character numeric (111)

Categorical 1-8, 10-14, 17, 18, 20-22, 26, 502-504, 508, 510, 512, 513, 518, 520, 526

Reason why a particular Falcon score was given to a transaction

130

Falcon Score 1, 2 or 3-character numeric (111)

Numerical Positive whole numbers from 0 to 999

Risk prediction and neural score calculated by the Falcon system

FICO Score 1, 2 or 3-character numeric (111)

Numerical Positive whole numbers from 0 to 999

Statistical behaviour score based on variances from normal behaviour calculated by Fair, Isaac and Company

Fraud 1-character alphabetic (A)

Categorical Y or N Fraud label for the transaction; Yes or No values

Merchant Country

2 or 5-character alphabetic (AA or AAAAA)

Categorical CA, US, OTHER Country the merchant is located

Merchant State 2, 3, 4, or 5-character alphabetic (AA, AAA, AAAA or AAAA)

Categorical AB, BC, MB, MWUS, NB, NEUS, NL, NS, ON, PE, QC, SK, SUS, WUS, OTHER

State/Province the merchant is located

Message Type 3 or 4-character numeric (111 or 1111)

Categorical 100, 101, 120, 7000 Code for the type of authorization request

PIN Indicator 1-character alphabetic (A)

Categorical Y or N Indicates whether the transaction was initiated by a chip PIN CVM or a chip signature card

POS Mode 1, 2, 3 or 4-character numeric (1, 11, 111 or 1111)

Categorical 0-2, 10-12, 50-52, 810-812, 900-902, 9999

Describes the authorization request entered at the Point of Sale (POS)

Response Code 1-character numeric (1)

Categorical 0, 1, 4, 5, 9 Response to the authorization request (i.e. decline for invalid PIN, non-authorized transaction, etc.)

131

Terminal Capability

1-character numeric (1)

Categorical 0-9 The processing and acceptance capability of the terminal (i.e. magnetic strip only, magnetic strip + chip card, etc.)

Time Difference (Minutes)

1 to 8-character numeric (±1111.1111)

Numerical Integer values The time in minutes between the current transaction and the previous transaction

Transaction Amount

1 to 8-character numeric (±1111.1111)

Numerical Positive integers The dollar amount associated with each transaction

Transaction Code 1 or 2-character numeric (1 or 11)

Categorical 0, 11, 17, 50 Code for the type of authorization request

Type Modifier 1-character numeric (1)

Categorical 0, 1, 4, 5 Type of transaction entered at the point of sale for non-monetary transactions

User Country 2 or 5-character alphabetic (AA or AAAAA)

Categorical CA, US, OTHER Country the user (card holder) is located

132

Appendix C: Example of how Weka calculates the Root Mean Squared Error

For this example the decision tree classifier (modeled using the J48 algorithm in Weka) is used

to output 5 predictions for 5 instances. The actual class of each instance is known, and the

decision tree algorithm’s predicted probability distribution is calculated. For all class labels the

difference between the actual class value and the predicted value are squared and divided by the

number of class labels (difference^2 / 2). The differences are summed for each instance (Squared

Error), and then the sums are summed for all instances (Sum of Squared Errors). Table C1 shows

an example of calculating the Root Mean Squared Error using the J48 algorithm for 5 instances.

Table C1: Calculation of the ‘Sum of Squared Errors’ for a decision tree classifier example with 5 instances

Class 1 Class 2

Inst. # Predicted Value by

J48

Actual Value

Diff^2 / 2 Predicted Value by

J48

Actual Value

Diff^2 / 2 SqrErr (sum of both

diffs) 1 0.621 1.0 0.07182 0.379 0.0 0.07182 0.14364

2 0.921 1.0 0.00312 0.079 0.0 0.00312 0.00624

3 0.012 0.0 0.00007 0.988 1.0 0.00007 0.00014

4 0.012 0.0 0.00007 0.988 1.0 0.00007 0.00014

5 0.921 1.0 0.00312 0.079 0.0 0.00312 0.00624

SumSqrErr= 0.1564

Therefore the Sum of Square Errors for the data shown in Table C1 is 0.1564. This value can

then be plugged into the Root Squared Mean Error (RSME) equation to determine the error term

for the classifier generated from the J48 algorithm as shown in equation C.1.

133

1758.0

5

1564.0)(2

n

yxe ii

(C.1)

The error term for the decision tree classifier in this example is 0.1758.