genetic algorithms for credit card fraud detection - inase · characteristics of credit card...
TRANSCRIPT
Genetic algorithms for credit card fraud
detection
SATVIK VATS*, SURYA KANT DUBEY, NAVEEN KUMAR PANDEY
Institute of Technology and Management
AL-1, Sector-7 GIDA, Gorakhpur, Uttar Pradesh, INDIA
E-mail address- [email protected]
Abstract: - Due to the rise and rapid growth of E-Commerce, use of credit cards for online purchases has
dramatically increased and it caused an explosion in the credit card fraud. Fraud is one of the major ethical
issues in the credit card industry. As credit card becomes the most popular mode of payment for both online as
well as regular purchase, cases of fraud associated with it are also rising. In real life, fraudulent transactions are
scattered with genuine transactions and simple pattern matching techniques are not often sufficient to detect
those frauds accurately. Implementation of efficient fraud detection systems has thus become imperative for all
credit card issuing banks to minimize their losses. Many modern techniques based on Artificial Intelligence,
Data mining, Fuzzy logic, Machine learning, Sequence Alignment, Genetic Programming etc., has evolved in
detecting various credit card fraudulent transactions. A genetic algorithm is an evolutionary search and
optimisation technique that Mimics natural evolution to find the best solution to a problem. Here the
characteristics of credit card transactions undergo evolution to allow a modelled credit card fraud detection
system to be tested.
Key- Words:- Electronic commerce, fraud, credit card, genetic algorithms, detection
1 Introduction
In recent history information technology has
become far more pervasive in everyone’s lives. As reliance on software products increases, so does the
pressure to ensure that they work reliably and as
expected. This is why software testing has risen to
the forefront of public attention, with notable
instances such as the iPhone alarm bug [3]. In
1998, the Data Protection Act changed the way
data can be used [13]. Until this time, developers in
the UK working in industry have simply made
copies of customer data and used it in an often less
secure development environment. The Act
introduces legislation intended to give more rights
to individuals whose data is being held, and
restricts uses to which it can be put. For example a
company wishing to outsource some of its
development may not have the right to pass on its
customers data, hence doing so would be a
Proceedings of the 2013 International Conference on Education and Educational Technologies
42
violation of the Act. The company Grid-Tools
Limited is our industrial partner for this project.
They offer professional solutions for automatic test
data generation in the form of a tool called Data
Maker. This tool was originally written to address
the increasing size of data sets required by industry. As the set size increases it becomes impractical to
create the data by hand, so automatic methods had
to be found. Data Maker is capable of generating
synthetic data that conforms to the requirements of a test engineer’s specification. The data created is
not regulated by the Data Protection Act, because it
has been generated rather than gathered. Thus it is
not related to any individual and hence not covered
in legislation. The use of synthetic data has
advantages in its own right: large sets of data can
be created, with their composition tailored to meet
test coverage criteria. A set of real-world data may
not do this, as it is likely to be of relatively constant
composition, so not testing all aspects of the
program. A limitation of Data Maker is that it can
only produce linear sets of data from its built-in functions. The systematic testing of some software,
however, requires data sets with trends. A typical
example of such a system is a credit card fraud
detection system. To thoroughly test such a system
one would require a large set of realistic
transactions, both legitimate and fraudulent. For
reasons discussed above real data should not be
used, so instead a way to generate such data must
be found.
2 Related work
Genetic algorithms are a heuristic used to solve
high-complexity computational problems. Apart
from modeling the phenomena occurring in nature,
they help in optimization, simulation, modeling,
design and prediction purposes in science,
medicine, technology, and everyday life [14]. A
recent survey of the state of the art was carried out
for the “Materials and Manufacturing Processes”
journal in 2009, by Paszkowicz [14]. As the name
of the journal suggests they were only concerned
with the application of genetic algorithms to
problems in chemistry and physics, but nonetheless
they highlighted some innovative uses. One cited
example was to help the design process of new
materials, in particular with regards to a reverse
heat transfer problem. The problem consists of
finding a material with desirable thermal properties
that give rise to a good temperature field profile.
For a particular material well known equations can
be used to calculate the temperature profile, but
because of their complex nature the process cannot
easily be reversed to find optimal parameters. This
is an area where evolutionary search often excels,
as we will see in the next example where the search
is applied to an NP complete problem. The
algorithm used in this case modeled a liquid
material that was being heated linearly on its
surface. The input to the algorithm, its initial
population, was properties of already known
similar liquids. The output computed for each
liquid was the temperature field and the cooling
rate. Good results were returned by the algorithm,
which were later confirmed to be correct
experimentally. Still in the same materials
engineering survey, evolutionary search has been
applied to the mechanical process of welding. To
produce a strong weld several parameters have to
be optimized, such as current, voltage, torch speed,
arc gap, shielding gas and its flow rate, type and
geometry of the electrode. It can already be seen
how this optimization process could lend itself to
the application of a genetic algorithm, and once
again good results were found for what would have
been an expensive experimental process. Not only
did the results of the optimization provide a better
set of welding parameters, they also shed light on
Proceedings of the 2013 International Conference on Education and Educational Technologies
43
the transformation of the metal during the weld.
This had already been described theoretically, but
the results from the algorithm helped to bring
calculations and experimental results closer
together. In a purely theoretical area, genetic
algorithms have been applied to find approximate
solutions to the travelling salesman problem.
Scaling became an issue as the number of cities the
salesman had to visit increased. Braun [4] reported
that the algorithm could generate very good but not
optimal solutions for travelling salesman problems
with 442 to 531 cities. Using a standard SUN
workstation they could optimally solve problems
with up to 442 cities in under thirty minutes. The
biggest problem examined was 666 cities, which
could be solved approximately with a journey
0.04% longer than the optimum route. Potvin also
analyzed Travelling Salesman with genetic
algorithms [6]. The biggest problem reported in his
survey was one million cities, solved to within 4%
of an optimal route. This took four hours on a
powerful computer. He identified the role played by
the crossover operator on the outcome, with
performance being significantly affected by the
reordering of the tour. Perhaps the most well
known application of machine learning is robotic
movement. Schultz [1] applied the algorithm so
that autonomous robots could navigate and perform
collision avoidance of obstacles in their path. An
innovative part of his work was once again aimed
at cost and time saving, similar to the previously
detailed welding example. The task set for the
autonomous robot was to navigate from a start to
end point down no pre planned route, avoiding
randomly placed obstacles on its way.
3. Fraud Detection using Genetic Algorithm
Genetic algorithms are evolutionary algorithms
which aim at obtaining better solutions as time
progresses. Since their first introduction by Holland
[5], they have been successfully applied to many
problem domains from astronomy to sports [2],
from optimization [8] to computer science [7], etc.
They have also been used in data mining mainly for
variable selection [10] and are mostly coupled with
other data mining algorithms. In this study, we try
to solve our classification problem by using only a
genetic algorithm solution. In this module the
system must detect whether any fraud has been
occurred in the transaction or not. It must also
display the user about the result.
In the following we make clear the
concept of genetic algorithms by using an own
example over boundary value testing. We
implemented this algorithm in Java and can
successfully generate inputs for the test. A genetic
algorithm is a paradigm often used to search vast
and poorly understood search spaces. With well
defined functions the algorithm will converge into
one area of the search space which holds the
optimal solution. This example is a very simple
instance of the algorithm that searches for a set of
optimum inputs for black box testing. The function
being tested checks whether or not a value x is
within the range 0 ≤ x ≤ 8. Boundary value testing
is concerned with selecting the following input
Values:
• Maximum.
Proceedings of the 2013 International Conference on Education and Educational Technologies
44
• Maximum minus one.
• Nominal middle value.
• Minimum plus one.
• Minimum.
Fig.1 Selection of inputs for 0 ≤ x ≤ 8
These test cases will exercise the program to detect
any errors, particularly those that are “off by one”.
For simplicity we will assume that the correct input
values are known.
A generic genetic algorithm [11]
SimpleGeneticAlgorithm ( )
initialise population ;
evaluate population ;
while ( termination criteria not met )
select solutions for nextgeneration ;
perform crossover and mutation ;
evaluate population ;
In the same way as a chromosome is the basic
building block of nature, so it is of a genetic
algorithm. The chromosome is an encoded
statement of the data which one wishes to optimise.
In our example the chromosome would represent a
tuple of all of the input values, and it is encoded as
a binary string. The reason for this choice will
become clear when further genetic operators are
considered. In our example, the inputs 8, 7, 3, 1 and
0 would be encoded as their binary equivalent, and
concatenated: 1000 0111 0011 0001 0000. The
second task is to write a function to compare the
relative merit of chromosomes... The fallowing
pseudo code shows, pseudo-code for fitness
calculation over an encoding of five bytes, each
representing an input integer. A set of
chromosomes goes to make up the population of
the algorithm. Our algorithm is started with a
randomly generated population of chromosomes.
Evaluation of the fitness of a chromosome
Int fitness ( Chromosome input )
int fitness = 0 ;
int [ ] ideal = new int [A : E ] ;// Array of ideal inputs A to E .
Proceedings of the 2013 International Conference on Education and Educational Technologies
45
int [ ] actual = input . to Array ( ) ; // Retrieved at a from chromosome .
for ( int i = A : E ; i++)
fitness = absolute ( actual − ideal ) ;
return fitness ;
Fig.2 Diagram of crossover.
Crossover is the operator used to reproduce
chromosomes. This works by taking a pair of
encoded chromosomes - the parents - and
combining them to produce two different
chromosomes - the progeny. When applied across
two fit chromosomes this method aims to produce
progeny that have inherited the best attributes of its
parents, though this is not always the case. To
illustrate the principal, let’s consider two
chromosomes and assume a central crossover: If
the parents are 0011 and 1100 the two progeny will
be 0000 and 1111 respectively - see Figure 4. This
should make it clear that the bit string is simply
crossed, as the name suggests. Depending on the
encoding, crossing in the middle of the
chromosome may not be likely to give rise to fit
progeny, where this is the case other points may be
chosen, or indeed more than one point. Another
common solution is to select a random point up to
the length of the chromosome, and cross there.
Crossover pseudo-code
Chromosome crossover ( Chromosome parentX , Chromosome parentY )
int c r o s s P o i n t = 8 ;
String x First Half = parentX . substring ( 0 , cross Point ) ;
String x Second Half = parentX . sub string ( cross Point , parentX . length ) ;
String y First Half = parentY . sub string ( 0 , cross Point ) ;
String y Second Half = parentY . sub string ( cross Point , parentY . length ) ;
Chromosome crossed X = x First Half + y Second Half ;
Chromosome crossed Y = y First Half + x Second Half ;
Mutation is essential to a true genetic algorithm. In
popular culture mutation is often viewed in a
negative light - simply consider how many horror
films are based around some kind of mutant! In fact
without mutation neither the world as we know it
Proceedings of the 2013 International Conference on Education and Educational Technologies
46
nor our algorithms would evolve efficiently.
Mutation is defined as a minimal change to a
chromosome, so when one is using a binary string
representation often a single bit is flipped. These
changes are usually applied at the end of each
generation before the breeding pool and population
are combined again, but only with a very small
probability of each chromosome being affected. If
this was not done then no new genetic information
would be produced after the initial population -
note that crossover doesn’t create anything, rather
just recombine existing chromosomes. Without
new chromosomes the algorithm is likely to cease
with a suboptimal population, or run infinitely
never converging on a solution. If, on the other
hand, mutation levels are set too high the stream of
new chromosomes could be too large, disrupting
any convergent progress. If mutation was set to
affect every chromosome in each generation and
crossover removed, then the search has become
completely stochastic.
Mutation pseudo-code
Chromosome mutate (Chromosome)
int randomValue = new Random( Chromosome . length ) ;
i f ( Chromosome . valueAt ( randomValue ) == 0 )
Chromosome . valueAt ( randomValue ) == 1 ;
else
Chromosome . valueAt ( randomValue ) == 0 ;
return Chromosome ;
3.1 Mathematical model
Chromosome is the logical unit of information
transmission to the next generation [12]. The
definition of a chromosome can be taken a little
deeper. Usually the chromosome holds a binary
encoding of the optimization subject. Where this is
the case the genetic algorithm is considered
discrete, as clearly only a set number of values can
be assumed. In some cases the encoding involves
the real numbers instead, creating a continuous
genetic algorithm. In other cases, such as modeling
temperature, the use of a continuous chromosome
is more appropriate. For natural selection to take
place, some way of comparing one chromosome to
the other must be available. In the algorithm this is
modeled as a fitness or cost function, where a lower cost chromosome is favored over a higher cost.
Cost function is mapping such that: chromosome
→ R, where a value closer to zero shows a better
optimized chromosome. The formalization that
follows has been drawn from work by B¨ck [1] and
Vose [9]. Wea begin by considering the algorithm
at the highest level. It can be considered a finite
state machine, where each state represents an
arbitrary generation of the population at a time t.
Between these states there is a transition, τ , to the
next generation. The algorithm can be considered
as a function with parameters, as shown in
Equation (1).
Proceedings of the 2013 International Conference on Education and Educational Technologies
47
Genetic Algorithm = (I, Φ, Ω, s, µ, λ, τ, ι) (1)
In this representation, the following notation is used:
• I is the space of chromosomes, or the underlying search space. Each chromosome is of
length l.
• Φ is a cost function I → R.
• Ω represents a set of probabilistic genetic operators. We will specify these shortly.
• s represents a deterministic selection operator. A side affect of this operator is ensuring
population size remains constant.
• µ is the number of parent individuals to include in reproduction.
• λ is the number of offspring individuals from reproduction.
• τ represents the complete process of transitioning from one generation to the next. This
will be expanded shortly.
• ι represents an arbitrary termination condition.
Initialization of I is carried out by a function randomly sampling the range Z (0,2). This is
done l · µ times. To relate this model to the finite state machine outlined above, we will clarify the operation of τ . Consider the population P at a generation t, P (t):
∀t ≥ 1 : P (t + 1) = τ (P (t)) (2)
The termination condition, ι, can be as simple or
complex as required. For this analysis we will
assume it is simply a maximum generation count,
and that functionality to maintain this count is
provided. In implementation this can be combined
with an average cost value, a relative improvement
or a threshold standard deviation of the population.
We now return our analysis to the genetic
operators, Ω. In the set of reproductive functions
we have recombination (crossover), and mutation.
These can be considered as sexual and asexual
operators respectively, characterized by the number
of input chromosomes used. Because of its simpler
nature we will first consider mutation. It can be
modeled as a function ω: I p → I q, where the
chromosome I is shown as a binary vector. This
means an arbitrary I can be shown as (a1 , ..., al ),
where l is the length of the binary string.
Mutation is the smallest unique change that can be
made to a chromosome. By Definition of the
mutation over a binary chromosome should be the
random change of one bit. To model this, a random
bit should be selected in the chromosome, 0 ≤ k ≤ l,
k ∈ N, and that bit flipped. The function then looks
like this:
(a1 , ..., al ) → (a1 , ..., ak ..., al )
(3)
Crossover is recombination of two chromosomes
without loss of information. Crossover works by
taking two chromosomes and swapping over the
values after a random cross point. To do this, once
again a random is selected, 0 ≤ k ≤ l, k ∈ N, and the
function looks like this:
(a1 , ..., al ), (b1 , ..., bl ) → (a1 , ..., ak , bk+1 , ...bl ), (b1 , ..., bk , ak+1 , ...al )
(4)
Elitism is a property preventing current best chromosomes participating in mutation. Many algorithms implement elitism, as it prevents the fitness of a population decaying. If the population is in a suboptimal area of the search space the best solution is retained until mutation makes a
Proceedings of the 2013 International Conference on Education and Educational Technologies
48
selection closer to the global optimum. From here normal evolution can continue.
3.2 Example run
To illustrate better the operation of a genetic algorithm, we shall dry run an own example. For simplicity we will use a five by five grid, as shown in Figure 3. The “optimal” square is shaded in the centre. Chromosomes. Four chromosome will be defined, each of which is a coordinate x,y on the grid. Cost function. The number of squares the chromosome is away from the optimum square is used.
Fig.3 Search space.
Crossover. Two most optimal chromosomes go to the next generation unchanged, two new ones are created as:
(x1 , y1 ), (x2 , y2 ) → (y2 , x1 ), (y1 , x2 ) (5)
Mutation. Not implemented.
Initial population. Randomly instantiated.
Elitism. Implemented.
(a) Chromosomes.
(b) Search space.
Fig.4 End of generation one.
In generation one the initial population is shown in Figure 4. To progress to generation two, the two best chromosomes are unchanged, and two new ones created by crossover. This is shown in Figure 5. Clearly the chromosome of cost one is selected, and as the remaining three have the same cost, we select the first, 5, and 4. The same process is iterated again, creating generation three, and giving rise to one optimal chromosome. This is in Figure 6.
(a) Chromosomes.
(b) Search space.
Fig.5 End of generation two.
(a) Chromosomes.
(b) Search space.
Proceedings of the 2013 International Conference on Education and Educational Technologies
49
Fig.6 End of generation three.
One can see from this that the search space is systematically sampled, the best chromosomes selected, and their traits passed on into the next generation. The algorithms do work without mutation being implemented; in this case it was left out in the interests of minimizing generation count. In the case of a larger example it would become necessary to prevent the evolution stagnating in a suboptimal area, as without this convergence cannot be proven.
4. Flow of Genetic algorithm
• Initially the initial population is selected
randomly from the sample space which
has many populations.
• The fitness value is calculated for each
chromosome in each population and is
sorted out.
• In selection process two parent
chromosomes are selected through
tournament method.
• The Crossover forms new offspring
(children) from the parent chromosomes
using single point probability.
• Mutation mutates the new offspring using
uniform probability measure.
• In elitism selection the best solution are
passed to the further generation.
• The new population is generated and
undergoes the same process it maximum
number of generation is reached.
4.1 Selection process
Selection is used for choosing the best
individuals, that is, for selecting those
chromosomes with higher fitness values. The
selection operation takes the current population and
produces a ‘mating pool’ which contains the
individuals which are going to reproduce. There
are several selection methods, like biased selection,
random selection, roulette wheel selection,
tournament selection. In this work the following
selection mechanisms are used.
4.2 Tournament Selection
Tournament selection has been used in
this as it selects optimal individuals from diverse
groups. It selects t individuals from the current
population uniformly at random, forms a
tournament and the best individual of a group wins
the tournament and is put into the mating pool for
recombination. This process is repeated the
number of times necessary to achieve the desired
size of intermediate population. The tournament
size controls the selection strength. The larger the
tournament size, the stronger is the selection
process.
4.3 Elitist Selection
Proceedings of the 2013 International Conference on Education and Educational Technologies
50
In order to make sure that the best individuals of
the solution are passed to further generations, and
should not be lost in random selection, this
selection operator is used. So we used a few best
chromosomes from each generation, based on the
higher fitness value and are passed to the next
generation of population.
4.4 Reproduction
To generate a second generation
population of solutions from those selected through
genetic operators: crossover (also called
recombination), and/or mutation. For each new
solution to be produced, a pair of "parent" solutions
is selected for breeding from the pool selected
previously. By producing a "child" solution using
the above methods of crossover and mutation, a
new solution is created which typically shares
many of the characteristics of its "parents". New
parents are selected for each new child, and the
process continues until a new population of
solutions of appropriate size is generated. Although
reproduction methods that are based on the use of
two parents are more "biology inspired", some
research suggests more than two "parents" are
better to be used to reproduce a good quality
chromosome. These processes ultimately result in
the next generation population of chromosomes
that is different from the initial generation.
Generally the average fitness will have increased
by this procedure for the population, since only the
best organisms from the first generation are
selected for breeding, along with a small proportion
of less fit solutions, for reasons already mentioned
above. Although Crossover and Mutation are
known as the main genetic operators, it is possible
to use other operators such as regrouping,
colonization-extinction, or migration in genetic
algorithms.
4.5 Termination
This generational process is repeated until
a termination condition has been reached. Common
terminating conditions are:
• A solution is found that satisfies minimum
criteria
• Fixed number of generations reached
• Allocated budget (computation
time/money) reached
• The highest ranking solution's fitness is
reaching or has reached a plateau such that
successive iterations no longer produce
better results
• Manual inspection
• Combinations of the above
5. Conclusion
This method proves accurate in deducting
fraudulent transaction and minimizing the number
Proceedings of the 2013 International Conference on Education and Educational Technologies
51
of false alert. Genetic algorithm is a novel one in
this literature in terms of application domain. If this
algorithm is applied into bank credit card fraud
detection system, the probability of fraud
transactions can be predicted soon after credit card
transactions. And a series of anti-fraud strategies
can be adopted to prevent banks from great losses
and reduce risks. The objective of the study was
taken differently than the typical classification
problems in that we had a variable misclassification
cost. As the standard data mining algorithms does
not fit well with this situation we decided to use
multi population genetic algorithm to obtain an
optimized parameter.
Future Enhancements
The findings obtained here may not be generalized
to the global fraud detection problem. As future
work, some effective algorithm which can perform
well for the classification problem with variable
misclassification costs could be developed.
REFERENCES
[1] Alan C Schultz. Learning robot behaviours
using genetic algorithms. Navy Center for Applied
Research in Artificial Intelligence, Naval Research
Laboratory, Washington, 1994.
[2] Charbonneau P., Genetic Algorithms in
Astronomy and Astrophysics High Altitude
Observatory. National Center for Athmospheric
Research, pp. 309-334, 1995.
[3] Dr Markus Roggenbach. CS364 Software
testing slides. Swansea University, 2011.
[4] Heinrich Braun. On solving travelling salesman
problems by genetic algorithms. Springer
Berlin / Heidelberg, 1991.
[5] Holland J., Adaptation in Natural and Artificial
Systems. Ann Harbor, MI: University of Michigan
Press. 1975.
[6] Jean-Yves Potvin. Genetic Algorithms for the
Travelling Salesman Problem. Centre de
Recherche sur les Transports, 1996.
[7] Kaya M., Autonomous Classifiers with
Understanable Rule Using Multi-objective Genetic
Algorithms. Expert Systems with Applications.
Vol. 37, no. 4, pp.3489-3494, 2009.
[8] Levi M., Burrows J., Fleming M., Hopkins M.,
The Nature, Extent and Economic Impact of Fraud
in the UK. Report for the Association of Chief
Police Officers' Economic Crime Portfolio. 2007.
[9] Michael Vose. The Simple Genetic Algorithm.
Massachusetts Institute of Technology, 1999.
[10] Minaei-Bidgoli B., Kashy D., Kortemeyer G.,
Punch W., Predicting Student Performance: An
Application of Data Mining Methods with the
Educational Web-based System LON CAPA.
Proceedings of ASEEIIEEE Frontiers in Education
Conference. 2003.
[11] Srinivas M., Patnaik L., Genetic algorithms - a
survey. IEEE Computer Society, 1994.
[12] Thomas Back. Evolutionary Algorithms in
Theory and Practice. Oxford University Press,1996.
[13] UK Statute Law. Data Protection Act 1998.
Office of Public Sector Information, 1998.
[23] Wojciech Paszkowicz. Genetic Algorithms, a
Nature-Inspired Tool: Survey of Applications in
Proceedings of the 2013 International Conference on Education and Educational Technologies
52