neural network classification and its applications in insurance industry

COMP 7570 –Neural Networks Project Report

Neural Network Classification and its Applications in Insurance Industry

Inderjeet Singh

7667292

Department of Computer Science

University of Manitoba

December 8, 2011

2

Abstract

Neural networks when used for classification also known as neural classifiers have many

advantages. Extracting rules from these trained networks is a hard task. Research has been

done in this regard. [Lu] generated a method of extracting the rules from neural networks and

advocated the use of neural networks in the process of classification and data mining in

general. [Smith] did a case study of the use of neural networks for customer retention in the

insurance industry. They discussed the importance of predicting the patterns of the customer

terminations for gaining profit in this highly competitive industry. [Viaene] deployed neural

networks for predicting the claim frauds in automobile insurance industry. Input (fraud

indicators) relevance is important for detecting the claim frauds. They used neural networks

(MLP-ARD) to produce the fraud indicators importance rankings for automobile insurance

industry.

1. Introduction

Neural Networks [Scuse] are models of intelligence that consist of large numbers of simple

processing units also known as neurons or nodes that collectively are able to perform very

complex pattern matching tasks. These models perform stimulus response (input-output)

mapping. Classification which is a branch of data mining, [Wiki] is the process of learning rules

or models from training data to generalize the known structure and then to classify new data

with these rules.

3

Normally in data mining field classification happens with the help of decision tree algorithms

and logistic regression. These days’ neural networks are also used as one of the approaches for

classification. Classification with neural networks is a popular area of research. It has gained a

lot of attention specifically in the field of data mining where the volume of data is too large to

handle.

Neural networks when used for classification have many advantages. They are data driven, self-

adaptive. They can approximate any complex function with high accuracy. They can be used to

make non-linear models which can model real world applications with high accuracy. They are

also tolerant to noisy data. Neural classifiers have problems as well. They usually lack

transparency and have black box behaviour; there learning or training time is long which

depends upon many repeated epochs cycles over the training data. Also, extracting

classification rules [Lu] is difficult from neural networks because of their complex and

incomprehensible structure with too many links between input, hidden and output units.

Neural networks have already been used in real world application such as bankruptcy

prediction, credit scoring, quality control, insurance industry, handwriting recognition and

many more. In this report I will focus specifically of their application in insurance industry.

Insurance industry is a very competitive industry. The success of an insurance company

depends upon the profit and growth. Profit depends upon various factors. Predicting the

average claim cost, frequency of claims and to examine the effect of change in prices of policies

or premium cost on the customer retention [Smith] is critical for profit. Neural classification has

4

been applied in this regard to learn and predict if a customer will terminate or renew his policy.

Claim fraud is another important issue in this industry. Companies are facing huge losses of

money from fraudulent claims made by the insurers. They are looking for solutions for fraud

claim prediction and diagnosis. Neural classification [Viaene] help to know which fraud

indicators or inputs are most crucial for predicting fraudulent claims. Both the above uses of

neural networks used different version of multilayer feed forward neural networks.

2. Extracting Symbolic Classification Rules from Neural Networks

In this work, [Lu] is focussed on mining classification rules from large databases with the help of

neural networks. Neural network approach has advantages like, low classification error rate and

robustness to noise.

The neural network based classification approach described by them consists of three phases:

The first phase is network construction in which a three layer feed forward neural network is

constructed. The method for creating neural network is inspired by [Sentiono 1995] method of

dynamically creating the network. Network creation starts from a single hidden unit and then

dynamically adds hidden units to network until network completely classifies all the input

patterns correctly. Rather than minimizing the sum of squares of errors [Sentiono 1995]

maximizes the likelihood function. Also, unlike back propagation method this method does not

get stuck in local minima.

The second phase is network pruning. In network pruning the penalty function [Sentiono] is

added to error function that helps to prune the network by weight removal. The penalty

5

function used in the above approach is sum of squared weights. While pruning the network the

classification error rate should not increase. The first objective while pruning is to discourage

nonessential connections and second is to prevent the connection weights from attaining large

values. Removing unnecessary weights from the network reduces the networks complexity.

The last phase is the rule extraction from the pruned network. Extracting rules in not easy, as

the number of links from the pruned networks is still too much to define the explicit

relationship in terms of if-then-else rules. Also, it is difficult to derive clear relationship between

continuous activation values of hidden units and output units. Rule extraction from a pruned

network consists of four steps: 1. Use of clustering algorithm to find clusters of hidden units’

activation values. 2. Enumerating the hidden unit activation values and computing the outputs.

Generate the rules that describe the network output in terms of the hidden unit activation

values. 3. For every hidden unit enumerate the input values that lead to them and generate the

set of rules that describes the hidden values activation values in terms of input units. 4. Last

step is to merge the rules obtained in previous two steps to obtain rules that map inputs to

output.

They explained their approach of rule extraction on one of the 10 classification problems or

functions used earlier in research. They chose function 3 to demonstrate their approach.

Function 3 looks is shown in Figure 1.

6

To solve this classification problem represented by function 3 they created the neural network

as described in network creation phase above. They used the people database consisting of

nine inputs such as salary, commission, age, elevel, car, zip code, house-value, house-years and

loan and one output representing the class. Input tuple can belong to group A or group B. The

inputs were represented as binary string of 0 and 1’s. The respective bits of the input string are

0 and 1 depending upon where subinterval the value of input is located. With the above binary

scheme for inputs there were a total of 37 binary inputs units (shown in Fig 11), values of 9

inputs plus one input unit for bias making a total of 38 units. The non-pruned network consisted

of six hidden units and one output units. Therefore, it consists of 234 links. The training dataset,

they used has 2000 tuples of these inputs. Network pruning is performed as described above

giving a much simpler network as shown in Fig 2. This pruned network only consists of two

hidden units and six input units.

Figure 1 : Function 3 [Lu]

7

Before extracting the rules from pruned network shown in Fig 2, all the four steps described

above are executed. The activation values of its two hidden units are clustered. The clusters are

centered on 0.46 and 0.81. This results in two clusters of discretized activation values. For first

hidden unit, input tuples are split into two groups one with activation values of [-1, 0.46) and

other with values between [0.46, 1). For second hidden unit, input tuples are split in same way

in groups of [-1, 0.81) and [0.81, 1). The activation value of patterns for two hidden units j=1 or

2 is represented by =1 or 2. Value of = 1 on a hidden unit means that input tuple belongs

to group A and value of =2 means input tuple belongs to group B. For input to be classified in

group A, either or should be equal 1, otherwise input is classified in group B.

To generate rules for each hidden unit that do not involve weights [Lu] used the X2R algorithm

they developed earlier. The rules they got for the two hidden units are combined to give the

rules for final output in term of inputs units. For function 3, they extracted a total of 5 rules

with a total of 10 conditions from the pruned network. These rules are shown below. The rules

Figure 11: Coding of the attributes of the neural network inputs [Lu]

8

they got can then be expressed in terms of actual input attributes of age and elevel for function

3.

Else Default rule. Group B

For evaluation and analysis, they compared their approach of extracting rules from neural

networks with the decision tree classifier (C4.5) approach. Test for the neural network

classification was done on eight functions similar to function 3 described above. Random

number generation was used to develop the dataset for testing the rules generated for

different functions or classification problems. They used three fold cross validation to estimate

the classification accuracy of the generated rules. Fig 3 shows the results they got after

evaluation of the quality of rules generated by neural networks for different functions. They

found that neural classifiers generate much fewer rules than decision trees algorithm C4.5,

shown in Fig 4. The accuracy and number of conditions per rule for different functions were

comparable for both appraoches.

They concluded that efforts can be made to make neural classifiers training fast. In this regard,

they suggested incremental training and rule extraction from the database.

9

Figure 2: Pruned network for Function 3 [Lu]

Figure 3: Averages of accuracy rates, the number of rules and the average conditions per rule obtained [Lu]

10

Figure 4: The number of rules extracted from neural networks (NN) and C4.5 algorithm (DT) [Lu]

3. Neural Network Applications in Insurance Industry

3.1. An Analysis and Prediction of Customer Retention Patterns and Pricing

The problem of concern in insurance industry is to set the pricing to match the claim costs and

yet to retain the existing customers and also acquire new ones. There have been a lot of

research in this regard, but due to competitiveness of this industry hardly any result or methods

to solve the above problem gets published.

11

In this case study, [Smith] works on structured problem of customer retention modelling using

regression, decision trees and neural networks also known as supervised learning methods. The

methods are used to learn the relationships between variables (inputs) and decisions (outputs).

They also study, the unstructured problem of analysis of claim patterns using clustering which is

an unsupervised learning method. In this report, I will discuss more about the first problem of

customer retention using neural classification, which is the main focus of this project

Growth of an insurance company depends upon attracting new customers and retaining the

existing ones. The renewal or termination of policy by customer depends upon premium price,

service, personal preference, insured amount, convenience and many other factors. The

analysis of customer retention in this case study involves two goals: First, to know the reasons

of policy termination and second, to develop a tool (based on neural classifier) for predicting

the likely policy termination. This tool will help in analyzing the impact of changes of premium

costs of policies on the likely terminations of customers. Identifying the likely policy terminating

customers can aid in the direct marketing campaigns.

To analyze the customer retention patterns, [Smith] obtained the data of 20914 auto policy

holders whose policies are going to expire in April 1998. The dataset included details such as

demographic information (age group, postcode .etc.), policy details (premium, sum insured

etc.) and policy holder history (rating, years on rating, claim history, etc.) as shown in Fig 5

below. Among this dataset, 7.1% of policy holders did not renewed their policies and their

policies terminated. Through meetings with insurance company [Smith] found that, premium

price and sum insured played a major factor in likely policy terminations.

12

They used the SAS Enterprise Miner software for evaluation. SAS Enterprise Miner is widely

known GUI based commercial software for applying data mining techniques. The setup for this

particular experiment involves different levels. At the first level is data processing (variable

selection, data transformation and data partitioning), then second level is application of data

mining techniques (clustering, regression, decision trees, and neural networks) and last level is

the analyses (assessment, bar charts). The process flow diagram is shown in Fig 6. In data

transformation they normalized and log transformed the variables. After transformation is

applied, they got a total of 29 independent inputs and one output (dependent variable or

termination yes or no decision), shown in Fig 5.

Regression, decision tree and neural network (available in SAS software) methods were used

for making three separate classification models or classifiers. These classifiers will predict the

likely terminations or renewals of policies. Three layer multilayer feed forward neural network

with 29 inputs units, 25 hidden units and single output unit is used. The units used hyperbolic

tangent activation function. Default learning rule which uses multiple Bernoulli error function is

used. The error is minimized by using a conjugate gradient technique and by changing the

weights.

All three methods are executed on the test set to classify the likely terminating policies. The

test set consists of 20% of entire dataset and is ranked in descending order of the likelihood of

policy holders terminating their policy. Fig 7 shows the lift chart comparing the performance of

all three methods in classifying the policy holders as terminating. Lift chart measures the

effectiveness of the predictive model and the area under the lift curve indicates how accurate

13

the predictive model is. X-axis in chart depicts the percentage of the policy holders selected

from the ranked list of test set and Y-axis depicts the percentage of likely terminating

customers from the percentage policy holder selected above. As can be seen in Fig 7 the white

line or lift curve representing neural networks has the largest area which means it classifies

most of the terminating policies. If only 10% of the policy holders are selected and ranked in

order of likely terminations predicted by the neural network model, 50% of the predicted

terminations are correct. With regression and decision tree this accuracy is only 40% and 28%.

Effect of decision threshold on the number of policies classified as terminated by the network is

also determined. If this decision threshold is set to 0.5, the policy is classified as terminated if

likelihood or probability of a policy predicted by neural network is above 0.5. It is observed that

setting a low value of 0.1 for this decision threshold helps in predicting all likely terminations.

Marketing mails can be sent out to these likely terminations, to help them renew their policies.

But low decision threshold results in loss of accuracy in predicting terminations. It is good to

keep the decision threshold high (high accuracy), if the premiums are being changed for policy

holders who are most likely predicted to terminate their policies. This ensures that premium

changes are made for only likely terminating customers.

Misclassification costs can be decided for generating a profit loss matrix. For example, if the

policy holder is classified as likely termination but he renews the policy, the misclassification

cost will be the discount offered to him as a bait to renew his policy. On the other hand if the

customer is not predicted as a termination and he actually terminates his policy,

14

misclassification cost will be loss of his premium for the next year. The optimal value of decision

threshold needs to be determined to minimize misclassification costs and maximize profits.

Pricing the policies is the tricky part. The pricing of policies occurs in four steps: prediction of

claim costs, identification of the right premium price to gain profitability, analysis of the

customer retention patterns considering the difference between old and new premiums, and

finally adjustment of these premiums to retain the customers and while still making profits.

These four steps are executed every time before marketing mails are going to be sent out to

the likely policy terminating customers. The new price of the policy could not suit the customer

and he may decide to terminate the policy. The data with new policy price together with the

difference of price is fed into the neural network model to predict the likely terminations with

new policy prices. The prices can then be adjusted to balance the goals of profitability and

customer retention. Optimal pricing is an iterative process with a goal of finding a balance.

Figure 5: Total of 29 inputs attributes [Smith]

15

Figure 6: Process flow diagram for customer retention classification [Smith]

Figure 7: Lift Chart showing percentage of policy holders classified for likely termination vs. percentage of policy holders selected from the test dataset. It shows the performance comparison for

classification techniques such as regression, decision tree and neural networks [Smith]

16

3.2. Auto Claim Fraud Detection using Bayesian Learning Neural Networks

Companies face a huge loss of money for fraudulent claims made by the insurers. Insurance

companies are looking for solutions for fraud claim prediction and diagnosis. These days they

are using tools that rely on neural networks and artificial intelligence to solve this problem.

Neural networks help in making general and scalable parameterized, non-linear mappings of

inputs and outputs. But there are also some problems with them, such as what weights to set

before training starts, how to avoid fitting the noise in training data which makes them difficult

to implement. The above issues are mostly solved by using the ad-hoc ways.

In this paper, [Viaene] have used Bayesian learning to deal with above issues while training the

neural networks. Bayesian learning learns the model in a step by step manner rather than ad-

hoc. [Viaene] explores predictive powers of Multi-Layer Perceptron (MLP) based neural

network classifiers trained with the help of [Mockay] evidence framework approach to Bayesian

learning which is used to optimize an automatic relevance determination (ARD) objective

function. ARD objective function is useful in determining the relative importance of the inputs

to the model. ARD and evidence framework approach is describes in more details below.

They have used the MLP back propagation neural network as shown in Fig 8. The hidden nodes

of network have hyperbolic tangent transfer function and output layer has logistic sigmoid

activation function. In Fig 8, x represents the input vector, z represents the output of the hidden

units and y represents the final output. The continuous output y(x) of this MLP classifier can be

interpreted as posterior probability ( | , which means the probability of getting class t =

17

1 as output, given the input vector x. The Bayesian posterior probability estimates produced by

MLP help classify the input vector to predefined classes by choosing a threshold in scoring

interval. While training the network, the weight vector w needs to be adjusted so that the

objective function which is sum of squared errors is minimized.

They measured the accuracy of prediction with used of two metrics known as percentage

correctly classified (PCC) and area under the receiver operating characteristic curve (AUROC).

Figure 8: Example of three layers Neural Network [Viaene]

While optimizing the neural classifier for best generalizations, it should be avoided from

learning the noise in the training data, also known as over fitting. To avoid over fitting, usually

validation dataset is used. A better approach is to add the regularization or penalty term to the

objective function. The unit based regularization term is also known as ARD. The final objective

function now becomes ∑ .

18

They discussed about how critical is input selection to the overall classification process. The

(regularization parameter) in ARD objective function is helpful in suppressing the weights

exiting from inputs. Larger the more irrelevant is the input and vice versa. Regularization

parameter allows MLP-ARD to include large number of potentially relevant input variables, thus

eliminating the efforts needed to delete some irrelevant input variables. This also means

adjusting the degree of importance of the input variables in the classification process; this is

known as soft input selection.

Bayesian learning is used to make the probabilistic models for the dataset. These models are

then used for prediction. Bayesian models are described in terms of posterior probability

density over the weight space. Then prediction is made by integrating over the posterior

probability. The evidence framework approach to Bayesian learning for MLP classifiers they

discussed requires local Gaussian approximation to the posterior probability density. They

introduced the concept of input relevance or ARD on the evidence framework with the help of

the Gaussian assumptions. The main objective of doing all this is to get the appropriate values

for the weight vector w and the regularization parameter .

They used Personal Injury Protection (PIP) automobile insurance claim fraud detection dataset

for their evaluation. PIP claims dataset consists of 1399 closed automobile insurance claims

from accidents that occurred in Massachusetts, USA in 1993. This data has been investigated

for fraud suspicion by the domain experts. The dataset included 25 binary fraud indicators (red

flags); refer Fig 9 and 12 non indicator inputs (non-flags) that are valuable in assessing the

fraudulent claim by investigators. In this dataset, ACC is accident, CLT is claimant, INJ is injury,

19

and INS is insured driver .etc. The input selection is done after having discussions with domain

experts.

These closed claims are reviewed by claim manager for suspected fraud on the basis of these

indicators or inputs. Each claim is categorized on a 10 point scale for suspected fraud. Claims

are also reviewed on the basis of verbal assessment by the claim manager. Claim can be

suspected for fraud if suspicion score > = 4 and further investigation is done in this case,

otherwise no investigation is done.

Figure 9: PIP binary fraud indicators with values (0=No, 1=yes) [Viaene]

In empirical evaluation they are doing input selection using MLP-ARD on the PIP insurance

claims data. The input importance ranking they got from MLP-ARD is then compared with input

20

importance rankings from logistic regression and decision tree learning. They used logistic

regression approach to classification as a reference for comparison. They took the relative

importance of inputs based upon the regression coefficient as a reference. They used decision

trees approach for classification as a second reference. Implementation-wise they used the m-

estimation smoothed and curtailed C4.5 variant, which is a better version of C4.5 algorithm.

The input importance in decision tree is decided by its role in splitting the tree so that

maximum entropy difference can be achieved. The relative performance of the decision tree

implementation in predicting the input was not quite good compared to logistic regression and

MLP-ARD.

10 fold cross validation can also be performed for input evaluation for the above three

approaches of classification. This leads to ensemble based input assessment, which means

input assessment is aggregated and then averaged for 10 models of the cross validation. Fig 10

shows the input rankings derived from the three methods. Rank 1 input is the most important.

The number in brackets is the input importance relative to the Rank 1. From the rankings it is

observed that six of the MLP-ARD top ten inputs are same as logistic top ten and seven are

same from the C4.5 top ten. Form the input rankings it is observed that MLP-ARD and logistic

are giving comparable input rankings. All the three classifiers can be used at same time to give

an ensemble classifier.

21

Figure 10: Input Importance Ranking [Viaene]

4. Discussion

I found some reasoning for the results missing from [Lu] paper on rule extraction. They did not

explain many results in their analysis. For e.g., they did not explain, why for function 4 the

accuracy with neural network is less than with C4.5. They did not explain why number of

conditions for function 5 is less per neural network rule than per C4.5 rule, while for all other

functions this is just the opposite case. This paper was written in the year 1996 and considering

that time the research in this area was at nascent stage. They also did not explain how exactly

they arrived at the pruned network (refer Fig 2) with only four inputs. Their advocacy for the

22

use of neural networks in classification is justified for some scenarios of data classification

where training time is not the constraint.

While searching for the papers of the use of the neural networks in insurance industry, I found

that not much research done is out there in public. Surely there must be some credible work

done on the uses of neural networks in the insurance companies, but due the competition it is

not disclosed.

[Smith] have used SAS Enterprise Miner software for doing their analysis. While processing the

data they used the feature variable selection node of the SAS tool, but they did not explain

anything about how this functionality will work without the tool. In their results they gave

classification accuracies results for 0.1 and 0.5 decision thresholds. The way they presented the

numbers for these results for actually renewed, actually terminated, classified as renewed and

classified as terminated policies is not quite clear to me. Their evaluation is not quite strong as

they only present the lift chart for their comparisons with other classification approaches.

The paper by [Viaene] does not go with the title of the paper “Auto claim fraud detection using

Bayesian learning neural networks”. The researchers talk more about developing MLP-ARD

approach and incorporating it in the evidence framework method, than to talk about their use

in detecting claim frauds. The focus is more on theoretical side with lots of equations. The

background information for the various methods used in the paper is very less making the

paper difficult to understand. A lot of assumptions and approximations have been used to while

making their method work for soft input selection.

23

5. Conclusions and Future Work

Using [Lu] method of extracting rules high quality rules can be obtained from the datasets.

Their works acts a bridging approach on using neural networks for classification purposes in

data mining. Time required for extracting rules is still large when compared to decision tree

approach. As a direction to future work they suggested the use of incremental training and rule

extraction from the database. Another way of reducing the training time and increasing the

accuracy is by reducing the input units of the network.

[Smith] tried to find a ways of doing optimal pricing of policies while retaining growth and

profitability. Their case study used the neural networks to learn and predict customer retention

patterns. They discussed some issues like the identification of misclassification cost to customer

retention analysis. Second issue is the implementation and incorporation of their method in the

insurance industry at a larger scale and in real time. They would like to work on these issues in

collaboration with the industry.

[Viaene] made a step in the direction of understanding the underling semantics of the neural

networks output prediction. This understanding is important for the use of neural networks in

everyday decision making tasks for prediction claim frauds. The impact of the input selection on

the claim fraud detection process was their main concern. They demonstrated the soft input

selection capabilities of their proposed MLP-ARD method on the real life insurance dataset.

I think neural networks due to their complex model making capabilities can be used more

effectively in insurance and other industries and there is still scope of lot of work.

24

References

1. [Lu] Hongjun Lu, Rudy Setiono and, Huan Liu, Effective Data Mining Using Neural

Networks, Vol 8, IEEE Transactions on Knowledge and Data Engineering,1996, pp. 957-

961

2. [MacKay] MacKay, D. J. C., The evidence framework applied to classification networks.

Neural Computation, 1992, 4(5), 720-736

3. [Scuse] David Scuse, Chapter 1 Intro, Class slides, University of Manitoba

4. [Setiono 1995] R. Setiono. A neural network construction algorithm which maximizes the

likelihood function, Connection Science, Vol. 7, No. 2, 1995, pages 147-166.

5. [Setiono] R. Setiono. A penalty-function approach for pruning feed forward neural

networks, Neural Computation, Vol. 9, No. 1, January 1997, pages 185-204.

6. [Smith] K.A. Smith, R.J. Willis and, M. Brooks, An Analysis of Customer Retention and

Insurance Claim Patterns Using Data Mining: A Case Study, The Journal of the

Operational Research Society, Vol. 51, May 2000, pp. 532-541

7. [Viaene] S. Viaene, G. Dedene and, R.A. Derrig, Auto claim fraud detection using

Bayesian learning neural networks, Journal of Expert Systems with Applications, Vol. 29,

pages 653 - 666, 2005

8. [Wiki] Data Mining, http://en.wikipedia.org/wiki/Data_mining

neural network classification and its applications in insurance industry

Technology

use of neural networks

neural network approach

days neural networks

neural classification

neural classification

neural classifiers

neural networksin

network pruning