de jong and heller sas miner preliminary analysis · 2016. 11. 7. · de jong and heller sas miner...
TRANSCRIPT
De Jong and Heller SAS miner preliminary analysis
In this document we go through the data exploration that one should always undertake beforeembarking on the construction of a statistical model. The occurrence of a claim (claim flag)and claim amount (clm amt) are considered to be the response variables. The effects of theother variables on these responses are considered here, graphically and numerically.
Kidsdriv is the number of kids in the car when driving. The left panel in Figure 1 displayshistogram of kidsdriv. It shows that driving without any kids in the car is the mostpopular and the maximum number of kids in the car is 4. The right panel displays thebox plot of log claim amount by the number of kids in the car. This indicates that giventhat a claim occurs, the claim amount is invariant with the number of kids. Therefore,kidsdriv is not a promising predictor for claim amount.
Number of kids
Fre
quen
cy
0 1 2 3 4
020
0040
0060
0080
00
0 1 2 3 4
46
810
12
Number of kids
Log
clai
m a
mou
nt
Figure 1: Kids in the car
In Table 1, the probability of a claim increases as the number of kids increases. Thisindicates that kidsdriv is a potential candidate to predict claim occurrence.
Table 1: Kids in the carNo. of kids Frequency Percent
claims0 88.0% 25%1 7.8% 39%2 3.4% 40%3 0.7% 53%4 0.04% 50%
Plcydate describes the date that the policy starts. The policy starting date spreads evenlyfrom March 1993 to June 1998. The average starting dates are similar between policieswith or without claims,which means that there is no difference between old and new policyholders in terms of claim occurrence. There is no pattern between claim amount andpolicy starting date. Therefore, this variable is not used for predicting either clm flagor clm amt.
Travtime The top left panel in Figure 2 displays the histogram of travel time between homeand work. The mean travel time is 33.4 minutes. The top right panel displays theboxplots of travel time without a claim(left) and with a claim(right). The medians andspreads are similar, which means that travtime doesn’t impact claim occurrence. In thebottom left panel, claim amount (given a claim occurred) is plotted against the traveltime. In the bottom right panel, log claim amount (given a claim occurred) is plotted
October 25, 2007 1
De Jong and Heller SAS miner preliminary analysis
against the travel time. The horizontal smooth lines in both plots indicate an independentrelationship between claim amount and travel time.
Travel time
Den
sity
0 50 100 150
0.00
00.
010
0.02
0
No Yes
020
4060
8010
014
0
Claim
Tra
vel t
ime
0 20 40 60 80 100 140
020
000
6000
010
0000
Travel time
Cla
im a
mou
nt
0 20 40 60 80 100 140
46
810
12Travel time
Log
clai
m a
mou
nt
Figure 2: Travel time from home to work
Car use There are two types of car usage: commercial and private. 36.8% of cars are forcommercial use and private cars account for 63.2%. As shown in Table 2, the probabilityof a claim is higher for commercial cars. Thus, car usage is a potential explanatoryvariable for claim occurrence.
Figure 3 displays boxplots of claim amount(left) and log claim amount(right) by carusage. Car usage does not look promising as an explanatory variable for claim amount.
Table 2: Car useCar use Frequency Percent
claimsCommercial 36.8% 35%Private 63.2% 22%
October 25, 2007 2
De Jong and Heller SAS miner preliminary analysis
Commercial Private
020
000
4000
060
000
8000
010
0000
Cla
im a
mou
nt
Commercial Private
46
810
12
Log
clai
m a
mou
nt
Figure 3: Car use
Bluebook describes the value of the car. The top left panel of Figure 4 displays the histogramof bluebook. The boxplots indicates the average bluebook of non claim policies is lowerthan the ones with claim. The smooth lines indicate the relationship between bluebookand claim amount. There is an upward linear relationship between bluebook and claimamount.
Bluebook
Den
sity
0 20000 40000 60000
0e+
002e
−05
4e−
05
No Yes
020
000
4000
060
000
Claim
Blu
eboo
k
0 20000 40000 60000
020
000
6000
010
0000
Bluebook
Cla
im a
mou
nt
0 10000 30000 50000
46
810
12
Bluebook
Log
clai
m a
mou
nt
Figure 4: Bluebook
Retained measures the number of years the customer has been with the company. The his-togram in Figure 5 shows that 15% of customers have taken up policies for less than oneyear. The boxplots show that the average years retained is longer for non–claim policies.Thus, it is a potential explanatory variable for the occurrence of a claim. The flat lineson the bottom plots indicate that claim amount does not depend on the customer loyalty.
Npolicy is the number of policies the customer holds. From Table 3 about 53% of customershold one policy and about 30% hold two. The proportion that make a claim does not
October 25, 2007 3
De Jong and Heller SAS miner preliminary analysis
Retained
Den
sity
0 5 10 15 20 25
0.00
0.05
0.10
0.15
No Yes
510
1520
25
Claim
Ret
aine
d
5 10 15 20 25
020
000
6000
010
0000
Retained
Cla
im a
mou
nt
5 10 15 20
46
810
12
Retained
Log
clai
m a
mou
ntFigure 5: Retained: number of years with the company
increase with increasing number of policies. The splines of claim amount against npolicyare flat (bottom panels of Figure 6).
Table 3: Number of policies
Number of Frequency Percentpolicies claims
1 53.4% 26.8%2 30.7% 27.2%3 10.9% 27.1%4 3.6% 24.7%5 1.0% 13.0%6 0.1% 0.0%7 0.2% 0.0%8 0.0% -9 0.05% 0.0%
Car type SUV and Sedan are the two most popular car types, as shown in Table 4. Theprobability of claim varies across car types, as indicated in Table 4. The claim amountdoes not vary much across car types (bottom panels of Figure 7).
Red car describes if the car’s color is red. 29% of insured cars are red. The probability ofmaking a claim is similar between red and non red cars, which is shown in Table 5.Figure 8 shows that the claim amount is similar between red and non red cars.
Clm freq measures the number of claims in the past 5 years. The top left panel in Figure9 indicates that majority of policies (61.1%) have no claim in the past 5 years. Thecustomers that incurred a claim this year have a higher number of past claims, on average.
October 25, 2007 4
De Jong and Heller SAS miner preliminary analysis
# Policies
Den
sity
2 4 6 8
0.0
0.2
0.4
0.6
0.8
1.0
No Yes
24
68
Claim
# P
olic
ies
2 4 6 8
020
000
6000
010
0000
# Policies
Cla
im a
mou
nt
1 2 3 4 5
46
810
12
# Policies
Log
clai
m a
mou
nt
Figure 6: Number of policies
Table 4: Car TypeCar type Frequency Percent
claimsPanel Truck 8.3% 26%Pickup 17.2% 31%Sedan 26.1% 17%Sports Car 11.4% 35%SUV 28.0% 29%Van 8.9% 27%
Panel Truck Sedan SUV Van
020
000
4000
060
000
8000
010
0000
Car type
Cla
im a
mou
nt
Panel Truck Sedan SUV Van
46
810
12
Car type
Log
clai
m a
mou
nt
Figure 7: Car type
October 25, 2007 5
De Jong and Heller SAS miner preliminary analysis
Table 5: Red carRed car Frequency Percent
claimsNot Red 71.1% 27%Red 28.9% 26%
no yes
020
000
4000
060
000
8000
010
0000
Red car
Cla
im a
mou
nt
no yes
46
810
12
Red carLo
g cl
aim
am
ount
Figure 8: Red car
This means that clm freq is potentialy a good predictor of claim occurrence. The bottomtwo panels indicate that claim amount is not related to clm freq.
Oldclaim records old claim amounts incurred in the past 5 years. The top right panel of Figure10 indicates that a claim is more likely to occur with a higher past claim amount. Thebottom two panels indicate that there is no relationship between claim amount and oldclaim amount, given a claim occurred.
Revoked measures if the policy holder’s license has been suspended in the last 7 years. Around12.2% of licenses have been suspended in the past 7 years. The occurrence of a claim isrelated to variable revoked. If the policy holder’s license has been suspended in the last7 years, he/she has 45% of chance of incurring a claim, compared with 24% if the licensehas not been suspended. The claim amount is not related to license suspension, which isshown in Figure 11.
Mvr pts is motor vehicle points. As we do not have any information on the data definition. aneducated guess is that low mvr pts is good. The top right panel in Figure 12 indicatesthat the average motor vehicle points for policies with claims is higher. The bottom twopanels indicate that claim amount is not related to motor vehicle points.
Age Average age of drivers making a claim is lower than those without a claim. Age lookspromising as a predictor of occurrence of a claim. The claim amount appears to beindependent of age.
Homekids is the number of kids at home. This variable is not related to claim occurrence andclaim amount, as shown in Figure 14.
Yoj is the number of years the customer has been working. The top left panel of Figure 15shows that yoj follows a normal distribution with mean of about 10 years. There isa hump at 0 since there are student policy holders. The average years of working arehigher with no claim policy holders. This indicates that yoj is a good predictor for claimoccurrence. The years of working is not related to claim size.
October 25, 2007 6
De Jong and Heller SAS miner preliminary analysis
Claim frequency
Den
sity
0 1 2 3 4 5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
No Yes
01
23
45
Claim
Cla
im fr
eque
ncy
0 1 2 3 4 5
020
000
6000
010
0000
Claim frequency
Cla
im a
mou
nt
0 1 2 3 4 5
46
810
12
Claim frequencyLo
g cl
aim
am
ount
Figure 9: Claim Frequency
Old claim
Den
sity
0 10000 30000 50000
0.00
000
0.00
005
0.00
010
0.00
015
No Yes
010
000
3000
050
000
Claim
Old
cla
im
0 10000 30000 50000
020
000
6000
010
0000
Old claim
Cla
im a
mou
nt
0 10000 30000 50000
46
810
12
Old claim
Log
clai
m a
mou
nt
Figure 10: Old claim amount
October 25, 2007 7
De Jong and Heller SAS miner preliminary analysis
No Yes
020
000
4000
060
000
8000
010
0000
Revoked
Cla
im a
mou
nt
No Yes
46
810
12
Revoked
Log
clai
m a
mou
nt
Figure 11: Revoked
Motor vehicle points
Den
sity
0 2 4 6 8 10 12
0.0
0.1
0.2
0.3
0.4
0.5
0.6
No Yes
02
46
810
12
Claim
Mot
or v
ehic
le p
oint
s
0 2 4 6 8 10 12
020
000
6000
010
0000
Motor vehicle points
Cla
im a
mou
nt
0 2 4 6 8 10 12
46
810
12
Motor vehicle points
Log
clai
m a
mou
nt
Figure 12: Motor vehicle points
October 25, 2007 8
De Jong and Heller SAS miner preliminary analysis
Age
Den
sity
20 30 40 50 60 70 80
0.00
0.01
0.02
0.03
0.04
No Yes
2030
4050
6070
80
Claim
Age
20 30 40 50 60 70 80
020
000
6000
010
0000
Age
Cla
im a
mou
nt
20 30 40 50 60 70
46
810
12
AgeLo
g cl
aim
am
ount
Figure 13: Age
Home kids
Den
sity
0 1 2 3 4 5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
No Yes
01
23
45
Claim
Hom
e ki
ds
0 1 2 3 4 5
020
000
6000
010
0000
Home kids
Cla
im a
mou
nt
0 1 2 3 4 5
46
810
12
Home kids
Log
clai
m a
mou
nt
Figure 14: Home kids
October 25, 2007 9
De Jong and Heller SAS miner preliminary analysis
Years of working
Den
sity
0 5 10 15 20
0.00
0.04
0.08
0.12
No Yes
05
1015
20
Claim
Yea
rs o
f wor
king
0 5 10 15 20
020
000
6000
010
0000
Years of working
Cla
im a
mou
nt
0 5 10 15
46
810
12
Years of working
Log
clai
m a
mou
ntFigure 15: Years of working
Income The distribution of income is right skewed with a hump at $0 income, which correspondsto the student policy holders. The average income for no claim policy holders is higher.Claim amount is not related to income.
Gender 54% of policy holders are female. 27.5% of female drivers have incurred a claim while25.5% of male drivers have incurred a claim. The average claim amounts given claimincurred are similar, as shown in Figure 17.
Married 60% of policy holders are married. 34% of non married policy holders have incurreda claim, while 22% of married policy holders have incurred a claim. Therefore, claimoccurrence appears to be associated with marriage status. Figure ?? shows that claimamount is not related to marriage status.
Parent1 denotes a single parent. Around 13% of policy holders are single parents. If the policyholder is a single parent, he or she has 45% probability of making a claim, while 24%probability if not a single parent. The average claim amount does not differ.
Jobclass Table 6 shows that the student and blue collar workers have a higher probabilityof making claims comparing to managers and doctors. However, this variable mightbe correlated with years of working (yoj), as it is expected that managers have a longerworking history. Caution might be taken in including both yoj and jobclass in a model.
Max educ From Table 7, the probability of claim occurrence is lower for policy holders withhigher education. Therefore, it is potentially a useful predictor for claim flag. From Figure21, the claim amount is approximately constant for all education categories. Therefore,it is not a useful predictor for claim amount.
Home value From the top right panel of Figure 22, the average home value is lower for non-claim policy holders. Therefore, it is an useful predictor for claim flag. The bottom twopanels indicate that claim amount is constant over home values.
October 25, 2007 10
De Jong and Heller SAS miner preliminary analysis
Income
Den
sity
0e+00 1e+05 2e+05 3e+05
0e+
004e
−06
8e−
06
No Yes
0e+
001e
+05
2e+
053e
+05
Claim
Inco
me
0e+00 1e+05 2e+05 3e+05
020
000
6000
010
0000
Income
Cla
im a
mou
nt
0 100000 200000 300000
46
810
12
Income
Log
clai
m a
mou
ntFigure 16: Income
F M
020
000
4000
060
000
8000
010
0000
Gender
Cla
im a
mou
nt
F M
46
810
12
Gender
Log
clai
m a
mou
nt
Figure 17: Gender
No Yes
020
000
4000
060
000
8000
010
0000
Married
Cla
im a
mou
nt
No Yes
46
810
12
Married
Log
clai
m a
mou
nt
Figure 18: Married
October 25, 2007 11
De Jong and Heller SAS miner preliminary analysis
No Yes
020
000
4000
060
000
8000
010
0000
Single parent
Cla
im a
mou
nt
No Yes
46
810
12
Single parent
Log
clai
m a
mou
nt
Figure 19: Single Parent
Table 6: Job ClassJob class Frequency Percent
claimsBlue Collar 23.7% 35.1%Clerical 16.5% 30.3%Doctor 3.3% 11.5%Home maker 8.7% 27.2%Lawyer 10.7% 18.0%Manager 13.0% 13.5%Professional 14.6% 23.1%Student 9.3% 37.9%
Clerical Home Maker Professional
020
000
4000
060
000
8000
010
0000
Job class
Cla
im a
mou
nt
Clerical Home Maker Professional
46
810
12
Job class
Log
clai
m a
mou
nt
Figure 20: Job Class
Table 7: Maximum education levelMax education Frequency Percentlevel claims< High School 15% 32%High School 29% 35%Bachelors 27% 24%Masters 20% 19%PhD 9% 16%
October 25, 2007 12
De Jong and Heller SAS miner preliminary analysis
<High School High School PhD
020
000
4000
060
000
8000
010
0000
Maximum education
Cla
im a
mou
nt
<High School High School PhD
46
810
12
Maximum education
Log
clai
m a
mou
nt
Figure 21: Max Education Level
Home value
Den
sity
0e+00 4e+05 8e+05
0e+
002e
−06
4e−
066e
−06
No Yes
0e+
004e
+05
8e+
05
Claim
Hom
e va
lue
0e+00 4e+05 8e+05
020
000
6000
010
0000
Home value
Cla
im a
mou
nt
0e+00 2e+05 4e+05 6e+05
46
810
12
Home value
Log
clai
m a
mou
nt
Figure 22: Home Value
October 25, 2007 13
De Jong and Heller SAS miner preliminary analysis
Table 8: DensityDensity Frequency Percent
claimsUrban 45% 20%Highly Urban 35% 46%Rural 15% 7%Highly Rural 5% 6%
Density is the population density of policy holder’s living area.There are 4 categories: highlyrural, rural, urban and highly urban. In Table 8, the probability of making a claim is farhigher in urban areas. Therefore, density might be a good predictor for claim occurrence.As shown in Figure 23, the claim amount is invariant between different areas.
Highly Rural Rural Urban
020
000
4000
060
000
8000
010
0000
Density
Cla
im a
mou
nt
Highly Rural Rural Urban
46
810
12
Density
Log
clai
m a
mou
nt
Figure 23: Density
October 25, 2007 14