predicting large claims within non-life...
TRANSCRIPT
INOM EXAMENSARBETE TECHNOLOGY,GRUNDNIVÅ, 15 HP
, STOCKHOLM SVERIGE 2018
Predicting Large Claims within Non-Life Insurance
JACOB BARNHOLDT
JOSEFIN GRAFFORD
KTHSCHOOL OF ENGINEERING SCIENCES
Predicting Large Claims within Non-Life Insurance JACOB BARNHOLDT JOSEFIN GRAFFORD
Degree Projects in Applied Mathematics and Industrial Economics
Degree Programme in Industrial Engineering and Management
KTH Royal Institute of Technology year 2018
Supervisors at If P&C Insurance: Hjalmar Heimbürger
Supervisors at KTH: Anja Janssen, Hans Lööf
Examiner at KTH: Henrik Hult
TRITA-SCI-GRU 2018:187
MAT-K 2018:06
Royal Institute of Technology School of Engineering Sciences KTH SCI SE-100 44 Stockholm, Sweden URL: www.kth.se/sci
Predicting Large Claims within Non-Life Insurance
Abstract
This bachelor thesis within the field of mathematical statistics aims to study the
possibility of predicting specifically large claims from non-life insurance policies
with commercial policyholders. This is done through regression analysis, where
we seek to develop and evaluate a generalized linear model, GLM. The project
is carried out in collaboration with the insurance company If P&C Insurance
and most of the research is conducted at their headquarters in Stockholm. The
explanatory variables of interest are characteristics associated with the policy-
holders. Due to the scarcity of large claims in the data set, the prediction is
done in two steps. Firstly, logistic regression is used to model the probability of
a large claim occurring. Secondly, the magnitude of the large claims is modelled
using a generalized linear model with a gamma distribution. Two full models
with all characteristics included are constructed and then reduced with com-
puter intensive algorithms. This results in two reduced models, one with two
characteristics excluded and one with one characteristic excluded.
Keywords: Mathematical Statistics, Regression Analysis, Generalized Linear
Model, Logistic Regression, Data Analysis, Non-life Insurance, Insurance Pric-
ing, Large claims
Prediktion av Storskador inom Sakforsakring
Sammanfattning
Det har kandidatexamensarbetet inom matematisk statistik avser att studera
mojligheten att predicera sarskilt stora skador fran sakforsakringspolicys med
foretag som forsakringstagare. Detta gors med regressionsanalys, dar vi amnar
att utveckla och bedoma en generaliserad linjar modell, GLM. Projektet utfors
i samarbete med forsakringsbolaget If Skadeforsakring och merparten av un-
dersokningen sker pa deras huvudkontor i Stockholm. Forklaringsvariablerna
som ar av intresse att undersoka ar egenskaper associerade med forsakringstagarna.
Pa grund av sallsynthet av storskador i datamangden gors prediktionen i tva
steg. Forst anvands logistisk regression for att modellera sannolikheten for en
storskada att intraffa. Sedan modelleras storskadornas omfattning genom en
generaliserad linjar modell med en gammafordelning. Tva grundmodeller med
alla forklaringsvariabler konstrueras for att sedan reduceras med datorinten-
stiva algortimer. Det resulterar i tva reducerade modeller, med tva respektive
en kundegenskap utesluten.
Nyckelord: Matematisk statistik, Regressionsanalys, Generaliserad linjar mod-
ell, Logistisk Regression, Dataanalys, Sakforsakring, Forsakringsprissattning,
Storskador
Acknowledgements
We want to thank the analysts of the Product & Price department at If P&C
Insurance for giving us the opportunity to write our bachelor thesis with them
and for making us feel welcome. A special thanks goes out to
Hjalmar Heimburger, whom we have been working closets with and has been
our advisor, mentoring us throughout the project and sharing his knowledge.
We would also like to thank to our thesis supervisor Anja Janssen from the
department of Mathematical Statistics at KTH Royal Institute of Technology.
Anja has been of great support throughout this project, giving us advice and
feedback which has been helpful and very appreciated.
Contents
1 Introduction 8
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Project Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Theory 10
2.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Insurance Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Key Terms . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Performance Measures . . . . . . . . . . . . . . . . . . . . 11
2.2.3 The Insurance Business Model . . . . . . . . . . . . . . . 12
2.2.4 Commercial Insurance Policies . . . . . . . . . . . . . . . 14
2.2.5 Insurance Pricing . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.6 Accounting for Large Claims . . . . . . . . . . . . . . . . 15
2.3 Mathematical Theory . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Regression Analysis . . . . . . . . . . . . . . . . . . . . . 17
2.3.2 Linear regression modelling . . . . . . . . . . . . . . . . . 17
2.3.3 Generalized Linear Models . . . . . . . . . . . . . . . . . 18
2.3.4 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . 20
2.3.5 Modelling Claim Severity . . . . . . . . . . . . . . . . . . 21
2.3.6 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.7 Model Validation . . . . . . . . . . . . . . . . . . . . . . . 24
3 Methodology 30
3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.1 Characteristics . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.2 Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.3 Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.4 Response Variable . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Model Development . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.1 Modelling Probability of a Large Claim . . . . . . . . . . 33
5
3.2.2 Modelling Large Claim Severity . . . . . . . . . . . . . . . 34
4 Results 36
4.1 Multicollinearity Diagnostics . . . . . . . . . . . . . . . . . . . . 36
4.2 Logistic Regression Model . . . . . . . . . . . . . . . . . . . . . . 37
4.2.1 Full Model Goodness of Fit Diagnostics . . . . . . . . . . 37
4.2.2 Significance of Variables in Full Model . . . . . . . . . . . 37
4.2.3 Reduced Model Goodness of Fit Diagnostics . . . . . . . . 38
4.2.4 Significance of Variables in Reduced Model . . . . . . . . 38
4.2.5 Final Model Coefficients . . . . . . . . . . . . . . . . . . . 39
4.2.6 Final Model . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.7 Final Model Residuals and ROC . . . . . . . . . . . . . . 40
4.3 Claim Severity Regression Model . . . . . . . . . . . . . . . . . . 42
4.3.1 Full Model Goodness of Fit Diagnostics . . . . . . . . . . 42
4.3.2 Results From Reducing Algorithm . . . . . . . . . . . . . 42
4.3.3 Reduced Model Goodness of Fit Diagnostics . . . . . . . . 43
4.3.4 Reduced Model, Significance of Variables . . . . . . . . . 43
4.3.5 Final Model Coefficients . . . . . . . . . . . . . . . . . . . 44
4.3.6 Final model . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.7 Final Model Residuals . . . . . . . . . . . . . . . . . . . . 46
5 Discussion 47
5.1 Model Validation and Adequacy . . . . . . . . . . . . . . . . . . 47
5.1.1 Sources of error or uncertainty . . . . . . . . . . . . . . . 47
5.1.2 Assessing the model reductions . . . . . . . . . . . . . . . 48
5.1.3 Statistical hypothesis testing . . . . . . . . . . . . . . . . 48
5.1.4 Prediction accuracy . . . . . . . . . . . . . . . . . . . . . 49
5.2 Interpretation of final models . . . . . . . . . . . . . . . . . . . . 50
5.3 Impact of risk-dependent insurance pricing . . . . . . . . . . . . 53
5.3.1 For If . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3.2 For commercial policyholders . . . . . . . . . . . . . . . . 54
6
6 Conclusions 56
7 Recommendations 57
References 58
7
1 Introduction
1.1 Background
An insurance is a contract between an insurer and a customer where the cus-
tomer buys protection against financial loss from the insurer by paying a price
known as a premium. The economic risk is then transferred from the customer,
usually referred to as policyholder, to an insurance company. Insurance con-
tracts can differ with regards to what they cover, under what circumstances
they are valid and how much is paid out by the insurer in case of an incident.
This creates a wide insurance market with insurance companies offering dif-
ferent types of insurances and under different terms and agreements. Non-life
insurance policies, also called property and casualty insurance policies, are all
insurance policies that are not classified as life insurance policies. They may for
example cover damages on cars, houses or other property, third party liability
and costs for business interruptions.
One of the largest companies on the Nordic market offering non-life insurance
policies is If P&C Insurance. They operate in Norway, Sweden, Finland, Den-
mark and the Baltics, with a wide customer base ranging from private individ-
uals to large enterprises. Fully owned by the financial company Sampo plc., If
works for a profit and needs to strive for business efficiency in order to stay com-
petitive and maintain a strong market position. An important aspect of this is
to develop and use sophisticated pricing models that set the optimal premiums
for the customers with respect to the risk that If has undertaken by insuring
them. The total amount that If collects in premiums from their customers needs
to cover the costs that If has for customer claims, administrative costs and also
generate a return on invested capital. The best way to achieve this is by trying
to predict each customer’s claims as accurately as possible.
The cost for claims can be divided into two subcategories: small claims and
8
large claims. Small claims are defined to be more frequent and of lower indi-
vidual costs than large claims, which in general are very rare. Since small and
large claims are of such different character, large claims are usually handled
separately in the pricing of policies.
1.2 Project Formulation
If ’s current model for accounting for their customer’s risks for large claims in
the pricing of their premiums uses a few characteristics as explanatory variables.
This project aims to investigate if this model can be adjusted to be more sophis-
ticated by taking additional customer characteristics into account and thereby
accomplish a more differentiated and risk-correct pricing.
This will be done using a generalized linear regression - GLM. If will provide a
large data set containing information about for example their customers, insur-
ance premiums and claim costs. The response variable to predict is the cost of
large claims as a percentage of the insurance premium. The explanatory vari-
ables to model this will be customer characteristics. That is, this project aims
to identify characteristics of the commercial policyholder that might affect the
risk of large claims and to what extent.
In this thesis, a distinction between small and large claims is made at 500
000 SEK. This means that the project aims to predict claims of individual costs
of 500 000 SEK or more.
9
2 Theory
2.1 Literature Review
The insurance theory was partly acquired from introduction courses held by
current analysts at the Product & Price department at If P&C Insurance and
partly from Esbjorn Ohlssons and Bjorn Johanssons book; Non-Life Insurance
Pricing with Generalized Linear Models. The book also provides insight into the
mathematics behind insurance theory. The second main source for mathematical
theory was Introduction to Linear Regression Analysis by Douglas Montgomery,
Elizabeth Peck and Geoffrey Vining. This provided a general understanding of
regression analysis and statistical model building. It also served as a practical
go-to source when further knowledge on certain topics was needed to move for-
ward during the project work.
In addition to the literature mentioned above, previous bachelor theses carried
out in collaboration with If P&C Insurance were studied. These were Mod-
elling Non-Life Insurance Policyholder Price Sensitivity by Patrik Hardin and
Sam Tabari and Fornyelsegrad och priskanslighet inom foretagsforsakringar by
Karin Knobel and Lovisa Laestadius. They provided a good orientation of rel-
evant mathematical aspects related to generalized linear models as well as an
introduction to the use of regression analysis in an insurance business setting.
Their work also served as inspiration for how to approach the problem addressed
in this thesis.
2.2 Insurance Theory
2.2.1 Key Terms
Here follows an introduction to important terms used in the insurance industry
Claim: A claim is when a customer to an insurance company has an accident or
damage and they want to use their insurance. They report it to their insurance
10
company, asking for reimbursement.
Policyholder: A customer to an insurance company. It can be an individual
or a company.
Claim cost: Refers to the costs associated with a claim. It is often divided
into subcategories:
Paid: The amount that has been paid out to the customer.
Case: The future costs for claims that have been reported. Some claims do
not lead to only one direct payment, but cause additional costs after some time.
This amount can be uncertain.
IBNR: Abbreviation for “Incurred But Not Reported” and refers to the costs
for claims that have occurred but the customer has not yet reported them to
the insurer. There is a high degree of uncertainty to estimates of these costs.
Administrative costs: Regular business expenses. Costs for staff, material, office
space etc.
Premium: The price that a customer pays for their insurance. Is usually paid
on a yearly basis[1].
2.2.2 Performance Measures
From the terms above, important performance and profitability measurements
are constructed.
Gross Written Premium (GWP): The total amount that the insurance company
collects in premiums from their customers for a year.
11
Gross Earned Premium (GEP): The Gross Written Premium linearized with
time. For example, when 10 days in to an insurance year, the GEP is 10365∗GWP.
Paid Ratio: Paid / GEP
Reported Ratio: (Paid + Case) / GEP
Risk Ratio: Claim cost / GEP = (Paid + Case + IBNR) / GEP
Cost Ratio: Administrative cost / GEP
Combined Ratio: (Claim cost + Administrative cost) / GEP = (Paid + Case
+ IBNR + Administrative cost) / GEP. [1]
2.2.3 The Insurance Business Model
The business idea of an insurance company is to provide their customers with
protection against financial risk in exchange for a fee, the premium. This means
that the risk is transferred from the policyholder to the insurance company. By
insuring many customers, the loss of the insurance company consists of a large
sum of many small approximately independent losses. The law of large num-
bers therefore makes the loss of an insurance company much more predictable
than the loss of an individual. Since an insurance company has a chance to
predict their losses to some extent, they have the possibility of making a profit
by charging premiums that cover their losses, other costs and give space for a
certain return[2, p. 1].
An important aspect of the business idea is the cash flows of an insurance
company. In general, policyholders pay their premiums up front, at the begin-
ning of the period they are buying insurance for, usually a year. Possible claims
12
from the policyholders therefore incur sometime after they have paid for their
product. This means that the insurance company get their revenues before they
have their costs, which means that they can invest the collected premiums in
the meantime and receive a return on them[1].
By Swedish law, insurance companies need to have a capital reserve to be able
to always help their customers by having enough coverage for large claims and
other unexpected incidents. For If P&C Insurance, this capital reserve is pro-
vided by their owners, the Finnish financial company Sampo plc. Currently
Sampo expects a return of 17,5 percent on their invested capital. If need to do
their best to meet this requirement to avoid the risk of eventually losing their
financing. The return on equity, ROE, is calculated as follows
ROE =Revenue · (1− Combined ratio) + Investment return
Capital reserve · (1− Tax rate)
The combined ratio is defined in section 2.2.2 and is the percentage of the
premiums collected needed to cover the expenses of the company. 1- CR is thus
the percentage of the collected premiums that the insurance company can keep
as a profit. As seen in the expression for the ROE, a smaller combined ratio
produces a larger ROE [1].
The investment return comes from the return from investing the collected pre-
miums and the capital reserve [1]. Higher investment returns are profitable for
the ROE. Their level depends on the current return on the securities If are in-
vesting in and is thus highly sensitive to interest rates and the economy as a
whole.
13
2.2.4 Commercial Insurance Policies
If insurances is divided into four sections; private customers, commercial cus-
tomers, industry and the baltics. Commercial insurance policies can have a
complicated structure. The company can decide to have a full insurance or
to insure some parts, this is called the insurance level. The subjects that the
company wants to insure is called exposures and the different size of insurances
covers different exposures. An exposure can for example be a car or a property
and the different exposures an insurance cover is called the exposure level. The
exposures can have different product levels. A product is a collection of what
the insurance will cover of the exposure. The parts to include is called product
modules[1].
2.2.5 Insurance Pricing
It has shown to be most advantageous for insurance companies to charge risk-
correct, or fair, prices. This means that the premium that each customer pays is
dependent on their individual risk for the insurer. Simplified, you can say that
a customer with higher expected claim cost should pay more for their policy
premium than a customer with lower expected claim cost[2, p. 2].
The reason for this that it has been realized to be the best alternative in a
business perspective for an insurer acting in a competitive market. In a situa-
tion where the policyholders are charged the same prices independent of their
respective risks, some customers, who are of high risk, get an unfairly low price
at the expense of other customers that are of lower risk. The customers charged
with a too high premium are then likely to be lost to a competitor that offers
a fairer premium. At the same time, the insurer with the uniform prices will
attract more high risk customers that benefits from a conform pricing struc-
ture. The result is an undesirable economic situation where the insurer loses
profitable deals and gains unprofitable ones[2, p. 2].
14
To avoid these problems and accomplish fair prices, the insurer needs to, as
accurately as possible, predict the expected losses from each of their customers.
This is typically done by creating a so called tariff, which is a set of tables that
calculates the premium for any given customer by taking into account the values
of a number of variables, so called rating factors, for that particular customer.
The rating factors are often properties of the policyholder or of the insured ob-
ject. For a driver looking to insure their vehicle, the tariff could for example
account for the driver’s age and for the weight of the vehicle, two factors that
are likely to affect the risk. Creating a tariff is typically done by using regres-
sion analysis to model the relationship between possible rating factors acting
as explanatory variables and a response variable that says something about the
risk of a customer[2, p. 2].
The details of how the premium is calculated can differ between insurance com-
panies, but it is common to predict the expected losses from, and hence the
risk of a customer, by modelling the response variables claim severity and claim
frequency for claims of small or average cost. Claims with costs above a certain
threshold, large claims, are often modelled separately.
After the risk of the customer is predicted, the final premium price is calculated
by adding a certain amount to cover administrative cost. Finally, the price is
multiplied by an adjustment term that consists of appropriate coefficients that
adjust the price according to for example discounts as alliance agreements[1].
Premium = Risk + Administrative costs + Return · Adjustment term
2.2.6 Accounting for Large Claims
As said, there is a point in separating the modelling of small and large claims.
A reason for this is that dominating large claims can make the estimates of the
total claim cost very volatile. A common approach is to truncate the claims
15
at a certain point and leave out the part of the claim cost that lies above this
threshold from the modelling of the small and average sized claims. The cost
above the truncation point also needs to be accounted for in the premium. One
way to do this is to simply assume that the differences in risk for large claims
are the same as those for the other claims and to therefore distribute the cost
for large claims among the policyholders accordingly. This means to only adjust
the base premium so that the overall premium level is adequate.
There is no obvious way to to account for any differences in contribution to
large claims from different groups of policyholders. One approach is to estimate
the proportion of large claims using a generalized linear model with a binomial
distribution and a logit link. One then models the number of large claims di-
vided by the total number of claims. An estimate of the large claim frequency is
then given by multiplying the original claim frequency by the predicted propor-
tion of large claims.[2, p. 63-64] The severity of the large claim, being a rather
extreme event, can be assumed to have a heavy-tail distribution.[5, p. 2] It is
often modelled with for example the log-normal or a Pareto distribution, which
better models the right tail of the distribution arising from high-impact events
of low probability.[6, p. 2]
16
2.3 Mathematical Theory
2.3.1 Regression Analysis
The concept of regression analysis includes a set of statistical techniques for
finding and estimating the relationships between different variables, which have
numerous applications and are used in many fields[7, p. 2]. The typical idea
is to find a relation for how the values of one or more independent variables
impact the outcome of another dependent variable of interest. This is done by
attempting to fit a function to a data set of observations of the dependent and
independent variables[3]. The data is of high importance since it constitutes the
foundation of the model that is fitted and can be retrieved through for instance
an observational study or an experiment. One could also perform regression
analysis on a set of historical data that have been saved, with or without the
intention to investigate the potential relationships of interest.
2.3.2 Linear regression modelling
The simplest form of regression models is the one where the model specification
is that the dependent variable, often called the response variable, is assumed
to have a linear relationship to one independent variable, often called predictor
variable, regressor variable or explanatory variable. This is called a simple linear
regression and implies a straight line relationship on the form
β0 + β1 · x+ ε = y (1)
where β0 denotes the intercept, β1 the slope and ε is an error term.
The coefficients β0 and β1 are unknown parameters, called regression coeffi-
cients. Fitting a regression model means to estimate these coefficients with the
use of sample data consisting of a number of observations pairs (y1, x1), ..., (yn, xn)[7,
p. 13]. With the coefficient estimates, a model is created and can be used to
predict the value of the response variable y for a point x. Since the number of
17
data points is typically greater than two, it is unlikely that all data points lie on
a perfectly straight line. Instead, it creates an overdetermined system of linear
equations.
The simple linear regression concepts can be generalized to the case were the
response variable may be related to several explanatory variables, yielding a
multiple linear regression model on the form
β0 + β1 · x1 + . . . + βk · xk + ε = y (2)
For both simple and multiple linear regression, estimating the model coef-
ficients is most commonly done through the method of least squares. Another
commonly used method for coefficient estimation in regression analysis is Max-
imum Likelihood estimation.
In linear regression modelling, the errors are assumed to have mean zero, un-
known constant variance σ2 and to be uncorrelated. Furthermore, a common
assumption is that the errors are normally distributed, which is required for
procedures of evaluating model parameters, such as hypothesis testing and the
construction of confidence intervals.
2.3.3 Generalized Linear Models
Generalized linear models is a special form of regression models that can be used
when usual assumptions of normality and constant variance are not satisfied.
In generalized linear models, the distribution of the response variable is not
required to be normal. Instead, it needs to belong to the exponential family. The
exponential family of distributions includes the normal distribution, the poisson
distribution and the gamma distribution, among others[7, p. 421]. Members of
the exponential family have probability density functions (or probability mass
18
functions) that can be expressed on the following form:
f(yi, θi, fi) = exp
{yi · θi − b(θi)
fiwi
+ h(yi, fi, wi)
}(3)
where θi is a so called natural location parameter that varies with i, fi is a
positive scale or dispersion parameter and wi is greater than zero. b(θi) is the
cumulant function. This function is twice continuously differentiable and for
every choice of such a function, a family of probability distributions is obtained,
such as the normal or the Poisson distributions. The function h(yi, fi, wi) is of
little interest in GLM theory, but is required in order for the total probability
to be equal to one.[4, 2, p. 17]
The idea behind generalized linear models is to obtain a linear model for a
function of the expected value of the response variable. Define the linear pre-
dictor η(i) by:
η(i) = g(E[yi]) = g(µi) = xTi β (4)
where the function g is an appropriately chosen link function which relates the
mean of the response to the linear predictor in the following way:
E[yi] = g−1(xTi β) (5)
There are several possible choices of link functions, but it is common to choose
η(i) = θ(i), where θ(i) is the natural location parameter that corresponds to
the distribution assumed for the response variable. This link function is called
the canonical link of that distribution.[7, p. 451]
The parameter estimates in a generalized linear model are calculated as the
maximum likelihood estimates[7, p. 452]. This means to maximize the like-
lihood function with the chosen link function inserted. This is equivalent to
finding the parameter estimates β which maximize the log likelihood function.
19
With the estimated parameters β the model becomes
yi = g−1(xTi β) (6)
where g is the link function. This gives estimates on the mean response for
points x of interest.
2.3.4 Logistic Regression
Logistic regression is a special type of generalized linear modelling where the
response variable is binary. Typically, the response takes one of the values 0 and
1, often representing a non-event and an event respectively. The model aims
to predict the probability for the response variable to have the value 1. Thus,
when the response variable is a bernoulli random variable, it takes only a binary
value[7, p. 428]. Consider the linear model:
y = β · x+ ε
We can assume that the response variable is a Bernoulli random variable and
thus have the distribution function:
P (y = 1) = pi , P (y = 0) = 1− pi
The expected value is then
E[y] = 1 ∗ pi + 0 ∗ (1− pi) = pi
which implies that
E[y] = xTβ = pi
20
Since the response is binary and hence restricted to the values 0 and 1, the
errors can only take one out of two possible values:
εi = 1− xTi β or εi = −xTi β
The errors are thus not normally distributed. Neither is the variance of the
error constant
V ar(y) = E[y − E(y)]2 = (1− pi)2 + (0− pi)2 · (1− pi) = pi(1− pi)
thus V ar(y) = E(y)[1 − E(y)] which means that the variance is a function of
the mean. Since 0 ≤ pi ≤ 1, we have that 0 ≤ E(y) ≤ 1. This is a constraint
that makes the earlier showed linear response function not a feasible choice for
a function to predict the binary response. A strictly increasing or decreasing
reversed S -or S-shaped function is better employed. Therefore, the so called
logistic response function is used. The logistic response function has the form:
ε(y) =exp(xTβ)
1 + exp(xTβ)=
1
1 + exp(−xTβ)
and can easliy be transformed to a linear function with η = xTβ where η is a
linear predictor, η = ln( pi1−pi ). This transformation with the probability pi is
called a logit transformation and its ratio pi1−pi is called the odds.
2.3.5 Modelling Claim Severity
We now turn to claim severity. Here, a measurement of the claim size is of
interest, Y = Xw where X is the total claim cost in the cell and Y is the claim
severity weighted on the exposure, w.
X is a random variable but it is not clear which distribution to assume for
X. However, the gamma distribution has become a standard in GLM analysis
for claim severity[10, p. 10].
21
The gamma assumption implies that the standard deviation is proportional to
E[Y ] which means that the we have a constant coefficient of variation [2, p. 20].
To derive the function of Y a example for w = 1 is given by
f(x) =βα
Γ(α)xα−1e−βx; x > 0 (7)
We denote this distribution G(α, β) for short. The expectation is then αβ and
the variance αβ2 [2, p. 20]. The sum of independent gamma distributed random
variables with the same scale parameter β are gamma distributed with the same
scale and index parameter, which is the sum of the individual α. With X being
the sum w of independent gamma distributed random variables, we have that
X ∼ G(wα, β).
The function for claims as a percentage of the premium, Y, is then
fY (y) = wfx(wy) =(wβ)wα
Γwαywα−1e−wβy; y > 0 (8)
Thus, Y ∼ G(nα,wβ) with expectation of αβ . This distribution can be trans-
formed to the form of an exponential family distribution, shown above. Before
doing so a re-parameterization is done with µ = αβ > 0 and φ = 1
α > 0
Now,
fY (y) = fY (y;µ, φ) =1
Γ(wφ )(w
µφ)wφ y
wφ−1e
−wyµφ = exp
( −yµ − log(µ)
φw
+ c(y, φ, w)
); y > 0
(9)
c(y, φ, w) = log(wy
φ)w
φ− log(y)− logΓ(
w
φ)
E(Y ) =wα
wβ= µ
and
V ar(Y ) =wα
(wβ)2=φµ2
w
22
To show that this gamma distribution is a member of the exponential fam-
ily, we now change the parameter θ = −1µ < 0.
We can conclude that this gamma distribution is an exponential family dis-
tribution by setting the index i and log(−θi) = b(θi)
fY i(yi; θi, φ) = exp
(yiθi + b(θi)
φwi
+ c(yi, φ, wi)
)(10)
Hence, we can use it in generalized linear models.
2.3.6 Multicollinearity
In regression models with multiple regressors there can be disturbance of the
significance of the model if there is correlation between the explanatory vari-
ables. That is called multicollinearity. There is said to be multicollinearity if the
regressors are nearly linearly dependent. Normally regressors are not orthogo-
nal, which is the optimal situation to avoid multicollinearity, however, the lack
of total orthogonality doesn’t have to be serious[7, p. 285]. But if the regres-
sors are nearly perfectly linear then the result will most likely be misleading or
erroneous, since variances are large, which needs to be avoided as far as possible.
2.3.5.1 Correlation Matrix One method for detecting multicollinearity be-
tween variables is to examine the correlation matrix. Given a regression model
on matrix form, the correlation matrix is defined as XTX and its elements
denoted by ri,j , i and j representing the indices of the regression variables.
All diagonal elements ri,i in this matrix are equal to one, and the degree of
collinearity between two different variables xi and xj is assessed by inspection
of the absolute value of their corresponding off-diagonal element ri,j . If the
regressors are nearly linearly dependent, ri,j will be close to unity. This is a
simple way of detecting dependencies between pairs of regressor variables. If
23
more than two variables are involved in a near-linear dependence, this may not
get catched by the correlation matrix, why other methods are needed for a more
thorough analysis[7, p. 293-294].
2.3.5.2 Variance Inflation Factors Another method for detecting multi-
collinearity is examination of the Variance Inflation Factors, VIFs. These are
found as the diagonal elements Cj,j of the matrix C = (XTX)−1 and can be
written as
Cj,j = (1 − R2j )
−1 hence VIFj = Cj,j = (1 − R2j )
−1 where R2j is the coefficient
of determination obtained when the variable xj is regressed on the remaining
regressors.
The R2j coefficient can be viewed as the proportion of variation explained by
the remaining regressor variables. Values of R2j that are close to 1 imply that
most of the variability in xj is explained by the other independent variables.
Hence, if R2j is near unity, VIFj is large and xj is nearly linearly dependent on
some subset of the remaining regressors. The VIF for each variable in the model
measures the combined effect of the dependencies among the regressor variables
on the variance of that particular variable. If one or more variables have a large
valued VIF, it indicates a problem with multicollinearity. There is no formal
rule for when a VIF is to be considered large, but practical experience indicates
that regression coefficients might be poorly estimated if some VIF exceeds 10[7,
p. 296-297].
2.3.7 Model Validation
2.3.7.1 Hypothesis Testing In regression it is important to test if the coeffi-
cient estimates of the model are significant. There is considered to be adequacy
if the observations have a relationship to the response variable. In regression
the general way to do these kinds of tests is to formulate hypotheses[7, p. 84]:
H0 : βj = 0, H1 : βj 6= 0
24
If the null hypothesis is rejected the characteristic variable associated with βj
contributes with significance to the model and if not the variable should be
excluded. To identify if the null hypothesis is to be rejected or not a p-value
is calculated. The p-value is a probability measure of obtaining results similar
to or more extreme than the observed, given that the null hypothesis is true.
If the p-value is larger than given selected significance level, the hypothesis is
not rejected. The p-value must be equal to or less than given significance level
in order to reject the null hypothesis and declare that the variable is significant
and should not be excluded from the model. The p-value is derived trough com-
parison between the random variable and the distribution of the test statistic.
2.3.7.2 Deviance goodness of fit test Deviance is a measure used to assess
goodness of fit for generalized linear models. The deviance evaluates the signif-
icance of two models, the full model (current model) and a saturated model, by
comparing their parameters and observes any divergence. The saturated model
is a trivial model of no individual interest often used as a benchmark when as-
sessing goodness of fit of other models, since it has a perfect fit[2, p. 39].
Deviance is defined as:
D = 2 lnL(Saturated model)
L(Full model)(11)
where L is the likelihood function.
The Deviance measure follows a χ2-distribution with (n-k) degrees of freedom,
where n is the number of observations in the model and k the number of pa-
rameters in the current model. The adequacy of the model can be evaluated
using the deviance measure since a low deviance with large p-value suggests
that the current model is a satisfactory fit. Another way to use the measure is
by dividing the deviance with the number of degrees of freedom. If the ratio
then obtained is larger than 1, it indicates that the current model is not a good
fit.[7, p. 433]
25
2.3.7.3 Pearson Chi-squared, (χ2) The Pearson chi-squared test, also called
χ2 - test, is a goodness of fit measure for logistic regression models which com-
pares the observed and expected probability of success and failure for every
characteristic in the observations[7, p. 432].
If the number of expected successes is n · pi and expected failures is n · (1− pi)
then
χ2 =
n∑j=1
((yi − niπi)2
niπi+
[(ni − yi)− ni(1− πi)]2
ni(1− niπi)) =
n∑j=1
yi − niπiniπi(1− πi)
(12)
The pearson chi-squared statistic can now be compared to a chi-squared distri-
bution with n-k degrees of freedom. Goodness of fit is shown if this results in
small values of the statistic and/or a large p-value.
2.3.7.4 Receiver Operating Characteristic, ROC A receiver operating
characteristic is a measurement used to assess the predictive power of a logistic
regression model. The ROC curve is a plot of sensitivity as a function of (1-
specificity ) where the sensitivity measures the model’s ability to predict events
correctly, and the specificity measures its ability to predict non-events correctly:
sensitivity = P (y = 1|y = 1)
and
specificity = P (y = 0|y = 0)
where y represents the model’s predicted values of the response variable y. It
summarizes the predictive power for all possible threshold probabilities π0, work-
ing as cut-off points where an observation with a higher probability than π0 is
classified as 1 and an observation with a lower probability than π0 is classified
as zero. The ROC curve is usually used by evaluating the area under it. The
larger the area under the curve, AUC, the better the predictions. It measures
26
the probability that the predictions and the outcomes are concordant, that the
observation with y = 1 will also get a larger predicted response y than an ob-
servation with y = 0. Thus, an AUC of 0.9 is a good result, while an AUC
of 0.5 means that the predictive power of the model is no better than random
guessing[8, p. 228-229].
2.3.7.5 Wald Chi-squared The Wald chi-squared test, also known as Wald
test, is a goodness of fit measure based on large-sample properties of maximum-
likelihood estimators. The Wald test can be seen as a rough approximation of
the likelihood ratio test. However, a likelihood ratio test requires at least two
models and the Wald test can be run with only one.
The Wald test finds out the significance of the explanatory variables using a
null hypothesis test. The maximum likelihood estimator β is squared and di-
vided by the variance which has asymptotically a χ2 - distribution. The statistic
is then compared to a χ2 - distribution with 1 degree of freedom and is rejected
if the Wald statistic is larger[7, p. 437].
The Wald test formula:
Wn =(β − β0)2
var(β)(13)
2.3.7.6 AIC and BIC AIC, the Akaike Information Criterion is a mea-
sure of the information expected from the model and is calculated as
AIC = −2 ln(L) + 2p
where L is the maximized value of the likelihood function for a model and p is
the number of parameters in the model[7, p. 336]. The AIC rewards goodness
of fit, as seen by the likelihood function that lowers the value, but also puts a
penalty on adding explanatory variables to discourage overfitting. Since adding
27
variables to a model almost always improve the goodness of fit, the AIC- mea-
sure is a trade-off between getting a good fit from including many variables and
avoiding the risk of an overfitted model which could yield a model that predicts
poorly on unseen data[9]. The Akaike Information Criterion is used for variable
selection by comparing subset models against one another to determine which
one is better[7, p. 332]. A lower value of AIC is desired since it is an estimate of
the loss of information in a model. However, since AIC is a relative measure of
model fit, it says nothing about the absolute quality of a single model. Hence,
the AIC of a model can only be assessed in relation to the AIC of other models.
BIC, the Bayesian Information Criterion is an extension of the AIC that
puts a greater penalty on adding explanatory variables as the sample size is
increased. There are several BIC measures, where one of the more commonly
used is the one defined by Schwartz(1978):
BIC = −2 ln(L) + p · ln(n)
where n is the number of observations in the model.
The BIC is interpreted and used in essentially the same way as the AIC and the
measures are often used together to complement each other. The Akaike and
Bayesian information criterions are both commonly used with the more complex
modelling situations such as GLMs[7, p. 337].
2.3.7.7 Residual Analysis Residuals is one of the most appropriate tools for
conducting model adequacy checks in regression analysis. The residual is the
difference between the observed value of the response variable and the value of
the response predicted by the model. The residuals therefore gives an indication
on how accurately the model predicts responses. An ordinary residual is defined
as
ei = yi − yi, i = 1, 2, ..., n
28
where i is the number of a specific observation.[7, p. 130]
In generealized linear models one of the most common choices of residuals to
use is Pearson Residuals.[2, p. 53] They are based on an idea of subtracting
off the predicted mean and dividing by an estimate of the standard deviation
of the observed value and are defined as
rP i =yi − yi√
ˆV ar(yi), i = 1, 2, ..., n
If most of the Pearson residuals for a model are within a band between −3 and
+3, it is an indication that the model is of high predictive power.[11]
29
3 Methodology
3.1 Data
The first part of the project was to manage the data before any analyses could
be done. Due to the sheer amount of raw data provided by If, structuring,
aggregating and choosing relevant data was a major part of this project.
3.1.1 Characteristics
Since If was primarily interested in customer related causes of large claims, we
were handed a set of nine possible rating factors that all represented customer
characteristics associated with commercial policyholders. These nine rating fac-
tors were different characteristics that in different ways described the customers’
financial situations. Some characteristics were continuous, either on some closed
interval or on the real line, and some were categorical. The nine characteris-
tics were chosen as explanatory variables for the initial model to predict large
claims. A tenth rating factor, the product code, was added to keep track of any
effects arising from the type of insurance. Since the data quality of the first nine
characteristics varied between observations associated with different countries,
the analysis was restricted to the country with the best quality of data with
regard to those variables.
3.1.2 Grouping
The characteristics where divided into discrete groups were each group were
to be represented by one explanatory variable in the regression, each getting
their own coefficient estimate. Each observation belonged to one group per
characteristic, where each group acted as a dummy variable. The main reason
behind this approach was to separate any missing or extreme values in the data
from the more reasonable ones without having to exclude all those observations.
By dividing the valid values of continuous characteristics into, for most cases,
one group with higher and one with lower values, it was easier to distinguish
30
between the effects of the level of the characteristic in question. Built-in pro-
cedures in SAS were used to get an overview of the spread of the values of the
characteristics analyzed. This was to assure that there was a sufficient amount
of data in each group which is of importance to avoid erroneous output. All
explanatory variables, consisting of the groups of the 10 characteristics, A-J,
are presented in table 1. Groups denoted by ”H” represent the higher values
of the corresponding characteristic and groups denoted by ’L’ the lower values.
Groups denoted by ’X’ or ’Missing’ represent missing or invalid values of the
characteristic.
Table 1: Grouping of Characteristics
Variable Grouping
Characteristic A H/L/X
Characteristic B H/L/X
Characteristic C H/L/X
Characteristic D H/L/X
Characteristic E H/L/X
Characteristic F H/L/X
Characteristic G H/L/X
Characteristic H 1-3
Characteristic I H/L/X
Characteristic J 1-23
H = High values
L = Low values
X = Extreme/missing values
Numbers = Groups in categories
31
3.1.3 Aggregation
After the characteristics were grouped and attached to the initial data, a SAS
procedure was used to aggregate the data to a less granular level than the initial
data table. It was aggregated on the basis of year, product code and the other
characteristics of interest. At the same time, the observed values of the response
variable were summed. This resulted in a table were each row represented a
unique combination of the variables mentioned and their resulting sums of the
response variable. This meant that many customers could together represent
a single row, reducing the number of rows in the table significantly. Each row
were then to act as one observation in the regression modelling.
3.1.4 Response Variable
The variable that If asked us to model for this project was the cost of large
claims as a percentage of the premium. This is not a very usual response vari-
able in generalized linear modelling, and therefore there are not any obvious
ways to model it presented in mathematical or insurance literature. When
modelling claim severity and claim frequency for smaller claims, it is practically
a standard to use the gamma and the poisson distributions respectively. As
the response variable to predict in this project is not as studied, we chose to
attempt to decide on a proper distribution by inspecting a histogram of all val-
ues of the response variable in our data set. This showed that the majority of
the observations had a response variable of value zero. This was not surprising,
since large claims are to view as more or less extreme events, why few insurance
contracts will have any costs for large claims associated with them.
Since no distribution of the exponential family has such a great mass at zero,
we chose to divide the analysis into two parts. In the first part, all observations
were used to predict the probability of a large claim occurring using a binary
response variable and logistic regression. In the second part, only the obser-
vations which had a large claim were extracted with the intent to model the
32
requested response variable, cost from large claims as a percentage of premium.
Only the part of the large claim that exceeded the large claim truncation point
of 500 000 SEK was included in the response variable. A histogram showed that
the data set contained observed values of the response variables that ranged be-
tween almost zero and very large values, with the bulk part at small values and
a long, light tail. Due to the high resemblance with a gamma distribution, this
was chosen to approximate this response variable.
3.2 Model Development
Due to the analysis being divided into two parts, there were two different model
developments, one for large claim probability and one for large claim severity.
The characteristics for both models are not necessarily dependent, a character-
istic can be excluded from the probability model but be kept in the severity
model. The models were built parallel from the same initial model and reduced
independent of each other.
3.2.1 Modelling Probability of a Large Claim
Initial development of the model for probability of a large claim required a bi-
nary response variable. The large claims in the initial data were set to either
missing, which meant no claims, zero, which meant that there could have been
a claim but beneath the cost of 500 000 SEK that classifies it as a large claim,
or some integer corresponding to the value above the distinction point of a large
claim. All claims missing were set to zero and all values above zero were set to
the value one.
With a SAS procedure calculating variance inflation factors and the correlation
matrix for explanatory variables, multicollinearity could be investigated before
any initial logistic regression was initiated. These multicollinearity diagnostics
are presented in section 4.1. With variance inflation factors and correlation be-
tween certain variables in mind, the logistic regression model was constructed
33
with a built in SAS-procedure. The SAS-procedure calculates the estimates
with maximum likelihood and evaluates the odds ratio. The procedure also has
a selection function which by removing and adding variables evaluates different
combinations of variables and performs selection based on the significance level
of the variables.
The reduced model eliminated two variables. The evaluated estimates were
analyzed and examined with respect to their plausibility. To be able to con-
clude improvement of the reduced model it was compared to the full model with
the measures of AIC, BIC and AUC (area under the ROC curve). With no mul-
ticollinearity and improved goodness of fit the reduction of the model stopped
here. Section 4.2 presents the goodness of fit diagnostics for both the full model
and the reduced model as well. There one also finds the variables included and
their corresponding significance levels for the two models.
The logistic regression thus resulted in an equation on the following form:
ln(pi
1− pi) =
38∑j=0
xi,j βj (14)
3.2.2 Modelling Large Claim Severity
Since the initial variables for the severity model and the probability model are
the same, the VIF and correlation matrix results could be utilized for this model
as well. The response variable was constructed by dividing the large claim with
the premium and a table was made with only the observations extracted with
the response variable above zero. Once again were all the missing values of the
claims set to zero.
A GLM with a log link was initiated with SAS and again a built-in function
for variable selection was used which selected and kept variables with respect to
34
their significance value. This time only one variable was eliminated. The eval-
uated estimates were analyzed and examined with respect to their plausibility.
To be able to conclude improvement of the reduced model it was compared to
the full model with the measures of AIC and BIC with no multicollinearity and
improved goodness of fit the reduction of the model stopped here. The results
and diagnostics are presented in section 4.3.
This log-gamma GLM thus resulted in a model on the following form:
ln(yi) =49∑j=0
xi,j βj (15)
35
4 Results
4.1 Multicollinearity Diagnostics
Table 2: Variance Inflation Factors
Variable VIF
Characteristic A 1,48491
Characteristic B 1,1308
Characteristic C 3,21279
Characteristic D 1,19488
Characteristic E 5,11038
Characteristic F 4,83526
Characteristic G 4,62401
Characteristic I 1
Table 3: Correlation Matrix
Characteristics G A D C F B E I
G 1 0,0892 0,1421 -0,6726 0,0094 -0,1987 -0,4262 -0,0003
A 0,0892 1 -0,0823 0,1571 -0,4242 -0,1299 0,0525 -0,0003
D 0,1421 -0,0823 1 -0,0174 0,0599 -0,1103 -0,2729 -0,0011
C -0,6726 0,1571 -0,0174 1 -0,253 0,1208 0,1921 0,0002
F 0,0094 -0,4242 0,0599 -0,253 1 0,131 -0,6635 -0,0011
B -0,1987 -0,1299 -0,1103 0,1208 0,131 1 -0,0819 -0,0002
E -0,4262 0,0525 -0,2729 0,1921 -0,6635 -0,0819 1 0,0011
I -0,0003 -0,0003 -0,0011 0,0002 -0,0011 -0,0002 0,0011 1
36
4.2 Logistic Regression Model
4.2.1 Full Model Goodness of Fit Diagnostics
Table 4: Goodness of Fit
Full Model
LogLike -7059,91
AUC 0,757988
AIC 14205,82
BIC 14580,74
4.2.2 Significance of Variables in Full Model
Table 5: Signifance
Full Model WaldChiSq ProbChiSq
Characteristic A 3,2873 0,1933
Characteristic B 31,5545 0,0000
Characteristic C 116,4300 0,0000
Characteristic D 0,4495 0,7987
Characteristic E 124,2040 0,0000
Characteristic F 60,2795 0,0000
Characteristic G 47,0024 0,0000
Characteristic H 23,8441 0,0000
Characteristic I 72,6220 0,0000
Characteristic J 811,7597 0,0000
37
4.2.3 Reduced Model Goodness of Fit Diagnostics
Table 6: Goodness of Fit
Reduced Model
LogLike -7061,81
AUC 0,756928
AIC 14201,61
BIC 14541,65
4.2.4 Significance of Variables in Reduced Model
Table 7: Significance
Reduced Model WaldChiSq ProbChiSq
Characteristic B 30,8980 0,0000
Characteristic C 116,1273 0,0000
Characteristic E 199,7650 0,0000
Characteristic F 67,0019 0,0000
Characteristic G 43,8051 0,0000
Characteristic H 24,3994 0,0000
Characteristic I 97,6561 0,0000
Characteristic J 809,6167 0,0000
38
4.2.5 Final Model Coefficients
Table 8: CoefficientsReduced Model Group Estimate(β)
Intercept - -2,9765
Characteristic B H - 0,258563
Characteristic B L -0,056793
Characteristic C H 0,44482
Characteristic C L -0,123971
Characteristic E H -0,104184
Characteristic E L -0,715658
Characteristic F H -0,403137
Characteristic F L 0,08744
Characteristic G H 0,07087
Characteristic G L -0,256218
Characteristic H 1 -0,277118
Characteristic H 2 -0,125842
Characteristic H 3 -0,101876
Characteristic I H -0,131207
Characteristic I L 0,36145
Characteristic J 1 -0,195867
Characteristic J 2 0,20147
Characteristic J 3 -3,561178
Characteristic J 4 -0,444718
Characteristic J 5 0,37214
Characteristic J 6 1,22243
Characteristic J 7 -1,075213
Characteristic J 8 -1,095859
Characteristic J 9 0,92637
Characteristic J 10 0,03483
Characteristic J 11 1,00475
Characteristic J 12 -0,086167
Characteristic J 13 0,06773
Characteristic J 14 -0,09741
Characteristic J 15 -0,132908
Characteristic J 16 -0,404597
Characteristic J 17 -0,467246
Characteristic J 18 -0,312258
Characteristic J 19 -1,174643
Characteristic J 20 0,77836
Characteristic J 21 0,76851
Characteristic J 22 1,61931
Characteristic J 23 1,57733
39
4.2.6 Final Model
From table 7 the following equation could be constructed,
yi1− yi
= e−2,9765 ·38∏j=1
exj,iβj = e−2,9765 ·ex1,iβ1 ·...·ex15,iβ15 ·...·ex25,iβ25 ·...·ex38,iβ38
(16)
Which can be written as:
e−2,9765·
eβB,H , if Group = H
eβB,L , if Group = L
1, Otherwise
For characteristic B
·...·
eβH,1 , if Group = 1
eβH,2 , if Group = 2
eβH,3 , if Group = 3
1, Otherwise
For characteristic H
·
...·
4.2.7 Final Model Residuals and ROC
Figure 1: Pearson Residuals. The residuals plotted versus case number. Events
are shown in red and non-events in blue.
40
Figure 2: ROC curve and AUC measure
41
4.3 Claim Severity Regression Model
4.3.1 Full Model Goodness of Fit Diagnostics
Table 9: Goodness of Fit
Full Model Value Value/DF
Deviance 7445,7034 3,9521
Scaled Deviance 2593,5880 1,3766
Pearson Chi-Square 20967,8539 11,1294
Log Likelihood -2482,7098
Full Log Likelihood -2482,7098
AIC 5053,4197
BIC 5298,2233
4.3.2 Results From Reducing Algorithm
Table 10: Reduction of Variables
Variables Included Variables Removed p-value
Characteristic B 0,0000
Characteristic C 0,0001
Characteristic D 0,0000
Characteristic E 0,0000
Characteristic F 0,0000
Characteristic G 0,0000
Characteristic H 0,0000
Characteristic I 0,0000
Characteristic J 0,0000
Characteristic A 0,0528
42
4.3.3 Reduced Model Goodness of Fit Diagnostics
Table 11: Goodness of Fit
Reduced Model Value Value/DF
Deviance 7487,7109 3,9702
Scaled Deviance 2595,6330 1,3763
Pearson Chi-Square 21797,5034 11,5575
Log Likelihood -2490,0085
Full Log Likelihood -2490,0085
AIC 5064,0169
BIC 5297,6931
4.3.4 Reduced Model, Significance of Variables
Table 12: Significance
Reduced Model ChiSq ProbChiSq
Characteristic B 150,02 0,0000
Characteristic C 13,59 0,0011
Characteristic D 15,80 0,0004
Characteristic E 15,78 0,0004
Characteristic F 75,29 0,0000
Characteristic G 78,92 0,0000
Characteristic H 13,93 0,0030
Characteristic I 126,82 0,0000
Characteristic J 1695,72 0,0000
43
4.3.5 Final Model Coefficients
Table 13: CoefficientsReduced Model Group Estimate
Intercept - -2,5222
Characteristic B H 2,6932
Characteristic B L 1,8412
Characteristic B X 0,0000
Characteristic C H 0,0347
Characteristic C L 0,4542
Characteristic C X 0,0000
Characteristic D H -1,4678
Characteristic D L -0,7142
Characteristic D X 0,0000
Characteristic E H -1,7617
Characteristic E L -2,1307
Characteristic E X 0,0000
Characteristic F H 0,0463
Characteristic F L -1,0185
Characteristic F X 0,0000
Characteristic G H 2,4602
Characteristic G L 2,2324
Characteristic G X 0,0000
Characteristic H 1 0,1487
Characteristic H 2 -0,2592
Characteristic H 3 0,0102
Characteristic H Missing 0,0000
Characteristic I H 1,9625
Characteristic I L 0,6583
Characteristic I X 0,0000
Characteristic J 1 1,6589
Characteristic J 2 3,1279
Characteristic J 3 1,4446
Characteristic J 4 2,3687
Characteristic J 5 1,0927
Characteristic J 6 1,5759
Characteristic J 7 0,3709
Characteristic J 8 2,4712
Characteristic J 9 0,0327
Characteristic J 10 0,5404
Characteristic J 11 0,7100
Characteristic J 12 1,7923
Characteristic J 13 2,9658
Characteristic J 14 4,5324
Characteristic J 15 2,2599
Characteristic J 16 4,3868
Characteristic J 17 3,5561
Characteristic J 18 1,5794
Characteristic J 19 4,4475
Characteristic J 20 2,4725
Characteristic J 21 -0,2355
Characteristic J 22 2,9615
Characteristic J 23 6,6656
Characteristic J 24 0,0000
44
4.3.6 Final model
From table 12 the following equation could be constructed,
yi = e−2,5222 ·∏49j=1 e
xj,iβj = e−2,5222 · ex1,iβ1 · ... · ex15,iˆβ15 · ... · ex25,i
ˆβ25 · ... ·
ex35,iˆβ35 · ... · ex49,i
ˆβ49 = e−2,5222 ·
eβB,H , if Group = H
eβB,L , if Group = L
1, Otherwise
For characteristic B
· ... ·
eβH,1 , if Group = 1
eβH,2 , if Group = 2
eβH,3 , if Group = 3
1, Otherwise
For characteristic H
· ...·
45
4.3.7 Final Model Residuals
Figure 3: Pearson Residuals
46
5 Discussion
5.1 Model Validation and Adequacy
5.1.1 Sources of error or uncertainty
Due to the large of amount of initial data there was a risk for hidden errors.
With millions of observations it is difficult to detect these errors and even more
difficult without knowing what the errors look like. In some parts of the data
it was obvious that it contained errors, examples being when all numbers for a
certain variable should be positive or a percentage between 0-100% to even make
sense. In addition to those obvious errors there were probably more inaccura-
cies that we couldn’t detect. Furthermore there is no guarantee that the groups
constructed are risk-homogeneous or that we did sophisticated enough interpre-
tation of what is to be considered as extreme values. Having more and more
narrow groups for each characteristics could have been one way to minimize the
risk of misleading results from such problems. However, other problems could
then have arisen from having a model with very many variables. For example,
some groups of a characteristic might get significant coefficients and some not.
A proxy when calculating this kind of insurance models is to aggregate the
data to a requested policy level and then weight the data to reduce the risk
of disturbance with significance. However, the model wasn’t able to converge
if it was weighted so this was not a possible option. Optimal would then be
to not aggregate the data. However that resulted in too large amount of data
which were problematic for the program used to handle when calculating the
logistic regression. This may cause the results to be misleading since the model
won’t take into consideration the initial amount of observations in each new
aggregated observation. However, it would have been possible to aggregate the
data for the GLM model but we made the decision to use the same policy level
for both sets of data.
47
The VIFs and the correlation matrix both indicate low multicollinearity be-
tween the characteristics. However, the characteristics were almost exclusively
financial measurements and the intuition says that they are correlated. This
raises further questions whether it is something wrong with the data set.
5.1.2 Assessing the model reductions
The reduced model which was done with logistic regression gave a smaller AIC
and BIC value than the full model which indicates a better model fit. However,
the reduced model done with GLM gave a smaller BIC value but not a smaller
AIC value than the full model. However, AIC is more tolerant with more vari-
ables while BIC puts a stricter penalty on adding variables. We are nonetheless
convinced that the reduced model is more sophisticated since it had just a small
difference between its AIC compared to the full model and the variable excluded
did not meet the wanted significance level.
5.1.3 Statistical hypothesis testing
The deviance, Wald chi-square and Pearson chi-square tests indicate significant
p-values for both models and satisfactory goodness of fit. However, the scaled
deviance divided by the degrees of freedom, has a value larger than one which in-
dicates a problem with the fit. Furthermore, in both the reduced logistic model
and reduced the GLM model there is significant p-values for the characteristics
but for the groups within the characteristics there is in some cases only one of
48
the groups that shows significance. However, this is not necessarily serious since
the significance is binary. It could have been more appropriate to merge the
two other groups (most groups are categorized to three variables) but since one
group represents missing/extreme values and we are interested in seeing trends
that option is not preferred.
5.1.4 Prediction accuracy
The prediction accuracy of the logistic regression model for the large claim
probability can be assessed by the concordance index, also called area under
the curve, AUC for the ROC. As seen in section 4.2.7, the model had an AUC
equal to approximately 76 percent. Since this is well over 50 percent, the model
is shown to have a certain ability to accurately predict the occurence of large
claims. However, the AUC is at the same time far from the optimal level of 100
percent, showing that the model to some extent lacks the ability to perform ac-
curate predictions. One reason for this could be the distribution of the response
variable, consisting of very few events(large claims) compared to the number of
non-events(no large claims). This could have made it slightly difficult to find
strong patterns for the occurrence of large claims, an issue that is difficult to
overcome due to the nature of the data.
For both models, the prediction accuracy can also be evaluated by analyzing the
residuals. In section 4.2.7, residuals for the large claim probability model are
presented. In figure 1 showing Pearson residuals plotted for each observation,
one can note that the residuals are positive for practically all the observations
for which there had been a large claim, shown in red, and negative for the obser-
vations which didn’t have a large claim, shown in blue. Since a positive residual
means that the predicted response is smaller than the observed and a negative
residual means that the predicted response is larger than the observed response,
49
this result is expected in this case since observed events have been coded as ones
and non-events as zeros. One also notes that the positive residuals are further
from zero than the negative residuals. This indicates that the model less accu-
rately predicts an event than a non-event. As said, this is likely to be caused by
the scarcity of large claims within the data. Many of the positive residuals also
seem to be larger 3, which indicates a problem with prediction accuracy as well
as outliers in the data. The negative residuals are all close to zero, indicating
that the model successfully predicts non-events.
In figure 3 of section 4.3.7, Pearson residuals plotted against the linear pre-
dictors for the large claim severity model are presented. One can see in the
scatter that most residuals are slightly negative which indicates that the pre-
dicted response is greater than the observed response in many cases. This means
that the model tends to assign higher costs for large claims as a percentage of
premium than what they actually had. There is also an apparent presence of
some very large positive residuals. Hence, for some observations the model pre-
dicts a much lower response than what is observed. The overall trend indicates
that there might be a problem with outliers, possibly causing most other resid-
uals to be slightly negative. However, since most residuals do not deviate too
much from the limit of +/- 3, the overall predictive power is acceptable.
5.2 Interpretation of final models
The model development and procedures for model reduction resulted in two fi-
nal models with 9 and 8 included characteristics respectively. It was found that
the models needed not to be heavily reduced. Rather, they differed from their
corresponding full models with regard to just two and one characteristic respec-
tively. Since the Variance Inflation Factors didn’t point towards any issues with
multicollinearity, there was no mathematical support for reducing the models
further. It is interesting that so many of the characteristics was shown to be
of significance and add explanatory power to the models since it indicates that
50
there are several aspects of commercial policyholders that affects their risks for
large claims. However, this is not necessarily the most desired result in practice
since more complex models make it more difficult when it comes to implement-
ing the results in the pricing of the insurance policies. It is not as simple as to
just start pricing on the basis of all the significant found characteristics, why a
model using one or a few characteristics in isolation in some sense would have
been to wish for.
The characteristic that was excluded from the model for the claim size was
also one of the characteristics excluded from the model for the large claim prob-
ability. This is an interesting result since it indicates that that characteristic
does in fact not have much effect on the large claim risk associated with commer-
cial policyholders. However, since there was an overall uncertainty about the
quality of the data used for the regression model building, it can be questioned
to what extent this is a safe conclusion to draw and use for a more general case.
As one more characteristic was excluded from the large claim probability model
than from the claim size model, it points towards a certain difference between
what causes large claims to occur at all and what affects their severity. That
characteristic doesn’t seem to explain anything about the occurrence of large
claims, but adds some explanatory value about the size of them. Differences
in what drives risk in the two models can also be observed in the coefficient
estimates corresponding to the characteristics and their indicator variables. If
for example the higher-valued group of a certain characteristic has a larger co-
efficient estimate than the lower-valued group in one model and the relationship
is the opposite in the other model, this shows a difference in the effect of that
characteristic, even if that characteristic has shown to be significant in both
models.
The coefficient estimates for the large claim probability model are presented
in table 8 and analyzed in the form of odds ratios. These show that for some
51
of the characteristics, there isn’t a large difference in odds between the groups.
An example is characteristic H, where the three groups not representing miss-
ing or invalid values has the odds 0.458, 0.532 and 0.545. These are all close
to each other, indicating that there is no apparent difference in risk of having a
large claim depending on to which of these groups the customer belongs to. For
the characteristics I and C there is a larger difference in odds ratio between the
higher and the lower groups and their coefficient estimates are oppositely signed.
This indicates that having a lower value of characteristic I and a higher value of
characteristic C are two things that increases the risk of a policyholder having a
large claim according to this model. For one of these characteristics, the higher-
risked group corresponds to the policyholder being in a better financial situation
and for the other, the opposite holds. For the other characteristics (B, E, F and
G) there is a certain but not very large difference in risk depending on whether
the customer has a higher or a lower value of those variables. It varied between
this characteristics if the better financial situation corresponded to a higher risk.
Since the model for the large claim size was constructed using the log link
function, the risks of different variable groups can be evaluated by taking the
exponential of the coefficient estimates generating a multiplicative model, sim-
ilarly to the odds ratios for the logistic model. The exponentiated coefficient
gives a multiplier on the expected response when the corresponding variable
changes by one. Hence, a group with a larger exponentiated coefficient estimate
gets larger contribution to the predicted large claim for that variable than a
group with a smaller multiplier. This reveals that there are large differences in
risk depending on which group the customer belongs to for the characteristics
B, C, F, G and I. For most of these characteristics (B, F, G and I) the higher
risk, i.e having more severe large claims in relation to premium, corresponds
to the higher-valued group and for characteristic C the risk is higher for the
lower-valued group. This is the one exception where the lower risk corresponds
to a worse financial situation. For some characteristics, the trend is opposite
from the logistic model. Characteristic E which had a certain risk difference in
52
the logistic model had not much difference in the severity model. This indicates
that there are differences between what causes large claims to occur at all and
what causes them to be severe.
By analyzing the differences in estimates between groups this way, one real-
izes that although many characteristics were of significance and therefore kept
through the model reduction steps, not all of them would necessarily be of ques-
tion to actually implement in the pricing models. When you for example can’t
see an apparent difference in risk between the higher and the lower group of a
variable you may not find it meaningful to include it when pricing, even if the
coefficients are significant. There can also be a problem if the risk differences
between the groups don’t follow the insurers intuition about how the risk should
differ between customers. Furthermore, differences in behaviour of the variables
between the two models, large claim probability and large claim severity, might
cause implementation problems. These are all aspects possible to consider in
order to choose which characteristics to keep investigating or to try and include
in the actual pricing of premiums.
5.3 Impact of risk-dependent insurance pricing
5.3.1 For If
As explained in the introduction to insurance pricing, section 2.2.5, sophisti-
cated pricing models that accounts for the customers’ risks as accurately as
possible is of great interest for an insurance company as If. An important part
of this is to not only continuously work to improve predictions of the frequently
occurring regular-sized claims, but to also find efficient ways to account for the
more rarely occurring large-sized claims. Even if they are less likely than other
claims to occur, the costs for If are extensive when they do. As was seen when
analyzing the data set of commercial policyholders in this project, the costs for
large claims corresponding to some groups of policyholders amount to several
thousand percent of their paid premiums. For other groups of policyholders,
53
costs for large claims were non-existent or only a small fraction of the amount
which those policyholders had paid. A pricing structure that is completely fair
should not disregard these variations in risk for large claims between commercial
policyholders.
By charging premiums optimally risk-corrected for large claims, the premiums
are fairer than otherwise. This means that If is likely to have a better proportion
between claim costs and collected premiums, since low-risk customers are more
prone to choose and to stay with If. Recall the expression for return on equity
stated in section 2.2.3. It shows that the ROE gets larger as the combined ratio
gets smaller. A smaller combined ratio is partly acquired through fairer pricing
since that produces a higher GEP for a lower claim cost than more conform
pricing structures. Thus, working to achieve prices optimally adjusted for large
claim risk can produce a higher return on equity, an important goal for If.
5.3.2 For commercial policyholders
The customers may not be as keen on the idea of risk dependent prices as the
insurance company. This project aimed to investigate the possibility to predict
risk and hence to price with regard to characteristics of the policyholder which
in some cases did not have an obvious connection to the object insured and the
usage of it. Rather, they were general characteristics of the corporates. Discus-
sions about the ethical aspect of risk dependent insurance pricing is therefore
of interest in this case. There is already regulations in the Swedish insurance
industry forbidding insurers to use for example gender as an explanatory vari-
able in the pricing of policies. One can question at which other characteristics
you start to approach a situation where the pricing is discriminating.
A phenomenon that is somewhat related to this is the concept of price optimiza-
54
tion, also often referred to as price discrimination in literature. This describes
the situation where a company charges different prices for the same product to
different customers and is related to the concept of price elasticity of demand in
microeconomics. The situation is often that a company wants to charge the less
price sensitive customers with higher prices than they have for customers which
are more price sensitive. This is a way for them to maximize their revenues.[12,
p. 407-410]
Insurance pricing in general differs from this idea in that the prices are dy-
namic with regard to the customers’ risks and not to their price elasticities. It
enables the insurance company to keep the lowest possible prices to all their
customers. In a situation with a conform pricing structure, the general price
level would have needed to be higher in order for the insurer to be equally prof-
itable due to higher expected claim costs as a result of higher-risked customers.
Risk-correct pricing can thus be considered an advantageous pricing structure
from a customer perspective as well, high-risk policyholders included.
Unlike private individuals, commercial policyholders can be in a competing sit-
uation with each other. For businesses, the costs for insurances can be of large
amounts and one could argue that premiums which are risk adjusted as far as
possible creates a fairer competition. Then, low-risk companies need not take
part in financing the risk of companies that are more likely to have for example
large claims. On the other hand, since getting a lower premium than your com-
petition gives you a competitive advantage it could be debated to what extent
an insurer should need statistical support for pricing with respect to a certain
characteristic.
55
6 Conclusions
The result of the project concludes that large claims are to some extent cor-
related to a company’s financial situation. However, further investigation is
needed in order to find more reliable models with respect to goodness of fit and
prediction accuracy as well as to gain better understanding of the impact and
importance of different characteristics.
Large claim probability The logistic regression shows that eight of the char-
acteristics are of significance in predicting the occurrence of large claims. The
concordance index indicates that the model have a predictive power of about 76
percent which means that the model to some extent lacks the ability to predict
responses accurately. An inspection of Pearson residuals shows that the model
predicts non-events well but has difficulties with predicting events, probably due
to the scarcity of large claims in the data set. It varies between characteristics
if the better financial situation corresponds to a higher or a lower risk to have
a large claim. It is therefore difficult to draw a general conclusion of which
policyholders are the risky ones with respect to this response variable.
Large claim cost as percentage of premium The severity model indi-
cates that one of the characteristics should be eliminated and to keep the rest.
The Pearson residuals showed on a relatively good prediction accuracy but a
tendency to predict higher values of the response variable than what was ob-
served. The presence of some very large residuals also causes doubt. For the
large claim severity model, many of the characteristics indicate that a better
financial situation corresponds to a higher risk.
56
7 Recommendations
For further research, we recommend to explore other ways of modelling the
response variable. For example, a more thorough analysis of which distribution
to use might result in a better model that does more accurate predictions.
Specifically, we suggest to look for and attempt to use a distribution with a
heavier tail than the gamma distribution. To further accomplish a more reliable
model, we recommend to look deeper into the characteristics used in this project
to get a better understanding of which values are to be viewed as invalid or
extreme and avoid them having too much influence on the model. Lastly, we
recommend to construct models with fewer characteristics and perhaps narrower
groups in order to get a better understanding of their individual effects and
thereby increase the possibility of implementation in the pricing of policies.
57
References
[1] If P&C Insurance
[2] Esbjorn Ohlsson, Bjorn Johansson. Non-Life Insurance Pricing with Gener-
alized Linear Models. 2010..
[3] Amy Gallo in Harvard Business Review. A Refresher on Regression Analysis.
2015.
https://hbr.org/2015/11/a-refresher-on-regression-analysis
Accessed on 2018-04-20
[4] If P&C Insurance, lecture on generalized linear models at KTH Royal insti-
tute of technology 2018.
[5] Henrik Hult, Filip Lindskog. Heavy-tailed insurance portfolios: buffer capital
and ruin probabilities. 2006.
[6] Marco Bee. Statistical analysis of the Lognormal-Pareto distribution using
Probability Weighted Moments and Maximum Likelihood. 2012.
[7] Douglas C. Montgomery, Elizabeth A. Peck, G. Geoffrey Vining. Introduc-
tion to Linear Regression Analysis. 2012
[8] Alan Agresti. Categorical Data Analysis 2nd ed. 2002
[9] English Oxford Living Dictionaries. Overfitting
https://en.oxforddictionaries.com/definition/overfitting
Accessed on 2018-05-23
[10] Murphy, K.P., Brockman, M.J., Lee, P.K.W. Using generalized linear mod-
els to build dynamic pricing systems for personal lines insurance. In: CAS
Winter 2000 Forum
[11] PennState, Eberly College of Science. STAT 504, 7.2.1 Model Diagnostics.
https://onlinecourses.science.psu.edu/stat504/node/161/
Accessed on 2018-05-22
58
[12] Paul Krugman, Robert Wells. Economics 4th ed. 2015.
[13] Patrik Hardin, Sam Tabari. Modelling Non-Life Insurance Policyholder
Price Sensitivity. Bachelor Thesis, KTH 2017.
[14] Lovisa Laestadius, Karin Knobel. Fornyelsegrad och priskanslighet inom
foretagsforsakringar. Bachelor Thesis, KTH 2016.
59
TRITA TRITA-SCI-GRU 2018:187
www.kth.se