predicting large claims within non-life...

INOM EXAMENSARBETE TECHNOLOGY,GRUNDNIVÅ, 15 HP

, STOCKHOLM SVERIGE 2018

Predicting Large Claims within Non-Life Insurance

JACOB BARNHOLDT

JOSEFIN GRAFFORD

KTHSCHOOL OF ENGINEERING SCIENCES

Predicting Large Claims within Non-Life Insurance JACOB BARNHOLDT JOSEFIN GRAFFORD

Degree Projects in Applied Mathematics and Industrial Economics

Degree Programme in Industrial Engineering and Management

KTH Royal Institute of Technology year 2018

Supervisors at If P&C Insurance: Hjalmar Heimbürger

Supervisors at KTH: Anja Janssen, Hans Lööf

Examiner at KTH: Henrik Hult

TRITA-SCI-GRU 2018:187

MAT-K 2018:06

Royal Institute of Technology School of Engineering Sciences KTH SCI SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

Predicting Large Claims within Non-Life Insurance

Abstract

This bachelor thesis within the field of mathematical statistics aims to study the

possibility of predicting specifically large claims from non-life insurance policies

with commercial policyholders. This is done through regression analysis, where

we seek to develop and evaluate a generalized linear model, GLM. The project

is carried out in collaboration with the insurance company If P&C Insurance

and most of the research is conducted at their headquarters in Stockholm. The

explanatory variables of interest are characteristics associated with the policy-

holders. Due to the scarcity of large claims in the data set, the prediction is

done in two steps. Firstly, logistic regression is used to model the probability of

a large claim occurring. Secondly, the magnitude of the large claims is modelled

using a generalized linear model with a gamma distribution. Two full models

with all characteristics included are constructed and then reduced with com-

puter intensive algorithms. This results in two reduced models, one with two

characteristics excluded and one with one characteristic excluded.

Keywords: Mathematical Statistics, Regression Analysis, Generalized Linear

Model, Logistic Regression, Data Analysis, Non-life Insurance, Insurance Pric-

ing, Large claims

Prediktion av Storskador inom Sakforsakring

Sammanfattning

Det har kandidatexamensarbetet inom matematisk statistik avser att studera

mojligheten att predicera sarskilt stora skador fran sakforsakringspolicys med

foretag som forsakringstagare. Detta gors med regressionsanalys, dar vi amnar

att utveckla och bedoma en generaliserad linjar modell, GLM. Projektet utfors

i samarbete med forsakringsbolaget If Skadeforsakring och merparten av un-

dersokningen sker pa deras huvudkontor i Stockholm. Forklaringsvariablerna

som ar av intresse att undersoka ar egenskaper associerade med forsakringstagarna.

Pa grund av sallsynthet av storskador i datamangden gors prediktionen i tva

steg. Forst anvands logistisk regression for att modellera sannolikheten for en

storskada att intraffa. Sedan modelleras storskadornas omfattning genom en

generaliserad linjar modell med en gammafordelning. Tva grundmodeller med

alla forklaringsvariabler konstrueras for att sedan reduceras med datorinten-

stiva algortimer. Det resulterar i tva reducerade modeller, med tva respektive

en kundegenskap utesluten.

Nyckelord: Matematisk statistik, Regressionsanalys, Generaliserad linjar mod-

ell, Logistisk Regression, Dataanalys, Sakforsakring, Forsakringsprissattning,

Storskador

Acknowledgements

We want to thank the analysts of the Product & Price department at If P&C

Insurance for giving us the opportunity to write our bachelor thesis with them

and for making us feel welcome. A special thanks goes out to

Hjalmar Heimburger, whom we have been working closets with and has been

our advisor, mentoring us throughout the project and sharing his knowledge.

We would also like to thank to our thesis supervisor Anja Janssen from the

department of Mathematical Statistics at KTH Royal Institute of Technology.

Anja has been of great support throughout this project, giving us advice and

feedback which has been helpful and very appreciated.

Contents

1 Introduction 8

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2 Project Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Theory 10

2.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Insurance Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 Key Terms . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.2 Performance Measures . . . . . . . . . . . . . . . . . . . . 11

2.2.3 The Insurance Business Model . . . . . . . . . . . . . . . 12

2.2.4 Commercial Insurance Policies . . . . . . . . . . . . . . . 14

2.2.5 Insurance Pricing . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.6 Accounting for Large Claims . . . . . . . . . . . . . . . . 15

2.3 Mathematical Theory . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.1 Regression Analysis . . . . . . . . . . . . . . . . . . . . . 17

2.3.2 Linear regression modelling . . . . . . . . . . . . . . . . . 17

2.3.3 Generalized Linear Models . . . . . . . . . . . . . . . . . 18

2.3.4 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . 20

2.3.5 Modelling Claim Severity . . . . . . . . . . . . . . . . . . 21

2.3.6 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . 23

2.3.7 Model Validation . . . . . . . . . . . . . . . . . . . . . . . 24

3 Methodology 30

3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.1.1 Characteristics . . . . . . . . . . . . . . . . . . . . . . . . 30

3.1.2 Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.1.3 Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1.4 Response Variable . . . . . . . . . . . . . . . . . . . . . . 32

3.2 Model Development . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2.1 Modelling Probability of a Large Claim . . . . . . . . . . 33

5

3.2.2 Modelling Large Claim Severity . . . . . . . . . . . . . . . 34

4 Results 36

4.1 Multicollinearity Diagnostics . . . . . . . . . . . . . . . . . . . . 36

4.2 Logistic Regression Model . . . . . . . . . . . . . . . . . . . . . . 37

4.2.1 Full Model Goodness of Fit Diagnostics . . . . . . . . . . 37

4.2.2 Significance of Variables in Full Model . . . . . . . . . . . 37

4.2.3 Reduced Model Goodness of Fit Diagnostics . . . . . . . . 38

4.2.4 Significance of Variables in Reduced Model . . . . . . . . 38

4.2.5 Final Model Coefficients . . . . . . . . . . . . . . . . . . . 39

4.2.6 Final Model . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.2.7 Final Model Residuals and ROC . . . . . . . . . . . . . . 40

4.3 Claim Severity Regression Model . . . . . . . . . . . . . . . . . . 42

4.3.1 Full Model Goodness of Fit Diagnostics . . . . . . . . . . 42

4.3.2 Results From Reducing Algorithm . . . . . . . . . . . . . 42

4.3.3 Reduced Model Goodness of Fit Diagnostics . . . . . . . . 43

4.3.4 Reduced Model, Significance of Variables . . . . . . . . . 43

4.3.5 Final Model Coefficients . . . . . . . . . . . . . . . . . . . 44

4.3.6 Final model . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.3.7 Final Model Residuals . . . . . . . . . . . . . . . . . . . . 46

5 Discussion 47

5.1 Model Validation and Adequacy . . . . . . . . . . . . . . . . . . 47

5.1.1 Sources of error or uncertainty . . . . . . . . . . . . . . . 47

5.1.2 Assessing the model reductions . . . . . . . . . . . . . . . 48

5.1.3 Statistical hypothesis testing . . . . . . . . . . . . . . . . 48

5.1.4 Prediction accuracy . . . . . . . . . . . . . . . . . . . . . 49

5.2 Interpretation of final models . . . . . . . . . . . . . . . . . . . . 50

5.3 Impact of risk-dependent insurance pricing . . . . . . . . . . . . 53

5.3.1 For If . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.3.2 For commercial policyholders . . . . . . . . . . . . . . . . 54

6

6 Conclusions 56

7 Recommendations 57

References 58

7

1 Introduction

1.1 Background

An insurance is a contract between an insurer and a customer where the cus-

tomer buys protection against financial loss from the insurer by paying a price

known as a premium. The economic risk is then transferred from the customer,

usually referred to as policyholder, to an insurance company. Insurance con-

tracts can differ with regards to what they cover, under what circumstances

they are valid and how much is paid out by the insurer in case of an incident.

This creates a wide insurance market with insurance companies offering dif-

ferent types of insurances and under different terms and agreements. Non-life

insurance policies, also called property and casualty insurance policies, are all

insurance policies that are not classified as life insurance policies. They may for

example cover damages on cars, houses or other property, third party liability

and costs for business interruptions.

One of the largest companies on the Nordic market offering non-life insurance

policies is If P&C Insurance. They operate in Norway, Sweden, Finland, Den-

mark and the Baltics, with a wide customer base ranging from private individ-

uals to large enterprises. Fully owned by the financial company Sampo plc., If

works for a profit and needs to strive for business efficiency in order to stay com-

petitive and maintain a strong market position. An important aspect of this is

to develop and use sophisticated pricing models that set the optimal premiums

for the customers with respect to the risk that If has undertaken by insuring

them. The total amount that If collects in premiums from their customers needs

to cover the costs that If has for customer claims, administrative costs and also

generate a return on invested capital. The best way to achieve this is by trying

to predict each customer’s claims as accurately as possible.

The cost for claims can be divided into two subcategories: small claims and

8

large claims. Small claims are defined to be more frequent and of lower indi-

vidual costs than large claims, which in general are very rare. Since small and

large claims are of such different character, large claims are usually handled

separately in the pricing of policies.

1.2 Project Formulation

If ’s current model for accounting for their customer’s risks for large claims in

the pricing of their premiums uses a few characteristics as explanatory variables.

This project aims to investigate if this model can be adjusted to be more sophis-

ticated by taking additional customer characteristics into account and thereby

accomplish a more differentiated and risk-correct pricing.

This will be done using a generalized linear regression - GLM. If will provide a

large data set containing information about for example their customers, insur-

ance premiums and claim costs. The response variable to predict is the cost of

large claims as a percentage of the insurance premium. The explanatory vari-

ables to model this will be customer characteristics. That is, this project aims

to identify characteristics of the commercial policyholder that might affect the

risk of large claims and to what extent.

In this thesis, a distinction between small and large claims is made at 500

000 SEK. This means that the project aims to predict claims of individual costs

of 500 000 SEK or more.

9

2 Theory

2.1 Literature Review

The insurance theory was partly acquired from introduction courses held by

current analysts at the Product & Price department at If P&C Insurance and

partly from Esbjorn Ohlssons and Bjorn Johanssons book; Non-Life Insurance

Pricing with Generalized Linear Models. The book also provides insight into the

mathematics behind insurance theory. The second main source for mathematical

theory was Introduction to Linear Regression Analysis by Douglas Montgomery,

Elizabeth Peck and Geoffrey Vining. This provided a general understanding of

regression analysis and statistical model building. It also served as a practical

go-to source when further knowledge on certain topics was needed to move for-

ward during the project work.

In addition to the literature mentioned above, previous bachelor theses carried

out in collaboration with If P&C Insurance were studied. These were Mod-

elling Non-Life Insurance Policyholder Price Sensitivity by Patrik Hardin and

Sam Tabari and Fornyelsegrad och priskanslighet inom foretagsforsakringar by

Karin Knobel and Lovisa Laestadius. They provided a good orientation of rel-

evant mathematical aspects related to generalized linear models as well as an

introduction to the use of regression analysis in an insurance business setting.

Their work also served as inspiration for how to approach the problem addressed

in this thesis.

2.2 Insurance Theory

2.2.1 Key Terms

Here follows an introduction to important terms used in the insurance industry

Claim: A claim is when a customer to an insurance company has an accident or

damage and they want to use their insurance. They report it to their insurance

10

company, asking for reimbursement.

Policyholder: A customer to an insurance company. It can be an individual

or a company.

Claim cost: Refers to the costs associated with a claim. It is often divided

into subcategories:

Paid: The amount that has been paid out to the customer.

Case: The future costs for claims that have been reported. Some claims do

not lead to only one direct payment, but cause additional costs after some time.

This amount can be uncertain.

IBNR: Abbreviation for “Incurred But Not Reported” and refers to the costs

for claims that have occurred but the customer has not yet reported them to

the insurer. There is a high degree of uncertainty to estimates of these costs.

Administrative costs: Regular business expenses. Costs for staff, material, office

space etc.

Premium: The price that a customer pays for their insurance. Is usually paid

on a yearly basis[1].

2.2.2 Performance Measures

From the terms above, important performance and profitability measurements

are constructed.

Gross Written Premium (GWP): The total amount that the insurance company

collects in premiums from their customers for a year.

11

Gross Earned Premium (GEP): The Gross Written Premium linearized with

time. For example, when 10 days in to an insurance year, the GEP is 10365∗GWP.

Paid Ratio: Paid / GEP

Reported Ratio: (Paid + Case) / GEP

Risk Ratio: Claim cost / GEP = (Paid + Case + IBNR) / GEP

Cost Ratio: Administrative cost / GEP

Combined Ratio: (Claim cost + Administrative cost) / GEP = (Paid + Case

+ IBNR + Administrative cost) / GEP. [1]

2.2.3 The Insurance Business Model

The business idea of an insurance company is to provide their customers with

protection against financial risk in exchange for a fee, the premium. This means

that the risk is transferred from the policyholder to the insurance company. By

insuring many customers, the loss of the insurance company consists of a large

sum of many small approximately independent losses. The law of large num-

bers therefore makes the loss of an insurance company much more predictable

than the loss of an individual. Since an insurance company has a chance to

predict their losses to some extent, they have the possibility of making a profit

by charging premiums that cover their losses, other costs and give space for a

certain return[2, p. 1].

An important aspect of the business idea is the cash flows of an insurance

company. In general, policyholders pay their premiums up front, at the begin-

ning of the period they are buying insurance for, usually a year. Possible claims

12

from the policyholders therefore incur sometime after they have paid for their

product. This means that the insurance company get their revenues before they

have their costs, which means that they can invest the collected premiums in

the meantime and receive a return on them[1].

By Swedish law, insurance companies need to have a capital reserve to be able

to always help their customers by having enough coverage for large claims and

other unexpected incidents. For If P&C Insurance, this capital reserve is pro-

vided by their owners, the Finnish financial company Sampo plc. Currently

Sampo expects a return of 17,5 percent on their invested capital. If need to do

their best to meet this requirement to avoid the risk of eventually losing their

financing. The return on equity, ROE, is calculated as follows

ROE =Revenue · (1− Combined ratio) + Investment return

Capital reserve · (1− Tax rate)

The combined ratio is defined in section 2.2.2 and is the percentage of the

premiums collected needed to cover the expenses of the company. 1- CR is thus

the percentage of the collected premiums that the insurance company can keep

as a profit. As seen in the expression for the ROE, a smaller combined ratio

produces a larger ROE [1].

The investment return comes from the return from investing the collected pre-

miums and the capital reserve [1]. Higher investment returns are profitable for

the ROE. Their level depends on the current return on the securities If are in-

vesting in and is thus highly sensitive to interest rates and the economy as a

whole.

13

2.2.4 Commercial Insurance Policies

If insurances is divided into four sections; private customers, commercial cus-

tomers, industry and the baltics. Commercial insurance policies can have a

complicated structure. The company can decide to have a full insurance or

to insure some parts, this is called the insurance level. The subjects that the

company wants to insure is called exposures and the different size of insurances

covers different exposures. An exposure can for example be a car or a property

and the different exposures an insurance cover is called the exposure level. The

exposures can have different product levels. A product is a collection of what

the insurance will cover of the exposure. The parts to include is called product

modules[1].

2.2.5 Insurance Pricing

It has shown to be most advantageous for insurance companies to charge risk-

correct, or fair, prices. This means that the premium that each customer pays is

dependent on their individual risk for the insurer. Simplified, you can say that

a customer with higher expected claim cost should pay more for their policy

premium than a customer with lower expected claim cost[2, p. 2].

The reason for this that it has been realized to be the best alternative in a

business perspective for an insurer acting in a competitive market. In a situa-

tion where the policyholders are charged the same prices independent of their

respective risks, some customers, who are of high risk, get an unfairly low price

at the expense of other customers that are of lower risk. The customers charged

with a too high premium are then likely to be lost to a competitor that offers

a fairer premium. At the same time, the insurer with the uniform prices will

attract more high risk customers that benefits from a conform pricing struc-

ture. The result is an undesirable economic situation where the insurer loses

profitable deals and gains unprofitable ones[2, p. 2].

14

To avoid these problems and accomplish fair prices, the insurer needs to, as

accurately as possible, predict the expected losses from each of their customers.

This is typically done by creating a so called tariff, which is a set of tables that

calculates the premium for any given customer by taking into account the values

of a number of variables, so called rating factors, for that particular customer.

The rating factors are often properties of the policyholder or of the insured ob-

ject. For a driver looking to insure their vehicle, the tariff could for example

account for the driver’s age and for the weight of the vehicle, two factors that

are likely to affect the risk. Creating a tariff is typically done by using regres-

sion analysis to model the relationship between possible rating factors acting

as explanatory variables and a response variable that says something about the

risk of a customer[2, p. 2].

The details of how the premium is calculated can differ between insurance com-

panies, but it is common to predict the expected losses from, and hence the

risk of a customer, by modelling the response variables claim severity and claim

frequency for claims of small or average cost. Claims with costs above a certain

threshold, large claims, are often modelled separately.

After the risk of the customer is predicted, the final premium price is calculated

by adding a certain amount to cover administrative cost. Finally, the price is

multiplied by an adjustment term that consists of appropriate coefficients that

adjust the price according to for example discounts as alliance agreements[1].

Premium = Risk + Administrative costs + Return · Adjustment term

2.2.6 Accounting for Large Claims

As said, there is a point in separating the modelling of small and large claims.

A reason for this is that dominating large claims can make the estimates of the

total claim cost very volatile. A common approach is to truncate the claims

15

at a certain point and leave out the part of the claim cost that lies above this

threshold from the modelling of the small and average sized claims. The cost

above the truncation point also needs to be accounted for in the premium. One

way to do this is to simply assume that the differences in risk for large claims

are the same as those for the other claims and to therefore distribute the cost

for large claims among the policyholders accordingly. This means to only adjust

the base premium so that the overall premium level is adequate.

There is no obvious way to to account for any differences in contribution to

large claims from different groups of policyholders. One approach is to estimate

the proportion of large claims using a generalized linear model with a binomial

distribution and a logit link. One then models the number of large claims di-

vided by the total number of claims. An estimate of the large claim frequency is

then given by multiplying the original claim frequency by the predicted propor-

tion of large claims.[2, p. 63-64] The severity of the large claim, being a rather

extreme event, can be assumed to have a heavy-tail distribution.[5, p. 2] It is

often modelled with for example the log-normal or a Pareto distribution, which

better models the right tail of the distribution arising from high-impact events

of low probability.[6, p. 2]

16

2.3 Mathematical Theory

2.3.1 Regression Analysis

The concept of regression analysis includes a set of statistical techniques for

finding and estimating the relationships between different variables, which have

numerous applications and are used in many fields[7, p. 2]. The typical idea

is to find a relation for how the values of one or more independent variables

impact the outcome of another dependent variable of interest. This is done by

attempting to fit a function to a data set of observations of the dependent and

independent variables[3]. The data is of high importance since it constitutes the

foundation of the model that is fitted and can be retrieved through for instance

an observational study or an experiment. One could also perform regression

analysis on a set of historical data that have been saved, with or without the

intention to investigate the potential relationships of interest.

2.3.2 Linear regression modelling

The simplest form of regression models is the one where the model specification

is that the dependent variable, often called the response variable, is assumed

to have a linear relationship to one independent variable, often called predictor

variable, regressor variable or explanatory variable. This is called a simple linear

regression and implies a straight line relationship on the form

β0 + β1 · x+ ε = y (1)

where β0 denotes the intercept, β1 the slope and ε is an error term.

The coefficients β0 and β1 are unknown parameters, called regression coeffi-

cients. Fitting a regression model means to estimate these coefficients with the

use of sample data consisting of a number of observations pairs (y1, x1), ..., (yn, xn)[7,

p. 13]. With the coefficient estimates, a model is created and can be used to

predict the value of the response variable y for a point x. Since the number of

17

data points is typically greater than two, it is unlikely that all data points lie on

a perfectly straight line. Instead, it creates an overdetermined system of linear

equations.

The simple linear regression concepts can be generalized to the case were the

response variable may be related to several explanatory variables, yielding a

multiple linear regression model on the form

β0 + β1 · x1 + . . . + βk · xk + ε = y (2)

For both simple and multiple linear regression, estimating the model coef-

ficients is most commonly done through the method of least squares. Another

commonly used method for coefficient estimation in regression analysis is Max-

imum Likelihood estimation.

In linear regression modelling, the errors are assumed to have mean zero, un-

known constant variance σ2 and to be uncorrelated. Furthermore, a common

assumption is that the errors are normally distributed, which is required for

procedures of evaluating model parameters, such as hypothesis testing and the

construction of confidence intervals.

2.3.3 Generalized Linear Models

Generalized linear models is a special form of regression models that can be used

when usual assumptions of normality and constant variance are not satisfied.

In generalized linear models, the distribution of the response variable is not

required to be normal. Instead, it needs to belong to the exponential family. The

exponential family of distributions includes the normal distribution, the poisson

distribution and the gamma distribution, among others[7, p. 421]. Members of

the exponential family have probability density functions (or probability mass

18

functions) that can be expressed on the following form:

f(yi, θi, fi) = exp

{yi · θi − b(θi)

fiwi

+ h(yi, fi, wi)

}(3)

where θi is a so called natural location parameter that varies with i, fi is a

positive scale or dispersion parameter and wi is greater than zero. b(θi) is the

cumulant function. This function is twice continuously differentiable and for

every choice of such a function, a family of probability distributions is obtained,

such as the normal or the Poisson distributions. The function h(yi, fi, wi) is of

little interest in GLM theory, but is required in order for the total probability

to be equal to one.[4, 2, p. 17]

The idea behind generalized linear models is to obtain a linear model for a

function of the expected value of the response variable. Define the linear pre-

dictor η(i) by:

η(i) = g(E[yi]) = g(µi) = xTi β (4)

where the function g is an appropriately chosen link function which relates the

mean of the response to the linear predictor in the following way:

E[yi] = g−1(xTi β) (5)

There are several possible choices of link functions, but it is common to choose

η(i) = θ(i), where θ(i) is the natural location parameter that corresponds to

the distribution assumed for the response variable. This link function is called

the canonical link of that distribution.[7, p. 451]

The parameter estimates in a generalized linear model are calculated as the

maximum likelihood estimates[7, p. 452]. This means to maximize the like-

lihood function with the chosen link function inserted. This is equivalent to

finding the parameter estimates β which maximize the log likelihood function.

19

With the estimated parameters β the model becomes

yi = g−1(xTi β) (6)

where g is the link function. This gives estimates on the mean response for

points x of interest.

2.3.4 Logistic Regression

Logistic regression is a special type of generalized linear modelling where the

response variable is binary. Typically, the response takes one of the values 0 and

1, often representing a non-event and an event respectively. The model aims

to predict the probability for the response variable to have the value 1. Thus,

when the response variable is a bernoulli random variable, it takes only a binary

value[7, p. 428]. Consider the linear model:

y = β · x+ ε

We can assume that the response variable is a Bernoulli random variable and

thus have the distribution function:

P (y = 1) = pi , P (y = 0) = 1− pi

The expected value is then

E[y] = 1 ∗ pi + 0 ∗ (1− pi) = pi

which implies that

E[y] = xTβ = pi

20

Since the response is binary and hence restricted to the values 0 and 1, the

errors can only take one out of two possible values:

εi = 1− xTi β or εi = −xTi β

The errors are thus not normally distributed. Neither is the variance of the

error constant

V ar(y) = E[y − E(y)]2 = (1− pi)2 + (0− pi)2 · (1− pi) = pi(1− pi)

thus V ar(y) = E(y)[1 − E(y)] which means that the variance is a function of

the mean. Since 0 ≤ pi ≤ 1, we have that 0 ≤ E(y) ≤ 1. This is a constraint

that makes the earlier showed linear response function not a feasible choice for

a function to predict the binary response. A strictly increasing or decreasing

reversed S -or S-shaped function is better employed. Therefore, the so called

logistic response function is used. The logistic response function has the form:

ε(y) =exp(xTβ)

1 + exp(xTβ)=

1

1 + exp(−xTβ)

and can easliy be transformed to a linear function with η = xTβ where η is a

linear predictor, η = ln( pi1−pi ). This transformation with the probability pi is

called a logit transformation and its ratio pi1−pi is called the odds.

2.3.5 Modelling Claim Severity

We now turn to claim severity. Here, a measurement of the claim size is of

interest, Y = Xw where X is the total claim cost in the cell and Y is the claim

severity weighted on the exposure, w.

X is a random variable but it is not clear which distribution to assume for

X. However, the gamma distribution has become a standard in GLM analysis

for claim severity[10, p. 10].

21

The gamma assumption implies that the standard deviation is proportional to

E[Y ] which means that the we have a constant coefficient of variation [2, p. 20].

To derive the function of Y a example for w = 1 is given by

f(x) =βα

Γ(α)xα−1e−βx; x > 0 (7)

We denote this distribution G(α, β) for short. The expectation is then αβ and

the variance αβ2 [2, p. 20]. The sum of independent gamma distributed random

variables with the same scale parameter β are gamma distributed with the same

scale and index parameter, which is the sum of the individual α. With X being

the sum w of independent gamma distributed random variables, we have that

X ∼ G(wα, β).

The function for claims as a percentage of the premium, Y, is then

fY (y) = wfx(wy) =(wβ)wα

Γwαywα−1e−wβy; y > 0 (8)

Thus, Y ∼ G(nα,wβ) with expectation of αβ . This distribution can be trans-

formed to the form of an exponential family distribution, shown above. Before

doing so a re-parameterization is done with µ = αβ > 0 and φ = 1

α > 0

Now,

fY (y) = fY (y;µ, φ) =1

Γ(wφ )(w

µφ)wφ y

wφ−1e

−wyµφ = exp

( −yµ − log(µ)

φw

+ c(y, φ, w)

); y > 0

(9)

c(y, φ, w) = log(wy

φ)w

φ− log(y)− logΓ(

w

φ)

E(Y ) =wα

wβ= µ

and

V ar(Y ) =wα

(wβ)2=φµ2

w

22

To show that this gamma distribution is a member of the exponential fam-

ily, we now change the parameter θ = −1µ < 0.

We can conclude that this gamma distribution is an exponential family dis-

tribution by setting the index i and log(−θi) = b(θi)

fY i(yi; θi, φ) = exp

(yiθi + b(θi)

φwi

+ c(yi, φ, wi)

)(10)

Hence, we can use it in generalized linear models.

2.3.6 Multicollinearity

In regression models with multiple regressors there can be disturbance of the

significance of the model if there is correlation between the explanatory vari-

ables. That is called multicollinearity. There is said to be multicollinearity if the

regressors are nearly linearly dependent. Normally regressors are not orthogo-

nal, which is the optimal situation to avoid multicollinearity, however, the lack

of total orthogonality doesn’t have to be serious[7, p. 285]. But if the regres-

sors are nearly perfectly linear then the result will most likely be misleading or

erroneous, since variances are large, which needs to be avoided as far as possible.

2.3.5.1 Correlation Matrix One method for detecting multicollinearity be-

tween variables is to examine the correlation matrix. Given a regression model

on matrix form, the correlation matrix is defined as XTX and its elements

denoted by ri,j , i and j representing the indices of the regression variables.

All diagonal elements ri,i in this matrix are equal to one, and the degree of

collinearity between two different variables xi and xj is assessed by inspection

of the absolute value of their corresponding off-diagonal element ri,j . If the

regressors are nearly linearly dependent, ri,j will be close to unity. This is a

simple way of detecting dependencies between pairs of regressor variables. If

23

more than two variables are involved in a near-linear dependence, this may not

get catched by the correlation matrix, why other methods are needed for a more

thorough analysis[7, p. 293-294].

2.3.5.2 Variance Inflation Factors Another method for detecting multi-

collinearity is examination of the Variance Inflation Factors, VIFs. These are

found as the diagonal elements Cj,j of the matrix C = (XTX)−1 and can be

written as

Cj,j = (1 − R2j )

−1 hence VIFj = Cj,j = (1 − R2j )

−1 where R2j is the coefficient

of determination obtained when the variable xj is regressed on the remaining

regressors.

The R2j coefficient can be viewed as the proportion of variation explained by

the remaining regressor variables. Values of R2j that are close to 1 imply that

most of the variability in xj is explained by the other independent variables.

Hence, if R2j is near unity, VIFj is large and xj is nearly linearly dependent on

some subset of the remaining regressors. The VIF for each variable in the model

measures the combined effect of the dependencies among the regressor variables

on the variance of that particular variable. If one or more variables have a large

valued VIF, it indicates a problem with multicollinearity. There is no formal

rule for when a VIF is to be considered large, but practical experience indicates

that regression coefficients might be poorly estimated if some VIF exceeds 10[7,

p. 296-297].

2.3.7 Model Validation

2.3.7.1 Hypothesis Testing In regression it is important to test if the coeffi-

cient estimates of the model are significant. There is considered to be adequacy

if the observations have a relationship to the response variable. In regression

the general way to do these kinds of tests is to formulate hypotheses[7, p. 84]:

H0 : βj = 0, H1 : βj 6= 0

24

If the null hypothesis is rejected the characteristic variable associated with βj

contributes with significance to the model and if not the variable should be

excluded. To identify if the null hypothesis is to be rejected or not a p-value

is calculated. The p-value is a probability measure of obtaining results similar

to or more extreme than the observed, given that the null hypothesis is true.

If the p-value is larger than given selected significance level, the hypothesis is

not rejected. The p-value must be equal to or less than given significance level

in order to reject the null hypothesis and declare that the variable is significant

and should not be excluded from the model. The p-value is derived trough com-

parison between the random variable and the distribution of the test statistic.

2.3.7.2 Deviance goodness of fit test Deviance is a measure used to assess

goodness of fit for generalized linear models. The deviance evaluates the signif-

icance of two models, the full model (current model) and a saturated model, by

comparing their parameters and observes any divergence. The saturated model

is a trivial model of no individual interest often used as a benchmark when as-

sessing goodness of fit of other models, since it has a perfect fit[2, p. 39].

Deviance is defined as:

D = 2 lnL(Saturated model)

L(Full model)(11)

where L is the likelihood function.

The Deviance measure follows a χ2-distribution with (n-k) degrees of freedom,

where n is the number of observations in the model and k the number of pa-

rameters in the current model. The adequacy of the model can be evaluated

using the deviance measure since a low deviance with large p-value suggests

that the current model is a satisfactory fit. Another way to use the measure is

by dividing the deviance with the number of degrees of freedom. If the ratio

then obtained is larger than 1, it indicates that the current model is not a good

fit.[7, p. 433]

25

2.3.7.3 Pearson Chi-squared, (χ2) The Pearson chi-squared test, also called

χ2 - test, is a goodness of fit measure for logistic regression models which com-

pares the observed and expected probability of success and failure for every

characteristic in the observations[7, p. 432].

If the number of expected successes is n · pi and expected failures is n · (1− pi)

then

χ2 =

n∑j=1

((yi − niπi)2

niπi+

[(ni − yi)− ni(1− πi)]2

ni(1− niπi)) =

n∑j=1

yi − niπiniπi(1− πi)

(12)

The pearson chi-squared statistic can now be compared to a chi-squared distri-

bution with n-k degrees of freedom. Goodness of fit is shown if this results in

small values of the statistic and/or a large p-value.

2.3.7.4 Receiver Operating Characteristic, ROC A receiver operating

characteristic is a measurement used to assess the predictive power of a logistic

regression model. The ROC curve is a plot of sensitivity as a function of (1-

specificity ) where the sensitivity measures the model’s ability to predict events

correctly, and the specificity measures its ability to predict non-events correctly:

sensitivity = P (y = 1|y = 1)

and

specificity = P (y = 0|y = 0)

where y represents the model’s predicted values of the response variable y. It

summarizes the predictive power for all possible threshold probabilities π0, work-

ing as cut-off points where an observation with a higher probability than π0 is

classified as 1 and an observation with a lower probability than π0 is classified

as zero. The ROC curve is usually used by evaluating the area under it. The

larger the area under the curve, AUC, the better the predictions. It measures

26

the probability that the predictions and the outcomes are concordant, that the

observation with y = 1 will also get a larger predicted response y than an ob-

servation with y = 0. Thus, an AUC of 0.9 is a good result, while an AUC

of 0.5 means that the predictive power of the model is no better than random

guessing[8, p. 228-229].

2.3.7.5 Wald Chi-squared The Wald chi-squared test, also known as Wald

test, is a goodness of fit measure based on large-sample properties of maximum-

likelihood estimators. The Wald test can be seen as a rough approximation of

the likelihood ratio test. However, a likelihood ratio test requires at least two

models and the Wald test can be run with only one.

The Wald test finds out the significance of the explanatory variables using a

null hypothesis test. The maximum likelihood estimator β is squared and di-

vided by the variance which has asymptotically a χ2 - distribution. The statistic

is then compared to a χ2 - distribution with 1 degree of freedom and is rejected

if the Wald statistic is larger[7, p. 437].

The Wald test formula:

Wn =(β − β0)2

var(β)(13)

2.3.7.6 AIC and BIC AIC, the Akaike Information Criterion is a mea-

sure of the information expected from the model and is calculated as

AIC = −2 ln(L) + 2p

where L is the maximized value of the likelihood function for a model and p is

the number of parameters in the model[7, p. 336]. The AIC rewards goodness

of fit, as seen by the likelihood function that lowers the value, but also puts a

penalty on adding explanatory variables to discourage overfitting. Since adding

27

variables to a model almost always improve the goodness of fit, the AIC- mea-

sure is a trade-off between getting a good fit from including many variables and

avoiding the risk of an overfitted model which could yield a model that predicts

poorly on unseen data[9]. The Akaike Information Criterion is used for variable

selection by comparing subset models against one another to determine which

one is better[7, p. 332]. A lower value of AIC is desired since it is an estimate of

the loss of information in a model. However, since AIC is a relative measure of

model fit, it says nothing about the absolute quality of a single model. Hence,

the AIC of a model can only be assessed in relation to the AIC of other models.

BIC, the Bayesian Information Criterion is an extension of the AIC that

puts a greater penalty on adding explanatory variables as the sample size is

increased. There are several BIC measures, where one of the more commonly

used is the one defined by Schwartz(1978):

BIC = −2 ln(L) + p · ln(n)

where n is the number of observations in the model.

The BIC is interpreted and used in essentially the same way as the AIC and the

measures are often used together to complement each other. The Akaike and

Bayesian information criterions are both commonly used with the more complex

modelling situations such as GLMs[7, p. 337].

2.3.7.7 Residual Analysis Residuals is one of the most appropriate tools for

conducting model adequacy checks in regression analysis. The residual is the

difference between the observed value of the response variable and the value of

the response predicted by the model. The residuals therefore gives an indication

on how accurately the model predicts responses. An ordinary residual is defined

as

ei = yi − yi, i = 1, 2, ..., n

28

where i is the number of a specific observation.[7, p. 130]

In generealized linear models one of the most common choices of residuals to

use is Pearson Residuals.[2, p. 53] They are based on an idea of subtracting

off the predicted mean and dividing by an estimate of the standard deviation

of the observed value and are defined as

rP i =yi − yi√

ˆV ar(yi), i = 1, 2, ..., n

If most of the Pearson residuals for a model are within a band between −3 and

+3, it is an indication that the model is of high predictive power.[11]

29

3 Methodology

3.1 Data

The first part of the project was to manage the data before any analyses could

be done. Due to the sheer amount of raw data provided by If, structuring,

aggregating and choosing relevant data was a major part of this project.

3.1.1 Characteristics

Since If was primarily interested in customer related causes of large claims, we

were handed a set of nine possible rating factors that all represented customer

characteristics associated with commercial policyholders. These nine rating fac-

tors were different characteristics that in different ways described the customers’

financial situations. Some characteristics were continuous, either on some closed

interval or on the real line, and some were categorical. The nine characteris-

tics were chosen as explanatory variables for the initial model to predict large

claims. A tenth rating factor, the product code, was added to keep track of any

effects arising from the type of insurance. Since the data quality of the first nine

characteristics varied between observations associated with different countries,

the analysis was restricted to the country with the best quality of data with

regard to those variables.

3.1.2 Grouping

The characteristics where divided into discrete groups were each group were

to be represented by one explanatory variable in the regression, each getting

their own coefficient estimate. Each observation belonged to one group per

characteristic, where each group acted as a dummy variable. The main reason

behind this approach was to separate any missing or extreme values in the data

from the more reasonable ones without having to exclude all those observations.

By dividing the valid values of continuous characteristics into, for most cases,

one group with higher and one with lower values, it was easier to distinguish

30

between the effects of the level of the characteristic in question. Built-in pro-

cedures in SAS were used to get an overview of the spread of the values of the

characteristics analyzed. This was to assure that there was a sufficient amount

of data in each group which is of importance to avoid erroneous output. All

explanatory variables, consisting of the groups of the 10 characteristics, A-J,

are presented in table 1. Groups denoted by ”H” represent the higher values

of the corresponding characteristic and groups denoted by ’L’ the lower values.

Groups denoted by ’X’ or ’Missing’ represent missing or invalid values of the

characteristic.

Table 1: Grouping of Characteristics

Variable Grouping

Characteristic A H/L/X

Characteristic B H/L/X

Characteristic C H/L/X

Characteristic D H/L/X

Characteristic E H/L/X

Characteristic F H/L/X

Characteristic G H/L/X

Characteristic H 1-3

Characteristic I H/L/X

Characteristic J 1-23

H = High values

L = Low values

X = Extreme/missing values

Numbers = Groups in categories

31

3.1.3 Aggregation

After the characteristics were grouped and attached to the initial data, a SAS

procedure was used to aggregate the data to a less granular level than the initial

data table. It was aggregated on the basis of year, product code and the other

characteristics of interest. At the same time, the observed values of the response

variable were summed. This resulted in a table were each row represented a

unique combination of the variables mentioned and their resulting sums of the

response variable. This meant that many customers could together represent

a single row, reducing the number of rows in the table significantly. Each row

were then to act as one observation in the regression modelling.

3.1.4 Response Variable

The variable that If asked us to model for this project was the cost of large

claims as a percentage of the premium. This is not a very usual response vari-

able in generalized linear modelling, and therefore there are not any obvious

ways to model it presented in mathematical or insurance literature. When

modelling claim severity and claim frequency for smaller claims, it is practically

a standard to use the gamma and the poisson distributions respectively. As

the response variable to predict in this project is not as studied, we chose to

attempt to decide on a proper distribution by inspecting a histogram of all val-

ues of the response variable in our data set. This showed that the majority of

the observations had a response variable of value zero. This was not surprising,

since large claims are to view as more or less extreme events, why few insurance

contracts will have any costs for large claims associated with them.

Since no distribution of the exponential family has such a great mass at zero,

we chose to divide the analysis into two parts. In the first part, all observations

were used to predict the probability of a large claim occurring using a binary

response variable and logistic regression. In the second part, only the obser-

vations which had a large claim were extracted with the intent to model the

32

requested response variable, cost from large claims as a percentage of premium.

Only the part of the large claim that exceeded the large claim truncation point

of 500 000 SEK was included in the response variable. A histogram showed that

the data set contained observed values of the response variables that ranged be-

tween almost zero and very large values, with the bulk part at small values and

a long, light tail. Due to the high resemblance with a gamma distribution, this

was chosen to approximate this response variable.

3.2 Model Development

Due to the analysis being divided into two parts, there were two different model

developments, one for large claim probability and one for large claim severity.

The characteristics for both models are not necessarily dependent, a character-

istic can be excluded from the probability model but be kept in the severity

model. The models were built parallel from the same initial model and reduced

independent of each other.

3.2.1 Modelling Probability of a Large Claim

Initial development of the model for probability of a large claim required a bi-

nary response variable. The large claims in the initial data were set to either

missing, which meant no claims, zero, which meant that there could have been

a claim but beneath the cost of 500 000 SEK that classifies it as a large claim,

or some integer corresponding to the value above the distinction point of a large

claim. All claims missing were set to zero and all values above zero were set to

the value one.

With a SAS procedure calculating variance inflation factors and the correlation

matrix for explanatory variables, multicollinearity could be investigated before

any initial logistic regression was initiated. These multicollinearity diagnostics

are presented in section 4.1. With variance inflation factors and correlation be-

tween certain variables in mind, the logistic regression model was constructed

33

with a built in SAS-procedure. The SAS-procedure calculates the estimates

with maximum likelihood and evaluates the odds ratio. The procedure also has

a selection function which by removing and adding variables evaluates different

combinations of variables and performs selection based on the significance level

of the variables.

The reduced model eliminated two variables. The evaluated estimates were

analyzed and examined with respect to their plausibility. To be able to con-

clude improvement of the reduced model it was compared to the full model with

the measures of AIC, BIC and AUC (area under the ROC curve). With no mul-

ticollinearity and improved goodness of fit the reduction of the model stopped

here. Section 4.2 presents the goodness of fit diagnostics for both the full model

and the reduced model as well. There one also finds the variables included and

their corresponding significance levels for the two models.

The logistic regression thus resulted in an equation on the following form:

ln(pi

1− pi) =

38∑j=0

xi,j βj (14)

3.2.2 Modelling Large Claim Severity

Since the initial variables for the severity model and the probability model are

the same, the VIF and correlation matrix results could be utilized for this model

as well. The response variable was constructed by dividing the large claim with

the premium and a table was made with only the observations extracted with

the response variable above zero. Once again were all the missing values of the

claims set to zero.

A GLM with a log link was initiated with SAS and again a built-in function

for variable selection was used which selected and kept variables with respect to

34

their significance value. This time only one variable was eliminated. The eval-

uated estimates were analyzed and examined with respect to their plausibility.

To be able to conclude improvement of the reduced model it was compared to

the full model with the measures of AIC and BIC with no multicollinearity and

improved goodness of fit the reduction of the model stopped here. The results

and diagnostics are presented in section 4.3.

This log-gamma GLM thus resulted in a model on the following form:

ln(yi) =49∑j=0

xi,j βj (15)

35

4 Results

4.1 Multicollinearity Diagnostics

Table 2: Variance Inflation Factors

Variable VIF

Characteristic A 1,48491

Characteristic B 1,1308

Characteristic C 3,21279

Characteristic D 1,19488

Characteristic E 5,11038

Characteristic F 4,83526

Characteristic G 4,62401

Characteristic I 1

Table 3: Correlation Matrix

Characteristics G A D C F B E I

G 1 0,0892 0,1421 -0,6726 0,0094 -0,1987 -0,4262 -0,0003

A 0,0892 1 -0,0823 0,1571 -0,4242 -0,1299 0,0525 -0,0003

D 0,1421 -0,0823 1 -0,0174 0,0599 -0,1103 -0,2729 -0,0011

C -0,6726 0,1571 -0,0174 1 -0,253 0,1208 0,1921 0,0002

F 0,0094 -0,4242 0,0599 -0,253 1 0,131 -0,6635 -0,0011

B -0,1987 -0,1299 -0,1103 0,1208 0,131 1 -0,0819 -0,0002

E -0,4262 0,0525 -0,2729 0,1921 -0,6635 -0,0819 1 0,0011

I -0,0003 -0,0003 -0,0011 0,0002 -0,0011 -0,0002 0,0011 1

36

4.2 Logistic Regression Model

4.2.1 Full Model Goodness of Fit Diagnostics

Table 4: Goodness of Fit

Full Model

LogLike -7059,91

AUC 0,757988

AIC 14205,82

BIC 14580,74

4.2.2 Significance of Variables in Full Model

Table 5: Signifance

Full Model WaldChiSq ProbChiSq

Characteristic A 3,2873 0,1933

Characteristic B 31,5545 0,0000

Characteristic C 116,4300 0,0000

Characteristic D 0,4495 0,7987

Characteristic E 124,2040 0,0000

Characteristic F 60,2795 0,0000

Characteristic G 47,0024 0,0000

Characteristic H 23,8441 0,0000

Characteristic I 72,6220 0,0000

Characteristic J 811,7597 0,0000

37

4.2.3 Reduced Model Goodness of Fit Diagnostics


Reduced Model

LogLike -7061,81

AUC 0,756928

AIC 14201,61

BIC 14541,65

4.2.4 Significance of Variables in Reduced Model

Table 7: Significance

Reduced Model WaldChiSq ProbChiSq









38

4.2.5 Final Model Coefficients

Table 8: CoefficientsReduced Model Group Estimate(β)

Intercept - -2,9765

Characteristic B H - 0,258563

Characteristic B L -0,056793

Characteristic C H 0,44482

Characteristic C L -0,123971

Characteristic E H -0,104184

Characteristic E L -0,715658

Characteristic F H -0,403137

Characteristic F L 0,08744

Characteristic G H 0,07087

Characteristic G L -0,256218

Characteristic H 1 -0,277118



Characteristic I H -0,131207

Characteristic I L 0,36145

Characteristic J 1 -0,195867

Characteristic J 2 0,20147






















39

4.2.6 Final Model

From table 7 the following equation could be constructed,

yi1− yi

= e−2,9765 ·38∏j=1

exj,iβj = e−2,9765 ·ex1,iβ1 ·...·ex15,iβ15 ·...·ex25,iβ25 ·...·ex38,iβ38

(16)

Which can be written as:

e−2,9765·

eβB,H , if Group = H

eβB,L , if Group = L

1, Otherwise

For characteristic B

·...·

eβH,1 , if Group = 1



1, Otherwise

For characteristic H

·

...·

4.2.7 Final Model Residuals and ROC

Figure 1: Pearson Residuals. The residuals plotted versus case number. Events

are shown in red and non-events in blue.

40

Figure 2: ROC curve and AUC measure

41

4.3 Claim Severity Regression Model

4.3.1 Full Model Goodness of Fit Diagnostics


Full Model Value Value/DF

Deviance 7445,7034 3,9521

Scaled Deviance 2593,5880 1,3766

Pearson Chi-Square 20967,8539 11,1294

Log Likelihood -2482,7098

Full Log Likelihood -2482,7098

AIC 5053,4197

BIC 5298,2233

4.3.2 Results From Reducing Algorithm

Table 10: Reduction of Variables

Variables Included Variables Removed p-value

Characteristic B 0,0000

Characteristic C 0,0001

Characteristic D 0,0000

Characteristic E 0,0000

Characteristic F 0,0000

Characteristic G 0,0000

Characteristic H 0,0000

Characteristic I 0,0000

Characteristic J 0,0000

Characteristic A 0,0528

42

4.3.3 Reduced Model Goodness of Fit Diagnostics


Reduced Model Value Value/DF

Deviance 7487,7109 3,9702

Scaled Deviance 2595,6330 1,3763

Pearson Chi-Square 21797,5034 11,5575

Log Likelihood -2490,0085

Full Log Likelihood -2490,0085

AIC 5064,0169

BIC 5297,6931

4.3.4 Reduced Model, Significance of Variables

Table 12: Significance

Reduced Model ChiSq ProbChiSq



Characteristic D 15,80 0,0004







43

4.3.5 Final Model Coefficients

Table 13: CoefficientsReduced Model Group Estimate

Intercept - -2,5222

Characteristic B H 2,6932

Characteristic B L 1,8412

Characteristic B X 0,0000

Characteristic C H 0,0347

Characteristic C L 0,4542

Characteristic C X 0,0000

Characteristic D H -1,4678

Characteristic D L -0,7142

Characteristic D X 0,0000

Characteristic E H -1,7617

Characteristic E L -2,1307

Characteristic E X 0,0000

Characteristic F H 0,0463

Characteristic F L -1,0185

Characteristic F X 0,0000

Characteristic G H 2,4602

Characteristic G L 2,2324

Characteristic G X 0,0000

Characteristic H 1 0,1487


Characteristic H 3 0,0102

Characteristic H Missing 0,0000

Characteristic I H 1,9625

Characteristic I L 0,6583

Characteristic I X 0,0000

























44

4.3.6 Final model

From table 12 the following equation could be constructed,

yi = e−2,5222 ·∏49j=1 e

xj,iβj = e−2,5222 · ex1,iβ1 · ... · ex15,iˆβ15 · ... · ex25,i

ˆβ25 · ... ·

ex35,iˆβ35 · ... · ex49,i

ˆβ49 = e−2,5222 ·

eβB,H , if Group = H

eβB,L , if Group = L

1, Otherwise

For characteristic B

· ... ·




1, Otherwise

For characteristic H

· ...·

45

4.3.7 Final Model Residuals

Figure 3: Pearson Residuals

46

5 Discussion

5.1 Model Validation and Adequacy

5.1.1 Sources of error or uncertainty

Due to the large of amount of initial data there was a risk for hidden errors.

With millions of observations it is difficult to detect these errors and even more

difficult without knowing what the errors look like. In some parts of the data

it was obvious that it contained errors, examples being when all numbers for a

certain variable should be positive or a percentage between 0-100% to even make

sense. In addition to those obvious errors there were probably more inaccura-

cies that we couldn’t detect. Furthermore there is no guarantee that the groups

constructed are risk-homogeneous or that we did sophisticated enough interpre-

tation of what is to be considered as extreme values. Having more and more

narrow groups for each characteristics could have been one way to minimize the

risk of misleading results from such problems. However, other problems could

then have arisen from having a model with very many variables. For example,

some groups of a characteristic might get significant coefficients and some not.

A proxy when calculating this kind of insurance models is to aggregate the

data to a requested policy level and then weight the data to reduce the risk

of disturbance with significance. However, the model wasn’t able to converge

if it was weighted so this was not a possible option. Optimal would then be

to not aggregate the data. However that resulted in too large amount of data

which were problematic for the program used to handle when calculating the

logistic regression. This may cause the results to be misleading since the model

won’t take into consideration the initial amount of observations in each new

aggregated observation. However, it would have been possible to aggregate the

data for the GLM model but we made the decision to use the same policy level

for both sets of data.

47

The VIFs and the correlation matrix both indicate low multicollinearity be-

tween the characteristics. However, the characteristics were almost exclusively

financial measurements and the intuition says that they are correlated. This

raises further questions whether it is something wrong with the data set.

5.1.2 Assessing the model reductions

The reduced model which was done with logistic regression gave a smaller AIC

and BIC value than the full model which indicates a better model fit. However,

the reduced model done with GLM gave a smaller BIC value but not a smaller

AIC value than the full model. However, AIC is more tolerant with more vari-

ables while BIC puts a stricter penalty on adding variables. We are nonetheless

convinced that the reduced model is more sophisticated since it had just a small

difference between its AIC compared to the full model and the variable excluded

did not meet the wanted significance level.

5.1.3 Statistical hypothesis testing

The deviance, Wald chi-square and Pearson chi-square tests indicate significant

p-values for both models and satisfactory goodness of fit. However, the scaled

deviance divided by the degrees of freedom, has a value larger than one which in-

dicates a problem with the fit. Furthermore, in both the reduced logistic model

and reduced the GLM model there is significant p-values for the characteristics

but for the groups within the characteristics there is in some cases only one of

48

the groups that shows significance. However, this is not necessarily serious since

the significance is binary. It could have been more appropriate to merge the

two other groups (most groups are categorized to three variables) but since one

group represents missing/extreme values and we are interested in seeing trends

that option is not preferred.

5.1.4 Prediction accuracy

The prediction accuracy of the logistic regression model for the large claim

probability can be assessed by the concordance index, also called area under

the curve, AUC for the ROC. As seen in section 4.2.7, the model had an AUC

equal to approximately 76 percent. Since this is well over 50 percent, the model

is shown to have a certain ability to accurately predict the occurence of large

claims. However, the AUC is at the same time far from the optimal level of 100

percent, showing that the model to some extent lacks the ability to perform ac-

curate predictions. One reason for this could be the distribution of the response

variable, consisting of very few events(large claims) compared to the number of

non-events(no large claims). This could have made it slightly difficult to find

strong patterns for the occurrence of large claims, an issue that is difficult to

overcome due to the nature of the data.

For both models, the prediction accuracy can also be evaluated by analyzing the

residuals. In section 4.2.7, residuals for the large claim probability model are

presented. In figure 1 showing Pearson residuals plotted for each observation,

one can note that the residuals are positive for practically all the observations

for which there had been a large claim, shown in red, and negative for the obser-

vations which didn’t have a large claim, shown in blue. Since a positive residual

means that the predicted response is smaller than the observed and a negative

residual means that the predicted response is larger than the observed response,

49

this result is expected in this case since observed events have been coded as ones

and non-events as zeros. One also notes that the positive residuals are further

from zero than the negative residuals. This indicates that the model less accu-

rately predicts an event than a non-event. As said, this is likely to be caused by

the scarcity of large claims within the data. Many of the positive residuals also

seem to be larger 3, which indicates a problem with prediction accuracy as well

as outliers in the data. The negative residuals are all close to zero, indicating

that the model successfully predicts non-events.

In figure 3 of section 4.3.7, Pearson residuals plotted against the linear pre-

dictors for the large claim severity model are presented. One can see in the

scatter that most residuals are slightly negative which indicates that the pre-

dicted response is greater than the observed response in many cases. This means

that the model tends to assign higher costs for large claims as a percentage of

premium than what they actually had. There is also an apparent presence of

some very large positive residuals. Hence, for some observations the model pre-

dicts a much lower response than what is observed. The overall trend indicates

that there might be a problem with outliers, possibly causing most other resid-

uals to be slightly negative. However, since most residuals do not deviate too

much from the limit of +/- 3, the overall predictive power is acceptable.

5.2 Interpretation of final models

The model development and procedures for model reduction resulted in two fi-

nal models with 9 and 8 included characteristics respectively. It was found that

the models needed not to be heavily reduced. Rather, they differed from their

corresponding full models with regard to just two and one characteristic respec-

tively. Since the Variance Inflation Factors didn’t point towards any issues with

multicollinearity, there was no mathematical support for reducing the models

further. It is interesting that so many of the characteristics was shown to be

of significance and add explanatory power to the models since it indicates that

50

there are several aspects of commercial policyholders that affects their risks for

large claims. However, this is not necessarily the most desired result in practice

since more complex models make it more difficult when it comes to implement-

ing the results in the pricing of the insurance policies. It is not as simple as to

just start pricing on the basis of all the significant found characteristics, why a

model using one or a few characteristics in isolation in some sense would have

been to wish for.

The characteristic that was excluded from the model for the claim size was

also one of the characteristics excluded from the model for the large claim prob-

ability. This is an interesting result since it indicates that that characteristic

does in fact not have much effect on the large claim risk associated with commer-

cial policyholders. However, since there was an overall uncertainty about the

quality of the data used for the regression model building, it can be questioned

to what extent this is a safe conclusion to draw and use for a more general case.

As one more characteristic was excluded from the large claim probability model

than from the claim size model, it points towards a certain difference between

what causes large claims to occur at all and what affects their severity. That

characteristic doesn’t seem to explain anything about the occurrence of large

claims, but adds some explanatory value about the size of them. Differences

in what drives risk in the two models can also be observed in the coefficient

estimates corresponding to the characteristics and their indicator variables. If

for example the higher-valued group of a certain characteristic has a larger co-

efficient estimate than the lower-valued group in one model and the relationship

is the opposite in the other model, this shows a difference in the effect of that

characteristic, even if that characteristic has shown to be significant in both

models.

The coefficient estimates for the large claim probability model are presented

in table 8 and analyzed in the form of odds ratios. These show that for some

51

of the characteristics, there isn’t a large difference in odds between the groups.

An example is characteristic H, where the three groups not representing miss-

ing or invalid values has the odds 0.458, 0.532 and 0.545. These are all close

to each other, indicating that there is no apparent difference in risk of having a

large claim depending on to which of these groups the customer belongs to. For

the characteristics I and C there is a larger difference in odds ratio between the

higher and the lower groups and their coefficient estimates are oppositely signed.

This indicates that having a lower value of characteristic I and a higher value of

characteristic C are two things that increases the risk of a policyholder having a

large claim according to this model. For one of these characteristics, the higher-

risked group corresponds to the policyholder being in a better financial situation

and for the other, the opposite holds. For the other characteristics (B, E, F and

G) there is a certain but not very large difference in risk depending on whether

the customer has a higher or a lower value of those variables. It varied between

this characteristics if the better financial situation corresponded to a higher risk.

Since the model for the large claim size was constructed using the log link

function, the risks of different variable groups can be evaluated by taking the

exponential of the coefficient estimates generating a multiplicative model, sim-

ilarly to the odds ratios for the logistic model. The exponentiated coefficient

gives a multiplier on the expected response when the corresponding variable

changes by one. Hence, a group with a larger exponentiated coefficient estimate

gets larger contribution to the predicted large claim for that variable than a

group with a smaller multiplier. This reveals that there are large differences in

risk depending on which group the customer belongs to for the characteristics

B, C, F, G and I. For most of these characteristics (B, F, G and I) the higher

risk, i.e having more severe large claims in relation to premium, corresponds

to the higher-valued group and for characteristic C the risk is higher for the

lower-valued group. This is the one exception where the lower risk corresponds

to a worse financial situation. For some characteristics, the trend is opposite

from the logistic model. Characteristic E which had a certain risk difference in

52

the logistic model had not much difference in the severity model. This indicates

that there are differences between what causes large claims to occur at all and

what causes them to be severe.

By analyzing the differences in estimates between groups this way, one real-

izes that although many characteristics were of significance and therefore kept

through the model reduction steps, not all of them would necessarily be of ques-

tion to actually implement in the pricing models. When you for example can’t

see an apparent difference in risk between the higher and the lower group of a

variable you may not find it meaningful to include it when pricing, even if the

coefficients are significant. There can also be a problem if the risk differences

between the groups don’t follow the insurers intuition about how the risk should

differ between customers. Furthermore, differences in behaviour of the variables

between the two models, large claim probability and large claim severity, might

cause implementation problems. These are all aspects possible to consider in

order to choose which characteristics to keep investigating or to try and include

in the actual pricing of premiums.

5.3 Impact of risk-dependent insurance pricing

5.3.1 For If

As explained in the introduction to insurance pricing, section 2.2.5, sophisti-

cated pricing models that accounts for the customers’ risks as accurately as

possible is of great interest for an insurance company as If. An important part

of this is to not only continuously work to improve predictions of the frequently

occurring regular-sized claims, but to also find efficient ways to account for the

more rarely occurring large-sized claims. Even if they are less likely than other

claims to occur, the costs for If are extensive when they do. As was seen when

analyzing the data set of commercial policyholders in this project, the costs for

large claims corresponding to some groups of policyholders amount to several

thousand percent of their paid premiums. For other groups of policyholders,

53

costs for large claims were non-existent or only a small fraction of the amount

which those policyholders had paid. A pricing structure that is completely fair

should not disregard these variations in risk for large claims between commercial

policyholders.

By charging premiums optimally risk-corrected for large claims, the premiums

are fairer than otherwise. This means that If is likely to have a better proportion

between claim costs and collected premiums, since low-risk customers are more

prone to choose and to stay with If. Recall the expression for return on equity

stated in section 2.2.3. It shows that the ROE gets larger as the combined ratio

gets smaller. A smaller combined ratio is partly acquired through fairer pricing

since that produces a higher GEP for a lower claim cost than more conform

pricing structures. Thus, working to achieve prices optimally adjusted for large

claim risk can produce a higher return on equity, an important goal for If.

5.3.2 For commercial policyholders

The customers may not be as keen on the idea of risk dependent prices as the

insurance company. This project aimed to investigate the possibility to predict

risk and hence to price with regard to characteristics of the policyholder which

in some cases did not have an obvious connection to the object insured and the

usage of it. Rather, they were general characteristics of the corporates. Discus-

sions about the ethical aspect of risk dependent insurance pricing is therefore

of interest in this case. There is already regulations in the Swedish insurance

industry forbidding insurers to use for example gender as an explanatory vari-

able in the pricing of policies. One can question at which other characteristics

you start to approach a situation where the pricing is discriminating.

A phenomenon that is somewhat related to this is the concept of price optimiza-

54

tion, also often referred to as price discrimination in literature. This describes

the situation where a company charges different prices for the same product to

different customers and is related to the concept of price elasticity of demand in

microeconomics. The situation is often that a company wants to charge the less

price sensitive customers with higher prices than they have for customers which

are more price sensitive. This is a way for them to maximize their revenues.[12,

p. 407-410]

Insurance pricing in general differs from this idea in that the prices are dy-

namic with regard to the customers’ risks and not to their price elasticities. It

enables the insurance company to keep the lowest possible prices to all their

customers. In a situation with a conform pricing structure, the general price

level would have needed to be higher in order for the insurer to be equally prof-

itable due to higher expected claim costs as a result of higher-risked customers.

Risk-correct pricing can thus be considered an advantageous pricing structure

from a customer perspective as well, high-risk policyholders included.

Unlike private individuals, commercial policyholders can be in a competing sit-

uation with each other. For businesses, the costs for insurances can be of large

amounts and one could argue that premiums which are risk adjusted as far as

possible creates a fairer competition. Then, low-risk companies need not take

part in financing the risk of companies that are more likely to have for example

large claims. On the other hand, since getting a lower premium than your com-

petition gives you a competitive advantage it could be debated to what extent

an insurer should need statistical support for pricing with respect to a certain

characteristic.

55

6 Conclusions

The result of the project concludes that large claims are to some extent cor-

related to a company’s financial situation. However, further investigation is

needed in order to find more reliable models with respect to goodness of fit and

prediction accuracy as well as to gain better understanding of the impact and

importance of different characteristics.

Large claim probability The logistic regression shows that eight of the char-

acteristics are of significance in predicting the occurrence of large claims. The

concordance index indicates that the model have a predictive power of about 76

percent which means that the model to some extent lacks the ability to predict

responses accurately. An inspection of Pearson residuals shows that the model

predicts non-events well but has difficulties with predicting events, probably due

to the scarcity of large claims in the data set. It varies between characteristics

if the better financial situation corresponds to a higher or a lower risk to have

a large claim. It is therefore difficult to draw a general conclusion of which

policyholders are the risky ones with respect to this response variable.

Large claim cost as percentage of premium The severity model indi-

cates that one of the characteristics should be eliminated and to keep the rest.

The Pearson residuals showed on a relatively good prediction accuracy but a

tendency to predict higher values of the response variable than what was ob-

served. The presence of some very large residuals also causes doubt. For the

large claim severity model, many of the characteristics indicate that a better

financial situation corresponds to a higher risk.

56

7 Recommendations

For further research, we recommend to explore other ways of modelling the

response variable. For example, a more thorough analysis of which distribution

to use might result in a better model that does more accurate predictions.

Specifically, we suggest to look for and attempt to use a distribution with a

heavier tail than the gamma distribution. To further accomplish a more reliable

model, we recommend to look deeper into the characteristics used in this project

to get a better understanding of which values are to be viewed as invalid or

extreme and avoid them having too much influence on the model. Lastly, we

recommend to construct models with fewer characteristics and perhaps narrower

groups in order to get a better understanding of their individual effects and

thereby increase the possibility of implementation in the pricing of policies.

57

References

[1] If P&C Insurance

[2] Esbjorn Ohlsson, Bjorn Johansson. Non-Life Insurance Pricing with Gener-

alized Linear Models. 2010..

[3] Amy Gallo in Harvard Business Review. A Refresher on Regression Analysis.

2015.

https://hbr.org/2015/11/a-refresher-on-regression-analysis

Accessed on 2018-04-20

[4] If P&C Insurance, lecture on generalized linear models at KTH Royal insti-

tute of technology 2018.

[5] Henrik Hult, Filip Lindskog. Heavy-tailed insurance portfolios: buffer capital

and ruin probabilities. 2006.

[6] Marco Bee. Statistical analysis of the Lognormal-Pareto distribution using

Probability Weighted Moments and Maximum Likelihood. 2012.

[7] Douglas C. Montgomery, Elizabeth A. Peck, G. Geoffrey Vining. Introduc-

tion to Linear Regression Analysis. 2012

[8] Alan Agresti. Categorical Data Analysis 2nd ed. 2002

[9] English Oxford Living Dictionaries. Overfitting

https://en.oxforddictionaries.com/definition/overfitting


[10] Murphy, K.P., Brockman, M.J., Lee, P.K.W. Using generalized linear mod-

els to build dynamic pricing systems for personal lines insurance. In: CAS

Winter 2000 Forum

[11] PennState, Eberly College of Science. STAT 504, 7.2.1 Model Diagnostics.

https://onlinecourses.science.psu.edu/stat504/node/161/


58

[12] Paul Krugman, Robert Wells. Economics 4th ed. 2015.

[13] Patrik Hardin, Sam Tabari. Modelling Non-Life Insurance Policyholder

Price Sensitivity. Bachelor Thesis, KTH 2017.

[14] Lovisa Laestadius, Karin Knobel. Fornyelsegrad och priskanslighet inom

foretagsforsakringar. Bachelor Thesis, KTH 2016.

59

TRITA TRITA-SCI-GRU 2018:187

www.kth.se

predicting large claims within non-life...

Documents