bayesian churn models

Advances and Applications in Statistical SciencesVolume …, Issue …, 2009, Pages … © 2009 Mili Publications

:tionClassificaSubjectsMathematic 2000 Please provide. Keywords: Bayesian variable selection, lifetime value analysis, marketing applications, model

assessment.

Received April 23, 2009

BAYESIAN CHURN MODELS

SILVIA FIGINI and PAOLO GIUDICI

Department of Statistics and Applied Economics University of Pavia E-mail: [email protected]

Abstract

We consider the problem of estimating the probability that a customer will abandon a company (churn). In such a context, a number of classical modelling challenges arise. We propose a Bayesian approach to the problem and compare it with classical churn models in a real case study.

1. Background and Preliminaries

Our problem concerns the estimation of customer lifetime value (LTV). This problem is important and timely, especially for private companies and public institutions that provide services to their customers or users. In this highly dynamic and competitive market, customers may churn, that is, switch service provider and therefore, cause a loss for the company that they abandon. It is possible to identify a number of components that can generate churn behaviour: a static component, determined by the customer characteristics and the type/subject of contracts; a dynamic component, that incorporates trends and the contact the customers have with the company; a seasonal element, tied to the period of subscription of the contract and external factors, including the course that the market and competitors take.

Statistical models typically used to predict churn are based on logistic regression, classification trees or other data mining models (see e.g. Giudici, [20]). Generally, all models are evaluated on the basis of a test sample (by

SILVIA FIGINI and PAOLO GIUDICI 2

definition not included in the training phase) and classified in terms of predictive accuracy with respect to the actual response values. Predictive accuracy means being able to identify correctly those individuals that will become churners during the evaluation phase (correct identification). Despite their good predictive capability, classical data mining models are often useless for marketing actions, as a very simple model based on customer contract deadlines will perform just as well. The use of new methods is therefore necessary to obtain a predictive tool which is able to consider the fact that churn data at hand are time dependent.

Modeling churn is very useful in order to derive LTV. In general, LTV model has three components: customer value over time, customer length of service and a discounting factor. More precisely, for each customer, the components are:

− The customer value over time: ( )tv for ,0>t where t denotes time.

The value function ( )tv is usually elicited by business experts.

− A model describing the customer’s churn probability over time. This is usually shown by a “survival” function, ( ),tS which describes the probability that the customer will still be active at time t.

− A discounting factor ( )tD which describes how much each monetary unit gained in some future time t is worth right now. This factor is usually calculated from business knowledge.

Given these three components, we can write the explicit formula for the LTV as follows:

( ) ( ) ( ) .0∫∞

= dttDtvtSLTV

In other words, LTV is the total value to be gained while the customer is still active. While this formula is attractive and straightforward, the essence of the challenge lies, of course, in estimating the ( )tS component in a reasonable way. This will be the aim of the present paper. In Section 2 and Section 3 we formalize the problem and our proposal. In Section 4 we derive the expressions needed for actual computation and finally in Section 5 we apply the methodology to a real problem.

BAYESIAN CHURN MODELS 3

2. Bayesian Variable Selection for Churn Models

In order to be useful in predicting churn, a survival analysis model must take the information present in the several covariates, typically contained in the database at hand, into account. It seems natural to employ a causal model such as the Cox proportional hazard model (Cox [8]). A crucial aspect of causal models in survival analysis is the preliminary stage, in which a set of explanatory variables must be properly chosen and designed; as in our real case, this choice is usually among a very large number of alternatives. This part of the analysis is typically accomplished with the help of descriptive tools, such as plots of the observed hazard rates at the covariate values. However, it is often the case that such tools are not sufficiently informative. As a consequence, a large number of variables are included as predictors and a model selection procedure needs to be run in order to find a parsimonious combination.

Our claim is that classical Cox proportional hazard models may not be the best strategy for customer LTV modeling. Some criticisms are:

− If repeated events occur, as in our case, a different model structure (e.g. based on counting processes) should be adopted.

− The Cox model assumes that every subject experiences at most one event, and that the event times are independent. In our context, a subject can experience multiple events (e.g. a churn event at different times and in different locations), possibly with dependencies among the event times of the same individual. Modelling multiple event time data requires a different approach (see e.g. Gail, Santner and Brown [16]).

− When many explanatory and possibly correlated variables, are specified, the efficiency of Cox's model selection and estimation becomes heavily dependent on the number of available observations. Variable selection is thus needed in a model selection step.

− It may be difficult, particularly in observational studies, to have complete information on all relevant covariates. Furthermore, random effects, expressing accident proneness or frailties may affect inferences on fixed effects.


In this paper we shall show how to improve the classical Cox model for LTV estimation by introducing a Bayesian survival model based on counting processes. Before illustrating our approach to variable selection for LTV, we report some of the latest useful references. Variable selection for multivariate failure time data has been analyzed by Fan, Lin and Zhou [12] based on a penalized pseudo-partial likelihood method. Fan and Li [11] gives some methods to extract variable selection for parametric mode via no-concave penalized likelihood. It has been shown there that the resulting procedures perform just as well as if the subset of significant variables were known in advance. Such a property is called oracle property. While Tibshirani [26] proposed the LASSO methodology, Dunson [10] proposes a semi-parametric Bayesian approach for inference on an unknown regression function, ( )xf characterizing the relationship between a continuous predictor X and a count response variable Y adjusting for covariates, Z. Finally, Giudici, Mezzetti and Muliere [21] proposed a nonparametric variable selection approach for survival analysis.

Our variable selection proposal adapts Giudici [19] to our context. To illustrate this approach on the basis of the data at hand, we shall first assume an exponential survival time, such that ( ) ii t λ=λ for .,,1 ni K=

Then it can be shown that, given the observed data ( ) ,,,1T

nyyy K= the

likelihood of ( )Tnλλ=λ ,,1 K is:

( ) ,exp1 ⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

λ−λ=λ ∑∏=∈

i

n

ii

ii tL

U

(1)

where { }1: =δ= iiU are the uncensored subjects. Now, let g indicate a

partition of the index set { }n,,1 K=I with gd subsets ( ),gSk for ,1=k

., gdK Clearly, given the correspondence between y,I and g,λ also

defines a partition of the data and of the hazard functions. Notice that the likelihood ( )λL assumes all iλ to be distinct and, thus, is in fact conditional on the independence partition {{ } { }},,,1 ngind K= containing ndg =

separate subsets ,iS each with .1=iS For this reason, it can be indicated by ( ).| indgL λ


A different likelihood arises when all hazards can be set equal to a common rate, say µ. This situation occurs when no covariate or frailty affects the survival times and corresponds to considering all data to be exchangeable. The resulting likelihood can be seen as conditional on the partition

{ },,,1 ngexc K= containing a single subset 1S ( ) :with 1 nS =

( ) { },exp| VgL dexc µ−µ=µ

with ∑ =δ=

ni id 1 the total number of failures and ∑ =

=ni itV 1 the overall

time at risk.

Apart from the above situations, which can be regarded as somewhat extreme, survival analysis is typically concerned with a plurality of effects that may induce dependencies among survival times. Such effects may be either observable (possibly with some missing values) or unobservable. In any case, when relevant, they induce a partition of the observations, by associating different hazards to individuals with the same level for that factor. In our approach, we will explore several partition structures. This amounts to considering a collection of alternative partial exchangeability structures for the survival times. Our model consists of two parts: a likelihood specification and a hierarchical prior distribution on the partition structure as well as on the corresponding set of hazards. Conditionally on a general partition g, let ( )., gSi kki ∈∀µ=λ Consequently, the likelihood of the

hazards ( )gdµµ=µ ,,1 K is the following:

( ) { },exp|1

kk

d

k

dk VgL

gk µ−µ=µ ∏

=

where ( )∑ ∈δ=µ gSi ikk k

d, and ( ) ( )ggSi ik dktVk

,,1for K== ∑ ∈ are the

hazard, death and risk set of the kth partition subset.

On the other hand, the prior specification requires the definition of a class of possible partitions { }.,,1 GK=G Once G is specified, it is necessary

to assign a probability distribution on both gdg R∈λ| and .G∈g


Specifically, conditionally on a partition g we shall take, for gdk ,,1 K= and

( ) :gSki ∈∀

( ),,~ kkkind

k rmrGammaµ

with km and kr being known positive constants. In the partition model this is not a novel proposal but it reflects the literature. Following Giudici [19] the two parameters are derived on the basis of expert opinion and a-priori knowledge. In particular kr and km can be interpreted, respectively, as pre-experimental ‘total time at risk’ and ‘observed events’ (e.g. coming from a meta-analysis). Finally, a simple probability function on G would take ( )gp

to be uniformly spread among partitions, i.e. ( ) .1−= Ggp

Our first aim is to evaluate the importance of each prognostic factor. This can be achieved by calculating, given the observed evidence ,y the posterior

probability of each partition, ( ).|ygp It can be shown that

( ) ( )( )

( )( )

.|1

kkk

g kk

dmrkk

kkkd

k kk

mrk

rVdmr

mrrgyp

+= +

+ΓΓ

= ∏

Furthermore, Bayes theorem gives ( ) ( ) ( ),|| gpgypygp ∝ from which

( )ygp | is obtained by normalization.

The procedure that we have shown for a constant hazard context can be generalized to more sophisticated frameworks, such as those described in the next section.

3. Bayesian Inference for Churn Models

Several authors have discussed Bayesian inference for censored survival data with an integrated baseline hazard function to be estimated non-parametrically (see e.g. Kalbeisch [22], Kalbeisch and Prentice [23] and Fleming and Harrington, [15]).

In particular, Clayton [7] formulates the Cox Model using the counting process notation introduced by Andersen and Gill [2] and discusses estimation of the baseline hazard and regression parameters using a


Bayesian approach based on the Markov Chain Monte Carlo. Although his approach may appear somewhat contrived, it forms the basis for extensions to random effects frailty models, time-dependent covariates, smoothed hazards and multiple events.

Here we follow Clayton’s guidelines and propose a methodology based on counting processes. With regards to our problem, the counting process that we present is characterized by a dynamic process (intensity), and a special pattern of incompleteness of observations (right-censoring or left-truncation). Having defined the intensity process, interest then focuses on parameters estimation.

Inferential procedures in this framework were first presented in Aalen [1], and turned out to be very useful. For further developments, see Andersen [3].

We consider the analysis of multiple event data, where there are n groups and the ith cluster has im individuals associated with an unobserved

frailty .,,1, nii K=ω The jth individual in the ith cluster, is associated with

the covariate variable ,ijx for imj ,,1 K= and .,,1 ni K= Such individuals

are assigned as belonging to a specific cluster because they are somehow related, perhaps by family association or geographical location. Conditional on frailties iω the complete survival times are assumed to be independent.

Let us consider the model:

( ) [ ( ) ( )] ,,,1,0|,| 10 nitXththXth ii K=≥+ω=ω

where ( )Xthi ,|ω represents a hazard function on the ith cluster that has been modified by the inclusion of a frailty.

The frailty random variable, is assumed to be independent of t and X for all clusters with some parametric distribution with unit mean, usually Gamma [6], with the unknown variance, say η, quantifying the amount of heterogeneity among individuals. In other words, we assume that

( ),,|~| 11 −− ηηωηω ii Gamma that is, given ( )nii ,,1, K=ωη is modeled as

a Gamma distribution with scale parameter 1−η and shape parameter .1−η The frailty random variable measures the random sensitivity of the ith


cluster to the event of interest after taking into account the effect of the covariate. The nonparametric part of the model ( ),0 th is assumed to be a semi-parametric hazard specification, having the advantage that it is a simple way to get a flexible hazard function, through a simple estimation (see e.g. Gamerman [17]). On the other hand, it has a major disadvantage, because the hazard is not continuous as a function of time, as there are jumps at the interval end points. In order to avoid such discontinuities we can use an ordinary piecewise exponential model (see e.g. Gamerman [17]).

To construct such a model, we first split the time axis into intervals ,0 10 gaaa L<<= where g is the number of intervals of observation times,

ijg ta > for all ni ,,1 K= and .,,1 imj K= The hazard in the interval

( ]kkk aaI ,1−= is defined by:

( ) ( ).110 taIat kkk <λ−+λ −∗

−∗

Therefore, the hazard function is obtained as:

( ) ( ) ( ),11

100 taIatth kk

g

kk <λ−+λ= −

∗

=−

∗ ∑ (2)

where the function ( )⋅I is the indicator function whose value is one if the argument is true and zero otherwise. Formally equation (2) has one more parameter than the usual piecewise constant hazard model, but as the function is continuous, the number of intervals can sometimes be reduced. Note that the likelihood function can be easily formulated by means of an indicator ( )ijkijijk taI <δ=δ −1 of death for the jth individual in the ith

cluster throughout kth interval, and the observation time ijkt in the interval.

This quantity equals:

⎩⎨⎧

<>−

=−

−−

.if0,if

1

11

kij

kijkijijk at

atatt

Finally, since we assume that the covariates are time independent, the parametric part of the model is constant in t and then it is the hazard function of an exponential distribution with θ parameter (the mean is θ and

the variance is )2θ so that: ( ) ,1|θ

=xth if .0>t


The survival function is related to the hazard function through the expression ( ) ( )[ ],exp tHtS −= where the integrated hazard function ( )TH is

given by ( ) ( ) ( )[ ]∫ −==t

tSlnduuhTH0

. Hence given the relationship

between the hazard and the survival function, it can be shown that the individual survival function, which has been modified by the inclusion of a

frailty component, specified by the parameters ( ) =λωω=ω ∗,,1T

nK

( )Tg∗∗ λλ ,,0 K and θ, is:

( ) [ ( ) ( )] ,,,1,|,|,,| 10 nitStStS ii K=θλω=θλω ω∗∗

where ( )tS0 and ( )tS1 are, respectively, the survival functions related to the piecewise baseline hazard function and exponential hazard function, i.e.,

( ) ( ) ( ) ,expexp|00

00 ⎟⎟

⎠

⎞

⎜⎜

⎝

⎛λ−=⎟⎟

⎠

⎞⎜⎜⎝

⎛−=λ ∑∫

=

∗∗g

kkk

ttcduuhtS

( ) ,expexp1|1 ⎟⎠⎞⎜

⎝⎛

θ−=⎟

⎠⎞⎜

⎝⎛

θ−

θ=θ ∫

∞ tduutSt

(3)

where ( )tck are positive statistics. Therefore, the density function takes the form:

( ) ( ) ( )θλωθλω=θλω ∗∗∗ ,,|,,|,,| tSthtf iii

[ ( ) ( )][ ( ) ( )] .,,1|||| 1010 nitStSthth ii K=θλθ+λω= ω∗∗ (4)

We assume that the parameter θ is specific for each individual in the population, but related to the covariates X through a probabilistic model.

In order to facilitate the implementation, it is convenient to assume that

( ),,log~ 2θσβθθ XNX (5)

that is, given X, the logarithm of the mean, ,log θ is modeled as a normal

distribution with variance 2θσ and mean ,βX a linear combination of the

effects covariates such that ( )Tpββ=β ,,1 K and X be the pN × matrix of


the covariates with rows ,,,1 Nxx K with N the total number of individuals.

The hyper-parameters β and 2θσ are unknown constants common to all

individuals in the population.

The hierarchical model given by the two stages (4) and (5) allows complete heterogeneity “frailty” in the population, so we can find two different individuals who have the same covariates, but their hazard functions are not necessary identical. Further, in the first stage (4), it has been described with its true parametric model given vectors of specific parameters, later to be estimated, while the second stage (5) accounts for cross-sectional (between subjects) heterogeneity of the matrix parameter

{ } ( ).,,1 and,,1 iij mjni KK ==θ=θ So that ijθ denote the expected

value of death for jth individual in the cluster i.

4. Bayesian Computations for Churn Models

A hierarchical representation of the model enables us to implement the MCMC methodology that allows Bayesian analysis of the problem. We shall now develop this representation. Suppose that the jth individual in the ith cluster survival time ijt is an absolutely continuous random variable

conditionally independent of a right censoring time ijW given the covariates

ijx and frailty .iω

Let ( )ijijij WtV ,min= and ( )ijijij WtI ≤=δ denote the time to the end-

point event and the indicator for the event of interest to take place, respectively. Suppose that ( )iijijij xV ωδ ,,, are i.i.d, for ni ,,1 K= and

,,,1 imj K= and the conditional hazard function of ijt given ijx and iω

satisfies the hazard model described by Section 3. For subject j in cluster i, let ( ) 1=tNij if 1=δij is in the interval [ ]t,0 and ( ) 0=tNij otherwise, and

let ( ) 1=tYij if the subject is still exposed to risk at time t and ( ) 0=tYij

otherwise.

Hence, we have a set of ∑ ==

ni imN 1 subjects which means that the

counting process { ( ) }0; ≥ttNij for the jth subject in the ith cluster set,


records the number of observed events up to time t. Letting ( )tdNij denote

the increment of ( )tNij over the small interval [ ),, dttt + the likelihood of

the data conditional on iω is then proportional to:

( ) ( ) [ ( ) ( )] ( )tdNn

ii

m

j tij ij

iththtYL θ+λω∝θωλ ∗

= = ≥

∗ ∏∏∏ ||,, 101 1 0

( ) [ ( ) ( )] .||exp 100

⎟⎠

⎞⎜⎝

⎛ θ+λω−× ∗

≥∫ dtththtY it

ij (6)

Since we allow each ( )tNij to take at most one jump for each subject,

note that ( )tdNij contribute to the likelihood in the same manner as

independent Poisson random variables even though ( ) 1≤tdNij for all ji,

and t.

Suppose that the time axis [ )∞+,0 is partitioned into 1+g disjoint

intervals ,,, 11 +gII K where [ )kkk aaI ,1−= for ,1,,1 += gk K with

00 =a and .1 ∞+=+ga In the kth interval, given ,iω the jth subject in the

ith cluster has an hazard equal to: { ( ) ( } ,,,1,|| 10 ijijijkiji gkthth K=θ+λω ∗

where ijg denotes the number of partitions of the time interval for the jth

subject in the ith group.

Given the complete data ( ),, ωT where { === jnitT ij ;,,1: K

},,,1 imK the likelihood can be re-expressed as:

( )( ]

[ ( ) ( )] ijki ij

kk

dNn

i

m

j

g

k aatiij ththtY θ+λω ∗

= = = ∈∏∏∏ ∏

−

|| 101 1 1 ,1

( ) [ ( ) ( )] ,||exp 100

⎟⎠

⎞⎜⎝

⎛ θ+λω−× ∗

≥∫ dtththtY it

ij (7)

where ijkdN is the change in the count function for the jth subject in the ith

group in the interval k. Under the assumption that the risk occurred in the interval kI is small, i.e.,

( ) [ ( ) ( )] ,0|| 101

≈θ+λ∗∫−

dtththtYk

k

a

aij


for all ,,, kji the likelihood contribution across this interval for individuals at risk is approximately:

( ) ( ) ,1exp11010 ⎟⎟

⎠

⎞⎜⎜⎝

⎛⎥⎦

⎤⎢⎣

⎡−

θ+ω−×

⎭⎬⎫

⎩⎨⎧

⎥⎦

⎤⎢⎣

⎡−

θ+ω −− kk

ijki

dN

kkij

ki aadHaadHijk

where ( )dtthdH k

k

aak ∫

−=

100 is the usual cumulative baseline intensity

function for the kth interval.

Notice again that the likelihood is essentially Poisson in form, reflecting the fact that the likelihood may be thought of as generated by the independent contributions of many data ‘atoms’ each concerned with observation of an individual over a very short interval during which the intensity may be regarded constant and approximately zero (for a review of this point, see e.g. Clayton [7]). Based on basic algebra, we can express the likelihood as:

( ) ( )∏∏ ∏= = =

−∗

⎭⎬⎫

⎩⎨⎧

⎥⎦

⎤⎢⎣

⎡−

θ+ω∝θωλ

n

i

m

j Ykji

dN

kkij

kii

ij

ijkaadHL

1 1 1:,,10

1,,

( ) ,1exp 10 ⎟⎟⎠

⎞⎜⎜⎝

⎛⎥⎦

⎤⎢⎣

⎡−

θ+ω−× −kk

ijki aadH (8)

where 1=ijkY if the jth subject in the ith group is exposed to risk at time

( ]kk aat ,1−∈ and 0=ijkY otherwise.

To complete the Bayesian specification of the model, prior distributions

are needed for the vector parameter ,∗λ and the hyper-parameters β and

.2θσ It seems natural to assume that ( )Tg

∗∗0

∗ λλ=λ ,, K be independent of

( )2, θσβ and η independent of ( ).,, ∗θ

∗ σβλ Specifically, for ∗λ we assume

independent Gamma priors:

( ) ,,,1,,|~ ijkkkk gkbaGamma K=λλ ∗∗∗∗

where ∗

∗

k

kba is the prior expectation for ∗λk and

( )2∗

∗

k

kba is the prior variance,

with prior independence assumed across the kth interval, hence:


( ).,|~ 1∏ =∗∗∗∗ λλ ijg

k kkk baGamma Concerning β and 2θσ (corresponding to the

parameters of a regression model) we assume a normal inverse prior (see e.g.

Bernardo and Smith, [4]): ( ),,|~| 22θθθθ σβσβ VmN p with Gamma~2

θσ

.,|12 ⎟

⎟⎠

⎞⎜⎜⎝

⎛

σθθ

θba Finally we suggest a Gamma distribution as a prior for η, i.e.:

( ),,~ 21 φφη Gamma where 21φφ is the prior expectation for η, and

( )22

1φ

φ is

the prior variance. We take ( ) .121 =φφ

=ηE

To obtain the conditional posterior distributions, required for Gibbs sampling we use the approach of data augmentation [25]. The idea of data augmentation is to insert latent data or missing data, in order to exploit the simplicity of the resulting conditional posterior distributions of vector parameters of interest. Although this will increase the dimensionality of the problem (possibly at the expense of extra computing time), the Gibbs sampler will be simpler.

Our objective is to derive the posterior distribution of ( ).,,, 2 ωσβλ θ∗ Such

a posterior cannot be computed analytically and, therefore, we use the Monte Carlo approximations through the Gibbs sampler algorithm. We remark that under the specifications, the likelihood is essentially Poisson in form, reflecting the fact that the likelihood may be thought of as generated by independent contributions of many data each concerned with observation of individual j of cluster i over a very short interval during which the intensity may be regarded as constant, i.e.,

( )( )

!

1

1,,,,|10

0 n

aadHYxdHnNP

n

kkij

ki

ijiijijkijk⎭⎬⎫

⎩⎨⎧

⎥⎦

⎤⎢⎣

⎡−

θ+ω

==ωθ=−

( ) .1exp 10 ⎟⎟⎠

⎞⎜⎜⎝

⎛⎥⎦

⎤⎢⎣

⎡−

θ+ω−× −kk

ijki aadH (9)

In a compact form we have:

( ) .|~ 10 ⎥⎦

⎤⎢⎣

⎡−

θω

+ω −kkiji

kiijkijk aadHdNPoissondN


Since the additive form of the Poisson sum does not result in the conditional posterior distribution in a closed form, we can solve this problem by expressing the likelihood in an augmented form, involving independent Poisson latent variables, unobserved or missing data, corresponding to each term in the expression for the Poisson mean. In particular, we assume

,120 ijkijkijkijk dNdNdNdN ++= for all 1:, =ijYji and gk ,,1 K= such

that:

( )[ ],|~ 1000 −∗ −λω kkiijkijk aadNPoissondN

( ) ⎥⎦⎤

⎢⎣⎡ −λ

ω−

∗ 2122 2|~ kkk

iijkijk aadNPoissondN

and

( ) .|~ 111 ⎥⎦

⎤⎢⎣

⎡−

θω

−kkiji

ijkijk aadNPoissondN

Using the property that the sum of independent Poisson random variables is also Poisson, it is straightforward to show that the previous equations are equivalent. Such expression allows us to take advantage of the Poisson-Gamma conjugate to obtain the conditional posteriors. Indeed, in this work we obtained the required conditionals. The calculations are as follows.

The joint posterior density of the parameters ( )ωσβλ θ∗ ,,, 2 and latent

variables ( )120 ,, ijkijkijk dNdNdN is proportional to:

( ) ( ) ( ) ,,|1,|,|,| 22

000211 ⎟⎟⎠

⎞⎜⎜⎝

⎛

σσβλφφη× θθ

θθθθ

∗∗∗ baGammaVmNbaGammaGammaA

where 1A is:

( )∏∏∏= = =

++==n

i

m

j

g

kijkijkijkijk

i ij

dNdNdNdNIA1 1 1

1201

( )[ ]100| −∗ −λω× kkiijk aadNPoisson

( ) ⎥⎦⎤

⎢⎣⎡ −λ

ω× −

∗ 212 2| kkk

iijk aadNPoisson


( )⎥⎦

⎤⎢⎣

⎡−

θω

× −11| kkiji

ijk aadNPoisson

( ) ( ) ( ).,|,|,|log 112 −−∗∗∗θ ηηωλσβθ× ikkkijij GammabaGammaxN (10)

Having obtained the joint posterior density, we can now outline the Monte Carlo sampler that we developed. The detailed description is reported in the Appendix.

5. Application of Bayesian Churn Models

We now turn our attention towards the application of the presented methodologies for modeling survival risk. In our problem the risk concerns the value that derives from the loss of a customer that churns. The objective is to determine which combination of covariates affects the risk function, studying specifically the characteristics and the relationship with the probability of survival for each customer. The model will be applied to a real case study that concerns an Italian media company that renews its customer contracts annually.

5.1. The available data

The data available for our analysis contain informations that can affect the event time, such as demographic variables, variables regarding the contract, the payment, the contact and geo-marketing variables. The response variable, used as a dependent variable to build predictive models, includes two different types of customers: those who are active during the survey and those, instead, who regularly cancelled their subscription. Concerning explanatory variables, the variables used were taken from different databases used within the company, which contained, respectively: socio-demographic information about the customers; information about their contractual situation and about its change over time; information about contacting the customers (through the call centre, promotion campaigns, etc) and, finally, geo-marketing information (divided into census, municipalities and larger geographical section information).

Regardless of their provenence, all variables have gone through a pre-processing feature selection step aimed at reducing them from a large number to a smaller set (containing 24 variables). The sample size available is composed of 3000 observations.


5.2. Descriptive survival analysis

In order to build a survival analysis model, we have constructed two variables: one variable of status (that distinguishes between active and non active customers) and one of duration (indicator of customer seniority). The first step in the analysis of survival data consists of a plot of the survival function and of the hazard.

The application of the Kaplan Meier estimator (see e.g. Kaplan and Meier [24]) to our data leads to the estimated survival function in Figure 1. From Figure 1 we note that the survival function has varying slopes, corresponding to different periods. When the curve decreases rapidly we have time periods with high churn rates; when the curve decreases slowly we have periods of loyalty. We remark that the final jump is due to a distortion caused by a few data in the tail of the lifecycle distribution.

Figure 1. Survival function.

Figure 2. Hazard function.


Table 1. Bayesian feature selection results. Variable ( )gyp | ( )ygp |

β info disconnection 0.2451 0.0472

2β decoder sold 0.2452 0.0472

3β decoder rental 0.2466 0.0475

4β payment credit card 0.2497 0.0481

5β promotion 0.2514 0.0484

6β chanell of sell 0.2588 0.0491

7β ex decoder rental 0.2521 0.0488

8β special offers 0.2835 0.0546

In Figure 2 we show the hazard function, which demonstrates how the instantaneous risk rate varies in time. From Figure 2 we note two peaks, corresponding to months 4 and 12, the most risky ones. Note that the risk rate is otherwise almost constant along the lifecycle. Of course there is a peak in the end, corresponding to what we observed in Figure 1.

We now move to the building of a complete predictive model. We have chosen to implement the classical Cox model first. The result, following a stepwise model selection procedure, is a set of about 20 explanatory variables. Such variables can be grouped into three main categories, according to the sign of their association with the churn rate, represented by the hazard ratio:

− Variables that show a positive association (e.g. wealth of the geographic regions, the quality of the call-center service, the sales channel);

− Variables that show a negative association (e.g. number of technical problems, cost of service bought, payment method);

− Variables that have no association (e.g. equipment rental cost, age of customer, number of family components).

5.3. Bayesian variable selection: results

We now show the results of our approach to Bayesian variable selection described in Section 2. In Table 1 we show the most important covariates to predict churn, according to the posterior probability of the corresponding partition. We observe that the most important variable for predicting churn


risk regards information on disconnection; this means when a customer contacts the call-center to terminate their contract. The second concerns decoder usage, and then covariates for payment method, promotion and special offers.

Table 2. Two step model: parameter estimation from the Bayesian Cox Model.

Variable Mean Sd MC error 0.25 Median 0.975

β info disconnection 0.7769 0.2123 0.00139 0.3547 0.7831 1.162

2β decoder sold -1.632 2.223 0.08101 -5.938 -1.688 3.186

3β decoder rental -1.731 0.6359 0.0308 -2.991 -1.718 -0.4818

4β payment credit card -2.203 0.8412 0.04174 -3.715 -2.25 -0.3793

5β promotion -1.368 0.6166 0.02468 -2.514 -1.396 -0.1127

6β chanell of sell -0.7287 1.626 0.09111 -3.206 -0.9579 3.382

7β ex decoder rental -1.494 0.6678 0.02963 -2.845 -1.48 -0.215

8β special offers 0.67 2.141 0.1202 -3.957 0.6207 4.817

5.4. Bayesian inference: results

We now proceed with Bayesian inference. Table 2 shows the results of Bayesian estimation for our model parameters. For each covariate selected by our Bayesian feature selection approach in the first step we calculated, for each parameter, the mean, the standard deviation, the Monte Carlo error, the median and the Bayesian confidence interval.

We estimated our models with different MCMC chains. The most stable result is with 10000 iterations and 500 iterations as a burn-in. We used the idea of parallel multiple chains to check the convergence of the Gibbs sampler, following Gelman and Rubin [18]. In particular, to generate the Gibbs posterior samples, we used three parallel chains. Monitoring convergence of the chains was carried out via the Brooks and Gelman [5] convergence diagnostic graph. We obtained that for the model in Table 2, the estimated correlations between parameters are quite low and the diagnostics indicate that the model has converged.

Furthermore, the goodness of a model should be evaluated in terms of predictive accuracy in a cross-validation exercise. We thus split the data set


in the two usual subsets: training and test. Both had been proportionally sampled with respect to the status variable. All sampled data contain information’s on all finally chosen explanatory variables (about twenty). In order to evaluate the predictive performance of the model, and compare it with a classical Cox model [8], we have focused our attention on a 3 month ahead prediction. We have proposed and implemented a procedure based on the estimated survival probabilities, aimed at calculating the percentage of true churners captured by the model. We remark that this is not a fair comparison, as survival models predict more than a time point; however company experts typically ask for this type of model benchmarking.

In order to measure the predictive power of the models we use a confusion matrix (see e.g. Giudici [20]). The confusion matrix is used as an indication of the properties of a classification (discriminant) rule. It contains the number of elements that have been correctly or incorrectly classified for each class. On its main diagonal we can see the number of observations that have been correctly classified for each class while the off-diagonal elements indicate the number of observations that have been incorrectly classified. If it is (explicitly or implicitly) assumed that each incorrect classification has the same cost, the proportion of incorrect classifications over the total number of classifications is called rate of error, or misclassification error and it must be minimized. Of course the assumption of equal costs can be replaced by weighting errors with their relative costs. If there are different costs for different errors a model with a lower

Table 3. Theoretical confusion matrix.

PO 0=P 1=P Total

0=O a b ba +

1=O c d dc +

Total ca + db + dcba +++

general level of accuracy is preferable to one that has greater accuracy but also higher costs. In our application we consider a sample of 1000 customers to validate the models. Table 3 classifies the observations of a validation dataset in four possible categories: the observations predicted as events which effectively are events (with absolute frequency equal to a); the observations predicted as events which are effectively non events (with


frequency equal to c); the observations predicted as non events which are effectively events (with frequency equal to b); the observations predicted as non events which turn out to be non events (with frequency equal to d). We obtain the following results: 404,106,45,445 ==== dcba for the

Classical Cox model and a 308,108,87,497 ==== dcba for the Bayesian Cox model proposed in Section 4. Therefore based on Table 3 we see that the confusion matrix based on the classical Cox model gives poor results in model prediction.

6. Final Remarks

In this paper we have presented a Bayesian methodology to predict customer churn rates and, therefore to estimate customer LTV. The Bayesian approach we have proposed, based on survival analysis modeling, leads to better performances. Our results show that our Bayesian survival analysis models are effective tools for LTV analysis and, consequently, for the actual planning of a range of marketing actions that impact on both prospective and actual customers. We believe that although Bayesian survival analysis is a very promising tool in the area of LTV estimation, further research is indeed needed, both in applied and methodological terms. From an applied viewpoint, directions to further investigate concern application of the methodology to a wider range of problems (we are conducting research in the area of credit risk). From a methodological viewpoint further research is needed to make Bayesian Cox model more robust, particularly taking possible random effects into account.

7. Appendix

The sampler iterates through the following steps:

Step 1. Sample the latent variables ( ) ,,, 120T

ijkijkijk dNdNdN for all

,1:,, =ijYkji jointly from their full conditional posterior distribution as

follows:

1. If ,0=ijkdN then let ;0102 === ijkijkijk dNdNdN

2. If ,0>ijkdN then sample ( )120 ,, ijkijkijk dNdNdN from a


distribution Multinomial ( ) :where,,,| 120 ijkijkijkijk PPPdN

( )

( ) ( ) ( ),

21

12

110

100

−−∗

−∗

−∗

−θω

+−λ+−λ

−λ=

kkiji

kkkkk

kkijk

aaaaaa

aaP (11)

( )

( ) ( ) ( ),

2121

12

110

21

2−−

∗−

∗

−∗

−θω

+−λ+−λ

−λ=

kkiji

kkkkk

kkkijk

aaaaaa

aaP (12)

( )

( ) ( ) ( ).

21

12

110

1

1−−

∗−

∗

−

−θω

+−λ+−λ

θ−

=

kkiji

kkkkk

ijkk

ijkaaaaaa

aa

P (13)

It follows from the expression of 1A that the full conditional distribution of the latent variables is proportional to:

( ) [ ( )] 0

!010

1202

ijkdN

ijkkki

ijkijkijkijk dNaadNdNdNdNIA −

∗ −λω×++==

( ) ( ).!!

21

1

2

21

12

ijk

dN

kkiji

ijk

dNkki

dN

aa

dN

aaijkijk

⎥⎦

⎤⎢⎣

⎡−×

θω

×⎥⎦⎤

⎢⎣⎡ −ω1

×−−

On the other hand, given ,1A we have that 2A is also proportional to:

( ) ( ) ( )⎥⎦

⎤⎢⎣

⎡−

θω

+λ−ω+−λω −∗

−−∗

12

110 21

!

kkiji

kkkikki

ijk

aaaaaa

dN

[ ( )] ( )

!21

! 2

21

010

20

ijk

dNkkki

ijk

dNkki

dN

aa

dNaa

ijkijk ⎥⎦

⎤⎢⎣⎡ −λω

×−λω

×−

∗

−∗

( ),!1

11

ijk

dN

kkiji

dN

aaijk

⎥⎦

⎤⎢⎣

⎡−×

θω

×−


,!!!! 120

120120ijkijkijk dN

ijkdNijk

dNijkijkijkijk

ijk PPPdNdNdNdN

∝

({ } { }),,,,|,, 120120 ijkijkijkijkijkijkijk PPPdNdNdNdNlMultinomia∝

where 120 ,, ijkijkijk PPP are defined in (11), (12) and (13).

Step 2. Sample ∗λ0 according to the fact that the full conditional

distribution of ∗λ0 is proportional to:

[ ( )] [ ( )]⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

−λω−−λω

−∗

=

−∗

∏ 101:,, 0

10 exp!0

kkiYkji ijk

dNkki aadN

aa

ij

ijk

( ) ( ),exp 001

0 0 ∗∗−∗ λ−λ×∗

ba

( ) ( )⎥⎥

⎦

⎤

⎢⎢

⎣

⎡−ωλ−λ∝ ∑ −

∗∗ ∑ =

kjikkiji

dN aaYijYkji ijk

,,1,00 exp1:,, 0

( ) .,| 11 1 1

01 1 1

000⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

−ω++λ∝ −= = =

∗

= = =

∗∗ ∑∑∑∑∑∑ kki

n

i

m

j

g

kij

n

i

m

j

g

kijk aaYbdNaGamma

i iji ij

Step 3. Sample ijk gk ,,1, K=λ∗ as the full conditional distribution of ∗λk is proportional to:

( ) ( ) ( ) 121

1:,,

21 2exp2

1 −∗−

∗

=

∗−

∗λ⎥⎦

⎤⎢⎣⎡ −λω

−⎥⎦⎤

⎢⎣⎡ λ−ω∏ k

ij

ijk akkkk

i

Ykji

dNkkki aaaa

( ),exp ∗∗λ−× kkb

( ) ( )⎥⎥

⎦

⎤

⎢⎢

⎣

⎡−λ

ω−λ∝ ∑∑

= =−

∗∗ ∑ ∑= =

n

i

m

jkkijk

idNk

ini

imj ijk aaY

1 112exp1 1

( ) ( )∗∗−∗ λ−λ∝∗

kka

k bk exp1

( ) .2,| 1,, ⎟⎟

⎟

⎠

⎞

⎜⎜⎜

⎝

⎛−

ω++λ∝ −

∗∗∗ ∑∑ kkji

iji

kji

ijkkk aaYbdNaGamma


Step 4. To derive the conditional distribution of nii ,,1, K=ω we start with the joint posterior density of parameters prior to augmentation that is proportional to:

( )∏ ∏= =

−⎭⎬⎫

⎩⎨⎧

⎥⎦

⎤⎢⎣

⎡−

θ+ω

i

ij

ijkm

j Ykji

dN

kkij

ki aadH1 1:,,

101

( ) ( ),exp1exp 110

1iikk

ijki aadH ωη−ω

⎭⎬⎫

⎩⎨⎧

⎥⎦

⎤⎢⎣

⎡−

θ+ω−× −η

−−

( ) ( ,1exp 11 1

0111 1

1

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛−

θ++ηω−ω∝ −

= =

−−η+ ∑∑∑ ∑= =−

kkij

m

j

g

kki

dNi aadH

i ijim

jijg

k ijk

( ) .1, 11 1

01 1

1⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛−

θ+η+∝ −

= == =

− ∑∑∑∑ kkij

m

j

g

kk

m

j

g

kijk aadHdNGamma

i iji ij

Step 5. Sample nii ,,1, K=ω from the expression we have found at

step 4. The full conditional distribution of ( )20, σβ is proportional to:

( ) ( ) .,|1,|,|log 222

1 1⎟⎟⎠

⎞⎜⎜⎝

⎛

σσβ

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

σβθ θθθ

θθθθ= =∏∏ baGammaVmNxN ijij

n

i

m

j

i (14)

That expression is the same as one that appears in the usual conjugate analysis of the normal data (see e.g. DeGroot [9]). It is then proportional to a multivariate Normal Inverse Gamma distribution, i.e.,

( ( )),,ˆ|~| 122 XXVN Tp +σββσβ −

θθθ

[( ) ( ) ] ,ˆˆ21,2|1~ 1

220 ⎟

⎟⎠

⎞⎜⎜⎝

⎛β−+β−++

σσ θ

−θθθθ

θmVmyXybVaGamma TT

where ( ( ) ( )) Xy Tnmi ,log,,log 11 θθ= K is the covariate matrix and the

estimated of the coefficient of regression β̂ is calculated from

( ) ( ).ˆ 111 yXmVXXV TT ++=β θ−θ

−−θ

Step 6. Sample 2θσ and then 2| θσβ from the expressions we have found at

step 4 and 5.


The other conditionals do not have a conjugate analysis. For each, imj ,,1 K= and ni ,,1 K= the conditional distribution for ijθ is

proportional to:

( ) [ ( )] ( ) ( ) .,,1,,|,|,| 2ijijijij

dNijiij gkxFtSthtY ijk K=σβθθλθλω θ

∗∗

This expression does not have closed form, but it is still possible to sample from it using a Metropolis algorithm.

Finally, for ,,,1 ni K= letting ,1−=ξ n the full conditional distribution

of ξ does not have a closed form either. It is proportional to:

[ ( )]( ),

exp1

1

1 ξξΓ

⎟⎠⎞

⎜⎝⎛ ωξ−

ξ⎟⎟

⎠

⎞

⎜⎜

⎝

⎛ω

∑∏ =ξ−

=

−ξ fn

n

i in

n

ii

with ( ).,|~ 21 φφξξ Gamma With this choice of priors it can be shown that the above full conditional density is log-concave. Thus we can use the adaptive rejection algorithm to sample from this full conditional. The results are in the application section.

Acknowledgements

This work has been supported by MUSING 2006 contract number 027097, 2006-2010 and MIUR PRIN Project 2005-2006 on Data Mining for E-business applications.

References

[1] O. Aalen, Nonparametric Inference for a family of counting processes, Annals of Statistics 4 (1978), 701-726.

[2] P. K. Andersen and R. D. Gill, Cox’s regression model for counting processes: a large sample study, Annals of Statistics 10 (1982), 1100-1120.

[3] P. K. Andersen, O. Borgan, R. D. Gill and N. Keiding, Statistical Models Based on Counting Processes, Springer Verlag, New York, 1993.

[4] J. M. Bernardo and A. F. Smith, Bayesian Theory, Chichester, Wiley, 1994.

[5] S. P. Brooks and A. Gelman, General methods for monitoring convergence of iterative simulations, Journal of Computational and Graphical Statistics 7 (1998), 434-456.

[6] D. G. Clayton, A Monte Carlo for Bayesian inference in frailty models, Biometrics 47 (1991), 467-485.


[7] D. G. Clayton, Bayesian Analysis of Frailty Models, Technical report, Medical Research Council Biostatistics Unit, Cambridge, 1994.

[8] D. R. Cox, Regression models and life tables, Journal of the Royal Statistical Society, Series B 34 (1972), 187-220.

[9] M. H. DeGroot, Optimal Statistical Decisions, McGraw Hill, 1970.

[10] Dunson, Bayesian semiparametric isotonic regression for count data, Journal of the American Statistical Association 100 (2005), 618-627.

[11] J. Fan and R. Li, Variable selection via penalized likelihood, Technical Report, Department of Statistics, UCLA, 2002.

[12] J. Fan, H. Lin and Y. Zhou, Local partial likelihood Estimation for life time data, the Annals of Statistics 34 (2006), 290-325.

[13] S. Figini, Customer relationship: a survival analysis approach, (to appear in Proceedings of Compstat, Roma), 2006.

[14] S. Figini, Bayesian variable and model selection for Customer Lifetime Value, Phd. dissertation, 2006.

[15] T. R. Fleming and D. P. Harrington, Counting Processes and Survival Analysis, Wiley, New York, 1991.

[16] M. H. Gail, T. J. Santner and C. C. Brown, An analysis of comparative cancirogenesis experiments with multiple times to tumor, Biometrics 36 (1980), 255-266.

[17] D. Gamerman, Dynamic Bayesian models for survival data, Applied Statistics 40 (1991), 63-79.

[18] A. Gelman and D. R. Rubin, A single series from the Gibbs sampler provides a false sense of security, Bayesian Statistics, Oxford University Press 4 (1992), 641-650.

[19] P. Giudici, Hierarchical model to identify prognostic factors in survival analysis, Statistical Application 1 (1996), 319-326.

[20] P. Giudici, Applied Data Mining, Wiley, 2003.

[21] P. Giudici, M. Mezzetti and P. Muliere, Mixtures of products of Dirichlet processes for variable selection in survival analysis, Journal of Statistical Planning and Inference 111 (2003), 101-115.

[22] J. D. Kalbfleisch, Nonparametric Bayesian analysis of survival time data, Journal of the Royal Statistical Society B 40 (1978), 214-221.

[23] J. D. Kalbfleisch and R. L. Prentice, The Statistical Analysis of Failure Time Data, Wiley New York, 1980.

[24] E. L. Kaplan and P. Meier, Nonparametric estimation from incomplete observations, Journal of the American Statistical Association 53 (1958), 457-481.

[25] M. A. Tanner and W. H. Wong, The calculation of posterior distributions by data augmentation, Journal of the American Statistical Association 82 (1987), 528-540.

[26] R. Tibshirani, Regression shrinkage, Journal of the royal statistical society B 58 (1997), 267-288.

bayesian churn models

Documents