predicting the unpredictable
TRANSCRIPT
PREDICTING THE UNPREDICTABLE:
Investigating Customer Profitability over Time
author
Jantien Dekker
17 June 2018
1
PREDICTING THE UNPREDICTABLE:
Investigating Customer Profitability over Time
MSc Marketing Thesis - Marketing Intelligence
17 June 2018
author
Jantien G. Dekker
S3177874
Palembangstraat 1, 9715 LK Groningen (NL)
+31 6 23653675
First supervisor: Prof. Dr. J.E. Wieringa
Second supervisor: Dr. J.T. Bouma
University of Groningen
Faculty of Economics & Business
Department of Marketing
PO Box 800, 9700 AV Groningen (NL)
2
SUMMARY
Over two decades ago, Foster, Gupta, and Sjoblom (1996) acknowledged the challenge of
tracking customer profitability (CP) over time: "a customer that is unprofitable now and is
expected to remain unprofitable requires a different set of corrective actions than a customer
that is unprofitable now but expected to be profitable in the foreseeable future". Measuring and
predicting customer profitability has become a major topic within marketing. Estimating
individual-level CLV has proven to be difficult, with sophisticated methods performing equally
well as simple methods (Donkers, Verhoef, & de Jong, 2007). Perhaps this is the result of little
attention to modelling changes in CP over time.
This thesis answers the management question of how we can identify profitable customers,
and especially customers that become more or less profitable over time. The general research
problem is how we can predict future profitability of individual customers, while accounting for
changes in CP over time. We use transaction data of a supplier of non-food products to retail
stores throughout The Netherlands over a three year period. Flowing from our discussion we
should answer the following questions to being able to provide a solution to our research
problem:
(1) What are drivers of customer profitability?
(2) How can we predict the future profitability of individual customers?
(3) To what extent are we able to predict changes in individual customer profitability
over time?
(4) How can we identify profitable customers segments based on our predictions?
We answer these questions by identifying CP drivers and components based on past
research, and we develop a predictive model that captures these identified drivers and
components to predict future CP over time.
Our main goal is to measure and predict individual customer profitability, which is the
revenues derived from a customer minus the costs to serve that customer. The specification of
both customer revenues and customer costs are hypothesized to have a significant influence on
model performance, and both components are mainly driven by past customer behavior,
customer characteristics, and firm actions.
3
We measure the profitability of a customer i (CPi) as:
CPi= ∑ (GMit-MCit)
T
t=1
where GMit = gross margins in period t
MCit = marketing costs in period t
T = time horizon of our measurement
We model two separate components for our final CP model: a Negative Binomial count
model for our number of visits components for our costs calculations, and a multiplicative fixed
effects OLS model for our gross margins component. We combine both models by subtracting
the number of visits multiplied by the average costs per visit from gross margins. For both our
visits and gross margins component, we use a hurdle model with a binary logit component that
models the zero-observations.
We found that past customer behavior, customer heterogeneity, and firm actions all drive
CP through its costs and gross margins components. Our model is able to capture changes in
CP over time, but these changes are not very accurate: our model does not offer a significantly
better performance compared to a model that uses past average CP and projects it to future
time periods. Also, a model without a separate cost component does not significantly provides
a worse performance compared to our model. For segmentation purposes, our model could offer
managers a tool to get a better understanding about what drives CP, especially within customer
segments.
In this thesis we aimed to “predict the unpredictable”. We were abe to predict changes in
customer profitability over time, except these changes did not predict the actual observed
changes in CP very well. We therefore conclude that trying to predict the unpredictable is very
difficult, perhaps even impossible, especially given scarce resources of managers to make trade-
offs between the investments to make into develop a sophisticated model, compared to using a
relatively simple model that seems to predict CP almost evenly well.
4
5
PREFACE
Dear reader,
Hereby I present to you my thesis for the MSc Marketing Intelligence. Over ten years ago I
graduated from secondary education, with no clue about what I wanted to do or who I wanted
to be in ten years. Several years past in which I did not attend any form of education. Instead, I
worked at multiple companies in several positions, to find out that my heart lies in marketing.
After finishing my part-time Bachelor of Business Administration with a specialization in
Marketing Management at the Hanze University Groningen, I realized that I was still missing
“something”. When I learned about the MSc Marketing Intelligence, I quited my fulltime job to
be a full-time student for the first time in my life. I have not regretted it for a single day: it has
given me exactly the “something” that I felt I was missing two years ago.
The data and the research problem of this thesis comes from the company that offered me
my first work experience. It therefore also carries a personal touch. Experiencing a bankruptcy
of the company that you love working for on the age of nineteen was a very hard but informative
experience. I want to thank the provider of the dataset, since it is very interesting, real-world
data. Although the company does not exist anymore, it has given me the chance to put what I
have learned during my education into practice.
My graditude goes out to Jaap Wieringa for supervising me during the process. I sometimes
can get carried away in my enthusiasm, and he could get me back on track. And although my
initial plan to combine my thesis with doing an internship did not follow true, I also want to
thank Jelle Bouma for his conversations. The bumps in the road offered me a learning experience
that goes beyond writing a thesis. Finally, I want to thank all lecturers from the (Pre-)MSc
Marketing courses for their support and sharing their knowledge during my education at the
University of Groningen.
Kind regards,
Jantien Dekker
6
CONTENTS
1 Introduction ....................................................................................................................................................................................... 8
1.1 Changes in Customer Profitability over Time ............................................................................................. 8
1.2 Description of the Organization ............................................................................................................................ 9
1.3 Scope and Contribution ............................................................................................................................................ 10
2 Customer Profitability ............................................................................................................................................................ 11
2.1 Managing Customer Profitability....................................................................................................................... 11
2.2 Defining Customer Profitability .......................................................................................................................... 12
2.3 Measuring Customer Profitability ..................................................................................................................... 13
2.3.1 Customer Relationship .................................................................................................................................. 13
2.3.2 Customer Revenues and Risk ................................................................................................................... 15
2.3.3 Customer Costs................................................................................................................................................... 16
2.4 Understanding Customer Profitability........................................................................................................... 17
2.4.1 Customer Behavior and Characteristics ........................................................................................... 17
2.4.2 Firm Actions .......................................................................................................................................................... 20
2.4.3 Market Variables ................................................................................................................................................ 21
2.5 Conceptual Model .......................................................................................................................................................... 21
3 Model ................................................................................................................................................................................................... 24
3.1 Model Specification ...................................................................................................................................................... 24
3.1.1 Model for the Number of Visits ............................................................................................................. 25
3.1.2 Model for Gross Margins ............................................................................................................................. 26
3.2 Data .......................................................................................................................................................................................... 27
3.3 Procedure ............................................................................................................................................................................. 28
7
4 Results ................................................................................................................................................................................................ 30
4.1 Number of Visits ............................................................................................................................................................ 30
4.2 Gross Margins .................................................................................................................................................................. 33
4.3 Customer Profitability ................................................................................................................................................ 35
4.3.1 Changes in CP over Time ............................................................................................................................. 35
4.3.2 Model Variants .................................................................................................................................................... 36
4.4 Customer Segments .................................................................................................................................................... 39
5 Discussion ........................................................................................................................................................................................ 42
5.1 General Discussion ........................................................................................................................................................ 42
5.2 Managerial Implications ............................................................................................................................................ 43
5.3 Limitations .......................................................................................................................................................................... 44
5.4 Future Research .............................................................................................................................................................. 45
5.5 Conclusion ........................................................................................................................................................................... 45
References ................................................................................................................................................................................................... 46
Digital Appendices
Appendix A: R-Code Data Preparation .................................................................................................................................. 49
Appendix B: R-Code Model Components............................................................................................................................. 53
Appendix C: R-Code Customer profitability ....................................................................................................................... 65
8
1 INTRODUCTION
Over two decades ago, Foster, Gupta, and Sjoblom (1996) acknowledged the challenge of
tracking customer profitability (CP) over time: "a customer that is unprofitable now and is
expected to remain unprofitable requires a different set of corrective actions than a customer
that is unprofitable now but expected to be profitable in the foreseeable future". Measuring and
predicting customer profitability has become a major topic within marketing. However, till date,
many attempts to estimate individual-level customer profitability have been rather unsuccesfull,
with simple models often performing just as good as more sophisticated ones. Perhaps this is
the result of little attention to modelling the possible changes in customer contributions over
time.
1.1 CHANGES IN CUSTOMER PROFITABILITY OVER TIME
Many CP models predict future profitability based on current contributions or the average past
contribution from the customer, assuming that a customer’s margins stay stable over time. This
assumption might be reasonable in some situations. However, there may be large variation
within customer contributions over time. This variation is mostly addressed by incorporating the
probability that the customer is still “alive” (i.e. retention probability).
Nevertheless, estimating individual-level CLV has proven to be difficult, with sophisticated
methods performing equally well as simple methods (Donkers, Verhoef, & de Jong, 2007). Also,
using current or past average margins in CLV calculations may lead to biases. For example in
markets in which the cost to serve a customer takes up a large proportion of the gross margin,
such as in B2B contexts with a high degree of personal sales. But also in markets with complex
dynamics, especially if the (B2B) firm’s customers are operating within different industries.
This thesis answers the management question of how we can identify profitable customers,
and especially customers that become more or less profitable over time. The general research
problem is how we can predict future profitability of individual customers, while accounting for
changes in CP over time. We use transaction data of a supplier of non-food products to retail
stores throughout The Netherlands over a three year period. Flowing from our discussion we
should answer the following questions to being able to provide a solution to our research
problem:
9
(5) What are drivers of customer profitability?
(6) How can we predict the future profitability of individual customers?
(7) To what extent are we able to predict changes in individual customer profitability
over time?
(8) How can we identify profitable customers segments based on our predictions?
We answer these questions by identifying CP drivers and components based on past
research, and we develop a predictive model that captures these identified drivers and
components to predict future CP over time.
1.2 DESCRIPTION OF THE ORGANIZATION
This thesis uses data of a supplier of non-food products to retail stores (e.g. supermarkets, drug
stores, hardware stores, etc) throughout The Netherlands over the period 2006 up until 2008.
The products were sold and distributed directly by the sales representatives, who put the goods
in the store in a leased display. Goods that were not sold could be returned: the representative
took them back at the next visit. Each representative was responsible for managing the
customers within his region, from acquisition to retention.
The company offered three main services/product categories:
(1) Regular: non-food products that were placed into the store in a display that was
provided by the supplier (e.g. socks, toys, cleaning accessories);
(2) Loyalty programs: are used by the customer for consumer loyalty programs (i.e.
consumers saved loyalty stamps for discounts on products). One program usually
covered a period of 4 to 8 weeks. The distributor delivered promotional material
(e.g. posters and vouchers) and made sure that there was always enough stock
within the store;
(3) Theme displays: introduced during the World Championship soccer in 2008. The
displays contained products with a specific theme (during WC soccer: orange/Dutch
products). The “orange” displays were sold out within a month. The company
therefore decided to order more theme displays, such as Christmas and Party.
The company only measured its performance based on aggregate-level revenues per sales
representative and per period. The management suspected that some customers were costing
10
more than they yielded. They therefore wanted to gain insights in the profitability of their
customers. They used simple summary statistics aggregated on the segment-level (e.g.
segmented on retail chain), and found that some segments appeared to be highly profitable, but
the company only had a few customers within these segments. After that, the sales forces put
in a lot of effort to acquire more customers within these segments. However, by that time, it
was already too late. The company was already in a rough patch and the theme displays were a
final hope. They seemed a success at first, but after a few months, a disasturous amount of
theme displays were returned, after which the company had no choice than to file for bankruptcy
in March 2009.
1.3 SCOPE AND CONTRIBUTION
The model can be used diagnostic, for assessing the performance of the company’s customer-
base and aspects that drive customer-based profits, and also normative, as input for managerial
decision-making for the selection and targeting of profitable customers. This research
contributes to theory by developing and testing a model that predicts CP on the individual
customer level in a context with a high degree of uncertainty and changes in customer profits
over time, which has been proven to be extremely difficult.
We limit our scope to a B2B supplier to B2C retailers within multiple industries. The focus
is solely on customer behavior, not on customer perceptions or attitudes. We only investigate
the measurement and prediction of CP, not its actual implementation in management practice.
In the next chapter we first discuss the concept of CP and its theoretical background. Based
on this discussion we develop our model in chapter 3. We present our results in the chapter
thereafter. Finally, we discuss our findings and its implications for both theory and practice.
11
2 CUSTOMER PROFITABILITY
In this chapter we first discuss customer management, which will answer the question of why it
is relevant to measure CP. We then define customer profitability, and discuss the different
research streams that have developed within CP. Next, we discuss the measurement and general
components of CP, followed by an identification of CP antedecents to get a better
understanding of the measure. We end the chapter with a conceptual model for our research,
in which all the identified components of customer profitability are present.
2.1 MANAGING CUSTOMER PROFITABILITY
Customer management can be defined as the processes through and actions by which the
contribution or value from each customer to the firm’s overall profitability is maximized, by
making use of individual data on customers (Kumar, Ramani, & Bohling, 2004; Verhoef & Lemon,
2013). Customer management involves making decisions on (a) selecting customers for
targeting, (b) allocating resources to these selected customers, and (c) nurturing customers to
increase future profitability (Kumar, Venkatesan, Bohling, & Beckmann, 2008). Customer
profitability can be increased by acquisition, up-selling, cross-selling, reducing customer costs,
and retention (Verhoef & Lemon, 2013). The underlying philosophy of is that to derive value
from customers, an organization should first be able to provide value to customers.
Another approach to managing customers is customer asset management. Within this
approach, customers are viewed and managed as economic assets. Kumar (2018) defines an
asset as “any physical, organizational, or human attribute that enables the firm to generate and
implement strategies that improve its efficiency and effectiveness in the marketplace”. Nenonen
and Storbacka (2016) give four actions to manage the customer asset to optimize profits: (1)
increasing revenues from customers: by customer acquisition, retention, development, price
increases, and innovation; (2) decreasing customer-related costs: both reducing costs to serve
and costs to acquire; (3) optimized asset utilization: optimizing capital investments in customer
relationships, and managing business volumes for economies of scale; (4) reducing customer-
related risks: diversifying the customer base, and reducing risk correlations within the customer
base.
12
In his recent work, Kumar (2018) provides the Customer Valuation Theory, in which he
attempts to integrate the concepts of customer value and customer assets. His theory connects
individual-level customer value to the performance and valuation of the entire firm by “(1)
valuing customers as assets, (2) managing a portfolio of customers, and (3) nurturing profitable
customers”.
Whichever of these approaches or actions a firm takes to manage its customers, it requires
a thourough understanding of its customer profitability and drivers of this profitability to be
able to determine the optimal courses of action. This allows firms to better identify and target
profitable customers, and optimize resource allocations to profitable customers and activities,
which leads to an increased marketing ROI (Reinartz, Thomas, & Kumar, 2005; Venkatesan &
Kumar, 2004).
2.2 DEFINING CUSTOMER PROFITABILITY
Pfeifer, Haskins and Conroy (2005) define customer profitability as “the difference between the
revenues earned from and costs associated with the customer relationship during a specified
period”. The authors state that if CP is viewed strictly from an accounting perspective, then CP
should focus on past and current contribution of a customer to the firm. Hence, CP is backward-
looking by definition. However, a firm cannot change or control its past, only its future.
Therefore, it needs to be able to anticipate on expected customer profitability by making
predictions about the future if it whishes to take appropriate actions. In literature, expected
future customer profitability is often referred to as customer lifetime value (CLV), which is the
expected future customer profitability during the entire relationship of a customer with the firm,
discounted by the current value of future capital (Holm, Kumar, & Rohde, 2012; Pfeifer et al.,
2005). If a customer is seen as an asset, then customer value can be viewed as the price that
someone would be willing to pay to acquire that asset (Pfeifer et al., 2005).
Many other related terms exist, such as customer equity or customer-based valuation
(Gupta, 2009), and net present value of expected gross contribution (Kumar et al., 2004).
Scholars have been debating on when to use and how to define customer profitability versus
customer value for decades, and both terms are seemed to be used interchangeable (Gupta,
2009; Holm et al., 2012; Kumar, 2018; Mulhern, 1999; Pfeifer et al., 2005). Derived from this
13
discussion we may conclude that CP and CV are used interchangeably, and that it is important
to define and specify any measurement of the concepts, and being aware of the differences
between them. In this thesis our main goal is to not only measure past CP, but also to make
predictions about the future. We use sources from both customer profitability and customer
value research, as long as it is relevant for the current research.
2.3 MEASURING CUSTOMER PROFITABILITY
A lot of different models for measuring CP exist, and each model seems to capture and/or focus
on different components. Holm, Kumar, and Rohde (2012) argue that model specification and
sophistication should depend on the complexity of the context, which they view as consisting of
customer behavioral complexity and customer service complexity. They define customer
behavioral complexity as “the degree of variation in retention durations (relationship length),
transaction frequency and value of transactions (relationship depth), and cross-buying behavior
(relationship breadth) across the total number of customer relationships a firm serves”.
Customer service complexity is defined as “the degree of variation in service needs and
requirements that invoke differential activities on an organization across customer-facing
functions in terms of the number of activities performed as well as the time spent on each
activity”.
The model should capture all aspects that are relevant for the specific context, as long as
the benefits for measuring each aspect is higher than its costs. We adopt the view of Holm,
Kumar, and Rohde (2012) on measuring CP, since it is highly flexible while still capturing all
components that are commonly used within CP literature, and it seems to bridge a gap between
several CP research streams (CPA and CLV). We will next discuss three aspects of customer
profitability that seem to be important to distinguish: customer relationship, customer revenues
and risk, and customer costs.
2.3.1 CUSTOMER RELATIONSHIP
According to Gupta et al. (2006), marketing actions of the firm lead to customer behavior, which
in turn leads to CP. They distinguish between three customer behaviors that represent the
lifetime stages of a customer:
(1) Customer acquisition: the first purchase of a customer;
14
(2) Customer margin: the purchase behavior during the customer-firm relationship (i.e.
up- and cross-selling);
(3) Customer retention: repeat-purchases and/or customer defection.
These three behavioral components seem to be the focus of many CP models (Gupta et al.,
2006; Gupta, 2009; Venkatesan & Kumar, 2004), and capture relationship length, depth, and
breadth (Bolton, Lemon, & Verhoef, 2004). Relationship depth and breadth refers to the
revenues that are associated with the customer relationship, including purchase frequency/up-
selling (depth) and cross-buying (breadth). We go into more depth on customer revenues in
the following subsection. In the remainder of this subsection on customer relationship, we will
discuss the relationship length in more detail.
Relationship length relates to the customer retention component of CP. Let us first discuss
the possible natures of a customer-firm relationship. Fader and Hardie (2009) distinguish
between contractual and non-contractual relationships. Within a contractual relationship, it is
relatively easy to observe the relationship termination. The customer needs to let the firm know
that he is terminating the relationship, or he does simply not extend its contract. Within a non-
contractual relationship, this customer defection is usually unobserved, making it harder to
determine whether a customer is still “alive” at a certain point. CP models usually model
customer retention by means of the probability that a customer will still be active at a certain
point in time (Gupta et al., 2006). Another distinction that Fader and Hardie (2009) make is
between the transaction opportunities. These can either occur continuously (i.e. at any given
time) or discretely (i.e. only at certain points in time). They presented a quadrant for both the
relationship type and transaction opportunities dimensions, and each setting asks for a different
modelling approach.
Another distinction that is made is the “lost-for-good” versus the “always-a-share”
relationship (Jackson, 1985). In a lost-for-good setting, a customer typically buys its product or
service from one company. Switching costs are generally believed to be high. The customer is
“alive” at some point, until he “dies” (i.e. he terminates the relationship completely). In an always-
a-share setting, the customer may spread his purchases between multiple sellers. The customer
never truly “dies”, since there is a probability that he will come back at each purchase
opportunity. Usually, the lost-for-good approach is used within contractual relationships, while
15
the always-a-share seems to be more appropriate for noncontractual relationships (Rust,
Lemon, & Zeithaml, 2004).
Finally, one should specify the period over which CP is measured or estimated. If it is
modelled for the entire (expected) relationship with the customer (i.e. its entire lifetime), the
time horizon should generally be set to infinity. It is however difficult to predict for a long period
in time, since markets are usually dynamic in nature. Besides, most companies set their
strategies for the next three to five years, which makes it reasonable to set a limited time horizon
for CP predictions (Kumar et al., 2008).
2.3.2 CUSTOMER REVENUES AND RISK
Instead of customer revenues, we can also refer to customer gross margins, which are revenues
minus the costs of goods sold (COGS) (Pfeifer et al., 2005). Using revenues and COGS
separately or using gross margins should depend on the variability in product margins. Many
models use average contribution margin of a customer to project future CP (Gupta, Lehmann,
& Stuart, 2004; Reinartz & Kumar, 2003). This might be reasonable if the company offers
relatively few service propositions and margins are relatively stable over time, but might be
biased if large variations in cash flows between customers exist. That is why several authors
have argued for more attention to risk in CP models (Bolton et al., 2004; Holm et al., 2012;
Kumar, 2018; Nenonen & Storbacka, 2016).
Kumar (2018) refers to risk in future CP as “the volatility and vulnerability in cash flows”.
Risk is usually captured through discount rate and retention probability (Gupta, 2009), which
could be seen as “vulnerabilities” in cash flows. Many models seem to lack, however, in the
inclusion of “volatility” in cash flows. Customer risk may result in a high reliance on a few
customer relationships, or in unsteady cash-flows, both of which can pose a threat to the
company’s health and should therefore be appropriately identified and managed (Nenonen &
Storbacka, 2016). An example of measuring risk because of volatility in cash flow is the “risk-
adjusted lifetime value” of Dhar and Glazer (2003), in which they capture the difference of a
customer’s deviation from the mean expected returns, which can be complemented with macro-
economic factors to understand this deviation. They call this risk aspect the “customer beta”,
which is the covariance of customer cash flow divided by the variance of customer cash flow.
16
Based on the previous discussion we hypothesize that there is a significant improvement in
model performance when changes in customer revenues are taken into consideration, compared
to a model that is based on average past contribution:
H1. A model that incorporates changes in customer revenues over time predicts future
customer profitability significantly better than a model that predicts future customer
profitability based on the average past contribution of a customer.
2.3.3 CUSTOMER COSTS
Several scholars stress the importance of explicitly specifying costs in CP calculations, since
most calculations of CP seem to focus on demand resulting from customer behavior, while the
costs related to serving customers are an important part of the customer margin (Blattberg,
Malthouse, & Neslin, 2009; Gupta, 2009). Pfeifer et al. (2005) describe three accounting
methods to allocate costs to customers: (1) divide the costs by the number of customers,
assuming that all customers use the same amount of resources, (2) assign costs to customers
relative to their size (e.g. revenues), and (3) based on their use of resources. The latter is referred
to as Activity-Based Costing (ABC), which is a common theme within CP analyses. ABC was
developed by Cooper and Kaplan (1988) with the underlying philosophy that costs should be
attributed to the activities proportional to their use of resources, i.e. splitting costs and tracing
them to individual products, instead of simply dividing costs by the number of units. ABC can
also be used to trace back costs to individual customers (Niraj, Gupta, & Narasimhan, 2001),
which works the same way, but with a different unit of interest. Costs can first be divided in
“pools”, and then into “drivers”, after which they are attributed to customers (Foster et al.,
1996). Take, for example, distribution costs as a cost pool. These costs may depend on the
number of product units sold, which is the cost driver. The total distribution costs are then
divided by the product units sold, which can then be attributed to customers, relative to their
units bought.
An important decision to be made is which costs to include in CP. If the total costs of the
company are traced back and attributed to individual customers, then customer profitability
reflects the overall profitability of the firm. If only the costs that are specific and variable to
serving customers are allocated to individual customers, then customer profitability could be
17
used for comparing customers within the company’s customer-base (Pfeifer et al., 2005).
Blattberg et al. (2009) refer to including all company costs as a “full-costing” approach, and only
including the variable costs of serving customers as a “marginal-costing” approach. Again, the
choice depends on the context and goals at hand.
We hypothesize that there is considerable difference in model performance when the costs
to serve customers are attributed to individual customers compared to a model that does not
includes a cost component:
H2. A model that attributes customers costs to individual customers predicts future
customer profitability significantly better than a model without separate cost
component.
2.4 UNDERSTANDING CUSTOMER PROFITABILITY
It is essential that a firm is not only able to measure CP, but also understands what drives CP
in order to being able to control it. Based on past research, we identified customer behavior and
characteristics, firm actions, and control variables that are found to influence CP or its
components (i.e. costs and revenues) in B2B settings. We chose to only investigate observed
behavior and characteristics, and thus we do not investigate perceptions or attitudes that were
found to influence CP. We discuss subsequently each identified driver and how it influences CP.
2.4.1 CUSTOMER BEHAVIOR AND CHARACTERISTICS
Customer behaviors that are found to be highly predictive of future behavior (and with that,
CP) are past purchase behavior, cross-buying behavior, and product returns behavior. In terms
of customer characteristics, we identified customer size and location as important predictors of
future CP. We subsequently discuss each CP driver that is related to customer behavior and
customer characteristics.
Past purchase behavior
Past purchase behavior is seen as one of the best predictors of future purchase behavior,
and with that, future customer profitability (Blattberg et al., 2009). The most commonly used
metrics to measure (past) purchase behavior are RFM metrics: Recency – the time since the
last purchase, Frequency – the number of purchases, and Monetary Value of these purchases.
18
Several other metrics can be derived from RFM measures, such as interpurchase time (i.e.
average time between transactions) and the average spend per transactions (i.e. M/F). Although
RFM metrics are amongst the most studied antecedents of customer profitability, findings on
the direction of the effect between RFM and profitability remain inconclusive. Most studies
show a positive link: customers who purchased more (often/recently) in the past are also more
likely to purchase more in the future, which is positively related to future profitability (Reinartz
et al., 2005).
Niraj et al. (2001) found that frequency was actually negatively related to profits, because
it adds complexity to the purchases. They found that frequency does not translate in
significantly higher average gross margins, but it does significantly increase costs. Compared to
many other CP studies, their model can be considered as one of the most detailed in terms of
attributing costs to individual customers. They did not only assign marketing costs to customers,
but also costs of each individual product. For example, costs are first attributed to separate
items (e.g. warehousing, distribution, negotiation with suppliers), and then to customers based
on their unit purchases of each item. Thus, especially if the costs per order is high relative to
the gross margins, and this is appropriately captured within the profitability model, we would
expect similar results to Niraj et al. (2001).
Cross-buying
Another surprising finding of Niraj et al. (2001) is that cross-buying did not have any
significant effect on customer profitability, while generally it has been found to be positively
related to customer profitability (Reinartz et al., 2005; Reinartz & Kumar, 2003; Rust, Kumar, &
Venkatesan, 2011). Kumar, George, and Pancras (2008) found that cross-buying is related to
the first product (category) purchased, and that it shows a U-shaped relationship with
interpurchase time: customers with an average interpurchase time are most likely to cross-buy.
They also found that higher focused buying (i.e. buying more within a category) is positively
related to cross-buying. This is an interesting finding, because Reinartz and Kumar (2003) found
a negative effect between focused buying and customer profitability. Generally, both behaviors
are believed to be positively related to profitability, and cross-buying has a larger effect on
profitability than focused-buying (Kumar et al., 2004).
19
Perhaps the reason why Niraj et al. (2001) did not find a significant effect is again because
of their cost attribution method. Cross-buying could be associated with higher spending levels,
but also with higher order complexity and thus higher costs, which may lead to diminishing
returns. Also, Shah, Kumar, Qu, and Chen (2012) found that approximately 10 to 35% of cross-
buying customers are in unprofitable relationships, and that the unprofitability increases with
the degree of cross-buying. They found that this is due to other unprofitable behaviors that
these customers show, such as excessive service requests (which is in line with our reasoning
that cross-buying may add to the costs) and promotion purchase behavior (i.e. lower gross
margins).
The contradictory findings on cross-buying behavior suggest that on an aggregated level,
cross-buying seems to be related to higher profits. However, large differences between
customers can exists, interactions with other behaviors are likely to be present, and cost-
attribution methods might influence the result.
Product returns
Research on the effect of product returns on CP shows contradictory findings (Petersen &
Kumar, 2015; Reinartz & Kumar, 2003). At one hand, product returns result in higher costs and
lower revenues, which has a negative consequence for profitability (Reinartz & Kumar, 2003).
However, returns could also decrease risk/price perceptions and therefore enhance future
spending, which in turn may positively influence CP (Petersen & Kumar, 2015). Kumar et al.
(2004) suggest that there is an optimal level of product returns, and thus shows a U-shaped
relationship with CP.
Customer size
It would be logical to assume that larger customers (e.g. based on their own store sales)
also spend more, and thus are more profitable. However, Bowman and Narayandas (2004)
suggest that large customers are not necessarily more profitable, because they are usually also
more demanding in terms of both quality and price. The influence of customer size on
profitability has also been reported by Van Raaij et al. (2003), who found a U-shaped
relationship between customer size and profitability: the top 1% of customers in terms of
customer size shows a lower profitability than large and medium sized customers, and small size
20
customers are reported to be most unprofitable. Reinartz and Kumar (2003) and Rust et al.
(2011) found a positive relationship.
Population density
Reinartz and Kumar (2003) found that population density had a negative effect on
customer profitability within B2C contexts. There was no effect present within B2B settings.
However, within a B2B context that deals with B2C retailers, and thus is indirectly related to a
B2C context, we could argue that the effect might be present.
Thus, to conclude, we hypothesize that past customer behavior and customer characteristics
are important drivers for CP, with implications and interactions for both revenues and costs:
H3. Past customer behavior and customer characteristics are significant drivers of CP and
both its costs and revenues components.
2.4.2 FIRM ACTIONS
Firm initiated contacts are found to positively influence the length of the customer-firm
relationship and individual profitability (Kumar et al., 2008; Reinartz et al., 2005; Reinartz &
Kumar, 2003). However, Blattberg et al. (2009) suggest that there is an optimal number of
marketing contacts. Above a certain point, there are diminishing returns, which is referred to as
wearout. Thus, marketing contacts are believed to show an inverted U-shaped relationship with
profitability. Rust et al. (2011) find evidence that marketing contacts do not only drive customer
behavior, but that the number of contacts in turn is also driven by past customer behavior.
Niraj et al. (2001) found that offering “extra items” (i.e. customized products or services) is
negatively related to customer profitability, since it adds to the service costs, but not necessarily
results into higher revenues. They argue that this may be a result of an orientation of sales
representatives towards short-term revenues, instead of towards long-term profitability. This
immediately leads us to the possible influence of sales representative on CP. Bowman and
Narayandas (2004) found that the hours spent at an account by a sales representative is
positively related to CP, especially if the relationship between the rep and the customer has a
long tenure. Sales rep’s perception of customer profitability can be biased based on their self-
efficacy and customer-orientation (Mullins, Ahearne, Lam, Hall, & Boichuk, 2014), implying that
21
we should control for the effect of sales persons on customer profitability when reps make their
own decisions for visiting customers.
To conclude, we hypothesize that firm actions drive both revenues and costs: there is a
“point of diminishing returns” after which CP declines:
H4. Firm actions drive both revenues and costs, and they show a diminishing return on CP,
implicating that there is an ideal point of firm actions.
2.4.3 MARKET VARIABLES
The market of retailers is highly dynamic, with a large amount of mergers & acquisitions,
changing customer demands, increased competition (both off- and online) and increasing
strategic alliances (Grewal, Roggeveen, & Nordfölt, 2017; Kumar, Anand, & Song, 2017). These
high dynamics imply that market dynamics could potentially affect customer profitability, and
thus, although they cannot be controlled by the firm, they need to be accounted for.
2.5 CONCEPTUAL MODEL
Our main goal is to measure and predict individual customer profitability, which is the revenues
derived from a customer minus the costs to serve that customer. In this thesis we test the
following hypotheses:
H1. A model that incorporates changes in customer revenues over time predicts future
customer profitability significantly better than a model that predicts future customer
profitability based on the average past contribution of a customer.
H2. A model that attributes customers costs to individual customers predicts future
customer profitability significantly better than a model without separate cost
component.
H3. Past customer behavior and customer characteristics are significant drivers of CP and
both its costs and revenues components.
H4. Firm actions drives both revenues and costs, and they show a diminishing return on CP,
implicating that there is an ideal point of firm actions.
22
We test these assumptions by including the identified antecedents in our CP model, and
determine their individual effects on the components of CP (i.e. revenues/gross margins and
customer costs). Also, we compare our model to simpler model variants without changes in CP
over time and without a cost component.
To summarize, we present an overview of all identified antecedents of CP and their
relationship with costs, revenues, and profitability in table 2.1. We present the most important
drivers of CP and relationships between concepts in our conceptual model (figure 2.1). We
expect that firm actions are driven by past firm actions, customer characteristics, and past
customer behavior. Customer behavior is driven by both current and past firm actions, customer
characteristics, and past customer behavior. Customer profitability is driven by both firm actions
and customer behavior, and this relationship is influenced by market dynamics.
Figure 2.1: conceptual model
Customer profitability
Market dynamics
Firm actions Customer behavior
Past firm actions Customer characteristics Past customer behavior
23
Antecedents
∩ = inverted U-
shaped relationship
U = U-shaped
relationship
C = control variable
/ = no significant
effects
Nir
aj e
t al
. (2
00
1)
Van
Raa
ij et
al.
(20
03
)
Rei
nar
tz &
Kum
ar (
20
03
)
Bow
man
& N
aray
anda
s (2
00
4)
Ku
mar
et
al. (
20
04
)
Rei
nar
tz e
t al
. (2
00
5)
Ku
mar
et
al. (
20
08
)
Ru
st e
t al
. (2
01
1)
Mu
llin
s et
al.
(20
14
)
Pet
erse
n &
Kum
ar (
20
15
)
Gre
wel
et
al. (
20
17
)
Ku
mar
et
al. (
20
17
)
Rev
enu
es
Cos
ts
Pro
fita
bili
ty
B2B setting* X X X X X X X X X R R
Customer behavior
Frequency - + + + + + ∩
Interpurchase time ∩ ∩ - ∩ ∩ ∩
Spending level + + + + + +
Cross-buying / + + + + + + + ∩
Product returns ? ∩ + - + U
Firm actions
Marketing contacts + + + + + + +
Extra services - + - + + ∩
Customer characteristics
Customer size ∩ + + + + ∩
Location / ?
Control variables
Sales representative C C C C C
Market dynamics C C C C C C
Table 2.1: antecedents of CP
* Only the study Petersen and Kumar (2015) investigated CP within a B2C context. Grewel et al.
(2017) and Kumar et al. (2017) did not study CP, but discussed the future within retailing. All other
cited articles studied CP within a B2B context.
24
3 MODEL
In this chapter we develop our model to measure and predict CP. We first define our
specification of CP, followed by the specification of its two components: costs and gross margins.
We then discuss the data available for the research, and the procedure that we follow to answer
our research questions and test our hypotheses.
3.1 MODEL SPECIFICATION
We measure the profitability of a customer i (CPi) as:
CPi= ∑ (GMit-MCit)
T
t=1
where GMit = gross margins in period t
MCit = marketing costs in period t
T = time horizon of our measurement
We chose to only include marketing costs in our model and not general overhead costs,
because our main goal is to compare CP between customers, instead of determining the CP of
the entire customer-base. Thus, to estimate future CP we must predict two components: the
number of visits and the gross margins for each period:
CP̂i= ∑ (GM̂it-V̂itC)
T
t=1
where GM̂it = predicted gross margins (in euros) in period t
V̂it = predicted number of visits in period t
C = costs per visit
Due to data limitations we have to make several assumptions related to the costs to serve
customers. For example, the sales force did not keep record of their visits to and hours spend
on each individual customer. We therefore make the assumption that the number of visits (i.e.
marketing contacts) equals the number of orders, and that each visit is assumed to takes the
same amount of time. We expect that this assumption is reasonable within the given context,
25
since sales representatives are responsible for maintaining the relationships with their own
customer-base, and they delivered products straight from their car on each visit (section 1.2).
Thus, the costs to serve a customer depends on the number of orders derived from that
customer.
Also, the costs of the sales force are only known on an aggregated level. We therefore divide
the total costs of the sales force over our time horizon, divided by the total number of orders
within that time horizon, to arrive at the average costs per visit. Individual customer costs are
thus a function of the number of orders (i.e. the number of visits) placed by that customer,
multiplied by the average costs per order over our entire time horizon. We have estimated the
average costs per visit at € 70,39 based on the average order costs in 2008 (total costs of the
sales force divided by the total number of orders: 544,396.26 / 7,734).
3.1.1 MODEL FOR THE NUMBER OF VISITS
We are interested in predicting the number of visits for our costs predictions. The number of
visits takes on discrete values from 0 to 26, with mean 1.83 and variance 5.82. Since the variance
is much larger than the mean (i.e. overdispersion), we assume a Negative Binomial distribution
(NBD) for our count data. Also, a relatively large part of our observations have zero values
(32.3%). We expect that a regular count model would not handle these zero-observations very
well. We therefore estimate a zero-inflated and a zero-hurdle model, and choose the model that
offers the best fit.
We hypothesized that the number of visits is a function of past purchase behavior, past
marketing contacts, market variables, and customer characteristics. Therefore, our initial
estimation of our visits component will have the following functional form:
Vit= α + β1Recit
+ β2Vit-1 + β3V.sumit-1
+ β4 V.avgit-1
+ β5GMit-1 + β6GM.sumit-1
+
β7GM.avgit-1
+ β8PRit-1 + β9PR.dumt + β10GDPt + β11Cati + β12Pop
i + uit
where α = intercept
Rec = periods t since last purchase
V = number of visits
GM = gross margins
sum = cumulative sum from t=1 till t-1
26
avg = cumulative average from t=1 till t-1
PR = number of premium orders
PR.dum = dummy indicating that no order details were recorded
GDP = GDP of consumers
Cat = number of categories purchased over the entire time horizon
Pop = population density
u = error term
3.1.2 MODEL FOR GROSS MARGINS
Our response variable gross margins (i.e. revenues derived from a customer minus the costs of
goods sold) can take positive and zero values, and its distribution is somewhat skewed to the
right (mean = 359.58, sd = 878.88). Also, there are considerable outliers present within our
data. All these characteristics are possible issues that can bias our estimations. To somewhat
account for these issues, and to accommodate potential interactions between our variables, we
estimate our gross margins model as a multiplicative (log-log) model. To account for the zero-
observations, we fit a zero-hurdle model to our data, in which we allow the variables and
parameters for the zero-hurdle part to differ from the positive gross margins model. Our initial
gross margins model takes the following specification:
GMit* = α + β1V
it
* + β2Recit
* + β3Vit-1
* + β4V.sumit-1
* + β5V.avgit-1* + β6GMit-1
* +
β7GM.sumit-1* + β8GM.avgit-1
* + β9PRit-1* + β10PR.dumi
* + β11GDPit-1* +
β12Catit-1
* + β13Popit-1
* + β14Returnsit
* + εit
where α = intercept
Rec = periods t since last purchase
V = number of visits
GM = gross margins
sum = cumulative sum from t=1 till t-1
avg = cumulative average from t=1 till t-1
PR = number of premium orders
PR.dum = dummy indicating that no order details were recorded
GDP = GDP of consumers
27
Cat = number of categories purchased over the entire time horizon
Pop = population density
Returns = gross margins of product returns
ε = error term
3.2 DATA
In total we have 3 years of observations (2006 - 2008), which we divided into 12 quarters. We
only model customer profitability for customers that have purchased within the first period of
the data (Q1 2006) to avoid potential problems with left-censoring. Because of variable
transformations (e.g. lagged variables of our response variables) we lose the first period of our
data, which leaves us with a total of 11 time periods. We then excluded all customers that did
not make any purchase in the remaining 11 time periods. In total we have observations of 349
customers over each period, which results in 3839 observations. We use the first 9 quarters of
our data for estimating our model (Q2 2006 to Q1 2008), and the last 3 quarters for assessing
its predictive validity (Q2 2008 to Q4 2008).
In the first five periods of our observations (Q1 2006 – Q1 2007), the company did not
keep track of order details, only of order totals. As a result, we do not have data on which
products were ordered (and also not the number of premium programs or product categories),
nor on the products that were returned. We therefore added a dummary variable that indicated
missing data for premium programs (i..e the variable Premium Dummy), and we set the cross-
buying variable as a fixed customer-specific variable, that does not change over time (i.e. the
variable categories, which is the sum of the bought product categories over the entire time
horizon).
Because of the possibility for customers to return goods that were not sold, gross margins
can take negative values. Since values on a logarithmic scale are not allowed to take negative
values, we had two options: (a) add a constant to our data to make sure that our values are
positive, or (b) exclude returns entirely from the gross margins response variable. We choose
for option (b), since adding a constant would still provide problems in reliably estimating zero-
observations. We resolved this by subtracting returns from gross margins, by the following
steps:
28
(1) For each period in which the returns were registered (from Q2 2007), and where
the return ratio was less than 1 (i.e. the customer bought more than he returned)
and higher than 0, we multiplied gross margins by the return ratio (e.g. if gross
margins was 100, and the return ratio 0.5, then gross margins was set at 50);
(2) For the resulting negative gross margins, we subtracted these from the gross
margins in the previous period, and repeated this step until every gross margins
value was zero, or negative for our first period of observations. We chose to subtract
them from previous periods, because gross margins can only be negative if a
customer returned products that he bought in an earlier period;
(3) We then set negative values for the first period of observations at zero, since these
were returns of products that were bought before our observation periods.
Thus, product returns are not included within our gross margins response variable. We
therefore included product returns as a predictor within our gross margins component to still
account for the effect of product returns. Note that, just as with our premium program orders,
product returns were not registered before Q2 2008.
We had 7% missing values for postal code, which we assume to be random errors because
of administrative errors. This led to missing values for our variable population density. Also, the
data obtained from Statistics Netherlands could not completely be matched to every postal
code, possibly because of changes within the municipalities within the last decade. This led to
17% missing values for Population Density. Thus, in total we had 24% missing values for
population density. We imputed these values based on predictive mean matching, with 5
imputed datasets and 50 iterations.
The R-code for our data preparation and manipulation are included in Appendix A.
3.3 PROCEDURE
First, we start out with a full model as specified in section 3.1. Next, we assess face validity and
resolve any possible issues relating multicollinearity by assessing Variance Inflation Factors
(VIF) of predictor variables. We optimize our model components by comparing several nested
versions of the models based on McFadden or adjusted R2, 𝜒2 or F-values, AIC, and suitable
measures of predictive accuracy. For both model components we assess whether a zero-inflated
29
or a zero-hurdle Binary Logit component significantly improves our model by comparing models
based on AIC scores.
Once we have fitted each model component, we investigate heterogeneity by comparing the
model with a model that includes effects for individual customers, customer groups, and/or
sales representatives. For our gross margins component, we test whether considerable
heterogeneity between individual customers is present by performing an F-test between the
pooled version and a fixed effects model. We then estimate a random-effects model, and
perform a Hausman test to assure that the heterogeneity between customers is endogeneous
to our predictors, which is an important assumption of a random-effects model.
For our gross margins component we assess whether our residuals are normally distributed
by both a Shapiro-Wilk and a Kolmogorov-Smirnov normallity test. We test for autocorrelation
using the Durbin-Watson test, and assess whether heteroskedasticity is present within our data
before and after the company started to register order details, by means of a Breush-Pagan
test. Selection bias may be present within our model. We therefore re-estimate the model by
using the Heckman procedure. A significant Inverse Mills Ratio indicates that selection bias is
present, and that we have to apply a Heckman correction to our gross margins expectation.
We test the predictive accuracy of our models by testing our models on both our estimation
and holdout sample. We use the Mean Absolute Error (MAE), Root Mean Squared Error
(RMSE), and Relative Absolute Error (RAE) for assessing predictive validity.
Once we have validated each model component and have made predictions for our holdout
sample, we calculate both the observed and the predicted customer profitability. We divide
customers into profitability segments for both the validation and the holdout sample, and we
check whether we observe and predict changes in individual customer profitability based on
shifts from and to profitability segments. We then compare our model to a model that takes
the average past contribution and projects it on the future and to a model without cost
component. Finally, we investigate differences in CP between customer segments by performing
a cluster analysis by using the Ward method based on Euclidean distance.
The R-codes for our visits and gross margins components and our customer profitability
analyses are included in Appendix B an C.
30
4 RESULTS
In this chapter we present the results of our analyses. We first estimate and predict our visits
and gross margins components, and present both model’s results. Then, we combine the results
of both models to arrive at our customer profitability predictions. We then investigate whether
our model is able to predict the changes in customer profitability over time, and compare our
model to simple variants. Finally, we inspect whether we can find differences between customers
for the purpose of customer management.
4.1 NUMBER OF VISITS
For our visits component we first estimated a Poisson model and deleted variables that showed
a high collinearity and a relative poor fit compared to correlated predictors (i.e. the cumulative
sum of both visits and gross margins, the direct lag of visits, and the cumulative average of gross
margins). We then performed a dispersion test, that showed significant results (dispersion =
1.617, z = 4.188, p = .000). Therefore, we tried fitting a Negative Binomial distribution to our
data, which performed significantly better than our Poisson model (LL Poisson = -4352.0, LL
Negative Binomial = -4243.3, Chi squared = 217.32, p = .000). We continued optimizing our
model assuming a Negative Binomial distribution.
Since 32.3% of our observations are zero-observations, we estimated a zero-inflated and a
hurdle model to our data. Both models show a large improvement in AIC compared to the regular
NBD model (regular NBD = 8506.6, zero-inflated = 8310.7, hurdle = 8306.6). The hurdle model
provides a better fit, and offers more flexibility in estimating the zero-observations using
different predictors. We therefore continue fitting the hurdle variant of our model.
Till now, we have neither considered heterogeneity between customers, nor have we
investigated the effect of sales representatives. A model with a customer-specific intercept
provides a perfect fit, and is therefore not an option. We fitted variants of our model by adding
the effects of sales reps, customer industry/retail chain, and both. Our results (table 4.1)
indicate that customer industry or retail chain and sales reps both have a significant effect on
the number of orders placed by a customer.
Estimates, including confidence intervals and marginal effects, are presented in table 4.2.
We present a comparison between the observed and predicted number of visits in figure 4.1.
31
Let us first discuss the zero-model: the binary logit model. We observe the strongest effect for
recency, which is negative. For each unit increase of recency, keeping all else equal, the
probability of purchase decreases with 65.3%. We observe the strongest, positive for the lag of
visits: for each unit increase, keeping all else equal, the probability of purchase increases by 61%.
For the number of visits we observe that for each unit increase of the cumulative average
of visits, the number of visits increases by 20.1%. Above a certain point, this effect diminishes,
as we observe a significant effect for the squared term of the cumulative average of visits. Thus,
customers with a higher average past number of visits are also predicted to show a higher
number of visits in the future. For each unit increase of the lag of premium oders, the number
of visits increases by 5.2%. During the period in which order details were not registered, the
number of visits was 18.4% higher. The GDP of consumers also shows a significant effect: for
each unit increase, keeping all else equal, the number of visits increases by 10.2%. We do not
observe any negative effects of predictor variables on the number of visits, only the strength of
the increase of the cumulative average past visits diminishes above a certain point.
Our model shows a Relative Absolute Error (RAE) of 0.59 (out-of-sample), which means
that it outperforms a naïve model where the estimated number of visits equals the number of
visits in the previous period. We observe a Root Mean Squared Error (RMSE) of 1.665 in our
estimation sample, against 1.358 in our validation sample. The Mean Absolute Error (MAE) of
1.358 of our holdout sample indicates that, on average, the predicted value for visits deviates
1.358 from the observed value for visits.
Model k LL Chisq AIC
Model without qualitative IVs 14 -4082.3 8192.6
Sales Rep 24 -4058.7 47.255 *** 8165.4
Industry/Chain 26 -4039.9 37.615 *** 8131.8
Industry/Chain + Sales Rep 36 -4019.8 40.266 *** 8111.5
Table 4.1: model variants
32
Figure 4.1: distribution of Visits
Count-model (Negative Binomial)
Variable Estimate Std. error z-value p-value 2.5% 97.5% Marginal
Visits avg 0.183 0.016 11.498 0.000 *** 0.152 0.214 1.201
Premium t-1 0.051 0.017 2.932 0.003 ** 0.017 0.084 1.052
Premium dum 0.169 0.046 3.701 0.000 *** 0.079 0.258 1.184
GDP Cons. 0.097 0.014 8.352 0.000 *** 0.069 0.125 1.102
I(Visits avg2) -0.044 0.006 -7.048 0.000 *** 0.053 0.086 1.072
Zero-model (Binary Logit)
Variable Estimate Std. error z-value p-value 2.5% 97.5% Marginal
Intercept -0.429 0.172 -2.487 0.013 * -0.767 -0.091 0.651
Recency -0.427 0.064 -6.701 0.000 *** -0.552 -0.302 0.653
Visits t-1 0.476 0.064 -6.701 0.000 *** 0.374 0.578 1.610
Visits sum -0.024 0.006 -3.835 0.000 *** -0.037 -0.012 0.976
Categories 0.281 0.019 14.899 0.000 *** 0.244 0.318 1.325
I(Categories2) -0.320 0.069 -4.647 0.000 *** -0.455 -0.185 0.726
Log-likelihood = -4019.8, LR test: Chi squared (33) = 2191.1***, AIC = 8111.51
Table 4.2: estimates Visits model
33
4.2 GROSS MARGINS
We estimated a multiplicative model for our gross margins (GM) component, with again a zero-
hurdle component for our zero-observations that has the same specification as for our Visits
model. We dropped the cumulative sum of both frequency and gross margins due to high
collinearity. Our initial model is significant (F = 114.4, df = 10; 1878, p = .000) and explains
37.9% of the variance within gross margins.
Since we expect considerable heterogeneity between customers, we have modeled three
variants of our model: (1) a pooled model, (2) a model with fixed customer effects, and (3) a
model with random customer effects. For the fixed effects model we deleted the variable
categories because it is customer-specific and time invariant, and therefore not allowed. An F-
test between the pooled and fixed effects model shows that there are significant individual
differences present (F = 2.082, df1 = 346, df2 = 1532, p = .000).
We then estimated a random-effects model, but a Hausman test showed that the random
effects are endogenous to our predictors, from which we must conclude that a random effects
model is not allowed (Chi squared = 631.59, df = 8, p = .000). Thus, we estimate our model with
a fixed effect for each customer, but we do not allow for customer-specific error terms. We
further fitted our model by deleting recency, premium dummy, GDP consumers, and the lag of
gross margins. Including quadratic effects of our variables did not improve the model’s
performance.
Since we predict a two-stage model, selection bias may be present. We therefore re-
estimated our model using the Heckman procedure. The Inverse Mills ratio was not significant
(IMR = 0.050, t = 0.394, p = .900), indicating that we can estimate our model without applying
the Heckman correction. No autocorrelation was detected between the residuals of each
customer (Durbin-Watson = 2.193, p = 1), but we did find significant heteroskedasticity
(Breusch–Pagan = 36.594, df = 5, p = 0.000). Also, the residuals of our model failed to meet
the normality assumption (Shapiro-Wilk = 0.894, p = .000; Kolmogorov-Smirnov = 0.131, p =
.000). We therefore obtained robust standard errors by t-tests of the coefficients using the
Arellano method that accounts for heteroskedasticity in fixed effects models.
We present the final estimates of our (censored) gross margins component in table 4.3. We
do not report estimates of the zero-hurdle component, as they are the same as presented in
34
table 4.2. Our overall model is significant, and explains 36.2% of the variation in gross margins.
Except for the number of premium orders, all estimates are significant. Since we estimated a
multiplicative model, estimates are presented as elasticities. We calculated the original estimates
by multiplying the elasticities by the variance within the standard errors divided by two.
The number of visits shows the largest, positive effect on gross margins: for each 1%
increase in visits, gross margins increases by 1.8%. Visits, the lag of visits, and product returns
show negative effects, with the largest effect resulting from product returns. For each 1%
increase in the value of product returns, gross margins decreases by 0.3%. The weighted mean
of the fixed effects is 4.767 with (robust) standard error 0.418. Except for 16 of the customers,
all fixed effects are significant and positive.
We predict the values for gross margins by taking the exponential of our log-transformed
predictions and multiplying these predictions by the probability of purchase. We present our
measures of accuracy in table 4.4. Our model shows an RAE of 0.529 on our validation sample,
which indicates that it ourperforms a naïve model. The out-of-sample MAE is 186.34, which
means that, on average, our predictions are 186.34 off the true values of gross margins.
Variable Estimate Std. error z-value p-value Elasticity
Visits 5.157 0.142 12.897 0.000 *** 1.826
Visits t-1 -0.215 0.089 -2.722 0.007 ** -0.243
GM avg -0.217 0.070 -3.529 0.000 *** -0.245
Premium 0.211 0.144 1.343 0.179 0.193
Returns -0.284 0.021 -15.656 0.000 *** -0.334
Unbalanced Panel: n = 349, T = 1-8, N = 1889
�̂� = 1.253, RSS = 2965.8, ESS = 4649.3, R2 = 0.362, Adj. R2 = 0.215
F-statistic (5, 1535) = 174.258, p-value = .000
Table 4.3: estimates gross margins model
Sample MAE RAE RMSE
Estimation (Q1-Q8) 184.99 0.422 521.07
Holdout (Q9-Q11) 175.22 0.498 506.84
Table 4.4: predictive accuracy GM model
35
4.3 CUSTOMER PROFITABILITY
Now that we have estimated our individual model components, we predict customer profitability.
Means and standard deviations of measured and predicted V, GM, and CP, for both our
estimation and validation sample are presented in table 4.5. Especially the predicted number of
visits within the validation period appears to be far off its observed values. This is probably
because of observations with relatively high values that could be considered outliers (figure 4.1).
On average, the absolute deviation between the observed and predicted values for CP is
204.03 within the estimation period, and 187.11 within the validation period. Our model shows
an RAE of 0.579 and 0.646 in the estimation and validation period respectively, and thus
performs better than a naïve model. With an RMSE of 511.13 (estimation period) and 521.92
(validation period) and a standard deviation of 468.72 in the prediction errors in the estimation
period.
4.3.1 CHANGES IN CP OVER TIME
We now investigate changes in CP over time. For this purpose we have divided customers into
three profitability segments: low (0-25%), middle (25-75%), and high (75-100%). We
examine the shifts within segments by comparing the average CP in the year prior to the
validation period to the average CP in the validation period. We chose to only take the year prior
to the validation period (Q5-Q8) for our comparison to prevent large changes in CP within the
estimation period to disturb our comparisons. Also, we deleted the customers that did not make
any purchases within the year prior to the validation period, because (a) this resulted in a very
high percentage of zero observations that prevented us from dividing customers into realistic
high percentage of zero observations that prevented us from dividing customers into realistic
profitability segments, as 41.6% of the customers showed a CP of zero within the validation
period, and (b) we did not believe that this would bias our comparison too much, since only 3
out of the 70 customers that did not purchase within Q5-Q8 eventually did make a purchase in
Q9-Q11. We will refer to Q5-Q8 as period 1, and to Q9-Q11 as period 2.
In table 4.6 we show both observed and predicted shifts in CP segments from period 1 to
period 2. 53.3% of the customers did not change from profitability segment, which means that
46.7% of the customers did. Only 3% of the lowest segment (2 customers) shifts to the highest
36
segment, while 10% of the highest segment (7 customers) shifts to the lowest segment. When
examining the predicted shifts, these do not look far off: 56.5% of the customers are predicted
to stay within the same CP segment.
If we compare the predicted CP segments with the observed CP segments in period 2 (table
4.7), 66.6% of the segments are classified correct. The largest errors seem to take place
between the lowest and the middle segment. 21.7% of the customers in the middle segment
are predicted to be in the lowest segment, and 34.8% of the customers in the lowest segment
are predicted to be in the middle segment. A possible explanation for these errors is the
distribution within each segment. For example, the difference between the first and the third
quartile of the predicted CP is only € 151.24, while the total range is € 5356,30. Also, there is
a considerable overlap between the lowest observed CP segment, and the middle predicted CP
segment. Thus, we could conclude that our model is predicting high CP reasonably well, but it
has difficulty in predicting lower and average CP.
On average, CP decreases by 232% from period 1 to period 2, while our model predicted an
average increase of 220%. We found that investigating these relative changes in CP is not
useful, since our data contains many values close zo zero. For example, customer X showed a
relative CP increase of 80,000%, because his CP in the first period was - € 0.0625, while he
showed an average CP of € 86.11 in the second period.
4.3.2 MODEL VARIANTS
To what extent does our model outperform a model that predicts future CP based on the past
average CP and to a model without a separate cost component? We took the average observed
CP of period 1 (Q5-Q8) to predict CP of period 2 (Q9-Q11), and compared it to the observed
CP of period 2. Also, we compared our model to several simpler variants, that included less model
components than our main model. For example a model that is based on the past average of
gross margins with a correction for predicted purchase probability (∅).
We report the MAE of each model in table 4.8. Our main model shows the lowest MAE.
However, especially the model that is based on past CP with a correction for purchase probability
comes very close. T-tests on the absolute errors of each model compared to the absolute errors
showed that none of the simpler model had a significantly higher MAE than our main model.
37
Thus, our model does not provide a significantly better performance compared to models based
on past average contribution and to models without a separate cost component.
In figure 4.3 we show observed and predicted CP for 6 customers, and also predicted CP
based on past average CP multiplied by the probability of purchase. Customers A and B showed
a large under-prediction, customers C and D a large over-prediction, and E and F relatively a
good performance. As we can see, the simple model does not predict large changes over time.
Our main model does show changes in CP over time, but often too much or in the wrong
direction. Therefore, the simple model is often just as close to the observed value as our main
model.
Visits Gross Margins CP
Mean SD Mean SD Mean SD
Q1-Q8 Actual 1.83 2.41 360 879 231 752
Predicted 1.83 1.81 280 732 151 667
Q9-Q11 Actual 1.92 1.84 245 648 161 568
Predicted 1.01 0.90 194 575 123 546
Table 4.5: summary statistics CP
Observed Q9-Q11 Predicted
Low Middle High Low Middle High
Q5-Q8
Observed
Low 8.3% 15.9% 0.7% 9.4% 14.5% 1.1%
Middle 14.1% 28.3% 7.6% 11.6% 30.8% 7.6%
High 2.5% 5.8% 16.7% 4.0% 4.7% 16.3%
Table 4.6: shifts in CP segments from Q5-Q8 to Q9-Q11
Predicted
Low Middle High
Observed Low 12.3% 8.7% 4.0%
Middle 10.9% 36.2% 2.9%
High 1.8% 5.1% 18.1%
Table 4.7: observed vs. predicted CP segments in Q9-Q11
38
Model MAE t-value
Main model 187.11
∅ GM.past.avgit - V̂itC 197.67 -0.507
GM.past.avgit - V̂itC 223.41 -1.696 .
∅ GM.past.avgit 220.41 -1.596
∅ CP.past.avgit 190.69 -0.173
CP.past.avgit 208.36 -1.009
Table 4.8: predictive accuracy CP model variants (holdout sample)
A B
C D
E F
Figure 4.2: patterns in CP for 6 customers
39
4.4 CUSTOMER SEGMENTS
We now investigate differences between customers. We divided customers into clusters based
on their average number of orders and spending levels in both Q5-Q8 and Q9-Q11 (Ward
method, Euclidean distance): this way, both components of CP are used for clustering, and also
the changes from the first to the second period are captured. Six segments were identified, of
which we present averages across multiple variables in table 4.9 and the distribution of CP within
each segment in figure 4.3.
The segments with the highest average visits and gross margins also show the highest CP.
The ordering also holds for the returns ratio: customers in the highest profitability segment
return the least of their products, while the customers in the lower segments return a large ratio
of the products. Segment 6 shows a return ratio of above 100%, which indicates that they
returned more than they bought. We can only explain this by the fact that the customer
returned products that they bought in the first observed periods, in which the company did not
yet register product returns. The least profitable segments show relatively the highest rate of
product returns. Segments 3 and 4 show why it can be helpful to segment on both visits and
gross margins over both periods. Both segments start out really close to eachother. However,
the profitability of segment 4 increases from period 1 to period 2, because of a large increase in
gross margins. Our model predicted this increase in gross margins, but it was not able to predict
the increase in CP. Instead it predicted that segment 3 would show a large increase in CP, while
that segment stayed relatively stable.
When performing the cluster analysis on clusters based on predicted instead of observed
visits and gross margins in period 2, 73.2% of the customers were classified within the same
segment as for our first cluster analysis. The fourth cluster seems to be the problem: only 5.8%
of the customers in segment 4 are predicted correctly. Thus, we could again conclude that our
model components are not able to predict changes in CP over time very well, which was the
main purpose of our model.
In table 4.10 we present averages across variables for each customer industry or retail chain.
Although some segments are definitely less or more profitable than others, determining which
customer segments are more profitable than others based on customer industry or retail chain
appears to be much more difficult than segmenting based on customer behavior or profitability.
40
Figure 4.3: distribution of CP within customer segments
Segment 1 2 3 4 5 6
n 24 22 18 52 70 90
Visits Q5-Q8 5.0 3.6 2.1 1.8 1.5 0.7
Q9-Q11 4.9 2.8 1.9 2.6 0.8 0.4
Predicted 2.7 2.2 1.5 1.6 1.0 0.6
Gross margins Q5-Q8 1698 861 317 282 168 33
Q9-Q11 1766 502 239 758 48 17
Predicted 1277 541 235 441 50 15
CP Q5-Q8 1346 607 166 153 63 -19
Q9-Q11 1421 305 104 573 -7 -9
Predicted 1085 387 326 128 -21 -27
CP Segment Q5-Q8 3.0 3.0 2.2 2.2 1.9 1.4
Q9-Q11 3.0 2.6 2.0 3.0 1.6 1.6
Predicted 2.9 2.6 2.6 2.1 1.7 1.7
MAE 3097 1293 1536 769 198 89
Premium orders 8.5 4.0 1.9 1.6 0.4 0.1
Returns ratio 0.05 0.14 0.17 0.29 0.48 1.32
Categories 11.5 10.4 10.1 9.1 6.7 5.3
Table 4.9: averages per customer segment
41
Figure 4.5: customer profitability per customer industry/retail chain
Industry/chain A B D E F G I O X Y
n 9 19 49 7 6 9 12 82 36 6
Visits 2.8 3.5 2.2 3.3 2.6 1.6 2.2 0.9 1.6 1.1
Gross margins 521 1151 407 698 751 158 397 101 336 146
CP 321 903 252 466 566 47 245 38 224 69
CP Predicted 83 647 154 482 201 188 140 7 259 103
CP Segment 2.3 2.4 2.2 2.5 2.6 1.7 2.5 1.7 2.1 1.9
CP Seg. Pred. 1.6 2.5 2.0 2.3 2.3 2.1 2.1 1.8 2.1 2.0
MAE 1151 1789 780 2073 1609 844 547 213 887 248
Change in CP -1.3 -4.2 1.0 0.3 18.2 -70 -0.4 0.4 -1.1 -0.3
Pred. change CP -1.0 -2.1 0.6 0.8 -1.1 31.4 -0.6 0.3 -1.8 -0.3
Premium orders 4.4 5.5 2.9 1.4 3.8 0.2 2.8 0.4 0.6 0.00
Returns ratio 0.14 0.13 0.47 0.20 0.05 0.28 0.24 1.16 0.44 0.93
Categories 10.3 9.6 9.2 10.7 10.0 9.7 8.4 6.5 6.1 6.5
Table 4.10: averages per customer industry/retail chain
42
5 DISCUSSION
In this chapter we first present our main findings, in which we discuss contributions and
implications for theory. We then discuss managerial implications, and how our research
contributes to marketing practice. Next, we discuss limitations of our study, and provide
suggestions for future research. The chapter concludes with a final conclusion for our research.
5.1 GENERAL DISCUSSION
We posited that future CP is driven by past customer behavior, customer characteristics, and
both past and current firm actions. We found considerable evidence that past behavior is a
strong predictor of future behavior, which confirms existing theory (Blattberg et al., 2009;
Reinartz et al., 2005). Recency (negative) and the number of orders in the previous period
(positive) show the strongest effects on purchase propensity. The strongest predictor of the
number of orders/visits is the cumulative average of visits up till the previous period. We found
a diminishing effect above a certain point, which indicates that there is an ideal number of
purchases to optimize returns (Niraj et al., 2011). The number of visits in both the current and
the past period are shown to be two of the strongest predictors of current spending levels.
Returns have the strongest, negative effect on spending levels. However, we did find that a
higher CP is related to a lower returns ratio, and vice versa.
We also found that the number of categories purchased had a significant effect on purchase
incidence, which confirms the theory of cross-buying having a positive effect (W. Reinartz et al.,
2005; W. J. Reinartz & Kumar, 2003; Rust, Kumar, & Venkatesan, 2011).
We found that the GDP of consumers acted as a significant control variable, which confirms
the theory of market dynamics having an influence on CP (Grewal, Roggeveen, & Nordfölt, 2017;
Kumar, Anand, & Song, 2017). Both the effect of sales representative and customer group were
found to be significant on the number of purchases. Mullins et al. (2014) found that the
perception of CP by sales represetatives may be biased, and could thus have an effect on actual
CP. did not study perceptions, and thus cannot confirm whether this is due to sales rep
perception of CP. However, our results do indicate that a CP model should account for the effect
of sales representatives.
43
For population density we did not find any significant effects, which confirms the theory of
Reinartz and Kumar (2003) that the effect of population density is not significant within a B2B
context. We did not investigate customer size. However, the fact that we found significant
effects for customer group for our visits component, and individual fixed effects for our gross
margins component, we confirm our theory that there is significant heterogeneity between
customers.
Based on our previous discussion of our model’s results, we conclude that we have confirmed
our hypotheses that past customer behavior, customer characteristics, and firm actions are
significant drivers of CP.
We hypothesized that a model that accounts for changes in revenues over time would result
in a better performance than a model based on past average contributions. We have modelled
CP with separate zero-hurdle components for costs based on the number of visits, and gross
margins, and found that our model does not significantly outperforms a simple model that uses
the average past contribution to predict future CP. We therefore conclude that our hypothesis
on an improved model performance when modeling changes in revenues over time is not
confirmed. Therefore, we could not contradict the findings of Donkers, Verhoef, and De Jong
(2008), who found that simple models to predict CP often perform just as good as more
sophisticated models.
Also, we did not find evidence to confirm our hypothesis that a model that attributes
customer costs on the individual customer-level outperforms a model without a separate cost
component.
5.2 MANAGERIAL IMPLICATIONS
What is the value of our model for managerial decision-making? Since managers often have
limited resources in terms of both money and time, we conclude that, at least within our context,
a manager could best predict its future CP based on past average contribution, possibly
extended by a purchase probability model. A model based on past average contribution and
purchase probability does not capture changes in CP for future time periods, but it does predict
future CP nearly as well as a more sophisiticated model. Therefore, when trade-offs between
44
resources and model performance need to be made, a simple model offers more advantages
compared to a more sophisticated model.
If a manager whishes to determine drivers of customer profitability, our model does provide
value, especially in distinguishing between the most and the least profitable customers. It can
also be used as a tool to segment and compare customers based on their relative CP Thus, we
can conclude that for diagnostic and descriptive purposes, our model could be used as a
management tool. For predicting changes in CP over time, our model could provide guidelines,
but we recommend not to trust it as a normative tool, as these changes show high predictive
errors, with both a large amount of under- and overpredictions.
5.3 LIMITATIONS
One major limitation of our research is that the company that was under investigation filed for
bankruptcy after our observation period. This bankruptcy was not suddenly: it had been
struggling for quite some time. We cannot determine whether this might have biased our
predictions. For example, right before the bankruptcy, several customers ended their relationship
with the firm because the company was not able to restitute payments for returned products.
Also, in the last six months of our observations, the company launched a new product group
that was not present during our estimation period, and we can therefore not determine whether
this had influenced our predictions. Especially since a very large amount of these products were
eventually returned, there is a serious possibility that this made our predictions less accurate.
Although we attributed costs on the individual level based on the average order handling
costs, we needed to make several assumptions regarding cost allocation. For example, the sales
force did not record hours spend on each customer, and it is possible that visits took place
without a purchase, which would not have been captured by our model.
Because of difficulties in estimating a model based on continuous data with negative,
positive, and zero values, we had to exclude product returns from the total gross margins
amounts. Although we did include the returns ratio as a predictor in our final model, it may have
not fully captured the true influence of returns on CP.
45
5.4 FUTURE RESEARCH
We found evidence that product returns have a significant influence on customer profitability
and especially its gross margins component. Especially lower customer profitability seems
related to a higher rate of product returns. Our research only included average order handling
costs. However, a company is assumed to make substantial costs for product returns, for
example additional inventory and shipping costs. Investigating the influence of product returns
on customer profitability while including costs for product returns could be a promising venue
for further research. Also, since lower levels of CP seem related to a higher level of product
returns, managers could experiment with differentiating in return policy between CP segments.
If a manager chooses to experiment with differentiated return policies, he should account for
the possibility that this differentiating service may lead to a lower overall customer satisfaction,
and with that, a lower overall CP (Petersen & Kumar, 2015).
Since our model is able to predict changes in CP over time, but often not the right direction
or the right size, it could be worthwhile to further investigate errors in predicting changes in CP
over time. To what extent can under- or overpredictions in CP predictions result in a decrease
or increase in actual future CP? A company could, for example, experiment by differentiating its
marketing efforts based on expected future CP, and compare results to a control group that did
not receive differentiated service based on their expected future CP.
5.5 CONCLUSION
In this thesis we aimed to “predict the unpredictable”. We were abe to predict changes in
customer profitability over time. However, these changes did not predict the actual observed CP
very well. We therefore conclude that trying to predict the unpredictable is very difficult, perhaps
even impossible, especially given scarce resources of managers to make trade-offs between the
investments to make into develop a sophisticated model, compared to using a relatively simple
model that seems to predict CP almost evenly well.
46
REFERENCES
Blattberg, R. C., Malthouse, E. C., & Neslin, S. A. (2009). Customer lifetime value: Empirical
generalizations and some conceptual questions. Journal of Interactive Marketing, 23(2),
157-168.
Bolton, R. N., Lemon, K. N., & Verhoef, P. C. (2004). The theoretical underpinnings of customer
asset management: A framework and propositions for future research. Journal of the
Academy of Marketing Science, 32(3), 271-292.
Bowman, D., & Narayandas, D. (2004). Linking customer management effort to customer
profitability in business markets. Journal of Marketing Research, 41(4), 433-447.
Cooper, R., & Kaplan, R. S. (1988). Measure costs right: Make the right decisions. Harvard
Business Review, 66(5), 96-103.
Dhar, R., & Glazer, R. (2003). Hedging customers.
Donkers, B., Verhoef, P. C., & de Jong, M. G. (2007). Modeling CLV: A test of competing
models in the insurance industry. Quantitative Marketing and Economics, 5(2), 163-
190.
Fader, P. S., & Hardie, B. G. (2009). Probability models for customer-base analysis. Journal of
Interactive Marketing, 23(1), 61-69.
Foster, G., Gupta, M., & Sjoblom, L. (1996). Customer profitability analysis: Challenges and new
directions. Journal of Cost Management, 10, 5-17.
Grewal, D., Roggeveen, A. L., & Nordfölt, J. (2017). The future of retailing. Journal of Retailing,
93(1), 1-6.
Gupta, S. (2009). Customer-based valuation. Journal of Interactive Marketing, 23(2), 169-
178.
Gupta, S., Hanssens, D., Hardie, B., Kahn, W., Kumar, V., Lin, N., . . . Sriram, S. (2006). Modeling
customer lifetime value. Journal of Service Research, 9(2), 139-155.
Gupta, S., Lehmann, D. R., & Stuart, J. A. (2004). Valuing customers. Journal of Marketing
Research, 41(1), 7-18.
Holm, M., Kumar, V., & Rohde, C. (2012). Measuring customer profitability in complex
environments: An interdisciplinary contingency framework. Journal of the Academy of
Marketing Science, 40(3), 387-401.
47
Jackson, B. B. (1985). Build customer relationships that last Harvard Business Review.
Kumar, V. (2018). A theory of customer valuation: Concepts, metrics, strategy, and
implementation. Journal of Marketing, 82(1), 1-19.
Kumar, V., Anand, A., & Song, H. (2017). Future of retailer profitability: An organizing
framework. Journal of Retailing, 93(1), 96-119.
Kumar, V., George, M., & Pancras, J. (2008). Cross-buying in retailing: Drivers and
consequences. Journal of Retailing, 84(1), 15-27.
Kumar, V., Ramani, G., & Bohling, T. (2004). Customer lifetime value approaches and best
practice applications. Journal of Interactive Marketing, 18(3), 60-72.
Kumar, V., Venkatesan, R., Bohling, T., & Beckmann, D. (2008). Practice prize Report—The
power of CLV: Managing customer lifetime value at IBM. Marketing Science, 27(4),
585-599.
Mulhern, F. J. (1999). Customer profitability analysis: Measurement, concentration, and
research directions. Journal of Interactive Marketing, 13(1), 25-40.
Mullins, R. R., Ahearne, M., Lam, S. K., Hall, Z. R., & Boichuk, J. P. (2014). Know your customer:
How salesperson perceptions of customer relationship quality form and influence
account profitability. Journal of Marketing, 78(6), 38-58.
Nenonen, S., & Storbacka, K. (2016). Driving shareholder value with customer asset
management: Moving beyond customer lifetime value. Industrial Marketing
Management, 52, 140-150.
Niraj, R., Gupta, M., & Narasimhan, C. (2001). Customer profitability in a supply chain. Journal
of Marketing, 65(3), 1-16.
Petersen, J. A., & Kumar, V. (2015). Perceived risk, product returns, and optimal resource
allocation: Evidence from a field experiment. Journal of Marketing Research, 52(2), 268-
285.
Pfeifer, P. E., Haskins, M. E., & Conroy, R. M. (2005). Customer lifetime value, customer
profitability, and the treatment of acquisition spending. Journal of Managerial Issues, ,
11-25.
Reinartz, W. J., & Kumar, V. (2003). The impact of customer relationship characteristics on
profitable lifetime duration. Journal of Marketing, 67(1), 77-99.
48
Reinartz, W., Thomas, J. S., & Kumar, V. (2005). Balancing acquisition and retention resources
to maximize customer profitability. Journal of Marketing, 69(1), 63-79.
Rust, R. T., Kumar, V., & Venkatesan, R. (2011). Will the frog change into a prince? predicting
future customer profitability. International Journal of Research in Marketing, 28(4),
281-294.
Rust, R. T., Lemon, K. N., & Zeithaml, V. A. (2004). Return on marketing: Using customer equity
to focus marketing strategy. Journal of Marketing, 68(1), 109-127.
Shah, D., Kumar, V., Qu, Y., & Chen, S. (2012). Unprofitable cross-buying: Evidence from
consumer and business markets. Journal of Marketing, 76(3), 78-95.
Van Raaij, E. M., Vernooij, M. J., & van Triest, S. (2003). The implementation of customer
profitability analysis: A case study. Industrial Marketing Management, 32(7), 573-583.
Venkatesan, R., & Kumar, V. (2004). A customer lifetime value framework for customer
selection and resource allocation strategy. Journal of Marketing, 68(4), 106-125.
Verhoef, P. C., & Lemon, K. N. (2013). Successful customer value management: Key lessons
and emerging trends. European Management Journal, 31(1), 1-15.
49
APPENDIX A: R-CODE DATA PREPARATION > rm(list = ls())
> setwd(" ")
> library(dplyr)
> library(zoo)
> library(BTYD)
> library(DataCombine)
> if(file.exists("CustomersFinal.csv")) {
+ customers <- read.csv("CustomersFinal.csv", header = TRUE, sep=",")
+ } else {
+ library(mice)
+ customers <- read.csv("CustomersClean.csv", header = TRUE, sep=",")
+ colnames(customers)[1] <- "cust"
+ customers <- dplyr::select(customers, -woonpl, -naam, -adres3, -betcond, -levwijze, -syscreated, -sysmodified)
+
+ # Population Density
+ bev <- read.csv("Bevolking.csv", header = TRUE, sep=";")
+ gem <- read.csv("Gemeentes2.csv", header = TRUE, sep=",")
+ bev <- filter(bev, grepl("GM", RegioS))
+ bev$RegioS <- as.character(bev$RegioS)
+ bev$RegioS <- substr(bev$RegioS, 3, 7)
+ bev <- dplyr::select(bev, RegioS, Bevolkingsdichtheid_57)
+ colnames(bev) <- c("Gem2017", "pop_dens")
+ bev$Gem2017 <- as.numeric(bev$Gem2017)
+ gem <- left_join(gem, bev, by=c("Gem2017"))
+ gem$Gem2017 <- NULL
+ customers <- left_join(customers, gem, by=c("postcode"))
+ rm(bev, gem)
+
+ # Categories
+ orders <- read.csv("OrdersClean.csv", header = TRUE, sep=",")
+ orderlines <- read.csv("OrderlinesClean.csv", header = TRUE, sep=",")
+ temp <- orders[,1:2]
+ orderlines <- left_join(orderlines, temp, by=c("ordernr"))
+ orderlines <- filter(orderlines, !is.na(debnr))
+ temp <- aggregate(orderlines$groepnaam, by=list(orderlines$debnr), function(x) length(unique(x)))
+ colnames(temp) <- c("cust", "categories")
+ customers <- left_join(customers, temp, by=c("cust"))
+ rm(orders, orderlines, temp)
+
+ # Impute postal code and population density
+ temp <- dplyr::select(customers, -cust)
+ impute <- mice(temp,m=5,maxit=50,meth='pmm',seed=500)
+ completedData <- complete(impute,1)
+ customers$pop_dens <- completedData$pop_dens
+ customers$postcode <- completedData$postcode
+
+ write.csv(customers, "CustomersFinal.csv", col.names = TRUE, row.names = FALSE)
+ }
> if(file.exists("CBT.csv")) {
+ cbt <- read.csv("CBT.csv", header = TRUE, sep=",")
+ } else {
+ orders <- read.csv("OrdersClean.csv", header = TRUE, sep=",")
+ ### Quarterly data
+ elog <- orders[,2:4]
+ colnames(elog) <- c("cust", "date", "sales")
+ elog$filter <- format(as.Date(elog$date), "%Y-%m")
+ elog <- filter(elog, filter != "2005-11" & filter != "2005-12")
+ elog$yq <- as.yearqtr(elog$date, format = "%Y-%m-%d")
+ elog$date <- elog$yq
+ elog[,4:5] <- NULL
+ freq <- data.frame(dc.BuildCBTFromElog(elog, statistic = "freq"))
+ spend <- data.frame(dc.BuildCBTFromElog(elog, statistic = "total.spend"))
+ colnames(spend)[3] <- "gm"
+ cbt <- left_join(freq, spend, by=c("date", "cust"))
+ cbt$pur <- ifelse(cbt$Freq == 0, 0, 1)
+ cbt$cust <- as.numeric(as.character(cbt$cust))
+ colnames(cbt)[3] <- "freq"
50
+ rm(elog, freq, spend)
+ cbt$date <- as.character(cbt$date)
+
+ ## We only take customers who made a purchase in t=1.
+ # Add time for both first and last purchase to customers df
+ customers$first.pur <- 0
+ customers$last.pur <- 0
+ customers$tot.pur <- 0
+ # customers$all.pur <- 0
+ for (c in unique(cbt$cust)) {
+ cbt.c <- filter(cbt, cust == c)
+ customers$first.pur[customers$cust == c] <- min(which(cbt.c$pur == 1))
+ customers$last.pur[customers$cust == c] <- max(which(cbt.c$pur == 1))
+ customers$tot.pur[customers$cust == c] <- sum(cbt.c$pur==1)
+ # customers$all.pur[customers$debnr == c] <- ifelse(sum(cbt.c$pur) == 13, 1, 0)
+ }
+ customers <- filter(customers, first.pur == 1, cust %in% cbt$cust)
+ cbt <- filter(cbt, cust %in% customers$cust)
+ cbt <- filter(cbt, date != "2009 Q1") # contains a lot of noise
+
+ write.csv(cbt, "CBT.csv", col.names = TRUE, row.names = FALSE)
+ }
> if(file.exists("Dataset.csv")) {
+ cbt <- read.csv("Dataset.csv", header = TRUE, sep=",")
+ } else {
+ cbt <- arrange(cbt, cust, date)
+
+ # Add recency
+ cbt$rec <- 0
+ temp <- cbt[1,]
+ temp$freq <- 999
+ for (c in unique(cbt$cust)) {
+ c.1 <- filter(cbt, cust == c)
+ c.pur <- c.1$pur
+ for (i in 2:12) {
+ c.i <- c.pur[1:i-1]
+ c.max <- max(which(c.i[] == 1))
+ c.1[i,6] <- i-c.max
+ }
+ temp <- rbind(temp, c.1)
+ }
+ temp <- filter(temp, freq != 999)
+ cbt$rec <- temp$rec
+
+ # Lags
+ cbt <- slide(cbt, Var = "freq", GroupVar = "cust", NewVar = "freq.lag", slideBy = -1)
+ cbt <- slide(cbt, Var = "gm", GroupVar = "cust", NewVar = "gm.lag", slideBy = -1)
+
+ # Add dynamic variables:
+ cbt <- arrange(cbt, cust, date)
+ cbt[,11:20] <- 0
+ colnames(cbt)[11:20] <- c("freq.diff", "gm.diff", "freq.ets",
+ "freq.hw", "gm.ets", "gm.hw", "freq.cumsum", "gm.cumsum",
+ "freq.cumavg", "gm.cumavg")
+ temp2 <- cbt[1,]
+ temp2$freq <- 999
+
+ library(forecast)
+ for (c in unique(cbt$cust)) {
+ cbt.c <- filter(cbt, cust == c)
+ for (i in 2:12) {
+ cbt.c$freq.diff <- c(NA, NA, diff(cbt.c$freq, lag = 2))
+ cbt.c$gm.diff <- c(NA, NA, diff(cbt.c$gm, lag = 2))
+ cbt.c$freq.cumsum[i] <- sum(cbt.c$freq[1:i-1])
+ cbt.c$gm.cumsum[i] <- sum(cbt.c$gm[1:i-1])
+ cbt.c$freq.cumavg[i] <- mean(cbt.c$freq[1:i-1])
+ cbt.c$gm.cumavg[i] <- mean(cbt.c$gm[1:i-1])
+ }
+ for (i in 2:11) {
+ cbt.c$freq.ets[i+1] <- forecast(ets(cbt.c$freq[1:i]), 1)$mean
+ cbt.c$freq.hw[i+1] <- forecast(HoltWinters(cbt.c$freq[1:i], beta=FALSE, gamma=FALSE), 1)$mean
+ cbt.c$gm.ets[i+1] <- forecast(ets(cbt.c$gm[1:i]), 1)$mean
51
+ cbt.c$gm.hw[i+1] <- forecast(HoltWinters(cbt.c$gm[1:i], beta=FALSE, gamma=FALSE), 1)$mean
+ }
+ temp2 <- rbind(temp2, cbt.c)
+ }
+ temp2 <- filter(temp2, freq != 999)
+ temp2 <- temp2[,11:20]
+ cbt[,11:20] <- temp2
+
+ cbt <- slide(cbt, Var = "freq", GroupVar = "cust", NewVar = "freq.lag.2", slideBy = -2)
+ cbt <- slide(cbt, Var = "gm", GroupVar = "cust", NewVar = "gm.lag.2", slideBy = -2)
+
+ # Premium & Returns
+ orderlines <- read.csv("OrderlinesClean.csv", header = TRUE, sep=",")
+ orders <- read.csv("OrdersClean.csv", header = TRUE, sep=",")
+ orderlines$premium <- ifelse(orderlines$Class_01 == "SPAAR", 1, 0)
+ temp <- aggregate(orderlines[,21:22], by=list(ordernr = orderlines$ordernr), max)
+ orders <- left_join(orders, temp, by=c("ordernr"))
+ orders$yq <- as.yearqtr(orders$fakdat, format = "%Y-%m-%d")
+ orders <- aggregate(orders[,c(4,8:10,22:23)], by=list(date = orders$yq, cust = orders$debnr), sum)
+ orders$returns <- abs(orders$total_returns)
+
+ orders <- dplyr::select(orders, date, cust, premium, theme, returns)
+ cbt$date <- as.character(cbt$date)
+ orders$date <- as.character(orders$date)
+ cbt <- left_join(cbt, orders, by=c("cust", "date"))
+
+ cbt$premium[is.na(cbt$premium)] <- 0
+ cbt$returns[is.na(cbt$returns)] <- 0
+
+ cbt <- slide(cbt, Var = "premium", GroupVar = "cust", NewVar = "premium.lag", slideBy = -1)
+ cbt <- slide(cbt, Var = "returns", GroupVar = "cust", NewVar = "returns.lag", slideBy = -1)
+
+ # add dummy for observations before Q2 2007 (no information about premium/returns before that time)
+ no.premium.dates <- c("2006 Q1", "2006 Q2", "2006 Q3", "2006 Q4", "2007 Q1")
+ cbt$premium.dum <- ifelse(cbt$date %in% no.premium.dates, 1, 0)
+ rm(temp,orders,orderlines,no.premium.dates, temp2, c, c.i, c.max, c.pur, i, c.1)
+
+ ### --- MARKET VARIABLES
+ # We use the variable national household consumption as change from same period in previous year
+ bbp <- read.csv("BBP.csv", header = TRUE, sep=";")
+ bbp <- bbp[-13,c(2,6)]
+ colnames(bbp) <- c("date", "GDP.Cons")
+ bbp$date <- unique(cbt$date)
+ cbt <- left_join(cbt, bbp, by=c("date"))
+ rm(bbp)
+
+ ### --- COMBINING CBT AND CUSTOMER DATA
+ customers <- filter(customers, cust %in% cbt$cust)
+ cbt <- left_join(cbt, customers, by=c("cust"))
+
+ write.csv(cbt, "Dataset.csv", col.names = TRUE, row.names = FALSE)
+ }
> if(file.exists("DatasetExReturns.csv")) {
+ cbt.new <- read.csv("DatasetExReturns.csv", header = TRUE, sep=",")
+ } else {
+ cbt.new <- cbt
+ cbt.new <- arrange(cbt.new, cust, date)
+
+ # Exclude returns
+ # (1) First for same period
+ cbt.new$gm.new <- ifelse(cbt.new$returns > 0 & cbt.new$returns < 1, cbt.new$gm * cbt.new$returns,
+ cbt.new$gm)
+ # (2) If sales in t is lower than amount returned, then:
+ temp2 <- cbt.new[1,]
+ temp2$freq <- 999
+ for (c in unique(cbt.new$cust)) {
+ cbt.new.c <- filter(cbt.new, cust == c)
+ for (i in 2:12) {
+ if (cbt.new.c$gm.new[i] < 0) {
+ cbt.new.c$gm.new[i-1] <- cbt.new.c$gm.new[i-1] - abs(cbt.new.c$gm.new[i])
52
+ cbt.new.c$gm.new[i] <- 0
+ }
+ }
+ temp2 <- rbind(temp2, cbt.new.c)
+ }
+ temp2 <- filter(temp2, freq != 999)
+ cbt.new$gm.new <- temp2$gm.new
+ # Repeat
+ temp2 <- cbt.new[1,]
+ temp2$freq <- 999
+ for (c in unique(cbt.new$cust)) {
+ cbt.new.c <- filter(cbt.new, cust == c)
+ for (i in 2:12) {
+ if (cbt.new.c$gm.new[i] < 0) {
+ cbt.new.c$gm.new[i-1] <- cbt.new.c$gm.new[i-1] - abs(cbt.new.c$gm.new[i])
+ cbt.new.c$gm.new[i] <- 0
+ }
+ }
+ temp2 <- rbind(temp2, cbt.new.c)
+ }
+ temp2 <- filter(temp2, freq != 999)
+ cbt.new$gm.new <- temp2$gm.new
+ # Still 39 observations left (excluding t=1)
+ temp2 <- cbt.new[1,]
+ temp2$freq <- 999
+ for (c in unique(cbt.new$cust)) {
+ cbt.new.c <- filter(cbt.new, cust == c)
+ for (i in 2:12) {
+ if (cbt.new.c$gm.new[i] < 0) {
+ cbt.new.c$gm.new[i-1] <- cbt.new.c$gm.new[i-1] - abs(cbt.new.c$gm.new[i])
+ cbt.new.c$gm.new[i] <- 0
+ }
+ }
+ temp2 <- rbind(temp2, cbt.new.c)
+ }
+ temp2 <- filter(temp2, freq != 999)
+ cbt.new$gm.new <- temp2$gm.new
+ # negative sales from unobserved left period
+ cbt.new$gm.new <- ifelse(cbt.new$gm.new < 0, 0, cbt.new$gm.new)
+
+ # Now set lags for gm without returns
+ cbt.new <- slide(cbt.new, Var = "gm.new", GroupVar = "cust", NewVar = "gm.lag", slideBy = -1)
+
+ temp2 <- cbt.new[1,]
+ temp2$freq <- 999
+ for (c in unique(cbt.new$cust)) {
+ cbt.new.c <- filter(cbt.new, cust == c)
+ for (i in 2:12) {
+ cbt.new.c$gm.cumsum[i] <- sum(cbt.new.c$gm.new[1:i-1])
+ cbt.new.c$gm.cumavg[i] <- mean(cbt.new.c$gm.new[1:i-1])
+ }
+ temp2 <- rbind(temp2, cbt.new.c)
+ }
+
+ temp2 <- filter(temp2, freq != 999)
+ cbt.new$gm.cumavg <- temp2$gm.cumavg
+ cbt.new$gm.cumsum <- temp2$gm.cumsum
+ rm(temp2)
+ cbt.new$gm <- cbt.new$gm.new
+ cbt.new <- dplyr::select(cbt.new, cust, date, freq, gm, pur, rec, freq.lag, freq.cumsum, freq.cumavg,
+ gm.lag, gm.cumsum, gm.cumavg, premium, premium.lag, premium.dum,
+ GDP.Cons, represent_id, klantgroep, pop_dens, categories)
+ cbt.new[,c(4,9:12)] <- round(cbt.new[,c(4,9:12)], 2)
+ write.csv(cbt.new, "DatasetExReturns.csv", col.names = TRUE, row.names = FALSE)
+ }
53
APPENDIX B: R-CODE MODEL COMPONENTS > rm(list = ls()) > > setwd(" ") > > library(dplyr) > library(MASS) > library(lmtest) > library(pscl) > library(car) > library(AER) > library(forecast) > library(Metrics) > library(nortest) > library(plm) > library(sampleSelection) > library(car) > > # Import & prepare data > dat <- read.csv("DatasetExReturns.csv", header = TRUE, sep=",") > dat <- filter(dat, !is.na(gm.lag)) # Delete t=1 because of NAs for lags > dat$represent_id <- as.factor(dat$represent_id) > > # Remove customers that didn't buy anything from t2 to t9 (= estimation sample): > remove <- c(82, 126, 136, 271, 338, 367, 461, 486, 488, 522, 565, 601, 629, 653, 676, 708, 722, 725, + 727, 746, 897, 920, 941, 986, 1005, 1009, 1018, 1025, 1324) > dat <- filter(dat, !cust %in% remove); dat$cust <- as.factor(dat$cust) > summary(dat) cust date freq gm pur rec
1 : 11 2006 Q2: 349 Min. : 0.000 Min. : 0.00 Min. :0.0000 Min. : 1.000
18 : 11 2006 Q3: 349 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.:0.0000 1st Qu.: 1.000
31 : 11 2006 Q4: 349 Median : 1.000 Median : 43.03 Median :1.0000 Median : 1.000
41 : 11 2007 Q1: 349 Mean : 1.657 Mean : 328.22 Mean :0.6205 Mean : 1.972
43 : 11 2007 Q2: 349 3rd Qu.: 2.000 3rd Qu.: 297.36 3rd Qu.:1.0000 3rd Qu.: 2.000
48 : 11 2007 Q3: 349 Max. :26.000 Max. :15487.74 Max. :1.0000 Max. :10.000
(Other):3773 (Other):1745
freq.lag freq.cumsum freq.cumavg gm.lag gm.cumsum gm.cumavg
Min. : 0.000 Min. : 1.00 Min. : 0.180 Min. : 0.00 Min. : 0.0 Min. : 0.00
1st Qu.: 0.000 1st Qu.: 4.00 1st Qu.: 1.000 1st Qu.: 0.00 1st Qu.: 401.7 1st Qu.: 81.69
Median : 1.000 Median : 8.00 Median : 1.670 Median : 65.04 Median : 936.4 Median : 199.01
Mean : 1.809 Mean : 12.67 Mean : 2.311 Mean : 362.05 Mean : 2558.7 Mean : 472.27
3rd Qu.: 2.000 3rd Qu.: 16.00 3rd Qu.: 3.000 3rd Qu.: 340.91 3rd Qu.: 2601.3 3rd Qu.: 471.30
Max. :26.000 Max. :133.00 Max. :24.000 Max. :17026.82 Max. :67772.4 Max. :17026.82
premium premium.lag premium.dum GDP.Cons represent_id klantgroep
Min. : 0.0000 Min. : 0.0000 Min. :0.0000 Min. :-1.000 6 :869 Overig :1265
1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.:-0.400 13 :693 Supermarkt D : 605
Median : 0.0000 Median : 0.0000 Median :0.0000 Median : 1.600 35 :517 Bouwmarkt/Tuincentrum: 517
Mean : 0.1151 Mean : 0.1058 Mean :0.3636 Mean : 1.355 16 :396 Tankstation : 429
3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.:1.0000 3rd Qu.: 2.400 27 :352 Supermarkt B : 231
Max. :19.0000 Max. :19.0000 Max. :1.0000 Max. : 4.500 2 :264 Supermarkt Overig : 231
(Other):748 (Other) : 561
pop_dens categories
Min. : 54 Min. : 2.000
1st Qu.: 149 1st Qu.: 3.000
Median : 318 Median : 6.000
Mean : 760 Mean : 6.564
3rd Qu.: 798 3rd Qu.: 9.000
Max. :5771 Max. :15.000
> > # Mean center variables that we'll include for quadratic effects > dat$freq.mc <- scale(dat$freq, center = TRUE, scale = FALSE)[1:3839,1] > dat$rec.mc <- scale(dat$rec, center = TRUE, scale = FALSE)[1:3839,1] > dat$freq.lag.mc <- scale(dat$freq.lag, center = TRUE, scale = FALSE)[1:3839,1] > dat$freq.cumsum.mc <- scale(dat$freq.cumsum, center = TRUE, scale = FALSE)[1:3839,1] > dat$freq.cumavg.mc <- scale(dat$freq.cumavg, center = TRUE, scale = FALSE)[1:3839,1] > dat$gm.lag.mc <- scale(dat$gm.lag, center = TRUE, scale = FALSE)[1:3839,1] > dat$gm.cumavg.mc <- scale(dat$gm.cumavg, center = TRUE, scale = FALSE)[1:3839,1] > dat$gm.cumsum.mc <- scale(dat$gm.cumsum, center = TRUE, scale = FALSE)[1:3839,1] > dat$premium.mc <- scale(dat$premium, center = TRUE, scale = FALSE)[1:3839,1] > dat$premium.lag.mc <- scale(dat$premium.lag, center = TRUE, scale = FALSE)[1:3839,1] > dat$categories.mc <- scale(dat$categories, center = TRUE, scale = FALSE)[1:3839,1] > > temp <- read.csv("Dataset.csv", header = TRUE, sep=",") > temp <- filter(temp, !cust %in% remove) > temp$cust <- as.factor(temp$cust) > dat <- left_join(dat, temp[,c(1:2,25)], by=c("cust", "date")) > #write.csv(dat, "DatasetPlusReturns.csv", col.names = TRUE, row.names = FALSE) > > # Divide in estimation and validation sample > dat$date <- as.numeric(dat$date) > dat.v <- filter(dat, date >= 10) > dat <- filter(dat, date < 10) > dat <- arrange(dat, date, cust) > > # Some functions that we use multiple times througout the code > clog <- function(x) log(x + 1) > vifs <- function(model) {print(summary(model)); print(sqrt(vif(model)) > 2); print(vif(model))} > assumptions <- function(model) {checkresiduals(model); print(gqtest(model, 0.5)); dwtest(model)} > performance <- function(obs, pred) {print(rae(obs, pred)); print(rmse(obs, pred)); print(mae(obs, pred))} > > # Inspect data > dat %>% dplyr::select(freq:GDP.Cons, pop_dens:categories, + gm.cumsum) %>% cor() %>% round(2)
54
freq gm pur rec freq.lag freq.cumsum freq.cumavg gm.lag gm.cumsum gm.cumavg premium
premium.lag
freq 1.00 0.79 0.52 -0.29 0.66 0.39 0.66 0.59 0.44 0.62 0.31
0.13
gm 0.79 1.00 0.28 -0.15 0.47 0.29 0.48 0.54 0.42 0.58 0.44
0.15
pur 0.52 0.28 1.00 -0.47 0.37 0.20 0.30 0.22 0.16 0.20 0.10
0.08
rec -0.29 -0.15 -0.47 1.00 -0.38 -0.17 -0.29 -0.20 -0.13 -0.16 -0.05 -
0.07
freq.lag 0.66 0.47 0.37 -0.38 1.00 0.51 0.83 0.78 0.53 0.72 0.13
0.26
freq.cumsum 0.39 0.29 0.20 -0.17 0.51 1.00 0.75 0.37 0.85 0.61 0.19
0.23
freq.cumavg 0.66 0.48 0.30 -0.29 0.83 0.75 1.00 0.63 0.70 0.82 0.09
0.11
gm.lag 0.59 0.54 0.22 -0.20 0.78 0.37 0.63 1.00 0.55 0.80 0.20
0.35
gm.cumsum 0.44 0.42 0.16 -0.13 0.53 0.85 0.70 0.55 1.00 0.80 0.20
0.26
gm.cumavg 0.62 0.58 0.20 -0.16 0.72 0.61 0.82 0.80 0.80 1.00 0.11
0.13
premium 0.31 0.44 0.10 -0.05 0.13 0.19 0.09 0.20 0.20 0.11 1.00
0.42
premium.lag 0.13 0.15 0.08 -0.07 0.26 0.23 0.11 0.35 0.26 0.13 0.42
1.00
premium.dum 0.15 0.10 0.17 -0.31 0.19 -0.32 0.12 0.12 -0.18 0.07 -0.15 -
0.14
GDP.Cons 0.12 0.07 0.07 -0.07 0.07 -0.12 0.03 0.04 -0.07 0.02 -0.03 -
0.02
pop_dens 0.06 0.03 -0.01 0.04 0.08 0.11 0.13 0.03 0.05 0.05 -0.03 -
0.04
categories 0.48 0.35 0.48 -0.41 0.45 0.44 0.46 0.34 0.41 0.40 0.14
0.13
premium.dum GDP.Cons pop_dens categories
freq 0.15 0.12 0.06 0.48
gm 0.10 0.07 0.03 0.35
pur 0.17 0.07 -0.01 0.48
rec -0.31 -0.07 0.04 -0.41
freq.lag 0.19 0.07 0.08 0.45
freq.cumsum -0.32 -0.12 0.11 0.44
freq.cumavg 0.12 0.03 0.13 0.46
gm.lag 0.12 0.04 0.03 0.34
gm.cumsum -0.18 -0.07 0.05 0.41
gm.cumavg 0.07 0.02 0.05 0.40
premium -0.15 -0.03 -0.03 0.14
premium.lag -0.14 -0.02 -0.04 0.13
premium.dum 1.00 0.15 0.00 0.00
GDP.Cons 0.15 1.00 0.00 0.00
pop_dens 0.00 0.00 1.00 -0.05
categories 0.00 0.00 -0.05 1.00
> dat %>% filter(pur == 1) %>% dplyr::select(freq:gm, rec:GDP.Cons, pop_dens:categories, + gm.cumsum) %>% cor() %>% round(2) freq gm rec freq.lag freq.cumsum freq.cumavg gm.lag gm.cumsum gm.cumavg premium premium.lag
freq 1.00 0.78 -0.12 0.61 0.37 0.64 0.58 0.44 0.63 0.31 0.10
gm 0.78 1.00 -0.04 0.43 0.27 0.45 0.52 0.41 0.57 0.43 0.14
rec -0.12 -0.04 1.00 -0.30 -0.16 -0.23 -0.15 -0.10 -0.13 0.00 -0.05
freq.lag 0.61 0.43 -0.30 1.00 0.51 0.85 0.78 0.53 0.73 0.11 0.25
freq.cumsum 0.37 0.27 -0.16 0.51 1.00 0.72 0.36 0.85 0.60 0.19 0.24
freq.cumavg 0.64 0.45 -0.23 0.85 0.72 1.00 0.63 0.69 0.82 0.07 0.09
gm.lag 0.58 0.52 -0.15 0.78 0.36 0.63 1.00 0.55 0.81 0.19 0.35
gm.cumsum 0.44 0.41 -0.10 0.53 0.85 0.69 0.55 1.00 0.80 0.20 0.26
gm.cumavg 0.63 0.57 -0.13 0.73 0.60 0.82 0.81 0.80 1.00 0.09 0.12
premium 0.31 0.43 0.00 0.11 0.19 0.07 0.19 0.20 0.09 1.00 0.42
premium.lag 0.10 0.14 -0.05 0.25 0.24 0.09 0.35 0.26 0.12 0.42 1.00
premium.dum 0.08 0.06 -0.13 0.12 -0.41 0.07 0.08 -0.23 0.04 -0.21 -0.18
GDP.Cons 0.11 0.06 -0.07 0.04 -0.16 0.01 0.03 -0.09 0.01 -0.04 -0.02
pop_dens 0.09 0.04 -0.02 0.12 0.13 0.16 0.04 0.05 0.05 -0.03 -0.04
categories 0.35 0.29 -0.20 0.35 0.43 0.40 0.30 0.41 0.38 0.12 0.11
premium.dum GDP.Cons pop_dens categories
freq 0.08 0.11 0.09 0.35
gm 0.06 0.06 0.04 0.29
rec -0.13 -0.07 -0.02 -0.20
freq.lag 0.12 0.04 0.12 0.35
freq.cumsum -0.41 -0.16 0.13 0.43
freq.cumavg 0.07 0.01 0.16 0.40
gm.lag 0.08 0.03 0.04 0.30
gm.cumsum -0.23 -0.09 0.05 0.41
gm.cumavg 0.04 0.01 0.05 0.38
premium -0.21 -0.04 -0.03 0.12
premium.lag -0.18 -0.02 -0.04 0.11
premium.dum 1.00 0.16 0.00 -0.14
GDP.Cons 0.16 1.00 0.00 -0.05
pop_dens 0.00 0.00 1.00 -0.04
categories -0.14 -0.05 -0.04 1.00
> ##### --- (1) --- VISITS --- (1) --- ##### > # Full model: > v1 <- glm(freq ~ rec + freq.lag + freq.cumavg + gm.lag + gm.cumavg + premium.lag + premium.dum + GDP.Cons + + pop_dens + categories, data = dat, family = poisson(link = "log")); vifs(v1)
Call:
glm(formula = freq ~ rec + freq.lag + freq.cumavg + gm.lag +
gm.cumavg + premium.lag + premium.dum + GDP.Cons + pop_dens +
55
categories, family = poisson(link = "log"), data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-5.4614 -1.0043 -0.3282 0.3922 8.5629
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.743e-01 7.111e-02 -2.452 0.01421 *
rec -4.454e-01 3.341e-02 -13.331 < 2e-16 ***
freq.lag 1.993e-02 9.930e-03 2.007 0.04479 *
freq.cumavg 8.605e-02 1.082e-02 7.954 1.81e-15 ***
gm.lag 6.246e-05 2.051e-05 3.045 0.00233 **
gm.cumavg -4.561e-05 2.137e-05 -2.134 0.03285 *
premium.lag 2.908e-02 1.342e-02 2.166 0.03028 *
premium.dum 1.305e-01 3.174e-02 4.113 3.91e-05 ***
GDP.Cons 7.333e-02 9.988e-03 7.342 2.11e-13 ***
pop_dens 3.247e-05 1.252e-05 2.593 0.00950 **
categories 9.391e-02 4.667e-03 20.125 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 7057.6 on 2791 degrees of freedom
Residual deviance: 3668.7 on 2781 degrees of freedom
AIC: 8709.3
Number of Fisher Scoring iterations: 6
rec freq.lag freq.cumavg gm.lag gm.cumavg premium.lag premium.dum GDP.Cons pop_dens
FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
categories
FALSE
rec freq.lag freq.cumavg gm.lag gm.cumavg premium.lag premium.dum GDP.Cons pop_dens
1.137372 11.723727 12.465840 10.899398 11.578315 1.557589 1.238092 1.045716 1.151738
categories
1.473479
> > # (1) Resolve multicollinearity: > v1a <- update(v1, . ~ . - freq.lag - freq.cumavg) > v1a <- update(v1, . ~ . - gm.lag - gm.cumavg) > v1b <- update(v1, . ~ . - freq.lag - gm.lag) > v1c <- update(v1, . ~ . - freq.lag - gm.cumavg) > v1d <- update(v1, . ~ . - freq.cumavg - gm.lag) > v1e <- update(v1, . ~ . - freq.cumavg - gm.cumavg) > AIC(v1b, v1c, v1d, v1e); BIC(v1b, v1c, v1d, v1e) df AIC
v1b 9 8741.470
v1c 9 8721.961
v1d 9 8770.447
v1e 9 8783.177
df BIC
v1b 9 8794.881
v1c 9 8775.371
v1d 9 8823.858
v1e 9 8836.588
> # v1c performs best + multicollinearity is resolved > v1 <- v1c; vifs(v1)
Call:
glm(formula = freq ~ rec + freq.cumavg + gm.lag + premium.lag +
premium.dum + GDP.Cons + pop_dens + categories, family = poisson(link = "log"),
data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.8621 -1.0025 -0.3318 0.3996 8.4674
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.438e-01 7.108e-02 -2.024 0.04300 *
rec -4.630e-01 3.334e-02 -13.888 < 2e-16 ***
freq.cumavg 9.205e-02 4.796e-03 19.191 < 2e-16 ***
gm.lag 4.401e-05 9.158e-06 4.806 1.54e-06 ***
premium.lag 4.946e-02 1.221e-02 4.049 5.13e-05 ***
premium.dum 1.552e-01 3.112e-02 4.986 6.17e-07 ***
GDP.Cons 7.276e-02 9.980e-03 7.291 3.08e-13 ***
pop_dens 3.855e-05 1.218e-05 3.165 0.00155 **
categories 9.207e-02 4.647e-03 19.813 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 7057.6 on 2791 degrees of freedom
Residual deviance: 3685.4 on 2783 degrees of freedom
AIC: 8722
Number of Fisher Scoring iterations: 6
rec freq.cumavg gm.lag premium.lag premium.dum GDP.Cons pop_dens categories
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
56
rec freq.cumavg gm.lag premium.lag premium.dum GDP.Cons pop_dens categories
1.113060 2.406071 2.259919 1.259330 1.190579 1.045856 1.093382 1.476864
> dispersiontest(v1)
Overdispersion test
data: v1
z = 4.1876, p-value = 1.41e-05
alternative hypothesis: true dispersion is greater than 1
sample estimates:
dispersion
1.616171
> > v1a <- glm.nb(freq ~ rec + freq.cumavg + gm.lag + premium.lag + premium.dum + GDP.Cons + + pop_dens + categories, data = dat); AIC(v1, v1a); lrtest(v1, v1a) df AIC
v1 9 8721.961
v1a 10 8506.642
Likelihood ratio test
Model 1: freq ~ rec + freq.cumavg + gm.lag + premium.lag + premium.dum +
GDP.Cons + pop_dens + categories
Model 2: freq ~ rec + freq.cumavg + gm.lag + premium.lag + premium.dum +
GDP.Cons + pop_dens + categories
#Df LogLik Df Chisq Pr(>Chisq)
1 9 -4352.0
2 10 -4243.3 1 217.32 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> > # try zero-inflated + hurdle: > v1b <- zeroinfl(freq ~ rec + freq.cumavg + gm.lag + premium.lag + premium.dum + GDP.Cons + pop_dens + categories | + rec + freq.lag + freq.cumsum + categories + I(scale(categories, center = TRUE)^2), + data = dat, dist = "negbin"); AIC(v1a, v1b) Warning message: In sqrt(diag(vc)[np]) : NaNs produced df AIC
v1a 10 8506.642
v1b 16 8310.695
> v1c <- hurdle(freq ~ rec + freq.cumavg + gm.lag + premium.lag + premium.dum + GDP.Cons + pop_dens + categories | + rec + freq.lag + freq.cumsum + categories + I(scale(categories, center = TRUE)^2), + data = dat, dist = "negbin"); AIC(v1a, v1b, v1c) Warning message: In sqrt(diag(vc_count)[kx + 1]) : NaNs produced df AIC
v1a 10 8506.642
v1b 16 8310.695
v1c 16 8306.612
> > # We choose the hurdle model. Fitting the model: > v1 <- hurdle(freq ~ freq.cumavg + premium.lag + premium.dum + GDP.Cons + I(freq.cumavg.mc^2) | + rec + freq.lag + freq.cumsum + categories + I(categories.mc^2), + data = dat, dist = "negbin") > v1a <- hurdle(freq ~ freq.cumavg + premium.lag + premium.dum + GDP.Cons | + rec + freq.lag + freq.cumsum + categories + I(categories.mc^2), + data = dat, dist = "negbin") > v1b <- hurdle(freq ~ freq.cumavg + premium.lag + GDP.Cons | + rec + freq.lag + freq.cumsum + categories + I(categories.mc^2), + data = dat, dist = "negbin") > v1c <- hurdle(freq ~ freq.cumavg + GDP.Cons | + rec + freq.lag + freq.cumsum + categories + I(scale(categories, center = TRUE)^2), + data = dat, dist = "negbin") > v1d <- hurdle(freq ~ freq.cumavg | + rec + freq.lag + freq.cumsum + categories + I(scale(categories, center = TRUE)^2), + data = dat, dist = "negbin") > AIC(v1, v1a, v1b, v1c, v1d) # v1b = best df AIC
v1 13 8281.024
v1a 12 8412.995
v1b 11 8411.500
v1c 10 8422.506
v1d 9 8454.126
> > # With or without quadratic term? > v1a <- hurdle(freq ~ freq.cumavg + premium.lag + GDP.Cons + I(freq.cumavg.mc^2) + + klantgroep + represent_id | rec + freq.lag + freq.cumsum + categories + I(categories.mc^2), + data = dat, dist = "negbin"); vifs(v1a); AIC(v1a)
Call:
hurdle(formula = freq ~ freq.cumavg + premium.lag + GDP.Cons + I(freq.cumavg.mc^2) + klantgroep + represent_id |
rec + freq.lag + freq.cumsum + categories + I(categories.mc^2), data = dat, dist = "negbin")
Pearson residuals:
Min 1Q Median 3Q Max
-1.7289 -0.5857 -0.2242 0.3776 8.2492
Count model coefficients (truncated negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.349365 0.182784 -1.911 0.05596 .
freq.cumavg 0.226669 0.016075 14.101 < 2e-16 ***
premium.lag 0.035984 0.017480 2.059 0.03954 *
GDP.Cons 0.099612 0.014442 6.898 5.29e-12 ***
I(freq.cumavg.mc^2) -0.008199 0.001106 -7.415 1.22e-13 ***
klantgroepDrogist -0.160366 0.203000 -0.790 0.42954
klantgroepOverig -0.540194 0.092232 -5.857 4.72e-09 ***
57
klantgroepSupermarkt A 0.194173 0.109322 1.776 0.07571 .
klantgroepSupermarkt B 0.193783 0.093179 2.080 0.03755 *
klantgroepSupermarkt C 0.263956 0.143097 1.845 0.06510 .
klantgroepSupermarkt D 0.133835 0.079477 1.684 0.09219 .
klantgroepSupermarkt E 0.243169 0.129522 1.877 0.06046 .
klantgroepSupermarkt F 0.087361 0.127500 0.685 0.49323
klantgroepSupermarkt G -0.092162 0.152539 -0.604 0.54572
klantgroepSupermarkt H 0.234399 0.255640 0.917 0.35919
klantgroepSupermarkt Overig 0.271438 0.104106 2.607 0.00913 **
klantgroepTankstation -0.245506 0.104774 -2.343 0.01912 *
represent_id2 0.235488 0.205991 1.143 0.25296
represent_id6 0.200035 0.170934 1.170 0.24190
represent_id13 -0.020956 0.175829 -0.119 0.90513
represent_id14 0.316662 0.197220 1.606 0.10836
represent_id16 0.183518 0.174210 1.053 0.29214
represent_id22 -0.445171 0.247843 -1.796 0.07247 .
represent_id23 0.226813 0.195366 1.161 0.24566
represent_id27 0.258063 0.174276 1.481 0.13867
represent_id34 0.294302 0.210547 1.398 0.16217
represent_id35 0.205256 0.171956 1.194 0.23261
Log(theta) 1.692253 0.140752 12.023 < 2e-16 ***
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.428856 0.172448 -2.487 0.012887 *
rec -0.426756 0.063687 -6.701 2.07e-11 ***
freq.lag 0.475956 0.052169 9.123 < 2e-16 ***
freq.cumsum -0.024365 0.006353 -3.835 0.000125 ***
categories 0.281413 0.018888 14.899 < 2e-16 ***
I(categories.mc^2) -0.024606 0.005295 -4.647 3.37e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta: count = 5.4317
Number of iterations in BFGS optimization: 41
Log-likelihood: -4058 on 34 Df
[1] 8183.417
> v1b <- hurdle(freq ~ freq.cumavg + premium.lag + GDP.Cons + I(freq.cumavg.mc^2) + klantgroep | + rec + freq.lag + freq.cumsum + categories + I(categories.mc^2), + data = dat, dist = "negbin"); vifs(v1b); AIC(v1b)
Call:
hurdle(formula = freq ~ freq.cumavg + premium.lag + GDP.Cons + I(freq.cumavg.mc^2) + klantgroep | rec +
freq.lag + freq.cumsum + categories + I(categories.mc^2), data = dat, dist = "negbin")
Pearson residuals:
Min 1Q Median 3Q Max
-1.6835 -0.5892 -0.2232 0.3806 8.2858
Count model coefficients (truncated negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.233767 0.087743 -2.664 0.00772 **
freq.cumavg 0.238178 0.015664 15.206 < 2e-16 ***
premium.lag 0.037368 0.017560 2.128 0.03334 *
GDP.Cons 0.097508 0.014522 6.715 1.89e-11 ***
I(freq.cumavg.mc^2) -0.008701 0.001065 -8.168 3.13e-16 ***
klantgroepDrogist -0.288771 0.199590 -1.447 0.14795
klantgroepOverig -0.533142 0.087100 -6.121 9.30e-10 ***
klantgroepSupermarkt A 0.252604 0.107073 2.359 0.01832 *
klantgroepSupermarkt B 0.230905 0.090814 2.543 0.01100 *
klantgroepSupermarkt C 0.227836 0.136079 1.674 0.09407 .
klantgroepSupermarkt D 0.156906 0.074444 2.108 0.03506 *
klantgroepSupermarkt E 0.254081 0.122365 2.076 0.03786 *
klantgroepSupermarkt F 0.139156 0.119182 1.168 0.24297
klantgroepSupermarkt G -0.081099 0.150990 -0.537 0.59119
klantgroepSupermarkt H 0.191586 0.255244 0.751 0.45289
klantgroepSupermarkt Overig 0.286725 0.097178 2.951 0.00317 **
klantgroepTankstation -0.224644 0.095998 -2.340 0.01928 *
Log(theta) 1.640681 0.138881 11.814 < 2e-16 ***
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.428856 0.172448 -2.487 0.012887 *
rec -0.426756 0.063687 -6.701 2.07e-11 ***
freq.lag 0.475956 0.052169 9.123 < 2e-16 ***
freq.cumsum -0.024365 0.006353 -3.835 0.000125 ***
categories 0.281413 0.018888 14.899 < 2e-16 ***
I(categories.mc^2) -0.024606 0.005295 -4.647 3.37e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta: count = 5.1587
Number of iterations in BFGS optimization: 29
Log-likelihood: -4072 on 24 Df
[1] 8191.474
> v1c <- hurdle(freq ~ freq.cumavg + premium.lag + GDP.Cons + I(freq.cumavg.mc^2) + represent_id | + rec + freq.lag + freq.cumsum + categories + I(categories.mc^2), + data = dat, dist = "negbin"); vifs(v1c); AIC(v1c)
Call:
hurdle(formula = freq ~ freq.cumavg + premium.lag + GDP.Cons + I(freq.cumavg.mc^2) + represent_id |
rec + freq.lag + freq.cumsum + categories + I(categories.mc^2), data = dat, dist = "negbin")
Pearson residuals:
Min 1Q Median 3Q Max
58
-1.6154 -0.5809 -0.2315 0.3769 9.8160
Count model coefficients (truncated negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.621028 0.178797 -3.473 0.000514 ***
freq.cumavg 0.297628 0.014618 20.360 < 2e-16 ***
premium.lag 0.047889 0.018515 2.586 0.009697 **
GDP.Cons 0.100025 0.014958 6.687 2.28e-11 ***
I(freq.cumavg.mc^2) -0.011792 0.001087 -10.850 < 2e-16 ***
represent_id2 0.127078 0.207672 0.612 0.540591
represent_id6 0.190873 0.173805 1.098 0.272116
represent_id13 -0.073649 0.178633 -0.412 0.680123
represent_id14 0.384512 0.199690 1.926 0.054162 .
represent_id16 0.270425 0.175664 1.539 0.123696
represent_id22 -0.412954 0.251403 -1.643 0.100466
represent_id23 0.120824 0.192609 0.627 0.530460
represent_id27 0.261183 0.178527 1.463 0.143471
represent_id34 0.307211 0.213905 1.436 0.150944
represent_id35 0.218052 0.175808 1.240 0.214872
Log(theta) 1.477740 0.132398 11.161 < 2e-16 ***
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.428856 0.172448 -2.487 0.012887 *
rec -0.426756 0.063687 -6.701 2.07e-11 ***
freq.lag 0.475956 0.052169 9.123 < 2e-16 ***
freq.cumsum -0.024365 0.006353 -3.835 0.000125 ***
categories 0.281413 0.018888 14.899 < 2e-16 ***
I(categories.mc^2) -0.024606 0.005295 -4.647 3.37e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta: count = 4.383
Number of iterations in BFGS optimization: 27
Log-likelihood: -4110 on 22 Df
[1] 8263.174
> v1d <- hurdle(freq ~ freq.cumavg + premium.lag + GDP.Cons + I(freq.cumavg.mc^2) | + rec + freq.lag + freq.cumsum + categories + I(categories.mc^2), + data = dat, dist = "negbin"); vifs(v1d); AIC(v1d)
Call:
hurdle(formula = freq ~ freq.cumavg + premium.lag + GDP.Cons + I(freq.cumavg.mc^2) | rec + freq.lag +
freq.cumsum + categories + I(categories.mc^2), data = dat, dist = "negbin")
Pearson residuals:
Min 1Q Median 3Q Max
-1.6337 -0.5786 -0.2336 0.3557 10.9936
Count model coefficients (truncated negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.524283 0.066630 -7.869 3.59e-15 ***
freq.cumavg 0.320577 0.013788 23.251 < 2e-16 ***
premium.lag 0.050382 0.018790 2.681 0.00733 **
GDP.Cons 0.096470 0.015047 6.411 1.44e-10 ***
I(freq.cumavg.mc^2) -0.012687 0.001043 -12.169 < 2e-16 ***
Log(theta) 1.422972 0.130889 10.872 < 2e-16 ***
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.428856 0.172448 -2.487 0.012887 *
rec -0.426756 0.063687 -6.701 2.07e-11 ***
freq.lag 0.475956 0.052169 9.123 < 2e-16 ***
freq.cumsum -0.024365 0.006353 -3.835 0.000125 ***
categories 0.281413 0.018888 14.899 < 2e-16 ***
I(categories.mc^2) -0.024606 0.005295 -4.647 3.37e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta: count = 4.1494
Number of iterations in BFGS optimization: 22
Log-likelihood: -4128 on 12 Df
freq.cumavg premium.lag GDP.Cons I(freq.cumavg.mc^2)
TRUE FALSE FALSE TRUE
freq.cumavg premium.lag GDP.Cons I(freq.cumavg.mc^2)
11.574419 1.046349 3.097466 5.128628
[1] 8279.72
> AIC(v1a,v1b,v1c,v1d) df AIC
v1a 34 8183.417
v1b 24 8191.474
v1c 22 8263.174
v1d 12 8279.720
> > # Final model: > v1 <- hurdle(freq ~ freq.cumavg + premium.lag + GDP.Cons + I(freq.cumavg.mc^2) + klantgroep + represent_id | + rec + freq.lag + freq.cumsum + categories + I(categories.mc^2), + data = dat, dist = "negbin"); vifs(v1)
Call:
hurdle(formula = freq ~ freq.cumavg + premium.lag + GDP.Cons + I(freq.cumavg.mc^2) + klantgroep + represent_id |
rec + freq.lag + freq.cumsum + categories + I(categories.mc^2), data = dat, dist = "negbin")
Pearson residuals:
Min 1Q Median 3Q Max
-1.7289 -0.5857 -0.2242 0.3776 8.2492
59
Count model coefficients (truncated negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.349365 0.182784 -1.911 0.05596 .
freq.cumavg 0.226669 0.016075 14.101 < 2e-16 ***
premium.lag 0.035984 0.017480 2.059 0.03954 *
GDP.Cons 0.099612 0.014442 6.898 5.29e-12 ***
I(freq.cumavg.mc^2) -0.008199 0.001106 -7.415 1.22e-13 ***
klantgroepDrogist -0.160366 0.203000 -0.790 0.42954
klantgroepOverig -0.540194 0.092232 -5.857 4.72e-09 ***
klantgroepSupermarkt A 0.194173 0.109322 1.776 0.07571 .
klantgroepSupermarkt B 0.193783 0.093179 2.080 0.03755 *
klantgroepSupermarkt C 0.263956 0.143097 1.845 0.06510 .
klantgroepSupermarkt D 0.133835 0.079477 1.684 0.09219 .
klantgroepSupermarkt E 0.243169 0.129522 1.877 0.06046 .
klantgroepSupermarkt F 0.087361 0.127500 0.685 0.49323
klantgroepSupermarkt G -0.092162 0.152539 -0.604 0.54572
klantgroepSupermarkt H 0.234399 0.255640 0.917 0.35919
klantgroepSupermarkt Overig 0.271438 0.104106 2.607 0.00913 **
klantgroepTankstation -0.245506 0.104774 -2.343 0.01912 *
represent_id2 0.235488 0.205991 1.143 0.25296
represent_id6 0.200035 0.170934 1.170 0.24190
represent_id13 -0.020956 0.175829 -0.119 0.90513
represent_id14 0.316662 0.197220 1.606 0.10836
represent_id16 0.183518 0.174210 1.053 0.29214
represent_id22 -0.445171 0.247843 -1.796 0.07247 .
represent_id23 0.226813 0.195366 1.161 0.24566
represent_id27 0.258063 0.174276 1.481 0.13867
represent_id34 0.294302 0.210547 1.398 0.16217
represent_id35 0.205256 0.171956 1.194 0.23261
Log(theta) 1.692253 0.140752 12.023 < 2e-16 ***
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.428856 0.172448 -2.487 0.012887 *
rec -0.426756 0.063687 -6.701 2.07e-11 ***
freq.lag 0.475956 0.052169 9.123 < 2e-16 ***
freq.cumsum -0.024365 0.006353 -3.835 0.000125 ***
categories 0.281413 0.018888 14.899 < 2e-16 ***
I(categories.mc^2) -0.024606 0.005295 -4.647 3.37e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta: count = 5.4317
Number of iterations in BFGS optimization: 41
Log-likelihood: -4058 on 34 Df
GVIF Df GVIF^(1/(2*Df))
freq.cumavg TRUE FALSE TRUE
premium.lag FALSE FALSE FALSE
GDP.Cons FALSE FALSE FALSE
I(freq.cumavg.mc^2) TRUE FALSE FALSE
klantgroep TRUE TRUE FALSE
represent_id TRUE TRUE FALSE
GVIF Df GVIF^(1/(2*Df))
freq.cumavg 18.644456 1 4.317923
premium.lag 1.095011 1 1.046428
GDP.Cons 3.264560 1 1.806809
I(freq.cumavg.mc^2) 6.743572 1 2.596839
klantgroep 38.898415 12 1.164789
represent_id 218.292832 10 1.309037
> rm(v1a, v1b, v1c, v1d, v1e) > > # --- 4. Obtain final estimates > v1.est.final <- data.frame(round(v1$coefficients$count, 3)) > v1.est.final <- round(summary(v1)$coef, 3) > v1.confint <- round(confint(v1), 3) > v1.est.final; v1.confint round.v1.coefficients.count..3.
(Intercept) -0.349
freq.cumavg 0.227
premium.lag 0.036
GDP.Cons 0.100
I(freq.cumavg.mc^2) -0.008
klantgroepDrogist -0.160
klantgroepOverig -0.540
klantgroepSupermarkt A 0.194
klantgroepSupermarkt B 0.194
klantgroepSupermarkt C 0.264
klantgroepSupermarkt D 0.134
klantgroepSupermarkt E 0.243
klantgroepSupermarkt F 0.087
klantgroepSupermarkt G -0.092
klantgroepSupermarkt H 0.234
klantgroepSupermarkt Overig 0.271
klantgroepTankstation -0.246
represent_id2 0.235
represent_id6 0.200
represent_id13 -0.021
represent_id14 0.317
represent_id16 0.184
represent_id22 -0.445
represent_id23 0.227
represent_id27 0.258
represent_id34 0.294
60
represent_id35 0.205
2.5 % 97.5 %
count_(Intercept) -0.708 0.009
count_freq.cumavg 0.195 0.258
count_premium.lag 0.002 0.070
count_GDP.Cons 0.071 0.128
count_I(freq.cumavg.mc^2) -0.010 -0.006
count_klantgroepDrogist -0.558 0.238
count_klantgroepOverig -0.721 -0.359
count_klantgroepSupermarkt A -0.020 0.408
count_klantgroepSupermarkt B 0.011 0.376
count_klantgroepSupermarkt C -0.017 0.544
count_klantgroepSupermarkt D -0.022 0.290
count_klantgroepSupermarkt E -0.011 0.497
count_klantgroepSupermarkt F -0.163 0.337
count_klantgroepSupermarkt G -0.391 0.207
count_klantgroepSupermarkt H -0.267 0.735
count_klantgroepSupermarkt Overig 0.067 0.475
count_klantgroepTankstation -0.451 -0.040
count_represent_id2 -0.168 0.639
count_represent_id6 -0.135 0.535
count_represent_id13 -0.366 0.324
count_represent_id14 -0.070 0.703
count_represent_id16 -0.158 0.525
count_represent_id22 -0.931 0.041
count_represent_id23 -0.156 0.610
count_represent_id27 -0.084 0.600
count_represent_id34 -0.118 0.707
count_represent_id35 -0.132 0.542
zero_(Intercept) -0.767 -0.091
zero_rec -0.552 -0.302
zero_freq.lag 0.374 0.578
zero_freq.cumsum -0.037 -0.012
zero_categories 0.244 0.318
zero_I(categories.mc^2) -0.035 -0.014
> exp(coef(v1)) count_(Intercept) count_freq.cumavg count_premium.lag
0.7051355 1.2544152 1.0366390
count_GDP.Cons count_I(freq.cumavg.mc^2) count_klantgroepDrogist
1.1047424 0.9918340 0.8518321
count_klantgroepOverig count_klantgroepSupermarkt A count_klantgroepSupermarkt B
0.5826352 1.2143067 1.2138325
count_klantgroepSupermarkt C count_klantgroepSupermarkt D count_klantgroepSupermarkt E
1.3020706 1.1432046 1.2752843
count_klantgroepSupermarkt F count_klantgroepSupermarkt G count_klantgroepSupermarkt H
1.0912907 0.9119570 1.2641491
count_klantgroepSupermarkt Overig count_klantgroepTankstation count_represent_id2
1.3118497 0.7823085 1.2655263
count_represent_id6 count_represent_id13 count_represent_id14
1.2214454 0.9792624 1.3725380
count_represent_id16 count_represent_id22 count_represent_id23
1.2014371 0.6407147 1.2545950
count_represent_id27 count_represent_id34 count_represent_id35
1.2944203 1.3421886 1.2278393
zero_(Intercept) zero_rec zero_freq.lag
0.6512537 0.6526225 1.6095523
zero_freq.cumsum zero_categories zero_I(categories.mc^2)
0.9759296 1.3250002 0.9756941
> > # --- 5. Model performance > dat$v1 <- fitted(v1) > dat.v$v1 <- predict(v1, dat.v) > > performance(dat$freq, dat$v1) [1] 0.6380279
[1] 1.691994
[1] 1.027299
> performance(dat.v$freq, dat.v$v1) [1] 0.5804381
[1] 1.305343
[1] 0.7716979
> ##### --- (3) --- GROSS MARGINS --- (3) --- ##### > # Zero part > p1 <- glm(pur ~ rec + freq.lag + freq.cumsum + categories + I(categories.mc^2), + family = binomial(link = "logit"), data = dat); vifs(p1)
Call:
glm(formula = pur ~ rec + freq.lag + freq.cumsum + categories +
I(categories.mc^2), family = binomial(link = "logit"), data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.0050 -0.5581 0.3732 0.6523 2.3057
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.428856 0.172449 -2.487 0.012887 *
rec -0.426756 0.063687 -6.701 2.07e-11 ***
freq.lag 0.475956 0.052168 9.124 < 2e-16 ***
freq.cumsum -0.024365 0.006353 -3.835 0.000125 ***
categories 0.281413 0.018888 14.899 < 2e-16 ***
I(categories.mc^2) -0.024606 0.005295 -4.647 3.37e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
61
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3514.7 on 2791 degrees of freedom
Residual deviance: 2371.3 on 2786 degrees of freedom
AIC: 2383.3
Number of Fisher Scoring iterations: 6
rec freq.lag freq.cumsum categories I(categories.mc^2)
FALSE FALSE FALSE FALSE FALSE
rec freq.lag freq.cumsum categories I(categories.mc^2)
1.446646 1.603007 1.331112 1.222432 1.046219
> dat$p1 <- predict(p1, dat, type = "response"); dat.v$p1 <- predict(p1, dat.v, type = "response") > > v1 <- hurdle(freq ~ freq.cumavg + premium.lag + premium.dum + GDP.Cons + categories + + I(freq.cumavg.mc**2) + klantgroep + represent_id | + rec + freq.lag + freq.cumsum + categories + I(categories.mc^2), + data = dat, dist = "negbin") > dat$v1 <- predict(v1, dat, type = "response"); dat.v$v1 <- predict(v1, dat.v, type = "response") > > # Full model > gm1 <- lm(log(gm+1) ~ log(rec) + log(freq+1) + log(freq.lag+1) + log(gm.lag+1) + log(gm.cumavg+1) + + log(premium+1) + premium.dum + log(GDP.Cons+2) + log(pop_dens) + log(categories), + data = subset(dat, pur == 1)); vifs(gm1) # no multicollinearity issues
Call:
lm(formula = log(gm + 1) ~ log(rec) + log(freq + 1) + log(freq.lag +
1) + log(gm.lag + 1) + log(gm.cumavg + 1) + log(premium +
1) + premium.dum + log(GDP.Cons + 2) + log(pop_dens) + log(categories),
data = subset(dat, pur == 1))
Residuals:
Min 1Q Median 3Q Max
-7.4544 -0.3708 0.2998 0.9402 4.3774
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.31171 0.33436 0.932 0.35133
log(rec) 0.57145 0.18763 3.046 0.00235 **
log(freq + 1) 2.23260 0.10420 21.426 < 2e-16 ***
log(freq.lag + 1) -0.76847 0.11160 -6.886 7.81e-12 ***
log(gm.lag + 1) 0.17233 0.02869 6.007 2.27e-09 ***
log(gm.cumavg + 1) 0.02032 0.04407 0.461 0.64478
log(premium + 1) 0.25475 0.13203 1.930 0.05382 .
premium.dum 0.48776 0.08361 5.833 6.38e-09 ***
log(GDP.Cons + 2) 0.10793 0.10675 1.011 0.31210
log(pop_dens) 0.02571 0.03426 0.750 0.45309
log(categories) 0.69955 0.08446 8.283 2.26e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.628 on 1878 degrees of freedom
Multiple R-squared: 0.3785, Adjusted R-squared: 0.3752
F-statistic: 114.4 on 10 and 1878 DF, p-value: < 2.2e-16
log(rec) log(freq + 1) log(freq.lag + 1) log(gm.lag + 1) log(gm.cumavg + 1) log(premium + 1)
FALSE FALSE FALSE FALSE FALSE FALSE
premium.dum log(GDP.Cons + 2) log(pop_dens) log(categories)
FALSE FALSE FALSE FALSE
log(rec) log(freq + 1) log(freq.lag + 1) log(gm.lag + 1) log(gm.cumavg + 1) log(premium + 1)
2.094514 1.876297 3.504090 3.522942 2.536442 1.199199
premium.dum log(GDP.Cons + 2) log(pop_dens) log(categories)
1.227559 1.060033 1.038463 1.368221
> > # Now try with customer intercept: > gm2 <- lm(log(gm+1) ~ log(rec) + log(freq+1) + log(freq.lag+1) + log(gm.lag+1) + log(gm.cumavg+1) + + log(premium+1) + premium.dum + log(GDP.Cons+2) + cust, + data = subset(dat, pur == 1)) # Delete categories & pop_dens = collinear with cust > > # For testing the assumptions we use the plm package (same estimates, but convenient for testing assumptions) > gm.pooled <- plm(formula(gm1), data = subset(dat, pur == 1), model = "pooling") > gm.fixed <- plm(log(gm+1) ~ log(rec) + log(freq+1) + log(freq.lag+1) + log(gm.lag+1) + log(gm.cumavg+1) + + log(premium+1) + premium.dum + log(GDP.Cons+2), data = subset(dat, pur == 1), model = "within") > gm.random <- plm(formula(gm1), data = subset(dat, pur == 1), random.method="swar", model="random") > > pFtest(gm.fixed, gm.pooled) # Significant: choose FE model over pooled model
F test for individual effects
data: log(gm + 1) ~ log(rec) + log(freq + 1) + log(freq.lag + 1) + ...
F = 2.0819, df1 = 346, df2 = 1532, p-value < 2.2e-16
alternative hypothesis: significant effects
> plmtest(gm.pooled, effect="individual") # Significant: choose RE model over pooled model
Lagrange Multiplier Test - (Honda) for unbalanced panels
data: formula(gm1)
normal = 3.9349, p-value = 4.162e-05
alternative hypothesis: significant effects
> phtest(gm.fixed, gm.random) # Significant: differences are endogenous to our predictors = use FE model
62
Hausman Test
data: log(gm + 1) ~ log(rec) + log(freq + 1) + log(freq.lag + 1) + ...
chisq = 631.59, df = 8, p-value < 2.2e-16
alternative hypothesis: one model is inconsistent
> > # Further fitting our model: > gm2a <- update(gm2, .~. - log(rec)) > gm2b <- update(gm2a, .~. - log(GDP.Cons+2)) > gm2c <- update(gm2b, .~. - log(gm.lag+1)) > gm2d <- update(gm2c, .~. - log(premium+1)) > AIC(gm2, gm2a, gm2b, gm2c, gm2d) # Best fit: gm2c df AIC
gm2 358 7179.175
gm2a 357 7177.317
gm2b 356 7176.115
gm2c 355 7175.576
gm2d 354 7187.305
> # Multiple R-squared: 0.5768, Adjusted R-squared: 0.4794 > gm2e <- update(gm2c, .~. + I(freq.lag.mc^2)); AIC(gm2c, gm2e) # 0.4809 df AIC
gm2c 355 7175.576
gm2e 356 7170.861
> > # FINAL MODEL: > gm2 <- lm(log(gm + 1) ~ log(freq + 1) + log(freq.lag + 1) + log(gm.cumavg + 1) + log(premium + 1) + + premium.dum + cust, data = subset(dat, pur == 1)) > > # Omited variable bias due to exclusion of product returns? > gm2b <- lm(log(gm + 1) ~ log(freq + 1) + log(freq.lag + 1) + log(gm.cumavg + 1) + log(premium + 1) + + premium.dum + log(returns + 1) + cust, data = subset(dat, pur == 1)); vifs(gm2b)
Call:
lm(formula = log(gm + 1) ~ log(freq + 1) + log(freq.lag + 1) +
log(gm.cumavg + 1) + log(premium + 1) + premium.dum + log(returns +
1) + cust, data = subset(dat, pur == 1))
Residuals:
Min 1Q Median 3Q Max
-7.0244 -0.3889 0.0724 0.6516 4.5003
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.961926 1.439745 3.446 0.000583 ***
log(freq + 1) 1.823209 0.107100 17.023 < 2e-16 ***
log(freq.lag + 1) -0.249912 0.086190 -2.900 0.003790 **
log(gm.cumavg + 1) -0.243653 0.060333 -4.038 5.65e-05 ***
log(premium + 1) 0.209589 0.134478 1.559 0.119313
premium.dum 0.029012 0.085791 0.338 0.735279
log(returns + 1) -0.330481 0.022297 -14.822 < 2e-16 ***
…
[ reached getOption("max.print") -- omitted 155 rows ]
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.39 on 1534 degrees of freedom
Multiple R-squared: 0.6298, Adjusted R-squared: 0.5443
F-statistic: 7.371 on 354 and 1534 DF, p-value: < 2.2e-16
GVIF Df GVIF^(1/(2*Df))
log(freq + 1) FALSE FALSE FALSE
log(freq.lag + 1) FALSE FALSE FALSE
log(gm.cumavg + 1) TRUE FALSE FALSE
log(premium + 1) FALSE FALSE FALSE
premium.dum FALSE FALSE FALSE
log(returns + 1) FALSE FALSE FALSE
cust TRUE TRUE FALSE
GVIF Df GVIF^(1/(2*Df))
log(freq + 1) 2.717965 1 1.648625
log(freq.lag + 1) 2.865585 1 1.692804
log(gm.cumavg + 1) 6.517957 1 2.553029
log(premium + 1) 1.705863 1 1.306087
premium.dum 1.771953 1 1.331147
log(returns + 1) 1.703809 1 1.305300
cust 26.665094 348 1.004729
> gm2c <- update(gm2b, .~. - premium.dum); AIC(gm2b, gm2c) df AIC
gm2b 356 6924.741
gm2c 355 6922.881
> gm2d <- update(gm2c, .~. - log(premium + 1)); AIC(gm2b, gm2c, gm2d) df AIC
gm2b 356 6924.741
gm2c 355 6922.881
gm2d 354 6923.796
> > # Testing for sample selection bias by estimating a heckit model to obtain inverse mills ratio: > gm.heckit <- heckit(formula(p1), formula(gm2c), data = dat, method= "2step"); summary(gm.heckit) --------------------------------------------
Tobit 2 model (sample selection model)
2-step Heckman / heckit estimation
2792 observations (903 censored and 1889 observed)
363 free parameters (df = 2430)
Probit selection equation:
Estimate Std. Error t value Pr(>|t|)
63
(Intercept) -0.190195 0.097764 -1.945 0.051837 .
rec -0.271961 0.035344 -7.695 2.05e-14 ***
freq.lag 0.251860 0.026636 9.456 < 2e-16 ***
freq.cumsum -0.012305 0.003593 -3.425 0.000626 ***
categories 0.163162 0.010455 15.606 < 2e-16 ***
I(categories.mc^2) -0.015068 0.002941 -5.124 3.22e-07 ***
Outcome equation:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.9399186 1.3651958 3.618 0.000302 ***
log(freq + 1) 1.8269359 0.0962099 18.989 < 2e-16 ***
log(freq.lag + 1) -0.2268806 0.1488771 -1.524 0.127652
log(gm.cumavg + 1) -0.2459647 0.0543327 -4.527 6.27e-06 ***
log(premium + 1) 0.1921051 0.1136542 1.690 0.091106 .
log(returns + 1) -0.3343286 0.0176285 -18.965 < 2e-16 ***
…
[ reached getOption("max.print") -- omitted 154 rows ]
Multiple R-Squared:0.6298, Adjusted R-Squared:0.5443
Error terms:
Estimate Std. Error t value Pr(>|t|)
invMillsRatio 0.04960 0.39377 0.126 0.9
sigma 1.25336 NA NA NA
rho 0.03957 NA NA NA
--------------------------------------------
> # IMR: not significant: continue with original model > > gm.fixed <- plm(log(gm + 1) ~ log(freq + 1) + log(freq.lag + 1) + log(gm.cumavg + 1) + log(premium + 1) + + log(returns + 1), data = subset(dat, pur == 1), model = "within") > > pdwtest(gm.fixed) # Durbin-Watson: insignificant: no autocorrelation
Durbin-Watson test for serial correlation in panel models
data: log(gm + 1) ~ log(freq + 1) + log(freq.lag + 1) + log(gm.cumavg + 1) + log(premium + 1) + log(returns +
1)
DW = 2.1932, p-value = 1
alternative hypothesis: serial correlation in idiosyncratic errors
> bptest(gm.fixed) # Breusch-Pagan: significant heteroskedasticity
studentized Breusch-Pagan test
data: gm.fixed
BP = 36.594, df = 5, p-value = 7.222e-07
> shapiro.test(gm.fixed$residuals) # Not normally distributed
Shapiro-Wilk normality test
data: gm.fixed$residuals
W = 0.89444, p-value < 2.2e-16
> lillie.test(gm.fixed$residuals)
Lilliefors (Kolmogorov-Smirnov) normality test
data: gm.fixed$residuals
D = 0.13141, p-value < 2.2e-16
> > # Obtain final estimates with robust standard errors > gm.est <- data.frame(coeftest(gm.fixed, vcov. = vcovHC, method = "arellano")[1:5,1:4]) > temp <- data.frame(exp(gm.est[,1])) > alpha_hat_star <- gm.est[,1] > sd_alpha_hat_star <- gm.est[,2] > alpha_hat <- (exp(alpha_hat_star)-1) * exp(-0.5*(sd_alpha_hat_star^2)) > temp <- round(data.frame(alpha_hat), 3) > gm.est; temp Estimate Std..Error t.value Pr...t..
log(freq + 1) 1.8264566 0.14161467 12.897369 3.286461e-36
log(freq.lag + 1) -0.2430427 0.08929350 -2.721841 6.565345e-03
log(gm.cumavg + 1) -0.2454251 0.06954109 -3.529211 4.290972e-04
log(premium + 1) 0.1933578 0.14396427 1.343096 1.794395e-01
log(returns + 1) -0.3341422 0.02134234 -15.656307 2.269688e-51
alpha_hat
1 5.160
2 -0.215
3 -0.217
4 0.211
5 -0.284
> > # Predictive performance > gm2 <- lm(log(gm + 1) ~ log(freq + 1) + log(freq.lag + 1) + log(gm.cumavg + 1) + log(premium + 1) + + log(returns+1) + cust, data = subset(dat, pur == 1)); vifs(gm2)
Call:
lm(formula = log(gm + 1) ~ log(freq + 1) + log(freq.lag + 1) +
log(gm.cumavg + 1) + log(premium + 1) + log(returns + 1) +
cust, data = subset(dat, pur == 1))
Residuals:
Min 1Q Median 3Q Max
-7.0267 -0.3883 0.0764 0.6503 4.4820
Coefficients:
Estimate Std. Error t value Pr(>|t|)
64
(Intercept) 4.994637 1.436078 3.478 0.000519 ***
log(freq + 1) 1.826457 0.106638 17.128 < 2e-16 ***
log(freq.lag + 1) -0.243043 0.083738 -2.902 0.003756 **
log(gm.cumavg + 1) -0.245425 0.060087 -4.084 4.65e-05 ***
log(premium + 1) 0.193358 0.125585 1.540 0.123850
log(returns + 1) -0.334142 0.019487 -17.147 < 2e-16 ***
…
[ reached getOption("max.print") -- omitted 154 rows ]
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.39 on 1535 degrees of freedom
Multiple R-squared: 0.6297, Adjusted R-squared: 0.5446
F-statistic: 7.396 on 353 and 1535 DF, p-value: < 2.2e-16
GVIF Df GVIF^(1/(2*Df))
log(freq + 1) FALSE FALSE FALSE
log(freq.lag + 1) FALSE FALSE FALSE
log(gm.cumavg + 1) TRUE FALSE FALSE
log(premium + 1) FALSE FALSE FALSE
log(returns + 1) FALSE FALSE FALSE
cust TRUE TRUE FALSE
GVIF Df GVIF^(1/(2*Df))
log(freq + 1) 2.696111 1 1.641984
log(freq.lag + 1) 2.706437 1 1.645125
log(gm.cumavg + 1) 6.468767 1 2.543377
log(premium + 1) 1.488558 1 1.220065
log(returns + 1) 1.302189 1 1.141135
cust 22.371575 348 1.004475
> dat$gm2 <- (exp(predict(gm2, dat)) - 1) * dat$p1; dat.v$gm2 <- (exp(predict(gm2, dat.v))-1) * dat.v$p1 > > mean(dat$gm); sd(dat$gm); mean(dat$gm2); sd(dat$gm2) [1] 359.5807
[1] 878.883
[1] 279.6916
[1] 731.5249
> mean(dat.v$gm); sd(dat.v$gm); mean(dat.v$gm2); sd(dat.v$gm2) [1] 244.5904
[1] 648.022
[1] 193.8051
[1] 575.2432
> performance(dat$gm, dat$gm2) [1] 0.4223258
[1] 521.0698
[1] 184.9897
> performance(dat.v$gm, dat.v$gm2) [1] 0.4977186
[1] 506.8427
[1] 175.2225
> > > # Inspect fixed effects > gm.fe <- round(data.frame(summary(fixef(gm.fixed))),3) > fx_level_robust1 <- fixef(gm.fixed, vcov = vcovHC(gm.fixed)) > gm.fe.sum <- round(data.frame(summary(fx_level_robust1)),3) > gm.fe.sum$cust <- row.names(gm.fe.sum) > dat$cust <- as.character(dat$cust); dat.v$cust <- as.character(dat.v$cust) > dat <- left_join(dat, gm.fe.sum[,c("Estimate", "cust")], by=c("cust")) > within_intercept(gm.fixed, vcov = vcovHC) (overall_intercept)
4.767096
attr(,"se")
[1] 0.4183012
> within_intercept(gm.fixed, vcov = function(x) vcovHC(x, method="arellano", type="HC0")) (overall_intercept)
4.767096
attr(,"se")
[1] 0.4183012
65
APPENDIX C: R-CODE CUSTOMER PROFITABILITY > rm(list = ls()) > setwd(" ") > library(dplyr) > library(DescTools) > library(Metrics) > library(pscl) > library(ggplot2) > dat <- read.csv("DatasetPlusReturns.csv", header = TRUE, sep=",") > dat$cust <- as.factor(dat$cust) > dat$represent_id <- as.factor(dat$represent_id) > dat$date <- as.numeric(dat$date) > dat <- arrange(dat, date, cust) > dat.v <- filter(dat, date >= 9) > dat <- filter(dat, date < 9) > p1 <- glm(pur ~ rec + freq.lag + freq.cumsum + categories + I(scale(categories, center = TRUE)^2), + family = binomial(link = "logit"), data = dat) > dat$p1 <- predict(p1, dat, type = "response"); dat.v$p1 <- predict(p1, dat.v, type = "response") > v1 <- hurdle(freq ~ freq.cumavg + premium.lag + premium.dum + GDP.Cons + categories + + I(scale(freq.cumavg, center = TRUE)**2) + klantgroep + represent_id | + rec + freq.lag + freq.cumsum + categories + I(scale(categories, center = TRUE)^2), + data = dat, dist = "negbin") > dat$v1 <- fitted(v1); dat.v$v1 <- predict(v1, dat.v) > gm1 <- lm(log(gm + 1) ~ log(freq + 1) + log(freq.lag + 1) + log(gm.cumavg + 1) + log(premium + 1) + + log(returns+1) + cust, data = subset(dat, pur == 1)) > dat$gm1 <- (exp(predict(gm1, dat)) - 1) * dat$p1; dat.v$gm1 <- (exp(predict(gm1, dat.v))-1) * dat.v$p1 > rm(p1, v1, gm1) > dat <- rbind(dat, dat.v); rm(dat.v) > cp <- dplyr::select(dat, cust, date, freq, gm, v1, gm1, p1) > cp[is.na(cp)] <- 0 > cp$holdout <- ifelse(cp$date >= 9, 1, 0) > costs <- 70.39 # Give here the value for costs per visit > cp <- mutate(cp, cp = gm - (freq * costs), pred = gm1 - (v1 * costs)) > acc <- function(t) { + temp <- filter(cp, date %in% t) + print("MAE"); print(MAE(temp$pred, temp$cp)) + print("RAE"); print(rae(temp$cp, temp$pred)) + print("RMSE"); print(RMSE(temp$pred, temp$cp)) + } > print("Estimation: "); acc(1:8); print("Validation: "); acc(9:11) [1] "Estimation: "
[1] "MAE"
[1] 204.0338
[1] "RAE"
[1] 0.5790955
[1] "RMSE"
[1] 511.1327
[1] "Validation: "
[1] "MAE"
[1] 187.1082
[1] "RAE"
[1] 0.6459613
[1] "RMSE"
[1] 521.924
> ##### --- CP SEGMENTS > # Compare t6-9 observed with t10-12 observed and predicted > cp$error <- abs(cp$pred - cp$cp) > cp.t6_9 <- cp %>% filter(date %in% 5:8) %>% group_by(cust) %>% summarise(v.avg.1= mean(freq), + gm.avg.1 = mean(gm), cp.avg.1 = mean(cp), cp.sd.1 = sd(cp)) %>% dplyr::select(cust, v.avg.1:cp.sd.1) > cp.t10_12 <- cp %>% filter(date %in% 9:11) %>% group_by(cust) %>% summarise(v.avg.v = mean(freq), + gm.avg.v = mean(gm), cp.avg.v = mean(cp), cp.sd.v = sd(cp), pred.avg.v = mean(pred), + error.avg.v = mean(error), mad.sum = sum(error)) %>% dplyr::select(cust, v.avg.v:mad.sum) > cp.tot <- left_join(cp.t6_9, cp.t10_12, by = "cust"); rm(cp.t6_9, cp.t10_12) > > # Make profitability segments > add.segment <- function(var) { + new.var <- ifelse(var > quantile(var, probs = 0.75), 3, + ifelse(var < quantile(var, probs = 0.25), 1, 2)) + return(new.var) + } > > # Since the zero observations disturb our analysis, we delete the customers that did not make any > # purchase in the year prior to our validation period (70 observations of which only 3 customers eventually > # did make a purchase in the validation period) > cp.tot <- filter(cp.tot, cp.avg.1 != 0) > cp.tot$segment.cp.1 <- add.segment(cp.tot$cp.avg.1); table(cp.tot$segment.cp.1)
1 2 3
69 138 69
> cp.tot$segment.cp.v <- add.segment(cp.tot$cp.avg.v); table(cp.tot$segment.cp.v)
1 2 3
69 138 69
> cp.tot$segment.pred.v <- add.segment(cp.tot$pred.avg.v); table(cp.tot$segment.pred.v)
1 2 3
69 138 69
> > # We now make confusion matrices > confusion.matrix <- function(temp) { + temp2 <- data.frame(temp[1:3], temp[4:6], temp[7:9]); temp2 <- round(temp2/sum(temp2),3) + return(temp2) + } > shifts.observed <- confusion.matrix(table(cp.tot$segment.cp.1, cp.tot$segment.cp.v)) > shifts.predicted <- confusion.matrix(table(cp.tot$segment.cp.1, cp.tot$segment.pred.v)) > observed.predicted <- confusion.matrix(table(cp.tot$segment.cp.v, cp.tot$segment.pred.v)) > summary(cp.tot$cp.avg.1); summary(cp.tot$cp.avg.v); summary(cp.tot$pred.avg.v) Min. 1st Qu. Median Mean 3rd Qu. Max.
66
-139.650 -7.434 45.457 216.582 211.411 5659.632
Min. 1st Qu. Median Mean 3rd Qu. Max.
-164.243 -0.082 0.000 199.905 183.306 4190.093
Min. 1st Qu. Median Mean 3rd Qu. Max.
-164.92 -34.62 -15.74 156.22 116.62 5191.38
> > # 3 segments: positive, negative, zero (allow +/- 10 for predicted zero) > cp.tot$seg.cp.pos <- ifelse(cp.tot$cp.avg.v == 0, 0, ifelse(cp.tot$cp.avg.v > 0, 1, -1)) > cp.tot$seg.pred.pos <- ifelse(cp.tot$pred.avg.v < -10, -1, ifelse(cp.tot$pred.avg.v > 10, 1, 0)) > observed.predicted.positive <- confusion.matrix(table(cp.tot$seg.cp.pos, cp.tot$seg.pred.pos)) > > > ##### --- VOLATILITY > # % change in CP from Q6-Q9 to Q10-Q12 > cp.tot <- mutate(cp.tot, cp.change = cp.avg.v - cp.avg.1, cp.change.ratio = cp.change / cp.avg.1, + pred.change = pred.avg.v - cp.avg.1, pred.change.ratio = pred.change / cp.avg.1) > # Relative change is not reliable! > > # CP.NoVol = gm.avg.1 * p1 - v1 * costs > cp.novol <- cp %>% filter(date >= 9) > cp.novol <- left_join(cp.novol, cp.tot[c("cust", "gm.avg.1", "v.avg.1", "cp.avg.1")], by=c("cust")) > cp.novol$gm.avg.1[is.na(cp.novol$gm.avg.1)] <- 0 > cp.novol$v.avg.1[is.na(cp.novol$v.avg.1)] <- 0 > cp.novol$cp.avg.1[is.na(cp.novol$cp.avg.1)] <- 0 > cp.novol$pred.novol <- cp.novol$gm.avg.1 * cp.novol$p1 - cp.novol$freq * costs > cp.novol$pred.novol <- cp.novol$gm.avg.1 * cp.novol$p1 - cp.novol$v1 * costs > cp.novol$pred.novol <- cp.novol$gm.avg.1 - cp.novol$v1 * costs > cp.novol$pred.novol <- cp.novol$gm.avg.1 * cp.novol$p1 > cp.novol$pred.novol <- cp.novol$cp.avg.1 * cp.novol$p1 > cp.novol$pred.novol <- cp.novol$cp.avg.1 > cp.novol$pred.novol <- cp.novol$pred > # MAE and t-test > MAE(cp.novol$pred.novol, cp.novol$cp) [1] 187.1082
> cp.novol$MAE.main <- abs(cp.novol$cp - cp.novol$pred) > cp.novol$MAE.simple <- abs(cp.novol$cp - cp.novol$pred.novol) > var.test(cp.novol$MAE.main, cp.novol$MAE.simple)
F test to compare two variances
data: cp.novol$MAE.main and cp.novol$MAE.simple
F = 1, num df = 1046, denom df = 1046, p-value = 1
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.8857959 1.1289283
sample estimates:
ratio of variances
1
> t.test(cp.novol$MAE.main, cp.novol$MAE.simple, var.equal = TRUE)
Two Sample t-test
data: cp.novol$MAE.main and cp.novol$MAE.simple
t = 0, df = 2092, p-value = 1
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-41.78155 41.78155
sample estimates:
mean of x mean of y
187.1082 187.1082
> > # What if we take our the customers that made no purchase in Q5-Q8? > cp.novol <- filter(cp.novol, cp.avg.1 != 0) > cp.novol$pred.simple <- cp.novol$cp.avg.1 * cp.novol$p1 > > cp.novol$MAE.main <- abs(cp.novol$cp - cp.novol$pred) > cp.novol$MAE.simple <- abs(cp.novol$cp - cp.novol$pred.simple) > > var.test(cp.novol$MAE.main, cp.novol$MAE.simple)
F test to compare two variances
data: cp.novol$MAE.main and cp.novol$MAE.simple
F = 1.1481, num df = 827, denom df = 827, p-value = 0.04714
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
1.001748 1.315943
sample estimates:
ratio of variances
1.148148
> t.test(cp.novol$MAE.main, cp.novol$MAE.simple, var.equal = FALSE)
Welch Two Sample t-test
data: cp.novol$MAE.main and cp.novol$MAE.simple
t = -0.21362, df = 1646.2, p-value = 0.8309
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-55.01782 44.21045
sample estimates:
mean of x mean of y
232.1723 237.5760
> ##### --- DIFFERENCES BETWEEN CUSTOMERS > customers <- read.csv("CP_Customers.csv", header = TRUE, sep=",")
67
> cp.tot$cust <- as.numeric(as.character(cp.tot$cust)) > customers <- left_join(cp.tot, customers, by=c("cust")) > cust.by.group <- customers %>% group_by(klantgroep) %>% summarise(n(), sum(cp.avg.1), sum(v.avg.1), + sum(gm.avg.1), mean(cp.avg.v), sd(cp.avg.v), + mean(pred.avg.v), sd(pred.avg.v), mean(segment.cp.1), mean(segment.cp.v), + mean(segment.pred.v), mean(cp.change), mean(v.avg.1), sd(v.avg.1), + mean(gm.avg.1), sd(gm.avg.1), mean(cp.avg.1), sd(cp.avg.1), mean(v.avg.v), + sd(v.avg.v), mean(gm.avg.v), sd(gm.avg.v), mean(premium.tot), max(represent_id), + mean(categories)) > cust.by.group[,3:26] <- round(cust.by.group[,3:26],2) > # Boxplots for observed vs predicted cp, gm, and v per group/rep > > library(RColorBrewer) > display.brewer.all() > library(extrafont) > loadfonts(device = "win") > par(family = "Dubai Light") > > customers$klantgroep <- factor(customers$klantgroep,levels(customers$klantgroep)[c(4:12,3,1:2,13)]) > levels(customers$klantgroep) <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "O", "X", "Y", "Z") > customers$represent_id <- as.factor(customers$represent_id) > levels(customers$represent_id) <- c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10") > > ggplot(data=customers, aes(x=klantgroep, y=cp.avg.1, color=klantgroep)) + geom_boxplot() + + scale_y_continuous(limits = c(-150, 6000), "Observed CP in Q5-Q8") + theme_bw() > # Drop 3 outliers: > ggplot(data=customers, aes(x=klantgroep, y=cp.avg.1, color=klantgroep)) + geom_boxplot() + + scale_y_continuous(limits = c(-150, 1500), "Observed CP (Q5-Q8)") + + scale_x_discrete("Customer Group") + theme_bw() + + theme(text = element_text(size=14, family="Dubai Light")) + + theme(axis.title.y = element_text(margin = margin(t = 0, r = 15, b = 0, l = 0))) + + theme(axis.title.x = element_text(margin = margin(t = 15, r = 0, b = 0, l = 0))) + + theme(legend.position="none") > > # Boxplot MAE per klantgroep > ggplot(data=customers, aes(x=klantgroep, y=mad.sum, color=klantgroep)) + geom_boxplot() + + scale_y_continuous(limits = c(0, 7500), "Absolute Error") + theme_bw() + + theme(text = element_text(size=14, family="Dubai Light", color = "black")) + + theme(axis.title.y = element_text(margin = margin(t = 0, r = 15, b = 0, l = 0)), + axis.text = element_text(size=10), + axis.title.x = element_blank()) + + theme(legend.position="none") > > # All-in-one plot > temp <- select(customers, cust, klantgroep, cp.avg.1, represent_id); temp$period <- "Q5-Q8 Observed" > temp2 <- select(customers, cust, klantgroep, cp.avg.v, represent_id); temp2$period <- "Q9-Q11 Observed" > temp3 <- select(customers, cust, klantgroep, pred.avg.v, represent_id); temp3$period <- "Q9-Q11 Predicted" > colnames(temp)[3] <- "cp"; colnames(temp2)[3] <- "cp"; colnames(temp3)[3] <- "cp" > temp <- rbind(temp, temp2, temp3) > colnames(temp)[5] <- "Period" > # Compare customer groups > ggplot(data=temp, aes(x=klantgroep, y=cp, color=Period)) + geom_boxplot() + + scale_y_continuous(limits = c(-175, 2000), "Customer Profitability") + + scale_x_discrete("Customer Group") + theme_bw() + + theme(text = element_text(size=14, family="Dubai Light", color = "black")) + + theme(axis.title.y = element_text(margin = margin(t = 0, r = 15, b = 0, l = 0)), + axis.text = element_text(size=10), axis.title.x = element_blank()) + + theme(legend.position="bottom") + scale_color_manual(values=c("#95dc83", "#eaaf50", "#88d8fe")) > #+ scale_fill_grey(start = 0.25, end = 0.75, na.value = "red") > > # Compare represent_id > ggplot(data=temp, aes(x=represent_id, y=cp, color=Period)) + geom_boxplot() + + scale_y_continuous(limits = c(-175, 2000), "Customer Profitability") + + scale_x_discrete("Sales Representative") + theme_bw() + + theme(text = element_text(size=14, family="Dubai Light", color = "black")) + + theme(axis.title.y = element_text(margin = margin(t = 0, r = 15, b = 0, l = 0)), + axis.text = element_text(size=10), axis.title.x = element_blank()) + + theme(legend.position="bottom") + scale_color_manual(values=c("#95dc83", "#eaaf50", "#88d8fe")) > customers %>% group_by(represent_id) %>% summarise(n()) # A tibble: 11 x 2
represent_id `n()`
<fct> <int>
1 0 7
2 1 10
3 2 66
4 3 53
5 4 6
6 5 30
7 6 12
8 7 9
9 8 32
10 9 7
11 10 44
> > # CP Segments > ggplot(data=temp, aes(x=Period, y=cp, color=Period)) + geom_boxplot() + + scale_y_continuous(limits = c(-175, 2000), "Customer Profitability") + + theme_bw() + theme(text = element_text(size=14, family="Dubai Light", color = "black")) + + theme(axis.title.y = element_text(margin = margin(t = 0, r = 15, b = 0, l = 0)), + axis.text = element_text(size=10), + axis.title.x = element_blank()) + + theme(legend.position="blank") + scale_color_manual(values=c("#95dc83", "#eaaf50", "#88d8fe")) > > > > # Plot observed and predicted CP for some customers > cp <- left_join(cp, cp.novol[,c("cust", "date", "pred.simple")], by=c("cust", "date")) > temp <- cp; colnames(temp)[c(9,10,12)] <- c("Observed", "Predicted", "Simple") > make.graphs <- function(var) { + for (c in var) { + df <- temp %>% filter(cust == c) %>% + dplyr::select(date, Observed, Predicted, Simple) %>%
68
+ tidyr::gather(key = "variable", value = "value", -date) + plot <- ggplot(df, aes(x = date, y = value)) + geom_line(aes(color = variable), size = 1) + + geom_line(aes(color = variable), size = 1) + geom_line(aes(color = variable), size = 1) + + scale_y_continuous(limits = c(min(df$value) - 100, max(df$value) + 100)) + + scale_x_discrete(limits = c(1:11)) + theme_bw() + + theme(text = element_text(size=10, family="Dubai Light", color = "black"), + axis.title.y=element_blank(), axis.title.x=element_blank(), legend.position="blank") + + scale_color_manual(values=c("#95dc83", "#eaaf50", "#88d8fe")) + print(plot) + } + } > > too_low <- c(417, 572, 1307, 163, 121) > too_high <- c(644, 740, 1298, 160, 1330) > good <- c(117, 551, 249, 893, 641) #117, 893 > make.graphs(c(121,417)); make.graphs(c(1298,740)); make.graphs(c(117,893)) # 325x210 > > temp <- select(cp, holdout, freq, gm, cp); temp$what <- "Observed" > temp2 <- select(cp, holdout, v1, gm1, pred); temp2$what <- "Predicted" > colnames(temp)[2:4] <- c("Visits2", "GrossMargins", "CustomerProfitability") > colnames(temp2)[2:4] <- c("Visits2", "GrossMargins", "CustomerProfitability") > temp <- rbind(temp, temp2) > temp$holdout <- as.factor(temp$holdout) > levels(temp$holdout) <- c("Q1-Q8", "Q9-Q11") > temp$holdout <- factor(temp$holdout,levels(temp$holdout)[c(2,1)]) > colnames(temp)[5] <- "Visits" > ggplot(data=temp, aes(x=holdout, y=Visits2, color=Visits)) + geom_boxplot() + + scale_x_discrete() + + theme_bw() + theme(text = element_text(size=12, family="Dubai Light", color = "black")) + + theme(axis.title.x = element_blank(), + axis.text = element_text(size=10), + axis.title.y = element_blank()) + coord_flip() + + theme(legend.position="bottom") + scale_color_manual(values=c("#95dc83", "#eaaf50", "#88d8fe"))