predicting the unpredictable

69
PREDICTING THE UNPREDICTABLE: Investigating Customer Profitability over Time author Jantien Dekker 17 June 2018

Upload: others

Post on 22-Dec-2021

26 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PREDICTING THE UNPREDICTABLE

PREDICTING THE UNPREDICTABLE:

Investigating Customer Profitability over Time

author

Jantien Dekker

17 June 2018

Page 2: PREDICTING THE UNPREDICTABLE

1

PREDICTING THE UNPREDICTABLE:

Investigating Customer Profitability over Time

MSc Marketing Thesis - Marketing Intelligence

17 June 2018

author

Jantien G. Dekker

S3177874

Palembangstraat 1, 9715 LK Groningen (NL)

[email protected]

+31 6 23653675

First supervisor: Prof. Dr. J.E. Wieringa

Second supervisor: Dr. J.T. Bouma

University of Groningen

Faculty of Economics & Business

Department of Marketing

PO Box 800, 9700 AV Groningen (NL)

Page 3: PREDICTING THE UNPREDICTABLE

2

SUMMARY

Over two decades ago, Foster, Gupta, and Sjoblom (1996) acknowledged the challenge of

tracking customer profitability (CP) over time: "a customer that is unprofitable now and is

expected to remain unprofitable requires a different set of corrective actions than a customer

that is unprofitable now but expected to be profitable in the foreseeable future". Measuring and

predicting customer profitability has become a major topic within marketing. Estimating

individual-level CLV has proven to be difficult, with sophisticated methods performing equally

well as simple methods (Donkers, Verhoef, & de Jong, 2007). Perhaps this is the result of little

attention to modelling changes in CP over time.

This thesis answers the management question of how we can identify profitable customers,

and especially customers that become more or less profitable over time. The general research

problem is how we can predict future profitability of individual customers, while accounting for

changes in CP over time. We use transaction data of a supplier of non-food products to retail

stores throughout The Netherlands over a three year period. Flowing from our discussion we

should answer the following questions to being able to provide a solution to our research

problem:

(1) What are drivers of customer profitability?

(2) How can we predict the future profitability of individual customers?

(3) To what extent are we able to predict changes in individual customer profitability

over time?

(4) How can we identify profitable customers segments based on our predictions?

We answer these questions by identifying CP drivers and components based on past

research, and we develop a predictive model that captures these identified drivers and

components to predict future CP over time.

Our main goal is to measure and predict individual customer profitability, which is the

revenues derived from a customer minus the costs to serve that customer. The specification of

both customer revenues and customer costs are hypothesized to have a significant influence on

model performance, and both components are mainly driven by past customer behavior,

customer characteristics, and firm actions.

Page 4: PREDICTING THE UNPREDICTABLE

3

We measure the profitability of a customer i (CPi) as:

CPi= ∑ (GMit-MCit)

T

t=1

where GMit = gross margins in period t

MCit = marketing costs in period t

T = time horizon of our measurement

We model two separate components for our final CP model: a Negative Binomial count

model for our number of visits components for our costs calculations, and a multiplicative fixed

effects OLS model for our gross margins component. We combine both models by subtracting

the number of visits multiplied by the average costs per visit from gross margins. For both our

visits and gross margins component, we use a hurdle model with a binary logit component that

models the zero-observations.

We found that past customer behavior, customer heterogeneity, and firm actions all drive

CP through its costs and gross margins components. Our model is able to capture changes in

CP over time, but these changes are not very accurate: our model does not offer a significantly

better performance compared to a model that uses past average CP and projects it to future

time periods. Also, a model without a separate cost component does not significantly provides

a worse performance compared to our model. For segmentation purposes, our model could offer

managers a tool to get a better understanding about what drives CP, especially within customer

segments.

In this thesis we aimed to “predict the unpredictable”. We were abe to predict changes in

customer profitability over time, except these changes did not predict the actual observed

changes in CP very well. We therefore conclude that trying to predict the unpredictable is very

difficult, perhaps even impossible, especially given scarce resources of managers to make trade-

offs between the investments to make into develop a sophisticated model, compared to using a

relatively simple model that seems to predict CP almost evenly well.

Page 5: PREDICTING THE UNPREDICTABLE

4

Page 6: PREDICTING THE UNPREDICTABLE

5

PREFACE

Dear reader,

Hereby I present to you my thesis for the MSc Marketing Intelligence. Over ten years ago I

graduated from secondary education, with no clue about what I wanted to do or who I wanted

to be in ten years. Several years past in which I did not attend any form of education. Instead, I

worked at multiple companies in several positions, to find out that my heart lies in marketing.

After finishing my part-time Bachelor of Business Administration with a specialization in

Marketing Management at the Hanze University Groningen, I realized that I was still missing

“something”. When I learned about the MSc Marketing Intelligence, I quited my fulltime job to

be a full-time student for the first time in my life. I have not regretted it for a single day: it has

given me exactly the “something” that I felt I was missing two years ago.

The data and the research problem of this thesis comes from the company that offered me

my first work experience. It therefore also carries a personal touch. Experiencing a bankruptcy

of the company that you love working for on the age of nineteen was a very hard but informative

experience. I want to thank the provider of the dataset, since it is very interesting, real-world

data. Although the company does not exist anymore, it has given me the chance to put what I

have learned during my education into practice.

My graditude goes out to Jaap Wieringa for supervising me during the process. I sometimes

can get carried away in my enthusiasm, and he could get me back on track. And although my

initial plan to combine my thesis with doing an internship did not follow true, I also want to

thank Jelle Bouma for his conversations. The bumps in the road offered me a learning experience

that goes beyond writing a thesis. Finally, I want to thank all lecturers from the (Pre-)MSc

Marketing courses for their support and sharing their knowledge during my education at the

University of Groningen.

Kind regards,

Jantien Dekker

Page 7: PREDICTING THE UNPREDICTABLE

6

CONTENTS

1 Introduction ....................................................................................................................................................................................... 8

1.1 Changes in Customer Profitability over Time ............................................................................................. 8

1.2 Description of the Organization ............................................................................................................................ 9

1.3 Scope and Contribution ............................................................................................................................................ 10

2 Customer Profitability ............................................................................................................................................................ 11

2.1 Managing Customer Profitability....................................................................................................................... 11

2.2 Defining Customer Profitability .......................................................................................................................... 12

2.3 Measuring Customer Profitability ..................................................................................................................... 13

2.3.1 Customer Relationship .................................................................................................................................. 13

2.3.2 Customer Revenues and Risk ................................................................................................................... 15

2.3.3 Customer Costs................................................................................................................................................... 16

2.4 Understanding Customer Profitability........................................................................................................... 17

2.4.1 Customer Behavior and Characteristics ........................................................................................... 17

2.4.2 Firm Actions .......................................................................................................................................................... 20

2.4.3 Market Variables ................................................................................................................................................ 21

2.5 Conceptual Model .......................................................................................................................................................... 21

3 Model ................................................................................................................................................................................................... 24

3.1 Model Specification ...................................................................................................................................................... 24

3.1.1 Model for the Number of Visits ............................................................................................................. 25

3.1.2 Model for Gross Margins ............................................................................................................................. 26

3.2 Data .......................................................................................................................................................................................... 27

3.3 Procedure ............................................................................................................................................................................. 28

Page 8: PREDICTING THE UNPREDICTABLE

7

4 Results ................................................................................................................................................................................................ 30

4.1 Number of Visits ............................................................................................................................................................ 30

4.2 Gross Margins .................................................................................................................................................................. 33

4.3 Customer Profitability ................................................................................................................................................ 35

4.3.1 Changes in CP over Time ............................................................................................................................. 35

4.3.2 Model Variants .................................................................................................................................................... 36

4.4 Customer Segments .................................................................................................................................................... 39

5 Discussion ........................................................................................................................................................................................ 42

5.1 General Discussion ........................................................................................................................................................ 42

5.2 Managerial Implications ............................................................................................................................................ 43

5.3 Limitations .......................................................................................................................................................................... 44

5.4 Future Research .............................................................................................................................................................. 45

5.5 Conclusion ........................................................................................................................................................................... 45

References ................................................................................................................................................................................................... 46

Digital Appendices

Appendix A: R-Code Data Preparation .................................................................................................................................. 49

Appendix B: R-Code Model Components............................................................................................................................. 53

Appendix C: R-Code Customer profitability ....................................................................................................................... 65

Page 9: PREDICTING THE UNPREDICTABLE

8

1 INTRODUCTION

Over two decades ago, Foster, Gupta, and Sjoblom (1996) acknowledged the challenge of

tracking customer profitability (CP) over time: "a customer that is unprofitable now and is

expected to remain unprofitable requires a different set of corrective actions than a customer

that is unprofitable now but expected to be profitable in the foreseeable future". Measuring and

predicting customer profitability has become a major topic within marketing. However, till date,

many attempts to estimate individual-level customer profitability have been rather unsuccesfull,

with simple models often performing just as good as more sophisticated ones. Perhaps this is

the result of little attention to modelling the possible changes in customer contributions over

time.

1.1 CHANGES IN CUSTOMER PROFITABILITY OVER TIME

Many CP models predict future profitability based on current contributions or the average past

contribution from the customer, assuming that a customer’s margins stay stable over time. This

assumption might be reasonable in some situations. However, there may be large variation

within customer contributions over time. This variation is mostly addressed by incorporating the

probability that the customer is still “alive” (i.e. retention probability).

Nevertheless, estimating individual-level CLV has proven to be difficult, with sophisticated

methods performing equally well as simple methods (Donkers, Verhoef, & de Jong, 2007). Also,

using current or past average margins in CLV calculations may lead to biases. For example in

markets in which the cost to serve a customer takes up a large proportion of the gross margin,

such as in B2B contexts with a high degree of personal sales. But also in markets with complex

dynamics, especially if the (B2B) firm’s customers are operating within different industries.

This thesis answers the management question of how we can identify profitable customers,

and especially customers that become more or less profitable over time. The general research

problem is how we can predict future profitability of individual customers, while accounting for

changes in CP over time. We use transaction data of a supplier of non-food products to retail

stores throughout The Netherlands over a three year period. Flowing from our discussion we

should answer the following questions to being able to provide a solution to our research

problem:

Page 10: PREDICTING THE UNPREDICTABLE

9

(5) What are drivers of customer profitability?

(6) How can we predict the future profitability of individual customers?

(7) To what extent are we able to predict changes in individual customer profitability

over time?

(8) How can we identify profitable customers segments based on our predictions?

We answer these questions by identifying CP drivers and components based on past

research, and we develop a predictive model that captures these identified drivers and

components to predict future CP over time.

1.2 DESCRIPTION OF THE ORGANIZATION

This thesis uses data of a supplier of non-food products to retail stores (e.g. supermarkets, drug

stores, hardware stores, etc) throughout The Netherlands over the period 2006 up until 2008.

The products were sold and distributed directly by the sales representatives, who put the goods

in the store in a leased display. Goods that were not sold could be returned: the representative

took them back at the next visit. Each representative was responsible for managing the

customers within his region, from acquisition to retention.

The company offered three main services/product categories:

(1) Regular: non-food products that were placed into the store in a display that was

provided by the supplier (e.g. socks, toys, cleaning accessories);

(2) Loyalty programs: are used by the customer for consumer loyalty programs (i.e.

consumers saved loyalty stamps for discounts on products). One program usually

covered a period of 4 to 8 weeks. The distributor delivered promotional material

(e.g. posters and vouchers) and made sure that there was always enough stock

within the store;

(3) Theme displays: introduced during the World Championship soccer in 2008. The

displays contained products with a specific theme (during WC soccer: orange/Dutch

products). The “orange” displays were sold out within a month. The company

therefore decided to order more theme displays, such as Christmas and Party.

The company only measured its performance based on aggregate-level revenues per sales

representative and per period. The management suspected that some customers were costing

Page 11: PREDICTING THE UNPREDICTABLE

10

more than they yielded. They therefore wanted to gain insights in the profitability of their

customers. They used simple summary statistics aggregated on the segment-level (e.g.

segmented on retail chain), and found that some segments appeared to be highly profitable, but

the company only had a few customers within these segments. After that, the sales forces put

in a lot of effort to acquire more customers within these segments. However, by that time, it

was already too late. The company was already in a rough patch and the theme displays were a

final hope. They seemed a success at first, but after a few months, a disasturous amount of

theme displays were returned, after which the company had no choice than to file for bankruptcy

in March 2009.

1.3 SCOPE AND CONTRIBUTION

The model can be used diagnostic, for assessing the performance of the company’s customer-

base and aspects that drive customer-based profits, and also normative, as input for managerial

decision-making for the selection and targeting of profitable customers. This research

contributes to theory by developing and testing a model that predicts CP on the individual

customer level in a context with a high degree of uncertainty and changes in customer profits

over time, which has been proven to be extremely difficult.

We limit our scope to a B2B supplier to B2C retailers within multiple industries. The focus

is solely on customer behavior, not on customer perceptions or attitudes. We only investigate

the measurement and prediction of CP, not its actual implementation in management practice.

In the next chapter we first discuss the concept of CP and its theoretical background. Based

on this discussion we develop our model in chapter 3. We present our results in the chapter

thereafter. Finally, we discuss our findings and its implications for both theory and practice.

Page 12: PREDICTING THE UNPREDICTABLE

11

2 CUSTOMER PROFITABILITY

In this chapter we first discuss customer management, which will answer the question of why it

is relevant to measure CP. We then define customer profitability, and discuss the different

research streams that have developed within CP. Next, we discuss the measurement and general

components of CP, followed by an identification of CP antedecents to get a better

understanding of the measure. We end the chapter with a conceptual model for our research,

in which all the identified components of customer profitability are present.

2.1 MANAGING CUSTOMER PROFITABILITY

Customer management can be defined as the processes through and actions by which the

contribution or value from each customer to the firm’s overall profitability is maximized, by

making use of individual data on customers (Kumar, Ramani, & Bohling, 2004; Verhoef & Lemon,

2013). Customer management involves making decisions on (a) selecting customers for

targeting, (b) allocating resources to these selected customers, and (c) nurturing customers to

increase future profitability (Kumar, Venkatesan, Bohling, & Beckmann, 2008). Customer

profitability can be increased by acquisition, up-selling, cross-selling, reducing customer costs,

and retention (Verhoef & Lemon, 2013). The underlying philosophy of is that to derive value

from customers, an organization should first be able to provide value to customers.

Another approach to managing customers is customer asset management. Within this

approach, customers are viewed and managed as economic assets. Kumar (2018) defines an

asset as “any physical, organizational, or human attribute that enables the firm to generate and

implement strategies that improve its efficiency and effectiveness in the marketplace”. Nenonen

and Storbacka (2016) give four actions to manage the customer asset to optimize profits: (1)

increasing revenues from customers: by customer acquisition, retention, development, price

increases, and innovation; (2) decreasing customer-related costs: both reducing costs to serve

and costs to acquire; (3) optimized asset utilization: optimizing capital investments in customer

relationships, and managing business volumes for economies of scale; (4) reducing customer-

related risks: diversifying the customer base, and reducing risk correlations within the customer

base.

Page 13: PREDICTING THE UNPREDICTABLE

12

In his recent work, Kumar (2018) provides the Customer Valuation Theory, in which he

attempts to integrate the concepts of customer value and customer assets. His theory connects

individual-level customer value to the performance and valuation of the entire firm by “(1)

valuing customers as assets, (2) managing a portfolio of customers, and (3) nurturing profitable

customers”.

Whichever of these approaches or actions a firm takes to manage its customers, it requires

a thourough understanding of its customer profitability and drivers of this profitability to be

able to determine the optimal courses of action. This allows firms to better identify and target

profitable customers, and optimize resource allocations to profitable customers and activities,

which leads to an increased marketing ROI (Reinartz, Thomas, & Kumar, 2005; Venkatesan &

Kumar, 2004).

2.2 DEFINING CUSTOMER PROFITABILITY

Pfeifer, Haskins and Conroy (2005) define customer profitability as “the difference between the

revenues earned from and costs associated with the customer relationship during a specified

period”. The authors state that if CP is viewed strictly from an accounting perspective, then CP

should focus on past and current contribution of a customer to the firm. Hence, CP is backward-

looking by definition. However, a firm cannot change or control its past, only its future.

Therefore, it needs to be able to anticipate on expected customer profitability by making

predictions about the future if it whishes to take appropriate actions. In literature, expected

future customer profitability is often referred to as customer lifetime value (CLV), which is the

expected future customer profitability during the entire relationship of a customer with the firm,

discounted by the current value of future capital (Holm, Kumar, & Rohde, 2012; Pfeifer et al.,

2005). If a customer is seen as an asset, then customer value can be viewed as the price that

someone would be willing to pay to acquire that asset (Pfeifer et al., 2005).

Many other related terms exist, such as customer equity or customer-based valuation

(Gupta, 2009), and net present value of expected gross contribution (Kumar et al., 2004).

Scholars have been debating on when to use and how to define customer profitability versus

customer value for decades, and both terms are seemed to be used interchangeable (Gupta,

2009; Holm et al., 2012; Kumar, 2018; Mulhern, 1999; Pfeifer et al., 2005). Derived from this

Page 14: PREDICTING THE UNPREDICTABLE

13

discussion we may conclude that CP and CV are used interchangeably, and that it is important

to define and specify any measurement of the concepts, and being aware of the differences

between them. In this thesis our main goal is to not only measure past CP, but also to make

predictions about the future. We use sources from both customer profitability and customer

value research, as long as it is relevant for the current research.

2.3 MEASURING CUSTOMER PROFITABILITY

A lot of different models for measuring CP exist, and each model seems to capture and/or focus

on different components. Holm, Kumar, and Rohde (2012) argue that model specification and

sophistication should depend on the complexity of the context, which they view as consisting of

customer behavioral complexity and customer service complexity. They define customer

behavioral complexity as “the degree of variation in retention durations (relationship length),

transaction frequency and value of transactions (relationship depth), and cross-buying behavior

(relationship breadth) across the total number of customer relationships a firm serves”.

Customer service complexity is defined as “the degree of variation in service needs and

requirements that invoke differential activities on an organization across customer-facing

functions in terms of the number of activities performed as well as the time spent on each

activity”.

The model should capture all aspects that are relevant for the specific context, as long as

the benefits for measuring each aspect is higher than its costs. We adopt the view of Holm,

Kumar, and Rohde (2012) on measuring CP, since it is highly flexible while still capturing all

components that are commonly used within CP literature, and it seems to bridge a gap between

several CP research streams (CPA and CLV). We will next discuss three aspects of customer

profitability that seem to be important to distinguish: customer relationship, customer revenues

and risk, and customer costs.

2.3.1 CUSTOMER RELATIONSHIP

According to Gupta et al. (2006), marketing actions of the firm lead to customer behavior, which

in turn leads to CP. They distinguish between three customer behaviors that represent the

lifetime stages of a customer:

(1) Customer acquisition: the first purchase of a customer;

Page 15: PREDICTING THE UNPREDICTABLE

14

(2) Customer margin: the purchase behavior during the customer-firm relationship (i.e.

up- and cross-selling);

(3) Customer retention: repeat-purchases and/or customer defection.

These three behavioral components seem to be the focus of many CP models (Gupta et al.,

2006; Gupta, 2009; Venkatesan & Kumar, 2004), and capture relationship length, depth, and

breadth (Bolton, Lemon, & Verhoef, 2004). Relationship depth and breadth refers to the

revenues that are associated with the customer relationship, including purchase frequency/up-

selling (depth) and cross-buying (breadth). We go into more depth on customer revenues in

the following subsection. In the remainder of this subsection on customer relationship, we will

discuss the relationship length in more detail.

Relationship length relates to the customer retention component of CP. Let us first discuss

the possible natures of a customer-firm relationship. Fader and Hardie (2009) distinguish

between contractual and non-contractual relationships. Within a contractual relationship, it is

relatively easy to observe the relationship termination. The customer needs to let the firm know

that he is terminating the relationship, or he does simply not extend its contract. Within a non-

contractual relationship, this customer defection is usually unobserved, making it harder to

determine whether a customer is still “alive” at a certain point. CP models usually model

customer retention by means of the probability that a customer will still be active at a certain

point in time (Gupta et al., 2006). Another distinction that Fader and Hardie (2009) make is

between the transaction opportunities. These can either occur continuously (i.e. at any given

time) or discretely (i.e. only at certain points in time). They presented a quadrant for both the

relationship type and transaction opportunities dimensions, and each setting asks for a different

modelling approach.

Another distinction that is made is the “lost-for-good” versus the “always-a-share”

relationship (Jackson, 1985). In a lost-for-good setting, a customer typically buys its product or

service from one company. Switching costs are generally believed to be high. The customer is

“alive” at some point, until he “dies” (i.e. he terminates the relationship completely). In an always-

a-share setting, the customer may spread his purchases between multiple sellers. The customer

never truly “dies”, since there is a probability that he will come back at each purchase

opportunity. Usually, the lost-for-good approach is used within contractual relationships, while

Page 16: PREDICTING THE UNPREDICTABLE

15

the always-a-share seems to be more appropriate for noncontractual relationships (Rust,

Lemon, & Zeithaml, 2004).

Finally, one should specify the period over which CP is measured or estimated. If it is

modelled for the entire (expected) relationship with the customer (i.e. its entire lifetime), the

time horizon should generally be set to infinity. It is however difficult to predict for a long period

in time, since markets are usually dynamic in nature. Besides, most companies set their

strategies for the next three to five years, which makes it reasonable to set a limited time horizon

for CP predictions (Kumar et al., 2008).

2.3.2 CUSTOMER REVENUES AND RISK

Instead of customer revenues, we can also refer to customer gross margins, which are revenues

minus the costs of goods sold (COGS) (Pfeifer et al., 2005). Using revenues and COGS

separately or using gross margins should depend on the variability in product margins. Many

models use average contribution margin of a customer to project future CP (Gupta, Lehmann,

& Stuart, 2004; Reinartz & Kumar, 2003). This might be reasonable if the company offers

relatively few service propositions and margins are relatively stable over time, but might be

biased if large variations in cash flows between customers exist. That is why several authors

have argued for more attention to risk in CP models (Bolton et al., 2004; Holm et al., 2012;

Kumar, 2018; Nenonen & Storbacka, 2016).

Kumar (2018) refers to risk in future CP as “the volatility and vulnerability in cash flows”.

Risk is usually captured through discount rate and retention probability (Gupta, 2009), which

could be seen as “vulnerabilities” in cash flows. Many models seem to lack, however, in the

inclusion of “volatility” in cash flows. Customer risk may result in a high reliance on a few

customer relationships, or in unsteady cash-flows, both of which can pose a threat to the

company’s health and should therefore be appropriately identified and managed (Nenonen &

Storbacka, 2016). An example of measuring risk because of volatility in cash flow is the “risk-

adjusted lifetime value” of Dhar and Glazer (2003), in which they capture the difference of a

customer’s deviation from the mean expected returns, which can be complemented with macro-

economic factors to understand this deviation. They call this risk aspect the “customer beta”,

which is the covariance of customer cash flow divided by the variance of customer cash flow.

Page 17: PREDICTING THE UNPREDICTABLE

16

Based on the previous discussion we hypothesize that there is a significant improvement in

model performance when changes in customer revenues are taken into consideration, compared

to a model that is based on average past contribution:

H1. A model that incorporates changes in customer revenues over time predicts future

customer profitability significantly better than a model that predicts future customer

profitability based on the average past contribution of a customer.

2.3.3 CUSTOMER COSTS

Several scholars stress the importance of explicitly specifying costs in CP calculations, since

most calculations of CP seem to focus on demand resulting from customer behavior, while the

costs related to serving customers are an important part of the customer margin (Blattberg,

Malthouse, & Neslin, 2009; Gupta, 2009). Pfeifer et al. (2005) describe three accounting

methods to allocate costs to customers: (1) divide the costs by the number of customers,

assuming that all customers use the same amount of resources, (2) assign costs to customers

relative to their size (e.g. revenues), and (3) based on their use of resources. The latter is referred

to as Activity-Based Costing (ABC), which is a common theme within CP analyses. ABC was

developed by Cooper and Kaplan (1988) with the underlying philosophy that costs should be

attributed to the activities proportional to their use of resources, i.e. splitting costs and tracing

them to individual products, instead of simply dividing costs by the number of units. ABC can

also be used to trace back costs to individual customers (Niraj, Gupta, & Narasimhan, 2001),

which works the same way, but with a different unit of interest. Costs can first be divided in

“pools”, and then into “drivers”, after which they are attributed to customers (Foster et al.,

1996). Take, for example, distribution costs as a cost pool. These costs may depend on the

number of product units sold, which is the cost driver. The total distribution costs are then

divided by the product units sold, which can then be attributed to customers, relative to their

units bought.

An important decision to be made is which costs to include in CP. If the total costs of the

company are traced back and attributed to individual customers, then customer profitability

reflects the overall profitability of the firm. If only the costs that are specific and variable to

serving customers are allocated to individual customers, then customer profitability could be

Page 18: PREDICTING THE UNPREDICTABLE

17

used for comparing customers within the company’s customer-base (Pfeifer et al., 2005).

Blattberg et al. (2009) refer to including all company costs as a “full-costing” approach, and only

including the variable costs of serving customers as a “marginal-costing” approach. Again, the

choice depends on the context and goals at hand.

We hypothesize that there is considerable difference in model performance when the costs

to serve customers are attributed to individual customers compared to a model that does not

includes a cost component:

H2. A model that attributes customers costs to individual customers predicts future

customer profitability significantly better than a model without separate cost

component.

2.4 UNDERSTANDING CUSTOMER PROFITABILITY

It is essential that a firm is not only able to measure CP, but also understands what drives CP

in order to being able to control it. Based on past research, we identified customer behavior and

characteristics, firm actions, and control variables that are found to influence CP or its

components (i.e. costs and revenues) in B2B settings. We chose to only investigate observed

behavior and characteristics, and thus we do not investigate perceptions or attitudes that were

found to influence CP. We discuss subsequently each identified driver and how it influences CP.

2.4.1 CUSTOMER BEHAVIOR AND CHARACTERISTICS

Customer behaviors that are found to be highly predictive of future behavior (and with that,

CP) are past purchase behavior, cross-buying behavior, and product returns behavior. In terms

of customer characteristics, we identified customer size and location as important predictors of

future CP. We subsequently discuss each CP driver that is related to customer behavior and

customer characteristics.

Past purchase behavior

Past purchase behavior is seen as one of the best predictors of future purchase behavior,

and with that, future customer profitability (Blattberg et al., 2009). The most commonly used

metrics to measure (past) purchase behavior are RFM metrics: Recency – the time since the

last purchase, Frequency – the number of purchases, and Monetary Value of these purchases.

Page 19: PREDICTING THE UNPREDICTABLE

18

Several other metrics can be derived from RFM measures, such as interpurchase time (i.e.

average time between transactions) and the average spend per transactions (i.e. M/F). Although

RFM metrics are amongst the most studied antecedents of customer profitability, findings on

the direction of the effect between RFM and profitability remain inconclusive. Most studies

show a positive link: customers who purchased more (often/recently) in the past are also more

likely to purchase more in the future, which is positively related to future profitability (Reinartz

et al., 2005).

Niraj et al. (2001) found that frequency was actually negatively related to profits, because

it adds complexity to the purchases. They found that frequency does not translate in

significantly higher average gross margins, but it does significantly increase costs. Compared to

many other CP studies, their model can be considered as one of the most detailed in terms of

attributing costs to individual customers. They did not only assign marketing costs to customers,

but also costs of each individual product. For example, costs are first attributed to separate

items (e.g. warehousing, distribution, negotiation with suppliers), and then to customers based

on their unit purchases of each item. Thus, especially if the costs per order is high relative to

the gross margins, and this is appropriately captured within the profitability model, we would

expect similar results to Niraj et al. (2001).

Cross-buying

Another surprising finding of Niraj et al. (2001) is that cross-buying did not have any

significant effect on customer profitability, while generally it has been found to be positively

related to customer profitability (Reinartz et al., 2005; Reinartz & Kumar, 2003; Rust, Kumar, &

Venkatesan, 2011). Kumar, George, and Pancras (2008) found that cross-buying is related to

the first product (category) purchased, and that it shows a U-shaped relationship with

interpurchase time: customers with an average interpurchase time are most likely to cross-buy.

They also found that higher focused buying (i.e. buying more within a category) is positively

related to cross-buying. This is an interesting finding, because Reinartz and Kumar (2003) found

a negative effect between focused buying and customer profitability. Generally, both behaviors

are believed to be positively related to profitability, and cross-buying has a larger effect on

profitability than focused-buying (Kumar et al., 2004).

Page 20: PREDICTING THE UNPREDICTABLE

19

Perhaps the reason why Niraj et al. (2001) did not find a significant effect is again because

of their cost attribution method. Cross-buying could be associated with higher spending levels,

but also with higher order complexity and thus higher costs, which may lead to diminishing

returns. Also, Shah, Kumar, Qu, and Chen (2012) found that approximately 10 to 35% of cross-

buying customers are in unprofitable relationships, and that the unprofitability increases with

the degree of cross-buying. They found that this is due to other unprofitable behaviors that

these customers show, such as excessive service requests (which is in line with our reasoning

that cross-buying may add to the costs) and promotion purchase behavior (i.e. lower gross

margins).

The contradictory findings on cross-buying behavior suggest that on an aggregated level,

cross-buying seems to be related to higher profits. However, large differences between

customers can exists, interactions with other behaviors are likely to be present, and cost-

attribution methods might influence the result.

Product returns

Research on the effect of product returns on CP shows contradictory findings (Petersen &

Kumar, 2015; Reinartz & Kumar, 2003). At one hand, product returns result in higher costs and

lower revenues, which has a negative consequence for profitability (Reinartz & Kumar, 2003).

However, returns could also decrease risk/price perceptions and therefore enhance future

spending, which in turn may positively influence CP (Petersen & Kumar, 2015). Kumar et al.

(2004) suggest that there is an optimal level of product returns, and thus shows a U-shaped

relationship with CP.

Customer size

It would be logical to assume that larger customers (e.g. based on their own store sales)

also spend more, and thus are more profitable. However, Bowman and Narayandas (2004)

suggest that large customers are not necessarily more profitable, because they are usually also

more demanding in terms of both quality and price. The influence of customer size on

profitability has also been reported by Van Raaij et al. (2003), who found a U-shaped

relationship between customer size and profitability: the top 1% of customers in terms of

customer size shows a lower profitability than large and medium sized customers, and small size

Page 21: PREDICTING THE UNPREDICTABLE

20

customers are reported to be most unprofitable. Reinartz and Kumar (2003) and Rust et al.

(2011) found a positive relationship.

Population density

Reinartz and Kumar (2003) found that population density had a negative effect on

customer profitability within B2C contexts. There was no effect present within B2B settings.

However, within a B2B context that deals with B2C retailers, and thus is indirectly related to a

B2C context, we could argue that the effect might be present.

Thus, to conclude, we hypothesize that past customer behavior and customer characteristics

are important drivers for CP, with implications and interactions for both revenues and costs:

H3. Past customer behavior and customer characteristics are significant drivers of CP and

both its costs and revenues components.

2.4.2 FIRM ACTIONS

Firm initiated contacts are found to positively influence the length of the customer-firm

relationship and individual profitability (Kumar et al., 2008; Reinartz et al., 2005; Reinartz &

Kumar, 2003). However, Blattberg et al. (2009) suggest that there is an optimal number of

marketing contacts. Above a certain point, there are diminishing returns, which is referred to as

wearout. Thus, marketing contacts are believed to show an inverted U-shaped relationship with

profitability. Rust et al. (2011) find evidence that marketing contacts do not only drive customer

behavior, but that the number of contacts in turn is also driven by past customer behavior.

Niraj et al. (2001) found that offering “extra items” (i.e. customized products or services) is

negatively related to customer profitability, since it adds to the service costs, but not necessarily

results into higher revenues. They argue that this may be a result of an orientation of sales

representatives towards short-term revenues, instead of towards long-term profitability. This

immediately leads us to the possible influence of sales representative on CP. Bowman and

Narayandas (2004) found that the hours spent at an account by a sales representative is

positively related to CP, especially if the relationship between the rep and the customer has a

long tenure. Sales rep’s perception of customer profitability can be biased based on their self-

efficacy and customer-orientation (Mullins, Ahearne, Lam, Hall, & Boichuk, 2014), implying that

Page 22: PREDICTING THE UNPREDICTABLE

21

we should control for the effect of sales persons on customer profitability when reps make their

own decisions for visiting customers.

To conclude, we hypothesize that firm actions drive both revenues and costs: there is a

“point of diminishing returns” after which CP declines:

H4. Firm actions drive both revenues and costs, and they show a diminishing return on CP,

implicating that there is an ideal point of firm actions.

2.4.3 MARKET VARIABLES

The market of retailers is highly dynamic, with a large amount of mergers & acquisitions,

changing customer demands, increased competition (both off- and online) and increasing

strategic alliances (Grewal, Roggeveen, & Nordfölt, 2017; Kumar, Anand, & Song, 2017). These

high dynamics imply that market dynamics could potentially affect customer profitability, and

thus, although they cannot be controlled by the firm, they need to be accounted for.

2.5 CONCEPTUAL MODEL

Our main goal is to measure and predict individual customer profitability, which is the revenues

derived from a customer minus the costs to serve that customer. In this thesis we test the

following hypotheses:

H1. A model that incorporates changes in customer revenues over time predicts future

customer profitability significantly better than a model that predicts future customer

profitability based on the average past contribution of a customer.

H2. A model that attributes customers costs to individual customers predicts future

customer profitability significantly better than a model without separate cost

component.

H3. Past customer behavior and customer characteristics are significant drivers of CP and

both its costs and revenues components.

H4. Firm actions drives both revenues and costs, and they show a diminishing return on CP,

implicating that there is an ideal point of firm actions.

Page 23: PREDICTING THE UNPREDICTABLE

22

We test these assumptions by including the identified antecedents in our CP model, and

determine their individual effects on the components of CP (i.e. revenues/gross margins and

customer costs). Also, we compare our model to simpler model variants without changes in CP

over time and without a cost component.

To summarize, we present an overview of all identified antecedents of CP and their

relationship with costs, revenues, and profitability in table 2.1. We present the most important

drivers of CP and relationships between concepts in our conceptual model (figure 2.1). We

expect that firm actions are driven by past firm actions, customer characteristics, and past

customer behavior. Customer behavior is driven by both current and past firm actions, customer

characteristics, and past customer behavior. Customer profitability is driven by both firm actions

and customer behavior, and this relationship is influenced by market dynamics.

Figure 2.1: conceptual model

Customer profitability

Market dynamics

Firm actions Customer behavior

Past firm actions Customer characteristics Past customer behavior

Page 24: PREDICTING THE UNPREDICTABLE

23

Antecedents

∩ = inverted U-

shaped relationship

U = U-shaped

relationship

C = control variable

/ = no significant

effects

Nir

aj e

t al

. (2

00

1)

Van

Raa

ij et

al.

(20

03

)

Rei

nar

tz &

Kum

ar (

20

03

)

Bow

man

& N

aray

anda

s (2

00

4)

Ku

mar

et

al. (

20

04

)

Rei

nar

tz e

t al

. (2

00

5)

Ku

mar

et

al. (

20

08

)

Ru

st e

t al

. (2

01

1)

Mu

llin

s et

al.

(20

14

)

Pet

erse

n &

Kum

ar (

20

15

)

Gre

wel

et

al. (

20

17

)

Ku

mar

et

al. (

20

17

)

Rev

enu

es

Cos

ts

Pro

fita

bili

ty

B2B setting* X X X X X X X X X R R

Customer behavior

Frequency - + + + + + ∩

Interpurchase time ∩ ∩ - ∩ ∩ ∩

Spending level + + + + + +

Cross-buying / + + + + + + + ∩

Product returns ? ∩ + - + U

Firm actions

Marketing contacts + + + + + + +

Extra services - + - + + ∩

Customer characteristics

Customer size ∩ + + + + ∩

Location / ?

Control variables

Sales representative C C C C C

Market dynamics C C C C C C

Table 2.1: antecedents of CP

* Only the study Petersen and Kumar (2015) investigated CP within a B2C context. Grewel et al.

(2017) and Kumar et al. (2017) did not study CP, but discussed the future within retailing. All other

cited articles studied CP within a B2B context.

Page 25: PREDICTING THE UNPREDICTABLE

24

3 MODEL

In this chapter we develop our model to measure and predict CP. We first define our

specification of CP, followed by the specification of its two components: costs and gross margins.

We then discuss the data available for the research, and the procedure that we follow to answer

our research questions and test our hypotheses.

3.1 MODEL SPECIFICATION

We measure the profitability of a customer i (CPi) as:

CPi= ∑ (GMit-MCit)

T

t=1

where GMit = gross margins in period t

MCit = marketing costs in period t

T = time horizon of our measurement

We chose to only include marketing costs in our model and not general overhead costs,

because our main goal is to compare CP between customers, instead of determining the CP of

the entire customer-base. Thus, to estimate future CP we must predict two components: the

number of visits and the gross margins for each period:

CP̂i= ∑ (GM̂it-V̂itC)

T

t=1

where GM̂it = predicted gross margins (in euros) in period t

V̂it = predicted number of visits in period t

C = costs per visit

Due to data limitations we have to make several assumptions related to the costs to serve

customers. For example, the sales force did not keep record of their visits to and hours spend

on each individual customer. We therefore make the assumption that the number of visits (i.e.

marketing contacts) equals the number of orders, and that each visit is assumed to takes the

same amount of time. We expect that this assumption is reasonable within the given context,

Page 26: PREDICTING THE UNPREDICTABLE

25

since sales representatives are responsible for maintaining the relationships with their own

customer-base, and they delivered products straight from their car on each visit (section 1.2).

Thus, the costs to serve a customer depends on the number of orders derived from that

customer.

Also, the costs of the sales force are only known on an aggregated level. We therefore divide

the total costs of the sales force over our time horizon, divided by the total number of orders

within that time horizon, to arrive at the average costs per visit. Individual customer costs are

thus a function of the number of orders (i.e. the number of visits) placed by that customer,

multiplied by the average costs per order over our entire time horizon. We have estimated the

average costs per visit at € 70,39 based on the average order costs in 2008 (total costs of the

sales force divided by the total number of orders: 544,396.26 / 7,734).

3.1.1 MODEL FOR THE NUMBER OF VISITS

We are interested in predicting the number of visits for our costs predictions. The number of

visits takes on discrete values from 0 to 26, with mean 1.83 and variance 5.82. Since the variance

is much larger than the mean (i.e. overdispersion), we assume a Negative Binomial distribution

(NBD) for our count data. Also, a relatively large part of our observations have zero values

(32.3%). We expect that a regular count model would not handle these zero-observations very

well. We therefore estimate a zero-inflated and a zero-hurdle model, and choose the model that

offers the best fit.

We hypothesized that the number of visits is a function of past purchase behavior, past

marketing contacts, market variables, and customer characteristics. Therefore, our initial

estimation of our visits component will have the following functional form:

Vit= α + β1Recit

+ β2Vit-1 + β3V.sumit-1

+ β4 V.avgit-1

+ β5GMit-1 + β6GM.sumit-1

+

β7GM.avgit-1

+ β8PRit-1 + β9PR.dumt + β10GDPt + β11Cati + β12Pop

i + uit

where α = intercept

Rec = periods t since last purchase

V = number of visits

GM = gross margins

sum = cumulative sum from t=1 till t-1

Page 27: PREDICTING THE UNPREDICTABLE

26

avg = cumulative average from t=1 till t-1

PR = number of premium orders

PR.dum = dummy indicating that no order details were recorded

GDP = GDP of consumers

Cat = number of categories purchased over the entire time horizon

Pop = population density

u = error term

3.1.2 MODEL FOR GROSS MARGINS

Our response variable gross margins (i.e. revenues derived from a customer minus the costs of

goods sold) can take positive and zero values, and its distribution is somewhat skewed to the

right (mean = 359.58, sd = 878.88). Also, there are considerable outliers present within our

data. All these characteristics are possible issues that can bias our estimations. To somewhat

account for these issues, and to accommodate potential interactions between our variables, we

estimate our gross margins model as a multiplicative (log-log) model. To account for the zero-

observations, we fit a zero-hurdle model to our data, in which we allow the variables and

parameters for the zero-hurdle part to differ from the positive gross margins model. Our initial

gross margins model takes the following specification:

GMit* = α + β1V

it

* + β2Recit

* + β3Vit-1

* + β4V.sumit-1

* + β5V.avgit-1* + β6GMit-1

* +

β7GM.sumit-1* + β8GM.avgit-1

* + β9PRit-1* + β10PR.dumi

* + β11GDPit-1* +

β12Catit-1

* + β13Popit-1

* + β14Returnsit

* + εit

where α = intercept

Rec = periods t since last purchase

V = number of visits

GM = gross margins

sum = cumulative sum from t=1 till t-1

avg = cumulative average from t=1 till t-1

PR = number of premium orders

PR.dum = dummy indicating that no order details were recorded

GDP = GDP of consumers

Page 28: PREDICTING THE UNPREDICTABLE

27

Cat = number of categories purchased over the entire time horizon

Pop = population density

Returns = gross margins of product returns

ε = error term

3.2 DATA

In total we have 3 years of observations (2006 - 2008), which we divided into 12 quarters. We

only model customer profitability for customers that have purchased within the first period of

the data (Q1 2006) to avoid potential problems with left-censoring. Because of variable

transformations (e.g. lagged variables of our response variables) we lose the first period of our

data, which leaves us with a total of 11 time periods. We then excluded all customers that did

not make any purchase in the remaining 11 time periods. In total we have observations of 349

customers over each period, which results in 3839 observations. We use the first 9 quarters of

our data for estimating our model (Q2 2006 to Q1 2008), and the last 3 quarters for assessing

its predictive validity (Q2 2008 to Q4 2008).

In the first five periods of our observations (Q1 2006 – Q1 2007), the company did not

keep track of order details, only of order totals. As a result, we do not have data on which

products were ordered (and also not the number of premium programs or product categories),

nor on the products that were returned. We therefore added a dummary variable that indicated

missing data for premium programs (i..e the variable Premium Dummy), and we set the cross-

buying variable as a fixed customer-specific variable, that does not change over time (i.e. the

variable categories, which is the sum of the bought product categories over the entire time

horizon).

Because of the possibility for customers to return goods that were not sold, gross margins

can take negative values. Since values on a logarithmic scale are not allowed to take negative

values, we had two options: (a) add a constant to our data to make sure that our values are

positive, or (b) exclude returns entirely from the gross margins response variable. We choose

for option (b), since adding a constant would still provide problems in reliably estimating zero-

observations. We resolved this by subtracting returns from gross margins, by the following

steps:

Page 29: PREDICTING THE UNPREDICTABLE

28

(1) For each period in which the returns were registered (from Q2 2007), and where

the return ratio was less than 1 (i.e. the customer bought more than he returned)

and higher than 0, we multiplied gross margins by the return ratio (e.g. if gross

margins was 100, and the return ratio 0.5, then gross margins was set at 50);

(2) For the resulting negative gross margins, we subtracted these from the gross

margins in the previous period, and repeated this step until every gross margins

value was zero, or negative for our first period of observations. We chose to subtract

them from previous periods, because gross margins can only be negative if a

customer returned products that he bought in an earlier period;

(3) We then set negative values for the first period of observations at zero, since these

were returns of products that were bought before our observation periods.

Thus, product returns are not included within our gross margins response variable. We

therefore included product returns as a predictor within our gross margins component to still

account for the effect of product returns. Note that, just as with our premium program orders,

product returns were not registered before Q2 2008.

We had 7% missing values for postal code, which we assume to be random errors because

of administrative errors. This led to missing values for our variable population density. Also, the

data obtained from Statistics Netherlands could not completely be matched to every postal

code, possibly because of changes within the municipalities within the last decade. This led to

17% missing values for Population Density. Thus, in total we had 24% missing values for

population density. We imputed these values based on predictive mean matching, with 5

imputed datasets and 50 iterations.

The R-code for our data preparation and manipulation are included in Appendix A.

3.3 PROCEDURE

First, we start out with a full model as specified in section 3.1. Next, we assess face validity and

resolve any possible issues relating multicollinearity by assessing Variance Inflation Factors

(VIF) of predictor variables. We optimize our model components by comparing several nested

versions of the models based on McFadden or adjusted R2, 𝜒2 or F-values, AIC, and suitable

measures of predictive accuracy. For both model components we assess whether a zero-inflated

Page 30: PREDICTING THE UNPREDICTABLE

29

or a zero-hurdle Binary Logit component significantly improves our model by comparing models

based on AIC scores.

Once we have fitted each model component, we investigate heterogeneity by comparing the

model with a model that includes effects for individual customers, customer groups, and/or

sales representatives. For our gross margins component, we test whether considerable

heterogeneity between individual customers is present by performing an F-test between the

pooled version and a fixed effects model. We then estimate a random-effects model, and

perform a Hausman test to assure that the heterogeneity between customers is endogeneous

to our predictors, which is an important assumption of a random-effects model.

For our gross margins component we assess whether our residuals are normally distributed

by both a Shapiro-Wilk and a Kolmogorov-Smirnov normallity test. We test for autocorrelation

using the Durbin-Watson test, and assess whether heteroskedasticity is present within our data

before and after the company started to register order details, by means of a Breush-Pagan

test. Selection bias may be present within our model. We therefore re-estimate the model by

using the Heckman procedure. A significant Inverse Mills Ratio indicates that selection bias is

present, and that we have to apply a Heckman correction to our gross margins expectation.

We test the predictive accuracy of our models by testing our models on both our estimation

and holdout sample. We use the Mean Absolute Error (MAE), Root Mean Squared Error

(RMSE), and Relative Absolute Error (RAE) for assessing predictive validity.

Once we have validated each model component and have made predictions for our holdout

sample, we calculate both the observed and the predicted customer profitability. We divide

customers into profitability segments for both the validation and the holdout sample, and we

check whether we observe and predict changes in individual customer profitability based on

shifts from and to profitability segments. We then compare our model to a model that takes

the average past contribution and projects it on the future and to a model without cost

component. Finally, we investigate differences in CP between customer segments by performing

a cluster analysis by using the Ward method based on Euclidean distance.

The R-codes for our visits and gross margins components and our customer profitability

analyses are included in Appendix B an C.

Page 31: PREDICTING THE UNPREDICTABLE

30

4 RESULTS

In this chapter we present the results of our analyses. We first estimate and predict our visits

and gross margins components, and present both model’s results. Then, we combine the results

of both models to arrive at our customer profitability predictions. We then investigate whether

our model is able to predict the changes in customer profitability over time, and compare our

model to simple variants. Finally, we inspect whether we can find differences between customers

for the purpose of customer management.

4.1 NUMBER OF VISITS

For our visits component we first estimated a Poisson model and deleted variables that showed

a high collinearity and a relative poor fit compared to correlated predictors (i.e. the cumulative

sum of both visits and gross margins, the direct lag of visits, and the cumulative average of gross

margins). We then performed a dispersion test, that showed significant results (dispersion =

1.617, z = 4.188, p = .000). Therefore, we tried fitting a Negative Binomial distribution to our

data, which performed significantly better than our Poisson model (LL Poisson = -4352.0, LL

Negative Binomial = -4243.3, Chi squared = 217.32, p = .000). We continued optimizing our

model assuming a Negative Binomial distribution.

Since 32.3% of our observations are zero-observations, we estimated a zero-inflated and a

hurdle model to our data. Both models show a large improvement in AIC compared to the regular

NBD model (regular NBD = 8506.6, zero-inflated = 8310.7, hurdle = 8306.6). The hurdle model

provides a better fit, and offers more flexibility in estimating the zero-observations using

different predictors. We therefore continue fitting the hurdle variant of our model.

Till now, we have neither considered heterogeneity between customers, nor have we

investigated the effect of sales representatives. A model with a customer-specific intercept

provides a perfect fit, and is therefore not an option. We fitted variants of our model by adding

the effects of sales reps, customer industry/retail chain, and both. Our results (table 4.1)

indicate that customer industry or retail chain and sales reps both have a significant effect on

the number of orders placed by a customer.

Estimates, including confidence intervals and marginal effects, are presented in table 4.2.

We present a comparison between the observed and predicted number of visits in figure 4.1.

Page 32: PREDICTING THE UNPREDICTABLE

31

Let us first discuss the zero-model: the binary logit model. We observe the strongest effect for

recency, which is negative. For each unit increase of recency, keeping all else equal, the

probability of purchase decreases with 65.3%. We observe the strongest, positive for the lag of

visits: for each unit increase, keeping all else equal, the probability of purchase increases by 61%.

For the number of visits we observe that for each unit increase of the cumulative average

of visits, the number of visits increases by 20.1%. Above a certain point, this effect diminishes,

as we observe a significant effect for the squared term of the cumulative average of visits. Thus,

customers with a higher average past number of visits are also predicted to show a higher

number of visits in the future. For each unit increase of the lag of premium oders, the number

of visits increases by 5.2%. During the period in which order details were not registered, the

number of visits was 18.4% higher. The GDP of consumers also shows a significant effect: for

each unit increase, keeping all else equal, the number of visits increases by 10.2%. We do not

observe any negative effects of predictor variables on the number of visits, only the strength of

the increase of the cumulative average past visits diminishes above a certain point.

Our model shows a Relative Absolute Error (RAE) of 0.59 (out-of-sample), which means

that it outperforms a naïve model where the estimated number of visits equals the number of

visits in the previous period. We observe a Root Mean Squared Error (RMSE) of 1.665 in our

estimation sample, against 1.358 in our validation sample. The Mean Absolute Error (MAE) of

1.358 of our holdout sample indicates that, on average, the predicted value for visits deviates

1.358 from the observed value for visits.

Model k LL Chisq AIC

Model without qualitative IVs 14 -4082.3 8192.6

Sales Rep 24 -4058.7 47.255 *** 8165.4

Industry/Chain 26 -4039.9 37.615 *** 8131.8

Industry/Chain + Sales Rep 36 -4019.8 40.266 *** 8111.5

Table 4.1: model variants

Page 33: PREDICTING THE UNPREDICTABLE

32

Figure 4.1: distribution of Visits

Count-model (Negative Binomial)

Variable Estimate Std. error z-value p-value 2.5% 97.5% Marginal

Visits avg 0.183 0.016 11.498 0.000 *** 0.152 0.214 1.201

Premium t-1 0.051 0.017 2.932 0.003 ** 0.017 0.084 1.052

Premium dum 0.169 0.046 3.701 0.000 *** 0.079 0.258 1.184

GDP Cons. 0.097 0.014 8.352 0.000 *** 0.069 0.125 1.102

I(Visits avg2) -0.044 0.006 -7.048 0.000 *** 0.053 0.086 1.072

Zero-model (Binary Logit)

Variable Estimate Std. error z-value p-value 2.5% 97.5% Marginal

Intercept -0.429 0.172 -2.487 0.013 * -0.767 -0.091 0.651

Recency -0.427 0.064 -6.701 0.000 *** -0.552 -0.302 0.653

Visits t-1 0.476 0.064 -6.701 0.000 *** 0.374 0.578 1.610

Visits sum -0.024 0.006 -3.835 0.000 *** -0.037 -0.012 0.976

Categories 0.281 0.019 14.899 0.000 *** 0.244 0.318 1.325

I(Categories2) -0.320 0.069 -4.647 0.000 *** -0.455 -0.185 0.726

Log-likelihood = -4019.8, LR test: Chi squared (33) = 2191.1***, AIC = 8111.51

Table 4.2: estimates Visits model

Page 34: PREDICTING THE UNPREDICTABLE

33

4.2 GROSS MARGINS

We estimated a multiplicative model for our gross margins (GM) component, with again a zero-

hurdle component for our zero-observations that has the same specification as for our Visits

model. We dropped the cumulative sum of both frequency and gross margins due to high

collinearity. Our initial model is significant (F = 114.4, df = 10; 1878, p = .000) and explains

37.9% of the variance within gross margins.

Since we expect considerable heterogeneity between customers, we have modeled three

variants of our model: (1) a pooled model, (2) a model with fixed customer effects, and (3) a

model with random customer effects. For the fixed effects model we deleted the variable

categories because it is customer-specific and time invariant, and therefore not allowed. An F-

test between the pooled and fixed effects model shows that there are significant individual

differences present (F = 2.082, df1 = 346, df2 = 1532, p = .000).

We then estimated a random-effects model, but a Hausman test showed that the random

effects are endogenous to our predictors, from which we must conclude that a random effects

model is not allowed (Chi squared = 631.59, df = 8, p = .000). Thus, we estimate our model with

a fixed effect for each customer, but we do not allow for customer-specific error terms. We

further fitted our model by deleting recency, premium dummy, GDP consumers, and the lag of

gross margins. Including quadratic effects of our variables did not improve the model’s

performance.

Since we predict a two-stage model, selection bias may be present. We therefore re-

estimated our model using the Heckman procedure. The Inverse Mills ratio was not significant

(IMR = 0.050, t = 0.394, p = .900), indicating that we can estimate our model without applying

the Heckman correction. No autocorrelation was detected between the residuals of each

customer (Durbin-Watson = 2.193, p = 1), but we did find significant heteroskedasticity

(Breusch–Pagan = 36.594, df = 5, p = 0.000). Also, the residuals of our model failed to meet

the normality assumption (Shapiro-Wilk = 0.894, p = .000; Kolmogorov-Smirnov = 0.131, p =

.000). We therefore obtained robust standard errors by t-tests of the coefficients using the

Arellano method that accounts for heteroskedasticity in fixed effects models.

We present the final estimates of our (censored) gross margins component in table 4.3. We

do not report estimates of the zero-hurdle component, as they are the same as presented in

Page 35: PREDICTING THE UNPREDICTABLE

34

table 4.2. Our overall model is significant, and explains 36.2% of the variation in gross margins.

Except for the number of premium orders, all estimates are significant. Since we estimated a

multiplicative model, estimates are presented as elasticities. We calculated the original estimates

by multiplying the elasticities by the variance within the standard errors divided by two.

The number of visits shows the largest, positive effect on gross margins: for each 1%

increase in visits, gross margins increases by 1.8%. Visits, the lag of visits, and product returns

show negative effects, with the largest effect resulting from product returns. For each 1%

increase in the value of product returns, gross margins decreases by 0.3%. The weighted mean

of the fixed effects is 4.767 with (robust) standard error 0.418. Except for 16 of the customers,

all fixed effects are significant and positive.

We predict the values for gross margins by taking the exponential of our log-transformed

predictions and multiplying these predictions by the probability of purchase. We present our

measures of accuracy in table 4.4. Our model shows an RAE of 0.529 on our validation sample,

which indicates that it ourperforms a naïve model. The out-of-sample MAE is 186.34, which

means that, on average, our predictions are 186.34 off the true values of gross margins.

Variable Estimate Std. error z-value p-value Elasticity

Visits 5.157 0.142 12.897 0.000 *** 1.826

Visits t-1 -0.215 0.089 -2.722 0.007 ** -0.243

GM avg -0.217 0.070 -3.529 0.000 *** -0.245

Premium 0.211 0.144 1.343 0.179 0.193

Returns -0.284 0.021 -15.656 0.000 *** -0.334

Unbalanced Panel: n = 349, T = 1-8, N = 1889

�̂� = 1.253, RSS = 2965.8, ESS = 4649.3, R2 = 0.362, Adj. R2 = 0.215

F-statistic (5, 1535) = 174.258, p-value = .000

Table 4.3: estimates gross margins model

Sample MAE RAE RMSE

Estimation (Q1-Q8) 184.99 0.422 521.07

Holdout (Q9-Q11) 175.22 0.498 506.84

Table 4.4: predictive accuracy GM model

Page 36: PREDICTING THE UNPREDICTABLE

35

4.3 CUSTOMER PROFITABILITY

Now that we have estimated our individual model components, we predict customer profitability.

Means and standard deviations of measured and predicted V, GM, and CP, for both our

estimation and validation sample are presented in table 4.5. Especially the predicted number of

visits within the validation period appears to be far off its observed values. This is probably

because of observations with relatively high values that could be considered outliers (figure 4.1).

On average, the absolute deviation between the observed and predicted values for CP is

204.03 within the estimation period, and 187.11 within the validation period. Our model shows

an RAE of 0.579 and 0.646 in the estimation and validation period respectively, and thus

performs better than a naïve model. With an RMSE of 511.13 (estimation period) and 521.92

(validation period) and a standard deviation of 468.72 in the prediction errors in the estimation

period.

4.3.1 CHANGES IN CP OVER TIME

We now investigate changes in CP over time. For this purpose we have divided customers into

three profitability segments: low (0-25%), middle (25-75%), and high (75-100%). We

examine the shifts within segments by comparing the average CP in the year prior to the

validation period to the average CP in the validation period. We chose to only take the year prior

to the validation period (Q5-Q8) for our comparison to prevent large changes in CP within the

estimation period to disturb our comparisons. Also, we deleted the customers that did not make

any purchases within the year prior to the validation period, because (a) this resulted in a very

high percentage of zero observations that prevented us from dividing customers into realistic

high percentage of zero observations that prevented us from dividing customers into realistic

profitability segments, as 41.6% of the customers showed a CP of zero within the validation

period, and (b) we did not believe that this would bias our comparison too much, since only 3

out of the 70 customers that did not purchase within Q5-Q8 eventually did make a purchase in

Q9-Q11. We will refer to Q5-Q8 as period 1, and to Q9-Q11 as period 2.

In table 4.6 we show both observed and predicted shifts in CP segments from period 1 to

period 2. 53.3% of the customers did not change from profitability segment, which means that

46.7% of the customers did. Only 3% of the lowest segment (2 customers) shifts to the highest

Page 37: PREDICTING THE UNPREDICTABLE

36

segment, while 10% of the highest segment (7 customers) shifts to the lowest segment. When

examining the predicted shifts, these do not look far off: 56.5% of the customers are predicted

to stay within the same CP segment.

If we compare the predicted CP segments with the observed CP segments in period 2 (table

4.7), 66.6% of the segments are classified correct. The largest errors seem to take place

between the lowest and the middle segment. 21.7% of the customers in the middle segment

are predicted to be in the lowest segment, and 34.8% of the customers in the lowest segment

are predicted to be in the middle segment. A possible explanation for these errors is the

distribution within each segment. For example, the difference between the first and the third

quartile of the predicted CP is only € 151.24, while the total range is € 5356,30. Also, there is

a considerable overlap between the lowest observed CP segment, and the middle predicted CP

segment. Thus, we could conclude that our model is predicting high CP reasonably well, but it

has difficulty in predicting lower and average CP.

On average, CP decreases by 232% from period 1 to period 2, while our model predicted an

average increase of 220%. We found that investigating these relative changes in CP is not

useful, since our data contains many values close zo zero. For example, customer X showed a

relative CP increase of 80,000%, because his CP in the first period was - € 0.0625, while he

showed an average CP of € 86.11 in the second period.

4.3.2 MODEL VARIANTS

To what extent does our model outperform a model that predicts future CP based on the past

average CP and to a model without a separate cost component? We took the average observed

CP of period 1 (Q5-Q8) to predict CP of period 2 (Q9-Q11), and compared it to the observed

CP of period 2. Also, we compared our model to several simpler variants, that included less model

components than our main model. For example a model that is based on the past average of

gross margins with a correction for predicted purchase probability (∅).

We report the MAE of each model in table 4.8. Our main model shows the lowest MAE.

However, especially the model that is based on past CP with a correction for purchase probability

comes very close. T-tests on the absolute errors of each model compared to the absolute errors

showed that none of the simpler model had a significantly higher MAE than our main model.

Page 38: PREDICTING THE UNPREDICTABLE

37

Thus, our model does not provide a significantly better performance compared to models based

on past average contribution and to models without a separate cost component.

In figure 4.3 we show observed and predicted CP for 6 customers, and also predicted CP

based on past average CP multiplied by the probability of purchase. Customers A and B showed

a large under-prediction, customers C and D a large over-prediction, and E and F relatively a

good performance. As we can see, the simple model does not predict large changes over time.

Our main model does show changes in CP over time, but often too much or in the wrong

direction. Therefore, the simple model is often just as close to the observed value as our main

model.

Visits Gross Margins CP

Mean SD Mean SD Mean SD

Q1-Q8 Actual 1.83 2.41 360 879 231 752

Predicted 1.83 1.81 280 732 151 667

Q9-Q11 Actual 1.92 1.84 245 648 161 568

Predicted 1.01 0.90 194 575 123 546

Table 4.5: summary statistics CP

Observed Q9-Q11 Predicted

Low Middle High Low Middle High

Q5-Q8

Observed

Low 8.3% 15.9% 0.7% 9.4% 14.5% 1.1%

Middle 14.1% 28.3% 7.6% 11.6% 30.8% 7.6%

High 2.5% 5.8% 16.7% 4.0% 4.7% 16.3%

Table 4.6: shifts in CP segments from Q5-Q8 to Q9-Q11

Predicted

Low Middle High

Observed Low 12.3% 8.7% 4.0%

Middle 10.9% 36.2% 2.9%

High 1.8% 5.1% 18.1%

Table 4.7: observed vs. predicted CP segments in Q9-Q11

Page 39: PREDICTING THE UNPREDICTABLE

38

Model MAE t-value

Main model 187.11

∅ GM.past.avgit - V̂itC 197.67 -0.507

GM.past.avgit - V̂itC 223.41 -1.696 .

∅ GM.past.avgit 220.41 -1.596

∅ CP.past.avgit 190.69 -0.173

CP.past.avgit 208.36 -1.009

Table 4.8: predictive accuracy CP model variants (holdout sample)

A B

C D

E F

Figure 4.2: patterns in CP for 6 customers

Page 40: PREDICTING THE UNPREDICTABLE

39

4.4 CUSTOMER SEGMENTS

We now investigate differences between customers. We divided customers into clusters based

on their average number of orders and spending levels in both Q5-Q8 and Q9-Q11 (Ward

method, Euclidean distance): this way, both components of CP are used for clustering, and also

the changes from the first to the second period are captured. Six segments were identified, of

which we present averages across multiple variables in table 4.9 and the distribution of CP within

each segment in figure 4.3.

The segments with the highest average visits and gross margins also show the highest CP.

The ordering also holds for the returns ratio: customers in the highest profitability segment

return the least of their products, while the customers in the lower segments return a large ratio

of the products. Segment 6 shows a return ratio of above 100%, which indicates that they

returned more than they bought. We can only explain this by the fact that the customer

returned products that they bought in the first observed periods, in which the company did not

yet register product returns. The least profitable segments show relatively the highest rate of

product returns. Segments 3 and 4 show why it can be helpful to segment on both visits and

gross margins over both periods. Both segments start out really close to eachother. However,

the profitability of segment 4 increases from period 1 to period 2, because of a large increase in

gross margins. Our model predicted this increase in gross margins, but it was not able to predict

the increase in CP. Instead it predicted that segment 3 would show a large increase in CP, while

that segment stayed relatively stable.

When performing the cluster analysis on clusters based on predicted instead of observed

visits and gross margins in period 2, 73.2% of the customers were classified within the same

segment as for our first cluster analysis. The fourth cluster seems to be the problem: only 5.8%

of the customers in segment 4 are predicted correctly. Thus, we could again conclude that our

model components are not able to predict changes in CP over time very well, which was the

main purpose of our model.

In table 4.10 we present averages across variables for each customer industry or retail chain.

Although some segments are definitely less or more profitable than others, determining which

customer segments are more profitable than others based on customer industry or retail chain

appears to be much more difficult than segmenting based on customer behavior or profitability.

Page 41: PREDICTING THE UNPREDICTABLE

40

Figure 4.3: distribution of CP within customer segments

Segment 1 2 3 4 5 6

n 24 22 18 52 70 90

Visits Q5-Q8 5.0 3.6 2.1 1.8 1.5 0.7

Q9-Q11 4.9 2.8 1.9 2.6 0.8 0.4

Predicted 2.7 2.2 1.5 1.6 1.0 0.6

Gross margins Q5-Q8 1698 861 317 282 168 33

Q9-Q11 1766 502 239 758 48 17

Predicted 1277 541 235 441 50 15

CP Q5-Q8 1346 607 166 153 63 -19

Q9-Q11 1421 305 104 573 -7 -9

Predicted 1085 387 326 128 -21 -27

CP Segment Q5-Q8 3.0 3.0 2.2 2.2 1.9 1.4

Q9-Q11 3.0 2.6 2.0 3.0 1.6 1.6

Predicted 2.9 2.6 2.6 2.1 1.7 1.7

MAE 3097 1293 1536 769 198 89

Premium orders 8.5 4.0 1.9 1.6 0.4 0.1

Returns ratio 0.05 0.14 0.17 0.29 0.48 1.32

Categories 11.5 10.4 10.1 9.1 6.7 5.3

Table 4.9: averages per customer segment

Page 42: PREDICTING THE UNPREDICTABLE

41

Figure 4.5: customer profitability per customer industry/retail chain

Industry/chain A B D E F G I O X Y

n 9 19 49 7 6 9 12 82 36 6

Visits 2.8 3.5 2.2 3.3 2.6 1.6 2.2 0.9 1.6 1.1

Gross margins 521 1151 407 698 751 158 397 101 336 146

CP 321 903 252 466 566 47 245 38 224 69

CP Predicted 83 647 154 482 201 188 140 7 259 103

CP Segment 2.3 2.4 2.2 2.5 2.6 1.7 2.5 1.7 2.1 1.9

CP Seg. Pred. 1.6 2.5 2.0 2.3 2.3 2.1 2.1 1.8 2.1 2.0

MAE 1151 1789 780 2073 1609 844 547 213 887 248

Change in CP -1.3 -4.2 1.0 0.3 18.2 -70 -0.4 0.4 -1.1 -0.3

Pred. change CP -1.0 -2.1 0.6 0.8 -1.1 31.4 -0.6 0.3 -1.8 -0.3

Premium orders 4.4 5.5 2.9 1.4 3.8 0.2 2.8 0.4 0.6 0.00

Returns ratio 0.14 0.13 0.47 0.20 0.05 0.28 0.24 1.16 0.44 0.93

Categories 10.3 9.6 9.2 10.7 10.0 9.7 8.4 6.5 6.1 6.5

Table 4.10: averages per customer industry/retail chain

Page 43: PREDICTING THE UNPREDICTABLE

42

5 DISCUSSION

In this chapter we first present our main findings, in which we discuss contributions and

implications for theory. We then discuss managerial implications, and how our research

contributes to marketing practice. Next, we discuss limitations of our study, and provide

suggestions for future research. The chapter concludes with a final conclusion for our research.

5.1 GENERAL DISCUSSION

We posited that future CP is driven by past customer behavior, customer characteristics, and

both past and current firm actions. We found considerable evidence that past behavior is a

strong predictor of future behavior, which confirms existing theory (Blattberg et al., 2009;

Reinartz et al., 2005). Recency (negative) and the number of orders in the previous period

(positive) show the strongest effects on purchase propensity. The strongest predictor of the

number of orders/visits is the cumulative average of visits up till the previous period. We found

a diminishing effect above a certain point, which indicates that there is an ideal number of

purchases to optimize returns (Niraj et al., 2011). The number of visits in both the current and

the past period are shown to be two of the strongest predictors of current spending levels.

Returns have the strongest, negative effect on spending levels. However, we did find that a

higher CP is related to a lower returns ratio, and vice versa.

We also found that the number of categories purchased had a significant effect on purchase

incidence, which confirms the theory of cross-buying having a positive effect (W. Reinartz et al.,

2005; W. J. Reinartz & Kumar, 2003; Rust, Kumar, & Venkatesan, 2011).

We found that the GDP of consumers acted as a significant control variable, which confirms

the theory of market dynamics having an influence on CP (Grewal, Roggeveen, & Nordfölt, 2017;

Kumar, Anand, & Song, 2017). Both the effect of sales representative and customer group were

found to be significant on the number of purchases. Mullins et al. (2014) found that the

perception of CP by sales represetatives may be biased, and could thus have an effect on actual

CP. did not study perceptions, and thus cannot confirm whether this is due to sales rep

perception of CP. However, our results do indicate that a CP model should account for the effect

of sales representatives.

Page 44: PREDICTING THE UNPREDICTABLE

43

For population density we did not find any significant effects, which confirms the theory of

Reinartz and Kumar (2003) that the effect of population density is not significant within a B2B

context. We did not investigate customer size. However, the fact that we found significant

effects for customer group for our visits component, and individual fixed effects for our gross

margins component, we confirm our theory that there is significant heterogeneity between

customers.

Based on our previous discussion of our model’s results, we conclude that we have confirmed

our hypotheses that past customer behavior, customer characteristics, and firm actions are

significant drivers of CP.

We hypothesized that a model that accounts for changes in revenues over time would result

in a better performance than a model based on past average contributions. We have modelled

CP with separate zero-hurdle components for costs based on the number of visits, and gross

margins, and found that our model does not significantly outperforms a simple model that uses

the average past contribution to predict future CP. We therefore conclude that our hypothesis

on an improved model performance when modeling changes in revenues over time is not

confirmed. Therefore, we could not contradict the findings of Donkers, Verhoef, and De Jong

(2008), who found that simple models to predict CP often perform just as good as more

sophisticated models.

Also, we did not find evidence to confirm our hypothesis that a model that attributes

customer costs on the individual customer-level outperforms a model without a separate cost

component.

5.2 MANAGERIAL IMPLICATIONS

What is the value of our model for managerial decision-making? Since managers often have

limited resources in terms of both money and time, we conclude that, at least within our context,

a manager could best predict its future CP based on past average contribution, possibly

extended by a purchase probability model. A model based on past average contribution and

purchase probability does not capture changes in CP for future time periods, but it does predict

future CP nearly as well as a more sophisiticated model. Therefore, when trade-offs between

Page 45: PREDICTING THE UNPREDICTABLE

44

resources and model performance need to be made, a simple model offers more advantages

compared to a more sophisticated model.

If a manager whishes to determine drivers of customer profitability, our model does provide

value, especially in distinguishing between the most and the least profitable customers. It can

also be used as a tool to segment and compare customers based on their relative CP Thus, we

can conclude that for diagnostic and descriptive purposes, our model could be used as a

management tool. For predicting changes in CP over time, our model could provide guidelines,

but we recommend not to trust it as a normative tool, as these changes show high predictive

errors, with both a large amount of under- and overpredictions.

5.3 LIMITATIONS

One major limitation of our research is that the company that was under investigation filed for

bankruptcy after our observation period. This bankruptcy was not suddenly: it had been

struggling for quite some time. We cannot determine whether this might have biased our

predictions. For example, right before the bankruptcy, several customers ended their relationship

with the firm because the company was not able to restitute payments for returned products.

Also, in the last six months of our observations, the company launched a new product group

that was not present during our estimation period, and we can therefore not determine whether

this had influenced our predictions. Especially since a very large amount of these products were

eventually returned, there is a serious possibility that this made our predictions less accurate.

Although we attributed costs on the individual level based on the average order handling

costs, we needed to make several assumptions regarding cost allocation. For example, the sales

force did not record hours spend on each customer, and it is possible that visits took place

without a purchase, which would not have been captured by our model.

Because of difficulties in estimating a model based on continuous data with negative,

positive, and zero values, we had to exclude product returns from the total gross margins

amounts. Although we did include the returns ratio as a predictor in our final model, it may have

not fully captured the true influence of returns on CP.

Page 46: PREDICTING THE UNPREDICTABLE

45

5.4 FUTURE RESEARCH

We found evidence that product returns have a significant influence on customer profitability

and especially its gross margins component. Especially lower customer profitability seems

related to a higher rate of product returns. Our research only included average order handling

costs. However, a company is assumed to make substantial costs for product returns, for

example additional inventory and shipping costs. Investigating the influence of product returns

on customer profitability while including costs for product returns could be a promising venue

for further research. Also, since lower levels of CP seem related to a higher level of product

returns, managers could experiment with differentiating in return policy between CP segments.

If a manager chooses to experiment with differentiated return policies, he should account for

the possibility that this differentiating service may lead to a lower overall customer satisfaction,

and with that, a lower overall CP (Petersen & Kumar, 2015).

Since our model is able to predict changes in CP over time, but often not the right direction

or the right size, it could be worthwhile to further investigate errors in predicting changes in CP

over time. To what extent can under- or overpredictions in CP predictions result in a decrease

or increase in actual future CP? A company could, for example, experiment by differentiating its

marketing efforts based on expected future CP, and compare results to a control group that did

not receive differentiated service based on their expected future CP.

5.5 CONCLUSION

In this thesis we aimed to “predict the unpredictable”. We were abe to predict changes in

customer profitability over time. However, these changes did not predict the actual observed CP

very well. We therefore conclude that trying to predict the unpredictable is very difficult, perhaps

even impossible, especially given scarce resources of managers to make trade-offs between the

investments to make into develop a sophisticated model, compared to using a relatively simple

model that seems to predict CP almost evenly well.

Page 47: PREDICTING THE UNPREDICTABLE

46

REFERENCES

Blattberg, R. C., Malthouse, E. C., & Neslin, S. A. (2009). Customer lifetime value: Empirical

generalizations and some conceptual questions. Journal of Interactive Marketing, 23(2),

157-168.

Bolton, R. N., Lemon, K. N., & Verhoef, P. C. (2004). The theoretical underpinnings of customer

asset management: A framework and propositions for future research. Journal of the

Academy of Marketing Science, 32(3), 271-292.

Bowman, D., & Narayandas, D. (2004). Linking customer management effort to customer

profitability in business markets. Journal of Marketing Research, 41(4), 433-447.

Cooper, R., & Kaplan, R. S. (1988). Measure costs right: Make the right decisions. Harvard

Business Review, 66(5), 96-103.

Dhar, R., & Glazer, R. (2003). Hedging customers.

Donkers, B., Verhoef, P. C., & de Jong, M. G. (2007). Modeling CLV: A test of competing

models in the insurance industry. Quantitative Marketing and Economics, 5(2), 163-

190.

Fader, P. S., & Hardie, B. G. (2009). Probability models for customer-base analysis. Journal of

Interactive Marketing, 23(1), 61-69.

Foster, G., Gupta, M., & Sjoblom, L. (1996). Customer profitability analysis: Challenges and new

directions. Journal of Cost Management, 10, 5-17.

Grewal, D., Roggeveen, A. L., & Nordfölt, J. (2017). The future of retailing. Journal of Retailing,

93(1), 1-6.

Gupta, S. (2009). Customer-based valuation. Journal of Interactive Marketing, 23(2), 169-

178.

Gupta, S., Hanssens, D., Hardie, B., Kahn, W., Kumar, V., Lin, N., . . . Sriram, S. (2006). Modeling

customer lifetime value. Journal of Service Research, 9(2), 139-155.

Gupta, S., Lehmann, D. R., & Stuart, J. A. (2004). Valuing customers. Journal of Marketing

Research, 41(1), 7-18.

Holm, M., Kumar, V., & Rohde, C. (2012). Measuring customer profitability in complex

environments: An interdisciplinary contingency framework. Journal of the Academy of

Marketing Science, 40(3), 387-401.

Page 48: PREDICTING THE UNPREDICTABLE

47

Jackson, B. B. (1985). Build customer relationships that last Harvard Business Review.

Kumar, V. (2018). A theory of customer valuation: Concepts, metrics, strategy, and

implementation. Journal of Marketing, 82(1), 1-19.

Kumar, V., Anand, A., & Song, H. (2017). Future of retailer profitability: An organizing

framework. Journal of Retailing, 93(1), 96-119.

Kumar, V., George, M., & Pancras, J. (2008). Cross-buying in retailing: Drivers and

consequences. Journal of Retailing, 84(1), 15-27.

Kumar, V., Ramani, G., & Bohling, T. (2004). Customer lifetime value approaches and best

practice applications. Journal of Interactive Marketing, 18(3), 60-72.

Kumar, V., Venkatesan, R., Bohling, T., & Beckmann, D. (2008). Practice prize Report—The

power of CLV: Managing customer lifetime value at IBM. Marketing Science, 27(4),

585-599.

Mulhern, F. J. (1999). Customer profitability analysis: Measurement, concentration, and

research directions. Journal of Interactive Marketing, 13(1), 25-40.

Mullins, R. R., Ahearne, M., Lam, S. K., Hall, Z. R., & Boichuk, J. P. (2014). Know your customer:

How salesperson perceptions of customer relationship quality form and influence

account profitability. Journal of Marketing, 78(6), 38-58.

Nenonen, S., & Storbacka, K. (2016). Driving shareholder value with customer asset

management: Moving beyond customer lifetime value. Industrial Marketing

Management, 52, 140-150.

Niraj, R., Gupta, M., & Narasimhan, C. (2001). Customer profitability in a supply chain. Journal

of Marketing, 65(3), 1-16.

Petersen, J. A., & Kumar, V. (2015). Perceived risk, product returns, and optimal resource

allocation: Evidence from a field experiment. Journal of Marketing Research, 52(2), 268-

285.

Pfeifer, P. E., Haskins, M. E., & Conroy, R. M. (2005). Customer lifetime value, customer

profitability, and the treatment of acquisition spending. Journal of Managerial Issues, ,

11-25.

Reinartz, W. J., & Kumar, V. (2003). The impact of customer relationship characteristics on

profitable lifetime duration. Journal of Marketing, 67(1), 77-99.

Page 49: PREDICTING THE UNPREDICTABLE

48

Reinartz, W., Thomas, J. S., & Kumar, V. (2005). Balancing acquisition and retention resources

to maximize customer profitability. Journal of Marketing, 69(1), 63-79.

Rust, R. T., Kumar, V., & Venkatesan, R. (2011). Will the frog change into a prince? predicting

future customer profitability. International Journal of Research in Marketing, 28(4),

281-294.

Rust, R. T., Lemon, K. N., & Zeithaml, V. A. (2004). Return on marketing: Using customer equity

to focus marketing strategy. Journal of Marketing, 68(1), 109-127.

Shah, D., Kumar, V., Qu, Y., & Chen, S. (2012). Unprofitable cross-buying: Evidence from

consumer and business markets. Journal of Marketing, 76(3), 78-95.

Van Raaij, E. M., Vernooij, M. J., & van Triest, S. (2003). The implementation of customer

profitability analysis: A case study. Industrial Marketing Management, 32(7), 573-583.

Venkatesan, R., & Kumar, V. (2004). A customer lifetime value framework for customer

selection and resource allocation strategy. Journal of Marketing, 68(4), 106-125.

Verhoef, P. C., & Lemon, K. N. (2013). Successful customer value management: Key lessons

and emerging trends. European Management Journal, 31(1), 1-15.

Page 50: PREDICTING THE UNPREDICTABLE

49

APPENDIX A: R-CODE DATA PREPARATION > rm(list = ls())

> setwd(" ")

> library(dplyr)

> library(zoo)

> library(BTYD)

> library(DataCombine)

> if(file.exists("CustomersFinal.csv")) {

+ customers <- read.csv("CustomersFinal.csv", header = TRUE, sep=",")

+ } else {

+ library(mice)

+ customers <- read.csv("CustomersClean.csv", header = TRUE, sep=",")

+ colnames(customers)[1] <- "cust"

+ customers <- dplyr::select(customers, -woonpl, -naam, -adres3, -betcond, -levwijze, -syscreated, -sysmodified)

+

+ # Population Density

+ bev <- read.csv("Bevolking.csv", header = TRUE, sep=";")

+ gem <- read.csv("Gemeentes2.csv", header = TRUE, sep=",")

+ bev <- filter(bev, grepl("GM", RegioS))

+ bev$RegioS <- as.character(bev$RegioS)

+ bev$RegioS <- substr(bev$RegioS, 3, 7)

+ bev <- dplyr::select(bev, RegioS, Bevolkingsdichtheid_57)

+ colnames(bev) <- c("Gem2017", "pop_dens")

+ bev$Gem2017 <- as.numeric(bev$Gem2017)

+ gem <- left_join(gem, bev, by=c("Gem2017"))

+ gem$Gem2017 <- NULL

+ customers <- left_join(customers, gem, by=c("postcode"))

+ rm(bev, gem)

+

+ # Categories

+ orders <- read.csv("OrdersClean.csv", header = TRUE, sep=",")

+ orderlines <- read.csv("OrderlinesClean.csv", header = TRUE, sep=",")

+ temp <- orders[,1:2]

+ orderlines <- left_join(orderlines, temp, by=c("ordernr"))

+ orderlines <- filter(orderlines, !is.na(debnr))

+ temp <- aggregate(orderlines$groepnaam, by=list(orderlines$debnr), function(x) length(unique(x)))

+ colnames(temp) <- c("cust", "categories")

+ customers <- left_join(customers, temp, by=c("cust"))

+ rm(orders, orderlines, temp)

+

+ # Impute postal code and population density

+ temp <- dplyr::select(customers, -cust)

+ impute <- mice(temp,m=5,maxit=50,meth='pmm',seed=500)

+ completedData <- complete(impute,1)

+ customers$pop_dens <- completedData$pop_dens

+ customers$postcode <- completedData$postcode

+

+ write.csv(customers, "CustomersFinal.csv", col.names = TRUE, row.names = FALSE)

+ }

> if(file.exists("CBT.csv")) {

+ cbt <- read.csv("CBT.csv", header = TRUE, sep=",")

+ } else {

+ orders <- read.csv("OrdersClean.csv", header = TRUE, sep=",")

+ ### Quarterly data

+ elog <- orders[,2:4]

+ colnames(elog) <- c("cust", "date", "sales")

+ elog$filter <- format(as.Date(elog$date), "%Y-%m")

+ elog <- filter(elog, filter != "2005-11" & filter != "2005-12")

+ elog$yq <- as.yearqtr(elog$date, format = "%Y-%m-%d")

+ elog$date <- elog$yq

+ elog[,4:5] <- NULL

+ freq <- data.frame(dc.BuildCBTFromElog(elog, statistic = "freq"))

+ spend <- data.frame(dc.BuildCBTFromElog(elog, statistic = "total.spend"))

+ colnames(spend)[3] <- "gm"

+ cbt <- left_join(freq, spend, by=c("date", "cust"))

+ cbt$pur <- ifelse(cbt$Freq == 0, 0, 1)

+ cbt$cust <- as.numeric(as.character(cbt$cust))

+ colnames(cbt)[3] <- "freq"

Page 51: PREDICTING THE UNPREDICTABLE

50

+ rm(elog, freq, spend)

+ cbt$date <- as.character(cbt$date)

+

+ ## We only take customers who made a purchase in t=1.

+ # Add time for both first and last purchase to customers df

+ customers$first.pur <- 0

+ customers$last.pur <- 0

+ customers$tot.pur <- 0

+ # customers$all.pur <- 0

+ for (c in unique(cbt$cust)) {

+ cbt.c <- filter(cbt, cust == c)

+ customers$first.pur[customers$cust == c] <- min(which(cbt.c$pur == 1))

+ customers$last.pur[customers$cust == c] <- max(which(cbt.c$pur == 1))

+ customers$tot.pur[customers$cust == c] <- sum(cbt.c$pur==1)

+ # customers$all.pur[customers$debnr == c] <- ifelse(sum(cbt.c$pur) == 13, 1, 0)

+ }

+ customers <- filter(customers, first.pur == 1, cust %in% cbt$cust)

+ cbt <- filter(cbt, cust %in% customers$cust)

+ cbt <- filter(cbt, date != "2009 Q1") # contains a lot of noise

+

+ write.csv(cbt, "CBT.csv", col.names = TRUE, row.names = FALSE)

+ }

> if(file.exists("Dataset.csv")) {

+ cbt <- read.csv("Dataset.csv", header = TRUE, sep=",")

+ } else {

+ cbt <- arrange(cbt, cust, date)

+

+ # Add recency

+ cbt$rec <- 0

+ temp <- cbt[1,]

+ temp$freq <- 999

+ for (c in unique(cbt$cust)) {

+ c.1 <- filter(cbt, cust == c)

+ c.pur <- c.1$pur

+ for (i in 2:12) {

+ c.i <- c.pur[1:i-1]

+ c.max <- max(which(c.i[] == 1))

+ c.1[i,6] <- i-c.max

+ }

+ temp <- rbind(temp, c.1)

+ }

+ temp <- filter(temp, freq != 999)

+ cbt$rec <- temp$rec

+

+ # Lags

+ cbt <- slide(cbt, Var = "freq", GroupVar = "cust", NewVar = "freq.lag", slideBy = -1)

+ cbt <- slide(cbt, Var = "gm", GroupVar = "cust", NewVar = "gm.lag", slideBy = -1)

+

+ # Add dynamic variables:

+ cbt <- arrange(cbt, cust, date)

+ cbt[,11:20] <- 0

+ colnames(cbt)[11:20] <- c("freq.diff", "gm.diff", "freq.ets",

+ "freq.hw", "gm.ets", "gm.hw", "freq.cumsum", "gm.cumsum",

+ "freq.cumavg", "gm.cumavg")

+ temp2 <- cbt[1,]

+ temp2$freq <- 999

+

+ library(forecast)

+ for (c in unique(cbt$cust)) {

+ cbt.c <- filter(cbt, cust == c)

+ for (i in 2:12) {

+ cbt.c$freq.diff <- c(NA, NA, diff(cbt.c$freq, lag = 2))

+ cbt.c$gm.diff <- c(NA, NA, diff(cbt.c$gm, lag = 2))

+ cbt.c$freq.cumsum[i] <- sum(cbt.c$freq[1:i-1])

+ cbt.c$gm.cumsum[i] <- sum(cbt.c$gm[1:i-1])

+ cbt.c$freq.cumavg[i] <- mean(cbt.c$freq[1:i-1])

+ cbt.c$gm.cumavg[i] <- mean(cbt.c$gm[1:i-1])

+ }

+ for (i in 2:11) {

+ cbt.c$freq.ets[i+1] <- forecast(ets(cbt.c$freq[1:i]), 1)$mean

+ cbt.c$freq.hw[i+1] <- forecast(HoltWinters(cbt.c$freq[1:i], beta=FALSE, gamma=FALSE), 1)$mean

+ cbt.c$gm.ets[i+1] <- forecast(ets(cbt.c$gm[1:i]), 1)$mean

Page 52: PREDICTING THE UNPREDICTABLE

51

+ cbt.c$gm.hw[i+1] <- forecast(HoltWinters(cbt.c$gm[1:i], beta=FALSE, gamma=FALSE), 1)$mean

+ }

+ temp2 <- rbind(temp2, cbt.c)

+ }

+ temp2 <- filter(temp2, freq != 999)

+ temp2 <- temp2[,11:20]

+ cbt[,11:20] <- temp2

+

+ cbt <- slide(cbt, Var = "freq", GroupVar = "cust", NewVar = "freq.lag.2", slideBy = -2)

+ cbt <- slide(cbt, Var = "gm", GroupVar = "cust", NewVar = "gm.lag.2", slideBy = -2)

+

+ # Premium & Returns

+ orderlines <- read.csv("OrderlinesClean.csv", header = TRUE, sep=",")

+ orders <- read.csv("OrdersClean.csv", header = TRUE, sep=",")

+ orderlines$premium <- ifelse(orderlines$Class_01 == "SPAAR", 1, 0)

+ temp <- aggregate(orderlines[,21:22], by=list(ordernr = orderlines$ordernr), max)

+ orders <- left_join(orders, temp, by=c("ordernr"))

+ orders$yq <- as.yearqtr(orders$fakdat, format = "%Y-%m-%d")

+ orders <- aggregate(orders[,c(4,8:10,22:23)], by=list(date = orders$yq, cust = orders$debnr), sum)

+ orders$returns <- abs(orders$total_returns)

+

+ orders <- dplyr::select(orders, date, cust, premium, theme, returns)

+ cbt$date <- as.character(cbt$date)

+ orders$date <- as.character(orders$date)

+ cbt <- left_join(cbt, orders, by=c("cust", "date"))

+

+ cbt$premium[is.na(cbt$premium)] <- 0

+ cbt$returns[is.na(cbt$returns)] <- 0

+

+ cbt <- slide(cbt, Var = "premium", GroupVar = "cust", NewVar = "premium.lag", slideBy = -1)

+ cbt <- slide(cbt, Var = "returns", GroupVar = "cust", NewVar = "returns.lag", slideBy = -1)

+

+ # add dummy for observations before Q2 2007 (no information about premium/returns before that time)

+ no.premium.dates <- c("2006 Q1", "2006 Q2", "2006 Q3", "2006 Q4", "2007 Q1")

+ cbt$premium.dum <- ifelse(cbt$date %in% no.premium.dates, 1, 0)

+ rm(temp,orders,orderlines,no.premium.dates, temp2, c, c.i, c.max, c.pur, i, c.1)

+

+ ### --- MARKET VARIABLES

+ # We use the variable national household consumption as change from same period in previous year

+ bbp <- read.csv("BBP.csv", header = TRUE, sep=";")

+ bbp <- bbp[-13,c(2,6)]

+ colnames(bbp) <- c("date", "GDP.Cons")

+ bbp$date <- unique(cbt$date)

+ cbt <- left_join(cbt, bbp, by=c("date"))

+ rm(bbp)

+

+ ### --- COMBINING CBT AND CUSTOMER DATA

+ customers <- filter(customers, cust %in% cbt$cust)

+ cbt <- left_join(cbt, customers, by=c("cust"))

+

+ write.csv(cbt, "Dataset.csv", col.names = TRUE, row.names = FALSE)

+ }

> if(file.exists("DatasetExReturns.csv")) {

+ cbt.new <- read.csv("DatasetExReturns.csv", header = TRUE, sep=",")

+ } else {

+ cbt.new <- cbt

+ cbt.new <- arrange(cbt.new, cust, date)

+

+ # Exclude returns

+ # (1) First for same period

+ cbt.new$gm.new <- ifelse(cbt.new$returns > 0 & cbt.new$returns < 1, cbt.new$gm * cbt.new$returns,

+ cbt.new$gm)

+ # (2) If sales in t is lower than amount returned, then:

+ temp2 <- cbt.new[1,]

+ temp2$freq <- 999

+ for (c in unique(cbt.new$cust)) {

+ cbt.new.c <- filter(cbt.new, cust == c)

+ for (i in 2:12) {

+ if (cbt.new.c$gm.new[i] < 0) {

+ cbt.new.c$gm.new[i-1] <- cbt.new.c$gm.new[i-1] - abs(cbt.new.c$gm.new[i])

Page 53: PREDICTING THE UNPREDICTABLE

52

+ cbt.new.c$gm.new[i] <- 0

+ }

+ }

+ temp2 <- rbind(temp2, cbt.new.c)

+ }

+ temp2 <- filter(temp2, freq != 999)

+ cbt.new$gm.new <- temp2$gm.new

+ # Repeat

+ temp2 <- cbt.new[1,]

+ temp2$freq <- 999

+ for (c in unique(cbt.new$cust)) {

+ cbt.new.c <- filter(cbt.new, cust == c)

+ for (i in 2:12) {

+ if (cbt.new.c$gm.new[i] < 0) {

+ cbt.new.c$gm.new[i-1] <- cbt.new.c$gm.new[i-1] - abs(cbt.new.c$gm.new[i])

+ cbt.new.c$gm.new[i] <- 0

+ }

+ }

+ temp2 <- rbind(temp2, cbt.new.c)

+ }

+ temp2 <- filter(temp2, freq != 999)

+ cbt.new$gm.new <- temp2$gm.new

+ # Still 39 observations left (excluding t=1)

+ temp2 <- cbt.new[1,]

+ temp2$freq <- 999

+ for (c in unique(cbt.new$cust)) {

+ cbt.new.c <- filter(cbt.new, cust == c)

+ for (i in 2:12) {

+ if (cbt.new.c$gm.new[i] < 0) {

+ cbt.new.c$gm.new[i-1] <- cbt.new.c$gm.new[i-1] - abs(cbt.new.c$gm.new[i])

+ cbt.new.c$gm.new[i] <- 0

+ }

+ }

+ temp2 <- rbind(temp2, cbt.new.c)

+ }

+ temp2 <- filter(temp2, freq != 999)

+ cbt.new$gm.new <- temp2$gm.new

+ # negative sales from unobserved left period

+ cbt.new$gm.new <- ifelse(cbt.new$gm.new < 0, 0, cbt.new$gm.new)

+

+ # Now set lags for gm without returns

+ cbt.new <- slide(cbt.new, Var = "gm.new", GroupVar = "cust", NewVar = "gm.lag", slideBy = -1)

+

+ temp2 <- cbt.new[1,]

+ temp2$freq <- 999

+ for (c in unique(cbt.new$cust)) {

+ cbt.new.c <- filter(cbt.new, cust == c)

+ for (i in 2:12) {

+ cbt.new.c$gm.cumsum[i] <- sum(cbt.new.c$gm.new[1:i-1])

+ cbt.new.c$gm.cumavg[i] <- mean(cbt.new.c$gm.new[1:i-1])

+ }

+ temp2 <- rbind(temp2, cbt.new.c)

+ }

+

+ temp2 <- filter(temp2, freq != 999)

+ cbt.new$gm.cumavg <- temp2$gm.cumavg

+ cbt.new$gm.cumsum <- temp2$gm.cumsum

+ rm(temp2)

+ cbt.new$gm <- cbt.new$gm.new

+ cbt.new <- dplyr::select(cbt.new, cust, date, freq, gm, pur, rec, freq.lag, freq.cumsum, freq.cumavg,

+ gm.lag, gm.cumsum, gm.cumavg, premium, premium.lag, premium.dum,

+ GDP.Cons, represent_id, klantgroep, pop_dens, categories)

+ cbt.new[,c(4,9:12)] <- round(cbt.new[,c(4,9:12)], 2)

+ write.csv(cbt.new, "DatasetExReturns.csv", col.names = TRUE, row.names = FALSE)

+ }

Page 54: PREDICTING THE UNPREDICTABLE

53

APPENDIX B: R-CODE MODEL COMPONENTS > rm(list = ls()) > > setwd(" ") > > library(dplyr) > library(MASS) > library(lmtest) > library(pscl) > library(car) > library(AER) > library(forecast) > library(Metrics) > library(nortest) > library(plm) > library(sampleSelection) > library(car) > > # Import & prepare data > dat <- read.csv("DatasetExReturns.csv", header = TRUE, sep=",") > dat <- filter(dat, !is.na(gm.lag)) # Delete t=1 because of NAs for lags > dat$represent_id <- as.factor(dat$represent_id) > > # Remove customers that didn't buy anything from t2 to t9 (= estimation sample): > remove <- c(82, 126, 136, 271, 338, 367, 461, 486, 488, 522, 565, 601, 629, 653, 676, 708, 722, 725, + 727, 746, 897, 920, 941, 986, 1005, 1009, 1018, 1025, 1324) > dat <- filter(dat, !cust %in% remove); dat$cust <- as.factor(dat$cust) > summary(dat) cust date freq gm pur rec

1 : 11 2006 Q2: 349 Min. : 0.000 Min. : 0.00 Min. :0.0000 Min. : 1.000

18 : 11 2006 Q3: 349 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.:0.0000 1st Qu.: 1.000

31 : 11 2006 Q4: 349 Median : 1.000 Median : 43.03 Median :1.0000 Median : 1.000

41 : 11 2007 Q1: 349 Mean : 1.657 Mean : 328.22 Mean :0.6205 Mean : 1.972

43 : 11 2007 Q2: 349 3rd Qu.: 2.000 3rd Qu.: 297.36 3rd Qu.:1.0000 3rd Qu.: 2.000

48 : 11 2007 Q3: 349 Max. :26.000 Max. :15487.74 Max. :1.0000 Max. :10.000

(Other):3773 (Other):1745

freq.lag freq.cumsum freq.cumavg gm.lag gm.cumsum gm.cumavg

Min. : 0.000 Min. : 1.00 Min. : 0.180 Min. : 0.00 Min. : 0.0 Min. : 0.00

1st Qu.: 0.000 1st Qu.: 4.00 1st Qu.: 1.000 1st Qu.: 0.00 1st Qu.: 401.7 1st Qu.: 81.69

Median : 1.000 Median : 8.00 Median : 1.670 Median : 65.04 Median : 936.4 Median : 199.01

Mean : 1.809 Mean : 12.67 Mean : 2.311 Mean : 362.05 Mean : 2558.7 Mean : 472.27

3rd Qu.: 2.000 3rd Qu.: 16.00 3rd Qu.: 3.000 3rd Qu.: 340.91 3rd Qu.: 2601.3 3rd Qu.: 471.30

Max. :26.000 Max. :133.00 Max. :24.000 Max. :17026.82 Max. :67772.4 Max. :17026.82

premium premium.lag premium.dum GDP.Cons represent_id klantgroep

Min. : 0.0000 Min. : 0.0000 Min. :0.0000 Min. :-1.000 6 :869 Overig :1265

1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.:-0.400 13 :693 Supermarkt D : 605

Median : 0.0000 Median : 0.0000 Median :0.0000 Median : 1.600 35 :517 Bouwmarkt/Tuincentrum: 517

Mean : 0.1151 Mean : 0.1058 Mean :0.3636 Mean : 1.355 16 :396 Tankstation : 429

3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.:1.0000 3rd Qu.: 2.400 27 :352 Supermarkt B : 231

Max. :19.0000 Max. :19.0000 Max. :1.0000 Max. : 4.500 2 :264 Supermarkt Overig : 231

(Other):748 (Other) : 561

pop_dens categories

Min. : 54 Min. : 2.000

1st Qu.: 149 1st Qu.: 3.000

Median : 318 Median : 6.000

Mean : 760 Mean : 6.564

3rd Qu.: 798 3rd Qu.: 9.000

Max. :5771 Max. :15.000

> > # Mean center variables that we'll include for quadratic effects > dat$freq.mc <- scale(dat$freq, center = TRUE, scale = FALSE)[1:3839,1] > dat$rec.mc <- scale(dat$rec, center = TRUE, scale = FALSE)[1:3839,1] > dat$freq.lag.mc <- scale(dat$freq.lag, center = TRUE, scale = FALSE)[1:3839,1] > dat$freq.cumsum.mc <- scale(dat$freq.cumsum, center = TRUE, scale = FALSE)[1:3839,1] > dat$freq.cumavg.mc <- scale(dat$freq.cumavg, center = TRUE, scale = FALSE)[1:3839,1] > dat$gm.lag.mc <- scale(dat$gm.lag, center = TRUE, scale = FALSE)[1:3839,1] > dat$gm.cumavg.mc <- scale(dat$gm.cumavg, center = TRUE, scale = FALSE)[1:3839,1] > dat$gm.cumsum.mc <- scale(dat$gm.cumsum, center = TRUE, scale = FALSE)[1:3839,1] > dat$premium.mc <- scale(dat$premium, center = TRUE, scale = FALSE)[1:3839,1] > dat$premium.lag.mc <- scale(dat$premium.lag, center = TRUE, scale = FALSE)[1:3839,1] > dat$categories.mc <- scale(dat$categories, center = TRUE, scale = FALSE)[1:3839,1] > > temp <- read.csv("Dataset.csv", header = TRUE, sep=",") > temp <- filter(temp, !cust %in% remove) > temp$cust <- as.factor(temp$cust) > dat <- left_join(dat, temp[,c(1:2,25)], by=c("cust", "date")) > #write.csv(dat, "DatasetPlusReturns.csv", col.names = TRUE, row.names = FALSE) > > # Divide in estimation and validation sample > dat$date <- as.numeric(dat$date) > dat.v <- filter(dat, date >= 10) > dat <- filter(dat, date < 10) > dat <- arrange(dat, date, cust) > > # Some functions that we use multiple times througout the code > clog <- function(x) log(x + 1) > vifs <- function(model) {print(summary(model)); print(sqrt(vif(model)) > 2); print(vif(model))} > assumptions <- function(model) {checkresiduals(model); print(gqtest(model, 0.5)); dwtest(model)} > performance <- function(obs, pred) {print(rae(obs, pred)); print(rmse(obs, pred)); print(mae(obs, pred))} > > # Inspect data > dat %>% dplyr::select(freq:GDP.Cons, pop_dens:categories, + gm.cumsum) %>% cor() %>% round(2)

Page 55: PREDICTING THE UNPREDICTABLE

54

freq gm pur rec freq.lag freq.cumsum freq.cumavg gm.lag gm.cumsum gm.cumavg premium

premium.lag

freq 1.00 0.79 0.52 -0.29 0.66 0.39 0.66 0.59 0.44 0.62 0.31

0.13

gm 0.79 1.00 0.28 -0.15 0.47 0.29 0.48 0.54 0.42 0.58 0.44

0.15

pur 0.52 0.28 1.00 -0.47 0.37 0.20 0.30 0.22 0.16 0.20 0.10

0.08

rec -0.29 -0.15 -0.47 1.00 -0.38 -0.17 -0.29 -0.20 -0.13 -0.16 -0.05 -

0.07

freq.lag 0.66 0.47 0.37 -0.38 1.00 0.51 0.83 0.78 0.53 0.72 0.13

0.26

freq.cumsum 0.39 0.29 0.20 -0.17 0.51 1.00 0.75 0.37 0.85 0.61 0.19

0.23

freq.cumavg 0.66 0.48 0.30 -0.29 0.83 0.75 1.00 0.63 0.70 0.82 0.09

0.11

gm.lag 0.59 0.54 0.22 -0.20 0.78 0.37 0.63 1.00 0.55 0.80 0.20

0.35

gm.cumsum 0.44 0.42 0.16 -0.13 0.53 0.85 0.70 0.55 1.00 0.80 0.20

0.26

gm.cumavg 0.62 0.58 0.20 -0.16 0.72 0.61 0.82 0.80 0.80 1.00 0.11

0.13

premium 0.31 0.44 0.10 -0.05 0.13 0.19 0.09 0.20 0.20 0.11 1.00

0.42

premium.lag 0.13 0.15 0.08 -0.07 0.26 0.23 0.11 0.35 0.26 0.13 0.42

1.00

premium.dum 0.15 0.10 0.17 -0.31 0.19 -0.32 0.12 0.12 -0.18 0.07 -0.15 -

0.14

GDP.Cons 0.12 0.07 0.07 -0.07 0.07 -0.12 0.03 0.04 -0.07 0.02 -0.03 -

0.02

pop_dens 0.06 0.03 -0.01 0.04 0.08 0.11 0.13 0.03 0.05 0.05 -0.03 -

0.04

categories 0.48 0.35 0.48 -0.41 0.45 0.44 0.46 0.34 0.41 0.40 0.14

0.13

premium.dum GDP.Cons pop_dens categories

freq 0.15 0.12 0.06 0.48

gm 0.10 0.07 0.03 0.35

pur 0.17 0.07 -0.01 0.48

rec -0.31 -0.07 0.04 -0.41

freq.lag 0.19 0.07 0.08 0.45

freq.cumsum -0.32 -0.12 0.11 0.44

freq.cumavg 0.12 0.03 0.13 0.46

gm.lag 0.12 0.04 0.03 0.34

gm.cumsum -0.18 -0.07 0.05 0.41

gm.cumavg 0.07 0.02 0.05 0.40

premium -0.15 -0.03 -0.03 0.14

premium.lag -0.14 -0.02 -0.04 0.13

premium.dum 1.00 0.15 0.00 0.00

GDP.Cons 0.15 1.00 0.00 0.00

pop_dens 0.00 0.00 1.00 -0.05

categories 0.00 0.00 -0.05 1.00

> dat %>% filter(pur == 1) %>% dplyr::select(freq:gm, rec:GDP.Cons, pop_dens:categories, + gm.cumsum) %>% cor() %>% round(2) freq gm rec freq.lag freq.cumsum freq.cumavg gm.lag gm.cumsum gm.cumavg premium premium.lag

freq 1.00 0.78 -0.12 0.61 0.37 0.64 0.58 0.44 0.63 0.31 0.10

gm 0.78 1.00 -0.04 0.43 0.27 0.45 0.52 0.41 0.57 0.43 0.14

rec -0.12 -0.04 1.00 -0.30 -0.16 -0.23 -0.15 -0.10 -0.13 0.00 -0.05

freq.lag 0.61 0.43 -0.30 1.00 0.51 0.85 0.78 0.53 0.73 0.11 0.25

freq.cumsum 0.37 0.27 -0.16 0.51 1.00 0.72 0.36 0.85 0.60 0.19 0.24

freq.cumavg 0.64 0.45 -0.23 0.85 0.72 1.00 0.63 0.69 0.82 0.07 0.09

gm.lag 0.58 0.52 -0.15 0.78 0.36 0.63 1.00 0.55 0.81 0.19 0.35

gm.cumsum 0.44 0.41 -0.10 0.53 0.85 0.69 0.55 1.00 0.80 0.20 0.26

gm.cumavg 0.63 0.57 -0.13 0.73 0.60 0.82 0.81 0.80 1.00 0.09 0.12

premium 0.31 0.43 0.00 0.11 0.19 0.07 0.19 0.20 0.09 1.00 0.42

premium.lag 0.10 0.14 -0.05 0.25 0.24 0.09 0.35 0.26 0.12 0.42 1.00

premium.dum 0.08 0.06 -0.13 0.12 -0.41 0.07 0.08 -0.23 0.04 -0.21 -0.18

GDP.Cons 0.11 0.06 -0.07 0.04 -0.16 0.01 0.03 -0.09 0.01 -0.04 -0.02

pop_dens 0.09 0.04 -0.02 0.12 0.13 0.16 0.04 0.05 0.05 -0.03 -0.04

categories 0.35 0.29 -0.20 0.35 0.43 0.40 0.30 0.41 0.38 0.12 0.11

premium.dum GDP.Cons pop_dens categories

freq 0.08 0.11 0.09 0.35

gm 0.06 0.06 0.04 0.29

rec -0.13 -0.07 -0.02 -0.20

freq.lag 0.12 0.04 0.12 0.35

freq.cumsum -0.41 -0.16 0.13 0.43

freq.cumavg 0.07 0.01 0.16 0.40

gm.lag 0.08 0.03 0.04 0.30

gm.cumsum -0.23 -0.09 0.05 0.41

gm.cumavg 0.04 0.01 0.05 0.38

premium -0.21 -0.04 -0.03 0.12

premium.lag -0.18 -0.02 -0.04 0.11

premium.dum 1.00 0.16 0.00 -0.14

GDP.Cons 0.16 1.00 0.00 -0.05

pop_dens 0.00 0.00 1.00 -0.04

categories -0.14 -0.05 -0.04 1.00

> ##### --- (1) --- VISITS --- (1) --- ##### > # Full model: > v1 <- glm(freq ~ rec + freq.lag + freq.cumavg + gm.lag + gm.cumavg + premium.lag + premium.dum + GDP.Cons + + pop_dens + categories, data = dat, family = poisson(link = "log")); vifs(v1)

Call:

glm(formula = freq ~ rec + freq.lag + freq.cumavg + gm.lag +

gm.cumavg + premium.lag + premium.dum + GDP.Cons + pop_dens +

Page 56: PREDICTING THE UNPREDICTABLE

55

categories, family = poisson(link = "log"), data = dat)

Deviance Residuals:

Min 1Q Median 3Q Max

-5.4614 -1.0043 -0.3282 0.3922 8.5629

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.743e-01 7.111e-02 -2.452 0.01421 *

rec -4.454e-01 3.341e-02 -13.331 < 2e-16 ***

freq.lag 1.993e-02 9.930e-03 2.007 0.04479 *

freq.cumavg 8.605e-02 1.082e-02 7.954 1.81e-15 ***

gm.lag 6.246e-05 2.051e-05 3.045 0.00233 **

gm.cumavg -4.561e-05 2.137e-05 -2.134 0.03285 *

premium.lag 2.908e-02 1.342e-02 2.166 0.03028 *

premium.dum 1.305e-01 3.174e-02 4.113 3.91e-05 ***

GDP.Cons 7.333e-02 9.988e-03 7.342 2.11e-13 ***

pop_dens 3.247e-05 1.252e-05 2.593 0.00950 **

categories 9.391e-02 4.667e-03 20.125 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 7057.6 on 2791 degrees of freedom

Residual deviance: 3668.7 on 2781 degrees of freedom

AIC: 8709.3

Number of Fisher Scoring iterations: 6

rec freq.lag freq.cumavg gm.lag gm.cumavg premium.lag premium.dum GDP.Cons pop_dens

FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE

categories

FALSE

rec freq.lag freq.cumavg gm.lag gm.cumavg premium.lag premium.dum GDP.Cons pop_dens

1.137372 11.723727 12.465840 10.899398 11.578315 1.557589 1.238092 1.045716 1.151738

categories

1.473479

> > # (1) Resolve multicollinearity: > v1a <- update(v1, . ~ . - freq.lag - freq.cumavg) > v1a <- update(v1, . ~ . - gm.lag - gm.cumavg) > v1b <- update(v1, . ~ . - freq.lag - gm.lag) > v1c <- update(v1, . ~ . - freq.lag - gm.cumavg) > v1d <- update(v1, . ~ . - freq.cumavg - gm.lag) > v1e <- update(v1, . ~ . - freq.cumavg - gm.cumavg) > AIC(v1b, v1c, v1d, v1e); BIC(v1b, v1c, v1d, v1e) df AIC

v1b 9 8741.470

v1c 9 8721.961

v1d 9 8770.447

v1e 9 8783.177

df BIC

v1b 9 8794.881

v1c 9 8775.371

v1d 9 8823.858

v1e 9 8836.588

> # v1c performs best + multicollinearity is resolved > v1 <- v1c; vifs(v1)

Call:

glm(formula = freq ~ rec + freq.cumavg + gm.lag + premium.lag +

premium.dum + GDP.Cons + pop_dens + categories, family = poisson(link = "log"),

data = dat)

Deviance Residuals:

Min 1Q Median 3Q Max

-4.8621 -1.0025 -0.3318 0.3996 8.4674

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.438e-01 7.108e-02 -2.024 0.04300 *

rec -4.630e-01 3.334e-02 -13.888 < 2e-16 ***

freq.cumavg 9.205e-02 4.796e-03 19.191 < 2e-16 ***

gm.lag 4.401e-05 9.158e-06 4.806 1.54e-06 ***

premium.lag 4.946e-02 1.221e-02 4.049 5.13e-05 ***

premium.dum 1.552e-01 3.112e-02 4.986 6.17e-07 ***

GDP.Cons 7.276e-02 9.980e-03 7.291 3.08e-13 ***

pop_dens 3.855e-05 1.218e-05 3.165 0.00155 **

categories 9.207e-02 4.647e-03 19.813 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 7057.6 on 2791 degrees of freedom

Residual deviance: 3685.4 on 2783 degrees of freedom

AIC: 8722

Number of Fisher Scoring iterations: 6

rec freq.cumavg gm.lag premium.lag premium.dum GDP.Cons pop_dens categories

FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Page 57: PREDICTING THE UNPREDICTABLE

56

rec freq.cumavg gm.lag premium.lag premium.dum GDP.Cons pop_dens categories

1.113060 2.406071 2.259919 1.259330 1.190579 1.045856 1.093382 1.476864

> dispersiontest(v1)

Overdispersion test

data: v1

z = 4.1876, p-value = 1.41e-05

alternative hypothesis: true dispersion is greater than 1

sample estimates:

dispersion

1.616171

> > v1a <- glm.nb(freq ~ rec + freq.cumavg + gm.lag + premium.lag + premium.dum + GDP.Cons + + pop_dens + categories, data = dat); AIC(v1, v1a); lrtest(v1, v1a) df AIC

v1 9 8721.961

v1a 10 8506.642

Likelihood ratio test

Model 1: freq ~ rec + freq.cumavg + gm.lag + premium.lag + premium.dum +

GDP.Cons + pop_dens + categories

Model 2: freq ~ rec + freq.cumavg + gm.lag + premium.lag + premium.dum +

GDP.Cons + pop_dens + categories

#Df LogLik Df Chisq Pr(>Chisq)

1 9 -4352.0

2 10 -4243.3 1 217.32 < 2.2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> > # try zero-inflated + hurdle: > v1b <- zeroinfl(freq ~ rec + freq.cumavg + gm.lag + premium.lag + premium.dum + GDP.Cons + pop_dens + categories | + rec + freq.lag + freq.cumsum + categories + I(scale(categories, center = TRUE)^2), + data = dat, dist = "negbin"); AIC(v1a, v1b) Warning message: In sqrt(diag(vc)[np]) : NaNs produced df AIC

v1a 10 8506.642

v1b 16 8310.695

> v1c <- hurdle(freq ~ rec + freq.cumavg + gm.lag + premium.lag + premium.dum + GDP.Cons + pop_dens + categories | + rec + freq.lag + freq.cumsum + categories + I(scale(categories, center = TRUE)^2), + data = dat, dist = "negbin"); AIC(v1a, v1b, v1c) Warning message: In sqrt(diag(vc_count)[kx + 1]) : NaNs produced df AIC

v1a 10 8506.642

v1b 16 8310.695

v1c 16 8306.612

> > # We choose the hurdle model. Fitting the model: > v1 <- hurdle(freq ~ freq.cumavg + premium.lag + premium.dum + GDP.Cons + I(freq.cumavg.mc^2) | + rec + freq.lag + freq.cumsum + categories + I(categories.mc^2), + data = dat, dist = "negbin") > v1a <- hurdle(freq ~ freq.cumavg + premium.lag + premium.dum + GDP.Cons | + rec + freq.lag + freq.cumsum + categories + I(categories.mc^2), + data = dat, dist = "negbin") > v1b <- hurdle(freq ~ freq.cumavg + premium.lag + GDP.Cons | + rec + freq.lag + freq.cumsum + categories + I(categories.mc^2), + data = dat, dist = "negbin") > v1c <- hurdle(freq ~ freq.cumavg + GDP.Cons | + rec + freq.lag + freq.cumsum + categories + I(scale(categories, center = TRUE)^2), + data = dat, dist = "negbin") > v1d <- hurdle(freq ~ freq.cumavg | + rec + freq.lag + freq.cumsum + categories + I(scale(categories, center = TRUE)^2), + data = dat, dist = "negbin") > AIC(v1, v1a, v1b, v1c, v1d) # v1b = best df AIC

v1 13 8281.024

v1a 12 8412.995

v1b 11 8411.500

v1c 10 8422.506

v1d 9 8454.126

> > # With or without quadratic term? > v1a <- hurdle(freq ~ freq.cumavg + premium.lag + GDP.Cons + I(freq.cumavg.mc^2) + + klantgroep + represent_id | rec + freq.lag + freq.cumsum + categories + I(categories.mc^2), + data = dat, dist = "negbin"); vifs(v1a); AIC(v1a)

Call:

hurdle(formula = freq ~ freq.cumavg + premium.lag + GDP.Cons + I(freq.cumavg.mc^2) + klantgroep + represent_id |

rec + freq.lag + freq.cumsum + categories + I(categories.mc^2), data = dat, dist = "negbin")

Pearson residuals:

Min 1Q Median 3Q Max

-1.7289 -0.5857 -0.2242 0.3776 8.2492

Count model coefficients (truncated negbin with log link):

Estimate Std. Error z value Pr(>|z|)

(Intercept) -0.349365 0.182784 -1.911 0.05596 .

freq.cumavg 0.226669 0.016075 14.101 < 2e-16 ***

premium.lag 0.035984 0.017480 2.059 0.03954 *

GDP.Cons 0.099612 0.014442 6.898 5.29e-12 ***

I(freq.cumavg.mc^2) -0.008199 0.001106 -7.415 1.22e-13 ***

klantgroepDrogist -0.160366 0.203000 -0.790 0.42954

klantgroepOverig -0.540194 0.092232 -5.857 4.72e-09 ***

Page 58: PREDICTING THE UNPREDICTABLE

57

klantgroepSupermarkt A 0.194173 0.109322 1.776 0.07571 .

klantgroepSupermarkt B 0.193783 0.093179 2.080 0.03755 *

klantgroepSupermarkt C 0.263956 0.143097 1.845 0.06510 .

klantgroepSupermarkt D 0.133835 0.079477 1.684 0.09219 .

klantgroepSupermarkt E 0.243169 0.129522 1.877 0.06046 .

klantgroepSupermarkt F 0.087361 0.127500 0.685 0.49323

klantgroepSupermarkt G -0.092162 0.152539 -0.604 0.54572

klantgroepSupermarkt H 0.234399 0.255640 0.917 0.35919

klantgroepSupermarkt Overig 0.271438 0.104106 2.607 0.00913 **

klantgroepTankstation -0.245506 0.104774 -2.343 0.01912 *

represent_id2 0.235488 0.205991 1.143 0.25296

represent_id6 0.200035 0.170934 1.170 0.24190

represent_id13 -0.020956 0.175829 -0.119 0.90513

represent_id14 0.316662 0.197220 1.606 0.10836

represent_id16 0.183518 0.174210 1.053 0.29214

represent_id22 -0.445171 0.247843 -1.796 0.07247 .

represent_id23 0.226813 0.195366 1.161 0.24566

represent_id27 0.258063 0.174276 1.481 0.13867

represent_id34 0.294302 0.210547 1.398 0.16217

represent_id35 0.205256 0.171956 1.194 0.23261

Log(theta) 1.692253 0.140752 12.023 < 2e-16 ***

Zero hurdle model coefficients (binomial with logit link):

Estimate Std. Error z value Pr(>|z|)

(Intercept) -0.428856 0.172448 -2.487 0.012887 *

rec -0.426756 0.063687 -6.701 2.07e-11 ***

freq.lag 0.475956 0.052169 9.123 < 2e-16 ***

freq.cumsum -0.024365 0.006353 -3.835 0.000125 ***

categories 0.281413 0.018888 14.899 < 2e-16 ***

I(categories.mc^2) -0.024606 0.005295 -4.647 3.37e-06 ***

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Theta: count = 5.4317

Number of iterations in BFGS optimization: 41

Log-likelihood: -4058 on 34 Df

[1] 8183.417

> v1b <- hurdle(freq ~ freq.cumavg + premium.lag + GDP.Cons + I(freq.cumavg.mc^2) + klantgroep | + rec + freq.lag + freq.cumsum + categories + I(categories.mc^2), + data = dat, dist = "negbin"); vifs(v1b); AIC(v1b)

Call:

hurdle(formula = freq ~ freq.cumavg + premium.lag + GDP.Cons + I(freq.cumavg.mc^2) + klantgroep | rec +

freq.lag + freq.cumsum + categories + I(categories.mc^2), data = dat, dist = "negbin")

Pearson residuals:

Min 1Q Median 3Q Max

-1.6835 -0.5892 -0.2232 0.3806 8.2858

Count model coefficients (truncated negbin with log link):

Estimate Std. Error z value Pr(>|z|)

(Intercept) -0.233767 0.087743 -2.664 0.00772 **

freq.cumavg 0.238178 0.015664 15.206 < 2e-16 ***

premium.lag 0.037368 0.017560 2.128 0.03334 *

GDP.Cons 0.097508 0.014522 6.715 1.89e-11 ***

I(freq.cumavg.mc^2) -0.008701 0.001065 -8.168 3.13e-16 ***

klantgroepDrogist -0.288771 0.199590 -1.447 0.14795

klantgroepOverig -0.533142 0.087100 -6.121 9.30e-10 ***

klantgroepSupermarkt A 0.252604 0.107073 2.359 0.01832 *

klantgroepSupermarkt B 0.230905 0.090814 2.543 0.01100 *

klantgroepSupermarkt C 0.227836 0.136079 1.674 0.09407 .

klantgroepSupermarkt D 0.156906 0.074444 2.108 0.03506 *

klantgroepSupermarkt E 0.254081 0.122365 2.076 0.03786 *

klantgroepSupermarkt F 0.139156 0.119182 1.168 0.24297

klantgroepSupermarkt G -0.081099 0.150990 -0.537 0.59119

klantgroepSupermarkt H 0.191586 0.255244 0.751 0.45289

klantgroepSupermarkt Overig 0.286725 0.097178 2.951 0.00317 **

klantgroepTankstation -0.224644 0.095998 -2.340 0.01928 *

Log(theta) 1.640681 0.138881 11.814 < 2e-16 ***

Zero hurdle model coefficients (binomial with logit link):

Estimate Std. Error z value Pr(>|z|)

(Intercept) -0.428856 0.172448 -2.487 0.012887 *

rec -0.426756 0.063687 -6.701 2.07e-11 ***

freq.lag 0.475956 0.052169 9.123 < 2e-16 ***

freq.cumsum -0.024365 0.006353 -3.835 0.000125 ***

categories 0.281413 0.018888 14.899 < 2e-16 ***

I(categories.mc^2) -0.024606 0.005295 -4.647 3.37e-06 ***

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Theta: count = 5.1587

Number of iterations in BFGS optimization: 29

Log-likelihood: -4072 on 24 Df

[1] 8191.474

> v1c <- hurdle(freq ~ freq.cumavg + premium.lag + GDP.Cons + I(freq.cumavg.mc^2) + represent_id | + rec + freq.lag + freq.cumsum + categories + I(categories.mc^2), + data = dat, dist = "negbin"); vifs(v1c); AIC(v1c)

Call:

hurdle(formula = freq ~ freq.cumavg + premium.lag + GDP.Cons + I(freq.cumavg.mc^2) + represent_id |

rec + freq.lag + freq.cumsum + categories + I(categories.mc^2), data = dat, dist = "negbin")

Pearson residuals:

Min 1Q Median 3Q Max

Page 59: PREDICTING THE UNPREDICTABLE

58

-1.6154 -0.5809 -0.2315 0.3769 9.8160

Count model coefficients (truncated negbin with log link):

Estimate Std. Error z value Pr(>|z|)

(Intercept) -0.621028 0.178797 -3.473 0.000514 ***

freq.cumavg 0.297628 0.014618 20.360 < 2e-16 ***

premium.lag 0.047889 0.018515 2.586 0.009697 **

GDP.Cons 0.100025 0.014958 6.687 2.28e-11 ***

I(freq.cumavg.mc^2) -0.011792 0.001087 -10.850 < 2e-16 ***

represent_id2 0.127078 0.207672 0.612 0.540591

represent_id6 0.190873 0.173805 1.098 0.272116

represent_id13 -0.073649 0.178633 -0.412 0.680123

represent_id14 0.384512 0.199690 1.926 0.054162 .

represent_id16 0.270425 0.175664 1.539 0.123696

represent_id22 -0.412954 0.251403 -1.643 0.100466

represent_id23 0.120824 0.192609 0.627 0.530460

represent_id27 0.261183 0.178527 1.463 0.143471

represent_id34 0.307211 0.213905 1.436 0.150944

represent_id35 0.218052 0.175808 1.240 0.214872

Log(theta) 1.477740 0.132398 11.161 < 2e-16 ***

Zero hurdle model coefficients (binomial with logit link):

Estimate Std. Error z value Pr(>|z|)

(Intercept) -0.428856 0.172448 -2.487 0.012887 *

rec -0.426756 0.063687 -6.701 2.07e-11 ***

freq.lag 0.475956 0.052169 9.123 < 2e-16 ***

freq.cumsum -0.024365 0.006353 -3.835 0.000125 ***

categories 0.281413 0.018888 14.899 < 2e-16 ***

I(categories.mc^2) -0.024606 0.005295 -4.647 3.37e-06 ***

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Theta: count = 4.383

Number of iterations in BFGS optimization: 27

Log-likelihood: -4110 on 22 Df

[1] 8263.174

> v1d <- hurdle(freq ~ freq.cumavg + premium.lag + GDP.Cons + I(freq.cumavg.mc^2) | + rec + freq.lag + freq.cumsum + categories + I(categories.mc^2), + data = dat, dist = "negbin"); vifs(v1d); AIC(v1d)

Call:

hurdle(formula = freq ~ freq.cumavg + premium.lag + GDP.Cons + I(freq.cumavg.mc^2) | rec + freq.lag +

freq.cumsum + categories + I(categories.mc^2), data = dat, dist = "negbin")

Pearson residuals:

Min 1Q Median 3Q Max

-1.6337 -0.5786 -0.2336 0.3557 10.9936

Count model coefficients (truncated negbin with log link):

Estimate Std. Error z value Pr(>|z|)

(Intercept) -0.524283 0.066630 -7.869 3.59e-15 ***

freq.cumavg 0.320577 0.013788 23.251 < 2e-16 ***

premium.lag 0.050382 0.018790 2.681 0.00733 **

GDP.Cons 0.096470 0.015047 6.411 1.44e-10 ***

I(freq.cumavg.mc^2) -0.012687 0.001043 -12.169 < 2e-16 ***

Log(theta) 1.422972 0.130889 10.872 < 2e-16 ***

Zero hurdle model coefficients (binomial with logit link):

Estimate Std. Error z value Pr(>|z|)

(Intercept) -0.428856 0.172448 -2.487 0.012887 *

rec -0.426756 0.063687 -6.701 2.07e-11 ***

freq.lag 0.475956 0.052169 9.123 < 2e-16 ***

freq.cumsum -0.024365 0.006353 -3.835 0.000125 ***

categories 0.281413 0.018888 14.899 < 2e-16 ***

I(categories.mc^2) -0.024606 0.005295 -4.647 3.37e-06 ***

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Theta: count = 4.1494

Number of iterations in BFGS optimization: 22

Log-likelihood: -4128 on 12 Df

freq.cumavg premium.lag GDP.Cons I(freq.cumavg.mc^2)

TRUE FALSE FALSE TRUE

freq.cumavg premium.lag GDP.Cons I(freq.cumavg.mc^2)

11.574419 1.046349 3.097466 5.128628

[1] 8279.72

> AIC(v1a,v1b,v1c,v1d) df AIC

v1a 34 8183.417

v1b 24 8191.474

v1c 22 8263.174

v1d 12 8279.720

> > # Final model: > v1 <- hurdle(freq ~ freq.cumavg + premium.lag + GDP.Cons + I(freq.cumavg.mc^2) + klantgroep + represent_id | + rec + freq.lag + freq.cumsum + categories + I(categories.mc^2), + data = dat, dist = "negbin"); vifs(v1)

Call:

hurdle(formula = freq ~ freq.cumavg + premium.lag + GDP.Cons + I(freq.cumavg.mc^2) + klantgroep + represent_id |

rec + freq.lag + freq.cumsum + categories + I(categories.mc^2), data = dat, dist = "negbin")

Pearson residuals:

Min 1Q Median 3Q Max

-1.7289 -0.5857 -0.2242 0.3776 8.2492

Page 60: PREDICTING THE UNPREDICTABLE

59

Count model coefficients (truncated negbin with log link):

Estimate Std. Error z value Pr(>|z|)

(Intercept) -0.349365 0.182784 -1.911 0.05596 .

freq.cumavg 0.226669 0.016075 14.101 < 2e-16 ***

premium.lag 0.035984 0.017480 2.059 0.03954 *

GDP.Cons 0.099612 0.014442 6.898 5.29e-12 ***

I(freq.cumavg.mc^2) -0.008199 0.001106 -7.415 1.22e-13 ***

klantgroepDrogist -0.160366 0.203000 -0.790 0.42954

klantgroepOverig -0.540194 0.092232 -5.857 4.72e-09 ***

klantgroepSupermarkt A 0.194173 0.109322 1.776 0.07571 .

klantgroepSupermarkt B 0.193783 0.093179 2.080 0.03755 *

klantgroepSupermarkt C 0.263956 0.143097 1.845 0.06510 .

klantgroepSupermarkt D 0.133835 0.079477 1.684 0.09219 .

klantgroepSupermarkt E 0.243169 0.129522 1.877 0.06046 .

klantgroepSupermarkt F 0.087361 0.127500 0.685 0.49323

klantgroepSupermarkt G -0.092162 0.152539 -0.604 0.54572

klantgroepSupermarkt H 0.234399 0.255640 0.917 0.35919

klantgroepSupermarkt Overig 0.271438 0.104106 2.607 0.00913 **

klantgroepTankstation -0.245506 0.104774 -2.343 0.01912 *

represent_id2 0.235488 0.205991 1.143 0.25296

represent_id6 0.200035 0.170934 1.170 0.24190

represent_id13 -0.020956 0.175829 -0.119 0.90513

represent_id14 0.316662 0.197220 1.606 0.10836

represent_id16 0.183518 0.174210 1.053 0.29214

represent_id22 -0.445171 0.247843 -1.796 0.07247 .

represent_id23 0.226813 0.195366 1.161 0.24566

represent_id27 0.258063 0.174276 1.481 0.13867

represent_id34 0.294302 0.210547 1.398 0.16217

represent_id35 0.205256 0.171956 1.194 0.23261

Log(theta) 1.692253 0.140752 12.023 < 2e-16 ***

Zero hurdle model coefficients (binomial with logit link):

Estimate Std. Error z value Pr(>|z|)

(Intercept) -0.428856 0.172448 -2.487 0.012887 *

rec -0.426756 0.063687 -6.701 2.07e-11 ***

freq.lag 0.475956 0.052169 9.123 < 2e-16 ***

freq.cumsum -0.024365 0.006353 -3.835 0.000125 ***

categories 0.281413 0.018888 14.899 < 2e-16 ***

I(categories.mc^2) -0.024606 0.005295 -4.647 3.37e-06 ***

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Theta: count = 5.4317

Number of iterations in BFGS optimization: 41

Log-likelihood: -4058 on 34 Df

GVIF Df GVIF^(1/(2*Df))

freq.cumavg TRUE FALSE TRUE

premium.lag FALSE FALSE FALSE

GDP.Cons FALSE FALSE FALSE

I(freq.cumavg.mc^2) TRUE FALSE FALSE

klantgroep TRUE TRUE FALSE

represent_id TRUE TRUE FALSE

GVIF Df GVIF^(1/(2*Df))

freq.cumavg 18.644456 1 4.317923

premium.lag 1.095011 1 1.046428

GDP.Cons 3.264560 1 1.806809

I(freq.cumavg.mc^2) 6.743572 1 2.596839

klantgroep 38.898415 12 1.164789

represent_id 218.292832 10 1.309037

> rm(v1a, v1b, v1c, v1d, v1e) > > # --- 4. Obtain final estimates > v1.est.final <- data.frame(round(v1$coefficients$count, 3)) > v1.est.final <- round(summary(v1)$coef, 3) > v1.confint <- round(confint(v1), 3) > v1.est.final; v1.confint round.v1.coefficients.count..3.

(Intercept) -0.349

freq.cumavg 0.227

premium.lag 0.036

GDP.Cons 0.100

I(freq.cumavg.mc^2) -0.008

klantgroepDrogist -0.160

klantgroepOverig -0.540

klantgroepSupermarkt A 0.194

klantgroepSupermarkt B 0.194

klantgroepSupermarkt C 0.264

klantgroepSupermarkt D 0.134

klantgroepSupermarkt E 0.243

klantgroepSupermarkt F 0.087

klantgroepSupermarkt G -0.092

klantgroepSupermarkt H 0.234

klantgroepSupermarkt Overig 0.271

klantgroepTankstation -0.246

represent_id2 0.235

represent_id6 0.200

represent_id13 -0.021

represent_id14 0.317

represent_id16 0.184

represent_id22 -0.445

represent_id23 0.227

represent_id27 0.258

represent_id34 0.294

Page 61: PREDICTING THE UNPREDICTABLE

60

represent_id35 0.205

2.5 % 97.5 %

count_(Intercept) -0.708 0.009

count_freq.cumavg 0.195 0.258

count_premium.lag 0.002 0.070

count_GDP.Cons 0.071 0.128

count_I(freq.cumavg.mc^2) -0.010 -0.006

count_klantgroepDrogist -0.558 0.238

count_klantgroepOverig -0.721 -0.359

count_klantgroepSupermarkt A -0.020 0.408

count_klantgroepSupermarkt B 0.011 0.376

count_klantgroepSupermarkt C -0.017 0.544

count_klantgroepSupermarkt D -0.022 0.290

count_klantgroepSupermarkt E -0.011 0.497

count_klantgroepSupermarkt F -0.163 0.337

count_klantgroepSupermarkt G -0.391 0.207

count_klantgroepSupermarkt H -0.267 0.735

count_klantgroepSupermarkt Overig 0.067 0.475

count_klantgroepTankstation -0.451 -0.040

count_represent_id2 -0.168 0.639

count_represent_id6 -0.135 0.535

count_represent_id13 -0.366 0.324

count_represent_id14 -0.070 0.703

count_represent_id16 -0.158 0.525

count_represent_id22 -0.931 0.041

count_represent_id23 -0.156 0.610

count_represent_id27 -0.084 0.600

count_represent_id34 -0.118 0.707

count_represent_id35 -0.132 0.542

zero_(Intercept) -0.767 -0.091

zero_rec -0.552 -0.302

zero_freq.lag 0.374 0.578

zero_freq.cumsum -0.037 -0.012

zero_categories 0.244 0.318

zero_I(categories.mc^2) -0.035 -0.014

> exp(coef(v1)) count_(Intercept) count_freq.cumavg count_premium.lag

0.7051355 1.2544152 1.0366390

count_GDP.Cons count_I(freq.cumavg.mc^2) count_klantgroepDrogist

1.1047424 0.9918340 0.8518321

count_klantgroepOverig count_klantgroepSupermarkt A count_klantgroepSupermarkt B

0.5826352 1.2143067 1.2138325

count_klantgroepSupermarkt C count_klantgroepSupermarkt D count_klantgroepSupermarkt E

1.3020706 1.1432046 1.2752843

count_klantgroepSupermarkt F count_klantgroepSupermarkt G count_klantgroepSupermarkt H

1.0912907 0.9119570 1.2641491

count_klantgroepSupermarkt Overig count_klantgroepTankstation count_represent_id2

1.3118497 0.7823085 1.2655263

count_represent_id6 count_represent_id13 count_represent_id14

1.2214454 0.9792624 1.3725380

count_represent_id16 count_represent_id22 count_represent_id23

1.2014371 0.6407147 1.2545950

count_represent_id27 count_represent_id34 count_represent_id35

1.2944203 1.3421886 1.2278393

zero_(Intercept) zero_rec zero_freq.lag

0.6512537 0.6526225 1.6095523

zero_freq.cumsum zero_categories zero_I(categories.mc^2)

0.9759296 1.3250002 0.9756941

> > # --- 5. Model performance > dat$v1 <- fitted(v1) > dat.v$v1 <- predict(v1, dat.v) > > performance(dat$freq, dat$v1) [1] 0.6380279

[1] 1.691994

[1] 1.027299

> performance(dat.v$freq, dat.v$v1) [1] 0.5804381

[1] 1.305343

[1] 0.7716979

> ##### --- (3) --- GROSS MARGINS --- (3) --- ##### > # Zero part > p1 <- glm(pur ~ rec + freq.lag + freq.cumsum + categories + I(categories.mc^2), + family = binomial(link = "logit"), data = dat); vifs(p1)

Call:

glm(formula = pur ~ rec + freq.lag + freq.cumsum + categories +

I(categories.mc^2), family = binomial(link = "logit"), data = dat)

Deviance Residuals:

Min 1Q Median 3Q Max

-3.0050 -0.5581 0.3732 0.6523 2.3057

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -0.428856 0.172449 -2.487 0.012887 *

rec -0.426756 0.063687 -6.701 2.07e-11 ***

freq.lag 0.475956 0.052168 9.124 < 2e-16 ***

freq.cumsum -0.024365 0.006353 -3.835 0.000125 ***

categories 0.281413 0.018888 14.899 < 2e-16 ***

I(categories.mc^2) -0.024606 0.005295 -4.647 3.37e-06 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Page 62: PREDICTING THE UNPREDICTABLE

61

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 3514.7 on 2791 degrees of freedom

Residual deviance: 2371.3 on 2786 degrees of freedom

AIC: 2383.3

Number of Fisher Scoring iterations: 6

rec freq.lag freq.cumsum categories I(categories.mc^2)

FALSE FALSE FALSE FALSE FALSE

rec freq.lag freq.cumsum categories I(categories.mc^2)

1.446646 1.603007 1.331112 1.222432 1.046219

> dat$p1 <- predict(p1, dat, type = "response"); dat.v$p1 <- predict(p1, dat.v, type = "response") > > v1 <- hurdle(freq ~ freq.cumavg + premium.lag + premium.dum + GDP.Cons + categories + + I(freq.cumavg.mc**2) + klantgroep + represent_id | + rec + freq.lag + freq.cumsum + categories + I(categories.mc^2), + data = dat, dist = "negbin") > dat$v1 <- predict(v1, dat, type = "response"); dat.v$v1 <- predict(v1, dat.v, type = "response") > > # Full model > gm1 <- lm(log(gm+1) ~ log(rec) + log(freq+1) + log(freq.lag+1) + log(gm.lag+1) + log(gm.cumavg+1) + + log(premium+1) + premium.dum + log(GDP.Cons+2) + log(pop_dens) + log(categories), + data = subset(dat, pur == 1)); vifs(gm1) # no multicollinearity issues

Call:

lm(formula = log(gm + 1) ~ log(rec) + log(freq + 1) + log(freq.lag +

1) + log(gm.lag + 1) + log(gm.cumavg + 1) + log(premium +

1) + premium.dum + log(GDP.Cons + 2) + log(pop_dens) + log(categories),

data = subset(dat, pur == 1))

Residuals:

Min 1Q Median 3Q Max

-7.4544 -0.3708 0.2998 0.9402 4.3774

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.31171 0.33436 0.932 0.35133

log(rec) 0.57145 0.18763 3.046 0.00235 **

log(freq + 1) 2.23260 0.10420 21.426 < 2e-16 ***

log(freq.lag + 1) -0.76847 0.11160 -6.886 7.81e-12 ***

log(gm.lag + 1) 0.17233 0.02869 6.007 2.27e-09 ***

log(gm.cumavg + 1) 0.02032 0.04407 0.461 0.64478

log(premium + 1) 0.25475 0.13203 1.930 0.05382 .

premium.dum 0.48776 0.08361 5.833 6.38e-09 ***

log(GDP.Cons + 2) 0.10793 0.10675 1.011 0.31210

log(pop_dens) 0.02571 0.03426 0.750 0.45309

log(categories) 0.69955 0.08446 8.283 2.26e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.628 on 1878 degrees of freedom

Multiple R-squared: 0.3785, Adjusted R-squared: 0.3752

F-statistic: 114.4 on 10 and 1878 DF, p-value: < 2.2e-16

log(rec) log(freq + 1) log(freq.lag + 1) log(gm.lag + 1) log(gm.cumavg + 1) log(premium + 1)

FALSE FALSE FALSE FALSE FALSE FALSE

premium.dum log(GDP.Cons + 2) log(pop_dens) log(categories)

FALSE FALSE FALSE FALSE

log(rec) log(freq + 1) log(freq.lag + 1) log(gm.lag + 1) log(gm.cumavg + 1) log(premium + 1)

2.094514 1.876297 3.504090 3.522942 2.536442 1.199199

premium.dum log(GDP.Cons + 2) log(pop_dens) log(categories)

1.227559 1.060033 1.038463 1.368221

> > # Now try with customer intercept: > gm2 <- lm(log(gm+1) ~ log(rec) + log(freq+1) + log(freq.lag+1) + log(gm.lag+1) + log(gm.cumavg+1) + + log(premium+1) + premium.dum + log(GDP.Cons+2) + cust, + data = subset(dat, pur == 1)) # Delete categories & pop_dens = collinear with cust > > # For testing the assumptions we use the plm package (same estimates, but convenient for testing assumptions) > gm.pooled <- plm(formula(gm1), data = subset(dat, pur == 1), model = "pooling") > gm.fixed <- plm(log(gm+1) ~ log(rec) + log(freq+1) + log(freq.lag+1) + log(gm.lag+1) + log(gm.cumavg+1) + + log(premium+1) + premium.dum + log(GDP.Cons+2), data = subset(dat, pur == 1), model = "within") > gm.random <- plm(formula(gm1), data = subset(dat, pur == 1), random.method="swar", model="random") > > pFtest(gm.fixed, gm.pooled) # Significant: choose FE model over pooled model

F test for individual effects

data: log(gm + 1) ~ log(rec) + log(freq + 1) + log(freq.lag + 1) + ...

F = 2.0819, df1 = 346, df2 = 1532, p-value < 2.2e-16

alternative hypothesis: significant effects

> plmtest(gm.pooled, effect="individual") # Significant: choose RE model over pooled model

Lagrange Multiplier Test - (Honda) for unbalanced panels

data: formula(gm1)

normal = 3.9349, p-value = 4.162e-05

alternative hypothesis: significant effects

> phtest(gm.fixed, gm.random) # Significant: differences are endogenous to our predictors = use FE model

Page 63: PREDICTING THE UNPREDICTABLE

62

Hausman Test

data: log(gm + 1) ~ log(rec) + log(freq + 1) + log(freq.lag + 1) + ...

chisq = 631.59, df = 8, p-value < 2.2e-16

alternative hypothesis: one model is inconsistent

> > # Further fitting our model: > gm2a <- update(gm2, .~. - log(rec)) > gm2b <- update(gm2a, .~. - log(GDP.Cons+2)) > gm2c <- update(gm2b, .~. - log(gm.lag+1)) > gm2d <- update(gm2c, .~. - log(premium+1)) > AIC(gm2, gm2a, gm2b, gm2c, gm2d) # Best fit: gm2c df AIC

gm2 358 7179.175

gm2a 357 7177.317

gm2b 356 7176.115

gm2c 355 7175.576

gm2d 354 7187.305

> # Multiple R-squared: 0.5768, Adjusted R-squared: 0.4794 > gm2e <- update(gm2c, .~. + I(freq.lag.mc^2)); AIC(gm2c, gm2e) # 0.4809 df AIC

gm2c 355 7175.576

gm2e 356 7170.861

> > # FINAL MODEL: > gm2 <- lm(log(gm + 1) ~ log(freq + 1) + log(freq.lag + 1) + log(gm.cumavg + 1) + log(premium + 1) + + premium.dum + cust, data = subset(dat, pur == 1)) > > # Omited variable bias due to exclusion of product returns? > gm2b <- lm(log(gm + 1) ~ log(freq + 1) + log(freq.lag + 1) + log(gm.cumavg + 1) + log(premium + 1) + + premium.dum + log(returns + 1) + cust, data = subset(dat, pur == 1)); vifs(gm2b)

Call:

lm(formula = log(gm + 1) ~ log(freq + 1) + log(freq.lag + 1) +

log(gm.cumavg + 1) + log(premium + 1) + premium.dum + log(returns +

1) + cust, data = subset(dat, pur == 1))

Residuals:

Min 1Q Median 3Q Max

-7.0244 -0.3889 0.0724 0.6516 4.5003

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 4.961926 1.439745 3.446 0.000583 ***

log(freq + 1) 1.823209 0.107100 17.023 < 2e-16 ***

log(freq.lag + 1) -0.249912 0.086190 -2.900 0.003790 **

log(gm.cumavg + 1) -0.243653 0.060333 -4.038 5.65e-05 ***

log(premium + 1) 0.209589 0.134478 1.559 0.119313

premium.dum 0.029012 0.085791 0.338 0.735279

log(returns + 1) -0.330481 0.022297 -14.822 < 2e-16 ***

[ reached getOption("max.print") -- omitted 155 rows ]

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.39 on 1534 degrees of freedom

Multiple R-squared: 0.6298, Adjusted R-squared: 0.5443

F-statistic: 7.371 on 354 and 1534 DF, p-value: < 2.2e-16

GVIF Df GVIF^(1/(2*Df))

log(freq + 1) FALSE FALSE FALSE

log(freq.lag + 1) FALSE FALSE FALSE

log(gm.cumavg + 1) TRUE FALSE FALSE

log(premium + 1) FALSE FALSE FALSE

premium.dum FALSE FALSE FALSE

log(returns + 1) FALSE FALSE FALSE

cust TRUE TRUE FALSE

GVIF Df GVIF^(1/(2*Df))

log(freq + 1) 2.717965 1 1.648625

log(freq.lag + 1) 2.865585 1 1.692804

log(gm.cumavg + 1) 6.517957 1 2.553029

log(premium + 1) 1.705863 1 1.306087

premium.dum 1.771953 1 1.331147

log(returns + 1) 1.703809 1 1.305300

cust 26.665094 348 1.004729

> gm2c <- update(gm2b, .~. - premium.dum); AIC(gm2b, gm2c) df AIC

gm2b 356 6924.741

gm2c 355 6922.881

> gm2d <- update(gm2c, .~. - log(premium + 1)); AIC(gm2b, gm2c, gm2d) df AIC

gm2b 356 6924.741

gm2c 355 6922.881

gm2d 354 6923.796

> > # Testing for sample selection bias by estimating a heckit model to obtain inverse mills ratio: > gm.heckit <- heckit(formula(p1), formula(gm2c), data = dat, method= "2step"); summary(gm.heckit) --------------------------------------------

Tobit 2 model (sample selection model)

2-step Heckman / heckit estimation

2792 observations (903 censored and 1889 observed)

363 free parameters (df = 2430)

Probit selection equation:

Estimate Std. Error t value Pr(>|t|)

Page 64: PREDICTING THE UNPREDICTABLE

63

(Intercept) -0.190195 0.097764 -1.945 0.051837 .

rec -0.271961 0.035344 -7.695 2.05e-14 ***

freq.lag 0.251860 0.026636 9.456 < 2e-16 ***

freq.cumsum -0.012305 0.003593 -3.425 0.000626 ***

categories 0.163162 0.010455 15.606 < 2e-16 ***

I(categories.mc^2) -0.015068 0.002941 -5.124 3.22e-07 ***

Outcome equation:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 4.9399186 1.3651958 3.618 0.000302 ***

log(freq + 1) 1.8269359 0.0962099 18.989 < 2e-16 ***

log(freq.lag + 1) -0.2268806 0.1488771 -1.524 0.127652

log(gm.cumavg + 1) -0.2459647 0.0543327 -4.527 6.27e-06 ***

log(premium + 1) 0.1921051 0.1136542 1.690 0.091106 .

log(returns + 1) -0.3343286 0.0176285 -18.965 < 2e-16 ***

[ reached getOption("max.print") -- omitted 154 rows ]

Multiple R-Squared:0.6298, Adjusted R-Squared:0.5443

Error terms:

Estimate Std. Error t value Pr(>|t|)

invMillsRatio 0.04960 0.39377 0.126 0.9

sigma 1.25336 NA NA NA

rho 0.03957 NA NA NA

--------------------------------------------

> # IMR: not significant: continue with original model > > gm.fixed <- plm(log(gm + 1) ~ log(freq + 1) + log(freq.lag + 1) + log(gm.cumavg + 1) + log(premium + 1) + + log(returns + 1), data = subset(dat, pur == 1), model = "within") > > pdwtest(gm.fixed) # Durbin-Watson: insignificant: no autocorrelation

Durbin-Watson test for serial correlation in panel models

data: log(gm + 1) ~ log(freq + 1) + log(freq.lag + 1) + log(gm.cumavg + 1) + log(premium + 1) + log(returns +

1)

DW = 2.1932, p-value = 1

alternative hypothesis: serial correlation in idiosyncratic errors

> bptest(gm.fixed) # Breusch-Pagan: significant heteroskedasticity

studentized Breusch-Pagan test

data: gm.fixed

BP = 36.594, df = 5, p-value = 7.222e-07

> shapiro.test(gm.fixed$residuals) # Not normally distributed

Shapiro-Wilk normality test

data: gm.fixed$residuals

W = 0.89444, p-value < 2.2e-16

> lillie.test(gm.fixed$residuals)

Lilliefors (Kolmogorov-Smirnov) normality test

data: gm.fixed$residuals

D = 0.13141, p-value < 2.2e-16

> > # Obtain final estimates with robust standard errors > gm.est <- data.frame(coeftest(gm.fixed, vcov. = vcovHC, method = "arellano")[1:5,1:4]) > temp <- data.frame(exp(gm.est[,1])) > alpha_hat_star <- gm.est[,1] > sd_alpha_hat_star <- gm.est[,2] > alpha_hat <- (exp(alpha_hat_star)-1) * exp(-0.5*(sd_alpha_hat_star^2)) > temp <- round(data.frame(alpha_hat), 3) > gm.est; temp Estimate Std..Error t.value Pr...t..

log(freq + 1) 1.8264566 0.14161467 12.897369 3.286461e-36

log(freq.lag + 1) -0.2430427 0.08929350 -2.721841 6.565345e-03

log(gm.cumavg + 1) -0.2454251 0.06954109 -3.529211 4.290972e-04

log(premium + 1) 0.1933578 0.14396427 1.343096 1.794395e-01

log(returns + 1) -0.3341422 0.02134234 -15.656307 2.269688e-51

alpha_hat

1 5.160

2 -0.215

3 -0.217

4 0.211

5 -0.284

> > # Predictive performance > gm2 <- lm(log(gm + 1) ~ log(freq + 1) + log(freq.lag + 1) + log(gm.cumavg + 1) + log(premium + 1) + + log(returns+1) + cust, data = subset(dat, pur == 1)); vifs(gm2)

Call:

lm(formula = log(gm + 1) ~ log(freq + 1) + log(freq.lag + 1) +

log(gm.cumavg + 1) + log(premium + 1) + log(returns + 1) +

cust, data = subset(dat, pur == 1))

Residuals:

Min 1Q Median 3Q Max

-7.0267 -0.3883 0.0764 0.6503 4.4820

Coefficients:

Estimate Std. Error t value Pr(>|t|)

Page 65: PREDICTING THE UNPREDICTABLE

64

(Intercept) 4.994637 1.436078 3.478 0.000519 ***

log(freq + 1) 1.826457 0.106638 17.128 < 2e-16 ***

log(freq.lag + 1) -0.243043 0.083738 -2.902 0.003756 **

log(gm.cumavg + 1) -0.245425 0.060087 -4.084 4.65e-05 ***

log(premium + 1) 0.193358 0.125585 1.540 0.123850

log(returns + 1) -0.334142 0.019487 -17.147 < 2e-16 ***

[ reached getOption("max.print") -- omitted 154 rows ]

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.39 on 1535 degrees of freedom

Multiple R-squared: 0.6297, Adjusted R-squared: 0.5446

F-statistic: 7.396 on 353 and 1535 DF, p-value: < 2.2e-16

GVIF Df GVIF^(1/(2*Df))

log(freq + 1) FALSE FALSE FALSE

log(freq.lag + 1) FALSE FALSE FALSE

log(gm.cumavg + 1) TRUE FALSE FALSE

log(premium + 1) FALSE FALSE FALSE

log(returns + 1) FALSE FALSE FALSE

cust TRUE TRUE FALSE

GVIF Df GVIF^(1/(2*Df))

log(freq + 1) 2.696111 1 1.641984

log(freq.lag + 1) 2.706437 1 1.645125

log(gm.cumavg + 1) 6.468767 1 2.543377

log(premium + 1) 1.488558 1 1.220065

log(returns + 1) 1.302189 1 1.141135

cust 22.371575 348 1.004475

> dat$gm2 <- (exp(predict(gm2, dat)) - 1) * dat$p1; dat.v$gm2 <- (exp(predict(gm2, dat.v))-1) * dat.v$p1 > > mean(dat$gm); sd(dat$gm); mean(dat$gm2); sd(dat$gm2) [1] 359.5807

[1] 878.883

[1] 279.6916

[1] 731.5249

> mean(dat.v$gm); sd(dat.v$gm); mean(dat.v$gm2); sd(dat.v$gm2) [1] 244.5904

[1] 648.022

[1] 193.8051

[1] 575.2432

> performance(dat$gm, dat$gm2) [1] 0.4223258

[1] 521.0698

[1] 184.9897

> performance(dat.v$gm, dat.v$gm2) [1] 0.4977186

[1] 506.8427

[1] 175.2225

> > > # Inspect fixed effects > gm.fe <- round(data.frame(summary(fixef(gm.fixed))),3) > fx_level_robust1 <- fixef(gm.fixed, vcov = vcovHC(gm.fixed)) > gm.fe.sum <- round(data.frame(summary(fx_level_robust1)),3) > gm.fe.sum$cust <- row.names(gm.fe.sum) > dat$cust <- as.character(dat$cust); dat.v$cust <- as.character(dat.v$cust) > dat <- left_join(dat, gm.fe.sum[,c("Estimate", "cust")], by=c("cust")) > within_intercept(gm.fixed, vcov = vcovHC) (overall_intercept)

4.767096

attr(,"se")

[1] 0.4183012

> within_intercept(gm.fixed, vcov = function(x) vcovHC(x, method="arellano", type="HC0")) (overall_intercept)

4.767096

attr(,"se")

[1] 0.4183012

Page 66: PREDICTING THE UNPREDICTABLE

65

APPENDIX C: R-CODE CUSTOMER PROFITABILITY > rm(list = ls()) > setwd(" ") > library(dplyr) > library(DescTools) > library(Metrics) > library(pscl) > library(ggplot2) > dat <- read.csv("DatasetPlusReturns.csv", header = TRUE, sep=",") > dat$cust <- as.factor(dat$cust) > dat$represent_id <- as.factor(dat$represent_id) > dat$date <- as.numeric(dat$date) > dat <- arrange(dat, date, cust) > dat.v <- filter(dat, date >= 9) > dat <- filter(dat, date < 9) > p1 <- glm(pur ~ rec + freq.lag + freq.cumsum + categories + I(scale(categories, center = TRUE)^2), + family = binomial(link = "logit"), data = dat) > dat$p1 <- predict(p1, dat, type = "response"); dat.v$p1 <- predict(p1, dat.v, type = "response") > v1 <- hurdle(freq ~ freq.cumavg + premium.lag + premium.dum + GDP.Cons + categories + + I(scale(freq.cumavg, center = TRUE)**2) + klantgroep + represent_id | + rec + freq.lag + freq.cumsum + categories + I(scale(categories, center = TRUE)^2), + data = dat, dist = "negbin") > dat$v1 <- fitted(v1); dat.v$v1 <- predict(v1, dat.v) > gm1 <- lm(log(gm + 1) ~ log(freq + 1) + log(freq.lag + 1) + log(gm.cumavg + 1) + log(premium + 1) + + log(returns+1) + cust, data = subset(dat, pur == 1)) > dat$gm1 <- (exp(predict(gm1, dat)) - 1) * dat$p1; dat.v$gm1 <- (exp(predict(gm1, dat.v))-1) * dat.v$p1 > rm(p1, v1, gm1) > dat <- rbind(dat, dat.v); rm(dat.v) > cp <- dplyr::select(dat, cust, date, freq, gm, v1, gm1, p1) > cp[is.na(cp)] <- 0 > cp$holdout <- ifelse(cp$date >= 9, 1, 0) > costs <- 70.39 # Give here the value for costs per visit > cp <- mutate(cp, cp = gm - (freq * costs), pred = gm1 - (v1 * costs)) > acc <- function(t) { + temp <- filter(cp, date %in% t) + print("MAE"); print(MAE(temp$pred, temp$cp)) + print("RAE"); print(rae(temp$cp, temp$pred)) + print("RMSE"); print(RMSE(temp$pred, temp$cp)) + } > print("Estimation: "); acc(1:8); print("Validation: "); acc(9:11) [1] "Estimation: "

[1] "MAE"

[1] 204.0338

[1] "RAE"

[1] 0.5790955

[1] "RMSE"

[1] 511.1327

[1] "Validation: "

[1] "MAE"

[1] 187.1082

[1] "RAE"

[1] 0.6459613

[1] "RMSE"

[1] 521.924

> ##### --- CP SEGMENTS > # Compare t6-9 observed with t10-12 observed and predicted > cp$error <- abs(cp$pred - cp$cp) > cp.t6_9 <- cp %>% filter(date %in% 5:8) %>% group_by(cust) %>% summarise(v.avg.1= mean(freq), + gm.avg.1 = mean(gm), cp.avg.1 = mean(cp), cp.sd.1 = sd(cp)) %>% dplyr::select(cust, v.avg.1:cp.sd.1) > cp.t10_12 <- cp %>% filter(date %in% 9:11) %>% group_by(cust) %>% summarise(v.avg.v = mean(freq), + gm.avg.v = mean(gm), cp.avg.v = mean(cp), cp.sd.v = sd(cp), pred.avg.v = mean(pred), + error.avg.v = mean(error), mad.sum = sum(error)) %>% dplyr::select(cust, v.avg.v:mad.sum) > cp.tot <- left_join(cp.t6_9, cp.t10_12, by = "cust"); rm(cp.t6_9, cp.t10_12) > > # Make profitability segments > add.segment <- function(var) { + new.var <- ifelse(var > quantile(var, probs = 0.75), 3, + ifelse(var < quantile(var, probs = 0.25), 1, 2)) + return(new.var) + } > > # Since the zero observations disturb our analysis, we delete the customers that did not make any > # purchase in the year prior to our validation period (70 observations of which only 3 customers eventually > # did make a purchase in the validation period) > cp.tot <- filter(cp.tot, cp.avg.1 != 0) > cp.tot$segment.cp.1 <- add.segment(cp.tot$cp.avg.1); table(cp.tot$segment.cp.1)

1 2 3

69 138 69

> cp.tot$segment.cp.v <- add.segment(cp.tot$cp.avg.v); table(cp.tot$segment.cp.v)

1 2 3

69 138 69

> cp.tot$segment.pred.v <- add.segment(cp.tot$pred.avg.v); table(cp.tot$segment.pred.v)

1 2 3

69 138 69

> > # We now make confusion matrices > confusion.matrix <- function(temp) { + temp2 <- data.frame(temp[1:3], temp[4:6], temp[7:9]); temp2 <- round(temp2/sum(temp2),3) + return(temp2) + } > shifts.observed <- confusion.matrix(table(cp.tot$segment.cp.1, cp.tot$segment.cp.v)) > shifts.predicted <- confusion.matrix(table(cp.tot$segment.cp.1, cp.tot$segment.pred.v)) > observed.predicted <- confusion.matrix(table(cp.tot$segment.cp.v, cp.tot$segment.pred.v)) > summary(cp.tot$cp.avg.1); summary(cp.tot$cp.avg.v); summary(cp.tot$pred.avg.v) Min. 1st Qu. Median Mean 3rd Qu. Max.

Page 67: PREDICTING THE UNPREDICTABLE

66

-139.650 -7.434 45.457 216.582 211.411 5659.632

Min. 1st Qu. Median Mean 3rd Qu. Max.

-164.243 -0.082 0.000 199.905 183.306 4190.093

Min. 1st Qu. Median Mean 3rd Qu. Max.

-164.92 -34.62 -15.74 156.22 116.62 5191.38

> > # 3 segments: positive, negative, zero (allow +/- 10 for predicted zero) > cp.tot$seg.cp.pos <- ifelse(cp.tot$cp.avg.v == 0, 0, ifelse(cp.tot$cp.avg.v > 0, 1, -1)) > cp.tot$seg.pred.pos <- ifelse(cp.tot$pred.avg.v < -10, -1, ifelse(cp.tot$pred.avg.v > 10, 1, 0)) > observed.predicted.positive <- confusion.matrix(table(cp.tot$seg.cp.pos, cp.tot$seg.pred.pos)) > > > ##### --- VOLATILITY > # % change in CP from Q6-Q9 to Q10-Q12 > cp.tot <- mutate(cp.tot, cp.change = cp.avg.v - cp.avg.1, cp.change.ratio = cp.change / cp.avg.1, + pred.change = pred.avg.v - cp.avg.1, pred.change.ratio = pred.change / cp.avg.1) > # Relative change is not reliable! > > # CP.NoVol = gm.avg.1 * p1 - v1 * costs > cp.novol <- cp %>% filter(date >= 9) > cp.novol <- left_join(cp.novol, cp.tot[c("cust", "gm.avg.1", "v.avg.1", "cp.avg.1")], by=c("cust")) > cp.novol$gm.avg.1[is.na(cp.novol$gm.avg.1)] <- 0 > cp.novol$v.avg.1[is.na(cp.novol$v.avg.1)] <- 0 > cp.novol$cp.avg.1[is.na(cp.novol$cp.avg.1)] <- 0 > cp.novol$pred.novol <- cp.novol$gm.avg.1 * cp.novol$p1 - cp.novol$freq * costs > cp.novol$pred.novol <- cp.novol$gm.avg.1 * cp.novol$p1 - cp.novol$v1 * costs > cp.novol$pred.novol <- cp.novol$gm.avg.1 - cp.novol$v1 * costs > cp.novol$pred.novol <- cp.novol$gm.avg.1 * cp.novol$p1 > cp.novol$pred.novol <- cp.novol$cp.avg.1 * cp.novol$p1 > cp.novol$pred.novol <- cp.novol$cp.avg.1 > cp.novol$pred.novol <- cp.novol$pred > # MAE and t-test > MAE(cp.novol$pred.novol, cp.novol$cp) [1] 187.1082

> cp.novol$MAE.main <- abs(cp.novol$cp - cp.novol$pred) > cp.novol$MAE.simple <- abs(cp.novol$cp - cp.novol$pred.novol) > var.test(cp.novol$MAE.main, cp.novol$MAE.simple)

F test to compare two variances

data: cp.novol$MAE.main and cp.novol$MAE.simple

F = 1, num df = 1046, denom df = 1046, p-value = 1

alternative hypothesis: true ratio of variances is not equal to 1

95 percent confidence interval:

0.8857959 1.1289283

sample estimates:

ratio of variances

1

> t.test(cp.novol$MAE.main, cp.novol$MAE.simple, var.equal = TRUE)

Two Sample t-test

data: cp.novol$MAE.main and cp.novol$MAE.simple

t = 0, df = 2092, p-value = 1

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-41.78155 41.78155

sample estimates:

mean of x mean of y

187.1082 187.1082

> > # What if we take our the customers that made no purchase in Q5-Q8? > cp.novol <- filter(cp.novol, cp.avg.1 != 0) > cp.novol$pred.simple <- cp.novol$cp.avg.1 * cp.novol$p1 > > cp.novol$MAE.main <- abs(cp.novol$cp - cp.novol$pred) > cp.novol$MAE.simple <- abs(cp.novol$cp - cp.novol$pred.simple) > > var.test(cp.novol$MAE.main, cp.novol$MAE.simple)

F test to compare two variances

data: cp.novol$MAE.main and cp.novol$MAE.simple

F = 1.1481, num df = 827, denom df = 827, p-value = 0.04714

alternative hypothesis: true ratio of variances is not equal to 1

95 percent confidence interval:

1.001748 1.315943

sample estimates:

ratio of variances

1.148148

> t.test(cp.novol$MAE.main, cp.novol$MAE.simple, var.equal = FALSE)

Welch Two Sample t-test

data: cp.novol$MAE.main and cp.novol$MAE.simple

t = -0.21362, df = 1646.2, p-value = 0.8309

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-55.01782 44.21045

sample estimates:

mean of x mean of y

232.1723 237.5760

> ##### --- DIFFERENCES BETWEEN CUSTOMERS > customers <- read.csv("CP_Customers.csv", header = TRUE, sep=",")

Page 68: PREDICTING THE UNPREDICTABLE

67

> cp.tot$cust <- as.numeric(as.character(cp.tot$cust)) > customers <- left_join(cp.tot, customers, by=c("cust")) > cust.by.group <- customers %>% group_by(klantgroep) %>% summarise(n(), sum(cp.avg.1), sum(v.avg.1), + sum(gm.avg.1), mean(cp.avg.v), sd(cp.avg.v), + mean(pred.avg.v), sd(pred.avg.v), mean(segment.cp.1), mean(segment.cp.v), + mean(segment.pred.v), mean(cp.change), mean(v.avg.1), sd(v.avg.1), + mean(gm.avg.1), sd(gm.avg.1), mean(cp.avg.1), sd(cp.avg.1), mean(v.avg.v), + sd(v.avg.v), mean(gm.avg.v), sd(gm.avg.v), mean(premium.tot), max(represent_id), + mean(categories)) > cust.by.group[,3:26] <- round(cust.by.group[,3:26],2) > # Boxplots for observed vs predicted cp, gm, and v per group/rep > > library(RColorBrewer) > display.brewer.all() > library(extrafont) > loadfonts(device = "win") > par(family = "Dubai Light") > > customers$klantgroep <- factor(customers$klantgroep,levels(customers$klantgroep)[c(4:12,3,1:2,13)]) > levels(customers$klantgroep) <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "O", "X", "Y", "Z") > customers$represent_id <- as.factor(customers$represent_id) > levels(customers$represent_id) <- c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10") > > ggplot(data=customers, aes(x=klantgroep, y=cp.avg.1, color=klantgroep)) + geom_boxplot() + + scale_y_continuous(limits = c(-150, 6000), "Observed CP in Q5-Q8") + theme_bw() > # Drop 3 outliers: > ggplot(data=customers, aes(x=klantgroep, y=cp.avg.1, color=klantgroep)) + geom_boxplot() + + scale_y_continuous(limits = c(-150, 1500), "Observed CP (Q5-Q8)") + + scale_x_discrete("Customer Group") + theme_bw() + + theme(text = element_text(size=14, family="Dubai Light")) + + theme(axis.title.y = element_text(margin = margin(t = 0, r = 15, b = 0, l = 0))) + + theme(axis.title.x = element_text(margin = margin(t = 15, r = 0, b = 0, l = 0))) + + theme(legend.position="none") > > # Boxplot MAE per klantgroep > ggplot(data=customers, aes(x=klantgroep, y=mad.sum, color=klantgroep)) + geom_boxplot() + + scale_y_continuous(limits = c(0, 7500), "Absolute Error") + theme_bw() + + theme(text = element_text(size=14, family="Dubai Light", color = "black")) + + theme(axis.title.y = element_text(margin = margin(t = 0, r = 15, b = 0, l = 0)), + axis.text = element_text(size=10), + axis.title.x = element_blank()) + + theme(legend.position="none") > > # All-in-one plot > temp <- select(customers, cust, klantgroep, cp.avg.1, represent_id); temp$period <- "Q5-Q8 Observed" > temp2 <- select(customers, cust, klantgroep, cp.avg.v, represent_id); temp2$period <- "Q9-Q11 Observed" > temp3 <- select(customers, cust, klantgroep, pred.avg.v, represent_id); temp3$period <- "Q9-Q11 Predicted" > colnames(temp)[3] <- "cp"; colnames(temp2)[3] <- "cp"; colnames(temp3)[3] <- "cp" > temp <- rbind(temp, temp2, temp3) > colnames(temp)[5] <- "Period" > # Compare customer groups > ggplot(data=temp, aes(x=klantgroep, y=cp, color=Period)) + geom_boxplot() + + scale_y_continuous(limits = c(-175, 2000), "Customer Profitability") + + scale_x_discrete("Customer Group") + theme_bw() + + theme(text = element_text(size=14, family="Dubai Light", color = "black")) + + theme(axis.title.y = element_text(margin = margin(t = 0, r = 15, b = 0, l = 0)), + axis.text = element_text(size=10), axis.title.x = element_blank()) + + theme(legend.position="bottom") + scale_color_manual(values=c("#95dc83", "#eaaf50", "#88d8fe")) > #+ scale_fill_grey(start = 0.25, end = 0.75, na.value = "red") > > # Compare represent_id > ggplot(data=temp, aes(x=represent_id, y=cp, color=Period)) + geom_boxplot() + + scale_y_continuous(limits = c(-175, 2000), "Customer Profitability") + + scale_x_discrete("Sales Representative") + theme_bw() + + theme(text = element_text(size=14, family="Dubai Light", color = "black")) + + theme(axis.title.y = element_text(margin = margin(t = 0, r = 15, b = 0, l = 0)), + axis.text = element_text(size=10), axis.title.x = element_blank()) + + theme(legend.position="bottom") + scale_color_manual(values=c("#95dc83", "#eaaf50", "#88d8fe")) > customers %>% group_by(represent_id) %>% summarise(n()) # A tibble: 11 x 2

represent_id `n()`

<fct> <int>

1 0 7

2 1 10

3 2 66

4 3 53

5 4 6

6 5 30

7 6 12

8 7 9

9 8 32

10 9 7

11 10 44

> > # CP Segments > ggplot(data=temp, aes(x=Period, y=cp, color=Period)) + geom_boxplot() + + scale_y_continuous(limits = c(-175, 2000), "Customer Profitability") + + theme_bw() + theme(text = element_text(size=14, family="Dubai Light", color = "black")) + + theme(axis.title.y = element_text(margin = margin(t = 0, r = 15, b = 0, l = 0)), + axis.text = element_text(size=10), + axis.title.x = element_blank()) + + theme(legend.position="blank") + scale_color_manual(values=c("#95dc83", "#eaaf50", "#88d8fe")) > > > > # Plot observed and predicted CP for some customers > cp <- left_join(cp, cp.novol[,c("cust", "date", "pred.simple")], by=c("cust", "date")) > temp <- cp; colnames(temp)[c(9,10,12)] <- c("Observed", "Predicted", "Simple") > make.graphs <- function(var) { + for (c in var) { + df <- temp %>% filter(cust == c) %>% + dplyr::select(date, Observed, Predicted, Simple) %>%

Page 69: PREDICTING THE UNPREDICTABLE

68

+ tidyr::gather(key = "variable", value = "value", -date) + plot <- ggplot(df, aes(x = date, y = value)) + geom_line(aes(color = variable), size = 1) + + geom_line(aes(color = variable), size = 1) + geom_line(aes(color = variable), size = 1) + + scale_y_continuous(limits = c(min(df$value) - 100, max(df$value) + 100)) + + scale_x_discrete(limits = c(1:11)) + theme_bw() + + theme(text = element_text(size=10, family="Dubai Light", color = "black"), + axis.title.y=element_blank(), axis.title.x=element_blank(), legend.position="blank") + + scale_color_manual(values=c("#95dc83", "#eaaf50", "#88d8fe")) + print(plot) + } + } > > too_low <- c(417, 572, 1307, 163, 121) > too_high <- c(644, 740, 1298, 160, 1330) > good <- c(117, 551, 249, 893, 641) #117, 893 > make.graphs(c(121,417)); make.graphs(c(1298,740)); make.graphs(c(117,893)) # 325x210 > > temp <- select(cp, holdout, freq, gm, cp); temp$what <- "Observed" > temp2 <- select(cp, holdout, v1, gm1, pred); temp2$what <- "Predicted" > colnames(temp)[2:4] <- c("Visits2", "GrossMargins", "CustomerProfitability") > colnames(temp2)[2:4] <- c("Visits2", "GrossMargins", "CustomerProfitability") > temp <- rbind(temp, temp2) > temp$holdout <- as.factor(temp$holdout) > levels(temp$holdout) <- c("Q1-Q8", "Q9-Q11") > temp$holdout <- factor(temp$holdout,levels(temp$holdout)[c(2,1)]) > colnames(temp)[5] <- "Visits" > ggplot(data=temp, aes(x=holdout, y=Visits2, color=Visits)) + geom_boxplot() + + scale_x_discrete() + + theme_bw() + theme(text = element_text(size=12, family="Dubai Light", color = "black")) + + theme(axis.title.x = element_blank(), + axis.text = element_text(size=10), + axis.title.y = element_blank()) + coord_flip() + + theme(legend.position="bottom") + scale_color_manual(values=c("#95dc83", "#eaaf50", "#88d8fe"))