a hierarchical rule-based inferential modelling and

A HIERARCHICAL RULE-BASED INFERENTIAL

MODELLING AND PREDICTION WITH APPLICATION IN

STRATEGIC PURCHASING BEHAVIOUR

A Thesis Submitted to The University of Manchester for the degree of

Doctor of Philosophy

in the Faculty of Humanities

2020

YUN PRIHANTINA MULYANI

ALLIANCE MANCHESTER BUSINESS SCHOOL

2

List of Contents

List of Contents ................................................................................................. 2

List of Tables ..................................................................................................... 4

List of Figures .................................................................................................... 7

Abbreviation ....................................................................................................... 9

Abstract ............................................................................................................ 10

Declaration ....................................................................................................... 11

Copyright Statement ....................................................................................... 12

Acknowledgements ......................................................................................... 13

Chapter 1 Introduction ................................................................................ 14

1.1. Background ......................................................................................... 14

1.2. Research Questions ............................................................................ 20

1.3. Research Objectives............................................................................ 21

1.4. Research Contributions ....................................................................... 21

1.5. Research Significance ......................................................................... 24

1.6. Thesis Structure .................................................................................. 26

Chapter 2 Research Background ................................................................ 30

2.1. Introduction .......................................................................................... 30

2.2. Introduction to Revenue Management Theories .................................. 30

2.3. Advanced Booking Decision-Making .................................................... 34

2.4. Introduction to Machine Learning ......................................................... 37

2.5. Classification Models: Advantages and Disadvantages ....................... 39

Chapter 3 Research Methodologies ........................................................... 44

3.1. Introduction .......................................................................................... 44

3.2. Research Approach ............................................................................. 44

3.3. Data Collection .................................................................................... 45

3.4. Evidential Reasoning ........................................................................... 50

3.5. Maximum Likelihood Evidential Reasoning (MAKER) Framework ....... 58

3.6. Machine Learning Methods.................................................................. 63

3.7. Sequential Least Squares Programming (SLSQP) .............................. 74

3.8. Evaluation Metrics ............................................................................... 77

3

3.9. Summary ............................................................................................. 83

Chapter 4 A Hierarchical Rule-based Inferential Modelling and Prediction

85

4.1. Introduction .......................................................................................... 85

4.2. Introduction to MAKER Framework ...................................................... 85

4.3. MAKER Algorithm with Referential Values ........................................... 89

4.4. Belief Rule Base .................................................................................. 98

4.5. The Decomposition of Input Variables ................................................. 99

4.6. Parameter Learning ........................................................................... 105

4.7. A Comparative Analysis ..................................................................... 108

4.8. Summary ........................................................................................... 136

Chapter 5 Application to Customer Classification .................................. 137

5.1. Introduction ........................................................................................ 137

5.2. Theoretical Foundations: Customer Types and Behaviours ............... 138

5.3. Conceptual Framework ...................................................................... 145

5.4. Data Preparation ............................................................................... 153

5.5. Hierarchical Rule-based Models for Customer Classification ............. 154

5.6. Model Comparisons ........................................................................... 183

5.7. Summary ........................................................................................... 198

Chapter 6 Application to Customer Decision Model ............................... 201

6.1. Introduction ........................................................................................ 201

6.2. Conceptual Framework: Input Variables and Decisions ..................... 201

6.3. Data preparation ................................................................................ 218

6.4. Hierarchical Rule-based Models to Predicting Customer Decisions ... 220

6.5. Model comparisons ........................................................................... 253

6.6. Summary ........................................................................................... 269

Chapter 7 Conclusions and Recommendations for Future Research .... 271

7.1. Conclusions ....................................................................................... 271

7.2. Limitations and Recommendations for Future Research .................... 274

References ..................................................................................................... 277

Appendices .................................................................................................... 286

4

List of Tables

Table 1.1. Thesis structure ................................................................................ 27

Table 2.1. Advantages and disadvantages of classification methods ................. 39

Table 3.1. Data characteristics ........................................................................... 48

Table 3.2. Threshold metrics ............................................................................. 79

Table 3.3. Rules of thumb for AUC .................................................................... 82

Table 4.1. An example of data transformation .................................................... 95

Table 4.2. Generated datasets with four input variables ................................... 124

Table 4.3. Performance measures for the dataset 1 ........................................ 129





Table 4.8. Grand averages of performance measures of the five generated

datasets ........................................................................................................... 134

Table 5.1. Definitions of strategic customers .................................................... 141

Table 5.2. Input variables ................................................................................. 147

Table 5.3. Descriptive statistics and spearman correlation matrix .................... 156

Table 5.4. Percentiles of the dataset ................................................................ 159

Table 5.5. The optimised referential values obtained from MAKER-ER- based

models of the first round .................................................................................. 163

Table 5.6. The frequencies of the referential values of the input variable of TS 164

Table 5.7. The likelihoods of the referential values of the input variable of TS . 164

Table 5.8. The probabilities of referential values of the input variable of TS ..... 165

Table 5.9. The probabilities of referential values of the input variable of HP .... 165

Table 5.10. The joint probabilities of different combinations of the referential values

from input variables HP and TS ....................................................................... 167

Table 5.11. The interdependence indices between the referential values from the

input variables HP and TS ............................................................................... 168

Table 5.12. Interdependence indices between referential values from the input

variables FB and ICR ....................................................................................... 169

5

Table 5.13. Two adjacent referential values of each input variable of an observation

from the customer – type dataset: {.2105, .3955, 4, .1415} .............................. 171

Table 5.14. The belief rule base of the first group of evidence and the activated

belief rules by an observation of the input variables of group 1 from the customer-

type dataset: {.2105, .3955} ............................................................................. 172

Table 5.15. The belief rule base of the second group of evidence with activated

belief rule base by an observation of the input variables of group 2 from the

customer-type dataset: {4, .1415} .................................................................... 173

Table 5.16. The belief rule base of the top hierarchy of inference with the initial

belief degrees for the customer-type dataset ................................................... 177

Table 5.17. The belief rule base of the top hierarchy of inference with the optimised

belief degrees of the training set of the first fold for the customer-type dataset 178

Table 5.18. The joint similarity degree of the outputs generated by group 1: {.1371,

.8629} and group 2: {.2537, .7463} from the customer – type dataset .............. 179

Table 5.19. Selected hyperparameters of SVM, ANN, CT, and Weighted KNN for

customer type models ...................................................................................... 185

Table 5.20. F-beta scores for customer behaviour classifiers .......................... 186

Table 5.21. Accuracies for customer behaviour classifiers ............................... 187

Table 5.22. Precisions of the test sets for customer behaviour classifiers ........ 188

Table 5.23. Recalls of the test sets for customer behaviour classifiers ............. 189

Table 5.24. The MSEs and AUCs of the prediction models (training set) for

customer type classifiers .................................................................................. 196

Table 5.25. The MSEs and AUCs of the prediction models (test set) for customer

type classifiers ................................................................................................. 197

Table 6.1. Descriptive Statistics and Correlation Matrix ................................... 221

Table 6.2. Percentiles of the dataset ................................................................ 224

Table 6.3. Optimised referential values obtained from MAKER-ER-based models

of the first round ............................................................................................... 230

Table 6.4. The frequencies of the referential values of the input variable of WPT

........................................................................................................................ 231

Table 6.5. The likelihoods of the referential values of the input variable of WPT

........................................................................................................................ 231

Table 6.6. The probabilities of referential values of the input variable of WPT . 233

Table 6.7. The probabilities of referential values of the input variable of APT .. 233

6

Table 6.8. Joint probabilities for different combinations of referential values from

input variables: WPT and APT ......................................................................... 235

Table 6.9. Interdependence indices for referential values of the input variables:

WPT and APT .................................................................................................. 235


HP and DD ...................................................................................................... 235


NF and C ......................................................................................................... 236

Table 6.12. The belief rule base of the first group of evidence and the activated

belief rules by an observation from the customer – decision dataset: {.2946, .1193}

........................................................................................................................ 239

Table 6.13. The belief rule base of the second group of evidence with activated

belief rule base by an observation from the customer – decision dataset: {.3955,

1.9816} ............................................................................................................ 239

Table 6.14. The belief rule base of the third group of evidence with activated belief

rule base by an observation from the customer – decision dataset: {62, 1} ...... 240

Table 6.15. Two adjacent referential values of each input variable of an observation

from the customer decision dataset: {.2946, .1193, .3954, 1.9816, 62, 1} ........ 240

Table 6.16. Initial belief rule base of the top hierarchy for the customer-decision

dataset ............................................................................................................. 248

Table 6.17. Optimised belief rule base of the top hierarchy the activated belief rules

by the three MAKER-generated outputs: {(1, .6007), (2, .3993)}; {(1, .8468), (2,

.1532)}; and {(1, .2387), (2, .7613)} .................................................................. 249

Table 6.18. The selected hyperparameters of CT, SVM, KNN, Weighted KNN, and

NN for customer decision models .................................................................... 254

Table 6.19. F-beta scores for customer behaviour classifiers .......................... 256

Table 6.20. Accuracies for customer decision models ..................................... 257

Table 6.21. Precisions of the test sets for customer decision models ............... 258

Table 6.22. Recalls of the test sets for customer decision models ................... 259

Table 6.23. MSEs and AUCs of classifiers for customer decision models ........ 267

7

List of Figures

Figure 3.1 A single-hidden-layer neural network (Bishop, 2006) ........................ 69

Figure 3.2. The procedure of sequential (least squares) quadratic programming

method .............................................................................................................. 78

Figure 3.3. Confusion matrix of binary problem .................................................. 79

Figure 3.4. ROC curve ....................................................................................... 81

Figure 4.1. Hierarchical MAKER-based training process .................................. 104

Figure 4.2. A hierarchical rule-based inferential modelling and prediction based on

MAKER framework for n groups of evidence ................................................... 107

Figure 4.3. Referential Value-based Discretization Technique: an input variable

(upper), and two input variables (bottom) ......................................................... 115

Figure 4.4. Scatter plot from the datasets ........................................................ 125

Figure 4.5. Plot of the grand average scores of performance measures of the five

generated datasets for each model .................................................................. 135

Figure 5.1. Illustration 1 (several weeks before departure date) ....................... 146

Figure 5.2. Illustration 2 (some days before departure date) ............................ 147

Figure 5.3. Data linkage ................................................................................... 149

Figure 5.4. Hierarchical MAKER frameworks for customer classification ......... 157

Figure 5.5. Scatter plot of the observed data of the training set of the first fold with

plotted optimised referential values in each of the input variables from the customer

– type dataset from the optimisation of the MAKER-ER-based model .............. 160

Figure 5.6. Scatter plot of the observed data of the training set of the first fold with

plotted optimised referential values in each of the input variables from the customer

– type dataset from the optimisation of the MAKER-BRB-based model ........... 161

Figure 5.7. Individual support of the referential values of each input variable ... 165

Figure 5.8. The ROC curve of the MAKER-ER-based classifier, MAKER-BRB-

based classifier, and all the alternative machine learning methods of the test sets

of the customer-type dataset ........................................................................... 192

Figure 5.9. The PR curve of the MAKER-ER-based classifier, MAKER-BRB-based

classifier, and all the alternative machine learning methods of the test sets of the

customer-type dataset ..................................................................................... 194

8

Figure 6.1. Conceptual framework for decisions by advanced booking customers

under dynamic Pricing ..................................................................................... 209

Figure 6.2. Conceptual framework for decision by advanced booking customers

under dynamic pricing after refinement ............................................................ 209

Figure 6.3. Data linkage for customer decision model ...................................... 215

Figure 6.4. Example of a booking journey ........................................................ 216

Figure 6.5. Hierarchical MAKER framework for customer decision prediction .. 222

Figure 6.6. Scatter plot for observed data, with plotted optimised referential values

for each input variable in the optimisation of MAKER-ER-based model from the

customer – decision dataset. ........................................................................... 226

Figure 6.7. Scatter plot for observed data, with plotted optimised referential values

for each input variable in the optimisation of MAKER-BRB-based model from the

customer – decision dataset. ........................................................................... 227

Figure 6.8. Individual support of referential values of each input variable of the

training set of the first fold of the customer decision dataset ............................ 232

Figure 6.9. The ROC curve of MAKER-ER-based classifier, MAKER-BRB-based

classifier, and all the alternative machine learning methods for the test sets of the

customer-decision dataset ............................................................................... 262

Figure 6.10. The PR curve of MAKER-ER-based classifier, MAKER-BRB-based

classifier, and all the alternative machine learning methods for the test sets of the

customer-decision dataset ............................................................................... 264

9

Abbreviation

APT Average price trend

AUC Area under the curve

AUCPR Area under the precision-recall curve

AUCROC Area under the receiver operating curve

BRB Belief rule base

C Customer type

CT Classification tree

DD Days before departure date

D-S Dempster-Shafer

ER Evidential reasoning

FB Frequency of bookings

HP The length of holding period

ICR The interval between cancelling and book again

LR Linear regression

KNN K-nearest neighbours

MAKER Maximum likelihood evidential reasoning

MSE Mean of squared error

MLP Multilayer perceptron

NB Naïve Bayes

NF Number of flights offered in a day

PNR Passenger name record

PR Precision-recall

SLSQP Sequential (least squares) quadratic programming

TS Time spent for confirming booking

RIMER Rule-based inference methodology using evidential reasoning

ROC Receiver operating characteristics

WPT Waiting patience time

10

Abstract

Strategic purchasing behaviour has received growing attention in the field of revenue management. Its occurrence potentially hurts providers’ revenue with substantial losses. The need to detect strategic customers and predict their decisions has been highlighted by researchers. Theoretical models rely on assumptions about how customers make decisions and what factors influence them. Conditioned experiments are relatively expensive and not representative of the actual system. By comparison, approaches based on statistics and machine learning from historical data can be relatively cheap, representative of actual conditions, and data-based. However, the widely used approaches pose certain challenges, such as interpretability, overfitting, and stability. These may influence their ability to classify – that is, predict customer types and decisions.

We propose a conceptual framework and data linkage for detecting strategic customers and predicting customer decisions. The proposed framework and data linkage were developed based on cancel-rebook behaviour by two customer types, namely strategic and myopic. They were also based on two customer decisions: buy or wait. The evidence showed that the input variables in the framework were good predictors of customer types and decisions. Ultimately, we propose a new approach, namely a hierarchical rule-based inferential modelling and prediction, to integrate statistical analysis, rule-based inference, maximum likelihood prediction, and machine learning in a hierarchical structure. The referential value-based discretisation technique used in this approach can alleviate information loss and distortion as an effect of over-generalisation caused by discretisation. It also captures the structure of the data better than other discretisation technique. We used belief-rule-based inference to analyse the relationship between inputs and outputs. An interdependence index was used to measure the relationship between input variables. The hierarchical structure deals with sparse matrices by decomposing input variables into several groups of evidence. The outputs generated by all groups of evidence are then combined to obtain a final inference.

The classifiers, developed based on maximum likelihood evidential reasoning (MAKER) framework and a hierarchical rule-based inferential modelling and prediction, are transparent and interpretable. The classifiers perform better than the majority of alternative classification models for both datasets (customer types and customer decisions). Their performance is similar to that of classification trees.

Keywords: Rule-based Inference, Statistical Analysis, Evidential Reasoning, Machine Learning, Data Discretisation, Probabilistic Inference, Classification, Strategic Customer, Revenue Management

11

Declaration

No portion of the work referred to in the thesis has been submitted in support of an

application for another degree or qualification of this or any other university or other

institute of learning.

12

Copyright Statement

The following four notes on copyright and the ownership of intellectual property

rights must be included as written below:

i. The author of this thesis (including any appendices and/or schedules to

this thesis) owns certain copyright or related rights in it (the “Copyright”)

and s/he has given The University of Manchester certain rights to use

such Copyright, including for administrative purposes.

ii. Copies of this thesis, either in full or in extracts and whether in hard or

electronic copy, may be made only in accordance with the Copyright,

Designs and Patents Act 1988 (as amended) and regulations issued

under it or, where appropriate, in accordance with licensing agreements

which the University has from time to time. This page must form part of

any such copies made.

iii. The ownership of certain Copyright, patents, designs, trademarks and

other intellectual property (the “Intellectual Property”) and any

reproductions of copyright works in the thesis, for example graphs and

tables (“Reproductions”), which may be described in this thesis, may not

be owned by the author and may be owned by third parties. Such

Intellectual Property and Reproductions cannot and must not be made

available for use without the prior written permission of the owner(s) of the

relevant Intellectual Property and/or Reproductions.

iv. Further information on the conditions under which disclosure, publication

and commercialisation of this thesis, the Copyright and any Intellectual

Property and/or Reproductions described in it may take place is available

in the University IP Policy (see

http://documents.manchester.ac.uk/DocuInfo.aspx?DocID=2442 0), in any

relevant Thesis restriction declarations deposited in the University Library,

The University Library’s regulations (see

http://www.library.manchester.ac.uk/about/regulations/) and in The

University’s policy on Presentation of Theses

13

Acknowledgements

I would like to gratefully acknowledge various people who have been journeyed

with me in the last four years as I have worked on this thesis. First, I would like to

express my sincere gratitude and thanks to my supervisors, Prof. Jian-bo Yang

and Prof. Dong-Ling Xu for their guidance and constant supervision, also their

support in completing this endeavour.

Second, I would like to thank to my family for encouragement which helped me

during my study. My beloved husband, Budi Wuryanto, who was always supportive

and always by my side in every hard times and struggles, and my lovely sons,

Dimas F Alghazi and Tristan H Alghazi, who served as my inspiration to finish my

study. My father, my mother, and my sisters who always gave me strength and

motivation.

Third, I thank my colleagues in the University of Manchester, especially in the

Decision and Cognitive Science Research Centre for all the fun times and

discussions we have had in the last four years.

Fourth, I would like to acknowledge the Indonesia Endowment Fund for Education

(LPDP) for the financial support.

Many thanks and appreciations also go to my colleague and people who have

willingly helped me out with their abilities.

14

Chapter 1 Introduction

1.1. Background

In the case of business characterised by perishable goods with constrained

capacity, dynamic pricing is prevalent. It stimulates demand and increases revenue

in the short term (Cho et al., 2008; Kimes, 1989). In the context of air travel,

individual price differences – in which the ticket fare paid by a passenger may differ

from that of the adjacent passenger – is compatible with this practice (Kimes,

2003). The practice has boosted the success of airline and hospitality industries

(Kimes, 1989; Talurri and Ryzin, 2004).

In the past, airlines segmented their markets based on the belief that customers’

willingness to pay would increase as the consumption date approached. Hence,

the company naturally faced different segmentation over time (Talurri and Ryzin,

2004). The classical segmentation approach might not be applicable nowadays.

Because of the applied dynamic pricing, price transparency, and the widespread

use of price comparison sites (PCS) and other supports to meet customers’ needs

to minimise their travel costs and search costs (Bilotkach, 2010; Boyd and Bilegan,

2003), customers have started to act strategically by timing their purchases –

namely, strategic purchasing behaviour (Anderson and Wilson, 2003). Research

has shown that PCS also influences offline price evaluations; offline travel agents

must deal with this strategic behaviour, as do online agents (Bodu et al., 2015; Toh

et al., 2012).

15

Researchers have given different labels to such behaviour, but with the same

essence. Some of these labels include deal-seeker (Schwartz, 2006), forward-

looking customer (Chevalier and Goolsbee, 2009), and strategic customer

(Anderson and Wilson, 2003).

The success of dynamic pricing relies on effective policy to minimise

cannibalisation across price levels – when customers with high willingness to pay

buy for an available lower price – and induce early purchase. ‘Waiting’ behaviour

in strategic purchasing makes demand uncertain and can lead to underestimated

projection of demand (Liu and Ryzin, 2008). Customers who would buy or are

willing to pay a higher price may choose a lower price by delaying their purchase

(Anderson and Wilson, 2003) and by following a strategy of cancel-rebook (Gorin

et al., 2012). Gorin et al (2012) found that customers who already book a ticket,

cancel the booking and book again for the same flight with a lower price. Gorin et

al (2012) found numerous examples of this cancel-rebook behaviour through

airline databases. Due to this strategic purchasing behaviour, the revenue system

then records relatively few bookings at high prices, underestimates the future

demand for such price classes, and hence recommends prices that are lower than

they should be. This condition is termed the spiral-down effect (Cooper et al.,

2006).

Some studies have examined the effect of strategic purchasing. It can hurt revenue

with profit losses of between 7% and 50% (Besanko and Winston, 1990; Levin et

al., 2008; Nair, 2007; Zhang and Cooper, 2008). These findings highlight the

research significance for academia and for practitioners about achieving gains

through dealing appropriately with strategic purchasing.

16

In revenue management practice, research has mainly focused on the

development of approaches to deal with strategic customers. The studies started

with assumptions about how customers make decisions and theoretical models

were derived that explained firms’ responses (Aviv and Pazgal, 2008; Levin et al.,

2008; Liu and Ryzin, 2008). The models assumed that all customers were acting

strategically to delay their purchase if a predicted future offer was expected to be

cheaper; if not, they would buy right away. This assumption has been criticised as

it might not be applicable in reality. The different ratio of strategic customers

dominating the market likely requires a different policy to alleviate the effect of

strategic purchasing behaviour on revenue (Cleophas and Bartke, 2011; Su,

2007). This view was corroborated by recent studies. For example, Wang et al.

(2013) and Lai et al. (2010) generated the percentage of strategic customers and

found that revenue management policies were sensitive to this percentage. In

conclusion, firms should treat the market according to what behaviour

predominates. The research has illustrated the need to explore how strategic

customers arise in the market and how they make purchase decisions.

Several researchers have presented empirical evidence of the extent to which

strategic customers occur in the market. Chevalier and Goolsbee (2009) analysed

buyer behaviour regarding college textbooks and confirmed the presence of

strategic customers. Osadchiy and Bendoly (2011) found that the percentage of

strategic customers ranged between 8% and 38%, using a conditioned simulation

with 155 financially motivated subjects. Li et al. (2014) developed a structural

model to estimate the percentage of strategic customers in a study of real-life

airline ticketing; the proportion ranged from 5.2% to 19.2%.

17

A laboratory experiment by Mak et al. (2014) indicated that only 6% of customers

were completely myopic – customers who do not intend to strategically time their

purchase. This group always buys right away as long as the price fits with their

valuation. However, on average, most customers were strategic although their

decision might somewhat deviate from the optimal one; that is, they might choose

to wait when they should have bought early, and vice versa.

The diverse findings were caused by divergent methods, assumptions, and

settings, as well as relatively small samples (Gönsch et al., 2013). Conditioned

experiments may not represent the actual situation when subjects make purchase

decisions using their own money; hence, biases may occur.

The rationale behind ‘waiting’ behaviour has been discussed in terms of decision-

making theory. Research in this area has mainly examined the factors that

influenced customers’ decision making when they were exposed to any type of

promotion, deals, or discounted products. An interesting finding was that deal-

seeking behaviour was mainly determined by a cognitive process rather than by

emotions (Chandon, Wansink, and Laurent, 2000; Christou, 2011; Lichtenstein et

al., 1993). However, deal evaluation and deal-proneness might intensify

customers’ emotional state, which could induce an intention to purchase (Christou,

2011). Such studies explained how antecedents influenced the customers’

motivation and enticed them to book a deal. Once they were motivated and showed

intention, their behavioural responses could be stimulated and predicted

(Gollwitzer and Brandst, 1997). Therefore, it might be possible to identify customer

types by their response to any means of gaining a lower price, through

understanding their behaviour in booking a deal.

18

Scholars often use theoretical models based on intuitive thinking when the

challenges and expenses of empirical work are high (Cleophas and Bartke, 2011).

However, identifying strategic customers is not impractical to do since advanced

technology has been used in industries recently, which constitute applied revenue

management. In addition, initial studies have explored strategic customers and

their associated tangible behaviours. Evidence has shown that the behaviour of

checking prices often and cancelling and then rebooking was correlated with

customers’ intention to obtain a lower price. This behaviour was perceived as a

manifestation of ‘waiting’ behaviour (Toh et al., 2012). Cancel-and-rebook

behaviour in an airline database was significantly related to evidence of strategic

purchasing (Gorin et al., 2012). The researchers found numerous examples of

customers who purposefully monitored the prices, then cancelled and rebooked

when a lower price became available in the system.

Despite the many studies addressing strategic customers, none have explored

how to detect a strategic customer from quantifiable behaviour, specifically cancel-

rebook behaviour. Previous findings have shown that purchase-related activities

were correlated with customers’ intention to obtain lower prices. Those works

became an initial footing to form a detection procedure and a classification model

for predicting customer types through quantifiable behaviour. In addition, limited

research has examined empirical evidence, as most researchers test their models

with numerical experiments. This study aims to bridge the gap in the literature and

to provide a model that is tested using empirical data for detecting strategic

customers through their cancel-rebook behaviour and predicting their decisions.

19

The existence of strategic customers can cause substantial losses if

inappropriately addressed. Understanding how such behaviour dominates a

market plays a role in formulating revenue management-related policies aimed at

long-term profits. Purchase-related activities potentially demonstrate customers’

intention to look for a lower price. Hence, there is scope for developing methods to

detect strategic customers from available records.

Customer behaviour, in general, is ‘the process and activities people engage in

when searching for, selecting, purchasing, using, evaluating, and disposing of

products and services so as to satisfy their needs and desires’ (Belch and Belch,

1998). Although it is complicated to study, developing better understanding of

customer behaviour can help firms to segment the market effectively. Ultimately

this may lead to gains in revenue (Rickwood and White, 2009). Therefore, it is

important to develop a decision support system to predict customer types based

on their purchase behaviour, and to predict customer decisions in the environment

of dynamic pricing.

This research is important as it may relax the widespread assumption in theoretical

revenue management models that all customers act strategically. It improves the

established revenue management models and helps to identify representative firm-

level responses. We utilised the case of airline ticketing, but the model is

applicable for other adopters of revenue management with similar conditions, such

as hotels.

20

1.2. Research Questions

The primary goals of this study were (1) to define a detection procedure for

customer types in response to dynamic pricing, through refining the fit between

theory and available data; 2) to develop a new classification model with a

hierarchical rule-based inferential modelling and prediction approach based on

MAKER framework, to predict customer types based on their perceptible purchase-

related activities – especially cancel-rebook behaviour; and 3) to expand the

approach for predicting customer decisions.

To achieve the primary goals of this study, the following questions guided the

analysis:

Q1. What perceptible purchase-related activities might describe the differences

among customer types regarding their responses to dynamic pricing?

Q2. What factors influence customer decisions to buy or to wait, in the environment

of dynamic pricing?

Q3. What are the drawbacks and benefits of alternative models for classification,

that is, predicting discrete outputs (i.e. customer types and decisions)?

Q4. How do MAKER frameworks deal with sparse matrices?

Q5. How do MAKER frameworks work for complex numerical data?

21

Q6. How do the alternative models perform in customer detection and decision

prediction?

1.3. Research Objectives

The objectives of this research were listed below:

RO1. To construct a conceptual framework for detecting customer types and

decisions in the environment of dynamic pricing.

RO2. To construct MAKER-based frameworks as an alternative classification

model under sparse matrices and complex numerical system.

RO3. To examine the interpretability and the performance of the proposed model

compared with other alternative models.

1.4. Research Contributions

The research contributions of this work are as follows:

1. A conceptual framework was developed for customer-type detection and data

linkage, based on literature in revenue management and, and was refined

based on available data. This is a research innovation in the field of revenue

management. The framework is useful to detect customer types, which

provides companies with an understanding of the composition of customer

types in their market. In turn, this can lead to an effective policy to deal with

the existence of strategic customers. Most studies in revenue management

22

have utilised numerical experiments, resulting in divergent findings. Our

framework is interpretable and tested on a real case.

2. A conceptual framework for customer-decision prediction and data linkage

was constructed; this is another innovation in the research. This framework is

useful to predict customer decisions derived from historical data rather than

theoretical assumptions. The model is also useful to inform managerial

decision making to address strategic purchasing. We examined the model with

a real case airline ticketing.

3. A hierarchical rule-based inferential modelling and prediction approach was

developed to deal with sparse matrices. This is an innovative approach to

modelling and prediction that allows interdependencies between input

variables to be determined without violating statistical requirements. It is

applicable when the data distribution is heavily skewed, such as when joint

frequency matrices violate the statistical requirement for sample size.

Grouping too many referential values with less than the minimum statistical

requirement (i.e. five cases per cell) leads to substantial loss of information

and to distortion. Our grouping is rooted in the strength of the relationship

between input variables and outputs, as well as the statistical requirements for

sample sizes for pairs of referential values. In addition, the input variables

being formed in a group are conceptually correlated. The hierarchical rule-

based approach can reduce the complexity of rule-based inferential modelling

and prediction as the number of referential values of each input variable and

the number of input variables increase.

4. A referential-value-based data discretization technique was utilised to deal

with complex numerical data for modelling. The initial referential values were

23

obtained from an unsupervised discretisation method, such as equal-

frequency discretisation. The values were trained simultaneously with other

model parameters because such learning from these values contributes

directly to optimising the model performance.

5. We propose an approach that integrates statistical analysis, rule-base

inference, and maximum likelihood prediction and machine learning,

embedded in a hierarchical rule-based inferential modelling and prediction.

The approach leads to transparent and interpretable results where the

relationship between input and output becomes clear.

6. A hierarchical rule-based inferential modelling and prediction approach was

used to establish MAKER-based classifiers for predicting customer types and

decisions. With heavily skewed distributions, input variables were

decomposed into groups and ER or BRB rule-based inference was then

utilised. Parameters and referential values were learned simultaneously in the

model. The MAKER-based models outperformed complex methods such as

support vector machine and neural networks, which are recognised as black-

box models. The MAKER-based models generally performed better than the

majority of the interpretable classifiers: logistic regression, naïve Bayes, k-

nearest neighbours, distance-based weighted k-nearest neighbours, linear

discriminant, and quadratic discriminant, and performed similar to

classification tree. MAKER-based models are transparent and interpretable,

and the relationship between inputs and outputs can be clearly explained.

24

1.5. Research Significance

In this section, we explain the theoretical and practical significance of this research.

This research is developed based on the field of revenue management and the

field of modelling and computational approach. This research delivers both

theoretical and practical significance for both fields as follows.

Theoretical significance:

• In the field of revenue management

This research contributes to a new approach for detecting strategic customers from

visible booking-related behaviours. This topic is a growing issue in revenue

management. We propose a framework for customer-type detection with

refinement according to data availability. As mentioned earlier, research in this field

relies on intuitive assumptions that all customers act strategically, leading to a

suboptimal revenue management policy. In our framework, a more precise

approach to estimating the existence of strategic customers in the market is

achieved, which in turn can lead to more appropriate RM.

This study also creates a conceptual framework for a customer-decision model in

the environment of dynamic pricing with an advance booking mechanism, where

customers can place a guaranteed reservation before departure date. It comprises

the influential factors in customer decisions, including provider’s-controlled factors,

risk-related factors, and customer’s personal factors. This framework can aid

25

scholars and more specifically practitioners to understand how customers make

decisions whether to buy or to wait in an environment of dynamic pricing.

• In the field of modelling and computational approach

This research contributes to the development of an integrated approach for

statistical analysis, a hierarchical rule-based inference, maximum likelihood

prediction, and machine learning for classification of various data types and with

heavily skewed data distribution. Heavily skewed distribution produces sparse

joint frequency matrices, where combining too many referential values leads to

information loss. Specifically, input variables were grouped based on statistical

and conceptual considerations and feasibility in terms of statistical requirements

for sample size. MAKER framework was performed for each evidence group.

Then an ER-based or BRB-based model was utilised to combine the MAKER-

generated output from all evidence group. This approach is transparent and

interpretable and the relationship between system inputs and outputs becomes

clear and understandable

In addition, a referential-value-based data discretisation technique was employed

to deal with numerical data. These values were trained simultaneously with all the

model parameters. In other words, the learning process was designed such that

all parameters embedded in the model were changed to achieve the objective of

minimising errors.

26

Practical significance:

This research offers an approach to detecting strategic customers and predicting

customer decisions in the environment of dynamic pricing and advance-booking

mechanisms. It can benefit practitioners at the managerial level, especially

revenue management adopters with fixed capacity or similar characteristics – such

as hotels, airlines, sport and entertainment ticketing, and advertisements.

Professionals in these fields can use our model because of its hierarchical rule-

based inferential modelling and prediction approach, which is transparent and

interpretable. It offers similar or improved accuracy compared to other

classification models.

The framework is also adaptable if professionals examine many inputs and outputs

or different referential values. Such scenarios are likely to occur in fields in which

dynamic changes in customer behaviour may occur in response to changes in

business environment. The MAKER-based models are mainly a white-box

approach, that is, the type of approaches which one has transparent machine

learning process and interpretable model features. They enable professionals to

find useful patterns from the data and thus to formulate beneficial managerial

leverage to deal with strategic customers appropriately.

1.6. Thesis Structure

This thesis consists of six chapters as outlined in Table 1.1. Each chapter is

designed to answer specific research questions, with corresponding research

objectives.

27

Table 1.1. Thesis structure

Chapter Research

questions

Research

objectives

Chapter 1 Introduction

Chapter 2 Literature review Q3 RO3

Chapter 3 Research

methodologies

Q3 RO3

Chapter 4 A hierarchical-based

inferential modelling

and prediction

Q4, Q5 RO2, RO3

Chapter 5 Application to customer

classification

Q1, Q4, Q5, Q6 RO1, RO2, RO3

Chapter 6 Application to customer

decision model

Q2, Q4, Q5, Q6 RO1, RO2, RO3

Chapter 7 Conclusion and

recommendations for

further study

Chapter 2 provides a systematic literature review regarding established

classification models. We identify the advantages and disadvantages of the

established models and formulate the need for a new approach. Chapter 2

answers Q3, with corresponding RO3.

Chapter 3 explains the research methodologies used in this thesis to answer Q3,

which address RO3. In this chapter, we explain the general research methods,

data collection, the research methods for the rule-based inferential modelling and

prediction approach, the optimisation method to find optimised model parameters,

and evaluation metrics for model performance comparison.

28

Chapter 4 presents our proposed new approach to hierarchical rule-based

inferential modelling and prediction which is established based on MAKER

framework, namely MAKER-ER- and MAKER-BRB- based models to deal with

sparse matrices and to address data transformation for numerical data. This

chapter answers Q4 and Q5, fulfilling RO2. We explain the proposed approach

analytically and geographically to highlight the advantages of our model compared

to other models. Then, to analyse whether hierarchical structure on the proposed

approach affect its generalization capability and complexity, we apply a full MAKER

model, BRB, and hierarchical MAKER models to five generated datasets and

compare the performance of these models in terms of model complexity,

computation time, accuracy, area under the receiver operating curve (AUCROC),

and mean squared error (MSE).

Chapter 5 addresses Q1, Q4, Q5, and Q6, and fulfils RO1, RO2, and RO3. This

chapter presents the application of the MAKER-ER- and MAKER-BRB-based

models (described in Chapter 4) to customer-type detection. In this chapter, we

provide the theory regarding customer types in revenue management and discuss

business settings and tangible booking behaviours. We then formulate a

conceptual framework for customer-type detection, followed by the designed data

linkage to obtain the necessary dataset from different data sources. Then, we

describe how we applied MAKER-ER- and MAKER-BRB-based models to this

case. Finally, we compare the model’s performance to that of alternative methods.

Chapter 6 presents the application of a hierarchical rule-based inferential modelling

and prediction based on MAKER framework to customer decisions. We explain

customer advanced booking decision-making in the environment of dynamic

29

pricing. We then formulate a conceptual framework for customer decisions –

including formulating input variables and designing data linkage to obtain the

desired datasets for further analysis. We describe how we developed MAKER-ER-

and MAKER-BRB-based classifiers and compare them to other alternative

methods in terms of their performance of prediction. This chapter answers Q2, Q4,

Q5, and Q6, and fulfils RO1, RO2, and RO3.

Chapter 7 summarises the findings of this research and provides a final conclusion.

We also suggest directions for further research.

30

Chapter 2 Research Background

2.1. Introduction

This chapter provides the theoretical background underlying the development of

conceptual frameworks for detecting customer types and predicting customer

decisions in revenue management. We also provide a comprehensive analysis of

classification methods that are widely used for predicting discrete outputs which is

compatible with the purpose of the models: predicting customer types and

decisions. Section 2.2 introduces the concepts of revenue management and

dynamic pricing. Section 2.3 explains advanced booking decision models, and

Section 2.4 gives a brief explanation of machine learning and classification.

Section 2.5 provides machine learning methods which are widely used for

classification. Finally, Section 2.5 provides critical analysis of the well-known

classification methods in machine learning and highlights some drawbacks of

these methods.

2.2. Introduction to Revenue Management Theories

Classical revenue management, namely yield management, was applied initially in

the airline industry (Littlewood, 2005). It is defined as ‘allocating the right type of

capacity to the right kind of customer at the right price in the right time so as to

maximize revenue or yield’ (Kimes, 1989), with the clause ‘to the right distribution

channel’ added to the definition for mixed-channel cases (Hayes and Miller, 2011).

31

Revenue management can be adopted by an industry that meets the following

conditions. First, it engages in customer segmentation, by means of which different

customers accept different fares for the same product. In addition, its products

generally have low variable costs but high fixed costs. The industry also sells a

perishable product that is characterized by fixed or inflexible capacity (or inventory)

and a limited selling period; once the selling period has ended, the remaining

products cannot be stored as inventory. It has an advanced booking term and

experiences significant demand fluctuation, as well. Finally, it has a sophisticated

and decentralized information system to gather data about customer behaviour,

demand patterns, and more (Kimes, 1989). Industries which feature these

characteristics include hotels, restaurants, sports and entertainment ticketing,

airlines, cloud computing, advertising, telecommunications, shipping, railways,

electricity suppliers, water suppliers, and retail. These industries have recently

applied revenue management techniques and have been successful adopters

(Ivanov and Zhechev, 2012; Kimes, 2003; Qiwen, 2010; Talurri and Ryzin, 2004).

Revenue management theory addresses three basic types of decisions (Talurri

and Ryzin, 2004). First, price decisions include choices about how to price across

segments, product categories, and distribution channels. Second, quantity

decisions answer questions about how to allocate limited or constrained capacity

to different segments, products, or channels; when to open or close fare classes

during selling period; and so on. Third are structural decisions, which support the

other decisions. Examples of outcomes of structural decisions include price fencing

to define fare restrictions and limitations, selling format (e.g. auctions or price

updating), segmentation formula, and selling design.

32

Dynamic pricing is one of the most successful strategies in revenue management.

Dynamic pricing, by definition, is a business strategy to maximise revenue by

changing prices ‘either over time, across customers, or across products/bundles’

(Kannan and Kopalle, 2001). It is fundamentally different from fixed price

approaches since it allows customers to buy the same good or service at various

prices regardless of the promotion formats (Talurri and Ryzin, 2004). The practice

of dynamic pricing works well in situations of limited capacity with high fixed costs

– for example, airlines, hotels, and sports and entertainment ticketing (Etzioni,

Tuchinda, Knoblock, and Yates, 2003; Sahay, 2007). Additionally, the use of the

Internet makes the process of dynamic pricing easier, less costly, and potentially

more effective (Cho et al., 2008).

There are three types of dynamic pricing (Kannan and Kopalle, 2001): 1) posted

price, 2) auction pricing, and 3) bundle pricing. Airlines have applied all of these

types. In the first dynamic price type, the basic strategy is that products are sold at

posted prices, which are updated over time (Etzioni et al., 2003). For the second

type, firms such as hotwire.com and priceline.com sell tickets with hidden attributes

and reveal them after the purchase has been made. Through negotiation at the

sites, customers and sellers reach an agreed upon prices. This type, which is

termed reverse-auction, has been identified as the most successful method for

helping airlines sell excessive seats or last-minute deals for customers with very

flexible schedules (Jerath et al., 2010). The third dynamic pricing type is bundle

pricing. To maximise customer utility based on customers’ needs and preferences,

some airlines offer greater flexibility by allowing customers to select service and

ticket bundles (for example, www.united.com) (Granados et al., 2012). In this

http://www.united.com/

33

study, we focus on dynamic posted price updating, in which airlines change the

posted price on any of their platforms, such as their website, dynamically over time.

In the airline industry, a general pattern of price increases as a flight’s departure

date draws closer has been naturally applied. This is adjusted based on

prospective customers’ shifting willingness to pay (Schwartz, 2000). Leisure

travellers might get lower prices by booking in advance, and business travellers

have to pay higher prices due to their tendency to book close to their date of travel.

Etzioni et al. (2003) found that this general pattern remained the same in the

Internet era; however, over a one-month observation window, prices were divided

into as many as four tiers, and within each tier prices fluctuated with a smaller

variance. Nevertheless, in their study, prices often dropped over time and were

eventually lower than prices very early in the selling period.

On average, airfares can change five to seven times in a day (Etzioni et al., 2003).

The price changes are a manifestation of a company’s response to uncertain

conditions, such as supply–demand pressures (Sahay, 2007), remaining time

(Talurri and Ryzin, 2004), competitors’ strategy (Levin et al., 2009), or seasonality

(Etzioni et al., 2003). Airlines respond to these conditions by using real-time

reservations, customer booking history, and customers’ characteristic to perform

demand predictions.

The success of dynamic pricing applications relies on the effectiveness of policies

put in place to minimize cannibalization, when customers – with high wiliness to

pay – choose an available lower price, and to induce early purchases so that

demand uncertainty can be reduced (Liu and Ryzin, 2008; Talurri and Ryzin,

34

2004). Airline companies normally impose restrictions on lower price classes, such

as restrictions on cancellation, ticket reissue, rerouting, and maximum or minimum

stay (Meissner and Strauss, 2010) to prevent such cannibalization. However, given

the availability of information through the Internet, customers now are able to

collect necessary data to decide just when to buy to gain more benefit from this

practice (Chen and Schwartz, 2008). This is the idea of strategic customer

behaviour – that customers anticipate future price drops and delay their purchases.

2.3. Advanced Booking Decision-Making

In some cases – for example, airlines and hotels – the customer purchase cycle

theory may not be entirely applicable. In airline ticketing and hotels, for instance,

customers have the option to make a reservation (to book) before the ‘real

purchase’. They may secure a seat or room with full, partial, or zero payment

before the actual purchase. Schwartz (2000, 2006) developed the advanced

booking decision model (ABDM) and provides a theoretical framework for how

savvy online customers exploit dynamic pricing in advanced booking settings. In

these cases, customers are restricted by some conditions, such as cancellation

policies. After evaluating the alternatives, customers may choose the best option

according to their values and then book a ticket. They are then likely to monitor

price changes until departure time. The information retrieved may bring them to

change their decision, triggering them to cancel the previous booking and book a

more favourable ticket once it is available. Gorin et al. (2012) presented a real

example of airline ticket sales which confirmed that customers truly engage in the

‘book then search’ strategy. The dynamic changing of product prices and

35

availability shapes customers’ perceptions over time and indeed makes customers’

decisions fluctuate even after they have narrowed down their consideration set to

a single product (Chen and Schwartz, 2006).

The ABDM framework was used to identify four possible decisions that online

customers may choose: 1) a ‘book’ strategy, in which customers place a

reservation and do nothing more, neither search nor evaluation; 2) a ‘book then

search’ strategy in which customers place a reservation at an agreed price,

continue the search for the same product by collecting price information until a time

closer to the date of consumption, and rebook if necessary (i.e. when a better deal

is offered); 3) a ‘search’ strategy in which customers search for a better deal

without booking the product until the seemingly best deal comes out; and 4) an

‘exit’ strategy in which customers choose another carrier. This framework has

expanded the former widely applied two-stage decision model – buy now or buy

later (e.g. Anderson and Wilson, 2003) – and three-stage decision model – buy

now, buy later, or exit (Su, 2007). The factors Schwartz considers in his framework

(Schwartz, 2000, 2006) are 1) price pattern (Chen and Schwartz, 2006), 2) time

before the consumption date (Chen and Schwartz, 2008), 3) cancellation fee and

deadline (Chen, Schwartz, and Vargas, 2011), and 4) search cost.

Customers try to balance between benefits (e.g. the possibility of getting a lower

price), costs (e.g. for search efforts, including time spent and physical and

psychological efforts), and risks (e.g. the possibility of sell-outs) when they are

making a purchase decision. While maintaining the risk of losing the product,

Customers can continue to search relevant information until the perceived gains

outweigh search efforts. Today, customers tend to use meta-search websites,

36

search engines, or third-party websites which provide supporting features (Etzioni

et al., 2003). These sites help customers to gather and compare information with

specified search strategies from numerous products, and they display the results

in an easy-to-understand view. Additionally, some third-party websites provide

customers with highly sophisticated functions such as price tickers, price trends,

and price alerts. These sites mine information which allows them to make

suggestions to customers on the best time to purchase. As a result, customers

tend to spend less time and less effort during the purchase-related evaluation and

comparison process. This has increasingly fostered customers who are more

knowledgeable and more eager to maximise their gains by exploiting dynamic

pricing.

Online intermediaries are efficient information resources for customers and play a

significant role in shaping internal reference price in customer evaluations of prices

available through other agents, either online or offline. Customers may browse

through these resources as their primary information source before making an

offline transaction. This strategy is the most common approach to what is known

as ‘research shopping’ (Bodur et al., 2015). This also happens in airline ticketing:

customers check prices via the Internet and call their trusted offline agents to

finalise their booking (Toh et al., 2012).

In airline ticket sales, ‘book now, pay later’ with or without a deposit has been

widely applied by offline agents, although online agents have started to adopt it

with different degrees of leniency. Online customers, however, are normally asked

for direct and full payment after completing the booking request through online

agents. It has become a reasonable strategy for customers to buy through offline

37

agents while monitoring prices over time through online intermediaries in case a

lower price becomes available. In this case, customers can exploit dynamic pricing

with reduced risk of sell-out at reduced cost.

2.4. Introduction to Machine Learning

Machine learning is defined by Artur Samuel in Awad and Khanna (2015) as ‘a

field of study that gives computers the ability to learn without being explicitly

programmed’. Machine learning incorporates scientific computing, mathematics,

and statistics (Lee, 2019). It consists of algorithms and techniques that create

systems for data learning (Lee, 2019). Machine learning methods are divided into

supervised, semi-supervised, or unsupervised learning based on whether labelled

data is required during training (Lee, 2019). Supervised learning methods use

labelled data as training data and make predictions for unseen data, whereas

unsupervised learning methods take unlabelled data as training data and make

predictions for unseen data (Lee, 2019). Semi-supervised learning methods learn

from both labelled and unlabelled data (Lin and Cohen, 2010). This last method is

best used to achieve the same level of accuracy when only a few samples of

training data are provided (Lin and Cohen, 2010).

Supervised machine learning acquires information about the input–output

relationships in a system from a training set of input–output pairs and uses this

acquired information to make predictions for unseen inputs (Lu and Wu Ying,

2012). The goal of supervised machine learning is to build a system that is able to

learn the mapping between inputs and outputs and to use that system to predict

the output when given a new input. If the output is numerical data, it is considered

38

to be a regression task (Lee, 2019). If the output is discrete data, it is considered

to be a classification task (Lee, 2019). Some classification methods are presented

in Section 3.6, including logistic regression (LR), support vector machines (SVM),

neural networks (NN), classification tree (CT), k-nearest neighbours (KNN), and

naïve Bayes (NB).

A brief introduction to the above-mentioned classification methods is summarised

as follows. LR is a linear classification model that learns the relationship between

input and output by minimising the error between the probability of a sample

belonging to a certain class and the actual classification. NB is a classification

method based on Bayes’ theorem under a naïve assumption of condition

independence for every input variable. KNN is a non-parametric classification

method that classify a new observation based on a similarity function – for example

distance functions – with other available observations. Classification tree is a non-

parametric classification technique in the form of a tree structure which is

developed through recursive partitioning. The method correctly classifies

observations by decomposing the observations into subsets based on the values

of input variables. SVM is a classification technique that finds the hyperplane,

which is defined by support vectors (cases), by maximising the margin between

two classes. NN is a complex system consisting of interconnected group of nodes

that imitate the working of neurons in a human brain. Further explanation about

these classification methods can be seen in Section 3.6. The following section

presents critical analysis of these classification methods.

39

2.5. Classification Models: Advantages and

Disadvantages

In this section, some machine learning classification methods in Section 3.6 will be

critically analysed in terms of the advantages and disadvantages as presented in

Table 2.1.

Table 2.1. Advantages and disadvantages of classification methods

Classification

method Advantages Disadvantages

Logistic

Regression

1. Logistic regression is

interpretable method on

modular level with its weights

presenting the degree to which

an input variable contribute to a

certain class prediction

(Carvalho, Pereira, and

Cardoso, 2019).

2. In terms of flexibility and

robustness in case of violations

of the assumptions about the

underlying data, logistic

regression is better than linear

discriminant (Liong and Foo,

2013)

1. Logistic regression

classifiers assume that input

variables are independent

with each other and sensitive

to outliers (Molnar, 2019).

2. Logistic regression

classifiers are limited to

linearly separable two class

problem (Molnar, 2019).

40

Support

Vector

Machines

1. Supper vector machines

tend to find global optimum

solution since the model

complexity has been

considered as a structural risk

in SVM training. (Ren, 2012).

2. Support vector machines

minimise empirical risk learnt

from the training set and above-

mentioned structural risk.

Consequently, these

classification models have

strong generalization capability

(Ren, 2012).

3.Support vector machines are

robust and precise in biased

data distribution (Auria and

Moro, 2008).

1. Support vector machines

are different to other

classifiers in that they lack of

the absence of explicit

approximations (Knox, 2018).

Neural

Networks

1. Neural networks can model

non-linear and complex

relationships without imposing

any fixed relationships in the

data (Haykin, 1999; Tu, 1996)

2. Neural networks are

relatively robust to noisy and

incomplete labelling (Reed and

Lee, 2015)

3. Neural networks have the

potential for inherently fault-

1. Neural networks is

considered as ‘black-box’

model whose system is not

transparent and hence, it is

difficult to understand

(Molnar, 2019).

2. The other disadvantages of

neural network are its greater

computational burden,

tendency of overfitting, and

the empirical nature of the

41

tolerant and robust

computation (Haykin, 1999).

4. Neural networks are capable

to adapt with changes in the

surrounding environment, for

example, when it is operating

nonstationary environment,

neural networks can change

their weights in real time

(Haykin, 1999).

5. The parallel nature of neural

networks makes fast

computation of certain tasks

(Haykin, 1999).

model development (Tu,

1996).

Classification

tree

1. Classification tree captures

interactions between input

variables in the data (Molnar,

2019).

2. Natural visualisation of

classification tree makes it

simple and interpretable

(Molnar, 2019).

1. Classification tree are not

efficient when dealing with

linear relationship between an

input variable and the output

(Molnar, 2019).

2. Slight changes in the input

variables give big impact on

the predicted output (Molnar,

2019).

3. It is quite unstable since a

few changes in the training set

can produce completely

different tree (Molnar, 2019).

42

K-nearest

neighbours

1. The algorithm of k-nearest

neighbour is relatively

straightforward (Knox, 2018).

2. K-nearest neighbours

algorithm is interpretable at the

local level (Molnar, 2019).

1. K-nearest neighbour is

expensive to implement

especially for large dataset

(Knox, 2018).

2. A small value of k makes

the classifier sensitive to

particular data points while a

large value of k makes the

behaviour of the classifier

insensitive local variations in

the class densities. Hence,

careful adjustment of k is

highly required (Knox, 2018).

Naïve Bayes

1. Naïve Bayes classifiers are

not limited to non-parametric

methods, which are relatively

expensive to implement. They

can be used with parametric,

non-parametric, or semi-

parametric (a mixture of the

two) methods (Knox, 2018).

2. Naïve Bayes classifiers are

interpretable model on modular

level as the contribution of each

input variables toward a certain

class prediction is very clear

(Molnar, 2019).

1. Naïve Bayes works under

unrealistic strong assumption

of independency between

input variables (Molnar,

2019).

2. Naïve Bayes generally

provides lower accuracy for

problems of a complicated

nature than do other complex

methods (Karim and Rahman,

2013).

43

From the explanations in Section 2.5, we can highlight two issues with the

classification methods listed in Table 2.1: interpretability and the assumption of

independence between input variables. In this study, we propose hierarchical rule-

based inferential and prediction modelling under the MAKER framework, which is

discussed in Chapter 4.

44

Chapter 3 Research Methodologies

3.1. Introduction

This chapter explains research approach, data collection, and available

technologies used in this thesis. Section 3.2 presents the general research

approach of the thesis. Section 3.3 describes the data collection, including data

sources, data components, and data characteristics used in the thesis. Section 3.4

introduces the evidential reasoning (ER) rule, followed by Section 3.5 with an

introduction of the MAKER framework. Section 3.6 briefly explains the machine

learning methods for classification that have been chosen for comparison with a

hierarchical rule-based modelling and prediction based on MAKER framework

proposed in this study. Section 3.7 explain the optimisation method used to find

the optimal parameters of MAKER-based models in this study. The performance

measures are explained in Section 3.8, and Section 3.9 summarises this chapter.

3.2. Research Approach

There are generally three research approaches: 1) qualitative, 2) quantitative, and

3) mixed-method approaches (Creswell, 2018). Qualitative research is an

approach used to explore and understand individuals or groups in human or social

problems. The various types of qualitative designs include narrative research,

phenomenology, grounded theory, ethnographies, and case studies. Researchers

analyse qualitative data inductively, working from particulars to general themes –

for example, taking a recorded interview, document data, audio visual data, and

45

observation data and making interpretations of the meaning of the data.

Quantitative research, by contrast, is an approach to testing objective theories

using mathematical and statistical theories or models. The variables tested in

quantitative research can be measured and expressed numerically. Mixed-method

approaches incorporate elements of both qualitative and quantitative research. All

the data used in this research is numeric, and the purpose of the study is to

introduce a new approach to customer classification and decision prediction along

with performing other machine learning methods as comparison. Therefore, the

research approach in this project is considered quantitative.

3.3. Data Collection

As discussed previously, before applying to the real-world data – that is, customer

types and decisions – in Chapter 4 we analysed the effect of a hierarchical

structure employed in MAKER frameworks on the complexity and accuracy of the

MAKER-based models. Hence, we need to evaluate the generalization capability,

efficiency, and complexity of the hierarchical MAKER frameworks. In this research,

we utilised ‘make_classification’ and ‘make_blobs’ functions in sklearn in Python

to generate datasets with different characteristics – for example with/without

noises, one or two clusters per class, and blob data. More clusters per class in the

datasets lead to complex non-linear boundary. The datasets with noises was

designed to investigate efficacy of the models.

On the basis of generalization capability of the hierarchical MAKER frameworks,

we then apply the frameworks on the real-world classification datasets – customer

types and decisions. The performance of the hierarchical MAKER frameworks was

46

then compared against that of other machine learning methods. The rest of this

section describes about the datasets of customer types and decisions.

Data was collected from an online reservation application provider from Indonesia

with the web address www.pointer.co.id. They provide a booking system which has

been adopted by more than 500 agents. The data consists of two main sources:

passenger booking records and price databases. Personal information of

passengers is completely confidential and fully anonymized.

First, the passenger booking database consists of passenger name records

(PNRs), which contain name, origin-destination, departure date, departure time,

carrier/airline, ticket price, group size or the composition of passengers for group

booking (i.e. the numbers of adults, infants, and children), booking status, booking

time (A1), and date and time by which payment must be made (B1). Passengers

who place a reservation can secure a seat without paying a fee or deposit, and the

seat is issued once they make full payment before B1. Later we define the period

between A1 and B1 as the holding period. When passengers alter their travel plans

– for example, changing the departure date, cancelling the booking, or not making

a payment before the hold period ends – the system creates a new PNR if they

place a reservation again. Therefore, tracking passenger historical bookings must

be done carefully, a process which is explained in Chapters 5 and 6. Since this

study focuses on cancel-rebook behaviour by customers who wish to find lower-

priced alternatives for their travel plans, we consider only cancel-rebook records

generated after customers presumably have fixed travel plans: that is, same flight,

same departure time, same origin-destination, and same composition of

passengers for group booking. For customers with seemingly fixed travel plans,

47

cancel-rebook behaviour is perceived as their effort to find lower-priced

alternatives by delaying their purchase and waiting for an expected lower price.

Hence, any previous cancel-rebook records that show changes in their travel plans

are removed from the database.

Second, in the price database, prices of each flight for certain departure dates are

recorded and updated every three hours to capture price changes. Only direct

flights are considered in this research. Connecting flights were removed. The

database consists of posted prices, the name of airlines (carrier), departure date

and time, origin-destination, and update time. Data was collected from 18th July

2017 to 24th September 2017, meaning that it covered price dynamics for the two

months prior to the departure date. Thirty-one pairs of cities were chosen to

represent origin-destinations with different characteristics in terms of price range

and the number of flights offered per day, as presented in Table 3.1.

These two databases were used to generate the desired datasets containing

useful information for predicting customer types and decisions, as presented in

Sections 5.3 and 6.2.3. In short, there are four system input variables and two

system output categories for predicting customer types, as further discussed in

Chapter 5, and six system input variables generated with two system outputs – buy

or wait – for predicting customer decisions, as further discussed Chapter 6.

48

Table 3.1. Data characteristics

No Origin-

Destination

Price (Rupiahs) Number of

flights per day

Number

of

bookings Min Max Stdev Min Max

1 NTXBTH 555,000 1,763,900 129,834 0 2 12

2 BTHNTX 857,000 1,583,000 170,973 0 2 3

3 UPGTIM 1,060,000 3,073,000 300,084 1 3 12

4 TIMUPG 1,025,000 2,406,600 162,702 2 3 13

5 AMQCGK 1,070,300 4,032,600 493,210 3 4 13

6 PKYCGK 587,000 1,470,500 179,789 3 4 77

7 CGKPKY 427,000 1,713,500 172,419 3 4 49

8 SUBPKY 457,800 1,283,000 107,364 3 3 25

9 PKYSUB 397,800 1,135,000 103,271 3 3 54

10 CGKAMQ 1,133,000 5,445,000 550,224 4 5 16

11 BTJCGK 832,000 2,962,700 405,221 4 5 21

12 CGKBTJ 847,000 3,057,700 353,912 4 5 15

13 UPGPLW 338,000 2,655,000 259,337 4 5 17

14 PNKSUB 401,000 1,495,900 119,559 4 4 33

15 SUBPNK 511,700 1,545,900 140,822 4 4 13

16 DJJUPG 971,000 3,053,500 285,361 5 5 7

17 UPGDJJ 860,000 3,755,000 698,107 5 5 11

18 PLWUPG 338,000 2,620,000 129,179 5 5 17

19 JOGBPN 335,500 1,970,100 194,495 6 6 21

20 BPNJOG 579,500 2,020,100 262,824 6 6 5

21 CGKBKS 328,000 1,613,400 132,750 7 8 14

22 UPGBPN 382,200 1,280,000 89,524 7 7 35

23 BPNUPG 415,000 1,128,000 89,121 7 7 23

49

Table 3.1. Continued.

No Origin-

Destination


flights per day

Number

of

bookings Min Max Stdev Min Max

24 BKSCGK 348,000 1,513,400 164,531 8 8 12

25 BDJSUB 373,800 1,182,500 112,529 9 9 43

26 SUBBDJ 313,000 1,222,500 120,057 9 9 23

27 LOPCGK 524,800 3,310,000 433,196 10 10 15

28 UPGKDI 224,000 1,192,000 86,978 11 12 42

29 SOCCGK 319,000 1,204,500 218,541 11 13 27

30 CGKLOP 514,800 3,300,000 375,778 12 12 19

31 CGKSOC 319,000 1,284,500 267,697 12 13 18

32 BTHCGK 410,000 3,365,000 222,671 13 13 23

33 CGKBTH 410,000 1,965,400 194,336 13 13 46

34 JOGHLP 318,300 2,024,000 182,238 13 13 52

35 HLPJOG 318,300 2,024,000 171,693 13 13 61

36 KDIUPG 224,000 1,333,000 96,674 13 13 30

37 BDJCGK 511,300 1,919,500 194,945 15 16 40

38 CGKBDJ 511,300 1,999,500 192,695 15 16 72

39 BPNSUB 348,000 1,579,000 119,936 15 17 16

40 SUBBPN 494,300 1,655,900 156,028 15 17 22

41 UPGSUB 352,000 2,336,000 172,009 16 17 52

42 TKGCGK 135,000 992,000 105,196 17 19 72

43 PDGCGK 500,400 2,619,000 256,082 18 20 58

44 CGKPDG 489,500 3,432,000 265,338 18 20 46

45 PKUCGK 489,800 3,295,000 151,032 18 21 16

46 CGKPKU 494,800 1,937,900 169,696 18 20 39

47 CGKTKG 135,000 1,022,000 133,430 18 19 59

48 SUBUPG 332,000 2,361,000 170,240 19 19 46

49 SRGCGK 303,600 1,903,000 209,129 21 23 34

50 PLMCGK 311,100 1,221,000 156,497 22 23 35

51 CGKPLM 311,100 2,002,000 166,410 22 24 39

52 CGKSRG 313,600 1,903,000 237,245 22 23 45

50


No Origin-

Destination


flights per day

Number

of

bookings Min Max Stdev MIn Max

53 PNKCGK 315,000 1,714,800 173,475 23 24 104

54 CGKPNK 315,000 1,804,800 197,689 23 24 114

55 JOGCGK 320,500 2,035,000 271,576 24 25 33

56 CGKJOG 335,500 2,035,000 294,266 25 25 54

57 CGKKNO 495,000 3,212,000 309,450 34 37 68

58 KNOCGK 495,000 4,260,000 279,839 35 37 54

59 CGKUPG 552,000 4,301,000 334,070 35 36 138

60 UPGCGK 552,000 4,316,000 344,121 37 40 126

61 CGKSUB 401,500 2,772,000 266,977 48 51 92

62 SUBCGK 320,000 2,812,000 258,385 50 52 56

3.4. Evidential Reasoning

Evidential reasoning (ER) is developed on the basis of Dempster–Shafer theory

(D-S theory), first developed by Dempster in the 1960s and Shafer in the 1970s

(Binaghi and Madella, 1999). In D-S theory, we first define a set of possible

propositions that are mutually exclusive and collectively exhaustive, called a

discernment framework. We then perform a basic probability assignment or mass

function that measures the probability of pointing exactly to a certain proposition

(Chen et al., 1960). Dempster’s rule of combination is applied to combine two

independent sets of mass functions in a frame of discernment under an orthogonal

sum operation that is associative and commutative.

51

In general, classification includes ‘learning the invariant and common properties of

a set of samples characterizing a class’ and ‘deciding if a new sample is a possible

member of the class’ (Binaghi and Madella, 1999). These two tasks are named

abstraction and generalization, respectively. Classification models estimate the

function characterizing class membership and develop a deductive inference

mechanism to perform a reasoning process to assign a new sample to a given

class (Binaghi and Madella, 1999). In contrast to Bayesian inference, D-S theory

does not require a priori knowledge. Since the aim of classification is to detect

unseen data, and a priori knowledge may not always be provided, D-S theory is

suitable for classification (Chen et al., 1960).

However, D-S theory has difficulty managing conflicting beliefs (Yang and Xu,

2013) when combining two pieces of evidence. Another disadvantage of D-S

theory is that it assumes that all the evidence is completely reliable. Yang and Xu

(2013) proposed the ER rule, a generic conjunctive probability reasoning process,

to deal with the limitations of D-S theory and introduce inherent properties of

evidence – namely, the quality of the information source and the relative

importance of evidence, denoted by reliability and weight, respectively. Weighted

belief distribution (WBD) and weighted belief with reliability (WBDR) replace the

belief distribution of the Dempster rule.

Dempster’s rule is a special case of the ER rule – the case in which all the evidence

is completely reliable. The ER rule also improves the original ER algorithm (Xu,

Yang, and Wang, 2006; Yang, 2001) when the reliability of evidence is equal to its

weight, and the weights are normalised. The ER rule does not always require such

normalisation. The ER rule essentially can deal with different types of uncertainty

52

and the rule-or-utility-based information technique (Xu, 2011; Yang, 2001). The ER

algorithm has been adopted in the belief rule base (BRB) system, namely in the

rule-based inference methodology using evidential reasoning (RIMER) approach

(Yang et al., 2006). The RIMER approach has been applied in many areas (Chang

et al., 2013; Kong et al., 2016; Tang et al., 2011). The RIMER approach is able to

model the relationship between system inputs and outputs and to handle different

types of information under different types of uncertainty. However, the number of

rules increases exponentially as the number of input variables and the referential

value of each variable increases (Yang and Xu, 2017). The ER rule and the RIMER

approach are explained in the following sections.

3.4.1. Evidential Reasoning Rule

The ER rule has been established for combining evidence while taking weights

and reliabilities (Yang and Xu, 2013) into account when forming a belief

distribution. Suppose that Θ = {𝜃1, … , 𝜃𝑁} is a set of mutually exclusive and

collectively exhaustive propositions. Θ is referred as power of discernment with 2𝑁

subsets of Θ which consist of singleton propositions and their subset, as seen in

Equation (3.1).

𝑃(Θ) = 2Θ = {𝜙, {𝜃1},… , {𝜃𝑁}, {𝜃1, 𝜃2},… , {𝜃1, … , 𝜃𝑁−1}, Θ} (3.1)

where 𝜙 is an empty set. Each piece of evidence is profiled by a belief distribution

(BD) as displayed in Equation (3.2).

53

𝑒𝑗 = {(𝜃, 𝑝𝜃,𝑗), ∀𝜃 ⊆ Θ,∑ 𝑝𝜃,𝑗 = 1𝜃⊆Θ

} (3.2)

for 𝑗 = {1,… , 𝐿} where L is the number of pieces of evidence, and N is the number

of propositions. 𝑝𝜃,𝑗 represents the degree to which a piece of evidence, 𝑒𝑗, points

to proposition 𝜃, which can be any subset of Θ or any element of 𝑃(Θ) except the

empty set.

Reliability and weight are the parameters associated with each piece of evidence.

Reliability is denoted by 𝑟𝑗, with 0 ≤ 𝑟𝑗 ≤ 1; 𝑟𝑗 = 0 stands for ‘not reliable at all’, and

𝑟𝑗 = 1 stands for ‘completely reliable’ (Yang and Xu, 2013). A piece of evidence

may also have weights, denoted by 𝑤𝑗, which indicates the relative importance of

that piece of evidence compared with other evidence (Yang and Xu, 2013). In the

case with 𝑤𝑗 = 𝑟𝑗 in the frame of WBD, both share the same definition of reliability,

and both are measured in the same joint space (Yang and Xu, 2014). As such, 1 -

𝑟𝑗 acts as the unreliability of evidence 𝑒𝑗, and it provides room for another piece of

evidence to support or oppose different propositions. On the other hand, if 𝑤𝑗 ≠ 𝑟𝑗,

it means that the different pieces of evidence have been generated from different

sources or different measurements (Xu et al., 2017). WBDR is defined in Equation

(3.3).

𝑚𝑗 = {(𝜃, ��𝜃,𝑗), ∀𝜃 ⊆ Θ; (𝑃(Θ), ��𝑃(Θ),𝑗)}

��𝜃,𝑗 = {

0 𝜃 = 𝜙𝑐𝑟𝑤,𝑗𝑚𝜃,𝑗 𝜃 ⊆ Θ, 𝜃 ≠ 𝜙

𝑐𝑟𝑤,𝑗(1 − 𝑟𝑗) 𝜃 = 𝑃(Θ)

(3.3)

54

where 𝑚𝜃,𝑗 = 𝑤𝑗𝑝𝜃,𝑗 and 𝑐𝑟𝑤,𝑗 = 1 1 + 𝑤𝑗 − 𝑟𝑗⁄ as normalisation factors to satisfy

∑ ��𝜃,𝑗 + ��𝑃(Θ),𝑗 = 1𝜃⊆Θ given that ∑ 𝑝𝜃,𝑗 = 1𝜃⊆Θ . ��𝜃,𝑗 shows the degree to which

evidence 𝑒𝑗 supports proposition 𝜃, with weights and reliabilities considered.

Through an orthogonal sum operation, two independent pieces of evidence can be

combined in any order, as displayed in Equations (3.4)–(3.7), to measure the

degree of joint support resulting from 𝑒1 and 𝑒2, which is denoted by 𝑃𝜃,𝑒(2). The

information given by 𝑒2 does not depend on the result of 𝑒1 and vice versa, and

these pieces of evidence include belief distribution, reliability and weight. This

combination process must be done recursively in the case of multiple pieces of

evidence (i.e. L pieces of evidence) before generating the total combined degree

of joint support for proposition 𝜃, 𝑃𝜃,𝑒(𝐿), which is explicitly written in Equation (3.4).

Let 𝑒(𝑖) be defined as the combination of the first 𝑖 pieces of evidence. In addition,

due to the natural properties of the orthogonal sum operation – that is, associative

and cumulative – this combination can be performed in any order.

𝑃𝜃,𝑒(2) = {

0��𝜃,𝑒(2)

∑ ��𝐷,𝑒(2)𝐷⊆Θ

(3.4)

��𝜃,𝑒(2) = [(1 − 𝑟2)𝑚𝜃,1 − (1 − 𝑟1)𝑚𝜃,2] + ∑ 𝑚𝐵,1𝑚𝐶,2

𝐵∩𝐶=𝜃

∀𝜃 ⊆ Θ (3.5)

��𝜃,𝑒(𝑖) = [(1 − 𝑟𝑖)𝑚𝜃,𝑒(𝑖−1) −𝑚𝑃(Θ),e(i−1)𝑚𝜃,𝑖] + ∑ 𝑚𝐵,𝑒(𝑖−1)𝑚𝐶,𝑖

𝐵∩𝐶=𝜃

∀𝜃 ⊆ Θ (3.6)

𝑚𝑃(Θ),e(i) = (1 − 𝑟𝑖)𝑚𝑃(Θ),e(i−1) (3.7)

55

3.4.2. Rule-based Inference Methodology Using Evidential

Reasoning (RIMER)

RIMER is developed on the basis of a BRB system, which is an extension of a

traditional IF-THEN rule base to represent different types of knowledge under

uncertainty, and the ER rule to combine multiple pieces of evidence from activated

belief rules. Traditional IF-THEN rules are extended by assigning degrees of belief

in all possible consequences of each rule, as displayed in Equation (3.8). In the

RIMER framework, other parameters, including rule weights, attribute weights, and

consequent belief degrees, are designed to represent the belief rules (Kong et al.,

2016). Those parameters can be fine-tuned through a learning process with

historical data. Whereas a traditional IF-THEN rule is clear-cut, a BRB system

needs to undergo a learning process to get the best performance from the model.

𝐴𝑖𝑘(𝑖 = 1,… , 𝑇𝑘) corresponds to the referential point of the ith attribute used in the

kth rule. 𝛽𝑗𝑘 (𝑗 = 1,… ,𝑁; 𝑘 = 1,… , 𝐿) is the belief degree assigned to consequent

𝐷𝑗, where 𝑁 is the number of consequents, and 𝐿 is the number of rules. Belief

degrees can be initially drawn from experts, historical data, or common knowledge.

𝜃𝑘 is the rule weight which shows the relative importance of the kth rule, while 𝛿𝑖(𝑖 =

1,… , 𝑇𝑘) is the attribute weight representing the relative importance of the ith

If 𝐴1𝑘 ∧ 𝐴2

𝑘 ∧ …∧ 𝐴𝑇𝑘𝑘 , then {(𝐷1, 𝛽1𝑘), (𝐷2, 𝛽2𝑘),… , (𝐷𝑁, 𝛽𝑁𝑘)}

with a rule weight 𝜃𝑘 and attribute weights 𝛿1, 𝛿2, … , 𝛿𝑇𝑘

where 𝛽𝑗𝑘 ≥ 0,∑ 𝛽𝑗𝑘 ≤ 1𝑁𝑗=1

(3.8)

56

attribute. 𝑇𝑘 is the total number of attributes used in the kth rule. Belief degrees are

expressed in terms of the probability with which 𝐷𝑗 is likely to occur. The total belief

degrees of a rule can be less than or equal to one, which is designed to handle

missing data or unknown consequents.

If necessary, input values are transformed to belief distributions corresponding to

referential points used in the BRB. These belief distributions represent the degree

to which the input values belong to the referential points. 𝑥𝑖, as the input value for

the ith attribute, is transformed as 𝑆(𝑥𝑖), as seen in Equation (3.9).

𝑆(𝑥𝑖) = {(𝐴𝑖𝑗 , 𝛼𝑖𝑗); 𝑗 = 1,… , 𝐽𝑖}, 𝑖 = 1,… , 𝑇

where 0 ≤ 𝛼𝑖𝑗 ≤ 1 and ∑ 𝛼𝑖𝑗 ≤ 1𝐽𝑖𝑗=1

(3.9)

𝐴𝑖𝑗 is the jth referential category of the ith attribute while 𝛼𝑖𝑗 shows the degree to

which 𝑥𝑖 belongs to the referential point 𝐴𝑖𝑗. 𝐽𝑖 is the number of all referential points

of the ith attribute and 𝑇 is the number of all attributes. If a BRB has 𝑇 attributes,

then the rules are extended from Equation (3.8) by taking all possible combinations

of the referential points for the 𝑇 attributes as displayed in Equation (3.10)

(𝐴1𝑘 , 𝛼1

𝑘) ∧ (𝐴2𝑘 , 𝛼2

𝑘) ∧ …∧ (𝐴𝑇𝑘𝑘 , 𝛼𝑇𝑘

𝑘 ) (3.10)

where 𝐴𝑖𝑘 ∈ {𝐴𝑖𝑗 , 𝑗 = 1,… 𝐽𝑖} and 𝛼𝑖

𝑘 ∈ {𝛼𝑖𝑗, 𝑗 = 1,… 𝐽𝑖}. Therefore, the input can be

transformed into referential points with distributed probability or belief structure.

57

We then need to calculate the activation weight of each belief rule in the rule base.

The activation weight, denoted by 𝜔𝑘, represents the degree to which the packet

attribute denoted by 𝐴𝑘 in the kth rule is triggered by the inputs. It can be calculated

using Equation (3.11).

𝜔𝑘 =𝜃𝑘𝛼𝑘

∑ 𝜃𝑗𝛼𝑗𝐿𝑗=1

=𝜃𝑘∏ (𝛼𝑖

𝑘)��𝑘𝑖𝑇𝑘

𝑖=1

∑ [𝜃𝑗∏ (𝛼𝑙𝑗)��𝑗𝑙𝑇𝑘

𝑙=1 ]𝐿𝑗=1

𝛿��𝑖 =𝛿𝑘𝑖

𝑚𝑎𝑥𝑖=1,…,𝑇𝑘(𝛿𝑘𝑖), 0 ≤ 𝛿��𝑖 ≤ 1

(3.11)

where 𝑘 = 1,… , 𝐿.

As shown in Equation (3.11), the activation weight depends on the rule weight (𝜃𝑘)

and the belief degrees associated with various referential points resulting from

input transformation. 𝛼𝑖𝑘 obtained by Equation (3.9) corresponds the matching

degree to which the input value associates with 𝐴𝑖𝑘 (𝑖 = 1,… , 𝑇𝑘; 𝑘 = 1,… , 𝐿), where

𝐴𝑖𝑘 is the referential point of the ith attribute in the kth rule. 𝛼𝑘 represents the degree

to which the input vector matches the packet attribute 𝐴𝑘 in the kth rule. 𝑇𝑘 is the

total number of all attributes in kth rule. 𝐿 is the number of belief rules in the rule

base. A belief rule with 𝜔𝑘 = 0 is by default not activated; otherwise, a belief rule

with 𝜔𝑘 > 0 is activated. The activated belief degrees associated with consequents

are then combined through an inference process using an ER approach. An

activated belief rule is treated as a basic evidence, with belief degrees transformed

into basic probability masses, and the ER algorithm combines the activated belief

rules to generate the joint probability for each possible consequent (𝐷𝑗).

58

𝛽𝑗 =𝜇[∏ (𝜔𝑘𝛽𝑗𝑘 + 1 − 𝜔𝑘 ∑ 𝛽𝑗𝑘

𝑁𝑗=1 )𝐿

𝑘=1 −∏ (1 − 𝜔𝑘 ∑ 𝛽𝑗𝑘𝑁𝑗=1 )𝐿

𝑘=1 ]

1 − 𝜇[ ∏ (1 − 𝜔𝑘)𝐿𝑘=1 ]

(3.12)

𝜇 = [∑∏(𝜔𝑘𝛽𝑗𝑘 + 1 −𝜔𝑘∑ 𝛽𝑗𝑘𝑁

𝑗=1)

𝐿

𝑘=1

𝑁

𝑗=1

− (𝑁 − 1)∏(1− 𝜔𝑘∑ 𝛽𝑗𝑘𝑁

𝑗=1)

𝐿

𝑘=1

]

−1

(3.13)

The combined belief degrees denoted by 𝛽𝑗 associated with consequent

𝐷𝑗(𝑗 = 1,… ,𝑁) are a function of 𝜔𝑘 by Equation (3.12) and the belief degrees

𝛽𝑗𝑘 (𝑗 = 1,… ,𝑁; 𝑘 = 1,… , 𝐿). The activation weight 𝜔𝑘 itself depends on the input

vector 𝑥, the attribute weights 𝛿𝑖(𝑖 = 1,… , 𝑇), and the rule weights 𝜃𝑘.

3.5. Maximum Likelihood Evidential Reasoning

(MAKER) Framework

In this section, we provide a brief explanation of the MAKER framework developed

by Yang and Xu (2017) as an alternative inferential process for system analysis

and decision making. The rule-based inferential modelling and prediction

approaches developed in this study are fundamentally based on the MAKER

framework.

This framework defines two spaces: a state space model (SSM) and an evidence

space model (ESM) (Yang and Xu, 2017). An SSM describes system states or

changes with different inputs, while an ESM describes multiple pieces of evidence

with interdependencies in a probabilistic and distributed manner to represent

system behaviours. A probability, which is obtained from likelihoods generated

59

from data, is assigned to each evidential element associated with a subset of

system states. As such, a piece of evidence is profiled as a basic probability

distribution. The degree of interdependence is statistically calculated through

marginal and joint likelihood functions. Two pieces evidence are then combined

through a conjunctive ER rule.

In an SSM, suppose that 𝐻𝑛 is a system state. A system space may consist of N

mutually exclusive and collectively exhaustive system states, and hence, the SSM

can be denoted by Θ = {𝐻1, 𝐻2, … , 𝐻𝑛}, with 𝐻𝑖 ∩ 𝐻𝑗 for any 𝑖 ≠ 𝑗. Let 𝑃(Θ) or 2Θ be

the power set of Θ, which contains empty sets, single system states, and subsets

of system states as described in Equation (3.14). An output is profiled by a basic

probability which is defined as an ordinary discrete probability distribution (Yang

and Xu, 2017). No probability is assigned to empty set. Some conditions for a

probability function are described in Equations (3.15)–(3.18).

𝑃(Θ) = 2Θ = {𝜙, {𝐻1},… , {𝐻𝑁}, {𝐻1, 𝐻2},… , {𝐻1, … , 𝐻𝑁−1}, Θ} (3.14)

0 ≤ 𝑝(𝜃) ≤ 1 ∀𝜃 ⊆ Θ (3.15)

∑ 𝑝(𝜃) = 1𝜃⊆Θ

(3.16)

𝑝(𝜙) = 0 (3.17)

𝜃 is a subset of states or proposition which cannot be decomposed into pieces

assigned to subsets of 𝜃. 𝑝(𝜃) is a probability that the proposition 𝜃 is true. A

system output, 𝑦, is profiled as a probability distribution as displayed in Equation

(3.18).

60

𝑦 = {(𝜃, 𝑝(𝜃)), ∀𝜃 ⊆ Θ,∑ 𝑝(𝜃) = 1𝜃⊆Θ

} (3.18)

In an ESM, each piece of evidence is generated from data and is divided into

several evidential elements. Each element points to exactly one proposition.

Suppose that 𝑒𝑖,𝑙(𝜃) is an element of the ith piece of evidence from input variable

𝑥𝑙 which points exactly to proposition 𝜃. The evidential element of 𝑒𝑖,𝑙 represents

the evidence subspace for the ith value of 𝑥𝑙. 𝑝𝜃,𝑖,𝑙 is a basic probability assigned to

𝑒𝑖,𝑙(𝜃) according to the likelihood principle and the Bayesian principle (Yang and

Xu, 2014). 𝑐𝜃,𝑖,𝑙 is the likelihood of the ith value of 𝑥𝑙 given proposition 𝜃, and 𝑝𝜃,𝑖,𝑙

is a normalised likelihood as stated in Equation (3.20). Each 𝑒𝑖,𝑙 is then profiled by

a basic probability distribution as displayed in Equation (3.19).

𝑒𝑖,𝑙 = {(𝑒𝑖,𝑙(𝜃), 𝑝𝜃,𝑖,𝑙), ∀𝜃 ⊆ Θ,∑ 𝑝𝜃,𝑖,𝑙 = 1𝜃⊆Θ

} (3.19)

𝑝𝜃,𝑖,𝑙 = 𝑐𝜃,𝑖,𝑙 ∑ 𝑐𝐴,𝑖,𝑙𝐴⊆Θ

⁄ (3.20)

For a discrete 𝑥𝑙, the evidence subspace can be denoted by 𝐸𝑖 =

{𝑒1,𝑙 , 𝑒2,𝑙… , 𝑒𝑖,𝑙…}, leading to a discrete ESM. For a continuous 𝑥𝑙, the most direct

approach is to discretise 𝑥𝑙, a process which is explained in Chapter 4. The

interrelationship between each pair of input variables is assessed based on the

statistical interdependence between the two inputs. According to the likelihood

principle and the Bayesian principle, the joint basic probability can be obtained

from a joint likelihood function as described in Equation (3.21). Suppose that 𝑒𝑖,𝑙

61

and 𝑒𝑗,𝑚 are the two pieces of evidence from input variables 𝑥𝑙 and 𝑥𝑚,

respectively. The interrelationship between the two evidential elements is

represented by the interdependence index term as shown in Equation (3.22), with

its properties shown in Equation (3.23).

𝑝𝜃,𝑖𝑙,𝑗𝑚 = 𝑐𝜃,𝑖𝑙,𝑗𝑚 ∑ 𝑐𝐴,𝑖𝑙,𝑗𝑚𝐴⊆Θ

⁄ (3.21)

𝛼𝐴,𝐵,𝑖,𝑗 = {0 if 𝑝𝐴,𝑖,𝑙=0 or 𝑝𝐴,𝑖,𝑙=0

𝑝𝐴,𝐵,𝑖𝑙,𝑗𝑚 (𝑝𝐴,𝑖,𝑙 𝑝𝐵,𝑗,𝑚)⁄ otherwise

(3.22)

𝛼𝐴,𝐵,𝑖,𝑗 = {0 if 𝑒𝑖,𝑙 and 𝑒𝑗,𝑚 disjoint

1 if 𝑒𝑖,𝑙 and 𝑒𝑗,𝑚 independent

(3.23)

Multiple pieces of evidence are then combined through the conjunctive MAKER

rule process. In the joint-evidence state space, each output of an SSM intersects

with each evidential element in an ESM. As such, it is possible to measure the

individual support for a proposition from evidential elements and to measure joint

support with interdependence among evidential elements considered. Let 𝑠𝑖,𝑙 =

𝜃 ∩ 𝑒𝑖,𝑙(𝜃) represent the intersection between 𝜃 and 𝑒𝑖,𝑙(𝜃), meaning that 𝑒𝑖,𝑙(𝜃)

supports the proposition 𝜃. If evidence 𝑒𝑖,𝑙 is generated from the same data source

as the other evidence with probability function 𝑝, 𝑒𝑖,𝑙 is then the probability mass

that proposition 𝜃 is supported by 𝑒𝑖,𝑙(𝜃), as given below.

(𝑠𝑖,𝑙(𝜃)) = 𝑝 (𝜃|𝑒𝑖,𝑙(𝜃)) 𝑝 (𝑒𝑖,𝑙(𝜃)) = 𝑟𝜃,𝑖,𝑙 𝑝 (𝑒𝑖,𝑙(𝜃)) (3.24)

62

The reliability of evidential element (𝑒𝑖,𝑙(𝜃)) is denoted by 𝑟𝜃,𝑖,𝑙 , which is defined as

the conditional probability that proposition 𝜃 is true given that 𝑒𝑖,𝑙 supports 𝜃. This

definition measures the quality of 𝑒𝑖,𝑙. 𝑟𝜃,𝑖,𝑙 can be trained from data so that the

likelihood of the true state can be maximised. If 𝑒𝑖,𝑙 is profiled with the probability

distribution generated by Equation (3.20), based on the likelihood principle, its

support for proposition 𝜃, 𝑝𝑙 (𝑠𝑖,𝑙(𝜃)), must be proportional to 𝑝 (𝑠𝑖,𝑙(𝜃)) as stated

in Equation (3.25).

𝑚𝜃,𝑖,𝑙 = 𝑝(𝑠𝑖,𝑙(𝜃)) = 𝜔𝑖,𝑙𝑝𝑙 (𝑠𝑖,𝑙(𝜃)) =

𝜔𝑖,𝑙𝑝𝑙 (𝜃|𝑒𝑖,𝑙(𝜃)) 𝑝𝑙 (𝑒𝑖,𝑙(𝜃)) = 𝑤𝜃,𝑖,𝑙𝑝𝑙 (𝑒𝑖,𝑙(𝜃))

(3.25)

where 𝜔𝑖,𝑙 is a positive scaling factor constant, and 𝑤𝜃,𝑖,𝑙 = 𝜔𝑖,𝑙𝑝𝑙 (𝜃|𝑒𝑖,𝑙(𝜃)) is the

weight of an evidential element so that 𝑝 (𝑠𝑖,𝑙(𝜃)) and 𝑝𝑙 (𝑠𝑖,𝑙(𝜃)) are proportional

to each other when 𝑒𝑖,𝑙 is acquired from data for 𝑥𝑙 only. In the case where 𝑝 = 𝑝𝑙,

then 𝑤𝜃,𝑖,𝑙 = 𝑟𝜃,𝑖,𝑙 or 𝜔𝑖,𝑙 = 1. As with 𝑟𝜃,𝑖,𝑙 , 𝑤𝜃,𝑖,𝑙 can also be trained together with

other parameters to maximise the likelihood of the true state.

To determine the total degree of support for a proposition, the combination process

must be done at an elementary level and exhaustively accumulated at the end of

the process. Equation (3.26) shows the conjunctive MAKER rule generating the

degree of support for a proposition from two pieces of evidence, 𝑒𝑖,𝑙 and 𝑒𝑗,𝑚. This

process must be recursively done for all combinations before generating the total

degree of support for proposition 𝜃, denoted by 𝑝(𝜃) as presented in Equation

(3.27).

63

𝑚𝜃 = [(1 − 𝑟𝑗,𝑚)𝑚𝜃,𝑖,𝑙 + (1 − 𝑟𝑖,𝑙)𝑚𝜃,𝑗,𝑚] + ∑ 𝛾𝐴,𝐵,𝑖,𝑗𝛼𝐴,𝐵,𝑖,𝑗𝑚𝐴,𝑖,𝑙𝑚𝐵,𝑗,𝑚

𝐴∩𝐵=𝜃

(3.26)

𝑝(𝜃) = {0𝑚𝜃

∑ 𝑚𝐶𝐶⊆Θ

(3.27)

The MAKER algorithm has relaxed the assumption of evidence independency in

the ER rule by measuring the interdependence between a pair of pieces of

evidence statistically while keeping intact the core properties of the probabilistic

reasoning process in the ER rule. As it grounds the measurement through a

statistical test, some statistical rules of thumb must be satisfied, including the

minimum sample size requirement.

3.6. Machine Learning Methods

In this section, we briefly explain some machine learning methods, specifically for

classification, including logistic regression, k-nearest neighbour, classification tree,

naïve Bayes, support vector machines, and neural networks.

3.6.1. Logistic Regression

The goal of logistic regression (LR) analysis is, more or less, similar to the linear

regression model in terms of the general principle employed in the analysis

(Hosmer et al., 2013). The difference is that the outcome of the LR model is binary

or dichotomous, reflected in the form of the model and its assumptions. When used

64

for more than two classes, it is called a multinomial linear regression. The goal of

LR is to find the best fitting and most parsimonious model which interpretably

describes the relationship between a response (outcome or dependent variable)

and one or more independent variables (predictors, covariates, or explanatory

variables) by estimating the probabilities that reflect how closely the output belongs

to a response. In this model, let 𝑝𝑖 = 𝐸(𝑌|𝑥) be the conditional mean of 𝑌 given 𝑥,

where 𝑌 is the outcome and 𝑥 is the specific vector of predictors. 𝑝𝑖 is expressed

for the ith subject or case as seen in Equation (3.28).

𝑝𝑖 =𝑒𝑥𝑝[𝛽0 + 𝛽1𝑥1,𝑖 +⋯+ 𝛽𝑗𝑥𝑗,𝑖 +⋯+ 𝛽𝑚𝑥𝑚,𝑖]

1 + 𝑒𝑥𝑝[𝛽0 + 𝛽1𝑥1,𝑖 +⋯+ 𝛽𝑗𝑥𝑗,𝑖 +⋯+ 𝛽𝑚𝑥𝑚,𝑖]

(3.28)

where 𝑗 = 1,… ,𝑚, and 𝑚 is the number of predictors. The predictor itself must be

at least interval-scaled. Data transformation is required if categorical variables are

included in the predictor. The logit transformation is defined as follows:

logit (𝑝𝑖) = 𝑙𝑛 [𝑝𝑖

1 − 𝑝𝑖] = 𝛽0 + 𝛽1𝑥1,𝑖 +⋯+ 𝛽𝑗𝑥𝑗,𝑖 +⋯+ 𝛽𝑚𝑥𝑚,𝑖

(3.29)

𝛽𝑗 is a coefficient or parameter representing the magnitude of change in the

outcome as a result of the unit change in 𝑥𝑗𝑖. These unknown parameters are

estimated through a learning process from a set of data based on maximum

likelihood. In LR analysis, the next step in this analysis is to assess the significance

of the coefficient of a variable in the model and to keep only the significant variables

in the model. This involves assessing whether the presence of the variable in the

model explains more about the variance of the outcome through a statistical test

65

for significance. However, in this research all variables are included in the model

to be evaluated.

3.6.2. Support Vector Machine (SVM)

A support vector machine (SVM) is a supervised machine learning algorithm for

solving problems in classification, regression, and novelty detection. SVMs have

become popular because the determination of parameters is based on a convex

optimisation problem which results in any local optimum equalling a global

optimum. An SVM for solving two-class classification problems using linear models

is discussed in this section (Bishop, 2006). The basic idea of this method is to

transform the input vector into a higher dimensional vector so that two classes can

be linearly separated by a higher dimensional surface – a so-called hyperplane.

Suppose that the training data set consists of 𝑁 input vectors denoted by 𝑥𝑛 (𝑛 =

1,… ,𝑁) with corresponding target values 𝑡𝑛, where 𝑡𝑛 ∈ {−1,1}, and new data

points 𝑥 are classified depending on the sign of 𝑦(𝑥), which is formulated using

linear models as depicted in Equation (3.30). An SVM classifier needs to satisfy

𝑦(𝑥𝑛) > 0 for points having 𝑡𝑛 = 1 and (𝑥𝑛) < 0 for points having 𝑡𝑛 = −1, so that

𝑡𝑛𝑦(𝑥) > 0 is applicable for all training data points.

𝑦(𝑥) = 𝑤𝑇𝜙(𝑥) + 𝑏 (3.30)

where 𝜙(𝑥) denotes a fixed feature-space transformation, 𝑤 is the normal vector

to the learned hyperplane, and 𝑏 is bias parameter.

66

An SVM determines the optimal hyperplane based on the concept of a margin,

which is defined as the perpendicular distance between the hyperplane and the

closest data points. 𝑦(𝑥) = 0 defines a hyperplane that discriminates between the

two classes such that all data points are classified into one class as either 𝑡𝑛 = 1

or −1. The location of the decision boundary or hyperplane is determined by a

subset of the data points known as support vectors. The optimal hyperplane is

found by maximising the margin.

In practice, however, the class-conditional distributions may overlap, resulting in

poor generalisation when an exact separation is made. Therefore, an SVM is

modified such that some of the training points can be misclassified, but with a

penalty whose value is a linear function of the distance from the boundary, denoted

by 𝜉𝑛(𝑛 = 1,… . , 𝑁), known as a slack variable. 𝜉𝑛 = 0 for a data point on or inside

the margin boundary, 𝜉𝑛 = |𝑡𝑛 − 𝑦(𝑥𝑛)| for a misclassified data point, and 𝜉𝑛 = 1

for a data point on the decision boundary. Hence, points with 0 ≤ 𝜉𝑛 ≤ 1 stay inside

the margin and on the correct side of the boundary, while points with 𝜉𝑛 > 1 are

misclassified. This technique is described as relaxing a hard margin constraint to

be a soft margin. Note that due to overlapping class distributions, this framework

becomes very sensitive to outliers as the value of 𝜉𝑛 increases. Hence, the

parameter 𝐶 > 0, which is defined as a regularization coefficient to control the

trade-off between the slack variable penalty and the margin, or to be exact, it

controls the trade-off between training errors and model complexity. The optimal

parameters for 𝑤 and 𝑏 can be found by solving the following quadratic

programming problem, in which the objective is to minimise a quadratic function

with a set of linear inequality constraints.

67

To solve the problem, the Lagrange multipliers 𝛼𝑛 ≥ 0 are introduced with one

multiplier for each constraint. The dual representation of the maximum margin

problem can be derived as displayed in Equation (3.32).

max𝛼𝑛

∑𝛼𝑛

𝑁

𝑛=1

−1

2∑ ∑ 𝛼𝑛𝛼𝑚𝑡𝑛𝑡𝑚𝑘(𝑥𝑛, 𝑥𝑚)

𝑁

𝑚=1

𝑁

𝑛=1

subject to

∑𝛼𝑛𝑡𝑛 = 0

𝑁

𝑛=1

0 ≤ 𝛼𝑛 ≤ 𝐶, 𝑛 = 1,…𝑁

(3.32)

The kernel function is defined by 𝑘(𝑥, 𝑥′) = 𝜙(𝑥)𝑇𝜙(𝑥′), which maps the input

vectors into a suitable feature space. Some kernel types in SVMs are introduced,

such as linear, polynomial, and radial basis functions. Once the problem defined

in Equation (3.32) has been solved, the following formula can be used to classify

new data points.

𝑦(𝑥) = 𝑤𝑇𝜙(𝑥) + 𝑏 = ∑𝛼𝑛𝑡𝑛𝑘(𝑥, 𝑥𝑛) + 𝑏

𝑁

𝑛=1

= ∑ 𝛼𝑛𝑡𝑛𝑘(𝑥, 𝑥𝑛) + 𝑏

𝑛∈𝑠𝑣

(3.33)

min𝑤,𝑏,𝜉𝑛

1

2𝑤𝑇𝑤 + 𝐶∑𝜉𝑛

𝑁

𝑛=1

subject to

𝑡𝑛(𝑤𝑇𝜙(𝑥𝑛) + 𝑏) ≥ 1 − 𝜉𝑛 and

𝜉𝑛 ≥ 0, 𝑛 = 1,… ,𝑁

(3.31)

68

Data points with 𝛼𝑛 = 0 do not contribute to defining the predictive model in

Equation (3.33), and the remaining data points, known as support vectors (𝑠𝑣) with

𝛼𝑛 > 0, define the decision function.

3.6.3. Neural Networks (NN)

A neural network (NN) is a computational graph with nodes as computing units and

directed edges as transmission units which pass the numerical information from

node to node (Bishop, 2006; Haykin, 1999). One of the most famous structures for

an NN, the feed-forward neural network, also known as the multilayer perceptron

(MLP), is discussed in this chapter. The MLP consists of multiple layers of neurons,

which are input layers directly connected to external data, one or more hidden

layers, and an output layer. The structure in Figure 3.1 is described as a single-

hidden-layer network, a typical MLP with one hidden layer, where each layer is

fully connected to the next layer. An MLP is a series of function transformations

which generate the predicted output in the case of either classification or

regression from external data through an activation or transfer function. An MLP is

a nonlinear function of a linear combination of the inputs with adaptive coefficients

or parameters.

69

Figure 3.1 A single-hidden-layer neural network (Bishop, 2006)

Suppose that we have D input variables with M neurons in one hidden layer to

predict K classes in the case of classification. First, 𝑀 linear combinations of the

input variables are constructed as follows

𝑎𝑗 =∑𝑤𝑗𝑖(1)

𝐷

𝑖=1

𝑥𝑖 +𝑤𝑗0(1)

(3.34)

𝑧𝑗 = ℎ(𝑎𝑗) (3.35)

where 𝑗 = 1,… ,𝑀; 𝑖 = 1,… , 𝐷.

𝑎𝑗 values are known as activations. These values are then transformed through a

nonlinear activation function, ℎ(. ), which is generally chosen to be a sigmoid

function – for example, tanh. The superscript (1) indicates the corresponding

weights denoted by 𝑤𝑗𝑖 and biases denoted by 𝑤𝑗0 are in the first layer of the

network. The values resulting from Equation (3.35), known as hidden units, are

70

again linearly combined to give the output unit activation as presented in Equation

(3.36).

𝑎𝑘 =∑𝑤𝑘𝑗(2)

𝑀

𝑗=1

𝑧𝑗 +𝑤𝑘0(2)

(3.36)

where 𝑘 = 1,… , 𝐾 and 𝐾 is the number of outputs.

This process corresponds to the second layer of the network shown by the

superscript (2). The final step is to transform the output unit activation using an

activation function, 𝑓(. ), which depends on the type of the outputs: for example, a

logistic sigmoid function for a binary case as shown in Equation (3.38) and a

softmax activation function for a multiclass problem as shown in Equation (3.39).

The overall neural network for all stages can be formulated in Equation (3.37).

𝑦𝑘(𝑥, 𝑤) = 𝑓(∑𝑤𝑘𝑗(2)

𝑀

𝑗=1

ℎ (∑𝑤𝑗𝑖(1)

𝐷

𝑖=1

𝑥𝑖 +𝑤𝑗0(1)) + 𝑤𝑘0

(2))

(3.37)

𝑦𝑘 = 𝜎(𝑎) =1

1 + 𝑒𝑥𝑝(−𝑎)

(3.38)

𝑦𝑘 =𝑎𝑘

∑ 𝑎𝐵𝐾𝐵=1

(3.39)

For network training, different algorithms have been applied to find the optimal

vector of 𝑤. The most popular algorithm is backpropagation. It minimises the error

function in weight space through the method of gradient descent. Given a training

set of N samples, it learns NN network by minimising sum-of-squares error function

71

between the output vector generated by the network (𝑦𝑛) and the target denoted

by 𝑡𝑛 using Equations (3.40)-(3.41).

𝐸(𝑤) =1

2∑( 𝑦𝑛 − 𝑡𝑛)

2

𝑁

𝑛=1

(3.40)

𝐸 =1

2∑(𝑦𝑘 − 𝑡𝑘)

2

𝑘

(3.41)

A solution of the learning problem is the combination of weights which minimises

the error function. Every 𝑘-th component of the output vector is evaluated by

Equation (3.41), where 𝑦𝑘 and 𝑡𝑘 denote the 𝑘-th component of the output vector

𝑦𝑛 and of the target 𝑡𝑛, respectively. Those values are then accumulated to give

the sum 𝐸. The nonlinearity of the network function causes the error function 𝐸(𝑤)

to be nonconvex, and hence the solution might be a local minimum of the error

function. At first, initial weights are randomly chosen, and then the gradient of the

error function is computed and used to correct the initial weights as displayed in

Equation (3.42).

∇𝐸 = (𝜕𝐸

𝜕𝑤1,𝜕𝐸

𝜕𝑤2, … ,

𝜕𝐸

𝜕𝑤ℓ)

(3.42)

∆𝑤𝑝 = −𝜂𝜕𝐸

𝜕𝑤𝑝

(3.43)

where 𝑝 = 1,2,… ℓ. 𝜂 is the so-called learning constant, which defines the step

length of each iteration in the direction of the negative gradient. The gradient is

then recursively computed until the local minimum is found.

72

The following steps are designed for a one-hidden-layer network. The weights

between the hidden layer and the output layer are updated by Equation (3.44).

Similarly, the weight updates between the neurons in the input and hidden layer

are made as displayed in Equation (3.45).

𝑤𝑗𝑘(𝑡 + 1) = 𝑤𝑘𝑗(𝑡) + ∆𝑤𝑘𝑗(𝑡)

where ∆𝑤𝑗𝑘(𝑡) = −𝜂𝜕𝐸

𝜕𝑤𝑘𝑗= −𝜂

𝜕𝐸

𝜕𝑦𝑘

𝜕𝑦𝑘

𝜕𝑎𝑘

𝜕𝑎𝑘

𝜕𝑤𝑘𝑗= −𝜂(𝑦𝑘 − 𝑡𝑘)𝑓

′(𝑎𝑘)(𝑧𝑗) = −𝜂𝛿𝑦𝑘𝑧𝑗

(3.44)

where 𝛿𝑦𝑘 is referred to as the error signal of the neuron 𝑘 in the output layer.

∆𝑤𝑗𝑖(𝑡) = −𝜂𝜕𝐸

𝜕𝑤𝑗𝑖= −𝜂∑

𝜕𝐸

𝜕𝑦𝑘

𝜕𝑦𝑘𝜕𝑎𝑘

𝜕𝑎𝑘𝜕𝑧𝑗

𝜕𝑧𝑗

𝜕𝑎𝑗

𝜕𝑎𝑗

𝜕𝑤𝑗𝑖𝑘

= −𝜂∑(𝑦𝑘 − 𝑡𝑘)𝑓′(𝑎𝑘)(𝑤𝑗𝑘)𝑓

′(𝑎𝑗)𝑥𝑖𝑘

= −𝜂𝛿𝑧𝑗𝑥𝑖

(3.45)

where 𝛿𝑧𝑗 = 𝑓′(𝑎𝑗)∑ 𝛿𝑦𝑘𝑤𝑗𝑘𝑘 corresponds to the error signal of neuron 𝑗 in the

hidden layer. The steps above are repeated until convergence is reached (i.e.

when the error is below the pre-set value). The optimal weight vector is considered

to be the solution.

3.6.4. Classification Tree

The classification and regression trees (CART) algorithm is one of the popular

algorithms for tree induction. It is able to perform under nonlinear relationships

between features and outcome and also where features interact with each other

(Molnar, 2019). Equation (3.46) explains the relationship between the features (𝑥)

and the outcome (𝑦).

73

�� = 𝑓(𝑥) = ∑ 𝑐𝑚𝐼{𝑥𝜖𝑅𝑚}

𝑀

𝑚=1

(3.46)

Each sample (𝑥) must belong to exactly a leaf node, denoted by 𝑅𝑚. 𝐼{𝑥𝜖𝑅𝑚}, as an

identity function equals 1 if a sample is in the subset 𝑅𝑚 and 0 otherwise. If a

sample belongs to a leaf node 𝑅𝑙, the predicted outcome �� equals 𝑐𝑙, where 𝑐𝑙 is

the average of all training samples in the leaf node 𝑅𝑙.

The subsets are obtained by recursively partitioning the input space. At first, a

feature which will result in the best partitions in terms of the Gini index, which

indicates the impurity of a node, is selected to become a decision node. Then the

algorithm searches for the best cut-off point of the selected feature that minimizes

the Gini index of the class distribution of the outcome. If in the split subset all

classes have the same frequency, the node is denoted as an impure node. This

process is repeated recursively until the stop criterion is met.

3.6.5. Naïve Bayes

Naïve Bayes utilises the Bayes’ theorem of conditional probabilities for

classification as presented in Equation (3.47). With a strong (naïve) assumption of

independency between features, this method calculates the probability of a sample

belonging to a class based on the value of each feature. The class probability is

estimated for each feature independently.

74

𝑃(𝐶𝑘|𝑥) =1

𝑍𝑃(𝐶𝑘)∏𝑃(𝑥𝑖|𝐶𝑘)

𝑛

𝑖=1

(3.47)

where Z is a scaling parameter to make the sum of probabilities for all classes

always be 1, and n is the number of features in the dataset.

3.7. Sequential Least Squares Programming

(SLSQP)

In this study, we propose a hierarchical rule-based inferential modelling and

prediction based on MAKER framework. We train the model parameters, including

weights (reliabilities), referential values, and belief degrees of consequents, while

maintaining the sample size per combination of referential values of different input

variables at least equal to the minimum statistical requirement. These parameters

are optimised under the minimization function of mean squared errors. Since the

objective is to minimise the quadratic function subject to a set of inequality

constraints – that is, the minimum sample size requirement per combination of

referential values of different input variables – and equality constraints – that is,

the total degrees of belief for consequences of each belief rule must be 1. The

sequential (least squares) quadratic algorithm can deal with this kind of

optimisation problem.

The sequential (least squares) quadratic programming (SLSQP) algorithm is one

of the more popular, robust, and efficient computational methods for nonlinear

optimisation problems (Kraft, 1988; Boggs and Tolle, 1995). It is designed as a

nonlinearly constrained, gradient-based optimisation with equality and inequality

75

constraints (Kraft, 1988). Sequential (least squares) quadratic programming is a

powerful tool to be used in data analytics software. It has been well established in

many platforms – for example, SLSQP in SciPy, fmincon in Matlab, SNOPT and

FILTERSQP. Given these advantages, this method is applied in this study. We utilise

SLSQP as a subfunction of minimize in SciPy to find the optimised model

parameters – that is, the weights, referential values, and degrees of belief for

consequents of each belief rule.

In general, the nonlinear optimisation problem with equality and inequality

constraints can be defined in Equation (3.48). The SLSQP algorithm sequentially

approximates this original problem. It is solved iteratively with an initial vector of

parameters denoted by 𝑥0. 𝑥𝑘 indicates the vector of parameters in the (𝑘)th

iteration. 𝑥𝑘+1 can be obtained using Equation (3.51).

min𝑥

𝑓(𝑥)

𝑠. 𝑡. 𝑏(𝑥) = 0

𝑐(𝑥) ≤ 0

(3.48)

𝐿(𝑋, 𝜆, 𝜇) = 𝑓(𝑋) + 𝜆𝑏(𝑋)𝑇 + 𝜇𝑐(𝑋)𝑇

where 𝜆 and 𝜇 are vectors of multipliers for equality and inequality constraints

respectively

(3.49)

min ∇𝑓(𝑥𝑘)𝑇𝑑 +

1

2𝑑𝑇𝐻𝑓(𝑥𝑘)𝑑

𝑠. 𝑡. 𝑏𝑖(𝑥𝑘) + ∇𝑏𝑖(𝑥𝑘)𝑇𝑑 = 0; 𝑖 = 1,2,… ,𝑚

𝑐𝑗(𝑥𝑘) + ∇𝑐𝑗(𝑥𝑘)𝑇𝑑 ≤ 0; 𝑗 = 1,2,… , 𝑛

(3.50)

𝑥𝑘+1 = 𝑥𝑘 + 𝛼𝑘𝑑𝑘; 𝛼𝑘 ∈ (0,1]

where 𝑑𝑘 indicates the search direction within the 𝑘th step and 𝛼𝑘 is the step

length

(3.51)

76

The search direction (𝑑𝑘) is obtained from information generated by solving a

quadratic programming subproblem which is formulated by a quadratic

approximation of the Lagrange function of nonlinear programming, subject to linear

approximations of the constraints (Kraft, 1988). The Lagrangian function of this

problem can be seen in Equation (3.49). The quadratic subproblem is defined in

Equation (3.50). Vector 𝑑, ∆𝜆 , and Δ𝜇 are defined as 𝑑 = 𝑥 − 𝑥𝑘 , ∆𝜆 = 𝜆 − 𝜆𝑘, and

Δ𝜇 = 𝜇 − 𝜇𝑘, respectively. This solution provides a search direction for 𝑥.

The quadratic subproblem reflects the local properties of the original problem, and

in this way it is relatively easy to solve, and the objective of the subproblem

represents the nonlinearities of the original problem (Boggs and Tolle, 1995). If the

quadratic problem is appropriately chosen, this method can be seen as an

extension of Newton and quasi-Newton methods. Thus, SLSQP method is

expected to have the characteristic of Newton-like methods, which is a rapid

convergence with an optimal step length of 𝛼𝑘 = 1 when it iterates close to the

solution. In addition, the SLSQP is expected to have a possible erratic behaviour

so that the step length, 𝛼𝑘, can be modified and consequently, 𝑥𝑘+1 becomes a

better approximation to the optimum solution (Boggs and Tolle, 1995).

The general flow chart of SLSQP can be seen in Figure 3.2. 𝛿 and K are predefined

parameters such that the solution converges if the number of iterations has

reached K or the vector 𝑑 is less than 𝛿. In this study, 𝛿 and K are set at .0001 and

200, respectively.

77

3.8. Evaluation Metrics

The evaluation metric discussed below is designed for classification problems.

Classification can be divided based on data types into binary, multiclass, and multi-

labelled classification. The metrics are categorised into three types, including

threshold, probability, and ranking metrics (Hossin and Sulaiman, 2015). All those

types produce a single value, which makes the evaluation easier, although in some

cases subtle details of the classifier’s performance cannot be explicitly covered.

For example, in a case of very imbalanced data, relying solely on accuracy can be

misleading.

3.8.1. Threshold Metrics

Before the evaluation metrics are explained further, the confusion matrix as a base

for many common classification metrics is explained in Figure 3.3 (Awad and

Khanna, 2015). Classifiers yield a so-called probabilistic classifier showing the

degree to which an instance is a member of a class. A decision threshold converts

the probabilistic classifier into a discrete classifier. Suppose that the threshold is

set to be 0.5. Any probabilistic classifier above 0.5 produces a positive instance;

otherwise it produces a negative instance.

78

StartInitial vector of parameters, x0

Set iteration, k=0

At kth iteration, evaluate

Update Lagrange Function H

( ) ( ) ( )kkk xcxbxf ,,

Solve quadratic subproblem to determine

kd

or

kd

Kk

Modify so that is closer to the solution

k 1+kx

No

Calculate and update k = k+11+kx

End

Figure 3.2. The procedure of sequential (least squares) quadratic programming method

79

Actual class Positive Negative

Pre

dic

ted

cla

ss Positive True positives (tp) False positives (fp)

Negative False negatives (fn) True negatives (tn)

Column totals: Total positives (P) Total negatives (N)

Figure 3.3. Confusion matrix of binary problem

From this confusion matrix, tp and tn are the numbers of correctly classified

positive and negative instances, respectively, while fp and fn denote the numbers

of misclassified positive and negative instances, respectively. From this matrix,

some evaluation metrics are generated, as listed in Table 3.2. Accuracy is the

most-used metric, since it is easy to compute; applicable for binary, multiclass, or

multi-label problems; and easy to interpret (Hossin and Sulaiman, 2015).

However, it is less distinctive and provides less discriminable values. This limitation

becomes misleading in the case of very imbalanced data. In such a case, the

accuracy of the values is seemingly acceptable, even if none of the minority class

instances are correctly predicted by a trained classifier.

Table 3.2. Threshold metrics

Metrics Equation

Accuracy 𝑡𝑝 + 𝑡𝑛

𝑃 + 𝑁

Misclassification rate 𝑓𝑝 + 𝑓𝑛

𝑃 + 𝑁

True positive rate (tp rate)/ Recall (r)/ Sensitivity 𝑡𝑝

𝑃

False positive rate (fp rate) 𝑓𝑝

𝑁

80

Table 3.2. Continued

Metrics Equation

Specificity 𝑡𝑛

𝑡𝑛 + 𝑓𝑝= 1 − 𝑓𝑝 𝑟𝑎𝑡𝑒

Precision (p) 𝑡𝑝

𝑡𝑝 + 𝑓𝑝

F-Measure 2

1𝑝 +

1𝑟

= 2 ×𝑝 × 𝑟

𝑝 + 𝑟

F-beta score (1 + 𝛽2) ×𝑝 × 𝑟

(𝛽2 × 𝑝) + 𝑟

3.8.2. Probability Metrics

Mean squared error (MSE) is an example of a probability metric (Hossin and

Sulaiman, 2015). It measures the gap between the predicted values and the actual

values, denoted by 𝑃𝑛 and 𝐴𝑛, respectively. It is defined for 𝑁 samples as depicted

in Equation (3.52).

𝑀𝑆𝐸 =1

2∑(𝑃𝑛 − 𝐴𝑛)

2

𝑁

𝑛=1

(3.52)

3.8.3. Ranking Metrics

A receiver operating characteristics (ROC) curve is proposed for performance

visualisation and model selection (Hossin and Sulaiman, 2015). A ROC curve is a

81

two-dimensional graph where tp rate (or sensitivity) on the y axis is plotted against

fp rate (1 - specificity) on the x axis for different cut-off points. See Table 3.2 to

recall the definitions of sensitivity and specificity. The trade-off between benefits

(true positives) and costs (false positives) can be seen in the ROC curve, because

any increase in tp rate (sensitivity) is followed by a decrease in fp rate (1 -

specificity). A perfect classifier gives 100% for both sensitivity and specificity, which

is point (0,1), meaning that the closer the plot to the upper left corner, the better

the classifier. The 45-degree diagonal line depicted in Figure 3.4 shows a random

classifier. Point (0,0) shows that a classifier never issues a positive class; as such

the classifier fails to predict positive classes, which results in zero values for both

false positive errors and true positives. Meanwhile, point (1,1) unconditionally

issues positive classes. Any point at the left-hand side of a ROC curve near the x

axis is considered ‘conservative’, a situation in which a classifier issues positive

classes only with strong evidence, making few false positive errors and having a

low tp rate as well (Fawcett, 2006). Classifiers located at the right-hand side of a

ROC curve can be considered ‘liberal’, a situation in which the classifier utilises

weak evidence and issues nearly all positives correctly but often produces a high

fp rate.

0.2

0.4

0.6

0.8

1.0

0.2 0.4 0.6 0.8 1.0

False positive rate

Tru

e p

ositiv

e r

ate

Figure 3.4. ROC curve

82

The area under the curve measures the accuracy of an algorithm and is

abbreviated as AUC. AUC is one of the most popular ranking metrics and has been

proven to provide a better representation of an algorithm’s performance than does

accuracy. Its values range between 0 and 1. The advantage of using AUC is its

ability to reflect the overall ranking performance of a classifier with a single scalar

value. For binary problems, AUC can be defined as depicted in Equation (3.53)

(Hossin and Sulaiman, 2015).

𝐴𝑈𝐶 =𝑆𝑝 −

𝑛𝑝(𝑛𝑛 + 1)2

𝑛𝑝𝑛𝑛

(3.53)

where 𝑆𝑝 is the sum of the all positives instances ranked, and 𝑛𝑝 and 𝑛𝑛 are the

numbers of positive and negative instances, respectively.

Table 3.3 lists the rules of thumb for AUCROC according to Hosmer et al. (2013).

According to the table, AUCROC scores range from .5 to 1. The nearer the score

is to 1, the better the classifier at discriminating the outcome groups. An AUCROC

score of .5 shows that a classifier fails to discriminate between the outcome

groups, since this corresponds to chance or a random classifier.

Table 3.3. Rules of thumb for AUC

Area Point system

.5 - .6 Fail

.6 - .7 Poor

.7 - .8 Fair

.8 - .9 Good

.9 – 1 Excellent

83

Another alternative performance metric under a large skew class distribution is the

precision-recall curve (Davis and Goadrich, 2006), or PR curve. The curve plots

recall on the x-axis and precision on y-axis. Recall is the same as the true positive

rate used in the ROC curve. The fraction of observations that are positive and are

classified as positive is called precision, as seen in Table 3.2. For ROC curve, the

closer the line to the upper-left-corner, the better the model performance is. For

PR curve, the closer the line to the upper-right-corner the better the model

performance is. David and Goadrich (2006) explained the difference visual

representations of ROC and PR curves and highlighted that the performance

difference among classifiers can be identified more clearly with the PR curve than

the ROC curve under a large skew class distribution. With the example of a cancer

detection dataset, all classifiers were seemingly close to optimal, based on the

ROC curve. The PR curve, however, indicated that there was still vast room for

improvement. In addition, the PR curve can visually indicate the performance

difference among classifiers, which a ROC curve cannot. Similarly, to the ROC

curve, the area under the PR curve (AUCPR) can also be estimated using a

composite trapezoidal method.

3.9. Summary

In this chapter, we presented the research methods used in Chapters 4, 5, and 6.

First, we discussed how we generate data that was used in Chapter 4, and we

explained data collection including the source of and the components in the

database used in Chapters 5 and 6, and we also provided a brief justification for

the database chosen to obtain a desired dataset. Further explanation of how the

desired dataset was obtained is given in Chapters 5 and 6. Second, we briefly

84

explained the original ER rule as a foundation for the MAKER framework, the

application of the ER algorithm in the BRB system, ideas and rationales useful for

the development of a hierarchical rule-based inferential modelling and prediction

based on MAKER framework in Chapter 4. Third, we illustrated the sequential

(least squares) quadratic programming used in the classification of customer types

and decisions in Chapters 4, 5 and 6. Fourth, we briefly introduced some popular

machine learning methods and their algorithms for classification. Finally, we also

reviewed some performance evaluation metrics as a foundation for selecting

metrics for model comparisons in Chapters 4, 5 and 6.

85

Chapter 4 A Hierarchical Rule-based

Inferential Modelling and Prediction

4.1. Introduction

This chapter thoroughly explains the classifier based on MAKER framework

established by a hierarchical rule-based modelling and prediction – namely,

MAKER-ER-based and MAKER-BRB-based models for dealing with sparse

matrices and complex numerical data. It starts with an introduction to the MAKER

framework with referential values for data discretization in Section 4.2. Section 4.3

explains MAKER algorithm with referential value-based discretisation technique for

data transformation. Section 0 explains the concept of belief rule base. Section 4.5

explore the hierarchical rule-based inferential modelling and prediction approach,

investigate the methods for grouping evidence, and explain a process for final

inference. Section 4.6 explains how model parameters can learn from data and.

The proposed models are then compared analytically and graphically with other

machine learning methods in Section 4.7. A summary of the chapter is provided in

Section 4.8.

4.2. Introduction to MAKER Framework

The maximum likelihood evidential reasoning (MAKER) framework was introduced

by Yang and Xu (2017). It is a data-driven inference process to predict the outputs

of a system from its input under uncertainty. Yang and Xu (2017) emphasize four

86

unique features of this approach: 1) its establishment with unknown prior

probability as a default, 2) its explicit measurement of ambiguity in data, 3) its

explicit measurement of the quality of data (known as evidence reliability), and 4)

its ability to take into account statistically measured dependencies between pieces

of evidence. According to Yang and Xu (2017), the MAKER framework defines two

types of models – state space models (SSMs) and evidence space models (ESMs)

– and a conjunctive MAKER rule. The basic concepts and steps in the MAKER

framework are presented in the following.

An SSM describes a system of states or changes with different inputs. It consists

of a finite number of states, which makes Dempster’s original thinking on state

space the foundation of SSMs (Dempster, 2008). Following Yang and Xu (2017),

suppose that 𝐻𝑛 is a system state. It has at least 𝑁 disjoint states which do not

overlap each other, and hence, the SSM can be denoted by Θ = {𝐻1⋯𝐻𝑛⋯𝐻𝑁⋯}

with 𝐻𝑖 ∩𝐻𝑗 for any 𝑖 ≠ 𝑗. We can assign probability to a subset of system states.

Let 𝑃(Θ) or 2Θ be the power set of Θ which contains the empty set ∅ and the full

state space Θ. According to Yang and Xu (2017), an output of the system is

modelled by a unique set function, which is referred to as a basic probability

function that is defined as an ordinary discrete probability distribution function. No

probability is assigned to the empty set. The basic probability function is presented

in Definition 4.1 .

Definition 4.1 (Basic probability function)

A basic probability function is defined as 𝑝: 2Θ → [0,1] if conditions (4.1)–(4.3) are

satisfied. 𝜃 is a subset of states, which is known as an assertion. 𝑝(𝜃) is the

87

probability when proposition 𝜃 is true. It is assigned exactly to 𝜃 and cannot be

decomposed into pieces assigned to subsets of 𝜃 (Yang and Xu, 2017).

0 ≤ 𝑝(𝜃) ≤ 1 ∀𝜃 ⊆ Θ (4.1)

∑ 𝑝(𝜃) = 1𝜃⊆Θ

(4.2)

𝑝(𝜙) = 0 (4.3)

Definition 4.2 (System output)

A system output 𝑦 is defined as a probability distribution as shown in Equation

(4.4). 𝑝(𝜃) can be obtained using Equations (4.16) and (4.17). If 𝑝(𝜃) > 0, 𝜃 is

referred to as the focal element of 𝑦. c stated that, with the foundation from

Dempster (2008), an assertion can be profiled by three probabilities with 𝑝𝑡 + 𝑝𝑓 +

𝑝𝑢 = 1, where 𝑝𝑡, 𝑝𝑓, and 𝑝𝑢 are the nonzero probabilities representing ‘true’,

‘false’, and ‘unknown’, termed as the triad of an assertion. Therefore, this

framework allows the inference process to be conducted with ambiguous

information or unknown data (Yang and Xu, 2017).

𝑦 = {(𝜃, 𝑝(𝜃)), ∀𝜃 ⊆ Θ,∑ 𝑝(𝜃) = 1𝜃⊆Θ

} (4.4)

In an ESM, an evidence space is a space derived from data. Each piece of

evidence in the evidence space is acquired from data. Each piece of evidence can

be partitioned into evidential elements each of which points to exactly one

assertion in the state space or an element in the power set of the states.

88

Evidence acquisition from data is developed based on the likelihood principle and

the Bayesian principle. According to Rohde (2014), evidence derived from

observations that have proportional likelihoods should be the same, which is

known as the likelihood principle. The likelihood principle essentially holds that the

likelihood function, or likelihood in short, is the sole basis for inference. A likelihood

function denoted by 𝑓(𝑥; 𝜃) arises from a probability density function of 𝑥, which is

a function of the unknown parameter 𝜃 (Rohde, 2014). Meanwhile, Bayesian

principle indicates that the combination of the evidence with the prior distribution

of the states should lead to posterior probability (Yang and Xu, 2017).

Based on the data acquisition as described above, we can build a one-dimensional

ESM for each input variable (Yang and Xu, 2017). Suppose that 𝑒𝑖,𝑙(𝜃) is an

element of the ith piece of evidence from input variable 𝑥𝑙 which points exactly to

proposition 𝜃. The evidential element of 𝑒𝑖,𝑙 represents the evidence subspace for

the ith value of 𝑥𝑙. 𝑝𝜃,𝑖,𝑙 is the basic probability that the evidence element 𝑒𝑖,𝑙 points

exactly to assertion 𝜃, presented by 𝑝𝜃,𝑖,𝑙 = 𝑝𝑙 (𝑒𝑖,𝑙 (𝜃)). Let 𝑐𝜃,𝑖,𝑙 be the likelihood

of the ith value of 𝑥𝑙 given proposition 𝜃. The basic probability 𝑝𝜃,𝑖,𝑙 is a normalised

likelihood as stated in Equation (4.5). Given the basic probability 𝑝𝜃,𝑖,𝑙 that is

acquired from input 𝑥𝑙 for each assertion, we can the define system input from

evidence 𝑒𝑖,𝑙, as explained in Definition 4.3 (Yang and Xu, 2017).

𝑝𝜃,𝑖,𝑙 = 𝑐𝜃,𝑖,𝑙 ∑ 𝑐𝐴,𝑖,𝑙𝐴⊆Θ

⁄ (4.5)

89

Definition 4.3 (System input)

A basic probability distribution can be assigned to 𝑒𝑖,𝑙 as presented in (4.6) forming

a system input (Yang and Xu, 2017). z

𝑒𝑖,𝑙 = {(𝑒𝑖,𝑙(𝜃), 𝑝𝜃,𝑖,𝑙), ∀𝜃 ⊆ Θ,∑ 𝑝𝜃,𝑖,𝑙 = 1𝜃⊆Θ

} (4.6)

where 𝑝𝜃,𝑖,𝑙 is acquired from input variable 𝑥𝑙 by using Equation (4.5). According to

Yang and Xu (2017), evidential elements 𝑒𝑖,𝑙(𝐻𝑛) for all 𝐻𝑛 ∈ Θ represent the

evidence subspace for the 𝑖th value of 𝑥𝑙. If 𝑥𝑙 is discrete, the evidence subspace

can be denoted by 𝐸𝑖 = {𝑒1,𝑙 , 𝑒2,𝑙… , 𝑒𝑖,𝑙…}, leading to a discrete ESM.

As previously stated, one of the unique features of the MAKER framework is that

it considers the interrelationship between a pair of evidence in the model.

According to Yang and Xu (2017), the interdependence is observed by statistical

interdependence between the pair of evidence. According to the likelihood

principle and the Bayesian principle, joint basic probability can be obtained from a

joint likelihood function, which is discussed later. The following section describes

the MAKER algorithm with referential values as a discretization technique for

numerical inputs.

4.3. MAKER Algorithm with Referential Values

4.3.1. Evidence Acquisition

Suppose that we have a data set of N instances consisting of M input variables

and an output variable with K classes. The input 𝑥𝑛 = {𝑥𝑛,𝑙|𝑛 = 1,… ,𝑁; 𝑙 = 1,… ,𝑀}

90

can be either discrete or continuous. Each instance is classified as one of the class

memberships with numerical expressions in Θ = {𝑘| 𝑘 = 1,… , 𝐾} denoted by 𝑦𝑛 =

{𝑦𝑛| 𝑦𝑛 = 1,… , 𝐾; 𝑛 = 1,… ,𝑁}. As mentioned above, the MAKER framework is

constructed with discrete functions so that numerical data needs to be transformed.

Such transformation makes the MAKER framework applicable for numerical data.

In this framework, referential value-based transformation is applied. The initial

referential values can be set based on expert knowledge, random rules without

prior knowledge, or common sense, and afterwards values can be learned from

input-output data (Xu et al., 2017). Referential values include minimum and

maximum values of an input variable and the values between them. In addition,

the number of referential values of an input variable can differ from that of other

input variables. The framework is flexible to that condition.

An input value of an input variable, 𝑥𝑘,𝑛,𝑖 , the corresponding output of which

belongs to class 𝑘, is transformed as denoted in Equation (4.7) . 𝐴𝑖𝑙 is the 𝑖th

referential value of the 𝑙th input variable, while 𝑎𝑛,𝑙,𝑖𝑘 represents the degree to which

the 𝑛th input value of the 𝑙th input variable (i.e. 𝑥𝑛,𝑙) belongs to referential value 𝐴𝑖𝑙

or, in other words, how close is 𝑥𝑛,𝑙 to 𝐴𝑖𝑙.

𝑆𝑙(𝑥𝑛,𝑙) = {(𝐴𝑖𝑙 , 𝑎𝑛,𝑙,𝑖

𝑘 ); 𝑖 = 1,… , 𝐼𝑙} where 𝐼𝑙 is the number of all referential values in

input variable 𝑙

where

𝑎𝑛,𝑙,𝑖𝑘 =

𝐴𝑖+1𝑙 − 𝑥𝑛,𝑙

𝐴𝑖+1𝑙 − 𝐴𝑖

𝑙 ; 𝑎𝑛,𝑙,𝑖+1𝑘 = 1 − 𝑎𝑛,𝑙,𝑖

𝑘 ; 𝐴𝑖𝑙 ≤ 𝑥𝑛,𝑙 ≤ 𝐴𝑖+1

𝑙

𝑎𝑛,𝑙,𝑖′𝑘 = 0 for 𝑖′ ≠ 𝑖, 𝑖 + 1

(4.7)

91

After all the input values of an input variable are transformed, the next step is to

aggregate all the belief distributions for referential values under different class

memberships. In this way, the frequencies of the referential values of an input

variable under different classes can be obtained, and based on this calculation, the

likelihood 𝑐𝑘,𝑙,𝑖 and the basic probability 𝑝𝑘,𝑖,𝑙 can be estimated using Equations

(4.8) to (4.10).

[ 𝑎𝑙,11 ⋯ 𝑎𝑙,𝑖

1 … 𝑎𝑙,𝐼𝑙1

⋮ ⋱ ⋮ ⋱ ⋮𝑎𝑙,1𝑘 ⋯ 𝑎𝑙,𝑖

𝑘 ⋯ 𝑎𝑙,𝐼𝑙𝑘

⋮ ⋱ ⋮ ⋱ ⋮𝑎𝑙,1𝐾 ⋯ 𝑎𝑙,𝑖

𝐾 ⋯ 𝑎𝑙,𝐼𝑙𝐾]

where

𝑎𝑙,𝑖𝑘 = ∑𝑎𝑛,𝑙,𝑖

𝑘

𝑁

𝑛=1

∑∑𝑎𝑙,𝑖𝑘

𝐼𝑙

𝑖=1

𝐾

𝑘=1

=∑∑𝑎𝑙,𝑖𝑘

𝐾

𝑘=1

𝐼𝑙

𝑖=1

= 𝑁

(4.8)

[ 𝑐1,𝑙,1 = 𝑎1,𝑙

1 ∑𝑎𝑙,𝑖1

𝐼𝑙

𝑖=1

⁄ ⋯ 𝑐1,𝑙,𝑖 = 𝑎𝑖,𝑙1 ∑𝑎𝑙,𝑖

1

𝐼𝑙

𝑖=1

⁄ … 𝑐1,𝑙,𝐼𝑙 = 𝑎𝐼𝑙,𝑙1 ∑𝑎𝑙,𝑖

1

𝐼𝑙

𝑖=1

⁄

⋮ ⋱ ⋮ ⋱ ⋮

𝑐𝑘,𝑙,1 = 𝑎1,𝑙𝑘 ∑𝑎𝑙,𝑖

𝑘

𝐼𝑙

𝑖=1

⁄ ⋯ 𝑐𝑘,𝑙,𝑖 = 𝑎𝑖,𝑙𝑘 ∑𝑎𝑙,𝑖

𝑘

𝐼𝑙

𝑖=1

⁄ ⋯ 𝑐𝑘,𝑙,𝐼𝑙 = 𝑎𝐼𝑙,𝑙𝑘 ∑𝑎𝑙,𝑖

𝑘

𝐼𝑙

𝑖=1

⁄

⋮ ⋱ ⋮ ⋱ ⋮

𝑐𝐾,𝑙,1 = 𝑎1,𝑙𝐾 ∑𝑎𝑙,𝑖

𝐾

𝐼𝑙

𝑖=1

⁄ ⋯ 𝑐𝐾,𝑙,𝑖 = 𝑎𝑖,𝑙𝐾 ∑𝑎𝑙,𝑖

𝐾

𝐼𝑙

𝑖=1

⁄ ⋯ 𝑐𝐾,𝑙,𝐼𝑙 = 𝑎𝐼𝑙,𝑙𝐾 ∑𝑎𝑙,𝑖

𝐾

𝐼𝑙

𝑖=1

⁄]

𝑐𝑙,𝑖𝑘 =

{

𝑎𝑙,𝑖𝑘 ∑𝑎𝑙,𝑖

𝑘 for ∑𝑎𝑙,𝑖𝑘

𝐼𝑙

𝑖=1

≠ 0

𝐼𝑙

𝑖=1

⁄

0 for ∑𝑎𝑙,𝑖𝑘

𝐼𝑙

𝑖=1

= 0

(4.9)

92

∑𝑐𝑙,𝑖𝑘 = 1

𝐼𝑙

𝑖=1

, ∀𝑘 ⊆ 𝛩

∑∑𝑐𝑙,𝑖𝑘 =

𝐼𝑙

𝑖=1

𝐾

𝑘=1

∑∑𝑐𝑙,𝑖𝑘 =

𝐾

𝑘=1

𝐼𝑙

𝑖=1

𝐾

[ 𝑝1,𝑙,1 = 𝑐𝑙,1

1 ∑𝑐𝑙,1𝑘

𝐾

𝑘=1

⁄ ⋯ 𝑝1,𝑖,𝑙 = 𝑐𝑙,𝑖1 ∑𝑐𝑙,𝑖

𝑘

𝐾

𝑘=1

⁄ … 𝑝1,𝐼𝑙,𝑙 = 𝑐𝑙,𝐼𝑙1 ∑𝑐𝑙,𝐼𝑖

𝑘

𝐾

𝑘=1

⁄

⋮ ⋱ ⋮ ⋱ ⋮

𝑝𝑘,𝑙,1 = 𝑐𝑙,1𝑘 ∑𝑐𝑙,1

𝑘

𝐾

𝑘=1

⁄ ⋯ 𝑝𝑘,𝑖,𝑙 = 𝑐𝑙,𝑖𝑘 ∑𝑐𝑙,𝑖

𝑘

𝐾

𝑘=1

⁄ ⋯ 𝑝𝑘,𝐼𝑙,𝑙 = 𝑐𝑙,𝐼𝑙𝑘 ∑𝑐𝑙,𝐼𝑖

𝑘

𝐾

𝑘=1

⁄

⋮ ⋱ ⋮ ⋱ ⋮

𝑝𝐾,𝑙,1 = 𝑐𝑙,1𝐾 ∑𝑐𝑙,1

𝑘

𝐾

𝑘=1

⁄ ⋯ 𝑝𝐾,𝑖,𝑙 = 𝑐𝑙,𝑖𝐾 ∑𝑐𝑙,𝑖

𝑘

𝐾

𝑘=1

⁄ ⋯ 𝑝𝐾,𝐼𝑙,𝑙 = 𝑐𝑙,𝐼𝑙𝐾 ∑𝑐𝑙,𝐼𝑖

𝑘

𝐾

𝑘=1

⁄]

𝑝𝑘,𝑙,𝑖 =

{

𝑐𝑙,𝑖

𝑘 ∑𝑐𝑙,𝑖𝑘 if ∑𝑐𝑙,𝑖

𝑘

𝐾

𝑘=1

≠ 0

𝐾

𝑘=1

⁄

0 if ∑𝑐𝑙,𝑖𝑘

𝐾

𝑘=1

= 0

∑𝑝𝑘,𝑙,𝑖 = 1

𝐾

𝑘=1

(4.10)

Suppose that 𝑒𝑖𝑙(𝑘) is the 𝑖th referential value of the 𝑙th input variable directly

supporting to class 𝑘. 𝑐𝑙,𝑖𝑘 is the likelihood that the 𝑖th referential value of the 𝑙th input

variable is observed given that class 𝑘 is true. The basic probability 𝑝𝑘,𝑙,𝑖 is obtained

from normalised 𝑐𝑙,𝑖𝑘 . Each piece of evidence is profiled by a set of basic

probabilities as stated in Equation (4.11).

( )( )

== =

1 and ,,K

1k

,,,, ilkilk

l

i

l

i pkpkee (4.11)

Through this technique, numerical data can be profiled into a simple discrete

distribution function without losing information or over generalising.

93

4.3.2. Interdependence Index

A key characteristic introduced in the MAKER framework is its ability to measure

the interrelationship between a pair of evidence when multiple input variables are

considered. As such, the assumption of input independency in ER can be relaxed

in this framework. The measurement is based on the statistical interdependence

between a pair of evidence from different input variables. It can be acquired from

a joint likelihood function.

Suppose that we have multiple input variables denoted by a vector 𝑥𝑛 =

{𝑥𝑛,𝑗1 , … , 𝑥𝑛,𝑗𝑙 , … , 𝑥𝑛,𝑗𝑚}, 𝑛 = 1,… ,𝑁; 𝑗𝑙 = 1,… ,𝑀; 𝑙 = 1,… ,𝑀;𝑚 = 2,… ,…𝑀; 𝑖𝑙 =

1,… , 𝐼𝑗𝑙. The input is then transformed for the combination of referential values,

𝐴𝑖1…𝑖𝑙…𝑖𝑚𝑗1…𝑗𝑙…𝑗𝑚, as follows

𝑆𝑛,𝑗1…𝑗𝑙…𝑗𝑚(𝑥𝑛) = {(𝐴𝑖1…𝑖𝑙…𝑖𝑚𝑗1…𝑗𝑙…𝑗𝑚 , 𝑎𝑛,𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚

𝑘 )} , 𝑛 = 1,… ,𝑀; 𝑙 = 1,… ,𝑀;𝑚

= 2,… ,𝑀; 𝑖𝑙 = 1,… , 𝐼𝑗𝑙

where 𝐼𝑗𝑙 is the total number of referential values of input variable 𝑗𝑙

𝐴𝑖1…𝑖𝑙…𝑖𝑚𝑗1…𝑗𝑙…𝑗𝑚 = {𝐴𝑖1

𝑗1 , … , 𝐴𝑖𝑙𝑗𝑙 , … , 𝐴𝑖𝑚

𝑗𝑚}

𝑎𝑛,𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚𝑘 = 𝑎𝑛,𝑗1,𝑖1

𝑘 ×…× 𝑎𝑛,𝑗𝑙,𝑖𝑙𝑘 × …× 𝑎𝑛,𝑗𝑚,𝑖𝑚

𝑘

where

∑ ∑ ∑ 𝑎𝑛,𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚𝑘

𝑖1…𝑖𝑙…𝑖𝑚∈𝑅𝑗1…𝑗𝑙…𝑗𝑚∈𝑇

= 1

𝐾

𝑘=1

(4.12)

94

The similarity degree denoted by 𝑎𝑛,𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚𝑘 represents the degree to which

an input value of 𝑥𝑛 matches the combination of the referential values 𝐴𝑖1…𝑖𝑙…𝑖𝑚𝑗1…𝑗𝑙…𝑗𝑚.

Suppose that {(𝑗1… 𝑗𝑙…𝑗𝑚), 𝑗𝑙 = 1,⋯ ,𝑀; 𝑙 = 1,… ,𝑚} ∈ 𝑇 and {(𝑖1… 𝑖𝑙 … 𝑖𝑚), 𝑖𝑙 =

1,⋯ , 𝐼𝑗𝑙; 𝑙 = 1,… ,𝑚} ∈ 𝑅. We can then calculate the likelihood k

iiijjj mlmlc 11 ,...... in

which the combination of referential values 𝐴𝑖1…𝑖𝑙…𝑖𝑚𝑗1…𝑗𝑙…𝑗𝑚 occurs and class 𝑘 is true

with the conditions as stated in (4.13). The belief degree representing the extent

to which the combination of referential values directly points to class 𝑘 can be

calculated according to Equation (4.14).

𝑎𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚𝑘 = ∑𝑎𝑛,𝑗𝑖…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚

𝑘

𝑁

𝑛=1

∑ ∑ ∑ 𝑎𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚𝑘


𝐾

𝑘=1

= ∑ ∑ ∑𝑎𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚𝑘

𝐾

𝑘=1𝑖1…𝑖𝑙…𝑖𝑚∈𝑅𝑗1…𝑗𝑙…𝑗𝑚∈𝑇

= 𝑁

𝛿𝑘 = ∑ ∑ 𝑎𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚𝑘


𝑐𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚𝑘 = {

𝑎𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚𝑘 𝛿𝑘 for 𝛿𝑘 ≠ 0⁄

0 for 𝛿𝑘 = 0

∑ ∑ 𝑐𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚𝑘


= 1, ∀𝑘 ⊆ 𝛩

∑ ∑ ∑ 𝑐𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚𝑘


𝐾

𝑘=1

= ∑ ∑ ∑𝑐𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚𝑘

𝐾

𝑘=1

= 𝐾


(4.13)

95

As a simple example, suppose that we have two pieces of evidence with referential

values as follows: {0, 50, 100} and {0, 1, 3}. An input with values 80 and 0.2 for the

first and second pieces of evidence can be transformed using Equation (4.7).

Hence, we can obtain 𝑆1 = {(𝐴11, 0), (𝐴2

1 , 0.4), (𝐴31 , 0.6)} and 𝑆2 =

{(𝐴12, 0.8), (𝐴2

2, 0.2), (𝐴32, 0)}. The input is then transformed for each combination of

referential values using Equation (4.12) as stated below.

Table 4.1. An example of data transformation

𝐴1,11,2 𝐴1,2

1,2 𝐴1,31,2 𝐴2,1

1,2 𝐴2,21,2 𝐴2,3

1,2 𝐴3,11,2 𝐴3,2

1,2 𝐴3,31,2 Total

{0,0} {0,1} {0,3} {50,0} {50,1} {50,3} {100,0} {100,1} {100,3}

0 0 0 0.32 0.08 0 0.48 0.12 0 1

The interdependence index introduced in the MAKER framework measures the

interdependence between a pair of evidence. Suppose that 𝑖1 and 𝑖2 are the

referential values of two pieces of evidence 𝑗1and 𝑗2 respectively. The

interdependence index between the two pieces of evidence denoted by 2211 ,,,, ijijka

can be defined by Equation (4.15). 2211 ,,,, ijijkp represents the degree to which the two

𝑝𝑘,𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚

=

{

𝑐𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚

𝑘 ∑𝑐𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚𝑘 if ∑𝑐𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚

𝑘

𝐾

𝑘=1

≠ 0

𝐾

𝑘=1

⁄

0 if ∑𝑐𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚𝑘

𝐾

𝑘=1

≠ 0

∑𝑝𝑘,𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚 = 1

𝐾

𝑘=1

(4.14)

96

pieces of evidence jointly support class 𝑘 and it can be obtained by using Equations

(4.13) and (4.14). 𝑝𝑘,𝑗1,𝑖1and 𝑝𝑘,𝑗2,𝑖2are the belief degrees of evidence elements

𝑒𝑖1𝑗1(𝑘) and 𝑒𝑖2

𝑗2(𝑘) respectively pointing to class 𝑘 and are obtained using Equations

(4.7)–(4.11).

𝑎𝑘,𝑗1,𝑖1,𝑗2,𝑖2 = {0 if 𝑝𝑘,𝑗1,𝑖1 = 0 or 𝑝𝑘,𝑗2,𝑖2 = 0

𝑝𝑘,𝑗1,𝑖1,𝑗2,𝑖2 (𝑝𝑘,𝑗1,𝑖1𝑝𝑘,𝑗2,𝑖2)⁄ if 𝑝𝑘,𝑗1,𝑖1 ≠ 0 or 𝑝𝑘,𝑗2,𝑖2 ≠ 0

𝑎𝑘,𝑗1,𝑖1,𝑗2,𝑖2 = {0 if 𝑒𝑗1,𝑖1(𝑘) and 𝑒𝑗2,𝑖2(𝑘) are disjoint

1 if 𝑒𝑗1,𝑖1(𝑘) and 𝑒𝑗2,𝑖2(𝑘) are independent

(4.15)

4.3.3. Evidence Combination

In the MAKER framework, every referential value called as an evidential element

in the evidence space is directly connected to each class membership in the state

space. To obtain the aggregate level of support for a class membership, including

independent support from each element and joint support from a combination of

referential values with their interdependence considered, all supports for a class

membership from all evidential elements are exhaustively combined.

Let 𝑝𝑘,𝑒(2) be the combined degree of belief with which two pieces of evidence

𝑒𝑖1𝑗1(𝑘) and 𝑒𝑖2

𝑗2(𝑘) jointly support for class 𝑘. With interdependence between the

two pieces of evidence considered, 𝑝𝑘,𝑒(2) can be calculated using Equation (4.16).

𝑟𝑗1,𝑖1 and 𝑟𝑗2,𝑖2 are the reliabilities of evidence elements 𝑒𝑗1𝑖1 and 𝑒𝑗2

𝑖2 , respectively.

𝛾𝑘,(𝑗1,𝑖1),(𝑗2,𝑖2) is a nonnegative coefficient which represents the relative degree of

joint support for class 𝑘 from both 𝑒𝑖1𝑗1(𝑘) and 𝑒𝑖2

𝑗2(𝑘) relative to their individual

97

support. 𝛼𝑘,(𝑗1,𝑖1),(𝑗2,𝑖2) is an interdependence index between evidence elements 𝑒𝑖1𝑗1

and 𝑒𝑖2𝑗2, which can be obtained using Equation (4.12). Model parameters,

𝜔𝑘,𝑗1,𝑖1, 𝜔𝑘,𝑗2,𝑖2, 𝑟𝑘,𝑗1,𝑖1, 𝑟𝑘,𝑗2,𝑖2 , and 𝛾𝑘,(𝑗1,𝑖1),(𝑗2,𝑖2) can be trained. In this study,

𝜔𝑘,𝑗1,𝑖1= 𝜔𝑘,𝑗2,𝑖2= 1. As such, 𝑤𝑘,𝑗1,𝑖1 = 𝑟𝑘,𝑗1,𝑖1 and 𝑤𝑘,𝑗2,𝑖2 = 𝑟𝑘,𝑗2,𝑖2. In addition,

𝛾𝑘,(𝑗1,𝑖1),(𝑗2,𝑖2) is set to be 1.

𝑝𝑘,𝑒(2) = {

0 𝑘 = 𝜑

𝑚𝑘 ∑𝑚𝐶

𝐶⊆𝛩

⁄ 𝑘 ⊆ 𝛩}

(4.16)

𝑚𝑘 = [(1 − 𝑟𝑗2,𝑖2)𝑚𝑘,𝑗1,𝑖1 + (1 − 𝑟𝑗1,𝑖1)𝑚𝑘,𝑗2,𝑖2] + 𝛾𝑘,(𝑗1,𝑖1),(𝑗2,𝑖2)𝛼𝑘,(𝑗1,𝑖1),(𝑗2,𝑖2)𝑚𝑘,𝑗1,𝑖1𝑚𝑘,𝑗2,𝑖2

where 𝑚𝑘,𝑗1,𝑖1 = 𝑤𝑘,𝑗1,𝑖1𝑝𝑘,𝑗1,𝑖1 = 𝜔𝑘,𝑗1,𝑖1𝑟𝑘,𝑗1,𝑖1𝑚𝑘,𝑗1,𝑖1 and 𝑚𝑘,𝑗2,𝑖2 = 𝑤𝑘,𝑗2,𝑖2𝑝𝑘,𝑗2,𝑖2

= 𝜔𝑘,𝑗2,𝑖2𝑟𝑘,𝑗2,𝑖2𝑚𝑘,𝑗2,𝑖2

𝑟𝑗1,𝑖1 =∑𝑟𝑘,𝑗1,𝑖1

𝐾

𝑘=1

𝑝𝑘,𝑗1,𝑖1 and 𝑟𝑗2,𝑖2 =∑𝑟𝑘,𝑗2,𝑖2𝑝𝑘,𝑗2,𝑖2

𝐾

𝑘=1

(4.17)

𝑟𝑘,𝑒(2) = {

0 𝑘 = 𝜑

𝑚𝑘,𝑒(2) 𝑝𝑘,𝑒(2)⁄ 𝑘 ⊆ 𝛩, 𝑘 ≠ 𝜑

1 −𝑚𝛩,𝑒(2) 𝑘 = 𝑃(𝛩)

(4.18)

Equations (4.16) - (4.18) are interpreted from the conjunctive MAKER rule by Yang

and Xu (2017) as presented in Equations (4.19) and (4.20).

𝑝(𝜃) = {0𝑚𝜃

∑ 𝑚𝐶𝐶⊆Θ

(4.19)

𝑚𝜃 = [(1 − 𝑟𝑗,𝑚)𝑚𝜃,𝑖,𝑙 + (1 − 𝑟𝑖,𝑙)𝑚𝜃,𝑗,𝑚] + ∑ 𝛾𝐴,𝐵,𝑖,𝑗𝛼𝐴,𝐵,𝑖,𝑗𝑚𝐴,𝑖,𝑙𝑚𝐵,𝑗,𝑚

𝐴∩𝐵=𝜃

where 𝑚𝜃,𝑖,𝑙 = 𝑝 (𝑠𝑖,𝑙(𝜃)) = 𝜔𝑖,𝑙𝑝𝑙 (𝑠𝑖,𝑙(𝜃)) = 𝜔𝑖,𝑙𝑝𝑙 (𝜃|𝑒𝑖,𝑙(𝜃)) 𝑝𝑙 (𝑒𝑖,𝑙(𝜃)) =

𝑤𝜃,𝑖,𝑙𝑝𝑙 (𝑒𝑖,𝑙(𝜃)); 𝑟𝑖,𝑙 ∑ 𝑟𝜃,𝑖,𝑙𝑝 (𝑒𝑖,𝑙(𝜃))𝜃⊆Θ

(4.20)

98

4.4. Belief Rule Base

The traditional IF-THEN rules can be expressed as 𝑅𝑘 in Equation (4.21) where

𝐴𝑖𝑘 (𝑖 = 1, . . , 𝑇𝑘) is the referential value or grade of the 𝑖th attribute in the 𝑘th rule. 𝑇𝑘

is the number of attributes used in the 𝑘th rule. The symbol denotes an ‘AND’

relationship between the attributes. 𝐷𝑘 is the consequent of the 𝑘th rule being

activated. Equation (4.21) indicates whether an input vector with those

corresponding referential values points directly to the outcome 𝐷𝑘 with 100%

probability. This simple form does not consider the relative importance of each

attribute, the relative importance of rules in the rule base, or the distribution of

consequences. The traditional IF-THEN rules can be extended by including

attribute weights, rule weights, and belief degrees for all possible consequences

as expressed in Equation (4.22) (Yang et al., 2006).

if 𝐴1𝑘 ∧ …∧ 𝐴𝑖

𝑘 ∧ …∧ 𝐴𝑇𝑘𝑘 then D𝑘 (4.21)

if 𝐴1𝑘 ∧ 𝐴2

𝑘 …∧ 𝐴𝑇𝑘𝑘 then {(𝐷1, 𝛽1𝑘), (𝐷2, 𝛽2𝑘),… , (𝐷𝑁, 𝛽𝑁𝑘)}, 𝑘 ∈ {1, . . . , 𝐿}

where 𝛽𝑗𝑘 ≥ 0,∑𝛽𝑗𝑘

𝑁

𝑗=1

≤ 1 with a rule weight 𝜃𝑘 and attribute weights 𝛿k,1, … , 𝛿k,i, … 𝛿k,T𝑘

(4.22)

𝐴𝑖𝑘 (𝑖 = 1, . . , 𝑇𝑘) indicates the referential value or grade of the 𝑖th attribute in the 𝑘th

rule where 𝑇𝑘 is the number of attributes used in the 𝑘th rule. 𝐿 is the number of

rules in the rule base. According to Yang et al. (2006), if an input satisfies the

packet attributes 𝐴𝑘 = (𝐴1𝑘 , 𝐴2

𝑘 , … , 𝐴𝑇𝑘𝑘 ), the rule 𝑅𝑘 is activated and points to the

consequence 𝐷𝑗 with degree of belief 𝛽𝑗𝑘(𝑗 = 1,… , 𝑁), where N is the number of

possible consequences. Belief degrees are expressed as the probability with which

99

𝐷𝑗 is likely to occur. The total belief degrees of a rule can be less than or equal to

one, which allows room to handle missing data or unknown consequent. 𝜃𝑘 is a

weight of the 𝑘th rule acting as the relative importance of the rule compared to other

rules in the rule base. 𝛿𝑘,𝑖 is the attribute weight of the 𝑖th attribute in the 𝑘th rule,

indicating the relative importance of the attribute among all the attributes used in

the 𝑘th rule. In Equation (4.22), a rule weight, attribute weights, and consequent

belief degrees are embedded in each rule in the rule base. If a rule is presented in

the format of Equation (4.22), it is referred as a belief rule. A collection of belief

rules is called a belief rule base (BRB).

4.5. The Decomposition of Input Variables

In this study, belief rule base inference of the rule-based inferential modelling and

prediction is transparent and interpretable. However, Xu et al. (2017) stated that

BRB has a problem with the high multiplicative complexity on the number of

referential values of input variables in the belief rule base. The size of belief rule

base increases exponentially as the number of input variables and the number of

referential values of each input variable increase (Yang and Xu, 2017).

Consequently, the number of parameters required for training increases

exponentially (Yang and Xu, 2017). Suppose that we have six input variables with

three referential values for each input variable, we will have 36 = 729 belief rules.

The rule-based models will be extremely complex.

Furthermore, the rule-based modelling and prediction in this study is developed

based on the MAKER framework. In this framework, evidence is acquired through

statistical analysis directly from data. Interdependency index is measured using

100

statistical interdependence between a pair of evidence. It can be acquired from a

joint likelihood function as presented in Equations (4.13)–(4.15). For a discrete joint

likelihood function, the likelihood is calculated on the basis of the frequencies of

the combinations of referential values for each class membership. This principle is

widely used as a basis for measuring the interdependencies between two variables

with nominal or ordinal type data – for example, a chi-square test for independence

(Bishop, 2007), known as a contingency table. Each cell of the contingency table

contains the cases that matches a certain combination of categories. It is worth

noting that ‘category’ has the same meaning as ‘referential value’ in this study.

The minimum requirement for a presumably sufficiently large sample size is five

cases per cell. According to Bishop (2007), there are two types of zero entries in a

contingency table: 1) sampling zeros that may occur for cells that are realistic

combinations of categories with relatively small samples when compared to a large

number of cells or 2) structural zeros because it is not possible to collect

observations for certain combinations of categories–that is, certain combinations

of referential values of input variables. A sampling zero occurs when no

observation is found for a certain combination of variables, but it is probable that

the combination exists. Meanwhile, structural zeros are attached to unrealistic

combinations due to features of the data or the data structure (Bishop, 2007).

Strategies to deal with combinations of categories that violate the minimum sample

size requirement are explained below.

There are several ways to deal with a sparse contingency table. The common

practice is to collapse the categories to obtain a smaller and less sparse table with

fewer categories (Kateri and Iliopoulos, 2010). Two categories are combined on

101

the basis of homogeneity and structure (Kateri and Iliopoulos, 2010). These

collapsed categories are also considered to be theoretically or practically

equivalent. However, this practice can produce misleading statistical inferences in

which significant associations are found in the collapsed table when there are no

such associations in the original table (Bishop, 2007; Fienberg and Rinaldo, 2007).

The practice also potentially distorts the modelling process and leads to a loss of

some valuable data or information.

Another approach is to add small positive quantities to every cell in the table. This

practice was discussed in Fienberg and Rinaldo (2007) with numerical examples,

and it is evident that it can result in misleading and incorrect inferences. The

simplest – yet expensive – way to remedy a sparse contingency table is to collect

more samples (Bishop, 2007). In the case of table with a zero denominator,

another strategy is to arbitrarily define zero divided by zero to be zero (Fienberg,

1980). Careful justification is essential when taking one of the actions above

(Fienberg and Rinaldo, 2007). The current study mainly uses numerical data, and

hence a referential value-based discretization method is applied. Since the

referential values are trained, the justification explained above, which is very

complex if not impossible, must be done as part of the optimisation process.

As stated above, the models under the rule-based inferential modelling and

prediction become extremely complex as the number of input variables and the

number of referential values of each input variable increase (Yang and Xu, 2017).

In addition, under the MAKER framework with sparse matrices, statistical analysis

is difficult when many cells contain a number of cases less than the statistical

requirement of the sample size. Therefore, this study proposed the model based

102

on MAKER framework established by the hierarchical rule-based inferential

modelling and prediction. The input variables are split into n groups of input

variables; as such, the number of rules decreases, and violations of statistical

requirements can be avoided. Furthermore, the need for careful justification when

optimising the trained referential values in sparse matrices, a need which increases

computational complexity, can be reduced.

Measures for selecting the best split consider the strength of the relationship

between the input variables and the outputs, either a linear or nonlinear

relationship. The input variable with the strongest relationship to the output makes

the largest contribution to explaining variances in the output. The strongest

relationship indicates the variable that retains the most information and is the most

significant in the prediction model. Steps for grouping evidence or input variables

are listed below.

Step 1. With the full dataset, we sort input variables based on their estimated

importance in the prediction model by measuring their linear or non-linear

relationship to the outputs.

Step 2: We set the most important variable as an initial member of the first evidence

group. We add the next most important variable to this group if the statistical

requirement of sample size per cell is met. Otherwise, we set this variable as the

initial member of the second group. We then move on to the third variable and add

it to the first evidence group if the minimum sample size per cell is met. If not, it

can be added to the second group or can be put into a new evidence group if the

statistical requirement of sample size per cell is violated. This step is repeated until

103

evidence groups whose members have joint frequency matrices with at least the

minimum sample size per cell, except for structural zeros, are formed.

We apply the MAKER framework to each group, and hence each group generates

its output based on the input variables as described in Figure 4.1. We provide two

ways to combine the MAKER-generated outputs of two or more evidence groups.

The first method is to combine them according to a MAKER rule. Each evidence

group makes an inference based on the input of variables within the group denoted

by 𝑝𝑔(𝑠)(𝜃), 𝑔 = 1,… , 𝐺, 𝑠 = 1,… , 𝑆, where G is the total number of evidence groups

formed, and S is the total number of observations or samples. An ER rule has been

developed for combining evidence while considering weight and reliability (Yang

and Xu, 2013). To combine pairs of MAKER-generated outputs, we need their

weights. By using Equation (4.18), we can obtain the weights of all evidence

groups. Hence, Equation (4.16) can be recursively applied to obtain the whole

system-generated output, denoted by ��(𝑠) for each input vector 𝑥𝑠. Parameters

𝛾𝐴,𝐵,𝑖,𝑗 and 𝛼𝐴,𝐵,𝑖,𝑗 in Equation (4.16) can be trained. In this study, 𝛾𝐴,𝐵,𝑖,𝑗 and 𝛼𝐴,𝐵,𝑖,𝑗

of this part are set to be 1. This approach is called MAKER-ER-based model.

The second approach to combining evidence groups is to use a BRB. As stated in

Section 0, generating a BRB basically requires finding all possibilities for an IF-

THEN rule. If the consequent 𝜃 is supported by all evidence groups, the belief

degree of the consequent 𝜃 is assumed to be 1. Otherwise, if none of the groups

support the consequent 𝜃, its belief degree is assumed to be 0. If the consequent

𝜃 is supported by one or more evidence groups while the remaining evidence

supports its negation, the belief degree of the consequent 𝜃 is logically between 0

and 1. In this way, we can generate a BRB in the form of Equations (4.18) and

104

(4.21). In this state, the packet antecedent of the belief rule written as𝐴1𝑘 ∧ 𝐴2

𝑘…∧

𝐴𝑇𝑘𝑘 should be expressed as ‘If a group of evidence points to 𝑘 class’. Therefore,

the number of antecedents in this BRB equals the number of groups of evidence

in the system. Furthermore, {(𝐷1, 𝛽1𝑘), (𝐷2, 𝛽2𝑘),… , (𝐷𝑁, 𝛽𝑁𝑘)}, 𝑘 ∈ {1, . . . , 𝐿} should

be expressed in this state as ‘the probability of an observation belongs to 𝜃.’

Stage 2Stage 1

Real system

MAKER-based system 1

MAKER-based system 2

MAKER-based system G

ER- or BRB-based system

Observed output

System-generatedoutput

Input

MAKER-generatedoutput 1

MAKER-generatedoutput 2

MAKER-generatedoutput G

( ))(ˆ sp

( )p

( )1p

( )2p

( )Gp

sx

( )P

.

.

.

PModel parameters

Figure 4.1. Hierarchical MAKER-based training process

𝑝𝑔(𝑠)(𝜃) is the MAKER-generated output of evidence group 𝑔 (𝑔 = 1,… , 𝐺)

corresponding to the degree to which the evidence group supports the consequent

𝜃 based on an input vector 𝑥𝑠. The outputs of all evidence groups become an input

for the next system, which is a BRB system as depicted in Figure 4.1. Hence, 𝑝𝑔(𝑠)

is a numerical input, and it can be transformed as shown in Equation (4.7). Let

𝑝𝑔(𝑠)(𝜃) act as a similarity degree, indicating the degree to which a MAKER-

generated output belongs to 𝜃. Using Equation (4.12), we can obtain the degree

of joint similarity between the outputs generated by each group of evidence and

the combination of antecedents of the BRB (see Sections 5.5.7 and 6.4.7 for

105

examples), and these values will activate the associated belief rules.

Subsequently, we can apply Equations (3.11) and (3.13) to combine the belief

degrees of the activated belief rules to obtain the system-generated output denoted

by 𝑝(𝜃). In this study, all rule weights (𝜃𝑘) and attribute weights (𝛿𝑘,𝑖) are set to be

equal, and only the consequent belief degrees (𝛽𝑗𝑘) are trained. We can train the

consequent belief degrees along with other model parameters using historical

data. As such, the gap between the observed outputs (��(𝑠)(𝜃)) and the system-

generated outputs (𝑝(𝜃)) is minimal, as depicted in Figure 4.1. This approach is

called MAKER-BRB-based model.

4.6. Parameter Learning

As stated in Yang and Xu (2017) and Yang et al. (2007), MAKER parameters –

that is, 𝑟𝜃,𝑖,𝑙 , 𝑟𝜃,𝑗,𝑚, 𝑤𝜃,𝑖,𝑙 , 𝑤𝜃,𝑗,𝑚, and 𝛾𝐴,𝐵,𝑖,𝑗– and BRB-based system parameters on

the top hierarchy – that is, 𝜃𝑘 , 𝛿𝑖,𝑘 , and 𝛽𝑗,𝑘 – can be trained using historical data.

These parameters are similar with

For the purpose of parameter learning, a general least squares optimisation model

is established as shown in Equation (4.23) for the MAKER-ER-based system and

Equation (4.24) for the MAKER-BRB-based system. Once we obtain the trained

parameters, we can use them to predict system outputs from given system inputs.

In Equations (4.23) and (4.24), ��(𝑠)(𝜃) is the probability that the consequent 𝜃 is

true given the 𝑠th observation. The function measures the MSE between the

system-generated outputs and the observed outputs as depicted in Figure 4.1,

bringing 𝑝(𝜃) as close as possible to ��(𝑠)(𝜃). defines the feasible space of

106

parameters – for example, 0 ≤ 𝛽𝑗,𝑘 ≤ 1 – meaning that consequent belief degrees

must be nonnegative and less than or equal to 1. 𝑟𝜃,𝑖,𝑙 , 𝑟𝜃,𝑗,𝑚, 𝑤𝜃,𝑖,𝑙 , 𝑤𝜃,𝑗,𝑚, and 𝛾𝜃,𝑖,𝑗

are MAKER parameters, while 𝜃𝑘 , 𝛿𝑖,𝑘 , and, 𝛽𝑗,𝑘 are BRB parameters of the top

hierarchy.

𝑚𝑖𝑛 𝛿 =1

2𝑆∑∑ (𝑝(𝜃) − ��(𝑠)(𝜃))

2

𝜃⊆𝛩

𝑆

𝑠=1

subject to 𝑟𝜃,𝑖,𝑙 , 𝑟𝜃,𝑗,𝑚, 𝑤𝜃,𝑖,𝑙 , 𝑤𝜃,𝑗,𝑚, 𝛾𝜃,𝑖,𝑗, 𝜃𝑘 , 𝛿𝑖,𝑘 , 𝛽𝑗,𝑘 ∈ 𝛺

∑ 𝛽𝑗𝑘 = 1𝑁𝑗=1 ; where 𝑁 is the number of consequents in the BRB of the top

hierarchy

(4.24)

For the optimisation tool, we utilise sequential least squares programming provided

in the SciPy package in Python (Pedregosa et al., 2011). It is designed to minimise

a function of several variables with bounds, equality, and inequality constraints.

The model parameters, including MAKER model parameters of all evidence groups

and BRB parameters of the top hierarchy, can be optimised simultaneously to

minimise the function in Equation (4.23) or Equation (4.24). Based on the

explanations in Sections 3.7 and 4.3, Figure 4.2 explains all the steps required for

a hierarchical rule-based modelling and prediction.

𝑚𝑖𝑛 𝛿 =1

2𝑆∑∑ (𝑝(𝜃) − ��(𝑠)(𝜃))

2

𝜃⊆𝛩

𝑆

𝑠=1

subject to 𝑟𝜃,𝑖,𝑙 , 𝑟𝜃,𝑗,𝑚, 𝑤𝜃,𝑖,𝑙 , 𝑤𝜃,𝑗,𝑚, 𝛾𝜃,𝑖,𝑗 ∈ 𝛺

(4.23)

107

Use referential values

for evidence

acquisition

Calculate interdependence

indexes between pairs of

evidence within the group of

evidence

Combine evidence

based on MAKER rule

using weights

Generate belief rule

base

Input

training

data

Predict the class membership

based on activated belief rules

by performing maximum

likelihood prediction

Evidence group 1

Use referential values

for evidence

acquisition




evidence

Combine evidence

based on MAKER rule

using weights


base

Input

training

data





Evidence group 2

Use referential

values for evidence

acquisition




evidence

Combine evidence

based on MAKER rule

using weights


base

Input

training

data





Evidence group G

End

Use belief degree of each

consequent and generate belief rule

base, Or combine using ER rule

Perform maximum likelihood

prediction

The top of

hierarchy – final

inference

Start

Input

training

data

. . . .

Group 1 Group 2

Final

Inference

Group G. . . .

Yang et al. (2006)

This study

Yang and Xu (2017)

Rule-based

modelling

Prediction

Figure 4.2. A hierarchical rule-based inferential modelling and prediction based on MAKER framework for n groups of evidence

108

This study applied a MAKER framework – a work by Yang and Xu (2017) – and a

BRB – a work by Yang et al. (2006) in a hierarchical structure so that the proposed

framework can deal with sparse matrices, maintain the interpretability of the

framework, and reduce the model complexity. This study introduced how the

framework can deal with complex numerical inputs by using referential values. In

addition, how the outputs generated by groups of evidence can be combined to

make a final inference was explained in this study. In Figure 4.2, the novelty of this

research can be seen in the red boxes.

4.7. A Comparative Analysis

In this section, a hierarchical rule-based inferential modelling and prediction based

on MAKER framework is discussed analytically and graphically. First, we compare

referential value-based discretization technique and other techniques and highlight

the advantages of using this approach for data transformation. Second, we present

the modelling and inference process of MAKER-based framework in dealing with

sparse matrices. This proposed framework is evaluated in comparison with other

interpretable machine learning methods to emphasize its advantages. Third, we

compare the predictive power, model complexity, and computation time of MAKER,

BRB, and a hierarchical MAKER framework.

4.7.1. Referential Value-based Discretization Technique

Data, in general, can be divided into two main types: qualitative data and

quantitative data (Maimon and Rokach, 2005). Quantitative data is data that can

109

be measured and expressed numerically. It can be further divided into discrete and

continuous types (Maimon and Rokach, 2005). When required, we can use

discretization techniques to transform numerical data into qualitative data: for

example, in the application of a classification tree (Agre and Peev, 2002).

In machine learning, classification tree is usually applied in a classification context

and used as a data pre-processing step. It is used to improve model accuracy

especially for classification and accelerating the learning process for very large

datasets (Agre and Peev, 2002). In addition, some machine learning algorithms

work only with discrete data, and hence, a discretization technique is needed to

make the algorithms applicable to real datasets (Agre and Peev, 2002).

Since the data used in this research is mainly numerical, a discretization technique

is required to develop an efficient and effective learning algorithm (Agre and Peev,

2002). There are some established discretization methods. These are divided into

1) supervised versus unsupervised, 2) global versus local, and 3) static (univariate)

versus dynamic (multivariate) techniques (Agre and Peev, 2002; Dougherty,

Kohavi, and Sahami, 1995). Supervised discretization techniques utilise class

information, whereas unsupervised ones do not. Global discretization occurs prior

to model development, whereas local methods are performed during the process

of model development. Dynamic methods search for the cut-off points for all

variables simultaneously, allowing capture of the interdependencies between

variables, whereas static methods do not search cut-off points for each variable

independently.

110

The simplest unsupervised discretization methods include equal width (EWB) and

equal frequency (EFB) discretization methods (Agre and Peev, 2002; Dougherty

et al., 1995). EWB discretization divides numerical data into 𝑘 equal-length (width)

intervals. In EFB discretization, the data is partitioned into 𝑘 intervals. Each interval

has roughly equal frequencies. The number of intervals, 𝑘, is a user-specified

parameter. The EWB algorithm determines the minimum and maximum values of

the variable and then calculates the width (𝑤) by dividing the range by 𝑘. In this

way, the variable can be discretized into 𝑘 intervals by the number pairs:

{[𝑑0, 𝑑0 +𝑤 ], [𝑑0 +𝑤, 𝑑0 + 2𝑤 ], … , [𝑑0 + (𝑘 − 2)𝑤, 𝑑𝑚 ]}, where 𝑑0 is the minimum

value of the variable, and 𝑑𝑚 = (𝑘 − 1)𝑤 is the maximum value of the variable.

The EFB algorithm determines the minimum and maximum values of the variable,

sorts all the values in ascending order, and divide the range into 𝑘 in such a way

that all intervals contain approximately equal frequencies of each value. These

techniques are considered to be static methods, since discretization is performed

for each variable and the value of 𝑘 for each variable independently (Dougherty et

al., 1995).

In contrast with unsupervised discretization techniques, supervised approaches

take class information and are able to capture class-variable interdependence

(Agre and Peev, 2002). For example, the entropy-based supervised discretization

(EBD) method proposed by Ayyad and Irani (1993) is a divisive hierarchical

clustering technique which utilises an entropy measure as a criterion for recursively

partitioning numerical data and a minimum description length (MDL) principle as a

stopping criterion. Each value of a variable can be a potential boundary that splits

the variable into two intervals. The partition boundary which minimises the entropy

function over all possible boundaries is chosen (Dougherty et al., 1995). The

111

information gain is a gap between entropy values of the interval before and after

splitting. A new boundary is recursively applied to each interval produced in the

previous steps. This recursive process is terminated if the stopping criterion has

been reached.

The ChiMerge system by Kerber (1992) is another type of supervised discretization

method based on statistical analysis. At its initial stage, each observation value is

placed into its own interval, and chi-square tests are performed to determine

whether adjacent intervals should be merged. A chi-square test is an

independency test based on an empirical measure of the expected frequency of

the classes represented in each interval. Two adjacent intervals which are

statistically independent of each other are merged. A chi-square threshold is

predefined to determine the extent of the merging process (Dougherty et al., 1995).

This method is supervised, global discretization. StatDisc is another discretization

method using statistical tests to determine intervals (Richeldi and Rossotto, 1995).

This heuristic bottom-up method is similar to ChiMerge; however, StatDisc merges

𝑁 adjacent intervals at a time, while ChiMerge combines only two at a time

(Dougherty et al., 1995).

Adaptive quantizers is a method mixing supervised and unsupervised

discretization (Chan, Batur, and Srinivasan, 1991). Intervals are initially set based

on an unsupervised discretization method, such as a binary equal width interval

(Dougherty et al., 1995). A set of classification rules are then applied to the

discretized data. The interval with the lowest prediction accuracy is then split into

two partitions of equal width. This process is repeated until the termination

parameter is reached (Dougherty et al., 1995).

112

As explained previously, unsupervised discretization does not use class

information, although it is essential (Dougherty et al., 1995). Ignoring such

information may lead to the formation of inappropriate intervals and consequently

a poorly performing prediction model (Dougherty et al., 1995). Most studies provide

evidence that supervised discretization methods are able to perform better than

unsupervised ones in terms of error rates – that is, the accuracy of the prediction

model (Dougherty et al., 1995).

The discretization techniques discussed above are those widely used in research.

However, the literature highlights some limitations of these discretization methods.

The major disadvantage is that numerical data is partitioned into intervals and is

labelled to indicate which interval an observation belongs to (Dougherty et al.,

1995). The labels then replace the original observation values. This naturally leads

to information loss and distortion, which potentially cause inaccuracies when

making subsequent inferences (Yang et al., 2006).

Unsupervised discretization’s failure to use class information when it is available

generally results in inappropriate cut-off points and the loss of valuable information

in the development of a prediction model. Consequently, such techniques deliver

poor modelling performance (Agre and Peev, 2002). Although it is evident that

supervised discretization methods perform better than unsupervised ones as

measured by error rates, in some cases supervised discretization has been applied

to the entire dataset before the dataset is split into several folds for training and

testing purposes. Research has recognized that discretization before creating folds

gives the discretization method a chance to have access to the test sets, which is

likely to produce optimistic error rates (Agre and Peev, 2002).

113

Referential value-based discretization provides a reasonable approach which

captures the relationship between observation values and each referential value.

It is equivalent to transforming an observation value into a distribution of referential

values using belief degree values (Yang et al., 2006). The belief degree represents

the extent to which an input value or observation belongs to each referential value.

In other words, it measures how close an observation value is to each referential

value, reducing information loss and distortion. In addition, it allows the structure

of the data to be well captured.

In the entropy-based discretization method by Fayyad and Irani (1993), the

intervals along each branch are recursively and independently evaluated, leading

to imbalanced intervals (Dougherty et al., 1995). Meanwhile, the referential value-

based discretization technique has been developed in such a way that the

referential values of all input variables are determined simultaneously, and thus,

interdependencies between input variables are well captured. In addition to this,

the searching for referential values occurs during the process of constructing the

prediction model. Therefore, the determination of referential values is also reflected

directly in the model accuracy. In this way, discretization makes the prediction not

only more efficient but also more effective by directly minimizing the error between

predicted outputs and observed outputs.

4.7.2. MAKER-based Models

As discussed previously, a hierarchical rule-based inferential modelling and

prediction is considered because of the sparsity of matrices. This approach is

114

proposed to reduce the size of belief rule base, which consequently reducing the

model complexity. This approach is also designed to deal with the sparsity of

matrices, which only few joint frequencies are nonzero, in order to avoid misleading

and incorrect inference, information loss, and computational complexity.

As depicted in Figure 4.1, input variables are decomposed into 𝑔 groups in stage

1. By applying rule-based inferential modelling and prediction to each group based

on the MAKER framework, we can generate the probability for each class (system

output) of each group of each observation. The inputs for stage 2 are the 𝑔

probabilities for each class (system output) of each observation. The final

prediction is achieved by combining these MAKER-generated outputs from all

groups through an ER rule or BRB approach, known as the MAKER-ER-based or

MAKER-BRB-based model, respectively.

The discussion below starts with an explanation of the modelling core and

inference mechanism in stage 1, where the MAKER framework is applied to each

group of input variables, followed by the final inference mechanism in stage 2. We

also present a comparison analysis with other modelling and prediction

approaches to highlight the advantages of the hierarchical MAKER framework.

• The modelling core and inference mechanism

The graphical representation of referential value-based discretization for one input

variable (upper) and two input variables (lower) is depicted in Figure 4.3. The

number of referential values is defined for each input variable denoted by

𝐴𝑖𝑙(𝑙 = 1,… ,𝑀; 𝑖 = 1,… , 𝐼𝑙), where 𝑀 is the number of input variables and 𝐼𝑙 is the

115

number of referential values of the 𝑙th input variable. Through referential value-

based discretization, a continuous space is decomposed into (𝐼1 − 1) × (𝐼2 − 1) ×

…× (𝐼𝑙 − 1) sub-spaces. An observation is located within a sub-space.

Figure 4.3. Referential Value-based Discretization Technique: an input variable (upper), and two input variables (bottom)

As seen in Figure 4.3, an observation value (green dot) is located in two adjacent

referential values of an input variable (red dots). Through discretization, an input

value is transformed into a discrete value with the corresponding belief distribution

for referential values. In a higher dimension – for example, for two input variables

as depicted in Figure 4.3– a continuous space is decomposed into (5 − 1) × (4 −

1) = 12 sub-spaces. An observation is located within a sub-space determined by

𝐴32

𝐴21

𝐴42

𝐴22

𝐴12

𝐴11 𝐴3

1 𝐴41 𝐴5

1

𝐴11

𝐴21

𝐴31

𝐴41

𝐴51

116

the intersections of the referential values. This concept is also applicable for higher

dimensions with more input variables.

Each intersection of referential values denoted by blue and red dots in Figure 4.3

represents the ‘IF’ form in the concept of the BRB as discussed in Section 0,

specifically in Equation (4.22). The ‘IF’ form expressed ask

T

kk

kAAA 21 , known

as a packet antecedent 𝐴𝑘, should be interpreted in this study as a combination of

referential values of the input variables or an intersection of referential values. A

belief degree or probability for each system output is assigned to each intersection,

forming the ‘THEN’ expression: i.e. ( ) ( ) ( ) Nkkk ,D,,,D,,D N2211 . The belief

degree or probability of each output (consequence) is obtained by combining

pieces of evidence from a group of input variables and their corresponding weights

using a MAKER rule considering the interdependency of two pieces of evidence,

as described in Section 4.3. The weights of the combined pieces of evidence (or a

packet antecedent) which are obtained by Equation (4.18) are used for inference.

This is how a BRB is generated, from which an inference can be made.

The similarity degree explained in Section 4.3.2 measures how close an

observation is to the intersections of referential values or the combinations of

referential values. On this basis, we can estimate the relative location of an input

vector in the input space. Logically, the greater the number of referential values,

the higher the location accuracy of an input vector in the input space. However,

this greater accuracy also causes greater model complexity.

The similarity degrees activate the intersections of the referential values with the

corresponding probabilities of each output (consequence) – that is, the belief rules.

117

As depicted in Figure 4.3 for the case of two input variables, an input vector (green

dot) based on its degree of similarity to the combination of referential values,

activates the intersections of referential values (red dots). On the basis of the

degrees of similarity and the weights of intersections of referential values, we can

obtain updated weights that measure the degree to which an intersection of

referential values is triggered by an observation. The activated belief rules are then

combined using Equations (4.16)–(4.18). In this way, we can obtain the probability

of each output (consequence) for an input vector resulting from a group of input

variables in stage 1. The rule-based modelling and prediction based on the

MAKER framework explained above is applied for each group of input variables in

stage 1.

Stage 2 of the hierarchical MAKER framework accepts the MAKER-generated

outputs from all groups of input variables from stage 1 as the input for making final

inferences. The inferences are presented by the probability of each output

(consequence) of each observation. Hence, we have numerical inputs in stage 2.

Suppose that in stage 1, input variables are split into 𝑔 groups with 𝐾 classes for

the output variable. In stage 2 we have 𝑔 input variables with 𝐾 referential values

for each of the input variables. Therefore, we have a BRB consisting of 𝐾𝐺 belief

rules.

Based on the concept of a BRB, a packet antecedent should be expressed in the

form ‘if a group in the system points to consequence 𝑘’ and the ‘THEN’ form should

be expressed as ‘the probability of each consequence’. The MAKER-generated

probability of a group of input variables in stage 1 represents how likely the group

points to a certain output (consequence) that naturally indicates a belief distribution

118

of referential values in stage 2. If all groups fully support a certain output

(consequence) with probability of 1.0, the final inference by logic must be 1.0 for

the probability of that class. On the other hand, if all groups completely oppose a

certain class, the probability of that class must be 0. The belief degrees of other

belief rules representing the conflicting inference made by the groups in stage 1

can be trained. On the basis of the similarity degrees and the trained belief degree

of each output (consequence) of each belief rule, we can obtain a probability

pointing to each output (consequence) as a final inference by an observation. This

approach describes the MAKER-BRB-based model.

According to the explanation above, we can acquire the probabilities pointing to

different class membership generated by a MAKER rule from a group of input

variables in stage 1. In stage 2, we can perceive these probabilities as pieces of

evidence that can be directly combined through a MAKER rule using Equations

(4.16)–(4.18). We can obtain the weight of each group by Equation (4.18) when

combining the activated belief rules in stage 1. Given those pieces of evidence with

their corresponding weights, we can use Equation (4.16) to combine evidence, and

therefore we can generate the probability of each consequent with all input

variables in the system being considered. This approach, the MAKER-ER-based

model, is more direct than the former one.

• The advantages of hierarchical MAKER frameworks

As is clear from the explanation above, the hierarchical MAKER framework can

acquire evidence and measure the interdependencies of pairs of evidence directly

from data using statistical analysis. The input variables are split according to an

119

adjustment to avoid violations of statistical requirements, since the validity of

inferences drawn from this modelling and prediction approach depends on how

well the framework meets the statistical requirements. In each group of input

variables, combining multiple pieces of evidence from the input variables based on

the MAKER framework generates a BRB. On the basis of the BRB and the degree

of similarity between the input vector and packet antecedents of the BRB,

predictions can be generated for these sub-models – that is, groups of input

variables. The outputs generated by sub-models – that is, groups of evidence –

are then aggregated on the basis of either an ER rule or a BRB to make a final

inference. For any given observation, on the basis of the BRB and a maximum

likelihood prediction, an inference can be made. It can be seen that this approach

is completely transparent and interpretable, resulting an objective, robust, and

rigorous data-driven inference method. The model parameters can be trained to

maximise the likelihood of true states. Through designated machine learning, the

parameters can be optimised under the optimisation function as discussed in 4.6.

Machine learning models can make predictions with a high level of accuracy, but

they often do not have the ability to explain how their algorithm arrives at its

conclusion or prediction. This ability is known as ‘interpretability’ and is defined by

Kim et al. in Carvalho et al. (2019) in the context of a machine learning system as

‘the degree to which a human can consistently predict the model’s result’. It was

also recently defined as the ‘ability to explain or to present in understandable terms

to a human’ by Doshi-Velez and Kim in Carvalho et al. (2019). Interpretability is

crucial for learning transfer, extraction of scientific findings, behaviour explanation,

modelling faulty assessment, and so on. In addition, interpretability can increase

120

human trust and acceptance of a model, which is a key factor in determining

whether users want to use it (Carvalho et al., 2019).

In the customer choice model, interpretability plays an important role. If a model is

a black box not revealing a transparent relationship between input and output, the

model can only generate predictions without explanations. Scientific findings

remain completely hidden in the model: for example, why a customer makes a

particular decision at one point and a different decision at another. With an

interpretable model, we can trace the differences in inputs leading to different

customer decisions. Analysing customer behaviour is a fundamental need that

must be met to drive managerial decision-making.

Logistic regression, classification trees, k-nearest neighbours and naive Bayes

models are commonly used interpretable machine learning models. They have

meaningful parameters and/or features; based on these, useful information can be

extracted, and predictions can be explained.

The weights in logistic regression are the interpretable elements. We can observe

the estimated odds change that results from the increase of a feature by one unit.

However, logistic regression is restricted to binary classification with linear

relationships between input and output where all inputs are independent of each

other. The hierarchical MAKER framework can deal with nonlinear binary

classification and multiple classification. It also takes the interdependencies

between input variables into account through a measured interdependence index.

A classification tree can be used in situations in which the relationship between

input variables and outputs is nonlinear, and there is interaction among input

121

variables. A classification tree recursively partitions the input space into regions,

and each observation belongs to exactly one region. MAKER-based classifiers

also divide the input space into sub-spaces (regions). However, MAKER-based

classifiers are more representative of reality because they use the probabilities of

the intersections of referential values and the degree of similarity between an

observation and the intersections of referential values to generate predicted

outputs.

The tree structure delivers arguably simple interpretations with natural

visualization. However, a classification tree is quite unstable and lacks smoothness

(Molnar, 2019). The cut-off points of the input variables and the structure of the

classification tree can be completely changed by just a few changes in the training

sets. Moreover, slight changes in the input variables can have a large impact on

the predicted outputs, which is a rather unintuitive and undesirable outcome.

MAKER-based models are generally more stable and smoother than classification

trees. Each input value activates a number of belief rules, and an inference can be

made on the basis of these belief rules and the degree of similarity between an

input value and referential values. The same principle is applied for higher levels

in the hierarchy. This means a subtle change in referential values (cut-off points)

will not have a large impact on the predicted outputs of the hierarchical MAKER

framework.

Naïve Bayes classifiers make predictions based on Bayes’ theorem with a naive

assumption of conditional independence between input variables. The contribution

of each input variable toward the predicted output is very clear, making Naïve

Bayes an interpretable classifier. The approach, however, requires prior

122

probabilities. MAKER-based classifiers are not dependent on prior probabilities. If

available, prior probabilities are treated as independent pieces of evidence (Yang

and Xu, 2017).

Unlike the above-mentioned classifiers, k-nearest neighbour classifiers are

instance-based learning algorithms. This non-parametric method makes

predictions based on the proximity of an observation to other instances. There is

no interpretability at the modular level. The other above-mentioned classifiers can

explain how parts of the model affect predictions, but a k-nearest neighbour

classifier cannot reach this level of interpretability. We can explain why an

observation belongs to a certain class by retrieving the k neighbours that are used

for predictions. The models become less interpretable, however, as the number of

input variables increases.

Two important criteria when developing prediction models are accuracy and

interpretability (Mori and Uchihira, 2019). However, these criteria are connected

and often competing: that is, the more accurate the prediction, the less

understandable it becomes (Carvalho et al., 2019). Although approaches using

logistic regression, classification trees, naïve Bayes, and k-nearest neighbours are

easy to interpret, they are generally less accurate than the more complex and

opaque models. The interpretability aspect of the hierarchical MAKER framework

is demonstrated in the following chapters in an application predicting customer

types and customer decisions in revenue management. The performance of the

hierarchical MAKER framework is compared with other machine learning methods

in terms of accuracy and other metrics.

123

4.7.3. Performance Comparison

In order to analyse whether the hierarchical structure proposed in this thesis

affects the predictive power, model complexity, and computation time, this section

presents a performance comparison for MAKER, BRB, and hierarchical MAKER

frameworks – MAKER-BRB- and MAKER-ER-based models. Six binary

classification datasets with four input variables were generated by

‘make_classification’ and ‘make_blobs’ functions provided by sklearn in Python. In

this study, the ‘make_classification’ function generate a random binary

classification problem by initially creating clusters of points normally distributed

about vertices of a four-dimensional hypercube and then assigning an equal

number of clusters to each class. It introduces interdependences between input

variables. To increase complexity of classification problem, we can add more

clusters per class and decrease the separation between classes leading to

complex non-linear boundary for classifier. We can also add noises in the dataset

to test the efficacy of classifier.

In this study, we set two clusters per class with normal decision boundary.

The ‘class_sep’ – a parameter to determine how clusters are separated – was set

to be 1.5. The larger the value of ‘class_sep’, the less the overlap between clusters

and the value of 1.5 is considered as normal difficulty level. The ‘flip-y’ – a

parameter to determine the fraction of data points whose class is randomly

assigned – was set to be .20 meaning that 20% of the dataset was noises. The

‘make_blobs’ function generating isotropic gaussian blobs for clustering was used

for dataset 5. The characteristics of the datasets are presented in Table 4.2. All the

124

datasets in Table 4.2 consisted of 200 samples. The scatterplots of the data points

of each dataset can be seen in Figure 4.4.

Table 4.2. Generated datasets with four input variables

Dataset With/without noise The number of clusters per class

1 Without noise (‘flip_y’=0) 2 (‘class_sep’=1.5)

2 Without noise (‘flip_y’=0) 1

3 With noise (‘flip_y’=.2) 2 (‘class_sep’=1.5)

4 With noise (‘flip_y’=.2) 1

5 Blobs, 2 centres, 4 input variables

The datasets included the observed input values and the observed output

values. With all of these observed input-output data pairs, we can use rule-based

inferential modelling and prediction to develop models, and train the parameters of

the models by minimising the differences between the observed output values and

the predicted output values generated by the classifiers. In this study, the

referential values were set to be fixed as the minima and the maxima of the

observed input values. Hence, we only need to train parameters except referential

values. All the datasets had at least five cases per cell of the joint frequency

matrices between input variables and hence, a full MAKER framework could be

implemented. For a hierarchical MAKER framework, we split the input variables

into two groups of evidence, each of which performs MAKER and the predicted

outputs of each group are then combined in the upper level by applying BRB or

ER rule to suggest a final inference regarding whether the observation belongs to

125

a certain class given the input values of the four input variables. Because all the

input variables in the datasets were informative, how we split the input variables

did not matter.

Figure 4.4. Scatter plot from the datasets

MAKER, BRB, MAKER-BRB, and MAKER-ER were applied for all the datasets.

We utilised five-fold cross validation. Each dataset was divided into five folds with

126

similar class distribution. We employed stratified five-fold cross validation in Python

to partition each dataset into five folds. Each fold was treated as test set, while the

remaining folds acted as training set. Therefore, we obtained five rounds for each

classifier for each dataset. The processes of modelling, prediction, and parameter

learning are summarised as follows.

The steps of modelling, prediction, and parameter learning of a full MAKER model

are displayed below.

Step 1: Applying the approach of rule-based inferential modelling and prediction –

such as evidence acquisition, analysis of evidence interdependence, and inference

making – to develop a full MAKER-based model with minima and maxima of the

observed input values as fixed referential values.

Step 2: With training set, using SLSQP to train the relevant weights of referential

values to obtain the optimised weights of referential values.

Step 3: Generating the predicted outputs of test set on the basis of a full MAKER

model of optimised weights of referential values.

Step 4: Performing model evaluation by comparing the observed against the

predicted outputs, and recording the computation time – that is, the time spent

required to learn the pattern of the data.

The steps of modelling, prediction, and parameter learning of a BRB model are

displayed below.

Step 1: Developing a belief rule base consisting of a packet antecedent – a

combination of referential values of the input variables, and probabilities of each

consequence of each belief rule – that is, the probability of an observation with the

127

corresponding input values belongs to a certain class. Initial belief degrees are

generated.

Step 2: With training set, using SLSQP to train belief degrees to obtain the

optimised belief degrees that minimise the gap between the observed and the

predicted outputs of the training set.

Step 3: Generating the predicted outputs of test set on the basis of a BRB model

of optimised belief degrees.

Step 4: Performing model evaluation.

The steps of modelling, prediction, and parameter learning of a hierarchical

MAKER model are summarised in the following part.

Step 1: Splitting the input variables into two groups of evidence.

Step 2: Using the approach of rule-based inferential modelling and prediction –

such as evidence acquisition, analysis of evidence interdependence, and inference

making – to develop a full MAKER-based model to each group of evidence with

minima and maxima of the corresponding observed input values as fixed referential

values.

Step 3: Aggregating the predicted outputs of both groups of evidence by applying

a BRB or ER rule – namely, MAKER-BRB-based and MAKER-ER-based models,

respectively to make a final inference.

Step 4: With training set, using SLSQP to train weights – parameters of MAKER-

based model of each group of evidence – and to train belief degrees of the BRB if

MAKER-BRB-based model is applied.

Step 5: Performing model evaluation

128

In this section, we compared the model performances of a full MAKER, BRB,

MAKER-ER-based model, and MAKER-BRB-based model on the five datasets –

two clusters per class without noise, one cluster per class without noise, two

clusters per class with noise, one cluster per class with noise, and gaussian blobs.

Each of the five datasets had been partitioned into five folds using stratified five-

fold cross validation in Python to ensure each fold has a similar class distribution.

As previously explained, each of the datasets had two fixed referential values.

Hence, only weights and belief degrees were trained.

To compare the classifiers, measures of performance is required. These include

accuracy, AUCROC, MSE, computation time, and the number of trained

parameters. The threshold value is set to be .5. The perfect classifier will result in

the AUCROC and the accuracy of 1. The AUCROC of .5 is not better than random

classifier. Meanwhile, the less MSE is, up to a minimum of 0, the superior the model

is.

As explained previously, we utilised five-fold cross validation. A fold was selected

to be a test set and the remaining folds was used to train the model. The optimised

model parameters, which are obtained from the model training, were the applied

to the test set. If the classifier can generalise the pattern of the data, the

performance of the model on the test sets is relatively similar with the performance

on the train set. Hence, in this section, we present the model performances on the

test sets over the five rounds. Tables 4.3-4.7 provide the scores of computation

time, accuracies, AUCROCs, and MSEs. The average scores of those measures

are summarised in Table 4.8.

129

Table 4.3. Performance measures for the dataset 1

No. Model

Fold

Average

1 2 3 4 5

Computation time

(in seconds)

1 MAKER 57 73 70 64 68 66.4

2 BRB 132 132 137 155 234 158

3 MAKER-BRB 97 86 92 102 99 95.2

4 MAKER-ER 31 116 96 129 112 96.8

Accuracy

1 MAKER .9512 .9000 .9000 .9750 .9231 .9299

2 BRB 1.0000 .9250 .9000 .9750 .9824 .9565

3 MAKER-BRB .9024 .8750 .9250 1.0000 .9487 .9302

4 MAKER-ER .9512 .8250 .9000 .9950 .9487 .9240

AUCROC

1 MAKER .9929 .9600 .9775 .9975 .9711 .9798

2 BRB 1.0000 .9800 .9725 .9975 .9487 .9797

3 MAKER-BRB .9857 .9575 .9775 1.0000 .9684 .9778

4 MAKER-ER .9476 .9299 .9700 .9950 .9605 .9606

MSE

1 MAKER .1022 .1241 .1062 .0719 .1066 .1022

2 BRB .0785 .1010 .0890 .0583 .0774 .0808

3 MAKER-BRB .0693 .0982 .0753 .0753 .0745 .0785

4 MAKER-ER .1341 .1491 .1176 .0901 .1289 .1240

130


No. Model

Fold

Average

1 2 3 4 5

Computation time

(in seconds)

1 MAKER 51 54 60 49 62 55.2

2 BRB 188 218 205 238 222 214.2

3 MAKER-BRB 108 281 96 191 81 151.4

4 MAKER-ER 60 58 47 89 59 62.6

Accuracy

1 MAKER .9756 .9750 1.0000 1.0000 .9487 .9799

2 BRB .9756 .9500 1.0000 1.0000 .9487 .9749

3 MAKER-BRB .9756 .9250 1.0000 1.0000 .9231 .9647

4 MAKER-ER .9756 .9750 .9750 .9750 .9763 .9754

AUCROC

1 MAKER .9976 .9975 1.0000 1.0000 .9658 .9922

2 BRB 1.0000 1.0000 1.0000 1.0000 .9658 .9932

3 MAKER-BRB 1.0000 .9450 1.0000 1.0000 .9579 .9806

4 MAKER-ER 1.0000 1.0000 .9950 1.0000 .9487 .9887

MSE

1 MAKER .0487 .0722 .0650 .0576 .0831 .0653

2 BRB .0246 .0422 .0338 .0348 .0624 .0396

3 MAKER-BRB .0458 .0742 .0554 .0479 .0822 .0611

4 MAKER-ER .0261 .0386 .0439 .0269 .0569 .0385

131


No. Model

Fold

Average

1 2 3 4 5

Computation time

(in seconds)

1 MAKER 73 96 114 128 173 116.8

2 BRB 268 312 170 464 336 310

3 MAKER-BRB 77 75 78 68 73 74.2

4 MAKER-ER 83 87 113 135 145 112.6

Accuracy

1 MAKER .7561 .8049 .7750 .7179 .7692 .7646

2 BRB .8780 .8780 .8000 .8205 .8205 .8394

3 MAKER-BRB .7317 .7805 .7750 .7436 .7436 .7549

4 MAKER-ER .7561 .8293 .8000 .7436 .7692 .7796

AUCROC

1 MAKER .8429 .8762 .8396 .8000 .7895 .8296

2 BRB .9405 .9381 .8596 .8895 .9026 .9061

3 MAKER-BRB .8333 .8571 .8396 .7579 .7868 .8149

4 MAKER-ER .8286 .8044 .8396 .7737 .8184 .8129

MSE

1 MAKER .1668 .1414 .1696 .1793 .1839 .1682

2 BRB .1110 .1059 .1449 .1197 .1299 .1223

3 MAKER-BRB .1684 .1479 .1585 .1841 .1805 .1679

4 MAKER-ER .1823 .1596 .1754 .1876 .1904 .1791

132


No. Model

Fold

Average

1 2 3 4 5

Computation time

(in seconds)

1 MAKER 146 107 143 100 87 116.6

2 BRB 288 380 251 148 142 241.8

3 MAKER-BRB 126 124 118 111 92 114.2

4 MAKER-ER 127 260 175 87 118 153.4

Accuracy

1 MAKER .8049 .8095 .7250 .6667 .8462 .7705

2 BRB .8049 .8049 .7750 .7439 .8718 .8001

3 MAKER-BRB .8293 .8049 .7250 .6923 .8205 .7744

4 MAKER-ER .8049 .7561 .7750 .7179 .7692 .7646

AUCROC

1 MAKER .8548 .8293 .8049 .7974 .8711 .8315

2 BRB .8643 .8024 .8049 .8132 .8912 .8352

3 MAKER-BRB .8548 .7786 .8025 .7947 .8842 .8230

4 MAKER-ER .8071 .7952 .7875 .7947 .8605 .8090

MSE

1 MAKER .1679 .1841 .1862 .1885 .1804 .1814

2 BRB .1525 .1756 .1868 .1804 .1425 .1676

3 MAKER-BRB .1477 .1778 .1804 .1864 .1434 .1671

4 MAKER-ER .1806 .1905 .1911 .1891 .1722 .1847

133


No. Model

Fold

Average

1 2 3 4 5

Computation time

(in seconds)

1 MAKER 58 92 86 57 74 73.4

2 BRB 220 183 202 176 200 196.2

3 MAKER-BRB 31 24 56 26 29 33.2

4 MAKER-ER 131 76 115 103 120 109

Accuracy

1 MAKER 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000

2 BRB 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000

3 MAKER-BRB 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000

4 MAKER-ER 1.0000 .9750 1.0000 1.0000 1.0000 .9950

AUCROC

1 MAKER 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000

2 BRB 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000

3 MAKER-BRB 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000

4 MAKER-ER 1.0000 .9500 1.0000 1.0000 1.0000 .9900

MSE

1 MAKER .0219 .0236 .0189 .0228 .2080 .0590

2 BRB .0054 .0081 .0049 .0050 .0051 .0057

3 MAKER-BRB .0059 .0070 .0023 .0055 .0036 .0049

4 MAKER-ER .0228 .0437 .0316 .0222 .0206 .0282

134

Table 4.8. Grand averages of performance measures of the five generated

datasets

Model Number of

parameters

Computation

time (in

seconds)

Accuracy AUCROC MSE

MAKER 16 85.68 .8890 .9266 .1152

BRB 32 224.04 .9142 .9428 .0832

MAKER-BRB 24 93.64 .8848 .9193 .0959

MAKER-ER 16 94.3 .8877 .9123 .1109

With four input variables, MAKER, BRB, MAKER-BRB-based model, and MAKER-

ER-based model required the number of trained parameters of 16, 32, 24, and 16,

respectively. Longer computation time is required as the number of parameters

increases. As seen in Table 4.8, BRB spent the longest training time which is

224.04 seconds while MAKER required the shortest computation time which was

85.68 seconds. MAKER-ER- and MAKER-BRB-based models showed nearly

similar records – 93.64 and 94.3, respectively. These records were less than a half

of the BRB’s computation time but were still slightly higher than the full MAKER

model.

According to Tables 4.3-4.7, all the classifiers provided nearly similar performance

on the five datasets. Meanwhile, in general, the performance of the hierarchical

MAKER models on the basis of accuracy – .8848 and .8877 for MAKER-BRB- and

MAKER-ER-based models, respectively – was similar to that of the full MAKER

135

model – .8890. All the classifiers also had nearly similar scores of AUCROC and

MSE. The result signifies that similar to a full MAKER and BRB models, the

hierarchical MAKER frameworks can generalise the pattern of the data and

perform well on unseen data – that is, the test sets.

Figure 4.5 illustrates the average of the computation time of each classifier against

the average of performance measures – AUCROC and MSE – of each classifier.

It is clearly seen that with the same datasets, BRB required more trained

parameters which affected its computation time. The number of rules in a BRB

increases exponentially as the number of input variables, the referential values of

each variable, and the number of consequences increases (Yang and Xu, 2017).

Meanwhile, MAKER-BRB- and MAKER-ER -based models required slightly higher

computation times than a full MAKER model does. However, their performance

was similar to that of the full MAKER model.

Figure 4.5. Plot of the grand average scores of performance measures of the five generated datasets for each model

0

50

100

150

200

250

.7500

.8000

.8500

.9000

.9500

1.0000

MAKER BRB MAKER-BRB MAKER-ER

ROCAUC Accuracy Duration (in second)

136

Thus, we concluded that the hierarchical structure applied in MAKER-BRB- and

MAKER-ER-based models does not significantly increase computation time

required to train parameters or to learn the pattern of data. Their generalization

capability is similar to that of a full MAKER model based on accuracies, AUCROCs,

and MSEs. The hierarchical MAKER models can perform well on the datasets with

complex non-linear boundary and noises.

4.8. Summary

In this chapter, we have explained the algorithm of the hierarchical MAKER

framework – namely, the MAKER-ER-based and MAKER-BRB-based models with

a referential value-based discretization technique for data transformation, one of

the main contributions of this research. First, we presented the evidence

acquisition process, the measurement of interdependencies between input

variables, evidence combination approaches, the generation of a BRB, and the

bottom-up inference process in the hierarchical MAKER framework. We then

performed a comparative analysis between this framework and other machine

learning methods to highlight the framework’s advantages, and we compared the

referential value-based discretization technique used within the framework with

other discretization techniques. We conducted a comparative analysis for a full

MAKER, BRB, and the hierarchical MAKER frameworks based on the five

generated datasets with complex non-linear boundary and noises. Some

performance measures including computation time, accuracy, AUCROC, and MSE

were presented. The hierarchical MAKER frameworks required less computation

time that BRB did while their performance was similar to that of the full MAKER.

137

Chapter 5 Application to Customer

Classification

5.1. Introduction

This chapter presents the application of a hierarchical rule-based inferential

modelling and to customer classification based on MAKER framework in revenue

management. The chapter is organised as follows. Section 5.2 explains the

theoretical foundations and the formulation of a conceptual framework in customer

detection. It includes a literature review on customer types in revenue

management, the opportunity for customer detection from perceptible booking-

related behaviours, and the booking setting applied in the real case used in this

research. Section 5.3 describes data linkage to 1) extract the desired dataset,

comprising the estimated values of the input variables; and 2) label customer types

based on a customer booking journey. Section 5.4 describes the data preparation

including data cleaning, data transformation, and data partitioning. The following

section 5.5 explain how the classifiers, that is MAKER-ER- and MAKER-BRB-

based models, were built and trained. Section 5.6 present a comparative study on

classifier performance for the proposed framework and other machine learning

methods. A summary of this chapter is presented in Section 5.7.

138

5.2. Theoretical Foundations: Customer Types and

Behaviours

This section explains the theoretical foundation of identifying customer types and

learning their booking behaviour. First, we selected customer types being

considered in this study according to recent literature in revenue management.

Second, we provide critical analysis to the possibility of detecting customer types

through customer booking behaviour. Third, we presented the business setting

applied in this case study used in this research.

5.2.1. Customer Types in Revenue Management Practice

According to recent literature, there are four customer types regarding purchase

timing. These are summarised below.

Myopic customers. They buy the product immediately if the price fits with their

valuation (Su, 2007; Chacon and Swimney, 2009).

Strategic customers. These customers possess knowledge of dynamic pricing,

the desire to save money or to gain achievement and excitement. These customers

rationally time their purchase based on their expectation of future price and other

considerations, while maintaining the risk of losing a ticket due to lack of stock, or

paying more (Cason and Reynolds, 2005; Mak et al., 2014; Osadchiy and Bendoly,

2011; Reynolds, 2000). Their decision is formed through a learning process (e.g.

Anderson and Wilson, 2003; Cleophas and Bartke, 2011) by searching for

139

information related to their goal. Different labels for strategic customers have been

used in the literature, such as deal-seeker, functional procrastination, forward-

looking customer, and rational buyer. However, these concepts are substantially

the same.

Bargain-hunter customers. These customers seek a sufficiently low discount

price. They appear at the end of a selling period and buy excess inventory (Cachon

and Swinney, 2009; Cleophas and Bartke, 2011; Jerath et al., 2010; Ovchinnikov

and Milner, 2012). This type has mainly been discussed in the retail industry and

are considered in the case of a markdown price, in which the prices consistently

drop.

Inertia. Recently, some scholars have introduced a behaviour termed customer

inertia, which delays the purchase although the best decision is to buy immediately

(Su, 2009). This behaviour may be caused by a psychological trait called

dysfunctional procrastination (Darpy, 2000).

Osadchiy and Bendoly (2011) introduced a classification system based on

purchase patterns by financially motivated subjects, who were conditioned to act

strategically. Subjects showed different decisions once they were exposed to

dynamic pricing, even if they were all conditioned to act as a strategic customer

and received identical information. The classification consists of five types: 1)

rational strategic, who consistently make decisions fitting with the rational choice

model; 2) risk averse, who always choose to buy now; 3) risk affine, who always

choose to wait; 4) counter rational, who make decisions opposite to the rational

140

choice (e.g. choose to wait when they should have bought); and 5) random, whose

the purchase pattern could not be identified.

This research considers two customer types: strategic and myopic. The other

customer types were omitted for several reasons. First, as mentioned earlier,

bargain-hunters appear if the markdown pricing strategy is applied, especially in

the retail industry (e.g. electronics or high-end fashion). As the product value

declines over time, the prices are naturally discounted as the season progresses

Aviv and Pazgal, 2008). Cleophas and Bartke (2011) considered myopic, strategic,

and bargain-hunter together in an airline case study. However, the model was

designed when an airline employs a markdown pricing strategy. In airlines, the

pricing strategy is dynamic and does not have a consistent pattern. Therefore,

bargain-hunter was excluded in this study. To the best of our knowledge, bargain-

hunters have mainly been discussed for the fashion industry (e.g. Cachon and

Swinney, 2009). Second, some researchers have explained about customer inertia

as resulting from human limitations in processing information and accordingly

deciding away from the optimal path. In reality, generally human decisions may

deviate from rational optimality. In addition, customers probably receive partial

information. Hence, in this study, customers who keep waiting when they should

buy immediately are considered as strategic customers who make not optimal

decision due to their natural limitations of cognitive process.

141

5.2.2. Tangible Booking Behaviours

To illustrate the behaviour of strategic customers, Table 5.1 displays definitions

from selected scholarly publications. Some identified aspects of strategic

customers are as follows: 1) the propensity to wait or delay their purchase, 2) the

intention to maximise their utility or the value of money spent, 3) rational thinking,

4) the learning process in searching for information related to prices and probability

of stock-outs, 5) the tendency to rush in at the last minute, 6) communication with

other customers, and 7) cancel-rebook behaviour.

Table 5.1. Definitions of strategic customers

No. Definitions of strategic customers

1 ‘Customers are waiting for… anticipating price markdowns of…, and

tracking prices of ….’ (Zbaracki et al., 2004)

2 ‘Rational customers anticipating the pricing path….’ (Stokey, 1981)

3

‘They time their purchase in anticipation of future discounts and need to

consider not only future prices, but also the likelihood of stock-outs.’ (Y.

Aviv and Pazgal, 2008)

4

‘They recognize that the product may become available on the salvage

market and consider delaying their purchase…to maximize their

expected surplus.’ (Ye and Sun, 2015)

5 ‘Strategic customers has become synonymous with this type of rational,

forward-looking purchasing behaviour.’ (Su and Zhang, 2009)

6 ‘They may reason strategically the best time to buy, search for deals,

rush in at the last minute.’ (Wang et al., 2013)

7 ‘They may strategically delay a purchase to learn more about product

value.’ (Cachon and Swinney, 2009)

8 ‘They are completely rational customers who can be opposed to

customers with bounded rational behaviour.’ (Shen and Su, 2007)

142


No. Definitions of strategic customers

9 ‘They are intertemporal utility maximizers.’ (Besanko and Winston,

1990)

10

‘Strategic customers plan their buys according to their expectations,

current observations, and communication with their peers.’(Cleophas

and Bartke, 2011)

11

‘There are many indications that deal-seeking travelers continue to

search after they have made a reservation, looking for an even better

deal for the same tourism product or service…cancel their existing

reservation and rebook the better deal.’ (Chen et al., 2011)

Many researchers have identified delay or waiting behaviour as a definite

consequence of strategic behaviour. Toh et al. (2012) showed through a

questionnaire with statistical tests that frequently checking for lower prices and

rebooking if necessary were significantly correlated with the behaviour of waiting,

checking for lower fares over time, and keeping contact with agents about lower

prices that were available. Similarly, Gorin et al. (2012) illustrated strategic

behaviour with a real-life example of numerous rebooking occurrences from an

airline database. A customer booked a fully refundable fare (Y class, €549) either

because of uncertain travel plans or unavailable lower price. Then, a week before

departure date, they cancelled and rebooked at a lower price (B class, €439).

Finally, five days before the travel date, they cancelled the previous ticket and

rebooked at a lower price for V class (€107). The passenger was willing to pay for

class Y but chose class V once it become available.

This behaviour is denoted as ‘cannibalisation’ in revenue management. Customers

exploit searching for information about lower prices, secure a seat to reduce the

143

risk of losing the ticket, and rebook if necessary once a lower price is available.

This behaviour maximises their benefit. Therefore, cancel-rebook behaviour is

potentially useful to detect strategic purchasing.

5.2.3. Flexible Payment

In general, providers apply cancellation policy in advanced booking settings. In

many cases, guaranteed reservation requires deposit or (fully or partly)

prepayment. The payment made in advance will not be fully (or partly) reimbursed

and will be kept as compensation to the provider. Customers who pay zero deposit

on a ‘book now, pay later’ system can hold their seats at the posted price, for free

(Yip, 2019). Similar slogans have been introduced for hotels, such as ‘book now,

pay when you stay’ (Lorenz, 2019). This feature allows people to book and pay

later without worrying about sell-outs or price increases. However, agents normally

give a certain time limit for the ‘holding’ or consideration period. During this time,

the payment must be made, or the booking will automatically be cancelled. The

holding period ranges from minutes to several days, depending on the number of

days to departure and the policies agreed by agents and airlines. This feature has

been widely used by offline agents. However, some agents use similar terms but

with a different meaning. ‘Book now, pay later’ can also mean reserving a seat and

opting for monthly instalments. In this research, the first definition is used.

Online agents initially introduce the same features to compete with offline agents,

and to entice offline customers through payment flexibility. Another term for similar

offerings is ‘Hold my booking’ for a minimum fee, or zero fee, through both online

144

and offline ticketing offices. Offline agents are less restricted than online agents.

Through online channels, customers can hold their booking generally with up to 48

or 72 hours of a holding period. Another term used is ‘free cancellation’. The

difference is that customers must pay the full price, but can be reimbursed without

any cancellation fee if they cancel within the specified period.

Strategic customers who perceive themselves as experienced and capable of

influencing other customers tend to gather and share information related to their

experiences. They may engage through discussion in online forums to share and

influence other people’s purchase decisions (Clark and Goldsmith, 2005).

Cleophas and Bartke (2011) considered these interactions among customers in

their model. Other than data, tracking of online forums, websites, social media, and

other media is valuable.

Several forums or websites provide tips for making cheap bookings (e.g.

Flightdelayclaimsteam.com, 2019). They may explicitly suggest that customers

book, cancel, and repeatedly rebook by exploiting a 24-hour cancellation policy. In

addition, they suggest rebooking immediately, even before the original booking

expires, for extra safety. They highlighted that the prices likely will drop within 24

hours.

Zero-deposit ‘book now, pay later’ gives customers time to finalize their travel

plans, check if details about the booking are correct, and to conduct more research

if they wish. In addition, the risk of sell-outs and higher prices are reduced since

customers have secured a seat by paying for small amount of or even without

paying deposit. Customers can secure a seat for the holding period while they seek

145

other available lower prices. If a more favourable price appears, they may make

another booking at a minimum or zero cost.

5.3. Conceptual Framework

In this section, we discuss input variables that were identified through refinement

of the theoretical foundations and available data. Following the identification, we

describe how we extracted values of input variables to obtain the desired dataset.

As there were no labelled customer types in the system, we formulated a

procedure to label whether a customer was detected as strategic or myopic. The

procedure mimicked strategic purchases using the price information extracted from

the system.

5.3.1. Influential Variables

For illustration, a real example from customer transaction records is presented. A

passenger attempted to make booking for a 17th Sep 19:55 flight by Lion Air from

CGK to BDJ. He booked a ticket of class V (Rp863k) (i.e. Indonesia’s currency) on

31st July 17:44, about 6 weeks before departure date. He could hold the seat until

6th August 23:29 (6.24 days). At that time, either because he was unsure about his

travel plan or he realised that the prices were stable and there was no lower price

available, he let the ticket gone. He then made the second attempt on 18th August

at 07:25 with about 6.67-day time at the same class (V, Rp836k). When availability

of lower price (M, Rp583k) was released for the same flight, he made the third

attempt until on 31st August at 07:21 he issued the ticket with full payment. At the

146

end, he got 30.26% off from the previous fare. Figure 5.1 gave an illustration of this

real example.

Figure 5.1. Illustration 1 (several weeks before departure date)

Another example is illustrated in Figure 5.2. The booking was made by a

passenger four days before the departure date. The first booking was for a Rp950k

ticket from UPG to CGK on 18 Sep at 06:30. The length of the holding period was

9.5 hours (0.395 days). In the middle of the period, a lower class was available. He

cancelled the first booking on 14 Sep at 20:00, before the holding period ended,

and immediately made another booking for class T (Rp862k) and purchased it 12

minutes later. These two real examples with different arrival time were chosen from

numerous similar cases in the dataset. Although both of the passengers obtained

lower prices through cancel-rebook strategy, in other cases customers ended up

at getting higher price by applying the same strategy.

147

Figure 5.2. Illustration 2 (some days before departure date)

Based on literature explained in the previous section, there are certain typical

behaviours among strategic customers when making a reservation: 1) holding

period is spent monitoring prices, 2) continuous cancel-rebook behaviour, and 3)

immediately rebooking once the previous reservation is cancelled or released by

the system. In other words, given the same length of holding period, compared to

myopic customers, strategic customers tend to spend longer time, make more

frequent attempts or bookings, and have shorter interval time between cancelling

and booking again. We thus selected four input variables: the length of holding

period, time spent for confirming booking, frequency of bookings, and interval

between cancelling and rebook as listed in Table 5.2.

Table 5.2. Input variables

No. Input variable Label Unit

1 The length of ‘hold’ period HP Day

2 Time spent for confirming booking TS Day

3 Frequency of bookings FB Times

4 Interval between cancelling and booking

again

ICR Day

148

Figure 5.3 illustrates cancel-rebook behaviour when attempting to purchase a

ticket flight. Customers may make a rebooking due to changes in their travel plans

– for example, changes to the departure date or time, origin or destination, number

of tickets, and the composition of passengers (e.g., adults, infants and children).

To eliminate this effect on the classification model, data were recorded only if no

such changes were made, indicating that customers probably enacted such

behaviour to exploit dynamic pricing.

A customer attempted to book a ticket n times. She first booked at time A1 and the

seat was secured at the agreed price until B1. She could confirm the reservation at

any time, denoted by C1, between A1 and B1 (A1 ≤ C1 ≤ B1). Given the holding period

(H), she spends TS (‘time spent for confirming booking’) amount of time (0 ≤ TS ≤

H). Her choices are 1) to make payment, or 2) to wait either to cancel purposefully

or to be released from the system because no payment was made by B1. She then

rebooks the same ticket CR1 unit of time later at A2. CR1 indicates the interval

between cancelling and booking again. She repeats this cancel-rebook behaviour

until the nth attempt with avgCR unit of time as an average value of CRi to CRn.

149

Notes. An : booking time of nth attempt; Bn : maximum time to make payment; Cn: confirmation time/time when status of the reservation is changed; CRn-1: period between cancelling the (n-2)th attempt and book again for the (n-1)th attempt ; H: the length of holding period; TS: time spent given holding period; n: number of attempts or bookings has been made. Figure 5.3. Data linkage

Booking records

A1 B1C1

1st book

H

TS

A2 B2C2

2nd book

An BnCn

nth book

CRn-1

Customer decision is

made


made


made

1. Posted price

2. Name of airline (brand)

3. Origin – destination

4. Departure date and time

5. Updating time

Price database

Search keywords: origin-destination,

departure date & time, and confirmation time

of 1st book, 2

nd book, … , n

th book

Posted price at Cn

Estimating

strategic

decision

Decision-

making patterns

To be compared: agreed price and

consumer decision of 1st book until n

th book

Booked price Actual consumer decision

Labelling

consumer types

AWT2

AWT1

AWTn

CR2

CR1

150

5.3.2. Detecting Customer Types

Previous studies assume that strategic customers make a perfectly rational choice

(e.g. Aviv and Pazgal, 2008). However, this assumption is contradicted by the theory

of bounded rationality (Simon, 1955, 1956). Li et al. (2014) proposed that strategic

customers may have different levels of sophistication in predicting future prices. They

divided customers into three categories: 1) perfect foresight, 2) weak-form rational

expectation and 3) strong-form rational expectation. The authors explained that

although customers receive the same information, their decisions may be different but

still follow the rational choice model.

This paper has different setting from Li et al. (2014). In this study, customers do not

need to project the future price. Instead, they obtain perfect information in real time

just before they confirm the booking. In other words, any time before B1, they can

update the information on the Internet to see whether lower prices are available.

Hence, we assume that their decision to buy or to wait relies on perfect information.

The rational choice is to wait (i.e. cancel and rebook) if a lower price appears. If the

price remains the same, strategic customers can choose either to buy or to wait

depending on their patience (Su, 2009). Although a set of perfect information is given

to them, this does not guarantee that the outcome will be as expected. Time is needed

to process the rebooking, and offerings may be changed during this period. Hence,

some customers attempt to rebook before the previous booking is released.

Therefore, identifying customer types solely based on whether they obtain lower

prices can be misleading.

151

The identification of customer types was designed for one-time purchases. In each

purchase, the customer may have one or more bookings made before the payment.

It is assumed that customers check just before they confirm at Cn. For each booking,

we identified the rational choice with the information for prices given just before C1

(confirmation time) and checked whether customers followed it. From the booking

record, travel-related information – such as departure time, name of airline, origin-

destination and confirmation time (Cn) – were utilised to search for the price posted

just before Cn. Rational choice is grounded on the comparison between the posted

price (pAn

) and agreed price (pCn

).

Rational choice = {

buy if pAn

< pCn

wait if pAn

> pCn

either buy or wait if pAn

= pCn

To label customer types, the principle of comparing actual customer decisions against

the rational choice was applied (e.g. Mak et al., 2014). If they were the same, it means

that customer choice was consistent with rational choice. If they chose to buy when

they should have waited, they were an immediate buyer, which is similar to myopic

behaviour. If they chose to wait when they should have paid immediately, they were

labelled as persistently choosing to wait. The customer’s insistence on waiting was

considered to be strategic waiting in this case. If they fully followed the rational choice

over all n attempts, they were labelled ‘strategic’. If they showed immediate buying,

they were categorised as ‘myopic’.

A customer booking normally creates a passenger name record (PNR), which

contains travel-related information such as the name of passenger, fare class, and

their flight sequence. When a cancellation occurs, a new PNR is generated. This

makes tracking cancel-rebook behaviour difficult and costly. Nonetheless, tracking

152

cancel-rebook behaviour can provide insight regarding strategic customers and may

ultimately impact the airline’s revenue. Systematic tracking was generated in this

paper to extract relevant input variables to label and classify customers. Based on

this information, a classification model was developed, to be applied in subsequent

detection systems without scrutinising PNRs for every individual.

This classification model is proposed specifically for small and medium-sized travel

agents. The agents have less sophisticated revenue management and information

systems than airlines do to deal with different kinds of customer behaviour. In addition,

airlines make pricing and capacity allocation policies for a full capacity of any flight in

response to the presence of strategic customers. Therefore, the costly systematic

tracking to detect strategic customers is worth implementing for airlines. However,

each small and medium-sized travel agent may only sell a very small portion of seats

in a flight or even only one to two tickets in a flight. To do the systematic tracking, the

travel agents have to collect the update of price changes and ticket availabilities of

the flight every hour until the last minute before the departure time and to save it to

their data storage. This exhaustive tracking must be done for only one or two tickets

sold of a particular flight. Therefore, the benefit of the tracking is not worth the effort.

The agents can use the proposed classification to detect customer types based on

customers’ past transaction history without scrutinising PNRs, collecting the updates

of price changes and ticket availabilities, and conducting the costly systematic

tracking.

153

5.4. Data Preparation

Real world data may be incomplete, noisy, and inconsistent which leads to low

performance, poor-quality outputs, and hidden useful patterns (Zhang, Zhang, and

Yang, 2003). Data preparation is required to deal with such issues to yield quality

data. Data preparation include data integration, data transformation, data cleaning,

data reduction, and data partitioning (Zhang, Zhang, and Yang, 2003). Therefore,

data preparation is required before model developments. This study mainly used data

integration, data cleaning, and data partitioning. Data integration is the combination

of technical and business process utilised to combine data from different data sources

into the desired dataset, that is, meaningful and valuable information (Hendler, 2014).

Section 5.3 presents data linkage to obtain the dataset used in this study. Data

cleaning includes dealing with missing values, noisy data, outliers, and resolving

inconsistencies (Zhang, Zhang, and Yang, 2003). Data partitioning is a technique for

dividing the dataset into multiple smaller parts.

The focus of the study was to examine what factors can be utilised to discriminate

between strategic and myopic customers in the environment of dynamic pricing

through their cancel-rebook behaviour. The procedure used to label customer types

mimicked strategic purchases using the price information extracted from the system.

Incomplete information about price information at time closest to Cn (confirmation

time) could lead to bias or misleading inference about whether customers follow

strategic purchases. Without price information, detecting customer types could not be

proceeded. Hence, for data cleaning, we only accepted complete records of each

customer.

154

In data partitioning, we utilised five-fold cross-validation with stratified random

sampling. The data were divided into five folds with similar class distribution. If

customers made several attempts or they bought more than once, they would have

more data points in the dataset. In this condition, it is advisable to shuffle the dataset,

that is, to randomly reorganize the dataset. The partitions obtained through k-fold

cross validation with shuffling generally derive from different customers, which avoids

the model learning from the patterns of particular customers. We employed stratified

five-fold cross validation with shuffling in Python to partition the dataset into five folds.

Each fold was treated as test set, while the remaining folds acted as a training set.

Therefore, we obtained five rounds for each classifier.

5.5. Hierarchical Rule-based Models for Customer

Classification

In this section, the building of a hierarchical rule-based inferential modelling and

prediction based on MAKER framework – namely MAKER-ER- and MAKER-BRB-

based classifiers – for predicting customer types is explained. A numerical study using

the described dataset is presented in this section. As previously stated, we used four

input variables – HP, TS, FB, and ICR – to predict the customer types: myopic or

strategic. The definitions of these variables and customer types are detailed in Section

5.2. In addition, the data were shuffled and partitioned into five groups with similar

class distribution based on stratified random sampling. The training set of the first

group is used here to illustrate how MAKER-ER and MAKER-BRB frameworks were

applied in this case study to a customer-type dataset.

155

5.5.1. Hierarchical MAKER frameworks

A minimum of five cases per cell of the joint frequency matrices between the input

variables, except disjointed pieces of evidence, must be satisfied to implement a full

MAKER framework. Models based on MAKER-ER- and MAKER-BRB are designed if

this statistical requirement is not satisfied. In addition, the MAKER-ER- and MAKER-

BRB-based models are useful to reduce the multiplicative complexity on the number

of referential values of input variables in the belief rule base. To group input variables,

one must start with the input variable that exerts the strongest impact on the model

outcome and then add the other input variables one by one. The joint frequency

matrices of the pairs of input variables in a MAKER model must fulfil the statistical

requirement of having at least five cases per cell.

For initialisation, since the data were not normally distributed, a Spearman correlation

test was used to analyse the strength of the linear correlation between the input

variables and the outputs; and among input variables. According to Table 5.3, the

input variables ranked from strongest to weakest correlation with the output variables

were TS, FB, ICR, and HP. Hence, customer decisions were highly influenced by the

input variables TS and FB. Based on this order, we added the input variables one by

one to TS until all the joint frequency matrices between the input variables had at least

five cases per cell, except for those where pieces of evidence were disjointed due to

structural zeros. The input variables which could not satisfy this condition were

excluded and formed another group of evidence.

156

Table 5.3. Descriptive statistics and spearman correlation matrix

Factor Min Max Mean SD Customer

types

HP TS FB ICR

HP .007 6.732 .439 .911 .232** 1

TS .000 6.056 .065 .293 .446** -.160** 1

FB 1.000 34.000 1.232 1.045 .422** -.074** .395** 1

ICR .000 1.199 .003 .042 .264** -0.029 .299** .654** 1

Note: correlation is significant at .05 (2-tailed); ** correlation is significant at .01 (2-tailed) In this way, we defined the groups of evidence as depicted in Figure 5.4. Theoretically,

group 1 (HP-TS) explains how customers spend time given the length of the holding

period, and group 2 (FB-TS) describes how quickly customers book again if they make

several attempts before the final purchase. The MAKER-based model was applied for

each group of evidence and generated the output of whether a customer is strategic

or myopic. The MAKER-generated output of each group of evidence presented the

probability of a customer being myopic or strategic. These outputs were then

aggregated to suggest a final inference regarding whether customers are myopic or

strategic given the input values of the four input variables.

5.5.2. Optimised Referential Values of the Model

This section demonstrates how to develop a classifier based on MAKER-ER- and

MAKER-BRB-based classifiers for a model of the customer types. A numerical study

is presented here using the dataset explained in Section 3.3. We split the input

variables into two groups of evidence: HP and TS as group 1 and FB and ICR as

group 2. The output variable was the customer types: ‘myopic’ or ‘strategic’ classes.

The definition of these two types can be found in Section 5.2.

157

The length of

holding

period

Time spent

for confirming

booking

Frequency of

bookings

Interval between

cancelling and

book again

Myopic(generated

outputs by

group 1)

Strategic (generated

outputs by

group 1)

Myopic (generated

outputs by

group 2)


outputs by

group 2)

1 2 3 4

1-p1

1-p 2p

2p

1

1-p11-p2

1-p

2

Myopic

(final

inference)

Strategic

(final

inference)

Rules (k)

Final inferences

MAKER-generated

ouputs

Input variables

MAKER-based

classifiers

1-p1)(

1

sp

The length of

holding

period

Time spent

for confirming

booking

Frequency of

bookings

Interval between

cancelling and

book again

Myopic

(generated

outputs by

group 1)


outputs by

group 1)

Myopic (generated

outputs by

group 2)


outputs by

group 2)

1-p1

p1

1-p 2p

2p

1

1-p1 1-p2

Myopic

(final

inference)

Strategic

(final

inference)Final inferences

MAKER-generated

ouputs

Input variables

MAKER-based

classifiers

p2

MAKER-BRB-based model

MAKER-ER-based model

Figure 5.4. Hierarchical MAKER frameworks for customer classification

158

As explained above, the data were partitioned into five groups with similar class

distributions, with the data shuffled beforehand. For the purpose of illustration, we use

the first group as an example in this section. The model parameters – that is,

referential values and weights – were assigned to develop a MAKER framework. We

used the optimised parameters of the first group as an example.

Discretisation is often applied to transform quantitative data into qualitative data to

make learning from the qualitative data more efficient and effective. All the input

variables were numerical. A discretisation technique with referential values was

applied to all input variables.

Referential values consist of the lower and upper boundaries of the input variables for

the dataset and any values between those boundaries. The boundaries can be set

based on the minima and maxima of the observed values for the input variables of

the whole dataset. Alternatively, experts can determine the boundaries, such as in the

study by Kong et al. (2016) about trauma outcome. In this study, we utilised the

percentiles of the observed values for input variables of the whole dataset.

In this study, we set the percentiles of 1% and 99% as the lower and upper

boundaries. Table 5.4 demonstrates that the minimum and the first percentile of the

observed values of the input variable FB were 1. The 99th percentile and the maximum

of FB (observed) were 5 and 35 respectively. Almost all the customers – 99% – in the

dataset made five or fewer bookings. The significant difference between the 99th

percentile and the maximum of FB could indicate there were extreme values in the

dataset. Furthermore, 0.5% of the dataset (12 customers) made several attempts: 6

to 35 bookings. We set 99th percentile as the upper boundary. Hence, booking more

than five times was equivalent to booking five times. These percentiles were selected

159

because we could obtain complete joint frequency matrices, that is, all the cells of the

joint frequency matrices of the pairs of evidence did not have sampling zeros. In

addition, the performance of the classifiers was not significantly affected by this

modification. For other machine learning methods, we replaced the extreme values

with the values of these boundaries of each input variable.

Table 5.4. Percentiles of the dataset

Input

variable

Percentile

0% 100% 1% 99%

HP 6.551 × 10−3 6.732 9.982 × 10−3 5.532

TS 5.800 × 10−5 6.537 4.750 × 10−4 2.111

FB 1 35 1 5

ICR –3.494 8.648 –.2939 .559

As explained in Section 3.7, the model parameters – including weights and referential

values – were optimised through sequential least squares programming (SLSQP) with

randomly set initial parameters, and the MSE score was used as an objective function.

Equations (4.23) and (4.24) were used for the MAKER-ER- and MAKER-BRB-based

models, respectively. The optimisation algorithm identifies the direction to find a new

solution based on the evaluation of the MSE score. The algorithm was repeated for

200 iterations or until .0001 tolerance was reached.

The target of the optimisation of both MAKER-ER- and MAKER-BRB-based models

is to maximise the likelihood of the true state of a training set and to automatically

minimise the MSE scores. The MSE scores denote the difference between the model

outputs and observed values. Optimising the referential values of each input variable

means identifying how to divide the input variables so that the observations for a given

160

class are placed in the majority. More trained referential values can reasonably

improve the classifier; however, the associated cost increases (i.e., model

complexity). In this case, we used one optimised referential value for input variables

because adding more referential values to each input variable did not significantly

improve the AUC scores but caused higher model complexity. Sparser joint frequency

matrices were found when more referential values were added. In addition, two

adjacent referential values – that is, no trained referential value – can only

approximate monotonic function, and at least one trained referential value is required

to approximate non-monotonic function.

Figure 5.5 illustrates the scatter plot for the first training set across four input variables.

There are two grids as there were two groups of evidence: HP-TS (left) and FB-ICR

(right). The red dots represent ‘myopic’, and the blue dots represent ‘strategic’. The

vertical and horizontal lines indicate the optimised referential values of the input

variables. As displayed in the figure, these lines split the data into several groups.

Since the referential values are optimised through MAKER-ER- and MAKER-BRB-

based classifiers, the optimised referential values split the data into several grids,

each of which indicates the placement of most of the class. Because there was one

trained referential value for each input variable, each figure features four grids.

Group of evidence: HP-TS

Group of evidence: FB-ICR

Figure 5.5. Scatter plot of the observed data of the training set of the first fold with plotted optimised referential values in each of the input variables from the customer – type dataset from the optimisation of the MAKER-ER-based model

161

In general, Figures 5.5 and 5.6 illustrate that data patterns existed for records in

different classes of different input variables of the dataset. For the evidence regarding

HP-TS, both classes (i.e., myopic and strategic) were distributed in the same range.

Given the higher values of the input variable HP, most of the strategic customers were

distributed over a large value range of the input variable TS. For the group of evidence

for FB-TS, ‘strategic’ (blue dots) generally dominated the right side of the figure FB-

TS, meaning that ‘strategic’ featured a large range of values of the input variable FB.

The myopic customers were mainly distributed closer to the lower boundary of the

input variable ICR. The strategic customers generally had a large value range of the

input variable ICR. In addition, there was no single observation in the upper left corner

of the figure. If a customer books once, the value of CR is zero; this condition is called

structural zeros.

Group of evidence: HP-TS

Group of evidence: FB-ICR

Figure 5.6. Scatter plot of the observed data of the training set of the first fold with plotted optimised referential values in each of the input variables from the customer – type dataset from the optimisation of the MAKER-BRB-based model

The horizontal and vertical lines denote the optimised referential values of the input

variables of the respective training set. As stated earlier, the optimisation of referential

values with respect to MSE score led to data separation of the observations for each

input variable. Hence, the majority of a class fell within the same value range for each

162

input variable. As shown in Figures 5.5 and 5.6, the optimised referential values are

generally located around the separation point between classes: myopic and strategic.


variables of the respective training set. As stated earlier, the optimisation of the

referential values with respect to the MSE score led to data separation of the

observations for each input variable. Hence, most of a class fell within the same value

range for each input variable. As displayed in Figures 5.5 and 5.6, the optimised

referential values are generally located around the separation point between classes:

myopic and strategic.

For the following sections, the optimised referential values and other model

parameters of the training set of the first group of both MAKER-based classifiers are

taken as an example to demonstrate how MAKER-ER- and MAKER-BRB-based

models are constructed for the customer type prediction of a given dataset. The next

section discusses the MAKER-based models according to four aspects: 1) evidence

acquisition from data, 2) evidence interdependence, 3) belief-rule inference, and 4)

inference of the top hierarchy, including the ER rule and BRB inference.

5.5.3. Evidence Acquisition from Data

Section 4.3 explains the MAKER framework with referential values as a discretisation

method for numerical data. As already stated, the referential values of each input

variable in numerical data must be defined to acquire evidence from a dataset. The

referential values as model parameters can initially be set based on expert knowledge

or can be randomly generated. They can then be trained using historical data under

163

an optimisation objective (Xu et al., 2017). For illustration purposes, we used the

solution of optimisation of the first round for the MAKER-ER-based model – including

weights and an optimised referential value for each input variable of the training set.

Table 5.5 depicts the optimised referential values used for this illustration. The

referential values include the boundary referential values – the lower boundary and

upper boundary determined in Section 5.5.2 – and one referential value which lies

between the boundaries. To acquire evidence from a dataset, the first step is to

transform each input value of each input variable of the training set using Equation

(4.7) with the following steps: 1) to find two adjacent referential values of the

respective input variable where the input value is located and 2) to calculate the belief

distributions with respect to the two adjacent referential values, called the similarity

degree. The second step is to aggregate the similarity degrees of each referential

value under different classes of the training set according to Equation (4.8). The

frequencies of the referential values of each input variable under different classes of

the output variable of the training set can be subsequently generated. Table 5.6

displays the frequencies of the referential values of the input variable, with TS as an

example.

Table 5.5. The optimised referential values obtained from MAKER-ER- based models of the first round

Input variables TS HP FB ICR

Lower boundary .0005 .0010 1 –.2939

Optimised referential values (MAKER-

ER-based model)

.1338 .1585 1.3390 .0312

Optimised referential values

(MAKER-BRB-based model)

.1802 .1206 1.0848 .0612

Upper boundary 2.1110 5.5320 5 .5590

164

Table 5.6. The frequencies of the referential values of the input variable of TS

Class\referential values .0005 .1338 2.1110

Myopic 1515.1435 278.4574 31.3991

Strategic 204.5881 350.1775 67.2344

The third step is to calculate the likelihood of a referential value of an input variable

being observed given that a class of the output variable is true. Equation (4.9) is

applied for all referential values of all input variables of the training set of the dataset.

Once the likelihood of a referential value of an input variable can be obtained, the

probability of the respective referential value points to a class of the output can be

calculated using Equation (4.10). Table 5.7 presents the likelihoods for the referential

values .0005, .1338, and 2.1110 of the input variable, with TS as an example.

Table 5.7. The likelihoods of the referential values of the input variable of TS


Myopic .8302 .1526 .0172

Strategic .3289 .5630 .1081

Figure 5.7 depicts the individual support of each piece of evidence regarding class

membership – for myopic (blue) and strategic (orange), which is obtained from the

probability of each referential value of each input variable of the training set. Table

5.8 presents the probabilities of the referential values .0005, .1338, and 2.1110 of the

input variable of TS of the first group training set and Table 5.9 presents the

probabilities of the referential values .0010, .1585, and 5.5320 of the input variable

HP.

165

Table 5.8. The probabilities of referential values of the input variable of TS


Myopic 0.7162 0.2132 0.1373

Strategic 0.2838 0.7868 0.8627

Table 5.9. The probabilities of referential values of the input variable of HP


Myopic .5685 .5180 .2808

Strategic .4315 .4820 .7192

Figure 5.7. Individual support of the referential values of each input variable

Several pieces of evidence can be acquired from the probabilities calculated above.

The probabilities of the referential values of the input variables of the training set

00.10.20.30.40.50.60.70.80.9

1

Lowerboundary

Trainedreferential

value

Upperboundary

Bas

ic p

rob

abili

ty

TS

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Lowerboundary

Trainedreferential

value

Upperboundary

Bas

ic p

rob

abili

ty

HP

00.10.20.30.40.50.60.70.80.9

1

Lowerboundary

Trainedreferential

value

Upperboundary

Bas

ic p

rob

abili

ty

FB

00.10.20.30.40.50.60.70.80.9

1

Lowerboundary

Trainedreferential

value

Upperboundary

Bas

ic p

rob

abili

ty

ICR

166

represent to what degree the respective referential values of the input variables

indicate different class membership. In this way, we can acquire various pieces of

evidence. For example, in Table 5.9 the probabilities of the lower boundary of input

variable HP (i.e. 0.0010) are .5685 and .4315 for the myopic and strategic classes,

respectively. These probabilities means that if an observation has an input value of

the input variable of HP of 0.0010, the probability of the observation being myopic is

.5685 and .4315 of being strategic. Furthermore, we can acquire evidence from an

input value of HP of 0.0010 that indicates that the myopic and strategic classes have

probabilities of .5685 and .4315, respectively.

5.5.4. Analysis of Evidence Interdependence

In this section, the introduction of the interdependence index denoted by α in the

MAKER framework as a measurement of evidence interdependence between a pair

of evidential elements is discussed. As explained in Section 4.3, the MAKER-based

model is purposely developed to decrease the assumption of the independence

between a pair of evidential elements under the ER rule when combining the

respective evidence. The interdependence index can be calculated using Equation

(4.14). To generate the interdependence index between a pair of evidential elements,

the first step is to calculate the similarity degree of the input values for the combination

of evidential elements using Equation (4.12). The second step is to apply Equation

(4.13) to obtain the joint probability of the pair of evidential elements. Subsequently,

using Equation (4.14), the interdependence index between a pair of evidential

elements can be estimated.

167

Table 5.10 displays the joint probabilities of all pairs of evidential elements of the input

variables HP and TS indicating different classes of the output variables: myopic and

strategic classes. These joint probabilities were calculated from the frequencies of

different combinations of referential values of pieces of evidence from the input

variables HP and TS under different class membership. The frequencies have at least

five samples, except the combination of the referential values {2.1110, .0010} for both

classes of the output variable. Those pieces of evidence are disjoint for all classes.

Table 5.10. The joint probabilities of different combinations of the referential values from input variables HP and TS

Class\The

combination of

two referential

values

{.0005,

.0010}

{.0005,

.1585}

{.0005,

5.5320

}

{.1338,

.0010}

{.1338,

.1585}

{.0005,

5.5320

}

{2.1110,

.0010}

{2.1110,

.1585}

{2.1110,

5.5320}

Myopic .5220 .7179 .9128 .2481 .2079 .3090 0 .1500 .2516

Strategic .4780 .2821 .0872 .7519 .7912 .6910 0 .8500 .7484

As already stated, TS depends on HP such that if the value of HP is .0010, there is

no possibility of TS having a value of 2.1110. The combination of the referential values

{1, .5590} for the input variables FB and ICR are also disjoint. Therefore, we define

the inequality constraints of all the combinations of referential values of the input

variables of each group of evidence, except the combination of referential values

{2.1110, .0010} for group 1 and the combination of referential values {1, .5590} for

group 2.

The last step is to calculate the interdependence index of a pair of evidential elements

with respect to class membership. With the probabilities obtained from the previous

steps in Section 5.5.3 – as displayed in Tables 5.8, 5.9, and 5.10, which are a basic

168

probability distribution of the input variable TS, a basic probability distribution of the

input variable HP, and the joint probabilities of the pair of pieces of evidence

respectively – the interdependence indices between the pieces of evidence from the

input variables HP and TS can be obtained through Equation (4.21).

From Table 5.11, it can be observed that the input variables HP and TS generally

have values of interdependence indices between 1 and 10, meaning that both of the

input variables are moderately independent of each other, except for the combination

of referential values {2.1110, .0010}, which has an interdependence index of 0 (i.e.,

disjointed). According to Table 5.12, it is evident that the input variables FB and ICR

are generally moderately independent of each other, the interdependence indices of

FB and ICR lie between 1 and 3. However, some combinations of referential values

(e.g., {1.3390, .5590} and {5, .5590}) display high values (e.g., 50.7658 and 42.3690,

respectively). The input variable FB of 1.3390 and the input variable ICR of .5590 are

highly dependent on each other under the class of ‘myopic’ with the corresponding

interdependence index of 50.7658. The same condition applies for the input variable

FB of 5 and the input variable ICR of .5590 with the corresponding interdependence

index of 42.3690.

Table 5.11. The interdependence indices between the referential values from the input variables HP and TS

Class\The

combinati

on of two

referential

values

{.0005,

.0010}

{.0005,

.1585}

{.0005,

5.5320

}

{.1338,

.0010}

{.1338,

.1585}

{.0005,

5.5320}

{2.1110

,.0010}

{2.1110,

.1585}

{2.1110,

5.5320}

Myopic 1.6079 2.6232 6.1651 1.7218 3.4673 8.1328 0 4.8174 9.7909

Strategic 2.5845 1.5906 .7065 2.3813 1.5199 1.0724 0 1.4021 1.0385

169

Table 5.12. Interdependence indices between referential values from the input variables FB and ICR

Class\The

combinati

on of two

referential

values

{1,

–.2939}

{1,

.0312}

{1,

.559

0}

{1.3390,

–.2939}

{1.3390,

.0312}

{1.3390,

.5590}

{5,

–.2939}

{5,

.0312}

{5, .5590}

Myopic 1.9318 1.8726 0 1.6575 1.8626 50.7658 1.8986 1.9147 42.3690

Strategic 2.1033 2.2008 0 2.0821 2.0921 .5680 2.051 2.0851 .6888

5.5.5. Belief Rule Base

Once the evidence from the dataset and the interdependence indices between pairs

of units of evidence have been acquired, we are now in position to develop a belief

rule base from which an inference can be made is discussed. As stated in Section

4.4, a belief rule should be expressed in the form of Equation (4.22). The ‘IF’ form,

expressed ask

T

kk

kAAA 21 and called a packet antecedent 𝐴𝑘, should be

interpreted in this study as a combination of the referential values of the input

variables or ‘if the input value of each input variable is equal to a referential value of

this input variable’. The ‘THEN’ form expressing the probabilities of each

consequence (i.e., ( ) ( ) ( ) Nkkk ,D,,,D,,D N2211 ) should be interpreted as the

probability of a customer with the corresponding input values being strategic or

myopic.

Since the ‘IF’ form represents a combination of the referential values of the input

variables, the size of a belief rule base equals the multiplications of all the referential

170

values of input variables. For example, in group 1, there are two input variables, each

with three referential values: the lower boundary, trained referential value, and upper

boundary. Hence, the size of the BRB of group 1 is 3 × 3 = 9. The BRB of groups 1

and 2 can be seen in Tables 5.14 and 5.15, respectively. It is also worth noting that

the trained referential values are solutions of the optimisation of the MAKER-ER- or

MAKER-BRB-based classifiers. In this section, we also utilise other optimised model

parameters, such as the weights of input variables. Meanwhile, the ‘THEN’ form

consists of the consequences: myopic and strategic with the corresponding

probabilities. To obtain the probabilities of a customer being myopic or strategic, the

MAKER rule is used to combine pieces of evidence in a group of evidence with the

consideration of the interdependency of pairs of evidence using Equation (4.16) in

Section 4.3.3. Using Equation (4.18), we can obtain the weights of the combined

evidence from the probability mass 𝑚𝜃,𝑒(𝐿) and the probability 𝑝𝜃,𝑒(𝐿) for 𝜃 ⊆ Θ, 𝜃 ≠ 𝜙

or from the probability mass 𝑚𝑃(Θ),𝑒(𝐿). These weights are used for inference in the

next section and called the rule weight, denoted by 𝜃𝑘. For example, with the

calculation explained in Section 4.3.3, in group 1 the probabilities of a combination of

referential values {.1338, .0010} being myopic and strategic are .1018 and .8982,

respectively. We can also obtain a probability of .8022 of being myopic and .1978 of

being strategic for a combination of referential values { 1, –.2939}.

5.5.6. BRB Inference with Referential Values

For discrete or nominal data, making an inference from a belief rule base is a direct

process. For example, if the input vector presents the combination ‘High ∧ Low ∧

High’, then we can obtain a probability 𝑝1 of consequent 1 and 𝑝2 of consequent 2,

171

where 𝑝1 = 𝛽1𝑘 and 𝑝2 = 𝛽2𝑘 from the 𝑘th rule from which the IF rule of ‘High ∧ Low ∧

High’ is mentioned. The inference process with referential values as a discretisation

method is consequently different from that with discrete data. The inference process

is discussed in this section.

A belief rule base was developed in the previous section. A belief rule was constructed

from a packet antecedent 𝐴𝑘, which is a combination of the referential values of the

input variables and the corresponding probabilities of consequence. Based on this

form, we need to transform the numerical data for the combinations of the referential

values of the input variables. We need to calculate a similarity degree of each

observation value of each input variable. An input value can be transformed using

Equation (4.7). The similarity degree indicates the degree to which an input value

matches each of the referential values. For example, an observation with the input

values {.2105, .3955, 4, .1415} for TS, HP, FB, and ICR, respectively, with the

referential values defined in Table 5.5 has two adjacent referential values of each

input variables as depicted in Table 5.13. Using Equation (4.12), we can calculate the

joint similarity degree between the observation and the combination of the referential

values of each belief rule or the packet antecedents. These values represent the

individual matching degree to which the input vector or an observation belongs to a

packet antecedent 𝐴𝑘, denoted by 𝛼𝑘 for the 𝑘th rule.

Table 5.13. Two adjacent referential values of each input variable of an observation from the customer – type dataset: {.2105, .3955, 4, .1415}

TS HP FB ICR

.1338 .1585 1.3390 .0312

2.1110 5.5320 5 .5590

172

Since each observed value of an input variable is expressed by its distances to two

referential values, a number of belief rules are activated out of the total belief rules

ranging from 1 (for an input vector exactly equal to a combination of the referential

values of the input variables) to 2𝑁, where 𝑁 is the number of input variables for an

input vector for which each observation value is located between two adjacent

referential values. In this case, as we have two input variables with three referential

values for each group of evidence, the activated belief rules range from 1 to 22 = 4 out

of 9 belief rules. By using Equation (4.12), we can obtain the joint similarity degree of

each belief rule in the BRB of each group of evidence, as depicted in Tables 5.14-

5.15 for an observation of {.2105, .3955, 4, .1415}. In these tables, it can be found

that four combinations of the two activated adjacent referential values in Table 5.13

have 𝛼𝑘 > 0, while other combinations of other referential values have 𝛼𝑘 = 0.

Table 5.14. The belief rule base of the first group of evidence and the activated belief rules by an observation of the input variables of group 1 from the customer-type dataset: {.2105, .3955}

Antecedent Belief degree

Rule 𝐴1 (TS) 𝐴2 (HP) Myopic Strategic 𝛼𝑘

1 .0005 .0010 .8117 .1883 0

2 .0005 .1585 .9122 .0878 0

3 .0005 5.5320 .7864 .2136 0

4 .1338 .0010 .1018 .8982 0

5 .1338 .1585 .0836 .9164 .9089

6 .1338 5.5320 .0761 .9239 .0645

7 2.1110 .0010 .2511 .7489 0

8 2.1110 .1585 .1488 .8512 .0249

9 2.1110 5.5320 .1472 .8528 .0018

173

Table 5.15. The belief rule base of the second group of evidence with activated belief rule base by an observation of the input variables of group 2 from the customer-type dataset: {4, .1415}


Rule 𝐴3 (FB) 𝐴4 (ICR) Myopic Strategic 𝛼𝑘

1 1 –.2939 .8022 .1978 0

2 1 .0312 .7989 .2011 0

3 1 .5590 .5640 .4360 0

4 1.3390 –.2939 .0420 .9580 0

5 1.3390 .0312 .0415 .9585 .2682

6 1.3390 .5590 .0878 .9122 .0049

7 5 –.2939 .0659 .9341 0

8 5 .0312 .0641 .9359 .7138

9 5 .5590 .1820 .8180 .0131

At this point, of each belief rule, we have 𝛼𝑘 as an individual matching degree to which

the input values belong to a packet antecedent, 𝐴𝑘; the weights of the combined

pieces of evidence of 𝐴𝑘, which are obtained from the probability mass 𝑚𝜃,𝑒(𝐿), the

probability 𝑝𝜃,𝑒(𝐿), and the probability mass 𝑚𝑃(Θ),𝑒(𝐿); and the probability of each

consequence as a result of the combination of pieces of evidence, 𝐴𝑘 . Hence, the

weights of the pieces of evidence affect the weights of each belief rule activated by

an observation.

Once we obtain the activated belief rules with the corresponding joint similarity

degrees and their weights, the next step is to combine these belief rules to predict the

probabilities of each consequence (i.e., a customer being myopic or strategic). First,

we need to calculate the updated weight denoted by 𝜔𝑘 of each belief rule in BRB

based on the joint similarity degrees and the associated rule weight 𝜃𝑘 from Equation

(3.11), with 𝐿 referring to a number of belief rules in BRB. The value 𝜔𝑘 is designed

174

to measure the degree to which a packet antecedent 𝐴𝑘 in the 𝑘th rule is triggered by

an observation. As stated in the previous section, the weights of the input variables

contribute to the weight of each belief rule, and based on the joint similarity degrees

and those weights, we calculate the updated weight of each belief rule. We can

conclude that the weights of the input variables influence the updated weight of each

belief rule, which measures the degree to which a belief rule is triggered in predicting

the probability of each consequence. Second, given the updated weight of each belief

rule and the associated probability of each consequence, we can combine these

pieces of evidence using the conjunctive MAKER rule as demonstrated in Equation

(4.16). The output of this framework is the probability of a customer being myopic or

strategic.

For example, with an observation of group 1 of {.2105, .3955} for TS and HP

respectively, we obtain a probability of .0778 of being myopic and .9222 of being

strategic. In addition, given an observation of group 2 of {4, .1415} for FB and ICR

respectively, we obtain a probability of .0483 of being myopic, and .9517 of being

strategic. At this point, we can determine the probability of a customer being strategic

or myopic based on a number but not the total of the input variables in the input

system. To generate the probability of each consequence from all the input variables

of the input system being considered is discussed in the following section.

5.5.7. Inference of the Top Hierarchy

Based on the previous section, we can obtain the probability of each consequence as

a result of the evidence combination of some but not all the input variables in the

system. As depicted in Figure 4.1, a system consists of some groups of evidence,

175

each of which features a number of input variables. In the lower levels of the hierarchy,

each group of evidence makes inferences based on the input variables in the input

system of the group. As such, each group of evidence generates the probability of

each consequence of the output system. As we have acquired the MAKER-generated

outputs from the input variables of each group of evidence, we can now combine the

outputs to reach the final inference of the top hierarchy, which is the probability of

being myopic or strategic with all the input variables being considered. We provide

two combination methods: ER- and BRB-based models.

First is the ER rule. According to the previous section, we can acquire the probabilities

generated by the MAKER rule from the input variables of a group of evidence. On the

other hand, an observation of the input variables of the group of evidence generates

the probabilities of class membership. Therefore, we can acquire a piece of evidence

from the observation. As such, we obtain the same number of pieces of evidence as

the number of groups of evidence in the hierarchy.

To combine these pieces of evidence using the ER rule, we need their weights. We

can obtain the weight of each group of evidence from the probability mass 𝑚𝜃,𝑒(𝐿) and

the probability 𝑝𝜃,𝑒(𝐿) for 𝜃 ⊆ Θ, 𝜃 ≠ 𝜙 or from the probability mass 𝑚𝑃(Θ),𝑒(𝐿) using

Equation (4.18) when combining the activated belief rules in the previous section.

Given these pieces of evidence and their weights, we can use Equation (4.16) for

evidence combination, and therefore, we can generate the probability of each

consequence, considering all the input variables in the system. Since the weights of

the input variables of a group of evidence have an effect to the updated weight of

each belief rule, and the weight of a group of evidence is the weight of the combined

activated belief rules, we can conclude that the weight of a group of evidence is

176

influenced by the weights of the input variables of the group of evidence. As such, in

the top hierarchy, we can conclude that the final inference generated considers the

weights of all the input variables in the system.

For example, in this study, there are two groups of evidence as depicted in Figure 5.4;

as such, we should have two outputs: the probabilities of being myopic and strategic.

An observation the input values of group 1 of {.2105, .3955} for TS and HP,

respectively, and group 2 of {4, .1415} for FB and ICR, respectively. By following the

procedures in the previous sections and given two groups of evidence, we can obtain

two pieces of evidence as the MAKER-generated outputs: {(1, .0778), (2, .9222)} and

{(1, .0483), (2, .9517)} for groups 1 and 2, respectively. With their weights and using

Equation (4.16), we can generate a probability of .0233 of being myopic and .9767 of

being strategic as a final output of the system, where the probabilities are obtained

with all the input variables in the system together.

Second is the BRB rule. As depicted in Figure 4.1, there are a number of groups of

evidence, each of which consists of some input variables. As stated above, each

group of evidence generates the probability of each consequence. We can make

inferences based on the concept of the belief rule base. To construct a belief rule

base, we must follow the expression of the extended IF-THEN rule, as described in

Section 0, specifically in Equation (4.22). In this state, the antecedent of the belief rule

written as 21

k

T

kk

kAAA should be expressed in this state as ‘if a group of evidence

points to 𝑘 class’. Therefore, the number of antecedents equals the number of groups

of evidence in the system. In this study, there are two groups of evidence; hence,

there are two antecedents in the BRB. Furthermore,

( ) ( ) ( ) LkNkkk ,...,1,,D,,,D,,D N2211 should be expressed in this state as

177

‘the probability of a customer being myopic or strategic given the values of the

antecedents’, or we can say ‘the probability of a customer being of the class

membership myopic or strategic, given the results from each group of evidence’.

The antecedents in this study are the outputs generated by each group of evidence.

In addition, the outputs refer to the class membership such that the number of

combinations equals 𝐾𝐺, where K is the number of outputs in the output system, and

G is the number of groups of evidence in the system. In this study, there are two class

memberships as the outputs, with two groups of evidence formed in the system.

Therefore, we have 22 = 4 belief rules, as depicted in Table 5.16 and Figure 5.4.

Table 5.16. The belief rule base of the top hierarchy of inference with the initial belief degrees for the customer-type dataset

No. Antecedent Consequence

𝐴1 𝐴2 Myopic (1) Strategic (2)

1 1 1 1 0

2 1 2 .5 .5

3 2 1 .5 .5

4 2 2 0 1

We suppose that 𝐴1 is the output generated by group 1; 𝐴11 = 1 if the group of

evidence indicates class 𝑘 = 1 or myopic, and 𝐴21 = 2 if the group of evidence

indicates class 𝑘 = 2 or strategic. Furthermore, 𝐴12 = 1 and 𝐴2

2 = 2 signify that group

2 points to myopic (𝑘 = 1) and strategic (𝑘 = 2), respectively. As we do not have prior

knowledge regarding the belief degrees assigned to each consequence, denoted by

𝛽𝑗,𝑘 for the 𝑗th consequence in the 𝑘th rule as displayed Equation (4.22), we can

construct a BRB as follows.

178

• The construction of belief rule base

Given the observed values of the input variables in the input system of each group of

evidence, if both groups of evidence indicate the same class membership, this means

that the observation of all the input variables fully indicates the corresponding class.

For example, the first and fourth belief rules generate a probability of 1 for myopic and

strategic, respectively. If both groups of evidence point to different class membership,

we cannot say the observation of all the input variables exactly indicates a particular

class membership, meaning that the probability of each consequence can range from

0 to 1. These belief degrees can be trained along with other model parameters

simultaneously. For initialisation, we use the initial belief degrees as listed in Table

5.16. Table 5.17 provides the optimised belief degrees of the belief rule base. We use

these belief degrees for this section as an example.

Table 5.17. The belief rule base of the top hierarchy of inference with the optimised belief degrees of the training set of the first fold for the customer-type dataset

No. Antecedent Consequence

𝐴1 𝐴2 Myopic (1) Strategic (2)

1 1 1 1 0

2 1 2 .1623 .8377

3 2 1 .1771 .8299

4 2 2 0 1

The antecedent in the BRB is defined as ‘a group of evidence points to a class

membership with the probability of 1’. Since each group of evidence generates the

probability of each consequence, which measures the degree to which the observed

values of the observation of the input variables within the group indicates a class

179

membership, we cannot reach a direct conclusion based on the BRB. For the purpose

of demonstration, we use the observed values {.2105, .3955, 4, .1415} and the

optimised model parameters obtained from the MAKER-BRB-based model, including

the optimised referential values in Table 5.5, which are .1802, .1206, 1.0848, and

.0612 respectively for TS, HP, FB, and ICR.

• The calculation of joint similarity degree

For example, the observed values {.2105, .3955} for TS and HP generate the

probabilities {.1371, .8629}, meaning that this observation belongs to 𝐴21 to a high

degree (.8629) and to 𝐴11 to a low degree (.1371). As such, we can obtain the belief

distribution of the antecedents. Therefore, we can apply Equation (4.12) to obtain the

joint similarity degree between the outputs generated by each group of evidence and

the combination of the antecedents of each belief rule. For example, based on the

probabilities obtained from groups 1 and 2, which are {(1, .1371), (2, .8629)} and {(1,

.2537), (2, .7463)}, respectively, we can obtain the joint similarity degree for each

antecedent as displayed in Table 5.18.

Table 5.18. The joint similarity degree of the outputs generated by group 1: {.1371, .8629} and group 2: {.2537, .7463} from the customer – type dataset

{𝐴11, 𝐴1

2} {𝐴11, 𝐴2

2} {𝐴21 , 𝐴1

2} {𝐴21 , 𝐴2

2} Total

.0348 .1023 .2189 .6440 1

• Making inference from activated belief rules

These joint similarity degrees activate four belief rules. As with the previous section,

these values are used to calculate the updated weight of each belief rule. The rule

180

weights denoted by 𝜃𝑘 can be trained. However, in this study, the rule weights were

set to be equal. The joint similarity degree influences how we invoke the activated

belief rules to contribute to the inference. Since the joint similarity degree is calculated

from the outputs generated by each group of evidence, each of which consists of

some but not all of the input variables in the system, we may conclude that by

combining the outputs in this way, the inference is obtained by considering all the

input variables in the system. The probabilities {(1, .1371), (2, .8629)} and {(1,

.2537),(2, .7463)} obtained from an observation of {.2105, .3955, 4, .1415} generates

the prediction of class membership as follows: .0901 of being myopic and .9099 of

being strategic.

5.5.8. The Interpretability of Hierarchical MAKER Frameworks

As mentioned previously, a set of model parameters in this study consists of one

trained referential value of the four input variables of the system and the weights of

the evidential elements (referential values) of the four input variables for the MAKER-

ER-based classifier. An additional set of parameters is a set of trained belief degrees

of each consequence of the respective belief rules for the MAKER-BRB-based

classifier. The trained referential values are utilised to obtain pieces of evidence,

which are then combined in the upper level of the hierarchy. Given the optimised

weights of the evidential elements of the input variables of each group of evidence,

we can generate the probability of each consequence. For each group of evidence,

the weights of the input variables impact the updated weight of each activated belief

rule through an observation to predict the probabilities of the classes of the output

system.

181

In the MAKER-ER-based classifier, given the probabilities generated by the MAKER

rule for each group of evidence and the weight of the combined activated belief rules

of each group of evidence, we can make predictions (i.e., the probabilities of the

classes of the output system with all four input variables considered) in the upper level

of the hierarchy. The weights of the two input variables of each group of evidence

have an impact on the updated weights of each activated belief rule, and the weight

of the combined activated belief rules of each group of evidence has an impact on the

inference made in the upper level.

In the MAKER-BRB-based classifier, the probabilities generated by the MAKER rule

for each group of evidence indicate the degree to which the input variables of each

group of evidence points to each class of the output system. As such, we can calculate

the joint similarity degree for each combination of the antecedents. Given the trained

belief degrees of the consequences of each belief rule in the BRB and the joint

similarity degrees, we can make predictions in the upper level of the hierarchy, which

are inferred based on the four input variables in the system.

Through these two ways (i.e., the MAKER-ER- and MAKER-BRB-based models), we

can maximise the predicted outputs (i.e., the predicted probabilities of each class of

the output system) close to the true observed outputs of the training set to minimise

the MSE score by optimising the model parameters, including the referential values

of the four input variables, the weights of the evidential elements for both classifiers,

and the trained belief degrees of the relevant belief rules specifically for the MAKER-

BRB-based model.

In this study, given the optimised referential values (i.e., the trained referential values)

of the four input variables, we can construct the MAKER-based classifier for an

182

illustration, specifically how to acquire pieces of evidence from the data. On the basis

of the referential values and other optimised solutions (i.e., the weights as well as the

belief degrees of each consequence in the BRB of the top hierarchy), we can use the

MAKER-ER- and MAKER-BRB-based models to make inferences through the

process described in this section. For an example used earlier {.2105, .3955, 4,

.1415}, the predicted probabilities of each class are {.0233, .9767} and {.0901, .9099}

for the MAKER-ER- and MAKER-BRB-based models. Based on the process

established in these classifiers, we can conclude that the MAKER-ER- and MAKER-

BRB-based classifiers are an interpretable approach by integrating statistical analysis

when acquiring pieces of evidence as well as the measurement of the

interdependencies between pairs of evidence, belief rule-based inferences in the

MAKER rule, the maximum likelihood prediction, and machine learning. Furthermore,

even with the input variables in the system split into multiple groups of evidence, the

inference process established for both classifiers has combined all pieces of evidence

from the lower level in the hierarchy. In every combination process of the pieces of

evidence from the bottom to the top of the hierarchy, the knowledge embedded in a

piece of evidence, including its weights, is continuously forwarded until the final

inference in the top hierarchy. In this way, we may conclude that the predicted outputs

of the system outputs of both classifiers are a result of the inference process of all the

input variables in the system with knowledge representation parameters embedded

in each piece of evidence (i.e., the weights, referential values, and consequent belief

degrees).

183

5.6. Model Comparisons

In this section, we compare the model performances of the MAKER-ER- and MAKER-

BRB-based models with other common machine learning methods for classification,

including logistic regression (LR), support vector machines (SVM), neural networks

(NN), and classification trees (CT), Naïve Bayes (NB), k-nearest neighbour (KNN),

distance-based weighted k-nearest neighbour (weighted KNN), linear discriminant

(LD), and quadratic discriminant (QD) for the dataset of the case of customer

classification in revenue management.

As explained in Section 5.4, we utilise five-fold cross validation. The dataset is divided

into five folds with shuffled stratified cross validation to obtain a similar class

distribution. As such, each fold has nearly the same class distribution. In cross

validation, four folds are used as a training set, and the rest act as a test set. The

training set is used to train the model. These optimal parameters, which are obtained

from the model training, are then applied to the test set. If the model can generalise

the pattern of the data, the performance of the models on the test sets is relatively

similar with the performance on the training set. Therefore, in this section, we compare

all the classifiers based on their performances over the five rounds. We provide the

reports for both training and test sets.

We use some performance measures including accuracy, precision, and recall with

the threshold value of .5 for the classifiers based on probabilities. Specifically, for

SVM, the threshold value of 0 is used. As already mentioned, the dataset used in this

case study is highly imbalanced (1:4). Precision and recall scores are representative

to check whether the classifier can make accurate predictions on both classes

184

regardless of the existing imbalance in the dataset (Davis and Goadrich, 2006). The

ideal case is obtaining high precision and high recall. We also report the MSE scores

since the MAKER-ER- and MAKER-BRB-based models are optimised under this

optimisation function. We report the area under the receiver operating characteristic

curve (AUCROC) scores and the area under the precision recall curve (AUCPR) score

since these metrics provide a preferred measure than solely based on the accuracy.

The higher the score of the AUC is, up to a maximum of 1.0, the superior the model

is. A further explanation of the measures can be found in Section 3.8.

It is also worth noting that for SVM, ANN, CT, KNN, and weighted KNN, we should

determine the hyperparameters of these classifiers. A hyperparameter is a parameter

whose value is determined before the learning process begins. We utilise

gridsearchcv in sklearn python to find the optimal hyperparameter based on a five-

round model training method. Rather than solely based on one performance measure

(i.e., accuracy), since this dataset is highly imbalanced (1:4), we use the F-beta score

whose value is 1 for the best and 0 for the worst. Beta is a weight assigned to the F-

beta score. Its values range from 0 to infinitive (Maratea et al., 2014). The F-beta

score will put more attention on precision when beta is lower than 1. The F-beta score

will weight toward recall when beta is greater than 1. With a beta value of 1, the F-

beta score is exactly the same as F-measure which is an equally weighted harmonic

mean of precision and recall as seen in Table 3.2. F0.5, F1, and F2 measures – the

notations for the beta values of .5, 1, and 2 respectively – are the most widely used

F-beta scores (Maratea et al., 2014). In this study, the beta value of 1 is deliberately

chosen so that the importance of precision is set to be equal to that of recall. The

hyperparameters of classifiers with the highest F-beta score of the left-out data after

the five-round training method are selected as presented in Table 5.19.

185

Table 5.19. Selected hyperparameters of SVM, ANN, CT, and Weighted KNN for customer type models

Classifier Selected hyperparameter

CT The maximum depth = 3; the minimum samples per leaf = 50;

the minimum size each leaf = 170

SVM Penalty parameter C = 9; the kernel type is radial basis function

kernel.

KNN k = 25

Weighted KNN k = 33

NN Multilayer perceptron is selected; the number of hidden layers =

1; the number of neurons in the hidden layer = 10; the activation

function is rectified linear unit function.

5.6.1. Accuracies, Precisions, Recalls, and F-beta Scores

As stated earlier, the dataset used in this case study is highly imbalanced (1:4); hence,

we provide the performance measures for each class over five training and test sets.

Table 5.20 provide the F-beta scores for both train and test sets with ‘myopic’ as

negative class. As mentioned earlier, we set the beta value of 1. Tables 5.21 - 5.23

provide the scores of accuracy, precision, and recall, respectively for each class.

186

Table 5.20. F-beta scores for customer behaviour classifiers

Model\Iteration 1st 2nd 3rd 4th 5th Average Stdev

Train

MAKER-ER .801 .779 .669 .796 .782 .765 .055

MAKER-BRB .826 .787 .826 .826 .814 .816 .017

LR .445 .580 .447 .435 .398 .461 .069

SVM .695 .720 .720 .715 .709 .712 .011

NN .792 .781 .796 .784 .777 .786 .008

DT .792 .798 .802 .792 .792 .795 .005

NB .520 .509 .488 .471 .461 .490 .025

KNN .749 .763 .773 .758 .759 .761 .009

Weighted KNN 1.000 1.000 1.000 1.000 1.000 1.000 .000

LD .443 .458 .437 .417 .408 .433 .020

QD .498 .583 .509 .473 .458 .504 .048

Test

MAKER-ER .790 .749 .662 .816 .824 .768 .066

MAKER-BRB .815 .735 .836 .836 .814 .808 .042

LR .396 .504 .398 .413 .498 .442 .054

SVM .715 .665 .710 .720 .740 .710 .028

NN .792 .751 .780 .794 .758 .775 .020

DT .802 .786 .780 .804 .802 .795 .011

NB .393 .497 .449 .544 .536 .484 .063

KNN .754 .742 .749 .764 .753 .753 .008

Weighted KNN .775 .744 .754 .755 .784 .762 .016

LD .386 .431 .388 .405 .509 .424 .051

QD .432 .508 .439 .533 .545 .491 .053

187

Table 5.21. Accuracies for customer behaviour classifiers


Train

MAKER-ER .892 .881 .845 .888 .884 .878 .019

MAKER-BRB .898 .883 .897 .895 .896 .894 .006

LR .797 .821 .798 .796 .785 .799 .013

SVM .850 .861 .857 .854 .856 .856 .004

NN .881 .880 .886 .879 .878 .881 .003

CT .885 .887 .886 .883 .884 .885 .003

NB .823 .820 .821 .814 .824 .820 .003

KNN .871 .875 .877 .870 .874 .874 .003

Weighted KNN 1.000 1.000 1.000 1.000 1.000 1.000 .000

LD .796 .803 .796 .793 .786 .795 .003

QD .801 .818 .798 .796 .782 .799 .003

Test

MAKER-ER .884 .867 .838 .900 .906 .879 .027

MAKER-BRB .894 .853 .902 .906 .902 .891 .022

LR .790 .788 .785 .783 .822 .794 .016

SVM .859 .833 .855 .859 .869 .855 .014

NN .884 .863 .877 .888 .869 .876 .010

CT .886 .878 .879 .890 .888 .884 .005

NB .823 .819 .822 .812 .812 .818 .005

KNN .869 .859 .869 .875 .869 .868 .006

Weighted KNN .884 .863 .873 .873 .888 .876 .010

LD .790 .784 .793 .783 .824 .795 .017

QD .792 .784 .779 .814 .810 .796 .016

188

Table 5.22. Precisions of the test sets for customer behaviour classifiers

Model/Iteration 1st 2nd 3rd 4th 5th Average Stdev

Myopic

MAKER-ER .950 .920 .880 .960 .950 .932 .033

MAKER-BRB .730 .930 .970 .970 .970 .914 .104

LR .740 .820 .800 .800 .820 .796 .033

SVM .730 .890 .900 .900 .910 .866 .076

NN .720 .930 .950 .950 .930 .896 .099

CT .730 .960 .950 .960 .960 .912 .102

NB .660 .820 .810 .830 .830 .790 .073

KNN .730 .930 .920 .930 .930 .888 .088

Weighted KNN .760 .920 .920 .920 .930 .890 .073

LD .750 .800 .790 .800 .820 .792 .026

QD .710 .820 .810 .830 .830 .800 .051

Strategic

MAKER-ER .730 .720 .710 .760 .790 .742 .033

MAKER-BRB .720 .680 .760 .760 .730 .730 .033

LR .740 .630 .690 .660 .860 .716 .090

SVM .730 .670 .720 .730 .750 .720 .030

NN .720 .700 .720 .730 .720 .718 .011

CT .730 .710 .720 .740 .730 .726 .011

NB .660 .630 .660 .740 .710 .680 .044

KNN .730 .700 .730 .740 .720 .724 .015

Weighted KNN .760 .720 .730 .750 .760 .744 .018

LD .750 .660 .690 .670 .870 .728 .087

QD .710 .600 .620 .730 .690 .670 .057

189

Table 5.23. Recalls of the test sets for customer behaviour classifiers


Myopic

MAKER-ER .890 .900 .910 .910 .920 .906 .011

MAKER-BRB .880 .870 .900 .900 .890 .888 .013

LR .970 .920 .960 .950 .980 .956 .023

SVM .910 .890 .910 .910 .920 .908 .011

NN .880 .880 .880 .890 .890 .884 .005

CT .950 .920 .940 .950 .940 .940 .012

NB .880 .880 .890 .890 .890 .886 .012

KNN .900 .880 .900 .900 .900 .896 .012

Weighted KNN .920 .900 .900 .910 .920 .910 .012

LD .970 .940 .960 .950 .980 .960 .012

QD .960 .900 .930 .950 .930 .934 .012

Strategic

MAKER-ER .860 .780 .620 .880 .860 .800 .108

MAKER-BRB .940 .800 .930 .930 .920 .904 .059

LR .270 .420 .280 .300 .350 .324 .062

SVM .700 .660 .700 .710 .730 .700 .025

NN .880 .810 .850 .870 .800 .842 .036

CT .890 .880 .850 .880 .890 .878 .016

NB .280 .410 .340 .430 .430 .280 .410

KNN .780 .790 .770 .790 .790 .784 .009

Weighted KNN .790 .770 .780 .760 .810 .782 .019

LD .260 .320 .270 .290 .360 .300 .041

QD .310 .440 .340 .420 .450 .392 .063

190

The three highlighted numbers in bold are the first-, second-, and third-best classifiers

based on the corresponding measure. All the classifiers listed in the mentioned tables

demonstrate relatively good in terms of accuracy, precision, and recall, except LR,

NB, LD, and QD, which are discussed later. We calculate the average score of all the

performance measures for test sets across all the classifiers and over the five-round

validation: .848, .861, .718, .915, .644, and .655 for accuracy, precision of the myopic

class, precision of the strategic class, recall of the myopic class, recall of the strategic

class, and F-beta score respectively.

As an evaluation metric, F-beta score provides a single score that considers both

precision and recall with beta as the weight of recall in the combined score. The F-

beta scores lies in the range 0 and 1 with 0 being the worst and 1 being best.

According to Table 2.20, MAKER-ER- and MAKER-BRB-based classifiers, CT, and

NN are the four best classifiers based on the F-beta scores, in between .768 - .808.

Meanwhile, LR, NB, LD and QD have scores below than the average of F-beta score

of .655. The performances of classifiers for each class are compared as explained

below.

The MAKER-ER- and MAKER-BRB-based classifiers and the classification provide

the optimal performance measures among the other alternative classifiers for this

case since both models are in the position of the three best classifiers among the

other alternative classifiers in terms of accuracy, precision, and recall for strategic

class. The average scores of recalls of the MAKER-ER- and MAKER-BRB-based

classifiers for myopic class are .906 and .888, which are close to the grand average

recall of .915 for myopic class. However, the performance differences between the

proposed classifiers (i.e., MAKER-ER- and MAKER-BRB-based classifiers) and the

191

other alternative classifiers are subtle, except for the recall (i.e., sensitivity) of the

strategic class, which is the minority in this dataset with only about 25%.

The MAKER-BRB-based model produce the highest scores of recalls: .904. The

classifier CT also produces a high sensitivity score: .878. The average recall of the

MAKER-ER-based classifier for strategic class is .800, which is above the grand

average recall of .644. In addition, NN also provide good performance based on

average recall for myopic, which is .842. These four classifiers can produce higher

sensitivity scores compared to the alternative methods. The classifier LR, NB, LD,

and QD exhibits the lowest recall scores, in between .300-.392, meaning that given

the predicted strategic class, only few predictions are correct. In addition, LR, NB, LD,

and QD shows that the scores of accuracy, precision, and recall (i.e. specifically for

strategic class) are always below the grand average of the corresponding scores.

The explanation above suggests that the performance of the MAKER-ER- and

MAKER-BRB-based classifiers and the classification tree for customer classification

in this dataset outperform the other alternative methods for the predefined thresholds

of .5 and 0 to estimate the performance measures: accuracies, precisions, and recalls.

5.6.2. MSEs and AUCs

In this section, we report the mean square errors (MSEs), the area under the curves

(AUCs) of all the classifiers. We also provide the ROC and PR curve of the proposed

classifiers, the MAKER-ER- and MAKER-BRB-based models, compared to the other

alternative machine learning methods. Figure 5.8 illustrates the ROC curves of all the

classifiers of all the test sets of the dataset. As displayed in this figure, there are five

192

lines with different colours presenting the ROC curves for the test set of each round.

Round 1 features the first fold as the test set, round 2 features the second fold as the

test set, and so on. The diagonal red line represents a random classifier. The further

the line moves from this red diagonal line or the closer the line moves to the left corner

of the curve, the better the classifier is. Figure 5.9 demonstrates the PR curves of all

the classifiers of the test sets over the five round training process. Similar with the

ROC curves, five lines in the PR curves represents the PR curve for the test set of

each round. The better the discrimination of the classifier, the closer the line move to

the right corner of the curves (see Section 3.8.3). The grey area in both curves

indicates the dispersion of the curves between rounds with ± 1 standard deviation.

MAKER-ER-based classifier

MAKER-BRB-based classifier

LR

SVM

Figure 5.8. The ROC curve of the MAKER-ER-based classifier, MAKER-BRB-based classifier, and all the alternative machine learning methods of the test sets of the customer-type dataset

193

NN

CT

NB

KNN

Weighted KNN

LD

QD

Figure 5.8. Continued.

194



LR

CT

SVM

NN

NB

KNN

Figure 5.9. The PR curve of the MAKER-ER-based classifier, MAKER-BRB-based classifier, and all the alternative machine learning methods of the test sets of the customer-type dataset

195

Weighted KNN

LD

QD


We display the MSEs and AUCs of the classifiers for the training sets of all five rounds

in Tables 5.24 and 5.25 and for the test sets of all five rounds. We report these metrics

for both the training and test sets to check if overtraining occurs. Since these metrics

of the training sets are similar to those of the test sets over the five rounds, we

conclude that overtraining does not occur. This result signifies that similar to the other

machine learning methods, MAKER-ER- and MAKER-BRB-based models can learn

and generalise the pattern of the data and perform well on unseen data.

196

Table 5.24. The MSEs and AUCs of the prediction models (training set) for customer type classifiers

Train

Model/Iteration 1st 2nd 3rd 4th 5th Average Std CI (95%)

AUCROCs

MAKER-ER .948 .948 .934 .947 .944 .944 .006 .939-.949

MAKER-BRB .945 .942 .943 .942 .943 .943 .001 .942-.944

LR .881 .894 .889 .893 .878 .887 .007 .881-.893

SVM .905 .908 .907 .902 .906 .906 .003 .903-.908

NN .919 .919 .918 .913 .916 .917 .003 .915-.919

CT .912 .915 .915 .909 .914 .913 .003 .911-.915

NB .872 .801 .801 .808 .804 .817 .003 .790-.844

KNN .925 .928 .924 .920 .923 .924 .003 .922-.927

Weighted KNN 1.000 1.000 1.000 1.000 1.000 1.000 .003 1.000-1.000

LD .887 .903 .895 .900 .883 .894 .003 .886-.901

QD .885 .890 .884 .877 .879 .883 .003 .879-.887

AUCPRs

MAKER-ER .799 .783 .779 .747 .775 .777 .019 .758-.796

MAKER-BRB .789 .784 .744 .778 .757 .770 .019 .751-.789

LR .675 .709 .689 .690 .658 .684 .019 .665-.703

SVM .717 .722 .725 .712 .724 .720 .005 .715-.725

NN .778 .759 .787 .742 .733 .760 .023 .737-.783

CT .795 .803 .799 .803 .804 .801 .004 .797-.805

NB .682 .687 .660 .656 .641 .632 .019 .613-.652

KNN .779 .790 .764 .776 .778 .778 .009 .768-.787

Weighted KNN 1.000 1.000 1.000 1.000 1.000 1.000 .000 1.000-1.000

LD .680 .719 .696 .696 .664 .691 .021 .670-.712

QD .692 .698 .670 .662 .647 .636 .021 .615-.657

MSEs

MAKER-ER .080 .081 .103 .081 .083 .086 .010 .077-.094

MAKER-BRB .076 .081 .077 .076 .077 .078 .002 .076-.079

LR .143 .138 .142 .144 .149 .143 .004 .140-.146

SVM .109 .106 .106 .108 .107 .107 .002 .106-.108

NN .091 .089 .091 .092 .092 .091 .001 .090-.092

CT .128 .133 .129 .132 .130 .130 .003 .128-.132

NB .088 .086 .086 .088 .087 .087 .003 .086-.088

KNN .094 .090 .090 .093 .091 .092 .003 .090-.093

Weighted KNN .000 .001 .001 .001 .000 .001 .003 .000-.001

LD .146 .142 .146 .147 .151 .147 .003 .144-.150

QD .175 .167 .176 .178 .187 .177 .003 .171-.183

197

Table 5.25. The MSEs and AUCs of the prediction models (test set) for customer type classifiers

Test

Model/Iteration 1st 2nd 3rd 4th 5th Average Stdev CI (95%)

AUCROCs

MAKER-ER .936 .939 .938 .947 .961 .944 .010 .935-.953

MAKER-BRB .937 .928 .942 .941 .950 .940 .008 .933-.946

LR .885 .870 .873 .910 .896 .887 .016 .872-.901

SVM .906 .884 .893 .912 .906 .900 .011 .890-.910

NN .911 .902 .899 .929 .912 .911 .012 .900-.921

CT .908 .903 .903 .914 .902 .906 .005 .901-.910

NB .733 .813 .843 .793 .790 .794 .040 .759-.830

KNN .905 .895 .898 .925 .904 .905 .012 .895-.915

Weighted KNN .908 .907 .904 .927 .926 .915 .011 .905-.924

LD .893 .876 .880 .914 .900 .893 .015 .879-.906

QD .882 .859 .878 .913 .885 .883 .019 .866-.900

AUCPRs

MAKER-ER .745 .763 .790 .731 .865 .779 .053 .726-.832

MAKER-BRB .759 .756 .775 .768 .834 .778 .032 .747-.810

LR .681 .631 .645 .704 .756 .683 .050 .633-.733

SVM .700 .668 .701 .710 .712 .698 .018 .680-.716

NN .695 .718 .707 .741 .733 .719 .019 .700-.738

CT .789 .784 .786 .750 .722 .766 .029 .737-.795

NB .640 .634 .644 .695 .723 .667 .040 .627-.707

KNN .729 .704 .694 .755 .708 .718 .024 .694-.742

Weighted KNN .772 .768 .758 .786 .780 .773 .011 .762-.783

LD .699 .630 .654 .705 .756 .689 .049 .640-.738

QD .649 .628 .661 .715 .725 .675 .043 .633-.718

MSEs

MAKER-ER .087 .089 .097 .080 .070 .085 .010 .076-.094

MAKER-BRB .082 .094 .077 .076 .070 .080 .009 .072-.088

LR .147 .159 .148 .142 .127 .145 .012 .134-.155

SVM .109 .121 .109 .108 .104 .110 .006 .105-.116

NN .093 .097 .098 .085 .095 .094 .005 .089-.098

CT .088 .092 .092 .088 .089 .090 .002 .088-.092

NB .139 .128 .120 .133 .138 .132 .008 .125-.138

KNN .094 .103 .100 .090 .096 .097 .005 .092-.101

Weighted KNN .091 .097 .097 .092 .085 .092 .005 .088-.097

LD .150 .165 .153 .147 .126 .148 .014 .136-.161

QD .185 .201 .185 .163 .159 .179 .018 .163-.194

198

As depicted in the previously mentioned tables, based on the performance of the

comparative analysis on the test sets of the five rounds, the MAKER-ER- and

MAKER-BRB-based models outperform the alternative methods with the average

scores and standard deviations of AUCROCs: .944 (.010) and .940 (.008),

respectively. According to Table 3.3 in Section 3.8.3, an AUC between .9 and 1

indicates excellent discrimination. The MAKER-ER- and MAKER-BRB-based models

can be considered to have excellent discrimination because the average scores of

the AUC are above .9: .944 for the MAKER-ER-based model and .940 for the MAKER-

BRB-based model. Meanwhile, the other machine learning methods in the lower

performance ranging from .794 to .914 are considered as ‘fair’, ‘good’, and ‘excellent’

for the AUCROCs of .70-.80, .80-.90, and .90-1.0 respectively. Similar to the

AUCROCs, the MAKER-ER- and MAKER-BRB-based classifiers outperform the

alternative methods, indicated by the average AUCPR of .779 and .778 respectively.

Both classifiers also exhibit the lowest MSE scores with standard deviations as

follows: .085 (.010) for the MAKER-ER-based model and .080 (.009) for the MAKER-

BRB-based model.

5.7. Summary

This chapter presented the application of the MAKER-ER- and MAKER-BRB-based

models to customer classification in revenue management with two outputs, myopic

and strategic, and four input variables regarding customers’ booking behaviour in the

environment of dynamic pricing. This chapter consisted of six main subsections.

First, we presented the theoretical foundations: identified customer types in revenue

management in response to dynamic pricing, tangible purchase behaviour, and the

199

booking setting used in the case study. Second, we formulated the conceptual

framework of customer classification including the input variables which may

discriminate customer classes, the data linkage which explains how we can obtain

the desired dataset given the available booking and price records, and the detection

procedure to label the customer types. Third, we introduced the data preparation

including data cleaning, and data partitioning used to obtain five groups for five-fold

cross validation applied for all the classifiers.

Fourth, we performed a statistical test on the dataset obtained in the previous section

determine whether the four input variables can explain the variances of class

membership and whether the input variables are conceptually correlated. Based on

the statistical test, we also described how we created groups of evidence because

the statistical requirement for joint frequencies between pieces of evidence was

violated such that the groups of evidence formed are statistically correct and

theoretically meaningful. Fifth, according to evidence acquisition from the data,

interdependency indices, belief rule-based inference, maximum likelihood prediction,

and machine learning, we described how to construct the MAKER-ER- and MAKER-

BRB-based classifiers for the hierarchical MAKER framework in which the input data

are split into groups of evidence. Given the optimised referential values and the other

optimised model parameters, such as weights, and with the training set of the first

round, we provided a demonstration for both classifiers.

Sixth, with consideration of the highly imbalanced class distribution (1:4), we analysed

the model performance comparison based on accuracy, precision, recall, F-beta,

AUCs and MSE for all classifiers. Based on the analysis, it is evident that the MAKER-

ER- and MAKER-BRB-based classifiers outperform eight of nine alternative machine

learning methods: LR, SVM, NN, NB, KNN, weighted KNN, LD and QD. Meanwhile,

200

the classification tree exhibits a similar performance to both classifiers. The MAKER-

ER- and MAKER-BRB-based models, as interpretable and robust classifiers, are

recommended for customer classification.

201

Chapter 6 Application to Customer Decision

Model

6.1. Introduction

This chapter presents the application of a hierarchical rule-based inferential modelling

and prediction based on MAKER framework for predicting customer decisions in the

environment of dynamic pricing. The chapter is structured as follows. Section 6.2

explains the theoretical framework, including possible decisions considered in the

model, the input variables that potentially influence customer decisions, the

hierarchical MAKER framework, and the data linkage to obtain the desired dataset

from the available data in the system. Section 6.3 describes the data preparation,

including data cleaning, and data partitioning. Section 6.4 explains how the proposed

classifier, namely MAKER-ER- and MAKER-BRB-based models, were built and

trained in this case study. Section 6.5 presents a comparison analysis of model

performances for the proposed framework and alternative methods. A summary of

this chapter is presented in Section 6.6.

6.2. Conceptual Framework: Input Variables and

Decisions

The conceptual framework of the prediction model for customer decisions in an

environment of dynamic pricing was developed on the basis of literature. This chapter

explains the conceptual framework, including the following aspects: customer

decisions; input variables, denoting the factors that might possibly influence customer

202

decisions in environments of dynamic pricing; and data linkage, which describes how

we obtained the desired dataset from the data available in the system.

The booking setting was the same as that discussed in the previous chapter (see

Section 5.2.3). Customers book a ticket and are given time to pay denoted as the

holding period. They can secure the ticket at the price posted when booking, with zero

deposit, and pay later – at any time before the holding period ends. Otherwise, the

ticket is automatically cancelled. In this setting, strategic customers can intentionally

delay their purchase – that is, their payment of the full price, and strategically wait

until lower prices become available.

Revenue management theory was designed for perishable products such as airlines

and hotels. The remaining capacity cannot be stored as inventory if the selling period

is over (Talurri and Ryzin, 2004). By contrast, companies may have inflexible or

limited capacity, which means more capacity cannot be easily added to meet a high

demand in the future. Pricing and capacity allocation are two major practices of

revenue management (Choi and Kimes, 2002) by a means of balancing between

supply and demand under capacity restrictions, demand uncertainty, and various

market conditions in order to maximise profit (Talurri and Ryzin, 2004).

As explained in Section 2.3, when advanced booking is applied, travellers often book

before making full payment. Through a guaranteed reservation, they can secure a

seat. During this period up until the departure date, practice is applied in the industry

and thus prices and seat availability change over time. Customers sometimes look for

a better deal and rebook if necessary, to replace an earlier booking with a more

favourable price (Toh et al., 2012). To obtain a lower price, they search and update

203

their information, and learn and evaluate whether they should change their previous

decision (Cleophas and Bartke, 2011).

This study focuses on the additional phase of a purchase cycle that customers

experience during the purchase decision process. As stated above, the information

search-and-evaluation phases can be repeated even after customers place a

reservation until the departure date (Schwartz, 2000, 2006). Customers in advanced

booking setting that offer dynamic pricing face uncertainty regarding price and other

related factors such as product availability. At the same time, they have opportunity

to maximise the value of their money spent (Chen and Schwartz, 2008). In this study,

we modelled an additional phase in which after placing a guaranteed reservation,

customers either continue to make full payment right away or wait in the hope of

getting a better deal in the future.

6.2.1. Input variables

People tend to respond to promotions or any means of gaining a lower price including,

strategic purchasing behaviour (Choi and Kimes, 2002). In addition, purchase

decision-making also requires cognitive evaluation of consequences (Christou, 2011).

Relevant information may shape a customer’s belief or perception and hence may

influence customer decisions. In this study, we included both internal and external

determinants that might influence their purchase decision.

Advanced booking customers have to maintain the associated risk of strategic waiting

while observing if a lower price will be available in the future. If they choose to wait

and their reservation is cancelled, they need time to book again. During that time, the

204

ticket might no longer be available due to being sold out. Several researchers have

predicted the propensity to buy, based on these two perceptions by customers: the

perception of risk and benefit (Aviv, Levin, and Nediak, 2009; Chen and Schwartz,

2008; Cleophas and Bartke, 2011 to name a few).

Some researchers have experimented with showing information about the remaining

capacity to the participants (e.g. Mak et al., 2014). However, in reality, customers

cannot access such information and may interpret price changes as showing the risk

of sell-out (Chen and Schwartz, 2008). If customers associate the price changes with

the existence of applied revenue management, they may assume that limited seats

mean high demand, hence an increasing price. Li et al. (2014) implicitly modelled the

likelihood of sell-out by operationalising the lowest posted price as a baseline demand

model. Another possible approach is to use a time element. Customers may perceive

a higher risk of sell-out as the desired departure date approaches. Through a

controlled experiment in the hotel case, Schwartz (2000) demonstrated that the closer

to the date of stay, the higher the willingness to pay and consequently the higher the

propensity to book. Related to this, the decision of strategic customers is also affected

by the level of product scarcity (Dasu and Tong, 2010; Mak et al., 2014). Strategic

customers may perceive a higher risk of fewer flights offered in a day.

Revenue management adopters, including airlines and hotels, add cancellation

policies to anticipate unsold inventory – such as seats or rooms – due to last-minute

cancellations and no-shows (Chen, 2016). That is also an effective strategy to

decrease the no-show rate by about 8% and 5% for airlines and hotels respectively

(Dekay, Yates, and Toh, 2004). In addition, this policy may affect customer decisions

regarding advanced booking settings. A conditioned experiment by Chen et al. (2011)

found that the effect of cancellation deadlines on customer decisions was statistically

205

significant. In contrast, cancellation fees did not significantly influence customer

decisions. A lenient cancellation policy might induce customers to search extensively

after buying a fully refundable ticket and to rebook once a lower price is available

(Gorin et al., 2012). If the policy is lenient– such as a distant deadline or zero cost,

customers more likely to continue to search than book right away. Similarly, a ‘book-

now-pay-later’ system without a deposit gives customers time to confirm their booking,

whether or not they finalise their booking by paying full price. The duration of the

holding period probably affects customers’ tendency to search and to wait for a lower

price.

In addition to the external determinants explained above, researchers have also

considered personal factors. Customer who are exposed to exactly the same

information may make different decisions. The first possible explanation is customers’

differing propensity to respond to methods of obtaining a discounted or lower price

(Lichtenstein, Netemeyer, and Burton, 1990; Lichtenstein, Ridgway, and Netemeyer,

1993). Some customers maximise their value above the money spent on a product or

service (Kwon and Kwon, 2013). They may experience positive emotions while

looking for a better deal, such as feeling pleasure and enjoyment (Fortin, 2000).

Regardless of whether they are financially or emotionally motivated, getting a lower

price is their goal in shopping (Chandon, Wansink, and Laurent, 2000; Kwon and

Kwon, 2013). In revenue management theory (see Section 5.2.1), four customer types

are identified by divergent customers’ responses to dynamic pricing. In this study, we

considered only two relevant to our market.

A second possible reason is that customers exhibit different levels of waiting patience

or willingness to wait. This point is used to demonstrate the degree of strategic

behaviour by the customer. Two estimation approaches are discrete and numerical.

206

The first one segments the market into discrete customer types. Su (2007)

categorised customers into two patience levels, high or low. In an extension to this

model, Besbes and Lobel (2015) interpreted patience as ‘the time they are present in

the system’ expressed by a discrete value denoted by ω. The 0 value of ω represents

customers who are completely impatient, whereas ω = 1 means the customers can

postpose their purchase for one period and so on. Another approach is to use a

discount factor, equivalent to a waiting cost, expressed as a numerical variable to

measure the customer patience (e.g. Levin et al., 2009). However, this approach is

complex in terms of identification and computation (Li et al., 2014).

The third possible reason is divergent customers’ emotions. Emotions are defined as

a mental state of readiness that is affected by individuals’ assessment of events and

thoughts (Bagozzi, Gurhan-Canli, and Priester, 2002). These subjective feelings are

associated with what customers feel during and after evaluation; that is, purchase and

consumption (Ruth, 2001). Consequences and feedback after performing a certain

behaviour, including information processing, evaluation, and justification of decisions

generates an emotional response. Customers feel satisfied, happy, excited, and

thrilled when they find a better deal than they expected or secure a lower price than

other pays. Regret arises when customers choose to wait but then lose the

opportunity due to sell-out; or when they buy the product immediately and later it

becomes available at a lower price. Anticipated regret may shorten the search, since

it can influence a person’s desire to perform a certain task (Bagozzi et al., 2002).

Zeelenberg (1999) discusses regret theory in detail.

Studies in revenue management quantify regret as ‘stock-out regret’ or ‘high-price

regret’ (Eren and Parker, 2010; Nasiry and Popescu, 2012; Özer and Zheng, 2015 to

name a few). In brief, the anticipated regret is measured by comparing the perceived

207

probability of and the actual probability of stockout and high-price (e.g. Özer and

Zheng, 2015). The perceived probabilities are the result of the customer’s

observation. Because we could not obtain the customer’s perception – that is, the

perceived probability of stock-out or higher price, we excluded this factor in the

analysis.

The fourth possible reason is different customers’ attitude toward risk. Studies about

strategic customers in revenue management assume that customers are risk neutral.

Thus, they make decision that maximise their total expected surplus (Swinney, 2011).

Liu and Ryzin (2008) introduced the degree of risk aversion. Through experiments,

Osadchiy and Bendoly (2011) identified three perceived risk groups among forward-

looking customers: 1) those who correctly perceived risks in waiting, 2) those who

underestimated risks, and 3) those who overestimated risks. To identify these groups

requires knowing the customers’ decision-making patterns. In our study, only three

months of transaction data were available, which was not enough to determine each

individual’s decision-making pattern. Longer term data are needed for at least some

purchase records for each customer.

The fifth possible reason is divergent customers’ search cost. Through the Internet,

price comparison sites (PCSs) and other similar facilities, customers have access to

a novel and convenient method of searching for a lower price (Jung, Cho, and Lee,

2014). Hence, the search cost for customers are low and tends towards zero

(Clemons, Hann, and Hitt, 2002). Some researchers have even assumed that

strategic customers incur a homogenous search cost (e.g. Su and Zhang, 2009). This

factor, in the current situation where technology advances enhance customers’

access to information, seems unlikely to significantly influence the purchase decisions

of strategic customers. However, some studies in revenue management have

208

considered search costs either experimentally (e.g. Schwartz, 2000) or through

mathematical programming (e.g. Wang et al., 2013). Schwartz suggested quantifying

the indirect component of search costs as ‘time and/or energy spent’ – which is not

necessarily literally ‘money spent’ during an information search. This was to avoid

misleading the decision modelling. However, no further explanation was offered about

quantifying the indirect costs. In this study, we followed the argument by Clemons et

al. (2002) and excluded this factor in the analysis.

To summarise, according to the literature, three major categories are identified: 1)

provider-controlled factors, 2) uncontrolled factors, and 3) personal factors. There are

depicted in Figure 6.1. Provider-controlled factors means all factors that can be

manipulated by the provider, such as airlines and travel agents. This group includes

price changes, information about remaining products or inventories, and a

cancellation policy. Uncontrolled factors are those which cannot be manipulated by

the provider but impact customer decisions; this includes customer arrival time (i.e.

before departure) and number of flights offered in a day. Personal factors are

individual factors specific to the customers, which vary from person to person and

strongly influence their purchase decisions.

The framework in Figure 6.1 was refined based on the available data as previously

explained. All the input variables considered in this study are depicted in Figure 6.2

and explained below.

209

Price changes

Demand information

Cancellation policy:

deadline and fee

Time before

consumption date

Product scarcity: number

of flights offered a day

Consumer’s response

toward dynamic pricing

Waiting patience

Consumer’s regret

Consumer risk profile

Search cost

Estimated lower

rate (ELR)

Estimated sell-out

risk (ESR)

Propensity to buy

Pro

vid

er

co

ntr

olle

d

facto

rs

Un

co

ntr

olle

d

facto

rsP

ers

on

al fa

cto

rs

Figure 6.1. Conceptual framework for decisions by advanced booking customers under dynamic Pricing

Price changes

Cancellation deadline (i.e.

the length of holding period)

Days before

departure time

Number of flights offered

a day

Consumer’s response

toward dynamic pricing

Waiting patience

Wait or buy

decision

Pro

vid

er

co

ntr

olle

d

facto

rsU

nco

ntr

olle

d

facto

rsP

ers

on

al fa

cto

rs

Figure 6.2. Conceptual framework for decision by advanced booking customers under dynamic pricing after refinement

210

Price changes. Chen and Schwartz (2008) examined price-change pattern,

categorised into four conditions: increasing, decreasing, fluctuating, and no-change.

The magnitude of the price changes did not affect the categorisation. Other

researchers have utilised price reduction or discount to express the magnitude of

benefit potentially obtained by customers if they choose to wait (e.g. Cleophas and

Bartke, 2011). This tactic is usually employed for downward price trends. Li et al.

(2014) included price average and price volatility, that is, standard deviation and

coefficient of variation, in their model. The model was focused predicting the

percentage of strategic customers taking place in the market, with various degree of

foresight, including perfect, strong, and weak foresight. In this study, we used

historical price trends to indicate the magnitude and direction of the price changes.

The negative, positive, nearly zero, and absolute zero values of price trends can be

interpreted as decreasing, increasing, relatively stable, and no-change respectively.

Assuming that customers observed the price changes after the booking was placed,

price trends could be estimated during the holding period. How such estimation was

made is discussed in the next section.

Length of holding period. Cancellation policy including cancellation deadline and

fee hypothetically impacts strategic purchase behaviour (Chen et al., 2011). However,

in an experiment by Chen et al., only cancellation deadline statistically influenced

customers’ decisions. The stricter the cancellation deadline, the higher the tendency

to search more than book now. Cancellation fee has not been shown to have a

statistically significant impact on customers’ decisions. In the case of ‘book now pay

later’, the cancellation fee is zero, meaning no deposit is required to place a booking.

However, customers are only allowed to hold the ticket for several hours or days,

denoted by the length of the holding period. This timing element hypothetically

211

influences customers’ propensity to buy the ticket instead of wait. The longer the

holding period, the stronger their tendency to cancel and rebook – with the expectation

of obtaining a lower price in the future.

Days before departure time. Customer arrival time is defined from the first contact

the customer makes; that is, the time of the first search for a ticket to fulfil their need

(Schwartz, 2000). In this study, we defined first contact as the first booking made by

the customer, as we had no access to search records for the customers. Hence, we

assumed that if a customer placed a booking, they had confirmed their travel plan and

narrowed down all possible alternatives, regardless of how long they search for

information; then they would have chosen the best of the alternatives.

The number of available flights. This refers to the number of flights offered on a

day. The number can be dynamically changed due to stock-outs. In addition, each

city pair has a particular range for the number of flights offered on a day. For example,

the city pair of NTXBTH normally provides two flight per day; however, before the

departure date it could be sold out, as shown by a zero value for the minimum number

of flights. Similarly, the city pair of SUBCGK provided 52 flights a day and the minimum

number of flights on record was 50; this means that before the departure date, two

flights had no remaining seats and were no longer visible on the website.

Waiting patience. Besbes and Lobel (2015) discretized the waiting patience level

because their model was time-dependent and they sought optimal strategies for each

period. Discretization was the most convenient way to reduce the model’s complexity.

To include different levels of waiting patience to illustrate how long customers

remained in the system, in this study we defined ‘waiting patience’ as the accumulated

212

time for which the customer was present in the system before their final purchase.

Final purchase was indicated by paying the price in full.

Customer’s responses to dynamic pricing. We considered two customer

responses to dynamic pricing: myopic and strategic. Myopic customers always buy

the product at the price best fitted to their valuation. Strategic customers intend to

time their purchase strategically to obtain an expected lower price. Section 5.3.2

explains the identification of these customer types; the labels obtained from that

process were used in this study.

6.2.2. Decisions

As explained in Section 2.3, the decision-making process in any business that uses

advance booking, such as airlines or hotels, differs from that of other products. The

difference is that the actual purchase is not necessarily the same as placing a

reservation or booking. Customers may gather more information after booking and

might change their decision about whether to keep the reservation. This process may

be repeated until shortly before departure time. According to the literature on revenue

management, three main decision models exist, explained as follows.

Wait or buy. In this concept, it is assumed that customers may choose to postpone

their purchase and never leave the market (e.g. Anderson and Wilson, 2003;

Cleophas and Bartke, 2011). This model is also designed to focus on what factors

influence the odds of a customer choosing to buy rather than wait.

213

Buy now, wait, or exit. This model extends the former concept of ‘wait or buy’

decision. This model allows customers to choose any other alternatives, that is, the

second-best alternative (e.g. Chen and Schwartz, 2006; Li et al., 2014; Su, 2007).

Four decisions. Schwartz (2000) introduced advanced booking decision model

(ABDM) to account for online-savvy customers. In this case, customers face a booking

restriction, that is, a cancellation policy. Customers may place a reservation after

evaluating the produce and then take no further action; this is called the ‘book’

strategy. They may choose to search for information and then decide which product

best fits their needs, which is the ‘search’ strategy. They may choose to book to

minimize the risk of stockouts, then continue to search more information and rebook

if a cheaper price for the same product is available; this is the ‘book then search’

strategy. Like the other models, Schwartz’s model also accounts for people booking

other alternatives – namely the ‘exit’ strategy.

In the setting of ‘book now pay later’ (see Section 5.2.3), the payment can be made

at any time between reservation and the end of the specified holding period. If by the

end of the holding period the customer has not paid the reservation, the system

automatically cancels the reservation. The customer can evaluate the alternatives and

then book a ticket according to what they believe is the best option. Once the booking

is made, they are likely to monitor the price changes over time until departure, along

with other related information such as product availability. This information may

change their decision about whether to proceed with payment or cancel the

reservation and book again if a more favourable ticket is offered.

The focus of this study is on developing model to predict customer decisions and

understanding what factors affects the decisions; specifically, what factors can induce

214

them to buy rather than waiting. We included two customer decisions, including ‘wait’

and ‘buy’. ‘Wait’ in this study refers to a customer placing a booking and then passing

all or some of the holding period before continuing with the booking for the same flight

– that is, no changes to their itinerary. ‘Buy’ refers to a customer placing a reservation

and making payment before the holding period lapses.

6.2.3. Data linkage

As previously mentioned, in this study we considered six input variables: 1) average

price trend; 2) the length of the holding period; 3) days before departure time; 4)

number of flights offered on a day; 5) waiting patience time; and 6) customer types,

summarised as APT, HP, DD , NF, WPT, and C. In this section, we explain how we

retrieved the desired dataset from the available transaction and price records, as

depicted in Figure 6.3. Codes B1-B9 mean the data were retrieved from transaction

or booking records; codes P1-P9 were from price records. APT, HP, DD, and NF can

be obtained from transaction records.

215

Name Origin-Destination

Departure date & time

Carrier/Provider

Price (Rps) Book time Book limit time Status Confirmation time

B1 B2 B3 B4 B5 B6 B7 B8 B9

1st AS CGK-SRG 23/09/17 18:00 Batik 473,000 20/09/17 21:34:55 21/09/17 07:04:00 1 21/09/17 07:04:00 2nd AS CGK-SRG 23/09/17 18:00 Batik 440,000 21/09/17 08:25:46 21/09/17 17:55:00 1 21/09/17 17:55:00 3rd AS* CGK-SRG 23/09/17 18:00 Batik 363,000 21/09/17 20:10:53 22/09/17 05:40:00 2 21/09/17 21:21:39

Note: *) In this example, we put an initial of the passenger’s name for data privacy

Updating date & time Origin - Destination


Carrier/Provider

Posted price (Rps) Price trend

P1 P2 P3 P4 P5

20/09/17 22:20:36 CGK-SRG 23/09/17 18:00 Batik 473,000 20/09/17 02:17:34 CGK-SRG 23/09/17 18:00 Batik 473,000 0 20/09/17 06:15:48 CGK-SRG 23/09/17 18:00 Batik 407,000 - .1395



Carrier/Provider

Posted price (Rps) Price trend

P1 P2 P3 P4 P5

21/09/17 10:20:25 CGK-SRG 23/09/17 18:00 Batik 407,000 21/09/17 14:18:17 CGK-SRG 23/09/17 18:00 Batik 363,000 - .1081



Carrier/Provider

Posted price (Rps)

P1 P2 P3 P4 P5

21/09/17 21:21:27 CGK-SRG 23/09/17 18:00 Batik 363,000 21/09/17 21:21:27 CGK-SRG 23/09/17 15:30 Lion 316,800 . .

.

. . .

.

. . .

21/09/17 21:21:27 CGK-SRG 23/09/17 19:35 Garuda 668,500

DD = B3 – B6 HP = B7 – B6

Price trend for the 1st book

with B6 ≤ P1 ≤ B9

Price trend for the 2nd

book

Number of flight offered in

the day for the 3rd

book; P1

is closest time to B9

Figure 6.3. Data linkage for customer decision model

From booking records, we obtained information about the passenger’s name, origin

and destination, departure date and time, carrier or provider, price, booking time,

booking limit time, booking status, and confirmation time. Booking time refers to the

point at which the customer placed a guaranteed reservation to secure a seat.

Booking limit time represents the cancellation deadline. Customers can pay later for

the reserved seat, up to the booking limit time, without worrying about the price

increasing. Status of the booking was denoted by B8 and was coded as 1, 2, 3, or 4,

meaning respectively cancelled by the system, issued, booked, or cancelled on

216

request. The time when the status changes – for example, from booked (3) to issued

(2), was recorded in B9. In other words, B9 showed the confirmation time and thus

indicated when customers made their decision. ‘Cancelled by request’ differed from

‘cancelled by the system’. The former refers to a request from the customer; the latter

occurred if the customer did not pay before the booking limit time.

A detailed example from the records is shown in Figure 6.4. The passenger (AS)

made three bookings for a 23 September 18:00 flight, with Batik Air as the provider,

from CGK to SRG. The first attempt was made on 20 September at 21:34:55. The

provider or agent gave the passenger time to secure the ticket by 21 September

07:04:00 or it would be automatically cancelled by the system. DD refers to the time

gap between booking and departure; that is, DD = B3 – B6. This variable (DD)

indicates how close the booking was to the travel date. HP represents the gap

between booking time (here, 20 September 21:34:55) and the booking limit (21

September 07:04:00), which showed how long the passenger held the seat or waited

before securing it, while checking the price changes. As mentioned earlier, the longer

the holding period, the more likely customers are to continue to search and wait for

lower prices. The values in this example were DD of 2.8508 days and HP of .3952

days.

20/09/17 21:34:55

20/09/17 07:04:00

.3952

A1

C1=B1

21/09/17 08:25:46

21/09/17 17:55:00

.3953

A2

C2=B2

21/09/17 20:10:53

21/09/17 21:21:39

A3

C3

.0568 .0944 .0491

B3

22/09/17 05:40:00

.3952 .8473 .9908

Figure 6.4. Example of a booking journey

217

To examine the price changes over time during the holding period, we collected the

prices updated between B6 and B9, that is, B6 ≤ P1 ≤ B9. These results are shown in

the second and third panels in Figure 6.3. The formula used to calculate price trends

was as follows: 𝑃𝑇𝑡 = (𝑃𝑡 − 𝑃𝑡−1) 𝑃𝑡−1⁄ where 𝑃𝑇𝑡 is price trend at time 𝑡 and 𝑃𝑡 is the

posted price updated at time t. The term APT refers to all price trends observed during

the holding period, divided by the number of observations. For example, for the first

booking, the price trends were 0 and -.1395. Hence, AVT is the average of those

values and was equal to -.0698. The AVT for the second booking was -.1081.

The input variable NF indicates the number of flights available when customers made

a decision at B9. In addition to information about demand, such information could

influence the customer’s perception about product scarcity. Because the prices are

updated every three to four hours, it is not guaranteed we could obtain information

about the available flights exactly at B9; therefore, we used the time closest to B9 to

represent the updated condition when customers made a decision. This approach

enabled us to obtain the lists of all flights available on 17 September from all providers.

In this example, there were approximately 23 flights available when the customer

made a decision at B9.

The input variables WPT and C were be acquired from the booking records, as

depicted in Figure 5.3. To illustrate WTP, Figure 6.4 depicts the booking journey of

the passenger AS shown in Figure 6.3. For the first attempt, the customer waited the

whole holding period and let the ticket be cancelled by system at B1, hence C1 = B1.

At this point, the customer was present in the system for .3952 of a day. Later, they

booked again at A2 after waiting for another .0568 of a day and did not pay at B2.

Therefore, they had waited for .8473 of a day at C2. For the last booking, the customer

was given .3952 of a day – that is, time from A3 to B3 – but spent only .0491 of the

218

day; at C3, they decided to buy the ticket. The values of WPT for the first, second,

and third bookings were .3952, .8473, and .9908 of a day, respectively. The customer

type (C) is obtained from the labelled customers as a result of systematic tracking in

Section 5.3.2. Each customer is labelled by either myopic or strategic. Myopic is

labelled as 0, while strategic is coded as 1.

6.3. Data preparation

Real world data may be incomplete, noisy, and inconsistent which leads to low

performance, poor-quality outputs, and hidden useful patterns (Zhang, Zhang, and

Yang, 2003). Data preparation is required to deal with such issues to yield quality

data. Data preparation include data integration, data transformation, data cleaning,

data reduction, and data partitioning (Zhang, Zhang, and Yang, 2003). Therefore,

data preparation is required before model developments. This study mainly used data

integration, data cleaning, and data partitioning. Data integration is the combination

of technical and business process utilised to combine data from different data sources

into the desired dataset, that is, meaningful and valuable information (Hendler, 2014).

Section 6.2 presents data linkage to obtain the dataset used in this study. Data

cleaning includes dealing with missing values, noisy data, outliers, and resolving

inconsistencies (Zhang, Zhang, and Yang, 2003). Data partitioning is a technique for

dividing the dataset into multiple smaller parts.

The focus of the study was to examine what factors influence ‘wait’ or ‘buy’ decision

in the environment of dynamic pricing. Hence, for data cleaning, we minimised the

possibility of customers’ rebooking for other reasons, such as changes to their travel

plan. Customers may choose to cancel or let the system automatically cancel their

219

booking, usually because of travel-plan changes. This could create bias in the model

or disguise useful information. Therefore, we considered the ‘waiting’ state when

customers appeared to have a fixed travel plan. This was evident through no changes

being made to origin-destination, departure date, number of passengers, or the ratio

for adult-child-infant.

For example, a customer makes four booking attempts. On the second attempt, they

change their planned departure date, and then they repeatedly delay the purchase

until the fourth attempt when the ticket is issued. In this case, we would omit the first

attempt and consider ‘waiting’ to begin with the second attempt and end with the third

attempt. The fourth attempt was considered the ‘buy’ state. In addition, all records for

customers who did not pay at the end were categorised as ‘exit’ and were deleted

from our dataset.

In data partitioning, we utilised five-fold cross-validation with stratified random

sampling. The data were divided into five folds with similar class distribution. If

customers made several attempts or they bought more than once, they would have

more data points in the dataset. In this condition, it is advisable to shuffle the dataset,

that is, to randomly reorganize the dataset. The partitions obtained through k-fold

cross validation with shuffling generally derive from different customers, which avoids

the model learning from the patterns of particular customers. We employed stratified

five-fold cross validation with shuffling in Python to partition the dataset into five folds.

Each fold was treated as test set, while the remaining folds acted as a training set.

Therefore, we obtained five rounds for each classifier.

220

6.4. Hierarchical Rule-based Models to Predicting

Customer Decisions

In this section, the building of MAKER-ER- and MAKER-BRB-based classifiers for

predicting customer decisions is demonstrated. A numerical study using the described

dataset is presented in the remainder of this section. As previously stated, we had six

input variables: APT, HP, DD, NF, WPT, and C to predict the customer’s decision

either to wait or to buy. The definitions of these variables and decisions are detailed

in Section 6.2. In addition, the data were shuffled and partitioned based on stratified

random sampling into five folds with similar class distribution. The training set of the

first fold is used here to illustrate how MAKER-ER and MAKER-BRB frameworks were

applied in this case study, that is, to a customer-decision dataset.

6.4.1. Hierarchical MAKER frameworks

A minimum of five cases per cell of joint frequency matrices between input variables,

except disjoint pieces of evidence, must be satisfied to implement a full MAKER

framework. MAKER-ER- and MAKER-BRB-based models are designed if this

statistical requirement is not satisfied. To group input variables, one must start with

an input variable having the strongest impact on the model outcome, then add other

input variables one by one. Joint frequency matrices of pairs of input variables in a

MAKER model must fulfil the statistical requirement of having five cases per cell.

According to Table 6.1, input variables ranked by strongest to lowest correlation with

the output variables were WPT, CT, HP, NF, DD, and APT. Hence, customer

221

decisions were highly influenced by personal factors: WPT and CT. The next input

variable, HP, indicated the opportunity for customers to exploit dynamic pricing. Then

NF, DD, and APT shaped the customer’s perception about the benefit and risk of

strategic waiting, which in turn impact their purchase decision.

Table 6.1. Descriptive Statistics and Correlation Matrix

Factor Min Max Mean SD Decision WPT NF HP DD PT

WPT .000 31.796 .361 .394 -.341**

NF 2 100 18.661 1.445 .041* .015

HP .006 6.732 .405 14.266 .140** .223** .117**

DD .182 63.627 3.225 .875 .026 .243** .112** .577**

APT -.563 1.674 .007 5.698 .015 .034* .045** .094** .079**

C .415** -.497** .058** -.066** -.113** -.014

Note: correlation is significant at .05 (2-tailed); ** correlation is significant at .01 (2-tailed) Based on this order, we added the input variables one by one to WPT until all the joint

frequency matrices between input variables had five cases per cell, except for those

where pieces of evidence were disjoint due to structural zeros. Input variables which

could not satisfy this condition were excluded and formed another group of evidence.

In this way, we defined groups of evidence as depicted in Figure 6.5. The MAKER-

generated output by each of group of evidence presents the probabilities of a

customer choosing to wait or to buy. These outputs were then aggregated to suggest

a final inference regarding whether customers choose to wait or buy, given the input

values of the six input variables.

222

Waiting

patience time

Average

price trend

The length of

holding

period

Days before

departure

time

Number of

flights offered

in a day

Consumer

type

Buy (1) Wait (1) Buy (2) Wait (2) Buy (3) Wait (3)

1 2 3 4 5 6 7 8

1-p1

p1

1-p 2

1-p 3p

2p

3p1

1-p1p2

1-p

2 p3

1-p

3

Buy (f) Wait (f)

Rules (k)

Final inferences

MAKER-generated

ouputs

Input variables

MAKER-based

classifiers

Waiting

patience time

Average

price trend

The length of

holding

period

Days before

departure

time

Number of

flights offered

in a day

Consumer

type

Buy (1) Wait (1) Buy (2) Wait (2) Buy (3) Wait (3)

1-p1

p1

1-p 2

1-p 3p

2p

3p1

1-p1

p2

1-p

2

p31-p3

Buy (f) Wait (f)Final inferences

MAKER-generated

ouputs

Input variables

MAKER-based

classifiers

MAKER-BRB-based model

MAKER-ER-based model (1), (2), (3) : generated outputs by group 1, 2, 3, respectively; (f) : final inference

Figure 6.5. Hierarchical MAKER framework for customer decision prediction

6.4.2. Optimised Referential Values of The Model

This section demonstrates how to develop a classifier based on MAKER-ER- and

MAKER-BRB-based systems for a customer-decision model. A numerical study is

223

presented here, using the dataset explained in Section 3.3. We split the input

variables into three groups of evidence: WPT and PT as group 1, HP and DD as group

2, and NF and C as group 3. The output variable was customer decision: wait or buy.

‘Wait’ means the customer chose not to pay and rebooked in the future. ‘Buy’ means

the customer paid before the holding period ended.

As explained above, the data were partitioned into five folds with similar class

distribution, with the data shuffled beforehand. For the purpose of illustration, we use

the first fold as an example in this section. The model parameters – that is, referential

values and weights – were assigned to develop a MAKER framework. We used the

optimised parameters of the first fold as an example.

Discretization is often applied to transform quantitative data to qualitative data, so that

learning from qualitative data becomes more efficient and effective. The input variable

‘customer type’ was qualitative data or was nominal with numerical expressions: 0 for

myopic and 1 for strategic. The other input variables were numerical. Hence, a

discretization technique with referential values was applied to all input variables

except C.

Referential values consist of lower and upper boundaries of the observed values of

the input variables for the dataset, and any values between those boundaries. The

boundaries can be set based on minima and maxima of the observed values for input

variables of the whole dataset. Alternatively, experts can determine the boundaries.

In this study, we set the percentiles of 1% and 99% for lower and upper boundaries

respectively. Table 6.2 shows that the minima and the 1st percentile of the observed

values of the input variable WPT were .0001 and .0005. The 99th percentile and

224

maxima of WPT (observed) were 4.7441 and 31.7962. Almost all the customers –

99% – in the dataset were present in the system for fewer than 4.7441 days. One

customer booked long before the departure date and waited up to 31.7962 days.

Hence, waiting for more than 4.7441 days was equivalent to waiting 4. 7441 days.

Table 6.2. Percentiles of the dataset

Input

variable

Percentile

0% 1% 99% 100%

WPT 1 × 10−4 5 × 10−4 4.7441 31.7962

APT -.5632 -.1077 .2154 1.6738

HP .0065 .0099 5.4128 6.7321

DD .1818 .3526 32.8692 63.6265

NF 2 2 68 100

As explained in Section 3.7, the model parameters – including weights and referential

values – were optimised through sequential least squares programming (SLSQP) with

randomly set initial parameters. The MSE score was used as an objective function.

That is, Equations (4.23) and (4.24) were used for the MAKER-ER- and MAKER-BRB-

based models respectively. The optimisation algorithm will find the direction to find a

new solution based on the evaluation of MSE score. The algorithm was repeated for

200 iterations or until .0001 tolerance was reached.

The target of the optimisation of both MAKER-ER- and MAKER-BRB-based models

is to maximise the likelihood of the true state of a training set, and to automatically

minimise the MSE scores. The MSE scores denote the difference between model

outputs and observed values. Optimising referential values of each input variable

means finding how to divide the input variables so that the observations for a given

225

class are placed in the majority. More trained referential values reasonably can

improve the classifier; however, the associated cost is increased (i.e. model

complexity). In this case, we used one optimised referential value for input variables,

because adding more referential values to each input variable did not significantly

improve the AUC scores but caused higher model complexity. Sparser joint frequency

matrices were found if more referential values were added. In addition, two adjacent

referential values, that is no trained referential value, can only approximate monotonic

function, and at least one trained referential value is required to approximate non-

monotonic function.

Figure 6.6 illustrates the scatter plot for the first training set across six input variables.

There are three grids as there were three groups of evidence: WPT-APT (left), HPT-

DD (middle), and NF-C (right). Red dots represent ‘buy’ and blue ‘wait’. The vertical

and horizontal lines indicate the optimised referential values of input variables. As

shown in the figure, these lines split the data into several groups. Since the referential

values are optimised through MAKER-ER- and MAKER-BRB-based classifiers, the

optimised referential values split the data into several grids, each of which indicated

the placement of most of the class. Because there was one trained referential value

for each input variable, each figure shows four grids.

226

Figure 6.6. Scatter plot for observed data, with plotted optimised referential values for each input variable in the optimisation of MAKER-ER-based model from the customer – decision dataset.

227

Figure 6.7. Scatter plot for observed data, with plotted optimised referential values for each input variable in the optimisation of MAKER-BRB-based model from the customer – decision dataset.

Generally, the figures above illustrate that different data patterns existed for records

in different classes of different input variables of the dataset. For the evidence

regarding WPT – APT, most of the ‘wait’ decisions (blue dots) with long waiting-

patience times were distributed near-zero price trend. In addition, ‘buy’ decisions with

short waiting-patience levels (meaning that the customer bought immediately) were

228

mainly distributed far above or below the stable price trend; that is, a near-zero price

trend.

For the group of evidence for HP – DD, a linear relationship existed between HP and

DD. The further the booking time was from the departure date, the longer the holding

period. The data distribution was dense for the values of DD below 10 days and for

values of HP below 1 day. It is also interesting to note that longer holding periods did

not necessarily make customers wait and book again; the majority class of the upper-

right-corner of the second grid are red dot. Furthermore, the bottom-left corner, in

which customers faced last-minute selling – that is, very close to the departure date

and a short holding period – was occupied by blue dots. Hence, other situational

factors were affecting these decisions.

For the group of evidence for C – NF, strategic customers generally make ‘wait’

decisions, as shown by blue dots dominating the upper line (i.e. strategic customers

denoted by 1). Myopic customers tended to make a ‘buy’ decision, as shown by more

red dots in the lower line (i.e. myopic customers denoted by 0). For NF, both classes

were generally distributed in the same range.


variables of the respective training set. As stated earlier, the optimisation of referential

values with respect to MSE score led to data separation of the observations for each

input variable. Hence, the majority of a class fell within the same value range for each

input variable. As shown in Figures 6.6 and 6.7, the optimised referential values are

generally located around the separation point between classes: wait and buy.

229

For the following sections, optimised referential values and other model parameters

of the training set of the first fold of both MAKER-based classifiers are taken as an

example to demonstrate how MAKER-ER- and MAKER-BRB-based models are

constructed for customer decision prediction of a given dataset. The next section

discusses the MAKER-based model according to four aspects: 1) evidence

acquisition from data; 2) evidence interdependence; 3) belief-rule inference; and 4)

inference of the top hierarchy, including ER rule and BRB rule inference.

6.4.3. Evidence Acquisition from Data

Section 4.3 explains the MAKER framework with referential values as a discretization

method for numerical data. As already stated, the referential values of each input

variable in numerical data must be defined to acquire evidence from a dataset.

Referential values as model parameters can initially be set based on expert

knowledge or can be randomly generated. They can then be trained using historical

data under an optimisation objective (Xu et al., 2017). For illustration purposes, we

used the solution of optimisation for the MAKER-BRB-based model of the first round.

It included weights and an optimised referential value for each input variable of the

training set.

Table 6.3 depicts the optimised referential values used for this illustration. The

referential values include boundary referential values: the lower and upper boundaries

(see Section 6.4.2 and a trained referential value, which lies between the boundaries.

The number of trained referential values can be changed depending on the desired

balance between model complexity and performance, as well as the statistical

requirements (see Section 6.4.2).

230

Table 6.3. Optimised referential values obtained from MAKER-ER-based models of the first round

Notes: N/A : not applicable (the input variable of C was discrete)

In addition, the customer decision dataset included an input variable with nominal

data, that is customer types (C). Evidence acquisition with nominal data is simpler

than for numerical data. In the data transformation in Equation (4.7), the term 𝑎𝑛,𝑙,𝑖𝑘

represents the degree to which the 𝑛th input value of the 𝑙th input variable (𝑥𝑛,𝑙) belongs

to referential value 𝐴𝑖𝑙. In other words, it shows how close 𝑥𝑛,𝑙 is to 𝐴𝑖

𝑙. For the input

variable of CT, the values of 𝑎𝑛,𝑙,𝑖𝑘 are either 0 or 1. For example, if an observation of

CT is 1 (strategic), then 𝑆1 = {(𝐴16, 0), (𝐴2

6, 1)}. If CT is 0 (myopic), then 𝑆1 =

{(𝐴16, 1), (𝐴2

6, 0)}.

To acquire evidence from a dataset, the first step is to transform each input value of

each input variable of the training set using Equation (4.1). The input value is located

between two adjacent referential values of the respective input variable. The belief

distributions termed ‘similarity degree’ are calculated with respect to the adjacent

referential values. The second step is to aggregate the similarity degrees for each

referential value under different classes of the training set, according to Equation

Input variables WPT APT HP DD NF C

Lower boundary .0005 -.1077 .0010 .3526 2 0

Optimised referential

values (MAKER-ER-

based model)

1.0203 .05 .7 1.6 16.2358 N/A

Optimised referential

values

(MAKER-BRB-based

model)

.5614 .009 .5783 1.4534 15.98 N/A

Upper boundary 4.7441 .2154 5.5320 32.8692 68 1

231

(4.2). The frequencies of referential values for each input variable, under different

classes of the output variable of the training set, can then be generated. Table 6.4

displays the frequencies of the referential values of the input variable WPT using the

trained referential values of the first round as an example.

Table 6.4. The frequencies of the referential values of the input variable of WPT


Wait 1984.4742 395.9835 66.5432

Buy 293.0559 219.3981 51.5460

Next, to calculate the likelihood of a referential value of an input variable being

observed given that a class of the output variable is true, Equation (4.9) was applied.

The procedure was repeated for all referential values for all input variables of the

training set in the dataset. Once we knew the likelihood of a referential value of an

input variable, we used Equation (4.10) to calculate the probability of the respective

referential value pointing to a class of the output. Table 6.5 presents the likelihood of

the referential values: .0005, .5614, and 4.7441 of the input variable WPT as an

example.

Table 6.5. The likelihoods of the referential values of the input variable of WPT


Wait .8110 .1618 .0272

Buy .5196 .3890 .0914

Figure 6.8 depicts the individual support from each piece of evidence for different

class membership which is obtained from the probability of each referential value of

each input variable of the training set. Table 6.6 exhibits the probabilities for referential

232

values: .0005, .5614, and 4.7441 of the input variable WPT and Table 6.7 for

referential values: -.1077, .009, and .2154 of the input variable APT.

Figure 6.8. Individual support of referential values of each input variable of the training set of the first fold of the customer decision dataset

233

Table 6.6. The probabilities of referential values of the input variable of WPT


Wait .6095 .2938 .2293

Buy .3905 .7062 .7707

Table 6.7. The probabilities of referential values of the input variable of APT

Class\referential values -.1077 .009 .2154

Wait .4870 .5013 .5053

Buy .5130 .4987 .4947

Several pieces of evidence can be acquired from the probabilities calculated above.

The probabilities of the referential values of the input variables represent the degree

to which the respective referential values of the input variables point to different class

membership. For example, the probabilities of the lower boundary for APT being -

.1077 are .4870 and .5130 for ‘wait’ and ‘buy’ decisions respectively (see Table 6.7).

Hence, if an observation has an input APT value of -.1077, the probability of the

customer choosing ‘wait’ is .4870 and for ‘buy’ it is .5130.

6.4.4. Analysis of Evidence Interdependence

The six input variables of the customer-decision dataset (namely WPT, APT, HP, DD,

NF, and C) express provider-controlled, uncontrolled, and personal factors that

influence customer decisions in an environment with dynamic pricing. Predicting

customer decisions solely based on only one input variable is likely to be insufficient;

one variable cannot explain much variance in customer decisions. Therefore, it is

necessary to combine other evidence to predict customer decision.

234

In evidential reasoning (ER), the general assumption when combining two pieces of

evidence is that the two pieces of evidence are independent from each other. Using

the MAKER framework, this assumption can be relaxed. The interdependence index

can be calculated using Equation (4.14). To generate the interdependence index for

a pair of evidential elements, the first step is to calculate the degree of similarity for

input values for the combination of evidential elements, using Equation (4.12). Then,

joint probability for the pair of evidential elements is obtained by applying Equation

(4.13). Subsequently, Equation (4.14) estimates the interdependence index for a pair

of evidential elements.

Table 6.8 displays the joint probabilities of all pairs of evidential elements of the input

variables WPT and APT pointing to different classes of the output variable, namely

the ‘wait’ and ‘buy’ classes. These joint probabilities are calculated from the

frequencies of different combinations of referential values of pieces of evidence from

the input variables WPT and APT under different class membership. The frequencies

have at least five cases unless two pieces of evidence are disjoint, such as {4.7441,

.2154}. The same condition applies for the combination of referential values from input

variables of {.0010, 32.8692}. It means those pieces of evidence are disjoint under

different classes. Therefore, we define the inequality constraints of all the

combinations of referential values of the input variables of each group of evidence,

except for structural zeros. The structural zeros includes the combination of referential

values: {4.7441, .2154} for evidence group 1 and the combination of referential values:

{.0010, 32.8692} for evidence group 3.

235

Table 6.8. Joint probabilities for different combinations of referential values from input variables: WPT and APT

Class\The combination of

two referential values

{.0005,

-.1077}

{.5614,

.009}

{.0005,

.2154}

{.5614, -

.1077}

{.5614,

.009}

{.5614,

.2154}

{4.7441, -

.1077}

{4.7441,

.009}

{4.7441,

.2154}

Wait .6295 .6063 .6608 .2800 .2934 .3268 .2149 .2311 0

Buy .3705 .3937 .3392 .7200 .7066 .6732 .7851 .7689 0

Table 6.9. Interdependence indices for referential values of the input variables: WPT and APT



{.0005,

-.1077}

{.5614,

.009}

{.0005,

.2154}

{.5614,

-.1077}

{.5614,

.009}

{.5614,

.2154}

{4.7441,

-.1077}

{4.7441,

.009}

{4.7441,

.2154}

Wait 2.1209 1.9844 2.1457 1.9574 1.9926 2.2015 1.9246 2.0106 0

Buy 1.8494 2.0216 1.7556 1.9872 2.0061 1.9269 1.9856 2.004 0

Table 6.10. Interdependence indices for referential values of the input variables: HP and DD



{.0010,

.3526}

{.0010,

1.4534}

{.0010,

32.8692}

{.5783,

.3526}

{.5783,

.3526}

{.5783,

32.8692}

{5.5320,

.3526}

{5.5320,

1.4534}

{5.5320,

32.8692}

Wait 2.0404 1.9983 1.8145 2.2374 2.0464 1.5913 0 2.0573 1.6175

Buy 1.9662 1.9984 2.2368 1.6856 1.9518 2.5751 0 1.9349 2.5698

236

Table 6.11. Interdependence indices for referential values of the input variables: NF and C

Class\The combination of two referential values

{2, 0} {2, 1} {16.2358, 0} {16.2358, 1} {68, 0} {68, 1}

Wait 1.5278 4.0076 1.3381 3.8792 .9519 4.6195 Buy 3.5462 1.3297 4.0599 1.3737 5.3558 1.1082

Thereafter, we calculated the interdependence index for a pair of evidential elements

with respect to different class membership. The probabilities obtained from the

previous steps (see Section 6.4.3), displayed in Tables 6.6, 6.7, and 6.8, are a basic

probability distribution of the input variable WPT, a basic probability distribution of the

input variable APT, and joint probabilities of the pair of pieces of evidence from WPT

and APT respectively. The interdependence indices between each piece of evidence

from the input variable of WPT and the input variable of APT was obtained by

Equation (4.21).

Table 6.9 indicates that the input variables of WPT and APT generally had

interdependence index values ranging from 1 to 3. This means that these input

variables were nearly independent from each other, except the combination of

referential values {4.7441, .2154}, which had an index value of 0 (i.e. disjoint).

According to Table 6.10, the input variables of HP and DD were generally nearly

independent from each other, with values of interdependence indices between 1 and

3. A combination of referential values {.0010, 32.8692} was disjoint. Meanwhile, the

input variables of NF and C were generally moderately independent from each other,

with interdependence indices ranging from 1 to 6, as presented in Table 6.11.

.

237

6.4.5. Belief Rule Base

The sections above discuss how we acquired evidence from all six input variables

and analysed the interdependence indices for two pieces of evidence. The next step

was the development of a belief rule from which inferences could be drawn. We

applied the belief rule explained in Section 0. In this case study, the ‘IF’ form in

Equation (4.22) expressed ask

T

kk

kAAA 21 should be interpreted as a

combination of referential values of the input variables from a group of evidence, or ‘if

the input value of each input variable equal to a referential value of this input variable’.

The combination of referential values is termed a packet antecedent 𝐴𝑘. The ‘THEN’

form in Equation (4.22) expresses the probability of each consequence, that is,

( ) ( ) ( ) Nkkk ,D,,,D,,D N2211 . This should be interpreted as the probability of a

customer with the corresponding input values choosing ‘wait’ or ‘buy’. Using evidence

group 1 as an example, if the input value of WPT is equal to a referential value of the

input variable WPT and the input value of APT is equal to a referential value of the

input variable APT, then the probability of the customer choosing to buy or to wait is

k1 and k2 respectively.

To obtain the probabilities of a customer choosing to buy and wait, MAKER rule is

used to combine pieces of evidence from a group of evidence with the consideration

of interdependency of pairs of evidence by Equation (4.16). Through Equation (4.18),

we can obtain the weights of the combined evidence from the probability mass 𝑚𝜃,𝑒(𝐿)

and the probability 𝑝𝜃,𝑒(𝐿) for 𝜃 ⊆ Θ, 𝜃 ≠ 𝜙 or from the probability mass 𝑚𝑃(Θ),𝑒(𝐿).

These weights are used for inference in the next section and are termed a rule weight,

denoted by 𝜃𝑘. For example, with the calculation explained in Section 4.3.3, the

238

probabilities of a combination of referential values for group 1 of {.0005, -.1077} for

choosing to buy or to wait were .9059 and .0941 respectively. We also obtained a

probability of .9583 for ‘buy’ and .0417 for ‘wait’, for a combination of group 2 of {.5783,

32.8692}. For a combination of group 3 {15.9868, 1}, the probability of choosing to

wait or buy was .2563 and .7437 respectively.

A belief rule base of evidence groups 1, 2, and 3 is provided in Tables 6.12, 6.13, and

6.14 respectively. It is worth noting that trained referential values are solutions of the

optimisation of MAKER-based classifier. In this section, we also utilised other

optimised model parameters, such as the weights of input variables. Two input

variables were used, each of which included three referential values, namely lower

boundary, trained referential value, and upper boundary. This yielded altogether nine

combinations for each group of evidence, except group 3 which had six combinations.

In that group, the input variable C had two referential values: 0 for myopic and 1 for

strategic. Each combination contains one referential value from different input

variables within a group of evidence.

The ‘THEN’ form consists of the consequences ‘buy’ and ‘wait’, with the

corresponding probability values obtained through the MAKER rule. We utilised the

optimised model parameters, including weights and referential values of each input

variable, which are solutions of the optimisation of MAKER-based classifiers to build

these belief rules. The results are depicted in Tables 6.12, 6.13, and 6.14. The

following section explains how we drew a BRB inference from an observation.

239

Table 6.12. The belief rule base of the first group of evidence and the activated belief rules by an observation from the customer – decision dataset: {.2946, .1193}


Rule 𝐴1 (WPT) 𝐴2 (APT) Buy Wait 𝛼𝑘

1 .0005 -.1077 .9059 .0941 0

2 .0005 .009 .9626 .0374 .2133

3 .0005 .2154 .8632 .1368 .2524

4 .5614 -.1077 .2786 .7214 0

5 .5614 .009 .3787 .6213 .2417

6 .5614 .2154 .2352 .7648 .2926

7 4.7441 -.1077 .2315 .7685 0

8 4.7441 .009 .3227 .6773 0

9 4.7441 .2154 .2195 .7805 0

Table 6.13. The belief rule base of the second group of evidence with activated belief rule base by an observation from the customer – decision dataset: {.3955, 1.9816}


Rule 𝐴3 (HP) 𝐴4 (DD) Buy Wait 𝛼𝑘

1 .0010 .3526 .5533 .4467 0

2 .0010 1.45341.6 .3806 .6194 .1426

3 .0010 32.8692 .8961 .1039 .0067

4 .5783 .3526 .9068 .0932 0

5 .5783 1.4534 .8624 .1376 .8319

6 5.5320 32.8692 .9583 .0417 .0188

7 5.5320 .3526 .9074 .0926 0

8 5.5320 1.4534 .8760 .1240 0

9 5.5320 32.8692 .9622 .0378 0

240

Table 6.14. The belief rule base of the third group of evidence with activated belief rule base by an observation from the customer – decision dataset: {62, 1}


Rule 𝐴5 (NF) 𝐴6 (C) Buy Wait 𝛼𝑘

1 2 0 .8978 .1022 0

2 2 1 .0478 .9522 0

3 15.98 0 .9152 .0848 0

4 15.98 1 .0731 .9269 .0949

5 68 0 .6414 .3586 0

6 68 1 .2563 .7437 .9051

Table 6.15. Two adjacent referential values of each input variable of an observation from the customer decision dataset: {.2946, .1193, .3954, 1.9816, 62, 1}

WPT APT HP DD NF C

.0005 .009 .0010 1.4534 15.98 0

.5614 .2154 .5783 32.8692 68 1

6.4.6. BRB Inference with Referential Values

We constructed BRBs for three groups of evidence, as depicted in Tables 6.12, 6.13,

and 6.14, through the MAKER framework. We were then able to draw an inference

from a BRB for each observation in the dataset. For discrete or nominal data, inferring

something from a BRB is a direct process. For example, if the input vector shows the

combination ‘High ∧ Low ∧ High’, the probability is obtained as follows: 𝑝1 of

consequent 1 and 𝑝2 of consequent 2, where 𝑝1 = 𝛽1𝑘 and 𝑝2 = 𝛽2𝑘 from 𝑘th rule from

which the IF rule of ‘High ∧ Low ∧ High’ is mentioned. However, all the input variables

were numerical data and were discretized through a referential-value-based data-

241

discretization method. The inference process with this kind of discretization method

differs from that with discrete or nominal data.

Belief rule bases for the three groups of evidence were developed in the previous

section. Each belief rule base is constructed from a packet antecedent 𝐴𝑘, which is a

combination of referential values of input variables and its corresponding probabilities

of consequences. Based on this form, we needed to transform the numerical data for

the combinations of referential values of input variables. First, we calculated the

similarity degree for each observation value of each input variable. An input value can

be transformed using Equation (4.7). The similarity degree indicates the degree to

which an input value matches each of the referential values. An observation with input

values: {.2946, .1193, .3954, 1.9816, 62, 1} for WPT, APT, HP, DD, NF, and C

respectively with referential values defined in Table 6.3 was taken as an example.

The observation has two adjacent referential values for each input variable, as

depicted in Table 6.15. Using Equation (4.12), we calculated the joint similarity degree

between the observation and the combination of referential values for each belief rule

or the packet antecedents. These values represent the individual matching degree,

indicating the degree to which the input vector or an observation close to a packet

antecedent 𝐴𝑘 denoted by 𝛼𝑘 for the 𝑘th rule.

Each input value was discretized, with an indication of its distance to two adjacent

referential values. Hence, several belief rules were activated from all belief rules,

ranging from 1 to 2𝑁, where 𝑁 is the number of input variables for an input vector for

which each observation value is located between two adjacent referential values. If

an observation is exactly equal to a combination of referential values of input

variables, only one belief rule is activated. Since two input variables existed for each

242

group of evidence with three referential values, the activated belief rules ranged from

1 to 22 = 4. Specifically, for the third group of evidence, an input value of the CT

variable exactly equalled a referential value that was either 0 (myopic) or 1 (strategic).

Therefore, for this group of evidence, there were 1 to 21 = 2 belief rules activated,

out of six belief rules, in the BRB.

To illustrate, using equation (4.12), yields a joint similarity degree for each belief rule

in the BRB of each group of evidence, as depicted in Tables 6.12, 6.13, and 6.14 for

the observation {.2946, .1193, .3954, 1.9816, 62, 1}. Four combinations of the two

activated adjacent referential values in Tables 6.12 and 6.13 have 𝛼𝑘 > 0, whereas

other combinations of referential values have 𝛼𝑘 = 0. Two combinations of referential

values of the third group of evidence are activated, as presented in Table 6.14. At this

point, for each belief rule we have 𝛼𝑘 as an individual matching degree. The weights

of combined pieces of evidence of 𝐴𝑘 are obtained from the probability mass 𝑚𝜃,𝑒(𝐿),

the probability 𝑝𝜃,𝑒(𝐿), the probability mass 𝑚𝑃(Θ),𝑒(𝐿), and the probability of each

consequent as a result of the combination of pieces of evidence, 𝐴𝑘 . Hence, the

weights of the pieces of evidence affect the weights of each belief rule activated by

an observation.

At this point, of each belief rule, we have 𝛼𝑘 as an individual matching degree to which

input values belongs to a packet antecedent, 𝐴𝑘; the weights of combined pieces of

evidence of 𝐴𝑘 which is obtained from the probability mass 𝑚𝜃,𝑒(𝐿), the probability

𝑝𝜃,𝑒(𝐿), and the probability mass 𝑚𝑃(Θ),𝑒(𝐿); and the probability of each consequent as

a result of the combination of pieces of evidence, 𝐴𝑘 . Hence, the weights of the pieces

of evidence affects the weights of each belief rule activated by an observation.

243

After obtaining the activated belief rules with the corresponding joint similarity degrees

and their weights, the next step is to combine these belief rules to predict the

probabilities of each consequent. This indicates the probability of a customer

choosing to buy or to wait. First, we calculated the updated weight denoted by 𝜔𝑘 of

each belief rule in BRB, based on the joint similarity degrees and the associated rule

weight 𝜃𝑘 using Equation (3.11). The term 𝐿 refers to a number of belief rules in BRB.

The term 𝜔𝑘 is designed to measure the degree to which a packet antecedent 𝐴𝑘 in

the 𝑘th rule is triggered by an observation. As stated in previous section, the weights

of input variables contribute to the weight of each belief rule. On the basis of joint

similarity degrees and the weights, we calculated the updated weight of each belief

rule. We concluded that the weights of input variables influenced the updated weight

of each belief rule, which measures the degree to which a belief rule is triggered in

predicting the probability of each consequent. Second, given the updated weight of

each belief rule and the associated probability of each consequent, we combined

those pieces of evidence using the conjunctive MAKER rule. The calculation is shown

in Equation (4.16). The output of this framework is the probability of a customer

choosing to buy or wait.

For example, with the observation from evidence group 1: {.2946, .1193} for WPT and

APT respectively, a probability of .6007 was obtained for ‘buy’, and .3993 for ‘wait’.

For an observation of group of evidence 2: {.3954, 1.9816} for HP and DD

respectively, the probability was: .8468 for ‘buy’, and .1532 for ‘wait’. For the third

group of evidence with the input values {62, 1}, we obtained a probability: .2387 for

‘buy’, and .7613 for ‘wait’. At this point, the probability of a customer choosing to buy

or wait was obtained based on some but not all input variables in the input system.

244

Generating the probability of each consequent from all input variables of the input

system considered together is discussed in the following section.

6.4.7. Inference on The Top Hierarchy

The previous section demonstrated obtaining the probability of each consequent as a

result of evidence combinations for some but not all input variables in the system. As

depicted in Figure 4.1, a system consists of groups of evidence, each of which has

several input variables. In the lower levels of the hierarchy, each group of evidence

performs prediction. Thus, each group of evidence generates the probability of each

consequent of the output system. We first acquired the MAKER-generated outputs

from the input variables of each group of evidence. This meant we could combine the

outputs for a final inference regarding the top hierarchy. This level was the probability

of a customer choosing to buy or wait, with all input variables considered. We provide

two combination methods for the top hierarchy, namely the ER-based model and the

BRB-based model as depicted in Figure 6.5.

According to the previous section, we can acquire the probabilities generated by the

MAKER rule from input variables of a group of evidence. In other words, an

observation of the input variables of a group of evidence generated the probabilities

pointing to class membership. Therefore, we acquired evidence from the observation.

We obtained the same number of pieces of evidence as the number of groups of

evidence in the hierarchy. To combine the MAKER-generated outputs for each group

of evidence, we calculated their weights as described below.

245

As previously explained, within a group of evidence, the weights of input variables

have an impact on the weight of each belief rule. On the basis of the degree of joint

similarity between an observation and each belief rule, and the weight of each belief

rule, we calculated the updated weight of each belief rule. These updated weights

measure the degree to which a belief rule is activated or triggered in predicting the

probability of each consequent. In the next step, the MAKER-generated outputs for

each group of evidence were obtained by combining the activated belief rules using

the conjunctive MAKER rule, shown in Equation (4.16).

We obtained the weight of each group of evidence from the probability mass 𝑚𝜃,𝑒(𝐿)

and the probability 𝑝𝜃,𝑒(𝐿) for 𝜃 ⊆ Θ, 𝜃 ≠ 𝜙 or from the probability mass 𝑚𝑃(Θ),𝑒(𝐿),

using Equation (4.18) when combining the activated belief rules. Given those pieces

of evidence and their weights, we used Equation (4.16) to combine the evidence and

generate the probability of each consequent, considering all input variables in the

system. The weights of input variables within a group of evidence affect the weights

of activated belief rules. Through the conjunctive MAKER rule, we combined the

activated belief rules and calculated the weight of the combined belief rules, which

indicates the weight of a group of evidence. In the top hierarchy, the MAKER-

generated outputs with the weights of the three groups of evidence were then

combined. We concluded that the final inference generated considered the weights of

all input variables in the system.

This study examined three groups of evidence, each of which yielded an inference

based on the groups’ input variables. At this point, we should have three outputs; that

is, the probabilities of a customer choosing to buy or to wait. In the previous section,

the example observations were {.2946, .1193}, {.3954, 1.9816}, and {62, 1} for the

246

first, second, and third groups of evidence, respectively. We obtained the MAKER-

generated outputs through the procedures explained in the previous sections, as

follows: {(1, .7582), (2, .2418)} for the group of WPT-APT; {(1, .5968.0778), (2, .4032)}

for the group of HP-DD; and {(1, .57), (2, .43)} for the group of NF-C. Using their

weights and Equation (4.16) we generated the probabilities of .7496 for ‘buy’ and

.2504 for ‘wait’ as a final output of the system for the observation {.2946, .1193, .3954,

1.9816, 62, 1}. The probabilities were obtained with all the input variables in the

system being considered. Through this method of evidence combination, this

framework is termed MAKER-ER-based classifiers as seen in Figure 6.5.

The following description explains how we obtain a final inference using BRB. As

depicted in Figure 4.1, there are several groups of evidence, each of which consists

of input variables. As stated above, each group of evidence generates the probability

of each consequent. We can draw inferences based on the concept of a belief rule

base.

• The construction of belief rule base

To construct a belief rule base, we follow the expression of the extended IF-THEN

rule, as described in Section 0, specifically Equation (4.22). In this state, the packet

antecedent of the belief rule written as 21

k

T

kk

kAAA should be expressed as ‘If a

group of evidence points to 𝑘 class’. Therefore, the number of antecedents in this

belief rule base equals the number of groups of evidence in the system. In this study,

there were three groups of evidence and hence three antecedents in the BRB.

Furthermore, ( ) ( ) ( ) LkNkkk ,...,1,,D,,,D,,D N2211 should be expressed in

this state as ‘the probability of a customer choosing to buy or to wait, given the values

247

of antecedents.’ Alternatively, we could refer to the probability of a customer choosing

either class membership, ‘buy’ or ‘wait’, given the results for each group of evidence.

‘Antecedent’ in this study refers to the outputs generated by each group of evidence.

In addition, the outputs refer to class membership, with the number of combinations

equal to 𝐾𝐺, where K is the number of outputs in the output system and G is the

number of groups of evidence in the system. In this study, there were two class

membership as outputs, with three groups of evidence formed in the system.

Therefore, there are 23 = 8 belief rules, as depicted in Table 6.16.

Suppose that 𝐴1 is the output generated by group of evidence 1. The 𝐴11 term is equal

to 1 if the group of evidence points to class ‘buy’ (k=1); and 𝐴21 = 2 if the group of

evidence points to class ‘wait’ (k=2). Similarly, 𝐴12 = 1 and 𝐴2

2 = 2 mean that evidence

group 2 points to ‘buy’ and ‘wait’ respectively. As we lacked prior knowledge regarding

the belief degrees assigned to each consequent denoted by 𝛽𝑗,𝑘 for the 𝑗th consequent

in the 𝑘th rule, as shown in Equation (4.22), we constructed a BRB as follows.

Logically, given an observation, if both groups point to the same class, it means that

the observation of all input variables fully points to the corresponding class. For

example, 1st and 8th belief rules generate the probability of 1 for ‘buy’ and ‘wait’

respectively. If both groups of evidence point to different class, the observation of all

input variables does not point exactly to a particular class, which means the probability

of each consequent can range from 0 to 1. These belief degrees can be trained along

with other model parameters simultaneously. For initialization, we used initial belief

degrees as shown in Table 6.16. Table 6.17 provides the optimised belief degrees of

the belief rule base. We used these optimised belief degrees as an example in this

section.

248

Table 6.16. Initial belief rule base of the top hierarchy for the customer-decision dataset

No. Antecedent Consequent

𝐴1 𝐴2 𝐴3 ‘to buy’ (k=1) ‘to wait’ (k=2)

1 1 1 1 1 0

2 1 1 2 .75 .25

3 1 2 1 .75 .25

4 1 2 2 .25 .75

5 2 1 1 .75 .25

6 2 1 2 .25 .75

7 2 2 1 .25 .75

8 2 2 2 0 1

If the output of each group of evidence fully points to a class membership, inference

with the BRB can be made directly. However, the probability of each consequent

generated by each group of evidence can range from 0 to 1. This probability measures

the degree to which an evidence group points to a class membership. For the purpose

of demonstration, we use the observed values: {.2946, .1193, .3954, 1.9816, 62, 1}

and the optimised model parameters obtained from MAKER-BRB-based model

including optimised referential values in Table 6.3.

• The calculation of joint similarity degree

As an example, we use the observed values {.2946, .1193, .3954, 1.9816, 62, 1} and

the optimised model parameters obtained from the MAKER-BRB-based model,

including optimised referential values in Table 6.3. The observed values {.2946,

.1193} for WPT and APT generated the probabilities {.6007, .3993}. These results

mean that this observation belongs to 𝐴21 to a low degree (.3993) and to 𝐴1

1 to a high

degree (.6007).

249

The above procedure allowed us to obtain the belief distribution of antecedents. We

applied Equation (4.12) to obtain the degree of joint similarity between the outputs

generated by each group of evidence and the combination of antecedents of each

belief rule. For example, based on the probabilities obtained from the groups of

evidence 1, 2, and 3: {(1, .6007), (2, .3993)} for WPT-APT; {(1, .8468), (2, .1532)} for

HP-DD; and {(1, .2387), (2, .7613)} for NF-C respectively, we obtained the degrees of

joint similarity shown in Table 6.17. These joint similarity degrees will activate eight

belief rules.

Table 6.17. Optimised belief rule base of the top hierarchy the activated belief rules by the three MAKER-generated outputs: {(1, .6007), (2, .3993)}; {(1, .8468), (2, .1532)}; and {(1, .2387), (2, .7613)}

No.

Antecedent Consequent

𝐴1 𝐴2 𝐴3 ‘to buy’

(k=1) ‘to wait’ (k=2) 𝛼𝑘

1 1 1 1 1 0 .1214

2 1 1 2 1 0 .3873

3 1 2 1 .9991 .0009 .0220

4 1 2 2 0 1 .0701

5 2 1 1 .7575 .2425 .0807

6 2 1 2 .4818 .5182 .2574

7 2 2 1 0 1 .0146

8 2 2 2 0 1 .0466

• Making final inference from activated belief rules

As in the previous section, these values were used to calculate the updated weight of

each belief rule. Rule weights denoted by 𝜃𝑘 can be trained. However, in this study,

we set equal rule weights. The joint similarity degree affects how the activated belief

250

rules are invoked to contribute to the inference. The joint similarity is calculated from

the outputs generated by each group of evidence, each of which consists of some but

not all input variables in the system. Therefore, by combining the outputs in this way,

the inference is obtained by considering all input variables in the system. The

probabilities {(1, .6007), (2, .3993)}, {(1, .8468), (2, .1532)}, and {(1, .2387), (2, .7613)}

obtained from the observation {.2946, .1193, .3954, 1.9816, 62, 1} generated the

prediction of class membership as follows: .7158 for ‘buy’ and .2842 for ‘wait’.

As mentioned in Section 6.4.2, in this study a set of model parameters consisted of

1) a trained referential value for the four input variables of the system and 2) the

weights of the evidential elements (referential values) for the four input variables for

the MAKER-ER-based classifier. Additional model parameter for MAKER-BRB-based

classifier is a set of trained belief degree of each consequent of each relevant belief

rule where ∑ 𝛽𝑖𝑘𝑁𝑖=1 = 1. The trained referential values were utilised to obtain pieces

of evidence, which were then combined in the upper level of the hierarchy. Given

optimised weights of evidential elements of the input variables for each evidence

group, we generated the probability of each consequent. For each group of evidence,

the weights of the input variables influenced the updated weight for each belief rule

activated by an observation, to predict the probabilities of the classes of the output

system.

6.4.8. The Interpretability of Hierarchical MAKER Frameworks

In the MAKER-ER-based classifier, given the probabilities generated by MAKER rule

and the weight of combined activated belief rules of each group of evidence, we can

make predictions, that is, calculate the probabilities pointing to different class

251

membership with all four input variables being considered, in the upper level of the

hierarchy. The weights of the input variables of each evidence group affect the

updated weights of each activated belief rule, and the weight of the combined

activated belief rules of each group affect the inference derived in the upper level.

In MAKER-ER -based classifier, given the probabilities generated by MAKER rule for

each group of evidence and the weight of combined activated belief rules of each

group of evidence, we can make predictions, i.e. the probabilities of classes of the

output system with all the four input variables being considered, in the upper level of

the hierarchy. Since the weights of the two input variables of each group of evidence

have an impact on the updated weights of each activated belief rule, and the weight

of the combined activated belief rules of each group of evidence have an impact on

the inference made in the upper level.

In the MAKER-BRB-based classifier, the probabilities generated by the MAKER rule

for each group of evidence shows the degree to which the two input variables of each

group point to each class of the output system. We thus calculate the joint similarity

for each combination of the antecedents. Given the degrees of trained belief and joint

similarity, we can make predictions in the upper level of the hierarchy. There are

inferred based on the four input variables in the system.

Through these two approaches – that is, MAKER-ER-based and MAKER-BRB-based

models, we maximise the predicted outputs, that is, the predicted probabilities of each

class of the output system, as close to the true observed outputs of the training set to

minimise the MSE score. Through this optimisation process, model parameters

including referential values of the four input variables, and the weights of the evidential

252

elements for both classifiers, and trained belief degrees of the relevant belief rules

specifically for MAKER-BRB-based model, are trained using historical data.

In this study, given the optimised (i.e. trained) referential values of the six input

variables, we constructed the MAKER-based classifier to illustrate how to acquire

pieces of evidence from data. On the basis of the referential values and other

optimised solutions – that is, the weights and the belief degrees of each consequent

in the BRB of the top hierarchy, we used MAKER-ER- and MAKER-BRB-based

models to draw inferences. The process has been described in this section. An

example used earlier was {.2946, .1193, .3954, 1.9816, 62, 1}. The predicted

probabilities for this example for each class were {.7496, .2504} for the MAKER-ER-

based model and {.7158, .2842} for the MAKER-BRB-based model. Based on the

process established in these classifiers, we concluded that the MAKER-ER- and

MAKER-BRB-based classifiers offered an interpretable approach. They integrated

statistical analysis when acquiring pieces of evidence, the measurement of

interdependencies between pairs of pieces of evidence, belief rule-based inference

in the MAKER rule, maximum likelihood prediction, and machine learning.

Even with the input variables in the system split into multiple groups of evidence, the

inference process established for both classifiers combined all pieces of evidence

from the lower level in the hierarchy. In every combination process of pieces of

evidence from the bottom to the top of the hierarchy, the knowledge embedded in a

piece of evidence, including its weights, was continuously forwarded until the final

inference in the top hierarchy. Hence, we concluded that the predicted outputs of the

system output for both classifiers resulted from an inference process involving all the

input variables in the system, with knowledge representation parameters embedded

253

in each piece of evidence. These parameters were the weights, referential values,

and consequent belief degrees.

6.5. Model comparisons

In this section, the performance of MAKER-ER- and MAKER-BRB-based models are

compared with other common machine learning methods for classification. These

include LR, SVM, CT, NN, NB, KNN, weighted KNN, LD, and QD. The dataset of the

case of customer decision in revenue management was used.

As already stated, the customer-decision dataset was partitioned into five folds, with

shuffled stratified cross-validation to obtain almost equal class distributions for each

fold. The training and test sets of the five rounds were generated based on the five

folds of the customer-decision dataset. The optimised parameters were then applied

for the test sets. Hence, we compared all the classifiers based on their performances

over the five test sets in five rounds.

In this section, we include accuracies, precisions, and recalls with a threshold value

of .5 for the classifier on the basis of probabilities. For SVM, the threshold value 0 is

used. For the case of imbalanced data, the best possible outcome is high precision

and recall scores. We also present the MSE scores under which MAKER-ER- and

MAKER-BRB-based models were optimised. In addition, we report the area under the

receiver operating characteristic curve (AUCROC) scores, because this metric

provides a better measure than accuracy alone can. In addition, the area under the

precision-recall curve (AUCPR) scores are reported for better measurement in the

case of imbalanced data. The value of AUC ranges from .5 to 1. The closer the AUC

254

score to 1.0, the more accurate is the classifier. .5 of AUCROC score indicate random

classifier. Further explanation about the measures appears in Section 3.8.

For SVM, NN, KNN, weighted KNN, and CT, we also determined the hyperparameters

of those classifiers. We utilised gridsearchcv in sklearn Python to find the optimal

hyperparameter based on a five-round model-training method. Using only one

performance measure, namely accuracy, is somewhat inaccurate since this dataset

is highly imbalanced (1:4.5). We thus used the F-beta score, which is the weighted

mean of precision and recall. We set the beta value of 1 in this study. The parameters

with the highest F-beta score for the omitted data after the five-round training method

were selected. The hyperparameters for the above-mentioned classifiers are

discussed in Section 2.3. Table 6.18 lists the selected hyperparameters of the

abovementioned alternative methods.

Table 6.18. The selected hyperparameters of CT, SVM, KNN, Weighted KNN, and NN for customer decision models

Classifier Selected hyperparameters

CT The maximum depth = 4; the minimum samples per leaf = 50;

the minimum size each leaf = 170

SVM Penalty parameter C = 1; the kernel type is linear.

KNN k = 25

Weighted KNN K = 33

NN Multilayer perceptron is selected. The number of hidden layers

= 1; the number of neurons in the hidden layer = 6; the activation

function is linear.

255

6.5.1. Accuracies, Precisions, Recalls, and F-beta Scores

Table 6.19 present the F-beta scores with the beta value of 1 as a weight parameter.

We define ‘wait’ as negative class. Tables 6.20 - 6.22 present the scores of accuracy,

precision, and recall respectively for each class over five train-and-test sets. These

measurements had a threshold value of .50 for probabilistic-based classifiers and 0

for SVM because of the binary outcome – that is, ‘buy’ and ‘wait’ classes. The machine

learning models presented in the tables were selected based on the F-beta score, as

previously explained. The machine learning models with the best hyperparameters

were then compared against MAKER-based models.

For comprehensive evaluation of a classifier’s performance, we used the average

score of each performance metric – that is, the accuracy, precision, recall, and F-beta

score – for each classifier. The three highlighted numbers in bold are the first, second,

and third best classifiers on the basis of the corresponding measure. Generally, the

average scores of all performance measures across all the classifiers over the five

test sets were .351 for F-beta score, .829 for accuracy, .848 for precision of the ‘buy’

class, .615 for precision of ‘wait’, .961 for recall of ‘buy’, and .255 for recall of ‘wait’.

The results showed that MAKER- based classifiers were among the best three

classifiers based on the scores for F-beta, accuracy, precision of ‘buy’ class, and

recall of ‘wait’ class. The average precision for ‘wait’ in the MAKER-ER- and MAKER-

BRB-based classifiers was .594 and .596 respectively. This was close to the grand

average of .615 across all the classifiers over the five test sets. For the recall of ‘buy’,

the average score of the MAKER-ER- and MAKER-BRB-based models were .952,

and .948 which is close to the average score of .961 across all the classifiers over the

five test sets.

256

Table 6.19. F-beta scores for customer behaviour classifiers


Train

MAKER-ER .409 .411 .418 .382 .420 .408 .015

MAKER-BRB .420 .456 .422 .393 .426 .423 .022

LR .330 .283 .308 .290 .305 .303 .018

SVM .185 .142 .157 .157 .187 .166 .020

NN .344 .326 .346 .310 .338 .333 .015

CT .471 .501 .471 .475 .462 .476 .015

NB .364 .336 .362 .342 .349 .351 .012

KNN .371 .314 .346 .357 .298 .337 .031

Weighted KNN 1.000 1.000 1.000 1.000 1.000 1.000 .000

LD .351 .341 .343 .336 .351 .344 .007

QD .515 .370 .343 .524 .343 .419 .092

Test

MAKER-ER .337 .507 .355 .366 .429 .399 .070

MAKER-BRB .423 .383 .378 .475 .399 .412 .040

LR .307 .274 .292 .333 .314 .304 .022

SVM .138 .161 .203 .192 .147 .168 .028

NN .309 .327 .313 .364 .329 .328 .022

CT .462 .542 .474 .417 .487 .476 .045

NB .368 .332 .364 .378 .296 .348 .034

KNN .276 .290 .280 .287 .364 .299 .036

Weighted KNN .388 .445 .344 .323 .447 .390 .057

LD .324 .363 .336 .371 .324 .344 .022

QD .452 .357 .319 .528 .311 .393 .094

257

Table 6.20. Accuracies for customer decision models Model\Iteration 1st 2nd 3rd 4th 5th Average Stdev

Train

MAKER-ER .838 .834 .838 .831 .834 .835 .003

MAKER-BRB .834 .839 .837 .826 .832 .834 .005

LR .833 .830 .831 .826 .829 .830 .003

SVM .819 .819 .819 .819 .821 .819 .001

NN .829 .830 .831 .825 .832 .829 .003

CT .862 .855 .861 .863 .861 .860 .003

NB .823 .822 .822 .820 .825 .822 .002

KNN .841 .836 .838 .838 .832 .837 .003

Weighted KNN 1.000 1.000 1.000 1.000 1.000 1.000 .000

LD .827 .826 .827 .822 .827 .826 .002

QD .833 .828 .827 .831 .827 .829 .003

Test

MAKER-ER .809 .847 .816 .837 .842 .830 .017

MAKER-BRB .826 .819 .814 .854 .837 .830 .016

LR .823 .831 .827 .842 .829 .830 .007

SVM .813 .824 .822 .826 .805 .818 .009

NN .816 .831 .819 .842 .825 .827 .010

CT .856 .857 .862 .854 .862 .858 .004

NB .826 .819 .824 .834 .814 .823 .008

KNN .823 .834 .816 .832 .844 .830 .011

Weighted KNN .818 .836 .814 .819 .839 .825 .011

LD .813 .829 .822 .841 .822 .825 .010

QD .819 .824 .816 .832 .819 .822 .007

258

Table 6.21. Precisions of the test sets for customer decision models

Model/Iteration 1st 2nd 3rd 4th 5th Average Stdev

Buy

MAKER-ER .850 .880 .850 .850 .860 .858 .013

MAKER-BRB .860 .850 .850 .870 .850 .856 .009

LR .840 .840 .840 .840 .840 .840 .000

SVM .820 .830 .830 .830 .820 .826 .005

NN .840 .840 .840 .850 .850 .844 .005

CT .860 .880 .860 .860 .870 .866 .009

NB .850 .840 .850 .850 .840 .846 .005

KNN .840 .840 .840 .840 .850 .842 .004

Weighted KNN .850 .860 .850 .840 .860 .852 .008

LD .840 .850 .840 .850 .840 .844 .005

QD .870 .850 .840 .890 .840 .858 .022

Wait

MAKER-ER .480 .640 .520 .680 .650 .594 .088

MAKER-BRB .560 .530 .510 .740 .640 .596 .094

LR .570 .700 .630 .800 .620 .664 .089

SVM .500 .770 .650 .750 .400 .614 .160

NN .520 .640 .540 .750 .580 .606 .093

CT .770 .680 .840 .820 .800 .782 .063

NB .580 .540 .560 .630 .500 .562 .048

KNN .590 .740 .530 .710 .750 .664 .098

Weighted KNN .520 .610 .510 .540 .620 .560 .051

LD .500 .600 .560 .720 .550 .586 .083

QD .520 .570 .520 .560 .530 .540 .023

259

Table 6.22. Recalls of the test sets for customer decision models


Buy

MAKER-ER .940 .950 .940 .970 .960 .952 .013

MAKER-BRB .940 .940 .930 .970 .960 .948 .016

LR .960 .980 .970 .990 .970 .974 .011

SVM .980 .990 .990 .990 .970 .984 .009

NN .950 .970 .960 .980 .960 .964 .011

CT .980 .950 .990 .990 .980 .978 .016

NB .960 .950 .950 .960 .950 .954 .005

KNN .970 .990 .960 .980 .980 .976 .011

Weighted KNN .930 .950 .940 .960 .950 .946 .011

LD .940 .960 .960 .980 .960 .960 .014

QD .920 .960 .950 .910 .960 .940 .023

Wait

MAKER-ER .260 .420 .270 .250 .320 .304 .070

MAKER-BRB .340 .300 .300 .350 .290 .316 .027

LR .210 .170 .190 .210 .210 .198 .018

SVM .080 .090 .120 .110 .090 .098 .016

NN .220 .220 .220 .240 .230 .226 .009

CT .330 .450 .330 .280 .350 .348 .063

NB .270 .240 .270 .270 .210 .252 .027

KNN .180 .180 .190 .180 .240 .194 .026

Weighted KNN .310 .350 .260 .230 .350 .300 .054

LD .240 .260 .240 .250 .230 .244 .011

QD .400 .260 .230 .500 .220 .322 .123

260

This result indicates that the MAKER-based models performed better than LR, SVM,

NN, LD, QD, KNN, and Weighted KNN. The classifier LR, one of the simple

interpretable classifiers, failed to predict the ‘wait’ class correctly, as shown by the

‘wait’ recall of .198. The ‘wait’ recall in SVM was similar at .098. Classification tree, as

one of the complex interpretable classifiers, showed slightly better performance in

predicting customer decisions.

Hence the MAKER-ER- and MAKER-BRB-based classifiers outperformed another

simple interpretable classifier, that is, LR, KNN, Weighted KNN, LD, and QD. They

also outperformed other complex machine learning methods, including SVM and NN

at a threshold of .05. Despite its complexity, the performance of the classification tree

was slightly better than the other classifiers.

6.5.2. MSEs and AUCs

In this section, we report probability and ranking metrics, including MSEs, AUCROCs,

and AUCPRs. Classifiers generate a probabilistic classifier that shows the degree to

which an observation is a member of a class. The performance metrics explained

above – that is, accuracy, precision, and recall – covert the probabilistic classifier into

a discrete classifier. The threshold value of .50 is cut-off point, with any probabilistic

classifier above the cut-off indicating a positive class and those below indicating a

negative class. A ROC curve plots the true positive and false positive rates for

different cut-off points. In addition, PR curve plots the true positive rate, also known

as recall, as well as precision using a different classification threshold. The AUC is a

single scalar value that reflects a classifier’s performance regardless of the

classification threshold.

261

In the following paragraphs, we provide AUCROC and AUCPR. The latter is

suggested for the case of highly imbalanced data. Other than that, these two metrics

provide a better performance measure than the threshold metrics of accuracy,

precision, and recall. We also report the MSE as a probability metric. It measures the

gap between the predicted values, that is, the probability generated by a classifier,

versus the actual values.

Figure 6.9 shows the ROC curve of all the classifiers for all the test sets of the dataset.

Five lines in the figure, with different colours, present the ROC curve for the test set

of each round. Round 1 had the 1st fold as a test set, round 2 had the 2nd fold as a test

set and so on. The diagonal red line represents a random classifier with the AUCROC

score of .5. The further the line is placed from this red diagonal line, or the closer the

line to the left corner of the curve, the better is the classifier. The blue line indicates

the average ROC curve over the five test sets. The grey area illustrates the dispersion

of the curves over five rounds, which is ± 1 standard deviation. Figure 6.10 presents

the PR curves for all the classifiers over five rounds. Similar with ROC curve, the lines

with different colours indicate the PR curve of the test set of each round. The closer

the line to the right corner of the curve, the better the performance of the classifier.

The grey area indicates the dispersion of the curve at ± 1 standard deviation.

262



LR

SVM

NN

CT

Figure 6.9. The ROC curve of MAKER-ER-based classifier, MAKER-BRB-based classifier, and all the alternative machine learning methods for the test sets of the customer-decision dataset

263

NB

Weighted KNN

KNN

QD

LD


264



LR

CT

SVM

NN

NB

Weighted KNN Figure 6.10. The PR curve of MAKER-ER-based classifier, MAKER-BRB-based classifier, and all the alternative machine learning methods for the test sets of the customer-decision dataset

265

KNN

QD

LD


Table 6.23 displays the MSEs, AUCROCs, and AUCPRs of the classifiers for the

training and test sets of all five rounds. The scores of the metrics of training sets were

similar to those of the test sets over the five rounds, meaning that the classifiers could

learn and generalise the pattern of the data and performed well on unseen data. The

grand average of each metric of all the classifiers over the five test sets was .121,

.825, and .519 for MSE, AUCROC, and AUCPR respectively.

266

The highlighted scores in Table 6.23 indicate the best, second-best, and third-best

performance among the other classifiers. It is evident that MAKER-ER- and MAKER-

BRB-based models along with classification tree outperformed other classifiers in

terms of the three metrics. The average scores and standard deviations of AUCROCs

for the MAKER-ER- and MAKER-BRB-based models were .836 (.019) and .848 (.020)

respectively. Similar to AUCs, both classifiers showed second position in terms of

MSE scores as follows: .114 (.005) and .113 (.006) respectively. According to the

average AUCPRs, the MAKER-based classifiers performed better than all the

classifiers, except classification tree. The scores were .544 (.048) and .562 (.036) for

MAKER-ER- and MAKER-BRB-based models respectively. According to Table 3.3 in

Section 3.8.3, an AUCROC between .8 and .9 indicates good discrimination.

Subtle changes were noted in the average AUCROCs in Table 6.23. For example,

the average AUCROC of the MAKER-ER-based model was .836, which was nearly

same as the classifiers SVM (.840) and ANN (.829) averages. However, the average

AUCPR of the MAKER-ER-based model was .544, which was a difference of .34

compared to ANN (.510) and .43 compared to SVM (.501). The average AUCPR of

the MAKER-ER-based classifier over the five test sets was .544, which was close to

the average AUCPR of all the classifiers (.519). The MAKER-BRB-based model and

classification tree performed best regarding the average AUCPR over the five test

sets, at .562 and .636.

Thus, we concluded that the performance of MAKER-ER- and MAKER-BRB-based

models in predicting customer decisions in this study was superior to machine

learning methods – that is, SVM and NN. They also performed better than a simple

interpretable classifier, namely LR, KNN, weighted KNN, LD, and QD.

267

Table 6.23. MSEs and AUCs of classifiers for customer decision models

Train Test

Models/Folds 1st 2nd 3rd 4th 5th Avg Std CI (95%) 1st 2nd 3rd 4th 5th Avg Stdev CI (95%)

AUCROCs

MAKER-ER .844 .827 .835 .838 .835 .836 .006 .831-.841 .815 .861 .849 .821 .833 .836 .019 .819-.853

MAKER-BRB .851 .855 .861 .846 .853 .853 .006 .848-.858 .858 .840 .823 .875 .843 .848 .020 .830-.865

LR .831 .830 .832 .836 .833 .832 .002 .830-.834 .766 .845 .845 .806 .848 .822 .036 .791-.853

SVM .854 .842 .843 .845 .839 .845 .006 .840-.850 .797 .853 .858 .841 .851 .840 .025 .818-.862

NN .846 .832 .831 .838 .830 .835 .007 .830-.841 .812 .849 .837 .805 .840 .829 .019 .812-.846

CT .891 .887 .879 .895 .897 .890 .007 .884-.896 .861 .872 .894 .874 .886 .877 .013 .866-.889

NB .795 .796 .799 .815 .801 .801 .008 .794-.808 .738 .810 .830 .804 .782 .793 .035 .762-.824

KNN .836 .832 .828 .834 .831 .832 .003 .829-.835 .773 .812 .807 .768 .786 .789 .020 .772-.807

Weighted KNN 1.000 1.000 1.000 1.000 1.000 1.000 .000 1.00-1.00 .805 .831 .837 .784 .796 .811 .023 .791-.830

LD .836 .839 .842 .841 .841 .840 .002 .838-.842 .773 .843 .861 .829 .846 .830 .034 .800-.860

QD .818 .801 .813 .817 .811 .812 .007 .806-.818 .791 .815 .804 .772 .820 .801 .019 .784-.818

MSEs

MAKER-ER .112 .115 .114 .114 .114 .114 .001 .113-.115 .122 .108 .116 .113 .114 .114 .005 .110-.119

MAKER-BRB .111 .110 .110 .113 .112 .111 .001 .110-.112 .112 .118 .120 .104 .111 .113 .006 .107-.118

LR .117 .121 .120 .120 .119 .119 .002 .118-.121 .130 .114 .118 .118 .120 .120 .006 .115-.125

SVM .133 .138 .135 .135 .133 .135 .002 .133-.137 .138 .135 .134 .131 .141 .136 .004 .132-.139

NN .113 .116 .116 .116 .116 .115 .001 .114-.117 .126 .111 .116 .117 .116 .117 .005 .112-.122

268


Train Test

Models/Folds 1st 2nd 3rd 4th 5th Avg Std CI (95%) 1st 2nd 3rd 4th 5th Avg Stdev CI (95%)

CT .095 .098 .098 .093 .093 .095 .002 .093-.097 .104 .103 .094 .100 .098 .100 .004 .096-.103

NB .127 .132 .128 .130 .129 .129 .002 .128-.131 .136 .131 .120 .128 .138 .131 .007 .125-.137

KNN .115 .116 .116 .115 .117 .116 .001 .115-.117 .127 .119 .126 .126 .120 .123 .004 .120-.127

Weighted KNN .000 .000 .000 .000 .000 .000 .000 .000-.000 .125 .115 .125 .129 .119 .123 .005 .118-.127

LD .118 .122 .121 .121 .121 .121 .002 .119-.122 .134 .116 .117 .119 .122 .121 .007 .115-.128

QD .123 .127 .126 .127 .126 .126 .002 .124-.127 .139 .126 .130 .125 .132 .130 .006 .125-.135

AUCPRs

MAKER-ER .548 .526 .549 .537 .540 .540 .009 .532-.548 .496 .589 .496 .594 .545 .544 .048 .502-.586

MAKER-BRB .573 .548 .566 .545 .555 .557 .012 .547-.568 .519 .587 .530 .601 .572 .562 .036 .531-.593

LR .514 .485 .496 .483 .499 .495 .013 .484-.506 .420 .528 .510 .543 .486 .497 .048 .455-.540

SVM .510 .487 .493 .483 .503 .495 .011 .485-.505 .434 .542 .517 .541 .469 .501 .048 .459-.542

NN .673 .655 .656 .673 .672 .666 .009 .657-.674 .445 .574 .503 .552 .479 .510 .053 .464-.557

CT .612 .625 .661 .624 .659 .636 .022 .616-.656 .612 .625 .661 .624 .659 .636 .022 .616-.656

NB .464 .444 .452 .454 .456 .450 .007 .444-.457 .420 .449 .507 .478 .411 .453 .040 .418-.488

KNN 1.000 1.000 1.000 1.000 1.000 1.000 .000 1.00-1.00 .483 .521 .461 .501 .499 .493 .022 .473-.512

Weighted KNN 1.000 1.000 1.000 1.000 1.000 1.000 .000 1.00-1.00 .516 .562 .492 .480 .545 .519 .035 .489-.549

LD .518 .487 .497 .485 .503 .450 .013 .439-.462 .422 .525 .522 .553 .477 .500 .051 .455-.545

QD .493 .469 .480 .480 .489 .450 .009 .442-.459 .444 .488 .455 .495 .443 .465 .025 .443-.486

269

Despite its complexity, classification tree performed slightly better than the MAKER-

based classifiers. The MAKER-ER- and MAKER-BRB-based classifiers are an

interpretable classifier with an integrated process of statistical analysis, belief-rule-

based inference, and machine learning to predict customer decisions in an

environment of dynamic pricing. Hence, further analysis to drive managerial decision-

making should be conducted.

6.6. Summary

This chapter presents the application of hierarchical MAKER framework, namely

MAKER-ER- and MAKER-BRB-based classifiers, for customer decisions in an airline

advanced booking setting. The two outputs of the models were ‘buy’ and ‘wait’, with

six input variables being considered. These included provider-controlled variables,

namely length of holding period and average price trend; uncontrolled variables,

namely number of flights offered in a day and the time before departure; and personal

variables, namely waiting patience and customer types in response to dynamic

pricing.

This chapter consisted of six main sections. First, we described a conceptual

framework that explains input variables, identification of customer decisions, and data

linkage. According to literature and refinement process, six variables, which might

influence customer decisions, were selected. Wait and buy decisions were considered

in this study. In addition to this, we created data linkage to integrate customer

transaction records and price records to obtain meaningful and desired dataset for

further analysis. Second, we explained data preparation used in this study, including

270

data cleaning and data partitioning. Five folds for five round cross validation were

used for all the classifiers.

Third, we demonstrated the formulation of groups of evidence, evidence acquisition

from data, interdependency indices, belief-rule-based inference, maximum likelihood

prediction, and the inference process for all the generated MAKER outputs on the top

hierarchy. This process indicates how we constructed a hierarchical rule-based

modelling and prediction based on MAKER framework, in which input variables were

split into several groups of evidence.

Given the optimised referential values and other model parameters, such as weights

and belief degrees of belief rules, we used the training set of the first round to

demonstrate both classifiers. Fourth, considering the highly imbalanced class

distribution (1:4.5), we analysed and compared the models’ performance. The

measures we compared were accuracy, precision, recall, F-beta, AUCROC and

AUCPR, and MSE of all the classifiers. The analysis results indicated that MAKER-

ER- and MAKER-BRB-based classifiers outperformed eight of the nine alternative

machine learning methods. The classification tree showed a similar performance to

that of both classifiers. Therefore, we concluded that MAKER-ER- and MAKER-BRB-

based models, as interpretable and robust classifiers, are suitable for predicting

customer decisions. Furthermore, they can be utilised to learn about customer

purchasing behaviour to assist in managerial decision making.

271

Chapter 7 Conclusions and

Recommendations for Future Research

7.1. Conclusions

The existence of strategic customers potentially hurts providers’ revenue with

significant profit losses. Researchers developed theoretical models and formulated

optimal providers’ responses to address the strategic purchasing behaviour, that is,

delaying the purchase in the hope of obtaining lower price. Most of the methods were

developed under the assumption that all customers act strategically. One of the other

popular methods is conditioned experiments. In addition, numerous examples of

cancel-rebook behaviour found in airline database are one of the useful information

to distinguish strategic customers and non-strategic customers – namely, myopic

customers. Therefore, we developed a classification model for detecting strategic

customers through their cancel-rebook behaviour. In addition, we developed a

customer-decision model as a support system to help providers address strategic

purchasing behaviour. Approaches based on statistics and machine learning from

historical databases can be relatively cheap and are representative of real-life rather

than experimental conditions. Empirical approaches are also free of assumptions,

unlike theoretical models, which rely on assumptions about how customers make their

decisions and what factors influence those decisions.

The classification methods in widespread use at present have their own challenges.

Examples are poor interpretability, overfitting, and low stability. These issues can

potentially influence the performance of these models in classification, that is,

predicting customer types and decisions. In this research, we proposed a new

272

method, that is, a hierarchical rule-based inferential modelling and prediction

approach based on MAKER framework. It integrates statistical analysis, rule-based

inference, maximum likelihood prediction, and machine learning for classification in a

hierarchical structure. The proposed model addresses the challenges of popular

classification methods and deals with sparse matrices. The input variables are

decomposed into several groups of evidence, each of which performs rule-based

inferential modelling and prediction based on MAKER framework. The outputs

generated from each group of evidence are combined for a final inference.

The proposed method enabled us to acquire evidence directly from the data and to

combine multiple pieces of evidence from input variables within a group of evidence

to generate a belief rule base. For any given inputs, we could generate outputs – the

probability associated with each class – using the belief rule and maximum likelihood

prediction. The outputs generated from each group of evidence were then combined

using evidential reasoning rule or a belief-rule base with trained belief degrees of

consequents. With the algorithm of machine learning, we optimised the model

parameters to maximise the likelihood of the true state. The findings for this approach

are summarised below.

• By proposing a conceptual framework and data linkage for detecting strategic

customers, we fulfilled research objectives 1. The proposed conceptual

framework and data linkage were developed based on cancel-rebook behaviour

(see Section 5.3). The conceptual framework was tested using historical data. In

the case study, the input variables were good predictors.

• By proposing a conceptual framework and data linkage for predicting customer

decisions, we achieved research 1. The proposed framework and data linkage

were refined according to the availability of data (see Section 6.2). Wait-or-buy

273

decisions in an advanced booking setting with zero deposit were considered. It

was evident that the input variables in the framework were good predictors.

• By comparing the alternative approaches – popular methods in machine learning

– with the theory regarding classification, we achieved research objective 3. The

alternative classification methods that were used in this comparison included

SVM, NN, NB, LR, CT, KNN, weighted KNN, LD, and QD. On the basis of a rule-

based inference and maximum likelihood evidential reasoning (MAKER), the

MAKER-based framework that we propose was transparent and interpretable.

The relationship between inputs and outputs can be clearly analysed. Compared

with other interpretable machine learning models, such as LR, NB, KNN,

weighted KNN, LD, and QD, the proposed method performed better. In addition,

LR and NB assumes independence among input variables, but MAKER-based

framework does not.

• By comparing the performance of various approaches to both datasets (i.e.

customer-type and customer-decision datasets), we achieved research objective

3. MAKER-based classifiers outperformed most of the other models, that is, LR,

SVM, NN, KNN, weighted KNN, NB, LD, and QD. The MAKER-based classifiers

performed similarly to classification tree. The difference in performance among

classifiers was clearly identified using AUCPR rather than AUCROC. Along with

its ability for prediction, MAKER-based classifiers also measured the

interdependence between input variables. This measure indicates whether – and

the extent to which – input variables are dependent on each other. For illustration,

Sections 5.5.7 and 6.4.7 report on the probabilities generated by the framework

regarding whether a customer was strategic or myopic, and the probabilities of a

customer choosing to buy and wait.

274

• By applying a referential-value-based data discretisation technique in the

hierarchical MAKER framework, we achieved research objective 2. This

technique alleviated the information loss and distortion resulting from over-

generalisation caused by discretisation. It also captured the structure of the data

better than other discretisation techniques.

• By proposing a hierarchical rule-based modelling and prediction, we achieved

research objective 2. The structure is applicable in the case of sparse matrices.

Decomposing input variables into several groups of evidence was proposed. This

approach avoids misleading and incorrect inferences that are due either to

violations of statistical requirements for sample size or to information loss; and

the multiplicative (computational) complexity of many referential values of input

variables in the belief rule. The hierarchical structure enables MAKER-based

classifiers to make predictions from several groups of evidence and to combine

the outputs at the aggregate level for a final inference. The hierarchical MAKER

framework performed well for both datasets (customer type and customer

decision).

7.2. Limitations and Recommendations for Future Research

This research consisted of several limitations.

• The hierarchical MAKER framework and other machine learning methods were

applied and tested for a study case in Indonesia, meaning that different datasets

might result in different outcomes of model parameters and model performances.

• The datasets used in this research was highly skewed and hence, the dominance

of the majority classes can influence the development of the classification model.

275

• In this research, the cancelled transactions were deleted from datasets. This kind

of information might be useful to include ‘exit’ decision made by customers.

Suggested directions for further study are summarised below.

• Based on the scores of AUCROC, the performance of the hierarchical MAKER

framework indicated an effective and adequate classifier. However, there is much

room for improvement based the scores of AUCPR, especially for a highly

skewed dataset. Both customer-type and customer-decision datasets had a

highly imbalanced class distribution. The model parameters were trained under

the optimisation function of the mean squared error (MSE), which represents the

difference between actual and predicted outputs. The MSE is obtained by

squaring the average difference over the dataset and may therefore advantage

the majority class. Further research could focus on improving the performance

measures on which the algorithm of machine learning is based, so that both

classes are treated equally.

• For both datasets used in this study, groups of evidence were formed, and

complete joint frequency matrices were obtained by decomposing the input

variables. However, most large matrices are sparse because almost all entries

are zeros. Decomposing only the input variables does not necessarily solve the

problem of sampling zeros. It might be necessary to decompose until the level of

sub-rules, that is, most frequently activated combinations of referential values.

Hence, future research could focus on hierarchical rule-based inference

composed of sub-rules.

• Customer-type and customer-decision datasets were retrieved from transactions

made by customers who eventually bought the tickets. In reality, ‘exit’ decisions

276

also occur in a dataset. Another potential direction for further study in the field of

revenue management is thus extending decision models to include exit decisions.

A rule-based inferential modelling and prediction approach is applicable for

multiple classification tasks.

277

References

Agre, G., & Peev, S. (2002). On Supervised and Unsupervised Discretization. Bulgarian Academy of Sciences, 2(43–57).

Anderson, C. K., & Wilson, J. G. (2003). Wait or buy? The strategic consumer: Pricing and profit implications. Journal of Operational Research Society, 54(3), 299–306.

Auria, L., & Moro, R. A. (2008). Support Vector Machines (SVM) as a Technique for Solvency Analysis (August 1, 2008). DIW Berlin Discussion Paper No. 811. Retrieved from http://dx.doi.org/10.2139/ssrn.1424949.

Aviv, Y., & Pazgal, A. (2008). Optimal Pricing of Seasonal Products in the Presence of Forward-Looking Consumers. Manufacturing & Service Operations Management, 10(3), 339–359.

Aviv, Yossi, Levin, Y., & Nediak, M. (2009). Counteracting Strategic Consumer Behavior in Dynamic Pricing Systems. In N. S & T. CS (Eds.), Consumer-Driven Demand and Operations Management Models (pp. 323–352).

Bagozzi, R. P., Gurhan-Canli, Z., & Priester, J. R. (2002). The social psychology of consumer behaviour. In The Social Psychology of Consumer Behaviour. Buckingham: Open University Press.

Belch, G. E., & Belch, M. A. (1998). Advertising and promotion : an integrated marketing communications perspective (4th ed.). Maidenhead: McGraw-Hill.

Besanko, D., & Winston, W. L. (1990). Optimal Price Skimming by a Monopolist Facing Rational Consumers. Management Science, 36(5), 555–567.

Besbes, O., & Lobel, I. (2015). Intertemporal Price Discrimination : Structure and Computation of Optimal Policies. Management Science, 61(1), 92–110.

Bilotkach, V. (2010). Reputation, search cost, and airfares. Journal of Air Transport Management, 16(5), 251–257.

Binaghi, E., & Madella, P. (1999). Fuzzy Dempster – Shafer Reasoning for Rule-Based Classifiers. International Journal of Intellegent Systems, 14(6), 559–583.

Bishop, C. M. (2006). Pattern recognition and machine learning. In Information Science and Statistics. New York, N.Y: Springer.

Bishop, Y. M. M. (2007). Discrete Multivariate Analysis Theory and Practice (S. E. Fienberg & P. W. Holland, Eds.). Retrieved from https://doi.org/10.1007/978-0-387-72806-3.

Bodur, H. O., Klein, N. M., & Arora, N. (2015). Online price search: Impact of price comparison sites on offline price evaluations. Journal of Retailing, 91(1), 125–139.

Boggs, P. T., & Tolle, J. W. (1995). Sequential Quadratic Programming *. Acta Numerica, 4, 1–51.

Boyd, E. A., & Bilegan, I. C. (2003). Revenue Management and E-Commerce.

278

Management Science, 49(10), 1363–1386.

Cachon, G. P., & Swinney, R. (2009). Purchasing, Pricing, and Quick Response in the Presence of Strategic Consumers. Management Science, 55(3), 497–511.

Carvalho, D. V, Pereira, E. M., & Cardoso, J. S. (2019). Machine Learning Interpretability : A Survey on Methods and Metrics. Electronics, 8(2019), 1–34.

Cason, T. N., & Reynolds, S. S. (2005). Bounded rationality in laboratory bargaining with asymmetric information. Economic Theory, 25(3), 553–574.

Chan, C.-C., Batur, C., & Srinivasan, A. (1991). Determination of quantization intervals in rule based model for dynamic systems. Proceedings of the IEEE Conference on Systems, Man, and Cybernetics, 1719–1723. Charlottesvile, Virginia.

Chandon, P., Wansink, B., & Laurent, G. (2000). A Benefit Congruency Framework of Sales Promotion Effectiveness. Journal of Marketing, 64(October 2000), 65–81.

Chang, L., Zhou, Y., Jiang, J., Li, M., & Zhang, X. (2013). Structure learning for belief rule base expert system: A comparative study. Knowledge-Based Systems, 39, 159–172.

Chen, C.-C., & Schwartz, Z. (2006). The Importance of Information Asymmetry in Customers’ Booking Decisions: A Cautionary Tale from the Internet. Cornell Hotel and Restaurant Administration Quarterly, 47(3), 272–285.

Chen, C.-C., & Schwartz, Z. (2008). Timing Matters: Travelers’ Advanced-Booking Expectations and Decisions. Journal of Travel Research, 47(1), 35–42.

Chen, C. C., Schwartz, Z., & Vargas, P. (2011). The search for the best deal: How hotel cancellation policies affect the search and booking decisions of deal-seeking customers. International Journal of Hospitality Management, 30(1), 129–135.

Chen, C., & Schwartz, Z. (2008). Room rate patterns and customers’ propensity to book a hotel room. Journal of Hospitality & Tourism Research, 32(3), 287–306.

Chen, Chihchien. (2016). Cancellation policies in the hotel, airline and restaurant industries. Journal of Revenue and Pricing Management,15(3–4), 271–276.

Chen, Q., Whitbrook, A., Aickelin, U., & Roadknight, C. (1960). Data Classification Using the Dempster-Shafer Method.

Chevalier, J., & Goolsbee, A. (2009). Are Durable Goods Consumers Forward-Looking? Evidence from College Textbooks. The Quarterly Journal of Economics, 124(4), 1853–1884.

Cho, M., Fan, M., & Zhou, Y. (2008). Strategic Consumer Response to Dynamic Pricing of Perishable Product. In International Series in Operations Research and Management Science, 131, 435-458.

Choi, S., & Kimes, S. E. (2002). Electronic distribution channels’ effect on hotel revenue management. The Cornell Hotel and Restaurant Administration Quarterly, 43(3), 23–31.

279

Christou, E. (2011). Exploring Online Sales Promotions in the Hospitality Industry Exploring Online Sales Promotions. Journal of Hospitality Marketing & Management, 20, 814–829.

Clark, R. A., & Goldsmith, R. E. (2005). Market Mavens : Psychological Influences. Psychology & Marketing, 22(4), 289–312.

Clemons, E. K., Hann, I.-H., & Hitt, L. M. (2002). Dispersion in and Differentiation Travel : Investigation Online Empirical. Management Science, 48(4), 534–549.

Cleophas, C., & Bartke, P. (2011). Modeling strategic customers using simulations - With examples from airline revenue management. Procedia - Social and Behavioral Sciences, 20(2011), 1060–1068.

Cooper, W. L., Homem-de-Mello, T., & Kleywegt, A. J. (2006). Models of the Spiral-Down Effect in Revenue Management. Operations Research, 54(5), 968–987.

Creswell, J. W. (2018). Research design : qualitative, quantitative & mixed methods approaches (Fifth edit; J. D. Creswell, Ed.). Los Angeles ; Sage.

Darpy, D. (2000). Consumer Procrastination and Purchase Delay. 29th Annual Conference EMAC, 1–7.

Dasu, S., & Tong, C. (2010). Dynamic pricing when consumers are strategic: Analysis of posted and contingent pricing schemes. European Journal of Operational Research, 204(3), 662–671.

Davis, J., & Goadrich, M. (2006). The Relationship Between Precision-Recall and ROC Curves. Proceedings of the 23rd International Conference on Machine Learning. Pittsburgh, PA.

Dekay, F., Yates, B., & Toh, R. S. (2004). Non-performance penalties in the hotel industry. International Journal of Hospitality Management, 23(3), 273–286.

Dempster, A. P. (2008). The Dempster – Shafer calculus for statisticians. International Journal of Approximate Reasoning, 48(2), 365–377.

Dougherty, J., Kohavi, R., & Sahami, M. (1995). Supervised and Unsupervised Discretization of Continuous Features. Machine larning: Proc. 12th Int’l Conf., 1995, 194-202.

Eren, S. S., & Parker, J. (2010). Monopoly pricing with limited demand information. Journal of Revenue and Pricing Management, 9(1–2), 23–48.

Etzioni, O., Tuchinda, R., Knoblock, C. a., & Yates, A. (2003). To buy or not to buy: Mining airfare data to minimize ticket purchase price. Knowledge Discovery and Data Mining Proceedings of the Ninth ACM SIGKDD International Conference, 119–128.

Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861-874.

Fayyad, U. M., & Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. The 13th International Joint Conference on Artificial Intelligence, Morgan Kauffmann, 1022–1027.

Fienberg, S. E. (1980). The analysis of cross-classified categorical data. Cambridge:

280

MIT Press.

Fienberg, S. E., & Rinaldo, A. (2007). Three centuries of categorical data analysis : Log-linear models and maximum likelihood estimation. Journal of Statistical Planning and Inference, 137(11), 3430–3445.

Flightdelayclaimsteam.com. (2019). 15 Flight Hacks you can use for Ridiculously Cheap Bookings Today. Available at: https://www.flightdelayclaimsteam.com/flight-hacks-for-cheaper-bookings-you-can-use-today/%0D. [Accessed 18 July 2019].

Fortin, D. R. (2000). Clipping Coupons in Cyberspace : A Proposed Model of Behavior for Deal- Prone Consumers. Psychology and Marketing, 17(6), 515–534.

Gollwitzer, P. M., & Brandst, V. (1997). Implementation Intentions and Effective Goal Pursuit. Journal of Personality and Marketing, 73(1), 186–199.

Gönsch, J., Klein, R., Neugebauer, M., & Steinhardt, C. (2013). Dynamic pricing with strategic customers. Journal of Business Economics, 83(5), 505–549.

Gorin, T., Walczak, D., Bartke, P., & Friedemann, M. (2012). Incorporating cancel and rebook behavior in revenue management optimization. Journal of Revenue and Pricing Management, 11(6), 645–660.

Granados, N., Kauffman, R. J., Lai, H., & Lin, H. C. (2012). ?? La Carte Pricing and Price Elasticity of Demand in Air Travel. Decision Support Systems, 53(2), 381–394.

Hayes, D. K., & Miller, A. (2011). Revenue Management for the Hospitality Industry. Journal of Revenue and Pricing Management, 11(4), 479–480.

Haykin, S. (1999). Neural networks : a comprehensive foundation (2nd ed.). Delhi: Pearson Education.

Hendler, J. (2014). Data Integration for Heterogenous Datasets. Big Data, 2(4), 205–215.

Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression. In Wiley Series in Probability and Statistics (3rd ed.). Hoboken, NJ: Wiley.

Hossin, M., & Sulaiman, M. N. (2015). A Review On Evaluation Metrics for Data Classification Evaluations. International Journal of Data Mining & Knowledge Management Process (IJDKP), 5(2), 1–11.

Ivanov, S., & Zhechev, V. (2012). Hotel revenue management – a critical literature review. Tourism Review, 60(2), 175–197.

Jerath, K., Netessine, S., & Veeraraghavan, S. K. (2010). Revenue Management with Strategic Customers: Last-Minute Selling and Opaque Selling. Management Science, 56(3), 430–448.

Jung, K., Cho, Y. C., & Lee, S. (2014). Online shoppers’ response to price comparison sites. Journal of Business Research, 67(10), 2079–2087.

Kannan, P. K., & Kopalle, K. (2001). Dynamic Pricing on the internet-importance and Implications for Consumer Behavior. International Journal of Electronic Commerce, 5(3), 63–83.

https://www.flightdelayclaimsteam.com/flight-hacks-for-cheaper-bookings-you-can-use-today/%0D

https://www.flightdelayclaimsteam.com/flight-hacks-for-cheaper-bookings-you-can-use-today/%0D

281

Karim, M., & Rahman, R. M. (2013). Decision Tree and Naïve Bayes Algorithm for Classification and Generation of Actionable Knowledge for Direct Marketing. Journal of Software Engineering and Applications, 6(4), 196–206.

Kateri, M., & Iliopoulos, G. (2010). On collapsing categories in two-way contingency tables. Statistics, 37(5), 443-455.

Kerber, R. (1992). Chimerge: Discretization of numeric attributes. Proceedings of the Tenth National Conference on Artificial Intelligence, 123–128.

Kimes, S. E. (1989). The Basics of Yield Management. Cornell Hotel and Restaurant Administration Quarterly, 30(3), 14–19.

Kimes, Sheryl E. (2003). Revenue Management: A Retrospective. Cornell Hotel and Restaurant Administration Quarterly, 44(5), 131–138.

Knox, S. W. (2018). Machine learning : a concise introduction . Hoboken, NJ: John Wiley & Sons, Inc.

Kong, G., Xu, D. L., Yang, J. B., Yin, X., Wang, T., Jiang, B., & Hu, Y. (2016). Belief rule-based inference for predicting trauma outcome. Knowledge-Based Systems, 95, 35–44.

Kraft, D. A software package for sequential quadratic programming. (1998). Tech. Rep. DFVLR-FB 88-28, DLR German Aerospace Center - Institute for Flight Mechanics, Koln, Germany.

Kwon, K., & Kwon, Y. J. (2013). Heterogeneity of deal proneness : Value-mining , price-mining , and encounters. Journal of Retailing and Consumer Services, 20(2), 182–188.

Lai, G., Debo, L. G., & Sycara, K. (2010). Buy Now and Match Later: Impact of Posterior Price Matching on Profit with Strategic Consumers. Manufacturing & Service Operations Management, 12(1), 33–35.

Lee, W.-M. (2019). Python machine learning . Indianapolis, IN: Wiley.

Levin, Y., McGill, J., & Nediak, M. (2009). Dynamic Pricing in the Presence of Strategic Consumers and Oligopolistic Competition. Management Science, 55(1), 32–46.

Li, J., Granados, N. F., & Netessine, S. (2014). Are Consumers Strategic? Structural Estimation from the Air-Travel Industry. Management Science, 60(9), 2114–2137.

Lichtenstein, D. R., Netemeyer, R. G., & Burton, S. (1990). Distinguishing coupon proneness from value consciousness: An acquisition-transaction utility. Journal of Marketing, 54(3), 54-67.

Lichtenstein, D. R., Ridgway, N. M., & Netemeyer, R. G. (1993). Price Perceptions and Consumer Shopping Behavior : A Field Study. Journal of Marketing Research, 30(2), 234–246.

Lin, F., & Cohen, W. W. (2010). Semi-supervised classification of network data using very few labels. Proceedings - 2010 International Conference on Advances in Social Network Analysis and Mining, ASONAM 2010, 192–199.

282

Liong, C., and Foo, S. (2013). Comparison of linear discriminant analysis and logistic regression for data classification. AIP Conference Proceedings, 1522(1), 1159-1165.

Littlewood, K. (2005). Forecasting and control of passenger bookings. Journal of Revenue & Pricing Management, 4(2), 111–123.

Liu, Q., & Ryzin, G. J. van. (2008). Strategic Capacity Rationing to Induce Early Purchases. Management Science, 54(6), 1115–1131.

Lorenz, T. (2019). 7 sites to find book now pay later hotels. Available at: https://www.finder.com.au/book-now-pay-later-hotels. [Accessed 18 July 2019].

Liu,Q.,Ying, W. (2012). Supervised Learning. In: Seel N.M. (eds) Encyclopedia of the Sciences of Learning. Springer, Boston, MA.

Maimon, O., & Rokach, L. (2005). Decomposition Methodology for Knowledge Discovery and Data Mining: Theory and Applications. World Scientific.

Mak, V., Rapoport, A., Gisches, E. J., & Han, J. (2014). Purchasing Scarce Products Under Dynamic Pricing: An Experimental Investigation. Manufacturing & Service Operations Management, 16(3), 425–438.

Maratea, A., Petrosino, A., & Manzo, M. (2014). Adjusted F-measure and kernel scaling for imbalanced data learning. Information Sciences, 257, 331–341.

Meissner, J., & Strauss, A. K. (2010). Pricing structure optimization in mixed restricted/unrestricted fare environments. Journal of Revenue and Pricing Management, 9(5), 399–418.

Molnar, C. (2019). Interpretable machine learning. A Guide for Making Black Box Models Explainable. Available at: http://christophm.github.io/interpretable-ml-book/. [Accessed 18 June 2019].

Mori, T., & Uchihira, N. (2019). Balancing the trade-off between accuracy and interpretability in software defect prediction. Empirical Software Engineering, 24(2), 779-825.

Nair, H. (2007). Intertemporal price discrimination with forward-looking consumers: Application to the US market for console video-games. Quantitative Marketing and Economics, 5(3), 239–292.

Nasiry, J., & Popescu, I. (2012). Advance Selling When Consumers Regret. Management Science, 58(6), 1160–1177.

Osadchiy, N., & Bendoly, E. (2011). Are Consumers Really Strategic ? Implications from an Experimental Study. 2011 MSOM Annual Conference.

Ovchinnikov, A., & Milner, J. M. (2012). Revenue management with end-of-period discounts in the presence of customer learning. Production and Operations Management, 21(1), 69–84.

Özer, Ö., & Zheng, Y. (2015). Markdown or Everyday Low Price? The Role of Behavioral Motives. Management Science, 62(2), 326-346.

Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. (2011). Scikit-learn:

https://www.finder.com.au/book-now-pay-later-hotels

http://christophm.github.io/interpretable-ml-book/

http://christophm.github.io/interpretable-ml-book/

283

Machine learning in python. Journal of Machine Learning Research, 12, 2825–2830.

Qiwen, J., Weijun, Z., & youyan, H. (2010). Revenue Management in the Service Industry : Research Overview and Prospect. International Conference on Management and Service Science (MASS), 1–5.

Reed, S. E., & Lee, H. (2015). Raining deep neural networks on noisy labels with bootstrapping. In ICLR, 1–11.

Ren, J. (2012). Knowledge-Based Systems ANN vs . SVM : Which one performs better in classification of MCCs in mammogram imaging. Knowledge-Based Systems, 26(2012), 144–153.

Reynolds, S. S. (2000). Durable-Goods Monopoly : Laboratory Market and Bargaining Experiments. The RAND Journal of Economics, 31(2), 375–394.

Richeldi, M., & Rossotto, M. (1995). Class-driven statistical discretization of continuous attributes (extended abstract). In Lecture Notes in Artificial Intelligence 914 (N. Lavrac, pp. 335–338). Berlin, Heidelberg, New York: Springer Verlag.

Rickwood, C., & White, L. (2009). Pre-purchase decision-making for a complex service: retirement planning. Journal of Services Marketing, 23(3), 145–153.

Rohde, C. A. (2014). Introductory Statistical Inference with the Likelihood Function. In Introductory Statistical Inference with the Likelihood Function. Cham: Springer International Publishing.

Ruth, J. A. (2001). Promoting a Brand ’ s Emotion Benefits : The Influence of Emotion Categorization Processes on Consumer Evaluations. Journal of Consumer Psychology, 11(2), 99-113.

Sahay, A. (2007). How to reap higher profits with dynamic pricing. MIT Sloan Management Review, 48(4), 53–62.

Schwartz, Z. (2000). Changes in Hotel Guests’ Willingness To Pay as The Date of Stay Draws Closer. Journal of Hospitality & Tourism Research, 24(2), 180–198.

Schwartz, Z. (2006). Advanced booking and revenue management: Room rates and the consumers’ strategic zones. International Journal of Hospitality Management, 25(3), 447–462.

Shen, Z. M., & Su, X. (2007). Customer Behavior Modeling in Revenue Management and Auctions: A Review and New Research Opportunities. Production & Operations Management, 16(6), 780–790.

Simon, H. A. (1955). A Behavioral Model of Rational Choice. The Quarterly Journal of Economics, 69(1), 99–118.

Simon, H. A. (1956). Rational choice and the structure of the environment. Psychological Review, 63(2), 129–138.

Stokey, N. L. (1981). Rational Expectations and Durable Goods Pricing. The Bell Journal of Economics, 12(1), 112–128.

Su, X. (2007). Intertemporal Pricing with Customer Behavior. Management Science,

284

53(5), 726–741.

Su, X. (2009). A Model of Consumer Inertia with Applications to Dynamic Pricing. Production & Operations Management, 18(4), 365–380.

Su, Xuanming, & Zhang, F. (2009). On the Value of Commitment and Availability Guarantees When Selling to Strategic Consumers. Management Science, 55(5), 713–726.

Swinney, R. (2011). Selling to Strategic Consumers When Product Value Is Uncertain: The Value of Matching Supply and Demand. Management Science, 57(10), 1737–1751.

Talurri, K. T., & Ryzin, G. J. Van. (2004). The Theory And Practice of Revenue Management. Boston: Kluwer Academic Publishers.

Tang, D., Yang, J.-B., Chin, K.-S., Wong, Z. S. Y., & Liu, X. (2011). A methodology to generate a belief rule base for customer perception risk analysis in new product development. Expert Systems with Applications, 38(5), 5373–5383.

Toh, R. S., Dekay, F., & Raven, P. (2012). Travel Planning: Searching foR and Booking Online Seats on the Internet. Transportation Journal, 51(1), 80–98.

Tu, J. V. (1996). Advantages and Disadvantages of Using Artificial Neural Networks versus Logistic Regression for Predicting Medical Outcomes. Journal of Clinical Epidemiology, 49(11), 1225–1231.

Wang, M., Ma, M., Yue, X., & Mukhopadhyay, S. (2013). A capacitated firm’s pricing strategies for strategic consumers with different search costs. Annals of Operations Research, 240(2), 731–760.

Xu, D. (2011). An introduction and survey of the evidential reasoning approach for multiple criteria decision analysis. Annals of Operations Research, 195(1), 163–187.

Xu, D. L., Yang, J. B., & Wang, Y. M. (2006). The evidential reasoning approach for multi-attribute decision analysis under interval uncertainty. European Journal of Operational Research, 174(3), 1914–1943.

Xu, X., Zheng, J., Yang, J. bo, Xu, D. ling, & Chen, Y. wang. (2017). Data classification using evidence reasoning rule. Knowledge-Based Systems, 116(2017), 144–151.

Yang, J.-B., & Xu, D.-L. (2014). A Study on Generalising Bayesian Inference to Evidential Reasoning. In Belief Functions: Theory and Applications - Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, Vol. 8764, pp. 180-189.

Yang, J. B. (2001). Rule and utility based evidential reasoning approach for multiattribute decision analysis under uncertainties. European Journal of Operational Research, 131(1), 31–61.

Yang, J. B., Liu, J., Wang, J., Sii, H. S., & Wang, H. W. (2006). Belief rule-base inference methodology using the evidential reasoning approach - RIMER. IEEE Transactions on Systems, Man, and Cybernetics Part A:Systems and Humans, 36(2), 266–285.

285

Yang, J. B., & Xu, D. L. (2013). Evidential reasoning rule for evidence combination. Artificial Intelligence, 205, 1–29.

Yang, J., Liu, J., Xu, D., Wang, J., Wang, H., & Member, A. (2007). Optimization Models for Training Belief-Rule-Based Systems. 37(4), 569–585.

Yang, J., & Xu, D.-L. (2017). Inferential Modelling and Decision Making with Data. IEEE International Conference on Automation and Computing (ICAC2017), (September), 7–8. Huddersfield, UK.

Ye, T., & Sun, H. (2015). Price-setting newsvendor with strategic consumers. Omega, 63, 103-110.

Yip, S. (2019). 11 airlines and websites that offer layby flights to book now and pay later. Avaialable at: https://www.finder.com.au/book-now-pay-later. [Accessed 18 July 2019].

Zbaracki, M. J., Ritson, M., Levy, D., Dutta, S., & Bergen, M. (2004). Managerial and Customer Costs of Price Adjustment : Direct Evidence from Industrial Markets. The Review of Economics and Statistics, 86(2), 514–533.

Zeelenberg, M. (1999). Anticipated Regret , Expected Feedback and Behavioral Decision Making. 106(September 1998), 93–106.

Zhang, D., & Cooper, W. L. (2008). Managing Clearance Sales in the Presence of Strategic Customers. Production & Operations Management, 17(4), 416–431.

Zhang, S., Zhang, C., & Yang, Q. (2003). Data preparation for data mining. Applied Artificial Intelligence, 17:5-6, 375–381.

https://www.finder.com.au/book-now-pay-later

286

Appendices

Examples of customer type and decision dataset (200 samples)

No. FB ICR TS HP DD APT NF WPT decision Customer

type

1 1 .0000 .4574 .4817 1.3185 .0636 3 .4574 buy myopic

2 1 .0000 .0009 .0972 1.4097 1.1565 28 .0009 buy myopic

3 2 .5054 .0732 .0779 1.6786 .7493 36 .6517 buy myopic

4 3 .0025 .1422 .3952 1.9972 .0841 4 .8975 buy strategic

5 1 .0000 .0201 .3955 1.6656 .2142 16 .0201 buy myopic

6 2 -.2791 .0648 .3950 1.8534 -.0813 4 .1281 buy strategic

7 1 .0000 .1520 .3954 2.0093 .0841 4 .1520 buy strategic

8 2 -.2823 .2772 .3955 2.0316 -.0157 72 .2721 buy strategic

9 2 -.3038 .0626 .3955 1.8767 -.1136 4 .1236 buy strategic

10 1 .0000 .0354 .3952 1.8841 -.1136 4 .0354 buy myopic

11 1 .0000 .0012 .0952 1.5848 -.3413 36 .0012 buy myopic

12 2 -.3038 .0626 .3956 1.9685 -.1458 4 .0935 wait strategic

13 2 -.2791 .0648 .3955 1.9698 -.1458 4 .1180 wait strategic

14 1 .0000 .1208 .3955 1.9719 -.1458 4 .1208 buy strategic

15 2 -.2823 .2772 .3956 2.1449 -.0262 72 .3956 wait strategic

16 1 .0000 .0351 .3953 2.2189 .0042 37 .0351 buy myopic

17 1 .0000 .0216 .3950 1.9248 .0391 5 .0216 buy myopic

18 2 .4830 .2166 .3955 1.3975 .1630 90 .9162 buy myopic

19 2 .0191 .2167 .3955 1.4017 .1630 90 .4525 buy myopic

20 3 .0025 .1422 .3957 2.8616 .0459 4 .4007 wait strategic

21 3 .0025 .1422 .3952 2.8667 .0459 4 .0029 wait strategic

22 1 .0000 .0519 .3951 2.8681 .0459 4 .0519 buy myopic

23 2 .0191 .2167 .3951 1.8159 -.1786 45 .3951 wait myopic

24 1 .0000 .2946 .3954 1.9816 .1193 62 .2946 buy strategic

25 1 .0000 .3360 .3952 1.8014 -.0056 32 .3360 buy strategic

26 1 .0000 .0437 .3949 2.0005 .0938 22 .0437 buy myopic

27 3 .2260 .2776 .3953 2.3988 -.1081 23 .8473 wait strategic

28 2 .0965 .2371 .3956 2.5275 .2571 19 .3956 wait strategic

29 2 .4830 .2166 .3947 2.2753 -.0893 45 .3947 wait myopic

30 1 .0000 .0054 .3948 1.5344 -.0066 66 .0054 buy myopic

31 1 .0000 .0647 .3950 3.3804 -.0833 22 .0647 buy myopic

32 3 .2260 .2776 .3952 2.8508 -.0698 23 .3952 wait strategic

33 1 .0000 .0482 .3956 2.0657 -.0312 34 .0482 buy myopic

34 1 .0000 .0027 .3953 1.8723 -.0875 44 .0044 buy strategic

35 2 .2151 .2638 .3952 1.6911 -.0615 26 .8253 wait strategic

287


type

36 1 .0000 .0473 .0872 1.0733 -.1239 24 .0473 buy myopic

37 1 .0000 .0740 .5145 4.2297 -.1149 45 .0740 buy myopic

38 1 .0000 .0106 .3956 4.2727 -.1250 22 .0106 buy myopic

39 1 .0000 .1971 1.2984 1.4026 .0745 17 .1971 buy strategic

40 2 .4412 .2008 .3956 4.4394 -.0841 4 .8428 buy strategic

41 1 .0000 .0370 .3951 1.8340 -.0207 13 .0370 buy myopic

42 1 .0000 .2908 .3956 3.0074 -.0930 9 .2908 buy myopic

43 1 .0000 .3047 .3951 3.0083 -.0930 9 .3047 buy myopic

44 2 .0590 .2299 .3117 1.6041 .1159 28 .6124 buy strategic

45 2 .5466 .4896 .1041 4.9090 -.2902 48 1.5257 buy strategic

46 2 .0891 .2796 .3953 1.6529 .1021 35 .7476 buy strategic

47 1 .0000 .2731 .3952 3.0508 -.0540 4 .2731 buy strategic

48 3 .3040 .2978 .3954 2.7801 .1930 9 .3954 wait strategic

49 1 .0000 .1120 .3951 4.6055 -.0509 47 .0364 buy strategic

50 3 .2823 .1661 .0621 .7399 .0308 74 .6891 buy strategic

51 1 .0000 .0015 .3954 4.8704 -.0475 12 .0015 buy myopic

52 1 .0000 .0366 .3953 5.2168 -.0781 23 .0366 buy myopic

53 1 .0000 .0029 .3952 2.0195 -.0745 4 .0033 buy strategic

54 1 .0000 .1312 .3956 3.0393 -.0484 9 .1312 buy strategic

55 5 .3264 .3375 .5132 4.0722 .1721 35 1.5584 wait myopic

56 8 .3508 .3251 .3952 3.8945 .0476 17 .3921 wait strategic

57 2 .4696 .1998 .3950 2.1609 .1061 32 .8692 buy strategic

58 2 4.2156 .8762 .3952 4.4403 -.0359 9 8.6492 wait strategic

59 1 .0000 .3028 .3955 5.0309 -.0354 49 .3028 buy myopic

60 7 .0451 .0226 .0414 1.1456 -.0091 68 .0824 wait myopic

61 5 .3264 .3375 .4822 4.5413 .2288 35 1.0584 wait myopic

62 1 .0000 .3865 .3953 1.7988 -.2597 13 .3865 buy myopic

63 3 .2823 .1661 .3950 1.3874 .0766 37 .3950 wait strategic

64 1 .0000 .0076 .0525 .6601 .5694 20 .0076 buy myopic

65 2 -.1081 .3645 .3956 2.0158 .0908 44 .6758 buy strategic

66 1 .0000 .0145 .3957 1.9477 .2545 23 .0145 buy myopic

67 1 .0000 .0946 .3950 1.5110 .1146 17 .0946 buy strategic

68 1 .0000 .1521 .3953 4.8960 -.0789 13 .1521 buy strategic

69 1 .0000 .1538 .3953 3.7315 -.0806 13 .1538 buy strategic

70 1 .0000 .1898 .3950 1.8874 .0912 4 .1898 buy strategic

71 1 .0000 .0169 .5187 5.8659 .1897 51 .0169 buy myopic

72 1 .0000 .1523 .3956 1.4866 .7339 23 .1523 buy strategic

73 1 .0000 .0025 .3956 1.5734 -.0260 35 .0025 buy myopic

74 1 .0000 .0178 .3957 3.7137 -.0455 12 .0178 buy myopic

75 1 .0000 .0053 .3954 1.6044 -.0260 35 .0053 buy myopic

76 2 .2950 .3919 .3954 5.6356 -.0499 23 .3954 wait myopic

77 1 .0000 .2603 .3954 6.1565 -.0851 6 .2603 buy strategic

288


type

78 2 -.1081 .3645 .3956 2.3032 -.0265 46 .3406 wait strategic

79 1 .0000 .3574 .3953 1.8335 -.0468 74 .3574 buy strategic

80 2 .5466 .4896 .9787 6.4343 .0409 50 .9787 wait strategic

81 5 .3264 .3375 .5167 5.1174 .2288 35 .5167 wait myopic

82 1 .0000 1.3152 2.4823 7.0420 .0006 20 1.3152 buy strategic

83 2 -.3640 1.0160 2.1650 6.6726 -.0332 3 1.7994 wait strategic

84 1 .0000 .0091 .0625 1.8334 -.1059 13 .0091 buy myopic

85 1 .0000 .0086 2.5670 6.8385 .0639 13 .0086 buy myopic

86 2 .0509 .1982 .3956 2.1657 .1008 17 .4474 buy strategic

87 1 .0000 .2620 .3955 2.1656 .1804 10 .2620 buy strategic

88 1 .0000 .2983 .3954 3.3711 .1039 34 .2983 buy myopic

89 2 .0055 .2009 .3950 1.5221 .0743 10 .3950 wait strategic

90 1 .0000 .0375 .3953 1.7856 .1630 4 .0375 buy myopic

91 2 .0509 .1982 .3953 2.6119 .0046 17 .3953 wait strategic

92 4 .4436 .3015 .3953 1.5641 .0307 18 1.9227 buy strategic

93 1 .0000 .0251 .3952 1.4938 .0492 18 .0251 buy myopic

94 2 -.2686 .0680 .3950 3.4374 .1021 37 .1356 buy strategic

95 1 .0000 .0347 .3954 2.8586 .0931 13 .0347 buy myopic

96 1 .0000 .0028 .3956 1.4157 .0083 35 .0028 buy myopic

97 1 .0000 .1505 .3952 3.4792 .1021 32 .1505 buy myopic

98 1 .0000 .0027 .3125 7.1637 .9526 34 .0027 buy myopic

99 1 .0000 .0014 2.2310 7.8011 .0019 5 .0014 buy myopic

100 1 .0000 .0071 .3953 1.5328 .2059 10 .0071 buy myopic

101 3 .8433 .3500 .3949 1.9533 -.2175 5 3.1408 buy strategic

102 1 .0000 .0011 .3949 2.0262 -.0512 16 .0011 buy myopic

103 1 .0000 .0033 .3951 3.5201 .4109 22 .0033 buy myopic

104 1 .0000 .2282 .3950 3.5575 .0047 32 .2282 buy myopic

105 1 .0000 .2322 .3956 3.5623 .0047 32 .2322 buy myopic

106 2 -.2686 .0680 .3955 3.5643 .0047 37 .1273 wait strategic

107 1 .0000 .2352 .3954 3.5662 .0047 32 .2352 buy myopic

108 1 .0000 .0092 .0620 1.0280 -.0545 17 .0092 buy myopic

109 1 .0000 .1817 .3954 2.2364 -.0845 5 .1817 buy strategic

110 1 .0000 .0275 2.3441 9.9386 .1911 10 .0275 buy myopic

111 1 .0000 .2425 .3955 2.6080 .4087 23 .1026 buy strategic

112 1 .0000 .0954 .3950 2.6103 .4087 23 .0954 buy strategic

113 1 .0000 .7706 1.9595 7.9872 .0416 13 .7706 buy myopic

114 1 .0000 .2167 .3955 3.8073 -.0999 24 .2167 buy strategic

115 1 .0000 .2171 .3951 3.8097 -.0999 24 .2171 buy strategic

116 1 .0000 .0018 .3950 3.0956 .0909 20 .0018 buy myopic

117 1 .0000 .0197 .1288 3.2260 .7820 13 .0197 buy myopic

118 1 .0000 .1361 .3951 3.4145 .4348 18 .1361 buy strategic

119 1 .0000 .1734 .3950 2.8249 .0191 35 .1734 buy myopic

289


type

120 1 .0000 .0190 .3955 3.8225 .1021 37 .0352 buy strategic

121 1 .0000 .1372 .3955 5.3948 -.0337 47 .1372 buy strategic

122 1 .0000 .1428 .3953 3.8175 -.1297 48 .1428 buy strategic

123 1 .0000 .3145 .3953 3.8745 .1021 37 .3145 buy strategic

124 1 .0000 .2392 .3951 3.8757 .1021 37 .2392 buy strategic

125 1 .0000 .3152 .3951 3.8757 .1021 37 .3152 buy strategic

126 1 .0000 .0105 .3954 1.8774 .1198 23 .0105 buy myopic

127 1 .0000 .0005 2.6824 10.5685 .0125 43 .0005 buy myopic

128 1 .0000 .1246 .3122 1.5024 -.0775 3 .1246 buy strategic

129 1 .0000 .2022 .3954 3.0357 .1150 4 .5924 buy strategic

130 1 .0000 .0224 .3952 1.5424 .3958 34 .0224 buy myopic

131 3 .8433 .3500 .3951 2.7694 -.0993 5 2.5059 wait strategic

132 1 .0000 1.8552 2.0989 6.3705 -.0094 36 1.8552 buy myopic

133 1 .0000 1.5541 2.1265 11.0751 .0116 36 1.5541 buy strategic

134 1 .0000 .1262 .3953 2.6148 .0031 13 .1262 buy myopic

135 2 -.0778 .2238 .3952 4.0549 .2334 9 .4472 buy strategic

136 1 .0000 .0028 .3954 1.4065 .0444 35 .0028 buy myopic

137 1 .0000 .0410 2.1773 6.8863 .0113 10 .0410 buy myopic

138 1 .0000 .0074 .3951 2.9499 .0384 26 .0074 buy myopic

139 1 .0000 .0298 .3950 1.8263 -.0767 4 .0298 buy myopic

140 1 .0000 .0567 .3954 1.9572 .0542 24 .0567 buy myopic

141 1 .0000 .1385 .3954 1.8009 .1056 5 .1385 buy strategic

142 1 .0000 .1546 .3954 2.0565 -.0424 4 .1546 buy myopic

143 1 .0000 .2147 .3953 1.8196 .0248 22 .2147 buy strategic

144 1 .0000 .0663 .3954 4.3489 -.0516 4 .0663 buy myopic

145 1 .0000 .0552 2.3098 10.2375 .0008 26 .0552 buy myopic

146 1 .0000 .3095 .3956 2.0219 .0396 50 .2707 buy strategic

147 2 .0156 .3601 .3955 2.9281 .1111 12 .3955 wait myopic

148 1 .0000 .0213 .3955 2.8469 -.0816 19 .0213 buy myopic

149 1 .0000 .2614 .9785 7.9903 .0159 19 .2614 buy myopic

150 1 .0000 1.1773 1.9427 8.8489 -.0194 34 1.1773 buy myopic

151 1 .0000 .3784 1.9580 9.9303 .0135 20 .3784 buy strategic

152 1 .0000 .5319 1.9683 9.9232 .0006 5 .5319 buy strategic

153 1 .0000 .2025 .4899 4.9031 .0725 8 .2025 buy strategic

154 2 -2.5954 .4871 2.5954 8.1933 .0029 41 .9830 buy strategic

155 1 .0000 .1562 .3950 2.8937 .1200 16 .1562 buy strategic

156 2 -2.5954 .4871 2.6066 8.2045 .0029 51 .0023 wait strategic

157 1 .0000 .0213 .0416 2.2798 .0516 10 .0213 buy myopic

158 1 .0000 .0006 2.6426 7.1919 .0170 10 .0006 buy myopic

159 1 .0000 .0168 2.6473 7.0161 .0078 8 .0168 buy myopic

160 6 .0988 .0740 .1319 2.0971 1.6738 35 .0432 wait myopic

161 2 .4070 .3852 .3953 2.6828 .0789 22 1.3298 wait strategic

290


type

162 1 .0000 .0115 .3950 5.5027 .0987 8 .0115 buy myopic

163 3 3.5814 2.6370 2.0903 7.9382 .0005 26 23.4267 buy strategic

164 1 .0000 .0043 .3953 1.6953 .0461 40 .0043 buy myopic

165 1 .0000 .1014 .3955 1.1796 .0983 5 .1014 buy strategic

166 1 .0000 .0014 2.1616 11.6449 .0063 11 .0014 buy myopic

167 2 -.9773 .4728 .9787 6.0801 .0010 10 .8721 buy myopic

168 2 -.9773 .4728 .9785 6.0813 .0010 19 .0747 wait myopic

169 1 .0000 .0022 .3952 5.7647 -.0676 13 .0022 buy myopic

170 2 .2114 .1997 .3955 1.6830 -.0996 3 .3955 wait strategic

171 10 .1124 .0849 .3952 2.0396 .0743 10 .6761 wait strategic

172 2 -.3548 .0218 .3951 3.1833 .0952 35 .0433 buy strategic

173 2 .1802 .1979 .3951 1.4979 -.1321 21 .3951 wait strategic

174 1 .0000 .0015 .3951 4.1229 -.0500 16 .0015 buy myopic

175 3 .2408 .2029 .3952 1.6216 -.1191 17 .6235 wait strategic

176 1 .0000 .0574 .3952 3.5632 -.1078 17 .0574 buy myopic

177 1 .0000 .0084 2.2714 7.7478 .2213 38 .0084 buy myopic

178 2 -.3548 .0218 .3953 3.2238 .0476 35 .0408 wait strategic

179 1 .0000 .0407 .3956 2.7158 -.0410 38 .0407 buy myopic

180 1 .0000 .0141 .0621 .5496 -.0314 13 .0141 buy myopic

181 1 .0000 .0637 .3954 3.5717 -.1078 17 .0637 buy myopic

182 1 .0000 .0648 .3955 3.5740 -.1078 17 .0648 buy myopic

183 1 .0000 1.0991 2.2865 7.0233 .0093 17 1.0991 buy strategic

184 1 .0000 .0192 .3956 3.5622 -.0148 13 .0192 buy myopic

185 1 .0000 .0206 .3956 3.7206 .0300 26 .0206 buy myopic

186 4 .2264 .1220 .3954 1.6003 -.0893 4 .3954 wait strategic

187 2 -.3940 .0010 .3954 1.7565 .0307 20 .0021 buy strategic

188 2 -.3940 .0010 .3953 1.7578 .0307 20 .0011 wait strategic

189 1 .0000 .1966 .3951 2.9778 -.1458 15 .1966 buy strategic

190 1 .0000 .1971 .3956 2.9789 -.1458 15 .1971 buy strategic

191 1 .0000 .6856 2.3367 7.4277 .3564 8 .6856 buy strategic

192 1 .0000 .0763 .5112 4.6258 .0698 19 .0763 buy myopic

193 1 .0000 .3375 .3949 3.1435 -.1463 15 .2800 buy strategic

194 1 .0000 .1448 .4879 4.7830 -.0985 23 .1448 buy strategic

195 1 .0000 .0106 2.3722 6.9910 .0175 20 .0106 buy myopic

196 1 .0000 .0125 .3950 4.8430 -.1141 19 .0125 buy myopic

197 1 .0000 .2483 .3955 2.1809 -.0645 22 .2483 buy strategic

198 1 .0000 .0057 .3951 2.1771 .0317 13 .0057 buy myopic

199 1 .0000 .0020 .0625 .8481 -.2198 12 .0020 buy myopic

200 2 .0203 .0657 .0985 1.1263 1.3429 34 .0985 wait myopic

a hierarchical rule-based inferential modelling and

Documents