a hierarchical rule-based inferential modelling and
TRANSCRIPT
A HIERARCHICAL RULE-BASED INFERENTIAL
MODELLING AND PREDICTION WITH APPLICATION IN
STRATEGIC PURCHASING BEHAVIOUR
A Thesis Submitted to The University of Manchester for the degree of
Doctor of Philosophy
in the Faculty of Humanities
2020
YUN PRIHANTINA MULYANI
ALLIANCE MANCHESTER BUSINESS SCHOOL
2
List of Contents
List of Contents ................................................................................................. 2
List of Tables ..................................................................................................... 4
List of Figures .................................................................................................... 7
Abbreviation ....................................................................................................... 9
Abstract ............................................................................................................ 10
Declaration ....................................................................................................... 11
Copyright Statement ....................................................................................... 12
Acknowledgements ......................................................................................... 13
Chapter 1 Introduction ................................................................................ 14
1.1. Background ......................................................................................... 14
1.2. Research Questions ............................................................................ 20
1.3. Research Objectives............................................................................ 21
1.4. Research Contributions ....................................................................... 21
1.5. Research Significance ......................................................................... 24
1.6. Thesis Structure .................................................................................. 26
Chapter 2 Research Background ................................................................ 30
2.1. Introduction .......................................................................................... 30
2.2. Introduction to Revenue Management Theories .................................. 30
2.3. Advanced Booking Decision-Making .................................................... 34
2.4. Introduction to Machine Learning ......................................................... 37
2.5. Classification Models: Advantages and Disadvantages ....................... 39
Chapter 3 Research Methodologies ........................................................... 44
3.1. Introduction .......................................................................................... 44
3.2. Research Approach ............................................................................. 44
3.3. Data Collection .................................................................................... 45
3.4. Evidential Reasoning ........................................................................... 50
3.5. Maximum Likelihood Evidential Reasoning (MAKER) Framework ....... 58
3.6. Machine Learning Methods.................................................................. 63
3.7. Sequential Least Squares Programming (SLSQP) .............................. 74
3.8. Evaluation Metrics ............................................................................... 77
3
3.9. Summary ............................................................................................. 83
Chapter 4 A Hierarchical Rule-based Inferential Modelling and Prediction
85
4.1. Introduction .......................................................................................... 85
4.2. Introduction to MAKER Framework ...................................................... 85
4.3. MAKER Algorithm with Referential Values ........................................... 89
4.4. Belief Rule Base .................................................................................. 98
4.5. The Decomposition of Input Variables ................................................. 99
4.6. Parameter Learning ........................................................................... 105
4.7. A Comparative Analysis ..................................................................... 108
4.8. Summary ........................................................................................... 136
Chapter 5 Application to Customer Classification .................................. 137
5.1. Introduction ........................................................................................ 137
5.2. Theoretical Foundations: Customer Types and Behaviours ............... 138
5.3. Conceptual Framework ...................................................................... 145
5.4. Data Preparation ............................................................................... 153
5.5. Hierarchical Rule-based Models for Customer Classification ............. 154
5.6. Model Comparisons ........................................................................... 183
5.7. Summary ........................................................................................... 198
Chapter 6 Application to Customer Decision Model ............................... 201
6.1. Introduction ........................................................................................ 201
6.2. Conceptual Framework: Input Variables and Decisions ..................... 201
6.3. Data preparation ................................................................................ 218
6.4. Hierarchical Rule-based Models to Predicting Customer Decisions ... 220
6.5. Model comparisons ........................................................................... 253
6.6. Summary ........................................................................................... 269
Chapter 7 Conclusions and Recommendations for Future Research .... 271
7.1. Conclusions ....................................................................................... 271
7.2. Limitations and Recommendations for Future Research .................... 274
References ..................................................................................................... 277
Appendices .................................................................................................... 286
4
List of Tables
Table 1.1. Thesis structure ................................................................................ 27
Table 2.1. Advantages and disadvantages of classification methods ................. 39
Table 3.1. Data characteristics ........................................................................... 48
Table 3.2. Threshold metrics ............................................................................. 79
Table 3.3. Rules of thumb for AUC .................................................................... 82
Table 4.1. An example of data transformation .................................................... 95
Table 4.2. Generated datasets with four input variables ................................... 124
Table 4.3. Performance measures for the dataset 1 ........................................ 129
Table 4.4. Performance measures for the dataset 2 ........................................ 130
Table 4.5. Performance measures for the dataset 3 ........................................ 131
Table 4.6. Performance measures for the dataset 4 ........................................ 132
Table 4.7. Performance measures for the dataset 5 ........................................ 133
Table 4.8. Grand averages of performance measures of the five generated
datasets ........................................................................................................... 134
Table 5.1. Definitions of strategic customers .................................................... 141
Table 5.2. Input variables ................................................................................. 147
Table 5.3. Descriptive statistics and spearman correlation matrix .................... 156
Table 5.4. Percentiles of the dataset ................................................................ 159
Table 5.5. The optimised referential values obtained from MAKER-ER- based
models of the first round .................................................................................. 163
Table 5.6. The frequencies of the referential values of the input variable of TS 164
Table 5.7. The likelihoods of the referential values of the input variable of TS . 164
Table 5.8. The probabilities of referential values of the input variable of TS ..... 165
Table 5.9. The probabilities of referential values of the input variable of HP .... 165
Table 5.10. The joint probabilities of different combinations of the referential values
from input variables HP and TS ....................................................................... 167
Table 5.11. The interdependence indices between the referential values from the
input variables HP and TS ............................................................................... 168
Table 5.12. Interdependence indices between referential values from the input
variables FB and ICR ....................................................................................... 169
5
Table 5.13. Two adjacent referential values of each input variable of an observation
from the customer – type dataset: {.2105, .3955, 4, .1415} .............................. 171
Table 5.14. The belief rule base of the first group of evidence and the activated
belief rules by an observation of the input variables of group 1 from the customer-
type dataset: {.2105, .3955} ............................................................................. 172
Table 5.15. The belief rule base of the second group of evidence with activated
belief rule base by an observation of the input variables of group 2 from the
customer-type dataset: {4, .1415} .................................................................... 173
Table 5.16. The belief rule base of the top hierarchy of inference with the initial
belief degrees for the customer-type dataset ................................................... 177
Table 5.17. The belief rule base of the top hierarchy of inference with the optimised
belief degrees of the training set of the first fold for the customer-type dataset 178
Table 5.18. The joint similarity degree of the outputs generated by group 1: {.1371,
.8629} and group 2: {.2537, .7463} from the customer – type dataset .............. 179
Table 5.19. Selected hyperparameters of SVM, ANN, CT, and Weighted KNN for
customer type models ...................................................................................... 185
Table 5.20. F-beta scores for customer behaviour classifiers .......................... 186
Table 5.21. Accuracies for customer behaviour classifiers ............................... 187
Table 5.22. Precisions of the test sets for customer behaviour classifiers ........ 188
Table 5.23. Recalls of the test sets for customer behaviour classifiers ............. 189
Table 5.24. The MSEs and AUCs of the prediction models (training set) for
customer type classifiers .................................................................................. 196
Table 5.25. The MSEs and AUCs of the prediction models (test set) for customer
type classifiers ................................................................................................. 197
Table 6.1. Descriptive Statistics and Correlation Matrix ................................... 221
Table 6.2. Percentiles of the dataset ................................................................ 224
Table 6.3. Optimised referential values obtained from MAKER-ER-based models
of the first round ............................................................................................... 230
Table 6.4. The frequencies of the referential values of the input variable of WPT
........................................................................................................................ 231
Table 6.5. The likelihoods of the referential values of the input variable of WPT
........................................................................................................................ 231
Table 6.6. The probabilities of referential values of the input variable of WPT . 233
Table 6.7. The probabilities of referential values of the input variable of APT .. 233
6
Table 6.8. Joint probabilities for different combinations of referential values from
input variables: WPT and APT ......................................................................... 235
Table 6.9. Interdependence indices for referential values of the input variables:
WPT and APT .................................................................................................. 235
Table 6.10. Interdependence indices for referential values of the input variables:
HP and DD ...................................................................................................... 235
Table 6.11. Interdependence indices for referential values of the input variables:
NF and C ......................................................................................................... 236
Table 6.12. The belief rule base of the first group of evidence and the activated
belief rules by an observation from the customer – decision dataset: {.2946, .1193}
........................................................................................................................ 239
Table 6.13. The belief rule base of the second group of evidence with activated
belief rule base by an observation from the customer – decision dataset: {.3955,
1.9816} ............................................................................................................ 239
Table 6.14. The belief rule base of the third group of evidence with activated belief
rule base by an observation from the customer – decision dataset: {62, 1} ...... 240
Table 6.15. Two adjacent referential values of each input variable of an observation
from the customer decision dataset: {.2946, .1193, .3954, 1.9816, 62, 1} ........ 240
Table 6.16. Initial belief rule base of the top hierarchy for the customer-decision
dataset ............................................................................................................. 248
Table 6.17. Optimised belief rule base of the top hierarchy the activated belief rules
by the three MAKER-generated outputs: {(1, .6007), (2, .3993)}; {(1, .8468), (2,
.1532)}; and {(1, .2387), (2, .7613)} .................................................................. 249
Table 6.18. The selected hyperparameters of CT, SVM, KNN, Weighted KNN, and
NN for customer decision models .................................................................... 254
Table 6.19. F-beta scores for customer behaviour classifiers .......................... 256
Table 6.20. Accuracies for customer decision models ..................................... 257
Table 6.21. Precisions of the test sets for customer decision models ............... 258
Table 6.22. Recalls of the test sets for customer decision models ................... 259
Table 6.23. MSEs and AUCs of classifiers for customer decision models ........ 267
7
List of Figures
Figure 3.1 A single-hidden-layer neural network (Bishop, 2006) ........................ 69
Figure 3.2. The procedure of sequential (least squares) quadratic programming
method .............................................................................................................. 78
Figure 3.3. Confusion matrix of binary problem .................................................. 79
Figure 3.4. ROC curve ....................................................................................... 81
Figure 4.1. Hierarchical MAKER-based training process .................................. 104
Figure 4.2. A hierarchical rule-based inferential modelling and prediction based on
MAKER framework for n groups of evidence ................................................... 107
Figure 4.3. Referential Value-based Discretization Technique: an input variable
(upper), and two input variables (bottom) ......................................................... 115
Figure 4.4. Scatter plot from the datasets ........................................................ 125
Figure 4.5. Plot of the grand average scores of performance measures of the five
generated datasets for each model .................................................................. 135
Figure 5.1. Illustration 1 (several weeks before departure date) ....................... 146
Figure 5.2. Illustration 2 (some days before departure date) ............................ 147
Figure 5.3. Data linkage ................................................................................... 149
Figure 5.4. Hierarchical MAKER frameworks for customer classification ......... 157
Figure 5.5. Scatter plot of the observed data of the training set of the first fold with
plotted optimised referential values in each of the input variables from the customer
– type dataset from the optimisation of the MAKER-ER-based model .............. 160
Figure 5.6. Scatter plot of the observed data of the training set of the first fold with
plotted optimised referential values in each of the input variables from the customer
– type dataset from the optimisation of the MAKER-BRB-based model ........... 161
Figure 5.7. Individual support of the referential values of each input variable ... 165
Figure 5.8. The ROC curve of the MAKER-ER-based classifier, MAKER-BRB-
based classifier, and all the alternative machine learning methods of the test sets
of the customer-type dataset ........................................................................... 192
Figure 5.9. The PR curve of the MAKER-ER-based classifier, MAKER-BRB-based
classifier, and all the alternative machine learning methods of the test sets of the
customer-type dataset ..................................................................................... 194
8
Figure 6.1. Conceptual framework for decisions by advanced booking customers
under dynamic Pricing ..................................................................................... 209
Figure 6.2. Conceptual framework for decision by advanced booking customers
under dynamic pricing after refinement ............................................................ 209
Figure 6.3. Data linkage for customer decision model ...................................... 215
Figure 6.4. Example of a booking journey ........................................................ 216
Figure 6.5. Hierarchical MAKER framework for customer decision prediction .. 222
Figure 6.6. Scatter plot for observed data, with plotted optimised referential values
for each input variable in the optimisation of MAKER-ER-based model from the
customer – decision dataset. ........................................................................... 226
Figure 6.7. Scatter plot for observed data, with plotted optimised referential values
for each input variable in the optimisation of MAKER-BRB-based model from the
customer – decision dataset. ........................................................................... 227
Figure 6.8. Individual support of referential values of each input variable of the
training set of the first fold of the customer decision dataset ............................ 232
Figure 6.9. The ROC curve of MAKER-ER-based classifier, MAKER-BRB-based
classifier, and all the alternative machine learning methods for the test sets of the
customer-decision dataset ............................................................................... 262
Figure 6.10. The PR curve of MAKER-ER-based classifier, MAKER-BRB-based
classifier, and all the alternative machine learning methods for the test sets of the
customer-decision dataset ............................................................................... 264
9
Abbreviation
APT Average price trend
AUC Area under the curve
AUCPR Area under the precision-recall curve
AUCROC Area under the receiver operating curve
BRB Belief rule base
C Customer type
CT Classification tree
DD Days before departure date
D-S Dempster-Shafer
ER Evidential reasoning
FB Frequency of bookings
HP The length of holding period
ICR The interval between cancelling and book again
LR Linear regression
KNN K-nearest neighbours
MAKER Maximum likelihood evidential reasoning
MSE Mean of squared error
MLP Multilayer perceptron
NB Naïve Bayes
NF Number of flights offered in a day
PNR Passenger name record
PR Precision-recall
SLSQP Sequential (least squares) quadratic programming
TS Time spent for confirming booking
RIMER Rule-based inference methodology using evidential reasoning
ROC Receiver operating characteristics
WPT Waiting patience time
10
Abstract
Strategic purchasing behaviour has received growing attention in the field of revenue management. Its occurrence potentially hurts providers’ revenue with substantial losses. The need to detect strategic customers and predict their decisions has been highlighted by researchers. Theoretical models rely on assumptions about how customers make decisions and what factors influence them. Conditioned experiments are relatively expensive and not representative of the actual system. By comparison, approaches based on statistics and machine learning from historical data can be relatively cheap, representative of actual conditions, and data-based. However, the widely used approaches pose certain challenges, such as interpretability, overfitting, and stability. These may influence their ability to classify – that is, predict customer types and decisions.
We propose a conceptual framework and data linkage for detecting strategic customers and predicting customer decisions. The proposed framework and data linkage were developed based on cancel-rebook behaviour by two customer types, namely strategic and myopic. They were also based on two customer decisions: buy or wait. The evidence showed that the input variables in the framework were good predictors of customer types and decisions. Ultimately, we propose a new approach, namely a hierarchical rule-based inferential modelling and prediction, to integrate statistical analysis, rule-based inference, maximum likelihood prediction, and machine learning in a hierarchical structure. The referential value-based discretisation technique used in this approach can alleviate information loss and distortion as an effect of over-generalisation caused by discretisation. It also captures the structure of the data better than other discretisation technique. We used belief-rule-based inference to analyse the relationship between inputs and outputs. An interdependence index was used to measure the relationship between input variables. The hierarchical structure deals with sparse matrices by decomposing input variables into several groups of evidence. The outputs generated by all groups of evidence are then combined to obtain a final inference.
The classifiers, developed based on maximum likelihood evidential reasoning (MAKER) framework and a hierarchical rule-based inferential modelling and prediction, are transparent and interpretable. The classifiers perform better than the majority of alternative classification models for both datasets (customer types and customer decisions). Their performance is similar to that of classification trees.
Keywords: Rule-based Inference, Statistical Analysis, Evidential Reasoning, Machine Learning, Data Discretisation, Probabilistic Inference, Classification, Strategic Customer, Revenue Management
11
Declaration
No portion of the work referred to in the thesis has been submitted in support of an
application for another degree or qualification of this or any other university or other
institute of learning.
12
Copyright Statement
The following four notes on copyright and the ownership of intellectual property
rights must be included as written below:
i. The author of this thesis (including any appendices and/or schedules to
this thesis) owns certain copyright or related rights in it (the “Copyright”)
and s/he has given The University of Manchester certain rights to use
such Copyright, including for administrative purposes.
ii. Copies of this thesis, either in full or in extracts and whether in hard or
electronic copy, may be made only in accordance with the Copyright,
Designs and Patents Act 1988 (as amended) and regulations issued
under it or, where appropriate, in accordance with licensing agreements
which the University has from time to time. This page must form part of
any such copies made.
iii. The ownership of certain Copyright, patents, designs, trademarks and
other intellectual property (the “Intellectual Property”) and any
reproductions of copyright works in the thesis, for example graphs and
tables (“Reproductions”), which may be described in this thesis, may not
be owned by the author and may be owned by third parties. Such
Intellectual Property and Reproductions cannot and must not be made
available for use without the prior written permission of the owner(s) of the
relevant Intellectual Property and/or Reproductions.
iv. Further information on the conditions under which disclosure, publication
and commercialisation of this thesis, the Copyright and any Intellectual
Property and/or Reproductions described in it may take place is available
in the University IP Policy (see
http://documents.manchester.ac.uk/DocuInfo.aspx?DocID=2442 0), in any
relevant Thesis restriction declarations deposited in the University Library,
The University Library’s regulations (see
http://www.library.manchester.ac.uk/about/regulations/) and in The
University’s policy on Presentation of Theses
13
Acknowledgements
I would like to gratefully acknowledge various people who have been journeyed
with me in the last four years as I have worked on this thesis. First, I would like to
express my sincere gratitude and thanks to my supervisors, Prof. Jian-bo Yang
and Prof. Dong-Ling Xu for their guidance and constant supervision, also their
support in completing this endeavour.
Second, I would like to thank to my family for encouragement which helped me
during my study. My beloved husband, Budi Wuryanto, who was always supportive
and always by my side in every hard times and struggles, and my lovely sons,
Dimas F Alghazi and Tristan H Alghazi, who served as my inspiration to finish my
study. My father, my mother, and my sisters who always gave me strength and
motivation.
Third, I thank my colleagues in the University of Manchester, especially in the
Decision and Cognitive Science Research Centre for all the fun times and
discussions we have had in the last four years.
Fourth, I would like to acknowledge the Indonesia Endowment Fund for Education
(LPDP) for the financial support.
Many thanks and appreciations also go to my colleague and people who have
willingly helped me out with their abilities.
14
Chapter 1 Introduction
1.1. Background
In the case of business characterised by perishable goods with constrained
capacity, dynamic pricing is prevalent. It stimulates demand and increases revenue
in the short term (Cho et al., 2008; Kimes, 1989). In the context of air travel,
individual price differences – in which the ticket fare paid by a passenger may differ
from that of the adjacent passenger – is compatible with this practice (Kimes,
2003). The practice has boosted the success of airline and hospitality industries
(Kimes, 1989; Talurri and Ryzin, 2004).
In the past, airlines segmented their markets based on the belief that customers’
willingness to pay would increase as the consumption date approached. Hence,
the company naturally faced different segmentation over time (Talurri and Ryzin,
2004). The classical segmentation approach might not be applicable nowadays.
Because of the applied dynamic pricing, price transparency, and the widespread
use of price comparison sites (PCS) and other supports to meet customers’ needs
to minimise their travel costs and search costs (Bilotkach, 2010; Boyd and Bilegan,
2003), customers have started to act strategically by timing their purchases –
namely, strategic purchasing behaviour (Anderson and Wilson, 2003). Research
has shown that PCS also influences offline price evaluations; offline travel agents
must deal with this strategic behaviour, as do online agents (Bodu et al., 2015; Toh
et al., 2012).
15
Researchers have given different labels to such behaviour, but with the same
essence. Some of these labels include deal-seeker (Schwartz, 2006), forward-
looking customer (Chevalier and Goolsbee, 2009), and strategic customer
(Anderson and Wilson, 2003).
The success of dynamic pricing relies on effective policy to minimise
cannibalisation across price levels – when customers with high willingness to pay
buy for an available lower price – and induce early purchase. ‘Waiting’ behaviour
in strategic purchasing makes demand uncertain and can lead to underestimated
projection of demand (Liu and Ryzin, 2008). Customers who would buy or are
willing to pay a higher price may choose a lower price by delaying their purchase
(Anderson and Wilson, 2003) and by following a strategy of cancel-rebook (Gorin
et al., 2012). Gorin et al (2012) found that customers who already book a ticket,
cancel the booking and book again for the same flight with a lower price. Gorin et
al (2012) found numerous examples of this cancel-rebook behaviour through
airline databases. Due to this strategic purchasing behaviour, the revenue system
then records relatively few bookings at high prices, underestimates the future
demand for such price classes, and hence recommends prices that are lower than
they should be. This condition is termed the spiral-down effect (Cooper et al.,
2006).
Some studies have examined the effect of strategic purchasing. It can hurt revenue
with profit losses of between 7% and 50% (Besanko and Winston, 1990; Levin et
al., 2008; Nair, 2007; Zhang and Cooper, 2008). These findings highlight the
research significance for academia and for practitioners about achieving gains
through dealing appropriately with strategic purchasing.
16
In revenue management practice, research has mainly focused on the
development of approaches to deal with strategic customers. The studies started
with assumptions about how customers make decisions and theoretical models
were derived that explained firms’ responses (Aviv and Pazgal, 2008; Levin et al.,
2008; Liu and Ryzin, 2008). The models assumed that all customers were acting
strategically to delay their purchase if a predicted future offer was expected to be
cheaper; if not, they would buy right away. This assumption has been criticised as
it might not be applicable in reality. The different ratio of strategic customers
dominating the market likely requires a different policy to alleviate the effect of
strategic purchasing behaviour on revenue (Cleophas and Bartke, 2011; Su,
2007). This view was corroborated by recent studies. For example, Wang et al.
(2013) and Lai et al. (2010) generated the percentage of strategic customers and
found that revenue management policies were sensitive to this percentage. In
conclusion, firms should treat the market according to what behaviour
predominates. The research has illustrated the need to explore how strategic
customers arise in the market and how they make purchase decisions.
Several researchers have presented empirical evidence of the extent to which
strategic customers occur in the market. Chevalier and Goolsbee (2009) analysed
buyer behaviour regarding college textbooks and confirmed the presence of
strategic customers. Osadchiy and Bendoly (2011) found that the percentage of
strategic customers ranged between 8% and 38%, using a conditioned simulation
with 155 financially motivated subjects. Li et al. (2014) developed a structural
model to estimate the percentage of strategic customers in a study of real-life
airline ticketing; the proportion ranged from 5.2% to 19.2%.
17
A laboratory experiment by Mak et al. (2014) indicated that only 6% of customers
were completely myopic – customers who do not intend to strategically time their
purchase. This group always buys right away as long as the price fits with their
valuation. However, on average, most customers were strategic although their
decision might somewhat deviate from the optimal one; that is, they might choose
to wait when they should have bought early, and vice versa.
The diverse findings were caused by divergent methods, assumptions, and
settings, as well as relatively small samples (Gönsch et al., 2013). Conditioned
experiments may not represent the actual situation when subjects make purchase
decisions using their own money; hence, biases may occur.
The rationale behind ‘waiting’ behaviour has been discussed in terms of decision-
making theory. Research in this area has mainly examined the factors that
influenced customers’ decision making when they were exposed to any type of
promotion, deals, or discounted products. An interesting finding was that deal-
seeking behaviour was mainly determined by a cognitive process rather than by
emotions (Chandon, Wansink, and Laurent, 2000; Christou, 2011; Lichtenstein et
al., 1993). However, deal evaluation and deal-proneness might intensify
customers’ emotional state, which could induce an intention to purchase (Christou,
2011). Such studies explained how antecedents influenced the customers’
motivation and enticed them to book a deal. Once they were motivated and showed
intention, their behavioural responses could be stimulated and predicted
(Gollwitzer and Brandst, 1997). Therefore, it might be possible to identify customer
types by their response to any means of gaining a lower price, through
understanding their behaviour in booking a deal.
18
Scholars often use theoretical models based on intuitive thinking when the
challenges and expenses of empirical work are high (Cleophas and Bartke, 2011).
However, identifying strategic customers is not impractical to do since advanced
technology has been used in industries recently, which constitute applied revenue
management. In addition, initial studies have explored strategic customers and
their associated tangible behaviours. Evidence has shown that the behaviour of
checking prices often and cancelling and then rebooking was correlated with
customers’ intention to obtain a lower price. This behaviour was perceived as a
manifestation of ‘waiting’ behaviour (Toh et al., 2012). Cancel-and-rebook
behaviour in an airline database was significantly related to evidence of strategic
purchasing (Gorin et al., 2012). The researchers found numerous examples of
customers who purposefully monitored the prices, then cancelled and rebooked
when a lower price became available in the system.
Despite the many studies addressing strategic customers, none have explored
how to detect a strategic customer from quantifiable behaviour, specifically cancel-
rebook behaviour. Previous findings have shown that purchase-related activities
were correlated with customers’ intention to obtain lower prices. Those works
became an initial footing to form a detection procedure and a classification model
for predicting customer types through quantifiable behaviour. In addition, limited
research has examined empirical evidence, as most researchers test their models
with numerical experiments. This study aims to bridge the gap in the literature and
to provide a model that is tested using empirical data for detecting strategic
customers through their cancel-rebook behaviour and predicting their decisions.
19
The existence of strategic customers can cause substantial losses if
inappropriately addressed. Understanding how such behaviour dominates a
market plays a role in formulating revenue management-related policies aimed at
long-term profits. Purchase-related activities potentially demonstrate customers’
intention to look for a lower price. Hence, there is scope for developing methods to
detect strategic customers from available records.
Customer behaviour, in general, is ‘the process and activities people engage in
when searching for, selecting, purchasing, using, evaluating, and disposing of
products and services so as to satisfy their needs and desires’ (Belch and Belch,
1998). Although it is complicated to study, developing better understanding of
customer behaviour can help firms to segment the market effectively. Ultimately
this may lead to gains in revenue (Rickwood and White, 2009). Therefore, it is
important to develop a decision support system to predict customer types based
on their purchase behaviour, and to predict customer decisions in the environment
of dynamic pricing.
This research is important as it may relax the widespread assumption in theoretical
revenue management models that all customers act strategically. It improves the
established revenue management models and helps to identify representative firm-
level responses. We utilised the case of airline ticketing, but the model is
applicable for other adopters of revenue management with similar conditions, such
as hotels.
20
1.2. Research Questions
The primary goals of this study were (1) to define a detection procedure for
customer types in response to dynamic pricing, through refining the fit between
theory and available data; 2) to develop a new classification model with a
hierarchical rule-based inferential modelling and prediction approach based on
MAKER framework, to predict customer types based on their perceptible purchase-
related activities – especially cancel-rebook behaviour; and 3) to expand the
approach for predicting customer decisions.
To achieve the primary goals of this study, the following questions guided the
analysis:
Q1. What perceptible purchase-related activities might describe the differences
among customer types regarding their responses to dynamic pricing?
Q2. What factors influence customer decisions to buy or to wait, in the environment
of dynamic pricing?
Q3. What are the drawbacks and benefits of alternative models for classification,
that is, predicting discrete outputs (i.e. customer types and decisions)?
Q4. How do MAKER frameworks deal with sparse matrices?
Q5. How do MAKER frameworks work for complex numerical data?
21
Q6. How do the alternative models perform in customer detection and decision
prediction?
1.3. Research Objectives
The objectives of this research were listed below:
RO1. To construct a conceptual framework for detecting customer types and
decisions in the environment of dynamic pricing.
RO2. To construct MAKER-based frameworks as an alternative classification
model under sparse matrices and complex numerical system.
RO3. To examine the interpretability and the performance of the proposed model
compared with other alternative models.
1.4. Research Contributions
The research contributions of this work are as follows:
1. A conceptual framework was developed for customer-type detection and data
linkage, based on literature in revenue management and, and was refined
based on available data. This is a research innovation in the field of revenue
management. The framework is useful to detect customer types, which
provides companies with an understanding of the composition of customer
types in their market. In turn, this can lead to an effective policy to deal with
the existence of strategic customers. Most studies in revenue management
22
have utilised numerical experiments, resulting in divergent findings. Our
framework is interpretable and tested on a real case.
2. A conceptual framework for customer-decision prediction and data linkage
was constructed; this is another innovation in the research. This framework is
useful to predict customer decisions derived from historical data rather than
theoretical assumptions. The model is also useful to inform managerial
decision making to address strategic purchasing. We examined the model with
a real case airline ticketing.
3. A hierarchical rule-based inferential modelling and prediction approach was
developed to deal with sparse matrices. This is an innovative approach to
modelling and prediction that allows interdependencies between input
variables to be determined without violating statistical requirements. It is
applicable when the data distribution is heavily skewed, such as when joint
frequency matrices violate the statistical requirement for sample size.
Grouping too many referential values with less than the minimum statistical
requirement (i.e. five cases per cell) leads to substantial loss of information
and to distortion. Our grouping is rooted in the strength of the relationship
between input variables and outputs, as well as the statistical requirements for
sample sizes for pairs of referential values. In addition, the input variables
being formed in a group are conceptually correlated. The hierarchical rule-
based approach can reduce the complexity of rule-based inferential modelling
and prediction as the number of referential values of each input variable and
the number of input variables increase.
4. A referential-value-based data discretization technique was utilised to deal
with complex numerical data for modelling. The initial referential values were
23
obtained from an unsupervised discretisation method, such as equal-
frequency discretisation. The values were trained simultaneously with other
model parameters because such learning from these values contributes
directly to optimising the model performance.
5. We propose an approach that integrates statistical analysis, rule-base
inference, and maximum likelihood prediction and machine learning,
embedded in a hierarchical rule-based inferential modelling and prediction.
The approach leads to transparent and interpretable results where the
relationship between input and output becomes clear.
6. A hierarchical rule-based inferential modelling and prediction approach was
used to establish MAKER-based classifiers for predicting customer types and
decisions. With heavily skewed distributions, input variables were
decomposed into groups and ER or BRB rule-based inference was then
utilised. Parameters and referential values were learned simultaneously in the
model. The MAKER-based models outperformed complex methods such as
support vector machine and neural networks, which are recognised as black-
box models. The MAKER-based models generally performed better than the
majority of the interpretable classifiers: logistic regression, naïve Bayes, k-
nearest neighbours, distance-based weighted k-nearest neighbours, linear
discriminant, and quadratic discriminant, and performed similar to
classification tree. MAKER-based models are transparent and interpretable,
and the relationship between inputs and outputs can be clearly explained.
24
1.5. Research Significance
In this section, we explain the theoretical and practical significance of this research.
This research is developed based on the field of revenue management and the
field of modelling and computational approach. This research delivers both
theoretical and practical significance for both fields as follows.
Theoretical significance:
• In the field of revenue management
This research contributes to a new approach for detecting strategic customers from
visible booking-related behaviours. This topic is a growing issue in revenue
management. We propose a framework for customer-type detection with
refinement according to data availability. As mentioned earlier, research in this field
relies on intuitive assumptions that all customers act strategically, leading to a
suboptimal revenue management policy. In our framework, a more precise
approach to estimating the existence of strategic customers in the market is
achieved, which in turn can lead to more appropriate RM.
This study also creates a conceptual framework for a customer-decision model in
the environment of dynamic pricing with an advance booking mechanism, where
customers can place a guaranteed reservation before departure date. It comprises
the influential factors in customer decisions, including provider’s-controlled factors,
risk-related factors, and customer’s personal factors. This framework can aid
25
scholars and more specifically practitioners to understand how customers make
decisions whether to buy or to wait in an environment of dynamic pricing.
• In the field of modelling and computational approach
This research contributes to the development of an integrated approach for
statistical analysis, a hierarchical rule-based inference, maximum likelihood
prediction, and machine learning for classification of various data types and with
heavily skewed data distribution. Heavily skewed distribution produces sparse
joint frequency matrices, where combining too many referential values leads to
information loss. Specifically, input variables were grouped based on statistical
and conceptual considerations and feasibility in terms of statistical requirements
for sample size. MAKER framework was performed for each evidence group.
Then an ER-based or BRB-based model was utilised to combine the MAKER-
generated output from all evidence group. This approach is transparent and
interpretable and the relationship between system inputs and outputs becomes
clear and understandable
In addition, a referential-value-based data discretisation technique was employed
to deal with numerical data. These values were trained simultaneously with all the
model parameters. In other words, the learning process was designed such that
all parameters embedded in the model were changed to achieve the objective of
minimising errors.
26
Practical significance:
This research offers an approach to detecting strategic customers and predicting
customer decisions in the environment of dynamic pricing and advance-booking
mechanisms. It can benefit practitioners at the managerial level, especially
revenue management adopters with fixed capacity or similar characteristics – such
as hotels, airlines, sport and entertainment ticketing, and advertisements.
Professionals in these fields can use our model because of its hierarchical rule-
based inferential modelling and prediction approach, which is transparent and
interpretable. It offers similar or improved accuracy compared to other
classification models.
The framework is also adaptable if professionals examine many inputs and outputs
or different referential values. Such scenarios are likely to occur in fields in which
dynamic changes in customer behaviour may occur in response to changes in
business environment. The MAKER-based models are mainly a white-box
approach, that is, the type of approaches which one has transparent machine
learning process and interpretable model features. They enable professionals to
find useful patterns from the data and thus to formulate beneficial managerial
leverage to deal with strategic customers appropriately.
1.6. Thesis Structure
This thesis consists of six chapters as outlined in Table 1.1. Each chapter is
designed to answer specific research questions, with corresponding research
objectives.
27
Table 1.1. Thesis structure
Chapter Research
questions
Research
objectives
Chapter 1 Introduction
Chapter 2 Literature review Q3 RO3
Chapter 3 Research
methodologies
Q3 RO3
Chapter 4 A hierarchical-based
inferential modelling
and prediction
Q4, Q5 RO2, RO3
Chapter 5 Application to customer
classification
Q1, Q4, Q5, Q6 RO1, RO2, RO3
Chapter 6 Application to customer
decision model
Q2, Q4, Q5, Q6 RO1, RO2, RO3
Chapter 7 Conclusion and
recommendations for
further study
Chapter 2 provides a systematic literature review regarding established
classification models. We identify the advantages and disadvantages of the
established models and formulate the need for a new approach. Chapter 2
answers Q3, with corresponding RO3.
Chapter 3 explains the research methodologies used in this thesis to answer Q3,
which address RO3. In this chapter, we explain the general research methods,
data collection, the research methods for the rule-based inferential modelling and
prediction approach, the optimisation method to find optimised model parameters,
and evaluation metrics for model performance comparison.
28
Chapter 4 presents our proposed new approach to hierarchical rule-based
inferential modelling and prediction which is established based on MAKER
framework, namely MAKER-ER- and MAKER-BRB- based models to deal with
sparse matrices and to address data transformation for numerical data. This
chapter answers Q4 and Q5, fulfilling RO2. We explain the proposed approach
analytically and geographically to highlight the advantages of our model compared
to other models. Then, to analyse whether hierarchical structure on the proposed
approach affect its generalization capability and complexity, we apply a full MAKER
model, BRB, and hierarchical MAKER models to five generated datasets and
compare the performance of these models in terms of model complexity,
computation time, accuracy, area under the receiver operating curve (AUCROC),
and mean squared error (MSE).
Chapter 5 addresses Q1, Q4, Q5, and Q6, and fulfils RO1, RO2, and RO3. This
chapter presents the application of the MAKER-ER- and MAKER-BRB-based
models (described in Chapter 4) to customer-type detection. In this chapter, we
provide the theory regarding customer types in revenue management and discuss
business settings and tangible booking behaviours. We then formulate a
conceptual framework for customer-type detection, followed by the designed data
linkage to obtain the necessary dataset from different data sources. Then, we
describe how we applied MAKER-ER- and MAKER-BRB-based models to this
case. Finally, we compare the model’s performance to that of alternative methods.
Chapter 6 presents the application of a hierarchical rule-based inferential modelling
and prediction based on MAKER framework to customer decisions. We explain
customer advanced booking decision-making in the environment of dynamic
29
pricing. We then formulate a conceptual framework for customer decisions –
including formulating input variables and designing data linkage to obtain the
desired datasets for further analysis. We describe how we developed MAKER-ER-
and MAKER-BRB-based classifiers and compare them to other alternative
methods in terms of their performance of prediction. This chapter answers Q2, Q4,
Q5, and Q6, and fulfils RO1, RO2, and RO3.
Chapter 7 summarises the findings of this research and provides a final conclusion.
We also suggest directions for further research.
30
Chapter 2 Research Background
2.1. Introduction
This chapter provides the theoretical background underlying the development of
conceptual frameworks for detecting customer types and predicting customer
decisions in revenue management. We also provide a comprehensive analysis of
classification methods that are widely used for predicting discrete outputs which is
compatible with the purpose of the models: predicting customer types and
decisions. Section 2.2 introduces the concepts of revenue management and
dynamic pricing. Section 2.3 explains advanced booking decision models, and
Section 2.4 gives a brief explanation of machine learning and classification.
Section 2.5 provides machine learning methods which are widely used for
classification. Finally, Section 2.5 provides critical analysis of the well-known
classification methods in machine learning and highlights some drawbacks of
these methods.
2.2. Introduction to Revenue Management Theories
Classical revenue management, namely yield management, was applied initially in
the airline industry (Littlewood, 2005). It is defined as ‘allocating the right type of
capacity to the right kind of customer at the right price in the right time so as to
maximize revenue or yield’ (Kimes, 1989), with the clause ‘to the right distribution
channel’ added to the definition for mixed-channel cases (Hayes and Miller, 2011).
31
Revenue management can be adopted by an industry that meets the following
conditions. First, it engages in customer segmentation, by means of which different
customers accept different fares for the same product. In addition, its products
generally have low variable costs but high fixed costs. The industry also sells a
perishable product that is characterized by fixed or inflexible capacity (or inventory)
and a limited selling period; once the selling period has ended, the remaining
products cannot be stored as inventory. It has an advanced booking term and
experiences significant demand fluctuation, as well. Finally, it has a sophisticated
and decentralized information system to gather data about customer behaviour,
demand patterns, and more (Kimes, 1989). Industries which feature these
characteristics include hotels, restaurants, sports and entertainment ticketing,
airlines, cloud computing, advertising, telecommunications, shipping, railways,
electricity suppliers, water suppliers, and retail. These industries have recently
applied revenue management techniques and have been successful adopters
(Ivanov and Zhechev, 2012; Kimes, 2003; Qiwen, 2010; Talurri and Ryzin, 2004).
Revenue management theory addresses three basic types of decisions (Talurri
and Ryzin, 2004). First, price decisions include choices about how to price across
segments, product categories, and distribution channels. Second, quantity
decisions answer questions about how to allocate limited or constrained capacity
to different segments, products, or channels; when to open or close fare classes
during selling period; and so on. Third are structural decisions, which support the
other decisions. Examples of outcomes of structural decisions include price fencing
to define fare restrictions and limitations, selling format (e.g. auctions or price
updating), segmentation formula, and selling design.
32
Dynamic pricing is one of the most successful strategies in revenue management.
Dynamic pricing, by definition, is a business strategy to maximise revenue by
changing prices ‘either over time, across customers, or across products/bundles’
(Kannan and Kopalle, 2001). It is fundamentally different from fixed price
approaches since it allows customers to buy the same good or service at various
prices regardless of the promotion formats (Talurri and Ryzin, 2004). The practice
of dynamic pricing works well in situations of limited capacity with high fixed costs
– for example, airlines, hotels, and sports and entertainment ticketing (Etzioni,
Tuchinda, Knoblock, and Yates, 2003; Sahay, 2007). Additionally, the use of the
Internet makes the process of dynamic pricing easier, less costly, and potentially
more effective (Cho et al., 2008).
There are three types of dynamic pricing (Kannan and Kopalle, 2001): 1) posted
price, 2) auction pricing, and 3) bundle pricing. Airlines have applied all of these
types. In the first dynamic price type, the basic strategy is that products are sold at
posted prices, which are updated over time (Etzioni et al., 2003). For the second
type, firms such as hotwire.com and priceline.com sell tickets with hidden attributes
and reveal them after the purchase has been made. Through negotiation at the
sites, customers and sellers reach an agreed upon prices. This type, which is
termed reverse-auction, has been identified as the most successful method for
helping airlines sell excessive seats or last-minute deals for customers with very
flexible schedules (Jerath et al., 2010). The third dynamic pricing type is bundle
pricing. To maximise customer utility based on customers’ needs and preferences,
some airlines offer greater flexibility by allowing customers to select service and
ticket bundles (for example, www.united.com) (Granados et al., 2012). In this
33
study, we focus on dynamic posted price updating, in which airlines change the
posted price on any of their platforms, such as their website, dynamically over time.
In the airline industry, a general pattern of price increases as a flight’s departure
date draws closer has been naturally applied. This is adjusted based on
prospective customers’ shifting willingness to pay (Schwartz, 2000). Leisure
travellers might get lower prices by booking in advance, and business travellers
have to pay higher prices due to their tendency to book close to their date of travel.
Etzioni et al. (2003) found that this general pattern remained the same in the
Internet era; however, over a one-month observation window, prices were divided
into as many as four tiers, and within each tier prices fluctuated with a smaller
variance. Nevertheless, in their study, prices often dropped over time and were
eventually lower than prices very early in the selling period.
On average, airfares can change five to seven times in a day (Etzioni et al., 2003).
The price changes are a manifestation of a company’s response to uncertain
conditions, such as supply–demand pressures (Sahay, 2007), remaining time
(Talurri and Ryzin, 2004), competitors’ strategy (Levin et al., 2009), or seasonality
(Etzioni et al., 2003). Airlines respond to these conditions by using real-time
reservations, customer booking history, and customers’ characteristic to perform
demand predictions.
The success of dynamic pricing applications relies on the effectiveness of policies
put in place to minimize cannibalization, when customers – with high wiliness to
pay – choose an available lower price, and to induce early purchases so that
demand uncertainty can be reduced (Liu and Ryzin, 2008; Talurri and Ryzin,
34
2004). Airline companies normally impose restrictions on lower price classes, such
as restrictions on cancellation, ticket reissue, rerouting, and maximum or minimum
stay (Meissner and Strauss, 2010) to prevent such cannibalization. However, given
the availability of information through the Internet, customers now are able to
collect necessary data to decide just when to buy to gain more benefit from this
practice (Chen and Schwartz, 2008). This is the idea of strategic customer
behaviour – that customers anticipate future price drops and delay their purchases.
2.3. Advanced Booking Decision-Making
In some cases – for example, airlines and hotels – the customer purchase cycle
theory may not be entirely applicable. In airline ticketing and hotels, for instance,
customers have the option to make a reservation (to book) before the ‘real
purchase’. They may secure a seat or room with full, partial, or zero payment
before the actual purchase. Schwartz (2000, 2006) developed the advanced
booking decision model (ABDM) and provides a theoretical framework for how
savvy online customers exploit dynamic pricing in advanced booking settings. In
these cases, customers are restricted by some conditions, such as cancellation
policies. After evaluating the alternatives, customers may choose the best option
according to their values and then book a ticket. They are then likely to monitor
price changes until departure time. The information retrieved may bring them to
change their decision, triggering them to cancel the previous booking and book a
more favourable ticket once it is available. Gorin et al. (2012) presented a real
example of airline ticket sales which confirmed that customers truly engage in the
‘book then search’ strategy. The dynamic changing of product prices and
35
availability shapes customers’ perceptions over time and indeed makes customers’
decisions fluctuate even after they have narrowed down their consideration set to
a single product (Chen and Schwartz, 2006).
The ABDM framework was used to identify four possible decisions that online
customers may choose: 1) a ‘book’ strategy, in which customers place a
reservation and do nothing more, neither search nor evaluation; 2) a ‘book then
search’ strategy in which customers place a reservation at an agreed price,
continue the search for the same product by collecting price information until a time
closer to the date of consumption, and rebook if necessary (i.e. when a better deal
is offered); 3) a ‘search’ strategy in which customers search for a better deal
without booking the product until the seemingly best deal comes out; and 4) an
‘exit’ strategy in which customers choose another carrier. This framework has
expanded the former widely applied two-stage decision model – buy now or buy
later (e.g. Anderson and Wilson, 2003) – and three-stage decision model – buy
now, buy later, or exit (Su, 2007). The factors Schwartz considers in his framework
(Schwartz, 2000, 2006) are 1) price pattern (Chen and Schwartz, 2006), 2) time
before the consumption date (Chen and Schwartz, 2008), 3) cancellation fee and
deadline (Chen, Schwartz, and Vargas, 2011), and 4) search cost.
Customers try to balance between benefits (e.g. the possibility of getting a lower
price), costs (e.g. for search efforts, including time spent and physical and
psychological efforts), and risks (e.g. the possibility of sell-outs) when they are
making a purchase decision. While maintaining the risk of losing the product,
Customers can continue to search relevant information until the perceived gains
outweigh search efforts. Today, customers tend to use meta-search websites,
36
search engines, or third-party websites which provide supporting features (Etzioni
et al., 2003). These sites help customers to gather and compare information with
specified search strategies from numerous products, and they display the results
in an easy-to-understand view. Additionally, some third-party websites provide
customers with highly sophisticated functions such as price tickers, price trends,
and price alerts. These sites mine information which allows them to make
suggestions to customers on the best time to purchase. As a result, customers
tend to spend less time and less effort during the purchase-related evaluation and
comparison process. This has increasingly fostered customers who are more
knowledgeable and more eager to maximise their gains by exploiting dynamic
pricing.
Online intermediaries are efficient information resources for customers and play a
significant role in shaping internal reference price in customer evaluations of prices
available through other agents, either online or offline. Customers may browse
through these resources as their primary information source before making an
offline transaction. This strategy is the most common approach to what is known
as ‘research shopping’ (Bodur et al., 2015). This also happens in airline ticketing:
customers check prices via the Internet and call their trusted offline agents to
finalise their booking (Toh et al., 2012).
In airline ticket sales, ‘book now, pay later’ with or without a deposit has been
widely applied by offline agents, although online agents have started to adopt it
with different degrees of leniency. Online customers, however, are normally asked
for direct and full payment after completing the booking request through online
agents. It has become a reasonable strategy for customers to buy through offline
37
agents while monitoring prices over time through online intermediaries in case a
lower price becomes available. In this case, customers can exploit dynamic pricing
with reduced risk of sell-out at reduced cost.
2.4. Introduction to Machine Learning
Machine learning is defined by Artur Samuel in Awad and Khanna (2015) as ‘a
field of study that gives computers the ability to learn without being explicitly
programmed’. Machine learning incorporates scientific computing, mathematics,
and statistics (Lee, 2019). It consists of algorithms and techniques that create
systems for data learning (Lee, 2019). Machine learning methods are divided into
supervised, semi-supervised, or unsupervised learning based on whether labelled
data is required during training (Lee, 2019). Supervised learning methods use
labelled data as training data and make predictions for unseen data, whereas
unsupervised learning methods take unlabelled data as training data and make
predictions for unseen data (Lee, 2019). Semi-supervised learning methods learn
from both labelled and unlabelled data (Lin and Cohen, 2010). This last method is
best used to achieve the same level of accuracy when only a few samples of
training data are provided (Lin and Cohen, 2010).
Supervised machine learning acquires information about the input–output
relationships in a system from a training set of input–output pairs and uses this
acquired information to make predictions for unseen inputs (Lu and Wu Ying,
2012). The goal of supervised machine learning is to build a system that is able to
learn the mapping between inputs and outputs and to use that system to predict
the output when given a new input. If the output is numerical data, it is considered
38
to be a regression task (Lee, 2019). If the output is discrete data, it is considered
to be a classification task (Lee, 2019). Some classification methods are presented
in Section 3.6, including logistic regression (LR), support vector machines (SVM),
neural networks (NN), classification tree (CT), k-nearest neighbours (KNN), and
naïve Bayes (NB).
A brief introduction to the above-mentioned classification methods is summarised
as follows. LR is a linear classification model that learns the relationship between
input and output by minimising the error between the probability of a sample
belonging to a certain class and the actual classification. NB is a classification
method based on Bayes’ theorem under a naïve assumption of condition
independence for every input variable. KNN is a non-parametric classification
method that classify a new observation based on a similarity function – for example
distance functions – with other available observations. Classification tree is a non-
parametric classification technique in the form of a tree structure which is
developed through recursive partitioning. The method correctly classifies
observations by decomposing the observations into subsets based on the values
of input variables. SVM is a classification technique that finds the hyperplane,
which is defined by support vectors (cases), by maximising the margin between
two classes. NN is a complex system consisting of interconnected group of nodes
that imitate the working of neurons in a human brain. Further explanation about
these classification methods can be seen in Section 3.6. The following section
presents critical analysis of these classification methods.
39
2.5. Classification Models: Advantages and
Disadvantages
In this section, some machine learning classification methods in Section 3.6 will be
critically analysed in terms of the advantages and disadvantages as presented in
Table 2.1.
Table 2.1. Advantages and disadvantages of classification methods
Classification
method Advantages Disadvantages
Logistic
Regression
1. Logistic regression is
interpretable method on
modular level with its weights
presenting the degree to which
an input variable contribute to a
certain class prediction
(Carvalho, Pereira, and
Cardoso, 2019).
2. In terms of flexibility and
robustness in case of violations
of the assumptions about the
underlying data, logistic
regression is better than linear
discriminant (Liong and Foo,
2013)
1. Logistic regression
classifiers assume that input
variables are independent
with each other and sensitive
to outliers (Molnar, 2019).
2. Logistic regression
classifiers are limited to
linearly separable two class
problem (Molnar, 2019).
40
Support
Vector
Machines
1. Supper vector machines
tend to find global optimum
solution since the model
complexity has been
considered as a structural risk
in SVM training. (Ren, 2012).
2. Support vector machines
minimise empirical risk learnt
from the training set and above-
mentioned structural risk.
Consequently, these
classification models have
strong generalization capability
(Ren, 2012).
3.Support vector machines are
robust and precise in biased
data distribution (Auria and
Moro, 2008).
1. Support vector machines
are different to other
classifiers in that they lack of
the absence of explicit
approximations (Knox, 2018).
Neural
Networks
1. Neural networks can model
non-linear and complex
relationships without imposing
any fixed relationships in the
data (Haykin, 1999; Tu, 1996)
2. Neural networks are
relatively robust to noisy and
incomplete labelling (Reed and
Lee, 2015)
3. Neural networks have the
potential for inherently fault-
1. Neural networks is
considered as ‘black-box’
model whose system is not
transparent and hence, it is
difficult to understand
(Molnar, 2019).
2. The other disadvantages of
neural network are its greater
computational burden,
tendency of overfitting, and
the empirical nature of the
41
tolerant and robust
computation (Haykin, 1999).
4. Neural networks are capable
to adapt with changes in the
surrounding environment, for
example, when it is operating
nonstationary environment,
neural networks can change
their weights in real time
(Haykin, 1999).
5. The parallel nature of neural
networks makes fast
computation of certain tasks
(Haykin, 1999).
model development (Tu,
1996).
Classification
tree
1. Classification tree captures
interactions between input
variables in the data (Molnar,
2019).
2. Natural visualisation of
classification tree makes it
simple and interpretable
(Molnar, 2019).
1. Classification tree are not
efficient when dealing with
linear relationship between an
input variable and the output
(Molnar, 2019).
2. Slight changes in the input
variables give big impact on
the predicted output (Molnar,
2019).
3. It is quite unstable since a
few changes in the training set
can produce completely
different tree (Molnar, 2019).
42
K-nearest
neighbours
1. The algorithm of k-nearest
neighbour is relatively
straightforward (Knox, 2018).
2. K-nearest neighbours
algorithm is interpretable at the
local level (Molnar, 2019).
1. K-nearest neighbour is
expensive to implement
especially for large dataset
(Knox, 2018).
2. A small value of k makes
the classifier sensitive to
particular data points while a
large value of k makes the
behaviour of the classifier
insensitive local variations in
the class densities. Hence,
careful adjustment of k is
highly required (Knox, 2018).
Naïve Bayes
1. Naïve Bayes classifiers are
not limited to non-parametric
methods, which are relatively
expensive to implement. They
can be used with parametric,
non-parametric, or semi-
parametric (a mixture of the
two) methods (Knox, 2018).
2. Naïve Bayes classifiers are
interpretable model on modular
level as the contribution of each
input variables toward a certain
class prediction is very clear
(Molnar, 2019).
1. Naïve Bayes works under
unrealistic strong assumption
of independency between
input variables (Molnar,
2019).
2. Naïve Bayes generally
provides lower accuracy for
problems of a complicated
nature than do other complex
methods (Karim and Rahman,
2013).
43
From the explanations in Section 2.5, we can highlight two issues with the
classification methods listed in Table 2.1: interpretability and the assumption of
independence between input variables. In this study, we propose hierarchical rule-
based inferential and prediction modelling under the MAKER framework, which is
discussed in Chapter 4.
44
Chapter 3 Research Methodologies
3.1. Introduction
This chapter explains research approach, data collection, and available
technologies used in this thesis. Section 3.2 presents the general research
approach of the thesis. Section 3.3 describes the data collection, including data
sources, data components, and data characteristics used in the thesis. Section 3.4
introduces the evidential reasoning (ER) rule, followed by Section 3.5 with an
introduction of the MAKER framework. Section 3.6 briefly explains the machine
learning methods for classification that have been chosen for comparison with a
hierarchical rule-based modelling and prediction based on MAKER framework
proposed in this study. Section 3.7 explain the optimisation method used to find
the optimal parameters of MAKER-based models in this study. The performance
measures are explained in Section 3.8, and Section 3.9 summarises this chapter.
3.2. Research Approach
There are generally three research approaches: 1) qualitative, 2) quantitative, and
3) mixed-method approaches (Creswell, 2018). Qualitative research is an
approach used to explore and understand individuals or groups in human or social
problems. The various types of qualitative designs include narrative research,
phenomenology, grounded theory, ethnographies, and case studies. Researchers
analyse qualitative data inductively, working from particulars to general themes –
for example, taking a recorded interview, document data, audio visual data, and
45
observation data and making interpretations of the meaning of the data.
Quantitative research, by contrast, is an approach to testing objective theories
using mathematical and statistical theories or models. The variables tested in
quantitative research can be measured and expressed numerically. Mixed-method
approaches incorporate elements of both qualitative and quantitative research. All
the data used in this research is numeric, and the purpose of the study is to
introduce a new approach to customer classification and decision prediction along
with performing other machine learning methods as comparison. Therefore, the
research approach in this project is considered quantitative.
3.3. Data Collection
As discussed previously, before applying to the real-world data – that is, customer
types and decisions – in Chapter 4 we analysed the effect of a hierarchical
structure employed in MAKER frameworks on the complexity and accuracy of the
MAKER-based models. Hence, we need to evaluate the generalization capability,
efficiency, and complexity of the hierarchical MAKER frameworks. In this research,
we utilised ‘make_classification’ and ‘make_blobs’ functions in sklearn in Python
to generate datasets with different characteristics – for example with/without
noises, one or two clusters per class, and blob data. More clusters per class in the
datasets lead to complex non-linear boundary. The datasets with noises was
designed to investigate efficacy of the models.
On the basis of generalization capability of the hierarchical MAKER frameworks,
we then apply the frameworks on the real-world classification datasets – customer
types and decisions. The performance of the hierarchical MAKER frameworks was
46
then compared against that of other machine learning methods. The rest of this
section describes about the datasets of customer types and decisions.
Data was collected from an online reservation application provider from Indonesia
with the web address www.pointer.co.id. They provide a booking system which has
been adopted by more than 500 agents. The data consists of two main sources:
passenger booking records and price databases. Personal information of
passengers is completely confidential and fully anonymized.
First, the passenger booking database consists of passenger name records
(PNRs), which contain name, origin-destination, departure date, departure time,
carrier/airline, ticket price, group size or the composition of passengers for group
booking (i.e. the numbers of adults, infants, and children), booking status, booking
time (A1), and date and time by which payment must be made (B1). Passengers
who place a reservation can secure a seat without paying a fee or deposit, and the
seat is issued once they make full payment before B1. Later we define the period
between A1 and B1 as the holding period. When passengers alter their travel plans
– for example, changing the departure date, cancelling the booking, or not making
a payment before the hold period ends – the system creates a new PNR if they
place a reservation again. Therefore, tracking passenger historical bookings must
be done carefully, a process which is explained in Chapters 5 and 6. Since this
study focuses on cancel-rebook behaviour by customers who wish to find lower-
priced alternatives for their travel plans, we consider only cancel-rebook records
generated after customers presumably have fixed travel plans: that is, same flight,
same departure time, same origin-destination, and same composition of
passengers for group booking. For customers with seemingly fixed travel plans,
47
cancel-rebook behaviour is perceived as their effort to find lower-priced
alternatives by delaying their purchase and waiting for an expected lower price.
Hence, any previous cancel-rebook records that show changes in their travel plans
are removed from the database.
Second, in the price database, prices of each flight for certain departure dates are
recorded and updated every three hours to capture price changes. Only direct
flights are considered in this research. Connecting flights were removed. The
database consists of posted prices, the name of airlines (carrier), departure date
and time, origin-destination, and update time. Data was collected from 18th July
2017 to 24th September 2017, meaning that it covered price dynamics for the two
months prior to the departure date. Thirty-one pairs of cities were chosen to
represent origin-destinations with different characteristics in terms of price range
and the number of flights offered per day, as presented in Table 3.1.
These two databases were used to generate the desired datasets containing
useful information for predicting customer types and decisions, as presented in
Sections 5.3 and 6.2.3. In short, there are four system input variables and two
system output categories for predicting customer types, as further discussed in
Chapter 5, and six system input variables generated with two system outputs – buy
or wait – for predicting customer decisions, as further discussed Chapter 6.
48
Table 3.1. Data characteristics
No Origin-
Destination
Price (Rupiahs) Number of
flights per day
Number
of
bookings Min Max Stdev Min Max
1 NTXBTH 555,000 1,763,900 129,834 0 2 12
2 BTHNTX 857,000 1,583,000 170,973 0 2 3
3 UPGTIM 1,060,000 3,073,000 300,084 1 3 12
4 TIMUPG 1,025,000 2,406,600 162,702 2 3 13
5 AMQCGK 1,070,300 4,032,600 493,210 3 4 13
6 PKYCGK 587,000 1,470,500 179,789 3 4 77
7 CGKPKY 427,000 1,713,500 172,419 3 4 49
8 SUBPKY 457,800 1,283,000 107,364 3 3 25
9 PKYSUB 397,800 1,135,000 103,271 3 3 54
10 CGKAMQ 1,133,000 5,445,000 550,224 4 5 16
11 BTJCGK 832,000 2,962,700 405,221 4 5 21
12 CGKBTJ 847,000 3,057,700 353,912 4 5 15
13 UPGPLW 338,000 2,655,000 259,337 4 5 17
14 PNKSUB 401,000 1,495,900 119,559 4 4 33
15 SUBPNK 511,700 1,545,900 140,822 4 4 13
16 DJJUPG 971,000 3,053,500 285,361 5 5 7
17 UPGDJJ 860,000 3,755,000 698,107 5 5 11
18 PLWUPG 338,000 2,620,000 129,179 5 5 17
19 JOGBPN 335,500 1,970,100 194,495 6 6 21
20 BPNJOG 579,500 2,020,100 262,824 6 6 5
21 CGKBKS 328,000 1,613,400 132,750 7 8 14
22 UPGBPN 382,200 1,280,000 89,524 7 7 35
23 BPNUPG 415,000 1,128,000 89,121 7 7 23
49
Table 3.1. Continued.
No Origin-
Destination
Price (Rupiahs) Number of
flights per day
Number
of
bookings Min Max Stdev Min Max
24 BKSCGK 348,000 1,513,400 164,531 8 8 12
25 BDJSUB 373,800 1,182,500 112,529 9 9 43
26 SUBBDJ 313,000 1,222,500 120,057 9 9 23
27 LOPCGK 524,800 3,310,000 433,196 10 10 15
28 UPGKDI 224,000 1,192,000 86,978 11 12 42
29 SOCCGK 319,000 1,204,500 218,541 11 13 27
30 CGKLOP 514,800 3,300,000 375,778 12 12 19
31 CGKSOC 319,000 1,284,500 267,697 12 13 18
32 BTHCGK 410,000 3,365,000 222,671 13 13 23
33 CGKBTH 410,000 1,965,400 194,336 13 13 46
34 JOGHLP 318,300 2,024,000 182,238 13 13 52
35 HLPJOG 318,300 2,024,000 171,693 13 13 61
36 KDIUPG 224,000 1,333,000 96,674 13 13 30
37 BDJCGK 511,300 1,919,500 194,945 15 16 40
38 CGKBDJ 511,300 1,999,500 192,695 15 16 72
39 BPNSUB 348,000 1,579,000 119,936 15 17 16
40 SUBBPN 494,300 1,655,900 156,028 15 17 22
41 UPGSUB 352,000 2,336,000 172,009 16 17 52
42 TKGCGK 135,000 992,000 105,196 17 19 72
43 PDGCGK 500,400 2,619,000 256,082 18 20 58
44 CGKPDG 489,500 3,432,000 265,338 18 20 46
45 PKUCGK 489,800 3,295,000 151,032 18 21 16
46 CGKPKU 494,800 1,937,900 169,696 18 20 39
47 CGKTKG 135,000 1,022,000 133,430 18 19 59
48 SUBUPG 332,000 2,361,000 170,240 19 19 46
49 SRGCGK 303,600 1,903,000 209,129 21 23 34
50 PLMCGK 311,100 1,221,000 156,497 22 23 35
51 CGKPLM 311,100 2,002,000 166,410 22 24 39
52 CGKSRG 313,600 1,903,000 237,245 22 23 45
50
Table 3.1. Continued.
No Origin-
Destination
Price (Rupiahs) Number of
flights per day
Number
of
bookings Min Max Stdev MIn Max
53 PNKCGK 315,000 1,714,800 173,475 23 24 104
54 CGKPNK 315,000 1,804,800 197,689 23 24 114
55 JOGCGK 320,500 2,035,000 271,576 24 25 33
56 CGKJOG 335,500 2,035,000 294,266 25 25 54
57 CGKKNO 495,000 3,212,000 309,450 34 37 68
58 KNOCGK 495,000 4,260,000 279,839 35 37 54
59 CGKUPG 552,000 4,301,000 334,070 35 36 138
60 UPGCGK 552,000 4,316,000 344,121 37 40 126
61 CGKSUB 401,500 2,772,000 266,977 48 51 92
62 SUBCGK 320,000 2,812,000 258,385 50 52 56
3.4. Evidential Reasoning
Evidential reasoning (ER) is developed on the basis of Dempster–Shafer theory
(D-S theory), first developed by Dempster in the 1960s and Shafer in the 1970s
(Binaghi and Madella, 1999). In D-S theory, we first define a set of possible
propositions that are mutually exclusive and collectively exhaustive, called a
discernment framework. We then perform a basic probability assignment or mass
function that measures the probability of pointing exactly to a certain proposition
(Chen et al., 1960). Dempster’s rule of combination is applied to combine two
independent sets of mass functions in a frame of discernment under an orthogonal
sum operation that is associative and commutative.
51
In general, classification includes ‘learning the invariant and common properties of
a set of samples characterizing a class’ and ‘deciding if a new sample is a possible
member of the class’ (Binaghi and Madella, 1999). These two tasks are named
abstraction and generalization, respectively. Classification models estimate the
function characterizing class membership and develop a deductive inference
mechanism to perform a reasoning process to assign a new sample to a given
class (Binaghi and Madella, 1999). In contrast to Bayesian inference, D-S theory
does not require a priori knowledge. Since the aim of classification is to detect
unseen data, and a priori knowledge may not always be provided, D-S theory is
suitable for classification (Chen et al., 1960).
However, D-S theory has difficulty managing conflicting beliefs (Yang and Xu,
2013) when combining two pieces of evidence. Another disadvantage of D-S
theory is that it assumes that all the evidence is completely reliable. Yang and Xu
(2013) proposed the ER rule, a generic conjunctive probability reasoning process,
to deal with the limitations of D-S theory and introduce inherent properties of
evidence – namely, the quality of the information source and the relative
importance of evidence, denoted by reliability and weight, respectively. Weighted
belief distribution (WBD) and weighted belief with reliability (WBDR) replace the
belief distribution of the Dempster rule.
Dempster’s rule is a special case of the ER rule – the case in which all the evidence
is completely reliable. The ER rule also improves the original ER algorithm (Xu,
Yang, and Wang, 2006; Yang, 2001) when the reliability of evidence is equal to its
weight, and the weights are normalised. The ER rule does not always require such
normalisation. The ER rule essentially can deal with different types of uncertainty
52
and the rule-or-utility-based information technique (Xu, 2011; Yang, 2001). The ER
algorithm has been adopted in the belief rule base (BRB) system, namely in the
rule-based inference methodology using evidential reasoning (RIMER) approach
(Yang et al., 2006). The RIMER approach has been applied in many areas (Chang
et al., 2013; Kong et al., 2016; Tang et al., 2011). The RIMER approach is able to
model the relationship between system inputs and outputs and to handle different
types of information under different types of uncertainty. However, the number of
rules increases exponentially as the number of input variables and the referential
value of each variable increases (Yang and Xu, 2017). The ER rule and the RIMER
approach are explained in the following sections.
3.4.1. Evidential Reasoning Rule
The ER rule has been established for combining evidence while taking weights
and reliabilities (Yang and Xu, 2013) into account when forming a belief
distribution. Suppose that Θ = {𝜃1, … , 𝜃𝑁} is a set of mutually exclusive and
collectively exhaustive propositions. Θ is referred as power of discernment with 2𝑁
subsets of Θ which consist of singleton propositions and their subset, as seen in
Equation (3.1).
𝑃(Θ) = 2Θ = {𝜙, {𝜃1},… , {𝜃𝑁}, {𝜃1, 𝜃2},… , {𝜃1, … , 𝜃𝑁−1}, Θ} (3.1)
where 𝜙 is an empty set. Each piece of evidence is profiled by a belief distribution
(BD) as displayed in Equation (3.2).
53
𝑒𝑗 = {(𝜃, 𝑝𝜃,𝑗), ∀𝜃 ⊆ Θ,∑ 𝑝𝜃,𝑗 = 1𝜃⊆Θ
} (3.2)
for 𝑗 = {1,… , 𝐿} where L is the number of pieces of evidence, and N is the number
of propositions. 𝑝𝜃,𝑗 represents the degree to which a piece of evidence, 𝑒𝑗, points
to proposition 𝜃, which can be any subset of Θ or any element of 𝑃(Θ) except the
empty set.
Reliability and weight are the parameters associated with each piece of evidence.
Reliability is denoted by 𝑟𝑗, with 0 ≤ 𝑟𝑗 ≤ 1; 𝑟𝑗 = 0 stands for ‘not reliable at all’, and
𝑟𝑗 = 1 stands for ‘completely reliable’ (Yang and Xu, 2013). A piece of evidence
may also have weights, denoted by 𝑤𝑗, which indicates the relative importance of
that piece of evidence compared with other evidence (Yang and Xu, 2013). In the
case with 𝑤𝑗 = 𝑟𝑗 in the frame of WBD, both share the same definition of reliability,
and both are measured in the same joint space (Yang and Xu, 2014). As such, 1 -
𝑟𝑗 acts as the unreliability of evidence 𝑒𝑗, and it provides room for another piece of
evidence to support or oppose different propositions. On the other hand, if 𝑤𝑗 ≠ 𝑟𝑗,
it means that the different pieces of evidence have been generated from different
sources or different measurements (Xu et al., 2017). WBDR is defined in Equation
(3.3).
𝑚𝑗 = {(𝜃, ��𝜃,𝑗), ∀𝜃 ⊆ Θ; (𝑃(Θ), ��𝑃(Θ),𝑗)}
��𝜃,𝑗 = {
0 𝜃 = 𝜙𝑐𝑟𝑤,𝑗𝑚𝜃,𝑗 𝜃 ⊆ Θ, 𝜃 ≠ 𝜙
𝑐𝑟𝑤,𝑗(1 − 𝑟𝑗) 𝜃 = 𝑃(Θ)
(3.3)
54
where 𝑚𝜃,𝑗 = 𝑤𝑗𝑝𝜃,𝑗 and 𝑐𝑟𝑤,𝑗 = 1 1 + 𝑤𝑗 − 𝑟𝑗⁄ as normalisation factors to satisfy
∑ ��𝜃,𝑗 + ��𝑃(Θ),𝑗 = 1𝜃⊆Θ given that ∑ 𝑝𝜃,𝑗 = 1𝜃⊆Θ . ��𝜃,𝑗 shows the degree to which
evidence 𝑒𝑗 supports proposition 𝜃, with weights and reliabilities considered.
Through an orthogonal sum operation, two independent pieces of evidence can be
combined in any order, as displayed in Equations (3.4)–(3.7), to measure the
degree of joint support resulting from 𝑒1 and 𝑒2, which is denoted by 𝑃𝜃,𝑒(2). The
information given by 𝑒2 does not depend on the result of 𝑒1 and vice versa, and
these pieces of evidence include belief distribution, reliability and weight. This
combination process must be done recursively in the case of multiple pieces of
evidence (i.e. L pieces of evidence) before generating the total combined degree
of joint support for proposition 𝜃, 𝑃𝜃,𝑒(𝐿), which is explicitly written in Equation (3.4).
Let 𝑒(𝑖) be defined as the combination of the first 𝑖 pieces of evidence. In addition,
due to the natural properties of the orthogonal sum operation – that is, associative
and cumulative – this combination can be performed in any order.
𝑃𝜃,𝑒(2) = {
0��𝜃,𝑒(2)
∑ ��𝐷,𝑒(2)𝐷⊆Θ
(3.4)
��𝜃,𝑒(2) = [(1 − 𝑟2)𝑚𝜃,1 − (1 − 𝑟1)𝑚𝜃,2] + ∑ 𝑚𝐵,1𝑚𝐶,2
𝐵∩𝐶=𝜃
∀𝜃 ⊆ Θ (3.5)
��𝜃,𝑒(𝑖) = [(1 − 𝑟𝑖)𝑚𝜃,𝑒(𝑖−1) −𝑚𝑃(Θ),e(i−1)𝑚𝜃,𝑖] + ∑ 𝑚𝐵,𝑒(𝑖−1)𝑚𝐶,𝑖
𝐵∩𝐶=𝜃
∀𝜃 ⊆ Θ (3.6)
𝑚𝑃(Θ),e(i) = (1 − 𝑟𝑖)𝑚𝑃(Θ),e(i−1) (3.7)
55
3.4.2. Rule-based Inference Methodology Using Evidential
Reasoning (RIMER)
RIMER is developed on the basis of a BRB system, which is an extension of a
traditional IF-THEN rule base to represent different types of knowledge under
uncertainty, and the ER rule to combine multiple pieces of evidence from activated
belief rules. Traditional IF-THEN rules are extended by assigning degrees of belief
in all possible consequences of each rule, as displayed in Equation (3.8). In the
RIMER framework, other parameters, including rule weights, attribute weights, and
consequent belief degrees, are designed to represent the belief rules (Kong et al.,
2016). Those parameters can be fine-tuned through a learning process with
historical data. Whereas a traditional IF-THEN rule is clear-cut, a BRB system
needs to undergo a learning process to get the best performance from the model.
𝐴𝑖𝑘(𝑖 = 1,… , 𝑇𝑘) corresponds to the referential point of the ith attribute used in the
kth rule. 𝛽𝑗𝑘 (𝑗 = 1,… ,𝑁; 𝑘 = 1,… , 𝐿) is the belief degree assigned to consequent
𝐷𝑗, where 𝑁 is the number of consequents, and 𝐿 is the number of rules. Belief
degrees can be initially drawn from experts, historical data, or common knowledge.
𝜃𝑘 is the rule weight which shows the relative importance of the kth rule, while 𝛿𝑖(𝑖 =
1,… , 𝑇𝑘) is the attribute weight representing the relative importance of the ith
If 𝐴1𝑘 ∧ 𝐴2
𝑘 ∧ …∧ 𝐴𝑇𝑘𝑘 , then {(𝐷1, 𝛽1𝑘), (𝐷2, 𝛽2𝑘),… , (𝐷𝑁, 𝛽𝑁𝑘)}
with a rule weight 𝜃𝑘 and attribute weights 𝛿1, 𝛿2, … , 𝛿𝑇𝑘
where 𝛽𝑗𝑘 ≥ 0,∑ 𝛽𝑗𝑘 ≤ 1𝑁𝑗=1
(3.8)
56
attribute. 𝑇𝑘 is the total number of attributes used in the kth rule. Belief degrees are
expressed in terms of the probability with which 𝐷𝑗 is likely to occur. The total belief
degrees of a rule can be less than or equal to one, which is designed to handle
missing data or unknown consequents.
If necessary, input values are transformed to belief distributions corresponding to
referential points used in the BRB. These belief distributions represent the degree
to which the input values belong to the referential points. 𝑥𝑖, as the input value for
the ith attribute, is transformed as 𝑆(𝑥𝑖), as seen in Equation (3.9).
𝑆(𝑥𝑖) = {(𝐴𝑖𝑗 , 𝛼𝑖𝑗); 𝑗 = 1,… , 𝐽𝑖}, 𝑖 = 1,… , 𝑇
where 0 ≤ 𝛼𝑖𝑗 ≤ 1 and ∑ 𝛼𝑖𝑗 ≤ 1𝐽𝑖𝑗=1
(3.9)
𝐴𝑖𝑗 is the jth referential category of the ith attribute while 𝛼𝑖𝑗 shows the degree to
which 𝑥𝑖 belongs to the referential point 𝐴𝑖𝑗. 𝐽𝑖 is the number of all referential points
of the ith attribute and 𝑇 is the number of all attributes. If a BRB has 𝑇 attributes,
then the rules are extended from Equation (3.8) by taking all possible combinations
of the referential points for the 𝑇 attributes as displayed in Equation (3.10)
(𝐴1𝑘 , 𝛼1
𝑘) ∧ (𝐴2𝑘 , 𝛼2
𝑘) ∧ …∧ (𝐴𝑇𝑘𝑘 , 𝛼𝑇𝑘
𝑘 ) (3.10)
where 𝐴𝑖𝑘 ∈ {𝐴𝑖𝑗 , 𝑗 = 1,… 𝐽𝑖} and 𝛼𝑖
𝑘 ∈ {𝛼𝑖𝑗, 𝑗 = 1,… 𝐽𝑖}. Therefore, the input can be
transformed into referential points with distributed probability or belief structure.
57
We then need to calculate the activation weight of each belief rule in the rule base.
The activation weight, denoted by 𝜔𝑘, represents the degree to which the packet
attribute denoted by 𝐴𝑘 in the kth rule is triggered by the inputs. It can be calculated
using Equation (3.11).
𝜔𝑘 =𝜃𝑘𝛼𝑘
∑ 𝜃𝑗𝛼𝑗𝐿𝑗=1
=𝜃𝑘∏ (𝛼𝑖
𝑘)��𝑘𝑖𝑇𝑘
𝑖=1
∑ [𝜃𝑗∏ (𝛼𝑙𝑗)��𝑗𝑙𝑇𝑘
𝑙=1 ]𝐿𝑗=1
𝛿��𝑖 =𝛿𝑘𝑖
𝑚𝑎𝑥𝑖=1,…,𝑇𝑘(𝛿𝑘𝑖), 0 ≤ 𝛿��𝑖 ≤ 1
(3.11)
where 𝑘 = 1,… , 𝐿.
As shown in Equation (3.11), the activation weight depends on the rule weight (𝜃𝑘)
and the belief degrees associated with various referential points resulting from
input transformation. 𝛼𝑖𝑘 obtained by Equation (3.9) corresponds the matching
degree to which the input value associates with 𝐴𝑖𝑘 (𝑖 = 1,… , 𝑇𝑘; 𝑘 = 1,… , 𝐿), where
𝐴𝑖𝑘 is the referential point of the ith attribute in the kth rule. 𝛼𝑘 represents the degree
to which the input vector matches the packet attribute 𝐴𝑘 in the kth rule. 𝑇𝑘 is the
total number of all attributes in kth rule. 𝐿 is the number of belief rules in the rule
base. A belief rule with 𝜔𝑘 = 0 is by default not activated; otherwise, a belief rule
with 𝜔𝑘 > 0 is activated. The activated belief degrees associated with consequents
are then combined through an inference process using an ER approach. An
activated belief rule is treated as a basic evidence, with belief degrees transformed
into basic probability masses, and the ER algorithm combines the activated belief
rules to generate the joint probability for each possible consequent (𝐷𝑗).
58
𝛽𝑗 =𝜇[∏ (𝜔𝑘𝛽𝑗𝑘 + 1 − 𝜔𝑘 ∑ 𝛽𝑗𝑘
𝑁𝑗=1 )𝐿
𝑘=1 −∏ (1 − 𝜔𝑘 ∑ 𝛽𝑗𝑘𝑁𝑗=1 )𝐿
𝑘=1 ]
1 − 𝜇[ ∏ (1 − 𝜔𝑘)𝐿𝑘=1 ]
(3.12)
𝜇 = [∑∏(𝜔𝑘𝛽𝑗𝑘 + 1 −𝜔𝑘∑ 𝛽𝑗𝑘𝑁
𝑗=1)
𝐿
𝑘=1
𝑁
𝑗=1
− (𝑁 − 1)∏(1− 𝜔𝑘∑ 𝛽𝑗𝑘𝑁
𝑗=1)
𝐿
𝑘=1
]
−1
(3.13)
The combined belief degrees denoted by 𝛽𝑗 associated with consequent
𝐷𝑗(𝑗 = 1,… ,𝑁) are a function of 𝜔𝑘 by Equation (3.12) and the belief degrees
𝛽𝑗𝑘 (𝑗 = 1,… ,𝑁; 𝑘 = 1,… , 𝐿). The activation weight 𝜔𝑘 itself depends on the input
vector 𝑥, the attribute weights 𝛿𝑖(𝑖 = 1,… , 𝑇), and the rule weights 𝜃𝑘.
3.5. Maximum Likelihood Evidential Reasoning
(MAKER) Framework
In this section, we provide a brief explanation of the MAKER framework developed
by Yang and Xu (2017) as an alternative inferential process for system analysis
and decision making. The rule-based inferential modelling and prediction
approaches developed in this study are fundamentally based on the MAKER
framework.
This framework defines two spaces: a state space model (SSM) and an evidence
space model (ESM) (Yang and Xu, 2017). An SSM describes system states or
changes with different inputs, while an ESM describes multiple pieces of evidence
with interdependencies in a probabilistic and distributed manner to represent
system behaviours. A probability, which is obtained from likelihoods generated
59
from data, is assigned to each evidential element associated with a subset of
system states. As such, a piece of evidence is profiled as a basic probability
distribution. The degree of interdependence is statistically calculated through
marginal and joint likelihood functions. Two pieces evidence are then combined
through a conjunctive ER rule.
In an SSM, suppose that 𝐻𝑛 is a system state. A system space may consist of N
mutually exclusive and collectively exhaustive system states, and hence, the SSM
can be denoted by Θ = {𝐻1, 𝐻2, … , 𝐻𝑛}, with 𝐻𝑖 ∩ 𝐻𝑗 for any 𝑖 ≠ 𝑗. Let 𝑃(Θ) or 2Θ be
the power set of Θ, which contains empty sets, single system states, and subsets
of system states as described in Equation (3.14). An output is profiled by a basic
probability which is defined as an ordinary discrete probability distribution (Yang
and Xu, 2017). No probability is assigned to empty set. Some conditions for a
probability function are described in Equations (3.15)–(3.18).
𝑃(Θ) = 2Θ = {𝜙, {𝐻1},… , {𝐻𝑁}, {𝐻1, 𝐻2},… , {𝐻1, … , 𝐻𝑁−1}, Θ} (3.14)
0 ≤ 𝑝(𝜃) ≤ 1 ∀𝜃 ⊆ Θ (3.15)
∑ 𝑝(𝜃) = 1𝜃⊆Θ
(3.16)
𝑝(𝜙) = 0 (3.17)
𝜃 is a subset of states or proposition which cannot be decomposed into pieces
assigned to subsets of 𝜃. 𝑝(𝜃) is a probability that the proposition 𝜃 is true. A
system output, 𝑦, is profiled as a probability distribution as displayed in Equation
(3.18).
60
𝑦 = {(𝜃, 𝑝(𝜃)), ∀𝜃 ⊆ Θ,∑ 𝑝(𝜃) = 1𝜃⊆Θ
} (3.18)
In an ESM, each piece of evidence is generated from data and is divided into
several evidential elements. Each element points to exactly one proposition.
Suppose that 𝑒𝑖,𝑙(𝜃) is an element of the ith piece of evidence from input variable
𝑥𝑙 which points exactly to proposition 𝜃. The evidential element of 𝑒𝑖,𝑙 represents
the evidence subspace for the ith value of 𝑥𝑙. 𝑝𝜃,𝑖,𝑙 is a basic probability assigned to
𝑒𝑖,𝑙(𝜃) according to the likelihood principle and the Bayesian principle (Yang and
Xu, 2014). 𝑐𝜃,𝑖,𝑙 is the likelihood of the ith value of 𝑥𝑙 given proposition 𝜃, and 𝑝𝜃,𝑖,𝑙
is a normalised likelihood as stated in Equation (3.20). Each 𝑒𝑖,𝑙 is then profiled by
a basic probability distribution as displayed in Equation (3.19).
𝑒𝑖,𝑙 = {(𝑒𝑖,𝑙(𝜃), 𝑝𝜃,𝑖,𝑙), ∀𝜃 ⊆ Θ,∑ 𝑝𝜃,𝑖,𝑙 = 1𝜃⊆Θ
} (3.19)
𝑝𝜃,𝑖,𝑙 = 𝑐𝜃,𝑖,𝑙 ∑ 𝑐𝐴,𝑖,𝑙𝐴⊆Θ
⁄ (3.20)
For a discrete 𝑥𝑙, the evidence subspace can be denoted by 𝐸𝑖 =
{𝑒1,𝑙 , 𝑒2,𝑙… , 𝑒𝑖,𝑙…}, leading to a discrete ESM. For a continuous 𝑥𝑙, the most direct
approach is to discretise 𝑥𝑙, a process which is explained in Chapter 4. The
interrelationship between each pair of input variables is assessed based on the
statistical interdependence between the two inputs. According to the likelihood
principle and the Bayesian principle, the joint basic probability can be obtained
from a joint likelihood function as described in Equation (3.21). Suppose that 𝑒𝑖,𝑙
61
and 𝑒𝑗,𝑚 are the two pieces of evidence from input variables 𝑥𝑙 and 𝑥𝑚,
respectively. The interrelationship between the two evidential elements is
represented by the interdependence index term as shown in Equation (3.22), with
its properties shown in Equation (3.23).
𝑝𝜃,𝑖𝑙,𝑗𝑚 = 𝑐𝜃,𝑖𝑙,𝑗𝑚 ∑ 𝑐𝐴,𝑖𝑙,𝑗𝑚𝐴⊆Θ
⁄ (3.21)
𝛼𝐴,𝐵,𝑖,𝑗 = {0 if 𝑝𝐴,𝑖,𝑙=0 or 𝑝𝐴,𝑖,𝑙=0
𝑝𝐴,𝐵,𝑖𝑙,𝑗𝑚 (𝑝𝐴,𝑖,𝑙 𝑝𝐵,𝑗,𝑚)⁄ otherwise
(3.22)
𝛼𝐴,𝐵,𝑖,𝑗 = {0 if 𝑒𝑖,𝑙 and 𝑒𝑗,𝑚 disjoint
1 if 𝑒𝑖,𝑙 and 𝑒𝑗,𝑚 independent
(3.23)
Multiple pieces of evidence are then combined through the conjunctive MAKER
rule process. In the joint-evidence state space, each output of an SSM intersects
with each evidential element in an ESM. As such, it is possible to measure the
individual support for a proposition from evidential elements and to measure joint
support with interdependence among evidential elements considered. Let 𝑠𝑖,𝑙 =
𝜃 ∩ 𝑒𝑖,𝑙(𝜃) represent the intersection between 𝜃 and 𝑒𝑖,𝑙(𝜃), meaning that 𝑒𝑖,𝑙(𝜃)
supports the proposition 𝜃. If evidence 𝑒𝑖,𝑙 is generated from the same data source
as the other evidence with probability function 𝑝, 𝑒𝑖,𝑙 is then the probability mass
that proposition 𝜃 is supported by 𝑒𝑖,𝑙(𝜃), as given below.
(𝑠𝑖,𝑙(𝜃)) = 𝑝 (𝜃|𝑒𝑖,𝑙(𝜃)) 𝑝 (𝑒𝑖,𝑙(𝜃)) = 𝑟𝜃,𝑖,𝑙 𝑝 (𝑒𝑖,𝑙(𝜃)) (3.24)
62
The reliability of evidential element (𝑒𝑖,𝑙(𝜃)) is denoted by 𝑟𝜃,𝑖,𝑙 , which is defined as
the conditional probability that proposition 𝜃 is true given that 𝑒𝑖,𝑙 supports 𝜃. This
definition measures the quality of 𝑒𝑖,𝑙. 𝑟𝜃,𝑖,𝑙 can be trained from data so that the
likelihood of the true state can be maximised. If 𝑒𝑖,𝑙 is profiled with the probability
distribution generated by Equation (3.20), based on the likelihood principle, its
support for proposition 𝜃, 𝑝𝑙 (𝑠𝑖,𝑙(𝜃)), must be proportional to 𝑝 (𝑠𝑖,𝑙(𝜃)) as stated
in Equation (3.25).
𝑚𝜃,𝑖,𝑙 = 𝑝(𝑠𝑖,𝑙(𝜃)) = 𝜔𝑖,𝑙𝑝𝑙 (𝑠𝑖,𝑙(𝜃)) =
𝜔𝑖,𝑙𝑝𝑙 (𝜃|𝑒𝑖,𝑙(𝜃)) 𝑝𝑙 (𝑒𝑖,𝑙(𝜃)) = 𝑤𝜃,𝑖,𝑙𝑝𝑙 (𝑒𝑖,𝑙(𝜃))
(3.25)
where 𝜔𝑖,𝑙 is a positive scaling factor constant, and 𝑤𝜃,𝑖,𝑙 = 𝜔𝑖,𝑙𝑝𝑙 (𝜃|𝑒𝑖,𝑙(𝜃)) is the
weight of an evidential element so that 𝑝 (𝑠𝑖,𝑙(𝜃)) and 𝑝𝑙 (𝑠𝑖,𝑙(𝜃)) are proportional
to each other when 𝑒𝑖,𝑙 is acquired from data for 𝑥𝑙 only. In the case where 𝑝 = 𝑝𝑙,
then 𝑤𝜃,𝑖,𝑙 = 𝑟𝜃,𝑖,𝑙 or 𝜔𝑖,𝑙 = 1. As with 𝑟𝜃,𝑖,𝑙 , 𝑤𝜃,𝑖,𝑙 can also be trained together with
other parameters to maximise the likelihood of the true state.
To determine the total degree of support for a proposition, the combination process
must be done at an elementary level and exhaustively accumulated at the end of
the process. Equation (3.26) shows the conjunctive MAKER rule generating the
degree of support for a proposition from two pieces of evidence, 𝑒𝑖,𝑙 and 𝑒𝑗,𝑚. This
process must be recursively done for all combinations before generating the total
degree of support for proposition 𝜃, denoted by 𝑝(𝜃) as presented in Equation
(3.27).
63
𝑚𝜃 = [(1 − 𝑟𝑗,𝑚)𝑚𝜃,𝑖,𝑙 + (1 − 𝑟𝑖,𝑙)𝑚𝜃,𝑗,𝑚] + ∑ 𝛾𝐴,𝐵,𝑖,𝑗𝛼𝐴,𝐵,𝑖,𝑗𝑚𝐴,𝑖,𝑙𝑚𝐵,𝑗,𝑚
𝐴∩𝐵=𝜃
(3.26)
𝑝(𝜃) = {0𝑚𝜃
∑ 𝑚𝐶𝐶⊆Θ
(3.27)
The MAKER algorithm has relaxed the assumption of evidence independency in
the ER rule by measuring the interdependence between a pair of pieces of
evidence statistically while keeping intact the core properties of the probabilistic
reasoning process in the ER rule. As it grounds the measurement through a
statistical test, some statistical rules of thumb must be satisfied, including the
minimum sample size requirement.
3.6. Machine Learning Methods
In this section, we briefly explain some machine learning methods, specifically for
classification, including logistic regression, k-nearest neighbour, classification tree,
naïve Bayes, support vector machines, and neural networks.
3.6.1. Logistic Regression
The goal of logistic regression (LR) analysis is, more or less, similar to the linear
regression model in terms of the general principle employed in the analysis
(Hosmer et al., 2013). The difference is that the outcome of the LR model is binary
or dichotomous, reflected in the form of the model and its assumptions. When used
64
for more than two classes, it is called a multinomial linear regression. The goal of
LR is to find the best fitting and most parsimonious model which interpretably
describes the relationship between a response (outcome or dependent variable)
and one or more independent variables (predictors, covariates, or explanatory
variables) by estimating the probabilities that reflect how closely the output belongs
to a response. In this model, let 𝑝𝑖 = 𝐸(𝑌|𝑥) be the conditional mean of 𝑌 given 𝑥,
where 𝑌 is the outcome and 𝑥 is the specific vector of predictors. 𝑝𝑖 is expressed
for the ith subject or case as seen in Equation (3.28).
𝑝𝑖 =𝑒𝑥𝑝[𝛽0 + 𝛽1𝑥1,𝑖 +⋯+ 𝛽𝑗𝑥𝑗,𝑖 +⋯+ 𝛽𝑚𝑥𝑚,𝑖]
1 + 𝑒𝑥𝑝[𝛽0 + 𝛽1𝑥1,𝑖 +⋯+ 𝛽𝑗𝑥𝑗,𝑖 +⋯+ 𝛽𝑚𝑥𝑚,𝑖]
(3.28)
where 𝑗 = 1,… ,𝑚, and 𝑚 is the number of predictors. The predictor itself must be
at least interval-scaled. Data transformation is required if categorical variables are
included in the predictor. The logit transformation is defined as follows:
logit (𝑝𝑖) = 𝑙𝑛 [𝑝𝑖
1 − 𝑝𝑖] = 𝛽0 + 𝛽1𝑥1,𝑖 +⋯+ 𝛽𝑗𝑥𝑗,𝑖 +⋯+ 𝛽𝑚𝑥𝑚,𝑖
(3.29)
𝛽𝑗 is a coefficient or parameter representing the magnitude of change in the
outcome as a result of the unit change in 𝑥𝑗𝑖. These unknown parameters are
estimated through a learning process from a set of data based on maximum
likelihood. In LR analysis, the next step in this analysis is to assess the significance
of the coefficient of a variable in the model and to keep only the significant variables
in the model. This involves assessing whether the presence of the variable in the
model explains more about the variance of the outcome through a statistical test
65
for significance. However, in this research all variables are included in the model
to be evaluated.
3.6.2. Support Vector Machine (SVM)
A support vector machine (SVM) is a supervised machine learning algorithm for
solving problems in classification, regression, and novelty detection. SVMs have
become popular because the determination of parameters is based on a convex
optimisation problem which results in any local optimum equalling a global
optimum. An SVM for solving two-class classification problems using linear models
is discussed in this section (Bishop, 2006). The basic idea of this method is to
transform the input vector into a higher dimensional vector so that two classes can
be linearly separated by a higher dimensional surface – a so-called hyperplane.
Suppose that the training data set consists of 𝑁 input vectors denoted by 𝑥𝑛 (𝑛 =
1,… ,𝑁) with corresponding target values 𝑡𝑛, where 𝑡𝑛 ∈ {−1,1}, and new data
points 𝑥 are classified depending on the sign of 𝑦(𝑥), which is formulated using
linear models as depicted in Equation (3.30). An SVM classifier needs to satisfy
𝑦(𝑥𝑛) > 0 for points having 𝑡𝑛 = 1 and (𝑥𝑛) < 0 for points having 𝑡𝑛 = −1, so that
𝑡𝑛𝑦(𝑥) > 0 is applicable for all training data points.
𝑦(𝑥) = 𝑤𝑇𝜙(𝑥) + 𝑏 (3.30)
where 𝜙(𝑥) denotes a fixed feature-space transformation, 𝑤 is the normal vector
to the learned hyperplane, and 𝑏 is bias parameter.
66
An SVM determines the optimal hyperplane based on the concept of a margin,
which is defined as the perpendicular distance between the hyperplane and the
closest data points. 𝑦(𝑥) = 0 defines a hyperplane that discriminates between the
two classes such that all data points are classified into one class as either 𝑡𝑛 = 1
or −1. The location of the decision boundary or hyperplane is determined by a
subset of the data points known as support vectors. The optimal hyperplane is
found by maximising the margin.
In practice, however, the class-conditional distributions may overlap, resulting in
poor generalisation when an exact separation is made. Therefore, an SVM is
modified such that some of the training points can be misclassified, but with a
penalty whose value is a linear function of the distance from the boundary, denoted
by 𝜉𝑛(𝑛 = 1,… . , 𝑁), known as a slack variable. 𝜉𝑛 = 0 for a data point on or inside
the margin boundary, 𝜉𝑛 = |𝑡𝑛 − 𝑦(𝑥𝑛)| for a misclassified data point, and 𝜉𝑛 = 1
for a data point on the decision boundary. Hence, points with 0 ≤ 𝜉𝑛 ≤ 1 stay inside
the margin and on the correct side of the boundary, while points with 𝜉𝑛 > 1 are
misclassified. This technique is described as relaxing a hard margin constraint to
be a soft margin. Note that due to overlapping class distributions, this framework
becomes very sensitive to outliers as the value of 𝜉𝑛 increases. Hence, the
parameter 𝐶 > 0, which is defined as a regularization coefficient to control the
trade-off between the slack variable penalty and the margin, or to be exact, it
controls the trade-off between training errors and model complexity. The optimal
parameters for 𝑤 and 𝑏 can be found by solving the following quadratic
programming problem, in which the objective is to minimise a quadratic function
with a set of linear inequality constraints.
67
To solve the problem, the Lagrange multipliers 𝛼𝑛 ≥ 0 are introduced with one
multiplier for each constraint. The dual representation of the maximum margin
problem can be derived as displayed in Equation (3.32).
max𝛼𝑛
∑𝛼𝑛
𝑁
𝑛=1
−1
2∑ ∑ 𝛼𝑛𝛼𝑚𝑡𝑛𝑡𝑚𝑘(𝑥𝑛, 𝑥𝑚)
𝑁
𝑚=1
𝑁
𝑛=1
subject to
∑𝛼𝑛𝑡𝑛 = 0
𝑁
𝑛=1
0 ≤ 𝛼𝑛 ≤ 𝐶, 𝑛 = 1,…𝑁
(3.32)
The kernel function is defined by 𝑘(𝑥, 𝑥′) = 𝜙(𝑥)𝑇𝜙(𝑥′), which maps the input
vectors into a suitable feature space. Some kernel types in SVMs are introduced,
such as linear, polynomial, and radial basis functions. Once the problem defined
in Equation (3.32) has been solved, the following formula can be used to classify
new data points.
𝑦(𝑥) = 𝑤𝑇𝜙(𝑥) + 𝑏 = ∑𝛼𝑛𝑡𝑛𝑘(𝑥, 𝑥𝑛) + 𝑏
𝑁
𝑛=1
= ∑ 𝛼𝑛𝑡𝑛𝑘(𝑥, 𝑥𝑛) + 𝑏
𝑛∈𝑠𝑣
(3.33)
min𝑤,𝑏,𝜉𝑛
1
2𝑤𝑇𝑤 + 𝐶∑𝜉𝑛
𝑁
𝑛=1
subject to
𝑡𝑛(𝑤𝑇𝜙(𝑥𝑛) + 𝑏) ≥ 1 − 𝜉𝑛 and
𝜉𝑛 ≥ 0, 𝑛 = 1,… ,𝑁
(3.31)
68
Data points with 𝛼𝑛 = 0 do not contribute to defining the predictive model in
Equation (3.33), and the remaining data points, known as support vectors (𝑠𝑣) with
𝛼𝑛 > 0, define the decision function.
3.6.3. Neural Networks (NN)
A neural network (NN) is a computational graph with nodes as computing units and
directed edges as transmission units which pass the numerical information from
node to node (Bishop, 2006; Haykin, 1999). One of the most famous structures for
an NN, the feed-forward neural network, also known as the multilayer perceptron
(MLP), is discussed in this chapter. The MLP consists of multiple layers of neurons,
which are input layers directly connected to external data, one or more hidden
layers, and an output layer. The structure in Figure 3.1 is described as a single-
hidden-layer network, a typical MLP with one hidden layer, where each layer is
fully connected to the next layer. An MLP is a series of function transformations
which generate the predicted output in the case of either classification or
regression from external data through an activation or transfer function. An MLP is
a nonlinear function of a linear combination of the inputs with adaptive coefficients
or parameters.
69
Figure 3.1 A single-hidden-layer neural network (Bishop, 2006)
Suppose that we have D input variables with M neurons in one hidden layer to
predict K classes in the case of classification. First, 𝑀 linear combinations of the
input variables are constructed as follows
𝑎𝑗 =∑𝑤𝑗𝑖(1)
𝐷
𝑖=1
𝑥𝑖 +𝑤𝑗0(1)
(3.34)
𝑧𝑗 = ℎ(𝑎𝑗) (3.35)
where 𝑗 = 1,… ,𝑀; 𝑖 = 1,… , 𝐷.
𝑎𝑗 values are known as activations. These values are then transformed through a
nonlinear activation function, ℎ(. ), which is generally chosen to be a sigmoid
function – for example, tanh. The superscript (1) indicates the corresponding
weights denoted by 𝑤𝑗𝑖 and biases denoted by 𝑤𝑗0 are in the first layer of the
network. The values resulting from Equation (3.35), known as hidden units, are
70
again linearly combined to give the output unit activation as presented in Equation
(3.36).
𝑎𝑘 =∑𝑤𝑘𝑗(2)
𝑀
𝑗=1
𝑧𝑗 +𝑤𝑘0(2)
(3.36)
where 𝑘 = 1,… , 𝐾 and 𝐾 is the number of outputs.
This process corresponds to the second layer of the network shown by the
superscript (2). The final step is to transform the output unit activation using an
activation function, 𝑓(. ), which depends on the type of the outputs: for example, a
logistic sigmoid function for a binary case as shown in Equation (3.38) and a
softmax activation function for a multiclass problem as shown in Equation (3.39).
The overall neural network for all stages can be formulated in Equation (3.37).
𝑦𝑘(𝑥, 𝑤) = 𝑓(∑𝑤𝑘𝑗(2)
𝑀
𝑗=1
ℎ (∑𝑤𝑗𝑖(1)
𝐷
𝑖=1
𝑥𝑖 +𝑤𝑗0(1)) + 𝑤𝑘0
(2))
(3.37)
𝑦𝑘 = 𝜎(𝑎) =1
1 + 𝑒𝑥𝑝(−𝑎)
(3.38)
𝑦𝑘 =𝑎𝑘
∑ 𝑎𝐵𝐾𝐵=1
(3.39)
For network training, different algorithms have been applied to find the optimal
vector of 𝑤. The most popular algorithm is backpropagation. It minimises the error
function in weight space through the method of gradient descent. Given a training
set of N samples, it learns NN network by minimising sum-of-squares error function
71
between the output vector generated by the network (𝑦𝑛) and the target denoted
by 𝑡𝑛 using Equations (3.40)-(3.41).
𝐸(𝑤) =1
2∑( 𝑦𝑛 − 𝑡𝑛)
2
𝑁
𝑛=1
(3.40)
𝐸 =1
2∑(𝑦𝑘 − 𝑡𝑘)
2
𝑘
(3.41)
A solution of the learning problem is the combination of weights which minimises
the error function. Every 𝑘-th component of the output vector is evaluated by
Equation (3.41), where 𝑦𝑘 and 𝑡𝑘 denote the 𝑘-th component of the output vector
𝑦𝑛 and of the target 𝑡𝑛, respectively. Those values are then accumulated to give
the sum 𝐸. The nonlinearity of the network function causes the error function 𝐸(𝑤)
to be nonconvex, and hence the solution might be a local minimum of the error
function. At first, initial weights are randomly chosen, and then the gradient of the
error function is computed and used to correct the initial weights as displayed in
Equation (3.42).
∇𝐸 = (𝜕𝐸
𝜕𝑤1,𝜕𝐸
𝜕𝑤2, … ,
𝜕𝐸
𝜕𝑤ℓ)
(3.42)
∆𝑤𝑝 = −𝜂𝜕𝐸
𝜕𝑤𝑝
(3.43)
where 𝑝 = 1,2,… ℓ. 𝜂 is the so-called learning constant, which defines the step
length of each iteration in the direction of the negative gradient. The gradient is
then recursively computed until the local minimum is found.
72
The following steps are designed for a one-hidden-layer network. The weights
between the hidden layer and the output layer are updated by Equation (3.44).
Similarly, the weight updates between the neurons in the input and hidden layer
are made as displayed in Equation (3.45).
𝑤𝑗𝑘(𝑡 + 1) = 𝑤𝑘𝑗(𝑡) + ∆𝑤𝑘𝑗(𝑡)
where ∆𝑤𝑗𝑘(𝑡) = −𝜂𝜕𝐸
𝜕𝑤𝑘𝑗= −𝜂
𝜕𝐸
𝜕𝑦𝑘
𝜕𝑦𝑘
𝜕𝑎𝑘
𝜕𝑎𝑘
𝜕𝑤𝑘𝑗= −𝜂(𝑦𝑘 − 𝑡𝑘)𝑓
′(𝑎𝑘)(𝑧𝑗) = −𝜂𝛿𝑦𝑘𝑧𝑗
(3.44)
where 𝛿𝑦𝑘 is referred to as the error signal of the neuron 𝑘 in the output layer.
∆𝑤𝑗𝑖(𝑡) = −𝜂𝜕𝐸
𝜕𝑤𝑗𝑖= −𝜂∑
𝜕𝐸
𝜕𝑦𝑘
𝜕𝑦𝑘𝜕𝑎𝑘
𝜕𝑎𝑘𝜕𝑧𝑗
𝜕𝑧𝑗
𝜕𝑎𝑗
𝜕𝑎𝑗
𝜕𝑤𝑗𝑖𝑘
= −𝜂∑(𝑦𝑘 − 𝑡𝑘)𝑓′(𝑎𝑘)(𝑤𝑗𝑘)𝑓
′(𝑎𝑗)𝑥𝑖𝑘
= −𝜂𝛿𝑧𝑗𝑥𝑖
(3.45)
where 𝛿𝑧𝑗 = 𝑓′(𝑎𝑗)∑ 𝛿𝑦𝑘𝑤𝑗𝑘𝑘 corresponds to the error signal of neuron 𝑗 in the
hidden layer. The steps above are repeated until convergence is reached (i.e.
when the error is below the pre-set value). The optimal weight vector is considered
to be the solution.
3.6.4. Classification Tree
The classification and regression trees (CART) algorithm is one of the popular
algorithms for tree induction. It is able to perform under nonlinear relationships
between features and outcome and also where features interact with each other
(Molnar, 2019). Equation (3.46) explains the relationship between the features (𝑥)
and the outcome (𝑦).
73
�� = 𝑓(𝑥) = ∑ 𝑐𝑚𝐼{𝑥𝜖𝑅𝑚}
𝑀
𝑚=1
(3.46)
Each sample (𝑥) must belong to exactly a leaf node, denoted by 𝑅𝑚. 𝐼{𝑥𝜖𝑅𝑚}, as an
identity function equals 1 if a sample is in the subset 𝑅𝑚 and 0 otherwise. If a
sample belongs to a leaf node 𝑅𝑙, the predicted outcome �� equals 𝑐𝑙, where 𝑐𝑙 is
the average of all training samples in the leaf node 𝑅𝑙.
The subsets are obtained by recursively partitioning the input space. At first, a
feature which will result in the best partitions in terms of the Gini index, which
indicates the impurity of a node, is selected to become a decision node. Then the
algorithm searches for the best cut-off point of the selected feature that minimizes
the Gini index of the class distribution of the outcome. If in the split subset all
classes have the same frequency, the node is denoted as an impure node. This
process is repeated recursively until the stop criterion is met.
3.6.5. Naïve Bayes
Naïve Bayes utilises the Bayes’ theorem of conditional probabilities for
classification as presented in Equation (3.47). With a strong (naïve) assumption of
independency between features, this method calculates the probability of a sample
belonging to a class based on the value of each feature. The class probability is
estimated for each feature independently.
74
𝑃(𝐶𝑘|𝑥) =1
𝑍𝑃(𝐶𝑘)∏𝑃(𝑥𝑖|𝐶𝑘)
𝑛
𝑖=1
(3.47)
where Z is a scaling parameter to make the sum of probabilities for all classes
always be 1, and n is the number of features in the dataset.
3.7. Sequential Least Squares Programming
(SLSQP)
In this study, we propose a hierarchical rule-based inferential modelling and
prediction based on MAKER framework. We train the model parameters, including
weights (reliabilities), referential values, and belief degrees of consequents, while
maintaining the sample size per combination of referential values of different input
variables at least equal to the minimum statistical requirement. These parameters
are optimised under the minimization function of mean squared errors. Since the
objective is to minimise the quadratic function subject to a set of inequality
constraints – that is, the minimum sample size requirement per combination of
referential values of different input variables – and equality constraints – that is,
the total degrees of belief for consequences of each belief rule must be 1. The
sequential (least squares) quadratic algorithm can deal with this kind of
optimisation problem.
The sequential (least squares) quadratic programming (SLSQP) algorithm is one
of the more popular, robust, and efficient computational methods for nonlinear
optimisation problems (Kraft, 1988; Boggs and Tolle, 1995). It is designed as a
nonlinearly constrained, gradient-based optimisation with equality and inequality
75
constraints (Kraft, 1988). Sequential (least squares) quadratic programming is a
powerful tool to be used in data analytics software. It has been well established in
many platforms – for example, SLSQP in SciPy, fmincon in Matlab, SNOPT and
FILTERSQP. Given these advantages, this method is applied in this study. We utilise
SLSQP as a subfunction of minimize in SciPy to find the optimised model
parameters – that is, the weights, referential values, and degrees of belief for
consequents of each belief rule.
In general, the nonlinear optimisation problem with equality and inequality
constraints can be defined in Equation (3.48). The SLSQP algorithm sequentially
approximates this original problem. It is solved iteratively with an initial vector of
parameters denoted by 𝑥0. 𝑥𝑘 indicates the vector of parameters in the (𝑘)th
iteration. 𝑥𝑘+1 can be obtained using Equation (3.51).
min𝑥
𝑓(𝑥)
𝑠. 𝑡. 𝑏(𝑥) = 0
𝑐(𝑥) ≤ 0
(3.48)
𝐿(𝑋, 𝜆, 𝜇) = 𝑓(𝑋) + 𝜆𝑏(𝑋)𝑇 + 𝜇𝑐(𝑋)𝑇
where 𝜆 and 𝜇 are vectors of multipliers for equality and inequality constraints
respectively
(3.49)
min ∇𝑓(𝑥𝑘)𝑇𝑑 +
1
2𝑑𝑇𝐻𝑓(𝑥𝑘)𝑑
𝑠. 𝑡. 𝑏𝑖(𝑥𝑘) + ∇𝑏𝑖(𝑥𝑘)𝑇𝑑 = 0; 𝑖 = 1,2,… ,𝑚
𝑐𝑗(𝑥𝑘) + ∇𝑐𝑗(𝑥𝑘)𝑇𝑑 ≤ 0; 𝑗 = 1,2,… , 𝑛
(3.50)
𝑥𝑘+1 = 𝑥𝑘 + 𝛼𝑘𝑑𝑘; 𝛼𝑘 ∈ (0,1]
where 𝑑𝑘 indicates the search direction within the 𝑘th step and 𝛼𝑘 is the step
length
(3.51)
76
The search direction (𝑑𝑘) is obtained from information generated by solving a
quadratic programming subproblem which is formulated by a quadratic
approximation of the Lagrange function of nonlinear programming, subject to linear
approximations of the constraints (Kraft, 1988). The Lagrangian function of this
problem can be seen in Equation (3.49). The quadratic subproblem is defined in
Equation (3.50). Vector 𝑑, ∆𝜆 , and Δ𝜇 are defined as 𝑑 = 𝑥 − 𝑥𝑘 , ∆𝜆 = 𝜆 − 𝜆𝑘, and
Δ𝜇 = 𝜇 − 𝜇𝑘, respectively. This solution provides a search direction for 𝑥.
The quadratic subproblem reflects the local properties of the original problem, and
in this way it is relatively easy to solve, and the objective of the subproblem
represents the nonlinearities of the original problem (Boggs and Tolle, 1995). If the
quadratic problem is appropriately chosen, this method can be seen as an
extension of Newton and quasi-Newton methods. Thus, SLSQP method is
expected to have the characteristic of Newton-like methods, which is a rapid
convergence with an optimal step length of 𝛼𝑘 = 1 when it iterates close to the
solution. In addition, the SLSQP is expected to have a possible erratic behaviour
so that the step length, 𝛼𝑘, can be modified and consequently, 𝑥𝑘+1 becomes a
better approximation to the optimum solution (Boggs and Tolle, 1995).
The general flow chart of SLSQP can be seen in Figure 3.2. 𝛿 and K are predefined
parameters such that the solution converges if the number of iterations has
reached K or the vector 𝑑 is less than 𝛿. In this study, 𝛿 and K are set at .0001 and
200, respectively.
77
3.8. Evaluation Metrics
The evaluation metric discussed below is designed for classification problems.
Classification can be divided based on data types into binary, multiclass, and multi-
labelled classification. The metrics are categorised into three types, including
threshold, probability, and ranking metrics (Hossin and Sulaiman, 2015). All those
types produce a single value, which makes the evaluation easier, although in some
cases subtle details of the classifier’s performance cannot be explicitly covered.
For example, in a case of very imbalanced data, relying solely on accuracy can be
misleading.
3.8.1. Threshold Metrics
Before the evaluation metrics are explained further, the confusion matrix as a base
for many common classification metrics is explained in Figure 3.3 (Awad and
Khanna, 2015). Classifiers yield a so-called probabilistic classifier showing the
degree to which an instance is a member of a class. A decision threshold converts
the probabilistic classifier into a discrete classifier. Suppose that the threshold is
set to be 0.5. Any probabilistic classifier above 0.5 produces a positive instance;
otherwise it produces a negative instance.
78
StartInitial vector of parameters, x0
Set iteration, k=0
At kth iteration, evaluate
Update Lagrange Function H
( ) ( ) ( )kkk xcxbxf ,,
Solve quadratic subproblem to determine
kd
or
kd
Kk
Modify so that is closer to the solution
k 1+kx
No
Calculate and update k = k+11+kx
End
Figure 3.2. The procedure of sequential (least squares) quadratic programming method
79
Actual class Positive Negative
Pre
dic
ted
cla
ss Positive True positives (tp) False positives (fp)
Negative False negatives (fn) True negatives (tn)
Column totals: Total positives (P) Total negatives (N)
Figure 3.3. Confusion matrix of binary problem
From this confusion matrix, tp and tn are the numbers of correctly classified
positive and negative instances, respectively, while fp and fn denote the numbers
of misclassified positive and negative instances, respectively. From this matrix,
some evaluation metrics are generated, as listed in Table 3.2. Accuracy is the
most-used metric, since it is easy to compute; applicable for binary, multiclass, or
multi-label problems; and easy to interpret (Hossin and Sulaiman, 2015).
However, it is less distinctive and provides less discriminable values. This limitation
becomes misleading in the case of very imbalanced data. In such a case, the
accuracy of the values is seemingly acceptable, even if none of the minority class
instances are correctly predicted by a trained classifier.
Table 3.2. Threshold metrics
Metrics Equation
Accuracy 𝑡𝑝 + 𝑡𝑛
𝑃 + 𝑁
Misclassification rate 𝑓𝑝 + 𝑓𝑛
𝑃 + 𝑁
True positive rate (tp rate)/ Recall (r)/ Sensitivity 𝑡𝑝
𝑃
False positive rate (fp rate) 𝑓𝑝
𝑁
80
Table 3.2. Continued
Metrics Equation
Specificity 𝑡𝑛
𝑡𝑛 + 𝑓𝑝= 1 − 𝑓𝑝 𝑟𝑎𝑡𝑒
Precision (p) 𝑡𝑝
𝑡𝑝 + 𝑓𝑝
F-Measure 2
1𝑝 +
1𝑟
= 2 ×𝑝 × 𝑟
𝑝 + 𝑟
F-beta score (1 + 𝛽2) ×𝑝 × 𝑟
(𝛽2 × 𝑝) + 𝑟
3.8.2. Probability Metrics
Mean squared error (MSE) is an example of a probability metric (Hossin and
Sulaiman, 2015). It measures the gap between the predicted values and the actual
values, denoted by 𝑃𝑛 and 𝐴𝑛, respectively. It is defined for 𝑁 samples as depicted
in Equation (3.52).
𝑀𝑆𝐸 =1
2∑(𝑃𝑛 − 𝐴𝑛)
2
𝑁
𝑛=1
(3.52)
3.8.3. Ranking Metrics
A receiver operating characteristics (ROC) curve is proposed for performance
visualisation and model selection (Hossin and Sulaiman, 2015). A ROC curve is a
81
two-dimensional graph where tp rate (or sensitivity) on the y axis is plotted against
fp rate (1 - specificity) on the x axis for different cut-off points. See Table 3.2 to
recall the definitions of sensitivity and specificity. The trade-off between benefits
(true positives) and costs (false positives) can be seen in the ROC curve, because
any increase in tp rate (sensitivity) is followed by a decrease in fp rate (1 -
specificity). A perfect classifier gives 100% for both sensitivity and specificity, which
is point (0,1), meaning that the closer the plot to the upper left corner, the better
the classifier. The 45-degree diagonal line depicted in Figure 3.4 shows a random
classifier. Point (0,0) shows that a classifier never issues a positive class; as such
the classifier fails to predict positive classes, which results in zero values for both
false positive errors and true positives. Meanwhile, point (1,1) unconditionally
issues positive classes. Any point at the left-hand side of a ROC curve near the x
axis is considered ‘conservative’, a situation in which a classifier issues positive
classes only with strong evidence, making few false positive errors and having a
low tp rate as well (Fawcett, 2006). Classifiers located at the right-hand side of a
ROC curve can be considered ‘liberal’, a situation in which the classifier utilises
weak evidence and issues nearly all positives correctly but often produces a high
fp rate.
0.2
0.4
0.6
0.8
1.0
0.2 0.4 0.6 0.8 1.0
False positive rate
Tru
e p
ositiv
e r
ate
Figure 3.4. ROC curve
82
The area under the curve measures the accuracy of an algorithm and is
abbreviated as AUC. AUC is one of the most popular ranking metrics and has been
proven to provide a better representation of an algorithm’s performance than does
accuracy. Its values range between 0 and 1. The advantage of using AUC is its
ability to reflect the overall ranking performance of a classifier with a single scalar
value. For binary problems, AUC can be defined as depicted in Equation (3.53)
(Hossin and Sulaiman, 2015).
𝐴𝑈𝐶 =𝑆𝑝 −
𝑛𝑝(𝑛𝑛 + 1)2
𝑛𝑝𝑛𝑛
(3.53)
where 𝑆𝑝 is the sum of the all positives instances ranked, and 𝑛𝑝 and 𝑛𝑛 are the
numbers of positive and negative instances, respectively.
Table 3.3 lists the rules of thumb for AUCROC according to Hosmer et al. (2013).
According to the table, AUCROC scores range from .5 to 1. The nearer the score
is to 1, the better the classifier at discriminating the outcome groups. An AUCROC
score of .5 shows that a classifier fails to discriminate between the outcome
groups, since this corresponds to chance or a random classifier.
Table 3.3. Rules of thumb for AUC
Area Point system
.5 - .6 Fail
.6 - .7 Poor
.7 - .8 Fair
.8 - .9 Good
.9 – 1 Excellent
83
Another alternative performance metric under a large skew class distribution is the
precision-recall curve (Davis and Goadrich, 2006), or PR curve. The curve plots
recall on the x-axis and precision on y-axis. Recall is the same as the true positive
rate used in the ROC curve. The fraction of observations that are positive and are
classified as positive is called precision, as seen in Table 3.2. For ROC curve, the
closer the line to the upper-left-corner, the better the model performance is. For
PR curve, the closer the line to the upper-right-corner the better the model
performance is. David and Goadrich (2006) explained the difference visual
representations of ROC and PR curves and highlighted that the performance
difference among classifiers can be identified more clearly with the PR curve than
the ROC curve under a large skew class distribution. With the example of a cancer
detection dataset, all classifiers were seemingly close to optimal, based on the
ROC curve. The PR curve, however, indicated that there was still vast room for
improvement. In addition, the PR curve can visually indicate the performance
difference among classifiers, which a ROC curve cannot. Similarly, to the ROC
curve, the area under the PR curve (AUCPR) can also be estimated using a
composite trapezoidal method.
3.9. Summary
In this chapter, we presented the research methods used in Chapters 4, 5, and 6.
First, we discussed how we generate data that was used in Chapter 4, and we
explained data collection including the source of and the components in the
database used in Chapters 5 and 6, and we also provided a brief justification for
the database chosen to obtain a desired dataset. Further explanation of how the
desired dataset was obtained is given in Chapters 5 and 6. Second, we briefly
84
explained the original ER rule as a foundation for the MAKER framework, the
application of the ER algorithm in the BRB system, ideas and rationales useful for
the development of a hierarchical rule-based inferential modelling and prediction
based on MAKER framework in Chapter 4. Third, we illustrated the sequential
(least squares) quadratic programming used in the classification of customer types
and decisions in Chapters 4, 5 and 6. Fourth, we briefly introduced some popular
machine learning methods and their algorithms for classification. Finally, we also
reviewed some performance evaluation metrics as a foundation for selecting
metrics for model comparisons in Chapters 4, 5 and 6.
85
Chapter 4 A Hierarchical Rule-based
Inferential Modelling and Prediction
4.1. Introduction
This chapter thoroughly explains the classifier based on MAKER framework
established by a hierarchical rule-based modelling and prediction – namely,
MAKER-ER-based and MAKER-BRB-based models for dealing with sparse
matrices and complex numerical data. It starts with an introduction to the MAKER
framework with referential values for data discretization in Section 4.2. Section 4.3
explains MAKER algorithm with referential value-based discretisation technique for
data transformation. Section 0 explains the concept of belief rule base. Section 4.5
explore the hierarchical rule-based inferential modelling and prediction approach,
investigate the methods for grouping evidence, and explain a process for final
inference. Section 4.6 explains how model parameters can learn from data and.
The proposed models are then compared analytically and graphically with other
machine learning methods in Section 4.7. A summary of the chapter is provided in
Section 4.8.
4.2. Introduction to MAKER Framework
The maximum likelihood evidential reasoning (MAKER) framework was introduced
by Yang and Xu (2017). It is a data-driven inference process to predict the outputs
of a system from its input under uncertainty. Yang and Xu (2017) emphasize four
86
unique features of this approach: 1) its establishment with unknown prior
probability as a default, 2) its explicit measurement of ambiguity in data, 3) its
explicit measurement of the quality of data (known as evidence reliability), and 4)
its ability to take into account statistically measured dependencies between pieces
of evidence. According to Yang and Xu (2017), the MAKER framework defines two
types of models – state space models (SSMs) and evidence space models (ESMs)
– and a conjunctive MAKER rule. The basic concepts and steps in the MAKER
framework are presented in the following.
An SSM describes a system of states or changes with different inputs. It consists
of a finite number of states, which makes Dempster’s original thinking on state
space the foundation of SSMs (Dempster, 2008). Following Yang and Xu (2017),
suppose that 𝐻𝑛 is a system state. It has at least 𝑁 disjoint states which do not
overlap each other, and hence, the SSM can be denoted by Θ = {𝐻1⋯𝐻𝑛⋯𝐻𝑁⋯}
with 𝐻𝑖 ∩𝐻𝑗 for any 𝑖 ≠ 𝑗. We can assign probability to a subset of system states.
Let 𝑃(Θ) or 2Θ be the power set of Θ which contains the empty set ∅ and the full
state space Θ. According to Yang and Xu (2017), an output of the system is
modelled by a unique set function, which is referred to as a basic probability
function that is defined as an ordinary discrete probability distribution function. No
probability is assigned to the empty set. The basic probability function is presented
in Definition 4.1 .
Definition 4.1 (Basic probability function)
A basic probability function is defined as 𝑝: 2Θ → [0,1] if conditions (4.1)–(4.3) are
satisfied. 𝜃 is a subset of states, which is known as an assertion. 𝑝(𝜃) is the
87
probability when proposition 𝜃 is true. It is assigned exactly to 𝜃 and cannot be
decomposed into pieces assigned to subsets of 𝜃 (Yang and Xu, 2017).
0 ≤ 𝑝(𝜃) ≤ 1 ∀𝜃 ⊆ Θ (4.1)
∑ 𝑝(𝜃) = 1𝜃⊆Θ
(4.2)
𝑝(𝜙) = 0 (4.3)
Definition 4.2 (System output)
A system output 𝑦 is defined as a probability distribution as shown in Equation
(4.4). 𝑝(𝜃) can be obtained using Equations (4.16) and (4.17). If 𝑝(𝜃) > 0, 𝜃 is
referred to as the focal element of 𝑦. c stated that, with the foundation from
Dempster (2008), an assertion can be profiled by three probabilities with 𝑝𝑡 + 𝑝𝑓 +
𝑝𝑢 = 1, where 𝑝𝑡, 𝑝𝑓, and 𝑝𝑢 are the nonzero probabilities representing ‘true’,
‘false’, and ‘unknown’, termed as the triad of an assertion. Therefore, this
framework allows the inference process to be conducted with ambiguous
information or unknown data (Yang and Xu, 2017).
𝑦 = {(𝜃, 𝑝(𝜃)), ∀𝜃 ⊆ Θ,∑ 𝑝(𝜃) = 1𝜃⊆Θ
} (4.4)
In an ESM, an evidence space is a space derived from data. Each piece of
evidence in the evidence space is acquired from data. Each piece of evidence can
be partitioned into evidential elements each of which points to exactly one
assertion in the state space or an element in the power set of the states.
88
Evidence acquisition from data is developed based on the likelihood principle and
the Bayesian principle. According to Rohde (2014), evidence derived from
observations that have proportional likelihoods should be the same, which is
known as the likelihood principle. The likelihood principle essentially holds that the
likelihood function, or likelihood in short, is the sole basis for inference. A likelihood
function denoted by 𝑓(𝑥; 𝜃) arises from a probability density function of 𝑥, which is
a function of the unknown parameter 𝜃 (Rohde, 2014). Meanwhile, Bayesian
principle indicates that the combination of the evidence with the prior distribution
of the states should lead to posterior probability (Yang and Xu, 2017).
Based on the data acquisition as described above, we can build a one-dimensional
ESM for each input variable (Yang and Xu, 2017). Suppose that 𝑒𝑖,𝑙(𝜃) is an
element of the ith piece of evidence from input variable 𝑥𝑙 which points exactly to
proposition 𝜃. The evidential element of 𝑒𝑖,𝑙 represents the evidence subspace for
the ith value of 𝑥𝑙. 𝑝𝜃,𝑖,𝑙 is the basic probability that the evidence element 𝑒𝑖,𝑙 points
exactly to assertion 𝜃, presented by 𝑝𝜃,𝑖,𝑙 = 𝑝𝑙 (𝑒𝑖,𝑙 (𝜃)). Let 𝑐𝜃,𝑖,𝑙 be the likelihood
of the ith value of 𝑥𝑙 given proposition 𝜃. The basic probability 𝑝𝜃,𝑖,𝑙 is a normalised
likelihood as stated in Equation (4.5). Given the basic probability 𝑝𝜃,𝑖,𝑙 that is
acquired from input 𝑥𝑙 for each assertion, we can the define system input from
evidence 𝑒𝑖,𝑙, as explained in Definition 4.3 (Yang and Xu, 2017).
𝑝𝜃,𝑖,𝑙 = 𝑐𝜃,𝑖,𝑙 ∑ 𝑐𝐴,𝑖,𝑙𝐴⊆Θ
⁄ (4.5)
89
Definition 4.3 (System input)
A basic probability distribution can be assigned to 𝑒𝑖,𝑙 as presented in (4.6) forming
a system input (Yang and Xu, 2017). z
𝑒𝑖,𝑙 = {(𝑒𝑖,𝑙(𝜃), 𝑝𝜃,𝑖,𝑙), ∀𝜃 ⊆ Θ,∑ 𝑝𝜃,𝑖,𝑙 = 1𝜃⊆Θ
} (4.6)
where 𝑝𝜃,𝑖,𝑙 is acquired from input variable 𝑥𝑙 by using Equation (4.5). According to
Yang and Xu (2017), evidential elements 𝑒𝑖,𝑙(𝐻𝑛) for all 𝐻𝑛 ∈ Θ represent the
evidence subspace for the 𝑖th value of 𝑥𝑙. If 𝑥𝑙 is discrete, the evidence subspace
can be denoted by 𝐸𝑖 = {𝑒1,𝑙 , 𝑒2,𝑙… , 𝑒𝑖,𝑙…}, leading to a discrete ESM.
As previously stated, one of the unique features of the MAKER framework is that
it considers the interrelationship between a pair of evidence in the model.
According to Yang and Xu (2017), the interdependence is observed by statistical
interdependence between the pair of evidence. According to the likelihood
principle and the Bayesian principle, joint basic probability can be obtained from a
joint likelihood function, which is discussed later. The following section describes
the MAKER algorithm with referential values as a discretization technique for
numerical inputs.
4.3. MAKER Algorithm with Referential Values
4.3.1. Evidence Acquisition
Suppose that we have a data set of N instances consisting of M input variables
and an output variable with K classes. The input 𝑥𝑛 = {𝑥𝑛,𝑙|𝑛 = 1,… ,𝑁; 𝑙 = 1,… ,𝑀}
90
can be either discrete or continuous. Each instance is classified as one of the class
memberships with numerical expressions in Θ = {𝑘| 𝑘 = 1,… , 𝐾} denoted by 𝑦𝑛 =
{𝑦𝑛| 𝑦𝑛 = 1,… , 𝐾; 𝑛 = 1,… ,𝑁}. As mentioned above, the MAKER framework is
constructed with discrete functions so that numerical data needs to be transformed.
Such transformation makes the MAKER framework applicable for numerical data.
In this framework, referential value-based transformation is applied. The initial
referential values can be set based on expert knowledge, random rules without
prior knowledge, or common sense, and afterwards values can be learned from
input-output data (Xu et al., 2017). Referential values include minimum and
maximum values of an input variable and the values between them. In addition,
the number of referential values of an input variable can differ from that of other
input variables. The framework is flexible to that condition.
An input value of an input variable, 𝑥𝑘,𝑛,𝑖 , the corresponding output of which
belongs to class 𝑘, is transformed as denoted in Equation (4.7) . 𝐴𝑖𝑙 is the 𝑖th
referential value of the 𝑙th input variable, while 𝑎𝑛,𝑙,𝑖𝑘 represents the degree to which
the 𝑛th input value of the 𝑙th input variable (i.e. 𝑥𝑛,𝑙) belongs to referential value 𝐴𝑖𝑙
or, in other words, how close is 𝑥𝑛,𝑙 to 𝐴𝑖𝑙.
𝑆𝑙(𝑥𝑛,𝑙) = {(𝐴𝑖𝑙 , 𝑎𝑛,𝑙,𝑖
𝑘 ); 𝑖 = 1,… , 𝐼𝑙} where 𝐼𝑙 is the number of all referential values in
input variable 𝑙
where
𝑎𝑛,𝑙,𝑖𝑘 =
𝐴𝑖+1𝑙 − 𝑥𝑛,𝑙
𝐴𝑖+1𝑙 − 𝐴𝑖
𝑙 ; 𝑎𝑛,𝑙,𝑖+1𝑘 = 1 − 𝑎𝑛,𝑙,𝑖
𝑘 ; 𝐴𝑖𝑙 ≤ 𝑥𝑛,𝑙 ≤ 𝐴𝑖+1
𝑙
𝑎𝑛,𝑙,𝑖′𝑘 = 0 for 𝑖′ ≠ 𝑖, 𝑖 + 1
(4.7)
91
After all the input values of an input variable are transformed, the next step is to
aggregate all the belief distributions for referential values under different class
memberships. In this way, the frequencies of the referential values of an input
variable under different classes can be obtained, and based on this calculation, the
likelihood 𝑐𝑘,𝑙,𝑖 and the basic probability 𝑝𝑘,𝑖,𝑙 can be estimated using Equations
(4.8) to (4.10).
[ 𝑎𝑙,11 ⋯ 𝑎𝑙,𝑖
1 … 𝑎𝑙,𝐼𝑙1
⋮ ⋱ ⋮ ⋱ ⋮𝑎𝑙,1𝑘 ⋯ 𝑎𝑙,𝑖
𝑘 ⋯ 𝑎𝑙,𝐼𝑙𝑘
⋮ ⋱ ⋮ ⋱ ⋮𝑎𝑙,1𝐾 ⋯ 𝑎𝑙,𝑖
𝐾 ⋯ 𝑎𝑙,𝐼𝑙𝐾]
where
𝑎𝑙,𝑖𝑘 = ∑𝑎𝑛,𝑙,𝑖
𝑘
𝑁
𝑛=1
∑∑𝑎𝑙,𝑖𝑘
𝐼𝑙
𝑖=1
𝐾
𝑘=1
=∑∑𝑎𝑙,𝑖𝑘
𝐾
𝑘=1
𝐼𝑙
𝑖=1
= 𝑁
(4.8)
[ 𝑐1,𝑙,1 = 𝑎1,𝑙
1 ∑𝑎𝑙,𝑖1
𝐼𝑙
𝑖=1
⁄ ⋯ 𝑐1,𝑙,𝑖 = 𝑎𝑖,𝑙1 ∑𝑎𝑙,𝑖
1
𝐼𝑙
𝑖=1
⁄ … 𝑐1,𝑙,𝐼𝑙 = 𝑎𝐼𝑙,𝑙1 ∑𝑎𝑙,𝑖
1
𝐼𝑙
𝑖=1
⁄
⋮ ⋱ ⋮ ⋱ ⋮
𝑐𝑘,𝑙,1 = 𝑎1,𝑙𝑘 ∑𝑎𝑙,𝑖
𝑘
𝐼𝑙
𝑖=1
⁄ ⋯ 𝑐𝑘,𝑙,𝑖 = 𝑎𝑖,𝑙𝑘 ∑𝑎𝑙,𝑖
𝑘
𝐼𝑙
𝑖=1
⁄ ⋯ 𝑐𝑘,𝑙,𝐼𝑙 = 𝑎𝐼𝑙,𝑙𝑘 ∑𝑎𝑙,𝑖
𝑘
𝐼𝑙
𝑖=1
⁄
⋮ ⋱ ⋮ ⋱ ⋮
𝑐𝐾,𝑙,1 = 𝑎1,𝑙𝐾 ∑𝑎𝑙,𝑖
𝐾
𝐼𝑙
𝑖=1
⁄ ⋯ 𝑐𝐾,𝑙,𝑖 = 𝑎𝑖,𝑙𝐾 ∑𝑎𝑙,𝑖
𝐾
𝐼𝑙
𝑖=1
⁄ ⋯ 𝑐𝐾,𝑙,𝐼𝑙 = 𝑎𝐼𝑙,𝑙𝐾 ∑𝑎𝑙,𝑖
𝐾
𝐼𝑙
𝑖=1
⁄]
𝑐𝑙,𝑖𝑘 =
{
𝑎𝑙,𝑖𝑘 ∑𝑎𝑙,𝑖
𝑘 for ∑𝑎𝑙,𝑖𝑘
𝐼𝑙
𝑖=1
≠ 0
𝐼𝑙
𝑖=1
⁄
0 for ∑𝑎𝑙,𝑖𝑘
𝐼𝑙
𝑖=1
= 0
(4.9)
92
∑𝑐𝑙,𝑖𝑘 = 1
𝐼𝑙
𝑖=1
, ∀𝑘 ⊆ 𝛩
∑∑𝑐𝑙,𝑖𝑘 =
𝐼𝑙
𝑖=1
𝐾
𝑘=1
∑∑𝑐𝑙,𝑖𝑘 =
𝐾
𝑘=1
𝐼𝑙
𝑖=1
𝐾
[ 𝑝1,𝑙,1 = 𝑐𝑙,1
1 ∑𝑐𝑙,1𝑘
𝐾
𝑘=1
⁄ ⋯ 𝑝1,𝑖,𝑙 = 𝑐𝑙,𝑖1 ∑𝑐𝑙,𝑖
𝑘
𝐾
𝑘=1
⁄ … 𝑝1,𝐼𝑙,𝑙 = 𝑐𝑙,𝐼𝑙1 ∑𝑐𝑙,𝐼𝑖
𝑘
𝐾
𝑘=1
⁄
⋮ ⋱ ⋮ ⋱ ⋮
𝑝𝑘,𝑙,1 = 𝑐𝑙,1𝑘 ∑𝑐𝑙,1
𝑘
𝐾
𝑘=1
⁄ ⋯ 𝑝𝑘,𝑖,𝑙 = 𝑐𝑙,𝑖𝑘 ∑𝑐𝑙,𝑖
𝑘
𝐾
𝑘=1
⁄ ⋯ 𝑝𝑘,𝐼𝑙,𝑙 = 𝑐𝑙,𝐼𝑙𝑘 ∑𝑐𝑙,𝐼𝑖
𝑘
𝐾
𝑘=1
⁄
⋮ ⋱ ⋮ ⋱ ⋮
𝑝𝐾,𝑙,1 = 𝑐𝑙,1𝐾 ∑𝑐𝑙,1
𝑘
𝐾
𝑘=1
⁄ ⋯ 𝑝𝐾,𝑖,𝑙 = 𝑐𝑙,𝑖𝐾 ∑𝑐𝑙,𝑖
𝑘
𝐾
𝑘=1
⁄ ⋯ 𝑝𝐾,𝐼𝑙,𝑙 = 𝑐𝑙,𝐼𝑙𝐾 ∑𝑐𝑙,𝐼𝑖
𝑘
𝐾
𝑘=1
⁄]
𝑝𝑘,𝑙,𝑖 =
{
𝑐𝑙,𝑖
𝑘 ∑𝑐𝑙,𝑖𝑘 if ∑𝑐𝑙,𝑖
𝑘
𝐾
𝑘=1
≠ 0
𝐾
𝑘=1
⁄
0 if ∑𝑐𝑙,𝑖𝑘
𝐾
𝑘=1
= 0
∑𝑝𝑘,𝑙,𝑖 = 1
𝐾
𝑘=1
(4.10)
Suppose that 𝑒𝑖𝑙(𝑘) is the 𝑖th referential value of the 𝑙th input variable directly
supporting to class 𝑘. 𝑐𝑙,𝑖𝑘 is the likelihood that the 𝑖th referential value of the 𝑙th input
variable is observed given that class 𝑘 is true. The basic probability 𝑝𝑘,𝑙,𝑖 is obtained
from normalised 𝑐𝑙,𝑖𝑘 . Each piece of evidence is profiled by a set of basic
probabilities as stated in Equation (4.11).
( )( )
== =
1 and ,,K
1k
,,,, ilkilk
l
i
l
i pkpkee (4.11)
Through this technique, numerical data can be profiled into a simple discrete
distribution function without losing information or over generalising.
93
4.3.2. Interdependence Index
A key characteristic introduced in the MAKER framework is its ability to measure
the interrelationship between a pair of evidence when multiple input variables are
considered. As such, the assumption of input independency in ER can be relaxed
in this framework. The measurement is based on the statistical interdependence
between a pair of evidence from different input variables. It can be acquired from
a joint likelihood function.
Suppose that we have multiple input variables denoted by a vector 𝑥𝑛 =
{𝑥𝑛,𝑗1 , … , 𝑥𝑛,𝑗𝑙 , … , 𝑥𝑛,𝑗𝑚}, 𝑛 = 1,… ,𝑁; 𝑗𝑙 = 1,… ,𝑀; 𝑙 = 1,… ,𝑀;𝑚 = 2,… ,…𝑀; 𝑖𝑙 =
1,… , 𝐼𝑗𝑙. The input is then transformed for the combination of referential values,
𝐴𝑖1…𝑖𝑙…𝑖𝑚𝑗1…𝑗𝑙…𝑗𝑚, as follows
𝑆𝑛,𝑗1…𝑗𝑙…𝑗𝑚(𝑥𝑛) = {(𝐴𝑖1…𝑖𝑙…𝑖𝑚𝑗1…𝑗𝑙…𝑗𝑚 , 𝑎𝑛,𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚
𝑘 )} , 𝑛 = 1,… ,𝑀; 𝑙 = 1,… ,𝑀;𝑚
= 2,… ,𝑀; 𝑖𝑙 = 1,… , 𝐼𝑗𝑙
where 𝐼𝑗𝑙 is the total number of referential values of input variable 𝑗𝑙
𝐴𝑖1…𝑖𝑙…𝑖𝑚𝑗1…𝑗𝑙…𝑗𝑚 = {𝐴𝑖1
𝑗1 , … , 𝐴𝑖𝑙𝑗𝑙 , … , 𝐴𝑖𝑚
𝑗𝑚}
𝑎𝑛,𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚𝑘 = 𝑎𝑛,𝑗1,𝑖1
𝑘 ×…× 𝑎𝑛,𝑗𝑙,𝑖𝑙𝑘 × …× 𝑎𝑛,𝑗𝑚,𝑖𝑚
𝑘
where
∑ ∑ ∑ 𝑎𝑛,𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚𝑘
𝑖1…𝑖𝑙…𝑖𝑚∈𝑅𝑗1…𝑗𝑙…𝑗𝑚∈𝑇
= 1
𝐾
𝑘=1
(4.12)
94
The similarity degree denoted by 𝑎𝑛,𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚𝑘 represents the degree to which
an input value of 𝑥𝑛 matches the combination of the referential values 𝐴𝑖1…𝑖𝑙…𝑖𝑚𝑗1…𝑗𝑙…𝑗𝑚.
Suppose that {(𝑗1… 𝑗𝑙…𝑗𝑚), 𝑗𝑙 = 1,⋯ ,𝑀; 𝑙 = 1,… ,𝑚} ∈ 𝑇 and {(𝑖1… 𝑖𝑙 … 𝑖𝑚), 𝑖𝑙 =
1,⋯ , 𝐼𝑗𝑙; 𝑙 = 1,… ,𝑚} ∈ 𝑅. We can then calculate the likelihood k
iiijjj mlmlc 11 ,...... in
which the combination of referential values 𝐴𝑖1…𝑖𝑙…𝑖𝑚𝑗1…𝑗𝑙…𝑗𝑚 occurs and class 𝑘 is true
with the conditions as stated in (4.13). The belief degree representing the extent
to which the combination of referential values directly points to class 𝑘 can be
calculated according to Equation (4.14).
𝑎𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚𝑘 = ∑𝑎𝑛,𝑗𝑖…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚
𝑘
𝑁
𝑛=1
∑ ∑ ∑ 𝑎𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚𝑘
𝑖1…𝑖𝑙…𝑖𝑚∈𝑅𝑗1…𝑗𝑙…𝑗𝑚∈𝑇
𝐾
𝑘=1
= ∑ ∑ ∑𝑎𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚𝑘
𝐾
𝑘=1𝑖1…𝑖𝑙…𝑖𝑚∈𝑅𝑗1…𝑗𝑙…𝑗𝑚∈𝑇
= 𝑁
𝛿𝑘 = ∑ ∑ 𝑎𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚𝑘
𝑖1…𝑖𝑙…𝑖𝑚∈𝑅𝑗1…𝑗𝑙…𝑗𝑚∈𝑇
𝑐𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚𝑘 = {
𝑎𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚𝑘 𝛿𝑘 for 𝛿𝑘 ≠ 0⁄
0 for 𝛿𝑘 = 0
∑ ∑ 𝑐𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚𝑘
𝑖1…𝑖𝑙…𝑖𝑚∈𝑅𝑗1…𝑗𝑙…𝑗𝑚∈𝑇
= 1, ∀𝑘 ⊆ 𝛩
∑ ∑ ∑ 𝑐𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚𝑘
𝑖1…𝑖𝑙…𝑖𝑚∈𝑅𝑗1…𝑗𝑙…𝑗𝑚∈𝑇
𝐾
𝑘=1
= ∑ ∑ ∑𝑐𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚𝑘
𝐾
𝑘=1
= 𝐾
𝑖1…𝑖𝑙…𝑖𝑚∈𝑅𝑗1…𝑗𝑙…𝑗𝑚∈𝑇
(4.13)
95
As a simple example, suppose that we have two pieces of evidence with referential
values as follows: {0, 50, 100} and {0, 1, 3}. An input with values 80 and 0.2 for the
first and second pieces of evidence can be transformed using Equation (4.7).
Hence, we can obtain 𝑆1 = {(𝐴11, 0), (𝐴2
1 , 0.4), (𝐴31 , 0.6)} and 𝑆2 =
{(𝐴12, 0.8), (𝐴2
2, 0.2), (𝐴32, 0)}. The input is then transformed for each combination of
referential values using Equation (4.12) as stated below.
Table 4.1. An example of data transformation
𝐴1,11,2 𝐴1,2
1,2 𝐴1,31,2 𝐴2,1
1,2 𝐴2,21,2 𝐴2,3
1,2 𝐴3,11,2 𝐴3,2
1,2 𝐴3,31,2 Total
{0,0} {0,1} {0,3} {50,0} {50,1} {50,3} {100,0} {100,1} {100,3}
0 0 0 0.32 0.08 0 0.48 0.12 0 1
The interdependence index introduced in the MAKER framework measures the
interdependence between a pair of evidence. Suppose that 𝑖1 and 𝑖2 are the
referential values of two pieces of evidence 𝑗1and 𝑗2 respectively. The
interdependence index between the two pieces of evidence denoted by 2211 ,,,, ijijka
can be defined by Equation (4.15). 2211 ,,,, ijijkp represents the degree to which the two
𝑝𝑘,𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚
=
{
𝑐𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚
𝑘 ∑𝑐𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚𝑘 if ∑𝑐𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚
𝑘
𝐾
𝑘=1
≠ 0
𝐾
𝑘=1
⁄
0 if ∑𝑐𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚𝑘
𝐾
𝑘=1
≠ 0
∑𝑝𝑘,𝑗1…𝑗𝑙…𝑗𝑚,𝑖1…𝑖𝑙…𝑖𝑚 = 1
𝐾
𝑘=1
(4.14)
96
pieces of evidence jointly support class 𝑘 and it can be obtained by using Equations
(4.13) and (4.14). 𝑝𝑘,𝑗1,𝑖1and 𝑝𝑘,𝑗2,𝑖2are the belief degrees of evidence elements
𝑒𝑖1𝑗1(𝑘) and 𝑒𝑖2
𝑗2(𝑘) respectively pointing to class 𝑘 and are obtained using Equations
(4.7)–(4.11).
𝑎𝑘,𝑗1,𝑖1,𝑗2,𝑖2 = {0 if 𝑝𝑘,𝑗1,𝑖1 = 0 or 𝑝𝑘,𝑗2,𝑖2 = 0
𝑝𝑘,𝑗1,𝑖1,𝑗2,𝑖2 (𝑝𝑘,𝑗1,𝑖1𝑝𝑘,𝑗2,𝑖2)⁄ if 𝑝𝑘,𝑗1,𝑖1 ≠ 0 or 𝑝𝑘,𝑗2,𝑖2 ≠ 0
𝑎𝑘,𝑗1,𝑖1,𝑗2,𝑖2 = {0 if 𝑒𝑗1,𝑖1(𝑘) and 𝑒𝑗2,𝑖2(𝑘) are disjoint
1 if 𝑒𝑗1,𝑖1(𝑘) and 𝑒𝑗2,𝑖2(𝑘) are independent
(4.15)
4.3.3. Evidence Combination
In the MAKER framework, every referential value called as an evidential element
in the evidence space is directly connected to each class membership in the state
space. To obtain the aggregate level of support for a class membership, including
independent support from each element and joint support from a combination of
referential values with their interdependence considered, all supports for a class
membership from all evidential elements are exhaustively combined.
Let 𝑝𝑘,𝑒(2) be the combined degree of belief with which two pieces of evidence
𝑒𝑖1𝑗1(𝑘) and 𝑒𝑖2
𝑗2(𝑘) jointly support for class 𝑘. With interdependence between the
two pieces of evidence considered, 𝑝𝑘,𝑒(2) can be calculated using Equation (4.16).
𝑟𝑗1,𝑖1 and 𝑟𝑗2,𝑖2 are the reliabilities of evidence elements 𝑒𝑗1𝑖1 and 𝑒𝑗2
𝑖2 , respectively.
𝛾𝑘,(𝑗1,𝑖1),(𝑗2,𝑖2) is a nonnegative coefficient which represents the relative degree of
joint support for class 𝑘 from both 𝑒𝑖1𝑗1(𝑘) and 𝑒𝑖2
𝑗2(𝑘) relative to their individual
97
support. 𝛼𝑘,(𝑗1,𝑖1),(𝑗2,𝑖2) is an interdependence index between evidence elements 𝑒𝑖1𝑗1
and 𝑒𝑖2𝑗2, which can be obtained using Equation (4.12). Model parameters,
𝜔𝑘,𝑗1,𝑖1, 𝜔𝑘,𝑗2,𝑖2, 𝑟𝑘,𝑗1,𝑖1, 𝑟𝑘,𝑗2,𝑖2 , and 𝛾𝑘,(𝑗1,𝑖1),(𝑗2,𝑖2) can be trained. In this study,
𝜔𝑘,𝑗1,𝑖1= 𝜔𝑘,𝑗2,𝑖2= 1. As such, 𝑤𝑘,𝑗1,𝑖1 = 𝑟𝑘,𝑗1,𝑖1 and 𝑤𝑘,𝑗2,𝑖2 = 𝑟𝑘,𝑗2,𝑖2. In addition,
𝛾𝑘,(𝑗1,𝑖1),(𝑗2,𝑖2) is set to be 1.
𝑝𝑘,𝑒(2) = {
0 𝑘 = 𝜑
𝑚𝑘 ∑𝑚𝐶
𝐶⊆𝛩
⁄ 𝑘 ⊆ 𝛩}
(4.16)
𝑚𝑘 = [(1 − 𝑟𝑗2,𝑖2)𝑚𝑘,𝑗1,𝑖1 + (1 − 𝑟𝑗1,𝑖1)𝑚𝑘,𝑗2,𝑖2] + 𝛾𝑘,(𝑗1,𝑖1),(𝑗2,𝑖2)𝛼𝑘,(𝑗1,𝑖1),(𝑗2,𝑖2)𝑚𝑘,𝑗1,𝑖1𝑚𝑘,𝑗2,𝑖2
where 𝑚𝑘,𝑗1,𝑖1 = 𝑤𝑘,𝑗1,𝑖1𝑝𝑘,𝑗1,𝑖1 = 𝜔𝑘,𝑗1,𝑖1𝑟𝑘,𝑗1,𝑖1𝑚𝑘,𝑗1,𝑖1 and 𝑚𝑘,𝑗2,𝑖2 = 𝑤𝑘,𝑗2,𝑖2𝑝𝑘,𝑗2,𝑖2
= 𝜔𝑘,𝑗2,𝑖2𝑟𝑘,𝑗2,𝑖2𝑚𝑘,𝑗2,𝑖2
𝑟𝑗1,𝑖1 =∑𝑟𝑘,𝑗1,𝑖1
𝐾
𝑘=1
𝑝𝑘,𝑗1,𝑖1 and 𝑟𝑗2,𝑖2 =∑𝑟𝑘,𝑗2,𝑖2𝑝𝑘,𝑗2,𝑖2
𝐾
𝑘=1
(4.17)
𝑟𝑘,𝑒(2) = {
0 𝑘 = 𝜑
𝑚𝑘,𝑒(2) 𝑝𝑘,𝑒(2)⁄ 𝑘 ⊆ 𝛩, 𝑘 ≠ 𝜑
1 −𝑚𝛩,𝑒(2) 𝑘 = 𝑃(𝛩)
(4.18)
Equations (4.16) - (4.18) are interpreted from the conjunctive MAKER rule by Yang
and Xu (2017) as presented in Equations (4.19) and (4.20).
𝑝(𝜃) = {0𝑚𝜃
∑ 𝑚𝐶𝐶⊆Θ
(4.19)
𝑚𝜃 = [(1 − 𝑟𝑗,𝑚)𝑚𝜃,𝑖,𝑙 + (1 − 𝑟𝑖,𝑙)𝑚𝜃,𝑗,𝑚] + ∑ 𝛾𝐴,𝐵,𝑖,𝑗𝛼𝐴,𝐵,𝑖,𝑗𝑚𝐴,𝑖,𝑙𝑚𝐵,𝑗,𝑚
𝐴∩𝐵=𝜃
where 𝑚𝜃,𝑖,𝑙 = 𝑝 (𝑠𝑖,𝑙(𝜃)) = 𝜔𝑖,𝑙𝑝𝑙 (𝑠𝑖,𝑙(𝜃)) = 𝜔𝑖,𝑙𝑝𝑙 (𝜃|𝑒𝑖,𝑙(𝜃)) 𝑝𝑙 (𝑒𝑖,𝑙(𝜃)) =
𝑤𝜃,𝑖,𝑙𝑝𝑙 (𝑒𝑖,𝑙(𝜃)); 𝑟𝑖,𝑙 ∑ 𝑟𝜃,𝑖,𝑙𝑝 (𝑒𝑖,𝑙(𝜃))𝜃⊆Θ
(4.20)
98
4.4. Belief Rule Base
The traditional IF-THEN rules can be expressed as 𝑅𝑘 in Equation (4.21) where
𝐴𝑖𝑘 (𝑖 = 1, . . , 𝑇𝑘) is the referential value or grade of the 𝑖th attribute in the 𝑘th rule. 𝑇𝑘
is the number of attributes used in the 𝑘th rule. The symbol denotes an ‘AND’
relationship between the attributes. 𝐷𝑘 is the consequent of the 𝑘th rule being
activated. Equation (4.21) indicates whether an input vector with those
corresponding referential values points directly to the outcome 𝐷𝑘 with 100%
probability. This simple form does not consider the relative importance of each
attribute, the relative importance of rules in the rule base, or the distribution of
consequences. The traditional IF-THEN rules can be extended by including
attribute weights, rule weights, and belief degrees for all possible consequences
as expressed in Equation (4.22) (Yang et al., 2006).
if 𝐴1𝑘 ∧ …∧ 𝐴𝑖
𝑘 ∧ …∧ 𝐴𝑇𝑘𝑘 then D𝑘 (4.21)
if 𝐴1𝑘 ∧ 𝐴2
𝑘 …∧ 𝐴𝑇𝑘𝑘 then {(𝐷1, 𝛽1𝑘), (𝐷2, 𝛽2𝑘),… , (𝐷𝑁, 𝛽𝑁𝑘)}, 𝑘 ∈ {1, . . . , 𝐿}
where 𝛽𝑗𝑘 ≥ 0,∑𝛽𝑗𝑘
𝑁
𝑗=1
≤ 1 with a rule weight 𝜃𝑘 and attribute weights 𝛿k,1, … , 𝛿k,i, … 𝛿k,T𝑘
(4.22)
𝐴𝑖𝑘 (𝑖 = 1, . . , 𝑇𝑘) indicates the referential value or grade of the 𝑖th attribute in the 𝑘th
rule where 𝑇𝑘 is the number of attributes used in the 𝑘th rule. 𝐿 is the number of
rules in the rule base. According to Yang et al. (2006), if an input satisfies the
packet attributes 𝐴𝑘 = (𝐴1𝑘 , 𝐴2
𝑘 , … , 𝐴𝑇𝑘𝑘 ), the rule 𝑅𝑘 is activated and points to the
consequence 𝐷𝑗 with degree of belief 𝛽𝑗𝑘(𝑗 = 1,… , 𝑁), where N is the number of
possible consequences. Belief degrees are expressed as the probability with which
99
𝐷𝑗 is likely to occur. The total belief degrees of a rule can be less than or equal to
one, which allows room to handle missing data or unknown consequent. 𝜃𝑘 is a
weight of the 𝑘th rule acting as the relative importance of the rule compared to other
rules in the rule base. 𝛿𝑘,𝑖 is the attribute weight of the 𝑖th attribute in the 𝑘th rule,
indicating the relative importance of the attribute among all the attributes used in
the 𝑘th rule. In Equation (4.22), a rule weight, attribute weights, and consequent
belief degrees are embedded in each rule in the rule base. If a rule is presented in
the format of Equation (4.22), it is referred as a belief rule. A collection of belief
rules is called a belief rule base (BRB).
4.5. The Decomposition of Input Variables
In this study, belief rule base inference of the rule-based inferential modelling and
prediction is transparent and interpretable. However, Xu et al. (2017) stated that
BRB has a problem with the high multiplicative complexity on the number of
referential values of input variables in the belief rule base. The size of belief rule
base increases exponentially as the number of input variables and the number of
referential values of each input variable increase (Yang and Xu, 2017).
Consequently, the number of parameters required for training increases
exponentially (Yang and Xu, 2017). Suppose that we have six input variables with
three referential values for each input variable, we will have 36 = 729 belief rules.
The rule-based models will be extremely complex.
Furthermore, the rule-based modelling and prediction in this study is developed
based on the MAKER framework. In this framework, evidence is acquired through
statistical analysis directly from data. Interdependency index is measured using
100
statistical interdependence between a pair of evidence. It can be acquired from a
joint likelihood function as presented in Equations (4.13)–(4.15). For a discrete joint
likelihood function, the likelihood is calculated on the basis of the frequencies of
the combinations of referential values for each class membership. This principle is
widely used as a basis for measuring the interdependencies between two variables
with nominal or ordinal type data – for example, a chi-square test for independence
(Bishop, 2007), known as a contingency table. Each cell of the contingency table
contains the cases that matches a certain combination of categories. It is worth
noting that ‘category’ has the same meaning as ‘referential value’ in this study.
The minimum requirement for a presumably sufficiently large sample size is five
cases per cell. According to Bishop (2007), there are two types of zero entries in a
contingency table: 1) sampling zeros that may occur for cells that are realistic
combinations of categories with relatively small samples when compared to a large
number of cells or 2) structural zeros because it is not possible to collect
observations for certain combinations of categories–that is, certain combinations
of referential values of input variables. A sampling zero occurs when no
observation is found for a certain combination of variables, but it is probable that
the combination exists. Meanwhile, structural zeros are attached to unrealistic
combinations due to features of the data or the data structure (Bishop, 2007).
Strategies to deal with combinations of categories that violate the minimum sample
size requirement are explained below.
There are several ways to deal with a sparse contingency table. The common
practice is to collapse the categories to obtain a smaller and less sparse table with
fewer categories (Kateri and Iliopoulos, 2010). Two categories are combined on
101
the basis of homogeneity and structure (Kateri and Iliopoulos, 2010). These
collapsed categories are also considered to be theoretically or practically
equivalent. However, this practice can produce misleading statistical inferences in
which significant associations are found in the collapsed table when there are no
such associations in the original table (Bishop, 2007; Fienberg and Rinaldo, 2007).
The practice also potentially distorts the modelling process and leads to a loss of
some valuable data or information.
Another approach is to add small positive quantities to every cell in the table. This
practice was discussed in Fienberg and Rinaldo (2007) with numerical examples,
and it is evident that it can result in misleading and incorrect inferences. The
simplest – yet expensive – way to remedy a sparse contingency table is to collect
more samples (Bishop, 2007). In the case of table with a zero denominator,
another strategy is to arbitrarily define zero divided by zero to be zero (Fienberg,
1980). Careful justification is essential when taking one of the actions above
(Fienberg and Rinaldo, 2007). The current study mainly uses numerical data, and
hence a referential value-based discretization method is applied. Since the
referential values are trained, the justification explained above, which is very
complex if not impossible, must be done as part of the optimisation process.
As stated above, the models under the rule-based inferential modelling and
prediction become extremely complex as the number of input variables and the
number of referential values of each input variable increase (Yang and Xu, 2017).
In addition, under the MAKER framework with sparse matrices, statistical analysis
is difficult when many cells contain a number of cases less than the statistical
requirement of the sample size. Therefore, this study proposed the model based
102
on MAKER framework established by the hierarchical rule-based inferential
modelling and prediction. The input variables are split into n groups of input
variables; as such, the number of rules decreases, and violations of statistical
requirements can be avoided. Furthermore, the need for careful justification when
optimising the trained referential values in sparse matrices, a need which increases
computational complexity, can be reduced.
Measures for selecting the best split consider the strength of the relationship
between the input variables and the outputs, either a linear or nonlinear
relationship. The input variable with the strongest relationship to the output makes
the largest contribution to explaining variances in the output. The strongest
relationship indicates the variable that retains the most information and is the most
significant in the prediction model. Steps for grouping evidence or input variables
are listed below.
Step 1. With the full dataset, we sort input variables based on their estimated
importance in the prediction model by measuring their linear or non-linear
relationship to the outputs.
Step 2: We set the most important variable as an initial member of the first evidence
group. We add the next most important variable to this group if the statistical
requirement of sample size per cell is met. Otherwise, we set this variable as the
initial member of the second group. We then move on to the third variable and add
it to the first evidence group if the minimum sample size per cell is met. If not, it
can be added to the second group or can be put into a new evidence group if the
statistical requirement of sample size per cell is violated. This step is repeated until
103
evidence groups whose members have joint frequency matrices with at least the
minimum sample size per cell, except for structural zeros, are formed.
We apply the MAKER framework to each group, and hence each group generates
its output based on the input variables as described in Figure 4.1. We provide two
ways to combine the MAKER-generated outputs of two or more evidence groups.
The first method is to combine them according to a MAKER rule. Each evidence
group makes an inference based on the input of variables within the group denoted
by 𝑝𝑔(𝑠)(𝜃), 𝑔 = 1,… , 𝐺, 𝑠 = 1,… , 𝑆, where G is the total number of evidence groups
formed, and S is the total number of observations or samples. An ER rule has been
developed for combining evidence while considering weight and reliability (Yang
and Xu, 2013). To combine pairs of MAKER-generated outputs, we need their
weights. By using Equation (4.18), we can obtain the weights of all evidence
groups. Hence, Equation (4.16) can be recursively applied to obtain the whole
system-generated output, denoted by ��(𝑠) for each input vector 𝑥𝑠. Parameters
𝛾𝐴,𝐵,𝑖,𝑗 and 𝛼𝐴,𝐵,𝑖,𝑗 in Equation (4.16) can be trained. In this study, 𝛾𝐴,𝐵,𝑖,𝑗 and 𝛼𝐴,𝐵,𝑖,𝑗
of this part are set to be 1. This approach is called MAKER-ER-based model.
The second approach to combining evidence groups is to use a BRB. As stated in
Section 0, generating a BRB basically requires finding all possibilities for an IF-
THEN rule. If the consequent 𝜃 is supported by all evidence groups, the belief
degree of the consequent 𝜃 is assumed to be 1. Otherwise, if none of the groups
support the consequent 𝜃, its belief degree is assumed to be 0. If the consequent
𝜃 is supported by one or more evidence groups while the remaining evidence
supports its negation, the belief degree of the consequent 𝜃 is logically between 0
and 1. In this way, we can generate a BRB in the form of Equations (4.18) and
104
(4.21). In this state, the packet antecedent of the belief rule written as𝐴1𝑘 ∧ 𝐴2
𝑘…∧
𝐴𝑇𝑘𝑘 should be expressed as ‘If a group of evidence points to 𝑘 class’. Therefore,
the number of antecedents in this BRB equals the number of groups of evidence
in the system. Furthermore, {(𝐷1, 𝛽1𝑘), (𝐷2, 𝛽2𝑘),… , (𝐷𝑁, 𝛽𝑁𝑘)}, 𝑘 ∈ {1, . . . , 𝐿} should
be expressed in this state as ‘the probability of an observation belongs to 𝜃.’
Stage 2Stage 1
Real system
MAKER-based system 1
MAKER-based system 2
MAKER-based system G
ER- or BRB-based system
Observed output
System-generatedoutput
Input
MAKER-generatedoutput 1
MAKER-generatedoutput 2
MAKER-generatedoutput G
( ))(ˆ sp
( )p
( )1p
( )2p
( )Gp
sx
( )P
.
.
.
PModel parameters
Figure 4.1. Hierarchical MAKER-based training process
𝑝𝑔(𝑠)(𝜃) is the MAKER-generated output of evidence group 𝑔 (𝑔 = 1,… , 𝐺)
corresponding to the degree to which the evidence group supports the consequent
𝜃 based on an input vector 𝑥𝑠. The outputs of all evidence groups become an input
for the next system, which is a BRB system as depicted in Figure 4.1. Hence, 𝑝𝑔(𝑠)
is a numerical input, and it can be transformed as shown in Equation (4.7). Let
𝑝𝑔(𝑠)(𝜃) act as a similarity degree, indicating the degree to which a MAKER-
generated output belongs to 𝜃. Using Equation (4.12), we can obtain the degree
of joint similarity between the outputs generated by each group of evidence and
the combination of antecedents of the BRB (see Sections 5.5.7 and 6.4.7 for
105
examples), and these values will activate the associated belief rules.
Subsequently, we can apply Equations (3.11) and (3.13) to combine the belief
degrees of the activated belief rules to obtain the system-generated output denoted
by 𝑝(𝜃). In this study, all rule weights (𝜃𝑘) and attribute weights (𝛿𝑘,𝑖) are set to be
equal, and only the consequent belief degrees (𝛽𝑗𝑘) are trained. We can train the
consequent belief degrees along with other model parameters using historical
data. As such, the gap between the observed outputs (��(𝑠)(𝜃)) and the system-
generated outputs (𝑝(𝜃)) is minimal, as depicted in Figure 4.1. This approach is
called MAKER-BRB-based model.
4.6. Parameter Learning
As stated in Yang and Xu (2017) and Yang et al. (2007), MAKER parameters –
that is, 𝑟𝜃,𝑖,𝑙 , 𝑟𝜃,𝑗,𝑚, 𝑤𝜃,𝑖,𝑙 , 𝑤𝜃,𝑗,𝑚, and 𝛾𝐴,𝐵,𝑖,𝑗– and BRB-based system parameters on
the top hierarchy – that is, 𝜃𝑘 , 𝛿𝑖,𝑘 , and 𝛽𝑗,𝑘 – can be trained using historical data.
These parameters are similar with
For the purpose of parameter learning, a general least squares optimisation model
is established as shown in Equation (4.23) for the MAKER-ER-based system and
Equation (4.24) for the MAKER-BRB-based system. Once we obtain the trained
parameters, we can use them to predict system outputs from given system inputs.
In Equations (4.23) and (4.24), ��(𝑠)(𝜃) is the probability that the consequent 𝜃 is
true given the 𝑠th observation. The function measures the MSE between the
system-generated outputs and the observed outputs as depicted in Figure 4.1,
bringing 𝑝(𝜃) as close as possible to ��(𝑠)(𝜃). defines the feasible space of
106
parameters – for example, 0 ≤ 𝛽𝑗,𝑘 ≤ 1 – meaning that consequent belief degrees
must be nonnegative and less than or equal to 1. 𝑟𝜃,𝑖,𝑙 , 𝑟𝜃,𝑗,𝑚, 𝑤𝜃,𝑖,𝑙 , 𝑤𝜃,𝑗,𝑚, and 𝛾𝜃,𝑖,𝑗
are MAKER parameters, while 𝜃𝑘 , 𝛿𝑖,𝑘 , and, 𝛽𝑗,𝑘 are BRB parameters of the top
hierarchy.
𝑚𝑖𝑛 𝛿 =1
2𝑆∑∑ (𝑝(𝜃) − ��(𝑠)(𝜃))
2
𝜃⊆𝛩
𝑆
𝑠=1
subject to 𝑟𝜃,𝑖,𝑙 , 𝑟𝜃,𝑗,𝑚, 𝑤𝜃,𝑖,𝑙 , 𝑤𝜃,𝑗,𝑚, 𝛾𝜃,𝑖,𝑗, 𝜃𝑘 , 𝛿𝑖,𝑘 , 𝛽𝑗,𝑘 ∈ 𝛺
∑ 𝛽𝑗𝑘 = 1𝑁𝑗=1 ; where 𝑁 is the number of consequents in the BRB of the top
hierarchy
(4.24)
For the optimisation tool, we utilise sequential least squares programming provided
in the SciPy package in Python (Pedregosa et al., 2011). It is designed to minimise
a function of several variables with bounds, equality, and inequality constraints.
The model parameters, including MAKER model parameters of all evidence groups
and BRB parameters of the top hierarchy, can be optimised simultaneously to
minimise the function in Equation (4.23) or Equation (4.24). Based on the
explanations in Sections 3.7 and 4.3, Figure 4.2 explains all the steps required for
a hierarchical rule-based modelling and prediction.
𝑚𝑖𝑛 𝛿 =1
2𝑆∑∑ (𝑝(𝜃) − ��(𝑠)(𝜃))
2
𝜃⊆𝛩
𝑆
𝑠=1
subject to 𝑟𝜃,𝑖,𝑙 , 𝑟𝜃,𝑗,𝑚, 𝑤𝜃,𝑖,𝑙 , 𝑤𝜃,𝑗,𝑚, 𝛾𝜃,𝑖,𝑗 ∈ 𝛺
(4.23)
107
Use referential values
for evidence
acquisition
Calculate interdependence
indexes between pairs of
evidence within the group of
evidence
Combine evidence
based on MAKER rule
using weights
Generate belief rule
base
Input
training
data
Predict the class membership
based on activated belief rules
by performing maximum
likelihood prediction
Evidence group 1
Use referential values
for evidence
acquisition
Calculate interdependence
indexes between pairs of
evidence within the group of
evidence
Combine evidence
based on MAKER rule
using weights
Generate belief rule
base
Input
training
data
Predict the class membership
based on activated belief rules
by performing maximum
likelihood prediction
Evidence group 2
Use referential
values for evidence
acquisition
Calculate interdependence
indexes between pairs of
evidence within the group of
evidence
Combine evidence
based on MAKER rule
using weights
Generate belief rule
base
Input
training
data
Predict the class membership
based on activated belief rules
by performing maximum
likelihood prediction
Evidence group G
End
Use belief degree of each
consequent and generate belief rule
base, Or combine using ER rule
Perform maximum likelihood
prediction
The top of
hierarchy – final
inference
Start
Input
training
data
. . . .
Group 1 Group 2
Final
Inference
Group G. . . .
Yang et al. (2006)
This study
Yang and Xu (2017)
Rule-based
modelling
Prediction
Figure 4.2. A hierarchical rule-based inferential modelling and prediction based on MAKER framework for n groups of evidence
108
This study applied a MAKER framework – a work by Yang and Xu (2017) – and a
BRB – a work by Yang et al. (2006) in a hierarchical structure so that the proposed
framework can deal with sparse matrices, maintain the interpretability of the
framework, and reduce the model complexity. This study introduced how the
framework can deal with complex numerical inputs by using referential values. In
addition, how the outputs generated by groups of evidence can be combined to
make a final inference was explained in this study. In Figure 4.2, the novelty of this
research can be seen in the red boxes.
4.7. A Comparative Analysis
In this section, a hierarchical rule-based inferential modelling and prediction based
on MAKER framework is discussed analytically and graphically. First, we compare
referential value-based discretization technique and other techniques and highlight
the advantages of using this approach for data transformation. Second, we present
the modelling and inference process of MAKER-based framework in dealing with
sparse matrices. This proposed framework is evaluated in comparison with other
interpretable machine learning methods to emphasize its advantages. Third, we
compare the predictive power, model complexity, and computation time of MAKER,
BRB, and a hierarchical MAKER framework.
4.7.1. Referential Value-based Discretization Technique
Data, in general, can be divided into two main types: qualitative data and
quantitative data (Maimon and Rokach, 2005). Quantitative data is data that can
109
be measured and expressed numerically. It can be further divided into discrete and
continuous types (Maimon and Rokach, 2005). When required, we can use
discretization techniques to transform numerical data into qualitative data: for
example, in the application of a classification tree (Agre and Peev, 2002).
In machine learning, classification tree is usually applied in a classification context
and used as a data pre-processing step. It is used to improve model accuracy
especially for classification and accelerating the learning process for very large
datasets (Agre and Peev, 2002). In addition, some machine learning algorithms
work only with discrete data, and hence, a discretization technique is needed to
make the algorithms applicable to real datasets (Agre and Peev, 2002).
Since the data used in this research is mainly numerical, a discretization technique
is required to develop an efficient and effective learning algorithm (Agre and Peev,
2002). There are some established discretization methods. These are divided into
1) supervised versus unsupervised, 2) global versus local, and 3) static (univariate)
versus dynamic (multivariate) techniques (Agre and Peev, 2002; Dougherty,
Kohavi, and Sahami, 1995). Supervised discretization techniques utilise class
information, whereas unsupervised ones do not. Global discretization occurs prior
to model development, whereas local methods are performed during the process
of model development. Dynamic methods search for the cut-off points for all
variables simultaneously, allowing capture of the interdependencies between
variables, whereas static methods do not search cut-off points for each variable
independently.
110
The simplest unsupervised discretization methods include equal width (EWB) and
equal frequency (EFB) discretization methods (Agre and Peev, 2002; Dougherty
et al., 1995). EWB discretization divides numerical data into 𝑘 equal-length (width)
intervals. In EFB discretization, the data is partitioned into 𝑘 intervals. Each interval
has roughly equal frequencies. The number of intervals, 𝑘, is a user-specified
parameter. The EWB algorithm determines the minimum and maximum values of
the variable and then calculates the width (𝑤) by dividing the range by 𝑘. In this
way, the variable can be discretized into 𝑘 intervals by the number pairs:
{[𝑑0, 𝑑0 +𝑤 ], [𝑑0 +𝑤, 𝑑0 + 2𝑤 ], … , [𝑑0 + (𝑘 − 2)𝑤, 𝑑𝑚 ]}, where 𝑑0 is the minimum
value of the variable, and 𝑑𝑚 = (𝑘 − 1)𝑤 is the maximum value of the variable.
The EFB algorithm determines the minimum and maximum values of the variable,
sorts all the values in ascending order, and divide the range into 𝑘 in such a way
that all intervals contain approximately equal frequencies of each value. These
techniques are considered to be static methods, since discretization is performed
for each variable and the value of 𝑘 for each variable independently (Dougherty et
al., 1995).
In contrast with unsupervised discretization techniques, supervised approaches
take class information and are able to capture class-variable interdependence
(Agre and Peev, 2002). For example, the entropy-based supervised discretization
(EBD) method proposed by Ayyad and Irani (1993) is a divisive hierarchical
clustering technique which utilises an entropy measure as a criterion for recursively
partitioning numerical data and a minimum description length (MDL) principle as a
stopping criterion. Each value of a variable can be a potential boundary that splits
the variable into two intervals. The partition boundary which minimises the entropy
function over all possible boundaries is chosen (Dougherty et al., 1995). The
111
information gain is a gap between entropy values of the interval before and after
splitting. A new boundary is recursively applied to each interval produced in the
previous steps. This recursive process is terminated if the stopping criterion has
been reached.
The ChiMerge system by Kerber (1992) is another type of supervised discretization
method based on statistical analysis. At its initial stage, each observation value is
placed into its own interval, and chi-square tests are performed to determine
whether adjacent intervals should be merged. A chi-square test is an
independency test based on an empirical measure of the expected frequency of
the classes represented in each interval. Two adjacent intervals which are
statistically independent of each other are merged. A chi-square threshold is
predefined to determine the extent of the merging process (Dougherty et al., 1995).
This method is supervised, global discretization. StatDisc is another discretization
method using statistical tests to determine intervals (Richeldi and Rossotto, 1995).
This heuristic bottom-up method is similar to ChiMerge; however, StatDisc merges
𝑁 adjacent intervals at a time, while ChiMerge combines only two at a time
(Dougherty et al., 1995).
Adaptive quantizers is a method mixing supervised and unsupervised
discretization (Chan, Batur, and Srinivasan, 1991). Intervals are initially set based
on an unsupervised discretization method, such as a binary equal width interval
(Dougherty et al., 1995). A set of classification rules are then applied to the
discretized data. The interval with the lowest prediction accuracy is then split into
two partitions of equal width. This process is repeated until the termination
parameter is reached (Dougherty et al., 1995).
112
As explained previously, unsupervised discretization does not use class
information, although it is essential (Dougherty et al., 1995). Ignoring such
information may lead to the formation of inappropriate intervals and consequently
a poorly performing prediction model (Dougherty et al., 1995). Most studies provide
evidence that supervised discretization methods are able to perform better than
unsupervised ones in terms of error rates – that is, the accuracy of the prediction
model (Dougherty et al., 1995).
The discretization techniques discussed above are those widely used in research.
However, the literature highlights some limitations of these discretization methods.
The major disadvantage is that numerical data is partitioned into intervals and is
labelled to indicate which interval an observation belongs to (Dougherty et al.,
1995). The labels then replace the original observation values. This naturally leads
to information loss and distortion, which potentially cause inaccuracies when
making subsequent inferences (Yang et al., 2006).
Unsupervised discretization’s failure to use class information when it is available
generally results in inappropriate cut-off points and the loss of valuable information
in the development of a prediction model. Consequently, such techniques deliver
poor modelling performance (Agre and Peev, 2002). Although it is evident that
supervised discretization methods perform better than unsupervised ones as
measured by error rates, in some cases supervised discretization has been applied
to the entire dataset before the dataset is split into several folds for training and
testing purposes. Research has recognized that discretization before creating folds
gives the discretization method a chance to have access to the test sets, which is
likely to produce optimistic error rates (Agre and Peev, 2002).
113
Referential value-based discretization provides a reasonable approach which
captures the relationship between observation values and each referential value.
It is equivalent to transforming an observation value into a distribution of referential
values using belief degree values (Yang et al., 2006). The belief degree represents
the extent to which an input value or observation belongs to each referential value.
In other words, it measures how close an observation value is to each referential
value, reducing information loss and distortion. In addition, it allows the structure
of the data to be well captured.
In the entropy-based discretization method by Fayyad and Irani (1993), the
intervals along each branch are recursively and independently evaluated, leading
to imbalanced intervals (Dougherty et al., 1995). Meanwhile, the referential value-
based discretization technique has been developed in such a way that the
referential values of all input variables are determined simultaneously, and thus,
interdependencies between input variables are well captured. In addition to this,
the searching for referential values occurs during the process of constructing the
prediction model. Therefore, the determination of referential values is also reflected
directly in the model accuracy. In this way, discretization makes the prediction not
only more efficient but also more effective by directly minimizing the error between
predicted outputs and observed outputs.
4.7.2. MAKER-based Models
As discussed previously, a hierarchical rule-based inferential modelling and
prediction is considered because of the sparsity of matrices. This approach is
114
proposed to reduce the size of belief rule base, which consequently reducing the
model complexity. This approach is also designed to deal with the sparsity of
matrices, which only few joint frequencies are nonzero, in order to avoid misleading
and incorrect inference, information loss, and computational complexity.
As depicted in Figure 4.1, input variables are decomposed into 𝑔 groups in stage
1. By applying rule-based inferential modelling and prediction to each group based
on the MAKER framework, we can generate the probability for each class (system
output) of each group of each observation. The inputs for stage 2 are the 𝑔
probabilities for each class (system output) of each observation. The final
prediction is achieved by combining these MAKER-generated outputs from all
groups through an ER rule or BRB approach, known as the MAKER-ER-based or
MAKER-BRB-based model, respectively.
The discussion below starts with an explanation of the modelling core and
inference mechanism in stage 1, where the MAKER framework is applied to each
group of input variables, followed by the final inference mechanism in stage 2. We
also present a comparison analysis with other modelling and prediction
approaches to highlight the advantages of the hierarchical MAKER framework.
• The modelling core and inference mechanism
The graphical representation of referential value-based discretization for one input
variable (upper) and two input variables (lower) is depicted in Figure 4.3. The
number of referential values is defined for each input variable denoted by
𝐴𝑖𝑙(𝑙 = 1,… ,𝑀; 𝑖 = 1,… , 𝐼𝑙), where 𝑀 is the number of input variables and 𝐼𝑙 is the
115
number of referential values of the 𝑙th input variable. Through referential value-
based discretization, a continuous space is decomposed into (𝐼1 − 1) × (𝐼2 − 1) ×
…× (𝐼𝑙 − 1) sub-spaces. An observation is located within a sub-space.
Figure 4.3. Referential Value-based Discretization Technique: an input variable (upper), and two input variables (bottom)
As seen in Figure 4.3, an observation value (green dot) is located in two adjacent
referential values of an input variable (red dots). Through discretization, an input
value is transformed into a discrete value with the corresponding belief distribution
for referential values. In a higher dimension – for example, for two input variables
as depicted in Figure 4.3– a continuous space is decomposed into (5 − 1) × (4 −
1) = 12 sub-spaces. An observation is located within a sub-space determined by
𝐴32
𝐴21
𝐴42
𝐴22
𝐴12
𝐴11 𝐴3
1 𝐴41 𝐴5
1
𝐴11
𝐴21
𝐴31
𝐴41
𝐴51
116
the intersections of the referential values. This concept is also applicable for higher
dimensions with more input variables.
Each intersection of referential values denoted by blue and red dots in Figure 4.3
represents the ‘IF’ form in the concept of the BRB as discussed in Section 0,
specifically in Equation (4.22). The ‘IF’ form expressed ask
T
kk
kAAA 21 , known
as a packet antecedent 𝐴𝑘, should be interpreted in this study as a combination of
referential values of the input variables or an intersection of referential values. A
belief degree or probability for each system output is assigned to each intersection,
forming the ‘THEN’ expression: i.e. ( ) ( ) ( ) Nkkk ,D,,,D,,D N2211 . The belief
degree or probability of each output (consequence) is obtained by combining
pieces of evidence from a group of input variables and their corresponding weights
using a MAKER rule considering the interdependency of two pieces of evidence,
as described in Section 4.3. The weights of the combined pieces of evidence (or a
packet antecedent) which are obtained by Equation (4.18) are used for inference.
This is how a BRB is generated, from which an inference can be made.
The similarity degree explained in Section 4.3.2 measures how close an
observation is to the intersections of referential values or the combinations of
referential values. On this basis, we can estimate the relative location of an input
vector in the input space. Logically, the greater the number of referential values,
the higher the location accuracy of an input vector in the input space. However,
this greater accuracy also causes greater model complexity.
The similarity degrees activate the intersections of the referential values with the
corresponding probabilities of each output (consequence) – that is, the belief rules.
117
As depicted in Figure 4.3 for the case of two input variables, an input vector (green
dot) based on its degree of similarity to the combination of referential values,
activates the intersections of referential values (red dots). On the basis of the
degrees of similarity and the weights of intersections of referential values, we can
obtain updated weights that measure the degree to which an intersection of
referential values is triggered by an observation. The activated belief rules are then
combined using Equations (4.16)–(4.18). In this way, we can obtain the probability
of each output (consequence) for an input vector resulting from a group of input
variables in stage 1. The rule-based modelling and prediction based on the
MAKER framework explained above is applied for each group of input variables in
stage 1.
Stage 2 of the hierarchical MAKER framework accepts the MAKER-generated
outputs from all groups of input variables from stage 1 as the input for making final
inferences. The inferences are presented by the probability of each output
(consequence) of each observation. Hence, we have numerical inputs in stage 2.
Suppose that in stage 1, input variables are split into 𝑔 groups with 𝐾 classes for
the output variable. In stage 2 we have 𝑔 input variables with 𝐾 referential values
for each of the input variables. Therefore, we have a BRB consisting of 𝐾𝐺 belief
rules.
Based on the concept of a BRB, a packet antecedent should be expressed in the
form ‘if a group in the system points to consequence 𝑘’ and the ‘THEN’ form should
be expressed as ‘the probability of each consequence’. The MAKER-generated
probability of a group of input variables in stage 1 represents how likely the group
points to a certain output (consequence) that naturally indicates a belief distribution
118
of referential values in stage 2. If all groups fully support a certain output
(consequence) with probability of 1.0, the final inference by logic must be 1.0 for
the probability of that class. On the other hand, if all groups completely oppose a
certain class, the probability of that class must be 0. The belief degrees of other
belief rules representing the conflicting inference made by the groups in stage 1
can be trained. On the basis of the similarity degrees and the trained belief degree
of each output (consequence) of each belief rule, we can obtain a probability
pointing to each output (consequence) as a final inference by an observation. This
approach describes the MAKER-BRB-based model.
According to the explanation above, we can acquire the probabilities pointing to
different class membership generated by a MAKER rule from a group of input
variables in stage 1. In stage 2, we can perceive these probabilities as pieces of
evidence that can be directly combined through a MAKER rule using Equations
(4.16)–(4.18). We can obtain the weight of each group by Equation (4.18) when
combining the activated belief rules in stage 1. Given those pieces of evidence with
their corresponding weights, we can use Equation (4.16) to combine evidence, and
therefore we can generate the probability of each consequent with all input
variables in the system being considered. This approach, the MAKER-ER-based
model, is more direct than the former one.
• The advantages of hierarchical MAKER frameworks
As is clear from the explanation above, the hierarchical MAKER framework can
acquire evidence and measure the interdependencies of pairs of evidence directly
from data using statistical analysis. The input variables are split according to an
119
adjustment to avoid violations of statistical requirements, since the validity of
inferences drawn from this modelling and prediction approach depends on how
well the framework meets the statistical requirements. In each group of input
variables, combining multiple pieces of evidence from the input variables based on
the MAKER framework generates a BRB. On the basis of the BRB and the degree
of similarity between the input vector and packet antecedents of the BRB,
predictions can be generated for these sub-models – that is, groups of input
variables. The outputs generated by sub-models – that is, groups of evidence –
are then aggregated on the basis of either an ER rule or a BRB to make a final
inference. For any given observation, on the basis of the BRB and a maximum
likelihood prediction, an inference can be made. It can be seen that this approach
is completely transparent and interpretable, resulting an objective, robust, and
rigorous data-driven inference method. The model parameters can be trained to
maximise the likelihood of true states. Through designated machine learning, the
parameters can be optimised under the optimisation function as discussed in 4.6.
Machine learning models can make predictions with a high level of accuracy, but
they often do not have the ability to explain how their algorithm arrives at its
conclusion or prediction. This ability is known as ‘interpretability’ and is defined by
Kim et al. in Carvalho et al. (2019) in the context of a machine learning system as
‘the degree to which a human can consistently predict the model’s result’. It was
also recently defined as the ‘ability to explain or to present in understandable terms
to a human’ by Doshi-Velez and Kim in Carvalho et al. (2019). Interpretability is
crucial for learning transfer, extraction of scientific findings, behaviour explanation,
modelling faulty assessment, and so on. In addition, interpretability can increase
120
human trust and acceptance of a model, which is a key factor in determining
whether users want to use it (Carvalho et al., 2019).
In the customer choice model, interpretability plays an important role. If a model is
a black box not revealing a transparent relationship between input and output, the
model can only generate predictions without explanations. Scientific findings
remain completely hidden in the model: for example, why a customer makes a
particular decision at one point and a different decision at another. With an
interpretable model, we can trace the differences in inputs leading to different
customer decisions. Analysing customer behaviour is a fundamental need that
must be met to drive managerial decision-making.
Logistic regression, classification trees, k-nearest neighbours and naive Bayes
models are commonly used interpretable machine learning models. They have
meaningful parameters and/or features; based on these, useful information can be
extracted, and predictions can be explained.
The weights in logistic regression are the interpretable elements. We can observe
the estimated odds change that results from the increase of a feature by one unit.
However, logistic regression is restricted to binary classification with linear
relationships between input and output where all inputs are independent of each
other. The hierarchical MAKER framework can deal with nonlinear binary
classification and multiple classification. It also takes the interdependencies
between input variables into account through a measured interdependence index.
A classification tree can be used in situations in which the relationship between
input variables and outputs is nonlinear, and there is interaction among input
121
variables. A classification tree recursively partitions the input space into regions,
and each observation belongs to exactly one region. MAKER-based classifiers
also divide the input space into sub-spaces (regions). However, MAKER-based
classifiers are more representative of reality because they use the probabilities of
the intersections of referential values and the degree of similarity between an
observation and the intersections of referential values to generate predicted
outputs.
The tree structure delivers arguably simple interpretations with natural
visualization. However, a classification tree is quite unstable and lacks smoothness
(Molnar, 2019). The cut-off points of the input variables and the structure of the
classification tree can be completely changed by just a few changes in the training
sets. Moreover, slight changes in the input variables can have a large impact on
the predicted outputs, which is a rather unintuitive and undesirable outcome.
MAKER-based models are generally more stable and smoother than classification
trees. Each input value activates a number of belief rules, and an inference can be
made on the basis of these belief rules and the degree of similarity between an
input value and referential values. The same principle is applied for higher levels
in the hierarchy. This means a subtle change in referential values (cut-off points)
will not have a large impact on the predicted outputs of the hierarchical MAKER
framework.
Naïve Bayes classifiers make predictions based on Bayes’ theorem with a naive
assumption of conditional independence between input variables. The contribution
of each input variable toward the predicted output is very clear, making Naïve
Bayes an interpretable classifier. The approach, however, requires prior
122
probabilities. MAKER-based classifiers are not dependent on prior probabilities. If
available, prior probabilities are treated as independent pieces of evidence (Yang
and Xu, 2017).
Unlike the above-mentioned classifiers, k-nearest neighbour classifiers are
instance-based learning algorithms. This non-parametric method makes
predictions based on the proximity of an observation to other instances. There is
no interpretability at the modular level. The other above-mentioned classifiers can
explain how parts of the model affect predictions, but a k-nearest neighbour
classifier cannot reach this level of interpretability. We can explain why an
observation belongs to a certain class by retrieving the k neighbours that are used
for predictions. The models become less interpretable, however, as the number of
input variables increases.
Two important criteria when developing prediction models are accuracy and
interpretability (Mori and Uchihira, 2019). However, these criteria are connected
and often competing: that is, the more accurate the prediction, the less
understandable it becomes (Carvalho et al., 2019). Although approaches using
logistic regression, classification trees, naïve Bayes, and k-nearest neighbours are
easy to interpret, they are generally less accurate than the more complex and
opaque models. The interpretability aspect of the hierarchical MAKER framework
is demonstrated in the following chapters in an application predicting customer
types and customer decisions in revenue management. The performance of the
hierarchical MAKER framework is compared with other machine learning methods
in terms of accuracy and other metrics.
123
4.7.3. Performance Comparison
In order to analyse whether the hierarchical structure proposed in this thesis
affects the predictive power, model complexity, and computation time, this section
presents a performance comparison for MAKER, BRB, and hierarchical MAKER
frameworks – MAKER-BRB- and MAKER-ER-based models. Six binary
classification datasets with four input variables were generated by
‘make_classification’ and ‘make_blobs’ functions provided by sklearn in Python. In
this study, the ‘make_classification’ function generate a random binary
classification problem by initially creating clusters of points normally distributed
about vertices of a four-dimensional hypercube and then assigning an equal
number of clusters to each class. It introduces interdependences between input
variables. To increase complexity of classification problem, we can add more
clusters per class and decrease the separation between classes leading to
complex non-linear boundary for classifier. We can also add noises in the dataset
to test the efficacy of classifier.
In this study, we set two clusters per class with normal decision boundary.
The ‘class_sep’ – a parameter to determine how clusters are separated – was set
to be 1.5. The larger the value of ‘class_sep’, the less the overlap between clusters
and the value of 1.5 is considered as normal difficulty level. The ‘flip-y’ – a
parameter to determine the fraction of data points whose class is randomly
assigned – was set to be .20 meaning that 20% of the dataset was noises. The
‘make_blobs’ function generating isotropic gaussian blobs for clustering was used
for dataset 5. The characteristics of the datasets are presented in Table 4.2. All the
124
datasets in Table 4.2 consisted of 200 samples. The scatterplots of the data points
of each dataset can be seen in Figure 4.4.
Table 4.2. Generated datasets with four input variables
Dataset With/without noise The number of clusters per class
1 Without noise (‘flip_y’=0) 2 (‘class_sep’=1.5)
2 Without noise (‘flip_y’=0) 1
3 With noise (‘flip_y’=.2) 2 (‘class_sep’=1.5)
4 With noise (‘flip_y’=.2) 1
5 Blobs, 2 centres, 4 input variables
The datasets included the observed input values and the observed output
values. With all of these observed input-output data pairs, we can use rule-based
inferential modelling and prediction to develop models, and train the parameters of
the models by minimising the differences between the observed output values and
the predicted output values generated by the classifiers. In this study, the
referential values were set to be fixed as the minima and the maxima of the
observed input values. Hence, we only need to train parameters except referential
values. All the datasets had at least five cases per cell of the joint frequency
matrices between input variables and hence, a full MAKER framework could be
implemented. For a hierarchical MAKER framework, we split the input variables
into two groups of evidence, each of which performs MAKER and the predicted
outputs of each group are then combined in the upper level by applying BRB or
ER rule to suggest a final inference regarding whether the observation belongs to
125
a certain class given the input values of the four input variables. Because all the
input variables in the datasets were informative, how we split the input variables
did not matter.
Figure 4.4. Scatter plot from the datasets
MAKER, BRB, MAKER-BRB, and MAKER-ER were applied for all the datasets.
We utilised five-fold cross validation. Each dataset was divided into five folds with
126
similar class distribution. We employed stratified five-fold cross validation in Python
to partition each dataset into five folds. Each fold was treated as test set, while the
remaining folds acted as training set. Therefore, we obtained five rounds for each
classifier for each dataset. The processes of modelling, prediction, and parameter
learning are summarised as follows.
The steps of modelling, prediction, and parameter learning of a full MAKER model
are displayed below.
Step 1: Applying the approach of rule-based inferential modelling and prediction –
such as evidence acquisition, analysis of evidence interdependence, and inference
making – to develop a full MAKER-based model with minima and maxima of the
observed input values as fixed referential values.
Step 2: With training set, using SLSQP to train the relevant weights of referential
values to obtain the optimised weights of referential values.
Step 3: Generating the predicted outputs of test set on the basis of a full MAKER
model of optimised weights of referential values.
Step 4: Performing model evaluation by comparing the observed against the
predicted outputs, and recording the computation time – that is, the time spent
required to learn the pattern of the data.
The steps of modelling, prediction, and parameter learning of a BRB model are
displayed below.
Step 1: Developing a belief rule base consisting of a packet antecedent – a
combination of referential values of the input variables, and probabilities of each
consequence of each belief rule – that is, the probability of an observation with the
127
corresponding input values belongs to a certain class. Initial belief degrees are
generated.
Step 2: With training set, using SLSQP to train belief degrees to obtain the
optimised belief degrees that minimise the gap between the observed and the
predicted outputs of the training set.
Step 3: Generating the predicted outputs of test set on the basis of a BRB model
of optimised belief degrees.
Step 4: Performing model evaluation.
The steps of modelling, prediction, and parameter learning of a hierarchical
MAKER model are summarised in the following part.
Step 1: Splitting the input variables into two groups of evidence.
Step 2: Using the approach of rule-based inferential modelling and prediction –
such as evidence acquisition, analysis of evidence interdependence, and inference
making – to develop a full MAKER-based model to each group of evidence with
minima and maxima of the corresponding observed input values as fixed referential
values.
Step 3: Aggregating the predicted outputs of both groups of evidence by applying
a BRB or ER rule – namely, MAKER-BRB-based and MAKER-ER-based models,
respectively to make a final inference.
Step 4: With training set, using SLSQP to train weights – parameters of MAKER-
based model of each group of evidence – and to train belief degrees of the BRB if
MAKER-BRB-based model is applied.
Step 5: Performing model evaluation
128
In this section, we compared the model performances of a full MAKER, BRB,
MAKER-ER-based model, and MAKER-BRB-based model on the five datasets –
two clusters per class without noise, one cluster per class without noise, two
clusters per class with noise, one cluster per class with noise, and gaussian blobs.
Each of the five datasets had been partitioned into five folds using stratified five-
fold cross validation in Python to ensure each fold has a similar class distribution.
As previously explained, each of the datasets had two fixed referential values.
Hence, only weights and belief degrees were trained.
To compare the classifiers, measures of performance is required. These include
accuracy, AUCROC, MSE, computation time, and the number of trained
parameters. The threshold value is set to be .5. The perfect classifier will result in
the AUCROC and the accuracy of 1. The AUCROC of .5 is not better than random
classifier. Meanwhile, the less MSE is, up to a minimum of 0, the superior the model
is.
As explained previously, we utilised five-fold cross validation. A fold was selected
to be a test set and the remaining folds was used to train the model. The optimised
model parameters, which are obtained from the model training, were the applied
to the test set. If the classifier can generalise the pattern of the data, the
performance of the model on the test sets is relatively similar with the performance
on the train set. Hence, in this section, we present the model performances on the
test sets over the five rounds. Tables 4.3-4.7 provide the scores of computation
time, accuracies, AUCROCs, and MSEs. The average scores of those measures
are summarised in Table 4.8.
129
Table 4.3. Performance measures for the dataset 1
No. Model
Fold
Average
1 2 3 4 5
Computation time
(in seconds)
1 MAKER 57 73 70 64 68 66.4
2 BRB 132 132 137 155 234 158
3 MAKER-BRB 97 86 92 102 99 95.2
4 MAKER-ER 31 116 96 129 112 96.8
Accuracy
1 MAKER .9512 .9000 .9000 .9750 .9231 .9299
2 BRB 1.0000 .9250 .9000 .9750 .9824 .9565
3 MAKER-BRB .9024 .8750 .9250 1.0000 .9487 .9302
4 MAKER-ER .9512 .8250 .9000 .9950 .9487 .9240
AUCROC
1 MAKER .9929 .9600 .9775 .9975 .9711 .9798
2 BRB 1.0000 .9800 .9725 .9975 .9487 .9797
3 MAKER-BRB .9857 .9575 .9775 1.0000 .9684 .9778
4 MAKER-ER .9476 .9299 .9700 .9950 .9605 .9606
MSE
1 MAKER .1022 .1241 .1062 .0719 .1066 .1022
2 BRB .0785 .1010 .0890 .0583 .0774 .0808
3 MAKER-BRB .0693 .0982 .0753 .0753 .0745 .0785
4 MAKER-ER .1341 .1491 .1176 .0901 .1289 .1240
130
Table 4.4. Performance measures for the dataset 2
No. Model
Fold
Average
1 2 3 4 5
Computation time
(in seconds)
1 MAKER 51 54 60 49 62 55.2
2 BRB 188 218 205 238 222 214.2
3 MAKER-BRB 108 281 96 191 81 151.4
4 MAKER-ER 60 58 47 89 59 62.6
Accuracy
1 MAKER .9756 .9750 1.0000 1.0000 .9487 .9799
2 BRB .9756 .9500 1.0000 1.0000 .9487 .9749
3 MAKER-BRB .9756 .9250 1.0000 1.0000 .9231 .9647
4 MAKER-ER .9756 .9750 .9750 .9750 .9763 .9754
AUCROC
1 MAKER .9976 .9975 1.0000 1.0000 .9658 .9922
2 BRB 1.0000 1.0000 1.0000 1.0000 .9658 .9932
3 MAKER-BRB 1.0000 .9450 1.0000 1.0000 .9579 .9806
4 MAKER-ER 1.0000 1.0000 .9950 1.0000 .9487 .9887
MSE
1 MAKER .0487 .0722 .0650 .0576 .0831 .0653
2 BRB .0246 .0422 .0338 .0348 .0624 .0396
3 MAKER-BRB .0458 .0742 .0554 .0479 .0822 .0611
4 MAKER-ER .0261 .0386 .0439 .0269 .0569 .0385
131
Table 4.5. Performance measures for the dataset 3
No. Model
Fold
Average
1 2 3 4 5
Computation time
(in seconds)
1 MAKER 73 96 114 128 173 116.8
2 BRB 268 312 170 464 336 310
3 MAKER-BRB 77 75 78 68 73 74.2
4 MAKER-ER 83 87 113 135 145 112.6
Accuracy
1 MAKER .7561 .8049 .7750 .7179 .7692 .7646
2 BRB .8780 .8780 .8000 .8205 .8205 .8394
3 MAKER-BRB .7317 .7805 .7750 .7436 .7436 .7549
4 MAKER-ER .7561 .8293 .8000 .7436 .7692 .7796
AUCROC
1 MAKER .8429 .8762 .8396 .8000 .7895 .8296
2 BRB .9405 .9381 .8596 .8895 .9026 .9061
3 MAKER-BRB .8333 .8571 .8396 .7579 .7868 .8149
4 MAKER-ER .8286 .8044 .8396 .7737 .8184 .8129
MSE
1 MAKER .1668 .1414 .1696 .1793 .1839 .1682
2 BRB .1110 .1059 .1449 .1197 .1299 .1223
3 MAKER-BRB .1684 .1479 .1585 .1841 .1805 .1679
4 MAKER-ER .1823 .1596 .1754 .1876 .1904 .1791
132
Table 4.6. Performance measures for the dataset 4
No. Model
Fold
Average
1 2 3 4 5
Computation time
(in seconds)
1 MAKER 146 107 143 100 87 116.6
2 BRB 288 380 251 148 142 241.8
3 MAKER-BRB 126 124 118 111 92 114.2
4 MAKER-ER 127 260 175 87 118 153.4
Accuracy
1 MAKER .8049 .8095 .7250 .6667 .8462 .7705
2 BRB .8049 .8049 .7750 .7439 .8718 .8001
3 MAKER-BRB .8293 .8049 .7250 .6923 .8205 .7744
4 MAKER-ER .8049 .7561 .7750 .7179 .7692 .7646
AUCROC
1 MAKER .8548 .8293 .8049 .7974 .8711 .8315
2 BRB .8643 .8024 .8049 .8132 .8912 .8352
3 MAKER-BRB .8548 .7786 .8025 .7947 .8842 .8230
4 MAKER-ER .8071 .7952 .7875 .7947 .8605 .8090
MSE
1 MAKER .1679 .1841 .1862 .1885 .1804 .1814
2 BRB .1525 .1756 .1868 .1804 .1425 .1676
3 MAKER-BRB .1477 .1778 .1804 .1864 .1434 .1671
4 MAKER-ER .1806 .1905 .1911 .1891 .1722 .1847
133
Table 4.7. Performance measures for the dataset 5
No. Model
Fold
Average
1 2 3 4 5
Computation time
(in seconds)
1 MAKER 58 92 86 57 74 73.4
2 BRB 220 183 202 176 200 196.2
3 MAKER-BRB 31 24 56 26 29 33.2
4 MAKER-ER 131 76 115 103 120 109
Accuracy
1 MAKER 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
2 BRB 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
3 MAKER-BRB 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
4 MAKER-ER 1.0000 .9750 1.0000 1.0000 1.0000 .9950
AUCROC
1 MAKER 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
2 BRB 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
3 MAKER-BRB 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
4 MAKER-ER 1.0000 .9500 1.0000 1.0000 1.0000 .9900
MSE
1 MAKER .0219 .0236 .0189 .0228 .2080 .0590
2 BRB .0054 .0081 .0049 .0050 .0051 .0057
3 MAKER-BRB .0059 .0070 .0023 .0055 .0036 .0049
4 MAKER-ER .0228 .0437 .0316 .0222 .0206 .0282
134
Table 4.8. Grand averages of performance measures of the five generated
datasets
Model Number of
parameters
Computation
time (in
seconds)
Accuracy AUCROC MSE
MAKER 16 85.68 .8890 .9266 .1152
BRB 32 224.04 .9142 .9428 .0832
MAKER-BRB 24 93.64 .8848 .9193 .0959
MAKER-ER 16 94.3 .8877 .9123 .1109
With four input variables, MAKER, BRB, MAKER-BRB-based model, and MAKER-
ER-based model required the number of trained parameters of 16, 32, 24, and 16,
respectively. Longer computation time is required as the number of parameters
increases. As seen in Table 4.8, BRB spent the longest training time which is
224.04 seconds while MAKER required the shortest computation time which was
85.68 seconds. MAKER-ER- and MAKER-BRB-based models showed nearly
similar records – 93.64 and 94.3, respectively. These records were less than a half
of the BRB’s computation time but were still slightly higher than the full MAKER
model.
According to Tables 4.3-4.7, all the classifiers provided nearly similar performance
on the five datasets. Meanwhile, in general, the performance of the hierarchical
MAKER models on the basis of accuracy – .8848 and .8877 for MAKER-BRB- and
MAKER-ER-based models, respectively – was similar to that of the full MAKER
135
model – .8890. All the classifiers also had nearly similar scores of AUCROC and
MSE. The result signifies that similar to a full MAKER and BRB models, the
hierarchical MAKER frameworks can generalise the pattern of the data and
perform well on unseen data – that is, the test sets.
Figure 4.5 illustrates the average of the computation time of each classifier against
the average of performance measures – AUCROC and MSE – of each classifier.
It is clearly seen that with the same datasets, BRB required more trained
parameters which affected its computation time. The number of rules in a BRB
increases exponentially as the number of input variables, the referential values of
each variable, and the number of consequences increases (Yang and Xu, 2017).
Meanwhile, MAKER-BRB- and MAKER-ER -based models required slightly higher
computation times than a full MAKER model does. However, their performance
was similar to that of the full MAKER model.
Figure 4.5. Plot of the grand average scores of performance measures of the five generated datasets for each model
0
50
100
150
200
250
.7500
.8000
.8500
.9000
.9500
1.0000
MAKER BRB MAKER-BRB MAKER-ER
ROCAUC Accuracy Duration (in second)
136
Thus, we concluded that the hierarchical structure applied in MAKER-BRB- and
MAKER-ER-based models does not significantly increase computation time
required to train parameters or to learn the pattern of data. Their generalization
capability is similar to that of a full MAKER model based on accuracies, AUCROCs,
and MSEs. The hierarchical MAKER models can perform well on the datasets with
complex non-linear boundary and noises.
4.8. Summary
In this chapter, we have explained the algorithm of the hierarchical MAKER
framework – namely, the MAKER-ER-based and MAKER-BRB-based models with
a referential value-based discretization technique for data transformation, one of
the main contributions of this research. First, we presented the evidence
acquisition process, the measurement of interdependencies between input
variables, evidence combination approaches, the generation of a BRB, and the
bottom-up inference process in the hierarchical MAKER framework. We then
performed a comparative analysis between this framework and other machine
learning methods to highlight the framework’s advantages, and we compared the
referential value-based discretization technique used within the framework with
other discretization techniques. We conducted a comparative analysis for a full
MAKER, BRB, and the hierarchical MAKER frameworks based on the five
generated datasets with complex non-linear boundary and noises. Some
performance measures including computation time, accuracy, AUCROC, and MSE
were presented. The hierarchical MAKER frameworks required less computation
time that BRB did while their performance was similar to that of the full MAKER.
137
Chapter 5 Application to Customer
Classification
5.1. Introduction
This chapter presents the application of a hierarchical rule-based inferential
modelling and to customer classification based on MAKER framework in revenue
management. The chapter is organised as follows. Section 5.2 explains the
theoretical foundations and the formulation of a conceptual framework in customer
detection. It includes a literature review on customer types in revenue
management, the opportunity for customer detection from perceptible booking-
related behaviours, and the booking setting applied in the real case used in this
research. Section 5.3 describes data linkage to 1) extract the desired dataset,
comprising the estimated values of the input variables; and 2) label customer types
based on a customer booking journey. Section 5.4 describes the data preparation
including data cleaning, data transformation, and data partitioning. The following
section 5.5 explain how the classifiers, that is MAKER-ER- and MAKER-BRB-
based models, were built and trained. Section 5.6 present a comparative study on
classifier performance for the proposed framework and other machine learning
methods. A summary of this chapter is presented in Section 5.7.
138
5.2. Theoretical Foundations: Customer Types and
Behaviours
This section explains the theoretical foundation of identifying customer types and
learning their booking behaviour. First, we selected customer types being
considered in this study according to recent literature in revenue management.
Second, we provide critical analysis to the possibility of detecting customer types
through customer booking behaviour. Third, we presented the business setting
applied in this case study used in this research.
5.2.1. Customer Types in Revenue Management Practice
According to recent literature, there are four customer types regarding purchase
timing. These are summarised below.
Myopic customers. They buy the product immediately if the price fits with their
valuation (Su, 2007; Chacon and Swimney, 2009).
Strategic customers. These customers possess knowledge of dynamic pricing,
the desire to save money or to gain achievement and excitement. These customers
rationally time their purchase based on their expectation of future price and other
considerations, while maintaining the risk of losing a ticket due to lack of stock, or
paying more (Cason and Reynolds, 2005; Mak et al., 2014; Osadchiy and Bendoly,
2011; Reynolds, 2000). Their decision is formed through a learning process (e.g.
Anderson and Wilson, 2003; Cleophas and Bartke, 2011) by searching for
139
information related to their goal. Different labels for strategic customers have been
used in the literature, such as deal-seeker, functional procrastination, forward-
looking customer, and rational buyer. However, these concepts are substantially
the same.
Bargain-hunter customers. These customers seek a sufficiently low discount
price. They appear at the end of a selling period and buy excess inventory (Cachon
and Swinney, 2009; Cleophas and Bartke, 2011; Jerath et al., 2010; Ovchinnikov
and Milner, 2012). This type has mainly been discussed in the retail industry and
are considered in the case of a markdown price, in which the prices consistently
drop.
Inertia. Recently, some scholars have introduced a behaviour termed customer
inertia, which delays the purchase although the best decision is to buy immediately
(Su, 2009). This behaviour may be caused by a psychological trait called
dysfunctional procrastination (Darpy, 2000).
Osadchiy and Bendoly (2011) introduced a classification system based on
purchase patterns by financially motivated subjects, who were conditioned to act
strategically. Subjects showed different decisions once they were exposed to
dynamic pricing, even if they were all conditioned to act as a strategic customer
and received identical information. The classification consists of five types: 1)
rational strategic, who consistently make decisions fitting with the rational choice
model; 2) risk averse, who always choose to buy now; 3) risk affine, who always
choose to wait; 4) counter rational, who make decisions opposite to the rational
140
choice (e.g. choose to wait when they should have bought); and 5) random, whose
the purchase pattern could not be identified.
This research considers two customer types: strategic and myopic. The other
customer types were omitted for several reasons. First, as mentioned earlier,
bargain-hunters appear if the markdown pricing strategy is applied, especially in
the retail industry (e.g. electronics or high-end fashion). As the product value
declines over time, the prices are naturally discounted as the season progresses
Aviv and Pazgal, 2008). Cleophas and Bartke (2011) considered myopic, strategic,
and bargain-hunter together in an airline case study. However, the model was
designed when an airline employs a markdown pricing strategy. In airlines, the
pricing strategy is dynamic and does not have a consistent pattern. Therefore,
bargain-hunter was excluded in this study. To the best of our knowledge, bargain-
hunters have mainly been discussed for the fashion industry (e.g. Cachon and
Swinney, 2009). Second, some researchers have explained about customer inertia
as resulting from human limitations in processing information and accordingly
deciding away from the optimal path. In reality, generally human decisions may
deviate from rational optimality. In addition, customers probably receive partial
information. Hence, in this study, customers who keep waiting when they should
buy immediately are considered as strategic customers who make not optimal
decision due to their natural limitations of cognitive process.
141
5.2.2. Tangible Booking Behaviours
To illustrate the behaviour of strategic customers, Table 5.1 displays definitions
from selected scholarly publications. Some identified aspects of strategic
customers are as follows: 1) the propensity to wait or delay their purchase, 2) the
intention to maximise their utility or the value of money spent, 3) rational thinking,
4) the learning process in searching for information related to prices and probability
of stock-outs, 5) the tendency to rush in at the last minute, 6) communication with
other customers, and 7) cancel-rebook behaviour.
Table 5.1. Definitions of strategic customers
No. Definitions of strategic customers
1 ‘Customers are waiting for… anticipating price markdowns of…, and
tracking prices of ….’ (Zbaracki et al., 2004)
2 ‘Rational customers anticipating the pricing path….’ (Stokey, 1981)
3
‘They time their purchase in anticipation of future discounts and need to
consider not only future prices, but also the likelihood of stock-outs.’ (Y.
Aviv and Pazgal, 2008)
4
‘They recognize that the product may become available on the salvage
market and consider delaying their purchase…to maximize their
expected surplus.’ (Ye and Sun, 2015)
5 ‘Strategic customers has become synonymous with this type of rational,
forward-looking purchasing behaviour.’ (Su and Zhang, 2009)
6 ‘They may reason strategically the best time to buy, search for deals,
rush in at the last minute.’ (Wang et al., 2013)
7 ‘They may strategically delay a purchase to learn more about product
value.’ (Cachon and Swinney, 2009)
8 ‘They are completely rational customers who can be opposed to
customers with bounded rational behaviour.’ (Shen and Su, 2007)
142
Table 5.1. Continued.
No. Definitions of strategic customers
9 ‘They are intertemporal utility maximizers.’ (Besanko and Winston,
1990)
10
‘Strategic customers plan their buys according to their expectations,
current observations, and communication with their peers.’(Cleophas
and Bartke, 2011)
11
‘There are many indications that deal-seeking travelers continue to
search after they have made a reservation, looking for an even better
deal for the same tourism product or service…cancel their existing
reservation and rebook the better deal.’ (Chen et al., 2011)
Many researchers have identified delay or waiting behaviour as a definite
consequence of strategic behaviour. Toh et al. (2012) showed through a
questionnaire with statistical tests that frequently checking for lower prices and
rebooking if necessary were significantly correlated with the behaviour of waiting,
checking for lower fares over time, and keeping contact with agents about lower
prices that were available. Similarly, Gorin et al. (2012) illustrated strategic
behaviour with a real-life example of numerous rebooking occurrences from an
airline database. A customer booked a fully refundable fare (Y class, €549) either
because of uncertain travel plans or unavailable lower price. Then, a week before
departure date, they cancelled and rebooked at a lower price (B class, €439).
Finally, five days before the travel date, they cancelled the previous ticket and
rebooked at a lower price for V class (€107). The passenger was willing to pay for
class Y but chose class V once it become available.
This behaviour is denoted as ‘cannibalisation’ in revenue management. Customers
exploit searching for information about lower prices, secure a seat to reduce the
143
risk of losing the ticket, and rebook if necessary once a lower price is available.
This behaviour maximises their benefit. Therefore, cancel-rebook behaviour is
potentially useful to detect strategic purchasing.
5.2.3. Flexible Payment
In general, providers apply cancellation policy in advanced booking settings. In
many cases, guaranteed reservation requires deposit or (fully or partly)
prepayment. The payment made in advance will not be fully (or partly) reimbursed
and will be kept as compensation to the provider. Customers who pay zero deposit
on a ‘book now, pay later’ system can hold their seats at the posted price, for free
(Yip, 2019). Similar slogans have been introduced for hotels, such as ‘book now,
pay when you stay’ (Lorenz, 2019). This feature allows people to book and pay
later without worrying about sell-outs or price increases. However, agents normally
give a certain time limit for the ‘holding’ or consideration period. During this time,
the payment must be made, or the booking will automatically be cancelled. The
holding period ranges from minutes to several days, depending on the number of
days to departure and the policies agreed by agents and airlines. This feature has
been widely used by offline agents. However, some agents use similar terms but
with a different meaning. ‘Book now, pay later’ can also mean reserving a seat and
opting for monthly instalments. In this research, the first definition is used.
Online agents initially introduce the same features to compete with offline agents,
and to entice offline customers through payment flexibility. Another term for similar
offerings is ‘Hold my booking’ for a minimum fee, or zero fee, through both online
144
and offline ticketing offices. Offline agents are less restricted than online agents.
Through online channels, customers can hold their booking generally with up to 48
or 72 hours of a holding period. Another term used is ‘free cancellation’. The
difference is that customers must pay the full price, but can be reimbursed without
any cancellation fee if they cancel within the specified period.
Strategic customers who perceive themselves as experienced and capable of
influencing other customers tend to gather and share information related to their
experiences. They may engage through discussion in online forums to share and
influence other people’s purchase decisions (Clark and Goldsmith, 2005).
Cleophas and Bartke (2011) considered these interactions among customers in
their model. Other than data, tracking of online forums, websites, social media, and
other media is valuable.
Several forums or websites provide tips for making cheap bookings (e.g.
Flightdelayclaimsteam.com, 2019). They may explicitly suggest that customers
book, cancel, and repeatedly rebook by exploiting a 24-hour cancellation policy. In
addition, they suggest rebooking immediately, even before the original booking
expires, for extra safety. They highlighted that the prices likely will drop within 24
hours.
Zero-deposit ‘book now, pay later’ gives customers time to finalize their travel
plans, check if details about the booking are correct, and to conduct more research
if they wish. In addition, the risk of sell-outs and higher prices are reduced since
customers have secured a seat by paying for small amount of or even without
paying deposit. Customers can secure a seat for the holding period while they seek
145
other available lower prices. If a more favourable price appears, they may make
another booking at a minimum or zero cost.
5.3. Conceptual Framework
In this section, we discuss input variables that were identified through refinement
of the theoretical foundations and available data. Following the identification, we
describe how we extracted values of input variables to obtain the desired dataset.
As there were no labelled customer types in the system, we formulated a
procedure to label whether a customer was detected as strategic or myopic. The
procedure mimicked strategic purchases using the price information extracted from
the system.
5.3.1. Influential Variables
For illustration, a real example from customer transaction records is presented. A
passenger attempted to make booking for a 17th Sep 19:55 flight by Lion Air from
CGK to BDJ. He booked a ticket of class V (Rp863k) (i.e. Indonesia’s currency) on
31st July 17:44, about 6 weeks before departure date. He could hold the seat until
6th August 23:29 (6.24 days). At that time, either because he was unsure about his
travel plan or he realised that the prices were stable and there was no lower price
available, he let the ticket gone. He then made the second attempt on 18th August
at 07:25 with about 6.67-day time at the same class (V, Rp836k). When availability
of lower price (M, Rp583k) was released for the same flight, he made the third
attempt until on 31st August at 07:21 he issued the ticket with full payment. At the
146
end, he got 30.26% off from the previous fare. Figure 5.1 gave an illustration of this
real example.
Figure 5.1. Illustration 1 (several weeks before departure date)
Another example is illustrated in Figure 5.2. The booking was made by a
passenger four days before the departure date. The first booking was for a Rp950k
ticket from UPG to CGK on 18 Sep at 06:30. The length of the holding period was
9.5 hours (0.395 days). In the middle of the period, a lower class was available. He
cancelled the first booking on 14 Sep at 20:00, before the holding period ended,
and immediately made another booking for class T (Rp862k) and purchased it 12
minutes later. These two real examples with different arrival time were chosen from
numerous similar cases in the dataset. Although both of the passengers obtained
lower prices through cancel-rebook strategy, in other cases customers ended up
at getting higher price by applying the same strategy.
147
Figure 5.2. Illustration 2 (some days before departure date)
Based on literature explained in the previous section, there are certain typical
behaviours among strategic customers when making a reservation: 1) holding
period is spent monitoring prices, 2) continuous cancel-rebook behaviour, and 3)
immediately rebooking once the previous reservation is cancelled or released by
the system. In other words, given the same length of holding period, compared to
myopic customers, strategic customers tend to spend longer time, make more
frequent attempts or bookings, and have shorter interval time between cancelling
and booking again. We thus selected four input variables: the length of holding
period, time spent for confirming booking, frequency of bookings, and interval
between cancelling and rebook as listed in Table 5.2.
Table 5.2. Input variables
No. Input variable Label Unit
1 The length of ‘hold’ period HP Day
2 Time spent for confirming booking TS Day
3 Frequency of bookings FB Times
4 Interval between cancelling and booking
again
ICR Day
148
Figure 5.3 illustrates cancel-rebook behaviour when attempting to purchase a
ticket flight. Customers may make a rebooking due to changes in their travel plans
– for example, changes to the departure date or time, origin or destination, number
of tickets, and the composition of passengers (e.g., adults, infants and children).
To eliminate this effect on the classification model, data were recorded only if no
such changes were made, indicating that customers probably enacted such
behaviour to exploit dynamic pricing.
A customer attempted to book a ticket n times. She first booked at time A1 and the
seat was secured at the agreed price until B1. She could confirm the reservation at
any time, denoted by C1, between A1 and B1 (A1 ≤ C1 ≤ B1). Given the holding period
(H), she spends TS (‘time spent for confirming booking’) amount of time (0 ≤ TS ≤
H). Her choices are 1) to make payment, or 2) to wait either to cancel purposefully
or to be released from the system because no payment was made by B1. She then
rebooks the same ticket CR1 unit of time later at A2. CR1 indicates the interval
between cancelling and booking again. She repeats this cancel-rebook behaviour
until the nth attempt with avgCR unit of time as an average value of CRi to CRn.
149
Notes. An : booking time of nth attempt; Bn : maximum time to make payment; Cn: confirmation time/time when status of the reservation is changed; CRn-1: period between cancelling the (n-2)th attempt and book again for the (n-1)th attempt ; H: the length of holding period; TS: time spent given holding period; n: number of attempts or bookings has been made. Figure 5.3. Data linkage
Booking records
A1 B1C1
1st book
H
TS
A2 B2C2
2nd book
An BnCn
nth book
CRn-1
Customer decision is
made
Customer decision is
made
Customer decision is
made
1. Posted price
2. Name of airline (brand)
3. Origin – destination
4. Departure date and time
5. Updating time
Price database
Search keywords: origin-destination,
departure date & time, and confirmation time
of 1st book, 2
nd book, … , n
th book
Posted price at Cn
Estimating
strategic
decision
Decision-
making patterns
To be compared: agreed price and
consumer decision of 1st book until n
th book
Booked price Actual consumer decision
Labelling
consumer types
AWT2
AWT1
AWTn
CR2
CR1
150
5.3.2. Detecting Customer Types
Previous studies assume that strategic customers make a perfectly rational choice
(e.g. Aviv and Pazgal, 2008). However, this assumption is contradicted by the theory
of bounded rationality (Simon, 1955, 1956). Li et al. (2014) proposed that strategic
customers may have different levels of sophistication in predicting future prices. They
divided customers into three categories: 1) perfect foresight, 2) weak-form rational
expectation and 3) strong-form rational expectation. The authors explained that
although customers receive the same information, their decisions may be different but
still follow the rational choice model.
This paper has different setting from Li et al. (2014). In this study, customers do not
need to project the future price. Instead, they obtain perfect information in real time
just before they confirm the booking. In other words, any time before B1, they can
update the information on the Internet to see whether lower prices are available.
Hence, we assume that their decision to buy or to wait relies on perfect information.
The rational choice is to wait (i.e. cancel and rebook) if a lower price appears. If the
price remains the same, strategic customers can choose either to buy or to wait
depending on their patience (Su, 2009). Although a set of perfect information is given
to them, this does not guarantee that the outcome will be as expected. Time is needed
to process the rebooking, and offerings may be changed during this period. Hence,
some customers attempt to rebook before the previous booking is released.
Therefore, identifying customer types solely based on whether they obtain lower
prices can be misleading.
151
The identification of customer types was designed for one-time purchases. In each
purchase, the customer may have one or more bookings made before the payment.
It is assumed that customers check just before they confirm at Cn. For each booking,
we identified the rational choice with the information for prices given just before C1
(confirmation time) and checked whether customers followed it. From the booking
record, travel-related information – such as departure time, name of airline, origin-
destination and confirmation time (Cn) – were utilised to search for the price posted
just before Cn. Rational choice is grounded on the comparison between the posted
price (pAn
) and agreed price (pCn
).
Rational choice = {
buy if pAn
< pCn
wait if pAn
> pCn
either buy or wait if pAn
= pCn
To label customer types, the principle of comparing actual customer decisions against
the rational choice was applied (e.g. Mak et al., 2014). If they were the same, it means
that customer choice was consistent with rational choice. If they chose to buy when
they should have waited, they were an immediate buyer, which is similar to myopic
behaviour. If they chose to wait when they should have paid immediately, they were
labelled as persistently choosing to wait. The customer’s insistence on waiting was
considered to be strategic waiting in this case. If they fully followed the rational choice
over all n attempts, they were labelled ‘strategic’. If they showed immediate buying,
they were categorised as ‘myopic’.
A customer booking normally creates a passenger name record (PNR), which
contains travel-related information such as the name of passenger, fare class, and
their flight sequence. When a cancellation occurs, a new PNR is generated. This
makes tracking cancel-rebook behaviour difficult and costly. Nonetheless, tracking
152
cancel-rebook behaviour can provide insight regarding strategic customers and may
ultimately impact the airline’s revenue. Systematic tracking was generated in this
paper to extract relevant input variables to label and classify customers. Based on
this information, a classification model was developed, to be applied in subsequent
detection systems without scrutinising PNRs for every individual.
This classification model is proposed specifically for small and medium-sized travel
agents. The agents have less sophisticated revenue management and information
systems than airlines do to deal with different kinds of customer behaviour. In addition,
airlines make pricing and capacity allocation policies for a full capacity of any flight in
response to the presence of strategic customers. Therefore, the costly systematic
tracking to detect strategic customers is worth implementing for airlines. However,
each small and medium-sized travel agent may only sell a very small portion of seats
in a flight or even only one to two tickets in a flight. To do the systematic tracking, the
travel agents have to collect the update of price changes and ticket availabilities of
the flight every hour until the last minute before the departure time and to save it to
their data storage. This exhaustive tracking must be done for only one or two tickets
sold of a particular flight. Therefore, the benefit of the tracking is not worth the effort.
The agents can use the proposed classification to detect customer types based on
customers’ past transaction history without scrutinising PNRs, collecting the updates
of price changes and ticket availabilities, and conducting the costly systematic
tracking.
153
5.4. Data Preparation
Real world data may be incomplete, noisy, and inconsistent which leads to low
performance, poor-quality outputs, and hidden useful patterns (Zhang, Zhang, and
Yang, 2003). Data preparation is required to deal with such issues to yield quality
data. Data preparation include data integration, data transformation, data cleaning,
data reduction, and data partitioning (Zhang, Zhang, and Yang, 2003). Therefore,
data preparation is required before model developments. This study mainly used data
integration, data cleaning, and data partitioning. Data integration is the combination
of technical and business process utilised to combine data from different data sources
into the desired dataset, that is, meaningful and valuable information (Hendler, 2014).
Section 5.3 presents data linkage to obtain the dataset used in this study. Data
cleaning includes dealing with missing values, noisy data, outliers, and resolving
inconsistencies (Zhang, Zhang, and Yang, 2003). Data partitioning is a technique for
dividing the dataset into multiple smaller parts.
The focus of the study was to examine what factors can be utilised to discriminate
between strategic and myopic customers in the environment of dynamic pricing
through their cancel-rebook behaviour. The procedure used to label customer types
mimicked strategic purchases using the price information extracted from the system.
Incomplete information about price information at time closest to Cn (confirmation
time) could lead to bias or misleading inference about whether customers follow
strategic purchases. Without price information, detecting customer types could not be
proceeded. Hence, for data cleaning, we only accepted complete records of each
customer.
154
In data partitioning, we utilised five-fold cross-validation with stratified random
sampling. The data were divided into five folds with similar class distribution. If
customers made several attempts or they bought more than once, they would have
more data points in the dataset. In this condition, it is advisable to shuffle the dataset,
that is, to randomly reorganize the dataset. The partitions obtained through k-fold
cross validation with shuffling generally derive from different customers, which avoids
the model learning from the patterns of particular customers. We employed stratified
five-fold cross validation with shuffling in Python to partition the dataset into five folds.
Each fold was treated as test set, while the remaining folds acted as a training set.
Therefore, we obtained five rounds for each classifier.
5.5. Hierarchical Rule-based Models for Customer
Classification
In this section, the building of a hierarchical rule-based inferential modelling and
prediction based on MAKER framework – namely MAKER-ER- and MAKER-BRB-
based classifiers – for predicting customer types is explained. A numerical study using
the described dataset is presented in this section. As previously stated, we used four
input variables – HP, TS, FB, and ICR – to predict the customer types: myopic or
strategic. The definitions of these variables and customer types are detailed in Section
5.2. In addition, the data were shuffled and partitioned into five groups with similar
class distribution based on stratified random sampling. The training set of the first
group is used here to illustrate how MAKER-ER and MAKER-BRB frameworks were
applied in this case study to a customer-type dataset.
155
5.5.1. Hierarchical MAKER frameworks
A minimum of five cases per cell of the joint frequency matrices between the input
variables, except disjointed pieces of evidence, must be satisfied to implement a full
MAKER framework. Models based on MAKER-ER- and MAKER-BRB are designed if
this statistical requirement is not satisfied. In addition, the MAKER-ER- and MAKER-
BRB-based models are useful to reduce the multiplicative complexity on the number
of referential values of input variables in the belief rule base. To group input variables,
one must start with the input variable that exerts the strongest impact on the model
outcome and then add the other input variables one by one. The joint frequency
matrices of the pairs of input variables in a MAKER model must fulfil the statistical
requirement of having at least five cases per cell.
For initialisation, since the data were not normally distributed, a Spearman correlation
test was used to analyse the strength of the linear correlation between the input
variables and the outputs; and among input variables. According to Table 5.3, the
input variables ranked from strongest to weakest correlation with the output variables
were TS, FB, ICR, and HP. Hence, customer decisions were highly influenced by the
input variables TS and FB. Based on this order, we added the input variables one by
one to TS until all the joint frequency matrices between the input variables had at least
five cases per cell, except for those where pieces of evidence were disjointed due to
structural zeros. The input variables which could not satisfy this condition were
excluded and formed another group of evidence.
156
Table 5.3. Descriptive statistics and spearman correlation matrix
Factor Min Max Mean SD Customer
types
HP TS FB ICR
HP .007 6.732 .439 .911 .232** 1
TS .000 6.056 .065 .293 .446** -.160** 1
FB 1.000 34.000 1.232 1.045 .422** -.074** .395** 1
ICR .000 1.199 .003 .042 .264** -0.029 .299** .654** 1
Note: correlation is significant at .05 (2-tailed); ** correlation is significant at .01 (2-tailed) In this way, we defined the groups of evidence as depicted in Figure 5.4. Theoretically,
group 1 (HP-TS) explains how customers spend time given the length of the holding
period, and group 2 (FB-TS) describes how quickly customers book again if they make
several attempts before the final purchase. The MAKER-based model was applied for
each group of evidence and generated the output of whether a customer is strategic
or myopic. The MAKER-generated output of each group of evidence presented the
probability of a customer being myopic or strategic. These outputs were then
aggregated to suggest a final inference regarding whether customers are myopic or
strategic given the input values of the four input variables.
5.5.2. Optimised Referential Values of the Model
This section demonstrates how to develop a classifier based on MAKER-ER- and
MAKER-BRB-based classifiers for a model of the customer types. A numerical study
is presented here using the dataset explained in Section 3.3. We split the input
variables into two groups of evidence: HP and TS as group 1 and FB and ICR as
group 2. The output variable was the customer types: ‘myopic’ or ‘strategic’ classes.
The definition of these two types can be found in Section 5.2.
157
The length of
holding
period
Time spent
for confirming
booking
Frequency of
bookings
Interval between
cancelling and
book again
Myopic(generated
outputs by
group 1)
Strategic (generated
outputs by
group 1)
Myopic (generated
outputs by
group 2)
Strategic (generated
outputs by
group 2)
1 2 3 4
1-p1
1-p 2p
2p
1
1-p11-p2
1-p
2
Myopic
(final
inference)
Strategic
(final
inference)
Rules (k)
Final inferences
MAKER-generated
ouputs
Input variables
MAKER-based
classifiers
1-p1)(
1
sp
The length of
holding
period
Time spent
for confirming
booking
Frequency of
bookings
Interval between
cancelling and
book again
Myopic
(generated
outputs by
group 1)
Strategic (generated
outputs by
group 1)
Myopic (generated
outputs by
group 2)
Strategic (generated
outputs by
group 2)
1-p1
p1
1-p 2p
2p
1
1-p1 1-p2
Myopic
(final
inference)
Strategic
(final
inference)Final inferences
MAKER-generated
ouputs
Input variables
MAKER-based
classifiers
p2
MAKER-BRB-based model
MAKER-ER-based model
Figure 5.4. Hierarchical MAKER frameworks for customer classification
158
As explained above, the data were partitioned into five groups with similar class
distributions, with the data shuffled beforehand. For the purpose of illustration, we use
the first group as an example in this section. The model parameters – that is,
referential values and weights – were assigned to develop a MAKER framework. We
used the optimised parameters of the first group as an example.
Discretisation is often applied to transform quantitative data into qualitative data to
make learning from the qualitative data more efficient and effective. All the input
variables were numerical. A discretisation technique with referential values was
applied to all input variables.
Referential values consist of the lower and upper boundaries of the input variables for
the dataset and any values between those boundaries. The boundaries can be set
based on the minima and maxima of the observed values for the input variables of
the whole dataset. Alternatively, experts can determine the boundaries, such as in the
study by Kong et al. (2016) about trauma outcome. In this study, we utilised the
percentiles of the observed values for input variables of the whole dataset.
In this study, we set the percentiles of 1% and 99% as the lower and upper
boundaries. Table 5.4 demonstrates that the minimum and the first percentile of the
observed values of the input variable FB were 1. The 99th percentile and the maximum
of FB (observed) were 5 and 35 respectively. Almost all the customers – 99% – in the
dataset made five or fewer bookings. The significant difference between the 99th
percentile and the maximum of FB could indicate there were extreme values in the
dataset. Furthermore, 0.5% of the dataset (12 customers) made several attempts: 6
to 35 bookings. We set 99th percentile as the upper boundary. Hence, booking more
than five times was equivalent to booking five times. These percentiles were selected
159
because we could obtain complete joint frequency matrices, that is, all the cells of the
joint frequency matrices of the pairs of evidence did not have sampling zeros. In
addition, the performance of the classifiers was not significantly affected by this
modification. For other machine learning methods, we replaced the extreme values
with the values of these boundaries of each input variable.
Table 5.4. Percentiles of the dataset
Input
variable
Percentile
0% 100% 1% 99%
HP 6.551 × 10−3 6.732 9.982 × 10−3 5.532
TS 5.800 × 10−5 6.537 4.750 × 10−4 2.111
FB 1 35 1 5
ICR –3.494 8.648 –.2939 .559
As explained in Section 3.7, the model parameters – including weights and referential
values – were optimised through sequential least squares programming (SLSQP) with
randomly set initial parameters, and the MSE score was used as an objective function.
Equations (4.23) and (4.24) were used for the MAKER-ER- and MAKER-BRB-based
models, respectively. The optimisation algorithm identifies the direction to find a new
solution based on the evaluation of the MSE score. The algorithm was repeated for
200 iterations or until .0001 tolerance was reached.
The target of the optimisation of both MAKER-ER- and MAKER-BRB-based models
is to maximise the likelihood of the true state of a training set and to automatically
minimise the MSE scores. The MSE scores denote the difference between the model
outputs and observed values. Optimising the referential values of each input variable
means identifying how to divide the input variables so that the observations for a given
160
class are placed in the majority. More trained referential values can reasonably
improve the classifier; however, the associated cost increases (i.e., model
complexity). In this case, we used one optimised referential value for input variables
because adding more referential values to each input variable did not significantly
improve the AUC scores but caused higher model complexity. Sparser joint frequency
matrices were found when more referential values were added. In addition, two
adjacent referential values – that is, no trained referential value – can only
approximate monotonic function, and at least one trained referential value is required
to approximate non-monotonic function.
Figure 5.5 illustrates the scatter plot for the first training set across four input variables.
There are two grids as there were two groups of evidence: HP-TS (left) and FB-ICR
(right). The red dots represent ‘myopic’, and the blue dots represent ‘strategic’. The
vertical and horizontal lines indicate the optimised referential values of the input
variables. As displayed in the figure, these lines split the data into several groups.
Since the referential values are optimised through MAKER-ER- and MAKER-BRB-
based classifiers, the optimised referential values split the data into several grids,
each of which indicates the placement of most of the class. Because there was one
trained referential value for each input variable, each figure features four grids.
Group of evidence: HP-TS
Group of evidence: FB-ICR
Figure 5.5. Scatter plot of the observed data of the training set of the first fold with plotted optimised referential values in each of the input variables from the customer – type dataset from the optimisation of the MAKER-ER-based model
161
In general, Figures 5.5 and 5.6 illustrate that data patterns existed for records in
different classes of different input variables of the dataset. For the evidence regarding
HP-TS, both classes (i.e., myopic and strategic) were distributed in the same range.
Given the higher values of the input variable HP, most of the strategic customers were
distributed over a large value range of the input variable TS. For the group of evidence
for FB-TS, ‘strategic’ (blue dots) generally dominated the right side of the figure FB-
TS, meaning that ‘strategic’ featured a large range of values of the input variable FB.
The myopic customers were mainly distributed closer to the lower boundary of the
input variable ICR. The strategic customers generally had a large value range of the
input variable ICR. In addition, there was no single observation in the upper left corner
of the figure. If a customer books once, the value of CR is zero; this condition is called
structural zeros.
Group of evidence: HP-TS
Group of evidence: FB-ICR
Figure 5.6. Scatter plot of the observed data of the training set of the first fold with plotted optimised referential values in each of the input variables from the customer – type dataset from the optimisation of the MAKER-BRB-based model
The horizontal and vertical lines denote the optimised referential values of the input
variables of the respective training set. As stated earlier, the optimisation of referential
values with respect to MSE score led to data separation of the observations for each
input variable. Hence, the majority of a class fell within the same value range for each
162
input variable. As shown in Figures 5.5 and 5.6, the optimised referential values are
generally located around the separation point between classes: myopic and strategic.
The horizontal and vertical lines denote the optimised referential values of the input
variables of the respective training set. As stated earlier, the optimisation of the
referential values with respect to the MSE score led to data separation of the
observations for each input variable. Hence, most of a class fell within the same value
range for each input variable. As displayed in Figures 5.5 and 5.6, the optimised
referential values are generally located around the separation point between classes:
myopic and strategic.
For the following sections, the optimised referential values and other model
parameters of the training set of the first group of both MAKER-based classifiers are
taken as an example to demonstrate how MAKER-ER- and MAKER-BRB-based
models are constructed for the customer type prediction of a given dataset. The next
section discusses the MAKER-based models according to four aspects: 1) evidence
acquisition from data, 2) evidence interdependence, 3) belief-rule inference, and 4)
inference of the top hierarchy, including the ER rule and BRB inference.
5.5.3. Evidence Acquisition from Data
Section 4.3 explains the MAKER framework with referential values as a discretisation
method for numerical data. As already stated, the referential values of each input
variable in numerical data must be defined to acquire evidence from a dataset. The
referential values as model parameters can initially be set based on expert knowledge
or can be randomly generated. They can then be trained using historical data under
163
an optimisation objective (Xu et al., 2017). For illustration purposes, we used the
solution of optimisation of the first round for the MAKER-ER-based model – including
weights and an optimised referential value for each input variable of the training set.
Table 5.5 depicts the optimised referential values used for this illustration. The
referential values include the boundary referential values – the lower boundary and
upper boundary determined in Section 5.5.2 – and one referential value which lies
between the boundaries. To acquire evidence from a dataset, the first step is to
transform each input value of each input variable of the training set using Equation
(4.7) with the following steps: 1) to find two adjacent referential values of the
respective input variable where the input value is located and 2) to calculate the belief
distributions with respect to the two adjacent referential values, called the similarity
degree. The second step is to aggregate the similarity degrees of each referential
value under different classes of the training set according to Equation (4.8). The
frequencies of the referential values of each input variable under different classes of
the output variable of the training set can be subsequently generated. Table 5.6
displays the frequencies of the referential values of the input variable, with TS as an
example.
Table 5.5. The optimised referential values obtained from MAKER-ER- based models of the first round
Input variables TS HP FB ICR
Lower boundary .0005 .0010 1 –.2939
Optimised referential values (MAKER-
ER-based model)
.1338 .1585 1.3390 .0312
Optimised referential values
(MAKER-BRB-based model)
.1802 .1206 1.0848 .0612
Upper boundary 2.1110 5.5320 5 .5590
164
Table 5.6. The frequencies of the referential values of the input variable of TS
Class\referential values .0005 .1338 2.1110
Myopic 1515.1435 278.4574 31.3991
Strategic 204.5881 350.1775 67.2344
The third step is to calculate the likelihood of a referential value of an input variable
being observed given that a class of the output variable is true. Equation (4.9) is
applied for all referential values of all input variables of the training set of the dataset.
Once the likelihood of a referential value of an input variable can be obtained, the
probability of the respective referential value points to a class of the output can be
calculated using Equation (4.10). Table 5.7 presents the likelihoods for the referential
values .0005, .1338, and 2.1110 of the input variable, with TS as an example.
Table 5.7. The likelihoods of the referential values of the input variable of TS
Class\referential values .0005 .1338 2.1110
Myopic .8302 .1526 .0172
Strategic .3289 .5630 .1081
Figure 5.7 depicts the individual support of each piece of evidence regarding class
membership – for myopic (blue) and strategic (orange), which is obtained from the
probability of each referential value of each input variable of the training set. Table
5.8 presents the probabilities of the referential values .0005, .1338, and 2.1110 of the
input variable of TS of the first group training set and Table 5.9 presents the
probabilities of the referential values .0010, .1585, and 5.5320 of the input variable
HP.
165
Table 5.8. The probabilities of referential values of the input variable of TS
Class\referential values .0005 .1338 2.1110
Myopic 0.7162 0.2132 0.1373
Strategic 0.2838 0.7868 0.8627
Table 5.9. The probabilities of referential values of the input variable of HP
Class\referential values .0010 .1585 5.5320
Myopic .5685 .5180 .2808
Strategic .4315 .4820 .7192
Figure 5.7. Individual support of the referential values of each input variable
Several pieces of evidence can be acquired from the probabilities calculated above.
The probabilities of the referential values of the input variables of the training set
00.10.20.30.40.50.60.70.80.9
1
Lowerboundary
Trainedreferential
value
Upperboundary
Bas
ic p
rob
abili
ty
TS
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Lowerboundary
Trainedreferential
value
Upperboundary
Bas
ic p
rob
abili
ty
HP
00.10.20.30.40.50.60.70.80.9
1
Lowerboundary
Trainedreferential
value
Upperboundary
Bas
ic p
rob
abili
ty
FB
00.10.20.30.40.50.60.70.80.9
1
Lowerboundary
Trainedreferential
value
Upperboundary
Bas
ic p
rob
abili
ty
ICR
166
represent to what degree the respective referential values of the input variables
indicate different class membership. In this way, we can acquire various pieces of
evidence. For example, in Table 5.9 the probabilities of the lower boundary of input
variable HP (i.e. 0.0010) are .5685 and .4315 for the myopic and strategic classes,
respectively. These probabilities means that if an observation has an input value of
the input variable of HP of 0.0010, the probability of the observation being myopic is
.5685 and .4315 of being strategic. Furthermore, we can acquire evidence from an
input value of HP of 0.0010 that indicates that the myopic and strategic classes have
probabilities of .5685 and .4315, respectively.
5.5.4. Analysis of Evidence Interdependence
In this section, the introduction of the interdependence index denoted by α in the
MAKER framework as a measurement of evidence interdependence between a pair
of evidential elements is discussed. As explained in Section 4.3, the MAKER-based
model is purposely developed to decrease the assumption of the independence
between a pair of evidential elements under the ER rule when combining the
respective evidence. The interdependence index can be calculated using Equation
(4.14). To generate the interdependence index between a pair of evidential elements,
the first step is to calculate the similarity degree of the input values for the combination
of evidential elements using Equation (4.12). The second step is to apply Equation
(4.13) to obtain the joint probability of the pair of evidential elements. Subsequently,
using Equation (4.14), the interdependence index between a pair of evidential
elements can be estimated.
167
Table 5.10 displays the joint probabilities of all pairs of evidential elements of the input
variables HP and TS indicating different classes of the output variables: myopic and
strategic classes. These joint probabilities were calculated from the frequencies of
different combinations of referential values of pieces of evidence from the input
variables HP and TS under different class membership. The frequencies have at least
five samples, except the combination of the referential values {2.1110, .0010} for both
classes of the output variable. Those pieces of evidence are disjoint for all classes.
Table 5.10. The joint probabilities of different combinations of the referential values from input variables HP and TS
Class\The
combination of
two referential
values
{.0005,
.0010}
{.0005,
.1585}
{.0005,
5.5320
}
{.1338,
.0010}
{.1338,
.1585}
{.0005,
5.5320
}
{2.1110,
.0010}
{2.1110,
.1585}
{2.1110,
5.5320}
Myopic .5220 .7179 .9128 .2481 .2079 .3090 0 .1500 .2516
Strategic .4780 .2821 .0872 .7519 .7912 .6910 0 .8500 .7484
As already stated, TS depends on HP such that if the value of HP is .0010, there is
no possibility of TS having a value of 2.1110. The combination of the referential values
{1, .5590} for the input variables FB and ICR are also disjoint. Therefore, we define
the inequality constraints of all the combinations of referential values of the input
variables of each group of evidence, except the combination of referential values
{2.1110, .0010} for group 1 and the combination of referential values {1, .5590} for
group 2.
The last step is to calculate the interdependence index of a pair of evidential elements
with respect to class membership. With the probabilities obtained from the previous
steps in Section 5.5.3 – as displayed in Tables 5.8, 5.9, and 5.10, which are a basic
168
probability distribution of the input variable TS, a basic probability distribution of the
input variable HP, and the joint probabilities of the pair of pieces of evidence
respectively – the interdependence indices between the pieces of evidence from the
input variables HP and TS can be obtained through Equation (4.21).
From Table 5.11, it can be observed that the input variables HP and TS generally
have values of interdependence indices between 1 and 10, meaning that both of the
input variables are moderately independent of each other, except for the combination
of referential values {2.1110, .0010}, which has an interdependence index of 0 (i.e.,
disjointed). According to Table 5.12, it is evident that the input variables FB and ICR
are generally moderately independent of each other, the interdependence indices of
FB and ICR lie between 1 and 3. However, some combinations of referential values
(e.g., {1.3390, .5590} and {5, .5590}) display high values (e.g., 50.7658 and 42.3690,
respectively). The input variable FB of 1.3390 and the input variable ICR of .5590 are
highly dependent on each other under the class of ‘myopic’ with the corresponding
interdependence index of 50.7658. The same condition applies for the input variable
FB of 5 and the input variable ICR of .5590 with the corresponding interdependence
index of 42.3690.
Table 5.11. The interdependence indices between the referential values from the input variables HP and TS
Class\The
combinati
on of two
referential
values
{.0005,
.0010}
{.0005,
.1585}
{.0005,
5.5320
}
{.1338,
.0010}
{.1338,
.1585}
{.0005,
5.5320}
{2.1110
,.0010}
{2.1110,
.1585}
{2.1110,
5.5320}
Myopic 1.6079 2.6232 6.1651 1.7218 3.4673 8.1328 0 4.8174 9.7909
Strategic 2.5845 1.5906 .7065 2.3813 1.5199 1.0724 0 1.4021 1.0385
169
Table 5.12. Interdependence indices between referential values from the input variables FB and ICR
Class\The
combinati
on of two
referential
values
{1,
–.2939}
{1,
.0312}
{1,
.559
0}
{1.3390,
–.2939}
{1.3390,
.0312}
{1.3390,
.5590}
{5,
–.2939}
{5,
.0312}
{5, .5590}
Myopic 1.9318 1.8726 0 1.6575 1.8626 50.7658 1.8986 1.9147 42.3690
Strategic 2.1033 2.2008 0 2.0821 2.0921 .5680 2.051 2.0851 .6888
5.5.5. Belief Rule Base
Once the evidence from the dataset and the interdependence indices between pairs
of units of evidence have been acquired, we are now in position to develop a belief
rule base from which an inference can be made is discussed. As stated in Section
4.4, a belief rule should be expressed in the form of Equation (4.22). The ‘IF’ form,
expressed ask
T
kk
kAAA 21 and called a packet antecedent 𝐴𝑘, should be
interpreted in this study as a combination of the referential values of the input
variables or ‘if the input value of each input variable is equal to a referential value of
this input variable’. The ‘THEN’ form expressing the probabilities of each
consequence (i.e., ( ) ( ) ( ) Nkkk ,D,,,D,,D N2211 ) should be interpreted as the
probability of a customer with the corresponding input values being strategic or
myopic.
Since the ‘IF’ form represents a combination of the referential values of the input
variables, the size of a belief rule base equals the multiplications of all the referential
170
values of input variables. For example, in group 1, there are two input variables, each
with three referential values: the lower boundary, trained referential value, and upper
boundary. Hence, the size of the BRB of group 1 is 3 × 3 = 9. The BRB of groups 1
and 2 can be seen in Tables 5.14 and 5.15, respectively. It is also worth noting that
the trained referential values are solutions of the optimisation of the MAKER-ER- or
MAKER-BRB-based classifiers. In this section, we also utilise other optimised model
parameters, such as the weights of input variables. Meanwhile, the ‘THEN’ form
consists of the consequences: myopic and strategic with the corresponding
probabilities. To obtain the probabilities of a customer being myopic or strategic, the
MAKER rule is used to combine pieces of evidence in a group of evidence with the
consideration of the interdependency of pairs of evidence using Equation (4.16) in
Section 4.3.3. Using Equation (4.18), we can obtain the weights of the combined
evidence from the probability mass 𝑚𝜃,𝑒(𝐿) and the probability 𝑝𝜃,𝑒(𝐿) for 𝜃 ⊆ Θ, 𝜃 ≠ 𝜙
or from the probability mass 𝑚𝑃(Θ),𝑒(𝐿). These weights are used for inference in the
next section and called the rule weight, denoted by 𝜃𝑘. For example, with the
calculation explained in Section 4.3.3, in group 1 the probabilities of a combination of
referential values {.1338, .0010} being myopic and strategic are .1018 and .8982,
respectively. We can also obtain a probability of .8022 of being myopic and .1978 of
being strategic for a combination of referential values { 1, –.2939}.
5.5.6. BRB Inference with Referential Values
For discrete or nominal data, making an inference from a belief rule base is a direct
process. For example, if the input vector presents the combination ‘High ∧ Low ∧
High’, then we can obtain a probability 𝑝1 of consequent 1 and 𝑝2 of consequent 2,
171
where 𝑝1 = 𝛽1𝑘 and 𝑝2 = 𝛽2𝑘 from the 𝑘th rule from which the IF rule of ‘High ∧ Low ∧
High’ is mentioned. The inference process with referential values as a discretisation
method is consequently different from that with discrete data. The inference process
is discussed in this section.
A belief rule base was developed in the previous section. A belief rule was constructed
from a packet antecedent 𝐴𝑘, which is a combination of the referential values of the
input variables and the corresponding probabilities of consequence. Based on this
form, we need to transform the numerical data for the combinations of the referential
values of the input variables. We need to calculate a similarity degree of each
observation value of each input variable. An input value can be transformed using
Equation (4.7). The similarity degree indicates the degree to which an input value
matches each of the referential values. For example, an observation with the input
values {.2105, .3955, 4, .1415} for TS, HP, FB, and ICR, respectively, with the
referential values defined in Table 5.5 has two adjacent referential values of each
input variables as depicted in Table 5.13. Using Equation (4.12), we can calculate the
joint similarity degree between the observation and the combination of the referential
values of each belief rule or the packet antecedents. These values represent the
individual matching degree to which the input vector or an observation belongs to a
packet antecedent 𝐴𝑘, denoted by 𝛼𝑘 for the 𝑘th rule.
Table 5.13. Two adjacent referential values of each input variable of an observation from the customer – type dataset: {.2105, .3955, 4, .1415}
TS HP FB ICR
.1338 .1585 1.3390 .0312
2.1110 5.5320 5 .5590
172
Since each observed value of an input variable is expressed by its distances to two
referential values, a number of belief rules are activated out of the total belief rules
ranging from 1 (for an input vector exactly equal to a combination of the referential
values of the input variables) to 2𝑁, where 𝑁 is the number of input variables for an
input vector for which each observation value is located between two adjacent
referential values. In this case, as we have two input variables with three referential
values for each group of evidence, the activated belief rules range from 1 to 22 = 4 out
of 9 belief rules. By using Equation (4.12), we can obtain the joint similarity degree of
each belief rule in the BRB of each group of evidence, as depicted in Tables 5.14-
5.15 for an observation of {.2105, .3955, 4, .1415}. In these tables, it can be found
that four combinations of the two activated adjacent referential values in Table 5.13
have 𝛼𝑘 > 0, while other combinations of other referential values have 𝛼𝑘 = 0.
Table 5.14. The belief rule base of the first group of evidence and the activated belief rules by an observation of the input variables of group 1 from the customer-type dataset: {.2105, .3955}
Antecedent Belief degree
Rule 𝐴1 (TS) 𝐴2 (HP) Myopic Strategic 𝛼𝑘
1 .0005 .0010 .8117 .1883 0
2 .0005 .1585 .9122 .0878 0
3 .0005 5.5320 .7864 .2136 0
4 .1338 .0010 .1018 .8982 0
5 .1338 .1585 .0836 .9164 .9089
6 .1338 5.5320 .0761 .9239 .0645
7 2.1110 .0010 .2511 .7489 0
8 2.1110 .1585 .1488 .8512 .0249
9 2.1110 5.5320 .1472 .8528 .0018
173
Table 5.15. The belief rule base of the second group of evidence with activated belief rule base by an observation of the input variables of group 2 from the customer-type dataset: {4, .1415}
Antecedent Belief degree
Rule 𝐴3 (FB) 𝐴4 (ICR) Myopic Strategic 𝛼𝑘
1 1 –.2939 .8022 .1978 0
2 1 .0312 .7989 .2011 0
3 1 .5590 .5640 .4360 0
4 1.3390 –.2939 .0420 .9580 0
5 1.3390 .0312 .0415 .9585 .2682
6 1.3390 .5590 .0878 .9122 .0049
7 5 –.2939 .0659 .9341 0
8 5 .0312 .0641 .9359 .7138
9 5 .5590 .1820 .8180 .0131
At this point, of each belief rule, we have 𝛼𝑘 as an individual matching degree to which
the input values belong to a packet antecedent, 𝐴𝑘; the weights of the combined
pieces of evidence of 𝐴𝑘, which are obtained from the probability mass 𝑚𝜃,𝑒(𝐿), the
probability 𝑝𝜃,𝑒(𝐿), and the probability mass 𝑚𝑃(Θ),𝑒(𝐿); and the probability of each
consequence as a result of the combination of pieces of evidence, 𝐴𝑘 . Hence, the
weights of the pieces of evidence affect the weights of each belief rule activated by
an observation.
Once we obtain the activated belief rules with the corresponding joint similarity
degrees and their weights, the next step is to combine these belief rules to predict the
probabilities of each consequence (i.e., a customer being myopic or strategic). First,
we need to calculate the updated weight denoted by 𝜔𝑘 of each belief rule in BRB
based on the joint similarity degrees and the associated rule weight 𝜃𝑘 from Equation
(3.11), with 𝐿 referring to a number of belief rules in BRB. The value 𝜔𝑘 is designed
174
to measure the degree to which a packet antecedent 𝐴𝑘 in the 𝑘th rule is triggered by
an observation. As stated in the previous section, the weights of the input variables
contribute to the weight of each belief rule, and based on the joint similarity degrees
and those weights, we calculate the updated weight of each belief rule. We can
conclude that the weights of the input variables influence the updated weight of each
belief rule, which measures the degree to which a belief rule is triggered in predicting
the probability of each consequence. Second, given the updated weight of each belief
rule and the associated probability of each consequence, we can combine these
pieces of evidence using the conjunctive MAKER rule as demonstrated in Equation
(4.16). The output of this framework is the probability of a customer being myopic or
strategic.
For example, with an observation of group 1 of {.2105, .3955} for TS and HP
respectively, we obtain a probability of .0778 of being myopic and .9222 of being
strategic. In addition, given an observation of group 2 of {4, .1415} for FB and ICR
respectively, we obtain a probability of .0483 of being myopic, and .9517 of being
strategic. At this point, we can determine the probability of a customer being strategic
or myopic based on a number but not the total of the input variables in the input
system. To generate the probability of each consequence from all the input variables
of the input system being considered is discussed in the following section.
5.5.7. Inference of the Top Hierarchy
Based on the previous section, we can obtain the probability of each consequence as
a result of the evidence combination of some but not all the input variables in the
system. As depicted in Figure 4.1, a system consists of some groups of evidence,
175
each of which features a number of input variables. In the lower levels of the hierarchy,
each group of evidence makes inferences based on the input variables in the input
system of the group. As such, each group of evidence generates the probability of
each consequence of the output system. As we have acquired the MAKER-generated
outputs from the input variables of each group of evidence, we can now combine the
outputs to reach the final inference of the top hierarchy, which is the probability of
being myopic or strategic with all the input variables being considered. We provide
two combination methods: ER- and BRB-based models.
First is the ER rule. According to the previous section, we can acquire the probabilities
generated by the MAKER rule from the input variables of a group of evidence. On the
other hand, an observation of the input variables of the group of evidence generates
the probabilities of class membership. Therefore, we can acquire a piece of evidence
from the observation. As such, we obtain the same number of pieces of evidence as
the number of groups of evidence in the hierarchy.
To combine these pieces of evidence using the ER rule, we need their weights. We
can obtain the weight of each group of evidence from the probability mass 𝑚𝜃,𝑒(𝐿) and
the probability 𝑝𝜃,𝑒(𝐿) for 𝜃 ⊆ Θ, 𝜃 ≠ 𝜙 or from the probability mass 𝑚𝑃(Θ),𝑒(𝐿) using
Equation (4.18) when combining the activated belief rules in the previous section.
Given these pieces of evidence and their weights, we can use Equation (4.16) for
evidence combination, and therefore, we can generate the probability of each
consequence, considering all the input variables in the system. Since the weights of
the input variables of a group of evidence have an effect to the updated weight of
each belief rule, and the weight of a group of evidence is the weight of the combined
activated belief rules, we can conclude that the weight of a group of evidence is
176
influenced by the weights of the input variables of the group of evidence. As such, in
the top hierarchy, we can conclude that the final inference generated considers the
weights of all the input variables in the system.
For example, in this study, there are two groups of evidence as depicted in Figure 5.4;
as such, we should have two outputs: the probabilities of being myopic and strategic.
An observation the input values of group 1 of {.2105, .3955} for TS and HP,
respectively, and group 2 of {4, .1415} for FB and ICR, respectively. By following the
procedures in the previous sections and given two groups of evidence, we can obtain
two pieces of evidence as the MAKER-generated outputs: {(1, .0778), (2, .9222)} and
{(1, .0483), (2, .9517)} for groups 1 and 2, respectively. With their weights and using
Equation (4.16), we can generate a probability of .0233 of being myopic and .9767 of
being strategic as a final output of the system, where the probabilities are obtained
with all the input variables in the system together.
Second is the BRB rule. As depicted in Figure 4.1, there are a number of groups of
evidence, each of which consists of some input variables. As stated above, each
group of evidence generates the probability of each consequence. We can make
inferences based on the concept of the belief rule base. To construct a belief rule
base, we must follow the expression of the extended IF-THEN rule, as described in
Section 0, specifically in Equation (4.22). In this state, the antecedent of the belief rule
written as 21
k
T
kk
kAAA should be expressed in this state as ‘if a group of evidence
points to 𝑘 class’. Therefore, the number of antecedents equals the number of groups
of evidence in the system. In this study, there are two groups of evidence; hence,
there are two antecedents in the BRB. Furthermore,
( ) ( ) ( ) LkNkkk ,...,1,,D,,,D,,D N2211 should be expressed in this state as
177
‘the probability of a customer being myopic or strategic given the values of the
antecedents’, or we can say ‘the probability of a customer being of the class
membership myopic or strategic, given the results from each group of evidence’.
The antecedents in this study are the outputs generated by each group of evidence.
In addition, the outputs refer to the class membership such that the number of
combinations equals 𝐾𝐺, where K is the number of outputs in the output system, and
G is the number of groups of evidence in the system. In this study, there are two class
memberships as the outputs, with two groups of evidence formed in the system.
Therefore, we have 22 = 4 belief rules, as depicted in Table 5.16 and Figure 5.4.
Table 5.16. The belief rule base of the top hierarchy of inference with the initial belief degrees for the customer-type dataset
No. Antecedent Consequence
𝐴1 𝐴2 Myopic (1) Strategic (2)
1 1 1 1 0
2 1 2 .5 .5
3 2 1 .5 .5
4 2 2 0 1
We suppose that 𝐴1 is the output generated by group 1; 𝐴11 = 1 if the group of
evidence indicates class 𝑘 = 1 or myopic, and 𝐴21 = 2 if the group of evidence
indicates class 𝑘 = 2 or strategic. Furthermore, 𝐴12 = 1 and 𝐴2
2 = 2 signify that group
2 points to myopic (𝑘 = 1) and strategic (𝑘 = 2), respectively. As we do not have prior
knowledge regarding the belief degrees assigned to each consequence, denoted by
𝛽𝑗,𝑘 for the 𝑗th consequence in the 𝑘th rule as displayed Equation (4.22), we can
construct a BRB as follows.
178
• The construction of belief rule base
Given the observed values of the input variables in the input system of each group of
evidence, if both groups of evidence indicate the same class membership, this means
that the observation of all the input variables fully indicates the corresponding class.
For example, the first and fourth belief rules generate a probability of 1 for myopic and
strategic, respectively. If both groups of evidence point to different class membership,
we cannot say the observation of all the input variables exactly indicates a particular
class membership, meaning that the probability of each consequence can range from
0 to 1. These belief degrees can be trained along with other model parameters
simultaneously. For initialisation, we use the initial belief degrees as listed in Table
5.16. Table 5.17 provides the optimised belief degrees of the belief rule base. We use
these belief degrees for this section as an example.
Table 5.17. The belief rule base of the top hierarchy of inference with the optimised belief degrees of the training set of the first fold for the customer-type dataset
No. Antecedent Consequence
𝐴1 𝐴2 Myopic (1) Strategic (2)
1 1 1 1 0
2 1 2 .1623 .8377
3 2 1 .1771 .8299
4 2 2 0 1
The antecedent in the BRB is defined as ‘a group of evidence points to a class
membership with the probability of 1’. Since each group of evidence generates the
probability of each consequence, which measures the degree to which the observed
values of the observation of the input variables within the group indicates a class
179
membership, we cannot reach a direct conclusion based on the BRB. For the purpose
of demonstration, we use the observed values {.2105, .3955, 4, .1415} and the
optimised model parameters obtained from the MAKER-BRB-based model, including
the optimised referential values in Table 5.5, which are .1802, .1206, 1.0848, and
.0612 respectively for TS, HP, FB, and ICR.
• The calculation of joint similarity degree
For example, the observed values {.2105, .3955} for TS and HP generate the
probabilities {.1371, .8629}, meaning that this observation belongs to 𝐴21 to a high
degree (.8629) and to 𝐴11 to a low degree (.1371). As such, we can obtain the belief
distribution of the antecedents. Therefore, we can apply Equation (4.12) to obtain the
joint similarity degree between the outputs generated by each group of evidence and
the combination of the antecedents of each belief rule. For example, based on the
probabilities obtained from groups 1 and 2, which are {(1, .1371), (2, .8629)} and {(1,
.2537), (2, .7463)}, respectively, we can obtain the joint similarity degree for each
antecedent as displayed in Table 5.18.
Table 5.18. The joint similarity degree of the outputs generated by group 1: {.1371, .8629} and group 2: {.2537, .7463} from the customer – type dataset
{𝐴11, 𝐴1
2} {𝐴11, 𝐴2
2} {𝐴21 , 𝐴1
2} {𝐴21 , 𝐴2
2} Total
.0348 .1023 .2189 .6440 1
• Making inference from activated belief rules
These joint similarity degrees activate four belief rules. As with the previous section,
these values are used to calculate the updated weight of each belief rule. The rule
180
weights denoted by 𝜃𝑘 can be trained. However, in this study, the rule weights were
set to be equal. The joint similarity degree influences how we invoke the activated
belief rules to contribute to the inference. Since the joint similarity degree is calculated
from the outputs generated by each group of evidence, each of which consists of
some but not all of the input variables in the system, we may conclude that by
combining the outputs in this way, the inference is obtained by considering all the
input variables in the system. The probabilities {(1, .1371), (2, .8629)} and {(1,
.2537),(2, .7463)} obtained from an observation of {.2105, .3955, 4, .1415} generates
the prediction of class membership as follows: .0901 of being myopic and .9099 of
being strategic.
5.5.8. The Interpretability of Hierarchical MAKER Frameworks
As mentioned previously, a set of model parameters in this study consists of one
trained referential value of the four input variables of the system and the weights of
the evidential elements (referential values) of the four input variables for the MAKER-
ER-based classifier. An additional set of parameters is a set of trained belief degrees
of each consequence of the respective belief rules for the MAKER-BRB-based
classifier. The trained referential values are utilised to obtain pieces of evidence,
which are then combined in the upper level of the hierarchy. Given the optimised
weights of the evidential elements of the input variables of each group of evidence,
we can generate the probability of each consequence. For each group of evidence,
the weights of the input variables impact the updated weight of each activated belief
rule through an observation to predict the probabilities of the classes of the output
system.
181
In the MAKER-ER-based classifier, given the probabilities generated by the MAKER
rule for each group of evidence and the weight of the combined activated belief rules
of each group of evidence, we can make predictions (i.e., the probabilities of the
classes of the output system with all four input variables considered) in the upper level
of the hierarchy. The weights of the two input variables of each group of evidence
have an impact on the updated weights of each activated belief rule, and the weight
of the combined activated belief rules of each group of evidence has an impact on the
inference made in the upper level.
In the MAKER-BRB-based classifier, the probabilities generated by the MAKER rule
for each group of evidence indicate the degree to which the input variables of each
group of evidence points to each class of the output system. As such, we can calculate
the joint similarity degree for each combination of the antecedents. Given the trained
belief degrees of the consequences of each belief rule in the BRB and the joint
similarity degrees, we can make predictions in the upper level of the hierarchy, which
are inferred based on the four input variables in the system.
Through these two ways (i.e., the MAKER-ER- and MAKER-BRB-based models), we
can maximise the predicted outputs (i.e., the predicted probabilities of each class of
the output system) close to the true observed outputs of the training set to minimise
the MSE score by optimising the model parameters, including the referential values
of the four input variables, the weights of the evidential elements for both classifiers,
and the trained belief degrees of the relevant belief rules specifically for the MAKER-
BRB-based model.
In this study, given the optimised referential values (i.e., the trained referential values)
of the four input variables, we can construct the MAKER-based classifier for an
182
illustration, specifically how to acquire pieces of evidence from the data. On the basis
of the referential values and other optimised solutions (i.e., the weights as well as the
belief degrees of each consequence in the BRB of the top hierarchy), we can use the
MAKER-ER- and MAKER-BRB-based models to make inferences through the
process described in this section. For an example used earlier {.2105, .3955, 4,
.1415}, the predicted probabilities of each class are {.0233, .9767} and {.0901, .9099}
for the MAKER-ER- and MAKER-BRB-based models. Based on the process
established in these classifiers, we can conclude that the MAKER-ER- and MAKER-
BRB-based classifiers are an interpretable approach by integrating statistical analysis
when acquiring pieces of evidence as well as the measurement of the
interdependencies between pairs of evidence, belief rule-based inferences in the
MAKER rule, the maximum likelihood prediction, and machine learning. Furthermore,
even with the input variables in the system split into multiple groups of evidence, the
inference process established for both classifiers has combined all pieces of evidence
from the lower level in the hierarchy. In every combination process of the pieces of
evidence from the bottom to the top of the hierarchy, the knowledge embedded in a
piece of evidence, including its weights, is continuously forwarded until the final
inference in the top hierarchy. In this way, we may conclude that the predicted outputs
of the system outputs of both classifiers are a result of the inference process of all the
input variables in the system with knowledge representation parameters embedded
in each piece of evidence (i.e., the weights, referential values, and consequent belief
degrees).
183
5.6. Model Comparisons
In this section, we compare the model performances of the MAKER-ER- and MAKER-
BRB-based models with other common machine learning methods for classification,
including logistic regression (LR), support vector machines (SVM), neural networks
(NN), and classification trees (CT), Naïve Bayes (NB), k-nearest neighbour (KNN),
distance-based weighted k-nearest neighbour (weighted KNN), linear discriminant
(LD), and quadratic discriminant (QD) for the dataset of the case of customer
classification in revenue management.
As explained in Section 5.4, we utilise five-fold cross validation. The dataset is divided
into five folds with shuffled stratified cross validation to obtain a similar class
distribution. As such, each fold has nearly the same class distribution. In cross
validation, four folds are used as a training set, and the rest act as a test set. The
training set is used to train the model. These optimal parameters, which are obtained
from the model training, are then applied to the test set. If the model can generalise
the pattern of the data, the performance of the models on the test sets is relatively
similar with the performance on the training set. Therefore, in this section, we compare
all the classifiers based on their performances over the five rounds. We provide the
reports for both training and test sets.
We use some performance measures including accuracy, precision, and recall with
the threshold value of .5 for the classifiers based on probabilities. Specifically, for
SVM, the threshold value of 0 is used. As already mentioned, the dataset used in this
case study is highly imbalanced (1:4). Precision and recall scores are representative
to check whether the classifier can make accurate predictions on both classes
184
regardless of the existing imbalance in the dataset (Davis and Goadrich, 2006). The
ideal case is obtaining high precision and high recall. We also report the MSE scores
since the MAKER-ER- and MAKER-BRB-based models are optimised under this
optimisation function. We report the area under the receiver operating characteristic
curve (AUCROC) scores and the area under the precision recall curve (AUCPR) score
since these metrics provide a preferred measure than solely based on the accuracy.
The higher the score of the AUC is, up to a maximum of 1.0, the superior the model
is. A further explanation of the measures can be found in Section 3.8.
It is also worth noting that for SVM, ANN, CT, KNN, and weighted KNN, we should
determine the hyperparameters of these classifiers. A hyperparameter is a parameter
whose value is determined before the learning process begins. We utilise
gridsearchcv in sklearn python to find the optimal hyperparameter based on a five-
round model training method. Rather than solely based on one performance measure
(i.e., accuracy), since this dataset is highly imbalanced (1:4), we use the F-beta score
whose value is 1 for the best and 0 for the worst. Beta is a weight assigned to the F-
beta score. Its values range from 0 to infinitive (Maratea et al., 2014). The F-beta
score will put more attention on precision when beta is lower than 1. The F-beta score
will weight toward recall when beta is greater than 1. With a beta value of 1, the F-
beta score is exactly the same as F-measure which is an equally weighted harmonic
mean of precision and recall as seen in Table 3.2. F0.5, F1, and F2 measures – the
notations for the beta values of .5, 1, and 2 respectively – are the most widely used
F-beta scores (Maratea et al., 2014). In this study, the beta value of 1 is deliberately
chosen so that the importance of precision is set to be equal to that of recall. The
hyperparameters of classifiers with the highest F-beta score of the left-out data after
the five-round training method are selected as presented in Table 5.19.
185
Table 5.19. Selected hyperparameters of SVM, ANN, CT, and Weighted KNN for customer type models
Classifier Selected hyperparameter
CT The maximum depth = 3; the minimum samples per leaf = 50;
the minimum size each leaf = 170
SVM Penalty parameter C = 9; the kernel type is radial basis function
kernel.
KNN k = 25
Weighted KNN k = 33
NN Multilayer perceptron is selected; the number of hidden layers =
1; the number of neurons in the hidden layer = 10; the activation
function is rectified linear unit function.
5.6.1. Accuracies, Precisions, Recalls, and F-beta Scores
As stated earlier, the dataset used in this case study is highly imbalanced (1:4); hence,
we provide the performance measures for each class over five training and test sets.
Table 5.20 provide the F-beta scores for both train and test sets with ‘myopic’ as
negative class. As mentioned earlier, we set the beta value of 1. Tables 5.21 - 5.23
provide the scores of accuracy, precision, and recall, respectively for each class.
186
Table 5.20. F-beta scores for customer behaviour classifiers
Model\Iteration 1st 2nd 3rd 4th 5th Average Stdev
Train
MAKER-ER .801 .779 .669 .796 .782 .765 .055
MAKER-BRB .826 .787 .826 .826 .814 .816 .017
LR .445 .580 .447 .435 .398 .461 .069
SVM .695 .720 .720 .715 .709 .712 .011
NN .792 .781 .796 .784 .777 .786 .008
DT .792 .798 .802 .792 .792 .795 .005
NB .520 .509 .488 .471 .461 .490 .025
KNN .749 .763 .773 .758 .759 .761 .009
Weighted KNN 1.000 1.000 1.000 1.000 1.000 1.000 .000
LD .443 .458 .437 .417 .408 .433 .020
QD .498 .583 .509 .473 .458 .504 .048
Test
MAKER-ER .790 .749 .662 .816 .824 .768 .066
MAKER-BRB .815 .735 .836 .836 .814 .808 .042
LR .396 .504 .398 .413 .498 .442 .054
SVM .715 .665 .710 .720 .740 .710 .028
NN .792 .751 .780 .794 .758 .775 .020
DT .802 .786 .780 .804 .802 .795 .011
NB .393 .497 .449 .544 .536 .484 .063
KNN .754 .742 .749 .764 .753 .753 .008
Weighted KNN .775 .744 .754 .755 .784 .762 .016
LD .386 .431 .388 .405 .509 .424 .051
QD .432 .508 .439 .533 .545 .491 .053
187
Table 5.21. Accuracies for customer behaviour classifiers
Model\Iteration 1st 2nd 3rd 4th 5th Average Stdev
Train
MAKER-ER .892 .881 .845 .888 .884 .878 .019
MAKER-BRB .898 .883 .897 .895 .896 .894 .006
LR .797 .821 .798 .796 .785 .799 .013
SVM .850 .861 .857 .854 .856 .856 .004
NN .881 .880 .886 .879 .878 .881 .003
CT .885 .887 .886 .883 .884 .885 .003
NB .823 .820 .821 .814 .824 .820 .003
KNN .871 .875 .877 .870 .874 .874 .003
Weighted KNN 1.000 1.000 1.000 1.000 1.000 1.000 .000
LD .796 .803 .796 .793 .786 .795 .003
QD .801 .818 .798 .796 .782 .799 .003
Test
MAKER-ER .884 .867 .838 .900 .906 .879 .027
MAKER-BRB .894 .853 .902 .906 .902 .891 .022
LR .790 .788 .785 .783 .822 .794 .016
SVM .859 .833 .855 .859 .869 .855 .014
NN .884 .863 .877 .888 .869 .876 .010
CT .886 .878 .879 .890 .888 .884 .005
NB .823 .819 .822 .812 .812 .818 .005
KNN .869 .859 .869 .875 .869 .868 .006
Weighted KNN .884 .863 .873 .873 .888 .876 .010
LD .790 .784 .793 .783 .824 .795 .017
QD .792 .784 .779 .814 .810 .796 .016
188
Table 5.22. Precisions of the test sets for customer behaviour classifiers
Model/Iteration 1st 2nd 3rd 4th 5th Average Stdev
Myopic
MAKER-ER .950 .920 .880 .960 .950 .932 .033
MAKER-BRB .730 .930 .970 .970 .970 .914 .104
LR .740 .820 .800 .800 .820 .796 .033
SVM .730 .890 .900 .900 .910 .866 .076
NN .720 .930 .950 .950 .930 .896 .099
CT .730 .960 .950 .960 .960 .912 .102
NB .660 .820 .810 .830 .830 .790 .073
KNN .730 .930 .920 .930 .930 .888 .088
Weighted KNN .760 .920 .920 .920 .930 .890 .073
LD .750 .800 .790 .800 .820 .792 .026
QD .710 .820 .810 .830 .830 .800 .051
Strategic
MAKER-ER .730 .720 .710 .760 .790 .742 .033
MAKER-BRB .720 .680 .760 .760 .730 .730 .033
LR .740 .630 .690 .660 .860 .716 .090
SVM .730 .670 .720 .730 .750 .720 .030
NN .720 .700 .720 .730 .720 .718 .011
CT .730 .710 .720 .740 .730 .726 .011
NB .660 .630 .660 .740 .710 .680 .044
KNN .730 .700 .730 .740 .720 .724 .015
Weighted KNN .760 .720 .730 .750 .760 .744 .018
LD .750 .660 .690 .670 .870 .728 .087
QD .710 .600 .620 .730 .690 .670 .057
189
Table 5.23. Recalls of the test sets for customer behaviour classifiers
Model\Iteration 1st 2nd 3rd 4th 5th Average Stdev
Myopic
MAKER-ER .890 .900 .910 .910 .920 .906 .011
MAKER-BRB .880 .870 .900 .900 .890 .888 .013
LR .970 .920 .960 .950 .980 .956 .023
SVM .910 .890 .910 .910 .920 .908 .011
NN .880 .880 .880 .890 .890 .884 .005
CT .950 .920 .940 .950 .940 .940 .012
NB .880 .880 .890 .890 .890 .886 .012
KNN .900 .880 .900 .900 .900 .896 .012
Weighted KNN .920 .900 .900 .910 .920 .910 .012
LD .970 .940 .960 .950 .980 .960 .012
QD .960 .900 .930 .950 .930 .934 .012
Strategic
MAKER-ER .860 .780 .620 .880 .860 .800 .108
MAKER-BRB .940 .800 .930 .930 .920 .904 .059
LR .270 .420 .280 .300 .350 .324 .062
SVM .700 .660 .700 .710 .730 .700 .025
NN .880 .810 .850 .870 .800 .842 .036
CT .890 .880 .850 .880 .890 .878 .016
NB .280 .410 .340 .430 .430 .280 .410
KNN .780 .790 .770 .790 .790 .784 .009
Weighted KNN .790 .770 .780 .760 .810 .782 .019
LD .260 .320 .270 .290 .360 .300 .041
QD .310 .440 .340 .420 .450 .392 .063
190
The three highlighted numbers in bold are the first-, second-, and third-best classifiers
based on the corresponding measure. All the classifiers listed in the mentioned tables
demonstrate relatively good in terms of accuracy, precision, and recall, except LR,
NB, LD, and QD, which are discussed later. We calculate the average score of all the
performance measures for test sets across all the classifiers and over the five-round
validation: .848, .861, .718, .915, .644, and .655 for accuracy, precision of the myopic
class, precision of the strategic class, recall of the myopic class, recall of the strategic
class, and F-beta score respectively.
As an evaluation metric, F-beta score provides a single score that considers both
precision and recall with beta as the weight of recall in the combined score. The F-
beta scores lies in the range 0 and 1 with 0 being the worst and 1 being best.
According to Table 2.20, MAKER-ER- and MAKER-BRB-based classifiers, CT, and
NN are the four best classifiers based on the F-beta scores, in between .768 - .808.
Meanwhile, LR, NB, LD and QD have scores below than the average of F-beta score
of .655. The performances of classifiers for each class are compared as explained
below.
The MAKER-ER- and MAKER-BRB-based classifiers and the classification provide
the optimal performance measures among the other alternative classifiers for this
case since both models are in the position of the three best classifiers among the
other alternative classifiers in terms of accuracy, precision, and recall for strategic
class. The average scores of recalls of the MAKER-ER- and MAKER-BRB-based
classifiers for myopic class are .906 and .888, which are close to the grand average
recall of .915 for myopic class. However, the performance differences between the
proposed classifiers (i.e., MAKER-ER- and MAKER-BRB-based classifiers) and the
191
other alternative classifiers are subtle, except for the recall (i.e., sensitivity) of the
strategic class, which is the minority in this dataset with only about 25%.
The MAKER-BRB-based model produce the highest scores of recalls: .904. The
classifier CT also produces a high sensitivity score: .878. The average recall of the
MAKER-ER-based classifier for strategic class is .800, which is above the grand
average recall of .644. In addition, NN also provide good performance based on
average recall for myopic, which is .842. These four classifiers can produce higher
sensitivity scores compared to the alternative methods. The classifier LR, NB, LD,
and QD exhibits the lowest recall scores, in between .300-.392, meaning that given
the predicted strategic class, only few predictions are correct. In addition, LR, NB, LD,
and QD shows that the scores of accuracy, precision, and recall (i.e. specifically for
strategic class) are always below the grand average of the corresponding scores.
The explanation above suggests that the performance of the MAKER-ER- and
MAKER-BRB-based classifiers and the classification tree for customer classification
in this dataset outperform the other alternative methods for the predefined thresholds
of .5 and 0 to estimate the performance measures: accuracies, precisions, and recalls.
5.6.2. MSEs and AUCs
In this section, we report the mean square errors (MSEs), the area under the curves
(AUCs) of all the classifiers. We also provide the ROC and PR curve of the proposed
classifiers, the MAKER-ER- and MAKER-BRB-based models, compared to the other
alternative machine learning methods. Figure 5.8 illustrates the ROC curves of all the
classifiers of all the test sets of the dataset. As displayed in this figure, there are five
192
lines with different colours presenting the ROC curves for the test set of each round.
Round 1 features the first fold as the test set, round 2 features the second fold as the
test set, and so on. The diagonal red line represents a random classifier. The further
the line moves from this red diagonal line or the closer the line moves to the left corner
of the curve, the better the classifier is. Figure 5.9 demonstrates the PR curves of all
the classifiers of the test sets over the five round training process. Similar with the
ROC curves, five lines in the PR curves represents the PR curve for the test set of
each round. The better the discrimination of the classifier, the closer the line move to
the right corner of the curves (see Section 3.8.3). The grey area in both curves
indicates the dispersion of the curves between rounds with ± 1 standard deviation.
MAKER-ER-based classifier
MAKER-BRB-based classifier
LR
SVM
Figure 5.8. The ROC curve of the MAKER-ER-based classifier, MAKER-BRB-based classifier, and all the alternative machine learning methods of the test sets of the customer-type dataset
193
NN
CT
NB
KNN
Weighted KNN
LD
QD
Figure 5.8. Continued.
194
MAKER-ER-based classifier
MAKER-BRB-based classifier
LR
CT
SVM
NN
NB
KNN
Figure 5.9. The PR curve of the MAKER-ER-based classifier, MAKER-BRB-based classifier, and all the alternative machine learning methods of the test sets of the customer-type dataset
195
Weighted KNN
LD
QD
Figure 5.9. Continued.
We display the MSEs and AUCs of the classifiers for the training sets of all five rounds
in Tables 5.24 and 5.25 and for the test sets of all five rounds. We report these metrics
for both the training and test sets to check if overtraining occurs. Since these metrics
of the training sets are similar to those of the test sets over the five rounds, we
conclude that overtraining does not occur. This result signifies that similar to the other
machine learning methods, MAKER-ER- and MAKER-BRB-based models can learn
and generalise the pattern of the data and perform well on unseen data.
196
Table 5.24. The MSEs and AUCs of the prediction models (training set) for customer type classifiers
Train
Model/Iteration 1st 2nd 3rd 4th 5th Average Std CI (95%)
AUCROCs
MAKER-ER .948 .948 .934 .947 .944 .944 .006 .939-.949
MAKER-BRB .945 .942 .943 .942 .943 .943 .001 .942-.944
LR .881 .894 .889 .893 .878 .887 .007 .881-.893
SVM .905 .908 .907 .902 .906 .906 .003 .903-.908
NN .919 .919 .918 .913 .916 .917 .003 .915-.919
CT .912 .915 .915 .909 .914 .913 .003 .911-.915
NB .872 .801 .801 .808 .804 .817 .003 .790-.844
KNN .925 .928 .924 .920 .923 .924 .003 .922-.927
Weighted KNN 1.000 1.000 1.000 1.000 1.000 1.000 .003 1.000-1.000
LD .887 .903 .895 .900 .883 .894 .003 .886-.901
QD .885 .890 .884 .877 .879 .883 .003 .879-.887
AUCPRs
MAKER-ER .799 .783 .779 .747 .775 .777 .019 .758-.796
MAKER-BRB .789 .784 .744 .778 .757 .770 .019 .751-.789
LR .675 .709 .689 .690 .658 .684 .019 .665-.703
SVM .717 .722 .725 .712 .724 .720 .005 .715-.725
NN .778 .759 .787 .742 .733 .760 .023 .737-.783
CT .795 .803 .799 .803 .804 .801 .004 .797-.805
NB .682 .687 .660 .656 .641 .632 .019 .613-.652
KNN .779 .790 .764 .776 .778 .778 .009 .768-.787
Weighted KNN 1.000 1.000 1.000 1.000 1.000 1.000 .000 1.000-1.000
LD .680 .719 .696 .696 .664 .691 .021 .670-.712
QD .692 .698 .670 .662 .647 .636 .021 .615-.657
MSEs
MAKER-ER .080 .081 .103 .081 .083 .086 .010 .077-.094
MAKER-BRB .076 .081 .077 .076 .077 .078 .002 .076-.079
LR .143 .138 .142 .144 .149 .143 .004 .140-.146
SVM .109 .106 .106 .108 .107 .107 .002 .106-.108
NN .091 .089 .091 .092 .092 .091 .001 .090-.092
CT .128 .133 .129 .132 .130 .130 .003 .128-.132
NB .088 .086 .086 .088 .087 .087 .003 .086-.088
KNN .094 .090 .090 .093 .091 .092 .003 .090-.093
Weighted KNN .000 .001 .001 .001 .000 .001 .003 .000-.001
LD .146 .142 .146 .147 .151 .147 .003 .144-.150
QD .175 .167 .176 .178 .187 .177 .003 .171-.183
197
Table 5.25. The MSEs and AUCs of the prediction models (test set) for customer type classifiers
Test
Model/Iteration 1st 2nd 3rd 4th 5th Average Stdev CI (95%)
AUCROCs
MAKER-ER .936 .939 .938 .947 .961 .944 .010 .935-.953
MAKER-BRB .937 .928 .942 .941 .950 .940 .008 .933-.946
LR .885 .870 .873 .910 .896 .887 .016 .872-.901
SVM .906 .884 .893 .912 .906 .900 .011 .890-.910
NN .911 .902 .899 .929 .912 .911 .012 .900-.921
CT .908 .903 .903 .914 .902 .906 .005 .901-.910
NB .733 .813 .843 .793 .790 .794 .040 .759-.830
KNN .905 .895 .898 .925 .904 .905 .012 .895-.915
Weighted KNN .908 .907 .904 .927 .926 .915 .011 .905-.924
LD .893 .876 .880 .914 .900 .893 .015 .879-.906
QD .882 .859 .878 .913 .885 .883 .019 .866-.900
AUCPRs
MAKER-ER .745 .763 .790 .731 .865 .779 .053 .726-.832
MAKER-BRB .759 .756 .775 .768 .834 .778 .032 .747-.810
LR .681 .631 .645 .704 .756 .683 .050 .633-.733
SVM .700 .668 .701 .710 .712 .698 .018 .680-.716
NN .695 .718 .707 .741 .733 .719 .019 .700-.738
CT .789 .784 .786 .750 .722 .766 .029 .737-.795
NB .640 .634 .644 .695 .723 .667 .040 .627-.707
KNN .729 .704 .694 .755 .708 .718 .024 .694-.742
Weighted KNN .772 .768 .758 .786 .780 .773 .011 .762-.783
LD .699 .630 .654 .705 .756 .689 .049 .640-.738
QD .649 .628 .661 .715 .725 .675 .043 .633-.718
MSEs
MAKER-ER .087 .089 .097 .080 .070 .085 .010 .076-.094
MAKER-BRB .082 .094 .077 .076 .070 .080 .009 .072-.088
LR .147 .159 .148 .142 .127 .145 .012 .134-.155
SVM .109 .121 .109 .108 .104 .110 .006 .105-.116
NN .093 .097 .098 .085 .095 .094 .005 .089-.098
CT .088 .092 .092 .088 .089 .090 .002 .088-.092
NB .139 .128 .120 .133 .138 .132 .008 .125-.138
KNN .094 .103 .100 .090 .096 .097 .005 .092-.101
Weighted KNN .091 .097 .097 .092 .085 .092 .005 .088-.097
LD .150 .165 .153 .147 .126 .148 .014 .136-.161
QD .185 .201 .185 .163 .159 .179 .018 .163-.194
198
As depicted in the previously mentioned tables, based on the performance of the
comparative analysis on the test sets of the five rounds, the MAKER-ER- and
MAKER-BRB-based models outperform the alternative methods with the average
scores and standard deviations of AUCROCs: .944 (.010) and .940 (.008),
respectively. According to Table 3.3 in Section 3.8.3, an AUC between .9 and 1
indicates excellent discrimination. The MAKER-ER- and MAKER-BRB-based models
can be considered to have excellent discrimination because the average scores of
the AUC are above .9: .944 for the MAKER-ER-based model and .940 for the MAKER-
BRB-based model. Meanwhile, the other machine learning methods in the lower
performance ranging from .794 to .914 are considered as ‘fair’, ‘good’, and ‘excellent’
for the AUCROCs of .70-.80, .80-.90, and .90-1.0 respectively. Similar to the
AUCROCs, the MAKER-ER- and MAKER-BRB-based classifiers outperform the
alternative methods, indicated by the average AUCPR of .779 and .778 respectively.
Both classifiers also exhibit the lowest MSE scores with standard deviations as
follows: .085 (.010) for the MAKER-ER-based model and .080 (.009) for the MAKER-
BRB-based model.
5.7. Summary
This chapter presented the application of the MAKER-ER- and MAKER-BRB-based
models to customer classification in revenue management with two outputs, myopic
and strategic, and four input variables regarding customers’ booking behaviour in the
environment of dynamic pricing. This chapter consisted of six main subsections.
First, we presented the theoretical foundations: identified customer types in revenue
management in response to dynamic pricing, tangible purchase behaviour, and the
199
booking setting used in the case study. Second, we formulated the conceptual
framework of customer classification including the input variables which may
discriminate customer classes, the data linkage which explains how we can obtain
the desired dataset given the available booking and price records, and the detection
procedure to label the customer types. Third, we introduced the data preparation
including data cleaning, and data partitioning used to obtain five groups for five-fold
cross validation applied for all the classifiers.
Fourth, we performed a statistical test on the dataset obtained in the previous section
determine whether the four input variables can explain the variances of class
membership and whether the input variables are conceptually correlated. Based on
the statistical test, we also described how we created groups of evidence because
the statistical requirement for joint frequencies between pieces of evidence was
violated such that the groups of evidence formed are statistically correct and
theoretically meaningful. Fifth, according to evidence acquisition from the data,
interdependency indices, belief rule-based inference, maximum likelihood prediction,
and machine learning, we described how to construct the MAKER-ER- and MAKER-
BRB-based classifiers for the hierarchical MAKER framework in which the input data
are split into groups of evidence. Given the optimised referential values and the other
optimised model parameters, such as weights, and with the training set of the first
round, we provided a demonstration for both classifiers.
Sixth, with consideration of the highly imbalanced class distribution (1:4), we analysed
the model performance comparison based on accuracy, precision, recall, F-beta,
AUCs and MSE for all classifiers. Based on the analysis, it is evident that the MAKER-
ER- and MAKER-BRB-based classifiers outperform eight of nine alternative machine
learning methods: LR, SVM, NN, NB, KNN, weighted KNN, LD and QD. Meanwhile,
200
the classification tree exhibits a similar performance to both classifiers. The MAKER-
ER- and MAKER-BRB-based models, as interpretable and robust classifiers, are
recommended for customer classification.
201
Chapter 6 Application to Customer Decision
Model
6.1. Introduction
This chapter presents the application of a hierarchical rule-based inferential modelling
and prediction based on MAKER framework for predicting customer decisions in the
environment of dynamic pricing. The chapter is structured as follows. Section 6.2
explains the theoretical framework, including possible decisions considered in the
model, the input variables that potentially influence customer decisions, the
hierarchical MAKER framework, and the data linkage to obtain the desired dataset
from the available data in the system. Section 6.3 describes the data preparation,
including data cleaning, and data partitioning. Section 6.4 explains how the proposed
classifier, namely MAKER-ER- and MAKER-BRB-based models, were built and
trained in this case study. Section 6.5 presents a comparison analysis of model
performances for the proposed framework and alternative methods. A summary of
this chapter is presented in Section 6.6.
6.2. Conceptual Framework: Input Variables and
Decisions
The conceptual framework of the prediction model for customer decisions in an
environment of dynamic pricing was developed on the basis of literature. This chapter
explains the conceptual framework, including the following aspects: customer
decisions; input variables, denoting the factors that might possibly influence customer
202
decisions in environments of dynamic pricing; and data linkage, which describes how
we obtained the desired dataset from the data available in the system.
The booking setting was the same as that discussed in the previous chapter (see
Section 5.2.3). Customers book a ticket and are given time to pay denoted as the
holding period. They can secure the ticket at the price posted when booking, with zero
deposit, and pay later – at any time before the holding period ends. Otherwise, the
ticket is automatically cancelled. In this setting, strategic customers can intentionally
delay their purchase – that is, their payment of the full price, and strategically wait
until lower prices become available.
Revenue management theory was designed for perishable products such as airlines
and hotels. The remaining capacity cannot be stored as inventory if the selling period
is over (Talurri and Ryzin, 2004). By contrast, companies may have inflexible or
limited capacity, which means more capacity cannot be easily added to meet a high
demand in the future. Pricing and capacity allocation are two major practices of
revenue management (Choi and Kimes, 2002) by a means of balancing between
supply and demand under capacity restrictions, demand uncertainty, and various
market conditions in order to maximise profit (Talurri and Ryzin, 2004).
As explained in Section 2.3, when advanced booking is applied, travellers often book
before making full payment. Through a guaranteed reservation, they can secure a
seat. During this period up until the departure date, practice is applied in the industry
and thus prices and seat availability change over time. Customers sometimes look for
a better deal and rebook if necessary, to replace an earlier booking with a more
favourable price (Toh et al., 2012). To obtain a lower price, they search and update
203
their information, and learn and evaluate whether they should change their previous
decision (Cleophas and Bartke, 2011).
This study focuses on the additional phase of a purchase cycle that customers
experience during the purchase decision process. As stated above, the information
search-and-evaluation phases can be repeated even after customers place a
reservation until the departure date (Schwartz, 2000, 2006). Customers in advanced
booking setting that offer dynamic pricing face uncertainty regarding price and other
related factors such as product availability. At the same time, they have opportunity
to maximise the value of their money spent (Chen and Schwartz, 2008). In this study,
we modelled an additional phase in which after placing a guaranteed reservation,
customers either continue to make full payment right away or wait in the hope of
getting a better deal in the future.
6.2.1. Input variables
People tend to respond to promotions or any means of gaining a lower price including,
strategic purchasing behaviour (Choi and Kimes, 2002). In addition, purchase
decision-making also requires cognitive evaluation of consequences (Christou, 2011).
Relevant information may shape a customer’s belief or perception and hence may
influence customer decisions. In this study, we included both internal and external
determinants that might influence their purchase decision.
Advanced booking customers have to maintain the associated risk of strategic waiting
while observing if a lower price will be available in the future. If they choose to wait
and their reservation is cancelled, they need time to book again. During that time, the
204
ticket might no longer be available due to being sold out. Several researchers have
predicted the propensity to buy, based on these two perceptions by customers: the
perception of risk and benefit (Aviv, Levin, and Nediak, 2009; Chen and Schwartz,
2008; Cleophas and Bartke, 2011 to name a few).
Some researchers have experimented with showing information about the remaining
capacity to the participants (e.g. Mak et al., 2014). However, in reality, customers
cannot access such information and may interpret price changes as showing the risk
of sell-out (Chen and Schwartz, 2008). If customers associate the price changes with
the existence of applied revenue management, they may assume that limited seats
mean high demand, hence an increasing price. Li et al. (2014) implicitly modelled the
likelihood of sell-out by operationalising the lowest posted price as a baseline demand
model. Another possible approach is to use a time element. Customers may perceive
a higher risk of sell-out as the desired departure date approaches. Through a
controlled experiment in the hotel case, Schwartz (2000) demonstrated that the closer
to the date of stay, the higher the willingness to pay and consequently the higher the
propensity to book. Related to this, the decision of strategic customers is also affected
by the level of product scarcity (Dasu and Tong, 2010; Mak et al., 2014). Strategic
customers may perceive a higher risk of fewer flights offered in a day.
Revenue management adopters, including airlines and hotels, add cancellation
policies to anticipate unsold inventory – such as seats or rooms – due to last-minute
cancellations and no-shows (Chen, 2016). That is also an effective strategy to
decrease the no-show rate by about 8% and 5% for airlines and hotels respectively
(Dekay, Yates, and Toh, 2004). In addition, this policy may affect customer decisions
regarding advanced booking settings. A conditioned experiment by Chen et al. (2011)
found that the effect of cancellation deadlines on customer decisions was statistically
205
significant. In contrast, cancellation fees did not significantly influence customer
decisions. A lenient cancellation policy might induce customers to search extensively
after buying a fully refundable ticket and to rebook once a lower price is available
(Gorin et al., 2012). If the policy is lenient– such as a distant deadline or zero cost,
customers more likely to continue to search than book right away. Similarly, a ‘book-
now-pay-later’ system without a deposit gives customers time to confirm their booking,
whether or not they finalise their booking by paying full price. The duration of the
holding period probably affects customers’ tendency to search and to wait for a lower
price.
In addition to the external determinants explained above, researchers have also
considered personal factors. Customer who are exposed to exactly the same
information may make different decisions. The first possible explanation is customers’
differing propensity to respond to methods of obtaining a discounted or lower price
(Lichtenstein, Netemeyer, and Burton, 1990; Lichtenstein, Ridgway, and Netemeyer,
1993). Some customers maximise their value above the money spent on a product or
service (Kwon and Kwon, 2013). They may experience positive emotions while
looking for a better deal, such as feeling pleasure and enjoyment (Fortin, 2000).
Regardless of whether they are financially or emotionally motivated, getting a lower
price is their goal in shopping (Chandon, Wansink, and Laurent, 2000; Kwon and
Kwon, 2013). In revenue management theory (see Section 5.2.1), four customer types
are identified by divergent customers’ responses to dynamic pricing. In this study, we
considered only two relevant to our market.
A second possible reason is that customers exhibit different levels of waiting patience
or willingness to wait. This point is used to demonstrate the degree of strategic
behaviour by the customer. Two estimation approaches are discrete and numerical.
206
The first one segments the market into discrete customer types. Su (2007)
categorised customers into two patience levels, high or low. In an extension to this
model, Besbes and Lobel (2015) interpreted patience as ‘the time they are present in
the system’ expressed by a discrete value denoted by ω. The 0 value of ω represents
customers who are completely impatient, whereas ω = 1 means the customers can
postpose their purchase for one period and so on. Another approach is to use a
discount factor, equivalent to a waiting cost, expressed as a numerical variable to
measure the customer patience (e.g. Levin et al., 2009). However, this approach is
complex in terms of identification and computation (Li et al., 2014).
The third possible reason is divergent customers’ emotions. Emotions are defined as
a mental state of readiness that is affected by individuals’ assessment of events and
thoughts (Bagozzi, Gurhan-Canli, and Priester, 2002). These subjective feelings are
associated with what customers feel during and after evaluation; that is, purchase and
consumption (Ruth, 2001). Consequences and feedback after performing a certain
behaviour, including information processing, evaluation, and justification of decisions
generates an emotional response. Customers feel satisfied, happy, excited, and
thrilled when they find a better deal than they expected or secure a lower price than
other pays. Regret arises when customers choose to wait but then lose the
opportunity due to sell-out; or when they buy the product immediately and later it
becomes available at a lower price. Anticipated regret may shorten the search, since
it can influence a person’s desire to perform a certain task (Bagozzi et al., 2002).
Zeelenberg (1999) discusses regret theory in detail.
Studies in revenue management quantify regret as ‘stock-out regret’ or ‘high-price
regret’ (Eren and Parker, 2010; Nasiry and Popescu, 2012; Özer and Zheng, 2015 to
name a few). In brief, the anticipated regret is measured by comparing the perceived
207
probability of and the actual probability of stockout and high-price (e.g. Özer and
Zheng, 2015). The perceived probabilities are the result of the customer’s
observation. Because we could not obtain the customer’s perception – that is, the
perceived probability of stock-out or higher price, we excluded this factor in the
analysis.
The fourth possible reason is different customers’ attitude toward risk. Studies about
strategic customers in revenue management assume that customers are risk neutral.
Thus, they make decision that maximise their total expected surplus (Swinney, 2011).
Liu and Ryzin (2008) introduced the degree of risk aversion. Through experiments,
Osadchiy and Bendoly (2011) identified three perceived risk groups among forward-
looking customers: 1) those who correctly perceived risks in waiting, 2) those who
underestimated risks, and 3) those who overestimated risks. To identify these groups
requires knowing the customers’ decision-making patterns. In our study, only three
months of transaction data were available, which was not enough to determine each
individual’s decision-making pattern. Longer term data are needed for at least some
purchase records for each customer.
The fifth possible reason is divergent customers’ search cost. Through the Internet,
price comparison sites (PCSs) and other similar facilities, customers have access to
a novel and convenient method of searching for a lower price (Jung, Cho, and Lee,
2014). Hence, the search cost for customers are low and tends towards zero
(Clemons, Hann, and Hitt, 2002). Some researchers have even assumed that
strategic customers incur a homogenous search cost (e.g. Su and Zhang, 2009). This
factor, in the current situation where technology advances enhance customers’
access to information, seems unlikely to significantly influence the purchase decisions
of strategic customers. However, some studies in revenue management have
208
considered search costs either experimentally (e.g. Schwartz, 2000) or through
mathematical programming (e.g. Wang et al., 2013). Schwartz suggested quantifying
the indirect component of search costs as ‘time and/or energy spent’ – which is not
necessarily literally ‘money spent’ during an information search. This was to avoid
misleading the decision modelling. However, no further explanation was offered about
quantifying the indirect costs. In this study, we followed the argument by Clemons et
al. (2002) and excluded this factor in the analysis.
To summarise, according to the literature, three major categories are identified: 1)
provider-controlled factors, 2) uncontrolled factors, and 3) personal factors. There are
depicted in Figure 6.1. Provider-controlled factors means all factors that can be
manipulated by the provider, such as airlines and travel agents. This group includes
price changes, information about remaining products or inventories, and a
cancellation policy. Uncontrolled factors are those which cannot be manipulated by
the provider but impact customer decisions; this includes customer arrival time (i.e.
before departure) and number of flights offered in a day. Personal factors are
individual factors specific to the customers, which vary from person to person and
strongly influence their purchase decisions.
The framework in Figure 6.1 was refined based on the available data as previously
explained. All the input variables considered in this study are depicted in Figure 6.2
and explained below.
209
Price changes
Demand information
Cancellation policy:
deadline and fee
Time before
consumption date
Product scarcity: number
of flights offered a day
Consumer’s response
toward dynamic pricing
Waiting patience
Consumer’s regret
Consumer risk profile
Search cost
Estimated lower
rate (ELR)
Estimated sell-out
risk (ESR)
Propensity to buy
Pro
vid
er
co
ntr
olle
d
facto
rs
Un
co
ntr
olle
d
facto
rsP
ers
on
al fa
cto
rs
Figure 6.1. Conceptual framework for decisions by advanced booking customers under dynamic Pricing
Price changes
Cancellation deadline (i.e.
the length of holding period)
Days before
departure time
Number of flights offered
a day
Consumer’s response
toward dynamic pricing
Waiting patience
Wait or buy
decision
Pro
vid
er
co
ntr
olle
d
facto
rsU
nco
ntr
olle
d
facto
rsP
ers
on
al fa
cto
rs
Figure 6.2. Conceptual framework for decision by advanced booking customers under dynamic pricing after refinement
210
Price changes. Chen and Schwartz (2008) examined price-change pattern,
categorised into four conditions: increasing, decreasing, fluctuating, and no-change.
The magnitude of the price changes did not affect the categorisation. Other
researchers have utilised price reduction or discount to express the magnitude of
benefit potentially obtained by customers if they choose to wait (e.g. Cleophas and
Bartke, 2011). This tactic is usually employed for downward price trends. Li et al.
(2014) included price average and price volatility, that is, standard deviation and
coefficient of variation, in their model. The model was focused predicting the
percentage of strategic customers taking place in the market, with various degree of
foresight, including perfect, strong, and weak foresight. In this study, we used
historical price trends to indicate the magnitude and direction of the price changes.
The negative, positive, nearly zero, and absolute zero values of price trends can be
interpreted as decreasing, increasing, relatively stable, and no-change respectively.
Assuming that customers observed the price changes after the booking was placed,
price trends could be estimated during the holding period. How such estimation was
made is discussed in the next section.
Length of holding period. Cancellation policy including cancellation deadline and
fee hypothetically impacts strategic purchase behaviour (Chen et al., 2011). However,
in an experiment by Chen et al., only cancellation deadline statistically influenced
customers’ decisions. The stricter the cancellation deadline, the higher the tendency
to search more than book now. Cancellation fee has not been shown to have a
statistically significant impact on customers’ decisions. In the case of ‘book now pay
later’, the cancellation fee is zero, meaning no deposit is required to place a booking.
However, customers are only allowed to hold the ticket for several hours or days,
denoted by the length of the holding period. This timing element hypothetically
211
influences customers’ propensity to buy the ticket instead of wait. The longer the
holding period, the stronger their tendency to cancel and rebook – with the expectation
of obtaining a lower price in the future.
Days before departure time. Customer arrival time is defined from the first contact
the customer makes; that is, the time of the first search for a ticket to fulfil their need
(Schwartz, 2000). In this study, we defined first contact as the first booking made by
the customer, as we had no access to search records for the customers. Hence, we
assumed that if a customer placed a booking, they had confirmed their travel plan and
narrowed down all possible alternatives, regardless of how long they search for
information; then they would have chosen the best of the alternatives.
The number of available flights. This refers to the number of flights offered on a
day. The number can be dynamically changed due to stock-outs. In addition, each
city pair has a particular range for the number of flights offered on a day. For example,
the city pair of NTXBTH normally provides two flight per day; however, before the
departure date it could be sold out, as shown by a zero value for the minimum number
of flights. Similarly, the city pair of SUBCGK provided 52 flights a day and the minimum
number of flights on record was 50; this means that before the departure date, two
flights had no remaining seats and were no longer visible on the website.
Waiting patience. Besbes and Lobel (2015) discretized the waiting patience level
because their model was time-dependent and they sought optimal strategies for each
period. Discretization was the most convenient way to reduce the model’s complexity.
To include different levels of waiting patience to illustrate how long customers
remained in the system, in this study we defined ‘waiting patience’ as the accumulated
212
time for which the customer was present in the system before their final purchase.
Final purchase was indicated by paying the price in full.
Customer’s responses to dynamic pricing. We considered two customer
responses to dynamic pricing: myopic and strategic. Myopic customers always buy
the product at the price best fitted to their valuation. Strategic customers intend to
time their purchase strategically to obtain an expected lower price. Section 5.3.2
explains the identification of these customer types; the labels obtained from that
process were used in this study.
6.2.2. Decisions
As explained in Section 2.3, the decision-making process in any business that uses
advance booking, such as airlines or hotels, differs from that of other products. The
difference is that the actual purchase is not necessarily the same as placing a
reservation or booking. Customers may gather more information after booking and
might change their decision about whether to keep the reservation. This process may
be repeated until shortly before departure time. According to the literature on revenue
management, three main decision models exist, explained as follows.
Wait or buy. In this concept, it is assumed that customers may choose to postpone
their purchase and never leave the market (e.g. Anderson and Wilson, 2003;
Cleophas and Bartke, 2011). This model is also designed to focus on what factors
influence the odds of a customer choosing to buy rather than wait.
213
Buy now, wait, or exit. This model extends the former concept of ‘wait or buy’
decision. This model allows customers to choose any other alternatives, that is, the
second-best alternative (e.g. Chen and Schwartz, 2006; Li et al., 2014; Su, 2007).
Four decisions. Schwartz (2000) introduced advanced booking decision model
(ABDM) to account for online-savvy customers. In this case, customers face a booking
restriction, that is, a cancellation policy. Customers may place a reservation after
evaluating the produce and then take no further action; this is called the ‘book’
strategy. They may choose to search for information and then decide which product
best fits their needs, which is the ‘search’ strategy. They may choose to book to
minimize the risk of stockouts, then continue to search more information and rebook
if a cheaper price for the same product is available; this is the ‘book then search’
strategy. Like the other models, Schwartz’s model also accounts for people booking
other alternatives – namely the ‘exit’ strategy.
In the setting of ‘book now pay later’ (see Section 5.2.3), the payment can be made
at any time between reservation and the end of the specified holding period. If by the
end of the holding period the customer has not paid the reservation, the system
automatically cancels the reservation. The customer can evaluate the alternatives and
then book a ticket according to what they believe is the best option. Once the booking
is made, they are likely to monitor the price changes over time until departure, along
with other related information such as product availability. This information may
change their decision about whether to proceed with payment or cancel the
reservation and book again if a more favourable ticket is offered.
The focus of this study is on developing model to predict customer decisions and
understanding what factors affects the decisions; specifically, what factors can induce
214
them to buy rather than waiting. We included two customer decisions, including ‘wait’
and ‘buy’. ‘Wait’ in this study refers to a customer placing a booking and then passing
all or some of the holding period before continuing with the booking for the same flight
– that is, no changes to their itinerary. ‘Buy’ refers to a customer placing a reservation
and making payment before the holding period lapses.
6.2.3. Data linkage
As previously mentioned, in this study we considered six input variables: 1) average
price trend; 2) the length of the holding period; 3) days before departure time; 4)
number of flights offered on a day; 5) waiting patience time; and 6) customer types,
summarised as APT, HP, DD , NF, WPT, and C. In this section, we explain how we
retrieved the desired dataset from the available transaction and price records, as
depicted in Figure 6.3. Codes B1-B9 mean the data were retrieved from transaction
or booking records; codes P1-P9 were from price records. APT, HP, DD, and NF can
be obtained from transaction records.
215
Name Origin-Destination
Departure date & time
Carrier/Provider
Price (Rps) Book time Book limit time Status Confirmation time
B1 B2 B3 B4 B5 B6 B7 B8 B9
1st AS CGK-SRG 23/09/17 18:00 Batik 473,000 20/09/17 21:34:55 21/09/17 07:04:00 1 21/09/17 07:04:00 2nd AS CGK-SRG 23/09/17 18:00 Batik 440,000 21/09/17 08:25:46 21/09/17 17:55:00 1 21/09/17 17:55:00 3rd AS* CGK-SRG 23/09/17 18:00 Batik 363,000 21/09/17 20:10:53 22/09/17 05:40:00 2 21/09/17 21:21:39
Note: *) In this example, we put an initial of the passenger’s name for data privacy
Updating date & time Origin - Destination
Departure date & time
Carrier/Provider
Posted price (Rps) Price trend
P1 P2 P3 P4 P5
20/09/17 22:20:36 CGK-SRG 23/09/17 18:00 Batik 473,000 20/09/17 02:17:34 CGK-SRG 23/09/17 18:00 Batik 473,000 0 20/09/17 06:15:48 CGK-SRG 23/09/17 18:00 Batik 407,000 - .1395
Updating date & time Origin - Destination
Departure date & time
Carrier/Provider
Posted price (Rps) Price trend
P1 P2 P3 P4 P5
21/09/17 10:20:25 CGK-SRG 23/09/17 18:00 Batik 407,000 21/09/17 14:18:17 CGK-SRG 23/09/17 18:00 Batik 363,000 - .1081
Updating date & time Origin - Destination
Departure date & time
Carrier/Provider
Posted price (Rps)
P1 P2 P3 P4 P5
21/09/17 21:21:27 CGK-SRG 23/09/17 18:00 Batik 363,000 21/09/17 21:21:27 CGK-SRG 23/09/17 15:30 Lion 316,800 . .
.
. . .
.
. . .
21/09/17 21:21:27 CGK-SRG 23/09/17 19:35 Garuda 668,500
DD = B3 – B6 HP = B7 – B6
Price trend for the 1st book
with B6 ≤ P1 ≤ B9
Price trend for the 2nd
book
Number of flight offered in
the day for the 3rd
book; P1
is closest time to B9
Figure 6.3. Data linkage for customer decision model
From booking records, we obtained information about the passenger’s name, origin
and destination, departure date and time, carrier or provider, price, booking time,
booking limit time, booking status, and confirmation time. Booking time refers to the
point at which the customer placed a guaranteed reservation to secure a seat.
Booking limit time represents the cancellation deadline. Customers can pay later for
the reserved seat, up to the booking limit time, without worrying about the price
increasing. Status of the booking was denoted by B8 and was coded as 1, 2, 3, or 4,
meaning respectively cancelled by the system, issued, booked, or cancelled on
216
request. The time when the status changes – for example, from booked (3) to issued
(2), was recorded in B9. In other words, B9 showed the confirmation time and thus
indicated when customers made their decision. ‘Cancelled by request’ differed from
‘cancelled by the system’. The former refers to a request from the customer; the latter
occurred if the customer did not pay before the booking limit time.
A detailed example from the records is shown in Figure 6.4. The passenger (AS)
made three bookings for a 23 September 18:00 flight, with Batik Air as the provider,
from CGK to SRG. The first attempt was made on 20 September at 21:34:55. The
provider or agent gave the passenger time to secure the ticket by 21 September
07:04:00 or it would be automatically cancelled by the system. DD refers to the time
gap between booking and departure; that is, DD = B3 – B6. This variable (DD)
indicates how close the booking was to the travel date. HP represents the gap
between booking time (here, 20 September 21:34:55) and the booking limit (21
September 07:04:00), which showed how long the passenger held the seat or waited
before securing it, while checking the price changes. As mentioned earlier, the longer
the holding period, the more likely customers are to continue to search and wait for
lower prices. The values in this example were DD of 2.8508 days and HP of .3952
days.
20/09/17 21:34:55
20/09/17 07:04:00
.3952
A1
C1=B1
21/09/17 08:25:46
21/09/17 17:55:00
.3953
A2
C2=B2
21/09/17 20:10:53
21/09/17 21:21:39
A3
C3
.0568 .0944 .0491
B3
22/09/17 05:40:00
.3952 .8473 .9908
Figure 6.4. Example of a booking journey
217
To examine the price changes over time during the holding period, we collected the
prices updated between B6 and B9, that is, B6 ≤ P1 ≤ B9. These results are shown in
the second and third panels in Figure 6.3. The formula used to calculate price trends
was as follows: 𝑃𝑇𝑡 = (𝑃𝑡 − 𝑃𝑡−1) 𝑃𝑡−1⁄ where 𝑃𝑇𝑡 is price trend at time 𝑡 and 𝑃𝑡 is the
posted price updated at time t. The term APT refers to all price trends observed during
the holding period, divided by the number of observations. For example, for the first
booking, the price trends were 0 and -.1395. Hence, AVT is the average of those
values and was equal to -.0698. The AVT for the second booking was -.1081.
The input variable NF indicates the number of flights available when customers made
a decision at B9. In addition to information about demand, such information could
influence the customer’s perception about product scarcity. Because the prices are
updated every three to four hours, it is not guaranteed we could obtain information
about the available flights exactly at B9; therefore, we used the time closest to B9 to
represent the updated condition when customers made a decision. This approach
enabled us to obtain the lists of all flights available on 17 September from all providers.
In this example, there were approximately 23 flights available when the customer
made a decision at B9.
The input variables WPT and C were be acquired from the booking records, as
depicted in Figure 5.3. To illustrate WTP, Figure 6.4 depicts the booking journey of
the passenger AS shown in Figure 6.3. For the first attempt, the customer waited the
whole holding period and let the ticket be cancelled by system at B1, hence C1 = B1.
At this point, the customer was present in the system for .3952 of a day. Later, they
booked again at A2 after waiting for another .0568 of a day and did not pay at B2.
Therefore, they had waited for .8473 of a day at C2. For the last booking, the customer
was given .3952 of a day – that is, time from A3 to B3 – but spent only .0491 of the
218
day; at C3, they decided to buy the ticket. The values of WPT for the first, second,
and third bookings were .3952, .8473, and .9908 of a day, respectively. The customer
type (C) is obtained from the labelled customers as a result of systematic tracking in
Section 5.3.2. Each customer is labelled by either myopic or strategic. Myopic is
labelled as 0, while strategic is coded as 1.
6.3. Data preparation
Real world data may be incomplete, noisy, and inconsistent which leads to low
performance, poor-quality outputs, and hidden useful patterns (Zhang, Zhang, and
Yang, 2003). Data preparation is required to deal with such issues to yield quality
data. Data preparation include data integration, data transformation, data cleaning,
data reduction, and data partitioning (Zhang, Zhang, and Yang, 2003). Therefore,
data preparation is required before model developments. This study mainly used data
integration, data cleaning, and data partitioning. Data integration is the combination
of technical and business process utilised to combine data from different data sources
into the desired dataset, that is, meaningful and valuable information (Hendler, 2014).
Section 6.2 presents data linkage to obtain the dataset used in this study. Data
cleaning includes dealing with missing values, noisy data, outliers, and resolving
inconsistencies (Zhang, Zhang, and Yang, 2003). Data partitioning is a technique for
dividing the dataset into multiple smaller parts.
The focus of the study was to examine what factors influence ‘wait’ or ‘buy’ decision
in the environment of dynamic pricing. Hence, for data cleaning, we minimised the
possibility of customers’ rebooking for other reasons, such as changes to their travel
plan. Customers may choose to cancel or let the system automatically cancel their
219
booking, usually because of travel-plan changes. This could create bias in the model
or disguise useful information. Therefore, we considered the ‘waiting’ state when
customers appeared to have a fixed travel plan. This was evident through no changes
being made to origin-destination, departure date, number of passengers, or the ratio
for adult-child-infant.
For example, a customer makes four booking attempts. On the second attempt, they
change their planned departure date, and then they repeatedly delay the purchase
until the fourth attempt when the ticket is issued. In this case, we would omit the first
attempt and consider ‘waiting’ to begin with the second attempt and end with the third
attempt. The fourth attempt was considered the ‘buy’ state. In addition, all records for
customers who did not pay at the end were categorised as ‘exit’ and were deleted
from our dataset.
In data partitioning, we utilised five-fold cross-validation with stratified random
sampling. The data were divided into five folds with similar class distribution. If
customers made several attempts or they bought more than once, they would have
more data points in the dataset. In this condition, it is advisable to shuffle the dataset,
that is, to randomly reorganize the dataset. The partitions obtained through k-fold
cross validation with shuffling generally derive from different customers, which avoids
the model learning from the patterns of particular customers. We employed stratified
five-fold cross validation with shuffling in Python to partition the dataset into five folds.
Each fold was treated as test set, while the remaining folds acted as a training set.
Therefore, we obtained five rounds for each classifier.
220
6.4. Hierarchical Rule-based Models to Predicting
Customer Decisions
In this section, the building of MAKER-ER- and MAKER-BRB-based classifiers for
predicting customer decisions is demonstrated. A numerical study using the described
dataset is presented in the remainder of this section. As previously stated, we had six
input variables: APT, HP, DD, NF, WPT, and C to predict the customer’s decision
either to wait or to buy. The definitions of these variables and decisions are detailed
in Section 6.2. In addition, the data were shuffled and partitioned based on stratified
random sampling into five folds with similar class distribution. The training set of the
first fold is used here to illustrate how MAKER-ER and MAKER-BRB frameworks were
applied in this case study, that is, to a customer-decision dataset.
6.4.1. Hierarchical MAKER frameworks
A minimum of five cases per cell of joint frequency matrices between input variables,
except disjoint pieces of evidence, must be satisfied to implement a full MAKER
framework. MAKER-ER- and MAKER-BRB-based models are designed if this
statistical requirement is not satisfied. To group input variables, one must start with
an input variable having the strongest impact on the model outcome, then add other
input variables one by one. Joint frequency matrices of pairs of input variables in a
MAKER model must fulfil the statistical requirement of having five cases per cell.
According to Table 6.1, input variables ranked by strongest to lowest correlation with
the output variables were WPT, CT, HP, NF, DD, and APT. Hence, customer
221
decisions were highly influenced by personal factors: WPT and CT. The next input
variable, HP, indicated the opportunity for customers to exploit dynamic pricing. Then
NF, DD, and APT shaped the customer’s perception about the benefit and risk of
strategic waiting, which in turn impact their purchase decision.
Table 6.1. Descriptive Statistics and Correlation Matrix
Factor Min Max Mean SD Decision WPT NF HP DD PT
WPT .000 31.796 .361 .394 -.341**
NF 2 100 18.661 1.445 .041* .015
HP .006 6.732 .405 14.266 .140** .223** .117**
DD .182 63.627 3.225 .875 .026 .243** .112** .577**
APT -.563 1.674 .007 5.698 .015 .034* .045** .094** .079**
C .415** -.497** .058** -.066** -.113** -.014
Note: correlation is significant at .05 (2-tailed); ** correlation is significant at .01 (2-tailed) Based on this order, we added the input variables one by one to WPT until all the joint
frequency matrices between input variables had five cases per cell, except for those
where pieces of evidence were disjoint due to structural zeros. Input variables which
could not satisfy this condition were excluded and formed another group of evidence.
In this way, we defined groups of evidence as depicted in Figure 6.5. The MAKER-
generated output by each of group of evidence presents the probabilities of a
customer choosing to wait or to buy. These outputs were then aggregated to suggest
a final inference regarding whether customers choose to wait or buy, given the input
values of the six input variables.
222
Waiting
patience time
Average
price trend
The length of
holding
period
Days before
departure
time
Number of
flights offered
in a day
Consumer
type
Buy (1) Wait (1) Buy (2) Wait (2) Buy (3) Wait (3)
1 2 3 4 5 6 7 8
1-p1
p1
1-p 2
1-p 3p
2p
3p1
1-p1p2
1-p
2 p3
1-p
3
Buy (f) Wait (f)
Rules (k)
Final inferences
MAKER-generated
ouputs
Input variables
MAKER-based
classifiers
Waiting
patience time
Average
price trend
The length of
holding
period
Days before
departure
time
Number of
flights offered
in a day
Consumer
type
Buy (1) Wait (1) Buy (2) Wait (2) Buy (3) Wait (3)
1-p1
p1
1-p 2
1-p 3p
2p
3p1
1-p1
p2
1-p
2
p31-p3
Buy (f) Wait (f)Final inferences
MAKER-generated
ouputs
Input variables
MAKER-based
classifiers
MAKER-BRB-based model
MAKER-ER-based model (1), (2), (3) : generated outputs by group 1, 2, 3, respectively; (f) : final inference
Figure 6.5. Hierarchical MAKER framework for customer decision prediction
6.4.2. Optimised Referential Values of The Model
This section demonstrates how to develop a classifier based on MAKER-ER- and
MAKER-BRB-based systems for a customer-decision model. A numerical study is
223
presented here, using the dataset explained in Section 3.3. We split the input
variables into three groups of evidence: WPT and PT as group 1, HP and DD as group
2, and NF and C as group 3. The output variable was customer decision: wait or buy.
‘Wait’ means the customer chose not to pay and rebooked in the future. ‘Buy’ means
the customer paid before the holding period ended.
As explained above, the data were partitioned into five folds with similar class
distribution, with the data shuffled beforehand. For the purpose of illustration, we use
the first fold as an example in this section. The model parameters – that is, referential
values and weights – were assigned to develop a MAKER framework. We used the
optimised parameters of the first fold as an example.
Discretization is often applied to transform quantitative data to qualitative data, so that
learning from qualitative data becomes more efficient and effective. The input variable
‘customer type’ was qualitative data or was nominal with numerical expressions: 0 for
myopic and 1 for strategic. The other input variables were numerical. Hence, a
discretization technique with referential values was applied to all input variables
except C.
Referential values consist of lower and upper boundaries of the observed values of
the input variables for the dataset, and any values between those boundaries. The
boundaries can be set based on minima and maxima of the observed values for input
variables of the whole dataset. Alternatively, experts can determine the boundaries.
In this study, we set the percentiles of 1% and 99% for lower and upper boundaries
respectively. Table 6.2 shows that the minima and the 1st percentile of the observed
values of the input variable WPT were .0001 and .0005. The 99th percentile and
224
maxima of WPT (observed) were 4.7441 and 31.7962. Almost all the customers –
99% – in the dataset were present in the system for fewer than 4.7441 days. One
customer booked long before the departure date and waited up to 31.7962 days.
Hence, waiting for more than 4.7441 days was equivalent to waiting 4. 7441 days.
Table 6.2. Percentiles of the dataset
Input
variable
Percentile
0% 1% 99% 100%
WPT 1 × 10−4 5 × 10−4 4.7441 31.7962
APT -.5632 -.1077 .2154 1.6738
HP .0065 .0099 5.4128 6.7321
DD .1818 .3526 32.8692 63.6265
NF 2 2 68 100
As explained in Section 3.7, the model parameters – including weights and referential
values – were optimised through sequential least squares programming (SLSQP) with
randomly set initial parameters. The MSE score was used as an objective function.
That is, Equations (4.23) and (4.24) were used for the MAKER-ER- and MAKER-BRB-
based models respectively. The optimisation algorithm will find the direction to find a
new solution based on the evaluation of MSE score. The algorithm was repeated for
200 iterations or until .0001 tolerance was reached.
The target of the optimisation of both MAKER-ER- and MAKER-BRB-based models
is to maximise the likelihood of the true state of a training set, and to automatically
minimise the MSE scores. The MSE scores denote the difference between model
outputs and observed values. Optimising referential values of each input variable
means finding how to divide the input variables so that the observations for a given
225
class are placed in the majority. More trained referential values reasonably can
improve the classifier; however, the associated cost is increased (i.e. model
complexity). In this case, we used one optimised referential value for input variables,
because adding more referential values to each input variable did not significantly
improve the AUC scores but caused higher model complexity. Sparser joint frequency
matrices were found if more referential values were added. In addition, two adjacent
referential values, that is no trained referential value, can only approximate monotonic
function, and at least one trained referential value is required to approximate non-
monotonic function.
Figure 6.6 illustrates the scatter plot for the first training set across six input variables.
There are three grids as there were three groups of evidence: WPT-APT (left), HPT-
DD (middle), and NF-C (right). Red dots represent ‘buy’ and blue ‘wait’. The vertical
and horizontal lines indicate the optimised referential values of input variables. As
shown in the figure, these lines split the data into several groups. Since the referential
values are optimised through MAKER-ER- and MAKER-BRB-based classifiers, the
optimised referential values split the data into several grids, each of which indicated
the placement of most of the class. Because there was one trained referential value
for each input variable, each figure shows four grids.
226
Figure 6.6. Scatter plot for observed data, with plotted optimised referential values for each input variable in the optimisation of MAKER-ER-based model from the customer – decision dataset.
227
Figure 6.7. Scatter plot for observed data, with plotted optimised referential values for each input variable in the optimisation of MAKER-BRB-based model from the customer – decision dataset.
Generally, the figures above illustrate that different data patterns existed for records
in different classes of different input variables of the dataset. For the evidence
regarding WPT – APT, most of the ‘wait’ decisions (blue dots) with long waiting-
patience times were distributed near-zero price trend. In addition, ‘buy’ decisions with
short waiting-patience levels (meaning that the customer bought immediately) were
228
mainly distributed far above or below the stable price trend; that is, a near-zero price
trend.
For the group of evidence for HP – DD, a linear relationship existed between HP and
DD. The further the booking time was from the departure date, the longer the holding
period. The data distribution was dense for the values of DD below 10 days and for
values of HP below 1 day. It is also interesting to note that longer holding periods did
not necessarily make customers wait and book again; the majority class of the upper-
right-corner of the second grid are red dot. Furthermore, the bottom-left corner, in
which customers faced last-minute selling – that is, very close to the departure date
and a short holding period – was occupied by blue dots. Hence, other situational
factors were affecting these decisions.
For the group of evidence for C – NF, strategic customers generally make ‘wait’
decisions, as shown by blue dots dominating the upper line (i.e. strategic customers
denoted by 1). Myopic customers tended to make a ‘buy’ decision, as shown by more
red dots in the lower line (i.e. myopic customers denoted by 0). For NF, both classes
were generally distributed in the same range.
The horizontal and vertical lines denote the optimised referential values of the input
variables of the respective training set. As stated earlier, the optimisation of referential
values with respect to MSE score led to data separation of the observations for each
input variable. Hence, the majority of a class fell within the same value range for each
input variable. As shown in Figures 6.6 and 6.7, the optimised referential values are
generally located around the separation point between classes: wait and buy.
229
For the following sections, optimised referential values and other model parameters
of the training set of the first fold of both MAKER-based classifiers are taken as an
example to demonstrate how MAKER-ER- and MAKER-BRB-based models are
constructed for customer decision prediction of a given dataset. The next section
discusses the MAKER-based model according to four aspects: 1) evidence
acquisition from data; 2) evidence interdependence; 3) belief-rule inference; and 4)
inference of the top hierarchy, including ER rule and BRB rule inference.
6.4.3. Evidence Acquisition from Data
Section 4.3 explains the MAKER framework with referential values as a discretization
method for numerical data. As already stated, the referential values of each input
variable in numerical data must be defined to acquire evidence from a dataset.
Referential values as model parameters can initially be set based on expert
knowledge or can be randomly generated. They can then be trained using historical
data under an optimisation objective (Xu et al., 2017). For illustration purposes, we
used the solution of optimisation for the MAKER-BRB-based model of the first round.
It included weights and an optimised referential value for each input variable of the
training set.
Table 6.3 depicts the optimised referential values used for this illustration. The
referential values include boundary referential values: the lower and upper boundaries
(see Section 6.4.2 and a trained referential value, which lies between the boundaries.
The number of trained referential values can be changed depending on the desired
balance between model complexity and performance, as well as the statistical
requirements (see Section 6.4.2).
230
Table 6.3. Optimised referential values obtained from MAKER-ER-based models of the first round
Notes: N/A : not applicable (the input variable of C was discrete)
In addition, the customer decision dataset included an input variable with nominal
data, that is customer types (C). Evidence acquisition with nominal data is simpler
than for numerical data. In the data transformation in Equation (4.7), the term 𝑎𝑛,𝑙,𝑖𝑘
represents the degree to which the 𝑛th input value of the 𝑙th input variable (𝑥𝑛,𝑙) belongs
to referential value 𝐴𝑖𝑙. In other words, it shows how close 𝑥𝑛,𝑙 is to 𝐴𝑖
𝑙. For the input
variable of CT, the values of 𝑎𝑛,𝑙,𝑖𝑘 are either 0 or 1. For example, if an observation of
CT is 1 (strategic), then 𝑆1 = {(𝐴16, 0), (𝐴2
6, 1)}. If CT is 0 (myopic), then 𝑆1 =
{(𝐴16, 1), (𝐴2
6, 0)}.
To acquire evidence from a dataset, the first step is to transform each input value of
each input variable of the training set using Equation (4.1). The input value is located
between two adjacent referential values of the respective input variable. The belief
distributions termed ‘similarity degree’ are calculated with respect to the adjacent
referential values. The second step is to aggregate the similarity degrees for each
referential value under different classes of the training set, according to Equation
Input variables WPT APT HP DD NF C
Lower boundary .0005 -.1077 .0010 .3526 2 0
Optimised referential
values (MAKER-ER-
based model)
1.0203 .05 .7 1.6 16.2358 N/A
Optimised referential
values
(MAKER-BRB-based
model)
.5614 .009 .5783 1.4534 15.98 N/A
Upper boundary 4.7441 .2154 5.5320 32.8692 68 1
231
(4.2). The frequencies of referential values for each input variable, under different
classes of the output variable of the training set, can then be generated. Table 6.4
displays the frequencies of the referential values of the input variable WPT using the
trained referential values of the first round as an example.
Table 6.4. The frequencies of the referential values of the input variable of WPT
Class\referential values .0005 .5614 4.7441
Wait 1984.4742 395.9835 66.5432
Buy 293.0559 219.3981 51.5460
Next, to calculate the likelihood of a referential value of an input variable being
observed given that a class of the output variable is true, Equation (4.9) was applied.
The procedure was repeated for all referential values for all input variables of the
training set in the dataset. Once we knew the likelihood of a referential value of an
input variable, we used Equation (4.10) to calculate the probability of the respective
referential value pointing to a class of the output. Table 6.5 presents the likelihood of
the referential values: .0005, .5614, and 4.7441 of the input variable WPT as an
example.
Table 6.5. The likelihoods of the referential values of the input variable of WPT
Class\referential values .0005 .5614 4.7441
Wait .8110 .1618 .0272
Buy .5196 .3890 .0914
Figure 6.8 depicts the individual support from each piece of evidence for different
class membership which is obtained from the probability of each referential value of
each input variable of the training set. Table 6.6 exhibits the probabilities for referential
232
values: .0005, .5614, and 4.7441 of the input variable WPT and Table 6.7 for
referential values: -.1077, .009, and .2154 of the input variable APT.
Figure 6.8. Individual support of referential values of each input variable of the training set of the first fold of the customer decision dataset
233
Table 6.6. The probabilities of referential values of the input variable of WPT
Class\referential values .0005 .5614 4.7441
Wait .6095 .2938 .2293
Buy .3905 .7062 .7707
Table 6.7. The probabilities of referential values of the input variable of APT
Class\referential values -.1077 .009 .2154
Wait .4870 .5013 .5053
Buy .5130 .4987 .4947
Several pieces of evidence can be acquired from the probabilities calculated above.
The probabilities of the referential values of the input variables represent the degree
to which the respective referential values of the input variables point to different class
membership. For example, the probabilities of the lower boundary for APT being -
.1077 are .4870 and .5130 for ‘wait’ and ‘buy’ decisions respectively (see Table 6.7).
Hence, if an observation has an input APT value of -.1077, the probability of the
customer choosing ‘wait’ is .4870 and for ‘buy’ it is .5130.
6.4.4. Analysis of Evidence Interdependence
The six input variables of the customer-decision dataset (namely WPT, APT, HP, DD,
NF, and C) express provider-controlled, uncontrolled, and personal factors that
influence customer decisions in an environment with dynamic pricing. Predicting
customer decisions solely based on only one input variable is likely to be insufficient;
one variable cannot explain much variance in customer decisions. Therefore, it is
necessary to combine other evidence to predict customer decision.
234
In evidential reasoning (ER), the general assumption when combining two pieces of
evidence is that the two pieces of evidence are independent from each other. Using
the MAKER framework, this assumption can be relaxed. The interdependence index
can be calculated using Equation (4.14). To generate the interdependence index for
a pair of evidential elements, the first step is to calculate the degree of similarity for
input values for the combination of evidential elements, using Equation (4.12). Then,
joint probability for the pair of evidential elements is obtained by applying Equation
(4.13). Subsequently, Equation (4.14) estimates the interdependence index for a pair
of evidential elements.
Table 6.8 displays the joint probabilities of all pairs of evidential elements of the input
variables WPT and APT pointing to different classes of the output variable, namely
the ‘wait’ and ‘buy’ classes. These joint probabilities are calculated from the
frequencies of different combinations of referential values of pieces of evidence from
the input variables WPT and APT under different class membership. The frequencies
have at least five cases unless two pieces of evidence are disjoint, such as {4.7441,
.2154}. The same condition applies for the combination of referential values from input
variables of {.0010, 32.8692}. It means those pieces of evidence are disjoint under
different classes. Therefore, we define the inequality constraints of all the
combinations of referential values of the input variables of each group of evidence,
except for structural zeros. The structural zeros includes the combination of referential
values: {4.7441, .2154} for evidence group 1 and the combination of referential values:
{.0010, 32.8692} for evidence group 3.
235
Table 6.8. Joint probabilities for different combinations of referential values from input variables: WPT and APT
Class\The combination of
two referential values
{.0005,
-.1077}
{.5614,
.009}
{.0005,
.2154}
{.5614, -
.1077}
{.5614,
.009}
{.5614,
.2154}
{4.7441, -
.1077}
{4.7441,
.009}
{4.7441,
.2154}
Wait .6295 .6063 .6608 .2800 .2934 .3268 .2149 .2311 0
Buy .3705 .3937 .3392 .7200 .7066 .6732 .7851 .7689 0
Table 6.9. Interdependence indices for referential values of the input variables: WPT and APT
Class\The combination of
two referential values
{.0005,
-.1077}
{.5614,
.009}
{.0005,
.2154}
{.5614,
-.1077}
{.5614,
.009}
{.5614,
.2154}
{4.7441,
-.1077}
{4.7441,
.009}
{4.7441,
.2154}
Wait 2.1209 1.9844 2.1457 1.9574 1.9926 2.2015 1.9246 2.0106 0
Buy 1.8494 2.0216 1.7556 1.9872 2.0061 1.9269 1.9856 2.004 0
Table 6.10. Interdependence indices for referential values of the input variables: HP and DD
Class\The combination of
two referential values
{.0010,
.3526}
{.0010,
1.4534}
{.0010,
32.8692}
{.5783,
.3526}
{.5783,
.3526}
{.5783,
32.8692}
{5.5320,
.3526}
{5.5320,
1.4534}
{5.5320,
32.8692}
Wait 2.0404 1.9983 1.8145 2.2374 2.0464 1.5913 0 2.0573 1.6175
Buy 1.9662 1.9984 2.2368 1.6856 1.9518 2.5751 0 1.9349 2.5698
236
Table 6.11. Interdependence indices for referential values of the input variables: NF and C
Class\The combination of two referential values
{2, 0} {2, 1} {16.2358, 0} {16.2358, 1} {68, 0} {68, 1}
Wait 1.5278 4.0076 1.3381 3.8792 .9519 4.6195 Buy 3.5462 1.3297 4.0599 1.3737 5.3558 1.1082
Thereafter, we calculated the interdependence index for a pair of evidential elements
with respect to different class membership. The probabilities obtained from the
previous steps (see Section 6.4.3), displayed in Tables 6.6, 6.7, and 6.8, are a basic
probability distribution of the input variable WPT, a basic probability distribution of the
input variable APT, and joint probabilities of the pair of pieces of evidence from WPT
and APT respectively. The interdependence indices between each piece of evidence
from the input variable of WPT and the input variable of APT was obtained by
Equation (4.21).
Table 6.9 indicates that the input variables of WPT and APT generally had
interdependence index values ranging from 1 to 3. This means that these input
variables were nearly independent from each other, except the combination of
referential values {4.7441, .2154}, which had an index value of 0 (i.e. disjoint).
According to Table 6.10, the input variables of HP and DD were generally nearly
independent from each other, with values of interdependence indices between 1 and
3. A combination of referential values {.0010, 32.8692} was disjoint. Meanwhile, the
input variables of NF and C were generally moderately independent from each other,
with interdependence indices ranging from 1 to 6, as presented in Table 6.11.
.
237
6.4.5. Belief Rule Base
The sections above discuss how we acquired evidence from all six input variables
and analysed the interdependence indices for two pieces of evidence. The next step
was the development of a belief rule from which inferences could be drawn. We
applied the belief rule explained in Section 0. In this case study, the ‘IF’ form in
Equation (4.22) expressed ask
T
kk
kAAA 21 should be interpreted as a
combination of referential values of the input variables from a group of evidence, or ‘if
the input value of each input variable equal to a referential value of this input variable’.
The combination of referential values is termed a packet antecedent 𝐴𝑘. The ‘THEN’
form in Equation (4.22) expresses the probability of each consequence, that is,
( ) ( ) ( ) Nkkk ,D,,,D,,D N2211 . This should be interpreted as the probability of a
customer with the corresponding input values choosing ‘wait’ or ‘buy’. Using evidence
group 1 as an example, if the input value of WPT is equal to a referential value of the
input variable WPT and the input value of APT is equal to a referential value of the
input variable APT, then the probability of the customer choosing to buy or to wait is
k1 and k2 respectively.
To obtain the probabilities of a customer choosing to buy and wait, MAKER rule is
used to combine pieces of evidence from a group of evidence with the consideration
of interdependency of pairs of evidence by Equation (4.16). Through Equation (4.18),
we can obtain the weights of the combined evidence from the probability mass 𝑚𝜃,𝑒(𝐿)
and the probability 𝑝𝜃,𝑒(𝐿) for 𝜃 ⊆ Θ, 𝜃 ≠ 𝜙 or from the probability mass 𝑚𝑃(Θ),𝑒(𝐿).
These weights are used for inference in the next section and are termed a rule weight,
denoted by 𝜃𝑘. For example, with the calculation explained in Section 4.3.3, the
238
probabilities of a combination of referential values for group 1 of {.0005, -.1077} for
choosing to buy or to wait were .9059 and .0941 respectively. We also obtained a
probability of .9583 for ‘buy’ and .0417 for ‘wait’, for a combination of group 2 of {.5783,
32.8692}. For a combination of group 3 {15.9868, 1}, the probability of choosing to
wait or buy was .2563 and .7437 respectively.
A belief rule base of evidence groups 1, 2, and 3 is provided in Tables 6.12, 6.13, and
6.14 respectively. It is worth noting that trained referential values are solutions of the
optimisation of MAKER-based classifier. In this section, we also utilised other
optimised model parameters, such as the weights of input variables. Two input
variables were used, each of which included three referential values, namely lower
boundary, trained referential value, and upper boundary. This yielded altogether nine
combinations for each group of evidence, except group 3 which had six combinations.
In that group, the input variable C had two referential values: 0 for myopic and 1 for
strategic. Each combination contains one referential value from different input
variables within a group of evidence.
The ‘THEN’ form consists of the consequences ‘buy’ and ‘wait’, with the
corresponding probability values obtained through the MAKER rule. We utilised the
optimised model parameters, including weights and referential values of each input
variable, which are solutions of the optimisation of MAKER-based classifiers to build
these belief rules. The results are depicted in Tables 6.12, 6.13, and 6.14. The
following section explains how we drew a BRB inference from an observation.
239
Table 6.12. The belief rule base of the first group of evidence and the activated belief rules by an observation from the customer – decision dataset: {.2946, .1193}
Antecedent Belief degree
Rule 𝐴1 (WPT) 𝐴2 (APT) Buy Wait 𝛼𝑘
1 .0005 -.1077 .9059 .0941 0
2 .0005 .009 .9626 .0374 .2133
3 .0005 .2154 .8632 .1368 .2524
4 .5614 -.1077 .2786 .7214 0
5 .5614 .009 .3787 .6213 .2417
6 .5614 .2154 .2352 .7648 .2926
7 4.7441 -.1077 .2315 .7685 0
8 4.7441 .009 .3227 .6773 0
9 4.7441 .2154 .2195 .7805 0
Table 6.13. The belief rule base of the second group of evidence with activated belief rule base by an observation from the customer – decision dataset: {.3955, 1.9816}
Antecedent Belief degree
Rule 𝐴3 (HP) 𝐴4 (DD) Buy Wait 𝛼𝑘
1 .0010 .3526 .5533 .4467 0
2 .0010 1.45341.6 .3806 .6194 .1426
3 .0010 32.8692 .8961 .1039 .0067
4 .5783 .3526 .9068 .0932 0
5 .5783 1.4534 .8624 .1376 .8319
6 5.5320 32.8692 .9583 .0417 .0188
7 5.5320 .3526 .9074 .0926 0
8 5.5320 1.4534 .8760 .1240 0
9 5.5320 32.8692 .9622 .0378 0
240
Table 6.14. The belief rule base of the third group of evidence with activated belief rule base by an observation from the customer – decision dataset: {62, 1}
Antecedent Belief degree
Rule 𝐴5 (NF) 𝐴6 (C) Buy Wait 𝛼𝑘
1 2 0 .8978 .1022 0
2 2 1 .0478 .9522 0
3 15.98 0 .9152 .0848 0
4 15.98 1 .0731 .9269 .0949
5 68 0 .6414 .3586 0
6 68 1 .2563 .7437 .9051
Table 6.15. Two adjacent referential values of each input variable of an observation from the customer decision dataset: {.2946, .1193, .3954, 1.9816, 62, 1}
WPT APT HP DD NF C
.0005 .009 .0010 1.4534 15.98 0
.5614 .2154 .5783 32.8692 68 1
6.4.6. BRB Inference with Referential Values
We constructed BRBs for three groups of evidence, as depicted in Tables 6.12, 6.13,
and 6.14, through the MAKER framework. We were then able to draw an inference
from a BRB for each observation in the dataset. For discrete or nominal data, inferring
something from a BRB is a direct process. For example, if the input vector shows the
combination ‘High ∧ Low ∧ High’, the probability is obtained as follows: 𝑝1 of
consequent 1 and 𝑝2 of consequent 2, where 𝑝1 = 𝛽1𝑘 and 𝑝2 = 𝛽2𝑘 from 𝑘th rule from
which the IF rule of ‘High ∧ Low ∧ High’ is mentioned. However, all the input variables
were numerical data and were discretized through a referential-value-based data-
241
discretization method. The inference process with this kind of discretization method
differs from that with discrete or nominal data.
Belief rule bases for the three groups of evidence were developed in the previous
section. Each belief rule base is constructed from a packet antecedent 𝐴𝑘, which is a
combination of referential values of input variables and its corresponding probabilities
of consequences. Based on this form, we needed to transform the numerical data for
the combinations of referential values of input variables. First, we calculated the
similarity degree for each observation value of each input variable. An input value can
be transformed using Equation (4.7). The similarity degree indicates the degree to
which an input value matches each of the referential values. An observation with input
values: {.2946, .1193, .3954, 1.9816, 62, 1} for WPT, APT, HP, DD, NF, and C
respectively with referential values defined in Table 6.3 was taken as an example.
The observation has two adjacent referential values for each input variable, as
depicted in Table 6.15. Using Equation (4.12), we calculated the joint similarity degree
between the observation and the combination of referential values for each belief rule
or the packet antecedents. These values represent the individual matching degree,
indicating the degree to which the input vector or an observation close to a packet
antecedent 𝐴𝑘 denoted by 𝛼𝑘 for the 𝑘th rule.
Each input value was discretized, with an indication of its distance to two adjacent
referential values. Hence, several belief rules were activated from all belief rules,
ranging from 1 to 2𝑁, where 𝑁 is the number of input variables for an input vector for
which each observation value is located between two adjacent referential values. If
an observation is exactly equal to a combination of referential values of input
variables, only one belief rule is activated. Since two input variables existed for each
242
group of evidence with three referential values, the activated belief rules ranged from
1 to 22 = 4. Specifically, for the third group of evidence, an input value of the CT
variable exactly equalled a referential value that was either 0 (myopic) or 1 (strategic).
Therefore, for this group of evidence, there were 1 to 21 = 2 belief rules activated,
out of six belief rules, in the BRB.
To illustrate, using equation (4.12), yields a joint similarity degree for each belief rule
in the BRB of each group of evidence, as depicted in Tables 6.12, 6.13, and 6.14 for
the observation {.2946, .1193, .3954, 1.9816, 62, 1}. Four combinations of the two
activated adjacent referential values in Tables 6.12 and 6.13 have 𝛼𝑘 > 0, whereas
other combinations of referential values have 𝛼𝑘 = 0. Two combinations of referential
values of the third group of evidence are activated, as presented in Table 6.14. At this
point, for each belief rule we have 𝛼𝑘 as an individual matching degree. The weights
of combined pieces of evidence of 𝐴𝑘 are obtained from the probability mass 𝑚𝜃,𝑒(𝐿),
the probability 𝑝𝜃,𝑒(𝐿), the probability mass 𝑚𝑃(Θ),𝑒(𝐿), and the probability of each
consequent as a result of the combination of pieces of evidence, 𝐴𝑘 . Hence, the
weights of the pieces of evidence affect the weights of each belief rule activated by
an observation.
At this point, of each belief rule, we have 𝛼𝑘 as an individual matching degree to which
input values belongs to a packet antecedent, 𝐴𝑘; the weights of combined pieces of
evidence of 𝐴𝑘 which is obtained from the probability mass 𝑚𝜃,𝑒(𝐿), the probability
𝑝𝜃,𝑒(𝐿), and the probability mass 𝑚𝑃(Θ),𝑒(𝐿); and the probability of each consequent as
a result of the combination of pieces of evidence, 𝐴𝑘 . Hence, the weights of the pieces
of evidence affects the weights of each belief rule activated by an observation.
243
After obtaining the activated belief rules with the corresponding joint similarity degrees
and their weights, the next step is to combine these belief rules to predict the
probabilities of each consequent. This indicates the probability of a customer
choosing to buy or to wait. First, we calculated the updated weight denoted by 𝜔𝑘 of
each belief rule in BRB, based on the joint similarity degrees and the associated rule
weight 𝜃𝑘 using Equation (3.11). The term 𝐿 refers to a number of belief rules in BRB.
The term 𝜔𝑘 is designed to measure the degree to which a packet antecedent 𝐴𝑘 in
the 𝑘th rule is triggered by an observation. As stated in previous section, the weights
of input variables contribute to the weight of each belief rule. On the basis of joint
similarity degrees and the weights, we calculated the updated weight of each belief
rule. We concluded that the weights of input variables influenced the updated weight
of each belief rule, which measures the degree to which a belief rule is triggered in
predicting the probability of each consequent. Second, given the updated weight of
each belief rule and the associated probability of each consequent, we combined
those pieces of evidence using the conjunctive MAKER rule. The calculation is shown
in Equation (4.16). The output of this framework is the probability of a customer
choosing to buy or wait.
For example, with the observation from evidence group 1: {.2946, .1193} for WPT and
APT respectively, a probability of .6007 was obtained for ‘buy’, and .3993 for ‘wait’.
For an observation of group of evidence 2: {.3954, 1.9816} for HP and DD
respectively, the probability was: .8468 for ‘buy’, and .1532 for ‘wait’. For the third
group of evidence with the input values {62, 1}, we obtained a probability: .2387 for
‘buy’, and .7613 for ‘wait’. At this point, the probability of a customer choosing to buy
or wait was obtained based on some but not all input variables in the input system.
244
Generating the probability of each consequent from all input variables of the input
system considered together is discussed in the following section.
6.4.7. Inference on The Top Hierarchy
The previous section demonstrated obtaining the probability of each consequent as a
result of evidence combinations for some but not all input variables in the system. As
depicted in Figure 4.1, a system consists of groups of evidence, each of which has
several input variables. In the lower levels of the hierarchy, each group of evidence
performs prediction. Thus, each group of evidence generates the probability of each
consequent of the output system. We first acquired the MAKER-generated outputs
from the input variables of each group of evidence. This meant we could combine the
outputs for a final inference regarding the top hierarchy. This level was the probability
of a customer choosing to buy or wait, with all input variables considered. We provide
two combination methods for the top hierarchy, namely the ER-based model and the
BRB-based model as depicted in Figure 6.5.
According to the previous section, we can acquire the probabilities generated by the
MAKER rule from input variables of a group of evidence. In other words, an
observation of the input variables of a group of evidence generated the probabilities
pointing to class membership. Therefore, we acquired evidence from the observation.
We obtained the same number of pieces of evidence as the number of groups of
evidence in the hierarchy. To combine the MAKER-generated outputs for each group
of evidence, we calculated their weights as described below.
245
As previously explained, within a group of evidence, the weights of input variables
have an impact on the weight of each belief rule. On the basis of the degree of joint
similarity between an observation and each belief rule, and the weight of each belief
rule, we calculated the updated weight of each belief rule. These updated weights
measure the degree to which a belief rule is activated or triggered in predicting the
probability of each consequent. In the next step, the MAKER-generated outputs for
each group of evidence were obtained by combining the activated belief rules using
the conjunctive MAKER rule, shown in Equation (4.16).
We obtained the weight of each group of evidence from the probability mass 𝑚𝜃,𝑒(𝐿)
and the probability 𝑝𝜃,𝑒(𝐿) for 𝜃 ⊆ Θ, 𝜃 ≠ 𝜙 or from the probability mass 𝑚𝑃(Θ),𝑒(𝐿),
using Equation (4.18) when combining the activated belief rules. Given those pieces
of evidence and their weights, we used Equation (4.16) to combine the evidence and
generate the probability of each consequent, considering all input variables in the
system. The weights of input variables within a group of evidence affect the weights
of activated belief rules. Through the conjunctive MAKER rule, we combined the
activated belief rules and calculated the weight of the combined belief rules, which
indicates the weight of a group of evidence. In the top hierarchy, the MAKER-
generated outputs with the weights of the three groups of evidence were then
combined. We concluded that the final inference generated considered the weights of
all input variables in the system.
This study examined three groups of evidence, each of which yielded an inference
based on the groups’ input variables. At this point, we should have three outputs; that
is, the probabilities of a customer choosing to buy or to wait. In the previous section,
the example observations were {.2946, .1193}, {.3954, 1.9816}, and {62, 1} for the
246
first, second, and third groups of evidence, respectively. We obtained the MAKER-
generated outputs through the procedures explained in the previous sections, as
follows: {(1, .7582), (2, .2418)} for the group of WPT-APT; {(1, .5968.0778), (2, .4032)}
for the group of HP-DD; and {(1, .57), (2, .43)} for the group of NF-C. Using their
weights and Equation (4.16) we generated the probabilities of .7496 for ‘buy’ and
.2504 for ‘wait’ as a final output of the system for the observation {.2946, .1193, .3954,
1.9816, 62, 1}. The probabilities were obtained with all the input variables in the
system being considered. Through this method of evidence combination, this
framework is termed MAKER-ER-based classifiers as seen in Figure 6.5.
The following description explains how we obtain a final inference using BRB. As
depicted in Figure 4.1, there are several groups of evidence, each of which consists
of input variables. As stated above, each group of evidence generates the probability
of each consequent. We can draw inferences based on the concept of a belief rule
base.
• The construction of belief rule base
To construct a belief rule base, we follow the expression of the extended IF-THEN
rule, as described in Section 0, specifically Equation (4.22). In this state, the packet
antecedent of the belief rule written as 21
k
T
kk
kAAA should be expressed as ‘If a
group of evidence points to 𝑘 class’. Therefore, the number of antecedents in this
belief rule base equals the number of groups of evidence in the system. In this study,
there were three groups of evidence and hence three antecedents in the BRB.
Furthermore, ( ) ( ) ( ) LkNkkk ,...,1,,D,,,D,,D N2211 should be expressed in
this state as ‘the probability of a customer choosing to buy or to wait, given the values
247
of antecedents.’ Alternatively, we could refer to the probability of a customer choosing
either class membership, ‘buy’ or ‘wait’, given the results for each group of evidence.
‘Antecedent’ in this study refers to the outputs generated by each group of evidence.
In addition, the outputs refer to class membership, with the number of combinations
equal to 𝐾𝐺, where K is the number of outputs in the output system and G is the
number of groups of evidence in the system. In this study, there were two class
membership as outputs, with three groups of evidence formed in the system.
Therefore, there are 23 = 8 belief rules, as depicted in Table 6.16.
Suppose that 𝐴1 is the output generated by group of evidence 1. The 𝐴11 term is equal
to 1 if the group of evidence points to class ‘buy’ (k=1); and 𝐴21 = 2 if the group of
evidence points to class ‘wait’ (k=2). Similarly, 𝐴12 = 1 and 𝐴2
2 = 2 mean that evidence
group 2 points to ‘buy’ and ‘wait’ respectively. As we lacked prior knowledge regarding
the belief degrees assigned to each consequent denoted by 𝛽𝑗,𝑘 for the 𝑗th consequent
in the 𝑘th rule, as shown in Equation (4.22), we constructed a BRB as follows.
Logically, given an observation, if both groups point to the same class, it means that
the observation of all input variables fully points to the corresponding class. For
example, 1st and 8th belief rules generate the probability of 1 for ‘buy’ and ‘wait’
respectively. If both groups of evidence point to different class, the observation of all
input variables does not point exactly to a particular class, which means the probability
of each consequent can range from 0 to 1. These belief degrees can be trained along
with other model parameters simultaneously. For initialization, we used initial belief
degrees as shown in Table 6.16. Table 6.17 provides the optimised belief degrees of
the belief rule base. We used these optimised belief degrees as an example in this
section.
248
Table 6.16. Initial belief rule base of the top hierarchy for the customer-decision dataset
No. Antecedent Consequent
𝐴1 𝐴2 𝐴3 ‘to buy’ (k=1) ‘to wait’ (k=2)
1 1 1 1 1 0
2 1 1 2 .75 .25
3 1 2 1 .75 .25
4 1 2 2 .25 .75
5 2 1 1 .75 .25
6 2 1 2 .25 .75
7 2 2 1 .25 .75
8 2 2 2 0 1
If the output of each group of evidence fully points to a class membership, inference
with the BRB can be made directly. However, the probability of each consequent
generated by each group of evidence can range from 0 to 1. This probability measures
the degree to which an evidence group points to a class membership. For the purpose
of demonstration, we use the observed values: {.2946, .1193, .3954, 1.9816, 62, 1}
and the optimised model parameters obtained from MAKER-BRB-based model
including optimised referential values in Table 6.3.
• The calculation of joint similarity degree
As an example, we use the observed values {.2946, .1193, .3954, 1.9816, 62, 1} and
the optimised model parameters obtained from the MAKER-BRB-based model,
including optimised referential values in Table 6.3. The observed values {.2946,
.1193} for WPT and APT generated the probabilities {.6007, .3993}. These results
mean that this observation belongs to 𝐴21 to a low degree (.3993) and to 𝐴1
1 to a high
degree (.6007).
249
The above procedure allowed us to obtain the belief distribution of antecedents. We
applied Equation (4.12) to obtain the degree of joint similarity between the outputs
generated by each group of evidence and the combination of antecedents of each
belief rule. For example, based on the probabilities obtained from the groups of
evidence 1, 2, and 3: {(1, .6007), (2, .3993)} for WPT-APT; {(1, .8468), (2, .1532)} for
HP-DD; and {(1, .2387), (2, .7613)} for NF-C respectively, we obtained the degrees of
joint similarity shown in Table 6.17. These joint similarity degrees will activate eight
belief rules.
Table 6.17. Optimised belief rule base of the top hierarchy the activated belief rules by the three MAKER-generated outputs: {(1, .6007), (2, .3993)}; {(1, .8468), (2, .1532)}; and {(1, .2387), (2, .7613)}
No.
Antecedent Consequent
𝐴1 𝐴2 𝐴3 ‘to buy’
(k=1) ‘to wait’ (k=2) 𝛼𝑘
1 1 1 1 1 0 .1214
2 1 1 2 1 0 .3873
3 1 2 1 .9991 .0009 .0220
4 1 2 2 0 1 .0701
5 2 1 1 .7575 .2425 .0807
6 2 1 2 .4818 .5182 .2574
7 2 2 1 0 1 .0146
8 2 2 2 0 1 .0466
• Making final inference from activated belief rules
As in the previous section, these values were used to calculate the updated weight of
each belief rule. Rule weights denoted by 𝜃𝑘 can be trained. However, in this study,
we set equal rule weights. The joint similarity degree affects how the activated belief
250
rules are invoked to contribute to the inference. The joint similarity is calculated from
the outputs generated by each group of evidence, each of which consists of some but
not all input variables in the system. Therefore, by combining the outputs in this way,
the inference is obtained by considering all input variables in the system. The
probabilities {(1, .6007), (2, .3993)}, {(1, .8468), (2, .1532)}, and {(1, .2387), (2, .7613)}
obtained from the observation {.2946, .1193, .3954, 1.9816, 62, 1} generated the
prediction of class membership as follows: .7158 for ‘buy’ and .2842 for ‘wait’.
As mentioned in Section 6.4.2, in this study a set of model parameters consisted of
1) a trained referential value for the four input variables of the system and 2) the
weights of the evidential elements (referential values) for the four input variables for
the MAKER-ER-based classifier. Additional model parameter for MAKER-BRB-based
classifier is a set of trained belief degree of each consequent of each relevant belief
rule where ∑ 𝛽𝑖𝑘𝑁𝑖=1 = 1. The trained referential values were utilised to obtain pieces
of evidence, which were then combined in the upper level of the hierarchy. Given
optimised weights of evidential elements of the input variables for each evidence
group, we generated the probability of each consequent. For each group of evidence,
the weights of the input variables influenced the updated weight for each belief rule
activated by an observation, to predict the probabilities of the classes of the output
system.
6.4.8. The Interpretability of Hierarchical MAKER Frameworks
In the MAKER-ER-based classifier, given the probabilities generated by MAKER rule
and the weight of combined activated belief rules of each group of evidence, we can
make predictions, that is, calculate the probabilities pointing to different class
251
membership with all four input variables being considered, in the upper level of the
hierarchy. The weights of the input variables of each evidence group affect the
updated weights of each activated belief rule, and the weight of the combined
activated belief rules of each group affect the inference derived in the upper level.
In MAKER-ER -based classifier, given the probabilities generated by MAKER rule for
each group of evidence and the weight of combined activated belief rules of each
group of evidence, we can make predictions, i.e. the probabilities of classes of the
output system with all the four input variables being considered, in the upper level of
the hierarchy. Since the weights of the two input variables of each group of evidence
have an impact on the updated weights of each activated belief rule, and the weight
of the combined activated belief rules of each group of evidence have an impact on
the inference made in the upper level.
In the MAKER-BRB-based classifier, the probabilities generated by the MAKER rule
for each group of evidence shows the degree to which the two input variables of each
group point to each class of the output system. We thus calculate the joint similarity
for each combination of the antecedents. Given the degrees of trained belief and joint
similarity, we can make predictions in the upper level of the hierarchy. There are
inferred based on the four input variables in the system.
Through these two approaches – that is, MAKER-ER-based and MAKER-BRB-based
models, we maximise the predicted outputs, that is, the predicted probabilities of each
class of the output system, as close to the true observed outputs of the training set to
minimise the MSE score. Through this optimisation process, model parameters
including referential values of the four input variables, and the weights of the evidential
252
elements for both classifiers, and trained belief degrees of the relevant belief rules
specifically for MAKER-BRB-based model, are trained using historical data.
In this study, given the optimised (i.e. trained) referential values of the six input
variables, we constructed the MAKER-based classifier to illustrate how to acquire
pieces of evidence from data. On the basis of the referential values and other
optimised solutions – that is, the weights and the belief degrees of each consequent
in the BRB of the top hierarchy, we used MAKER-ER- and MAKER-BRB-based
models to draw inferences. The process has been described in this section. An
example used earlier was {.2946, .1193, .3954, 1.9816, 62, 1}. The predicted
probabilities for this example for each class were {.7496, .2504} for the MAKER-ER-
based model and {.7158, .2842} for the MAKER-BRB-based model. Based on the
process established in these classifiers, we concluded that the MAKER-ER- and
MAKER-BRB-based classifiers offered an interpretable approach. They integrated
statistical analysis when acquiring pieces of evidence, the measurement of
interdependencies between pairs of pieces of evidence, belief rule-based inference
in the MAKER rule, maximum likelihood prediction, and machine learning.
Even with the input variables in the system split into multiple groups of evidence, the
inference process established for both classifiers combined all pieces of evidence
from the lower level in the hierarchy. In every combination process of pieces of
evidence from the bottom to the top of the hierarchy, the knowledge embedded in a
piece of evidence, including its weights, was continuously forwarded until the final
inference in the top hierarchy. Hence, we concluded that the predicted outputs of the
system output for both classifiers resulted from an inference process involving all the
input variables in the system, with knowledge representation parameters embedded
253
in each piece of evidence. These parameters were the weights, referential values,
and consequent belief degrees.
6.5. Model comparisons
In this section, the performance of MAKER-ER- and MAKER-BRB-based models are
compared with other common machine learning methods for classification. These
include LR, SVM, CT, NN, NB, KNN, weighted KNN, LD, and QD. The dataset of the
case of customer decision in revenue management was used.
As already stated, the customer-decision dataset was partitioned into five folds, with
shuffled stratified cross-validation to obtain almost equal class distributions for each
fold. The training and test sets of the five rounds were generated based on the five
folds of the customer-decision dataset. The optimised parameters were then applied
for the test sets. Hence, we compared all the classifiers based on their performances
over the five test sets in five rounds.
In this section, we include accuracies, precisions, and recalls with a threshold value
of .5 for the classifier on the basis of probabilities. For SVM, the threshold value 0 is
used. For the case of imbalanced data, the best possible outcome is high precision
and recall scores. We also present the MSE scores under which MAKER-ER- and
MAKER-BRB-based models were optimised. In addition, we report the area under the
receiver operating characteristic curve (AUCROC) scores, because this metric
provides a better measure than accuracy alone can. In addition, the area under the
precision-recall curve (AUCPR) scores are reported for better measurement in the
case of imbalanced data. The value of AUC ranges from .5 to 1. The closer the AUC
254
score to 1.0, the more accurate is the classifier. .5 of AUCROC score indicate random
classifier. Further explanation about the measures appears in Section 3.8.
For SVM, NN, KNN, weighted KNN, and CT, we also determined the hyperparameters
of those classifiers. We utilised gridsearchcv in sklearn Python to find the optimal
hyperparameter based on a five-round model-training method. Using only one
performance measure, namely accuracy, is somewhat inaccurate since this dataset
is highly imbalanced (1:4.5). We thus used the F-beta score, which is the weighted
mean of precision and recall. We set the beta value of 1 in this study. The parameters
with the highest F-beta score for the omitted data after the five-round training method
were selected. The hyperparameters for the above-mentioned classifiers are
discussed in Section 2.3. Table 6.18 lists the selected hyperparameters of the
abovementioned alternative methods.
Table 6.18. The selected hyperparameters of CT, SVM, KNN, Weighted KNN, and NN for customer decision models
Classifier Selected hyperparameters
CT The maximum depth = 4; the minimum samples per leaf = 50;
the minimum size each leaf = 170
SVM Penalty parameter C = 1; the kernel type is linear.
KNN k = 25
Weighted KNN K = 33
NN Multilayer perceptron is selected. The number of hidden layers
= 1; the number of neurons in the hidden layer = 6; the activation
function is linear.
255
6.5.1. Accuracies, Precisions, Recalls, and F-beta Scores
Table 6.19 present the F-beta scores with the beta value of 1 as a weight parameter.
We define ‘wait’ as negative class. Tables 6.20 - 6.22 present the scores of accuracy,
precision, and recall respectively for each class over five train-and-test sets. These
measurements had a threshold value of .50 for probabilistic-based classifiers and 0
for SVM because of the binary outcome – that is, ‘buy’ and ‘wait’ classes. The machine
learning models presented in the tables were selected based on the F-beta score, as
previously explained. The machine learning models with the best hyperparameters
were then compared against MAKER-based models.
For comprehensive evaluation of a classifier’s performance, we used the average
score of each performance metric – that is, the accuracy, precision, recall, and F-beta
score – for each classifier. The three highlighted numbers in bold are the first, second,
and third best classifiers on the basis of the corresponding measure. Generally, the
average scores of all performance measures across all the classifiers over the five
test sets were .351 for F-beta score, .829 for accuracy, .848 for precision of the ‘buy’
class, .615 for precision of ‘wait’, .961 for recall of ‘buy’, and .255 for recall of ‘wait’.
The results showed that MAKER- based classifiers were among the best three
classifiers based on the scores for F-beta, accuracy, precision of ‘buy’ class, and
recall of ‘wait’ class. The average precision for ‘wait’ in the MAKER-ER- and MAKER-
BRB-based classifiers was .594 and .596 respectively. This was close to the grand
average of .615 across all the classifiers over the five test sets. For the recall of ‘buy’,
the average score of the MAKER-ER- and MAKER-BRB-based models were .952,
and .948 which is close to the average score of .961 across all the classifiers over the
five test sets.
256
Table 6.19. F-beta scores for customer behaviour classifiers
Model\Iteration 1st 2nd 3rd 4th 5th Average Stdev
Train
MAKER-ER .409 .411 .418 .382 .420 .408 .015
MAKER-BRB .420 .456 .422 .393 .426 .423 .022
LR .330 .283 .308 .290 .305 .303 .018
SVM .185 .142 .157 .157 .187 .166 .020
NN .344 .326 .346 .310 .338 .333 .015
CT .471 .501 .471 .475 .462 .476 .015
NB .364 .336 .362 .342 .349 .351 .012
KNN .371 .314 .346 .357 .298 .337 .031
Weighted KNN 1.000 1.000 1.000 1.000 1.000 1.000 .000
LD .351 .341 .343 .336 .351 .344 .007
QD .515 .370 .343 .524 .343 .419 .092
Test
MAKER-ER .337 .507 .355 .366 .429 .399 .070
MAKER-BRB .423 .383 .378 .475 .399 .412 .040
LR .307 .274 .292 .333 .314 .304 .022
SVM .138 .161 .203 .192 .147 .168 .028
NN .309 .327 .313 .364 .329 .328 .022
CT .462 .542 .474 .417 .487 .476 .045
NB .368 .332 .364 .378 .296 .348 .034
KNN .276 .290 .280 .287 .364 .299 .036
Weighted KNN .388 .445 .344 .323 .447 .390 .057
LD .324 .363 .336 .371 .324 .344 .022
QD .452 .357 .319 .528 .311 .393 .094
257
Table 6.20. Accuracies for customer decision models Model\Iteration 1st 2nd 3rd 4th 5th Average Stdev
Train
MAKER-ER .838 .834 .838 .831 .834 .835 .003
MAKER-BRB .834 .839 .837 .826 .832 .834 .005
LR .833 .830 .831 .826 .829 .830 .003
SVM .819 .819 .819 .819 .821 .819 .001
NN .829 .830 .831 .825 .832 .829 .003
CT .862 .855 .861 .863 .861 .860 .003
NB .823 .822 .822 .820 .825 .822 .002
KNN .841 .836 .838 .838 .832 .837 .003
Weighted KNN 1.000 1.000 1.000 1.000 1.000 1.000 .000
LD .827 .826 .827 .822 .827 .826 .002
QD .833 .828 .827 .831 .827 .829 .003
Test
MAKER-ER .809 .847 .816 .837 .842 .830 .017
MAKER-BRB .826 .819 .814 .854 .837 .830 .016
LR .823 .831 .827 .842 .829 .830 .007
SVM .813 .824 .822 .826 .805 .818 .009
NN .816 .831 .819 .842 .825 .827 .010
CT .856 .857 .862 .854 .862 .858 .004
NB .826 .819 .824 .834 .814 .823 .008
KNN .823 .834 .816 .832 .844 .830 .011
Weighted KNN .818 .836 .814 .819 .839 .825 .011
LD .813 .829 .822 .841 .822 .825 .010
QD .819 .824 .816 .832 .819 .822 .007
258
Table 6.21. Precisions of the test sets for customer decision models
Model/Iteration 1st 2nd 3rd 4th 5th Average Stdev
Buy
MAKER-ER .850 .880 .850 .850 .860 .858 .013
MAKER-BRB .860 .850 .850 .870 .850 .856 .009
LR .840 .840 .840 .840 .840 .840 .000
SVM .820 .830 .830 .830 .820 .826 .005
NN .840 .840 .840 .850 .850 .844 .005
CT .860 .880 .860 .860 .870 .866 .009
NB .850 .840 .850 .850 .840 .846 .005
KNN .840 .840 .840 .840 .850 .842 .004
Weighted KNN .850 .860 .850 .840 .860 .852 .008
LD .840 .850 .840 .850 .840 .844 .005
QD .870 .850 .840 .890 .840 .858 .022
Wait
MAKER-ER .480 .640 .520 .680 .650 .594 .088
MAKER-BRB .560 .530 .510 .740 .640 .596 .094
LR .570 .700 .630 .800 .620 .664 .089
SVM .500 .770 .650 .750 .400 .614 .160
NN .520 .640 .540 .750 .580 .606 .093
CT .770 .680 .840 .820 .800 .782 .063
NB .580 .540 .560 .630 .500 .562 .048
KNN .590 .740 .530 .710 .750 .664 .098
Weighted KNN .520 .610 .510 .540 .620 .560 .051
LD .500 .600 .560 .720 .550 .586 .083
QD .520 .570 .520 .560 .530 .540 .023
259
Table 6.22. Recalls of the test sets for customer decision models
Model\Iteration 1st 2nd 3rd 4th 5th Average Stdev
Buy
MAKER-ER .940 .950 .940 .970 .960 .952 .013
MAKER-BRB .940 .940 .930 .970 .960 .948 .016
LR .960 .980 .970 .990 .970 .974 .011
SVM .980 .990 .990 .990 .970 .984 .009
NN .950 .970 .960 .980 .960 .964 .011
CT .980 .950 .990 .990 .980 .978 .016
NB .960 .950 .950 .960 .950 .954 .005
KNN .970 .990 .960 .980 .980 .976 .011
Weighted KNN .930 .950 .940 .960 .950 .946 .011
LD .940 .960 .960 .980 .960 .960 .014
QD .920 .960 .950 .910 .960 .940 .023
Wait
MAKER-ER .260 .420 .270 .250 .320 .304 .070
MAKER-BRB .340 .300 .300 .350 .290 .316 .027
LR .210 .170 .190 .210 .210 .198 .018
SVM .080 .090 .120 .110 .090 .098 .016
NN .220 .220 .220 .240 .230 .226 .009
CT .330 .450 .330 .280 .350 .348 .063
NB .270 .240 .270 .270 .210 .252 .027
KNN .180 .180 .190 .180 .240 .194 .026
Weighted KNN .310 .350 .260 .230 .350 .300 .054
LD .240 .260 .240 .250 .230 .244 .011
QD .400 .260 .230 .500 .220 .322 .123
260
This result indicates that the MAKER-based models performed better than LR, SVM,
NN, LD, QD, KNN, and Weighted KNN. The classifier LR, one of the simple
interpretable classifiers, failed to predict the ‘wait’ class correctly, as shown by the
‘wait’ recall of .198. The ‘wait’ recall in SVM was similar at .098. Classification tree, as
one of the complex interpretable classifiers, showed slightly better performance in
predicting customer decisions.
Hence the MAKER-ER- and MAKER-BRB-based classifiers outperformed another
simple interpretable classifier, that is, LR, KNN, Weighted KNN, LD, and QD. They
also outperformed other complex machine learning methods, including SVM and NN
at a threshold of .05. Despite its complexity, the performance of the classification tree
was slightly better than the other classifiers.
6.5.2. MSEs and AUCs
In this section, we report probability and ranking metrics, including MSEs, AUCROCs,
and AUCPRs. Classifiers generate a probabilistic classifier that shows the degree to
which an observation is a member of a class. The performance metrics explained
above – that is, accuracy, precision, and recall – covert the probabilistic classifier into
a discrete classifier. The threshold value of .50 is cut-off point, with any probabilistic
classifier above the cut-off indicating a positive class and those below indicating a
negative class. A ROC curve plots the true positive and false positive rates for
different cut-off points. In addition, PR curve plots the true positive rate, also known
as recall, as well as precision using a different classification threshold. The AUC is a
single scalar value that reflects a classifier’s performance regardless of the
classification threshold.
261
In the following paragraphs, we provide AUCROC and AUCPR. The latter is
suggested for the case of highly imbalanced data. Other than that, these two metrics
provide a better performance measure than the threshold metrics of accuracy,
precision, and recall. We also report the MSE as a probability metric. It measures the
gap between the predicted values, that is, the probability generated by a classifier,
versus the actual values.
Figure 6.9 shows the ROC curve of all the classifiers for all the test sets of the dataset.
Five lines in the figure, with different colours, present the ROC curve for the test set
of each round. Round 1 had the 1st fold as a test set, round 2 had the 2nd fold as a test
set and so on. The diagonal red line represents a random classifier with the AUCROC
score of .5. The further the line is placed from this red diagonal line, or the closer the
line to the left corner of the curve, the better is the classifier. The blue line indicates
the average ROC curve over the five test sets. The grey area illustrates the dispersion
of the curves over five rounds, which is ± 1 standard deviation. Figure 6.10 presents
the PR curves for all the classifiers over five rounds. Similar with ROC curve, the lines
with different colours indicate the PR curve of the test set of each round. The closer
the line to the right corner of the curve, the better the performance of the classifier.
The grey area indicates the dispersion of the curve at ± 1 standard deviation.
262
MAKER-ER-based classifier
MAKER-BRB-based classifier
LR
SVM
NN
CT
Figure 6.9. The ROC curve of MAKER-ER-based classifier, MAKER-BRB-based classifier, and all the alternative machine learning methods for the test sets of the customer-decision dataset
263
NB
Weighted KNN
KNN
QD
LD
Figure 6.8. Continued.
264
MAKER-ER-based classifier
MAKER-BRB-based classifier
LR
CT
SVM
NN
NB
Weighted KNN Figure 6.10. The PR curve of MAKER-ER-based classifier, MAKER-BRB-based classifier, and all the alternative machine learning methods for the test sets of the customer-decision dataset
265
KNN
QD
LD
Figure 6.10. Continued.
Table 6.23 displays the MSEs, AUCROCs, and AUCPRs of the classifiers for the
training and test sets of all five rounds. The scores of the metrics of training sets were
similar to those of the test sets over the five rounds, meaning that the classifiers could
learn and generalise the pattern of the data and performed well on unseen data. The
grand average of each metric of all the classifiers over the five test sets was .121,
.825, and .519 for MSE, AUCROC, and AUCPR respectively.
266
The highlighted scores in Table 6.23 indicate the best, second-best, and third-best
performance among the other classifiers. It is evident that MAKER-ER- and MAKER-
BRB-based models along with classification tree outperformed other classifiers in
terms of the three metrics. The average scores and standard deviations of AUCROCs
for the MAKER-ER- and MAKER-BRB-based models were .836 (.019) and .848 (.020)
respectively. Similar to AUCs, both classifiers showed second position in terms of
MSE scores as follows: .114 (.005) and .113 (.006) respectively. According to the
average AUCPRs, the MAKER-based classifiers performed better than all the
classifiers, except classification tree. The scores were .544 (.048) and .562 (.036) for
MAKER-ER- and MAKER-BRB-based models respectively. According to Table 3.3 in
Section 3.8.3, an AUCROC between .8 and .9 indicates good discrimination.
Subtle changes were noted in the average AUCROCs in Table 6.23. For example,
the average AUCROC of the MAKER-ER-based model was .836, which was nearly
same as the classifiers SVM (.840) and ANN (.829) averages. However, the average
AUCPR of the MAKER-ER-based model was .544, which was a difference of .34
compared to ANN (.510) and .43 compared to SVM (.501). The average AUCPR of
the MAKER-ER-based classifier over the five test sets was .544, which was close to
the average AUCPR of all the classifiers (.519). The MAKER-BRB-based model and
classification tree performed best regarding the average AUCPR over the five test
sets, at .562 and .636.
Thus, we concluded that the performance of MAKER-ER- and MAKER-BRB-based
models in predicting customer decisions in this study was superior to machine
learning methods – that is, SVM and NN. They also performed better than a simple
interpretable classifier, namely LR, KNN, weighted KNN, LD, and QD.
267
Table 6.23. MSEs and AUCs of classifiers for customer decision models
Train Test
Models/Folds 1st 2nd 3rd 4th 5th Avg Std CI (95%) 1st 2nd 3rd 4th 5th Avg Stdev CI (95%)
AUCROCs
MAKER-ER .844 .827 .835 .838 .835 .836 .006 .831-.841 .815 .861 .849 .821 .833 .836 .019 .819-.853
MAKER-BRB .851 .855 .861 .846 .853 .853 .006 .848-.858 .858 .840 .823 .875 .843 .848 .020 .830-.865
LR .831 .830 .832 .836 .833 .832 .002 .830-.834 .766 .845 .845 .806 .848 .822 .036 .791-.853
SVM .854 .842 .843 .845 .839 .845 .006 .840-.850 .797 .853 .858 .841 .851 .840 .025 .818-.862
NN .846 .832 .831 .838 .830 .835 .007 .830-.841 .812 .849 .837 .805 .840 .829 .019 .812-.846
CT .891 .887 .879 .895 .897 .890 .007 .884-.896 .861 .872 .894 .874 .886 .877 .013 .866-.889
NB .795 .796 .799 .815 .801 .801 .008 .794-.808 .738 .810 .830 .804 .782 .793 .035 .762-.824
KNN .836 .832 .828 .834 .831 .832 .003 .829-.835 .773 .812 .807 .768 .786 .789 .020 .772-.807
Weighted KNN 1.000 1.000 1.000 1.000 1.000 1.000 .000 1.00-1.00 .805 .831 .837 .784 .796 .811 .023 .791-.830
LD .836 .839 .842 .841 .841 .840 .002 .838-.842 .773 .843 .861 .829 .846 .830 .034 .800-.860
QD .818 .801 .813 .817 .811 .812 .007 .806-.818 .791 .815 .804 .772 .820 .801 .019 .784-.818
MSEs
MAKER-ER .112 .115 .114 .114 .114 .114 .001 .113-.115 .122 .108 .116 .113 .114 .114 .005 .110-.119
MAKER-BRB .111 .110 .110 .113 .112 .111 .001 .110-.112 .112 .118 .120 .104 .111 .113 .006 .107-.118
LR .117 .121 .120 .120 .119 .119 .002 .118-.121 .130 .114 .118 .118 .120 .120 .006 .115-.125
SVM .133 .138 .135 .135 .133 .135 .002 .133-.137 .138 .135 .134 .131 .141 .136 .004 .132-.139
NN .113 .116 .116 .116 .116 .115 .001 .114-.117 .126 .111 .116 .117 .116 .117 .005 .112-.122
268
Table 6.23. Continued.
Train Test
Models/Folds 1st 2nd 3rd 4th 5th Avg Std CI (95%) 1st 2nd 3rd 4th 5th Avg Stdev CI (95%)
CT .095 .098 .098 .093 .093 .095 .002 .093-.097 .104 .103 .094 .100 .098 .100 .004 .096-.103
NB .127 .132 .128 .130 .129 .129 .002 .128-.131 .136 .131 .120 .128 .138 .131 .007 .125-.137
KNN .115 .116 .116 .115 .117 .116 .001 .115-.117 .127 .119 .126 .126 .120 .123 .004 .120-.127
Weighted KNN .000 .000 .000 .000 .000 .000 .000 .000-.000 .125 .115 .125 .129 .119 .123 .005 .118-.127
LD .118 .122 .121 .121 .121 .121 .002 .119-.122 .134 .116 .117 .119 .122 .121 .007 .115-.128
QD .123 .127 .126 .127 .126 .126 .002 .124-.127 .139 .126 .130 .125 .132 .130 .006 .125-.135
AUCPRs
MAKER-ER .548 .526 .549 .537 .540 .540 .009 .532-.548 .496 .589 .496 .594 .545 .544 .048 .502-.586
MAKER-BRB .573 .548 .566 .545 .555 .557 .012 .547-.568 .519 .587 .530 .601 .572 .562 .036 .531-.593
LR .514 .485 .496 .483 .499 .495 .013 .484-.506 .420 .528 .510 .543 .486 .497 .048 .455-.540
SVM .510 .487 .493 .483 .503 .495 .011 .485-.505 .434 .542 .517 .541 .469 .501 .048 .459-.542
NN .673 .655 .656 .673 .672 .666 .009 .657-.674 .445 .574 .503 .552 .479 .510 .053 .464-.557
CT .612 .625 .661 .624 .659 .636 .022 .616-.656 .612 .625 .661 .624 .659 .636 .022 .616-.656
NB .464 .444 .452 .454 .456 .450 .007 .444-.457 .420 .449 .507 .478 .411 .453 .040 .418-.488
KNN 1.000 1.000 1.000 1.000 1.000 1.000 .000 1.00-1.00 .483 .521 .461 .501 .499 .493 .022 .473-.512
Weighted KNN 1.000 1.000 1.000 1.000 1.000 1.000 .000 1.00-1.00 .516 .562 .492 .480 .545 .519 .035 .489-.549
LD .518 .487 .497 .485 .503 .450 .013 .439-.462 .422 .525 .522 .553 .477 .500 .051 .455-.545
QD .493 .469 .480 .480 .489 .450 .009 .442-.459 .444 .488 .455 .495 .443 .465 .025 .443-.486
269
Despite its complexity, classification tree performed slightly better than the MAKER-
based classifiers. The MAKER-ER- and MAKER-BRB-based classifiers are an
interpretable classifier with an integrated process of statistical analysis, belief-rule-
based inference, and machine learning to predict customer decisions in an
environment of dynamic pricing. Hence, further analysis to drive managerial decision-
making should be conducted.
6.6. Summary
This chapter presents the application of hierarchical MAKER framework, namely
MAKER-ER- and MAKER-BRB-based classifiers, for customer decisions in an airline
advanced booking setting. The two outputs of the models were ‘buy’ and ‘wait’, with
six input variables being considered. These included provider-controlled variables,
namely length of holding period and average price trend; uncontrolled variables,
namely number of flights offered in a day and the time before departure; and personal
variables, namely waiting patience and customer types in response to dynamic
pricing.
This chapter consisted of six main sections. First, we described a conceptual
framework that explains input variables, identification of customer decisions, and data
linkage. According to literature and refinement process, six variables, which might
influence customer decisions, were selected. Wait and buy decisions were considered
in this study. In addition to this, we created data linkage to integrate customer
transaction records and price records to obtain meaningful and desired dataset for
further analysis. Second, we explained data preparation used in this study, including
270
data cleaning and data partitioning. Five folds for five round cross validation were
used for all the classifiers.
Third, we demonstrated the formulation of groups of evidence, evidence acquisition
from data, interdependency indices, belief-rule-based inference, maximum likelihood
prediction, and the inference process for all the generated MAKER outputs on the top
hierarchy. This process indicates how we constructed a hierarchical rule-based
modelling and prediction based on MAKER framework, in which input variables were
split into several groups of evidence.
Given the optimised referential values and other model parameters, such as weights
and belief degrees of belief rules, we used the training set of the first round to
demonstrate both classifiers. Fourth, considering the highly imbalanced class
distribution (1:4.5), we analysed and compared the models’ performance. The
measures we compared were accuracy, precision, recall, F-beta, AUCROC and
AUCPR, and MSE of all the classifiers. The analysis results indicated that MAKER-
ER- and MAKER-BRB-based classifiers outperformed eight of the nine alternative
machine learning methods. The classification tree showed a similar performance to
that of both classifiers. Therefore, we concluded that MAKER-ER- and MAKER-BRB-
based models, as interpretable and robust classifiers, are suitable for predicting
customer decisions. Furthermore, they can be utilised to learn about customer
purchasing behaviour to assist in managerial decision making.
271
Chapter 7 Conclusions and
Recommendations for Future Research
7.1. Conclusions
The existence of strategic customers potentially hurts providers’ revenue with
significant profit losses. Researchers developed theoretical models and formulated
optimal providers’ responses to address the strategic purchasing behaviour, that is,
delaying the purchase in the hope of obtaining lower price. Most of the methods were
developed under the assumption that all customers act strategically. One of the other
popular methods is conditioned experiments. In addition, numerous examples of
cancel-rebook behaviour found in airline database are one of the useful information
to distinguish strategic customers and non-strategic customers – namely, myopic
customers. Therefore, we developed a classification model for detecting strategic
customers through their cancel-rebook behaviour. In addition, we developed a
customer-decision model as a support system to help providers address strategic
purchasing behaviour. Approaches based on statistics and machine learning from
historical databases can be relatively cheap and are representative of real-life rather
than experimental conditions. Empirical approaches are also free of assumptions,
unlike theoretical models, which rely on assumptions about how customers make their
decisions and what factors influence those decisions.
The classification methods in widespread use at present have their own challenges.
Examples are poor interpretability, overfitting, and low stability. These issues can
potentially influence the performance of these models in classification, that is,
predicting customer types and decisions. In this research, we proposed a new
272
method, that is, a hierarchical rule-based inferential modelling and prediction
approach based on MAKER framework. It integrates statistical analysis, rule-based
inference, maximum likelihood prediction, and machine learning for classification in a
hierarchical structure. The proposed model addresses the challenges of popular
classification methods and deals with sparse matrices. The input variables are
decomposed into several groups of evidence, each of which performs rule-based
inferential modelling and prediction based on MAKER framework. The outputs
generated from each group of evidence are combined for a final inference.
The proposed method enabled us to acquire evidence directly from the data and to
combine multiple pieces of evidence from input variables within a group of evidence
to generate a belief rule base. For any given inputs, we could generate outputs – the
probability associated with each class – using the belief rule and maximum likelihood
prediction. The outputs generated from each group of evidence were then combined
using evidential reasoning rule or a belief-rule base with trained belief degrees of
consequents. With the algorithm of machine learning, we optimised the model
parameters to maximise the likelihood of the true state. The findings for this approach
are summarised below.
• By proposing a conceptual framework and data linkage for detecting strategic
customers, we fulfilled research objectives 1. The proposed conceptual
framework and data linkage were developed based on cancel-rebook behaviour
(see Section 5.3). The conceptual framework was tested using historical data. In
the case study, the input variables were good predictors.
• By proposing a conceptual framework and data linkage for predicting customer
decisions, we achieved research 1. The proposed framework and data linkage
were refined according to the availability of data (see Section 6.2). Wait-or-buy
273
decisions in an advanced booking setting with zero deposit were considered. It
was evident that the input variables in the framework were good predictors.
• By comparing the alternative approaches – popular methods in machine learning
– with the theory regarding classification, we achieved research objective 3. The
alternative classification methods that were used in this comparison included
SVM, NN, NB, LR, CT, KNN, weighted KNN, LD, and QD. On the basis of a rule-
based inference and maximum likelihood evidential reasoning (MAKER), the
MAKER-based framework that we propose was transparent and interpretable.
The relationship between inputs and outputs can be clearly analysed. Compared
with other interpretable machine learning models, such as LR, NB, KNN,
weighted KNN, LD, and QD, the proposed method performed better. In addition,
LR and NB assumes independence among input variables, but MAKER-based
framework does not.
• By comparing the performance of various approaches to both datasets (i.e.
customer-type and customer-decision datasets), we achieved research objective
3. MAKER-based classifiers outperformed most of the other models, that is, LR,
SVM, NN, KNN, weighted KNN, NB, LD, and QD. The MAKER-based classifiers
performed similarly to classification tree. The difference in performance among
classifiers was clearly identified using AUCPR rather than AUCROC. Along with
its ability for prediction, MAKER-based classifiers also measured the
interdependence between input variables. This measure indicates whether – and
the extent to which – input variables are dependent on each other. For illustration,
Sections 5.5.7 and 6.4.7 report on the probabilities generated by the framework
regarding whether a customer was strategic or myopic, and the probabilities of a
customer choosing to buy and wait.
274
• By applying a referential-value-based data discretisation technique in the
hierarchical MAKER framework, we achieved research objective 2. This
technique alleviated the information loss and distortion resulting from over-
generalisation caused by discretisation. It also captured the structure of the data
better than other discretisation techniques.
• By proposing a hierarchical rule-based modelling and prediction, we achieved
research objective 2. The structure is applicable in the case of sparse matrices.
Decomposing input variables into several groups of evidence was proposed. This
approach avoids misleading and incorrect inferences that are due either to
violations of statistical requirements for sample size or to information loss; and
the multiplicative (computational) complexity of many referential values of input
variables in the belief rule. The hierarchical structure enables MAKER-based
classifiers to make predictions from several groups of evidence and to combine
the outputs at the aggregate level for a final inference. The hierarchical MAKER
framework performed well for both datasets (customer type and customer
decision).
7.2. Limitations and Recommendations for Future Research
This research consisted of several limitations.
• The hierarchical MAKER framework and other machine learning methods were
applied and tested for a study case in Indonesia, meaning that different datasets
might result in different outcomes of model parameters and model performances.
• The datasets used in this research was highly skewed and hence, the dominance
of the majority classes can influence the development of the classification model.
275
• In this research, the cancelled transactions were deleted from datasets. This kind
of information might be useful to include ‘exit’ decision made by customers.
Suggested directions for further study are summarised below.
• Based on the scores of AUCROC, the performance of the hierarchical MAKER
framework indicated an effective and adequate classifier. However, there is much
room for improvement based the scores of AUCPR, especially for a highly
skewed dataset. Both customer-type and customer-decision datasets had a
highly imbalanced class distribution. The model parameters were trained under
the optimisation function of the mean squared error (MSE), which represents the
difference between actual and predicted outputs. The MSE is obtained by
squaring the average difference over the dataset and may therefore advantage
the majority class. Further research could focus on improving the performance
measures on which the algorithm of machine learning is based, so that both
classes are treated equally.
• For both datasets used in this study, groups of evidence were formed, and
complete joint frequency matrices were obtained by decomposing the input
variables. However, most large matrices are sparse because almost all entries
are zeros. Decomposing only the input variables does not necessarily solve the
problem of sampling zeros. It might be necessary to decompose until the level of
sub-rules, that is, most frequently activated combinations of referential values.
Hence, future research could focus on hierarchical rule-based inference
composed of sub-rules.
• Customer-type and customer-decision datasets were retrieved from transactions
made by customers who eventually bought the tickets. In reality, ‘exit’ decisions
276
also occur in a dataset. Another potential direction for further study in the field of
revenue management is thus extending decision models to include exit decisions.
A rule-based inferential modelling and prediction approach is applicable for
multiple classification tasks.
277
References
Agre, G., & Peev, S. (2002). On Supervised and Unsupervised Discretization. Bulgarian Academy of Sciences, 2(43–57).
Anderson, C. K., & Wilson, J. G. (2003). Wait or buy? The strategic consumer: Pricing and profit implications. Journal of Operational Research Society, 54(3), 299–306.
Auria, L., & Moro, R. A. (2008). Support Vector Machines (SVM) as a Technique for Solvency Analysis (August 1, 2008). DIW Berlin Discussion Paper No. 811. Retrieved from http://dx.doi.org/10.2139/ssrn.1424949.
Aviv, Y., & Pazgal, A. (2008). Optimal Pricing of Seasonal Products in the Presence of Forward-Looking Consumers. Manufacturing & Service Operations Management, 10(3), 339–359.
Aviv, Yossi, Levin, Y., & Nediak, M. (2009). Counteracting Strategic Consumer Behavior in Dynamic Pricing Systems. In N. S & T. CS (Eds.), Consumer-Driven Demand and Operations Management Models (pp. 323–352).
Bagozzi, R. P., Gurhan-Canli, Z., & Priester, J. R. (2002). The social psychology of consumer behaviour. In The Social Psychology of Consumer Behaviour. Buckingham: Open University Press.
Belch, G. E., & Belch, M. A. (1998). Advertising and promotion : an integrated marketing communications perspective (4th ed.). Maidenhead: McGraw-Hill.
Besanko, D., & Winston, W. L. (1990). Optimal Price Skimming by a Monopolist Facing Rational Consumers. Management Science, 36(5), 555–567.
Besbes, O., & Lobel, I. (2015). Intertemporal Price Discrimination : Structure and Computation of Optimal Policies. Management Science, 61(1), 92–110.
Bilotkach, V. (2010). Reputation, search cost, and airfares. Journal of Air Transport Management, 16(5), 251–257.
Binaghi, E., & Madella, P. (1999). Fuzzy Dempster – Shafer Reasoning for Rule-Based Classifiers. International Journal of Intellegent Systems, 14(6), 559–583.
Bishop, C. M. (2006). Pattern recognition and machine learning. In Information Science and Statistics. New York, N.Y: Springer.
Bishop, Y. M. M. (2007). Discrete Multivariate Analysis Theory and Practice (S. E. Fienberg & P. W. Holland, Eds.). Retrieved from https://doi.org/10.1007/978-0-387-72806-3.
Bodur, H. O., Klein, N. M., & Arora, N. (2015). Online price search: Impact of price comparison sites on offline price evaluations. Journal of Retailing, 91(1), 125–139.
Boggs, P. T., & Tolle, J. W. (1995). Sequential Quadratic Programming *. Acta Numerica, 4, 1–51.
Boyd, E. A., & Bilegan, I. C. (2003). Revenue Management and E-Commerce.
278
Management Science, 49(10), 1363–1386.
Cachon, G. P., & Swinney, R. (2009). Purchasing, Pricing, and Quick Response in the Presence of Strategic Consumers. Management Science, 55(3), 497–511.
Carvalho, D. V, Pereira, E. M., & Cardoso, J. S. (2019). Machine Learning Interpretability : A Survey on Methods and Metrics. Electronics, 8(2019), 1–34.
Cason, T. N., & Reynolds, S. S. (2005). Bounded rationality in laboratory bargaining with asymmetric information. Economic Theory, 25(3), 553–574.
Chan, C.-C., Batur, C., & Srinivasan, A. (1991). Determination of quantization intervals in rule based model for dynamic systems. Proceedings of the IEEE Conference on Systems, Man, and Cybernetics, 1719–1723. Charlottesvile, Virginia.
Chandon, P., Wansink, B., & Laurent, G. (2000). A Benefit Congruency Framework of Sales Promotion Effectiveness. Journal of Marketing, 64(October 2000), 65–81.
Chang, L., Zhou, Y., Jiang, J., Li, M., & Zhang, X. (2013). Structure learning for belief rule base expert system: A comparative study. Knowledge-Based Systems, 39, 159–172.
Chen, C.-C., & Schwartz, Z. (2006). The Importance of Information Asymmetry in Customers’ Booking Decisions: A Cautionary Tale from the Internet. Cornell Hotel and Restaurant Administration Quarterly, 47(3), 272–285.
Chen, C.-C., & Schwartz, Z. (2008). Timing Matters: Travelers’ Advanced-Booking Expectations and Decisions. Journal of Travel Research, 47(1), 35–42.
Chen, C. C., Schwartz, Z., & Vargas, P. (2011). The search for the best deal: How hotel cancellation policies affect the search and booking decisions of deal-seeking customers. International Journal of Hospitality Management, 30(1), 129–135.
Chen, C., & Schwartz, Z. (2008). Room rate patterns and customers’ propensity to book a hotel room. Journal of Hospitality & Tourism Research, 32(3), 287–306.
Chen, Chihchien. (2016). Cancellation policies in the hotel, airline and restaurant industries. Journal of Revenue and Pricing Management,15(3–4), 271–276.
Chen, Q., Whitbrook, A., Aickelin, U., & Roadknight, C. (1960). Data Classification Using the Dempster-Shafer Method.
Chevalier, J., & Goolsbee, A. (2009). Are Durable Goods Consumers Forward-Looking? Evidence from College Textbooks. The Quarterly Journal of Economics, 124(4), 1853–1884.
Cho, M., Fan, M., & Zhou, Y. (2008). Strategic Consumer Response to Dynamic Pricing of Perishable Product. In International Series in Operations Research and Management Science, 131, 435-458.
Choi, S., & Kimes, S. E. (2002). Electronic distribution channels’ effect on hotel revenue management. The Cornell Hotel and Restaurant Administration Quarterly, 43(3), 23–31.
279
Christou, E. (2011). Exploring Online Sales Promotions in the Hospitality Industry Exploring Online Sales Promotions. Journal of Hospitality Marketing & Management, 20, 814–829.
Clark, R. A., & Goldsmith, R. E. (2005). Market Mavens : Psychological Influences. Psychology & Marketing, 22(4), 289–312.
Clemons, E. K., Hann, I.-H., & Hitt, L. M. (2002). Dispersion in and Differentiation Travel : Investigation Online Empirical. Management Science, 48(4), 534–549.
Cleophas, C., & Bartke, P. (2011). Modeling strategic customers using simulations - With examples from airline revenue management. Procedia - Social and Behavioral Sciences, 20(2011), 1060–1068.
Cooper, W. L., Homem-de-Mello, T., & Kleywegt, A. J. (2006). Models of the Spiral-Down Effect in Revenue Management. Operations Research, 54(5), 968–987.
Creswell, J. W. (2018). Research design : qualitative, quantitative & mixed methods approaches (Fifth edit; J. D. Creswell, Ed.). Los Angeles ; Sage.
Darpy, D. (2000). Consumer Procrastination and Purchase Delay. 29th Annual Conference EMAC, 1–7.
Dasu, S., & Tong, C. (2010). Dynamic pricing when consumers are strategic: Analysis of posted and contingent pricing schemes. European Journal of Operational Research, 204(3), 662–671.
Davis, J., & Goadrich, M. (2006). The Relationship Between Precision-Recall and ROC Curves. Proceedings of the 23rd International Conference on Machine Learning. Pittsburgh, PA.
Dekay, F., Yates, B., & Toh, R. S. (2004). Non-performance penalties in the hotel industry. International Journal of Hospitality Management, 23(3), 273–286.
Dempster, A. P. (2008). The Dempster – Shafer calculus for statisticians. International Journal of Approximate Reasoning, 48(2), 365–377.
Dougherty, J., Kohavi, R., & Sahami, M. (1995). Supervised and Unsupervised Discretization of Continuous Features. Machine larning: Proc. 12th Int’l Conf., 1995, 194-202.
Eren, S. S., & Parker, J. (2010). Monopoly pricing with limited demand information. Journal of Revenue and Pricing Management, 9(1–2), 23–48.
Etzioni, O., Tuchinda, R., Knoblock, C. a., & Yates, A. (2003). To buy or not to buy: Mining airfare data to minimize ticket purchase price. Knowledge Discovery and Data Mining Proceedings of the Ninth ACM SIGKDD International Conference, 119–128.
Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861-874.
Fayyad, U. M., & Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. The 13th International Joint Conference on Artificial Intelligence, Morgan Kauffmann, 1022–1027.
Fienberg, S. E. (1980). The analysis of cross-classified categorical data. Cambridge:
280
MIT Press.
Fienberg, S. E., & Rinaldo, A. (2007). Three centuries of categorical data analysis : Log-linear models and maximum likelihood estimation. Journal of Statistical Planning and Inference, 137(11), 3430–3445.
Flightdelayclaimsteam.com. (2019). 15 Flight Hacks you can use for Ridiculously Cheap Bookings Today. Available at: https://www.flightdelayclaimsteam.com/flight-hacks-for-cheaper-bookings-you-can-use-today/%0D. [Accessed 18 July 2019].
Fortin, D. R. (2000). Clipping Coupons in Cyberspace : A Proposed Model of Behavior for Deal- Prone Consumers. Psychology and Marketing, 17(6), 515–534.
Gollwitzer, P. M., & Brandst, V. (1997). Implementation Intentions and Effective Goal Pursuit. Journal of Personality and Marketing, 73(1), 186–199.
Gönsch, J., Klein, R., Neugebauer, M., & Steinhardt, C. (2013). Dynamic pricing with strategic customers. Journal of Business Economics, 83(5), 505–549.
Gorin, T., Walczak, D., Bartke, P., & Friedemann, M. (2012). Incorporating cancel and rebook behavior in revenue management optimization. Journal of Revenue and Pricing Management, 11(6), 645–660.
Granados, N., Kauffman, R. J., Lai, H., & Lin, H. C. (2012). ?? La Carte Pricing and Price Elasticity of Demand in Air Travel. Decision Support Systems, 53(2), 381–394.
Hayes, D. K., & Miller, A. (2011). Revenue Management for the Hospitality Industry. Journal of Revenue and Pricing Management, 11(4), 479–480.
Haykin, S. (1999). Neural networks : a comprehensive foundation (2nd ed.). Delhi: Pearson Education.
Hendler, J. (2014). Data Integration for Heterogenous Datasets. Big Data, 2(4), 205–215.
Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression. In Wiley Series in Probability and Statistics (3rd ed.). Hoboken, NJ: Wiley.
Hossin, M., & Sulaiman, M. N. (2015). A Review On Evaluation Metrics for Data Classification Evaluations. International Journal of Data Mining & Knowledge Management Process (IJDKP), 5(2), 1–11.
Ivanov, S., & Zhechev, V. (2012). Hotel revenue management – a critical literature review. Tourism Review, 60(2), 175–197.
Jerath, K., Netessine, S., & Veeraraghavan, S. K. (2010). Revenue Management with Strategic Customers: Last-Minute Selling and Opaque Selling. Management Science, 56(3), 430–448.
Jung, K., Cho, Y. C., & Lee, S. (2014). Online shoppers’ response to price comparison sites. Journal of Business Research, 67(10), 2079–2087.
Kannan, P. K., & Kopalle, K. (2001). Dynamic Pricing on the internet-importance and Implications for Consumer Behavior. International Journal of Electronic Commerce, 5(3), 63–83.
281
Karim, M., & Rahman, R. M. (2013). Decision Tree and Naïve Bayes Algorithm for Classification and Generation of Actionable Knowledge for Direct Marketing. Journal of Software Engineering and Applications, 6(4), 196–206.
Kateri, M., & Iliopoulos, G. (2010). On collapsing categories in two-way contingency tables. Statistics, 37(5), 443-455.
Kerber, R. (1992). Chimerge: Discretization of numeric attributes. Proceedings of the Tenth National Conference on Artificial Intelligence, 123–128.
Kimes, S. E. (1989). The Basics of Yield Management. Cornell Hotel and Restaurant Administration Quarterly, 30(3), 14–19.
Kimes, Sheryl E. (2003). Revenue Management: A Retrospective. Cornell Hotel and Restaurant Administration Quarterly, 44(5), 131–138.
Knox, S. W. (2018). Machine learning : a concise introduction . Hoboken, NJ: John Wiley & Sons, Inc.
Kong, G., Xu, D. L., Yang, J. B., Yin, X., Wang, T., Jiang, B., & Hu, Y. (2016). Belief rule-based inference for predicting trauma outcome. Knowledge-Based Systems, 95, 35–44.
Kraft, D. A software package for sequential quadratic programming. (1998). Tech. Rep. DFVLR-FB 88-28, DLR German Aerospace Center - Institute for Flight Mechanics, Koln, Germany.
Kwon, K., & Kwon, Y. J. (2013). Heterogeneity of deal proneness : Value-mining , price-mining , and encounters. Journal of Retailing and Consumer Services, 20(2), 182–188.
Lai, G., Debo, L. G., & Sycara, K. (2010). Buy Now and Match Later: Impact of Posterior Price Matching on Profit with Strategic Consumers. Manufacturing & Service Operations Management, 12(1), 33–35.
Lee, W.-M. (2019). Python machine learning . Indianapolis, IN: Wiley.
Levin, Y., McGill, J., & Nediak, M. (2009). Dynamic Pricing in the Presence of Strategic Consumers and Oligopolistic Competition. Management Science, 55(1), 32–46.
Li, J., Granados, N. F., & Netessine, S. (2014). Are Consumers Strategic? Structural Estimation from the Air-Travel Industry. Management Science, 60(9), 2114–2137.
Lichtenstein, D. R., Netemeyer, R. G., & Burton, S. (1990). Distinguishing coupon proneness from value consciousness: An acquisition-transaction utility. Journal of Marketing, 54(3), 54-67.
Lichtenstein, D. R., Ridgway, N. M., & Netemeyer, R. G. (1993). Price Perceptions and Consumer Shopping Behavior : A Field Study. Journal of Marketing Research, 30(2), 234–246.
Lin, F., & Cohen, W. W. (2010). Semi-supervised classification of network data using very few labels. Proceedings - 2010 International Conference on Advances in Social Network Analysis and Mining, ASONAM 2010, 192–199.
282
Liong, C., and Foo, S. (2013). Comparison of linear discriminant analysis and logistic regression for data classification. AIP Conference Proceedings, 1522(1), 1159-1165.
Littlewood, K. (2005). Forecasting and control of passenger bookings. Journal of Revenue & Pricing Management, 4(2), 111–123.
Liu, Q., & Ryzin, G. J. van. (2008). Strategic Capacity Rationing to Induce Early Purchases. Management Science, 54(6), 1115–1131.
Lorenz, T. (2019). 7 sites to find book now pay later hotels. Available at: https://www.finder.com.au/book-now-pay-later-hotels. [Accessed 18 July 2019].
Liu,Q.,Ying, W. (2012). Supervised Learning. In: Seel N.M. (eds) Encyclopedia of the Sciences of Learning. Springer, Boston, MA.
Maimon, O., & Rokach, L. (2005). Decomposition Methodology for Knowledge Discovery and Data Mining: Theory and Applications. World Scientific.
Mak, V., Rapoport, A., Gisches, E. J., & Han, J. (2014). Purchasing Scarce Products Under Dynamic Pricing: An Experimental Investigation. Manufacturing & Service Operations Management, 16(3), 425–438.
Maratea, A., Petrosino, A., & Manzo, M. (2014). Adjusted F-measure and kernel scaling for imbalanced data learning. Information Sciences, 257, 331–341.
Meissner, J., & Strauss, A. K. (2010). Pricing structure optimization in mixed restricted/unrestricted fare environments. Journal of Revenue and Pricing Management, 9(5), 399–418.
Molnar, C. (2019). Interpretable machine learning. A Guide for Making Black Box Models Explainable. Available at: http://christophm.github.io/interpretable-ml-book/. [Accessed 18 June 2019].
Mori, T., & Uchihira, N. (2019). Balancing the trade-off between accuracy and interpretability in software defect prediction. Empirical Software Engineering, 24(2), 779-825.
Nair, H. (2007). Intertemporal price discrimination with forward-looking consumers: Application to the US market for console video-games. Quantitative Marketing and Economics, 5(3), 239–292.
Nasiry, J., & Popescu, I. (2012). Advance Selling When Consumers Regret. Management Science, 58(6), 1160–1177.
Osadchiy, N., & Bendoly, E. (2011). Are Consumers Really Strategic ? Implications from an Experimental Study. 2011 MSOM Annual Conference.
Ovchinnikov, A., & Milner, J. M. (2012). Revenue management with end-of-period discounts in the presence of customer learning. Production and Operations Management, 21(1), 69–84.
Özer, Ö., & Zheng, Y. (2015). Markdown or Everyday Low Price? The Role of Behavioral Motives. Management Science, 62(2), 326-346.
Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. (2011). Scikit-learn:
283
Machine learning in python. Journal of Machine Learning Research, 12, 2825–2830.
Qiwen, J., Weijun, Z., & youyan, H. (2010). Revenue Management in the Service Industry : Research Overview and Prospect. International Conference on Management and Service Science (MASS), 1–5.
Reed, S. E., & Lee, H. (2015). Raining deep neural networks on noisy labels with bootstrapping. In ICLR, 1–11.
Ren, J. (2012). Knowledge-Based Systems ANN vs . SVM : Which one performs better in classification of MCCs in mammogram imaging. Knowledge-Based Systems, 26(2012), 144–153.
Reynolds, S. S. (2000). Durable-Goods Monopoly : Laboratory Market and Bargaining Experiments. The RAND Journal of Economics, 31(2), 375–394.
Richeldi, M., & Rossotto, M. (1995). Class-driven statistical discretization of continuous attributes (extended abstract). In Lecture Notes in Artificial Intelligence 914 (N. Lavrac, pp. 335–338). Berlin, Heidelberg, New York: Springer Verlag.
Rickwood, C., & White, L. (2009). Pre-purchase decision-making for a complex service: retirement planning. Journal of Services Marketing, 23(3), 145–153.
Rohde, C. A. (2014). Introductory Statistical Inference with the Likelihood Function. In Introductory Statistical Inference with the Likelihood Function. Cham: Springer International Publishing.
Ruth, J. A. (2001). Promoting a Brand ’ s Emotion Benefits : The Influence of Emotion Categorization Processes on Consumer Evaluations. Journal of Consumer Psychology, 11(2), 99-113.
Sahay, A. (2007). How to reap higher profits with dynamic pricing. MIT Sloan Management Review, 48(4), 53–62.
Schwartz, Z. (2000). Changes in Hotel Guests’ Willingness To Pay as The Date of Stay Draws Closer. Journal of Hospitality & Tourism Research, 24(2), 180–198.
Schwartz, Z. (2006). Advanced booking and revenue management: Room rates and the consumers’ strategic zones. International Journal of Hospitality Management, 25(3), 447–462.
Shen, Z. M., & Su, X. (2007). Customer Behavior Modeling in Revenue Management and Auctions: A Review and New Research Opportunities. Production & Operations Management, 16(6), 780–790.
Simon, H. A. (1955). A Behavioral Model of Rational Choice. The Quarterly Journal of Economics, 69(1), 99–118.
Simon, H. A. (1956). Rational choice and the structure of the environment. Psychological Review, 63(2), 129–138.
Stokey, N. L. (1981). Rational Expectations and Durable Goods Pricing. The Bell Journal of Economics, 12(1), 112–128.
Su, X. (2007). Intertemporal Pricing with Customer Behavior. Management Science,
284
53(5), 726–741.
Su, X. (2009). A Model of Consumer Inertia with Applications to Dynamic Pricing. Production & Operations Management, 18(4), 365–380.
Su, Xuanming, & Zhang, F. (2009). On the Value of Commitment and Availability Guarantees When Selling to Strategic Consumers. Management Science, 55(5), 713–726.
Swinney, R. (2011). Selling to Strategic Consumers When Product Value Is Uncertain: The Value of Matching Supply and Demand. Management Science, 57(10), 1737–1751.
Talurri, K. T., & Ryzin, G. J. Van. (2004). The Theory And Practice of Revenue Management. Boston: Kluwer Academic Publishers.
Tang, D., Yang, J.-B., Chin, K.-S., Wong, Z. S. Y., & Liu, X. (2011). A methodology to generate a belief rule base for customer perception risk analysis in new product development. Expert Systems with Applications, 38(5), 5373–5383.
Toh, R. S., Dekay, F., & Raven, P. (2012). Travel Planning: Searching foR and Booking Online Seats on the Internet. Transportation Journal, 51(1), 80–98.
Tu, J. V. (1996). Advantages and Disadvantages of Using Artificial Neural Networks versus Logistic Regression for Predicting Medical Outcomes. Journal of Clinical Epidemiology, 49(11), 1225–1231.
Wang, M., Ma, M., Yue, X., & Mukhopadhyay, S. (2013). A capacitated firm’s pricing strategies for strategic consumers with different search costs. Annals of Operations Research, 240(2), 731–760.
Xu, D. (2011). An introduction and survey of the evidential reasoning approach for multiple criteria decision analysis. Annals of Operations Research, 195(1), 163–187.
Xu, D. L., Yang, J. B., & Wang, Y. M. (2006). The evidential reasoning approach for multi-attribute decision analysis under interval uncertainty. European Journal of Operational Research, 174(3), 1914–1943.
Xu, X., Zheng, J., Yang, J. bo, Xu, D. ling, & Chen, Y. wang. (2017). Data classification using evidence reasoning rule. Knowledge-Based Systems, 116(2017), 144–151.
Yang, J.-B., & Xu, D.-L. (2014). A Study on Generalising Bayesian Inference to Evidential Reasoning. In Belief Functions: Theory and Applications - Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, Vol. 8764, pp. 180-189.
Yang, J. B. (2001). Rule and utility based evidential reasoning approach for multiattribute decision analysis under uncertainties. European Journal of Operational Research, 131(1), 31–61.
Yang, J. B., Liu, J., Wang, J., Sii, H. S., & Wang, H. W. (2006). Belief rule-base inference methodology using the evidential reasoning approach - RIMER. IEEE Transactions on Systems, Man, and Cybernetics Part A:Systems and Humans, 36(2), 266–285.
285
Yang, J. B., & Xu, D. L. (2013). Evidential reasoning rule for evidence combination. Artificial Intelligence, 205, 1–29.
Yang, J., Liu, J., Xu, D., Wang, J., Wang, H., & Member, A. (2007). Optimization Models for Training Belief-Rule-Based Systems. 37(4), 569–585.
Yang, J., & Xu, D.-L. (2017). Inferential Modelling and Decision Making with Data. IEEE International Conference on Automation and Computing (ICAC2017), (September), 7–8. Huddersfield, UK.
Ye, T., & Sun, H. (2015). Price-setting newsvendor with strategic consumers. Omega, 63, 103-110.
Yip, S. (2019). 11 airlines and websites that offer layby flights to book now and pay later. Avaialable at: https://www.finder.com.au/book-now-pay-later. [Accessed 18 July 2019].
Zbaracki, M. J., Ritson, M., Levy, D., Dutta, S., & Bergen, M. (2004). Managerial and Customer Costs of Price Adjustment : Direct Evidence from Industrial Markets. The Review of Economics and Statistics, 86(2), 514–533.
Zeelenberg, M. (1999). Anticipated Regret , Expected Feedback and Behavioral Decision Making. 106(September 1998), 93–106.
Zhang, D., & Cooper, W. L. (2008). Managing Clearance Sales in the Presence of Strategic Customers. Production & Operations Management, 17(4), 416–431.
Zhang, S., Zhang, C., & Yang, Q. (2003). Data preparation for data mining. Applied Artificial Intelligence, 17:5-6, 375–381.
286
Appendices
Examples of customer type and decision dataset (200 samples)
No. FB ICR TS HP DD APT NF WPT decision Customer
type
1 1 .0000 .4574 .4817 1.3185 .0636 3 .4574 buy myopic
2 1 .0000 .0009 .0972 1.4097 1.1565 28 .0009 buy myopic
3 2 .5054 .0732 .0779 1.6786 .7493 36 .6517 buy myopic
4 3 .0025 .1422 .3952 1.9972 .0841 4 .8975 buy strategic
5 1 .0000 .0201 .3955 1.6656 .2142 16 .0201 buy myopic
6 2 -.2791 .0648 .3950 1.8534 -.0813 4 .1281 buy strategic
7 1 .0000 .1520 .3954 2.0093 .0841 4 .1520 buy strategic
8 2 -.2823 .2772 .3955 2.0316 -.0157 72 .2721 buy strategic
9 2 -.3038 .0626 .3955 1.8767 -.1136 4 .1236 buy strategic
10 1 .0000 .0354 .3952 1.8841 -.1136 4 .0354 buy myopic
11 1 .0000 .0012 .0952 1.5848 -.3413 36 .0012 buy myopic
12 2 -.3038 .0626 .3956 1.9685 -.1458 4 .0935 wait strategic
13 2 -.2791 .0648 .3955 1.9698 -.1458 4 .1180 wait strategic
14 1 .0000 .1208 .3955 1.9719 -.1458 4 .1208 buy strategic
15 2 -.2823 .2772 .3956 2.1449 -.0262 72 .3956 wait strategic
16 1 .0000 .0351 .3953 2.2189 .0042 37 .0351 buy myopic
17 1 .0000 .0216 .3950 1.9248 .0391 5 .0216 buy myopic
18 2 .4830 .2166 .3955 1.3975 .1630 90 .9162 buy myopic
19 2 .0191 .2167 .3955 1.4017 .1630 90 .4525 buy myopic
20 3 .0025 .1422 .3957 2.8616 .0459 4 .4007 wait strategic
21 3 .0025 .1422 .3952 2.8667 .0459 4 .0029 wait strategic
22 1 .0000 .0519 .3951 2.8681 .0459 4 .0519 buy myopic
23 2 .0191 .2167 .3951 1.8159 -.1786 45 .3951 wait myopic
24 1 .0000 .2946 .3954 1.9816 .1193 62 .2946 buy strategic
25 1 .0000 .3360 .3952 1.8014 -.0056 32 .3360 buy strategic
26 1 .0000 .0437 .3949 2.0005 .0938 22 .0437 buy myopic
27 3 .2260 .2776 .3953 2.3988 -.1081 23 .8473 wait strategic
28 2 .0965 .2371 .3956 2.5275 .2571 19 .3956 wait strategic
29 2 .4830 .2166 .3947 2.2753 -.0893 45 .3947 wait myopic
30 1 .0000 .0054 .3948 1.5344 -.0066 66 .0054 buy myopic
31 1 .0000 .0647 .3950 3.3804 -.0833 22 .0647 buy myopic
32 3 .2260 .2776 .3952 2.8508 -.0698 23 .3952 wait strategic
33 1 .0000 .0482 .3956 2.0657 -.0312 34 .0482 buy myopic
34 1 .0000 .0027 .3953 1.8723 -.0875 44 .0044 buy strategic
35 2 .2151 .2638 .3952 1.6911 -.0615 26 .8253 wait strategic
287
No. FB ICR TS HP DD APT NF WPT decision Customer
type
36 1 .0000 .0473 .0872 1.0733 -.1239 24 .0473 buy myopic
37 1 .0000 .0740 .5145 4.2297 -.1149 45 .0740 buy myopic
38 1 .0000 .0106 .3956 4.2727 -.1250 22 .0106 buy myopic
39 1 .0000 .1971 1.2984 1.4026 .0745 17 .1971 buy strategic
40 2 .4412 .2008 .3956 4.4394 -.0841 4 .8428 buy strategic
41 1 .0000 .0370 .3951 1.8340 -.0207 13 .0370 buy myopic
42 1 .0000 .2908 .3956 3.0074 -.0930 9 .2908 buy myopic
43 1 .0000 .3047 .3951 3.0083 -.0930 9 .3047 buy myopic
44 2 .0590 .2299 .3117 1.6041 .1159 28 .6124 buy strategic
45 2 .5466 .4896 .1041 4.9090 -.2902 48 1.5257 buy strategic
46 2 .0891 .2796 .3953 1.6529 .1021 35 .7476 buy strategic
47 1 .0000 .2731 .3952 3.0508 -.0540 4 .2731 buy strategic
48 3 .3040 .2978 .3954 2.7801 .1930 9 .3954 wait strategic
49 1 .0000 .1120 .3951 4.6055 -.0509 47 .0364 buy strategic
50 3 .2823 .1661 .0621 .7399 .0308 74 .6891 buy strategic
51 1 .0000 .0015 .3954 4.8704 -.0475 12 .0015 buy myopic
52 1 .0000 .0366 .3953 5.2168 -.0781 23 .0366 buy myopic
53 1 .0000 .0029 .3952 2.0195 -.0745 4 .0033 buy strategic
54 1 .0000 .1312 .3956 3.0393 -.0484 9 .1312 buy strategic
55 5 .3264 .3375 .5132 4.0722 .1721 35 1.5584 wait myopic
56 8 .3508 .3251 .3952 3.8945 .0476 17 .3921 wait strategic
57 2 .4696 .1998 .3950 2.1609 .1061 32 .8692 buy strategic
58 2 4.2156 .8762 .3952 4.4403 -.0359 9 8.6492 wait strategic
59 1 .0000 .3028 .3955 5.0309 -.0354 49 .3028 buy myopic
60 7 .0451 .0226 .0414 1.1456 -.0091 68 .0824 wait myopic
61 5 .3264 .3375 .4822 4.5413 .2288 35 1.0584 wait myopic
62 1 .0000 .3865 .3953 1.7988 -.2597 13 .3865 buy myopic
63 3 .2823 .1661 .3950 1.3874 .0766 37 .3950 wait strategic
64 1 .0000 .0076 .0525 .6601 .5694 20 .0076 buy myopic
65 2 -.1081 .3645 .3956 2.0158 .0908 44 .6758 buy strategic
66 1 .0000 .0145 .3957 1.9477 .2545 23 .0145 buy myopic
67 1 .0000 .0946 .3950 1.5110 .1146 17 .0946 buy strategic
68 1 .0000 .1521 .3953 4.8960 -.0789 13 .1521 buy strategic
69 1 .0000 .1538 .3953 3.7315 -.0806 13 .1538 buy strategic
70 1 .0000 .1898 .3950 1.8874 .0912 4 .1898 buy strategic
71 1 .0000 .0169 .5187 5.8659 .1897 51 .0169 buy myopic
72 1 .0000 .1523 .3956 1.4866 .7339 23 .1523 buy strategic
73 1 .0000 .0025 .3956 1.5734 -.0260 35 .0025 buy myopic
74 1 .0000 .0178 .3957 3.7137 -.0455 12 .0178 buy myopic
75 1 .0000 .0053 .3954 1.6044 -.0260 35 .0053 buy myopic
76 2 .2950 .3919 .3954 5.6356 -.0499 23 .3954 wait myopic
77 1 .0000 .2603 .3954 6.1565 -.0851 6 .2603 buy strategic
288
No. FB ICR TS HP DD APT NF WPT decision Customer
type
78 2 -.1081 .3645 .3956 2.3032 -.0265 46 .3406 wait strategic
79 1 .0000 .3574 .3953 1.8335 -.0468 74 .3574 buy strategic
80 2 .5466 .4896 .9787 6.4343 .0409 50 .9787 wait strategic
81 5 .3264 .3375 .5167 5.1174 .2288 35 .5167 wait myopic
82 1 .0000 1.3152 2.4823 7.0420 .0006 20 1.3152 buy strategic
83 2 -.3640 1.0160 2.1650 6.6726 -.0332 3 1.7994 wait strategic
84 1 .0000 .0091 .0625 1.8334 -.1059 13 .0091 buy myopic
85 1 .0000 .0086 2.5670 6.8385 .0639 13 .0086 buy myopic
86 2 .0509 .1982 .3956 2.1657 .1008 17 .4474 buy strategic
87 1 .0000 .2620 .3955 2.1656 .1804 10 .2620 buy strategic
88 1 .0000 .2983 .3954 3.3711 .1039 34 .2983 buy myopic
89 2 .0055 .2009 .3950 1.5221 .0743 10 .3950 wait strategic
90 1 .0000 .0375 .3953 1.7856 .1630 4 .0375 buy myopic
91 2 .0509 .1982 .3953 2.6119 .0046 17 .3953 wait strategic
92 4 .4436 .3015 .3953 1.5641 .0307 18 1.9227 buy strategic
93 1 .0000 .0251 .3952 1.4938 .0492 18 .0251 buy myopic
94 2 -.2686 .0680 .3950 3.4374 .1021 37 .1356 buy strategic
95 1 .0000 .0347 .3954 2.8586 .0931 13 .0347 buy myopic
96 1 .0000 .0028 .3956 1.4157 .0083 35 .0028 buy myopic
97 1 .0000 .1505 .3952 3.4792 .1021 32 .1505 buy myopic
98 1 .0000 .0027 .3125 7.1637 .9526 34 .0027 buy myopic
99 1 .0000 .0014 2.2310 7.8011 .0019 5 .0014 buy myopic
100 1 .0000 .0071 .3953 1.5328 .2059 10 .0071 buy myopic
101 3 .8433 .3500 .3949 1.9533 -.2175 5 3.1408 buy strategic
102 1 .0000 .0011 .3949 2.0262 -.0512 16 .0011 buy myopic
103 1 .0000 .0033 .3951 3.5201 .4109 22 .0033 buy myopic
104 1 .0000 .2282 .3950 3.5575 .0047 32 .2282 buy myopic
105 1 .0000 .2322 .3956 3.5623 .0047 32 .2322 buy myopic
106 2 -.2686 .0680 .3955 3.5643 .0047 37 .1273 wait strategic
107 1 .0000 .2352 .3954 3.5662 .0047 32 .2352 buy myopic
108 1 .0000 .0092 .0620 1.0280 -.0545 17 .0092 buy myopic
109 1 .0000 .1817 .3954 2.2364 -.0845 5 .1817 buy strategic
110 1 .0000 .0275 2.3441 9.9386 .1911 10 .0275 buy myopic
111 1 .0000 .2425 .3955 2.6080 .4087 23 .1026 buy strategic
112 1 .0000 .0954 .3950 2.6103 .4087 23 .0954 buy strategic
113 1 .0000 .7706 1.9595 7.9872 .0416 13 .7706 buy myopic
114 1 .0000 .2167 .3955 3.8073 -.0999 24 .2167 buy strategic
115 1 .0000 .2171 .3951 3.8097 -.0999 24 .2171 buy strategic
116 1 .0000 .0018 .3950 3.0956 .0909 20 .0018 buy myopic
117 1 .0000 .0197 .1288 3.2260 .7820 13 .0197 buy myopic
118 1 .0000 .1361 .3951 3.4145 .4348 18 .1361 buy strategic
119 1 .0000 .1734 .3950 2.8249 .0191 35 .1734 buy myopic
289
No. FB ICR TS HP DD APT NF WPT decision Customer
type
120 1 .0000 .0190 .3955 3.8225 .1021 37 .0352 buy strategic
121 1 .0000 .1372 .3955 5.3948 -.0337 47 .1372 buy strategic
122 1 .0000 .1428 .3953 3.8175 -.1297 48 .1428 buy strategic
123 1 .0000 .3145 .3953 3.8745 .1021 37 .3145 buy strategic
124 1 .0000 .2392 .3951 3.8757 .1021 37 .2392 buy strategic
125 1 .0000 .3152 .3951 3.8757 .1021 37 .3152 buy strategic
126 1 .0000 .0105 .3954 1.8774 .1198 23 .0105 buy myopic
127 1 .0000 .0005 2.6824 10.5685 .0125 43 .0005 buy myopic
128 1 .0000 .1246 .3122 1.5024 -.0775 3 .1246 buy strategic
129 1 .0000 .2022 .3954 3.0357 .1150 4 .5924 buy strategic
130 1 .0000 .0224 .3952 1.5424 .3958 34 .0224 buy myopic
131 3 .8433 .3500 .3951 2.7694 -.0993 5 2.5059 wait strategic
132 1 .0000 1.8552 2.0989 6.3705 -.0094 36 1.8552 buy myopic
133 1 .0000 1.5541 2.1265 11.0751 .0116 36 1.5541 buy strategic
134 1 .0000 .1262 .3953 2.6148 .0031 13 .1262 buy myopic
135 2 -.0778 .2238 .3952 4.0549 .2334 9 .4472 buy strategic
136 1 .0000 .0028 .3954 1.4065 .0444 35 .0028 buy myopic
137 1 .0000 .0410 2.1773 6.8863 .0113 10 .0410 buy myopic
138 1 .0000 .0074 .3951 2.9499 .0384 26 .0074 buy myopic
139 1 .0000 .0298 .3950 1.8263 -.0767 4 .0298 buy myopic
140 1 .0000 .0567 .3954 1.9572 .0542 24 .0567 buy myopic
141 1 .0000 .1385 .3954 1.8009 .1056 5 .1385 buy strategic
142 1 .0000 .1546 .3954 2.0565 -.0424 4 .1546 buy myopic
143 1 .0000 .2147 .3953 1.8196 .0248 22 .2147 buy strategic
144 1 .0000 .0663 .3954 4.3489 -.0516 4 .0663 buy myopic
145 1 .0000 .0552 2.3098 10.2375 .0008 26 .0552 buy myopic
146 1 .0000 .3095 .3956 2.0219 .0396 50 .2707 buy strategic
147 2 .0156 .3601 .3955 2.9281 .1111 12 .3955 wait myopic
148 1 .0000 .0213 .3955 2.8469 -.0816 19 .0213 buy myopic
149 1 .0000 .2614 .9785 7.9903 .0159 19 .2614 buy myopic
150 1 .0000 1.1773 1.9427 8.8489 -.0194 34 1.1773 buy myopic
151 1 .0000 .3784 1.9580 9.9303 .0135 20 .3784 buy strategic
152 1 .0000 .5319 1.9683 9.9232 .0006 5 .5319 buy strategic
153 1 .0000 .2025 .4899 4.9031 .0725 8 .2025 buy strategic
154 2 -2.5954 .4871 2.5954 8.1933 .0029 41 .9830 buy strategic
155 1 .0000 .1562 .3950 2.8937 .1200 16 .1562 buy strategic
156 2 -2.5954 .4871 2.6066 8.2045 .0029 51 .0023 wait strategic
157 1 .0000 .0213 .0416 2.2798 .0516 10 .0213 buy myopic
158 1 .0000 .0006 2.6426 7.1919 .0170 10 .0006 buy myopic
159 1 .0000 .0168 2.6473 7.0161 .0078 8 .0168 buy myopic
160 6 .0988 .0740 .1319 2.0971 1.6738 35 .0432 wait myopic
161 2 .4070 .3852 .3953 2.6828 .0789 22 1.3298 wait strategic
290
No. FB ICR TS HP DD APT NF WPT decision Customer
type
162 1 .0000 .0115 .3950 5.5027 .0987 8 .0115 buy myopic
163 3 3.5814 2.6370 2.0903 7.9382 .0005 26 23.4267 buy strategic
164 1 .0000 .0043 .3953 1.6953 .0461 40 .0043 buy myopic
165 1 .0000 .1014 .3955 1.1796 .0983 5 .1014 buy strategic
166 1 .0000 .0014 2.1616 11.6449 .0063 11 .0014 buy myopic
167 2 -.9773 .4728 .9787 6.0801 .0010 10 .8721 buy myopic
168 2 -.9773 .4728 .9785 6.0813 .0010 19 .0747 wait myopic
169 1 .0000 .0022 .3952 5.7647 -.0676 13 .0022 buy myopic
170 2 .2114 .1997 .3955 1.6830 -.0996 3 .3955 wait strategic
171 10 .1124 .0849 .3952 2.0396 .0743 10 .6761 wait strategic
172 2 -.3548 .0218 .3951 3.1833 .0952 35 .0433 buy strategic
173 2 .1802 .1979 .3951 1.4979 -.1321 21 .3951 wait strategic
174 1 .0000 .0015 .3951 4.1229 -.0500 16 .0015 buy myopic
175 3 .2408 .2029 .3952 1.6216 -.1191 17 .6235 wait strategic
176 1 .0000 .0574 .3952 3.5632 -.1078 17 .0574 buy myopic
177 1 .0000 .0084 2.2714 7.7478 .2213 38 .0084 buy myopic
178 2 -.3548 .0218 .3953 3.2238 .0476 35 .0408 wait strategic
179 1 .0000 .0407 .3956 2.7158 -.0410 38 .0407 buy myopic
180 1 .0000 .0141 .0621 .5496 -.0314 13 .0141 buy myopic
181 1 .0000 .0637 .3954 3.5717 -.1078 17 .0637 buy myopic
182 1 .0000 .0648 .3955 3.5740 -.1078 17 .0648 buy myopic
183 1 .0000 1.0991 2.2865 7.0233 .0093 17 1.0991 buy strategic
184 1 .0000 .0192 .3956 3.5622 -.0148 13 .0192 buy myopic
185 1 .0000 .0206 .3956 3.7206 .0300 26 .0206 buy myopic
186 4 .2264 .1220 .3954 1.6003 -.0893 4 .3954 wait strategic
187 2 -.3940 .0010 .3954 1.7565 .0307 20 .0021 buy strategic
188 2 -.3940 .0010 .3953 1.7578 .0307 20 .0011 wait strategic
189 1 .0000 .1966 .3951 2.9778 -.1458 15 .1966 buy strategic
190 1 .0000 .1971 .3956 2.9789 -.1458 15 .1971 buy strategic
191 1 .0000 .6856 2.3367 7.4277 .3564 8 .6856 buy strategic
192 1 .0000 .0763 .5112 4.6258 .0698 19 .0763 buy myopic
193 1 .0000 .3375 .3949 3.1435 -.1463 15 .2800 buy strategic
194 1 .0000 .1448 .4879 4.7830 -.0985 23 .1448 buy strategic
195 1 .0000 .0106 2.3722 6.9910 .0175 20 .0106 buy myopic
196 1 .0000 .0125 .3950 4.8430 -.1141 19 .0125 buy myopic
197 1 .0000 .2483 .3955 2.1809 -.0645 22 .2483 buy strategic
198 1 .0000 .0057 .3951 2.1771 .0317 13 .0057 buy myopic
199 1 .0000 .0020 .0625 .8481 -.2198 12 .0020 buy myopic
200 2 .0203 .0657 .0985 1.1263 1.3429 34 .0985 wait myopic