pages.stern.nyu.edupages.stern.nyu.edu/~padamopo/thesis_draft.pdf · unexpectedness and...
TRANSCRIPT
Unexpectedness and Non-Obviousness inRecommendation Technologies and Their Impact
on Consumer Decision Making
by
Panagiotis Adamopoulos
A dissertation submitted in partial fulfillmentof the requirements for the degree of
Doctor of PhilosophyDepartment of Information, Operations and Management Sciences
Leonard N. Stern School of Business, New York UniversityApril 2016
Doctoral Committee:
Professor Gediminas AdomaviciusProfessor Anindya GhoseAssistant Professor Srikanth JagabathulaProfessor Alexander Tuzhilin, Chair
c© Panagiotis Adamopoulos 2016
All Rights Reserved
ACKNOWLEDGEMENTS
I would like to thank my doctoral advisor and co-author, Prof. Alexander Tuzhilin,
for his support, guidance, and valuable feedback throughout these years.
I owe much gratitude to Leonard N. Stern School of Business and the Department
of Information, Operations & Management Sciences (IOMS). Special thanks to Prof.
Anindya Ghose, Prof. Gediminas Adomavicius, Prof. Daria Dzyabura, Prof. Srikanth
Jagabathula, Prof. Panos Ipeirotis, Prof. Hila Lifshitz-Assaf, Prof. Natalia Levina,
Prof. Foster Provost, and Prof. Normal White.
I would also like to thank my parents and my brother Eleftherios for always having
faith in me and supporting me with all their means.
Most of all, I would like to thank Vilma Todri for her unconditional love, encour-
agement, and support.
ii
TABLE OF CONTENTS
ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . ii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . x
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
CHAPTER
I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
II. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Over-specialization of Recommendations . . . . . . . . . . . . 92.2 Concentration Bias of Recommendations . . . . . . . . . . . . 92.3 Novelty of Recommendations . . . . . . . . . . . . . . . . . . 102.4 Serendipity of Recommendations . . . . . . . . . . . . . . . . 112.5 Diversity of Recommendations . . . . . . . . . . . . . . . . . 122.6 Unexpectedness of Recommendations . . . . . . . . . . . . . 142.7 Business Value of Recommendations . . . . . . . . . . . . . . 152.8 Recommender Systems and Consumer Decision Making . . . 16
III. Neighborhood Selection in Collaborative Filtering Systems . 19
3.1 Introduction to Neighborhood Selection . . . . . . . . . . . . 193.2 Related Approaches . . . . . . . . . . . . . . . . . . . . . . . 233.3 Methods for Neighborhood Selection . . . . . . . . . . . . . . 24
3.3.1 Neighborhood Models . . . . . . . . . . . . . . . . . 24
iii
3.3.2 Theoretical Motivation . . . . . . . . . . . . . . . . 263.3.3 Probabilistic Neighborhood Selection . . . . . . . . 303.3.4 Optimized Neighborhood Selection . . . . . . . . . . 32
3.4 Experimental Settings of Probabilistic Neighborhood Selection 383.4.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . 383.4.2 Experimental Setup . . . . . . . . . . . . . . . . . . 39
3.5 Results of Probabilistic Neighborhood Selection . . . . . . . . 433.5.1 Orthogonality of Recommendations . . . . . . . . . 453.5.2 Comparison of Coverage and Diversity . . . . . . . 493.5.3 Comparison of Dispersion and Diversity Reinforcement 523.5.4 Comparison of Item Prediction . . . . . . . . . . . . 553.5.5 Comparison of Utility-based Ranking . . . . . . . . 60
3.6 Experimental Settings of Optimized Neighborhood Selection . 613.7 Results of Optimized Neighborhood Selection . . . . . . . . . 62
3.7.1 Orthogonality of Recommendations . . . . . . . . . 633.7.2 Comparison of Coverage and Diversity . . . . . . . 653.7.3 Comparison of Dispersion and Diversity Reinforcement 673.7.4 Comparison of Item Prediction . . . . . . . . . . . . 673.7.5 Comparison of Rating Prediction . . . . . . . . . . . 703.7.6 Comparison of Utility-based Ranking . . . . . . . . 73
3.8 Discussion of Neighborhood Selection . . . . . . . . . . . . . 73
IV. Unexpectedness in Recommender Systems . . . . . . . . . . . 77
4.1 Introduction to Unexpectedness . . . . . . . . . . . . . . . . . 774.2 Related Concepts . . . . . . . . . . . . . . . . . . . . . . . . 784.3 Definition of Unexpectedness . . . . . . . . . . . . . . . . . . 81
4.3.1 Unexpectedness of Recommendations . . . . . . . . 814.3.2 Utility of Recommendations . . . . . . . . . . . . . 844.3.3 Recommendation Algorithm . . . . . . . . . . . . . 864.3.4 Evaluation of Recommendations . . . . . . . . . . . 87
4.4 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . 904.4.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . 924.4.2 Experimental Setup . . . . . . . . . . . . . . . . . . 94
4.5 Results of Unexpectedness Method . . . . . . . . . . . . . . . 1014.5.1 Comparison of Unexpectedness . . . . . . . . . . . . 1034.5.2 Comparison of Rating Prediction . . . . . . . . . . . 1124.5.3 Comparison of Item Prediction . . . . . . . . . . . . 1164.5.4 Comparison of Diversity and Dispersion . . . . . . . 120
4.6 Discussion of Unexpectedness . . . . . . . . . . . . . . . . . . 126
V. Business Value of Recommendations . . . . . . . . . . . . . . . 130
iv
5.1 Introduction to Business Value of Recommendations . . . . . 1305.2 Related Concepts and Approaches . . . . . . . . . . . . . . . 1335.3 Empirical Setting and Data . . . . . . . . . . . . . . . . . . . 1375.4 Empirical Method and Models . . . . . . . . . . . . . . . . . 143
5.4.1 Identification Strategy . . . . . . . . . . . . . . . . 1465.5 Deep-Learning Model of User-Generated Reviews . . . . . . . 1475.6 Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . 148
5.6.1 Out-of-Sample Performance . . . . . . . . . . . . . 1525.6.2 Moderating Effects on Business Value . . . . . . . . 155
5.7 Robustness Checks . . . . . . . . . . . . . . . . . . . . . . . . 1635.7.1 Falsification Tests . . . . . . . . . . . . . . . . . . . 172
5.8 Discussion of Business Value of Recommendations . . . . . . 175
VI. Conclusions, Limitations, and Future Directions . . . . . . . . 182
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
A. Measuring the Concentration Reinforcement Bias of Recom-mender Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
B. Weighted Percentile Methods in Collaborative Filtering Sys-tems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
v
LIST OF FIGURES
Figure
3.1 Examples of Sampling Distributions for Probabilistic NeighborhoodSelection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2 Examples of Probabilistically Sampled Neighborhoods . . . . . . . . 423.3 Summary of Performance of Probabilistic Neighborhood Selection . 443.4 Overlap of Recommendations of Probabilistic Neighborhood Selection 463.5 Correlation of Recommendations of Probabilistic Neighborhood Se-
lection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.6 Aggregate Diversity of Probabilistic Neighborhood Selection . . . . 503.7 Dispersion of Recommendations of Probabilistic Neighborhood Selec-
tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.8 Diversity Reinforcement of Probabilistic Neighborhood Selection . . 563.9 Item Prediction Performance of Probabilistic Neighborhood Selection 573.10 Item Prediction Performance of Probabilistic Neighborhood Selection
for Fixed Recommendation Lists Size . . . . . . . . . . . . . . . . . 583.11 Summary of Performance of Optimized Neighborhood Selection . . 633.12 Overlap of Recommendations of Optimized Neighborhood Selection 643.13 Aggregated Diversity of Optimized Neighborhood Selection . . . . . 663.14 Dispersion of Recommendations of Optimized Neighborhood Selection 683.15 Diversity Reinforcement of Optimized Neighborhood Selection . . . 693.16 Item Prediction Performance of Optimized Neighborhood Selection . 713.17 Rating Prediction Performance of Optimized Neighborhood Selection 723.18 Utility-based Ranking of Optimized Neighborhood Selection . . . . 744.1 Unexpectedness Performance . . . . . . . . . . . . . . . . . . . . . . 1054.2 Unexpectedness Performance for Different Algorithms . . . . . . . . 1064.3 Unexpectedness Performance for Different Sets of Expectations . . . 1084.4 Post-hoc Analysis of Unexpectedness Performance . . . . . . . . . . 1094.5 Rating Prediction Performance of Unexpectedness Method . . . . . 1154.6 Post-hoc Analysis of Rating Prediction Performance . . . . . . . . . 1164.7 Item Prediction Performance of Unexpectedness Method . . . . . . 1194.8 Post-hoc Analysis of Item Prediction Performance . . . . . . . . . . 120
vi
4.9 Diversity Performance of Unexpectedness Method . . . . . . . . . . 1234.10 Post-hoc Analysis of Diversity Performance . . . . . . . . . . . . . . 1244.11 Dispersion Performance of Unexpectedness Method . . . . . . . . . 1245.1 Locations of Recommended Venues . . . . . . . . . . . . . . . . . . 1385.2 Correlation of Main Variables Employed in Econometric Specifica-
tions for Business Value of Recommendations . . . . . . . . . . . . . 1415.3 In-Sample Evaluation of Econometric Specifications Assessing Busi-
ness Value of Recommendations . . . . . . . . . . . . . . . . . . . . 1565.4 Out-of-Sample Evaluation of Econometric Specifications Assessing
Business Value of Recommendations . . . . . . . . . . . . . . . . . . 1565.5 Conceptual Model of Effects of Recommender Systems . . . . . . . 177A.1 Performance (ranking) of various RS algorithms. . . . . . . . . . . . 196B.1 Weighted Percentile Methods: Prediction Accuracy . . . . . . . . . 202B.2 Weighted Percentile Methods: Post hoc analysis . . . . . . . . . . . 203
vii
LIST OF TABLES
Table
3.1 Probability Distributions for Neighborhood Selection . . . . . . . . 404.1 Sets of Expected Recommendations . . . . . . . . . . . . . . . . . . 984.2 Unexpectedness Performance . . . . . . . . . . . . . . . . . . . . . . 1044.3 Rating Prediction Performance of Unexpectedness Method . . . . . 1144.4 Item Prediction Performance of Unexpectedness Method . . . . . . 1174.5 Diversity Performance of Unexpectedness Method . . . . . . . . . . 1225.1 Locations and Corresponding Number of Venues . . . . . . . . . . . 1395.2 Coefficient Estimates of Logit Model . . . . . . . . . . . . . . . . . 1515.3 Coefficient Estimates of Nested Logit Model . . . . . . . . . . . . . 1535.4 Coefficient Estimates of Nested Logit Model with Alternative-level
Fixed effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1545.5 In-Sample Validation of Nested Logit Model with Alternative-level
Fixed effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1555.6 Out-of-Sample Validation of Nested Logit Model with Alternative-
level Fixed effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1555.7 In-Sample Validation of Logit Model . . . . . . . . . . . . . . . . . 1565.8 Out-of-Sample Validation of Logit Model . . . . . . . . . . . . . . . 1575.9 In-Sample Validation of Nested Logit Model . . . . . . . . . . . . . 1575.10 Out-of-Sample Validation of Nested Logit Model . . . . . . . . . . . 1575.11 Moderating Effect of Item Attributes on Effectiveness of Recommen-
dations – Price . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1605.12 Moderating Effect of Item Attributes on Effectiveness of Recommen-
dations – Marketing Promotions . . . . . . . . . . . . . . . . . . . . 1615.13 Moderating Effect of Item Attributes on Effectiveness of Recommen-
dations – Novelty . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1625.14 Moderating Effect of Item Attributes on Effectiveness of Recommen-
dations – Popularity . . . . . . . . . . . . . . . . . . . . . . . . . . 1645.15 Moderating Effect of Context on Effectiveness of Recommendations
– Public Holidays . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
viii
5.16 Moderating Effect of Context on Effectiveness of Recommendations– Temperature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
5.17 Coefficient Estimates of Nested Logit Model (Sub-sample Analysis) 1685.18 Coefficient Estimates of Nested Logit Model with Alternative-level
Fixed effects (Sub-sample Analysis) . . . . . . . . . . . . . . . . . . 1695.19 Coefficient Estimates of Logit Model with Random Coefficients . . . 1705.20 Coefficient Estimates of Nested Logit Model with Random Coefficients1715.21 Coefficient Estimates of Nested Logit Model with Instrumental Vari-
ables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1735.22 Coefficient Estimates of Nested Logit Model with Instrumental Vari-
ables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1745.23 Falsification Check (Pseudo-recommendations) . . . . . . . . . . . . 1755.24 Falsification Check (Pseudo-timing of recommendations) . . . . . . 175B.1 Weighted Percentile Methods: Item Prediction Accuracy . . . . . . 203B.2 Weighted Percentile Methods: Catalog Coverage Performance . . . . 205
ix
LIST OF ABBREVIATIONS
AMZ Amazon
BC BookCrossing
CF Collaborative Filtering
HCI Human-Computer Interaction
IR Information Retrieval
IS Information Systems
k-BN k Better Neighbors
k-FN k Furthest Neighbors
k-NN k Nearest Neighbors
k-PN k Probabilistic Neighbors
MAD Mean-Absolute Deviation
MAPE Mean-Absolute-Percent Error
MF Matrix Factorization
ML MovieLens
MSE Mean-Square Error
MT MovieTweetings
O2O Online-to-offline
RMSE Root-Mean-Square Error
RS Recommender System
x
US United States
xi
ABSTRACT
Unexpectedness and Non-Obviousness in Recommendation Technologies and TheirImpact on Consumer Decision Making
by
Panagiotis Adamopoulos
Chair: Alexander Tuzhilin
Despite the numerous benefits of personalization techniques, many current approaches
create “filter bubbles”; i.e., isolated information neighborhoods that are highly cus-
tomized based on a user’s prior activity patterns, and that sometimes significantly re-
duce cultural or ideological diversity in what a user is exposed to. These filter bubbles
limit exposure to alternative views and options, often without the user actually realiz-
ing this kind of “informational isolation” is occurring. This thesis presents a number
of studies aiming at moving recommender systems beyond the traditional paradigm
and the classical perspective of rating prediction accuracy. We contribute to existing
helpful but less explored recommendation strategies and propose new approaches aim-
ing at more useful recommendations for both users and businesses. Working towards
this direction, we discuss the studies we have conducted in this stream of research
with the goal to avoid this problem of filter bubbles and, in particular, to alleviate the
over-specialization and concentration problems in recommender systems by design-
ing techniques that deliver non-obvious, unexpected, and high quality personalized
xii
recommendations. The overall goal of this research program is to move our focus
from even more accurate rating predictions and aims at offering a holistic experience
to the users. The conducted prescriptive studies are supplemented with descriptive
user behavior studies that examine the effects of the proposed type of non-obvious
(unexpected) recommendations on consumer decision-making.
In particular, we formulate the classical neighborhood-based collaborative filter-
ing method as an ensemble method, thus, allowing us to show the suboptimality of
the nearest neighbors (k-NN) approach in terms of not only over-specialization and
concentration biases but also predictive accuracy. Besides, focusing on neighborhood
selection, we propose a novel optimized neighborhood-based method (k-BN; k Better
Neighbors) and a new probabilistic neighborhood-based method (k-PN; k Probabilis-
tic Neighbors) as improvements of the standard k-NN approach alleviating some of
the most common problems of collaborative filtering recommender systems, based
on classical metrics of dispersion and diversity of recommendations as well as some
newly proposed metrics. Furthermore, we propose a concept of unexpectedness in
recommender systems illustrating the differences from the related but different terms
of novelty, serendipity, and diversity. We then operationalize unexpectedness by sug-
gesting various mechanisms for specifying the expectations of the users and proposing
a recommendation method for providing the users with non-obvious but high qual-
ity personalized recommendations that fairly match their interests based on specific
metrics of unexpectedness. Finally, we employ econometric modeling and machine
learning techniques in order to estimate the effectiveness and impact of various types of
recommendations in the mobile context on consumers’ utility and real-world demand.
Concluding this thesis, we also summarize the conclusions of the conducted studies,
discuss the limitations of our work, outline the implications of this stream of research,
and theoretically integrate our findings into the existing literature on recommender
xiii
systems by extending a current conceptual model of the effects of recommender sys-
tem use, their characteristics, and other factors on consumer decision-making.
xiv
CHAPTER I
Introduction
Personalization techniques (i.e., the design, management, and delivery of content
and business processes to users based on known, observed, and predictive informa-
tion [Meister et al., 2002]) play an important role in both businesses and society. In
many aspects of our everyday life, our choices are guided by such promising tech-
niques. However, despite the numerous benefits of these techniques, narrow-minded
personalization can create “filter bubbles” [Pariser, 2011b]: invisible and personal uni-
verses of information that might trap users into a relevance paradox confining them
to isolated information neighborhoods and restricting them from seeing or exploring
the vast array of other possibilities [Andrews, 1984; Pariser, 2011b] (also known as
“pigeonhole” problem). In particular, in an effort to overcome information overload
problems, we build systems that aim at discovering such personal universes and are
rewarded for delivering information exclusively from these universes at the expense of
providing more serendipitous information [Leopold, 2013]. Our increasing reliance on
these systems in combination with our consumption behaviors [Bakshy et al., 2015]
can then establish strong feedback loops that narrow the diversity of our choices and
limit our exposure to alternative views disturbing the intrinsic balance of our choices
[Pariser, 2011a], as advocated by various philosophers and researchers, including Pat-
1
tie Maes, who developed one of the first recommender systems (RSes) [Shardanand
and Maes, 1995; Thompson, 2008]. Due to the prevalence of such simplistic person-
alization approaches that create filter bubbles and in view of the importance of the
implications of these over-specialization problems, which in specific aspects of our ev-
eryday lives are associated with even adopting more extreme attitudes [Stroud, 2008]
and misperceiving facts about events [Kull et al., 2003], there is growing scientific
interest in this phenomenon (e.g., [Bakshy et al., 2015]).
At the same time, in a pursuit of relevance, common personalization techniques
disproportionally take into consideration and amplify the popularity of available
choices and options. Because of that, and also due to certain statistical biases [Pradel
et al., 2012; Radovanovic et al., 2010], such personalization techniques are character-
ized by a severe concentration bias and tend to amplify a “rich-get-richer” effect for
already popular options, at the expense of the long-tail (also known as “blockbuster”
problem) (e.g., [Adomavicius and Kwon, 2011, 2012; Evans, 2008; Radovanovic et al.,
2010]). As a result, common personalization techniques that serve such purposes
guide our choices towards common and frequent consumption patterns. In other
words, naive algorithms and personalization techniques can create commonalities
among these emerging filter bubbles making them more similar to each other over
time and hence such concentration biases could be conceptualized by someone in a
metaphorical sense as a “gravitational force” grouping together and shrinking filter
bubbles. Thus, narrow-minded personalization techniques can have additional detri-
mental effects, such as deconstructing non-prevailing views, opinions, and behaviors
(e.g., [Evans, 2008]).
Online recommender systems are one family of the personalization technologies
and information systems for which there is initial empirical evidence that often suffer
from these effects [Nguyen et al., 2014]. Over the last two decades, a wide variety of
2
different types of recommender systems (RSes) has been developed and successfully
used across several domains [Adomavicius and Tuzhilin, 2005]. During this time,
many researchers have focused mainly on the development and improvement of effi-
cient algorithms for more accurate rating prediction. Although the recommendations
of the latest class of systems are significantly more accurate than they used to be a
decade ago [Bell et al., 2009] and the broad social and business acceptance of RSes
has already been achieved, there is still a long way to go in terms of satisfaction of the
actual needs of the users [Konstan and Riedl, 2012]. This is due, primarily, to the fact
that many existing RSes focus on providing even more accurate rather than more use-
ful recommendations. The key under-explored dimensions for further improvement
include the (perceived) usefulness of random stimuli, diverse viewpoints, attitude-
challenging information, or cross-cutting content in recommendations [Adamopoulos,
2014c]. Instead, common recommenders, such as collaborative filtering (CF) algo-
rithms, recommend products based on prior sales and ratings. Hence, they tend not
to recommend products with limited historical data, even if these items would be
rated favorably. Thus, these recommenders can create a rich-get-richer effect for pop-
ular items while this concentration bias can prevent what may otherwise be better
consumer-product matches [Fleder and Hosanagar, 2009]. This phenomenon leads
to commonalities in exposure, experiences, and selected choices among the different
users and is related to the previously discussed metaphor of “gravitational force”.
At the same time, common RSes usually recommend items very similar to what the
users have already purchased or liked in the past [Abbassi et al., 2009]. However, this
over-specialization of recommendations enhances the aforementioned phenomenon of
filter bubbles by not expanding, or even restricting, users’ exposure to more diverse
and non-obvious options. Nevertheless, the over-specialization and concentration bi-
ases of popular personalization technologies are also often inconsistent with users’
3
preferences and needs, business goals, and social welfare (e.g., [Evans, 2008; Sinha
and Swearingen, 2001]).
Moving beyond the classical perspective of the rating prediction accuracy, the
main objective of this stream of research is to contribute to existing helpful but less
explored paradigms of RSes as well as to propose new approaches that will result in
more useful recommendations for both users and businesses. Working towards this
direction, we discuss the studies we have conducted towards alleviating the problems
of over-specialization and concentration biases in recommender systems by designing
techniques that deliver to the users non-obvious, diverse, unexpected, and, at the
same time, high quality personalized recommendations, which could potentially ex-
pand users’ exposure and choices. We focus this research program on the front-end
of the “design - solution - perceptions - intentions - behavior” causal chain and we
move the focus from even more accurate rating predictions and aim at offering a
holistic experience to the users. The conducted prescriptive design science studies
are supplemented with descriptive (explanatory) user behavior studies that examine
the effects of the proposed type of recommendations on consumer decision-making.
Finally, we theoretically integrate our findings into the current literature on RSes by
extending a current conceptual model of the effects of RS use, RS characteristics, and
other factors on consumer decision-making. The discussed impact of RSes on con-
sumers’ decision-making and choices can be supported through the lenses of various
IS theories, including theories of human information processing as well as theories of
satisfaction.
In detail, in Chapter II we first provide a brief survey of the related work on over-
specialization and concentration biases of recommendations as well as the concepts of
novelty, serendipity, diversity, and unexpectedness that offer the potential to alleviate
4
the aforementioned problems, then present the related work assessing the value of
recommendations for businesses and, finally, we provide a brief overview of the main
IS theories regarding the impact of RSes on consumer decision-making.
Focusing on the problems of over-specialization and concentration biases, in Chap-
ter III we formulate the classical neighborhood-based collaborative filtering method,
k nearest neighbors (k-NN), as an ensemble method, thus, allowing us to show the
suboptimality of the k-NN approach in terms of not only over-specialization and
concentration of recommendations but also predictive accuracy. Besides, focusing on
neighborhood selection, we propose a novel optimized neighborhood-based method (k-
BN; k Better Neighbors) and a new probabilistic neighborhood-based method (k-PN;
k Probabilistic Neighbors) as improvements of the standard k-NN approach alle-
viating some of the most common problems of collaborative filtering recommender
systems, based on classical metrics of dispersion and diversity as well as some newly
proposed metrics.
Another key dimension for significant improvement is the concept of unexpected-
ness. In Chapter IV, we propose a method to improve user satisfaction by generating
unexpected recommendations based on the utility theory of economics. In particular,
we propose a new concept of unexpectedness in RSes as recommending to users those
items that depart from what they expect from the system. We define and formalize
the concept of unexpectedness and discuss how it differs from the related notions of
novelty, serendipity, and diversity. Besides, we suggest several mechanisms for speci-
fying the users’ expectations and propose specific performance metrics to measure the
unexpectedness of recommendation lists. We also take into consideration the qual-
ity of recommendations using certain utility functions and present an algorithm for
providing the users with unexpected recommendations of high quality that are hard
to discover but fairly match their interests. Last but not least, we conduct several
5
experiments on “real-world” data sets to compare our recommendation results with
several other baseline methods. The proposed approach outperforms these baseline
methods in terms of unexpectedness and other important metrics, such as coverage,
aggregate diversity, and dispersion, while avoiding any accuracy loss.
Chapter V employs econometric modeling and machine learning techniques in or-
der to estimate the impact of recommendations in the mobile context on consumers’
utility and real-world demand. This chapter delves further into the differences in
effectiveness of recommendations, and examines this heterogeneity examining the
moderating effect of various item attributes and contextual factors in order to gain
a more detailed understanding of the effectiveness of the various types of recommen-
dations. We also validate the robustness of our findings using multiple econometric
specifications as well as instrumental variable methods with instruments based on a
machine-learning model employing deep-learning techniques.
Finally, in Chapter VI we summarize the conclusions of the conducted studies
and discuss their theoretical and managerial implications along with the limitations
of our work.
6
CHAPTER II
Related Work
In this section, we first discuss the extant work on the antecedents and conse-
quences of the problems of over-specialization and concentration biases of recommen-
dations as well as existing approaches aiming at alleviating these problems in RSes.
We then present the related work assessing the value of recommendations for busi-
nesses and their impact on product demand. Finally, we provide a brief overview
of the main IS theories regarding the impact of RSes on consumer decision-making
processes and outcomes.
Since the first collaborative filtering systems were introduced in the mid-90’s
[Goldberg et al., 1992; Konstan et al., 1997], there have been many attempts to im-
prove their performance focusing mainly on rating prediction accuracy [Desrosiers and
Karypis, 2011; Koren and Bell, 2011]. Common approaches include rating normaliza-
tion [Desrosiers and Karypis, 2011] (based on mean-centering [Resnick et al., 1994]
and Z-score [Herlocker et al., 1999]), similarity weighting of neighbors [Said et al.,
2012a] (accounting for significance [Herlocker et al., 2002; Bell et al., 2007] and vari-
ance [Jin et al., 2004]), and neighborhood selection, using top-N filtering, threshold
filtering, or negative filtering [Herlocker et al., 1999; Desrosiers and Karypis, 2011].
7
Even though the rating prediction perspective is the prevailing paradigm in rec-
ommender systems, there are other perspectives that have been gaining significant
attention in this field [Jannach et al., 2010] and try to alleviate the problems pertain-
ing to the narrow rating prediction focus [Adamopoulos, 2013a, 2014b]. This narrow
focus has been evident in laboratory studies and real-world online experiments, which
indicated that higher predictive accuracy does not always correspond to higher lev-
els of user-perceived quality or to increased sales [McNee et al., 2006; Jannach and
Hegelich, 2009; Jannach et al., 2013; Cremonesi et al., 2011]. Some streams of research
that aim to improve recommender systems going beyond rating prediction accuracy
include work on human-computer interaction (HCI) [Yoo et al., 2013], which involves
the study and design of the interaction between users and RSes, explanations for rec-
ommendations [Tintarev and Masthoff, 2011], which provide transparency into the
working of the recommendation process exposing the reasoning and data behind each
recommendation. Besides, other approaches pertain to diversification [Adomavicius
and Kwon, 2012; Zhou et al., 2010], which maximizes the variety of items in a rec-
ommendation list, group recommenders [O’Connor et al., 2002], which recommend
items for groups of people, rather than individuals, and recommendation sequences
[Masthoff, 2011], where sequences of ordered items are recommended instead of single
items.
In this dissertation, we focus on two of the most important problems related to
this narrow focus of many RSes that hinder user satisfaction (i.e., the problems of
over-specialization and concentration biases of recommendations) and we propose the
concept of unexpectedness that can alleviate these problems while improving both user
satisfaction and business outcomes.
8
2.1 Over-specialization of Recommendations
The problem of over-specialization pertains to the observation that common RSes
usually recommend items very similar to what the users have already purchased or
liked in the past [Abbassi et al., 2009]. However, this over-specialization of recom-
mendations is often inconsistent with sales goals and consumers’ preferences. For
instance, Ghose et al. [2012b] provided empirical evidence that indeed consumers
prefer diversity in ranking results. This problem is often practically addressed by
injecting randomness in the recommendation procedure [Balabanovic and Shoham,
1997], filtering out items which are too similar to items the user has rated in the past
[Billsus and Pazzani, 2000], or increasing the diversity of recommendations [Ziegler
et al., 2005]. Interestingly, Said et al. [2012b, 2013] presented an inverted neigh-
borhood model, k-furthest neighbors, to identify less ordinary neighborhoods for the
purpose of creating more diverse recommendations by recommending items disliked
by the least similar users.
2.2 Concentration Bias of Recommendations
Similarly, CF algorithms tend not to recommend products with limited historical
data, even if these items would be rated favorably. Aiming at verifying and measuring
over-specialization bias, Nguyen et al. [2014], employing a longitudinal data set that
represents users’ interactions with RSes and consumption of information items, find
that RSes indeed expose the users to narrowing sets of items over time. Thus, these
recommenders can create a rich-get-richer effect for popular items while this concen-
tration bias can prevent what may otherwise be better consumer-product matches
[Fleder and Hosanagar, 2009]. Studying the concentration bias of recommendations,
Jannach et al. [2013] compared different RS algorithms with respect to aggregate
9
diversity and their tendency to focus on certain parts of the product spectrum and
showed that popular algorithms may lead to an undesired popularity boost of already
popular items. Finally, Fleder and Hosanagar [2009] showed that this concentration
bias, in contrast to the potential goal of RSes to promote long-tail items, can create a
rich-get-richer effect for popular products leading to a subsequent reduction in profits
and sales diversity and suggested that better RS designs which limit popularity ef-
fects and promote exploration are still needed. However, Hinz et al. [2011] and Matt
et al. [2013] maintain that whether over-specialization and concentration biases will
be enhanced or alleviated depends on the applied personalization technology.
Furthermore, in the past, some researchers working on the over-specialization and
concentration biases of recommender systems tried to provide alternative definitions
of unexpectedness and various related but still different concepts, such as novelty,
diversity, and serendipity. In the following paragraphs, we discuss the aforementioned
concepts and corresponding approaches.
2.3 Novelty of Recommendations
In particular, novel recommendations are recommendations of those items that the
user did not know about [Konstan et al., 2006]. Hijikata et al. [2009] use collaborative
filtering to derive novel recommendations by explicitly asking users what items they
already know. Besides, Weng et al. [2007] suggest a taxonomy-based RS that utilizes
hot topic detection using association rules to improve novelty and quality of recom-
mendations, whereas Zhang and Hurley [2009] propose to enhance novelty at a small
cost to overall accuracy by partitioning the user profile into clusters of similar items
and compose the recommendation list of items that match well with each cluster,
10
rather than with the entire user profile. Besides, Celma and Herrera [2008] analyze
the item-based recommendation network to detect whether its intrinsic topology has
a pathology that hinders long-tail novel recommendations and Nakatsuji et al. [2010]
define and measure novelty as the smallest distance from the class the user accessed
before to the class that includes target items over the taxonomy.
2.4 Serendipity of Recommendations
Moreover, serendipity, the most closely related concept to unexpectedness, in-
volves a positive emotional response of the user about a previously unknown (novel)
item and measures how surprising these recommendations are [Shani and Gunawar-
dana, 2011]; serendipitous recommendations are, by definition, also novel. However,
a serendipitous recommendation involves an item that the user would not be likely
to discover otherwise, whereas the user might autonomously discover novel items.
Iaquinta et al. [2008] propose to enhance serendipity by recommending novel items
whose description is semantically far from users’ profiles and Kawamae et al. [2009;
2010] suggest an algorithm for recommending novel items based on the assumption
that users follow earlier adopters who have demonstrated similar preferences. In
addition, Sugiyama and Kan [2011] propose a method for recommending scholarly
papers utilizing dissimilar users and co-authors to construct the profile of the target
researcher. Also, Andre et al. [2009] examine the potential for serendipity in web
search and suggest that information about personal interests and behavior may be
used to support serendipity.
11
2.5 Diversity of Recommendations
Furthermore, diversification is defined as the process of maximizing the variety
of items in a recommendation list. Most of the literature in RSes and Information
Retrieval (IR) studies the principle of diversity to improve user satisfaction. Typical
approaches replace items in the derived recommendation lists to minimize similarity
among all items or remove “obvious” items from them as in [Billsus and Pazzani,
2000]. Ziegler et al. [2005] propose a similarity metric using a taxonomy-based clas-
sification and use this to assess the topical diversity of recommendation lists. They
also provide a heuristic algorithm to increase the diversity of the recommendation list.
Then, Zhang and Hurley [2008] focus on intra-list diversity and address the problem
as the joint optimization of two objective functions reflecting preference similarity
and item diversity, and Hurley and Zhang [2011] formulate the trade-off between di-
versity and matching quality as a binary optimization problem. Besides, Wang and
Zhu [2009], inspired by the modern portfolio theory in financial markets, suggest an
algorithm that generalizes the probability ranking principle by considering both the
uncertainty of relevance predictions and correlations between retrieved documents.
Also, Said et al. [2012b] suggest an inverted nearest neighbor model and recommend
items disliked by the least similar users.
Following a different direction, McSherry [2002] investigates the conditions in
which similarity can be increased without loss of diversity and presents an approach
to retrieval that is designed to deliver such similarity-preserving increases in diversity.
In addition, Zhang et al. [2012] propose a collection of algorithms to simultaneously
increase novelty, diversity, and serendipity, at a slight cost to accuracy, and Zhou
et al. [2010] suggest a hybrid algorithm that, without relying on any semantic or
context-specific information, simultaneously gains in both accuracy and diversity of
12
recommendations.
In other streams of research, Panniello et al. [2009] compare several contextual pre-
filtering, post-filtering, and contextual modeling methods in terms of accuracy and
diversity of their recommendations to determine which methods outperform others
and under which circumstances. Considering how to measure diversity, Castells et al.
[2011] and Vargas and Castells [2011] aim to cover and generalize the metrics reported
in the RS literature [Zhang and Hurley, 2008; Zhou et al., 2010; Ziegler et al., 2005],
and derive new ones. They suggest novelty and diversity metric schemes that take into
consideration item position and relevance through a probabilistic recommendation-
browsing model.
Besides, other researchers studied the importance of personalization and users’
perception in diversity. In particular, Hu and Pu [2011] investigate design issues that
can enhance users’ perception or recommendation diversity and improve users’ satis-
faction, and Ge et al. [2012] show that the perceived diversity of a recommendation
list depends on the placement of diverse items. Further, Vargas et al. [2012] sug-
gest that the combination of personalization and diversification achieves competitive
performance improving the baseline, plain personalization, and plain diversification
approaches in terms of both diversity and accuracy measures, Shi et al. [2012] ar-
gue that the diversification level in a recommendation list should be adapted to the
target users’ individual situations and needs, and propose a framework to adaptively
diversify recommendation results for individual users based on latent factor models,
while Chen et al. [2013] explore the impact of personality values on users’ needs for
recommendation diversity.
Lastly, examining similar but yet different concepts of diversity, Adomavicius and
Kwon [2009, 2012] propose the concept of aggregated diversity as the ability of a
system to recommend across all users as many different items as possible while keeping
13
accuracy loss to a minimum, by a controlled promotion of less popular items towards
the top of the recommendation lists. Also, Lathia et al. [2010] consider the concept
of temporal diversity, the diversity in the sequence of recommendation lists produced
over time. Finally, Jannach et al. [2013] compare different recommender systems
algorithms with respect to aggregate diversity and their tendency of focusing on
certain parts of the product spectrum and maintain that some popular algorithms may
lead to an undesired popularity boost of already popular items, whereas Bellogın et al.
[2012] present a comparative study on the influence that different types of information
available in social systems have on item recommendation, aiming to identify which
sources of user interest evidence are more effective to achieve useful recommendations.
2.6 Unexpectedness of Recommendations
Pertaining to unexpectedness, in the field of knowledge discovery, Silberschatz and
Tuzhilin [1996]; Berger and Tuzhilin [1998]; Padmanabhan and Tuzhilin [1998, 2000,
2006] propose a characterization relative to the system of prior domain beliefs and
develop efficient algorithms for the discovery of unexpected patterns, which com-
bine the independent concepts of unexpectedness and minimality of patterns. Also,
Kontonasios et al. [2012] survey different methods for assessing the unexpectedness
of patterns focusing on frequent item sets, tiles, association rules, and classification
rules.
In the field of recommender systems, Murakami et al. [2008] and Ge et al. [2010]
suggest both a definition of unexpectedness as the difference in predictions between
two algorithms, the deviation of a recommender system from the results obtained from
a primitive prediction model that shows high ratability, and corresponding metrics
for evaluating this system-centric notion of unexpectedness. Besides, Akiyama et al.
14
[2010] propose unexpectedness as a general metric that does not depend on a user’s
record and only involves an unlikely combination of item features.
In the next paragraphs, we present the related work assessing the value of recom-
mendations for businesses as well as an overview of the main IS theories regarding
the impact of RSes on consumer decision-making processes and outcomes.
2.7 Business Value of Recommendations
Prior work has also examined the business value of recommender systems and their
effect on demand levels. Studying the effects of recommender systems on aggregate
demand and markets, Fleder and Hosanagar [2009] show analytically that RSes can
lead to a reduction in aggregate sales diversity, creating a rich-get-richer effect for
popular products and preventing what may otherwise be better consumer-product
matches. However, Brynjolfsson et al. [2011] provide empirical evidence that RSes are
associated with an increase in niche products, reflecting lower search costs in addition
to the increased product availability and corroborating the findings of [Pathak et al.,
2010] regarding the heterogenizing effects of RSes.
Focusing on the impact of recommender systems on demand levels for individual
products, Tintarev et al. [2010] conduct a user study with 21 subjects and find that
RSes can increase the demand levels, especially for long tail items. Oestreicher-Singer
and Sundararajan [2012b] study how the explicit visibility of related-product networks
can influence the demand for products in such networks and find that complementary
products have significant influence on each other’s demand. They also find that
newer and more popular products benefit more from the attention they garner from
15
their network position in such related-product networks. In contrast, Chen et al.
[2004] find that such network-based recommendations in a desktop setting are more
effective for less-popular books. Similarly, Pathak et al. [2010], examining a desktop
recommender for item-to-item networks of books and focusing on 156 top-selling books
on Amazon.com, find that the impact of the strength of recommendations on sales
rank is moderated by the recency effect.
Our work is also related to the extant literature estimating the business value of
multi-dimensional data sets in context-based recommender systems. Adamopoulos
and Tuzhilin [2014a] propose a method for estimating the expected economic value
of multi-dimensional data sets in RSes and illustrate the proposed approach using
a unique data set combining implicit and explicit ratings with rich content, spatio-
temporal contextual dimensions, and social network profiles [Adamopoulos, 2014a].
This approach can lead to better and more profitable managerial decisions as well as
more useful evaluation metrics.
2.8 Effects of Recommender Systems Usage and Character-
istics on Consumer Decision-Making and Choices
Finally, the impact of common RSes on consumers’ decision-making and choices
has been demonstrated in various empirical settings and has found theoretical sup-
port through the lenses of various IS theories. In particular, one set of theories that
provide support for such findings is that of human information processing. For in-
stance, Shafir et al. [1993b] maintain that people evaluate alternatives by comparing
them separately on distinct dimensions and that relationships among alternatives
may be perceived to be more compelling reasons or arguments for choice than de-
16
riving overall values for each alternative and choosing the alternative with the best
value. Because such differences (e.g., whether an alternative is recommended, ranking
in recommendation list, etc.) can be perceived with little effort, relationships among
alternatives may be used to make choices even in situations with simple alternatives
and even if they do not provide good justifications [Bettman et al., 1998]. Another
explanation is that the recommended alternatives maximize the ease of justifying
consumers’ decisions. This explanation is becoming even more important in the case
of collective decisions (e.g., restaurant selection) and group recommenders; Shafir
et al. [1993a] have demonstrated that decision makers often construct reasons in or-
der to justify a decision to themselves (i.e., increase their confidence in the decision)
and/or justify (explain) their decision to others. Additionally, another explanation is
that the recommended alternatives are simply becoming more salient and hence are
more frequently selected by the consumers either deliberately (e.g., lower search costs
[Wilde, 1980]) or inadvertently since they capture their attention. Finally, a utili-
tarian explanation of the positive and significant effect of recommendations on the
demand levels for the recommended products is that consumers might perceive the
recommendations as endorsements of the candidate items from the RS or as another
reputation dimension (in addition to the item rating, consumer reviews, etc.). Apart
from the aforementioned theoretical perspectives, there is also significant empirical
evidence that RSes have indeed great impact on users’ choices and that this influence
is greater than the influence of peers and experts [Senecal and Nantel, 2004].
Focusing now on the potential differences in effects between the proposed type of
recommendations and the standard recommendation types and, especially, the effec-
tiveness of unexpected and non-obvious recommendations, based on the perspective of
the IS success model [DeLone and McLean, 1992] and, in particular, the components
of usefulness and uniqueness of information, which affect user satisfaction through
17
the construct of information quality, the proposed type of recommendations can sig-
nificantly increase the effectiveness of RSes as well as user evaluation of RSes and
consumer decision-making. In addition, the expectation-disconfirmation paradigm
[McKinney et al., 2002; Olson and Dover, 1979] postulates that when positive discon-
firmation (i.e., when a customer’s evaluations of system or product performance are
different from his or her pre-trial expectations about the product or system) occurs,
results in enhanced user satisfaction. However, there is no empirical study that has
directly investigated the effects of expectation (dis)confirmation on satisfaction with
RSes. On the contrary, the theory of interpersonal similarity [Byrne and Griffitt, 1969;
Levin et al., 2002; Lichtenthal and Tellefsen, 2001; Zucker, 1986] postulates that the
greater the degree of similarity between two parties, the greater the attraction will
be, resulting in increased user satisfaction. Hence, apart from designing methods that
enhance the unexpectedness and non-obviousness of recommendations, we also pro-
pose to examine empirically whether unexpected and non-obvious recommendations
can enhance the effectiveness and user evaluation of recommender systems.
18
CHAPTER III
Neighborhood Selection in Collaborative Filtering
Systems
“I don’t need a friend who changes when I change and who nods when I
nod; my shadow does that much better.”
- Plutarch, 46 - 120 AD
3.1 Introduction to Neighborhood Selection
Although a wide variety of different types of recommender systems (RSes) has been
developed and used across several domains over the last 20 years [Adomavicius and
Tuzhilin, 2005], the classical user-based k-NN collaborative filtering (CF) method still
remains one of the most popular and prominent methods used in the recommender
systems community [Jannach et al., 2010].
Neighborhood-based collaborative filtering recommendation methods predict any
unknown ratings using the existing ratings given by/to the most similar users/items,
called nearest neighbors. It is often assumed that selecting the k most similar neigh-
bors results in the best performance the standard collaborative filtering approach
can produce. However, some investigations show that it is possible to select other
19
neighbors than the most similar that outperform the standard collaborative filtering
approach [Adamopoulos and Tuzhilin, 2013b]. This is a reflection of the fact that
some of the most similar neighbors have a detrimental effect on the accuracy of the
predictions and should be actually replaced. As a matter of fact, the complemen-
tariness of the neighbors that are selected is a key factor in the predictive accuracy
of this approach. Hence, this problem can be shown to be NP-hard and its exact
solution by exhaustive exploration is unfeasible for typical applications that involve
a large number of users and items.
This observation and the employed theoretical framework can give answer to a
number of interesting research questions such as:
• What is the optimal size k for a neighborhood?
• How can this size k be dynamically estimated for each user and item?
• Who are the optimal neighbors?
• What are the optimal weight for the selected neighbors?
At the same time, apart from enhancing the predictive performance of neighborhood-
based methods, selecting users other than the most similar ones to the target user,
we can alleviate the important problems of over-specialization and concentration bi-
ases and hence enhance the usefulness of collaborative filtering RSes. In particular,
the proposed approaches have the potential to provide personalized recommendations
from a wide range of items in order to escape the obvious and expected recommenda-
tions, while avoiding predictive accuracy loss. The key intuition for this is three-fold.
First, using the neighborhood with the most similar users to estimate unknown rat-
ings and recommend candidate items, the generated recommendation lists usually
consist of known items with which the users are already familiar. Second, because
20
of the multidimensionality of user preferences, there are many items that the target
user may like and are unknown to her k most similar users. Third, selecting very
similar neighbors might have a detrimental effect on the performance of the model
since such neighbors tend to capture the same predictive signals and information
In this chapter, we present certain variations of the classical k-NN method. We
propose a method for optimized neighborhood selection (k-BN; k Better Neighbors)
in collaborative filtering recommender systems that address the problem of identify-
ing neighborhoods closer to the optimal ones. In addition, we present a probabilistic
neighborhood selection approach (k-PN; k Probabilistic Neighbors) in which the es-
timation of an unknown rating of the user for an item is based not on the weighted
averages of the k most similar (nearest) neighbors but on k probabilistically selected
neighbors.
To empirically evaluate the proposed probabilistic approach, we conduct an em-
pirical study showing that selecting diverse representative neighborhoods the proposed
methods generate recommendations that are very different from the classical CF ap-
proaches and alleviate the over-specialization and concentration problems while out-
performing k-NN, k-FN [Said et al., 2012b, 2013], and matrix factorization methods.
We also demonstrate that the proposed methods outperform, by a wide margin in
most cases, both the standard k-nearest neighbors and the k-furthest neighbors ap-
proaches in terms of both item prediction accuracy and utility-based ranking. The
experimental results are also in accordance with the phenomenon of “hubness” and
the ensemble learning theory that we employ in the neighborhood-based CF frame-
work. Besides, we show that the performance improvement is not achieved at the
expense of other popular performance measures, such as catalog coverage, aggregate
diversity, and diversity reinforcement.
21
In summary, the main contributions of the studies presented in this chapter are:
• We formulated the classical neighborhood-based collaborative filtering method
as an ensemble method, thus, allowing us to show the potential suboptimality
of the k-NN approach in terms of predictive accuracy.
• We proposed a new optimized neighborhood-based method (k-BN; k Better
Neighbors) as an improvement of the standard k-NN approach.
• We proposed a new probabilistic neighborhood-based method (k-PN; k Proba-
bilistic Neighbors) as an improvement of the standard k-NN approach.
• We empirically showed that the proposed methods outperform, by a wide mar-
gin, the classical collaborative filtering algorithm and practically illustrated its
suboptimality in addition to providing a theoretical justification of this empir-
ical observation.
• We showed that the proposed methods alleviate the common problems of over-
specialization and concentration biases of recommendations in terms of various
popular metrics and a newly proposed metric that measures the diversity rein-
forcement of recommendations.
• We identified a particular implementation of the k-PN method that performs
consistently well across various experimental settings.
• We illustrated that most of the times the k-BN method outperforms the k-PN
method.
22
3.2 Related Approaches
Different ways for selecting a number of candidates neighbors and forming the
neighborhood Ni(u) have been proposed in the literature. Many approaches cluster
the set of users (or items) in order to improve the scalability and accuracy of recom-
mender systems [O’Connor and Herlocker, 1999; Sarwar et al., 2002; Xue et al., 2005].
For instance, Bellogin and Parapar [2012] use a spectral clustering technique, Normal-
ized Cut, in order to derive a cluster-based collaborative filtering algorithm and frame
this technique as a method for neighbor selection in user-based collaborative filtering
recommender systems. This method clusters the users in the collection by finding
the optimal cut of the computed graph, where Pearson similarity is used to weight
the edges between items. Then, it selects a neighborhood that outputs those users
who belong to the same cluster as the target user. In some cases, additional external
information can also be used either for the clustering of users/items or directly for
the neighborhood selection. For instance, using the concept of trust many approaches
select only the most trustworthy users. This concept of trust, apart from external
information, can also be based on some trust metrics [O’Donovan and Smyth, 2005].
Similarly, Bellogın et al. [2013] propose to select neighbors according to the overlap of
their preferences with those of the target user. In particular, the authors propose an
overlap-based filtering in which the users who have more preferred items in common
with the target user are selected as neighbors and they investigate the consideration
of the above principle as the single criterion for neighbor selection, after which the
overlap is no longer taken into account (neither in the user similarity function, nor
in any posterior user weighting). Finally, building on this idea, Bellogin et al. [2013]
use relevance-based language models from information retrieval in order to identify
neighbors in recommendations. Such Relevance Models (RM) are formulated in text
23
IR on a triadic space (query, documents, words), whereas the CF space is typically
dyadic (users and items). In essence, this method tries to capture how relevant each
candidate neighbor would be to the target user. In contrast to the majority of the
related approaches, the algorithms proposed in this dissertation are based on robust
theoretical motivation and hence specific variations can be viewed as optimization
problems.
3.3 Methods for Neighborhood Selection
Collaborative filtering (CF) methods produce user specific recommendations of
items based on patterns of ratings or usage (e.g., purchases) without the need for
exogenous information about either items or users [Ricci and Shapira, 2011]. Hence,
in order to estimate unknown ratings and recommend items to users, CF systems
need to relate two fundamentally different entities: items and users.
3.3.1 Neighborhood Models
User-based neighborhood recommendation methods predict the rating ru,i of user
u for item i using the ratings given to i by users most similar to u, called nearest
neighbors and denoted by Ni(u). Taking into account the fact that the neighbors can
have different levels of similarity, wu,v, and considering the k users v with the highest
similarity to u (i.e., the standard user-based k-NN collaborative filtering approach),
the predicted rating is:
ru,i = ru +
∑v∈Ni(u)
wu,v ∗ (rv,i − rv)∑v∈Ni(u)
|wu,v|, (3.1)
24
ALGORITHM 1: k-NN Recommendation Algorithm
Input: User-Item Rating matrix ROutput: Recommendation lists of size l
k: Number of users in the neighborhood of user u, Ni(u)l: Number of items recommended to user u
for each user u dofor each item i do
Find the k users in the neighborhood of user u, Ni(u);Combine ratings given to item i by neighbors Ni(u);
endRecommend to user u the top-l items having the highest predicted rating ru,i;
end
where ru is the average of the ratings given by user u.
However, the ratings given to item i by the nearest neighbors of user u can be
combined into a single estimation using various combining (or aggregating) functions
[Adomavicius and Tuzhilin, 2005]. Examples of combining functions include majority
voting, distance-moderated voting, weighted average, adjusted weighted average, and
percentiles [Adamopoulos and Tuzhilin, 2013c].
In the same way, the neighborhood used in estimating the unknown ratings and
recommending items can be formed in different ways. Instead of using the k users
with the highest similarity to the target user, any approach or procedure that selects
k of the candidate neighbors can be used, in principle.
Algorithm 1 summarizes the user-based k-nearest neighbors (k-NN) collaborative
filtering approach using a general combining function and neighborhood selection
approach.
In this chapter, we propose a novel k-NN method (k-PN; k Probabilistic Neigh-
bors) using probabilistic neighborhood selection that also takes into consideration
similarity levels between the target user and the n candidate neighbors. We also
propose an optimized neighborhood selection method (k-BN; k Better Neighbors) in
25
collaborative filtering recommender systems.
3.3.2 Theoretical Motivation
In this section, we present the theoretical motivation for the proposed approaches
and the connections to the phenomenon of “hubness” as well as the ensemble learning
theory. In particular, we discuss a major implication of selecting just the most similar
candidates (or even all the candidates) as neighbors and we motivate how the proposed
method can alleviate the over-specialization and concentration problems without sig-
nificantly reducing, and even increasing, the predictive accuracy, demonstrating that
similar but diverse neighbors should be used in neighborhood-based methods.
It should be clear by now that selecting neighborhoods using underlying probabil-
ity distributions, instead of deterministically selecting just the k nearest neighbors,
can result in very different recommendations from those generated based on the stan-
dard neighborhood-based approaches. For the sake of brevity, we focus on the phe-
nomenon of “hubness” and the effect of selecting diverse neighbors on the predictive
accuracy of the proposed approach.
The phenomenon of “hubness” is related to a new aspect of the dimensionality
curse and affects the distribution of k-occurrences: the number of times a point oc-
curs among the k nearest neighbors of the other points in a data set, according to
some distance measure [Radovanovic et al., 2010]. This distribution becomes con-
siderably skewed as dimensionality increases, causing the emergence of hubs, that is,
points which appear in many more k-NN lists than other points, effectively making
them “popular” nearest neighbors. This is an inherent property that depends on
the intrinsic, rather than embedding, dimensionality of data and, thus, dimensional-
ity reduction techniques, such as matrix factorization, do not alleviate the problem
effectively. For the same reason “hubness” occurs even for small values of k and
26
for all cosine-like measures, such as Pearson correlation, cosine similarity, and ad-
justed cosine. Besides, “hubness” is unrelated to other data properties like sparsity
or skewness of the distribution of ratings [Nanopoulos et al., 2009]. Nevertheless, this
phenomenon is part of the problem of concentration bias of recommendations. In
particular, Seyerlehner et al. [2009] show that hubness reduces coverage and reach-
ability, especially of long-tail items, in both content-based and CF systems. Thus,
these problems can be alleviated by selecting neighbors other than the most similar
to the target.
Moreover, in order to further theoretically motivate the proposed approach, we
focus on ensemble learning theory. In particular, for the predictive tasks of a recom-
mender system, we should construct an estimator f(x; w) that approximates an un-
known target function g(x) given a set of N training samples zN = {z1, z2, . . . , zN} =
{(x1, y1), (x2, y2), . . . , (xN , yN)}, where xi ∈ Rd, y ∈ R, and w is a weight vector; zN
is a realization of a random sequence ZN = {Z1, . . . , ZN} whose i-th component con-
sists of a random vector Zi = (Xi, Yi) and, thus, each zi = (xi, yi) is an independent
and identically distributed (i.i.d) sample from an unknown joint distribution p(x, y).
Without loss of generality, we can assume that g(xi) ∈ R; the following derivations
can be easily generalized to situations where g(xi) and y ∈ Rd′ with d′ > 1. We also
assume that there is a functional relationship between the training pair zi = (xi, yi):
yi = g(xi) + ε, where ε is the additive noise with zero mean (E{ε} = 0) and finite
variance (V ar{ε} = σ2 <∞).
Since the estimate w depends on the given zN , we should write w(zN) to clarify
this dependency. Hence, we should also write f(x; w(zN)); however for simplicity
we will write f(x; zN) as in [Geman et al., 1992]. Then, introducing a new random
vector Z0 = (X0, Y0) ∈ Rd+1, which has a distribution identical to that of Zi, but
is independent of Zi for all i, the generalization error (GErr), defined as the mean
27
squared error averaged over all possible realizations of ZN and Z0,
GErr(f) = EZN{EZ0{[Y0 − f(X0;ZN)]2}
},
can be expressed by the following “bias/variance” decomposition [Geman et al., 1992]:
GErr(f) = EX0{V ar{f |X0}+Bias{f |X0}2}+ σ2.
However, using ensemble estimators, instead of a single estimator f , we have a
collection of them: f1, f2, . . . , fk, where each fi has its own parameter vector wi and
k is the total number of estimators. The output of the ensemble estimator for some
input x can be defined as the weighted average of outputs of k estimators for x:
f (k)ens(x) =
k∑m=1
αmfm, (3.2)
where, without loss of generality, αm > 0 andk∑
m=1
αm = 1.
Following Ueda and Nakano [1996], the generalization error of this ensemble esti-
mator is:
GErr(f (k)ens) = EX0
{V ar{f (k)
ens|X0}+Bias{f (k)ens|X0}2
}+ σ2,
28
which can also be expressed as:
GErr(f (k)ens) = EX0
{[ k∑m=1
a2m EZN(m)
[(fm − E
ZN(m)
(fm))2]+
∑m
∑i 6=m
amai EZN(m)
,ZN(i)
{[fm − E
ZN(m)
(fm)][fi − E
ZN(i)
(fi)]}]
+
[ k∑m=1
am EZN(m)
(fm − g)
]2}
+ σ2,
where the term EZN(m)
,ZN(i)
{[fm−EZN
(m)(fm)
][fi−EZN
(i)(fi)]}
corresponds to the pair-
wise covariance of the estimators m and i, Cov{fm, fi|X0}.
The results can also be extended to the following equation:
GErr(f (k)ens) = EX0
{k∑
m=1
k∑i=1
amaiCmi
}, (3.3)
where Cij indicates the i, j component of the symmetric correlation matrix C. The
i, j component of matrix C is given by:
Cij =
V ar{fi|X0}+Bias{fi|X0}2, if i = j
Cov{fi, fj|X0}+Bias{fi|X0}Bias{fj|X0}, otherwise.
Hence, in addition to the bias and variance of the individual estimators (and the
noise variance), the generalization error of an ensemble also depends on the covariance
between the individuals; an ensemble is controlled by a three-way trade-off. Thus, if
fi and fj are positively correlated, then the correlation increases the generalization
error, whereas if they are negatively correlated, then the correlation contributes to a
decrease in the generalization error.
In the context of neighborhood-based collaborative filtering methods in recom-
29
mender systems, we can think of the ith (most similar to the target user) neighbor
as corresponding to a single estimator fi that simply predicts the rating of this spe-
cific neighbor. Thus, reducing the aggregated pairwise covariance of the neighbors
(estimators) can decrease the generalization error of the model ; at the same time, it
may increase the bias or variance of the estimators and the generalization error as
well. Hence, one way to reduce the covariance is not to restrict the k estimators only
to the k nearest (most similar) neighbors but to use also other candidate neighbors
(estimators).1,2
In the next paragraphs, we first present a probabilistic neighborhood selection
method and then an optimized technique that have the potential to alleviate the
over-specialization and concentration problems of RSes by selecting diverse neighbors
(i.e., neighbors with lower covariance levels).
3.3.3 Probabilistic Neighborhood Selection
In this section, we present a novel k-NN CF method (k-PN; k Probabilistic Neigh-
bors) using a probabilistic neighborhood selection technique that, instead of the most
similar neighbors, carefully selects a set of diverse neighbors in order to alleviate the
over-specialization and concentration problems. The proposed approach uses a gen-
eral algorithm for efficient sampling [Wong and Easton, 1980] that can also take into
1Let ru,i and ru,j the correlation of target user u and candidate neighbors i and j respectively,then the correlation ri,j of neighbors i and j is bounded by the following expression: ru,iru,j −√
1− r2u,i√
1− r2u,j ≤ ri,j ≤ ru,iru,j +√
1− r2u,i√
1− r2u,j .2For a formal argument why the proposed probabilistic approach can result in very different
recommendations from those generated based on the standard k-NN approach and how the itempredictive accuracy can be affected, a 0/1 loss can be used in the context of classification ensemblelearning with the (highly) rated items corresponding to the positive class. For a rigorous derivationof the generalization error in ensemble learning using the bias-variance-covariance decompositionand a 0/1 loss function see [Roli and Fumera, 2002; Tumer and Ghosh, 1996].
30
consideration similarity levels between the target user and the n candidate neighbors.
For the probabilistic neighborhood selection phase of the proposed algorithm, we
allow the neighbors to represent the whole spectrum of candidates, while focusing on
specific areas of this spectrum. Selecting such diverse neighborhoods, the proposed
method aims at alleviating the problems of over-specialization and concentration
biases (see Section 3.3.2).
In a nutshell, for the neighborhood selection phase of the k-PN approach, an initial
weight is assigned to each candidate neighbor and then the candidates are sampled,
without replacement, proportionally to their assigned weights. These initial weights
can be derived based on popular distance metrics (e.g., Cosine similarity, Pearson
correlation, etc.), probability distributions, or other strategies and techniques. For
instance, in order to use certain probability distributions aiming at specific areas
of the spectrum of candidates, the initial weight wi for each candidate i can be
generated using some function of its distance from the target u or its ranking (based
on the distance metric) and the corresponding probability density function (e.g.,
wi = P (ranki) or wi = P (sim (u, i)); for a complete example see Section 3.4.2). Based
on the selection of the initial weights, the algorithm will select different neighborhoods
and, thus, generate different recommendations. We should note here that including
all the candidates in a neighborhood does not alleviate the problems under study, as
discussed in Section 3.3.2.
For implementing the proposed approach, we suggest an efficient method (based on
[Fagin and Price, 1978; Wong and Easton, 1980]) for weighted sampling of k neighbors
without replacement that takes into consideration similarity levels between the target
user and the population of n candidate neighbors. In particular, the set of candidate
neighbors at any time is described by values {w′1, w′2, . . . , w
′n}. In general, if user i
is still a candidate for selection, then w′i = wi (where wi is generated as previously
31
described), whereas w′i = 0 if the user has been already selected in the neighborhood
and, hence, removed from the set of candidates. Denote the sum of the weights of
the first j candidates by Sj =∑j
i=1w′i, where j = 1, . . . , n, and let Q = Sn be the
sum of the weights {w′i} of all the candidates. In order to draw a neighbor, choose x
with uniform probability from [0, Q] and find l such that Sl−1 ≤ x ≤ Sl. Then, add
l to the neighborhood and remove it from the set of candidates while setting w′
l = 0.
After a candidate has been selected into the neighborhood, this neighbor is no longer
available for later selection.
This method can be easily implemented using a binary search tree having all
n candidate neighbors as leaves with values {w1, w2, . . . , wn}, whereas the value of
each internal node of the tree is the sum of the values of the corresponding immedi-
ate descendant nodes. This sampling method requires O(n) initialization operations,
O(k log n) additions and comparisons, and O(k) divisions and random number gen-
erations [Wong and Easton, 1980]. The suggested method can be used with any
distance metric and valid probability distribution including the empirical distribu-
tion of users’ similarity (see Section 3.4.2). Algorithm 2 summarizes the method for
efficient weighted sampling without replacement [Wong and Easton, 1980].
Note that the same approach can also be used for item-based neighborhood meth-
ods by simply sampling diverse neighborhoods of items, instead of users.
3.3.4 Optimized Neighborhood Selection
In this section, we propose a novel k-NN CF method (k-BN; k Better Neigh-
bors) using a greedy neighborhood selection technique that, instead of the most sim-
ilar neighbors, carefully selects a set of diverse neighbors in order to alleviate the
over-specialization and concentration problems while improving also the predictive
accuracy of the generated recommendations.
32
ALGORITHM 2: Weighted Sampling (Without Replacement) Algorithm
Input: Initial weights {w1, . . . , wn} of candidates for neighborhood Ni(u)Output: Neighborhood of user u, Ni(u)
k: Number of users in the neighborhood of user u, Ni(u)L(v): The left-descendent of node vR(v): The right-descendent of node vGv: The sum of weights of the leaves in the left subtree from node vQ: The sum of weights of the nodes in the binary tree
Build binary search tree with n leaves labeled 1, 2, . . . , n;Assign to leaves corresponding values w1, w2, . . . , wn;Associate values Gv with internal nodes;Set Q =
∑nv=1wv;
Set Ni(u) = ∅;for j ← 1 to k do
Set C = 0;Set v = the root node;Set D = ∅;Select x uniformly from [0, Q];repeat
if x ≤ Gv + C thenSet D = D ] {v};Move to node/leaf L(v);
elseSet C = C +Gv;Move to node/leaf R(v);
end
until a leaf is reached ;Set Ni(u) = Ni(u) ] {v};for each node d ∈ D do
Set Gd = Gd − wv;endSet Q = Q− wv;Set wv = 0;
end
33
3.3.4.1 Ordered Selection
For the simpler case, using equal weights for all the neighbors, the predicted rating
can be estimated as the average of the individual ratings of the k users (neighbors)
in the neighborhood:
ri =1
k
k∑u=1
ru,i
and the generalization error can be expressed as:
GErr(k) =1
k2
k∑m=1
k∑n=1
Cmn
where the diagonal elements Cmm are the average squared error of the m-th ensemble
member (neighbor) and Cmn correspond to the products of the biases of the m-th
and n-th neighbors plus their covariance.
We propose a polynomial time greedy algorithm that constructs at each step the
best local solution. The algorithms starts with an empty neighborhood (or alterna-
tively a seed of neighbors) and then selects at each iteration the optimal neighbor
from all the remaining candidates (i.e., the neighbor that minimizes the training er-
ror). This forward (ordered) selection process is halted when k neighbors have been
selected (or alternatively when the expected generalization starts increasing) and the
corresponding neighborhood is returned. This early stopping criterion allows the
selection of a neighborhood that avoids overfitting and can improve the generaliza-
tion performance of the proposed method. The process is similar to the one used in
[Hernandez-Lobato et al., 2011] to prune regression bagging ensembles.
Algorithm 3 summarizes the proposed method for selecting the optimal neighbor-
hood in polynomial time using a greedy approach. The selected neighborhood tries
34
ALGORITHM 3: Optimized (Ordered) Neighborhood Selection Algorithm
Input: User-Item Rating matrix ROutput: Neighborhood N of size k
U : The set of usersR: The set of known ratingsN : The indexes of the users selected in the neighborhoodk: Number of users in the neighborhood Nfor m← 1 to |U| do
for n← m to |U| doCmn ← 1
|R|∑
ru,i∈R [(rm,i − ru,i) (rn,i − ru,i)];Cnm ← Cmn ;
end
end
N ← ∅;for j ← 1 to k do
minimum← +∞;for u ∈ {1, . . . , |U|}/{N1, . . . , Nj−1} do
value← 1j2
(∑j−1m=1
∑j−1n=1CNmNn + 2
∑j−1m=1CNmu + Cuu
);
if value < minimum thenNj ← u;minimum← value;
end
end
endreturn N
to minimize the error
GErr(k) =1
k2
k∑m=1
k∑n=1
CNmNn
by minimizing the training error of the k ensemble members and hence Cmn is calcu-
lated as an average over the training set instead of the expectation over the popula-
tion.3
Furthermore, instead of using equal weights for all the neighbors as in the previous
paragraph, we can estimate the optimal weight for each neighbor in each neighbor-
3We avoid overfitting by selecting k neighbors and not using all the candidates.
35
hood. Using the method of Lagrange multipliers to solve for the optimal weights (a∗m)
in Eq. (3.3), the necessary conditions are:
∂[∑k
i=1
∑kj=1 aiajCij − λ (
∑i αi − 1)
]∂αm
= 0 ∀ m = 1, . . . , k
and ∑i
αi − 1 = 0.
Even though, based on the above conditions and Cramer’s rule, the generalization
error can be directly minimized using
a∗i =
∑j C−1ij∑
l
∑j C−1lj
(3.4)
this method depends on a reliable estimate of C and also C being non-singular in
order to be easily inverted [Perrone and Cooper, 1993]. In practice though, errors
are often highly correlated, thus rows of C are nearly linearly dependent so that C
is irreversible or ill-conditioned matrix and inverting C leads to significant round-off
errors [Jimenez, 1998; Zhou et al., 2002].
However, all the information needed for solving the optimization problem is the
matrix C which can be estimated over the training set and is expected to be similar
as if estimated over p(x, y); the true distribution of the data. Nevertheless, despite
selecting only k neighbors and not using all the candidates, this assumption and the
corresponding design choice of estimating C over only the training data might lead to
overfitting to the training data and hence the empirical evaluation of the proposed
approach is needed.
Furthermore, because of the sparsity problem in the rating matrix of collaborative
filtering recommender systems, higher performance might be achieved if a baseline
36
predictor for the score of the ensemble member (rating of the neighbor) is used in
the estimation of the covariance of the ensemble members and matrix C in the cases
where the true rating of the neighbor is unknown. In this work, we propose the use
of a simple baseline predictor that provides a first-order approximation of the bias in
our rating matrix. For instance, the following baseline predictor can be used:
ru,i = µ+ bu + bi.
where µ is the global average rating in our data set, bi the average rating bias of the
corresponding item (i.e., bi =∑u∈R(i)(rui−µ)
|R(i)| ), and bu the average rating bias of the
corresponding user (i.e., bu =∑i∈R(u)(rui−µ−bi)
|R(u)| ).
In summary, in this study we introduce various algorithms to address the problem
of identifying the optimal subset of neighbors in collaborative filtering recommender
systems. We first present a novel method for recommending items based on prob-
abilistic neighborhood selection (k-PN) in collaborative filtering systems. Then, we
proposed an ordered neighborhood selection approach where neighborhoods of in-
creasing size are constructed by incorporating at each iteration the neighbor that
reduces the predictive error the most (k-BN). All the algorithms designed in this
study reach an approximate solution in polynomial time. However, a difficulty, in-
herent to all machine learning algorithms, is that the optimal neighborhood selected
on the basis of the training set may have a suboptimal generalization performance
[Martınez-Munoz and Suarez, 2006]. Nevertheless, despite the fact that the found
solution might not be globally optimal, it can generally be a near-optimal local min-
imum. Hence, the empirical evaluation of the proposed approaches is needed.
37
3.4 Experimental Settings of Probabilistic Neighborhood Se-
lection
To empirically validate the k-PN method presented in Section 3.3.1 and evalu-
ate the generated recommendations, we conduct a large number of experiments on
“real-world” data sets and compare our results to different baselines. For an apples-to-
apples comparison, the selected baselines include the user-based k-NN CF approach,
which we promise to improve in this study. Compared to other popular algorithms,
user-based k-NN generates recommendations that suffer less from over-specialization
and concentration biases [Desrosiers and Karypis, 2011; Jannach et al., 2013] and has
also been found to perform well in terms of other performance measures [Burke, 2002;
Cremonesi et al., 2011; Adamopoulos and Tuzhilin, 2011, 2013a]. Nevertheless, the
proposed approach can be applied to any neighborhood-based method and it is not
specific to the user-based approach, which has been selected for increased compatibil-
ity as well as interpretability of the results. Additionally, we also compare our results
against furthest neighbors models (k-FN) [Said et al., 2013, 2012b]. Finally, we also
compare our experimental results against matrix factorization (MF) [Gantner et al.,
2011].
3.4.1 Data Sets
The data sets that we used are the MovieLens [GroupLens, 2011] and MovieTweet-
ings [Dooms et al., 2013] as well as a snapshot from Amazon [McAuley and Leskovec,
2013]. The RecSys HetRec 2011 MovieLens (ML) data set [GroupLens, 2011] contains
855,598 ratings (on a 1-5 scale) from 2,113 users on 10,197 movies. Moreover, the
MovieTweetings (MT) data set is described in [Dooms et al., 2013] and consists of
ratings included in well-structured tweets on Twitter. Owing to the extreme sparsity
38
of the data set, we decided to condense the data set in order to obtain more mean-
ingful results from collaborative filtering algorithms. In particular, we removed items
and users with fewer than 10 ratings. The resulting data set contains 12,332 ratings
(on a 0-10 scale) from 839 users on 836 movies. Finally, the Amazon (AMZ) data
set is described in [McAuley and Leskovec, 2013] and consists of reviews of fine foods
during a period of more than 10 years. After removing items with fewer than 10
ratings and reviewers with fewer than 25 ratings each, the data set consists of 15,235
ratings (on a 1-5 scale) from 407 users on 4,316 items.
3.4.2 Experimental Setup
Using the ML, MT, and AMZ data sets, we conducted a large number of exper-
iments and compared the results against the standard user-based k-NN approach,
different k-FN methods, and matrix factorization. In order to test the proposed ap-
proach of probabilistic neighborhood selection under various experimental settings,
we used different sizes of neighborhoods (k ∈ {20, 30, . . . , 80}) and different proba-
bility distributions (P ∈ {normal, exponential, Weibull, folded normal, uniform}),
with various specifications (i.e., location and scale parameters), as well as the empir-
ical distribution of user similarity, described in Table 3.1. The uniform distribution
is used in order to compare the proposed method against randomly selecting neigh-
bors. The specific distributions were selected because they focus on different areas
of the spectrum of candidate neighbors and they constitute common but flexible ex-
amples that can be easily reproduced. Additionally, we used two k-FN models [Said
et al., 2013, 2012b]; the second furthest neighbor model (k-FN2) employed in this
study corresponds to recommending the least liked items of the furthest neighbors
instead of the most liked ones (k-FN1). We should note here that because of the
strict deterministic nature of both k-NN and k-FN, it is not possible to interpo-
39
late between these two methods and select diverse neighbors that approximate the
results of k-PN. In addition, we generated recommendation lists of different sizes
(l ∈ {1, 3, 5, 10, 20, . . . , 100}). In summary, we used 3 data sets, 7 different sizes of
neighborhoods, 12 probability distributions, and 13 different lengths of recommenda-
tion lists, resulting in 3, 276 experiments in total.
Table 3.1: Probability Distributions and Density Functions for Probabilistic Neigh-borhood Selection.
LabelProbability Probability Density Location and ShapeDistribution Function (weights) Parameters
k-NN -
{1/k, if x ≤ n− k0, otherwise
-
E Empirical Similarity wx/∑n
i=1 wi -
U Uniform 1/n -
N1 Normal 1√2πσ2
e−(x−µ)2
2σ2µ = 0 σ = (0.25/15.0)n
N2 µ = (2.0/15.0)n σ = (0.5/15.0)n
Exp1 Exponential λe−λxλ = 1/k
Exp2 λ = 2/k
W1 Weibull µλ
(xλ
)µ−1e−(x/λ)µ µ = 0.25 λ = n/20
W2 µ = 0.50 λ = n/20
FN1 Folded normal√
2σ√πe−
θ2
2 e−x2
2σ2 cosh(θxσ
) θ = 1 σ = kFN2 θ = 1 σ = k/2
k-FN -
{1/k, if x ≥ n− k0, otherwise
-
For the probabilistic neighborhood selection, we used the method described in
Section 3.3.3. In order to estimate the initial weights {wi} of the procedure, we used
the probability density functions illustrated in Table 1. Without loss of generality,
in order to take into consideration similarity levels of the candidate neighbors, the
candidates can be ordered and re-labeled such that su,1 ≥ su,2 ≥ . . . ≥ su,n, where
su,j is the similarity level of target u and candidate j based on some distance metric.
40
Then, the initial weight wj for each candidate can be generated using its ranking
and a probability density function. For instance, using the Weibull probability dis-
tribution (i.e., W1 or W2), the weight of the most similar candidate (i.e., j = 1) is
w1 = µλ
(1λ
)µ−1e−(1/λ)µ , where µ and λ are the shape and scale parameters of the
distribution and n is the total number of all the candidate neighbors.4 In contrast
to the deterministic k-NN and k-FN approaches, depending on the parameters of the
employed probability density function, this candidate neighbor (i.e., the most sim-
ilar to the target) may or may not have the highest weight wj.5 Figure 3.1 shows
the likelihood of sampling each candidate neighbor using different probability distri-
butions for the MovieLens data set and k = 80 and Figure 3.2 shows the sampled
neighborhoods for a randomly selected target user using the different distributions;
the candidate neighbors for each target user and item in the x axis are ordered based
on their similarity to the target user with 0 corresponding to the nearest (i.e., most
similar) candidate. As we can see, the selected distributions focus on different areas
of the spectrum of candidate neighbors. We should note here that using the empirical
distribution of user similarity resulted in more diverse neighborhoods.
In all the conducted experiments, in order to measure the similarity among the
candidate neighbors, we used the Pearson correlation; similar results were also ob-
tained using the cosine similarity. We also used significance weighting as in [Her-
locker et al., 1999], in order to penalize for similarity based on few common rat-
ings, and filtered any candidate neighbors with zero weight [Desrosiers and Karypis,
2011]. For the similarity estimation of the candidates in the k-furthest neighbor
algorithm, we used the approach described in [Said et al., 2013, 2012b]. Besides,
4For continuous probability distributions, the cumulative distribution function can also be usedsuch as wi = F (i+ 0.5)− F (i− 0.5) or wi = F (i)− F (i− 1).
5For a probabilistic furthest neighbors model the candidates can be ordered in reverse similarityorder such that su,1 ≤ su,2 ≤ . . . ≤ su,n. Initial experiments illustrated that such models underper-form the proposed approach.
41
0 50 100 150 200 250 300 350 400
Candidate Neighbor Order
0.000
0.005
0.010
0.015
0.020
0.025
0.030
0.035
0.040
Sam
plin
g P
rob
ab
ilit
y
N1
N2
Exp1
Exp2
W1
W2
FN1
FN2
Figure 3.1: Sampling probability for the nearest candidate neighbors using differentprobability distributions for the MovieLens (ML) data set.
0 500 1000 1500 2000Candidate Neighbor
FN2
FN1
W2
W1
Exp2
Exp1
N2
N1
U
E
k-NN
Figure 3.2: Sampled Neighborhoods using the different probability distributions forthe MovieLens data set.
42
we used the standard combining function as in Eq. (3.1). Similar results were
also obtained using a combining function without a first-order bias approximation:
ru,i =∑
v∈Ni(u) wu,vrv,i/∑
v∈Ni(u) |wu,v|; any differences are explicitly discussed in the
following section. In addition, we used a holdout validation scheme in all of our ex-
periments with 80/20 splits of the rating tuples to the training/test parts in order
to avoid overfitting. Finally, the evaluation of the various approaches in each ex-
perimental setting is based on users with more than k candidate neighbors, where
k is the corresponding neighborhood size; if a user has k or less available candidate
neighbors, then the same neighbors are always selected and the results for the specific
user are in principle identical for all the examined approaches, apart from the inverse
k-FN (k-FN2) method. Similarly, the generated recommendation lists were also eval-
uated using a subset of the test set containing only highly rated items as well as only
long-tail items [Cremonesi et al., 2010].
3.5 Results of Probabilistic Neighborhood Selection
The aim of this study is to demonstrate that the proposed method indeed effec-
tively generates recommendations that alleviate the over-specialization and concen-
tration problems while performing well in terms of other important metrics of RSes.
Therefore, we conduct a comparative analysis of our method and the standard base-
line (k-NN), matrix factorization, and the k-furthest neighbor approaches, in different
experimental settings.
Given the number and the diversity of experimental settings, the presentation
of the results constitutes a challenging problem. A reasonable way to compare the
results across the different settings is by computing the relative performance differ-
ences and discussing only the most interesting dimensions. Due to space limitations,
43
detailed results, supplementary graphs and tables, and tests of statistical significance
about all the conducted experiments as well as additional performance metrics mea-
suring orthogonality of recommendations and predictive accuracy are included in
[Adamopoulos and Tuzhilin, 2013b].
Overall, the proposed method generates recommendations that are very different
from the classical CF approaches and alleviates the over-specialization and concentra-
tion problems, based on metrics of coverage, dispersion, and diversity reinforcement
(mobility of recommendations), while avoiding any significant accuracy loss. Fig. 3.3
shows an overview of the performance of all the methods on the ML data set across
various metrics for recommendation lists of size l = 10.
Figure 3.3: Summary of performance for the ML data set.
44
3.5.1 Orthogonality of Recommendations
In this section, we examine whether the proposed approach finds and recommends
different items than those recommended by the standard recommenders and, thus,
whether it can alleviate the over-specialization problem [Said et al., 2012b]. In par-
ticular, we investigate the overlap of recommendations (i.e., the percentage of items
that belong to both recommendation lists) between the classical neighborhood-based
collaborative filtering method and the various specifications (i.e., E, N1, N2, Exp1,
Exp2, W1, W2, FN1, and FN2) of the proposed approach described in Table 3.1. Fig.
3.4 presents the results obtained by applying our method to the MovieLens, Movi-
eTweetings, and Amazon data sets. The values reported are computed as the average
overlap over seven neighborhood sizes, k ∈ {20, 30, . . . , 80}, for recommendation lists
of size l = 10.
As Fig. 3.4 demonstrates, the proposed method generates recommendations that
are very different from the recommendations provided by the classical k-NN approach.
The more different recommendations were achieved using the empirical distribution
of user similarity and the inversed k-furthest neighbors approach [Said et al., 2013].
In particular, the average overlap across all the proposed probability distributions,
neighborhoods, and recommendation list sizes was 14.87%, 64.17%, and 64.64% for
the ML, MT, and AMZ data sets, respectively; the corresponding overlap using only
the empirical distribution was 2.79%, 44.82%, and 39.80% for the different data sets.
Hence, it is worth to note that not only the k-FN approach but also the proposed
probabilistic method resulted in orthogonal recommendations to the standard k-NN
method. Besides, for the more sparse data sets (i.e., MovieTweetings, Amazon), the
recommendation lists exhibit greater overlap, since there are proportionally less can-
didate neighbors available to sample from and, thus, the neighborhoods tend to be
45
k-NN
E
U
N1
N2
Exp1
Exp2
W1
W2
FN1
FN2
k-FN1
k-FN2
k-NN
E U N1
N2
Exp
1
Exp
2
W1
W2
FN
1
FN
2
k-FN
1
k-FN
2
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
(a)
Mov
ieL
ens
k-NN
E
U
N1
N2
Exp1
Exp2
W1
W2
FN1
FN2
k-FN1
k-FN2
k-NN
E U N1
N2
Exp
1
Exp
2
W1
W2
FN
1
FN
2
k-FN
1
k-FN
2
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
(b)
Mov
ieT
wee
tings
k-NN
E
U
N1
N2
Exp1
Exp2
W1
W2
FN1
FN2
k-FN1
k-FN2
k-NN
E U N1
N2
Exp
1
Exp
2
W1
W2
FN
1
FN
2
k-FN
1
k-FN
2
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
(c)
Am
azo
n
Fig
ure
3.4:
Ove
rlap
ofre
com
men
dat
ion
list
sof
sizel
=10
for
the
diff
eren
tdat
ase
ts.
46
more similar. This is also depicted in the experiments using one of the standard prob-
ability distributions but the empirical distance of candidate neighbors, the uniform
distribution, and the deterministic k-FN approach. Similarly, recommendation lists of
smaller size resulted in even smaller overlap among the various methods. Moreover,
the experiments conducted using the U and k-FN1 approaches resulted in recom-
mendations very different from the recommendations provided by the classical k-NN
approach only when the first-order bias approximation was not used in the combin-
ing function of the ratings. In general, without the first-order bias approximation the
average overlap was further reduced by 58.70%, 19.69%, and 52.18% for the ML, MT,
and AMZ data sets, respectively. As one would expect, the experiments conducted
using the same probability distribution (e.g., Exp1 and Exp2) result in similar perfor-
mance. To determine statistical significance, we have tested the null hypothesis that
the performance of each of the methods is the same using the Friedman test. Based
on the results, we reject the null hypothesis with p < 0.0001. Performing post hoc
analysis on Friedman’s Test results, all the specifications of the proposed approach
(apart from the cases of the N2, W2, Exp, and FN specifications for the AMZ data
set) significantly outperform the k-FN1 method in all the data sets. The difference
between the empirical distribution and k-FN2 are not statistically significant for any
data set.
Nevertheless, even a large overlap between two recommendation lists does not
imply that these lists are the same. For instance, two recommendation lists might
contain the same items but in reverse order. In order to further examine the orthog-
onality of the generated recommendations, we measure the rank correlation of the
generated lists using the Spearman’s rank correlation coefficient [Spearman, 1987],
which measures the Pearson correlation coefficient between the ranked variables. In
particular, we use the top 100 items recommended by method i and examine the cor-
47
k-NN
E U N1
N2
Exp
1
Exp
2
W1
W2
FN
1
FN
2
k-FN
1
k-FN
2
k-NN
E
U
N1
N2
Exp1
Exp2
W1
W2
FN1
FN2
k-FN1
k-FN2
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Figure 3.5: Spearman’s rank correlation coefficient for the MovieTweetings data set.
48
relation ρij in the rankings generated by methods i and j for those items; ρij might be
different from ρji. Fig. 3.5 shows the average ranking correlation over seven neighbor-
hood sizes, k ∈ {20, 30, . . . , 80}, using the Spearman’s rank correlation coefficient ρij
(i corresponds to the row index and j to the column index) for the MovieTweetings
data set (i.e., the data set that exhibits the largest overlap). The correlation between
the classical neighborhood-based collaborative filtering method and the probabilistic
approach with the empirical distribution using the top 100 items recommended by
the k-NN method is 22.21%. As Figs. 3.4b and 3.5 illustrate, even though some
specifications may result in recommendation lists that exhibit significant overlap, the
ranking of the recommended items is not strongly correlated.
3.5.2 Comparison of Coverage and Diversity
In this section, we investigate the effect of the proposed method on coverage
and aggregate diversity; two important metrics that in combination with other mea-
sures discussed in this study show whether the proposed approach alleviates the over-
specialization and concentration problems of common RSes. The results obtained
using the catalog coverage metric are equivalent to those using the diversity-in-top-
N metric for aggregate diversity; henceforth, only one set of results is presented.
Fig. 3.6 presents the results obtained by applying our method to the ML, MT, and
AMZ data sets. In particular, the Hinton diagram in Fig. 3.6 shows the percentage
increase/decrease in performance compared to the k-NN baseline for each probability
distribution and recommendation lists of size l ∈ {1, 3, 5, 10, 20, . . . , 100} over seven
neighborhood sizes, k ∈ {20, 30, . . . , 80}. Positive and negative values are represented
by white and black squares, respectively, and the size of each square represents the
magnitude of each value.
Fig. 3.6 demonstrates that the proposed method in most cases performs better than
49
13
510
20
30
40
50
60
80
100
Recom
men
dati
on
Lis
t S
ize
MF
k-F
N2
k-F
N1
FN2
FN1
W2
W1
Exp
2
Exp
1
N2
N1UE
(a)
Mov
ieL
ens
13
510
20
30
40
50
60
80
100
Recom
men
dati
on
Lis
t S
ize
MF
k-F
N2
k-F
N1
FN2
FN1
W2
W1
Exp
2
Exp
1
N2
N1UE
(b)
Mov
ieT
wee
tings
13
510
20
30
40
50
60
80
100
Recom
men
dati
on
Lis
t S
ize
MF
k-F
N2
k-F
N1
FN2
FN1
W2
W1
Exp
2
Exp
1
N2
N1UE
(c)
Am
azo
n
Fig
ure
3.6:
Incr
ease
inag
greg
ate
div
ersi
typ
erfo
rman
cefo
rth
ediff
eren
tdat
ase
tsan
dre
com
men
dat
ion
list
size
s.
50
the user-based k-NN, matrix factorization, and the k-FN methods. The more diverse
recommendations were achieved using the empirical distribution of user similarity
and the inverse k-furthest neighbors approach (k-FN2). In particular, the average
aggregate diversity across all the probability distributions, neighborhoods, and rec-
ommendation list sizes was 22.10%, 46.09%, and 13.52% for the ML, MT, and AMZ
data sets, respectively; the corresponding diversity using only the empirical distribu-
tion was 24.20%, 50.55%, and 17.04% for the different data sets. The corresponding
performance of MF [Gantner et al., 2011] was measured as 14.17%, 10.70%, and
14.39%, respectively.
Furthermore, the performance was increased both in the experiments where the k-
NN method, because of the specifics of the particular data sets, resulted in low aggre-
gate diversity (e.g., Amazon) and high diversity performance (e.g., MovieTweetings).
In addition, the experiments conducted using the same probability distribution (e.g.,
Exp1 and Exp2) exhibit very similar performance. As one would expect, in most
cases the aggregate diversity increased, whereas the magnitude of the difference in
performance decreased, with increasing recommendation list size l. Without using the
first-order bias approximation in the combining function, the standard k-NN method
resulted in higher aggregate diversity and catalog coverage but the proposed approach
still outperformed the classical algorithm in most of the cases by a narrower margin;
using the inverse k-FN method (k-FN2) without the first-order bias approximation
resulted in decrease in performance for the Amazon data set. The performance of
empirical distribution was 33.37%, 61.06%, and 41.50% for the different data sets.
Nevertheless, using all the candidates, instead of probabilistically selecting a diverse
neighborhood, underperforms the proposed approach since the overall contribution of
neighbors other than the most similar is significantly discounted.
In terms of statistical significance, using the Friedman test and performing post
51
hoc analysis, the differences among the employed baselines (i.e., k-NN, MF, k-FN1,
and k-FN2) and all the proposed specifications are statistically significant (p < 0.001)
for the ML data set. For the MT and AMZ data sets, all the proposed specifications
(i.e., E, N1, N2, Exp1, Exp2, W1, W2, FN1, and FN2) significantly outperform the
k-NN and matrix factorization algorithms; the empirical distribution significantly
outperforms also the k-FN1 method.
3.5.3 Comparison of Dispersion and Diversity Reinforcement
In order to conclude whether the proposed approach alleviates the concentration
biases, the generated recommendation lists should also be evaluated for the inequality
across items using the Gini coefficient. Fig. 3.7 shows the percentage increase (white
squares) or decrease (black squares) in dispersion of recommendations compared to
the k-NN baseline. The Gini coefficient was on average improved by 6.81%, 3.67%,
and 1.67% for the ML, MT, and AMZ data sets, respectively; the corresponding
figures using only the empirical distribution were 7.48%, 6.73%, and 3.45% for the
different data sets, which implies an improvement of 7.41%, 16.76%, and 1.54% over
MF and 9.69%, 5.34%, and 2.84% over k-FN. The more uniformly distributed rec-
ommendation lists were achieved using the empirical distribution of user similarity
and the inverse k-furthest neighbors approach. Moreover, the larger the size of the
recommendation lists, the larger the improvement in the Gini coefficient. Similarly,
without using the first-order bias approximation in the rating combining function,
the average dispersion was further improved by 6.48%, 6.83%, and 20.22% for the
ML, MT, and AMZ data sets, respectively. This implies an improvement of 14.91%,
22.83%, and 21.19% against MF and 16.94%, 5.90%, and 2.38% against k-FN. As
we can conclude, in the recommendation lists generated from the proposed method,
the number of times an item is recommended is more equally distributed compared to
52
13
51
02
03
04
05
06
08
01
00
Recom
men
dati
on
Lis
t S
ize
MF
k-F
N2
k-F
N1
FN2
FN1
W2
W1
Exp
2
Exp
1
N2
N1UE
(a)
Mov
ieL
ens
13
51
02
03
04
05
06
08
01
00
Recom
men
dati
on
Lis
t S
ize
MF
k-F
N2
k-F
N1
FN2
FN1
W2
W1
Exp
2
Exp
1
N2
N1UE
(b)
Mov
ieT
wee
tings
13
51
02
03
04
05
06
08
01
00
Recom
men
dati
on
Lis
t S
ize
MF
k-F
N2
k-F
N1
FN2
FN1
W2
W1
Exp
2
Exp
1
N2
N1UE
(c)
Am
azo
n
Fig
ure
3.7:
Incr
ease
indis
per
sion
ofre
com
men
dat
ions
for
the
diff
eren
tdat
ase
tsan
dre
com
men
dat
ion
list
size
s.
53
other CF methods. In terms of statistical significance, all the proposed specifications
(apart from the N1, Exp2, and FN2 for the MT data set and the N2, Exp2 for the
AMZ data set) significantly outperform the k-NN, matrix factorization, and k-FN1
methods (p < 0.001). The empirical distribution also significantly outperforms the
k-FN2 method for the ML data set; the differences are not statistically significant for
the other data sets.
However, simply evaluating the recommendation lists in terms of dispersion and
inequality does not provide any information about the (popularity-based) diversity
reinforcement and mobility of the recommendations (i.e., whether popular or long-
tail items are more likely to be recommended) since these metrics do not consider
the prior state of the system. Hence, we employ a diversity reinforcement measure
M to assess whether the proposed recommender system approach follows or changes
the prior popularity of items when recommendation lists are generated. Thus, we
define M , which equals the proportion of items that are “mobile” (e.g., changed from
popular in terms of number of ratings to “long tail” in terms of recommendation
frequency), as follows:
M = 1−K∑i=1
πiρii
where the vector π denotes the initial distribution of each of the K (popularity)
categories and ρii the probability of staying in category i, given that i was the initial
category.6 A score of zero denotes no change (i.e., the number of times an item is
recommended is proportional to the number of ratings it has received) whereas a score
of one denotes that the RS recommends only the long-tail items (i.e., the number of
times an item is recommended is proportional to the inverse of the number of ratings
it has received).
6The proposed diversity reinforcement score can be easily adapted in order to differentiate thedirection of change and the magnitude.
54
In the conducted experiments, based on the 80-20 rule or Pareto principle, we
use two categories, labeled as “head” and “tail”, where the former category contains
the top 20% of items (in terms of ratings or recommendations frequency) and the
latter category the remaining 80%. The experimental results demonstrate that the
proposed method generates recommendation lists that exhibit in most cases higher
diversity reinforcement compared to the k-NN, MF, and k-FN methods. In particular,
the performance was increased by 0.91%, 0.95%, and 0.19% for the ML, MT, and
AMZ data sets, respectively; the corresponding improvement using only the empirical
distribution was 1.29%, 1.46%, and 0.45% for the different data sets which implies
an improvement of 1.52%, 1.12%, and 2.11% over MF and 1.01%, 1.31%, and 0.29%
over k-FN. We also note that recommendation lists of larger size resulted on average
in even larger improvements. Besides, considering a smaller number of items as
popular also resulted in larger improvements. Similarly, without the first-order bias
approximation the average diversity reinforcement was further increased by 0.69%,
0.53%, and 3.28% for the ML, MT, and AMZ data sets, respectively. This implies
an improvement of 2.40%, 1.82%, and 1.65% against MF and 1.83%, 1.44%, and
1.82% against k-FN. Fig. 3.8 shows the transition probabilities of each category for
recommendation lists of size l = 100 using the empirical distribution of similarity
and the MovieTweetings data set. In terms of statistical significance, in most of the
cases all the proposed specifications significantly outperform the baseline methods
(p < 0.005) [Adamopoulos and Tuzhilin, 2013b].
3.5.4 Comparison of Item Prediction
Apart from alleviating the concentration bias and over-specialization problems in
CF systems, the proposed approach should also perform well in terms of predictive
accuracy. Thus, the goal in this section is to compare the proposed method with the
55
Figure 3.8: Diversity reinforcement for the MT data set.
standard baseline methods in terms of traditional metrics for item prediction, such as
the F1 score. Figs. 3.9 and 3.10 present the results obtained by applying the proposed
method to the MovieLens (ML), MovieTweetings (MT), and Amazon (AMZ) data
sets. The values reported in Fig. 3.10 are computed as the average performance over
seven neighborhood sizes, k ∈ {20, 30, . . . , 80}, using the F1 score for recommendation
lists of size l = 10. The Hilton diagram show in Figure 3.9 presents the relative F1
score for recommendation lists of size l ∈ {1, 3, 5, 10, 20, 30, 40, 50, 60, 70, 80, 100}; the
size of each white square represents the magnitude of each value with the maximum
size corresponding to the maximum value achieved in the conducted experiments for
each data set. Similar results were also obtained using as positive instances only the
highly rated items (i.e., items rated above the average rating or above the 80% of the
rating scale) in the test set.
Figs. 3.9 and 3.10 demonstrate that the proposed method outperforms the stan-
56
(a)
Mov
ieL
ens
(b)
Mov
ieT
wee
tin
gs
(c)
Am
azo
n
Fig
ure
3.9:
Item
pre
dic
tion
per
form
ance
for
the
diff
eren
tdat
ase
ts.
57
(a)
Mov
ieL
ens
(b)
Mov
ieT
wee
tin
gs
(c)
Am
azo
n
Fig
ure
3.10
:It
empre
dic
tion
per
form
ance
for
the
diff
eren
tdat
ase
tsan
dre
com
men
dat
ion
list
sof
sizel
=10
.
58
dard user-based k-NN method and the k-FN approach in most of the cases. The
most accurate recommendations were generated using the empirical distribution of
user similarity, the normal or the exponential distribution. In particular, the average
F1 score across all the proposed probability distributions, neighborhoods, and recom-
mendation list sizes was 0.0018, 0.0050, and 0.0010 for the ML, MT, and AMZ data
sets, respectively; the corresponding performance using only the empirical distribu-
tion was 0.0015, 0.0055, and 0.0022 for the different data sets resulting on average
in a 4-fold increase. Besides, without using the first-order bias approximation in the
rating combining function, the proposed approach outperformed in most of the cases
the classical k-NN algorithm and the k-FN method by a wider margin. Furthermore,
we should note that the performance was increased across various experimental speci-
fications, including different sparsity levels, neighborhood sizes, and recommendation
list lengths. This performance improvement is due to the reduction of covariance
among the selected neighbors and is in accordance with the phenomenon of “hub-
ness” and the ensemble learning theory that we introduce in the neighborhood-based
collaborative filtering framework in Section 3.3.2.
To determine the statistical significance of the previous findings, we have tested
using the Friedman test the null hypothesis that the performance of each of the
methods is the same. Based on the results, we reject the null hypothesis with p <
0.0001. Performing post hoc analysis on Friedman’s Test results, in most of the
cases (i.e., 86.42% of the experimental settings) the proposed approach significantly
outperforms the employed baselines and in the remaining cases the differences are not
statistically significantly. In particular, the differences between the traditional k-NN
and each one of the proposed variations (apart from the case of the FN1 specification
for the MT data set) are statistically significant for all the data sets; similar results
were also obtained for the differences among the proposed approach and the k-FN
59
models.
3.5.5 Comparison of Utility-based Ranking
Further, in order to better assess the quality of the proposed approach, the recom-
mendation lists should also be evaluated for the ranking of the items that present to
the users, taking into account the rating scale of the selected data sets. In principle,
since all items are not of equal relevance/quality to the users, the relevant/better
items should be identified and ranked higher for presentation. Assuming that the
utility of each recommendation is the rating of the recommended item discounted by
a factor that depends on its position in the list of recommendations, in this section
we evaluate the generated recommendation lists based on the normalized Cumula-
tive Discounted Gain (nDCG) [Jarvelin and Kekalainen, 2002], where positions are
discounted logarithmically.
The highest performance was again achieved using the empirical distribution of
user similarity, the normal, or the Weibull distribution. In particular, the average
increase of the nDCG score across all the examined probability distributions, neigh-
borhoods, and recommendation list sizes was 100.06%, 20.05%, and 89.85% for the
ML, MT, and AMZ data sets, respectively; the corresponding increase using only the
empirical distribution was 117.65%, 23.01%, and 383.99% for the different data sets
resulting on average in a 2-fold increase. The absolute performance of the empirical
distribution for the different data sets was 73.54%, 74.62%, and 42.63%, respectively.
Even though k-PN on average underperforms MF, it performs very well on both ML
and MT, especially given the goals of this method. Without using the first-order
bias approximation in the rating combining function, the proposed approach outper-
formed in most of the cases the classical k-NN algorithm and the k-FN methods by
an even wider margin. The same wide margin was also observed focusing on long-tail
60
items, except for MT. In terms of statistical significance, the differences among the
employed baselines and all the proposed specifications (apart from the FN1 for the
MT data set and the N1, Exp2, W2, and FN2 for the AMZ data set) are statistically
significant.
3.6 Experimental Settings of Optimized Neighborhood Se-
lection
To empirically validate the k-BN method presented in Section 3.3.1 and evaluate
the generated recommendations, we conduct a large number of experiments on “real-
world” data sets and compare our results to the k-PN (using the empirical distribution
of similarity) that outperforms different baselines as shown in Section 3.5.
The data sets that we used are the MovieLens [GroupLens, 2011] and MovieTweet-
ings [Dooms et al., 2013] as well as a snapshot from Amazon [McAuley and Leskovec,
2013] as in Section 3.4. Using the ML, MT, and AMZ data sets, we conducted a large
number of experiments and compared the results against the k-PN approach. In order
to test the proposed approach of probabilistic neighborhood selection under various
experimental settings, we used different sizes of neighborhoods (k ∈ {10, 20, . . . , 80})
with various specifications. In addition, we generated recommendation lists of differ-
ent sizes (l ∈ {1, 3, 5, 10, 20, . . . , 100}). In summary, we used 3 data sets, 8 different
sizes of neighborhoods, 6 specifications, and 13 different lengths of recommendation
lists, resulting in 1, 872 experiments in total.
The different specifications we employed in the experiments correspond to the
following condition: i) k-BNu where the weights in the combining function are equal
across all neighbors (i.e., wu,v = 1), ii) k-BNs where the weights depend on the
similarity between each neighbor and the target user (i.e., wu,v = su,v), iii) k-BNo
61
where the weights are estimated based on Eqn. 3.4, iv) k-BNwu where the neighbor’s
rating prediction for the target user is weighted by the similarity with the target
user su,v and the weights in the combining function are equal across all neighbors, v)
k-BNws where the neighbor’s rating prediction is estimated as before and the weights
depend on the similarity, and finally vi) k-BNwo where the prediction is estimated as
before and the weights are based on Eqn. 3.4.
In all the conducted experiments, in order to measure the similarity among the
candidate neighbors, we used the cosine similarity; similar results were also obtained
using the Pearson correlation. We also used significance weighting as in [Herlocker
et al., 1999], in order to penalize for similarity based on few common ratings, and
filtered any candidate neighbors with zero weight [Desrosiers and Karypis, 2011]. In
addition, we used a holdout validation scheme in all of our experiments with 80/20
splits of the rating tuples to the training/test parts in order to avoid overfitting. For
the k-BN method and the estimation of bias, variance, and covariance, we also used 3
folds sampling with replacement 50% of the observations in the training set. Finally,
the generated recommendation lists were also evaluated using a subset of the test set
containing only highly rated items as well as only long-tail items [Cremonesi et al.,
2010].
3.7 Results of Optimized Neighborhood Selection
Overall, the proposed method generates recommendations that alleviate the over-
specialization and concentration problems, based on metrics of coverage, dispersion,
and diversity reinforcement (mobility of recommendations), while avoiding any sig-
nificant accuracy loss. Fig. 3.11 shows an overview of the performance of the k-BN
and k-PN methods on the MT data set across various metrics for recommendation
62
lists of size l = 10.
Figure 3.11: Summary of performance for the MT data set.
3.7.1 Orthogonality of Recommendations
In this section, we examine whether the proposed approach finds and recommends
different items than those recommended by the k-PN method. In particular, we in-
vestigate the overlap of recommendations (i.e., the percentage of items that belong
to both recommendation lists) between the k-PN method and the various specifica-
tions of the proposed approach. Fig. 3.12 presents the results obtained by applying
63
13
510
20
30
40
50
60
70
80
90
100
Recom
men
dati
on
Lis
t S
ize
kB
Nw
o
kB
Nw
s
kB
Nw
u
kB
No
kB
Ns
kB
Nu
(a)
Mov
ieL
ens
13
510
20
30
40
50
60
70
80
90
100
Recom
men
dati
on
Lis
t S
ize
kB
Nw
o
kB
Nw
s
kB
Nw
u
kB
No
kB
Ns
kB
Nu
(b)
Mov
ieT
wee
tin
gs
13
510
20
30
40
50
60
70
80
90
100
Recom
men
dati
on
Lis
t S
ize
kB
Nw
o
kB
Nw
s
kB
Nw
u
kB
No
kB
Ns
kB
Nu
(c)
Am
azo
n
Fig
ure
3.12
:O
verl
apof
reco
mm
endat
ion
list
sfo
rth
ediff
eren
tdat
ase
tsan
dre
com
men
dat
ion
list
size
s.
64
our method to the MovieLens, MovieTweetings, and Amazon data sets. The val-
ues reported are computed as the average overlap over eight neighborhood sizes,
k ∈ {10, 20, . . . , 80}.
As Fig. 3.12 demonstrates, the proposed method generates recommendations that
are very different from the recommendations provided by the k-PN approach. In par-
ticular, the average overlap across all the proposed probability distributions, neigh-
borhoods, and recommendation list sizes was 0.30%, 20.54%, and 0.25% for the ML,
MT, and AMZ data sets, respectively. Besides, recommendation lists of smaller size
resulted in even smaller overlap among the various methods.
3.7.2 Comparison of Coverage and Diversity
In this section, we investigate the effect of the proposed method on coverage and
aggregate diversity. The results obtained using the catalog coverage metric are equiv-
alent to those using the diversity-in-top-N metric for aggregate diversity; henceforth,
only one set of results is presented. Fig. 3.13 presents the results obtained by ap-
plying our method to the ML, MT, and AMZ data sets. In particular, the Hinton
diagram in Fig. 3.13 shows the percentage increase/decrease in performance compared
to the k-PN method for each probability distribution and recommendation lists of size
l ∈ {1, 3, 5, 10, 20, . . . , 100} over eight neighborhood sizes, k ∈ {10, 20, . . . , 80}. Posi-
tive and negative values are represented by white and black squares, respectively, and
the size of each square represents the magnitude of each value.
Fig. 3.13 demonstrates that the proposed method in most cases performs better than
the user-based k-NN, matrix factorization, k-FN, and k-PN methods. In particular,
the average aggregate diversity increase across all the specifications, neighborhoods,
and recommendation list sizes was 5.90%, 27.71%, and 1.34% for the ML, MT, and
AMZ data sets, respectively. The performance was increased both in the experiments
65
13
510
20
30
40
50
60
70
80
90
100
Recom
men
dati
on
Lis
t S
ize
kB
Nw
o
kB
Nw
s
kB
Nw
u
kB
No
kB
Ns
kB
Nu
(a)
Mov
ieL
ens
13
510
20
30
40
50
60
70
80
90
100
Recom
men
dati
on
Lis
t S
ize
kB
Nw
o
kB
Nw
s
kB
Nw
u
kB
No
kB
Ns
kB
Nu
(b)
Mov
ieT
wee
tin
gs
13
510
20
30
40
50
60
70
80
90
100
Recom
men
dati
on
Lis
t S
ize
kB
Nw
o
kB
Nw
s
kB
Nw
u
kB
No
kB
Ns
kB
Nu
(c)
Am
azo
n
Fig
ure
3.13
:In
crea
sein
aggr
egat
ediv
ersi
typ
erfo
rman
cefo
rth
ediff
eren
tdat
ase
tsan
dre
com
men
dat
ion
list
size
s.
66
where the k-NN method, because of the specifics of the particular data sets, resulted
in low aggregate diversity (e.g., Amazon) and high diversity performance (e.g., Movi-
eTweetings). We should note that for the ML dataset estimating the optimal weights
for the neighbors resulted in decreased performance compared to the k-PN approach.
3.7.3 Comparison of Dispersion and Diversity Reinforcement
Fig. 3.14 shows the percentage increase (white squares) or decrease (black squares)
in dispersion of recommendations compared to the k-PN method. The Gini coefficient
was on average improved by 0.015%, 0.24%, and 0.007% for the ML, MT, and AMZ
data sets, respectively. Hence, in the recommendation lists generated from the pro-
posed method, the number of times an item is recommended is more equally distributed
compared to other standard CF methods as well as the k-PN method.
We also evaluate the generated recommendations based on the (popularity-based)
diversity reinforcement and mobility of the recommendations (i.e., whether popular
or long-tail items are more likely to be recommended) as in Section 3.5.3. The experi-
mental results illustrated in Fig. 3.15 demonstrate that the proposed method generates
recommendation lists that exhibit in most cases higher diversity reinforcement com-
pared to the k-NN, MF, k-FN, and k-PN methods. In particular, the performance
was increased by 0.52%, 4.95%, and −0.43% for the ML, MT, and AMZ data sets,
respectively. We also note that recommendation lists of larger size resulted on aver-
age in even larger improvements. Besides, considering a smaller number of items as
popular also resulted in larger improvements.
3.7.4 Comparison of Item Prediction
Apart from alleviating the concentration bias and over-specialization problems in
CF systems, the proposed approach should also perform well in terms of predictive
67
13
510
20
30
40
50
60
70
80
90
100
Recom
men
dati
on
Lis
t S
ize
kB
Nw
o
kB
Nw
s
kB
Nw
u
kB
No
kB
Ns
kB
Nu
(a)
Mov
ieL
ens
13
510
20
30
40
50
60
70
80
90
100
Recom
men
dati
on
Lis
t S
ize
kB
Nw
o
kB
Nw
s
kB
Nw
u
kB
No
kB
Ns
kB
Nu
(b)
Mov
ieT
wee
tin
gs
13
510
20
30
40
50
60
70
80
90
100
Recom
men
dati
on
Lis
t S
ize
kB
Nw
o
kB
Nw
s
kB
Nw
u
kB
No
kB
Ns
kB
Nu
(c)
Am
azo
n
Fig
ure
3.14
:In
crea
sein
dis
per
sion
ofre
com
men
dat
ions
for
the
diff
eren
tdat
ase
tsan
dre
com
men
dat
ion
list
size
s.
68
13
510
20
30
40
50
60
70
80
90
100
Recom
men
dati
on
Lis
t S
ize
kB
Nw
o
kB
Nw
s
kB
Nw
u
kB
No
kB
Ns
kB
Nu
(a)
Mov
ieL
ens
13
510
20
30
40
50
60
70
80
90
100
Recom
men
dati
on
Lis
t S
ize
kB
Nw
o
kB
Nw
s
kB
Nw
u
kB
No
kB
Ns
kB
Nu
(b)
Mov
ieT
wee
tin
gs
13
510
20
30
40
50
60
70
80
90
100
Recom
men
dati
on
Lis
t S
ize
kB
Nw
o
kB
Nw
s
kB
Nw
u
kB
No
kB
Ns
kB
Nu
(c)
Am
azo
n
Fig
ure
3.15
:In
crea
sein
div
ersi
tyre
info
rcem
ent
and
mob
ilit
yof
reco
mm
endat
ions
for
the
diff
eren
tdat
ase
tsan
dre
com
-m
endat
ion
list
size
s.
69
accuracy. Thus, the goal in this section is to compare the proposed method with the
standard baseline methods in terms of traditional metrics for item prediction, such
as the F1 score. Fig. 3.16 presents the results obtained by applying the proposed
method to the MovieLens (ML), MovieTweetings (MT), and Amazon (AMZ) data
sets. The values reported in Fig. 3.16 present the relative F1 score improvements in
comparison to k-PN for recommendation lists of size l ∈ {1, 3, 5, 10, 20, . . . , 100}; the
size of each white square represents the magnitude of each value with the maximum
size corresponding to the maximum increase achieved in the conducted experiments
for each data set. Similar results were also obtained using as positive instances only
the highly rated items (i.e., items rated above the average rating or above the 80%
of the rating scale) in the test set.
Fig. 3.16 demonstrates that the proposed method outperforms the k-PN approach
in most of the cases. In particular, the average F1 score increase across all the
proposed probability distributions, neighborhoods, and recommendation list sizes was
−0.04%, 62.03%, and 770% (38.22% for recommendations of size larger than 1) for
the ML, MT, and AMZ data sets, respectively.
3.7.5 Comparison of Rating Prediction
We also compare the proposed method with the k-PN method in terms of RMSE
score. Fig. 3.17 presents the results obtained by applying the proposed method to the
MovieLens (ML), MovieTweetings (MT), and Amazon (AMZ) data sets. The values
reported in Fig. 3.17 presents the relative RMSE score improvement (i.e., decrease)
for neighborhoods of size k ∈ {10, 20, . . . , 80}; the size of each white square represents
the magnitude of each value with the maximum size corresponding to the maximum
change achieved in the conducted experiments for each data set.
Fig. 3.17 demonstrates that the proposed method outperforms the k-PN approach
70
13
510
20
30
40
50
60
70
80
90
100
Recom
men
dati
on
Lis
t S
ize
kB
Nw
o
kB
Nw
s
kB
Nw
u
kB
No
kB
Ns
kB
Nu
(a)
Mov
ieL
ens
13
510
20
30
40
50
60
70
80
90
100
Recom
men
dati
on
Lis
t S
ize
kB
Nw
o
kB
Nw
s
kB
Nw
u
kB
No
kB
Ns
kB
Nu
(b)
Mov
ieT
wee
tin
gs
13
510
20
30
40
50
60
70
80
90
100
Recom
men
dati
on
Lis
t S
ize
kB
Nw
o
kB
Nw
s
kB
Nw
u
kB
No
kB
Ns
kB
Nu
(c)
Am
azo
n
Fig
ure
3.16
:It
empre
dic
tion
per
form
ance
for
the
diff
eren
tdat
ase
ts.
71
10
20
30
40
50
60
70
80
Neig
hb
orh
ood
Siz
e
kB
Nw
o
kB
Nw
s
kB
Nw
u
kB
No
kB
Ns
kB
Nu
(a)
Mov
ieL
ens
10
20
30
40
50
60
70
80
Neig
hb
orh
ood
Siz
e
kB
Nw
o
kB
Nw
s
kB
Nw
u
kB
No
kB
Ns
kB
Nu
(b)
Mov
ieT
wee
tin
gs
10
20
30
40
50
60
70
80
Neig
hb
orh
ood
Siz
e
kB
Nw
o
kB
Nw
s
kB
Nw
u
kB
No
kB
Ns
kB
Nu
(c)
Am
azo
n
Fig
ure
3.17
:R
atin
gpre
dic
tion
per
form
ance
for
the
diff
eren
tdat
ase
ts.
72
in most of the cases. In particular, the average RMSE score was improved across all
the proposed probability distributions, neighborhoods, and recommendation list sizes
was 0.011%, 3.58%, and 1.92% for the ML, MT, and AMZ data sets, respectively. The
corresponding increase compared to k-NN is 0.022%, 3.94%, and 1.92%, respectively.
3.7.6 Comparison of Utility-based Ranking
In this section we evaluate the generated recommendation lists based on the nor-
malized Cumulative Discounted Gain (nDCG) [Jarvelin and Kekalainen, 2002], where
positions are discounted logarithmically. Fig. 3.18 presents the results obtained by
applying the proposed method to the MovieLens (ML), MovieTweetings (MT), and
Amazon (AMZ) data sets. The values reported in Fig. 3.18 presents the relative F1
score improvement for recommendation lists of size l ∈ {1, 3, 5, 10, 20, . . . , 100}.
The average increase of the nDCG score across all the examined probability dis-
tributions, neighborhoods, and recommendation list sizes was 39.58%, −0.63%, and
293.91% for the ML, MT, and AMZ data sets, respectively.
3.8 Discussion of Neighborhood Selection
In this chapter, we introduce various algorithms to address the problem of identi-
fying better subsets of neighbors in collaborative filtering systems. We first present a
novel method for recommending items based on probabilistic neighborhood selection
(k-PN) in collaborative filtering systems. We then proposed an ordered neighborhood
selection approach where neighborhoods of increasing size are constructed by incor-
porating at each iteration the neighbor that reduces the expected predictive error the
most (k-BN). We illustrate the practical implementation of the proposed approaches
in the context of memory-based systems adapting and improving the standard k-
73
13
510
20
30
40
50
60
70
80
90
100
Recom
men
dati
on
Lis
t S
ize
kB
Nw
o
kB
Nw
s
kB
Nw
u
kB
No
kB
Ns
kB
Nu
(a)
Mov
ieL
ens
13
510
20
30
40
50
60
70
80
90
100
Recom
men
dati
on
Lis
t S
ize
kB
Nw
o
kB
Nw
s
kB
Nw
u
kB
No
kB
Ns
kB
Nu
(b)
Mov
ieT
wee
tin
gs
13
510
20
30
40
50
60
70
80
90
100
Recom
men
dati
on
Lis
t S
ize
kB
Nw
o
kB
Nw
s
kB
Nw
u
kB
No
kB
Ns
kB
Nu
(c)
Am
azo
n
Fig
ure
3.18
:U
tility
-bas
edra
nkin
gp
erfo
rman
cefo
rth
ediff
eren
tdat
ase
ts.
74
nearest neighbors (k-NN) method. In the proposed approaches, the neighborhoods
are selected based on expected generalization error (i.e., k-BN) or an underlying
probability distribution (i.e., k-PN), instead of just the k neighbors with the highest
similarity level to the target. For the probabilistic neighborhood selection (k-PN)
approach, we use an efficient method for weighted sampling of k neighbors that takes
into consideration similarity levels between the target and all the candidate neighbors.
In addition, we conduct an empirical study showing that the proposed methods,
by selecting diverse representative neighborhoods, generate recommendations that are
very different from the classical CF approaches and alleviate the over-specialization
and concentration problems while outperforming k-NN, k-FN, and matrix factoriza-
tion methods. We also demonstrate that the proposed methods outperform, by a
wide margin in most cases, both the standard k-nearest neighbors and the k-furthest
neighbors approaches in terms of both item prediction accuracy and utility-based
ranking. The experimental results are also in accordance with the phenomenon of
“hubness” and the ensemble learning theory that we employ in the neighborhood-
based CF framework. Besides, we show that the performance improvement is not
achieved at the expense of other popular performance measures.
Moreover, the proposed methods can be further extended and modified in order to
sample k neighbors from the x nearest candidates, instead of all the available users,
and combined with additional rating normalization and similarity weighting schemes
[Jin et al., 2004] beyond those employed in this study. Similarly, the complexity of
the proposed approaches can be further reduced by first filtering out the candidates
for selection in the resulting neighborhood. For instance, only users with more than
a pre-defined number of ratings can be consider as candidate neighbors. Also, apart
from the user-based and item-based k-NN collaborative filtering approaches, other
popular methods that can be easily extended with the use of the proposed neigh-
75
borhood selection methods, in order to allow us to generate both accurate and novel
recommendations, include Matrix Factorization approaches and global neighborhood
models (e.g., [Koren, 2008, 2010]). Besides, this approach can be further extended to
the popular methods of k-NN classification and regression in IR.
76
CHAPTER IV
On Unexpectedness in Recommender Systems: Or
How to Expect the Unexpected
“If you do not expect it, you will not find the unexpected, for it is hard
to find and difficult”.
- Heraclitus of Ephesus, 544 - 484 B.C.
4.1 Introduction to Unexpectedness
One key dimension for improvement that can significantly contribute to the overall
performance and usefulness of RSes, and is still under-explored, is the notion of unex-
pectedness. RSes often recommend expected items that the users are already familiar
with and, thus, they are of little interest to them. For example, a shopping RS may
recommend to customers products such as milk and bread. Although being accurate,
in the sense that the customer will indeed buy these two products, such recommen-
dations are of little interest because they are obvious, since the shopper will, most
likely, buy these products even without these recommendations. Therefore, because
of this potential for higher user satisfaction, it is important to study non-obvious
recommendations. Motivated by the challenges and implications of this problem, we
77
try to resolve it by recommending unexpected items of significant usefulness to the
users.
Following the Greek philosopher Heraclitus, we approach this hard and difficult
problem of finding and recommending unexpected items by first capturing the ex-
pectations of the user. The challenge is not only to identify the items expected by
the user and then derive the unexpected ones, but also to enhance the concept of
unexpectedness while still delivering recommendations of high quality that achieve a
fair match to user’s interests.
In this chapter, we formalize this concept by providing a new formal definition
of unexpected recommendations, as those recommendations that significantly depart
from user’s expectations, and differentiate it from various related concepts, such as
novelty and serendipity. We also propose a method for generating unexpected recom-
mendations and suggest specific metrics to measure the unexpectedness of recommen-
dation lists. Finally, we show that the proposed method can enhance unexpectedness
while maintaining the same or higher levels of accuracy of recommendations.
4.2 Related Concepts
In the following paragraphs, we discuss how various related but still different
concepts, such as novelty, diversity, and serendipity, differ from the proposed notion
of unexpectedness.
In particular, comparing novelty to unexpectedness, a novel recommendation
might be unexpected but novelty is strictly defined in terms of previously unknown
non-redundant items without allowing for known but unexpected ones. Also, novelty
does not include any positive reactions of the user to recommendations. Illustrat-
ing some of these differences in the movie context, assume that the user John Doe
78
is mainly interested in Action & Adventure films. Recommending to this user the
newly released production of one of his favorite Action & Adventure film directors is
a novel recommendation but not necessarily unexpected and possibly of low utility
for him since John was either expecting the release of this film or he could easily find
out about it. Similarly, assume that we recommend to this user the latest Children &
Family film. Although this is definitely a novel recommendation, it is probably also
of low utility and would be likely considered “irrelevant” because it departs too much
from his expectations.
Moreover, even though both serendipity and unexpectedness involve a positive sur-
prise of the user, serendipity is restricted to novel items and their accidental discovery,
without taking into consideration the expectations of the users and the relevance of
the items, and thus constitutes a different type of recommendation that can be more
risky and ambiguous. To further illustrate the differences of these two concepts, let
us assume that we recommend to John Doe the latest Romance film. There are some
chances that John will like this novel item and the accidental discovery of a serendip-
itous recommendation. However, such a recommendation might also be of low utility
to the user since it does not take into consideration his expectations and the relevance
of the items. On the other hand, assume that we recommend to John Doe a movie in
which one of his favorite Action & Adventure film directors is performing as an actor
in an old (non-novel) Action film of another director. The user will most probably
like this unexpected but non-serendipitous recommendation.
Under a definition of diversification as the process of maximizing the variety of
items in a recommendation list, avoiding a too narrow set of choices is generally a
good approach to increase the usefulness of the recommendation list since it enhances
the chances that a user is pleased by at least some recommended items. However,
diversity is a very different concept from unexpectedness and constitutes an ex-post
79
process that can be combined with the concept of unexpectedness.
Pertaining to unexpectedness, the previously proposed system-centric approaches
do not fully capture the multi-faceted concept of unexpectedness since they do not
truly take into account the actual expectations of the users, which is crucial accord-
ing to philosophers, such as Heraclitus, and some modern researchers [Silberschatz
and Tuzhilin, 1996; Berger and Tuzhilin, 1998; Padmanabhan and Tuzhilin, 1998].
Hence, an alternative user-centric definition of unexpectedness, taking into account
prior expectations of the users, and methods for providing to the users unexpected
recommendations are still needed. In particular, a user-centric definition of unex-
pectedness and the corresponding methods should avoid recommendations that are
obvious, irrelevant, or expected to the user, but without being strictly restricted only
to novel items, and also should allow for a notion of positive discovery, as a recom-
mendation makes more sense when it exposes the user to a relevant experience that
she/he has not thought of or experienced yet. In this dissertation, we deviate from
the previous definitions of unexpectedness and propose a new formal user-centric def-
inition, as recommending to the users those items that depart from what they expect
from the recommender system, which we thoroughly discuss in the next section.
Based on the previous definitions and the discussed similarities and differences,
one can conclude that the concepts of novelty, serendipity, and unexpectedness are
overlapping since all these entities are linked to a notion of discovery, as a recom-
mendation makes more sense when it exposes the user to a relevant experience that
she/he has not thought of or experienced yet. However, unexpectedness includes the
positive reaction of a user to recommendations about previously unknown items, but
without being restricted only to novel items and, also avoids recommendations that
are obvious, irrelevant, or expected to the user.
80
4.3 Definition of Unexpectedness
In this section, we formally model and define the concept of unexpected rec-
ommendations as those recommendations that significantly depart from the user’s
expectations. However, unexpectedness alone is not enough for providing truly useful
recommendations since it is possible to deliver unexpected recommendations but of
low quality. Therefore, after defining unexpectedness, we introduce utility of a recom-
mendation and provide an example of utility as a function of the quality of recommen-
dation (e.g., specified by the item’s rating) and its unexpectedness. We maintain that
this utility of a recommended item is the concept on which we should focus (vis-a-vis
“pure” unexpectedness) by recommending items with the highest levels of utility to
the user. Finally, we propose an algorithm for providing the users with unexpected
recommendations of high quality that are hard to discover but fairly match their in-
terests and present specific performance measures for evaluating the unexpectedness
of the generated recommendations. We define unexpectedness in Section 4.3.1, the
utility of recommendations in Section 4.3.2, and we propose a method for delivering
unexpected recommendations of high quality in Section 4.3.3 and metrics for their
evaluation in Section 4.3.4.
4.3.1 Unexpectedness of Recommendations
To define unexpectedness, we start with user expectations. The expected items
for each user u can be defined as a consideration set, a finite collection of typical
items and these items that the user considers as choice candidates in order to serve
her own current needs or fulfill her intentions, as indicated by interacting with the
recommender system. This concept of the sets of user expectations can be more
precisely specified and operationalized in the lower level of a specific application and
81
recommendation setting. In particular, the set of expected items Eu for a user can
be specified in various ways, such as the set of past transactions performed by the
user, or as a set of “typical” recommendations that she expects to receive or has
received in the past. Moreover, the sets of user expectations, as the true expectations
of the users, can also be adapted to different contexts and evolve with the time. For
example, in case of a movie RS, this set of expected items may include all the movies
already seen by the user and all their related and similar movies, where “relatedness”
and “similarity” are specified and operationalized through specific mechanisms in
Section 4.4.
Intuitively, an item included in the set of expected recommendations derives “zero
unexpectedness” for the user, whereas the more an item departs from the set of expec-
tations, the more unexpected it is, until it starts being perceived as irrelevant by the
user. Unexpectedness should thus be a positive, unbounded function of the distance
of this item from the set of expected items. More formally, we define unexpectedness
in recommender systems as follows. First, we define:
δu,i = d(i; Eu), (4.1)
where d(i; Eu) is the distance of item i from the set of expected items Eu for user
u. Then, unexpectedness of item i with respect to user expectations Eu is defined as
some unimodal function ∆ of this distance:
∆(δu,i; δ∗u), (4.2)
where δ∗u is the best (most preferred) unexpected distance from the set of expected
items Eu for user u (the mode of distribution ∆). In particular, the most preferred
82
unexpected distance δ∗u for user u is a horizontally differentiated feature and can be
interpreted as the distance that results in the highest utility for a given quality of an
item (see Section 4.3.2) and captures the preferences of the user about unexpectedness.
Intuitively, unimodality of this function ∆ indicates that:
1. there is only one most preferred unexpected distance,
2. an item that greatly departs from user’s expectations, even though results in a
large departure from expectations, will be probably perceived as irrelevant by
the user and, hence, it is not truly unexpected, and
3. items that are close to the expected set are not truly unexpected but rather
obvious to the user.
The above definitions1 clearly take into consideration the actual expectations of the
users as we discussed in Section 4.2. Hence, unexpectedness is neither a characteristic
of items nor users, since an item can be expected for a specific user but unexpected
for another one. It is the interplay of the user and the item that characterizes whether
the particular recommendation is unexpected for the specific user or not.
However, recommending to a user the items that result in the highest level of
unexpectedness could be problematic, since recommendations should also be of high
quality and fairly match user preferences. In other words, it is important to emphasize
that simply increasing the unexpectedness of a recommendation list is valueless if this
list does not contain relevant items of high quality that the user likes. In order to
generate such recommendations that would maximize the users’ satisfaction, we use
certain concepts from the utility theory in economics [Marshall, 1920].
1The aforementioned definitions serve as templates of the proposed concepts that are preciselydefined and thoroughly operationalized through specific mechanisms in Sections 4.4.2.1-4.4.2.4.
83
4.3.2 Utility of Recommendations
In the context of recommender systems, pertaining to the concept of unexpect-
edness and trying to keep the complexity of our method to a minimum, we specify
the utility of a recommendation of an item to a user in terms of two components:
the utility of quality that the user will gain from using the product and the util-
ity of unexpectedness of the recommended item, as defined in Section 4.3.1. Our
proposed model follows the standard assumption in economics that the users are en-
gaging into optimal utility maximizing behavior [Marshall, 1920]. Additionally, we
consider the quality of an item to be a vertically differentiated characteristic [Tirole,
1988], which means that utility is a monotone function of quality and hence, given
the unexpectedness of an item, the greater the quality of this item, the greater the
utility of the recommendation to the user. Consequently, without loss of generality,
we propose that we can estimate this overall utility of a recommendation using the
previously mentioned utility of quality and the loss in utility by the departure from
the preferred level of unexpectedness δ∗u. This will allow the utility function to have
the required characteristics described so far. Note that the distribution of utility as
a function of unexpectedness and quality is non-linear, bounded, and experiences a
global maximum.
Formalizing these concepts, in order to provide an example of a utility function
to illustrate the proposed method, we assume that each user u values the quality of
an item by a positive constant qu and that the quality of the item i is represented by
the corresponding rating ru,i. Then, we define the utility derived from the quality of
the recommended item i to the user u as:
U qu,i = qu × ru,i + εqu,i, (4.3)
84
where εqu,i is the error term defined as a random variable capturing the stochastic
aspect of recommending item i to user u.
We also assume that user u values the unexpectedness of an item by a non-negative
factor λu measuring the user’s tolerance to redundancy and irrelevance. The utility of
the user decreases by departing from the preferred level of unexpectedness δ∗u. Then,
the utility of the unexpectedness of a recommendation can be represented as:
U δu,i = −λu × φ(δu,i; δ
∗u) + εδu,i, (4.4)
where function φ captures the departure of unexpectedness of item i from the preferred
level of unexpectedness δ∗u for user u and εδu,i is the error term for user u and item i.
Then, the utility of recommending items to users is computed as the sum of (4.3)
and (4.4):
Uu,i = U qu,i + U δ
u,i (4.5)
Uu,i = qu × ru,i − λu × φ(δu,i; δ∗u) + εu,i, (4.6)
where εu,i is the stochastic error term.
Function φ can also be defined in various ways. For example, using popular
location models for horizontal and vertical differentiation of products in economics
[Cremer and Thisse, 1991; Neven, 1985], the departure from the preferred level of
unexpectedness can be defined as the linear distance:
Uu,i = qu × ru,i − λu × |δu,i − δ∗u| , (4.7)
or the quadratic one:
Uu,i = qu × ru,i − λu × (δu,i − δ∗u)2. (4.8)
85
Note that the utility of a recommendation is linearly increasing with the rating
for these distances, whereas, given the quality of the product, it increases with un-
expectedness up to the threshold of the preferred level of unexpectedness δ∗u. This
threshold δ∗u is specific for each user and context. Also, note that two recommended
items of different quality and distance from the set of expected items may derive the
same levels of usefulness (i.e., indifference curves).2
4.3.3 Recommendation Algorithm
Once the utility function Uu,i is defined, we can then make recommendations
to user u by selecting items i having the highest values of utility Uu,i. Additionally,
specific restrictions can be applied on the quality and unexpectedness of the candidate
items, if appropriate in the application, in order to ensure that the recommended items
will exhibit specific levels of unexpectedness and quality.3
Algorithm 4 summarizes the proposed method for generating unexpected recom-
mendations of high quality that are hard to discover and fairly match the users’
interests. In particular, we compute for each user u a set of expected recommenda-
tions Eu. Then, for each item i in our product base, if the estimated quality of the
item qu,i is above the threshold¯q, we compute the distance δu,i of the specific item
from the set of expectations Eu for the particular user. If the distance δu,i is within
the specified interval [¯δ, δ], we compute the utility of unexpectedness U δ
u,i of item i
for user u based on φ(δu,i; δ∗u). Next, we estimate the final utility Uu,i of recommend-
2Eqs. (4.5) and (4.6) illustrate a simple example of a utility function for the problem of unexpect-edness in recommender systems. Any utility function may be used and not necessarily a weightedsum of two or more distinct components. The reader might even derive examples of utility functionswithout the use of δ∗ but may lose some of the discussed properties (e.g., global maximum). Besides,function φ does not have to be symmetric as in the examples provided in (4.7) and (4.8).
3In the same sense, if required in a specific setting, only items not included in the set of userexpectations can be considered candidates for recommendation. An alternative way to control theexpected levels of unexpectedness can be based on the utility function of choice and tuning of itscoefficients.
86
ALGORITHM 4: Unexpectedness Recommendation Algorithm
Input: Users’ profiles, utility function, estimated quality of items for users, context, etc.Output: Recommendation lists of size Nu
qu,i: Quality of item i for user u
¯q: Lower limit on quality of recommended items
¯δ: Lower limit on distance of recommended items from expectationsδ: Upper limit on distance of recommended items from expectationsNu: Number of items recommended to user u
for each user u doCompute expectations Eu for user u;for each item i do
if qu,i ≥¯q ;
thenCompute distance δu,i of item i from expectations Eu for user u;if δu,i ∈ [
¯δ, δ];
thenEstimate utility of unexpectedness U δu,i of item i for user u based on
φ(δu,i; δ∗u);
Estimate utility Uu,i of item i for user u;
end
end
endRecommend to user u top Nu items having the highest utility Uu,i;
end
ing this item to the specific user based on the different components of the specified
utility function; the estimated utility corresponds to the final predicted rating ru,i of
the classical recommender system algorithms. Finally, we recommend to the user the
items that exhibit the highest estimated utility Uu,i. Examples on how to compute
the set of expected item Eu for a user are provided in Section 4.4.2.3.
4.3.4 Evaluation of Recommendations
[Adomavicius and Tuzhilin, 2005; Herlocker et al., 2004; McNee et al., 2006;
Adamopoulos, 2014b, 2013a; Adamopoulos et al., 2014] suggest that RS should be
evaluated not only by their accuracy, but also by other important metrics such as
87
coverage, serendipity, unexpectedness, and usefulness. Hence, we propose specific
metrics to evaluate the candidate items and the generated recommendations.
4.3.4.1 Metrics of Unexpectedness
In order to accurately and precisely measure the unexpectedness of candidate
items and generated recommendation lists, we deviate from the approach proposed
by Murakami et al. [2008] and Ge et al. [2010], and propose new metrics to evaluate
our method. In particular, Murakami et al. [2008] and Ge et al. [2010] focus on the
difference in predictions between two algorithms (i.e., the deviation of beliefs in a
recommender system from the results obtained from a primitive prediction model
that shows high ratability) and thus Ge et al. [2010] calculate the unexpected set of
recommendations (UNEXP) as:
UNEXP = RS \ PM (4.9)
where PM is a set of recommendations generated by a primitive prediction model
and RS denotes the recommendations generated by a recommender system. When an
element of RS does not belong to PM, they consider this element to be unexpected.
As Ge et al. [2010] argue, unexpected recommendations may not be always useful
and, thus, the paper also introduces a serendipity measure as:
SRDP =|UNEXP
⋂USEFUL|
|N |(4.10)
where USEFUL denotes the set of “useful” items and N the length of the recom-
mendation list. For instance, the usefulness of an item can be judged by the users or
approximated by the items’ ratings as we describe in Section 4.4.2.6.
88
However, these measures do not fully capture the proposed user-centric definition
of unexpectedness since a PM usually contains just the most popular items and does
not actually take at all into account the expectations of the users. Consequently,
we revise their definition and introduce new metrics to measure unexpectedness as
follows. First of all, we define expectedness (EXPECTED) as the mean ratio of the
items that are included in both the set of expected recommendations for a user (Eu)
and the generated recommendation list (RSu):
EXPECTED =∑u
|RSu⋂
Eu||N |
. (4.11)
Furthermore, we propose a metric of unexpectedness (UNEXPECTED) as the
mean ratio of the items that are not included in the set of expected recommendations
for the user but are included in the generated recommendation lists:
UNEXPECTED =∑u
|RSu \ Eu||N |
. (4.12)
Correspondingly, we can also derive a new metric, following the SRDP measure
of serendipity [Murakami et al., 2008], based on the proposed concept and metric of
unexpectedness:
UNEXPECTED+ =∑u
|(RSu \ Eu)⋂
USEFULu||N |
. (4.13)
For the sake of simplicity, the metrics defined so far consider whether an item is
expected to the user or not in terms of strict boolean identity. However, we can relax
this restriction using the distance of an item from the set of expectations as in (4.1),
89
or the unexpectedness of an item as in (4.2). For instance:
UNEXPECTED =∑u
∆(δu,i; δ∗u)
|N |. (4.14)
Moreover, the metrics proposed in this section can be combined with those sug-
gested by Murakami et al. [2008] and Ge et al. [2010] as described in Section 4.4.2.6.
Besides, the proposed metrics can be adapted to take into consideration the rank of
the item in the recommendation list by using a rank discount factor as in [Castells
et al., 2011].
4.3.4.2 Metrics of Accuracy
The recommendation lists can also be evaluated for the accuracy of rating and item
predictions using standard metrics such as Root Mean Square Error, Mean Absolute
Error, Precision, Recall, and the F-measure. In applications where the number of
recommendations presented to the user is preordained, the most useful measure of
interest usually is precision at N [Shani and Gunawardana, 2011].
Finally, recommender systems can also be evaluated based on various other metrics
including diversity, confidence, trust, robustness, adaptivity, and catalog coverage
[Shani and Gunawardana, 2011].
4.4 Experimental Settings
To empirically validate the method presented in Section 4.3.3 and evaluate the
unexpectedness of the generated recommendations, we conduct a large number of
experiments on “real-world” data sets and compare our results to popular baseline
methods.
90
Unfortunately, we could not compare our results with other methods for deriving
unexpected recommendations for the following reasons. First, among the previously
proposed methods of unexpectedness, as explained in Section 4.2, the authors present
only the performance metrics and do not provide any clear computational algorithm
for computing recommendations, thus making the comparison impossible. Further,
most of the existing methods are based on related but different principles such as
diversity and novelty. Since these concepts are, in principle, very different from our
definition, they cannot be directly compared with our approach. Besides, most of the
methods of novelty and serendipity require additional data, such as explicit informa-
tion from the users about known items. In addition, many of the methods of these
related concepts are not generic and cannot be implemented in a traditional recom-
mendation setting, but assume very specific applications and domains. Consequently,
we selected a number of standard Collaborative Filtering (CF) and other algorithms
as baseline methods to compare with the proposed approach. In particular, we se-
lected both the item-based and user-based k-Nearest Neighborhood approach (kNN),
the Slope One (SO) algorithm [Lemire and Maclachlan, 2007], a Matrix Factorization
(MF) method [Koren et al., 2009], the average rating value of an item, and a baseline
using the average rating value plus a regularized user and item bias [Koren, 2010].
We would like to indicate that, although the selected baseline methods do not explic-
itly support the notion of unexpectedness, they constitute fairly reasonable baselines
because, as was pointed out in [Burke, 2002], CF methods also perform well in terms
of other performance measures besides the classical accuracy measures.4
4The proposed method also outperforms in terms of unexpectedness other methods that cap-ture the related but different concepts of novelty, serendipity, and diversity, such as the k-furthestneighbor collaborative filtering recommender algorithm [Said et al., 2012b].
91
4.4.1 Data Sets
The basic data sets that we used are the RecSys HetRec 2011 MovieLens data set
[Cantador et al., 2011] and the BookCrossing data set [Ziegler et al., 2005].
The RecSys HetRec 2011 MovieLens (ML) data set [Cantador et al., 2011] is
an extension of a data set published by [GroupLens, 2011], which contains personal
ratings and tags about movies, and consists of 855,598 ratings from 2,113 users on
10,197 movies. This data set is relatively dense (3.97%) compared to other frequently
used data sets but we believe that this characteristic is a virtue that will let us better
evaluate our method since it allows us to better specify the set of expected movies
for each user. Besides, in order to test the proposed method under various levels of
sparsity [Adomavicius and Zhang, 2012], we consider different proper subsets of the
data sets.
Additionally, we used information and further details from Wikipedia [Wikipedia,
2012] and IMDb [IMDb, 2011]. Joining these data sets we were able to enhance
the available information by identifying whether a movie is an episode or sequel of
another movie included in our data set. We succeeded in identifying “related” items
(i.e., episodes, sequels, movies with exactly the same title) for 2,443 of our movies
(23.95% of the movies with 2.18 related movies on average and a maximum of 22).
We used this information about related movies to identify sets of expectations, as
described in Section 4.4.2.3. We also consider a proper subset (b) of the MovieLens
data set consisting of 4,735 items and 2,029 users, with at least 25 ratings each,
exhibiting 807,167 ratings.
The BookCrossing (BC) data set is gathered by Ziegler et al. [2005] from Bookcross-
ing.com [BookCrossing, 2004], a social networking site founded to encourage the ex-
change of books. This data set contains fully anonymized information on 278,858
92
members and 1,157,112 personal ratings, both implicit and explicit, referring to
271,379 distinct ISBNs. The specific data set was selected because we can use the
implicit ratings of the users to better specify their expectations, as described in Sec-
tion 4.4.2.3. Besides, we supplemented the available data for 261,229 books with
information from Amazon [Amazon, 2012], Google Books [Google, 2012], ISBNdb [IS-
BNdb.com, 2012], LibraryThing [LibraryThing, 2012], Wikipedia [Wikipedia, 2012],
and WorldCat [WorldCat, 2012]. Such data is often publicly available and, there-
fore, it can be freely and widely used in many recommender systems [Umyarov and
Tuzhilin, 2011].
Since some books on BookCrossing refer to rare, non-English books, or outdated
titles not in print anymore, we were able to collect background information and
“related” books (i.e., alternative editions, sequels, books in the same series, with
same subjects and classifications, with the same tags, and books identified as related
or similar by the aforementioned services) for 152,702 of the books with an average of
31 related books per ISBN. Following Ziegler et al. [2005] and owing to the extreme
sparsity of the BookCrossing data set, we decided to further condense the data set
in order to obtain more meaningful results from collaborative filtering algorithms.
Hence, we discarded all the books for which we were not able to find any information,
along with all the ratings referring to them. Next, we also removed book titles with
fewer than 4 ratings and community members with fewer than 8 ratings each. The
dimensions of the resulting data set were considerably more moderate, featuring 8,824
users, 18,607 books, and 377,749 ratings (147,403 explicit ratings). Finally, we also
consider two proper subsets of this; (b) 3,580 items with at least 10 ratings and 2,545
users, with at least 15 ratings each, exhibiting 57,176 explicit and 95,067 implicit
ratings and (c) 870 items and 1,379 users with at least 25 ratings exhibiting 22,192
explicit and 37,115 implicit ratings.
93
Based on the collected information, we approximated the sets of expected recom-
mendations for the users, using the mechanisms described in detail in Section 4.4.2.3.
4.4.2 Experimental Setup
Using the MovieLens data set, we conducted 7,488 experiments. In half of the
experiments we assume that the users are homogeneous (Hom) and have exactly the
same preferences. In the other half, we investigate the more realistic case (Het) where
users have different preferences that depend on previous interactions with the system.
Furthermore, we use two different and diverse sets of expected movies for each user,
and different utility functions. We also use different rating prediction algorithms
and various measures of distance between movies and among a movie and the set of
expected recommendations. Finally, we derived recommendation lists of different sizes
(k ∈ {1, 3, 5, 10, 20, . . . , 100}). In conclusion, we used 2 subsets, 2 sets of expected
movies, 6 algorithms for rating prediction, 3 correlation metrics, 2 distance metrics,
2 utility functions, 2 different assumptions about user preferences, and 13 different
lengths of recommendation lists, resulting in 7,488 experiments in total.
Using the BookCrossing data set, we conducted our experiments on three different
proper subsets described in Section 4.4.1. As before, we also assume different specifi-
cations for the experiments. In particular, we used 3 subsets, 3 sets of expected books,
6 algorithms for rating prediction, 3 correlation metrics, 2 distance metrics, 2 utility
functions, 2 different assumptions about user preferences, and 13 different lengths
of recommendation lists, resulting in 16,848 experiments in total. The experimental
settings are described in detail in Sections 4.4.2.1 - 4.4.2.4.
4.4.2.1 Utility of Recommendation
We consider the following utility functions:
94
(1a) Representative agent (homogeneous users) with linear distance (Hom-Lin): The
users are homogeneous and have similar preferences (i.e., parameters q, λ, δ∗ are
the same across all users) and φ(δu,i; δ∗u) is linear in δu,i in (4.6):
Uu,i = q × ru,i − λ× |δu,i − δ∗| . (4.15)
(1b) Representative agent (homogeneous users) with quadratic distance (Hom-Quad):
The users are homogeneous but φ(δu,i; δ∗u) is quadratic in δu,i in (4.6):
Uu,i = q × ru,i − λ× (δu,i − δ∗)2. (4.16)
(2a) Heterogeneous users with linear distance (Het-Lin): The users are heteroge-
neous, have different preferences (i.e., qu, λu, δ∗u), and φ(δu,i; δ
∗u) is linear in δu,i
as in (4.7):
Uu,i = qu × ru,i − λu × |δu,i − δ∗u| . (4.17)
(2b) Heterogeneous users with quadratic distance (Het-Quad): Users have different
preferences and φ(δu,i; δ∗u) is quadratic in δu,i. This case corresponds to function
(4.8):
Uu,i = qu × ru,i − λu × (δu,i − δ∗u)2. (4.18)
4.4.2.2 Item Similarity
To generate the set of unexpected recommendations, the system computes the
distance d(i, j) between two items. In the conducted experiments, we use both
collaborative-based and content-based item distance.5 The distance matrix can be
5Additional similarity measures were tested in [Adamopoulos and Tuzhilin, 2011] with similarresults.
95
easily updated with respect to new ratings as in [Khabbaz et al., 2011] in order to
address potential scalability issues in large scale systems. The complexity of the pro-
posed algorithm can also be reduced by appropriately setting a lower limit in quality
(¯q) as illustrated in Algorithm 4. Other techniques that should also be explored in
future research include user clustering, low rank approximation of unexpectedness
matrix, and partitioning the item space based on product category or subject classi-
fication.
4.4.2.3 Sets of Expected Recommendations
The set of expected recommendations for each user can be precisely specified and
operationalized using various mechanisms that can be applied across various domains
and applications. Such mechanisms are the past transactions performed by the user,
knowledge discovery and data mining techniques (e.g., association rule learning and
user profiling), and experts’ domain knowledge. The mechanisms for specifying sets of
expected recommendations for the users can also be seeded, as and when needed, with
the past transactions as well as implicit and explicit ratings of the users. In order to
test the proposed method under various and diverse sets of expected recommendations
of different cardinalities that have been specified using the mechanisms summarized
in Table 4.1, we consider the following settings.6
1. Expected Movies: We use the following two examples of definitions of expected
movies in our study. The first set of expected movies (E(Base)u ) for user u follows
a very strict definition of expectedness, as defined in Section 4.3.1. The profile of
user u consists of the set of movies that she/he has already rated. In particular,
6In this experimental study, the expectations of the users were specified in terms of strict booleanidentity because of the characteristics of the specific data sets and for the sake of simplicity. Aspart of the future work, we plan to relax this assumption using the proposed definition and metricof unexpectedness (Eq. 4.14).
96
movie i is expected for user u if the user has already rated some movie j such
that i has the same title or is an episode or sequel of movie j, where episode
or sequel is identified as explained in Section 4.4.1. These sets of expected
recommendations have on average a cardinality of 517 and 451 for the different
subsets.
The second set of expected movies (E(Base+RL)u ) follows a broader definition of
expectations and is generated based on some set of rules. It includes the first
set plus a number of closely “related” movies (E(Base+RL)u ⊇ E(Base)
u ). In order
to form the second set of expected movies, we also use content-based similarity
between movies. More specifically, two movies are related if at least one of the
following conditions holds: (i) they were produced by the same director, belong
to the same genre, and were released within an interval of 5 years, (ii) the same
set of protagonists appears in both of them (where a protagonist is defined as
an actor with ranking ∈ {1, 2, 3}) and they belong to the same genre, (iii) the
two movies share more than twenty common tags, are in the same language,
and their correlation metric is above a certain threshold θ (Jaccard coefficient
(J) > 0.50), (iv) there is a link from the Wikipedia article for movie i to the
article for movie j and the two movies are sufficiently correlated (J > 0.50) and
(v) the content-based distance metric is below a threshold θ (d < 0.50). The
extended set of expected movies has an average size of 1,127 and 949 items per
user, for the two subsets, respectively.
2. Expected Books: For the BookCrossing data set, we use three different examples
of expected books for our users. The first set of expectations (E(Base)u ) consists
of only the items that user u rated implicitly or explicitly.7 The second set
7Only explicit ratings were used with the baseline rating prediction algorithms.
97
Table 4.1: Sets of Expected Recommendations for Different Experimental Settings.
Data setSet of Expected
Mechanism MethodRecommendations
MovieLensBase Past Transactions Explicit RatingsBase+RL Domain Knowledge Set of Rules
BookCrossingBase Past Transactions Implicit RatingsBase+RI Domain Knowledge Related ItemsBase+AR Data Mining Association Rules
of expected books (E(Base+RI)u ) includes the first set plus the related or similar
books identified by various third-party services as described in Section 4.4.1.
These sets of expectations contain on average 1,257, 1,030, and 296 items for the
three subsets, respectively. Finally, the third set of expected recommendations
(E(Base+AS)u ) is generated using association rule learning. In detail, an item i is
expected for user u if i is consequent of a rule with support at least 5% and
user u has implicitly or explicitly rated all the antecedent items. Because of the
nature of this procedure, there is little variation in the set of expectations among
the different users and, in general, these sets consist of the most popular items,
defined in terms of number of ratings. These sets of expected recommendations
have on average a cardinality of 808, 670, and 194 for the different subsets.
4.4.2.4 Distance from the Set of Expectations
After estimating the expectations of user u, we can then define the distance of
item i from the set of expected recommendations Eu in various ways. For example,
it can be determined by averaging the distances between the candidate item i and
all the items included in set Eu. Additionally, we also use the Centroid distance that
is defined as the distance of an item i from the centroid point of the set of expected
98
recommendations Eu for user u.8
4.4.2.5 Utility Estimation
Since the users are restricted to provide ratings on a specific scale, the correspond-
ing item ratings in our data sets are censored from below and above (also known as
censoring from left and right, respectively) [Davidson and MacKinnon, 2004]. Hence,
in order to model the consumer choice, estimate the parameters of interest (i.e., qu
and λu in equations (4.15) - (4.18)), and make predictions within the same scale that
was available to the users, we borrow from the field of economics popular models of
censored multiple linear regressions [McDonald and Moffitt, 1980; Olsen, 1978; Long,
1997]9 imposing also a restriction on these models for non-negative coefficients (i.e.,
qu, λu ≥ 0) [Greene, 2012; Wooldridge, 2002].
Furthermore, given the limitations of offline experiments and our data sets, we
use the predicted ratings from the baseline methods as a measure of quality for the
recommended items and the actual ratings of the users as a proxy for the utility of
the recommendations; this, in combination with the choice of utility functions de-
scribed in Section 4.4.2.1, will allow us to study the effect of taking unexpectedness
into consideration, without introducing any other source of variation into our model.
We also used the average distance of rated items from the set of expected recommen-
dations in order to estimate the preferred level of unexpectedness δ∗u for each user and
distance metric; for the case of homogeneous users, we used the average value over
all users. In addition, we did not use the unexpectedness and quality thresholds,¯δ, δ,
8The experiments conducted in [Adamopoulos and Tuzhilin, 2011] using the Hausdorff distance(d(i,Eu) = inf{d(i, j) : j ∈ Eu}) indicate inconsistent performance and sometimes under-performedthe standard CF methods. Hence, in this work we only conducted experiments using the averageand the centroid distance.
9Multiple linear regression models and generalized linear latent and mixed models estimated bymaximum likelihoods [Rabe-Hesketh et al., 2002] were also tested with similar results. Shivaswamyet al. [2007] and Khan and Zubek [2008] may also be used for utility estimation.
99
and¯q, described in Section 4.3.3, to limit the candidate items for recommendation.
Besides, we used a holdout validation scheme in all of our experiments with 80/20
splits of data to the training/test part in order to avoid overfitting. Finally, we as-
sume an application scenario where an item can be a candidate for recommendation
to a user if and only if it has not been rated by the specific user; expected items can
be recommended.
4.4.2.6 Metrics of Unexpectedness and Accuracy
To evaluate our approach in terms of unexpectedness, we use the metrics described
in Section 4.3.4.1. Additionally, we further evaluate the recommendation lists using
different (i.e., expanded) sets of expectations, compared to the expectations used for
the utility estimation, based on metrics derived by combining the proposed metrics
with those suggested by Murakami et al. [2008] and Ge et al. [2010]. For the primitive
prediction model (PM) of Ge et al. [2010] in (4.9) we used the top-N items with highest
average rating and the largest number of ratings. For instance, for the experiments
conducted using the main subset of the MovieLens data set, the PM model consists of
the top 200 items with the highest average rating and top 800 items with the greatest
number of ratings; the same ratio was used for all the experiments.
Besides, we introduce an additional metric of expectedness (EXPECTEDPM) as
the mean ratio of the recommended items that are either included in the set of
expected recommendations for a user or in the primitive prediction model, and are
also included in the generated recommendation list. Correspondingly, we define an
additional metric of unexpectedness (UNEXPECTEDPM) as the mean ratio of the
recommended items that are neither included in expectations nor in the primitive
100
prediction model, and are included in the generated recommendations:
UNEXPECTEDPM =∑u
|RSu \ (Eu ∪ PM)||N |
. (4.19)
Based on the ratio of Ge et al. [2010] in (4.10), we also use the metrics UNEXPECTED+
and UNEXPECTED+PM to evaluate serendipitous [Murakami et al., 2008] recommen-
dations in conjunction with the metrics of unexpectedness in (4.12) and (4.19), re-
spectively. To compute these metrics, the usefulness of an item for a user can be
judged by the specific user or approximated by the item’s ratings. For instance,
we consider an item to be useful if its average rating is greater than the mean of
the rating scale. In particular, in the experiments conducted using the ML and BC
data sets, we consider an item to be useful, if its average rating is greater than 2.5
(USEFUL = {i : ri > 2.5}) and 5.0, respectively.
Finally, we also evaluate the generated recommendations lists based on the aggre-
gate recommendation diversity, coverage of product base, dispersion of recommenda-
tions, as well as accuracy of rating and item predictions using the metrics discussed
in Section 4.3.4.2.
4.5 Results of Unexpectedness Method
The aim of this study is to demonstrate that the proposed method is indeed
effectively capturing the concept of unexpectedness and performs well in terms of the
classical accuracy metrics by a comparative analysis of our method and the standard
baseline algorithms in different experimental settings.
Given the number of experimental settings (5 subsets based on 2 data sets, 5 sets
of expected items, 6 algorithms for rating prediction, 3 correlation metrics, 2 distance
101
metrics, 2 utility functions, 2 different assumptions about users preferences, and 13
different lengths of recommendation lists, resulting in 24, 336 conducted experiments
in total), the presentation of results constitutes a challenging problem. To give a
“flavor” of the results, instead of plotting individual graphs, a more concise represen-
tation can be obtained by computing the average values of performance for the main
experimental settings (see Section 4.4.2.1) and testing the statistical significance of
the differences in performance, if any. The averages are taken over the six algorithms
for rating prediction, the two correlation metrics, and the two distance metrics, except
as otherwise noted. However, given the diversity of the aforementioned experimental
settings, both the different baselines and the proposed approach may exhibit different
performance in each setting. A reasonable way to compare the results across different
experimental settings is by computing the relative performance differences:
Diff = (Perfunxp − Perfbsln)/Perfbsln, (4.20)
taken as averages over some experimental settings, where bsln refers to the baseline
methods and unxp to the proposed method for unexpectedness. A positive value
of Diff means that the proposed method outperforms the baseline, and a negative–
otherwise. For each metric, only the most interesting dimensions are discussed.
Using the utility estimation method described in Section 4.4.2.5, the average qu is
1.005 for the experiments conducted on the MovieLens data set. For the experiments
with the first set of expected movies, the average λu is 0.144 for the linear distance
and 0.146 for the quadratic one. For the extended set of expected movies, the average
estimated λu is 0.207 and 1.568, respectively. In the experiments conducted on the
BookCrossing data set, the average qu is 1.003. For the experiments with the first set
of expected books, the average λu is 0.710 for the linear distance and 3.473 for the
102
quadratic one. For the second and third set of expected items, the average estimated
λu is 0.717 and 3.1240, and 0.576 and 2.218, respectively.
In Section 4.5.1, we compare how the proposed method for unexpected recommen-
dations compares with the standard baseline methods in terms of unexpectedness and
serendipity of recommendation lists. Then, in Sections 4.5.2 and 4.5.3, we study the
effects on rating and item prediction accuracy, respectively. Finally, in Section 4.5.4,
we compare the proposed method with the baseline methods in terms of other popular
metrics, such as catalog coverage, aggregate recommendation diversity, and dispersion
of recommendations.
4.5.1 Comparison of Unexpectedness
In this section, we experimentally demonstrate that the proposed method effec-
tively captures the notion of unexpectedness and, hence, outperforms the standard
baseline methods in terms of unexpectedness. Table 4.2 presents the results ob-
tained by applying our method to the MovieLens (ML) and BookCrossing (BC)
data sets. The values reported are computed using the proposed unexpectedness
metric (4.12) as the average increase in performance over six algorithms for rating
prediction, two distance metrics, different subsets, and three correlation metrics for
recommendation lists of size k ∈ {1, 3, 5, 10, 30, 50, 100}. Besides, Fig. 4.1 presents
the average performance over the same dimensions for recommendation lists of size
k ∈ {1, 3, 5, 10, 20, . . . , 100}. Similar results were also obtained using the additional
metrics described in Section 4.4.2.6. In addition, similar patterns were also observed
specifying the user expectations using different mechanisms for the training and test
data.
Table 4.2 and Fig. 4.1 demonstrate that the proposed method outperforms the
standard baselines. As we can observe, the increase in performance is larger for
103
Tab
le4.
2:U
nex
pec
tednes
sP
erfo
rman
cefo
rth
eM
ovie
Len
san
dB
ook
Cro
ssin
gD
ata
Set
s.
Data
User
Experimenta
lRecommendation
ListSize
Set
Expecta
tions
Setting
13
510
30
50
100
MovieLens
Base
Hom
ogen
eou
sL
inea
r1.
90%
3.57
%3.
93%
2.30%
1.7
4%
1.51%
1.08%
Hom
ogen
eou
sQ
uad
rati
c1.
81%
3.33
%3.
63%
2.40%
1.7
7%
1.58%
1.16%
Het
erog
eneo
us
Lin
ear
1.77
%2.
24%
2.46
%1.
86%
1.3
7%
1.21%
0.87%
Het
erog
eneo
us
Qu
adra
tic
1.61
%1.
99%
2.21
%1.
68%
1.2
7%
1.13%
0.84%
Base
+R
L
Hom
ogen
eou
sL
inea
r20
.84%
18.3
7%16
.01%
12.5
3%
10.
51%
9.9
8%
7.9
7%
Hom
ogen
eou
sQ
uad
rati
c17
.86%
17.6
7%16
.14%
13.3
1%
11.
28%
10.8
2%
8.9
9%
Het
erog
eneo
us
Lin
ear
16.1
4%14
.82%
13.2
8%11.
06%
9.2
2%
8.90%
7.46%
Het
erog
eneo
us
Qu
adra
tic
14.4
3%13
.50%
12.2
0%10.
39%
8.7
6%
8.51%
7.26%
BookCrossing
Base
Hom
ogen
eou
sL
inea
r0.
89%
0.90
%0.
84%
0.84%
0.7
9%
0.77%
0.73%
Hom
ogen
eou
sQ
uad
rati
c0.
62%
0.65
%0.
62%
0.56%
0.5
2%
0.50%
0.47%
Het
erog
eneo
us
Lin
ear
0.43
%0.
46%
0.44
%0.
44%
0.4
4%
0.45%
0.45%
Het
erog
eneo
us
Qu
adra
tic
0.39
%0.
42%
0.40
%0.
40%
0.4
1%
0.41%
0.41%
Base
+R
I
Hom
ogen
eou
sL
inea
r18
2.12
%15
2.70
%14
6.17
%13
1.80%
114.
17%
104.
80%
90.6
9%
Hom
ogen
eou
sQ
uad
rati
c18
4.29
%15
5.78
%14
9.89
%13
6.12%
117.
89%
108.
54%
93.8
8%
Het
erog
eneo
us
Lin
ear
91.0
3%79
.54%
78.7
5%68.
62%
60.6
4%
57.
82%
50.7
4%
Het
erog
eneo
us
Qu
adra
tic
84.1
9%73
.90%
73.5
7%63.
73%
56.5
3%
54.
18%
47.6
9%
Base
+A
R
Hom
ogen
eou
sL
inea
r15
7.56
%13
3.80
%12
7.74
%11
5.27%
98.7
1%
90.
49%
76.7
5%
Hom
ogen
eou
sQ
uad
rati
c15
8.95
%13
6.38
%13
0.90
%11
8.38%
101.
16%
92.4
3%
78.
44%
Het
erog
eneo
us
Lin
ear
79.3
0%70
.04%
69.0
9%59.
62%
51.8
4%
49.
09%
42.2
2%
Het
erog
eneo
us
Qu
adra
tic
73.3
1%64
.99%
64.4
4%55.
24%
48.1
7%
45.
86%
39.5
7%
104
13
510
20
30
40
50
60
70
80
90
100
Recom
men
dati
on
Lis
t S
ize
0.9
2
0.9
3
0.9
4
0.9
5
0.9
6
0.9
7
0.9
8
Unexpectedness
Base
line
Hom
-Lin
Hom
-Quad
Het-
Lin
Het-
Quad
(a)
ML
-B
ase
13
510
20
30
40
50
60
70
80
90
100
Recom
men
dati
on
Lis
t S
ize
0.6
5
0.7
0
0.7
5
0.8
0
0.8
5
0.9
0
Unexpectedness
Base
line
Hom
-Lin
Hom
-Quad
Het-
Lin
Het-
Quad
(b)
ML
-B
ase
+R
L
13
510
20
30
40
50
60
70
80
90
100
Recom
men
dati
on
Lis
t S
ize
0.9
86
0.9
88
0.9
90
0.9
92
0.9
94
0.9
96
Unexpectedness
Base
line
Hom
-Lin
Hom
-Quad
Het-
Lin
Het-
Quad
(c)
BC
-B
ase
13
510
20
30
40
50
60
70
80
90
100
Recom
men
dati
on
Lis
t S
ize
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Unexpectedness
Base
line
Hom
-Lin
Hom
-Quad
Het-
Lin
Het-
Quad
(d)
BC
-B
ase
+R
I
13
510
20
30
40
50
60
70
80
90
100
Recom
men
dati
on
Lis
t S
ize
0.3
0.4
0.5
0.6
0.7
0.8
Unexpectedness
Base
line
Hom
-Lin
Hom
-Quad
Het-
Lin
Het-
Quad
(e)
BC
-B
ase
+A
R
Fig
ure
4.1:
Unex
pec
tednes
sp
erfo
rman
ceof
diff
eren
tex
per
imen
tal
sett
ings
for
the
(a),
(b)
Mov
ieL
ens
(ML
)an
d(c
),(d
),(e
)B
ook
Cro
ssin
g(B
C)
dat
ase
ts.
105
0.0
0.2
0.4
0.6
0.8
1.0
Un
exp
ecte
dn
ess
10%
20%
30%
40%
50%
60%
70%
80%
90%
Probability
Base
line
Hom
ogeneous
Hete
rogeneous
(a)
ML
-B
ase
0.0
0.2
0.4
0.6
0.8
1.0
Un
exp
ecte
dn
ess
5%
10%
15%
20%
25%
30%
35%
Probability
Base
line
Hom
ogeneous
Hete
rogeneous
(b)
ML
-B
ase
+R
L
0.0
0.2
0.4
0.6
0.8
1.0
Un
exp
ecte
dn
ess
20%
40%
60%
80%
100%
Probability
Base
line
Hom
ogeneous
Hete
rogeneous
(c)
BC
-B
ase
0.0
0.2
0.4
0.6
0.8
1.0
Un
exp
ecte
dn
ess
10%
20%
30%
40%
50%
60%
70%
ProbabilityB
ase
line
Hom
ogeneous
Hete
rogeneous
(d)
BC
-B
ase
+R
I
0.0
0.2
0.4
0.6
0.8
1.0
Un
exp
ecte
dn
ess
10%
20%
30%
40%
50%
60%
70%
Probability
Base
line
Hom
ogeneous
Hete
rogeneous
(e)
BC
-B
ase
+A
R
Fig
ure
4.2:
Dis
trib
uti
onof
Unex
pec
tednes
sfo
rre
com
men
dat
ion
list
sof
sizek=
5an
ddiff
eren
tex
per
imen
tal
sett
ings
for
the
Mov
ieL
ens
(ML
)an
dB
ook
Cro
ssin
g(B
C)
dat
ase
ts.
106
recommendation lists of smaller size k. This, in combination with the observation
that unexpectedness was significantly enhanced also for large values of k, illustrates
that the proposed method both introduces new items in the recommendation lists
and also effectively re-ranks the existing items promoting the unexpected ones. Fig.
4.1 also shows that unexpectedness was enhanced both in cases where the definition
of unexpectedness was strict, as described in Section 4.4.2.3, and thus the baseline
recommendation system methods resulted in high unexpectedness (i.e., Base) and in
cases where the measured unexpectedness of the baselines was low (i.e., Base+RL,
Base+RI, and Base+AR). Similarly, the performance was increase both for the base-
line methods that resulted in high unexpectedness (e.g., Slope One algorithm) in the
conducted experiments and the methods where unexpectedness was low (e.g., Matrix
Factorization method, item-based k-Nearest Neighbors recommendation algorithm).
Additionally, the experiments conducted using the more accurate sets of expectations
based on the information collected from various third-party websites (Base+RI) out-
performed those automatically derived by association rules (Base+AS). Besides, the
increase in performance is larger also in the experiments where the sparsity of the sub-
set of data (see Section 4.4.1) is higher, which is the most realistic scenario in practice.
In particular, for the MovieLens data set, the average unexpectedness of the recom-
mendation lists was increased by 1.62% and 10.83% (17.32% for k = 1) for the (Base)
and (Base+RL) sets of expected movies, respectively. For the BookCrossing data set,
for the (Base) set of expectations the average unexpectedness was increased by 0.55%.
For the (Base+RI) and (Base+AR) sets of expected books, the average improvement
was 135.41% (188.61% for k = 1) and 78.16% (117.28% for k = 1). Unexpected-
ness was increased in 85.43% and 89.14% of the experiments for the MovieLens and
BookCrossing data sets, respectively. Finally, the unexpectedness of the generated
recommendation lists can be further enhanced, as described in Section 4.3.3, using
107
0.1 0.0 0.1 0.2 0.3 0.4Performance Improvement
0
1000
2000
3000
4000
5000
|User
Exp
ecta
tion
s|
(a) ML - Base+RL
0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7Performance Improvement
1000
2000
3000
4000
5000
6000
7000
|User
Exp
ecta
tion
s|
(b) BC - Base+RI
Figure 4.3: Increase in Unexpectedness for recommendation lists of size k=5 for theMovieLens (ML) and BookCrossing (BC) data sets using different sets ofexpectations.
appropriate thresholds on the unexpectedness of individual items.
A particularly noteworthy observation, as demonstrated through the distribution
of unexpectedness across all the generated recommendation lists for the ML and BC
data sets in Fig. 4.2, is that the higher the cardinality and the better approximated
the sets of users’ expectations are, the greater the improvements against the baseline
methods. In principle, if no expectations are specified, the recommendation results
will be the same as the baseline method. The same pattern can also be observed in
Fig. 4.3 showing the cardinality of the set of user expectations along the vertical axis,
the increase in unexpectedness performance along the horizontal axis, and a linear
line fitting the data for recommendation lists of size k = 5.10 This informal notion
of “monotonicity” of expectations is useful in order to achieve the desired levels of
unexpectedness. We believe that this pattern is a general property of the proposed
method, because of the explicit use of users’ expectations and the departure function,
and we plan to explore this topic as part of our future research.
10We also tried higher order polynomials but they do not offer significantly better fitting of thedata.
108
(a) MovieLens data set (b) BookCrossing data set
Figure 4.4: Post hoc analysis for Friedman’s Test of Unexpectedness Performance ofdifferent methods for the (a) MovieLens and (b) BookCrossing data sets.
To determine statistical significance, we have tested the null hypothesis that the
performance of each of the five lines of the graphs in Fig. 4.1 is the same, using the
Friedman test (nonparametric repeated measure ANOVA) [Berry and Linoff, 1997]
and we reject the null hypothesis with p < 0.0001. Performing post hoc analysis on
Friedman’s Test results for the ML data set, the difference between the Baseline and
each one of the experimental settings, apart from the difference between the Baseline
and Heterogeneous Quadratic, are statistically significant. Besides, the differences be-
tween Homogeneous Quadratic and Heterogeneous Linear, Homogeneous Linear and
Heterogeneous Quadratic, and Homogeneous Quadratic and Heterogeneous Quadratic
are statistically significant, as well. For the BC data set, the difference between the
Baseline and each one of the experimental settings is also statistically significant
with p < 0.0001. Moreover, the differences among Homogeneous Linear, Homoge-
neous Quadratic, Heterogeneous Linear, and Heterogeneous Quadratic, apart from
the difference between Homogeneous Linear and Homogeneous Quadratic, are also
109
statistically significant. Fig. 4.4 presents the box-and-whisker diagrams [Benjamini,
1988] displaying the aforementioned differences among the various methods.
4.5.1.1 Qualitative Comparison of Unexpectedness
The proposed approach avoids obvious recommendations such as recommending
to a user the movies “The Lord of the Rings: The Return of the King”, “The Bourne
Identity”, and “The Dark Knight” because the user had already highly rated all the
sequels or prequels of these movies. Besides, the proposed method provides recom-
mendations from a wider range of items and does not focus mostly on bestsellers as
described in Section 4.5.4. In addition, even though the proposed method generates
truly unexpected recommendations, these recommendations are not irrelevant and
they still provide a fair match to user’s interests.
Using the MovieLens data set and the (Base) sets of expected recommendations,
the baseline methods recommend to a user, who highly rates very popular Action,
Adventure, and Drama films, the movies “The Lord of the Rings: The Two Towers”,
“The Dark Knight”, and “The Lord of the Rings: The Return of the King” (user
id = 36803 with Matrix Factorization). However, this user has already highly rated
prequels or sequels of these movies (i.e., “The Lord of the Rings: The Fellowship
of the Ring” and “Batman Begins”) and, hence, the aforementioned popular recom-
mendations are expected for this specific user. On the other hand, for the same user,
the proposed method generated the following recommendations: “The Pianist”, “La
vita e bella”, and “Rear Window”. These movies are of high quality, unexpected,
and not irrelevant since they fairly match the user’s interests. In particular, based
on the definitions and mechanisms used to specify the user expectations as described
in Section 4.4.2.3, all these interesting movies are unexpected for the user since they
significantly depart from her/his expectations. Additionally, they are of great quality
110
in terms of the average rating, even though less popular in terms of the number of
ratings. Besides, these Biography, Drama, Romance, and Mystery movies are not
irrelevant to the user and they fairly match the user’s profile since they involve ele-
ments in their plot, such as war, that can also be found in other films that she/he has
already highly rated such as “Erin Brockovich”, “October Sky”, and “Three Kings”.
Finally, interestingly enough, some of these high quality, interesting, and unexpected
recommendations are also based on movies filmed by the same director that adapted
a film the user rated highly (i.e., “Pinocchio” and “La vita e bella”).
Using the BookCrossing data set and the (Base+RI) set of expectations described
in Section 4.4.2.3, the baseline methods recommend to a user, who has already rated
a very large number of items, the following expected books: “I Know This Much Is
True”, “Outlander”, and “The Catcher in the Rye” (user id = 153662 with Matrix
Factorization). In particular, the book “I Know This Much Is True” is highly ex-
pected because the specific user has already rated and she/he is familiar with the
books “A Tangled Web”, “A Virtuous Woman”, “Thursday’s Child”, and “Drowning
Ruth”. Similarly, the book “Outlander” is expected because of the books “Dragonfly
in Amber”, “Enslaved”, “When Lightning Strikes”, “Touch of Enchantment”, and
“Thorn in My Heart”. Finally, the recommendation about the item “The Catcher
in the Rye” is expected since the user has highly rated the books “Forever: A Novel
of Good and Evil, Love and Hope”, “Fahrenheit 451”, and “Dream Country”. In
summary, all of the aforementioned recommendations are expected for the user be-
cause the recommended items are very similar to other books, which the user has
already highly rated, from the same authors that were published around the same
time (e.g., “I Know This Much Is True” and “A Virtuous Woman”, or “Outlander”
and “Dragonfly in Amber”, etc.), frequently bought together on popular websites
such as Amazon.com [Amazon, 2012] and LibraryThing [LibraryThing, 2012] (e.g., “I
111
Know This Much Is True” and “Drowning Ruth”, etc.), with similar library subjects,
plots and classifications (e.g., “The Catcher in the Rye” and “Dream Country”, etc.),
with similar tags (e.g., “The Catcher in the Rye” and “Forever: A Novel of Good and
Evil, Love and Hope”), etc. In spite of that, the proposed algorithm recommends
to the user the following books that significantly depart from her/his expectations:
“Doing Good”, “The Reader”, and “Tuesdays with Morrie: An Old Man, a Young
Man, and Life’s Greatest Lesson”. These high quality and interesting recommenda-
tions, even though unexpected to the user, they are not irrelevant since they provide
a fair match to the user’s interests since she/he has already highly rated books that
deal with relevant issues such as family, romance, life, and memoirs.
4.5.1.2 Comparison of Serendipity
Pertaining to the notion of serendipity as defined in [Ge et al., 2010], the results
are very similar to those obtained using the proposed measures of unexpectedness
and demonstrate that the proposed method outperforms the standard baselines in
most of the experimental settings.
In summary, we demonstrated in this sections that the proposed method for unex-
pected recommendations effectively captures the notion of unexpectedness by providing
the users with interesting and unexpected recommendations of high quality that fairly
match their interests and, hence, outperforms the standard baseline methods in terms
of the proposed unexpectedness metrics.
4.5.2 Comparison of Rating Prediction
In this section, we examine how the proposed method for unexpected recommen-
dations compares with the standard baseline methods in terms of the classical rating
112
prediction accuracy-based metrics, such as RMSE and MAE. In typical offline experi-
ments as those presented here, the data is not collected using the recommender system
or method under evaluation. In particular, the observations in our test sets were not
based on unexpected recommendations generated from the proposed method.11 Also,
the user ratings had been submitted over a long period of time representing the tastes
of the users and their expectations of the recommender system at that specific point in
time that they rated each item. Therefore, in order to effectively evaluate the rating
and item prediction accuracy of our method, when we compute the unexpectedness
of item i for user u (see Section 4.3.3), we treat item i as not being included in the
set of expectations Eu for user u –whether it is included or not– and we compute the
distance of item i from the rest of the items in the set of expectations E−iu , where
E−iu := Eu \ {i}, to generate the corresponding prediction ru,i (i.e., the estimated
utility of recommending the candidate item i to the target user u).
Table 4.3 presents the results obtained by applying our method to the ML and
BC data sets using the different sets of expectations and baseline predictive methods.
The values reported are computed as the difference in average performance over the
different subsets, the different utility functions, two distance metrics, and three corre-
lation metrics. In Fig. 4.5, the bars labeled as Baseline represent performance of the
standard baseline methods. The bars labeled as Homogeneous Linear, Homogeneous
Quadratic, Heterogeneous Linear, and Heterogeneous Quadratic present the average
performance over the different subsets and sets of expectations, two distance met-
rics, and three correlation metrics, for the different experimental settings described
in Section 4.4.2.1. All the bars have been grouped by baseline algorithm (x-axis).
11For instance, the assumption that unused items would have not been used even if they hadbeen recommended is erroneous when you evaluate unexpected recommendations (i.e., a user maynot have used an item because she/he was unaware of its existence, but after the recommendationexposed that item the user can decide to select it [Shani and Gunawardana, 2011]).
113
Tab
le4.
3:A
vera
geR
MSE
Per
form
ance
for
the
Mov
ieL
ens
and
Book
Cro
ssin
gD
ata
Set
s.
Data
RatingPre
diction
Expecta
tions
Baseline
Homogeneous
Hetero
geneous
Set
Algorith
mLinear
Quadra
tic
Linear
Quadra
tic
MovieLens
Mat
rixF
acto
riza
tion
Base
0.7
892
0.1
1%
0.1
3%
0.0
7%
0.1
2%
Base
+R
L0.7
892
0.1
2%
0.1
3%
0.0
7%
0.1
2%
Slo
peO
ne
Base
0.8
242
0.2
9%
0.2
9%
0.4
3%
0.4
3%
Base
+R
L0.8
242
0.2
9%
0.2
9%
0.4
3%
0.4
2%
Item
KN
NB
ase
0.8
093
-0.0
1%
-0.0
1%
0.0
0%
0.0
1%
Base
+R
L0.8
093
-0.0
1%
-0.0
1%
0.0
1%
0.0
2%
Use
rKN
NB
ase
0.8
160
0.0
1%
0.0
1%
0.0
3%
0.0
4%
Base
+R
L0.8
160
0.0
1%
0.0
1%
0.0
3%
0.0
4%
Use
rIte
mB
asel
ine
Base
0.8
256
0.0
1%
0.0
0%
0.0
4%
0.0
5%
Base
+R
L0.8
256
0.0
1%
0.0
1%
0.0
6%
0.0
5%
Item
Ave
rage
Base
0.8
932
0.0
1%
0.0
0%
1.2
6%
1.5
2%
Base
+R
L0.8
932
0.0
2%
0.0
1%
1.2
9%
1.5
7%
BookCrossing
Mat
rixF
acto
riza
tion
Base
1.7
882
0.2
8%
0.3
5%
-0.3
5%
0.0
2%
Base
+R
I1.7
882
0.0
5%
-0.1
4%
-0.4
2%
0.0
1%
Base
+A
S1.7
882
0.0
1%
-0.1
4%
-0.4
6%
-0.0
1%
Slo
peO
ne
Base
1.8
585
3.4
3%
3.5
2%
2.5
8%
3.1
2%
Base
+R
I1.8
585
3.1
5%
3.0
1%
2.3
2%
2.7
9%
Base
+A
S1.8
585
3.2
1%
3.0
4%
2.3
7%
2.9
1%
Item
KN
NB
ase
1.6
248
1.4
6%
1.4
5%
-1.2
1%
-0.2
3%
Base
+R
I1.6
248
1.4
3%
1.0
2%
-1.4
4%
-0.5
9%
Base
+A
S1.6
248
1.4
8%
1.0
2%
-1.5
2%
-0.5
4%
Use
rKN
NB
ase
1.7
280
1.4
1%
1.1
9%
-0.4
1%
0.2
5%
Base
+R
I1.7
280
1.4
4%
0.9
9%
-0.6
6%
-0.0
2%
Base
+A
S1.7
280
1.4
6%
1.0
1%
-0.6
0%
0.1
0%
Use
rIte
mB
asel
ine
Base
1.5
779
2.4
8%
2.3
4%
0.2
1%
0.9
9%
Base
+R
I1.5
779
1.9
3%
1.7
7%
-0.1
4%
0.6
8%
Base
+A
S1.5
779
1.9
8%
1.7
8%
-0.1
4%
0.7
1%
Item
Ave
rage
Base
1.7
615
0.0
7%
-0.1
0%
-0.1
7%
0.5
0%
Base
+R
I1.7
615
-0.0
4%
-0.3
2%
-0.2
8%
0.5
6%
Base
+A
S1.7
615
0.0
1%
-0.4
1%
-0.3
5%
0.5
0%
114
Matrix FactorizationSlope One
Item kNNUser kNN
User Item Baseline
Item Average0.78
0.80
0.82
0.84
0.86
0.88
0.90
RM
SE
BaselineHom-LinHom-QuadHet-LinHet-Quad
(a) ML - RMSE
Matrix FactorizationSlope One
Item kNNUser kNN
User Item Baseline
Item Average1.50
1.55
1.60
1.65
1.70
1.75
1.80
1.85
1.90
RM
SE
BaselineHom-LinHom-QuadHet-LinHet-Quad
(b) BC - RMSE
Figure 4.5: RMSE performance for the (a) MovieLens and (b) BookCrossing datasets.
In the aforementioned tables and figures, we observe that the proposed method
performs at least as well as the standard baseline methods in most of the experimental
settings. In particular, for the ML data set the RMSE was on average reduced by
0.07% and 0.34% for the cases of the homogeneous and heterogeneous users. For the
BC data set, the RMSE was improved by 1.30% and 0.31%, respectively. The overall
minimum average RMSE achieved was 0.7848 for the ML and 1.5018 for the BC data
set.
Using the Friedman test, we have tested the null hypothesis that the performance
of each of the five lines of the graphs in Fig. 4.5 is the same; we reject the null
hypothesis with p < 0.001. Performing post hoc analysis on Friedman’s Test results,
for the ML data set only the difference between the Heterogeneous Quadratic and
Baseline is statistically significant for the RMSE accuracy metric. For the BC data
set, the differences between the Homogeneous Linear and Baseline, and Homogeneous
Quadratic and Baseline are statistically significant, as well. Fig. 4.6 presents the box-
and-whisker diagrams displaying the aforementioned differences among the various
methods.
115
(a) ML - RMSE (b) BC - RMSE
Figure 4.6: Post hoc analysis for Friedman’s Test of Accuracy Performance of differentmethods for the (a) MovieLens (ML) and (b) BookCrossing (BC) datasets.
In summary, we demonstrated in this section that the proposed method performs
at least as well as, and in some cases even better than, the standard baseline methods
in terms of the classical rating prediction accuracy-based metrics.
4.5.3 Comparison of Item Prediction
The goal in this section is to compare our method with the standard baseline
methods in terms of traditional metrics for item prediction, such as precision, recall,
and F1 score. Table 4.4 presents the results obtained by applying our method to
the MovieLens and BookCrossing data sets. The values reported are computed as
the difference in average performance over the different subsets, six algorithms for
rating prediction, two distance metrics, and three correlation metrics using the F1
score for recommendation lists of size k ∈ {1, 3, 5, 10, 30, 50, 100}. Respectively, Fig.
4.7 illustrates the average performance over the same dimensions for lists of size
k ∈ {1, 3, 5, 10, 20, . . . , 100}.
116
Tab
le4.
4:F
1P
erfo
rman
cefo
rth
eM
ovie
Len
san
dB
ook
Cro
ssin
gD
ata
Set
s.
Data
User
Experimenta
lRecommendation
ListSize
Set
Expecta
tions
Setting
13
510
30
50
100
MovieLens
Base
Hom
ogen
eou
sL
inea
r5.
00%
4.29
%7.
10%
9.54%
8.15%
6.17%
5.57%
Hom
ogen
eou
sQ
uad
rati
c4.
00%
4.87
%5.
63%
6.68%
5.35%
4.10%
3.36%
Het
erog
eneo
us
Lin
ear
5.00
%10
.92%
13.6
7%17.
78%
15.6
3%
14.
81%
15.2
9%
Het
erog
eneo
us
Qu
adra
tic
7.50
%12
.09%
14.6
1%17.
78%
15.5
0%
14.
09%
14.0
7%
Base
+R
L
Hom
ogen
eou
sL
inea
r3.
00%
4.48
%7.
37%
10.1
5%
8.7
8%
6.6
4%
6.3
3%
Hom
ogen
eou
sQ
uad
rati
c4.
50%
5.46
%6.
70%
7.98%
6.55%
5.14%
4.37%
Het
erog
eneo
us
Lin
ear
4.00
%10
.33%
12.8
7%16.
39%
14.5
7%
13.
81%
14.8
0%
Het
erog
eneo
us
Qu
adra
tic
4.50
%11
.11%
13.0
0%15.
96%
14.0
8%
12.
88%
13.3
3%
BookCrossing
Base
Hom
ogen
eou
sL
inea
r23
.08%
9.84
%7.
41%
1.90%
2.45%
1.83%
1.02%
Hom
ogen
eou
sQ
uad
rati
c23
.08%
10.6
6%8.
33%
4.0
5%
3.0
6%
2.0
3%
1.2
3%
Het
erog
eneo
us
Lin
ear
12.5
0%6.
56%
9.26
%4.2
9%
2.2
4%
2.4
3%
1.8
4%
Het
erog
eneo
us
Qu
adra
tic
11.5
4%6.
56%
7.10
%3.
57%
1.84%
1.42%
1.02%
Base
+R
I
Hom
ogen
eou
sL
inea
r29
.81%
13.5
2%8.
02%
2.1
4%
2.6
5%
2.2
3%
2.0
4%
Hom
ogen
eou
sQ
uad
rati
c25
.96%
13.5
2%8.
95%
3.5
7%
3.6
7%
2.6
4%
2.2
5%
Het
erog
eneo
us
Lin
ear
13.4
6%7.
38%
8.33
%3.1
0%
2.2
4%
2.6
4%
1.6
4%
Het
erog
eneo
us
Qu
adra
tic
14.4
2%6.
56%
7.10
%3.
33%
1.63%
1.22%
0.82%
Base
+A
R
Hom
ogen
eou
sL
inea
r22
.12%
6.15
%4.
32%
-0.4
8%
1.0
2%
0.8
1%
1.0
2%
Hom
ogen
eou
sQ
uad
rati
c22
.12%
7.38
%5.
56%
1.19%
1.84%
1.22%
1.23%
Het
erog
eneo
us
Lin
ear
8.65
%2.
05%
4.63
%0.7
1%
0.2
0%
0.8
1%
0.2
0%
Het
erog
eneo
us
Qu
adra
tic
12.5
0%5.
74%
6.17
%2.
86%
1.02%
1.01%
0.61%
117
In particular, for the MovieLens data set and the case of the homogeneous users
F1 score was improved by 6.14%, on average. In the case of heterogeneous customers
performance was increased by 13.90%. For the BookCrossing data set, in the case
of homogeneous users, F1 score was on average enhanced by 4.85% and, for hetero-
geneous users, by 3.16%. Table 4.4 shows that performance was increased both in
cases where the definition of unexpectedness was strict (i.e., Base) and in cases where
the definition was broader (i.e., Base+RL, Base+RI, and Base+AR). Additionally,
the experiments conducted using the more accurate sets of expectations based on
the information collected from various third-party websites (Base+RI) outperformed
those using the expected sets automatically derived by association rules (Base+AS).
To determine statistical significance, we have tested the null hypothesis that the
performance of each of the five lines of the graphs in Fig. 4.7 is the same using the
Friedman test. Based on the results we reject the null hypothesis with p < 0.0001.
Performing post hoc analysis on Friedman’s Test results for the ML data set, the
differences between the Baseline and each one of the experimental settings are sta-
tistically significant for the F1 score. For the BC data set, the differences between
the Baseline and each one of the experimental settings are also statistically signif-
icant.12 Even though the lines are very close to each other and the differences in
performance in absolute values are not large (e.g., Fig. 4.7e), the results are statisti-
cally significant since the performance of the proposed method is ranked consistently
higher than the baselines (lines do not cross). Fig. 4.8 presents the box-and-whisker
diagrams displaying the aforementioned differences among the various methods.
In conclusion, we demonstrated in this section that the proposed method for
unexpected recommendations performs at least as well as, and in some cases even
12In the experiments conducted using the MovieLens data set, the difference between HomogeneousQuadratic and Baseline is statically significant with p < 0.01.
118
13
510
20
30
40
50
60
70
80
90
100
Recom
men
dati
on
Lis
t S
ize
0.0
00
0.0
05
0.0
10
0.0
15
0.0
20
0.0
25
0.0
30
0.0
35
0.0
40
F1 Score
Base
line
Hom
-Lin
Hom
-Quad
Het-
Lin
Het-
Quad
(a)
ML
-B
ase
13
510
20
30
40
50
60
70
80
90
100
Recom
men
dati
on
Lis
t S
ize
0.0
00
0.0
05
0.0
10
0.0
15
0.0
20
0.0
25
0.0
30
0.0
35
0.0
40
F1 Score
Base
line
Hom
-Lin
Hom
-Quad
Het-
Lin
Het-
Quad
(b)
ML
-B
ase
+R
L
13
510
20
30
40
50
60
70
80
90
100
Recom
men
dati
on
Lis
t S
ize
0.0
010
0.0
015
0.0
020
0.0
025
0.0
030
0.0
035
0.0
040
0.0
045
0.0
050
0.0
055
F1 Score
Base
line
Hom
-Lin
Hom
-Quad
Het-
Lin
Het-
Quad
(c)
BC
-B
ase
13
510
20
30
40
50
60
70
80
90
100
Recom
men
dati
on
Lis
t S
ize
0.0
010
0.0
015
0.0
020
0.0
025
0.0
030
0.0
035
0.0
040
0.0
045
0.0
050
0.0
055
F1 Score
Base
line
Hom
-Lin
Hom
-Quad
Het-
Lin
Het-
Quad
(d)
BC
-B
ase
+R
I
13
510
20
30
40
50
60
70
80
90
100
Recom
men
dati
on
Lis
t S
ize
0.0
010
0.0
015
0.0
020
0.0
025
0.0
030
0.0
035
0.0
040
0.0
045
0.0
050
0.0
055
F1 Score
Base
line
Hom
-Lin
Hom
-Quad
Het-
Lin
Het-
Quad
(e)
BC
-B
ase
+A
R
Fig
ure
4.7:
F1
per
form
ance
ofdiff
eren
tex
per
imen
talse
ttin
gsfo
rth
e(a
),(b
)M
ovie
Len
s(M
L)
and
(c),
(d),
(e)
Book
Cro
ss-
ing
(BC
)dat
ase
ts.
119
(a) MovieLens data set (b) BookCrossing data set
Figure 4.8: Post hoc analysis for Friedman’s Test of F1 Performance of different meth-ods for the (a) MovieLens and (b) BookCrossing data sets.
better than, the standard baseline methods in terms of the classical item prediction
metrics.
4.5.4 Comparison of Diversity and Dispersion
In this section we investigate the effect of the proposed method for unexpected
recommendations on coverage, aggregate diversity, and dispersion, three important
metrics for RSes [Ge et al., 2010; Adomavicius and Kwon, 2012; Shani and Gunawar-
dana, 2011].13 The results obtained using the catalog coverage metric [Herlocker et al.,
2004; Ge et al., 2010] (i.e., the percentage of items in the catalog that are ever recom-
mended to users: |⋃
u∈U RSu|/ |I|) are very similar to those using the diversity-in-top-
N metric for aggregate diversity [Adomavicius and Kwon, 2011, 2012]; henceforth,
only results on coverage are presented. Table 4.5 presents the results obtained by
13High unexpectedness of recommendation lists does not imply high coverage and diversity. Forexample, if the system recommends to all users the same k best unexpected items from the productbase, the recommendation list for each user is unexpected, but only k distinct items are recommendedto all users.
120
applying our method to the MovieLens and BookCrossing data sets. The values re-
ported are computed as the average catalog coverage over the different subsets, six
algorithms for rating prediction, two distance metrics, and three correlation met-
rics for recommendation lists of size k ∈ {1, 3, 5, 10, 30, 50, 100}. Fig. 4.9 presents
the average performance over the same dimensions for recommendation lists of size
k ∈ {1, 3, 5, 10, 20, . . . , 100}.
As Table 4.5 and Fig. 4.9 demonstrate, the proposed method outperforms the
standard baselines in most of the experimental settings. As we can see, the experi-
ments conducted under the assumption of heterogeneous users exhibit higher catalog
coverage than those using a representative agent. This is an interesting result that
can be useful in practice, especially in settings with potential adverse effects of over-
recommending an item or very large catalogs. For instance, it would be profitable
for Netflix, if the recommender system can encourage users to rent “long-tail” movies
because they are less costly to license and acquire from distributors than new-release
or highly popular movies of big studios [Goldstein and Goldstein, 2006]. Also, we can
observe that the smaller the size of the recommendation list, the greater the increase
in performance. In particular, as we see in Table 4.5, for the MovieLens data set the
average coverage was increased by 19.48% (39.10% for k = 1) and 37.40% (58.39%
for k = 1) for the cases of the homogeneous and heterogeneous users, respectively.
For the BookCrossing data set, in the case of homogeneous customers coverage was
improved by 9.26% (39.00% for k = 1) and for heterogeneous customers by 23.17%
(59.62% for k = 1), on average. Besides, the increase in performance is larger also
in the experiments where the sparsity of the subset of data is higher. In general,
coverage was increased in 95.68% (max = 55.74%) and 91.57% (max = 100%) of the
experiments for the MovieLens and BookCrossing data sets, respectively.
In terms of statistical significance, with the Friedman test, we have rejected the
121
Tab
le4.
5:C
over
age
Per
form
ance
for
the
Mov
ieL
ens
and
Book
Cro
ssin
gD
ata
Set
s.
Data
User
Experimenta
lRecommendation
ListSize
Set
Expecta
tions
Setting
13
510
30
50
100
MovieLens
Base
Hom
ogen
eou
sL
inea
r38
.58%
37.0
5%35
.15%
28.3
5%
16.
27%
12.3
8%
7.7
0%
Hom
ogen
eou
sQ
uad
rati
c38
.41%
36.4
8%34
.65%
28.3
2%
16.
62%
12.4
7%
7.7
7%
Het
erog
eneo
us
Lin
ear
58.3
3%56
.29%
55.5
6%48.
75%
34.7
1%
30.
49%
27.1
2%
Het
erog
eneo
us
Qu
adra
tic
52.6
4%50
.99%
49.5
5%42
.21%
28.
15%
23.3
8%
19.
12%
Base
+R
L
Hom
ogen
eou
sL
inea
r40
.00%
37.4
1%35
.91%
28.9
3%
16.
88%
13.1
1%
8.8
2%
Hom
ogen
eou
sQ
uad
rati
c39
.41%
37.0
1%35
.28%
28.6
5%
17.
04%
13.3
2%
9.3
8%
Het
erog
eneo
us
Lin
ear
63.4
3%62
.77%
61.2
9%53.
80%
39.0
9%
34.
62%
30.6
0%
Het
erog
eneo
us
Qu
adra
tic
59.1
6%57
.61%
56.3
1%48
.77%
34.
67%
29.8
1%
25.
71%
BookCrossing
Base
Hom
ogen
eou
sL
inea
r46
.55%
30.2
7%21
.69%
12.8
4%
5.6
6%
4.0
9%
2.9
7%
Hom
ogen
eou
sQ
uad
rati
c46
.16%
29.7
9%21
.33%
12.7
2%
5.5
6%
4.0
6%
2.9
0%
Het
erog
eneo
us
Lin
ear
56.7
7%40
.50%
31.4
5%22.
71%
16.9
6%
17.
67%
20.3
1%
Het
erog
eneo
us
Qu
adra
tic
52.5
4%35
.67%
26.3
4%16
.54%
8.6
8%
7.6
8%
7.7
8%
Base
+R
I
Hom
ogen
eou
sL
inea
r36
.60%
23.9
2%17
.31%
10.8
4%
5.1
9%
4.6
7%
5.5
2%
Hom
ogen
eou
sQ
uad
rati
c35
.42%
22.7
8%16
.15%
9.43%
3.51%
2.94%
4.24%
Het
erog
eneo
us
Lin
ear
65.1
1%48
.12%
38.8
5%29.
81%
22.7
5%
22.
11%
22.2
0%
Het
erog
eneo
us
Qu
adra
tic
60.6
1%43
.07%
33.5
5%23
.63%
15.
32%
13.9
2%
14.
34%
Base
+A
R
Hom
ogen
eou
sL
inea
r35
.26%
21.7
4%15
.19%
8.80%
2.84%
1.97%
1.36%
Hom
ogen
eou
sQ
uad
rati
c34
.04%
20.4
3%13
.86%
7.31%
0.76%
-0.4
8%-1
.59%
Het
erog
eneo
us
Lin
ear
63.5
2%46
.43%
37.1
2%27.
70%
20.5
3%
19.
96%
20.2
9%
Het
erog
eneo
us
Qu
adra
tic
59.1
9%41
.13%
31.5
2%21
.35%
12.
26%
10.4
7%
9.6
2%
122
13
510
20
30
40
50
60
70
80
90
100
Recom
men
dati
on
Lis
t S
ize
0.0
0
0.0
5
0.1
0
0.1
5
0.2
0
0.2
5
Coverage
Base
line
Hom
-Lin
Hom
-Quad
Het-
Lin
Het-
Quad
(a)
ML
-B
ase
13
510
20
30
40
50
60
70
80
90
100
Recom
men
dati
on
Lis
t S
ize
0.0
0
0.0
5
0.1
0
0.1
5
0.2
0
0.2
5
Coverage
Base
line
Hom
-Lin
Hom
-Quad
Het-
Lin
Het-
Quad
(b)
ML
-B
ase
+R
L
13
510
20
30
40
50
60
70
80
90
100
Recom
men
dati
on
Lis
t S
ize
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Coverage
Base
line
Hom
-Lin
Hom
-Quad
Het-
Lin
Het-
Quad
(c)
BC
-B
ase
13
510
20
30
40
50
60
70
80
90
100
Recom
men
dati
on
Lis
t S
ize
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Coverage
Base
line
Hom
-Lin
Hom
-Quad
Het-
Lin
Het-
Quad
(d)
BC
-B
ase
+R
I
13
510
20
30
40
50
60
70
80
90
100
Recom
men
dati
on
Lis
t S
ize
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Coverage
Base
line
Hom
-Lin
Hom
-Quad
Het-
Lin
Het-
Quad
(e)
BC
-B
ase
+A
R
Fig
ure
4.9:
Cov
erag
ep
erfo
rman
ceof
diff
eren
tex
per
imen
tal
sett
ings
for
the
(a),
(b)
Mov
ieL
ens
(ML
)an
d(c
),(d
),(e
)B
ook
Cro
ssin
g(B
C)
dat
ase
ts.
123
(a) MovieLens data set (b) BookCrossing data set
Figure 4.10: Post hoc analysis for Friedman’s Test of Coverage Performance of dif-ferent methods for the (a) MovieLens and (b) BookCrossing data sets.
0 20 40 60 80 100Cumulative % of items
0
20
40
60
80
100
Cu
mu
lati
ve %
of
recom
men
dati
on
s
Perfect EqualityBaselineUnexpectedness
(a) ML - Base+RL
0 20 40 60 80 100Cumulative % of items
0
20
40
60
80
100
Cu
mu
lati
ve %
of
recom
men
dati
on
s
Perfect EqualityBaselineUnexpectedness
(b) BC - Base+RI
Figure 4.11: Lorenz curves for recommendation lists of size k = 5 for the (a) Movie-Lens (ML) and (b) BookCrossing (BC) data sets.
124
null hypothesis (p < 0.0001) that the performance of each of the five lines of the graphs
in Fig. 4.9 is the same. Performing post hoc analysis on Friedman’s Test results, for
both the data sets the difference between the Baseline and each of the remaining
experimental settings is statistically significant (p < 0.001). Fig. 4.10 presents the
box-and-whisker diagrams displaying the aforementioned differences among the dif-
ferent methods.
The derived recommendation lists can also be evaluated for the inequality across
items, the dispersion of recommendations, using the Gini coefficient [Gini, 1909], the
Hoover (Robin Hood) index [Hoover, 1985], or the Lorenz curve [Lorenz, 1905]. In
particular, Fig. 4.11 uses the Lorenz curve to graphically represent the cumulative dis-
tribution function of the empirical probability distribution of recommendations; it is
a graph showing for the bottom x% of items, what percentage y% of the total recom-
mendations they have. As we can conclude from Fig. 4.11, in the recommendation lists
generated from the proposed method, the number of times an item is recommended
is more equally distributed compared to the baseline methods. Such systems provide
recommendations from a wider range of items and do not focus mostly on bestsellers,
which users are often capable of discovering by themselves. Hence, they are benefi-
cial for both users and some organizations [Brynjolfsson et al., 2003, 2011; Goldstein
and Goldstein, 2006]. Finally, the difference in increase in performance between Figs.
4.11a and 4.11b, 0.98% and 7.17% respectively in terms of the Hoover index, could
be attributed to both idiosyncrasies of the two data sets and the differences in defini-
tions and cardinalities of the sets of expected recommendations discussed in Section
4.4.2.3.
In summary, we demonstrated in this section that the proposed method for unex-
pected recommendations outperforms the standard baseline methods in terms of the
classical catalog coverage measure, aggregate recommendation diversity, and disper-
125
sion of recommendations.
4.6 Discussion of Unexpectedness
In this chapter, we proposed a method to improve user satisfaction by generating
unexpected recommendations based on the utility theory of economics. In particular,
we proposed and studied a new concept of unexpected recommendations as recom-
mending to a user those items that depart from what the specific user expects from
the recommender system. We defined and formalized the concept of unexpectedness
and discussed how it differs from the related notions of novelty, serendipity, and diver-
sity. Besides, we suggested several mechanisms for specifying the users’ expectations
and proposed specific performance metrics to measure the unexpectedness of recom-
mendation lists. After formally defining and formulating theoretically this concept,
we operationalized the notion of unexpectedness and presented a method for provid-
ing unexpected recommendations of high quality that are hard to discover bur fairly
match user interests.
Moreover, we compared the generated unexpected recommendations with popular
baseline methods using the proposed performance metrics of unexpectedness. Our
experimental results demonstrate that the proposed method improves performance
in terms of unexpectedness while maintaining the same or higher levels of accuracy
of recommendations. Besides, we showed that the proposed method for unexpected
recommendations also improves performance based on other important metrics, such
as catalog coverage, aggregate diversity, and dispersion of recommendations. More
specifically, using different “real-world” data sets, various examples of sets of expected
recommendations, and different utility functions and distance metrics, we were able
to test the proposed method under a large number of experimental settings including
126
various levels of sparsity, different mechanisms for specifying users’ expectations, and
different cardinalities of these sets of expectations. As discussed in Section 4.5, all
the examined variations of the proposed method, including homogeneous and hetero-
geneous users with different departure functions, significantly outperformed in terms
of unexpectedness the standard baseline algorithms, including item-based and user-
based k-Nearest Neighbors, Slope One [Lemire and Maclachlan, 2007], and Matrix
Factorization [Koren et al., 2009]. This demonstrates that the proposed method in-
deed effectively captures the concept of unexpectedness since, in principle, it should
do better than unexpectedness-agnostic methods such as the classical Collaborative
Filtering approach. Furthermore, the proposed unexpected recommendation method
performed at least as well as, and in some cases even better than, the baseline al-
gorithms in terms of the classical accuracy-based measures, such as RMSE and F1
score.
One of the main premises of the proposed method is that users’ expectations
should be explicitly considered in order to provide the users with unexpected recom-
mendations of high quality that are hard to discover but fairly match their interests.
If no expectations are specified, the recommendation results will not differ from those
of the standard rating prediction algorithms in recommender systems. Hence, the
greatest improvements both in terms of unexpectedness and accuracy vis-a-vis all
other approaches were observed in the experiments using the sets of expectations
exhibiting larger cardinality (Base+RL, Base+RI, and Base+AS). These sets of ex-
pected recommendations allowed us to better approximate the expectations of each
user through a non-restricting but more realistic and natural definition of “expected”
items using the particular characteristics of the selected data sets (see Section 4.4.1).
Additionally, the experiments conducted using the more accurate sets of expectations
based on the information collected from various third-party websites (Base+RI) out-
127
performed those using the expected sets automatically derived by association rules
(Base+AS). Also, the fact that the proposed method delivers unexpected recom-
mendations of high quality is depicted in the small differences between the proposed
metric of unexpectedness (Eq. 4.12) and the adapted metric of serendipity (Eq. 4.13).
The assumption of heterogeneous users allowed for better approximation of users’
preferences at the individual level, while the extended set of expected movies allowed
us to better approximate the expectations of each user through a more realistic and
natural definition of closely “related” items.
Moreover, the standard example of a utility function that was provided in Sec-
tion 4.3.2 illustrates that the proposed method can be easily used in existing recom-
mender systems as a new component that enhances unexpectedness of recommenda-
tions, without the need to modify the current rating prediction procedures. Further,
since the proposed method is not specific to the examples of utility functions and sets
of expected recommendations that were provided in this work, we suggest adapting
the proposed method to the particular recommendation applications, by experiment-
ing with different utility functions, estimation procedures, and sets of expectations,
exploiting the domain knowledge.
As a part of the future work, we would like to conduct live experiments with real
users for evaluating unexpected recommendations and analyze both qualitative and
quantitative aspects in a traditional on-line retail setting as well as in a platform for
massive open on-line courses [Adamopoulos, 2013b]. Also, we would like to further
evaluate the proposed approach and mechanisms specifying the user expectations us-
ing different mechanisms for the training and test data. Moreover, we would like
further explore the notion of “monotonicity” introduced in Section 4.5.1 with the
goal of formally and empirically demonstrating this effect. Further, we assumed in
all the experiments reported in this chapter that a recommendation can be either
128
expected or unexpected. We plan to relax this assumption in our future experiments
using the proposed definition and metrics of unexpectedness. Besides, we would also
like to introduce and study additional metrics of unexpectedness and further inves-
tigate how the different existing recommender system algorithms perform in terms
of unexpectedness vis-a-vis other popular properties of recent systems. Also, we as-
sume an application scenario where the items that the user has already chosen are
not recommended again. However, our method can be easily adapted to application
scenarios such as location recommendation systems where it might be useful to rec-
ommend familiar to the user venues or places that she/he visits periodically. For
instance, such a set of expectations could also take into consideration the distance
of the user from each venue and for how long she/he has not been to that venue or
similar ones, while adapting to different contexts and evolving with the time. Finally,
future system implementation using the proposed method might also allow the user
to explicitly control the different parameters in the proposed model so that individual
desired levels of unexpectedness can be obtained.
129
CHAPTER V
The Business Value of Recommendations:
Evidence from a City Guide Mobile Application
5.1 Introduction to Business Value of Recommendations
Mobile devices have become a major platform for information as consumers spend
an increasing amount of time with mobile devices [Nielsen, 2014] and use them more
often to search for products [Fargo, 2014]. Mobile devices have also been driving
the fast growth in e-commerce sales whereas, at the same time, the contribution of
traditional shopping channels has been declining [eMarketer, 2013]. These trends are
mainly attributed to smartphone users and are expected to continue for several years.
It is projected that by 2019 the number of mobile shoppers will reach 213.7 million
(from 125 million in 2013) with 87% of smartphone users shopping online using their
mobile device (compared to 75% in 2013) [eMarketer, 2015], while the number of
search queries will almost double and the amount of sales will triple [UBS, 2015].
At the same time, despite the already widespread penetration of mobile devices
and the recent advances of technology, information overload problems are still more
acute in such platforms, compared to desktops, due to various technical characteristics
and idiosyncrasies of mobile devices. These unique characteristics include, among
130
others, the distinct human-computer interaction, the increased impact of the external
environment, the differences in behavioral characteristics of mobile users, context of
usage, the smaller screen size of mobile devices, etc. [Ghose et al., 2012a; Ricci,
2010]. Recommender system (RS) techniques though offer the potential to further
increase the usability of mobile devices and alleviate some of the implications of
the aforementioned idiosyncrasies by providing more focused content and effectively
limiting the negative effects of information overload. Given the significance of mobile
platforms and the emerging opportunities of recommender systems, it is of paramount
importance to measure and better understand any differences in effectiveness among
various recommendation types and algorithms in a mobile context. Similarly, in
order to better leverage the benefits of RSes, it is also important to understand the
differences in the effectiveness of recommendations across the various candidate items
and recommendation settings and thus to examine the moderating effect of various
item attributes and contextual factors in the mobile channel.
However, despite the increasing prevalence and importance of electronic com-
merce, mobile devices, and smartphone applications, there has been scant academic
research regarding the economic impact of RSes, especially in the context of mobile
recommendations. This is mainly due to the inherent difficulty of measuring the
economic impact of RSes, the limited availability of appropriate data sets, and the
increasingly important privacy concerns that RSes and location-based services raise
[Krumm, 2009; Riboni et al., 2009]. Therefore, even though the impact of recommen-
dations on user behavior and economic demand and especially their corresponding
effects when using a mobile platform is a promising field of research, our understand-
ing of how various types of recommendations in the mobile context may affect the
demand levels for individual products is rather limited yet. For instance, 78% of
marketers cite lack of such knowledge regarding personalization techniques as a bar-
131
rier to their adoption in mobile settings as well as to successful implementations of
marketing strategies [Econsultancy.com, 2013].
In this study, we measure the effectiveness of recommendations as the increase in
demand for the recommended candidate items. In particular, we employ econometric
techniques in order to estimate the impact of recommendations on consumers’ utility
and real-world demand, using an observational study with actual data corresponding
to all the users of a popular real-world mobile application. The main contributions
of this study are the following. First, we measure the effectiveness and economic
impact of various types of real-world recommendations in the mobile settings, based
on a structural method following discrete-choice models of product demand that have
a long history in econometrics (e.g., [McFadden, 1980]). Second, we facilitate the
estimation of causal effects in the presence of endogeneity (a common issue in RSes,
targeted advertising, etc.) using machine learning methods. In particular, we leverage
an exogenous shock to the recommendation process and extend the family of the
popular BLP-style instruments of item “isolation” and differentiation [Berry et al.,
1995] to the latent space through using deep learning techniques that, instead of
commonly treating the individual words of user-generated texts as unique symbols
without meaning, reflect semantic and syntactic similarities and differences among
words and phrases. Third, we discover significant new findings that extend the current
knowledge regarding the heterogeneous impact of RSes, reconcile contradictory prior
findings in the related literature, and draw significant business implications.
Our main results show that an increase by 10% in the number of recommenda-
tions raises demand by about 7.1%. This effect is both statistically and economically
significant and can have greater impact on demand than specific item attributes. Our
findings also highlight the importance of “in-the-moment” marketing and recommen-
dations on the mobile world [Oliver et al., 1998]. In particular, we find that trending
132
recommendations have a much stronger effect on consumers’ choices compared to tra-
ditional recommendations in a mobile setting. This effect is relatively stable across
various levels of popularity, whereas traditional recommendations contribute to the
“rich-get-richer” problem. We also find significant differences in effectiveness among
various types of traditional recommendations. Finally, we also examine various mod-
erating effects of item attributes and contextual factors. We find that in our empirical
setting the effectiveness of recommendations increases during holidays and with bet-
ter weather conditions while more expensive alternatives better leverage the effect of
recommendations. Besides, we find that recommendations simply based on just the
novelty of each alternative do not have a significant effect but novel alternatives accrue
greater benefits from recommendations when item attributes, such as the quality of
the alternative, are also taken into consideration by the recommendation algorithm.
The rest of the chapter is organized as follows. In Section 5.2, we discuss the
relevant literature on recommender systems and mobile platforms. In Section 5.3, we
provide an overview of the employed data set and application domain. This is followed
by a description of the methods used to estimate the effects of recommendations in
Section 5.4 and the employed deep-learning techniques for econometric instruments
in Section 5.5. We then report the results of our empirical study in Section 5.6.
The chapter concludes with a discussion of the findings and limitations as well as an
overview of directions for future research in Section 5.8.
5.2 Literature Review and Research Question
Our work is related to several streams of research, including mobile consumer
behavior, desktop and mobile recommender systems, and effects of recommendations
on sales distribution. In the next paragraphs, due to space limitations, we focus on
133
the most relevant topics and the corresponding works. For a rigorous review of the
related work in RSes, please see, for instance, [Adomavicius and Tuzhilin, 2005; Li
and Karahanna, 2015; Ricci, 2010; Xiao and Benbasat, 2007].
Studying the effects of desktop recommender systems on aggregate demand and
markets, Fleder and Hosanagar [2009] show analytically that RSes can lead to a reduc-
tion in aggregate sales diversity, creating a rich-get-richer effect for popular products
and preventing what may otherwise be better consumer-product matches. However,
Brynjolfsson et al. [2011] provide empirical evidence that RSes are associated with
an increase in niche products, reflecting lower search costs in addition to the in-
creased product availability and corroborating the findings of [Pathak et al., 2010]
regarding the heterogenizing effects of RSes. Nevertheless, Hosanagar et al. [2013]
study whether RSes are fragmenting the online population and find that users widen
their interests, which in turn creates commonality with others instead of heterogeniz-
ing users. In this study, we focus especially on recommender systems in the mobile
context and demand levels for individual products rather than effects on aggregate
demand at the market level.
Focusing on the impact of desktop recommender systems on demand levels for indi-
vidual products, Oestreicher-Singer and Sundararajan [2012b] study how the explicit
visibility of related-product networks can influence the demand for products in such
networks and find that complementary products have significant influence on each
other’s demand. They also find that newer and more popular products benefit more
from the attention they garner from their network position in such related-product
networks. In contrast, Chen et al. [2004] find that such network-based recommenda-
tions in a desktop setting are more effective for less-popular books. Similarly, Pathak
et al. [2010], examining a desktop recommender for item-to-item networks of books
and focusing on 156 top-selling books on Amazon.com, find that the impact of the
134
strength of recommendations on sales rank is moderated by the recency effect. Even
though the majority of studies has focused on the effects of RSes on the market level
and item-to-item networks of hyperlinked products in the non-mobile context, there
are also empirical studies examining the effect of various types of recommendations
on demand levels for individual products. Lee and Benbasat [2010] conduct a lab
experiment with 43 subjects and find that RSes reduce users’ perceived effort and
increase accuracy of their decisions while their findings support the notion that RSes
should be designed to fit the user’s task undertaken. Besides, Tintarev et al. [2010]
conduct a user study with 21 subjects and find that RSes can increase the demand
levels, especially for long tail items. In addition, Jannach and Hegelich [2009] present
a case study evaluating the effectiveness of item recommendations for mobile apps
(either paid or free apps) in different navigational situations. However, Jannach and
Hegelich [2009] focus only on the most frequent users and employ only traditional RS
algorithms, while consumers’ utility and willingness to pay as well as economic de-
mand are out of the scope of that case study. In this study, we focus on the assessment
of the impact of real-world recommendations on sales and the utility of consumers
in the context of mobile recommendations, based on actual data corresponding to all
the users of a popular real-world mobile application. More specifically, we examine:
RQ: What is the relative economic effectiveness of recommendations on real-world
demand for individual items in the mobile context?
In addition to the main effect of the various types of RSes on the demand levels
for individual products, we also examine specific moderating effects in order to gain
a more detailed understanding of the effectiveness of the various types of recommen-
dations in a mobile setting. In particular, we examine whether the popularity, price
or novelty of an alternative moderate the effectiveness of different types of mobile
135
recommendations. In addition to such item attributes, we also study whether mar-
keting promotions and context moderate the effectiveness of RSes. Prior research in
RSes has examined other specific moderators such as product type, product com-
plexity, and product novelty. In particular, Senecal and Nantel [2004] examine the
moderating effect of product type and find that recommendations are more effective
for experience products compared to search goods. Fasolo et al. [2005] examine the
effect of product complexity and find that consumers using RSes engaged in more
information search and were less confident in their product choices for higher product
complexity. Finally, in prior research Ekstrand et al. [2014] and Matt et al. [2014] find
that novelty has a significant negative effect on consumers’ satisfaction and perceived
enjoyment whereas Vargas and Castells [2011] argue that novelty is a key positive
quality of recommendations in real scenarios.
Our work is also related to the extant literature in mobile consumer behavior. In
the context of mobile advertising, Kannan et al. [2001] propose that mobile advertising
is likely to significantly increase the frequency of impulse purchases due to the instant
gratification and the immediate need fulfillment enabled by the medium. Examining
mobile coupons, Danaher et al. [2015] find that how long coupons are valid can
influence redemption rates in a mobile setting as consumers redeem mobile coupons
much faster than traditional coupons and they do not usually store them for future
use while Fong et al. [2015] show that mobile coupons are more effective when they are
sent to consumers close to the consumption time. Furthermore, Panniello et al. [2016]
examine how contextual information affect customer trust, sales, and other business
performance metrics in the context of an e-commerce website by conducting A/B
testing with the customers of that website. Finally, Andrews et al. [2015] examine
the moderating effect of context on mobile ad effectiveness and illustrate the impact
of physical crowdedness on consumer response to mobile ads. In addition, our study is
136
also related to online-to-offline (O2O) commerce as we examine the impact of online
recommendations on real-world (offline) demand [Rampell, 2010].
Finally, our work is also related to the stream of literature that integrates machine
learning and data mining with econometric techniques. In particular, in this study, we
employ deep learning methods to introduce new machine learning-based econometric
instruments that extend a popular family of instruments from the observed product
characteristics space to the latent space. Such a machine learning-based approach
has the potential to generate more appropriate instruments as well as to leverage the
abundance of user-generated content when structured product attributes are either
not available or not sufficient. Hence, our work is also related to the extant literature
in Information Systems that employs text-mining, sentiment-analysis, and other data
mining methods with user-generated content in empirical econometric studies (e.g.,
[Archak et al., 2011; Ghose and Ipeirotis, 2011; Ghose et al., 2012b; Goes et al., 2014;
Goh et al., 2013; Netzer et al., 2012; Tirunillai and Tellis, 2014]).
5.3 Mobile Recommendations and Data
Our data set is from a mobile platform that identifies and recommends interesting
events and places. In aggregate, our data set includes 12, 119 venues and the corre-
sponding visits of several million active users from February 2015 until March 2015;
the maximum number of total visits to a single venue in our data set is 121, 524. In
particular, our data set includes all the restaurants in the mobile urban guide app
for the 10 most popular cities (in terms of population) in the United States. Table
5.1 shows the specific cities that are included in our data set and the corresponding
number of venues in each city and Figure 5.1 shows the locations of venues for three
of these cities.
137
41.8
5
41.9
0
41.9
5
−87
.75
−87
.70
−87
.65
−87
.60
lon
lat
(a)
Ch
icag
o
40.7
0
40.7
5
40.8
0
−74
.05
−74
.00
−73
.95
−73
.90
lon
lat
(b)
New
York
Cit
y
37.7
0
37.7
5
37.8
0
37.8
5
−12
2.50
−12
2.45
−12
2.40
−12
2.35
lon
lat
(c)
San
Fra
nci
sco
Fig
ure
5.1:
Ven
ues
incl
uded
inth
epan
els
ofC
hic
ago,
New
Yor
kC
ity,
and
San
Fra
nci
sco
inou
rdat
ase
t.
138
Table 5.1: US Cities included in Analysis and Corresponding Number of Venues.
City State Venues
Austin TX 788Chicago IL 1,585Dallas TX 651Houston TX 1,042Los Angeles CA 953New York NY 3,811Philadelphia PA 661San Antonio TX 712San Diego CA 747San Francisco CA 1,169
The dependent variable (DV) of our analysis corresponds to the total number of
visits to a particular venue in a single time period. The independent variables (IVs)
of interest include various types of recommendations, such as traditional recommen-
dations based on past historical trends and data. In particular, the different types
of recommendation include recommendations based on whether a venue is recom-
mended because of the total number of submitted positive user-generated reviews,
positive ratings from the users, etc. (i.e., ‘quality recommendations’), as well as rec-
ommendations based on whether a business is endorsed through reviews by famous
brands and experts (i.e., ‘expert recommendations’). They also include recommenda-
tions for novel venues, as alternatives that opened recently are also recommended to
the users (i.e., ‘novel recommendations’). In addition, the mobile application also rec-
ommends to the users venues that have scheduled upcoming events for customers (i.e.,
‘event recommendations’). The IVs also include whether a venue is recommended as
a “trending” venue (i.e., ‘trending recommendations’). This type of recommendations
takes into consideration the latest trends for the specific alternatives by explicitly dis-
counting past historical trends and data while leveraging available information that
captures current trends based on the normalized relative differences in item attributes
139
(e.g., number of photos) during the last time periods, in order interesting alternatives
to be recommended to the users and not only the most popular ones. This type of
recommendations is designed in order to take advantage of the higher involvement of
consumers with the mobile devices and leverage the differences in consumption pat-
terns (e.g., more instantaneous and less planned behaviors) in mobile platforms and,
from a business perspective, they capture trends related to “in-the-moment” mar-
keting. Apart from the distinction of recommendations between the different types
(i.e., ‘quality’, ‘expert’, ‘novel’, ‘event’, and ‘trending’ recommendations), all the rec-
ommendations are seemingly similar as they are presented to the users in the same
way, even though they are generated as described above. Moreover, all the recom-
mendations are generated before the realization of the demand for each alternative.
Besides, all the recommendation lists are explicitly diversified by the RS algorithms
[Ziegler et al., 2005] enhancing the exogeneity of recommendations (see Figure 5.2
for correlation and potential endogeneity), while the ‘trending’ recommendations also
exhibit higher levels of temporal diversity [Lathia et al., 2010]. In other words, the
diversification process provides an exogenous shock to the recommendation process.
For each of the types of recommendations and time period, our data set includes the
relative number of times a venue was recommended to the users as well as statistics
(e.g., average) of the ranking of the venue in the generated recommendation lists.
Moreover, the independent variables of interest also include how many brands and
experts have endorsed each venue and whether there are (and how many) scheduled
upcoming events in this specific venue. This information is available for all businesses
and not only the recommended venues.
Additionally, even though the employed estimation method (see Section 5.4) does
not require the researcher to observe all the relevant product characteristics,1 our data
1The demand equation discussed in Section 5.4 is given an explicit structural interpretation
140
Figure 5.2: Correlation of Main Variables Employed in Econometric Specifications forBusiness Value of Recommendations.
141
set includes a large number of contextual variables and item attributes. In particular,
our data set includes the exact location of the venue, the number of user-generated
reviews for the venue, the average numerical rating of the venue and the corresponding
number of ratings submitted by users, the number of photos uploaded by users for
this venue, the price tier of the venue (i.e., from 1 to 4 with 1 corresponding to
least pricey), whether the venue is part of a chain, the categories that have been
applied to this venue (e.g., American restaurant, Vegetarian) by users, whether a
marketing promotion is taking place, whether the venue offers breakfast, brunch,
lunch, dinner, alcohol, delivery, or take-outs, whether the venue takes reservations,
whether it accepts credit cards, whether there is live music, DJ, TVs, Wi-Fi, outdoor
seating, and parking availability, when the venue opened and was first introduced in
the platform, as well as a description of the venue and the user-generated reviews.
We further supplement our data set with additional contextual variables. In par-
ticular, for each one of the venues in our data set we also include climate data (e.g.,
temperature and precipitation) from the National Center for Environmental Infor-
mation (NCEI) of the National Oceanic and Atmospheric Administration (NOAA),
geospatial data (e.g., elevation level) from the Consultative Group for International
Agricultural Research (CGIAR) Consortium for Spatial Information (CSI), as well as
U.S. rental prices (e.g., median rental price at this location) from the U.S. Census
Bureau, 2009-2013 5-Year American Community Survey, and Zillow Group. Besides,
we also include calendar data with information about bank holidays, etc. Figure 5.2
shows the correlation matrix of the variables of main interest.
One of the main advantages of the presented study is the use of actual data
corresponding to all the available venues and all the users of a popular real-world
application and not only to a specific sub-population (e.g., most frequent users, users
accommodating unobserved demand factors.
142
opting in either online or offline surveys, etc.). In addition, from a methodological
perspective, a couple of other important advantages is that we can easily quantify the
quality of all the alternatives and their marketing promotions and that the recom-
mendations are characterized by high levels of both intra-list and temporal diversity
as discussed in this section. Besides, our data set includes observations corresponding
to multiple cities and several time periods.
5.4 Empirical Method and Models
In this section, we discuss the econometric structural model we apply to estimate
the utility of consumers regarding the different alternatives and the corresponding
effect of the various types of recommendations. In a nutshell, each consumer selects
the alternative that gives her/him the highest utility while the utility of consumers
depends on the alternative characteristics, specific contextual factors, whether the
particular alternative is recommended by the mobile application, as well as individual
taste parameters. The alternative (market) shares are then derived as the aggregate
outcome of individual consumer decisions and the utility parameters are inferred
based on consumer decisions.
In particular, there are R markets (i.e., cities) with Nr alternatives (i.e., venues)
in market r. For each alternative j in market r and time period (i.e., day) t, the
observed characteristics are denoted by vector zjrt ∈ RKz , contextual factors by vector
wjrt ∈ RKw , and recommendation types by vector ρjrt ∈ [0, 1]Kρ ; for simplicity zj,
wj, and ρj respectively. The elements of zj, wj, and ρj combined include attributes
xj (e.g., quality, frequency of each type of recommendation, temperature) that affect
the demand levels qjrt (i.e., number of visitors); for simplicity qj. The unobserved
characteristics (e.g., perception of status) of alternative j are denoted by ξj. The
143
utility uij of user i for alternative j depends on the characteristics of the alternative
and the user as well as the price pj.
In addition to the competing venues j = 1, . . . , N , we also model the existence of
an outside option, j = 0. This outside option corresponds to alternatives that might
not be present in our data set or the option of a user not visiting any venue at all
in time period t. Consumers may choose to select the outside option instead of the
N “inside” alternatives; the mean utility value of the outside option is normalized to
zero.
Following the standard assumption of consumer rationality for utility maximiza-
tion (i.e., the consumer chooses the alternative that maximizes utility surplus) and
assuming that εij, which captures user-specific taste parameters, follows an extreme
value distribution and no random coefficients, the probability that a user i chooses
alternative j is [McFadden, 1980]:
Pr(choiceij) =euij∑Nk=0 e
uik=
eβxj−αρj+ξj
1 +∑N
k=1 eβxk−αρk+ξk
, (5.1)
∀k in the same market r and k 6= j.
The market share sjrt, for simplicity sj, of each alternative is then calculated as
sj = qj/Mr, where Mr is the total market size for the corresponding city (i.e., market)
r. This market size Mr is set to the maximum number of unique active users that has
been ever observed in the mobile application for that city. Alternatively, the market
size could be assumed to be the population of each city or the number of households
instead of individuals; the results remain qualitatively the same. Inverting the market
share equation and taking the logarithm in Eqn. (5.1), the market share of alternative
j is:
ln(sj)− ln(s0) = βxj − αρj + ξj. (5.2)
144
Additionally, if we assume that user tastes are correlated across alternatives and
group the alternatives into G exhaustive and mutually exclusive sets, g = 1, . . . , G,
the market share of alternative j is [Cardell, 1997]:
ln(sj)− ln(s0) = βxj − αρj + σ ln(sj/g) + ξj, (5.3)
where sj/g is the market share of alternative j as a fraction of the total group (nest)
share and j is in group g. As the parameter σ approaches one, the within group
correlation of utility levels goes to one, and as σ approaches zero, the within group
correlation goes to zero. In the empirical section of our study, we estimate the pro-
posed model both with and without assuming that user tastes are correlated across
alternatives.
In other words, using demand-estimation approaches from economics, we estimate
the weights that consumers (implicitly) assign to alternative characteristics, recom-
mendations, and contextual factors, as well as the sensitivity of consumers to changes
in these factors and characteristics. This is done by inverting the function defining
market shares to uncover the utility levels of the alternatives and relating these utility
levels to alternative characteristics, recommendations, and contextual factors. Then,
based on these estimates, we derive the utility gain that each type of recommendations
generates. The employed methodology estimates any effects in a privacy-preserving
manner, as it does not require individual consumer data but only aggregate data and
statistics, even though it is a model of individual behavior.
Apart from the benefits discussed in the previous section (e.g., privacy, real data
and users, popular real-world application, quantifiable quality, exogenous variation,
etc.), this method allows also for unobserved product characteristics, including also
determinants that are difficult to measure (e.g., consumers’ perceptions about sta-
145
tus). In particular, we can consistently obtain the effects of interest even if other
covariates are unobserved or endogenous, and we have no outside instruments for
them [Crawford, 2012]. Besides, the resulting model can make predictions not only
for the existing alternatives under different conditions and contextual factors but also
for new alternatives that might not be included in our data set.
5.4.1 Identification Strategy
We can treat Eqn. (5.3) as an estimation equation, treating ξj as an unobserved
error term, and use typical econometric techniques in order to estimate the unknown
parameters. In particular, we employ panel data techniques in order to control also
for unobserved confounders, in addition to the extensive set of observed confounders,
while we also leverage the within-alternative variation in our data set (i.e., temporal
diversity of recommendations – see Section 5.3). In addition, our econometric specifi-
cations alleviate common endogeneity issues in prices, for instance, as we control for
both the quality of products and marketing promotions as well as in alternative rec-
ommendation as the recommendations are explicitly diversified by the recommender
system [Ziegler et al., 2005]. Nevertheless, even though we are interested in the relative
differences of effectiveness across the various types of recommendations in a mobile
context and we leverage the exogenous variation in the recommendation mechanisms,
while we also control for numerous confounders, including quality and marketing pro-
motions, we also explore as robustness check the possibility that different variables in
our econometric specifications are endogenous. Hence, in our empirical study, we also
use traditional instruments, such as rental prices, degree of competition, and specific
variables that are used in the algorithms to generate the recommendations but do not
affect the utility of the alternatives for consumers in the current time period given
the observed confounders (e.g., lagged variables corresponding to number of photos in
146
previous time periods) [Sweeting, 2013], as well as a novel instrument derived from a
metric of alternative differentiation and isolation based on a machine learning model
of the user-generated reviews employing deep-learning techniques. The motivation
for the latter instrument is twofold: first, it is similar to the BLP-style instruments
[Berry, 1994; Berry et al., 1995], which measure the isolation in product characteris-
tics space as products (or alternatives) that are more isolated are related to higher
margins; second, it is related to whether an alternative is recommended by the mo-
bile application as alternatives that are more isolated have higher likelihood of being
included in a recommendation list because of the diversification process of the em-
ployed recommendation algorithms. As discussed in the following section, these new
machine learning-based econometric instruments extend the popular family of BLP-
style instruments from the observable space of product characteristics to the latent
space.
5.5 Deep-Learning Model of User-Generated Reviews
In order to implement these machine learning-based instruments that measure the
differentiation of alternatives in the latent space of products, we use an efficient and
state-of-the-art method based on deep-learning techniques [Le and Mikolov, 2014;
Mikolov et al., 2013b]. In particular, instead of treating the individual words of
user-generated texts as unique symbols (tokens) without meaning, as in common
bag-of-words and bag-of-n-grams models, the employed method reflects semantic and
syntactic similarities and differences among words and phrases by representing them
as dense vectors, usually referred to as “neural embeddings”. The common paradigm
for deriving such representations is based on the distributional hypothesis of Harris
[1954] that words in similar contexts have similar meanings. In essence, the continuous
147
space representations are learned in an unsupervised fashion by trying to maximize
the dot-product between the vectors of frequently occurring word-context pairs and
minimize it for random word-context pairs [Levy and Goldberg, 2007]. These contin-
uous space representations can be used with any standard distance metrics in order
to measure the differentiation of alternatives in the latent space.
The neural network based word vectors are usually trained using stochastic gra-
dient descent, where the gradient is obtained via back-propagation [Mikolov et al.,
2013a; Rumelhart et al., 1988]. After the training converges, words with similar mean-
ing are mapped to a similar position in the latent space. We use a pre-trained open-
source model that contains 300-dimensional vectors for 3 million words and phrases
trained on part of Google News dataset (about 100 billion words) and described in
[Mikolov et al., 2013a], in order to estimate in each time period the representation of
the alternatives in the latent space, as well as the distance of the various alternatives;
the generated econometric instruments exhibit with-in subject variation over time.
We should note that this method is general and applicable to texts of any length (e.g.,
sentences, paragraphs, documents) and does not require task-specific tuning nor does
it rely on parse trees [Le and Mikolov, 2014]).
5.6 Empirical Results
In order to discover the impact of different types of recommendations in the mobile
context, we estimate different specifications of the structural econometric model we
presented in the previous section. First, we estimate various specifications based
on the structural logit model of demand presented in Eqn. (5.2) and the nested
model in Eqn. (5.3) (see Tables 5.2 and 5.3). Then, we also control for unobserved
heterogeneity and estimate our models introducing alternative-level fixed effects (see
148
Table 5.4). Due to space limitations, we only present the results corresponding to
the latter (nested logit) specification; the results are qualitatively the same across
the different specifications. Similar results were also obtained employing reduced
form models rather than structural models, which provide better fit to our data. All
our econometric specifications provide very good fit to the data as well as out-of-
sample predictive performance (see Tables 5.5 – 5.6). Apart from estimating the
main variables of interest, we also examine a number of interaction effects (see Tables
5.11 – 5.16). In order to test the robustness of our findings and control for potential
endogeneity, we conduct a subsample analysis, allow for parameter heterogeneity,
and also employ instrumental variable techniques (see Tables 5.21 – 5.22) based on
econometric instruments derived using deep learning methods for machine learning.
Finally, we also conduct various falsification tests in order to verify the validity of our
findings.
Table 5.2 shows the coefficient estimates for venue demand and the corresponding
impact of the various types of recommendations in the mobile context, based on the
logit model with multi-level fixed effects (i.e., market, venue category), various alter-
native and context controls (e.g., number of events, meals served, location, holiday,
temperature, precipitation) and a time trend as well as day of the week effects, in
order to control for correlation of tastes across alternatives as well as different po-
tentially unobserved effects. Models 1 and 2 identify the average effect of recommen-
dations, in general. In particular, in Model 1 the variable of interest is binary while
Model 2 provides a more precise measurement using the relative frequency of recom-
mendation of the specific venue. Model 3 separates the effect of the different types
of recommendations (i.e., ‘quality recommendations’, ‘event recommendations’, ‘ex-
pert recommendations’, ‘novel recommendations’, and ‘trending recommendations’)
in which we are interested. Then, Model 4 also controls for the ranking of each rec-
149
ommendation in the generated recommendation lists. Due to space limitations, only
coefficients of the main variables of interest and statistically significant effects are
shown.
As Models 1 – 4 show, recommendations have a positive impact on demand in
a mobile recommendation setting. This impact is significant at level p < 0.001, in
most of the cases. In particular, computing alternative-level derivatives of the de-
mand function (elasticities), an increase by 10% in the number of times a venue is
recommended raises the demand by 7.15% for already recommended alternatives and
by 0.92% on average for all alternatives in general. Hence, there are positive effects on
both individual demand and aggregate-level demand in the market. These effects are
both statistically and economically significant ; comparing these numbers with typical
click-through-rates in display advertising, the recommendation effect in this mobile
setting is orders of magnitude larger. Besides, comparing the effect of recommenda-
tions to the effects of various attributes on alternative demand, an increase of 1% in
the relative frequency of recommendation of an alternative corresponds to an increase
of about 8% in the rating of the alternative, about 4.20% in the number of reviews,
and about 13% in the number of photos. Hence, recommendations in a mobile setting
can have a greater impact on demand compared to various item attributes. Moreover,
trending recommendations that provide “in-the-moment” content to the users have a
much stronger effect, in the particular mobile setting, compared to traditional rec-
ommendations based on historical trends and data, which highlights the importance
of “in-the-moment” marketing in a mobile context. Additionally, recommendations
based on the quality of venues and experts’ reviews outperform other types of recom-
mendations. On the contrary, recommendations based on the number of upcoming
events or simply the novelty of the alternative underperform the other types of mo-
bile recommendations we examine. The differences in effectiveness among the various
150
Table 5.2: Coefficient Estimates of Logit Model.
Model 1 Model 2 Model 3 Model 4
Recommendation (Binary) 0.7207***Recommendation 0.9926***Trending recommendation 1.6179*** 1.6130***Quality recommendation 0.9576*** 1.0729***Event recommendation 0.3026*** 0.4471***Expert recommendation 0.9935*** 1.1119***Novel recommendation 0.3707*** 0.5130***Recommendation ranking -0.0031***Price -0.0105*** -0.0114*** -0.0101*** -0.0099***Rating 0.0096*** 0.0100*** 0.0121*** 0.0103***Number of Reviews 0.1885*** 0.2021*** 0.1923*** 0.1903***Sentiment of Reviews 0.0120*** 0.0132*** 0.0121*** 0.0120***Photos 0.0587*** 0.0552*** 0.0559*** 0.0570***Chain 0.0352*** 0.0391*** 0.0367*** 0.0374***Promotions -0.0032*** -0.0033*** -0.0032*** -0.0031***Alcohol 0.0222*** 0.0199*** 0.0213*** 0.0221***Delivery -0.0347*** -0.0332*** -0.0323*** -0.0322***Takeout -0.0900*** -0.0914*** -0.0999*** -0.0998***Reservations 0.0168*** 0.0157*** 0.0140*** 0.0144***Credit cards 0.0637*** 0.0741*** 0.0726*** 0.0726***Outdoor seating -0.0069* -0.0064* -0.0039 -0.0038Wi-Fi 0.0195*** 0.0160*** 0.0178*** 0.0177***Parking 0.1117*** 0.1016*** 0.0951*** 0.0900***Wheelchair accessible 0.0237** 0.0218** 0.0157 0.0160*TVs 0.0048 0.0163 0.0165 0.0169Music 0.0545*** 0.0532*** 0.0523*** 0.0530***
Market-level fixed effects Yes Yes Yes YesCategory-level fixed effects Yes Yes Yes YesAdditional alternative controls Yes Yes Yes YesContext controls Yes Yes Yes YesTime trend Yes Yes Yes Yes
Log-likelihood -859504 -857790 -855548 -855405Adjusted R2 0.520 0.522 0.525 0.525p 0.0000 0.0000 0.0000 0.0000N 711,673 711,673 711,673 711,673
Note: The additional alternative controls include the hours of operation of thespecific alternative, for how many weeks the alternative has been operating, and thenumber of events in the specific alternative and time period. The context controlsinclude day of the week and holiday effects, local temperature and precipitationlevels, as well as the average geographic distance of alternative. Significance levels:* p < 0.05, ** p < 0.01, *** p < 0.001
151
types of recommendations are statistically significant. This significant difference in ef-
fects persists also after controlling for both the overall ranking in the recommendation
list and the with-in ranking in the specific types of recommendations.
Table 5.3 then presents the results of the nested logit model allowing for correlated
user behaviors. The results corroborate our previous findings regarding the overall
effect of recommendations as well as the effect of each type of recommendations.
Then, we also control for unobserved heterogeneity across alternatives. Table 5.4
presents the corresponding results. The results further substantiate our previous find-
ings. We should notice that after controlling for unobserved heterogeneity at the level
of alternative, the effect of novel recommendations is not found to be statistically sig-
nificant. This result regarding the novelty of recommendations is in accordance with
the findings of Adamopoulos and Tuzhilin [2014c] that novelty should be considered
vis-a-vis the overall quality and utility of each alternative when generating recom-
mendations, rather than recommending the most novel items. It is also worth noting
that the effect of promotional marketing strategies is now positive and significant.
Taking into consideration also the results of the aforementioned econometric spec-
ifications, this indicates that even though marketing promotions have on average a
positive effect on demand, such promotions are usually offered by alternatives (venue)
that experience lower levels of consumer demand than expected.
5.6.1 Out-of-Sample Performance
In order to assess the out-of-sample performance of our models and validate our
aforementioned findings, we employ a hold-out evaluation scheme with 80/20 random
split of data and evaluate each model in terms of root-mean-square error (RMSE),
mean-square error (MSE), mean absolute deviation (MAD), and mean absolute per-
cent error (MAPE). In particular, Tables 5.5 and 5.6 present for each econometric
152
Table 5.3: Coefficient Estimates of Nested Logit Model.
Model 1 Model 2 Model 3 Model 4
Recommendation (Binary) 0.6944***Recommendation 0.9712***Trending recommendation 1.4935*** 1.4908***Quality recommendation 0.9436*** 1.0134***Event recommendation 0.2687*** 0.3562***Expert recommendation 1.0730*** 1.1444***Novel recommendation 0.2535*** 0.3399***Recommendation ranking -0.0019***Price -0.0168*** -0.0177*** -0.0164*** -0.0163***Rating 0.0030* 0.0030* 0.0050*** 0.0039**Number of Reviews 0.1510*** 0.1627*** 0.1543*** 0.1532***Sentiment of Reviews 0.0200*** 0.0213*** 0.0203*** 0.0203***Photos 0.0549*** 0.0510*** 0.0515*** 0.0522***Chain 0.0373*** 0.0413*** 0.0392*** 0.0397***Promotions -0.0032*** -0.0033*** -0.0032*** -0.0032***Alcohol 0.0526*** 0.0509*** 0.0516*** 0.0520***Delivery -0.0586*** -0.0575*** -0.0563*** -0.0562***Takeout -0.0425*** -0.0406*** -0.0492*** -0.0493***Reservations 0.0208*** 0.0198*** 0.0182*** 0.0184***Credit cards 0.0467*** 0.0552*** 0.0535*** 0.0535***Outdoor seating -0.0049 -0.0047 -0.0024 -0.0023Wi-Fi 0.0185*** 0.0149*** 0.0167*** 0.0166***Parking 0.0868*** 0.0759*** 0.0687*** 0.0657***Wheelchair accessible 0.0138 0.0116 0.0055 0.0057TVs 0.0006 0.0116 0.0121 0.0124Music 0.0358*** 0.0341*** 0.0338*** 0.0343***With-in group share 0.1101*** 0.1127*** 0.1114*** 0.1111***
Market-level fixed effects Yes Yes Yes YesCategory-level fixed effects Yes Yes Yes YesAdditional alternative controls Yes Yes Yes YesContext controls Yes Yes Yes YesTime trend Yes Yes Yes Yes
Log-likelihood -846109 -843649 -841686 -841639Adjusted R2 0.537 0.541 0.543 0.543p 0.0000 0.0000 0.0000 0.0000N 711,673 711,673 711,673 711,673
Note: The additional alternative controls include the hours of operation of the specificalternative, for how many weeks the alternative has been operating, and the numberof events in the specific alternative and time period. The context controls includeday of the week and holiday effects, local temperature and precipitation levels, as wellas the average geographic distance of alternative. Significance levels: * p < 0.05, **p < 0.01, *** p < 0.001 153
Table 5.4: Coefficient Estimates of Nested Logit Model with Alternative-level Fixedeffects.
Model 1 Model 2 Model 3 Model 4
Recommendation (Binary) 0.5506***Recommendation 0.8893***Trending recommendation 1.0610*** 1.0572***Quality recommendation 0.8903*** 0.9765***Event recommendation 0.4896*** 0.5877***Expert recommendation 0.8421*** 0.9294***Novel recommendation -0.0629 0.0299Recommendation ranking -0.0023***Rating 0.0234*** 0.0166*** 0.0180*** 0.0174***Number of Reviews 0.0925*** 0.1068*** 0.1063*** 0.1060***Sentiment of Reviews 0.0015 0.0013 0.0012 0.0012Photos 0.0183*** -0.0012 -0.0012 -0.0005Promotions 0.0024* 0.0024* 0.0023* 0.0023*With-in group share 0.2487*** 0.2544*** 0.2532*** 0.2529***
Alternative-level fixed effects Yes Yes Yes YesAdditional alternative controls Yes Yes Yes YesContext controls Yes Yes Yes YesTime trend Yes Yes Yes Yes
Log-likelihood -800812 -796329 -795765 -795667Adjusted R2 0.587 0.592 0.593 0.598p 0.0000 0.0000 0.0000 0.0000N 711,673 711,673 711,673 711,673
Note: The additional alternative controls include the hours of operation of thespecific alternative, for how many weeks the alternative has been operating, and thenumber of events in the specific alternative and time period. The context controlsinclude day of the week and holiday effects, local temperature and precipitationlevels, as well as the average geographic distance of alternative. Significance levels:* p < 0.05, ** p < 0.01, *** p < 0.001
154
specification the in-sample and out-of-sample performance, respectively. Figures 5.3
and 5.4 graphically illustrate the RMSE for each econometric specification the in-
sample and out-of-sample performance. Based on the results, in addition to very
good explanatory power, all the employed models and econometric specifications ex-
hibit very good out-of-sample performance.
Table 5.5: In-Sample Validation of Nested Logit Model with Alternative-level Fixedeffects.
Model 1 Model 2 Model 3 Model 4
RMSE 0.745510 0.740829 0.740243 0.740175MSE 0.555785 0.548827 0.547959 0.547860MAD 0.439188 0.436594 0.436228 0.436224MAPE 5.535493 5.485244 5.482124 5.481529
Table 5.6: Out-of-Sample Validation of Nested Logit Model with Alternative-levelFixed effects.
Model 1 Model 2 Model 3 Model 4
RMSE 0.927897 0.919884 0.918142 0.917934MSE 0.860992 0.846186 0.842985 0.842603MAD 0.596312 0.587145 0.585642 0.585507MAPE 8.178065 8.042463 8.028449 8.025713
Tables 5.7-5.10 provide the corresponding results for the logit and nested logit
models. The results corroborate our previous findings. The results also illustrate
that the nested logit specification with alternative-level fixed effects provides a better
fit to our data.
5.6.2 Moderating Effects on Effectiveness of Recommendations
Moreover, we delve further into the differences in effectiveness of recommendations
and we examine the moderating effect of various item attributes and contextual fac-
tors in order to gain a more detailed understanding of the effectiveness of the various
155
Figure 5.3: In-Sample Evaluation of Accuracy of Econometric Specifications Assess-ing Business Value of Recommendations.
Figure 5.4: Out-of-Sample Evaluation of Predictive Accuracy of Econometric Speci-fications Assessing Business Value of Recommendations.
Table 5.7: In-Sample Validation of Logit Model.
Model 1 Model 2 Model 3 Model 4
RMSE 0.809600 0.807652 0.805111 0.804932MSE 0.655452 0.652302 0.648204 0.647916MAD 0.488359 0.489543 0.486214 0.486156MAPE 6.140014 6.128357 6.098491 6.096787
156
Table 5.8: Out-of-Sample Validation of Logit Model.
Model 1 Model 2 Model 3 Model 4
RMSE 0.994408 0.991234 0.985066 0.984596MSE 0.988847 0.982545 0.970354 0.969430MAD 0.616621 0.618346 0.610596 0.610394MAPE 8.587711 8.535874 8.477126 8.471730
Table 5.9: In-Sample Validation of Nested Logit Model.
Model 1 Model 2 Model 3 Model 4
RMSE 0.794504 0.791763 0.789581 0.789514MSE 0.631236 0.626888 0.623439 0.623333MAD 0.482875 0.483424 0.480524 0.480507MAPE 6.051936 6.031686 6.005812 6.005087
Table 5.10: Out-of-Sample Validation of Nested Logit Model.
Model 1 Model 2 Model 3 Model 4
RMSE 0.968876 0.963149 0.957922 0.957720MSE 0.938721 0.927655 0.917615 0.917228MAD 0.610642 0.609633 0.602756 0.602656MAPE 8.458919 8.376500 8.323763 8.320875
157
types of recommendations in a mobile context. Tables 5.11 – 5.16 present the mod-
erating effects of the various attributes and contextual factors. In particular, Table
5.11 examines the moderating effect of price; Table 5.12 examines the moderating in-
teraction of marketing promotions and recommendations; Table 5.13 investigates the
interaction effect of the novelty of the alternative and recommendations; Table 5.14
presents the interaction effect with popularity; Table 5.15 presents the interaction
effect between recommendations and public holidays; and Table 5.16 the interaction
effect with temperature. All the presented results extend the results presented in
Table 5.4 and control for time-varying venue attributes, climate attributes, geospa-
tial attributes, and calendar attributes as before, as well as for individual-level fixed
effects. All the additional controls and effects are included in all Tables 5.11 – 5.16,
even though they are not depicted in the results, due to space restrictions. Base levels
are also estimated as usual, even though they are not reported as well.
Based on the results presented in Table 5.11, we find a positive and significant
moderating effect of price on the effectiveness of recommendations. This finding in-
dicates that even though alternatives that are more expensive are less appealing to
the users, ceteris paribus, they can more effectively leverage the additional attention
they garner from recommendations. Further decomposing the ‘traditional’ recom-
mendations into different types, as before, we find statistically significant differences
among the various types of recommendations. In particular, the largest moderating
effect of price is in the case of novel recommendations indicating that novel expensive
venues benefit more from recommendations compared to novel but cheaper venues;
whereas event recommendations exhibit a negative interaction effect. This finding
also illustrates the need to examine multiple types of recommendations rather than
a single class. The presented econometric specifications include market-level fixed
effects rather than alternative-level as there is not significant within-venue variation
158
in the price.
Based on the results presented in Table 5.12, we do not find significant moderating
effects of promotions on the effectiveness of recommendations for any type of recom-
mendations. The effect of marketing promotions in the presence of recommendations
is a topic of interest that should be thoroughly examined in future research.
Additionally, based on the results presented in Table 5.13, novel alternatives gain
in general greater benefits from recommendations. In combination with the findings
presented in Tables 5.2 – 5.4, this result illustrates that novel alternatives accrue
greater benefits from recommendations when those recommendations are not solely
based on the characteristic of novelty but also take into consideration attributes such
as the quality of the item. This is highlighted also by the magnitude of the effect in
the case of recommendations based on quality. Hence, this finding reconciles differ-
ent contradictory findings in prior literature in recommender systems. For instance,
Ekstrand et al. [2014] and Matt et al. [2014] maintain that novelty has a significant
negative effect on consumers’ satisfaction and perceived enjoyment, whereas Vargas
and Castells [2011] argue that novelty is a key positive quality of recommendations
in real scenarios.
In Table 5.14, we investigate the effect of the different types of recommendations
across different levels of venue popularity using the number of visits as metric of pop-
ularity and employing the technique of quantile regression. Recommendations have a
stronger positive effect on average for more popular alternatives. This empirical find-
ing is consistent with the observation of Oestreicher-Singer and Sundararajan [2012b]
according to which more popular products use more efficiently the attention they gar-
ner from their network position in recommendation networks in electronic markets.
An interesting and unexpected observation though is that the effect of traditional
recommendations is much stronger for more popular alternatives whereas the effect
159
Tab
le5.
11:
Moder
atin
gE
ffec
tof
Pri
ceon
Eff
ecti
venes
sof
Rec
omm
endat
ions.
Model
1M
odel
2M
odel
3M
od
el
4
Rec
omm
endat
ion
(Bin
ary)
xP
rice
-0.0
074
Rec
omm
endat
ion
xP
rice
0.05
15**
*T
rendin
gre
com
men
dat
ion
xP
rice
0.04
09*
0.04
19*
Qual
ity
reco
mm
endat
ion
xP
rice
0.01
300.
0132
Eve
nt
reco
mm
endat
ion
xP
rice
-0.1
705*
**-0
.183
5***
Exp
ert
reco
mm
endat
ion
xP
rice
0.06
660.
0863
*N
ovel
reco
mm
endat
ion
xP
rice
0.19
67**
*0.
2044
***
160
Tab
le5.
12:
Moder
atin
gE
ffec
tof
Mar
keti
ng
Pro
mot
ions
onE
ffec
tive
nes
sof
Rec
omm
endat
ions.
Model
1M
odel
2M
odel
3M
odel
4
Rec
omm
endat
ion
(Bin
ary)
xP
rom
otio
ns
-0.0
046*
**R
ecom
men
dat
ion
xP
rom
otio
ns
-0.0
023
Tre
ndin
gre
com
men
dat
ion
xP
rom
otio
ns
-0.0
010
0.00
09Q
ual
ity
reco
mm
endat
ion
xP
rom
otio
ns
-0.0
025
-0.0
020
Eve
nt
reco
mm
endat
ion
xP
rom
otio
ns
0.01
270.
0127
Exp
ert
reco
mm
endat
ion
xP
rom
otio
ns
-0.0
071
-0.0
075
Nov
elre
com
men
dat
ion
xP
rom
otio
ns
0.01
210.
0170
161
Tab
le5.
13:
Moder
atin
gE
ffec
tof
Nov
elty
onE
ffec
tive
nes
sof
Rec
omm
endat
ions.
Model
1M
odel
2M
od
el
3M
odel
4
Rec
omm
endat
ion
(Bin
ary)
xW
eeks
open
0.00
61R
ecom
men
dat
ion
xW
eeks
open
-0.1
060*
**T
rendin
gre
com
men
dat
ion
xW
eeks
open
-0.2
230*
**-0
.222
6***
Qual
ity
reco
mm
endat
ion
xW
eeks
open
-0.2
000*
**-0
.199
2***
Eve
nt
reco
mm
endat
ion
xW
eeks
open
0.07
94*
0.06
45E
xp
ert
reco
mm
endat
ion
xW
eeks
open
0.03
930.
0358
Nov
elre
com
men
dat
ion
xW
eeks
open
0.07
50**
*0.
0747
***
162
of “in-the-moment” recommendations is more stable across alternatives. This find-
ing can alleviate the popularity reinforcement effects of recommender systems, which
have already been observed in various settings (e.g., [Fleder and Hosanagar, 2009;
Oestreicher-Singer and Sundararajan, 2012a]), and potentially contribute to disper-
sion in the tail. Besides, we can also see that the strong and increasing moderating
effect of popularity on the effect of recommendations on demand is mainly through
recommendations based on quality and experts’ reviews whereas this effect is not as
strong for recommendations based on the novelty of alternatives.
Finally, Tables 5.15 and 5.16 shows the effect of context on the effectiveness of
recommendations. In particular, both holidays and better weather conditions (i.e.,
higher levels of temperature during the examined time period) have a positive and
significant result on the effectiveness of recommendations. Based on the detailed
results, this moderating effect is stronger for ‘trending’, ‘quality’ and ‘expert’ recom-
mendations, highlighting the differences in user behavior across various contexts. It
is worth noting that holidays have a negative effect on event recommendation effec-
tiveness indicating that during holidays users prefer recommended events less. This
finding contributes to the emerging literature on mobile marketing and contextual at-
tributes (e.g., effect of crowdedness on mobile offers [Andrews et al., 2015], geographic
mobility and responsiveness to mobile ads [Ghose and Han, 2011]).
5.7 Robustness Checks
In order to assess the possibility that the aforementioned findings are capturing
other unobserved factors instead of the effect of recommendations on the demand
levels, we also conduct various robustness checks. First, we conduct a subsample
analysis, leveraging the within-subject variation in our panel data set (see Section
163
Tab
le5.
14:
Moder
atin
gE
ffec
tof
Pop
ula
rity
onE
ffec
tive
nes
sof
Rec
omm
endat
ions.
Q:0.10
Q:0.20
Q:0.30
Q:0.40
Q:0.50
Q:0.60
Q:0.70
Q:0.80
Q:0.90
Tre
nd
ing
reco
mm
end
atio
n1.2
97**
*1.
830*
**2.
080*
**2.
240*
**1.
819*
**1.
632
***
1.567
***
1.475
***
1.269
***
Qu
ali
tyre
com
men
dat
ion
0.0
15**
*0.
119*
**0.
487*
**0.
808*
**0.
671*
**0.6
30*
**0.
702
***
0.9
83*
**4.2
44*
**E
vent
reco
mm
end
atio
n-0
.016
***
-0.2
19**
*-0
.500
***
-0.6
93**
*-0
.442
***
-0.2
80**
*-0
.172
***
0.194
***
3.6
49*
**E
xp
ert
reco
mm
end
ati
on
0.04
5***
0.33
7***
0.86
3***
1.00
9***
1.03
3***
1.0
40*
**1.
244
***
1.629
***
2.255
***
Nov
elre
com
men
dati
on
0.0
12**
*0.
092*
**0.
429*
**0.
744*
**0.
599*
**0.5
38*
**0.
530
***
0.499
***
1.085
***
Mar
ket-
leve
lfi
xed
effec
tsY
esY
esY
esY
esY
esY
esY
esY
esY
esC
ateg
ory
-lev
elfi
xed
effec
tsY
esY
esY
esY
esY
esY
esY
esY
esY
esA
dd
itio
nal
alt
ern
ativ
eco
ntr
ols
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Con
text
contr
ols
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Tim
etr
end
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Pse
ud
o-R
20.
505
0.50
80.
508
0.49
10.
456
0.453
0.4
41
0.419
0.4
09
p0.
0000
0.00
000.
0000
0.00
000.
0000
0.0
000
0.0
000
0.000
00.0
000
N711
,673
711,
673
711,
673
711,
673
711,
673
711,
673
711,
673
711,
673
711,
673
Note
:T
he
ad
dit
ion
alalt
ern
ativ
eco
ntr
ols
incl
ud
eth
eh
ours
ofop
erat
ion
ofth
esp
ecifi
calt
ern
ati
ve,
for
how
man
yw
eeks
the
alte
rnati
veh
as
bee
nop
erat
ing,
an
dth
enu
mb
erof
even
tsin
the
spec
ific
alte
rnat
ive
and
tim
ep
erio
d.
Th
eco
nte
xt
contr
ols
incl
ud
ed
ayof
the
wee
kan
dh
olid
ayeff
ects
,lo
cal
tem
per
atu
rean
dp
reci
pit
atio
nle
vels
,as
wel
las
the
aver
age
geog
rap
hic
dis
tan
ceof
alte
rnati
ve.
Sig
nifi
can
cele
vels
:*p<
0.0
5,**
p<
0.0
1,**
*p<
0.00
1
164
Tab
le5.
15:
Moder
atin
gE
ffec
tof
Public
Hol
iday
son
Eff
ecti
venes
sof
Rec
omm
endat
ions.
Model
1M
odel
2M
od
el
3M
odel
4
Rec
omm
endat
ion
(Bin
ary)
xH
olid
ay0.
1020
***
Rec
omm
endat
ion
xH
olid
ay0.
0965
***
Tre
ndin
gre
com
men
dat
ion
xH
olid
ay0.
3392
***
0.33
96**
*Q
ual
ity
reco
mm
endat
ion
xH
olid
ay0.
0781
***
0.08
06**
*E
vent
reco
mm
endat
ion
xH
olid
ay-0
.339
3***
-0.3
417*
**E
xp
ert
reco
mm
endat
ion
xH
olid
ay0.
0750
0.07
81N
ovel
reco
mm
endat
ion
xH
olid
ay0.
1416
*0.
1440
*
165
Tab
le5.
16:
Moder
atin
gE
ffec
tof
Tem
per
ature
onE
ffec
tive
nes
sof
Rec
omm
endat
ions.
Model
1M
odel
2M
odel
3M
od
el
4
Rec
omm
endat
ion
(Bin
ary)
xT
emp
erat
ure
0.00
26**
*R
ecom
men
dat
ion
xT
emp
erat
ure
-0.0
020*
**T
rendin
gre
com
men
dat
ion
xT
emp
erat
ure
0.01
52**
*0.
0150
***
Qual
ity
reco
mm
endat
ion
xT
emp
erat
ure
0.00
52**
*0.
0048
***
Eve
nt
reco
mm
endat
ion
xT
emp
erat
ure
-0.0
058
-0.0
074*
Exp
ert
reco
mm
endat
ion
xT
emp
erat
ure
-0.0
017
-0.0
013
Nov
elre
com
men
dat
ion
xT
emp
erat
ure
0.00
95**
*0.
0088
***
166
5.3), and we estimate the effectiveness of the various types of recommendations only
for alternatives that have been recommended. Table 5.18 presents the results for the
nested logit model controlling for unobserved heterogeneity at the level of alternative;
Table 5.17 provides the corresponding results employing multi-level fixed effects as
well as time-varying alternative controls, context controls, and a time trend. The
results corroborate our previous findings.
Moreover, we further consider heterogeneous effects by assigning random coeffi-
cients to the main variables of interest. Table 5.20 shows the corresponding results
for the nested logit model; Table 5.19 provides the corresponding results for the logit
model. The results corroborate our previous findings. It is worth noting that expert
and event recommendations exhibit higher variance compared to the other types of
recommendations in the mobile context.
Even though we are interested in the relative differences of effectiveness across the
various types of recommendations in a mobile context and we leverage the exogenous
variation in the recommendation mechanisms (see Section 5.3) while we also control
for numerous confounders, including perceived quality and marketing promotions, we
also assume as robustness check that different variables are endogenous. Hence, we
employ instrumental variable techniques. Since instrumental variable estimates are
consistent, the large size of our data set becomes an important advantage [Angrist
and Krueger, 2001]. Tables 5.21 – 5.22 present the effect of the different types of
recommendations after accounting for potential endogeneity in prices, with-in group
share, and recommendations. Table 5.21 uses as instruments rental prices and Haus-
man instruments (e.g., the average price of other venues in the same market and same
rating category), as well as the novel metric of alternative differentiation and isola-
tion based on the employed machine learning model of the user-generated reviews
using deep-learning techniques (see Section 5.4). Then, Table 5.22, in addition to
167
Table 5.17: Coefficient Estimates of Nested Logit Model (Sub-sample Analysis).
Model 1 Model 2 Model 3 Model 4
Recommendation (Binary) 0.6105***Recommendation 0.8946***Trending recommendation 1.2147*** 1.2114***Quality recommendation 0.8579*** 0.9305***Event recommendation 0.4606*** 0.5526***Expert recommendation 1.0776*** 1.1508***Novel recommendation 0.1342*** 0.2224***Recommendation ranking -0.0020***
Market-level fixed effects Yes Yes Yes YesCategory-level fixed effects Yes Yes Yes YesAdditional alternative controls Yes Yes Yes YesContext controls Yes Yes Yes YesTime trend Yes Yes Yes Yes
Log-likelihood -476628 -474380 -473651 -473619Adjusted R2 0.452 0.459 0.462 0.462p 0.0000 0.0000 0.0000 0.0000N 336,953 336,953 336,953 336,953
Note: The additional alternative controls include the price, rating, number ofreviews, sentiments of reviews, number of photos, whether the alternative is partof a chain, hours of operation of the specific alternative, marketing promotions,whether the venue serves alcohol, whether it offers delivery and takeout services,whether it accepts credit cards and reservations, whether it offers outdoor seating,Wi-Fi, parking space, whether it accessible by wheelchair, whether it has TVs andmusic, for how many weeks the alternative has been operating, and the numberof events in the specific alternative and time period. The context controls includeday of the week and holiday effects, local temperature and precipitation levels,as well as the average geographic distance of alternative. Significance levels: *p < 0.05, ** p < 0.01, *** p < 0.001
168
Table 5.18: Coefficient Estimates of Nested Logit Model with Alternative-level Fixedeffects (Sub-sample Analysis).
Model 1 Model 2 Model 3 Model 4
Recommendation (Binary) 0.5410***Recommendation 0.8594***Trending recommendation 0.9697*** 0.9660***Quality recommendation 0.8506*** 0.9251***Event recommendation 0.5820*** 0.6698***Expert recommendation 0.9676*** 1.0430***Novel recommendation -0.0373 0.0471Recommendation ranking -0.0020***
Alternative-level fixed effects Yes Yes Yes YesAdditional alternative controls Yes Yes Yes YesContext controls Yes Yes Yes YesTime trend Yes Yes Yes Yes
Adjusted R2 0.439 0.442 0.443 0.445p 0.0000 0.0000 0.0000 0.0000N 336,953 336,953 336,953 336,953
Note: The additional alternative controls include the hours of operation of thespecific alternative, for how many weeks the alternative has been operating, andthe number of events in the specific alternative and time period. The context con-trols include day of the week and holiday effects, local temperature and precipita-tion levels, as well as the average geographic distance of alternative. Significancelevels: * p < 0.05, ** p < 0.01, *** p < 0.001
169
Table 5.19: Coefficient Estimates of Logit Model with Random Coefficients.
Model 1 Model 2 Model 3 Model 4
Recommendation (Binary) 0.5162***Recommendation 0.7053***Trending recommendation 1.4377*** 1.4322***Quality recommendation 0.5834*** 0.6811***Event recommendation -0.0087 0.0866Expert recommendation 0.7753*** 0.8999***Novel recommendation 0.2511*** 0.3232***Recommendation ranking -0.0025***Rating 0.0282*** 0.0291*** 0.0310*** 0.0303***Number of Reviews 0.1490*** 0.1558*** 0.1514*** 0.1513***Sentiment of Reviews 0.0032* 0.0032* 0.0033* 0.0033*Photos 0.0541*** 0.0527*** 0.0550*** 0.0552***Promotions -0.0019** -0.0018** -0.0018** -0.0018**
St. dev. Recommendation (Binary) 0.5177***St. dev. Recommendation 0.5085***St. dev. Trending recommendation 0.7401*** 0.7370***St. dev. Quality recommendation 0.5599*** 0.5576***St. dev. Event recommendation 1.2449*** 1.2369***St. dev. Expert recommendation 1.2018*** 1.1774***St. dev. Novel recommendation 0.3315*** 0.3358***
Alternative-level effects Yes Yes Yes YesAdditional alternative controls Yes Yes Yes YesContext controls Yes Yes Yes YesTime trend Yes Yes Yes Yes
Log-likelihood -835679 -839088 -836291 -836255x2 275,203 274,191 277,642 277,665p 0.0000 0.0000 0.0000 0.0000N 711,673 711,673 711,673 711,673
Note: The additional alternative controls include the hours of operation of the spe-cific alternative, for how many weeks the alternative has been operating, and thenumber of events in the specific alternative and time period. The context controlsinclude day of the week and holiday effects, local temperature and precipitation lev-els, as well as the average geographic distance of alternative. Significance levels: *p < 0.05, ** p < 0.01, *** p < 0.001
170
Table 5.20: Coefficient Estimates of Nested Logit Model with Random Coefficients.
Model 1 Model 2 Model 3 Model 4
Recommendation (Binary) 0.4868***Recommendation 0.6979***Trending recommendation 1.1585*** 1.1538***Quality recommendation 0.6077*** 0.6958***Event recommendation 0.1174 0.2027Expert recommendation 0.7658*** 0.8794***Novel recommendation 0.0936 0.1896**Recommendation ranking -0.0022***Rating 0.0264*** 0.0261*** 0.0275*** 0.0269***Number of Reviews 0.0930*** 0.0982*** 0.0951*** 0.0950***Sentiment of Reviews 0.0046*** 0.0047*** 0.0048*** 0.0048***Photos 0.0365*** 0.0326*** 0.0352*** 0.0353***Promotions 0.0005 0.0007 0.0007 0.0007With-in group share 0.2295*** 0.2316*** 0.2287*** 0.2287***
St. dev. Recommendation (Binary) 0.5183***St. dev. Recommendation 0.4662***St. dev. Trending recommendation 0.5801*** 0.5772***St. dev. Quality recommendation 0.5418*** 0.5418***St. dev. Event recommendation 1.1700** 1.1653**St. dev. Expert recommendation 1.1574** 1.1362**St. dev. Novel recommendation 0.2392*** 0.2480***
Alternative-level effects Yes Yes Yes YesAdditional alternative controls Yes Yes Yes YesContext controls Yes Yes Yes YesTime trend Yes Yes Yes Yes
Log-likelihood -808547 -811554 -809425 -809394x2 301,897 307,668 309,507 309,581p 0.0000 0.0000 0.0000 0.0000N 711,673 711,673 711,673 711,673
Note: The additional alternative controls include the hours of operation of thespecific alternative, for how many weeks the alternative has been operating, andthe number of events in the specific alternative and time period. The context con-trols include day of the week and holiday effects, local temperature and precipita-tion levels, as well as the average geographic distance of alternative. Significancelevels: * p < 0.05, ** p < 0.01, *** p < 0.001
171
the metrics based on the deep learning model, includes variables used in generating
the recommendations. In particular, the instruments for the trending and traditional
quality recommendations respectively include the average and standard deviation of
the alternative differentiation and isolation for the specific time period, lags of the
standardized percentage change in the number of photos and positive ratings, as well
as the lag of the within-category standardized rating and number of photos. We
have tested the instruments’ validity using the Hansem-Sargan test [Hansen, 1982]
and the Stock-Yogo critical values [Stock and Yogo, 2005]. In addition, both models
include alternative-level fixed effects as well as time-varying controls for venue, cli-
mate, geospatial, and calendar attributes, even though not depicted in the following
table due to space restrictions. The results further corroborate our previous find-
ings regarding the overall recommendation effect as well as the impact of the various
recommendation types. The differences between the effects of the different recom-
mendation types are statistically significant results and highlight the importance of
“in-the-moment” recommendations.
5.7.1 Falsification Tests
One might think that it is plausible that the previous set of models are simply
picking up spurious effects as a result of pure coincidence, a general increase in the
corresponding metrics, or other unobserved factors. To assess the possibility that the
aforementioned findings are a statistical artifact and the identified positive significant
effects were captured by chance or because of other confounding factors, we run dif-
ferent falsification tests (“placebo” studies) using the same models as above (in order
to maintain consistency) but randomly indicating which alternatives (i.e., random
alternative recommended) were recommended and when (i.e., random time period
of recommendation), respectively. The results of the falsification tests are shown in
172
Table 5.21: Coefficient Estimates of Nested Logit Model with Instrumental Variables.
Model 1 Model 2 Model 3 Model 4
Recommendation (Binary) 0.6944***Recommendation 0.9780***Trending recommendation 1.5141*** 1.5122***Quality recommendation 0.9472*** 1.0244***Event recommendation 0.2828*** 0.3819***Expert recommendation 1.0504*** 1.1277***Novel recommendation 0.2800*** 0.3754***Recommendation ranking -0.0021***Price -0.0436*** -0.0438*** -0.0407*** -0.0398***Rating 0.0037 0.0043* 0.0061** 0.0049*Number of Reviews 0.1569*** 0.1718*** 0.1606*** 0.1597***Sentiment of Reviews 0.0165*** 0.0172*** 0.0169*** 0.0168***Photos 0.0587*** 0.0552*** 0.0553*** 0.0560***Promotions -0.0032*** -0.0033*** -0.0032*** -0.0032***With-in group share 0.0969*** 0.0892** 0.0957*** 0.0945***
Market-level fixed effects Yes Yes Yes YesCategory-level fixed effects Yes Yes Yes YesAdditional alternative controls Yes Yes Yes YesContext controls Yes Yes Yes YesTime trend Yes Yes Yes Yes
Log-likelihood -829027 -827001 -824543 -824572Adjusted R2 0.540 0.542 0.546 0.546p 0.0000 0.0000 0.0000 0.0000N 711,673 711,673 711,673 711,673
Note: The additional alternative controls include the hours of operation of thespecific alternative, for how many weeks the alternative has been operating, and thenumber of events in the specific alternative and time period. The context controlsinclude day of the week and holiday effects, local temperature and precipitationlevels, as well as the average geographic distance of alternative. Significance levels:* p < 0.05, ** p < 0.01, *** p < 0.001
173
Table 5.22: Coefficient Estimates of Nested Logit Model with Instrumental Variables.
CoefficientRobust
z P > |z| [95% Conf.Std. Err. Interval]
Trending recommendation 1.448042*** 0.425770 3.40 0.001 0.613549 2.282535Quality recommendation 0.863831*** 0.168845 5.12 0.000 0.532902 1.194761Event recommendation 0.350116*** 0.075285 4.65 0.000 0.202560 0.497673Expert recommendation 1.093406*** 0.065746 16.63 0.000 0.964546 1.222267Novel recommendation 0.273595*** 0.033846 8.08 0.000 0.207259 0.339931Price -0.016360*** 0.001559 -10.49 0.000 -0.019410 -0.013300Rating 0.007253*** 0.001887 3.84 0.000 0.003554 0.010952Number of Reviews 0.157965*** 0.003894 40.57 0.000 0.150333 0.165597Sentiment of Reviews 0.020388*** 0.000977 20.87 0.000 0.018473 0.022303Photos 0.052817*** 0.003223 16.39 0.000 0.046500 0.059135Promotions -0.003190*** 0.000359 -8.89 0.000 -0.003890 -0.002490With-in group share 0.113758*** 0.001504 75.64 0.000 0.110810 0.116706
Log-likelihood: -786170 Adjusted R2: 0.551 p: 0.0000 N : 680,347
Note: The additional alternative controls include the hours of operation of the specificalternative, for how many weeks the alternative has been operating, and the numberof events in the specific alternative and time period. The context controls include dayof the week and holiday effects, local temperature and precipitation levels, as well asthe average geographic distance of alternative. Market-level and category-level fixedeffects are included as well. Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001
174
Tables 5.23 and 5.24. We see that, under these checks, the corresponding effects are
not statistically significant, indicating that our previous findings are not a statistical
artifact of our specifications, but we indeed discovered the actual effects.
Table 5.23: Falsification Check Employing Pseudo-recommendations.
Model 1 Model 2 Model 3
(Pseudo) Recommendation (Binary) -0.0026(Pseudo) Recommendation 0.0025(Pseudo) Trending recommendation 0.0003(Pseudo) Quality recommendation -0.0007(Pseudo) Event recommendation 0.0034(Pseudo) Expert recommendation 0.0036(Pseudo) Novel recommendation 0.0039
Table 5.24: Falsification Check Employing Pseudo-timing of Recommendations.
Model 1 Model 2 Model 3
(Pseudo) Recommendation (Binary) 0.0009(Pseudo) Recommendation 0.0025(Pseudo) Trending recommendation 0.0024(Pseudo) Quality recommendation -0.0023(Pseudo) Event recommendation 0.0052(Pseudo) Expert recommendation 0.0088(Pseudo) Novel recommendation 0.0052
5.8 Discussion of Business Value of Recommendations
Apart from quantifying the impact of various types of mobile recommendations
on demand levels for individual items and identifying various moderating effects on
this impact, we theoretically integrate our research questions and the corresponding
findings into the current literature on RSes by extending a current conceptual model
of the effects of RS use, RS characteristics, and other factors on consumer decision-
making that was first articulated by Xiao and Benbasat [2007]. Figure 5.5 depicts the
175
updated conceptual model incorporating our research questions and findings regarding
the decision outcome of consumers.
The discussed impact of RSes on consumers’ decision-making and choices can be
supported through the lenses of various IS theories. Recommendations serve to poten-
tially reduce the effort required [Shugan, 1980] as well as the uncertainty surrounding
a decision, and thus both reduce the difficulty of making a choice and increase the
confidence associated with it [Fitzsimons and Lehmann, 2004]. Another utilitarian
explanation of the aforementioned finding that recommendations have a positive and
significant effect on the demand levels for the recommended products is that con-
sumers might perceive the recommendations as endorsements of the candidate items
from the mobile RS or as another reputation dimension (in addition to the item rating
and consumer reviews). Apart from arguments based on utility theory, we can also
find support for our findings in the theories of human information processing. For
instance, Shafir et al. [1993b] maintain that people evaluate alternatives by compar-
ing them separately on distinct dimensions and that relationships among alternatives
may be perceived to be more compelling reasons or arguments for choice than de-
riving overall values for each alternative and choosing the alternative with the best
value. Because such differences (e.g., whether an alternative is recommended, rank-
ing in recommendation list) can be perceived with little effort, relationships among
alternatives may be used to make choices even in situations with simple alternatives
and even if they do not provide good justifications [Bettman et al., 1998]. Another
possible explanation is that the recommended alternatives maximize the ease of jus-
tifying consumers’ decisions. This explanation is becoming even more important in
the case of collective decisions (e.g., restaurant selection) and group recommenders;
Shafir et al. [1993a] have demonstrated that decision makers often construct reasons in
order to justify a decision to themselves (i.e., increase their confidence in the decision)
176
Fig
ure
5.5:
Con
ceptu
alm
odel
ofeff
ects
ofre
com
men
der
syst
ems.
177
and/or justify (explain) their decision to others. Alternatively, another explanation
is that the recommended alternatives are simply becoming more salient and hence
are more frequently selected by the consumers either deliberately (e.g., lower search
costs [Wilde, 1980]) or inadvertently since they capture their attention. Furthermore,
the technical differences between the trending and the traditional recommendations
contribute to the differences in effectiveness. In particular, the trending recommen-
dations are characterized by higher levels of temporal diversity and hence it is more
likely to recommend alternatives that are not already included in the consideration
set of a user and thus to have higher effectiveness due to differences in awareness levels
[Bodapati, 2008]. Finally, the significant difference in effectiveness among the various
types of recommendations and, especially, the effect of “in-the-moment” recommen-
dations can also be explained based on the IS success model [DeLone and McLean,
1992] and the components of timeliness and uniqueness of information which affect
user satisfaction through the construct of information quality. Similarly, the dif-
ferences of events recommendations and recommendations from “experts” from the
other types of examined recommendations can be explained based on the relevance
of information.
The findings of this study, apart from a theoretical contribution, also have impor-
tant managerial implications. Based on our results, we find that recommendations
in the mobile context have a positive effect on individual demand levels for the rec-
ommended alternatives. This is an important and timely finding for managers as
mobile recommender systems currently have been adopted by only 13% of the com-
panies across the globe, even though RSes are the most widespread personalization
technique for traditional channels [Econsultancy.com, 2013]. Apart from highlight-
ing in this study the economic impact of recommendations in the mobile context
and their effects on consumers’ decision-making, we also illustrate how managers and
178
practitioners can leverage observational data to evaluate different recommender sys-
tem algorithms and estimate their effects. This is especially important nowadays as
more than half of the companies (57%) do not test the performance of their own
recommender systems and 75% of them do not quantify the improvement in conver-
sion rates resulting from their RSes, even though they realize the importance of RSes
[Econsultancy.com, 2013]. Moreover, further disentangling the effects and economic
impact of various types of recommendations, we find in the mobile context that
recommendations that provide “in-the-moment” content to the users have a much
stronger effect compared to traditional recommendations based on historical trends
and data. The importance and economic significance of these findings is further am-
plified by the prediction that in the next years the number of mobile shoppers will
reach 213.7 million with 87% of smartphone users shopping online using their mobile
device. Another actionable finding of significant importance for businesses and man-
agers is that novel alternatives accrue greater benefits from recommendations when
those recommendations are not solely based on the characteristic of novelty but also
take into consideration specific item attributes, such as the quality of the item. Sim-
ilarly, the remaining moderating effects we examined in this study (i.e., popularity,
price, marketing promotions, and context) are also important for businesses and rec-
ommender system practitioners as they can guide the development and adoption of
future mobile RS algorithms and highlight the importance of various product design
decisions that managers should consider in order to improve the performance of their
RSes. The managerial importance of these findings and the corresponding detailed
understanding of the effectiveness of recommendations is further highlighted by the
fact that 78% of the companies consider lack of knowledge as a barrier to adopting or
improving recommendation systems in their organization [Econsultancy.com, 2013].
Finally, another important finding with significant managerial implications is that
179
the effect of traditional recommendations is much stronger for more popular prod-
ucts, contributing to a rich-get-richer effect, whereas the effect of “in-the-moment”
recommendations is more stable across alternatives, allowing businesses to effectively
leverage the benefits of long-tail products.
In addition to the theoretical and managerial contributions, our study also con-
tributes to the stream of literature that integrates machine learning and data mining
approaches with econometric techniques. In particular, in this study, we employ deep
learning techniques to introduce new machine learning-based econometric instruments
that extend a popular family of instruments from the observed product characteristics
space to the latent space. Hence, we contribute to the extant literature in Informa-
tion Systems that employs text-mining, sentiment-analysis, and other data mining
methods with user-generated content in empirical econometric studies (e.g., [Archak
et al., 2011; Ghose and Ipeirotis, 2011; Ghose et al., 2012b; Goes et al., 2014; Goh
et al., 2013; Netzer et al., 2012; Tirunillai and Tellis, 2014]).
Future research, apart from estimating the impact of recommender systems across
various settings in the mobile context, can also apply the proposed techniques in order
to generate contextual recommendation lists based on consumers’ surplus, rather than
the predicted numerical ratings for the various alternatives or the probability of accep-
tance of each recommendation, as proposed by Ghose et al. [2012b] in the context of
web-based travel search engines. Future research should also further examine the eco-
nomic effectiveness of recommendations in different application domains investigating
a wider range of recommended items. Besides, future research should also compare
the effectiveness of mobile recommendations vis-a-vis traditional recommendations in
other channels (e.g., web). In addition, the effect of marketing promotions in the
presence of recommendations is another topic of significant academic and managerial
interest that should be thoroughly examined in future research. Finally, disentan-
180
gling the effect of recommendations on repeated visits of existing customers from the
corresponding effects for new users and alternatives is another promising topic for
future research.
One of the limitations of this study is that we focus only on the effect of recom-
mendations on the demand for the candidate items (e.g., restaurants). Nevertheless,
the explicit user satisfaction should also be considered and precisely measured. How-
ever, explicit ratings from individual users were not available for privacy reasons.
Other limitations of our data set is that it contains observations corresponding to a
specific domain in the mobile channel, rather than multiple channels and different
product categories. Despite these limitations, our contribution may be widely rele-
vant to managers while also seeding a number of new directions for future research.
Our hope is that these limitations are viewed not as a liability but as a path towards
future research that extends our research question while strengthening the relevant
theory and empirical evidence.
181
CHAPTER VI
Conclusions, Limitations, and Future Directions
The work presented in this thesis will hopefully help the recommender systems
(RSes) field move further beyond the perspective of rating prediction accuracy fo-
cusing on developing methods for avoiding “filter bubbles”. Following this stream
of research that both contributes to existing helpful but less explored paradigms for
recommender systems and proposes new valuable approaches and perspectives, we
discussed the studies we have conducted towards alleviating the problems of over-
specialization and concentration biases in recommender systems by delivering to the
users non-obvious, diverse, unexpected, and, at the same time, high quality personal-
ized recommendations, which could potentially expand users’ exposure and choices.
We focus this research program on the front-end of the “design - solution - perceptions
- intentions - behavior” causal chain and we move the focus from even more accu-
rate rating predictions in the context of recommender systems and aim at offering a
holistic experience to the users. The conducted prescriptive studies are supplemented
with descriptive (explanatory) user behavior studies that examine the effects of the
proposed type of recommendations on consumer decision-making. Finally, we theoret-
ically integrate our findings into the current literature on RSes by extending a current
conceptual model of the effects of RS use, RS characteristics, and other factors on con-
182
sumer decision-making. The discussed impact of RSes on consumer decision-making
and choices can be supported through the lenses of various IS theories, including theo-
ries of human information processing as well as theories of satisfaction. In particular,
the studies discussed in Chapters III and IV move our focus from even more accurate
rating predictions and aim at offering a holistic experience to the users by avoiding
the over-specialization and concentration of recommendations and providing the users
with non-obvious but high quality personalized recommendations that fairly match
their interests and they will remarkably like. Then, Chapter V assesses the business
value of various real-works types of recommendations and evaluates how they con-
tribute to the over-specialization and concentration biases of modern recommender
systems.
In detail, Chapter II provides a brief survey of the related work on over-specialization
and concentration biases of recommendations as well as novelty, serendipity, diver-
sity, and unexpectedness. It also presents the related work assessing the value of
recommendations for businesses and, finally, provides an overview of the main Infor-
mation Systems theories regarding the impact of RSes on consumer decision-making.
Then, Chapter III formulates the classical neighborhood-based collaborative filter-
ing method as an ensemble method, thus, allowing us to show the suboptimality of
the k-NN (k Nearest Neighbors) approach in terms of not only over-specialization
and concentration but also predictive accuracy. Besides, focusing on neighborhood
selection, it proposes a novel optimized neighborhood-based method (k-BN; k Better
Neighbors) and a new probabilistic neighborhood-based method (k-PN; k Probabilis-
tic Neighbors) as improvements of the standard k-NN approach alleviating some of
the most common problems of collaborative filtering recommender systems, based on
classical metrics of dispersion and diversity as well as some newly proposed metrics.
This performance improvement is in accordance with ensemble learning theory and
183
the phenomenon of “hubness” in recommender systems. Then, Chapter IV proposes
a concept of unexpectedness in recommender systems illustrating the differences from
the related but different terms of novelty, serendipity, and diversity. In addition, it
fully operationalizes unexpectedness by suggesting various mechanisms for specifying
the expectations of the users and proposing a recommendation method for providing
the users with non-obvious but high quality personalized recommendations that fairly
match their interests based on specific metrics of unexpectedness. Finally, Chapter
V employs econometric modeling and machine learning techniques in order to esti-
mate the impact of recommendations in the mobile context on consumers’ utility and
real-world demand. This chapter delves further into the differences in effectiveness of
recommendations, and examines this heterogeneity examining the moderating effect
of various item attributes and contextual factors in order to gain a more detailed
understanding of the effectiveness of the various types of recommendations. We also
validate the robustness of our findings using multiple econometric specifications as
well as instrumental variable methods with instruments based on a machine-learning
model employing deep-learning techniques.
In summary, the main contributions of the studies presented in this thesis are:
• We formulated the classical neighborhood-based collaborative filtering method
as an ensemble method, thus, allowing us to show the potential suboptimality
of the k-NN approach in terms of predictive accuracy.
• We proposed a new optimized neighborhood-based method (k-BN; k Better
Neighbors) as an improvement of the standard k-NN approach.
• We proposed a new probabilistic neighborhood-based method (k-PN; k Proba-
bilistic Neighbors) as an improvement of the standard k-NN approach.
184
• We empirically showed that the proposed methods (i.e., k-BN and k-PN) out-
perform, by a wide margin, the classical collaborative filtering algorithm and
practically illustrated the suboptimality of k-NN in addition to providing a theo-
retical justification of this empirical observation. Moreover, we showed that the
proposed methods alleviate the common problems of over-specialization and
concentration biases of recommendations in terms of various popular metrics
and a newly proposed metric that measures the diversity reinforcement of rec-
ommendations. Besides, we identified a particular implementation of the k-PN
method that performs consistently well across various experimental settings.
We also illustrated that most of the times the k-BN method outperforms the
employed baselines as well as the k-PN method.
• We proposed a new formal definition of unexpectedness in recommender sys-
tems and conducted a survey of related concept illustrating the differences from
these related terms. We also formalized this concept of unexpectedness and
fully operationalized it by suggesting various mechanisms for specifying the ex-
pectations of the users.
• We proposed a method for providing the users with non-obvious but high qual-
ity recommendations that fairly match their interests and suggested specific
metrics to measure the unexpectedness of recommendation lists. Finally, we
showed that the proposed method for unexpected recommendations can en-
hance unexpectedness while maintaining the same or higher levels of accuracy
of recommendations.
• We estimated the effectiveness and economic impact of various types of real-
world recommendations in a mobile setting based on a structural econometric
method following discrete-choice models of product demand.
185
• We introduced new machine learning-based econometric instruments that ex-
tend a popular family of instruments to the latent space and facilitate the usage
of instrumental variable techniques for causal inference in the presence of endo-
geneity in the field of RSes
• We discovered significant new findings and moderating effects that extend our
current knowledge regarding the heterogeneous impact of recommender systems
and reconcile contradictory prior findings in the related literature.
One of the main advantages of the proposed methodology is the use of publicly
available data sets. Employing such data sources allows the direct comparison with
the existing state-of-the-art algorithms in the field of recommender systems. Such
a direct comparison facilitates the incremental contribution to the literature in the
fields of recommender systems, data mining, and machine learning while avoiding any
further fragmentation. At the same time, this strategy makes our research studies
reproducible and replicable and better allows other researchers and practitioners to
further test our findings in other scientific fields, recommendation domains, and appli-
cations as well as efficiently compare the proposed approaches to any algorithms that
will be proposed in the near future. However, the presented and proposed studies also
have some limitations. The main limitation of these studies is the absence of online
evaluation with real users (by means of an A/B test, for instance). In the past, it
has been observed that the results from online evaluation may contradict the results
of offline evaluation because of the possible incompleteness of the employed data sets
[Beel et al., 2013]. Nevertheless, we believe that the employed methodology and the
corresponding evaluation have been appropriately designed and hence the finding will
hold in a carefully designed online evaluation. We should also note that conducting
only online studies in the absence of offline evaluation would not allow us to report
186
comparable results for the same methods “under the same conditions.”
In addition to the theoretical foundation and the empirical contribution of this
program of research, we theoretically integrate my research questions and the corre-
sponding findings into the current literature on RSes by extending a current concep-
tual model of the effects of RS use, RS characteristics, and other factors on consumer
decision-making that was first articulated by Xiao and Benbasat [2007] and then fur-
ther refined by the authors in [Xiao and Benbasat, 2014]. The discussed impact of
RSes on consumers’ decision-making and choices can be supported through the lenses
of various IS theories, including theories of human information processing as well as
theories of satisfaction [Adamopoulos and Tuzhilin, 2015; Adamopoulos et al., 2016].
The presented research studies also have important managerial implications. Ad-
hering to our main research objective, we work towards the direction of providing
more useful recommendations for both users and businesses. Avoiding obvious and
expected recommendations while maintaining high predictive accuracy levels, we can
alleviate the common problems of over-specialization and concentration biases that
often characterize the collaborative filtering algorithms. Building such a recommender
system, we have the potential to further increase user satisfaction and engagement
and offer a superior experience to the users [Baumol and Ide, 1956; Kahn et al., 1991].
In addition, unexpectedness and its related notions can improve the welfare of con-
sumers by allowing them to locate and buy products that otherwise they would not
have purchased. Introducing unexpectedness in recommender systems can vastly re-
duce customers’ search cost by recommending items that the user would rate highly
but it would be quite unlikely –or she/he would have to spend a large amount of
time– to discover them on her/his own. As a result, the inefficiencies caused by buyer
search costs are reduced, while increasing the ability of markets to optimally allocate
productive resources [Bakos, 1997].
187
Furthermore, the generated recommendations should be useful not only for the
users but for the businesses as well. Based on our analysis, we find that a significant
raise in the demand of recommended items. Examining the impact of different types
of recommendations, our findings highlight the importance of inter-temporal diversity
as well as non-obviousness and unexpectedness. Apart from the direct effect of in-
creased sales and willingness-to-pay, the proposed approaches also exhibit a potential
positive economic impact based on the enhanced customer loyalty leading to lasting
and valuable relationships [Gorgoglione et al., 2011; Zhang et al., 2011] through offer-
ing more useful for the users recommendations from a wider range of items, enabling
them to find relevant items that are harder to discover, and making the users familiar
with the whole product catalog. Apart from the significant gains in producer welfare
from the additional sales [Brynjolfsson et al., 2003], business might also leverage rev-
enues from market niches [Fleder and Hosanagar, 2009]. Thus, there is a potential
positive economic impact based on the effect of recommending items from the long
tail and not focusing mostly on bestsellers that usually exhibit higher marginal costs
and lower profit margins because of acquisition costs and licenses as well as increased
competition. Another actionable finding of importance for businesses and managers
is that novel alternatives accrue greater benefits from recommendations when those
recommendations are not solely based on the characteristic of novelty but also take
into consideration additional attributes such as the quality of the item. Similarly,
another important finding for businesses is that the effect of traditional recommen-
dations is much stronger for more popular products, contributing to a rich-get-richer
effect, whereas the effect of unexpected, diverse, and non-obvious recommendations
is more stable across alternatives, allowing businesses to effectively leverage the bene-
fits of long-tail products. Nevertheless, the remaining moderating effects we examine
are also important for businesses and recommender system practitioners as they can
188
guide the development of future RS algorithms and highlight the importance of vari-
ous product design decisions [Adamopoulos and Tuzhilin, 2015; Adamopoulos et al.,
2016].
This thesis may facilitate future research to integrate the proposed approaches
with related existing techniques in the fields of web search and data mining. More-
over, we would like to implement and evaluate the proposed approaches conducting
live experiments in on-line retail settings [Todri and Adamopoulos, 2014; Adamopou-
los and Todri, 2015a], taking into consideration various user attributes and charac-
teristics [Adamopoulos and Todri, 2015b; Adamopoulos et al., 2015a], as well as in
platforms for massive open online courses [Adamopoulos, 2013b]. Besides, we would
like to conduct a series of live controlled experiments with human subjects in order
to study the on-line user behavior, examine and actively adjust the trade-off between
exploration (e.g., unexpectedness, serendipity, diversity, etc.) and exploitation (e.g.,
accuracy) of recommender systems, and further evaluate the proposed perspectives
in a user-centric framework for top-N recommendations.
189
Appendices
190
Appendix A
Measuring the Concentration Reinforcement Bias
of Recommender Systems
Several measures have been employed in prior research in order to measure the con-
centration reinforcement and popularity bias of RSes as well as other similar concepts.
These metrics include catalog coverage, aggregate diversity, and the Gini coefficient.
In particular, catalog coverage measures the percentage of items for which the RS is
able to make predictions [Herlocker et al., 2004] while aggregate diversity uses the
total number of distinct items among the top-N recommendation lists across all users
to measure the absolute long-tail diversity of recommendations [Adomavicius and
Kwon, 2012]. The Gini coefficient [Gini, 1921] is used to measure the distributional
dispersion of the number of times each item is recommended across all users; similar
are the Hoover (Robin Hood) index and the Lorenz curve [Hoover, 1985; Lorenz, 1905]
However, these metrics do not take into consideration the prior popularity of
candidate items and, hence, do not provide sufficient evidence on whether the prior
concentration of popularity is reinforced or alleviated by the RS. Moving towards this
direction, Adamopoulos and Tuzhilin [2013b, 2014b] employ a popularity reinforce-
ment measure M to assess whether a RS follows or changes the prior popularity of
191
items when recommendations are generated. To evaluate the concentration reinforce-
ment bias of recommendations, Adamopoulos and Tuzhilin [2013b, 2014b] measure
the proportion of items that changed from “long-tail” in terms of prior sales (or num-
ber of positive ratings) to popular in terms of recommendation frequency as follows:
M = 1−K∑i=1
πiρii,
where the vector π denotes the initial distribution of each of the K popularity cat-
egories and ρii the probability of staying in category i, given that i was the initial
category. In [Adamopoulos and Tuzhilin, 2013b, 2014b], the popularity categories,
labeled as “head” and “tail”, are based on the Pareto principle and hence the “head”
category contains the top 20% of items (in terms of positive ratings or recommen-
dation frequency, respectively) and the “tail” category the remaining 80%. Based
on this metric, a score of zero denotes no change (i.e., the number of times an item
is recommended is proportional to the number of ratings it has received) whereas a
score of one denotes that the RS recommends mainly the long-tail items (i.e., the
number of times an item is recommended is proportional to the inverse of the num-
ber of ratings it has received). However, this metric of concentration reinforcement
(popularity) bias entails an arbitrary selection of popularity categories. Besides, all
items included in the same popularity category are contributing equally to this metric,
despite any differences in popularity.
To precisely measure the concentration reinforcement (popularity) bias of RSes
and alleviate the problems of the aforementioned metrics, in [Adamopoulos et al.,
192
2015b] we propose a new metric as follows:
CI@N =∑i∈I
1
2
s(i)∑j∈I s(j)
ln
s(i)+1∑j∈I s(j)+1
rN (i)+1N∗|U |+|I|
+1
2
rN(i)
N ∗ |U |ln
rN (i)+1N∗|U |+|I|s(i)+1∑j∈I s(j)+1
,
where s(i) is the prior popularity of item i (i.e., the number of positive ratings for
item i in the training set or correspondingly the number of prior sales of item i), rN(i)
is the number of times item i is included in the generated top-N recommendation
lists, and U and I are the sets of users and items, respectively.1 In essence, following
the notion of Jensen-Shannon divergence in probability theory and statistics, the
proposed metric captures the distributional divergence between the popularity of
each item in terms of prior sales (or number of positive ratings) and the number of
times each item is recommended across all users. Based on this metric, a score of zero
denotes no change (i.e., the number of times an item is recommended is proportional
to its prior popularity) whereas a (more) positive score denotes that the generated
recommendations deviate (more) from the prior popularity (i.e., sales or positive
ratings) of items.
In order to measure whether the deviation of recommendations from the distribu-
tion of prior sales (or positive ratings) promotes long-tail rather than popular items,
we also propose a measure of “long-tail enforcement” as follows:
LTIλ@N =1
|I|∑i∈I
λ
(1− s(i)∑
j∈I s(j)
)ln
rN (i)+1N∗|U |+|I|s(i)+1∑j∈I s(j)+1
+ (1− λ)
s(i)∑j∈I s(j)
ln
s(i)+1∑j∈I s(j)+1
rN (i)+1N∗|U |+|I|
,
1Another smoothed version of the proposed metric is: CI@N =∑
i∈I12s%(i) ln
(s%(i)
12 s%(i)+ 1
2 rN%(i)
)+
12r
N
%(i) ln(
rN% (i)12 s%(i)+ 1
2 rN%(i)
), where s%(i) = s(i)∑
j∈I s(j) and rN%(i) = rN (i)N∗|U | .
193
where λ ∈ (0, 1) controls which items are considered long-tail (i.e., the percentile of
popularity below which a RS should increase the frequency of recommendation of an
item). In essence, the proposed metric rewards a RS for increasing the frequency
of recommendations of long-tail items while penalizing for frequently recommending
already popular items.
A.1 Experimental Results
To empirically illustrate the usefulness of the proposed metrics, we conduct a
large number of experiments comparing various algorithms across different perfor-
mance measures. The data sets we used are the MovieLens 100k (ML-100k), 1M
(ML-1m), and “latest-small” (ML-ls), and the FilmTrust (FT). The recommendations
were produced using the algorithms of association rules (AR), item-based collabora-
tive filtering (CF) nearest neighbors (ItemKNN), user-based CF nearest neighbors
(UserKNN), CF ensemble for ranking (RankSGD) [Jahrer and Toscher, 2012], list-
wise learning to rank with matrix factorization (LRMF) [Shi et al., 2010], Bayesian
personalized ranking (BPR) [Rendle et al., 2009], and BPR for non-uniformly sampled
items (WBPR) [Gantner et al., 2012] implemented in [Guo et al., 2015].
Figure A.1 illustrates the results of the comparative analysis of the different al-
gorithms across various metrics. In particular, Fig. A.1 shows the relative ranking in
performance for each algorithm based on popular metrics of predictive accuracy and
dispersion as well as the newly proposed metrics; green (red) squares indicate that
the specific algorithm achieved the best (worst) relative performance among all the
algorithms for the corresponding dataset and metric.2
Based on the results, we can see that the proposed metrics capture different per-
2We have reversed the scale of the Gini coefficient for easier interpretation of the results (i.e., thegreen color corresponds to the most uniformly distributed recommendations).
194
formance dimensions of an algorithm compared to the relevant metrics of Gini coef-
ficient and aggregate diversity. Comparing the performance based on the proposed
concentration bias metric (CI@N) with the metric of Gini coefficient, we see that
even though on aggregate an algorithm might distribute more equally than another
algorithm the number of times each item is recommended, it might still achieve this
by deviating less from the prior popularity (i.e., number of sales or positive ratings)
of each item separately (e.g., green color for Gini coefficient and red color for concen-
tration reinforcement). Nevertheless, the differences among the LTIλ performance
and the other metrics (e.g., aggregate diversity) indicate that even though some al-
gorithms might recommend fewer (more) items than others or distribute how many
times each item is recommended less (more) equally among the recommended items,
they might achieve this by frequently recommending more (fewer) long-tail items
rather than more (fewer) popular items (e.g., red color for Gini coefficient and green
color for “long-tail enforcement”). Hence, the two proposed metrics should be used
in combination in order to evaluate i) how much the recommendations of a RS algo-
rithm deviate from the prior popularity of items and ii) whether this deviation occurs
by promoting long-tail rather than already popular items.
195
Figure A.1: Performance (ranking) of various RS algorithms.
196
Appendix B
Weighted Percentile Methods in Collaborative
Filtering Systems
Under a definition of recommendation opportunity as how much a user could re-
alistically like an item, we are looking for a high percentile (e.g., 80 percent) of the
conditional distribution of the rating for the specific target user and candidate item,
given all the information we have about them. Utilizing the high percentiles, we aim
at recommending items that the users will remarkably like. One of the challenges
here is to get a good estimate of the conditional rating percentile for each user and
item from the available data. In the next section, we illustrate the practical imple-
mentation of the proposed approach in the context of neighborhood models.
User-based neighborhood recommendation methods predict the rating ru,i of user
u for item i using the ratings given to i by users most similar to u, called nearest
neighbors and denoted by Ni(u). Taking into account the fact that the neighbors can
have different levels of similarity, wu,v, and considering the k users v with the highest
similarity to u (i.e., the standard user-based k-NN collaborative filtering approach),
197
the predicted rating is:
ru,i =
∑v∈Ni(u)
wu,vrv,i∑v∈Ni(u)
|wu,v|
However, the ratings given to item i by the nearest neighbors of user u can be
combined into a single estimation using various combining (or aggregating) functions
[Adomavicius and Tuzhilin, 2005].
In [Adamopoulos and Tuzhilin, 2013c], we propose to use higher weighted per-
centiles as a combining function. Such a high percentile p (e.g., 70th, . . . , 90th) of the
conditional distribution of the user’s rating, given all the information that we have
available, characterizes how much the target user u could realistically like the candi-
date item i. Intuitively, using high percentiles is analogous to “shifting the needle”
in our rating combining function from the middle of the rating distribution, as is the
case with the weighted average, towards the tail on the right side of the distribution
targeting recommendations that the users will like better. Formally, the percentile,
denoted by rpu,i, is defined such that the probability that user u would rate item i with
a rating of rpu,i or less is p%. Note that both low and high ratings contribute to the
estimation since they affect the rank of the values and, thus, the percentile quantity
of interest.
In a typical k-NN collaborative filtering model, the information that we have
available in order to estimate an unknown rating, and respectively the quantity rpu,i,
is the neighbors of user u, denoted by Ni(u), the similarity levels of these neighbors
wNi(u) := (wu,v : v ∈ Ni(u)), and the corresponding ratings rN (u),i := (rv,i : v ∈
Ni(u)). As an example of the proposed method and its differences from the clas-
sical approaches, consider the neighborhood N (u) of size 4 with similarity weights
wN (u) = (0.2, 0.4, 0.3, 0.1) and items x and y with ratings rN (u),x = (2, 3, 3, 4) and
198
ALGORITHM 5: k-NN Recommendation Algorithm
Input: User-Item Rating matrix ROutput: Recommendation lists of size l
k: Number of users in the neighborhood of user u, Ni(u)
for each user u doFind the k users most similar to user u, Ni(u);for each item i do
Combine ratings given to item i by neighbors Ni(u);endRecommend to user u the top-l items having the highest predicted rating ru,i;
end
ALGORITHM 6: Weighted Percentile Estimation Algorithm
Input: Values v1, . . . , vn, Weights w1, . . . , wn, andp percentile to be estimated.
Output: p-th weighted percentile of ordered values v1, . . . , vn
Order values v1, . . . , vn from least to greatest;Rearrange weights w1, . . . , wn based on ordered values;Calculate the percent rank for p based on weights w1, . . . , wn;Use linear interpolation between the two nearest ranks;
rN (u),y = (2, 2, 4, 4), respectively. Using the standard combining function, item x
would be recommended. However, using, for instance, the weighted 80th percentile
of the variable rN (u),i, item y would be recommended since the specific percentile for
item y, denoted by rp=80u,y , corresponds to a higher rating than item x and, thus, there
is high potential that user u could realistically like item y more than x; equivalently,
the probability of user u assigning a rating greater or equal to 4 is higher for item y
than x.
Algorithm 5 summarizes the user-based k-nearest neighbors (k-NN) collaborative
filtering approach with a general combining function and Algorithm 6 shows a pro-
cedure to estimate a weighted percentile rpu,i (i.e., the proposed combining function),
where the values rN (u),i are the ratings given to candidate item i by neighbors Ni(u),
the k users most similar to target user u, and the weights wNi(u) are the corresponding
199
similarity levels of neighbors to user u.
B.1 Experimental Settings
To empirically validate the proposed method and evaluate the generated recom-
mendations, we conduct a large number of experiments on “real-world” data sets and
compare our results to the k-NN CF approach, which has been found to perform well
also in terms of other performance measures besides the classical accuracy metrics
[Burke, 2002; Adamopoulos and Tuzhilin, 2011, 2013a; Jannach et al., 2013].
The data sets that we used are the RecSys HetRec 2011 MovieLens data set
[Cantador et al., 2011] and the BookCrossing data set [Ziegler et al., 2005].
The RecSys HetRec 2011 MovieLens (ML) data set contains personal ratings and
tags about movies and consists of 855,598 ratings from 2,113 users on 10,197 items.
The BookCrossing (BX) data set is described by Ziegler et al. [2005] and gathered
from Bookcrossing.com, a social networking site founded to encourage the exchange
of books. Following Ziegler et al. [2005] and owing to the extreme sparsity of the data,
we decided to condense the data set in order to obtain more meaningful results from
collaborative filtering algorithms. Hence, we discarded as in [Ziegler et al., 2005] all
books for which we were not able to find any information, along with all the ratings
referring to them. Next, we also removed book titles with fewer than 4 ratings and
community members with fewer than 8 ratings each. The dimensions of the resulting
data set were considerably more moderate, featuring 8,824 users, 7,818 books, and
107,367 explicit ratings.
Using the ML and BX data sets, we conducted a large number of experiments
and compared our method against the standard user-based k nearest neighbors col-
laborative filtering approach. In order to test the proposed approach of weighted
percentiles under various experimental settings, we used 2 data sets, 6 different sizes
200
of neighborhoods (k ∈ {30, 40, . . . , 80}), 9 different percentiles as combining func-
tions (p ∈ {10, 20, . . . , 90}), and generated recommendation lists of 13 different sizes
(l ∈ {1, 3, 5, 10, 20, . . . , 100}), resulting in 1,404 experiments in total.
For the computation of the weighted percentiles, we used the gmisclib [gmisclib,
2013] scientific library. Besides, we used Pearson correlation to measure similarity.
Finally, we used a holdout validation scheme in all of our experiments with 80/20
splits of data to the training/test part in order to avoid overfitting.
B.2 Results
The aim of this study is to demonstrate that the proposed method is indeed
effectively increasing the classical item prediction accuracy measures and performs
well in terms of other popular performance measures, such as catalog coverage, by a
comparative analysis of our method and the standard k-NN algorithm, in different
experimental settings.1
The goal in this section is to compare our method with the standard baseline
methods in terms of traditional metrics for item prediction, such as precision, re-
call, and F1 score. Table B.1 presents the results obtained by applying our method
to the MovieLens and BookCrossing data sets. The values reported are computed
as the average performance over the six neighborhood sizes using the F1 score for
recommendation lists of size l ∈ {3, 5, 10, 30, 50, 100}. Respectively, Fig. B.1 illus-
trates the average performance for neighborhoods of size k ∈ {30, 80} and lists of size
l ∈ {1, 3, 5, 10, 20, . . . , 100}.
Table B.1 and Fig. B.1 demonstrate that the proposed method outperforms the
1Similar results were also obtained using the k users with the highest similarity to the target useru, N (u), independently of whether they rated the specific candidate item i. For each metric, onlythe most interesting dimensions are discussed. Finally, results for low percentiles are not presented,since they constantly underperform the experiments using the high percentiles.
201
1 3 5 10 20 30 40 50 60 70 80 90 100Recommendation List Size
0.000
0.005
0.010
0.015
0.020
0.025
F1 s
core
MovieLens k=30
knn60th percentile70th percentile80th percentile90th percentile
(a) ML - k = 30
1 3 5 10 20 30 40 50 60 70 80 90 100Recommendation List Size
0.000
0.005
0.010
0.015
0.020
0.025
F1 s
core
MovieLens k=80
knn60th percentile70th percentile80th percentile90th percentile
(b) ML - k = 80
1 3 5 10 20 30 40 50 60 70 80 90 100Recommendation List Size
0.0005
0.0010
0.0015
0.0020
0.0025
0.0030
0.0035
0.0040
0.0045
0.0050
F1 s
core
BookCrossing k=30
knn60th percentile70th percentile80th percentile90th percentile
(c) BX - k = 30
1 3 5 10 20 30 40 50 60 70 80 90 100Recommendation List Size
0.0005
0.0010
0.0015
0.0020
0.0025
0.0030
0.0035
0.0040
0.0045
0.0050
F1 s
core
BookCrossing k=80
knn60th percentile70th percentile80th percentile90th percentile
(d) BX - k = 80
Figure B.1: Prediction Accuracy (F1 score) for the (a), (b) MovieLens (ML) and (c),(d) BookCrossing (BX) data sets.
202
Table B.1: Item Prediction Accuracy (F1 score∗102).
DataMethod
Recommendation List SizeSet 3 5 10 30 50 100
ML
k-NN 0.0026 0.0039 0.0078 0.0164 0.3575 0.3362p
erc
enti
le 60th 0.0902 0.1533 0.2975 0.7208 0.8094 0.861170th 0.1784 0.2907 0.5309 1.2440 1.8051 2.356580th 0.0993 0.1575 0.2848 0.6966 0.9525 1.293090th 0.0456 0.0854 0.1848 0.4552 0.6316 0.9344
BX
k-NN 0.1606 0.1876 0.2415 0.2882 0.2899 0.2807
perc
enti
le 60th 0.1743 0.2396 0.3149 0.3654 0.3716 0.352670th 0.1864 0.2418 0.3184 0.3751 0.3841 0.382380th 0.2130 0.2590 0.3654 0.4419 0.4361 0.425690th 0.2126 0.3065 0.3772 0.4592 0.4795 0.4728
kNN 60th 70th 80th 90th0.000
0.005
0.010
0.015
0.020
0.025
F1 s
core
(a) MovieLens data set
kNN 60th 70th 80th 90th0.0005
0.0010
0.0015
0.0020
0.0025
0.0030
0.0035
0.0040
0.0045
0.0050
F1 s
core
(b) BookCrossing data set
Figure B.2: Post hoc analysis for Friedman’s Test of Item Prediction Accuracy (F1
score) for both data sets.
203
k-NN method by a wide margin. In particular, for both data sets, accuracy was
improved in all the experiments using high percentiles. Besides, as we can observe,
the increase in performance is larger for recommendation lists of larger size. For the
ML data set the maximum F1 score was achieved using the 70th percentile (0.024)
whereas for BX the maximum was 0.0048 using the 90th percentile.
To determine statistical significance, we have tested the null hypothesis that the
performance of each of the methods is the same using the Friedman test. Based
on the results, we reject the null hypothesis with p < 0.0001. Performing post hoc
analysis on Friedman’s Test results, the differences between the Baseline k-NN and
each one of the experimental settings are statistically significant. Fig. B.2 presents
the box-and-whisker diagrams displaying the aforementioned differences among the
various methods.
Similar results were also obtained using standard utility-based ranking metrics,
such as the normalized discounted cumulative gain (nDCG) and mean reciprocal rank
(MRR).
In this section we investigate the effect of the proposed method on coverage and
aggregate diversity, two important metrics for RSes [Ricci and Shapira, 2011], that go
beyond the classical perspective of rating prediction accuracy [Adamopoulos, 2013a;
Adamopoulos et al., 2014]. The results obtained using the catalog coverage metric [Ge
et al., 2010] (i.e., the percentage of items in the catalog that are ever recommended
to users) are equivalent to those using the diversity-in-top-N metric for aggregate
diversity [Adomavicius and Kwon, 2012]; henceforth, only results on coverage are
presented. Table B.2 presents the results obtained by applying our method to the
ML and BX data sets. The values reported are computed as the average catalog
coverage over six neighborhood sizes (k ∈ {30, 40, . . . , 80}) for recommendation lists
of size l = {3, 5, 10, 30, 50, 100}.
204
Table B.2: Catalog Coverage Performance.
DataMethod
Recommendation List SizeSet 3 5 10 30 50 100
ML
k-NN 0.50% 0.57% 0.68% 0.92% 2.80% 3.98%
perc
enti
le 60th 1.17% 1.29% 1.50% 1.92% 2.30% 3.77%70th 2.62% 2.88% 3.24% 3.89% 4.29% 5.28%80th 5.99% 6.46% 6.98% 8.08% 8.63% 9.77%90th 11.31% 13.29% 15.02% 16.94% 17.93% 19.32%
BX
k-NN 45.45% 54.34% 65.52% 84.14% 90.50% 95.36%
perc
enti
le 60th 45.44% 54.04% 65.52% 83.85% 90.35% 95.16%70th 44.85% 53.34% 64.84% 84.02% 90.13% 95.09%80th 46.47% 54.66% 65.90% 84.26% 90.39% 95.16%90th 46.33% 54.58% 66.04% 84.28% 90.44% 95.09%
Table B.2 demonstrates that the proposed method performs at least as well as, and
is some cases even better than, the standard user-based k-NN method. In particular,
for the ML data set, where the Baseline k-NN results in low coverage, performance
is increased on average by 643.77%, with the 90th percentile exhibiting the highest
coverage. For the BX data, where the Baseline k-NN results in high coverage be-
cause of the specifics of the particular data set and the larger number of users, the
performance is on average the same (+0.00). In terms of statistical significance, using
the Friedman test and performing post hoc analysis, for the ML data set the differ-
ences among the standard user-based k-NN method and all the experimental settings
are statistically significant (p < 0.005). For the BX data set, only the differences
between the 70th percentile and the remaining experimental settings are statistically
significant.
The generated recommendation lists can also be evaluated for the inequality across
items using the Gini coefficient. In particular, for the ML and BX data sets the
Gini coefficient was on average improved by 2.58% and 0.27%, respectively. As we
205
can conclude, in the recommendation lists generated from the proposed method, the
number of times an item is recommended is more equally distributed.
In summary, we demonstrated that the proposed method outperforms the stan-
dard user-based k-NN algorithm by a wide margin in terms of item prediction accuracy
and utility-based ranking metrics and performs at least as well as, and in some cases
even better than, the standard baseline method in terms of several other popular
performance measures.
206
BIBLIOGRAPHY
207
BIBLIOGRAPHY
Abbassi, Z., Amer-Yahia, S., Lakshmanan, L. V., Vassilvitskii, S., and Yu, C.
(2009). Getting recommender systems to think outside the box. In Proceedings of
the third ACM conference on Recommender systems, RecSys ’09, pages 285–288,
New York, NY, USA. ACM.
Adamopoulos, P. (2013a). Beyond rating prediction accuracy: On new perspec-
tives in recommender systems. In Proceedings of the seventh ACM conference on
Recommender systems, RecSys ’13, pages 459–462, New York, NY, USA. ACM.
Adamopoulos, P. (2013b). What makes a great MOOC? an interdisciplinary anal-
ysis of student retention in online courses. In Proceedings of the 34th International
Conference on Information Systems, ICIS 2013.
Adamopoulos, P. (2014a). ConcertTweets: A multi-
dimensional data set for recommender systems research.
http://people.stern.nyu.edu/padamopo/data/concertTweets.html.
Adamopoulos, P. (2014b). Novel perspectives in collaborative filtering recom-
mender systems. In 23rd International Conference on World Wide Web (WWW)
PhD Symposium.
Adamopoulos, P. (2014c). On discovering non-obvious recommendations: Using
unexpectedness and neighborhood selection methods in collaborative filtering sys-
208
tems. In Proceedings of the 7th ACM International Conference on Web Search and
Data Mining, WSDM ’14, pages 655–660, New York, NY, USA. ACM.
Adamopoulos, P., Bellogın, A., Castells, P., Cremonesi, P., and Steck, H. (2014).
Redd 2014 - international workshop on recommender systems evaluation: Dimen-
sions and design. In Proceedings of the 8th ACM Conference on Recommender
systems, pages 393–394. ACM.
Adamopoulos, P., Ghose, A., and Todri, V. (2015a). Estimating the impact of
user personality traits on Word-of-Mouth: Text-mining microblogging platforms.
Available at SSRN 2679199.
Adamopoulos, P., Ghose, A., and Tuzhilin, A. (2016). The business value of recom-
mendations in a mobile application: Combining deep learning with econometrics.
Working Paper, New York University.
Adamopoulos, P. and Todri, V. (2015a). The effectiveness of marketing strate-
gies in social media: Evidence from promotional events. In Proceedings of the
21th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, pages 1641–1650. ACM.
Adamopoulos, P. and Todri, V. (2015b). Personality-based recommendations: Evi-
dence from Amazon.com. In 9th ACM Conference on Recommender systems. ACM.
Adamopoulos, P. and Tuzhilin, A. (2011). On unexpectedness in recommender
systems: Or how to expect the unexpected. In DiveRS 2011 - ACM RecSys 2011
Workshop on Novelty and Diversity in Recommender Systems, RecSys 2011, New
York, NY, USA. ACM.
Adamopoulos, P. and Tuzhilin, A. (2013a). On unexpectedness in recommender
209
systems: Or how to better expect the unexpected. Working Paper: CBA-13-03,
New York University. http://ssrn.com/abstract=2282999.
Adamopoulos, P. and Tuzhilin, A. (2013b). Probabilistic neighborhood selection in
collaborative filtering systems. Working Paper: CBA-13-04, New York University.
http://hdl.handle.net/2451/31988.
Adamopoulos, P. and Tuzhilin, A. (2013c). Recommendation opportunities: Im-
proving item prediction using weighted percentile methods in collaborative filtering
systems. In Proceedings of the seventh ACM conference on Recommender systems,
RecSys ’13, pages 351–354, New York, NY, USA. ACM.
Adamopoulos, P. and Tuzhilin, A. (2014a). Estimating the value of multi-
dimensional data sets in context-based recommender systems. In ACM conference
on Recommender Systems (RecSys) 2014 Poster Proceedings.
Adamopoulos, P. and Tuzhilin, A. (2014b). On over-specialization and concentra-
tion bias of recommendations: Probabilistic neighborhood selection in collaborative
filtering systems. In Proceedings of the 8th ACM Conference on Recommender sys-
tems, pages 153–160. ACM.
Adamopoulos, P. and Tuzhilin, A. (2014c). On unexpectedness in recommender
systems: Or how to better expect the unexpected. ACM Transactions on Intelligent
Systems and Technology (TIST), 5(4):54.
Adamopoulos, P. and Tuzhilin, A. (2015). The business value of recommendations:
A privacy-preserving econometric analysis. In Proceedings of the 36th International
Conference on Information Systems, ICIS. AIS.
210
Adamopoulos, P., Tuzhilin, A., and Mountanos, P. (2015b). Measuring the con-
centration reinforcement bias of recommender systems. In 9th ACM Conference on
Recommender systems. ACM.
Adomavicius, G. and Kwon, Y. (2009). Toward more diverse recommendations:
Item re-ranking methods for recommender dystems. In Proceedings of the 19th
Workshop on Information Technology and Systems (WITS’09).
Adomavicius, G. and Kwon, Y. (2011). Maximizing aggregate recommendation
diversity: A graph-theoretic approach. In DiveRS 2011 - ACM RecSys 2011 Work-
shop on Novelty and Diversity in Recommender Systems, RecSys 2011, New York,
NY, USA. ACM.
Adomavicius, G. and Kwon, Y. (2012). Improving aggregate recommendation di-
versity using ranking-based techniques. Knowledge and Data Engineering, IEEE
Transactions on, 24(5):896–911.
Adomavicius, G. and Tuzhilin, A. (2005). Toward the next generation of recom-
mender systems: A survey of the state-of-the-art and possible extensions. IEEE
Trans. on Knowl. and Data Eng., 17(6):734–749.
Adomavicius, G. and Zhang, J. (2012). Impact of data characteristics on rec-
ommender systems performance. ACM Transactions on Management Information
Systems, 3(1):1–17.
Akiyama, T., Obara, K., and Tanizaki, M. (2010). Proposal and evaluation of
serendipitous recommendation method using general unexpectedness. In Proceed-
ings of the ACM RecSys Workshop on Practical Use of Recommender Systems,
Algorithms and Technologies (PRSAT 2010), RecSys 2010, New York, NY, USA.
ACM.
211
Amazon (2012). Amazon.com, Inc. http://www.amazon.com.
Andre, P., Teevan, J., and Dumais, S. T. (2009). From x-rays to silly putty via
Uranus: Serendipity and its role in web search. In Proceedings of the 27th in-
ternational conference on Human factors in computing systems, CHI ’09, pages
2033–2036, New York, NY, USA. ACM.
Andrews, D. (1984). The IRG Solution: Hierarchical Incompetence and How to
Overcome It. Souvenir Press.
Andrews, M., Luo, X., Fang, Z., and Ghose, A. (2015). Mobile ad effectiveness:
Hyper-contextual targeting with crowdedness. Marketing Science.
Angrist, J. and Krueger, A. B. (2001). Instrumental variables and the search for
identification: From supply and demand to natural experiments. Report, National
Bureau of Economic Research.
Archak, N., Ghose, A., and Ipeirotis, P. G. (2011). Deriving the pricing power of
product features by mining consumer reviews. Management Science, 57(8):1485–
1509.
Bakos, J. Y. (1997). Reducing buyer search costs: implications for electronic mar-
ketplaces. Management Science, 43(12):1676–1692.
Bakshy, E., Messing, S., and Adamic, L. A. (2015). Exposure to ideologically
diverse news and opinion on facebook. Science, 348(6239):1130–1132.
Balabanovic, M. and Shoham, Y. (1997). Fab: content-based, collaborative recom-
mendation. Communications of the ACM, 40(3):66–72.
Baumol, W. J. and Ide, E. A. (1956). Variety in retailing. Management Science,
3(1):93–101.
212
Beel, J., Langer, S., Genzmehr, M., Gipp, B., and Nurnberger, A. (2013). A com-
parative analysis of offline and online evaluations and discussion of research paper
recommender system evaluation. In Proceedings of the Workshop on Reproducibil-
ity and Replication in Recommender Systems Evaluation (RepSys) at the ACM
Recommender System Conference (RecSys).
Bell, R., Koren, Y., and Volinsky, C. (2007). Modeling relationships at multiple
scales to improve accuracy of large recommender systems. In Proceedings of the 13th
ACM SIGKDD international conference on Knowledge discovery and data mining,
pages 95–104. ACM.
Bell, R. M., Bennett, J., Koren, Y., and Volinsky, C. (2009). The million dollar
programming prize. IEEE Spectr., 46(5):28–33.
Bellogın, A., Cantador, I., and Castells, P. (2012). A comparative study of hetero-
geneous item recommendations in social systems. Information Sciences.
Bellogın, A., Castells, P., and Cantador, I. (2013). Improving memory-based col-
laborative filtering by neighbour selection based on user preference overlap. In
Proceedings of the 10th Conference on Open Research Areas in Information Re-
trieval, OAIR ’13, pages 145–148, Paris, France, France.
Bellogin, A. and Parapar, J. (2012). Using graph partitioning techniques for neigh-
bour selection in user-based collaborative filtering. In Proceedings of the sixth ACM
conference on Recommender systems, RecSys ’12, pages 213–216, New York, NY,
USA. ACM.
Bellogin, A., Parapar, J., and Castells, P. (2013). Probabilistic collaborative fil-
tering with negative cross entropy. In Proceedings of the 7th ACM conference on
Recommender systems, RecSys ’13, pages 387–390, New York, NY, USA. ACM.
213
Benjamini, Y. (1988). Opening the box of a boxplot. The American Statistician,
42(4):257–262.
Berger, G. and Tuzhilin, A. (1998). Discovering unexpected patterns in temporal
data using temporal logic. Temporal Databases: research and practice, pages 281–
309.
Berry, M. J. and Linoff, G. (1997). Data Mining Techniques: For Marketing, Sales,
and Customer Support. John Wiley & Sons, Inc., New York, NY, USA.
Berry, S. (1994). Estimating discrete-choice models of product differentiation. The
RAND Journal of Economics, 25(2):242–262.
Berry, S., Levinsohn, J., and Pakes, A. (1995). Automobile prices in market equi-
librium. Econometrica, 63(4):841–890.
Bettman, J. R., Luce, M. F., and Payne, J. W. (1998). Constructive consumer
choice processes. Journal of consumer research, 25(3):187–217.
Billsus, D. and Pazzani, M. J. (2000). User modeling for adaptive news access.
User Modeling and User-Adapted Interaction, 10(2-3):147–180.
Bodapati, A. V. (2008). Recommendation systems with purchase data. Journal of
Marketing Research, 45(1):77–93.
BookCrossing (2004). Bookcrossing, Inc. http://www.bookcrossing.com.
Brynjolfsson, E., Hu, Y. J., and Simester, D. (2011). Goodbye pareto principle,
hello long tail: The effect of search costs on the concentration of product sales.
Management Science, 57(8):1373–1386.
214
Brynjolfsson, E., Hu, Y. J., and Smith, M. D. (2003). Consumer surplus in the
digital economy: Estimating the value of increased product variety at online book-
sellers. Management Science, 49(11):1580–1596.
Burke, R. (2002). Hybrid recommender systems: Survey and experiments. User
Modeling and User-Adapted Interaction, 12(4):331–370.
Byrne, D. and Griffitt, W. (1969). Similarity and awareness of similarity of per-
sonality characteristics as determinants of attraction. Journal of Experimental
Research in Personality.
Cantador, I., Brusilovsky, P., and Kuflik, T. (2011). 2nd workshop on information
heterogeneity and fusion in recommender systems (HetRec 2011). In Proceedings
of the 5th ACM conference on Recommender systems, RecSys ’11, New York, NY,
USA. ACM.
Cardell, N. S. (1997). Variance components structures for the extreme-value and
logistic distributions with application to models of heterogeneity. Econometric
Theory, 13(2):185–213.
Castells, P., Vargas, S., and Wang, J. (2011). Novelty and diversity metrics for rec-
ommender dystems: Choice, discovery and relevance. In International Workshop
on Diversity in Document Retrieval (DDR 2011) at the 33rd European Conference
on Information Retrieval (ECIR 2011).
Celma, O. and Herrera, P. (2008). A new approach to evaluating novel recommen-
dations. In Proceedings of the 2008 ACM conference on Recommender systems,
RecSys ’08, pages 179–186, New York, NY, USA. ACM.
Chen, L., Wu, W., and He, L. (2013). How personality influences users’ needs for
215
recommendation diversity? In CHI ’13 Extended Abstracts on Human Factors in
Computing Systems, CHI EA ’13, pages 829–834, New York, NY, USA. ACM.
Chen, P.-Y., Wu, S.-y., and Yoon, J. (2004). The impact of online recommendations
and consumer feedback on sales. ICIS 2004 Proceedings, page 58.
Crawford, G. S. (2012). Endogenous product choice: A progress report. Interna-
tional Journal of Industrial Organization, 30(3):315–320.
Cremer, H. and Thisse, J.-F. (1991). Location models of horizontal differentia-
tion: A special case of vertical differentiation models. The Journal of Industrial
Economics, 39(4):383–390.
Cremonesi, P., Garzotto, F., Negro, S., Papadopoulos, A. V., and Turrin, R.
(2011). Looking for ”good” recommendations: A comparative evaluation of rec-
ommender systems. In Human-Computer Interaction–INTERACT 2011, pages
152–168. Springer.
Cremonesi, P., Koren, Y., and Turrin, R. (2010). Performance of recommender
algorithms on top-n recommendation tasks. In Proceedings of the Fourth ACM
Conference on Recommender Systems, RecSys ’10, pages 39–46, New York, NY,
USA. ACM.
Danaher, P. J., Smith, M. S., Ranasinghe, K., and Danaher, T. S. (2015). Where,
when and how long: Factors that influence the redemption of mobile phone
coupons. Journal of Marketing Research.
Davidson, R. and MacKinnon, J. (2004). Econometric Theory and Methods. Ox-
ford University Press.
216
DeLone, W. H. and McLean, E. R. (1992). Information systems success: The quest
for the dependent variable. Information systems research, 3(1):60–95.
Desrosiers, C. and Karypis, G. (2011). A comprehensive survey of neighborhood-
based recommendation methods. In Recommender systems handbook, pages 107–
144. Springer.
Dooms, S., De Pessemier, T., and Martens, L. (2013). MovieTweetings: a movie
rating dataset collected from twitter. In Workshop on Crowdsourcing and Human
Computation for Recommender Systems, CrowdRec at RecSys ’13.
Dror, G., Koren, Y., and Weimer, M., editors (2012). Proceedings of KDD Cup
2011 competition, San Diego, CA, USA, 2011, volume 18 of JMLR Proceedings.
JMLR.org.
Econsultancy.com (2013). The realities of online personalization. Technical report,
Econsultancy.com.
Ekstrand, M. D., Harper, F. M., Willemsen, M. C., and Konstan, J. A. (2014).
User perception of differences in recommender algorithms. In Proceedings of the
8th ACM Conference on Recommender systems, pages 161–168. ACM.
eMarketer (2013). Smartphones, tablets drive faster growth in ecommerce sales.
Report, eMarketer.
eMarketer (2015). Holiday shopping preview. Report, eMarketer.
Evans, J. A. (2008). Electronic publication and the narrowing of science and schol-
arship. Science, 321(5887):395–399.
Fagin, R. and Price, T. G. (1978). Efficient calculation of expected miss ratios in
the independent reference model. SIAM Journal on Computing, 7(3):288–297.
217
Fargo, W. (2014). Data, data, data: 2015 internet advertising themes. Report,
Wells Fargo.
Fasolo, B., McClelland, G. H., and Lange, K. A. (2005). The effect of site de-
sign and interattribute correlations on interactive web-based decisions. Lawrence
Erlbaum Associates, Inc.
Fitzsimons, G. J. and Lehmann, D. R. (2004). Reactance to recommendations:
When unsolicited advice yields contrary responses. Marketing Science, 23(1):82–
94.
Fleder, D. and Hosanagar, K. (2009). Blockbuster culture’s next rise or fall:
The impact of recommender systems on sales diversity. Management Science,
55(5):697–712.
Fong, N. M., Fang, Z., and Luo, X. (2015). Geo-conquesting: Competitive loca-
tional targeting of mobile promotions. Journal of Marketing Research.
Gantner, Z., Drumond, L., et al. (2012). Personalized ranking for non-uniformly
sampled items. In Dror et al. [2012], pages 231–247.
Gantner, Z., Rendle, S., Freudenthaler, C., and Schmidt-Thieme, L. (2011). My-
MediaLite: A free recommender system library. In 5th ACM International Confer-
ence on Recommender Systems (RecSys 2011).
Ge, M., Delgado-Battenfeld, C., and Jannach, D. (2010). Beyond accuracy: Eval-
uating recommender systems by coverage and serendipity. In Proceedings of the
fourth ACM conference on Recommender systems, RecSys ’10, pages 257–260, New
York, NY, USA. ACM.
218
Ge, M., Jannach, D., Gedikli, F., and Hepp, M. (2012). Effects of the placement
of diverse ITEms in recommendation lists. In Proceedings of 14th International
Conference on Enterprise Information Systems (ICEIS 2012).
Geman, S., Bienenstock, E., and Doursat, R. (1992). Neural networks and the
bias/variance dilemma. Neural computation, 4(1):1–58.
Ghose, A., Goldfarb, A., and Han, S. P. (2012a). How is the mobile internet differ-
ent? search costs and local activities. Information Systems Research, 24(3):613–
631.
Ghose, A. and Han, S. (2011). An empirical analysis of user content generation
and usage behavior on the mobile internet. Management Science, 57(9):1671–1691.
Ghose, A., Ipeirotis, P., and Li, B. (2012b). Designing ranking systems for ho-
tels on travel search engines by mining user-generated and crowd-sourced content.
Marketing Science.
Ghose, A. and Ipeirotis, P. G. (2011). Estimating the helpfulness and economic
impact of product reviews: Mining text and reviewer characteristics. Ieee Trans-
actions on Knowledge and Data Engineering, 23(10):1498–1512.
Gini, C. (1909). Concentration and dependency ratios (in Italian). English trans-
lation in Rivista di Politica Economica, 87:769–789.
Gini, C. (1921). Measurement of inequality of incomes. The Economic Journal,
pages 124–126.
gmisclib (2013). gmisclib, scientific library.
http://kochanski.org/gpk/code/speechresearch/gmisclib/.
219
Goes, P. B., Lin, M., and Au Yeung, C.-m. (2014). Popularity effect in user-
generated content: Evidence from online product reviews. Information Systems
Research, 25(2):222–238.
Goh, K.-Y., Heng, C.-S., and Lin, Z. (2013). Social media brand community
and consumer behavior: Quantifying the relative impact of user-and marketer-
generated content. Information Systems Research, 24(1):88–107.
Goldberg, D., Nichols, D., Oki, B. M., and Terry, D. (1992). Using collabora-
tive filtering to weave an information tapestry. Communications of the ACM,
35(12):61–70.
Goldstein, D. and Goldstein, D. (2006). Profiting from the long tail. Harvard
Business Review, 84(6):24–28.
Google (2012). Google Books. http://books.google.com.
Gorgoglione, M., Panniello, U., and Tuzhilin, A. (2011). The effect of context-
aware recommendations on customer purchasing behavior and trust. In Proceedings
of the fifth ACM conference on Recommender systems, RecSys ’11, pages 85–92,
New York, NY, USA. ACM.
Greene, W. (2012). Econometric Analysis. Pearson series in economics. Prentice
Hall.
GroupLens (2011). GroupLens research group.
Guo, G., Zhang, J., Sun, Z., and Yorke-Smith, N. (2015). Librec: A java library
for recommender systems. In UMAP’15.
Hansen, L. P. (1982). Large sample properties of generalized method of moments
estimators. Econometrica: Journal of the Econometric Society, pages 1029–1054.
220
Harris, Z. S. (1954). Distributional structure. Word.
Herlocker, J., Konstan, J. A., and Riedl, J. (2002). An empirical analysis of de-
sign choices in neighborhood-based collaborative filtering algorithms. Information
retrieval, 5(4):287–310.
Herlocker, J. L., Konstan, J. A., Borchers, A., and Riedl, J. (1999). An algorithmic
framework for performing collaborative filtering. In Proceedings of the 22nd annual
international ACM SIGIR conference on Research and development in information
retrieval, pages 230–237. ACM.
Herlocker, J. L., Konstan, J. A., Terveen, L. G., and Riedl, J. T. (2004). Evaluating
collaborative filtering recommender systems. ACM Transactions on Information
Systems, 22(1):5–53.
Hernandez-Lobato, D., Martınez-Munoz, G., and Suarez, A. (2011). Empirical
analysis and evaluation of approximate techniques for pruning regression bagging
ensembles. Neurocomput., 74(12-13):2250–2264.
Hijikata, Y., Shimizu, T., and Nishida, S. (2009). Discovery-oriented collaborative
filtering for improving user satisfaction. In Proceedings of the 14th international
conference on Intelligent user interfaces, IUI ’09, pages 67–76, New York, NY,
USA. ACM.
Hinz, O., Eckert, J., and Skiera, B. (2011). Drivers of the long tail phenomenon:
an empirical analysis. Journal of management information systems, 27(4):43–70.
Hoover, E. (1985). An introduction to regional economics. A. A. Knopf, New York.
Hosanagar, K., Fleder, D., Lee, D., and Buja, A. (2013). Recommender systems
and their effects on consumers: The fragmentation debate. Management Science.
221
Hu, R. and Pu, P. (2011). Helping users perceive recommendation diversity. Work-
shop on Novelty and Diversity in Recommender Systems (DiveRS 2011), page 43.
Hurley, N. and Zhang, M. (2011). Novelty and diversity in top-n recommendation–
analysis and evaluation. ACM Transactions on Internet Technology (TOIT),
10(4):14.
Iaquinta, L., Gemmis, M. d., Lops, P., Semeraro, G., Filannino, M., and Molino,
P. (2008). Introducing serendipity in a content-based recommender system. In
Proceedings of the 2008 8th International Conference on Hybrid Intelligent Systems,
HIS ’08, pages 168–173, Washington, DC, USA. IEEE Computer Society.
IMDb (2011). Imdb.com, Inc. http://www.imdb.com.
ISBNdb.com (2012). The ISBN database. http://isbndb.com.
Jahrer, M. and Toscher, A. (2012). Collaborative filtering ensemble for ranking.
In Dror et al. [2012], pages 153–167.
Jannach, D. and Hegelich, K. (2009). A case study on the effectiveness of recom-
mendations in the mobile internet. In Proceedings of the third ACM conference on
Recommender systems, RecSys ’09, pages 205–208. ACM.
Jannach, D., Lerche, L., Gedikli, F., and Bonnin, G. (2013). What recommenders
recommend–an analysis of accuracy, popularity, and sales diversity effects. In User
Modeling, Adaptation, and Personalization, pages 25–37. Springer.
Jannach, D., Zanker, M., Felfernig, A., and Friedrich, G. (2010). Recommender
systems: an introduction. Cambridge University Press.
Jarvelin, K. and Kekalainen, J. (2002). Cumulated gain-based evaluation of IR
techniques. ACM Transactions on Information Systems, 20(4):422–446.
222
Jimenez, D. (1998). Dynamically weighted ensemble neural networks for classifica-
tion. In Neural Networks Proceedings, 1998. IEEE World Congress on Computa-
tional Intelligence. The 1998 IEEE International Joint Conference on, volume 1,
pages 753–756. IEEE.
Jin, R., Chai, J. Y., and Si, L. (2004). An automatic weighting scheme for col-
laborative filtering. In Proceedings of the 27th annual international ACM SIGIR
conference on Research and development in information retrieval, pages 337–344.
ACM.
Kahn, B., Lehmann, D., and Dept, W. S. M. (1991). Modeling choice among
assortments. Working paper (Wharton School. Marketing Dept.). Wharton School,
University of Pennsylvania, Marketing Department.
Kannan, P., Chang, A., and Whinston, A. (2001). Wireless commerce: Marketing
issues and possibilities.
Kawamae, N. (2010). Serendipitous recommendations via innovators. In Proceed-
ings of the 33rd international ACM SIGIR conference on Research and development
in information retrieval, SIGIR ’10, pages 218–225, New York, NY, USA. ACM.
Kawamae, N., Sakano, H., and Yamada, T. (2009). Personalized recommendation
based on the personal innovator degree. In Proceedings of the third ACM conference
on Recommender systems, RecSys ’09, pages 329–332, New York, NY, USA. ACM.
Khabbaz, M., Xie, M., and Lakshmanan, L. (2011). TopRecs: Pushing the envelope
on recommender systems. Data Engineering, page 61.
Khan, F. and Zubek, V. (2008). Support vector regression for censored data
223
(SVRc): A novel tool for survival analysis. In Data Mining, 2008. ICDM ’08.
Eighth IEEE International Conference on, pages 863–868.
Konstan, J. A., McNee, S. M., Ziegler, C.-N., Torres, R., Kapoor, N., and Riedl,
J. T. (2006). Lessons on applying automated recommender systems to information-
seeking tasks. In proceedings of the 21st national conference on Artificial intelligence
- Volume 2, AAAI’06, pages 1630–1633, Palo Alto, CA, USA. AAAI Press.
Konstan, J. A., Miller, B. N., Maltz, D., Herlocker, J. L., Gordon, L. R., and Riedl,
J. (1997). GroupLens: applying collaborative filtering to Usenet news. Communi-
cations of the ACM, 40(3):77–87.
Konstan, J. A. and Riedl, J. T. (2012). Recommender systems: from algorithms
to user experience. User Modeling and User-Adapted Interaction, 22:101–123.
Kontonasios, K.-N., Spyropoulou, E., and De Bie, T. (2012). Knowledge discovery
interestingness measures based on unexpectedness. Wiley Interdisciplinary Re-
views: Data Mining and Knowledge Discovery, 2(5):386–399.
Koren, Y. (2008). Factorization meets the neighborhood: a multifaceted collab-
orative filtering model. In Proceedings of the 14th ACM SIGKDD international
conference on Knowledge discovery and data mining, pages 426–434. ACM.
Koren, Y. (2010). Factor in the neighbors: Scalable and accurate collaborative
filtering. ACM Trans. Knowl. Discov. Data, 4(1):1–24.
Koren, Y. and Bell, R. (2011). Advances in collaborative filtering. In Recommender
Systems Handbook, pages 145–186. Springer.
Koren, Y., Bell, R., and Volinsky, C. (2009). Matrix factorization techniques for
recommender systems. Computer, 42(8):30–37.
224
Krumm, J. (2009). A survey of computational location privacy. Personal and
Ubiquitous Computing, 13(6):391–399.
Kull, S., Ramsay, C., and Lewis, E. (2003). Misperceptions, the media, and the
Iraq war. Political Science Quarterly, 118(4):569–598.
Lathia, N., Hailes, S., Capra, L., and Amatriain, X. (2010). Temporal diversity
in recommender systems. In Proceedings of the 33rd international ACM SIGIR
conference on Research and development in information retrieval, SIGIR ’10, pages
210–217, New York, NY, USA. ACM.
Le, Q. V. and Mikolov, T. (2014). Distributed representations of sentences and
documents. arXiv preprint arXiv:1405.4053.
Lee, Y. E. and Benbasat, I. (2010). Interaction design for mobile product recom-
mendation agents: Supporting users’ decisions in retail stores. ACM Transactions
on Computer-Human Interaction, 17(4):1–32.
Lemire, D. and Maclachlan, A. (2007). Slope one predictors for online rating-based
collaborative filtering. CoRR, abs/cs/0702144.
Leopold, T. (2013). Internet gains are serendipity’s loss.
http://www.cnn.com/2013/11/20/tech/web/internet-serendipity/.
Levin, D. Z., Cross, R., and Abrams, L. C. (2002). Why should i trust you? pre-
dictors of interpersonal trust in a knowledge transfer context. Academy of Man-
agement.
Levy, O. and Goldberg, Y. (2007). Dependencybased word embeddings. In Proceed-
ings of the 52nd Annual Meeting of the Association for Computational Linguistics,
volume 2, pages 302–308.
225
Li, S. S. and Karahanna, E. (2015). Online recommendation systems in a b2C
e-commerce context: A review and future directions. Journal of the Association
for Information Systems, 16(2):2.
LibraryThing (2012). LibraryThing. http://www.librarything.com.
Lichtenthal, J. D. and Tellefsen, T. (2001). Toward a theory of business buyer-seller
similarity. Journal of Personal Selling & Sales Management, 21(1):1–14.
Long, J. (1997). Regression models for categorical and limited dependent variables,
volume 7. Sage Publications, Incorporated.
Lorenz, M. O. (1905). Methods of measuring the concentration of wealth. Publi-
cations of the American Statistical Association, 9(70):209–219.
Marshall, A. (1920). Principles of Economics, volume 1. Macmillan and Co.,
London, UK.
Martınez-Munoz, G. and Suarez, A. (2006). Pruning in ordered bagging ensembles.
In Proceedings of the 23rd international conference on Machine learning, pages 609–
616. ACM.
Masthoff, J. (2011). Group recommender systems: Combining individual models.
In Recommender Systems Handbook, pages 677–702. Springer.
Matt, C., Benlian, A., Hess, T., and Weib, C. (2014). Escaping from the filter
bubble? the effects of novelty and serendipity on users’ evaluations of online rec-
ommendations. In Proceedings of the 35th International Conference on Information
Systems.
226
Matt, C., Hess, T., and Weiß, C. (2013). The differences between recommender
technologies in their impact on sales diversity. In Proceedings of the 34th Interna-
tional Conference on Information Systems.
McAuley, J. and Leskovec, J. (2013). Hidden factors and hidden topics: under-
standing rating dimensions with review text. In Proceedings of the seventh ACM
conference on Recommender systems, RecSys ’13, New York, NY, USA. ACM.
McDonald, J. F. and Moffitt, R. A. (1980). The uses of tobit analysis. The Review
of Economics and Statistics, 62(2):318–321.
McFadden, D. (1980). Econometric models for probabilistic choice among products.
Journal of Business, pages 13–29.
McKinney, V., Yoon, K., and Zahedi, F. M. (2002). The measurement of web-
customer satisfaction: An expectation and disconfirmation approach. Information
systems research, 13(3):296–315.
McNee, S. M., Riedl, J., and Konstan, J. A. (2006). Being accurate is not enough:
how accuracy metrics have hurt recommender systems. In Proceedings of CHI ’06,
CHI EA ’06, pages 1097–1101, New York, NY, USA. ACM.
McSherry, D. (2002). Diversity-conscious retrieval. In Proceedings of the 6th Euro-
pean Conference on Advances in Case-Based Reasoning, ECCBR ’02, pages 219–
233, London, UK, UK. Springer-Verlag.
Meister, F., Shin, D., and Andrews, L. (2002). Getting to know you: what’s
new in personalization technologies. E-Doc, March-April (http://www. aiim.
org/Resources/Archive/Magazine/2002-Mar-Apr/25187).
227
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of
word representations in vector space. arXiv preprint arXiv:1301.3781.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013b). Dis-
tributed representations of words and phrases and their compositionality. In Ad-
vances in Neural Information Processing Systems, pages 3111–3119.
Murakami, T., Mori, K., and Orihara, R. (2008). Metrics for evaluating the
serendipity of recommendation lists. In Proceedings of the 2007 conference on
New frontiers in artificial intelligence, JSAI’07, pages 40–46, Berlin, Heidelberg.
Springer-Verlag.
Nakatsuji, M., Fujiwara, Y., Tanaka, A., Uchiyama, T., Fujimura, K., and Ishida,
T. (2010). Classical music for rock fans?: Novel recommendations for expanding
user interests. In Proceedings of the 19th ACM international conference on Infor-
mation and knowledge management, CIKM ’10, pages 949–958, New York, NY,
USA. ACM.
Nanopoulos, A., Radovanovic, M., and Ivanovic, M. (2009). How does high dimen-
sionality affect collaborative filtering? In Proceedings of the Third ACM Conference
on Recommender Systems, RecSys ’09, pages 293–296, New York, NY, USA. ACM.
Netzer, O., Feldman, R., Goldenberg, J., and Fresko, M. (2012). Mine your own
business: Market-structure surveillance through text mining. Marketing Science,
31(3):521–543.
Neven, D. (1985). Two stage (perfect) equilibrium in hotelling’s model. The Jour-
nal of Industrial Economics, 33(3):317–325.
Nguyen, T. T., Hui, P.-M., Harper, F. M., Terveen, L., and Konstan, J. A. (2014).
228
Exploring the filter bubble: the effect of using recommender systems on content
diversity. In Proceedings of the 23rd international conference on World wide web,
pages 677–686. ACM.
Nielsen (2014). Total audience report: Q3 2014. Report, Nielsen.
O’Connor, M., Cosley, D., Konstan, J. A., and Riedl, J. (2002). PolyLens: a rec-
ommender system for groups of users. In ECSCW 2001, pages 199–218. Springer.
O’Connor, M. and Herlocker, J. (1999). Clustering items for collaborative filtering.
In Proceedings of the ACM SIGIR workshop on recommender systems, volume 128.
UC Berkeley.
O’Donovan, J. and Smyth, B. (2005). Trust in recommender systems. In Pro-
ceedings of the 10th international conference on Intelligent user interfaces, IUI ’05,
pages 167–174, New York, NY, USA. ACM.
Oestreicher-Singer, G. and Sundararajan, A. (2012a). Recommendation networks
and the long tail of electronic commerce. Mis Quarterly, 36(1):65–83.
Oestreicher-Singer, G. and Sundararajan, A. (2012b). The visible hand? demand
effects of recommendation networks in electronic markets. Management Science,
58(11):1963–1981.
Oliver, R. W., Rust, R. T., and Varki, S. (1998). Real-time marketing. Marketing
Management, 7(4):28.
Olsen, R. J. (1978). Note on the uniqueness of the maximum likelihood estimator
for the tobit model. Econometrica, 46(5):1211–1215.
Olson, J. C. and Dover, P. A. (1979). Disconfirmation of consumer expectations
through product trial. Journal of Applied psychology, 64(2):179.
229
Padmanabhan, B. and Tuzhilin, A. (1998). A belief-driven method for discover-
ing unexpected patterns. In Proceedings of the third International Conference on
Knowledge Discovery and Data Mining, KDD ’98, pages 94–100, Palo Alto, CA,
USA. AAAI Press.
Padmanabhan, B. and Tuzhilin, A. (2000). Small is beautiful: discovering the
minimal set of unexpected patterns. In Proceedings of the sixth ACM SIGKDD
international conference on Knowledge discovery and data mining, KDD ’00, pages
54–63, New York, NY, USA. ACM.
Padmanabhan, B. and Tuzhilin, A. (2006). On characterization and discovery of
minimal unexpected patterns in rule discovery. IEEE Trans. on Knowl. and Data
Eng., 18(2):202–216.
Panniello, U., Gorgoglione, M., and Tuzhilin, A. (2016). In CARS we trust: How
context-aware recommendations affect customers’ trust and other business perfor-
mance measures of recommender systems. ISR.
Panniello, U., Tuzhilin, A., Gorgoglione, M., Palmisano, C., and Pedone, A. (2009).
Experimental comparison of pre- vs. post-filtering approaches in context-aware rec-
ommender systems. In Proceedings of the third ACM conference on Recommender
systems, RecSys ’09, pages 265–268, New York, NY, USA. ACM.
Pariser, E. (2011a). Eli Pariser: Beware Online ”filter Bubbles.”. Ted.
Pariser, E. (2011b). The filter bubble: What the Internet is hiding from you. Pen-
guin UK.
Pathak, B., Garfinkel, R., Gopal, R. D., Venkatesan, R., and Yin, F. (2010). Em-
230
pirical analysis of the impact of recommender systems on sales. Journal of Man-
agement Information Systems, 27(2):159–188.
Perrone, M. P. and Cooper, L. N. (1993). When Networks Disagree: Ensemble
Methods for Hybrid Neural Networks, pages 126–142. Chapman and Hall.
Pradel, B., Usunier, N., and Gallinari, P. (2012). Ranking with non-random missing
ratings: influence of popularity and positivity on evaluation metrics. In Proceedings
of the sixth ACM conference on Recommender systems, pages 147–154. ACM.
Rabe-Hesketh, S., Skrondal, A., and Pickles, A. (2002). Reliable estimation of
generalized linear mixed models using adaptive quadrature. Stata Journal, 2(1):1–
21.
Radovanovic, M., Nanopoulos, A., and Ivanovic, M. (2010). Hubs in space: Popular
nearest neighbors in high-dimensional data. The Journal of Machine Learning
Research, 11:2487–2531.
Rampell, A. (2010). Why Online2Offline commerce is a trillion dollar opportunity.
techcrunch. com,(available online at http://techcrunch. com/2010/08/07/why-
online2offline-commerce-is-a-trillion-dollaropportunity/).
Rendle, S., Freudenthaler, C., et al. (2009). BPR: Bayesian Personalized Rank-
ing from Implicit Feedback. In Proceedings of the Conference on Uncertainty in
Artificial Intelligence, UAI ’09, pages 452–461, Arlington, Virginia, United States.
AUAI Press.
Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., and Riedl, J. (1994). Grou-
pLens: an open architecture for collaborative filtering of netnews. In Proceedings of
231
the 1994 ACM conference on Computer supported cooperative work, pages 175–186.
ACM.
Riboni, D., Pareschi, L., and Bettini, C. (2009). Privacy in georeferenced context-
aware services: A survey, pages 151–172. Springer.
Ricci, F. (2010). Mobile recommender systems. Information Technology &
Tourism, 12(3):205–231.
Ricci, F. and Shapira, B. (2011). Recommender systems handbook. Springer.
Roli, F. and Fumera, G. (2002). Analysis of linear and order statistics combin-
ers for fusion of imbalanced classifiers. In Proceedings of the Third International
Workshop on Multiple Classifier Systems, MCS ’02, pages 252–261, London, UK,
UK. Springer-Verlag.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1988). Learning represen-
tations by back-propagating errors. Cognitive modeling, 5.
Said, A., Fields, B., Jain, B. J., and Albayrak, S. (2013). User-centric evaluation of
a k-furthest neighbor collaborative filtering recommender algorithm. In Proceedings
of the ACM 2013 conference on Computer Supported Cooperative Work. ACM.
Said, A., Jain, B. J., and Albayrak, S. (2012a). Analyzing weighting schemes in
collaborative filtering: cold start, post cold start and power users. In Proceedings
of the 27th Annual ACM Symposium on Applied Computing, SAC ’12, pages 2035–
2040, New York, NY, USA. ACM.
Said, A., Jain, B. J., Kille, B., and Albayrak, S. (2012b). Increasing diversity
through furthest neighbor-based recommendation. In Proceedings of the WSDM’12
Workshop on Diversity in Document Retrieval (DDR’12).
232
Sarwar, B. M., Karypis, G., Konstan, J., and Riedl, J. (2002). Recommender sys-
tems for large-scale e-commerce: Scalable neighborhood formation using clustering.
In Proceedings of the fifth international conference on computer and information
technology, volume 1.
Senecal, S. and Nantel, J. (2004). The influence of online product recommendations
on consumers online choices. Journal of retailing, 80(2):159–169.
Seyerlehner, K., Flexer, A., and Widmer, G. (2009). On the limitations of browsing
top-n recommender systems. In Proceedings of the Third ACM Conference on
Recommender Systems, RecSys ’09, pages 321–324, New York, NY, USA. ACM.
Shafir, E., Simonson, I., and Tversky, A. (1993a). Reason-based choice. Cognition,
49(1):11–36.
Shafir, E. B., Osherson, D. N., and Smith, E. E. (1993b). The advantage model: A
comparative theory of evaluation and choice under risk. Organizational Behavior
and Human Decision Processes, 55(3):325–378.
Shani, G. and Gunawardana, A. (2011). Evaluating recommendation systems.
Recommender Systems Handbook, 12(19):1–41.
Shardanand, U. and Maes, P. (1995). Social information filtering: algorithms for
automating word of mouth. In Proceedings of the SIGCHI conference on Human
factors in computing systems, pages 210–217. ACM Press/Addison-Wesley Pub-
lishing Co.
Shi, Y., Larson, M., et al. (2010). List-wise learning to rank with matrix factoriza-
tion for collaborative filtering. In Proceedings of the Fourth ACM Conference on
Recommender Systems, RecSys ’10, pages 269–272, New York, NY, USA. ACM.
233
Shi, Y., Zhao, X., Wang, J., Larsona, M., and Hanjalic, A. (2012). Adaptive
diversification of recommendation results via latent factor portfolio. In SIGIR.
Shivaswamy, P., Chu, W., and Jansche, M. (2007). A support vector approach to
censored targets. In Data Mining, 2007. ICDM 2007. Seventh IEEE International
Conference on, pages 655–660.
Shugan, S. M. (1980). The cost of thinking. Journal of consumer Research, pages
99–111.
Silberschatz, A. and Tuzhilin, A. (1996). What makes patterns interesting in knowl-
edge discovery systems. Knowledge and Data Engineering, IEEE Transactions on,
8(6):970–974.
Sinha, R. R. and Swearingen, K. (2001). Comparing recommendations made by
online systems and friends. In DELOS workshop: personalisation and recommender
systems in digital libraries, volume 1.
Spearman, C. (1987). The proof and measurement of association between two
things. The American Journal of Psychology, 100(3/4):441–471.
Stock, J. H. and Yogo, M. (2005). Testing for weak instruments in linear IV
regression. Identification and inference for econometric models: Essays in honor
of Thomas Rothenberg.
Stroud, N. J. (2008). Media use and political predispositions: Revisiting the con-
cept of selective exposure. Political Behavior, 30(3):341–366.
Sugiyama, K. and Kan, M.-Y. (2011). Serendipitous recommendation for scholarly
papers considering relations among researchers. In Proceedings of the 11th annual
234
international ACM/IEEE joint conference on Digital libraries, JCDL ’11, pages
307–310, New York, NY, USA. ACM.
Sweeting, A. (2013). Dynamic product positioning in differentiated product mar-
kets: The effect of fees for musical performance rights on the commercial radio
industry. Econometrica, 81(5):1763–1803.
Thompson, C. (2008). If you liked this, youre sure to love that. The New York
Times, 21.
Tintarev, N., Flores, A., and Amatriain, X. (2010). Off the beaten track: a mobile
field study exploring the long tail of tourist recommendations.
Tintarev, N. and Masthoff, J. (2011). Designing and evaluating explanations
for recommender systems. In Recommender Systems Handbook, pages 479–510.
Springer.
Tirole, J. (1988). The Theory of Industrial Organization. Mit Press.
Tirunillai, S. and Tellis, G. J. (2014). Mining marketing meaning from online chat-
ter: Strategic brand analysis of big data using latent dirichlet allocation. Journal
of Marketing Research, 51(4):463–479.
Todri, V. and Adamopoulos, P. (2014). Social commerce: An empirical examina-
tion of the antecedents and consequences of commerce in social network platforms.
In Proceedings of the 35th International Conference on Information Systems, ICIS,
page 18. AIS.
Tumer, K. and Ghosh, J. (1996). Error correlation and error reduction in ensemble
classifiers. Connection science, 8(3-4):385–404.
235
UBS (2015). Us internet & interactive entertainment: Can internet stocks rise
despite headwinds? Report, UBS.
Ueda, N. and Nakano, R. (1996). Generalization error of ensemble estimators.
In Neural Networks, 1996., IEEE International Conference on, volume 1, pages
90–95. IEEE.
Umyarov, A. and Tuzhilin, A. (2011). Using external aggregate ratings for improv-
ing individual recommendations. ACM Transactions on the Web, 5(3):1–40.
Vargas, S. and Castells, P. (2011). Rank and relevance in novelty and diversity
metrics for recommender systems. In Proceedings of the fifth ACM conference on
Recommender systems, RecSys ’11, pages 109–116, New York, NY, USA. ACM.
Vargas, S., Castells, P., and Vallet, D. (2012). Explicit relevance models in intent-
oriented information retrieval diversification. In 35th Annual International ACM
SIGIR Conference on Research and Development in Information Retrieval (SIGIR
2012), Portland, OR, USA.
Wang, J. and Zhu, J. (2009). Portfolio theory of information retrieval. In Proc. of
the Annual International ACM SIGIR Conference on Research and Development
on Information Retrieval (SIGIR).
Weng, L.-T., Xu, Y., Li, Y., and Nayak, R. (2007). Improving recommendation
novelty based on topic taxonomy. In Proceedings of the 2007 IEEE/WIC/ACM
International Conferences on Web Intelligence and Intelligent Agent Technology -
Workshops, WI-IATW ’07, pages 115–118, Washington, DC, USA. IEEE Computer
Society.
Wikipedia (2012). Wikimedia foundation, Inc. http://www.wikipedia.org.
236
Wilde, L. L. (1980). The economics of consumer information acquisition. The
Journal of Business, 53(3):143–158.
Wong, C.-K. and Easton, M. C. (1980). An efficient method for weighted sampling
without replacement. SIAM Journal on Computing, 9(1):111–113.
Wooldridge, J. (2002). Econometric Analysis of Cross Section and Panel Data.
Econometric Analysis of Cross Section and Panel Data. Mit Press.
WorldCat (2012). OCLC online computer library center, Inc.
http://www.worldcat.org.
Xiao, B. and Benbasat, I. (2007). E-commerce product recommendation agents:
use, characteristics, and impact. Mis Quarterly, 31(1):137–209.
Xiao, B. and Benbasat, I. (2014). Research on the use, characteristics, and impact
of e-commerce product recommendation agents: A review and update for 2007–
2012. In Handbook of Strategic e-Business Management, pages 403–431. Springer.
Xue, G.-R., Lin, C., Yang, Q., Xi, W., Zeng, H.-J., Yu, Y., and Chen, Z. (2005).
Scalable collaborative filtering using cluster-based smoothing. In Proceedings of the
28th annual international ACM SIGIR conference on Research and development in
information retrieval, SIGIR ’05, pages 114–121, New York, NY, USA. ACM.
Yoo, K.-H., Gretzel, U., and Zanker, M. (2013). Persuasive Recommender Systems:
Conceptual Background and Implications. Springer.
Zhang, M. and Hurley, N. (2008). Avoiding monotony: Improving the diversity
of recommendation lists. In Proceedings of the 2008 ACM conference on Recom-
mender systems, RecSys ’08, pages 123–130, New York, NY, USA. ACM.
237
Zhang, M. and Hurley, N. (2009). Novel item recommendation by user profile par-
titioning. In Proceedings of the 2009 IEEE/WIC/ACM International Joint Con-
ference on Web Intelligence and Intelligent Agent Technology - Volume 01, WI-IAT
’09, pages 508–515, Washington, DC, USA. IEEE Computer Society.
Zhang, T., Agarwal, R., and Lucas Jr, H. C. (2011). The value of IT-enabled
retailer learning: personalized product recommendations and customer store loyalty
in electronic markets. MIS Quarterly-Management Information Systems, 35(4):859.
Zhang, Y. C., Seaghdha, D. O., Quercia, D., and Jambor, T. (2012). Auralist:
introducing serendipity into music recommendation. In Proceedings of the fifth
ACM international conference on Web search and data mining, WSDM ’12, pages
13–22, New York, NY, USA. ACM.
Zhou, T., Kuscsik, Z., Liu, J.-G., Medo, M., Wakeling, J. R., and Zhang, Y.-C.
(2010). Solving the apparent diversity-accuracy dilemma of recommender systems.
Proceedings of the National Academy of Sciences, 107(10):4511–4515.
Zhou, Z.-H., Wu, J., and Tang, W. (2002). Ensembling neural networks: many
could be better than all. Artificial intelligence, 137(1):239–263.
Ziegler, C.-N., McNee, S. M., Konstan, J. A., and Lausen, G. (2005). Improving
recommendation lists through topic diversification. In Proceedings of the 14th in-
ternational conference on World Wide Web, WWW ’05, pages 22–32, New York,
NY, USA. ACM.
Zucker, L. G. (1986). Production of trust: Institutional sources of economic struc-
ture, 1840–1920. Research in organizational behavior.
238