pages.stern.nyu.edupages.stern.nyu.edu/~padamopo/thesis_draft.pdf · unexpectedness and...

Unexpectedness and Non-Obviousness inRecommendation Technologies and Their Impact

on Consumer Decision Making

by

Panagiotis Adamopoulos

A dissertation submitted in partial fulfillmentof the requirements for the degree of

Doctor of PhilosophyDepartment of Information, Operations and Management Sciences

Leonard N. Stern School of Business, New York UniversityApril 2016

Doctoral Committee:

Professor Gediminas AdomaviciusProfessor Anindya GhoseAssistant Professor Srikanth JagabathulaProfessor Alexander Tuzhilin, Chair

c© Panagiotis Adamopoulos 2016

All Rights Reserved

ACKNOWLEDGEMENTS

I would like to thank my doctoral advisor and co-author, Prof. Alexander Tuzhilin,

for his support, guidance, and valuable feedback throughout these years.

I owe much gratitude to Leonard N. Stern School of Business and the Department

of Information, Operations & Management Sciences (IOMS). Special thanks to Prof.

Anindya Ghose, Prof. Gediminas Adomavicius, Prof. Daria Dzyabura, Prof. Srikanth

Jagabathula, Prof. Panos Ipeirotis, Prof. Hila Lifshitz-Assaf, Prof. Natalia Levina,

Prof. Foster Provost, and Prof. Normal White.

I would also like to thank my parents and my brother Eleftherios for always having

faith in me and supporting me with all their means.

Most of all, I would like to thank Vilma Todri for her unconditional love, encour-

agement, and support.

ii

TABLE OF CONTENTS

ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . ii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . x

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

CHAPTER

I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

II. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1 Over-specialization of Recommendations . . . . . . . . . . . . 92.2 Concentration Bias of Recommendations . . . . . . . . . . . . 92.3 Novelty of Recommendations . . . . . . . . . . . . . . . . . . 102.4 Serendipity of Recommendations . . . . . . . . . . . . . . . . 112.5 Diversity of Recommendations . . . . . . . . . . . . . . . . . 122.6 Unexpectedness of Recommendations . . . . . . . . . . . . . 142.7 Business Value of Recommendations . . . . . . . . . . . . . . 152.8 Recommender Systems and Consumer Decision Making . . . 16

III. Neighborhood Selection in Collaborative Filtering Systems . 19

3.1 Introduction to Neighborhood Selection . . . . . . . . . . . . 193.2 Related Approaches . . . . . . . . . . . . . . . . . . . . . . . 233.3 Methods for Neighborhood Selection . . . . . . . . . . . . . . 24

3.3.1 Neighborhood Models . . . . . . . . . . . . . . . . . 24

iii

3.3.2 Theoretical Motivation . . . . . . . . . . . . . . . . 263.3.3 Probabilistic Neighborhood Selection . . . . . . . . 303.3.4 Optimized Neighborhood Selection . . . . . . . . . . 32

3.4 Experimental Settings of Probabilistic Neighborhood Selection 383.4.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . 383.4.2 Experimental Setup . . . . . . . . . . . . . . . . . . 39

3.5 Results of Probabilistic Neighborhood Selection . . . . . . . . 433.5.1 Orthogonality of Recommendations . . . . . . . . . 453.5.2 Comparison of Coverage and Diversity . . . . . . . 493.5.3 Comparison of Dispersion and Diversity Reinforcement 523.5.4 Comparison of Item Prediction . . . . . . . . . . . . 553.5.5 Comparison of Utility-based Ranking . . . . . . . . 60

3.6 Experimental Settings of Optimized Neighborhood Selection . 613.7 Results of Optimized Neighborhood Selection . . . . . . . . . 62

3.7.1 Orthogonality of Recommendations . . . . . . . . . 633.7.2 Comparison of Coverage and Diversity . . . . . . . 653.7.3 Comparison of Dispersion and Diversity Reinforcement 673.7.4 Comparison of Item Prediction . . . . . . . . . . . . 673.7.5 Comparison of Rating Prediction . . . . . . . . . . . 703.7.6 Comparison of Utility-based Ranking . . . . . . . . 73

3.8 Discussion of Neighborhood Selection . . . . . . . . . . . . . 73

IV. Unexpectedness in Recommender Systems . . . . . . . . . . . 77

4.1 Introduction to Unexpectedness . . . . . . . . . . . . . . . . . 774.2 Related Concepts . . . . . . . . . . . . . . . . . . . . . . . . 784.3 Definition of Unexpectedness . . . . . . . . . . . . . . . . . . 81

4.3.1 Unexpectedness of Recommendations . . . . . . . . 814.3.2 Utility of Recommendations . . . . . . . . . . . . . 844.3.3 Recommendation Algorithm . . . . . . . . . . . . . 864.3.4 Evaluation of Recommendations . . . . . . . . . . . 87

4.4 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . 904.4.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . 924.4.2 Experimental Setup . . . . . . . . . . . . . . . . . . 94

4.5 Results of Unexpectedness Method . . . . . . . . . . . . . . . 1014.5.1 Comparison of Unexpectedness . . . . . . . . . . . . 1034.5.2 Comparison of Rating Prediction . . . . . . . . . . . 1124.5.3 Comparison of Item Prediction . . . . . . . . . . . . 1164.5.4 Comparison of Diversity and Dispersion . . . . . . . 120

4.6 Discussion of Unexpectedness . . . . . . . . . . . . . . . . . . 126

V. Business Value of Recommendations . . . . . . . . . . . . . . . 130

iv

5.1 Introduction to Business Value of Recommendations . . . . . 1305.2 Related Concepts and Approaches . . . . . . . . . . . . . . . 1335.3 Empirical Setting and Data . . . . . . . . . . . . . . . . . . . 1375.4 Empirical Method and Models . . . . . . . . . . . . . . . . . 143

5.4.1 Identification Strategy . . . . . . . . . . . . . . . . 1465.5 Deep-Learning Model of User-Generated Reviews . . . . . . . 1475.6 Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . 148

5.6.1 Out-of-Sample Performance . . . . . . . . . . . . . 1525.6.2 Moderating Effects on Business Value . . . . . . . . 155

5.7 Robustness Checks . . . . . . . . . . . . . . . . . . . . . . . . 1635.7.1 Falsification Tests . . . . . . . . . . . . . . . . . . . 172

5.8 Discussion of Business Value of Recommendations . . . . . . 175

VI. Conclusions, Limitations, and Future Directions . . . . . . . . 182

Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

A. Measuring the Concentration Reinforcement Bias of Recom-mender Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

B. Weighted Percentile Methods in Collaborative Filtering Sys-tems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

v

LIST OF FIGURES

Figure

3.1 Examples of Sampling Distributions for Probabilistic NeighborhoodSelection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.2 Examples of Probabilistically Sampled Neighborhoods . . . . . . . . 423.3 Summary of Performance of Probabilistic Neighborhood Selection . 443.4 Overlap of Recommendations of Probabilistic Neighborhood Selection 463.5 Correlation of Recommendations of Probabilistic Neighborhood Se-

lection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.6 Aggregate Diversity of Probabilistic Neighborhood Selection . . . . 503.7 Dispersion of Recommendations of Probabilistic Neighborhood Selec-

tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.8 Diversity Reinforcement of Probabilistic Neighborhood Selection . . 563.9 Item Prediction Performance of Probabilistic Neighborhood Selection 573.10 Item Prediction Performance of Probabilistic Neighborhood Selection

for Fixed Recommendation Lists Size . . . . . . . . . . . . . . . . . 583.11 Summary of Performance of Optimized Neighborhood Selection . . 633.12 Overlap of Recommendations of Optimized Neighborhood Selection 643.13 Aggregated Diversity of Optimized Neighborhood Selection . . . . . 663.14 Dispersion of Recommendations of Optimized Neighborhood Selection 683.15 Diversity Reinforcement of Optimized Neighborhood Selection . . . 693.16 Item Prediction Performance of Optimized Neighborhood Selection . 713.17 Rating Prediction Performance of Optimized Neighborhood Selection 723.18 Utility-based Ranking of Optimized Neighborhood Selection . . . . 744.1 Unexpectedness Performance . . . . . . . . . . . . . . . . . . . . . . 1054.2 Unexpectedness Performance for Different Algorithms . . . . . . . . 1064.3 Unexpectedness Performance for Different Sets of Expectations . . . 1084.4 Post-hoc Analysis of Unexpectedness Performance . . . . . . . . . . 1094.5 Rating Prediction Performance of Unexpectedness Method . . . . . 1154.6 Post-hoc Analysis of Rating Prediction Performance . . . . . . . . . 1164.7 Item Prediction Performance of Unexpectedness Method . . . . . . 1194.8 Post-hoc Analysis of Item Prediction Performance . . . . . . . . . . 120

vi

4.9 Diversity Performance of Unexpectedness Method . . . . . . . . . . 1234.10 Post-hoc Analysis of Diversity Performance . . . . . . . . . . . . . . 1244.11 Dispersion Performance of Unexpectedness Method . . . . . . . . . 1245.1 Locations of Recommended Venues . . . . . . . . . . . . . . . . . . 1385.2 Correlation of Main Variables Employed in Econometric Specifica-

tions for Business Value of Recommendations . . . . . . . . . . . . . 1415.3 In-Sample Evaluation of Econometric Specifications Assessing Busi-

ness Value of Recommendations . . . . . . . . . . . . . . . . . . . . 1565.4 Out-of-Sample Evaluation of Econometric Specifications Assessing

Business Value of Recommendations . . . . . . . . . . . . . . . . . . 1565.5 Conceptual Model of Effects of Recommender Systems . . . . . . . 177A.1 Performance (ranking) of various RS algorithms. . . . . . . . . . . . 196B.1 Weighted Percentile Methods: Prediction Accuracy . . . . . . . . . 202B.2 Weighted Percentile Methods: Post hoc analysis . . . . . . . . . . . 203

vii

LIST OF TABLES

Table

3.1 Probability Distributions for Neighborhood Selection . . . . . . . . 404.1 Sets of Expected Recommendations . . . . . . . . . . . . . . . . . . 984.2 Unexpectedness Performance . . . . . . . . . . . . . . . . . . . . . . 1044.3 Rating Prediction Performance of Unexpectedness Method . . . . . 1144.4 Item Prediction Performance of Unexpectedness Method . . . . . . 1174.5 Diversity Performance of Unexpectedness Method . . . . . . . . . . 1225.1 Locations and Corresponding Number of Venues . . . . . . . . . . . 1395.2 Coefficient Estimates of Logit Model . . . . . . . . . . . . . . . . . 1515.3 Coefficient Estimates of Nested Logit Model . . . . . . . . . . . . . 1535.4 Coefficient Estimates of Nested Logit Model with Alternative-level

Fixed effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1545.5 In-Sample Validation of Nested Logit Model with Alternative-level

Fixed effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1555.6 Out-of-Sample Validation of Nested Logit Model with Alternative-

level Fixed effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1555.7 In-Sample Validation of Logit Model . . . . . . . . . . . . . . . . . 1565.8 Out-of-Sample Validation of Logit Model . . . . . . . . . . . . . . . 1575.9 In-Sample Validation of Nested Logit Model . . . . . . . . . . . . . 1575.10 Out-of-Sample Validation of Nested Logit Model . . . . . . . . . . . 1575.11 Moderating Effect of Item Attributes on Effectiveness of Recommen-

dations – Price . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1605.12 Moderating Effect of Item Attributes on Effectiveness of Recommen-

dations – Marketing Promotions . . . . . . . . . . . . . . . . . . . . 1615.13 Moderating Effect of Item Attributes on Effectiveness of Recommen-

dations – Novelty . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1625.14 Moderating Effect of Item Attributes on Effectiveness of Recommen-

dations – Popularity . . . . . . . . . . . . . . . . . . . . . . . . . . 1645.15 Moderating Effect of Context on Effectiveness of Recommendations

– Public Holidays . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

viii

5.16 Moderating Effect of Context on Effectiveness of Recommendations– Temperature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

5.17 Coefficient Estimates of Nested Logit Model (Sub-sample Analysis) 1685.18 Coefficient Estimates of Nested Logit Model with Alternative-level

Fixed effects (Sub-sample Analysis) . . . . . . . . . . . . . . . . . . 1695.19 Coefficient Estimates of Logit Model with Random Coefficients . . . 1705.20 Coefficient Estimates of Nested Logit Model with Random Coefficients1715.21 Coefficient Estimates of Nested Logit Model with Instrumental Vari-

ables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1735.22 Coefficient Estimates of Nested Logit Model with Instrumental Vari-

ables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1745.23 Falsification Check (Pseudo-recommendations) . . . . . . . . . . . . 1755.24 Falsification Check (Pseudo-timing of recommendations) . . . . . . 175B.1 Weighted Percentile Methods: Item Prediction Accuracy . . . . . . 203B.2 Weighted Percentile Methods: Catalog Coverage Performance . . . . 205

ix

LIST OF ABBREVIATIONS

AMZ Amazon

BC BookCrossing

CF Collaborative Filtering

HCI Human-Computer Interaction

IR Information Retrieval

IS Information Systems

k-BN k Better Neighbors

k-FN k Furthest Neighbors

k-NN k Nearest Neighbors

k-PN k Probabilistic Neighbors

MAD Mean-Absolute Deviation

MAPE Mean-Absolute-Percent Error

MF Matrix Factorization

ML MovieLens

MSE Mean-Square Error

MT MovieTweetings

O2O Online-to-offline

RMSE Root-Mean-Square Error

RS Recommender System

x

US United States

xi

ABSTRACT

Unexpectedness and Non-Obviousness in Recommendation Technologies and TheirImpact on Consumer Decision Making

by

Panagiotis Adamopoulos

Chair: Alexander Tuzhilin

Despite the numerous benefits of personalization techniques, many current approaches

create “filter bubbles”; i.e., isolated information neighborhoods that are highly cus-

tomized based on a user’s prior activity patterns, and that sometimes significantly re-

duce cultural or ideological diversity in what a user is exposed to. These filter bubbles

limit exposure to alternative views and options, often without the user actually realiz-

ing this kind of “informational isolation” is occurring. This thesis presents a number

of studies aiming at moving recommender systems beyond the traditional paradigm

and the classical perspective of rating prediction accuracy. We contribute to existing

helpful but less explored recommendation strategies and propose new approaches aim-

ing at more useful recommendations for both users and businesses. Working towards

this direction, we discuss the studies we have conducted in this stream of research

with the goal to avoid this problem of filter bubbles and, in particular, to alleviate the

over-specialization and concentration problems in recommender systems by design-

ing techniques that deliver non-obvious, unexpected, and high quality personalized

xii

recommendations. The overall goal of this research program is to move our focus

from even more accurate rating predictions and aims at offering a holistic experience

to the users. The conducted prescriptive studies are supplemented with descriptive

user behavior studies that examine the effects of the proposed type of non-obvious

(unexpected) recommendations on consumer decision-making.

In particular, we formulate the classical neighborhood-based collaborative filter-

ing method as an ensemble method, thus, allowing us to show the suboptimality of

the nearest neighbors (k-NN) approach in terms of not only over-specialization and

concentration biases but also predictive accuracy. Besides, focusing on neighborhood

selection, we propose a novel optimized neighborhood-based method (k-BN; k Better

Neighbors) and a new probabilistic neighborhood-based method (k-PN; k Probabilis-

tic Neighbors) as improvements of the standard k-NN approach alleviating some of

the most common problems of collaborative filtering recommender systems, based

on classical metrics of dispersion and diversity of recommendations as well as some

newly proposed metrics. Furthermore, we propose a concept of unexpectedness in

recommender systems illustrating the differences from the related but different terms

of novelty, serendipity, and diversity. We then operationalize unexpectedness by sug-

gesting various mechanisms for specifying the expectations of the users and proposing

a recommendation method for providing the users with non-obvious but high qual-

ity personalized recommendations that fairly match their interests based on specific

metrics of unexpectedness. Finally, we employ econometric modeling and machine

learning techniques in order to estimate the effectiveness and impact of various types of

recommendations in the mobile context on consumers’ utility and real-world demand.

Concluding this thesis, we also summarize the conclusions of the conducted studies,

discuss the limitations of our work, outline the implications of this stream of research,

and theoretically integrate our findings into the existing literature on recommender

xiii

systems by extending a current conceptual model of the effects of recommender sys-

tem use, their characteristics, and other factors on consumer decision-making.

xiv

CHAPTER I

Introduction

Personalization techniques (i.e., the design, management, and delivery of content

and business processes to users based on known, observed, and predictive informa-

tion [Meister et al., 2002]) play an important role in both businesses and society. In

many aspects of our everyday life, our choices are guided by such promising tech-

niques. However, despite the numerous benefits of these techniques, narrow-minded

personalization can create “filter bubbles” [Pariser, 2011b]: invisible and personal uni-

verses of information that might trap users into a relevance paradox confining them

to isolated information neighborhoods and restricting them from seeing or exploring

the vast array of other possibilities [Andrews, 1984; Pariser, 2011b] (also known as

“pigeonhole” problem). In particular, in an effort to overcome information overload

problems, we build systems that aim at discovering such personal universes and are

rewarded for delivering information exclusively from these universes at the expense of

providing more serendipitous information [Leopold, 2013]. Our increasing reliance on

these systems in combination with our consumption behaviors [Bakshy et al., 2015]

can then establish strong feedback loops that narrow the diversity of our choices and

limit our exposure to alternative views disturbing the intrinsic balance of our choices

[Pariser, 2011a], as advocated by various philosophers and researchers, including Pat-

1

tie Maes, who developed one of the first recommender systems (RSes) [Shardanand

and Maes, 1995; Thompson, 2008]. Due to the prevalence of such simplistic person-

alization approaches that create filter bubbles and in view of the importance of the

implications of these over-specialization problems, which in specific aspects of our ev-

eryday lives are associated with even adopting more extreme attitudes [Stroud, 2008]

and misperceiving facts about events [Kull et al., 2003], there is growing scientific

interest in this phenomenon (e.g., [Bakshy et al., 2015]).

At the same time, in a pursuit of relevance, common personalization techniques

disproportionally take into consideration and amplify the popularity of available

choices and options. Because of that, and also due to certain statistical biases [Pradel

et al., 2012; Radovanovic et al., 2010], such personalization techniques are character-

ized by a severe concentration bias and tend to amplify a “rich-get-richer” effect for

already popular options, at the expense of the long-tail (also known as “blockbuster”

problem) (e.g., [Adomavicius and Kwon, 2011, 2012; Evans, 2008; Radovanovic et al.,

2010]). As a result, common personalization techniques that serve such purposes

guide our choices towards common and frequent consumption patterns. In other

words, naive algorithms and personalization techniques can create commonalities

among these emerging filter bubbles making them more similar to each other over

time and hence such concentration biases could be conceptualized by someone in a

metaphorical sense as a “gravitational force” grouping together and shrinking filter

bubbles. Thus, narrow-minded personalization techniques can have additional detri-

mental effects, such as deconstructing non-prevailing views, opinions, and behaviors

(e.g., [Evans, 2008]).

Online recommender systems are one family of the personalization technologies

and information systems for which there is initial empirical evidence that often suffer

from these effects [Nguyen et al., 2014]. Over the last two decades, a wide variety of

2

different types of recommender systems (RSes) has been developed and successfully

used across several domains [Adomavicius and Tuzhilin, 2005]. During this time,

many researchers have focused mainly on the development and improvement of effi-

cient algorithms for more accurate rating prediction. Although the recommendations

of the latest class of systems are significantly more accurate than they used to be a

decade ago [Bell et al., 2009] and the broad social and business acceptance of RSes

has already been achieved, there is still a long way to go in terms of satisfaction of the

actual needs of the users [Konstan and Riedl, 2012]. This is due, primarily, to the fact

that many existing RSes focus on providing even more accurate rather than more use-

ful recommendations. The key under-explored dimensions for further improvement

include the (perceived) usefulness of random stimuli, diverse viewpoints, attitude-

challenging information, or cross-cutting content in recommendations [Adamopoulos,

2014c]. Instead, common recommenders, such as collaborative filtering (CF) algo-

rithms, recommend products based on prior sales and ratings. Hence, they tend not

to recommend products with limited historical data, even if these items would be

rated favorably. Thus, these recommenders can create a rich-get-richer effect for pop-

ular items while this concentration bias can prevent what may otherwise be better

consumer-product matches [Fleder and Hosanagar, 2009]. This phenomenon leads

to commonalities in exposure, experiences, and selected choices among the different

users and is related to the previously discussed metaphor of “gravitational force”.

At the same time, common RSes usually recommend items very similar to what the

users have already purchased or liked in the past [Abbassi et al., 2009]. However, this

over-specialization of recommendations enhances the aforementioned phenomenon of

filter bubbles by not expanding, or even restricting, users’ exposure to more diverse

and non-obvious options. Nevertheless, the over-specialization and concentration bi-

ases of popular personalization technologies are also often inconsistent with users’

3

preferences and needs, business goals, and social welfare (e.g., [Evans, 2008; Sinha

and Swearingen, 2001]).

Moving beyond the classical perspective of the rating prediction accuracy, the

main objective of this stream of research is to contribute to existing helpful but less

explored paradigms of RSes as well as to propose new approaches that will result in

more useful recommendations for both users and businesses. Working towards this

direction, we discuss the studies we have conducted towards alleviating the problems

of over-specialization and concentration biases in recommender systems by designing

techniques that deliver to the users non-obvious, diverse, unexpected, and, at the

same time, high quality personalized recommendations, which could potentially ex-

pand users’ exposure and choices. We focus this research program on the front-end

of the “design - solution - perceptions - intentions - behavior” causal chain and we

move the focus from even more accurate rating predictions and aim at offering a

holistic experience to the users. The conducted prescriptive design science studies

are supplemented with descriptive (explanatory) user behavior studies that examine

the effects of the proposed type of recommendations on consumer decision-making.

Finally, we theoretically integrate our findings into the current literature on RSes by

extending a current conceptual model of the effects of RS use, RS characteristics, and

other factors on consumer decision-making. The discussed impact of RSes on con-

sumers’ decision-making and choices can be supported through the lenses of various

IS theories, including theories of human information processing as well as theories of

satisfaction.

In detail, in Chapter II we first provide a brief survey of the related work on over-

specialization and concentration biases of recommendations as well as the concepts of

novelty, serendipity, diversity, and unexpectedness that offer the potential to alleviate

4

the aforementioned problems, then present the related work assessing the value of

recommendations for businesses and, finally, we provide a brief overview of the main

IS theories regarding the impact of RSes on consumer decision-making.

Focusing on the problems of over-specialization and concentration biases, in Chap-

ter III we formulate the classical neighborhood-based collaborative filtering method,

k nearest neighbors (k-NN), as an ensemble method, thus, allowing us to show the

suboptimality of the k-NN approach in terms of not only over-specialization and

concentration of recommendations but also predictive accuracy. Besides, focusing on

neighborhood selection, we propose a novel optimized neighborhood-based method (k-

BN; k Better Neighbors) and a new probabilistic neighborhood-based method (k-PN;

k Probabilistic Neighbors) as improvements of the standard k-NN approach alle-

viating some of the most common problems of collaborative filtering recommender

systems, based on classical metrics of dispersion and diversity as well as some newly

proposed metrics.

Another key dimension for significant improvement is the concept of unexpected-

ness. In Chapter IV, we propose a method to improve user satisfaction by generating

unexpected recommendations based on the utility theory of economics. In particular,

we propose a new concept of unexpectedness in RSes as recommending to users those

items that depart from what they expect from the system. We define and formalize

the concept of unexpectedness and discuss how it differs from the related notions of

novelty, serendipity, and diversity. Besides, we suggest several mechanisms for speci-

fying the users’ expectations and propose specific performance metrics to measure the

unexpectedness of recommendation lists. We also take into consideration the qual-

ity of recommendations using certain utility functions and present an algorithm for

providing the users with unexpected recommendations of high quality that are hard

to discover but fairly match their interests. Last but not least, we conduct several

5

experiments on “real-world” data sets to compare our recommendation results with

several other baseline methods. The proposed approach outperforms these baseline

methods in terms of unexpectedness and other important metrics, such as coverage,

aggregate diversity, and dispersion, while avoiding any accuracy loss.

Chapter V employs econometric modeling and machine learning techniques in or-

der to estimate the impact of recommendations in the mobile context on consumers’

utility and real-world demand. This chapter delves further into the differences in

effectiveness of recommendations, and examines this heterogeneity examining the

moderating effect of various item attributes and contextual factors in order to gain

a more detailed understanding of the effectiveness of the various types of recommen-

dations. We also validate the robustness of our findings using multiple econometric

specifications as well as instrumental variable methods with instruments based on a

machine-learning model employing deep-learning techniques.

Finally, in Chapter VI we summarize the conclusions of the conducted studies

and discuss their theoretical and managerial implications along with the limitations

of our work.

6

CHAPTER II

Related Work

In this section, we first discuss the extant work on the antecedents and conse-

quences of the problems of over-specialization and concentration biases of recommen-

dations as well as existing approaches aiming at alleviating these problems in RSes.

We then present the related work assessing the value of recommendations for busi-

nesses and their impact on product demand. Finally, we provide a brief overview

of the main IS theories regarding the impact of RSes on consumer decision-making

processes and outcomes.

Since the first collaborative filtering systems were introduced in the mid-90’s

[Goldberg et al., 1992; Konstan et al., 1997], there have been many attempts to im-

prove their performance focusing mainly on rating prediction accuracy [Desrosiers and

Karypis, 2011; Koren and Bell, 2011]. Common approaches include rating normaliza-

tion [Desrosiers and Karypis, 2011] (based on mean-centering [Resnick et al., 1994]

and Z-score [Herlocker et al., 1999]), similarity weighting of neighbors [Said et al.,

2012a] (accounting for significance [Herlocker et al., 2002; Bell et al., 2007] and vari-

ance [Jin et al., 2004]), and neighborhood selection, using top-N filtering, threshold

filtering, or negative filtering [Herlocker et al., 1999; Desrosiers and Karypis, 2011].

7

Even though the rating prediction perspective is the prevailing paradigm in rec-

ommender systems, there are other perspectives that have been gaining significant

attention in this field [Jannach et al., 2010] and try to alleviate the problems pertain-

ing to the narrow rating prediction focus [Adamopoulos, 2013a, 2014b]. This narrow

focus has been evident in laboratory studies and real-world online experiments, which

indicated that higher predictive accuracy does not always correspond to higher lev-

els of user-perceived quality or to increased sales [McNee et al., 2006; Jannach and

Hegelich, 2009; Jannach et al., 2013; Cremonesi et al., 2011]. Some streams of research

that aim to improve recommender systems going beyond rating prediction accuracy

include work on human-computer interaction (HCI) [Yoo et al., 2013], which involves

the study and design of the interaction between users and RSes, explanations for rec-

ommendations [Tintarev and Masthoff, 2011], which provide transparency into the

working of the recommendation process exposing the reasoning and data behind each

recommendation. Besides, other approaches pertain to diversification [Adomavicius

and Kwon, 2012; Zhou et al., 2010], which maximizes the variety of items in a rec-

ommendation list, group recommenders [O’Connor et al., 2002], which recommend

items for groups of people, rather than individuals, and recommendation sequences

[Masthoff, 2011], where sequences of ordered items are recommended instead of single

items.

In this dissertation, we focus on two of the most important problems related to

this narrow focus of many RSes that hinder user satisfaction (i.e., the problems of

over-specialization and concentration biases of recommendations) and we propose the

concept of unexpectedness that can alleviate these problems while improving both user

satisfaction and business outcomes.

8

2.1 Over-specialization of Recommendations

The problem of over-specialization pertains to the observation that common RSes

usually recommend items very similar to what the users have already purchased or

liked in the past [Abbassi et al., 2009]. However, this over-specialization of recom-

mendations is often inconsistent with sales goals and consumers’ preferences. For

instance, Ghose et al. [2012b] provided empirical evidence that indeed consumers

prefer diversity in ranking results. This problem is often practically addressed by

injecting randomness in the recommendation procedure [Balabanovic and Shoham,

1997], filtering out items which are too similar to items the user has rated in the past

[Billsus and Pazzani, 2000], or increasing the diversity of recommendations [Ziegler

et al., 2005]. Interestingly, Said et al. [2012b, 2013] presented an inverted neigh-

borhood model, k-furthest neighbors, to identify less ordinary neighborhoods for the

purpose of creating more diverse recommendations by recommending items disliked

by the least similar users.

2.2 Concentration Bias of Recommendations

Similarly, CF algorithms tend not to recommend products with limited historical

data, even if these items would be rated favorably. Aiming at verifying and measuring

over-specialization bias, Nguyen et al. [2014], employing a longitudinal data set that

represents users’ interactions with RSes and consumption of information items, find

that RSes indeed expose the users to narrowing sets of items over time. Thus, these

recommenders can create a rich-get-richer effect for popular items while this concen-

tration bias can prevent what may otherwise be better consumer-product matches

[Fleder and Hosanagar, 2009]. Studying the concentration bias of recommendations,

Jannach et al. [2013] compared different RS algorithms with respect to aggregate

9

diversity and their tendency to focus on certain parts of the product spectrum and

showed that popular algorithms may lead to an undesired popularity boost of already

popular items. Finally, Fleder and Hosanagar [2009] showed that this concentration

bias, in contrast to the potential goal of RSes to promote long-tail items, can create a

rich-get-richer effect for popular products leading to a subsequent reduction in profits

and sales diversity and suggested that better RS designs which limit popularity ef-

fects and promote exploration are still needed. However, Hinz et al. [2011] and Matt

et al. [2013] maintain that whether over-specialization and concentration biases will

be enhanced or alleviated depends on the applied personalization technology.

Furthermore, in the past, some researchers working on the over-specialization and

concentration biases of recommender systems tried to provide alternative definitions

of unexpectedness and various related but still different concepts, such as novelty,

diversity, and serendipity. In the following paragraphs, we discuss the aforementioned

concepts and corresponding approaches.

2.3 Novelty of Recommendations

In particular, novel recommendations are recommendations of those items that the

user did not know about [Konstan et al., 2006]. Hijikata et al. [2009] use collaborative

filtering to derive novel recommendations by explicitly asking users what items they

already know. Besides, Weng et al. [2007] suggest a taxonomy-based RS that utilizes

hot topic detection using association rules to improve novelty and quality of recom-

mendations, whereas Zhang and Hurley [2009] propose to enhance novelty at a small

cost to overall accuracy by partitioning the user profile into clusters of similar items

and compose the recommendation list of items that match well with each cluster,

10

rather than with the entire user profile. Besides, Celma and Herrera [2008] analyze

the item-based recommendation network to detect whether its intrinsic topology has

a pathology that hinders long-tail novel recommendations and Nakatsuji et al. [2010]

define and measure novelty as the smallest distance from the class the user accessed

before to the class that includes target items over the taxonomy.

2.4 Serendipity of Recommendations

Moreover, serendipity, the most closely related concept to unexpectedness, in-

volves a positive emotional response of the user about a previously unknown (novel)

item and measures how surprising these recommendations are [Shani and Gunawar-

dana, 2011]; serendipitous recommendations are, by definition, also novel. However,

a serendipitous recommendation involves an item that the user would not be likely

to discover otherwise, whereas the user might autonomously discover novel items.

Iaquinta et al. [2008] propose to enhance serendipity by recommending novel items

whose description is semantically far from users’ profiles and Kawamae et al. [2009;

2010] suggest an algorithm for recommending novel items based on the assumption

that users follow earlier adopters who have demonstrated similar preferences. In

addition, Sugiyama and Kan [2011] propose a method for recommending scholarly

papers utilizing dissimilar users and co-authors to construct the profile of the target

researcher. Also, Andre et al. [2009] examine the potential for serendipity in web

search and suggest that information about personal interests and behavior may be

used to support serendipity.

11

2.5 Diversity of Recommendations

Furthermore, diversification is defined as the process of maximizing the variety

of items in a recommendation list. Most of the literature in RSes and Information

Retrieval (IR) studies the principle of diversity to improve user satisfaction. Typical

approaches replace items in the derived recommendation lists to minimize similarity

among all items or remove “obvious” items from them as in [Billsus and Pazzani,

2000]. Ziegler et al. [2005] propose a similarity metric using a taxonomy-based clas-

sification and use this to assess the topical diversity of recommendation lists. They

also provide a heuristic algorithm to increase the diversity of the recommendation list.

Then, Zhang and Hurley [2008] focus on intra-list diversity and address the problem

as the joint optimization of two objective functions reflecting preference similarity

and item diversity, and Hurley and Zhang [2011] formulate the trade-off between di-

versity and matching quality as a binary optimization problem. Besides, Wang and

Zhu [2009], inspired by the modern portfolio theory in financial markets, suggest an

algorithm that generalizes the probability ranking principle by considering both the

uncertainty of relevance predictions and correlations between retrieved documents.

Also, Said et al. [2012b] suggest an inverted nearest neighbor model and recommend

items disliked by the least similar users.

Following a different direction, McSherry [2002] investigates the conditions in

which similarity can be increased without loss of diversity and presents an approach

to retrieval that is designed to deliver such similarity-preserving increases in diversity.

In addition, Zhang et al. [2012] propose a collection of algorithms to simultaneously

increase novelty, diversity, and serendipity, at a slight cost to accuracy, and Zhou

et al. [2010] suggest a hybrid algorithm that, without relying on any semantic or

context-specific information, simultaneously gains in both accuracy and diversity of

12

recommendations.

In other streams of research, Panniello et al. [2009] compare several contextual pre-

filtering, post-filtering, and contextual modeling methods in terms of accuracy and

diversity of their recommendations to determine which methods outperform others

and under which circumstances. Considering how to measure diversity, Castells et al.

[2011] and Vargas and Castells [2011] aim to cover and generalize the metrics reported

in the RS literature [Zhang and Hurley, 2008; Zhou et al., 2010; Ziegler et al., 2005],

and derive new ones. They suggest novelty and diversity metric schemes that take into

consideration item position and relevance through a probabilistic recommendation-

browsing model.

Besides, other researchers studied the importance of personalization and users’

perception in diversity. In particular, Hu and Pu [2011] investigate design issues that

can enhance users’ perception or recommendation diversity and improve users’ satis-

faction, and Ge et al. [2012] show that the perceived diversity of a recommendation

list depends on the placement of diverse items. Further, Vargas et al. [2012] sug-

gest that the combination of personalization and diversification achieves competitive

performance improving the baseline, plain personalization, and plain diversification

approaches in terms of both diversity and accuracy measures, Shi et al. [2012] ar-

gue that the diversification level in a recommendation list should be adapted to the

target users’ individual situations and needs, and propose a framework to adaptively

diversify recommendation results for individual users based on latent factor models,

while Chen et al. [2013] explore the impact of personality values on users’ needs for

recommendation diversity.

Lastly, examining similar but yet different concepts of diversity, Adomavicius and

Kwon [2009, 2012] propose the concept of aggregated diversity as the ability of a

system to recommend across all users as many different items as possible while keeping

13

accuracy loss to a minimum, by a controlled promotion of less popular items towards

the top of the recommendation lists. Also, Lathia et al. [2010] consider the concept

of temporal diversity, the diversity in the sequence of recommendation lists produced

over time. Finally, Jannach et al. [2013] compare different recommender systems

algorithms with respect to aggregate diversity and their tendency of focusing on

certain parts of the product spectrum and maintain that some popular algorithms may

lead to an undesired popularity boost of already popular items, whereas Bellogın et al.

[2012] present a comparative study on the influence that different types of information

available in social systems have on item recommendation, aiming to identify which

sources of user interest evidence are more effective to achieve useful recommendations.

2.6 Unexpectedness of Recommendations

Pertaining to unexpectedness, in the field of knowledge discovery, Silberschatz and

Tuzhilin [1996]; Berger and Tuzhilin [1998]; Padmanabhan and Tuzhilin [1998, 2000,

2006] propose a characterization relative to the system of prior domain beliefs and

develop efficient algorithms for the discovery of unexpected patterns, which com-

bine the independent concepts of unexpectedness and minimality of patterns. Also,

Kontonasios et al. [2012] survey different methods for assessing the unexpectedness

of patterns focusing on frequent item sets, tiles, association rules, and classification

rules.

In the field of recommender systems, Murakami et al. [2008] and Ge et al. [2010]

suggest both a definition of unexpectedness as the difference in predictions between

two algorithms, the deviation of a recommender system from the results obtained from

a primitive prediction model that shows high ratability, and corresponding metrics

for evaluating this system-centric notion of unexpectedness. Besides, Akiyama et al.

14

[2010] propose unexpectedness as a general metric that does not depend on a user’s

record and only involves an unlikely combination of item features.

In the next paragraphs, we present the related work assessing the value of recom-

mendations for businesses as well as an overview of the main IS theories regarding

the impact of RSes on consumer decision-making processes and outcomes.

2.7 Business Value of Recommendations

Prior work has also examined the business value of recommender systems and their

effect on demand levels. Studying the effects of recommender systems on aggregate

demand and markets, Fleder and Hosanagar [2009] show analytically that RSes can

lead to a reduction in aggregate sales diversity, creating a rich-get-richer effect for

popular products and preventing what may otherwise be better consumer-product

matches. However, Brynjolfsson et al. [2011] provide empirical evidence that RSes are

associated with an increase in niche products, reflecting lower search costs in addition

to the increased product availability and corroborating the findings of [Pathak et al.,

2010] regarding the heterogenizing effects of RSes.

Focusing on the impact of recommender systems on demand levels for individual

products, Tintarev et al. [2010] conduct a user study with 21 subjects and find that

RSes can increase the demand levels, especially for long tail items. Oestreicher-Singer

and Sundararajan [2012b] study how the explicit visibility of related-product networks

can influence the demand for products in such networks and find that complementary

products have significant influence on each other’s demand. They also find that

newer and more popular products benefit more from the attention they garner from

15

their network position in such related-product networks. In contrast, Chen et al.

[2004] find that such network-based recommendations in a desktop setting are more

effective for less-popular books. Similarly, Pathak et al. [2010], examining a desktop

recommender for item-to-item networks of books and focusing on 156 top-selling books

on Amazon.com, find that the impact of the strength of recommendations on sales

rank is moderated by the recency effect.

Our work is also related to the extant literature estimating the business value of

multi-dimensional data sets in context-based recommender systems. Adamopoulos

and Tuzhilin [2014a] propose a method for estimating the expected economic value

of multi-dimensional data sets in RSes and illustrate the proposed approach using

a unique data set combining implicit and explicit ratings with rich content, spatio-

temporal contextual dimensions, and social network profiles [Adamopoulos, 2014a].

This approach can lead to better and more profitable managerial decisions as well as

more useful evaluation metrics.

2.8 Effects of Recommender Systems Usage and Character-

istics on Consumer Decision-Making and Choices

Finally, the impact of common RSes on consumers’ decision-making and choices

has been demonstrated in various empirical settings and has found theoretical sup-

port through the lenses of various IS theories. In particular, one set of theories that

provide support for such findings is that of human information processing. For in-

stance, Shafir et al. [1993b] maintain that people evaluate alternatives by comparing

them separately on distinct dimensions and that relationships among alternatives

may be perceived to be more compelling reasons or arguments for choice than de-

16

riving overall values for each alternative and choosing the alternative with the best

value. Because such differences (e.g., whether an alternative is recommended, ranking

in recommendation list, etc.) can be perceived with little effort, relationships among

alternatives may be used to make choices even in situations with simple alternatives

and even if they do not provide good justifications [Bettman et al., 1998]. Another

explanation is that the recommended alternatives maximize the ease of justifying

consumers’ decisions. This explanation is becoming even more important in the case

of collective decisions (e.g., restaurant selection) and group recommenders; Shafir

et al. [1993a] have demonstrated that decision makers often construct reasons in or-

der to justify a decision to themselves (i.e., increase their confidence in the decision)

and/or justify (explain) their decision to others. Additionally, another explanation is

that the recommended alternatives are simply becoming more salient and hence are

more frequently selected by the consumers either deliberately (e.g., lower search costs

[Wilde, 1980]) or inadvertently since they capture their attention. Finally, a utili-

tarian explanation of the positive and significant effect of recommendations on the

demand levels for the recommended products is that consumers might perceive the

recommendations as endorsements of the candidate items from the RS or as another

reputation dimension (in addition to the item rating, consumer reviews, etc.). Apart

from the aforementioned theoretical perspectives, there is also significant empirical

evidence that RSes have indeed great impact on users’ choices and that this influence

is greater than the influence of peers and experts [Senecal and Nantel, 2004].

Focusing now on the potential differences in effects between the proposed type of

recommendations and the standard recommendation types and, especially, the effec-

tiveness of unexpected and non-obvious recommendations, based on the perspective of

the IS success model [DeLone and McLean, 1992] and, in particular, the components

of usefulness and uniqueness of information, which affect user satisfaction through

17

the construct of information quality, the proposed type of recommendations can sig-

nificantly increase the effectiveness of RSes as well as user evaluation of RSes and

consumer decision-making. In addition, the expectation-disconfirmation paradigm

[McKinney et al., 2002; Olson and Dover, 1979] postulates that when positive discon-

firmation (i.e., when a customer’s evaluations of system or product performance are

different from his or her pre-trial expectations about the product or system) occurs,

results in enhanced user satisfaction. However, there is no empirical study that has

directly investigated the effects of expectation (dis)confirmation on satisfaction with

RSes. On the contrary, the theory of interpersonal similarity [Byrne and Griffitt, 1969;

Levin et al., 2002; Lichtenthal and Tellefsen, 2001; Zucker, 1986] postulates that the

greater the degree of similarity between two parties, the greater the attraction will

be, resulting in increased user satisfaction. Hence, apart from designing methods that

enhance the unexpectedness and non-obviousness of recommendations, we also pro-

pose to examine empirically whether unexpected and non-obvious recommendations

can enhance the effectiveness and user evaluation of recommender systems.

18

CHAPTER III

Neighborhood Selection in Collaborative Filtering

Systems

“I don’t need a friend who changes when I change and who nods when I

nod; my shadow does that much better.”

- Plutarch, 46 - 120 AD

3.1 Introduction to Neighborhood Selection

Although a wide variety of different types of recommender systems (RSes) has been

developed and used across several domains over the last 20 years [Adomavicius and

Tuzhilin, 2005], the classical user-based k-NN collaborative filtering (CF) method still

remains one of the most popular and prominent methods used in the recommender

systems community [Jannach et al., 2010].

Neighborhood-based collaborative filtering recommendation methods predict any

unknown ratings using the existing ratings given by/to the most similar users/items,

called nearest neighbors. It is often assumed that selecting the k most similar neigh-

bors results in the best performance the standard collaborative filtering approach

can produce. However, some investigations show that it is possible to select other

19

neighbors than the most similar that outperform the standard collaborative filtering

approach [Adamopoulos and Tuzhilin, 2013b]. This is a reflection of the fact that

some of the most similar neighbors have a detrimental effect on the accuracy of the

predictions and should be actually replaced. As a matter of fact, the complemen-

tariness of the neighbors that are selected is a key factor in the predictive accuracy

of this approach. Hence, this problem can be shown to be NP-hard and its exact

solution by exhaustive exploration is unfeasible for typical applications that involve

a large number of users and items.

This observation and the employed theoretical framework can give answer to a

number of interesting research questions such as:

• What is the optimal size k for a neighborhood?

• How can this size k be dynamically estimated for each user and item?

• Who are the optimal neighbors?

• What are the optimal weight for the selected neighbors?

At the same time, apart from enhancing the predictive performance of neighborhood-

based methods, selecting users other than the most similar ones to the target user,

we can alleviate the important problems of over-specialization and concentration bi-

ases and hence enhance the usefulness of collaborative filtering RSes. In particular,

the proposed approaches have the potential to provide personalized recommendations

from a wide range of items in order to escape the obvious and expected recommenda-

tions, while avoiding predictive accuracy loss. The key intuition for this is three-fold.

First, using the neighborhood with the most similar users to estimate unknown rat-

ings and recommend candidate items, the generated recommendation lists usually

consist of known items with which the users are already familiar. Second, because

20

of the multidimensionality of user preferences, there are many items that the target

user may like and are unknown to her k most similar users. Third, selecting very

similar neighbors might have a detrimental effect on the performance of the model

since such neighbors tend to capture the same predictive signals and information

In this chapter, we present certain variations of the classical k-NN method. We

propose a method for optimized neighborhood selection (k-BN; k Better Neighbors)

in collaborative filtering recommender systems that address the problem of identify-

ing neighborhoods closer to the optimal ones. In addition, we present a probabilistic

neighborhood selection approach (k-PN; k Probabilistic Neighbors) in which the es-

timation of an unknown rating of the user for an item is based not on the weighted

averages of the k most similar (nearest) neighbors but on k probabilistically selected

neighbors.

To empirically evaluate the proposed probabilistic approach, we conduct an em-

pirical study showing that selecting diverse representative neighborhoods the proposed

methods generate recommendations that are very different from the classical CF ap-

proaches and alleviate the over-specialization and concentration problems while out-

performing k-NN, k-FN [Said et al., 2012b, 2013], and matrix factorization methods.

We also demonstrate that the proposed methods outperform, by a wide margin in

most cases, both the standard k-nearest neighbors and the k-furthest neighbors ap-

proaches in terms of both item prediction accuracy and utility-based ranking. The

experimental results are also in accordance with the phenomenon of “hubness” and

the ensemble learning theory that we employ in the neighborhood-based CF frame-

work. Besides, we show that the performance improvement is not achieved at the

expense of other popular performance measures, such as catalog coverage, aggregate

diversity, and diversity reinforcement.

21

In summary, the main contributions of the studies presented in this chapter are:

• We formulated the classical neighborhood-based collaborative filtering method

as an ensemble method, thus, allowing us to show the potential suboptimality

of the k-NN approach in terms of predictive accuracy.

• We proposed a new optimized neighborhood-based method (k-BN; k Better

Neighbors) as an improvement of the standard k-NN approach.

• We proposed a new probabilistic neighborhood-based method (k-PN; k Proba-

bilistic Neighbors) as an improvement of the standard k-NN approach.

• We empirically showed that the proposed methods outperform, by a wide mar-

gin, the classical collaborative filtering algorithm and practically illustrated its

suboptimality in addition to providing a theoretical justification of this empir-

ical observation.

• We showed that the proposed methods alleviate the common problems of over-

specialization and concentration biases of recommendations in terms of various

popular metrics and a newly proposed metric that measures the diversity rein-

forcement of recommendations.

• We identified a particular implementation of the k-PN method that performs

consistently well across various experimental settings.

• We illustrated that most of the times the k-BN method outperforms the k-PN

method.

22

3.2 Related Approaches

Different ways for selecting a number of candidates neighbors and forming the

neighborhood Ni(u) have been proposed in the literature. Many approaches cluster

the set of users (or items) in order to improve the scalability and accuracy of recom-

mender systems [O’Connor and Herlocker, 1999; Sarwar et al., 2002; Xue et al., 2005].

For instance, Bellogin and Parapar [2012] use a spectral clustering technique, Normal-

ized Cut, in order to derive a cluster-based collaborative filtering algorithm and frame

this technique as a method for neighbor selection in user-based collaborative filtering

recommender systems. This method clusters the users in the collection by finding

the optimal cut of the computed graph, where Pearson similarity is used to weight

the edges between items. Then, it selects a neighborhood that outputs those users

who belong to the same cluster as the target user. In some cases, additional external

information can also be used either for the clustering of users/items or directly for

the neighborhood selection. For instance, using the concept of trust many approaches

select only the most trustworthy users. This concept of trust, apart from external

information, can also be based on some trust metrics [O’Donovan and Smyth, 2005].

Similarly, Bellogın et al. [2013] propose to select neighbors according to the overlap of

their preferences with those of the target user. In particular, the authors propose an

overlap-based filtering in which the users who have more preferred items in common

with the target user are selected as neighbors and they investigate the consideration

of the above principle as the single criterion for neighbor selection, after which the

overlap is no longer taken into account (neither in the user similarity function, nor

in any posterior user weighting). Finally, building on this idea, Bellogin et al. [2013]

use relevance-based language models from information retrieval in order to identify

neighbors in recommendations. Such Relevance Models (RM) are formulated in text

23

IR on a triadic space (query, documents, words), whereas the CF space is typically

dyadic (users and items). In essence, this method tries to capture how relevant each

candidate neighbor would be to the target user. In contrast to the majority of the

related approaches, the algorithms proposed in this dissertation are based on robust

theoretical motivation and hence specific variations can be viewed as optimization

problems.

3.3 Methods for Neighborhood Selection

Collaborative filtering (CF) methods produce user specific recommendations of

items based on patterns of ratings or usage (e.g., purchases) without the need for

exogenous information about either items or users [Ricci and Shapira, 2011]. Hence,

in order to estimate unknown ratings and recommend items to users, CF systems

need to relate two fundamentally different entities: items and users.

3.3.1 Neighborhood Models

User-based neighborhood recommendation methods predict the rating ru,i of user

u for item i using the ratings given to i by users most similar to u, called nearest

neighbors and denoted by Ni(u). Taking into account the fact that the neighbors can

have different levels of similarity, wu,v, and considering the k users v with the highest

similarity to u (i.e., the standard user-based k-NN collaborative filtering approach),

the predicted rating is:

ru,i = ru +

∑v∈Ni(u)

wu,v ∗ (rv,i − rv)∑v∈Ni(u)

|wu,v|, (3.1)

24

ALGORITHM 1: k-NN Recommendation Algorithm

Input: User-Item Rating matrix ROutput: Recommendation lists of size l

k: Number of users in the neighborhood of user u, Ni(u)l: Number of items recommended to user u

for each user u dofor each item i do

Find the k users in the neighborhood of user u, Ni(u);Combine ratings given to item i by neighbors Ni(u);

endRecommend to user u the top-l items having the highest predicted rating ru,i;

end

where ru is the average of the ratings given by user u.

However, the ratings given to item i by the nearest neighbors of user u can be

combined into a single estimation using various combining (or aggregating) functions

[Adomavicius and Tuzhilin, 2005]. Examples of combining functions include majority

voting, distance-moderated voting, weighted average, adjusted weighted average, and

percentiles [Adamopoulos and Tuzhilin, 2013c].

In the same way, the neighborhood used in estimating the unknown ratings and

recommending items can be formed in different ways. Instead of using the k users

with the highest similarity to the target user, any approach or procedure that selects

k of the candidate neighbors can be used, in principle.

Algorithm 1 summarizes the user-based k-nearest neighbors (k-NN) collaborative

filtering approach using a general combining function and neighborhood selection

approach.

In this chapter, we propose a novel k-NN method (k-PN; k Probabilistic Neigh-

bors) using probabilistic neighborhood selection that also takes into consideration

similarity levels between the target user and the n candidate neighbors. We also

propose an optimized neighborhood selection method (k-BN; k Better Neighbors) in

25

collaborative filtering recommender systems.

3.3.2 Theoretical Motivation

In this section, we present the theoretical motivation for the proposed approaches

and the connections to the phenomenon of “hubness” as well as the ensemble learning

theory. In particular, we discuss a major implication of selecting just the most similar

candidates (or even all the candidates) as neighbors and we motivate how the proposed

method can alleviate the over-specialization and concentration problems without sig-

nificantly reducing, and even increasing, the predictive accuracy, demonstrating that

similar but diverse neighbors should be used in neighborhood-based methods.

It should be clear by now that selecting neighborhoods using underlying probabil-

ity distributions, instead of deterministically selecting just the k nearest neighbors,

can result in very different recommendations from those generated based on the stan-

dard neighborhood-based approaches. For the sake of brevity, we focus on the phe-

nomenon of “hubness” and the effect of selecting diverse neighbors on the predictive

accuracy of the proposed approach.

The phenomenon of “hubness” is related to a new aspect of the dimensionality

curse and affects the distribution of k-occurrences: the number of times a point oc-

curs among the k nearest neighbors of the other points in a data set, according to

some distance measure [Radovanovic et al., 2010]. This distribution becomes con-

siderably skewed as dimensionality increases, causing the emergence of hubs, that is,

points which appear in many more k-NN lists than other points, effectively making

them “popular” nearest neighbors. This is an inherent property that depends on

the intrinsic, rather than embedding, dimensionality of data and, thus, dimensional-

ity reduction techniques, such as matrix factorization, do not alleviate the problem

effectively. For the same reason “hubness” occurs even for small values of k and

26

for all cosine-like measures, such as Pearson correlation, cosine similarity, and ad-

justed cosine. Besides, “hubness” is unrelated to other data properties like sparsity

or skewness of the distribution of ratings [Nanopoulos et al., 2009]. Nevertheless, this

phenomenon is part of the problem of concentration bias of recommendations. In

particular, Seyerlehner et al. [2009] show that hubness reduces coverage and reach-

ability, especially of long-tail items, in both content-based and CF systems. Thus,

these problems can be alleviated by selecting neighbors other than the most similar

to the target.

Moreover, in order to further theoretically motivate the proposed approach, we

focus on ensemble learning theory. In particular, for the predictive tasks of a recom-

mender system, we should construct an estimator f(x; w) that approximates an un-

known target function g(x) given a set of N training samples zN = {z1, z2, . . . , zN} =

{(x1, y1), (x2, y2), . . . , (xN , yN)}, where xi ∈ Rd, y ∈ R, and w is a weight vector; zN

is a realization of a random sequence ZN = {Z1, . . . , ZN} whose i-th component con-

sists of a random vector Zi = (Xi, Yi) and, thus, each zi = (xi, yi) is an independent

and identically distributed (i.i.d) sample from an unknown joint distribution p(x, y).

Without loss of generality, we can assume that g(xi) ∈ R; the following derivations

can be easily generalized to situations where g(xi) and y ∈ Rd′ with d′ > 1. We also

assume that there is a functional relationship between the training pair zi = (xi, yi):

yi = g(xi) + ε, where ε is the additive noise with zero mean (E{ε} = 0) and finite

variance (V ar{ε} = σ2 <∞).

Since the estimate w depends on the given zN , we should write w(zN) to clarify

this dependency. Hence, we should also write f(x; w(zN)); however for simplicity

we will write f(x; zN) as in [Geman et al., 1992]. Then, introducing a new random

vector Z0 = (X0, Y0) ∈ Rd+1, which has a distribution identical to that of Zi, but

is independent of Zi for all i, the generalization error (GErr), defined as the mean

27

squared error averaged over all possible realizations of ZN and Z0,

GErr(f) = EZN{EZ0{[Y0 − f(X0;ZN)]2}

},

can be expressed by the following “bias/variance” decomposition [Geman et al., 1992]:

GErr(f) = EX0{V ar{f |X0}+Bias{f |X0}2}+ σ2.

However, using ensemble estimators, instead of a single estimator f , we have a

collection of them: f1, f2, . . . , fk, where each fi has its own parameter vector wi and

k is the total number of estimators. The output of the ensemble estimator for some

input x can be defined as the weighted average of outputs of k estimators for x:

f (k)ens(x) =

k∑m=1

αmfm, (3.2)

where, without loss of generality, αm > 0 andk∑

m=1

αm = 1.

Following Ueda and Nakano [1996], the generalization error of this ensemble esti-

mator is:

GErr(f (k)ens) = EX0

{V ar{f (k)

ens|X0}+Bias{f (k)ens|X0}2

}+ σ2,

28

which can also be expressed as:


{[ k∑m=1

a2m EZN(m)

[(fm − E

ZN(m)

(fm))2]+

∑m

∑i 6=m

amai EZN(m)

,ZN(i)

{[fm − E

ZN(m)

(fm)][fi − E

ZN(i)

(fi)]}]

+

[ k∑m=1

am EZN(m)

(fm − g)

]2}

+ σ2,

where the term EZN(m)

,ZN(i)

{[fm−EZN

(m)(fm)

][fi−EZN

(i)(fi)]}

corresponds to the pair-

wise covariance of the estimators m and i, Cov{fm, fi|X0}.

The results can also be extended to the following equation:


{k∑

m=1

k∑i=1

amaiCmi

}, (3.3)

where Cij indicates the i, j component of the symmetric correlation matrix C. The

i, j component of matrix C is given by:

Cij =

V ar{fi|X0}+Bias{fi|X0}2, if i = j

Cov{fi, fj|X0}+Bias{fi|X0}Bias{fj|X0}, otherwise.

Hence, in addition to the bias and variance of the individual estimators (and the

noise variance), the generalization error of an ensemble also depends on the covariance

between the individuals; an ensemble is controlled by a three-way trade-off. Thus, if

fi and fj are positively correlated, then the correlation increases the generalization

error, whereas if they are negatively correlated, then the correlation contributes to a

decrease in the generalization error.

In the context of neighborhood-based collaborative filtering methods in recom-

29

mender systems, we can think of the ith (most similar to the target user) neighbor

as corresponding to a single estimator fi that simply predicts the rating of this spe-

cific neighbor. Thus, reducing the aggregated pairwise covariance of the neighbors

(estimators) can decrease the generalization error of the model ; at the same time, it

may increase the bias or variance of the estimators and the generalization error as

well. Hence, one way to reduce the covariance is not to restrict the k estimators only

to the k nearest (most similar) neighbors but to use also other candidate neighbors

(estimators).1,2

In the next paragraphs, we first present a probabilistic neighborhood selection

method and then an optimized technique that have the potential to alleviate the

over-specialization and concentration problems of RSes by selecting diverse neighbors

(i.e., neighbors with lower covariance levels).

3.3.3 Probabilistic Neighborhood Selection

In this section, we present a novel k-NN CF method (k-PN; k Probabilistic Neigh-

bors) using a probabilistic neighborhood selection technique that, instead of the most

similar neighbors, carefully selects a set of diverse neighbors in order to alleviate the

over-specialization and concentration problems. The proposed approach uses a gen-

eral algorithm for efficient sampling [Wong and Easton, 1980] that can also take into

1Let ru,i and ru,j the correlation of target user u and candidate neighbors i and j respectively,then the correlation ri,j of neighbors i and j is bounded by the following expression: ru,iru,j −√

1− r2u,i√

1− r2u,j ≤ ri,j ≤ ru,iru,j +√

1− r2u,i√

1− r2u,j .2For a formal argument why the proposed probabilistic approach can result in very different

recommendations from those generated based on the standard k-NN approach and how the itempredictive accuracy can be affected, a 0/1 loss can be used in the context of classification ensemblelearning with the (highly) rated items corresponding to the positive class. For a rigorous derivationof the generalization error in ensemble learning using the bias-variance-covariance decompositionand a 0/1 loss function see [Roli and Fumera, 2002; Tumer and Ghosh, 1996].

30

consideration similarity levels between the target user and the n candidate neighbors.

For the probabilistic neighborhood selection phase of the proposed algorithm, we

allow the neighbors to represent the whole spectrum of candidates, while focusing on

specific areas of this spectrum. Selecting such diverse neighborhoods, the proposed

method aims at alleviating the problems of over-specialization and concentration

biases (see Section 3.3.2).

In a nutshell, for the neighborhood selection phase of the k-PN approach, an initial

weight is assigned to each candidate neighbor and then the candidates are sampled,

without replacement, proportionally to their assigned weights. These initial weights

can be derived based on popular distance metrics (e.g., Cosine similarity, Pearson

correlation, etc.), probability distributions, or other strategies and techniques. For

instance, in order to use certain probability distributions aiming at specific areas

of the spectrum of candidates, the initial weight wi for each candidate i can be

generated using some function of its distance from the target u or its ranking (based

on the distance metric) and the corresponding probability density function (e.g.,

wi = P (ranki) or wi = P (sim (u, i)); for a complete example see Section 3.4.2). Based

on the selection of the initial weights, the algorithm will select different neighborhoods

and, thus, generate different recommendations. We should note here that including

all the candidates in a neighborhood does not alleviate the problems under study, as

discussed in Section 3.3.2.

For implementing the proposed approach, we suggest an efficient method (based on

[Fagin and Price, 1978; Wong and Easton, 1980]) for weighted sampling of k neighbors

without replacement that takes into consideration similarity levels between the target

user and the population of n candidate neighbors. In particular, the set of candidate

neighbors at any time is described by values {w′1, w′2, . . . , w

′n}. In general, if user i

is still a candidate for selection, then w′i = wi (where wi is generated as previously

31

described), whereas w′i = 0 if the user has been already selected in the neighborhood

and, hence, removed from the set of candidates. Denote the sum of the weights of

the first j candidates by Sj =∑j

i=1w′i, where j = 1, . . . , n, and let Q = Sn be the

sum of the weights {w′i} of all the candidates. In order to draw a neighbor, choose x

with uniform probability from [0, Q] and find l such that Sl−1 ≤ x ≤ Sl. Then, add

l to the neighborhood and remove it from the set of candidates while setting w′

l = 0.

After a candidate has been selected into the neighborhood, this neighbor is no longer

available for later selection.

This method can be easily implemented using a binary search tree having all

n candidate neighbors as leaves with values {w1, w2, . . . , wn}, whereas the value of

each internal node of the tree is the sum of the values of the corresponding immedi-

ate descendant nodes. This sampling method requires O(n) initialization operations,

O(k log n) additions and comparisons, and O(k) divisions and random number gen-

erations [Wong and Easton, 1980]. The suggested method can be used with any

distance metric and valid probability distribution including the empirical distribu-

tion of users’ similarity (see Section 3.4.2). Algorithm 2 summarizes the method for

efficient weighted sampling without replacement [Wong and Easton, 1980].

Note that the same approach can also be used for item-based neighborhood meth-

ods by simply sampling diverse neighborhoods of items, instead of users.

3.3.4 Optimized Neighborhood Selection

In this section, we propose a novel k-NN CF method (k-BN; k Better Neigh-

bors) using a greedy neighborhood selection technique that, instead of the most sim-

ilar neighbors, carefully selects a set of diverse neighbors in order to alleviate the

over-specialization and concentration problems while improving also the predictive

accuracy of the generated recommendations.

32

ALGORITHM 2: Weighted Sampling (Without Replacement) Algorithm

Input: Initial weights {w1, . . . , wn} of candidates for neighborhood Ni(u)Output: Neighborhood of user u, Ni(u)

k: Number of users in the neighborhood of user u, Ni(u)L(v): The left-descendent of node vR(v): The right-descendent of node vGv: The sum of weights of the leaves in the left subtree from node vQ: The sum of weights of the nodes in the binary tree

Build binary search tree with n leaves labeled 1, 2, . . . , n;Assign to leaves corresponding values w1, w2, . . . , wn;Associate values Gv with internal nodes;Set Q =

∑nv=1wv;

Set Ni(u) = ∅;for j ← 1 to k do

Set C = 0;Set v = the root node;Set D = ∅;Select x uniformly from [0, Q];repeat

if x ≤ Gv + C thenSet D = D ] {v};Move to node/leaf L(v);

elseSet C = C +Gv;Move to node/leaf R(v);

end

until a leaf is reached ;Set Ni(u) = Ni(u) ] {v};for each node d ∈ D do

Set Gd = Gd − wv;endSet Q = Q− wv;Set wv = 0;

end

33

3.3.4.1 Ordered Selection

For the simpler case, using equal weights for all the neighbors, the predicted rating

can be estimated as the average of the individual ratings of the k users (neighbors)

in the neighborhood:

ri =1

k

k∑u=1

ru,i

and the generalization error can be expressed as:

GErr(k) =1

k2

k∑m=1

k∑n=1

Cmn

where the diagonal elements Cmm are the average squared error of the m-th ensemble

member (neighbor) and Cmn correspond to the products of the biases of the m-th

and n-th neighbors plus their covariance.

We propose a polynomial time greedy algorithm that constructs at each step the

best local solution. The algorithms starts with an empty neighborhood (or alterna-

tively a seed of neighbors) and then selects at each iteration the optimal neighbor

from all the remaining candidates (i.e., the neighbor that minimizes the training er-

ror). This forward (ordered) selection process is halted when k neighbors have been

selected (or alternatively when the expected generalization starts increasing) and the

corresponding neighborhood is returned. This early stopping criterion allows the

selection of a neighborhood that avoids overfitting and can improve the generaliza-

tion performance of the proposed method. The process is similar to the one used in

[Hernandez-Lobato et al., 2011] to prune regression bagging ensembles.

Algorithm 3 summarizes the proposed method for selecting the optimal neighbor-

hood in polynomial time using a greedy approach. The selected neighborhood tries

34

ALGORITHM 3: Optimized (Ordered) Neighborhood Selection Algorithm

Input: User-Item Rating matrix ROutput: Neighborhood N of size k

U : The set of usersR: The set of known ratingsN : The indexes of the users selected in the neighborhoodk: Number of users in the neighborhood Nfor m← 1 to |U| do

for n← m to |U| doCmn ← 1

|R|∑

ru,i∈R [(rm,i − ru,i) (rn,i − ru,i)];Cnm ← Cmn ;

end

end

N ← ∅;for j ← 1 to k do

minimum← +∞;for u ∈ {1, . . . , |U|}/{N1, . . . , Nj−1} do

value← 1j2

(∑j−1m=1

∑j−1n=1CNmNn + 2

∑j−1m=1CNmu + Cuu

);

if value < minimum thenNj ← u;minimum← value;

end

end

endreturn N

to minimize the error

GErr(k) =1

k2

k∑m=1

k∑n=1

CNmNn

by minimizing the training error of the k ensemble members and hence Cmn is calcu-

lated as an average over the training set instead of the expectation over the popula-

tion.3

Furthermore, instead of using equal weights for all the neighbors as in the previous

paragraph, we can estimate the optimal weight for each neighbor in each neighbor-

3We avoid overfitting by selecting k neighbors and not using all the candidates.

35

hood. Using the method of Lagrange multipliers to solve for the optimal weights (a∗m)

in Eq. (3.3), the necessary conditions are:

∂[∑k

i=1

∑kj=1 aiajCij − λ (

∑i αi − 1)

]∂αm

= 0 ∀ m = 1, . . . , k

and ∑i

αi − 1 = 0.

Even though, based on the above conditions and Cramer’s rule, the generalization

error can be directly minimized using

a∗i =

∑j C−1ij∑

l

∑j C−1lj

(3.4)

this method depends on a reliable estimate of C and also C being non-singular in

order to be easily inverted [Perrone and Cooper, 1993]. In practice though, errors

are often highly correlated, thus rows of C are nearly linearly dependent so that C

is irreversible or ill-conditioned matrix and inverting C leads to significant round-off

errors [Jimenez, 1998; Zhou et al., 2002].

However, all the information needed for solving the optimization problem is the

matrix C which can be estimated over the training set and is expected to be similar

as if estimated over p(x, y); the true distribution of the data. Nevertheless, despite

selecting only k neighbors and not using all the candidates, this assumption and the

corresponding design choice of estimating C over only the training data might lead to

overfitting to the training data and hence the empirical evaluation of the proposed

approach is needed.

Furthermore, because of the sparsity problem in the rating matrix of collaborative

filtering recommender systems, higher performance might be achieved if a baseline

36

predictor for the score of the ensemble member (rating of the neighbor) is used in

the estimation of the covariance of the ensemble members and matrix C in the cases

where the true rating of the neighbor is unknown. In this work, we propose the use

of a simple baseline predictor that provides a first-order approximation of the bias in

our rating matrix. For instance, the following baseline predictor can be used:

ru,i = µ+ bu + bi.

where µ is the global average rating in our data set, bi the average rating bias of the

corresponding item (i.e., bi =∑u∈R(i)(rui−µ)

|R(i)| ), and bu the average rating bias of the

corresponding user (i.e., bu =∑i∈R(u)(rui−µ−bi)

|R(u)| ).

In summary, in this study we introduce various algorithms to address the problem

of identifying the optimal subset of neighbors in collaborative filtering recommender

systems. We first present a novel method for recommending items based on prob-

abilistic neighborhood selection (k-PN) in collaborative filtering systems. Then, we

proposed an ordered neighborhood selection approach where neighborhoods of in-

creasing size are constructed by incorporating at each iteration the neighbor that

reduces the predictive error the most (k-BN). All the algorithms designed in this

study reach an approximate solution in polynomial time. However, a difficulty, in-

herent to all machine learning algorithms, is that the optimal neighborhood selected

on the basis of the training set may have a suboptimal generalization performance

[Martınez-Munoz and Suarez, 2006]. Nevertheless, despite the fact that the found

solution might not be globally optimal, it can generally be a near-optimal local min-

imum. Hence, the empirical evaluation of the proposed approaches is needed.

37

3.4 Experimental Settings of Probabilistic Neighborhood Se-

lection

To empirically validate the k-PN method presented in Section 3.3.1 and evalu-

ate the generated recommendations, we conduct a large number of experiments on

“real-world” data sets and compare our results to different baselines. For an apples-to-

apples comparison, the selected baselines include the user-based k-NN CF approach,

which we promise to improve in this study. Compared to other popular algorithms,

user-based k-NN generates recommendations that suffer less from over-specialization

and concentration biases [Desrosiers and Karypis, 2011; Jannach et al., 2013] and has

also been found to perform well in terms of other performance measures [Burke, 2002;

Cremonesi et al., 2011; Adamopoulos and Tuzhilin, 2011, 2013a]. Nevertheless, the

proposed approach can be applied to any neighborhood-based method and it is not

specific to the user-based approach, which has been selected for increased compatibil-

ity as well as interpretability of the results. Additionally, we also compare our results

against furthest neighbors models (k-FN) [Said et al., 2013, 2012b]. Finally, we also

compare our experimental results against matrix factorization (MF) [Gantner et al.,

2011].

3.4.1 Data Sets

The data sets that we used are the MovieLens [GroupLens, 2011] and MovieTweet-

ings [Dooms et al., 2013] as well as a snapshot from Amazon [McAuley and Leskovec,

2013]. The RecSys HetRec 2011 MovieLens (ML) data set [GroupLens, 2011] contains

855,598 ratings (on a 1-5 scale) from 2,113 users on 10,197 movies. Moreover, the

MovieTweetings (MT) data set is described in [Dooms et al., 2013] and consists of

ratings included in well-structured tweets on Twitter. Owing to the extreme sparsity

38

of the data set, we decided to condense the data set in order to obtain more mean-

ingful results from collaborative filtering algorithms. In particular, we removed items

and users with fewer than 10 ratings. The resulting data set contains 12,332 ratings

(on a 0-10 scale) from 839 users on 836 movies. Finally, the Amazon (AMZ) data

set is described in [McAuley and Leskovec, 2013] and consists of reviews of fine foods

during a period of more than 10 years. After removing items with fewer than 10

ratings and reviewers with fewer than 25 ratings each, the data set consists of 15,235

ratings (on a 1-5 scale) from 407 users on 4,316 items.

3.4.2 Experimental Setup

Using the ML, MT, and AMZ data sets, we conducted a large number of exper-

iments and compared the results against the standard user-based k-NN approach,

different k-FN methods, and matrix factorization. In order to test the proposed ap-

proach of probabilistic neighborhood selection under various experimental settings,

we used different sizes of neighborhoods (k ∈ {20, 30, . . . , 80}) and different proba-

bility distributions (P ∈ {normal, exponential, Weibull, folded normal, uniform}),

with various specifications (i.e., location and scale parameters), as well as the empir-

ical distribution of user similarity, described in Table 3.1. The uniform distribution

is used in order to compare the proposed method against randomly selecting neigh-

bors. The specific distributions were selected because they focus on different areas

of the spectrum of candidate neighbors and they constitute common but flexible ex-

amples that can be easily reproduced. Additionally, we used two k-FN models [Said

et al., 2013, 2012b]; the second furthest neighbor model (k-FN2) employed in this

study corresponds to recommending the least liked items of the furthest neighbors

instead of the most liked ones (k-FN1). We should note here that because of the

strict deterministic nature of both k-NN and k-FN, it is not possible to interpo-

39

late between these two methods and select diverse neighbors that approximate the

results of k-PN. In addition, we generated recommendation lists of different sizes

(l ∈ {1, 3, 5, 10, 20, . . . , 100}). In summary, we used 3 data sets, 7 different sizes of

neighborhoods, 12 probability distributions, and 13 different lengths of recommenda-

tion lists, resulting in 3, 276 experiments in total.

Table 3.1: Probability Distributions and Density Functions for Probabilistic Neigh-borhood Selection.

LabelProbability Probability Density Location and ShapeDistribution Function (weights) Parameters

k-NN -

{1/k, if x ≤ n− k0, otherwise

-

E Empirical Similarity wx/∑n

i=1 wi -

U Uniform 1/n -

N1 Normal 1√2πσ2

e−(x−µ)2

2σ2µ = 0 σ = (0.25/15.0)n

N2 µ = (2.0/15.0)n σ = (0.5/15.0)n

Exp1 Exponential λe−λxλ = 1/k

Exp2 λ = 2/k

W1 Weibull µλ

(xλ

)µ−1e−(x/λ)µ µ = 0.25 λ = n/20

W2 µ = 0.50 λ = n/20

FN1 Folded normal√

2σ√πe−

θ2

2 e−x2

2σ2 cosh(θxσ

) θ = 1 σ = kFN2 θ = 1 σ = k/2

k-FN -

{1/k, if x ≥ n− k0, otherwise

-

For the probabilistic neighborhood selection, we used the method described in

Section 3.3.3. In order to estimate the initial weights {wi} of the procedure, we used

the probability density functions illustrated in Table 1. Without loss of generality,

in order to take into consideration similarity levels of the candidate neighbors, the

candidates can be ordered and re-labeled such that su,1 ≥ su,2 ≥ . . . ≥ su,n, where

su,j is the similarity level of target u and candidate j based on some distance metric.

40

Then, the initial weight wj for each candidate can be generated using its ranking

and a probability density function. For instance, using the Weibull probability dis-

tribution (i.e., W1 or W2), the weight of the most similar candidate (i.e., j = 1) is

w1 = µλ

(1λ

)µ−1e−(1/λ)µ , where µ and λ are the shape and scale parameters of the

distribution and n is the total number of all the candidate neighbors.4 In contrast

to the deterministic k-NN and k-FN approaches, depending on the parameters of the

employed probability density function, this candidate neighbor (i.e., the most sim-

ilar to the target) may or may not have the highest weight wj.5 Figure 3.1 shows

the likelihood of sampling each candidate neighbor using different probability distri-

butions for the MovieLens data set and k = 80 and Figure 3.2 shows the sampled

neighborhoods for a randomly selected target user using the different distributions;

the candidate neighbors for each target user and item in the x axis are ordered based

on their similarity to the target user with 0 corresponding to the nearest (i.e., most

similar) candidate. As we can see, the selected distributions focus on different areas

of the spectrum of candidate neighbors. We should note here that using the empirical

distribution of user similarity resulted in more diverse neighborhoods.

In all the conducted experiments, in order to measure the similarity among the

candidate neighbors, we used the Pearson correlation; similar results were also ob-

tained using the cosine similarity. We also used significance weighting as in [Her-

locker et al., 1999], in order to penalize for similarity based on few common rat-

ings, and filtered any candidate neighbors with zero weight [Desrosiers and Karypis,

2011]. For the similarity estimation of the candidates in the k-furthest neighbor

algorithm, we used the approach described in [Said et al., 2013, 2012b]. Besides,

4For continuous probability distributions, the cumulative distribution function can also be usedsuch as wi = F (i+ 0.5)− F (i− 0.5) or wi = F (i)− F (i− 1).

5For a probabilistic furthest neighbors model the candidates can be ordered in reverse similarityorder such that su,1 ≤ su,2 ≤ . . . ≤ su,n. Initial experiments illustrated that such models underper-form the proposed approach.

41

0 50 100 150 200 250 300 350 400

Candidate Neighbor Order

0.000

0.005

0.010

0.015

0.020

0.025

0.030

0.035

0.040

Sam

plin

g P

rob

ab

ilit

y

N1

N2

Exp1

Exp2

W1

W2

FN1

FN2

Figure 3.1: Sampling probability for the nearest candidate neighbors using differentprobability distributions for the MovieLens (ML) data set.

0 500 1000 1500 2000Candidate Neighbor

FN2

FN1

W2

W1

Exp2

Exp1

N2

N1

U

E

k-NN

Figure 3.2: Sampled Neighborhoods using the different probability distributions forthe MovieLens data set.

42

we used the standard combining function as in Eq. (3.1). Similar results were

also obtained using a combining function without a first-order bias approximation:

ru,i =∑

v∈Ni(u) wu,vrv,i/∑

v∈Ni(u) |wu,v|; any differences are explicitly discussed in the

following section. In addition, we used a holdout validation scheme in all of our ex-

periments with 80/20 splits of the rating tuples to the training/test parts in order

to avoid overfitting. Finally, the evaluation of the various approaches in each ex-

perimental setting is based on users with more than k candidate neighbors, where

k is the corresponding neighborhood size; if a user has k or less available candidate

neighbors, then the same neighbors are always selected and the results for the specific

user are in principle identical for all the examined approaches, apart from the inverse

k-FN (k-FN2) method. Similarly, the generated recommendation lists were also eval-

uated using a subset of the test set containing only highly rated items as well as only

long-tail items [Cremonesi et al., 2010].

3.5 Results of Probabilistic Neighborhood Selection

The aim of this study is to demonstrate that the proposed method indeed effec-

tively generates recommendations that alleviate the over-specialization and concen-

tration problems while performing well in terms of other important metrics of RSes.

Therefore, we conduct a comparative analysis of our method and the standard base-

line (k-NN), matrix factorization, and the k-furthest neighbor approaches, in different

experimental settings.

Given the number and the diversity of experimental settings, the presentation

of the results constitutes a challenging problem. A reasonable way to compare the

results across the different settings is by computing the relative performance differ-

ences and discussing only the most interesting dimensions. Due to space limitations,

43

detailed results, supplementary graphs and tables, and tests of statistical significance

about all the conducted experiments as well as additional performance metrics mea-

suring orthogonality of recommendations and predictive accuracy are included in

[Adamopoulos and Tuzhilin, 2013b].

Overall, the proposed method generates recommendations that are very different

from the classical CF approaches and alleviates the over-specialization and concentra-

tion problems, based on metrics of coverage, dispersion, and diversity reinforcement

(mobility of recommendations), while avoiding any significant accuracy loss. Fig. 3.3

shows an overview of the performance of all the methods on the ML data set across

various metrics for recommendation lists of size l = 10.

Figure 3.3: Summary of performance for the ML data set.

44

3.5.1 Orthogonality of Recommendations

In this section, we examine whether the proposed approach finds and recommends

different items than those recommended by the standard recommenders and, thus,

whether it can alleviate the over-specialization problem [Said et al., 2012b]. In par-

ticular, we investigate the overlap of recommendations (i.e., the percentage of items

that belong to both recommendation lists) between the classical neighborhood-based

collaborative filtering method and the various specifications (i.e., E, N1, N2, Exp1,

Exp2, W1, W2, FN1, and FN2) of the proposed approach described in Table 3.1. Fig.

3.4 presents the results obtained by applying our method to the MovieLens, Movi-

eTweetings, and Amazon data sets. The values reported are computed as the average

overlap over seven neighborhood sizes, k ∈ {20, 30, . . . , 80}, for recommendation lists

of size l = 10.

As Fig. 3.4 demonstrates, the proposed method generates recommendations that

are very different from the recommendations provided by the classical k-NN approach.

The more different recommendations were achieved using the empirical distribution

of user similarity and the inversed k-furthest neighbors approach [Said et al., 2013].

In particular, the average overlap across all the proposed probability distributions,

neighborhoods, and recommendation list sizes was 14.87%, 64.17%, and 64.64% for

the ML, MT, and AMZ data sets, respectively; the corresponding overlap using only

the empirical distribution was 2.79%, 44.82%, and 39.80% for the different data sets.

Hence, it is worth to note that not only the k-FN approach but also the proposed

probabilistic method resulted in orthogonal recommendations to the standard k-NN

method. Besides, for the more sparse data sets (i.e., MovieTweetings, Amazon), the

recommendation lists exhibit greater overlap, since there are proportionally less can-

didate neighbors available to sample from and, thus, the neighborhoods tend to be

45

k-NN

E

U

N1

N2

Exp1

Exp2

W1

W2

FN1

FN2

k-FN1

k-FN2

k-NN

E U N1

N2

Exp

1

Exp

2

W1

W2

FN

1

FN

2

k-FN

1

k-FN

2

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

(a)

Mov

ieL

ens

k-NN

E

U

N1

N2

Exp1

Exp2

W1

W2

FN1

FN2

k-FN1

k-FN2

k-NN

E U N1

N2

Exp

1

Exp

2

W1

W2

FN

1

FN

2

k-FN

1

k-FN

2

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

(b)

Mov

ieT

wee

tings

k-NN

E

U

N1

N2

Exp1

Exp2

W1

W2

FN1

FN2

k-FN1

k-FN2

k-NN

E U N1

N2

Exp

1

Exp

2

W1

W2

FN

1

FN

2

k-FN

1

k-FN

2

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

(c)

Am

azo

n

Fig

ure

3.4:

Ove

rlap

ofre

com

men

dat

ion

list

sof

sizel

=10

for

the

diff

eren

tdat

ase

ts.

46

more similar. This is also depicted in the experiments using one of the standard prob-

ability distributions but the empirical distance of candidate neighbors, the uniform

distribution, and the deterministic k-FN approach. Similarly, recommendation lists of

smaller size resulted in even smaller overlap among the various methods. Moreover,

the experiments conducted using the U and k-FN1 approaches resulted in recom-

mendations very different from the recommendations provided by the classical k-NN

approach only when the first-order bias approximation was not used in the combin-

ing function of the ratings. In general, without the first-order bias approximation the

average overlap was further reduced by 58.70%, 19.69%, and 52.18% for the ML, MT,

and AMZ data sets, respectively. As one would expect, the experiments conducted

using the same probability distribution (e.g., Exp1 and Exp2) result in similar perfor-

mance. To determine statistical significance, we have tested the null hypothesis that

the performance of each of the methods is the same using the Friedman test. Based

on the results, we reject the null hypothesis with p < 0.0001. Performing post hoc

analysis on Friedman’s Test results, all the specifications of the proposed approach

(apart from the cases of the N2, W2, Exp, and FN specifications for the AMZ data

set) significantly outperform the k-FN1 method in all the data sets. The difference

between the empirical distribution and k-FN2 are not statistically significant for any

data set.

Nevertheless, even a large overlap between two recommendation lists does not

imply that these lists are the same. For instance, two recommendation lists might

contain the same items but in reverse order. In order to further examine the orthog-

onality of the generated recommendations, we measure the rank correlation of the

generated lists using the Spearman’s rank correlation coefficient [Spearman, 1987],

which measures the Pearson correlation coefficient between the ranked variables. In

particular, we use the top 100 items recommended by method i and examine the cor-

47

k-NN

E U N1

N2

Exp

1

Exp

2

W1

W2

FN

1

FN

2

k-FN

1

k-FN

2

k-NN

E

U

N1

N2

Exp1

Exp2

W1

W2

FN1

FN2

k-FN1

k-FN2

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Figure 3.5: Spearman’s rank correlation coefficient for the MovieTweetings data set.

48

relation ρij in the rankings generated by methods i and j for those items; ρij might be

different from ρji. Fig. 3.5 shows the average ranking correlation over seven neighbor-

hood sizes, k ∈ {20, 30, . . . , 80}, using the Spearman’s rank correlation coefficient ρij

(i corresponds to the row index and j to the column index) for the MovieTweetings

data set (i.e., the data set that exhibits the largest overlap). The correlation between

the classical neighborhood-based collaborative filtering method and the probabilistic

approach with the empirical distribution using the top 100 items recommended by

the k-NN method is 22.21%. As Figs. 3.4b and 3.5 illustrate, even though some

specifications may result in recommendation lists that exhibit significant overlap, the

ranking of the recommended items is not strongly correlated.

3.5.2 Comparison of Coverage and Diversity

In this section, we investigate the effect of the proposed method on coverage

and aggregate diversity; two important metrics that in combination with other mea-

sures discussed in this study show whether the proposed approach alleviates the over-

specialization and concentration problems of common RSes. The results obtained

using the catalog coverage metric are equivalent to those using the diversity-in-top-

N metric for aggregate diversity; henceforth, only one set of results is presented.

Fig. 3.6 presents the results obtained by applying our method to the ML, MT, and

AMZ data sets. In particular, the Hinton diagram in Fig. 3.6 shows the percentage

increase/decrease in performance compared to the k-NN baseline for each probability

distribution and recommendation lists of size l ∈ {1, 3, 5, 10, 20, . . . , 100} over seven

neighborhood sizes, k ∈ {20, 30, . . . , 80}. Positive and negative values are represented

by white and black squares, respectively, and the size of each square represents the

magnitude of each value.

Fig. 3.6 demonstrates that the proposed method in most cases performs better than

49

13

510

20

30

40

50

60

80

100

Recom

men

dati

on

Lis

t S

ize

MF

k-F

N2

k-F

N1

FN2

FN1

W2

W1

Exp

2

Exp

1

N2

N1UE

(a)

Mov

ieL

ens

13

510

20

30

40

50

60

80

100

Recom

men

dati

on

Lis

t S

ize

MF

k-F

N2

k-F

N1

FN2

FN1

W2

W1

Exp

2

Exp

1

N2

N1UE

(b)

Mov

ieT

wee

tings

13

510

20

30

40

50

60

80

100

Recom

men

dati

on

Lis

t S

ize

MF

k-F

N2

k-F

N1

FN2

FN1

W2

W1

Exp

2

Exp

1

N2

N1UE

(c)

Am

azo

n

Fig

ure

3.6:

Incr

ease

inag

greg

ate

div

ersi

typ

erfo

rman

cefo

rth

ediff

eren

tdat

ase

tsan

dre

com

men

dat

ion

list

size

s.

50

the user-based k-NN, matrix factorization, and the k-FN methods. The more diverse

recommendations were achieved using the empirical distribution of user similarity

and the inverse k-furthest neighbors approach (k-FN2). In particular, the average

aggregate diversity across all the probability distributions, neighborhoods, and rec-

ommendation list sizes was 22.10%, 46.09%, and 13.52% for the ML, MT, and AMZ

data sets, respectively; the corresponding diversity using only the empirical distribu-

tion was 24.20%, 50.55%, and 17.04% for the different data sets. The corresponding

performance of MF [Gantner et al., 2011] was measured as 14.17%, 10.70%, and

14.39%, respectively.

Furthermore, the performance was increased both in the experiments where the k-

NN method, because of the specifics of the particular data sets, resulted in low aggre-

gate diversity (e.g., Amazon) and high diversity performance (e.g., MovieTweetings).

In addition, the experiments conducted using the same probability distribution (e.g.,

Exp1 and Exp2) exhibit very similar performance. As one would expect, in most

cases the aggregate diversity increased, whereas the magnitude of the difference in

performance decreased, with increasing recommendation list size l. Without using the

first-order bias approximation in the combining function, the standard k-NN method

resulted in higher aggregate diversity and catalog coverage but the proposed approach

still outperformed the classical algorithm in most of the cases by a narrower margin;

using the inverse k-FN method (k-FN2) without the first-order bias approximation

resulted in decrease in performance for the Amazon data set. The performance of

empirical distribution was 33.37%, 61.06%, and 41.50% for the different data sets.

Nevertheless, using all the candidates, instead of probabilistically selecting a diverse

neighborhood, underperforms the proposed approach since the overall contribution of

neighbors other than the most similar is significantly discounted.

In terms of statistical significance, using the Friedman test and performing post

51

hoc analysis, the differences among the employed baselines (i.e., k-NN, MF, k-FN1,

and k-FN2) and all the proposed specifications are statistically significant (p < 0.001)

for the ML data set. For the MT and AMZ data sets, all the proposed specifications

(i.e., E, N1, N2, Exp1, Exp2, W1, W2, FN1, and FN2) significantly outperform the

k-NN and matrix factorization algorithms; the empirical distribution significantly

outperforms also the k-FN1 method.

3.5.3 Comparison of Dispersion and Diversity Reinforcement

In order to conclude whether the proposed approach alleviates the concentration

biases, the generated recommendation lists should also be evaluated for the inequality

across items using the Gini coefficient. Fig. 3.7 shows the percentage increase (white

squares) or decrease (black squares) in dispersion of recommendations compared to

the k-NN baseline. The Gini coefficient was on average improved by 6.81%, 3.67%,

and 1.67% for the ML, MT, and AMZ data sets, respectively; the corresponding

figures using only the empirical distribution were 7.48%, 6.73%, and 3.45% for the

different data sets, which implies an improvement of 7.41%, 16.76%, and 1.54% over

MF and 9.69%, 5.34%, and 2.84% over k-FN. The more uniformly distributed rec-

ommendation lists were achieved using the empirical distribution of user similarity

and the inverse k-furthest neighbors approach. Moreover, the larger the size of the

recommendation lists, the larger the improvement in the Gini coefficient. Similarly,

without using the first-order bias approximation in the rating combining function,

the average dispersion was further improved by 6.48%, 6.83%, and 20.22% for the

ML, MT, and AMZ data sets, respectively. This implies an improvement of 14.91%,

22.83%, and 21.19% against MF and 16.94%, 5.90%, and 2.38% against k-FN. As

we can conclude, in the recommendation lists generated from the proposed method,

the number of times an item is recommended is more equally distributed compared to

52

13

51

02

03

04

05

06

08

01

00

Recom

men

dati

on

Lis

t S

ize

MF

k-F

N2

k-F

N1

FN2

FN1

W2

W1

Exp

2

Exp

1

N2

N1UE

(a)

Mov

ieL

ens

13

51

02

03

04

05

06

08

01

00

Recom

men

dati

on

Lis

t S

ize

MF

k-F

N2

k-F

N1

FN2

FN1

W2

W1

Exp

2

Exp

1

N2

N1UE

(b)

Mov

ieT

wee

tings

13

51

02

03

04

05

06

08

01

00

Recom

men

dati

on

Lis

t S

ize

MF

k-F

N2

k-F

N1

FN2

FN1

W2

W1

Exp

2

Exp

1

N2

N1UE

(c)

Am

azo

n

Fig

ure

3.7:

Incr

ease

indis

per

sion

ofre

com

men

dat

ions

for

the

diff

eren

tdat

ase

tsan

dre

com

men

dat

ion

list

size

s.

53

other CF methods. In terms of statistical significance, all the proposed specifications

(apart from the N1, Exp2, and FN2 for the MT data set and the N2, Exp2 for the

AMZ data set) significantly outperform the k-NN, matrix factorization, and k-FN1

methods (p < 0.001). The empirical distribution also significantly outperforms the

k-FN2 method for the ML data set; the differences are not statistically significant for

the other data sets.

However, simply evaluating the recommendation lists in terms of dispersion and

inequality does not provide any information about the (popularity-based) diversity

reinforcement and mobility of the recommendations (i.e., whether popular or long-

tail items are more likely to be recommended) since these metrics do not consider

the prior state of the system. Hence, we employ a diversity reinforcement measure

M to assess whether the proposed recommender system approach follows or changes

the prior popularity of items when recommendation lists are generated. Thus, we

define M , which equals the proportion of items that are “mobile” (e.g., changed from

popular in terms of number of ratings to “long tail” in terms of recommendation

frequency), as follows:

M = 1−K∑i=1

πiρii

where the vector π denotes the initial distribution of each of the K (popularity)

categories and ρii the probability of staying in category i, given that i was the initial

category.6 A score of zero denotes no change (i.e., the number of times an item is

recommended is proportional to the number of ratings it has received) whereas a score

of one denotes that the RS recommends only the long-tail items (i.e., the number of

times an item is recommended is proportional to the inverse of the number of ratings

it has received).

6The proposed diversity reinforcement score can be easily adapted in order to differentiate thedirection of change and the magnitude.

54

In the conducted experiments, based on the 80-20 rule or Pareto principle, we

use two categories, labeled as “head” and “tail”, where the former category contains

the top 20% of items (in terms of ratings or recommendations frequency) and the

latter category the remaining 80%. The experimental results demonstrate that the

proposed method generates recommendation lists that exhibit in most cases higher

diversity reinforcement compared to the k-NN, MF, and k-FN methods. In particular,

the performance was increased by 0.91%, 0.95%, and 0.19% for the ML, MT, and

AMZ data sets, respectively; the corresponding improvement using only the empirical

distribution was 1.29%, 1.46%, and 0.45% for the different data sets which implies

an improvement of 1.52%, 1.12%, and 2.11% over MF and 1.01%, 1.31%, and 0.29%

over k-FN. We also note that recommendation lists of larger size resulted on average

in even larger improvements. Besides, considering a smaller number of items as

popular also resulted in larger improvements. Similarly, without the first-order bias

approximation the average diversity reinforcement was further increased by 0.69%,

0.53%, and 3.28% for the ML, MT, and AMZ data sets, respectively. This implies

an improvement of 2.40%, 1.82%, and 1.65% against MF and 1.83%, 1.44%, and

1.82% against k-FN. Fig. 3.8 shows the transition probabilities of each category for

recommendation lists of size l = 100 using the empirical distribution of similarity

and the MovieTweetings data set. In terms of statistical significance, in most of the

cases all the proposed specifications significantly outperform the baseline methods

(p < 0.005) [Adamopoulos and Tuzhilin, 2013b].

3.5.4 Comparison of Item Prediction

Apart from alleviating the concentration bias and over-specialization problems in

CF systems, the proposed approach should also perform well in terms of predictive

accuracy. Thus, the goal in this section is to compare the proposed method with the

55

Figure 3.8: Diversity reinforcement for the MT data set.

standard baseline methods in terms of traditional metrics for item prediction, such as

the F1 score. Figs. 3.9 and 3.10 present the results obtained by applying the proposed

method to the MovieLens (ML), MovieTweetings (MT), and Amazon (AMZ) data

sets. The values reported in Fig. 3.10 are computed as the average performance over

seven neighborhood sizes, k ∈ {20, 30, . . . , 80}, using the F1 score for recommendation

lists of size l = 10. The Hilton diagram show in Figure 3.9 presents the relative F1

score for recommendation lists of size l ∈ {1, 3, 5, 10, 20, 30, 40, 50, 60, 70, 80, 100}; the

size of each white square represents the magnitude of each value with the maximum

size corresponding to the maximum value achieved in the conducted experiments for

each data set. Similar results were also obtained using as positive instances only the

highly rated items (i.e., items rated above the average rating or above the 80% of the

rating scale) in the test set.

Figs. 3.9 and 3.10 demonstrate that the proposed method outperforms the stan-

56

(a)

Mov

ieL

ens

(b)

Mov

ieT

wee

tin

gs

(c)

Am

azo

n

Fig

ure

3.9:

Item

pre

dic

tion

per

form

ance

for

the

diff

eren

tdat

ase

ts.

57

(a)

Mov

ieL

ens

(b)

Mov

ieT

wee

tin

gs

(c)

Am

azo

n

Fig

ure

3.10

:It

empre

dic

tion

per

form

ance

for

the

diff

eren

tdat

ase

tsan

dre

com

men

dat

ion

list

sof

sizel

=10

.

58

dard user-based k-NN method and the k-FN approach in most of the cases. The

most accurate recommendations were generated using the empirical distribution of

user similarity, the normal or the exponential distribution. In particular, the average

F1 score across all the proposed probability distributions, neighborhoods, and recom-

mendation list sizes was 0.0018, 0.0050, and 0.0010 for the ML, MT, and AMZ data

sets, respectively; the corresponding performance using only the empirical distribu-

tion was 0.0015, 0.0055, and 0.0022 for the different data sets resulting on average

in a 4-fold increase. Besides, without using the first-order bias approximation in the

rating combining function, the proposed approach outperformed in most of the cases

the classical k-NN algorithm and the k-FN method by a wider margin. Furthermore,

we should note that the performance was increased across various experimental speci-

fications, including different sparsity levels, neighborhood sizes, and recommendation

list lengths. This performance improvement is due to the reduction of covariance

among the selected neighbors and is in accordance with the phenomenon of “hub-

ness” and the ensemble learning theory that we introduce in the neighborhood-based

collaborative filtering framework in Section 3.3.2.

To determine the statistical significance of the previous findings, we have tested

using the Friedman test the null hypothesis that the performance of each of the

methods is the same. Based on the results, we reject the null hypothesis with p <

0.0001. Performing post hoc analysis on Friedman’s Test results, in most of the

cases (i.e., 86.42% of the experimental settings) the proposed approach significantly

outperforms the employed baselines and in the remaining cases the differences are not

statistically significantly. In particular, the differences between the traditional k-NN

and each one of the proposed variations (apart from the case of the FN1 specification

for the MT data set) are statistically significant for all the data sets; similar results

were also obtained for the differences among the proposed approach and the k-FN

59

models.

3.5.5 Comparison of Utility-based Ranking

Further, in order to better assess the quality of the proposed approach, the recom-

mendation lists should also be evaluated for the ranking of the items that present to

the users, taking into account the rating scale of the selected data sets. In principle,

since all items are not of equal relevance/quality to the users, the relevant/better

items should be identified and ranked higher for presentation. Assuming that the

utility of each recommendation is the rating of the recommended item discounted by

a factor that depends on its position in the list of recommendations, in this section

we evaluate the generated recommendation lists based on the normalized Cumula-

tive Discounted Gain (nDCG) [Jarvelin and Kekalainen, 2002], where positions are

discounted logarithmically.

The highest performance was again achieved using the empirical distribution of

user similarity, the normal, or the Weibull distribution. In particular, the average

increase of the nDCG score across all the examined probability distributions, neigh-

borhoods, and recommendation list sizes was 100.06%, 20.05%, and 89.85% for the

ML, MT, and AMZ data sets, respectively; the corresponding increase using only the

empirical distribution was 117.65%, 23.01%, and 383.99% for the different data sets

resulting on average in a 2-fold increase. The absolute performance of the empirical

distribution for the different data sets was 73.54%, 74.62%, and 42.63%, respectively.

Even though k-PN on average underperforms MF, it performs very well on both ML

and MT, especially given the goals of this method. Without using the first-order

bias approximation in the rating combining function, the proposed approach outper-

formed in most of the cases the classical k-NN algorithm and the k-FN methods by

an even wider margin. The same wide margin was also observed focusing on long-tail

60

items, except for MT. In terms of statistical significance, the differences among the

employed baselines and all the proposed specifications (apart from the FN1 for the

MT data set and the N1, Exp2, W2, and FN2 for the AMZ data set) are statistically

significant.

3.6 Experimental Settings of Optimized Neighborhood Se-

lection

To empirically validate the k-BN method presented in Section 3.3.1 and evaluate

the generated recommendations, we conduct a large number of experiments on “real-

world” data sets and compare our results to the k-PN (using the empirical distribution

of similarity) that outperforms different baselines as shown in Section 3.5.

The data sets that we used are the MovieLens [GroupLens, 2011] and MovieTweet-

ings [Dooms et al., 2013] as well as a snapshot from Amazon [McAuley and Leskovec,

2013] as in Section 3.4. Using the ML, MT, and AMZ data sets, we conducted a large

number of experiments and compared the results against the k-PN approach. In order

to test the proposed approach of probabilistic neighborhood selection under various

experimental settings, we used different sizes of neighborhoods (k ∈ {10, 20, . . . , 80})

with various specifications. In addition, we generated recommendation lists of differ-

ent sizes (l ∈ {1, 3, 5, 10, 20, . . . , 100}). In summary, we used 3 data sets, 8 different

sizes of neighborhoods, 6 specifications, and 13 different lengths of recommendation

lists, resulting in 1, 872 experiments in total.

The different specifications we employed in the experiments correspond to the

following condition: i) k-BNu where the weights in the combining function are equal

across all neighbors (i.e., wu,v = 1), ii) k-BNs where the weights depend on the

similarity between each neighbor and the target user (i.e., wu,v = su,v), iii) k-BNo

61

where the weights are estimated based on Eqn. 3.4, iv) k-BNwu where the neighbor’s

rating prediction for the target user is weighted by the similarity with the target

user su,v and the weights in the combining function are equal across all neighbors, v)

k-BNws where the neighbor’s rating prediction is estimated as before and the weights

depend on the similarity, and finally vi) k-BNwo where the prediction is estimated as

before and the weights are based on Eqn. 3.4.

In all the conducted experiments, in order to measure the similarity among the

candidate neighbors, we used the cosine similarity; similar results were also obtained

using the Pearson correlation. We also used significance weighting as in [Herlocker

et al., 1999], in order to penalize for similarity based on few common ratings, and

filtered any candidate neighbors with zero weight [Desrosiers and Karypis, 2011]. In

addition, we used a holdout validation scheme in all of our experiments with 80/20

splits of the rating tuples to the training/test parts in order to avoid overfitting. For

the k-BN method and the estimation of bias, variance, and covariance, we also used 3

folds sampling with replacement 50% of the observations in the training set. Finally,

the generated recommendation lists were also evaluated using a subset of the test set

containing only highly rated items as well as only long-tail items [Cremonesi et al.,

2010].

3.7 Results of Optimized Neighborhood Selection

Overall, the proposed method generates recommendations that alleviate the over-

specialization and concentration problems, based on metrics of coverage, dispersion,

and diversity reinforcement (mobility of recommendations), while avoiding any sig-

nificant accuracy loss. Fig. 3.11 shows an overview of the performance of the k-BN

and k-PN methods on the MT data set across various metrics for recommendation

62

lists of size l = 10.

Figure 3.11: Summary of performance for the MT data set.

3.7.1 Orthogonality of Recommendations

In this section, we examine whether the proposed approach finds and recommends

different items than those recommended by the k-PN method. In particular, we in-

vestigate the overlap of recommendations (i.e., the percentage of items that belong

to both recommendation lists) between the k-PN method and the various specifica-

tions of the proposed approach. Fig. 3.12 presents the results obtained by applying

63

13

510

20

30

40

50

60

70

80

90

100

Recom

men

dati

on

Lis

t S

ize

kB

Nw

o

kB

Nw

s

kB

Nw

u

kB

No

kB

Ns

kB

Nu

(a)

Mov

ieL

ens

13

510

20

30

40

50

60

70

80

90

100

Recom

men

dati

on

Lis

t S

ize

kB

Nw

o

kB

Nw

s

kB

Nw

u

kB

No

kB

Ns

kB

Nu

(b)

Mov

ieT

wee

tin

gs

13

510

20

30

40

50

60

70

80

90

100

Recom

men

dati

on

Lis

t S

ize

kB

Nw

o

kB

Nw

s

kB

Nw

u

kB

No

kB

Ns

kB

Nu

(c)

Am

azo

n

Fig

ure

3.12

:O

verl

apof

reco

mm

endat

ion

list

sfo

rth

ediff

eren

tdat

ase

tsan

dre

com

men

dat

ion

list

size

s.

64

our method to the MovieLens, MovieTweetings, and Amazon data sets. The val-

ues reported are computed as the average overlap over eight neighborhood sizes,

k ∈ {10, 20, . . . , 80}.

As Fig. 3.12 demonstrates, the proposed method generates recommendations that

are very different from the recommendations provided by the k-PN approach. In par-

ticular, the average overlap across all the proposed probability distributions, neigh-

borhoods, and recommendation list sizes was 0.30%, 20.54%, and 0.25% for the ML,

MT, and AMZ data sets, respectively. Besides, recommendation lists of smaller size

resulted in even smaller overlap among the various methods.

3.7.2 Comparison of Coverage and Diversity

In this section, we investigate the effect of the proposed method on coverage and

aggregate diversity. The results obtained using the catalog coverage metric are equiv-

alent to those using the diversity-in-top-N metric for aggregate diversity; henceforth,

only one set of results is presented. Fig. 3.13 presents the results obtained by ap-

plying our method to the ML, MT, and AMZ data sets. In particular, the Hinton

diagram in Fig. 3.13 shows the percentage increase/decrease in performance compared

to the k-PN method for each probability distribution and recommendation lists of size

l ∈ {1, 3, 5, 10, 20, . . . , 100} over eight neighborhood sizes, k ∈ {10, 20, . . . , 80}. Posi-

tive and negative values are represented by white and black squares, respectively, and

the size of each square represents the magnitude of each value.

Fig. 3.13 demonstrates that the proposed method in most cases performs better than

the user-based k-NN, matrix factorization, k-FN, and k-PN methods. In particular,

the average aggregate diversity increase across all the specifications, neighborhoods,

and recommendation list sizes was 5.90%, 27.71%, and 1.34% for the ML, MT, and

AMZ data sets, respectively. The performance was increased both in the experiments

65

13

510

20

30

40

50

60

70

80

90

100

Recom

men

dati

on

Lis

t S

ize

kB

Nw

o

kB

Nw

s

kB

Nw

u

kB

No

kB

Ns

kB

Nu

(a)

Mov

ieL

ens

13

510

20

30

40

50

60

70

80

90

100

Recom

men

dati

on

Lis

t S

ize

kB

Nw

o

kB

Nw

s

kB

Nw

u

kB

No

kB

Ns

kB

Nu

(b)

Mov

ieT

wee

tin

gs

13

510

20

30

40

50

60

70

80

90

100

Recom

men

dati

on

Lis

t S

ize

kB

Nw

o

kB

Nw

s

kB

Nw

u

kB

No

kB

Ns

kB

Nu

(c)

Am

azo

n

Fig

ure

3.13

:In

crea

sein

aggr

egat

ediv

ersi

typ

erfo

rman

cefo

rth

ediff

eren

tdat

ase

tsan

dre

com

men

dat

ion

list

size

s.

66

where the k-NN method, because of the specifics of the particular data sets, resulted

in low aggregate diversity (e.g., Amazon) and high diversity performance (e.g., Movi-

eTweetings). We should note that for the ML dataset estimating the optimal weights

for the neighbors resulted in decreased performance compared to the k-PN approach.

3.7.3 Comparison of Dispersion and Diversity Reinforcement

Fig. 3.14 shows the percentage increase (white squares) or decrease (black squares)

in dispersion of recommendations compared to the k-PN method. The Gini coefficient

was on average improved by 0.015%, 0.24%, and 0.007% for the ML, MT, and AMZ

data sets, respectively. Hence, in the recommendation lists generated from the pro-

posed method, the number of times an item is recommended is more equally distributed

compared to other standard CF methods as well as the k-PN method.

We also evaluate the generated recommendations based on the (popularity-based)

diversity reinforcement and mobility of the recommendations (i.e., whether popular

or long-tail items are more likely to be recommended) as in Section 3.5.3. The experi-

mental results illustrated in Fig. 3.15 demonstrate that the proposed method generates

recommendation lists that exhibit in most cases higher diversity reinforcement com-

pared to the k-NN, MF, k-FN, and k-PN methods. In particular, the performance

was increased by 0.52%, 4.95%, and −0.43% for the ML, MT, and AMZ data sets,

respectively. We also note that recommendation lists of larger size resulted on aver-

age in even larger improvements. Besides, considering a smaller number of items as

popular also resulted in larger improvements.


Apart from alleviating the concentration bias and over-specialization problems in

CF systems, the proposed approach should also perform well in terms of predictive

67

13

510

20

30

40

50

60

70

80

90

100

Recom

men

dati

on

Lis

t S

ize

kB

Nw

o

kB

Nw

s

kB

Nw

u

kB

No

kB

Ns

kB

Nu

(a)

Mov

ieL

ens

13

510

20

30

40

50

60

70

80

90

100

Recom

men

dati

on

Lis

t S

ize

kB

Nw

o

kB

Nw

s

kB

Nw

u

kB

No

kB

Ns

kB

Nu

(b)

Mov

ieT

wee

tin

gs

13

510

20

30

40

50

60

70

80

90

100

Recom

men

dati

on

Lis

t S

ize

kB

Nw

o

kB

Nw

s

kB

Nw

u

kB

No

kB

Ns

kB

Nu

(c)

Am

azo

n

Fig

ure

3.14

:In

crea

sein

dis

per

sion

ofre

com

men

dat

ions

for

the

diff

eren

tdat

ase

tsan

dre

com

men

dat

ion

list

size

s.

68

13

510

20

30

40

50

60

70

80

90

100

Recom

men

dati

on

Lis

t S

ize

kB

Nw

o

kB

Nw

s

kB

Nw

u

kB

No

kB

Ns

kB

Nu

(a)

Mov

ieL

ens

13

510

20

30

40

50

60

70

80

90

100

Recom

men

dati

on

Lis

t S

ize

kB

Nw

o

kB

Nw

s

kB

Nw

u

kB

No

kB

Ns

kB

Nu

(b)

Mov

ieT

wee

tin

gs

13

510

20

30

40

50

60

70

80

90

100

Recom

men

dati

on

Lis

t S

ize

kB

Nw

o

kB

Nw

s

kB

Nw

u

kB

No

kB

Ns

kB

Nu

(c)

Am

azo

n

Fig

ure

3.15

:In

crea

sein

div

ersi

tyre

info

rcem

ent

and

mob

ilit

yof

reco

mm

endat

ions

for

the

diff

eren

tdat

ase

tsan

dre

com

-m

endat

ion

list

size

s.

69

accuracy. Thus, the goal in this section is to compare the proposed method with the

standard baseline methods in terms of traditional metrics for item prediction, such

as the F1 score. Fig. 3.16 presents the results obtained by applying the proposed

method to the MovieLens (ML), MovieTweetings (MT), and Amazon (AMZ) data

sets. The values reported in Fig. 3.16 present the relative F1 score improvements in

comparison to k-PN for recommendation lists of size l ∈ {1, 3, 5, 10, 20, . . . , 100}; the

size of each white square represents the magnitude of each value with the maximum

size corresponding to the maximum increase achieved in the conducted experiments

for each data set. Similar results were also obtained using as positive instances only

the highly rated items (i.e., items rated above the average rating or above the 80%

of the rating scale) in the test set.

Fig. 3.16 demonstrates that the proposed method outperforms the k-PN approach

in most of the cases. In particular, the average F1 score increase across all the

proposed probability distributions, neighborhoods, and recommendation list sizes was

−0.04%, 62.03%, and 770% (38.22% for recommendations of size larger than 1) for

the ML, MT, and AMZ data sets, respectively.

3.7.5 Comparison of Rating Prediction

We also compare the proposed method with the k-PN method in terms of RMSE

score. Fig. 3.17 presents the results obtained by applying the proposed method to the

MovieLens (ML), MovieTweetings (MT), and Amazon (AMZ) data sets. The values

reported in Fig. 3.17 presents the relative RMSE score improvement (i.e., decrease)

for neighborhoods of size k ∈ {10, 20, . . . , 80}; the size of each white square represents

the magnitude of each value with the maximum size corresponding to the maximum

change achieved in the conducted experiments for each data set.

Fig. 3.17 demonstrates that the proposed method outperforms the k-PN approach

70

13

510

20

30

40

50

60

70

80

90

100

Recom

men

dati

on

Lis

t S

ize

kB

Nw

o

kB

Nw

s

kB

Nw

u

kB

No

kB

Ns

kB

Nu

(a)

Mov

ieL

ens

13

510

20

30

40

50

60

70

80

90

100

Recom

men

dati

on

Lis

t S

ize

kB

Nw

o

kB

Nw

s

kB

Nw

u

kB

No

kB

Ns

kB

Nu

(b)

Mov

ieT

wee

tin

gs

13

510

20

30

40

50

60

70

80

90

100

Recom

men

dati

on

Lis

t S

ize

kB

Nw

o

kB

Nw

s

kB

Nw

u

kB

No

kB

Ns

kB

Nu

(c)

Am

azo

n

Fig

ure

3.16

:It

empre

dic

tion

per

form

ance

for

the

diff

eren

tdat

ase

ts.

71

10

20

30

40

50

60

70

80

Neig

hb

orh

ood

Siz

e

kB

Nw

o

kB

Nw

s

kB

Nw

u

kB

No

kB

Ns

kB

Nu

(a)

Mov

ieL

ens

10

20

30

40

50

60

70

80

Neig

hb

orh

ood

Siz

e

kB

Nw

o

kB

Nw

s

kB

Nw

u

kB

No

kB

Ns

kB

Nu

(b)

Mov

ieT

wee

tin

gs

10

20

30

40

50

60

70

80

Neig

hb

orh

ood

Siz

e

kB

Nw

o

kB

Nw

s

kB

Nw

u

kB

No

kB

Ns

kB

Nu

(c)

Am

azo

n

Fig

ure

3.17

:R

atin

gpre

dic

tion

per

form

ance

for

the

diff

eren

tdat

ase

ts.

72

in most of the cases. In particular, the average RMSE score was improved across all

the proposed probability distributions, neighborhoods, and recommendation list sizes

was 0.011%, 3.58%, and 1.92% for the ML, MT, and AMZ data sets, respectively. The

corresponding increase compared to k-NN is 0.022%, 3.94%, and 1.92%, respectively.

3.7.6 Comparison of Utility-based Ranking

In this section we evaluate the generated recommendation lists based on the nor-

malized Cumulative Discounted Gain (nDCG) [Jarvelin and Kekalainen, 2002], where

positions are discounted logarithmically. Fig. 3.18 presents the results obtained by

applying the proposed method to the MovieLens (ML), MovieTweetings (MT), and

Amazon (AMZ) data sets. The values reported in Fig. 3.18 presents the relative F1

score improvement for recommendation lists of size l ∈ {1, 3, 5, 10, 20, . . . , 100}.

The average increase of the nDCG score across all the examined probability dis-

tributions, neighborhoods, and recommendation list sizes was 39.58%, −0.63%, and

293.91% for the ML, MT, and AMZ data sets, respectively.

3.8 Discussion of Neighborhood Selection

In this chapter, we introduce various algorithms to address the problem of identi-

fying better subsets of neighbors in collaborative filtering systems. We first present a

novel method for recommending items based on probabilistic neighborhood selection

(k-PN) in collaborative filtering systems. We then proposed an ordered neighborhood

selection approach where neighborhoods of increasing size are constructed by incor-

porating at each iteration the neighbor that reduces the expected predictive error the

most (k-BN). We illustrate the practical implementation of the proposed approaches

in the context of memory-based systems adapting and improving the standard k-

73

13

510

20

30

40

50

60

70

80

90

100

Recom

men

dati

on

Lis

t S

ize

kB

Nw

o

kB

Nw

s

kB

Nw

u

kB

No

kB

Ns

kB

Nu

(a)

Mov

ieL

ens

13

510

20

30

40

50

60

70

80

90

100

Recom

men

dati

on

Lis

t S

ize

kB

Nw

o

kB

Nw

s

kB

Nw

u

kB

No

kB

Ns

kB

Nu

(b)

Mov

ieT

wee

tin

gs

13

510

20

30

40

50

60

70

80

90

100

Recom

men

dati

on

Lis

t S

ize

kB

Nw

o

kB

Nw

s

kB

Nw

u

kB

No

kB

Ns

kB

Nu

(c)

Am

azo

n

Fig

ure

3.18

:U

tility

-bas

edra

nkin

gp

erfo

rman

cefo

rth

ediff

eren

tdat

ase

ts.

74

nearest neighbors (k-NN) method. In the proposed approaches, the neighborhoods

are selected based on expected generalization error (i.e., k-BN) or an underlying

probability distribution (i.e., k-PN), instead of just the k neighbors with the highest

similarity level to the target. For the probabilistic neighborhood selection (k-PN)

approach, we use an efficient method for weighted sampling of k neighbors that takes

into consideration similarity levels between the target and all the candidate neighbors.

In addition, we conduct an empirical study showing that the proposed methods,

by selecting diverse representative neighborhoods, generate recommendations that are

very different from the classical CF approaches and alleviate the over-specialization

and concentration problems while outperforming k-NN, k-FN, and matrix factoriza-

tion methods. We also demonstrate that the proposed methods outperform, by a

wide margin in most cases, both the standard k-nearest neighbors and the k-furthest

neighbors approaches in terms of both item prediction accuracy and utility-based

ranking. The experimental results are also in accordance with the phenomenon of

“hubness” and the ensemble learning theory that we employ in the neighborhood-

based CF framework. Besides, we show that the performance improvement is not

achieved at the expense of other popular performance measures.

Moreover, the proposed methods can be further extended and modified in order to

sample k neighbors from the x nearest candidates, instead of all the available users,

and combined with additional rating normalization and similarity weighting schemes

[Jin et al., 2004] beyond those employed in this study. Similarly, the complexity of

the proposed approaches can be further reduced by first filtering out the candidates

for selection in the resulting neighborhood. For instance, only users with more than

a pre-defined number of ratings can be consider as candidate neighbors. Also, apart

from the user-based and item-based k-NN collaborative filtering approaches, other

popular methods that can be easily extended with the use of the proposed neigh-

75

borhood selection methods, in order to allow us to generate both accurate and novel

recommendations, include Matrix Factorization approaches and global neighborhood

models (e.g., [Koren, 2008, 2010]). Besides, this approach can be further extended to

the popular methods of k-NN classification and regression in IR.

76

CHAPTER IV

On Unexpectedness in Recommender Systems: Or

How to Expect the Unexpected

“If you do not expect it, you will not find the unexpected, for it is hard

to find and difficult”.

- Heraclitus of Ephesus, 544 - 484 B.C.

4.1 Introduction to Unexpectedness

One key dimension for improvement that can significantly contribute to the overall

performance and usefulness of RSes, and is still under-explored, is the notion of unex-

pectedness. RSes often recommend expected items that the users are already familiar

with and, thus, they are of little interest to them. For example, a shopping RS may

recommend to customers products such as milk and bread. Although being accurate,

in the sense that the customer will indeed buy these two products, such recommen-

dations are of little interest because they are obvious, since the shopper will, most

likely, buy these products even without these recommendations. Therefore, because

of this potential for higher user satisfaction, it is important to study non-obvious

recommendations. Motivated by the challenges and implications of this problem, we

77

try to resolve it by recommending unexpected items of significant usefulness to the

users.

Following the Greek philosopher Heraclitus, we approach this hard and difficult

problem of finding and recommending unexpected items by first capturing the ex-

pectations of the user. The challenge is not only to identify the items expected by

the user and then derive the unexpected ones, but also to enhance the concept of

unexpectedness while still delivering recommendations of high quality that achieve a

fair match to user’s interests.

In this chapter, we formalize this concept by providing a new formal definition

of unexpected recommendations, as those recommendations that significantly depart

from user’s expectations, and differentiate it from various related concepts, such as

novelty and serendipity. We also propose a method for generating unexpected recom-

mendations and suggest specific metrics to measure the unexpectedness of recommen-

dation lists. Finally, we show that the proposed method can enhance unexpectedness

while maintaining the same or higher levels of accuracy of recommendations.

4.2 Related Concepts

In the following paragraphs, we discuss how various related but still different

concepts, such as novelty, diversity, and serendipity, differ from the proposed notion

of unexpectedness.

In particular, comparing novelty to unexpectedness, a novel recommendation

might be unexpected but novelty is strictly defined in terms of previously unknown

non-redundant items without allowing for known but unexpected ones. Also, novelty

does not include any positive reactions of the user to recommendations. Illustrat-

ing some of these differences in the movie context, assume that the user John Doe

78

is mainly interested in Action & Adventure films. Recommending to this user the

newly released production of one of his favorite Action & Adventure film directors is

a novel recommendation but not necessarily unexpected and possibly of low utility

for him since John was either expecting the release of this film or he could easily find

out about it. Similarly, assume that we recommend to this user the latest Children &

Family film. Although this is definitely a novel recommendation, it is probably also

of low utility and would be likely considered “irrelevant” because it departs too much

from his expectations.

Moreover, even though both serendipity and unexpectedness involve a positive sur-

prise of the user, serendipity is restricted to novel items and their accidental discovery,

without taking into consideration the expectations of the users and the relevance of

the items, and thus constitutes a different type of recommendation that can be more

risky and ambiguous. To further illustrate the differences of these two concepts, let

us assume that we recommend to John Doe the latest Romance film. There are some

chances that John will like this novel item and the accidental discovery of a serendip-

itous recommendation. However, such a recommendation might also be of low utility

to the user since it does not take into consideration his expectations and the relevance

of the items. On the other hand, assume that we recommend to John Doe a movie in

which one of his favorite Action & Adventure film directors is performing as an actor

in an old (non-novel) Action film of another director. The user will most probably

like this unexpected but non-serendipitous recommendation.

Under a definition of diversification as the process of maximizing the variety of

items in a recommendation list, avoiding a too narrow set of choices is generally a

good approach to increase the usefulness of the recommendation list since it enhances

the chances that a user is pleased by at least some recommended items. However,

diversity is a very different concept from unexpectedness and constitutes an ex-post

79

process that can be combined with the concept of unexpectedness.

Pertaining to unexpectedness, the previously proposed system-centric approaches

do not fully capture the multi-faceted concept of unexpectedness since they do not

truly take into account the actual expectations of the users, which is crucial accord-

ing to philosophers, such as Heraclitus, and some modern researchers [Silberschatz

and Tuzhilin, 1996; Berger and Tuzhilin, 1998; Padmanabhan and Tuzhilin, 1998].

Hence, an alternative user-centric definition of unexpectedness, taking into account

prior expectations of the users, and methods for providing to the users unexpected

recommendations are still needed. In particular, a user-centric definition of unex-

pectedness and the corresponding methods should avoid recommendations that are

obvious, irrelevant, or expected to the user, but without being strictly restricted only

to novel items, and also should allow for a notion of positive discovery, as a recom-

mendation makes more sense when it exposes the user to a relevant experience that

she/he has not thought of or experienced yet. In this dissertation, we deviate from

the previous definitions of unexpectedness and propose a new formal user-centric def-

inition, as recommending to the users those items that depart from what they expect

from the recommender system, which we thoroughly discuss in the next section.

Based on the previous definitions and the discussed similarities and differences,

one can conclude that the concepts of novelty, serendipity, and unexpectedness are

overlapping since all these entities are linked to a notion of discovery, as a recom-

mendation makes more sense when it exposes the user to a relevant experience that

she/he has not thought of or experienced yet. However, unexpectedness includes the

positive reaction of a user to recommendations about previously unknown items, but

without being restricted only to novel items and, also avoids recommendations that

are obvious, irrelevant, or expected to the user.

80

4.3 Definition of Unexpectedness

In this section, we formally model and define the concept of unexpected rec-

ommendations as those recommendations that significantly depart from the user’s

expectations. However, unexpectedness alone is not enough for providing truly useful

recommendations since it is possible to deliver unexpected recommendations but of

low quality. Therefore, after defining unexpectedness, we introduce utility of a recom-

mendation and provide an example of utility as a function of the quality of recommen-

dation (e.g., specified by the item’s rating) and its unexpectedness. We maintain that

this utility of a recommended item is the concept on which we should focus (vis-a-vis

“pure” unexpectedness) by recommending items with the highest levels of utility to

the user. Finally, we propose an algorithm for providing the users with unexpected

recommendations of high quality that are hard to discover but fairly match their in-

terests and present specific performance measures for evaluating the unexpectedness

of the generated recommendations. We define unexpectedness in Section 4.3.1, the

utility of recommendations in Section 4.3.2, and we propose a method for delivering

unexpected recommendations of high quality in Section 4.3.3 and metrics for their

evaluation in Section 4.3.4.

4.3.1 Unexpectedness of Recommendations

To define unexpectedness, we start with user expectations. The expected items

for each user u can be defined as a consideration set, a finite collection of typical

items and these items that the user considers as choice candidates in order to serve

her own current needs or fulfill her intentions, as indicated by interacting with the

recommender system. This concept of the sets of user expectations can be more

precisely specified and operationalized in the lower level of a specific application and

81

recommendation setting. In particular, the set of expected items Eu for a user can

be specified in various ways, such as the set of past transactions performed by the

user, or as a set of “typical” recommendations that she expects to receive or has

received in the past. Moreover, the sets of user expectations, as the true expectations

of the users, can also be adapted to different contexts and evolve with the time. For

example, in case of a movie RS, this set of expected items may include all the movies

already seen by the user and all their related and similar movies, where “relatedness”

and “similarity” are specified and operationalized through specific mechanisms in

Section 4.4.

Intuitively, an item included in the set of expected recommendations derives “zero

unexpectedness” for the user, whereas the more an item departs from the set of expec-

tations, the more unexpected it is, until it starts being perceived as irrelevant by the

user. Unexpectedness should thus be a positive, unbounded function of the distance

of this item from the set of expected items. More formally, we define unexpectedness

in recommender systems as follows. First, we define:

δu,i = d(i; Eu), (4.1)

where d(i; Eu) is the distance of item i from the set of expected items Eu for user

u. Then, unexpectedness of item i with respect to user expectations Eu is defined as

some unimodal function ∆ of this distance:

∆(δu,i; δ∗u), (4.2)

where δ∗u is the best (most preferred) unexpected distance from the set of expected

items Eu for user u (the mode of distribution ∆). In particular, the most preferred

82

unexpected distance δ∗u for user u is a horizontally differentiated feature and can be

interpreted as the distance that results in the highest utility for a given quality of an

item (see Section 4.3.2) and captures the preferences of the user about unexpectedness.

Intuitively, unimodality of this function ∆ indicates that:

1. there is only one most preferred unexpected distance,

2. an item that greatly departs from user’s expectations, even though results in a

large departure from expectations, will be probably perceived as irrelevant by

the user and, hence, it is not truly unexpected, and

3. items that are close to the expected set are not truly unexpected but rather

obvious to the user.

The above definitions1 clearly take into consideration the actual expectations of the

users as we discussed in Section 4.2. Hence, unexpectedness is neither a characteristic

of items nor users, since an item can be expected for a specific user but unexpected

for another one. It is the interplay of the user and the item that characterizes whether

the particular recommendation is unexpected for the specific user or not.

However, recommending to a user the items that result in the highest level of

unexpectedness could be problematic, since recommendations should also be of high

quality and fairly match user preferences. In other words, it is important to emphasize

that simply increasing the unexpectedness of a recommendation list is valueless if this

list does not contain relevant items of high quality that the user likes. In order to

generate such recommendations that would maximize the users’ satisfaction, we use

certain concepts from the utility theory in economics [Marshall, 1920].

1The aforementioned definitions serve as templates of the proposed concepts that are preciselydefined and thoroughly operationalized through specific mechanisms in Sections 4.4.2.1-4.4.2.4.

83

4.3.2 Utility of Recommendations

In the context of recommender systems, pertaining to the concept of unexpect-

edness and trying to keep the complexity of our method to a minimum, we specify

the utility of a recommendation of an item to a user in terms of two components:

the utility of quality that the user will gain from using the product and the util-

ity of unexpectedness of the recommended item, as defined in Section 4.3.1. Our

proposed model follows the standard assumption in economics that the users are en-

gaging into optimal utility maximizing behavior [Marshall, 1920]. Additionally, we

consider the quality of an item to be a vertically differentiated characteristic [Tirole,

1988], which means that utility is a monotone function of quality and hence, given

the unexpectedness of an item, the greater the quality of this item, the greater the

utility of the recommendation to the user. Consequently, without loss of generality,

we propose that we can estimate this overall utility of a recommendation using the

previously mentioned utility of quality and the loss in utility by the departure from

the preferred level of unexpectedness δ∗u. This will allow the utility function to have

the required characteristics described so far. Note that the distribution of utility as

a function of unexpectedness and quality is non-linear, bounded, and experiences a

global maximum.

Formalizing these concepts, in order to provide an example of a utility function

to illustrate the proposed method, we assume that each user u values the quality of

an item by a positive constant qu and that the quality of the item i is represented by

the corresponding rating ru,i. Then, we define the utility derived from the quality of

the recommended item i to the user u as:

U qu,i = qu × ru,i + εqu,i, (4.3)

84

where εqu,i is the error term defined as a random variable capturing the stochastic

aspect of recommending item i to user u.

We also assume that user u values the unexpectedness of an item by a non-negative

factor λu measuring the user’s tolerance to redundancy and irrelevance. The utility of

the user decreases by departing from the preferred level of unexpectedness δ∗u. Then,

the utility of the unexpectedness of a recommendation can be represented as:

U δu,i = −λu × φ(δu,i; δ

∗u) + εδu,i, (4.4)

where function φ captures the departure of unexpectedness of item i from the preferred

level of unexpectedness δ∗u for user u and εδu,i is the error term for user u and item i.

Then, the utility of recommending items to users is computed as the sum of (4.3)

and (4.4):

Uu,i = U qu,i + U δ

u,i (4.5)

Uu,i = qu × ru,i − λu × φ(δu,i; δ∗u) + εu,i, (4.6)

where εu,i is the stochastic error term.

Function φ can also be defined in various ways. For example, using popular

location models for horizontal and vertical differentiation of products in economics

[Cremer and Thisse, 1991; Neven, 1985], the departure from the preferred level of

unexpectedness can be defined as the linear distance:

Uu,i = qu × ru,i − λu × |δu,i − δ∗u| , (4.7)

or the quadratic one:

Uu,i = qu × ru,i − λu × (δu,i − δ∗u)2. (4.8)

85

Note that the utility of a recommendation is linearly increasing with the rating

for these distances, whereas, given the quality of the product, it increases with un-

expectedness up to the threshold of the preferred level of unexpectedness δ∗u. This

threshold δ∗u is specific for each user and context. Also, note that two recommended

items of different quality and distance from the set of expected items may derive the

same levels of usefulness (i.e., indifference curves).2

4.3.3 Recommendation Algorithm

Once the utility function Uu,i is defined, we can then make recommendations

to user u by selecting items i having the highest values of utility Uu,i. Additionally,

specific restrictions can be applied on the quality and unexpectedness of the candidate

items, if appropriate in the application, in order to ensure that the recommended items

will exhibit specific levels of unexpectedness and quality.3

Algorithm 4 summarizes the proposed method for generating unexpected recom-

mendations of high quality that are hard to discover and fairly match the users’

interests. In particular, we compute for each user u a set of expected recommenda-

tions Eu. Then, for each item i in our product base, if the estimated quality of the

item qu,i is above the threshold¯q, we compute the distance δu,i of the specific item

from the set of expectations Eu for the particular user. If the distance δu,i is within

the specified interval [¯δ, δ], we compute the utility of unexpectedness U δ

u,i of item i

for user u based on φ(δu,i; δ∗u). Next, we estimate the final utility Uu,i of recommend-

2Eqs. (4.5) and (4.6) illustrate a simple example of a utility function for the problem of unexpect-edness in recommender systems. Any utility function may be used and not necessarily a weightedsum of two or more distinct components. The reader might even derive examples of utility functionswithout the use of δ∗ but may lose some of the discussed properties (e.g., global maximum). Besides,function φ does not have to be symmetric as in the examples provided in (4.7) and (4.8).

3In the same sense, if required in a specific setting, only items not included in the set of userexpectations can be considered candidates for recommendation. An alternative way to control theexpected levels of unexpectedness can be based on the utility function of choice and tuning of itscoefficients.

86

ALGORITHM 4: Unexpectedness Recommendation Algorithm

Input: Users’ profiles, utility function, estimated quality of items for users, context, etc.Output: Recommendation lists of size Nu

qu,i: Quality of item i for user u

¯q: Lower limit on quality of recommended items

¯δ: Lower limit on distance of recommended items from expectationsδ: Upper limit on distance of recommended items from expectationsNu: Number of items recommended to user u

for each user u doCompute expectations Eu for user u;for each item i do

if qu,i ≥¯q ;

thenCompute distance δu,i of item i from expectations Eu for user u;if δu,i ∈ [

¯δ, δ];

thenEstimate utility of unexpectedness U δu,i of item i for user u based on

φ(δu,i; δ∗u);

Estimate utility Uu,i of item i for user u;

end

end

endRecommend to user u top Nu items having the highest utility Uu,i;

end

ing this item to the specific user based on the different components of the specified

utility function; the estimated utility corresponds to the final predicted rating ru,i of

the classical recommender system algorithms. Finally, we recommend to the user the

items that exhibit the highest estimated utility Uu,i. Examples on how to compute

the set of expected item Eu for a user are provided in Section 4.4.2.3.

4.3.4 Evaluation of Recommendations

[Adomavicius and Tuzhilin, 2005; Herlocker et al., 2004; McNee et al., 2006;

Adamopoulos, 2014b, 2013a; Adamopoulos et al., 2014] suggest that RS should be

evaluated not only by their accuracy, but also by other important metrics such as

87

coverage, serendipity, unexpectedness, and usefulness. Hence, we propose specific

metrics to evaluate the candidate items and the generated recommendations.

4.3.4.1 Metrics of Unexpectedness

In order to accurately and precisely measure the unexpectedness of candidate

items and generated recommendation lists, we deviate from the approach proposed

by Murakami et al. [2008] and Ge et al. [2010], and propose new metrics to evaluate

our method. In particular, Murakami et al. [2008] and Ge et al. [2010] focus on the

difference in predictions between two algorithms (i.e., the deviation of beliefs in a

recommender system from the results obtained from a primitive prediction model

that shows high ratability) and thus Ge et al. [2010] calculate the unexpected set of

recommendations (UNEXP) as:

UNEXP = RS \ PM (4.9)

where PM is a set of recommendations generated by a primitive prediction model

and RS denotes the recommendations generated by a recommender system. When an

element of RS does not belong to PM, they consider this element to be unexpected.

As Ge et al. [2010] argue, unexpected recommendations may not be always useful

and, thus, the paper also introduces a serendipity measure as:

SRDP =|UNEXP

⋂USEFUL|

|N |(4.10)

where USEFUL denotes the set of “useful” items and N the length of the recom-

mendation list. For instance, the usefulness of an item can be judged by the users or

approximated by the items’ ratings as we describe in Section 4.4.2.6.

88

However, these measures do not fully capture the proposed user-centric definition

of unexpectedness since a PM usually contains just the most popular items and does

not actually take at all into account the expectations of the users. Consequently,

we revise their definition and introduce new metrics to measure unexpectedness as

follows. First of all, we define expectedness (EXPECTED) as the mean ratio of the

items that are included in both the set of expected recommendations for a user (Eu)

and the generated recommendation list (RSu):

EXPECTED =∑u

|RSu⋂

Eu||N |

. (4.11)

Furthermore, we propose a metric of unexpectedness (UNEXPECTED) as the

mean ratio of the items that are not included in the set of expected recommendations

for the user but are included in the generated recommendation lists:

UNEXPECTED =∑u

|RSu \ Eu||N |

. (4.12)

Correspondingly, we can also derive a new metric, following the SRDP measure

of serendipity [Murakami et al., 2008], based on the proposed concept and metric of

unexpectedness:

UNEXPECTED+ =∑u

|(RSu \ Eu)⋂

USEFULu||N |

. (4.13)

For the sake of simplicity, the metrics defined so far consider whether an item is

expected to the user or not in terms of strict boolean identity. However, we can relax

this restriction using the distance of an item from the set of expectations as in (4.1),

89

or the unexpectedness of an item as in (4.2). For instance:

UNEXPECTED =∑u

∆(δu,i; δ∗u)

|N |. (4.14)

Moreover, the metrics proposed in this section can be combined with those sug-

gested by Murakami et al. [2008] and Ge et al. [2010] as described in Section 4.4.2.6.

Besides, the proposed metrics can be adapted to take into consideration the rank of

the item in the recommendation list by using a rank discount factor as in [Castells

et al., 2011].

4.3.4.2 Metrics of Accuracy

The recommendation lists can also be evaluated for the accuracy of rating and item

predictions using standard metrics such as Root Mean Square Error, Mean Absolute

Error, Precision, Recall, and the F-measure. In applications where the number of

recommendations presented to the user is preordained, the most useful measure of

interest usually is precision at N [Shani and Gunawardana, 2011].

Finally, recommender systems can also be evaluated based on various other metrics

including diversity, confidence, trust, robustness, adaptivity, and catalog coverage

[Shani and Gunawardana, 2011].

4.4 Experimental Settings

To empirically validate the method presented in Section 4.3.3 and evaluate the

unexpectedness of the generated recommendations, we conduct a large number of

experiments on “real-world” data sets and compare our results to popular baseline

methods.

90

Unfortunately, we could not compare our results with other methods for deriving

unexpected recommendations for the following reasons. First, among the previously

proposed methods of unexpectedness, as explained in Section 4.2, the authors present

only the performance metrics and do not provide any clear computational algorithm

for computing recommendations, thus making the comparison impossible. Further,

most of the existing methods are based on related but different principles such as

diversity and novelty. Since these concepts are, in principle, very different from our

definition, they cannot be directly compared with our approach. Besides, most of the

methods of novelty and serendipity require additional data, such as explicit informa-

tion from the users about known items. In addition, many of the methods of these

related concepts are not generic and cannot be implemented in a traditional recom-

mendation setting, but assume very specific applications and domains. Consequently,

we selected a number of standard Collaborative Filtering (CF) and other algorithms

as baseline methods to compare with the proposed approach. In particular, we se-

lected both the item-based and user-based k-Nearest Neighborhood approach (kNN),

the Slope One (SO) algorithm [Lemire and Maclachlan, 2007], a Matrix Factorization

(MF) method [Koren et al., 2009], the average rating value of an item, and a baseline

using the average rating value plus a regularized user and item bias [Koren, 2010].

We would like to indicate that, although the selected baseline methods do not explic-

itly support the notion of unexpectedness, they constitute fairly reasonable baselines

because, as was pointed out in [Burke, 2002], CF methods also perform well in terms

of other performance measures besides the classical accuracy measures.4

4The proposed method also outperforms in terms of unexpectedness other methods that cap-ture the related but different concepts of novelty, serendipity, and diversity, such as the k-furthestneighbor collaborative filtering recommender algorithm [Said et al., 2012b].

91

4.4.1 Data Sets

The basic data sets that we used are the RecSys HetRec 2011 MovieLens data set

[Cantador et al., 2011] and the BookCrossing data set [Ziegler et al., 2005].

The RecSys HetRec 2011 MovieLens (ML) data set [Cantador et al., 2011] is

an extension of a data set published by [GroupLens, 2011], which contains personal

ratings and tags about movies, and consists of 855,598 ratings from 2,113 users on

10,197 movies. This data set is relatively dense (3.97%) compared to other frequently

used data sets but we believe that this characteristic is a virtue that will let us better

evaluate our method since it allows us to better specify the set of expected movies

for each user. Besides, in order to test the proposed method under various levels of

sparsity [Adomavicius and Zhang, 2012], we consider different proper subsets of the

data sets.

Additionally, we used information and further details from Wikipedia [Wikipedia,

2012] and IMDb [IMDb, 2011]. Joining these data sets we were able to enhance

the available information by identifying whether a movie is an episode or sequel of

another movie included in our data set. We succeeded in identifying “related” items

(i.e., episodes, sequels, movies with exactly the same title) for 2,443 of our movies

(23.95% of the movies with 2.18 related movies on average and a maximum of 22).

We used this information about related movies to identify sets of expectations, as

described in Section 4.4.2.3. We also consider a proper subset (b) of the MovieLens

data set consisting of 4,735 items and 2,029 users, with at least 25 ratings each,

exhibiting 807,167 ratings.

The BookCrossing (BC) data set is gathered by Ziegler et al. [2005] from Bookcross-

ing.com [BookCrossing, 2004], a social networking site founded to encourage the ex-

change of books. This data set contains fully anonymized information on 278,858

92

members and 1,157,112 personal ratings, both implicit and explicit, referring to

271,379 distinct ISBNs. The specific data set was selected because we can use the

implicit ratings of the users to better specify their expectations, as described in Sec-

tion 4.4.2.3. Besides, we supplemented the available data for 261,229 books with

information from Amazon [Amazon, 2012], Google Books [Google, 2012], ISBNdb [IS-

BNdb.com, 2012], LibraryThing [LibraryThing, 2012], Wikipedia [Wikipedia, 2012],

and WorldCat [WorldCat, 2012]. Such data is often publicly available and, there-

fore, it can be freely and widely used in many recommender systems [Umyarov and

Tuzhilin, 2011].

Since some books on BookCrossing refer to rare, non-English books, or outdated

titles not in print anymore, we were able to collect background information and

“related” books (i.e., alternative editions, sequels, books in the same series, with

same subjects and classifications, with the same tags, and books identified as related

or similar by the aforementioned services) for 152,702 of the books with an average of

31 related books per ISBN. Following Ziegler et al. [2005] and owing to the extreme

sparsity of the BookCrossing data set, we decided to further condense the data set

in order to obtain more meaningful results from collaborative filtering algorithms.

Hence, we discarded all the books for which we were not able to find any information,

along with all the ratings referring to them. Next, we also removed book titles with

fewer than 4 ratings and community members with fewer than 8 ratings each. The

dimensions of the resulting data set were considerably more moderate, featuring 8,824

users, 18,607 books, and 377,749 ratings (147,403 explicit ratings). Finally, we also

consider two proper subsets of this; (b) 3,580 items with at least 10 ratings and 2,545

users, with at least 15 ratings each, exhibiting 57,176 explicit and 95,067 implicit

ratings and (c) 870 items and 1,379 users with at least 25 ratings exhibiting 22,192

explicit and 37,115 implicit ratings.

93

Based on the collected information, we approximated the sets of expected recom-

mendations for the users, using the mechanisms described in detail in Section 4.4.2.3.

4.4.2 Experimental Setup

Using the MovieLens data set, we conducted 7,488 experiments. In half of the

experiments we assume that the users are homogeneous (Hom) and have exactly the

same preferences. In the other half, we investigate the more realistic case (Het) where

users have different preferences that depend on previous interactions with the system.

Furthermore, we use two different and diverse sets of expected movies for each user,

and different utility functions. We also use different rating prediction algorithms

and various measures of distance between movies and among a movie and the set of

expected recommendations. Finally, we derived recommendation lists of different sizes

(k ∈ {1, 3, 5, 10, 20, . . . , 100}). In conclusion, we used 2 subsets, 2 sets of expected

movies, 6 algorithms for rating prediction, 3 correlation metrics, 2 distance metrics,

2 utility functions, 2 different assumptions about user preferences, and 13 different

lengths of recommendation lists, resulting in 7,488 experiments in total.

Using the BookCrossing data set, we conducted our experiments on three different

proper subsets described in Section 4.4.1. As before, we also assume different specifi-

cations for the experiments. In particular, we used 3 subsets, 3 sets of expected books,

6 algorithms for rating prediction, 3 correlation metrics, 2 distance metrics, 2 utility

functions, 2 different assumptions about user preferences, and 13 different lengths

of recommendation lists, resulting in 16,848 experiments in total. The experimental

settings are described in detail in Sections 4.4.2.1 - 4.4.2.4.

4.4.2.1 Utility of Recommendation

We consider the following utility functions:

94

(1a) Representative agent (homogeneous users) with linear distance (Hom-Lin): The

users are homogeneous and have similar preferences (i.e., parameters q, λ, δ∗ are

the same across all users) and φ(δu,i; δ∗u) is linear in δu,i in (4.6):

Uu,i = q × ru,i − λ× |δu,i − δ∗| . (4.15)

(1b) Representative agent (homogeneous users) with quadratic distance (Hom-Quad):

The users are homogeneous but φ(δu,i; δ∗u) is quadratic in δu,i in (4.6):

Uu,i = q × ru,i − λ× (δu,i − δ∗)2. (4.16)

(2a) Heterogeneous users with linear distance (Het-Lin): The users are heteroge-

neous, have different preferences (i.e., qu, λu, δ∗u), and φ(δu,i; δ

∗u) is linear in δu,i

as in (4.7):

Uu,i = qu × ru,i − λu × |δu,i − δ∗u| . (4.17)

(2b) Heterogeneous users with quadratic distance (Het-Quad): Users have different

preferences and φ(δu,i; δ∗u) is quadratic in δu,i. This case corresponds to function

(4.8):

Uu,i = qu × ru,i − λu × (δu,i − δ∗u)2. (4.18)

4.4.2.2 Item Similarity

To generate the set of unexpected recommendations, the system computes the

distance d(i, j) between two items. In the conducted experiments, we use both

collaborative-based and content-based item distance.5 The distance matrix can be

5Additional similarity measures were tested in [Adamopoulos and Tuzhilin, 2011] with similarresults.

95

easily updated with respect to new ratings as in [Khabbaz et al., 2011] in order to

address potential scalability issues in large scale systems. The complexity of the pro-

posed algorithm can also be reduced by appropriately setting a lower limit in quality

(¯q) as illustrated in Algorithm 4. Other techniques that should also be explored in

future research include user clustering, low rank approximation of unexpectedness

matrix, and partitioning the item space based on product category or subject classi-

fication.

4.4.2.3 Sets of Expected Recommendations

The set of expected recommendations for each user can be precisely specified and

operationalized using various mechanisms that can be applied across various domains

and applications. Such mechanisms are the past transactions performed by the user,

knowledge discovery and data mining techniques (e.g., association rule learning and

user profiling), and experts’ domain knowledge. The mechanisms for specifying sets of

expected recommendations for the users can also be seeded, as and when needed, with

the past transactions as well as implicit and explicit ratings of the users. In order to

test the proposed method under various and diverse sets of expected recommendations

of different cardinalities that have been specified using the mechanisms summarized

in Table 4.1, we consider the following settings.6

1. Expected Movies: We use the following two examples of definitions of expected

movies in our study. The first set of expected movies (E(Base)u ) for user u follows

a very strict definition of expectedness, as defined in Section 4.3.1. The profile of

user u consists of the set of movies that she/he has already rated. In particular,

6In this experimental study, the expectations of the users were specified in terms of strict booleanidentity because of the characteristics of the specific data sets and for the sake of simplicity. Aspart of the future work, we plan to relax this assumption using the proposed definition and metricof unexpectedness (Eq. 4.14).

96

movie i is expected for user u if the user has already rated some movie j such

that i has the same title or is an episode or sequel of movie j, where episode

or sequel is identified as explained in Section 4.4.1. These sets of expected

recommendations have on average a cardinality of 517 and 451 for the different

subsets.

The second set of expected movies (E(Base+RL)u ) follows a broader definition of

expectations and is generated based on some set of rules. It includes the first

set plus a number of closely “related” movies (E(Base+RL)u ⊇ E(Base)

u ). In order

to form the second set of expected movies, we also use content-based similarity

between movies. More specifically, two movies are related if at least one of the

following conditions holds: (i) they were produced by the same director, belong

to the same genre, and were released within an interval of 5 years, (ii) the same

set of protagonists appears in both of them (where a protagonist is defined as

an actor with ranking ∈ {1, 2, 3}) and they belong to the same genre, (iii) the

two movies share more than twenty common tags, are in the same language,

and their correlation metric is above a certain threshold θ (Jaccard coefficient

(J) > 0.50), (iv) there is a link from the Wikipedia article for movie i to the

article for movie j and the two movies are sufficiently correlated (J > 0.50) and

(v) the content-based distance metric is below a threshold θ (d < 0.50). The

extended set of expected movies has an average size of 1,127 and 949 items per

user, for the two subsets, respectively.

2. Expected Books: For the BookCrossing data set, we use three different examples

of expected books for our users. The first set of expectations (E(Base)u ) consists

of only the items that user u rated implicitly or explicitly.7 The second set

7Only explicit ratings were used with the baseline rating prediction algorithms.

97

Table 4.1: Sets of Expected Recommendations for Different Experimental Settings.

Data setSet of Expected

Mechanism MethodRecommendations

MovieLensBase Past Transactions Explicit RatingsBase+RL Domain Knowledge Set of Rules

BookCrossingBase Past Transactions Implicit RatingsBase+RI Domain Knowledge Related ItemsBase+AR Data Mining Association Rules

of expected books (E(Base+RI)u ) includes the first set plus the related or similar

books identified by various third-party services as described in Section 4.4.1.

These sets of expectations contain on average 1,257, 1,030, and 296 items for the

three subsets, respectively. Finally, the third set of expected recommendations

(E(Base+AS)u ) is generated using association rule learning. In detail, an item i is

expected for user u if i is consequent of a rule with support at least 5% and

user u has implicitly or explicitly rated all the antecedent items. Because of the

nature of this procedure, there is little variation in the set of expectations among

the different users and, in general, these sets consist of the most popular items,

defined in terms of number of ratings. These sets of expected recommendations

have on average a cardinality of 808, 670, and 194 for the different subsets.

4.4.2.4 Distance from the Set of Expectations

After estimating the expectations of user u, we can then define the distance of

item i from the set of expected recommendations Eu in various ways. For example,

it can be determined by averaging the distances between the candidate item i and

all the items included in set Eu. Additionally, we also use the Centroid distance that

is defined as the distance of an item i from the centroid point of the set of expected

98

recommendations Eu for user u.8

4.4.2.5 Utility Estimation

Since the users are restricted to provide ratings on a specific scale, the correspond-

ing item ratings in our data sets are censored from below and above (also known as

censoring from left and right, respectively) [Davidson and MacKinnon, 2004]. Hence,

in order to model the consumer choice, estimate the parameters of interest (i.e., qu

and λu in equations (4.15) - (4.18)), and make predictions within the same scale that

was available to the users, we borrow from the field of economics popular models of

censored multiple linear regressions [McDonald and Moffitt, 1980; Olsen, 1978; Long,

1997]9 imposing also a restriction on these models for non-negative coefficients (i.e.,

qu, λu ≥ 0) [Greene, 2012; Wooldridge, 2002].

Furthermore, given the limitations of offline experiments and our data sets, we

use the predicted ratings from the baseline methods as a measure of quality for the

recommended items and the actual ratings of the users as a proxy for the utility of

the recommendations; this, in combination with the choice of utility functions de-

scribed in Section 4.4.2.1, will allow us to study the effect of taking unexpectedness

into consideration, without introducing any other source of variation into our model.

We also used the average distance of rated items from the set of expected recommen-

dations in order to estimate the preferred level of unexpectedness δ∗u for each user and

distance metric; for the case of homogeneous users, we used the average value over

all users. In addition, we did not use the unexpectedness and quality thresholds,¯δ, δ,

8The experiments conducted in [Adamopoulos and Tuzhilin, 2011] using the Hausdorff distance(d(i,Eu) = inf{d(i, j) : j ∈ Eu}) indicate inconsistent performance and sometimes under-performedthe standard CF methods. Hence, in this work we only conducted experiments using the averageand the centroid distance.

9Multiple linear regression models and generalized linear latent and mixed models estimated bymaximum likelihoods [Rabe-Hesketh et al., 2002] were also tested with similar results. Shivaswamyet al. [2007] and Khan and Zubek [2008] may also be used for utility estimation.

99

and¯q, described in Section 4.3.3, to limit the candidate items for recommendation.

Besides, we used a holdout validation scheme in all of our experiments with 80/20

splits of data to the training/test part in order to avoid overfitting. Finally, we as-

sume an application scenario where an item can be a candidate for recommendation

to a user if and only if it has not been rated by the specific user; expected items can

be recommended.

4.4.2.6 Metrics of Unexpectedness and Accuracy

To evaluate our approach in terms of unexpectedness, we use the metrics described

in Section 4.3.4.1. Additionally, we further evaluate the recommendation lists using

different (i.e., expanded) sets of expectations, compared to the expectations used for

the utility estimation, based on metrics derived by combining the proposed metrics

with those suggested by Murakami et al. [2008] and Ge et al. [2010]. For the primitive

prediction model (PM) of Ge et al. [2010] in (4.9) we used the top-N items with highest

average rating and the largest number of ratings. For instance, for the experiments

conducted using the main subset of the MovieLens data set, the PM model consists of

the top 200 items with the highest average rating and top 800 items with the greatest

number of ratings; the same ratio was used for all the experiments.

Besides, we introduce an additional metric of expectedness (EXPECTEDPM) as

the mean ratio of the recommended items that are either included in the set of

expected recommendations for a user or in the primitive prediction model, and are

also included in the generated recommendation list. Correspondingly, we define an

additional metric of unexpectedness (UNEXPECTEDPM) as the mean ratio of the

recommended items that are neither included in expectations nor in the primitive

100

prediction model, and are included in the generated recommendations:

UNEXPECTEDPM =∑u

|RSu \ (Eu ∪ PM)||N |

. (4.19)

Based on the ratio of Ge et al. [2010] in (4.10), we also use the metrics UNEXPECTED+

and UNEXPECTED+PM to evaluate serendipitous [Murakami et al., 2008] recommen-

dations in conjunction with the metrics of unexpectedness in (4.12) and (4.19), re-

spectively. To compute these metrics, the usefulness of an item for a user can be

judged by the specific user or approximated by the item’s ratings. For instance,

we consider an item to be useful if its average rating is greater than the mean of

the rating scale. In particular, in the experiments conducted using the ML and BC

data sets, we consider an item to be useful, if its average rating is greater than 2.5

(USEFUL = {i : ri > 2.5}) and 5.0, respectively.

Finally, we also evaluate the generated recommendations lists based on the aggre-

gate recommendation diversity, coverage of product base, dispersion of recommenda-

tions, as well as accuracy of rating and item predictions using the metrics discussed

in Section 4.3.4.2.

4.5 Results of Unexpectedness Method

The aim of this study is to demonstrate that the proposed method is indeed

effectively capturing the concept of unexpectedness and performs well in terms of the

classical accuracy metrics by a comparative analysis of our method and the standard

baseline algorithms in different experimental settings.

Given the number of experimental settings (5 subsets based on 2 data sets, 5 sets

of expected items, 6 algorithms for rating prediction, 3 correlation metrics, 2 distance

101

metrics, 2 utility functions, 2 different assumptions about users preferences, and 13

different lengths of recommendation lists, resulting in 24, 336 conducted experiments

in total), the presentation of results constitutes a challenging problem. To give a

“flavor” of the results, instead of plotting individual graphs, a more concise represen-

tation can be obtained by computing the average values of performance for the main

experimental settings (see Section 4.4.2.1) and testing the statistical significance of

the differences in performance, if any. The averages are taken over the six algorithms

for rating prediction, the two correlation metrics, and the two distance metrics, except

as otherwise noted. However, given the diversity of the aforementioned experimental

settings, both the different baselines and the proposed approach may exhibit different

performance in each setting. A reasonable way to compare the results across different

experimental settings is by computing the relative performance differences:

Diff = (Perfunxp − Perfbsln)/Perfbsln, (4.20)

taken as averages over some experimental settings, where bsln refers to the baseline

methods and unxp to the proposed method for unexpectedness. A positive value

of Diff means that the proposed method outperforms the baseline, and a negative–

otherwise. For each metric, only the most interesting dimensions are discussed.

Using the utility estimation method described in Section 4.4.2.5, the average qu is

1.005 for the experiments conducted on the MovieLens data set. For the experiments

with the first set of expected movies, the average λu is 0.144 for the linear distance

and 0.146 for the quadratic one. For the extended set of expected movies, the average

estimated λu is 0.207 and 1.568, respectively. In the experiments conducted on the

BookCrossing data set, the average qu is 1.003. For the experiments with the first set

of expected books, the average λu is 0.710 for the linear distance and 3.473 for the

102

quadratic one. For the second and third set of expected items, the average estimated

λu is 0.717 and 3.1240, and 0.576 and 2.218, respectively.

In Section 4.5.1, we compare how the proposed method for unexpected recommen-

dations compares with the standard baseline methods in terms of unexpectedness and

serendipity of recommendation lists. Then, in Sections 4.5.2 and 4.5.3, we study the

effects on rating and item prediction accuracy, respectively. Finally, in Section 4.5.4,

we compare the proposed method with the baseline methods in terms of other popular

metrics, such as catalog coverage, aggregate recommendation diversity, and dispersion

of recommendations.

4.5.1 Comparison of Unexpectedness

In this section, we experimentally demonstrate that the proposed method effec-

tively captures the notion of unexpectedness and, hence, outperforms the standard

baseline methods in terms of unexpectedness. Table 4.2 presents the results ob-

tained by applying our method to the MovieLens (ML) and BookCrossing (BC)

data sets. The values reported are computed using the proposed unexpectedness

metric (4.12) as the average increase in performance over six algorithms for rating

prediction, two distance metrics, different subsets, and three correlation metrics for

recommendation lists of size k ∈ {1, 3, 5, 10, 30, 50, 100}. Besides, Fig. 4.1 presents

the average performance over the same dimensions for recommendation lists of size

k ∈ {1, 3, 5, 10, 20, . . . , 100}. Similar results were also obtained using the additional

metrics described in Section 4.4.2.6. In addition, similar patterns were also observed

specifying the user expectations using different mechanisms for the training and test

data.

Table 4.2 and Fig. 4.1 demonstrate that the proposed method outperforms the

standard baselines. As we can observe, the increase in performance is larger for

103

Tab

le4.

2:U

nex

pec

tednes

sP

erfo

rman

cefo

rth

eM

ovie

Len

san

dB

ook

Cro

ssin

gD

ata

Set

s.

Data

User

Experimenta

lRecommendation

ListSize

Set

Expecta

tions

Setting

13

510

30

50

100

MovieLens

Base

Hom

ogen

eou

sL

inea

r1.

90%

3.57

%3.

93%

2.30%

1.7

4%

1.51%

1.08%

Hom

ogen

eou

sQ

uad

rati

c1.

81%

3.33

%3.

63%

2.40%

1.7

7%

1.58%

1.16%

Het

erog

eneo

us

Lin

ear

1.77

%2.

24%

2.46

%1.

86%

1.3

7%

1.21%

0.87%

Het

erog

eneo

us

Qu

adra

tic

1.61

%1.

99%

2.21

%1.

68%

1.2

7%

1.13%

0.84%

Base

+R

L

Hom

ogen

eou

sL

inea

r20

.84%

18.3

7%16

.01%

12.5

3%

10.

51%

9.9

8%

7.9

7%

Hom

ogen

eou

sQ

uad

rati

c17

.86%

17.6

7%16

.14%

13.3

1%

11.

28%

10.8

2%

8.9

9%

Het

erog

eneo

us

Lin

ear

16.1

4%14

.82%

13.2

8%11.

06%

9.2

2%

8.90%

7.46%

Het

erog

eneo

us

Qu

adra

tic

14.4

3%13

.50%

12.2

0%10.

39%

8.7

6%

8.51%

7.26%

BookCrossing

Base

Hom

ogen

eou

sL

inea

r0.

89%

0.90

%0.

84%

0.84%

0.7

9%

0.77%

0.73%

Hom

ogen

eou

sQ

uad

rati

c0.

62%

0.65

%0.

62%

0.56%

0.5

2%

0.50%

0.47%

Het

erog

eneo

us

Lin

ear

0.43

%0.

46%

0.44

%0.

44%

0.4

4%

0.45%

0.45%

Het

erog

eneo

us

Qu

adra

tic

0.39

%0.

42%

0.40

%0.

40%

0.4

1%

0.41%

0.41%

Base

+R

I

Hom

ogen

eou

sL

inea

r18

2.12

%15

2.70

%14

6.17

%13

1.80%

114.

17%

104.

80%

90.6

9%

Hom

ogen

eou

sQ

uad

rati

c18

4.29

%15

5.78

%14

9.89

%13

6.12%

117.

89%

108.

54%

93.8

8%

Het

erog

eneo

us

Lin

ear

91.0

3%79

.54%

78.7

5%68.

62%

60.6

4%

57.

82%

50.7

4%

Het

erog

eneo

us

Qu

adra

tic

84.1

9%73

.90%

73.5

7%63.

73%

56.5

3%

54.

18%

47.6

9%

Base

+A

R

Hom

ogen

eou

sL

inea

r15

7.56

%13

3.80

%12

7.74

%11

5.27%

98.7

1%

90.

49%

76.7

5%

Hom

ogen

eou

sQ

uad

rati

c15

8.95

%13

6.38

%13

0.90

%11

8.38%

101.

16%

92.4

3%

78.

44%

Het

erog

eneo

us

Lin

ear

79.3

0%70

.04%

69.0

9%59.

62%

51.8

4%

49.

09%

42.2

2%

Het

erog

eneo

us

Qu

adra

tic

73.3

1%64

.99%

64.4

4%55.

24%

48.1

7%

45.

86%

39.5

7%

104

13

510

20

30

40

50

60

70

80

90

100

Recom

men

dati

on

Lis

t S

ize

0.9

2

0.9

3

0.9

4

0.9

5

0.9

6

0.9

7

0.9

8

Unexpectedness

Base

line

Hom

-Lin

Hom

-Quad

Het-

Lin

Het-

Quad

(a)

ML

-B

ase

13

510

20

30

40

50

60

70

80

90

100

Recom

men

dati

on

Lis

t S

ize

0.6

5

0.7

0

0.7

5

0.8

0

0.8

5

0.9

0

Unexpectedness

Base

line

Hom

-Lin

Hom

-Quad

Het-

Lin

Het-

Quad

(b)

ML

-B

ase

+R

L

13

510

20

30

40

50

60

70

80

90

100

Recom

men

dati

on

Lis

t S

ize

0.9

86

0.9

88

0.9

90

0.9

92

0.9

94

0.9

96

Unexpectedness

Base

line

Hom

-Lin

Hom

-Quad

Het-

Lin

Het-

Quad

(c)

BC

-B

ase

13

510

20

30

40

50

60

70

80

90

100

Recom

men

dati

on

Lis

t S

ize

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Unexpectedness

Base

line

Hom

-Lin

Hom

-Quad

Het-

Lin

Het-

Quad

(d)

BC

-B

ase

+R

I

13

510

20

30

40

50

60

70

80

90

100

Recom

men

dati

on

Lis

t S

ize

0.3

0.4

0.5

0.6

0.7

0.8

Unexpectedness

Base

line

Hom

-Lin

Hom

-Quad

Het-

Lin

Het-

Quad

(e)

BC

-B

ase

+A

R

Fig

ure

4.1:

Unex

pec

tednes

sp

erfo

rman

ceof

diff

eren

tex

per

imen

tal

sett

ings

for

the

(a),

(b)

Mov

ieL

ens

(ML

)an

d(c

),(d

),(e

)B

ook

Cro

ssin

g(B

C)

dat

ase

ts.

105

0.0

0.2

0.4

0.6

0.8

1.0

Un

exp

ecte

dn

ess

10%

20%

30%

40%

50%

60%

70%

80%

90%

Probability

Base

line

Hom

ogeneous

Hete

rogeneous

(a)

ML

-B

ase

0.0

0.2

0.4

0.6

0.8

1.0

Un

exp

ecte

dn

ess

5%

10%

15%

20%

25%

30%

35%

Probability

Base

line

Hom

ogeneous

Hete

rogeneous

(b)

ML

-B

ase

+R

L

0.0

0.2

0.4

0.6

0.8

1.0

Un

exp

ecte

dn

ess

20%

40%

60%

80%

100%

Probability

Base

line

Hom

ogeneous

Hete

rogeneous

(c)

BC

-B

ase

0.0

0.2

0.4

0.6

0.8

1.0

Un

exp

ecte

dn

ess

10%

20%

30%

40%

50%

60%

70%

ProbabilityB

ase

line

Hom

ogeneous

Hete

rogeneous

(d)

BC

-B

ase

+R

I

0.0

0.2

0.4

0.6

0.8

1.0

Un

exp

ecte

dn

ess

10%

20%

30%

40%

50%

60%

70%

Probability

Base

line

Hom

ogeneous

Hete

rogeneous

(e)

BC

-B

ase

+A

R

Fig

ure

4.2:

Dis

trib

uti

onof

Unex

pec

tednes

sfo

rre

com

men

dat

ion

list

sof

sizek=

5an

ddiff

eren

tex

per

imen

tal

sett

ings

for

the

Mov

ieL

ens

(ML

)an

dB

ook

Cro

ssin

g(B

C)

dat

ase

ts.

106

recommendation lists of smaller size k. This, in combination with the observation

that unexpectedness was significantly enhanced also for large values of k, illustrates

that the proposed method both introduces new items in the recommendation lists

and also effectively re-ranks the existing items promoting the unexpected ones. Fig.

4.1 also shows that unexpectedness was enhanced both in cases where the definition

of unexpectedness was strict, as described in Section 4.4.2.3, and thus the baseline

recommendation system methods resulted in high unexpectedness (i.e., Base) and in

cases where the measured unexpectedness of the baselines was low (i.e., Base+RL,

Base+RI, and Base+AR). Similarly, the performance was increase both for the base-

line methods that resulted in high unexpectedness (e.g., Slope One algorithm) in the

conducted experiments and the methods where unexpectedness was low (e.g., Matrix

Factorization method, item-based k-Nearest Neighbors recommendation algorithm).

Additionally, the experiments conducted using the more accurate sets of expectations

based on the information collected from various third-party websites (Base+RI) out-

performed those automatically derived by association rules (Base+AS). Besides, the

increase in performance is larger also in the experiments where the sparsity of the sub-

set of data (see Section 4.4.1) is higher, which is the most realistic scenario in practice.

In particular, for the MovieLens data set, the average unexpectedness of the recom-

mendation lists was increased by 1.62% and 10.83% (17.32% for k = 1) for the (Base)

and (Base+RL) sets of expected movies, respectively. For the BookCrossing data set,

for the (Base) set of expectations the average unexpectedness was increased by 0.55%.

For the (Base+RI) and (Base+AR) sets of expected books, the average improvement

was 135.41% (188.61% for k = 1) and 78.16% (117.28% for k = 1). Unexpected-

ness was increased in 85.43% and 89.14% of the experiments for the MovieLens and

BookCrossing data sets, respectively. Finally, the unexpectedness of the generated

recommendation lists can be further enhanced, as described in Section 4.3.3, using

107

0.1 0.0 0.1 0.2 0.3 0.4Performance Improvement

0

1000

2000

3000

4000

5000

|User

Exp

ecta

tion

s|

(a) ML - Base+RL

0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7Performance Improvement

1000

2000

3000

4000

5000

6000

7000

|User

Exp

ecta

tion

s|

(b) BC - Base+RI

Figure 4.3: Increase in Unexpectedness for recommendation lists of size k=5 for theMovieLens (ML) and BookCrossing (BC) data sets using different sets ofexpectations.

appropriate thresholds on the unexpectedness of individual items.

A particularly noteworthy observation, as demonstrated through the distribution

of unexpectedness across all the generated recommendation lists for the ML and BC

data sets in Fig. 4.2, is that the higher the cardinality and the better approximated

the sets of users’ expectations are, the greater the improvements against the baseline

methods. In principle, if no expectations are specified, the recommendation results

will be the same as the baseline method. The same pattern can also be observed in

Fig. 4.3 showing the cardinality of the set of user expectations along the vertical axis,

the increase in unexpectedness performance along the horizontal axis, and a linear

line fitting the data for recommendation lists of size k = 5.10 This informal notion

of “monotonicity” of expectations is useful in order to achieve the desired levels of

unexpectedness. We believe that this pattern is a general property of the proposed

method, because of the explicit use of users’ expectations and the departure function,

and we plan to explore this topic as part of our future research.

10We also tried higher order polynomials but they do not offer significantly better fitting of thedata.

108

(a) MovieLens data set (b) BookCrossing data set

Figure 4.4: Post hoc analysis for Friedman’s Test of Unexpectedness Performance ofdifferent methods for the (a) MovieLens and (b) BookCrossing data sets.

To determine statistical significance, we have tested the null hypothesis that the

performance of each of the five lines of the graphs in Fig. 4.1 is the same, using the

Friedman test (nonparametric repeated measure ANOVA) [Berry and Linoff, 1997]

and we reject the null hypothesis with p < 0.0001. Performing post hoc analysis on

Friedman’s Test results for the ML data set, the difference between the Baseline and

each one of the experimental settings, apart from the difference between the Baseline

and Heterogeneous Quadratic, are statistically significant. Besides, the differences be-

tween Homogeneous Quadratic and Heterogeneous Linear, Homogeneous Linear and

Heterogeneous Quadratic, and Homogeneous Quadratic and Heterogeneous Quadratic

are statistically significant, as well. For the BC data set, the difference between the

Baseline and each one of the experimental settings is also statistically significant

with p < 0.0001. Moreover, the differences among Homogeneous Linear, Homoge-

neous Quadratic, Heterogeneous Linear, and Heterogeneous Quadratic, apart from

the difference between Homogeneous Linear and Homogeneous Quadratic, are also

109

statistically significant. Fig. 4.4 presents the box-and-whisker diagrams [Benjamini,

1988] displaying the aforementioned differences among the various methods.

4.5.1.1 Qualitative Comparison of Unexpectedness

The proposed approach avoids obvious recommendations such as recommending

to a user the movies “The Lord of the Rings: The Return of the King”, “The Bourne

Identity”, and “The Dark Knight” because the user had already highly rated all the

sequels or prequels of these movies. Besides, the proposed method provides recom-

mendations from a wider range of items and does not focus mostly on bestsellers as

described in Section 4.5.4. In addition, even though the proposed method generates

truly unexpected recommendations, these recommendations are not irrelevant and

they still provide a fair match to user’s interests.

Using the MovieLens data set and the (Base) sets of expected recommendations,

the baseline methods recommend to a user, who highly rates very popular Action,

Adventure, and Drama films, the movies “The Lord of the Rings: The Two Towers”,

“The Dark Knight”, and “The Lord of the Rings: The Return of the King” (user

id = 36803 with Matrix Factorization). However, this user has already highly rated

prequels or sequels of these movies (i.e., “The Lord of the Rings: The Fellowship

of the Ring” and “Batman Begins”) and, hence, the aforementioned popular recom-

mendations are expected for this specific user. On the other hand, for the same user,

the proposed method generated the following recommendations: “The Pianist”, “La

vita e bella”, and “Rear Window”. These movies are of high quality, unexpected,

and not irrelevant since they fairly match the user’s interests. In particular, based

on the definitions and mechanisms used to specify the user expectations as described

in Section 4.4.2.3, all these interesting movies are unexpected for the user since they

significantly depart from her/his expectations. Additionally, they are of great quality

110

in terms of the average rating, even though less popular in terms of the number of

ratings. Besides, these Biography, Drama, Romance, and Mystery movies are not

irrelevant to the user and they fairly match the user’s profile since they involve ele-

ments in their plot, such as war, that can also be found in other films that she/he has

already highly rated such as “Erin Brockovich”, “October Sky”, and “Three Kings”.

Finally, interestingly enough, some of these high quality, interesting, and unexpected

recommendations are also based on movies filmed by the same director that adapted

a film the user rated highly (i.e., “Pinocchio” and “La vita e bella”).

Using the BookCrossing data set and the (Base+RI) set of expectations described

in Section 4.4.2.3, the baseline methods recommend to a user, who has already rated

a very large number of items, the following expected books: “I Know This Much Is

True”, “Outlander”, and “The Catcher in the Rye” (user id = 153662 with Matrix

Factorization). In particular, the book “I Know This Much Is True” is highly ex-

pected because the specific user has already rated and she/he is familiar with the

books “A Tangled Web”, “A Virtuous Woman”, “Thursday’s Child”, and “Drowning

Ruth”. Similarly, the book “Outlander” is expected because of the books “Dragonfly

in Amber”, “Enslaved”, “When Lightning Strikes”, “Touch of Enchantment”, and

“Thorn in My Heart”. Finally, the recommendation about the item “The Catcher

in the Rye” is expected since the user has highly rated the books “Forever: A Novel

of Good and Evil, Love and Hope”, “Fahrenheit 451”, and “Dream Country”. In

summary, all of the aforementioned recommendations are expected for the user be-

cause the recommended items are very similar to other books, which the user has

already highly rated, from the same authors that were published around the same

time (e.g., “I Know This Much Is True” and “A Virtuous Woman”, or “Outlander”

and “Dragonfly in Amber”, etc.), frequently bought together on popular websites

such as Amazon.com [Amazon, 2012] and LibraryThing [LibraryThing, 2012] (e.g., “I

111

Know This Much Is True” and “Drowning Ruth”, etc.), with similar library subjects,

plots and classifications (e.g., “The Catcher in the Rye” and “Dream Country”, etc.),

with similar tags (e.g., “The Catcher in the Rye” and “Forever: A Novel of Good and

Evil, Love and Hope”), etc. In spite of that, the proposed algorithm recommends

to the user the following books that significantly depart from her/his expectations:

“Doing Good”, “The Reader”, and “Tuesdays with Morrie: An Old Man, a Young

Man, and Life’s Greatest Lesson”. These high quality and interesting recommenda-

tions, even though unexpected to the user, they are not irrelevant since they provide

a fair match to the user’s interests since she/he has already highly rated books that

deal with relevant issues such as family, romance, life, and memoirs.

4.5.1.2 Comparison of Serendipity

Pertaining to the notion of serendipity as defined in [Ge et al., 2010], the results

are very similar to those obtained using the proposed measures of unexpectedness

and demonstrate that the proposed method outperforms the standard baselines in

most of the experimental settings.

In summary, we demonstrated in this sections that the proposed method for unex-

pected recommendations effectively captures the notion of unexpectedness by providing

the users with interesting and unexpected recommendations of high quality that fairly

match their interests and, hence, outperforms the standard baseline methods in terms

of the proposed unexpectedness metrics.

4.5.2 Comparison of Rating Prediction

In this section, we examine how the proposed method for unexpected recommen-

dations compares with the standard baseline methods in terms of the classical rating

112

prediction accuracy-based metrics, such as RMSE and MAE. In typical offline experi-

ments as those presented here, the data is not collected using the recommender system

or method under evaluation. In particular, the observations in our test sets were not

based on unexpected recommendations generated from the proposed method.11 Also,

the user ratings had been submitted over a long period of time representing the tastes

of the users and their expectations of the recommender system at that specific point in

time that they rated each item. Therefore, in order to effectively evaluate the rating

and item prediction accuracy of our method, when we compute the unexpectedness

of item i for user u (see Section 4.3.3), we treat item i as not being included in the

set of expectations Eu for user u –whether it is included or not– and we compute the

distance of item i from the rest of the items in the set of expectations E−iu , where

E−iu := Eu \ {i}, to generate the corresponding prediction ru,i (i.e., the estimated

utility of recommending the candidate item i to the target user u).

Table 4.3 presents the results obtained by applying our method to the ML and

BC data sets using the different sets of expectations and baseline predictive methods.

The values reported are computed as the difference in average performance over the

different subsets, the different utility functions, two distance metrics, and three corre-

lation metrics. In Fig. 4.5, the bars labeled as Baseline represent performance of the

standard baseline methods. The bars labeled as Homogeneous Linear, Homogeneous

Quadratic, Heterogeneous Linear, and Heterogeneous Quadratic present the average

performance over the different subsets and sets of expectations, two distance met-

rics, and three correlation metrics, for the different experimental settings described

in Section 4.4.2.1. All the bars have been grouped by baseline algorithm (x-axis).

11For instance, the assumption that unused items would have not been used even if they hadbeen recommended is erroneous when you evaluate unexpected recommendations (i.e., a user maynot have used an item because she/he was unaware of its existence, but after the recommendationexposed that item the user can decide to select it [Shani and Gunawardana, 2011]).

113

Tab

le4.

3:A

vera

geR

MSE

Per

form

ance

for

the

Mov

ieL

ens

and

Book

Cro

ssin

gD

ata

Set

s.

Data

RatingPre

diction

Expecta

tions

Baseline

Homogeneous

Hetero

geneous

Set

Algorith

mLinear

Quadra

tic

Linear

Quadra

tic

MovieLens

Mat

rixF

acto

riza

tion

Base

0.7

892

0.1

1%

0.1

3%

0.0

7%

0.1

2%

Base

+R

L0.7

892

0.1

2%

0.1

3%

0.0

7%

0.1

2%

Slo

peO

ne

Base

0.8

242

0.2

9%

0.2

9%

0.4

3%

0.4

3%

Base

+R

L0.8

242

0.2

9%

0.2

9%

0.4

3%

0.4

2%

Item

KN

NB

ase

0.8

093

-0.0

1%

-0.0

1%

0.0

0%

0.0

1%

Base

+R

L0.8

093

-0.0

1%

-0.0

1%

0.0

1%

0.0

2%

Use

rKN

NB

ase

0.8

160

0.0

1%

0.0

1%

0.0

3%

0.0

4%

Base

+R

L0.8

160

0.0

1%

0.0

1%

0.0

3%

0.0

4%

Use

rIte

mB

asel

ine

Base

0.8

256

0.0

1%

0.0

0%

0.0

4%

0.0

5%

Base

+R

L0.8

256

0.0

1%

0.0

1%

0.0

6%

0.0

5%

Item

Ave

rage

Base

0.8

932

0.0

1%

0.0

0%

1.2

6%

1.5

2%

Base

+R

L0.8

932

0.0

2%

0.0

1%

1.2

9%

1.5

7%

BookCrossing

Mat

rixF

acto

riza

tion

Base

1.7

882

0.2

8%

0.3

5%

-0.3

5%

0.0

2%

Base

+R

I1.7

882

0.0

5%

-0.1

4%

-0.4

2%

0.0

1%

Base

+A

S1.7

882

0.0

1%

-0.1

4%

-0.4

6%

-0.0

1%

Slo

peO

ne

Base

1.8

585

3.4

3%

3.5

2%

2.5

8%

3.1

2%

Base

+R

I1.8

585

3.1

5%

3.0

1%

2.3

2%

2.7

9%

Base

+A

S1.8

585

3.2

1%

3.0

4%

2.3

7%

2.9

1%

Item

KN

NB

ase

1.6

248

1.4

6%

1.4

5%

-1.2

1%

-0.2

3%

Base

+R

I1.6

248

1.4

3%

1.0

2%

-1.4

4%

-0.5

9%

Base

+A

S1.6

248

1.4

8%

1.0

2%

-1.5

2%

-0.5

4%

Use

rKN

NB

ase

1.7

280

1.4

1%

1.1

9%

-0.4

1%

0.2

5%

Base

+R

I1.7

280

1.4

4%

0.9

9%

-0.6

6%

-0.0

2%

Base

+A

S1.7

280

1.4

6%

1.0

1%

-0.6

0%

0.1

0%

Use

rIte

mB

asel

ine

Base

1.5

779

2.4

8%

2.3

4%

0.2

1%

0.9

9%

Base

+R

I1.5

779

1.9

3%

1.7

7%

-0.1

4%

0.6

8%

Base

+A

S1.5

779

1.9

8%

1.7

8%

-0.1

4%

0.7

1%

Item

Ave

rage

Base

1.7

615

0.0

7%

-0.1

0%

-0.1

7%

0.5

0%

Base

+R

I1.7

615

-0.0

4%

-0.3

2%

-0.2

8%

0.5

6%

Base

+A

S1.7

615

0.0

1%

-0.4

1%

-0.3

5%

0.5

0%

114

Matrix FactorizationSlope One

Item kNNUser kNN

User Item Baseline

Item Average0.78

0.80

0.82

0.84

0.86

0.88

0.90

RM

SE

BaselineHom-LinHom-QuadHet-LinHet-Quad

(a) ML - RMSE

Matrix FactorizationSlope One

Item kNNUser kNN

User Item Baseline

Item Average1.50

1.55

1.60

1.65

1.70

1.75

1.80

1.85

1.90

RM

SE

BaselineHom-LinHom-QuadHet-LinHet-Quad

(b) BC - RMSE

Figure 4.5: RMSE performance for the (a) MovieLens and (b) BookCrossing datasets.

In the aforementioned tables and figures, we observe that the proposed method

performs at least as well as the standard baseline methods in most of the experimental

settings. In particular, for the ML data set the RMSE was on average reduced by

0.07% and 0.34% for the cases of the homogeneous and heterogeneous users. For the

BC data set, the RMSE was improved by 1.30% and 0.31%, respectively. The overall

minimum average RMSE achieved was 0.7848 for the ML and 1.5018 for the BC data

set.

Using the Friedman test, we have tested the null hypothesis that the performance

of each of the five lines of the graphs in Fig. 4.5 is the same; we reject the null

hypothesis with p < 0.001. Performing post hoc analysis on Friedman’s Test results,

for the ML data set only the difference between the Heterogeneous Quadratic and

Baseline is statistically significant for the RMSE accuracy metric. For the BC data

set, the differences between the Homogeneous Linear and Baseline, and Homogeneous

Quadratic and Baseline are statistically significant, as well. Fig. 4.6 presents the box-

and-whisker diagrams displaying the aforementioned differences among the various

methods.

115

(a) ML - RMSE (b) BC - RMSE

Figure 4.6: Post hoc analysis for Friedman’s Test of Accuracy Performance of differentmethods for the (a) MovieLens (ML) and (b) BookCrossing (BC) datasets.

In summary, we demonstrated in this section that the proposed method performs

at least as well as, and in some cases even better than, the standard baseline methods

in terms of the classical rating prediction accuracy-based metrics.


The goal in this section is to compare our method with the standard baseline

methods in terms of traditional metrics for item prediction, such as precision, recall,

and F1 score. Table 4.4 presents the results obtained by applying our method to

the MovieLens and BookCrossing data sets. The values reported are computed as

the difference in average performance over the different subsets, six algorithms for

rating prediction, two distance metrics, and three correlation metrics using the F1

score for recommendation lists of size k ∈ {1, 3, 5, 10, 30, 50, 100}. Respectively, Fig.

4.7 illustrates the average performance over the same dimensions for lists of size

k ∈ {1, 3, 5, 10, 20, . . . , 100}.

116

Tab

le4.

4:F

1P

erfo

rman

cefo

rth

eM

ovie

Len

san

dB

ook

Cro

ssin

gD

ata

Set

s.

Data

User

Experimenta

lRecommendation

ListSize

Set

Expecta

tions

Setting

13

510

30

50

100

MovieLens

Base

Hom

ogen

eou

sL

inea

r5.

00%

4.29

%7.

10%

9.54%

8.15%

6.17%

5.57%

Hom

ogen

eou

sQ

uad

rati

c4.

00%

4.87

%5.

63%

6.68%

5.35%

4.10%

3.36%

Het

erog

eneo

us

Lin

ear

5.00

%10

.92%

13.6

7%17.

78%

15.6

3%

14.

81%

15.2

9%

Het

erog

eneo

us

Qu

adra

tic

7.50

%12

.09%

14.6

1%17.

78%

15.5

0%

14.

09%

14.0

7%

Base

+R

L

Hom

ogen

eou

sL

inea

r3.

00%

4.48

%7.

37%

10.1

5%

8.7

8%

6.6

4%

6.3

3%

Hom

ogen

eou

sQ

uad

rati

c4.

50%

5.46

%6.

70%

7.98%

6.55%

5.14%

4.37%

Het

erog

eneo

us

Lin

ear

4.00

%10

.33%

12.8

7%16.

39%

14.5

7%

13.

81%

14.8

0%

Het

erog

eneo

us

Qu

adra

tic

4.50

%11

.11%

13.0

0%15.

96%

14.0

8%

12.

88%

13.3

3%

BookCrossing

Base

Hom

ogen

eou

sL

inea

r23

.08%

9.84

%7.

41%

1.90%

2.45%

1.83%

1.02%

Hom

ogen

eou

sQ

uad

rati

c23

.08%

10.6

6%8.

33%

4.0

5%

3.0

6%

2.0

3%

1.2

3%

Het

erog

eneo

us

Lin

ear

12.5

0%6.

56%

9.26

%4.2

9%

2.2

4%

2.4

3%

1.8

4%

Het

erog

eneo

us

Qu

adra

tic

11.5

4%6.

56%

7.10

%3.

57%

1.84%

1.42%

1.02%

Base

+R

I

Hom

ogen

eou

sL

inea

r29

.81%

13.5

2%8.

02%

2.1

4%

2.6

5%

2.2

3%

2.0

4%

Hom

ogen

eou

sQ

uad

rati

c25

.96%

13.5

2%8.

95%

3.5

7%

3.6

7%

2.6

4%

2.2

5%

Het

erog

eneo

us

Lin

ear

13.4

6%7.

38%

8.33

%3.1

0%

2.2

4%

2.6

4%

1.6

4%

Het

erog

eneo

us

Qu

adra

tic

14.4

2%6.

56%

7.10

%3.

33%

1.63%

1.22%

0.82%

Base

+A

R

Hom

ogen

eou

sL

inea

r22

.12%

6.15

%4.

32%

-0.4

8%

1.0

2%

0.8

1%

1.0

2%

Hom

ogen

eou

sQ

uad

rati

c22

.12%

7.38

%5.

56%

1.19%

1.84%

1.22%

1.23%

Het

erog

eneo

us

Lin

ear

8.65

%2.

05%

4.63

%0.7

1%

0.2

0%

0.8

1%

0.2

0%

Het

erog

eneo

us

Qu

adra

tic

12.5

0%5.

74%

6.17

%2.

86%

1.02%

1.01%

0.61%

117

In particular, for the MovieLens data set and the case of the homogeneous users

F1 score was improved by 6.14%, on average. In the case of heterogeneous customers

performance was increased by 13.90%. For the BookCrossing data set, in the case

of homogeneous users, F1 score was on average enhanced by 4.85% and, for hetero-

geneous users, by 3.16%. Table 4.4 shows that performance was increased both in

cases where the definition of unexpectedness was strict (i.e., Base) and in cases where

the definition was broader (i.e., Base+RL, Base+RI, and Base+AR). Additionally,

the experiments conducted using the more accurate sets of expectations based on

the information collected from various third-party websites (Base+RI) outperformed

those using the expected sets automatically derived by association rules (Base+AS).


performance of each of the five lines of the graphs in Fig. 4.7 is the same using the

Friedman test. Based on the results we reject the null hypothesis with p < 0.0001.

Performing post hoc analysis on Friedman’s Test results for the ML data set, the

differences between the Baseline and each one of the experimental settings are sta-

tistically significant for the F1 score. For the BC data set, the differences between

the Baseline and each one of the experimental settings are also statistically signif-

icant.12 Even though the lines are very close to each other and the differences in

performance in absolute values are not large (e.g., Fig. 4.7e), the results are statisti-

cally significant since the performance of the proposed method is ranked consistently

higher than the baselines (lines do not cross). Fig. 4.8 presents the box-and-whisker

diagrams displaying the aforementioned differences among the various methods.

In conclusion, we demonstrated in this section that the proposed method for

unexpected recommendations performs at least as well as, and in some cases even

12In the experiments conducted using the MovieLens data set, the difference between HomogeneousQuadratic and Baseline is statically significant with p < 0.01.

118

13

510

20

30

40

50

60

70

80

90

100

Recom

men

dati

on

Lis

t S

ize

0.0

00

0.0

05

0.0

10

0.0

15

0.0

20

0.0

25

0.0

30

0.0

35

0.0

40

F1 Score

Base

line

Hom

-Lin

Hom

-Quad

Het-

Lin

Het-

Quad

(a)

ML

-B

ase

13

510

20

30

40

50

60

70

80

90

100

Recom

men

dati

on

Lis

t S

ize

0.0

00

0.0

05

0.0

10

0.0

15

0.0

20

0.0

25

0.0

30

0.0

35

0.0

40

F1 Score

Base

line

Hom

-Lin

Hom

-Quad

Het-

Lin

Het-

Quad

(b)

ML

-B

ase

+R

L

13

510

20

30

40

50

60

70

80

90

100

Recom

men

dati

on

Lis

t S

ize

0.0

010

0.0

015

0.0

020

0.0

025

0.0

030

0.0

035

0.0

040

0.0

045

0.0

050

0.0

055

F1 Score

Base

line

Hom

-Lin

Hom

-Quad

Het-

Lin

Het-

Quad

(c)

BC

-B

ase

13

510

20

30

40

50

60

70

80

90

100

Recom

men

dati

on

Lis

t S

ize

0.0

010

0.0

015

0.0

020

0.0

025

0.0

030

0.0

035

0.0

040

0.0

045

0.0

050

0.0

055

F1 Score

Base

line

Hom

-Lin

Hom

-Quad

Het-

Lin

Het-

Quad

(d)

BC

-B

ase

+R

I

13

510

20

30

40

50

60

70

80

90

100

Recom

men

dati

on

Lis

t S

ize

0.0

010

0.0

015

0.0

020

0.0

025

0.0

030

0.0

035

0.0

040

0.0

045

0.0

050

0.0

055

F1 Score

Base

line

Hom

-Lin

Hom

-Quad

Het-

Lin

Het-

Quad

(e)

BC

-B

ase

+A

R

Fig

ure

4.7:

F1

per

form

ance

ofdiff

eren

tex

per

imen

talse

ttin

gsfo

rth

e(a

),(b

)M

ovie

Len

s(M

L)

and

(c),

(d),

(e)

Book

Cro

ss-

ing

(BC

)dat

ase

ts.

119


Figure 4.8: Post hoc analysis for Friedman’s Test of F1 Performance of different meth-ods for the (a) MovieLens and (b) BookCrossing data sets.

better than, the standard baseline methods in terms of the classical item prediction

metrics.

4.5.4 Comparison of Diversity and Dispersion

In this section we investigate the effect of the proposed method for unexpected

recommendations on coverage, aggregate diversity, and dispersion, three important

metrics for RSes [Ge et al., 2010; Adomavicius and Kwon, 2012; Shani and Gunawar-

dana, 2011].13 The results obtained using the catalog coverage metric [Herlocker et al.,

2004; Ge et al., 2010] (i.e., the percentage of items in the catalog that are ever recom-

mended to users: |⋃

u∈U RSu|/ |I|) are very similar to those using the diversity-in-top-

N metric for aggregate diversity [Adomavicius and Kwon, 2011, 2012]; henceforth,

only results on coverage are presented. Table 4.5 presents the results obtained by

13High unexpectedness of recommendation lists does not imply high coverage and diversity. Forexample, if the system recommends to all users the same k best unexpected items from the productbase, the recommendation list for each user is unexpected, but only k distinct items are recommendedto all users.

120

applying our method to the MovieLens and BookCrossing data sets. The values re-

ported are computed as the average catalog coverage over the different subsets, six

algorithms for rating prediction, two distance metrics, and three correlation met-

rics for recommendation lists of size k ∈ {1, 3, 5, 10, 30, 50, 100}. Fig. 4.9 presents

the average performance over the same dimensions for recommendation lists of size

k ∈ {1, 3, 5, 10, 20, . . . , 100}.

As Table 4.5 and Fig. 4.9 demonstrate, the proposed method outperforms the

standard baselines in most of the experimental settings. As we can see, the experi-

ments conducted under the assumption of heterogeneous users exhibit higher catalog

coverage than those using a representative agent. This is an interesting result that

can be useful in practice, especially in settings with potential adverse effects of over-

recommending an item or very large catalogs. For instance, it would be profitable

for Netflix, if the recommender system can encourage users to rent “long-tail” movies

because they are less costly to license and acquire from distributors than new-release

or highly popular movies of big studios [Goldstein and Goldstein, 2006]. Also, we can

observe that the smaller the size of the recommendation list, the greater the increase

in performance. In particular, as we see in Table 4.5, for the MovieLens data set the

average coverage was increased by 19.48% (39.10% for k = 1) and 37.40% (58.39%

for k = 1) for the cases of the homogeneous and heterogeneous users, respectively.

For the BookCrossing data set, in the case of homogeneous customers coverage was

improved by 9.26% (39.00% for k = 1) and for heterogeneous customers by 23.17%

(59.62% for k = 1), on average. Besides, the increase in performance is larger also

in the experiments where the sparsity of the subset of data is higher. In general,

coverage was increased in 95.68% (max = 55.74%) and 91.57% (max = 100%) of the

experiments for the MovieLens and BookCrossing data sets, respectively.

In terms of statistical significance, with the Friedman test, we have rejected the

121

Tab

le4.

5:C

over

age

Per

form

ance

for

the

Mov

ieL

ens

and

Book

Cro

ssin

gD

ata

Set

s.

Data

User

Experimenta

lRecommendation

ListSize

Set

Expecta

tions

Setting

13

510

30

50

100

MovieLens

Base

Hom

ogen

eou

sL

inea

r38

.58%

37.0

5%35

.15%

28.3

5%

16.

27%

12.3

8%

7.7

0%

Hom

ogen

eou

sQ

uad

rati

c38

.41%

36.4

8%34

.65%

28.3

2%

16.

62%

12.4

7%

7.7

7%

Het

erog

eneo

us

Lin

ear

58.3

3%56

.29%

55.5

6%48.

75%

34.7

1%

30.

49%

27.1

2%

Het

erog

eneo

us

Qu

adra

tic

52.6

4%50

.99%

49.5

5%42

.21%

28.

15%

23.3

8%

19.

12%

Base

+R

L

Hom

ogen

eou

sL

inea

r40

.00%

37.4

1%35

.91%

28.9

3%

16.

88%

13.1

1%

8.8

2%

Hom

ogen

eou

sQ

uad

rati

c39

.41%

37.0

1%35

.28%

28.6

5%

17.

04%

13.3

2%

9.3

8%

Het

erog

eneo

us

Lin

ear

63.4

3%62

.77%

61.2

9%53.

80%

39.0

9%

34.

62%

30.6

0%

Het

erog

eneo

us

Qu

adra

tic

59.1

6%57

.61%

56.3

1%48

.77%

34.

67%

29.8

1%

25.

71%

BookCrossing

Base

Hom

ogen

eou

sL

inea

r46

.55%

30.2

7%21

.69%

12.8

4%

5.6

6%

4.0

9%

2.9

7%

Hom

ogen

eou

sQ

uad

rati

c46

.16%

29.7

9%21

.33%

12.7

2%

5.5

6%

4.0

6%

2.9

0%

Het

erog

eneo

us

Lin

ear

56.7

7%40

.50%

31.4

5%22.

71%

16.9

6%

17.

67%

20.3

1%

Het

erog

eneo

us

Qu

adra

tic

52.5

4%35

.67%

26.3

4%16

.54%

8.6

8%

7.6

8%

7.7

8%

Base

+R

I

Hom

ogen

eou

sL

inea

r36

.60%

23.9

2%17

.31%

10.8

4%

5.1

9%

4.6

7%

5.5

2%

Hom

ogen

eou

sQ

uad

rati

c35

.42%

22.7

8%16

.15%

9.43%

3.51%

2.94%

4.24%

Het

erog

eneo

us

Lin

ear

65.1

1%48

.12%

38.8

5%29.

81%

22.7

5%

22.

11%

22.2

0%

Het

erog

eneo

us

Qu

adra

tic

60.6

1%43

.07%

33.5

5%23

.63%

15.

32%

13.9

2%

14.

34%

Base

+A

R

Hom

ogen

eou

sL

inea

r35

.26%

21.7

4%15

.19%

8.80%

2.84%

1.97%

1.36%

Hom

ogen

eou

sQ

uad

rati

c34

.04%

20.4

3%13

.86%

7.31%

0.76%

-0.4

8%-1

.59%

Het

erog

eneo

us

Lin

ear

63.5

2%46

.43%

37.1

2%27.

70%

20.5

3%

19.

96%

20.2

9%

Het

erog

eneo

us

Qu

adra

tic

59.1

9%41

.13%

31.5

2%21

.35%

12.

26%

10.4

7%

9.6

2%

122

13

510

20

30

40

50

60

70

80

90

100

Recom

men

dati

on

Lis

t S

ize

0.0

0

0.0

5

0.1

0

0.1

5

0.2

0

0.2

5

Coverage

Base

line

Hom

-Lin

Hom

-Quad

Het-

Lin

Het-

Quad

(a)

ML

-B

ase

13

510

20

30

40

50

60

70

80

90

100

Recom

men

dati

on

Lis

t S

ize

0.0

0

0.0

5

0.1

0

0.1

5

0.2

0

0.2

5

Coverage

Base

line

Hom

-Lin

Hom

-Quad

Het-

Lin

Het-

Quad

(b)

ML

-B

ase

+R

L

13

510

20

30

40

50

60

70

80

90

100

Recom

men

dati

on

Lis

t S

ize

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Coverage

Base

line

Hom

-Lin

Hom

-Quad

Het-

Lin

Het-

Quad

(c)

BC

-B

ase

13

510

20

30

40

50

60

70

80

90

100

Recom

men

dati

on

Lis

t S

ize

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Coverage

Base

line

Hom

-Lin

Hom

-Quad

Het-

Lin

Het-

Quad

(d)

BC

-B

ase

+R

I

13

510

20

30

40

50

60

70

80

90

100

Recom

men

dati

on

Lis

t S

ize

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Coverage

Base

line

Hom

-Lin

Hom

-Quad

Het-

Lin

Het-

Quad

(e)

BC

-B

ase

+A

R

Fig

ure

4.9:

Cov

erag

ep

erfo

rman

ceof

diff

eren

tex

per

imen

tal

sett

ings

for

the

(a),

(b)

Mov

ieL

ens

(ML

)an

d(c

),(d

),(e

)B

ook

Cro

ssin

g(B

C)

dat

ase

ts.

123


Figure 4.10: Post hoc analysis for Friedman’s Test of Coverage Performance of dif-ferent methods for the (a) MovieLens and (b) BookCrossing data sets.

0 20 40 60 80 100Cumulative % of items

0

20

40

60

80

100

Cu

mu

lati

ve %

of

recom

men

dati

on

s

Perfect EqualityBaselineUnexpectedness

(a) ML - Base+RL

0 20 40 60 80 100Cumulative % of items

0

20

40

60

80

100

Cu

mu

lati

ve %

of

recom

men

dati

on

s

Perfect EqualityBaselineUnexpectedness

(b) BC - Base+RI

Figure 4.11: Lorenz curves for recommendation lists of size k = 5 for the (a) Movie-Lens (ML) and (b) BookCrossing (BC) data sets.

124

null hypothesis (p < 0.0001) that the performance of each of the five lines of the graphs

in Fig. 4.9 is the same. Performing post hoc analysis on Friedman’s Test results, for

both the data sets the difference between the Baseline and each of the remaining

experimental settings is statistically significant (p < 0.001). Fig. 4.10 presents the

box-and-whisker diagrams displaying the aforementioned differences among the dif-

ferent methods.

The derived recommendation lists can also be evaluated for the inequality across

items, the dispersion of recommendations, using the Gini coefficient [Gini, 1909], the

Hoover (Robin Hood) index [Hoover, 1985], or the Lorenz curve [Lorenz, 1905]. In

particular, Fig. 4.11 uses the Lorenz curve to graphically represent the cumulative dis-

tribution function of the empirical probability distribution of recommendations; it is

a graph showing for the bottom x% of items, what percentage y% of the total recom-

mendations they have. As we can conclude from Fig. 4.11, in the recommendation lists

generated from the proposed method, the number of times an item is recommended

is more equally distributed compared to the baseline methods. Such systems provide

recommendations from a wider range of items and do not focus mostly on bestsellers,

which users are often capable of discovering by themselves. Hence, they are benefi-

cial for both users and some organizations [Brynjolfsson et al., 2003, 2011; Goldstein

and Goldstein, 2006]. Finally, the difference in increase in performance between Figs.

4.11a and 4.11b, 0.98% and 7.17% respectively in terms of the Hoover index, could

be attributed to both idiosyncrasies of the two data sets and the differences in defini-

tions and cardinalities of the sets of expected recommendations discussed in Section

4.4.2.3.

In summary, we demonstrated in this section that the proposed method for unex-

pected recommendations outperforms the standard baseline methods in terms of the

classical catalog coverage measure, aggregate recommendation diversity, and disper-

125

sion of recommendations.

4.6 Discussion of Unexpectedness

In this chapter, we proposed a method to improve user satisfaction by generating

unexpected recommendations based on the utility theory of economics. In particular,

we proposed and studied a new concept of unexpected recommendations as recom-

mending to a user those items that depart from what the specific user expects from

the recommender system. We defined and formalized the concept of unexpectedness

and discussed how it differs from the related notions of novelty, serendipity, and diver-

sity. Besides, we suggested several mechanisms for specifying the users’ expectations

and proposed specific performance metrics to measure the unexpectedness of recom-

mendation lists. After formally defining and formulating theoretically this concept,

we operationalized the notion of unexpectedness and presented a method for provid-

ing unexpected recommendations of high quality that are hard to discover bur fairly

match user interests.

Moreover, we compared the generated unexpected recommendations with popular

baseline methods using the proposed performance metrics of unexpectedness. Our

experimental results demonstrate that the proposed method improves performance

in terms of unexpectedness while maintaining the same or higher levels of accuracy

of recommendations. Besides, we showed that the proposed method for unexpected

recommendations also improves performance based on other important metrics, such

as catalog coverage, aggregate diversity, and dispersion of recommendations. More

specifically, using different “real-world” data sets, various examples of sets of expected

recommendations, and different utility functions and distance metrics, we were able

to test the proposed method under a large number of experimental settings including

126

various levels of sparsity, different mechanisms for specifying users’ expectations, and

different cardinalities of these sets of expectations. As discussed in Section 4.5, all

the examined variations of the proposed method, including homogeneous and hetero-

geneous users with different departure functions, significantly outperformed in terms

of unexpectedness the standard baseline algorithms, including item-based and user-

based k-Nearest Neighbors, Slope One [Lemire and Maclachlan, 2007], and Matrix

Factorization [Koren et al., 2009]. This demonstrates that the proposed method in-

deed effectively captures the concept of unexpectedness since, in principle, it should

do better than unexpectedness-agnostic methods such as the classical Collaborative

Filtering approach. Furthermore, the proposed unexpected recommendation method

performed at least as well as, and in some cases even better than, the baseline al-

gorithms in terms of the classical accuracy-based measures, such as RMSE and F1

score.

One of the main premises of the proposed method is that users’ expectations

should be explicitly considered in order to provide the users with unexpected recom-

mendations of high quality that are hard to discover but fairly match their interests.

If no expectations are specified, the recommendation results will not differ from those

of the standard rating prediction algorithms in recommender systems. Hence, the

greatest improvements both in terms of unexpectedness and accuracy vis-a-vis all

other approaches were observed in the experiments using the sets of expectations

exhibiting larger cardinality (Base+RL, Base+RI, and Base+AS). These sets of ex-

pected recommendations allowed us to better approximate the expectations of each

user through a non-restricting but more realistic and natural definition of “expected”

items using the particular characteristics of the selected data sets (see Section 4.4.1).

Additionally, the experiments conducted using the more accurate sets of expectations

based on the information collected from various third-party websites (Base+RI) out-

127

performed those using the expected sets automatically derived by association rules

(Base+AS). Also, the fact that the proposed method delivers unexpected recom-

mendations of high quality is depicted in the small differences between the proposed

metric of unexpectedness (Eq. 4.12) and the adapted metric of serendipity (Eq. 4.13).

The assumption of heterogeneous users allowed for better approximation of users’

preferences at the individual level, while the extended set of expected movies allowed

us to better approximate the expectations of each user through a more realistic and

natural definition of closely “related” items.

Moreover, the standard example of a utility function that was provided in Sec-

tion 4.3.2 illustrates that the proposed method can be easily used in existing recom-

mender systems as a new component that enhances unexpectedness of recommenda-

tions, without the need to modify the current rating prediction procedures. Further,

since the proposed method is not specific to the examples of utility functions and sets

of expected recommendations that were provided in this work, we suggest adapting

the proposed method to the particular recommendation applications, by experiment-

ing with different utility functions, estimation procedures, and sets of expectations,

exploiting the domain knowledge.

As a part of the future work, we would like to conduct live experiments with real

users for evaluating unexpected recommendations and analyze both qualitative and

quantitative aspects in a traditional on-line retail setting as well as in a platform for

massive open on-line courses [Adamopoulos, 2013b]. Also, we would like to further

evaluate the proposed approach and mechanisms specifying the user expectations us-

ing different mechanisms for the training and test data. Moreover, we would like

further explore the notion of “monotonicity” introduced in Section 4.5.1 with the

goal of formally and empirically demonstrating this effect. Further, we assumed in

all the experiments reported in this chapter that a recommendation can be either

128

expected or unexpected. We plan to relax this assumption in our future experiments

using the proposed definition and metrics of unexpectedness. Besides, we would also

like to introduce and study additional metrics of unexpectedness and further inves-

tigate how the different existing recommender system algorithms perform in terms

of unexpectedness vis-a-vis other popular properties of recent systems. Also, we as-

sume an application scenario where the items that the user has already chosen are

not recommended again. However, our method can be easily adapted to application

scenarios such as location recommendation systems where it might be useful to rec-

ommend familiar to the user venues or places that she/he visits periodically. For

instance, such a set of expectations could also take into consideration the distance

of the user from each venue and for how long she/he has not been to that venue or

similar ones, while adapting to different contexts and evolving with the time. Finally,

future system implementation using the proposed method might also allow the user

to explicitly control the different parameters in the proposed model so that individual

desired levels of unexpectedness can be obtained.

129

CHAPTER V

The Business Value of Recommendations:

Evidence from a City Guide Mobile Application

5.1 Introduction to Business Value of Recommendations

Mobile devices have become a major platform for information as consumers spend

an increasing amount of time with mobile devices [Nielsen, 2014] and use them more

often to search for products [Fargo, 2014]. Mobile devices have also been driving

the fast growth in e-commerce sales whereas, at the same time, the contribution of

traditional shopping channels has been declining [eMarketer, 2013]. These trends are

mainly attributed to smartphone users and are expected to continue for several years.

It is projected that by 2019 the number of mobile shoppers will reach 213.7 million

(from 125 million in 2013) with 87% of smartphone users shopping online using their

mobile device (compared to 75% in 2013) [eMarketer, 2015], while the number of

search queries will almost double and the amount of sales will triple [UBS, 2015].

At the same time, despite the already widespread penetration of mobile devices

and the recent advances of technology, information overload problems are still more

acute in such platforms, compared to desktops, due to various technical characteristics

and idiosyncrasies of mobile devices. These unique characteristics include, among

130

others, the distinct human-computer interaction, the increased impact of the external

environment, the differences in behavioral characteristics of mobile users, context of

usage, the smaller screen size of mobile devices, etc. [Ghose et al., 2012a; Ricci,

2010]. Recommender system (RS) techniques though offer the potential to further

increase the usability of mobile devices and alleviate some of the implications of

the aforementioned idiosyncrasies by providing more focused content and effectively

limiting the negative effects of information overload. Given the significance of mobile

platforms and the emerging opportunities of recommender systems, it is of paramount

importance to measure and better understand any differences in effectiveness among

various recommendation types and algorithms in a mobile context. Similarly, in

order to better leverage the benefits of RSes, it is also important to understand the

differences in the effectiveness of recommendations across the various candidate items

and recommendation settings and thus to examine the moderating effect of various

item attributes and contextual factors in the mobile channel.

However, despite the increasing prevalence and importance of electronic com-

merce, mobile devices, and smartphone applications, there has been scant academic

research regarding the economic impact of RSes, especially in the context of mobile

recommendations. This is mainly due to the inherent difficulty of measuring the

economic impact of RSes, the limited availability of appropriate data sets, and the

increasingly important privacy concerns that RSes and location-based services raise

[Krumm, 2009; Riboni et al., 2009]. Therefore, even though the impact of recommen-

dations on user behavior and economic demand and especially their corresponding

effects when using a mobile platform is a promising field of research, our understand-

ing of how various types of recommendations in the mobile context may affect the

demand levels for individual products is rather limited yet. For instance, 78% of

marketers cite lack of such knowledge regarding personalization techniques as a bar-

131

rier to their adoption in mobile settings as well as to successful implementations of

marketing strategies [Econsultancy.com, 2013].

In this study, we measure the effectiveness of recommendations as the increase in

demand for the recommended candidate items. In particular, we employ econometric

techniques in order to estimate the impact of recommendations on consumers’ utility

and real-world demand, using an observational study with actual data corresponding

to all the users of a popular real-world mobile application. The main contributions

of this study are the following. First, we measure the effectiveness and economic

impact of various types of real-world recommendations in the mobile settings, based

on a structural method following discrete-choice models of product demand that have

a long history in econometrics (e.g., [McFadden, 1980]). Second, we facilitate the

estimation of causal effects in the presence of endogeneity (a common issue in RSes,

targeted advertising, etc.) using machine learning methods. In particular, we leverage

an exogenous shock to the recommendation process and extend the family of the

popular BLP-style instruments of item “isolation” and differentiation [Berry et al.,

1995] to the latent space through using deep learning techniques that, instead of

commonly treating the individual words of user-generated texts as unique symbols

without meaning, reflect semantic and syntactic similarities and differences among

words and phrases. Third, we discover significant new findings that extend the current

knowledge regarding the heterogeneous impact of RSes, reconcile contradictory prior

findings in the related literature, and draw significant business implications.

Our main results show that an increase by 10% in the number of recommenda-

tions raises demand by about 7.1%. This effect is both statistically and economically

significant and can have greater impact on demand than specific item attributes. Our

findings also highlight the importance of “in-the-moment” marketing and recommen-

dations on the mobile world [Oliver et al., 1998]. In particular, we find that trending

132

recommendations have a much stronger effect on consumers’ choices compared to tra-

ditional recommendations in a mobile setting. This effect is relatively stable across

various levels of popularity, whereas traditional recommendations contribute to the

“rich-get-richer” problem. We also find significant differences in effectiveness among

various types of traditional recommendations. Finally, we also examine various mod-

erating effects of item attributes and contextual factors. We find that in our empirical

setting the effectiveness of recommendations increases during holidays and with bet-

ter weather conditions while more expensive alternatives better leverage the effect of

recommendations. Besides, we find that recommendations simply based on just the

novelty of each alternative do not have a significant effect but novel alternatives accrue

greater benefits from recommendations when item attributes, such as the quality of

the alternative, are also taken into consideration by the recommendation algorithm.

The rest of the chapter is organized as follows. In Section 5.2, we discuss the

relevant literature on recommender systems and mobile platforms. In Section 5.3, we

provide an overview of the employed data set and application domain. This is followed

by a description of the methods used to estimate the effects of recommendations in

Section 5.4 and the employed deep-learning techniques for econometric instruments

in Section 5.5. We then report the results of our empirical study in Section 5.6.

The chapter concludes with a discussion of the findings and limitations as well as an

overview of directions for future research in Section 5.8.

5.2 Literature Review and Research Question

Our work is related to several streams of research, including mobile consumer

behavior, desktop and mobile recommender systems, and effects of recommendations

on sales distribution. In the next paragraphs, due to space limitations, we focus on

133

the most relevant topics and the corresponding works. For a rigorous review of the

related work in RSes, please see, for instance, [Adomavicius and Tuzhilin, 2005; Li

and Karahanna, 2015; Ricci, 2010; Xiao and Benbasat, 2007].

Studying the effects of desktop recommender systems on aggregate demand and

markets, Fleder and Hosanagar [2009] show analytically that RSes can lead to a reduc-

tion in aggregate sales diversity, creating a rich-get-richer effect for popular products

and preventing what may otherwise be better consumer-product matches. However,

Brynjolfsson et al. [2011] provide empirical evidence that RSes are associated with

an increase in niche products, reflecting lower search costs in addition to the in-

creased product availability and corroborating the findings of [Pathak et al., 2010]

regarding the heterogenizing effects of RSes. Nevertheless, Hosanagar et al. [2013]

study whether RSes are fragmenting the online population and find that users widen

their interests, which in turn creates commonality with others instead of heterogeniz-

ing users. In this study, we focus especially on recommender systems in the mobile

context and demand levels for individual products rather than effects on aggregate

demand at the market level.

Focusing on the impact of desktop recommender systems on demand levels for indi-

vidual products, Oestreicher-Singer and Sundararajan [2012b] study how the explicit

visibility of related-product networks can influence the demand for products in such

networks and find that complementary products have significant influence on each

other’s demand. They also find that newer and more popular products benefit more

from the attention they garner from their network position in such related-product

networks. In contrast, Chen et al. [2004] find that such network-based recommenda-

tions in a desktop setting are more effective for less-popular books. Similarly, Pathak

et al. [2010], examining a desktop recommender for item-to-item networks of books

and focusing on 156 top-selling books on Amazon.com, find that the impact of the

134

strength of recommendations on sales rank is moderated by the recency effect. Even

though the majority of studies has focused on the effects of RSes on the market level

and item-to-item networks of hyperlinked products in the non-mobile context, there

are also empirical studies examining the effect of various types of recommendations

on demand levels for individual products. Lee and Benbasat [2010] conduct a lab

experiment with 43 subjects and find that RSes reduce users’ perceived effort and

increase accuracy of their decisions while their findings support the notion that RSes

should be designed to fit the user’s task undertaken. Besides, Tintarev et al. [2010]

conduct a user study with 21 subjects and find that RSes can increase the demand

levels, especially for long tail items. In addition, Jannach and Hegelich [2009] present

a case study evaluating the effectiveness of item recommendations for mobile apps

(either paid or free apps) in different navigational situations. However, Jannach and

Hegelich [2009] focus only on the most frequent users and employ only traditional RS

algorithms, while consumers’ utility and willingness to pay as well as economic de-

mand are out of the scope of that case study. In this study, we focus on the assessment

of the impact of real-world recommendations on sales and the utility of consumers

in the context of mobile recommendations, based on actual data corresponding to all

the users of a popular real-world mobile application. More specifically, we examine:

RQ: What is the relative economic effectiveness of recommendations on real-world

demand for individual items in the mobile context?

In addition to the main effect of the various types of RSes on the demand levels

for individual products, we also examine specific moderating effects in order to gain

a more detailed understanding of the effectiveness of the various types of recommen-

dations in a mobile setting. In particular, we examine whether the popularity, price

or novelty of an alternative moderate the effectiveness of different types of mobile

135

recommendations. In addition to such item attributes, we also study whether mar-

keting promotions and context moderate the effectiveness of RSes. Prior research in

RSes has examined other specific moderators such as product type, product com-

plexity, and product novelty. In particular, Senecal and Nantel [2004] examine the

moderating effect of product type and find that recommendations are more effective

for experience products compared to search goods. Fasolo et al. [2005] examine the

effect of product complexity and find that consumers using RSes engaged in more

information search and were less confident in their product choices for higher product

complexity. Finally, in prior research Ekstrand et al. [2014] and Matt et al. [2014] find

that novelty has a significant negative effect on consumers’ satisfaction and perceived

enjoyment whereas Vargas and Castells [2011] argue that novelty is a key positive

quality of recommendations in real scenarios.

Our work is also related to the extant literature in mobile consumer behavior. In

the context of mobile advertising, Kannan et al. [2001] propose that mobile advertising

is likely to significantly increase the frequency of impulse purchases due to the instant

gratification and the immediate need fulfillment enabled by the medium. Examining

mobile coupons, Danaher et al. [2015] find that how long coupons are valid can

influence redemption rates in a mobile setting as consumers redeem mobile coupons

much faster than traditional coupons and they do not usually store them for future

use while Fong et al. [2015] show that mobile coupons are more effective when they are

sent to consumers close to the consumption time. Furthermore, Panniello et al. [2016]

examine how contextual information affect customer trust, sales, and other business

performance metrics in the context of an e-commerce website by conducting A/B

testing with the customers of that website. Finally, Andrews et al. [2015] examine

the moderating effect of context on mobile ad effectiveness and illustrate the impact

of physical crowdedness on consumer response to mobile ads. In addition, our study is

136

also related to online-to-offline (O2O) commerce as we examine the impact of online

recommendations on real-world (offline) demand [Rampell, 2010].

Finally, our work is also related to the stream of literature that integrates machine

learning and data mining with econometric techniques. In particular, in this study, we

employ deep learning methods to introduce new machine learning-based econometric

instruments that extend a popular family of instruments from the observed product

characteristics space to the latent space. Such a machine learning-based approach

has the potential to generate more appropriate instruments as well as to leverage the

abundance of user-generated content when structured product attributes are either

not available or not sufficient. Hence, our work is also related to the extant literature

in Information Systems that employs text-mining, sentiment-analysis, and other data

mining methods with user-generated content in empirical econometric studies (e.g.,

[Archak et al., 2011; Ghose and Ipeirotis, 2011; Ghose et al., 2012b; Goes et al., 2014;

Goh et al., 2013; Netzer et al., 2012; Tirunillai and Tellis, 2014]).

5.3 Mobile Recommendations and Data

Our data set is from a mobile platform that identifies and recommends interesting

events and places. In aggregate, our data set includes 12, 119 venues and the corre-

sponding visits of several million active users from February 2015 until March 2015;

the maximum number of total visits to a single venue in our data set is 121, 524. In

particular, our data set includes all the restaurants in the mobile urban guide app

for the 10 most popular cities (in terms of population) in the United States. Table

5.1 shows the specific cities that are included in our data set and the corresponding

number of venues in each city and Figure 5.1 shows the locations of venues for three

of these cities.

137

41.8

5

41.9

0

41.9

5

−87

.75

−87

.70

−87

.65

−87

.60

lon

lat

(a)

Ch

icag

o

40.7

0

40.7

5

40.8

0

−74

.05

−74

.00

−73

.95

−73

.90

lon

lat

(b)

New

York

Cit

y

37.7

0

37.7

5

37.8

0

37.8

5

−12

2.50

−12

2.45

−12

2.40

−12

2.35

lon

lat

(c)

San

Fra

nci

sco

Fig

ure

5.1:

Ven

ues

incl

uded

inth

epan

els

ofC

hic

ago,

New

Yor

kC

ity,

and

San

Fra

nci

sco

inou

rdat

ase

t.

138

Table 5.1: US Cities included in Analysis and Corresponding Number of Venues.

City State Venues

Austin TX 788Chicago IL 1,585Dallas TX 651Houston TX 1,042Los Angeles CA 953New York NY 3,811Philadelphia PA 661San Antonio TX 712San Diego CA 747San Francisco CA 1,169

The dependent variable (DV) of our analysis corresponds to the total number of

visits to a particular venue in a single time period. The independent variables (IVs)

of interest include various types of recommendations, such as traditional recommen-

dations based on past historical trends and data. In particular, the different types

of recommendation include recommendations based on whether a venue is recom-

mended because of the total number of submitted positive user-generated reviews,

positive ratings from the users, etc. (i.e., ‘quality recommendations’), as well as rec-

ommendations based on whether a business is endorsed through reviews by famous

brands and experts (i.e., ‘expert recommendations’). They also include recommenda-

tions for novel venues, as alternatives that opened recently are also recommended to

the users (i.e., ‘novel recommendations’). In addition, the mobile application also rec-

ommends to the users venues that have scheduled upcoming events for customers (i.e.,

‘event recommendations’). The IVs also include whether a venue is recommended as

a “trending” venue (i.e., ‘trending recommendations’). This type of recommendations

takes into consideration the latest trends for the specific alternatives by explicitly dis-

counting past historical trends and data while leveraging available information that

captures current trends based on the normalized relative differences in item attributes

139

(e.g., number of photos) during the last time periods, in order interesting alternatives

to be recommended to the users and not only the most popular ones. This type of

recommendations is designed in order to take advantage of the higher involvement of

consumers with the mobile devices and leverage the differences in consumption pat-

terns (e.g., more instantaneous and less planned behaviors) in mobile platforms and,

from a business perspective, they capture trends related to “in-the-moment” mar-

keting. Apart from the distinction of recommendations between the different types

(i.e., ‘quality’, ‘expert’, ‘novel’, ‘event’, and ‘trending’ recommendations), all the rec-

ommendations are seemingly similar as they are presented to the users in the same

way, even though they are generated as described above. Moreover, all the recom-

mendations are generated before the realization of the demand for each alternative.

Besides, all the recommendation lists are explicitly diversified by the RS algorithms

[Ziegler et al., 2005] enhancing the exogeneity of recommendations (see Figure 5.2

for correlation and potential endogeneity), while the ‘trending’ recommendations also

exhibit higher levels of temporal diversity [Lathia et al., 2010]. In other words, the

diversification process provides an exogenous shock to the recommendation process.

For each of the types of recommendations and time period, our data set includes the

relative number of times a venue was recommended to the users as well as statistics

(e.g., average) of the ranking of the venue in the generated recommendation lists.

Moreover, the independent variables of interest also include how many brands and

experts have endorsed each venue and whether there are (and how many) scheduled

upcoming events in this specific venue. This information is available for all businesses

and not only the recommended venues.

Additionally, even though the employed estimation method (see Section 5.4) does

not require the researcher to observe all the relevant product characteristics,1 our data

1The demand equation discussed in Section 5.4 is given an explicit structural interpretation

140

Figure 5.2: Correlation of Main Variables Employed in Econometric Specifications forBusiness Value of Recommendations.

141

set includes a large number of contextual variables and item attributes. In particular,

our data set includes the exact location of the venue, the number of user-generated

reviews for the venue, the average numerical rating of the venue and the corresponding

number of ratings submitted by users, the number of photos uploaded by users for

this venue, the price tier of the venue (i.e., from 1 to 4 with 1 corresponding to

least pricey), whether the venue is part of a chain, the categories that have been

applied to this venue (e.g., American restaurant, Vegetarian) by users, whether a

marketing promotion is taking place, whether the venue offers breakfast, brunch,

lunch, dinner, alcohol, delivery, or take-outs, whether the venue takes reservations,

whether it accepts credit cards, whether there is live music, DJ, TVs, Wi-Fi, outdoor

seating, and parking availability, when the venue opened and was first introduced in

the platform, as well as a description of the venue and the user-generated reviews.

We further supplement our data set with additional contextual variables. In par-

ticular, for each one of the venues in our data set we also include climate data (e.g.,

temperature and precipitation) from the National Center for Environmental Infor-

mation (NCEI) of the National Oceanic and Atmospheric Administration (NOAA),

geospatial data (e.g., elevation level) from the Consultative Group for International

Agricultural Research (CGIAR) Consortium for Spatial Information (CSI), as well as

U.S. rental prices (e.g., median rental price at this location) from the U.S. Census

Bureau, 2009-2013 5-Year American Community Survey, and Zillow Group. Besides,

we also include calendar data with information about bank holidays, etc. Figure 5.2

shows the correlation matrix of the variables of main interest.

One of the main advantages of the presented study is the use of actual data

corresponding to all the available venues and all the users of a popular real-world

application and not only to a specific sub-population (e.g., most frequent users, users

accommodating unobserved demand factors.

142

opting in either online or offline surveys, etc.). In addition, from a methodological

perspective, a couple of other important advantages is that we can easily quantify the

quality of all the alternatives and their marketing promotions and that the recom-

mendations are characterized by high levels of both intra-list and temporal diversity

as discussed in this section. Besides, our data set includes observations corresponding

to multiple cities and several time periods.

5.4 Empirical Method and Models

In this section, we discuss the econometric structural model we apply to estimate

the utility of consumers regarding the different alternatives and the corresponding

effect of the various types of recommendations. In a nutshell, each consumer selects

the alternative that gives her/him the highest utility while the utility of consumers

depends on the alternative characteristics, specific contextual factors, whether the

particular alternative is recommended by the mobile application, as well as individual

taste parameters. The alternative (market) shares are then derived as the aggregate

outcome of individual consumer decisions and the utility parameters are inferred

based on consumer decisions.

In particular, there are R markets (i.e., cities) with Nr alternatives (i.e., venues)

in market r. For each alternative j in market r and time period (i.e., day) t, the

observed characteristics are denoted by vector zjrt ∈ RKz , contextual factors by vector

wjrt ∈ RKw , and recommendation types by vector ρjrt ∈ [0, 1]Kρ ; for simplicity zj,

wj, and ρj respectively. The elements of zj, wj, and ρj combined include attributes

xj (e.g., quality, frequency of each type of recommendation, temperature) that affect

the demand levels qjrt (i.e., number of visitors); for simplicity qj. The unobserved

characteristics (e.g., perception of status) of alternative j are denoted by ξj. The

143

utility uij of user i for alternative j depends on the characteristics of the alternative

and the user as well as the price pj.

In addition to the competing venues j = 1, . . . , N , we also model the existence of

an outside option, j = 0. This outside option corresponds to alternatives that might

not be present in our data set or the option of a user not visiting any venue at all

in time period t. Consumers may choose to select the outside option instead of the

N “inside” alternatives; the mean utility value of the outside option is normalized to

zero.

Following the standard assumption of consumer rationality for utility maximiza-

tion (i.e., the consumer chooses the alternative that maximizes utility surplus) and

assuming that εij, which captures user-specific taste parameters, follows an extreme

value distribution and no random coefficients, the probability that a user i chooses

alternative j is [McFadden, 1980]:

Pr(choiceij) =euij∑Nk=0 e

uik=

eβxj−αρj+ξj

1 +∑N

k=1 eβxk−αρk+ξk

, (5.1)

∀k in the same market r and k 6= j.

The market share sjrt, for simplicity sj, of each alternative is then calculated as

sj = qj/Mr, where Mr is the total market size for the corresponding city (i.e., market)

r. This market size Mr is set to the maximum number of unique active users that has

been ever observed in the mobile application for that city. Alternatively, the market

size could be assumed to be the population of each city or the number of households

instead of individuals; the results remain qualitatively the same. Inverting the market

share equation and taking the logarithm in Eqn. (5.1), the market share of alternative

j is:

ln(sj)− ln(s0) = βxj − αρj + ξj. (5.2)

144

Additionally, if we assume that user tastes are correlated across alternatives and

group the alternatives into G exhaustive and mutually exclusive sets, g = 1, . . . , G,

the market share of alternative j is [Cardell, 1997]:

ln(sj)− ln(s0) = βxj − αρj + σ ln(sj/g) + ξj, (5.3)

where sj/g is the market share of alternative j as a fraction of the total group (nest)

share and j is in group g. As the parameter σ approaches one, the within group

correlation of utility levels goes to one, and as σ approaches zero, the within group

correlation goes to zero. In the empirical section of our study, we estimate the pro-

posed model both with and without assuming that user tastes are correlated across

alternatives.

In other words, using demand-estimation approaches from economics, we estimate

the weights that consumers (implicitly) assign to alternative characteristics, recom-

mendations, and contextual factors, as well as the sensitivity of consumers to changes

in these factors and characteristics. This is done by inverting the function defining

market shares to uncover the utility levels of the alternatives and relating these utility

levels to alternative characteristics, recommendations, and contextual factors. Then,

based on these estimates, we derive the utility gain that each type of recommendations

generates. The employed methodology estimates any effects in a privacy-preserving

manner, as it does not require individual consumer data but only aggregate data and

statistics, even though it is a model of individual behavior.

Apart from the benefits discussed in the previous section (e.g., privacy, real data

and users, popular real-world application, quantifiable quality, exogenous variation,

etc.), this method allows also for unobserved product characteristics, including also

determinants that are difficult to measure (e.g., consumers’ perceptions about sta-

145

tus). In particular, we can consistently obtain the effects of interest even if other

covariates are unobserved or endogenous, and we have no outside instruments for

them [Crawford, 2012]. Besides, the resulting model can make predictions not only

for the existing alternatives under different conditions and contextual factors but also

for new alternatives that might not be included in our data set.

5.4.1 Identification Strategy

We can treat Eqn. (5.3) as an estimation equation, treating ξj as an unobserved

error term, and use typical econometric techniques in order to estimate the unknown

parameters. In particular, we employ panel data techniques in order to control also

for unobserved confounders, in addition to the extensive set of observed confounders,

while we also leverage the within-alternative variation in our data set (i.e., temporal

diversity of recommendations – see Section 5.3). In addition, our econometric specifi-

cations alleviate common endogeneity issues in prices, for instance, as we control for

both the quality of products and marketing promotions as well as in alternative rec-

ommendation as the recommendations are explicitly diversified by the recommender

system [Ziegler et al., 2005]. Nevertheless, even though we are interested in the relative

differences of effectiveness across the various types of recommendations in a mobile

context and we leverage the exogenous variation in the recommendation mechanisms,

while we also control for numerous confounders, including quality and marketing pro-

motions, we also explore as robustness check the possibility that different variables in

our econometric specifications are endogenous. Hence, in our empirical study, we also

use traditional instruments, such as rental prices, degree of competition, and specific

variables that are used in the algorithms to generate the recommendations but do not

affect the utility of the alternatives for consumers in the current time period given

the observed confounders (e.g., lagged variables corresponding to number of photos in

146

previous time periods) [Sweeting, 2013], as well as a novel instrument derived from a

metric of alternative differentiation and isolation based on a machine learning model

of the user-generated reviews employing deep-learning techniques. The motivation

for the latter instrument is twofold: first, it is similar to the BLP-style instruments

[Berry, 1994; Berry et al., 1995], which measure the isolation in product characteris-

tics space as products (or alternatives) that are more isolated are related to higher

margins; second, it is related to whether an alternative is recommended by the mo-

bile application as alternatives that are more isolated have higher likelihood of being

included in a recommendation list because of the diversification process of the em-

ployed recommendation algorithms. As discussed in the following section, these new

machine learning-based econometric instruments extend the popular family of BLP-

style instruments from the observable space of product characteristics to the latent

space.

5.5 Deep-Learning Model of User-Generated Reviews

In order to implement these machine learning-based instruments that measure the

differentiation of alternatives in the latent space of products, we use an efficient and

state-of-the-art method based on deep-learning techniques [Le and Mikolov, 2014;

Mikolov et al., 2013b]. In particular, instead of treating the individual words of

user-generated texts as unique symbols (tokens) without meaning, as in common

bag-of-words and bag-of-n-grams models, the employed method reflects semantic and

syntactic similarities and differences among words and phrases by representing them

as dense vectors, usually referred to as “neural embeddings”. The common paradigm

for deriving such representations is based on the distributional hypothesis of Harris

[1954] that words in similar contexts have similar meanings. In essence, the continuous

147

space representations are learned in an unsupervised fashion by trying to maximize

the dot-product between the vectors of frequently occurring word-context pairs and

minimize it for random word-context pairs [Levy and Goldberg, 2007]. These contin-

uous space representations can be used with any standard distance metrics in order

to measure the differentiation of alternatives in the latent space.

The neural network based word vectors are usually trained using stochastic gra-

dient descent, where the gradient is obtained via back-propagation [Mikolov et al.,

2013a; Rumelhart et al., 1988]. After the training converges, words with similar mean-

ing are mapped to a similar position in the latent space. We use a pre-trained open-

source model that contains 300-dimensional vectors for 3 million words and phrases

trained on part of Google News dataset (about 100 billion words) and described in

[Mikolov et al., 2013a], in order to estimate in each time period the representation of

the alternatives in the latent space, as well as the distance of the various alternatives;

the generated econometric instruments exhibit with-in subject variation over time.

We should note that this method is general and applicable to texts of any length (e.g.,

sentences, paragraphs, documents) and does not require task-specific tuning nor does

it rely on parse trees [Le and Mikolov, 2014]).

5.6 Empirical Results

In order to discover the impact of different types of recommendations in the mobile

context, we estimate different specifications of the structural econometric model we

presented in the previous section. First, we estimate various specifications based

on the structural logit model of demand presented in Eqn. (5.2) and the nested

model in Eqn. (5.3) (see Tables 5.2 and 5.3). Then, we also control for unobserved

heterogeneity and estimate our models introducing alternative-level fixed effects (see

148

Table 5.4). Due to space limitations, we only present the results corresponding to

the latter (nested logit) specification; the results are qualitatively the same across

the different specifications. Similar results were also obtained employing reduced

form models rather than structural models, which provide better fit to our data. All

our econometric specifications provide very good fit to the data as well as out-of-

sample predictive performance (see Tables 5.5 – 5.6). Apart from estimating the

main variables of interest, we also examine a number of interaction effects (see Tables

5.11 – 5.16). In order to test the robustness of our findings and control for potential

endogeneity, we conduct a subsample analysis, allow for parameter heterogeneity,

and also employ instrumental variable techniques (see Tables 5.21 – 5.22) based on

econometric instruments derived using deep learning methods for machine learning.

Finally, we also conduct various falsification tests in order to verify the validity of our

findings.

Table 5.2 shows the coefficient estimates for venue demand and the corresponding

impact of the various types of recommendations in the mobile context, based on the

logit model with multi-level fixed effects (i.e., market, venue category), various alter-

native and context controls (e.g., number of events, meals served, location, holiday,

temperature, precipitation) and a time trend as well as day of the week effects, in

order to control for correlation of tastes across alternatives as well as different po-

tentially unobserved effects. Models 1 and 2 identify the average effect of recommen-

dations, in general. In particular, in Model 1 the variable of interest is binary while

Model 2 provides a more precise measurement using the relative frequency of recom-

mendation of the specific venue. Model 3 separates the effect of the different types

of recommendations (i.e., ‘quality recommendations’, ‘event recommendations’, ‘ex-

pert recommendations’, ‘novel recommendations’, and ‘trending recommendations’)

in which we are interested. Then, Model 4 also controls for the ranking of each rec-

149

ommendation in the generated recommendation lists. Due to space limitations, only

coefficients of the main variables of interest and statistically significant effects are

shown.

As Models 1 – 4 show, recommendations have a positive impact on demand in

a mobile recommendation setting. This impact is significant at level p < 0.001, in

most of the cases. In particular, computing alternative-level derivatives of the de-

mand function (elasticities), an increase by 10% in the number of times a venue is

recommended raises the demand by 7.15% for already recommended alternatives and

by 0.92% on average for all alternatives in general. Hence, there are positive effects on

both individual demand and aggregate-level demand in the market. These effects are

both statistically and economically significant ; comparing these numbers with typical

click-through-rates in display advertising, the recommendation effect in this mobile

setting is orders of magnitude larger. Besides, comparing the effect of recommenda-

tions to the effects of various attributes on alternative demand, an increase of 1% in

the relative frequency of recommendation of an alternative corresponds to an increase

of about 8% in the rating of the alternative, about 4.20% in the number of reviews,

and about 13% in the number of photos. Hence, recommendations in a mobile setting

can have a greater impact on demand compared to various item attributes. Moreover,

trending recommendations that provide “in-the-moment” content to the users have a

much stronger effect, in the particular mobile setting, compared to traditional rec-

ommendations based on historical trends and data, which highlights the importance

of “in-the-moment” marketing in a mobile context. Additionally, recommendations

based on the quality of venues and experts’ reviews outperform other types of recom-

mendations. On the contrary, recommendations based on the number of upcoming

events or simply the novelty of the alternative underperform the other types of mo-

bile recommendations we examine. The differences in effectiveness among the various

150

Table 5.2: Coefficient Estimates of Logit Model.

Model 1 Model 2 Model 3 Model 4

Recommendation (Binary) 0.7207***Recommendation 0.9926***Trending recommendation 1.6179*** 1.6130***Quality recommendation 0.9576*** 1.0729***Event recommendation 0.3026*** 0.4471***Expert recommendation 0.9935*** 1.1119***Novel recommendation 0.3707*** 0.5130***Recommendation ranking -0.0031***Price -0.0105*** -0.0114*** -0.0101*** -0.0099***Rating 0.0096*** 0.0100*** 0.0121*** 0.0103***Number of Reviews 0.1885*** 0.2021*** 0.1923*** 0.1903***Sentiment of Reviews 0.0120*** 0.0132*** 0.0121*** 0.0120***Photos 0.0587*** 0.0552*** 0.0559*** 0.0570***Chain 0.0352*** 0.0391*** 0.0367*** 0.0374***Promotions -0.0032*** -0.0033*** -0.0032*** -0.0031***Alcohol 0.0222*** 0.0199*** 0.0213*** 0.0221***Delivery -0.0347*** -0.0332*** -0.0323*** -0.0322***Takeout -0.0900*** -0.0914*** -0.0999*** -0.0998***Reservations 0.0168*** 0.0157*** 0.0140*** 0.0144***Credit cards 0.0637*** 0.0741*** 0.0726*** 0.0726***Outdoor seating -0.0069* -0.0064* -0.0039 -0.0038Wi-Fi 0.0195*** 0.0160*** 0.0178*** 0.0177***Parking 0.1117*** 0.1016*** 0.0951*** 0.0900***Wheelchair accessible 0.0237** 0.0218** 0.0157 0.0160*TVs 0.0048 0.0163 0.0165 0.0169Music 0.0545*** 0.0532*** 0.0523*** 0.0530***

Market-level fixed effects Yes Yes Yes YesCategory-level fixed effects Yes Yes Yes YesAdditional alternative controls Yes Yes Yes YesContext controls Yes Yes Yes YesTime trend Yes Yes Yes Yes

Log-likelihood -859504 -857790 -855548 -855405Adjusted R2 0.520 0.522 0.525 0.525p 0.0000 0.0000 0.0000 0.0000N 711,673 711,673 711,673 711,673

Note: The additional alternative controls include the hours of operation of thespecific alternative, for how many weeks the alternative has been operating, and thenumber of events in the specific alternative and time period. The context controlsinclude day of the week and holiday effects, local temperature and precipitationlevels, as well as the average geographic distance of alternative. Significance levels:* p < 0.05, ** p < 0.01, *** p < 0.001

151

types of recommendations are statistically significant. This significant difference in ef-

fects persists also after controlling for both the overall ranking in the recommendation

list and the with-in ranking in the specific types of recommendations.

Table 5.3 then presents the results of the nested logit model allowing for correlated

user behaviors. The results corroborate our previous findings regarding the overall

effect of recommendations as well as the effect of each type of recommendations.

Then, we also control for unobserved heterogeneity across alternatives. Table 5.4

presents the corresponding results. The results further substantiate our previous find-

ings. We should notice that after controlling for unobserved heterogeneity at the level

of alternative, the effect of novel recommendations is not found to be statistically sig-

nificant. This result regarding the novelty of recommendations is in accordance with

the findings of Adamopoulos and Tuzhilin [2014c] that novelty should be considered

vis-a-vis the overall quality and utility of each alternative when generating recom-

mendations, rather than recommending the most novel items. It is also worth noting

that the effect of promotional marketing strategies is now positive and significant.

Taking into consideration also the results of the aforementioned econometric spec-

ifications, this indicates that even though marketing promotions have on average a

positive effect on demand, such promotions are usually offered by alternatives (venue)

that experience lower levels of consumer demand than expected.

5.6.1 Out-of-Sample Performance

In order to assess the out-of-sample performance of our models and validate our

aforementioned findings, we employ a hold-out evaluation scheme with 80/20 random

split of data and evaluate each model in terms of root-mean-square error (RMSE),

mean-square error (MSE), mean absolute deviation (MAD), and mean absolute per-

cent error (MAPE). In particular, Tables 5.5 and 5.6 present for each econometric

152

Table 5.3: Coefficient Estimates of Nested Logit Model.


Recommendation (Binary) 0.6944***Recommendation 0.9712***Trending recommendation 1.4935*** 1.4908***Quality recommendation 0.9436*** 1.0134***Event recommendation 0.2687*** 0.3562***Expert recommendation 1.0730*** 1.1444***Novel recommendation 0.2535*** 0.3399***Recommendation ranking -0.0019***Price -0.0168*** -0.0177*** -0.0164*** -0.0163***Rating 0.0030* 0.0030* 0.0050*** 0.0039**Number of Reviews 0.1510*** 0.1627*** 0.1543*** 0.1532***Sentiment of Reviews 0.0200*** 0.0213*** 0.0203*** 0.0203***Photos 0.0549*** 0.0510*** 0.0515*** 0.0522***Chain 0.0373*** 0.0413*** 0.0392*** 0.0397***Promotions -0.0032*** -0.0033*** -0.0032*** -0.0032***Alcohol 0.0526*** 0.0509*** 0.0516*** 0.0520***Delivery -0.0586*** -0.0575*** -0.0563*** -0.0562***Takeout -0.0425*** -0.0406*** -0.0492*** -0.0493***Reservations 0.0208*** 0.0198*** 0.0182*** 0.0184***Credit cards 0.0467*** 0.0552*** 0.0535*** 0.0535***Outdoor seating -0.0049 -0.0047 -0.0024 -0.0023Wi-Fi 0.0185*** 0.0149*** 0.0167*** 0.0166***Parking 0.0868*** 0.0759*** 0.0687*** 0.0657***Wheelchair accessible 0.0138 0.0116 0.0055 0.0057TVs 0.0006 0.0116 0.0121 0.0124Music 0.0358*** 0.0341*** 0.0338*** 0.0343***With-in group share 0.1101*** 0.1127*** 0.1114*** 0.1111***



Note: The additional alternative controls include the hours of operation of the specificalternative, for how many weeks the alternative has been operating, and the numberof events in the specific alternative and time period. The context controls includeday of the week and holiday effects, local temperature and precipitation levels, as wellas the average geographic distance of alternative. Significance levels: * p < 0.05, **p < 0.01, *** p < 0.001 153

Table 5.4: Coefficient Estimates of Nested Logit Model with Alternative-level Fixedeffects.


Recommendation (Binary) 0.5506***Recommendation 0.8893***Trending recommendation 1.0610*** 1.0572***Quality recommendation 0.8903*** 0.9765***Event recommendation 0.4896*** 0.5877***Expert recommendation 0.8421*** 0.9294***Novel recommendation -0.0629 0.0299Recommendation ranking -0.0023***Rating 0.0234*** 0.0166*** 0.0180*** 0.0174***Number of Reviews 0.0925*** 0.1068*** 0.1063*** 0.1060***Sentiment of Reviews 0.0015 0.0013 0.0012 0.0012Photos 0.0183*** -0.0012 -0.0012 -0.0005Promotions 0.0024* 0.0024* 0.0023* 0.0023*With-in group share 0.2487*** 0.2544*** 0.2532*** 0.2529***

Alternative-level fixed effects Yes Yes Yes YesAdditional alternative controls Yes Yes Yes YesContext controls Yes Yes Yes YesTime trend Yes Yes Yes Yes



154

specification the in-sample and out-of-sample performance, respectively. Figures 5.3

and 5.4 graphically illustrate the RMSE for each econometric specification the in-

sample and out-of-sample performance. Based on the results, in addition to very

good explanatory power, all the employed models and econometric specifications ex-

hibit very good out-of-sample performance.

Table 5.5: In-Sample Validation of Nested Logit Model with Alternative-level Fixedeffects.


RMSE 0.745510 0.740829 0.740243 0.740175MSE 0.555785 0.548827 0.547959 0.547860MAD 0.439188 0.436594 0.436228 0.436224MAPE 5.535493 5.485244 5.482124 5.481529

Table 5.6: Out-of-Sample Validation of Nested Logit Model with Alternative-levelFixed effects.


RMSE 0.927897 0.919884 0.918142 0.917934MSE 0.860992 0.846186 0.842985 0.842603MAD 0.596312 0.587145 0.585642 0.585507MAPE 8.178065 8.042463 8.028449 8.025713

Tables 5.7-5.10 provide the corresponding results for the logit and nested logit

models. The results corroborate our previous findings. The results also illustrate

that the nested logit specification with alternative-level fixed effects provides a better

fit to our data.

5.6.2 Moderating Effects on Effectiveness of Recommendations

Moreover, we delve further into the differences in effectiveness of recommendations

and we examine the moderating effect of various item attributes and contextual fac-

tors in order to gain a more detailed understanding of the effectiveness of the various

155

Figure 5.3: In-Sample Evaluation of Accuracy of Econometric Specifications Assess-ing Business Value of Recommendations.

Figure 5.4: Out-of-Sample Evaluation of Predictive Accuracy of Econometric Speci-fications Assessing Business Value of Recommendations.

Table 5.7: In-Sample Validation of Logit Model.


RMSE 0.809600 0.807652 0.805111 0.804932MSE 0.655452 0.652302 0.648204 0.647916MAD 0.488359 0.489543 0.486214 0.486156MAPE 6.140014 6.128357 6.098491 6.096787

156

Table 5.8: Out-of-Sample Validation of Logit Model.


RMSE 0.994408 0.991234 0.985066 0.984596MSE 0.988847 0.982545 0.970354 0.969430MAD 0.616621 0.618346 0.610596 0.610394MAPE 8.587711 8.535874 8.477126 8.471730

Table 5.9: In-Sample Validation of Nested Logit Model.


RMSE 0.794504 0.791763 0.789581 0.789514MSE 0.631236 0.626888 0.623439 0.623333MAD 0.482875 0.483424 0.480524 0.480507MAPE 6.051936 6.031686 6.005812 6.005087

Table 5.10: Out-of-Sample Validation of Nested Logit Model.


RMSE 0.968876 0.963149 0.957922 0.957720MSE 0.938721 0.927655 0.917615 0.917228MAD 0.610642 0.609633 0.602756 0.602656MAPE 8.458919 8.376500 8.323763 8.320875

157

types of recommendations in a mobile context. Tables 5.11 – 5.16 present the mod-

erating effects of the various attributes and contextual factors. In particular, Table

5.11 examines the moderating effect of price; Table 5.12 examines the moderating in-

teraction of marketing promotions and recommendations; Table 5.13 investigates the

interaction effect of the novelty of the alternative and recommendations; Table 5.14

presents the interaction effect with popularity; Table 5.15 presents the interaction

effect between recommendations and public holidays; and Table 5.16 the interaction

effect with temperature. All the presented results extend the results presented in

Table 5.4 and control for time-varying venue attributes, climate attributes, geospa-

tial attributes, and calendar attributes as before, as well as for individual-level fixed

effects. All the additional controls and effects are included in all Tables 5.11 – 5.16,

even though they are not depicted in the results, due to space restrictions. Base levels

are also estimated as usual, even though they are not reported as well.

Based on the results presented in Table 5.11, we find a positive and significant

moderating effect of price on the effectiveness of recommendations. This finding in-

dicates that even though alternatives that are more expensive are less appealing to

the users, ceteris paribus, they can more effectively leverage the additional attention

they garner from recommendations. Further decomposing the ‘traditional’ recom-

mendations into different types, as before, we find statistically significant differences

among the various types of recommendations. In particular, the largest moderating

effect of price is in the case of novel recommendations indicating that novel expensive

venues benefit more from recommendations compared to novel but cheaper venues;

whereas event recommendations exhibit a negative interaction effect. This finding

also illustrates the need to examine multiple types of recommendations rather than

a single class. The presented econometric specifications include market-level fixed

effects rather than alternative-level as there is not significant within-venue variation

158

in the price.

Based on the results presented in Table 5.12, we do not find significant moderating

effects of promotions on the effectiveness of recommendations for any type of recom-

mendations. The effect of marketing promotions in the presence of recommendations

is a topic of interest that should be thoroughly examined in future research.

Additionally, based on the results presented in Table 5.13, novel alternatives gain

in general greater benefits from recommendations. In combination with the findings

presented in Tables 5.2 – 5.4, this result illustrates that novel alternatives accrue

greater benefits from recommendations when those recommendations are not solely

based on the characteristic of novelty but also take into consideration attributes such

as the quality of the item. This is highlighted also by the magnitude of the effect in

the case of recommendations based on quality. Hence, this finding reconciles differ-

ent contradictory findings in prior literature in recommender systems. For instance,

Ekstrand et al. [2014] and Matt et al. [2014] maintain that novelty has a significant

negative effect on consumers’ satisfaction and perceived enjoyment, whereas Vargas

and Castells [2011] argue that novelty is a key positive quality of recommendations

in real scenarios.

In Table 5.14, we investigate the effect of the different types of recommendations

across different levels of venue popularity using the number of visits as metric of pop-

ularity and employing the technique of quantile regression. Recommendations have a

stronger positive effect on average for more popular alternatives. This empirical find-

ing is consistent with the observation of Oestreicher-Singer and Sundararajan [2012b]

according to which more popular products use more efficiently the attention they gar-

ner from their network position in recommendation networks in electronic markets.

An interesting and unexpected observation though is that the effect of traditional

recommendations is much stronger for more popular alternatives whereas the effect

159

Tab

le5.

11:

Moder

atin

gE

ffec

tof

Pri

ceon

Eff

ecti

venes

sof

Rec

omm

endat

ions.

Model

1M

odel

2M

odel

3M

od

el

4

Rec

omm

endat

ion

(Bin

ary)

xP

rice

-0.0

074

Rec

omm

endat

ion

xP

rice

0.05

15**

*T

rendin

gre

com

men

dat

ion

xP

rice

0.04

09*

0.04

19*

Qual

ity

reco

mm

endat

ion

xP

rice

0.01

300.

0132

Eve

nt

reco

mm

endat

ion

xP

rice

-0.1

705*

**-0

.183

5***

Exp

ert

reco

mm

endat

ion

xP

rice

0.06

660.

0863

*N

ovel

reco

mm

endat

ion

xP

rice

0.19

67**

*0.

2044

***

160

Tab

le5.

12:

Moder

atin

gE

ffec

tof

Mar

keti

ng

Pro

mot

ions

onE

ffec

tive

nes

sof

Rec

omm

endat

ions.

Model

1M

odel

2M

odel

3M

odel

4

Rec

omm

endat

ion

(Bin

ary)

xP

rom

otio

ns

-0.0

046*

**R

ecom

men

dat

ion

xP

rom

otio

ns

-0.0

023

Tre

ndin

gre

com

men

dat

ion

xP

rom

otio

ns

-0.0

010

0.00

09Q

ual

ity

reco

mm

endat

ion

xP

rom

otio

ns

-0.0

025

-0.0

020

Eve

nt

reco

mm

endat

ion

xP

rom

otio

ns

0.01

270.

0127

Exp

ert

reco

mm

endat

ion

xP

rom

otio

ns

-0.0

071

-0.0

075

Nov

elre

com

men

dat

ion

xP

rom

otio

ns

0.01

210.

0170

161

Tab

le5.

13:

Moder

atin

gE

ffec

tof

Nov

elty

onE

ffec

tive

nes

sof

Rec

omm

endat

ions.

Model

1M

odel

2M

od

el

3M

odel

4

Rec

omm

endat

ion

(Bin

ary)

xW

eeks

open

0.00

61R

ecom

men

dat

ion

xW

eeks

open

-0.1

060*

**T

rendin

gre

com

men

dat

ion

xW

eeks

open

-0.2

230*

**-0

.222

6***

Qual

ity

reco

mm

endat

ion

xW

eeks

open

-0.2

000*

**-0

.199

2***

Eve

nt

reco

mm

endat

ion

xW

eeks

open

0.07

94*

0.06

45E

xp

ert

reco

mm

endat

ion

xW

eeks

open

0.03

930.

0358

Nov

elre

com

men

dat

ion

xW

eeks

open

0.07

50**

*0.

0747

***

162

of “in-the-moment” recommendations is more stable across alternatives. This find-

ing can alleviate the popularity reinforcement effects of recommender systems, which

have already been observed in various settings (e.g., [Fleder and Hosanagar, 2009;

Oestreicher-Singer and Sundararajan, 2012a]), and potentially contribute to disper-

sion in the tail. Besides, we can also see that the strong and increasing moderating

effect of popularity on the effect of recommendations on demand is mainly through

recommendations based on quality and experts’ reviews whereas this effect is not as

strong for recommendations based on the novelty of alternatives.

Finally, Tables 5.15 and 5.16 shows the effect of context on the effectiveness of

recommendations. In particular, both holidays and better weather conditions (i.e.,

higher levels of temperature during the examined time period) have a positive and

significant result on the effectiveness of recommendations. Based on the detailed

results, this moderating effect is stronger for ‘trending’, ‘quality’ and ‘expert’ recom-

mendations, highlighting the differences in user behavior across various contexts. It

is worth noting that holidays have a negative effect on event recommendation effec-

tiveness indicating that during holidays users prefer recommended events less. This

finding contributes to the emerging literature on mobile marketing and contextual at-

tributes (e.g., effect of crowdedness on mobile offers [Andrews et al., 2015], geographic

mobility and responsiveness to mobile ads [Ghose and Han, 2011]).

5.7 Robustness Checks

In order to assess the possibility that the aforementioned findings are capturing

other unobserved factors instead of the effect of recommendations on the demand

levels, we also conduct various robustness checks. First, we conduct a subsample

analysis, leveraging the within-subject variation in our panel data set (see Section

163

Tab

le5.

14:

Moder

atin

gE

ffec

tof

Pop

ula

rity

onE

ffec

tive

nes

sof

Rec

omm

endat

ions.

Q:0.10

Q:0.20

Q:0.30

Q:0.40

Q:0.50

Q:0.60

Q:0.70

Q:0.80

Q:0.90

Tre

nd

ing

reco

mm

end

atio

n1.2

97**

*1.

830*

**2.

080*

**2.

240*

**1.

819*

**1.

632

***

1.567

***

1.475

***

1.269

***

Qu

ali

tyre

com

men

dat

ion

0.0

15**

*0.

119*

**0.

487*

**0.

808*

**0.

671*

**0.6

30*

**0.

702

***

0.9

83*

**4.2

44*

**E

vent

reco

mm

end

atio

n-0

.016

***

-0.2

19**

*-0

.500

***

-0.6

93**

*-0

.442

***

-0.2

80**

*-0

.172

***

0.194

***

3.6

49*

**E

xp

ert

reco

mm

end

ati

on

0.04

5***

0.33

7***

0.86

3***

1.00

9***

1.03

3***

1.0

40*

**1.

244

***

1.629

***

2.255

***

Nov

elre

com

men

dati

on

0.0

12**

*0.

092*

**0.

429*

**0.

744*

**0.

599*

**0.5

38*

**0.

530

***

0.499

***

1.085

***

Mar

ket-

leve

lfi

xed

effec

tsY

esY

esY

esY

esY

esY

esY

esY

esY

esC

ateg

ory

-lev

elfi

xed

effec

tsY

esY

esY

esY

esY

esY

esY

esY

esY

esA

dd

itio

nal

alt

ern

ativ

eco

ntr

ols

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Con

text

contr

ols

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Tim

etr

end

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Pse

ud

o-R

20.

505

0.50

80.

508

0.49

10.

456

0.453

0.4

41

0.419

0.4

09

p0.

0000

0.00

000.

0000

0.00

000.

0000

0.0

000

0.0

000

0.000

00.0

000

N711

,673

711,

673

711,

673

711,

673

711,

673

711,

673

711,

673

711,

673

711,

673

Note

:T

he

ad

dit

ion

alalt

ern

ativ

eco

ntr

ols

incl

ud

eth

eh

ours

ofop

erat

ion

ofth

esp

ecifi

calt

ern

ati

ve,

for

how

man

yw

eeks

the

alte

rnati

veh

as

bee

nop

erat

ing,

an

dth

enu

mb

erof

even

tsin

the

spec

ific

alte

rnat

ive

and

tim

ep

erio

d.

Th

eco

nte

xt

contr

ols

incl

ud

ed

ayof

the

wee

kan

dh

olid

ayeff

ects

,lo

cal

tem

per

atu

rean

dp

reci

pit

atio

nle

vels

,as

wel

las

the

aver

age

geog

rap

hic

dis

tan

ceof

alte

rnati

ve.

Sig

nifi

can

cele

vels

:*p<

0.0

5,**

p<

0.0

1,**

*p<

0.00

1

164

Tab

le5.

15:

Moder

atin

gE

ffec

tof

Public

Hol

iday

son

Eff

ecti

venes

sof

Rec

omm

endat

ions.

Model

1M

odel

2M

od

el

3M

odel

4

Rec

omm

endat

ion

(Bin

ary)

xH

olid

ay0.

1020

***

Rec

omm

endat

ion

xH

olid

ay0.

0965

***

Tre

ndin

gre

com

men

dat

ion

xH

olid

ay0.

3392

***

0.33

96**

*Q

ual

ity

reco

mm

endat

ion

xH

olid

ay0.

0781

***

0.08

06**

*E

vent

reco

mm

endat

ion

xH

olid

ay-0

.339

3***

-0.3

417*

**E

xp

ert

reco

mm

endat

ion

xH

olid

ay0.

0750

0.07

81N

ovel

reco

mm

endat

ion

xH

olid

ay0.

1416

*0.

1440

*

165

Tab

le5.

16:

Moder

atin

gE

ffec

tof

Tem

per

ature

onE

ffec

tive

nes

sof

Rec

omm

endat

ions.

Model

1M

odel

2M

odel

3M

od

el

4

Rec

omm

endat

ion

(Bin

ary)

xT

emp

erat

ure

0.00

26**

*R

ecom

men

dat

ion

xT

emp

erat

ure

-0.0

020*

**T

rendin

gre

com

men

dat

ion

xT

emp

erat

ure

0.01

52**

*0.

0150

***

Qual

ity

reco

mm

endat

ion

xT

emp

erat

ure

0.00

52**

*0.

0048

***

Eve

nt

reco

mm

endat

ion

xT

emp

erat

ure

-0.0

058

-0.0

074*

Exp

ert

reco

mm

endat

ion

xT

emp

erat

ure

-0.0

017

-0.0

013

Nov

elre

com

men

dat

ion

xT

emp

erat

ure

0.00

95**

*0.

0088

***

166

5.3), and we estimate the effectiveness of the various types of recommendations only

for alternatives that have been recommended. Table 5.18 presents the results for the

nested logit model controlling for unobserved heterogeneity at the level of alternative;

Table 5.17 provides the corresponding results employing multi-level fixed effects as

well as time-varying alternative controls, context controls, and a time trend. The

results corroborate our previous findings.

Moreover, we further consider heterogeneous effects by assigning random coeffi-

cients to the main variables of interest. Table 5.20 shows the corresponding results

for the nested logit model; Table 5.19 provides the corresponding results for the logit

model. The results corroborate our previous findings. It is worth noting that expert

and event recommendations exhibit higher variance compared to the other types of

recommendations in the mobile context.

Even though we are interested in the relative differences of effectiveness across the

various types of recommendations in a mobile context and we leverage the exogenous

variation in the recommendation mechanisms (see Section 5.3) while we also control

for numerous confounders, including perceived quality and marketing promotions, we

also assume as robustness check that different variables are endogenous. Hence, we

employ instrumental variable techniques. Since instrumental variable estimates are

consistent, the large size of our data set becomes an important advantage [Angrist

and Krueger, 2001]. Tables 5.21 – 5.22 present the effect of the different types of

recommendations after accounting for potential endogeneity in prices, with-in group

share, and recommendations. Table 5.21 uses as instruments rental prices and Haus-

man instruments (e.g., the average price of other venues in the same market and same

rating category), as well as the novel metric of alternative differentiation and isola-

tion based on the employed machine learning model of the user-generated reviews

using deep-learning techniques (see Section 5.4). Then, Table 5.22, in addition to

167

Table 5.17: Coefficient Estimates of Nested Logit Model (Sub-sample Analysis).


Recommendation (Binary) 0.6105***Recommendation 0.8946***Trending recommendation 1.2147*** 1.2114***Quality recommendation 0.8579*** 0.9305***Event recommendation 0.4606*** 0.5526***Expert recommendation 1.0776*** 1.1508***Novel recommendation 0.1342*** 0.2224***Recommendation ranking -0.0020***



Note: The additional alternative controls include the price, rating, number ofreviews, sentiments of reviews, number of photos, whether the alternative is partof a chain, hours of operation of the specific alternative, marketing promotions,whether the venue serves alcohol, whether it offers delivery and takeout services,whether it accepts credit cards and reservations, whether it offers outdoor seating,Wi-Fi, parking space, whether it accessible by wheelchair, whether it has TVs andmusic, for how many weeks the alternative has been operating, and the numberof events in the specific alternative and time period. The context controls includeday of the week and holiday effects, local temperature and precipitation levels,as well as the average geographic distance of alternative. Significance levels: *p < 0.05, ** p < 0.01, *** p < 0.001

168

Table 5.18: Coefficient Estimates of Nested Logit Model with Alternative-level Fixedeffects (Sub-sample Analysis).


Recommendation (Binary) 0.5410***Recommendation 0.8594***Trending recommendation 0.9697*** 0.9660***Quality recommendation 0.8506*** 0.9251***Event recommendation 0.5820*** 0.6698***Expert recommendation 0.9676*** 1.0430***Novel recommendation -0.0373 0.0471Recommendation ranking -0.0020***

Alternative-level fixed effects Yes Yes Yes YesAdditional alternative controls Yes Yes Yes YesContext controls Yes Yes Yes YesTime trend Yes Yes Yes Yes

Adjusted R2 0.439 0.442 0.443 0.445p 0.0000 0.0000 0.0000 0.0000N 336,953 336,953 336,953 336,953

Note: The additional alternative controls include the hours of operation of thespecific alternative, for how many weeks the alternative has been operating, andthe number of events in the specific alternative and time period. The context con-trols include day of the week and holiday effects, local temperature and precipita-tion levels, as well as the average geographic distance of alternative. Significancelevels: * p < 0.05, ** p < 0.01, *** p < 0.001

169

Table 5.19: Coefficient Estimates of Logit Model with Random Coefficients.


Recommendation (Binary) 0.5162***Recommendation 0.7053***Trending recommendation 1.4377*** 1.4322***Quality recommendation 0.5834*** 0.6811***Event recommendation -0.0087 0.0866Expert recommendation 0.7753*** 0.8999***Novel recommendation 0.2511*** 0.3232***Recommendation ranking -0.0025***Rating 0.0282*** 0.0291*** 0.0310*** 0.0303***Number of Reviews 0.1490*** 0.1558*** 0.1514*** 0.1513***Sentiment of Reviews 0.0032* 0.0032* 0.0033* 0.0033*Photos 0.0541*** 0.0527*** 0.0550*** 0.0552***Promotions -0.0019** -0.0018** -0.0018** -0.0018**

St. dev. Recommendation (Binary) 0.5177***St. dev. Recommendation 0.5085***St. dev. Trending recommendation 0.7401*** 0.7370***St. dev. Quality recommendation 0.5599*** 0.5576***St. dev. Event recommendation 1.2449*** 1.2369***St. dev. Expert recommendation 1.2018*** 1.1774***St. dev. Novel recommendation 0.3315*** 0.3358***

Alternative-level effects Yes Yes Yes YesAdditional alternative controls Yes Yes Yes YesContext controls Yes Yes Yes YesTime trend Yes Yes Yes Yes

Log-likelihood -835679 -839088 -836291 -836255x2 275,203 274,191 277,642 277,665p 0.0000 0.0000 0.0000 0.0000N 711,673 711,673 711,673 711,673

Note: The additional alternative controls include the hours of operation of the spe-cific alternative, for how many weeks the alternative has been operating, and thenumber of events in the specific alternative and time period. The context controlsinclude day of the week and holiday effects, local temperature and precipitation lev-els, as well as the average geographic distance of alternative. Significance levels: *p < 0.05, ** p < 0.01, *** p < 0.001

170

Table 5.20: Coefficient Estimates of Nested Logit Model with Random Coefficients.


Recommendation (Binary) 0.4868***Recommendation 0.6979***Trending recommendation 1.1585*** 1.1538***Quality recommendation 0.6077*** 0.6958***Event recommendation 0.1174 0.2027Expert recommendation 0.7658*** 0.8794***Novel recommendation 0.0936 0.1896**Recommendation ranking -0.0022***Rating 0.0264*** 0.0261*** 0.0275*** 0.0269***Number of Reviews 0.0930*** 0.0982*** 0.0951*** 0.0950***Sentiment of Reviews 0.0046*** 0.0047*** 0.0048*** 0.0048***Photos 0.0365*** 0.0326*** 0.0352*** 0.0353***Promotions 0.0005 0.0007 0.0007 0.0007With-in group share 0.2295*** 0.2316*** 0.2287*** 0.2287***

St. dev. Recommendation (Binary) 0.5183***St. dev. Recommendation 0.4662***St. dev. Trending recommendation 0.5801*** 0.5772***St. dev. Quality recommendation 0.5418*** 0.5418***St. dev. Event recommendation 1.1700** 1.1653**St. dev. Expert recommendation 1.1574** 1.1362**St. dev. Novel recommendation 0.2392*** 0.2480***

Alternative-level effects Yes Yes Yes YesAdditional alternative controls Yes Yes Yes YesContext controls Yes Yes Yes YesTime trend Yes Yes Yes Yes

Log-likelihood -808547 -811554 -809425 -809394x2 301,897 307,668 309,507 309,581p 0.0000 0.0000 0.0000 0.0000N 711,673 711,673 711,673 711,673

Note: The additional alternative controls include the hours of operation of thespecific alternative, for how many weeks the alternative has been operating, andthe number of events in the specific alternative and time period. The context con-trols include day of the week and holiday effects, local temperature and precipita-tion levels, as well as the average geographic distance of alternative. Significancelevels: * p < 0.05, ** p < 0.01, *** p < 0.001

171

the metrics based on the deep learning model, includes variables used in generating

the recommendations. In particular, the instruments for the trending and traditional

quality recommendations respectively include the average and standard deviation of

the alternative differentiation and isolation for the specific time period, lags of the

standardized percentage change in the number of photos and positive ratings, as well

as the lag of the within-category standardized rating and number of photos. We

have tested the instruments’ validity using the Hansem-Sargan test [Hansen, 1982]

and the Stock-Yogo critical values [Stock and Yogo, 2005]. In addition, both models

include alternative-level fixed effects as well as time-varying controls for venue, cli-

mate, geospatial, and calendar attributes, even though not depicted in the following

table due to space restrictions. The results further corroborate our previous find-

ings regarding the overall recommendation effect as well as the impact of the various

recommendation types. The differences between the effects of the different recom-

mendation types are statistically significant results and highlight the importance of

“in-the-moment” recommendations.

5.7.1 Falsification Tests

One might think that it is plausible that the previous set of models are simply

picking up spurious effects as a result of pure coincidence, a general increase in the

corresponding metrics, or other unobserved factors. To assess the possibility that the

aforementioned findings are a statistical artifact and the identified positive significant

effects were captured by chance or because of other confounding factors, we run dif-

ferent falsification tests (“placebo” studies) using the same models as above (in order

to maintain consistency) but randomly indicating which alternatives (i.e., random

alternative recommended) were recommended and when (i.e., random time period

of recommendation), respectively. The results of the falsification tests are shown in

172

Table 5.21: Coefficient Estimates of Nested Logit Model with Instrumental Variables.


Recommendation (Binary) 0.6944***Recommendation 0.9780***Trending recommendation 1.5141*** 1.5122***Quality recommendation 0.9472*** 1.0244***Event recommendation 0.2828*** 0.3819***Expert recommendation 1.0504*** 1.1277***Novel recommendation 0.2800*** 0.3754***Recommendation ranking -0.0021***Price -0.0436*** -0.0438*** -0.0407*** -0.0398***Rating 0.0037 0.0043* 0.0061** 0.0049*Number of Reviews 0.1569*** 0.1718*** 0.1606*** 0.1597***Sentiment of Reviews 0.0165*** 0.0172*** 0.0169*** 0.0168***Photos 0.0587*** 0.0552*** 0.0553*** 0.0560***Promotions -0.0032*** -0.0033*** -0.0032*** -0.0032***With-in group share 0.0969*** 0.0892** 0.0957*** 0.0945***




173

Table 5.22: Coefficient Estimates of Nested Logit Model with Instrumental Variables.

CoefficientRobust

z P > |z| [95% Conf.Std. Err. Interval]

Trending recommendation 1.448042*** 0.425770 3.40 0.001 0.613549 2.282535Quality recommendation 0.863831*** 0.168845 5.12 0.000 0.532902 1.194761Event recommendation 0.350116*** 0.075285 4.65 0.000 0.202560 0.497673Expert recommendation 1.093406*** 0.065746 16.63 0.000 0.964546 1.222267Novel recommendation 0.273595*** 0.033846 8.08 0.000 0.207259 0.339931Price -0.016360*** 0.001559 -10.49 0.000 -0.019410 -0.013300Rating 0.007253*** 0.001887 3.84 0.000 0.003554 0.010952Number of Reviews 0.157965*** 0.003894 40.57 0.000 0.150333 0.165597Sentiment of Reviews 0.020388*** 0.000977 20.87 0.000 0.018473 0.022303Photos 0.052817*** 0.003223 16.39 0.000 0.046500 0.059135Promotions -0.003190*** 0.000359 -8.89 0.000 -0.003890 -0.002490With-in group share 0.113758*** 0.001504 75.64 0.000 0.110810 0.116706

Log-likelihood: -786170 Adjusted R2: 0.551 p: 0.0000 N : 680,347

Note: The additional alternative controls include the hours of operation of the specificalternative, for how many weeks the alternative has been operating, and the numberof events in the specific alternative and time period. The context controls include dayof the week and holiday effects, local temperature and precipitation levels, as well asthe average geographic distance of alternative. Market-level and category-level fixedeffects are included as well. Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001

174

Tables 5.23 and 5.24. We see that, under these checks, the corresponding effects are

not statistically significant, indicating that our previous findings are not a statistical

artifact of our specifications, but we indeed discovered the actual effects.

Table 5.23: Falsification Check Employing Pseudo-recommendations.

Model 1 Model 2 Model 3

(Pseudo) Recommendation (Binary) -0.0026(Pseudo) Recommendation 0.0025(Pseudo) Trending recommendation 0.0003(Pseudo) Quality recommendation -0.0007(Pseudo) Event recommendation 0.0034(Pseudo) Expert recommendation 0.0036(Pseudo) Novel recommendation 0.0039

Table 5.24: Falsification Check Employing Pseudo-timing of Recommendations.

Model 1 Model 2 Model 3

(Pseudo) Recommendation (Binary) 0.0009(Pseudo) Recommendation 0.0025(Pseudo) Trending recommendation 0.0024(Pseudo) Quality recommendation -0.0023(Pseudo) Event recommendation 0.0052(Pseudo) Expert recommendation 0.0088(Pseudo) Novel recommendation 0.0052

5.8 Discussion of Business Value of Recommendations

Apart from quantifying the impact of various types of mobile recommendations

on demand levels for individual items and identifying various moderating effects on

this impact, we theoretically integrate our research questions and the corresponding

findings into the current literature on RSes by extending a current conceptual model

of the effects of RS use, RS characteristics, and other factors on consumer decision-

making that was first articulated by Xiao and Benbasat [2007]. Figure 5.5 depicts the

175

updated conceptual model incorporating our research questions and findings regarding

the decision outcome of consumers.

The discussed impact of RSes on consumers’ decision-making and choices can be

supported through the lenses of various IS theories. Recommendations serve to poten-

tially reduce the effort required [Shugan, 1980] as well as the uncertainty surrounding

a decision, and thus both reduce the difficulty of making a choice and increase the

confidence associated with it [Fitzsimons and Lehmann, 2004]. Another utilitarian

explanation of the aforementioned finding that recommendations have a positive and

significant effect on the demand levels for the recommended products is that con-

sumers might perceive the recommendations as endorsements of the candidate items

from the mobile RS or as another reputation dimension (in addition to the item rating

and consumer reviews). Apart from arguments based on utility theory, we can also

find support for our findings in the theories of human information processing. For

instance, Shafir et al. [1993b] maintain that people evaluate alternatives by compar-

ing them separately on distinct dimensions and that relationships among alternatives

may be perceived to be more compelling reasons or arguments for choice than de-

riving overall values for each alternative and choosing the alternative with the best

value. Because such differences (e.g., whether an alternative is recommended, rank-

ing in recommendation list) can be perceived with little effort, relationships among

alternatives may be used to make choices even in situations with simple alternatives

and even if they do not provide good justifications [Bettman et al., 1998]. Another

possible explanation is that the recommended alternatives maximize the ease of jus-

tifying consumers’ decisions. This explanation is becoming even more important in

the case of collective decisions (e.g., restaurant selection) and group recommenders;

Shafir et al. [1993a] have demonstrated that decision makers often construct reasons in

order to justify a decision to themselves (i.e., increase their confidence in the decision)

176

Fig

ure

5.5:

Con

ceptu

alm

odel

ofeff

ects

ofre

com

men

der

syst

ems.

177

and/or justify (explain) their decision to others. Alternatively, another explanation

is that the recommended alternatives are simply becoming more salient and hence

are more frequently selected by the consumers either deliberately (e.g., lower search

costs [Wilde, 1980]) or inadvertently since they capture their attention. Furthermore,

the technical differences between the trending and the traditional recommendations

contribute to the differences in effectiveness. In particular, the trending recommen-

dations are characterized by higher levels of temporal diversity and hence it is more

likely to recommend alternatives that are not already included in the consideration

set of a user and thus to have higher effectiveness due to differences in awareness levels

[Bodapati, 2008]. Finally, the significant difference in effectiveness among the various

types of recommendations and, especially, the effect of “in-the-moment” recommen-

dations can also be explained based on the IS success model [DeLone and McLean,

1992] and the components of timeliness and uniqueness of information which affect

user satisfaction through the construct of information quality. Similarly, the dif-

ferences of events recommendations and recommendations from “experts” from the

other types of examined recommendations can be explained based on the relevance

of information.

The findings of this study, apart from a theoretical contribution, also have impor-

tant managerial implications. Based on our results, we find that recommendations

in the mobile context have a positive effect on individual demand levels for the rec-

ommended alternatives. This is an important and timely finding for managers as

mobile recommender systems currently have been adopted by only 13% of the com-

panies across the globe, even though RSes are the most widespread personalization

technique for traditional channels [Econsultancy.com, 2013]. Apart from highlight-

ing in this study the economic impact of recommendations in the mobile context

and their effects on consumers’ decision-making, we also illustrate how managers and

178

practitioners can leverage observational data to evaluate different recommender sys-

tem algorithms and estimate their effects. This is especially important nowadays as

more than half of the companies (57%) do not test the performance of their own

recommender systems and 75% of them do not quantify the improvement in conver-

sion rates resulting from their RSes, even though they realize the importance of RSes

[Econsultancy.com, 2013]. Moreover, further disentangling the effects and economic

impact of various types of recommendations, we find in the mobile context that

recommendations that provide “in-the-moment” content to the users have a much

stronger effect compared to traditional recommendations based on historical trends

and data. The importance and economic significance of these findings is further am-

plified by the prediction that in the next years the number of mobile shoppers will

reach 213.7 million with 87% of smartphone users shopping online using their mobile

device. Another actionable finding of significant importance for businesses and man-

agers is that novel alternatives accrue greater benefits from recommendations when

those recommendations are not solely based on the characteristic of novelty but also

take into consideration specific item attributes, such as the quality of the item. Sim-

ilarly, the remaining moderating effects we examined in this study (i.e., popularity,

price, marketing promotions, and context) are also important for businesses and rec-

ommender system practitioners as they can guide the development and adoption of

future mobile RS algorithms and highlight the importance of various product design

decisions that managers should consider in order to improve the performance of their

RSes. The managerial importance of these findings and the corresponding detailed

understanding of the effectiveness of recommendations is further highlighted by the

fact that 78% of the companies consider lack of knowledge as a barrier to adopting or

improving recommendation systems in their organization [Econsultancy.com, 2013].

Finally, another important finding with significant managerial implications is that

179

the effect of traditional recommendations is much stronger for more popular prod-

ucts, contributing to a rich-get-richer effect, whereas the effect of “in-the-moment”

recommendations is more stable across alternatives, allowing businesses to effectively

leverage the benefits of long-tail products.

In addition to the theoretical and managerial contributions, our study also con-

tributes to the stream of literature that integrates machine learning and data mining

approaches with econometric techniques. In particular, in this study, we employ deep

learning techniques to introduce new machine learning-based econometric instruments

that extend a popular family of instruments from the observed product characteristics

space to the latent space. Hence, we contribute to the extant literature in Informa-

tion Systems that employs text-mining, sentiment-analysis, and other data mining

methods with user-generated content in empirical econometric studies (e.g., [Archak

et al., 2011; Ghose and Ipeirotis, 2011; Ghose et al., 2012b; Goes et al., 2014; Goh

et al., 2013; Netzer et al., 2012; Tirunillai and Tellis, 2014]).

Future research, apart from estimating the impact of recommender systems across

various settings in the mobile context, can also apply the proposed techniques in order

to generate contextual recommendation lists based on consumers’ surplus, rather than

the predicted numerical ratings for the various alternatives or the probability of accep-

tance of each recommendation, as proposed by Ghose et al. [2012b] in the context of

web-based travel search engines. Future research should also further examine the eco-

nomic effectiveness of recommendations in different application domains investigating

a wider range of recommended items. Besides, future research should also compare

the effectiveness of mobile recommendations vis-a-vis traditional recommendations in

other channels (e.g., web). In addition, the effect of marketing promotions in the

presence of recommendations is another topic of significant academic and managerial

interest that should be thoroughly examined in future research. Finally, disentan-

180

gling the effect of recommendations on repeated visits of existing customers from the

corresponding effects for new users and alternatives is another promising topic for

future research.

One of the limitations of this study is that we focus only on the effect of recom-

mendations on the demand for the candidate items (e.g., restaurants). Nevertheless,

the explicit user satisfaction should also be considered and precisely measured. How-

ever, explicit ratings from individual users were not available for privacy reasons.

Other limitations of our data set is that it contains observations corresponding to a

specific domain in the mobile channel, rather than multiple channels and different

product categories. Despite these limitations, our contribution may be widely rele-

vant to managers while also seeding a number of new directions for future research.

Our hope is that these limitations are viewed not as a liability but as a path towards

future research that extends our research question while strengthening the relevant

theory and empirical evidence.

181

CHAPTER VI

Conclusions, Limitations, and Future Directions

The work presented in this thesis will hopefully help the recommender systems

(RSes) field move further beyond the perspective of rating prediction accuracy fo-

cusing on developing methods for avoiding “filter bubbles”. Following this stream

of research that both contributes to existing helpful but less explored paradigms for

recommender systems and proposes new valuable approaches and perspectives, we

discussed the studies we have conducted towards alleviating the problems of over-

specialization and concentration biases in recommender systems by delivering to the

users non-obvious, diverse, unexpected, and, at the same time, high quality personal-

ized recommendations, which could potentially expand users’ exposure and choices.

We focus this research program on the front-end of the “design - solution - perceptions

- intentions - behavior” causal chain and we move the focus from even more accu-

rate rating predictions in the context of recommender systems and aim at offering a

holistic experience to the users. The conducted prescriptive studies are supplemented

with descriptive (explanatory) user behavior studies that examine the effects of the

proposed type of recommendations on consumer decision-making. Finally, we theoret-

ically integrate our findings into the current literature on RSes by extending a current

conceptual model of the effects of RS use, RS characteristics, and other factors on con-

182

sumer decision-making. The discussed impact of RSes on consumer decision-making

and choices can be supported through the lenses of various IS theories, including theo-

ries of human information processing as well as theories of satisfaction. In particular,

the studies discussed in Chapters III and IV move our focus from even more accurate

rating predictions and aim at offering a holistic experience to the users by avoiding

the over-specialization and concentration of recommendations and providing the users

with non-obvious but high quality personalized recommendations that fairly match

their interests and they will remarkably like. Then, Chapter V assesses the business

value of various real-works types of recommendations and evaluates how they con-

tribute to the over-specialization and concentration biases of modern recommender

systems.

In detail, Chapter II provides a brief survey of the related work on over-specialization

and concentration biases of recommendations as well as novelty, serendipity, diver-

sity, and unexpectedness. It also presents the related work assessing the value of

recommendations for businesses and, finally, provides an overview of the main Infor-

mation Systems theories regarding the impact of RSes on consumer decision-making.

Then, Chapter III formulates the classical neighborhood-based collaborative filter-

ing method as an ensemble method, thus, allowing us to show the suboptimality of

the k-NN (k Nearest Neighbors) approach in terms of not only over-specialization

and concentration but also predictive accuracy. Besides, focusing on neighborhood

selection, it proposes a novel optimized neighborhood-based method (k-BN; k Better

Neighbors) and a new probabilistic neighborhood-based method (k-PN; k Probabilis-

tic Neighbors) as improvements of the standard k-NN approach alleviating some of

the most common problems of collaborative filtering recommender systems, based on

classical metrics of dispersion and diversity as well as some newly proposed metrics.

This performance improvement is in accordance with ensemble learning theory and

183

the phenomenon of “hubness” in recommender systems. Then, Chapter IV proposes

a concept of unexpectedness in recommender systems illustrating the differences from

the related but different terms of novelty, serendipity, and diversity. In addition, it

fully operationalizes unexpectedness by suggesting various mechanisms for specifying

the expectations of the users and proposing a recommendation method for providing

the users with non-obvious but high quality personalized recommendations that fairly

match their interests based on specific metrics of unexpectedness. Finally, Chapter

V employs econometric modeling and machine learning techniques in order to esti-

mate the impact of recommendations in the mobile context on consumers’ utility and

real-world demand. This chapter delves further into the differences in effectiveness of

recommendations, and examines this heterogeneity examining the moderating effect

of various item attributes and contextual factors in order to gain a more detailed

understanding of the effectiveness of the various types of recommendations. We also

validate the robustness of our findings using multiple econometric specifications as

well as instrumental variable methods with instruments based on a machine-learning

model employing deep-learning techniques.

In summary, the main contributions of the studies presented in this thesis are:

• We formulated the classical neighborhood-based collaborative filtering method

as an ensemble method, thus, allowing us to show the potential suboptimality

of the k-NN approach in terms of predictive accuracy.

• We proposed a new optimized neighborhood-based method (k-BN; k Better

Neighbors) as an improvement of the standard k-NN approach.

• We proposed a new probabilistic neighborhood-based method (k-PN; k Proba-

bilistic Neighbors) as an improvement of the standard k-NN approach.

184

• We empirically showed that the proposed methods (i.e., k-BN and k-PN) out-

perform, by a wide margin, the classical collaborative filtering algorithm and

practically illustrated the suboptimality of k-NN in addition to providing a theo-

retical justification of this empirical observation. Moreover, we showed that the

proposed methods alleviate the common problems of over-specialization and

concentration biases of recommendations in terms of various popular metrics

and a newly proposed metric that measures the diversity reinforcement of rec-

ommendations. Besides, we identified a particular implementation of the k-PN

method that performs consistently well across various experimental settings.

We also illustrated that most of the times the k-BN method outperforms the

employed baselines as well as the k-PN method.

• We proposed a new formal definition of unexpectedness in recommender sys-

tems and conducted a survey of related concept illustrating the differences from

these related terms. We also formalized this concept of unexpectedness and

fully operationalized it by suggesting various mechanisms for specifying the ex-

pectations of the users.

• We proposed a method for providing the users with non-obvious but high qual-

ity recommendations that fairly match their interests and suggested specific

metrics to measure the unexpectedness of recommendation lists. Finally, we

showed that the proposed method for unexpected recommendations can en-

hance unexpectedness while maintaining the same or higher levels of accuracy

of recommendations.

• We estimated the effectiveness and economic impact of various types of real-

world recommendations in a mobile setting based on a structural econometric

method following discrete-choice models of product demand.

185

• We introduced new machine learning-based econometric instruments that ex-

tend a popular family of instruments to the latent space and facilitate the usage

of instrumental variable techniques for causal inference in the presence of endo-

geneity in the field of RSes

• We discovered significant new findings and moderating effects that extend our

current knowledge regarding the heterogeneous impact of recommender systems

and reconcile contradictory prior findings in the related literature.

One of the main advantages of the proposed methodology is the use of publicly

available data sets. Employing such data sources allows the direct comparison with

the existing state-of-the-art algorithms in the field of recommender systems. Such

a direct comparison facilitates the incremental contribution to the literature in the

fields of recommender systems, data mining, and machine learning while avoiding any

further fragmentation. At the same time, this strategy makes our research studies

reproducible and replicable and better allows other researchers and practitioners to

further test our findings in other scientific fields, recommendation domains, and appli-

cations as well as efficiently compare the proposed approaches to any algorithms that

will be proposed in the near future. However, the presented and proposed studies also

have some limitations. The main limitation of these studies is the absence of online

evaluation with real users (by means of an A/B test, for instance). In the past, it

has been observed that the results from online evaluation may contradict the results

of offline evaluation because of the possible incompleteness of the employed data sets

[Beel et al., 2013]. Nevertheless, we believe that the employed methodology and the

corresponding evaluation have been appropriately designed and hence the finding will

hold in a carefully designed online evaluation. We should also note that conducting

only online studies in the absence of offline evaluation would not allow us to report

186

comparable results for the same methods “under the same conditions.”

In addition to the theoretical foundation and the empirical contribution of this

program of research, we theoretically integrate my research questions and the corre-

sponding findings into the current literature on RSes by extending a current concep-

tual model of the effects of RS use, RS characteristics, and other factors on consumer

decision-making that was first articulated by Xiao and Benbasat [2007] and then fur-

ther refined by the authors in [Xiao and Benbasat, 2014]. The discussed impact of

RSes on consumers’ decision-making and choices can be supported through the lenses

of various IS theories, including theories of human information processing as well as

theories of satisfaction [Adamopoulos and Tuzhilin, 2015; Adamopoulos et al., 2016].

The presented research studies also have important managerial implications. Ad-

hering to our main research objective, we work towards the direction of providing

more useful recommendations for both users and businesses. Avoiding obvious and

expected recommendations while maintaining high predictive accuracy levels, we can

alleviate the common problems of over-specialization and concentration biases that

often characterize the collaborative filtering algorithms. Building such a recommender

system, we have the potential to further increase user satisfaction and engagement

and offer a superior experience to the users [Baumol and Ide, 1956; Kahn et al., 1991].

In addition, unexpectedness and its related notions can improve the welfare of con-

sumers by allowing them to locate and buy products that otherwise they would not

have purchased. Introducing unexpectedness in recommender systems can vastly re-

duce customers’ search cost by recommending items that the user would rate highly

but it would be quite unlikely –or she/he would have to spend a large amount of

time– to discover them on her/his own. As a result, the inefficiencies caused by buyer

search costs are reduced, while increasing the ability of markets to optimally allocate

productive resources [Bakos, 1997].

187

Furthermore, the generated recommendations should be useful not only for the

users but for the businesses as well. Based on our analysis, we find that a significant

raise in the demand of recommended items. Examining the impact of different types

of recommendations, our findings highlight the importance of inter-temporal diversity

as well as non-obviousness and unexpectedness. Apart from the direct effect of in-

creased sales and willingness-to-pay, the proposed approaches also exhibit a potential

positive economic impact based on the enhanced customer loyalty leading to lasting

and valuable relationships [Gorgoglione et al., 2011; Zhang et al., 2011] through offer-

ing more useful for the users recommendations from a wider range of items, enabling

them to find relevant items that are harder to discover, and making the users familiar

with the whole product catalog. Apart from the significant gains in producer welfare

from the additional sales [Brynjolfsson et al., 2003], business might also leverage rev-

enues from market niches [Fleder and Hosanagar, 2009]. Thus, there is a potential

positive economic impact based on the effect of recommending items from the long

tail and not focusing mostly on bestsellers that usually exhibit higher marginal costs

and lower profit margins because of acquisition costs and licenses as well as increased

competition. Another actionable finding of importance for businesses and managers

is that novel alternatives accrue greater benefits from recommendations when those

recommendations are not solely based on the characteristic of novelty but also take

into consideration additional attributes such as the quality of the item. Similarly,

another important finding for businesses is that the effect of traditional recommen-

dations is much stronger for more popular products, contributing to a rich-get-richer

effect, whereas the effect of unexpected, diverse, and non-obvious recommendations

is more stable across alternatives, allowing businesses to effectively leverage the bene-

fits of long-tail products. Nevertheless, the remaining moderating effects we examine

are also important for businesses and recommender system practitioners as they can

188

guide the development of future RS algorithms and highlight the importance of vari-

ous product design decisions [Adamopoulos and Tuzhilin, 2015; Adamopoulos et al.,

2016].

This thesis may facilitate future research to integrate the proposed approaches

with related existing techniques in the fields of web search and data mining. More-

over, we would like to implement and evaluate the proposed approaches conducting

live experiments in on-line retail settings [Todri and Adamopoulos, 2014; Adamopou-

los and Todri, 2015a], taking into consideration various user attributes and charac-

teristics [Adamopoulos and Todri, 2015b; Adamopoulos et al., 2015a], as well as in

platforms for massive open online courses [Adamopoulos, 2013b]. Besides, we would

like to conduct a series of live controlled experiments with human subjects in order

to study the on-line user behavior, examine and actively adjust the trade-off between

exploration (e.g., unexpectedness, serendipity, diversity, etc.) and exploitation (e.g.,

accuracy) of recommender systems, and further evaluate the proposed perspectives

in a user-centric framework for top-N recommendations.

189

Appendices

190

Appendix A

Measuring the Concentration Reinforcement Bias

of Recommender Systems

Several measures have been employed in prior research in order to measure the con-

centration reinforcement and popularity bias of RSes as well as other similar concepts.

These metrics include catalog coverage, aggregate diversity, and the Gini coefficient.

In particular, catalog coverage measures the percentage of items for which the RS is

able to make predictions [Herlocker et al., 2004] while aggregate diversity uses the

total number of distinct items among the top-N recommendation lists across all users

to measure the absolute long-tail diversity of recommendations [Adomavicius and

Kwon, 2012]. The Gini coefficient [Gini, 1921] is used to measure the distributional

dispersion of the number of times each item is recommended across all users; similar

are the Hoover (Robin Hood) index and the Lorenz curve [Hoover, 1985; Lorenz, 1905]

However, these metrics do not take into consideration the prior popularity of

candidate items and, hence, do not provide sufficient evidence on whether the prior

concentration of popularity is reinforced or alleviated by the RS. Moving towards this

direction, Adamopoulos and Tuzhilin [2013b, 2014b] employ a popularity reinforce-

ment measure M to assess whether a RS follows or changes the prior popularity of

191

items when recommendations are generated. To evaluate the concentration reinforce-

ment bias of recommendations, Adamopoulos and Tuzhilin [2013b, 2014b] measure

the proportion of items that changed from “long-tail” in terms of prior sales (or num-

ber of positive ratings) to popular in terms of recommendation frequency as follows:

M = 1−K∑i=1

πiρii,

where the vector π denotes the initial distribution of each of the K popularity cat-

egories and ρii the probability of staying in category i, given that i was the initial

category. In [Adamopoulos and Tuzhilin, 2013b, 2014b], the popularity categories,

labeled as “head” and “tail”, are based on the Pareto principle and hence the “head”

category contains the top 20% of items (in terms of positive ratings or recommen-

dation frequency, respectively) and the “tail” category the remaining 80%. Based

on this metric, a score of zero denotes no change (i.e., the number of times an item

is recommended is proportional to the number of ratings it has received) whereas a

score of one denotes that the RS recommends mainly the long-tail items (i.e., the

number of times an item is recommended is proportional to the inverse of the num-

ber of ratings it has received). However, this metric of concentration reinforcement

(popularity) bias entails an arbitrary selection of popularity categories. Besides, all

items included in the same popularity category are contributing equally to this metric,

despite any differences in popularity.

To precisely measure the concentration reinforcement (popularity) bias of RSes

and alleviate the problems of the aforementioned metrics, in [Adamopoulos et al.,

192

2015b] we propose a new metric as follows:

CI@N =∑i∈I

1

2

s(i)∑j∈I s(j)

ln

s(i)+1∑j∈I s(j)+1

rN (i)+1N∗|U |+|I|

+1

2

rN(i)

N ∗ |U |ln

rN (i)+1N∗|U |+|I|s(i)+1∑j∈I s(j)+1

,

where s(i) is the prior popularity of item i (i.e., the number of positive ratings for

item i in the training set or correspondingly the number of prior sales of item i), rN(i)

is the number of times item i is included in the generated top-N recommendation

lists, and U and I are the sets of users and items, respectively.1 In essence, following

the notion of Jensen-Shannon divergence in probability theory and statistics, the

proposed metric captures the distributional divergence between the popularity of

each item in terms of prior sales (or number of positive ratings) and the number of

times each item is recommended across all users. Based on this metric, a score of zero

denotes no change (i.e., the number of times an item is recommended is proportional

to its prior popularity) whereas a (more) positive score denotes that the generated

recommendations deviate (more) from the prior popularity (i.e., sales or positive

ratings) of items.

In order to measure whether the deviation of recommendations from the distribu-

tion of prior sales (or positive ratings) promotes long-tail rather than popular items,

we also propose a measure of “long-tail enforcement” as follows:

LTIλ@N =1

|I|∑i∈I

λ

(1− s(i)∑

j∈I s(j)

)ln

rN (i)+1N∗|U |+|I|s(i)+1∑j∈I s(j)+1

+ (1− λ)

s(i)∑j∈I s(j)

ln

s(i)+1∑j∈I s(j)+1

rN (i)+1N∗|U |+|I|

,

1Another smoothed version of the proposed metric is: CI@N =∑

i∈I12s%(i) ln

(s%(i)

12 s%(i)+ 1

2 rN%(i)

)+

12r

N

%(i) ln(

rN% (i)12 s%(i)+ 1

2 rN%(i)

), where s%(i) = s(i)∑

j∈I s(j) and rN%(i) = rN (i)N∗|U | .

193

where λ ∈ (0, 1) controls which items are considered long-tail (i.e., the percentile of

popularity below which a RS should increase the frequency of recommendation of an

item). In essence, the proposed metric rewards a RS for increasing the frequency

of recommendations of long-tail items while penalizing for frequently recommending

already popular items.

A.1 Experimental Results

To empirically illustrate the usefulness of the proposed metrics, we conduct a

large number of experiments comparing various algorithms across different perfor-

mance measures. The data sets we used are the MovieLens 100k (ML-100k), 1M

(ML-1m), and “latest-small” (ML-ls), and the FilmTrust (FT). The recommendations

were produced using the algorithms of association rules (AR), item-based collabora-

tive filtering (CF) nearest neighbors (ItemKNN), user-based CF nearest neighbors

(UserKNN), CF ensemble for ranking (RankSGD) [Jahrer and Toscher, 2012], list-

wise learning to rank with matrix factorization (LRMF) [Shi et al., 2010], Bayesian

personalized ranking (BPR) [Rendle et al., 2009], and BPR for non-uniformly sampled

items (WBPR) [Gantner et al., 2012] implemented in [Guo et al., 2015].

Figure A.1 illustrates the results of the comparative analysis of the different al-

gorithms across various metrics. In particular, Fig. A.1 shows the relative ranking in

performance for each algorithm based on popular metrics of predictive accuracy and

dispersion as well as the newly proposed metrics; green (red) squares indicate that

the specific algorithm achieved the best (worst) relative performance among all the

algorithms for the corresponding dataset and metric.2

Based on the results, we can see that the proposed metrics capture different per-

2We have reversed the scale of the Gini coefficient for easier interpretation of the results (i.e., thegreen color corresponds to the most uniformly distributed recommendations).

194

formance dimensions of an algorithm compared to the relevant metrics of Gini coef-

ficient and aggregate diversity. Comparing the performance based on the proposed

concentration bias metric (CI@N) with the metric of Gini coefficient, we see that

even though on aggregate an algorithm might distribute more equally than another

algorithm the number of times each item is recommended, it might still achieve this

by deviating less from the prior popularity (i.e., number of sales or positive ratings)

of each item separately (e.g., green color for Gini coefficient and red color for concen-

tration reinforcement). Nevertheless, the differences among the LTIλ performance

and the other metrics (e.g., aggregate diversity) indicate that even though some al-

gorithms might recommend fewer (more) items than others or distribute how many

times each item is recommended less (more) equally among the recommended items,

they might achieve this by frequently recommending more (fewer) long-tail items

rather than more (fewer) popular items (e.g., red color for Gini coefficient and green

color for “long-tail enforcement”). Hence, the two proposed metrics should be used

in combination in order to evaluate i) how much the recommendations of a RS algo-

rithm deviate from the prior popularity of items and ii) whether this deviation occurs

by promoting long-tail rather than already popular items.

195

Figure A.1: Performance (ranking) of various RS algorithms.

196

Appendix B

Weighted Percentile Methods in Collaborative

Filtering Systems

Under a definition of recommendation opportunity as how much a user could re-

alistically like an item, we are looking for a high percentile (e.g., 80 percent) of the

conditional distribution of the rating for the specific target user and candidate item,

given all the information we have about them. Utilizing the high percentiles, we aim

at recommending items that the users will remarkably like. One of the challenges

here is to get a good estimate of the conditional rating percentile for each user and

item from the available data. In the next section, we illustrate the practical imple-

mentation of the proposed approach in the context of neighborhood models.

User-based neighborhood recommendation methods predict the rating ru,i of user

u for item i using the ratings given to i by users most similar to u, called nearest

neighbors and denoted by Ni(u). Taking into account the fact that the neighbors can

have different levels of similarity, wu,v, and considering the k users v with the highest

similarity to u (i.e., the standard user-based k-NN collaborative filtering approach),

197

the predicted rating is:

ru,i =

∑v∈Ni(u)

wu,vrv,i∑v∈Ni(u)

|wu,v|

However, the ratings given to item i by the nearest neighbors of user u can be

combined into a single estimation using various combining (or aggregating) functions

[Adomavicius and Tuzhilin, 2005].

In [Adamopoulos and Tuzhilin, 2013c], we propose to use higher weighted per-

centiles as a combining function. Such a high percentile p (e.g., 70th, . . . , 90th) of the

conditional distribution of the user’s rating, given all the information that we have

available, characterizes how much the target user u could realistically like the candi-

date item i. Intuitively, using high percentiles is analogous to “shifting the needle”

in our rating combining function from the middle of the rating distribution, as is the

case with the weighted average, towards the tail on the right side of the distribution

targeting recommendations that the users will like better. Formally, the percentile,

denoted by rpu,i, is defined such that the probability that user u would rate item i with

a rating of rpu,i or less is p%. Note that both low and high ratings contribute to the

estimation since they affect the rank of the values and, thus, the percentile quantity

of interest.

In a typical k-NN collaborative filtering model, the information that we have

available in order to estimate an unknown rating, and respectively the quantity rpu,i,

is the neighbors of user u, denoted by Ni(u), the similarity levels of these neighbors

wNi(u) := (wu,v : v ∈ Ni(u)), and the corresponding ratings rN (u),i := (rv,i : v ∈

Ni(u)). As an example of the proposed method and its differences from the clas-

sical approaches, consider the neighborhood N (u) of size 4 with similarity weights

wN (u) = (0.2, 0.4, 0.3, 0.1) and items x and y with ratings rN (u),x = (2, 3, 3, 4) and

198

ALGORITHM 5: k-NN Recommendation Algorithm

Input: User-Item Rating matrix ROutput: Recommendation lists of size l

k: Number of users in the neighborhood of user u, Ni(u)

for each user u doFind the k users most similar to user u, Ni(u);for each item i do

Combine ratings given to item i by neighbors Ni(u);endRecommend to user u the top-l items having the highest predicted rating ru,i;

end

ALGORITHM 6: Weighted Percentile Estimation Algorithm

Input: Values v1, . . . , vn, Weights w1, . . . , wn, andp percentile to be estimated.

Output: p-th weighted percentile of ordered values v1, . . . , vn

Order values v1, . . . , vn from least to greatest;Rearrange weights w1, . . . , wn based on ordered values;Calculate the percent rank for p based on weights w1, . . . , wn;Use linear interpolation between the two nearest ranks;

rN (u),y = (2, 2, 4, 4), respectively. Using the standard combining function, item x

would be recommended. However, using, for instance, the weighted 80th percentile

of the variable rN (u),i, item y would be recommended since the specific percentile for

item y, denoted by rp=80u,y , corresponds to a higher rating than item x and, thus, there

is high potential that user u could realistically like item y more than x; equivalently,

the probability of user u assigning a rating greater or equal to 4 is higher for item y

than x.

Algorithm 5 summarizes the user-based k-nearest neighbors (k-NN) collaborative

filtering approach with a general combining function and Algorithm 6 shows a pro-

cedure to estimate a weighted percentile rpu,i (i.e., the proposed combining function),

where the values rN (u),i are the ratings given to candidate item i by neighbors Ni(u),

the k users most similar to target user u, and the weights wNi(u) are the corresponding

199

similarity levels of neighbors to user u.

B.1 Experimental Settings

To empirically validate the proposed method and evaluate the generated recom-

mendations, we conduct a large number of experiments on “real-world” data sets and

compare our results to the k-NN CF approach, which has been found to perform well

also in terms of other performance measures besides the classical accuracy metrics

[Burke, 2002; Adamopoulos and Tuzhilin, 2011, 2013a; Jannach et al., 2013].

The data sets that we used are the RecSys HetRec 2011 MovieLens data set

[Cantador et al., 2011] and the BookCrossing data set [Ziegler et al., 2005].

The RecSys HetRec 2011 MovieLens (ML) data set contains personal ratings and

tags about movies and consists of 855,598 ratings from 2,113 users on 10,197 items.

The BookCrossing (BX) data set is described by Ziegler et al. [2005] and gathered

from Bookcrossing.com, a social networking site founded to encourage the exchange

of books. Following Ziegler et al. [2005] and owing to the extreme sparsity of the data,

we decided to condense the data set in order to obtain more meaningful results from

collaborative filtering algorithms. Hence, we discarded as in [Ziegler et al., 2005] all

books for which we were not able to find any information, along with all the ratings

referring to them. Next, we also removed book titles with fewer than 4 ratings and

community members with fewer than 8 ratings each. The dimensions of the resulting

data set were considerably more moderate, featuring 8,824 users, 7,818 books, and

107,367 explicit ratings.

Using the ML and BX data sets, we conducted a large number of experiments

and compared our method against the standard user-based k nearest neighbors col-

laborative filtering approach. In order to test the proposed approach of weighted

percentiles under various experimental settings, we used 2 data sets, 6 different sizes

200

of neighborhoods (k ∈ {30, 40, . . . , 80}), 9 different percentiles as combining func-

tions (p ∈ {10, 20, . . . , 90}), and generated recommendation lists of 13 different sizes

(l ∈ {1, 3, 5, 10, 20, . . . , 100}), resulting in 1,404 experiments in total.

For the computation of the weighted percentiles, we used the gmisclib [gmisclib,

2013] scientific library. Besides, we used Pearson correlation to measure similarity.

Finally, we used a holdout validation scheme in all of our experiments with 80/20

splits of data to the training/test part in order to avoid overfitting.

B.2 Results

The aim of this study is to demonstrate that the proposed method is indeed

effectively increasing the classical item prediction accuracy measures and performs

well in terms of other popular performance measures, such as catalog coverage, by a

comparative analysis of our method and the standard k-NN algorithm, in different

experimental settings.1

The goal in this section is to compare our method with the standard baseline

methods in terms of traditional metrics for item prediction, such as precision, re-

call, and F1 score. Table B.1 presents the results obtained by applying our method

to the MovieLens and BookCrossing data sets. The values reported are computed

as the average performance over the six neighborhood sizes using the F1 score for

recommendation lists of size l ∈ {3, 5, 10, 30, 50, 100}. Respectively, Fig. B.1 illus-

trates the average performance for neighborhoods of size k ∈ {30, 80} and lists of size

l ∈ {1, 3, 5, 10, 20, . . . , 100}.

Table B.1 and Fig. B.1 demonstrate that the proposed method outperforms the

1Similar results were also obtained using the k users with the highest similarity to the target useru, N (u), independently of whether they rated the specific candidate item i. For each metric, onlythe most interesting dimensions are discussed. Finally, results for low percentiles are not presented,since they constantly underperform the experiments using the high percentiles.

201

1 3 5 10 20 30 40 50 60 70 80 90 100Recommendation List Size

0.000

0.005

0.010

0.015

0.020

0.025

F1 s

core

MovieLens k=30

knn60th percentile70th percentile80th percentile90th percentile

(a) ML - k = 30


0.000

0.005

0.010

0.015

0.020

0.025

F1 s

core

MovieLens k=80


(b) ML - k = 80


0.0005

0.0010

0.0015

0.0020

0.0025

0.0030

0.0035

0.0040

0.0045

0.0050

F1 s

core

BookCrossing k=30


(c) BX - k = 30


0.0005

0.0010

0.0015

0.0020

0.0025

0.0030

0.0035

0.0040

0.0045

0.0050

F1 s

core

BookCrossing k=80


(d) BX - k = 80

Figure B.1: Prediction Accuracy (F1 score) for the (a), (b) MovieLens (ML) and (c),(d) BookCrossing (BX) data sets.

202

Table B.1: Item Prediction Accuracy (F1 score∗102).

DataMethod

Recommendation List SizeSet 3 5 10 30 50 100

ML

k-NN 0.0026 0.0039 0.0078 0.0164 0.3575 0.3362p

erc

enti

le 60th 0.0902 0.1533 0.2975 0.7208 0.8094 0.861170th 0.1784 0.2907 0.5309 1.2440 1.8051 2.356580th 0.0993 0.1575 0.2848 0.6966 0.9525 1.293090th 0.0456 0.0854 0.1848 0.4552 0.6316 0.9344

BX

k-NN 0.1606 0.1876 0.2415 0.2882 0.2899 0.2807

perc

enti

le 60th 0.1743 0.2396 0.3149 0.3654 0.3716 0.352670th 0.1864 0.2418 0.3184 0.3751 0.3841 0.382380th 0.2130 0.2590 0.3654 0.4419 0.4361 0.425690th 0.2126 0.3065 0.3772 0.4592 0.4795 0.4728

kNN 60th 70th 80th 90th0.000

0.005

0.010

0.015

0.020

0.025

F1 s

core

(a) MovieLens data set

kNN 60th 70th 80th 90th0.0005

0.0010

0.0015

0.0020

0.0025

0.0030

0.0035

0.0040

0.0045

0.0050

F1 s

core

(b) BookCrossing data set

Figure B.2: Post hoc analysis for Friedman’s Test of Item Prediction Accuracy (F1

score) for both data sets.

203

k-NN method by a wide margin. In particular, for both data sets, accuracy was

improved in all the experiments using high percentiles. Besides, as we can observe,

the increase in performance is larger for recommendation lists of larger size. For the

ML data set the maximum F1 score was achieved using the 70th percentile (0.024)

whereas for BX the maximum was 0.0048 using the 90th percentile.


performance of each of the methods is the same using the Friedman test. Based

on the results, we reject the null hypothesis with p < 0.0001. Performing post hoc

analysis on Friedman’s Test results, the differences between the Baseline k-NN and

each one of the experimental settings are statistically significant. Fig. B.2 presents

the box-and-whisker diagrams displaying the aforementioned differences among the

various methods.

Similar results were also obtained using standard utility-based ranking metrics,

such as the normalized discounted cumulative gain (nDCG) and mean reciprocal rank

(MRR).

In this section we investigate the effect of the proposed method on coverage and

aggregate diversity, two important metrics for RSes [Ricci and Shapira, 2011], that go

beyond the classical perspective of rating prediction accuracy [Adamopoulos, 2013a;

Adamopoulos et al., 2014]. The results obtained using the catalog coverage metric [Ge

et al., 2010] (i.e., the percentage of items in the catalog that are ever recommended

to users) are equivalent to those using the diversity-in-top-N metric for aggregate

diversity [Adomavicius and Kwon, 2012]; henceforth, only results on coverage are

presented. Table B.2 presents the results obtained by applying our method to the

ML and BX data sets. The values reported are computed as the average catalog

coverage over six neighborhood sizes (k ∈ {30, 40, . . . , 80}) for recommendation lists

of size l = {3, 5, 10, 30, 50, 100}.

204

Table B.2: Catalog Coverage Performance.

DataMethod

Recommendation List SizeSet 3 5 10 30 50 100

ML

k-NN 0.50% 0.57% 0.68% 0.92% 2.80% 3.98%

perc

enti

le 60th 1.17% 1.29% 1.50% 1.92% 2.30% 3.77%70th 2.62% 2.88% 3.24% 3.89% 4.29% 5.28%80th 5.99% 6.46% 6.98% 8.08% 8.63% 9.77%90th 11.31% 13.29% 15.02% 16.94% 17.93% 19.32%

BX

k-NN 45.45% 54.34% 65.52% 84.14% 90.50% 95.36%

perc

enti

le 60th 45.44% 54.04% 65.52% 83.85% 90.35% 95.16%70th 44.85% 53.34% 64.84% 84.02% 90.13% 95.09%80th 46.47% 54.66% 65.90% 84.26% 90.39% 95.16%90th 46.33% 54.58% 66.04% 84.28% 90.44% 95.09%

Table B.2 demonstrates that the proposed method performs at least as well as, and

is some cases even better than, the standard user-based k-NN method. In particular,

for the ML data set, where the Baseline k-NN results in low coverage, performance

is increased on average by 643.77%, with the 90th percentile exhibiting the highest

coverage. For the BX data, where the Baseline k-NN results in high coverage be-

cause of the specifics of the particular data set and the larger number of users, the

performance is on average the same (+0.00). In terms of statistical significance, using

the Friedman test and performing post hoc analysis, for the ML data set the differ-

ences among the standard user-based k-NN method and all the experimental settings

are statistically significant (p < 0.005). For the BX data set, only the differences

between the 70th percentile and the remaining experimental settings are statistically

significant.

The generated recommendation lists can also be evaluated for the inequality across

items using the Gini coefficient. In particular, for the ML and BX data sets the

Gini coefficient was on average improved by 2.58% and 0.27%, respectively. As we

205

can conclude, in the recommendation lists generated from the proposed method, the

number of times an item is recommended is more equally distributed.

In summary, we demonstrated that the proposed method outperforms the stan-

dard user-based k-NN algorithm by a wide margin in terms of item prediction accuracy

and utility-based ranking metrics and performs at least as well as, and in some cases

even better than, the standard baseline method in terms of several other popular

performance measures.

206

BIBLIOGRAPHY

207

BIBLIOGRAPHY

Abbassi, Z., Amer-Yahia, S., Lakshmanan, L. V., Vassilvitskii, S., and Yu, C.

(2009). Getting recommender systems to think outside the box. In Proceedings of

the third ACM conference on Recommender systems, RecSys ’09, pages 285–288,

New York, NY, USA. ACM.

Adamopoulos, P. (2013a). Beyond rating prediction accuracy: On new perspec-

tives in recommender systems. In Proceedings of the seventh ACM conference on

Recommender systems, RecSys ’13, pages 459–462, New York, NY, USA. ACM.

Adamopoulos, P. (2013b). What makes a great MOOC? an interdisciplinary anal-

ysis of student retention in online courses. In Proceedings of the 34th International

Conference on Information Systems, ICIS 2013.

Adamopoulos, P. (2014a). ConcertTweets: A multi-

dimensional data set for recommender systems research.

http://people.stern.nyu.edu/padamopo/data/concertTweets.html.

Adamopoulos, P. (2014b). Novel perspectives in collaborative filtering recom-

mender systems. In 23rd International Conference on World Wide Web (WWW)

PhD Symposium.

Adamopoulos, P. (2014c). On discovering non-obvious recommendations: Using

unexpectedness and neighborhood selection methods in collaborative filtering sys-

208

tems. In Proceedings of the 7th ACM International Conference on Web Search and

Data Mining, WSDM ’14, pages 655–660, New York, NY, USA. ACM.

Adamopoulos, P., Bellogın, A., Castells, P., Cremonesi, P., and Steck, H. (2014).

Redd 2014 - international workshop on recommender systems evaluation: Dimen-

sions and design. In Proceedings of the 8th ACM Conference on Recommender

systems, pages 393–394. ACM.

Adamopoulos, P., Ghose, A., and Todri, V. (2015a). Estimating the impact of

user personality traits on Word-of-Mouth: Text-mining microblogging platforms.

Available at SSRN 2679199.

Adamopoulos, P., Ghose, A., and Tuzhilin, A. (2016). The business value of recom-

mendations in a mobile application: Combining deep learning with econometrics.

Working Paper, New York University.

Adamopoulos, P. and Todri, V. (2015a). The effectiveness of marketing strate-

gies in social media: Evidence from promotional events. In Proceedings of the

21th ACM SIGKDD International Conference on Knowledge Discovery and Data

Mining, pages 1641–1650. ACM.

Adamopoulos, P. and Todri, V. (2015b). Personality-based recommendations: Evi-

dence from Amazon.com. In 9th ACM Conference on Recommender systems. ACM.

Adamopoulos, P. and Tuzhilin, A. (2011). On unexpectedness in recommender

systems: Or how to expect the unexpected. In DiveRS 2011 - ACM RecSys 2011

Workshop on Novelty and Diversity in Recommender Systems, RecSys 2011, New

York, NY, USA. ACM.

Adamopoulos, P. and Tuzhilin, A. (2013a). On unexpectedness in recommender

209

systems: Or how to better expect the unexpected. Working Paper: CBA-13-03,

New York University. http://ssrn.com/abstract=2282999.

Adamopoulos, P. and Tuzhilin, A. (2013b). Probabilistic neighborhood selection in

collaborative filtering systems. Working Paper: CBA-13-04, New York University.

http://hdl.handle.net/2451/31988.

Adamopoulos, P. and Tuzhilin, A. (2013c). Recommendation opportunities: Im-

proving item prediction using weighted percentile methods in collaborative filtering

systems. In Proceedings of the seventh ACM conference on Recommender systems,

RecSys ’13, pages 351–354, New York, NY, USA. ACM.

Adamopoulos, P. and Tuzhilin, A. (2014a). Estimating the value of multi-

dimensional data sets in context-based recommender systems. In ACM conference

on Recommender Systems (RecSys) 2014 Poster Proceedings.

Adamopoulos, P. and Tuzhilin, A. (2014b). On over-specialization and concentra-

tion bias of recommendations: Probabilistic neighborhood selection in collaborative

filtering systems. In Proceedings of the 8th ACM Conference on Recommender sys-

tems, pages 153–160. ACM.

Adamopoulos, P. and Tuzhilin, A. (2014c). On unexpectedness in recommender

systems: Or how to better expect the unexpected. ACM Transactions on Intelligent

Systems and Technology (TIST), 5(4):54.

Adamopoulos, P. and Tuzhilin, A. (2015). The business value of recommendations:

A privacy-preserving econometric analysis. In Proceedings of the 36th International

Conference on Information Systems, ICIS. AIS.

210

Adamopoulos, P., Tuzhilin, A., and Mountanos, P. (2015b). Measuring the con-

centration reinforcement bias of recommender systems. In 9th ACM Conference on

Recommender systems. ACM.

Adomavicius, G. and Kwon, Y. (2009). Toward more diverse recommendations:

Item re-ranking methods for recommender dystems. In Proceedings of the 19th

Workshop on Information Technology and Systems (WITS’09).

Adomavicius, G. and Kwon, Y. (2011). Maximizing aggregate recommendation

diversity: A graph-theoretic approach. In DiveRS 2011 - ACM RecSys 2011 Work-

shop on Novelty and Diversity in Recommender Systems, RecSys 2011, New York,

NY, USA. ACM.

Adomavicius, G. and Kwon, Y. (2012). Improving aggregate recommendation di-

versity using ranking-based techniques. Knowledge and Data Engineering, IEEE

Transactions on, 24(5):896–911.

Adomavicius, G. and Tuzhilin, A. (2005). Toward the next generation of recom-

mender systems: A survey of the state-of-the-art and possible extensions. IEEE

Trans. on Knowl. and Data Eng., 17(6):734–749.

Adomavicius, G. and Zhang, J. (2012). Impact of data characteristics on rec-

ommender systems performance. ACM Transactions on Management Information

Systems, 3(1):1–17.

Akiyama, T., Obara, K., and Tanizaki, M. (2010). Proposal and evaluation of

serendipitous recommendation method using general unexpectedness. In Proceed-

ings of the ACM RecSys Workshop on Practical Use of Recommender Systems,

Algorithms and Technologies (PRSAT 2010), RecSys 2010, New York, NY, USA.

ACM.

211

Amazon (2012). Amazon.com, Inc. http://www.amazon.com.

Andre, P., Teevan, J., and Dumais, S. T. (2009). From x-rays to silly putty via

Uranus: Serendipity and its role in web search. In Proceedings of the 27th in-

ternational conference on Human factors in computing systems, CHI ’09, pages

2033–2036, New York, NY, USA. ACM.

Andrews, D. (1984). The IRG Solution: Hierarchical Incompetence and How to

Overcome It. Souvenir Press.

Andrews, M., Luo, X., Fang, Z., and Ghose, A. (2015). Mobile ad effectiveness:

Hyper-contextual targeting with crowdedness. Marketing Science.

Angrist, J. and Krueger, A. B. (2001). Instrumental variables and the search for

identification: From supply and demand to natural experiments. Report, National

Bureau of Economic Research.

Archak, N., Ghose, A., and Ipeirotis, P. G. (2011). Deriving the pricing power of

product features by mining consumer reviews. Management Science, 57(8):1485–

1509.

Bakos, J. Y. (1997). Reducing buyer search costs: implications for electronic mar-

ketplaces. Management Science, 43(12):1676–1692.

Bakshy, E., Messing, S., and Adamic, L. A. (2015). Exposure to ideologically

diverse news and opinion on facebook. Science, 348(6239):1130–1132.

Balabanovic, M. and Shoham, Y. (1997). Fab: content-based, collaborative recom-

mendation. Communications of the ACM, 40(3):66–72.

Baumol, W. J. and Ide, E. A. (1956). Variety in retailing. Management Science,

3(1):93–101.

212

Beel, J., Langer, S., Genzmehr, M., Gipp, B., and Nurnberger, A. (2013). A com-

parative analysis of offline and online evaluations and discussion of research paper

recommender system evaluation. In Proceedings of the Workshop on Reproducibil-

ity and Replication in Recommender Systems Evaluation (RepSys) at the ACM

Recommender System Conference (RecSys).

Bell, R., Koren, Y., and Volinsky, C. (2007). Modeling relationships at multiple

scales to improve accuracy of large recommender systems. In Proceedings of the 13th

ACM SIGKDD international conference on Knowledge discovery and data mining,

pages 95–104. ACM.

Bell, R. M., Bennett, J., Koren, Y., and Volinsky, C. (2009). The million dollar

programming prize. IEEE Spectr., 46(5):28–33.

Bellogın, A., Cantador, I., and Castells, P. (2012). A comparative study of hetero-

geneous item recommendations in social systems. Information Sciences.

Bellogın, A., Castells, P., and Cantador, I. (2013). Improving memory-based col-

laborative filtering by neighbour selection based on user preference overlap. In

Proceedings of the 10th Conference on Open Research Areas in Information Re-

trieval, OAIR ’13, pages 145–148, Paris, France, France.

Bellogin, A. and Parapar, J. (2012). Using graph partitioning techniques for neigh-

bour selection in user-based collaborative filtering. In Proceedings of the sixth ACM

conference on Recommender systems, RecSys ’12, pages 213–216, New York, NY,

USA. ACM.

Bellogin, A., Parapar, J., and Castells, P. (2013). Probabilistic collaborative fil-

tering with negative cross entropy. In Proceedings of the 7th ACM conference on


213

Benjamini, Y. (1988). Opening the box of a boxplot. The American Statistician,

42(4):257–262.

Berger, G. and Tuzhilin, A. (1998). Discovering unexpected patterns in temporal

data using temporal logic. Temporal Databases: research and practice, pages 281–

309.

Berry, M. J. and Linoff, G. (1997). Data Mining Techniques: For Marketing, Sales,

and Customer Support. John Wiley & Sons, Inc., New York, NY, USA.

Berry, S. (1994). Estimating discrete-choice models of product differentiation. The

RAND Journal of Economics, 25(2):242–262.

Berry, S., Levinsohn, J., and Pakes, A. (1995). Automobile prices in market equi-

librium. Econometrica, 63(4):841–890.

Bettman, J. R., Luce, M. F., and Payne, J. W. (1998). Constructive consumer

choice processes. Journal of consumer research, 25(3):187–217.

Billsus, D. and Pazzani, M. J. (2000). User modeling for adaptive news access.

User Modeling and User-Adapted Interaction, 10(2-3):147–180.

Bodapati, A. V. (2008). Recommendation systems with purchase data. Journal of

Marketing Research, 45(1):77–93.

BookCrossing (2004). Bookcrossing, Inc. http://www.bookcrossing.com.

Brynjolfsson, E., Hu, Y. J., and Simester, D. (2011). Goodbye pareto principle,

hello long tail: The effect of search costs on the concentration of product sales.

Management Science, 57(8):1373–1386.

214

Brynjolfsson, E., Hu, Y. J., and Smith, M. D. (2003). Consumer surplus in the

digital economy: Estimating the value of increased product variety at online book-

sellers. Management Science, 49(11):1580–1596.

Burke, R. (2002). Hybrid recommender systems: Survey and experiments. User

Modeling and User-Adapted Interaction, 12(4):331–370.

Byrne, D. and Griffitt, W. (1969). Similarity and awareness of similarity of per-

sonality characteristics as determinants of attraction. Journal of Experimental

Research in Personality.

Cantador, I., Brusilovsky, P., and Kuflik, T. (2011). 2nd workshop on information

heterogeneity and fusion in recommender systems (HetRec 2011). In Proceedings

of the 5th ACM conference on Recommender systems, RecSys ’11, New York, NY,

USA. ACM.

Cardell, N. S. (1997). Variance components structures for the extreme-value and

logistic distributions with application to models of heterogeneity. Econometric

Theory, 13(2):185–213.

Castells, P., Vargas, S., and Wang, J. (2011). Novelty and diversity metrics for rec-

ommender dystems: Choice, discovery and relevance. In International Workshop

on Diversity in Document Retrieval (DDR 2011) at the 33rd European Conference

on Information Retrieval (ECIR 2011).

Celma, O. and Herrera, P. (2008). A new approach to evaluating novel recommen-

dations. In Proceedings of the 2008 ACM conference on Recommender systems,

RecSys ’08, pages 179–186, New York, NY, USA. ACM.

Chen, L., Wu, W., and He, L. (2013). How personality influences users’ needs for

215

recommendation diversity? In CHI ’13 Extended Abstracts on Human Factors in

Computing Systems, CHI EA ’13, pages 829–834, New York, NY, USA. ACM.

Chen, P.-Y., Wu, S.-y., and Yoon, J. (2004). The impact of online recommendations

and consumer feedback on sales. ICIS 2004 Proceedings, page 58.

Crawford, G. S. (2012). Endogenous product choice: A progress report. Interna-

tional Journal of Industrial Organization, 30(3):315–320.

Cremer, H. and Thisse, J.-F. (1991). Location models of horizontal differentia-

tion: A special case of vertical differentiation models. The Journal of Industrial

Economics, 39(4):383–390.

Cremonesi, P., Garzotto, F., Negro, S., Papadopoulos, A. V., and Turrin, R.

(2011). Looking for ”good” recommendations: A comparative evaluation of rec-

ommender systems. In Human-Computer Interaction–INTERACT 2011, pages

152–168. Springer.

Cremonesi, P., Koren, Y., and Turrin, R. (2010). Performance of recommender

algorithms on top-n recommendation tasks. In Proceedings of the Fourth ACM

Conference on Recommender Systems, RecSys ’10, pages 39–46, New York, NY,

USA. ACM.

Danaher, P. J., Smith, M. S., Ranasinghe, K., and Danaher, T. S. (2015). Where,

when and how long: Factors that influence the redemption of mobile phone

coupons. Journal of Marketing Research.

Davidson, R. and MacKinnon, J. (2004). Econometric Theory and Methods. Ox-

ford University Press.

216

DeLone, W. H. and McLean, E. R. (1992). Information systems success: The quest

for the dependent variable. Information systems research, 3(1):60–95.

Desrosiers, C. and Karypis, G. (2011). A comprehensive survey of neighborhood-

based recommendation methods. In Recommender systems handbook, pages 107–

144. Springer.

Dooms, S., De Pessemier, T., and Martens, L. (2013). MovieTweetings: a movie

rating dataset collected from twitter. In Workshop on Crowdsourcing and Human

Computation for Recommender Systems, CrowdRec at RecSys ’13.

Dror, G., Koren, Y., and Weimer, M., editors (2012). Proceedings of KDD Cup

2011 competition, San Diego, CA, USA, 2011, volume 18 of JMLR Proceedings.

JMLR.org.

Econsultancy.com (2013). The realities of online personalization. Technical report,

Econsultancy.com.

Ekstrand, M. D., Harper, F. M., Willemsen, M. C., and Konstan, J. A. (2014).

User perception of differences in recommender algorithms. In Proceedings of the

8th ACM Conference on Recommender systems, pages 161–168. ACM.

eMarketer (2013). Smartphones, tablets drive faster growth in ecommerce sales.

Report, eMarketer.

eMarketer (2015). Holiday shopping preview. Report, eMarketer.

Evans, J. A. (2008). Electronic publication and the narrowing of science and schol-

arship. Science, 321(5887):395–399.

Fagin, R. and Price, T. G. (1978). Efficient calculation of expected miss ratios in

the independent reference model. SIAM Journal on Computing, 7(3):288–297.

217

Fargo, W. (2014). Data, data, data: 2015 internet advertising themes. Report,

Wells Fargo.

Fasolo, B., McClelland, G. H., and Lange, K. A. (2005). The effect of site de-

sign and interattribute correlations on interactive web-based decisions. Lawrence

Erlbaum Associates, Inc.

Fitzsimons, G. J. and Lehmann, D. R. (2004). Reactance to recommendations:

When unsolicited advice yields contrary responses. Marketing Science, 23(1):82–

94.

Fleder, D. and Hosanagar, K. (2009). Blockbuster culture’s next rise or fall:

The impact of recommender systems on sales diversity. Management Science,

55(5):697–712.

Fong, N. M., Fang, Z., and Luo, X. (2015). Geo-conquesting: Competitive loca-

tional targeting of mobile promotions. Journal of Marketing Research.

Gantner, Z., Drumond, L., et al. (2012). Personalized ranking for non-uniformly

sampled items. In Dror et al. [2012], pages 231–247.

Gantner, Z., Rendle, S., Freudenthaler, C., and Schmidt-Thieme, L. (2011). My-

MediaLite: A free recommender system library. In 5th ACM International Confer-

ence on Recommender Systems (RecSys 2011).

Ge, M., Delgado-Battenfeld, C., and Jannach, D. (2010). Beyond accuracy: Eval-

uating recommender systems by coverage and serendipity. In Proceedings of the

fourth ACM conference on Recommender systems, RecSys ’10, pages 257–260, New

York, NY, USA. ACM.

218

Ge, M., Jannach, D., Gedikli, F., and Hepp, M. (2012). Effects of the placement

of diverse ITEms in recommendation lists. In Proceedings of 14th International

Conference on Enterprise Information Systems (ICEIS 2012).

Geman, S., Bienenstock, E., and Doursat, R. (1992). Neural networks and the

bias/variance dilemma. Neural computation, 4(1):1–58.

Ghose, A., Goldfarb, A., and Han, S. P. (2012a). How is the mobile internet differ-

ent? search costs and local activities. Information Systems Research, 24(3):613–

631.

Ghose, A. and Han, S. (2011). An empirical analysis of user content generation

and usage behavior on the mobile internet. Management Science, 57(9):1671–1691.

Ghose, A., Ipeirotis, P., and Li, B. (2012b). Designing ranking systems for ho-

tels on travel search engines by mining user-generated and crowd-sourced content.

Marketing Science.

Ghose, A. and Ipeirotis, P. G. (2011). Estimating the helpfulness and economic

impact of product reviews: Mining text and reviewer characteristics. Ieee Trans-

actions on Knowledge and Data Engineering, 23(10):1498–1512.

Gini, C. (1909). Concentration and dependency ratios (in Italian). English trans-

lation in Rivista di Politica Economica, 87:769–789.

Gini, C. (1921). Measurement of inequality of incomes. The Economic Journal,

pages 124–126.

gmisclib (2013). gmisclib, scientific library.

http://kochanski.org/gpk/code/speechresearch/gmisclib/.

219

Goes, P. B., Lin, M., and Au Yeung, C.-m. (2014). Popularity effect in user-

generated content: Evidence from online product reviews. Information Systems

Research, 25(2):222–238.

Goh, K.-Y., Heng, C.-S., and Lin, Z. (2013). Social media brand community

and consumer behavior: Quantifying the relative impact of user-and marketer-

generated content. Information Systems Research, 24(1):88–107.

Goldberg, D., Nichols, D., Oki, B. M., and Terry, D. (1992). Using collabora-

tive filtering to weave an information tapestry. Communications of the ACM,

35(12):61–70.

Goldstein, D. and Goldstein, D. (2006). Profiting from the long tail. Harvard

Business Review, 84(6):24–28.

Google (2012). Google Books. http://books.google.com.

Gorgoglione, M., Panniello, U., and Tuzhilin, A. (2011). The effect of context-

aware recommendations on customer purchasing behavior and trust. In Proceedings

of the fifth ACM conference on Recommender systems, RecSys ’11, pages 85–92,

New York, NY, USA. ACM.

Greene, W. (2012). Econometric Analysis. Pearson series in economics. Prentice

Hall.

GroupLens (2011). GroupLens research group.

Guo, G., Zhang, J., Sun, Z., and Yorke-Smith, N. (2015). Librec: A java library

for recommender systems. In UMAP’15.

Hansen, L. P. (1982). Large sample properties of generalized method of moments

estimators. Econometrica: Journal of the Econometric Society, pages 1029–1054.

220

Harris, Z. S. (1954). Distributional structure. Word.

Herlocker, J., Konstan, J. A., and Riedl, J. (2002). An empirical analysis of de-

sign choices in neighborhood-based collaborative filtering algorithms. Information

retrieval, 5(4):287–310.

Herlocker, J. L., Konstan, J. A., Borchers, A., and Riedl, J. (1999). An algorithmic

framework for performing collaborative filtering. In Proceedings of the 22nd annual

international ACM SIGIR conference on Research and development in information

retrieval, pages 230–237. ACM.

Herlocker, J. L., Konstan, J. A., Terveen, L. G., and Riedl, J. T. (2004). Evaluating

collaborative filtering recommender systems. ACM Transactions on Information

Systems, 22(1):5–53.

Hernandez-Lobato, D., Martınez-Munoz, G., and Suarez, A. (2011). Empirical

analysis and evaluation of approximate techniques for pruning regression bagging

ensembles. Neurocomput., 74(12-13):2250–2264.

Hijikata, Y., Shimizu, T., and Nishida, S. (2009). Discovery-oriented collaborative

filtering for improving user satisfaction. In Proceedings of the 14th international

conference on Intelligent user interfaces, IUI ’09, pages 67–76, New York, NY,

USA. ACM.

Hinz, O., Eckert, J., and Skiera, B. (2011). Drivers of the long tail phenomenon:

an empirical analysis. Journal of management information systems, 27(4):43–70.

Hoover, E. (1985). An introduction to regional economics. A. A. Knopf, New York.

Hosanagar, K., Fleder, D., Lee, D., and Buja, A. (2013). Recommender systems

and their effects on consumers: The fragmentation debate. Management Science.

221

Hu, R. and Pu, P. (2011). Helping users perceive recommendation diversity. Work-

shop on Novelty and Diversity in Recommender Systems (DiveRS 2011), page 43.

Hurley, N. and Zhang, M. (2011). Novelty and diversity in top-n recommendation–

analysis and evaluation. ACM Transactions on Internet Technology (TOIT),

10(4):14.

Iaquinta, L., Gemmis, M. d., Lops, P., Semeraro, G., Filannino, M., and Molino,

P. (2008). Introducing serendipity in a content-based recommender system. In

Proceedings of the 2008 8th International Conference on Hybrid Intelligent Systems,

HIS ’08, pages 168–173, Washington, DC, USA. IEEE Computer Society.

IMDb (2011). Imdb.com, Inc. http://www.imdb.com.

ISBNdb.com (2012). The ISBN database. http://isbndb.com.

Jahrer, M. and Toscher, A. (2012). Collaborative filtering ensemble for ranking.

In Dror et al. [2012], pages 153–167.

Jannach, D. and Hegelich, K. (2009). A case study on the effectiveness of recom-

mendations in the mobile internet. In Proceedings of the third ACM conference on

Recommender systems, RecSys ’09, pages 205–208. ACM.

Jannach, D., Lerche, L., Gedikli, F., and Bonnin, G. (2013). What recommenders

recommend–an analysis of accuracy, popularity, and sales diversity effects. In User

Modeling, Adaptation, and Personalization, pages 25–37. Springer.

Jannach, D., Zanker, M., Felfernig, A., and Friedrich, G. (2010). Recommender

systems: an introduction. Cambridge University Press.

Jarvelin, K. and Kekalainen, J. (2002). Cumulated gain-based evaluation of IR

techniques. ACM Transactions on Information Systems, 20(4):422–446.

222

Jimenez, D. (1998). Dynamically weighted ensemble neural networks for classifica-

tion. In Neural Networks Proceedings, 1998. IEEE World Congress on Computa-

tional Intelligence. The 1998 IEEE International Joint Conference on, volume 1,

pages 753–756. IEEE.

Jin, R., Chai, J. Y., and Si, L. (2004). An automatic weighting scheme for col-

laborative filtering. In Proceedings of the 27th annual international ACM SIGIR

conference on Research and development in information retrieval, pages 337–344.

ACM.

Kahn, B., Lehmann, D., and Dept, W. S. M. (1991). Modeling choice among

assortments. Working paper (Wharton School. Marketing Dept.). Wharton School,

University of Pennsylvania, Marketing Department.

Kannan, P., Chang, A., and Whinston, A. (2001). Wireless commerce: Marketing

issues and possibilities.

Kawamae, N. (2010). Serendipitous recommendations via innovators. In Proceed-

ings of the 33rd international ACM SIGIR conference on Research and development

in information retrieval, SIGIR ’10, pages 218–225, New York, NY, USA. ACM.

Kawamae, N., Sakano, H., and Yamada, T. (2009). Personalized recommendation

based on the personal innovator degree. In Proceedings of the third ACM conference

on Recommender systems, RecSys ’09, pages 329–332, New York, NY, USA. ACM.

Khabbaz, M., Xie, M., and Lakshmanan, L. (2011). TopRecs: Pushing the envelope

on recommender systems. Data Engineering, page 61.

Khan, F. and Zubek, V. (2008). Support vector regression for censored data

223

(SVRc): A novel tool for survival analysis. In Data Mining, 2008. ICDM ’08.

Eighth IEEE International Conference on, pages 863–868.

Konstan, J. A., McNee, S. M., Ziegler, C.-N., Torres, R., Kapoor, N., and Riedl,

J. T. (2006). Lessons on applying automated recommender systems to information-

seeking tasks. In proceedings of the 21st national conference on Artificial intelligence

- Volume 2, AAAI’06, pages 1630–1633, Palo Alto, CA, USA. AAAI Press.

Konstan, J. A., Miller, B. N., Maltz, D., Herlocker, J. L., Gordon, L. R., and Riedl,

J. (1997). GroupLens: applying collaborative filtering to Usenet news. Communi-

cations of the ACM, 40(3):77–87.

Konstan, J. A. and Riedl, J. T. (2012). Recommender systems: from algorithms

to user experience. User Modeling and User-Adapted Interaction, 22:101–123.

Kontonasios, K.-N., Spyropoulou, E., and De Bie, T. (2012). Knowledge discovery

interestingness measures based on unexpectedness. Wiley Interdisciplinary Re-

views: Data Mining and Knowledge Discovery, 2(5):386–399.

Koren, Y. (2008). Factorization meets the neighborhood: a multifaceted collab-

orative filtering model. In Proceedings of the 14th ACM SIGKDD international

conference on Knowledge discovery and data mining, pages 426–434. ACM.

Koren, Y. (2010). Factor in the neighbors: Scalable and accurate collaborative

filtering. ACM Trans. Knowl. Discov. Data, 4(1):1–24.

Koren, Y. and Bell, R. (2011). Advances in collaborative filtering. In Recommender

Systems Handbook, pages 145–186. Springer.

Koren, Y., Bell, R., and Volinsky, C. (2009). Matrix factorization techniques for

recommender systems. Computer, 42(8):30–37.

224

Krumm, J. (2009). A survey of computational location privacy. Personal and

Ubiquitous Computing, 13(6):391–399.

Kull, S., Ramsay, C., and Lewis, E. (2003). Misperceptions, the media, and the

Iraq war. Political Science Quarterly, 118(4):569–598.

Lathia, N., Hailes, S., Capra, L., and Amatriain, X. (2010). Temporal diversity

in recommender systems. In Proceedings of the 33rd international ACM SIGIR

conference on Research and development in information retrieval, SIGIR ’10, pages


Le, Q. V. and Mikolov, T. (2014). Distributed representations of sentences and

documents. arXiv preprint arXiv:1405.4053.

Lee, Y. E. and Benbasat, I. (2010). Interaction design for mobile product recom-

mendation agents: Supporting users’ decisions in retail stores. ACM Transactions

on Computer-Human Interaction, 17(4):1–32.

Lemire, D. and Maclachlan, A. (2007). Slope one predictors for online rating-based

collaborative filtering. CoRR, abs/cs/0702144.

Leopold, T. (2013). Internet gains are serendipity’s loss.

http://www.cnn.com/2013/11/20/tech/web/internet-serendipity/.

Levin, D. Z., Cross, R., and Abrams, L. C. (2002). Why should i trust you? pre-

dictors of interpersonal trust in a knowledge transfer context. Academy of Man-

agement.

Levy, O. and Goldberg, Y. (2007). Dependencybased word embeddings. In Proceed-

ings of the 52nd Annual Meeting of the Association for Computational Linguistics,

volume 2, pages 302–308.

225

Li, S. S. and Karahanna, E. (2015). Online recommendation systems in a b2C

e-commerce context: A review and future directions. Journal of the Association

for Information Systems, 16(2):2.

LibraryThing (2012). LibraryThing. http://www.librarything.com.

Lichtenthal, J. D. and Tellefsen, T. (2001). Toward a theory of business buyer-seller

similarity. Journal of Personal Selling & Sales Management, 21(1):1–14.

Long, J. (1997). Regression models for categorical and limited dependent variables,

volume 7. Sage Publications, Incorporated.

Lorenz, M. O. (1905). Methods of measuring the concentration of wealth. Publi-

cations of the American Statistical Association, 9(70):209–219.

Marshall, A. (1920). Principles of Economics, volume 1. Macmillan and Co.,

London, UK.

Martınez-Munoz, G. and Suarez, A. (2006). Pruning in ordered bagging ensembles.

In Proceedings of the 23rd international conference on Machine learning, pages 609–

616. ACM.

Masthoff, J. (2011). Group recommender systems: Combining individual models.

In Recommender Systems Handbook, pages 677–702. Springer.

Matt, C., Benlian, A., Hess, T., and Weib, C. (2014). Escaping from the filter

bubble? the effects of novelty and serendipity on users’ evaluations of online rec-

ommendations. In Proceedings of the 35th International Conference on Information

Systems.

226

Matt, C., Hess, T., and Weiß, C. (2013). The differences between recommender

technologies in their impact on sales diversity. In Proceedings of the 34th Interna-

tional Conference on Information Systems.

McAuley, J. and Leskovec, J. (2013). Hidden factors and hidden topics: under-

standing rating dimensions with review text. In Proceedings of the seventh ACM

conference on Recommender systems, RecSys ’13, New York, NY, USA. ACM.

McDonald, J. F. and Moffitt, R. A. (1980). The uses of tobit analysis. The Review

of Economics and Statistics, 62(2):318–321.

McFadden, D. (1980). Econometric models for probabilistic choice among products.

Journal of Business, pages 13–29.

McKinney, V., Yoon, K., and Zahedi, F. M. (2002). The measurement of web-

customer satisfaction: An expectation and disconfirmation approach. Information

systems research, 13(3):296–315.

McNee, S. M., Riedl, J., and Konstan, J. A. (2006). Being accurate is not enough:

how accuracy metrics have hurt recommender systems. In Proceedings of CHI ’06,

CHI EA ’06, pages 1097–1101, New York, NY, USA. ACM.

McSherry, D. (2002). Diversity-conscious retrieval. In Proceedings of the 6th Euro-

pean Conference on Advances in Case-Based Reasoning, ECCBR ’02, pages 219–

233, London, UK, UK. Springer-Verlag.

Meister, F., Shin, D., and Andrews, L. (2002). Getting to know you: what’s

new in personalization technologies. E-Doc, March-April (http://www. aiim.

org/Resources/Archive/Magazine/2002-Mar-Apr/25187).

227

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of

word representations in vector space. arXiv preprint arXiv:1301.3781.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013b). Dis-

tributed representations of words and phrases and their compositionality. In Ad-

vances in Neural Information Processing Systems, pages 3111–3119.

Murakami, T., Mori, K., and Orihara, R. (2008). Metrics for evaluating the

serendipity of recommendation lists. In Proceedings of the 2007 conference on

New frontiers in artificial intelligence, JSAI’07, pages 40–46, Berlin, Heidelberg.

Springer-Verlag.

Nakatsuji, M., Fujiwara, Y., Tanaka, A., Uchiyama, T., Fujimura, K., and Ishida,

T. (2010). Classical music for rock fans?: Novel recommendations for expanding

user interests. In Proceedings of the 19th ACM international conference on Infor-

mation and knowledge management, CIKM ’10, pages 949–958, New York, NY,

USA. ACM.

Nanopoulos, A., Radovanovic, M., and Ivanovic, M. (2009). How does high dimen-

sionality affect collaborative filtering? In Proceedings of the Third ACM Conference

on Recommender Systems, RecSys ’09, pages 293–296, New York, NY, USA. ACM.

Netzer, O., Feldman, R., Goldenberg, J., and Fresko, M. (2012). Mine your own

business: Market-structure surveillance through text mining. Marketing Science,

31(3):521–543.

Neven, D. (1985). Two stage (perfect) equilibrium in hotelling’s model. The Jour-

nal of Industrial Economics, 33(3):317–325.

Nguyen, T. T., Hui, P.-M., Harper, F. M., Terveen, L., and Konstan, J. A. (2014).

228

Exploring the filter bubble: the effect of using recommender systems on content

diversity. In Proceedings of the 23rd international conference on World wide web,

pages 677–686. ACM.

Nielsen (2014). Total audience report: Q3 2014. Report, Nielsen.

O’Connor, M., Cosley, D., Konstan, J. A., and Riedl, J. (2002). PolyLens: a rec-

ommender system for groups of users. In ECSCW 2001, pages 199–218. Springer.

O’Connor, M. and Herlocker, J. (1999). Clustering items for collaborative filtering.

In Proceedings of the ACM SIGIR workshop on recommender systems, volume 128.

UC Berkeley.

O’Donovan, J. and Smyth, B. (2005). Trust in recommender systems. In Pro-

ceedings of the 10th international conference on Intelligent user interfaces, IUI ’05,

pages 167–174, New York, NY, USA. ACM.

Oestreicher-Singer, G. and Sundararajan, A. (2012a). Recommendation networks

and the long tail of electronic commerce. Mis Quarterly, 36(1):65–83.

Oestreicher-Singer, G. and Sundararajan, A. (2012b). The visible hand? demand

effects of recommendation networks in electronic markets. Management Science,

58(11):1963–1981.

Oliver, R. W., Rust, R. T., and Varki, S. (1998). Real-time marketing. Marketing

Management, 7(4):28.

Olsen, R. J. (1978). Note on the uniqueness of the maximum likelihood estimator

for the tobit model. Econometrica, 46(5):1211–1215.

Olson, J. C. and Dover, P. A. (1979). Disconfirmation of consumer expectations

through product trial. Journal of Applied psychology, 64(2):179.

229

Padmanabhan, B. and Tuzhilin, A. (1998). A belief-driven method for discover-

ing unexpected patterns. In Proceedings of the third International Conference on

Knowledge Discovery and Data Mining, KDD ’98, pages 94–100, Palo Alto, CA,

USA. AAAI Press.

Padmanabhan, B. and Tuzhilin, A. (2000). Small is beautiful: discovering the

minimal set of unexpected patterns. In Proceedings of the sixth ACM SIGKDD

international conference on Knowledge discovery and data mining, KDD ’00, pages


Padmanabhan, B. and Tuzhilin, A. (2006). On characterization and discovery of

minimal unexpected patterns in rule discovery. IEEE Trans. on Knowl. and Data

Eng., 18(2):202–216.

Panniello, U., Gorgoglione, M., and Tuzhilin, A. (2016). In CARS we trust: How

context-aware recommendations affect customers’ trust and other business perfor-

mance measures of recommender systems. ISR.

Panniello, U., Tuzhilin, A., Gorgoglione, M., Palmisano, C., and Pedone, A. (2009).

Experimental comparison of pre- vs. post-filtering approaches in context-aware rec-

ommender systems. In Proceedings of the third ACM conference on Recommender

systems, RecSys ’09, pages 265–268, New York, NY, USA. ACM.

Pariser, E. (2011a). Eli Pariser: Beware Online ”filter Bubbles.”. Ted.

Pariser, E. (2011b). The filter bubble: What the Internet is hiding from you. Pen-

guin UK.

Pathak, B., Garfinkel, R., Gopal, R. D., Venkatesan, R., and Yin, F. (2010). Em-

230

pirical analysis of the impact of recommender systems on sales. Journal of Man-

agement Information Systems, 27(2):159–188.

Perrone, M. P. and Cooper, L. N. (1993). When Networks Disagree: Ensemble

Methods for Hybrid Neural Networks, pages 126–142. Chapman and Hall.

Pradel, B., Usunier, N., and Gallinari, P. (2012). Ranking with non-random missing

ratings: influence of popularity and positivity on evaluation metrics. In Proceedings

of the sixth ACM conference on Recommender systems, pages 147–154. ACM.

Rabe-Hesketh, S., Skrondal, A., and Pickles, A. (2002). Reliable estimation of

generalized linear mixed models using adaptive quadrature. Stata Journal, 2(1):1–

21.

Radovanovic, M., Nanopoulos, A., and Ivanovic, M. (2010). Hubs in space: Popular

nearest neighbors in high-dimensional data. The Journal of Machine Learning

Research, 11:2487–2531.

Rampell, A. (2010). Why Online2Offline commerce is a trillion dollar opportunity.

techcrunch. com,(available online at http://techcrunch. com/2010/08/07/why-

online2offline-commerce-is-a-trillion-dollaropportunity/).

Rendle, S., Freudenthaler, C., et al. (2009). BPR: Bayesian Personalized Rank-

ing from Implicit Feedback. In Proceedings of the Conference on Uncertainty in

Artificial Intelligence, UAI ’09, pages 452–461, Arlington, Virginia, United States.

AUAI Press.

Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., and Riedl, J. (1994). Grou-

pLens: an open architecture for collaborative filtering of netnews. In Proceedings of

231

the 1994 ACM conference on Computer supported cooperative work, pages 175–186.

ACM.

Riboni, D., Pareschi, L., and Bettini, C. (2009). Privacy in georeferenced context-

aware services: A survey, pages 151–172. Springer.

Ricci, F. (2010). Mobile recommender systems. Information Technology &

Tourism, 12(3):205–231.

Ricci, F. and Shapira, B. (2011). Recommender systems handbook. Springer.

Roli, F. and Fumera, G. (2002). Analysis of linear and order statistics combin-

ers for fusion of imbalanced classifiers. In Proceedings of the Third International

Workshop on Multiple Classifier Systems, MCS ’02, pages 252–261, London, UK,

UK. Springer-Verlag.

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1988). Learning represen-

tations by back-propagating errors. Cognitive modeling, 5.

Said, A., Fields, B., Jain, B. J., and Albayrak, S. (2013). User-centric evaluation of

a k-furthest neighbor collaborative filtering recommender algorithm. In Proceedings

of the ACM 2013 conference on Computer Supported Cooperative Work. ACM.

Said, A., Jain, B. J., and Albayrak, S. (2012a). Analyzing weighting schemes in

collaborative filtering: cold start, post cold start and power users. In Proceedings

of the 27th Annual ACM Symposium on Applied Computing, SAC ’12, pages 2035–

2040, New York, NY, USA. ACM.

Said, A., Jain, B. J., Kille, B., and Albayrak, S. (2012b). Increasing diversity

through furthest neighbor-based recommendation. In Proceedings of the WSDM’12

Workshop on Diversity in Document Retrieval (DDR’12).

232

Sarwar, B. M., Karypis, G., Konstan, J., and Riedl, J. (2002). Recommender sys-

tems for large-scale e-commerce: Scalable neighborhood formation using clustering.

In Proceedings of the fifth international conference on computer and information

technology, volume 1.

Senecal, S. and Nantel, J. (2004). The influence of online product recommendations

on consumers online choices. Journal of retailing, 80(2):159–169.

Seyerlehner, K., Flexer, A., and Widmer, G. (2009). On the limitations of browsing

top-n recommender systems. In Proceedings of the Third ACM Conference on

Recommender Systems, RecSys ’09, pages 321–324, New York, NY, USA. ACM.

Shafir, E., Simonson, I., and Tversky, A. (1993a). Reason-based choice. Cognition,

49(1):11–36.

Shafir, E. B., Osherson, D. N., and Smith, E. E. (1993b). The advantage model: A

comparative theory of evaluation and choice under risk. Organizational Behavior

and Human Decision Processes, 55(3):325–378.

Shani, G. and Gunawardana, A. (2011). Evaluating recommendation systems.

Recommender Systems Handbook, 12(19):1–41.

Shardanand, U. and Maes, P. (1995). Social information filtering: algorithms for

automating word of mouth. In Proceedings of the SIGCHI conference on Human

factors in computing systems, pages 210–217. ACM Press/Addison-Wesley Pub-

lishing Co.

Shi, Y., Larson, M., et al. (2010). List-wise learning to rank with matrix factoriza-

tion for collaborative filtering. In Proceedings of the Fourth ACM Conference on

Recommender Systems, RecSys ’10, pages 269–272, New York, NY, USA. ACM.

233

Shi, Y., Zhao, X., Wang, J., Larsona, M., and Hanjalic, A. (2012). Adaptive

diversification of recommendation results via latent factor portfolio. In SIGIR.

Shivaswamy, P., Chu, W., and Jansche, M. (2007). A support vector approach to

censored targets. In Data Mining, 2007. ICDM 2007. Seventh IEEE International

Conference on, pages 655–660.

Shugan, S. M. (1980). The cost of thinking. Journal of consumer Research, pages

99–111.

Silberschatz, A. and Tuzhilin, A. (1996). What makes patterns interesting in knowl-

edge discovery systems. Knowledge and Data Engineering, IEEE Transactions on,

8(6):970–974.

Sinha, R. R. and Swearingen, K. (2001). Comparing recommendations made by

online systems and friends. In DELOS workshop: personalisation and recommender

systems in digital libraries, volume 1.

Spearman, C. (1987). The proof and measurement of association between two

things. The American Journal of Psychology, 100(3/4):441–471.

Stock, J. H. and Yogo, M. (2005). Testing for weak instruments in linear IV

regression. Identification and inference for econometric models: Essays in honor

of Thomas Rothenberg.

Stroud, N. J. (2008). Media use and political predispositions: Revisiting the con-

cept of selective exposure. Political Behavior, 30(3):341–366.

Sugiyama, K. and Kan, M.-Y. (2011). Serendipitous recommendation for scholarly

papers considering relations among researchers. In Proceedings of the 11th annual

234

international ACM/IEEE joint conference on Digital libraries, JCDL ’11, pages


Sweeting, A. (2013). Dynamic product positioning in differentiated product mar-

kets: The effect of fees for musical performance rights on the commercial radio

industry. Econometrica, 81(5):1763–1803.

Thompson, C. (2008). If you liked this, youre sure to love that. The New York

Times, 21.

Tintarev, N., Flores, A., and Amatriain, X. (2010). Off the beaten track: a mobile

field study exploring the long tail of tourist recommendations.

Tintarev, N. and Masthoff, J. (2011). Designing and evaluating explanations

for recommender systems. In Recommender Systems Handbook, pages 479–510.

Springer.

Tirole, J. (1988). The Theory of Industrial Organization. Mit Press.

Tirunillai, S. and Tellis, G. J. (2014). Mining marketing meaning from online chat-

ter: Strategic brand analysis of big data using latent dirichlet allocation. Journal

of Marketing Research, 51(4):463–479.

Todri, V. and Adamopoulos, P. (2014). Social commerce: An empirical examina-

tion of the antecedents and consequences of commerce in social network platforms.

In Proceedings of the 35th International Conference on Information Systems, ICIS,

page 18. AIS.

Tumer, K. and Ghosh, J. (1996). Error correlation and error reduction in ensemble

classifiers. Connection science, 8(3-4):385–404.

235

UBS (2015). Us internet & interactive entertainment: Can internet stocks rise

despite headwinds? Report, UBS.

Ueda, N. and Nakano, R. (1996). Generalization error of ensemble estimators.

In Neural Networks, 1996., IEEE International Conference on, volume 1, pages

90–95. IEEE.

Umyarov, A. and Tuzhilin, A. (2011). Using external aggregate ratings for improv-

ing individual recommendations. ACM Transactions on the Web, 5(3):1–40.

Vargas, S. and Castells, P. (2011). Rank and relevance in novelty and diversity

metrics for recommender systems. In Proceedings of the fifth ACM conference on


Vargas, S., Castells, P., and Vallet, D. (2012). Explicit relevance models in intent-

oriented information retrieval diversification. In 35th Annual International ACM

SIGIR Conference on Research and Development in Information Retrieval (SIGIR

2012), Portland, OR, USA.

Wang, J. and Zhu, J. (2009). Portfolio theory of information retrieval. In Proc. of

the Annual International ACM SIGIR Conference on Research and Development

on Information Retrieval (SIGIR).

Weng, L.-T., Xu, Y., Li, Y., and Nayak, R. (2007). Improving recommendation

novelty based on topic taxonomy. In Proceedings of the 2007 IEEE/WIC/ACM

International Conferences on Web Intelligence and Intelligent Agent Technology -

Workshops, WI-IATW ’07, pages 115–118, Washington, DC, USA. IEEE Computer

Society.

Wikipedia (2012). Wikimedia foundation, Inc. http://www.wikipedia.org.

236

Wilde, L. L. (1980). The economics of consumer information acquisition. The

Journal of Business, 53(3):143–158.

Wong, C.-K. and Easton, M. C. (1980). An efficient method for weighted sampling

without replacement. SIAM Journal on Computing, 9(1):111–113.

Wooldridge, J. (2002). Econometric Analysis of Cross Section and Panel Data.

Econometric Analysis of Cross Section and Panel Data. Mit Press.

WorldCat (2012). OCLC online computer library center, Inc.

http://www.worldcat.org.

Xiao, B. and Benbasat, I. (2007). E-commerce product recommendation agents:

use, characteristics, and impact. Mis Quarterly, 31(1):137–209.

Xiao, B. and Benbasat, I. (2014). Research on the use, characteristics, and impact

of e-commerce product recommendation agents: A review and update for 2007–

2012. In Handbook of Strategic e-Business Management, pages 403–431. Springer.

Xue, G.-R., Lin, C., Yang, Q., Xi, W., Zeng, H.-J., Yu, Y., and Chen, Z. (2005).

Scalable collaborative filtering using cluster-based smoothing. In Proceedings of the

28th annual international ACM SIGIR conference on Research and development in

information retrieval, SIGIR ’05, pages 114–121, New York, NY, USA. ACM.

Yoo, K.-H., Gretzel, U., and Zanker, M. (2013). Persuasive Recommender Systems:

Conceptual Background and Implications. Springer.

Zhang, M. and Hurley, N. (2008). Avoiding monotony: Improving the diversity

of recommendation lists. In Proceedings of the 2008 ACM conference on Recom-

mender systems, RecSys ’08, pages 123–130, New York, NY, USA. ACM.

237

Zhang, M. and Hurley, N. (2009). Novel item recommendation by user profile par-

titioning. In Proceedings of the 2009 IEEE/WIC/ACM International Joint Con-

ference on Web Intelligence and Intelligent Agent Technology - Volume 01, WI-IAT

’09, pages 508–515, Washington, DC, USA. IEEE Computer Society.

Zhang, T., Agarwal, R., and Lucas Jr, H. C. (2011). The value of IT-enabled

retailer learning: personalized product recommendations and customer store loyalty

in electronic markets. MIS Quarterly-Management Information Systems, 35(4):859.

Zhang, Y. C., Seaghdha, D. O., Quercia, D., and Jambor, T. (2012). Auralist:

introducing serendipity into music recommendation. In Proceedings of the fifth

ACM international conference on Web search and data mining, WSDM ’12, pages


Zhou, T., Kuscsik, Z., Liu, J.-G., Medo, M., Wakeling, J. R., and Zhang, Y.-C.

(2010). Solving the apparent diversity-accuracy dilemma of recommender systems.

Proceedings of the National Academy of Sciences, 107(10):4511–4515.

Zhou, Z.-H., Wu, J., and Tang, W. (2002). Ensembling neural networks: many

could be better than all. Artificial intelligence, 137(1):239–263.

Ziegler, C.-N., McNee, S. M., Konstan, J. A., and Lausen, G. (2005). Improving

recommendation lists through topic diversification. In Proceedings of the 14th in-

ternational conference on World Wide Web, WWW ’05, pages 22–32, New York,

NY, USA. ACM.

Zucker, L. G. (1986). Production of trust: Institutional sources of economic struc-

ture, 1840–1920. Research in organizational behavior.

238

pages.stern.nyu.edupages.stern.nyu.edu/~padamopo/thesis_draft.pdf · unexpectedness and...

Documents