modeling user exposure in recommendation

Modeling User Exposure in Recommendation

Dawen LiangColumbia University

New York, [email protected]

Laurent CharlinMcGill UniversityMontreal, Canada

[email protected] McInerneyColumbia University


David M. BleiColumbia University


ABSTRACTCollaborative filtering analyzes user preferences for items(e.g., books, movies, restaurants, academic papers) by ex-ploiting the similarity patterns across users. In implicit feed-back settings, all the items, including the ones that a userdid not consume, are taken into consideration. But thisassumption does not accord with the common sense under-standing that users have a limited scope and awareness ofitems. For example, a user might not have heard of a cer-tain paper, or might live too far away from a restaurant toexperience it. In the language of causal analysis [9], the as-signment mechanism (i.e., the items that a user is exposedto) is a latent variable that may change for various user/itemcombinations. In this paper, we propose a new probabilisticapproach that directly incorporates user exposure to itemsinto collaborative filtering. The exposure is modeled as alatent variable and the model infers its value from data. Indoing so, we recover one of the most successful state-of-the-art approaches as a special case of our model [8], and providea plug-in method for conditioning exposure on various formsof exposure covariates (e.g., topics in text, venue locations).We show that our scalable inference algorithm outperformsexisting benchmarks in four different domains both with andwithout exposure covariates.

1. INTRODUCTIONMaking good recommendations is an important problem

on the web. In the recommendation problem, we observehow a set of users interacts with a set of items, and ourgoal is to show each user a set of previously unseen itemsthat she will like. Broadly speaking, recommendation sys-tems use historical data to infer users’ preferences, and thenuse the inferred preferences to suggest items. Good recom-mendation systems are essential as the web grows; users areoverwhelmed with choice.

Traditionally there are two modes of this problem, rec-ommendation from explicit data and recommendation fromimplicit data. With explicit data, users rate some items(positively, negatively, or along a spectrum) and we aim topredict their missing ratings. This is called explicit databecause we only need the rated items to infer a user’s pref-erences. Positively rated items indicate types of items thatshe likes; negatively rated items indicate items that she doesnot like. But, for all of its qualities, explicit data is of limiteduse. It is often difficult to obtain.

The more prevalent mode is recommendation from im-

plicit data. In implicit data, each user expresses a binarydecision about items—for example this can be clicking, pur-chasing, viewing—and we aim to predict unclicked itemsthat she would want to click on. Unlike ratings data, im-plicit data is easily accessible. While ratings data requiresaction on the part of the users, implicit data is often a nat-ural byproduct of their behavior, e.g., browsing histories,click logs, and past purchases.

But recommendation from implicit data is also more dif-ficult than its explicit counterpart. The reason is that thedata is binary and thus, when inferring a user’s preferences,we must use unclicked items. Mirroring methods for explicitdata, many methods treat unclicked items as those a userdoes not like. But this assumption is mistaken, and over-estimates the effect of the unclicked items. Some of theseitems—many of them, in large-scale settings—are unclickedbecause the user didn’t see them, rather than because shechose not to click them. This is the crux of the problem ofanalyzing implicit data: we know users click on items theylike, but we do not know why an item is unclicked.

Existing approaches account for this by downweightingthe unclicked items. In Hu et al. [8] the data about unclickeditems are given a lower “confidence”, expressed through thevariance of a Gaussian random variable. In Rendle et al. [23],the unclicked items are artificially subsampled at a lower ratein order to reduce their influence on the estimation. Thesemethods are effective, but they involve heuristic alterationsto the data.

In this paper, we take a direct approach to solving thisproblem. We develop a probabilistic model for recommen-dation called Exposure MF (abbreviated as ExpoMF) thatseparately captures whether a user has been exposed to anitem from whether a user has ultimately decided to clickon it. This leads to an algorithm that iterates between es-timating the user preferences and estimating the exposure,i.e., why the unclicked items were unclicked. When esti-mating preferences, it naturally downweights the unclickeditems that it expected the user will like, because it imaginesthat she was not exposed to them.

Concretely, imagine a music listener with a strong prefer-ence for alternative rock bands such as Radiohead. Imaginethat, in a dataset, there are some Radiohead tracks thatthis user has not listened to. There are different reasonswhich may explain unlistened tracks (e.g., the user has alimited listening budget, a particular song is too recent oris unavailable from a particular online service). Accordingto that user’s listening history these unlistened tracks would

1

arX

iv:1

510.

0702

5v1

[st

at.M

L]

23

Oct

201

5

likely make for good recommendations. In this situation ourmodel would assume that the user does not know about thesetracks—she has not been exposed to them—and downweighttheir (negative) contribution when inferring that user’s pref-erences.

Further, by separating the two sides of the problem, ourapproach enables new innovations in implicit recommenda-tion models. Specifically, we can build models of users’ ex-posure that are guided by additional information such asitem content, if exposure to the items typically happens viasearch, or user/item location, if the users and items are ge-ographically organized.

As an example imagine a recommender system for dinersin New York City and diners in Las Vegas. New Yorkersare only exposed to restaurants in New York City. Fromour model’s perspective, unvisited restaurants in New Yorkare therefore more informative in deriving a New Yorker’spreferences compared to unvisited restaurants in Las Vegas.Accordingly for New York users our model will upweightunvisited restaurants in New York while downweighting un-visited Las Vegas restaurants.

We studied our method with user listening history froma music intelligence company, clicks from a scientific e-printserver, user bookmarks from an online reference manager,and user checkins at venues from a location-based socialnetwork. In all cases, ExpoMF matches or surpasses thestate-of-the-art method of Hu et al. [8]. Furthermore, whenavailable, we use extra information to inform our user ex-posure model. In those cases using the extra informationoutperforms the simple ExpoMF model. Further, when us-ing document content information our model also outper-forms a method specially developed for recommending docu-ments using content and user click information [26]. Finally,we illustrate the alternative-rock-listener and the New-York-dinner examples using real data fit with our models in Fig-ure 2 and Figure 3.

This paper is organized as follows. We first review collab-orative filtering models in Section 2. We then introduce Ex-poMF in Section 3 and location and content specific modelsin the subsequent subsection. We draw connections betweenExpoMF and causal inference as well as other recommen-dation research paths in Section 4. Finally we present anempirical study in Section 5.

2. BACKGROUND

Matrix factorization for collaborative filtering. User-item preference data, whether implicit or not, can be en-coded in a user by item matrix. In this paper we refer to thisuser by item matrix as the click matrix or the consumptionmatrix. Given the observed entries in this matrix the recom-mendation task is often framed as filling in the unobservedentries. Matrix factorization models, which infer (latent)user preferences and item attributes by factorizing the clickmatrix, are standard in recommender systems [11]. Froma generative modeling perspective they can be understoodas first drawing user and item latent factors correspond-ing, respectively, to user preferences and item attributes.Then drawing observations from a specific distribution (e.g.,a Poisson or a Gaussian) with its mean parametrized by thedot product between the user and the item factors. For-

mally, Gaussian matrix factorization is [18]:

θu ∼ N (0, λ−1θ IK)

βi ∼ N (0, λ−1β IK)

yui ∼ N (θ>u βi, λ−1y ),

where θu and βi represent user u’s latent preferences anditem i’s attributes respectively. We use the mean and co-variance to parametrize the Gaussian distribution. λθ, λβ ,and λy are treated as hyperparameters. IK stands for theidentity matrix of dimension K.

Collaborative filtering for implicit data. Weightedmatrix factorization (WMF), the standard factorization modelfor implicit data, selectively downweights evidence from theclick matrix [8]. WMF uses a simple heuristic where all un-observed user-item interactions are equally downweightedvis-a-vis the observed interactions. Under WMF an obser-vation is generated from:

yui ∼ N (θ>u βi, c−1yui

),

where the “confidence” c is set such that c1 > c0. Thisdependency between a click and itself is unorthodox; becauseof it WMF is not a generative model. As we will describein Section 3 we obtain a proper generative model by addingan exposure latent variable.

WMF treats the collaborative filtering problem with im-plicit data as a regression problem. Concretely, consumeduser-item pairs are assigned a value of one and unobserveduser-item pairs are assigned a value of zero. Bayesian per-sonalized ranking (BPR) [23, 22] instead treats the problemas a one of ranking consumed user-item pairs above unob-served pairs. In a similar vein, the weighted approximate-ranking pairwise (WARP) loss proposed in Weston et al.[27] approximately optimizes Precision@k. To deal with thenon-differentiable nature of the ranking loss, these methodstypically design specific (stochastic optimization) methodsfor parameter estimation.

3. EXPOSURE MATRIX FACTORIZATIONWe present exposure matrix factorization (ExpoMF). In

Section 3.1, we describe the main model. In Section 3.2 wediscuss several ways of incorporating external informationinto ExpoMF (i.e., topics from text, locations). We deriveinference procedures for our model (and variants) in Sec-tion 3.3. Finally we discuss how to make predictions givenour model in Section 3.4.

3.1 Model DescriptionFor every combination of users u = 1, . . . , U and items

i = 1, . . . , I, consider two sets of variables. The first matrixA = {aui} indicates whether user u has been exposed toitem i. The second matrix Y = {yui} indicates whether ornot user u clicked on item i.

Whether a user is exposed to an item comes from a Bernoulli.Conditional on being exposed, user’s preference comes froma matrix factorization model. Similar to the standard method-ology, we factorize this conditional distribution to K user

2

preferences θi,1:K and K item attributes βu,1:K ,

θu ∼ N (0, λ−1θ IK)

βi ∼ N (0, λ−1β IK)

aui ∼ Bernoulli(µui)

yui | aui = 1 ∼ N (θ>u βi, λ−1y )

yui | aui = 0 ∼ δ0, (1)

where δ0 denotes that p(yui = 0 | aui = 0) = 1, and we in-troduced a set of hyperparameters denoting the inverse vari-ance (λθ, λβ , λy). µui is the prior probability of exposure,we discuss various ways of setting or learning it in subse-quent sections. A graphical representation of the model inEquation (1) is given in Figure 1a.

We observe the complete click matrix Y . These have aspecial structure. When yui > 0, we know that aui = 1.When yui = 0, then aui is latent. The user might have beenexposed to item i and decided not to click (i.e., aui = 1,yui = 0); or she may have never seen the item (i.e., aui = 0,yui = 0). We note that since Y is usually sparse in practice,most aui will be latent.

The model described in Equation (1) leads to the followinglog joint probability1 of exposures and clicks for user u anditem i,

log p(aui, yui |µui,θu,βi, λ−1y )

= log Bernoulli(aui |µui) + aui logN (yui |θ>u βi, λ−1y )

+ (1− aui) log I[yui = 0], (2)

where I[b] is the indicator function that evaluates to 1 whenb is true, and 0 otherwise.

What does the distribution in Equation 2 say about themodel’s exposure beliefs when no clicks are observed? Whenthe predicted preference is high (i.e., when θ>u βi is high)then the log likelihood of no clicks logN (0 |θ>u βi, λ−1

y ) islow and likely non-positive. This feature penalizes the modelfor placing probability mass on aui = 1, forcing us to believethat user u is not exposed to item i. (The converse argumentalso holds for low values of θ>u βi). Interestingly, a low valueof aui downweights the evidence for θu and βi (this is clearby considering extreme values: when aui = 0, the user anditem factors do not affect the log joint in Eq. 2 at all; whenaui = 1, we recover standard matrix factorization). Likeweighted matrix factorization (WMF) [8], ExpoMF sharesthe same feature of selectively downweighting evidence fromthe click matrix.

In ExpoMF, fixing the entries of the exposure matrix toa single value (e.g., aui = 1, ∀u, i) recovers Gaussian prob-abilistic matrix factorization [18] (see Section 2). WMF isalso a special case of our model which can be obtained byfixing ExpoMF’s exposure matrix using c0 and c1 as above.

The intuitions we developed for user exposure from thejoint probability do not yet involve µui, the prior belief onexposure. As we noted earlier, there are a rich set of choicesavailable in the modeling of µui. We discuss several of thesenext.

3.2 Hierarchical Modeling of ExposureWe now discuss methods for choosing and learning µui.

One could fix µui at some global value for all users and items,

1N.B., we follow the convention that 0 log 0 = 0 to allow thelog joint to be defined when yui > 0.

UI

yui

aui

βiθu

µui

(a) Exposure MF.

UI

aui

βi

µui

ψ

θu yui

x

(b) Exposure MF withexposure covariates.

Figure 1: Graphical representation of the exposure MFmodel (both with and without exposure covariates). Shadednodes represent observed variables. Unshaded nodes repre-sent hidden variables. A directed edge from node a to node bdenotes that the variable b depends on the value of variablea. Plates denote replication by the value in the lower cornerof the plate. The lightly shaded node aui indicates that itis partially observed (i.e., it is observed when yui = 1 andunobserved otherwise).

meaning that the user factors, item factors, and clicks wouldwholly determine exposure (conditioned on variance hyper-parameters). One could also fix µui for specific values of uand i. This can be done when there is specific extra infor-mation that informs us about exposure (denoted as exposurecovariates), e.g. the location of a restaurant, the content ofa paper. However, we found that empirical performance ishighly sensitive to this choice, motivating the need to placemodels on the prior for µui with flexible parameters.

We introduce observed exposure covariates xi and expo-sure model parameters ψu and condition µui |ψu,xi ac-cording to some domain-specific structure. The extendedgraphical model with exposure covariates is shown in Fig-ure 1b. Whatever this exposure model looks like, conditionalindependence between the priors for exposure and the morestandard collaborative filtering parameters (given exposure)ensures that the updates for the model we introduced in Sec-tion 3.1 will be the same for many popular inference proce-dures (e.g., expectation-maximization, variational inference,Gibbs sampling), making the extension to exposure covari-ates a plug-in procedure. We discuss two possible choices ofexposure model next.

Per-item µi. A direct way to encode exposure is viaitem popularity: if a song is popular, it is more likely thatyou have been exposed to it. Therefore, we choose an item-dependent conjugate prior on µi ∼ Beta(α1, α2). This modeldoes not use any external information (beyond clicks).

Text topics or locations as exposure covariates. Inthe domain of recommending text documents, we considerthe exposure covariates as the set of words for each doc-ument. In the domain of location-based recommendation,the exposure covariates are the locations of the venues be-ing recommended. We treat both in a similar way.

Consider a L-dimensional (L does not necessarily equalthe latent space dimension K in the matrix factorization

3

model) representation xi of the content of document i ob-tained through natural language processing (e.g., word em-beddings [17], latent Dirichlet allocation [3]), or the positionof venue i obtained by first clustering all the venues in thedata set then finding the expected assignment to L clustersfor each venue. In both cases, xi is all positive and normal-izes to 1. Denoting σ as the sigmoid function, we set

µui = σ(ψ>u xi), (3)

where we learn the coefficients ψu for each user u. Fur-thermore, we can include intercepts with various levels andinteractions [6].

How to interpret the coefficients ψu? The first interpre-tation is that of logistic regression, where the independentvariables are xi, the dependent binary variables are aui, andthe coefficients to learn are ψu.

The second interpretation is from a recommender systemsperspective: ψu represents the topics (or geographical pointsof interest) that a user is usually exposed to, restricting thechoice set to documents and venues that match ψu. Forexample, if the lth topic represents neural networks, and xilis high, then the user must be an avid consumer of neuralnetwork papers (i.e., ψul must be high) for the model toinclude an academic paper i in the exposure set of u. In thelocation domain if the lth cluster represents Brooklyn, andxil is high, then the user must live in or visit Brooklyn oftenfor the model to include venues near there in the exposureset of u.

3.3 InferenceWe use expectation-maximization (EM) [2] to find the

maximum a posteriori estimates of the unknown parametersof the model.2 The algorithm is summarized in Algorithm 1.

The EM inference procedure for the basic model, Ex-poMF, can be found by writing out the full log likelihoodof the model, then alternating between finding the expecta-tions of missing data (exposure) in the E(xpectation)-stepand finding maximum of the likelihood with respect to theparameters in the M(aximization)-step. This procedure isanalytical for our model because it is conditionally conju-gate, meaning that the posterior distribution of each randomvariable is in the same family as its prior in the model.

Furthermore, as we mentioned in Section 3.2, conditionalindependence between the priors for µui and the rest of themodel (given µui) means that the update for the latent ex-posure variables and user and item factors are not alteredfor any exposure model we use. We present these generalupdates first.

E-step. In the E-step, we compute expectation of theexposure latent variable E[aui] for all user and item combi-nations (u, i) for which there are no observed clicks (recallthat the presence of clicks yui > 0 means that aui = 1 de-terministically),

E[aui |θu,βi, µui, yui = 0] =µui · N (0|θ>u βi, λ−1

y )

µui · N (0|θ>u βi, λ−1y ) + (1− µui)

.

(4)

2There are various other inference methods we could haveused, such as Markov chain Monte Carlo inference or varia-tional inference. We chose EM for reasons of efficiency andsimplicity, and find that it performs well in practice.

Algorithm 1 Inference for ExpoMF

1: input: click matrix Y , exposure covariates x1:I (topicsor locations, optional).

2: random initialization: user factors θ1:U , item factorsβ1:I , exposure priors µ1:I (for per-item µi), OR exposuremodel parameters ψ1:U (with exposure model).

3: while performance on validation set increases do4: Compute expected exposure A (Equation 4)5: Update user factors θ1:U (Equation 5)6: Update item factors β1:I (Equation 6)7: ExpoMF with per-item µi:

Update priors µi (Equation 7)8: ExpoMF with exposure model µui = σ(ψ>u xi):

Update coefficients ψu (Equation 8 or Equa-tion 10)

9: end while

where N (0 |θ>u βi, λ−1y ) stands for the probability density

function of N (θ>u βi, λ−1y ) evaluated at 0.

M-step. For notational convenience, we define pui =E[aui | θu, βi, µui, yui = 0] computed from the E-step. With-out loss of generality, we define pui = 1 if yui = 1. Theupdate for the latent collaborative filtering factors is:

θu ← (λy∑i puiβiβ

>i + λθIK)−1(

∑i λypuiyuiβi) (5)

βi ← (λy∑u puiθuθ

>u + λβIK)−1(

∑u λypuiyuiθu), (6)

Inference for the Exposure Prior µuiWe now present inference for the hierarchical variants ofExposure MF. In particular we highlight the updates to µuiunder the various models we presented in Section 3.2.

Update for per-item µi. Maximizing the log likelihoodwith respect to µi is equivalent to finding the mode of thecomplete conditional Beta(α1 +

∑u pui, α2 + U −

∑u pui),

which is:

µi ←α1 +

∑u pui − 1

α1 + α2 + U − 2(7)

Update for exposure covariates (topics, location).Setting µui = σ(ψ>u xi), where xi is given by pre-processing(topic analysis or clustering), presents us with the challengeof maximizing the log likelihood with respect to exposuremodel parameters ψu. Since there is no analytical solutionfor the mode, we resort to following the gradients of the loglikelihood with respect to ψu,

ψnewu ← ψu + η∇ψu

L, (8)

for some learning rate η, where

∇ψuL = 1

I

∑i(pui − σ(ψ>u xi))xi. (9)

This can be computationally challenging especially forlarge item-set sizes I. Therefore, we perform (mini-batch)stochastic gradient descent: at each iteration t, we randomlysubsample a small batch of items Bt and take a noisy gra-dient steps:

ψnewu ← ψu + ηtgt (10)

for some learning rate ηt, where

gt = 1|Bt|

∑i∈Bt

(pui − σ(ψ>u xi))xi. (11)

4

For each EM iteration, we found it sufficient to do a singleupdate to the exposure model parameter ψu (as opposedto updating until it reaches convergence). This partial M-step [19] is much faster in practice.

Complexity and Implementation DetailsA naive implementation of the weighted matrix factorization(WMF) [8] has the same complexity as ExpoMF in termsof updating the user and item factors. However, the trickthat is used to speed up computations in WMF cannot beapplied to ExpoMF due to the non-uniformness of the ex-posure latent variable aui. On the other hand, the factorupdates are still independent across users and items. Theseupdates can therefore easily be parallelized.

In ExpoMF’s implementation, explicitly storing the expo-sure matrixA is impractical for even medium-sized datasets.As an alternative, we perform the E-step on the fly: only thenecessary part of the exposure matrix A is constructed forthe updates of the user/item factors and exposure priorsµui. As shown in Section 5, with parallelization and the on-the-fly E-step, ExpoMF can be easily fit to medium-to-largedatasets.

3.4 PredictionIn matrix factorization collaborative filtering the predic-

tion of yui is given by the dot product between the inferreduser and item factors θ>u βi. This corresponds to the pre-dictive density of ExpoMF p(yui |Y ) using point mass ap-proximations to the posterior given by the EM algorithm3.However, ExpoMF can also make predictions by integratingout the uncertainty from the exposure latent variable aui:

Ey[yui |θu,βi] = Ea[Ey[yui |θu,βi, aui]

]=∑aui∈{0,1} P(aui)Ey[yui |θu,βi, aui]

= µui · θ>u βi(12)

We experimented with both predictions in our study andfound that the simple dot product works better for ExpoMFwith per-item µi while E[yui |θu,βi] works better for Ex-poMF with exposure covariates. We provide further insightsabout this difference in Section 5.6.1.

4. RELATED WORKIn this section we highlight connections between ExpoMF

and other similar research directions.

Causal inference. Our work borrows ideas from thefield of causal inference [20, 9]. Causal inference aims atunderstanding and explaining the effect of one variable onanother.

One particular aim of causal inference is to answer coun-terfactual questions. For example, “would this new recom-mendation engine increase user click through rate?”. Whileonline studies may answer such a question, they are typi-cally expensive even for large electronic commerce compa-nies. Obtaining answers to such questions using observa-tional data alone (e.g., log data) is therefore of importantpractical interest [4, 13, 25].

3This quantity is also the treatment effect E[yui | aui =1,θu,βi]− E[yui | aui = 0,θu,βi] in the potential outcomesframework, since aui = 0 deterministically ensures yui = 0.

We establish a connection with the potential outcomeframework of Rubin [24]. In this framework one differen-tiates the assignment mechanism, whether a user is exposedto an item, from the potential outcome, whether a user con-sumes an item. In potential outcome terminology our workcan thus be understood as a form a latent assignment model.In particular, while consumption implies exposure, we donot know which items users have seen but not consumed.Further the questions of interest to us, personalized recom-mendation, depart from traditional work in causal inferencewhich aims at quantifying the effect of a particular treat-ment (e.g., the efficacy of a new drug).

Biased CF models. Authors have recognized that typ-ical observational data describing user rating items is bi-ased toward items of interest. Although this observationis somewhat orthogonal to our investigation, models thatemerged from this line of work share commonalities withour approach. Specifically, Marlin et al. [16], Ling et al. [14]separate the selection model (the exposure matrix) from thedata model (the matrix factorization). However, their inter-pretation, rooted in the theory of missing data [15], leads toa much different interpretation of the selection model. Theyhypothesize that the value of a rating influences whether ornot a user will report the rating (this implicitly captures theeffect that users mostly consume items they like a priori).This approach is also specific to explicit feedback data. Incontrast, we model how (the value of) the exposure matrixaffects user rating or consumption.

Exposure in other contexts. In zero-inflated Pois-son regression, a latent binary random variable, similar toour exposure variable is introduced to “explain away” thestructural zeros, such that the underlying Poisson modelcan better capture the count data [12]. This type of modelis common in Economics where it is used to account foroverly frequent zero-valued observations.

ExpoMF can also be considered as an instance of a spike-and-slab model [10] where the “spike” comes from the expo-sure variables and the matrix factorization component formsthe flat “slab” part.

Versatile CF models. As we show in Section 3.2, Ex-poMF’s exposure matrix can be used to model external in-formation describing user and item interactions. This is incontrast to most CF models which are crafted to model asingle type of data (e.g., document content when recommen-dation scientific papers [26]). An exception is factorizationmachines (FM) of Rendle [21]. FM models all types of (nu-meric) user, item or user-item features. FM considers theinteraction between all features and learns specific parame-ters for each interaction.

5. EMPIRICAL STUDYIn this section we study the recommendation performance

of ExpoMF by fitting the model to several datasets. Weprovide further insights into ExpoMF’s performance by ex-ploring the resulting model fits. We highlight that:

• ExpoMF performs comparably better than the state-of-the-art WMF [8] on four datasets representing userclicks, checkins, bookmarks and listening behavior.

• When augmenting ExpoMF with exposure covariatesits performance is further improved. ExpoMF with

5

TPS Mendeley Gowalla ArXiv# of users 221,830 45,293 57,629 37,893# of items 22,781 76,237 47,198 44,715

# interactions 14.0M 2.4M 2.3M 2.5M% interactions 0.29% 0.07% 0.09% 0.15%

Table 1: Attributes of datasets after pre-processing. In-teractions are non-zero entries (listening counts, clicks, andcheckins). % interactions refers to the density of the user-item consumption matrix (Y ).

location covariates and ExpoMF with content covari-ates both outperform the simpler ExpoMF with per-item µi. Furthermore, ExpoMF with content covari-ates outperforms a state-of-the-art document recom-mendation model [26].

• Through posterior exploration we provide insights intoExpoMF’s user-exposure modeling.

5.1 DatasetsThroughout this study we use four medium to large-scale

user-item consumption datasets from various domains: 1)taste profile subset (TPS) of the million song dataset [1]; 2)scientific articles data from arXiv4; 3) user bookmarks fromMendeley5; and 4) check-in data from the Gowalla dataset[5]. In more details:

• Taste Profile Subset (TPS): contains user-song playcounts collected by the music intelligence company EchoNest.6 We binarize the play counts and interpret themas implicit preference. We further pre-process the datasetby only keeping the users with at least 20 songs in theirlistening history and songs that are listened to by atleast 50 users.

• ArXiv: contains user-paper clicks derived from logdata collected in 2012 by the arXiv pre-print server.Multiple clicks by the same user on a single paper areconsidered to be a single click. We pre-process the datato ensure that all users and items have a minimum of10 clicks.

• Mendeley: contains user-paper bookmarks as providedby the Mendeley service, a “reference manager”. Thebehavior data is filtered such that each user has atleast 10 papers in her library and the papers that arebookmarked by at least 20 users are kept. In additionthis dataset contains the content of the papers whichwe pre-process using standard techniques to yield a10K words vocabulary. In Section 5.6.1 we make useof paper content to inform ExpoMF’s exposure model.

• Gowalla: contains user-venue checkins from a location-based social network. We pre-process the data suchthat all users and venues have a minimum of 20 check-ins. Furthermore, this dataset also contains locationsfor the venues which we will use to guide location-based recommendation (Section 5.6.2).

The final dimensions of these datasets are summarized inTable 1.4http://arxiv.org5http://mendeley.com6http://the.echonest.com

5.2 Experimental setupFor each dataset we randomly split the observed user-item

interactions into training/test/validation sets with 70/20/10proportions. In all the experiments, the dimension of thelatent space for collaborative filtering model K is 100. Themodel is trained following the inference algorithm describedin Section 3.3. We monitor the convergence of the algorithmusing the truncated normalized discounted cumulative gain(NDCG@100, see below for details) on the validation set.Hyper-parameters for ExpoMF-based models and baselinemodels are also selected according to the same criterion.

To make predictions, for each user u, we rank each itemusing its predicted preference y∗ui = θ>u βi, i = 1, · · · , I. Wethen exclude items from the training and validation sets andcalculate all the metrics based on the resulting ordered list.Further when using ExpoMF with exposure covariates wefound that performance was improved by predicting missingpreferences according to E[yui|θu,βi] (see Section 3.4 fordetails).

5.3 Performance measuresTo evaluate the recommendation performance, we report

both Recall@k, a standard information retrieval measure, aswell as two ranking-specific metrics: mean average precision(MAP@k) and [email protected]

We denote rank(u, i) as the rank of item i in user u’spredicted list and ytest

u as the set of items in the heldouttest set for user u.

• Recall@k: For each user u, Recall@k is computed asfollows:

Recall@k =∑

i∈ytestu

1{rank(u, i) ≤ k}min(k, |ytest

u |)

where 1{·} is the indicator function. In all our exper-iments we report both k = 20 and k = 50. We donot report Precision@k due to the noisy nature of theimplicit feedback data: even if an item i /∈ ytest

u , it ispossible that the user will consume it in the future.This makes Precision@k less interpretable since it isprone to fluctuations.

• MAP@k: Mean average precision calculates the meanof users’ average precision. The (truncated) average

7Hu et al. [8] propose mean percentile rank (MPR) as anevaluation metric for implicit feedback recommendation.Denote perc(u, i) as the percentile-ranking of item i withinthe ranked list of all items for user u. That is, perc(u, i) =0% if item i is ranked first in the list, and perc(u, i) = 100%if item i is ranked last. Then MPR is defined as:

MPR =

∑u,i y

testu,i · perc(u, i)∑u,i y

testu,i

We do not use this metric because MPR penalizes rankslinearly – if one algorithm ranks all the heldout test itemsin the middle of the list, while another algorithm ranks halfof the heldout test items at the top while the other halfat the bottom, both algorithms will get similar MPR, whileclearly the second algorithm is more desirable. On the otherhand, the metrics we use here will clearly prefer the secondalgorithm.

6

http://arxiv.org

http://mendeley.com

http://the.echonest.com

TPS Mendeley Gowalla ArXivWMF ExpoMF WMF ExpoMF WMF ExpoMF WMF ExpoMF

Recall@20 0.195 0.201 0.128 0.139 0.122 0.118 0.143 0.147Recall@50 0.293 0.286 0.210 0.221 0.192 0.186 0.237 0.236

NDCG@100 0.255 0.263 0.149 0.159 0.118 0.116 0.154 0.157MAP@100 0.092 0.109 0.048 0.055 0.044 0.043 0.051 0.054

Table 2: Comparison between WMF [8] and ExpoMF. While the differences in performance are generally small, ExpoMFperforms comparably better than WMF across datasets.

precision for user u is:

Average Precision@k =

k∑n=1

Precision@n

min(n, |ytestu |)

.

• NDCG@k: Emphasizes the importance of the top ranksby logarithmically discounting ranks. NDCG@k foreach user is computed as follows:

DCG@k =

k∑i=1

2reli − 1

log2(i+ 1); NDCG@k =

DCG@k

IDCG@k

IDCG@k is a normalization factor that ensures NDCGlies between zero and one (perfect ranking). In theimplicit feedback case the relevance is binary: reli = 1if i ∈ ytest

u , and 0 otherwise. In our study we alwaysreport the averaged NDCG across users.

For the ranking-based measure in all the experiments weset k = 100 which is a reasonable number of items to con-sider for a user. Results are consistent when using othervalues of k.

5.4 BaselinesWe compare ExpoMF to weighted matrix factorization

(WMF), the standard state-of-the-art method for collabo-rative filtering with implicit data [8]. WMF is described inSection 2.

We also experimented with Bayesian personalized ranking(BPR) [23], a ranking model for implicit collaborative filter-ing. However preliminary results were not competitive withother approaches. BPR is trained using stochastic optimiza-tion which can be sensible to hyper-parameter values (es-pecially hyper-parameters related to the optimization pro-cedure). A more exhaustive search over hyper-parameterscould yield more competitive results.

We describe specific baselines relevant to modeling expo-sure covariates in their dedicated subsections.

5.5 Studying Exposure MF

Empirical evaluation. Results comparing ExpoMF toWMF on our four datasets are given in Table 2. Each met-ric is averaged across all the users. We notice that ExpoMFperforms comparably better than WMF on most datasets(the standard errors are on the order of 10−4) though thedifference in performance is small. In addition, higher val-ues of NDCG@100 and MAP@100 (even when Recall@50 islower) indicate that the top-ranked items by ExpoMF tendto be more relevant to users’ interests.

Exploratory analysis. We now explore the posteriordistribution of the exposure latent variables of two specific

users from the TSP dataset. This exploration provides in-sights into how ExpoMF infers user exposure.

The top figure of Figure 2 shows the inferred exposurelatent variable E[aui] corresponding to yui = 0 for user A.E[aui] is plotted along with the empirical item popularity(measured by number of times a song was listened to in thetraining set). We also plot the interpolated per-item consid-eration prior µi learned using Equation 7. There is a strongrelationship between song popularity and consideration (thisis true across users). User A’s training data revealed thatshe has only listened to songs from either Radiohead or In-terpol (both are alternative rock bands). Therefore, for mostsongs, the model infers that the probability of user A con-sidering them is higher than the inferred prior, i.e., it ismore likely that user A did not want to listen to them (theyare true zeros). However, as pointed out by the rectangularbox, there are a few “outliers” which mostly contain songsfrom Radiohead and Interpol that user A did not listen to(some of them are in fact held out in the test set). Effec-tively, a lower posterior E[aui] than the prior indicates thatthe model downweights these unlistened songs more. In con-trast, WMF downweights all songs uniformly.

A second example is shown in the bottom figure of Fig-ure 2. User B mostly listens to indie rock bands (e.g. Flo-rence and the Machine, Arctic Monkeys, and The Kills).“Dog Days are Over”by Florence and the Machine is the sec-ond most popular song in this dataset, behind “Fireworks”by Katy Perry. These two songs correspond to the two right-most dots on the figure. Given the user’s listening history,the model clearly differentiates these two similarly popularsongs. The fact that user B did not listen to “Dog Days areOver” (again in the test set) is more likely due to her nothaving been exposed to it. In contrast the model infers thatthe user probably did not like “Fireworks” even though it ispopular.

5.6 Incorporating Exposure CovariatesFigure 2 demonstrates that ExpoMF strongly associates

user exposure to item popularity. This is partly due to thefact that the model’s prior is parametrized with a per-itemterm µi.

Here we are interested in using exposure covariates to pro-vide additional information about the (likely) exposure ofusers to items (see Figure 1b).

Recall that the role of these exposure covariates is to allowthe matrix factorization component to focus on items thatthe user has been exposed to. In particular this can be donein the model by upweighting (increasing their probabilityof exposure) items that users were (likely) exposed to anddownweighting items that were not. A motivating examplewith restaurant recommendations and New York City versusLas Vegas diners was discussed in Section 1.

7

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0

3UR

EDE

LOLWy

RI

EHLn

J H

[S

RsH

G W

R LWH

P

101 102 103 104 1050.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.85DGLRKHDG: ".LG A"5DGLRKHDG: "JLJsDw )DOOLnJ InWR 3ODFH" 5DGLRKHDG: "6DLO 7R 7KH 0RRn"5DGLRKHDG: "6LW 'Rwn. 6WDnG 8S"5DGLRKHDG: "BORw 2uW" InWHUSRO: "C'PHUH" InWHUSRO: "1RW (vHn JDLO" InWHUSRO: "3uEOLF 3HUvHUW" InWHUSRO: "1H[W (vLO" InFuEus: "BDss 6ROR"GUHDW :KLWH: "HRusH 2I BURNHn LRvH"

8sHU A (5DGLRKHDG DnG InWHUSRO OLsWHnHU)

3RsWHULRU RI auiLHDUnHG SULRU µi

101 102 103 104 105

(PSLULFDO LWHP SRSuODULWy

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8.DWy 3HUUy: ")LUHwRUNs" [DERvH SULRU]

)ORUHnFH DnG WKH 0DFKLnH: "'RJ 'Dys AUH 2vHU (5DGLR (GLW)" [EHORw SULRU]

8sHU B (InGLH OLsWHnHU)

Figure 2: We compare the inferred posteriors of the exposure matrix for two users (denoted by blue dots) and compareagainst the prior probability for exposure (red dashed lined). On the top, user A is a fan of the bands Radiohead and Interpol.Accordingly, the model downweights unlistened songs from these two bands. User B has broader interests and notably enjoyslistening to the very popular band Florence and the Machine. Similarly as for user A, unlistened tracks of Florence and theMachine get downweighted in the posterior.

In the coming subsections we compare content-aware andlocation-aware versions of ExpoMF which we refer to as Con-tent ExpoMF and Location ExpoMF respectively. Studyingeach model in its respective domain we demonstrate thatthe exposure covariates improve the quality of the recom-mendations compared to ExpoMF with per-item µi.

5.6.1 Content covariatesScientists—whether through a search engine, a personal

recommendation or other means—have a higher likelihoodof being exposed to papers specific to their own discipline.In this section we study the problem of using the content ofpapers as a way to guide inference of the exposure compo-nent of ExpoMF.

In this use case, we model the user exposure based onthe topics of articles. We use latent Dirichlet allocation(LDA) [3], a model of document collections, to model ar-ticle content. Assuming there are K topics Φ = φ1:K , eachof which is a categorical distribution over a fixed set of vo-cabulary, LDA treats each document as a mixture of thesetopics where the topic proportion xi is inferred from thedata. One can understand LDA as representing documentsin a low-dimensional “topic” space with the topic proportionxi being their coordinates.

We use the topic proportion xi learned from the Mendeleydataset as exposure covariates. Following the notation ofSection 3.2, our hierarchical ExpoMF is:

µui = σ(ψ>u xi + γu)

where we include a per-user bias term γu. Under this model,a molecular biology paper and a computer science paperthat a computer scientist has not read will likely be treateddifferently: the model will consider the computer scientist

ExpoMF Content ExpoMF CTR [26]Recall@20 0.139 0.144 0.127Recall@50 0.221 0.229 0.210

NDCG@100 0.159 0.165 0.150MAP@100 0.055 0.056 0.049

Table 3: Comparison between Content ExpoMF and Ex-poMF on Mendeley. We also compare collaborative topicregression (CTR) [26], a model makes use of the same addi-tional information as Content ExpoMF.

has been exposed to the computer science paper, thus higherE[aui], yet not to the molecular biology paper (hence lowerE[aui]). The matrix factorization component of the modelwill focus on modeling computer science papers since thatare more likely to be have been exposed.

Our model, Content ExpoMF, is trained following thealgorithm in Algorithm 1. For updating exposure-relatedmodel parameters ψu and γu, we take mini-batch gradientsteps with a batch-size of 10 users and a constant step sizeof 0.5 for 10 epochs.

Study. We evaluate the empirical performance of Con-tent ExpoMF and report results in Table 3. We compareto collaborative topic regression (CTR), a state-of-the-artmethod for recommending scientific papers [26] combiningboth LDA and WMF.8 We did not compare with the more

8Note that to train CTR we first learned a document topicmodel, fixed it and then learned the user preference model.It was suggested by its authors that this learning procedureprovided computational advantages while not hindering per-formance significantly [26].

8

Figure 3: We compare the inferred exposure posterior of ExpoMF (top row) and Content ExpoMF (bottom row). On the leftare the posteriors of user A who is interested in statistical machine learning while on the right user B is interested in computersystem research. Neither users have read the “Latent Dirichlet Allocation” paper. ExpoMF infers that both users have aboutequal probability of having been exposed to it. As we discussed in Section 5.5 (and demonstrated in Figure 2) this is mostlybased on the popularity of this paper. In contrast, Content ExpoMF infers that user A has more likely been exposed to thispaper because of the closeness between that paper’s content and user A’s interest. Content ExpoMF therefore upweights thepaper. Given user B’s interests the paper is correctly downweighted by the model.

recent and scalable collaborative topic Poisson factorization(CTPF) [7] since the resulting performance differences mayhave been the result of CTPF Poisson likelihood (versusGaussian likelihood for both ExpoMF and WMF).

We note that CTR’s performance falls in-between the per-formance of ExpoMF and WMF (from Table 1). CTR isparticularly well suited to the cold-start case which is notthe data regime we focus on in this study (i.e., recall thatwe have only kept papers that have been bookmarked by atleast 20 users).

Figure 3 highlights the behavior of Content ExpoMF com-pared to that of regular ExpoMF. Two users are selected:User A (left column) is interested in statistical machinelearning and Bayesian statistics. User B (right column) isinterested in computer systems. Neither of them have read“Latent Dirichlet Allocation” (LDA) a seminal paper thatfalls within user A’s interests. On the top row we show theposterior of the exposure latent variables E[aui] for two users(user A and user B) inferred from ExpoMF with per-itemµi. LDA is shown using a white dot. Overall both users’estimated exposures are dominated by the empirical itempopularity.

In contrast, on the bottom row we plot the results of Con-tent ExpoMF. Allowing the model to use the documents’content to infer user exposure offers greater flexibility com-pared to the simple ExpoMF model. This extra flexibilitymay also explain why there is an advantage in using inferredexposure to predict missing observations (see Section 3.4).Namely when exposure covariates are available the modelcan better capture the underlying user exposures to items.In contrast using the inferred exposure to predict with thesimple ExpoMF model performs worse.

WMF ExpoMF Location ExpoMFRecall@20 0.122 0.118 0.129Recall@50 0.192 0.186 0.199

NDCG@100 0.118 0.116 0.125MAP@100 0.044 0.043 0.048

Table 4: Comparison between Location ExpoMF and Ex-poMF with per-item µi on Gowalla. Using location expo-sure covariates outperforms the simpler ExpoMF and WMFaccording to all metrics.

5.6.2 Location covariatesWhen studying the Gowalla dataset we can use venue lo-

cation as exposure covariates.Recall from Section 3.2 that location exposure covariates

are created by first clustering all venues (using K-means)and then finding the representation of each venue in thisclustering space. Similarly as in Content ExpoMF (Sec-tion 5.6.1), Location ExpoMF departs from ExpoMF:

µui = σ(ψ>u xi + γu)

where xik is the venue i’s expected assignment to cluster kand γu is a per-user bias term.9

Study. We train Location ExpoMF following the sameprocedure as Content ExpoMF. We report the empiricalcomparison between WMF, ExpoMF and Location ExpoMF

9We named Content ExpoMF and Location ExpoMF differ-ently to make it clear to the reader that they condition oncontent and location features respectively. Both models arein fact mathematically equivalent.

9

in Table 4. We note that Location ExpoMF outperformsboth WMF and the simpler version of ExpoMF.

For comparison purposes we also developed a simple base-line FilterWMF which makes use of the location covariates.FilterWMF filters out venues recommended by WMF thatare inaccessible (too far) to the user. Since user location isnot directly available in the dataset, we estimate it using thegeometric median of all the venues the user has checked into.The median is preferable to the mean because it is better athandling outliers and is more likely to choose a typical visitlocation. However, the results of this simple FilterWMFbaseline are worse than the results of the regular WMF. Weattribute this performance to the fact that having a singlefocus of location is too strong an assumption to capture visitbehavior of users well. In addition, since we randomly splitthe data, it is possible that a user’s checkins at city A andcity B are split between the training and test set. We leavethe exploration of better location-aware baselines to futurework.

6. CONCLUSIONIn this paper, we presented a novel collaborative filtering

mechanism that takes into account user exposure to items.In doing so, we theoretically justify existing approaches thatdownweight unclicked items for recommendation, and pro-vide an extendable framework for specifying more elaboratemodels of exposure based on logistic regression. In empiricalstudies we found that the additional flexibility of our modelhelps it outperform existing approaches to matrix factoriza-tion on four datasets from various domains. We note thatthe same approach can also be used to analyze explicit feed-back.

There are several promising avenues for future work. Con-sider a reader who keeps himself up to date with the “what’snew” pages of a website, or a tourist visiting a new city look-ing for a restaurant recommendation. The exposure pro-cesses are more dynamic in these scenarios and may be dif-ferent during training and test time. We therefore seek newways to capture exposure that include ever more realisticassumptions about how users interact with items.

References[1] T. Bertin-Mahieux, D. P. W. Ellis, B. Whitman, and

P. Lamere. The million song dataset. In Proceedings ofthe 12th International Society for Music InformationRetrieval Conference, pages 591–596, 2011.

[2] C. M. Bishop. Pattern Recognition and Machine Learn-ing. Springer, 2006.

[3] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichletallocation. the Journal of Machine Learning Research,3:993–1022, 2003.

[4] L. Bottou, J. Peters, J. Quinonero Candela, D. X.Charles, D. M. Chickering, E. Portugaly, D. Ray,P. Simard, and E. Snelson. Counterfactual reasoningand learning systems: The example of computationaladvertising. Journal of Machine Learning Research, 14:3207–3260, 2013.

[5] E. Cho, S. A. Myers, and J. Leskovec. Friendship andmobility: user movement in location-based social net-

works. In Proceedings of the 17th ACM SIGKDD Inter-national Conference on Knowledge Discovery and DataMining, pages 1082–1090. ACM, 2011.

[6] A. Gelman and J. Hill. Data Analysis Using Regressionand Multilevel/Hierarchical Models. Cambridge Univer-sity Press, 2006.

[7] P. K. Gopalan, L. Charlin, and D. Blei. Content-based recommendations with Poisson factorization. InAdvances in Neural Information Processing Systems,pages 3176–3184, 2014.

[8] Y. Hu, Y. Koren, and C. Volinsky. Collaborative fil-tering for implicit feedback datasets. In Data Mining,2008. ICDM’08. Eighth IEEE International Conferenceon, pages 263–272. IEEE, 2008.

[9] G. W. Imbens and D. B. Rubin. Causal Inference inStatistics, Social, and Biomedical Sciences. CambridgeUniversity Press, 2015.

[10] H. Ishwaran and J. S. Rao. Spike and slab variableselection: frequentist and Bayesian strategies. Annalsof Statistics, pages 730–773, 2005.

[11] Y. Koren, R. Bell, and C. Volinsky. Matrix factorizationtechniques for recommender systems. Computer, 42(8):30–37, Aug. 2009. ISSN 0018-9162.

[12] D. Lambert. Zero-inflated Poisson regression, with anapplication to defects in manufacturing. Technometrics,34(1):1–14, 1992.

[13] L. Li, W. Chu, J. Langford, and R. E. Schapire. Acontextual-bandit approach to personalized news arti-cle recommendation. In Proceedings of the 19th Inter-national Conference on World Wide Web, WWW ’10,pages 661–670, New York, NY, USA, 2010. ACM. ISBN978-1-60558-799-8.

[14] G. Ling, H. Yang, M. R. Lyu, and I. King. Responseaware model-based collaborative filtering. In Proceed-ings of the Twenty-Eighth Conference on Uncertainty inArtificial Intelligence, Catalina Island, CA, USA, Au-gust 14-18, 2012, pages 501–510, 2012.

[15] R. J. A. Little and D. B. Rubin. Statistical Analysis withMissing Data. John Wiley & Sons, Inc., New York, NY,USA, 1986. ISBN 0-471-80254-9.

[16] B. M. Marlin, R. S. Zemel, S. T. Roweis, and M. Slaney.Collaborative filtering and the missing at random as-sumption. In UAI 2007, Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelli-gence, Vancouver, BC, Canada, July 19-22, 2007, pages267–275, 2007.

[17] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado,and J. Dean. Distributed representations of words andphrases and their compositionality. In Advances in Neu-ral Information Processing Systems, pages 3111–3119,2013.

[18] A. Mnih and R. Salakhutdinov. Probabilistic matrixfactorization. In Advances in Neural Information Pro-cessing Systems, pages 1257–1264, 2007.

10

[19] R. M. Neal and G. E. Hinton. A view of the EM algo-rithm that justifies incremental, sparse, and other vari-ants. In Learning in Graphical Models, pages 355–368.Springer, 1998.

[20] J. Pearl. Causality: Models, Reasoning and Inference.Cambridge University Press, New York, NY, USA, 2ndedition, 2009. ISBN 052189560X, 9780521895606.

[21] S. Rendle. Factorization machines. In Proceedings of the2010 IEEE International Conference on Data Mining,pages 995–1000, 2010.

[22] S. Rendle and C. Freudenthaler. Improving pairwiselearning for item recommendation from implicit feed-back. In Proceedings of the 7th ACM International Con-ference on Web Search and Data Mining, pages 273–282. ACM, 2014.

[23] S. Rendle, C. Freudenthaler, Z. Gantner, andL. Schmidt-Thieme. BPR: Bayesian personalized rank-ing from implicit feedback. In Proceedings of theTwenty-Fifth Conference on Uncertainty in ArtificialIntelligence, pages 452–461, 2009.

[24] D. Rubin. Estimating causal effects of treatments inrandomized and nonrandomized studies. Journal of Ed-ucational Psychology, 66(5):688–701, 1974.

[25] A. Swaminathan and T. Joachims. Counterfactual riskminimization. In Proceedings of the 24th InternationalConference on World Wide Web, WWW ’15 Compan-ion, pages 939–941, 2015.

[26] C. Wang and D. Blei. Collaborative topic modeling forrecommending scientific articles. In Knowledge Discov-ery and Data Mining, 2011.

[27] J. Weston, S. Bengio, and N. Usunier. WSABIE: Scal-ing up to large vocabulary image annotation. In IJCAI,volume 11, pages 2764–2770, 2011.

11

modeling user exposure in recommendation

Documents