style conditioned recommendations - arxiv · style could be romance, comedy, horror, etc, while for...

Style Conditioned RecommendationsMurium Iqbal

OverstockMidvale, Utah

[email protected]

Kamelia AryafarOverstock

Midvale, [email protected]

Timothy AndertonOverstock

Midvale, [email protected]

ABSTRACTWe propose Style Conditioned Recommendations (SCR) and in-troduce style injection as a method to diversify recommendations.We use Conditional Variational Autoencoder (CVAE) architecture,where both the encoder and decoder are conditioned on a user profilelearned from item content data. This allows us to apply style transfermethodologies to the task of recommendations, which we refer to asinjection. To enable style injection, user profiles are learned to beinterpretable such that they express users’ propensities for specificpredefined styles. These are learned via label-propagation from adataset of item content, with limited labeled points. To performinjection, the condition on the encoder is learned while the con-dition on the decoder is selected per explicit feedback. Explicitfeedback can be taken either from a user’s response to a style orinterest quiz, or from item ratings. In the absence of explicit feed-back, the condition at the encoder is applied to the decoder. Weshow a 12% improvement on NDCG@20 over the traditional VAEbased approach and an average 22% improvement on AUC across allclasses for predicting user style profiles against our best performingbaseline. After injecting styles we compare the user style profileto the style of the recommendations and show that injected styleshave an average +133% increase in presence. Our results show thatstyle injection is a powerful method to diversify recommendationswhile maintaining personal relevance. Our main contribution is anapplication of a semi-supervised approach that extends item labelsto interpretable user profiles.

CCS CONCEPTS•Information systems → Recommender systems; Personaliza-tion; •Computing methodologies→ Learning from implicit feed-back; Semi-supervised learning settings;

KEYWORDSProduct Recommendation; Variational Autoencoders; Style Transfer

ACM Reference format:Murium Iqbal, Kamelia Aryafar, and Timothy Anderton. 2019. Style Condi-tioned Recommendations. In Proceedings of Thirteenth ACM Conferenceon Recommender Systems, Copenhagen, Denmark, September 16–20, 2019(RecSys ’19), 9 pages.DOI: 10.1145/3298689.3347007

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’19, Copenhagen, Denmark© 2019 ACM. 978-1-4503-6243-6/19/09. . . $15.00DOI: 10.1145/3298689.3347007

1 INTRODUCTIONRecommendation systems have a strong popular item bias [1], dueto reliance on implicit feedback data. This manifests as the mostcommonly interacted with items being continually recommended,even when more relevant items are present in the catalog. This cancause users to become stuck in a recommendation filter bubble [19],a phenomenon which causes them to be continually exposed tothe same types of items, leading to an unpleasant redundant userexperience. Item content data can improve recommendations byallowing unpopular items to gain exposure. This data contains faults,though, in the form of false item descriptions or poor quality images.Incorporation of item content data thus risks exposing irrelevantitems. Hybrid recommendation systems which make use of itemcontent data and implicit feedback data, are still subject to popularitem bias. Using explicit feedback from the user, such as ratings andreviews, can produce results with higher diversity which are morerelevant to the user, especially for items which have sparse clickdata[4]. Unfortunately many production recommendation systemscannot leverage this finding as explicit feedback data is limited,requiring manual labeling by users. Our goal is to create a systemthat can leverage all available data to generate diverse, relevantrecommendations which are personalized to a user.

We introduce Style Conditioned Recommendations (SCR) whichextends Variational Autoencoder (VAE) recommendations by em-ploying the Conditional VAE (CVAE) architecture. VAE recommen-dations use implicit feedback data to perform collaborative filtering.We incorporate content data into this approach by introducing acondition on the VAE which is learned from item content data. Werefer to this as the user style profile. In this context style refers to agrouping or genre of interest. In the context of movies, for example,style could be Romance, Comedy, Horror, etc, while for Furniturestyle could be Modern, Traditional, etc. The model is composed oftwo parts. The first is an encoder which takes item content data asinput, aggregates it into a user content representation and infers userstyle profiles. We refer to this portion of the network as the text en-coder. The second is a CVAE which takes the user item click matrixand the learned user style profiles and generates recommendations.We refer to this as the click VAE.

To leverage explicit feedback data we introduce style injection.This is a novel application of style-transfer techniques, commonin computer vision, to the task of recommendations. Styles are in-jected into recommendations by allowing the user style profile atthe encoder to be learned over the data, but selecting the user styleprofile at the decoder per explicit feedback. This conditions thereconstruction on the new style, and injects it into the recommenda-tions, allowing for incorporation of explicit feedback from a user.Explicit feedback can be incorporated in two ways. The first is bygathering user interests in styles by employing style quizzes, or hav-ing users self-identify in interest groups. The second method uses

arX

iv:1

907.

1238

8v2

[cs

.IR

] 5

Aug

201

9

RecSys ’19, September 16–20, 2019, Copenhagen, Denmark Murium Iqbal, Kamelia Aryafar, and Timothy Anderton

user item ratings. If a user highly rates some items in the catalog,these items can be used to generate a new user profile on which theuser recommendations can be conditioned. In the absence of explicitfeedback, the user profile can remain unchanged at the decoder, andthe system is still able to produce recommendations based only onthe implicit feedback and item content data.

To allow for selection of user profiles for injection, we requirethat user profiles be interpretable. We define interpretability suchthat each dimension of the user profile represents a pre-defined style.This enables selection as style quizzes can be made to specificallyenquire about styles selected for the user profile. The outputs of thequiz can then be directly mapped to a corresponding user profilewith which to condition the decoder. We enforce interpretabilityon the style profiles by employing a label propagation term. Userstyle profiles are learned from item content data, containing limitedmanually verified style labels. We create a dataset of user contentrepresentations with labeled style profiles by sampling the labeleditems from our training dataset of user clicks. The result is a datasetwhich is representative of user click patterns but also bears the stylelabels.

Throughout this paper we use e-commerce as a case study toframe the problem and examine results. However, it should benoted that the proposed methods are generalizable to all settings thatmake use of personalized recommendations and are not necessarilylimited to e-commerce. Our main contribution is an extension of anunsupervised approach using VAEs to semi-supervised settings. Wedo so by learning interpretable user style profiles from a limited setof item labels and employing CVAE architecture. This enables ournovel style injection approach to recommendations.

2 RELATED WORKCollaborative filtering based recommendations are widely used in in-dustry due to ease of implementation, scalability, high performanceand large volume of academic works [14, 20, 24]. These modelswork on implicit feedback data to generate recommendations, i.e.they take a user-item click matrix as input. For understanding genresor styles the most popular methods involve topic modeling [7, 8].These models often make use of content data to generate recommen-dations, i.e. they use item text descriptions or item images as input.Hybrid recommendation systems make use of both implicit feedbackdata and content data [2, 21, 26]. These systems provide higher cov-erage, leveraging the content data for items with low volume of userinteractions, but still maintain high accuracy for popular items. Weapproach hybrid recommendations by learning representations overdifferent modalities of data and concatenating them together as inputto a generative process to obtain recommendations [5]. This allowsus to design the different portions of our network for the specificmodality of data they are meant to process. We differ in other workswhich use this approach in that we learn both representations at theuser level to enable style injection. This adds the constraint thatrepresentations learned over the content data must be interpretable.

Some personalized recommendation systems aim to learn userprofile’s in an unsupervised manner from content data [15], but theseprofiles are not directly interpretable. Others build user profiles byincorporating non-platform specific data, such as user locations orinformation about the user’s social network [23, 27]. This requires

harvesting data from outside the platform to build the user profile,which is not always readily available. Recent work on learningstyle aware recommendations makes use of visual data to learn style,unsupervised, at the item level [9, 17]. These representations arenot directly interpretable either, with even topic modeling based ap-proaches requiring manual interpretation and labeling of the learnedtopics. Item level style can also be learned to be interpretable withsupervised learning [3, 6], but requires a large corpus of labeled im-age data. To the best of our knowledge a semi-supervised approachhas not yet been applied to learning interpretable style profiles at theuser level.

VAEs [10] have recently been applied to the task of generatingonline recommendations from implicit feedback data (click data)[13]. Variational Autoencoders (VAEs) are a natural choice for rec-ommendations as they can rapidly perform collaborative filtering andimpute missing values in user item interaction data, one of the mostpopular recommendation methodologies [22]. Further approacheslook at incorporating content information, to create hybrid recom-mendations by using separate VAEs for the content data [12], butlearn representations primarily at the item level, combining itemlevel representations to avoid the cold start problem. Further workslook at Conditional VAE (CVAE) architecture for recommendationsas well as joint VAE (JVAE) [11], but does not look at extensions forinterpretable encodings. Furthermore, none of these systems makeuse of style transfer via VAE’s for the purpose of recommendations.Style transfer is a technique commonly used in Computer Visionto impose learned styles of one image onto another. This is donein VAEs by conditioning the encoder and decoder on an externallabel allowing reconstructions to be formed under different selectedconditions[16]. As far as we are aware, this methodology has notpreviously been applied to recommendations.

Recent work on generating new items from user’s preferenceshas also been studied [9, 28]. In these methods new items or imageshave been generated that do not necessarily exist on the platform.The methods can therefore be useful for inspiring designers andsellers, but are not immediately impactful for many recommendationsystems platforms as the generated images are not of existing items.For example, generating new cover images for movies will notenable further personalization on media streaming platforms, as thegenerated images won’t correspond to an existing item. Our systeminstead creates new recommendations of existing items based on auser’s history and new preferences, allowing the user to explore thecatalog of items under different assumed user style profiles.

3 METHODIn this section we introduce the framework and architecture of SCRas well as the methods which enable style injection. We use thefollowing notation: Bold faced lettering X represents a matrix, andbold faced lettering with an accompanying arrow, ®x, represents avector. Subscript T indicates content data and related componentsand encodings, while subscript C indicates click data and relatedcomponents and encodings. The letter z represents latent represen-tation and the letters x and v represents raw input. We use a hat toindicate a reconstruction such as x̂ .

Style Conditioned Recommendations RecSys ’19, September 16–20, 2019, Copenhagen, Denmark

3.1 Click VAEThe input to the VAE is a user click matrix XC with dimensionalityU × I . U is the number of users represented in the dataset and I is thenumber of items. The model learns compressed latent representa-tions, ZC , of the input and uses these to reconstruct the input matrixand impute missing values. Recommendations are then obtained bytaking all values within the reconstruction which exceed a thresholdas recommendations, or by taking the top-N items by value per rowin the reconstruction as recommendations for the respective users.

The distribution over the latent representations, ZC must be se-lected such that the probability of our observations, p(XC ) alsoknown as the evidence, is maximized. This requires us to computethe posterior, p(ZC |XC ), so that the distribution of the latent variableis conditioned on the observations, allowing us to select the mostrelevant configuration for p(ZC ). This in turn requires calculation ofp(XC ) itself, which is intractable as marginalizing over all possibleconfigurations of the latent variable ZC is prohibitively expensive.As direct calculation of the true posterior, p(ZC |XC ) distribution, isintractable, an approximation to the true posterior is used q(ZC |XC ).To ensure that this approximation follows our assumed distribu-tion over the posterior p(ZC |XC ), minimizing the Kullback-Lieblerdivergence, KL, between the two distributions is desired.

KL(q(ZC |XC )| |p(ZC |XC )) =E[logq(ZC |XC )] − E[logp(XC ,ZC )] + logp(XC )

(1)

Rearranging these terms can lead us to an objective function,known as the Evidence Lower BOund (ELBO), which is necessarilyless than the value of p(XC ), as the KL term is always non-negative.Maximizing the ELBO will allow us to maximize p(XC ) withoutdirectly calculating it. We define the ELBO as:

ELBO = E[logp(XC ,ZC )] − E[logq(ZC |XC )]= E[logp(XC ,ZC )] − KL(q(ZC |XC )| |p(ZC ))

(2)

The two terms in the ELBO can be seen as a reconstruction loss,and a regularization, enforcing some prior, p(ZC ) on the latent repre-sentations. For the purposes of recommendations, a multinomial isassumed at the output of the generator [13]. This enables a list-wiseapproach for the recommendations, as the items must compete forlimited probability mass, preventing the model from giving over-confident results. As such, this first term is taken as a cross-entropyover the softmax, σ of the outputs. The second term is taken as theKL divergence between the approximate multivariate Gaussian, anda standard normal Gaussian prior assumed as the true distribution ofthe latent variables p(ZC ).

L =∑(XC × log(X̂C ))

− 0.5 × (µ2 − I − logdet(Σ) + tr(Σ))(3)

In VAEs, we parameterize the inference distribution and the gen-erative distribution by an encoder and decoder respectively. To allowfor sampling of the latent user click representation, while still al-lowing for back propagation, the reparameterziation trick is applied[10]; as such the encoder provides the parameters which dictatethe distributions over each element of ZC . These are mean, µ andvariance, Σ. Sampling over a normal Gaussian is done outside of

Figure 1: A diagram of the click vae of our model. The input is a sparsematrix of user item clicks and the learned user profile as output by thetext encoder. The network uses tanh activations between layers and asof tmax activation at the output, which allows for a list-wise approachto recommendations where each item has to compete with others forlimited probability mass

the network, and provided as another input, ϵ . The terms are thencombined to obtain the latent representations.

ϵ ∼ N(0, 1)

ZC = µ + Σ12 ◦ ϵ

(4)

◦ represents the Hadamard product.This latent representation is provided as input to the decoder

which produces recommendations. Our system furthers this VAEapproach by conditioning both the distribution of clicks, and theirlatent representations on a user profile, ZT . We will refer to theVAE portion of our network as the click VAE, a diagram of which isprovided in Figure 1. The updated ELBO reflecting the conditioningis expressed as

ELBO = E[logp(XC ,ZC ,ZT )] − KL(q(ZC |XC ,ZT )| |p(ZC )) (5)

3.2 Text EncoderThis is conditioning achieved in the VAE by simply concatenatingthe user profile ZT row-wise to both the input of the encoder XCand the input of the decoder ZC . This user profile is learned from arepresentation of content data, XT . The user’s content representationis fed into a multi-layer perceptron, similar to the encoder of the clickVAE, to obtain the user profiles, ZT. We will refer to this encoderas the text encoder, a diagram of which is provided in Figure 2depicting size and structure of the network.

Users’ content representations are obtained by averaging over thecontent representation of the items with which they have interacted,MT . Item content data can be provided in any vector form such as


Figure 2: A diagram of the text encoder of our model. The input isa dense matrix of content embeddings. The network uses ReLU activa-tions between layers and a sigmoid activation at the output. The learneduser profile represents the probability of the user’s interest in each style.

embeddings of item text or item images. Our dataset contains itemrepresentations built on item text. Item text is taken as item name anditem attributes, which are short form tags applied to items describingaspects such as color, material, or size. This text is stripped of stopwords, stemmed and tokenized. Each token is then passed through apre-trained word2vec [18] model to obtain a word embedding. Theitem representation is taken as the average of these embeddings.

Taking an average over all items a user has interacted with yieldsflatter distributions for users with many interactions than those withfewer interactions. Though the majority of users have few iteminteractions, the distribution has a long heavy tail, containing userswith more than ten times the average number of interactions. Thiscauses distributions dependent on averaging to shift depending onthe length of browsing history of a user. To avoid this issue, allusers are taken as the average of only a fixed small number k ofitems which they have interacted with. These items are selectedat random each epoch, where k < N , where N is the minimumnumber of items each user has clicked. The process to generate

Algorithm 1 User content representation

1: procedure GET ITEM VECTOR(item Text)2: bag of words← tokenize (item Text)3: bag of words← stem (word) ∀ word ∈ bag of words4: W← word2vec(word) ∀ word ∈ bag of words5: ®mT ← mean(W, axis=0)6: return ®mT

7: procedure GET USER TEXT VECTOR( User Clicks )8: items←

(nk)

UserClicks9: M← Get Item Vector(item) ∀ item ∈ items

10: ®xT ← mean(M, axis=0)11: return ®xT

(a) User click data is used to generate a sparse multi-class vector, ®xC . Items are sampled.Their text is used to produce content data representations from a word2vec. These areaveraged to obtain the user content vector, xT .

(b) The prepped user data is passed through the SCR. The output of the text encoderis used as input at both the click VAE’s encoder and decoder. This manifests as a skipconnection.

Figure 3: Generation of style conditioned recommendations.

the user content vectors is documented in Algorithm 1. The resultfrom this process is a U × D matrix, where U is the number of users(the same as for the click matrix, XC ) and D is the dimensionalityof the embedding space. This matrix is then used as input for thetext encoder, built of a simple MLP structure with 2 hidden layers.The resulting encoding, ZT is the user profile on which the clickVAE is conditioned. A diagram of the process to sample and createrecommendations with SCR is provided in Figure 3.

Algorithm 2 Create Text Encoder Training Data

1: VT ← []2: S← []3: for i = 1...10 do4: for u ∈ Users do5: Clicks← UserClicks ∩ ItemLabels6: items←

(nk)

Clicks7: T← Get Item Vector(item) ∀ item ∈ items8: ®vT ← mean(T, axis=0)9: ®s← mean(Item Label) ∀item ∈ items

10: ®s ← [1 (if i > θ ) else 0 ∀ i ∈ ®s]; θ = 1k

11: VT ← [VT , ®xT ]12: S← [S,®s]13: Text Encoder← train Label Prop(VT , S)


3.3 Shaping User ProfilesUser profiles can be created to be interpretable, indicating interestin genres, styles, or color palettes. We define interpretable userprofiles such that each dimension in the latent space of the profilesindicates a user’s probability of interacting with items of a specifictype. Interpretable profiles enable style transfer techniques, com-mon in computer vision, to be applied to recommendations. Styletransfer via a conditional VAE is done by updating the condition atthe decoder, allowing the structure captured by the encoding to betransferred to a different condition [16]. In recommendations, thisallows user-item preferences, captured in the encoding, to be trans-ferred from one taxonomy of items to another, or from one genre ofitems to another, from one color palette to another, etc. This wouldmanifest at the decoder as recommendations which are relevant tothe customer based on their prior browsing history but which alsocontain items indicative of the new interest. For this reason we referto this as "injection".

Although users are not labeled with style profiles in our dataset,some items have style labels. These labels are limited, with onlyroughly 2% of our items having reliable, manually validated labels.To create a dataset which can make use of these labels and enablelearning profiles at the user level, we sample the training data, in asimilar fashion to that described in Algorithm 1, but we limit thesampled items to be from those which bear reliable style labels andobtain content representations VT . We vectorize each item’s stylelabel and the corresponding user-level style profile is taken as theaverage of the labels for the associated items. This average is thenthresholded, to obtain a multi-class binary profile, S. The thresholdis selected as 1

k , so that a label of 1 is only applied to styles whichhave at least one full items worth of mass in their vector. Not allusers within our dataset have interacted with at least k of our labeleditems. As such, to create a large diverse corpus of training data,we sample our training set of user to click interactions multipletimes to obtain VT . The label propagation term is then taken as thecategorical cross entropy between the true profile S and the profileslearned from taking VT as the input to the text encoder ZV .

LabelProp = −∑

S × log(ZV ) (6)

The process to generate the training data for the label propagationterm is documented in Algorithm 2.

4 EXPERIMENTATION AND RESULTS4.1 Experimental SetupOur system is trained on a dataset of user clicks, item content data,and item style labels. We collect a user item click matrix over a spanof 3 months. These are filtered to only include users which haveinteracted with at least 15 items, and items which have had at least30 users interact with them. This yields a dataset of 177,415 users× 40,926 items, and a sparsity of 0.1576%. We have sparse itemstyle labels matrix with dimension 8,136 items by 8 styles. These 8styles are Mid-century Modern, Coastal, Rustic, Glam, Industrial,Bohemian, Modern and Traditional. Each item can be associatedto multiple styles, with 13.7% of our items being associated withtwo to three styles, and the rest being associated with just one. Our

Distribution of Style in Training DataM-C. Coast. Rustic Glam Indust. Boho. Modr. Trad.0.147 0.048 0.115 0.058 0.043 0.090 0.302 0.170

Table 1: Percentage of training samples each style is present in.

item content data is taken as the word2vec 1 embedding of itemtitle, and item attributes. The word2vec embeddings yield a densematrix with dimensions 40,926 items × 300 features.

To allow for effective training over both loss functions, we freezepart of the network while the other part is actively trained. Thedatasets for each portion of the network are different, with the textencoder taking in text embeddings, VT and style labels, S for training,and the click VAE taking text embeddings, XT and the user clickmatrix, XC as inputs for training. We first train the text encoder tolearn the user style profiles by optimizing the label propagation termin Equation 6 while the click VAE portion of the network is frozen.The weights in the text encoder are then frozen, as the click VAEis trained to optimize the ELBO defined in Equation 5. Adding ascaling factor, β to the KL term in the ELBO improves results whenbeta < 1[13]. We have empirically chosen to set β = 0.17.

4.2 DatasetsTo validate the performance of SCR on the task of generating rec-ommendations, we take a heldout set of 10,000 users for validation.The users taken for testing and validation have 20% of their iteminteractions masked. For validation the unmasked 80% of clicksare used as input to SCR. The items sampled to create the contentrepresentation inputs for SCR are limited to the 80% of unmaskeditems as well. The model is scored at how well it can recover themasked 20% of clicks via normalized discounted cumulative gain(NDCG).

We have a dataset of 8,135 items with style labels. We hold outon-sixth of the items, 1,356 for validation. Sampling to produce thedataset of labeled user style profiles is done only on the training setof user clicks. This allows us to use the validation set of users toexamine results of style profiling and style injection. After samplingwe have a dataset of 1,573,340 samples for training the user styleprofiles and samples for validation. The distribution of styles in thetraining dataset is presented in Table 1.

NDCG Recall@ 20 @ 50 @ 20 @ 50

SCR 0.157 0.192 0.176 0.264SCR w/o LP 0.157 0.191 0.176 0.263

VAE-CF 0.140 0.172 0.155 0.233cSlim 0.095 0.111 0.095 0.137

Table 2: Offline evaluation metrics for our proposed model, SCR,against SCR without the label propagation term, VAE-CF and cSLIM.We examine performance of each with NDCG @ 20 and 50. We showthat SCR out-performs VAE-CF and cSLIM and prove that additionof the label propagation term, which allows for style profiling and styleinjection does not detrimentally affect the performance on recommen-dations.

1https://spacy.io/usage/vectors-similarity

https://spacy.io/usage/vectors-similarity


4.3 Performance on RecommendationsWe compare SCR to VAE for collaborative filtering (VAE-CF) [13],and to SLIM [20] a form of matrix factorization. We select VAE-CFas our model is an extension upon this recommendation system. Wechoose SLIM as it has a high performance matrix factorization. Asis shown in Table 2, SCR outperforms both baselines, showing animprovement of +0.017 on NDCG@20 over the best performingbaseline. It should be noted that the addition of the label propagationterm does not hurt or improve the results of SCR, with no change onperformance at NDCG@20. We experimented further with SCR byassuming a Dirichlet prior [25] over the latent representations ZCinstead of the standard Gaussian. This did not show an improvementon the task of recommendations. We additionally experiment bytraining with a Gaussian prior but regularizing the latent space adver-sarially as in Adversarial Autoencoders (AAE) [16]. This showedan improvement over our results by +.002 on NDCG@20 but tooksignificantly longer to train, with the Gaussian and Dirichlet priorsrequiring roughly 60 epochs to converge and the AAE requiringnearly 300.

4.4 Performance on User Style Profile Prediction

AUCM-C. Coast. Rustic Glam Indust. Boho. Modr. Trad.

SCR 0.964 0.946 0.962 0.975 0.979 0.936 0.926 0.948SCR w/ Gauss 0.974 0.965 0.965 0.976 0.975 0.950 0.930 0.946SCR w/ Dir. 0.958 0.953 0.959 0.985 0.979 0.936 0.932 0.940

LR 0.837 0.727 0.812 0.763 0.822 0.729 0.806 0.801RF 0.671 0.612 0.700 0.684 0.687 0.593 0.667 0.663

Table 3: We show the performance of SCR at predicting user style pro-files based on a held-out set of labeled style items. We compare it’sperformance to that of a mutli-class all-vs-one logistic regression, andmulti-class random forest.

We compare the performance of SCR to multiclass all-vs-one Logis-tic Regression. We use per category AUC to evaluate each model’sperformance on predicting each style. Results for the held-out set arepresented in Table 3. We select Logistic Regression, as it is similarto the architecture we have chosen for the text encoder, which is aMulti-Layer Perceptron with a sigmoid activation at the output. Wechose to compare to a Random Forest classifier as well, as it is anon-linear ensemble model. We allowed for 1000 trees in the Ran-dom Forest and no maximum depth defined. Our best performingversion of SCR, which assumes a Gaussian prior over the user styleprofiles improves upon the best performing baseline by +0.172 onAUC on average across all styles.

We experiment with the text encoder by also assuming differentpriors over it. The standard SCR approach assumes no prior. Toallow for prior distributions, we perform the same reparameterizationtrick we use for the click VAE, with the text encoder producing theparameters for the distribution. Without a decoder the loss term onlyconsists of the label propagation term, and a KL term to enforce theassumed prior. We examine the results with a Gaussian prior, andDirichlet prior. Both showed an improvement to no assumed priorover SCR without any assumed prior over the user style profiles. TheGaussian prior shows an +0.006 improvement on AUC on averageacross all styles over standard SCR. The Dirichlet shows an average

Figure 4: The distribution of styles across user profiles. The Modernstyle seems to be prevalent across the profiles, while the next most com-mon style is Traditional.

Pearson Correlation CoeffM-C. Coast. Rustic Glam Indust. Boho. Modr. Trad.

M-C 1.0 −0.048 −0.135 −0.053 −0.016 −0.132 −0.142 −0.240Coast. −0.048 1.0 −0.003 −0.041 −0.043 0.026 −0.243 −0.020Rustic −0.135 −0.003 1.0 −0.100 0.132 −0.049 −0.352 −0.032Glam −0.053 −0.041 −0.100 1.0 −0.022 −0.011 −0.037 −0.071

Indust. −0.016 −0.043 0.132 −0.022 1.0 −0.076 −0.146 −0.100Boho −0.132 0.026 −0.049 −0.011 −0.076 1.0 −0.298 −0.014Modr −0.142 −0.243 −0.352 −0.037 −0.146 −0.298 1.0 −0.289Trad. −0.240 −0.020 −0.032 −0.072 −0.100 −0.014 −0.289 1.0

Table 4: Displayed are the correlations between different style as mea-sured by the Pearson Correlation Coefficient. All styles are at leastweakly negatively correlated with one another, indicating that presenceof one style in the profile negatively impacts the likelihood of high inter-est in another style.

+0.001 improvement. Results for the two priors are presented inTable 3.

We examine the distribution of learned style profiles which showa bias towards Modern style and Traditional style as depicted inthe distributions of learned user style profiles in Figure 4. Thisis expected as the highest interacted with items in our dataset areModern and Traditional items as tabulated in Table 1. We presentthe correlation between styles in learned user style profiles in Table4.

4.5 Style InjectionTo experiment with style injeciton, users’ content vectors XT andclick vectors XC are generated as described previously, and passedthrough the text encoder and the encoder of the click VAE. At thedecoder of the click VAE, the generated ZT is replaced with a one-hot encoded vector, where only one of the styles is present withinthe profile. This allows a user’s recommendations to be injectedwith that style. A visual representation of the style injection inpresented in Figure 5. To validate whether the style injection isindeed performing as expected, we pass the top 20 recommendationsfrom each injected set into the text encoder, allowing it to sample5 items at random and produce a new profile. This allows us toinfer the style of the recommendations after the injection has beenperformed. We then subtract each user’s original style profile fromthe style profile learned from the injected recommendations. We


Figure 5: Results of injecting different styles into a user’s recommenda-tions.

Figure 6: The change in style profiles after style injection has been per-formed. Each row in the figure represents a style being injected, eachcolumn shows the average change in that style’s value from the user’soriginal style profile.

display a heat map of the average changes in profiles in Figure 6. Ascan be seen, when a style is injected, the dimension correspondingto the injected style sees a large positive shift in the profile.

5 DISCUSSIONIn this section we discuss the results of our experimentation. Weexamine the affects of architectural decisions on recommendations,the affects of sampling on data distributions and the performance ofstyle injection.

Employing a CVAE architecture for recommendations affords usthe flexibility to structure each encoder so that it performs best on thedataset it consumes. The click data, XC is sparse, while the contentdata, XT is dense. We empirically found using Tanh activations forthe sparse click data in the click VAE encoder, and ReLU activationsfor the dense content data provided best results in the text encoder.Having a dropout layer immediately at the input for the text encoderyielded the best results on the style AUC scores, while moving thedrop out to the input to the decoder rather than at the input to theentire click VAE yielded the best results for the NDCG validation.Our system is able to use the label propagation term to learn stylelabels from text data, but any item representation could be substituted.We plan to experiment further with item representations built fromimages, or both image and text data.

The CVAE architecture and sampling method allows us to incor-porate temporal information into the recommendations at inference.The standard VAE approach to recommendations does not incorpo-rate any temporal features; all user clicks are given an equal weight,regardless of how long ago the interaction occured. At trainingtime our system follows this methodology, and ignores any temporalinformation, sampling uniformly from the items to create the usercontent representation. At inference time, only the last five items auser has interacted with can be taken to produce the user style profile.


Figure 7: The variance of each feature in the content representationchanges with larger sample sizes of items, showing less variance thelarger the size. As such we choose a sample size of 5, with enough infor-mation to capture a style, but not so large a sample to lose variance inthe underlying data once the item embeddings are averaged.

This allows recommendations to be conditioned on the user’s mostrecent style profile, making the recommendations highly relevantto the user’s recent browsing history while still incorporating theirentire history.

Sampling the items used to construct the user content data, insteadof using all the items, allows us to ensure the distributions providedfor all users are similar. Figure 7 shows that the variance in thefeature space of the embeddings diminishes as the number of itemsused to obtain the average is increased, even though the number ofusers represented remains the same. Using just 5 items affords ushigh variance, while still taking a representative sample of items.This ensures that the distribution user content data is consistentacross users. Furthermore, at inference time, accessing the fulluser history of clicks and aggregating all item content data may betime consuming. Limiting content representations to only dependon 5 items enables online recommendations by preventing lengthyaggregation computations.

We would like to argue that the effectiveness of the model atpredicting style is due to the fact that we are modeling style at theuser level rather than the item level. Style is more obvious for groupsof items rather than a single time. The same jacket can be worn withbusiness attire to seem formal, but also with casual clothing to bemore dressed-down. While the style of the jacket may be ambiguousand subjective, the style of each outfit is distinct. Thus learning styleat the user level affords high performance, as shown in our results.

Thresholding the style labels to produce a multi-label one hotoutput has allowed us to obtain high performance on predicting userstyle profiles. This approaches assumes a multivariate bernoullidistribution for each labeled profile, and attempts to minimize acategorical cross-entropy between the labels and the learned profiles.This allows multiple styles to be present in the same profile. Weexperimented with assuming a multinomial distribution over the userstyle profiles. No thresholding was applied and the model attemptedto minimize a mean-squared error (mse) loss term. This yieldedpoor results, with the model continually converging to a uniformdistribution of predicted values across styles for all users. Assuminga multi-variate bernoulli enabled the network to infer distinct profiles

per user. The correlations between styles, as shown in Table 4 showsthat even though we did not assume a multinomial distribution overthe profiles, which would cause each style to compete with others forthe limited probability mass, the learned styles still show a negativecorrelation with one another. This shows that the model has learnedthat the presence of one style negatively impacts the likelihood ofthe presence of other styles.

Style profiles learned under the Gaussian prior perform best. Thiscould be due to the reparametrization trick, which introduces a formof noise based regularization. This allows the model to general-ize the learned representations better. The Dirichlet prior performsworse than Gaussian, because it assumes a multinomial distribu-tion over the user style profile. This causes the model to performpoorly as the labels assume a multi-variate bernoulli distribution.Pairing the Dirichlet prior with the mse loss term still suffered froma convergence to uniform distributions over the user style profiles.

For styles which are negatively correlated, the model does notsucceed at injecting the one into a user’s recommendations if they arealready aligned with the other. We have noticed in these cases, thesystem gracefully falls back to the user’s original recommendations.We think this is preferred behavior over exposing items which areunrelated to the user but are relevant to the requested style.

6 CONCLUSIONWe introduce style injection via Style Conditioned Recommenda-tions (SCR) as a novel way to diversify recommendations whilemaintaining relevance to users. Explicit feedback, though hard togather, gives a clear picture of the user’s interests and tastes. Implicitfeedback allows recommendation systems to leverage patterns ofshopping behavior across users to find other relevant items, but cre-ates a heavy popular item bias. Item content data allows all items tobe considered, but is subject to faults such as false item descriptions,missing or poor quality item images, poorly named items, and more.SCR is able to leverage all three of these signals to produce rele-vant personalized recommendations. By employing semi-supervisedlearning, we leverage a limited number of cleanly labeled items tolearn user level style profiles. SCR affords increased diversity asexplicit feedback becomes available by using style injection.

In the absence of explicit feedback for style injection, SCR isstill a performant recommendation model. SCR is able to infer styleprofiles and generate recommendations. It outperforms both VAE-CR method [13] and cSLIM [20] on the task of recommendations andout performs baselines of multi-class all-vs-one logistic regressionand multi-class random forest on predicting user style profiles.

REFERENCES[1] Himan Abdollahpouri, Robin Burke, and Bamshad Mobasher. 2017. Controlling

Popularity Bias in Learning-to-Rank Recommendation. In Proceedings of theEleventh ACM Conference on Recommender Systems. ACM, 42–46.

[2] Marie Al-Ghossein, Pierre-Alexandre Murena, Talel Abdessalem, Anthony Barré,and Antoine Cornuéjols. 2018. Adaptive collaborative topic modeling for onlinerecommendation. In Proceedings of the 12th ACM Conference on RecommenderSystems. ACM, 338–346.

[3] Lukas Bossard, Matthias Dantone, Christian Leistner, Christian Wengert, TillQuack, and Luc Van Gool. 2012. Apparel classification with style. In Asianconference on computer vision. Springer, 321–335.

[4] Rocío Cañamares and Pablo Castells. 2018. Should I Follow the Crowd? AProbabilistic Analysis of the Effectiveness of Popularity in Recommender Systems.(2018).


[5] Erion Çano and Maurizio Morisio. 2017. Hybrid recommender systems: Asystematic literature review. Intelligent Data Analysis 21, 6 (2017), 1487–1524.

[6] Wei Di, Catherine Wah, Anurag Bhardwaj, Robinson Piramuthu, and Neel Sun-daresan. 2013. Style finder: Fine-grained clothing style detection and retrieval. InProceedings of the IEEE Conference on computer vision and pattern recognitionworkshops. 8–13.

[7] Wei-Lin Hsiao and Kristen Grauman. 2018. Creating capsule wardrobes fromfashion images. In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition. 7161–7170.

[8] Diane J Hu, Rob Hall, and Josh Attenberg. 2014. Style in the long tail: Discoveringunique interests with latent variable models in large scale social e-commerce. InProceedings of the 20th ACM SIGKDD international conference on Knowledgediscovery and data mining. ACM, 1640–1649.

[9] Wang-Cheng Kang, Chen Fang, Zhaowen Wang, and Julian McAuley. 2017.Visually-aware fashion recommendation and design with generative image models.In Data Mining (ICDM), 2017 IEEE International Conference on. IEEE, 207–216.

[10] Diederik P Kingma and Max Welling. 201r. Auto-encoding variational bayes.International Conference on Learning Representations (ICLR) (201r).

[11] Wonsung Lee, Kyungwoo Song, and Il-Chul Moon. 2017. Augmented variationalautoencoders for collaborative filtering with auxiliary information. In Proceedingsof the 2017 ACM on Conference on Information and Knowledge Management.ACM, 1139–1148.

[12] Xiaopeng Li and James She. 2017. Collaborative variational autoencoder forrecommender systems. In Proceedings of the 23rd ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining. ACM, 305–314.

[13] Dawen Liang, Rahul G Krishnan, Matthew D Hoffman, and Tony Jebara. 2018.Variational Autoencoders for Collaborative Filtering. In Proceedings of the 2018World Wide Web Conference on World Wide Web. International World Wide WebConferences Steering Committee, 689–698.

[14] Greg Linden, Brent Smith, and Jeremy York. 2003. Amazon. com recommen-dations: Item-to-item collaborative filtering. IEEE Internet computing 1 (2003),76–80.

[15] Pasquale Lops, Marco De Gemmis, and Giovanni Semeraro. 2011. Content-basedrecommender systems: State of the art and trends. In Recommender systemshandbook. Springer, 73–105.

[16] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and BrendanFrey. 2015. Adversarial autoencoders. arXiv preprint arXiv:1511.05644 (2015).

[17] Julian McAuley, Rahul Pandey, and Jure Leskovec. 2015. Inferring networksof substitutable and complementary products. In Proceedings of the 21th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining.ACM, 785–794.

[18] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.Distributed representations of words and phrases and their compositionality. InAdvances in neural information processing systems. 3111–3119.

[19] Tien T Nguyen, Pik-Mai Hui, F Maxwell Harper, Loren Terveen, and Joseph AKonstan. 2014. Exploring the filter bubble: the effect of using recommendersystems on content diversity. In Proceedings of the 23rd international conferenceon World wide web. ACM, 677–686.

[20] Xia Ning and George Karypis. 2011. Slim: Sparse linear methods for top-nrecommender systems. In 2011 11th IEEE International Conference on DataMining. IEEE, 497–506.

[21] Xia Ning and George Karypis. 2012. Sparse linear methods with side informationfor top-n recommendations. In Proceedings of the sixth ACM conference onRecommender systems. ACM, 155–162.

[22] Ivens Portugal, Paulo Alencar, and Donald Cowan. 2018. The use of machinelearning algorithms in recommender systems: A systematic review. ExpertSystems with Applications 97 (2018), 205–227.

[23] Xueming Qian, He Feng, Guoshuai Zhao, and Tao Mei. 2014. Personalizedrecommendation combining user interest and social circle. IEEE transactions onknowledge and data engineering 26, 7 (2014), 1763–1777.

[24] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian personalized ranking from implicit feedback. InProceedings of the twenty-fifth conference on uncertainty in artificial intelligence.AUAI Press, 452–461.

[25] Akash Srivastava and Charles Sutton. 2017. Autoencoding variational inferencefor topic models. In Proceedings of ICLR (2017).

[26] Chong Wang and David M Blei. 2011. Collaborative topic modeling for recom-mending scientific articles. In Proceedings of the 17th ACM SIGKDD internationalconference on Knowledge discovery and data mining. ACM, 448–456.

[27] Hongzhi Yin, Bin Cui, Ling Chen, Zhiting Hu, and Chengqi Zhang. 2015. Model-ing location-based user rating profiles for personalized recommendation. ACMTransactions on Knowledge Discovery from Data (TKDD) 9, 3 (2015), 19.

[28] Shizhan Zhu, Sanja Fidler, Raquel Urtasun, Dahua Lin, and Chen Change Loy.2017. Be Your Own Prada: Fashion Synthesis with Structural Coherence. InComputer Vision (ICCV), 2017 IEEE International Conference on. IEEE, 1689–1697.

style conditioned recommendations - arxiv · style could be romance, comedy, horror, etc, while for...

Documents