a comparison of clustering-based privacy-preserving collaborative filtering schemes

12
Applied Soft Computing 13 (2013) 2478–2489 Contents lists available at SciVerse ScienceDirect Applied Soft Computing j ourna l ho me p age: www.elsevier.com/l ocate/asoc A comparison of clustering-based privacy-preserving collaborative filtering schemes Alper Bilge, Huseyin Polat Computer Engineering Department, Anadolu University, 26470 Eskisehir, Turkey 1 a r t i c l e i n f o Article history: Received 15 August 2011 Received in revised form 8 May 2012 Accepted 26 November 2012 Available online 14 December 2012 Keywords: Privacy Collaborative filtering Accuracy Profiling Preprocessing Clustering a b s t r a c t Privacy-preserving collaborative filtering (PPCF) methods designate extremely beneficial filtering skills without deeply jeopardizing privacy. However, they mostly suffer from scalability, sparsity, and accuracy problems. First, applying privacy measures introduces additional costs making scalability worse. Sec- ond, due to randomness for preserving privacy, quality of predictions diminishes. Third, with increasing number of products, sparsity becomes an issue for both CF and PPCF schemes. In this study, we first propose a content-based profiling (CBP) of users to overcome sparsity issues while performing clustering because the very sparse nature of rating profiles sometimes do not allow strong dis- crimination. To cope with scalability and accuracy problems of PPCF schemes, we then show how to apply k-means clustering (KMC), fuzzy c-means method (FCM), and self-organizing map (SOM) clustering to CF schemes while preserving users’ confidentiality. After presenting an evaluation of clustering-based meth- ods in terms of privacy and supplementary costs, we carry out real data-based experiments to compare the clustering algorithms within and against traditional CF and PPCF approaches in terms of accuracy. Our empirical outcomes demonstrate that FCM achieves the best low cost performance compared to other methods due to its approximation-based model. The results also show that our privacy-preserving methods are able to offer precise predictions. © 2012 Elsevier B.V. All rights reserved. 1. Introduction Increasing attention to online facilities introduced by the Inter- net leads an unnecessary access to a burden of material, which is contemporarily called information overload. Consequently, online vendors start offering automated product recommendation ser- vices to boost sales in online stores. There has been a number of ways to produce automated referrals including content-based, col- laborative and knowledge-based techniques [1–3]. One of the most successful techniques is collaborative filtering (CF) [4,5] in which products are recommended based on the similarity of users. CF operates on an n × m matrix, referred to as user–item matrix, where preferences of n users on m items are recorded. When a new user comes demanding recommendation, denoted as the active user (a), the most similar k users in the system are determined. Using their previous ratings on the product recommendation is sought, referred to as the target item (q), a prediction (p aq ) is estimated for a on q. There are several studies demonstrating how well CF Corresponding author. Tel.: +90 222 322 3550; fax: +90 222 323 9501. E-mail addresses: [email protected] (A. Bilge), [email protected] (H. Polat). 1 Tel.: +90 222 321 3550; fax: +90 222 323 9501. performs [3,5,6] and also some successful real-world applications adopting this prosperous technique [7,8]. As risks of online shopping, such as profiling users, unsought marketing, price discrimination, being subject to government surveillance, and so on; are much get to be known, privacy issues have been arousing more interest by customers [9]. Since CF ser- vices collect users’ preferences about the products they bought and construct personal profiles, they are open to privacy risks [9,10]. Thus, customers refrain from submitting their authentic preferences about products they purchased; or give false data, which makes it difficult to estimate dependable referrals. To overcome this challenge, researchers propose different privacy- preserving schemes to produce predictions without jeopardizing privacy [11–14]. Providing privacy measures within CF applications help them become more widespread and easily accessible; however, such measures bring inevitable costs and some difficulties emerge concerning the use of privacy-preserving collaborative filtering (PPCF) systems [4,6,12,15,16]. First, as more people get oriented in such applications and online vendors supplement new products, n and m grows and it gets harder to expand such systems, called the scalability problem [15,16]. Second, privacy measures provided by such systems require extra computational and storage costs that contribute to the scalability issues. Third, due to privacy- preserving measures, it becomes an issue to estimate predictions 1568-4946/$ see front matter © 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.asoc.2012.11.046

Upload: huseyin

Post on 08-Dec-2016

225 views

Category:

Documents


1 download

TRANSCRIPT

As

AC

a

ARRAA

KPCAPPC

1

ncvvwlspopc(trf

(

1h

Applied Soft Computing 13 (2013) 2478–2489

Contents lists available at SciVerse ScienceDirect

Applied Soft Computing

j ourna l ho me p age: www.elsev ier .com/ l ocate /asoc

comparison of clustering-based privacy-preserving collaborative filteringchemes

lper Bilge, Huseyin Polat ∗

omputer Engineering Department, Anadolu University, 26470 Eskisehir, Turkey1

r t i c l e i n f o

rticle history:eceived 15 August 2011eceived in revised form 8 May 2012ccepted 26 November 2012vailable online 14 December 2012

eywords:rivacyollaborative filteringccuracy

a b s t r a c t

Privacy-preserving collaborative filtering (PPCF) methods designate extremely beneficial filtering skillswithout deeply jeopardizing privacy. However, they mostly suffer from scalability, sparsity, and accuracyproblems. First, applying privacy measures introduces additional costs making scalability worse. Sec-ond, due to randomness for preserving privacy, quality of predictions diminishes. Third, with increasingnumber of products, sparsity becomes an issue for both CF and PPCF schemes.

In this study, we first propose a content-based profiling (CBP) of users to overcome sparsity issues whileperforming clustering because the very sparse nature of rating profiles sometimes do not allow strong dis-crimination. To cope with scalability and accuracy problems of PPCF schemes, we then show how to applyk-means clustering (KMC), fuzzy c-means method (FCM), and self-organizing map (SOM) clustering to CF

rofilingreprocessinglustering

schemes while preserving users’ confidentiality. After presenting an evaluation of clustering-based meth-ods in terms of privacy and supplementary costs, we carry out real data-based experiments to comparethe clustering algorithms within and against traditional CF and PPCF approaches in terms of accuracy.Our empirical outcomes demonstrate that FCM achieves the best low cost performance compared toother methods due to its approximation-based model. The results also show that our privacy-preservingmethods are able to offer precise predictions.

. Introduction

Increasing attention to online facilities introduced by the Inter-et leads an unnecessary access to a burden of material, which isontemporarily called information overload. Consequently, onlineendors start offering automated product recommendation ser-ices to boost sales in online stores. There has been a number ofays to produce automated referrals including content-based, col-

aborative and knowledge-based techniques [1–3]. One of the mostuccessful techniques is collaborative filtering (CF) [4,5] in whichroducts are recommended based on the similarity of users. CFperates on an n × m matrix, referred to as user–item matrix, wherereferences of n users on m items are recorded. When a new useromes demanding recommendation, denoted as the active usera), the most similar k users in the system are determined. Usingheir previous ratings on the product recommendation is sought,

eferred to as the target item (q), a prediction (paq) is estimatedor a on q. There are several studies demonstrating how well CF

∗ Corresponding author. Tel.: +90 222 322 3550; fax: +90 222 323 9501.E-mail addresses: [email protected] (A. Bilge), [email protected]

H. Polat).1 Tel.: +90 222 321 3550; fax: +90 222 323 9501.

568-4946/$ – see front matter © 2012 Elsevier B.V. All rights reserved.ttp://dx.doi.org/10.1016/j.asoc.2012.11.046

© 2012 Elsevier B.V. All rights reserved.

performs [3,5,6] and also some successful real-world applicationsadopting this prosperous technique [7,8].

As risks of online shopping, such as profiling users, unsoughtmarketing, price discrimination, being subject to governmentsurveillance, and so on; are much get to be known, privacy issueshave been arousing more interest by customers [9]. Since CF ser-vices collect users’ preferences about the products they boughtand construct personal profiles, they are open to privacy risks[9,10]. Thus, customers refrain from submitting their authenticpreferences about products they purchased; or give false data,which makes it difficult to estimate dependable referrals. Toovercome this challenge, researchers propose different privacy-preserving schemes to produce predictions without jeopardizingprivacy [11–14].

Providing privacy measures within CF applications help thembecome more widespread and easily accessible; however, suchmeasures bring inevitable costs and some difficulties emergeconcerning the use of privacy-preserving collaborative filtering(PPCF) systems [4,6,12,15,16]. First, as more people get oriented insuch applications and online vendors supplement new products, nand m grows and it gets harder to expand such systems, called the

scalability problem [15,16]. Second, privacy measures providedby such systems require extra computational and storage coststhat contribute to the scalability issues. Third, due to privacy-preserving measures, it becomes an issue to estimate predictions

t Com

wgsbrda

osHditcnnAcs

osoodapdrni(teicptpsctwaoto

2fi

scwcoTcsito

A. Bilge, H. Polat / Applied Sof

ith reasonable accuracy. Fourth, originating from constantlyrowing nature of PPCF, user–item matrices are generally highlyparse. It gets much harder to determine the neighbors of any userecause the similarities can only be calculated through overlappingatings [17]. Furthermore, even if the similarities can be accuratelyetermined, it is likely that most of the neighbors might not have

rating on q. It reduces the effort of finding similar users pointless.In addition to the challenges caused by extremely sparse nature

f most CF databases, the prediction estimation process of suchystems basically automates the old habit of “word-of-mouth”.owever, all so-called dependable PPCF recommendations rely onoubted past preferences of existing users, which might simply be

mprecise or even inconsistent. At this juncture, due to their uncer-ainty and approximation-based performance, PPCF technologiesan be subject to soft computing methods such as fuzzy logic andeural computing rather than conventional (hard) computing tech-iques. Yet, they have not been evaluated over such architectures.lthough some fuzzy clustering algorithms are engaged within CFontext [18–21], their effects on PPCF framework have not beentudied.

In this study, we offer to employ a content-based profiling (CBP)f users’ ratings based on item categories to get rid of the effects ofparse nature of user–item matrix and empower separation skillsf clustering algorithms. We propose to map rating profiles of usersnto a category-based profile. Such profiles are much smaller andense facilitating to determine similarities with decent accuracy;nd therefore, to operate clustering algorithms more efficiently. Torotect individual privacy, we propose masking users’ confidentialata before submitting them to the data holder by disguising theated items and the ratings. Then, we will employ some well-knownon-hierarchical clustering algorithms, namely k-means cluster-

ng (KMC), fuzzy c-means method (FCM), and self-organizing mapSOM) clustering to PPCF schemes and provide a comparison inerms of accuracy and performance among them. We especiallyxpect to enhance tractability and robustness of PPCF schemes byntegrating them with fuzzy and neural computing approaches. Theontributions of the paper can be listed, as follows: (i) we pro-ose a novel CBP method utilizing categorical information of itemso alleviate sparsity-related problems. (ii) We employ privacy-reserving measures on sparsity-enhanced profiles to provide aufficient level of privacy to individuals. (iii) We show the appli-ability of both conventional and soft non-hierarchical clusteringechniques (k-means, fuzzy c-means, and SOM) to the PPCF frame-ork to overcome scalability issues. (iv) We additionally provide

comparison among utilized clustering techniques. To the best ofur knowledge, our paper presents the first analyses and evalua-ion on integrating uncertainty-based soft computing constituentsn PPCF framework.

. Clustering algorithms and clustering-based collaborativeltering

In CF applications, clustering is widely used to cope with thecalability problem. It reduces the size of data set involved in CF pro-ess. KMC is probably the most well-known clustering approach,here k random initial objects are chosen as centers one for each

luster [22]. Then, n objects are compared with each seed by meansf discrimination criterion and assigned to the closest cluster.his procedure is performed repeatedly and at each stage, clusterenters are recalculated as the average of objects assigned to corre-

ponding cluster. The algorithm converges when the modificationn cluster centers between successive stages is close to zero or lesshan a pre-defined value. At the end, all objects are assigned to onlyne cluster.

puting 13 (2013) 2478–2489 2479

Like KMC, FCM also requires c pre-defined initial objects as clus-ter centers [23,24]. At each stage of the algorithm, a membershipfor every object is estimated to each cluster center using a com-parison criterion. The cluster centers are recalculated at each stageaccording the membership of objects to clusters as their weightedaverage. At the end, the algorithm provides the degree of member-ship to each cluster center and according to those memberships;objects either can be included in more than one cluster or to thecluster for which the degree of membership is higher. Unlike KMC,FCM is more flexible [25] because it provides objects, which mighthave more than one interface with clusters.

Such objects deserve more attention to figure out why they con-tribute to more than one cluster. In mathematical terms, given adata set D = {x1, x2, . . ., xn}, FCM minimizes the objective functiongiven in Eq. (1), as follows:

JQ (U, V) =n∑

j=1

c∑

i=1

uQij

d2(xj, Vi) (1)

with respect to U (a fuzzy c-partition of the data set) and to V (aset of c prototypes), where Q is a real number greater than 1, Vi isthe centroid of cluster i, uij is the degree of membership of objectxj belonging to cluster i, d2(·,·) is an inner product metric, and c isthe number of clusters. The parameter Q controls the “fuzziness” ofthe resulting clusters [26].

SOM is a type of artificial neural network reducing dimensionsby producing a map of usually one or two dimensions that plotsthe similarities of the data by grouping similar objects together[27,28]. SOMs can be used to explore the groupings and rela-tions within high-dimensional data by projecting the data ontoa two-dimensional space. SOM architecture consists of two fullyconnected layers: an input layer and a Kohonen layer. The num-ber of neurons in the input layer matches the number of attributesof the objects. Each neuron in the input layer has a feed-forwardconnection to each neuron in the Kohonen layer. The neuron in theKohonen layer with the biggest input will become the winning neu-ron. The algorithm forms the SOM by first initializing the weights inthe network by assigning them small random values [26]. Then, itfollows three essential procedures: competition, cooperation, andadaptation [29].

Aforementioned clustering algorithms can be employed as apreprocessing to group similar users in the same cluster and dis-similar ones in diverse clusters off-line in order to enhance onlineperformance or scalability of CF schemes. After collecting users’preferences, the server constructs an n × m user–item matrix, D.Then, it groups users into c clusters using different clusteringapproaches off-line. When an active user a wants a prediction for atarget item q, she sends her known ratings and a query to the serverduring an online interaction. The server determines a’s similarityto each cluster center using the Pearson’s correlation coefficientformula, as follows [30]:

waC =∑m

j=1(vaj − v̄a)(vCj − v̄C )

�a × �C(2)

in which C represents a cluster center, v̄a, v̄C , �a and �C are vectormean values and standard deviations of a and corresponding clustercenter, respectively. Active user then is assigned to the cluster withthe largest similarity weight.

Once a’s cluster is determined, the similarities between a and allher cluster members are calculated using Eq. (2) similarly. Noticethat after clustering, the server estimates n/c similarities instead ofn similarities online. The best k users in that cluster are then chosen

2 t Com

ao

p

ibr

3

scgtddda

mmsaopwaaoGtawt

a[irtubatedo

sp“tipoputtdsa(

480 A. Bilge, H. Polat / Applied Sof

s neighbors. Finally, a prediction is produced as a weighted averagef those neighbors having a rating on q, as follows [30]:

aq = r̄a + �a ×∑nc

u=1((vuj − v̄u)/�u) × wau∑nc

u=1wau

(3)

n which paq represents the prediction for a on q and nc is the num-er of neighbors chosen from the corresponding cluster who has aating on q.

. Related work

Collaborative filtering schemes use ratings on items to findimilarities between users (or items), where recommendation isalculated as a weighted average of similar users’ ratings on tar-et item [30,31]. Memory-based CF methods mainly suffer fromhe scalability problem [32,33]. Model-based CF approaches pro-uce a model off-line and operate on reduced data, which helpseal with scalability and sparsity issues. Clustering [20,34,35] andimensionality reduction processes [36] are popular model-basedpproaches.

To improve scalability and cope with sparsity of user–itematrix, Chen et al. [4] propose to use orthogonal nonnegativeatrix tri-factorization. Jeong et al. [17] propose a novel iterative

emi-explicit rating method, which aggregates neighbor ratingsnd extrapolates unrated elements in a semi-supervised manner tobtain a dense preference matrix. Honda et al. [34] offer to employrincipal component analysis and fuzzy clustering simultaneously,here they extract local principal components by using lower rank

pproximation of the data matrix and predict the missing values asn approximation. To ameliorate scalability, Feng and Hui-you [16]ffer a genetic clustering method to partition the user–item matrix,eorgiou and Tsapatsoulis [35] propose a data clustering method

hat is based on genetic algorithms, and Russell and Yoon [36]pply discrete wavelet transformation to recommender systems,here data are transformed and reduced significantly to decrease

he amount of time for producing a prediction.Besides overcoming sparsity and refining scalability, there are

lso approaches to augment quality of predictions. Bobadilla et al.37] propose a metric to measure similarity between users, whichs formulated via a simple linear combination of genetic algo-ithm weightings and values. Kim et al. [38] perform collaborativeagging, which is employed as an approach to grasp and filtersers’ preferences for items and Lee et al. [39] propose a CF-ased recommendation approach based on both implicit ratingsnd less ambitious ordinal scales for mobile music recommenda-ions. Also, a new algorithm relying on discovering the functionalrror-correcting dependencies in a data set by using the fractalimension is presented in [40] for undertaking accuracy problemf CF systems.

PPCF is receiving considerable interest due to insecure infra-tructure of online shopping environments [11]. Canny [12,13]roposes two schemes in which users iteratively compute a publicaggregate” of their data. Homomorphic encryption is employedo privately encrypt and decrypt user vectors to avoid expos-ng of individual data. Kaleli and Polat [41] propose a novelrivacy-preserving scheme to produce SOM clustering-based rec-mmendations on vertically distributed data among multiplearties. If a centralized solution is employed, where a server collectsser ratings and conducts CF processes, randomized perturbationechniques (RPTs) might be utilized to maintain user confiden-iality, as applied in [11]. Bilge and Polat [42] build RPT onto

iscrete wavelet transform-based recommender systems to pre-erve individuals’ privacy. Randomized response techniques (RRTs)re proposed to be applied along with naïve Bayesian classifierNBC) in [43] to produce private referrals. Bilge and Polat [44]

puting 13 (2013) 2478–2489

improve the performance of the approach proposed in [43] byapplying two preprocessing schemes. While it is proven that ran-domized approaches can handle privacy-preservation problem inCF systems, it is not extensively studied if approaches such asclustering dealing with other problems of CF, e.g. scalability andaccuracy, can be applied while achieving individual privacy. Inthis study, we focus on privacy-preserving schemes applied onclustering-based recommendations to produce referrals withoutgreatly jeopardizing users’ privacy. We investigate the conse-quences of applying RPTs to some clustering-based CF schemes interms of accuracy and performance. We also offer a new profilingthat can be used to cluster users by overcoming sparsity problem.

4. Privacy-preserving clustering-based collaborativefiltering schemes

Defining privacy succinctly is not an easy task. In CF applications,we can describe it, as follows: online vendors collecting their cus-tomers’ preferences about various products should not be able tolearn the true preferences of users and the rated items. Hence, thetrue ratings and the rated items are considered confidential. In con-trast, user and item IDs are regarded as public. To achieve individualusers’ privacy, we apply RPTs, explained in the following.

4.1. Privacy protection by randomized perturbation techniques

Accuracy and confidentiality are two major goals that shouldbe achieved by online vendors to attract customers. However, pre-serving privacy requires a level of distortion in user data, whichconsecutively pulls accuracy down meaning that these two goalsconflict in nature. Therefore, accuracy has to be compromised abit; and on the other hand, privacy measures must be tuned wellto balance the trade-off. RPTs are very useful tools for provid-ing a desired level of distortion in data so that privacy levels canbe adjusted accordingly. In CF terminology, RPTs offer to replacean individual rating entry v with v + r, where r is a random num-ber drawn from some pre-defined distribution. Such distribution’sparameters should be chosen to facilitate disguising individual dataentries irreversibly while not changing characteristics of aggregatedata significantly. Since recommendation generation process takesplace on aggregate data, the data holder can still perform certaincomputations on disguised vector without revealing any informa-tion about original vector except the range of masked data.

In terms of PPCF, confidentiality of customers might be consid-ered with two aspects, as discussed in [11]: (i) masking individualpreferences on products and (ii) the list of products purchased. Thefirst one avoids privacy disturbances such as unsolicited marketingor profiling users and the second one circumvents intrusions likeprice discrimination. To fulfill the first aspect, users can disguisetheir vectors by adding random numbers on their actual ratings. Inaddition, to mask their rated and/or unrated items, they also caninsert fake ratings (i.e., random numbers) into some fraction of theirunrated item cells. Random numbers for disguising purposes canbe generated using either uniform or Gaussian distribution with astandard deviation (�) and zero mean (�). In Gaussian distribution,users create random values using normal distribution. In uniformdistribution, they can generate random numbers over the range[−˛, ˛], where ̨ is a constant and ̨ =

√3�. Users can disguise their

private data before sending them to the server, as follows [11]:

1. The server lets each user know the values of parameters �max

and ˇmax. Notice that the optimum values of �max and ˇmax canbe determined experimentally.

2. Disguising process can be performed on normalized ratingssince z-scores are needed for generating recommendations, as

t Computing 13 (2013) 2478–2489 2481

3

4

5

6

7

8

dsq

4

urirlacbc

rtastsessrmf

A

Table 1An example of user–item matrix.

j1 j2 j3 j4 j5

Alice 5 1 2 ⊥ 3Bob 1 3 ⊥ 2 4

Table 2Genre features of items.

Romance Mystery Fiction Horror History Poetry

j1 1 0 0 0 1 0j2 0 1 1 0 0 0j 1 0 0 0 0 1

A. Bilge, H. Polat / Applied Sof

explained previously. Thus, users first normalize their ratingsby transforming them into z-scores before perturbation. Eachuser u computes their ratings average and standard deviation;and transforms the votes into z-scores. Given the mean vote ofa user u (v̄u), standard deviation of her ratings (�u), and her rat-ing on item j (vuj), the z-score of that rating (zuj) is estimated aszuj = (vuj − v̄u)/�u.

. Then, all users determine the number of unrated items (Iu) intheir ratings vectors.

. After uniformly randomly selecting ˇu over the range (0, ˇmax],each user u uniformly randomly chooses Fu =

⌊(ˇuIu)/100

unrated item cells to be filled.. Each user u chooses �u uniformly randomly from the range (0,

�max]; and decides random number distribution by flipping acoin, i.e., they utilize uniform or Gaussian distribution with equalprobability.

. Using the selected distribution, they generate Ru = tu + Fu randomnumbers (ruj values for j = 1, 2, . . ., Ru), where tu is the number ofuser u’s true ratings.

. Next, each user u finally disguises their z-scores and obtainsz′

uj= zuj + ruj; and fills the selected unrated item cells with cor-

responding random numbers.. After data disguising, users send their disguised vectors to the

server, which creates an n × m disguised user–item matrix, D′. Itcan now perform CF processes based on D′.

Since active users should also send their ratings vector to theata holder, their privacy must also be preserved, as well. Any a canimilarly disguise her data and send masked ratings vector with auery to the server in order to get a prediction.

.2. Estimating users’ content-based profiles

To cluster users, similarity weights between cluster centers andsers should be estimated. Calculating the similarity weights in CFelies on comparing common ratings. However, number of itemsncreases constantly and provided ratings on whole product rangeemains a very small fraction of them leading the sparsity prob-em. Difficulty of computing similarities accurately is even moremplified due to the sparsity of available data. Even if there areommonly rated items, since the number of common ratings wille very few, similarity calculation cannot be accurate and thereoflustering algorithms cannot perform well.

To overcome such shortcomings, instead of employing collectedating/preference profiles of users in clustering stage, we proposeo generate additional profiles on categories of items of which

preference exists [45]. Individual ratings are independent, butomehow they are correlated due to common features. Our objec-ive is to reduce typically sparse and large rating-based vectors intomaller and dense CBPs. Hence, we are able to compute a similarityven if there are no or very few common ratings. Moreover, sinceuch CBPs are more compact and fixed in size, time to calculateimilarities become stabilized. To generate CBPs, we first need toesolve the categories of the product scale. After determining com-on features among items, CBPs can be generated by using the

ollowing approaches:

. Purchased-based profiles (PBPs): PBPs are generated by checkingwhether an item is rated or not; and if it is rated (means previ-ously purchased), then corresponding feature categories of that

item are incremented by 1. In other words, if the user purchasedan item, each category that item belongs to is increased by 1.Note that this profiling approach focuses on market basket datarather than individual opinions of users’ on items.

3

j4 0 1 1 1 0 0j5 0 0 0 0 1 0

B. Rating-based profiles (RBPs): RBPs are generated by checkingwhether an item is rated or not again; and if it is rated (means anopinion is provided on item), then regarding feature categoriesof that item are augmented by matching rating value. In otherwords, each rated item’s categories are increased by as much astheir matching rating values. RBP scheme focuses on how muchusers like or dislike certain types of item categories.

To understand profiling process, let us go through a sampleuser–item matrix including preferences of two users for five booksgiven in Table 1, where ratings are given in five-star rating scaleand ⊥ sign indicates that no provided preference exists for thatitem. Also, suppose that products, as being books, have at leastone genre from the set {Romance, Mystery, Fiction, Horror, History,Poetry}. Item genre features are specified in Table 2 as 1 havingcorresponding genre and 0 otherwise. CBPs (PBPs and RBPs) aregenerated and depicted in Fig. 1 according to the example given inTables 1 and 2.

As seen from Fig. 1, Alice’s PBP is [2, 1, 1, 0, 2, 1] for genresRomance, Mystery, Fiction, Horror, History, and Poetry, respectivelybecause there are two items j1 and j3 whose genre is Romance andshe rated both items (corresponding PBP value is 2), there are twoitems j2 and j4 whose genre is Mystery and she rated one of themonly (corresponding PBP value is 1), and so on. We obtain the PBPsfor user Bob similarly. For the same genres, as seen from Fig. 1,Alice’s and Bob’s RBPs are [7, 1, 1, 0, 8, 2] and [1, 5, 5, 2, 5, 0], respec-tively. Since there are two items j1 and j3 whose genre is Romanceand Alice rated both items as 5 and 2, respectively, the correspond-ing RBP value is 5 + 2 = 7. Similarly, since there is only one item(j3) whose genre is Poetry and the Bob did not rate that item, thecorresponding RBP value is 0 for that genre.

When we examine real data sets collected for CF purposes, wenotice that number of rated items differs for different users. Forexample, in MovieLens Data (ML) set (http://www.grouplens.org),each user rates 20–200 movies. Similarly, number of items’ genres might be different for various items. For example, as seenfrom Fig. 1, j4 has three genres, while j3 has one genre only. Dueto these reasons, we may end up with profile values far away fromeach other. To smooth the effects of larger profile values, we nor-malize them, as follows: we find the sum of the profile values anddivide each profile value by the sum. Sum of the profile values inRBPBob in Fig. 1 is 18. If we normalize such profiles by dividing eachvalue by 18, we get normalized profile values of [1/18, 5/18, 5/18,2/18, 5/18, 0/18].

4.3. Performing clustering on masked data

After estimating CBPs corresponding to collected disguiseduser vectors, the server runs a clustering algorithm to group usersin terms of their CBPs. Given D, it is an easy task to group users;

2482 A. Bilge, H. Polat / Applied Soft Computing 13 (2013) 2478–2489

and

hdu

4

iudita

acGa

x

zzs

bW

Fig. 1. Generated PBP

owever, it is a challenge to cluster users when the server holds theisguised user–item matrix, D′. Thus, it can cluster D′, as follows,sing aforementioned clustering methods.

.3.1. Clustering users using k-means clustering on disguised dataThe KMC algorithm can be considered in two distinct phases: (i)

nitialization phase in which the algorithm randomly assigns thesers into c pre-defined clusters and (ii) iteration phase, where theistance (or similarity) between each user and each cluster center

s computed and the user is assigned to the nearest cluster; then,he cluster centers of any changed clusters is recomputed as theverage of members of each cluster.

In the initialization phase, the server uniformly randomlyssigns the users into c pre-defined clusters. It then estimates theluster centers by taking the average of members of each cluster.iven n disguised data items, x′

ifor i = 1, 2, . . ., n, where x′

i= xi + ri,

verage of such values can be estimated, as follows:

¯ ′i =

∑ni=1x′

i

n=

∑ni=1(xi + ri)

n=

∑ni=1xi

n+

∑ni=1ri

n≈

∑ni=1xi

n(4)

Since the random numbers are drawn from a distribution withero mean, expected value of the mean of the random numbers isero. With increasing n values, the expected value of the average of

uch random numbers will converge to zero.

In the second step, the server needs to find out similaritiesetween users and cluster centers; and update cluster centers.e employ Pearson’s correlation coefficient as the discrimination

RBP of Alice and Bob.

criterion between individual users; and it can also be used to findout similarity between an individual and a cluster center frommasked data, as follows:

w′Cu =

mu∑

j=1

z′Cjz

′uj =

mu∑

j=1

(zCj + rCj)(zuj + ruj) =mu∑

j=1

zCjzuj +mu∑

j=1

zCjruj

+mu∑

j=1

rCjzuj +mu∑

j=1

rCjruj ≈mu∑

j=1

zCjzuj (5)

in which z′Cj

represents the center of cluster C. The expected valuesof the last three summations in Eq. (5) are zero because randomnumbers rCj and ruj are generated using a distribution with a zeromean. Similarly, the expected means of the z-scores are zero, aswell. After estimating similarity weights, the users are assigned tothe closest clusters. The server then updates clusters centers, asexplained previously, using Eq. (6), as follows:

Ci =∑NCi

j=1z′uk

NCi

=∑NCi

j=1(zuk + ruk)

NCi

=∑NCi

j=1zuk

NCi

+∑NCi

j=1ruk

NCi

≈∑NCi

j=1zukfor i = 1, 2, . . . , c (6)

NCi

where Ci represents cluster center of ith cluster, NCi stands forthe number of users in regarding cluster and c is the numberof clusters. As a result, the server can compute the similarities

t Com

bh

4d

blatuaci

bmasaso

4d

imtntldn2fiiTf

w

ima

lblirt

j

i

w

wifrua

A

A. Bilge, H. Polat / Applied Sof

etween users and calculate cluster centers from perturbed data;ence, it performs KMC with adequate precision. �

.3.2. Clustering users using fuzzy c-means method on disguisedata

FCM is an extension of KMC for fuzzy clustering and it cane roughly performed in three steps: (i) initialization step simi-

ar to KMC to choose initial c centroids for clusters, (ii) computing membership matrix holding relationship levels of users to clus-ers, which is calculated as a function of similarity degrees of eachser to cluster centroids, and (iii) updating each cluster centroids weighted averages of each users’ membership to every clusterentroids. It repeats steps (ii) and (iii) until a termination criterions reached.

The only difference between FCM and KCM is that a user canelong to more than one cluster in FCM and cluster centers are esti-ated as weighted averages in FCM. Basically, the same estimations

re conducted in FCM. Thus, the first step is easily achievable. Also,ince similarity values appear in the second step and a weightedverage of contributions of each member is computed in the thirdtep of the algorithm, FCM can certainly be performed on D′, relyingn the equivalences expressed in Eqs. (5) and (6), respectively. �

.3.3. Clustering users using self-organizing map on disguisedata

SOM algorithm comprises of three essential processes after thenitialization procedure [26]: (i) a competitive process to deter-

ine the winning neuron, (ii) a cooperative process to define aopological neighborhood for locating the center for cooperatingeurons, and finally (iii) an adaptive process, where weight vec-ors are updated according to the input vectors. For the first step,et u = (z′

u1, z′u2, . . . , z′

ud)T be a user in D′ selected randomly, where

is the dimension of CBPs and let the weight vector of neuron in the Kohonen layer is wn = (wn1, wn2, . . ., wnN)T, where n = 1,, . . ., N and N is the number of neurons in the Kohonen layer. Tond the best match of user u with the weight vectors w1, w2, . . ., wN,

nner products wT∗ u must be computed and the largest is chosen.he computations can be done approximately on D′, formulated, asollows:

T∗ u = wT

∗ z′u∗ = wT

∗ (zu∗ + ru∗) = w∗1(zu1 + ru1)

+ · · · + w∗N(zud + rud) = w∗1zu1 + w∗1ru1 + · · · + w∗Nzud

+ w∗Nrud ≈ w∗1zu1 + · · · + w∗Nzud (7)

n which ru* terms are generated from a distribution with zeroean, hence, their multiplications with w* terms converges to zero

nd the equation holds.In the second step, the topological neighborhood, hjt is calcu-

ated, which is a unimodal function calculated over the distanceetween winning neuron t and excited neuron j. It can be calcu-

ated as the simple Euclidean distance. However, since this functions computed employing none of the elements of D′, but two neu-ons, t and j, it can easily be found unaffectedly of RPTs process. Inhe last step, the adaptation process, the weight vector wj of neuron

changes due to user u. Given the weight vector w(s)j

of neuron j at

teration s, the new weight vector w(s+1)j

is calculated, as follows:

(s+1)j

= w(s)j

+ �(s)hj,i(u)(s)(u − w(s)j

), (8)

here �(s) is the learning parameter and is a constant; and hj,i(u)(s)s the neighborhood function defined in the previous step. There-

ore, we need to focus on u − w(s)

jdifferences. Since u comprises of

andom values added upon z-score values, the effect of random val-es will converge to zero in such a difference computation as theyre generated from a distribution with zero mean. As these three

puting 13 (2013) 2478–2489 2483

steps of SOM clustering can be performed without greatly affectedfrom disguising procedure, we can conclude that SOM clusteringcan perform precisely on D′, without jeopardizing individuals’ con-fidentiality. �

4.4. Producing online recommendations privately

Typical CF-based recommendation estimation can be consid-ered as a three-step process: (i) calculating similarities betweena and train users, (ii) choosing neighbors, and (iii) estimating aweighted prediction relying on the most similar users’ rating onq [33]. In the following, we explain how to provide referrals frommasked data:

. Estimating similarity weights: After determining a’s cluster, sim-ilarities are calculated between a and each user in her clusterusing Eq. (2) based on masked data. Eq. (2) can be written, as fol-lows: wau =

∑mu

j=1zajzuj , where zaj = (vaj − v̄a)/�a and mu showsnumber of commonly rated items between a and user u. Whena and u’s ratings vectors are disguised, as explained before, sim-ilarity between them can be estimated, as formulated in Eq. (4).Therefore, the server can estimate the similarities between a andusers in her cluster from perturbed data with high accuracy.

B. Neighborhood formation: After finding similarities, the most sim-ilar k users are selected as neighbors to contribute to theprediction process. Those users whose similarity weights sat-isfy a pre-defined threshold can be chosen as neighbors; or aftersorting the users decreasingly according to similarities, the firstk users as selected as neighbors. After forming the neighborhood,the server can now compute paq.

C. Recommendation estimation: The prediction is estimated as aweighted sum of the ratings on q given by users in the neigh-borhood. Every user in the neighborhood contribute to thepredicted rating as their degree of similarity with a. Since allusers including a submit normalized ratings to the server, paq

can be estimated, as follows:

p′aq = v̄a + �a ×

∑Na

u=1w′auz′

uq∑Na

u=1w′au

= v̄a + �a × P ′aq (9)

in which Na shows the number of a’s neighbors who rated q,v̄a and �a represent a’s ratings mean and standard deviation,respectively. The server estimates P ′

aq only; and sends it to a.Once a receives it, she can de-normalize it and gets p′

aq. The servercan estimate p′

aq, as follows:

p′aq =

∑Na

u=1(wau + Rau)(zuq + ruq)∑Na

u=1(wau + Rau)

=∑Na

u=1wauzuq +∑Na

u=1waurau∑Na

u=1Rauzuq∑Na

u=1Rauruq∑Na

u=1wau + ∑Na

u=1Rau

≈∑Na

u=1wauzuq∑Na

u=1wau

(10)

in which due to the same reasons, the expected values of thelast three summations in nominator and the last summation indenominator part are zero. Hence, the server can estimate P ′

aqfrom masked data.

5. Performance and privacy analysis

It is imperative to analyze the proposed CBP scheme, appliedclustering approaches, and privacy preservation structure in termsof both off-line and online costs to define their effects on over-all performance of offered CF schemes. Although off-line costs do

2 t Computing 13 (2013) 2478–2489

nhoadp

5

Edar

adtcthdtsb

itoTsmptImwesCacOCscotcbtsOhi

5

ivoto

Table 3Privacy levels vs. level of perturbation.

�max

0.5 1 2 3 4

484 A. Bilge, H. Polat / Applied Sof

ot have acute effects on performance compared to online over-eads, they are still needed to be analyzed to provide a reportn off-line size of work overload. Such extra costs include stor-ge, communication, and computation overheads. Furthermore, aetailed analysis on privacy levels provided to individuals throughroposed methods is essential.

.1. Overhead costs analysis

Storage cost of traditional CF schemes is in the order of O(nm).mployed clustering algorithms and privacy preservation schemeso not introduce any extra storage costs. However, producing PBPsnd/or RBPs requires an additional space in the order of O(nf) toecord such CBPs, where f is the constant number of item features.

Additional communication costs can be considered in twospects: (i) the number of communications and (ii) the amount ofata transferred through communications. Number of communica-ions does not change in off-line and online phases of the proposedlustering-based CF schemes. Although the amount of data to beransferred also remains the same in online phase of the schemes;owever, it increases in off-line stages due to data masking proce-ures. Owing to privacy issues, recall that users insert fake ratings toheir profiles to hide their actual ratings. Therefore, according to thecheme, each user u’s amount of data to be transferred augmentsy ˇu% of their number of truly rated items.

Additional computation costs are also expected. Off-line phasencludes two stages: (i) generating CBPs and (ii) obtaining a modelhrough clustering. The data holder first generates CBPs in the orderf O(nm) because it checks every item of all users in the matrix.hen, it applies one of KMC, FCM, or SOM on produced CBPs to con-truct a model. Running times of KMC, FCM, and SOM clusteringethods are O(nfc+1 log n), O(nfc2), and O(n2), respectively. Actual

erformance of a recommender system is defined by its responseime to queries, which is determined by online computation load.n our scheme, the server estimates a prediction in an identical

anner to the one it does with non-privacy preserving approach,hich has an online running time in the order of O(n2m). How-

ver, due to clustering efforts, number of neighbors to calculateimilarities between them reduces significantly. Clustering-basedF approach defined in this study requires three steps to producen online prediction. When a sends her data with a query, firstorresponding CBP of a is generated by the server in the order of(m) time complexity because there are m items. After obtainingBP, the cluster, which a belongs to is determined by calculatingimilarity between a and all cluster centers. Having c clusters, cal-ulating such similarities via dot product is performed in the orderf O(mc) time. Finally, since users are grouped into distinct clus-ers, we may assume that there are approximately n/c users in eachluster. Actually, this might not be the case for FCM since users maye included in more than one cluster. Therefore, we may assumehat there are at most n/2 users in each cluster for FCM. In thiscenario, online running time of our approach is in the order of(n2m/c2). As can be seen from this complexity expression, theigher the number of clusters, the lesser the online running time

s.

.2. Privacy analysis

To prevent the server from learning actual ratings and the ratedtems, we propose employing data perturbation approaches on user

ectors and inserting fake ratings meanwhile affecting precisenessf the system negligibly. Thus, according to the proposed protocolhere are two particular considerations to analyze the privacy levelf a recommender system.

Privacy levels�(X|Z) 1.80 2.90 3.71 3.82 4.03

5.2.1. Distinguishing between true or fake ratingsSince users insert fake ratings to hide their truly rated items,

the data holder first needs to separate true and fake ratings fromreceived disguised user vector. Considering there are a total of mitems, let mr and me denote the number of true ratings and emptycells, respectively. After user inserts fake ratings into ˇu% of me

empty cells, the server receives m′r many ratings with m′

e emptycells where m′

r > mr , m′e < me, and m = me + mr = m′

r + m′e. Hav-

ing known ˇmax and m′r , the server needs to estimate ˇu, mr, and

the complete set of true ratings. The server can estimate ˇu witha probability because its value depends on ˇmax and m′

r . Assumethat each user provides at least 2 authentic ratings. Then, thereare at most m′

r − 2 fake ratings, denote it with mf. If mf is greaterthan ˇmax% of m, then the server can estimate the value of ˇu with1-out-of ˇmax probability, denote it with pˇu. Otherwise, it meansˇu is chosen from a narrower range and it can be truly estimatedwith a probability of 1-out-of

⌊mf × 100/m

⌋. After guessing ˇu, the

server can now estimate mr. Hereafter, mr = m − me, where me =100m′

e/(100 − ˇu). Possible sets of actual ratings can be predictedas one of the combinations of mr possibilities out of m′

r ratings. Ifall these possibilities are combined, the probability of determiningthe exact set of actual ratings, pmr, out of a disguised ratings vector

is pˇu × Cm′

rmr

, where CXY stands for number of Y-combinations from

a given set of X elements.

5.2.2. Estimating real ratings from disguised valuesAfter distinguishing between true and fake ratings, the server

needs to figure out real values of true ratings from their maskedz-scores. Therefore, we need to quantify provided privacy level.Such quantification defines how precisely the original value of arating can be estimated from its masked condition. Agrawal andAggarwal [46] propose a measure to quantify privacy, where thismeasure is employed in [11] to show privacy levels in PPCF context.According to this metric, we compute privacy levels as performedin [11] and present results in Table 3 with respect to varying levelsof perturbation we employed in our experiments.

As followed from the table, with increasing level of perturbation(randomization), privacy levels increase in an exponentially decay-ing manner. There are also other challenges for the server to guessreal ratings. These are estimating the random number distributiontype each user utilizes, estimating the value of �u, which is ran-domly chosen from the interval (0, �max], and deducing the meanand the standard deviation of each user’s original ratings.

6. Experiments

We conducted experiments to see how previously explainedthree clustering algorithms perform in terms of accuracy and effi-ciency on PPCF schemes and show how effective our proposed CBPscheme. We first evaluated how CBPs affect overall performance intraditional CF schemes along with clustering approach. After pro-ducing CBPs on masked data, we performed clustering to assess theperformance and provided a comparison among these three clus-

tering algorithms on PPCF framework. We finally carried out severalexperiments on real data to investigate how privacy measures andclustering efforts affect accuracy.

t Computing 13 (2013) 2478–2489 2485

6

pl(vaiogs

hlvwCs

6

edssdvtitaiaIwrdRbfduRtmc5

6

lpi((ivauGpp

Fig. 2. MAEs for varying number of neighbors (k).

Table 4Comparing PBP and RBP for varying c (without privacy).

c

1 2 3 5 7 10

PBPKMC 0.7519 0.7561 0.7635 0.7732 0.7881 0.8054FCM 0.7519 0.7543 0.7586 0.7595 0.7588 0.7620SOM 0.7519 0.7549 0.7668 0.7800 0.7886 0.8103

RBP

A. Bilge, H. Polat / Applied Sof

.1. Data set and evaluation criteria

Experiments were performed on a variation of well-knownublicly available data set MovieLens Data (ML), which was col-

ected by GroupLens research team at the University of Minnesotahttp://www.grouplens.org). The set contains 100,000 discreteotes from 943 users on 1682 movies in a five-star rating scale. Inddition, there are 18 predefined genre categories and each movien the set includes at least one. ML is a very sparse data set, 93.7%f all available ratings do not exist. Since ML includes genre cate-ories, we chose it to use in our experiments. The outcomes on thiset can be generalized.

We employed mean absolute error (MAE), which measures ofow close predictions to the eventual ratings as an average of abso-

ute errors ei = pi − vi, where pi is the prediction an vi is the actualote. Hence, the smaller the MAE, the better the results are. Sincee employ proposed clustering schemes to improve scalability ofF-based systems, we also measured total elapsed time (T) in s,pent on producing predictions online.

.2. Experimentation methodology

Experiments are performed using a fivefold cross-validationxperimentation methodology in which we uniformly randomlyivided the data set (D or D′) into five subsets. We used one of theubsets (Di or D′

i), as the test set and the remaining four as the train

et in each iteration i, where i = 1, 2, 3, 4, 5. After uniformly ran-omly obtaining train and test users, we withheld five rated items’otes for each test (active) user, replaced their entries with null,ried to predict their values, and compared them with actual rat-ngs. We first employed the three clustering algorithms along withhe proposed profiling scheme in CF framework to show how theyffect performance and accuracy without considering confidential-ty issues. Then, we employed the profiling scheme and clusteringpproaches in privacy-preserving framework in the same manner.n both frameworks, we formed clusters relying on CBPs of users

ith all three clustering algorithms, then return to the originalating-based user–item matrix, either D or D′, and estimated a pre-iction after calculating similarities based on rating profiles. SincePT schemes offer data disguising through adding random num-ers on actual values, we repeated all trials in privacy-preservingramework 100 times and took their averages to obtain moreependable outcomes. We ran trials in MATLAB 7.9.0 environmentsing a computer with an Intel Xeon 2.8 GHz processor and 6 GBAM. For clustering operations, we used MATLAB’s built-in func-ions with default options except choosing ‘correlation’ distance

easure (utilizes Pearson’s correlation coefficient) for k-meanslustering, ‘Q’ as 2 for FCM, and 1-dimensional feature-map through00 epochs for SOM.

.3. Results and discussion

The parameters to test throughout the experiments might beisted, as follows: (i) number of neighbors in recommendationrocess (k), (ii) pre-defined number of clusters (c) for cluster-

ng algorithms, (iii) maximum unrated cell filling percent (ˇmax),iv) maximum standard deviation to produce random numbers�max), and (v) type of random number distribution. It is shownn [11] that employing uniform or Gaussian distribution presentery similar results. Therefore, we did not perform additional tri-ls for this option; but we assumed that approximately half of the

sers employ uniform distribution and the remaining ones utilizeaussian distribution. After determining the optimum values of allarameters, we conducted one last test to provide a clear com-arison among clustering algorithms with their best performing

KMC 0.7519 0.7604 0.7708 0.7837 0.8114 0.8293FCM 0.7519 0.7603 0.7634 0.7706 0.7774 0.7803SOM 0.7519 0.7604 0.7748 0.8050 0.8151 0.8379

conditions. Detailed procedures and results of conducted experi-ments are explained in the following.

Number of neighbors (k): We first performed trials varying thenumber of neighbors contributing to the recommendation estima-tion process in both traditional CF and PPCF frameworks withoutemploying clustering algorithms. We tried to find optimum k val-ues for both schemes, where we varied k from 3 to 100. We keptˇmax at 25 and �max at 1 in PPCF scheme. We then estimated MAEsand finally displayed them in Fig. 2.

Accuracy improves with increasing number of neighbors join-ing to the recommendation estimation process, as seen from Fig. 2.However, while the improvement is significant while k moves from3 to 60, it becomes stable around 100 neighbors for both CF andPPCF schemes. On the other hand, employing more than 100 neigh-bors might negatively affect the scalability problems, as stated in[33]. Thus, we fixed the k parameter at 100 neighbors to contributethe prediction estimation process in the rest of the experiments.

Number of clusters (c): We conducted trials to see the effects ofvarying c values on efficiency and accuracy while running clusteringalgorithms. Obviously, as c increases, online time T spent on pro-ducing predictions declines. However, increasing c may also causelosses in accuracy. Therefore, we investigated the behavior of clus-tering algorithms and tried to find an optimum value by varyingc from 1 to 10 in both CF and PPCF schemes. Note that, evaluat-ing c = 1 means there is no clustering at all, which constitutes usa baseline to compare the performance of clustering approaches.We kept other parameters constant at k = 100 for both schemes,ˇmax = 25 and �max = 1 in PPCF scheme. We then estimated MAEand T values for all algorithms in both CF and PPCF schemes. Wefinally displayed MAEs in Tables 4 and 5 including both profilingschemes for CF (no privacy concerns) and PPCF (with privacy con-

cerns), respectively. Note that since both profiling schemes, PBPand RBP, show very similar trends for the overall averages of T val-ues, we depicted results for PBP scheme in Figs. 3 and 4 for CF andPPCF schemes, respectively.

2486 A. Bilge, H. Polat / Applied Soft Com

Table 5Comparing PBP and RBP for varying c (with privacy).

c

1 2 3 5 7 10

PBPKMC 0.7976 0.7956 0.8010 0.8170 0.8328 0.8580FCM 0.7977 0.7943 0.7930 0.7957 0.7941 0.7944SOM 0.7977 0.7946 0.7979 0.8128 0.8321 0.8661

RBPKMC 0.7980 0.8033 0.8164 0.8488 0.8529 0.8630FCM 0.7973 0.8734 0.9005 0.9027 0.9029 0.9047SOM 0.7967 0.9157 0.9139 0.9153 0.9154 0.9167

aslS0FsRsf

Fig. 3. Online time (T) for varying number of clusters (without privacy).

As can be followed from Table 4, when all three clusteringlgorithms are employed along with the both proposed profilingchemes without preserving privacy, all algorithms produce simi-ar MAE values, where FCM performs slightly better than KMC andOM, which is depicted in bold. There is a slight decrease (from.7519 to 0.7543) in accuracy for c = 2 while T almost halves in FCM.urthermore, we see that MAE values for both profiling schemes

how similar trends, where PBP performs insignificantly better thanBP. Likewise, we see in Fig. 3 that, as c increases, online time Tpent on producing predictions falls down since number of usersor whom we need to calculate similarity weights decreases. FCM

Fig. 4. Online time (T) for varying number of clusters (with privacy).

puting 13 (2013) 2478–2489

produces predictions around 100 s for all values of c greater than1. Therefore, we can conclude that c = 2 is the best option for FCMbecause MAE value corresponding to two clusters is the smallest,which is about 0.7543.

When privacy measures are employed, clustering algorithmsperform even better results compared to no cluster approach. Asshown in Table 5, all three algorithms show very close MAE val-ues to original PPCF approach with PBP scheme, where FCM is thebest among all (depicted in bold). When c = 3, FCM performs thesmallest MAE, which is better than the base approach (0.7977 forbase and 0.7930 for FCM). Moreover, KMC and SOM produce simi-lar outcomes in terms of MAE values with high accuracy. However,RBP scheme is not able to produce accurate referrals. The reasonfor this consequence is that RBP is affected much from privacy pro-tection protocol. Within the protocol, we disguise each rating withrandom numbers and additionally add fake random ratings intothe profiles. Although each existing rating, whether it is a true oneor fake, reflects as an increment by 1 to PBPs; they affect RBPs astheir values. However, such values are so perturbed that they dis-tort RBPs much more than they do PBPs. Online time T needed toproduce predictions shows similar trends while it does withoutpreserving confidentiality, as seen from Fig. 4. KMC and SOM pro-duce predictions in an exponentially increasing decay time scale forincreasing c; on the other hand, FCM again shows a non-improvingbehavior in terms of T, due to fuzzy clustering. In short, while pro-ducing predictions with privacy concerns, we see that PBP schemeperforms better than RBP and while FCM performs better for alllevels of c in terms accuracy, it can only improve T by twice. Thus,it might be useful to employ KMC or SOM if scalability concernsthwart accuracy.

Maximum unrated cell filling percent (ˇmax): To mask truly rateditems, each user u selects a ˇu value uniformly randomly over therange (0, ˇmax], where ˇmax is a privacy parameter closely relatedto accuracy and confidentiality level of the system because it con-trols the maximum unrated item cells filling percent for users. Wetested the effects of this parameter while varying its value from3 to 100. We kept other parameters constant, where k = 100, c = 3,and �max = 1. After computing overall averages of the errors, wepresented the outcomes in Fig. 5. We did not employ RBP in PPCFschemes due to the conclusion in previous experiments; hence, theresults depicted in Fig. 5 represent the outcomes for PBP scheme.

ˇmax is used for qualifying individual privacy level; hence, it isinversely correlated with accuracy. If PBP is utilized, although accu-racy improves as varying ˇmax up to 25%, it again decreases as ˇmax

is increased up to 100%. Actually, this fact is expected because largerˇmax values distort the original data more, which causes largeraccuracy losses. In other words, the bigger the ˇmax values are, thelarger the randomness added to the original data is. With increasingrandomness then, rating and content-based profiles distort moreand accuracy is expected to become worse. However, because ofsparsity of ML, added randomness up to 25% improved accuracy asit helped determining similarities among PBPs. As seen from Fig. 5,the best accuracy results are obtained when ˇmax = 25. FCM is moreresistant to changes in ˇmax compared to KMC or SOM. It is apt tochoose ˇmax as 25 for all clustering algorithms.

Maximum standard deviation (�max): We lastly tested the param-eter �max. According to the proposed protocol each user u uniformlyrandomly chooses �u value over the range (0, �max]. It is imperativeto examine the effects of varying �max values because the higher the�max value, the better the privacy level and the poorer the accuracy.We experimented on �max by changing its value from 0.5 to 4. Wekept other parameters constant at k = 100, c = 3, and ˇmax = 25. We

then estimated MAEs and demonstrated them in Fig. 6.

It is understandable that accuracy and �max are inversely cor-related indeed. Since as the range to produce random numbersexpands with increasing �max, perturbation level increases and that

A. Bilge, H. Polat / Applied Soft Computing 13 (2013) 2478–2489 2487

for va

muapiPmai

sat4Tw2mw

Fig. 5. MAE values

akes accuracy worse. As seen from Fig. 6, with increasing �max val-es, the quality of the referrals gets dramatically worse. Althoughccuracy values regress with increasing �max values, privacy levelrovided to individuals enhance due to augmented randomness. It

s fundamental to avoid data holder from estimating �u values inPCF for individual privacy concerns. Therefore, a �max value of 2ight be chosen as an optimal value to still produce dependable

nd accurate recommendations while disguising the private datarreversibly; and thus, not deeply jeopardizing individual privacy.

Overall performance: We finally conducted an experiment tohow the joint effects of the controlling parameters in all clusteringlgorithms in PPCF framework. For this purpose, we ran all clus-ering algorithms in PPCF scheme while varying �max from 0.5 to. Choosing �max totally depends on the privacy level demanded.herefore, for this search, we did not fix the �max parameter; but,e varied to see all opportunities. However, we set k at 100, c at

, and ˇmax at 25 for all algorithms and employed PBP scheme. Weeasured MAE and T values. We present our results in Table 6,here we also included the MAE and T values for traditional CF

Fig. 6. MAE values for varying �max values.

rying ˇmax values.

algorithm without privacy protection (mentioned as Base) to forma comparison opportunity to the reader.

As seen from Table 6, the quality of the predictions obviouslygets worse due to privacy concerns. However, although there isaccuracy losses due to our proposed privacy protection scheme,such losses are small. It is still possible to suggest recommendationswith pretty good accuracy employing our content-based profilingscheme and a clustering approach. Among examined clusteringapproaches, FCM performs the best in terms of both accuracy andonline performance. Compared to the base algorithm, it produces4715 predictions in 91 s instead of 199 s, which means a 45.7%enhancement in online time T. We presented the significance ofobtained MAE values utilizing t-tests for all clustering algorithms.Significance tests are performed by comparing the fivefold MAEresults mutually between traditional CF algorithm and one of KMC,FCM, and SOM. We performed one-tailed t-tests for all �max valuesof 0.5, 1, 2, 3, and 4. We hypothesized that the difference betweentraditional CF algorithm’s and each clustering algorithms’ accuracyvalues are not significantly diverse. We presented the results inTable 7.

For �max values of 0.5 and 1, t statistics indicate that thedifference between traditional algorithm’s and each clusteringalgorithms’ accuracy is not significantly different with a confi-

dence level of 90% (indicated with a single asterisk), as seen fromTable 7. Also, when we increase �max to 2, the results are stillnot significantly diverse with a confidence level of 95% (indicated

Table 6Overall performance with varying � values.

�max Base KMC FCM SOM

MAE0.5 0.7519 0.7884 0.7853 0.78681 0.7519 0.7935 0.7915 0.79182 0.7519 0.8135 0.8080 0.81083 0.7519 0.8622 0.8519 0.85624 0.7519 0.9143 0.9079 0.9055

T (s) 199 116 91 104

2488 A. Bilge, H. Polat / Applied Soft Com

Table 7Statistical significance of the differences.

�max KMC FCM SOM

0.5 t = 1.34* t = 1.24* t = 1.27*1 t = 1.48* t = 1.43* t = 1.42*2 t = 2.29** t = 2.03* t = 2.16**3 t = 4.06 t = 3.48 t = 3.84

wdevriboar

7

cniwpsSialaArEdPhctaa

tswfittsaap

pewaWct

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

3771–3778.[29] S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice Hall PTR,

4 t = 6.27 t = 6.04 t = 6.09

ith double asterisk). However, for �max values of 3 and 4, theifferences between MAE values seem to be significant, which isxpected due to level of randomization. Since we disguise inputectors with random numbers distributing in a considerably largeange, privacy enhances with the cost of much accuracy loss. Bear-ng the trade-off in mind, a �max value of 2 seems harmonizingetween privacy requirements and accuracy losses. In conclusion,ur profiling and privacy protection scheme along with clusteringlgorithms achieve confidentiality while sacrificing little on accu-acy but bringing a lot in scalability.

. Conclusions and future work

We proposed a scheme to solve the privacy problem thatlustering-based CF schemes encounter by utilizing RPT tech-iques. Our privacy-preserving scheme hides users’ ratings and the

tems they rated. As expected, the quality of the referrals becomesorse due to our privacy-preserving measures. However, com-ared to the original scheme, relative errors due to our proposedcheme are not major as demonstrated with a significance test.ince accuracy and privacy are conflicting goals, such losses arenevitable and can be considered acceptable. Thus, our scheme isble to achieve privacy without sacrificing much on accuracy. Ateast as important as accuracy and privacy, performance is alsomong the goals that should be accomplished in CF approaches.pplied clustering algorithms served well for scaling PPCF systemseducing online running time of the recommendation algorithms.specially the fuzzy approach offers a low cost solution for pro-ucing high quality recommendations and enhances robustness ofPCF systems due to its approximation-based model. On the otherand, applying privacy-preserving methods might introduce extraosts. Online performance is extremely critical to deploy CF sys-ems successfully. We showed that although our scheme introducesdditional costs, they are negligible and they do not immenselyffect performance.

We also proposed to employ a content-based profiling approacho overcome sparsity related similarity calculation problems in CFystems. By taking advantage of common features between items,e are able to produce much smaller and dense content-based pro-les to be utilized in clustering processes. Although our proposedwo distinct approaches of producing CBPs perform very well inraditional CF approach, RBP affected much from data disguisingchemes of PPCF and not able to produce predictions with highccuracy. However, PBP scheme performed well in PPCF schemeslong with clustering approaches. Additionally, due to off-line com-utations, they do not cause any extra online costs.

We are planning to explore other methods of producing sup-lementary profiles and other approaches such as predictingstimations to unrated cells via approximation algorithms to dealith sparsity problem. Also, we will study how such clustering

lgorithms might be utilized to improve scalability of CF systems.e are planning to combine other data reduction approaches with

lustering methods. We will also study how such clustering-basedechniques and profiling schemes can be applied to binary ratings.

[

puting 13 (2013) 2478–2489

Acknowledgement

This work is supported by Grant 108E221 from TUBITAK.

References

[1] R. Burke, Hybrid recommender systems: survey and experiments, User Mod-eling and User-Adapted Interaction 12 (2002) 331–370.

[2] P. Melville, R.J. Mooney, R. Nagarajan, Content-boosted collaborative filter-ing for improved recommendations, in: 18th National Conference on ArtificialIntelligence, 2002, pp. 187–192.

[3] W. Ziqiang, F. Boqin, Collaborative filtering algorithm based on mutual infor-mation, LNCS: Lecture Notes in Computer Science 3007 (2004) 405–415.

[4] G. Chen, F. Wang, C.S. Zhang, Collaborative filtering using orthogonal nonnega-tive matrix tri-factorization, Information Processing & Management 45 (2009)368–379.

[5] P. Symeonidis, A. Nanopoulos, A.N. Papadopoulos, Y. Manolopoulos, Collabo-rative recommender systems: combining effectiveness and efficiency, ExpertSystems with Applications 34 (2008) 2995–3013.

[6] A.M. Acilar, A. Arslan, A collaborative filtering method based on artificialimmune network, Expert Systems with Applications 36 (2009) 8324–8332.

[7] A. Kohrs, B. Merialdo, Creating user-adapted websites by the use of collabo-rative filtering, Interacting with Computers 13 (2001) 695–716.

[8] W. Ahmad, A. Khokhar, An architecture for privacy-preserving collaborativefiltering on web portals, in: 3rd International Symposium on Information Assur-ance and Security, 2007, pp. 273–278.

[9] L.F. Cranor, J. Reagle, M.S. Ackerman, Beyond Concern: Understanding NetUsers’ Attitudes about Online Privacy, 1999.

10] C. Jensen, C. Potts, C. Jensen, Privacy practices of Internet users: self-reportsversus observed behavior, International Journal of Human-Computer Studies63 (2005) 203–227.

11] H. Polat, W. Du, Privacy-preserving collaborative filtering, International Journalof Electronic Commerce 9 (2005) 9–35.

12] J. Canny, Collaborative filtering with privacy, in: IEEE Symposium on Securityand Privacy, 2002, pp. 45–57.

13] J. Canny, Collaborative filtering with privacy via factor analysis, in: 25th AnnualInternational ACM SIGIR Conference on Research and Development in Informa-tion Retrieval, 2002, pp. 238–245.

14] S. Berkovsky, Y. Eytani, T. Kuflik, F. Ricci, Enhancing privacy and preservingaccuracy of a distributed collaborative filtering, in: 2007 ACM Conference onRecommender Systems, 2007, pp. 9–16.

15] T.H. Roh, K.J. Oh, I. Han, The collaborative filtering recommendation basedon SOM cluster-indexing CBR, Expert Systems with Applications 25 (2003)413–423.

16] Z. Feng, C. Hui-you, A collaborative filtering algorithm employing genetic clus-tering to ameliorate the scalability issue, in: IEEE International Conference one-Business Engineering, 2006, pp. 331–338.

17] B. Jeong, J. Lee, H. Cho, An iterative semi-explicit rating method for building col-laborative recommender systems, Expert Systems with Applications 36 (2009)6181–6186.

18] M. Jianying, F. Yongjian, S. Yanguang, A neural networks-based clusteringcollaborative filtering algorithm in e-commerce recommendation system, in:International Conference on Web Information Systems and Mining, 2009, pp.616–619.

19] K. Honda, H. Ichihashi, Component-wise robust linear fuzzy clustering for col-laborative filtering, International Journal of Approximate Reasoning 37 (2004)127–144.

20] C. DanEr, Y. YuLong, G. SongJie, A collaborative filtering algorithm based onrough set and fuzzy clustering, in: 5th International Conference on Fuzzy Sys-tems and Knowledge Discovery, 2008, pp. 17–20.

21] A.R. Pariser, W. Miranker, Dimensionality reduction via self-organizing featuremaps for collaborative filtering, in: International Joint Conference on NeuralNetworks, 2007, pp. 1941–1946.

22] J.B. MacQueen, Some methods for classification and analysis of multivariateobservations, in: 5th Berkeley Symposium on Mathematical Statistics and Prob-ability, 1967, pp. 281–297.

23] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms,Kluwer Academic Publishers, Norwell, MA, USA, 1981.

24] Z. Yang, F.-L. Chung, W. Shitong, Robust fuzzy clustering-based image segmen-tation, Applied Soft Computing 9 (2009) 80–84.

25] T. Chaira, A novel intuitionistic fuzzy c-means clustering algorithm and itsapplication to medical images, Applied Soft Computing 11 (2011) 1711–1717.

26] G. Gan, C. Ma, J. Wu, Data Clustering: Theory, Algorithms, and Applications(ASA-SIAM Series on Statistics and Applied Probability), 2007.

27] T. Kohonen, Self-organization and Associative Memory, third ed., Springer-Verlag, New York, 1989.

28] M.H. Ghaseminezhad, A. Karami, A novel self-organizing map (SOM) neural net-work for discrete groups of data clustering, Applied Soft Computing 11 (2011)

New York, 1998.30] J.L. Herlocker, J.A. Konstan, A. Borchers, J.T. Riedl, An algorithmic framework

for performing collaborative filtering, in: 22nd Annual International ACM SIGIR

t Com

[

[

[

[

[

[

[

[

[

[

[

[

[

[

A. Bilge, H. Polat / Applied Sof

Conference on Research and Development in Information Retrieval, 1999, pp.230–237.

31] G. Adomavicius, A. Tuzhilin, Toward the next generation of recommendersystems: a survey of the state-of-the-art and possible extensions, IEEE Trans-actions on Knowledge and Data Engineering 17 (2005) 734–749.

32] J. Breese, D. Heckerman, C. Kadie, Empirical analysis of predictive algorithms forcollaborative filtering, in: 14th Annual Conference on Uncertainty in ArtificialIntelligence, 1998, pp. 43–52.

33] J.L. Herlocker, J.A. Konstan, L.G. Terveen, J.T. Riedl, Evaluating collaborative fil-tering recommender systems, ACM Transactions on Information Systems 22(2004) 5–53.

34] K. Honda, N. Sugiura, H. Ichihashi, S. Araki, Collaborative filteringusing principal component analysis and fuzzy clustering, in: 1st Asia-Pacific Conference on Web Intelligence: Research and Development, 2001,pp. 394–402.

35] O. Georgiou, N. Tsapatsoulis, Improving the scalability of recommender systemsby clustering using genetic algorithms, in: 20th International Conference onArtificial Neural Networks: Part I, 2010, pp. 442–449.

36] S. Russell, V. Yoon, Applications of wavelet data reduction in a recommendersystem, Expert Systems with Applications 34 (2008) 2316–2325.

37] J. Bobadilla, F. Ortega, A. Hernando, J. Alcalá, Improving collaborative filter-ing recommender system results and performance using genetic algorithms,Knowledge-Based Systems 24 (2011) 1310–1316.

[

[

puting 13 (2013) 2478–2489 2489

38] H.-N. Kim, A.-T. Ji, I. Ha, G.-S. Jo, Collaborative filtering based on collaborativetagging for enhancing the quality of recommendation, Electronic CommerceResearch and Applications 9 (2010) 73–83.

39] S.K. Lee, Y.H. Cho, S.H. Kim, Collaborative filtering with ordinal scale-basedimplicit ratings for mobile music recommendations, Information Sciences 180(2010) 2142–2155.

40] G. Bogdanova, T. Georgieva, Using error-correcting dependencies for collabo-rative filtering, Data & Knowledge Engineering 66 (2008) 402–413.

41] C. Kaleli, H. Polat, SOM-based recommendations with privacy on multi-partyvertically distributed data, Journal of the Operational Research Society 63(2012) 826–838.

42] A. Bilge, H. Polat, An improved privacy-preserving DWT-based collaborativefiltering scheme, Expert Systems with Applications 39 (2012) 3841–3854.

43] C. Kaleli, H. Polat, Providing private recommendations using naive Bayesianclassifier, Advances in Soft Computing 43 (2007) 168–173.

44] A. Bilge, H. Polat, Improving privacy-preserving NBC-based recommendationsby preprocessing, in: 2010 IEEE/WIC/ACM International Conference on WebIntelligence and Intelligent Agent Technology, 2010, pp. 143–147.

45] A. Bilge, H. Polat, An improved profile-based CF scheme with privacy, in: IEEEInternational Conference on Semantic Computing, 2011, pp. 133–140.

46] D. Agrawal, C.C. Aggarwal, On the design and quantification of privacypreserving data mining algorithms, in: 20th ACM SIGMOD-SIGACT-SIGARTSymposium on Principles of Database Systems, 2001, pp. 47–255.