toward efficient filter privacy-aware content-based pub/sub systems

Toward Efficient Filter Privacy-AwareContent-Based Pub/Sub Systems

Weixiong Rao, Member, IEEE, Lei Chen, Member, IEEE, and

Sasu Tarkoma, Senior Member, IEEE

Abstract—In recent years, the content-based publish/subscribe [12], [22] has become a popular paradigm to decouple information

producers and consumers with the help of brokers. Unfortunately, when users register their personal interests to the brokers, the

privacy pertaining to filters defined by honest subscribers could be easily exposed by untrusted brokers, and this situation is further

aggravated by the collusion attack between untrusted brokers and compromised subscribers. To protect the filter privacy, we introduce

an anonymizer engine to separate the roles of brokers into two parts, and adapt the k-anonymity and ‘-diversity models to the content-

based pub/sub. When the anonymization model is applied to protect the filter privacy, there is an inherent tradeoff between the

anonymization level and the publication redundancy. By leveraging partial-order-based generalization of filters to track filters satisfying

k-anonymity and ‘-diversity, we design algorithms to minimize the publication redundancy. Our experiments show the proposed

scheme, when compared with studied counterparts, has smaller forwarding cost while achieving comparable attack resilience.

Index Terms—Content-based pub/sub, k-anonymity, ‘-diversity

Ç

1 INTRODUCTION

IN recent years, the content-based publish/subscribe(pub/sub) [12], [22] has become a popular paradigm to

decouple information producers and consumers (i.e., pub-lishers and subscribes, respectively). It offers expressive andflexible information targeting capabilities for many Internetand mobile applications. In such a system, subscribersdeclare their personal interests by defining subscriptionconditions as filters, and publishers produce publicationmessages. On receiving publication messages from publish-ers, brokers match publications with registered filters, andforward each matched publication to needed subscribers ina one-to-many manner.

The content-based pub/sub offers an excellent decou-pling property with the help of brokers. Unfortunately,brokers also introduce privacy concerns [25], [26], [34]. Forexample, brokers may be hacked, sniffed, subpoenaed, orimpersonated. Thus, they cannot be trusted. When usersdefine their personal interests as filters and register thefilters to brokers, they could receive the publicationscontaining sensitive information (e.g., corporation or mili-tary) or political/religious affiliations. The untrusted bro-kers could expose the personal and sensitive interests withrespect to such users. Furthermore, by deploying brokers as

public third-party servers, many modern applications, likeservice oriented architectures (SOAs) and social computingplatforms, have adopted the content-based pub/sub para-digm. Attacks against public third-party servers could easilyleak subscribers’ interests. For example, on April 27 2011,Sony admitted that its PSN platform had been hacked andled to the leakage of 70 million users’ information [2]. Thisincident highlights risks in using public servers that maybecome compromised.

A number of systems have been proposed in theliterature about the security and privacy of pub/subsystems. Most of them [23], [25], [32], [33] are related toaccess control, publication content confidentiality, securerouting (using cryptographic techniques), publisher priv-acy, and so on. However, seldom works consider theprivacy of filters defined by subscribers. In particular, giventhe so-called collusion attack, compromised subscriberscollude together with untrusted brokers against honestsubscribers. The attack could easily expose the filtersdefined by the honest subscribers, even if traditionalencryption techniques are used to encrypt publications.For example, a group of users subscribe to the stockinformation pertaining to a secret and encrypted stockprice. Suppose that one of those users is compromised andcolludes with an untrusted broker. Due to the one-to-manycommunication offered by the pub/sub, the broker corre-lates publications with a set of filters matched with thepublications. Among the matched filters, some are definedby compromised users, and some by honest users (forconvenience, we call such filters by compromised filters andhonest filters, respectively). When the compromised usersdecrypt encrypted publications, attackers then easily inferthat the honest filters are matched with the same publica-tions and the honest users are exposed to be interested inthe publications. Thus, due to the combined effects by thecompromised users and untrusted brokers, traditionalencryption techniques alone cannot defend against thecollusion attack.

2644 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 11, NOVEMBER 2013

. W. Rao is with the School of Software Engineering, Tongji University,Chao’an Road 4800, Shanghai, China. E-mail: [email protected].

. S. Tarkoma is with the Department of Computer Science, University ofHelsinki, PL 68 (Gustaf Hallstromin katu 2b), 00014 Helsinki, Finland.E-mail: [email protected].

. L. Chen is with Department of Computer Science and Engineering, theHong Kong University of Science and Technology, Clear Water Bay,Kowloon, Hong Kong. E-mail: [email protected].

Manuscript received 1 Dec. 2011; revised 25 May 2012; accepted 25 July2012; published online 6 Sept. 2012.Recommended for acceptance by E. Ferrari.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-2011-12-0734.Digital Object Identifier no. 10.1109/TKDE.2012.177.

1041-4347/13/$31.00 � 2013 IEEE Published by the IEEE Computer Society

In this paper, to protect the filter privacy, we adapt the k-

anonymity [5], [17], [31], [35] and ‘-diversity [21] models to

the content-based pub/sub. Given the adapted model,

called filter anonymization, the collusion between an un-

trusted broker and compromised filters cannot easily

expose honest filters.When protecting filters by the filter anonymization, there

is an inherent tradeoff between filter privacy and publica-tion redundancy. The former is described in terms of howwell the filters are protected, and the latter how manyredundant publications are forwarded. Thus, the keyquestions addressed in the paper are the following:1) how to ensure that a given filter is k-anonymous and ‘-diverse (giving certain privacy protection), and 2) how toensure that the overall publication redundancy is mini-mized? Our general idea to offer the filter privacy iscloaking filters with more general ones. The main technique isto leverage partial-order-based generalization of filters totrack filters satisfying k-anonymity and ‘-diversity, anddesign solutions to minimize the publication redundancy.As a summary, the main contributions of this paper are

. We identify that the one-to-many communicationparadigm in the content-based pub/sub cannotdefend against the collusion attack, and thenintroduce an anonymizer engine to separate theroles of brokers.

. We propose the filter anonymization techniques tocloak real filters, and design techniques to minimizethe publication redundancy.

. We extend the proposed technique to the distributedcontent-based pub/sub on a set of clusteredmachines to offer a scalable solution.

. Our experiments verify that the proposed solutionis well scalable with the numbers of filters,publications and machines, and achieves the opti-mal tradeoff between the privacy protection andpublication redundancy.

This paper is structured as follows: Section 2 presents thedata model, and Section 3 introduces the solution. Section 4gives the techniques of generalizing filters, and Section 5designs algorithms to minimize the publication redun-dancy. Next, Section 6 analyzes the inference attackresilience, and Section 7 presents the distributed solutionon a set of clustered machines. After that, we report theevaluation results in Section 8 and review the related workin Section 9. Finally, Section 10 concludes the paper.

2 DATA MODEL AND ASSUMPTIONS

In this section, we introduce the content-based pub/sub

and the corresponding data model. Fig. 1 summarizes the

main symbols and associated meanings in the paper.

2.1 Overview of Content-Based Pub/Sub

Fig. 2a illustrates the three logical parties of the content-based pub/sub [6], [12], [22], [28]: publishers, subscribers,and brokers. Publishers (the users or associated softwareagents that produce publications) first announce advertise-ments of to-be-published messages to brokers, and thenpublish content messages. Subscribers (the users orassociated software agents that consume publications)declare their interests by filters, and send subscriptionrequests containing the filters to brokers. Brokers decouplepublishers and subscribers to offer asynchronized contentdelivery. On receiving advertisements from publishers,brokers validate filters and then organize filters into afilter indexing structure, for example, a partially orderedset (in short poset) [6] (we will introduce the poset soon).Next, when publications come, with the help of the poset,the brokers match incoming publications with the indexedfilters. After matched filters are found, the brokersforward publications to the associated subscribers. Wenote that the advertisement contains the summaryinformation of to-be-published messages [6], and utilizethe summary information for our solution.

2.2 Publications

Following the previous work [6], [12], [22], [28], apublication is a set of typed attributes. Each typed attributecontains a type, a name, and a value. The publication asa whole is purely a structural value derived from itsattributes. Besides this, we assume that among allpublication attributes, there exists a special attribute,named a sensitive attribute (SA). The values w.r.t the SAare called sensitive values (SVs). Typically, each publicationis associated with a unique SV, and the SV containssensitive information.

Given a set N of publications, we denote the cardinalityofN by jN j. Since each publication n 2 N is associated witha unique SV, then for all publications in N , we denote thetotal number of distinct SVs by kNk. For example, the threepublications in Fig. 3a contain four attributes. Among theseattributes, the price is the SA, and indicates the sensitive andsecret stock price. Such publications are associated with twodistinct SVs equal to 15 and 40. Depending on application

RAO ET AL.: TOWARD EFFICIENT FILTER PRIVACY-AWARE CONTENT-BASED PUB/SUB SYSTEMS 2645

Fig. 1. Used Symbols and the meanings.

Fig. 2. (a) Original Pub/Sub, (b) Privacy-aware Pub/Sub.

Fig. 3. (a) Three publications, (b) Poset with eight filters.

requirements, the SA can be other types of data (e.g., string).For example, suppose the SA is the string type and SVs arethe categories of illness, such as HIV, pneumonia.

2.3 Filters

A filter f defines a stateless Boolean function that accepts apublication message as an argument. Typically, the filter fcontains predicate conditions over typed attributes. Forexample, f1 in Fig. 3b defines the predicate condition ½1; 100�over the price attribute and expects publications having theprice inside the interval ½1; 100�.

A publication n is said to match a filter f if fðnÞ ¼ true.For the filter f , we denote NðfÞ to be the set consisting ofthose publications matching f . Since each publication n 2NðfÞ is associated with a unique SV, then via NðfÞ, a filterf is associated with jN ðfÞj SVs and kN ðfÞk distinct SVs.

Given two filters f and f 0, we say f covers f 0 if and only ifNðfÞ w N ðf 0Þ, and denote the covering relationship by f wf 0 or f 0 v f . As a special covering relationship, we define animmediate covering relationship, denoted by f t f 0, over aground set G of filters. That is, f t f 0 holds, if 1) f w f 0holds, and 2) there is no element f 00 2 G satisfying f w f 00and f 00 w f .

When given a set of filters G, the partially ordered set(poset) [6] is defined as an ordered pair P ¼ fðG;wÞg, whereG is called a ground set and w is the partial order of P.Based on the poset P, we have the following definitions:

. For each f 2 G, we have a set of predecessors andsuccessors) of f inP, given by PredðfÞ ¼ ff 0 2 G j f 0 wf and f 0 6¼ fg and SuccðfÞ ¼ ff 0 2 G j f w f 0 andf 0 6¼ fg.

. We have the immediate predecessors and successors of f ,given ImPredðfÞ ¼ ff 0 2 G j f 0 t f and f 0 6¼ fg andImSuccðfÞ ¼ ff 0 2 G j f t f 0 and f 0 6¼ fg. We con-sider ImSuccðfÞ (resp. ImPredðfÞÞ is a specialsuccessor (resp. predecessor) to f , i.e., Pred wImPredðfÞ and SuccðfÞ w ImSuccðfÞ.

Fig. 3b shows an example poset of eight filters. We notethat a poset could contain multiple roots, not just one rootas shown in Fig. 3b. Moreover, we can extend the poset toother types of data (instead of the number type illustratedin Fig. 3b). For example, given a set of illness names (thestring type), we can leverage the category relations of allillness to define the covering relations of such illness. Thatis, given a pair of any two illness, we can infer which oneis the successor and which one is the predecessors. Basedon the relations, we then similarly build a poset for theillness names.

3 SOLUTION OVERVIEW

In this section, we introduce the attack model and highlightour solution.

3.1 Attack Model

Definition 1. A filter is exposed, if and only if the filter isexactly identified to be matched with the publications having aSV equal to #.

Given the above definition, we want to defend againstthe following collusion attack without exposing the filter:

Definition 2 (Collusion attack). Among all N filters interestedin the sensitive SV #, at least k filters (including the filter f)are honest, and the remaining up to ðN � kÞ filters are allcompromised. The compromised filters and an untrustedbroker collude together against the honest f .

Given the collusion attack above, the original content-based pub/sub (without any privacy protection) cannotdefend against the attack, even if publications are en-crypted. That is, when the broker provides the decouplingproperty between publishers and subscribers, it essentiallyacts as two roles: 1) accepting subscription requests fromsubscribers for filter registration, and 2) forwarding pub-lications to matched filters. Therefore, the broker correlates(encrypted) publications with a set of matched filters(including the honest ones and compromised ones). Whenthe subscribers defining the compromised filters decryptthe encrypted publications carrying the SV #, the untrustedbroker identifies that all matched filters are interested in #.Among such filters, the honest ones are then exposed to beinterested in #. Therefore, the two roles of the broker lead tothe correlation between the publication associated with #and a set of filters, and then expose the honest filters.

3.2 Anonymization Engine

In view of the above issue, we introduce a trusted third-party anonymizer engine, and separate the roles of thebrokers, as shown in Fig. 2b. The introduced anonymizerengine is operated by a trusted authority (similar to well-known trusted authorities, such as the location anonymizerengine [38] to perform location anonymization and thecertificate authority CA to issue and verify digital certifi-cates). Anonymizer engines have been proven to achieveprivacy guarantees that are unachievable by client-based orpeer-to-peer architectures, and this architecture has beensuccessfully applied in a variety of privatization systems[17], [19], [38].

As shown in Fig. 2b, the anonymizer engine acceptsadvertisements from publishers (step a) and filters fromsubscribers (step b:1). The purpose of the anonymizerengine is to cloak incoming (real and uncloaked) filtersand to output cloaked filters that are used to protect realfilters (step b:2).

After that, the broker maintains the indexing structurefor the cloaked filters, and forwards publications tomatched subscribers (steps c and d). Since the brokerregisters cloaked filters (instead of real filters), the matchingpublications contain redundant ones. Thus, among thereceived publications, subscribers need to filter out thosepublications that do not match real filters, and alert users ofonly the matching publications.

The introduced anonymizer engine and the proposedgeneralization techniques prevent the aforementionedcorrelation. We analyze this claim as follows:

. The anonymizer engine, though receiving real filters,is unaware of those publications that match the realfilters. Thus, the anonymizer engine cannot correlatethe SVs # with the real filters f . Therefore, a curiousand even untrusted anonymizer engine alone cannotexpose the filters (referring to Definition 1 in termsof the meaning of exposing a filter).


. When the broker matches publications containingsensitive SVs # with the cloaked filters, thepublications are forwarded to matched subscribers.Nevertheless, not all of such subscribers are trulyinterested in #, and honest filters receive morepublications than needed. Thus, attacker cannotprecisely identify real interests of honest filters,even with the help of ðN � kÞ colluded subscribers.

Overall, either the anonymizer engine or the broker alone

cannot expose honest filters, unless the anonymizer engine

and the broker are colluded together. However, the

anonymizer engine is operated by the trusted authority,

and the broker is operated by the pub/sub service

providers. The collusion between the two parties is

practically infeasible.Note that our solution is not to replace traditional

cryptographic techniques. Instead, they can work together

and offer a complete solution to protect filters as follows:

. First, following the idea of [27], filters can beencrypted, and publications match the encryptedfilters. Therefore, the filter is protected with the leastrisk to be exposed. It indicates the filters are stillsecured even given the untrusted anonymizer engine.

. Second, to defend against the collusion between theanonymizer engine and the brokers, we could use theclassical cryptographic technique of secure multi-party computation [18], [39]. Specifically, a honestsubscriber, together with the remaining ðk� 1Þhonest filters, registers a filter to the anonymizerengine by the technique of secure multiparty compu-tation [18], [39]. This allows a set of subscribers toregister subscription filters and receive content ofinterests without revealing the individual filters toeach other or to outside observers of their publicationtraffic. It shields every filter against the collusionattack plus the collusion between the untrustedbroker and curious anonymizer engine.

In Section 4 that follows, we first introduce the proposed

privacy model and give the partial order P to generalize

filters, Then, Sections 5 and 6, respectively, present and

analyze the solution to minimize the forwarding cost, and

Section 7 extends the solution onto clustered machines.

4 PARTIAL ORDER-BASED FILTER ANONYMIZATION

To protect filters, in this section, we first adapt the k-

anonymity and ‘-diversity to the content-based pub/sub,

and then leverage the poset P to generalize filters.

4.1 Privacy Models

On receiving a filter f from a subscriber, the anonymizer

engine blurs the filter f with a generalized filter f 0. We call

f 0 and f , respectively, by a cloaked filter and a real (i.e.,

uncloaked) filter. The generalization ensures that the

filters are successfully matched with the publications of

interest, without incurring a false negative issue. The

cloaked filters generated by the anonymizer engine meet

the following privacy models, such that the real filters f

are indistinguishable.

Now, let us first define the k-filter anonymity. The k-filteranonymity definition is generic in nature. It ensures that afilter cannot be distinguished in a given reference group,which contains at least k cloaked filters.

Definition 3 (k-filter anonymity). For a publication n havinga SV #, the number of cloaked filters that are matched with n isat least k. Thus, one of the k cloaked filters is identified to betruly interested in # by probability at most 1=k.

The above definition has two requirements: 1) a real filterf is blurred by a cloaked filter f 0, and 2) the publication nmatches at least k cloaked filters (including the cloakedfilter f 0). Both requirements are needed. Otherwise, one ofsuch k cloaked filters is exposed to be interested in the SV #by a probability larger than 1=k. For example, 1) if no realfilter f is generalized, then such k cloaked filters are realfilters, and all of them are exposed to be truly interested in#, and 2) if the number of cloaked filters is smaller than k,then the probability in Definition 3 is larger than 1=k.

Next, we define the ‘-SV diversity. That is, all of thosepublications matching a cloaked filter f 0 are associated withat least ‘ diverse SVs (for the publications matching the realfilter f , the number of associated SVs might be smaller than‘). We believe that both k-filter anonymity and ‘-SVdiversity are required to protect filters. Otherwise, if onlythe k-filter anonymity is needed, f 0 may receive publica-tions associated with fewer than ‘ diverse SVs, for example,only one SV #. Then, attackers infer that f 0 (and thus f) isparticularly interested in the SV #, and the filter is thenexposed. Therefore, the ‘-SV diversity and k-filter anonym-ity work together to protect the filter’s privacy.

Definition 4 (‘-SV diversity). Given any cloaked filter f 0, thepublications that match f 0 contain at least ‘ diverse SVs. Thus,among these ‘ diverse SVs, it is indistinguishable which SV isof interest to f 0 (and its real filter f).

When the anonymizer engine cloaks incoming real filtersf , we expect that all cloaked filters f 0 satisfy both k-filteranonymity and ‘-SV diversity. We call this privacy modelfilter anonymization. In the following section, we will givethe technique to generalize real filters f .

4.2 Cloaking Filters

After defining the filter privacy model, we leverage theposet data structure to generalize real filters. Beforepresenting the details of such generalization, we firsthighlight three auxiliary functions (in addition to thealready given functions Succ, Pred, ImSucc, and ImPredin Section 2):

. NðfÞ returns the publications that match f . Thus,jN ðfÞj and kN ðfÞk, respectively, denote the numberof distinct publications and the number of distinctSVs in NðfÞ.

. Suppose that F denotes a set of filters. Then, jN ðF Þjand kN ðF Þk return the total numbers of distinctpublications matching any filter f 2 F , and distinctSVs associated with such publications.

. If SuccðfÞ returns a set of successors to f , thenjN ðSuccðfÞj returns the total numbers of distinct


publications matching any successor to f . Similarsituations occur for kN ðSuccðfÞÞk, jN ðImSuccðfÞj,and so on.

Based on the definitions above, we present two anon-

ymization techniques w.r.t. a filter fi. The main idea behind

the anonymization is utilizing the covering relation to

generalize real filters fj, such that fj is satisfied with the k-

filter anonymity and ‘-SV diversity and is not exposed even

with the collusion attack.In detail, the first property is named immediate successor

anonymization, as shown in Fig. 4a, where the filters in this

figure are corresponding to the ones in Fig. 3b.

Property 1 (Immediate successor anonymization). For a

filter fi, if jImSuccðfiÞj � k and kN ðfiÞk � ‘ hold, f is used

to generalize all filters fj 2 ImSuccðfiÞ.

In the property above, if a filter fi satisfying jImSuccðfiÞj �k and kN ðfiÞk � ‘, the filters fj 2 ImSuccðfiÞ are then

generalized by fi (i.e., fi is the cloaked filter to fj). For

example, in Fig. 4a, for the filter f2, jImSuccðf2Þj ¼ 3 and

among the 3 publications in Fig. 3, kN ðf2Þk ¼ 2 (because n1

and n3 have the same SV 40). Now, we say f2 has the

immediate successor anonymization satisfying 3-filter anon-

ymity and 2-SVs diversity, and f2 then generalizes all three

immediate successors f5; f6 and f7.

Theorem 1. Immediate successor anonymization can protect the

filters fj 2 ImSuccðfiÞ in terms of the k-filter anonymity and

‘-SV diversity, and defend against the collusion attack.

The proof of the above theorem refers to Appendix,

which can be found on the Computer Society Digital

Library at http://doi.ieeecomputersociety.org/10.1109/

TKDE.2012.177.If either jImSuccðfiÞj < k or kN ðfiÞk < ‘ occurs, the

condition of Property 1 does not hold. Then, we relax the

condition and have the 2nd property.

Property 2 (Successors anonymization). If both jSuccðfiÞj �k and kN ðfiÞk � ‘ hold, then f is used to generalize all

successors fj 2 SuccðfiÞ.

This property offers a higher filter anonymization level

than the immediate successors anonymization does.

For example, in Fig. 4b, Succðf2Þ ¼ ff5; f6; f7; f8g and

kN ðf2Þk ¼ 2. Then, we have successor anonymization for

4-filter anonymity and 2-SV diversity, and all successors

Succðf2Þ are generalized by f2. Following Theorem 1, we

similarly have:

Theorem 2. Successor anonymization can protect the filters fj 2SuccðfiÞ in terms of the k-filter anonymity and ‘-SV diversity,

and defend against the collusion attack.

5 MINIMIZING REDUNDANT PUBLICATIONS

In this section, we focus on minimizing publication

redundancy caused by filter anonymization.

5.1 Problem Statement

Recall that the essence of Properties 1 and 2 is to use a filterfi to generalize a set of (immediate) successors fj 2SuccðfiÞ. However, the generalization meanwhile incurspublication redundancy. That is, when a filter fi generalizesfj 2 SuccðfiÞ, the subscriber specifying fj has to receiveredundant publications, i.e., NðfiÞ n N ðfjÞ. Therefore, thegeneralization involves the tradeoff between the filteranonymization level (i.e., the anonymity level and diversitylevel) and the number of redundant publications. Forexample, suppose fi is a root in the poset. We can use fito generalize its all successors fj 2 SuccðfiÞ. Since the rootfilters are the most general ones, we achieve the highestfilter anonymization level. However, the subscriber defin-ing fj 2 SuccðfiÞ has to receive the largest number ofredundant publications. Thus, besides the privacy require-ment, we want to minimize forwarding cost, measured bythe number of publications matching the cloaked filters fi.

Before presenting the defined problem to minimize the

forwarding cost, we first give the following observations:

. In the poset structure, a filter f might be thesuccessor of multiple filters. For example in Fig. 4a,f6 is the successor to three filters f1; f2, and f3. Thus,all such filters (f1, f2, and f3) can be used to cloakf6. However, each of them involves a specificnumber of redundant publications. Thus, choosingan appropriate filter to cloak f involves a specificforwarding cost.

. Suppose we use a (real) filter fi to generalize thesuccessors fj 2 SuccðfiÞ. The generalization benefitsa set of successor filters fj 2 SuccðfiÞ. That is, allfilters fj 2 SuccðfiÞ equally benefit from the general-ization. Since we want to cloak all input filters, wethus need to consider the number of successor filtersSuccðfiÞ. When jSuccðfiÞj is larger, more filtersbenefit from the generalization.

In view of the observations, we have the following

problem:

Problem 1 ½Prob Ano�. Among a ground set G of real filters,

we select a subset S v G of filters, such that 1) each filer

fi 2 S satisfies jSuccðfiÞÞj � k and kN ðfiÞk � ‘, 2) every

filter fj 2 G is generalized by a filter fi 2 S and fj 2 SuccðfiÞholds, and 3) the overall forwarding cost

PjSji¼1 ½�i � jN ðfiÞj� is

minimized, where �i ¼ 1 if fi 2 S generalizes any fj 2 G,

otherwise �i ¼ 0.

Unfortunately, for the Prob_Ano optimization problem,

no efficient solution exists, unless P ¼ NP (the proof refers

to Appendix, available in the online supplemental material).

Theorem 3. Prob_Ano is NP-hard.


Fig. 4. Examples of cloaking filters (a) immediate successor anonymiza-tion, and (b) successor anonymization.

5.2 Algorithm Design

Since the Prob_Ano problem is NP-hard, we consequentlypropose an approximation algorithm to solve it.

Intuitively, when the filters fi 2 S v G generalize allfilters fj 2 G, we expect to anonymize all filters with theneeded anonymization level for the least forwarding cost.For those filters fi satisfying jSuccðfiÞj � k and kN ðfiÞk � ‘,we use a parameter wi to define the unit cost of using fi forthe generalization of the filters fj 2 SuccðfiÞ (Using wi isconsistent with the classic greedy algorithm [37] to solve theset-cover problem, such that we have chance to use the leastcost to generalize all filters fj 2 G):

wi ¼jN ðfiÞj

jSuccðfiÞ n Rj: ð1Þ

In the equation above, R denotes the set of those filtersthat have already been generalized when fi is currentlyselected as a member inside S. Clearly, a lower ratio wimeans using a lower forwarding cost (i.e., a smaller jN ðfiÞj)to generalize all filters fj 2 SuccðfiÞ n R. Therefore, wewould like to choose the filters fi with low wi, instead ofthose with high wi.

Based on the ratio wi, we follow the set-cover greedyalgorithm and design Algorithm 1 to solve the Prob_Anoproblem. The basic idea of Algorithm 1 is to choose thosefilters fi having a low wi and to generalize all filtersSuccðfiÞ nR. The details of Algorithm 1 are as follows:

Algorithm 1. GREEDY_ANO (anonymity level k, diversity

level ‘, filters G)

Require: each filter is associated with a unique Id.

1: initiate a minimal heap H;

2: for all filters fi 2 G do

3: add pair hfi; wii to heap H;

4: end for

5: initiate a set R ¼ ; which is used to contain all resolved

filters;

6: while a filter fi 2 G is still unresolved and H is not

empty do

7: pick the filter fi associated with the pair poped from

H;8: if (the fi satisfies both jSuccðfiÞ n Rj � k and

kN ðfiÞk � ‘) then

9: generalize fj 2 SuccðfiÞ n R by fi;

10: add all filters fj 2 SuccðfiÞ n R to R, and mark fjresolved;

11: for (each filter fi appearing in the heap H do

12: update the associated ratio wi in H byjN ðfiÞj

jSuccðfiÞnRj ;

13: end for

14: end if

15: end while

16: if there still exist ungeneralized filters fi then

17: require that fi and its associated ðk� 1Þ honest filters

adopt the multi-party computation to hide the real

filter fi;

18: end if

First, we require that each real filter f is associated with aunique Id uðfÞ. Brokers then leverage the Id to forward thematching publications toward the associated subscriber.

In lines 1-4, we use a minimal heap H to prestore thepairs hfi; wii. Line 5 initiates an empty setRwhich stores allthose already resolved (i.e., generalized) filters.

During the while loop (lines 6-15), if there exists an un-resolved filter and H is not empty, the head item fi in H isfetched. The filter fi has the currently least ratio wi. If thefetched filter fi satisfies the privacy requirement, i.e.,jSuccðfiÞ n Rj � k and kN ðfiÞk � ‘, all of the successors fj 2SuccðfiÞ n R are generalized by fi. Next, we mark suchsuccessors fj resolved, and add them to the set R.

Besides, for those filters fi still appearing inside H, weneed to consider the updates of wi (due to the updated R inline 10). That is, if fj 2 SuccðfiÞ n R in line 10 is a successorto the filter fi still appearing inside H, we recompute thewi ¼ jNðfiÞj

jSuccðfiÞnRj , where R is newly updated by line 10.If there still exists some unresolved filters fi, the

anonymization engine requires fi together with its ðk� 1Þhonest filters to adopt the multiparty computation [18], [39]for hiding the real filter fi. For the multiparty computation,an example is using the filter tfi consisting of the unionover all such k honest filters, and all such k honest filters arereplaced with tfi . In this way, fi is hidden inside a group ofk honest filters, which are all associated with tfi . After that,the engine then reruns Algorithm 1 to blur all input filters.

Finally, since each real filter f is associated with a uniqueId uðfÞ and each unique Id is mapped to a real filter f(before the cloaking), Algorithm 1 then returns the Id uðfÞwith a unique cloaked filter f 0 (after the cloaking). Note thatthe cloaked filter f 0 is associated with at least k unique Ids,where k is the level of the k-anonymity requirement.

We give the privacy analysis as follows (the proof refersto Appendix, available in the online supplemental material).

Theorem 4. The cloaked filter fi generalized by Algorithm 1satisfies the k-filter anonymity and ‘-SVs diversity, and candefend against the collusion attack between untrusted brokersand up to ðN � kÞ compromised filters.

In terms of the optimality ratio, by the greedy algorithmof the set-cover problem, we have the following lemma.

Lemma 5. Let I be an instance of the Prob_Ano, andOPT_Cost(I) be the optimal overall forwarding cost used tosolve the instance I. The overall forwarding cost pertaining toAlgorithm 1 is at most ðlogN �OPT CostðIÞÞ.

We note that there are improvements over Algorithm 1.For example, we might consider the combinations of kfilters work together to blur their associated successors withthe least cost. Though this improvement might offer bettertradeoff between the privacy protection and forwardingredundancy than Algorithm 1 does, it need exponentialspace cost with respect to the number of k. Nevertheless, westill follow Algorithm 1 to define the weight for suchcombinations and find the most efficient combinations toblur filters, but at the expensive space cost to prestore theweight w.r.t each combination.

5.3 Maintaining Dynamic Filters

After Algorithm 1 blurs N real filters, we need to maintainthe anonymized poset P in terms of blurring a newlyincoming filter f and removing an already cloaked filter f .


To blur a new filter f (Algorithm 2 gives the pseudo-

code), we first find those filters fi satisfying f 2 SuccðfiÞand the privacy requirements. For each of such filters fi, we

similarly find the one with the smallest weight wi that is

computed by (1), and then use it to blur f . In case that no

such filters are found, following line 17 of Algorithm 1, we

adopt the multiparty computation for regeneralization.

Algorithm 2. INSERT_ANO (anonymized poset P, filter f)

1: find all filters fi satisfying jSuccðfiÞj � k, kN ðfiÞk � ‘and f 2 SuccðfiÞ;

2: if the filters fi exist then

3: compute the weight wi for each fi;

4: return the filter fi associated with the smallest wi;

5: else

6: run line 17 of Alg. 17: end if

Next, we consider how to remove an already cloaked

filter f from the poset P (Algorithm 3 gives the

pseudocode). Recall that Algorithm 1 maps the unique Id

uðfÞ of each real filter f to a blurred filter f 0 and the f 0 is

associated with at least k unique Ids. For convenience, we

denote the set consisting of all unique Ids associated with

f 0 by Uðf 0Þ. To remove filter f , with the help of the uðfÞ,we first find the cloaked filter f 0 2 P associate with the Id

uðfÞ, and remove the Id uðfÞ from the set Uðf 0Þ. After the

removal, if the set Uðf 0Þ now contains fewer than k items,

we will find a filter fi (line 4), such that those (real) filters

having unique Ids inside Uðf 0Þ will be reblurred by fi.

Reblurring the real filters ensures that the set UðfiÞcontains at least k items (consist with Algorithm 1). To

this end, we first find all filters fi satisfying fj 2 SuccðfiÞ,where fj is the filter having the uðfjÞ 2 Uðf 0Þ. If the filters

fi do exist, we then find the one fi having the smallest wiand let fi blur the filters fj having uðfjÞ 2 Uðf 0Þ. It

immediately means that all items in Uðf 0Þ are moved to

the set UðfiÞ. Since the set Uðf 0Þ now becomes empty, we

then remove the filter f 0 from P. Therefore, for every filter

fi 2 P, the set UðfiÞ contains at least k items. Otherwise, if

the filters fi are not found, we again adopt the multiparty

computation for regeneralization.

Algorithm 3. DELETE_ANO (anonymized poset P, filter f)

1: find the filter f 0 2 P that is associated with uðfÞ;2: remove uðfÞ from the set Uðf 0Þ;3: if the set Uðf 0Þ now contains fewer than k items then

4: find all filters fi satisfying (i) fj 2 SuccðfiÞ for every

filter fj with uðfjÞ 2 Uðf 0Þ, (ii) jSuccðfiÞj � k and

(iii) kN ðfiÞk � ‘;5: if the filters fi are found then

6: compute the weight wi for each fi;

7: find the fi with the least wi, and the found fi then

blurs those filters fj with uðfjÞ 2 Uðf 0Þ;8: else

9: require f 0 to adopt the multi-party computation

for re-generalization;

10: end if

11: end if

Privacy analysis. For Algorithms 2-3, we have thefollowing result (the proof refers to Appendix, available inthe online supplemental material):

Theorem 6. The filter maintaining algorithms satisfy the k-filteranonymity and ‘-SV diversity, and defend against thecollusion attack.

Optimizations. Despite the fact that the above algorithmsmeet the privacy requirement, they introduce efficiencydegradation. For example, suppose that Algorithm 2 hasalready inserted a set of filters (denoted by F ). Next,Algorithm 2 inserted another two filters f1 and f2,respectively. Following Algorithm 2, the cloaked filter forf1 (denoted by f 01 2 F Þ always belongs to a filter inside F .However, after f2 comes, f2 could be more efficient to be thecloaked filter of f1 than f 01. In this case, Algorithm 2 does notoffer a cost efficient solution.

To overcome the above issue, we propose the batchinsertion and deletion of filters. That is, the anonymizationengine first uses the above algorithms (Algorithms 2 and 3)to maintain the poset P for the dynamical filters. Then, forevery batch of the operations, the anonymization engineuses Algorithm 1 to reblur all filters to form the new posetP. Such batch algorithms are similarly adopted by manyprevious works, such as [38].

6 INFERENCE ATTACK RESILIENCE

The generalization of filters SuccðfÞ by f is only a part of thestory, one needs to consider the resilience of the anonymi-zation against the so-called replay attack. In this section, wefirst give the definition of the replay attack and analyze theresilience of the proposed algorithm.

6.1 Replay Attack

Briefly, the replay attack leverages some prior knowledge torerun Algorithm 1 to expose honest filters, i.e., identifyingwhether or not the correlation between a specific honestfilter fj 2 SuccðfiÞ n R and a blurred filter fi has higherprobability than 1= j SuccðfiÞ n Rj.

In detail, attackers first knows the details of Algorithm 1.Next, the prior knowledge also considers the backgroundknowledge, such as the distribution of SVs in publications,and the distribution of predicate conditions in filters. Manyworks have shown that the distributions of publications andfilters are skewed. In addition, the collusion attack inSection 3.1 assumes that there are ðN � kÞ compromisedfilters and k honest filters. Until now, attackers have theknowledge K:

1. N cloaked filters F0 generated by the anonymizerengine (due to the untrusted broker),

2. ðN � kÞ real filters from compromised filters (due tothe collusion attack),

3. the skewed distributions of subscriptions and pub-lications, and

4. the details of Algorithm 1.

Given the knowledge K, attackers try to reveal theremaining k honest filters by the replay attack (we denotethe associated k filters by fi with 1 � i � k). For a specificfilter fi, we define the estimation confidence conf ½fi j f 0i ;K� as


the probability that attacks can reveal fi based on its

cloaked filter f 0i and the knowledge K.In the replay attack, attackers rerun Algorithm 1 with N

input filters, among which ðN � kÞ filters are those from

compromised filters, and the k filters are generated by

following the skewed distribution of filter and must be

covered by the corresponding cloaked filters (those k

generated filters are denoted by f�i and satisfied with

f�i v f 0i). With such N input filters, Algorithm 1 can output

a set of N cloaked filters (denoted by F�0). If F�0 is exactly

the same as the set F0 outputted by the anonymizer engine,

then with high confidence, attackers can infer that their

estimated filters f�i pertaining to k honest filters fi are

correct and the honest filter fi is exposed.

In case not all filters f�0i 2 F�0 are the same as the

corresponding filters f 0i 2 F0, it indicates the estimated

filters f�i are not exactly identical to the real filters fi. Then,

attackers can compute the estimation confidence that f�i is

generalized to f 0i , conf ½f 0i j f�i ;K�, by the Jaccard similarity

between Succðf�0iÞ and Succðf 0iÞ. That is, conf ½f 0i j f�i ;K� ¼jSuccðf�0iÞuSuccðf 0iÞjjSuccðf�0iÞtSuccðf 0iÞj

. Using Succðf 0iÞ (resp. Succðf�0iÞ) to compute

conf ½f 0i j f�i ;K� makes sense because f 0i (resp. f�0i) is the

generalization of Succðf 0iÞ (resp. Succðf�0iÞ). When conf ½f 0i jf�i ;K� is larger, it is more possible that f�0i is the same as f 0i .

Since conf ½f 0i j f�i ;K� directly shows the confidence that

the estimated filter f�i can be generalized to f 0i , we can

compute conf ½fi j f 0i ;K� that gives the confidence of infer-

ring the real filter fi from the cloaked filter f 0i and the prior

knowledge K as follows:

conf ½fi j f 0i ;K� ¼conf

�f 0i j fþi ;K

�

Pki¼1 conf

�f 0i j fþi ;K

� :

Clearly, conf ½fi j f 0i ;K� indicates the higher confidence of

the estimation filters, and fi is exposed to be the estimated

filter f�i with a higher probability. Given the k honest filters,

attackers can correspondingly compute the number k of

conf ½fi j f 0i ;K�. Among these computed conf ½fi j f 0i ;K�, the

large one, denoted by confmax½1:k�½fi j f 0i ;K�, leads to the

exposure of the real filter fi by the estimated filter f�i with

the highest potentials.

6.2 Resilience Analysis

Next, we proceed to analyzing the resilience of Algorithm 1

and a counterpart (namely a random approach) with

respect to the replay attack.

Theorem 7. Algorithm 1 achieves the comparable resilience to the

replay attack as the random approach does.

The analysis above gives the insight of Algorithm 1 to

defend against the replay attack. Section 8 empirically

verifies the attack resilience of Algorithm 1. It is shown that

Algorithm 1 provides almost the same amount of resilience

as the random approach.

7 DISTRIBUTED SOLUTION ON A CLUSTER OF

MACHINES

In this section, we extend our solution to the distributedcontent-based pub/sub running on a cluster ofC commoditymachines. In this distributed version, each machine acts as abroker. Pub/Sub on such a cluster is recently popular to offerscalable, high throughput and parallel services for cloudcomputing [1], [4], [30].

For the content-based pub/sub on clustered machines,we protect filters in terms of the k-filter anonymity and ‘-diversity models. In addition, due to the existence ofmultiple brokers, we extend the original collusion attackas follows:

Definition 5 (Distributed collusion attack). We assume all Cbrokers are untrusted. With at least k honest subscribers(including the one defining f). Then, the remaining up toðN � kÞ compromised subscribers and the C untrusted brokerscollude together against the subscriber defining f .

To offer the privacy protection, we still follow theprevious technique to use the anonymization engine. Incase of the machine failure, we could use multiple physicalmachines to act as the role of the anonymization engine forhigh availability. The privacy-aware distributed pub/subworks as follows: First, publishers announce publicationadvertisements to the anonymization engine (the same asthe case of a single broker). Second, similar to the case of asingle broker, the subscription request, containing a filter f ,is first sent to the anonymization engine. The engine thenblurs f to generate a cloaked filter f 0 meeting the privacyrequirement, and sends the blurred filter f 0 to a brokermachine RðfÞ (we will give the algorithm to blur the filtersand find the machine RðfÞ soon). The broker machine RðfÞthen registers f 0 to index the filters f 0. For all locallyregistered filters, the machine RðfÞ similarly builds a posetby Algorithm 1 (and maintains the poset by Algorithms 2and 3). After that, a publication is forwarded to thosebrokers that register (cloaked) filters f 0 satisfied with thepublication. The forwarding is enabled by a redirectormachine that maintains the pairs hRi; Fii for all brokermachines Ci (1 � i � C). Here, Fi is the union of all filters fin Ri and Fi w f holds. In this way, Ri receives only thosepublications that satisfy the Fi. The redirector machine iscommonly used in a cluster environment [20], [14].

The challenge in designing a privacy-aware solutionon clustered machines still involves the tradeoff betweenthe privacy protection and the forwarding efficiency. First,as a straightforward solution to blur a real filter f , theanonymization engine might randomly choose a machinefor f (together with the ðk� 1Þ honest filters associated withf). Then, based on the those filters that are alreadyregistered to the chosen machine, the anonymization engineblurs f to satisfy the privacy requirement by the proposedtechniques given in Algorithm 1 or Algorithm 2. Therandom approach offers strong privacy protection. How-ever, the problem is the high publication cost: all Cmachines may contain filters satisfying each publication,incurring C copies for each publication.

To overcome this issue, we propose an approach toassign each filter f (together with its associated ðk� 1Þ


honest filters) on a specific machine RðfÞ, such that all thosefilters satisfied with a publication are assigned to as fewmachines as possible. Thus, the number of forwarded copiesfor each publication is reduced. After that, for the filters fassigned on each machine RðfÞ, we use Algorithms 1 and 2to blur such assigned filters.

To design an assignment algorithm for the abovepurpose, we define a metric to measure the similaritysimfi;fj between a pair of filters fi and fj as follows:

simfi;fj ¼jN ðfi u fjÞjjN ðfi t fjÞj

: ð2Þ

The above metric simfi;fj indicates how similar thepublications matching the filters fi and fj are. If thosefilters having high similarity are assigned to the samemachines, we have more chance to reduce the number ofpublication copies. Given a set R of C machines (i.e.,jRj ¼ C), the assignment of the filters G across R (withC � 3) is a graph partition problem. Due to the known NP-hard result, we propose a heuristic solution (Algorithm 4)as follows:

Algorithm 4. ASSIGN_ANO (machines R, filters G)

1: initiate a maximal heap H;

2: for each pair of filters fi and fj inside G (with

1 � i < j � jGj) do

3: add the similarity simfi;fj , together with the pairhfi; fji, to heap H;

4: end for

5: while (there still exists an unassigned filter) & (H is not

empty) do

6: fetch the pair hfi; fji from H;

7: if neither fi nor fj is assigned then

8: assign fi and fj, together with associated honest

filters, to a random machine;9: else if one of the filters, say fi, is unassigned, but fj

is assigned then

10: assign fi, together with associated honest filters,

to the same machine as fj;

11: end if

12: end while

13: for each machine Ri 2 R (with 1 � i � jRj) do

14: apply Alg. 1 and Alg. 2 to blur filters assigned on Ri;15: end for

Algorithm 4 gives the pseudocode to assign the filters Gacross the machines R. First, we use a maximal heap to pre-store the similarity simfi;fj , together with the pair hfi; fji.Next, the algorithm fetches the pairs of filters having highsimilarity and then assigns them (together with theirassociated honest filters) on the same machines (lines 6-11).In the tie of the filters having high similarity, we can breakthe tie by assigning a filter f together with its ðk� 1Þ honestfilters on the same machines. This ensures that the privacyrequirement is met. After that, Algorithms 1 and 2 blur thefilter f assigned on each machine (lines 13-15). Following theprevious sections, Algorithms 1 and 2 can protect theprivacy of f and defend the collusion attack withoutexposing f . Meanwhile, we leverage the precomputedsimilarity simfi;fj to assign filters having high simfi;fj onthe same machines. In this way, we expect that all those

filters satisfying each publication are distributed on as few

machines as possible.Privacy analysis. Each clustered machine does have

highly similar filters. However, the machine meanwhile

contains the filters having the diverse interests (including

the at least k honest filters, and the randomly assigned

filters by line 8 of Algorithm 4). Thus, given the collusion

attack, it is still indistinguishable the exact interests of the

honest filters, which are generalized by Algorithms 1 and

2. In particular, on each clustered machine, we have the

same scenario as the case of a single broker, which is

proven to resilient to the collusion and replay attacks (as

shown in Section 6).

8 EVALUATION

In this section, we, respectively, evaluate the proposed

centralized and distributed solutions.

8.1 Evaluation of Centralized Solution

In this section, we first give the experimental settings, and

then report the evaluation results in terms of the forwarding

cost and the attack resilience.

8.1.1 Experimental Settings

During the evaluation, we are interested in how the

proposed algorithms (i.e., Algorithms 1-3) can achieve the

tradeoff between the anonymization level and the forward-

ing cost. Therefore, the experiments are designed for three

following aspects:

. Cost efficiency. The proposed algorithm should notincur excessive forwarding cost that is measured bythe ratio between the total numbers of publicationswhen the privacy protection is applied and not. Theratio is called cost ratio. A larger ratio (�1) indicatesmore cost used to protect the privacy and, thus,incurs more overhead.

. Attack resilience. The applied privacy protectionshould be robust to defend against malicious replayattacks. That is, it is difficult for attackers to identify(real) filters defined by subscribers with highcertainty. We will use an entropy value to measurethe resilience. A larger value of the entropy indicatesmore resilience.

. Maintenance. When considering the maintenance ofthe poset in terms of dynamical filters by Algorithms 2and 3, we measure the average running time.

We compare the proposed algorithm with three follow-

ing counterparts:

. Random approach. Following Section 6.2, we use therandom approach due to the enough robust resi-lience against the replay attacks [38].

. Predecessor approach. For each real filter f , we use apredecessor fi 2 predðfÞ to blur f . The predecessorsatisfies the privacy requirement and has thesmallest publication cost, i.e., jN ðfiÞj. We areinterested in how this approach is comparable withthe proposed algorithm (i.e., Algorithm 1).


. Simple approach. For each real filter f , we simply use

the most general filter (i.e., root filter in the poset)

satisfying the privacy requirement to blur f . The

simple approach is inefficient for publication cost,

but offers the robustest attack resilience.

Following the previous works [13], [7], we use the Zipf

distribution to generate filters and publications (Table 1

summarizes the parameters used in our experiments):

. Without the loss of generality, we use a numeric field

to set up the SA inside the domain range ½0:0; 1:0�.. To define the predicate interval I for a (real) filter f ,

we generate its middle point Im and half length I1=2. In

detail, we first follow the Zipf distribution to

generate Im inside ½0:0; 1:0�. Next, we follow the Zipf

distribution to generate the length 2 � I1=2 inside

½0:0; H�. H is the minor value of Im and ð1� ImÞ,such that the interval I is inside ½0:0; 1:0�.

. To define publications, we follow the Zipf distribu-tion to generate the SV m inside ½0:0; 1:0�. Moreover,the number of distinct SVs, denoted by kNk, variesfrom 100 to 10,000.

. During each experiment with a specific k-anonymitylevel, we randomly group generated filters, such thateach group has at least k filters as honest filters formultiparty computation.

8.1.2 Cost Efficiency

For the parameters given in Table 1, we vary their values,

and study their effects on the paid cost to protect the filter

privacy by Figs. 5i, 5ii, 5iii, 5iv, 5v, 5vi, 5vii, and 5viii.We study the effect of the k-anonymity level and ‘-

diversity level in Figs. 5i, 5ii, and 5iii. As shown in Fig. 5i

(the effect of the k-anonymity level), the simple approach

incurs the largest cost ratio (around 4.121), and does not

change even if the anonymity level is increased. It is

because the simple approach always uses the most general

filters (root filters) to blur real filters, no matter the

anonymity level. Moreover, a larger k-anonymity level

leads to a higher cost ratio for all four schemes, and the ratio

gradually reaches 4.121 (i.e., the cost ratio of the simple

scheme) after the k-anonymity level becomes large enough

(�50). It is because a larger k-anonymity level indicates

more general filters are used to cloak real filters. Intuitively,

it means those filters inside the root area of the poset


TABLE 1Parameters Used in Experiments

Fig. 5. Cost efficiency: (i-viii), and attack resilience (ix-xii).

structures are used to generalize real filters, incurring highcost ratios for all schemes. Finally, the proposed greedyscheme uses less cost ratio than the predecessor schemedoes. For example, when the k-anonymity level is equal to2, the cost ratios of the greedy and predecessor schemes are1.37 and 1.43, respectively. Because the proposed greedyscheme uses a filter f to blur a set of successors succðfÞ,each of such successors can benefit from the generalization(see the second observation in Section 5). Instead, thepredecessor scheme does not achieve the same benefit asthe greedy scheme does.

Second, Fig. 5ii shows the effect of the ‘-diversity level.A larger ‘-diversity level incurs a higher cost ratio for allschemes. However, compared with Fig. 5i, the growth isrelatively smooth. It is because during this experiment, thek-anonymity level with a default value 10 is enough tomake the filters have more diverse SVs (satisfying a larger‘-diversity level), without incurring a significantly largernumber of publications.

Third, Fig. 5iii studies which one, either the k-anonymity level or the ‘-SV level, is the major parameterfor the proposed greedy scheme. From this figure, weobserve that given a fixed k-anonymity level, the costratio growth trend is relatively smooth. Instead for a fixedl-anonymity level, the cost ratio growth trend is muchmore significantly. Therefore, we verify that the k-anonymity level is the major parameter that leads tohigher cost ratios for the greedy algorithm.

In Fig. 5iv, to study the effect of the number of thedistinct SVs kNk, we vary kNk from 10 to 1,000 (because ‘is set 10 by default, we set kNk � 10 to enable the privacyprotection). As shown in this figure, kNk (¼ ‘ ¼ 10), allschemes use the most general filters to blur filters andhave the largest ratios. After that, these ratios areincreased gradually. It is because given more distinctSVs, these schemes have more options to select the mosteconomical filters to cloak filters. In addition, we bydefault set ‘ ¼ 10 and kNk ¼ 100 purposely in ourexperiments and ensure all filters can be cloaked. In thisway, we can study how these schemes work withoutresorting to the multiparty computation.

Next, Fig. 5v shows the effect of the number of filters.This figure indicates that given a larger number of filters,the greedy, predecessor and random schemes save morepublishing cost. For example, when the filter count is100,000, the greedy scheme spends less than 69.4 percentcost compared with the random scheme. Instead, when thefilter count is 10,000, the saving is only 20.2 percent. For alarger number of filters, it is predictable that the greedyscheme can save more cost than the random approach. Thisis because, given more filters, the greedy algorithm hasmore options to select the most economical filter to cloaksuccessor filters, thus saving more forwarding cost. Thisresult is particularly useful for the proposed greedy schemeto meet the scalability requirement.

In Fig. 5vi, we are interested in how the Zipf parameter �affects forwarding cost (we fix the Zipf parameter used togenerate publications and then vary the Zipf parameter togenerate subscriptions). In this figure, neither the uniformnor the most skewed distribution (i.e., � is equal to 0.0 and

2.0, respectively) is associated with the largest cost ratio. Indetail, when � ¼ 0:0, the predicate intervals in filters areuniformly distributed; a generated filter f and its successorsSuccðfÞ (i.e., f is used to blur such successors SuccðfÞ) allhave similar filter length, indicating that the generalizationwill not incur too much extra forwarding cost. Next for� ¼ 2:0, the predicate intervals in filters are very skewed,i.e., most of them have a very narrow width and only a veryfew have a large width. It means that the majority of filterscan be blurred by a relatively general filter, such that thereare opportunities to further reduce the forwarding cost.

After that, Fig. 5vii shows the effect of the number ofpublications. Since we fix the number of distinct SVs, alarger number of publications do not significantly incurhigher cost ratio for all three schemes. It is also helpful forthe greedy scheme to achieve high scalability for nowadaysapplications having a large number of publications.

Finally in Fig. 5viii, we measure the average runningtime of the simple, random, predecessor, greedy schemes,and the insertion algorithm (Algorithm 2) to blur a filter.Since the first four algorithms (the simple, random,predecessor, and greedy schemes) are for a set of givenfilters, we first measure the overall time to cloak such filtersand then compute the average time per filter. For thedeletion algorithm (Algorithm 3), we measure the averagetime per deletion operation. This figure indicates that thegreedy scheme uses the most time since it needs topreprocess the weight wi for all filters and uses the heapH. In addition, the predecessor and insertion schemestraverse the poset structure and also need high time.

8.1.3 Attack Resilience

We proceed to evaluating the resilience of the studiedalgorithms against the replay attack. That is, we first use Nreal filters as the input real filters to generate N cloakedfilters. These cloaked filters, denoted by F0, are intuitivelytreated as those output from the anonymizer engine. Next,among the N real filters, we randomly pick ðN � kÞ filtersas compromised filters, and the remaining k ones as honestfilters. Following the replay attack in Section 6.2, we havethe prior knowledge K, and need to estimate k filters f�

based on the known Zipf distribution. Those estimatedfilters satisfy the claim of f�i ðv f 0iÞ, where f 0i 2 F0 are thecloaked filters with respect to the honest filters fi. Next, wefollow Section 6.2 to rerun the proposed algorithms(Algorithms 1, 2, and 3) and output N cloaked filters F�0.By 200 rounds of replay attacks, we compute the entropyvalue of confmax½1:k�½fi j f 0i ;K�. Similarly, we conduct thereplay attack for the simple and random schemes andcompute the entropy value of confmax½1:k�½fi j f 0i ;K�.

By varying the k-anonymity level, ‘-SV diversity level,and the number of filters, we plot the aforementionedentropy results of all three schemes in Figs. 5x, 5xi, and 5xii.We find

. The simple scheme achieves the largest entropyvalue, indicating the most strong attack resilience.

. By comparing the greedy and random schemes, wefind the two schemes have comparable entropyresults. It is because the greedy scheme uses a filter fto cloak a set of the successors succðfÞ (as shown in


Section 6). Thus, when all such successors arecloaked by the same filter f , even if some of suchsuccessors are compromised, it is still difficult toexpose the remaining successors, and the greedyscheme achieves the comparable resilience to therandom scheme.

. Differing from the greedy scheme, the predecessorscheme always uses a specific predecessor predðfÞ tocloak a filter f . Thus, a specific predecessor predðfÞblurs a unique filter f , incurring less diversedistribution of the generated filters and a lowerentropy value.

. Finally, a larger number N of filters indicates higherattack resilience, because we have more options tochoose filters to cloak the associated successors.

8.2 Evaluation of Distributed Solution

In this section, we vary the number of clustered machinesand study the performance of Algorithm 4 in terms of thepublication cost and the attack resilience:

. For the publication cost, we measure the averagenumber of publication copies that each machinereceives.

. To measure the attack resilience, we compute theentropy of the linkability for all machines. Among allthe computed entropy values, we then use theminimal entropy to measure the goodness ofthe distributed solution to be attack resilient (becausethe filters in the machine associated with the minimalentropy have more possibilities to be exposed).

We compare Algorithm 4 (called similarity approach)with a random approach that evenly assigns filters acrossthe machines and then blurs the assigned filters (byAlgorithms 1-2).

First, Fig. 6i plots the cost efficiency when the number ofclustered machines varies from 2 to 50. For both schemes,more clustered machines lead to a smaller number ofpublications per machine. Moreover, because the similarityscheme clusters the similar filters on the same machines, itspends fewer publications than the random scheme does.

Second, Fig. 6ii studies the distribution of the publicationcost across the 20 clustered machines, where we sort thepublication cost per machine by descending order. Thoughthe random scheme achieves more balanced workloadsthan the similarity scheme, the similarity scheme does notlead to significantly skewed distribution. For example, forthe similarity scheme, the least publication cost is as high90.57 percent as the heaviest cost.

Finally, Fig. 6iii measures the attack resilience by varyingthe number of machines from 2 to 50. As shown in thisfigure, more machines lead to smaller entropy values,which is discussed as follows: Given more machines, thenumber of assigned filters on each machine becomessmaller, leading to lower entropy value, which is similarto Fig. 5xi. Nevertheless, even with 50 machines, theentropy value of the similarity scheme, 1.583, is very closeto the entropy value of the random scheme, 1.603. This isconsistent with the Figs. 5x, 5xi, and 5xii, because we usethe proposed greedy scheme to cloak the assigned filters ofeach machine.

9 RELATED WORKS

In this section, we first review the privacy models. Next, weshow their usages in location-aware mobile computing.After that, we investigate the security in pub/sub systems,and finally discuss our solution with the related works.

Privacy-preserving data publishing. In the recently popularprivacy-preserving data publishing area, k-anonymity [35],[31], [17], [5] and ‘-diversity [21] are two popular privacy-preserving models to defend against two so-called attacksrecord linkage and attribute linkage, respectively. In suchmodels, Quasi Identifier (QID) is a set of attributes thatcould potentially identify record owners, and SAs consist ofsensitive person-specific information such as disease,salary, and disability status. The k-anonymity modelensures that the minimum group size on QID is at least k.Next, the ‘-diversity principle requires every QID group tocontain at least ‘ “well-represented” SVs, for example, thesimple case that at least ‘ distinct values for the SA in eachQID group (i.e., the ‘ distinct diversity), and more strongentropy ‘-diversity. In addition, related work also includesprivacy-preserving anonymization of set-valued data [36],and so on.

Location privacy protection. Location anonymization refersto a location information transformation process thatperturbs the exact location of a mobile client to a cloakedlocation box that meets the given location privacy metrics.The k-location anonymity technique uses a cloaked region torepresent the client location, and this region needs to containat least ðk� 1Þ other mobile clients [17]. Well-knowntechniques for location anonymity are spatial cloaking [9]and transformation-based matching. For example, the spatialcloaking enlarges the user location q into a cloaked region Q0

in a way that prevents the reconstruction of q from Q0. Aserver returns points of interest (POI) to a client using themore generalQ0 and the client has to prune the set to find the


Fig. 6. Study of distribution solution.

interesting elements. Next, to defend against attackers using

background knowledge [21], location ‘-diversity has been

proposed to ensure that at least ‘ð> 1Þ different geographical

(or postal) addresses are associated to this location upon

release. An enhancement is s-diversity [38] if there are at

least sð> 1Þ different road segments associated to this

location upon release.Privacy in Pub/Sub and distributed systems. A number of

cryptography security services have been proposed for

publish/subscribe, for example, PSGuard [34]. However, as

shown in the introduction, the collusion attack between

untrusted brokers and compromised filters can incur the

privacy leakage of honest filters.The authors of [10], [15] study the protection of

publisher privacy, instead of filter privacy, and [27]

protects the publication confidentiality. Similar to our

work, Choi et al. [8] assume that the brokers are curious

and may lead to the leakage of filter privacy, and use a

cryptographic technique to protect publication confidenti-

ality and subscription privacy. With the same goal as [8],

the recent work [24] uses the Paillier Homomorphic

cryptosystem to address the shortcomings in [8], [27], such

as inaccurate content delivery and false positives. Never-

theless, [8], [24] do not consider the collusion between

curious brokers and compromised subscribers. Instead,

under the strong collusion attack and reply attack, we

adopt the k-anonymity and ‘-diversity privacy models for

subscription privacy protection meanwhile with the goal to

minimize the publication forwarding redundancy. We note

that the techniques in [8], [24] are applicable for our work

to protect publication confidentiality.Distributed systems, such as Mist [3], support user

anonymity. Mist routes a message though a network with

the purpose of keeping the location of the sender hidden

from intermediate devices. The system consists of a number

of routers, known as Mist routers, ordered in a hierarchical

structure. However, the systems such as Mist do not

address issues pertaining to data semantics. Finally, in

P2P networks, the classic work [16] proposes to use

redundant routing to hide package destinations, because

each of the middle nodes in the routing path could be the

real destination.Discussions. Following the review of the literature,

enforcing privacy protection typically requires the degrada-

tion of efficiency or precision. For example, privacy-

preserving data publishing generalizes exact values of

QID attributes. Privacy-aware location-based services blur

exact locations with enlarged regions and users have to

prune useless query results w.r.t. such enlarged regions [9],

leading to more false positive issues. To protect publisher

privacy [10], each of the k honest publishers has to receive

all queries that are answered by such k-honest publishers,

incurring extra workloads. Freedman and Morris [16]

protect the destination privacy at the cost of more

redundant routing hops (thus more network traffics). In

this paper, in view of the publication redundancy, we

therefore propose solutions to minimize the publication

forwarding redundancy.

10 CONCLUSIONS AND FUTURE WORK

In this paper, we note that the current content-based pub/sub, even with encrypted publications, cannot defendagainst the collusion attack. Thus, we introduce ananonymizer engine which is responsible for the cloakingof filters. We next present the filter anonymization model byadapting k-anonymity and ‘-diversity to the content-basedpub/sub. By leveraging partial-order-based generalizationof filters, we design centralized and distributed algorithmsto blur filters for the filter anonymization model andmeanwhile to minimize the forwarding cost. Our experi-mental results verify that the proposed solution scales wellwith the numbers of filters and publications, and achievecomparable resilience to the random approach.

As to the future work, we are considering strongerprivacy protection models [11], and also plan to plug-in theprivacy model into other forms of semantic filteringapproaches, such as keywords-based approach [29].

ACKNOWLEDGMENTS

This work was supported by the Academy of Finland, grantnumbers 135230 and 139144, and in part by the HONGKONG RGC NSFC Joint Funding Project N_HKUST612/09,National Grand Fundamental Research 973 Program ofChina under Grant 2012-CB316200, Microsoft Research AsiaGrant, MRA11EG05 and HKUST RPC Grant RPC10EG13,China NSFC (No.61103006), Shanghai Committee of Scienceand Technology Grant (No.12510706200), and the Funda-mental Research Funds for the Central Universities (TongjiUniversity). Part of this work was done when the firstauthor was at the Department of Computer Science,University of Helsinki, Finland.

REFERENCES

[1] “Apache Kafka: A High-Throughput Distributed MessagingSystem,” http://incubator.apache.org/kafka/, 2013.

[2] http://tracehotnews.com/sony-admitted-psns-70-million-users-information-leakage, 2013.

[3] J. Al-Muhtadi, R.H. Campbell, A. Kapadia, M.D. Mickunas, and S.Yi, “Routing through the Mist: Privacy Preserving Communica-tion in Ubiquitous Computing Environments,” Proc. 22nd Int’lConf. Distributing Computer Systems, pp. 74-83, 2002.

[4] L.A. Barroso, J. Dean, and U. Holzle, “Web Search for a Planet:The Google Cluster Architecture,” IEEE Micro, vol. 23, no. 2,pp. 22-28, Mar./Apr. 2003.

[5] R.J. Bayardo and R. Agrawal, “Data Privacy through Optimal k-Anonymization,” Proc. 21st Int’l Conf. Data Eng. (ICDE), 2005.

[6] A. Carzaniga, D.S. Rosenblum, and A.L. Wolf, “Design andEvaluation of a Wide-Area Event Notification Service,” ACMTrans. Computer Systems, vol. 19, no. 3, pp. 332-383, 2001.

[7] A. Carzaniga and A.L. Wolf, “Forwarding in a Content-BasedNetwork,” Proc. ACM SIGCOMM ’03, pp. 163-174, Aug. 2003.

[8] S. Choi, G. Ghinita, and E. Bertino, “A Privacy-EnhancingContent-Based Publish/Subscribe System Using Scalar ProductPreserving Transformations,” Proc. Int’l Conf. Database ExpertApplications (DEXA), pp. 368-384, 2010.

[9] C.-Y. Chow, M.F. Mokbel, and X. Liu, “A Peer-to-Peer SpatialCloaking Algorithm for Anonymous Location-Based Service,”Proc. 14th Ann. ACM Int’l Symp. Advances in Geographic InformationSystems (GIS), pp. 171-178, 2006.

[10] E. Curtmola, A. Deutsch, K.K. Ramakrishnan, and D. Srivastava,“Load-Balanced Query Dissemination in Privacy-Aware OnlineCommunities,” Proc. ACM SIGMOD Int’l Conf. Management ofData, pp. 471-482, 2010.

[11] C. Dwork, “Differential Privacy,” Proc. Int’l Colloquium Automata,Languages and Programming (ICALP), pp. 1-12, 2006.


[12] F. Fabret, H.-A. Jacobsen, F. Llirbat, J. Pereira, K.A. Ross, andD. Shasha, “Filtering Algorithms and Implementation for VeryFast Publish/Subscribe,” Proc. ACM SIGMOD Int’l Conf.Management of Data, pp. 115-126, 2001.

[13] F. Fabret, H.-A. Jacobsen, F. Llirbat, J. Pereira, K.A. Ross, andD. Shasha, “Filtering Algorithms and Implementation for VeryFast Publish/Subscribe,” Proc. SIGMOD Int’l Conf. Managementof Data, pp. 115-126, 2001.

[14] B. Fan, H. Lim, D.G. Andersen, and M. Kaminsky, “Small Cache,Big Effect: Provable Load Balancing for Randomly PartitionedCluster Services,” Proc. ACM Symp. Cloud Computing (SOCC),pp. 1-13, 2011.

[15] P. Felber, M. Rajman, E. Riviere, V. Schiavoni, and J. Valerio,“SPADS: Publisher Anonymization for dht Storage,” Proc. IEEE10th Int’l Conf. Peer to Per Computing (P2P), pp. 1-10, 2010.

[16] M.J. Freedman and R. Morris, “Tarzan: A Peer-to-Peer Anonymiz-ing Network Layer,” Proc. ACM Conf. Computer and Comm.Security, pp. 193-206, 2002.

[17] B. Gedik and L. Liu, “Protecting Location Privacy with Persona-lized k-Anonymity: Architecture and Algorithms,” IEEE Trans.Mobile Computing, vol. 7, no. 1, pp. 1-18, Jan. 2008.

[18] O. Goldreich, S. Micali, and A. Wigderson, “How to Play anyMental Game or a Completeness Theorem for Protocols withHonest Majority,” Proc. 19th Ann. ACM Symp. Theory of Computing(STOC), pp. 218-229, 1987.

[19] M. Gruteser and D. Grunwald, “Anonymous Usage ofLocation-Based Services through Spatial and Temporal Cloak-ing,” Proc. First Int’l Conf. Mobile Systems, Applications andServices (MobiSys), 2003.

[20] H. Liu, Z. Wu, M. Petrovic, and H.-A. Jacobsen, “OptimizedCluster-Based Filtering Algorithm for Graph Metadata,” Informa-tion Sciences, vol. 181, no. 24, pp. 5468-5484, 2011.

[21] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubrama-niam, “L-Diversity: Privacy Beyond K-Anonymity,” Proc. Int’lConf. Data Eng. (ICDE), p. 24, 2006.

[22] G. Mul, “Large-Scale Content-Based Publish/Subscribe Systems,”PhD dissertation, 2002.

[23] M. Nabeel, N. Shang, and E. Bertino, “Privacy-Preserving Filteringand Covering in Content-Based Publish Subscribe Systems,”technical report, Purdue Univ., June 2009.

[24] M. Nabeel, N. Shang, and E. Bertino, “Efficient PrivacyPreserving Content Based Publish Subscribe Systems,” Proc.17th ACM Symp. Access Control Models and Technologies(SACMAT), pp. 133-144, 2012.

[25] L. Opyrchal, A. Prakash, and A. Agrawal, “Supporting PrivacyPolicies in a Publish-Subscribe Substrate for Pervasive Environ-ments,” J. Networks, vol. 2, no. 1, pp. 17-26, 2007.

[26] L. Pareschi, D. Riboni, A. Agostini, and C. Bettini, “Compositionand Generalization of Context Data for Privacy Preservation,”Proc. IEEE Sixth Ann. Int’l Conf. Pervasive Computing and Comm.(PerCom), pp. 429-433, 2008.

[27] C. Raiciu and D.S. Rosenblum, “Enabling Confidentiality inContent-Based Publish/Subscribe Infrastructures,” Proc. Secure-Comm Workshops, pp. 1-11, 2006.

[28] W. Rao, L. Chen, and A.W. Fu, “On Efficient Content Matching inDistributed Pub/Sub Systems,” Proc. IEEE INFOCOM, 2009.

[29] W. Rao, L. Chen, and A.W.-C. Fu, “Stairs: Towards Efficient Full-Text Filtering and Dissemination in DHT Environments,” VLDB J.,vol. 20, no. 6, pp. 793-817, 2011.

[30] W. Rao, L. Chen, P. Hui, and S. Tarkoma, “Move: A Large ScaleKeyword-Based Content Filtering and Dissemination System,”Proc. IEEE 32nd Int’l Conf. Distributed Computing Systems (ICDCS),2012.

[31] P. Samarati, “Protecting Respondents’ Identities in MicrodataRelease,” IEEE Trans. Knowledge Data Eng., vol. 13, no. 6, pp. 1010-1027, Nov./Dec. 2001.

[32] N. Shang, M. Nabeel, F. Paci, and E. Bertino, “A Privacy-Preserving Approach to Policy-Based Content Dissemination,”Proc. IEEE 26th Int’l Conf. Data Eng. (ICDE), pp. 944-955, 2010.

[33] A. Shikfa, M. Onen, and R. Molva, “Privacy-Preserving Content-Based Publish/Subscribe Networks,” Proc. 24th Int’l InformationSecurity Conf. (SEC), pp. 270-282, 2009.

[34] M. Srivatsa and L. Liu, “Secure Event Dissemination in Publish-Subscribe Networks,” Proc. 27th Int’l Conf. Distributed ComputingSystems (ICDCS), p. 22, 2007.

[35] L. Sweeney, “k-Anonymity: A Model for Protecting Privacy,” Int’lJ. Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 10, no. 5,pp. 557-570, 2002.

[36] M. Terrovitis, N. Mamoulis, and P. Kalnis, “Privacy-PreservingAnonymization of Set-Valued Data,” Proc. VLDB Endowment,vol. 1, no. 1, pp. 115-125, 2008.

[37] V.V. Vazirani, Approximation Algorithms. Springer, 2001.[38] T. Wang and L. Liu, “Privacy-Aware Mobile Services over Road

Networks,” Proc. VLDB Endowment, vol. 2, no. 1, pp. 1042-1053,2009.

[39] A.C.-C. Yao, “Protocols for Secure Computations (ExtendedAbstract),” Proc. IEEE Symp. Foundations of Computer Science(FOCS), pp. 160-164, 1982.

Weixiong Rao received the BSc degree fromNorth (Beijing) Jiaotong University, the MScdegree from Shanghai Jiaotong University, andthe PhD degree from The Chinese University ofHong Kong in 2009. After his PhD study, heworked for Hong Kong University of Science andTechnology (2010), University of Helsinki (2011-2012), and now is with School of SoftwareEngineering, Tongji University, China. He is amember of the IEEE.

Lei Chen received the BS degree in computerscience and engineering from Tianjin University,China, in 1994, the MA degree from AsianInstitute of Technology, Thailand, in 1997, andthe PhD degree in computer science from theUniversity of Waterloo, Canada, in 2005. He iscurrently an associate professor in the Depart-ment of Computer Science and Engineering atThe Hong Kong University of Science andTechnology. His research interests include un-

certain databases, graph databases, multimedia and time-seriesdatabases, and sensor and peer-to-peer databases. He is a memberof the IEEE.

Sasu Tarkoma received the MSc and PhDdegrees in computer science from the Depart-ment of Computer Science at the University ofHelsinki. He is a full professor at the Departmentof Computer Science, University of Helsinki, andthe head of the networking and servicesspecialization line. He has managed and parti-cipated in national and international researchprojects at the University of Helsinki, AaltoUniversity, and Helsinki Institute for Information

Technology (HIIT). He has worked in the IT industry as a consultant andthe chief system architect as well as principal researcher and laboratoryexpert at Nokia Research Center. His interests include mobilecomputing, Internet technologies, and middleware. He is a seniormember of the IEEE.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


toward efficient filter privacy-aware content-based pub/sub systems

Documents