τ,ǫ-anonymity: towards privacy-preserving publishing of ... · fig. 1. illustrative example of...

arX

iv:1

701.

0224

3v1

[cs.

CY

] 9

Jan

2017

kτ,ǫ-anonymity: Towards Privacy-PreservingPublishing of Spatiotemporal Trajectory Data

Marco Gramaglia∗, Marco Fiore†, Alberto Tarable†, Albert Banchs∗

∗ IMDEA Networks Institute & Universidad Carlos III de MadridAvda. del Mar Mediterraneo, 2228918 Leganes (Madrid), Spain

Email: [email protected]

† CNR-IEIITCorso Duca degli Abruzzi, 24

10129 Torino, ItalyEmail: [email protected]

Abstract—Mobile network operators can track subscribers viapassive or active monitoring of device locations. The recordedtrajectories offer an unprecedented outlook on the activities oflarge user populations, which enables developing new networkingsolutions and services, and scaling up studies across researchdisciplines. Yet, the disclosure of individual trajectories raisessignificant privacy concerns: thus, these data are often protectedby restrictive non-disclosure agreements that limit their avail-ability and impede potential usages. In this paper, we contributeto the development of technical solutions to the problem ofprivacy-preserving publishing of spatiotemporal trajectories ofmobile subscribers. We propose an algorithm that generalizesthe data so that they satisfykτ,ǫ-anonymity, an original privacycriterion that thwarts attacks on trajectories. Evaluations withreal-world datasets demonstrate that our algorithm attains itsobjective while retaining a substantial level of accuracy in thedata. Our work is a step forward in the direction of open, privacy-preserving datasets of spatiotemporal trajectories.

I. I NTRODUCTION

Subscriber trajectory datasets collected by network opera-tors are logs of timestamped, georeferenced events associatedto the communication activities of individuals. The analysisof these datasets allows inferringfine-grained informationabout the movements, habits and undertakings of vast userpopulations. This has many different applications, encompass-ing both business and research. For instance, trajectory datacan be used to devise novel data-driven network optimizationtechniques [1] or support content delivery operations at thenetwork edge [2]. They can also be monetized via added-value services such as transport analytics [3] or location-based marketing [4]. Additionally, the relevance of massivemovement data from mobile subscribers is critical in researchdisciplines such as physics, sociology or epidemiology [5].

The importance of trajectory data has also been recognizedin the design of future 5G networks, with a thrust towards theintroduction of data interfaces among network operators andover-the-top (OTT) providers to give them online access to this(and other) data. OTTs can leverage such interfaces to auto-matically retrieve the data and process them on the fly, thusenabling new applications such as intelligent transportation [6]or assisted-life services [7].

All these use cases stem from the disclosure of trajectorydatasets to third parties. However, the open release of suchdata is still largely withhold, which hinders potential usages

and applications. A major barrier in this sense are privacy con-cerns: data circulation exposes it to re-identification attacks,and cognition of the movement patterns of de-anonymizedindividuals may reveal sensitive information about them.

This calls for anonymization techniques. The common prac-tice operators adhere to is replacing personal identifiers (e.g.,name, phone number, IMSI) with pseudo-identifiers (i.e., ran-dom or non-reversible hash values). Whether this is a sufficientmeasure is often called into question, especially in relation tothe possibility of tracking user movements. What is sure is thatpseudo-identifiers have been repeatedly proven not to protectagainst user trajectory uniqueness, i.e., the fact that mobilesubscribers have distinctive travel patterns that make themunivocally recognizable even in very large populations [8]–[10]. Uniqueness is not a privacy threat per-se, but it is avulnerability that can lead to re-identification. Examplesarebrought forth by recent attempts at cross-correlating mobileoperator-collected trajectories with georeferenced check-ins ofFlickr and Twitter users [11], with credit card records [12]orwith Yelp, Google Places and Facebook metadata [13].

More dependable anonymization solutions are needed. How-ever, the strategies devised to date for relational databases,location-based services, or regularly sampled (e.g., GPS)mo-bility do not suit the irregular sampling, time sparsity, andlong duration of trajectories collected by mobile operators.Moreover, current privacy criteria, includingk-anonymity anddifferential privacy, do not provide sufficient protectionor areimpractical in this context. See Sec. V for a detailed discussion.

In this paper, we put forward several contributions towardsprivacy-preserving data publishing (PPDP)of mobile sub-scriber trajectories. Our contributions are as follows:(i) weoutline attacks that are especially relevant to datasets ofspatiotemporal trajectories;(ii) we introducekτ,ǫ-anonymity,a novel privacy criterion that effectively copes with the mostthreatening attacks above;(iii) we developk-merge, an algo-rithm that solves a fundamental problem in the anonymizationof spatiotemporal trajectories, i.e., effective generalization;(iv) we implementkte-hide, a practical solution basedon k-merge that attainskτ,ǫ-anonymity in spatiotemporaltrajectory data;(v) we evaluate our approach on real-worlddatasets, showing that it achieves its objectives while retaininga substantial level of accuracy in the anonymized data.

http://arxiv.org/abs/1701.02243v1

II. REQUIREMENTS AND MODELS

We first present the requirements of PPDP, in Sec. II-A, andformalize the specific attacker model we consider, in Sec. II-B.We then propose a consistent privacy model, in Sec. II-C.

A. PPDP requirements

PPDP is defined as the development of methods for thepublication of information that allows meaningful knowledgediscovery, and yet preserves the privacy of monitored sub-jects [14]. The requisites of PPDP are similar for all typesof databases, including our specific case, i.e., datasets ofspatiotemporal trajectories. They are as follows.

1. The non-expert data publisher.Mining of the data isperformed by the data recipient, and not by the datapublisher. The only task of the data publisher is toanonymize the data for publication.

2. Publication of data, and not of data mining results. Theaim of PPDP is producing privacy-preserving datasets,and not anonymized datasets of classifiers, associationrules, or aggregate statistics. This sets PPDP apart fromprivacy-preserving data mining (PPDM), where the finalusage of the data is known at dataset compilation time.

3. Truthfulness at the record level. Each record of the pub-lished database must correspond to a real-world subject.Moreover, all information on a subject must map toactual activities or features of the subject. This avoidsthat fictitious data introduces unpredictable biases in theanonymized datasets.

Our privacy model will obey the principles above. We stressthat they impose that the privacy model must be agnosticof data usage (points 1 and 2), and that it cannot rely onrandomized, perturbed, permuted and synthetic data (point3).

B. Attacker model

Unlike PPDP requirements, the attacker model is necessarilyspecific to the type of data we consider, and it is characterizedby the knowledgeand goal of the adversary. The formerdescribes the information the opponent possesses, while thelatter represents his privacy-threatening objective.

1) Attacker knowledge:In trajectory datasets, each datarecord is a sequence of spatiotemporal samples. We assume anattacker who can track a target subscriber continuously duringany amount of timeτ . The adversary knowledge consists thenin all spatiotemporal samples in the victim’s trajectory over acontinuous1 time interval of durationτ .

2) Attacker goal:Attacks against user privacy in publisheddata can have different objectives, and a comprehensive clas-sification is provided in [14]. Two classes of attacks are espe-cially relevant in the context of mobile subscriber trajectorydata. Both exploit the uniqueness of movement patterns that,as mentioned in Sec. I, characterizes trajectory data.

1Non-continuous tracking in the attacker model is an interesting but verychallenging open problem. A mitigative solution realisable with our model isconsidering aτ that covers all disjoint tracking intervals.

• Record linkage attacks.These attacks aim at univocallydistinguishing an individual in the database. A successfulrecord linkage enables cross-database correlation, whichmay ultimately unveil the identity of the user. Recordlinkage attacks on mobile traffic data have been repeat-edly and successfully demonstrated [8]–[10]. As men-tioned in Sec. I, they have also been used for subsequentcross-database correlations [11]–[13].

• Probabilistic attacks.These attacks let an adversarywith partial information about an individual enlarge hisknowledge on that individual by accessing the database.They are especially relevant to spatiotemporal trajecto-ries, as shown by seminal works that first unveiled theanonymization issues of mobile traffic datasets [8], [9].Let us imagine a scenario where an adversary knows asmall set of spatiotemporal points in the trajectory ofa subscriber (because, e.g., he met the target individualthere). A successful probabilistic attack would reveal thecomplete movements of the subscriber to the attacker,who could then use them to infer sensitive informationabout the victim, such as home/work locations, dailyroutines, or visits to healthcare structures.

Our privacy model will address both classes of attacks above,led by an adversary with knowledge described in Sec. II-B1.

C. Privacy model

Our privacy model is designed following the PPDP re-quirements and attacker model presented before. We start byconsidering suitable privacy criteria against record linkage andprobabilistic attacks, in Sec. II-C1 and Sec. II-C2, respectively.We then show how the first criterion is in fact a specializationof the second, in Sec. II-C3, which allows us to focus ona single unifying privacy model. Finally, we present theelementary techniques that we employ to implement the targetprivacy criterion, in Sec. II-C4.

1) k-anonymity: The k-anonymity criterion realizes theindistinguishability principle, by commending that each recordin a database must be indistinguishable from at leastk−1 otherrecords in the same database [15]. In our case, this maps toensuring that each subscriber is hidden in a crowd ofk userswhose trajectories cannot be told apart. The popularity ofk-anonymity for PPDP has led to indiscriminated use beyond itsscope, and subsequent controversy on the privacy guaranteesit can provide. E.g.,k-anonymity has been proven ineffectiveagaint attacks aiming at attribute linkage (including exploitsof insufficient side-information diversity), at localizing users,or at disclosing their presence and meetings [16]–[18].

However,k-anonymity remains a legitimate criterion againstrecord linkage attacks on any kind of database [14]. Therefore,this privacy model protects trajectory data from the first typeof attack in Sec. II-B, including its variations in [8]–[13].

2) kτ,ǫ-anonymity: No privacy criterion proposed to datecan safeguard spatiotemporal trajectory data from the secondtype of attacks in Sec. II-B, i.e., probabilistic attacks. Thisforces us to define an original criterion, as follows.

Fig. 1. Illustrative example ofkτ,ǫ-anonymity of useri, with k=2.

The pertinent principle here is the so-calleduninformativeprinciple, i.e., ensuring that the difference between the knowl-edge of the adversary before and after accessing a databaseis small [16]. In our context, this principle warrants that anattacker who knows some subset of a subscriber’s movementscannot extract from the dataset a substantially longer portionof that user’s trajectory.

To attain the uninformative principle, we introduce thekτ,ǫ-anonymityprivacy criterion.kτ,ǫ-anonymity can be seenas a variation ofkm-anonymity, which establishes that eachindividual in a dataset must be indistinguishable from at leastk−1 other users in the same dataset, when limiting the attackerknowledge to any set ofm attributes [19].kτ,ǫ-anonymitytailors km-anonymity to our scenario, as follows.

• As per Sec. II-B, the attacker knowledge can be anycontinued sequence of spatiotemporal samples covering atime interval of length at mostτ : thus, them parameter ofkm-anonymity maps to the (variable) set of samples con-tained in any time periodτ . During any such time period,every trajectory in the dataset must be indistinguishablefrom at least otherk − 1 trajectories.

• The maximum additional knowledge that the attackeris allowed to learn is calledleakage; it consists of thespatiotemporal samples of the target user’s trajectorycontained in a time interval of duration at mostǫ, disjointfrom the originalτ . In order to fulfill the uninformativeprinciple, the leakageǫ must be small.

The two requirements above imply alternating in time thek−1trajectories that provide anonymization. An intuitive exampleis provided in Fig. 1. There, the trajectory of a target useri is 2τ,ǫ-anonymized using those of five other subscribers.The overlapping between the trajectories ofa, b, c, d, e andthat of i is partial and varied. An adversary knowing a sub-trajectory of i during any time interval of durationτ alwaysfinds at least one other user with a movement pattern thatis identical to that ofi during that interval, but differentelsewhere. With this knowledge, the adversary cannot tellapart i from the other subscriber, and thus cannot attributefull trajectories to one user or the other. As this holds nomatter where the knowledge interval is shifted to, the attackercan never retrieve the complete movement patterns ofi: thisachieves the uninformative principle. Still, the adversary canincrease its knowledge in some cases. Let us consider theinterval τ indicated in the figure: the trajectories ofi, d andeare identical for some time afterτ , which allows associating toi the movements duringǫ: the opponent learns one additionalspatiotemporal sample ofi.

3) Relationship between the privacy criteria:It is easy tosee thatk-anonymity is a special case ofkτ,ǫ-anonymity. As amatter of fact, the latter criterion reduces to the former whenτ + ǫ covers the whole temporal duration of the trajectorydataset. Then,kτ,ǫ-anonymity commends that each completetrajectory is indistinguishable fromk − 1 other trajectories,which is the definition ofk-anonymity. Our point here is thatan anonymization solution that implementskτ,ǫ-anonymitycan be straightforwardly employed to attaink-anonymity aswell, by properly adjusting theτ andǫ parameters.

In the light of these considerations, we address the problemof achieving kτ,ǫ-anonymity in datasets of spatiotemporaltrajectories of mobile subscribers. By doing so, we developa complete anonymization solution that is effective againstprobabilistic attacks, but can also be specialized to guaranteek-anonymity and counter record linkage attacks.

4) Generalization and suppression:In order to enforcekτ,ǫ-anonymity for all users in the dataset, we need to tweakthe spatiotemporal samples in the trajectories of individuals,so that the criterion in Sec. II-C2 is respected for all of them.To that end, we rely on two elementary techniques, i.e.,spatiotemporal generalizationandsuppressionof samples.

Spatiotemporal generalization reduces the precision of tra-jectory samples in space and time, so as to make the sam-ples of two or more users indistinguishable. Suppressionremoves from the trajectories those samples that are too hardto anonymize. Both techniques are lossy, i.e., imply somereduction of precision in the data. Yet, unlike other approaches,these techniques conform to the PPDP requirement of truth-fulness at the record level, see Sec. II-A.

III. A CHIEVING kτ,ǫ-ANONYMITY

Our goal is ensuring that an anonymized dataset of mobilesubscriber trajectories respects the uninformative principle,by implementing, through generalization and suppression,thekτ,ǫ-anonymity of all subscriber trajectories in the dataset.Clearly, we aim at doing so while minimizing the loss ofspatiotemporal granularity in the data.

We start by defining the basic operation of generalizinga set of spatiotemporal samples, and the associated cost interms of loss of granularity, in Sec. III-A. We then extend bothnotions to (sub-)trajectories, in Sec. III-B. Building on thesedefinitions, we discuss in Sec. III-C the optimal spatiotemporalgeneralization ofk (sub-)trajectories. We implement the resultinto k-merge, an optimal low-complexity algorithm thatgeneralizes (sub-)trajectories with minimal loss of data granu-larity, in Sec. III-D. Once able to merge (sub-)trajectories op-timally, we propose an approach to guaranteekτ,ǫ-anonymityof the trajectory of a single user, in Sec. III-E, and we thenscale the solution to multiple users in Sec. III-F. Finally,we introducekte-hide, an algorithm that ensureskτ,ǫ-anonymity in spatiotemporal trajectory datasets, in Sec. III-G.

A. Generalization of samples

A (raw) sampleof a spatiotemporal trajectory represents theposition of a subscriber at a given time, and we model it with

Fig. 2. Example of merging of trajectoriesSi = {si,j} andSi′ = {si′,j}into a generalized trajectoryG = {G}. For clarity, space is unidimensional.

a length-3 real vectors = (t(s), x(s), y(s)). Since a datasetis characterized by a finite granularity in time and space, asample is in fact a slot spanning some minimum temporaland spatial intervals. The vector entries above can be regardedas the origins of a normalized length-1 time interval and anormalized 1×1 two-dimensional area2.

Spatiotemporal generalization merges together two or moreraw samples into ageneralized sample, i.e., a slot with alarger span. Mathematically, a generalized sampleG can berepresented as the set of the merged samples. There is a costassociated with merging samples, which is related to the spanof the corresponding generalized sample, i.e., to the loss ofgranularity induced by the generalization. The cost of theoperation of merging a set of samples into the generalizedsampleG is defined as

c (G) = ct (G) cs (G) , (1)

wherect (G) represents the cost in the time dimension, whilecs (G) is the cost in the space dimensions.

Let G1 and G2 be two disjoint generalized samples (i.e.,G1 ∩ G2 = ∅). Then, we make the following two assumptionson the time and space merging costs:

ct (G1 ∪ G2) ≥ ct (G1) + ct (G2) (2)

cs (G1 ∪ G2) ≥ max {cs (G1) , cs (G2)} . (3)

Hereafter, we use the following definitions to implement thegeneric costsct (G) andcs (G):

ct (G) = ∆t (G) (4)

cs (G) = ∆x (G) + ∆y (G) , (5)

where∆⋆(G) = max

s∈G⋆(s)−min

s∈G⋆(s) + 1, (6)

with ⋆ ∈ {t, x, y}, is the span in each dimension.Therefore, in our implementation,c (G) is the area of a

rectangle with sides∆t (G) and∆x (G)+∆y (G). A graphicalexample is provided in Fig. 2, where two raw samplessi,1

and si′,1 are merged into a generalized sampleG1, spanning

2For instance, in our reference datasets, the sample granularity is 1 minutein time and 100 meters in space. A raw sample spans then one slot (i.e., 1minute) in time and one slot (i.e., a 100×100 m2 area) in space. However,our discussion is general, and holds for any precision in thedata.

∆t(G1) in time and∆x(G1) in space (portrayed as unidimen-sional in the figure, for the sake of readability).

Remark 1:The rationale for our choice of costs is com-putational efficiency. Also, summing the two space spansbefore multiplication allows balancing the time and spacecontributions. Finally, note that with the definition in (5), thespace merging cost assumption in (3) is trivially true. Instead,the definition in (4) lets the time merging cost assumption in(2) hold only if the time intervals spanned byG1 andG2 arenon-overlapping. The time coherence property that we willintroduce in Sec. III-B ensures that this is the always case.

B. Generalization of trajectories

A spatiotemporal (sub-)trajectory describes the movementsof a single subscriber during the dataset timespan. Formally, atrajectory is an ordered vector of samplesS = (s1, . . . , sN ),where the ordering is induced by the time coordinate, i.e.,t(si) < t(si′) if and only if i < i′.

A generalized trajectory, obtained by merging differenttrajectories, is defined as an ordered vector of generalizedsamplesG = (G1, . . . ,GZ). Here the ordering is more subtle,and based on the fact that the time intervals spanned by thegeneralized samples are non-overlapping, a property that willbe calledtime coherence. More precisely, ifGi andGi′ , i < i′,are two generalized samples ofG, then

maxs∈Gi

t(s) < mins∈G

i′

t(s).

An example of a generalized trajectoryG merging twotrajectoriesSi and Si′ is provided in Fig. 2.G fulfils timecoherence, as its generalized samples are temporally disjoint.

Remark 2:Time coherence is a defining property of gen-eralized trajectories in PPDP. As a matter of fact, publishingtrajectory data with time-overlapping samples would generatesemantic ambiguity and make analyses cumbersome.

Analogously to the cost of merging samples, we can define acost of merging multiple trajectories into a generalized trajec-tory. We define such cost as the sum of costs of all generalizedsamples belonging to it. More precisely, ifG = (G1, . . . ,GZ),andc(·) is defined as in (1), then the cost ofG is given by:

C (G) =

Z∑

i=1

c (Gi) . (7)

Remark 3:The cost in (7) is the overall surface covered bysamples of the generalized trajectory over the spatiotemporalplane. E.g., in Fig. 2, the cost ofG is the sum of the threeareas, i.e.,c(G1)+ c(G2)+ c(G3). It is thus proportional to thetotal loss of granularity induced by the generalization.

C. Optimal generalization of trajectories

We now formalize the problem ofoptimalgeneralization ofspatiotemporal (sub-)trajectories. Suppose that we havek tra-jectoriesS1, . . . ,Sk, with Si = (si,1, . . . , si,Ni

), i = 1, . . . , k.The goal is a generalized trajectoryG∗ = (G∗

1 , . . . ,G∗Z) from

S1, . . . ,Sk, which satisfies the following conditions.

i) The union of all generalized samples ofG∗ must coincide

with the union of all samples ofS1, . . . ,Sk, i.e.,

G∗1 ∪ · · · ∪ G∗

Z = S1 ∪ · · · ∪ Sk , S,

whereSi =⋃Ni

j=1{si,j}. Thus,G∗ is a partition of the setSof all samples in the input trajectories: it does not add anyalien sample or discard any input sample.

ii) Each generalized sample contains at least one samplefrom each of thek input trajectoriesS1, . . . ,Sk, i.e.,

G∗i ∩ Si′ 6= ∅, i = 1, . . . , Z, i′ = 1, . . . , k.

This imposes that each input trajectory contributes to eachgeneralized sample ofG∗. Otherwise, the merging couldassociate generalized samples to users that never visited thegeneralized location at the generalized time, violating point 3of the PPDP requirements in Sec. II-A.

iii) The cost of the merging is minimized, i.e.,

G∗ = arg min

G∈K

C(G), (8)

where K is the set of all partitions ofS satisfying timecoherence as well as conditionii) above, andC(G) is in (7).In Fig. 2, the generalized trajectoryG fulfils all these require-ments, and is thus the optimal mergeG∗ of Si andSi′ .

Solving the problem above with a brute-force search iscomputationally prohibitive, sinceK has a size that growsexponentially with|S|/k, where| · | denotes cardinality. How-ever, we can characterizeG∗ so that it is possible to computeit with low complexity. To that end, we nameelementaryapartition G ∈ K that cannot be refined to another partitionwithin K. In other words, none of the generalized samplesof an elementary partition can be split into two generalizedsamples without violating conditions i) and ii) above, or timecoherence. Then, we have the following proposition.

Proposition 1: Given the input trajectoriesS1, . . . ,Sk, theoptimalG∗ defined in (8) is an elementary partition.Proof: SupposeG ∈ K is not elementary, so that it canbe refined to another partitionG ∈ K. In particular, with-out loss of generality, suppose thatG = (G1, . . . ,GZ) and

G =(G1, . . . , GZ+1

), where

Gi =

{Gi, i < Z

GZ ∪ GZ+1, i = Z.(9)

From (7) and (9), the difference between the costs ofG andG is given by

C(G)− C(G) = c(GZ)− c(GZ)− c(GZ+1). (10)

Since GZ contains the union of raw samples inGZ andGZ+1, we can apply properties (2) and (3) (where (2) holdsbecause of time coherence) and obtain:

c(GZ) = ct(GZ)cs(GZ)

≥(ct(GZ) + ct(GZ+1)

)cs(GZ)

≥ ct(GZ)cs(GZ) + ct(GZ+1)cs(GZ+1)

= c(GZ) + c(GZ+1). (11)

Fig. 3. Partition tree for the two trajectoriesSi = {si,j} andSi′ = {si′,j}in Fig. 2. Nodes in the complete tree represent the setK of valid partitionsof the set of raw samplesS. Elementary partitions are the tree leaves andconstituteK∗. The partition in Fig. 2 is the leftmost leaf in the tree.

Algorithm 1: k-merge algorithm pseudocode.input : TrajectoriesS1, . . . ,Sk , whereSi = (si,1, . . . , si,Ni

)output: Generalized sample setG∗, CostC (G∗)

1 foreach i ∈ [1, k] do2 Si =

⋃Ni

j=1{si,j};

3 S ← timesort (S1 ∪ · · · ∪ Sk);4 Cost ← (0,∞, . . . ,∞);5 Partition ← (NULL , . . . , NULL);6 foreach sθ ∈ S do7 θ′ = θ − 1;8 while incomplete (sθ′ , . . . , sθ) do9 θ′ = θ′ − 1;

10 while elementary (sθ′ , . . . , sθ) do11 G ← generalize (sθ′ , . . . , sθ);12 if Cost [θ] > c (G) + Cost [θ′ − 1] then13 Cost [θ] ← c (G) + Cost [θ′ − 1];14 Partition ← (θ′ − 1,G);15 θ′ = θ′ − 1;16 G∗ ← visit (Partition);17 C (G∗)← Cost [ |S| ];

Comparing (11) with (10), we get thatC(G) ≥ C(G).Thus, to search for the optimalG∗, we can dropG and keeponly G. If G is not elementary, then we can find one of itsrefinements, and repeat the above steps to drop alsoG. Thisway, we can drop all partitions that are not elementary and beleft only with elementary partitions asG∗ candidates.

If we build a tree of partitions belonging toK, such that theS is the root and each node is a partition whose children areits refinements, the leaves are the elementary partitions, whichform a subsetK∗. The above proposition states that we canlimit the search ofG∗ to K∗, drastically reducing the searchspace ofG∗ to the setK∗ ⊂ K of elementary partitions ofS.An example is provided in Fig. 3, for the trajectories in Fig.2.

D. Optimal merging algorithm

We proposek-merge, an algorithm to efficiently searchthe set of raw samplesS, extract the subset of elementarypartitions,K∗, and identify the optimal partitionG∗.

The algorithm, detailed in Alg. 1, starts by populating a setof raw samplesS, whose itemssi,j are ordered accordingto their time valuet(si,j) (lines 1–3). Then, it processesall samples according to their temporal ordering (line 6).Specifically, the algorithm tests, for each samplesθ in positionθ, all sets{sθ′, . . . , sθ}, with θ′ < θ, as follows.

The first loop skips incomplete sets that do not containat least one sample from each input trajectory (line 8).The second loop runs until the first non-elementary set isencountered (line 10). Therein, the algorithm generalizesthecurrent (complete and elementary) set{sθ′, . . . , sθ} to G, andchecks if G reduces the total merging cost up tosθ. If so,the cost is updated by summingc(G) to the accumulated costup to sθ′−1, and the resulting (partial) partition ofS thatincludesG is stored (lines 11–14). Once out of the loops,the cost associated to the last sample is the optimal cost, andit is sufficient to backward navigate the partition structure toretrieve the associatedG∗ (lines 16–17).

Note that, in order to update the cost of including thecurrent samplesθ (line 13), the algorithm only checks previoussamples in time. It thus needs that the optimal decision up tosθ does not depend on any of the samples in the originaltrajectories that come later in time thansθ. The followingproposition guarantees that this is the case.

Proposition 2: Let G∗ = (G∗

1 , . . . ,G∗Z) be the optimal

generalized trajectory and let us make the hypothesis thatsθ

and sθ+1 do not belong to the same generalized sample ofG

∗. Let G∗p =

(G∗1 , . . . ,G

∗Z1

)and G

∗f =

(G∗Z1+1, . . . ,G

∗Z

),

so thatsθ ∈ G∗Z1

andsθ+1 ∈ G∗Z1+1. Then,G∗

p can be derivedindependently ofG∗

f .Proof: Let G, Gp and Gf be any generalized se-quences containing raw samples(s1, . . . , sN), (s1, . . . , sθ) and(sθ+1, . . . , sN ), respectively. According to the cost definition,we generally have

minG

C(G) ≤ minGp,Gf

C((Gp,Gf))

= minGp

C(Gp) + minGf

C(Gf),

where(Gp,Gf) is the concatenation ofGp andGf . However,by virtue of the hypothesis and by construction,

minG

C(G) = C(G∗)

= C(G∗p) + C(G∗

f )

= minGp

C(Gp) + minGf

C(Gf)

so that, to minimizeC(G) we only need to minimizeC(Gp)andC(Gf) independently.

The above proposition guarantees that the algorithm isexploring all possibilities, and as a result, the costC(G∗)returned byk-merge is optimal, i.e., it is the minimum lossof granularity necessary to merge the original trajectories.

Note thatk-merge has a very low complexity in practicalcases. Letl(θ) be the number of sets{sθ′ , . . . , sθ} that are bothcomplete and elementary for a givenθ. Then, the number ofcomputations and comparisons of sample generalization coststhat are performed ink-merge is

∑θ l(θ) = |S|l, wherel is

the average value ofl(θ). If l = O(1), which happens in mosttrajectory data where the samples of the input trajectoriesareintercalated in the time axis, thenk-merge runs in a timeO(|S|), i.e., linear in the number of samples.

Fig. 4. Overlapping hiding set structure realizingkτ,ǫ-anonymity for useri.

E. Single userkτ,ǫ-anonymity

We implementkτ,ǫ-anonymity for a generic subscriberi asshown in Fig. 4. We discretize time into intervals of lengthǫ,namedepochs. At the beginning of them-th epoch, we selecta set ofk−1 users different fromi, named ahiding setof i anddenoted ashi

m. The hiding sethim providesk-anonymity to

subscriberi for a subsequent time windowτ+ǫ. By repeatingthe hiding set selection for all epochs,τ/ǫ + 1 subsequenthiding sets of useri overlap at any point in time. Such astructure of overlapping hiding sets assures the following.

First, subscriberi is k-anonymized for any possible knowl-edge of the attacker. No matter where a time interval of lengthτ is shifted to along the time dimension, it will be alwayscompletely covered by the time window of one hiding set,i.e., a period during whichi’s trajectory is indistinguishablefrom those ofk − 1 other users. As an example, in Fig. 4,the attacker knowledgeτ (bottom-right of the plot) is fullyenclosed in the time window ofhi

6, and his sub-trajectory isindistinguishable from those of users inhi

6.Second, the additional knowledge leaked to the attacker is

exactly ǫ. From the first point above, the adversary cannottell apart i from the users in the hiding sethi

m whose timewindow covers his knowledgeτ . However, the adversary canfollow the (generalized) trajectories ofi and users inhi

m forthe full time windowτ + ǫ. Therefore, the adversary can infernew information about the (generalized) trajectory ofi duringthe time window period that exceeds his original knowledgeτ , i.e., ǫ. E.g., in Fig. 4, the time window ofhi

6 spans beforeand after the attacker knowledgeτ , for a total ofǫ.

The two guarantees above letkτ,ǫ-anonymity, as definedin Sec. II-C2, be fulfilled for the generic useri. The epochduration ǫ maps to the knowledge leakage. The followingimportant remarks are in order.

1. Hiding set selection.The structure of overlapping hidingsets is to be implemented so that the loss of accuracy in thekτ,ǫ-anonymized trajectory is minimized. Thus, the users inthe generic hiding sethi

m shall be those who, during the timewindow τ + ǫ starting at them-th epoch, have sub-trajectorieswith minimumk-merge cost with respect toi’s.

2. Reuse constraint.The uninformative principle requiresalternating thek − 1 trajectories used in different hiding sets,as per Sec. II-C2. A simple way to enforce this is limiting theinclusion of any subscriber in at most one hiding set ofi.

Fig. 5. Example ofk-pick constraint, withk=3, for useri during them-thhiding set selection. Hereǫ = τ , hence the time windows of hiding sets spantwo epochs. For clarity, space is unidimensional. Figure best viewed in colors.

3. Generalization set.As evidenced by the example in Fig. 4,the configuration of hiding sets changes at every epoch, andτ/ǫ + 1 hiding sets overlap during each epoch. This meansthat a spatiotemporal generalization must be used to merge aset ofχ = 1 + (τ/ǫ + 1)(k − 1) trajectories at each epoch.

4. Epoch duration tradeoff.The epoch durationǫ is aconfigurable system parameter, whose setting gives rise toa tradeoff between knowledge leakage and accuracy of theanonymized data. A lowerǫ reduces knowledge leakage.However, it also increasesχ, which typically entails a moremarked generalization and a higher loss of data granularity.

F. Multiple userkτ,ǫ-anonymity

Scalingkτ,ǫ-anonymity from a single user to all subscribersin a dataset implies that the choice of hiding sets cannot bemade independently for every user. Therefore, trajectory sim-ilarity and reuse constraint fulfillment are not sufficient normsanymore. In addition to the above, the selection of hiding setsneeds to be concerted among all users so as to ensure thatthe generalized trajectories are correctly intertwined and allsubscribers arek-anonymized during each time windowτ + ǫ.

An intuitive solution is enforcingfull consistency: includinga subscriberi into the hiding set of useri′ at epochm makesi′

automatically become part ofi’s hiding set at the same epoch.Formally, i ∈ h

i′

m ⇒ i′ ∈ him, ∀i 6= i′, ∀m.

In fact, full consistency is an unnecessarily restrictive con-dition. It is sufficient that hiding set concertation satisfies ak-pick constraint: during them-th epoch, each useri in thedataset has to be picked in the hiding sets of at least otherk−1subscribers. Formally,|{i′, i ∈ h

i′

m}| ≥ k − 1, ∀i, ∀m. Thisprovides an increased flexibility over all existing approacheswhich rely on fully consistent generalization strategies.

The rationale behind thek-pick constraint is best illustratedby means of a toy example, in Fig. 5. The figure portrays thespatiotemporal samples of usersi, i′ and i′′ during epochsm andm+ 1. The sub-trajectory of subscriberi in this timeinterval isSi = (si,1, si,2, si,3), represented as black squares;equivalently for i′ (orange triangles) andi′′ (red circles).Samples denoted by letters belong to other usersa, b, c andd, and they are instrumental to our example.

Let us assume thatǫ = τ (i.e., hiding sets span an interval2τ = 2ǫ, or epochsm andm+1), andk = 3. At the beginning

of them-th epoch, for subscriberi (resp.,i′ andi′′), one needsto selectk−1 = 2 other users that constitute the hiding seth

im

(resp.,hi′

m andhi′′

m ). Let us considerhim={a, b}, hi′

m={i, c},hi′′

m={i, d}, which results in the generalized sub-trajectoriesGi, Gi′ , Gi′′ in Fig. 5. The configuration satisfies thek-pickconstraint for subscriberi, who is picked ink− 1 = 2 hidingsets, i.e.,hi′

m andhi′′

m . Suppose now that the attacker knowsthe spatiotemporal samples ofi’s trajectory during any timeinterval τ within the m-th and (m + 1)-th epoch: as thesesamples are withinGi, Gi′ andGi′′ , theni is 3-anonymized.

The key consideration is thati is k-anonymized at epochmby i′ and i′′, yet it does not contribute to the anonymizationof neither i′ nor i′′, as i′, i′′ /∈ h

im. Thus, it is possible to

decouple the choice of hiding sets across subscribers, withoutjeopardizing the privacy guarantees granted byk-anonymity.Such a decoupling entails a dramatic increase of flexibilityinthe choice of hiding sets, as per the following proposition.

Proposition 3: Given a dataset ofU trajectories and a fixedvalue of k, the number of hiding set configurations allowedby full consistency is a fraction of that allowed byk-pick thatvanishes more than exponentially forU → ∞.Proof: Let us consider a set ofU users, whereU is a multipleof k, since otherwise full consistency cannot even be enforced.Let us build ak×U matrix, in which thei-th column contains(i,hi

m), wherehim is the hiding set for useri at a given epoch

m. (For simplicity, in this proof, we do not take into accountthe reuse constraints.)

The solution set under thek-pick constraint coincides withthe set of normalized Latin rectangles3 of size k × U . LetKk,U be the number ofk × U normalized Latin rectangles,which equals the number of possible solutions for our problemwith thek-pick constraint. An old result by Erdos and Kaplan-sky [20] states that, forU → ∞ andk = O

((logU)3/2−ǫ)

),

Kk,U ∼ (U !)k−1 exp (−k(k − 1)/2) (12)

If, instead, we enforce full consistency, then the number ofsolutions equals the number of different partitions of a size-Uset intoU/k subsets, all with sizek. Denoting byCk,U thisnumber, we can compute it as

Ck,U =

(Uk

)(U−kk

)· · · · ·

(kk

)

(U/k)!=

U !

(k!)U/k (U/k)!(13)

Thus, for fixedk andU → ∞

Ck,U

Kk,U∼

exp (k(k − 1)/2)

(U !)k−2(k!)U/k (U/k)!

which tends to zero more than exponentially forU → ∞.For large datasets of hundreds of thousands trajectories,k-

pick enables a much richer choice of merging configurations.This reasonably unbinds better combinations of the originaltrajectories, and results in more accurate anonymized data.

3A k×n Latin rectangle, k ≤ n, is a matrix in which all entries are takenfrom the set{1, . . . , n}, in such a way that each row and column containseach value at most once. The Latin rectangle is said to be normalized if thefirst row is the ordered set(1, . . . , n).

Algorithm 2: kte-hide algorithm pseudocode.input : Anonymization levelk, attacker knowledgeτ , leakageǫinput : Trajectory datasetDoutput: Anonymized trajectory datasetD

1 foreach eθ ∈ epochs (D) do2 Df ← filter (eθ,D);3 foreach Si,Si′ ∈ Df ,Si 6= Si′ do4 Costs [Si,Si′ ] ← k-merge (Si,Si′ );5 Clusters [θ] ← spectralClustering (Costs);6 if θ ≥ τ/ǫ+ 1 then7 foreach c ∈ Clusters [θ] do8 Subs ← split (c,Clusters [θ − τ/ǫ : θ − 1]);9 foreach cs ∈ Subs [θ] do

10 gs ← graph (cs);11 gsc ← greedyCycle (gs,k);12 if ∃gsc then13 foreach Si ∈ cs do14 hi

θ−τ/ǫ← gsc[Si];

15 else16 suppression (cs);17 foreach eθ ∈ epochs (D) do18 foreach Si ∈ D do19 h← filter (eθ,Si,h

iθ−τ/ǫ

, . . . ,hiθ);

20 D← replace (k-merge (h));

G. Practicalkτ,ǫ-anonymity algorithm

Capitalizing on all previous results, we designkte-hide,an algorithm that achieveskτ,ǫ-anonymity in datasets of spa-tiotemporal trajectories. Since even the optimal solutionto thesimpler k-anonymity problem is known to be NP-hard [14],we resort here to an heuristic solution.

The algorithm, in Alg. 2, proceeds on a per-epoch basis(line 1), finding, for each epochθ, a set ofχ users (withχ defined as in Sec. III-E) that hide each subscriber at lowmerging cost. An extensive search for the set ofχ users wouldhave an excessive costO(Uχ), whereU is the number ofusers in dataset, andχ ≥ 3. Thus, we adopt a computationallyefficient approach, by clustering user sub-trajectories based ontheir pairwise merging cost. Costs are computed viak-merge

(lines 2–4), and a standard spectral clustering algorithm groupssimilar trajectories into same clusters (line 5). This allowsoperating on each cluster independently in the following.

Starting from epochτ/ǫ+1 (line 6), the algorithm processeseach identified cluster at epochθ separately (line 7). Itsplits the current clusterc into subsets, which contain usertrajectories that share the same sequence of clusters duringthe lastτ/ǫ epochs (line 8).

Let cs be any of such subsets:cs is mapped to a directedgraph whose nodes are the users withincs, and there is anedge going from userj to useri if j can be in the hiding sethiθ−τ/ǫ of i without violating the reuse constraint (line 10). If

a k-anonymity level is required,k − 1 directional cycles arethen built within the graph, involving all nodes in the graph,in such a way that each node has a different parent in eachcycle (line 11). The hiding sethi

t−τ/ǫ is then obtained as theset of useri’s parents in thek − 1 cycles (lines 13–14).

Such a construction of hiding sets complies with thek-pickconstraint, since every useri is in the hiding set ofk−1 otherusers. It may however happen that no validk − 1 cycles can

be created withincs: this means that subscribers incs sharea sub-trajectory that is rare in the dataset, and their numberis insufficient to implementkτ,ǫ-anonymity. In this case, weapply suppression and remove all spatiotemporal samples ofsuch users’ sub-trajectories (line 16). Once all hiding sets aredetermined, the merging is performed, on each epoch and foreach user, usingk-merge (lines 17–20).

Overall, the heuristic algorithm above guarantees that over-lapping hiding sets that satisfy the reuse constraint (Sec.III-E)are selected for all users. It also ensures that such a choiceofhiding sets fulfils thek-pick requirement (Sec. III-F). Together,these conditions realizekτ,ǫ-anonymity of the trajectory data.

The complexity ofkte-hide is as follows. LetU bethe number of users,Θ be the number of epochs andN bethe average number of samples per user per epoch, so thatNtot = ΘUN is the total number of samples in the dataset.Then:(i) lines 2–4 performk-merge on two input trajectoriesΘU2 times, each of them with a complexityO(N ), for a totalcomplexity ofO(NtotU); (ii) spectral clustering (line 5) canbe implemented with complexityO(ΘU2) using KASP [21];(iii) the complexity of lines 17–20, performingk-mergeon χ input trajectoriesΘU times, is O(Ntotχ). All othersubroutines ofkte-hide have a much smaller complexity.

IV. PERFORMANCE EVALUATION

We evaluate our anonymization solutions with five real-world datasets of mobile subscriber trajectories, introducedin Sec. IV-A. A comparative evaluation ofk-merge isin Sec. IV-B, while the results ofkτ,ǫ-anonymization viakte-hide are presented in Sec. IV-C.

A. Reference datasets

Our datasets consist of user trajectories extracted from calldetail records (CDR) released by Orange within their D4DChallenges [22], and by the University of Minnesota [23].Three datasets, denoted asabi, dak and shn, describethe spatiotemporal trajectories of tens of thousands mobilesubscribers in urban regions, while the other two,civ andsen hereinafter, are nationwide. In all datasets, user positionsmap to the latitude and longitude of the current base station(BS) they are associated to. The main features of the datasetsare listed in Tab. I, revealing the heterogeneity of the scenarios.

In order to ensure that all datasets yield a minimum levelof detail in the trajectory of each tracked subscriber, we hadto preprocess theabi and civ datasets. Specifically, weonly retained those users whose trajectories have at least onespatiotemporal sample on every day in a specific two-weekperiod. No filtering was needed for thedak andsen datasets,which already contain users who are active for more than 75%of a 2-week timespan, andshn, whose users have even highersampling rates.

In all datasets, user positions map to the latitude and lon-gitude of the current base station (BS) they are associated to.We discretized the resulting positions on a 100-m regular grid,

TABLE IFEATURES OF REFERENCE MOBILE TRAFFIC DATASETS.

Dataset Surface BS BS/Km2 Users Density Samples Timespan[Km2] [user/Km2] [per user/h] [days]

abi 2,731 400 0.14 29,191 10.68 0.90 14

dak 1,024 457 0.44 71,146 69,47 0.74 14shn 3,329 2961 0.89 50,000 15.01 1.00 1

civ 322,463 1238 0.0038 82,728 0.26 0.75 14

sen 196,712 1666 0.0085 286,926 1.45 0.45 14

TABLE IICOMPARATIVE PERFORMANCE EVALUATION OFK-MERGE

Dataset k

k-merge Static generalization [success %] W4M GLOVETime Space 2h - 4Km 4h - 10Km 8h - 20Km Deleted Created Time Space Time Space[min] [Km] [%] [%] [min] [Km] [min] [Km]

abi

2 51 0.624 27.2 56.7 80.3 9.6 22.0 57 1.166 114 2.6265 228 3.423 0.7 11.0 40.5 31.9 31.2 185 3.809 292 3.7408 349 5.720 0.1 5.1 22.6 23.9 36.7 198 6.163 — —

dak

2 47 0.701 43.2 68.7 93.3 5.9 11.4 39 1.466 116 2.4985 220 5.286 2.2 14.0 67.0 20.3 21.2 172 5.807 294 3.1928 377 7.794 0.1 8.6 50.7 22.0 18.6 189 8.477 — —

which represents the finest spatial granularity we consider4.Samples are timestamped with an precision of one minute.

This is the granularity granted in theabi andciv datasets.The dak andsen datasets feature a temporal granularity of10 minutes: in order to have comparable datasets, we addeda random uniform noise over a ten-minute timespan to eachsample, so as to artificially refine the time granularity of thedata to one minute as well. In the case of theshn dataset, theprecision is one second, and we used a one-minute binning touniform the data to the standard format.

B. Comparative evaluation ofk-merge

Since no previous solution forkτ,ǫ-anonymity exists, weare forced to compare our algorithms to previous techniquesin terms of simplerk-anonymity. Interestingly, this allowsvalidating our proposed approach for merging spatiotemporaltrajectories via thek-merge algorithm.

We thus runk-merge on 100 randomk-tuples of mobileusers from the reference datasets, for different values ofk,and we record the spatiotemporal granularity retained bythe resulting generalized trajectories. We compare our resultsagainst those obtained by the only three approaches proposedin the literature for thek-anonymization of trajectories alongboth spatial and temporal dimensions.

The first is static generalization [8], [9], which consists in ahomogeneous reduction of data granularity, decided arbitrarilyand imposed on all user trajectories. Static generalization is atrial-and-error process, and it does not guaranteek-anonymityof all users. The second benchmark solution is Wait for Me(W4M) [36]. Intended for regularly sampled (e.g., GPS) trajec-tories, W4M performs the minimum spatiotemporal translationneeded to push all the trajectories within the same cylindricalvolume. It allows the creation of new synthetic samples,and it is thus not fully compliant with PPDP principles inSec. II-A. The latter operation is leveraged to improve thematching among trajectories in a cluster, and assumes thatmobile objects (i.e., subscribers in our case) effectuate linearconstant-speed movements between spatiotemporal samples.We use W4M with linear spatiotemporal distance (W4M-L),i.e., the version intended for large databases such as thosewe consider5, and configure it with the settings suggestedin [36]. The third approach is GLOVE [10], which relies on aheuristic measure of anonymizability to assess the similarity

4At 100-m spatial granularity, each grid cell contains at most one antennafrom the original dataset: the process does not cause any loss in data accuracy.

5Implementation athttp://kdd.isti.cnr.it/W4M/.

of spatiotemporal trajectories. This measure is fed to a greedyalgorithm to achievek-anonymity with limited loss of granu-larity and without introducing fictitious data. However, unlikek-merge, GLOVE does not provide an optimal solution, andis computationally expensive.

The results of our comparative evaluation are summarized inTab. II, for theabi anddak datasets, when varying numberkof trajectories merged together. Similar results were obtainedfor the other datasets, and are omitted due to space limitations.We immediately note how static aggregation is an ineffectiveapproach: the percentage of successfully mergedk-tuples iswell below 100%, even when dramatically reducing the datagranularity to 8 hours in time and 20 km in space. Instead,k-merge, W4M and GLOVE can merge all of thek-tuples,while retaining a good level of accuracy in the data. We candirectly compare the granularity in time (min) and space (km)retained byk-merge, W4M and GLOVE in merging groupsof k trajectories: the spatiotemporal accuracy is comparable inall cases. However, it is important to note that W4M attainsthis result by deleting and creating a significant amount ofsamples: in the end, only 40-70% of the original samplesare maintained in the generalized data. Conversely, all of thegeneralized samples created byk-merge reflect the actualreal-world data. Also,k-merge obtains a level of precisionthat is always higher than that of GLOVE, and scales better:indeed, the complexity of GLOVE did not allow computing asolution whenk = 8.

Overall, the results upholdk-merge as the current state-of-the-art solution to generalize sparse spatiotemporal trajectorieswhile obeying PPDP principles and minimizing accuracy loss.

C. Performance evaluation ofkte-hide

We runkte-hide on our reference datasets of mobile sub-scriber trajectories, so that they arekτ,ǫ-anonymized. As theanonymized data are robust to probabilistic attacks by design,we focus our evaluation on the cost of the anonymization,i.e., the loss of granularity. All results refer to the case of2τ,ǫ-anonymization, withǫ = τ .

1) Citywide datasets:Fig. 6 portrays the mean, medianand first/third quartiles of the sample granularity in thekτ,ǫ-anonymized citywide datasetsabi, dak andshn. The plotsshow how results vary when the adversary knowledgeτ ranges

http://kdd.isti.cnr.it/W4M/

10m

30m 1h 2h 4h

τ

0

1

2

3

4

5

Sp

ace

Gra

nu

lari

ty[K

m]

Mean

Median

25-75 %-ile

(a) abi10m

30m

1h 2h 4h

τ(b) dak

10m

30m

1h

τ(c) shn

10m

30m 1h 2h 4h

τ

0

30

60

90

Tim

eG

ranu

lari

ty[m

in]

Mean

Median

25-75 %-ile

(d) abi10m

30m

1h 2h 4h

τ(e) dak

10m

30m

1h

τ(f) shn

Fig. 6. Spatial (a,b,c) and temporal (d,e,f) granularity versus the adversary knowledgeτ in the citywide reference datasets.

10m

30m 1h 2h 4h

τ

0

10

20

30

40

50

Sp

ace

Gra

nu

lari

ty[K

m]

Mean

Median

25-75 %-ile

(a) civ10m

30m

1h 2h

τ(b) sen

10m

30m 1h 2h 4h

τ

0

30

60

90

120

Tim

eG

ranu

lari

ty[m

in]

Mean

Median

25-75 %-ile

(c) civ10m

30m

1h 2h

τ(d) sen

Fig. 7. Spatial (a,b) and temporal (c,d) granularity versusτ in the nationwide reference datasets.

10m30

m 1h 2h 4h

τ

0

2

4

6

8

10

Su

pp

resse

d[%

]

abi

dak

shn

civ

sen

Fig. 8. Suppressed samples versusτ .

from 10 minutes to 4 hours6. They refer to the anonymizeddata granularity in space7, in Fig.6a-c and time, in Fig.6d-f.

We remark how thekτ,ǫ-anonymized datasets retain signifi-cant levels of accuracy, with a median granularity in the orderof 1-3 km in space and below 45 minutes in time. These levelsof precision are largely sufficient for most analyses on mobilesubscriber activities, as discussed in, e.g., [24]. The temporalgranularity is negatively affected by an increasing adversaryknowledgeτ , which is expected. Interestingly, however, thespatial granularity is only marginally impacted byτ : protectingthe data from a more knowledgeable attacker does not have asignificant cost in terms of spatial accuracy.

2) Nationwide datasets:Fig. 7 shows equivalent resultsfor the nationwide datasetsciv and sen. The evolution oftemporal granularity versusτ , in Fig.7c-d is consistent withcitywide scenarios. Differences emerge in terms of spatialgranularity: in theciv case (Fig.7a) a reversed trend emerges,as accuracy grows along with the attacker knowledge. Thiscounterintuitive result is explained by the thin user presence intheciv dataset: as per Tab. I,civ has a density of subscribersper Km2 that is one or two orders of magnitude lower thanthose in our other reference datasets. Such a geographicalsparsity makes it difficult to find individuals with similarspatial trajectories: increasingτ has then the effect of enlargingthe set of candidate trajectories for merging at each epoch,witha positive influence on the accuracy in the generalized data.

These considerations are confirmed by the results with thesen dataset (Fig.7b). As per Tab. I, this dataset features asubscriber density that is about one order of magnitude higher

6The limited temporal span of theshn data prevents us from testing attackswith knowledgeτ higher than one hour. Indeed, aτ too close to the full datasetduration implies that the opponent has an a-priori knowledge of the victim’strajectory that is comparable to that contained in the data,making attemptsat countering a probabilistic attack futile.

7The spatial granularity in Fig. 6 is expressed as the sum of spans alongthe Cartesian axes. For instance, 1 km maps to, e.g., a squareof side 500 m.

than that ofciv, but around one order of magnitude lowerthan those of theabi, dak andshn. Coherently, the spatialgranularity trend falls in between those observed for suchdatasets, and it is not positively or negatively impacted bythe attacker knowledge.

More generally, the results in Fig. 7 demonstrate thatkte-hide can scale to large-scale real-world datasets. Theabsolute performance is good, as thekτ,ǫ-anonymized dataretains substantial precision: the median levels of granularityin space and time are comparable to those achieved in citywidedatasets. Finally, we remark that, in all cases, the amount ofsamples suppressed bykte-hide is in the 1%–7% range.

3) Sample suppression:The amount of samples suppressedby kte-hide in thekτ,ǫ-anonymization process is portrayedin Fig. 8. We note that resorting to suppression becomes morefrequent as the adversary knowledge increases. However, evenwhen the opponent is capable of tracking a user during fourcontinued hours, the percentage of suppressed samples remainslow, typically well below 10%. Moreover, the trend in thelong-timespan datasets is clearly sublinear, suggesting thatsuppression does not become prevalent with higherτ . Resultsare fairly consistent across citywide datasets8. Nationwidedatasets are also aligned, and yield even lower suppressionrates, at around 2%. This difference is explained by the factthat a larger number of users allows for a more efficientspectral clustering inkte-hide.

4) Disaggregation over time:As an intriguing concludingremark, Fig. 9 reveals a clear circadian rhythm in the granu-larity of kτ,ǫ-anonymized data, as well as in the percentage ofsuppressed samples. The plots refer to one sample week in theabi anddak datasets, whenτ = 30 min, but consistent resultswere observed in all of our reference datasets. Specifically,

8The spurious point atτ = 1 hour inshn is due to the fact that the timeinterval τ + ǫ is already very large, at around the same order of magnitudeof the full dataset duration.

Mon Tue Wed Thu Fri Sat Sun0

2

4

6

8

10

Me

an

Sp

ace

Gra

nu

lari

ty[K

m]

(a) abi, space


2

4

6

8

10

12

Me

an

Tim

eG

ranu

lari

ty[m

in]

(b) abi, time


4

8

12

16

Su

pp

resse

dsa

mp

les

[%]

(c) abi, suppression

Fig. 9. Time series of spatiotemporal accuracy (a,b) and suppression usage (c) for one sample week in theabi dataset.

the mean spatial granularity, in Fig. 9a, is much finer duringdaytime, when subscribers are more active and the volumeof trajectories is larger: here, it is easier to hide a userinto the crowd. Overnight displacements are instead harderto anonymize, since subscribers are limited in number andthey tend to have diverse patterns. This is also corroboratedby the significantly higher suppression of samples betweenmidnight and early morning, in Fig. 9c. Time granularity, inFig. 9b, is less subject to day-night oscillations: the slightlyhigher accuracy recorded at night is an artifact of the importantrelative suppression of samples at those times.

5) Summary: Overall, our results show thatkte-hideattainskτ,ǫ-anonymity of real-world datasets of mobile traffic,while maintaining a remarkable level of accuracy in the data.Interestingly, its performance is better when most needed,atdaytime, when the majority of human activities take place.

V. RELATED WORK

Protection of individual mobility data has attracted signif-icant attention in the past decade. However, attack modelsand privacy criteria are very specific to the different datacollection contexts. Hence, solutions developed for a specifictype of movement data are typically not reusable in otherenvironments.

For instance, a vast amount of works have targeted userprivacy in location-based services (LBS). There, the goal isensuring that single georeferenced queries are not uniquelyidentifiable [25]. This is equivalent to anonymizing eachspatiotemporal sample independently, and a whole other prob-lem from protecting full trajectories. Even when consideringsequences of queries, the LBS milieu allows pseudo-identifierreplacement, and most solutions rely on this approach, see,e.g., [26], [27]. If applied to spatiotemporal trajectories, thesetechniques would seriously and irreversibly break up trajecto-ries in time, disrupting data utility.

Another popular context is that of spatial trajectories that donot have a temporal dimension. The problem of anonymizingdatasets of spatial trajectories has been thoroughly exploredin data mining, and many practical solutions based on gen-eralization have been proposed, see, e.g., [28]–[31]. Suchsolutions are not compatible with or easily extended to themore complex spatiotemporal data we consider.

Some works explicitly target privacy preservation of spatio-temporal trajectories. However, the precise context they referto makes again all the difference. First, most such solutions

consider scenarios where user movements are sampled at regu-lar time intervals that are identical for all individuals [32], [33],or where the number of samples per device is very small [34].These assumptions hold, e.g., for GPS logs or RFID record,but not for trajectories recorded by mobile operators: the latterare irregularly sampled, temporally sparse, and cover longtime periods, which results in at least hundreds of samplesper user. Second, many of the approaches above disruptdata utility, by, e.g., trimming trajectories [35], or violatethe principles of PPDP, by, e.g., perturbating or permutatingthe trajectories [32], [33], or creating fictitious samples[36].Third, all previous studies aim at attainingk-anonymity ofspatiotemporal trajectories, i.e., they protect the data againstrecord linkage; this includes recent work specifically tailoredto mobile subscriber trajectory datasets [10]. As explained inSec. II,k-anonymity is only a partial countermeasure to attackson spatiotemporal trajectories.

Provable privacy guarantees are instead offered bydiffer-ential privacy, which commends that the presence of a user’sdata in the published dataset should not change substantiallythe output of the analysis, and thus formally bounds theprivacy risk of that user [37]. There have been attempts atusing differential privacy with mobility data. Specifically, ithas been successfully used the in the LBS context, whenpublishing aggregate information about the location of a largenumber of users, see, e.g., [38]. However, the requirementsof these solutions already become too strong in the case ofindividual LBS access data [39]. To address this problem, avariant of differential privacy, namedgeo-indistinguishabilityhas been introduced: it requires that any two locations becomemore indistinguishable as they are geographically closer [40].Practical mechanisms achieve geo-indistinguishability,see,e.g., [39], [40]. However, all refer to the anonymization ofsingle LBS queries: as of today, differential privacy and itsderived definitions still appear impractical in the contextofspatiotemporal trajectories.

VI. CONCLUSIONS

In this paper, we presented a first PPDP solution to prob-abilistic and record linkage attacks against mobile subscribertrajectory data. To that end, we introduced a novel privacymodel,kτ,ǫ-anonymity, which generalizes the popular criterionof k-anonymity. Our proposed algorithm,kte-hide, imple-mentskτ,ǫ-anonymity in real-world datasets, while retainingsubstantial spatiotemporal accuracy in the anoymized data.

REFERENCES

[1] K. Zheng, Z. Yang, K. Zhang, P. Chatzimisios, K. Yang, W. Xiang,“Big data-driven optimization for mobile networks toward 5G,” IEEENetwork, 30(1), 2016.

[2] M. Leconte, G. Paschos, L. Gkatzikis, M. Draief, S. Vassilaras, S. Chou-vardas,“Placing Dynamic Content in Caches with Small Population,”IEEE INFOCOM, 2016.

[3] Telefonica Smart Steps, http://dynamicinsights.telefonica.com/smart-steps/.[4] Orange Flux Vision, http://www.orange-business.com/fr/produits/flux-vision.[5] D. Naboulsi, M. Fiore, R. Stanica, S. Ribot,“Large-scale Mobile Traffic

Analysis: a Survey,” IEEE Communications Surveys and Tutorials,18(1), 2016.

[6] M. T. Asif, N. Mitrovic, J. Dauwels, P. Jaillet,“Matrix and Tensor BasedMethods for Missing Data Estimation in Large Traffic Networks,” IEEETransactions on ITS, 17(7), 2016.

[7] G. Czibula, A. M. Guran, I. G. Czibula, G. S. Cojocar,“IPA - Anintelligent personal assistant agent for task performancesupport,” IEEEICCP, 2009.

[8] H. Zang, J. Bolot,“Anonymization of location data does not work: Alarge-scale measurement study,”ACM MobiCom, 2011.

[9] Y. de Montjoye, C.A. Hidalgo, M. Verleysen, V. Blondel,“Unique inthe Crowd: The privacy bounds of human mobility,”Nature ScientificReports, 3(1376), 2013.

[10] M. Gramaglia, M. Fiore,“Hiding Mobile Traffic Fingerprints withGLOVE,” ACM CoNEXT, 2015.

[11] A. Cecaj, M. Mamei, N. Bicocchi,“Re-identification of AnonymizedCDR datasets Using Social Network Data,”IEEE PerCom Workshops,2014.

[12] C. Riederer, Y. Kim, A. Chaintreau, N. Korula, S. Lattanzi, “LinkingUsers Across Domains with Location Data: Theory and Validation,”ACM WWW, 2016.

[13] J. Mayer, P. Mutchler, J.C. Mitchell,“Evaluating the privacy propertiesof telephone metadata,”PNAS, 113(20), 2016.

[14] B.C.M. Fung, K. Wang, R. Chen, P.S. Yu,“Privacy-preserving datapublishing: A survey of recent developments,”ACM Computing Surveys,42(4), 2010.

[15] L. Sweeney,“k-anonymity: A model for protecting privacy,”Interna-tional Journal of Uncertainty, Fuzziness and Knowledge-Based Systems,10(5), 2002.

[16] A. Machanavajjhala, D. Kifer, J. Gehrke, M. Venkitasubramaniam,“l-diversity: Privacy beyond k-anonymity,”ACM Transactions on Knowl-edge Discovery from Data, 1(1):3, 2007.

[17] R. Shokri, G. Theodorakopoulos, J.-Y. Le Boudec, J.-P.Hubaux,“Quan-tifying Location Privacy,” IEEE SP, 2011.

[18] M. Srivatsa, M. Hicks,“Deanonymizing Mobility Traces: Using SocialNetworks as a Side-Channel,”AMC CCS, 2012.

[19] M. Terrovitis, N. Mamoulis, P. Kalnis,“Privacy-preserving Anonymiza-tion of Set-valued Data,”VLDB, 2008.

[20] P. Erdos, I. Kaplansky,“The asymptotic number of Latin rectangles,”Amer. J. Math., 68:230-236, 1946.

[21] D. Yan, L. Huang, M.I. Jordan,“Fast approximate spectral clustering,”ACM SIGKDD, 2009.

[22] Orange D4D Challenge. http://www.d4d.orange.com/en/.[23] D. Zhang, J. Huang, Y. Li, F. Zhang, C. Xu, T. He,“Exploring Hu-

man Mobility with Multi-Source Data at Extremely Large MetropolitanScales,”ACM MobiCom, 2014.

[24] M. Coscia, S. Rinzivillo, F. Giannotti, D. Pedreschi,“Optimal SpatialResolution for the Analysis of Human Mobility,”IEEE/ACM ASONAM,2012.

[25] M. Gruteser, D. Grunwald,“Anonymous Usage of Location-BasedServices Through Spatial and Temporal Cloaking,”ACM MobiSys,2003.

[26] J. Meyerowitz, R.R. Choudhury,“Hiding stars with fireworks: locationprivacy through camouflage,”ACM MobiCom, 2009.

[27] B. Hoh, M. Gruteser, H. Xiong, A. Alrabady,Preserving privacy in GPStraces via uncertainty-aware path cloaking. ACM CSS, 2007.

[28] A. Monreale, G. Andrienko, N. Andrienko, F. Giannotti,D. Pedreschi,S. Rinzivillo, S. Wrobel“Movement Data Anonymity through General-ization,” Transactions on Data Privacy 3(2), 2010.

[29] M.E. Nergiz, M. Atzori, Y. Saygin, B. Guc“Towards TrajectoryAnonymization: a Generalization-Based Approach,”Transactions onData Privacy 2(1), 2009.

[30] R. Chen, B.C.M. Fung, B.C. Desai, N.M. Sossou,“Differentially privatetransit data publication: a case study on the Montreal transportationsystem,”ACM KDD, 2012.

[31] G. Poulis, S. Skiadopoulos, G. Loukides, A. Gkoulalas-Divanis,“Apriori-based algorithms for km-anonymizing trajectory data,”Trans-actions on Data Privacy 7(2), 2014.

[32] J. Domingo-Ferrer, R. Trujillo-Rasua,“Microaggregation- andpermutation-based anonymization of movement data,”InformationScience, 208, 2012.

[33] O. Abul, F. Bonchi, M. Nanni,“Never walk alone: Uncertainty foranonymity in moving objects databases,”IEEE ICDE, 2008.

[34] B.C.M. Fung, M. Cao, B.C. Desai, H. Xu,“Privacy protection for RFIDdata,” ACM SAC, 2009.

[35] Y. Song, D. Dahlmeier, S. Bressan,“Not So Unique in the Crowd: aSimple and Effective Algorithm for Anonymizing Location Data,” PIR,2014.

[36] O. Abul, F. Bonchi, M. Nanni,“Anonymization of moving objectsdatabases by clustering and perturbation,”Information Systems, 35(8),2010.

[37] C. Dwork “Differential privacy,” ICALP, 2006.[38] R. Chen, G. Acs, C. Castelluccia“Differentially private sequential data

publication via variable-length n-grams,”ACM CCS, 2012.[39] K. Chatzikokolakis, C. Palamidessi, M. Stronati,“A Predictive

Differentially-Private Mechanism for Mobility Traces,”PETS, 2014.[40] M.E. Andres, N.E. Bordenabe, K. Chatzikokolakis, C. Palamidessi,

“Geo-indistinguishability: differential privacy for location-based sys-tems,” ACM CCS, 2013.

http://dynamicinsights.telefonica.com/smart-steps/

http://www.orange-business.com/fr/produits/flux-vision

http://www.d4d.orange.com/en/

τ,ǫ-anonymity: towards privacy-preserving publishing of ... · fig. 1. illustrative example of...

Documents