Table Of Content

kτ,ǫ-anonymity: Towards Privacy-Preserving Publishing of Spatiotemporal Trajectory Data Marco Gramaglia∗, Marco Fiore†, Alberto Tarable†, Albert Banchs∗ ∗ IMDEA Networks Institute & Universidad Carlos III de Madrid † CNR-IEIIT Avda. del Mar Mediterraneo, 22 Corso Duca degli Abruzzi, 24 28918 Leganes (Madrid), Spain 10129 Torino, Italy Email: [email protected] Email: [email protected] 7 1 0 Abstract—Mobilenetworkoperatorscantracksubscribersvia andapplications.Amajorbarrierinthissenseareprivacycon- 2 passive or active monitoring of device locations. The recorded cerns: data circulation exposes it to re-identification attacks, n trajectories offer an unprecedented outlook on the activities of and cognition of the movement patterns of de-anonymized a largeuserpopulations,whichenablesdevelopingnewnetworking individuals may reveal sensitive information about them. J solutions and services, and scaling up studies across research 9 disciplines. Yet, the disclosure of individual trajectories raises Thiscallsforanonymizationtechniques.Thecommonprac- significantprivacy concerns:thus,thesedataareoften protected tice operators adhere to is replacing personal identifiers (e.g., ] by restrictive non-disclosure agreements that limit their avail- name, phone number,IMSI) with pseudo-identifiers(i.e., ran- Y abilityand impedepotentialusages. In thispaper, wecontribute domornon-reversiblehashvalues).Whetherthisisasufficient C to the development of technical solutions to the problem of measure is often called into question, especially in relation to privacy-preserving publishing of spatiotemporal trajectories of s. mobile subscribers. We propose an algorithm that generalizes thepossibilityoftrackingusermovements.Whatissureisthat c the data so that they satisfy kτ,ǫ-anonymity, an original privacy pseudo-identifiers have been repeatedly proven not to protect [ criterion that thwarts attacks on trajectories. Evaluations with against user trajectory uniqueness, i.e., the fact that mobile real-world datasets demonstrate that our algorithm attains its 1 subscribers have distinctive travel patterns that make them objective while retaining a substantial level of accuracy in the v univocally recognizable even in very large populations [8]– data.Ourworkisastepforwardinthedirectionofopen,privacy- 3 preserving datasets of spatiotemporal trajectories. [10]. Uniqueness is not a privacy threat per-se, but it is a 4 vulnerability that can lead to re-identification. Examples are 2 I. INTRODUCTION brought forth by recent attempts at cross-correlating mobile 2 0 Subscriber trajectory datasets collected by network opera- operator-collectedtrajectorieswithgeoreferencedcheck-insof . tors are logs of timestamped, georeferencedevents associated Flickr and Twitter users [11], with credit card records [12] or 1 0 to the communication activities of individuals. The analysis with Yelp, Google Places and Facebook metadata [13]. 7 of these datasets allows inferring fine-grained information Moredependableanonymizationsolutionsareneeded.How- 1 about the movements, habits and undertakings of vast user ever, the strategies devised to date for relational databases, : v populations.Thishasmanydifferentapplications,encompass- location-based services, or regularly sampled (e.g., GPS) mo- i ing both business and research. For instance, trajectory data bility do not suit the irregular sampling, time sparsity, and X can be used to devise novel data-driven network optimization long duration of trajectories collected by mobile operators. r a techniques [1] or support content delivery operations at the Moreover,currentprivacycriteria, includingk-anonymityand network edge [2]. They can also be monetized via added- differentialprivacy,do notprovidesufficientprotectionor are value services such as transport analytics [3] or location- impracticalinthiscontext.SeeSec.Vforadetaileddiscussion. based marketing [4]. Additionally, the relevance of massive In this paper, we put forward several contributions towards movementdata from mobile subscribersis critical in research privacy-preserving data publishing (PPDP) of mobile sub- disciplines such as physics, sociology or epidemiology [5]. scriber trajectories. Our contributions are as follows: (i) we The importance of trajectory data has also been recognized outline attacks that are especially relevant to datasets of in the design offuture5G networks,with a thrusttowardsthe spatiotemporal trajectories; (ii) we introduce kτ,ǫ-anonymity, introduction of data interfaces among network operators and a novel privacy criterion that effectively copes with the most over-the-top(OTT)providerstogivethemonlineaccesstothis threateningattacksabove;(iii)wedevelopk-merge,analgo- (and other) data. OTTs can leverage such interfaces to auto- rithmthatsolvesa fundamentalproblemintheanonymization matically retrieve the data and process them on the fly, thus of spatiotemporal trajectories, i.e., effective generalization; enablingnewapplicationssuchasintelligenttransportation[6] (iv) we implement kte-hide, a practical solution based or assisted-life services [7]. on k-merge that attains kτ,ǫ-anonymity in spatiotemporal All these use cases stem from the disclosure of trajectory trajectory data; (v) we evaluate our approach on real-world datasets to third parties. However, the open release of such datasets,showingthatitachievesitsobjectiveswhileretaining data is still largely withhold, which hinders potential usages a substantial level of accuracy in the anonymized data. II. REQUIREMENTS AND MODELS • Record linkage attacks. These attacks aim at univocally distinguishinganindividualinthedatabase.Asuccessful WefirstpresenttherequirementsofPPDP,inSec.II-A,and record linkage enables cross-database correlation, which formalizethespecificattackermodelweconsider,inSec.II-B. may ultimately unveil the identity of the user. Record We then propose a consistent privacy model, in Sec.II-C. linkage attacks on mobile traffic data have been repeatedly and successfully demonstrated [8]–[10]. As men- A. PPDP requirements tioned in Sec.I, they have also been used for subsequent PPDP is defined as the development of methods for the cross-database correlations [11]–[13]. publication of information that allows meaningful knowledge • Probabilistic attacks. These attacks let an adversary discovery, and yet preserves the privacy of monitored sub- with partial information about an individual enlarge his jects [14]. The requisites of PPDP are similar for all types knowledge on that individual by accessing the database. of databases, including our specific case, i.e., datasets of They are especially relevant to spatiotemporal trajecto- spatiotemporal trajectories. They are as follows. ries, as shown by seminal works that first unveiled the 1. The non-expert data publisher. Mining of the data is anonymization issues of mobile traffic datasets [8], [9]. performed by the data recipient, and not by the data Let us imagine a scenario where an adversary knows a publisher. The only task of the data publisher is to small set of spatiotemporal points in the trajectory of anonymize the data for publication. a subscriber (because, e.g., he met the target individual 2. Publication of data, and not of data mining results. The there). A successful probabilistic attack would revealthe aim of PPDP is producing privacy-preserving datasets, complete movements of the subscriber to the attacker, and not anonymized datasets of classifiers, association who could then use them to infer sensitive information rules, or aggregate statistics. This sets PPDP apart from about the victim, such as home/work locations, daily privacy-preservingdata mining (PPDM), where the final routines, or visits to healthcare structures. usage of the data is known at dataset compilation time. Our privacy modelwill addressboth classes of attacks above, 3. Truthfulness at the record level. Each record of the pub- led by an adversary with knowledge described in Sec.II-B1. lished database must correspond to a real-world subject. Moreover, all information on a subject must map to C. Privacy model actual activities or features of the subject. This avoids that fictitious data introduces unpredictable biases in the Our privacy model is designed following the PPDP re- anonymized datasets. quirements and attacker model presented before. We start by consideringsuitableprivacycriteriaagainstrecordlinkageand Ourprivacymodelwillobeytheprinciplesabove.Westress probabilisticattacks,inSec.II-C1andSec.II-C2,respectively. that they impose that the privacy model must be agnostic We thenshowhowthe firstcriterionis infacta specialization of data usage (points 1 and 2), and that it cannot rely on of the second, in Sec.II-C3, which allows us to focus on randomized,perturbed,permutedandsynthetic data (point3). a single unifying privacy model. Finally, we present the elementarytechniquesthatweemploytoimplementthetarget B. Attacker model privacy criterion, in Sec.II-C4. UnlikePPDPrequirements,theattackermodelisnecessarily 1) k-anonymity: The k-anonymity criterion realizes the specifictothetypeofdataweconsider,anditischaracterized indistinguishabilityprinciple,bycommendingthateachrecord by the knowledge and goal of the adversary. The former inadatabasemustbeindistinguishablefromatleastk−1other describes the information the opponent possesses, while the records in the same database [15]. In our case, this maps to latter represents his privacy-threateningobjective. ensuring that each subscriber is hidden in a crowd of k users 1) Attacker knowledge: In trajectory datasets, each data whose trajectories cannot be told apart. The popularity of k- recordisasequenceofspatiotemporalsamples.Weassumean anonymityforPPDPhasledtoindiscriminatedusebeyondits attackerwhocantracka targetsubscribercontinuouslyduring scope, and subsequent controversy on the privacy guarantees anyamountoftimeτ. Theadversaryknowledgeconsiststhen it can provide. E.g., k-anonymity has been proven ineffective in all spatiotemporalsamples in the victim’s trajectory over a againt attacks aiming at attribute linkage (including exploits continuous1 time interval of duration τ. of insufficient side-information diversity), at localizing users, 2) Attackergoal: Attacksagainstuserprivacyinpublished or at disclosing their presence and meetings [16]–[18]. data can have different objectives, and a comprehensive clas- However,k-anonymityremainsalegitimatecriterionagainst sification is providedin [14]. Two classes of attacks are espe- recordlinkageattacksonanykindofdatabase[14].Therefore, cially relevant in the context of mobile subscriber trajectory this privacy model protects trajectory data from the first type data. Both exploit the uniqueness of movement patterns that, of attack in Sec.II-B, including its variations in [8]–[13]. as mentioned in Sec.I, characterizes trajectory data. 2) kτ,ǫ-anonymity: No privacy criterion proposed to date can safeguard spatiotemporal trajectory data from the second 1Non-continuous tracking in the attacker model is an interesting but very type of attacks in Sec.II-B, i.e., probabilistic attacks. This challengingopenproblem.Amitigativesolutionrealisablewithourmodelis considering aτ thatcovers alldisjoint tracking intervals. forces us to define an original criterion, as follows. 3) Relationship between the privacy criteria: It is easy to seethatk-anonymityisaspecialcaseofkτ,ǫ-anonymity.Asa matter of fact, the latter criterion reduces to the former when τ + ǫ covers the whole temporal duration of the trajectory dataset. Then, kτ,ǫ-anonymity commends that each complete trajectory is indistinguishable from k −1 other trajectories, which is the definition of k-anonymity.Our point here is that Fig.1. Illustrative exampleofkτ,ǫ-anonymityofuseri,withk=2. an anonymization solution that implements kτ,ǫ-anonymity can be straightforwardly employed to attain k-anonymity as The pertinent principle here is the so-called uninformative well, by properly adjusting the τ and ǫ parameters. principle,i.e.,ensuringthatthedifferencebetweentheknowl- Inthelightoftheseconsiderations,weaddresstheproblem edge of the adversary before and after accessing a database of achieving kτ,ǫ-anonymity in datasets of spatiotemporal is small [16]. In our context, this principle warrants that an trajectories of mobile subscribers. By doing so, we develop attacker who knowssome subset of a subscriber’smovements a complete anonymization solution that is effective against cannot extract from the dataset a substantially longer portion probabilistic attacks, but can also be specialized to guarantee of that user’s trajectory. k-anonymity and counter record linkage attacks. To attain the uninformative principle, we introduce the 4) Generalization and suppression: In order to enforce kτ,ǫ-anonymity privacy criterion. kτ,ǫ-anonymity can be seen kτ,ǫ-anonymity for all users in the dataset, we need to tweak as a variation of km-anonymity, which establishes that each the spatiotemporal samples in the trajectories of individuals, individualin a dataset must be indistinguishablefrom at least so that the criterion in Sec.II-C2 is respected for all of them. k−1otherusersinthesamedataset,whenlimitingtheattacker To that end, we rely on two elementary techniques, i.e., knowledge to any set of m attributes [19]. kτ,ǫ-anonymity spatiotemporal generalization and suppression of samples. tailors km-anonymity to our scenario, as follows. Spatiotemporal generalization reduces the precision of tra- • As per Sec.II-B, the attacker knowledge can be any jectory samples in space and time, so as to make the sam- continuedsequenceofspatiotemporalsamplescoveringa ples of two or more users indistinguishable. Suppression timeintervaloflengthatmostτ:thus,themparameterof removes from the trajectories those samples that are too hard km-anonymitymapstothe(variable)setofsamplescon- to anonymize. Both techniques are lossy, i.e., imply some tainedinanytimeperiodτ.Duringanysuchtimeperiod, reductionofprecisioninthedata.Yet,unlikeotherapproaches, every trajectory in the dataset must be indistinguishable these techniques conform to the PPDP requirement of truth- from at least other k−1 trajectories. fulness at the record level, see Sec.II-A. • The maximum additional knowledge that the attacker is allowed to learn is called leakage; it consists of the III. ACHIEVINGkτ,ǫ-ANONYMITY spatiotemporal samples of the target user’s trajectory Our goal is ensuring that an anonymized dataset of mobile containedinatimeintervalofdurationatmostǫ,disjoint subscriber trajectories respects the uninformative principle, from the original τ. In order to fulfill the uninformative by implementing, throughgeneralization and suppression, the principle, the leakage ǫ must be small. kτ,ǫ-anonymity of all subscriber trajectories in the dataset. Thetworequirementsaboveimplyalternatingintimethek−1 Clearly, we aim at doing so while minimizing the loss of trajectories that provide anonymization. An intuitive example spatiotemporal granularity in the data. is provided in Fig.1. There, the trajectory of a target user We start by defining the basic operation of generalizing i is 2τ,ǫ-anonymized using those of five other subscribers. a set of spatiotemporal samples, and the associated cost in The overlapping between the trajectories of a, b, c, d, e and termsoflossofgranularity,inSec.III-A.Wethenextendboth that of i is partial and varied. An adversary knowing a sub- notions to (sub-)trajectories, in Sec.III-B. Building on these trajectory of i during any time interval of duration τ always definitions,wediscussinSec.III-Ctheoptimalspatiotemporal finds at least one other user with a movement pattern that generalizationofk (sub-)trajectories.Weimplementtheresult is identical to that of i during that interval, but different into k-merge, an optimal low-complexity algorithm that elsewhere. With this knowledge, the adversary cannot tell generalizes(sub-)trajectorieswithminimallossofdatagranu- apart i from the other subscriber, and thus cannot attribute larity, in Sec.III-D. Once able to merge (sub-)trajectories op- full trajectories to one user or the other. As this holds no timally, we propose an approach to guarantee kτ,ǫ-anonymity matter wherethe knowledgeintervalis shifted to, the attacker of the trajectory of a single user, in Sec.III-E, and we then can never retrieve the complete movement patterns of i: this scale the solution to multiple users in Sec.III-F. Finally, achieves the uninformative principle. Still, the adversary can we introduce kte-hide, an algorithm that ensures kτ,ǫ- increase its knowledge in some cases. Let us consider the anonymity in spatiotemporal trajectory datasets, in Sec.III-G. intervalτ indicated in the figure:the trajectoriesof i, d and e A. Generalization of samples areidenticalforsometimeafterτ,whichallowsassociatingto i the movementsduring ǫ: the opponentlearns one additional A(raw)sampleofaspatiotemporaltrajectoryrepresentsthe spatiotemporal sample of i. positionof a subscriberat a giventime, and we modelit with ∆t(G ) in time and ∆x(G ) in space (portrayedas unidimen- 1 1 sional in the figure, for the sake of readability). Remark 1: The rationale for our choice of costs is com- putational efficiency. Also, summing the two space spans before multiplication allows balancing the time and space contributions. Finally, note that with the definition in (5), the space mergingcost assumptionin (3) is trivially true.Instead, the definition in (4) lets the time merging cost assumption in (2) hold only if the time intervals spanned by G and G are 1 2 non-overlapping. The time coherence property that we will Fig.2. Exampleofmergingoftrajectories Si={si,j}andSi′ ={si′,j} introduce in Sec.III-B ensures that this is the always case. intoageneralized trajectory G={G}.Forclarity, spaceisunidimensional. B. Generalization of trajectories a length-3 real vector s = (t(s),x(s),y(s)). Since a dataset is characterized by a finite granularity in time and space, a A spatiotemporal (sub-)trajectory describes the movements sample is in fact a slot spanning some minimum temporal ofasinglesubscriberduringthedatasettimespan.Formally,a andspatialintervals.Thevectorentriesabovecanberegarded trajectory is an ordered vector of samples S = (s1,...,sN), as the origins of a normalized length-1 time interval and a where the ordering is induced by the time coordinate, i.e., normalized 1×1 two-dimensional area2. t(si)<t(si′) if and only if i<i′. Spatiotemporalgeneralizationmergestogethertwo or more A generalized trajectory, obtained by merging different raw samples into a generalized sample, i.e., a slot with a trajectories, is defined as an ordered vector of generalized larger span. Mathematically, a generalized sample G can be samples G=(G1,...,GZ). Here the orderingis more subtle, represented as the set of the merged samples. There is a cost and based on the fact that the time intervals spanned by the associated with mergingsamples, which is related to the span generalized samples are non-overlapping,a property that will of the corresponding generalized sample, i.e., to the loss of becalledtimecoherence.Moreprecisely,ifGi andGi′,i<i′, granularity induced by the generalization. The cost of the are two generalized samples of G, then operation of merging a set of samples into the generalized maxt(s)< min t(s). sample G is defined as s∈Gi s∈Gi′ c(G)=c (G)c (G), (1) An example of a generalized trajectory G merging two t s trajectories Si and Si′ is provided in Fig.2. G fulfils time where c (G) represents the cost in the time dimension, while t coherence, as its generalized samples are temporally disjoint. c (G) is the cost in the space dimensions. s Remark 2: Time coherence is a defining property of gen- Let G and G be two disjoint generalized samples (i.e., 1 2 eralized trajectories in PPDP. As a matter of fact, publishing G ∩G =∅). Then, we make the following two assumptions 1 2 trajectory data with time-overlappingsamples would generate on the time and space merging costs: semantic ambiguity and make analyses cumbersome. c (G ∪G )≥c (G )+c (G ) (2) Analogouslytothecostofmergingsamples,wecandefinea t 1 2 t 1 t 2 costof mergingmultiple trajectoriesinto a generalizedtrajec- c (G ∪G )≥max{c (G ),c (G )}. (3) s 1 2 s 1 s 2 tory.Wedefinesuchcostasthesumofcostsofallgeneralized samplesbelongingtoit.Moreprecisely,ifG=(G ,...,G ), Hereafter,weusethefollowingdefinitionstoimplementthe 1 Z and c(·) is defined as in (1), then the cost of G is given by: generic costs c (G) and c (G): t s Z c (G)=∆t(G) (4) t C(G)= c(G ). (7) i cs(G)=∆x(G)+∆y(G), (5) Xi=1 Remark 3: Thecostin(7)istheoverallsurfacecoveredby where ∆⋆(G)=max ⋆(s)−min ⋆(s)+1, (6) samples of the generalized trajectory over the spatiotemporal s∈G s∈G plane. E.g., in Fig.2, the cost of G is the sum of the three with ⋆∈{t,x,y}, is the span in each dimension. areas,i.e., c(G1)+c(G2)+c(G3). Itis thusproportionaltothe Therefore, in our implementation, c(G) is the area of a total loss of granularity induced by the generalization. rectanglewithsides∆t(G)and∆x(G)+∆y(G).Agraphical example is provided in Fig.2, where two raw samples s C. Optimal generalization of trajectories i,1 and si′,1 are merged into a generalized sample G1, spanning We nowformalizetheproblemofoptimalgeneralizationof spatiotemporal(sub-)trajectories.Suppose that we have k tra- 2Forinstance,inourreferencedatasets,thesamplegranularityis1minute jectoriesS ,...,S , withS =(s ,...,s ), i=1,...,k. imnintiumtee)ainndti1m0e0amndeteornseinslostpa(ci.ee..,Aar1a0w0×sa1m00plem2spaarnesa)thiennsopnaeces.loHto(wi.eev.,er1, The goal is1a genekralized traijectoriy,1G∗ =i(,GN1∗i,...,GZ∗) from ourdiscussionisgeneral, andholdsforanyprecision inthedata. S1,...,Sk, which satisfies the following conditions. i)TheunionofallgeneralizedsamplesofG∗ mustcoincide with the union of all samples of S ,...,S , i.e., 1 k G∗∪···∪G∗ =S ∪···∪S ,S, 1 Z 1 k where S = Ni {s }. Thus, G∗ is a partition of the set S i j=1 i,j of all samples in the input trajectories: it does not add any S alien sample or discard any input sample. ii) Each generalized sample contains at least one sample from each of the k input trajectories S ,...,S , i.e., 1 k Gi∗∩Si′ 6=∅, i=1,...,Z, i′ =1,...,k. Fig.3. PartitiontreeforthetwotrajectoriesSi={si,j}andSi′ ={si′,j} in Fig.2. Nodes in the complete tree represent the set K of valid partitions This imposes that each input trajectory contributes to each of the set of raw samples S. Elementary partitions are the tree leaves and generalized sample of G∗. Otherwise, the merging could constitute K∗.Thepartition inFig.2istheleftmostleafinthetree. associate generalized samples to users that never visited the generalized location at the generalized time, violating point 3 Algorithm 1: k-merge algorithm pseudocode. ofitihi)eTPhPeDcPosrteqoufirtheme emnetsrgiinngSeisc.mIIi-nAim. ized, i.e., oinuptpuutt::GTreanjeecratolirzieeds Ssa1m,.p.le.,sSetkG,w∗h,eCreosStiC=(G(s∗i,)1,...,si,Ni) 1 foreach i∈[1,k]do G∗ =arg minC(G), (8) 2 Si=SNj=i1{si,j}; G∈K 3 S ←timesort(S1∪···∪Sk); 4 Cost←(0,∞,...,∞); where K is the set of all partitions of S satisfying time 5 Partition←(NULL, ...,NULL); coherence as well as condition ii) above, and C(G) is in (7). 6 foreach sθ ∈S do InFig.2, the generalizedtrajectoryGfulfilsallthese require- 7 θ′=θ−1; ments, and is thus the optimal merge G∗ of Si and Si′. 98 whileθ′i=ncθo′m−pl1e;te(sθ′,...,sθ)do Solving the problem above with a brute-force search is 10 whileelementary(sθ′,...,sθ)do computationally prohibitive, since K has a size that grows 11 G ←generalize(sθ′,...,sθ); 12 ifCost[θ]>c(G)+Cost[θ′−1]then exponentiallywith |S|/k, where|·| denotescardinality.How- 13 Cost[θ]←c(G)+Cost[θ′−1]; ever,we can characterizeG∗ so that it is possible to compute 14 Partition←(θ′−1,G); it with low complexity. To that end, we name elementary a 15 θ′=θ′−1; partition G ∈ K that cannot be refined to another partition 1167 GC∗(G←∗)v←isiCtos(tP[a|Srt|it]io;n); within K. In other words, none of the generalized samples of an elementary partition can be split into two generalized samples without violating conditions i) and ii) above, or time Comparing (11) with (10), we get that C(G) ≥ C(G). coherence. Then, we have the following proposition. Proposition 1: Given the input trajectories S ,...,S , the Thus, to search for the optimal G∗, we can drop G and keep optimal G∗ defined in (8) is an elementary part1ition. k only G. If G is not elementary, then we can find one ofeits Proof: Suppose G ∈ K is not elementary, so that it can refinements, and repeat the above steps to drop also G. This be refined to another partition G ∈ K. In particular, with- way,wee canedropall partitionsthatare notelementaryand be out loss of generality, suppose that G = (G ,...,G ) and left only with elementary partitions as G∗ candidates.e 1 Z IfwebuildatreeofpartitionsbelongingtoK,suchthatthe G= G ,...,G , where e 1 Z+1 S is the root and each node is a partition whose children are (cid:16) (cid:17) itsrefinements,theleavesaretheelementarypartitions,which e e e G , i<Z G = i (9) form a subset K∗. The above proposition states that we can i ( GZ ∪GZ+1, i=Z. limit the search of G∗ to K∗, drastically reducing the search e From(7)and(9),thedifferencebetweenthecostsofGand space of G∗ to the set K∗ ⊂K of elementarypartitionsof S. e e G is given by Anexampleis providedin Fig.3, forthe trajectoriesinFig.2. C(G)−C(G)=c(G )−c(G )−c(G ). (10) D. Optimal merging algorithm e Z Z Z+1 Since G contains the union of raw samples in G and We propose k-merge, an algorithm to efficiently search G , weZcan appely properties (2) eand (3)e(where (2)Zholds the set of raw samples S, extract the subset of elementary Z+1 because of time coherence) and obtain: e partitions, K∗, and identify the optimal partition G∗. The algorithm, detailed in Alg.1, starts by populating a set e c(GZ) = ct(GZ)cs(GZ) of raw samples S, whose items s are ordered according i,j ≥ ct(GZ)+ct(GZ+1) cs(GZ) to their time value t(si,j) (lines 1–3). Then, it processes all samples according to their temporal ordering (line 6). (cid:16) (cid:17) ≥ ct(GZe)cs(GZ)+e ct(GZ+1)cs(GZ+1) Specifically,thealgorithmtests,foreachsamplesθ inposition = c(GZ)+c(GZ+1). (11) θ, all sets {sθ′,...,sθ}, with θ′ <θ, as follows. e e e e e e The first loop skips incomplete sets that do not contain at least one sample from each input trajectory (line 8). The second loop runs until the first non-elementary set is encountered (line 10). Therein, the algorithm generalizes the current(complete and elementary) set {sθ′,...,sθ} to G, and checks if G reduces the total merging cost up to s . If so, θ the cost is updated by summing c(G) to the accumulatedcost up to sθ′−1, and the resulting (partial) partition of S that includes G is stored (lines 11–14). Once out of the loops, the cost associated to the last sample is the optimal cost, and it is sufficient to backward navigate the partition structure to retrieve the associated G∗ (lines 16–17). Fig.4. Overlappinghidingsetstructurerealizingkτ,ǫ-anonymityforuseri. Note that, in order to update the cost of including the currentsamples (line13),thealgorithmonlychecksprevious E. Single user kτ,ǫ-anonymity θ samples in time. It thus needs that the optimal decision up to We implementkτ,ǫ-anonymityfor a generic subscriber i as s does not depend on any of the samples in the original shown in Fig.4. We discretize time into intervals of length ǫ, θ trajectories that come later in time than s . The following named epochs.At the beginningof the m-th epoch, we select θ proposition guarantees that this is the case. asetofk−1usersdifferentfromi,namedahidingsetofiand Proposition 2: Let G∗ = (G1∗,...,GZ∗) be the optimal denoted as him. The hiding set him provides k-anonymity to generalized trajectory and let us make the hypothesis that s subscriberiforasubsequenttimewindowτ+ǫ.Byrepeating θ and s do not belong to the same generalized sample of the hiding set selection for all epochs, τ/ǫ + 1 subsequent θ+1 G∗. Let G∗ = G∗,...,G∗ and G∗ = G∗ ,...,G∗ , hiding sets of user i overlap at any point in time. Such a so thats ∈pG∗ an1ds ∈Z1G∗ . Thfen,G∗Zc1a+n1bederivZed structure of overlapping hiding sets assures the following. independθentlyZo1f(cid:0)G∗. θ+1 (cid:1)Z1+1 (cid:0)p (cid:1) First, subscriber i is k-anonymizedfor any possible knowl- f Proof: Let G, G and G be any generalized se- edgeoftheattacker.Nomatterwhereatimeintervaloflength p f quencescontainingrawsamples(s ,...,s ),(s ,...,s )and τ is shifted to along the time dimension, it will be always 1 N 1 θ (s ,...,s ), respectively.Accordingto the cost definition, completely covered by the time window of one hiding set, θ+1 N we generally have i.e., a period during which i’s trajectory is indistinguishable from those of k − 1 other users. As an example, in Fig.4, minC(G) ≤ min C((G ,G )) the attacker knowledge τ (bottom-right of the plot) is fully p f G Gp,Gf enclosed in the time window of hi, and his sub-trajectory is = minC(Gp)+minC(Gf), indistinguishable from those of use6rs in hi. G G 6 p f Second, the additional knowledge leaked to the attacker is where(G ,G )istheconcatenationofG andG .However, exactly ǫ. From the first point above, the adversary cannot p f p f by virtue of the hypothesis and by construction, tell apart i from the users in the hiding set hi whose time m window covers his knowledge τ. However, the adversary can minC(G) = C(G∗) follow the (generalized) trajectories of i and users in hi for G m = C(G∗)+C(G∗) the fulltime windowτ+ǫ. Therefore,the adversarycan infer p f new informationaboutthe (generalized)trajectoryof i during = minC(G )+minC(G ) G p G f the time window period that exceeds his original knowledge p f τ, i.e., ǫ. E.g., in Fig.4, the time window of hi spans before 6 so that, to minimize C(G) we only need to minimize C(Gp) and after the attacker knowledge τ, for a total of ǫ. and C(Gf) independently. The two guarantees above let kτ,ǫ-anonymity, as defined The above proposition guarantees that the algorithm is in Sec.II-C2, be fulfilled for the generic user i. The epoch exploring all possibilities, and as a result, the cost C(G∗) duration ǫ maps to the knowledge leakage. The following returned by k-merge is optimal, i.e., it is the minimum loss important remarks are in order. of granularity necessary to merge the original trajectories. 1. Hiding set selection. The structure of overlappinghiding Note thatk-merge has a verylow complexityin practical sets is to be implemented so that the loss of accuracy in the cases.Letl(θ)bethenumberofsets{sθ′,...,sθ}thatareboth kτ,ǫ-anonymized trajectory is minimized. Thus, the users in complete and elementary for a given θ. Then, the number of the generic hidingset hi shall be those who, duringthe time m computationsand comparisonsof sample generalization costs windowτ+ǫstartingatthe m-thepoch,havesub-trajectories that are performedin k-merge is l(θ)=|S|l, where l is with minimum k-merge cost with respect to i’s. θ theaveragevalueofl(θ). Ifl=O(1), whichhappensin most 2. Reuse constraint. The uninformative principle requires P trajectory data where the samples of the input trajectories are alternating the k−1 trajectories used in differenthiding sets, intercalated in the time axis, then k-merge runs in a time as per Sec.II-C2. A simple way to enforcethis is limiting the O(|S|), i.e., linear in the number of samples. inclusion of any subscriber in at most one hiding set of i. ofthem-thepoch,forsubscriberi(resp.,i′ andi′′),oneneeds toselectk−1=2otherusersthatconstitutethehidingsethi m (resp., hi′ and hi′′). Let us consider hi ={a,b}, hi′={i,c}, m m m m hi′′={i,d}, which results in the generalized sub-trajectories m Gi, Gi′, Gi′′ in Fig.5. The configuration satisfies the k-pick constraint for subscriber i, who is picked in k−1=2 hiding sets, i.e., hi′ and hi′′. Suppose now that the attacker knows m m the spatiotemporal samples of i’s trajectory during any time interval τ within the m-th and (m + 1)-th epoch: as these samples are within Gi, Gi′ and Gi′′, then i is 3-anonymized. Fig.5. Exampleofk-pickconstraint, withk=3,foruseriduringthem-th Thekeyconsiderationisthatiisk-anonymizedatepochm hidingsetselection.Hereǫ=τ,hencethetimewindowsofhidingsetsspan by i′ and i′′, yet it does not contribute to the anonymization twoepochs.Forclarity,spaceisunidimensional.Figurebestviewedincolors. of neither i′ nor i′′, as i′,i′′ ∈/ hi . Thus, it is possible to m decouplethe choice of hiding sets across subscribers, without 3.Generalizationset.AsevidencedbytheexampleinFig.4, jeopardizing the privacy guarantees granted by k-anonymity. the configuration of hiding sets changes at every epoch, and Such a decoupling entails a dramatic increase of flexibility in τ/ǫ+1 hiding sets overlap during each epoch. This means the choice of hiding sets, as per the following proposition. that a spatiotemporal generalization must be used to merge a Proposition 3: GivenadatasetofU trajectoriesandafixed set of χ=1+(τ/ǫ+1)(k−1) trajectories at each epoch. value of k, the number of hiding set configurations allowed 4. Epoch duration tradeoff. The epoch duration ǫ is a byfullconsistencyis a fractionofthatallowedby k-pickthat configurable system parameter, whose setting gives rise to vanishes more than exponentially for U →∞. a tradeoff between knowledge leakage and accuracy of the Proof:LetusconsiderasetofU users,whereU isamultiple anonymized data. A lower ǫ reduces knowledge leakage. ofk,sinceotherwisefullconsistencycannotevenbeenforced. However, it also increases χ, which typically entails a more Letusbuildak×U matrix,inwhichthei-thcolumncontains marked generalization and a higher loss of data granularity. (i,hi ),wherehi isthehidingsetforuseriatagivenepoch m m m. (For simplicity, in this proof, we do not take into account F. Multiple user kτ,ǫ-anonymity the reuse constraints.) Scalingkτ,ǫ-anonymityfromasingleusertoallsubscribers The solution set under the k-pick constraint coincides with in a dataset implies that the choice of hiding sets cannot be the set of normalized Latin rectangles3 of size k × U. Let made independentlyfor every user. Therefore, trajectory sim- K be the number of k ×U normalized Latin rectangles, k,U ilarityandreuseconstraintfulfillmentarenotsufficientnorms whichequalsthenumberofpossiblesolutionsforourproblem anymore.Inadditionto the above,the selection of hidingsets withthek-pickconstraint.AnoldresultbyErdo˝sandKaplan- needs to be concerted among all users so as to ensure that sky [20] states that, for U →∞ and k =O (logU)3/2−ǫ) , the generalized trajectories are correctly intertwined and all subscribersarek-anonymizedduringeachtimewindowτ+ǫ. Kk,U ∼(U!)k−1exp(−k(k−1(cid:0))/2) (1(cid:1)2) Anintuitivesolutionisenforcingfullconsistency:including asubscriberiintothehidingsetofuseri′atepochmmakesi′ If, instead, we enforce full consistency, then the number of solutionsequalsthe numberof differentpartitionsof a size-U automaticallybecomepartofi’shidingsetatthesameepoch. Formally, i∈hi′ ⇒i′ ∈hi , ∀i6=i′,∀m. set into U/k subsets, all with size k. Denoting by Ck,U this m m number, we can compute it as In fact, full consistency is an unnecessarily restrictive condition. It is sufficient that hiding set concertation satisfies a U U−k ····· k U! k-pick constraint: during the m-th epoch, each user i in the Ck,U = k k(U/k)! k = (k!)U/k(U/k)! (13) datasethastobepickedinthehidingsetsofatleastotherk−1 (cid:0) (cid:1)(cid:0) (cid:1) (cid:0) (cid:1) subscribers. Formally, |{i′, i ∈ hi′}| ≥ k−1, ∀i,∀m. This Thus, for fixed k and U →∞ m provides an increased flexibility over all existing approaches C exp(k(k−1)/2) k,U which rely on fully consistent generalization strategies. ∼ K (U!)k−2(k!)U/k(U/k)! Therationalebehindthe k-pickconstraintisbestillustrated k,U by means of a toy example, in Fig.5. The figure portrays the which tends to zero more than exponentially for U →∞. spatiotemporal samples of users i, i′ and i′′ during epochs For large datasets of hundreds of thousands trajectories, k- m and m+1. The sub-trajectory of subscriber i in this time pick enables a much richer choice of merging configurations. interval is S =(s ,s ,s ), represented as black squares; This reasonably unbinds better combinations of the original i i,1 i,2 i,3 equivalently for i′ (orange triangles) and i′′ (red circles). trajectories, and results in more accurate anonymized data. Samples denoted by letters belong to other users a, b, c and d, and they are instrumental to our example. 3Ak×nLatinrectangle,k≤n,isamatrixinwhichallentriesaretaken from the set {1,...,n}, in such a way that each row and column contains Let us assume that ǫ=τ (i.e., hiding sets span an interval eachvalue atmostonce. TheLatinrectangle issaidtobenormalized ifthe 2τ =2ǫ,orepochsmandm+1),andk =3.Atthebeginning firstrowistheorderedset(1,...,n). Algorithm 2: kte-hide algorithm pseudocode. be created within c : this means that subscribers in c share s s input :Anonymization level k,attacker knowledge τ,leakage ǫ a sub-trajectory that is rare in the dataset, and their number input :Trajectory datasetD is insufficient to implement kτ,ǫ-anonymity. In this case, we output:Anonymizedtrajectory dataset D apply suppression and remove all spatiotemporal samples of 1 foreach eθ ∈epochs(D)do 2 Df ←filter(eθ,D); such users’ sub-trajectories(line 16). Once all hiding sets are 3 foreach Si,Si′ ∈Df,Si6=Si′ do determined, the merging is performed,on each epoch and for 4 Costs[Si,Si′]←k-merge(Si,Si′); each user, using k-merge (lines 17–20). 5 Clusters[θ]←spectralClustering(Costs); 6 ifθ≥τ/ǫ+1then Overall, the heuristic algorithmaboveguaranteesthat over- 7 foreach c∈Clusters[θ]do lappinghidingsetsthatsatisfythereuseconstraint(Sec.III-E) 8 Subs←split(c,Clusters[θ−τ/ǫ:θ−1]); are selected for all users. It also ensures that such a choice of 9 foreach cs∈Subs[θ]do 10 gs←graph(cs); hidingsetsfulfilsthek-pickrequirement(Sec.III-F).Together, 11 gsc←greedyCycle(gs,k); these conditions realize kτ,ǫ-anonymityof the trajectory data. 1123 if∃gfsocretahcehnSi∈cs do The complexity of kte-hide is as follows. Let U be 14 hiθ−τ/ǫ←gsc[Si]; the number of users, Θ be the number of epochs and N be 15 else the average number of samples per user per epoch, so that 16 suppression(cs); N = ΘUN is the total number of samples in the dataset. 1178 foreafcohreeaθch∈Seip∈ocDhdso(D)do Thteotn:(i)lines2–4performk-mergeontwoinputtrajectories 19 h←filter(eθ,Si,hiθ−τ/ǫ,...,hiθ); ΘU2 times,eachofthemwithacomplexityO(N), foratotal 20 D←replace(k-merge(h)); complexity of O(NtotU); (ii) spectral clustering (line 5) can be implemented with complexity O(ΘU2) using KASP [21]; (iii) the complexity of lines 17–20, performing k-merge G. Practical kτ,ǫ-anonymity algorithm on χ input trajectories ΘU times, is O(Ntotχ). All other subroutines of kte-hide have a much smaller complexity. Capitalizing on all previousresults, we design kte-hide, an algorithm that achieves kτ,ǫ-anonymity in datasets of spa- IV. PERFORMANCE EVALUATION tiotemporaltrajectories.Sinceeventheoptimalsolutiontothe simpler k-anonymity problem is known to be NP-hard [14], We evaluate our anonymization solutions with five real- we resort here to an heuristic solution. world datasets of mobile subscriber trajectories, introduced The algorithm, in Alg.2, proceeds on a per-epoch basis in Sec.IV-A. A comparative evaluation of k-merge is (line 1), finding, for each epoch θ, a set of χ users (with in Sec.IV-B, while the results of kτ,ǫ-anonymization via χ defined as in Sec.III-E) that hide each subscriber at low kte-hide are presented in Sec.IV-C. mergingcost.Anextensivesearchforthesetofχuserswould have an excessive cost O(Uχ), where U is the number of A. Reference datasets usersindataset,andχ≥3.Thus,weadoptacomputationally efficientapproach,byclusteringusersub-trajectoriesbasedon Our datasets consist of user trajectories extracted from call theirpairwisemergingcost.Costsarecomputedviak-merge detail records (CDR) released by Orange within their D4D (lines2–4),andastandardspectralclusteringalgorithmgroups Challenges [22], and by the University of Minnesota [23]. similar trajectories into same clusters (line 5). This allows Three datasets, denoted as abi, dak and shn, describe operating on each cluster independently in the following. the spatiotemporal trajectories of tens of thousands mobile Startingfromepochτ/ǫ+1(line6),thealgorithmprocesses subscribers in urban regions, while the other two, civ and each identified cluster at epoch θ separately (line 7). It senhereinafter,are nationwide.Inalldatasets, userpositions splits the current cluster c into subsets, which contain user map to the latitude and longitude of the current base station trajectories that share the same sequence of clusters during (BS) they are associated to. The main features of the datasets the last τ/ǫ epochs (line 8). arelistedinTab.I,revealingtheheterogeneityofthescenarios. Let c be any of such subsets: c is mapped to a directed In order to ensure that all datasets yield a minimum level s s graph whose nodes are the users within c , and there is an of detail in the trajectory of each tracked subscriber, we had s edge going from user j to user i if j can be in the hiding set to preprocess the abi and civ datasets. Specifically, we hi of i withoutviolating the reuse constraint(line 10). If only retained those users whose trajectories have at least one θ−τ/ǫ a k-anonymity level is required, k−1 directional cycles are spatiotemporal sample on every day in a specific two-week then built within the graph, involving all nodes in the graph, period.Nofilteringwasneededforthedakandsendatasets, in such a way that each node has a different parent in each whichalreadycontainuserswhoareactiveformorethan75% cycle (line 11). The hiding set hi is then obtained as the ofa2-weektimespan,andshn,whoseusershaveevenhigher t−τ/ǫ set of user i’s parents in the k−1 cycles (lines 13–14). sampling rates. Sucha constructionofhidingsetscomplieswiththe k-pick In all datasets, user positions map to the latitude and lon- constraint,sinceeveryuseriisinthehidingsetofk−1other gitude of the current base station (BS) they are associated to. users. It may however happen that no valid k−1 cycles can Wediscretizedtheresultingpositionsona100-mregulargrid, TABLEI TABLEII FEATURESOFREFERENCEMOBILETRAFFICDATASETS. COMPARATIVEPERFORMANCEEVALUATIONOFK-MERGE Dataset Surface BS BS/Km2 Users Density Samples Timespan k-merge Staticgeneralization[success%] W4M GLOVE [Km2] [user/Km2] [peruser/h] [days] Dataset k Time Space 2h-4Km 4h-10Km 8h-20Km Deleted Created Time Space Time Space abi 2,731 400 0.14 29,191 10.68 0.90 14 [min] [Km] [%] [%] [min] [Km] [min] [Km] dak 1,024 457 0.44 71,146 69,47 0.74 14 2 51 0.624 27.2 56.7 80.3 9.6 22.0 57 1.166 114 2.626 shn 3,329 2961 0.89 50,000 15.01 1.00 1 abi 5 228 3.423 0.7 11.0 40.5 31.9 31.2 185 3.809 292 3.740 civ 322,463 1238 0.0038 82,728 0.26 0.75 14 8 349 5.720 0.1 5.1 22.6 23.9 36.7 198 6.163 — — sen 196,712 1666 0.0085 286,926 1.45 0.45 14 2 47 0.701 43.2 68.7 93.3 5.9 11.4 39 1.466 116 2.498 dak 5 220 5.286 2.2 14.0 67.0 20.3 21.2 172 5.807 294 3.192 8 377 7.794 0.1 8.6 50.7 22.0 18.6 189 8.477 — — which represents the finest spatial granularity we consider4. ofspatiotemporaltrajectories.Thismeasureis fedto a greedy Samples are timestamped with an precision of one minute. algorithm to achieve k-anonymity with limited loss of granu- This is the granularity granted in the abi and civ datasets. larity and withoutintroducingfictitious data. However,unlike The dak and sen datasets feature a temporal granularity of k-merge,GLOVEdoesnotprovideanoptimalsolution,and 10 minutes: in order to have comparable datasets, we added is computationally expensive. a random uniform noise over a ten-minute timespan to each Theresultsofourcomparativeevaluationaresummarizedin sample, so as to artificially refine the time granularity of the Tab.II,fortheabianddakdatasets,whenvaryingnumberk datatooneminuteaswell. Inthecaseoftheshndataset,the of trajectories merged together. Similar results were obtained precisionis onesecond, andwe used a one-minutebinningto fortheotherdatasets,andareomittedduetospacelimitations. uniform the data to the standard format. We immediately note how static aggregation is an ineffective approach: the percentage of successfully merged k-tuples is B. Comparative evaluation of k-merge well below 100%, even when dramatically reducing the data Since no previous solution for kτ,ǫ-anonymity exists, we granularity to 8 hours in time and 20 km in space. Instead, are forced to compare our algorithms to previous techniques k-merge, W4M and GLOVE can merge all of the k-tuples, in terms of simpler k-anonymity. Interestingly, this allows while retaining a good level of accuracy in the data. We can validating our proposed approach for merging spatiotemporal directlycomparethe granularityintime (min)andspace (km) trajectories via the k-merge algorithm. retained by k-merge, W4M and GLOVE in merginggroups We thus run k-merge on 100 random k-tuples of mobile ofk trajectories:thespatiotemporalaccuracyiscomparablein users from the reference datasets, for different values of k, all cases. However, it is important to note that W4M attains and we record the spatiotemporal granularity retained by this result by deleting and creating a significant amount of the resulting generalized trajectories. We compare our results samples: in the end, only 40-70% of the original samples against those obtained by the only three approachesproposed are maintained in the generalized data. Conversely, all of the in the literature for the k-anonymization of trajectories along generalized samples created by k-merge reflect the actual both spatial and temporal dimensions. real-world data. Also, k-merge obtains a level of precision Thefirstisstatic generalization[8],[9], whichconsistsina that is always higher than that of GLOVE, and scales better: homogeneousreductionofdatagranularity,decidedarbitrarily indeed, the complexity of GLOVE did not allow computinga and imposed on all user trajectories. Static generalization is a solution when k =8. trial-and-errorprocess,andit doesnotguaranteek-anonymity Overall,theresultsupholdk-mergeasthecurrentstate-of- of all users. The second benchmark solution is Wait for Me the-artsolutiontogeneralizesparsespatiotemporaltrajectories (W4M)[36].Intendedforregularlysampled(e.g.,GPS)trajec- whileobeyingPPDPprinciplesandminimizingaccuracyloss. tories,W4Mperformstheminimumspatiotemporaltranslation needed to push all the trajectories within the same cylindrical volume. It allows the creation of new synthetic samples, and it is thus not fully compliant with PPDP principles in C. Performance evaluation of kte-hide Sec.II-A. The latter operation is leveraged to improve the matching among trajectories in a cluster, and assumes that mobile objects (i.e., subscribers in our case) effectuate linear Werunkte-hideonourreferencedatasetsofmobilesub- constant-speed movements between spatiotemporal samples. scriber trajectories, so that they are kτ,ǫ-anonymized. As the We use W4M with linear spatiotemporal distance (W4M-L), anonymizeddata are robust to probabilisticattacks by design, i.e., the version intended for large databases such as those we focus our evaluation on the cost of the anonymization, we consider 5, and configure it with the settings suggested i.e., the loss of granularity. All results refer to the case of in [36]. The third approachis GLOVE [10], which relieson a 2τ,ǫ-anonymization,with ǫ=τ. heuristic measure of anonymizability to assess the similarity 1) Citywide datasets: Fig.6 portrays the mean, median and first/third quartiles of the sample granularity in the kτ,ǫ- 4At100-mspatialgranularity, eachgridcellcontains atmostoneantenna anonymized citywide datasets abi, dak and shn. The plots fromtheoriginaldataset:theprocessdoesnotcauseanylossindataaccuracy. 5Implementation athttp://kdd.isti.cnr.it/W4M/. showhowresultsvarywhentheadversaryknowledgeτ ranges Mean 90 Mean SpaceGranularity[Km]12345 M25e-d7i5an%-ile TimeGranularity[min]3600 M25e-d7i5an%-ile 010m30m 1h 2hτ 4h 10m30m 1h 2hτ 4h 10m30τm 1h 010m30m 1h 2hτ 4h 10m30m 1h 2hτ 4h 10m30τm 1h (a)abi (b)dak (c)shn (d)abi (e)dak (f)shn Fig.6. Spatial (a,b,c)andtemporal(d,e,f)granularity versustheadversaryknowledge τ inthecitywide reference datasets. 50 120 10 Mean Mean abi SpaceGranularity[Km]12340000 M25e-d7i5an%-ile TimeGranularity[min] 369000 M25e-d7i5an%-ile Suppressed[%] 2468 dscshieavnnk 0 010m30m 1h 2hτ 4h 10m30m 1hτ 2h 010m30m 1h 2hτ 4h 10m30m 1hτ 2h 10m30m 1h 2h τ 4h (a)civ (b)sen (c)civ (d)sen Fig.7. Spatial(a,b)andtemporal (c,d)granularity versusτ inthenationwide reference datasets. Fig.8. Suppressedsamplesversusτ. from 10 minutes to 4 hours6. They refer to the anonymized than that of civ, but around one order of magnitude lower data granularity in space7, in Fig.6a-c and time, in Fig.6d-f. than those of the abi, dak and shn. Coherently, the spatial We remarkhowthekτ,ǫ-anonymizeddatasetsretainsignifi- granularity trend falls in between those observed for such cantlevelsofaccuracy,with a mediangranularityinthe order datasets, and it is not positively or negatively impacted by of1-3kminspaceandbelow45minutesintime.Theselevels the attacker knowledge. ofprecisionare largelysufficientformostanalyseson mobile More generally, the results in Fig.7 demonstrate that subscriber activities, as discussed in, e.g., [24]. The temporal kte-hide can scale to large-scale real-world datasets. The granularity is negatively affected by an increasing adversary absolute performance is good, as the kτ,ǫ-anonymized data knowledge τ, which is expected. Interestingly, however, the retains substantial precision: the median levels of granularity spatialgranularityisonlymarginallyimpactedbyτ:protecting inspaceandtimearecomparabletothoseachievedincitywide the data from a more knowledgeableattacker does not have a datasets. Finally, we remark that, in all cases, the amount of significant cost in terms of spatial accuracy. samples suppressed by kte-hide is in the 1%–7% range. 2) Nationwide datasets: Fig.7 shows equivalent results 3) Samplesuppression: Theamountofsamplessuppressed for the nationwide datasets civ and sen. The evolution of bykte-hideinthekτ,ǫ-anonymizationprocessisportrayed temporal granularity versus τ, in Fig.7c-d is consistent with in Fig.8. We notethatresortingto suppressionbecomesmore citywide scenarios. Differences emerge in terms of spatial frequentastheadversaryknowledgeincreases.However,even granularity:inthecivcase(Fig.7a)areversedtrendemerges, when the opponent is capable of tracking a user during four as accuracy grows along with the attacker knowledge. This continuedhours,thepercentageofsuppressedsamplesremains counterintuitiveresultisexplainedbythethinuserpresencein low, typically well below 10%. Moreover, the trend in the thecivdataset:asperTab.I,civhasadensityofsubscribers long-timespan datasets is clearly sublinear, suggesting that per Km2 that is one or two orders of magnitude lower than suppressiondoesnotbecomeprevalentwith higherτ. Results those in our other reference datasets. Such a geographical are fairly consistent across citywide datasets8. Nationwide sparsity makes it difficult to find individuals with similar datasets are also aligned, and yield even lower suppression spatialtrajectories:increasingτ hasthentheeffectofenlarging rates, at around 2%. This difference is explained by the fact thesetofcandidatetrajectoriesformergingateachepoch,with that a larger number of users allows for a more efficient a positive influence on the accuracy in the generalized data. spectral clustering in kte-hide. These considerations are confirmed by the results with the 4) Disaggregation over time: As an intriguing concluding sen dataset (Fig.7b). As per Tab.I, this dataset features a remark, Fig.9 reveals a clear circadian rhythm in the granu- subscriberdensitythatisaboutoneorderofmagnitudehigher larityofkτ,ǫ-anonymizeddata,aswellasin thepercentageof suppressedsamples.Theplotsrefertoonesampleweekinthe 6Thelimitedtemporalspanoftheshndatapreventsusfromtestingattacks abianddakdatasets,whenτ =30min,butconsistentresults withknowledgeτhigherthanonehour.Indeed,aτtooclosetothefulldataset were observed in all of our reference datasets. Specifically, duration implies thattheopponent hasana-priori knowledge ofthevictim’s trajectory that is comparable to that contained in the data, making attempts atcountering aprobabilistic attackfutile. 8Thespurious point atτ =1hourinshnis duetothe fact that thetime 7The spatial granularity in Fig.6 is expressed as the sum of spans along interval τ +ǫis already very large, at around the same order of magnitude theCartesianaxes.Forinstance,1kmmapsto,e.g.,asquareofside500m. ofthefulldataset duration.

$k^{\tau,\epsilon}$-anonymity: Towards Privacy-Preserving Publishing of Spatiotemporal Trajectory Data PDF

1 MB·

by Marco Gramaglia

#journals #arxiv

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview $k^{\tau,\epsilon}$-anonymity: Towards Privacy-Preserving Publishing of Spatiotemporal Trajectory Data

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.