ebook img

K-Anonymization as Spatial Indexing: Toward Scalable and Incremental Anonymization PDF

12 Pages·2007·1.59 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview K-Anonymization as Spatial Indexing: Toward Scalable and Incremental Anonymization

K-Anonymization as Spatial Indexing: Toward Scalable and Incremental Anonymization Tochukwu Iwuchukwu Jeffrey F. Naughton UniversityofWisconsin UniversityofWisconsin 1210WestDaytonStreet 1210WestDaytonStreet Madison,WI53706 Madison,WI53706 [email protected] [email protected] ABSTRACT great deal of research along two orthogonal lines: first, how to refine the definition of k-anonymity to provide different guaran- Inthispaperweobservethatk-anonymizingadatasetisstrikingly teesaboutthedata(forexample,augmentingthedefinitionwithl- similartobuildingaspatialindexoverthedataset,sosimilarinfact diversity[21]);second,howtoefficientlygenerateanonymizations thatclassicalspatialindexingtechniquescanbeusedtoanonymize ofdatasetsthatareaspreciseaspossiblewhilestillrespectingthe datasets.Weusethisobservationtoleverageover20yearsofwork definition of anonymity (for example, using heuristic algorithms ondatabaseindexingtoprovideefficientanddynamicanonymiza- withmultidimensionalpartitioning[19].) Ourworkisanexample tiontechniques. Experimentswithourimplementationshowthat ofthesecondclassofresearch. theR-treeindex-basedapproachyieldsabatchanonymizational- We present an anonymization algorithm that substantially im- gorithmthatisordersofmagnitudemoreefficientthanpreviously provesuponpreviouslypresentedalgorithmsbothwithrespectto proposedalgorithmsandhastheadvantageofsupportingincremen- efficiencyandwithrespecttothequalityoftheanonymizationpro- talupdates.Finally,weshowthattheanonymizationsgeneratedby duced. Specifically,thealgorithmwepresentinthispaperallows theR-treeapproachdonotsacrificequalityintheirsearchforeffi- ustoanonymizedatasetscontainingatleast100millionrecords; ciency;infact,byseveralpreviouslyproposedqualitymetrics,the also,byarecentlypresentedmetricforthequalityofanonymiza- compactpartitioningpropertiesofR-treesgenerateanonymizations tion(“certainty”[33]),theanonymizationsproducedbyouralgo- superiortothosegeneratedbypreviouslyproposedanonymization rithmareapproximatelyafactoroftwobetterthanpreviousalgo- algorithms. rithms. Thekeytoouralgorithmistoexploitastrikingparallelbetween 1. INTRODUCTION the“classical”areaofdatabaseindexingandtherelativelynewdata The problem of anonymity in published data has been widely privacy research domain, k-anonymity. Let T be a table with n studiedinrecentyears. Organizationsmayreleaseprivatedatafor quasi-identifierattributesA1,A2,...,An.First,weobservethatthe thepurposesoffacilitatingusefuldataanalysisandresearch,forex- eventualgoalofallk-anonymizationalgorithmsistotransformT ample,patients’medicalrecordsmaybereleasedbyaclinictoaid bypartitioningT intogroupsofrecordssothateachgroupcontains amedicalstudy.Whilesuchdatasharinghasitsbenefits,wemust, aminimumofkrecords. Toillustratetheconnectionbetweenin- however,contendwiththeissueofprivacyforthoseindividualsto dexingandanonymization,assumethatBT isaB+-treeindexon whom information in the shared data pertain. K-anonymity [24, thequasi-identifierattributeA1.Notethateverypathfromtheroot 25, 26, 29, 30] has been proposed as a means to preserving pri- nodeinBT toaleafnodeLproducesasetofrecordsinT whoseA1 vacyindatareleases. Putsimply, theprivatedatasetismodified valuessatisfytheconstraintimposedbythepathfollowedtoreach so that each record is indistinguishable from at least k−1 other L.NotethenthattheA1valuesfortherecordscontainedinLfallin records. Indistinguishability is defined in terms of any set of at- therange[a,b]whereaandbaretheleftandrightseparatorvalues tributesthatcanbeusedtouniquelyidentifyanindividual.Thisset inL’sparentnodethatborderthe“pointer”toL. Whenweapply ofattributeshasbeencalledaquasi-identifier[7]intheliterature. thisconcepttoeveryleafnodeinBT, wecantransformT intoa Anexampleofaquasi-identifieristhesetofattributescomprising newtableT1byreplacingeveryrecord’svalueinA1bytheappro- Age,SexandZipcode[28].Figure1illustrateshowprivatedatacan priaterangeofvalues—recordsinthesameleafnodewillhave betransformedtopreserveanonymity. The2-anonymoustablein the same new A1 value. Going one step further, a B+-tree index Figure1(b)hasthreequasi-identifierattributes: Age,SexandZip- placesanoccupancyconstraintonallnodesinthetree,assuchev- codeandonesensitiveattribute:Ailment.Eachrecordinthistable eryleafnodeinBT mustcontainbetweenNmin andNmaxrecords. hasthesamequasi-identifiervaluesasatleastoneotherrecord. WiththesepropertiesofBT,namely,animplicitpartitioningofthe Since the original definition of k-anonymity, there has been a underlyingtableandaboundedoccupancyconstraintonallparti- tions,wegeta“k-anonymous”tablewherek=N . Figure1(c) min showsanexampleofaB+-treeindexontheAgeattributeofthe Permissiontocopywithoutfeeallorpartofthismaterialisgrantedprovided originaltableinFigure1(a)andcorrespondingtothe2-anonymous thatthecopiesarenotmadeordistributedfordirectcommercialadvantage, tableinFigure1(b). theVLDBcopyrightnoticeandthetitleofthepublicationanditsdateappear, Ingeneral,however,thetabletobepublishedmaycontainmore andnoticeisgiventhatcopyingisbypermissionoftheVeryLargeData thanonequasi-identifierattribute,soratherthanuseB+-trees,we BaseEndowment. Tocopyotherwise,ortorepublish,topostonservers suggest multidimensional spatial indexing, and the R-tree in par- ortoredistributetolists,requiresafeeand/orspecialpermissionfromthe ticular,asthebasisforanonymization. InSection2,wediscussin publisher,ACM. VLDB‘07,September23-28,2007,Vienna,Austria. greaterdetailhowavariantofR-treeindexesprovidealgorithmsfor Copyright2007VLDBEndowment,ACM978-1-59593-649-3/07/09. 746 RID Age Sex Zipcode Ailment Age Sex Zipcode Ailment R 21 M 53706 anemia [20–30] M 53706 anemia 1 R 26 M 53706 flu [20–30] M 53706 flu 2 R3 32 F 53710 cancer [30–40] F [53710–53715] cancer R4 36 F 53715 tornacl [30–40] F [53710–53715] tornacl R5 48 M 52108 flu [45–60] * [52100–52108] flu R6 56 F 52100 whiplash [45–60] * [52100–52108] whiplash (a) PrivatePatientTable (b) A2-anonymousPatientTable (c) B+-treeindexontheAgeattribute Figure1:A2-anonymousrepresentationanddatabaseindexingofapatienttable. k-anonymizingamulti-attributetable. WediscussR-treestylein- thedangerofprivacyviolationinthepresenceofcollusion, dexes[5,10,27]andhowtheseindexesareusefulfork-anonymizing weexploitthetreestructureofaspatialindexforautomatic adataset,andpropertiesthatgivethemadvantagesoverothertypes generationofmulti-granularanonymizeddatasetsthatpre- of indexing mechanisms. In particular, we focus on the variants servesk-anonymity. oftheR-treethatdonotoverlappartitions, forexample, theR+- tree[27].Thisisduetotheuniversallyadoptedpracticeinexisting Since database indexes are specifically designed for record in- k-anonymizationalgorithmsofgeneratingonlynon-overlappingpar- sertions,deletionsandupdates,byusingthemforanonymization, titionsintheanonymizeddata. weautomaticallygetamechanismforincrementalanonymization. Theconnectionbetweenanonymizationandspatialindexingis However,incrementalanonymizationraisesissueswithrespectto perhapsnotentirelysurprising,as[22]usedanewspecial-purpose thepreservationofprivacy.Ifanattackerhasexternalknowledgeof spatial index structure (the “pyramid tree”) to anonymize objects whichindividual’srecordsarebeinginserted,deletedorupdatedin movinginthespatialdomain. However,tothebestofourknowl- adataset,thentheattackermaybeabletoissueaseriesofqueries edge,oursisthefirstworktoproposeanonymizationofnon-spatial overtimeanddeducesensitiveinformation.Whileprovidinganin- data by the use of a classical spatial index that is already imple- crementallyupdatableanonymizationtechniquedoesnotsolvethe mented and distributed in commercial and open-source RDBMS inferenceproblem,itisamuchbetterplatformforupdatesthancur- products. renttechniques,whichcouldpotentiallyrequirere-anonymization Theindexing-anonymizingconnectiongivesusadifferentper- oftheentiredatasetaftereachupdate. spective in the k-anonymization domain, has several advantages Finally,theindex-basedapproachtoanonymizationcanexploit overpreviouslyproposedk-anonymizationalgorithms,andunifies theefficiencyinherentinindexupdateandbulk-loadingalgorithms. severaldesiredgoalsforanonymizationintoasingleapproach: Previousresearchink-anonymizingalgorithmshasfocusedalmost exclusively on the quality of the resulting anonymization, rather • AnR-treeindex-basedapproachtok-anonymizationfurnishes thanonthespeedwithwhichthatanonymizationisachieved. An us with efficient index-construction algorithms that enable exceptionistheMondrianalgorithmfrom[19],wheretheauthors faster bulk anonymization times than previous techniques, present a polynomial time algorithm, thus making it practical to evenformemory-residentdatasets. consideranonymizinglargedatasets.Whileabsoluteperformance • WefurthershowthatapplyingR-treebulk-loadingalgorithms wasnotthegoalofthatpaper,itisinterestingtonotethattheap- to anonymizing yields anonymization algorithms that per- proachsuggestedinthatpaperconstitutesatop-downmultidimen- form well even on data sets much larger than main mem- sionalspatialpartitioningalgorithm,whereasspatialindexbuilding ory. Thisenablesustoanonymizeadatasetof100,000,000 algorithmsrepresentabottom-upspatialpartitioningapproach. records. Toinvestigatethequalityandefficiencyofbothapproaches,we reimplementedtheMondrianalgorithmdescribedin[19],andcom- • Weobservethatminimalboundingboxesfromtheindexing pared ittobottom-upindex-based algorithms. We found thatthe domain[5,10,27]suggestanonymizationsthatleavegapsin bottom-up approach gave better quality as measured by the dis- thedomain. Thiscanyieldfarmorepreciseanonymizations cernibilitypenalty[4],KLdivergence[15]andthe“certaintymet- than previously proposed anonymization techniques, none ric”[33]. Furthermore,experimentswithourimplementationalso ofwhichconsiderleavinggaps. Thisopensupaninterest- showedthatthebottom-upapproachadoptedbyindexbulk-loading ingandnovelaspectofthealways-presenttensionbetween algorithmsisanorderofmagnitudefasterthanthetop-downMon- anonymizationandprecisionthathasnotbeenpreviouslyex- drianapproach. Itisaninterestingareaforfutureresearchtode- ploredinthek-anonymizationliterature. terminewhetherthisisafundamentalpropertyofalltop-downvs. bottom-upapproaches. • Spatial indexes are well-suited to exploit anticipated work- Notethatouruseof“top-down”vs.“bottom-up”methodsdiffers loadswhileanonymizingdatasets. Selectingspecificquasi- fromtheusagein[33],wheretheyusethetermstorefertotwonew identifier attributes on which to build an index and biased O(n2)algorithms.Whilewedidnothaveaccesstotheircodeinor- splitting algorithms are two ways that we can incorporate dertodoacomparisonofthesenewalgorithmswithours, asthe queryworkloadsintotheanonymization. authorsofthatpapernote,itisnotasefficientastheMondrianalgo- • Adatabaseownermaywishtodistributeanonymizedtables rithm(6Xsloweron100,000records),anditwillnotscaletolarge of different “granularity” to separate groups, reflecting her n(non2 algorithmcan). Ourexperimentsshowthatourbottom- trust. For example, she may deliver a 5-anonymization of upalgorithmscaleswellatleastforanotherfactorof1000(upto hertabletoamedicalresearchgroupwhiledeliveringa10- 100millionrecords.) Theissueofqualityoftheanonymizations anonymousversiontoaninsuranceresearchgroup. Rather producedislessclear,sincewhilethatpaperreportsbetterresults thanre-anonymizetheoriginaltableforeachgroup, facing thantheMondrianalgorithm,itdoesnotconsideranythinglikeour 747 “compaction” procedure, which we found essential for good cer- taintymetricscores. 2. R-TREESPATIALINDEXING R-treespatialindexesbringwiththemseveraldesirableproper- tieswhenappliedtotheproblemofk-anonymization. 2.1 ScalabilitytoLargeDataSets Todate,thek-anonymizationliteraturehasnotconsideredalgo- rithmsforanonymizingdatasetsthatdonotfitinmemory. Bulk- loadingdatabaseindexeshasalmostbydefinitionfocusedonsuch datasets.Anumberofbulk-loadingtechniqueshavebeenproposed forspatialindexes. Someofthesetechniquesrequirespatialsort- ingbasedonspace-fillingcurves[12,13,14](e.g,theHilbertcurve orZ-ordering). Whileweexperimentedwithsuchapproaches,we found in our implementation that non-sorting bulk-loading tech- niques based on the “buffer-tree” [2, 6] worked better for higher dimensionaldatasets. Thebuffer-treeisbasedontheideaofinsertingmultiplerecords simultaneouslyintothetree. Eachinternalnodeofthetreehasan externalbufferwhererecordsaretemporarilystored. Multiplein- sertionsareprocessedinthefollowingway. Anindexnodekeeps and “blocks” arriving insertions in its buffer. When the number ofrecordsinthebufferexceedsapre-definedthreshold,allofthe recordsare“re-activated”andadvancedtothenextlevelofthetree. Recordsare“terminated”whentheyareinsertedintoaleafpage. Figure3: SplittingoftheindexnodeN3afterclearingtheroot Figure2showsanexampleofabuffer-treeafteraseriesofinser- buffer tionshavebeenprocessed. Thebuffer-treeconsistsofthreenodes N1,N2,N3andfiveleafpagesP1,...,P5.Assumethatnodebuffers containatmosttwopagesandthatapagehasamaximumcapac- turingoperationisfirsttriggeredbythesplitofafullleafpage.Just ity of three records. Consider the insertion of record r . Since 25 asforrecordinsertions,multiplerestructuringoperationsarealso therootbufferisfull, theinsertionsofthesixrecordsintheroot processedsimultaneously—aninternalnodedefersanincoming bufferare“re-activated”and“pusheddown”tothenextlevelofthe insertionofanentry. Whenallsubtreesofthenodehavefinished buffer-tree.Afterclearingabuffer,itmayhappenthatbuffersatthe theirrestructuringoperations,theentriesarethenstoredintherout- nextlevelalsobecomefull. Theseoverflowsareagaineliminated ingtableofthenode.Thismayagainproduceoverflowandfurther byclearingthesebuffers. restructuringoperations. Figure3showsanexampleofarestruc- turingprocessaftertherecordsintherootbufferareclearedinto thebuffersofnodesN andN . 2 3 Wecangaininsightintotheperformanceofthebuffertreealgo- rithmbyassuminganI/Omodelwiththefollowingparameters: N=numberofrecordsininputdataset M=maximumnumberofrecordsthattheavailablememorycan hold B=maximumnumberofrecordsthatapagecanhold The authors in [6] show that the I/O cost for bulk-loading a buffer-treeforadatasetofNrecordsisO N/Blog N/B .Thus, (cid:0) M/B (cid:1) buffer-treesachievesimilarI/Ocostboundstoexternalsorting.We alsoexpectbuffer-treestohave“good”performancewhenthein- putdatasetfitsinmemory. Thebuffer-treeamortizesthecostof insertingasetofrecordsbydeferringoperationsonthetree. This contraststhetuple-loadingapproachthatinsertsrecordsonebyone intotheindextypicallyresultinginlongloadtimes(forexample, each newly inserted record may cause a node split that increases theheightofthetreeindex.) Figure2:Examplebuffer-treeafterinserting24records,afull Our experiments show reasonable performance for the buffer- rootbufferandrecordr waitingtobeinserted 25 tree algorithms in k-anonymizing larger-than-memory as well as memory-residentdatasetsusingaspatialindex. A feature of the buffer-tree is that insertions traverse the tree 2.2 IncrementalUtility from root to leaf while restructuring operations traverse the tree fromtheleafbackwardstotheroot.Arestructuringoperationcon- Inadynamicenvironment,thespatialindexisanaturalmecha- sists of a split of an overflowing node (a node whose buffers are nismforallowingchangestobemadeonadatasetwhilemaintain- full)andaninsertionofanewentryinitsparentnode. Arestruc- ing a k-anonymous view. The prior anonymization algorithms in 748 53710 53711 53712 53713 53710 53711 53712 53713 theliteratureworkonadatasetasacompletewhole—theystart withthecomplete(non-anonymized)datasetasinput,andproduce 23 23 acompleteanonymizeddatasetasoutput. Ifananonymizeddata set is to be updated using these algorithms, the only option is to 22 22 re-anonymizetheoriginaldatasetplusthenewdatarecords. This willbeinefficientinscenarioswithfrequentupdates. Again, database indexes are designed to be incrementally up- 21 21 dated. One concern that arises is whether using the incremental anonymizationthatresultsfromprocessingupdatesonerecordat 20 20 a time using an R-tree is of worse quality than one that would (a) Unbiased R+-tree (b) Biased R+-tree result from anonymizing the entire data set at once. Our experi- anonymizeddata anonymizeddata mentsshowthatthisisnotthecase—theincrementallyupdated anonymizeddatasethasquality(measuredbydiscernibilitypenalty, Figure4:TargetingtheR+-treetotheZipcodequasi-identifier KL-divergence and certainty penalty) comparable to that of the attributeofadataset bulk-anonymizeddataset. 2.3 QueryPerformance Inatraditionaldatabasesetting,theperformanceofaqueryonan If P contains 10 tuples, then the query result for P is 10×(35− indexistypicallydeterminedbyfactorssuchasthetimeittakesto 30)/(40−30)=5tuples. executethequeryorhowmanynodesneedtobesearchedtofind However,regardlessofthemethodusedtoevaluatequeryresults, all records that satisfy this query. For k-anonymity, we associate theseresultsmustbecomputedbasedonthesetofallpartitions,the queryperformancewiththenumberofrecordsthatareincludedin setW,thatmaycontainasatisfyingrecord. the answer that would not satisfy the same query on the original data. Thisissimilarinspirittothe“precision”metricusedinthe 2.4 QueryWorkload Bias InformationRetrievalliterature. DenoteW tobethesetofleafnodes(orpartitions)inadatabase Recentworks[9,11,20,31]haveconsideredincorporatingtar- indexthatissearchedduetoaqueryQ posedontheoriginaldata. getqueryworkloadsintotheanonymizationofadataset. Wecan 1 LetQ denotethesamequeryontheanonymizeddata.Notethatif also tailor a spatial index to take advantage of advanced knowl- 2 apartitionPiscontainedinW,thenPisacandidatepartitionthat edgeofthetypesofworkloadthatwillusetheanonymizeddata. maycontainarecordthatsatisfiesQ1. IfP∈/W,thenitiscertain Consider a very simple scenario where majority of the data min- thatPdoesnottocontainanyrecordthatsatisfiesQ . Consider, ingworkloadsareinterestedinthesinglequasi-identifierattribute 1 forexample,thefollowingqueryonatableT: A . If we build the index on A (a one-dimensional index), then 0 0 SELECT COUNT(*) thek-anonymizeddataisclusteredonA .Underasort-basedbulk- 0 FROM T loadingscheme,thisresultsinsortingthedatasetonA . Forour 0 WHERE T.Age ≥ 25 AND T.Age ≤ 35 non-sortspatialindexbulk-loadingtechnique,preferenceisgiven IftheageintervalforPisgivenas[40–50]thenPwillnotbein- to a pre-selected subset S of the quasi-identifier attributes when cludedinW.Ontheotherhand,iftheageintervalis[20–30]then splittingpartitions. PwillbeincludedinW.NotethatitisstillpossiblethatPdoesnot ConsiderthetwosetsofanonymizeddatainFigure4. Aquery containanyrecordthatsatisfiesthequery.TheagevaluesinPmay oftheform actuallyrangefrom20to24. Nevertheless,sincewehaveprecise SELECT COUNT(*) record values in the unanonymized data, the candidate partitions FROM T willbeexaminedandtherelevantrecordsreturned. Ontheother WHERE T.Zipcode = z hand,thequeryQ onanonymizeddatamayreturnallrecordsinW will return more accurate results on the anonymized data gener- 2 sincewedonothaveexactinformationinthiscase. Wedefinethe atedbythebiasedR+-tree(seeFigure4(b))thanidenticalqueries errorforQ2 asC2−C1/C1 whereC1 andC2 arethecardinalities on anonymized data generated by the unbiased R+-tree (see Fig- oftheresultsetsforQ andQ respectively. Intuitively,theerror ure4(a)). Forthequerytypegivenaboveandtheexampledatain 1 2 forQ canbereducedifW containsfewerpartitions. Theentries Figure4,queriesonthebiasedanonymizeddatainFigure4(b)will 2 inR-treestyleindexesaremaintainedinminimumboundingrect- betwiceasaccurateasqueriesontheunbiasedanonymizeddatain angles (MBRs), giving the minimal extents of its entries. MBRs Figure4(a)(assumingthataCOUNTqueryonapartitionreturnsthe allowsearchandrangequeriesonaspatialdatasettobeexecuted cardinalityofthatpartitionifthequeryregionintersectswiththe efficiently. ByusingMBRs,weincreasethelikelihoodthatapar- partition).Wefind,inexperiments,thatwhenwebuildtheR+-tree titionwillnotbeincludedinW.Intheexamplegivenabove,using withbiasedsplittingpoliciesforasubsetSofthequasi-identifierat- MBRs,therangeontheAgeattributeforPwillbe[20–24]thus tributes,queriesonShaverelativelyhigheraccuraciesthanqueries PwillnotbeincludedinW. onanonymizeddatageneratedbyanotherR+-treewithanunbiased Wenotethattheexactbehaviorofqueriesonanonymizeddata algorithm.Spatialindexescanhencebeusefultoolsforanonymiz- maydifferfordifferentapplications. Onemaychoosetotakethe ingdatasetswhenthetypesofqueryworkloadsthatwillusethe datadistributionintoconsiderationwhencomputingqueryresults. dataareknownbeforehand. Takingacuefrom[33]thatproposes DenoteP.AgeandQ.AgeastheageintervalforthepartitionPand a weighted certainty penalty metric, a spatial index can also in- query Q respectively. Denote (P∩Q).Age to be the intersection corporate query workloads into its splitting policies by assigning oftheageintervalsonPandQ. IfQ.Ageis(25≤age≤35)and higherweightstothe“moreimportant”quasi-identifierattributes. P.Age=[30–40]then(P∩Q).Age=[30–35].Now,ifweassume Asaconsequence,itbenefitsthespatialindextosplitthemoreim- thattheoriginaldatasetisuniformlydistributedonAgethenwe portantattributesthanthelessimportantonestoarriveatalower maycomputetheresultforQonPas|P|×|(P∩Q).Age|/|P.Age|. penaltyscoreforthenewpartitions. 749 3. MULTI-GRANULARK-ANONYMITY AlgorithmLeafScan INPUT:SetoforderedleafnodesN,granularityparameterk Whileaprimarygoaloftheprivacy-preservingtechniquesink- 1 OUTPUT:AnewsetofpartitionsS anonymityistopreventre-identificationofrecords,itmayalsobe LS1. S←emptysetofpartitions useful to control the precision or granularity of the information LS2. whileN6=0/ thatthedataownerreleasestodifferententities,tolimitthelinking LS3. P←emptypartition abilitiesofunknownadversaries. while|P|≤k1 DEFINITION 1. (GRANULARITY)Wesaythatak-anonymous L←nextleafnodeinN datasetisananonymousdatasetofgranularityk. AddallrecordsinLtoP N←N−L Supposetheadministratorsofauniversityhospitalhaveagreed LS4. ifthetotalnumberofrecordsintheremainingleafnodes to deliver anonymized medical records to the following three en- inNislessthank1,thenremovetheserecordsfromN tities; Entity1: researchersatthesameuniversityasthehospital, andaddthemtoP Entity2: researchersatadifferentuniversity, Entity3: theInter- LS5. updategeneralizedquasi-identifiervaluesforeveryrecord net. One may expect that the hospital administrators place more inP “trust”inEntity1thaninEntity2,Entity3beingtheleasttrusted. LS6. S←S∪P.ContinueLS2 Thenotionoftrustissubjectiveandmaybeassociatedwithfactors LS7. ReturnS suchasthedataowner’sperceptionofthetargetentity’sabilityor intenttore-identifyrecords(DatareleasedontheWebisprobably Figure5:Leafscanalgorithm morelikelyindangerofbeingcompromisedthandatareleasedto asmall,localgroupofresearchers.) Onemustthenverifythatk-anonymityispreservedforeachrecord We wish to be able to produce multi-granular anonymizations T′ ={T′,...,T′} of the original data set T while preserving k- overallanonymizationsofT. 1 n Wecan,however,takeadvantageofthetreestructureofspatial anonymity for every individual in T in the presence of an adver- indexestogeneratedatasetsofdifferentgranularitythatautomat- sarywhoisabletogainaccesstomorethanoneanonymizedtable. Note that each T′ is an anonymization of the same table T, that icallyguaranteesthatk-anonymityismaintainedforthecollection i ofanonymizeddatasets. ThistechniqueexploitsLemma1byef- is,T isunchangedbetweenanonymizations. Whileafulldiscus- fectively binding each record r to some pre-determined set of k sion of the inference problem that arises from releasing multiple records. anonymizationsofthesamedatasetisbeyondthescopeofthispa- LetSI beamulti-dimensionalspatialindexontheoriginaldata per,inwhatfollowswegiveaconditionthat,ifsatisfied,guarantees k-anonymityforanysetofmulti-granularanonymizationsT′ ofT. setwiththefollowingproperties. • Leaf nodes in SI contain between k and ck records, some DEFINITION 2. (k-BOUND)Wesaythattherecordriisk-bound constantc. intheoriginaltableT ifthereexistsasubsetofrecordsR⊆T such that|R|≥k andgivenanypartitionorequivalenceclassPinan • InternalnodesinSIcontainbetweenlandmentries. anonymizationofT,ifri∈P,thenR⊆P. FromeachlevelinSI,wecanautomaticallygenerateanonymized of tLhEeMtaMbAle1T.,Lkiet≥Tk′.=k{-Ta1n′,o.n.y.m,Tint′y}ibseparesseetrovfekdi-oavneornTym′ iizfaetvioenrys Tdaotagesentesraotfegarannaunlaornityymki,zlekd,lt2akb,l.e..o,flhgkr,awnuhlearreithyilsikth,ewheeimghatpoefaScIh. recordrj∈T isk-boundinT. nodeNjatlevelitoeachpartitionPjintheanonymizedtable.The recordsinPjarealltherecordscontainedinthesetofleafnodesin PROOF. (SKETCH)Givenanyki-anonymizationTi′ ofT,rjcan thesubtreerootedatNj. notbelinkedtofewerthankrecords(ki≥k)usingTi′ alone. An Usingthishierarchicalalgorithmforgeneratingmulti-granular adversarymayhoweverbeabletocircumventk-anonymityforrj anonymizeddatasetsguaranteesk-anonymityoverallanonymiza- usingthesetofanonymizationsT′ tonarrowdownthecandidate tionsofT. Toseewhy,pickanyrecordri inT andletPj beany recordsforrjtoasetCcontainingfewerthankrecords. partition from an anonymization of T. If Pj contains ri, then Pj Pick any record rj ∈T, for any new anonymization of T, the contains the leaf node L in the subtree for which Pj is the root. adversarymaybeabletoproduceanewsetofcandidaterecordsC′ Thus,fromLemma1,everyrecordinLisk-bound,riisk-bound. Ingeneratingmulti-granularanonymizeddatasetsviathehier- forrj suchthat|C′|≤|C|whereCistheprevioussetofcandidate archical algorithm on a spatial index, the data set owner can, at recordsforrjasdeterminedbytheadversary.Weknowthat|C′|≥ theveryleast,guaranteetheanonymitygeneratedbytheleafnodes 1sinceC′ mustatleastcontainrj. Ifrj isk-boundinT, thenR in the spatial index. In other words, if the leaf nodes produce a isalwaysacandidatesetforrj,R⊆C′,|C′|≥k. Thuscorrelating k-anonymous data set (every leaf node contains a minimum of k anynewanonymizationofT willresultinatleastkrecordsbeing records), then it can be guaranteed that releasing other data sets candidatesforrj. at other granularity k1 >k will not violate k-anonymity. If an adversarymanagestoobtainmultipleversionsofanonymizedta- Onemayviewarecordthatisk-boundinatableasbeingina bles, with the goal of re-identifying individuals, she can only re- groupthatalways“stickstogether”inanynewk-anonymizationof covertheinformationrevealedinthefinestgranular(mostprecise) thattable. anonymizeddatasetinherpossession. 3.1 A Hierarchical Algorithm for Generating 3.2 A Leaf Scan Algorithm for Generating Multi-GranularAnonymizedDataSets Multi-GranularAnonymizedDataSets Astraightforwardapproachtogeneratinganonymizeddatasets Inthissection,wedescribeanalgorithmthatratherthangenerate of different granularity is to re-anonymize T to obtain T′,...,T′. anonymizedtablesinahierarchicalfashion,utilizesthe“sequential 1 n 750 ordering”ofnodesonthesametreelevel. Thegranularityofthe datasetsthatcanbegeneratedusingahierarchicalalgorithmisre- strictedbythethresholdontheminimumnumberofrecordsinany leafnodeinthetree. Thisminimumoccupancythresholdonleaf nodesenforcesthepropertythateverysubtreecontainaminimum number of records. Assume that all nodes (including leaf nodes) inthespatialindexSIcontainbetweentwoandfourentries.Then, atbest,wecanhaveanonymousdatasetsofgranularity,k,2k,4k, 8k, ...,2hk. Using the hierarchical algorithm, we will be unable (a) uncompacteddata (b) compacteddata togeneratea6-anonymousdatasetsay, thebestwecandoisan 8-anonymousdataset.(Ofcoursebydefinitionofk-anonymity,an Figure6:Applyingthecompactionproceduretopartitions 8-anonymous table is also 6-anonymous.) Given a request for a datasetofgranularityk ,wecanfurtherimproveonthehierarchi- 1 calalgorithmbyscanningtheleafnodesinorderandpartitioning theleafnodesingroupsofk1/k. Inourcurrentexample,sinceev- apossibly“moreprecise”descriptionabouttherecordsinP. The eryleafnodecontainsatleasttworecords,ifk1=6,wescanthe compaction algorithm is a simple one — it scans each partition leafnodes,forminggroupsofthreeleafnodeseachexceptforthe P∈Dandcreatestheminimumboundingboxes. Foreachnumer- lastgroupthatmaycontainbetweenthreeandfivenodes. ical quasi-identifier attribute, the compaction algorithm generates Figure 5 shows the leaf scan algorithm for performing multi- anewrangewheretheendpointsaretheminimumandmaximum granular anonymizations of a data set. Note that since each leaf valuesthatoccurforrecordsinP. Foreachcategoricalattribute, node may contain between k and ck records, we may be able to the procedure removes all values from the set that do not occur formgroupscontaininglessthank1/k leafnodes. Thealgorithm in P. Where generalization hierarchies are used in place of sets, initializesanewgrouporpartitionwiththenextleafnode;ifthis theprocedurechoosesthelowestcommonancestorinthehierar- groupcontainsatleastk records,wearedonewiththisgroupand 1 chy for all the values in P. The old generalized values for every startanewone.Otherwiseanewleafisaddedtothecurrentgroup. recordinParethenreplacedwiththenew,morecompactvalues. The algorithm stops adding leaf nodes to a group when the total Figure6(b)depictsanexampleapplicationofthecompactionpro- numberofrecordsinthegroupisatleastk .Newgroupsareitera- 1 ceduretoanonymizeddatainFigure6(a). tivelycreateduntilthelaststep,whentherecordsintheremaining Intherestofthepaper,wewillrefertothedatabeforeapplying leafnodesislessthank ,weaddtheseleafnodestothecurrentand 1 thecompactionprocessastheuncompacteddataandthedataafter lastgroup,terminatingthealgorithm. Sincetherecordsinagroup applyingthecompactionprocessasthecompacteddata. Abene- mayspanmultipleleafnodes,thealgorithmrecomputesnewgen- fitofthecompactionprocedureisthat, whencomparedtoresults eralizedquasi-identifiervaluesbasedonalltherecordsinagroup. fromthesamequeriesontheoriginaldata,queryresultsoncom- Weusethisapproachinourimplementationtogeneratedatasets pacteddataaremoreaccuratethanqueriesonuncompacteddata. ofdifferentgranularity. AstheresultsinSection5willshow,exe- Due to the introduction of gaps in the anonymized data, queries cutiontimesforanonymizingadatasetisindependentoftheactual thatwouldhaveotherwisereturnednon-emptyresultsetsforone anonymityparameterksincegeneratingananonymousdatasetof ormorepartitionsnowreturnresultsthataremoreintunewiththe any granularity requires one full scan of all the leaf nodes (after original data set. In experiments, we see dramatic improvements buildingtheindexonabasekvalue). in accuracy for queries on compacted data over the same queries ByLemma1,k-anonymityisalsopreservedwhenwegenerate onuncompacteddata.Weshouldnotethatthesimplenatureofthe multi-granularanonymizeddatasetswiththeleafscanalgorithm. compaction procedure facilitates its application to data generated Toseewhy,weobservethattheleafscanalgorithmalwaysforms byanyk-anonymizationalgorithm. partitions from whole leaf nodes. If Pj is any partition from an Itisalsoreasonabletoexpecttheexecutioncostsforthecom- anonymizationofT andri∈PjthenPjcontainstheleafnodepar- paction process to be relatively small when compared to actual titionLthatcontainsri,thusriisalways“bound”totherecordsin anonymizationcostsasitsbasicoperationisasinglepassovereach Land|L|≥k. partitiontodetermineminimumandmaximumvaluesfornumer- ical attributes and minimal sets for categorical attributes. These 4. ACOMPACTIONPROCEDURE relativelysmallcompactioncostsareverifiedthroughexperimen- Intheprocessoftreatingthek-anonymizationproblemasanin- tal results shown in Section 5 by running the compaction proce- dexingproblem,werecognizedthatwecoulddramaticallyincrease dure on anonymized data generated by a previously proposed k- the precision of anonymized data sets by employing some of the anonymizationalgorithm. techniquesforimprovingqueryperformanceontheR-treestylein- Thiscompactionprocessmayleadonetoanuneasyfeelingthat dexes. Byusingminimumboundingboxes, thesespatialindexes “moreisbeingrevealed”thanwouldberevealediftheanonymized leavegapsinthedomainwheregapscorrespondtospatialportions datasetwerenotcompacted. Thisisactuallytrue. Forexample, ofthedomainthatdonotcontainanyrecord. We,thus,proposea anadversarycan“know”thatthereisnoindividualina“gap”area, compactionproceduretoincreasetheprecisionofananonymized somethingtheycouldnotdeducewithoutcompaction. Thisisan datasetgeneratedforanyindex,suchasthegridfile[23],thatdoes exampleofthetensionbetweenanonymizationproceduresanddata notmaintainMBRsforitsrecords. Sinceeveryk-anonymization utility.Butthisisreallyanissueinallk-anonymizationresearch. algorithm,whetherviewedasanindexingtechniqueornot,essen- Forexample,thediscernibilitypenalty[4]rewardsanonymiza- tiallycreatespartitionsintheoriginaldataset,thecompactiontech- tionproceduresthatdoagoodjobofputtingnomorethankdata niquecanberetrofittedtopreviouslyproposednon-index-basedap- points in a partition. This reveals more information than another proachestogivedramaticimprovementsaswell. anonymizationthathaspartitionswithmorethankdatapoints. To Thegoalofthecompactionprocedureistoregenerate,foreach seethis,supposethatanonymizationAputsk′>kdatapointsina partitionPinak-anonymousdatasetD,anotherpartitionP with numberofpartitions,butanonymizationBputsonlykineachpar- 1 751 Category Description signrecordstoregionsastheyareprocessed. Bycomparison,the Compiler gcc3.2.2 polynomial-time algorithmsuggestedintheMondrianpaper [19] Operatingsystem TaoLinuxrelease1 can be viewed as top-down, because its first step is to partition CPU IntelPentium43.00Ghz theentirespace,thentopartitiontheresultingsub-regions,andso Memory 1GB forth. Although the goal of the Mondrian algorithm was not ab- HardDisk SeagateATA/ATAPI-6 soluteperformance, weimplementedtheirtop-downalgorithmto comparetheefficiencyandqualityofthetop-downapproachwith Table1:Systemconfiguration thebottom-upapproachinherentinspatialindexing.Wearegrate- fultotheauthorsofthatpaperforprovidinguswithacopyoftheir Javaprototypeimplementationofthealgorithm,aswellasthedata tition. ThenwithanonymizationB,the“attacker”knowsforevery setstheyusedintheirexperiments. datarecordthatthesensitivevalueassociatedwiththatrecordmust Thefirstdatasetweusedwasarealworlddataset, theLands beoneofthealternativesappearinginthekrecordsinthepartition. End data set. The configuration for this data set was identical Ontheotherhand,withanonymizationA,foranydataelementin to[18]. TheLandsEnddatasetcontainedcustomersaleinforma- a“large”partition,theattackeronlyknowsthatthedatarecordhas tionand4,591,581records. Ithadeightattributescomprisingzip- asensitivevalueamongthevaluesfoundinthek′>k recordsin code,orderdate,gender,style,price,quantity,costandshipment. thepartition, whichingeneral maybealarger set. Another way Unlike [18] however, hierarchical constraints were eliminated by ofputtingthisisthatifaprocedurereducesthenumberofrecords imposing an intuitive ordering on the values for each categorical in a partition from k′ to k, it has “revealed” information, but the attributeinthedataset. Eachrecordintheresultingdatawas32 discernibility penalty says that the quality of the anonymization bytes and the entire data set was approximately 147 MB in size. has improved. In an extreme case, if k′=N, the number of ele- The second data set was synthetically generated and had nine at- mentsinthedataset,theattackergainsalmostnoinformationfrom tributescomprisingsalary, commission, age, educationlevel, car, the anonymized data; but of course, then the anonymized data is zipcode,housevalue,houseyearsandloan. Theconfigurationfor notusefulforany non-trivialanalysis. Thediscernibility penalty thesyntheticdatawasbasedonthegeneratorintroducedin[1].We triesto“penalize”theanonymizationalgorithmforthis; oneway generated100millionrecords,eachrecordwas36bytes,resulting ofviewingitisthatittriestoencouragedisclosingasmuchinfor- inadatasetsizeof3.6GB. mationaspossiblewhilestillnotviolatingk-anonymity. We built the R+-trees on all eight and nine attributes for the Therecentlyproposed“certaintymetric”[33]hasthesamechar- LandsEndandsyntheticdatasetsrespectively(everyattributewas acter. Itrewardsanonymizationproceduresforcreatingpartitions partofthequasi-identifier). Asaresultofthenumericalrecoding withsmallperimeters. Intuitively,asmallerperimeterforaparti- on the original data sets, the schema for an anonymized table is tionPmeansthatthequasi-identifierfortheelementsinPcanbe as follows: each quasi-identifier value for a recordt in the origi- knownfromtheoutsidetoberestrictedtoasmallersetofvalues naldatasetisreplaced,intheanonymizeddata,bytheinterval,on than would be the case if the perimeter were larger. Once again, thatquasi-identifier,oftheMBRcontainingt. Asaconsequence, thegoalistorevealaspreciseinformationaspossiblewithoutvi- wewerealsoabletoperformqueryexperimentsdescribedinSec- olatingk-anonymity. Our“shrinking”or“compaction”procedure tion5.4byspecifyingnumericalrangesinthequerypredicates. is yet another step in this direction. It tries to bound partitions Table1givesadescriptionofthesystemconfigurationsusedin ofk-elementsastightlyaspossiblewhilestillnotviolatingthek- allexperiments. anonymityrequirement. We note that recently there has been work on augmenting the 5.1 PerformanceEvaluation definitionsofk-anonymitytoprovidestrongerguarantees. Forex- We used the Lands End data for the first set of experiments to ample,inl-diversity[21],thek-anonymityrequirementisextended comparerunningtimesfortheR+-treebulk-loadingtoatop-down torequireacertaindegreeofdiversityinthesensitivevaluesofthe multi-dimensionalpartitioningapproach. Wealsoranexperiments recordsinapartition.Ourshrinkingprocedureisorthogonaltothis to evaluate incremental anonymization performance for the R+- kind of requirement — whatever the requirement, it tries to find tree. Fortheseexperiments, both algorithms were each allocated thesmallestboundingboxonthek-elementsthatstillsatisfiesthe amaximumbuffersizeof256MB.Eachexperimentwascarried requirements(k-anonymity,ork-anonymityandl-diversity.) outfivetimeswhileflushingthesystemfilebuffersbetweenruns, Wearguethatthisisthecorrectwaytodealwithinformationdis- andwereportaveragecoldrunningtimes. closureinthek-anonymousframework.Ifonethinkstoomuchin- Figure7(a)showsexecutiontimesforR+-treebulkloadandtop- formationisbeingrevealed,oneshouldstrengthentherestrictions downapproachontheLandsEnddatasetfordifferentanonymity onthedefinitionofwhatconstitutesanallowablepartition(forex- levelsk=5,10,25,50,100,250,500,1000.Askincreases,theex- ample,addingl-diversitytok-anonymity)ratherthantrustthatthe ecutiontimesforthetop-downalgorithmdecreasessincethereare anonymizationprocedurewillonlygenerate“loose”or“imprecise” fewerrecursivepartitioningsteps.Resultsshowthatthespatialin- partitioningsinsomeuncontrolledway. Toreiterate,ourphiloso- dexingapproachconsistentlyoutperformsthetop-downtechnique phyisthatthedefinitionofwhatisanallowablepartitionistaken suggestingthattheformerismoreefficientthanatop-downrecur- asinput;thegoalofananonymizationprocedureistoproducethe sivepartitioningschemeevenforbulkanonymization. Noticethat “best” or “most precise” partitioning that respects the definition. theexecutiontimesfortheR+-treeisindependentofk.Thisisdue Thatiswhattheshrinkingprocedureattemptstodo. tothefactthatwechooseabasek forthebulkloadprocess. For theseexperiments, weselectedbasek =5. Fortheactualinputk 5. EXPERIMENTSANDRESULTS parameter,weusedtheleafscanalgorithmdescribedinSection3.2 Wecarriedoutexperimentsontwodatasetsforempiricaleval- toconstructthefinalpartitions.Forexampleifk=5,thenthemap- uation of k-anonymization with spatial indexes. A spatial index pingisoneormoreleafnodestoonepartition. Ifk=10,thenwe bulk-loadingalgorithm,suchasthebuffer-treealgorithmweused, maptwoormoreleafnodestoonepartition. canbeviewedasabottom-upalgorithmbecauseitattemptstoas- Westarttheincrementalanonymizationexperimentsbyfirstbulk- 752 Lands End database Lands End database 110 2.6 100 R+−tree 2.5 Top−down partitioning 90 2.4 s) s) ec 80 ec2.3 s s e ( 70 e (2.2 m m n ti 60 n ti2.1 o o uti 50 uti 2 c c e 40 e1.9 x x e e 30 1.8 20 1.7 10 1.6 5 10 25 50 100 250 500 1000 0.5 1 1.5 2 2.5 3 3.5 4 4.5 anonymity level k total number of anonymized records (millions) (a) Anonymizationexecutiontimecomparison (b)R+-treeincrementalanonymizationtimesus- ingbatchsizeof0.5millionrecords(k=10) Figure7:ExecutiontimesontheLandsEnddatabase Synthetic data Synthetic data 103 104.4 s) sec102 e ( s104.3 m O n ti sk I ecutio101 di104.2 x e 100 104.1 0.036 0.18 0.36 0.9 1.8 3.6 32 64 128 256 512 data size (GB) Allocated memory (MB) (a)Executiontimesforvarieddatasetsizes,allo- (b) IOcosts,datasetsize=3.6GB catedmemory=256MB Figure8:R+-treeanonymizationscalingtolargedatasets loadingandanonymizingthefirst0.5millionrecordsfromtheLands (about180MB).Figure8(a)showsthatouralgorithmsadaptgrace- Enddataset. Thenwesubsequentlyselectnewbatchesof0.5mil- fullyasthesizeoftheinputdataisincreased. lion records to be anonymized. Figure 7(b) shows the running Wealsoconductedexperimentstoascertaintheeffectofavail- times for the R+-tree for incrementally anonymizing each batch. ablecomputermemoryontheperformanceofouralgorithms.Fig- Sinceatop-downapproachisnotincremental,itwouldhavetore- ure8(b)showsthetotalnumberofexplicitI/Osystemcallsmade anonymizetheentiredatasetoneachbatchinsert. duringtheanonymizationprocesswhilevaryingthesizeofmem- ory allotted to the process. These results show that I/O costs in- 5.2 Scalability crease by less than a factor of two when the allotted memory is Thesyntheticdatasetwasusedforthesecondsetofexperiments reducedbyafactoroftwo, alsoindicatingthatthebuffertreeal- toshowthatourspatialindexingtechniquesscalewelltothesize gorithmperformswellasaspatialindexingbulk-loadingalgorithm of the input data set. We did not test the top-down approach for forlarger-than-memorydatasets. scalingtolarger-than-memorydatasetsbecausetheversionofthe 5.3 AnonymizationQualityExperiments Mondrianalgorithmdescribedin[19]wasnotdesignedtohandle suchdatasets. Since anonymization quality is an essential property of any k- WeusedtheR+-treetok-anonymizedatasetsofdifferentsizes anonymizationalgorithm,weconductedathirdsetofexperiments rangingfromonemillion(0.036GB)to100millionrecords (3.6 —qualityexperimentsonanonymizeddatageneratedbytheR+- GB). We evaluate how the spatial index performs when the data treeandthetop-downapproachproposedin[19]. Wealsoevalu- setistoolargetofitinmainmemory. Weallotted256MBtothe atedthequalityofanonymizeddataafterexecutingcompactionas anonymizationprocessandfromtheallottedmemory,weuse4MB apost-processingstepondatageneratedbythatapproach. to buffer the input data and the remaining memory was used for In choosing algorithms with which to compare our algorithm, theindexbuffers.Theindexwasabletoholdamajorportionofthe we were guided by the following considerations. First, since we recordsinitsbufferswhenthedatasetislessthan5millionrecords hadaccesstothecodeoftheMondrianalgorithm,itwasalogical 753 subtree of H for which the node for the generalized valuet.Ai is root.|T.Ai|isthetotalnumberofleafnodesinH. DEFINITION 5. (KL-DIVERGENCE). The KL divergence for T, KL(T)=(cid:229) p(1)logpt(1) where p(1) and p(2) are the prob- t∈T t pt(2) t t abilities of the tuple t according to the original and anonymized tablesrespectively. (1) See[15]formoredetaileddescriptionsonthecomputationof p t (2) andp . t Figure10showsthatexploitingtheminimumboundingrectan- gle property of the R+-tree clearly pays off as it generates much betteranonymizeddataqualitythanthetop-downapproachwith- Figure9:Compactioncostasapercentageoftotalanonymiza- outcompaction.TheMondrianpartitioningisbasedontheheuris- tionexecutiontime(k=10) ticthatitshouldsplitthequasi-identifierattributewiththelargest range of values, whereas the R-tree splits by trying to minimize the area of the resulting partitions. For the data set used in our candidateforcomparison. Second,wedidnotconsidercomparing experiments, minimizing area results in better discernibility, KL- againstanyalgorithmthathadalreadybeenshowntoproducelower divergenceandcertaintypenalty.ResultsinFigures10(b)and10(c) qualityanonymizationsthanMondrian.Thiseliminated[4,18,24]. show that after compaction, the anonymized data quality for the Finally, since our focus in this paper is on scalable anonymiza- top-down approach greatly improves, suggesting that one of the tionalgorithms,wedidnotconsideralgorithmswithrunningtimes maindifferencesbetweentheanonymizationqualityproducedby greaterthanO(n2).Thiseliminated[4,11,18,24,26,33]. the top-down and spatial index approaches arises from the com- First,wemeasuredtherunningtimesforcompactionrelativeto pactionimplicitinthespatialindex. totalanonymizationtimeswhilechoosingsamplesofdifferentsizes WealsoobservefromFigure10(a)thatthediscernibilityscores rangingfrom0.5millionto4.5millionrecordsfromtheLandsEnd for uncompacted and compacted Mondrian-anonymized data are data set. Figure 9 shows that the times for compaction are small identical.Thecompactedanduncompactedanonymizeddatacom- relativetotheanonymizationtimes. prisethesamepartitionswiththesamenumberofrecordsperpar- Weusedthediscernibilitymetric[4],KLdivergence[15]andthe tition,althoughtheformermaydescribethesepartitionswithmore certaintymetricproposedin[33]tomeasureandcomparequality precisiononthequasi-identifierattributes.Asaconsequenceofthe withanonymizeddatageneratedfromtheLandsEnddatasetbythe discernibilitypenaltymeasuringqualitybasedonthecardinalities top-downapproachproposedin[19]. Beforecontinuingwiththe ofpartitionsalone,thismetricisunabletomeasuredifferencesbe- anonymization quality experiments, we define these three quality tweencompactedanduncompactedanonymizeddata. Thisshows measures. thatthecertaintypenaltyandKL-divergencemeasurescanidentify LetT beak-anonymoustablewithmquasi-identifierattributes, differences in quality in data anonymization that escape the dis- A1,...,Am,andcomprisingnpartitionsorequivalenceclasses, cernibilitypenalty. Ofcourse,whilewedidnotseeanyinourex- P1,...,Pn. periments,itisalsopossiblethatthereareexamplesofanonymiza- tions in which certainty penalty detects differences that are also DEFINITION 3. (DISCERNIBILITY PENALTY). The discerni- missedbyKL-divergenceandviceversa,inwhichcasethesemet- bilitypenaltyscoreforT,DM(T)=(cid:229) ni=1|Pi|2 rics are incommensurate and future study would be warranted to determinethestrengthsandlimitationsofeach. Thediscernibilitymeasure,ineffect,assignsapenaltytoeachtuple Sinceabenefitofspatialindexesisthatitisincremental,wealso t∈T basedonthesizeofthepartitionPi,giventhatt∈Pi. conductedteststoascertaintheeffects,onquality,ofanonymizing new records incrementally as opposed to a re-anonymization ap- DEFINITION 4. (CERTAINTYPENALTY).Thecertaintypenalty proach.(Earlierwediscussedtheimpactofincrementalanonymiza- scoreforT is tion on the efficiency of anonymization.) For the R+-tree index, (cid:229) CM(T)= NCP(t) wefirstloadedaportionofthedata(500,000records),anonymized t∈T them,thenincrementallyanonymizedadditionalbatchesof500,000 whereNCP(t)istheweightednormalizedcertaintypenaltyassigned recordsbyinsertingtheserecordsintotheindex.ForMondrian,we tothetuplet. reanonymized the entire set of records. Results from our experi- mentsvalidateourassertionthattheR+-tree’stechniqueofadding new records to the index in such a way that the total area cov- (cid:229)m |t.Ai| eredbytheresultingMBRsisminimizedisappropriateformain- NCP(t)=i=1(cid:16)wi×|T.Ai|(cid:17) tqauinaliintyggdoooeds ndoattasquuffaelrityfr.omFigiunrcere1m1esnhtaolwasntohnaytmanizoantyiomni,zeindfdaactta, wi istheweightassociatedwiththequasi-identifierattributeAi theR+-treeanonymizeddataisstillofhigherqualitythanthere- toreflectitsimportanceintheanonymizeddata.Inourqualityex- anonymizeddata. periments,wesettheweightsforallquasi-identifierattributesto1. 5.4 QueryAccuracy IfAiisanumericalattributethen|t.Ai|istherangeofthegeneral- izedvaluefortonAi.Forexample,ift.Age=[20–30],then|t.Ai| SincetheKL-divergenceandcertaintypenaltymeasuressuggest =10. |T.Ai|istherangeofalltuplesinT onAi. OtherwiseAiisa thatcompacteddataisatleastas“good”or“better”thanuncom- categoricalattributeandassumingtheexistenceofageneralization pacteddata,wecarriedoutafourthsetofexperimentstoevaluate hierarchy tree H for Ai, |t.Ai| is the number of leaf nodes in the therelativeaccuracyforqueriesonbothtypesofanonymizeddata 754 (a) Discernibilitypenalty (b) KLDivergence (c) Certaintypenalty Figure10:QualitycomparisonsontheLandsEnddatabase (a) Discernibilitypenalty (b) KLDivergence (c) Certaintypenalty Figure11:IncrementalqualitycomparisonsontheLandsEnddatabase(k=10) ontheLandsEnddataset.Wealsocompareaccuracyforthesame queriesontheR+-treeanonymizeddata. Weexecuted1000ran- count(anonymized)-count(original) domlygeneratedrangequeriesoveralleightattributesoftheLands Error(Q)= count(original) Enddataset.Thesequerieswereoftheform SELECT COUNT(*) Inourexperiments,wereporttheaveragenormalizederrorover FROM T thebatchof1000queries. Figure12(a)showstheaverageerrors WHERE T.A1≥a1 AND T.A1≤b1 for different k values. The accuracy for queries on compacted- . anonymizeddataisseentoimproveovertheaccuracyforthesame . . queriesonuncompacted-anonymizeddata.QueriesontheR+-tree AND anonymizeddatahaveevensmallererrorsthanontheMondrian- . . compactedanonymizeddatasuggestingthatthespatialindexdoes . a good job of partitioning the original data even for an arbitrary T.A8≥a8 AND T.A8≤b8 queryworkload. Note that in the above eight-dimensional range query, each of Figure 12(b) shows results when we vary the selectivity of the theeightattributesA1...Am hasbothanupperandlowerbound query on the data set. In general, the larger the cardinality of on its range. These bounds are set in the following manner: for the query result, the smaller the error generated by running the each query, we pick two records r and r at random from the 1 2 query on anonymized data versus the original data. This will in unanonymizeddatasetandseteachai,i=1,...,8,tothesmaller turntendtodiminishdifferencesbetweentheaccuracyofqueryre- ofr1.Aiandr2.Ai,andbitothelargervalue. sults generated by queries run over data anonymized by different ACOUNTqueryQontheanonymizeddatasetT returnsacount anonymizationprocedures. Eventhebenefitofcompactionisseen oftherecordsinTthatmatchQ.Arecordr∈Tissaidtomatchthe todropforlargequeryresults. queryQiftheregiondefinedbyrhasanon-nullintersectionwith Toshowthatwecouldindeeduseaspatialindextointegratean- the query region — r intersects with Q on every quasi-identifier ticipatedworkloadsintotheanonymization,weconstructedaquery attribute.Forexample,ifT hastwoquasi-identifierattributes,Age workloadthatgeneratedrandomqueriesontheZipcodeattributeof andZipcode,therecordr=([40–50],[53710–53720])matches theLandsEnddataset.Thesequerieswereoftheform thequeryQ= (45≤age≤55)∧(53700≤zipcode≤53715) . (cid:0) (cid:1) SELECT COUNT(*) Therecordr =([30–35], [53700–53715])doesnotmatchthis FROM T query. Queriesontheoriginal,unanonymizeddatasetworkinthe traditionalway.rmatchesQontheoriginaldataifthepointdefined WHERE T.Zipcode ≥z1 AND T.Zipcode ≤z2 The bounds for a range query on the Zipcode attribute are set byrlieswithinthequeryregion. inthefollowingmanner: wepicktworecordsr andr atrandom WecomputetheerrorforthequeryQ,Error(Q),asthenormal- 1 2 fromtheunanonymizeddatasetandsetz andz tor .Zipcodeand izeddifferencebetweentheresultsofevaluatingQontheoriginal 1 2 1 andanonymizeddatasets. r2.Zipcodesuchthatz1≤z2. We also built another R+-tree, but this time, biased our node 755

Description:
VLDB '07, September 23-28, 2007, Vienna, Austria. Copyright 2007 . ing and novel aspect of the always-present tension between anonymization and reimplemented the Mondrian algorithm described in [19], and com- pared it to
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.