PRACTICALK-ANONYMITYONLARGEDATASETS By BenjaminPodgursky Thesis SubmittedtotheFacultyofthe GraduateSchoolofVanderbiltUniversity inpartialfulfillmentoftherequirements forthedegreeof MASTEROFSCIENCE in ComputerScience May,2011 Nashville,Tennessee Approved: ProfessorGautamBiswas ProfessorDouglasH.Fisher ToNila,Mom,Dad,andAdriane ii ACKNOWLEDGMENTS IthankmyadvisorGautamBiswasforhisconstantsupportandguidancebothafterandlongbeforeIstarted writing this thesis. Over the past four years he has helped me jump headfirst into fields I would otherwise haveneverknownabout. I thank my colleagues at Rapleaf for introducing me to this project. I am always impressed by their dedication to finding and solving challenging, important and open problems; each of them has given me some kind of insight or help on this project, for which I am grateful. Most directly I worked with Greg PoulosinoriginallytacklingthisproblemwhenIwasaninternduringthesummerof2010,andwithouthis insightandworkthisprojectcouldnothavesucceeded. IowegratitudetothoseintheModelingandAnalysisofComplexSystemslabfortheirpatiencewithmy (hopefullynotcomplete)neglectofmyotherprojectsasItackledthisthesis. EachofmyprofessorsatVanderbilthasguidedmeininvaluablewaysovermytimehere. Iwanttothank Doug Fisher for helping me figure out clustering algorithms, Larry Dowdy for making me think of every problem as a distributed problem, and Jerry Spinrad for showing me that everything is a graph problem at heart. Ofcourse,withouttheconsistentencouragement,support,andproddingofmyfamilyandthosecloseto meIwouldneverhavegottenthisfar,andthisworkisdedicatedtothem. iii TABLEOFCONTENTS Page DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii ACKNOWLEDGMENTS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii LISTOFTABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi LISTOFFIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii LISTOFALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii I. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 PrivacyPreservingDataPublishing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 GoalsandContributionsofThesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 OrganizationofThesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 II. BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 PrivacyModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 IdentityDisclosure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 SensitiveAttributeDisclosure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 DataModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 NumericData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 CategoricalData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Hierarchical. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 DatawithUtility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 QualityMetrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Sumofsuppressedentries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Domaingeneralizationhierarchydistance . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Discernibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Lossmetric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Clusterdiameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Ambiguitymetric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Entropybasedmetric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Workloadbasedqualitymetrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 AnonymizationModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Multidimensionalrecoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Complexityresultsandboundedapproximations . . . . . . . . . . . . . . . . . . . . . . 21 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 III. PROBLEMDEFINITION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 DataModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 AdversaryModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 QualityMetrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 iv IV. ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Anonymityalgorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 ApproximateNearestNeighborSearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 V. EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 AdultsDataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 LSHvsRandomSamplingvsFullClustering . . . . . . . . . . . . . . . . . . . . . . . . 41 IterativevsSingle-PassClustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Mondrianmultidimensionalheuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 (K,1)anonymityalgorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Overallcomparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 SyntheticData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 TargetingDataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 VI. DISCUSSIONANDCONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 v LISTOFTABLES Table Page 1 Exampledatasetvulnerabletode-anonymization . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Examplepubliclyavailabledata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3 Publishedsensitivedatasetvulnerabletojoinattack . . . . . . . . . . . . . . . . . . . . . . . . 7 4 3-anonymousreleaseddataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 5 Exampleanonymizednumericdata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 6 Anotherexampledataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 7 Dataanonymizedviaglobalrecoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 8 Dataanonymizedvialocalrecoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 9 Hierarchicalcategoricalprofessions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 10 Enumeratedhierarchicaldata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 11 Attributeswithassignednumericutility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 12 Recordswithassignednumericutility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 13 Percentofutilityretainedontargetingdataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 14 Percentofattributeinstancesretainedontargetingdataset . . . . . . . . . . . . . . . . . . . . . 51 15 Algorithmruntimeontargetingdataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 vi LISTOFFIGURES Figure Page 1 2-anonymousand(2,1)anonymousviewsofdata . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Asimplegeneralizationhierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3 Apossibledifferentgeneralizationhierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4 Examplegeneralizationlattice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5 Figurefrom[20]showingmultidimensionalanonymizingover2dimensions . . . . . . . . . . . 20 6 Anexamplegeneralizationhierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 7 Theutilityofk-anonymizedsolutionsusingclusteringwithrandomsamplingvsclusteringwith LSHguidedsampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 8 The runtime of clustering anonymizers using random sampling vs clustering with LSH guided sampling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 9 Utilityofclusteringanonymizedsolutionsusingclusteringwithrandomsamplingvsclustering withLSHguidedsampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 10 RuntimeofclusteringanonymizersusingrandomsamplingvsclusteringwithLSHguidedsampling 44 11 Utilityofclusteringanonymizedsolutionsusingsingle-passclusteringvsiterativeclusteringal- gorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 12 Runtimeofclusteringalgorithmsusingsingle-passclusteringvsiterativeclusteringalgorithm . 45 13 UtilityofMondriananonymizedsolutionsusinganinformationgainheuristicvsalargest-value heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 14 RuntimeofMondrianusinganinformationgainheuristicvsalargest-valueheuristic . . . . . . 46 15 Utilityof(k,1)-anonymizedsolutions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 16 Runtimeof(k,1)-anonymizationalgorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 17 Side-by-sidecomparisonofalgorithmutilitiesontheAdultsdataset . . . . . . . . . . . . . . . 47 18 Side-by-sidecomparisonofalgorithmruntimeontheAdultsdataset . . . . . . . . . . . . . . . 48 19 Attributefrequencydistributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 20 Side-by-sidecomparisonofalgorithmutilityonasyntheticdataset . . . . . . . . . . . . . . . . 50 21 Side-by-sidecomparisonofalgorithmruntimeonasyntheticdataset . . . . . . . . . . . . . . . 50 vii LISTOFALGORITHMS Algorithm Page 1 Mondrian: multidimensionalpartitioninganonymization . . . . . . . . . . . . . . . . . . . . . 22 2 (k;1)anonymizationalgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3 MONDRIAN:Multidimensionalpartitioninganonymization: . . . . . . . . . . . . . . . . . . . 31 4 CLUSTER:approximateclusteringalgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5 ITER CLUSTER:iterativeclusteringalgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 33 6 TWO PHASE:two-phaseaggregationalgorithm . . . . . . . . . . . . . . . . . . . . . . . . . 34 7 merge: mergeprocedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 8 split: splitprocedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 9 K1 ANON:approximate(k,1)algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 10 preprocess: generatethenearestneighborhash . . . . . . . . . . . . . . . . . . . . . . . . . . 38 11 find nn: findthenearestneighborofapoint . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 12 build hash: generateanLSHhashfunction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 13 hash record: hasharecordintoanLSHbucket . . . . . . . . . . . . . . . . . . . . . . . . . . 38 viii CHAPTERI INTRODUCTION Aswespendmoreofourtimeonlineininformation-richandpersonalizedenvironments,itbecomesincreas- inglyeasierfordetailsfromourofflinelifetomeldwiththeironlinepresence. ThroughFacebookandother socialnetworks,ourpreferencesinfriends,food,andgamesbecomesvisibletoothers;LinkedInmakesour employmenthistoryandprofessionalcolleaguespublicinformation;evenpublicPandoraprofilesrevealour detailedtastesinmusic. Recordsstoredofflinealsocontainjustasrichabankofdata;publicvoterrecords showwhoislikelytovote,andpurchasehistoriesshowwhatsomeonehasbought(andislikelytobuyagain). Asthisdatabecomesincreasinglyeasytoaccessandcollect,manywebproviderstakeadvantageofthe ability to analyze this data and use it to tailor a person’s online experiences to their specific interests. The questionbecomes,howshouldoneactonthisdatainaresponsiblemanner?Notmanycompanieshavefound afoolproofwaytodothis. Mostrecently,FacebookandMyspacehavereceivedpublicandmediacriticism forpassingdemographicandinterestdatatoadvertisers. Anintuitivewaytobalancetheneedforprivacybut providecustomizationistouseananonymizeddatareleasetocustomizewebexperiences. Itisreasonable topersonalizeawebsitebasedonaperson’sprobableinterests,aslongasthewebsitecannotdeterminethe person’srealidentity(unlesstheyintentionallylog-in). Unfortunately,releasingdataandensuringanonymityatthesametimeisverydifficult. Tokeepauser’s identity unknown, it is not enough to strip data of Personally Identifying Information (PII). Even non PII datacanstillbetracedbacktoasinglepersonifenoughpersonalmicrodataisprovided. Forexample,age, income,job,maritalstatus,andmovieinterestscanoftenidentifyasingleindividual[29]. A number of companies have inadvertently released insufficiently anonymized datasets in past years. Netflix,anonlinemovierentalcompany,usesmovierentalandrankingrecordstosuggestmovierentalsto users, and sponsored a data-mining competition where teams competed to beat Netflix’s recommendation engine. Thesecondphaseofthiscompetitionwascanceledoutofprivacyconcernswhenitwasdiscovered that many individuals could be identified by a combination of movie recommendations. [14]. AOL has previously released a large set of search queries, without realizing that most of the queries could be traced backtoasingleindividualviapersonalizedqueries[26]. As a result, many companies and privacy researchers have been forced to look for ways to ensure the anonymityoftheirdata,whilemakingsuretherelevantinformationisretained,butatthesametimeitlends 1 itselftostatisticalinference,datamining,andonlinepersonalizationapplications. Rapleaf is a startup which specializes in web personalization[2]. If a website is able to get clues about theinterestsordemographicsofauser,itcantailortheexperiencetothatperson. Toallowawebsitetolearn aboutthatindividual,non-personallyidentifyingmicrodatacanbestoredinacookieontheuser’sbrowser. Thecookiecontainsbasicnon-sensitiveinformationlikeage,gender,andinterests. The fact that this targeting information is being served to websites and advertisers leads to concerns about user privacy; for privacy purposes, the microdata being served about an individual should not allow the receiver of the data to link a browser to an individual. There is no harm in associating a user with thefact”Enjoysbasketball”; insteadthefearisthatifenoughofthesesimplepiecesofdataarestored, the combinationwilluniquelyidentifyanindividual.Unfortunately,themostspecificinformation,and,therefore, themostde-anonymizing,isoftenthemostvaluableandusefuldata(it’ssomewhatusefultoknowifaperson ismale,butveryinterestingtoknowthathedrivesaFerrari.) PrivacyPreservingDataPublishing Fortunately, the field of research on privacy preserving data publishing studies exactly this problem. This field studies how to publish data in a way that simultaneously maintains privacy for the individuals whose recordsarebeingpublished,whilekeepingthereleaseddatasetrichenoughthatitisusefulfordata-mining purposes. One popular strategy for maintaining privacy in a released dataset is simply to ensure that the dataset remains anonymous. K-anonymity was the first carefully studied model for data anonymity[36]; the k- anonymity privacy assurance guarantees that a published record can be identified as one of no fewer than k individuals. The k-anonymity problem has traditionally been researched from the perspective of sensi- tivedatadisclosure–acommonlyciteddomainisthatofmedicalrecordsreleasedfordataminingpurposes, whereitisimportanttobeabletolinkdiseasestodemographicdata,withoutrevealingthatanindividualhas aparticulardisease. Mostresearchondatadisclosurefocusesonprotectingindividualsfromthereleaseofsensitiveattributes whichcouldbeembarrassingorharmful,ifreleased. Traditionalexamplesofsensitivedataincludemedical recordsandcriminalrecords. However,theobjectivehereissomewhatdifferent;theobjectiveistoprevent auserfrombeingidentifiedbytheirpublisheddata,andnoneofthetargetedinformationisconsideredtobe sensitivedata. 2
Description: