ebook img

PRACTICAL K-ANONYMITY ON LARGE DATASETS By Benjamin Podgursky Thesis Submitted to ... PDF

64 Pages·2011·0.42 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview PRACTICAL K-ANONYMITY ON LARGE DATASETS By Benjamin Podgursky Thesis Submitted to ...

PRACTICALK-ANONYMITYONLARGEDATASETS By BenjaminPodgursky Thesis SubmittedtotheFacultyofthe GraduateSchoolofVanderbiltUniversity inpartialfulfillmentoftherequirements forthedegreeof MASTEROFSCIENCE in ComputerScience May,2011 Nashville,Tennessee Approved: ProfessorGautamBiswas ProfessorDouglasH.Fisher ToNila,Mom,Dad,andAdriane ii ACKNOWLEDGMENTS IthankmyadvisorGautamBiswasforhisconstantsupportandguidancebothafterandlongbeforeIstarted writing this thesis. Over the past four years he has helped me jump headfirst into fields I would otherwise haveneverknownabout. I thank my colleagues at Rapleaf for introducing me to this project. I am always impressed by their dedication to finding and solving challenging, important and open problems; each of them has given me some kind of insight or help on this project, for which I am grateful. Most directly I worked with Greg PoulosinoriginallytacklingthisproblemwhenIwasaninternduringthesummerof2010,andwithouthis insightandworkthisprojectcouldnothavesucceeded. IowegratitudetothoseintheModelingandAnalysisofComplexSystemslabfortheirpatiencewithmy (hopefullynotcomplete)neglectofmyotherprojectsasItackledthisthesis. EachofmyprofessorsatVanderbilthasguidedmeininvaluablewaysovermytimehere. Iwanttothank Doug Fisher for helping me figure out clustering algorithms, Larry Dowdy for making me think of every problem as a distributed problem, and Jerry Spinrad for showing me that everything is a graph problem at heart. Ofcourse,withouttheconsistentencouragement,support,andproddingofmyfamilyandthosecloseto meIwouldneverhavegottenthisfar,andthisworkisdedicatedtothem. iii TABLEOFCONTENTS Page DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii ACKNOWLEDGMENTS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii LISTOFTABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi LISTOFFIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii LISTOFALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii I. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 PrivacyPreservingDataPublishing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 GoalsandContributionsofThesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 OrganizationofThesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 II. BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 PrivacyModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 IdentityDisclosure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 SensitiveAttributeDisclosure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 DataModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 NumericData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 CategoricalData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Hierarchical. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 DatawithUtility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 QualityMetrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Sumofsuppressedentries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Domaingeneralizationhierarchydistance . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Discernibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Lossmetric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Clusterdiameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Ambiguitymetric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Entropybasedmetric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Workloadbasedqualitymetrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 AnonymizationModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Multidimensionalrecoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Complexityresultsandboundedapproximations . . . . . . . . . . . . . . . . . . . . . . 21 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 III. PROBLEMDEFINITION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 DataModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 AdversaryModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 QualityMetrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 iv IV. ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Anonymityalgorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 ApproximateNearestNeighborSearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 V. EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 AdultsDataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 LSHvsRandomSamplingvsFullClustering . . . . . . . . . . . . . . . . . . . . . . . . 41 IterativevsSingle-PassClustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Mondrianmultidimensionalheuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 (K,1)anonymityalgorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Overallcomparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 SyntheticData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 TargetingDataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 VI. DISCUSSIONANDCONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 v LISTOFTABLES Table Page 1 Exampledatasetvulnerabletode-anonymization . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Examplepubliclyavailabledata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3 Publishedsensitivedatasetvulnerabletojoinattack . . . . . . . . . . . . . . . . . . . . . . . . 7 4 3-anonymousreleaseddataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 5 Exampleanonymizednumericdata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 6 Anotherexampledataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 7 Dataanonymizedviaglobalrecoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 8 Dataanonymizedvialocalrecoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 9 Hierarchicalcategoricalprofessions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 10 Enumeratedhierarchicaldata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 11 Attributeswithassignednumericutility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 12 Recordswithassignednumericutility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 13 Percentofutilityretainedontargetingdataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 14 Percentofattributeinstancesretainedontargetingdataset . . . . . . . . . . . . . . . . . . . . . 51 15 Algorithmruntimeontargetingdataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 vi LISTOFFIGURES Figure Page 1 2-anonymousand(2,1)anonymousviewsofdata . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Asimplegeneralizationhierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3 Apossibledifferentgeneralizationhierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4 Examplegeneralizationlattice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5 Figurefrom[20]showingmultidimensionalanonymizingover2dimensions . . . . . . . . . . . 20 6 Anexamplegeneralizationhierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 7 Theutilityofk-anonymizedsolutionsusingclusteringwithrandomsamplingvsclusteringwith LSHguidedsampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 8 The runtime of clustering anonymizers using random sampling vs clustering with LSH guided sampling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 9 Utilityofclusteringanonymizedsolutionsusingclusteringwithrandomsamplingvsclustering withLSHguidedsampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 10 RuntimeofclusteringanonymizersusingrandomsamplingvsclusteringwithLSHguidedsampling 44 11 Utilityofclusteringanonymizedsolutionsusingsingle-passclusteringvsiterativeclusteringal- gorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 12 Runtimeofclusteringalgorithmsusingsingle-passclusteringvsiterativeclusteringalgorithm . 45 13 UtilityofMondriananonymizedsolutionsusinganinformationgainheuristicvsalargest-value heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 14 RuntimeofMondrianusinganinformationgainheuristicvsalargest-valueheuristic . . . . . . 46 15 Utilityof(k,1)-anonymizedsolutions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 16 Runtimeof(k,1)-anonymizationalgorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 17 Side-by-sidecomparisonofalgorithmutilitiesontheAdultsdataset . . . . . . . . . . . . . . . 47 18 Side-by-sidecomparisonofalgorithmruntimeontheAdultsdataset . . . . . . . . . . . . . . . 48 19 Attributefrequencydistributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 20 Side-by-sidecomparisonofalgorithmutilityonasyntheticdataset . . . . . . . . . . . . . . . . 50 21 Side-by-sidecomparisonofalgorithmruntimeonasyntheticdataset . . . . . . . . . . . . . . . 50 vii LISTOFALGORITHMS Algorithm Page 1 Mondrian: multidimensionalpartitioninganonymization . . . . . . . . . . . . . . . . . . . . . 22 2 (k;1)anonymizationalgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3 MONDRIAN:Multidimensionalpartitioninganonymization: . . . . . . . . . . . . . . . . . . . 31 4 CLUSTER:approximateclusteringalgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5 ITER CLUSTER:iterativeclusteringalgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 33 6 TWO PHASE:two-phaseaggregationalgorithm . . . . . . . . . . . . . . . . . . . . . . . . . 34 7 merge: mergeprocedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 8 split: splitprocedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 9 K1 ANON:approximate(k,1)algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 10 preprocess: generatethenearestneighborhash . . . . . . . . . . . . . . . . . . . . . . . . . . 38 11 find nn: findthenearestneighborofapoint . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 12 build hash: generateanLSHhashfunction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 13 hash record: hasharecordintoanLSHbucket . . . . . . . . . . . . . . . . . . . . . . . . . . 38 viii CHAPTERI INTRODUCTION Aswespendmoreofourtimeonlineininformation-richandpersonalizedenvironments,itbecomesincreas- inglyeasierfordetailsfromourofflinelifetomeldwiththeironlinepresence. ThroughFacebookandother socialnetworks,ourpreferencesinfriends,food,andgamesbecomesvisibletoothers;LinkedInmakesour employmenthistoryandprofessionalcolleaguespublicinformation;evenpublicPandoraprofilesrevealour detailedtastesinmusic. Recordsstoredofflinealsocontainjustasrichabankofdata;publicvoterrecords showwhoislikelytovote,andpurchasehistoriesshowwhatsomeonehasbought(andislikelytobuyagain). Asthisdatabecomesincreasinglyeasytoaccessandcollect,manywebproviderstakeadvantageofthe ability to analyze this data and use it to tailor a person’s online experiences to their specific interests. The questionbecomes,howshouldoneactonthisdatainaresponsiblemanner?Notmanycompanieshavefound afoolproofwaytodothis. Mostrecently,FacebookandMyspacehavereceivedpublicandmediacriticism forpassingdemographicandinterestdatatoadvertisers. Anintuitivewaytobalancetheneedforprivacybut providecustomizationistouseananonymizeddatareleasetocustomizewebexperiences. Itisreasonable topersonalizeawebsitebasedonaperson’sprobableinterests,aslongasthewebsitecannotdeterminethe person’srealidentity(unlesstheyintentionallylog-in). Unfortunately,releasingdataandensuringanonymityatthesametimeisverydifficult. Tokeepauser’s identity unknown, it is not enough to strip data of Personally Identifying Information (PII). Even non PII datacanstillbetracedbacktoasinglepersonifenoughpersonalmicrodataisprovided. Forexample,age, income,job,maritalstatus,andmovieinterestscanoftenidentifyasingleindividual[29]. A number of companies have inadvertently released insufficiently anonymized datasets in past years. Netflix,anonlinemovierentalcompany,usesmovierentalandrankingrecordstosuggestmovierentalsto users, and sponsored a data-mining competition where teams competed to beat Netflix’s recommendation engine. Thesecondphaseofthiscompetitionwascanceledoutofprivacyconcernswhenitwasdiscovered that many individuals could be identified by a combination of movie recommendations. [14]. AOL has previously released a large set of search queries, without realizing that most of the queries could be traced backtoasingleindividualviapersonalizedqueries[26]. As a result, many companies and privacy researchers have been forced to look for ways to ensure the anonymityoftheirdata,whilemakingsuretherelevantinformationisretained,butatthesametimeitlends 1 itselftostatisticalinference,datamining,andonlinepersonalizationapplications. Rapleaf is a startup which specializes in web personalization[2]. If a website is able to get clues about theinterestsordemographicsofauser,itcantailortheexperiencetothatperson. Toallowawebsitetolearn aboutthatindividual,non-personallyidentifyingmicrodatacanbestoredinacookieontheuser’sbrowser. Thecookiecontainsbasicnon-sensitiveinformationlikeage,gender,andinterests. The fact that this targeting information is being served to websites and advertisers leads to concerns about user privacy; for privacy purposes, the microdata being served about an individual should not allow the receiver of the data to link a browser to an individual. There is no harm in associating a user with thefact”Enjoysbasketball”; insteadthefearisthatifenoughofthesesimplepiecesofdataarestored, the combinationwilluniquelyidentifyanindividual.Unfortunately,themostspecificinformation,and,therefore, themostde-anonymizing,isoftenthemostvaluableandusefuldata(it’ssomewhatusefultoknowifaperson ismale,butveryinterestingtoknowthathedrivesaFerrari.) PrivacyPreservingDataPublishing Fortunately, the field of research on privacy preserving data publishing studies exactly this problem. This field studies how to publish data in a way that simultaneously maintains privacy for the individuals whose recordsarebeingpublished,whilekeepingthereleaseddatasetrichenoughthatitisusefulfordata-mining purposes. One popular strategy for maintaining privacy in a released dataset is simply to ensure that the dataset remains anonymous. K-anonymity was the first carefully studied model for data anonymity[36]; the k- anonymity privacy assurance guarantees that a published record can be identified as one of no fewer than k individuals. The k-anonymity problem has traditionally been researched from the perspective of sensi- tivedatadisclosure–acommonlyciteddomainisthatofmedicalrecordsreleasedfordataminingpurposes, whereitisimportanttobeabletolinkdiseasestodemographicdata,withoutrevealingthatanindividualhas aparticulardisease. Mostresearchondatadisclosurefocusesonprotectingindividualsfromthereleaseofsensitiveattributes whichcouldbeembarrassingorharmful,ifreleased. Traditionalexamplesofsensitivedataincludemedical recordsandcriminalrecords. However,theobjectivehereissomewhatdifferent;theobjectiveistoprevent auserfrombeingidentifiedbytheirpublisheddata,andnoneofthetargetedinformationisconsideredtobe sensitivedata. 2

Description:
As we spend more of our time online in information-rich and personalized environments, and games becomes visible to others; anonymized solution,
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.