PRIVACY-PRESERVING DATA MINING: MODELS AND ALGORITHMS PRIVACY-PRESERVING DATA MINING: MODELS AND ALGORITHMS Editedby CHARUC.AGGARWAL IBMT.J.WatsonResearchCenter,Hawthorne,NY10532 PHILIPS.YU UniversityofIllinoisatChicago,Chicago,IL60607 KluwerAcademicPublishers Boston/Dordrecht/London Contents ListofFigures xv ListofTables xx Preface xxi 1 AnIntroductiontoPrivacy-PreservingDataMining 1 CharuC.Aggarwal,PhilipS.Yu 1. Introduction 1 2. Privacy-PreservingDataMiningAlgorithms 3 3. ConclusionsandSummary 7 References 8 2 AGeneralSurveyofPrivacy-PreservingDataMiningModelsandAlgorithms 11 CharuC.Aggarwal,PhilipS.Yu 1. Introduction 11 2. TheRandomizationMethod 13 2.1 PrivacyQuantification 15 2.2 AdversarialAttacksonRandomization 17 2.3 RandomizationMethodsforDataStreams 18 2.4 MultiplicativePerturbations 18 2.5 DataSwapping 19 3. GroupBasedAnonymization 20 3.1 Thek-AnonymityFramework 20 3.2 PersonalizedPrivacy-Preservation 23 3.3 UtilityBasedPrivacyPreservation 24 3.4 SequentialReleases 25 3.5 Thel-diversityMethod 26 3.6 Thet-closenessModel 27 3.7 ModelsforText,BinaryandStringData 27 4. DistributedPrivacy-PreservingDataMining 28 4.1 Distributed Algorithms over Horizontally Partitioned Data Sets 30 4.2 DistributedAlgorithmsoverVerticallyPartitionedData 31 4.3 DistributedAlgorithmsfork-Anonymity 31 5. Privacy-PreservationofApplicationResults 32 5.1 AssociationRuleHiding 32 5.2 DowngradingClassifierEffectiveness 34 vi PRIVACY-PRESERVINGDATAMINING:MODELSANDALGORITHMS 5.3 QueryAuditingandInferenceControl 34 6. LimitationsofPrivacy: TheCurseofDimensionality 37 7. ApplicationsofPrivacy-PreservingDataMining 38 7.1 MedicalDatabases: TheScrubandDataflySystems 38 7.2 BioterrorismApplications 40 7.3 HomelandSecurityApplications 40 7.4 GenomicPrivacy 42 8. Summary 42 References 43 3 ASurveyofInferenceControlMethodsforPrivacy-PreservingDataMining 53 JosepDomingo-Ferrer 1. Aclassificationofmicrodataprotectionmethods 55 2. Perturbativemaskingmethods 58 2.1 Additivenoise 58 2.2 Microaggregation 59 2.3 Dataswappingandrankswapping 61 2.4 Rounding 62 2.5 Resampling 62 2.6 PRAM 62 2.7 MASSC 63 3. Non-perturbativemaskingmethods 63 3.1 Sampling 64 3.2 Globalrecoding 64 3.3 Topandbottomcoding 65 3.4 Localsuppression 65 4. Syntheticmicrodatageneration 65 4.1 Syntheticdatabymultipleimputation 65 4.2 Syntheticdatabybootstrap 66 4.3 SyntheticdatabyLatinHypercubeSampling 66 4.4 PartiallysyntheticdatabyCholeskydecomposition 67 4.5 Otherpartiallysyntheticandhybridmicrodataapproaches 67 4.6 Prosandconsofsyntheticmicrodata 68 5. Tradingoffinformationlossanddisclosurerisk 69 5.1 Scoreconstruction 69 5.2 R-Umaps 71 5.3 k-anonymity 71 6. Conclusionsandresearchdirections 72 References 73 4 MeasuresofAnonymity 81 SureshVenkatasubramanian 1. Introduction 81 1.1 Whatisprivacy? 81 1.2 DataAnonymizationMethods 83 1.3 AClassificationOfMethods 84 2. StatisticalMeasuresofAnonymity 85 Contents vii 2.1 QueryRestriction 85 2.2 AnonymityviaVariance 85 2.3 AnonymityviaMultiplicity 86 3. ProbabilisticMeasuresofAnonymity 86 3.1 MeasuresBasedonRandomPerturbation 87 3.2 MeasuresBasedonGeneralization 90 3.3 UtilityvsPrivacy 93 4. ComputationalMeasuresOfAnonymity 94 4.1 AnonymityviaIsolation 96 5. ConclusionsAndNewDirections 97 5.1 NewDirections 98 References 98 5 k-AnonymousDataMining: ASurvey 103 V.Ciriani,S.DeCapitanidiVimercati,S.Foresti,andP.Samarati 1. Introduction 103 2. k-Anonymity 105 3. AlgorithmsforEnforcingk-Anonymity 108 4. k-AnonymityThreatsfromDataMining 115 4.1 AssociationRules 115 4.2 ClassificationMining 116 5. k-AnonymityinDataMining 118 6. Anonymize-and-Mine 120 7. Mine-and-Anonymize 123 7.1 Enforcingk-AnonymityonAssociationRules 124 7.2 Enforcingk-AnonymityonDecisionTrees 127 8. Conclusions 130 Acknowledgments 131 References 131 6 ASurveyofRandomizationMethodsforPrivacy-PreservingDataMining 135 CharuC.Aggarwal,PhilipS.Yu 1. Introduction 135 2. ReconstructionMethodsforRandomization 137 2.1 TheBayesReconstructionMethod 137 2.2 TheEMReconstructionMethod 139 2.3 UtilityandOptimalityofRandomizationModels 141 3. ApplicationsofRandomization 142 3.1 Privacy-PreservingClassificationwithRandomization 142 3.2 Privacy-PreservingOLAP 143 3.3 CollaborativeFiltering 143 4. ThePrivacy-InformationLossTradeoff 144 5. VulnerabilitiesoftheRandomizationMethod 147 6. RandomizationofTimeSeriesDataStreams 149 7. MultiplicativeNoiseforRandomization 150 7.1 VulnerabilitiesofMultiplicativeRandomization 151 viii PRIVACY-PRESERVINGDATAMINING:MODELSANDALGORITHMS 7.2 SketchBasedRandomization 151 8. ConclusionsandSummary 152 References 152 7 ASurveyofMultiplicative 155 Perturbationfor Privacy-PreservingDataMining KekeChenandLingLiu 1. Introduction 156 1.1 DataPrivacyvs. DataUtility 157 1.2 Outline 158 2. DefinitionofMultiplicativePerturbation 159 2.1 Notations 159 2.2 RotationPerturbation 159 2.3 ProjectionPerturbation 160 2.4 Sketch-basedApproach 162 2.5 GeometricPerturbation 162 3. TransformationInvariantDataMiningModels 163 3.1 DefinitionofTransformationInvariantModels 163 3.2 Transformation-InvariantClassificationModels 164 3.3 Transformation-InvariantClusteringModels 165 4. PrivacyEvaluationforMultiplicativePerturbation 166 4.1 AConceptualMultidimensionalPrivacyEvaluationModel 166 4.2 VarianceofDifferenceasColumnPrivacyMetric 167 4.3 IncorporatingAttackEvaluation 168 4.4 OtherMetrics 168 5. AttackResilientMultiplicativePerturbations 169 5.1 NaiveEstimationtoRotationPerturbation 169 5.2 ICA-BasedAttacks 171 5.3 Distance-InferenceAttacks 172 5.4 AttackswithMorePriorKnowledge 174 5.5 FindingAttack-ResilientPerturbations 174 6. Conclusion 176 References 176 8 ASurveyofQuantificationofPrivacyPreservingDataMiningAlgorithms 181 ElisaBertinoandDanLinandWeiJiang 1. MetricsforQuantifyingPrivacyLevel 184 1.1 DataPrivacy 184 1.2 ResultPrivacy 189 2. MetricsforQuantifyingHidingFailure 190 3. MetricsforQuantifyingDataQuality 191 3.1 QualityoftheDataResultingfromthePPDMProcess 191 3.2 QualityoftheDataMiningResults 196 4. ComplexityMetrics 198 5. HowtoSelectaProperMetric 199 6. ConclusionandResearchDirections 200 References 200 Contents ix 9 ASurveyofUtility-based 205 Privacy-PreservingData TransformationMethods MingHuaandJianPei 1. Introduction 206 1.1 WhatisUtility-basedPrivacyPreservation? 207 2. TypesofUtility-basedPrivacyPreservationMethods 208 2.1 PrivacyModels 208 2.2 UtilityMeasures 210 2.3 SummaryoftheUtility-BasedPrivacyPreservingMethods 212 3. Utility-BasedAnonymizationUsingLocalRecoding 212 3.1 GlobalRecodingandLocalRecoding 213 3.2 UtilityMeasure 214 3.3 AnonymizationMethods 215 3.4 SummaryandDiscussion 217 4. TheUtility-basedPrivacyPreservingMethodsinClassificationProb- lems 217 4.1 TheTop-DownSpecializationMethod 218 4.2 TheProgressiveDisclosureAlgorithm 222 4.3 SummaryandDiscussion 226 5. AnonymizedMarginal: InjectingUtilityintoAnonymizedDataSets 226 5.1 AnonymizedMarginal 227 5.2 UtilityMeasure 228 5.3 InjectingUtilityUsingAnonymizedMarginals 229 5.4 SummaryandDiscussion 231 6. Summary 232 References 232 10 MiningAssociationRulesunderPrivacyConstraints 237 JayantR.Haritsa 1. ProblemFramework 238 2. EvolutionoftheLiterature 244 3. TheFRAPPFramework 249 4. SampleResults 257 5. ClosingRemarks 261 References 261 11 ASurveyofAssociationRuleHidingMethodsforPrivacy 265 VassiliosS.VerykiosandArisGkoulalas-Divanis 1. Introduction 265 2. TerminologyandPreliminaries 267 3. TaxonomyofAssociationRuleHidingAlgorithms 268 4. ClassesofAssociationRuleAlgorithms 269 4.1 HeuristicApproaches 270 4.2 Border-basedApproaches 275 4.3 ExactApproaches 276 x PRIVACY-PRESERVINGDATAMINING:MODELSANDALGORITHMS 5. OtherHidingApproaches 277 6. MetricsandPerformanceAnalysis 279 7. DiscussionandFutureTrends 282 8. Conclusions 283 References 284 12 ASurveyofStatistical 289 ApproachestoPreserving ConfidentialityofContingency TableEntries StephenE.FienbergandAleksandraB.Slavkovic 1. Introduction 289 2. TheStatisticalApproachPrivacyProtection 290 3. DataminingAlgorithms,AssociationRules,andDisclosureLimita- tion 292 4. Estimation and Disclosure Limitation for Multi-way Contingency Tables 293 5. TwoIllustrativeExamples 299 5.1 Example1: DatafromaRandomizedClinicalTrial 299 5.2 Example 2: Data from the 1993 U.S. Current Population Survey 303 6. Conclusions 306 References 307 13 ASurveyof 311 Privacy-PreservingMethods AcrossHorizontallyPartitioned Data MuratKantarcioglu 1. Introduction 311 2. BasicCryptographicTechniquesforPrivacy-PreservingDistributed DataMining 313 3. CommonSecureSub-protocolsUsedinPrivacy-PreservingDistributed DataMining 316 4. Privacy-preserving Distributed Data Mining on Horizontally Parti- tionedData 321 5. ComparisontoVerticallyPartitionedDataModel 324 6. ExtensiontoMaliciousParties 325 7. LimitationsoftheCryptographicTechniquesUsedinPrivacy-Preserving DistributedDataMining 327 8. PrivacyIssuesRelatedtoDataMiningResults 328 9. Conclusion 330 References 330 14 ASurveyof 335 Privacy-PreservingMethods acrossVerticallyPartitioned Data Contents xi JaideepVaidya 1. Classification 337 1.1 Na¨ıveBayesClassification 340 1.2 BayesianNetworkStructureLearning 341 1.3 DecisionTreeClassification 342 2. Clustering 344 3. AssociationRuleMining 345 4. Outlierdetection 347 4.1 Algorithm 349 4.2 SecurityAnalysis 350 4.3 ComputationandCommunicationAnalysis 352 5. ChallengesandResearchDirections 353 References 354 15 ASurveyofAttackTechniquesonPrivacy-PreservingDataPerturbation 357 Methods KunLiu,ChrisGiannella,andHillolKargupta 1. Introduction 358 2. DefinitionsandNotation 358 3. AttackingAdditiveDataPerturbation 359 3.1 Eigen-AnalysisandPCAPreliminaries 360 3.2 SpectralFiltering 361 3.3 SVDFiltering 362 3.4 PCAFiltering 363 3.5 MAPEstimationAttack 364 3.6 DistributionAnalysisAttack 365 3.7 Summary 366 4. AttackingMatrixMultiplicativeDataPerturbation 367 4.1 KnownI/OAttacks 368 4.2 KnownSampleAttack 371 4.3 OtherAttacksBasedonICA 372 4.4 Summary 373 5. Attackingk-Anonymization 374 6. Conclusion 374 Acknowledgments 375 References 375 16 PrivateDataAnalysisvia 381 OutputPerturbation KobbiNissim 1. TheAbstractModel–StatisticalDatabases,Queries,andSanitizers 383 2. Privacy 386 2.1 InterpretingthePrivacyDefinition 388 3. TheBasicTechnique: CalibratingNoisetoSensitivity 392 3.1 Applications: FunctionswithLowGlobalSensitivity 394 4. ConstructingSanitizersforComplexFunctionalities 398 4.1 k-MeansClustering 399
Description: