Table Of ContentPrivacy-Preserving
Data Mining
Models and Algorithms
ADVANCES IN DATABASE SYSTEMS
Volume34
SeriesEditors
AhmedK.Elmagarmid AmitP.Sheth
PurdueUniversity WrightStateUniversity
WestLafayette,IN47907 Dayton,Ohio45435
OtherbooksintheSeries:
SEQUENCEDATAMINING,GuozhuDong,JianPei;ISBN:978-0-387-69936-3
DATASTREAMS:ModelsandAlgorithms,editedbyCharuC.Aggarwal;ISBN:978-0-387-28759-1
SIMILARITYSEARCH:TheMetricSpaceApproach,P.Zezula,G. Amato,V.Dohnal,M.Batko;
ISBN:0-387-29146-6
STREAM DATA MANAGEMENT, Nauman Chaudhry, Kevin Shaw, Mahdi Abdelguerfi; ISBN:
0-387-24393-3
FUZZYDATABASEMODELINGWITHXML,ZongminMa;ISBN:0-387-24248-1
MINING SEQUENTIAL PATTERNS FROM LARGE DATA SETS, Wei Wang and Jiong Yang;
ISBN:0-387-24246-5
ADVANCEDSIGNATUREINDEXINGFORMULTIMEDIAANDWEBAPPLICATIONS,Yan-
nisManolopoulos,AlexandrosNanopoulos,EleniTousidou;ISBN:1-4020-7425-5
ADVANCES IN DIGITAL GOVERNMENT: Technology, Human Factors, and Policy, edited by
WilliamJ.McIver,Jr.andAhmedK.Elmagarmid;ISBN:1-4020-7067-5
INFORMATIONANDDATABASEQUALITY,MarioPiattini,CoralCaleroandMarcelaGenero;
ISBN:0-7923-7599-8
DATAQUALITY,RichardY.Wang,MostaphaZiad,YangW.Lee:ISBN:0-7923-7215-8
THEFRACTALSTRUCTUREOFDATAREFERENCE:ApplicationstotheMemoryHierarchy,
BruceMcNutt;ISBN:0-7923-7945-4
SEMANTICMODELSFORMULTIMEDIADATABASESEARCHINGANDBROWSING,Shu-
ChingChen,R.L.Kashyap,andArifGhafoor;ISBN:0-7923-7888-1
INFORMATIONBROKERINGACROSSHETEROGENEOUSDIGITALDATA:AMetadata-
basedApproach,VipulKashyap,AmitSheth;ISBN:0-7923-7883-0
DATADISSEMINATIONINWIRELESSCOMPUTINGENVIRONMENTS,Kian-LeeTanand
BengChinOoi;ISBN:0-7923-7866-0
MIDDLEWARE NETWORKS: Concept, Design and Deployment of Internet Infrastructure,
MichahLerner,GeorgeVanecek,NinoVidovic,DadVrsalovic;ISBN:0-7923-7840-7
ADVANCEDDATABASEINDEXING,YannisManolopoulos,YannisTheodoridis,VassilisJ.Tsotras;
ISBN:0-7923-7716-8
MULTILEVEL SECURE TRANSACTION PROCESSING, Vijay Atluri, Sushil Jajodia, Binto
GeorgeISBN:0-7923-7702-8
FUZZYLOGICINDATAMODELING,GuoqingChenISBN:0-7923-8253-6
PRIVACY-PRESERVINGDATAMINING:ModelsandAlgorithms,editedbyCharuC.Aggarwal
andPhilipS.Yu;ISBN:0-387-70991-8
Privacy-Preserving
Data Mining
Models and Algorithms
Edited by
CharuC.Aggarwal
IBMT.J.WatsonResearchCenter,USA
and
PhilipS.Yu
UniversityofIllinoisatChicago,USA
ABC
Editors:
CharuC.Aggarwal PhilipS.Yu
IBMThomasJ.WatsonResearchCenter DepartmentofComputerScience
19SkylineDrive UniversityofIllinoisatChicago
HawthorneNY10532 854SouthMorganStreet
charu@us.ibm.com Chicago,IL60607-7053
psyu@cs.uic.edu
SeriesEditors AmitP.Sheth
AhmedK.Elmagarmid WrightStateUniversity
PurdueUniversity Dayton,Ohio45435
WestLafayette,IN47907
ISBN978-0-387-70991-8 e-ISBN978-0-387-70992-5
DOI10.1007/978-0-387-70992-5
LibraryofCongressControlNumber:2007943463
(cid:176)c 2008SpringerScience+BusinessMedia,LLC.
Allrightsreserved.Thisworkmaynotbetranslatedorcopiedinwholeorinpartwithoutthewritten
permissionofthepublisher(SpringerScience+BusinessMedia,LLC,233SpringStreet,NewYork,NY
10013,USA),exceptforbriefexcerptsinconnectionwithreviewsorscholarlyanalysis.Useinconnection
withanyformofinformationstorageandretrieval,electronicadaptation,computersoftware,orbysimilar
ordissimilarmethodologynowknownorhereafterdevelopedisforbidden.
Theuseinthispublicationoftradenames,trademarks,servicemarks,andsimilarterms,eveniftheyare
notidentifiedassuch,isnottobetakenasanexpressionofopinionastowhetherornottheyaresubject
toproprietaryrights.
Printedonacid-freepaper
9 8 7 6 5 4 3 2 1
springer.com
Preface
In recent years, advances in hardware technology have lead to an increase
in the capability to store and record personal data about consumers and indi-
viduals. This has lead to concerns that the personal data may be misused for
a variety of purposes. In order to alleviate these concerns, a number of tech-
niqueshaverecentlybeenproposedinordertoperformthedataminingtasksin
aprivacy-preserving way.Thesetechniquesforperformingprivacy-preserving
dataminingaredrawnfromawidearrayofrelatedtopicssuchasdatamining,
cryptography and information hiding. The material in this book is designed
to be drawn from the different topics so as to provide a good overview of the
important topicsinthefield.
Whilealargenumberofresearchpapersarenowavailableinthisfield,many
ofthetopics havebeenstudied bydifferent communities withdifferent styles.
At this stage, it becomes important to organize the topics in such a way that
therelative importance ofdifferent research areas isrecognized. Furthermore,
the field of privacy-preserving data mining has been explored independently
by the cryptography, database and statistical disclosure control communities.
Insomecases,theparallellinesofworkarequitesimilar,butthecommunities
are not sufficiently integrated for the provision of a broader perspective. This
book willcontain chapters from researchers ofallthree communities and will
therefore trytoprovideabalanced perspective oftheworkdoneinthisfield.
This book will be structured as an edited book from prominent researchers
inthefield.Eachchapterwillcontainasurveywhichcontainsthekeyresearch
contentonthetopic,andthefuturedirectionsofresearchinthefield.Emphasis
will be placed on making each chapter self-sufficient. While the chapters will
bewrittenbydifferent researchers, thetopicsandcontent isorganized insuch
awaysoastopresentthemostimportantmodels,algorithms,andapplications
in the privacy field in a structured and concise way. In addition, attention is
paid in drawing chapters from researchers working in different areas in order
toprovidedifferentpointsofview.Giventhelackofstructurally organized in-
formation onthetopicofprivacy,thebookwillprovideinsights whicharenot
easily accessible otherwise. A few chapters in the book are not surveys, since
thecorresponding topics fallintheemerging category, andenough material is
vi Preface
notavailable tocreateasurvey. Insuchcases, theindividual resultshavebeen
included to give a flavor of the emerging research in the field. It is expected
that the book will be a great help to researchers and graduate students inter-
estedinthetopic.Whiletheprivacyfieldclearlyfallsintheemergingcategory
becauseofitsrecency,itisnowbeginningtoreachamaturationandpopularity
point, wherethe development ofanoverview book onthe topic becomes both
possible and necessary. It is hoped that this book will provide a reference to
students,researchersandpractitionersinbothintroducingthetopicofprivacy-
preservingdataminingandunderstandingthepracticalandalgorithmicaspects
ofthearea.
Contents
Preface v
ListofFigures xvii
ListofTables xxi
1
AnIntroductiontoPrivacy-PreservingDataMining 1
CharuC.Aggarwal,PhilipS.Yu
1.1. Introduction 1
1.2. Privacy-PreservingDataMiningAlgorithms 3
1.3. ConclusionsandSummary 7
References 8
2
A General Survey of Privacy-Preserving Data Mining Models and 11
Algorithms
CharuC.Aggarwal,PhilipS.Yu
2.1. Introduction 11
2.2. TheRandomizationMethod 13
2.2.1 PrivacyQuantification 15
2.2.2 AdversarialAttacksonRandomization 18
2.2.3 RandomizationMethodsforDataStreams 18
2.2.4 MultiplicativePerturbations 19
2.2.5 DataSwapping 19
2.3. GroupBasedAnonymization 20
k
2.3.1 The -AnonymityFramework 20
2.3.2 PersonalizedPrivacy-Preservation 24
2.3.3 UtilityBasedPrivacyPreservation 24
2.3.4 SequentialReleases 25
l
2.3.5 The -diversityMethod 26
t
2.3.6 The -closenessModel 27
2.3.7 ModelsforText,BinaryandStringData 27
2.4. DistributedPrivacy-PreservingDataMining 28
2.4.1 Distributed Algorithms over Horizontally Partitioned Data
Sets 30
2.4.2 DistributedAlgorithmsoverVerticallyPartitionedData 31
k
2.4.3 DistributedAlgorithmsfor -Anonymity 32
viii Contents
2.5. Privacy-PreservationofApplicationResults 32
2.5.1 AssociationRuleHiding 33
2.5.2 DowngradingClassifierEffectiveness 34
2.5.3 QueryAuditingandInferenceControl 34
2.6. LimitationsofPrivacy:TheCurseofDimensionality 37
2.7. ApplicationsofPrivacy-PreservingDataMining 38
2.7.1 MedicalDatabases:TheScrubandDataflySystems 39
2.7.2 BioterrorismApplications 40
2.7.3 HomelandSecurityApplications 40
2.7.4 GenomicPrivacy 42
2.8. Summary 43
References 43
3
A Survey of Inference Control Methods for Privacy-Preserving Data 53
Mining
JosepDomingo-Ferrer
3.1. Introduction 54
3.2. AclassificationofMicrodataProtectionMethods 55
3.3. PerturbativeMaskingMethods 58
3.3.1 AdditiveNoise 58
3.3.2 Microaggregation 59
3.3.3 DataWappingandRankSwapping 61
3.3.4 Rounding 62
3.3.5 Resampling 62
3.3.6 PRAM 62
3.3.7 MASSC 63
3.4. Non-perturbativeMaskingMethods 63
3.4.1 Sampling 64
3.4.2 GlobalRecoding 64
3.4.3 TopandBottomCoding 65
3.4.4 LocalSuppression 65
3.5. SyntheticMicrodataGeneration 65
3.5.1 SyntheticDatabyMultipleImputation 65
3.5.2 SyntheticDatabyBootstrap 66
3.5.3 SyntheticDatabyLatinHypercubeSampling 66
3.5.4 PartiallySyntheticDatabyCholeskyDecomposition 67
3.5.5 OtherPartiallySyntheticandHybridMicrodataApproaches 67
3.5.6 ProsandConsofSyntheticMicrodata 68
3.6. TradingoffInformationLossandDisclosureRisk 69
3.6.1 ScoreConstruction 69
3.6.2 R-UMaps 71
3.6.3 k-anonymity 71
3.7. ConclusionsandResearchDirections 72
References 73
Contents ix
4
MeasuresofAnonymity 81
SureshVenkatasubramanian
4.1. Introduction 81
4.1.1 WhatisPrivacy? 82
4.1.2 DataAnonymizationMethods 83
4.1.3 AClassificationofMethods 84
4.2. StatisticalMeasuresofAnonymity 85
4.2.1 QueryRestriction 85
4.2.2 AnonymityviaVariance 85
4.2.3 AnonymityviaMultiplicity 86
4.3. ProbabilisticMeasuresofAnonymity 87
4.3.1 MeasuresBasedonRandomPerturbation 87
4.3.2 MeasuresBasedonGeneralization 90
4.3.3 UtilityvsPrivacy 94
4.4. ComputationalMeasuresofAnonymity 94
4.4.1 AnonymityviaIsolation 97
4.5. ConclusionsandNewDirections 97
4.5.1 NewDirections 98
References 99
5
k-AnonymousDataMining:ASurvey 105
V.Ciriani,S.DeCapitanidiVimercati,S.Foresti,andP.Samarati
5.1. Introduction 105
5.2. k-Anonymity 107
5.3. AlgorithmsforEnforcingk-Anonymity 110
5.4. k-AnonymityThreatsfromDataMining 117
5.4.1 AssociationRules 118
5.4.2 ClassificationMining 118
5.5. k-AnonymityinDataMining 120
5.6. Anonymize-and-Mine 123
5.7. Mine-and-Anonymize 126
5.7.1 Enforcingk-AnonymityonAssociationRules 126
5.7.2 Enforcingk-AnonymityonDecisionTrees 130
5.8. Conclusions 133
Acknowledgments 133
References 134
6
ASurveyofRandomizationMethodsforPrivacy-PreservingDataMining 137
CharuC.Aggarwal,PhilipS.Yu
6.1. Introduction 137
6.2. ReconstructionMethodsforRandomization 139
6.2.1 TheBayesReconstructionMethod 139
6.2.2 TheEMReconstructionMethod 141
6.2.3 UtilityandOptimalityofRandomizationModels 143
Description:Advances in hardware technology have increased the capability to store and record personal data about consumers and individuals. This has caused concerns that personal data may be used for a variety of intrusive or malicious purposes.Privacy Preserving Data Mining: Models and Algorithms proposes a n