ebook img

Anonymizing Large Transaction Data Using MapReduce PDF

142 Pages·2017·1.43 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Anonymizing Large Transaction Data Using MapReduce

Anonymizing Large Transaction Data Using MapReduce A thesis submitted in partial fulfilment of the requirement for the degree of Doctor of Philosophy Neelam Memon January 2016 School of Computer Science & Informatics Cardiff University i Declaration This work has not previously been accepted in substance for any degree and is not concurrentlysubmittedincandidatureforanydegree. Signed ..................................... (candidate) Date .............................. Statement 1 This thesis is being submitted in partial fulfillment of the requirements for the degree ofPhD. Signed ..................................... (candidate) Date .............................. Statement 2 This thesis is the result of my own independent work/investigation, except where oth- erwisestated. Othersourcesareacknowledgedbyexplicitreferences. Signed ..................................... (candidate) Date .............................. Statement 3 I hereby give consent for my thesis, if accepted, to be available for photocopying and for inter-library loan, and for the title and summary to be made available to outside organisations. Signed ..................................... (candidate) Date .............................. ii Abstract Publishing transaction data is important to applications such as marketing research and biomedical studies. Privacy is a concern when publishing such data since they often contain person-specific sensitive information. To address this problem, different data anonymization methods have been proposed. These methods have focused on protecting the associated individuals from different types of privacy leaks as well as preserving utility of the original data. But all these methods are sequential and are designedtoprocessdataonasinglemachine,hencenotscalabletolargedatasets. Recently,MapReducehasemergedasahighlyscalableplatformfordata-intensiveap- plications. In this work, we consider how MapReduce may be used to provide scalab- ility in large transaction data anonymization. More specifically, we consider how set- based generalization methods such as RBAT (Rule-Based Anonymization of Transac- tion data) may be parallelized using MapReduce. Set-based generalization methods have some desirable features for transaction anonymization, but their highly iterat- ive nature makes parallelization challenging. RBAT is a good representative of such methods. We propose a method for transaction data partitioning and representation. We also present two MapReduce-based parallelizations of RBAT. Our methods en- surescalabilitywhenthenumberoftransactionrecordsanddomainofitemsarelarge. OurpreliminaryresultsshowthatadirectparallelizationofRBATbypartitioningdata alone can result in significant overhead, which can offset the gains from parallel pro- cessing. WeproposeMR-RBATthatgeneralizesourdirectparallelmethodandallows tocontrolparallelizationoverhead. OurexperimentalresultsshowthatMR-RBATcan Abstract iii scalelinearlytolargedatasetsandtotheavailableresourceswhileretaininggooddata utility. iv Acknowledgements I would like to thank all those who have given me academic and moral support for my research work over the last years. I would like to thank the department of Computer Science and Informatics, in particular to my supervisors, Dr. Jianhua Shao and Dr. Grigorios Loukides for their guidance and valuable advice. I would like to thank all theschoolstaffmembers,specificallymyannualreviewpanelmembers,fortheirhelp and constructive feedback to my research. I would like to also thank my family and friends,fortheirsupport. v Contents Abstract ii Acknowledgements iv Contents v ListofFigures viii ListofTables x ListofAlgorithms xii 1 Introduction 1 1.1 PrivacyandItsProtection . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 PrivacyProtectioninTransactionData . . . . . . . . . . . . . . . . . 3 1.3 ResearchProblem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 ResearchMethodology . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5 ResearchChallenges . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.6 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Contents vi 1.7 ThesisStructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2 RelatedWork 13 2.1 Transactiondataanonymization . . . . . . . . . . . . . . . . . . . . 13 2.1.1 Generalization-basedTransactionDataAnonymization . . . . 15 2.2 Non-ParallelMethodsforScalableDataAnonymization . . . . . . . 22 2.3 PossibleChoicesforParallelization . . . . . . . . . . . . . . . . . . 26 2.3.1 PRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.2 BSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.3 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4 MapReduceandPrivacyProtection . . . . . . . . . . . . . . . . . . . 32 2.5 OtherapplicationsofMapReduce . . . . . . . . . . . . . . . . . . . 34 2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3 RBATandMapReduce 43 3.1 PreliminariesandProblemDefinition . . . . . . . . . . . . . . . . . 43 3.1.1 ProblemStatement . . . . . . . . . . . . . . . . . . . . . . . 47 3.2 OverviewofRBAT . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.3 ParallelDesignofRBATusingMapReduce . . . . . . . . . . . . . . 51 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4 DirectParallelizationofRBAT 55 4.1 DataPartitioningandRepresentation . . . . . . . . . . . . . . . . . . 56 4.2 ParallelSupportComputation . . . . . . . . . . . . . . . . . . . . . . 61 Contents vii 4.3 DirectParallelizationofRBAT . . . . . . . . . . . . . . . . . . . . . 62 4.4 ExperimentalEvaluation . . . . . . . . . . . . . . . . . . . . . . . . 71 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5 MR-RBAT:AScalableTransactionDataAnonymizationMethod 84 5.1 MR-RBAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.1.1 α-Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.1.2 γ-Check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.2 PerformanceEstimation . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.3 ExperimentalEvaluation . . . . . . . . . . . . . . . . . . . . . . . . 99 5.3.1 EvaluationofMR-RBAT . . . . . . . . . . . . . . . . . . . . 100 5.3.2 EvaluationofPerformanceEstimation . . . . . . . . . . . . . 107 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6 ConclusionsandFutureWork 111 6.1 ResearchSummary . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.2 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 References 116 viii List of Figures 2.1 Anexamplehierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 AexampleR-treefork-anonymousdata . . . . . . . . . . . . . . . . 24 2.3 Aexample2-anonymousdatausingMondrian . . . . . . . . . . . . . 25 2.4 AschematicrepresentationofaMapReduceround . . . . . . . . . . 29 3.1 AnexampleSplit-Tree . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2 AgeneralframeworkforMapReduce-baseddesignofRBAT . . . . . 52 4.1 AschematicrepresentationofParallelRBAT . . . . . . . . . . . . . 63 4.2 Datasizevs. Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3 Scalabilitywithrespecttoclustersize . . . . . . . . . . . . . . . . . 75 4.4 Scalabilitywithrespecttoclustersize . . . . . . . . . . . . . . . . . 76 4.5 Performancewithrespectto|Θ|,k = 60,c = 0.8 . . . . . . . . . . . 77 4.6 Performancewithrespectto|Θ|,k = 80,c = 0.7 . . . . . . . . . . . 79 4.7 RulesInputvs. Checked(k=60,c=0.8) . . . . . . . . . . . . . . . 80 4.8 Performancewithrespectto|I| . . . . . . . . . . . . . . . . . . . . . 81 4.9 Scaleup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 ListofFigures ix 5.1 Effectofα . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.2 α vs. Skewness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.3 ScalabilitywithrespecttoDatasize . . . . . . . . . . . . . . . . . . . 103 5.4 ScalabilitywithrespecttoClusterSize . . . . . . . . . . . . . . . . . 104 5.5 Effectof|I| . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.6 Effectofγ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.7 Scaleup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.8 Evaluationofcostestimationofα-Splitwithrespecttoα . . . . . . . 108 5.9 Evaluationofcostestimationofγ-Checkwithrespecttoγ . . . . . . 108

Description:
Using MapReduce. A thesis submitted in partial fulfilment of the requirement for the degree of Doctor of Philosophy. Neelam Memon. January 2016 .. by partitioning data and performing the key operations using MapReduce. 3. [72] Grigorios Loukides, Aris Gkoulalas-Divanis, and Bradley Malin.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.