Table Of Content

Incremental -Anonymous k Microaggregation of Large-Scale Datasets universitat polite`cnica de catalunya Ce´sar Hernańdez Baigorri supervised by David Rebollo-Monedero and Jordi Forne´ 2016 Abstract Every one of us is constantly releasing data about our interests, preferences or affil- iations in a conscious or unconscious way. Improvements on the technological field have led to an enormous volume of data on each individual available on the Internet. Using and publishing sensitive data (e.g. social research, marketing purposes) have stimulated the need for anonymization techniques, where a compromise between use- fulness of the data and privacy protection is sought. k-Anonymous microaggregation permits releasing a set of data where each person cannot be distinguished from, at least, k−1 individuals while maintaining similar statistical dependence between at- tributes. Currently, microaggregation algorithms are commonly used in this field thanks to the simplicity and quality provided. However, used on large datasets these methods result expensive in terms of computation time. This work addresses the need of running k-anonymization in a faster, efficient manner while introducing minimum distortion loss. To do so, we partition the original dataset in two fractions that will be processed in two consecutive steps enabling thepossibilityofstartingevenbeforereceivingtheentiredataset. Intuitivelythepro- cess would be indicated for anonymizing surveys, electoral processes, and all manner of polls, but the method has proved to be faster even without a head start. 2 Resum Cadascun de nosaltres està constantment generant dades sobre els nostres interessos, preferènciesoafiliacions, jasiguidemaneraconscientoinconscient. Elsavançosenel camp tecnolo`gic han portat a un enorme volum de dades disponibles a internet sobre cada individu. L’u´s i publicació de dades sensibles (investigacio´ social, marketing) han generat la necessitat de tècniques d’anonimització, on es busca un comprom´ıs entre la utilitat de les dades i la proteccio´ de la privacitat. La microagregació k- ano`nima permet publicar un conjunt de dades on una persona no pot ser distingida de, comam´ınim, k−1individusmentreesmantéunadependènciaestad´ısticasimilar entre atributs. En l’actualitat, generalment s’utilitzen algorismes de microagregacio´ en aquest camp a gràcies a la simplicitat i qualitat que proporcionen. No obstant aixo`, utilitzats en grans conjunts de dades aquests mètodes resulten cars en termes de temps de procés. Aquest treball aborda la necessitat d’executar la k-anonimitzacio´ d’una manera més ra`pida i eficient tot presentant una pèrdua m´ınima per distorsió. Per a això, par- tim el conjunt de dades original en dues fraccions que seran processades en dues fases consecutives permetent la possibilitat de comen¸car fins i tot abans de la recepcio´ del conjunts de dades sencer. Intuitivament el procés seria indicat per a anonimitzar en- questes, processos electorals, i altres tipus d’escrutinis, pero` el mètode ha demostrat ser més ra`pid fins i tot sense avantatge a l’inici. 3 Resumen Cadaunodenosotrosestaćonstantemente-yaseademaneraconscienteoinconsciente- generando datos sobre nuestros intereses, preferencias o afiliaciones. Los avances en el campo tecnolo´gico han llevado a que haya un enorme volumen de datos disponible en internet sobre cada individuo. El uso y publicacioń de datos sensibles (para investigación social o de marketing) han generado la necesidad de técnicas de anoni- mizacioń, dondesebuscauncompromisoentrelautilidaddelosdatosylaproteccioń de la privacidad. La microagregación k-anońima permite publicar un conjunto de datos donde una persona no puede ser distinguida de, al menos, k − 1 individuos, mientras se mantiene una dependencia estad´ıstica similar entre atributos. En la ac- tualidad, los algoritmos de microagregacioń son usados comuńmente en este campo gracias a la simplicidad y la calidad que proporcionan. Sin embargo, usados en grandes conjuntos de datos estos métodos resultan caros en términos de tiempo de proceso. Este trabajo aborda la necesidad de ejecutar la k-anonimizacioń de una manera ma´s ra´pida y eficiente introduciendo a su vez una pérdida m´ınima por distorsioń. Para ello, partimos el conjunto de datos original en dos fracciones que serań proce- sadas en dos pasos consecutivos permitiendo la posibilidad de empezar incluso antes de recibir todo el conjunto de datos. Intuitivamente el proceso ser´ıa indicado para anonimizarencuestas, procesoselectorales, ytodotipodeescrutinios, peroelmétodo ha demostrado ser ma´s ra´pido incluso sin ventaja en el inicio. 4 Acknowledgments I would like to begin thanking my thesis advisors Dr. David Rebollo-Monedero and Prof. Jordi Forné, for giving me the opportunity to collaborate on this project. This work has not only served as an excellent learning experience but also as my first approach to the scientific research field. David has always been willing to help me and teach me, he has also allowed this thesis to be my own work while constantly steering me in the right direction whenever I needed it. Additionally, I would also like to acknowledge Jordi for his valuable, insightful comments as a second advisor of this thesis. I also owe a huge debt of gratitude to my friends Carlos, Albert, Roger, Guillem, Marc, Izan, Oriol and my colleagues at T2C and Danone, Enric, Rau´l and Joan Carles,whobelievemuchmoreinmycapabilitiesthanIdo. Myspecialthanksbelong to my parents, César and Silvia, and sister, Paula. This thesis is the culmination of a long journey, I would not be who I am or where I am without their unconditional support throughout these years. Last but not least, this thesis is also for Raquel, for making everything so easy with her enthusiasm to see me growing and fulfilling my dreams by her side. 5 Contents Abstract 2 Resum 3 Resumen 4 Acknowledgments 5 1 Introduction 13 1.1 Contribution and Organization . . . . . . . . . . . . . . . . . . . . . 17 2 Review of the State of the Art on k-Anonymous Microaggregation 21 2.1 State of the art on Statistical Disclosure Control . . . . . . . . . . . . 21 2.2 Inference Risk of k-Anonymity . . . . . . . . . . . . . . . . . . . . . . 22 2.3 On the Complexity of k-Anonymity . . . . . . . . . . . . . . . . . . . 24 2.4 Algorithms for k-Anonymous Microaggregation . . . . . . . . . . . . 25 2.5 Other Incremental Approaches . . . . . . . . . . . . . . . . . . . . . . 25 3 Formal Model: The Concept of Incremental Microaggregation 27 3.1 Microaggregation Background . . . . . . . . . . . . . . . . . . . . . . 27 3.2 Formulation for Incremental k-Anonymous Microaggregation . . . . . 28 3.2.1 Running Times . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2.2 Critical and Optimal Data Ratio . . . . . . . . . . . . . . . . 30 3.2.3 Deadline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.4 Data Utility Loss . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.5 Effect of Dimensions on Distortion . . . . . . . . . . . . . . . 32 6 Contents 7 3.2.6 Inertial Coefficient . . . . . . . . . . . . . . . . . . . . . . . . 34 4 Algorithms and Theoretical Analysis 37 4.1 Base Algorithm: MDAV . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2 Incremental Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2.1 MDAV as an Incremental Algorithm . . . . . . . . . . . . . . 38 4.2.2 Nearest-Neighbor Method . . . . . . . . . . . . . . . . . . . . 45 5 Experimental Results 48 5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.2 Experimental Findings . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.2.1 MDAV as an Incremental Algorithm . . . . . . . . . . . . . . 52 5.2.2 Nearest-Neighbor Algorithm . . . . . . . . . . . . . . . . . . . 54 5.2.3 Incremental Algorithms Comparison . . . . . . . . . . . . . . 57 6 Conclusion and Future Research 60 Bibliography 64 List of Figures 1.1 Example of k-anonymous microaggregation of published demograph- ical data with k = 3, showing income and political affiliation. Under the assumption of this data being collected during an extended span of time, the traditionalmicroaggregationmethodswouldstart onceall data has been collected. Our proposal consists in starting the process before the whole data availability processing it in two steps. . . . . . 14 1.2 Example of incremental approach versus traditional one in a 10-hour voting process. This may be possible in most forms of electronic surveys or electoral processes where postal mail votes are used, but may not be applicable to processes where ballot boxes must remain closed until the end. Note that the traditional approach would not be usable if the result were required in less than 2 hours. On the other hand, our method allows finishing the process in just 39 minutes, although is shown in conclusions that it can be finished earlier. . . . . . . . . . 18 3.1 Microaggregation interpreted as a quantization problem on the Quasi- Identifiers. The microaggregation example is ran using MDAV, de- scribed in §4.1, in R2 with k = 5. . . . . . . . . . . . . . . . . . . . . 28 3.2 Formulation of the incremental problem, τ defines the relative time of eachcomputationalprocessincomparisonwiththetimeneededtorun the traditional algorithm. The time saving corresponds to the typical super additivity of the traditional microaggregation algorithms which implies that τ +τ < τ = 1, and the head start τ . . . . . . . . . . . 29 0 + − 8 List of Figures 9 3.3 The relative head start time is directly related to the rate of arrivals and nu. In that regard, we introduce the head start coefficient ς. In thispaperweworkundertheassumptionofuniformarrivals, precisely the first plot of this figure. It is easily seen that the head start relative time will be defined τ = ςν. . . . . . . . . . . . . . . . . . . . . . . . 30 − 4.1 Comparison of ν , which represents the critical data ratio, versus its − strong approximation, defined in §5.2.1, which presents more math- ematical tractability and is a good approach for values of ς greater than 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 Representation of the optimal data ratio ν∗ under the assumption of uniform arrivals and using MDAV as both base and incremental algorithm. Observe that once we surpass the saturation point ν∗ = ν . 42 − 4.3 Effect of ς on the critical data ratio ν∗ and the relative time gain ∆τ. The relative time gain reaches its maximum where τ +τ = 1, 0 − saturation region, for values bigger than ς = ς . However, when using − values of ς smaller than the threshold the maximum, and optimal data ratio, is reached before the saturation point. . . . . . . . . . . . . . . 43 4.4 In order to establish a lower bound for the relative deadline τ we end define it assuming the maximum relative time gain, concretely the relative time gain obtained using the optimal data ratio ν∗. . . . . . . 44 4.5 Discriminant (2+ς)2−8(1−τ ) of the data ratio needed to end endmin at the deadline is 0 before reaching saturation point ς = ς and grows − past this point. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.1 Relative time gain of MDAV as an incremental algorithm versus traditional MDAV without a head start. . . . . . . . . . . . . . . . . . . 52 5.2 Evaluation of τ as a function of ν using MDAV as an incremen- + tal algorithm in comparison with the expected outcome ν2 and the boundary τ¯ +τ > τ . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 0 − + 5.3 Relative distortion loss of MDAV as an incremental algorithm versus traditional MDAV usingk = 10. It can be seen that the relative distortion loss decreases as the number of dimensions increases. . . . 53 List of Figures 10 5.4 Relative distortion loss of Find Nearest-Neighbors as an incremental algorithm versus traditional MDAV using k = 10 and 100. The performance of this approach in distortion is merely displayed as a comparison with the improved version which uses cell splitting, but it is not a practical method. . . . . . . . . . . . . . . . . . . . . . . . . 54 5.5 Relative distortion loss of Find Nearest-Neighbors, with cell splitting using MDAV, as an incremental algorithm versus traditional MDAV using k = 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.6 Relative distortion loss of Find Nearest-Neighbors, with cell splitting using MDAV and with and without the inertial coefficient being con- sidered, as an incremental algorithm versus traditional MDAV using k = 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.7 Relative time gain of Find Nearest-Neighbors, with cell splitting using MDAV, as an incremental algorithm versus traditional MDAV without a head start. This plot compares the time efficiency of the two approaches proposed, split cells while adding records and split cells at the end. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.8 Evaluationofτ asafunctionofν usingFindNearest-Neighbors, with + cell splitting using MDAV, as an incremental algorithm in comparison with the expected MDAV outcome ν2 and the boundary τ¯ +τ > τ . 57 0 − + 5.9 Comparison of MDAV as an incremental algorithm vs. Find Nearest- Neighbor with cell splitting at the end. Here we compare the relative running times of the incremental algorithm τ , and the relative dis- + tortion loss introduced δ. This experiment is carried out using 50000 records of 15 dimensional Gaussian data and k = 10. . . . . . . . . . 58 5.10 Comparison of MDAV as an incremental algorithm vs. Find Nearest- Neighbor with cell splitting at the end. Here we compare the relative running times of the incremental algorithm τ , and the relative dis- + tortion loss introduced δ. This experiment is carried out using 50000 randomly picked records of “Large Census” and k = 10. . . . . . . . . 59

Description:

Marc, Izan, Oriol and my colleagues at T2C and Danone, Enric, Raúl and Joan. Carles, who believe .. Having the resourcefulness to handle 1 mil-.

Incremental k-Anonymous Microaggregation of Large-Scale Datasets PDF

68 Pages·2016·1.74 MB·English

by César Hernández Baigorri

Checking for file health...

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Download Incremental k-Anonymous Microaggregation of Large-Scale Datasets PDF Free - Full Version

by César Hernández Baigorri| 2016| 68 pages| 1.74| English

Download Incremental k-Anonymous Microaggregation of Large-Scale Datasets by César Hernández Baigorri in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About Incremental k-Anonymous Microaggregation of Large-Scale Datasets

Marc, Izan, Oriol and my colleagues at T2C and Danone, Enric, Raúl and Joan. Carles, who believe .. Having the resourcefulness to handle 1 mil-.

Detailed Information

Author:	César Hernández Baigorri
Publication Year:	2016
Pages:	68
Language:	English
File Size:	1.74
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free Incremental k-Anonymous Microaggregation of Large-Scale Datasets Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download Incremental k-Anonymous Microaggregation of Large-Scale Datasets PDF?

Yes, on https://PDFdrive.to you can download Incremental k-Anonymous Microaggregation of Large-Scale Datasets by César Hernández Baigorri completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read Incremental k-Anonymous Microaggregation of Large-Scale Datasets on my mobile device?

After downloading Incremental k-Anonymous Microaggregation of Large-Scale Datasets PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of Incremental k-Anonymous Microaggregation of Large-Scale Datasets?

Yes, this is the complete PDF version of Incremental k-Anonymous Microaggregation of Large-Scale Datasets by César Hernández Baigorri. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download Incremental k-Anonymous Microaggregation of Large-Scale Datasets PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.