ebook img

Apprentissage automatique de fonctions d'anonymisation pour les graphes et les graphes ... PDF

137 Pages·2015·2.73 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Apprentissage automatique de fonctions d'anonymisation pour les graphes et les graphes ...

DOCTORAL THESIS UNIVERSITY PIERRE AND MARIE CURIE Speciality Computer Science E´cole doctorale Informatique, T´el´ecommunications et ´electronique (Paris) Presented by Maria Coralia Laura NECULA MAAG For the degree of DOCTOR of the UNIVERSITY PIERRE AND MARIE CURIE Thesis subject : Automatic Learning of Anonymization Functions for Graphs and Dynamic Graphs defended the 8 of April 2015 Jury : M. Patrick Gallinari, Professor, University Pierre and Marie Curie Director M. Ludovic Denoyer, Professor, University Pierre and Marie Curie Co-director M. Fabrice Rossi, Professor, University Panth´eon-Sorbonne Referee M. Benjamin Nguyen, Professor, INSA Val de Loire Referee M. Bernd Amann, Professor, University Pierre and Marie Curie Examiner Mrs. Maryline Laurent, Professor, T´el´ecom SudParis Examiner M. Philippe Jacquet, Research Director, Alcatel-Lucent Bell Labs Examiner M. Hakim Hacid, Ass. Professor, Zayed University, United Arab Emirates Examiner Abstract Data privacy is a major problem that has to be considered before releasing datasets to the public or even to a partner company that would compute statistics or make a deep analysis of these data. Privacy is insured by performing data anonymization as required by legislation. In this context, many different anonymization techniques have been proposed in the literature. These techniques are difficult to use in a general context where attacks can be of different types, and where measures are not known to theanonymizer. Genericmethodsabletoadapttodifferentsituationsbecomedesirable. Weareaddressingtheproblemofprivacyrelatedtographdatawhichneeds,fordifferent reasons, to be publicly made available. This corresponds to the anonymized graph data publishing problem. We are placing from the perspective of an anonymizer not having access to the methods used to analyze the data. A generic methodology is proposed based on machine learning to obtain directly an anonymization function from a set of training data so as to optimize a tradeoff between privacy risk and utility loss. The method thus allows one to get a good anonymization procedure for any kind of attacks,andanycharacteristicinagivenset. Themethodologyisinstantiatedforsimple graphs and complex timestamped graphs. A tool has been developed implementing the method and has been experimented with success on real anonymized datasets coming from Twitter, Enron or Amazon. Results are compared with baseline and it is showed that the proposed method is generic and can automatically adapt itself to different anonymization contexts. Remerciements Je tiens tout d’abord `a remercier mes directeurs de th`ese, Prof. Patrick Gallinari et Prof. Ludovic Denoyer, pour leur disponibilit´e et leur soutien qui ont ´et´e essentiels `a l’accomplissement de ce travail. Je tiens ´egalement `a remercier Philippe Jacquet pour son soutien et ses conseils qui m’ont beaucoup guid´ee dans mes travaux de recherche. Je remercie ceux qui ont ´et´e a` l’initiative de cette th`ese de doctorat, notamment Hakim Hacid, Johann Daigremont et Bruno Aidan, ainsi que Xavier Andrieu pour son aide et ses conseils tout au long de ces ann´ees de th`ese. Jeremerciel’ensembledemescoll`eguespr´esentsoupass´esauBellLabs. Jeremercie plus particuli`erement Alonso Silva Allende pour avoir accept´e le travail de relecture des premi`eres versions de mon manuscrit ainsi que Gerard Burnside et Marc-Olivier Buob pour leurs conseils sur la pr´esentation de ce travail. Je remercie ma hierarchie pr´esente ou pass´ee au sein des Bell Labs, et plus par- ticuli`erement Messieurs Jean-Luc Beylat et Chris White pour m’avoir fourni le cadre propice `a la realisation de ce projet. Je remercie les Professeurs Benjamin Nguyen et Fabrice Rossi d’avoir accept´e la charge d’ˆetre rapporteurs de ma th`ese. Je remercie ´egalement les Professeurs Maryline Laurent, Bernd Amann, Hakim Hacid et Philippe Jacquet qui m’ont fait l’honneur d’avoir accept´e de faire partie de mon jury de th`ese ainsi que Laurent Viennot et Bernd Amann pour avoir accept´e de faire partie de mon jury de mi-parcours. Ungrandmerci`atousceuxquiontassist´e`amasoutenanceetquiparfoissontvenus de loin: j’ai ´et´e tr`es touch´ee par votre pr´esence. Je tiens `a exprimer ma profonde gratitude a` ma famille, mon mari St´ephane, mes enfants Lucas et Antoine ainsi qu’`a ma m`ere, Laurentia Veronica. Sans leur support cette th`ese aurait ´et´e difficilement r´ealisable. ii Contents List of Figures vi List of Tables ix 1 Introduction 1 1.1 General Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Graph Data Anonymization Issue . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Detailed Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 State of the Art 9 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Privacy Protection for Graph Data . . . . . . . . . . . . . . . . . . . . . . 11 2.2.1 What to protect? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.2 De-anonymization Techniques . . . . . . . . . . . . . . . . . . . . . 13 2.3 Anonymization Techniques for Graphs . . . . . . . . . . . . . . . . . . . . 18 2.3.1 Structural Modification Based on k-Anonymity Concept . . . . . . 19 2.3.2 Randomization Techniques . . . . . . . . . . . . . . . . . . . . . . 22 2.3.3 Generalization Techniques . . . . . . . . . . . . . . . . . . . . . . . 23 2.4 Utility Loss Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4.1 Graph Topological Properties . . . . . . . . . . . . . . . . . . . . . 24 2.4.2 Graph Spectral Properties . . . . . . . . . . . . . . . . . . . . . . . 27 2.4.3 Network Queries Aggregation . . . . . . . . . . . . . . . . . . . . . 27 2.5 Complex Graphs Anonymization . . . . . . . . . . . . . . . . . . . . . . . 28 2.5.1 Hypergraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5.2 Temporal Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.6 Differential Privacy for Graph Data Anonymization . . . . . . . . . . . . . 29 2.6.1 Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.6.2 Differential Privacy for Data Release . . . . . . . . . . . . . . . . . 31 2.7 Machine Learning in Data Anonymization Process . . . . . . . . . . . . . 33 2.7.1 Machine Learning used for Data De-Anonymization . . . . . . . . 33 2.7.2 Machine Learning used for Data Anonymization . . . . . . . . . . 33 2.7.3 Exploring the Privacy-Utility Tradeoff . . . . . . . . . . . . . . . . 33 2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3 Temporal Graphs Anonymization Issue 35 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2 Graphs Risk for De-Anonymization based on Subgraphs Partitioning . . . 37 iii Contents iv 3.3 Anonymization by Data Partitioning . . . . . . . . . . . . . . . . . . . . . 38 3.4 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4 Anonymization Methodology Based on Machine Learning 44 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3 Notations and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.3.1 Anonymization Function . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3.2 Utility Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3.3 Privacy Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.4 Optimization Problem: Balance between Utility Loss and Privacy Risk . . 51 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.6 Optimization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.6.1 Estimation of Distribution Algorithm . . . . . . . . . . . . . . . . 54 4.6.2 Genetic Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5 Simple Graphs Anonymization 59 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2 Optimization Problem for Simple Graphs . . . . . . . . . . . . . . . . . . 61 5.3 Anonymization Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.4 Privacy Risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.4.1 k-Degree Anonymity . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.4.2 k-Neighborhood Anonymity . . . . . . . . . . . . . . . . . . . . . . 67 5.5 Utility Loss Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.5.1 Clustering Coefficient Based Utility Loss (CC) . . . . . . . . . . . 68 5.5.2 Page Rank Based Utility Loss (PR) . . . . . . . . . . . . . . . . . 69 5.5.3 Two-hop neighborhood based utility loss (THN) . . . . . . . . . . 70 5.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.6.1 Baseline (BL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.6.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6 Adaptive Temporal Graphs Anonymization for Data Publishing 82 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.2 Machine Learning for Call Detail Records Anonymization . . . . . . . . . 86 6.2.1 Notations and Definitions . . . . . . . . . . . . . . . . . . . . . . . 86 6.2.2 Learning Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.3 Anonymization Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.4 Privacy Risks in Call Logs . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.4.1 Privacy Attack by Communication Sequence Generation (CSG) . . 94 6.4.2 Privacy attack by Neighborhood Degree Distribution (NDD) . . . 97 6.5 Utility Loss Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.5.1 ChangesPerformedbytheAnonymizationAlgorithmintheGraph (CHG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Contents v 6.5.2 Query Based Measures: Call Distribution Distance (CDD) . . . . . 100 6.5.3 Graph Topological Properties: Vertices In/Out Degrees (DE) . . . 100 6.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.6.1 Baseline: Random data perturbation . . . . . . . . . . . . . . . . . 101 6.6.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7 Conclusion 111 7.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.2 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Bibliography 114 List of Figures 1.1 Effectsonprivacyandutilityofanonymizationwithdifferentparameters: black bars represent the privacy of the data smaller bars signifying more privacy; gray bars represent the utility of the data. . . . . . . . . . . . . . 5 2.1 Fingerprint planted in the initial graph: each node represents a user. Fingerprint obtained by the “seed” subgraph formed by vertices h, 1, 2, 3, 4, 5, 6 can be uniquely re-identified using the degree-sequence of h. . . 13 2.2 Friendshipattack: BobandCarlcanbeuniquelyidentifiedbytheirvertex degree pair (2,4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Sequential releases of a dynamic social network: Bob can be re-identified between t and t as being the only person having visited the hospital for 1 2 the first time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4 Graph partition and edge copy: edge between nodes 2 and 4 and crossing edge between 5 and 9 are added in the anonymized graph. . . . . . . . . . 21 2.5 Edge clustering methods: relational information only between clusters of vertices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.1 Main problem of existing anonymization techniques w.r.t. the time (i.e. dynamics) dimension: even if the anonymized graph respects anonymiza- tion constraints (e.g. k-anonymity), when decomposing it in subgraphs, subgraph in timeslot TS4 can be de-anonymized. . . . . . . . . . . . . . . 37 3.2 Temporal graphs anonymization reinforcement technique, approach in three main steps: (i) data decomposition in subsets (ii) anonymization of each subset independently and (iii) aggregation of the anonymized subsets. 39 3.3 General architecture of the dynamic anonymization server (DAS). . . . . 40 4.1 Anonymization general framework: private information can be revealed if the adversary combines external data with anonymized data. . . . . . . 46 4.2 Learning process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.3 Estimation of distribution algorithm: for each step candidates solutions are generated. At step 0, population is initialized from a uniform distri- butionoveradmissiblesolutions(P).Themostpromisingcandidates(PS) are selected and a new population is generated at each step following a normal distribution N with the distribution parameters PDe. . . . . . . . 54 5.1 Verticesdegreehashtable: itstoresforeachdegreed presentinthegraph i the corresponding list of vertices of degree d . . . . . . . . . . . . . . . . . 65 i 5.2 Example for local clustering coefficient: three graphs used to illustrate how the clustering coefficient is computed. . . . . . . . . . . . . . . . . . . 68 vi

Description:
We are addressing the problem of privacy related to graph data which needs, for different .. In Aggarwal et al. [2011] it is shown that for large sparse graphs it is difficult to preserve utility after perturbation as it exists in this kind of graphs re- In Managing and Mining Graph Data, pages 42
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.