Learning to de-anonymize social networks Kumar Sharad University of Cambridge Computer Laboratory Churchill College November 2016 This dissertation is submitted for the degree of Doctor of Philosophy Learning to de-anonymize social networks Kumar Sharad Summary Releasing anonymized social network data for analysis has been a popular idea among data providers. Despite evidence to the contrary the belief that anonymization will solve the privacy problem in practice refuses to die. This dissertation contributes to the field of social graph de-anonymization by demonstrating that even automated models can be quite successful in breaching the privacy of such datasets. We propose novel machine-learning based techniques to learn the identities of nodes in social graphs, thereby automating manual, heuristic-based attacks. Our work extends the vast literature of social graph de-anonymization attacks by systematizing them. We present a random-forests based classifier which uses structural node features based on neighborhood degree distribution to predict their similarity. Using these simple and efficient features we design versatile and expressive learning models which can learn the de-anonymization task just from a few examples. Our evaluation establishes their efficacy in transforming de-anonymization to a learning problem. The learning is transferable in that the model can be trained to attack one graph when trained on another. Moving on, we demonstrate the versatility and greater applicability of the proposed model byusingittosolvethelong-standingproblem ofbenchmarkingsocialgraphanonymization schemes. Our framework bridges a fundamental research gap by making cheap, quick and automated analysis of anonymization schemes possible, without even requiring their full description. The benchmark is based on comparison of structural information leakage vs. utility preservation. We study the trade-off of anonymity vs. utility for six popular anonymization schemes including those promising k-anonymity. Our analysis shows that none of the schemes are fit for the purpose. Finally, we present an end-to-end social graph de-anonymization attack which uses the proposed machine learning techniques to recovernode mappings across intersecting graphs. Our attack enhances the state of art in graph de-anonymization by demonstrating better performance than all the other attacks including those that use seed knowledge. The attackisseedlessandheuristicfree, whichdemonstratesthesuperiorityofmachinelearning techniques as compared to hand-selected parametric attacks. Declaration This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration except where specifically indicated in the text. It is not substantially the same as any that I have submitted, or, is being concurrently submitted for a degree or diploma or other qualification at the University of Cambridge or any other University or similar institution except as declared in the Preface and specified in the text. I further state that no substantial part of my dissertation has already been submitted, or, is being concurrently submitted for any such degree, diploma or other qualification at the University of Cambridge or any other University of similar institution except as declared in the Preface and specified in the text. This dissertation does not exceed the regulation length of 60000 words, including tables and footnotes. Acknowledgments First and foremost, I would like to thank my supervisor Ross Anderson without whom this dissertation would have not been possible. He helped me at critical junctures, provided encouragement and valuable feedback, I owe him much gratitude for whatever I managed to achieve at Cambridge. I thank George Danezis for mentoring me during the initial stages of my PhD and teaching me how to do research. I thank Richard Clayton for helping me develop as a researcher; I can now fully appreciate the significance of sharing an office with him as a fresh PhD student and absorbing some of the wisdom on offer. I thank Alastair Beresford for his guidance and help throughout my PhD; especially, as an examiner both for my first year report and final dissertation. I thank Emiliano De Cristofaro for generously agreeing to examine my dissertation and providing meticulous feedback. I thank Lise Gough who went above and beyond in helping me come to Cambridge and then throughout my stay here. I could not have hoped for a more helpful and sincere person; it is due to her efforts that I was able to pursue a PhD at Cambridge. I thank Rebecca Sawalmeh for her ever helpful presence and making my stay at Churchill pleasant and comfortable. I thank Juan Garay and Abhradeep Guha Thakurta for graciously hosting me at Yahoo Labs, Sunnyvale for a very productive internship. Those were by far the sunniest three months I have had since leaving India in 2009. I thank the members of the Security Group and the Computer Laboratory, past and present, for their help and support including Ilias Marinos, Laurent Simon, Dimosthenis Pediaditakis, Marios Omar Choudary, Sheharbano Khattak, Rubin Xu, Dongting Yu, Bjoern Zeeb, Christian O’Connell, Alice Hutchings, Sophie Van Der Zee, Wei Ming Khoo, Timothy Goh, Jonathan Anderson, Markus Kuhn, Robert Watson and Frank Stajano. I thank my sponsors Microsoft Research, EPSRC, Computer Laboratory, University of Cambridge and Churchill College for supporting my PhD studies at Cambridge. Finally, I thankmyfamily– Ma, Papa andDidi fortheirsupportthroughoutmyeducation and their understanding even though I could not visit them as frequently as I would have liked to in the past few years. I dedicate this dissertation to them. Contents 1 Introduction 13 1.1 Chapter outline 13 1.2 Publications 15 1.3 Statement on research ethics 16 2 Background 17 2.1 Privacy in high-dimensional datasets 17 2.1.1 Pitfalls of releasing private datasets 18 2.2 Privacy challenges in social networks 20 2.3 Anonymity loves company 21 2.4 Differential-privacy-based schemes 22 2.5 Clustering-based schemes 24 2.6 Perturbation-based schemes 25 2.6.1 k-Anonymity-based schemes 26 2.7 The adversarial model 29 2.8 De-anonymizing social networks 30 2.8.1 Seed-based attacks 30 2.8.2 Seedless attacks 33 2.9 Definitions 34 2.10 Summary 35 3 Automating social graph de-anonymization 37 3.1 Introduction 37 3.2 Motivation: the D4D challenge 39 9 3.2.1 Data release and anonymization 40 3.2.2 Robustness of anonymization 41 3.2.3 Ad-hoc de-anonymization 42 3.2.4 Limitations of ad-hoc de-anonymization 43 3.3 Learning de-anonymization 44 3.3.1 De-anonymization: a learning problem 44 3.3.2 Decision trees and random forests 46 3.3.3 Specialized social graph features 48 3.3.4 Training and classification of node pairs 49 3.4 Evaluation 51 3.4.1 Experimental setup 52 3.4.2 Results: same training distribution 53 3.4.3 Results: different training distribution 55 3.4.4 Traditional de-anonymization task 57 3.4.5 Error analysis 59 3.4.6 Data sample sizes 61 3.4.7 Performance 61 3.5 Discussion 63 3.5.1 Is anonymization effective? 65 3.5.2 Improving de-anonymization 66 3.5.3 Choosing tree parameters 66 3.6 Summary 69 4 Benchmarking social graph anonymization schemes 71 4.1 Introduction 71 4.2 Quantifying anonymity in social graphs 73 4.2.1 The adversarial model 73 4.2.2 Graph generation 74 4.2.3 The classification framework 74 4.3 Evaluation and results: anonymity vs. utility 76 4.3.1 Random Sparsification (RSP) 79 10
Description: