ebook img

Exploring and Making Sense of Large Graphs PDF

230 Pages·2015·20.62 MB·English
Save to my drive
Quick download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Exploring and Making Sense of Large Graphs

Exploring and Making Sense of Large Graphs Danai Koutra CMU-CS-15-126 August 2015 Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 Thesis Committee: Christos Faloutsos, Chair William Cohen Roni Rosenfeld Eric Horvitz, Microsoft Research, Redmond Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Copyright © 2015 Danai Koutra ThisresearchwassponsoredbytheNationalScienceFoundationundergrantnumbersIIS-1151017415, IIS-1217559,andIIS-1408924,theDepartmentofEnergy/NationalNuclearSecurityAdministrationunder grantnumberDE-AC52-07NA27344,theDefenseAdvancedResearchProjectsAgencyundergrantnumber W911NF-11-C-0088,theAirForceResearchLaboratoryundergrantnumberF8750-11-C-0115,andtheUS ArmyResearchLabundergrantnumberW911NF-09-2-0053. Theviewsandconclusionscontainedinthis documentarethoseoftheauthorandshouldnotbeinterpretedasrepresentingtheofficialpolicies,either expressedorimplied,ofanysponsoringinstitution,theU.S.governmentoranyotherentity. Keywords: data mining, graph mining and exploration, understanding graphs, graph simi- larity, graph matching, network alignment, graph summarization, compression, pattern mining, outlier detection, anomaly detection, attribution, culprits, scalability, fast algorithms, models, visualization, social networks, brain graphs, connectomes, VoG, FaBP, DeltaCon, DeltaCon-Attr, TimeCrunch, BiG-Align, Uni-Align To my parents and brother iv Abstract Graphs naturally represent information ranging from links between web- pages, to friendships in social networks, to connections between neurons in our brains. These graphs often span billions of nodes and interactions between them. Within this deluge of interconnected data, how can we find the most important structures and summarize them? How can we efficiently visualize them? How can we detect anomalies that indicate critical events, such as an attack on a computer system, disease formation in the human brain, or the fall of a company? To gain insights into these problems, this thesis focuses on developing scalable, principled discovery algorithms that combine globality with locality to make sense of one or more graphs. In addition to our fast algorithmic methodologies, we also contribute graph-theoretical ideas and models, and real-world applications in two main areas: • Single-Graph Exploration: We show how to interpretably summarize a single graph by identifying its important graph structures. We comple- ment summarization with inference, which leverages information about few entities (obtained via summarization or other methods) and the network structure to efficiently and effectively learn information about the unknown entities. • Multiple-Graph Exploration: We extend the idea of single-graph sum- marization to time-evolving graphs, and show how to scalably discover temporal patterns. Apart from summarization, we claim that graph sim- ilarity is often the underlying problem in a host of applications where multiple graphs occur (e.g., temporal anomaly detection, discovery of behavioral patterns), and we present principled, scalable algorithms for aligning networks and measuring their similarity. We leverage techniques from diverse areas, such as matrix algebra, graph theory, optimization, information theory, machine learning, finance, and so- cial science, to solve real-world problems. We have applied our exploration algorithms to massive datasets, including a Web graph of 6.6 billion edges, a Twitter graph of 1.8 billion edges, brain graphs with up to 90 million edges, collaboration, peer-to-peer networks, browser logs, all spanning millions of users and interactions. Acknowledgements I am greatly indebted to my advisor, Christos Faloutsos, one of the nicest and happiest people I have ever met. Even during his sabbatical and my first year in graduate school, he spent hours remotely (on Sundays!) to give me advice on classes, and to teach me how to do research, write technical papers and give conference presentations. Overthepastfiveyears,Christosdidnotjustpreparemeforeverystep of the Ph.D. degree, but also for an independent academic career by involving me in research proposals, PI meetings, guest lectures, student mentoring, and following up with feedback and advice. He has also been extremely supportive and understanding, always advising me to rest, work out (zumba!), and balance work and other lifestyle choices, especially when he would see a burnout coming up. Moreover, Christos has been always fair, giving credit where credit is due in various ways; I will not forget the trip to CIKM’12 (which, conveniently, was in Hawaii, and came shortly after a tiring, vacation-less summer) where Christos sent me to thank me for contributing to grants and for attending DARPA PI meetings the previous year. During my last (and most intense) year at CMU, he was always there to give me tips about my job search, supportmeduringthetiringseriesofflights,andremindmetocelebrateeachsuccess during the process. Even until the final version of this document, Christos has been providing insightful comments and suggestions. No matter how much I write in this note, it is impossible to express all my gratitude to him. I would also like to thank the other members of my committee, William Cohen, Roni Rosenfeld and Eric Horvitz, for giving me feedback and asking insightful ques- tionsduringmythesisproposalanddefense. IwouldliketoparticularlythankWilliam for spending a lot of time to give me detailed comments on this document, and help improve the content. I am grateful to Eric for not only sitting on my dissertation committee,butalsomentoringmethroughawonderfulintershipatMSR,introducing me to the world of social science, advising me through my job application process, and suggesting interesting directions for my work. I have had the opportunity to work with numerous great mentors during summer andfallinterships. Duringmy4th internship,Inoticedthatthenumberofmymentors follows the Fibonacci sequence (1,1,2,3,5,...), so I decided to not intern again (okay, I was also running out of CPT time). I thank Hanghang Tong, Paul Bennett, Smriti Bhagat, Stratis Ioannidis, and Udi Weinsberg for the opportunity to work with them on fascinating research problems, and for their guidance. I thank all my friends in the Database group: Vagelis Papalexakis (Thanks for all the support and fun moments during grad school: the timesharing presentation in Hawaii, the crazy number of coffee shots in Beijing, the summer road trip to LA, and the office visits to cheer me up or get cheered up are only a few of the moments that come to my mind), Alex Beutel (I won’t forget our long discussions about the field vi of data mining. Also, so sorry that I happened to arrange my defense when Vagelis and you were interning, and you were not able to share ‘the’ videos!), Jay-Yoon Lee, Neil Shah, and Miguel Araújo. I also thank the DB friends who got me started when I joined CMU, and are now excelling in various positions: Leman Akoglu (also for the endless chats and giggles when rooming at conferences), Polo Chau (for your advice during my job search too!), Fan Guo, U Kang, Lei Li, and Aditya Prakash. I am thankful to all my friends, collaborators (whom I did not mention already) and colleagues who have inspired me over the past five years, and everyone who helped me during my job search by providing insightful feedback on my job materials, and by giving me advice about the interviews and decision process: Tina Eliassi-Rad, Brian Gallagher, Chris Homan, Jure Leskovec, Jen Neville, Andy Pavlo, Jimeng Sun, Evimaria Terzi, Joshua Vogelstein, Jilles Vreeken, and many more. Everything would have been harder without the flawless administrative support from Marilyn Walgora, Deb Cavlovich, and Todd Seth. I am particularly grateful to Marilyn for handling all my travel arrangements (which were too many!), for putting together all sorts of applications, for sending out reminders for upcoming deadlines, for helping me with the DB seminars, and for co-organizing surprise birthday parties with me :) Grad school was not just about work. I was fortunate to have wonderful friends around me who made the experience more fun, and less daunting: Dana and Yair Movshovitz-Attias, Sarah Loos and Jeremy Loos Karnowski, John Wright, Aaditya Ramdas, Gennady Pekhimenko, JP, Anthony Platanios, Eliza Kozyri, my early office- mates Gaurav Veda, Kate Taralova and Sue Ann Hong, my later officemates Guru Guruganesh and Sahil Singla, my housemates Elli Fragkaki, Eleana Petropoulou and Eylül Engin, the Greek and Turkish gangs in Pittsburgh, and many more friends with whom we had a blast during internships. Special thanks to my closest friend and partner,WalterLasecki,forbeingpatientandsupportingme,formakingfuturecareer decisions with me, and for becoming my most important tie to the ‘America land’. Last but not least, many thanks are due to the most amazing parents and brother, who have been extremely supportive during the whole process. They have always been there ready to answer my skype calls to hear my news, provide encouragement and advice, celebrate my successes, help me recover fast after failures, and cheer me up. Dad, thank you for discussing some of my research projects with me, for patiently going through some of my proofs, and for suggesting the appropriate statistical methods. After writing a paper with my brother, the goal now is for all four of us to write a paper bridging computer science, statistics, finance, and chemistry. vii Contents 1 Introduction 1 1.1 Overview and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 Part I: Single-Graph Exploration . . . . . . . . . . . . . . . . . . . . 2 1.1.2 Part II: Multiple-Graph Exploration . . . . . . . . . . . . . . . . . . 4 1.2 Overall Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Background 10 2.1 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Graph Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Graph-theoretic Data Structures . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Common Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 I Single-Graph Exploration 16 3 Graph Summarization 18 3.1 Proposed Method: Overview and Motivation . . . . . . . . . . . . . . . . . 20 3.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2.1 MDL for Graph Summarization . . . . . . . . . . . . . . . . . . . . 22 3.2.2 Encoding the Model . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.3 Encoding the Errors . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3 VoG: Summarization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.1 Step 1: Subgraph Generation . . . . . . . . . . . . . . . . . . . . . 27 3.3.2 Step 2: Subgraph Labeling . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.3 Step 3: Summary Assembly . . . . . . . . . . . . . . . . . . . . . . 29 3.3.4 Toy Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3.5 Time Complexity of VOG . . . . . . . . . . . . . . . . . . . . . . . . 31 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 viii 3.4.1 Q1: Quantitative Analysis . . . . . . . . . . . . . . . . . . . . . . 33 3.4.2 Q2: Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . 38 3.4.3 Q3: Scalability of VOG . . . . . . . . . . . . . . . . . . . . . . . . 44 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.6.1 MDL and Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.6.2 Graph Compression and Summarization . . . . . . . . . . . . . . . 47 3.6.3 Graph Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.6.4 Graph Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4 Inference in a Graph: Two Classes 51 4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.1.1 Random Walk with Restarts (RWR) . . . . . . . . . . . . . . . . . . 52 4.1.2 Semi-supervised learning (SSL) . . . . . . . . . . . . . . . . . . . . 53 4.1.3 Belief Propagation (BP) . . . . . . . . . . . . . . . . . . . . . . . . 54 4.1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.2 Proposed Method, Theorems and Correspondences . . . . . . . . . . . . . 56 4.3 Derivation of FABP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4 Analysis of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.5 Proposed Algorithm: FABP . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.6.1 Q1: Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.6.2 Q2: Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.6.3 Q3: Sensitivity to parameters . . . . . . . . . . . . . . . . . . . . . 69 4.6.4 Q4: Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5 Inference in a Graph: More Classes 73 5.1 Belief Propagation for Multiple Classes . . . . . . . . . . . . . . . . . . . . 75 5.2 Proposed Method: Linearized Belief Propagation . . . . . . . . . . . . . . 77 5.3 Derivation of LinBP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.3.1 Centering Belief Propagation . . . . . . . . . . . . . . . . . . . . . 79 5.3.2 Closed-form solution for LinBP . . . . . . . . . . . . . . . . . . . . 83 5.4 Additional Benefits of LinBP . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.4.1 Update equations and Convergence . . . . . . . . . . . . . . . . . . 85 5.4.2 Weighted graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.5 Equivalence to FABP (k = 2) . . . . . . . . . . . . . . . . . . . . . . . . . . 87 ix 5.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.6.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.6.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.6.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 II Multiple-Graph Exploration 93 6 Summarization of Dynamic Graphs 95 6.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.1.1 Using MDL for Dynamic Graph Summarization . . . . . . . . . . . 99 6.1.2 Encoding the Model . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.1.3 Encoding the Errors . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.2 Proposed Method: TIMECRUNCH . . . . . . . . . . . . . . . . . . . . . . . 103 6.2.1 Generating Candidate Static Structures . . . . . . . . . . . . . . . . 103 6.2.2 Labeling Candidate Static Structures . . . . . . . . . . . . . . . . . 104 6.2.3 Stitching Candidate Temporal Structures . . . . . . . . . . . . . . . 104 6.2.4 Composing the Summary . . . . . . . . . . . . . . . . . . . . . . . 106 6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.3.1 Datasets and Experimental Setup . . . . . . . . . . . . . . . . . . . 107 6.3.2 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.3.3 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.3.4 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7 Graph Similarity 115 7.1 Proposed Method: Intuition of DELTACON . . . . . . . . . . . . . . . . . . 118 7.1.1 Fundamental Concept . . . . . . . . . . . . . . . . . . . . . . . . . 118 7.1.2 How to measure node affinity? . . . . . . . . . . . . . . . . . . . . 118 7.1.3 Why use Belief Propagation? . . . . . . . . . . . . . . . . . . . . . 120 7.1.4 Which properties should a similarity measure satisfy? . . . . . . . . 120 7.2 Proposed Method: Details of DELTACON . . . . . . . . . . . . . . . . . . . 121 7.2.1 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . . 121 7.2.2 Speeding up: DELTACON . . . . . . . . . . . . . . . . . . . . . . . . 122 7.2.3 Properties of DELTACON . . . . . . . . . . . . . . . . . . . . . . . . 125 x

W911NF-11-C-0088, the Air Force Research Laboratory under grant number F8750-11-C-0115, and the US. Army Research Lab outlier detection, anomaly detection, attribution, culprits, scalability, fast algorithms, models, visualization [AP05] Charu C Aggarwal and S Yu Philip. Online Analysis of
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.