Table Of Content

Extracting and Analyzing Hidden Graphs from Relational Databases Konstantinos Xirogiannopoulos Amol Deshpande UniversityofMaryland,CollegePark UniversityofMaryland,CollegePark [email protected] [email protected] 7 ABSTRACT data,alargefractionofthedataofinterestinitiallyresidesinrela- 1 Analyzinginterconnectionstructuresamongunderlyingentitiesor tionaldatabasesystems(orsimilarstructuredstoragesystemslike 0 objectsinadatasetthroughtheuseofgraphanalyticscanprovide key-valuestores, withsomesortofschema); thiswilllikelycon- 2 tinuetobethecaseforavarietyofreasonsincludingthematurity tremendousvalueinmanyapplicationdomains. However,graphs ofRDBMSs,theirsupportfortransactionsandSQLqueries,andto b arenottheprimaryrepresentationchoiceforstoringmostdatato- somedegree,inertia. Relationaldatabasestypicallyincludemany e day,andinordertohaveaccesstotheseanalyses,usersareforced useful relationships between entities, and can contain many hid- F tomanuallyextractdatafromtheirdatastores,constructtherequi- den,interestinggraphs. Forexample,considerthefamiliarDBLP sitegraphs,andthenloadthemintosomegraphengineinorderto 1 dataset, whereausermaywanttoconstructagraphwiththeau- executetheirgraphanalysistask. Moreover,inmanycases(espe- 1 thors as the nodes. However, there are many ways to define the ciallywhenthegraphsaredense),thesegraphscanbesignificantly largerthantheinitialinputstoredinthedatabase,makingitinfea- edges;e.g.,wemaycreateanedgebetweentwoauthors:(1)ifthey ] co-authoredapaper,or(2)iftheyco-authoredapaperrecently,or B sible to construct or analyze such graphs in memory. In this pa- (3)iftheyco-authoredmultiplepaperstogether,or(4)iftheyco- D perweaddressbothofthesechallengesbybuildingasystemthat authoredapaperwithveryfewadditionalauthors(indicatingatrue enables users to declaratively specify graph extraction tasks over . collaboration),or(5)iftheyattendedthesameconference,andso s arelationaldatabaseschemaandthenexecutegraphalgorithmson c theextractedgraphs.Weproposeadeclarativedomainspecificlan- on. Someofthesegraphsmightbetoosparseortoodisconnected [ to yield useful insights, while others may exhibit high density or guageforthispurpose,andpairitupwithanovelcondensed,in- 2 memoryrepresentationthatsignificantlyreducesthememoryfoot- noise;however,manyofthesegraphsmayresultindifferenttypes ofinterestinginsights. Itisalsoofteninterestingtojuxtaposeand v print of these graphs, permitting analysis of larger-than-memory comparegraphsconstructedoverdifferenttimeperiods(i.e.,tem- 8 graphs. We present a general algorithm for creating such a con- poralgraphanalytics).Therearemanyothergraphsthatarepossi- 8 densedrepresentationforalargeclassofgraphextractionqueries blyofinteresthere,e.g.,thebipartiteauthor-publicationorauthor- 3 againstarbitraryschemas.Weobservethatthecondensedrepresen- conferencegraphs–identifyingpotentiallyinterestinggraphsitself 7 tationsuffersfromaduplicationissue,thatresultsininaccuracies maybedifficultforlargeschemaswith100softables. 0 formostgraphalgorithms. Wethenpresentasuiteofin-memory Currentlyauserwhowantstoexploresuchstructuresinanex- . representations that handle this duplication in different ways and 1 istingdatabaseisforcedto: (a)manuallyformulatetherightSQL allowtradingoff thememoryrequiredandthecomputationalcost 0 queriestoextractrelevantdata(whichmaynotcompletebecause forexecutingdifferentgraphalgorithms.Wealsointroduceseveral 7 ofthespaceexplosiondiscussedbelow),(b)writescriptstoconvert noveldeduplicationalgorithmsforremovingthisduplicationinthe 1 theresultsintotheformatrequiredbysomegraphdatabasesystem : graph, which are of independent interest for graph compression, v andprovideacomprehensiveexperimentalevaluationoverseveral or computation framework, (c) load the data into it, and then (d) i real-worldandsyntheticdatasetsillustratingthesetrade-offs.1 writeandexecutethegraphalgorithmsontheloadedgraphs. This X isacostly,labor-intensive,andcumbersomeprocess,andposesa r highbarriertoleveraginggraphanalyticsonthesedatasets.Thisis a 1. INTRODUCTION especiallyaproblemgiventhelargenumbersofentitytypespresent Analyzing the interconnection structure, i.e., graph structure, in most real-world datasets and a myriad of potential graphs that among the underlying entities or objects in a dataset can provide couldbedefinedoverthose. significantinsightsandvalueinmanyapplicationdomainssuchas Wearebuildingasystem,called GRAPHGEN,withthegoalto socialmedia,finance,health,andsciences. Thishasledtoanin- make it easy for users to extract a variety of different types of creasinginterestinexecutingawidevarietyofgraphanalysistasks graphsfromarelationaldatabase2,andexecutegraphanalysistasks andgraphalgorithms,e.g.,communitydetection,influencepropa- oralgorithmsovertheminmemory. GRAPHGENsupportsanex- gation, networkevolution, anomalydetection, centralityanalysis, pressiveDomainSpecificLanguage(DSL),basedonDatalog[2], etc. Many specialized graph databases (e.g., Neo4j, Titan, Ori- allowinguserstospecifyasinglegraphoracollectionofgraphs entDB,etc.),andgraphexecutionengines(e.g.,Giraph,GraphLab, to be extracted from the relational database (in essence, as views Ligra,Galois,GraphX)havebeendevelopedinrecentyearstoad- on the database tables). GRAPHGEN uses a translation layer to dresstheseneeds. generatetheappropriateSQLqueriestobeissuedtothedatabase, Althoughsuchspecializedgraphdatamanagementsystemshave madesignificantadvancesinstoringandanalyzinggraph-structured 2Although GRAPHGEN currently only supports PostgreSQL, it requires onlybasicSQLsupportfromtheunderlyingstorageengine,andcouldsim- 1AshorterversionofthispaperappearedinACMSIGMOD2017. plyscanthetablesifneeded. Graph Representation Edges ExtractionTime(s) DBLP Condensed 17,147,302 105.552 10Mrows FullGraph 86,190,578 >1200.000 IMDB Condensed 8,437,792 108.647 4.7Mrows FullGraph 33,066,098 687.223 TPCH Condensed 52,850 15.52 765Krows FullGraph 99,990,000 >1200.000 UNIV Condensed 60,000 0.033 32Krows FullGraph 3,592,176 82.042 (c) Expanded Graph (48 Edges) Table 1: Extracting graphs in GRAPHGEN using our condensedrepresentation(C-DUP)vsextractingthefullgraph(EXP). GRAPHGENenablesscalableextractionandanalysisongraphsthat p1 may not fit in memory. IMDB: Co-actors graph (on a subset of data), DBLP: Co-authors graph, TPCH: Connect customers who p2 buythesameproduct,UNIV:Connectstudentswhohavetakenthe samecourse(synthetic,fromhttp://db-book.com) p3 (a) Relational Tables andcreatesanefficientin-memoryrepresentationofthegraphthat p3 Nodes(ID, Name):-Author(ID, Name). is handed off to the user program or analytics task. GRAPHGEN Edges(ID1, ID2):-AuthorPub(ID1, supports a general-purpose Java Graph API as well as the stan- PubID), AuthorPub(ID2, PubID). dard vertex-centric API for specifying analysis tasks like PageR- (b) Extraction Query [Q1] (d) C-DUP (28 Edges) (e) DEDUP1 (32 Edges) ank. Figure1showsatoyDBLP-likedataset, andthequerythat Figure1: KeyconceptsofGRAPHGEN. ForC-DUPandDEDUP- specifiesa“co-authors”graphtobeconstructed. Figure1cshows 1,theauthornodesareshowntwice(withsubscripts_sand_t)to therequestedco-authorsgraph(GRAPHGENnaturallyextractsdi- avoidclutter(byseparatingthein-edgesandout-edges);physically rectedgraphs,andundirectedgraphsarerepresentedusingbidirec- theyarenotstoredseparately. tionaledges). The main scalability challenge in extracting graphs from rela- tionaltablesisthat:thegraphthattheuserisinterestedinanalyz- ing may be too large to extract and represent in memory, even if InFigure1,wecanseesuchaduplicationfortheedgea1→a4 theunderlyingrelationaldataissmall. Thereisaspaceexplosion sincetheyareconnectedthroughbothp1andp2. Suchduplica- because of the types of large-output3 joins that are often needed tion prevents us from operating on this condensed representation whenconstructingthesegraphs.Table1showsseveralexamplesof directly.Wedevelopasuiteofdifferentin-memoryrepresentations thisphenomenon. OntheDBLPdatasetrestrictedtojournalsand for this condensed graph that paired with a series of “deduplica- conferences,thereareapproximately1.6millionauthors,3million tion”algorithms,leverageavarietyoftechniquesfordealingwith publications, and8.6millionauthor-publicationrelationships; the theproblemofduplicateedges(oneofwhich,calledDEDUP-1,is co-authorsgraphonthatdatasetcontained86millionedges,andre- showninFigure1e). Wediscusstheprosandconsofeachrepre- quiredmorethanhalfanhourtoextractonalaptop.Thecondensed sentationandpresentaformalanalysiscomparingtheirtrade-offs. representationthatweadvocateinthispaperismuchmoreefficient Wealsopresentanextensiveexperimentalevaluation,usingseveral bothintermsofthememoryrequirementsandtheextractiontimes. realandsyntheticdatasets. TheDBLPdatasetis,insomesense,abest-casescenariosincethe Keycontributionsofthispaperinclude: averagenumberofauthorsperpublicationisrelativelysmall.Con- • Ahigh-leveldeclarativeDSLbasedonDatalogforintuitively structing the co-actors graph from the IMDB dataset results in a specifyinggraphextractionqueries. similarspaceexplosion.Similarly,agraphconnectingpairsofcus- • A general framework for extracting a condensed representa- tomerswhoboughtthesameiteminasmallTPCHdatasetresults tion(withduplicates)foracyclic,aggregation-freeextraction inagraphmuchlargerthantheinputdataset. EvenontheDBLP queriesoverarbitraryrelationalschemas. dataset,agraphthatconnectsauthorswhohavepapersatthesame conferencecontains1.8Bedges. • Asuiteofin-memoryrepresentationstohandletheduplication In this paper, we show how to analyze such large graphs by inthecondensedrepresentation. storing and operating upon them using a novel condensed repre- • Asystematicanalysisofthebenefitsandtrade-offsofferedby sentation, for extraction queries that are equivalent to unions of extractingandoperatingontheserepresentations. acyclic conjunctive queries without aggregations. The relational • Noveltechniquesfor“deduplicating”thesecondensedgraphs. modelalreadyprovidesanaturalsuchcondensedrepresentationfor suchqueries,thatwecallC-DUP,obtained(essentiallyforfree)by • Thefirstend-to-endsystemforenablinganalyticsongraphs omittingsomeofthelarge-outputjoinsfromthequeryrequiredfor thatexistwithinpurelyrelationaldatasets,efficiently,andwith- graphextraction. Figure1dshowsanexampleofC-DUPforthe outrequiringcomplexETL. co-authorsgraph,wherewecreateexplicitnodesforthepubs,in Outline: We begin with a brief discussion of how related work additiontothenodesfortheauthors;fortwoauthors,uandv,there has leveraged relational databases for graph analytics in the past isanedgeu → v,iffthereisadirectedpathfromus tovt inC- (Section2). Wethenpresentanoverviewof GRAPHGEN,briefly DUP.Thisrepresentationgeneralizestheideaofusingcliquesand describethegraphextractionDSL,anddiscusshow GRAPHGEN bicliques for graph compression [11, 26]; however, the key chal- decideswhenthecondensedrepresentationshouldbeextractedin- lengeforusisnotgeneratingtherepresentation,butratherdealing steadofthefullgraph(Section3).Wethendiscussthedifferentin- withduplicatepathsbetweentwonodes. memoryrepresentations(Section4)andpresentaseriesofdedupli- cationalgorithms(Section5).Finally,wepresentacomprehensive 3Weusethisterminsteadof“selectivity”termstoavoidconfusion. experimentalevaluation(Section6). 2. RELATEDWORK a1 a5 Results Graph Definition Graph Object a2 + + Graphdatamanagementsystems:Therehasbeenmuchworkon a6 Graph Execution Graph Analytics/query Analytics/query Results graphdatamanagementovertheyears,bothongraphdatabases[4] a4 a3 and graph analytics systems [47]. Unlike graph databases, our a7 a8 work targets the scenario where the data resides in an RDBMS Ingest/Shredding In-memory Graph Execution Engine and cannot be migrated to a graph database. The work on graph SQL Translation analyticssystemsislargelyorthogonalandcomplementary,asour Layer Extraction Results Queries techniquescanbeusedforefficientextractionandin-memoryrep- resentationinthesesystems,aswefurtherdiscussinSection6.4. Nodes Edges Props,.. Severaldistributedgraphanalyticssystemshaveadoptedhigh-level declarativeinterfacesbasedonDatalog[48,39,20,10].Ouruseof Datalogiscurrentlyrestrictedtospecifyingwhichgraphstoextract (inparticular,wedonotallowrecursion). Combiningdeclarative A relational database A relational database (i) (ii) graph extraction ala GRAPHGEN, and high-level graph analytics Figure 2: GRAPHGEN (right) has fundamentally different goals frameworks proposed in that prior work, is a rich area for future thanrecentworkonusingRDBMSsforgraphanalytics(left) work. Severalrecentworkshaveconsideredhowtoefficientlymigrate arelationaldatabasetoagraphdatabase[15,28,23],butthatwork typicallyfocusesoncreatingtheexpandedgraphmostefficiently, machinetodealwithbothissues. Ringodoesincludeanextensive anddoesn’tconsiderthepossibilityofgeneratingacondensedrep- libraryofbuilt-ingraphalgorithmsinSNAP[30],andwedoplan resentation.Priorworkontranslatingschemasfromonedatamodel tosupportRingoasafront-endanalyticsengineforGRAPHGEN. to another has considered more complex translation problems [7, 16];forus,thetranslation(throughourDSL)itselfiswell-defined Factorizedrepresentationofqueryresults: Ourcondensedrep- andstraightforward,buttheexecutionofthetranslationqueryand resentation is similar to the notion of “factorized representation avoidingthegenerationofthefinalresult,arethemainchallenges. of query results”, where the goal is also to maintain the result of a query in a compressed form [37]. This prior work proposes RDBMS&graphanalytics:Superficiallythemostcloselyrelated “schema-level factorizations” where the decisions about how to istherecentworkonleveragingrelationaldatabasesforgraphana- factorize a query result are based purely on the query, the rela- lytics,whoseaimistoshowthatspecializedgraphdatabasesoran- tion schemas, and the functional and multi-valued dependencies alyticsenginesmaybeunnecessary.Vertexica[25,24],GRAIL[17], that hold on the query result. On the other hand, our approach andSQLGraph[44]showhowtonormalizeandstoreagraphdataset canbeseenasexploring“data-dependentfactorization”.Thatprior asacollectionoftablesinanRDBMS(i.e.,howto“shred”graph workshowsthatthefactorizationoftheresultofanacyclicquery data),andhowtomapasubsetofgraphanalysistasksorqueries withoutprojectionsislinearinthesizeoftheinputdatabase;how- to relational operations on those tables (Figure 2). This is simi- ever,forquerieswithprojections,thestoragerequirementscanbe larinspirittotheearlierworkonstoringsemi-structuredorXML significantlyhigher. Forexample,forthequerythatgeneratesthe documentsinanRDBMS[16,40]. EmptyHeaded[1]showshow co-authorsgraph,theschema-levelapproachentailsexpandingthe worst-caseoptimaljoinalgorithmsmaybeusedforefficientgraph graphandmayrequirequadraticstorageintheworstcase(cf. Ap- querying. On the other hand, G-SQL [33] and GQ-Fast [31] ex- pendixA). ploreusinggraphprocessingenginestoexecuteSQLquerieseffi- Recentworkonworst-caseoptimaljoins[36,45]showshowto ciently.However,thosesystemsdonotconsidertheproblemofex- avoidlargeintermediateresultsduringexecutionofmulti-wayjoin tractingdifferentimplicitgraphsfromexistingrelationaldatasets, queries; we plan to integrate those techniques into our system as and rather choose a relational schema for storing a given graph we generalize our work to allow cyclic extraction queries. How- dataset. Further, those systems can typically only execute tasks ever, fortheclassofqueriesconsideredinthispaper(i.e., where (graphorXMLqueries)thatcanbemappedtoSQL;whileGRAPH- the edges are generated using a union of acyclic queries), those GENpushessomecomputationtotherelationalengine,mostofthe techniquesdonotprovideanybenefitsovertheclassicYannakakis complexgraphalgorithmsareexecutedonagraphrepresentationof algorithm[41]. Ourchallengeisthatthefinalqueryresultitselfis thedatainmemorythroughafull-fledgednativegraphAPI.This toolarge. makes GRAPHGEN suitableforcomplexanalysistaskslikecom- munitydetection,densesubgraphdetection/matching,etc.,which GraphCompression: Therehasbeenmuchworkongraph,RDF, require random and arbitrary access to the graph, and cannot be andXMLcompression,whichcanberoughlyclassifiedinto[34]: efficiently,ifatall,executedusingbasicSQL. (a)succinctrepresentations,wherethegoalistoencodethegraph AsterGraphAnalytics[43]andSAPHANAGraphEnginealso withasfewbitsaspossible[8, 5, 9, 42]; (b)structuralcompres- supportspecifyinggraphswithinanSQLquery,andapplyinggraph sion,wherethegraphstructureisanalyzedandchangedtoreduce algorithmsonthosegraphs. However,theinterfaceforspecifying the size of the graph [19, 11, 6, 13, 27]; and (c) lossy compres- whichgraphstoextractisnotveryintuitiveandlimitsthetypesof sion,whichaimtokeeponlysufficientinformationtoanswerspe- graphsthatcanbeextracted.Asteronlysupportsthevertex-centric cific classes of queries [18]. Our approach is complementary to APIforwritinggraphalgorithms.Ourtechniquescouldbeusedto the work on succinct representations and lossy compression, and reducethegraphmemoryfootprintsinthosesystemsaswell. canbeseenasaformofstructuralcompression. Perhapsthemost Ringo[38]hassomewhatsimilargoalstoGRAPHGENandpro- closelyrelatedworkisVMiner[11],whichidentifiesandexploits videsoperatorsforconvertingfromanin-memoryrelationaltable bi-cliquesintheinputgraphtocompressitlosslessly(seeSection representationtoagraphrepresentation. Ithoweverdoesnotcon- 6.1.1formoredetails).Inarecentwork,ManethandPeternek[35] siderexpensivelarge-outputjoinsthatareoftennecessaryforgraph presentatechniquetocompressagraphthroughdetectingrepeat- extraction,orthealternatein-memoryrepresentationoptimizations ing substructures; however, as with many of the graph or XML thatwediscusshere;itinsteadassumesapowerfullarge-memory compressiontechniques,onlycertaintypesofqueriescanbeexe- cutedagainstthecompressedrepresentation. Themaindifference Front End NetworkX / Other fromanyofthatworkandourworkisthat:thosetechniquesrequire Web App Graph Libraries ustofirstexpandthegraphbeforecompressingit,i.e.,theycannot Datalog Query/ In-memory Datalog In-memory Datalog Serialized operateontheimplicitrepresentationofthegraphintherelational VPerortgerxa mCentric Graph Query Graph Query Graph database;ourapproachaimstoavoidtheexpansionstepitself.Our approachisbetterabletoutilizethestructureinthedata,whichthe VeFrrtaemx eCweonrtkric Graph API PythSoenri aAliPzIa/ tGiornaph expansion will remove (as our experimental results comparing to VMinershow).Further,wealsosupportarbitrarygraphoperations on the compressed representations; a necessity for a general-pur- posegraphengine. Condensed Graph Standard Graph AninitialprototypeoftheGRAPHGENsystemwasrecentlydemon- Pre-processing, strated[46],wheretheprimaryfocuswasonautomaticallypropos- Optimization, and ingandextractinghiddengraphsgivenarelationalschema. This Translation to SQL Graph Generation paper provides an in-depth description of the techniques and al- Analysis Cardinali- Final SQL gorithmsforefficientgraphextractionaswellasacomprehensive Queries ties Queries Query experimentalevaluationofthetrade-offstherein. Results Relational Database 3. SYSTEMOVERVIEW (PostgreSQL) WebeginwithabriefdescriptionofthekeycomponentsofGRAPH- Figure3: GRAPHGENOverview GEN,andhowdataflowsthroughthem.WethensketchourDatalog- basedDSLforspecifyinggraphextractionjobs,andAPIsprovided totheusersafteragraphhasbeenloadedintomemory. 3.1 SystemArchitecture ingandaggregationconstructs; inessence, ourDSLallowsusers TheinnerworkingsofGRAPHGENandthecomponentsthator- tointuitivelyandsuccinctlyspecifynodesandedgesofthetarget chestrateitsfunctionalityaredemonstratedinFigure3.Thecorner- graphasviewsovertheunderlyingdatabasetables. Wenotethat stoneofthesystemisanabstractionlayerthatsitsatopanRDBMS, our goal is not to specify a graph algorithm itself using Datalog acceptsagraphextractiontask,andconstructsthequeriedgraph(s) (like Socialite [39]); however we do plan to explore this avenue in memory, which can then be analyzed by a user program. The infuturework,enablinguserstospecifygraphqueriesoranalysis graphextractiontaskisexpressedusingaDatalog-likeDSL,where tasksusingDatalogtogetherwiththegraphextractionquery. theuserspecifieshowtoconstructthenodesandtheedgesofthe OurDSLusestwospecialkeywords: graph(inessence,asviewsovertheunderlyingtables). Thisspec- NodesandEdges. Agraphspecificationcomprisesofatleast ification is parsed by a custom parser, which then analyzes the oneNodes(ID, ...)statement,followedbyatleastoneEdges(ID1, selectivities of the joins required to construct the graph by using ID2, ...)statement.ThefirstattributeofNodesandfirsttwo thestatisticsinthesystemcatalog. Thisanalysisisusedtodecide attributesofEdgesareassumedtodenoteuniqueIDs. Anyad- whethertohandoverthepartialorcompleteedgecreationtaskto ditionalattributesareconvertedintoproperties(e.g.,in[Q2],the thedatabase,ortoskipsomeofthejoinsandloadtheimplicitedges nodesofthegraphwillhaveapropertyName). Usersmayspec- inmemoryinacondensedrepresentation(Section4.2). ifymultipleNodesandEdgesstatementsinordertoextract,say, The system then builds one or more batches of SQL queries, heterogeneousgraphswithmultipledifferenttypesofverticesand where each batch defines one or more of the graphs that need to edges. be extracted. We aim to ensure that the total size of the graphs Figure4demonstratesseveralextractionqueries. Anextraction constructedinasinglebatch(inourmemory-efficientrepresenta- taskcancontainanynumberofjoins;e.g. [Q1]inFigure1,only tion)islessthanthetotalamountofmemoryavailable,sothatthe requiresasinglejoin(inthiscaseaself-joinontheAuthorPub graphs can be analyzed in memory. The queries are executed in table),while[Q2]asshowninFigure5awouldrequireatotalof sequence,andtheoutputgraphobject(s)is(are)handedtotheuser 3 joins, some of which (in this case Orders(order_key1, ID1) (cid:49) program. Amajorfocusofthisworkistoenableanalysisonvery LineItem(order_key1,part_key),andOrders(order_key2,ID2)(cid:49) largegraphsthatwouldtypicallynotfitinmemory. LineItem(order_key2,part_key))willbehandedofftothedatabase After extraction, users can (a) operate directly upon any por- sincetheyaresmall-outputkey-foreignkeyjoins. Theextraction tion of the graph using the Java Graph API, (b) define and run query[Q3]extractsabi-partite(heterogeneous)directedgraphbe- multi-threaded vertex-centric programs on it, (c) visually explore tweeninstructorsandstudentswhotooktheircourses(Figure5b). the graph through our front-end web application, or (d) serialize The GRAPHGEN DSLnaturallygeneratesdirectedgraphs, and thegraphontodisk(initsexpandedrepresentation)inastandard- undirected graphs are representedusing bidirectionaledges. The izedformat,sothatitcanbefurtheranalyzedusinganyspecialized typicalworkflowforauserwhenwritinganextractionquerywould graphprocessingframeworkorgraphlibrary(e.g.,NetworkX). betoinitiallyinspectthedatabaseschema, figureoutwhichrela- tionsarerelevanttothegraphtheyareinterestedinexploring,and 3.2 Datalog-BasedDSL thenchoosewhichattributesinthoserelationswouldconnectthe Dataloghasbeenincreasinglyusedforexpressingdataanalytics definedentitiesinthedesiredway. Wegenerallyassumethatthe workflows, and especially graph analysis tasks [21, 39, 20]. The userisknowledgeableaboutthedifferententitiesandrelationships mainreasonforitsemergenceliesinitselegancefornaturallyex- existentinthedatabase,andisabletoformulatesuchqueries. We pressingrecursivequeries,butalsoinitsoverallintuitiveandsim- havebuiltavisualizationtoolthatallowsuserstodiscoverandex- plesyntaxchoices. Ourdomainspecificlanguage(DSL)isbased tractpotentialgraphsinaninteractivemanner[46],howeverthatis onalimitednon-recursivesubsetofDatalog,augmentedwithloop- notthefocusofthispaper. [Q2] Nodes(ID, Name) :- Customer(ID, Name). Edges(ID1, ID2) :- Orders(order_key1, ID1),LineItem( order_key1, part_key), Orders(order_key2, ID2), LineItem(order_key2,part_key). [Q3] Nodes(ID, Name) :- Instructor(ID, Name). Nodes(ID, Name) :- Student(ID, Name). Edges(ID1, ID2) :- TaugNhotdCeos(uIrDs,e (NaImDe1),:-cCuosutrosmeerI(dI)D,, Name). Edges(ID1, ID2):-Orders(orderkey1, TookCourse(ID2, courseId) ID1), LineItem(orderId1, part_key), Figure4:GraphExtractionQuerOyrdEerxsa(morpdleersId(2c,f .IFDi2g)., 1for[Q1]) LineItem(orderId2, part_key). 3.3 ParsingandTranNosdleas(tIiDo, nName):-Instructor(ID, Name). Nodes(ID, Name):-Student(ID, Name). ThefirststepistoparsetheEdDgeasta(IloD1g, qIuDe2r)y:,-TaanudghctrCeoaurtesea(IsDe1t, of p1 SQLqueriestoexecuteagainstcothuresdeIadta),b aTsoeo.kCTohuerssep(eIDc2ifi, ccsoudrespeeInd)d. o1 o1 c1 onthenatureoftheEdgesstatement(s). p2 Case 1: Each of the Edges statements corresponds to an acyclic, o2 o2 c2 aggregation-freequery.Inthatcase,wemayloadacondensedrep- p3 resentationofthegraphintomemory(Section4.2). o3 o3 c3 Case2:AtleastoneEdgesstatementviolatestheabovecondition, in which case we create a single SQL statement to construct the edgesandexecuteittoloadtheexpandedgraphintomemory. TherestofthispaperfocusesonCase1,whichcorrespondsto (a) Execution of [Q2] (b) Execution of [Q3] the edges being constructed using a union of acyclic conjunctive Figure5:Extractionexamples:(a)Multi-layeredcondensedrepre- queries,andcoversmanynaturalgraphextractiontasks,including sentation,(b)extractingaheterogeneousbipartitegraph(weonly alltheexamplesdiscussedsofar.Evenforthisclassofqueries,ex- listtheschemasforsomeofthetables,andomittuplesforclarity) tractingandoperatinguponthegraphinacondensedformiscom- putationallychallenging. Aswediscussedearlier,therecentwork onfactorizationofqueryresults[37]alsohasthesamegoal;however, because of the projections in graph extraction queries, their • existsEdge(v, u):Returnstrueifthereisanedgebe- techniques will expand the graph (Appendix A). In future work, tweenthetwovertices. weplantogeneralizeourideastootherclassesofqueriesinCase • addEdge(v, u), deleteEdge(v, u), addVer- 2,bydrawinguponthatpriorworkaswellastherecentworkon tex(v), deleteVertex(v): Theseallowformanip- worst-caseoptimaljoins; althoughthelatterlineofworkenables ulatingthegraphsbyaddingorremovingedgesorvertices. onetoavoidgeneratingintermediateresults,ourfocusisonavoid- TheVertexclassalsosupportssettingorretrievingpropertiesas- inggeneratingthefinalresultitself. sociatedwithavertex. 3.4 AnalyzingtheExtractedGraphs Vertex-centricAPI:Thevertex-centricconceptualmodelhasbeen extensivelyusedinthepasttoexpresscomplexgraphalgorithmsby The most efficient means to utilize GRAPHGEN is to directly followingthe“think-like-a-vertex”methodologyindesigningthese operate on the graph either using our native Java Graph API, or algorithms.Wehaveimplementedasimple,multi-threadedvariant throughavertex-centricAPIthatweprovide. Bothofthesehave ofthevertex-centricframeworkinGRAPHGENthatallowsusersto beenimplementedtooperateonallthein-memory(condensedor implementaCOMPUTEfunctionandthenexecutethatagainstthe otherwise)representationsthatwepresentinSection4. extracted graph regardless of its in-memory representation. The BasicDataStructure: The basic data structure that we use for framework is based on a VertexCentric object which coor- storing the graphs is a variant of traditional Compressed Sparse dinatesthemulti-threadedexecutionofthecompute()function Row(CSR)representation[12]. Briefly,foreachnode,wemain- foreachjob. Thecoordinatorobjectsplitsthegraph’snodesinto tain two mutable ArrayLists for its in-coming and out-going chunksdependingonthenumberofcoresinthemachine,anddis- edges.WeuseJavaArrayListsinsteadoflinkedlistsforspace tributestheloadevenlyacrossallcores. Italsokeepstrackofthe efficiency; however, that makes vertex deletions more expensive current superstep, monitors the execution and triggers a termina- because those require rebuilding of the entire index of vertices. tion event when all vertices have voted to a halt. Users simply Wethereforeimplementalazydeletionmechanismwherevertices needtoimplementtheExecutorinterfacewhichcontainsasin- areinitiallyonlyremovedfromtheindex,thuslogicallyremoving gle method definition for compute(), instantiate their executor themfromthegraph,andarethenphysicallyremovedfromthever- andcalltherun()methodoftheVertexCentriccoordinatorobject ticeslist,inbatch,atalaterpointintime. Thiswayonlyasingle withtheExecutorobjectasinput. Theimplementationofmessage re-buildingoftheverticesindexisrequiredafterabatchremoval. passingwe’veadoptedissimilartothegather-apply-scatter(GAS) JavaAPI:Allofourin-memoryrepresentationsimplementasim- modelusedinGraphLab[32]inwhichnodescommunicatebydi- plegraphAPI,consistingofthefollowing7operations: rectlyaccessingtheirneighbors’data,thusavoidingtheoverhead • getVertices(): This function returns an iterator over ofexplicitlystoringmessagesinsomeintermediarydatastructure. alltheverticesinthegraph. ExternalLibraries: GRAPHGEN can also be used through a li- • getNeighbors(v): Foravertexv,thisfunctionreturns brarycalledgraphgenpy4,aPythonwrapperoverGRAPHGENal- aniteratorovertheneighborsofv,whichitselfsupportsthe lowing users to run queries in our DSL through simple Python standard hasNext() and next() functions. If a list of scripts and serialize the resulting graphs in a standard graph for- neighbors is desired (rather than an iterator), it can be re- trievedusinggetNeighbors(v).toList. 4http://konstantinosx.github.io/graphgen-project/ mat,thusopeningupanalysistoanygraphcomputationframework ingandoutgoingedges). Theremaybeself-edgesintheextracted or library (Section 6.4). A similar workflow was used in the im- graph(e.g.,c1 →c1 inFigure5a);however,sinceourextraction s t plementationofourfront-endwebapplication[46]throughwhich queriesareacyclic,G itselfisalwaysadirectedacyclicgraph. C users can visually explore the graphs that exist within their rela- DuplicationProblem:AlthoughtheC-DUPrepresentationiseasy tionalschema. toconstruct,itallowsformultiplepathsbetweenu andv ,since s t that’sthenaturaloutputoftheextractionprocessbelow.Anygraph 4. IN-MEMORY REPRESENTATION AND algorithm whose correctness depends solely on the connectivity TASKEXECUTION structureofthegraph(i.e.,“duplicate-insensitive”algorithms),can beexecuteddirectlyontopofthisrepresentation,withapotential The key efficiency challenge with extracting graphs from rela- for speedup (e.g., connected components or breadth-first search); tionaldatabasesisthat:inmostcases,queriesforextractingexplicit thenotionofrepresentation-independentgraphanalyticsfromre- relationships(i.e.,edges)betweenentitiesfromrelationaldatasets centworkcouldbeusedtofurtherincreasetheapplicabilityofthe (i.e., nodes) requires expensive non-key (large-output) joins. Be- C-DUPrepresentation[22,14]. However,thisduplicationcauses causeofthis,theextractedgraphmaybemuchlargerthantheinput correctnessissuesonallnonduplicate-insensitivegraphalgorithms. sizeitself(cf. Table1). Instead,weproposemaintainingandop- Theduplicationproblementailsthatprogrammatically,wheneach eratingupontheextractedgraphinacondensedfashion.Webegin real node tries to iterate over its neighbors, passing through its withdescribinganovelcondensedrepresentationthatweuseand obligatory virtual neighbors, it may encounter the same neighbor discusswhyitisideallysuitedforthispurpose;wealsodiscussthe morethanonce; thisindicatesaduplicateedge. Thesetofalgo- duplicationissuewiththisrepresentation. Wethenpresentagen- rithms we propose in Section 5 are geared towards dealing with eralalgorithmforconstructingsuchacondensedrepresentationfor thisduplicationproblem. agraphextractionqueryoveranarbitraryschema,thatguarantees thatthecondensedrepresentationrequiresatmostasmuchmem- Single-layervsMulti-layerCondensedGraphs:Acondensed oryasloadingalltheunderlyingtablesintheworstcase.Thecon- graph may have one or more layers of virtual nodes (formally, a densedrepresentationandthealgorithmcanbothhandleanynon- condensedgraphiscalledmulti-layerifitcontainsadirectedpath recursive,acyclicgraphextractionqueryoveranarbitraryschema. oflength>2). Inthemajorityofcases,mostofthejoinsinvolved We then propose a series of in-memory variations of the basic inextractingthesegraphswillbesimplekey-foreignkeyjoins,and condensedrepresentationthathandletheduplicationissuethrough large-outputjoins(whichrequireuseofvirtualnodes)occurrela- uniquelycharacterizedapproaches.Theserepresentationsareprod- tivelyrarely.Althoughoursystemcanhandlearbitrarymulti-layer uctsofasingle-runpreprocessingphaseontopofthecondensed graphs,wealsodevelopspecialalgorithmsforthecommoncaseof representationusinganarrayofalgorithmsdescribedinSection5. single-layercondensedgraphs. We also expand on the natural trade-offs associated with storing 4.2 ExtractingaCondensedGraph andoperatingoneachin-memoryrepresentation. Thekeyideabehindconstructingacondensedgraphistopost- 4.1 CondensedRepresentation&Duplication ponecertainjoins. Herewebrieflysketchouralgorithmformak- Theideaofcompressinggraphsthroughidentifyingspecifictypes ingthosedecisions,extractingthegraph,andpostprocessingitto ofstructureshasbeenaroundforalongtime[19,11].Asdiscussed reduceitssize. inSection2,thosepriortechniquesarenotdirectlyapplicablehere Step1: First,wetranslatetheNodesstatementsintoSQLqueries, since they require the input graph to be in an expanded form be- andexecutethoseagainstthedatabasetoloadthenodesinmemory. fore compressing. Instead, we propose a novel condensed repre- Inthefollowingdiscussion,weassumethatforeverynodeu,we sentation, calledC-DUP,thatiseffectivelyfreetoconstructfrom have two copies u (source) and u (target); physically we only s t thedatabaseandrequireslessmemorytoload. Givenagraphex- storeonecopy. tractionquery,letG(V,E)denotetheoutputexpandedgraph;for Step2: WeconsidereachEdgesstatementinturn. Recallthatthe clarityofexposition,weassumethatGisadirectedgraph.Sincein outputofeachsuchstatementisasetof2-tuples(correspondingto theC-DUPrepresentationthereareonlyedgesbetweenrealnodes asetofedgesbetweenrealnodes),andfurtherthatweassumethe andvirtualnodes,wedenoteverticeswithsubscript_s(source)to statementisacyclicandaggregation-free(cf. Section3.3,Case1). describetheverticesthathaveout-edgestovirtualnodes,andver- Withoutlossofgenerality,wecanrepresentthestatementas: ticeswithsubscript_ (target)forverticesthathavein-edgesfrom t Edges(ID1,ID2):−R (ID1,a ),R (a ,a ),...,R (a ,ID2) virtualnodes. WesayG (V(cid:48),E(cid:48))isanequivalentC-DUPrepre- 1 1 2 1 2 n n−1 C (two different relations, R and R , may correspond to the same sentationifandonlyif: i j databasetable). Generalizationstoallowmulti-attributejoinsand (1)foreverynodeu ∈ V,therearetwonodesu ,u ∈ V(cid:48) –the s t selectionpredicatesarestraightforward. remainingnodesinV(cid:48)arecalledvirtualnodes; ForeachjoinR (a ,a )(cid:49) R (a ,a ),weretrievethe i i−1 i ai i+1 i i+1 (2)GC isadirectedacylicgraph,i.e.,ithasnodirectedcycle; number of distinct values, d, for ai (the join attribute) from the (3)inG ,therearenoincomingedgestou ∀u ∈ V andnoout- systemcatalog(e.g.,n_distinctattributeinthepg_statsta- C s goingedgesfromu ∀u∈V; bleinPostgreSQL).If|R ||R |/d>2(|R |+|R |),thenwe t i i+1 i i+1 (4)foreveryedge(cid:104)u→v(cid:105)∈E,thereisatleastonedirectedpath consider this a large-output join and mark it so (this formula as- fromu tov inG . sumesthatthejoinattributeisuniformlydistributedandmaymiss s t C Figure5showstwoexamplesofsuchcondensedgraphs. Inthe alarge-outputjoinandcouldbeeasilysubstitutedwithamoreso- second case, where a heterogeneous bipartite graph is being ex- phisticatedselectivityestimator). tracted,therearenooutgoingedgesfroms1 ,s2 ,s3 orincoming Step3:Wethenconsidereachsubsequenceoftherelationswithout s s s edgestoi1 ,i2 ,sincetheoutputgraphitselfonlyhasedgesfrom alarge-outputjoin,constructanSQLquerycorrespondingtoit,and t t i_ nodes to s_ nodes. Although we assume there are two copies executeitagainstthedatabase. Leta,a ,...,a denotethejoin l m u of each real node in G here, the physical representation of G attributeswhicharemarkedaslarge-output. Then,thequerieswe C C only requires one copy (with special-case code to handle incom- executecorrespondto: res1(ID1,al):−R1(ID1,a1),...,Rl(al−1,al), a a a V1 a a W res2(al,am):−Rl+1(al,al+1),...,Rm(am−1,am),...,and b b b b b 2 resk(au,ID2):−Ru+1(au,au+1),...,Rn(an−1,ID2). c V1 c c c c Step4:Foreachjoinattributeattr∈{a,a ,...,a },wecreatea setofvirtualnodescorrespondingtoallplossmiblevaluuesattrtakes. u1 u1 u1 u1 u1 W1 Steodtegapev5xir:t→uFaolryn(.oxdF,eyo:r)ax∈lslro→ethse1yr,.wreFesoar,d(fdoxar,y(dxi)r,e∈yc)ter∈deserkde,gswe,ferwoaemddaadadredaainlrenecdotgdedee uud23 V uud23 uud23 V uud23 uud23 W3 t i i betweentwovirtualnodes:x→y. e e e e e Step6(Preprocessing): Foravirtualnode,letinandoutdenote f f f f f thenumberofincomingandoutgoingedgesrespectively;ifin× (a) C-DUP (24 Edges) (b) DEDUP1 (32 Edges) (c) DEDUP2 (22 Edges) out ≤ (in+out+1),we“expand”thisnode,i.e.,weremoveit Figure6: TheresultinggraphaftertheadditionofvirtualnodeV. andadddirectededgesfromitsin-neighborstoitsout-neighbors. (c)showstheresultinggraphforifweaddededgesbetweenvirtual Thispreprocessingstepcanhaveasignificantimpactonmemory nodes(weomit_ and_ subscriptssincetheyareclearfromthe consumption. We have implemented a multi-threaded version of s t context). thistoexploitmulti-coremachines,whichresultedinseveralnon- trivialconcurrencyissues. Weomitadetaileddiscussionforlack ofspace.Finally,thesystemalsocomputesthenumberofedgesin graphalgorithmsthataccessasmallfractionofthegraph(e.g.,if theexpandedgraph(thiscanbecomputedforfreeasaside-effect wewerelookingforinformationaboutasmallnumberofspecific ofallofourdeduplicationalgorithms),andexpandsthegraphifthe nodes). Ontheotherhand,duetotherequiredhashcomputations increaseinsizeissmall. ateverycall,theexecutionpenaltyforthisrepresentationishigh, If the query contains multiple Edges statements, the final con- especiallyformulti-layergraphs;italsosuffersfrommemoryand structed graph would be the union of the graphs constructed for garbagecollectionbottlenecksforalgorithmsthatrequireprocess- eachofthem.Itiseasytoshowthattheconstructedgraphsatisfies ingallthenodesinthegraph. OperationslikedeleteEdge() alltherequiredpropertieslistedabove,thatitisequivalenttothe arealsoquiteinvolvedinthisrepresentation,asdeletionofalogi- outputgraph,anditoccupiesnomorememorythanloadingallthe caledgemayrequirenon-trivialmodificationstothevirtualnodes. inputtablesintomemory. EXP:FullyExpandedGraph:Ontheotherendofthespectrum, IntheexampleshowninFigure5a,thegraphspecifiedinquery [Q2]thatisextractedassumesthatallthreeofthejoinsinvolved wecanchoosetoexpandthegraphinmemory,i.e.,createalldirect edgesbetweenalltherealnodesinthegraphandremovethevirtual portraylowselectivity,andsowechoosenottohandanyofthem nodes. The expanded graph typically has a much larger memory tothedatabase,butextractthecondensedrepresentationbyinstead footprintthantheotherrepresentationsduetothelargenumberof projectingthetablesinmemoryandcreatingintermediatevirtual edges.Itisnevertheless,naturally,themostefficientrepresentation nodesforeachuniquevalueofeachjoincondition. foroperatingon,sinceiterationonlyrequiresasequentialscanover 4.3 In-MemoryRepresentations one’s direct neighbors. The expanded graph is the baseline that weusetocomparetheperformanceofallotherrepresentationsin Next, we propose a series of in-memory graph representations termsoftradingoffmemorywithoperationalcomplexity. thatcanbeutilizedtostorethecondensedrepresentationmentioned above, in its deduplicated state. Here we discuss the representa- DEDUP-1:CondensedDeduplicatedRepresentation:Thisrep- tionformatsandtheirkeyproperties,andwithaspecificfocuson resentationformatisidenticaltoC-DUPinitsuseofvirtualnodes, theimplementationofthegetNeighbors()iterator,whichunderlies withthemajordifferencebeingthatitdoesnotsufferfromdupli- most graph algorithms. We note that, in some cases, we use the catepaths, and thusdoes notrequire theon-the-fly deduplication sametermtobothdenoteanin-memoryrepresentation,aswellas usedinC-DUP(i.e.,getNeighbors()doesnotneedtousethehash- thealgorithmforconstructingthatrepresentation;e.g.,wediscuss set).Thisrepresentationtypicallysitsinthemiddleofthespectrum theDEDUP-1representationbelowandoutlineitskeyproperties betweenEXPandC-DUPintermsofbothmemoryefficiencyand (e.g.,itdoesnotsufferfromduplication),andwediscussseveralal- iterationperformance;itusuallyresultsinalargernumberofedges gorithmsforconstructingtheDEDUP-1representationinthenext thanC-DUP,buthasreducedoverheadofneighboriteration. The section. We note that our representations primarily explore dif- trade-offsherealsoincludetheone-timecostofremovingduplica- ferentwaystodostructuralcompression, andcouldbecombined tion;deduplicatingagraphwhileminimizingthenumberofedges withothergraphcompressionapproaches[34]tofurtherreducethe addedcanbeshowntobeNP-Hard. Unliketheotherrepresenta- memoryfootprint. tionsdiscussedbelow,thisrepresentationmaintainsthesimplicity ofC-DUPandcaneasilybeserializedandusedbyothersystems C-DUP:CondensedDuplicatedRepresentation:Thisistherep- whichneedtosimplyimplementaproperiterator. resentation that we initially extract from the relational database, whichsuffersfromtheedgeduplicationproblem. Wecanutilize BITMAP:DeduplicationusingBitmaps:Thisrepresentationre- thisrepresentationas-isbyemployinganaivesolutiontodedupli- sultsfromapplyingadifferentkindofpreprocessingbasedonmain- cation,i.e.,bydoingdeduplicationontheflyasalgorithmsarebe- taining bitmaps, for filtering out duplicate paths between nodes. ingexecuted. Specifically,whenwecallgetNeighbors(u),itstarts Specifically,avirtualnodeV maybeassociatedwithasetofbitmaps, a depth-first traversal from u and returns all the real nodes (_ indexed by the IDs of the real nodes; the size of each bitmap is s t nodes)reachablefromu ; italsokeepstrackofwhichneighbors equaltothenumberofoutgoingedgesfromV. Consideradepth- s have already been seen (in a hashset) and skips over them if the firsttraversalstartingatu thatreachesV.Wechecktoseeifthere s neighborisseenagain. is a bitmap corresponding to u ; if not, we traverse each of the s Thisistypicallythemostmemory-efficientrepresentation,does outgoingedgesinturn. However,ifthereisindeedabitmapcor- not require any preprocessing overhead, and is a good option for responding to u , then we consult the bitmap to decide which of s A A0 B1 C0 D1 describedthere. Wealsodescribetheruntimecomplexityforeach a1 x11 x12 aa12 a111 A CDB 101 011 101 011 A algorithminwhichwerefertonr asthenumberofrealnodes,nv a1 a1 y11 aa23 11 11 a3 1 a1 V asthenumberofvirtualnodes,kasthenumberoflayersofvirtual 2 x1 y1 x1 B B nodes,anddasthemaximumdegreeofanynode(i.e.,themaxi- mumnumberofoutgoingedges). a2 aa23 y111 y002 aaa123 a1112 a1113 a2 C V C 5.1 PreprocessingforBITMAP 1 a3 x2 y2 x2 a3 D A A0 B0 C1 D Recallthatthegoalofthepreprocessingphasehereistoasso- B 0 0 0 ciateandinitializebitmapswiththevirtualnodestoavoidvisiting C 1 0 0 (a) Multi-layer Graph (b) Single-layer Graph thesamerealnodetwicewheniteratingovertheout-neighborsof agivenrealnode. Webeginwithpresentingasimple,naivealgo- Figure7: UsingBITMAPstohandleduplication;thedottededges rithm for setting the bitmaps; we then analyze the complexity of (correspondingtocolumnsoredgeswithall0s)areremoved. doingsooptimallyandpresentasetcover-basedgreedyalgorithm. the outgoing edges to skip (i.e., if the corresponding bit is set to 5.1.1 BITMAP-1Algorithm 1, wetraversetheedge). Inotherwords, thebitmapsareusedto Thisalgorithmonlyassociatesbitmapswiththevirtualnodesin eliminatethepossibilityofreachingthesameneighbortwice. thepenultimatelayer,i.e.,withthevirtualnodesthathaveoutgo- The main drawback of this representation is the memory over- ing edges to _ nodes. We iterate over all the real nodes in turn. head and complexity of storing these bitmaps, which also makes t For each such nodeu, we initiate a depth-firsttraversal from u , this representation less portable to systems outside GRAPHGEN. s keepingtrackofalltherealnodesvisitedduringtheprocessusing Thepreprocessingrequiredtosetthesebitmapscanalsobequite ahashset,H . ForeachvirtualnodeV visited,wecheckifitisin involvedaswediscussinthenextsection. u thepenultimatelayer;ifyes,weaddabitmapofsizeequaltothe DEDUP-2:OptimizationforSingle-layerSymmetricGraphs: numberofoutgoingedgesfromV. Then,foreachoutgoingedge Thisoptimizedrepresentationcansignificantlyreducethememory V →v ,wecheckifv ∈H . Ifso,wesetthecorrespondingbit t t u requirementsfordensegraphs,forthespecialcaseofasingle-layer, to0;else,wesetitto1andaddv toH . t u symmetriccondensedgraph(i.e.,(cid:104)u → v (cid:105) =⇒ (cid:104)v → u (cid:105)); s t s t Thisistheleastcomputationallycomplexofallthealgorithms, manygraphssatisfytheseconditions. Insuchacase,foravirtual andinpracticethefastestalgorithm.Itmaintainsthesamenumber nodeV,ifu → V,thenV → u ,andwecanomitthe_ nodes s t t ofedgesasC-DUP,whileaddingtheoverheadofthebitmapsand andassociatededges. Figure6illustratesanexampleofthesame theappropriateindexesassociatedwiththemforeachvirtualnode. graphifweweretouseallthreededuplicationrepresentations. In The traversal order in which we process each real node does not C-DUP,wehavetwovirtualnodesV andV , thatarebothcon- 1 2 matterheresincetheendresultwillalwayshavethesamenumber nected to a large number of real nodes. The optimal DEDUP-1 of edges as C-DUP. Changing the processing order only changes representation (Figure 6b) results in a substantial increase in the thewaythesetbitsaredistributedamongthebitmaps. numberofedges,becauseofthelargenumberofduplicatepaths. Complexity:Theworst-caseruntimecomplexityofthisalgorithm TheDEDUP-2representation(Figure6c)usesspecialundirected isO(n ∗dk+1).Althoughthismightseemhigh,wenotethatthis edgesbetweenvirtualnodestohandlesuchascenario.Arealnode r isalwayslowerthanthecostofexpandingthegraph. uisconsideredtobeconnectedtoallrealnodesthatitcanreach througheachofitsdirectneighboringvirtualnodesv,aswellasthe 5.1.2 FormalAnalysis virtualnodesdirectlyconnectedtov(i.e.1hopaway);e.g.,nodea Theabovealgorithm,whilesimple,tendstoinitializeandmain- isconnectedtobandcthroughW ,andtou ,u ,u throughW 2 1 2 3 1 tainalargenumberofbitmaps. Thisleadsustoaskthequestion: (whichisconnectedtoW ),butnottod,e,f (sinceW isnotcon- 2 3 howcanweachievetherequireddeduplicationwhileusingthemin- nectedtoW ).Thisrepresentationisrequiredtobeduplicate-free, 2 imumnumberofbitmaps(orminimumtotalnumberofbits)?This i.e., there can be at most one such path between a pair of nodes. seeminglysimpleproblemunfortunatelyturnsouttobeNP-Hard, The DEDUP-2 representation here requires 11 undirected edges, evenforsingle-layergraphs.Inasingle-layercondensedgraph,let whichisjustbelowthespacerequirementsforC-DUP.However, u denote a real node, with edges to virtual nodes V ,...,V , and fordensegraphs,thebenefitscanbesubstantial(Section6). 1 n let O(V ) denote the set of real nodes to which V has outgoing GeneratingagoodDEDUP-2representationforagivenC-DUP 1 1 edges.Then,theproblemofidentifyingaminimumsetofbitmaps graph is much more intricate than generating a DEDUP-1 repre- to maintain is equivalent to finding a set cover where our goal is sentation. Due to space constraints, we omit the algorithm from tofindasubsetofO(V ),...,O(V )thatcoverstheirunion. Un- themainbodyofthepaper,andpresentasketchinAppendixB. 1 n fortunately,thesetcoverproblemisnotonlyNP-Hard,butisalso knowntobehardtoapproximate. 5. PREPROCESSING&DEDUPLICATION 5.1.3 BITMAP-2Algorithm In this section, we discuss a series of preprocessing and deduplication algorithms we have developed for constructing the dif- This algorithm is based on the standard greedy algorithm for ferentin-memoryrepresentationsforagivenquery. Theinputto set cover, which is known to achieve the best approximation ra- allofthesealgorithmsistheC-DUPrepresentation,thathasbeen tio(O(logn))fortheproblem. Wedescribeitusingtheterminol- extracted and instantiated in memory. We first present a general ogyaboveforsingle-layercondensedgraphs. Thealgorithmstarts preprocessingalgorithmfortheBITMAPrepresentationformulti- bypickingthevirtualnodeV withthelargest|O(V )|. Itaddsa i i layercondensedgraphs. Wethendiscussaseriesofoptimizations bitmapforutoV ,andsetsittoall1s;allnodesinO(V )arenow i i for single-layer condensed graphs, including deduplication algo- consideredtobecovered.ItthenidentifiesthevirtualnodeV with j rithmsthatattempttoeliminateduplication(i.e.,achieveDEDUP-1 thelargest|O(V )−O(V )|,i.e.,thevirtualnodethatconnectsto j i representation). IncontrastwithSection4.3,herewefocusonde- largestnumberofnodesthatremaintobecovered.Itaddsabitmap scribingalgorithmsforhowtogeneratetherepresentationsthatwe foru toV andsetsitappropriately. Itrepeatstheprocessuntil s j densed graphs is the same as the original problem considered by Feder and Motwani [19], which focuses on the reverse problem of compressing cliques that exist in the expanded graph, by find- ingcliquesandconnectingallverticesinthesamecliquetoavir- tualnodeandremovingtheedgesamongthem. Althoughtheex- panded graph is usually very large, it is still only O(n2), so the NP-Hardnessofthededuplicationproblemisthesame. However, thosealgorithmspresentedin[19]arenotapplicableherebecause theinputrepresentationisdifferent,andexpansionisnotanoption. Wepresentfouralgorithmsforthisproblem. Inthedescriptionbelow, foravirtualnodeV, weuseI(V)to (a) 44 Edges (b) 34 Edges denotethesetofrealnodesthatpointtoV, andO(V)todenote N N N N N N N N therealnodesthatV pointsto. N N NaiveVirtualNodesFirst:Thisalgorithmdeduplicatesthegraph Figure8: Deduplicatingu usingthe“real-nodesfirst”algorithm, onevirtualnodeatatime. Westartwithagraphcontainingonly 1 resultingtoanequivalentgraphwithasmallernumberofedges therealnodesandnovirtualnodes,whichistriviallyduplication- free. Wethenaddthevirtualnodesoneatatime,alwaysensuring thatthepartialgraphremainsfreeofanyduplication. When adding a virtual node V: we first collect all of the vir- allthenodesthatarereachablefromu havebeencovered.Forthe s tualnodesR suchthatI(V)∩I(R ) (cid:54)= φ; thesearethevirtual remainingvirtualnodes(ifany),theedgesfromu tothosenodes i i s nodes that the real nodes in I(V) point to. Let this set be R. A aresimplydeletedsincethereisnoreasontotraversethose. processedsetisalsomaintainedwhichkeepstrackofthevirtual Wegeneralizethisbasicalgorithmtomulti-layercondensedgraphs nodesthathavebeenaddedtothecurrentpartialgraph. Forevery byapplyingthesameprincipleateachlayer.LetV1,...,V1denote 1 n virtualnodeR ∈ R∩processed, if|O(V)∩O(R )| > 1, we thesetofvirtualnodespointedtobyu . LetN(u )denoteallthe i i s s modifythevirtualnodestohandletheduplicationbeforeaddingV real_ nodesreachablefromu .ForeachV1,wecounthowmany t s i tothepartialgraph(IfthereisnosuchR ,wearedone).Weselect ofthenodesinN(u )arereachablefromV1,andexplorethevir- i s i arealnoder∈O(V)∩O(R )atrandom,andchoosetoeitherre- tualnodewiththehighestsuchcountfirst.Atthepenultimatelayer, i movetheedge(V →r)or(R →r),dependingonthein-degrees thealgorithmreducestothesingle-layeralgorithmdescribedabove i of the two virtual nodes. The intuition here is that, by removing andappropriatelysetsthebitmaps. Weconsistentlykeeptrackof theedgefromthelower-degreevirtualnode,wehavetoaddfewer how many of the nodes in N(u ) have been covered so far, and s directedgestocompensateforremovaloftheedge. Supposewe use that for making the decisions about which bits to set. So af- removetheformer(V → r)edge. Wethenadddirectedgestor terbitmapshavebeensetforallvirtualnodesreachablefromV1, 1 fromalltherealnodesinI(V),whilecheckingtomakesurethatr iftherearestillnodesinN(u )thatneedtobecovered,wepick s isnotalreadyconnectedtothosenodesthroughothervirtualnodes. thevirtualnodeV1 thatreachesthelargestnumberofuncovered i VirtualnodeV isthenaddedtoaprocessedsetandweconsider nodes,andsoon. thenextvirtualnode. It’simportanttonotethathereweneverdeleteanoutgoingedge fromavirtualnode,sinceitmaybeneededforanotherrealnode. Complexity:TheruntimecomplexityisO(nv∗d4). Instead,weusebitmapstostoptraversingdownthosepaths(e.g., NaiveRealNodesFirst: In this approach, we consider each real edgex2 →y2inFigure7). node in the graph at a time, and handle duplication between the Our implementation exploits multi-core parallelism, by creat- virtualnodesitisconnectedto,intheorderinwhichtheyappear ingequal-sizedchunksofthesetofrealnodes,andprocessingthe initsneighborhood.Thisalgorithmhandlesdeduplicationbetween nodesineachchunkinparallel. twovirtualnodesthatoverlapinexactlythesamewayastheone Complexity: Theruntimecomplexityofthisalgorithmissignifi- described above. It differs however in that it entirely handles all cantlyhigherthanBITMAP-1becauseoftheneedtore-compute duplication between a single real node’s virtual neighbors before the number of reachable nodes after each choice, and the worst- movingontoprocessingthenextrealnode. Aseachrealnodeis case complexity could be as high as: O(n ∗d2k). In practice, handled,itsvirtualnodesareaddedtoaprocessedset,andevery r newvirtualnodethatcomesinischeckedforduplicationagainst kisusually1or2,andthealgorithmfinishesreasonablyquickly, therestofthevirtualnodesinthisprocessedset.Thisprocessed especiallygivenourparallelimplementation. setishoweverlimitedtothevirtualneighborhoodoftherealnode 5.2 DeduplicationforDEDUP-1 thatiscurrentlybeingdeduplicated,andisclearedwhenwemove ontothenextrealnode. ThegoalwithdeduplicationistomodifytheinitialC-DUPgraph toreachastatewherethereisatmostoneuniquepathbetweenany Complexity:TheruntimecomplexityisO(nr∗d4). two real nodes in the graph. We describe a series of novel algo- GreedyRealNodesFirstAlgorithm: In this less naive but still rithms for achieving this for single-layer condensed graphs, and greedyapproach,weconsidereachrealnodeinsequence,anddedu- discuss the pros and cons of using each one as well as their ef- plicateitindividually.Figure8showsanexample,thatwewilluse fectivenessintermsofthesizeoftheresultinggraph. Webriefly to illustrate the algorithm. The figure shows a real node u that 1 sketch how these algorithms can be extended to multi-layer con- isconnectedto5virtualnodes,withsignificantduplication,anda densedgraphs;however,weleaveadetailedstudyofdeduplication deduplication of that node. Our goal here is to ensure that there formulti-layergraphstofuturework. are no duplicate edges involving u – we do not try to eliminate 1 allduplicationamongallofu ’svirtualnodeslikeinthenaiveap- 5.2.1 Single-layerCondensedGraphs 1 proach.Thecoreideaofthisalgorithmisthatweconsultaheuristic Thetheoreticalcomplexityofthisproblemforsingle-layercon- todecidewhethertoremoveanedgetoavirtualnodeandaddthe N(V)∩N(V)={u,u} 1 1 2 N(V)∩N(V)={u,u,u} 2 1 4 5 N(V)∩N(V)={u,u} 3 2 5 missingdirectedges,ortokeeptheedgetothevirtualnode. V1 V1 LetV(cid:48)denotethesetofvirtualnodestowhichu1usremainsconV- u1 u1 u1 u1 u tmpnfhrreoaiocnmtbtiewlmdewemizahnefictceetahehrndeudebstedoedittusasophldaloniidswcucdamnotfinbortnoeonerm,bcoaetfuendNeds;dPdVga-uleH(cid:48)sr(cid:48)soiandi,rngedlnetttohuhtesEiesinrtepdghsreeuoanlcostreeitenuuusetd432sgotu.hfscOettvriuudoirrcintrtugeufacorrltoea.nmleodTisdgthheetioeVsssV123 uuu432 VV23 uuu432 uuu432 VV23 uuu432 uuu321 VVV123 exaWctespertecsoevnetraphreoubrliesmtic. inspiredbythestandarduu5greedysetcover u5 u5 u5 u5 u45 heuristicwhichworksasfollows.WeinitializeVu(cid:48)6=∅,andV(cid:48)(cid:48) =V u6 V u6 u6 V u6 u6 V iVn;Nwe(uasl)s,oalnodgicthaullsyEadd=di{re(cutse,dxg)e|xsf∈rom∪Vu∈sVtoO7a(lVli)t1}s4. nedeWgiegsehbthoerns u7 u7 u7 u7 u7 12 edges movevirtualnodesfromV(cid:48)(cid:48) toV(cid:48) oneatatime. Specifically,for (a) 28 Edges (b) 24 Edges eachvirtualnodeV ∈V(cid:48)(cid:48),weconsidermovingittoV(cid:48). LetX = N(V)={u,u,u}N(V)={u,u,u,u} N(V)={u,u,u},N(V)={u,u,u,u} ∪O(V(cid:48))denotethesetofrealnodesthatuisconnectedtothrough ,N(V13)={u12,u25,u37} N(V2)={u11,u24,u45,u56} N(V13)={u12,u25,u37},N(V)2={u1,1u2,4u4,5u5}6 V(cid:48). In order to move V to V(cid:48), we must disconnect V from all , Figure9:DeduplicationusingGreedyVirtualNodesFirst nodes in O(V)∩X – otherwise there would be duplicate edges between u and those nodes. Then, for any a,b ∈ O(V) ∩ X, we check if any other virtual node in V(cid:48)(cid:48) is connected to both a layergraphs.Insinglelayergraphs,identifyingduplicationisrela- and b – if not, we must add the direct edge (a,b). Finally, for tivelystraightforward;fortwovirtualnodesV andV ,ifO(V )∩ 1 2 1 r ∈O(V)−O(V)∩X,weremovealldirectedges(u,r ). i i O(V )(cid:54)=φandI(V )∩I(V )(cid:54)=φ,thenthereisduplication. We ThebenefitofmovingthevirtualnodeV fromV(cid:48)(cid:48)toV(cid:48)iscom- 2 1 2 keep the neighbor lists in sorted order, thus making these checks puted as the reduction in the total number of edges in every sce- veryfast. However,formulti-layercondensedgraphs,weneedto nario. Weselectthevirtualnodewiththehighestbenefit(> 0)to doexpensivedepth-firsttraversalstosimplifyidentifyduplication. movetoV(cid:48). IfnovirtualnodeinV(cid:48)(cid:48)hasbenefit>0,wemoveon WecanadapttheNaiveVirtualNodesFirstalgorithmdescribed tothenextrealnodeandleaveuconnectedtoitsneighborsthrough abovetothemulti-layercaseasfollows. We(conceptually)adda directedges. dummynodestothecondensedgraphandadddirectededgesfrom Complexity:TheruntimecomplexityhereisroughlyO(n ∗d5). stothe_ copiesofalltherealnodes. Wethentraversethegraph r s in a depth-first fashion, and add the virtual nodes encountered to GreedyVirtualNodesFirstAlgorithm:Exactlylikethenaivever- an initially empty graph one-by-one, while ensuring no duplica- sionabove,thisalgorithmdeduplicatesthegraphonevirtualnode tion. However, this algorithm turned out to be infeasible to run at a time, maintaining a deduplicated partial graph at every step. evenonsmallmulti-layergraphs,andwedonotreportanyexperi- We start with a graph containing only the real nodes and no vir- mentalresultsforthatalgorithm. Instead,weproposeusingeither tualnodes,whichistriviallydeduplicated. Wethenaddthevirtual the BITMAP-2 approach for multi-layer graphs, or first convert- nodesoneatatime,alwaysensuringthatthepartialgraphdoesnot ingitintoasingle-layergraphifpossible(throughexpansionofall have any duplication. Let V denote the virtual node under con- virtualnodesinallbutonelayer)andthenusingoneofthealgo- sideration. LetV = {V ,...,V }denoteallthevirtualnodesthat 1 n rithmsdevelopedabove; notethatthelatterapproachshouldonly share at least 2 real nodes with V (i.e., |O(V)∩O(V )| ≥ 2). i beconsiderediftheexpansiondoesnotresultinaspaceexplosion. LetC = O(V)∩O(V ),denotetherealnodestowhichbothV i i andViareconnected. Atleast|Ci|−1ofthoseedgesmustbere- 6. EXPERIMENTALSTUDY movedfromO(V)andO(V )combinedtoensurethatthereisno i duplication. Inthissection,weprovideacomprehensiveexperimentalevalua- The special case of this problem where |C | = 2,∀i, can be tionofGRAPHGENusingseveralreal-worldandsyntheticdatasets. i shown to be equivalent to finding a vertex cover in a graph (we Wefirstpresentadetailedstudyusing4smalldatasets. Wethen omittheproofduetospaceconstraints).Weagainadoptaheuristic comparetheperformanceofthedifferentdeduplicationalgorithms, inspired by the greedy approximation algorithm for vertex cover. andpresentananalysisusingmuchlargerdatasets,butforasmaller Specifically, for each node in C , we compute the cost and the setofrepresentations.Alltheexperimentswererunonasinglema- i benefitofremovingitfromanyO(V )versusfromO(V). The chinewith24coresrunningat2.20GHz,andwith64GBRAM. i cost of removing the node is computed as the number of direct 6.1 SmallDatasets edgesthatneedtobeadded ifweremovetheedgetothatvirtual node,whereasthebenefitiscomputedasthereductioninthetotal Firstwepresentadetailedstudyusing4relatively-smalldatasets. numberofnodesintheintersectionwithV (Σ|C |)(removingthe WeuserepresentativesamplesoftheDBLPandIMDBdatasetsin i i nodefromO(V )alwaysyieldsabenefitof1,whereasremoving ourstudy(Table2), extractingco-authorandco-actorgraphsre- i itfromO(V)mayhaveahigherbenefit). Wethenmakeamore spectively.Wealsogeneratedaseriesofsyntheticgraphssothatwe informeddecisionandchoosetoremoveanedgefromarealnode can better understand the differences between the representations rnthatleadstotheoverallhighestbenefit/costratio. andalgorithmsonawiderangeofpossibledatasets,withvarying numbersofrealnodesandvirtualnodes,andvaryingdegreedistri- Complexity:Theruntimecomplexityhereis:O(n d(n d2+d)). v v butionsanddensities.Sinceweneedthegraphsinacondensedrep- Wenotethat,thesecomplexityboundslistedheremakeworst-case resentation,wecannotuseanyoftheexistingrandomgraphgener- assumptionsandinpractice,mostalgorithmsrunmuchfaster. atorsforthispurpose.Instead,webuiltasyntheticgraphgenerator, whichwesketchinAppendixC.1. 5.2.2 Multi-layerCondensedGraphs 6.1.1 CompressionPerformance deduplicatingmulti-layercondensedgraphsturnsouttobesig- nificantlytrickierandcomputationallymoreexpensivethansingle- Webeginwithcomparingthegraphsizesforthedifferentrep-