Arabesque: A System for Distributed Graph Mining CarlosH.C.Teixeira∗,AlexandreJ.Fonseca,MarcoSerafini, GeorgosSiganos,MohammedJ.Zaki,AshrafAboulnaga QatarComputingResearchInstitute-HBKU,Qatar Abstract 1. Introduction Distributed data processing platforms such as MapReduce Graphdataisubiquitousinmanyfields,fromtheWebtoad- and Pregel have substantially simplified the design and de- vertising and biology, and the analysis of graphs is becom- ploymentofcertainclassesofdistributedgraphanalyticsal- ing increasingly important. The development of algorithms gorithms.However,theseplatformsdonotrepresentagood forgraphanalyticshasspawnedalargeamountofresearch, match for distributed graph mining problems, as for exam- especiallyinrecentyears.However,graphanalyticshastra- ple finding frequent subgraphs in a graph. Given an input ditionallybeenachallengingproblemtackledbyexpertre- graph,theseproblemsrequireexploringaverylargenumber searchers,whocaneitherdesignnewspecializedalgorithms ofsubgraphsandfindingpatternsthatmatchsome“interest- for the problem at hand, or pick an appropriate and sound ingness” criteria desired by the user. These algorithms are solutionfromaveryvastliterature.Whentheinputgraphor very important for areas such as social networks, semantic the intermediate state or computation complexity becomes web,andbioinformatics. verylarge,scalabilityisanadditionalchallenge. In this paper, we present Arabesque, the first distributed The development of graph processing systems such as data processing platform for implementing graph mining Pregel[25]haschangedthisscenarioandmadeitsimplerto algorithms. Arabesque automates the process of exploring design scalable graph analytics algorithms. Pregel offers a a very large number of subgraphs. It defines a high-level simple“thinklikeavertex”(TLV)programmingparadigm, filter-processcomputationalmodelthatsimplifiesthedevel- where each vertex of the input graph is a processing ele- opmentofscalablegraphminingalgorithms:Arabesqueex- mentholdinglocalstateandcommunicatingwithitsneigh- ploressubgraphsandpassesthemtotheapplication,which borsinthegraph.TLVisaperfectmatchforproblemsthat must simply compute outputs and decide whether the sub- can be represented through linear algebra, where the graph graphshouldbefurtherextended.WeuseArabesque’sAPI is modeled as an adjacency matrix (or some other variant toproducedistributedsolutionstothreefundamentalgraph liketheLaplacianmatrix)andthecurrentstateofeachver- mining problems: frequent subgraph mining, counting mo- texisrepresentedasavector.Wecallthisclassofmethods tifs, and finding cliques. Our implementations require a graph computation problems. A good example is comput- handfuloflinesofcode,scaletotrillionsofsubgraphs,and ing PageRank [6], which is based on iterative sparse ma- represent in some cases the first available distributed solu- trixandvectormultiplicationoperations.TLVcoversseveral tions. otheralgorithmsthatrequireasimilarcomputationalarchi- tecture, for example, shortest path algorithms, and over the years many optimizations of this paradigm have been pro- ∗CurrentlyaPhDstudentattheFederalUniversityofMinasGerais,Brazil. posed[17,26,36,42]. ThisworkwasdonewhiletheauthorwasatQCRI. Despite this progress, there remains an important class of algorithms that cannot be readily formulated using the TLV paradigm. These are graph mining algorithms used to discover relevant patterns that comprise both structure- basedandlabel-basedpropertiesofthegraph.Graphmining Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalor is widely used for several applications, for example, dis- classroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributed covering 3D motifs in protein structures or chemical com- forprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitation onthefirstpage.CopyrightsforcomponentsofthisworkownedbyothersthanACM pounds, extracting network motifs or significant subgraphs mustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,orrepublish, from protein-protein or gene interaction networks, mining topostonserversortoredistributetolists,requirespriorspecificpermissionand/ora [email protected]. attributed patterns over semantic data (e.g., in Resource SOSP’15, October4–7,2015,Monterey,CA,USA. Description Framework or RDF format), finding structure- Copyright(cid:13)c 2015ACM978-1-4503-3834-9/15/10...$15.00. http://dx.doi.org/10.1145/2815400.2815410 content relationships in social media data, dense subgraph Motifs(MiCo) Motifs(Youtube) Cliques(MiCo) Input&graph& Pa3ern& Embeddings& FSM(CiteSeer) Motifs(SN) 2" 1" 3" hs p a bgr 1012 1" 3" 2" 2" u S g estin 109 4" 5" 3" 1" er nt ofI 106 Figure2:Graphminingconcepts:aninputgraph,anexam- ber ple pattern, and the embeddings of the pattern. Colors rep- Num 103 1 2 3 4 5 6 resent labels. Numbers denote vertex ids. Patterns and em- SizeoftheSubgraphs beddings are two types of subgraphs. However, a pattern is atemplate,whereasanembeddingisaninstance.Inthisex- Figure 1: Exponential growth of the intermediate state in ample,thetwoembeddingsareautomorphic. graph mining problems (motifs counting, clique finding, FSM:Frequentsubgraphmining)ondifferentdatasets. miningforcommunityandlinkspamdetectioninwebdata, cates whether an embedding should be processed, and pro- among others. Graph mining algorithms typically take a cess,whichexaminesanembeddingandmayproducesome labeled and immutable graph as input, and mine patterns output.Forexample,inthecaseoffindingcliquesthefilter that have some algorithm-specific property (e.g., frequency functionprunesembeddingsthatarenotcliques,sincenone abovesomethreshold)byfindingallinstancesofthesepat- of theirextensions canbe cliques,and the processfunction ternsintheinputgraph.Somealgorithmsalsocomputeag- outputsallexploredembeddings,whicharecliquesbycon- gregatedmetricsbasedonthesesubgraphs. struction.Arabesquealsosupportsthepruningoftheexplo- Designinggraphminingalgorithmsisachallengingand rationspacebasedonuser-definedmetricsaggregatedacross active area of research. In particular, scaling graph mining multipleembeddings. algorithmstoevenmoderatelylargegraphsishard.Theset TheArabesqueAPIsimplifiesandthusdemocratizesthe of possible patterns and their subgraphs in a graph can be designofgraphminingalgorithms,andautomatestheirexe- exponentialinthesizeoftheoriginalgraph,resultinginan cutioninadistributedsetting.WeusedArabesquetoimple- explosion of the computation and intermediate state. Fig- ment and evaluate scalable solutions to three fundamental ure 1 shows the exponential growth of the number of “in- anddiversegraphminingproblems:frequentsubgraphmin- teresting”subgraphsofdifferentsizesinsomeofthegraph ing, counting motifs, and finding cliques. These problems miningproblemsanddatasetswewillevaluateinthispaper. aredefinedpreciselyinSection2.Someofthesealgorithms Evengraphswithfewthousandsofedgescanquicklygener- are the first distributed solutions available in the literature, atehundredsofmillionsofinterestingsubgraphs.Theneed whichshowsthesimplicityandgeneralityofArabesque. for enumerating a large number of subgraphs characterizes Arabesque’sembedding-centeredAPIfacilitatesahighly graph mining problems and distinguishes them from graph scalable implementation. The system scales by spread- computationproblems.Despitethisstateexplosionproblem, ing embeddings uniformly across workers, thus avoiding mostgraphminingalgorithmsarecentralizedbecauseofthe hotspots.Bymakingitexplicitthatembeddingsarethefun- complexityofdistributedsolutions. damental unit of exploration, Arabesque is able to use fast In this paper, we propose automatic subgraph explo- coordination-freetechniques,basedonthenotionofembed- rationasagenericbuildingblockforsolvinggraphmining ding canonicality, to avoid redundant work and minimize problems,andintroduceArabesque,thefirstembeddingex- communicationcosts.Italsoenablesustostoreembeddings plorationsystemspecificallydesignedfordistributedgraph efficiently using a new data structure called Overapproxi- mining.Conceptually,wemovefromTLVto“thinklikean mating Directed Acyclic Graph (ODAG), and to devise a embedding” (TLE), where by embedding we denote a sub- new two-level optimization for pattern-based aggregation, graph representing a particular instance of a more general whichisacommonoperationingraphminingalgorithms. templatesubgraphcalledapattern(seeFigure2). Arabesque is implemented as a layer on top of Apache Arabesque defines a high-level filter-process computa- Giraph [3], a Pregel-inspired graph computation system, tional model. Given an input graph, the system takes care thus allowing both graph computation and graph mining of automatically and systematically visiting all the embed- algorithms to run on top of the same infrastructure. The dingsthatneedtobeexploredbytheuser-definedalgorithm, implementation does not use a TLV approach: it considers performingthisexplorationinadistributedmanner.Thesys- Giraph just as a regular data parallel system implementing tempassesalltheembeddingsitexplorestotheapplication, theBulkSynchronousProcessingmodel. whichconsistsprimarilyoftwofunctions:filter,whichindi- Tosummarize,wemakethefollowingcontributions: • Weproposeembeddingexploration,or“thinklikeanem- Characterizing Graph Mining Problems: Arabesque tar- bedding”, as an effective basic building block for graph gets graph mining problems, which involve subgraph enu- mining. We introduce the filter-process computational meration. Given an immutable input graph G with labeled model (Section 3), design an API that enables embed- verticesandedges,graphminingproblemsfocusontheenu- ding exploration to be expressed effectively and suc- merationofallpatternsthatsatisfysomeuser-specified“in- cinctly,andpresentthreeexamplegraphminingapplica- terestingness”criteria.Agivenpatternpisevaluatedbylist- tionsthatcanbeelegantlyexpressedusingtheArabesque ing its matches or embeddings in the input dataset G, and API(Section4). thenfilteringouttheuninterestingones.Graphminingprob- lems often require subgraph isomorphism checks to deter- • Weintroducetechniquestomakedistributedembedding mine the patterns related to sets of embeddings, and also explorationscalable:coordination-freeworksharing,ef- graph automorphism checks to eliminate duplicate embed- ficient storage of embeddings, and an important opti- dings. mizationforpattern-basedaggregation(Section5). We consider graph mining problems where one seeks • We demonstrate the scalability of Arabesque on various connected graph patterns whose embeddings are vertex- or graphs. We show that Arabesque scales to hundreds of edge-induced.Thereareseveralvariantsoftheseproblems. coresoveracluster,obtainingordersofmagnitudereduc- Theinputdatasetmaycompriseacollectionofmanygraphs, tionofrunningtimeoverthecentralizedbaselines(Sec- or a single large graph. The embeddings of a pattern com- tion6),andcananalyzetrillionsofembeddingsonlarge prise the set of (exact) isomorphisms from the pattern p to graphs. the input graph G. However, in inexact matching, one can The Arabesque system, together with all applications seekinexactorapproximateisomorphisms(basedonnotions usedforthispaper,ispubliclyavailableattheproject’sweb- ofedit-distances,labelcosts,etc.).Therearemanyvariants site:www.arabesque.io. intermsoftheinterestingnesscriteria,suchasfrequentsub- graphmining,densesubgraphmining,orextractingtheen- tire frequency distribution of subgraphs up to some given 2. GraphMiningProblems numberofvertices.Alsorelatedtographminingistheprob- In this section, we introduce the graph-theoretic terms we lemofgraphmatching,whereaquerypatternqisfixed,and will use throughout the text and characterize the space of onehastoretrieveallitsmatchesintheinputgraphG.Solu- graphminingproblemsaddressedbyArabesque. tions to graph matching typically use indexing approaches, Terminology: Graph mining problems take an input graph whichpre-computeasetofpathsorfrequentsubgraphs,and Gwhereverticesandedgesarelabeled.Verticesandedges use them to facilitate fast matching and retrieval. As such have unique ids, and their labels are arbitrary, domain- graphminingencompassesthematchingproblem,sincewe specific attributes that can be null. An embedding is a sub- havetobothenumeratethepatternsandfindtheirmatches. graph of G, i.e., a graph containing a subset of the ver- Further,anysolutiontothesingleinputgraphsettingiseas- tices and edges of G. A vertex-induced embedding is de- ilyadaptedtothemultiplegraphdatasetcase.Therefore,in fined starting from a set of vertices by including all edges thispaperwefocusongraphminingtasksonasinglelarge of G whose endpoints are in the set. An edge-induced em- input graph. See [1] for a state-of-the-art review of graph bedding is defined starting from a set of edges by includ- miningandsearch. ing all the endpoints of the edges in the set. For exam- UseCases:Throughoutthispaperweconsiderthreeclasses ple, the two embeddings of Figure 2 are induced by the ofproblems:frequentsubgraphmining,countingmotifs,and edges{(1,2),(2,3)}.Inordertobeinducedbythevertices findingcliques.Wechosetheseproblemsbecausetheyrep- {1,2,3}theembeddingsshouldalsoincludetheedge(1,3). resentdifferentclassesofgraphminingproblems.Thefirst Weconsideronlyconnectedembeddingssuchthatthereisa is an example of explore-and-prune problems, where only pathconnectingeachpairofvertices. embeddings corresponding to a frequent pattern need to be Apatternisanarbitrarygraph.Wesaythatanembedding furtherexplored.Countingmotifsrequiresexhaustivegraph einGisisomorphictoapatternpifandonlyifthereexists explorationuptosomemaximumsize.Findingcliquesisan a one-to-one mapping between the vertices of e and p, and exampleofdensesubgraphmining,andallowsonetoprune betweentheedgesofeandp,suchthat:(i)eachvertex(resp. the embeddings using local information. We discuss these edge)inehasonematchingvertex(resp.edge)inpwiththe problemsbelowinmoredetail. samelabels,andviceversa;(ii)eachmatchingedgeconnects Consider the task of frequent subgraph mining (FSM), matchingvertices.Inthiscase,wesayinformallythatehas i.e.,findingthosesubgraphs(orpatterns,inourterminology) patternp.Equivalently,wesaythatthereexistsasubgraph thatoccuraminimumnumberoftimesintheinputdataset. isomorphism from p to G, and we call e an instance of The occurrences are counted using some anti-monotonic patternpinG.Twoembeddingsareautomorphicifandonly function on the set of its embeddings. The anti-monotonic if they contain the same edges and vertices, i.e., they are propertystatesthatthefrequencyofasupergraphshouldnot equalandthusalsoisomorphic(seeforexampleFigure2). exceedthefrequencyofasubgraph,whichallowsonetostop Algorithm1:ExplorationstepinArabesque. extending patterns/embeddings as soon as they are deemed input :SetIofinitialembeddingsforthisstep tobeinfrequent.Thereareseveralanti-monotonicmetricsto output:SetF ofextendedembeddingsforthenextstep(initially measure the frequency of a pattern. While their differences empty) are not very important for this discussion, they all require output:SetOofnewoutputs(initiallyempty) aggregatingmetricsfromallembeddingsthatcorrespondto foreache∈Isuchthatα(e)do thesamepattern.Weusetheminimumimage-basedsupport addβ(e)toO; C←setofallextensionsofeobtainedbyaddingoneincident metric [7], which defines the frequency of a pattern as the edge/neighboringvertex; minimum number of distinct mappings for any vertex in foreache(cid:48)∈Cdo the pattern, over all embeddings of the pattern. Formally, ifφ(e(cid:48))andthereexistsnoe(cid:48)(cid:48)∈F automorphictoe(cid:48)then let G be the input graph, p be a pattern, and E its set of addπ(e(cid:48))toO; embeddings. The pattern p is frequent if sup(p,E) ≥ θ, adde(cid:48)toF; runaggregationfunctions; wheresup()istheminimumimage-basedsupportfunction, andθisauser-specifiedthreshold.TheFSMtaskistomine allfrequentsubgraphpatternsfromasinglelargegraphG. A motif p is defined as a connected pattern of vertex- canoptionallyalsodefinetwoadditionalfunctionsthatwill induced embeddings that exists in an input graph G. Fur- bedescribedshortly. ther, a set of motifs is required to be non-isomorphic, i.e., Arabesque starts an exploration by generating the set of there should obviously be no duplicate patterns. In motif candidate embeddings C, which are obtained by expand- mining[30],theinputgraphisassumedtobeunlabeled,and ing the embeddings in I. The system computes candidates there is no minimum frequency threshold; rather the goal by adding one incident edge or vertex to e, depending on is to extract all motifs that occur in G along with their fre- whether it runs in edge-based or vertex-based exploration quencydistribution.Sincethistaskisinherentlyexponential, mode.Inedge-basedexploration,anembeddingisanedge- the motifs are typically restricted to patterns of order (i.e., induced subgraph; in vertex-based exploration, it is vertex- number of vertices) at most k. For example, for k = 3 we induced(seeSection2).Inthefirstexplorationstep,I con- have two possible motifs: a chain where ends are not con- tains only a special “undefined” embedding, whose expan- nected and a (complete) triangle. Whereas the original def- sionC consistsofalledgesorverticesofG,dependingon initionofmotifstargetsunlabeledandinducedpatterns,we thetypeofexploration.Theapplicationcandecidebetween caneasilygeneralizethedefinitiontolabeledpatterns. edge-basedorvertex-basedexplorationduringinitialization. In clique mining the task is to enumerate all complete After computing the candidates, the filter function φ ex- subgraphs in the input graph G. A complete subgraph, or amines each candidate e(cid:48) and returns a Boolean value indi- clique,pwithk verticesisdefinedasaconnectedsubgraph cating whether e(cid:48) needs to be processed. If φ returns true, where each vertex has degree k − 1, i.e., each vertex is theprocessfunctionπ takese(cid:48) asinputandoutputsasetof connected to all other vertices (we assume there are no user-definedvalues.Bydefault,e(cid:48)isthenaddedtothesetF. self-loops). The clique problem can also be generalized to After an exploration step is terminated, I is set to be equal maximalcliques,i.e.,thosenotcontainedinanyotherclique, to F before the start of the next step. The computation ter- and frequent cliques, if we impose a minimum frequency minateswhenthesetF isemptyattheendofastep. thresholdinadditiontothecompletenessconstraint. ArabesquerunstheouterloopofAlgorithm1inparallel by partitioning the embeddings in I over multiple servers each running multiple worker threads. This distribution is 3. TheFilter-ProcessModel transparenttoapplications.Eachexecutionstepisexecuted asasuperstepintheBulkSynchronousParallelmodel[38]. We now discuss the filter-process model, which formalizes Optional aggregation functions: The filter-process model how Arabesque performs embedding exploration based on describedsofarconsiderssingleembeddingsinisolation.A application-specificfilterandprocessfunctions. common task in graph mining systems is to aggregate val- uesacrossmultipleembeddings,forexamplegroupingem- 3.1 ComputationalModel beddings by pattern. To this end, Arabesque offers specific Arabesquecomputationsproceedthroughasequenceofex- functions to execute user-defined aggregation for multiple ploration steps. At a conceptual level, the system performs embeddings.Aggregationcangroupembeddingsbyanarbi- theoperationsillustratedinAlgorithm1ineachexploration trary integer value or by pattern, and is executed on candi- step.EachstepisassociatedwithaninitialsetI containing dateembeddingsattheendoftheexplorationstepinwhich embeddingsoftheinputgraphG.Arabesqueautomatesthe theyaregenerated. process of exploring the graph and expanding embeddings. The optional aggregation filter function α and aggrega- Applicationsarespecifiedviatwouser-definedfunctions:a tionprocessfunctionβ canfilterandprocessanembedding filter function φ and a process function π. The application e in the exploration step following its generation. At that time,aggregatedinformationcollected fromalltheembed- approachalsocreatesasignificantnumberofduplicatemes- dings generated in the same exploration step as e becomes sages because each new embedding must be sent to all its available.Theαfunctioncantakefilteringdecisionsbefore bordervertices.Theselimitationssignificantlyaffectperfor- embeddingexpansionbasedonaggregatevalues.Forexam- mance.Inourexperiments,weobservedthatTLV-basedem- ple, in the frequent subgraph mining problem we can use beddingexplorationalgorithmscanbetwoordersofmagni- aggregatorstocounttheembeddingsassociatedwithagiven tudeslowercomparedtoTLE. pattern,andthenfilteroutembeddingsofinfrequentpatterns Thecurrentstate-of-the-artcentralizedmethodsforsolv- withα.Similarly,theβfunctioncanbeusedtooutputaggre- ing graph mining tasks typically adopt a different, pattern- gate information about an embedding, for example its fre- centricor“ThinkLikeaPattern”(TLP)approach.Thekey quency. By default, α returns true and β does not add any differencebetweenTLPandtheembedding-centricviewof outputtoO. thefilter-processmodelisthatitisnotnecessarytoexplicitly Guarantees and requirements: Embedding exploration materializeallembeddings:statecanbekeptatthegranular- processes every embedding that is not filtered out. More ityofpatterns(whicharemuchfewerthanembeddings)and formally,itguaranteesthefollowingcompletenessproperty: embeddings may be re-generated on demand. The process for each embedding e such that φ(e) = α(e) = true, em- startswiththesetofallpossible(labeled)singleverticesor beddingexplorationmustaddπ(e)andβ(e)toO. edges as candidate patterns. It then processes embeddings Completeness assumes some properties of the user- ofeachpattern,oftenbyrecomputingthemonthefly.After defined functions that emerge naturally in graph mining aggregationandpatternfiltering,thevalidsetofpatternsare algorithms. The first property is that the application con- extended by adding one more vertex or edge. Subgraph or siderstwoembeddingseande(cid:48) tobeequivalentiftheyare pattern mining proceeds iteratively via recursive extension, automorphic (see Section 2). Formally, Arabesque requires processing and filtering steps, and continues until no new what we call automorphism invariance: if e and e(cid:48) are au- patterns are found. Parallelizing the computation via parti- tomorphic, then each user-defined function must return the tioning it by pattern can easily result in load imbalance, as same result for e and e(cid:48). Arabesque leverages this natural our experiments show. This is because there are often only property of graph mining algorithms to prune automorphic few patterns that are highly popular – indeed, finding these embeddingsandsubstantiallyreducetheexplorationspace. fewpatternsistheverygoalofgraphmining.Thesepopular The second property Arabesque requires is called anti- patterns result in hotspots among workers and thus in poor monotonicity and is formally defined as follows: if φ(e) = loadbalancing. false then it holds that φ(e(cid:48)) = false for each embedding e(cid:48) that extends e. The same property holds for the optional 4. Arabesque:API,Programming,and filter function α. This is one of the essential properties for Implementation anyeffectivegraphminingmethodandguaranteesthatonce WenowdescribetheArabesqueJavaAPIandshowhowwe afilterfunctiontellstheframeworktopruneanembedding, useittoimplementourexampleapplications. allitsextensionscanbeignored. 3.2 AlternativeParadigms:ThinkLikeaVertexand 4.1 ArabesqueAPI ThinkLikeaPattern The API of Arabesque is illustrated in Figure 3. The user Wecannowcontrastourembedding-centricor“ThinkLike mustimplementtwofunctions:filter,whichcorresponds anEmbedding”(TLE)modelwithotherapproachestobuild to φ, and process, which corresponds to π. The pro- graph mining algorithms. We empirically compare with cess function in the API is responsible for adding results theseapproachesinSection6. to the output by invoking the output function provided The standard paradigm of systems like Pregel is vertex- by Arabesque, which prints the results to the underlying centric or “Think Like a Vertex” (TLV), because computa- file system (e.g. HDFS). The optional functions α and β tion and state are at the level of a vertex in the graph. TLV correspond, respectively, to aggregationFilter and systemsaredesignedtoscaleforlargeinputgraphs:thein- aggregationProcess.Theseapplication-specificfunc- formation about the input graph is distributed and vertices tionsareinvokedbytheArabesqueframeworkasillustrated onlyhaveinformationabouttheirlocalneighborhood.Inor- in Algorithm 1. All these functions have access to a local dertoperformembeddingexploration,eachvertexcankeep read-onlycopyofthegraph. asetoflocalembeddings,initiallycontainingonlythever- For performance and scalability reasons, a MapReduce- texitself.Then,toexpandalocalembeddinge,verticespush like model is used to compute aggregated values. Appli- ittothe“border”verticesofethatknowhowtoexpandeby cations can send data to reducers using the map function, adding its neighbors. Each expansion results in a new em- whichispartoftheArabesqueframeworkandaddsavalue bedding, which is sent again to border vertices and so on. toanaggregationgroupdefinedbyacertainkey.Manyap- Withthisapproach,highlyconnectedverticesmusttakeon plicationsusethepatternofanembeddingastheaggregation a disproportionate fraction of embeddings to expand. The key. Arabesque detects when the key is a pattern and uses Mandatoryapplication-definedfunctions: boolean filter(Embedding e) { return true; } boolean filter (Embedding e) void process(Embedding e){ void process (Embedding e) map (pattern(e), domains(e)); } Pair<Pattern,Domain> reduce Optionalapplication-definedfunctions: (Pattern p, List<Domain> domains){ Domain merged_domain = merge(domains); boolean aggregationFilter (Embedding e) return Pair (p, merged_domain); } void aggregationProcess (Embedding e) boolean aggregationFilter(Embedding e){ Pair<K,V> reduce (K key, List<V> values) Domain m_domain = readAggregate(pattern(e)); Pair<K,V> reduceOutput (K key, List<V> value) return (support(m_domain)>=THRESHOLD); } boolean terminationFilter (Embedding e) void aggregationProcess(Embedding e) { output(e); } Arabesquefunctionsinvokedbyapplications: (a)Frequentsubgraphmining(edge-basedexploration) void output (Object value) void map (K key, V value) boolean filter(Embedding e){ V readAggregate (K key) return (numVertices(e) <= MAX_SIZE);} void mapOutput (K key, V value) void process(Embedding e){ mapOutput (pattern(e),1); } Pair<Pattern,Integer> reduceOutput (Pattern p, List<Integer> counts){ Figure3:Basicuser-definedfunctionsinArabesque. return Pair (p, sum(counts)); } (b)Countingmotifs(vertex-basedexploration) specificoptimizationstomakethisaggregationefficient,as boolean filter (Embedding e){ return isClique(e); } explained in Section 5.4. The application specifies the ag- void process (Embedding e){ output(e); } gregationlogicthroughthereducefunction.Thisfunction (c)Findingcliques(vertex-basedexploration) receives all values mapped to a specific key in one execu- tionstepandaggregatesthem.Anymethodcanreadtheval- Figure4:ExamplesofArabesqueapplications. ues aggregated over the previous execution step using the readAggregatemethod. Output aggregation is a special case where aggregated valuesaresentdirectlytotheunderlyingdistributedfilesys- to the reducer responsible for the pattern p of e. For ex- tem at the end of each exploration step and are not made ample, in Figure 2 the domain of the blue vertex on the availableforlaterreads.Itcanbeusedthroughthemethods top of the pattern is vertex 1 in the first embedding and 3 mapOutput and reduceOutput. Their logic is similar in the second. The function reduce merges all domains: totheaggregationfunctionsdescribedpreviously,butaggre- the merged domain of a vertex in p is the union of all its gationisonlyexecutedwhenthewholecomputationends. aggregated mappings. Since expansion is done by adding TheterminationFilterfunctioncanhaltthecom- one edge in each exploration step, we are sure that all putationofanembeddingwhensomepre-definedcondition embeddings for p are visited and processed in the same isreached.Arabesqueappliesthisfilteronanembeddinge exploration step. The aggregationFilter function afterexecutingπ(e)andbeforeaddingetoF.Thisisjustan reads the merged domains of p using readAggregate optimizationtoavoidunnecessaryexplorationsteps.Forex- and computes the support, which is the minimum size of ample,ifweareinterestedinembeddingsofmaximumsize the domain of any vertex in p. It then filters out embed- n, the termination filter can halt the computation after pro- dingsforpatternsthatdonothaveenoughsupport.Finally, cessing embeddings of size n at step n. Without this filter, the aggregationProcess function outputs all the em- thesystemwouldhavetoproceedtostepn+1andgenerate beddings having a frequent pattern (those that survive the allembeddingsofsizen+1justtofilterallofthemout. aggregation-filter). The implementation of this application consistsof280linesofJavacode.Ofthese,212arerelated 4.2 ProgrammingwithArabesque tohandlingdomainsandcomputingsupport,whicharebasic We used the Arabesque API to implement algorithms that tasks required in any algorithm for frequent subgraph min- solve the three problems discussed in Section 2. The pseu- ing to characterize whether an embedding is relevant. By docode of the implementations is in Figure 4. Each of the comparison, the centralized baseline we use for evaluation, applicationsconsistsofveryfewlinesofcode,astarkcon- GRAMI[14],consistsof5,443linesofJavacode. trastcomparedtothecomplexityofthespecializedstateof Formotiffrequencycomputation,weperformanexhaus- theartalgorithmssolvingthesameproblems[8,30,43]. tive exploration of all embeddings until we reach a given In frequent subgraph mining we use aggregation to cal- maximum size and count all embeddings having the same culatethesupportfunctiondescribedinSection2.Thesup- pattern. Since the input graph is not labeled in this case, a port metric is based on the notion of domain, which is de- pattern corresponds to a motif. The function mapOutput fined as the set of distinct mappings between a vertex in p sends a value to the output reducer responsible for the pat- and the matching vertices in any automorphism of e. The ternofe.ThereduceOutputfunctionoutputsthesumof process function invokes map to send the domains of e thecountsforeachmotifp.Ourimplementationconsistsof 18linesofcode,veryfewcomparedtothe3,145linesofC tion 3) we can avoid redundant work by discarding all but codeofourcentralizedbaseline(Gtries[31]). oneoftheidentical,automorphicembeddings. Infindingcliqueswedoalocalpruning:ifanembedding Arabesquesolvesthisproblemusinganovelcoordination- isnotaclique,noneofitsextensionscanbeaclique.Since freeschemebasedonthenotionofembeddingcanonicality. we visit only cliques, the evaluation function outputs the Informally, we need to select exactly one of the redundant embeddings it receives as input. The isClique function automorphicembeddingsandelectitas“canonical”.Inour checksthatthenewlyaddedvertexisconnectedwithallpre- example, before w and w execute the filter and process 1 2 viousverticesintheembedding.Thisapplicationconsistsof functions on a new embedding e, Arabesque executes an 19linesofcode,whileourcentralizedbaseline(Mace[37]) embedding canonicality check to verify whether e can be consistsof4,621linesofCcode. pruned (see Algorithm 1). This check runs on a single em- In all these examples it is easy to verify that the evalua- beddingwithoutrequiringcoordination,aswenowdiscuss. tionandfilterfunctionssatisfytheanti-monotonicandauto- A sound canonicality check must have the property of morphism invariance properties required by the Arabesque uniqueness:giventhesetS ofallembeddingsautomorphic e computationalmodel. to an embedding e, there is exactly one canonical embed- dinge inS .Wecalle thecanonicalautomorphismofe. c e c 4.3 Arabesqueimplementation InArabesquewealsoneedanadditionalpropertyforcanon- Arabesquecanexecuteontopofanysystemsupportingthe icality checks called extendibility, which we define as fol- BSP model. We have implemented Arabesque as a layer lows.Letebeacandidateembeddingobtainedbyextending on top of Giraph. The implementation does not follow the a parent embedding e(cid:48) by one vertex or edge. The parent TLV paradigm: we use Giraph vertices simply as workers embedding e(cid:48) is canonical because it has not been pruned. that bear no relationship to any specific vertex in the input Let ec be the canonical automorphism of e. Extendibility graph. Each worker has access to a copy of the whole in- requires that ec is one of the extensions of e(cid:48). This allows put graph whose vertices and edges consist of incremental Arabesquetoprunetheautomorphismsofaparente(cid:48) while numeric ids. Communication among workers occurs in an stillexploringthecanonicalautomorphismofeachchilde. arbitrary point-to-point fashion and communication edges are not related to edges in the input graph. Giraph compu- Algorithm 2: Arabesque’s incremental embedding tations proceed through synchronous supersteps according canonicalitycheck(vertex-basedexploration) totheBSPmodel:ateachsuperstep,workersfirstreceiveall input :InputgraphG messagessentintheprevioussuperstep,thenprocessthem, input :Canonicalparentembedding(cid:104)v1,...,vn(cid:105) and finally send new messages to be delivered at the next input :Extensionvertexv superstep. The operations of the workers are described in output:trueiff(cid:104)v1,...,vn,v(cid:105)iscanonical Section 5. Aggregation functions, if specified, are executed ifv1>vthen returnfalse; usingstandardGiraphaggregators.Optionaloutputworkers foundNeighbour←false; are used for applications that aggregate output values. The fori=1...ndo input values for output aggregation persist over supersteps. if foundNeighbour=falseandvineighborofvinGthen Once the computation is over, output workers aggregate all foundNeighbour←true; theirinputvaluesandoutputthem. elseiffoundNeighbour=trueandvi>vthen returnfalse; returntrue; 5. GraphExplorationTechniques We now describe in more detail how Arabesque performs Arabesquechecksembeddingcanonicalityforeachcan- graphexploration.Wefirstdiscussthecoordination-freeex- didate before applying the filter function. There can be a plorationstrategyusedbyworkerstoavoidredundantwork. huge number of candidates, so it is essential that the check Wethenintroducethetechniquesweusetostoreandparti- isefficient.Wedevelopedalinear-timealgorithm,whichis tionembeddingsefficiently. basedonthefollowingdefinitionofcanonicalembedding. Considerthecaseofvertex-basedexploration(theedge- 5.1 Coordination-FreeExplorationStrategy basedcaseisanalogous).Anembeddingeiscanonicalifand When running exploration in a distributed setting, multiple onlyifitsverticeswerevisitedinthefollowingorder:start workers can reach two “identical”, i.e., automorphic (see by visiting the vertex with the smallest id, and then recur- Section2),embeddingsthroughdifferentexplorationpaths. sivelyaddtheneighborinewithsmallestidthathasnotbeen Consider for example the graph of Figure 2. Two different visitedyet.Forbetterperformance,Arabesqueperformsthis workers w and w may reach the two embeddings in Fig- canonicalitycheckinanincrementalfashion,asillustratedin 1 2 ure2,onestartingfromedge(1,2)byadding(2,3)andthe Algorithm2.Whenaworkerprocessesanembeddinge∈I, other starting from edge (3,2) by adding (2,1). Since all itcanalreadyassumethateiscanonicalbecauseArabesque user-definedfunctionsareautomorphism-invariant(seeSec- prunes non-canonical embeddings before passing them on tothenextexplorationstep.Arabesque,characterizesanem- edge-basedcaseisanalogous.TheODAGforasetofcanon- beddingasthelistofitsverticessortedbytheorderinwhich icalembeddingsconsistsofasmanyarraysasthenumberof they have been visited – the embedding is vertex-induced verticesofalltheembeddings.Theitharraycontainstheids so the list uniquely identifies it. When Arabesque checks of all vertices in the ith position in any embedding. Vertex thecanonicalityofanewcandidateembeddingobtainedby v in the ith array is connected to vertex u in the (i+1)th addingavertexvtoaparentcanonicalembeddinge,theal- arrayifthereisatleastonecanonicalembeddingwithvand gorithmscanseinsearchforthefirstneighborv(cid:48) ofv,and uinpositioniandi+1respectivelyintheoriginalset.An thenvertifiesthatthereisnovertexineafterv(cid:48) withhigher exampleofODAGisshowninFigures5and6. idthanv. The extended version of this paper [35] includes proofs Canonical" embeddings" showing that Algorithm 2 satisfies the uniqueness and ex- 3" 1" 4" 2" tendibility properties of canonicality checking. We also 1" 4" 3" showthatthesetwoproperties,togetherwithanti-monotoni- 2" 4" 1" 4" 5" 2" 3" 4" city and automorphism invariance, are sufficient to ensure 1" 5" 2" 4" 5" that Arabesque satisfies the completeness property of em- 3" 4" 5" beddingexploration(seeSection3). Figure5:ExamplegraphanditssetofSofcanonicalvertex- 5.2 StoringEmbeddingsCompactly inducedembeddingsofsize3. Graph mining algorithms can easily generate trillions of Array"3" Array"1" embeddings as intermediate state. Centralized algorithms Array"2" 2" typically do not explicitly store the embeddings they have 1" 3" 3" explored. Instead, they use application-specific knowledge 2" 4" 4" torebuildembeddingsontheflyfromamuchsmallerstate. 3" 5" The only application-level logic available to Arabesque consistsofthefilterandprocessfunctions,whichareopaque Figure6:ODAGfortheexampleofFigure5.Italsoencodes to the system. After each exploration step, Arabesque re- spuriousembeddingssuchas(cid:104)3,4,2(cid:105). ceivesasetofembeddingsfilteredbytheapplicationandit needstokeeptheminordertoexpandtheminthenextstep. Storing an ODAG is more compact than storing the full Storing each embedding separately can be very expensive. set of embeddings. In general, in a graph with N distinct AswehaveseenalsoinAlgorithm2,Arabesquerepresents verticeswecanhaveuptoO(Nk)differentembeddingsof embeddingsassequencesofnumbers,representingvertexor sizek.WithODAGsweonlyhavetokeepedgesbetweenk edge ids depending on whether exploration is vertex-based arrays, where k is the size of the embeddings, so the upper oredge-based.Therefore,weneedtofindtechniquestostore boundonthesizeisO(k·N2) = O(N2)ifk isaconstant setsofsequencesofintegersefficiently. muchsmallerthanN. Existingcompactdatastructuressuchasprefix-treesare Itispossibletoobtainallembeddingsoftheoriginalset too expensive because we would still have to store a new bysimplyfollowingtheedgesintheODAG.However,this leafforeachembeddinginthebestcase,sinceallcanonical will also generate spurious embeddings that are not in the embeddings we want to represent are different. In contrast, originalset.ConsiderforexamplethegraphofFigure5and Arabesque uses a novel technique called Overapproximat- its set of canonical embeddings S. Expanding the ODAG ing Directed Acyclic Graphs, or ODAGs. At a high level, of Figure 6 generates also (cid:104)3,4,2(cid:105) (cid:54)∈ S. Filtering out such ODAGs are similar to prefix trees where all nodes at the spuriousembeddingsrequiresapplication-specificlogic. same depth corresponding to the same vertex in the graph ODAGsinArabesque:InArabesque,workersproducenew are collapsed into a single node. This more compact repre- embeddings in each exploration step and add them to the sentationisanoverapproximation(superset)ofthesetofse- setF (seeAlgorithm1).WorkersuseODAGstostoreF at quenceswewanttostore.Whenextractingembeddingsfrom theendofanexplorationstep,andextractembeddingsfrom ODAGs we must do extra work to discard spurious paths. ODAGsatthebeginningofthenextstep. ODAGsthustradespacecomplexityforcomputationalcom- Filteringoutspuriousembeddings,asdiscussed,requires plexity.ThismakesODAGssimilartotherepresentativesets application-specific logic. Applications written using the introduced in [2], which have a higher compression capa- filter-process model give Arabesque enough information to bility but require more work to filter out spurious embed- perform filtering. In fact, workers can just apply the same dings.Inaddition,itishardertoachieveeffectiveloadbal- filteringcriteriaasAlgorithm1:thecanonicalitycheckand ancing with representative sets. We now discuss the details theuser-definedfilterandaggregatefilterfunctions.Ifanyof ofODAGsandshowhowtheyareusedinArabesque. thesechecksisnegativeforanembedding,weknowthatthe The ODAG Data Structure: For simplicity, we focus the embedding itself or, thanks to the anti-monotonicity prop- discussion on ODAGs for vertex-based exploration; the erty, one of its parents, was filtered out. In addition, since our canonicality check is incremental, we do not need to Theprocesscontinuesrecursivelyonsubsequentarraysuntil considerembeddingsthatextendanon-canonicalsequence, abalancedloadisreached. sowecanprunemultipleembeddingsatonce. 5.4 Two-LevelPatternAggregationforFastPattern After every exploration step, Arabesque merges and CanonicalityChecking broadcasts the new embeddings, and thus the ODAGs, to everyworker.Inordertoreducethenumberofspuriousem- Arabesque uses a special optimization to speed up per- beddings, workers group their embeddings in one ODAG patternaggregation,asdiscussedinSection4.Theoptimiza- perpattern.Foreachpattern,workersmergetheirlocalper- tionwasmotivatedbythehighpotentialcostofthistypeof patternODAGsintoasingleper-patternglobalODAG.Each aggregation,aswenowdiscuss. per-patternglobalODAGisreplicatedateachworkerbefore Consider again the example of Figure 2 and assume thebeginningofthenextexplorationstep. that we want to count the instances of all single-edge pat- MergingODAGsrequiresmergingedgesobtainedbydif- terns. The three single-edge embeddings (1,2), (2,3), and ferentworkers.Forexample,considerthefirsttwoarraysin (3,4) should be aggregated together since they all have a Figure 6. One worker might have explored the edge (cid:104)2,3(cid:105) blue and a yellow endpoint. Therefore, their two patterns and another worker (cid:104)2,4(cid:105). Edges of ODAG arrays are in- (blue,yellow) and (yellow,blue) should be considered dexedbytheirinitialvertex,sothetwoarraysfromthetwo equivalentbecausetheyareisomorphic(seeSection2).The workers will have two different entries for the element 2 aggregation reducer for these two patterns is associated to of the first array that need to be merged. Edge merging is asinglecanonicalpatternthatisisomorphictoboth.Map- very expensive because of the high number of edges in an ping a pattern to its canonical pattern thus entails solving ODAG.Therefore,Arabesqueusesamap-reducesteptoex- the graph isomorphism problem, for which no polynomial- ecuteedgemerging.Eachentryofeacharrayproducedbya timesolutionisknown[16].Thismakespatterncanonicality workerismappedtotheworkerassociatedtoitsindexvalue. much harder than embedding canonicality, which is related Thisworkerisresponsibleformergingalledgesforthaten- tothesimplergraphautomorphismproblem. try produced by all workers into a single entry. The entry Identifyingacanonicalpatternforeachsinglecandidate is then broadcast to all other workers, which computes the embeddingwouldbeasignificantbottleneckforArabesque, unionofallentriesinparallel. as we show in our evaluation, because of the sheer number of candidate embeddings that are generated at each explo- rationstep.Arabesquesolvesthisproblembyusinganovel 5.3 PartitioningEmbeddingsforLoadBalancing techniquecalledtwo-levelpatternaggregation. Afterbroadcastandbeforethebeginningofthenextexplo- The first level of aggregation occurs based on what we ration step, every worker obtains the same set of ODAGs, call quick patterns. A quick pattern of an embedding e is one for each pattern mined in the previous execution step. theoneobtained,inlineartime,bysimplyscanningallver- The next step is to partition the set I of new embeddings tices (or edges, depending on the exploration mode) of e (seeAlgorithm1)amongworkers.Thisisachievedbyparti- and extracting the corresponding labels. The quick pattern tioningtheembeddingsineachpatternODAGseparately. iscalculatedforeachcandidateembedding.Intheprevious Workers could achieve perfect load balancing by using example, we would obtain the quick pattern (blue,yellow) a round-robin strategy to share work: worker 1 takes the for the embeddings (1,2) and (3,4) and the quick pattern first embedding, worker 2 the second and so on. However, (yellow,blue) for the embedding (2,3). Each worker lo- having workers iterate through all embeddings produced in cally executes the reduce function based on quick patterns. thepreviousstep,includingthosethattheyarenotgoingto Once this local aggregation completes, a worker computes process,iscomputationallyintensive.Therefore,workersdo the canonical pattern p for each quick pattern p and sends c roundrobinonlargeblocksofbembeddings.Thequestion thelocallyaggregatedvaluetothereducerforp .Arabesque c nowishowtoidentifysuchblocksefficiently. usestheblisslibrarytodeterminecanonicalpatterns[20]. Arabesqueassociateseachelementv ineveryarraywith Insummary,insteadofexecutinggraphisomorphismfor anestimateofhowmanyembeddings,canonicalornot,can averylargenumberofcandidateembeddings,two-levelpat- begeneratedstartingfromv.Tothisend,Arabesqueassigns ternaggregationcomputesaquickpatternforeveryembed- acost1toeveryelementofthelastarray,anditassignstoan ding,obtainsanumberofquickpatternswhichisordersof elementoftheith arraythesumofthecostsofallelements magnitudesmallerthanthecandidateembeddings,andthen in the (i + 1)th array it is connected to. Load can then calculatesgraphisomorphismonlyforquickpatterns. be balanced by having each worker take a partition of the elements in the first array that has approximately the same 6. Evaluation estimated cost. While partitioning, it could happen that the 6.1 ExperimentalSetup costofanelementvofthefirstarrayneedstobesplitamong multiple workers. In this case, the costs associated to the Platform: We evaluate Arabesque using a cluster of 20 elementsofthesecondarrayconnectedtov arepartitioned. servers.Eachserverhas2IntelXeonE5-2670CPUswitha totalof32executionthreadsat2.67GHzpercoreand256GB tributedsolutionstosolveFSMonasinglelargeinputgraph RAM. The servers are connected with a 10 GbE network. intheliterature. Hadoop 2.6.0 was configured so that each physical server 10 contains a single worker which can use all 32 execution Ideal TLP TLV 8 threads(unlessotherwisestated).ArabesquerunsonGiraph developmenttrunkfromJanuary2015withaddedfunction- up 6 d alityforobtainingclusterdeploymentdetailsandimproving pee 4 S aggregationperformance.Thesemodificationsamountto10 2 extralinesofcode. 0 1 5 10 Vertices Edges Labels Av.Degree Numberofnodes(32threads) CiteSeer 3,312 4,732 6 2.8 Figure 7: Scalability Analysis of Alternative Paradigms: MiCo 100,000 1,080,298 29 21.6 FSM(S=300)onCiteSeer. Patents 2,745,761 13,965,409 37 10 Youtube 4,589,876 43,968,798 80 19 The Case of TLV: Our TLV implementation globally SN 5,022,893 198,613,776 0 79 Instagram 179,527,876 887,390,802 0 9.8 maintains the set of embeddings that have been visited, much like Arabesque. The implementation adopts the TLV approach as described in Section 3.2 and uses the same Table1:Graphsusedfortheevaluation. coordination-free technique as Arabesque to avoid redun- dant work. The TLV implementation also uses application- Datasets: We use six datasets (see Table 1). CiteSeer [14] specific approaches to control the expansion process. Our has publications as vertices, with their Computer Science TLVimplementationofFSMusesthisfeaturetofollowthe areaaslabel,andcitationsasedges.MiCo[14]hasauthors standarddepth-firststrategyofgSpan[43]. asvertices,whicharelabeledwiththeirfieldofinterest,and InFigure7,weshowthescalabilityofFSMwithsupport co-authorship ofa paper asedges. Patents [18]contains ci- 300usingtheCiteSeergraph.Asseenfromthefigure,TLV tationedgesbetweenUSPatentsbetweenJanuary1963and doesnotscalebeyond5servers.Amajorscalabilitybottle- December 1999; the year the patent was granted is consid- neck is that each embedding needs to be replicated to each eredtobethelabel.Youtube[10]listscrawledvideoidsand vertexthathasthenecessarylocalinformationtoexpandthe relatedvideosforeachvideopostedfromFebruary2007to embeddingfurther.Inaddition,high-degreeverticesneedto July 2008. The label is a combination of the video’s rating expandadisproportionatefractionofembeddings.CiteSeer andlength.SN,isasnapshotofarealworldSocialNetwork, isascale-freegraphthusaffectingthescalabilityofTLV. which is not publicly available. Instagram is a snapshot of Overall TLV performance is two orders of magnitude thepopularphotoandvideosharingsocialnetworkcollected slowercomparedtoArabesque.TLVrequiresmorethan300 by [28]. We consider all the graphs to be undirected. Note secondstorunFSMontheCiteSeergraph,whileArabesque thatevenifsomeofthesegraphsarenotverylarge,theex- requires only 7 seconds for the same setup. The total mes- plosion of the intermediate computation and state required sages exchanged for this tiny graph is 120 million, versus forgraphexploration(seeFigure1)makesthemverychal- 137 thousand messages required by Arabesque. Due to the lengingforcentralizedalgorithms. hot-spotsinherenttothegraphstructure,orthelabeldistri- ApplicationsandParameters:Weconsiderthethreeappli- bution, and the extended duplication of state that the TLV cationsdiscussedinSections2,whichwelabelFSM,Motifs paradigm requires, we conclude that TLV is not suited for andCliques.Bydefault,allMotifsexecutionsarerunwitha solvingtheseproblems. maximumembeddingsizeof4,denotedasMS=4,whereas TheCaseofTLP:TheTLPimplementationisbasedon CliquesarerunwithamaximumembeddingsizeofMS=5. GRAMI [14], which represents the state of the art for cen- For FSM, we explicitly state the support, denoted S, used tralizedFSM.GRAMIkeepsstateonaper-patternbasis,so ineachexperimentasthisparameterisverysensitivetothe fewrelativelystraightforwardchangestothecode-basewere propertiesoftheinputgraph. sufficienttoderiveaTLPimplementationwherepatternsare partitionedacrossasetofdistributedworkers. 6.2 AlternativeParadigms:TLVandTLP GRAMIusesanumberofoptimizationsthatarespecific We start by motivating the necessity for a new framework toFSM.Inparticular,itavoidsmaterializingallembeddings fordistributedgraphmining.Weevaluatethetwoalternative relatedtoapattern,acommonapproachforTLPalgorithms. computational paradigms that we discussed in Section 3.2. Whenever a new pattern is generated, its instances are re- Arabesque (i.e., TLE) will be evaluated in the next subsec- calculatedonthefly,stoppingassoonasasufficientnumber tion.Weconsidertheproblemoffrequentsubgraphmining of embeddings to pass the frequency threshold is found. (FSM) as a use case. Note that there are currently no dis- GRAMI thus solves a simpler problem than the TLV and
Description: