Table Of ContentArabesque: A System for Distributed Graph Mining
CarlosH.C.Teixeira∗,AlexandreJ.Fonseca,MarcoSerafini,
GeorgosSiganos,MohammedJ.Zaki,AshrafAboulnaga
QatarComputingResearchInstitute-HBKU,Qatar
Abstract 1. Introduction
Distributed data processing platforms such as MapReduce Graphdataisubiquitousinmanyfields,fromtheWebtoad-
and Pregel have substantially simplified the design and de- vertising and biology, and the analysis of graphs is becom-
ploymentofcertainclassesofdistributedgraphanalyticsal- ing increasingly important. The development of algorithms
gorithms.However,theseplatformsdonotrepresentagood forgraphanalyticshasspawnedalargeamountofresearch,
match for distributed graph mining problems, as for exam- especiallyinrecentyears.However,graphanalyticshastra-
ple finding frequent subgraphs in a graph. Given an input ditionallybeenachallengingproblemtackledbyexpertre-
graph,theseproblemsrequireexploringaverylargenumber searchers,whocaneitherdesignnewspecializedalgorithms
ofsubgraphsandfindingpatternsthatmatchsome“interest- for the problem at hand, or pick an appropriate and sound
ingness” criteria desired by the user. These algorithms are solutionfromaveryvastliterature.Whentheinputgraphor
very important for areas such as social networks, semantic the intermediate state or computation complexity becomes
web,andbioinformatics. verylarge,scalabilityisanadditionalchallenge.
In this paper, we present Arabesque, the first distributed The development of graph processing systems such as
data processing platform for implementing graph mining Pregel[25]haschangedthisscenarioandmadeitsimplerto
algorithms. Arabesque automates the process of exploring design scalable graph analytics algorithms. Pregel offers a
a very large number of subgraphs. It defines a high-level simple“thinklikeavertex”(TLV)programmingparadigm,
filter-processcomputationalmodelthatsimplifiesthedevel- where each vertex of the input graph is a processing ele-
opmentofscalablegraphminingalgorithms:Arabesqueex- mentholdinglocalstateandcommunicatingwithitsneigh-
ploressubgraphsandpassesthemtotheapplication,which borsinthegraph.TLVisaperfectmatchforproblemsthat
must simply compute outputs and decide whether the sub- can be represented through linear algebra, where the graph
graphshouldbefurtherextended.WeuseArabesque’sAPI is modeled as an adjacency matrix (or some other variant
toproducedistributedsolutionstothreefundamentalgraph liketheLaplacianmatrix)andthecurrentstateofeachver-
mining problems: frequent subgraph mining, counting mo- texisrepresentedasavector.Wecallthisclassofmethods
tifs, and finding cliques. Our implementations require a graph computation problems. A good example is comput-
handfuloflinesofcode,scaletotrillionsofsubgraphs,and ing PageRank [6], which is based on iterative sparse ma-
represent in some cases the first available distributed solu- trixandvectormultiplicationoperations.TLVcoversseveral
tions. otheralgorithmsthatrequireasimilarcomputationalarchi-
tecture, for example, shortest path algorithms, and over the
years many optimizations of this paradigm have been pro-
∗CurrentlyaPhDstudentattheFederalUniversityofMinasGerais,Brazil. posed[17,26,36,42].
ThisworkwasdonewhiletheauthorwasatQCRI. Despite this progress, there remains an important class
of algorithms that cannot be readily formulated using the
TLV paradigm. These are graph mining algorithms used
to discover relevant patterns that comprise both structure-
basedandlabel-basedpropertiesofthegraph.Graphmining
Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalor is widely used for several applications, for example, dis-
classroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributed
covering 3D motifs in protein structures or chemical com-
forprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitation
onthefirstpage.CopyrightsforcomponentsofthisworkownedbyothersthanACM pounds, extracting network motifs or significant subgraphs
mustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,orrepublish,
from protein-protein or gene interaction networks, mining
topostonserversortoredistributetolists,requirespriorspecificpermissionand/ora
fee.Requestpermissionsfrompermissions@acm.org. attributed patterns over semantic data (e.g., in Resource
SOSP’15, October4–7,2015,Monterey,CA,USA. Description Framework or RDF format), finding structure-
Copyright(cid:13)c 2015ACM978-1-4503-3834-9/15/10...$15.00.
http://dx.doi.org/10.1145/2815400.2815410 content relationships in social media data, dense subgraph
Motifs(MiCo) Motifs(Youtube) Cliques(MiCo) Input&graph& Pa3ern& Embeddings&
FSM(CiteSeer) Motifs(SN)
2" 1" 3"
hs
p
a
bgr 1012 1" 3" 2" 2"
u
S
g
estin 109 4" 5" 3" 1"
er
nt
ofI 106 Figure2:Graphminingconcepts:aninputgraph,anexam-
ber ple pattern, and the embeddings of the pattern. Colors rep-
Num 103 1 2 3 4 5 6 resent labels. Numbers denote vertex ids. Patterns and em-
SizeoftheSubgraphs beddings are two types of subgraphs. However, a pattern is
atemplate,whereasanembeddingisaninstance.Inthisex-
Figure 1: Exponential growth of the intermediate state in ample,thetwoembeddingsareautomorphic.
graph mining problems (motifs counting, clique finding,
FSM:Frequentsubgraphmining)ondifferentdatasets.
miningforcommunityandlinkspamdetectioninwebdata, cates whether an embedding should be processed, and pro-
among others. Graph mining algorithms typically take a cess,whichexaminesanembeddingandmayproducesome
labeled and immutable graph as input, and mine patterns output.Forexample,inthecaseoffindingcliquesthefilter
that have some algorithm-specific property (e.g., frequency functionprunesembeddingsthatarenotcliques,sincenone
abovesomethreshold)byfindingallinstancesofthesepat- of theirextensions canbe cliques,and the processfunction
ternsintheinputgraph.Somealgorithmsalsocomputeag- outputsallexploredembeddings,whicharecliquesbycon-
gregatedmetricsbasedonthesesubgraphs. struction.Arabesquealsosupportsthepruningoftheexplo-
Designinggraphminingalgorithmsisachallengingand rationspacebasedonuser-definedmetricsaggregatedacross
active area of research. In particular, scaling graph mining multipleembeddings.
algorithmstoevenmoderatelylargegraphsishard.Theset TheArabesqueAPIsimplifiesandthusdemocratizesthe
of possible patterns and their subgraphs in a graph can be designofgraphminingalgorithms,andautomatestheirexe-
exponentialinthesizeoftheoriginalgraph,resultinginan cutioninadistributedsetting.WeusedArabesquetoimple-
explosion of the computation and intermediate state. Fig- ment and evaluate scalable solutions to three fundamental
ure 1 shows the exponential growth of the number of “in- anddiversegraphminingproblems:frequentsubgraphmin-
teresting”subgraphsofdifferentsizesinsomeofthegraph ing, counting motifs, and finding cliques. These problems
miningproblemsanddatasetswewillevaluateinthispaper. aredefinedpreciselyinSection2.Someofthesealgorithms
Evengraphswithfewthousandsofedgescanquicklygener- are the first distributed solutions available in the literature,
atehundredsofmillionsofinterestingsubgraphs.Theneed whichshowsthesimplicityandgeneralityofArabesque.
for enumerating a large number of subgraphs characterizes Arabesque’sembedding-centeredAPIfacilitatesahighly
graph mining problems and distinguishes them from graph scalable implementation. The system scales by spread-
computationproblems.Despitethisstateexplosionproblem, ing embeddings uniformly across workers, thus avoiding
mostgraphminingalgorithmsarecentralizedbecauseofthe hotspots.Bymakingitexplicitthatembeddingsarethefun-
complexityofdistributedsolutions. damental unit of exploration, Arabesque is able to use fast
In this paper, we propose automatic subgraph explo- coordination-freetechniques,basedonthenotionofembed-
rationasagenericbuildingblockforsolvinggraphmining ding canonicality, to avoid redundant work and minimize
problems,andintroduceArabesque,thefirstembeddingex- communicationcosts.Italsoenablesustostoreembeddings
plorationsystemspecificallydesignedfordistributedgraph efficiently using a new data structure called Overapproxi-
mining.Conceptually,wemovefromTLVto“thinklikean mating Directed Acyclic Graph (ODAG), and to devise a
embedding” (TLE), where by embedding we denote a sub- new two-level optimization for pattern-based aggregation,
graph representing a particular instance of a more general whichisacommonoperationingraphminingalgorithms.
templatesubgraphcalledapattern(seeFigure2). Arabesque is implemented as a layer on top of Apache
Arabesque defines a high-level filter-process computa- Giraph [3], a Pregel-inspired graph computation system,
tional model. Given an input graph, the system takes care thus allowing both graph computation and graph mining
of automatically and systematically visiting all the embed- algorithms to run on top of the same infrastructure. The
dingsthatneedtobeexploredbytheuser-definedalgorithm, implementation does not use a TLV approach: it considers
performingthisexplorationinadistributedmanner.Thesys- Giraph just as a regular data parallel system implementing
tempassesalltheembeddingsitexplorestotheapplication, theBulkSynchronousProcessingmodel.
whichconsistsprimarilyoftwofunctions:filter,whichindi- Tosummarize,wemakethefollowingcontributions:
• Weproposeembeddingexploration,or“thinklikeanem- Characterizing Graph Mining Problems: Arabesque tar-
bedding”, as an effective basic building block for graph gets graph mining problems, which involve subgraph enu-
mining. We introduce the filter-process computational meration. Given an immutable input graph G with labeled
model (Section 3), design an API that enables embed- verticesandedges,graphminingproblemsfocusontheenu-
ding exploration to be expressed effectively and suc- merationofallpatternsthatsatisfysomeuser-specified“in-
cinctly,andpresentthreeexamplegraphminingapplica- terestingness”criteria.Agivenpatternpisevaluatedbylist-
tionsthatcanbeelegantlyexpressedusingtheArabesque ing its matches or embeddings in the input dataset G, and
API(Section4). thenfilteringouttheuninterestingones.Graphminingprob-
lems often require subgraph isomorphism checks to deter-
• Weintroducetechniquestomakedistributedembedding
mine the patterns related to sets of embeddings, and also
explorationscalable:coordination-freeworksharing,ef-
graph automorphism checks to eliminate duplicate embed-
ficient storage of embeddings, and an important opti-
dings.
mizationforpattern-basedaggregation(Section5).
We consider graph mining problems where one seeks
• We demonstrate the scalability of Arabesque on various
connected graph patterns whose embeddings are vertex- or
graphs. We show that Arabesque scales to hundreds of
edge-induced.Thereareseveralvariantsoftheseproblems.
coresoveracluster,obtainingordersofmagnitudereduc-
Theinputdatasetmaycompriseacollectionofmanygraphs,
tionofrunningtimeoverthecentralizedbaselines(Sec-
or a single large graph. The embeddings of a pattern com-
tion6),andcananalyzetrillionsofembeddingsonlarge
prise the set of (exact) isomorphisms from the pattern p to
graphs.
the input graph G. However, in inexact matching, one can
The Arabesque system, together with all applications seekinexactorapproximateisomorphisms(basedonnotions
usedforthispaper,ispubliclyavailableattheproject’sweb- ofedit-distances,labelcosts,etc.).Therearemanyvariants
site:www.arabesque.io. intermsoftheinterestingnesscriteria,suchasfrequentsub-
graphmining,densesubgraphmining,orextractingtheen-
tire frequency distribution of subgraphs up to some given
2. GraphMiningProblems
numberofvertices.Alsorelatedtographminingistheprob-
In this section, we introduce the graph-theoretic terms we lemofgraphmatching,whereaquerypatternqisfixed,and
will use throughout the text and characterize the space of onehastoretrieveallitsmatchesintheinputgraphG.Solu-
graphminingproblemsaddressedbyArabesque. tions to graph matching typically use indexing approaches,
Terminology: Graph mining problems take an input graph whichpre-computeasetofpathsorfrequentsubgraphs,and
Gwhereverticesandedgesarelabeled.Verticesandedges use them to facilitate fast matching and retrieval. As such
have unique ids, and their labels are arbitrary, domain- graphminingencompassesthematchingproblem,sincewe
specific attributes that can be null. An embedding is a sub- havetobothenumeratethepatternsandfindtheirmatches.
graph of G, i.e., a graph containing a subset of the ver- Further,anysolutiontothesingleinputgraphsettingiseas-
tices and edges of G. A vertex-induced embedding is de- ilyadaptedtothemultiplegraphdatasetcase.Therefore,in
fined starting from a set of vertices by including all edges thispaperwefocusongraphminingtasksonasinglelarge
of G whose endpoints are in the set. An edge-induced em- input graph. See [1] for a state-of-the-art review of graph
bedding is defined starting from a set of edges by includ- miningandsearch.
ing all the endpoints of the edges in the set. For exam- UseCases:Throughoutthispaperweconsiderthreeclasses
ple, the two embeddings of Figure 2 are induced by the ofproblems:frequentsubgraphmining,countingmotifs,and
edges{(1,2),(2,3)}.Inordertobeinducedbythevertices findingcliques.Wechosetheseproblemsbecausetheyrep-
{1,2,3}theembeddingsshouldalsoincludetheedge(1,3). resentdifferentclassesofgraphminingproblems.Thefirst
Weconsideronlyconnectedembeddingssuchthatthereisa is an example of explore-and-prune problems, where only
pathconnectingeachpairofvertices. embeddings corresponding to a frequent pattern need to be
Apatternisanarbitrarygraph.Wesaythatanembedding furtherexplored.Countingmotifsrequiresexhaustivegraph
einGisisomorphictoapatternpifandonlyifthereexists explorationuptosomemaximumsize.Findingcliquesisan
a one-to-one mapping between the vertices of e and p, and exampleofdensesubgraphmining,andallowsonetoprune
betweentheedgesofeandp,suchthat:(i)eachvertex(resp. the embeddings using local information. We discuss these
edge)inehasonematchingvertex(resp.edge)inpwiththe problemsbelowinmoredetail.
samelabels,andviceversa;(ii)eachmatchingedgeconnects Consider the task of frequent subgraph mining (FSM),
matchingvertices.Inthiscase,wesayinformallythatehas i.e.,findingthosesubgraphs(orpatterns,inourterminology)
patternp.Equivalently,wesaythatthereexistsasubgraph thatoccuraminimumnumberoftimesintheinputdataset.
isomorphism from p to G, and we call e an instance of The occurrences are counted using some anti-monotonic
patternpinG.Twoembeddingsareautomorphicifandonly function on the set of its embeddings. The anti-monotonic
if they contain the same edges and vertices, i.e., they are propertystatesthatthefrequencyofasupergraphshouldnot
equalandthusalsoisomorphic(seeforexampleFigure2).
exceedthefrequencyofasubgraph,whichallowsonetostop
Algorithm1:ExplorationstepinArabesque.
extending patterns/embeddings as soon as they are deemed
input :SetIofinitialembeddingsforthisstep
tobeinfrequent.Thereareseveralanti-monotonicmetricsto
output:SetF ofextendedembeddingsforthenextstep(initially
measure the frequency of a pattern. While their differences empty)
are not very important for this discussion, they all require output:SetOofnewoutputs(initiallyempty)
aggregatingmetricsfromallembeddingsthatcorrespondto foreache∈Isuchthatα(e)do
thesamepattern.Weusetheminimumimage-basedsupport addβ(e)toO;
C←setofallextensionsofeobtainedbyaddingoneincident
metric [7], which defines the frequency of a pattern as the
edge/neighboringvertex;
minimum number of distinct mappings for any vertex in foreache(cid:48)∈Cdo
the pattern, over all embeddings of the pattern. Formally, ifφ(e(cid:48))andthereexistsnoe(cid:48)(cid:48)∈F automorphictoe(cid:48)then
let G be the input graph, p be a pattern, and E its set of addπ(e(cid:48))toO;
embeddings. The pattern p is frequent if sup(p,E) ≥ θ, adde(cid:48)toF;
runaggregationfunctions;
wheresup()istheminimumimage-basedsupportfunction,
andθisauser-specifiedthreshold.TheFSMtaskistomine
allfrequentsubgraphpatternsfromasinglelargegraphG.
A motif p is defined as a connected pattern of vertex- canoptionallyalsodefinetwoadditionalfunctionsthatwill
induced embeddings that exists in an input graph G. Fur- bedescribedshortly.
ther, a set of motifs is required to be non-isomorphic, i.e., Arabesque starts an exploration by generating the set of
there should obviously be no duplicate patterns. In motif candidate embeddings C, which are obtained by expand-
mining[30],theinputgraphisassumedtobeunlabeled,and ing the embeddings in I. The system computes candidates
there is no minimum frequency threshold; rather the goal by adding one incident edge or vertex to e, depending on
is to extract all motifs that occur in G along with their fre- whether it runs in edge-based or vertex-based exploration
quencydistribution.Sincethistaskisinherentlyexponential, mode.Inedge-basedexploration,anembeddingisanedge-
the motifs are typically restricted to patterns of order (i.e., induced subgraph; in vertex-based exploration, it is vertex-
number of vertices) at most k. For example, for k = 3 we induced(seeSection2).Inthefirstexplorationstep,I con-
have two possible motifs: a chain where ends are not con- tains only a special “undefined” embedding, whose expan-
nected and a (complete) triangle. Whereas the original def- sionC consistsofalledgesorverticesofG,dependingon
initionofmotifstargetsunlabeledandinducedpatterns,we thetypeofexploration.Theapplicationcandecidebetween
caneasilygeneralizethedefinitiontolabeledpatterns. edge-basedorvertex-basedexplorationduringinitialization.
In clique mining the task is to enumerate all complete After computing the candidates, the filter function φ ex-
subgraphs in the input graph G. A complete subgraph, or amines each candidate e(cid:48) and returns a Boolean value indi-
clique,pwithk verticesisdefinedasaconnectedsubgraph cating whether e(cid:48) needs to be processed. If φ returns true,
where each vertex has degree k − 1, i.e., each vertex is theprocessfunctionπ takese(cid:48) asinputandoutputsasetof
connected to all other vertices (we assume there are no user-definedvalues.Bydefault,e(cid:48)isthenaddedtothesetF.
self-loops). The clique problem can also be generalized to After an exploration step is terminated, I is set to be equal
maximalcliques,i.e.,thosenotcontainedinanyotherclique, to F before the start of the next step. The computation ter-
and frequent cliques, if we impose a minimum frequency minateswhenthesetF isemptyattheendofastep.
thresholdinadditiontothecompletenessconstraint. ArabesquerunstheouterloopofAlgorithm1inparallel
by partitioning the embeddings in I over multiple servers
each running multiple worker threads. This distribution is
3. TheFilter-ProcessModel transparenttoapplications.Eachexecutionstepisexecuted
asasuperstepintheBulkSynchronousParallelmodel[38].
We now discuss the filter-process model, which formalizes
Optional aggregation functions: The filter-process model
how Arabesque performs embedding exploration based on
describedsofarconsiderssingleembeddingsinisolation.A
application-specificfilterandprocessfunctions.
common task in graph mining systems is to aggregate val-
uesacrossmultipleembeddings,forexamplegroupingem-
3.1 ComputationalModel
beddings by pattern. To this end, Arabesque offers specific
Arabesquecomputationsproceedthroughasequenceofex- functions to execute user-defined aggregation for multiple
ploration steps. At a conceptual level, the system performs embeddings.Aggregationcangroupembeddingsbyanarbi-
theoperationsillustratedinAlgorithm1ineachexploration trary integer value or by pattern, and is executed on candi-
step.EachstepisassociatedwithaninitialsetI containing dateembeddingsattheendoftheexplorationstepinwhich
embeddingsoftheinputgraphG.Arabesqueautomatesthe theyaregenerated.
process of exploring the graph and expanding embeddings. The optional aggregation filter function α and aggrega-
Applicationsarespecifiedviatwouser-definedfunctions:a tionprocessfunctionβ canfilterandprocessanembedding
filter function φ and a process function π. The application e in the exploration step following its generation. At that
time,aggregatedinformationcollected fromalltheembed- approachalsocreatesasignificantnumberofduplicatemes-
dings generated in the same exploration step as e becomes sages because each new embedding must be sent to all its
available.Theαfunctioncantakefilteringdecisionsbefore bordervertices.Theselimitationssignificantlyaffectperfor-
embeddingexpansionbasedonaggregatevalues.Forexam- mance.Inourexperiments,weobservedthatTLV-basedem-
ple, in the frequent subgraph mining problem we can use beddingexplorationalgorithmscanbetwoordersofmagni-
aggregatorstocounttheembeddingsassociatedwithagiven tudeslowercomparedtoTLE.
pattern,andthenfilteroutembeddingsofinfrequentpatterns Thecurrentstate-of-the-artcentralizedmethodsforsolv-
withα.Similarly,theβfunctioncanbeusedtooutputaggre- ing graph mining tasks typically adopt a different, pattern-
gate information about an embedding, for example its fre- centricor“ThinkLikeaPattern”(TLP)approach.Thekey
quency. By default, α returns true and β does not add any differencebetweenTLPandtheembedding-centricviewof
outputtoO. thefilter-processmodelisthatitisnotnecessarytoexplicitly
Guarantees and requirements: Embedding exploration materializeallembeddings:statecanbekeptatthegranular-
processes every embedding that is not filtered out. More ityofpatterns(whicharemuchfewerthanembeddings)and
formally,itguaranteesthefollowingcompletenessproperty: embeddings may be re-generated on demand. The process
for each embedding e such that φ(e) = α(e) = true, em- startswiththesetofallpossible(labeled)singleverticesor
beddingexplorationmustaddπ(e)andβ(e)toO. edges as candidate patterns. It then processes embeddings
Completeness assumes some properties of the user- ofeachpattern,oftenbyrecomputingthemonthefly.After
defined functions that emerge naturally in graph mining aggregationandpatternfiltering,thevalidsetofpatternsare
algorithms. The first property is that the application con- extended by adding one more vertex or edge. Subgraph or
siderstwoembeddingseande(cid:48) tobeequivalentiftheyare pattern mining proceeds iteratively via recursive extension,
automorphic (see Section 2). Formally, Arabesque requires processing and filtering steps, and continues until no new
what we call automorphism invariance: if e and e(cid:48) are au- patterns are found. Parallelizing the computation via parti-
tomorphic, then each user-defined function must return the tioning it by pattern can easily result in load imbalance, as
same result for e and e(cid:48). Arabesque leverages this natural our experiments show. This is because there are often only
property of graph mining algorithms to prune automorphic few patterns that are highly popular – indeed, finding these
embeddingsandsubstantiallyreducetheexplorationspace. fewpatternsistheverygoalofgraphmining.Thesepopular
The second property Arabesque requires is called anti- patterns result in hotspots among workers and thus in poor
monotonicity and is formally defined as follows: if φ(e) = loadbalancing.
false then it holds that φ(e(cid:48)) = false for each embedding
e(cid:48) that extends e. The same property holds for the optional 4. Arabesque:API,Programming,and
filter function α. This is one of the essential properties for Implementation
anyeffectivegraphminingmethodandguaranteesthatonce
WenowdescribetheArabesqueJavaAPIandshowhowwe
afilterfunctiontellstheframeworktopruneanembedding,
useittoimplementourexampleapplications.
allitsextensionscanbeignored.
3.2 AlternativeParadigms:ThinkLikeaVertexand 4.1 ArabesqueAPI
ThinkLikeaPattern
The API of Arabesque is illustrated in Figure 3. The user
Wecannowcontrastourembedding-centricor“ThinkLike mustimplementtwofunctions:filter,whichcorresponds
anEmbedding”(TLE)modelwithotherapproachestobuild to φ, and process, which corresponds to π. The pro-
graph mining algorithms. We empirically compare with cess function in the API is responsible for adding results
theseapproachesinSection6. to the output by invoking the output function provided
The standard paradigm of systems like Pregel is vertex- by Arabesque, which prints the results to the underlying
centric or “Think Like a Vertex” (TLV), because computa- file system (e.g. HDFS). The optional functions α and β
tion and state are at the level of a vertex in the graph. TLV correspond, respectively, to aggregationFilter and
systemsaredesignedtoscaleforlargeinputgraphs:thein- aggregationProcess.Theseapplication-specificfunc-
formation about the input graph is distributed and vertices tionsareinvokedbytheArabesqueframeworkasillustrated
onlyhaveinformationabouttheirlocalneighborhood.Inor- in Algorithm 1. All these functions have access to a local
dertoperformembeddingexploration,eachvertexcankeep read-onlycopyofthegraph.
asetoflocalembeddings,initiallycontainingonlythever- For performance and scalability reasons, a MapReduce-
texitself.Then,toexpandalocalembeddinge,verticespush like model is used to compute aggregated values. Appli-
ittothe“border”verticesofethatknowhowtoexpandeby cations can send data to reducers using the map function,
adding its neighbors. Each expansion results in a new em- whichispartoftheArabesqueframeworkandaddsavalue
bedding, which is sent again to border vertices and so on. toanaggregationgroupdefinedbyacertainkey.Manyap-
Withthisapproach,highlyconnectedverticesmusttakeon plicationsusethepatternofanembeddingastheaggregation
a disproportionate fraction of embeddings to expand. The key. Arabesque detects when the key is a pattern and uses
Mandatoryapplication-definedfunctions:
boolean filter(Embedding e) { return true; }
boolean filter (Embedding e) void process(Embedding e){
void process (Embedding e) map (pattern(e), domains(e)); }
Pair<Pattern,Domain> reduce
Optionalapplication-definedfunctions: (Pattern p, List<Domain> domains){
Domain merged_domain = merge(domains);
boolean aggregationFilter (Embedding e) return Pair (p, merged_domain); }
void aggregationProcess (Embedding e) boolean aggregationFilter(Embedding e){
Pair<K,V> reduce (K key, List<V> values) Domain m_domain = readAggregate(pattern(e));
Pair<K,V> reduceOutput (K key, List<V> value) return (support(m_domain)>=THRESHOLD); }
boolean terminationFilter (Embedding e) void aggregationProcess(Embedding e) {
output(e); }
Arabesquefunctionsinvokedbyapplications:
(a)Frequentsubgraphmining(edge-basedexploration)
void output (Object value)
void map (K key, V value) boolean filter(Embedding e){
V readAggregate (K key) return (numVertices(e) <= MAX_SIZE);}
void mapOutput (K key, V value) void process(Embedding e){
mapOutput (pattern(e),1); }
Pair<Pattern,Integer> reduceOutput
(Pattern p, List<Integer> counts){
Figure3:Basicuser-definedfunctionsinArabesque.
return Pair (p, sum(counts)); }
(b)Countingmotifs(vertex-basedexploration)
specificoptimizationstomakethisaggregationefficient,as boolean filter (Embedding e){
return isClique(e); }
explained in Section 5.4. The application specifies the ag- void process (Embedding e){ output(e); }
gregationlogicthroughthereducefunction.Thisfunction (c)Findingcliques(vertex-basedexploration)
receives all values mapped to a specific key in one execu-
tionstepandaggregatesthem.Anymethodcanreadtheval- Figure4:ExamplesofArabesqueapplications.
ues aggregated over the previous execution step using the
readAggregatemethod.
Output aggregation is a special case where aggregated
valuesaresentdirectlytotheunderlyingdistributedfilesys- to the reducer responsible for the pattern p of e. For ex-
tem at the end of each exploration step and are not made ample, in Figure 2 the domain of the blue vertex on the
availableforlaterreads.Itcanbeusedthroughthemethods top of the pattern is vertex 1 in the first embedding and 3
mapOutput and reduceOutput. Their logic is similar in the second. The function reduce merges all domains:
totheaggregationfunctionsdescribedpreviously,butaggre- the merged domain of a vertex in p is the union of all its
gationisonlyexecutedwhenthewholecomputationends. aggregated mappings. Since expansion is done by adding
TheterminationFilterfunctioncanhaltthecom- one edge in each exploration step, we are sure that all
putationofanembeddingwhensomepre-definedcondition embeddings for p are visited and processed in the same
isreached.Arabesqueappliesthisfilteronanembeddinge exploration step. The aggregationFilter function
afterexecutingπ(e)andbeforeaddingetoF.Thisisjustan reads the merged domains of p using readAggregate
optimizationtoavoidunnecessaryexplorationsteps.Forex- and computes the support, which is the minimum size of
ample,ifweareinterestedinembeddingsofmaximumsize the domain of any vertex in p. It then filters out embed-
n, the termination filter can halt the computation after pro- dingsforpatternsthatdonothaveenoughsupport.Finally,
cessing embeddings of size n at step n. Without this filter, the aggregationProcess function outputs all the em-
thesystemwouldhavetoproceedtostepn+1andgenerate beddings having a frequent pattern (those that survive the
allembeddingsofsizen+1justtofilterallofthemout. aggregation-filter). The implementation of this application
consistsof280linesofJavacode.Ofthese,212arerelated
4.2 ProgrammingwithArabesque
tohandlingdomainsandcomputingsupport,whicharebasic
We used the Arabesque API to implement algorithms that tasks required in any algorithm for frequent subgraph min-
solve the three problems discussed in Section 2. The pseu- ing to characterize whether an embedding is relevant. By
docode of the implementations is in Figure 4. Each of the comparison, the centralized baseline we use for evaluation,
applicationsconsistsofveryfewlinesofcode,astarkcon- GRAMI[14],consistsof5,443linesofJavacode.
trastcomparedtothecomplexityofthespecializedstateof Formotiffrequencycomputation,weperformanexhaus-
theartalgorithmssolvingthesameproblems[8,30,43]. tive exploration of all embeddings until we reach a given
In frequent subgraph mining we use aggregation to cal- maximum size and count all embeddings having the same
culatethesupportfunctiondescribedinSection2.Thesup- pattern. Since the input graph is not labeled in this case, a
port metric is based on the notion of domain, which is de- pattern corresponds to a motif. The function mapOutput
fined as the set of distinct mappings between a vertex in p sends a value to the output reducer responsible for the pat-
and the matching vertices in any automorphism of e. The ternofe.ThereduceOutputfunctionoutputsthesumof
process function invokes map to send the domains of e thecountsforeachmotifp.Ourimplementationconsistsof
18linesofcode,veryfewcomparedtothe3,145linesofC tion 3) we can avoid redundant work by discarding all but
codeofourcentralizedbaseline(Gtries[31]). oneoftheidentical,automorphicembeddings.
Infindingcliqueswedoalocalpruning:ifanembedding Arabesquesolvesthisproblemusinganovelcoordination-
isnotaclique,noneofitsextensionscanbeaclique.Since freeschemebasedonthenotionofembeddingcanonicality.
we visit only cliques, the evaluation function outputs the Informally, we need to select exactly one of the redundant
embeddings it receives as input. The isClique function automorphicembeddingsandelectitas“canonical”.Inour
checksthatthenewlyaddedvertexisconnectedwithallpre- example, before w and w execute the filter and process
1 2
viousverticesintheembedding.Thisapplicationconsistsof functions on a new embedding e, Arabesque executes an
19linesofcode,whileourcentralizedbaseline(Mace[37]) embedding canonicality check to verify whether e can be
consistsof4,621linesofCcode. pruned (see Algorithm 1). This check runs on a single em-
In all these examples it is easy to verify that the evalua- beddingwithoutrequiringcoordination,aswenowdiscuss.
tionandfilterfunctionssatisfytheanti-monotonicandauto- A sound canonicality check must have the property of
morphism invariance properties required by the Arabesque uniqueness:giventhesetS ofallembeddingsautomorphic
e
computationalmodel. to an embedding e, there is exactly one canonical embed-
dinge inS .Wecalle thecanonicalautomorphismofe.
c e c
4.3 Arabesqueimplementation InArabesquewealsoneedanadditionalpropertyforcanon-
Arabesquecanexecuteontopofanysystemsupportingthe icality checks called extendibility, which we define as fol-
BSP model. We have implemented Arabesque as a layer lows.Letebeacandidateembeddingobtainedbyextending
on top of Giraph. The implementation does not follow the a parent embedding e(cid:48) by one vertex or edge. The parent
TLV paradigm: we use Giraph vertices simply as workers embedding e(cid:48) is canonical because it has not been pruned.
that bear no relationship to any specific vertex in the input Let ec be the canonical automorphism of e. Extendibility
graph. Each worker has access to a copy of the whole in- requires that ec is one of the extensions of e(cid:48). This allows
put graph whose vertices and edges consist of incremental Arabesquetoprunetheautomorphismsofaparente(cid:48) while
numeric ids. Communication among workers occurs in an stillexploringthecanonicalautomorphismofeachchilde.
arbitrary point-to-point fashion and communication edges
are not related to edges in the input graph. Giraph compu- Algorithm 2: Arabesque’s incremental embedding
tations proceed through synchronous supersteps according canonicalitycheck(vertex-basedexploration)
totheBSPmodel:ateachsuperstep,workersfirstreceiveall input :InputgraphG
messagessentintheprevioussuperstep,thenprocessthem, input :Canonicalparentembedding(cid:104)v1,...,vn(cid:105)
and finally send new messages to be delivered at the next input :Extensionvertexv
superstep. The operations of the workers are described in output:trueiff(cid:104)v1,...,vn,v(cid:105)iscanonical
Section 5. Aggregation functions, if specified, are executed ifv1>vthen
returnfalse;
usingstandardGiraphaggregators.Optionaloutputworkers
foundNeighbour←false;
are used for applications that aggregate output values. The fori=1...ndo
input values for output aggregation persist over supersteps. if foundNeighbour=falseandvineighborofvinGthen
Once the computation is over, output workers aggregate all foundNeighbour←true;
theirinputvaluesandoutputthem. elseiffoundNeighbour=trueandvi>vthen
returnfalse;
returntrue;
5. GraphExplorationTechniques
We now describe in more detail how Arabesque performs Arabesquechecksembeddingcanonicalityforeachcan-
graphexploration.Wefirstdiscussthecoordination-freeex- didate before applying the filter function. There can be a
plorationstrategyusedbyworkerstoavoidredundantwork. huge number of candidates, so it is essential that the check
Wethenintroducethetechniquesweusetostoreandparti- isefficient.Wedevelopedalinear-timealgorithm,whichis
tionembeddingsefficiently. basedonthefollowingdefinitionofcanonicalembedding.
Considerthecaseofvertex-basedexploration(theedge-
5.1 Coordination-FreeExplorationStrategy
basedcaseisanalogous).Anembeddingeiscanonicalifand
When running exploration in a distributed setting, multiple onlyifitsverticeswerevisitedinthefollowingorder:start
workers can reach two “identical”, i.e., automorphic (see by visiting the vertex with the smallest id, and then recur-
Section2),embeddingsthroughdifferentexplorationpaths. sivelyaddtheneighborinewithsmallestidthathasnotbeen
Consider for example the graph of Figure 2. Two different visitedyet.Forbetterperformance,Arabesqueperformsthis
workers w and w may reach the two embeddings in Fig- canonicalitycheckinanincrementalfashion,asillustratedin
1 2
ure2,onestartingfromedge(1,2)byadding(2,3)andthe Algorithm2.Whenaworkerprocessesanembeddinge∈I,
other starting from edge (3,2) by adding (2,1). Since all itcanalreadyassumethateiscanonicalbecauseArabesque
user-definedfunctionsareautomorphism-invariant(seeSec- prunes non-canonical embeddings before passing them on
tothenextexplorationstep.Arabesque,characterizesanem- edge-basedcaseisanalogous.TheODAGforasetofcanon-
beddingasthelistofitsverticessortedbytheorderinwhich icalembeddingsconsistsofasmanyarraysasthenumberof
they have been visited – the embedding is vertex-induced verticesofalltheembeddings.Theitharraycontainstheids
so the list uniquely identifies it. When Arabesque checks of all vertices in the ith position in any embedding. Vertex
thecanonicalityofanewcandidateembeddingobtainedby v in the ith array is connected to vertex u in the (i+1)th
addingavertexvtoaparentcanonicalembeddinge,theal- arrayifthereisatleastonecanonicalembeddingwithvand
gorithmscanseinsearchforthefirstneighborv(cid:48) ofv,and uinpositioniandi+1respectivelyintheoriginalset.An
thenvertifiesthatthereisnovertexineafterv(cid:48) withhigher exampleofODAGisshowninFigures5and6.
idthanv.
The extended version of this paper [35] includes proofs Canonical"
embeddings"
showing that Algorithm 2 satisfies the uniqueness and ex- 3" 1" 4" 2"
tendibility properties of canonicality checking. We also 1" 4" 3"
showthatthesetwoproperties,togetherwithanti-monotoni- 2" 4" 1" 4" 5"
2" 3" 4"
city and automorphism invariance, are sufficient to ensure
1" 5" 2" 4" 5"
that Arabesque satisfies the completeness property of em-
3" 4" 5"
beddingexploration(seeSection3).
Figure5:ExamplegraphanditssetofSofcanonicalvertex-
5.2 StoringEmbeddingsCompactly
inducedembeddingsofsize3.
Graph mining algorithms can easily generate trillions of Array"3"
Array"1"
embeddings as intermediate state. Centralized algorithms Array"2"
2"
typically do not explicitly store the embeddings they have 1"
3" 3"
explored. Instead, they use application-specific knowledge 2"
4" 4"
torebuildembeddingsontheflyfromamuchsmallerstate. 3"
5"
The only application-level logic available to Arabesque
consistsofthefilterandprocessfunctions,whichareopaque Figure6:ODAGfortheexampleofFigure5.Italsoencodes
to the system. After each exploration step, Arabesque re- spuriousembeddingssuchas(cid:104)3,4,2(cid:105).
ceivesasetofembeddingsfilteredbytheapplicationandit
needstokeeptheminordertoexpandtheminthenextstep. Storing an ODAG is more compact than storing the full
Storing each embedding separately can be very expensive. set of embeddings. In general, in a graph with N distinct
AswehaveseenalsoinAlgorithm2,Arabesquerepresents verticeswecanhaveuptoO(Nk)differentembeddingsof
embeddingsassequencesofnumbers,representingvertexor sizek.WithODAGsweonlyhavetokeepedgesbetweenk
edge ids depending on whether exploration is vertex-based arrays, where k is the size of the embeddings, so the upper
oredge-based.Therefore,weneedtofindtechniquestostore boundonthesizeisO(k·N2) = O(N2)ifk isaconstant
setsofsequencesofintegersefficiently. muchsmallerthanN.
Existingcompactdatastructuressuchasprefix-treesare Itispossibletoobtainallembeddingsoftheoriginalset
too expensive because we would still have to store a new bysimplyfollowingtheedgesintheODAG.However,this
leafforeachembeddinginthebestcase,sinceallcanonical will also generate spurious embeddings that are not in the
embeddings we want to represent are different. In contrast, originalset.ConsiderforexamplethegraphofFigure5and
Arabesque uses a novel technique called Overapproximat- its set of canonical embeddings S. Expanding the ODAG
ing Directed Acyclic Graphs, or ODAGs. At a high level, of Figure 6 generates also (cid:104)3,4,2(cid:105) (cid:54)∈ S. Filtering out such
ODAGs are similar to prefix trees where all nodes at the spuriousembeddingsrequiresapplication-specificlogic.
same depth corresponding to the same vertex in the graph ODAGsinArabesque:InArabesque,workersproducenew
are collapsed into a single node. This more compact repre- embeddings in each exploration step and add them to the
sentationisanoverapproximation(superset)ofthesetofse- setF (seeAlgorithm1).WorkersuseODAGstostoreF at
quenceswewanttostore.Whenextractingembeddingsfrom theendofanexplorationstep,andextractembeddingsfrom
ODAGs we must do extra work to discard spurious paths. ODAGsatthebeginningofthenextstep.
ODAGsthustradespacecomplexityforcomputationalcom- Filteringoutspuriousembeddings,asdiscussed,requires
plexity.ThismakesODAGssimilartotherepresentativesets application-specific logic. Applications written using the
introduced in [2], which have a higher compression capa- filter-process model give Arabesque enough information to
bility but require more work to filter out spurious embed- perform filtering. In fact, workers can just apply the same
dings.Inaddition,itishardertoachieveeffectiveloadbal- filteringcriteriaasAlgorithm1:thecanonicalitycheckand
ancing with representative sets. We now discuss the details theuser-definedfilterandaggregatefilterfunctions.Ifanyof
ofODAGsandshowhowtheyareusedinArabesque. thesechecksisnegativeforanembedding,weknowthatthe
The ODAG Data Structure: For simplicity, we focus the embedding itself or, thanks to the anti-monotonicity prop-
discussion on ODAGs for vertex-based exploration; the erty, one of its parents, was filtered out. In addition, since
our canonicality check is incremental, we do not need to Theprocesscontinuesrecursivelyonsubsequentarraysuntil
considerembeddingsthatextendanon-canonicalsequence, abalancedloadisreached.
sowecanprunemultipleembeddingsatonce.
5.4 Two-LevelPatternAggregationforFastPattern
After every exploration step, Arabesque merges and
CanonicalityChecking
broadcasts the new embeddings, and thus the ODAGs, to
everyworker.Inordertoreducethenumberofspuriousem- Arabesque uses a special optimization to speed up per-
beddings, workers group their embeddings in one ODAG patternaggregation,asdiscussedinSection4.Theoptimiza-
perpattern.Foreachpattern,workersmergetheirlocalper- tionwasmotivatedbythehighpotentialcostofthistypeof
patternODAGsintoasingleper-patternglobalODAG.Each aggregation,aswenowdiscuss.
per-patternglobalODAGisreplicatedateachworkerbefore Consider again the example of Figure 2 and assume
thebeginningofthenextexplorationstep. that we want to count the instances of all single-edge pat-
MergingODAGsrequiresmergingedgesobtainedbydif- terns. The three single-edge embeddings (1,2), (2,3), and
ferentworkers.Forexample,considerthefirsttwoarraysin (3,4) should be aggregated together since they all have a
Figure 6. One worker might have explored the edge (cid:104)2,3(cid:105) blue and a yellow endpoint. Therefore, their two patterns
and another worker (cid:104)2,4(cid:105). Edges of ODAG arrays are in- (blue,yellow) and (yellow,blue) should be considered
dexedbytheirinitialvertex,sothetwoarraysfromthetwo equivalentbecausetheyareisomorphic(seeSection2).The
workers will have two different entries for the element 2 aggregation reducer for these two patterns is associated to
of the first array that need to be merged. Edge merging is asinglecanonicalpatternthatisisomorphictoboth.Map-
very expensive because of the high number of edges in an ping a pattern to its canonical pattern thus entails solving
ODAG.Therefore,Arabesqueusesamap-reducesteptoex- the graph isomorphism problem, for which no polynomial-
ecuteedgemerging.Eachentryofeacharrayproducedbya timesolutionisknown[16].Thismakespatterncanonicality
workerismappedtotheworkerassociatedtoitsindexvalue. much harder than embedding canonicality, which is related
Thisworkerisresponsibleformergingalledgesforthaten- tothesimplergraphautomorphismproblem.
try produced by all workers into a single entry. The entry Identifyingacanonicalpatternforeachsinglecandidate
is then broadcast to all other workers, which computes the embeddingwouldbeasignificantbottleneckforArabesque,
unionofallentriesinparallel. as we show in our evaluation, because of the sheer number
of candidate embeddings that are generated at each explo-
rationstep.Arabesquesolvesthisproblembyusinganovel
5.3 PartitioningEmbeddingsforLoadBalancing
techniquecalledtwo-levelpatternaggregation.
Afterbroadcastandbeforethebeginningofthenextexplo- The first level of aggregation occurs based on what we
ration step, every worker obtains the same set of ODAGs, call quick patterns. A quick pattern of an embedding e is
one for each pattern mined in the previous execution step. theoneobtained,inlineartime,bysimplyscanningallver-
The next step is to partition the set I of new embeddings tices (or edges, depending on the exploration mode) of e
(seeAlgorithm1)amongworkers.Thisisachievedbyparti- and extracting the corresponding labels. The quick pattern
tioningtheembeddingsineachpatternODAGseparately. iscalculatedforeachcandidateembedding.Intheprevious
Workers could achieve perfect load balancing by using example, we would obtain the quick pattern (blue,yellow)
a round-robin strategy to share work: worker 1 takes the for the embeddings (1,2) and (3,4) and the quick pattern
first embedding, worker 2 the second and so on. However, (yellow,blue) for the embedding (2,3). Each worker lo-
having workers iterate through all embeddings produced in cally executes the reduce function based on quick patterns.
thepreviousstep,includingthosethattheyarenotgoingto Once this local aggregation completes, a worker computes
process,iscomputationallyintensive.Therefore,workersdo the canonical pattern p for each quick pattern p and sends
c
roundrobinonlargeblocksofbembeddings.Thequestion thelocallyaggregatedvaluetothereducerforp .Arabesque
c
nowishowtoidentifysuchblocksefficiently. usestheblisslibrarytodeterminecanonicalpatterns[20].
Arabesqueassociateseachelementv ineveryarraywith Insummary,insteadofexecutinggraphisomorphismfor
anestimateofhowmanyembeddings,canonicalornot,can averylargenumberofcandidateembeddings,two-levelpat-
begeneratedstartingfromv.Tothisend,Arabesqueassigns ternaggregationcomputesaquickpatternforeveryembed-
acost1toeveryelementofthelastarray,anditassignstoan ding,obtainsanumberofquickpatternswhichisordersof
elementoftheith arraythesumofthecostsofallelements magnitudesmallerthanthecandidateembeddings,andthen
in the (i + 1)th array it is connected to. Load can then calculatesgraphisomorphismonlyforquickpatterns.
be balanced by having each worker take a partition of the
elements in the first array that has approximately the same 6. Evaluation
estimated cost. While partitioning, it could happen that the
6.1 ExperimentalSetup
costofanelementvofthefirstarrayneedstobesplitamong
multiple workers. In this case, the costs associated to the Platform: We evaluate Arabesque using a cluster of 20
elementsofthesecondarrayconnectedtov arepartitioned. servers.Eachserverhas2IntelXeonE5-2670CPUswitha
totalof32executionthreadsat2.67GHzpercoreand256GB tributedsolutionstosolveFSMonasinglelargeinputgraph
RAM. The servers are connected with a 10 GbE network. intheliterature.
Hadoop 2.6.0 was configured so that each physical server
10
contains a single worker which can use all 32 execution
Ideal TLP TLV
8
threads(unlessotherwisestated).ArabesquerunsonGiraph
developmenttrunkfromJanuary2015withaddedfunction- up 6
d
alityforobtainingclusterdeploymentdetailsandimproving pee 4
S
aggregationperformance.Thesemodificationsamountto10
2
extralinesofcode.
0
1 5 10
Vertices Edges Labels Av.Degree Numberofnodes(32threads)
CiteSeer 3,312 4,732 6 2.8
Figure 7: Scalability Analysis of Alternative Paradigms:
MiCo 100,000 1,080,298 29 21.6
FSM(S=300)onCiteSeer.
Patents 2,745,761 13,965,409 37 10
Youtube 4,589,876 43,968,798 80 19
The Case of TLV: Our TLV implementation globally
SN 5,022,893 198,613,776 0 79
Instagram 179,527,876 887,390,802 0 9.8 maintains the set of embeddings that have been visited,
much like Arabesque. The implementation adopts the TLV
approach as described in Section 3.2 and uses the same
Table1:Graphsusedfortheevaluation.
coordination-free technique as Arabesque to avoid redun-
dant work. The TLV implementation also uses application-
Datasets: We use six datasets (see Table 1). CiteSeer [14] specific approaches to control the expansion process. Our
has publications as vertices, with their Computer Science TLVimplementationofFSMusesthisfeaturetofollowthe
areaaslabel,andcitationsasedges.MiCo[14]hasauthors standarddepth-firststrategyofgSpan[43].
asvertices,whicharelabeledwiththeirfieldofinterest,and InFigure7,weshowthescalabilityofFSMwithsupport
co-authorship ofa paper asedges. Patents [18]contains ci- 300usingtheCiteSeergraph.Asseenfromthefigure,TLV
tationedgesbetweenUSPatentsbetweenJanuary1963and doesnotscalebeyond5servers.Amajorscalabilitybottle-
December 1999; the year the patent was granted is consid- neck is that each embedding needs to be replicated to each
eredtobethelabel.Youtube[10]listscrawledvideoidsand vertexthathasthenecessarylocalinformationtoexpandthe
relatedvideosforeachvideopostedfromFebruary2007to embeddingfurther.Inaddition,high-degreeverticesneedto
July 2008. The label is a combination of the video’s rating expandadisproportionatefractionofembeddings.CiteSeer
andlength.SN,isasnapshotofarealworldSocialNetwork, isascale-freegraphthusaffectingthescalabilityofTLV.
which is not publicly available. Instagram is a snapshot of Overall TLV performance is two orders of magnitude
thepopularphotoandvideosharingsocialnetworkcollected slowercomparedtoArabesque.TLVrequiresmorethan300
by [28]. We consider all the graphs to be undirected. Note secondstorunFSMontheCiteSeergraph,whileArabesque
thatevenifsomeofthesegraphsarenotverylarge,theex- requires only 7 seconds for the same setup. The total mes-
plosion of the intermediate computation and state required sages exchanged for this tiny graph is 120 million, versus
forgraphexploration(seeFigure1)makesthemverychal- 137 thousand messages required by Arabesque. Due to the
lengingforcentralizedalgorithms. hot-spotsinherenttothegraphstructure,orthelabeldistri-
ApplicationsandParameters:Weconsiderthethreeappli- bution, and the extended duplication of state that the TLV
cationsdiscussedinSections2,whichwelabelFSM,Motifs paradigm requires, we conclude that TLV is not suited for
andCliques.Bydefault,allMotifsexecutionsarerunwitha solvingtheseproblems.
maximumembeddingsizeof4,denotedasMS=4,whereas TheCaseofTLP:TheTLPimplementationisbasedon
CliquesarerunwithamaximumembeddingsizeofMS=5. GRAMI [14], which represents the state of the art for cen-
For FSM, we explicitly state the support, denoted S, used tralizedFSM.GRAMIkeepsstateonaper-patternbasis,so
ineachexperimentasthisparameterisverysensitivetothe fewrelativelystraightforwardchangestothecode-basewere
propertiesoftheinputgraph. sufficienttoderiveaTLPimplementationwherepatternsare
partitionedacrossasetofdistributedworkers.
6.2 AlternativeParadigms:TLVandTLP
GRAMIusesanumberofoptimizationsthatarespecific
We start by motivating the necessity for a new framework toFSM.Inparticular,itavoidsmaterializingallembeddings
fordistributedgraphmining.Weevaluatethetwoalternative relatedtoapattern,acommonapproachforTLPalgorithms.
computational paradigms that we discussed in Section 3.2. Whenever a new pattern is generated, its instances are re-
Arabesque (i.e., TLE) will be evaluated in the next subsec- calculatedonthefly,stoppingassoonasasufficientnumber
tion.Weconsidertheproblemoffrequentsubgraphmining of embeddings to pass the frequency threshold is found.
(FSM) as a use case. Note that there are currently no dis- GRAMI thus solves a simpler problem than the TLV and
Description:Qatar Computing Research Institute - HBKU, Qatar. Abstract. Distributed data . Giraph just as a regular data parallel system implementing the Bulk