Table Of Content

NeMa: Fast Graph Search with Label Similarity Arijit Khan Yinghui Wu Charu C. Aggarwal Xifeng Yan ComputerScience IBMT.J.WatsonResearch UniversityofCalifornia,SantaBarbara Hawthorne,NY {arijitkhan,yinghui,xyan}@cs.ucsb.edu [email protected] ABSTRACT andlanguagesareproposed—suchasSPARQLandXDDforthe RDFandXMLdata—whichrequireastandardschemaofqueries Itisincreasinglycommontofindreal-lifedatarepresentedasnet- and target graphs. Nevertheless, the real-lifegraphs are complex worksoflabeled,heterogeneousentities.Toquerythesenetworks, andnoisy,andoftenlackstandardizedschemas[1].Indeed,(a)the oneoftenneedstoidentifythematchesofagivenquerygraphina nodes may be heterogeneous, referring to different entities (e.g., (typicallylarge)networkmodeledasatargetgraph. Duetonoise persons, companies, documents) [14]. (b)Nodelabelsinagraph and thelackoffixedschema inthetargetgraph, thequery graph oftencarryrichsemantics,e.g.,id,urls,personalinformation,logs, cansubstantiallydifferfromitsmatchesinthetargetgraphinboth opinions [15]. (c) Worse still, the semantics of entities and their structure and node labels, thus bringing challenges to the graph interconnectionsinvariousdatasetsmaybedifferentandunknown queryingtasks.Inthispaper,weproposeNeMa(NetworkMatch), tousers[1]. Inthiscontext,amatchmaynotnecessarilybe(even a neighborhood-based subgraph matching technique for querying approximately)isomorphictothequerygraphintermsoflabeland real-life networks. (1) To measure the quality of the match, we topological equality. Thus, traditional graph querying techniques proposeanovelsubgraphmatchingcostmetricthataggregatesthe arenotabletocapturegoodqualitymatches. Considerthefollow- costsofmatchingindividualnodes,andunifiesbothstructureand ingexampleovertheIMDBmoviedataset. nodelabelsimilarities. (2)Basedonthemetric,weformulatethe minimumcost subgraph matchingproblem. Givenaquery graph andatargetgraph,theproblemistoidentifythe(top-k)matchesof K.( AWcitnosr)let (Mo?vie) K.( AWcitnosr)let (Mo?v ie) thequerygraphwithminimumcostsinthetargetgraph. Weshow Winslet, K. Titanic (actor) (Movie) thattheproblemisNP-hard,andalsohardtoapproximate. (3)We proposeaheuristicalgorithmforsolvingtheproblembasedonan ? ? inferencemodel. Inaddition,weproposeoptimizationtechniques (Director) (Director) toimprovetheefficiencyofourmethod. (4)Weempiricallyverify Cameron, J. Lang, S. thatNeMaisbotheffectiveandefficientcomparedtothekeyword S. Lang (Director) (I) (Actor) (Actor) searchandvariousstate-of-the-artgraphqueryingtechniques. S. Lang ? (Actor) (Movie) 1. INTRODUCTION (a) Query Graph 1 (b) Query Graph 2 (c) Top-1 Match Withtheadvent of theInternet, sourcesof datahaveincreased Figure1:AQueryandItsMatch(Example1.1) dramatically, including the World-Wide Web, social networks, Example 1.1. A user wants to find a movie of actress ‘Kate genome databases, knowledge graphs, medical and government Winslet’thatisdirectedbythesamedirectorwhoalsoworkedwith records. Such dataare often represented as graphs, where nodes actor‘Stephen Lang’. Eveniftheschemaandexact entitylabels arelabeledentitiesandedgesrepresentrelationsamongtheseenti- ofthetargetnetworkarenotavailable, theusercanstillcomeup ties[14,43].Queryingandminingofgraphdataareessentialfora withsomereasonablegraphrepresentation ofthequery[15,17], widerangeofemergingapplications[1,15,30]. asillustratedinFigure1(a)and1(b).Observethatsuchgraphical Toquerythesegraphs,oneoftenneedstoidentifythematchesof representationmaynotbeunique,andtheremightnotbeanexact agivenquerygraphina(typicallylarge)targetgraph. Traditional matchofthequerygraphinthedataset. Indeed,theresultinFig- graph querying models are usually defined in terms of subgraph ure1(c)(astar-shapedgraph)isbynomeanssimilartothequery isomorphismanditsextensions(e.g.,editdistance),whichidentify graphs inFigure1(a)and (b)(bothchain-shaped graphs) under subgraphs that are exactly or approximately isomorphic to query traditionalgraphsimilaritydefinitions. Grapheditdistanceofthe graphs [35, 33, 43]. In addition, a wide range of query models resultgraphwithquerygraphs1and2are4and6, respectively. Thesizeofthemaximumcommonsubgraphis3inbothcases.Nev- Permission tomake digital orhardcopies ofall orpartofthis workfor ertheless,‘Titanic’isthecorrectanswerofthequery; andhence, personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare the result graph should be considered a good match for both the notmadeordistributedforprofitorcommercialadvantageandthatcopies bearthisnoticeandthefullcitationonthefirstpage. Tocopyotherwise,to querygraphsusingsomenovelgraphsimilaritymetric. republish,topostonserversortoredistributetolists,requirespriorspecific This motivates us to investigate fast subgraph matching tech- permissionand/orafee. Articlesfromthisvolumewereinvitedtopresent niques suitable for query answering, which can relax rigidstruc- theirresultsatThe39thInternationalConferenceonVeryLargeDataBases, turalandlabelmatchingconstraintsofsubgraphisomorphismand August26th-30th2013,RivadelGarda,Trento,Italy. other traditional graph similarity measures. Our proposed graph ProceedingsoftheVLDBEndowment,Vol. 6,No.3 Copyright2013VLDBEndowment2150-8097/13/01...$10.00. similarity metric is based on following observations: (a) if two NeMa BLINKS1 IsoRank SAGA NESS1 gStore Precision 0.91 0.52 0.63 0.75 Filter:0.17 0.59 (Node) Filter+Verify:0.80 Recall 0.91 0.52 0.63 0.75 Filter:0.83 0.59 (Node) Filter+Verify:0.80 Precision 0.88 0.50 0.40 0.69 Filter:0.39 0.55 (Graph) Filter+Verify:0.74 Recall 0.88 0.50 0.40 0.69 Filter:0.75 0.55 (Graph) Filter+Verify:0.74 Top-1Match 0.97 1.92 4882.0 15.95 Filter:0.59 0.92 FindingTime(sec) Filter+Verify:56.16 Table1: NeMavs. KeywordSearchandGraphQueryingMethods: ThequerygraphswereextractedfromtheIMDBgraph,andlatermodified byadding 30%structural noiseand50%labelnoise. Wedetermined thetop-1matchforeach querygraphusingvarious methods, andmeasured effectiveness atthelevel of(a)querynodes, and(b)querygraphs. Atthenodelevel, precisionisdefinedastheratioofcorrectlydiscoverednode matchesoveralldiscoverednodematches,whilerecallismeasuredastheratioofcorrectlydiscoverednodematchesoverallcorrectnodematches. Similarly, atthegraphlevel, precisionisdefinedastheratioofcorrectlydiscoveredgraphmatchesoveralldiscoveredgraphmatches,andrecallis measuredastheratioofcorrectlydiscoveredgraphmatchesoverallcorrectgraphmatches. Agraphmatchisconsideredcorrectifatleast70%ofits nodesarematchedcorrectly.Sinceweconsideronlythetop-1match,precisionandrecallhavethesamevalue.Inaddition,wealsoreportprecisionand recallofNESSfilteringphase.Fordetailsaboutthequerygraphs,noise,andevaluationmetrics,seeSection7. nodes are closein a query graph, the corresponding nodes in the toapplyaninferencealgorithmtoheuristicallyidentifytheoptimal resultgraphmustalsobeclose. However, (b)theremaybesome matches.Ourmethodavoidscostlysubgraphisomorphismandedit differencesinlabelsofthematchednodes. distancecomputations. Wefurtherproposeindexingandoptimiza- While the need for such a graph similarity metric is evident tiontechniquesforourmethodinSection6. (e.g.,SAGA [35], IsoRank [33]), there is little work on subgraph (4)WeempiricallyverifytheeffectivenessandefficiencyofNeMa. matchinginlargenetworksconsideringboththecriteria.Recently, Ourexperimentalresultsonreal-worldnetworksinSection7show NESS [20] isproposed for subgraph matching that considers the thatNeMafindsbetterqualityresultsquicklyascomparedtokey- proximityamongnodes, butresortstostrictnodelabelmatching. wordsearch(e.g. BLINKS[16])andvariousgraphqueryingtech- The NESS algorithm is based on a filtering-and-verification ap- niques(e.g.,IsoRank[33],SAGA[35],NESS[20],gStore[43]). proach. Inthefilteringphase, thelesspromisingcandidatenodes areprunediteratively,untilnomorecandidatescanbepruned.The 2. RELATED WORK outputofthefilteringphaseisalimitednumberoffinalcandidates SubgraphMatching. Ullmann’sbacktrackingmethod[38], VF2 for each query node. Then, it verifiesallpossible graph matches [9],SwiftIndex[32]areusedforsubgraphisomorphismchecking. formedbythesefinalcandidates, inorder tofindthetop-k graph The subgraph matching problem identifies all the occurrences matches. OnecanmodifyNESStoleveragefornodelabeldiffer- of a query graph in the target network. In bioinformatics, exact ences. However, thismodificationreducestheeffectivenessofits and approximate subgraph matching have been extensively stud- filteringphase,andresultsinalargenumberoffinalcandidatesfor ied,e.g.,PathBlast[19],SAGA[35],NetAlign[22],IsoRank[33]. each query node (See Appendix for an example). Indeed, in our Amongthem,SAGAisclosetooursintermsofproblemformula- experiments, wefindaverylowprecisionscoreforNESS, atthe tion.However,thesealgorithmstargetsmallerbiologicalnetworks. endofitsfilteringphase(Table1).Hence,itbecomesquiteexpen- Itisdifficulttoapplytheminlargeheterogeneousnetworks. sivetodeterminethetop-kgraphmatchesfromtheselargenumber Therehavebeensignificantstudiesoninexactsubgraphmatch- of final candidates. In contrast, our proposed NeMa framework inginlargegraphs.Tongetal.[37]proposedthebest-effortpattern employsaninferencealgorithmthatiterativelybooststhescoreof matching,whichaimstomaintaintheshapeofthequery. Incon- morepromisingcandidatenodes,consideringbothlabelandstruc- trast,weidentifytheoptimalmatchesintermsofproximityamong turalsimilarity;andtherebydirectlyfindsthetop-kgraphmatches. entitiesratherthantheshapeofthequerygraph. Tianetal. [36] Contributions.Inthispaper,weproposeNeMa,anovelsubgraph proposed an approximate subgraph matching tool, called TALE, matchingframeworkforqueryingheterogeneousnetworks. withefficient indexing. Mongiovi et. al. introduced a set-cover- (1)Wedefinethequeryresultasthematchofagivenquerygraph based inexact subgraph matching technique, called SIGMA [26]. inatargetgraph,intermsofanotionofhomomorphism-basedsub- Boththesetechniquesuseedgemissestomeasurethequalityofa graphmatching. Tomeasurethequalityofthematches,wefurther match; and therefore, cannot incorporate the notion of proximity define a novel subgraph matching cost metric between the query amongentities. Thereareotherworksoninexactsubgraphmatch- graph and its matches (Section 3). In contrast to strict subgraph ing. Anincompletelist(see[13]forsurveys)includeshomomor- isomorphism,ourproposedmetricaggregatesthecostsofmatching phismbasedsubgraphmatching[12],beliefpropagationbasednet individualquerynodes,whichinturndependsonthecostofmatch- alignment [4], edge-edit-distance based subgraph indexing tech- ingnodelabelsandtheirneighborhoodswithinacertainhops. nique[41],subgraphmatchinginbillionnodegraphs[34],regular (2) Basedon thecost metric, wepropose theminimumcost sub- expressionbasedgraphpatternmatching[3],schema[25]andun- graph matching problem (Section 4), which is to identify the balanced ontology matching [42]. Among them, homomorphism matchesofthequerygraphwithminimumcostsinthetargetgraph. basedsubgraphmatching[12]isclosetoours.However,insteadof WeshowthattheproblemisNP-hardandalsohardtoapproximate. identifyingthetop-k matches, thepaperreportsallthesubgraphs (3)Weproposeaheuristicmethodfortheminimumcostsubgraph wherethequeryedgescanbemappedtopathsofagivenmaximum matching problem (Section 5). In a nutshell, NeMa converts the underlying graph homomorphism problem into an equivalent in- 1In this paper, all experimental results with NESS and BLINKS ference problem in graphical models [29], and thereby allows us correspondtotheirmodifiedversions,whereweallowtwonodesto bematchediftheirlabeldifferenceiswithinapredefinedthreshold. lengthandthelabeldifferencesarewithinacertainthreshold. Q(VQ,EQ,LQ) querygraph There are several works on simulation and bisimulation-based G(V,E,L) targetgraph graph pattern matching, e.g., [11, 23], which define subgraph φ:VQ→V subgraphmatchingfunction matchingasarelationamongqueryandtargetnodes.Comparedto ∆L labeldifferencefunction them,NeMa,ismorestrict,sincewedefinesubgraphmatchingas M(v) candidatesetofnodev afunctionfromquerynodestotargetnodes. RG(u) neighborhoodvectorofnodeu Nφ(v,u) neighborhoodmatchingcostbetweenvandu Label and Concept Propagation. Label propagation has been Fφ(v,u) individualnodematchingcostbetweenvandu widelyusedinsemi-supervisedlearning,e.g.,labelingofunlabeled C(φ) subgraphmatchingcostfunction graph nodes [31]. Concept Propagation /Concept Vector, on the Table2:Notations:TargetGraphs,QueriesandSubgraphMatching otherhand,wasoriginallyformulatedtomeasurethesemanticsim- Given a target graph G = (V,E,L) and a query graph Q = ilaritiesbetweenterms/conceptsinataxonomy[21]. Wenotethat thespreadingactivationtheoryofmemory[2]usedasimilaridea (VQ,EQ,LQ),(1)anodeu ∈ V isacandidateforaquerynode of activation propagation. CP/CV and spreading activation have v ∈ VQ if the difference in their labels (i.e., L(u) and LQ(v), respectively),determinedbyagiven(polynomial-timecomputable) beeneffectivelyusedin[7,20]forapproximatestructuralmatching intrees,graphsandalsoforinformationretrievalfromassociated labeldifferencefunction∆L,islessthanorequaltoapredefined thresholdǫ.WedenoteasM(v)thecandidatesetofthequerynode networks[5]. Theseworksconsideronlystrictnodelabelmatch- ing. However, subgraphmatchingwithoutnodelabelsisaharder v.(2)asubgraphmatchingisamany-to-onefunctionφ:VQ→V, problemthansubgraphmatchingwithnodelabels[40].Therefore, suchthat,foreachquerynodev∈VQ,φ(v)∈M(v). insteadofstrictnodelabelequality,whenoneallowsapproximate Remarks.(1)Thelabeldifferencefunction∆Lbetweentwonode nodelabelmatching(e.g.,inourcurrentwork),itsignificantlyin- labels canbe defined by avariety of criteria, such asthe Jaccard creasesthecomplexityofthesearchproblem. similarity,stringeditdistance,ormoresophisticatedsemanticmet- Querying Semi-structured Data. Lorel and UnQl are among rics, e.g., ontology similarity [10]. In this work, we use Jaccard thepreliminaryquerylanguagesdesignedforsemi-structureddata. similaritymeasuretodetermine∆L(Section7). (2)Incontrastto Both of them model input data as labeled graphs, while permit- strictone-to-one mappingasintraditionalsubgraphisomorphism ting users to write queries without detailed knowledge about the tests,weconsideramoregeneralmany-to-onesubgraphmatching schema. Later, an underlying query processing system converts function. Indeed, twoquerynodesmayhavethesamematch[12, thosequeriesintostandardSQLorstructuralrecursionqueries,re- 30]. (3)Inpractice,thenodesinthetargetandquerygraphsmay spectively, for retrieving the correct answers. This idea of query be annotated with types (e.g., Figure 1 and [15]), where a query rewriting has been explored in the context of both relational and nodecanonlybematchedwithtargetnodeshavingthesametype. semi-structureddata,e.g.,[10,28,39,15].Observethatsuchquery Insuchcases,oursubgraphmatchingmodelcanbeeasilyadapted rewritingtechniquesalleviateusersfromthecomplexityofunder- tocapturethetypeconstraintsbyrefiningcandidatesets. standingtheschema;nevertheless,theunderlyingqueryprocessing 3.2 Subgraph Matching CostFunction systemstillrequiresafixedschema. IntherealmofRDF,SPARQLiswidelyusedasthequerypro- Therecanbemany validmatching functions for agivenquery cessinglanguage. However, writingofaSPARQLqueryisoften graph and a target graph [13]. As stated earlier, our novel graph toochallenging, becauseitrequirestheexactknowledgeofstruc- similaritymetricmustpreservetheproximityamongnodepairsin ture,nodelabelsandtypes.gStore[43],whichisthefirststudythat thequerygraph,whilethelabelsofthematchednodesshouldalso considersasubgraphmatching-basedqueryansweringtechniquein besimilar.Takingthisasourguideline,weintroducethesubgraph RDFdata,allowsapproximatenodelabelmatching,butadheresto matchingcostfunctioninNeMaasametrictomeasurethegood- strictstructuralmatches.Incontrast,ourNeMaframeworkpermits ness of a matching. The function adds up the costs of matching bothstructuralandnodelabelmismatches. a query node with its candidate, thereby capturing the difference Our work is different from the keyword search in graphs [16, betweenlabelsandneighborhoodstructuresofthetwonodes. We 18],asourquerieshavebothstructureandkeywords(nodelabels). firstintroducethenotionofaneighborhoodvector. Neighborhoodvectorization. Givenanodeuinthetargetgraph 3. PRELIMINARIES G,werepresenttheneighborhoodofuwithaneighborhoodvector Westartwithafewdefinitions. RG(u) = {hu′,PG(u,u′)i},whereu′isanodewithinh-hopsof 3.1 Target Graphs, Queries andMatching u,andPG(u,u′)denotestheproximityofu′fromuinG. Target graph. A target graph that represents a heterogeneous αd(u,u′) ifd(u,u′)≤h; network dataset can be defined as a labeled, undirected graph PG(u,u′)= (1) (0 otherwise. G=(V,E,L),withthenodesetV,edgesetE,andalabelfunc- tionL, where(1)eachtarget nodeu ∈ V representsanentityin Here,d(u,u′)isthedistancebetweenuandu′.Thepropagation thenetwork,(2)eachedgee∈Edenotestherelationshipbetween factorαisaparameterbetween0and1;andh>0isthehopnum- twoentities,and(3)Lisafunctionthatassignstoeachnodeua ber(effectively,theradius)oftheneighborhoodforvectorization. labelL(u)fromafinitealphabet. Inpractice,thenodelabelsmay The neighborhood vector of node u encodes the proximity infor- representtheattributesoftheentities,e.g.,name,value,etc. mationfromutoitsh-hopneighbors. Itoftensufficestoconsider Querygraph.AquerygraphQ=(VQ,EQ,LQ)isanundirected, smallvaluesofh(e.g.,h=2),sincetherelationshipbetweentwo labeledgraph,withasetofquerynodesVQ,asetofqueryedges entitiesbecomesirrelevantastheirsocialdistanceincreases[6]. EQ, and a label function LQ, which assigns to each query node Basedon neighborhood vectors, wenow proceed tomodel the v∈VQalabelLQ(v)fromafinitealphabet. matching cost of theneighborhoods of aquery node and atarget We next define the subgraph matching of a (connected) query node.Letusdenotethesetofneighboringnodeswithinh-hopsofv graphinalargetargetnetwork. asN(v).Givenamatchingfunctionφ,theneighborhoodmatching u 2 costbetweenvandu=φ(v),denotedbyNφ(v,u),isdefinedas: v2 c c v4 c N (v,u)= v′∈N(v)∆+(PQ(v,v′),PG(u,φ(v′))) (2) φ P v′∈N(v)PQ(v,v′) where∆+(x,y)isafunPctiondefinedas v1 b a v3 u1 b a u3 Query Graph Q Match x−y, ifx>y; ∆ (x,y)= (3) Figure2:ExampleofFalseExactMatchinNeMa + (0 otherwise. Property2. Ifthematchidentifiedbyφisnotisomorphictothe Intuitively, N (v,u) measures thematching cost of the neigh- φ querygraphQ,andφisaone-to-onefunction,thenC(φ)>0. borhood vectorsof v andu. Notethat (i)theuser issuesaquery basedonhervaguenotionofhowtheentitiesareconnectedinthe PROOF. SinceQ is connected and φ is a one-to-one function, target graph. Hence, ∆+ avoids penalizing the cases when two if the match identified by φ is not isomorphic to Q, one of the nodes are closer in the target graph, as compared to their corre- following must hold. (1) There exists some node v ∈ VQ, s.t., sponding nodes in the query graph. (ii) We normalize Nφ(v,u) ∆L LQ(v),L(φ(v)) >0. Then,C(φ)>0,assumingλ6=0in overtheneighborhoodofvthatincursmorecostwhensamenum- Eq.4.(2)Thereexistsanedge(v,v′)inEQ;butthecorresponding berofnodemissesoccursinasmallerneighborhood. edge(cid:0)(u,u′)isnotin(cid:1)graphG. φ(v) = uandφ(v′) = u′. This Recallthatweassumetheexistenceofthelabeldifferencefunc- impliesPQ(v,v′)=α,butPG(u,u′)<α,whichinturnimplies tion 0 ≤ ∆L ≤ 1. Now, the individual node matching cost for Nφ(v,u)>0.Assumingλ6=1inEq.4,wegetC(φ)>0. matchingfunctionφisdefinedasalinearcombinationofthelabel differencefunctionandtheneighborhoodmatchingcostfunction. 4. PROBLEM FORMULATION Fφ(v,u)=λ·∆L(LQ(v),L(u))+(1−λ)·Nφ(v,u), (4) The subgraph matching cost function favors matches with low whereu=φ(v). matchingcosts.Basedonthematchingcostfunction,weintroduce Thisnodematchingcostcombinesbothlabelmatchingcostand theminimumcostsubgraphmatchingproblemasfollows. neighborhood matching cost viaaparameter 0 < λ < 1, whose Problem Statement 1. Minimum Cost Subgraph Matching. optimalvalueliesbetween0.3∼0.5empirically(Section7). Given a target graph G, a query graph Q, and the label noise Wearenowreadytodefineoursubgraphmatchingcostfunction. thresholdǫ,findtheminimumcostmatchingφ, Givenamatchingφfromthequerynodesv ∈ VQtotargetnodes φ(v)∈V,thesubgraphmatchingcostfunctionisdefinedas: argminC(φ), (6) φ C(φ)= F (v,φ(v)) (5) φ vX∈VQ s.t. ∆L(LQ(v),L(u))≤ǫ,∀v∈VQ,u=φ(v) (7) Intuitively, C(φ) is the matching cost of φ between the query Intuitively,insteadofcheckingsubgraphisomorphism,ourprob- graphQandthetargetgraphG,andtheproblemistofindamatch- lemformulationidentifiestheoptimalmatchbyminimizingnode ing function φ that minimizes C(φ). Note that, assuming ∆L label differences as well as node pair distances. The identified is non-negative, F (v,φ(v)) and therefore, C(φ) are both non- matchesserveasanswerstothequerygraph. φ negative,sotheminimumvaluethatC(φ)cantakeis0. The problem is, however, nontrivial. The following theorem showsthatthedecisionversionoftheproblemisintractable,even 3.3 CostFunction Properties whenthesubgraphmatchingfunctionφisnotinjective. Thefollowingpropertiesofoursubgraphmatchingcostfunction illustratesitsconnectionwithsubgraphisomorphism. Theorem1. GivenatargetnetworkG,aquerygraphQ,itisNP- completetodeterminewhetherthereexistsamatchφwithNeMa Property1. IfthequerygraphQissubgraphisomorphic(interms subgraphmatchingcostC(φ)=0. ofstructureandnodelabelsequality)tothetargetgraphG,then thereexistsaminimumcostmatchingfunctionφwithC(φ)=0. PROOF. TheproblemisNP,sincethereisanondeterministical- gorithmwhichguessesamatchingfunctionφ,andverifieswhether Property1ensuresthatallthematchingfunctionsφ,whichiden- itscostC(φ)= 0,inpolynomialtime. WeprovetheNP-hardness tifiesexact (isomorphic) matchesfor Q, musthave cost 0. How- byreductionfromthegraphhomomorphismproblem,whichisNP- ever, a match φ of Q, where C(φ) = 0, may not necessarily be complete [8]. A homomorphism from a graph Q′ to a graph G′ isomorphictoQ.Werefertosuchmatchesasfalseexactmatches. (bothunlabeled) isafunction thatpreservesnodeadjacency (i.e., Example 2.1. Consider aquery graph Q, atarget graph G(Fig- each edge inQ′ ismapped toanedge inG′). Given aninstance ure 2), and a subgraph matching function φ, where φ(v )=u , ofthegraphhomomorphismproblem,weconstructaninstanceof 1 1 φ(v )= φ(v )=u , and φ(v )=u . Assuming h = 1 and α = theminimumcostsubgraphmatchingproblem,whereallnodesin 2 4 2 3 3 0.5,theneighborhoodvectorsinQare: RQ(v1)={hv2,0.5i,hv3, the target graph G and query graph Q have identical labels. We 0.5i}, RQ(v2)={hv1,0.5i}, RQ(v3)={hv1,0.5i, hv4,0.5i}, and alsoassume, w.l.o.g., thatthedepthofvectorizationh = 1. One RQ(v4)={hv3,0.5i}. Similarly,wehavethefollowingneighbor- mayverifythatifthereexistsahomomorphismφ′fromQ′toG′, hood vectors in G: RG(u1) = {hu2, 0.5i, hu3, 0.5i}, RG(u2) then there exists a corresponding matching φ from Q to G, s.t. ={hu1, 0.5i, hu3, 0.5i}, and RG(u3) = {hu1,0.5i, hu2, 0.5i}. C(φ) = 0. Conversely, if φ′ is not a homomorphic matching, Therefore, the individual node matching costs Fφ is 0 for all thenthereexistsanedge(v,v′)inEQ,butthecorrespondingedge v ∈ VQ,andthesubgraphmatchingcostC(φ)is0. Observethat (φ(v),φ(v′)) is not in G. Hence, C(φ) > 0 (λ 6= 1 in Eq. 4). thematchidentifiedbyφisnotisomorphictoQ. Therefore,thereexistsamatchingfunctionφfromQtoG,where However,ifthematchingfunctionφisone-to-one,thefollowing C(φ)=0,ifandonlyifthereisahomomorphicmatchingφ′from propertyshowsthatthefalseexactmatchescanbeavoided,. Q′toG′.Thiscompletestheproof. One may want to find a polynomial time approximation algo- AlgorithmNemaInfer rithm.However,theproblemisalsohardtoapproximate. Input:TargetgraphG(V,E,L),QueryGraphQ(VQ,EQ,LQ). Output:MinimumcostmatchingofQinG. Theorem2. TheminimumcostsubgraphmatchingisAPX-hard. 1. foreachnodev∈VQdocomputeM(v); PROOF. Weshow that this optimization problem is APX-hard 2. i:=0;flag:=true; by performing a reduction (f,g) from the Maximum Graph Ho- 3. InitiateiterativeinferencingwithEq.8; 4. whileflagdo momorphism (MGH) problemwithout self loops, whichisAPX- 5. i:=i+1; hard [27]. An MGH problem identifies a matching which max- 6. foreachv∈VQdo imizes the number of edges of the query graph Q that can be 7. foreachu∈M(v)do mapped to edges of the target graph G (both unlabeled). Given 8. computeUi(v,u)withTheorem3; an instance I of MGH, we construct an instance I′ of the mini- 9. keeptrackofthecurrentmatchesofneighborsv′∈N(v); mumcostsubgraphmatchingproblem,whereallnodesinthetar- 10. computeoptimalmatchOi(v)usingEq.10; 11. ifmorethanathresholdnumberof agnedt neeqtwboerkthGe taontadlqnuuemrybegrraopfhnoQdehsavaendideedngtiecsa,lrleasbpeelcs.tivLeleyt,ninq 12. flquaegry:=nfoadlesse;vsatisfyOi(v)=Oi−1(v)then Q. w.l.o.g., assume the depth of vectorization h = 1, and the 13. constructΦforallv∈VQ(withEq.11,12). proportionalityconstantλ = 1− 1 . WedenotebyOPT(I)the 14. returnΦ; nq valueoftheoptimalsolutionofprobleminstanceI,andVAL(I,x) Figure3:IterativeInferenceAlgorithmNemaInfer the valueof afeasiblesolution xof theproblem instance I. As- sume OPT(I) = eo and VAL(I,x) = e for some feasiblesolu- iterativelycomputes aninference cost for each query node v and tion x of instance I. Clearly, eo ≥ 1. Hence, (1) OPT(I′) < itscandidates, andselectstheoptimalmatchofv asitscandidate 1 ≤ eo = OPT(I). Also, given some feasible solution y of uwiththeminimuminferencecost. NemaInferkeepstrackofthe instance I′, one may verify that |OPT(I) − VAL(I,g(y))| = optimalmatchesforeachquerynode. Theprocedurerepeatsuntil eo − e, and |OPT(I′) −VAL(I′,y)| ≥ 2enoq−eeq. Therefore, (2) it reaches a fixpoint, where the optimal matches for more than a |OPT(I)−VAL(I,g(y))|≤2nqeq|OPT(I′)−VAL(I′,y)|.Thus, threshold number of query nodes remain identical intwo succes- thereexistsareduction(f,g)fromMGHtotheminimumcostsub- siveiterations(lines4-12).Finally,NemaInferrefinesthematches graphmatchingproblem,andthetheoremfollows. ofeachquerynodeanditsneighborhoodthatit“memorizes”viaa memoizationtechnique,andobtainsthebestmatch(line13). The 5. QUERYPROCESSINGALGORITHM constructedsubgraphmatchΦisthenreturned(line14). WenextintroduceseveralproceduresofNemaInferindetail. In this section, we propose a heuristic solution to identify the minimum cost matchings. We start by introducing the max-sum Inference cost and optimal match (lines 3-12). The algorithm inference problem in graphical models [29], and show how our NemaInferimprovesthequalityofthematchingineachiteration, graph homomorphism problem underlying the NeMa framework basedonthenotionofaninferencecostandtheoptimalmatch. isequivalenttoaninferenceproblemingraphicalmodels. Inferencecost.AteachiterationiofNemaInfer,theinferencecost Max-Sum Inference. In graphical models, the joint probability Ui(v,u)foreachv∈VQandu∈M(v)isdefinedasfollows. distributionfunctionp(X)ofasetofvariablesX = {x ,x ,... 1 2 ,xM} can be expressed as a product of the form p(X) = U (v,u)= min F (v,u) (8) ifi(Xi), where each Xi ⊆ X. Alternatively, logp(X) = 0 {φ:φ(v)=u} φ Qilogfi(Xi).TheMax-Suminferenceproblemistofindtheval- IuPnesotohfetrhweovradrsia,bwleeswxo1u,lxd2li,k.e.t.o,xmMaxtihmaitzreelsougltpi(nXm)atxhiamtucamnpb(eXde)-. Ui(v,u)= min Fφ(v,u)+ Ui−1 v′,u′ (9) {φ:φ(v)=u} composedasthesumofseveralfunctionsoftheformlogfi(Xi), (cid:2) v′X∈N(v) (cid:0) (cid:1)(cid:3) eachofwhichdependsonlyonasubsetoftheoriginalvariables. Weassumei>0,andu′ =φ(v′)inEquation9.Intuitively,the The objective of the max-sum inference problem is similar to inferencecostistheminimumsumoftheindividualnodematching that of the minimum cost subgraph matching problem, which is costFφ(v,u)andthepreviousiteration’sinferencecostsUi−1 v′, tominimizetheoverallsubgraphmatchingcostC(φ). Recallthat φ(v′) forallneighborsv′ ofv, over allpossiblematchingfunc- (1) C(φ)isanaggregationoftheindividual nodematchingcosts tionsφ,withtheconstraintφ(v)=u. (cid:0) Fφ(v,φ(v)) of all query nodes v, and (2) the individual node Not(cid:1)e that although we consider the minimization over all pos- matching cost ofaquerynode v depends only onthematches of siblematchingfunctionsφ,s.t.,φ(v) = u, inEquation9, itonly v and its neighbors in N(v). In light of this, we propose an it- dependsonthematchesoftheneighboringnodesinN(v). Asdis- erativeinferencealgorithmsimilartotheloopybeliefpropagation cussedlater,inferencecostscanbecomputedinpolynomialtime. algorithm[29],usedforinferencingingraphicalmodels. Optimalmatch.Ineveryiteration,wealsodefinetheoptimalmatch 5.1 Iterative Inference Algorithm ofeachquerynode. Theoptimalmatchofaquerynodevatitera- Inthissection,weintroduceourinferencealgorithm,denotedas tioni,denotedbyOi(v),isdefinedasfollows. NemaInferandillustratedinFigure3. Overview. Given a query graph Q and a target graph G, Oi(v)=aur∈gMm(vin) Ui(v,u); i≥0 (10) NemaInfer first computes the candidate set for each query node usingthenodelabelsimilarityfunction∆L (line1). Next, itini- Example4.1. WeillustratetheideaofoneiterationofNemaInfer tializesaninferencecostU (v,u)byassigningittotheminimum using Figure 4. Assume we have already determined the candi- 0 possiblevalueofindividualnodematchingcostsF (v,u),overall date matches M(v) for every query node v using the label sim- φ possible matching functions φ, s.t., φ(v) = u (line2-3). It then ilarity function ∆L. For example, M(v2) = {u2,u5,u9} and M(v4)= {u7,u10}inFigure4. Also,considerh= 1. Ati= 0, v1 u1 u4 u8 U0(v2,u5) = U0(v2,u9) = 0. Therefore,wecannotdistinguish a a a a bteertwmeaetcnhuo5favn2d.uH9oiwnetvheer,inoibtisaelirzvaetitohnatrUou0n(dv,4a,suw10h)ic<hoUn0e(vis4a,ub7e)t-. v2 b c v3 ub uc u5 b c u6 uu9 b c u12 u10 isaneighbor of u9, whileu7 aneighbor of u5. Thus, itnot v4 d 2 3 ud 10 d onlyinfluencestheoptimalmatchO0(v4)ofv4atiterationi=0, v5 e 7 u11 e c u13 butitalsomakesU (v ,u ) < U (v ,u )atiterationi = 1,via 1 2 9 1 2 5 Eq.9. Hence, weimprovethematchesineachiterationandpro- Query Graph Q Target Graph G ceedtowardstheminimumcost(heuristic)subgraphmatch. Figure4:OptimalSubgraphMatchFindingAlgorithm Invariant. ThealgorithmNemaInferpossesthefollowinginvari- NemaInfer,wecomputeapartialinferencecostforeachnodev′∈ athnetiinnfeearcehncoefcitossittearnadtitohne,swuhbigcrhapilhlumstaratctehsinthgeccoosntn(Secetcitoionnb3et)w. een N(v),whichisdenotedbyWi(v,u,v′),anddefinedbelow. Invariant1. Ifthereexistsamatchingfunctionφfromthenodes ofQtothenodesofG,suchthat,C(φ)=0,thenUi(v,φ(v))=0 Wi(v,u,v′)={φ:φm(vi)n=u} β(v)·∆+ PQ(v,v′),PG u,φ(v′) forallv∈VQandi≥0. (cid:2)+Ui−1 v′(cid:0),φ(v′) (cid:0) (1(cid:1)3(cid:1)) However, theconverseisnotalwaystrue. Infact,basedonthe properties of the loopy belief propagation algorithm, there is no Here, β(v) = [ v′∈N(v)P(cid:0)Q(v,v′)]−(cid:1)1(cid:3). To compute guaranteethatouralgorithmwillconvergeforallquerynodesafter Wi(v,u,v′), we only nPeed to find the minimum value in Eq. 13 acertainnumberofiterations. Therefore,weterminatetheproce- over the candidates in M(v′). Hence, the partial inference cost durewhenmorethanathresholdnumberofquerynodesvsatisfy Wi(v,u,v′)canbecomputedinpolynomialtime,foreachtriplet theconditionOi(v) = Oi−1(v). WeempiricallyverifiedinSec- v, u, v′, where u ∈ M(v) and v′ ∈ N(v). Next, we show the tion 7 that our method usually requires about 2 to 3 iterations to relationbetweenthepartialinferencecostandtheinferencecost. terminate — around 95% of query nodes converge using IMDB dataset,andalsoperformswellinreal-lifenetworks. Theorem3. TheinferencecostUi(v,u)iscomputableinpolyno- mialtimeviathefollowingformula: Matchingrefinement(line13). Theoptimalmatchofeachquery Ui(v,u)=∆L LQ(v),L(u) + Wi(v,u,v′) (14) node at the final iteration might not correspond to the subgraph matching function with the minimum (heuristic) aggregate cost (cid:0) (cid:1) v′X∈N(v) PROOF. [29]. Thiscanhappen iftherearemultiplegraphmatchingfunc- tions that result in the minimum cost graph matches. Therefore, Ui(v,u) we need to refine the optimal node matches from the final round ofNemaInfertoidentifyonesuchminimumcostsubgraphmatch- = min ∆L(LQ(v),L(u))+β(v)· P ∆+(PQ(v,v′),PG(u,φ(v′))) ing function, say Φ. Werefer tothe matches of the query nodes {φ:φ(v)=u} v′∈N(v) (cid:2) corresponding to Φ as the most probable matches. To find these Fφ(v,u) most probable matches, the standard memoization technique can beused afterthetermination ofour iterativeinference algorithm. | + P Ui−1{(zv′,φ(v′)) } v′∈N(v) First,aquerynode,sayv,isselectedrandomly,anditsmostprob- (cid:3) ablematch,denotedbyΦ(v)∈M(v),isdeterminedasfollows: ={φ:φm(vin)=u}v′∈PN(v) β(v)·∆+(PQ(v,v′),PG(u,φ(v′)))+Ui−1(v′,φ(v′)) (cid:2) (cid:3) Φ(v)=argminUi′(v,u) (11) +∆L(LQ(v),L(u)) u∈M(v) In Eq. 11, i = i′ denotes the final iteration. For the remaining =v′∈PN(v){φ:φm(vin)=u} β(v)·∆+(PQ(v,v′),PG(u,φ(v′)))+Ui−1(v′,φ(v′)) nodes, the most probable matches are determined by memoizing (cid:2) (cid:3) recursively, i.e., wekeep trackof thematches of theneighboring Wi(v,u,v′) nodesthatgiverisetothemostprobablematchofthecurrentnode. | +∆L(LQ{(vz),L(u)) } Forexample,themostprobablematchesΦ(v′)ofallv′∈N(v)are =∆L(LQ(v),L(u))+ Wi(v,u,v′) obtainedusingthemostprobablematchofvasfollows. v′X∈N(v) Hence,thetheorem. φp = argmin Fφ(v,Φ(v))+ Ui′−1 v′,φ(v′) {φ:φ(v)=Φ(v)}(cid:2) v′X∈N(v) (cid:0) (cid:1)(cid:3) It follows from Theorem 3 that the inference cost Ui(v,u) of Φ(v′)=φp(v′) (12) nodesvanducanbeefficientlycomputedinpolynomialtime,by (1)determiningthepartialinferencecostWi(v,u,v′)foreachv′∈ The aforementioned memoization technique is performed until N(v), and (2) aggregating these partial inference costs with ∆L themostprobablematchesofallquerynodesarecomputed. LQ(v),L(u) . The aforementioned technique also keeps track ComputationofInferenceCosts. Astraightforwardapproachto ofwhichmatchesoftheneighboring querynodesgiverisetothe (cid:0) (cid:1) determine the inference cost Ui(v,u) for a query node v and its mostprobablematchofthecurrentquerynode. Thisinformation candidateu(Eq.9)considersallpossiblecombinationofmatches isrequiredduringmatchingrefinement. for all nodes v′ ∈ N(v), which has exponential timecomplexity Time complexity. We analyze the time complexity of the algo- and might beveryexpensive. Inthissection, wepropose atech- rithm NemaInfer. Let us denote the number of nodes in the tar- niquetocomputetheinferencecostinpolynomialtime. getgraphGandthequerygraphQas|V|and|VQ|,respectively. Partialinferencecost. ToevaluatetheinferencecostUi(v,u)for (1)ItrequiresO(|VQ|·|V|)timetoidentifyM(v)foreachquery aquery node v anditscandidate uatiterationiof thealgorithm node v ∈ VQ (line 1). (2) We denote the maximum number of candidates per query node as mQ, and the maximum number of the performance of our inference algorithm NemaInfer. Wefirst h-hop neighbors of each query node as dQ. The computation of introducethenotionofisolatedcandidates. Othe(mop2Qti·mdaQl)mfaotlclhowOintg(vT)hpeeorreqmue3ry(lninoede10v).haTshetirmefeorceo,mthpeletximitye MIso(lva)t,eadnCoadneduid∈atMes(.vG)iivseannaisqoulaetreydncoadnedivdaatnedoiftsv,caifndidateset requiredforeachiterationofNemaInferisO(|VQ|·m2Q·dQ). If therearetotalI iterations,theoveralltimecomplexityisgivenby O(|VQ|·|V|+I ·|VQ|·m2Q ·dQ). Observe that |VQ|, I, |dQ| {u′ :u′ ∈M(v′),v′ ∈N(v)}∩{u′′ :u′′ ∈N(u)}=∅ (15) andmQaretypicallysmall.Indeed,asverifiedinourexperiments Intuitively, thenodeuisanisolatedcandidateofaquerynode (Section7),Iistypicallylessthan4andmQis35,forquerygraphs vifnoneofthecandidateswithinh-hopneighborsofvareinthe with5nodesandreallifegraphscontaining12Mnodes. h-hopneighborhoodofu;otherwise,itisanon-isolatedcandidate 5.2 Generalized Queries ofv.Thus,anisolatedcandidateuofvcannotbematchedwithv. Inthissection,weextendNemaInferforthreegeneralizedcases, Toefficientlyfindthenon-isolatedcandidates,weproposeanop- namely,Top-kmatches,unlabeledqueries,andlabelededges. timizationproblem,basedonverificationcostandcandidatecover. Top-kMatches. Inmanyapplications,thequerygraphisnotsub- Verification Cost. The verification cost associated with a query graph isomorphic to the target network; and hence, weare inter- nodev isdefinedasthetimecomplexitytoverifyallnodesinits ested in identifying the top-k matches rather than only the best candidatesetM(v),whethertheyarenon-isolatedcandidates.Note match. Given the target network G and the query graph Q, the thatthecomplexityofverifyingwhethersomenodeu∈M(v)isa top-ksubgraphmatchingproblemistoidentifythetop-kmatches non-isolatedcandidateisO |N(u)|+ |M(v′)| . v′∈N(v) foraselectedquerynodev∈VQ. Candidate Cover. There exists dependencies between two non- (cid:0) P (cid:1) ThealgorithmNemaInfercanbereadilyadaptedforthisprob- isolatedcandidates:ifuisanon-isolatedcandidateofv,thenthere lem.(1)Thealgorithmcomputestheinferencecostsandterminates must exist a node u′ ∈ N(u), such that, u′ ∈ M(v′) for some at line12. (2)Weidentifythetop-k mostprobable matchesof v v′ ∈ N(v). Clearly, u′ is a non-isolated candidate of v′. If we (Eq.11). (3)Foreachofthesetop-kmostprobablematchesofv, verifiedallcandidates{u′ : u′ ∈ M(v′),v′ ∈ N(v)},thereisno weapplytherecursivememoizingtechnique(Eq.12)todetermine needtoverifythecandidatesinM(v)again.Thus,onemayreduce thecorrespondingmostprobablematchesforotherquerynodes. redundantverificationsusinganotionofcandidatecover. Matching Query with Unlabeled Nodes. A query graph may WedefinecandidatecoverC(Q)asasetofquerynodesv,such have nodes with unknown labels, e.g., query graphs constructed that,forallv′∈VQ,eitherv′ ∈C(Q),orv′ ∈N(v). fromRDFqueries.NeMacanbeadaptedtoevaluatesuchqueries. First,weidentifyallthenodesfromthetargetnetworkthatcanbe C(Q)={v:∀v′(v′ ∈C(Q)∨v′ ∈N(v))} (16) matched withsome labeled query node based on label similarity. Next, we find the subgraph induced by all those matched nodes Allnon-isolatedcandidatenodescanbeidentifiedbyverifying fromthetargetnetworkalongwiththeirneighborswithinh-hops. onlythequerynodesinC(Q). Wedefinetheverificationcostofa Allnodesinthissubgraphareconsideredasthecandidatesforthe candidatecoverasthesumoftheverificationcostsofitsconstituent unlabeledquerynodes. ThealgorithmNemaInferistheninvoked querynodes.Next,weintroducethecandidatecoverproblem. toidentifythematches. Inaddition, iftheunlabeledquerynodes containtypeinformation,thecandidatesetscanfurtherberefined. ProblemStatement 2. CandidateCover. Givena query graph Q,findthecandidatecoverwiththeminimumverificationcost. NeMawithEdgeLabels.TheNeMacostfunctioncanbeadapted toconsidertheedgelabels. Specifically,weconcatenatetheedge Thefollowingresultshowsthatthecandidatecoverproblemis labels along the shortest path between a pair of nodes, and then intractable,butapproximablewithinafactor2inpolynomialtime. updatetheneighborhoodmatchingcost(Equation2)asfollows. Theorem4. Thecandidatecoverproblemis(1)NP-hard,and(2) N (v,u) φ 2-approximable. v′∈N(v)[∆+(PQ(v,v′),PG(u,u′)))+∆L(s(v,v′),s(u,u′)))] PROOF. We show that this problem is NP-hard by reduction = P [PQ(v,v′)+1] from NP-complete weighted minimum vertex cover problem [8]. v′∈N(v) Given a decision version of the weighted minimum vertex cover P problem,weconstructaninstanceofthecandidatecoverproblem, Here, u = φ(v),u′ = φ(v′), ands(v,v′)concatenates theedge wherethevertexweightsareconsideredasthecorrespondingver- labelsalongtheshortestpathbetweenvandv′.Since,weconsider ification costs. Assuming h = 1, the minimum weighted vertex theedgelabelsalongtheshortestpathbetweenapairofnodes,the coverwillbeourcandidatecover. Onecanapplylinearprogram- asymptoticaltimecomplexityofNeMaInferremainsthesame. mingtosolvethisproblemwith2-approximation[8]. 6. INDEXINGANDOPTIMIZATION 6.2 Indexing Inthissection,wediscussindexingandoptimizationtechniques We introduce indexing technique to improve the efficiency of toimprovetheefficiencyofournetworkmatchingalgorithm. the inference algorithm. (1) During the off-line indexing phase, itcomputestheneighborhood vectorsR(u)forallnodesuinthe 6.1 Candidate Selection targetnetworkG,andstoresthevectorsintheindex.(2)Duringthe Thecandidatesetofaquerynodeisdefinedintermsofthelabel on-linenetworkmatchingtechnique,ifuisselectedasacandidate similarity function (see Section 3), which may include candidate ofv,itappliesEq.15toverifywhetheruisanisolatedcandidate nodesthatdonotmatchthequerynodeduetoneighborhoodmis- ofv.Ifso,uiseliminatedfromthecandidatesetofv. match. We introduce optimization techniques to efficiently filter Our index structure has space and time complexity O(ndh), G suchcandidatenodesasmuchaspossible,andtherebyimproving where|V|=n,dG =averagenodedegreeinG,andh=depthof IMDB INDEX 1 LABEL NOISE TH.=35% LABEL NOISE TH.=35% YAGO MATCH (TOP-1) LABEL NOISE TH.=50% LABEL NOISE TH.=50% 1 DBpedia MATCH (TOP-3) 0.03 F1-MEASURE 000...789 TIME (SEC) 1 10 010000 1000 MATCH (TOP-5) F1-MEASURE 000...789 TIME (SEC) 00..0012 0.3 0.6 0 0.3 0.5 0.7 1 0 .00.21 0.6 0 35 50 0.005 0 35 50 λ IMDB YAGODBpedia LABEL NOISE (%) LABEL NOISE (%) (a) Effectiveness (b) Efficiency (a) Effectiveness(DBpedia) (b) MatchTime(DBpedia) Figure5:QueryPerformance Figure6:PerformanceagainstLabelNoise Dataset #Nodes #Edges vectorization.ForNeMawithedgelabels(Section5.2),theasymp- IMDB 2,932,657 11,040,263 totical time and space complexity of indexing remains the same, YAGO 12,811,149 18,282,215 sinceweconsideredgelabelsalongtheshortestpaths. DBpedia 5,177,018 20,835,327 Dynamicmaintenanceof theindex. Our indexing methodscan Table3:DatasetSizes efficientlyaccommodatedynamicupdatesinthetargetnetwork. If The nodes in YAGO and DBpedia are annotated with labels, anodeu(andtheedgesattachedtoit)isaddedordeleted,onlythe whilethenodesinIMDBareannotatedwithbothtypesandlabels. indexesofu’sh-hopneighborsneedtobeupdated.Ifasingleedge Hence,weusedthetypeinformationassociatedwiththenodes,in (u,u′)isaddedordeleted,onlytheh−1hopneighborsofbothu additiontotheirlabels,whilequeryingtheIMDBnetwork. andu′areupdated,thusreducingtheredundantcomputation. Querygraphs. Wegeneratedthequerygraphsbyextractingsub- 6.3 Optimizationfortop-k matching graphs from the target graphs, and then introduced noise to each The inference algorithm NemaInfer can be adapted toidentify querygraph.Specifically,thequerygenerationwascontrolledby: the top-k matches. For small values of k, it is possible to prune • nodenumberanddiameter,denotedby|VQ|andDQ,respec- candidatenodesbysettingacostthreshold. Thecostthresholdǫc tively, where the query diameter is the maximum distance isinitiallysettoasmallvalueǫ0. IfUi(v,u) > ǫc forsomeu ∈ betweenanytwonodesinthequerygraphQ. M(v)atiterationioftheinferencealgorithm,thenuiseliminated from the candidate set M(v) for the subsequent iterations. After • structuralnoise,theratioofthenumberofedgeupdates(ran- dominsertionanddeletionofedges)inQtothenumberof termination,ifthetop-kmatchescannotbeidentified,weincrease edgesintheextractedsubgraph;and ǫcbyasmallvalue,andrepeatthestepsabove. Thecorrectnessof thismethodisensuredbyTheorem5. • labelnoise,measuredbytheJaccardsimilaritybetweenthe labelsof nodesintheextractedsubgraph andtheirupdated Theorem 5. If Ui(v,u) > ǫc at the i-th round of inference al- counterpartsinQ,wheretheupdatedlabelswereobtainedby gorithm, then for all j > i, Uj(v,u) > ǫc at the j-th round of insertingrandomlygeneratedwordstothequerynodelabels. inferencealgorithm. We used Jaccard similarity as the label similarity measure. PROOF. ItfollowsdirectlyfromEq.9andthefactthatUi(v,u) Specifically, given a query node v and a target node u, the label ≥0foralli≥0,overallpairsv,u,whereu∈M(v). difference ∆L(LQ(v),L(u)) is defined as 1− ||wwvv∩∪wwuu||, where Hence,wecaneliminateufromM(v),wheneverUi(v,u)>ǫc wv andwuarethesetofwordsintheirlabelLQ(v)andL(u),respectively. Recall that weallowed some noise inthe node labels occursforthefirsttimeatsomeiterationiofNemaInfer. byvaryingthelabelmatchingcostthresholdǫinourmatchingal- gorithm (Problem Statement 1). A node u in the target network 7. EXPERIMENTAL RESULTS isconsideredacandidatetomatchwithaquerynodeviftheirla- We present three sets of empirical results over three real-life bel difference ∆L(LQ(v),L(u)) isless than the predefined cost datasets to evaluate (1) the effectiveness and efficiency (Sec- thresholdǫ,referredtoasthelabelnoisethreshold. tion 7.2), (2) scalability (Section 7.3), and (3) optimization tech- Evaluationmetrics. Sincethequerygraphswereextractedfrom niques(Section7.4)underlyingtheNeMaframework. targetgraphs,onealreadyhasthecorrectnodematches. Now,the 7.1 Experimental Setup effectivenessofNeMaismeasuredasfollows. Precision(P)isthe ratioofthecorrectlydiscoverednodematchesoveralldiscovered GraphDataSets. Weusedthefollowingthreereallifedatasets, node matches. Recall (R) is the ratioof the correctly discovered each represents a target graph. (1) IMDB Network3. TheInter- nodematchesoverallcorrectnodematches.F1-Measurecombines netMovieDatabase(IMDB)consistsoftheentitiesofmovies,TV theresultsofprecisionandrecall,i.e., series, actors, directors, producers, amongothers, aswellastheir 2 relationships. (2) YAGOEntityRelationshipGraph4. YAGO is F1= (17) (1/R+1/P) aknowledgebasewithinformationharvestedfromtheWikipedia, WordNetandGeoNames.Itcontainsabout20millionRDFtriples. Weconsideredthetop-1matchtoevaluateprecision,recall,and (3) DBpedia Knowledge Base5. DBpedia extracts information F1-measure. Thus, weobtainedthesamevaluesforthem. How- fromtheWikipedia. Weconsidered 22 millionRDFtriplesfrom ever,precisionandrecallwillbeusefulwhileanalyzingNESS. DBpediaarticlecategories,infoboxproperties,andpersondata. ComparingMethods. WecomparedNeMawithkeywordsearch 3http://www.imdb.com/interfaces#plain (BLINKS[16])andvariousgraphqueryingmethods: SAGA[35], 4http://www.mpi-inf.mpg.de/yago-naga/yago/ IsoRank[33],NESS[20],andNeMa -avariationofNeMafol- gs 5http://dbpedia.org/About lowinggStore[43].AllthesemethodswereimplementedinC++. F1-MEASURE 0000....9999 12468 nnnQQQ===357,,, DDDQQQ===123 MATCH TIME (SEC) 00000 0.....00001.124682 nnnQQQ===357,,, DDDQQQ===123 F1-MEASURE 0000....9999 12468 nnnQQQ===357,,, DDDQQQ===123 MATCH TIME (SEC) 0 .1235 nnnQQQ===357,,, DDDQQQ===123 0.9 0 0.9 0.3 0 10 20 30 40 0 10 20 30 40 0 25 35 50 80 0 25 35 50 80 STRUCTURAL NOISE (%) STRUCTURAL NOISE (%) LABEL NOISE THRESHOLD (%) LABEL NOISE THRESHOLD (%) (a) (b) (c) (d) Figure7:QueryPerformancevs.Noise(IMDB);nQ:QueryNodes,DQ:QueryDiameter F1-MEASURE 0000 ....16789 nnnQQQ===573,,, DDDQQQ===231 MATCH TIME (SEC) 00 0..12.255 nnnQQQ===357,,, DDDQQQ===123 F1-MEASURE 0000 000....6789 ...17895555 nnnQQQ===357,,, DDDQQQ===123 MATCH TIME (SEC) 001 ...1473 nnnQQQ===357,,, DDDQQQ===123 0.6 0.5 0.1 0.55 0.1 0 10 20 30 40 0 10 20 30 40 0 25 35 50 80 0 25 35 50 80 STRUCTURAL NOISE (%) STRUCTURAL NOISE (%) LABEL NOISE THRESHOLD (%) LABEL NOISE THRESHOLD (%) (a) (b) (c) (d) Figure8:QueryPerformancevs.Noise(YAGO);nQ:QueryNodes,DQ:QueryDiameter DE DE DIDATE PER QUERY NO 1 12350 2500000 nnnQQQ===573,,, DDDQQQ===231 DIDATE PER QUERY NO 1 12352 0000250 nnnQQQ===573,,, DDDQQQ===231 F1-MEASURE 00 0..89 .1955 nnQQ==57,, DDQQ==23 F1-MEASURE 0000 ....16789 nnQQ==57,, DDQQ==23 N N # CA 1 0 25 35 50 80 # CA 1 0 25 35 50 80 0.8 0 1 2 0.5 0 1 2 LABEL NOISE THRESHOLD (%) LABEL NOISE THRESHOLD (%) # UNLABELED QUERY NODE # UNLABELED QUERY NODE (a) IMDB (b) YAGO (a) Effectiveness(IMDB) (b) Effectiveness(YAGO) Figure9:#Candidatesvs.LabelNoise Figure11:EffectivenesswithUnlabeledQueryNodes # ITERATIONS 223 ...2482 nnnQQQ===573,,, DDDQQQ===231 # ITERATIONS 22 ..224 nnnQQQ===573,,, DDDQQQ===231 F1-MEASURE 0000....9999 12468 W W/O/ EEDDGGEE LLAABBEELL MATCH TIME (SEC) 0000....34565555 W W/O/ EEDDGGEE LLAABBEELL 1.7 1.7 0.9 0.25 0 25 35 50 80 0 25 35 50 80 0 10 20 30 40 0 10 20 30 40 LABEL NOISE THRESHOLD (%) LABEL NOISE THRESHOLD (%) STRUCTURAL NOISE (%) STRUCTURAL NOISE (%) (a) IMDB (b) YAGO (a) Effectiveness (b) Efficiency Figure10:#Iterationsvs.LabelNoise Figure12:PerformancewithEdgeLabels(IMDB) Inour experiments, (1)propagation factorαanddepthofvec- Figure5(b)reportstheefficiencyofNeMausingthesameset- torization h (Section 3) were set as 0.5 and 2, respectively [20], tingasinFigure5(a),includingtherunningtimeofoff-lineindex (2)theoptimalvaluesoftheproportionalityconstantλ(Eq.4)for construction (INDEX) and online query evaluation (MATCH). We different datasetswereobtained empirically(Figure5(a)), (3) the observedthefollowing. (a)NeMaidentifiesthebestmatchinless indexeswerestoredintheharddisk. Alltheexperimentswererun than 0.2 seconds, over all three datasets 6. (b) The top-k match usingasinglecoreina100GB,2.5GHzXeonserver. findingtimedoesnotvarysignificantly,overdifferentk,sinceour inferencingmethodisalwaysexecutedonce.(c)Thetimerequired 7.2 Effectiveness andEfficiency for indexing ismodest (e.g., 9862 secfor theYAGO dataset with 7.2.1 PerformanceoverReal-lifeDataSets 13Mnodesand18Medges). (d)Theindexingandqueryingtimes arelongeroverIMDB,duetoitshigherdensity. Intheseexperiments,weevaluatedtheperformanceofofNeMa 7.2.2 PerformanceagainstNoise over three real-lifegraphs, averaged over 100 queries (Figure5). For each target graph, we randomly generated 100 query graphs Inthissetofexperiments,weinvestigatedtheimpactofvarying with|VQ|=7andDQ =3. Wefixedthestructuralnoiseas30%, noises on the performance of NeMa. Three setsof query graphs labelnoiseas50%,andlabelnoisethresholdas50%. weregeneratedbysetting(i)|VQ| = 3, DQ = 1, (ii)|VQ| = 5, Figure 5(a) shows the effectiveness of NeMa over various DQ = 2,and(iii)|VQ|=7,DQ = 3. Undereachqueryset,100 datasets,andwithdifferentvaluesoftheproportionalityconstantλ. querygraphsweregenerated. Forallthethreedatasets,ouralgorithmalwayscorrectlyidentifies Varyinglabelnoise.Fixingthestructuralnoiseas30%,wevaried more than76% of thequery nodes. Specifically, theF1-measure thelabelnoisefrom0%to50%,andinvestigatedtheeffectiveness is0.94 forIMDB,withλ = 0.3, even whenweintroduced 30% ofNeMa,whenthelabelnoisethresholdwassetas35%and50%, structural noise and 50% label noise. Theeffectiveness ishigher overIMDBduetothetypeconstraintposedwiththequerynodes. 6our indexingandmatchingalgorithmcanbeparallelizedforev- Besides, the optimal value of λ lies between 0.3 ∼ 0.5 in these ery node. Hence, one may implement NeMa ina PREGEL[24] datasets. platform,forlargergraphs. P, R 1 P, R 1 EC) 4000 URE, 00..68 URE, 00..68 ME (S 500 MEAS 00..24 MEAS 00..24 CH TI 1.5 F1- 0 0 20 30 40 F1- 0 0 25 35 50 80 MAT 00..14 STRUCTURAL NOISE (%) LABEL NOISE (%) 0 25 35 50 80 LABEL NOISE (%) IsoRank (F1) NESS (F1) IsoRank (F1) NESS (F1) SAGA (F1) NeMags (F1) SAGA (F1) NeMags (F1) IsoRank NeMags NESS (P) BLINKS (F1) NESS (P) BLINKS (F1) SAGA BLINKS NESS (R) NeMa (F1) NESS (R) NeMa (F1) NESS NeMa (a) (b) (c) Figure13:ComparisonResults(IMDB):NESS,BLINKSModifiedforApproximateLabelMatch.NESSResultsCorrespondtoitsFilteringPhase. respectively. Asshown inFigure 6(a) over DBpedia, (1) the F1- nodes from 0 to 2. As shown in Figure 11, (1) the F1-measure measuredecreasesasthelabelnoiseincreases,sincethecandidate decreases over both datasets while the number of the unlabeled set of each query node may contain more candidates that arenot nodesincreases,becausetheunlabelednodesintroducemorecan- truematches, whichinturnreduces theeffectiveness, (2)theF1- didates, whichinturnreducestheeffectiveness, (2)theeffective- measure ishigher when the label noise threshold ishigher, since nessishigheroverIMDBduetothetypeconstraints,and(3)over thecandidate setsare morelikelytoinclude thecorrect matches. allthecases,theF1-measureisalwaysabove0.50. ObservethattheF1-measureisalwaysabove0.60. 7.2.4 PerformancewithPropagationDepth Figure 6(b) shows that NeMa efficiency is insensitive to label In these experiments, we analyzed the effect of propagation noise,butmoresensitivetolabelnoisethreshold.Thisisbecauseit depth h in our query performance. We randomly selected 100 takesNeMamoretimetoprocessthelargercandidatesetsforthe querygraphasthelabelnoisethresholdincreases. query graphs from YAGO, with |VQ| = 7, DQ = 3, structural noise 30%, label noise 50% and label noise threshold 50%. Ta- Varyingstructuralnoise.Fixingthelabelnoisethresholdas35% ble4showsthattheefficiencyofNeMadecreaseswithincreasing and label noise as 35%, we varied the structural noise from 0% h,especiallytheindextimeincreasesexponentiallywithh. How- to 40% inFigure 7(a) and 7(b). It can be observed that (a) both ever,forh=2,weobtainedanacceptableF1-measureof0.86. effectivenessandefficiencydecreaseasweincreasethestructural noise,and(b)withtheincreaseofthequerysize,botheffectiveness h=1 h=2 h=3 andefficiencyincrease. Thereasonisthat(1)largerquerieshave IndexTime(sec) 265 9862 236553 moreconstraintsintheneighborhoodofaquerynode,whichbene- MatchTime(sec) 0.58 0.92 2.76 F1-Measure 0.61 0.86 0.87 fittheidentificationofcorrectmatches,and(2)ittakeslongertime forNeMatocomparethematchingcostforlargerqueries. More- Table4:QueryPerformancewithVaryingh(YAGO) over,theF1-measureisalwaysabove0.93,andtherunningtimeis alwayslessthan0.1seconds,evenwith40%structuralnoise. 7.2.5 PerformancewithEdgeLabels Weverifiedthequeryperformanceinthepresenceofedgelabels Varyinglabelnoisethreshold.Fixingthestructuralnoiseas30% (Section5.2).Werandomlyselected100querygraphsfromIMDB, and the label noise as35%, weinvestigated theeffect of varying thelabelnoisethresholdonthequeryperformance. with with |VQ| = 7, DQ = 3, label noise 50% and label noise threshold 50%. Wevaried the structural noise from 0% to 40%. TheeffectivenessandefficiencyofNeMaoverIMDBareillus- The labels of the newly inserted edges in the query graphs were tratedinFigures7(c)and7(d),respectively. Weobservedthefol- assignedbygeneratingrandomstrings. Figure12showsthat, (1) lowing. (1)TheF1-measureinitiallyincreaseswhileweincrease whennostructuralnoiseisadded,theF1-measureremainssamefor thelabelnoisethreshold.Thisisbecausethequerynodelabelsare boththecasesoflabeledandunlabelededges. (2)However, with updated by adding random words. Hence, thehigher isthe label theadditionofstructuralnoise,theF1-measuredecreasesslightly noise threshold, there is more chance that the correct match of a for the case of labeled edges, since there are more noises in the querynodewillbeselectedinitscandidateset.(2)Whenthelabel query graphs due to the labels of newly inserted edges. (3) The noise threshold is more than 35%, the F1-measure does not im- runningtimeishigherforthecaseoflabelededgesbecauseaddi- provesignificantly. Therefore,theoptimalvalueofthelabelnoise tionaltimeisrequiredtomeasureedgelabelsimilarities. thresholdcanbedeterminedempiricallybasedonthequeryandtar- getgraphs.Ontheotherhand,therunningtimeofNeMaincreases 7.2.6 ComparisonwithExistingAlgorithms withtheincreaseofthelabelnoisethreshold. Thisisbecause(a) We compared the performance of NeMa with IsoRank [33], thecandidatematchesperquerynodeincreases(seeFigure9),and SAGA [35], NESS [20], gStore [43], and BLINKS [16]. (1) (b)thenumberofiterationsrequiredfortheconvergenceofournet- IsoRank and SAGA find optimal graph matches in smaller bio- workmatchingalgorithmalsoincreases(seeFigure10). Hence,it logicalnetworksconsideringstructureandnodelabelsimilarities. takesmoretimeforNeMatofindthematches. (2) NESS finds the top-k graph matches from a large network, 7.2.3 EffectivenesswithUnlabeledQueryNode but with strict node label equality. Hence, we modified NESS WenextverifiedtheeffectivenessofNeMainthepresenceofun- by allowing two nodes to be matched if their label difference is labeled nodesinquerygraphs. Theseexperiments simulateRDF withinthelabelnoisethreshold. (3)Weconsideredavariationof query answering (see Section 5.2). For these experiments, we NeMa,namely,NeMa ,whichallowslabeldifferencebutresorts gs randomly selected two sets of 100 query graphs each, where (i) to strict isomorphic matching. Thus, NeMags essentially follows |VQ| = 7, DQ = 3, and (ii) |VQ| = 5, DQ = 2, respectively. the same principle as gStore, which is a subgraph isomorphism- Fixingthestructuralnoiseas30%, labelnoiseas35%, andlabel based SPARQL query evaluation method with node label differ- noisethresholdas35%,wevariedthenumberofunlabeledquery ences. (4)BLINKS[16],akeywordsearchmethod,supportsonly

Description:

NeMa: Fast Graph Search with Label Similarity. Arijit Khan Yinghui Wu Charu C. Aggarwal Xifeng Yan. Computer Science. IBM T. J. Watson Research.

NeMa: Fast Graph Search with Label Similarity - Charu Aggarwal PDF

12 Pages·2012·0.3 MB·English

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview NeMa: Fast Graph Search with Label Similarity - Charu Aggarwal

Description:

NeMa: Fast Graph Search with Label Similarity. Arijit Khan Yinghui Wu Charu C. Aggarwal Xifeng Yan. Computer Science. IBM T. J. Watson Research.

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.