00 Canonical Forms for Isomorphic and Equivalent RDF Graphs: Algorithms for Leaning and Labelling Blank Nodes AIDANHOGAN, CenterforSemanticWebResearch,DCC,UniversityofChile ExistentialblanknodesgreatlycomplicateanumberoffundamentaloperationsonRDFgraphs.Inparticular, theproblemsofdeterminingiftwoRDFgraphshavethesamestructuremoduloblanknodelabels(i.e.ifthey areisomorphic),ordeterminingiftwoRDFgraphshavethesamemeaningundersimplesemantics(i.e.,if theyaresimple-equivalent),havenoknownpolynomial-timealgorithms.Inthispaper,weproposemethods thatcanproducetwocanonicalformsofanRDFgraph.Thefirstcanonicalformpreservesisomorphism suchthatanytwoisomorphicRDFgraphswillproducethesamecanonicalform;thisiso-canonicalformis producedbymodifyingthewell-knowncanonicallabellingalgorithmNautyforapplicationtoRDFgraphs. Thesecondcanonicalformadditionallypreservessimple-equivalencesuchthatanytwosimple-equivalent RDFgraphswillproducethesamecanonicalform;thisequi-canonicalformisproducedby,inapreliminary step,leaningtheRDFgraph,andthencomputingtheiso-canonicalform.Thesealgorithmshaveanumber ofpracticalapplications,suchasforidentifyingisomorphicorequivalentRDFgraphsinalargecollection withoutrequiringpair-wisecomparison,forcomputingchecksumsorsigningRDFgraphs,forapplying consistentSkolemisationschemeswhereblanknodesaremappedinacanonicalmannertoIRIs,andsoforth. LikewiseavarietyofalgorithmscanbesimplifiedbypresupposingRDFgraphsinoneofthesecanonical forms.Bothalgorithmsrequireexponentialstepsintheworstcase;inourevaluationwedemonstratethat thereindeedexistdifficultsyntheticcases,butwealsoprovideresultsover9.9millionRDFgraphsthatsuggest suchcasesoccurinfrequentlyintherealworld,andthatbothcanonicalformscanbeefficientlycomputedin allbutahandfulofsuchcases. CCSConcepts:•Informationsystems→ResourceDescriptionFramework(RDF);•Mathematicsof computing→Graphalgorithms; AdditionalKeyWordsandPhrases:SemanticWeb,LinkedData,Skolemisation,Isomorphism,Signing ACMReferenceformat: AidanHogan.2017.CanonicalFormsforIsomorphicandEquivalentRDFGraphs:AlgorithmsforLeaningand LabellingBlankNodes.ACMTrans.Web0,0,Article00(January2017),60pages. DOI:0000001.0000001 1 INTRODUCTION AttheverycoreoftheSemanticWebistheResourceDescriptionFramework(RDF):astandard forpublishinggraph-structureddatathatusesIRIsasglobalidentifierssuchthatgraphsinremote locationsontheWebcancollaboratetocontributeinformationaboutthesameresourcesusing consistentterminologyinaninteroperablemanner.TheadoptionofRDFontheWebhasbeen continuouslygrowing,wherewecanpointtothehundredsofdatasetspublishedasRDFusing ThisworkwassupportedbytheMillenniumNucleusCenterforSemanticWebResearchunderGrantNo.NC120004andby FondecytGrantNo.11140900. Author’saddress:A.Hogan,DCC,UniversidaddeChile,AvenidaBeauchef851,Santiago,Chile. Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfee providedthatcopiesarenotmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeand thefullcitationonthefirstpage.CopyrightsforcomponentsofthisworkownedbyothersthanACMmustbehonored. Abstractingwithcreditispermitted.Tocopyotherwise,orrepublish,topostonserversortoredistributetolists,requires priorspecificpermissionand/[email protected]. ©2017ACM. 1559-1131/2017/1-ART00$15.00 DOI:0000001.0000001 ACMTransactionsontheWeb,Vol.0,No.0,Article00.Publicationdate:January2017. 00:2 AidanHogan LinkedDataprinciples[22]spanningavarietyofdomains,includingcollectionsfromgovernmental organisations,scientificcommunities,socialwebsites,mediaoutlets,onlineencyclopaedias,andso forth[51].Furthermore,hundredsofthousandsofweb-sitesandhundredsofmillionsofweb-pages nowcontainembeddedRDFa[24]–incentivisedbyinitiativessuchasSchema.org(promotedby Google,Microsoft,Yahoo!andYandex),andtheOpenGraphProtocol(promotedbyFacebook)–with threeofthelargestprovidersbeing,forexample,tripadvisor.com,yahoo.comandhotels.com[43]. DespitethistrendofRDFplayinganincreasinglyimportantroleasaformatforstructured-data exchangeontheWeb,thereareanumberoffundamentaloperationsoverRDFgraphsforwhich welackpracticalalgorithms.Infact,RDFdoesnotconsistpurelyofstatementscontainingIRIs, butalsosupportsliteralsthatrepresentdatatyped-valuessuchasstringsornumbers,and,more pertinentlyforthecurrentscope,blanknodesthatrepresentaresourcewithoutanexplicitidentifier. ItisthepresenceofblanknodesinRDFgraphsthatparticularlycomplicatesmatters. IntheoriginalW3CRecommendationforRDFpublishedin1999[35],anonymousnodeswere introducedasameansofdescribingaresourcewithoutanexplicitidentifier,quotinguse-cases such as the representation of bags of resources in RDF, the use of reification to describe RDF statementsasiftheywerethemselvesresources,orsimplytodescriberesourcesthatdidnothave anativeURI/IRIassociatedwiththem.WhentheW3CRecommendationforRDFwasrevisedin 2004[20],theserialisationofRDFgraphsastripleswassupportedthroughtheintroductionofthe modernnotionofblanknodestorepresentresourceswithoutexplicitidentifiers;theseblanknodes weredefinedasexistentialvariablesthatarelocally-scoped.Intuitivelyspeaking,thisexistential semanticscapturestheideathatonecanrelabeltheblanknodesofanRDFgraphinaone-to-one mannerwithoutaffectingthestructure[11]northesemantics[21]oftheRDFgraph,norwithout havingtoworryifthosesamelabelsalreadyexistinanothergraphelsewhereontheWeb. Practicallyspeaking,blanknodesareusedfortwomainreasons[28]: • Blanknodesallowpublisherstoavoidhavingtoexplicitlyidentifyspecificresources,where RDFsyntaxessuchasTurtle[5]usethispropertytoenablevariousconvenientshortcuts forspecifyingorderedlists,n-aryrelations,etc.;toolsparsingthesesyntaxescaninvent blanknodestorepresenttheseimplicitnodes. • Inothercases,publishersmayuseblanknodestorepresenttrueexistentialvariables,where avalueisknowntoexist,buttheexactvalueisnotknown. InarecentquestionnaireweconductedwiththeSemanticWebcommunity,wefoundthatpublishers may(hypothetically)useblanknodessometimesinonecase,ortheother,orboth[28].Inanycase, blanknodeshavebecomewidelyusedontheWeb,whereinpreviousworkwefoundthatina surveyof8.4millionWebdocumentscontainingRDFcrawledfrom829pay-leveldomains1,66%of domainsand45%ofdocumentsusedblanknodes[28]. Unfortunately,thepresenceofblanknodesinRDFcomplicatessomefundamentaloperations onRDFgraphs.Forexample,imaginetwodifferenttoolsparsingthesameRDFgraph–say,for example,retrievedfromthesamelocationontheWebinthesamesyntax–intoasetoftriples, labellingblanknodesinanarbitrarymanner.Nowtakethetworesultingsetsoftriplesandsaywe wishtodetermineifthetwoRDFgraphsarethesamemoduloblanknodelabels;i.e.,todetermine iftheyareisomorphic [11].IftheoriginalRDFgraphdidnotcontainblanknodes,thisprocess istriviallypossibleinpolynomialtimebycheckingifbothsetsoftriplesareequal,forexample, by sorting both sets of triples and then comparing them sequentially. However, if the original RDFgraphcontainsblanknodes,thentheproblemofdecidingRDFisomorphismhasthesame 1Domainssuchasbbc.co.ukorfacebook.com,butnotnews.bbc.co.ukorco.uk ACMTransactionsontheWeb,Vol.0,No.0,Article00.Publicationdate:January2017. CanonicalFormsforIsomorphicandEquivalentRDFGraphs 00:3 computationalcomplexityclassasgraphisomorphism(GI-complete),forwhichtherearenoknown polynomial-timealgorithms(ifsuchanalgorithmwerefound,itwouldestablishGI=P). WhileisomorphismreferstoastructuralcomparisonofRDFgraphs,itisalsopossibletoconsider asemanticcomparisonofsuchgraphs.TheRDFsemantics[21]definesanotionofentailment betweenRDFgraphssuchthatoneRDFgraphentailsanother,looselyspeaking,iftheentailed graphaddsnonewinformationovertheentailinggraph;inotherwords,iftheentailinggraph isconsideredtobetrueunderthemodeltheoryofthesemantics,thentheentailedgraphmust likewisebeconsideredtrue.TwoRDFgraphsthatentaileachotherarethusconsideredequivalent: ashavingthesamemeaningunderaparticularsemantics.ThefoundationalsemanticsforRDF, calledsimplesemantics,doesnotconsideranyspecialvocabularynortheinterpretationofdatatype values; rather, it considers the meaning of RDF graphs considering blank nodes as existential variablesandIRIsandliteralsasground termsdenotingaparticularresource.GivenanRDFgraph G andH withoutblanknodes,thenaskingifG entailsH isthesameasaskingifH containsa subsetofthetriplesofG,whichagainispossibleinpolynomialtimeby,forexample,sortingboth setsoftriplesandcomparingthemsequentiallytoseeifeverytripleofH isinG.However,as wediscusslater,ifbothRDFgraphscontainblanknodes,theproblemisinthesamecomplexity classastheproblemofgraphhomomorphism(namelyNP-complete),implyingthatthereisno knownpolynomial-timesolution.Furthermore,itisknownthatdeterminingiftwoRDFgraphsare simple-equivalent–i.e.iftheysimple-entaileachother–fallsintothesamecomplexityclass[18]. Insummarythen,therearenoknownpolynomial-timealgorithmsforthesetwofundamental operations of determining if two RDF graphs are structurally the same (per isomorphism) or semanticallythesame(persimpleequivalence).2 Inthispaper,weproposetwodifferentcanonicalformsforRDFgraphs.Firstwemustdefinetwo RDFgraphsasequal(orwemaysometimessaythesame)ifandonlyiftheyareequalassetsof RDFtriplesconsideringblanknodelabelsasfixedinthesamemannerasIRIsandliterals.Thefirst canonicalform,whichwecalliso-canonical,isanRDFgraphuniqueforeachsetofisomorphicRDF graphs;inotherwords,itisaformthatiscanonicalwithrespecttothestructureofRDFgraphs. Thesecondcanonicalform,whichwecallequi-canonical,isanRDFgraphuniqueforeachsetof simple-equivalentRDFgraphs;inotherwords,itisaformthatiscanonicalwithrespecttothe (simple)semanticsofRDFgraphs.Morespecifically,twoRDFgraphsareisomorphicifandonlyif theiriso-canonicalformsarethesame;twoRDFgraphsaresimple-equivalentifandonlyiftheir equi-canonicalformsarethesame. Thesecanonicalformshaveanumberofuse-cases,including: • givenalargesetofRDFgraphs,detect/removegraphsthatareduplicates; • givenanRDFgraph,computeahashofthatRDFgraph,whichcanbeusedforcomputing andverifyingchecksums,signatures,etc.; • givenanRDFgraph,SkolemisetheblanknodesintheRDFgraph–replacingthemwith freshIRIs–inadeterministicmannerbasedonthecontentofthegraph. OurmethodsdonotrelyonthesyntaxoftheRDFdocumentsinquestion,butratheroperateon theabstractrepresentationofanRDFgraphasasetoftriples,wheretheusercandecidewhether theywishtoconsiderduplicates,signatures,Skolemconstants,etc.,tobeconsistentwithrespect toeitherisomorphismor(simple)equivalence. 2Althoughpolynomial-timealgorithmshavebeenproposedintheliteratureforcomputingcanonicalformsofRDFgraphs withrespecttoisomorphism(e.g.,[2,9,32]),thesemaynotalwaysyieldcorrectresults.Wewilldiscusssuchworksinmore detailinSection7. ACMTransactionsontheWeb,Vol.0,No.0,Article00.Publicationdate:January2017. 00:4 AidanHogan GivenaninputRDFgraph,weproposealgorithmsforcomputingbothitsiso-canonicalform andequi-canonicalform.Asexpectedfromearlierdiscussion,neitherofthesealgorithmsisin polynomial-time:bothhaveexponential-timeworst-caseperformanceinthegeneralcase.However, unlikethemoregeneralproblemsofgraphisomorphismandgraphhomomorphism,inthecaseof real-worldRDFgraphs,weoftenhavegroundinformationthathelpstodistinguishblanknodes. Henceinourevaluation,wefirstpresentresultsforcomputingbothcanonicalformsoverthe BTC–14dataset[31]–acollectionof43.6millionRDFgraphsofwhich9.9millioncontainblank nodes–wheresaidresultssuggestthatcomputingtheseformsisefficientinallbutahandful ofcases.Tostress-testouralgorithms,wealsopresentresultsforcanonicalisingacollectionof syntheticgraphsatvarioussizes,whichgivesomeideaofthetypeofRDFgraphrequiredtoinvoke exponentialruntime,arguingthatsuchgraphsareunlikelytooccurnaturallyinpractice. Thispaperisanextensionofearlierwork[27]wherewefirstintroducedmethodstocanonically labelblanknodes,computinganiso-canonicalformofRDFgraphsforthepurposesofSkolemisation. Asidefromextendeddiscussionthroughout,themainnovelcontributionofthispaperistodiscuss analgorithmforleaningRDFgraphsandforcomputingtheirequi-canonicalforminamanner thattakesintoaccountnotonlystructuralbutalsosemanticidentity.Inthislattercontribution,we takesomeoftheideasandexperienceslearntfromanotherpreviousworkwherewepresented someinitialideasonleaningRDFgraphs[28];however,themethodswepresentinthispaperfocus onleaningpotentiallycomplexRDFgraphsinmemorywhere,inparticular,wepresentanovel depth-firstsearchalgorithmthatisshowntoperformbetterthanavarietyofbaselinemethods. WebeginbypresentingsomepreliminariesrelatingtothestructureandsemanticsofRDFgraphs (Section 2). We then present a theoretical analysis with a mix of new and existing results that helpestablishboththehardnessofthecanonicalisationproblemsweproposetotackle,aswellas high-levelapproachesbywhichtheycanbecomputed(Section3).Afterwards,wepresentindetail ouralgorithmsforcomputingtheiso-canonicalversionofanRDFgraph,whichreliesonacanonical labellingofblanknodes(Section4);andtheequi-canonicalversionofanRDFgraph(Section5), whichreliesonapre-processingstepthatleanstheRDFgraph.Wethenpresentevaluationresults overcollectionsofbothreal-worldandsyntheticRDFgraphs(Section6).Wethendiscussrelated worksbeforeconcludingthepaper(Sections7and8). 2 PRELIMINARIES WenowpresentsomeformalpreliminarieswithrespecttoRDFgraphs,isomorphism,andthe simplesemanticsofRDF. 2.1 RDFterms,triplesandgraphs RDFgraphsaresetsoftriplescontainingRDFterms,withcertainrestrictionsonwhichtermscan appearinwhichpositionsofatriple. Definition2.1(RDFterm). LetI,LandBdenotetheinfinitesetsofIRIs,literalsandblanknodes respectively.Thesesetsarepair-wisedisjoint.Werefergenericallytoanelementofoneofthese setsasanRDFterm.WerefertoelementsofthesetIL(i.e.,I∪L)asgroundRDFterms. Definition2.2(RDFtriple). Wecallatriple(s,p,o)∈IB×I×ILBanRDFtriple,wherethefirst element,calledthesubject,mustbeanIRIorablanknode;thesecondelement,calledthepredicate, mustbeanIRI;andthethirdelement,calledtheobject,canbeanyRDFterm. Definition2.3(RDFgraph). AnRDFgraphG ⊂ IB×I×ILBisafinitesetofRDFtriples.We denotebyterms(G)thesetofallRDFtermsappearinginG,andbnodes(G)thesetofallblank nodesappearinginG. ACMTransactionsontheWeb,Vol.0,No.0,Article00.Publicationdate:January2017. CanonicalFormsforIsomorphicandEquivalentRDFGraphs 00:5 Remark1. WesaythattwoRDFgraphsG andH areequal,or thesame,ifandonlyifG =H in termsofsetequality. 2.2 RDFisomorphism TwoRDFgraphsthatarethesamemoduloblanknodeslabels–i.e.,whereonecanbeobtained fromtheotherthroughaone-to-onemappingofblanknodes–arecalledisomorphic.Wefirstgive apreliminarydefinitiontocapturetheideaofmappingblanknodestootherRDFterms: Definition2.4(Blanknodemappingandbijection). Letµ :ILB→ILBbeapartialmappingofRDF termstoRDFterms,wherewedenotebydom(µ)thedomainofµ andbycodom(µ)thecodomain ofµ.Ifµ istheidentityonIL,wecallitablanknodemapping.Ifµ isablanknodemappingthat mapsblanknodesindom(µ)toblanknodesincodom(µ)inabijectivemanner,wecallitablank nodebijection. Abusingnotation,givenanRDFgraphG,wemayuseµ(G)todenotetheimageofG underµ (i.e.,theresultofapplyingµ toeveryterminG). WearenowreadytodefinethenotionofisomorphismbetweenRDFgraphs. Definition2.5(RDFisomorphism). TwoRDFgraphsG andH areisomorphic,denotedG (cid:27)H,if andonlyifthereexistsablanknodebijectionµ suchthatµ(G) =H,inwhichcasewecallµ an isomorphism. Lemma2.6. RDFisomorphism((cid:27))isanequivalencerelation. Proof. First,(cid:27)isreflexivepertheexistenceoftheidentitymapµ onblanknodes,whichisa blanknodebijection.Second,(cid:27)issymmetricsinceifG (cid:27)H,thenthereexistsµsuchthatµ(G)=H andsuchthatµ−1(H)=G,whereµ−1isalsoablanknodebijection.Third,(cid:27)istransitivesinceif G (cid:27)H andH (cid:27) I,thenthereexistblanknodebijectionsµ andµ(cid:48)suchthatµ(G)=H,µ(cid:48)(H)=I, andthusµ(cid:48)(µ(G))=I,whereµ(cid:48)◦µ isablanknodebijectionthatwitnessesG (cid:27)I. (cid:3) Remark2. IfG =H,thenG (cid:27)H. (cid:3) Example2.7. TakethefollowingtwoRDFgraphs,withG ontheleftandH ontheright,where theterm2014isaliteral(denotedwithasquarebox),alltermsprefixedwithunderscoreareblank nodes,andallotherterms(intheex:examplenamespace)areIRIs.Nowweconsider:arethese RDFgraphsisomorphic? ex:startYear _:bA _:a2 2014 ex:president ex:presidency ex:presidency ex:president ex:MBachelet ex:Chile ex:Chile ex:MBachelet ex:president ex:presidency ex:presidency ex:president 2014 _:b _:a1 ex:startYear Infacttheyare:thereisablanknodebijectionµ suchthatµ(_:a1)=_:bAandµ(_:a2)=_:bwhere µ(G)=H.Wecouldalsotaketheinversemappingµ−1,whereµ−1(H)=G andwhereµ−1isalsoa blanknodebijection.ThusweconcludeG (cid:27)H. (cid:3) ACMTransactionsontheWeb,Vol.0,No.0,Article00.Publicationdate:January2017. 00:6 AidanHogan Arelatedconcepttoisomorphism–andonethatwillplayanimportantrolelater–isthatofan automorphism,whichisanisomorphismthatmapsanRDFgraphtoitself.Intuitivelyspeaking, automorphismsrepresentaformofsymmetryinthegraph. Definition2.8(RDFautomorphism). AnautomorphismofanRDFgraphG isanisomorphism thatmapsG toitself;i.e.,ablanknodemappingµ isanautomorphismofG ifµ(G)=G.Ifµ isthe identitymappingonblanknodesinG thenµ isatrivialautomorphism;otherwiseµ isanon-trivial automorphism.WedenotethesetofallautomorphismsofG byAut(G). Example2.9. WegivetheautomorphismsfortwoRDFgraphsG andH.Trivialautomorphisms (i.e.,thosethataretheidentitymapping)areshowningrey. G Aut(G) H Aut(H) :p _:e µ(·)_:a_:b :p :p µ(·)_:c_:d_:e _:a _:b = _:a_:b _:c _:d _:c_:d_:e :p _:b_:a = _:d_:e_:c :p _:e_:c_:d Applyinganyoftheautomorphismsshownforthegraphinquestionwouldleadtothesamegraph (andnotjustanisomorphiccopy). (cid:3) 2.3 Simplesemantics,interpretations,entailmentandequivalence TheRDF(1.1)Semanticsrecommendation[21]definesfourmodel-theoreticregimesthat,loosely speaking,provideamathematicalbasisforassigningtruthtoRDFgraphsand,subsequently,for formallydefiningwhenoneRDFgraphentails another:inotherwords,ifoneassignstruthto a particular RDF graph, entailment defines which RDF graphs must also hold true as a logical consequence.Thus,unlikeRDFisomorphismwhichis,insomesense,astructuralcomparisonof RDFgraphs[11],entailmentoffersasemanticcomparisonofRDFgraphsintermsoftheirunderlying meaning[21].Thefourregimesare:simplesemantics,datatypesemantics,RDFsemanticsand RDFSsemantics.Inthispaper,weareinterestedinthesimplesemantics,whichcodifiesameaning forRDFgraphswithoutconsideringtheinterpretationofdatatypevaluesorspecialvocabulary terms(suchasrdf:typeorrdfs:subClassOf). Eachregimeisbasedonthenotionofaninterpretation,whichmapsthetermsinanRDFgraphto aset,andthendefinessomeset-theoreticalconditionsontheset.TheintuitionisthatRDFdescribes resourcesandrelationshipsbetweenthem,whereinterpretationsformabridgefromsyntactic termstotheresourcesandrelationshiptheydenote.Wenowdefineasimpleinterpretation. Definition2.10(Simpleinterpretation). Asimpleinterpretationisa4-tupleI =(Res,Prop,Ext,Int) whereRes isasetofresources;Propisasetofpropertiesthatrepresenttypesofbinaryrelations between resources (not necessarily disjoint fromRes);Ext maps properties to a set of pairs of resources,thusdenotingtheextensionoftherelations;andInt mapstermsinILtoRes ∪Prop, i.e.,mapstermsintheRDFgraphtotheresourcesandpropertiestheydescribe.Withrespectto blanknodes,letA:B→Res beafunctionthatmapsblanknodestoresources,andletInt denote A aversionofInt thatmapstermsinILBtoRes∪PropusingAforblanknodes.WesaythatI isa modelofanRDFgraphG ifandonlyifthereexistsamappingAsuchthatforeach(s,p,o)∈G,it holdsthatInt(p)∈Propand(Int (s),Int (o))∈Ext(Int(p)). A A Here,theexistentialsemanticsofblanknodesiscoveredbytheexistenceofanauxiliaryfunction A,whichisnotpartoftheactualinterpretation. ACMTransactionsontheWeb,Vol.0,No.0,Article00.Publicationdate:January2017. CanonicalFormsforIsomorphicandEquivalentRDFGraphs 00:7 SimpleentailmentofRDFgraphscanthenbedefinedintermsofthemodelsofeachgraph. Definition2.11(Simpleentailment). AnRDFgraphG simple-entailsanRDFgraphH,denoted G |=H,ifandonlyifeverymodelofG isalsoamodelofH. Intuitivelyspeaking,ifG |=H,thenH saysnothingnewoverG,orinotherwordsifweholdG tobetrue,thenwemustholdH tobetrueasalogicalconsequence.IftwoRDFgraphsentaileach other,thenwestatethattheyaresimple-equivalent.Inotherwords,semanticallyspeaking,both graphscontainthesameinformation. Definition2.12(Simpleequivalence). AnRDFgraphG issimple-equivalentwithanRDFgraphH, denotedG ≡H,ifandonlyifeverymodelofG isamodelofH andeverymodelofH isamodelof G (inotherwords,G |=H andH |=G). Lemma2.13. Simpleequivalence(≡)isanequivalencerelation. Proof. G ≡H ifandonlyifthesetsofmodelsofG andH areequal.Sinceset-equalityisan equivalencerelation,sois≡. (cid:3) Remark3. IfG (cid:27)H,thenG ≡H. (cid:3) GiventhatwecurrentlydealexclusivelywiththesimplesemanticsofRDF,forbrevity,henceforth, wemayrefertointerpretations,models,entailment,equivalence,etc.,withoutqualification,where weimplicitlyrefertothesimplesemantics. AnimportantquestionishowisomorphismandequivalencearedifferentforRDFgraphs.This isperhapsbestillustratedwithanexample. Example2.14. TakethefollowingtwoRDFgraphswithG ontheleftandH ontheright.Firstof all,weaskdoesG |=H hold(i.e.,iseverymodelofG alsoamodelofH)? ex:startYear _:a2 2014 ex:presidency ex:president ex:MBachelet ex:Chile ex:Chile ex:MBachelet ex:president ex:presidency ex:presidency ex:president 2014 _:b _:a1 ex:startYear Let’ssayI =(Res,Prop,Ext,Int)isamodelofG,whereforthepurposesofgeneralitywegiveno furtherdetails.SincethesetofgroundtermsinH isasubsetofG,thenInt mapsgroundtermsin H toRes∪PropinthesamemannerasforG.LetAdenoteanauxiliarymappingofblanknodesfor G suchthatforeach(s,p,o)∈G itholdsthatInt(p)∈Propand(Int (s),Int (o))∈Ext(Int(p));in A A otherwords,AwitnessesthatI isamodelofG.Nowletµ denoteablanknodemappingsuchthat µ(_:b) = _:a2.Then,foreach(s,p,o) ∈ H,itholdsthatInt(p) ∈ Prop and(IntA◦µ(s),IntA◦µ(o)) ∈ Ext(Int(p)),whereA◦µisavalidauxiliarymappingthatsatisfiestheconditionforItobeamodel ofH.Hence,anymodelofG isalsoamodelofH,orinotherwords,G |=H.Intuitivelyspeaking,if wefirstmap_:binH to_:a2inG,thenweseethatH containsasubsetoftheinformationofG. Now let us ask: does H |= G hold? This time consider a blank node mapping µ such that µ(_:a1)=_:bandµ(_:a2)=_:b.Usingasimilarargumentasabovebutintheoppositedirection, ACMTransactionsontheWeb,Vol.0,No.0,Article00.Publicationdate:January2017. 00:8 AidanHogan wecannowseethatanymodelofH mustbeamodelofG:thegroundtermsofG areasubsetof H,andifAisanauxiliarymappingthatwitnessesthatIisamodelofH,thenA◦µ witnessesthat IisamodelofG.Putanotherway,lookingjustatG,ifwemap_:a1to_:a2,wedonotchangethe meaningofG,andweendupwithanisomorphiccopyofH,andhencewemayseethatH |=G. GiventhatG |=H andH |=G,wehavethatG ≡H,eventhoughG (cid:29)H.Thoughbothgraphs arestructurallydifferent,fromasemanticperspectivebothgraphshavethesamesetofmodels underRDF’ssimplesemantics:theyimplyeachother. (cid:3) Inthisexample,wesawthatanRDFgraphcanbeequivalenttoasmallergraph:inotherwords, anRDFgraphcancontainredundanttriplesthataddnonewinformationinsemanticterms.This givesrisetothenotionofaleanRDFgraph[21],whichisonethatdoesnotcontainsuchredundancy. Definition2.15(LeanRDFgraph). AnRDFgraphG isconsideredleanifandonlyiftheredoes notexistapropersubgraphG(cid:48) ⊂G suchthatG(cid:48) |=G.OtherwisewecallG non-lean. Example2.16. ReferringbacktoExample2.14,H islean.However,letG(cid:48)denotethesetoftriples ofG thatdonotmention_:a1.WecanseethatG(cid:48) |=G (withthesamelineofreasoningastowhy H |=G).IntermsofexplainingwhyG isnon-lean,intuitivelythegraphcanbereadasmakingthe followingclaims: • ChilehasapresidencywithMBacheletaspresidentandstartYear2014. • ChilehasapresidencywithMBacheletaspresident. Undersimplesemantics,thesecondclaimisconsideredredundant. (cid:3) 3 THEORETICALSETTING ThegoalofthispaperistoproposeanddevelopalgorithmstocomputetwocanonicalformsforRDF: acanonicalformwithrespecttoisomorphismandacanonicalformwithrespecttoequivalence. Havingdefinedthesenotionsintheprevioussection,wenowpresentsometheoreticalresultsthat showthesetobehardproblemsingeneral;wealsopresenthigh-levelstrategiesastohowthese canonicalformscouldbecomputed.Wefirstfocusonisomorphismandlaterdiscussequivalence. 3.1 Isomorphism Tobegin,wewishtoestablishthatRDFisomorphismisinthesamecomplexityclassastherelated andmorewell-establishedproblemofgraphisomorphismforundirectedgraphs.Infact,thisresult isfolkloreandwashintedatpreviouslybyotherauthors,suchasCarroll[9],buttothebestofour knowledge,noformalproofofthisresultwasgiven.Firstweneedsomepreliminarydefinitions. Definition3.1(Undirectedgraph). AnundirectedgraphG=(V,E)isagraphwereV isthesetof vertexes,E ⊆V ×V isthesetofedges,and(v,v(cid:48))∈E ifandonlyif(v(cid:48),v)∈E (orinotherwords, theedgesareunorderedpairs). Definition3.2(Graphisomorphism). GiventwoundirectedgraphsG=(VG,EG)andH=(VH,EH), thesegraphsareisomorphic,denotedG (cid:27) H,ifandonlyifthereexistsabijectionβ :VG →VH suchthat(v,v(cid:48))∈EGifandonlyif(β(v),β(v(cid:48)))∈EH.Inthiscase,wecallβ anisomorphism. Thegraphisomorphismproblem–ofdecidingifG (cid:27) H–isGI-complete:aclassthatbelongsto NPbutisnotknowntobeequivalenttoNP-completenortopermitpolynomial-timesolutions.We nowgivearesultstatingthattheRDFisomorphismproblem–ofdecidingfortwoRDFgraphsif G (cid:27)H –isinthesamecomplexityclassasgraphisomorphism. Theorem3.3. GiventwoRDFgraphsG andH,determiningifG (cid:27)H isGI-complete. ACMTransactionsontheWeb,Vol.0,No.0,Article00.Publicationdate:January2017. CanonicalFormsforIsomorphicandEquivalentRDFGraphs 00:9 AproofforTheorem3.3–whichoriginallyappearedintheconferenceversionofthiswork[27] –isprovidedinAppendixA.FromTheorem3.3,wecanconcludethatitisnotknownifthere existsapolynomial-timealgorithmforRDFisomorphism:theexistenceofsuchanalgorithmwould implyGI=P,solvingalong-openprobleminComputerScience.Inarecentresult,Babai[3]proved thatthegraphisomorphismproblemcanbesolvedinquasi-polynomialtime3,wherealthougha polynomial-timealgorithmhasnotyetbeenfound,analgorithmbetterthanexponentialisnow known;henceGIiscontainedwithinQP:theclassofproblemssolvableinquasi-polynomialtime. These results extend to RDF isomorphism per Theorem 3.3. But these theoretical results refer toworst-caseanalyses,whereapreviousresultbyBabaietal.[4]showedthatisomorphismfor randomlygeneratedgraphscanbeperformedefficientlyusinganaivealgorithm.Hence,despite the possibility of non-polynomial worst cases, many practical algorithms exist to solve graph isomorphismquicklyformanycases. Infact,ourgoalhereisnotthesolvetheisomorphismdecisionproblembutrathertotacklethe harderproblemofcomputinganiso-canonicalversionofanRDFgraph. Definition3.4(Iso-canonicalRDFmapping). LetM beamappingfromanRDFgraphtoanRDF graph.M isiso-canonicalifandonlyifM(G) (cid:27)G foranyRDFgraphG andM(G)=M(H)ifand onlyifG (cid:27)H foranytwoRDFgraphsG andH. Weknowthatcomputinganiso-canonicalformofanRDFgraphisGI-hardsinceitcanbeused tosolvetheRDFisomorphismproblem:wecancomputetheiso-canonicalformofbothgraphsand checkiftheresultsareequal.Henceweareunlikelytofindpolynomial-timealgorithms. Before we continue, let us establish an initial iso-canonical mapping that is quite naive and impractical,butestablishestheideaofhowsuchaformcanbeachieved:defineatotalordering of RDF graphs and for a set of pairwise isomorphic RDF graphs, define the lowest such graph (followingcertainfixedcriteria)tobecanonical. Definition3.5(κ-mapping). AssumeatotalorderingofallRDFterms,triplesandgraphs.4Assume thatκ isablanknodebijectionthatlabelsallk blanknodesinagraphG from_:b1to_:bk,andlet κ(G)denotetheimageofGunderκ.Furthermore,letK denotethesetofallsuchκ-mappingsvalid forG.Wethendefinetheminimalκ-mappingofanRDFgraphG asM (G)=min{κ(G) |κ ∈K}. κ Proposition3.6. M isaniso-canonicalmapping. κ Proof. First,sincewehaveatotalorderingofgraphs,thereisauniquegraphmin{κ(G) |κ ∈K}, andhenceM isamappingfromRDFgraphstoRDFgraphs. κ Second,M (G) (cid:27)G sinceweonlyrelabelblanknodes:κ isablanknodebijection. κ WearenowlefttoprovethatM (G)=M (H)ifandonlyifG (cid:27)H. κ κ (M (G) = M (H)impliesG (cid:27) H)WeknowthatM (G) (cid:27) G andM (H) (cid:27) H.Ifwearegiven κ κ κ κ M (G) = M (H),thenwehavethatG (cid:27) M (G) (cid:27) M (H) (cid:27) H,andsince (cid:27) isanequivalence κ κ κ κ relation,iffollowsthatG (cid:27)H. (G (cid:27) H implies M (G) = M (H)) Suppose the result does not hold and there existG andH κ κ such thatG (cid:27) H and (without loss of generality) M (G) > M (H). SinceG (cid:27) H, there exists κ κ a blank node bijection µ such that µ(G) = H. Letκ be the mapping such thatκ(H) = M (H). κ Nowκ ◦µ(G) = M (H).Sinceκ ◦µ isaκ-mappingforG andM (G) > κ ◦µ(G),wearriveata κ κ contradictionperthedefinitionofM sinceitdoesnotusetheminimumκ-mappingforG. (cid:3) κ 3OnJanuary9,2017,theresultwasbrieflyretractedasabugwasfoundintheproof.Theresultwasreassertedafewdays laterwhenafixwasfound.Seehttp://people.cs.uchicago.edu/~laci/update.html. 4Forexample,wecanconsidertermsorderedsyntactically,triplesorderedlexicographically,andgraphsorderedsuchthat G <H ifandonlyifG ⊂H orthereexistsatriplet ∈G\H suchthatnotriplet(cid:48) ∈H\Gexistswheret(cid:48)<t. ACMTransactionsontheWeb,Vol.0,No.0,Article00.Publicationdate:January2017. 00:10 AidanHogan Example3.7. TakegraphH fromExample2.9.Wehaveatotalof3! = 6possibleκ-mappings (withonlyblanknodesfrombnodes(H)intheirdomainandblanknodesfrom{_:b1,_:b2,_:b3}in theircodomain)asfollows. H κ(·) _:c _:d _:e {(_:c,:p,_:d),(_:d,:p,_:e),(_:e,:p,_:c)} _:b1_:b2_:b3{(_:b1,:p,_:b2),(_:b2,:p,_:b3),(_:b3,:p,_:b1)} _:b1_:b3_:b2{(_:b1,:p,_:b3),(_:b3,:p,_:b2),(_:b2,:p,_:b1)} _:b2_:b1_:b3{(_:b2,:p,_:b1),(_:b1,:p,_:b3),(_:b3,:p,_:b2)} = _:b2_:b3_:b1{(_:b2,:p,_:b3),(_:b3,:p,_:b1),(_:b1,:p,_:b2)} _:b3_:b1_:b2{(_:b3,:p,_:b1),(_:b1,:p,_:b2),(_:b2,:p,_:b3)} _:b3_:b2_:b1{(_:b3,:p,_:b2),(_:b2,:p,_:b1),(_:b1,:p,_:b3)} Thesixκ-mappingsonlyproducetwodistinctgraphs,wherethefirst,fourthandfifthmappings correspondtoκ(cid:48)(H)andtheresttoκ(cid:48)(cid:48)(H),asfollows: κ(cid:48)(H) κ(cid:48)(cid:48)(H) _:b3 _:b2 :p :p :p :p _:b1 _:b2 _:b1 _:b3 :p :p Assumingatypicallexicalordering(asperFootnote4),M (H) =κ(cid:48)(H).Importantly,onecould κ (bijectively)relabel_:c,_:d,_:eintheoriginalgraphwithoutaffectingtheresult:theoutputwould bethesameforanyM (H(cid:48))suchthatH (cid:27)H(cid:48). (cid:3) κ Thisdiscussionsuggestsacorrectandcompletebruteforcealgorithmtocomputeaniso-canonical formforanyRDFgraphG:searchallκ-mappingsofG foronethatgivestheminimumsuchgraph. However,suchabrute-forceprocessisunnecessaryandnaive:byapplyingamorefine-grained totalorderingonRDFgraphs,wecanuseasimilarprincipletofindaniso-canonicalformina muchmoreefficientway.SuchanalgorithmwillbepresentedlaterinSection4. 3.2 Equivalence AspreviousdiscussedinSection2.3,twoRDFgraphsareequivalentiftheyentaileachother.Towards aninitialprocedurefordecidingiftwoRDFgraphsentaileachother,wehavethefollowingresult: Theorem3.8. G |=H ifandonlyablanknodemappingµ existssuchthatµ(H) ⊆G [18,21]. (cid:3) Example3.9. ReferringbacktoExample2.14,asawitnessthatG |=H holds,wehaveamapping µ suchthatµ(_:b) = _:a2andµ(H) ⊂ G.AsawitnessthatH |=G holds,wehaveamappingµ(cid:48) suchthatµ(_:a1)=µ(_:a2)=_:bandµ(G)=H. Forargument’ssake,letusconsideranRDFgraphH(cid:48) derivedfromH byreplacing_:bwith anIRI:I.Wenolongerhaveamappingµ thatwitnessesG |=H(cid:48);infact,G (cid:54)|=H(cid:48).However,for H(cid:48) |=G,wehavethemappingµ(_:a1)=µ(_:a2)=:I. (cid:3) Intraditionalgraphterms,findingablanknodemappingµ thatwitnessessuchanentailment relatescloselywiththenotionofgraphhomomorphism. ACMTransactionsontheWeb,Vol.0,No.0,Article00.Publicationdate:January2017.
Description: