ebook img

Canonical Forms for Isomorphic and Equivalent RDF Graphs PDF

60 Pages·2017·1.37 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Canonical Forms for Isomorphic and Equivalent RDF Graphs

00 Canonical Forms for Isomorphic and Equivalent RDF Graphs: Algorithms for Leaning and Labelling Blank Nodes AIDANHOGAN, CenterforSemanticWebResearch,DCC,UniversityofChile ExistentialblanknodesgreatlycomplicateanumberoffundamentaloperationsonRDFgraphs.Inparticular, theproblemsofdeterminingiftwoRDFgraphshavethesamestructuremoduloblanknodelabels(i.e.ifthey areisomorphic),ordeterminingiftwoRDFgraphshavethesamemeaningundersimplesemantics(i.e.,if theyaresimple-equivalent),havenoknownpolynomial-timealgorithms.Inthispaper,weproposemethods thatcanproducetwocanonicalformsofanRDFgraph.Thefirstcanonicalformpreservesisomorphism suchthatanytwoisomorphicRDFgraphswillproducethesamecanonicalform;thisiso-canonicalformis producedbymodifyingthewell-knowncanonicallabellingalgorithmNautyforapplicationtoRDFgraphs. Thesecondcanonicalformadditionallypreservessimple-equivalencesuchthatanytwosimple-equivalent RDFgraphswillproducethesamecanonicalform;thisequi-canonicalformisproducedby,inapreliminary step,leaningtheRDFgraph,andthencomputingtheiso-canonicalform.Thesealgorithmshaveanumber ofpracticalapplications,suchasforidentifyingisomorphicorequivalentRDFgraphsinalargecollection withoutrequiringpair-wisecomparison,forcomputingchecksumsorsigningRDFgraphs,forapplying consistentSkolemisationschemeswhereblanknodesaremappedinacanonicalmannertoIRIs,andsoforth. LikewiseavarietyofalgorithmscanbesimplifiedbypresupposingRDFgraphsinoneofthesecanonical forms.Bothalgorithmsrequireexponentialstepsintheworstcase;inourevaluationwedemonstratethat thereindeedexistdifficultsyntheticcases,butwealsoprovideresultsover9.9millionRDFgraphsthatsuggest suchcasesoccurinfrequentlyintherealworld,andthatbothcanonicalformscanbeefficientlycomputedin allbutahandfulofsuchcases. CCSConcepts:•Informationsystems→ResourceDescriptionFramework(RDF);•Mathematicsof computing→Graphalgorithms; AdditionalKeyWordsandPhrases:SemanticWeb,LinkedData,Skolemisation,Isomorphism,Signing ACMReferenceformat: AidanHogan.2017.CanonicalFormsforIsomorphicandEquivalentRDFGraphs:AlgorithmsforLeaningand LabellingBlankNodes.ACMTrans.Web0,0,Article00(January2017),60pages. DOI:0000001.0000001 1 INTRODUCTION AttheverycoreoftheSemanticWebistheResourceDescriptionFramework(RDF):astandard forpublishinggraph-structureddatathatusesIRIsasglobalidentifierssuchthatgraphsinremote locationsontheWebcancollaboratetocontributeinformationaboutthesameresourcesusing consistentterminologyinaninteroperablemanner.TheadoptionofRDFontheWebhasbeen continuouslygrowing,wherewecanpointtothehundredsofdatasetspublishedasRDFusing ThisworkwassupportedbytheMillenniumNucleusCenterforSemanticWebResearchunderGrantNo.NC120004andby FondecytGrantNo.11140900. Author’saddress:A.Hogan,DCC,UniversidaddeChile,AvenidaBeauchef851,Santiago,Chile. Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfee providedthatcopiesarenotmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeand thefullcitationonthefirstpage.CopyrightsforcomponentsofthisworkownedbyothersthanACMmustbehonored. Abstractingwithcreditispermitted.Tocopyotherwise,orrepublish,topostonserversortoredistributetolists,requires priorspecificpermissionand/[email protected]. ©2017ACM. 1559-1131/2017/1-ART00$15.00 DOI:0000001.0000001 ACMTransactionsontheWeb,Vol.0,No.0,Article00.Publicationdate:January2017. 00:2 AidanHogan LinkedDataprinciples[22]spanningavarietyofdomains,includingcollectionsfromgovernmental organisations,scientificcommunities,socialwebsites,mediaoutlets,onlineencyclopaedias,andso forth[51].Furthermore,hundredsofthousandsofweb-sitesandhundredsofmillionsofweb-pages nowcontainembeddedRDFa[24]–incentivisedbyinitiativessuchasSchema.org(promotedby Google,Microsoft,Yahoo!andYandex),andtheOpenGraphProtocol(promotedbyFacebook)–with threeofthelargestprovidersbeing,forexample,tripadvisor.com,yahoo.comandhotels.com[43]. DespitethistrendofRDFplayinganincreasinglyimportantroleasaformatforstructured-data exchangeontheWeb,thereareanumberoffundamentaloperationsoverRDFgraphsforwhich welackpracticalalgorithms.Infact,RDFdoesnotconsistpurelyofstatementscontainingIRIs, butalsosupportsliteralsthatrepresentdatatyped-valuessuchasstringsornumbers,and,more pertinentlyforthecurrentscope,blanknodesthatrepresentaresourcewithoutanexplicitidentifier. ItisthepresenceofblanknodesinRDFgraphsthatparticularlycomplicatesmatters. IntheoriginalW3CRecommendationforRDFpublishedin1999[35],anonymousnodeswere introducedasameansofdescribingaresourcewithoutanexplicitidentifier,quotinguse-cases such as the representation of bags of resources in RDF, the use of reification to describe RDF statementsasiftheywerethemselvesresources,orsimplytodescriberesourcesthatdidnothave anativeURI/IRIassociatedwiththem.WhentheW3CRecommendationforRDFwasrevisedin 2004[20],theserialisationofRDFgraphsastripleswassupportedthroughtheintroductionofthe modernnotionofblanknodestorepresentresourceswithoutexplicitidentifiers;theseblanknodes weredefinedasexistentialvariablesthatarelocally-scoped.Intuitivelyspeaking,thisexistential semanticscapturestheideathatonecanrelabeltheblanknodesofanRDFgraphinaone-to-one mannerwithoutaffectingthestructure[11]northesemantics[21]oftheRDFgraph,norwithout havingtoworryifthosesamelabelsalreadyexistinanothergraphelsewhereontheWeb. Practicallyspeaking,blanknodesareusedfortwomainreasons[28]: • Blanknodesallowpublisherstoavoidhavingtoexplicitlyidentifyspecificresources,where RDFsyntaxessuchasTurtle[5]usethispropertytoenablevariousconvenientshortcuts forspecifyingorderedlists,n-aryrelations,etc.;toolsparsingthesesyntaxescaninvent blanknodestorepresenttheseimplicitnodes. • Inothercases,publishersmayuseblanknodestorepresenttrueexistentialvariables,where avalueisknowntoexist,buttheexactvalueisnotknown. InarecentquestionnaireweconductedwiththeSemanticWebcommunity,wefoundthatpublishers may(hypothetically)useblanknodessometimesinonecase,ortheother,orboth[28].Inanycase, blanknodeshavebecomewidelyusedontheWeb,whereinpreviousworkwefoundthatina surveyof8.4millionWebdocumentscontainingRDFcrawledfrom829pay-leveldomains1,66%of domainsand45%ofdocumentsusedblanknodes[28]. Unfortunately,thepresenceofblanknodesinRDFcomplicatessomefundamentaloperations onRDFgraphs.Forexample,imaginetwodifferenttoolsparsingthesameRDFgraph–say,for example,retrievedfromthesamelocationontheWebinthesamesyntax–intoasetoftriples, labellingblanknodesinanarbitrarymanner.Nowtakethetworesultingsetsoftriplesandsaywe wishtodetermineifthetwoRDFgraphsarethesamemoduloblanknodelabels;i.e.,todetermine iftheyareisomorphic [11].IftheoriginalRDFgraphdidnotcontainblanknodes,thisprocess istriviallypossibleinpolynomialtimebycheckingifbothsetsoftriplesareequal,forexample, by sorting both sets of triples and then comparing them sequentially. However, if the original RDFgraphcontainsblanknodes,thentheproblemofdecidingRDFisomorphismhasthesame 1Domainssuchasbbc.co.ukorfacebook.com,butnotnews.bbc.co.ukorco.uk ACMTransactionsontheWeb,Vol.0,No.0,Article00.Publicationdate:January2017. CanonicalFormsforIsomorphicandEquivalentRDFGraphs 00:3 computationalcomplexityclassasgraphisomorphism(GI-complete),forwhichtherearenoknown polynomial-timealgorithms(ifsuchanalgorithmwerefound,itwouldestablishGI=P). WhileisomorphismreferstoastructuralcomparisonofRDFgraphs,itisalsopossibletoconsider asemanticcomparisonofsuchgraphs.TheRDFsemantics[21]definesanotionofentailment betweenRDFgraphssuchthatoneRDFgraphentailsanother,looselyspeaking,iftheentailed graphaddsnonewinformationovertheentailinggraph;inotherwords,iftheentailinggraph isconsideredtobetrueunderthemodeltheoryofthesemantics,thentheentailedgraphmust likewisebeconsideredtrue.TwoRDFgraphsthatentaileachotherarethusconsideredequivalent: ashavingthesamemeaningunderaparticularsemantics.ThefoundationalsemanticsforRDF, calledsimplesemantics,doesnotconsideranyspecialvocabularynortheinterpretationofdatatype values; rather, it considers the meaning of RDF graphs considering blank nodes as existential variablesandIRIsandliteralsasground termsdenotingaparticularresource.GivenanRDFgraph G andH withoutblanknodes,thenaskingifG entailsH isthesameasaskingifH containsa subsetofthetriplesofG,whichagainispossibleinpolynomialtimeby,forexample,sortingboth setsoftriplesandcomparingthemsequentiallytoseeifeverytripleofH isinG.However,as wediscusslater,ifbothRDFgraphscontainblanknodes,theproblemisinthesamecomplexity classastheproblemofgraphhomomorphism(namelyNP-complete),implyingthatthereisno knownpolynomial-timesolution.Furthermore,itisknownthatdeterminingiftwoRDFgraphsare simple-equivalent–i.e.iftheysimple-entaileachother–fallsintothesamecomplexityclass[18]. Insummarythen,therearenoknownpolynomial-timealgorithmsforthesetwofundamental operations of determining if two RDF graphs are structurally the same (per isomorphism) or semanticallythesame(persimpleequivalence).2 Inthispaper,weproposetwodifferentcanonicalformsforRDFgraphs.Firstwemustdefinetwo RDFgraphsasequal(orwemaysometimessaythesame)ifandonlyiftheyareequalassetsof RDFtriplesconsideringblanknodelabelsasfixedinthesamemannerasIRIsandliterals.Thefirst canonicalform,whichwecalliso-canonical,isanRDFgraphuniqueforeachsetofisomorphicRDF graphs;inotherwords,itisaformthatiscanonicalwithrespecttothestructureofRDFgraphs. Thesecondcanonicalform,whichwecallequi-canonical,isanRDFgraphuniqueforeachsetof simple-equivalentRDFgraphs;inotherwords,itisaformthatiscanonicalwithrespecttothe (simple)semanticsofRDFgraphs.Morespecifically,twoRDFgraphsareisomorphicifandonlyif theiriso-canonicalformsarethesame;twoRDFgraphsaresimple-equivalentifandonlyiftheir equi-canonicalformsarethesame. Thesecanonicalformshaveanumberofuse-cases,including: • givenalargesetofRDFgraphs,detect/removegraphsthatareduplicates; • givenanRDFgraph,computeahashofthatRDFgraph,whichcanbeusedforcomputing andverifyingchecksums,signatures,etc.; • givenanRDFgraph,SkolemisetheblanknodesintheRDFgraph–replacingthemwith freshIRIs–inadeterministicmannerbasedonthecontentofthegraph. OurmethodsdonotrelyonthesyntaxoftheRDFdocumentsinquestion,butratheroperateon theabstractrepresentationofanRDFgraphasasetoftriples,wheretheusercandecidewhether theywishtoconsiderduplicates,signatures,Skolemconstants,etc.,tobeconsistentwithrespect toeitherisomorphismor(simple)equivalence. 2Althoughpolynomial-timealgorithmshavebeenproposedintheliteratureforcomputingcanonicalformsofRDFgraphs withrespecttoisomorphism(e.g.,[2,9,32]),thesemaynotalwaysyieldcorrectresults.Wewilldiscusssuchworksinmore detailinSection7. ACMTransactionsontheWeb,Vol.0,No.0,Article00.Publicationdate:January2017. 00:4 AidanHogan GivenaninputRDFgraph,weproposealgorithmsforcomputingbothitsiso-canonicalform andequi-canonicalform.Asexpectedfromearlierdiscussion,neitherofthesealgorithmsisin polynomial-time:bothhaveexponential-timeworst-caseperformanceinthegeneralcase.However, unlikethemoregeneralproblemsofgraphisomorphismandgraphhomomorphism,inthecaseof real-worldRDFgraphs,weoftenhavegroundinformationthathelpstodistinguishblanknodes. Henceinourevaluation,wefirstpresentresultsforcomputingbothcanonicalformsoverthe BTC–14dataset[31]–acollectionof43.6millionRDFgraphsofwhich9.9millioncontainblank nodes–wheresaidresultssuggestthatcomputingtheseformsisefficientinallbutahandful ofcases.Tostress-testouralgorithms,wealsopresentresultsforcanonicalisingacollectionof syntheticgraphsatvarioussizes,whichgivesomeideaofthetypeofRDFgraphrequiredtoinvoke exponentialruntime,arguingthatsuchgraphsareunlikelytooccurnaturallyinpractice. Thispaperisanextensionofearlierwork[27]wherewefirstintroducedmethodstocanonically labelblanknodes,computinganiso-canonicalformofRDFgraphsforthepurposesofSkolemisation. Asidefromextendeddiscussionthroughout,themainnovelcontributionofthispaperistodiscuss analgorithmforleaningRDFgraphsandforcomputingtheirequi-canonicalforminamanner thattakesintoaccountnotonlystructuralbutalsosemanticidentity.Inthislattercontribution,we takesomeoftheideasandexperienceslearntfromanotherpreviousworkwherewepresented someinitialideasonleaningRDFgraphs[28];however,themethodswepresentinthispaperfocus onleaningpotentiallycomplexRDFgraphsinmemorywhere,inparticular,wepresentanovel depth-firstsearchalgorithmthatisshowntoperformbetterthanavarietyofbaselinemethods. WebeginbypresentingsomepreliminariesrelatingtothestructureandsemanticsofRDFgraphs (Section 2). We then present a theoretical analysis with a mix of new and existing results that helpestablishboththehardnessofthecanonicalisationproblemsweproposetotackle,aswellas high-levelapproachesbywhichtheycanbecomputed(Section3).Afterwards,wepresentindetail ouralgorithmsforcomputingtheiso-canonicalversionofanRDFgraph,whichreliesonacanonical labellingofblanknodes(Section4);andtheequi-canonicalversionofanRDFgraph(Section5), whichreliesonapre-processingstepthatleanstheRDFgraph.Wethenpresentevaluationresults overcollectionsofbothreal-worldandsyntheticRDFgraphs(Section6).Wethendiscussrelated worksbeforeconcludingthepaper(Sections7and8). 2 PRELIMINARIES WenowpresentsomeformalpreliminarieswithrespecttoRDFgraphs,isomorphism,andthe simplesemanticsofRDF. 2.1 RDFterms,triplesandgraphs RDFgraphsaresetsoftriplescontainingRDFterms,withcertainrestrictionsonwhichtermscan appearinwhichpositionsofatriple. Definition2.1(RDFterm). LetI,LandBdenotetheinfinitesetsofIRIs,literalsandblanknodes respectively.Thesesetsarepair-wisedisjoint.Werefergenericallytoanelementofoneofthese setsasanRDFterm.WerefertoelementsofthesetIL(i.e.,I∪L)asgroundRDFterms. Definition2.2(RDFtriple). Wecallatriple(s,p,o)∈IB×I×ILBanRDFtriple,wherethefirst element,calledthesubject,mustbeanIRIorablanknode;thesecondelement,calledthepredicate, mustbeanIRI;andthethirdelement,calledtheobject,canbeanyRDFterm. Definition2.3(RDFgraph). AnRDFgraphG ⊂ IB×I×ILBisafinitesetofRDFtriples.We denotebyterms(G)thesetofallRDFtermsappearinginG,andbnodes(G)thesetofallblank nodesappearinginG. ACMTransactionsontheWeb,Vol.0,No.0,Article00.Publicationdate:January2017. CanonicalFormsforIsomorphicandEquivalentRDFGraphs 00:5 Remark1. WesaythattwoRDFgraphsG andH areequal,or thesame,ifandonlyifG =H in termsofsetequality. 2.2 RDFisomorphism TwoRDFgraphsthatarethesamemoduloblanknodeslabels–i.e.,whereonecanbeobtained fromtheotherthroughaone-to-onemappingofblanknodes–arecalledisomorphic.Wefirstgive apreliminarydefinitiontocapturetheideaofmappingblanknodestootherRDFterms: Definition2.4(Blanknodemappingandbijection). Letµ :ILB→ILBbeapartialmappingofRDF termstoRDFterms,wherewedenotebydom(µ)thedomainofµ andbycodom(µ)thecodomain ofµ.Ifµ istheidentityonIL,wecallitablanknodemapping.Ifµ isablanknodemappingthat mapsblanknodesindom(µ)toblanknodesincodom(µ)inabijectivemanner,wecallitablank nodebijection. Abusingnotation,givenanRDFgraphG,wemayuseµ(G)todenotetheimageofG underµ (i.e.,theresultofapplyingµ toeveryterminG). WearenowreadytodefinethenotionofisomorphismbetweenRDFgraphs. Definition2.5(RDFisomorphism). TwoRDFgraphsG andH areisomorphic,denotedG (cid:27)H,if andonlyifthereexistsablanknodebijectionµ suchthatµ(G) =H,inwhichcasewecallµ an isomorphism. Lemma2.6. RDFisomorphism((cid:27))isanequivalencerelation. Proof. First,(cid:27)isreflexivepertheexistenceoftheidentitymapµ onblanknodes,whichisa blanknodebijection.Second,(cid:27)issymmetricsinceifG (cid:27)H,thenthereexistsµsuchthatµ(G)=H andsuchthatµ−1(H)=G,whereµ−1isalsoablanknodebijection.Third,(cid:27)istransitivesinceif G (cid:27)H andH (cid:27) I,thenthereexistblanknodebijectionsµ andµ(cid:48)suchthatµ(G)=H,µ(cid:48)(H)=I, andthusµ(cid:48)(µ(G))=I,whereµ(cid:48)◦µ isablanknodebijectionthatwitnessesG (cid:27)I. (cid:3) Remark2. IfG =H,thenG (cid:27)H. (cid:3) Example2.7. TakethefollowingtwoRDFgraphs,withG ontheleftandH ontheright,where theterm2014isaliteral(denotedwithasquarebox),alltermsprefixedwithunderscoreareblank nodes,andallotherterms(intheex:examplenamespace)areIRIs.Nowweconsider:arethese RDFgraphsisomorphic? ex:startYear _:bA _:a2 2014 ex:president ex:presidency ex:presidency ex:president ex:MBachelet ex:Chile ex:Chile ex:MBachelet ex:president ex:presidency ex:presidency ex:president 2014 _:b _:a1 ex:startYear Infacttheyare:thereisablanknodebijectionµ suchthatµ(_:a1)=_:bAandµ(_:a2)=_:bwhere µ(G)=H.Wecouldalsotaketheinversemappingµ−1,whereµ−1(H)=G andwhereµ−1isalsoa blanknodebijection.ThusweconcludeG (cid:27)H. (cid:3) ACMTransactionsontheWeb,Vol.0,No.0,Article00.Publicationdate:January2017. 00:6 AidanHogan Arelatedconcepttoisomorphism–andonethatwillplayanimportantrolelater–isthatofan automorphism,whichisanisomorphismthatmapsanRDFgraphtoitself.Intuitivelyspeaking, automorphismsrepresentaformofsymmetryinthegraph. Definition2.8(RDFautomorphism). AnautomorphismofanRDFgraphG isanisomorphism thatmapsG toitself;i.e.,ablanknodemappingµ isanautomorphismofG ifµ(G)=G.Ifµ isthe identitymappingonblanknodesinG thenµ isatrivialautomorphism;otherwiseµ isanon-trivial automorphism.WedenotethesetofallautomorphismsofG byAut(G). Example2.9. WegivetheautomorphismsfortwoRDFgraphsG andH.Trivialautomorphisms (i.e.,thosethataretheidentitymapping)areshowningrey. G Aut(G) H Aut(H) :p _:e µ(·)_:a_:b :p :p µ(·)_:c_:d_:e _:a _:b = _:a_:b _:c _:d _:c_:d_:e :p _:b_:a = _:d_:e_:c :p _:e_:c_:d Applyinganyoftheautomorphismsshownforthegraphinquestionwouldleadtothesamegraph (andnotjustanisomorphiccopy). (cid:3) 2.3 Simplesemantics,interpretations,entailmentandequivalence TheRDF(1.1)Semanticsrecommendation[21]definesfourmodel-theoreticregimesthat,loosely speaking,provideamathematicalbasisforassigningtruthtoRDFgraphsand,subsequently,for formallydefiningwhenoneRDFgraphentails another:inotherwords,ifoneassignstruthto a particular RDF graph, entailment defines which RDF graphs must also hold true as a logical consequence.Thus,unlikeRDFisomorphismwhichis,insomesense,astructuralcomparisonof RDFgraphs[11],entailmentoffersasemanticcomparisonofRDFgraphsintermsoftheirunderlying meaning[21].Thefourregimesare:simplesemantics,datatypesemantics,RDFsemanticsand RDFSsemantics.Inthispaper,weareinterestedinthesimplesemantics,whichcodifiesameaning forRDFgraphswithoutconsideringtheinterpretationofdatatypevaluesorspecialvocabulary terms(suchasrdf:typeorrdfs:subClassOf). Eachregimeisbasedonthenotionofaninterpretation,whichmapsthetermsinanRDFgraphto aset,andthendefinessomeset-theoreticalconditionsontheset.TheintuitionisthatRDFdescribes resourcesandrelationshipsbetweenthem,whereinterpretationsformabridgefromsyntactic termstotheresourcesandrelationshiptheydenote.Wenowdefineasimpleinterpretation. Definition2.10(Simpleinterpretation). Asimpleinterpretationisa4-tupleI =(Res,Prop,Ext,Int) whereRes isasetofresources;Propisasetofpropertiesthatrepresenttypesofbinaryrelations between resources (not necessarily disjoint fromRes);Ext maps properties to a set of pairs of resources,thusdenotingtheextensionoftherelations;andInt mapstermsinILtoRes ∪Prop, i.e.,mapstermsintheRDFgraphtotheresourcesandpropertiestheydescribe.Withrespectto blanknodes,letA:B→Res beafunctionthatmapsblanknodestoresources,andletInt denote A aversionofInt thatmapstermsinILBtoRes∪PropusingAforblanknodes.WesaythatI isa modelofanRDFgraphG ifandonlyifthereexistsamappingAsuchthatforeach(s,p,o)∈G,it holdsthatInt(p)∈Propand(Int (s),Int (o))∈Ext(Int(p)). A A Here,theexistentialsemanticsofblanknodesiscoveredbytheexistenceofanauxiliaryfunction A,whichisnotpartoftheactualinterpretation. ACMTransactionsontheWeb,Vol.0,No.0,Article00.Publicationdate:January2017. CanonicalFormsforIsomorphicandEquivalentRDFGraphs 00:7 SimpleentailmentofRDFgraphscanthenbedefinedintermsofthemodelsofeachgraph. Definition2.11(Simpleentailment). AnRDFgraphG simple-entailsanRDFgraphH,denoted G |=H,ifandonlyifeverymodelofG isalsoamodelofH. Intuitivelyspeaking,ifG |=H,thenH saysnothingnewoverG,orinotherwordsifweholdG tobetrue,thenwemustholdH tobetrueasalogicalconsequence.IftwoRDFgraphsentaileach other,thenwestatethattheyaresimple-equivalent.Inotherwords,semanticallyspeaking,both graphscontainthesameinformation. Definition2.12(Simpleequivalence). AnRDFgraphG issimple-equivalentwithanRDFgraphH, denotedG ≡H,ifandonlyifeverymodelofG isamodelofH andeverymodelofH isamodelof G (inotherwords,G |=H andH |=G). Lemma2.13. Simpleequivalence(≡)isanequivalencerelation. Proof. G ≡H ifandonlyifthesetsofmodelsofG andH areequal.Sinceset-equalityisan equivalencerelation,sois≡. (cid:3) Remark3. IfG (cid:27)H,thenG ≡H. (cid:3) GiventhatwecurrentlydealexclusivelywiththesimplesemanticsofRDF,forbrevity,henceforth, wemayrefertointerpretations,models,entailment,equivalence,etc.,withoutqualification,where weimplicitlyrefertothesimplesemantics. AnimportantquestionishowisomorphismandequivalencearedifferentforRDFgraphs.This isperhapsbestillustratedwithanexample. Example2.14. TakethefollowingtwoRDFgraphswithG ontheleftandH ontheright.Firstof all,weaskdoesG |=H hold(i.e.,iseverymodelofG alsoamodelofH)? ex:startYear _:a2 2014 ex:presidency ex:president ex:MBachelet ex:Chile ex:Chile ex:MBachelet ex:president ex:presidency ex:presidency ex:president 2014 _:b _:a1 ex:startYear Let’ssayI =(Res,Prop,Ext,Int)isamodelofG,whereforthepurposesofgeneralitywegiveno furtherdetails.SincethesetofgroundtermsinH isasubsetofG,thenInt mapsgroundtermsin H toRes∪PropinthesamemannerasforG.LetAdenoteanauxiliarymappingofblanknodesfor G suchthatforeach(s,p,o)∈G itholdsthatInt(p)∈Propand(Int (s),Int (o))∈Ext(Int(p));in A A otherwords,AwitnessesthatI isamodelofG.Nowletµ denoteablanknodemappingsuchthat µ(_:b) = _:a2.Then,foreach(s,p,o) ∈ H,itholdsthatInt(p) ∈ Prop and(IntA◦µ(s),IntA◦µ(o)) ∈ Ext(Int(p)),whereA◦µisavalidauxiliarymappingthatsatisfiestheconditionforItobeamodel ofH.Hence,anymodelofG isalsoamodelofH,orinotherwords,G |=H.Intuitivelyspeaking,if wefirstmap_:binH to_:a2inG,thenweseethatH containsasubsetoftheinformationofG. Now let us ask: does H |= G hold? This time consider a blank node mapping µ such that µ(_:a1)=_:bandµ(_:a2)=_:b.Usingasimilarargumentasabovebutintheoppositedirection, ACMTransactionsontheWeb,Vol.0,No.0,Article00.Publicationdate:January2017. 00:8 AidanHogan wecannowseethatanymodelofH mustbeamodelofG:thegroundtermsofG areasubsetof H,andifAisanauxiliarymappingthatwitnessesthatIisamodelofH,thenA◦µ witnessesthat IisamodelofG.Putanotherway,lookingjustatG,ifwemap_:a1to_:a2,wedonotchangethe meaningofG,andweendupwithanisomorphiccopyofH,andhencewemayseethatH |=G. GiventhatG |=H andH |=G,wehavethatG ≡H,eventhoughG (cid:29)H.Thoughbothgraphs arestructurallydifferent,fromasemanticperspectivebothgraphshavethesamesetofmodels underRDF’ssimplesemantics:theyimplyeachother. (cid:3) Inthisexample,wesawthatanRDFgraphcanbeequivalenttoasmallergraph:inotherwords, anRDFgraphcancontainredundanttriplesthataddnonewinformationinsemanticterms.This givesrisetothenotionofaleanRDFgraph[21],whichisonethatdoesnotcontainsuchredundancy. Definition2.15(LeanRDFgraph). AnRDFgraphG isconsideredleanifandonlyiftheredoes notexistapropersubgraphG(cid:48) ⊂G suchthatG(cid:48) |=G.OtherwisewecallG non-lean. Example2.16. ReferringbacktoExample2.14,H islean.However,letG(cid:48)denotethesetoftriples ofG thatdonotmention_:a1.WecanseethatG(cid:48) |=G (withthesamelineofreasoningastowhy H |=G).IntermsofexplainingwhyG isnon-lean,intuitivelythegraphcanbereadasmakingthe followingclaims: • ChilehasapresidencywithMBacheletaspresidentandstartYear2014. • ChilehasapresidencywithMBacheletaspresident. Undersimplesemantics,thesecondclaimisconsideredredundant. (cid:3) 3 THEORETICALSETTING ThegoalofthispaperistoproposeanddevelopalgorithmstocomputetwocanonicalformsforRDF: acanonicalformwithrespecttoisomorphismandacanonicalformwithrespecttoequivalence. Havingdefinedthesenotionsintheprevioussection,wenowpresentsometheoreticalresultsthat showthesetobehardproblemsingeneral;wealsopresenthigh-levelstrategiesastohowthese canonicalformscouldbecomputed.Wefirstfocusonisomorphismandlaterdiscussequivalence. 3.1 Isomorphism Tobegin,wewishtoestablishthatRDFisomorphismisinthesamecomplexityclassastherelated andmorewell-establishedproblemofgraphisomorphismforundirectedgraphs.Infact,thisresult isfolkloreandwashintedatpreviouslybyotherauthors,suchasCarroll[9],buttothebestofour knowledge,noformalproofofthisresultwasgiven.Firstweneedsomepreliminarydefinitions. Definition3.1(Undirectedgraph). AnundirectedgraphG=(V,E)isagraphwereV isthesetof vertexes,E ⊆V ×V isthesetofedges,and(v,v(cid:48))∈E ifandonlyif(v(cid:48),v)∈E (orinotherwords, theedgesareunorderedpairs). Definition3.2(Graphisomorphism). GiventwoundirectedgraphsG=(VG,EG)andH=(VH,EH), thesegraphsareisomorphic,denotedG (cid:27) H,ifandonlyifthereexistsabijectionβ :VG →VH suchthat(v,v(cid:48))∈EGifandonlyif(β(v),β(v(cid:48)))∈EH.Inthiscase,wecallβ anisomorphism. Thegraphisomorphismproblem–ofdecidingifG (cid:27) H–isGI-complete:aclassthatbelongsto NPbutisnotknowntobeequivalenttoNP-completenortopermitpolynomial-timesolutions.We nowgivearesultstatingthattheRDFisomorphismproblem–ofdecidingfortwoRDFgraphsif G (cid:27)H –isinthesamecomplexityclassasgraphisomorphism. Theorem3.3. GiventwoRDFgraphsG andH,determiningifG (cid:27)H isGI-complete. ACMTransactionsontheWeb,Vol.0,No.0,Article00.Publicationdate:January2017. CanonicalFormsforIsomorphicandEquivalentRDFGraphs 00:9 AproofforTheorem3.3–whichoriginallyappearedintheconferenceversionofthiswork[27] –isprovidedinAppendixA.FromTheorem3.3,wecanconcludethatitisnotknownifthere existsapolynomial-timealgorithmforRDFisomorphism:theexistenceofsuchanalgorithmwould implyGI=P,solvingalong-openprobleminComputerScience.Inarecentresult,Babai[3]proved thatthegraphisomorphismproblemcanbesolvedinquasi-polynomialtime3,wherealthougha polynomial-timealgorithmhasnotyetbeenfound,analgorithmbetterthanexponentialisnow known;henceGIiscontainedwithinQP:theclassofproblemssolvableinquasi-polynomialtime. These results extend to RDF isomorphism per Theorem 3.3. But these theoretical results refer toworst-caseanalyses,whereapreviousresultbyBabaietal.[4]showedthatisomorphismfor randomlygeneratedgraphscanbeperformedefficientlyusinganaivealgorithm.Hence,despite the possibility of non-polynomial worst cases, many practical algorithms exist to solve graph isomorphismquicklyformanycases. Infact,ourgoalhereisnotthesolvetheisomorphismdecisionproblembutrathertotacklethe harderproblemofcomputinganiso-canonicalversionofanRDFgraph. Definition3.4(Iso-canonicalRDFmapping). LetM beamappingfromanRDFgraphtoanRDF graph.M isiso-canonicalifandonlyifM(G) (cid:27)G foranyRDFgraphG andM(G)=M(H)ifand onlyifG (cid:27)H foranytwoRDFgraphsG andH. Weknowthatcomputinganiso-canonicalformofanRDFgraphisGI-hardsinceitcanbeused tosolvetheRDFisomorphismproblem:wecancomputetheiso-canonicalformofbothgraphsand checkiftheresultsareequal.Henceweareunlikelytofindpolynomial-timealgorithms. Before we continue, let us establish an initial iso-canonical mapping that is quite naive and impractical,butestablishestheideaofhowsuchaformcanbeachieved:defineatotalordering of RDF graphs and for a set of pairwise isomorphic RDF graphs, define the lowest such graph (followingcertainfixedcriteria)tobecanonical. Definition3.5(κ-mapping). AssumeatotalorderingofallRDFterms,triplesandgraphs.4Assume thatκ isablanknodebijectionthatlabelsallk blanknodesinagraphG from_:b1to_:bk,andlet κ(G)denotetheimageofGunderκ.Furthermore,letK denotethesetofallsuchκ-mappingsvalid forG.Wethendefinetheminimalκ-mappingofanRDFgraphG asM (G)=min{κ(G) |κ ∈K}. κ Proposition3.6. M isaniso-canonicalmapping. κ Proof. First,sincewehaveatotalorderingofgraphs,thereisauniquegraphmin{κ(G) |κ ∈K}, andhenceM isamappingfromRDFgraphstoRDFgraphs. κ Second,M (G) (cid:27)G sinceweonlyrelabelblanknodes:κ isablanknodebijection. κ WearenowlefttoprovethatM (G)=M (H)ifandonlyifG (cid:27)H. κ κ (M (G) = M (H)impliesG (cid:27) H)WeknowthatM (G) (cid:27) G andM (H) (cid:27) H.Ifwearegiven κ κ κ κ M (G) = M (H),thenwehavethatG (cid:27) M (G) (cid:27) M (H) (cid:27) H,andsince (cid:27) isanequivalence κ κ κ κ relation,iffollowsthatG (cid:27)H. (G (cid:27) H implies M (G) = M (H)) Suppose the result does not hold and there existG andH κ κ such thatG (cid:27) H and (without loss of generality) M (G) > M (H). SinceG (cid:27) H, there exists κ κ a blank node bijection µ such that µ(G) = H. Letκ be the mapping such thatκ(H) = M (H). κ Nowκ ◦µ(G) = M (H).Sinceκ ◦µ isaκ-mappingforG andM (G) > κ ◦µ(G),wearriveata κ κ contradictionperthedefinitionofM sinceitdoesnotusetheminimumκ-mappingforG. (cid:3) κ 3OnJanuary9,2017,theresultwasbrieflyretractedasabugwasfoundintheproof.Theresultwasreassertedafewdays laterwhenafixwasfound.Seehttp://people.cs.uchicago.edu/~laci/update.html. 4Forexample,wecanconsidertermsorderedsyntactically,triplesorderedlexicographically,andgraphsorderedsuchthat G <H ifandonlyifG ⊂H orthereexistsatriplet ∈G\H suchthatnotriplet(cid:48) ∈H\Gexistswheret(cid:48)<t. ACMTransactionsontheWeb,Vol.0,No.0,Article00.Publicationdate:January2017. 00:10 AidanHogan Example3.7. TakegraphH fromExample2.9.Wehaveatotalof3! = 6possibleκ-mappings (withonlyblanknodesfrombnodes(H)intheirdomainandblanknodesfrom{_:b1,_:b2,_:b3}in theircodomain)asfollows. H κ(·) _:c _:d _:e {(_:c,:p,_:d),(_:d,:p,_:e),(_:e,:p,_:c)} _:b1_:b2_:b3{(_:b1,:p,_:b2),(_:b2,:p,_:b3),(_:b3,:p,_:b1)} _:b1_:b3_:b2{(_:b1,:p,_:b3),(_:b3,:p,_:b2),(_:b2,:p,_:b1)} _:b2_:b1_:b3{(_:b2,:p,_:b1),(_:b1,:p,_:b3),(_:b3,:p,_:b2)} = _:b2_:b3_:b1{(_:b2,:p,_:b3),(_:b3,:p,_:b1),(_:b1,:p,_:b2)} _:b3_:b1_:b2{(_:b3,:p,_:b1),(_:b1,:p,_:b2),(_:b2,:p,_:b3)} _:b3_:b2_:b1{(_:b3,:p,_:b2),(_:b2,:p,_:b1),(_:b1,:p,_:b3)} Thesixκ-mappingsonlyproducetwodistinctgraphs,wherethefirst,fourthandfifthmappings correspondtoκ(cid:48)(H)andtheresttoκ(cid:48)(cid:48)(H),asfollows: κ(cid:48)(H) κ(cid:48)(cid:48)(H) _:b3 _:b2 :p :p :p :p _:b1 _:b2 _:b1 _:b3 :p :p Assumingatypicallexicalordering(asperFootnote4),M (H) =κ(cid:48)(H).Importantly,onecould κ (bijectively)relabel_:c,_:d,_:eintheoriginalgraphwithoutaffectingtheresult:theoutputwould bethesameforanyM (H(cid:48))suchthatH (cid:27)H(cid:48). (cid:3) κ Thisdiscussionsuggestsacorrectandcompletebruteforcealgorithmtocomputeaniso-canonical formforanyRDFgraphG:searchallκ-mappingsofG foronethatgivestheminimumsuchgraph. However,suchabrute-forceprocessisunnecessaryandnaive:byapplyingamorefine-grained totalorderingonRDFgraphs,wecanuseasimilarprincipletofindaniso-canonicalformina muchmoreefficientway.SuchanalgorithmwillbepresentedlaterinSection4. 3.2 Equivalence AspreviousdiscussedinSection2.3,twoRDFgraphsareequivalentiftheyentaileachother.Towards aninitialprocedurefordecidingiftwoRDFgraphsentaileachother,wehavethefollowingresult: Theorem3.8. G |=H ifandonlyablanknodemappingµ existssuchthatµ(H) ⊆G [18,21]. (cid:3) Example3.9. ReferringbacktoExample2.14,asawitnessthatG |=H holds,wehaveamapping µ suchthatµ(_:b) = _:a2andµ(H) ⊂ G.AsawitnessthatH |=G holds,wehaveamappingµ(cid:48) suchthatµ(_:a1)=µ(_:a2)=_:bandµ(G)=H. Forargument’ssake,letusconsideranRDFgraphH(cid:48) derivedfromH byreplacing_:bwith anIRI:I.Wenolongerhaveamappingµ thatwitnessesG |=H(cid:48);infact,G (cid:54)|=H(cid:48).However,for H(cid:48) |=G,wehavethemappingµ(_:a1)=µ(_:a2)=:I. (cid:3) Intraditionalgraphterms,findingablanknodemappingµ thatwitnessessuchanentailment relatescloselywiththenotionofgraphhomomorphism. ACMTransactionsontheWeb,Vol.0,No.0,Article00.Publicationdate:January2017.

Description:
Aidan Hogan. 2017. Canonical Forms for Isomorphic and Equivalent RDF Graphs: Algorithms for Leaning and. Labelling Blank Nodes. ACM Trans.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.