Table Of Content00
Canonical Forms for Isomorphic and Equivalent RDF
Graphs: Algorithms for Leaning and Labelling Blank Nodes
AIDANHOGAN,
CenterforSemanticWebResearch,DCC,UniversityofChile
ExistentialblanknodesgreatlycomplicateanumberoffundamentaloperationsonRDFgraphs.Inparticular,
theproblemsofdeterminingiftwoRDFgraphshavethesamestructuremoduloblanknodelabels(i.e.ifthey
areisomorphic),ordeterminingiftwoRDFgraphshavethesamemeaningundersimplesemantics(i.e.,if
theyaresimple-equivalent),havenoknownpolynomial-timealgorithms.Inthispaper,weproposemethods
thatcanproducetwocanonicalformsofanRDFgraph.Thefirstcanonicalformpreservesisomorphism
suchthatanytwoisomorphicRDFgraphswillproducethesamecanonicalform;thisiso-canonicalformis
producedbymodifyingthewell-knowncanonicallabellingalgorithmNautyforapplicationtoRDFgraphs.
Thesecondcanonicalformadditionallypreservessimple-equivalencesuchthatanytwosimple-equivalent
RDFgraphswillproducethesamecanonicalform;thisequi-canonicalformisproducedby,inapreliminary
step,leaningtheRDFgraph,andthencomputingtheiso-canonicalform.Thesealgorithmshaveanumber
ofpracticalapplications,suchasforidentifyingisomorphicorequivalentRDFgraphsinalargecollection
withoutrequiringpair-wisecomparison,forcomputingchecksumsorsigningRDFgraphs,forapplying
consistentSkolemisationschemeswhereblanknodesaremappedinacanonicalmannertoIRIs,andsoforth.
LikewiseavarietyofalgorithmscanbesimplifiedbypresupposingRDFgraphsinoneofthesecanonical
forms.Bothalgorithmsrequireexponentialstepsintheworstcase;inourevaluationwedemonstratethat
thereindeedexistdifficultsyntheticcases,butwealsoprovideresultsover9.9millionRDFgraphsthatsuggest
suchcasesoccurinfrequentlyintherealworld,andthatbothcanonicalformscanbeefficientlycomputedin
allbutahandfulofsuchcases.
CCSConcepts:•Informationsystems→ResourceDescriptionFramework(RDF);•Mathematicsof
computing→Graphalgorithms;
AdditionalKeyWordsandPhrases:SemanticWeb,LinkedData,Skolemisation,Isomorphism,Signing
ACMReferenceformat:
AidanHogan.2017.CanonicalFormsforIsomorphicandEquivalentRDFGraphs:AlgorithmsforLeaningand
LabellingBlankNodes.ACMTrans.Web0,0,Article00(January2017),60pages.
DOI:0000001.0000001
1 INTRODUCTION
AttheverycoreoftheSemanticWebistheResourceDescriptionFramework(RDF):astandard
forpublishinggraph-structureddatathatusesIRIsasglobalidentifierssuchthatgraphsinremote
locationsontheWebcancollaboratetocontributeinformationaboutthesameresourcesusing
consistentterminologyinaninteroperablemanner.TheadoptionofRDFontheWebhasbeen
continuouslygrowing,wherewecanpointtothehundredsofdatasetspublishedasRDFusing
ThisworkwassupportedbytheMillenniumNucleusCenterforSemanticWebResearchunderGrantNo.NC120004andby
FondecytGrantNo.11140900.
Author’saddress:A.Hogan,DCC,UniversidaddeChile,AvenidaBeauchef851,Santiago,Chile.
Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfee
providedthatcopiesarenotmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeand
thefullcitationonthefirstpage.CopyrightsforcomponentsofthisworkownedbyothersthanACMmustbehonored.
Abstractingwithcreditispermitted.Tocopyotherwise,orrepublish,topostonserversortoredistributetolists,requires
priorspecificpermissionand/orafee.Requestpermissionsfrompermissions@acm.org.
©2017ACM. 1559-1131/2017/1-ART00$15.00
DOI:0000001.0000001
ACMTransactionsontheWeb,Vol.0,No.0,Article00.Publicationdate:January2017.
00:2 AidanHogan
LinkedDataprinciples[22]spanningavarietyofdomains,includingcollectionsfromgovernmental
organisations,scientificcommunities,socialwebsites,mediaoutlets,onlineencyclopaedias,andso
forth[51].Furthermore,hundredsofthousandsofweb-sitesandhundredsofmillionsofweb-pages
nowcontainembeddedRDFa[24]–incentivisedbyinitiativessuchasSchema.org(promotedby
Google,Microsoft,Yahoo!andYandex),andtheOpenGraphProtocol(promotedbyFacebook)–with
threeofthelargestprovidersbeing,forexample,tripadvisor.com,yahoo.comandhotels.com[43].
DespitethistrendofRDFplayinganincreasinglyimportantroleasaformatforstructured-data
exchangeontheWeb,thereareanumberoffundamentaloperationsoverRDFgraphsforwhich
welackpracticalalgorithms.Infact,RDFdoesnotconsistpurelyofstatementscontainingIRIs,
butalsosupportsliteralsthatrepresentdatatyped-valuessuchasstringsornumbers,and,more
pertinentlyforthecurrentscope,blanknodesthatrepresentaresourcewithoutanexplicitidentifier.
ItisthepresenceofblanknodesinRDFgraphsthatparticularlycomplicatesmatters.
IntheoriginalW3CRecommendationforRDFpublishedin1999[35],anonymousnodeswere
introducedasameansofdescribingaresourcewithoutanexplicitidentifier,quotinguse-cases
such as the representation of bags of resources in RDF, the use of reification to describe RDF
statementsasiftheywerethemselvesresources,orsimplytodescriberesourcesthatdidnothave
anativeURI/IRIassociatedwiththem.WhentheW3CRecommendationforRDFwasrevisedin
2004[20],theserialisationofRDFgraphsastripleswassupportedthroughtheintroductionofthe
modernnotionofblanknodestorepresentresourceswithoutexplicitidentifiers;theseblanknodes
weredefinedasexistentialvariablesthatarelocally-scoped.Intuitivelyspeaking,thisexistential
semanticscapturestheideathatonecanrelabeltheblanknodesofanRDFgraphinaone-to-one
mannerwithoutaffectingthestructure[11]northesemantics[21]oftheRDFgraph,norwithout
havingtoworryifthosesamelabelsalreadyexistinanothergraphelsewhereontheWeb.
Practicallyspeaking,blanknodesareusedfortwomainreasons[28]:
• Blanknodesallowpublisherstoavoidhavingtoexplicitlyidentifyspecificresources,where
RDFsyntaxessuchasTurtle[5]usethispropertytoenablevariousconvenientshortcuts
forspecifyingorderedlists,n-aryrelations,etc.;toolsparsingthesesyntaxescaninvent
blanknodestorepresenttheseimplicitnodes.
• Inothercases,publishersmayuseblanknodestorepresenttrueexistentialvariables,where
avalueisknowntoexist,buttheexactvalueisnotknown.
InarecentquestionnaireweconductedwiththeSemanticWebcommunity,wefoundthatpublishers
may(hypothetically)useblanknodessometimesinonecase,ortheother,orboth[28].Inanycase,
blanknodeshavebecomewidelyusedontheWeb,whereinpreviousworkwefoundthatina
surveyof8.4millionWebdocumentscontainingRDFcrawledfrom829pay-leveldomains1,66%of
domainsand45%ofdocumentsusedblanknodes[28].
Unfortunately,thepresenceofblanknodesinRDFcomplicatessomefundamentaloperations
onRDFgraphs.Forexample,imaginetwodifferenttoolsparsingthesameRDFgraph–say,for
example,retrievedfromthesamelocationontheWebinthesamesyntax–intoasetoftriples,
labellingblanknodesinanarbitrarymanner.Nowtakethetworesultingsetsoftriplesandsaywe
wishtodetermineifthetwoRDFgraphsarethesamemoduloblanknodelabels;i.e.,todetermine
iftheyareisomorphic [11].IftheoriginalRDFgraphdidnotcontainblanknodes,thisprocess
istriviallypossibleinpolynomialtimebycheckingifbothsetsoftriplesareequal,forexample,
by sorting both sets of triples and then comparing them sequentially. However, if the original
RDFgraphcontainsblanknodes,thentheproblemofdecidingRDFisomorphismhasthesame
1Domainssuchasbbc.co.ukorfacebook.com,butnotnews.bbc.co.ukorco.uk
ACMTransactionsontheWeb,Vol.0,No.0,Article00.Publicationdate:January2017.
CanonicalFormsforIsomorphicandEquivalentRDFGraphs 00:3
computationalcomplexityclassasgraphisomorphism(GI-complete),forwhichtherearenoknown
polynomial-timealgorithms(ifsuchanalgorithmwerefound,itwouldestablishGI=P).
WhileisomorphismreferstoastructuralcomparisonofRDFgraphs,itisalsopossibletoconsider
asemanticcomparisonofsuchgraphs.TheRDFsemantics[21]definesanotionofentailment
betweenRDFgraphssuchthatoneRDFgraphentailsanother,looselyspeaking,iftheentailed
graphaddsnonewinformationovertheentailinggraph;inotherwords,iftheentailinggraph
isconsideredtobetrueunderthemodeltheoryofthesemantics,thentheentailedgraphmust
likewisebeconsideredtrue.TwoRDFgraphsthatentaileachotherarethusconsideredequivalent:
ashavingthesamemeaningunderaparticularsemantics.ThefoundationalsemanticsforRDF,
calledsimplesemantics,doesnotconsideranyspecialvocabularynortheinterpretationofdatatype
values; rather, it considers the meaning of RDF graphs considering blank nodes as existential
variablesandIRIsandliteralsasground termsdenotingaparticularresource.GivenanRDFgraph
G andH withoutblanknodes,thenaskingifG entailsH isthesameasaskingifH containsa
subsetofthetriplesofG,whichagainispossibleinpolynomialtimeby,forexample,sortingboth
setsoftriplesandcomparingthemsequentiallytoseeifeverytripleofH isinG.However,as
wediscusslater,ifbothRDFgraphscontainblanknodes,theproblemisinthesamecomplexity
classastheproblemofgraphhomomorphism(namelyNP-complete),implyingthatthereisno
knownpolynomial-timesolution.Furthermore,itisknownthatdeterminingiftwoRDFgraphsare
simple-equivalent–i.e.iftheysimple-entaileachother–fallsintothesamecomplexityclass[18].
Insummarythen,therearenoknownpolynomial-timealgorithmsforthesetwofundamental
operations of determining if two RDF graphs are structurally the same (per isomorphism) or
semanticallythesame(persimpleequivalence).2
Inthispaper,weproposetwodifferentcanonicalformsforRDFgraphs.Firstwemustdefinetwo
RDFgraphsasequal(orwemaysometimessaythesame)ifandonlyiftheyareequalassetsof
RDFtriplesconsideringblanknodelabelsasfixedinthesamemannerasIRIsandliterals.Thefirst
canonicalform,whichwecalliso-canonical,isanRDFgraphuniqueforeachsetofisomorphicRDF
graphs;inotherwords,itisaformthatiscanonicalwithrespecttothestructureofRDFgraphs.
Thesecondcanonicalform,whichwecallequi-canonical,isanRDFgraphuniqueforeachsetof
simple-equivalentRDFgraphs;inotherwords,itisaformthatiscanonicalwithrespecttothe
(simple)semanticsofRDFgraphs.Morespecifically,twoRDFgraphsareisomorphicifandonlyif
theiriso-canonicalformsarethesame;twoRDFgraphsaresimple-equivalentifandonlyiftheir
equi-canonicalformsarethesame.
Thesecanonicalformshaveanumberofuse-cases,including:
• givenalargesetofRDFgraphs,detect/removegraphsthatareduplicates;
• givenanRDFgraph,computeahashofthatRDFgraph,whichcanbeusedforcomputing
andverifyingchecksums,signatures,etc.;
• givenanRDFgraph,SkolemisetheblanknodesintheRDFgraph–replacingthemwith
freshIRIs–inadeterministicmannerbasedonthecontentofthegraph.
OurmethodsdonotrelyonthesyntaxoftheRDFdocumentsinquestion,butratheroperateon
theabstractrepresentationofanRDFgraphasasetoftriples,wheretheusercandecidewhether
theywishtoconsiderduplicates,signatures,Skolemconstants,etc.,tobeconsistentwithrespect
toeitherisomorphismor(simple)equivalence.
2Althoughpolynomial-timealgorithmshavebeenproposedintheliteratureforcomputingcanonicalformsofRDFgraphs
withrespecttoisomorphism(e.g.,[2,9,32]),thesemaynotalwaysyieldcorrectresults.Wewilldiscusssuchworksinmore
detailinSection7.
ACMTransactionsontheWeb,Vol.0,No.0,Article00.Publicationdate:January2017.
00:4 AidanHogan
GivenaninputRDFgraph,weproposealgorithmsforcomputingbothitsiso-canonicalform
andequi-canonicalform.Asexpectedfromearlierdiscussion,neitherofthesealgorithmsisin
polynomial-time:bothhaveexponential-timeworst-caseperformanceinthegeneralcase.However,
unlikethemoregeneralproblemsofgraphisomorphismandgraphhomomorphism,inthecaseof
real-worldRDFgraphs,weoftenhavegroundinformationthathelpstodistinguishblanknodes.
Henceinourevaluation,wefirstpresentresultsforcomputingbothcanonicalformsoverthe
BTC–14dataset[31]–acollectionof43.6millionRDFgraphsofwhich9.9millioncontainblank
nodes–wheresaidresultssuggestthatcomputingtheseformsisefficientinallbutahandful
ofcases.Tostress-testouralgorithms,wealsopresentresultsforcanonicalisingacollectionof
syntheticgraphsatvarioussizes,whichgivesomeideaofthetypeofRDFgraphrequiredtoinvoke
exponentialruntime,arguingthatsuchgraphsareunlikelytooccurnaturallyinpractice.
Thispaperisanextensionofearlierwork[27]wherewefirstintroducedmethodstocanonically
labelblanknodes,computinganiso-canonicalformofRDFgraphsforthepurposesofSkolemisation.
Asidefromextendeddiscussionthroughout,themainnovelcontributionofthispaperistodiscuss
analgorithmforleaningRDFgraphsandforcomputingtheirequi-canonicalforminamanner
thattakesintoaccountnotonlystructuralbutalsosemanticidentity.Inthislattercontribution,we
takesomeoftheideasandexperienceslearntfromanotherpreviousworkwherewepresented
someinitialideasonleaningRDFgraphs[28];however,themethodswepresentinthispaperfocus
onleaningpotentiallycomplexRDFgraphsinmemorywhere,inparticular,wepresentanovel
depth-firstsearchalgorithmthatisshowntoperformbetterthanavarietyofbaselinemethods.
WebeginbypresentingsomepreliminariesrelatingtothestructureandsemanticsofRDFgraphs
(Section 2). We then present a theoretical analysis with a mix of new and existing results that
helpestablishboththehardnessofthecanonicalisationproblemsweproposetotackle,aswellas
high-levelapproachesbywhichtheycanbecomputed(Section3).Afterwards,wepresentindetail
ouralgorithmsforcomputingtheiso-canonicalversionofanRDFgraph,whichreliesonacanonical
labellingofblanknodes(Section4);andtheequi-canonicalversionofanRDFgraph(Section5),
whichreliesonapre-processingstepthatleanstheRDFgraph.Wethenpresentevaluationresults
overcollectionsofbothreal-worldandsyntheticRDFgraphs(Section6).Wethendiscussrelated
worksbeforeconcludingthepaper(Sections7and8).
2 PRELIMINARIES
WenowpresentsomeformalpreliminarieswithrespecttoRDFgraphs,isomorphism,andthe
simplesemanticsofRDF.
2.1 RDFterms,triplesandgraphs
RDFgraphsaresetsoftriplescontainingRDFterms,withcertainrestrictionsonwhichtermscan
appearinwhichpositionsofatriple.
Definition2.1(RDFterm). LetI,LandBdenotetheinfinitesetsofIRIs,literalsandblanknodes
respectively.Thesesetsarepair-wisedisjoint.Werefergenericallytoanelementofoneofthese
setsasanRDFterm.WerefertoelementsofthesetIL(i.e.,I∪L)asgroundRDFterms.
Definition2.2(RDFtriple). Wecallatriple(s,p,o)∈IB×I×ILBanRDFtriple,wherethefirst
element,calledthesubject,mustbeanIRIorablanknode;thesecondelement,calledthepredicate,
mustbeanIRI;andthethirdelement,calledtheobject,canbeanyRDFterm.
Definition2.3(RDFgraph). AnRDFgraphG ⊂ IB×I×ILBisafinitesetofRDFtriples.We
denotebyterms(G)thesetofallRDFtermsappearinginG,andbnodes(G)thesetofallblank
nodesappearinginG.
ACMTransactionsontheWeb,Vol.0,No.0,Article00.Publicationdate:January2017.
CanonicalFormsforIsomorphicandEquivalentRDFGraphs 00:5
Remark1. WesaythattwoRDFgraphsG andH areequal,or thesame,ifandonlyifG =H in
termsofsetequality.
2.2 RDFisomorphism
TwoRDFgraphsthatarethesamemoduloblanknodeslabels–i.e.,whereonecanbeobtained
fromtheotherthroughaone-to-onemappingofblanknodes–arecalledisomorphic.Wefirstgive
apreliminarydefinitiontocapturetheideaofmappingblanknodestootherRDFterms:
Definition2.4(Blanknodemappingandbijection). Letµ :ILB→ILBbeapartialmappingofRDF
termstoRDFterms,wherewedenotebydom(µ)thedomainofµ andbycodom(µ)thecodomain
ofµ.Ifµ istheidentityonIL,wecallitablanknodemapping.Ifµ isablanknodemappingthat
mapsblanknodesindom(µ)toblanknodesincodom(µ)inabijectivemanner,wecallitablank
nodebijection.
Abusingnotation,givenanRDFgraphG,wemayuseµ(G)todenotetheimageofG underµ
(i.e.,theresultofapplyingµ toeveryterminG).
WearenowreadytodefinethenotionofisomorphismbetweenRDFgraphs.
Definition2.5(RDFisomorphism). TwoRDFgraphsG andH areisomorphic,denotedG (cid:27)H,if
andonlyifthereexistsablanknodebijectionµ suchthatµ(G) =H,inwhichcasewecallµ an
isomorphism.
Lemma2.6. RDFisomorphism((cid:27))isanequivalencerelation.
Proof. First,(cid:27)isreflexivepertheexistenceoftheidentitymapµ onblanknodes,whichisa
blanknodebijection.Second,(cid:27)issymmetricsinceifG (cid:27)H,thenthereexistsµsuchthatµ(G)=H
andsuchthatµ−1(H)=G,whereµ−1isalsoablanknodebijection.Third,(cid:27)istransitivesinceif
G (cid:27)H andH (cid:27) I,thenthereexistblanknodebijectionsµ andµ(cid:48)suchthatµ(G)=H,µ(cid:48)(H)=I,
andthusµ(cid:48)(µ(G))=I,whereµ(cid:48)◦µ isablanknodebijectionthatwitnessesG (cid:27)I. (cid:3)
Remark2. IfG =H,thenG (cid:27)H. (cid:3)
Example2.7. TakethefollowingtwoRDFgraphs,withG ontheleftandH ontheright,where
theterm2014isaliteral(denotedwithasquarebox),alltermsprefixedwithunderscoreareblank
nodes,andallotherterms(intheex:examplenamespace)areIRIs.Nowweconsider:arethese
RDFgraphsisomorphic?
ex:startYear _:bA
_:a2 2014
ex:president ex:presidency
ex:presidency ex:president
ex:MBachelet ex:Chile
ex:Chile ex:MBachelet
ex:president ex:presidency
ex:presidency ex:president
2014 _:b
_:a1 ex:startYear
Infacttheyare:thereisablanknodebijectionµ suchthatµ(_:a1)=_:bAandµ(_:a2)=_:bwhere
µ(G)=H.Wecouldalsotaketheinversemappingµ−1,whereµ−1(H)=G andwhereµ−1isalsoa
blanknodebijection.ThusweconcludeG (cid:27)H. (cid:3)
ACMTransactionsontheWeb,Vol.0,No.0,Article00.Publicationdate:January2017.
00:6 AidanHogan
Arelatedconcepttoisomorphism–andonethatwillplayanimportantrolelater–isthatofan
automorphism,whichisanisomorphismthatmapsanRDFgraphtoitself.Intuitivelyspeaking,
automorphismsrepresentaformofsymmetryinthegraph.
Definition2.8(RDFautomorphism). AnautomorphismofanRDFgraphG isanisomorphism
thatmapsG toitself;i.e.,ablanknodemappingµ isanautomorphismofG ifµ(G)=G.Ifµ isthe
identitymappingonblanknodesinG thenµ isatrivialautomorphism;otherwiseµ isanon-trivial
automorphism.WedenotethesetofallautomorphismsofG byAut(G).
Example2.9. WegivetheautomorphismsfortwoRDFgraphsG andH.Trivialautomorphisms
(i.e.,thosethataretheidentitymapping)areshowningrey.
G Aut(G) H Aut(H)
:p
_:e
µ(·)_:a_:b :p :p µ(·)_:c_:d_:e
_:a _:b
= _:a_:b _:c _:d _:c_:d_:e
:p _:b_:a = _:d_:e_:c
:p
_:e_:c_:d
Applyinganyoftheautomorphismsshownforthegraphinquestionwouldleadtothesamegraph
(andnotjustanisomorphiccopy). (cid:3)
2.3 Simplesemantics,interpretations,entailmentandequivalence
TheRDF(1.1)Semanticsrecommendation[21]definesfourmodel-theoreticregimesthat,loosely
speaking,provideamathematicalbasisforassigningtruthtoRDFgraphsand,subsequently,for
formallydefiningwhenoneRDFgraphentails another:inotherwords,ifoneassignstruthto
a particular RDF graph, entailment defines which RDF graphs must also hold true as a logical
consequence.Thus,unlikeRDFisomorphismwhichis,insomesense,astructuralcomparisonof
RDFgraphs[11],entailmentoffersasemanticcomparisonofRDFgraphsintermsoftheirunderlying
meaning[21].Thefourregimesare:simplesemantics,datatypesemantics,RDFsemanticsand
RDFSsemantics.Inthispaper,weareinterestedinthesimplesemantics,whichcodifiesameaning
forRDFgraphswithoutconsideringtheinterpretationofdatatypevaluesorspecialvocabulary
terms(suchasrdf:typeorrdfs:subClassOf).
Eachregimeisbasedonthenotionofaninterpretation,whichmapsthetermsinanRDFgraphto
aset,andthendefinessomeset-theoreticalconditionsontheset.TheintuitionisthatRDFdescribes
resourcesandrelationshipsbetweenthem,whereinterpretationsformabridgefromsyntactic
termstotheresourcesandrelationshiptheydenote.Wenowdefineasimpleinterpretation.
Definition2.10(Simpleinterpretation). Asimpleinterpretationisa4-tupleI =(Res,Prop,Ext,Int)
whereRes isasetofresources;Propisasetofpropertiesthatrepresenttypesofbinaryrelations
between resources (not necessarily disjoint fromRes);Ext maps properties to a set of pairs of
resources,thusdenotingtheextensionoftherelations;andInt mapstermsinILtoRes ∪Prop,
i.e.,mapstermsintheRDFgraphtotheresourcesandpropertiestheydescribe.Withrespectto
blanknodes,letA:B→Res beafunctionthatmapsblanknodestoresources,andletInt denote
A
aversionofInt thatmapstermsinILBtoRes∪PropusingAforblanknodes.WesaythatI isa
modelofanRDFgraphG ifandonlyifthereexistsamappingAsuchthatforeach(s,p,o)∈G,it
holdsthatInt(p)∈Propand(Int (s),Int (o))∈Ext(Int(p)).
A A
Here,theexistentialsemanticsofblanknodesiscoveredbytheexistenceofanauxiliaryfunction
A,whichisnotpartoftheactualinterpretation.
ACMTransactionsontheWeb,Vol.0,No.0,Article00.Publicationdate:January2017.
CanonicalFormsforIsomorphicandEquivalentRDFGraphs 00:7
SimpleentailmentofRDFgraphscanthenbedefinedintermsofthemodelsofeachgraph.
Definition2.11(Simpleentailment). AnRDFgraphG simple-entailsanRDFgraphH,denoted
G |=H,ifandonlyifeverymodelofG isalsoamodelofH.
Intuitivelyspeaking,ifG |=H,thenH saysnothingnewoverG,orinotherwordsifweholdG
tobetrue,thenwemustholdH tobetrueasalogicalconsequence.IftwoRDFgraphsentaileach
other,thenwestatethattheyaresimple-equivalent.Inotherwords,semanticallyspeaking,both
graphscontainthesameinformation.
Definition2.12(Simpleequivalence). AnRDFgraphG issimple-equivalentwithanRDFgraphH,
denotedG ≡H,ifandonlyifeverymodelofG isamodelofH andeverymodelofH isamodelof
G (inotherwords,G |=H andH |=G).
Lemma2.13. Simpleequivalence(≡)isanequivalencerelation.
Proof. G ≡H ifandonlyifthesetsofmodelsofG andH areequal.Sinceset-equalityisan
equivalencerelation,sois≡. (cid:3)
Remark3. IfG (cid:27)H,thenG ≡H. (cid:3)
GiventhatwecurrentlydealexclusivelywiththesimplesemanticsofRDF,forbrevity,henceforth,
wemayrefertointerpretations,models,entailment,equivalence,etc.,withoutqualification,where
weimplicitlyrefertothesimplesemantics.
AnimportantquestionishowisomorphismandequivalencearedifferentforRDFgraphs.This
isperhapsbestillustratedwithanexample.
Example2.14. TakethefollowingtwoRDFgraphswithG ontheleftandH ontheright.Firstof
all,weaskdoesG |=H hold(i.e.,iseverymodelofG alsoamodelofH)?
ex:startYear
_:a2 2014
ex:presidency ex:president
ex:MBachelet ex:Chile
ex:Chile ex:MBachelet
ex:president ex:presidency
ex:presidency ex:president
2014 _:b
_:a1 ex:startYear
Let’ssayI =(Res,Prop,Ext,Int)isamodelofG,whereforthepurposesofgeneralitywegiveno
furtherdetails.SincethesetofgroundtermsinH isasubsetofG,thenInt mapsgroundtermsin
H toRes∪PropinthesamemannerasforG.LetAdenoteanauxiliarymappingofblanknodesfor
G suchthatforeach(s,p,o)∈G itholdsthatInt(p)∈Propand(Int (s),Int (o))∈Ext(Int(p));in
A A
otherwords,AwitnessesthatI isamodelofG.Nowletµ denoteablanknodemappingsuchthat
µ(_:b) = _:a2.Then,foreach(s,p,o) ∈ H,itholdsthatInt(p) ∈ Prop and(IntA◦µ(s),IntA◦µ(o)) ∈
Ext(Int(p)),whereA◦µisavalidauxiliarymappingthatsatisfiestheconditionforItobeamodel
ofH.Hence,anymodelofG isalsoamodelofH,orinotherwords,G |=H.Intuitivelyspeaking,if
wefirstmap_:binH to_:a2inG,thenweseethatH containsasubsetoftheinformationofG.
Now let us ask: does H |= G hold? This time consider a blank node mapping µ such that
µ(_:a1)=_:bandµ(_:a2)=_:b.Usingasimilarargumentasabovebutintheoppositedirection,
ACMTransactionsontheWeb,Vol.0,No.0,Article00.Publicationdate:January2017.
00:8 AidanHogan
wecannowseethatanymodelofH mustbeamodelofG:thegroundtermsofG areasubsetof
H,andifAisanauxiliarymappingthatwitnessesthatIisamodelofH,thenA◦µ witnessesthat
IisamodelofG.Putanotherway,lookingjustatG,ifwemap_:a1to_:a2,wedonotchangethe
meaningofG,andweendupwithanisomorphiccopyofH,andhencewemayseethatH |=G.
GiventhatG |=H andH |=G,wehavethatG ≡H,eventhoughG (cid:29)H.Thoughbothgraphs
arestructurallydifferent,fromasemanticperspectivebothgraphshavethesamesetofmodels
underRDF’ssimplesemantics:theyimplyeachother. (cid:3)
Inthisexample,wesawthatanRDFgraphcanbeequivalenttoasmallergraph:inotherwords,
anRDFgraphcancontainredundanttriplesthataddnonewinformationinsemanticterms.This
givesrisetothenotionofaleanRDFgraph[21],whichisonethatdoesnotcontainsuchredundancy.
Definition2.15(LeanRDFgraph). AnRDFgraphG isconsideredleanifandonlyiftheredoes
notexistapropersubgraphG(cid:48) ⊂G suchthatG(cid:48) |=G.OtherwisewecallG non-lean.
Example2.16. ReferringbacktoExample2.14,H islean.However,letG(cid:48)denotethesetoftriples
ofG thatdonotmention_:a1.WecanseethatG(cid:48) |=G (withthesamelineofreasoningastowhy
H |=G).IntermsofexplainingwhyG isnon-lean,intuitivelythegraphcanbereadasmakingthe
followingclaims:
• ChilehasapresidencywithMBacheletaspresidentandstartYear2014.
• ChilehasapresidencywithMBacheletaspresident.
Undersimplesemantics,thesecondclaimisconsideredredundant. (cid:3)
3 THEORETICALSETTING
ThegoalofthispaperistoproposeanddevelopalgorithmstocomputetwocanonicalformsforRDF:
acanonicalformwithrespecttoisomorphismandacanonicalformwithrespecttoequivalence.
Havingdefinedthesenotionsintheprevioussection,wenowpresentsometheoreticalresultsthat
showthesetobehardproblemsingeneral;wealsopresenthigh-levelstrategiesastohowthese
canonicalformscouldbecomputed.Wefirstfocusonisomorphismandlaterdiscussequivalence.
3.1 Isomorphism
Tobegin,wewishtoestablishthatRDFisomorphismisinthesamecomplexityclassastherelated
andmorewell-establishedproblemofgraphisomorphismforundirectedgraphs.Infact,thisresult
isfolkloreandwashintedatpreviouslybyotherauthors,suchasCarroll[9],buttothebestofour
knowledge,noformalproofofthisresultwasgiven.Firstweneedsomepreliminarydefinitions.
Definition3.1(Undirectedgraph). AnundirectedgraphG=(V,E)isagraphwereV isthesetof
vertexes,E ⊆V ×V isthesetofedges,and(v,v(cid:48))∈E ifandonlyif(v(cid:48),v)∈E (orinotherwords,
theedgesareunorderedpairs).
Definition3.2(Graphisomorphism). GiventwoundirectedgraphsG=(VG,EG)andH=(VH,EH),
thesegraphsareisomorphic,denotedG (cid:27) H,ifandonlyifthereexistsabijectionβ :VG →VH
suchthat(v,v(cid:48))∈EGifandonlyif(β(v),β(v(cid:48)))∈EH.Inthiscase,wecallβ anisomorphism.
Thegraphisomorphismproblem–ofdecidingifG (cid:27) H–isGI-complete:aclassthatbelongsto
NPbutisnotknowntobeequivalenttoNP-completenortopermitpolynomial-timesolutions.We
nowgivearesultstatingthattheRDFisomorphismproblem–ofdecidingfortwoRDFgraphsif
G (cid:27)H –isinthesamecomplexityclassasgraphisomorphism.
Theorem3.3. GiventwoRDFgraphsG andH,determiningifG (cid:27)H isGI-complete.
ACMTransactionsontheWeb,Vol.0,No.0,Article00.Publicationdate:January2017.
CanonicalFormsforIsomorphicandEquivalentRDFGraphs 00:9
AproofforTheorem3.3–whichoriginallyappearedintheconferenceversionofthiswork[27]
–isprovidedinAppendixA.FromTheorem3.3,wecanconcludethatitisnotknownifthere
existsapolynomial-timealgorithmforRDFisomorphism:theexistenceofsuchanalgorithmwould
implyGI=P,solvingalong-openprobleminComputerScience.Inarecentresult,Babai[3]proved
thatthegraphisomorphismproblemcanbesolvedinquasi-polynomialtime3,wherealthougha
polynomial-timealgorithmhasnotyetbeenfound,analgorithmbetterthanexponentialisnow
known;henceGIiscontainedwithinQP:theclassofproblemssolvableinquasi-polynomialtime.
These results extend to RDF isomorphism per Theorem 3.3. But these theoretical results refer
toworst-caseanalyses,whereapreviousresultbyBabaietal.[4]showedthatisomorphismfor
randomlygeneratedgraphscanbeperformedefficientlyusinganaivealgorithm.Hence,despite
the possibility of non-polynomial worst cases, many practical algorithms exist to solve graph
isomorphismquicklyformanycases.
Infact,ourgoalhereisnotthesolvetheisomorphismdecisionproblembutrathertotacklethe
harderproblemofcomputinganiso-canonicalversionofanRDFgraph.
Definition3.4(Iso-canonicalRDFmapping). LetM beamappingfromanRDFgraphtoanRDF
graph.M isiso-canonicalifandonlyifM(G) (cid:27)G foranyRDFgraphG andM(G)=M(H)ifand
onlyifG (cid:27)H foranytwoRDFgraphsG andH.
Weknowthatcomputinganiso-canonicalformofanRDFgraphisGI-hardsinceitcanbeused
tosolvetheRDFisomorphismproblem:wecancomputetheiso-canonicalformofbothgraphsand
checkiftheresultsareequal.Henceweareunlikelytofindpolynomial-timealgorithms.
Before we continue, let us establish an initial iso-canonical mapping that is quite naive and
impractical,butestablishestheideaofhowsuchaformcanbeachieved:defineatotalordering
of RDF graphs and for a set of pairwise isomorphic RDF graphs, define the lowest such graph
(followingcertainfixedcriteria)tobecanonical.
Definition3.5(κ-mapping). AssumeatotalorderingofallRDFterms,triplesandgraphs.4Assume
thatκ isablanknodebijectionthatlabelsallk blanknodesinagraphG from_:b1to_:bk,andlet
κ(G)denotetheimageofGunderκ.Furthermore,letK denotethesetofallsuchκ-mappingsvalid
forG.Wethendefinetheminimalκ-mappingofanRDFgraphG asM (G)=min{κ(G) |κ ∈K}.
κ
Proposition3.6. M isaniso-canonicalmapping.
κ
Proof. First,sincewehaveatotalorderingofgraphs,thereisauniquegraphmin{κ(G) |κ ∈K},
andhenceM isamappingfromRDFgraphstoRDFgraphs.
κ
Second,M (G) (cid:27)G sinceweonlyrelabelblanknodes:κ isablanknodebijection.
κ
WearenowlefttoprovethatM (G)=M (H)ifandonlyifG (cid:27)H.
κ κ
(M (G) = M (H)impliesG (cid:27) H)WeknowthatM (G) (cid:27) G andM (H) (cid:27) H.Ifwearegiven
κ κ κ κ
M (G) = M (H),thenwehavethatG (cid:27) M (G) (cid:27) M (H) (cid:27) H,andsince (cid:27) isanequivalence
κ κ κ κ
relation,iffollowsthatG (cid:27)H.
(G (cid:27) H implies M (G) = M (H)) Suppose the result does not hold and there existG andH
κ κ
such thatG (cid:27) H and (without loss of generality) M (G) > M (H). SinceG (cid:27) H, there exists
κ κ
a blank node bijection µ such that µ(G) = H. Letκ be the mapping such thatκ(H) = M (H).
κ
Nowκ ◦µ(G) = M (H).Sinceκ ◦µ isaκ-mappingforG andM (G) > κ ◦µ(G),wearriveata
κ κ
contradictionperthedefinitionofM sinceitdoesnotusetheminimumκ-mappingforG. (cid:3)
κ
3OnJanuary9,2017,theresultwasbrieflyretractedasabugwasfoundintheproof.Theresultwasreassertedafewdays
laterwhenafixwasfound.Seehttp://people.cs.uchicago.edu/~laci/update.html.
4Forexample,wecanconsidertermsorderedsyntactically,triplesorderedlexicographically,andgraphsorderedsuchthat
G <H ifandonlyifG ⊂H orthereexistsatriplet ∈G\H suchthatnotriplet(cid:48) ∈H\Gexistswheret(cid:48)<t.
ACMTransactionsontheWeb,Vol.0,No.0,Article00.Publicationdate:January2017.
00:10 AidanHogan
Example3.7. TakegraphH fromExample2.9.Wehaveatotalof3! = 6possibleκ-mappings
(withonlyblanknodesfrombnodes(H)intheirdomainandblanknodesfrom{_:b1,_:b2,_:b3}in
theircodomain)asfollows.
H
κ(·) _:c _:d _:e
{(_:c,:p,_:d),(_:d,:p,_:e),(_:e,:p,_:c)}
_:b1_:b2_:b3{(_:b1,:p,_:b2),(_:b2,:p,_:b3),(_:b3,:p,_:b1)}
_:b1_:b3_:b2{(_:b1,:p,_:b3),(_:b3,:p,_:b2),(_:b2,:p,_:b1)}
_:b2_:b1_:b3{(_:b2,:p,_:b1),(_:b1,:p,_:b3),(_:b3,:p,_:b2)}
=
_:b2_:b3_:b1{(_:b2,:p,_:b3),(_:b3,:p,_:b1),(_:b1,:p,_:b2)}
_:b3_:b1_:b2{(_:b3,:p,_:b1),(_:b1,:p,_:b2),(_:b2,:p,_:b3)}
_:b3_:b2_:b1{(_:b3,:p,_:b2),(_:b2,:p,_:b1),(_:b1,:p,_:b3)}
Thesixκ-mappingsonlyproducetwodistinctgraphs,wherethefirst,fourthandfifthmappings
correspondtoκ(cid:48)(H)andtheresttoκ(cid:48)(cid:48)(H),asfollows:
κ(cid:48)(H) κ(cid:48)(cid:48)(H)
_:b3 _:b2
:p :p :p :p
_:b1 _:b2 _:b1 _:b3
:p :p
Assumingatypicallexicalordering(asperFootnote4),M (H) =κ(cid:48)(H).Importantly,onecould
κ
(bijectively)relabel_:c,_:d,_:eintheoriginalgraphwithoutaffectingtheresult:theoutputwould
bethesameforanyM (H(cid:48))suchthatH (cid:27)H(cid:48). (cid:3)
κ
Thisdiscussionsuggestsacorrectandcompletebruteforcealgorithmtocomputeaniso-canonical
formforanyRDFgraphG:searchallκ-mappingsofG foronethatgivestheminimumsuchgraph.
However,suchabrute-forceprocessisunnecessaryandnaive:byapplyingamorefine-grained
totalorderingonRDFgraphs,wecanuseasimilarprincipletofindaniso-canonicalformina
muchmoreefficientway.SuchanalgorithmwillbepresentedlaterinSection4.
3.2 Equivalence
AspreviousdiscussedinSection2.3,twoRDFgraphsareequivalentiftheyentaileachother.Towards
aninitialprocedurefordecidingiftwoRDFgraphsentaileachother,wehavethefollowingresult:
Theorem3.8. G |=H ifandonlyablanknodemappingµ existssuchthatµ(H) ⊆G [18,21]. (cid:3)
Example3.9. ReferringbacktoExample2.14,asawitnessthatG |=H holds,wehaveamapping
µ suchthatµ(_:b) = _:a2andµ(H) ⊂ G.AsawitnessthatH |=G holds,wehaveamappingµ(cid:48)
suchthatµ(_:a1)=µ(_:a2)=_:bandµ(G)=H.
Forargument’ssake,letusconsideranRDFgraphH(cid:48) derivedfromH byreplacing_:bwith
anIRI:I.Wenolongerhaveamappingµ thatwitnessesG |=H(cid:48);infact,G (cid:54)|=H(cid:48).However,for
H(cid:48) |=G,wehavethemappingµ(_:a1)=µ(_:a2)=:I. (cid:3)
Intraditionalgraphterms,findingablanknodemappingµ thatwitnessessuchanentailment
relatescloselywiththenotionofgraphhomomorphism.
ACMTransactionsontheWeb,Vol.0,No.0,Article00.Publicationdate:January2017.
Description:Aidan Hogan. 2017. Canonical Forms for Isomorphic and Equivalent RDF Graphs: Algorithms for Leaning and. Labelling Blank Nodes. ACM Trans.