ebook img

Evaluating navigational RDF queries over the Web PDF

0.8 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Evaluating navigational RDF queries over the Web

Evaluating navigational RDF queries over the Web JorgeBaier,DietrichDaroch,JuanL.Reutter,DomagojVrgocˇ PontificaUniversidadCato´licadeChileandCenterforSemanticWebResearch ABSTRACT worksusingtheonlineRDFdocumentspublishedbyDBLP,one SemanticWeb,anditsunderlyingdataformatRDF,lendthemselves ofthesimplestdatasetsnowformingpartoftheWebofLinked naturallytonavigationalqueryingduetotheirgraph-likestructure. Data.IntheRDFrepresentationofDBLP,eachresearcherisgivena This is particularly evident when considering RDF data on the uniqueIRI,aswellaseachpaper.Theauthorshiprelationindicating Web,wherevariousseparatelypublisheddatasetsreferenceeach thatanauthorAwroteapaperP isthenrepresentedbythetriple otherandformagiantgraphknownastheWebofLinkedData. {P,dc:creator,A}.TheIRIforeachauthor,thenservesasagood 7 And while navigational queries over singular RDF datasets are startingpointforinvestigatingtheDBLPdataset,asdereferenc- 1 supportedthroughSPARQLpropertypaths,notmuchisknown ingtheirIRIwillintuitivelygiveusallthepaperswrittenbythis 0 aboutevaluatingthemoverLinkedData.Inthispaperwepropose author.Forexample,ifwedereferencetheIRIM.Stonebraker,rep- 2 amethodforevaluatingpropertypathqueriesovertheWebbased resentingMichaelStonebraker,weobtainadocumentcontaining, r ontheclassicalAIsearchalgorithmA*,showitsoptimalityinthe amongstotherthings,thefollowingtriples a M openworldsettingoftheWeb,andtestitusingrealworldqueries M.Stonebraker foaf:name “M.Stonebraker” whichaccessavarietyofRDFdatasetsavailableonlineandthat inTods:StonebrakerWKH76 dc:creator M.Stonebraker arenotnecessarilyknowninadvance. inSigmod:PavloPRADMS09 dc:creator M.Stonebraker 4 1 ACMReferenceformat: These triples indicate that M.Stonebraker is the author of JorgeBaier,DietrichDaroch,JuanL.Reutter,DomagojVrgocˇ.2016.Evalu- the papers represented by IRIs inTods:StonebrakerWKH76 and ] atingnavigationalRDFqueriesovertheWeb.InProceedingsof,,,10pages. inSigmod:PavloPRADMS09,andthatthenameoftheentityrepre- B DOI:10.1145/nnnnnnn.nnnnnnn sentedbyM.Stonebrakerisindeed“MichaelStonebraker”. Sup- D posenowthatweneedtoretrievethenamesofalltheco-authors . 1 INTRODUCTION ofMichaelStonebraker.Itisveryeasytodothisusingthelinked s c TheResourceDescriptionFramework(RDF)[30]istheWorldWide datainfrastructure: WefirstdereferencetheIRIM.Stonebraker, [ obtaininganRDFdocumentthatcontains,inparticular,atriple Webconsortium(W3C)standardforrepresentingSemanticWeb 2 data.Inessence,anRDFgraphisasetoftriplesofinternationalised {P,dc:creator,M.Stonebraker}foreachpaperPauthoredbyM. Stonebraker.ThenwejustneedtodereferenceeachoftheIRIsof v resourceidentifiers(IRIs),wherethefirstandlastofthemrepresent 4 entityresources,andthemiddleonerelatestheseresources,just thesepapers:dereferencingeachoftheseIRIsP givesustriplesof 5 asisitdoneingraphdatabases[4].Theofficialquerylanguagefor theform{P,dc:creator,A},andnowweknowthatAisacoau- 4 RDFdatabasesisSPARQL[15]. thorofM.Stonebraker.ThelaststepistofurtherdereferencetheIRI 6 To answer the need for including navigational features into ofeachoftheseresearchers,tolookforatriple{A,foaf:name,N} 0 SPARQL,thelatestversionofthelanguageincludespropertypaths, thatindicatesthenameoftheresearcher(inthiscaseN). . Ofcourse,thequerylookingforco-authorsofMichaelStone- 1 asetofqueriesthatcanbeseenastheanaloguesofestablished brakercanbeseenasafixedpattern:namely,itisapathoflength 0 graphdatabaselanguagessuchasregularpathqueriesandtwo-way 7 regularpathqueries[11].Consequently,propertypathsarealready two,startingintheIRIM.Stonebrakerandtraversingtheedge 1 supportedbythevastmajorityofexistingSPARQLengines(e.g., dc:creatorbackwards(thusreachingapaperwrittenbyMichael : Stonebraker),andthentraversingthedc:creatoredgeforwards v [10,24,37]).Theinclusionofnavigationalqueriesisalsopresent toreachoneofhisco-authors.Butwhathappenswhenwewantto i inmostothergraphdatabasemodels(seee.g.[4,7]). X generalisethisqueryandobtainthecollaborationreachofMichael Besidesthetraditionalapproachwhereoneissuesaqueryovera Stonebraker,thatis,hisco-authors,theco-authorsofhisco-authors, r (setof)graphdatabases,thecommunityhasfurtherraisedtheneed a theirco-authors,etc?ThisissimilartothepopularnotionofErdo˝s forafundamentallydifferentwayofqueryingRDFdata:toobtain number,butthistimestartingwithadifferentauthor.Toanswer answersofqueriesoverthewholecorpusofRDFdatapresentonthe suchaqueryafixedlengthpathwillnolongersuffice,sincewedo WebandlinkedtogetherintowhatisknownastheWebofLinked notknowthedistancebetweenthestartingnodeandtheending Data,inadistributedwayandwithoutassuminganymediation nodeinadvance.Wethereforeneedtousepropertypaths;inthis norcentralisedorganisationincontrolofthedata,followingthe casethiswouldbedoneusingthequery LinkedDataPrinciples[9]. ThefundamentalpropertyofRDFdatathatmakesthisquerying M.Stonebraker (ˆdc:creator/dc:creator)* ?x, possibleisthattheIRIsinRDFdocumentspublishedonlineshould bedereferenceable.Thisbasicallymeansthatbyaccessinganygiven whichrepeatsthesimplepathfromoneauthortoapaper(using IRI,weobtainanewRDFdocumentdescribingitsneighbourhood ˆdc:creatortofollowanedgelabelleddc:creatorinareverse (orapartofit)intheLinkedDatagraph.Letusexplainhowthis direction)andthentoanotherauthor(usingdc:creator)anarbi- trarynumberoftimes,assignifiedbythestaroperator*.Theidea , isasbefore,butnowonceaco-authorisretrieved,searchdoesnot 2016.978-x-xxxx-xxxx-x/YY/MM...$15.00 DOI:10.1145/nnnnnnn.nnnnnnn stop,butcontinueswiththis(co-)authorasthestartingnode. ,, JorgeBaier,DietrichDaroch,JuanL.Reutter,DomagojVrgocˇ Whenevaluatingthesequerieswehaveonlydereferencedand - Itisveryrobustwhenevaluatingpropertypathsliveoverthe fetchedthedocumentsthatweneededinordertoanswerthequery, Webinfrastructure,andcanoftenanswerquerieswhichfaileven andthuswearetakingfulladvantageofthenatureofLinkedData. onSPARQLimplementationsexecutedoveralocaldataset. Thereisanotherfundamentaladvantageofthisapproach:wecan ApartfromdescribingthebasicimplementationoftheA*al- crossbetweendifferentdomainswithoutanyeffortbyusingthe gorithmandprovingitsoptimality, wealsodevelopseveralop- infrastructureoftheWeb, whichhappenswhenadereferenced timisationsgearedtowardsqueryansweringintheLinkedData IRIlinkstoanotherIRIresidingonadifferentserver. Thisisin setting. Mostnotably,weshowthatdereferencingmultipleIRIs contrastwith,forinstance,issuingasingledistributedqueryto inparallelcanspeedupthecomputationofpropertypathssignifi- acentralisedendpoint,sincewecanaccessanarbitrarynumber cantly.Finally,wedescribehowourimplementationrunsoverthe ofdifferentsources. Furthermore,wecanaccessdatathatisnot WebofLinkeddatausinganumberofreal-worldquerieswhich publishedondedicatedendpoints;allthatweneedisdatapublished utilisedifferentRDFdatasets. WecompareourapproachtoBFS onthestandardWebarchitecture.Uptoourbestknowledge,this andDFS-basedalgorithmsandtheirparallelversions,showingthat frameworkofdistributed,decentralisedandungovernedquerying A*issuperiorwhenitcomestoqueryingovertheWeb. hasnotbeenconsideredbeforetheadventofLinkedData. TheadvantagesoftheseapproacheshaveledtheSemanticWeb Outline.WeformaliseLinkedDataandpropertypathsinSection 2.InSection3wedescribehowDFSandBFScanbeusedtoanswer communitytoinvestigatethefundamentalsofqueryingoverthe propertypathqueriesandwhataretheirshortcomings.InSection Web[18],anddevelopingalgorithmsforansweringSPARQLqueries 4weintroducetheA*algorithmandshowitsoptimality.Optimi- overLinkedData[19,35].Unfortunately,despitethepotentialthat sationsarepresentedinSection5,andreal-worldexperimentsin propertypathscouldhaveinWebquerying,mostofthealgorithms Section6.WeconcludeinSection7. developedinthiscontextfocusonthepatternmatchingfeaturesof SPARQL,anddonotconsiderpropertypaths.Indeed,themajority ofstudiesaboutpropertypathsonlyconsiderhowtheyworkover 2 PRELIMINARIES a single centralised dataset [5, 14, 26, 38]. And while the need RDFgraphs. LetI andL becountablyinfinitedisjointsetsof forunderstandinghowpropertypathsmightworkovertheWeb IRIsandliterals,respectively.AnRDFtripleisatriple(s,p,o)from hasrepeatedlybeenraisedbytheresearchcommunity[8,20,21], (I∪L)×I×(I∪L),wheresiscalledsubject,ppredicate,ando previousstudieshavemostlyfocusedonunderstandingappropriate object.An(RDF)graphisafinitesetofRDFtriples.Forsimplicity semanticsand/orproposingnewlanguagestohelpusersnavigate weonlydealwithRDFdocumentsthatdonotcontainblanknodes. the Web, instead of describing the algorithms computing these LinkedData.Weareinterestedincomputingnavigationalqueries answers.Theonlyexceptionis[13],suggestingabasicdepth-first overthewidebodyofRDFdocumentspublishedontheWebthat searchalgorithminthecontextofNautiLODqueries:alanguage comprisewhatisknownastheWebofLinkedData.Ascustomary proposalthatextendspropertypaths.Therefore,themainobjective intheliterature(seee.g.[3,17]),wetreatthiscorpusofdocuments ofthispaperistoanswerthequestion: Howcanoneefficiently as a tupleW = (G,adoc), where G is a set of RDF graphs and evaluatepropertypathqueriesovertheWebofLinkedData? adoc : I → G ∪{∅} is a function that assigns graphs in G to someIRIs,andtheemptygraphtotherestoftheIRIs.Notethat Contributions. Ourmaincontributionisanalgorithmforeffi- cientlyretrievinganswerstopropertypathqueriesoverLinked previouswork(e.g.[17])usuallydefinesadocasapartialfunction. Data. Our solution is based on the observation that evaluating Weadoptinsteadtheconventionthatadoc(u)=∅wheneveradoc propertypathscanbeseenasasearchproblemoveraninitially isnotdefinedforu,asitsimplifiesthepresentation. unknowngraph.Indeed,intheexamplesabovewestartfromone The intuition behind this definition is that G represents the knownIRI(M.Stonebraker)andbeginexploringitsneighbours setofdocumentsontheWebofLinkeddata,andadoccaptures guidedbythequerywearetryingtoanswer.Butthisproblemhas dereferencing;thatis,adoc(u)givesustheneighboursofu inG. beenwellstudiedbytheArtificialIntelligencecommunity,andit Note that G is usually not available and has to be retrieved by isgenerallyagreedthatthemostappropriatesolutionhereisan lookingupIRIswithadoc. heuristic-searchalgorithmsuchasA*[16,31]. Inthispaperwe Example2.1. Wecannowformalisetheoperationsperformedin proposeavariantofA*forthesettingofLinkedDatabyusingthe theintroductionoverthelinkeddataarchitectureofDBLP.Starting propertypathwearetryingtoanswerasaheuristictoguideour with theIRI M.Stonebraker, wecan invoke adoc on thisIRI to search.Themainadvantagesofthisapproacharethefollowing: fetchitsassociatedgraph - It allows to overcome shortcomings of basic graph traversal algorithms such as depth-first search (DFS) and breadth-first adoc(M.Stonebraker)= search(BFS).Infact,weshowthatA*dominatesBFSandDFS, andthatitisoptimalwithrespecttothepartofthegraphthat  inSigMm.oSdt:oPnaevblroaPkReArDMS09 dfco:acfr:enaatmoer “MM.S.tStoonneebbrraakkeerr” becameavailableduringthesearch.This,insomesense,isthe . . . bestwecanhopeforintheopen-worldsettingoftheWeb.  .. .. .. - It does not only allow to find pairs of nodes connected by a When looking for the coauthors of M. Stonebraker, we might propertypath,butitcanalsoreturn(oneofthe)shortestpaths want to fetch adoc(inSigmod:PavloPRADMS09), which will give whichwitnessthisconnection:afeaturethatexistingSPARQL usagraphcontaining,amongstotherthings,triplesoftheform enginesarecurrentlylacking. (inSigmod:PavloPRADMS09,dc:creator,A),withAbeingtheIRI EvaluatingnavigationalRDFqueriesovertheWeb ,, oftheauthorsofthepaper.TogetthenameofAwefetchthegraph (NFA)AeoverI±thatacceptsthesamelanguagease,whenviewed adoc(A)andlookforthetriplewithfoaf:nameasthepredicate. asaregularexpression.Wecannowshow: PropertyPaths.Navigationalqueriesovergraphdatabasescom- Proposition2.2([11,12,26]). PPComputationcanbesolvedin monlyaskforpathsthatsatisfycertainproperties.Themostsimple O(|G|·|e|)(thuslinearinboththesizeofthegraphandthequery). ofthemcorrespondtoregularpathqueries,orRPQs[2,12],which The idea is as follows. LetG be an RDF graph,e a property select pairs of nodes connected by a path conforming to a reg- pathexpressionandu anIRI.First,weconstructtheautomaton wulhairchexepxrteesnsdioRnP,Qanwdi2th-wthaeyarbeigliutlyartoptartahveqruseeraiens,edogre2bRaPcQkwsa[1rd1]s,. Ae =(Qe,I±,qe0,Fe,δe)equivalenttothequerye,whereQe isthe setofstates,q0 istheinitialstate,F isthesetoffinalstatesand SpPaAthRs,QwLhfiecahtuarreesthaecmlasseslvoefsnaanviegxatteinosniaolnqoufetrhieeswkenlolwknnoawsnpcrloapsesrotyf δe ⊆ Qe ×I±e×Qe isthetransitionrelation. Next,fromG and 2RPQs.Forreadabilityweassumewedealonlywith2RPQs,adopt- Ae weconstructthelabelledproductgraphG×Ae whosenodes ingtheformalisationin[26]. Notehoweverthatouralgorithms comefromI×Qe,andthereisanedgefromanode(u1,q1)to (andourimplementation)workforallpropertypathexpressions. anode(u2,q2)labelledwitha ∈ I ifandonlyif(i)G containsa Formally,wedefinepropertypathsbythegrammar triple(u1,a,u2)and(ii)thetransitionrelationδe containsthetriple (q1,a,q2), thatis, ifinAe onecanadvancefromq1 toq2 while e := u |e− |e1·e2 |e1+e2 |e∗ |e?, readinga.Similarly,thereisanedgebetween(u1,q1)and(u2,q2) whereuisanIRIinI.Thesemanticsofpropertypaths,denotedby labelledwitha− ∈I−if(i)(u2,a,u1)∈Gand(ii)(q1,a−,q2)∈δe. e G,forapropertypatheandanRDFgraphG,isshownbelow. Itisnownotdifficulttoshowthefollowingproperty: (cid:74) (cid:75) a G ={(s,o) |(s,a,o)∈G}, Lemma2.3([12]). Apair(u,v)belongsto e G ifandonlyifthere e(cid:74)−(cid:75)G ={(s,o) |(o,s)∈ e G}, isapathfrom(u,qe0)to(v,qef)inthelabelle(cid:74)d(cid:75)graphG×Ae,where e1(cid:74)·e2(cid:75)G = e1 G◦ e2 G, (cid:74) (cid:75) qef ∈Fe isafinalstateofAe. (cid:74) (cid:75) (cid:74) (cid:75) (cid:74) (cid:75) e1+e2 G = e1 G∪ e2 G, WecannowsolvethePPComputationproblembytraversing (cid:74) e∗(cid:75)G =(cid:216)(cid:74) (cid:75)ei G(cid:74)∪(cid:75){(a,a) |aisaterminG}, theproductgraphG×Ae startingin(u,qe0)andreturningallthe (cid:74)e?(cid:75)G =i≥e1(cid:74)G∪(cid:75){(a,a) |aisaterminG}. IoRuIrstvrasvuecrhsatlh.aThtwuse,einncaosuenntseer,aonneodcaen(vre,qcaefs)t,twhiethprqoefbl∈emFeo,fdquurienrgy (cid:74) (cid:75) (cid:74) (cid:75) Here◦istheusualcompositionofbinaryrelations,andei isthe computation(inasinglegraph)astheproblemofsearchingforall concatenationoficopiesofe. connectedfinalnodesintheproductgraph.Thisdualitybetween evaluationandsearchisacrucialcomponentofourapproachfor 2.1 EvaluatingPropertyPathsviaAutomata queryingmultiplegraphsontheWebofLinkedData. AsinthecaseofthequerycomputingthecoauthorreachofM. 3 COMPUTINGPROPERTYPATHSOVER Stonebraker,oneisusuallyinterestedincomputingalltheIRIsthat canbereachedfromastartingIRIubymeansofapropertypath THEWEB expression.Formally,westudythefollowingproblem. WhencomputingtheanswerofapropertypathovertheWeb,we cannotsimplyrelyonthealgorithmoutlinedinSection2.1,because Problem: PPComputation thisassumesthatwehaveourentiregraphinmemory,whichis Input: PropertyPathe,RDFgraphG,startingIRIu notafeasibleoptionforthecaseoftheWeb. Havingastarting Output: AllIRIsvsuchthat(u,v)∈(cid:74)e(cid:75)G IRIucomesinhandyhere,aswecanemulatethealgorithmfrom Alternatively,onemaywishtocomputethefullevaluation e G Section2.1bydereferencingu,retrievingitsneighboursinadoc(u), ofpairsconnectedviaapathconformingtoe. However,thi(cid:74)so(cid:75)p- andcontinuingfromthere,thusbuildingalocalcopyofaportion erationisseldomusedinpractice: itisnotanintuitivequeryto oftheWebgraphneededtoanswerthequery. ask,andwhenusingpropertypathsinSPARQLoneusuallyobtains Toformalisethis,letusdefinetheWebgraphGWebastheRDF startingpointsfromotherpatternsorjoinsofpatterns.Also,com- graphconsistingoftheunion(cid:208)u∈Iadoc(u)ofallthegraphsresult- putingthefull e G isnotevensupportedinallSPARQLsystems ingbydereferencinganIRIinI(i.e.thecompleteWebofLinked (forinstanceVi(cid:74)rt(cid:75)uosoallowsonlypropertypathswithastarting Data). Fromhereonwardsweassumethatadoc(u)givesusthe point). Furthermore,aswewillseeinthefollowingsections,in neighboursofuintheWebgraph(seeSection5.2foradiscussion theopenworldsettingofLinkedDataitisonlynaturaltohavea ofhowtodealwiththeshortcomingsofthecurrentLinkedData startingpointforoursearch,sinceitisunrealistictoexpectthe infrastructure). Theproblemwearenowinterestedinissolving computationtotraverseandmanipulatetheentireWebgraph.This PPComputationaboveforthegraphGWeb,i.e.theproblem: iswhywechosetofocusonPPComputation. Problem: PP over the Web TosolvethePPComputationproblem,thetheoreticalliterature Input: PropertyPathe,startingIRIu proposedasimplealgorithmbasedonautomatatheory.Topresent thisalgorithm,notefirstthatourpropertypathsarenothingmore Output: AllIRIsvsuchthat(u,v)∈ e GWeb (cid:74) (cid:75) thanregularexpressionsoverthealphabetI±=I∪{u− |u ∈I} Now, althoughtheapproachofSection2.1wouldrequireusto thatcontainsallIRIsandtheirinverses. Thus,foreachproperty dooursearchoverGWeb×Ae,wecanrecastPP over the Web pathewecanconstructanondeterministicfinitestateautomaton as finding paths inside a subgraphGP ⊂ GWeb ×Ae which is ,, JorgeBaier,DietrichDaroch,JuanL.Reutter,DomagojVrgocˇ constructeddynamicallybydereferencingIRIsstartingatu.And Algorithm1:Breadth/Depth-FirstSearch astlothpocuognhsttrhuectginragpihtwGhPenmwigehtdebseirme),uscehlescmtinalgletrhe(inbefsatcat,lgworeitchamn 12 funcitnioitn←Se(aurc,hq(ue0),Ae) forproducingthisgraphanddoingpathsearchingoveritisnot 3 init.parent←null anobvioustask,duetothefollowingissuesnotoccurringinthe 4 ifqe0 ∈Fethenreturninitoraddinittosolutions 5 InitialiseOpenasanemptystack(DFS)orqueue(BFS) classicalpath-findingsetting. 6 InitialiseSeenasanemptyset First,path-findingalgorithmsaredesignedtoworkwithgraphs 7 InsertinitintobothOpenandSeen thatcanbeeitherstoredinmemoryorgeneratedefficiently. In 89 whilEexOtrpaecntnisondoetsem=p(tvy,dqo)fromOpenandcomputeNeighbours(s) contrast,graphGP isgeneratedbydereferencingIRIswhichin- 10 foreacht=(v(cid:48),q(cid:48))inNeighbours(s)thatisnotinSeendo volvesresolvinganumberofHTTPrequests. Thetimerequired 11 t.parent←s tocompletearequestdominatessignificantlythetimerequiredto 1132 Iinfsqe(cid:48)rt∈tFinetothbeonthreOtuprenntanodrSaedednttosolutions carryoutanyoperationperformedinmemory.Efficientalgorithms forthisproblemshouldthereforeaimatreducingnetworkrequests, 14 functionNeighbours((v,q)) afactorthatisusuallynotconsideredwhensolvingpath-finding 15 InitialiseSuccasanemptysetandRDFgraphGtempasanemptygraph 16 Gtemp←adoc(v) problems.Thesecondissueisthathereweareinterestedinmore 17 foreachIRIa ∈Iandstateq(cid:48)s.t.(q,a,q(cid:48))isinδedo than one solution. As such, an algorithm that returns answers 18 foreachtriple(v,a,v(cid:48))inGtempdo Insert(q(cid:48),v(cid:48))intoSucc incrementallyseemstobeamoresensibleoptionthanonethat 19 foreachIRIa− ∈I−andstateq(cid:48)s.t.(q,a−,q(cid:48))isinδedo 20 foreachtriple(v(cid:48),a,v)inGtempdo Insert(q(cid:48),v(cid:48))intoSucc computesallanswerspriortoreturningany. 21 returnSucc Next,wediscusshowclassicalpath-findingalgorithmscanbe modifiedtoreturnanswerstopropertypathqueriesovertheWeb andpinpointsomeoftheirshortcomingsinthissetting. isreachablefrominit thenthealgorithmeventuallyretrievesthis node. Thisisimportantbecauseitguaranteesthatallsolutions toaqueryareeventuallyreturned. Third,thememoryfootprint 3.1 Depth-FirstSearch ofDFSisrelativelylow.Actually,ifthedepthofthenodeontop Depth-FirstSearch(DFS)isaneasy-to-implementpath-findingalgo- ofthestackisk andthemaximumbranchingfactor(numberof rithmthatcanbeusedtosolvethePP over the Webproblem.On neighboursofanode)isb,thenthesizeofOpenisO(kb). inputastartingIRIuandanautomatonAe overI±,thealgorithm ToseehowDFSworkswhensolvingPP over the Web, let beginsasearchoverthegraphGWeb×Ae startingwiththenode us consider the first steps taken when processing the prop- init = (u,qe0),whereqe0 istheinitialnodeofAe. Thegoalofthe erty path (dc:creator− · dc:creator)∗, with the starting IRI algorithmistolookfornodesoftheform(v,qf),withvanIRIand M.Stonebraker,whichwaspresentedintheintroduction.First,let qofftahefinalaglosrtiatthemo.fAAte;etvheirsyismcoommemnotndluyriknngoewxnecausttiohne,gtohaelaclognodriitthiomn Ae bethefollowingNFA: dc:creator− maintainsasearchfrontier (orOpenlist)implementedasastack. start q0 q1 Atinitialisation,thefrontierissettoonlycontainthestartnode init.Inthemainloop,anodesisextractedfromthefrontierand dc:creator expandedbycomputingitsneighboursinGWeb×Ae,bymeansof As explained in Lemma 2.3, the starting node for our search thefunctionNeighbours.Allneighbouringgoalnodesarereturned, is (M.Stonebraker,q0). In the first iteration we extract this andthenallneighboursthathavenotbeenpreviouslyaddedtothe node from Open, dereference the IRI M.Stonebraker, obtain- frontierarenowinsertedatthetopofthefrontier.Thealgorithm ing, amongst others, the triple (inTods:StonebrakerWKH76, terminatesunsuccessfullyifthefrontierempties. dc:creator, M.Stonebraker). This will allow us to add to our ApseudocodeforDFSispresentedinAlgorithm1.Notethatwe frontier the node (inTods:StonebrakerWKH76,q1) and we pro- needOpentobeastackforDFS(Line7).Observeadditionallythat ceedsimilarlyforothertriplesinadoc(M.Stonebraker). Forthe thealgorithmdoesnotreturnapathbutratheranodefromwhich seconditeration,letusassume(inTods:StonebrakerWKH76,q1) apathcanbeobtainedbyfollowingtheso-calledparentpointers is at the top of the stack. When expanded, DFS will lookup (setinLine11).Finally,observethatinthecontextofnavigational adoc(inTods:StonebrakerWKH76) and retrieve, amongst other queryanswering,computingtheneighboursofanode(function things, all nodes connected to inTods:StonebrakerWKH76 by Neighbours)needsIRIdereferencing(setinLine16)whichinturn meansofalabeldc:creator. This,inparticular,yieldsall4au- requiresnetworkcommunication,anoperationthatmaytakesig- thorsofthispaper,but(M.Stonebraker,q0)isnotaddedtoOpen nificantlymoretimethanotherscarriedoutbythealgorithm,such becauseitwasalreadyinSeen.ForthenextiterationDFStakesone asdatamanagement. ofthesenodes,say(G.Held,q0),expandsthemagain,obtainingall TherearethreepropertiesofDFSthatareimportantforquery papersofG.Held;thenextiterationexpandsoneofthesepapers, answering. First,DFScanbeeasilymodifiedtoreturnpathsin- addsalltheauthorstothelistofanswers;andsoon. crementallyinsteadofonlyonepath.Indeed,insteadofreturning Algorithm1implementsaloopdetectionbypreventingthein- inLine12,thenodejustfoundtobeagoalnodecanbeaddedto sertionofapreviouslyseennodetoOpen. Thisisimportantto alistofsolutions. Inthesamespirit, onecaneasilyadaptDFS guaranteethatthealgorithmterminatesoverafinitegraphand toreturnthefirstk solutionsbyintroducingk asanadditional thattheanswersarecomplete. Inourcasethisimpliesthatwe parameter.Second,DFSiscompleteforfinitegraphs:ifagoalnode arelookingforsimplepaths,albeitnotintheRDFgraphbutin EvaluatingnavigationalRDFqueriesovertheWeb ,, the productgraphGWeb ×Ae. Inpractice thisimplies that our thefirstanswer. Indeed,startingfromStonebraker’sIRI,wejust algorithmlooksforpathswherethesameIRImayberepeatedat choosetheIRIforoneofStonebraker’spapers,weexpandsuchan mostanumberoftimesequivalenttothestatesoftheexpression IRI,fromwherewechoosetheIRIofoneofhisco-author’s.After automataAe. CompletenessofDFSinourcontextfollowsfrom dereferencingthelatterIRIwefindthefirstsolution. asimplepumpingargumentandthefactthatpropertypathsare DFShasdifferentyetimportantissuewiththisverysamequery. regularexpressionsoverI±. Tofindafirstsolution,DFSactuallydoestheminimumamount ThemostnotabledrawbackofDFSisthatthereisnoguaran- ofeffort,dereferencingtheminimumnumberofIRIs,asdescribed teeonsolutionquality,andsolutionswithmuchshortestpaths above.Theissueappearswhenlookingfortheanswersthatfollow maybemissed.Forinstance,inthequeryaboveitwillreturnthe thefirst. BecausethefocusofDFSisdepth,whenexecutedover co-authorsofG.Held, whichareatdistancetwoormorefrom DBLP,the5thanswerofourqueryhaslength6,thenext4answers M.Stonebraker,beforereturningtheotherauthorsofthepaper havelength12,andthefollowingones32andup.Thisimpliesthat inTods:StonebrakerWKH76.Inpracticethismeansthatwewould DFSwillincurinmorecomputationtimetoretrievetheseanswers, needmanymoreHTTPrequeststoretrievesubsequentsolutions, aswellasmorehttprequests.Moreover,returningtheselengthy whichinturnmeansmoretimetocomputeanswers. pathsfirstdoesnotseemintuitivelyright,aswenormallywantto displaysimpler,shorterpathsfirst.Indeed,itisnothardtocontrive 3.2 Breadth-FirstSearch examplesinwhichthelengthofsolutionsincreasesmuchfaster ToalleviatethedrawbacksofDFS,onecouldconsiderinsteadusing thaninourexamples,evenwhenmanyshortersolutionsexist. Breadth-FirstSearch(BFS),anothercompletesearchalgorithmthat Whatweneedisagoodbalancebetweenexecutiontimeand isguaranteedtofindshortestpaths.BFSissimilartoDFSinmost solutionquality.Inourexample,asensiblewaytoproceedwould aspects:itkeepsasearchfrontier(i.e.theOpenlist)duringexecu- betotaketheIRIforthefirstpaper,lookatitsauthors,listthem, tionandineachiterationitextractsanodefromthefrontierand andthenproceedlikewisewiththesecondpaper. Thisbalance thenexpandsit.ThemostimportantdifferenceisthatBFS,instead hasbeenstudiedintheareaofHeuristicSearch,formanyyears, ofalwaysexpandingthedeepestnodeinthefrontier, italways producingalgorithmsthatareguidedbyaheuristicfunctionh,that expandstheshallowestone.Atthealgorithmiclevel,BFScanbe issuchthath(s)estimatesthecostofapathfromstoagoalnode. obtainedfromDFSbysimplychangingunderlyingdatastructure Expansionsaresignificantly(usually,exponentially)reducedasone forOpentoaFIFOqueueinsteadofastack.Assuch,successorsof improvesthe“quality”ofh.Nextwediscussthechallengesofusing anodearealwaysaddedattheendofthequeue,andthereforea ofusingheuristicsearchoverLinkedData. shallownodeisalwaysselectedforexpansion. However,BFSalsosuffersfromanimportantdrawbackinour 4 AISEARCHTOTHERESCUE context:BFShasthepotentialofneedingmanyiterationstofind afirstsolutiontotheproblem.Indeed,assumeonceagainthata A*isoneofthemostsimpleandwell-studiedheuristicalgorithms nodehasatmostbneighbours,andimaginethattheshortestpath capableofsolvingpathsearchproblemsliketheonewedescribed inthesearchgraphhaskedges.Then,allnodesthatarereachable intheprevioussections.Inthissectionwestudyhowtoapplyitto inlessthank edgesareaddedtoOpenwhichmeansthatO(bk) theproblemofansweringpropertypathsovertheWeb. iterationsareneededbeforesuchapathisfound. ThemaindifferencebetweenA*andthealgorithmsdescribed earlier is that the search frontier is a priority queue where the 3.3 IssueswithBFSandDFS priorityisgivenby f(s), afunctionthatestimatesthecostofa BothBFSandDFShaveissueswithsomequeries. Considerfor solutionthatpassesthroughs[16].Ahigh-leveldescriptionofA* examplethefollowingquery,startingwithM.Stonebraker: isasfollows.Atinitialisation,theinitialnodeisaddedtotheOpen queue.A*nowrepeatsthefollowingloop:first,itextractsanode (dc:creator−·dc:creator)∗·dc:creator−·rdfs:label, withthehighestpriorityfromOpen.Itreturnssifitisagoalstate; that is, intuitively we want to retrieve the papers written by a otherwise,itexpandsstoobtainitsneighbours,addsthemtoOpen co-authorofM.Stonebraker,orbyaco-authorofsomeofhisco- andcontinuesexecution.NextwegiveaformaldescriptionofA*. authors, and so on. Furthermore, take the realistic assumption ThesearchgraphofA*isimplicitlydescribedby(1)astartnode thattherearehundredsofIRIsconnectedviadc:creator−withM. sstart;(2)asetofactionsAct;(3)apartialsuccessorfunctionSucc, Stonebraker(indeed,Stonebraker’sDBLPentry,asofthewriting suchthatSucc(a,s),ifdefined,returnsasetSofsuccessornodes; ofthispaper,contains298papers). (4)agoalcondition,whichisabooleanfunctionovernodes—goal LetusnowfocusonwhatBFSdoeswiththisquery.Itwillfirst nodesarethosenodesforwhichthisfunctionreturnstrue;(5)anon- dereferencetheIRIforStonebraker,addingtheIRIofeachofhis negativecostfunctioncbetweensuccessornodes.Theobjectiveof 298paperstoOpen. Then,itwilldereferenceeachoftheseIRIs, thealgorithmistofindapathfromsstart toagoalnode. whichrequires298requestsoverthenetwork.Wheneachofthese AnadditionalargumentrequiredbyA*isaheuristicfunction IRIsareexpanded,weaddtoOpentheco-authorsofStonebraker. h,whichisanon-negativefunctionovernodessuchthath(s)is OnlyafteralltheIRIsforStonebraker’spapersareexpanded,itwill anestimateofthecostofapaththatstartsinsandreachesagoal expandtheIRIofoneStonebraker’scoauthors,and,immediately node.TheheuristiciskeytotheperformanceofA*.Anempirically willfindasolutionpath. well-knownfactisthatashismoreaccurate,timesavingscanbe Waitingfor298HTTPrequestsbeforeobtainingthefirstanswer verybigbecauseexpansionsaresignificantlyreduced. Itcanbe isnotsensible:inthiscaseonlythreerequestsareneededtofind proventhatwhenhisadmissible,thatis,foreverysitholdsthat ,, JorgeBaier,DietrichDaroch,JuanL.Reutter,DomagojVrgocˇ Algorithm2:TheA*Algorithm q1 1 procedureA* 2 Closed←emptyset dc:creator− dc:creator 34 дO(psestnar←t)←em0ptypriorityqueueorderedbyf attribute start q0 dc:creator− q2 5 f(sstart)←h(sstart) 67 IwnhseirletsOstparetnin(cid:44)to∅Odpoen Figure1:Anautomatonfindingpapersoftheco-authorsof 8 ExtractsfromOpen M.Stonebraker. 9 ifsisagoalnodethen 10 returnsoraddstolistofsolutions 11 Expand(s) thesuccessorsof(u,q)mustbeobtainedbydereferencinganIRI 12 procedureExpand(s) (using, forexample, thefunctionNeighboursfromAlgorithm1). 13 InsertsintoClosed Thisagainmeansthatthemostcostlyoperationistheexpansionof 14 foreachainActsuchthatSucc(a,s)isdefineddo 15 foreachs(cid:48)inSucc(a,s)do newsuccessornodes,andassuchanyimplementationofA*must 16 t←s(cid:48) trytheirbesttofindawayofreducingthisbottleneck.Weexplain 17 iftisnotinSeenthen 18 AddttoSeen howtodothisinSection5.1.Butbefore,letusseehowtochoosea 19 д(t)←∞ goodheuristicfunctioninourscenario. 20 cost←д(s)+c(s,t) 2212 ifcoдs(tt)<←д(ct)osthten 4.1 AHeuristicforNavigationalQueries 23 f(t)←д(t)+h(t) HeuristicfunctionsareessentialfortheperformanceofA*.Wealso 24 parent(t)←(cid:104)s,a(cid:105) wantA*tobeoptimal,soourheuristicmustbeadmissible,thatis, 25 iftisagoalandf(t)≤f(top(Open))then 26 returntoraddttolistofsolutions itshouldnotoverestimatethecostofpathtoagoalnode. 27 ift(cid:60)Openthen InserttinOpen LetAbeanautomatonoverI±.Ourheuristicforthisproblem 28 else UpdatepriorityoftinOpen isdefinedasfollows:forallnodes(v,q)overI×Qe,whereQe are thestatesofAe,wedefineh((v,q))astheminimumdistancefrom qtoafinalstateofAe (andas∞ifnopathfromqtoafinalstate exists).ToillustrateourheuristicconsiderFigure1,corresponding h(s)doesnotoverestimatethecostofanypathfroms toagoal totheautomatonofthequeryforpapersofthecoauthorreachofM. node,thenA*findsaminimum-costpathfromsstart toagoalnode. StonebrakerintroducedinSection3.3.Thenwedefineh(u,q1)=2, Algorithm2showsapseudo-codeforA*.Thepriorityfunction h(u,q0) = 1,andh(u,q2) = 0,foreveryu ∈ I. Usuallyh(u,q)is isdefinedas f(s) =д(s)+h(s),wherehistheheuristicfunction implementedasasimplelookupinatable.GivenanautomatonA definedaboveandд(s)isthecostofthebestpathfoundsofar wedenotetheheuristicdefinedasdescribedabovebyhA. towardss.InanimplementationofA*,ahashtableisusedtostore OurheuristichAe isadmissibleforeachpropertypathe,aslong nodesthathavebeengeneratedinanexpansion(cf.Line18),and asAeistheminimumNFAfore.Toseethis,notethattheminimum parent-,д-,h-,andf-valuesarestoredaspropertiesofs. numberofactionsrequiredtoreachagoalnodefromnode(u,q) AfinalandimportantobservationisthatA*canbeeasilymod- cannotexceedthenumberofedgesofashortestpathbetweenthe ifiedtoreturnasequenceofanswers,insteadofasingleone. In automatonstateqandafinalstate.Thisisbecauseeachsuccessor thiscase, wesimplymodifythereturnstatementinLine10by (u(cid:48),q(cid:48))ofnode(u,q)mustbesuchthatthereisanedgebetweenq somethingthataddsstoalist. andq(cid:48)intheautomaton’sgraph. Using A* for computing property paths. Let us show how 4.2 TheoreticalGuarantees weuseA*tosolvePP over the Web. Thatis, givenasinputs apropertypatheandastartingIRIu,welookforallvsuchthat Awell-knownpropertyofA*isthatisfindscost-optimal(i.e.,short- (u,v)∈ e GWeb.LetAe =(Qe,I±,qe0,Fe,δe)betheautomataover est)paths.Hereweprovideanoptimalityresultofthesamesort. I±that(cid:74)is(cid:75)equivalenttoe.Recallthat(seeLemma2.3andSection Now,becausethefunctionadoc(u)isnotnecessarilyguaranteed 3)wecanreducethisproblemtosearchingforallnodes(v,q ),for toreturnalltriplescontainingu,wecannotshowoptimalityover f anIRIvandastateqf ∈Fe,overthegraphGP ⊂GWeb×Ae such theentireWeb,butratheronlyoverthegraphwehavealready thatthereisapathfrom(u,q0)to(v,q ). Inturn,thisproblem discovered,thatwedenotebyGA*. can be seen as an A* descripteion wheref (1) the start nodesstart Formally,givenanexecutionofA*,thelabelledgraphGA* ⊂ is(u,q0);(2)thesetAct ofactionscorrespondstoIRIsinI±;(3) GWeb ×Ae contains the edge (s,a,s(cid:48)) iff (1)s ∈ Closed, and(2) the paretial successor function Succ corresponds to the edges of s(cid:48)∈Succ(a,s).NoticethatthiscorrespondtotheproductofAewith GWeb×Ae,thatis,ifs =(u,q),wesay(u(cid:48),q(cid:48))∈Succ(a,s)ifboth thegraphthatcontainsalltriplespresentinanyofthedocuments (u,a,u(cid:48)) ∈ adoc(u),and(q,a,q(cid:48))isatransitioninAe,orifboth thathavebeenretrievedsofarinourcomputation.Thefollowingis (u(cid:48),a,u)∈adoc(u),and(q,a−,q(cid:48))isatransitioninAe;(4)anode anoptimalityresultbothforBFSandA*runwithourproperty-path s =(v,q)isagoalifq ∈Fe;and(5)thecostfunctionis1foreach heuristic. pairofnodesconnectedbySucc. Theorem4.1. LetG bedefinedasabovefromarunofA*that A* Thereisanimportantsubtletythatdistinguishesouralgorithm hasreturnedN answerswitheitherh = hAe orh = 0. Letπk be fromclassicalA*applications.JustasinthecaseofBFSandDFS, pathfoundtothek-thsolutionfoundbyA*,foranyk ∈{1,...,N}. EvaluatingnavigationalRDFqueriesovertheWeb ,, Furthermore,letck bethelengthofthek-thshortestpathfromsstart weareusing.Unfortunately,itwasshownrepeatedlythatthisis toanygoalstateoverG .Thenthecostofπ isc . generallynotthecasewhenworkingwithLinkedData[22,23], A* k k whichcanleadtoincompleteanswerssincemanytriplescontaining sketch. LetGAi* denoteGA* right after thei-th solution has thedereferencedIRImightnotbereturned. Thisisparticularly beenreturned.Weprovebyinductionthatthei-thsolutionfound problematicwhenworkingwithinverselinks,asitisestimated byA*wouldbethefirstsolutionfoundbyA*ifweweretomark thatpublishersincludeonlyaboutahalfofthetripleswherethe asnon-goalsallsolutionsfoundpriortothei-thsolution.Nowwe requestedIRIappearsastheobject[23]. usethefactthattheheuristicisadmissibleandthusthesolution ManyLinkedDataprovidersalsosetuppublicSPARQLend- foundisthei-thoptimaloverGAi*. (cid:3) pointswhereuserscanquerythedataset,sowecanpartiallyalle- viatethelackofLinkedDatainfrastructurebyrelyingonpublic Inpractice,asweseelateron,BFSrunsslowerthanA*withour SPARQLendpointstogetherwithLinkedData.Whenevaluating heuristic.Interestingly,wecanprovethatA*isbetterinthesense propertypathsoverLinkedData,wecombinethetwoapproaches thatBFShastoexpandatleastasmanynodesasA*. and,eachtimewedereferenceanIRI,wealsoquerytheappropri- Theorem4.2. Let(u,e)beanIRIandapropertypath.Thenevery ateendpointinordertoobtainthetriplesmentioningthesaidIRI. nodeexpandedbyA*,usedwithhAe,isalsoexpandedbyBFS. Furthermore,wequerytheendpointonlyaskingforlinksinthe appropriatedirection. Forinstance,ifourpropertypathsneeds Proof. WeobservethathAe(s)>0foreverynon-goalstates. totraversetheauthoredgeforwardsstartingfromanIRIstart, TheresultnowfollowsfromTheorem7in[31]. (cid:3) weaskthequerySELECT ?x WHERE {start author ?x}tothe appropriateendpointandsimilarlyforthebackwardsedges. 5 OPTIMISINGQUERYEXECUTION Inthissectionweprovideseveraloptimisationstothebasealgo- 5.3 MinimisingNetworkRequests rithmspresentedinSection3andSection4.Westartbydescribing Wehavearguedabovethatdereferencingisanexpensiveoperation. howparallelexpansionscanbeusedinordertoreducetheexecu- WhenA*ismodifiedtofindmultipleanswers,asweproposedabove tiontimesofoursearchalgorithms.Wethenexplainwhatarethe (i.e.,bysimplyaddingsolutionstoalist),someexpansionsmaybe currentshortcomingsoftheLinkedDatainfrastructureandpro- carriedoutsoonerthanwewouldwant,leadingtounnecessary poseawaytoovercomethemusingendpoints.Finally,wediscuss dereferencing.Indeed,ourheuristicassignsthevalue0toanynode awayoftweakingtheheuristicusedinA*inordertobothavoid unnecessarynetworkrequestsandfindanswerssooner. oftheform(u,qf)withqf afinalstate(becausethedistancetoa finalstateis0).Assumingq hasoutgoingtransitions,thestandard f A*algorithmwouldprioritisetheexpansionsofthosenodesover 5.1 ParallelExpansions anyothernodewiththesamef value,anoperationthatintuitively Allofoursearchalgorithmsfunctioninsuchawaythattheyselect wouldtakeusfartherfromthegoal. asetofnodeswhichwillserveasthestartingpointinthenext Wecanpostponetheseexpansionsbyusingaslightlydifferent iteration,andthenstartthesearchfromthesenodesonebyone. Anissuewiththisisthatarequestoverthenetwork—whichon hTheuerpisattihc,mdaexfindeisdtaanscfeoldlˆo(qw,qs.(cid:48))LbeettAwe=en(Qtw,Σo,qstoa,tFe,sδi)sbdeefianneNdFaAs. averagetakesmorethanasecond—isneededpereachdereference. Insteadofexpandingonenodeatatime, ouralgorithmscan dˆ(q,q(cid:48)) = 1+minq|δ(q,a)isdefinedd(q,q(cid:48)),ifδ(q,a)isdefinedfor benefit greatly from expanding multiple onesin parallel. More atleastsomea∈Σ,ordˆ(q,q(cid:48))=∞otherwise;andwheredisthe specifically,wemodifythealgorithmtoextractuptokofnodesthat usualgraphdistancebetweenqandq(cid:48).Thenthepathmaxheuristic couldbeatthetopoftheOpen,andexpandtheminparallel.kisnow hˆA withrespecttoAisdefinedashˆA((u,q)) = minqf∈Fdˆ(q,qf), aparamenterofthealgorithmswhichcanbeunderstoodasadegree thatis,theminimumpathmaxdistancefromqtoanyfinalstateof ofparallelism.Toobtaink-BFSandk-DFS,wemodifyAlgorithm1 A.Thisisastandardtechniqueusedbysearchalgorithmsinwhich suchthatLine16dealswithuptoktop-valuednodesfromOpen, anodemayhavetobere-expanded[25,33].Interestingly,onecan andneighboursarecomputedfortheminparalell.Similarly,k-A*is seethatthepathmaxdistancedˆcoincideswiththeusualdistance obtainedbyextractinguptoknodeswiththehighestf-valuesfrom inallstatesoftheautomataexceptforthefinalstates.Weavoid OpeninLine8ofAlgorithm2,andexpandingthemallinparallel earlyre-expansionofnodeswithfinalstatesbecausetheheuristic inline11.Inall3algorithms,afterallsuccessorsarecomputed,we forthemnowcorrespondsto1plustheminimumoftheheuristic addthemalltogethertoOpen,inthesameorderthatwewould valueoftheneighboursofthisstate. have,hadthenodesbeenexpandedsequentially.Itisthennothard toseethatoptimality(Theorem4.1)stillholdsfork-A*(andk-BFS). 6 EXPERIMENTALEVALUATION InSection6weshowthat,dependingonthedegreeofparallelism, InthissectionweevaluatehowanimplementationofA*algorithm thecomputationisspeduptenfoldinsomeinstances. performswhenexecutingpropertypathqueriesoverarealWeb environment.Toestablishabaseline,wealsocomparetheperfor- 5.2 UsingTheEndpointInfrastructure manceofA*withBFSandDFS,theonlyotheralgorithmsproposed Theevaluationalgorithmspresentedinprevioussectionsrelyon sofarintheliterature.Webeginbypresentingourexperimental thedereferencingmechanismofLinkedDataandworkunderthe setup,thequeries,andthencomparehowA*faresagainstBFSand assumption that when a specific IRI is dereferenced, we obtain DFS,showingthatA*outperformstheothertwoalgorithmsona allthetriplesmentioningsuchanIRIwhichresideontheserver regularbasis.Nextweinvestigatetheimpactofparallelrequestson ,, JorgeBaier,DietrichDaroch,JuanL.Reutter,DomagojVrgocˇ q_Coauthor q_NATO q_Bacon 250 200 1000 A* A* DFS DFS 200 BFS 800 BFS 150 Answers110500 Answers100 Answers 460000 50 A* 50 200 DFS BFS 00 100 200 300 400 500 00 50 100 150 200 250 00 100 200 300 400 500 600 700 800 Requests Requests Requests Figure2:A*minimisestherequestsneededtoobtainanswersofq Coauthor,q NATOandq Bacon ouralgorithms.Aswesee,addingparallelismreducestotalruntime timeofcomputingourqueriesisessentiallygivenbythenumber forallthreealgorithms,withA*remainingthemostconsistent.In- ofrequestsperformedbythealgorithm. Thisalsorulesoutthe terestingly,allalgorithmstendtolookmorealikeasmoreandmore dependenceonparameterswhichwehavenocontrolover,suchas parallelrequestsareallowed.Finally,wediscusssomereal-world theInternettraffic,ortheavailabilityofserversprovidinguswith examplesofthepathsretrievedbyouralgorithms. data.Thus,byfocusingonrequestsweignorelatencydifferences thatmaypersistevenaftertakingseveralrunsofthesamequery. 6.1 ExperimentalSetup Allexperimentswererunwithoutanaccesstothedatalocally, relyingsolelyontheWebinfrastructuretoretrievethedataneeded Weselected11navigationalqueriesthatareinspiredbyprevious ateachstepofthecomputation. Theexperimentswererunona benchmarks(seee.g. [14,32]). Thesequeriesarerepresentative ManjaroLinuxmachinewithai5-4670quad-coreprocessorand ofseveraldifferentfeaturesofpropertypaths,rangingfromeasy, 4GBofRAM.Toavoidfloodingserverswithrequestsweonlyran fixed-depthonestoqueriesusingmultiplestaroperationswhich our search until we either found 1000 answers, retrieved more aremuchhardertoevaluate. Ourqueriestargetoneormoreof than100000triplesfromtheserver,orreacheda10-minutetime the following Linked Data domains: YAGO, a huge knowledge limit.Eachexperimentwasran10times,andsincetheresultswere baseextractingdatafromWikipediaandvariousothersources[29]; largelyequivalent,wereportthenumbersofthelatestexecution. DBPedia[6],oneofthecentraldatasetsoftheLinkedDatainitiative Thesourcecodeforrunningtheexperimentsisavailableat[1]. thatalsooriginatesfromWikipedia;LinkedMovieDatabase,the bestknownsemanticdatabaseformovieinformation[27];andthe LinkedDatadomainofDBLP.Ourimplementationswillalwaysuse 6.2 HeuristicSearchAgainstBFSandDFS theoptimisationtechniquespresentedinSection5.2andSection5.3, ThegeneralconclusionofourexperimentsisthatA*bothrequires whileweassessthebenefitsofparallelism(Section5.1)separately. fewerrequestsandisfasterthanBFSandDFS.Beforereporting Asanexample,beloware3ofthe11queriesweuse.Forcomplete ourresultsinfull,letusexaminetherunsofqueriesq Coauthor, detailsofthequeries,resultsofallourruns,andimplementation q Baconandq NATOpresentedabove.Figure2showsthenum- ofouralgorithms,pleaserefertoouronlineappendix[1]. berofrequestsneededtocomputeafractionofthetotalanswers q Coauthor:Thepropertypath(coauthor−·coauthor)∗inthe availableforthesequeries.Inparticular,weseethatbothA*and DBLPdataset,startingfromtheIRIofM.Stonebraker.Thisproperty DFSarethebestchoiceforthequeryq Coauthor,becausethey pathlooksfortheIRIofallauthorsthatarerelatedtoM.Stonebraker producemoreanswersusingfewerHTTPrequests(eventhough onDBLP,byaco-authorshippathofarbitrarylength. thequalityoftheanswersproducedbyA*isarguablybetter–see q NATO:Apropertypaththatselectsallplacesthathostanentity below).Ontheotherhand,BFSrequiresaround300expansionsto dealingwithaNATOmemberstate,accordingtoYAGO. startproducinganswers,whichresultsinamuchslowerthrough- q Bacon:ApropertypaththatlooksfortheIRIsofactorshaving putaltogether.Next,bothA*andBFSarethebestchoiceforthe afiniteBacon-number1,andthatnavigatesusinglinksand/orIRIs queryq NATO.Thisisagainexpected,becausethisqueryrequires presentinanyofYAGO,DBPediaorLinkedMovieDatabase.This lessnavigationandmoreshallowexploration.Itisinterestingto isaninterestingquery,ascurrentlytheonlywaytoevaluateitis seethatinthiscaseA*reallysimulatestheoptimalBFSsearch.On bymeansofourLinkedDataapproach(see[8]foradiscussion). theotherhand,DFSwastesalotoftimeexploringlongpathsand Toassessouralgorithmweusetwoindicators:thenumberof obtaining“deep”answers. Finally,inthecaseof q Bacon,A*is HTTPrequestsmadetocomputeafractionoftheanswers,andthe showntostrictlybeatbothBFSandDFS.InthecaseofBFS,thisis timeneededtocomputethem.Inbothcaseswewanttominimise mostlybecauseA*’sheuristicallowsafinercontrolonwhichlinks thenumberofrequests,ortheamountoftimeneededtoproduce toexplore,andthemaindetractorforDFSisthatitstartsexploring theanswers.Wenotethatthenumberofrequestsisamuchbetter initiallinkswhichoftenrequiremanyrequestsbeforeencountering indicatoronhowthealgorithmworks: becauseHTTPrequests asolution. takeconsiderablymoretimethanalltheotheroperations,thetotal Fullresults.Forreasonsofspace,wecannotreporttheremaining 8experimentswiththesamedetail,soinsteadwedothefollowing. 1AnactresshasBaconnumber1ifsheactedinthesamemovieasKevinBacon,and BaconnumbernifsheactedwithsomeonewithBaconnumbern−1. Foreachquery,weanalysethecompletebehaviouroftheanswers EvaluatingnavigationalRDFqueriesovertheWeb ,, vs.requestandanswersvs.timecurves.Wesaythatanalgorithm q_Coauthor dominatestheothersifitissuchthatitreturnsatleastasmany 600 A* answersasanyotherfor80%oftherangeofrequests(ortime) 500 A*-20p forwhichweevaluatethem.Forexample,inFigure2weseethat BFS A∗ dominatestheotheralgorithmsforqueriesq Coauthorand 400 BFS-20p DFS q Bacon,whileinthecaseofq NATObothA∗andBFSdominate. ers DFS-20p Thetotalnumberoftimeseachalgorithmdominates(outof11 nsw300 queries)isshownbelow,forboththeanswersvs.requestandan- A 200 swersvs.timecurves(fulldetailsarefoundinouronlineappendix [1]).Onceagain,A∗remainsthemostconsistentoption. 100 Measure A* BFS DFS 0 0 100 200 300 400 500 600 700 Requestsv/sAnswers 11 3 4 Time (s) Timev/sAnswers 11 3 4 Figure 3: Answers over time on q Coauthor. The parallel versions(inred)aremuchfasterthanthenon-parallelones. 6.3 TheEffectofParallelRequests Next,wetesttheeffectofallowingparallelrequestsinouralgo- dblpAuthor:Michael_Stonebraker ˆdc:creator dblpPub:conf/acm/MuthuswamyKZSPJ85 rithms,aspresentedinSection5.1.Thisoptimisationgoesalong dc:creator dblpAuthor:Matthias_Jarke way into tackling the slow latency of Web requests, one of the rdfs:label "Matthias Jarke" mainproblemsofqueryingovertheHTTPprotocol.Indeed,HTTP dblpAuthor:Michael_Stonebraker ˆdc:creator dblpPub:conf/dbvis/AikenCLSSW95 requestsaresuchanimportantbottleneckinouralgorithmthat dc:creator dblpAuthor:Mybrid_Spalding allowingparallelrequestsessentiallymeansparallelisingtheentire rdfs:label "Mybrid Spalding" algorithm. Issuingparallelrequestsalsosoftenuphighlatency dblpAuthor:Michael_Stonebraker ˆdc:creator dblpPub:conf/vldb/StonebrakerABCCFLLMOORTZ05 pocketsortemporarynetworkproblems. Moreover,wecanalso dc:creator dblpAuthors:Adam_Batkin expectthealgorithmstobeacceleratedevenfurtherwhenthenum- rdfs:label "Adam Batkin" berofallowedrequestsisincreased,simplybecausemorerequests Figure4:Pathsforthe10th,50th,and200thanswersfound essentiallymeansmoreparallelinstancesofouralgorithmandeven byA*onq Coauthor. moresofteningpower.Theotherinterestingobservationisthat,as weallowmoreparallelrequestsinouralgorithms,allofA*,BFS Maxparallelcalls A* BFS DFS andDFSstarttolookalike,andinfactitiseasytoseethatallthree 1 11 3 4 algorithmsareessentiallyequivalentinthelimitwhereweissue 10 7 3 3 aninfinitenumberofrequestsatthesametime. 20 7 3 3 Toempiricallytesttheseobservations,weissuednewliveruns 40 6 4 5 ofthe11queriesdescribedintheprevioussections,butthistime usingparallelversionsofA*,BFSandDFS.Toseetheimpactonthe 6.4 Returningpaths numberofparallelthreadsallowed,wereportexperimentswith Sofarwehaveonlytalkedaboutfindingpairsofnodesthatform amaximumof10, 20, and40parallelthreads. Beforereporting theanswerofapropertypathquery,asdictatedbytheSPARQL thefullresults,letusstartwithcomparingtheresultsofthealgo- standard. However, our search algorithms can also be used to rithmwithnoparallelismagainsttheonewith20parallelthreads. computetheentirepathbetweentwonodes,andinfactwecanget Figure3showsthetimeneededtocomputetheanswersofquery themataverymarginalcost:wealreadyneedtokeeptrackofall q Coauthor,forallthreealgorithmsontheirnon-parallelversion theexpansions,sowecanproducepathssimplybyreturningthe andontheirparallelversionwithamaximumof20threads. As IRIscorrespondingtoeachoftherequestsmadebyouralgorithm. wesee,thetimeneededtocomputethesameamountofanswers Pathscanbeusedasajustificationfortheanswers,ortocon- decreasesbyalmosttenfoldinallthreecases.Moreover,theparallel tinueextractingmoreinformationafterwards.Inthecaseofqueries versionofBFSnowbehavesalmostasA*andDFSwhencomputing overLinkedData,wecanevenusethepathsofqueriestogather thefirst300answers(itthenreachesastalematebecauseallshallow informationaboutthestructureoftheWebitself. Fortheserea- answershavealreadybeendiscovered). sonsreturningpathsisaverysought-afterfunctionalityofgraph Full results. As expected, the time taken to compute answers querylanguages,andispresentforexampleinthepopularNeo4j decreasesdrastically(thebehaviouristhesameasforq Coauthor). engine[34]. Unfortunately,alanguagecapableofreturning(all) Perhapsmoreinterestingly,wefocusonhowalgorithmschange paths,oreven(all)simplepaths,isboundtobeverycomplicatedto whenmoreparallelthreadsareallowed. Inordertodothis,we evaluate[5,28],andthisisthereasonwhytheSPARQLstandard repeatthesamereportsmadeintheprevioussubsection,butthis doesnotincludesuchafunctionality.Inourcasewehaveanatural timefordifferentlevelsofparallelism.Moreprecisely,foreachof workaroundforthisissue,asoursearchfocusesonshortestpaths, our11queriesand4differentthreadcountswereportwhichofA*, whichareknowntobeeasiertoevaluatethansimplepaths. BFSorDFSdominatesinthetimeneededtocomputetheanswers. Asanexampleoftheusefulnessofpaths,itwasbyanalysing Asweseeasmoreparallelismisallowedintothealgorithms,both pathsthatweinferredthatA*normallyproducesbetteranswers BFSandDFSstartbecomingmorecompetitivecomparedtoA*. thanDFS(becausethepathsareshorter).Asanillustration,Figure4 ,, JorgeBaier,DietrichDaroch,JuanL.Reutter,DomagojVrgocˇ presentspathswitnessingtheanswers10,50,and200ofarunofthe REFERENCES queryq CoauthorwithA*.Fromthequeryitselfallthatwecan [1] 2016.http://dvrgoc.ing.puc.cl/navigation/.(2016). sayisthatthesethreeresearchersareconnectedtoM.Stonebraker [2] S.Abiteboul,P.Buneman,andD.Suciu.1999.DataontheWeb:FromRelationsto byacoauthorshippathofarbitrarylength.However,bylooking SemistructuredDataandXML.MorganKauffman. [3] SergeAbiteboulandVictorVianu.1997.QueriesandComputationontheWeb. atthepathswenowknowthattheyaredirectcoauthors.Onthe InICDT’97.262–275. otherhand,thelengthofpathsretrievedbyDFSaregoingtobe [4] RenzoAnglesandClaudioGutie´rrez.2008.Surveyofgraphdatabasemodels. ACMComput.Surv.40,1(2008). muchhigher.Foronerunofq CoauthorwithDFSthelengthsof [5] MarceloArenas,Sebastia´nConca,andJorgePe´rez.2012. Countingbeyond theanswers10,50and200wererespectively14,74,312. aYottabyte,orhowSPARQL1.1propertypathswillpreventadoptionofthe standard.InWWW. [6] So¨renAuer,ChristianBizer,GeorgiKobilarov,JensLehmann,RichardCyganiak, 7 CONCLUSIONSANDFUTUREWORK andZacharyIves.2007.Dbpedia:Anucleusforawebofopendata.Springer. [7] PabloBarcelo´Baeza.2013.Queryinggraphdatabases.InPODS2013.175–188. Thispaperpresentsthefirstfundamentalstudyoftheproblemof [8] JorgeBaier,DietrichDaroch,JuanL.Reutter,andDomagojVrgocˇ.2016.Property computingpropertypathsovertheWeb.Weshowedhowtocast PathsoverLinkedData:CanItBeDoneandHowToStart?.InCOLD@ISWC 2016. queryansweringasanAIsearchproblem,andprovidedanoptimal [9] TimBerners-Lee,ChristianBizer,andTomHeath.2009.Linkeddata-thestory algorithmbasedontheclassicalA*algorithm.Weprovidestrong sofar.Int.JournalonSemanticWebandInformationSystems5,3(2009),1–22. theoreticalandpracticalevidencethatA*isabetteralternative [10] Blazegraph2016.Blazegraph.https://www.blazegraph.com/.(2016). [11] D.Calvanese,G.DeGiacomo,M.Lenzerini,andM.Y.Vardi.2000.Containment thanbothBFSandDFSinthecontextofLinkedData,andthiscan ofconjunctiveregularpathquerieswithinverse.InKR2000.176–185. bespedupevenfurtherbyallowingparallelexecutionthreads. [12] I.Cruz,A.O.Mendelzon,andP.Wood.1987.Agraphicalquerylanguagesup- Intermsoffuturework,weidentifythreemainchallengeswe portingrecursion.InSIGMOD’87.323–330. [13] ValeriaFionda,GiuseppePirro`,andClaudioGutierrez.2015.NautiLOD:AFormal planontackling. LanguagefortheWebofDataGraph.TWEB9,1(2015),5:1–5:43. [14] AndreyGubichev,SrikantaJ.Bedathur,andStephanSeufert.2013. Sparqling Using triple pattern fragments. As noted in Section 5, there kleene:fastpropertypathsinRDF-3X.InGRADES2013.14. aresomeissueswiththeLinkedDatainfrastructure;mostnotably, [15] SteveHarrisandAndySeaborne.2013.SPARQL1.1querylanguage.W3C(2013). [16] PeterE.Hart,NilsNilsson,andB.Raphael.1968.Aformalbasisfortheheuristic itdoesnotprovidealltheinformationonewouldexpectwhen determinationofminimalcostpaths.IEEETransactionsonSystemsScienceand dereferencingIRIs. Whileitispossibletoalleviatethisissueby Cybernetics4,2(1968). [17] OlafHartig.2012.SPARQLforaWebofLinkedData:Semanticsandcomputabil- usingendpoints,sincetheiruptimecanbeerratic,itwasrecently ity.InTheSemanticWeb:ResearchandApplications.Springer,8–23. suggestedthatamorelightweightinfrastructureoftriplepattern [18] OlafHartig,ChristianBizer,andJohann-ChristophFreytag.2009. Executing fragments [36] would be more appropriate for the task. In the SPARQLqueriesovertheweboflinkeddata.Springer. [19] OlafHartigandM.TamerO¨zsu.2016.WalkingWithoutaMap:Ranking-Based futureweplantotesthowusingtriplepatternfragmentsaffects TraversalforQueryingLinkedData.InISWC2016.305–324. theperformanceandaccuracyofouralgorithmswhencompared [20] OlafHartigandJorgePe´rez.2015. LDQL:AQueryLanguagefortheWebof tothestandardendpointinfrastructure. LinkedData.InISWC2015.Springer,73–91. [21] OlafHartigandGiuseppePirro`.2015.AContext-BasedSemanticsforSPARQL AnsweringNautiLODandLDQLquerieswithA*. NautiLOD PropertyPathsOvertheWeb.InESWC2015.71–87. [22] AidanHoganandClaudioGutierrez.2014.PathstowardstheSustainableCon- [13]isatraversal-basedlanguageproposedasanoptiontoSPARLQ sumptionofSemanticDataontheWeb.InAMW2014. whenqueryingLinkedData,inwhichonehasmorefinercontrol [23] AidanHogan,Ju¨rgenUmbrich,AndreasHarth,RichardCyganiak,AxelPolleres, onhowistheWebgoingtobetraversed.Inthesamespirit,LDQL andStefanDecker.2012.AnempiricalsurveyofLinkedDataconformance.J. WebSem.14(2012),14–44. [20]isanotherlanguageaimedatcontrollinghowdataistobe [24] Jena2015.ApacheJenaManual.http://jena.apache.org.(2015). retrieved,albeitmuchlesspowerfulthanNautiLOD.Theinteresting [25] RichardE.Korf.1993.Linear-SpaceBest-FirstSearch.ArtificialIntelligence62,1 observationisthatwecanalsocastthequeryevaluationproblem (1993),41–78.DOI:http://dx.doi.org/10.1016/0004-3702(93)90045-D [26] EgorVKostylev,JuanLReutter,MiguelRomero,andDomagojVrgocˇ.2015. fortheselanguagesasasearchproblem,andthusA*shouldalso SPARQLwithPropertyPaths.InISWC2015.Springer,3–18. provideoptimalqueryansweringalgorithms.Infact,thealgorithm [27] LMDB.Linkedmoviedatabase.http://linkedmdb.org/.(????). [28] KatjaLosemannandWimMartens.2012. Thecomplexityofevaluatingpath proposedin[13]isessentiallywhatwedefinehereask-DFS,soone expressionsinSPARQL.InPODS2012.101–112. cannaturallysuspectthatA*shouldprovideabetterbehaviour. [29] FarzanehMahdisoltani,JoannaBiega,andFabianSuchanek.2014. Yago3:A knowledgebasefrommultilingualwikipedias.InCIDR. A*inlocalcomputations.Althoughwebasedourinvestigation [30] FrankManola,EricMiller,andBrianMcBride.2014. RDF1.1primer. https: inthecontextofLinkedData,thereissomeevidencethatourap- //www.w3.org/TR/rdf11-primer/,W3C(2014). [31] JudeaPearl.1984.Heuristics:IntelligentSearchStrategiesforComputerProblem proachmighthavepotentialintheclassicalsettingwheredatais Solving.Addison-WesleyLongmanPublishingCo.,Inc.,Boston,MA,USA. availablelocally. Themainreasonisthefactthatthecurrently [32] JuanL.Reutter,Adria´nSoto,andDomagojVrgocˇ.2015.RecursioninSPARQL. availablepropertypathevaluationalgorithmsdemandalotofre- InISWC2015. [33] StuartJ.Russell.1992.EfficientMemory-BoundedSearchMethods.InECAI’92. sources,especiallywhendealingwithpropertypathsthatusethe 1–5. Kleenestaroperator,andcurrentsystemscannoteasilycopewith [34] TheNeo4jTeam.2016.TheNeo4jManualv3.0.http://neo4j.com.(2016). [35] Ju¨rgenUmbrich,AidanHogan,AxelPolleres,andStefanDecker.2015. Link theserequirements[8].Ontheotherhand,wehaveseenthatthe traversalqueryingforadiverseWebofData.SemanticWeb6,6(2015),585–624. memoryusageofanA*-basedalgorithmisdirectlydependanton [36] RubenVerborgh,OlafHartig,BenDeMeester,GeraldHaesendonck,LaurensDe Vocht,MielVanderSande,RichardCyganiak,PieterColpaert,ErikMannens,and theamountofanswersthatneedtobecomputed,andeachanswer RikVandeWalle.2014.QueryingDatasetsontheWebwithHighAvailability. requiresanalmostnegligibleamountofadditionalmemory.This InISWC2014.180–196. suggeststhat,inthosecaseswhenwedonotneedalltheanswers, [37] Virtuoso2015.OpenLinkVirtuoso.http://virtuoso.openlinksw.com/.(2015). [38] NikolayYakovets,ParkeGodfrey,andJarekGryz.2016. QueryPlanningfor anapproachbasedonA*mightbeabetteroption. EvaluatingSPARQLPropertyPaths.InSIGMOD2016.1875–1889.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.