ebook img

SAOR: Template Rule Optimisations for Distributed Reasoning over 1 PDF

16 Pages·2010·0.15 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview SAOR: Template Rule Optimisations for Distributed Reasoning over 1

SAOR: Template Rule Optimisations for Distributed ? Reasoning over 1 Billion Linked Data Triples AidanHogan1,JeffZ.Pan2,AxelPolleres1,andStefanDecker1 1 DigitalEnterpriseResearchInstitute,NationalUniversityofIreland,Galway [email protected] 2 Dpt.ofComputingScience,UniversityofAberdeen [email protected] Abstract. Inthispaper,wediscussoptimisationsofrule-basedmaterialisation approachesforreasoningoverlargestaticRDFdatasets.Wegeneraliseandre- formalise what we call the “partial-indexing” approach to scalable rule-based materialisation: the approach is based on a separation of terminological data, which has been shown in previous and related works to enable highly scalable and distributable reasoning for specific rulesets; in so doing, we provide some completenesspropositionswithrespecttosemi-na¨ıveevaluation.Wethenshow how related work on template rules – T-Box-specific dynamic rulesets created bybindingtheterminologicalpatternsinthestaticruleset–canbeincorporated andoptimisedforthepartial-indexingapproach.Weevaluateourmethodsusing LUBM(10)forRDFS,pD*(OWLHorst)andOWL2RL,andthereafterdemon- stratepragmaticdistributedreasoningover1.12billionLinkedDatastatements forasubsetofOWL2RL/RDFruleswearguetobesuitableforWebreasoning. 1 Introduction MoreandmorestructureddataisbeingpublishedontheWebinconformancewiththe ResourceDescriptionFramework(RDF)fordisseminatingmachine-readableinforma- tion, forming what is often referred to as the “Web of Data”. This data is no longer purelyacademic:inparticular,theLinkedDatacommunity–bypromotingpragmatic best-practicesandapplications–hasoverseenRDFexportsfrom,forexample,corpo- rate bodies (e.g., BBC, New York Times, Freebase), community driven efforts (e.g., Wikipedia,GeoNames),thebiomedicaldomain(e.g.,DrugBank,LinkedClinicalTri- als)andgovernmentalbodies(e.g.,data.gov,data.gov.uk).Ataconservativeestimate, therenowexiststensofbillionsofRDFstatementsontheWeb. SittingatopRDFaretheRDFSchema(RDFS)andWebOntologyLanguage(OWL) standards.Primarily,RDFSandOWLallowfordefiningtherelationshipsbetweenthe classes and properties used to organise and describe entities, providing a declarative and extensible domain of discourse through use of rich formal semantics. One could thereafter view the Web of Data as a massive, heterogeneous, collaboratively edited knowledge-baseamenableforreasoning:however,theprospectofapplyingreasoning over (even subsets of) the Web of Data raises unique challenges, the most obvious of whicharetheneedforscale,andtolerancetonoisy,conflictingandimpudentdata[6]. ?TheworkpresentedinthispaperhasbeenfundedinpartbyScienceFoundationIrelandunder GrantNo.SFI/08/CE/I1380(Lion-2),bytheEUMOSTproject,theEPSRCLITROproject, andbyanIRCSETScholarship. InspiredbyrequirementsfortheSemanticWebSearchEngine(SWSE)project[9] –whichaimstooffersearchandbrowsingoverLinkedData–inpreviousworkwein- vestigatedpragmaticandscalablereasoningforWebdatathroughworkontheScalable AuthoritativeOWLReasoner(SAOR)[7,8];wediscussedtheformulationandsuitabil- ity of a set of rules inspired by pD* [16] for materialisation over Web data. We gave particularfocustoscalabilityandWebtoleranceshowingthatbyabandoningcomplete- ness,materialisationoveradiverseWebdataset–intheorderofabillionstatements– isentirelyfeasiblewrt.asignificantfragmentofOWLsemantics.Fromthescalability perspective,weintroducedapartial-indexingapproachbasedonaseparationoftermi- nologicaldatafromassertionaldatainourruleexecutionmodel:terminologicaldata– themostfrequentlyaccessedsegmentoftheknowledgebaseforreasoningwhichinour scenariorepresentsonlyasmallfractionoftheoveralldata[8]–isstoredandindexed in-memory for fast access, whereas the bulk of (assertional) data is processed by file- scans.Relatedapproacheshavesinceappearedintheliteraturewhichuseaseparation ofterminologicaldataforapplyingdistributedRDFSandpD*reasoningoverdatasets containinghundredsofmillions,billionsandhundredsofbillionsofstatements[19,18, 17]. However, each of these approaches has discussed completeness and implementa- tion/optimisationbasedonthespecificrulesetathand. In this paper, we reformulate the partial-indexing approach – generalising to arbi- traryrulesets–anddiscusswhenitis(i)completewithrespecttostandardruleclosure; and (ii) appropriate and scalable. We then introduce generic optimisations based on “template rules” – where terminological data is bound by the rules prior to accessing theA-Box–andprovidesomeinitialevaluationoverasmallLUBMdatasetforRDFS, pD*,andOWL2RL/RDF.Thereafter,welooktoapplyouroptimisationsforscalable anddistributedLinkedDatareasoning,initiallyreintroducingourauthoritativereason- ing algorithm which incorporates provenance, detailing distribution of our approach, andthenprovidingevaluationforreasoningover1.12bWebtriples. 2 Preliminaries Beforewecontinue,webrieflyintroducesomeconceptsprevalentthroughoutthepaper. Weusenotationandnomenclatureasispopularintheliterature(cf.[4,8]).Herein,we denoteinfinitesetsbySandcorrespondingfinitesubsetsbyS. 2.1 RDFandRules RDF Constant Given the set of URI references U, the set of blank nodes B, and the setofliteralsL,thesetofRDFconstantsisdenotedbyC:=U[B[L. RDFTriple Atriplet:=(s;p;o)2(U[B)(cid:2)U(cid:2)CiscalledanRDFtriple,wheresis calledsubject,ppredicate,andoobject.Atriplet:=(s;p;o)2G;G:=C(cid:2)C(cid:2)Cis calledageneralisedtriple,whichallowsanyRDFconstantinanytripleposition:hence forth,weassumegeneralisedtriples[2].WecallafinitesetoftriplesG (cid:26) Gagraph. (Forbrevity,wesometimesuser:fortheRDFSnamespace,o:forOWLnamespace, andf:forthewell-knownFOAFnamespace;weuse‘a’asashortcutforrdf:type.) TriplePattern,BasicGraphPattern Atriplepatternisageneralisedtriplewherevari- ablesfromthesetVareallowed;i.e.:tv :=(sv,pv,ov)2GV,GV :=(C[V)(cid:2)(C[ V)(cid:2)(C[V).Wecallaset(tobereadasconjunction)oftriplepatternsGV (cid:26)GV a basicgraphpattern.WedenotethesetofvariablesingraphpatternGV byV(GV). VariableBindings LetMbethesetofendomorphicvariablebindingmappingsV[ C ! V[C which map every constant c 2 C to itself and every variable v 2 V to an element of the set C[V. A triple t is a binding of a triple pattern tv := (sv, pv, ov) iff there exists (cid:22) 2 M, such that t = (cid:22)(tv) = ((cid:22)(sv), (cid:22)(pv), (cid:22)(ov)). A graph G is a binding of a graph pattern GV iff there exists a mapping (cid:22) 2 M such that (cid:22)(tv) = G; we use the shorthand (cid:22)(GV) = G. We use M(GV;G) := f(cid:22) j tv2GV (cid:22)(GV) (cid:18) G;(cid:22)(v) = vifv 2= V(GV)gtodenotethesetofvariablebindingmappings fSorgraphpatternGV ingraphG whichmapvariablesoutsideGV tothemselves. Inference Rule We define an inference rule r as the pair (Ante ;Con ), where the r r antecedent(orbody)Ante (cid:26)GVandtheconsequent(orhead)Con (cid:26)GVarebasic r r graphpatternssuchthatV(Con ) (cid:18) V(Ante )(rangerestricted)–ruleswithempty r r antecedentsmodelaxiomatictriples.WewriteinferencerulesasAnte )Con . r r Rule Application and Standard Closure A rule application is the immediate conse- quences T (G) := ((cid:22)(Con ) n (cid:22)(Ante )) of a rule r on a graph G; r (cid:22)2M(Anter;G) r r accordingly, for a ruleset R, T (G) := T (G). Now, let G := G [T (G ) S R r2R r i+1 i R i and G := G; the exhaustive application of the T operator on a graph G is then the 0 R S leastfixpoint(thesmallestvalueforn)suchthatG =T (G ).WecallG theclosure n R n n ofG wrt.rulesetR,denotedasCl (G),orsuccinctlyG wheretherulesetisobvious. R Theaboveclosuretakesagraphandarulesetandrecursivelyappliestherulesover theunionoftheoriginalgraphandtheinferencesuntilafixpoint.Usually,thiswould consistofindexingallinputandinferredtriples;however,thecostofindexingandper- formingquery-processingoverlargegraphscanbecomeprohibitivelyexpensive.Thus, in[7]weoriginallyproposedanalternatemethodbasedonaseparationofterminolog- icaldata,whichwenowgeneraliseanddiscuss. 3 PartialIndexingApproach:SeparatingTerminologicalData In the field of Logic Programming, the notion of a ‘linear program’ refers loosely to a ruleset where only one pattern in each rule is recursive [12]. Our partial indexing approach is optimised for linear rules, where the non-recursive segment of the data is identified, separated and prepared, and thereafter each recursive pattern can then be bound via a triple-by-triple stream: we cater for non-linear rules, but as the number ofrecursiverules,theamountofrecursion,andtheamountofrecursivedatainvolved increases,ourapproachperformsworsethanthe“full-indexing”approach. Specifically regarding RDFS and OWL, the terminological segment of the data presents itself as relatively small and ‘non-recursive’ (or at least, mostly only recur- sive within itself), which can be leveraged for partial indexing. Herein, we define our notion of RDF(S)/OWL terminological data. (To generalise the following, the reader can consider terminological data as the RDFS/OWL archetype for any non-recursive andsufficientlysmallelementofthedatacommonlyrequiredduringruleapplication.) Meta-class We consider a meta-class as a class specifically of classes or properties; i.e.,themembersofameta-classarethemselveseitherclassesorproperties.Herein,we restrictournotionofmeta-classestothesetdefinedinRDF(S)andOWLspecifications, where examples include rdf:Property, rdfs:Class, owl:Restriction, owl:- DatatypeProperty, owl:TransitiveProperty, etc.; rdfs:Resource, rdfs:- Literal,e.g.,arenotmeta-classes. Meta-property Ameta-propertyisonewhichhasameta-classasitsdomain;again,we restrict our notion of meta-properties to the set defined in RDF(S) and OWL specifi- cations, where examples include rdfs:domain, rdfs:subClassOf, owl:hasKey, owl:inverseOf, owl:oneOf, owl:onProperty, owl:unionOf, etc.; rdf:type, owl:sameAs,rdfs:label,e.g.,donothaveameta-classasdomain. TerminologicalTriple WedefinethesetofterminologicaltriplesT (cid:26) Gastheunion of (i) triples with rdf:type as predicate and a meta-class as object; (ii) triples with a meta-property as predicate; (iii) triples forming a valid RDF list whose head is the objectofameta-property(e.g.,alistusedforowl:unionOf,etc.). Terminological/AssertionalPattern Werefertoaterminological-triple/-graphpattern as one whose instance can only be a terminological triple or, resp., a set thereof. An assertionalpatternisanypatternwhichisnotterminological. Given the above notions of terminological data/patterns, we now define a T-split inferencerulewherepartoftherulebodyisstrictlymatchedbyterminologicaldata. Definition1. T-split inference rule Given a rule r := (Ante ;Con ), we define r r a T-split rule r(cid:28) as the triple (AnteT ;AnteG ;Con) where AnteT is the set of r(cid:28) r(cid:28) r(cid:28) terminological patterns in Ante , and AnteG := Ante n AnteT . We denote the r r(cid:28) r r(cid:28) set of all T-split rules by R(cid:28), and the mapping of a rule to its T-split version as (cid:28) : R ! R(cid:28); r 7! r(cid:28). We additionally give the convenient sets R; := fr(cid:28) j AnteT = ;;AnteG = ;g, RT; := fr(cid:28) j AnteT 6= ;;AnteG = ;g, R;G := r(cid:28) r(cid:28) r(cid:28) r(cid:28) fr(cid:28) j AnteT = ;;AnteG 6= ;g, RTG := fr(cid:28) j AnteT 6= ;;AnteG 6= ;g, r(cid:28) r(cid:28) r(cid:28) r(cid:28) RG := RTG [R;G andRT := RT; [RTG asthesetofallT-splitruleswithan empty antecedent, only terminological patterns, only assertional patterns, both types of patterns, some terminological patterns, and some assertional pattern respectively, where R(cid:28) = R; [RT; [R;G [RTG = R; [RT [RG. We also give the sets RG1 := fr(cid:28) 2 RG : jAnteG j = 1g,RGn := fr(cid:28) 2 RG : jAnteG j > 1g,denoting r(cid:28) r(cid:28) the set of linear and non-linear rules respectively. Given a T-split ruleset R(cid:28), herein wemayuse,e.g.,RGtodenoteR(cid:28) \RG. Example1. Fortheruler :=(?c1,r:subClassOf,?c2)^(?x,a,?c1))(?x,a,?c2), AnteT := f(?c1,r:subClassOf,?c2)g and AnteG := f(?x,a,?c1)g. Underlining AnteT,wewrite(cid:28)(r):=r(cid:28) :=(?c1,r:subClassOf,?c2)^(?x,a,?c1))(?x,a,?c2). WethendefineourT-Boxasthesetofterminologicaltriplesinagivengraphwhichare requiredbytheterminologicalpatternsofagivenruleset. Definition2. T-Box/A-Box Given a graph G and a T-split ruleset R(cid:28), let RT := R(cid:28) \RT represent the subset of rules in R(cid:28) which require terminological data; the T-Box of G wrt. R(cid:28) is then T(G;R(cid:28)) := r(cid:28)2RT tv2AnteTr(cid:28) (cid:22)2M(ftvg;G)(cid:22)(tv), representing the subset of terminological triples in G which satisfy a terminological S S S patternofaruleantecedent(AnteT )inR(cid:28);whererulesetandgraphareobvious,we r(cid:28) mayabbreviateT(G;R(cid:28))tosimplyT.OurA-BoxissynonymouswithG:i.e.,wealso considerourT-BoxaspartofourA-Boxinaformofunidirectionalmeta-modelling. GiventhenotionofaT-splitruleandourT-Box,wecannowdefinehowT-splitrules areapplied,andhowT-splitclosureisachievedwrt.astaticT-Box. Definition3. T-splitruleapplicationandclosureWedefineaT-splitruleapplication foraT-splitruler(cid:28) wrt.agraphG tobe: Tr(cid:28)(T;G):= ((cid:22)0(cid:14)(cid:22)1)(Conr(cid:28)) (1) (cid:22)02M(A[nteTr(cid:28);T)(cid:22)12M((cid:22)0[(AnteGr(cid:28));G) hereformalisingthenotionthattheterminologicalpatternsoftherulearestrictlyin- stantiated from a separate T-Box T. Again, for a T-split ruleset R(cid:28), TR(cid:28)(T;G) := r(cid:28)2R(cid:28) Tr(cid:28)(T;G). Now, let Ax denote the set of axiomatic triples given by R(cid:28) (the samesetasforR),andT :=T(G[Ax;R(cid:28))beourinitialT-BoxderivedfromG and 0 Saxiomatictriples,andTi+1 := Ti[T(TRT;(Ti;;);R(cid:28));wedefineour c(cid:28)losedT-Box asTnfortheleastvalueofnsuchthatTn =Tn[TRT;(Tn;;),denotedT ,represent- ingtheclosureofourinitialT-Boxwrt.rulesrequiringonlyterminologicalknowledge. Finally,letG0(cid:28) := G [T(cid:28) [AxandGi(cid:28)+1 := Gi(cid:28) [TRG(T(cid:28);Gi(cid:28));wenowdefinethe exhaustiveapplicationoftheTR(cid:28) operatoronagraphGwrt.astaticT-BoxT asbeing uptotheleastfixpointsuchthatGn(cid:28) = TRG(T(cid:28);Gn(cid:28)).WecallGn(cid:28) theT-splitclosureof G withrespecttotheT-splitrulesetR(cid:28),denotedasClR(cid:28)(T;G)orsimplyG(cid:28). TheT-splitclosurealgorithmconsistsoftwomainsteps:(i)derivingtheclosedT-Box fromaxiomatictriples,theinputgraph,andrecursivelyappliedRT; rules;(ii)apply- ing ‘A-Box’ reasoning for all triples wrt. the RG rules and the static T-Box. We now give some propositions relating the T-split closure with the standard rule application closure described in the preliminaries; firstly, we must give an auxiliary proposition which demonstrates how mappings for sub-graphs-patterns can be combined to give themappingsfortheentiregraphpattern,whichrelatestotheT-splitruleapplication. Proposition1. For any graph G and graph pattern GV := GV [ GV, it holds that a b M(GV;G)=f(cid:22) (cid:14)(cid:22) j(cid:22) 2M(GV;G);(cid:22) 2M((cid:22) (GV);G)g. b a a a b a b Proof. Firstly, (cid:22) (cid:14)(cid:22) 2 M since (cid:22) and (cid:22) are endomorphic. By definition, ((cid:22) (cid:14) b a a b b (cid:22) )(c) = c for c 2 C. Next, we need to show that ((cid:22) (cid:14)(cid:22) )(v) = vifv 2= V(GV): a a b since by definition (cid:22) (v) = vifv 2= GV and (cid:22) (v) = vifv 2= (cid:22) (GV), and since a a b a b V((cid:22) (GV))(cid:18)V(GV)andV(GV)=V(GV)[V(GV),then((cid:22) (cid:14)(cid:22) )(v)=vifv 2= a b b a b b a V(GV).Bydefinition,(cid:22) (GV)(cid:18)GandthuswehaveV((cid:22) (GV))=;,and(cid:22) (GV)= a a a a a a ((cid:22) (cid:14)(cid:22) )(GV);againbydefinitionwehave((cid:22) (cid:14)(cid:22) )(GV)(cid:18)G,andso((cid:22) (cid:14)(cid:22) )(GV)[ b a a b a b b a a ((cid:22) (cid:14)(cid:22) )(GV)=((cid:22) (cid:14)(cid:22) )(GV[GV)=((cid:22) (cid:14)(cid:22) )(GV)(cid:18)G.Wenowhave(cid:22) (cid:14)(cid:22) 2 b a b b a a b b a b a M(GV;G) for every (cid:22) 2 M(GV;G);(cid:22) 2 M((cid:22) (GV);G), and need to show that a a b a b forevery(cid:22) 2 M(GV;G),thereexistsa((cid:22) (cid:14)(cid:22) )suchthat((cid:22) (cid:14)(cid:22) )(GV) = (cid:22)(GV); b a b a by definition, we know that there exists a (cid:22) such that (cid:22) (GV) = (cid:22)(GV) for any (cid:22) a a a a as defined, and that for every such (cid:22) there exists a (cid:22) such that ((cid:22) (cid:14) (cid:22) )(GV) = a b b a b ((cid:22)(cid:14)(cid:22) )(GV)=(cid:22)(GV),andhencethepropositionholds. tu a b b Theorem1. SoundnessForanygivenrulesetR(cid:26)R,itsT-splitversionR(cid:28) :=(cid:28)(R), (cid:28) andanygraphG,itholdsthatG (cid:18)G. Proof. Clearly,AxgivesthesamesetoftriplesforR(cid:28) andR,andthusT (cid:18) G since 0 T(G[Ax;R(cid:28)) (cid:18) G[Ax (cid:18) G.FromProposition1,itfollowsthatM(Ante ;G) = r M(AnteT [AnteG ;G)=f(cid:22) (cid:14)(cid:22) j(cid:22) 2M(AnteT ;G);(cid:22) 2M((cid:22) (AnteG );G)g; r(cid:28) r(cid:28) 0 1 0 r(cid:28) 1 0 r(cid:28) wecanthenshowthatTr(G) = Tr(cid:28)(G;G)byreplacingT withG inEquation1,from whichfollowsTR(G)=TR(cid:28)(G;G).GiventhatTR(cid:28)a(G;G)(cid:18)TR(cid:28)(G;G)ifR(cid:28)a (cid:18)R(cid:28), andTR(cid:28)(Ga;Gb) (cid:18) TR(cid:28)(G;G)ifGa (cid:18) G andGb (cid:18) G –i.e.,thatourruleapplications (cid:28) aremonotonic–wecanshowbyinductionthatT (cid:18)G:givenT (cid:18)Gfromabove,we 0 cGcaa.nnNssoaawyy,tthchlaaettaiTrfliyG+(cid:28)G10(cid:28)(cid:18)(cid:18)(cid:18)GG,Gitf,hfaeTnnidG(cid:18)s(cid:28)inGce(cid:18)siTnGRce;GbT(yT(iT(cid:28)n;RdGuTi(cid:28)c;t)(iTo(cid:18)ni;,T;GR)(cid:28))(cid:28)(cid:18)(cid:18)(GiG(cid:28)T;R.GTi(cid:28);)(T=i;TTRi)(G(cid:18)i(cid:28)T)R(cid:18)(TGi,)w(cid:18)tue i i+1 Theorem2. ConditionalCompletenessIfT(cid:28) =T(G(cid:28);R(cid:28)),thenG(cid:28) =G. Proof. First, TR(cid:28)(T(G;R(cid:28));G) = TR(cid:28)(G;G) since by definition T(G;R(cid:28)) only re- movestriplesfromG thatcannotbeboundbyterminologicalpatternsinR(cid:28).Giventhe (cid:28) (cid:28) (cid:28) (cid:28) (cid:28) (cid:28) (cid:28) criteriaG =G [TRG(T ;G )–or,rephrasing,TRG(T ;G )(cid:18)G –wefirstknow tTha(GtA(cid:28);xR[(cid:28))T, t(cid:28)h[enGTR=(cid:28)G(T0(cid:28)(cid:28)(cid:18);GG(cid:28)(cid:28)).=ThTuRs,(cid:28)T(RT(cid:28)((GT(cid:28);(cid:28)R;G(cid:28)(cid:28)));G=(cid:28))TR=GT(TR(cid:28)(cid:28)(;GG(cid:28)(cid:28);)G(cid:18)(cid:28))G=(cid:28).TIfRT(G(cid:28)(cid:28)=), (cid:28) (cid:28) whichgivesG (cid:18) G (cid:18) G:i.e.,G isknowntobethepartialclosureofG.Giventhe 0 (cid:28) (cid:28) fixpointconditionG =G[T (G),thenG mustbethefixpoint:G =G. tu R Proposition2. Atriplet 2 T(G(cid:28);R(cid:28))nT(cid:28) canonlybeproducedforG(cid:28) throughan inferenceforaruleinRG. Proof. AnyT-Boxtriplesintheoriginalgraph,orT-Boxtriplesproducedbythe‘clo- sure’ of R; rules are added to the initial T-Box T . Any T-Box triples produced by 0 the closure of RT; rules over T are added to the closed T-Box T(cid:28). Since R(cid:28) := 0 R; [RT; [RG, the only new triples – terminological or not – that can arise in the computationofG(cid:28) afterderivingT(cid:28) arefromrulesinRG. tu We have shown that for an arbitrary ruleset and graph, the T-split closure is sound wrt. the standard closure, and that if no T-Box triples are produced by rules requiring assertional knowledge, then T-split closure is complete wrt. the standard closure. So, when are T-Box triples produced by RG rules? Analysis must be applied per ruleset, but for RDFS, pD* and OWL 2 RL/RDF, we informally posit that by inspection, one canshowthatsuchaconditioncanonlyarisethroughsocallednon-standardusage[8]: the assertion of terminological triples which use meta-classes and meta-properties in positionsotherthantheobjectofrdf:typetriplesorpredicatepositionrespectively– e.g.,my:subPropertyOf rdfs:subPropertyOf rdfs:subPropertyOf . TheT-splitapproachcanbeimplementedthroughpartialindexingusingtwoscans of the data: the first separates and builds the T-Box and the second reasons over the A-Box–notethatthefirstscancanbeoveraseparateT-Boxgraph.Algorithm1details thisapproach,whichlargelyfollowstheformalismsinDefinition3:themajorvariance consistsoftheapplicationofrulesinRG,whichonecanconvincethemselvesisequiv- alent since all triples encountered are passed through every rule in RG. For brevity, weomitsomeimplementationaldetailssuchaspartialduplicatedetectionimplemented usinganLRUlocalitycache.The“non-trivial”aspectsoftheimplementationinclude (cid:28) the indexing of the T-Box T , and the indexing of the A-Box A. Again, as A is re- quired to store more data, the two-scan approach becomes more inefficient than the “full-indexing”approach;inparticular,aruleinRGn withanopenpattern–e.g.,OWL 2 RL/RDF rule eq-rep-s: (?s,o:sameAs,?s0) ^ (?s,?p,?o) ) (?s0,?p,?o) – will re- quireindexingofalldata,negatingthebenefitsoftheapproach.Again,partial-indexing performswellifAremainssmallandperformsbestifRGn =;–i.e.,norulesrequire A-BoxjoinsandthusA-Boxindexingisnotrequired. Algorithm1:PartialindexingapproachforT-splitclosure Required:R,G 1 R(cid:28) :=(cid:28)(R);T0:=T(Ax;R(cid:28));n:=0; /*gett-splitrules&ax.T-Boxtriples*/ 2 fort2Gdo T0:=T0[T(ftg;R(cid:28)); /*SCAN1:extractT-Boxfrommaindata*/ 3 whileTn+16=Tndo Tn+1:=Tn[T(TRT;(Ti;;);R(cid:28));n++; /*doT-Boxreasoning*/ 4 T(cid:28) :=Tn+1;G(cid:28) :=G0(cid:28) :=G[T(cid:28)[Ax;A:=;; /*initialiseA-Boxstructures*/ 5 fortI 2G0(cid:28)do /*SCAN2:A-Boxreasoningoveralldata*/ 6 GI :=;;GI :=ftIg;n:=1; /*initialisesettoholdinferencesfromtI*/ 0 1 7 whileGnI 6=GnI(cid:0)1do /*whilewefindnewtriplestoreasonover*/ 8 fort2GnI nGnI(cid:0)1do /*scannewtriples*/ 109 fGonIr+r12:=RGGnIn[doTRG1(T(cid:28);ftg); /*doa/l*l‘fnoorAea-cBhox‘Aj-oBino’xrjuoliens’frourlet**// 11 fortv2AnteGdo /*foreachassertionalpattern*/ 12 if9(cid:22)2Mr:(cid:22)(tv)=tthen A:=A[ftg; /*indextifneeded*/ 13 GnI+1:=GnI+1[Tr(T(cid:28);A); /*apply‘A-Boxjoin’ruleoverindex*/ 14 n++; /*recurse*/ 15 G(cid:28) :=G(cid:28)[GI; /*writesetofrecursiveinferencesfortItooutput*/ n Return :G(cid:28) 4 TemplateRules We now discuss optimisations for deriving T-split closure based on template rules, which are currently used by DLEJena [13] and also used in RIF for supporting OWL 2 RL/RDF [15]; however, instead of manually specifying a set of template rules, we leverage our general notion of terminological data to create a generic template func- tion:afterseparatingandclosingtheT-Box,webindtheT-Boxpatternsofrulesbefore accessingtheA-Boxtocreateasetofnewtemplatedrules(orT-groundrules)which themselves‘encode’theT-Box,thusavoidingrepetitiveT-Boxpatternbindingsduring theA-Boxreasoningprocess.Wenowformalisethesenotions. Definition4. TemplateFunctionForaT-splitruler(cid:28),thetemplatefunctionisgiven as(cid:11):R(cid:28) (cid:2)2G !2R;(r(cid:28);T)7!f((cid:22)(AnteGr(cid:28));(cid:22)(Conr(cid:28)))j(cid:22)2M(AnteTr(cid:28);T)g. Example2. GivenasimpleT-BoxT :=f(f:Person;r:subClassOf;f:Agent)gand aruler(cid:28) :=(?c1,r:subClassOf,?c2)^(?x;a;?c1))(?x;a;?c2),thenthetemplate functionisgivenas(cid:11)(r(cid:28);T):=f(?x,a,f:Person))(?x,a,f:Agent)g. Templatedruleapplicationissynonymouswithstandardruleapplication.Wemayuse (cid:11)asintuitiveshorthandtomapasetofT-splitrulestotheunionofthesetofresulting templatedrules.Wenow(i)proposethatapplyingaT-splitrulegivesthesameresultas applyingtherespectivesetoftemplatedruleswrt.arbitrarygraphsT &G;(ii)describe theclosureofagraphusingtemplatedrules;(iii)showthatthetemplated-ruleclosure equalstheT-splitclosurepreviouslyoutlined. Proposition3. ForanygraphsT;G andforanyrulerwithaT-splitruler(cid:28) = (cid:28)(r), itholdsthatTr(cid:28)(T;G)=T(cid:11)(r(cid:28);T)(G). Proof. Tr(cid:28)(T;G)= (cid:22)02M(AnteTr(cid:28);T) (cid:22)12M((cid:22)0(AnteGr(cid:28));G)((cid:22)0(cid:14)(cid:22)1)(Conr(cid:28))= r2(cid:11)(r(cid:28);T) (cid:22)2M(ASnter;G)(cid:22)(Conr)=ST(cid:11)(r(cid:28);T)(G). tu S S Definition5. Templated rule closure Given a ruleset R, its T-split version R(cid:28) := (cid:28) (cid:28)(R), and a graph G, let T represent the closed T-Box as derived in the T-split closure, and let R(cid:11) := (cid:11)(RG;T(cid:28)). Again, let G(cid:11) := G [ T(cid:28) [ Ax, but this time 0 Gi(cid:11)+1 := Gi(cid:11) [TR(cid:11)(Gi(cid:11)); as before, the templated rule closure is Gn for the smallest valueofnsuchthatGn(cid:11) =TR(cid:11)(Gn(cid:11)),denotedasClR(cid:11)(T;G(cid:11)),orsimplyG(cid:11). Theorem3. ForanygraphG,andanyrulesetR(cid:26)R,itsT-splitversionR(cid:28),andthe respectivetemplatedrulesetR(cid:11),wecansaythatG(cid:11) =G(cid:28). Proof. TheonlydivergencebetweentheT-splitclosureandtemplated-ruleclosureisin thefixpointcalculation:Gi(cid:11)+1 :=Gi(cid:11)[TR(cid:11)(Gi(cid:11))versusGi(cid:28)+1 :=Gi(cid:28)[TRG(T(cid:28);Gi(cid:28)).Us- inginduction,bydef.G0(cid:28) =G0(cid:11);ifGi(cid:28) =Gi(cid:11),thenGi(cid:28)+1 =Gi(cid:11)[ r(cid:28)2RGTr(cid:28)(T(cid:28);Gi(cid:11))= Gi(cid:11)[ r2(cid:11)(RG;T(cid:28))Tr(cid:28)(Gi(cid:11))=Gi(cid:11)[TR(cid:11)(Gi(cid:11))=Gi(cid:11)+1. S tu The teSmplated rules can be applied in lieu of the RG rules in Algorithm 1. Indeed, a large number of rules can be templated for a sufficiently complex T-Box, and na¨ıve applicationofallsuchrulesonalltriplescouldworsenperformance;however,thetem- platedrulesaremoreamenabletofurtheroptimisations,whichwenowdiscuss. 4.1 MergingEquivalentTemplateRules Thetemplatingproceduremayresultinruleswithequivalentantecedents–whichcan be aligned by variable rewriting – being produced; these rules can subsequently be merged.Weformalisesuchnotionshere. Definition6. Equivalent Graph Patterns: Let N be the set of automorphic variable rewritemappingscontainingall(cid:23) asfollows: x ifx2C (cid:23) :V[C(cid:26)(cid:16)V[C; x7! (2) (v 2V otherwise (Note:N(cid:26)M).Wedenoteby(cid:24) anequivalencerelationforgraphpatternssuchthat (cid:23) GV (cid:24) GV iffthereexistsamapping(cid:23) 2Nsuchthat(cid:23)(GV)=GV. i (cid:23) j i j Proposition4. Therelation(cid:24) isreflexive,symmetricandtransitive. (cid:23) Proof. Reflexivity is trivially given by the identity morphism (cid:23)(x) = x, symmetry is given by the inverse morphism (cid:23)(cid:0)1(GV) where (cid:23)(cid:0)1 2 N if (cid:23) 2 N since (cid:23) is j automorphic,andtransitivityisgivenbythepresenceofthecompositemorphism((cid:23) (cid:14) a (cid:23) )(GV)whereagain(cid:23) (cid:14)(cid:23) 2Nsince(cid:23) and(cid:23) areautomorphic. tu b a b a b Definition7. Rule Merge: Let (cid:24)R be an equivalence relation – slightly abusing no- tation – which holds between two rules such that ri (cid:24)R rj iff Anteri (cid:24)(cid:23) Anterj. Given an equivalence class [r] – a set of rules between which (cid:24)R holds – select a canonical rule r 2 [r]; we can now describe the merge of the equivalence class as (cid:12)([r]) := (Ante ;Con ) where Con := (cid:23) (Con ) for some (cid:23) 2 N such r [r] [r] ri2[r] i ri i that(cid:23)i(Anteri)=Anter.NowlettingR=(cid:24)RS:=f[r]jr 2Rgdenotethequotientset ofRby(cid:24)R–thesetofallequivalentclasses[r]wrt.(cid:24)RinR–wecangeneralisethe rulemergefunctionforasetofrulesas(cid:12) :2R !2R,R7! f(cid:12)([r])j[r]2R=(cid:24)Rg. Example3. Takingthetwotemplatedrules:(?x,f:img,?y)S) (?x,a,f:Person)and (?s,f:img,?o) ) (?s,f:depicts,?o); they can be merged by (cid:23) where (cid:23)(?s) = ?x, (cid:23)(?o)=?y,giving(?x,f:img,?y))(?x,a,f:Person)^(?x,f:depicts,?y). Thechoiceofcanonicalruleisunimportantsince(cid:23) isautomorphic;wenowshowthat theapplicationofanyrulesetandtherespectivemergedrulesetareextensionallyequal. Proposition5. ForanygraphG and(cid:24)R equivalenceclass[r],T[r](G) = T(cid:12)([r])(G); foranyrulesetR,TR(G)=T(cid:12)(R)(G);wrt.closure,ClR(cid:11)(T;G)=Cl(cid:12)(R(cid:11))(T;G). Proof. We denote (cid:12)([r]) as (Ante ;Con ). If GV (cid:24) GV, then by def. (cid:23)(GV) = (cid:12) (cid:12) i (cid:23) j i GV, and for any graph G and any mapping (cid:22) 2 M, (cid:22)((cid:23)(GV)) = (cid:22)(GV); i.e., if j i j GV (cid:24) GV, M((cid:23)(GV);G) = M(GV;G). Thus we give M := f(cid:22) j (cid:22)(Ante ) (cid:18) i (cid:23) j i j (cid:12) (cid:12) Gg = f(cid:22) j (cid:22)((cid:23) (Ante )) (cid:18) Gg. Let M := f(cid:22) j (cid:22)(Ante ) (cid:18) Gg; now, ri2[r] i ri i ri it follows that T (G) = (cid:22)(Con ) = (cid:22)((cid:23) (Con )) = S (cid:12)([r]) (cid:22)2M(cid:12) (cid:12) ri2[r] (cid:22)2M(cid:12) i ri (cid:22)(Con )=T (G).Therestofthepropositionfollowsnaturally. tu ri2[r] (cid:22)2Mi ri [rS] S S S S 4.2 RuleIndex We have reduced the amount of templated rules through merging; however, given a sufficientlycomplexT-Box,wemaystillhaveaprohibitivenumberofrulesforefficient recursive application. We now look at the use of a rule index which maps a triple t to rules containing an antecedent pattern which t is a binding for, thus enabling the efficientidentificationandapplicationofonlyrelevantrulesforagiventriple. Definition8. RuleLookupGivenatripletandrulesetR,therulelookupfunctionis ! :G(cid:2)2R !2R,(t;R)7!fr 2Rj9(cid:22)2M:9tv 2Ante :((cid:22)(tv)=t)g. r Example4. Given a triple t := (ex:me,a,f:Person), and a simple example ruleset R := f(?x,f:img,?y) ) (?x,a,f:Person); (?x,a,f:Person) ) (?x,a,f:Agent); (?x,a,?y))(?y,a,r:Class)g,!(t;R)returnsasetcontainingthelattertworules. Atriplepatternhas23 =8possibleforms:(?;?;?),(s;?;?),(?;p;?),(?;?;o),(s;p;?), (?;p;o),(s;?;o),(s;p;o).Thus,werequireeightindicesforantecedenttriplepatterns, and eight lookups to perform !(t;R) – to find all relevant rules for a triple. We use sevenin-memoryhashtablesstoringtheconstantsoftheruleantecedentpatternsaskey, and a set of rules containing such a pattern as value; e.g., f(?x,a,f:Person)g is put into the (?;p;o) index with fa,f:Persong as key. Rules containing patterns without constantsarestoredinaset,astheyarerelevanttoalltriples. 4.3 RuleDependency–LabelledRuleGraph Within our rule index, there may exist rule dependencies: the application of one rule may/will lead to the application of another. Thus, instead of performing lookups for rulesforeachrecursivelyinferredtriple,wecanmodeldependenciesinourruleindex usingarulegraph.InLogicProgramming,arulegraphisdefinedasadirectedgraph H := (R;(cid:10)) where (r ;r ) 2 (cid:10) (i.e., r (cid:10)r , read “r follows r ”) iff there exists a i j i j j i mapping(cid:22)2Msuchthat(cid:22)(tv)2Con fortv 2Ante (cf.[14]). ri rj By building and encoding such a rule graph into our index, we can “wire” the re- cursive application of rules for a given triple. However, from the merge function (or otherwise) there may exist rules with large consequent sets. We therefore extend the notion of the rule graph to a directed labelled graph with inclusion of the labelling function(cid:21):R(cid:2)R!2GV;(r ;r )7!ftv 2Con j9(cid:22)2M:(cid:22)(cid:0)1(tv)2Ante g; i j ri rj in simpler terms, (cid:21)(r ;r ) gives the set of consequent triple patterns in r that would i j i bematchedbypatternsintheantecedentofr ,labellingtheedges(cid:10)oftherulegraph j withtheconsequentpatternsthatgivethedependency. Example5. Foraruler := (?x,f:img,?y) ) (?x,a,f:Person)^(?y,a,f:Image), i and a rule r := (?s,a,f:Person) ) (?s,a,f:Agent), we say that r (cid:10)r , where j i j (cid:21)(r ;r )=f(?x,a,f:Person)g. i j Inpractice,ourruleindexstoressetsofelementsofalinkedlist,whereeachelement contains a rule and links to rules which are relevant for that rule’s consequent pat- terns.Thus,foreachinputtriple,wecanretrieveallrelevantrulesforalleightpossible patterns, apply those rules, and if successful, follow the respective labelled links to recursivelyfindrelevantruleswithoutre-accessingtheindexuntilthenextinputtriple. 4.4 RuleSaturisation Weverybrieflydescribeonefinalandintuitiveoptimisationtechniqueweinvestigated –whichlaterevaluationdemonstratestobemostlydisadvantageous–involvingthesa- turisation of rules; we say that a subset of dependencies in the rule graph are strong dependencies,wherethesuccessfulapplicationofonerulewillalwaysleadtothesuc- cessfulapplicationofanother.Forlinearrules,wecansaturaterulesbypre-computing therecursiveruleapplicationofitsdependencies;wegivethegistwithanexample: Example6. Takerulesr :=(?x,f:img,?y))(?x,a,f:Person)^(?y,a,f:Image), i r := (?s,a,f:Person) ) (?s,a,f:Agent), r := (?x,a,?y) ) (?y,a,r:Class)g. j k Wecanseethatr (cid:10)r ,r (cid:10)r ,r (cid:10)r .Wecanremovethelinksfromr tor andr i j i k j k i j k (andsimilarlyfromr tor )bysaturatingr to(?x,f:img,?y))(?x,a,f:Person)^ j k i (?y,a,f:Image)^(?x,a,f:Agent)^(f:Person,a,r:Class)^(f:Image,a,r:Class)^ (f:Agent,a,r:Class)g. As we will see in Sections 4.6 & 5.2, saturisation produces more duplicates and thus putsmoreloadontheduplicate-removalcache,negativelyaffectingperformance. 4.5 OptimisedPartialIndexingApproachusingTemplateRules Wenowintegratetheabovenotionsasoptimisationsforthepartialindexingapproach, with the new procedure detailed in Algorithm 2. We no longer need to bind T-Box patterns during A-Box access; we mitigate the cost of extra templated rules by first mergingrules,andinsteadofbrute-forceapplyingallrulestoalltriplesintheA-Box reasoningscan,weuseourlinkedruleindextoretrieveonlyrelevantrulesforagiven tripleandtofindrecursivelyrelevantrules.Wenowinitiallyevaluateourmethods. 4.6 PreliminaryPerformanceEvaluation Inordertoinitiallyevaluatetheaboveoptimisations,weappliedsmall-scalereasoning forRDFS(minustheinfiniterdf: naxiomatictriples[4]),pD*andOWL2RL/RDF over LUBM(10) [3], consisting of about 1.3m triples – note that we do exclude lg/gl rules for RDFS/pD* since we allow generalised triples [2]. All evaluation in this pa- perhasbeenrunonsingle-core2.2GHzOpteronx86-64machine(s)with4GBofmain memory. Table 1 gives the performance for the following partial-indexing configura- tions: (i) N: ‘normal’ T-split closure; (ii) NI: normal T-split closure with linked rule

Description:
SAOR: Template Rule Optimisations for Distributed Reasoning over 1 Billion Linked Data Triples? Aidan Hogan 1, Jeff Z. Pan2, Axel Polleres , and Stefan Decker
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.