RobustandScalableLinkedDataReasoningIncorporating ProvenanceandTrustAnnotations PieroA.Bonattia,AidanHoganb,AxelPolleresb,LuigiSauroa aUniversita` diNapoli“FedericoII”,Napoli,Italy bDigitalEnterpriseResearchInstitute,NationalUniversityofIreland,Galway Abstract Inthispaper,weleverageannotatedlogicprogramsfortrackingindicatorsofprovenanceandtrustduringreasoning,specifically focussingontheuse-caseofapplyingascalablesubsetofOWL2RL/RDFrulesoverstaticcorporaofarbitraryLinkedData(Web data).Ourannotationsencodethreefacetsofinformation:(i)blacklist:a(possiblymanuallygenerated)booleanannotationwhich indicates that the referent data are known to be harmful and should be ignored during reasoning; (ii) ranking: a numeric value derived by a PageRank-inspired technique—adapted for Linked Data—which determines the centrality of certain data artefacts (such as RDF documents and statements); (iii) authority: a boolean value which uses Linked Data principles to conservatively determine whether or not some terminological information can be trusted. We formalise a logical framework which annotates inferences with the strength of derivation along these dimensions of trust and provenance; we formally demonstrate some desirable properties of the deployment of annotated logic programming in our setting, which guarantees (i) a unique minimal model (least fixpoint); (ii) monotonicity; (iii) finitariness; and (iv) finally decidability. In so doing, we also give some formal resultswhichrevealstrategiesforscalableandefficientimplementationofvariousreasoningtasksonemightconsider.Thereafter, wediscussscalableanddistributedimplementationstrategiesforapplyingourrankingandreasoningmethodsoveraclusterof commodity hardware; throughout, we provide evaluation of our methods over 1 billion Linked Data quadruples crawled from approximately 4 million individual Web documents, empirically demonstrating the scalability of our approach, and how our annotationvalueshelpensureamorerobustformofreasoning.Wefinallysketch,discussandevaluateause-caseforasimple repair of inconsistencies detectable within OWL 2 RL/RDF constraint rules using ranking annotations to detect and defeat the “marginalview”,andinsodoing,inferanempirical“consistencythreshold”fortheWebofDatainoursetting. Keywords: annotatedprograms;linkeddata;webreasoning;scalablereasoning;distributedreasoning;authoritativereasoning;owl2rl; provenance;pagerank;inconsistency;repair 1. Introduction BasedontheResourceDescriptionFramework(RDF), Linked Data emphasises four simple principles: (i) The Semantic Web is no longer purely academic: use URIs as names for things (and not just doc- in particular, the Linking Open Data project has fos- uments); (ii) make those URIs dereferenceable via teredarichlodeofopenlyavailablestructureddataon HTTP; (iii) return useful and relevant RDF content the Web, commonly dubbed the “Web of Data” [53]. upon lookup of those URIs; (iv) include links to other datasets.Sinceitsinceptionfouryearsago,theLinked Data community has overseen exports from corpo- Emailaddresses:[email protected](PieroA.Bonatti), rate bodies (e.g., the BBC, New York Times, Free- [email protected](AidanHogan), base, Linked Clinical Trials [linkedct.org], Thomp- [email protected](AxelPolleres), [email protected](LuigiSauro). sonReuters[opencalais.com]etc.),governmentalbod- PreprintsubmittedtoElsevier 02/06/2011 ies (e.g., UK Government [data.gov.uk], US Gov- However, reasoning over (even subsets) of the Web ernment [data.gov], US National Science Founda- ofDataposestwomajorchallenges,withseriousimpli- tion, etc.), community driven efforts (e.g., Wikipedia- cationsforreasoning:(i)scalability,whereonecanex- basedexports[dbpedia.org],GeoNames,etc.),social- pectLinkedDatacorporacontainingintheorderofbil- networking sites (e.g., MySpace [dbtune.org], flickr, lionsortensofbillionsofstatements(forthemoment); Twitter[semantictweet.com],etc.)andscientificcom- (ii)tolerancetonoiseandinconsistency,wherebydata munities(e.g.,DBLP,PubMed,UniProt),amongstoth- publishedontheWebareafflictedwithna¨ıveerrorsand ers.1 disagreement, arbitrary redefinitions of classes/proper- Interspersedwiththesevoluminousexportsofasser- ties, etc. Traditional reasoning approaches are not well tional (or instance) data are lightweight schemata/on- positioned to tackle such challenges, where the main tologiesdefinedinRDFS/OWL—whichwewillcollec- body of literature in the area is focussed on considera- tively call vocabularies—comprising the terminologi- tions such as expressiveness, soundness, completeness caldatawhichdescribeclassesandpropertiesandpro- andpolynomial-timetractability2,andmakessomeba- vide a formalisation of the domain of discourse. The sic assumptions about the underlying data quality; for assertional data describe things by assigning datatype example,tableaux-basedalgorithmsstrugglewithlarge (string) values for named attributes (datatype proper- bodies of assertional knowledge, and are tied by the ties),namedrelationstootherthings(objectproperties), principleofexfalsoquodlibet—fromcontradictionfol- andnamedclassifications(classes).Theterminological lowsanything—andthusstrugglewhenreasoningover data describe these classes and properties, with RDFS possiblyinconsistentdata.3 and OWL providing the formal mechanisms to render In previous works, we presented our Scalable Au- the semantics of these terms—in particular, their inter- thoritativeOWLReasoner(SAOR)whichappliesasub- relation and prescribed intended usage. Importantly, set of OWL 2 RL/RDF rules over arbitrary Linked best-practices encourage: (i) re-use of class and prop- Data crawls: in particular, we abandon completeness erty terms by independent and remote data publishers; in favour of conducting “sensible” inferencing which (ii)inter-vocabularyextensionofclassesandproperties; we argue to be suitable for the Linked Data use- (iii)dereferenceabilityofclassesandproperties,return- case [36,38].4 We have previously demonstrated dis- ingaformalRDFS/OWLdescriptionthereupon. tributed reasoning over ∼1b Linked Data triples [38], Applicationsareslowlyemergingwhichleveragethis and have also presented some preliminary formalisa- richveinofstructuredWebdata;however,thusfar(with tions of what we call “authoritative reasoning”, which ofcourseafewexceptions)applicationshavebeenslow considersthesourceofterminologicaldataduringrea- to leverage the underlying terminological data for rea- soning[36].TheSAORsystemisactivelyusedforma- soning.Looselyspeaking,reasoningutilisestheformal terialising inferences in the Semantic Web Search En- underlyingsemanticsofthetermsinthedatatoenable gine (SWSE) [37] which offers search and browsing thederivationofnewknowledge.WithrespecttoLinked overLinkedData.5 Data,sometimesthereisonlysparsere-useof(termino- Inthispaper,welooktoreformalisehowtheSAOR logical and/or assertional) terms between sources, and engineincorporatesthenotionsofprovenanceandtrust so reasoning can be used to better integrate the data in which form an integral part of its tolerance to noise, a merged corpus as follows: (i) infer and support the whereweseetheuseofannotatedlogicprograms[42] semanticsofgroundequality(owl:sameAsrelationsbe- as a natural fit. We thus derive a formal logical frame- tween individuals) to unite knowledge fractured by in- work for annotated reasoning in our setting, which en- sufficient URI re-use across the Web; (ii) leverage ter- codesthreeindicatorsofprovenanceandtrustandwhich minologicalknowledgetoinfernewassertionalknowl- thusallowsustoapplyrobustmaterialisationoverlarge, edge,possiblyacrossvocabulariesandevenassertional static,LinkedDatacorpora. documents; (iii) detect inconsistencies whereby one or more parties may provide conflicting data—herein, we 2 In our scenario with assertional data from the Web in the order focusonthelattertworeasoningtasks. ofbillionsofstatements,evenquadraticcomplexityisprohibitively expensive. 1 Seehttp://richard.cyganiak.de/2007/10/lod/for 3 Thereisongoingresearchdevotedtoparaconsistentlogicswhich a comprehensive graph of such datasets and their inter-linkage; tackle this issue—e.g., see a recent proposal for OWL 2 [51]—but however—andastheLinkedOpenNumbersproject(see[64,Figure theefficiencyofsuchapproachesisstillanopenquestion. 1]) has aptly evinced, and indeed as we will see ourselves in 4 For a high-level discussion on the general infeasibility of com- later evaluation—not all of this current Web of Data is entirely pletenessforscenariossuchasours,see[31]. “compelling”. 5 http://swse.deri.org/ 2 Further,inpreviousworkwenotedthatmanyincon- RDFConstant GiventhesetofURIreferencesU,the sistenciesariseontheWebastheresultofincompatible setofblanknodesB,6 andthesetofliteralsL,theset naming of resources, or na¨ıve publishing errors [35]: of RDF constants is denoted by C:=U∪B∪L. We herein, we look at a secondary use-case for our an- alsodefinethesetofvariablesV whichrangeoverC. notations which leverages OWL 2 RL/RDF constraint Herein,weuseCURIEs[5]todenoteURIs.Follow- rules to detect inconsistencies, where subsequently we ingTurtlesyntax[2],useaasaconvenientshortcutfor perform a simple “repair” of the Web knowledge-base rdf:type.Wedenotevariableswitha‘?’prefix. using our annotations—particularly ranks—to identify anddefeatthe“marginalview”presentintheinconsis- RDFTriple Atriplet:=(s,p,o)∈(U∪B)×U×C tency. iscalledanRDFtriple,wheresiscalledsubject,ppred- Morespecifically,we: icate, and o object. A triple t := (s,p,o) ∈ G,G := – introducesomenecessarypreliminaries(Section2); C×C×C is called a generalised triple [25], which – discuss our proposed annotation values to repre- allows any RDF constant in any triple position: hence- sent provenance and trust—blacklisting, authority, forth, we assume generalised triples unless explicitly andranking—givingconcreteinstantiationsforeach stated otherwise. We call a finite set of triples G ⊂ G (Section3); agraph. – describeaformalframeworkforannotatedprograms which incorporate the above three dimensions of 2.2. Linked Data principles, Data Sources and provenanceandtrust,includingformaldiscussionof Quadruples constraintrules(Section4); – describe our experimental setup and our 1 billion In order to cope with the unique challenges of han- tripleLinkedDatacorpus(Section5); dlingdiverseandunverifiedWebdata,manyofourcom- – describeourdistributed(i)implementationandeval- ponentsandalgorithmsrequireinclusionofanotionof uation of links-based ranking (Section 6), (ii) anno- provenance: consideration of the source of RDF data tatedreasoningforasubsetofOWL2RL/RDFrules found on the Web. Tightly related to such notions are (Section7),and(iii)ourinconsistencydetectionand the best practices of Linked Data [3], which give clear repairuse-case(Section8); guidelines for publishing RDF on the Web. We briefly – discuss issues relating to scalability and expressive- discuss Linked Data principles and notions relating to ness(Section9),renderrelatedworkinthefield(Sec- provenance.7 tion10),andconclude(Section11). Linked Data Principles The four best practices of LinkedDataareasfollows[3]: – (LDP1)useURIstonamethings; 2. Preliminaries – (LDP2) use HTTP URIs so that those names can be lookedup; In this section, we provide some necessary prelim- – (LDP3) provide useful structured information when inaries relating to (i) RDF (Section 2.1); (ii) Linked a look-up on a URI is made – loosely, called deref- Dataprinciplesanddatasources(Section2.2);(iii)rules erencing; andatoms(Section2.3);(iv)generalisedannotatedpro- – (LDP4)includelinksusingexternalURIs. grams (Section 2.4) (v) terminological data given by RDFS/OWL (Section 2.5); and (vi) OWL 2 RL/RDF Data Source We define the http-download function rules(Section2.6).Weattempttopreservenotationand get : U → 2G as the mapping from a URI to an RDF terminologyasprevalentintheliterature. graph(setoffacts)itmayprovidebymeansofagiven HTTP lookup [21] which directly returns status code 200OKanddatainasuitableRDFformat;thisfunction 6 We interpret blank-nodes as skolem constants, as opposed to 2.1. RDF existential variables. Also, we rewrite blank-node labels to ensure uniquenessperdocument,asprescribedin[30]. 7 Note that in a practical sense, all HTTP-level functions We briefly give some necessary notation relating to {get,redir,redirs,deref} are set at the time of the crawl, and are RDFconstantsandRDFtriples;cf.[30]. boundedbytheknowledgeofourcrawl. 3 also performs a rewriting of blank-node labels (based variablesofAyieldsB);wemayalsosaythatB isan on the input URI) to ensure uniqueness when merging instanceofA;ifB isground,wesaythatitisaground RDF graphs [30]. We define the set of data sources instance.Similarly,ifwehaveasubstitutionθsuchthat S⊂UasthesetofURIsS:={s∈U|get(s)(cid:54)=∅}. Aθ =Bθ,wesaythatθisaunifierofAandB;wede- note by mgu(A,B) the most general unifier of A and RDF Triple in Context/RDF Quadruple An ordered B which provides the “minimal” variable substitution pair(t,c)withatriplet=(s,p,o),c∈Sandt∈get(c) (uptovariablerenaming)requiredtounifyAandB. is called a triple in context c. We may also refer to (s, p,o,c)asanRDFquadrupleorquadqwithcontextc. Rule AruleR isgivenasfollows: H ←B ,...,B (n≥0) 1 n HTTP Redirects/Dereferencing A URI may provide where H,B ,...,B are atoms, H is called the 1 n a HTTP redirect to another URI using a 30x response head(conclusion/consequent)andB ,...,B thebody code [21]; we denote this function as redir : U → U 1 n (premise/antecedent). We use Head(R) to denote the which may map a URI to itself in the case of failure head H of R and Body(R) to denote the body (e.g., where no redirect exists)—note that we do not B ,...,B of R. Our rules are range restricted – or 1 n needtodistinguishbetweenthedifferent30xredirection safe [60]: like Datalog, the variables appearing in the schemes,andthatthisfunctionwouldimplicitlyinvolve, headofeachrulemustalsoappearinthebody. e.g., stripping the fragment identifier of a URI [4]. We The set of all rules that can be defined over atoms denotethefixpointofredirasredirs,denotingtraversal usingan(arbitrarybutfixed)infinitesupplyofvariables of a number of redirects (a limit may be set on this willbedenotedbyRules.Arulewithanemptybody traversal to avoid cycles and artificially long redirect is considered a fact; a rule with a non-empty body is paths).Wedefinedereferencingasthefunctionderef := calledaproper-rule.Wecallafinitesetofsuchrulesa get◦redirswhichmapsaURItoanRDFgraphretrieved programP. with status code 200 OK after following redirects, or Like before, a ground rule is one without variables. whichmapsaURItotheemptysetinthecaseoffailure. We denote with Ground(R) the set of ground instan- tiations of a rule R and with Ground(P) the ground 2.3. AtomsandRules instantiationsofallrulesoccurringinaprogramP. In this section, we briefly introduce some notation ImmediateConsequenceOperator Wegivethe(clas- as familiar from the field of Logic Programming [48], sical)immediateconsequenceoperatorC ofaprogram P which acts as a generalisation of the aforementioned P underinterpretationI as: RDFnotation. C : 2Facts →2Facts P I (cid:55)→{Head(R)θ |R∈P and Atom Atoms are of the form p(e ,...,e ) where 1 n ∃I(cid:48) ⊆I s.t.θ =mgu(Body(R),I(cid:48))} e ,...,e areterms(likeDatalog,functionsymbolsare 1 n disallowed) and where p is a predicate of arity n—we Intuitively, the immediate consequence operator maps denote the set of all such atoms by Atoms. This is a fromasetoffactsI tothesetoffactsitdirectlyentails generalisation of RDF triples, for which we employ a with respect to the program P—note that C (I) will P ternary predicate T where our atoms are of the form retain the facts in P since facts are rules with empty T(s,p,o)—for brevity, we commonly omit the ternary bodiesandthusunifywithanyinterpretation,andnote predicate and simply write (s,p,o). An RDF atom of thatforourpurposesC ismonotonic—theadditionof P this form is synonymous with a generalised triple pat- factsandrulestoaprogramcanonlyleadtoasuperset ternwherevariablesofthesetVareallowedinanypo- ofconsequences. sition).Agroundatom—orsimplyafact—isonewhich SinceourrulesareasyntacticsubsetofDatalog,C P doesnotcontainvariables(e.g.,ageneralisedtriple);we has a least fixpoint—denoted lfp(C )—whereby fur- P denote the set of all facts by Facts—a generalisation ther application of C will not yield any changes, and P ofG.A(Herbrand)interpretationI isafinitesubsetof which can be calculated in a bottom-up fashion, start- Facts—ageneralisationofagraph. ing from the empty interpretation ∆ and applying iter- Letting A and B be two atoms, we say that A sub- atively C [66] (here, convention assumes that P con- P sumesB—denotedA(cid:46)B—ifthereexistsasubstitution tainsthesetofinputfactsaswellasproperrules).De- θ of variables such that Aθ = B (applying θ to the fine the iterations of C as follows: C ↑ 0 := ∆; for P P 4 allordinalsα,C ↑(α+1):=C (C ↑α);sinceour Restricted immediate consequences In the gener- P P P rulesareDatalog,thereexistsanαsuchthatlfp(C )= alised annotation framework, the restricted immediate P C ↑ α for α < ω, where ω denotes the least infinite consequenceoperatorofanannotatedprogramP isde- P ordinal—i.e.,theimmediateconsequenceoperatorwill finedasfollows,whereσ rangesoversubstitutions: reach a fixpoint in countable steps [61], thereby giv- (cid:8) R (I)(A):=lub ρ|(A:ρ←B :µ ,...,B :µ )σ ing all ground consequences of the program. We call P 1 1 n n lfp(CP) the least model, which is given the more suc- ∈Ground(P)andI |=(Bi:µi)σ (cid:9) cinctnotationlm(P). for(1≤i≤n) . Example: Take a program P which com- 2.4. Generalisedannotatedprograms prises of the annotated rule shown in Equa- tion 1 and the annotated fact Parent(sam):0.6. In generalised annotated programs [42] the set of Then, R (I)(Father(sam)) = 0.3. Instead, let’s P truthvaluesisgeneralisedtoanarbitraryuppersemilat- say that P also contains Father(sam):0.5. Then tice T, that may represent—say—fuzzy values, incon- R (I)(Father(sam))=lub{0.5,0.3}=0.5. (cid:51) P sistencies, validity intervals (i.e. time), or a confidence value,tonamebutafew. Importantly, various properties of RP have been for- mallydemonstratedforgeneralisedannotatedprograms in [42], where we will reuse these results later for our (Generalised) Annotated rules Annotated rules are own (more specialised) annotation framework in Sec- expressionslike tion 4; for example, R has been shown to be mono- P H:ρ←B1:µ1,...,Bn:µn tonic,butnotalwayscontinuous[42]. whereeachµ canbeeitheranelementofT oravariable i rangingoverT;ρcanbeafunctionf(µ ,...,µ )over 1 n 2.5. Terminologicaldata:RDFS/OWL T. Programs, Ground(R) and Ground(P) are defined analogouslytonon-annotatedprograms. Aspreviouslydescribed,RDFS/OWLallowforpro- Example: Consider the following simple example of viding terminological data which constitute definitions a (generalised) annotated rule where T corresponds to of classes and properties used in the data. A detailed a set of confidence values in the interval [0,1] of real discussion of RDFS/OWL is out of scope, but the dis- numbers: tinction of terminological and assertional data—which Father(?x):(0.5×µ)←Parent(?x):µ. (1) wenowdescribe—isimportantforourpurposes.First, werequiresomepreliminaries.9 This rule intuitively states that something is a Father with (at least) half of the confidence for which it is a Parent.8 (cid:51) Meta-class We consider a meta-class as a class specifically of classes or properties; i.e., the Restricted interpretations So-called restricted inter- members of a meta-class are themselves either pretations map each ground atom to a member of T. classes or properties. Herein, we restrict our no- A restricted interpretation I satisfies A:µ (in symbols, tion of meta-classes to the set defined in RDF(S) I |= A:µ) iff I(A) ≥ µ, where ≥ is T’s ordering. and OWL specifications, where examples include T T Now I satisfies a rule like the above iff either I satis- rdf:Property,rdfs:Class,owl:Class,owl:Restric- fiestheheadorI doesnotsatisfysomeoftheannotated tion, owl:DatatypeProperty, owl:FunctionalProp- atomsinthebody. erty, etc. Note that rdfs:Resource, rdfs:Literal, Example: Take the annotated rule from Equation 1, e.g.,arenotmeta-classes,sincetheirmembersneednot and let’s say that we have a restricted interpretation I beclassesorproperties. whichsatisfiesParent(sam):0.6.Now,forI tosatisfy the given rule, it must also satisfy Father(sam):0.3 9 As we are dealing with Web data, we refer to the OWL 2 Full suchthatI(Father(sam))≥0.3. (cid:51) languageandtheOWL2RDF-basedsemantics[18]unlessexplic- itly stated otherwise. Note that a clean and intuitive definition of terminologicaldataissomewhatdifficultforRDFSandparticularly 8 This example can be trivially converted to RDF by instead con- OWLFull.Weinsteadrelyona‘shibboleth’approachwhichidenti- sidering the ternary atoms (?x, a, ex:Parent) and (?x, a, fiesmarkersforwhatweconsidertobeRDFS/OWLterminological ex:Father). data. 5 Meta-property A meta-property is one which has a where the T , 0 ≤ i ≤ m atoms in the body (T-atoms) i meta-class as its domain. Again, we restrict our no- are all those that can only have terminological ground tionofmeta-propertiestothesetdefinedinRDF(S)and instances,whereastheA ,1≤i≤natoms(A-atoms), i OWL specifications, where examples include rdfs:- canhavearbitrarygroundinstances.WeuseTBody(R) domain,rdfs:subClassOf,owl:hasKey,owl:inverse- andABody(R)torespectivelydenotethesetofT-atoms Of, owl:oneOf, owl:onProperty, owl:unionOf, etc. and the set of A-atoms in the body of R. Herein, we Note that rdf:type, owl:sameAs, rdfs:label, e.g., do presumethattheT-atomsandA-atomsofourrulescan nothaveameta-classasdomain,andarenotconsidered bedistinguishedandreferencedasdefinedabove. meta-properties. Example: LetR denotethefollowingrule EX (?x,a,?c2)←(?c1,rdfs:subClassOf,?c2),(?x,a,?c1) Terminologicaldata Wedefinethesetofterminolog- WhenwritingT-splitrules,wedenoteTBody(R )by EX icaltriplesastheunionofthefollowingsetsoftriples: underlining: the underlined T-atom can only be bound (i) triples with rdf:type as predicate and a meta- by a triple with the meta-property rdfs:subClassOf classasobject; as RDF predicate, and thus can only be bound by a (ii) tripleswithameta-propertyaspredicate; terminologicaltriple.Thesecondatominthebodycan (iii) triples forming a valid RDF list whose head is be bound by assertional or terminological triples, and theobjectofameta-property(e.g.,alistusedfor soisconsideredanA-atom. (cid:51) owl:unionOf,owl:intersectionOf,etc.); (iv) triples which contribute to an all-disjoint-classes orall-disjoint-propertiesaxiom.10 T-ground rule A T-ground rule is a set of rule in- Notethatthelastcategoryofterminologicaldataisonly stances for the T-split rule R given by grounding required for special consistency-checking rules called TBody(R). We denote the set of such rules for a pro- constraints: i.e., rules which check for logical contra- gram P and a set of facts I as GroundT(P,I), defined dictions in the data. For brevity, we leave this last cat- as: egory of terminological data implicit in the rest of the (cid:8) GroundT(P,I):= Head(R)θ ←ABody(R)θ |R∈P paper, where owl:AllDisjointClasses and owl:All- DisjointProperties can be thought of as “honorary and∃I(cid:48) ⊆I s.t.θ =mgu(TBody(R),I(cid:48))(cid:9). meta-classes”includedincategory1,owl:memberscan TheresultisasetofruleswhoseT-atomsaregrounded bethoughtofasan“honorarymeta-property”included bytheterminologicaldatainI. in category 2, and the respective RDF lists included in Example: Consider the T-split rule R from category3. EX Finally, we do not consider triples involving “user- the previous example. Now let IEX := { (foaf:- Person, rdfs:subClassOf, foaf:Agent), (foaf:- defined”meta-classesormeta-propertiesascontributing Agent, rdfs:subClassOf, dc:Agent) }. Here, totheterminology,whereinthefollowingexample,the first triple is considered terminological, but the second GroundT({REX},IEX) = { (?x, a, foaf:Agent) ← tripleisnot.11 (?x, a, ?foaf:Person); (?x, a, dc:Agent) ← (?x, a, ?foaf:Agent)}. (cid:51) (ex:inSubFamily,rdfs:subPropertyOf, rdfs:subClassOf) (ex:Bos,ex:inSubFamily,ex:Bovinae) T-splitprogramandleastfixpoint Herein,wegivean overviewofthecomputationoftheT-splitleastfixpoint T-splitrule AT-splitruleR isgivenasfollows: for a program P, which is broken up into two parts: H ←A1,...,An,T1,...,Tm (n,m≥0) (2) (i) the terminological least fixpoint, and (ii) the asser- tional least fixpoint. Let PF := {R ∈ P | Body(R) = 10Thatis,tripleswithrdf:typeaspredicateandowl:AllDis- ∅} be the set of facts in P,12 let PT∅ := {R ∈ jointClassesorowl:AllDisjointPropertiesasobject, P | TBody(R) (cid:54)= ∅,ABody(R) = ∅}, let P∅A := tripleswhosepredicateisowl:membersandwhosesubjectunifies with the previous category of triples, and triples forming a valid {R ∈ P | TBody(R) = ∅,ABody(R) (cid:54)= ∅}, and let RDF list whose head unifies with the object of such an owl:- PTA := {R ∈ P | TBody(R) (cid:54)= ∅,ABody(R) (cid:54)= memberstriple. ∅}. Clearly, P = PF ∪ PT∅ ∪ P∅A ∪ PTA. Now, 11In particular, we require a data-independent method for distin- guishingterminologicaldatafrompurelyassertionaldata,suchthat we only allow those meta-classes/-properties which are known a- 12Ofcourse,PF canrefertoaxiomaticfactsand/ortheinitialfacts priori. givenbyaninputknowledge-base. 6 let TP := PF ∪PT denote the initial (terminologi- S/OWL meta-classes do not appear in a position other cal) program containing ground facts and T-atom only thanasthevalueforrdf:type;and(ii)thatowl:sameAs rules, and let lm(TP) denote the least model for the doesnotaffectconstantsintheterminology.13 terminological program. Now, let AP := lm(TP) ∪ P∅A∪GroundT(PTA,lm(TP))denotethesecond(as- 3. Annotationvalues sertional) program containing all available facts and rules with empty or grounded T-atoms. Now, we can Na¨ıvelyconductingmaterialisationwrt.non-arbitrary givetheleastmodeloftheT-splitprogramP aslm(AP) rulesoverarbitrary,non-verifieddatamergedfrommil- for AP derived from P as above—we more generally lions of sources crawled from the Web broaches many denotethisbylmT(P). obvious dangers. In this section, we discuss the anno- In [38], we showed that the T-split least fixpoint is tation values we have chosen to represent the various completewrt.thestandardvariant(giventhatourrules dimensionsofprovenanceandtrustweuseforreason- aremonotonic)andthattheT-splitleastfixpointiscom- ing over Linked Data, which have been informed by plete with respect to the standard fixpoint if rules re- ourpastexperiencesinreasoningoverarbitraryLinked quiring assertional knowledge do not infer unique ter- Data. minologicalknowledgerequiredbytheT-splitprogram (i.e.,theassertionalprogramAP doesnotgeneratenew terminologicalfactsnotavailabletotheinitialprogram 3.1. Blacklisting TP). Despite our efforts to create algorithms which auto- matically detect and mitigate noise in the input data, 2.6. OWL2RL/RDFrules it may often be desirable to blacklist input data based on some criteria: for example, data from a certain do- OWL 2 RL/RDF [25] rules are a partial axiomati- main may be considered likely to be spam, or certain sation of the OWL 2 RDF-Based Semantics which is triple patterns may constitute common publishing er- applicable for arbitrary RDF graphs, and thus is com- rors which hinder the reasoning process. We currently patible with RDF Semantics [30]. The atoms of these do not require the blacklisting function, and thus con- rules comprise primarily of ternary predicates encod- sideralltriplestobenotblacklisted.However,suchan ing generalised RDF triples; some rules have a special annotationhasobvioususesforbypassingnoisewhich head denoted false which indicate that an instance of cannototherwisebeautomaticallydetected. the body is inconsistent. All such rules can be consid- One particular use-case we have in mind for includ- ered T-split where we use the aforementioned criteria ing the blacklisting annotation relates to the publica- forcharacterisingterminologicaldataandsubsequently tion of void values for inverse-functional properties: T-atoms. the Friend Of A Friend (FOAF)14 vocabulary offers As we will further discuss in Section 7, full materi- classesandpropertiesfordescribinginformationabout alisation wrt. the entire set of OWL 2 RL/RDF is in- people,organisations,documents,andsoforth:itiscur- feasible in our use-case; in particular, given a large A- rently one of the most widely instantiated vocabular- Box of arbitrary content, we wish to select a subset of ies in Linked Data [37, Appendix A]. FOAF includes the OWL 2 RL/RDF profile which is linear with re- a number of inverse-functional-properties which allow specttothatA-Box;thus,weselectasubsetO2R− of for identifying (esp.) people in the absence of agree- OWL 2 RL/RDF rules where |ABody(R)| ≤ 1 for all ment upon URIs, e.g.: foaf:homepage, foaf:mbox, R ∈ O2R− [36]—we provide the full ruleset in Ap- foaf:mboxsha1sum. However, FOAF exporters on the pendix A. Besides ensuring that the growth of asser- Webcommonlydonotrespecttheinverse-functionalse- tional inferences is linear wrt. the A-Box—and as we manticsoftheseproperties;oneparticularlypathogenic will see in later sections—our linear profile allows for case we encountered was exporters producing empty near-trivialdistributionofreasoning,aswellasvarious otheroptimisationtechniques(see[38,32]). 13Note that the OWL 2 RL/RDF eq-rep-* rules can cause incom- WithrespecttoOWL2RL/RDF,in[32]weshowed pleteness by condition (ii), but herein, we do not support these that the T-split least fixpoint is complete assuming (i) rules. Further, note that in other profiles (such as RDFS and pD*) axiomatic triples may be considered non-standard—however, none nonon-standardusage,wherebyrdf:type,rdf:first, of the OWL 2 RL/RDF axiomatic triples (see Table A.1) are non- rdf:rest and the RDFS/OWL meta-properties do not standard. appear other than as a predicate in the data, and RDF- 14http://xmlns.com/foaf/0.1/ 7 strings or values such as ‘mailto:’ for foaf:mbox mandatedbythatvocabularyandrecursivelyreferenced when users omitted specifying their email. Similarly, vocabularies for that term. Thus, once a publisher in- we encountered many corresponding erroneous values stantiates a class or property from a vocabulary, only for the foaf:mboxsha1sum property—representing a thatvocabularyanditsreferencesshouldinfluencewhat SHA1 encoded email value—referring to the SHA1 inferencesarepossiblethroughthatinstantiation. hashes of ‘mailto:’ and the empty string [34]. These Firstly, we must define the relationship between a values—-caused by na¨ıve publishing errors—lead to class/property term and a vocabulary, and give the no- quadratic spurious inferences equating all users who tionofterm-levelauthority.WeviewatermasanRDF omitted an email to each other. Although for reasons constant, and a vocabulary as a Web document. From of efficiency we currently do not support the rele- Section 2.2, we recall the get mapping from a URI vant OWL 2 RL/RDF rules which support the seman- (a Web location) to an RDF graph it may provide by ticsofinverse-functionalproperties,theblacklistingan- meansofagivenHTTPlookupandtheredirsmapping notation could be used to negate the effects of such for following the HTTP redirects of a URI. Also, let pathogenicvalues—essentially,itservesasapragmatic bnodes(G) denote the set of blank nodes appearing in lastresort. the RDF graph G. Now, we denote a mapping from a sourceURItothesetoftermsitspeaksauthoritatively forasfollows:17 3.2. Authoritativeanalysis auth: S→2C InourinitialworksonSAOR[36]—apragmaticrea- s(cid:55)→{c∈U|redirs(c)=s}∪bnodes(get(s)) sonerforLinkedData—weencounteredapuzzlingdel- ugeofinferenceswhichwedidnotinitiallyexpect.We where a Web source is authoritative for URIs which found that remote documents sometimes cross-define dereference to it and the blank nodes it contains; e.g., terms resident in popular vocabularies, changing the the FOAF vocabulary is authoritative for terms in its inferences authoritatively mandated for those terms. namespace since it follows best-practices and makes Forexample,wefoundonedocument15 whichdefines its class/property URIs dereference to an RDF/XML owl:Thing to be a member of 55 union classes—thus, documentdefiningtheterms.Notethatnodocumentis materialisation wrt. OWL 2 RL/RDF rule cls-uni [25, authoritativeforliterals. Table6]overanymemberofowl:Thingwouldinfer55 Tonegatetheeffectsofnon-authoritativeterminolog- additionalmembershipsfortheseobsoleteandobscure icalaxiomsonreasoningoverWebdata(asexemplified unionclasses.Wefoundanotherdocument16 whichde- above), we add an extra condition to the T-grounding finesninepropertiesasthedomainofrdf:type—again, of a rule: in particular, we only require amendment to anything defined to be a member of any class would ruleswherebothTBody(R)(cid:54)=∅andABody(R)(cid:54)=∅. be inferred to be a member of these nine properties. Even aside from “cross-defining” core terms, popular Authoritative T-ground rule Let varsTA(R)⊂V de- vocabularies such as FOAF were also affected [36]— note the set of variables appearing in both TBody(R) suchpracticeleadtothematerialisationofanimpracti- and ABody(R). Now, we define the set of authorita- cal bulk of arguably irrelevant data (which would sub- tiveruleinstancesforaprogramP,RDFgraphG,and sequentlyburdentheconsumerapplication). sourcesas:18 Tocounter-actremotecontributionsabouttheseman- tics of terms, we introduced a more conservative form G(cid:92)roundT(P,G,s):={Head(R)θ ←ABody(R)θ | of reasoning called authoritative reasoning [36] which R∈P criticallyexaminesthesourceofterminologicalknowl- and∃G(cid:48) ⊆Gs.t.θ =mgu(TBody(R),G(cid:48)) edge.Wenowre-introducetheconceptofauthoritative reasoningfrom[36],hereinprovidingmoredetailedfor- andifTBody(R)(cid:54)=∅∧ABody(R)(cid:54)=∅ (cid:9) malismsandupdateddiscussion. then∃v ∈varsTA(R)s.t.θ(v)∈auth(s) Ourauthoritativereasoningmethodsarebasedonthe intuition that a publisher instantiating a vocabulary’s 17Even pre-dating Linked Data, dereferenceable vocabulary term (class/property) thereby accepts the inferencing terms were encouraged; cf. http://www.w3.org/TR/2006/ WD-swbp-vocab-pub-20060314/ 15http://lsdis.cs.uga.edu/˜oldham/ontology/ 18Here we favour RDF graph notation as authority applies only wsag/wsag.owl in the context of Linked Data (but could be trivially generalised 16http://www.eiao.net/rdf/1.0 throughtheauthfunction). 8 whereauthoritativeruleinstancesaresynonymouswith (?x, owl:hasValue, ?y), (?x, owl:onProperty, ?p), we authoritatively T-ground rules and where the notion of would expect ?x to be ground by a blank-node skolem authoritative rule instances for a program follows nat- andthusexpecttheinstancetocomefromonegraph.) urally. The additional condition for authoritativeness statesthatifarulecontainsbothT-atomsandA-atoms 3.3. Links-basedranking in the body (ABody(R) (cid:54)= ∅∧TBody(R) (cid:54)= ∅), then the unifier must substitute atleast one variable appear- There is a long history of links-based analysis inginbothABody(R)andTBody(R)(avariablefrom over Web data—and in particular over hypertext thesetvarsTA(R))foranauthoritativetermfromsource documents—where links are seen as a positive vote s (a constant from the set auth(s)) for the resulting T- for the relevance or importance of a given document. ground rule to be authoritative. This implies that the SeminalworksexploitingthelinkstructureoftheWeb sourcesmustspeakauthoritativelyforatleastoneterm for ranking documents include HITS [44] and PageR- that will appear in the body of each proper T-ground ank [8]. Various approaches (e.g., [1,16,33,23,15,29]) rulewhichitsterminologygenerates,andsocannotcre- look at incorporating links-based analysis techniques ate new assertional rules which could apply over arbi- for ranking RDF data, with various end-goals in mind, trary assertional data not mentioning any of its terms. mostcommonly,prioritisationofinformationalartefacts Weillustratethiswithanexample. inuserresult-views;however,suchanalyseshavebeen Example: Take the T-split rule R as before where EX appliedtootheruse-cases,includingworkbyGue´retet varsTA(REX)={?c1}representingthesetofvariables al.[28]whichusesbetweennesscentralitymeasuresto in both TBody(R ) and ABody(R ). Let I be EX EX EX identify potentially weak points in the Web of Data in thegraphfromsources,wherenowforeachsubstitution termsofmaintainingconnectednessintegrity. θ,theremustexistv ∈varsTA(R )suchthatsspeaks EX Herein,weemploylinks-basedanalysiswiththeun- authoritativelyforθ(v).Inthiscase, derlying premise that higher ranked sources contribute – s must speak authoritatively for the URI foaf:Per- more“trustworthy”data:inourcase,wewouldexpect son—forwhich?c1issubstituted—fortheT-ground a correlation between the (Eigenvector) centrality of a rule(?x,a,foaf:Agent)←(?x,a,?foaf:Person)to source in the graph, and the quality of data that it pro- beauthoritative, vides. Inspired in particular by the work of Harth et – analogously,smustspeakauthoritativelyfortheURI al. [29] on applying PageRank to RDF, we implement foaf:Agent—againforwhich?c1issubstituted—for atwo-stepprocess:(i)wecreatethegraphoflinksbe- theT-groundrule(?x,a,dc:Agent)←(?x,a,foaf:- tween sources, and apply a standard PageRank calcu- Agent)tobeauthoritative. lation over said graph to derive source ranks; (ii) we Inotherwords,thesourcesservingtheT-factsinI EX propagatesourcerankstothetriplestheycontainusing must be the FOAF vocabulary for the above rules to authoritative. (cid:51) asimplesummationaggregation.Wenowdiscussthese twostepsinmoredetail. For reference, we highlight variables in varsTA(R) withboldfaceinTableA.4. 3.3.1. Creatingandrankingthesourcegraph (ItisworthnotingthatforruleswhereABody(R)and Creating the graph of interlinking Linked Data TBody(R) are both non-empty, authoritative instantia- sources is non-trivial, in that the notion of a hyperlink tionoftherulewillonlyconsiderunifiersforTBody(R) does not directly exist. Thus, we must extract a graph which come from one source: however, in practice for sympathetictoLinkedDataprinciplesandcurrentpub- OWL 2 RL/RDF this is not so restrictive: although lishingpatterns. TBody(R) may contain multiple atoms, in such rules Recalling the Linked Data principles enumerated in TBody(R)usuallyreferstoanatomicaxiomwhichre- Section 2.2, according to LDP4, links should be spec- quires multiple triples to represent—indeed, the OWL ified simply by using external URI names in the data. 2 Structural Specification19 enforces usage of blank- These URI names should dereference to an RDF de- nodes and cardinalities on such constructs to ensure scriptionofthemselvesaccordingtoLDP2andLDP3re- thattheconstituenttriplesofthemulti-tripleaxiomap- spectively.LetD :=(V,E)representasimpledirected pearinonesource.Totakeanexample,fortheT-atoms graph where V ⊂ S is a set of sources (vertices), and E ⊂S×Sisasetofpairsofvertices(edges).Letting 19http://www.w3.org/TR/2009/ si,sj ∈V betwovertices,then(si,sj)∈Eiffsi (cid:54)=sj REC-owl2-syntax-20091027/ and there exists some u ∈ U such that redirs(u) = sj 9 and u appears in some triple t ∈ get(s ): i.e., an edge statingatripleshouldincreasetherankofthetriple.20 i extendsfroms tos ifftheRDFgraphreturnedbys Thus, for calculating the rank of a triple t, we use the i j i mentionsaURIwhichredirectstos . summation of the ranks of the sources it appears in as j Now, let E(s) denote the set of direct successors of follows: s (outlinks), let E denote the set of vertices with no ∅ (cid:88) outlinks(danglingnodes),andletE−(s)denotetheset r(t):= r(s ) t of direct predecessors of s (inlinks). The PageRank of st∈{s∈S|t∈get(s)} a vertex s in the directed graph D := (V,E) is then i givenasfollows[8]: 4. Thelogicalframework r(s ):= 1−d +d (cid:88) r(s∅) +d (cid:88) r(sj) In this section, we look at incorporating the above i |V| |V| |E(s )| three dimensions of trust and provenance—which we j s∅∈E∅ sj∈E−(si) will herein refer to as annotation properties—into an annotated logic programming framework which tracks where d is a damping constant (usually set to 0.85) this information during reasoning, and determines the which helps ensure convergence in the following it- annotations of inferences based on the annotations of erative calculation, and where the middle component the rule and the relevant instances, where the resultant splitstheranksofdanglingnodesevenlyacrossallother valuesoftheannotationpropertiescanbeviewedasde- nodes. Note also that the first and second components notingthestrengthofaderivation.Weproposeandfor- are independent of i, and constitute the minimum pos- maliseageneralannotationframework,introducesome sible rank of all nodes (ensures that ranks do not need high-level tasks it enables, and discuss issues relating tobenormalisedduringiterativecalculation). toscalabilityinourscenario. Nowletw := 1|V−d| representtheweightofauniversal We begin in Section 4.1 by formalising annotation (weaklink)givenbyallnon-danglingnodestoallother functions which abstract the mechanisms used for an- nodes—danglingnodessplittheirvoteevenlyandthus notating facts and rules. In Section 4.2, we propose an don’trequireaweaklink;wecanuseaweightedadja- annotatedprogramframeworkandassociatedsemantics cency matrix M as follows to encode the graph D := based on the previous work of Kifer et al. [42] (intro- (V,E): ducedpreviouslyinSection2.4).InSection4.3,wein- troducesomehigh-levelreasoningtasksthatthisframe- d work enables; in Section 4.4 we look at how each task |E(sj)| +w, if(sj,si)∈E scales in the general case, and in Section 4.5 we focus m := 1 on the scalability of the task required for our use-case i,j , ifs ∈E w|V,| othejrwise∅ owveerbroiuerflysedleisccteudssansonmoteataioltnerpnraotpiveertiseesm.aInntSicescttihoant4o.n6e, mightconsiderforourframework.WewrapupinSec- where this stochastic matrix can be thought of as a tion4.7byintroducingannotatedconstraintruleswhich Markov chain (dubbed the random-surfer model). The allowfordetectingandlabellinginconsistenciesinour ranks of all sources can be expressed algebraically as use-case. theprincipaleigenvectorofM,whichinturncanbees- timatedusingthepoweriterationmethodupuntilsome 4.1. Abstractingannotationfunctions terminationcriteria(fixednumberofiterations,conver- gencemeasures,etc.)isreached.Werefertheinterested Thefirststeptowardsaformalsemanticsofannotated readerto[8]formoredetail. logicprogramsconsistsingeneralisingtheannotations (suchasblacklisting,authoritativeness,andranking)re- calledintheprevioussections.Theannotationdomains 3.3.2. Calculatingtripleranks Based on the rank values for the data sources, we nowcalculatetheranksforindividualtriples.Weusea 20Note that one could imagine a spamming scheme where a large numberofspuriouslow-rankeddocumentsrepeatedlymakethesame simplemodelforrankingtriples,basedontheintuition assertionstocreateasetofhighly-rankedtriples.Infuture,wemay that triples appearing in highly ranked sources should revise this algorithm to take into account some limiting function benefit from that rank, and that each additional source derivedfromPLD-levelanalysis. 10
Description: