ebook img

A Flexible Framework for Defining, Representing and Detecting Changes on the Data Web PDF

0.54 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview A Flexible Framework for Defining, Representing and Detecting Changes on the Data Web

A Flexible Framework for Defining, Representing and Detecting Changes on the Data Web Yannis Roussakis Ioannis Chrysakis Kostas Stefanidis FORTH-ICS FORTH-ICS FORTH-ICS [email protected] [email protected] [email protected] Giorgos Flouris Yannis Stavrakas FORTH-ICS ATHENA-IMIS [email protected] [email protected] innovation.gr 5 1 0 ABSTRACT objectviapredicate. 2 Dynamicityisanindispensablepartofthecurrentweb;datasets ThedynamicnatureofWebdatagivesrisetoamultitudeofprob- n lemsrelatedtotheidentification,computationandmanagementof are constantly evolving for several reasons, such as the inclusion a the evolving versions and the related changes. In this paper, we of new experimental evidence or observations, or the correction J of erroneous conceptualizations [21]. As an example, consider considertheproblemofchangerecognitioninRDFdatasets, i.e., 2 the problem of identifying, and when possible give semantics to, the detected number of changes between the versions 12.07 and 1 the changes that led from one version of an RDF dataset to an- 13.05,and13.05and13.07ofAtlas,whichare879.5Mand801.2M other. DespiteourRDFfocus,ourapproachissufficientlygeneral triples, respectively, while between the versions 3.7 and 3.8, and ] 3.8 and 3.9 of the English DBpedia are 20.7M and 9.3M triples toengulfdifferentdatamodelsthatcanbeencodedinRDF,such B (Figure7). asrelationalormulti-dimensional. Infact,weproposeaflexible, D Thisconstantevolutionposesseveralresearchproblems,which extendible and data-model-independent methodology of defining arerelatedtotheidentification,computation,storageandmanage- . changes that can capture the peculiarities and needs of different s ment of the evolving versions. An important question is how to c datamodelsandapplications, whilebeingformallyrobustdueto support complex changes, whose constituent changes are seem- [ thesatisfactionofthepropertiesofcompletenessandunambiguity. ingly unrelated and may occur on disparate pieces of data, but Further,weproposeanontologyofchangesforstoringthedetected 1 together as a whole they have a semantically coherent meaning changesthatallowsautomatedprocessingandanalysisofchanges, v for an application domain. Clearly, understanding and detecting cross-snapshotqueries(spanningacrossdifferentversions),aswell 2 thechangesofevolvingdatasetsisaprerequisitetoaddressthese asqueriesinvolvingbothchangesanddata. Todetectchangesand 5 problems. In particular, finding the differences (deltas) between populatesaidontology,weproposeacustomizabledetectionalgo- 6 datasets has been proved to play a crucial role in various cura- rithm,whichisapplicabletodifferentdatamodelsandapplications 2 tiontasks,suchasthesynchronizationofautonomouslydeveloped requiring the detection of custom, user-defined changes. Finally, 0 datasetsversions[3],orthevisualizationoftheevolutionhistoryof we provide a proof-of-concept application and evaluation of our 1. frameworkfordifferentdatamodels. adataset[12].Deltasarealsonecessaryincertainapplicationsthat 0 requireaccesstopreviousversionsofadatasettosupporthistori- 5 calorcross-snapshotqueries(e.g.,[19]),inorder,forexample,to 1 1. INTRODUCTION identifypaststatesofthedataset,understandtheevolutionprocess, : ordetectthesourceoferrorsinthecurrentmodelling. v With the growing complexity of the WWW, we face a com- Ingeneral,therearetwowaystorecordtheoccurringchanges. i pletely different way for creating, disseminating and consuming X The first is by constantly monitoring the dataset and logging any big volumes of information. Large-scale corporate, government, change, whichrequiresaclosedandcontrolledsystem, whereall r or even user-generated data are published and become available a changes pass through a dedicated application that records every toawidespectrumofusers. DBpedia, Freebase, YAGOandAt- changeasithappens. Amoreflexiblemethodincludesdetecting las are, among many others, examples of large data repositories, a posteriori the changes that happened. This avoids the need to whichstoreinformationaboutvariousentities,includingtheirre- haveacontrolledsystem, andallowsremoteusersofadatasetto lationships. Typically,datainsuchdatasetsarerepresentedusing identifychanges,eveniftheyhavenoaccesstotheactualchange theRDFmodel[11],inwhichinformationisstoredintriplesofthe processorknowledgethatithappened.Here,wefocusonthelatter form(subject,predicate,object),meaningthatsubjectisrelatedto case,whichismorechallengingandinteresting,eventhoughmost oftheproposedsolutions(namely,thedefinitionofalanguageof changes and the representation of the changes) are applicable in bothscenarios. Recognitionofchangesisbasedon3mainpillars,namelydefin- Permissiontomakedigitalorhardcopiesofallorpartofthisworkfor personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare ing,representinganddetectingchanges. Thefirstpillar(defining notmadeordistributedforprofitorcommercialadvantageandthatcopies changes)isrelatedtothedefinitionofalanguageofchanges,which bearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,to isbasedonthesetofchangesthatareunderstandablebythesystem. republish,topostonserversortoredistributetolists,requirespriorspecific Definingthelanguageofchangesincludesidentifyingthetypesof permissionand/orafee. changesthatwillbedetected,theirformaldefinition(e.g.,theirse- Copyright20XXACMX-XXXXX-XX-X/XX/XX...$15.00. manticsandparameters),andtakesintoaccounttheneedforcom- orcomplexchangesused. Thecustomizationfortheactual plex,domain-specificchanges.Ithasbeenarguedthatthelanguage languageofchangesemployedisbasedonthedefinitionof ofchangesasawhole,shouldbewell-behavedinthesenseofsatis- appropriateSPARQLqueries(oneperchange)thatcomeas fyingcertainproperties,namelybeingcompleteandunambiguous, parameterstothealgorithm, therebyallowingnewchanges so as to allow the generation of unique deltas in a deterministic tobeeasilydefined.DetailsaregiveninSection5. manner[14].ThedefinitionofchangesisgiveninSection3. • An additional contribution is the application of this work The second pillar (representing changes) is related to the rep- fordatamodelsotherthanthepureRDFmodel,namelythe resentationschemeusedforstoringthedetectedchanges. Thisis multi-dimensional model that has been transformed to the necessary to allow a persistent representation and storage of the RDF Data Cube vocabulary, while it could be also applied detectedchanges,aswellastosupportnavigationamongversions, similarlytotherelationalmodel. Thisapplicationismeant analysis of the deltas, cross-snapshot or historic queries, and the as a proof-of-concept that the proposed approach is indeed raising of changes as first class citizens in a multi-version repos- capableofsupportingdiverseneeds,andisnotexhaustivein itory. Our approach is based on the definition of an ontology of termsofthetypesofapplicationsormodelsthatourframe- changes that satisfies these goals. The representation scheme of worksupports. theproposedchangesisgiveninSection4. The third pillar (detecting changes) is related to the algorith- • Finally,weprovideexperimentsshowingthatthemostcru- mic definition of the detection process. This process identifies cialparameterinthechangerecognitionprocessisthetotal the changes applied between any two versions, based on the ac- number of detected changes rather than the size of the ex- tualchangedefinitionsinthelanguageofchanges, andstoresthe amined datasets. We prove also that the number of simple results in the ontology of changes. Our approach for change de- changesisproportionaltothenumberoftripleswhichwill tectionisbasedontheexecutionofappropriatelydefinedSPARQL be inserted into the ontology. Except from the evaluation queries,whoseanswersdeterminethedetectedchanges.Detailson aspect of our framework, our experiments offer a detailed thedetectionprocessaregiveninSection5. lookontheevolutionofreal-world,bigdatasets,aswerecord Amajorchallengerelatedtochangedetectionanddeltasisthat changesandprovideananalysisoftheirform. nolanguageofchangesissuitableforalldifferentapplicationsand data models. There are two reasons for that. First, the different Theapproachproposedinthispaperextendsourpreviouswork typesofdataavailableontheWebareoftenrepresentedusingdif- onchangedetection[14],byprovidingamoregenericchangedef- ferentmodels,whichcallfordifferentchanges. Second,different initionframework;here,weexperiencesignificantlyimprovedper- uses (or users) of the data may require a different set of changes formance(about1orderofmagnitude), scalability, aswellasin- being reported, or the definition of special types of changes that creasedgeneralityandapplicability. happenoftenorareimportantfortheoperationofthedata-driven application. Therefore,thereisnoone-size-fits-allsolutionforthe 2. PRELIMINARIES problemofchangerecognition. WeconsidertwodisjointsetsU, L, denotingtheURIsandlit- Inthiswork,weproposeaflexibleframeworkforchangerecog- erals(weignorehereblanknodesthatcanbeavoidedwhendata nition that can be applied to different data models and applica- are published according to the Linked Data paradigm). An RDF tions.Ourintentionisnottoprovideasinglesolution,butageneric triple [11] is a tuple of the form (subject, predicate, object) and methodologyfordefiningthethreecomponentsofamulti-version assertsthefactthatsubjectisassociatedwithobjectthroughpredi- repository,throughwhichdifferentsolutions,suitablefordifferent cate.ThesetT=U×U×(U∪L)isthesetofallRDFtriples.An needs,canbedeveloped. Eventhoughourframeworkisdesigned RDFdatasetDconsistsofasetofRDFtriples. Inthefollowing, forRDF,ourapproachisapplicabletoanydatamodelrepresentable wedenotebyD ,D theoldandnewversions,respectively, inRDF,i.e.,RDFisusedasaunifyingunderlyingmodeltoachieve old new ofadatasetD. thisgenerality. SPARQL1.1[17]istheofficialW3Crecommendationlanguage Inoverall,ourcontributionsareasfollows: forqueryingRDFgraphs. ThebuildingblockofaSPARQLstate- • Regarding the definition of changes, the objective of gen- mentisatriplepatterntpthatislikeanRDFtriple,butmayhave erality is met by providing two different types of changes, avariable(prefixedwithcharacter?)inanyofitssubject, predi- namelysimpleandcomplex(seeSection3).Theformertype cate,orobjectpositions;variablesaretakenfromaninfinitesetof ismeanttocapturefine-grainedchangesforthedatamodel variablesV,disjointfromthesetsU,L,sothesetoftriplepatterns athand,andshouldmeettherequirementsofcompleteness is:TP=(U∪V)×(U∪V)×(U∪L∪V).SPARQLtriplepat- and unambiguity. The latter type is meant to capture more ternscanbecombinedintographpatternsgp,usingoperatorslike coarse-grained,orspecialized,changesthatareusefulforthe join(“.”),optional(OPTIONAL)andunion(UNION)[1]andmay application at hand; this allows a customized behaviour of alsoincludeconditions(usingFILTER).Inthiswork,weareonly thechangedetectionprocess,dependingontheactualneeds interested in SELECT SPARQL queries, which are of the form: of the application. Complex changes are totally dynamic, “SELECT v ,...,v WHERE gp”, where n > 0, v ∈ V 1 n i and can be defined even at run-time, greatly enhancing the andgpisagraphpattern. flexibilityofourapproach. FortheevaluationofSPARQLqueries,wefollowthesemantics discussedin[15,1]. Evaluationisbasedonmappings,whichare • Forthepurposesofrepresentation,theproposedontologyof partialfunctionsµ:V(cid:55)→U∪LthatassociatevariableswithURIs changes is designed to be generic enough, so as to be cus- orliterals(abusingnotation, µ(tp)isusedtodenotetheresultof tomizable for different languages of changes. This allows replacingthevariablesintpwiththeirassignedvaluesaccording ustomeettheaforementionedgeneralitygoal,asdetailedin to µ). Then, the evaluation of a SPARQL triple pattern tp on a Section4. datasetDreturnsasetofmappings(denotedby[[tp]]D)suchthat • Thedetectionprocessisdefinedinacustomizablemanner,as µ(tp) ∈ D for µ ∈ [[tp]]D. This idea is extended to graph pat- thecoredetectionalgorithmisagnostictothesetofsimple terns by considering the semantics of the various operators (e.g., [[tp UNION tp ]]D = [[tp ]]D ∪[[tp ]]D). GivenaSPARQL that are directly associated with said change and are assumed to 1 2 1 2 query“SELECT v ,...,v WHERE gp”,itsresultwhenap- becapturedbythesimplechange. Forexample,theabovechange 1 n pliedonDis(µ(v ),...,µ(v ))forµ∈[[gp]]D. Fortheprecise wouldbeassociatedwiththetriple(m,rdfs:range,t)(whichin- 1 n semanticsandfurtherdetailsontheevaluationofSPARQLqueries, dicates that t is the type of measure m); if said triple is found in thereaderisreferredto[15,1]. D ,butnotinD ,thenthisindicatesthatanewtype(t)has new old beenattachedtomeasurem(thus,Attach_Type_To_Measure(m,t) 3. DEFININGCHANGES shouldbedetected). Triplesofthistypearerequiredtobeinoneversionbutnotin Our purpose in this work is to provide a change recognition theother(i.e.,inlow-leveldelta)andaredirectlyassociated(con- method,which,giventwo(subsequent)datasetversionsD ,D , old new sumed)bythecorrespondingsimplechange. Consumptioninthis would produce their delta (∆), i.e., a formal description of the respectmeansthatthecorrespondinglow-levelchangeiscaptured changes that were made to get from D to D . A delta is old new (described)bysaidsimplechange. basedonalanguageofchanges(L),i.e.,asetofformaldefinitions A simple change can also have a number of logical conditions ofthechangesthatthedeltacouldcontainand, subsequently, the requiredfordetection. Forexample,thetriple(m,rdfs:range,t) change detection method understands and detects; these changes mayalsoindicatethatadatatype(t)isattachedtoadimensionm. shouldcorrespondtotheevolutionprimitivesofthedatamodelun- TodeterminewhethertheinclusionofsuchatripleinD corre- derconsideration. new spondstoanAttach_Type_To_Measure(m,t)change(asopposedto Alanguageofchanges,initssimplestform,consistsofadditions anAttach_Datatype_To_Dimension(m,t)change),weneedacon- and deletions of elements (i.e., triples for the RDF case) from a ditionthatwilldeterminewhethermisameasureoradimension. dataset,i.e.,changesoftheformAdd(t)/Del(t). Suchadeltais Inthiscase,theAttach_Type_To_Measure(m,t)wouldrequirethe calledalow-leveldelta[22]andcanbeeasilycomputedasfollows: presenceofthetriple(m,rdf :type,qb : measureProperty)in ∆(D ,D )={Add(t)|t∈D \D }∪{Del(t)|t∈ old new new old D . Notethat,ingeneral,conditionsmayrefertoboththeold D \D }. new old new andthenewversion. Unlikeconsumedtriples,thetriplesappear- Low-levellanguages(anddeltas)areeasytodefineanddetect, inginconditionsarenotnecessaryaddedordeletedtriples; their andhaveseveralniceproperties[22]; however,therepresentation presenceisnecessaryineither(orboth)oftheversionsandtheir of changes at the level of (added/deleted) triples, leads to a syn- roleistodisambiguatebetweensimilarchanges. tacticdelta,whichdoesnotcapturethesemanticsofachangeand Moreformally,asimplechangeisdefinedasfollows: generatesresultsthatarenotintuitiveenoughforthehumanuser. Forexample,intheRDFcontext,theplaindeletionofanindividual DEFINITION 1. Asimplechangec(p1,...,pn)isdefinedasa (classinstance)wouldcorrespondtoamultitudeoftripledeletions tupleoftheform(cid:104)δ+,δ−,φ ,φ (cid:105)where: old new (namely,allthetriplesthatcontainthisURI,suchaspropertyin- • cisthenameandp ,...,p ∈V,n≥0,aretheparameters stances);listingallthesechangesdoesnotimmediatelyconveythe 1 n ofthechange, message that an individual was deleted, and the human observer may find it hard to decipher the intent of the actual changes that • δ+,δ− ⊆TParecalledtheconsumedaddedandconsumed tookplace[14]. Thisproblemisevenmoreseriousforotherdata deletedtriples,respectively,andaresetsoftriplepatterns, models (e.g., multi-dimensional) that are translated into RDF; in thiscase,thelow-leveldeltas,whicharedescribedinRDFjargon, • φold,φnew aregraphpatterns,calledtheconditionsrelated needtobe“translatedback”intotheoriginaldatamodel,making toDold,Dnew,respectively. decipheringevenmoredifficult. Toaddressthisproblem,high-leveldeltashavebeenproposed[14], Inourrunningexample,Attach_Type_To_Measure(m,t)hastwo which aim to describe changes at a more intuitive level, in order parameters(m,t)andδ+ ={(m,rdfs:range,t)},δ− =∅,φ = old to make them more human-understandable. In the above exam- “”,φ =“(m,rdf :type,qb:measureProperty)”. new ple,theoutputwouldbea“deleteindividual”change,thatimme- The structure of Definition 1 is used for defining the changes diately conveys the message that all triples that include a given thatthelanguageofchangesaccepts,butanyactualdetectionwill URI (individual) were deleted. The main idea behind achieving givespecificvalues(URIsorliterals)totheparametersm,tofthe thisistogrouplow-levelchangesintohigh-levelones,undersome change(e.g.,Attach_Type_To_Measure(dm-measure:meas7v8t,dm- conditions. Forthepurposesofthispaper,weorganizehigh-level type:int)).Thisiscapturedwiththefollowingnotion: changesintwomajortypes, namelysimpleandcomplex, eachof whichrepresentschangeswithacertaingranularityandroleinthe DEFINITION 2. Consider a change c(p1,...,pn). Then, an assignmentx ,...,x ofURIs/literalstovariablesp ,...,p is model.Detailsonthesetwotypesaregivenbelow. 1 n 1 n calledaninstantiationofcanddenotedbyc(x ,...,x ). 1 n 3.1 SimpleChanges Now we are in position to formally define the detectability of Ingeneral,theroleofsimplechangesistodescribechanges(evo- a change instantiation. Intuitively, a change instantiation corre- lutionprimitives)specifictothedatamodelathand; e.g., forthe sponds to a certain assignment of (some of) the variables in δ+, multi-dimensionalmodel,wewouldexpectchangeslike“Add_Di- δ−,φ ,φ ;theassignmentshouldbesuchthattheconditions old new mension”or“Attach_Type_To_Measure”. Usingsimplechanges, (φ ,φ )aretrueintheunderlyingdatasets,andthetriplesthat old new theusercanabstractfromthesyntacticalandrepresentationalpecu- correspond to the triple patterns in δ+, δ− are found in the low- liaritiesoftheunderlyingdatamodel(includingitspossibletrans- leveldelta.Moreprecisely: lationtoRDFformat),therebymakingdeltasmoreintuitive. Asimplechange(e.g.,Attach_Type_To_Measure(m,t)),iscom- DEFINITION 3. Achangeinstantiationc(x1,...,xn)ofasim- posedofthechangename(i.e.,Attach_Type_To_Measure)andthe ple change c(p ,...,p ) is detectable for the pair D ,D 1 n old new changeparameters(i.e.,(m,t)).Itsmaincharacteristicisthetriples iff there is a µ ∈ [[φold]]Dold ∩[[φnew]]Dnew such that for all that would be added/deleted from the RDF representation of the tp ∈ δ+: µ(tp) ∈ D \D andforalltp ∈ δ−: µ(tp) ∈ new old datasetwhensuchachangeoccurs.Thesecorrespondtothetriples D \D andforalli:µ(p )=x . old new i i changescanalsobeusedtodefinechangesthatwerenotforeseen atdesign-time. Thiswouldgiveanextraflexibilitytotheuserto describemorespecialized,coarse-grainedchanges,whicharerele- vanttoacertainapplication,butnotgenericenoughtobeconsid- eredpartofthe“core”languageof(simple)changes. Thisessen- tiallyallowsextendingthelanguageofchangesatrun-time. Complex changes are defined in a way similar to simple ones; there are certain differences though. First, complex changes are notdirectlyassociatedwithlowlevelchanges;insteadofδ+,δ−, theyincludeasetofsimplechanges,denotedbyδs,whichcorre- sponds to the set of simple changes that should be detectable in order for said complex change to be detectable. This is neces- sary to make their definition more user-friendly, given that com- plex changes will be defined by the end-user who does not un- derstand low level changes. Second, as complex changes can be freely defined by the user, it would be unrealistic to assume that theywillhaveanyqualityguarantees,suchascompletenessorun- ambiguity. As a consequence, the detection process may lead to non-deterministicconsumptionofsimplechangesandconflicts;to Figure1:VisualizationofCompletenessandUnambiguity avoid this, complex changes are associated with a priority level, whichisusedtoresolvesuchconflicts. Furthermore, complex changes support associations, i.e., cor- Then,adetectablechangeconsumesthetriplesthatcorrespond tothetriplepatternsinδ+,δ−.Formally: respondences between URIs/literals that are necessary to support changes like renames and merge/splits [14]. In general, an asso- DEFINITION 4. Adetectablechangeinstantiationc(x1,...,xn) ciation α is a pair (X,Y) where X,Y are sets of URIs, literals ofasimplechangec(p ,...,p )consumest∈D \D (re- orvariables(calledURI/literal/variableassociation,respectively). 1 n new old spectively, t ∈ Dold \Dnew) iff there is a µ ∈ [[φold]]Dold ∩ Associationscanhaveoneofthefollowingforms(wherevi may [[φ ]]Dnew and a tp ∈ δ+ (respectively, tp ∈ δ−) such that beaURI,literalorvariable): new µ(tp)=tandforalli:µ(p )=x . i i • {v }(cid:32){v },v (cid:54)=v (rename) 1 2 1 2 The concept of consumption represents the fact that low-level changesare“assigned”tosimpleones,essentiallyallowingagroup- • {v0}(cid:32){v1,...,vn},vi (cid:54)=vj fori(cid:54)=j(split) ing(partitioning)oflow-levelchangesintosimpleones. Tofulfil itspurpose,this“partitioning”shouldbeperfect,inthesensethat • {v1,...,vn}(cid:32){v0},vi (cid:54)=vj fori(cid:54)=j(merge) all low-level changes should be associated to one, and only one, A set of URI and literal associations should be provided as an simplechange. Thisiscapturedbythepropertiesofcompleteness inputtothedetectionprocess.Suchassociationscouldbeprovided andunambiguity.Formally: manuallybytheuser,orautomaticallyidentifiedusing,e.g.,entity DEFINITION 5. ConsiderasetofsimplechangesC. Thisset matchingsoftware[20];detectingassociationsisoutofthescopeof iscalledcompleteiffforanypairofversionsD ,D andfor thispaper.Givenamappingµandavariableassociationα,wewill old new allt∈(D \D )∪(D \D ),thereisaninstantiation denotebyµ(α)theURI/literalassociationresultingfromreplacing new old old new c(x ,...,x )ofsomec∈Csuchthatc(x ,...,x )isdetectable allvariablesinαwiththeircorrespondingURI/literalaccordingto 1 n 1 n andconsumest. µ. Complexchangesareusefulinseveralapplications. Forexam- DEFINITION 6. ConsiderasetofsimplechangesC. Thisset ple,variousontologicalapplicationsrefrainfromdeletingclasses, iscalledunambiguousiffforanypairofversionsD ,D and old new but use a special subsumption relationship to a class “Obsolete” for all t ∈ (D \D )∪(D \D ), if c,c(cid:48) ∈ C and new old old new toindicatethatacertainclass(say, cl)shouldnolongerbeused. c(x ,...,x ),c(cid:48)(x(cid:48),...,x(cid:48) )aredetectableandconsumet,then 1 n 1 m ThisactioncouldbecapturedbythechangeMark_as_Obsolete(cl), c(x ,...,x )=c(cid:48)(x(cid:48),...,x(cid:48) ). 1 n 1 m whichwouldconsumethesimplechangeAdd_SuperClass(cl,obs), whereobs=geneontology:ObsoleteClass. Inanutshell,completenessguaranteesthatalllowlevelchanges Theformalconceptsassociatedwithcomplexchangesaresimi- areassociatedwithatleastonesimplechange,therebymakingthe lartotheonesrelatedtosimplechanges: reporteddeltacomplete(i.e.,notmissinganychange);unambigu- ity guarantees that no race conditions will emerge between sim- plechangesattemptingtoconsumethesamelowlevelchange(see DEFINITION 7. Acomplexchangec(p1,...,pn)isdefinedas atupleoftheform(cid:104)δs,φ ,φ ,A,P(cid:105)where: Figure1foravisualizationofthenotionsofcompletenessandun- old new ambiguity). The combination of these two properties guarantees • cisthenameandp ,...,p ∈V,n≥0,aretheparameters thatthedeltaisproducedinadeterministicmannerandthatitwill 1 n ofthechange, properlyreflectthechangesthatwereactuallyperformed. 3.2 ComplexChanges • δsisasetofsimplechangescalledtherelatedsimplechanges, Toguaranteecompletenessandunambiguity,simplechangesmust • φ ,φ aregraphpatterns,calledtheconditionsrelated old new be defined at design time, and remain immutable between differ- toD ,D ,respectively, old new entapplications;forchangescorrespondingtocustom,application- specific evolution primitives, we use complex changes. Complex • Aisasetofvariableassociations, • P isanumbercorrespondingtotheprioritylevelofthecom- plexchange. Inourrunningexample,Mark_as_Obsolete(cl)hasoneparam- eter(cl)andδs = {Add_SuperClass(cl,obs)}, φ =“obs= old geneontology:ObsoleteClass”,φ =“”,A=∅,P =2. new The definition of complex change instantiations is identical to Definition2.Fordetectability,wehaveaseriesofdefinitions,soas totakeintoaccountpriorities: DEFINITION 8. Achangeinstantiationc(x1,...,xn)ofacom- plexchangec(p ,...,p )isinitiallydetectableforthepairD , 1 n old DnewandtheassociationsADold,Dnewiffthereisaµ∈[[φold]]Dold∩ [[φ ]]Dnewsuchthatforallc(cid:48)(µ(p ),...,µ(p ))∈δs,c(cid:48)(µ(p ), new 1 n 1 ...,µ(pn))isdetectable,forallα∈A,µ(α)∈ADold,Dnew and Figure2:RepresentationofSimpleChanges foralli,µ(p )=x . i i DEFINITION 9. Aninitiallydetectablechangeinstantiationc(x1, appearattheinstancelevel,instantiatingthecorrespondingclasses. ...,x ) of a complex change c(p ,...,p ) consumes a simple Specifically,atschemalevel,weintroduceoneclassforeachsim- n 1 n changeinstantiationc(cid:48)(x(cid:48)1,...,x(cid:48)m)ofasimplechangec(cid:48)(p(cid:48)1,..., pleandcomplexchangec(p1,...,pn)thatisunderstoodandcon- p(cid:48)m) iff c(cid:48)(p(cid:48)1,...,p(cid:48)m) ∈ δs and there is a µ ∈ [[φold]]Dold ∩ sideredbythelanguageofchanges(L),andatinstancelevel,we [[φnew]]Dnew suchthatforallα ∈ A,µ(α) ∈ ADold,Dnew and introduceoneindividualforeachdetectablechangec(x1,...,xn) foralli,µ(p )=x ,µ(p(cid:48))=x(cid:48). ineachpairofversions. i i i i DEFINITION 10. Achangeinstantiationc(x1,...,xn)ofacom- 4.2 OntologyofChanges: SchemaLevel plex change c(p ,...,p ) is detectable for the pair D , D 1 n old new Forsimplechanges,thecorrespondingchangedefinitionismod- andtheassociationsADold,Dnew iffitisinitiallydetectableforthe elledasasubclassofthemainclassSimple_Change,andisas- pairDold,DnewandtheassociationsADold,Dnew andthereisno sociatedwithadequate propertiestorepresentits parameters (see initiallydetectablechangeinstantiationofanothercomplexchange Figure 2); each such property models the type of the parameter withahigherprioritythatconsumesthesamesimplechange. (e.g.,whetheritisaURIoraliteral,viatheclassesrdfs:Resource, rdfs:Literalrespectively)anditsname(whichisadescriptivename 4. REPRESENTINGCHANGES that allows a more intuitive interaction with the user, useful also duringtheconstructionofcomplexchangesthatconsumesaidsim- 4.1 MotivationfortheOntologyofChanges plechange).Alsoadescriptivename(cname)iscapturedforeach Wetreatchangesasfirst-classcitizensinordertobeabletoper- typeofchange.Notethatconditionsdonotneedtobestored. formqueriesanalysingtheevolutionofdatasets.Further,wearein- Forcomplexchanges,similarideasareused;however,notethat terestedinperformingcombinedqueries,inwhichboththedatasets theinformationrelatedtocomplexchangesisgeneratedonthefly andthechangesshouldbeconsideredtogetananswer.Toachieve atchangecreationtime(incontrasttosimplechanges,whicharein- this,therepresentationofthechangesthataredetectedonthedata builtintheontologyatdesigntime).Inaddition,moredetailedin- cannotbeseparatedfromthedataitself. formationrelatedtothechangeshouldbeavailableintheontology Forexample,considerthefollowingquery:“returnallcountries forthechangedetectionprocess,because,unlikesimplechanges, forwhichtheunemploymentrateoftheircapitalcityincreasedat thisinformationisnotknownatdesigntimeandcannotbeembed- a rate higher than the average increase of the country as a whole dedinthecode. betweenversionsD andD ”. Thisqueryrequiresaccessto Eachcomplexchangeismodelledasasubclassofthemainclass old new thedata(toidentifycountriesandcapitals)andtothechanges(to Complex_Change,andisassociatedwithadequatepropertiesto describetheactualincreaseintheratesofunemploymentforeach representitsparameters(seeFigure3). Again,parametershavea cityandcountry). Therefore, toanswerit, thechangesshouldbe typeandaname,whicharemodelledinthesamemannerasinsim- storedinastructuredformandtheirrepresentationshouldinclude plechanges. Inaddition,eachcomplexchangeisassociatedwith connectionswiththeactualentities(citiesorcapitals)thattheyre- thesimplechange(s)thatitconsumes,andincludesalsoproperties ferto. Notethatthedefinitionofacross-snapshotquerylanguage describingits(user-defined)descriptivenameanditspriority. thattreatschangesasfirst-classcitizensisoutsidethescopeofthis Finally, foreachcomplexchange, theSPARQLqueryusedfor paper, however this work provides a formalism on which such a itsdetection,isautomaticallygeneratedatchangedefinitiontime; querylanguagecanbebased. Therefore,weproposerepresenting thisisdoneforefficiency,toavoidhavingtogeneratethisqueryin changesasspecialentitiesinanRDFdataset,withconnectionsto everyrunofthedetectionprocess. TheSPARQLqueryencapsu- theactualdata,sothatadetectablechangecanbeassociatedwith latesalltherelevantinformationaboutthedetectionofthechange, thecorrespondingdataentitiesthatitrefersto. sofurtherinformationonthecomplexchange(namelyconditions Morespecifically,weproposeanontologyofchangesforstoring andassociations)neednottobestored. the detected changes, thereby allowing a supervisory look of the 4.3 OntologyofChanges: DataLevel detectedchangesandtheirassociationwiththeentitiestheyreferto intheactualdatasets,facilitatingtheformulationandtheanswering Theinstancelevelisusedtostorethedetectablesimpleandcom- ofqueriesthatrefertoboththedataandtheirevolution. plex changes. Each detectable change between any two versions Insaidontology,theschemadescribesthedefinitionofthechanges isrepresentedusingadifferentindividualassociatedthroughade- (c(p ,...,p )–seeDefinitions1,7),whereasinformationonde- quatepropertieswithallrelevantinformation,namely,theversions 1 n tectedchanges(whicharechangeinstantiations–seeDefinition2) betweenwhichthechangewasdetectedandtheexactvaluesofits Figure6:RepresentationofAssociations Figure3:RepresentationofComplexChanges association. Forexample,Figure6showstherepresentationofthe associations{x1} (cid:32) {x2}and{y1} (cid:32) {y2,y3},whichappear betweenversionsv1,v2. 5. DETECTINGANDSTORINGCHANGES Thechangedetectionprocessisresponsibleforthedetectionof simple and complex changes between two dataset versions (and their corresponding associations, which are a necessary input for the detection of complex changes), as well as for the enrichment oftheontologyofchangeswithinformationaboutthedetectable changes. Thus,theprocesscanbeconsideredtocompriseoftwo steps:thefirstistheidentificationofthedetectablechangesandthe correspondinginformation(triples)tobeinsertedintheontologyof changes(triplecreation),andthesecondistheactualingestionof Figure4:RepresentationofSimpleChangeDetection saidinformationintheontologyofchanges(tripleingestion). Todetectsimpleandcomplexchanges,werelyonplainSPARQL parameters(ofthechangeinstantiation).Forcomplexchanges,we queries,whicharegeneratedfromtheinformationdrawnfromthe additionallyneedtoincludeconsumptioninformation. Examples definition of the corresponding changes (Definitions 3, 10). For ofsuchinstantiationsforsimpleandcomplexchangesareshownin simple changes, this information is known at design time, so the Figures4and5,forthedetectablechangesAttach_Type_To_Mea- query is loaded from a configuration file, whereas for complex sure(dm−measure:meas7v8t,dm−type:int)and changes,thecorrespondingqueryisgeneratedonceatchange-creation Mark_as_Obsolete(efo:EFO_0004151),respectively. time (run-time) and is loaded from the ontology of changes (see Figure3).Theresultsofthegeneratedqueriesdeterminethechange 4.4 OntologyofChanges: Associations instantiations that are detectable; this information is further pro- Anotherusefulfeatureoftheontologyofchangesisthestorage cessedtodeterminetheactualtriplestobeinsertedintheontology ofassociations,whicharenecessaryforthedetectionofcomplex ofchanges. Recallthatthisdependsontheactualdetectedchange changes. Associationsarestoredasinstancesclassifiedunderthe anditstype(seeSection4). classAssociation,andrecordtheversionsbetweenwhichtheyare Morespecifically,theSPARQLqueriesusedfordetectingasim- applicable,aswellastheoldandnewvaluesinthecorresponding ple change are SELECT queries, whose returned values are the parameter values of the change; thus, for each parameter of the changedefinition,weputonevariableintheSELECTclause.Then, the WHERE clause of the query includes the triple patterns that should (or should not) be found in each of the versions in order for a change instantiation to be detectable; more specifically, the triplepatternsinδ+ mustbefoundinD butnotinD , the new old triplepatternsinδ− mustbefoundinD butnotinD ,and old new thegraphpatternsinφ ,φ shouldbeappliedinD ,D , old new old new respectively. Let’susethismethodologytoconstructtheSPARQLqueryfor thesimplechangeAttach_Type_To_Measure(m,t)inwhich:δ+ = {(m,rdfs:range,t)},δ− =∅,φ =“”,φ =“(m,rdf :type, old new qb:measureProperty)”.ThecorrespondingSPARQLqueryis: SELECT ?m ?t WHERE { GRAPH D { ?m rdf:type qb:MeasureProperty. new ?m rdfs:range ?t. } FILTER NOT EXISTS { GRAPH D { ?m rdfs:range ?t. } } old } Figure5:RepresentationofComplexChangeDetection Eachresultrowofthisquerycorrespondstotheparametervalues Extensibility: Theadditionofanewchange(simpleorcomplex) ofonedetectablechangeinstantiation;forexample,ifthequeryre- requiresjusttheadditionofthecorrespondingSPARQLquery, turnsthepair(dm-measure:meas7v8t,dm-type:int),thenthechange withnofurthermodificationsonthesourcecode. instantiationAttach_Type_To_Measure(dm-measure:meas7v8t,dm- DataModelInsensitivity: Theprocesscanbeappliedinanydata type:int)isdetectable. model (e.g., multi-dimensional) which can be expressed in ThegenerationoftheSPARQLqueriesforthecomplexchanges followsasimilarpattern. OnedifferenceisthattheWHEREclause RDF(seeSection6)ignoringitsinternalrepresentation;all peculiarities of the underlying data model are incorporated shouldalsoincludetherequiredassociations,whichareassumedto inthecorrespondingSPARQLqueries,therebyallowingthe bealreadystoredintheontologyofchanges. Asecondimportant codetobeversatile. difference is that complex changes check the existence of simple changes(namely,thosefoundintheδspartofthecomplexchange Cross-PlatformCharacter: TheprocessrequiresverygenericRDF definition)intheontologyofchanges,ratherthantriplesinthetwo datamanagementoperations(SPARQLsupportandRDFdata versions(asisthecasewithsimplechangesdetection);therefore, import),soitcanbeimplementedinanytriplestore. complexchangesshouldbedetectedafterthedetectionofsimple changesandtheirstorageintheontology.Notealsothattheconsid- 6. APPLICATIONOFTHEFRAMEWORK eredsimplechangesshouldnothavebeenmarkedas“consumed” byotherdetectablechangesofahigherpriority;thus,itisimpor- Asmentionedabove,ourframeworkisflexibleandgenericenough tantforqueriesassociatedwithcomplexchangestobeexecutedin tobeapplicabletoanydatamodel, whichistransformabletothe aparticularorder,asimpliedbytheirpriority. RDF format. Thus, to apply our framework to any given data Let’sillustratetheabovemethodologytoconstructtheSPARQL model,onefirsthastodeterminehowtotransformsaiddatamodel queryforthecomplexchangeMark_as_Obsolete(cl). Thischange in RDF. The second step is the definition of the simple changes has one parameter (cl) and δs = {Add_SuperClass(cl,obs)}, that would comprise the language of changes (L); note that said φ =“obs=geneontology:ObsoleteClass”,φ =“”,A=∅, changesshouldbecompleteandunambiguousforoptimalresults. old new P =2.ThecorrespondingSPARQLqueryis: DefiningthesimplechangesamountstoidentifyingtheSPARQL querythatisnecessaryfordetection,aswellastherepresentation SELECT ?cl WHERE { ofthesimplechangesintheschemaofthechangesontology;the GRAPH <changesOntology> { lattercanbeeasilyautomated. ?asc a co:Add_Superclass; Complexchangesneednotbedefinedatthispoint,astheyare co:asc_p1 ?cl; createdatrun-time;tofacilitatethedefinitionofcomplexchanges co:asc_p2 ?obs. onecouldcreateanadequateuser-friendlyinterface,thatcould(po- FILTER NOT EXISTS { ?cc co:consumes ?asc } tentially) sacrifice some of the expressive power of the complex FILTER (?obs = “geneontology:ObsoleteClass”). change definition (Definition 7) in favour of user-friendliness; at } anyrate,thedefinitionofsuchaninterfaceisbeyondthescopeof } thiswork. As with simple changes, each query result corresponds to the Indicatively,wedemonstrate,forvisualizationandexperimenta- parameter values of a detectable change instantiation; for exam- tionpurposes,theaboveprocessfortheRDFandmulti-dimensional ple, if the query finds the simple change Add_Superclass(efo: data models; a similar methodology could be used for other data EFO_0004151,geneontology:ObsoleteClass)whichis models,e.g.,relational,butfurtherapplicationsofourframework not consumed by any complex change, then it will return efo: areomittedduetospacelimitations. EFO_0004151,whichimpliesthatthechangeinstantiation Mark_as_Obsolete(efo:EFO_0004151)isdetectable. 6.1 RDFModel Followingdetection,theinformationaboutthedetectable(sim- TheRDFdatamodelisapopularmodelforexposing,sharing, pleorcomplex)changeinstantiationsmustbestoredintheontol- andconnectingpiecesofdata,information,andknowledgeonthe ogyofchanges. Todoso, wehavetoprocesseachresultrowto SemanticWeb. ScientistsfromvariousareasofexpertiseuseRDF create the corresponding triple blocks, as specified in Section 4. to publish their scientific observations and measurements on the Thisisdoneasaseparateprocessthatfirststoresthetripleblocks Web. inafile(ondisk)andsubsequentlyuploadstheminVirtuosousing The case of RDF is straightforward in the sense that no data Virtuoso’bulkloadingprocess(tripleingestion). transformation is required. For the definition of simple changes, Notethatthedetectionandstoringofchangescouldbedonein weusedasetverysimilartotheso-called“basicchanges”in[14]. onestep,ifoneusedanadequatelydefinedSPARQLupdatestate- Thefulllistofdefinedsimplechanges,alongwithdetailsontheir ment that identified the detectable change instantiations, created definitionandthecorrespondingSPARQLqueriesusedfordetec- thecorrespondingtripleblocksandinsertedthemintheontology tioncanbefoundinAppendixB. usingasinglestatement. However,thisapproachturnedouttobe slowerby1-2ordersofmagnitude,partlybecauseitdoesnotex- 6.2 Multi-dimensionalModel ploitbulkupdatesbasedonmultiplethreads,andalsobecausebulk Themulti-dimensionaldatamodelisusefulforcapturingstatis- loadingismuchfasterthaninsertingtriplesusingSPARQLupdates ticsandinformationondataitemsalongseveraldimensions. The inVirtuoso. bridgetoRDFisachievedthroughtheDataCubevocabulary1which Thedescribedprocessenjoysthecharacteristicsputforwardin was proposed as a W3C recommendation to enable the publica- theprevioussections,namely: tionofstatisticaldataflowsandothermulti-dimensionaldatasets overtheWeb.TheDataCubevocabularyallowsmulti-dimensional data to enjoy all the advantages of the LOD and RDF technolo- Expressiveness: The algorithm supports the entire semantics of gies(knowledgesharingandinterconnection)viaitspublicationto SPARQL,therebyallowingaveryexpressivemethodtode- finechanges. 1http://www.w3.org/TR/vocab-data-cube triple-store). VirtuosoishostedonamachinewhichusesanIntel Table1:EvaluatedDatasets:VersionsandSizes XeonE5-2630at2.30GHz,with384GBofRAMrunningDebian Dataset Version #Triples Linuxwheezeversion,withLinuxkernel3.16.4. Thesystemuses RDFdatasets 7TBRAID-5HDDconfigurations.Fromthetotalamountofmem- v12.07 457.951.940 ATLAS v13.05 422.144.126 ory,wededicated64GBforVirtuosoand5GBfortheimplemented v13.07 447.149.655 application. Moreover, takingintoaccountthatCPUprovides12 v3.7 48.898.490 coreswith2threadseach,wedecidedtouseamulti-threadedim- Dbpedia v3.8 63.126.304 plementation;specifically,wenoticedthattheuseof8threadsdur- v3.9 67.980.265 ingthecreationoftheRDFtriplesalongwiththeingestionprocess 24-03-2009 189.378 gaveusoptimalresultsforoursetting.Thiswasonemorereasonto GO 22-09-2009 195.125 selectVirtuosoforourimplementation,asitallowstheconcurrent 20-04-2010 210.076 Multi-dimensionaldatasets useofmultiplethreadsduringingestion.Toeliminatetheeffectsof v1 229.338.925 hot/coldstarts,cachedOSinformationetc.,eachchangedetection 4lqc v2 243.336.594 processwasexecuted10timesandtheaveragetimeswereconsid- v3 243.449.874 ered. v1 4.022.424 For our experimental evaluation, we used 3 RDF and 3 multi- 1zph v2 4.022.352 dimensionaldatasets. TheselectedRDFdatasetswereATLAS3,a v3 4.022.208 subsetoftheEnglishDbpedia4(consistingofarticlecategories,in- v1 259.125 1bu4 v2 270.049 stancetypes,labelsandmapping-basedproperties)andGO5.The v3 270.049 selectedmulti-dimensionaldatasetswereprovidedbyDatamarket andcontainvariousstatistics;inparticular,4lqc6containsweather measurements from the Global Historical Climatology Network, RDF,therebyallowingpublishersorthirdpartiestoannotateand 1buh7 containsnumbersofemployedpersonsbetween15and64 linktospecifieddata,whichcanbeflexiblycombinedacrossdif- yearsoldtakingtimeoffoverthelast12monthsforfamilysickness ferentdatasets. or emergencies and 1zph8 contains information about unemploy- To apply our framework to the multi-dimensional model, we mentbysex,age,durationofunemploymentandregistration. All adopt the transformation proposed by the Data Cube vocabulary. thesemeasurementsaretakenfromEurostat. Table1summarizes Thedefinitionofsimplechangesisbasedontheidentificationof thecorrespondingversionsandtheirsizes. all entities of the Data Cube vocabulary (i.e., fact tables, dimen- sions,observations,codelists,hierarchies,measures,attributes)in 7.2 DetectedChanges ordertoexpressspecialorgenericcasesofchangesthatappearin Thefirstpartofourevaluationstudiesthenumberandtypeof eachofthoseentities. Toachievecompletenessandunambiguity simple changes that appear in the evaluated datasets. The results inourdefinitionofsimplechanges,wetakeintoaccounthowdata are summarized in Figure 7, where we note the large number of cubeentitiesareconnectedandwhichsetsoftriplepatternsdenote changeswhichoccurredduringATLASevolutioncomparedtothe theexistence/characteristicsofthecorrespondingentities. Again, otherdatasets. ThisisexplainedbythefactthatATLAScontains thefulllistofthedefinedsimplechangesforthemulti-dimensional experimentalbiologicalresultsandmeasurementsthatchangeover model,alongwithdetailsontheirdefinitionandthecorresponding time, thus new versions are vastly different from previous ones. SPARQLqueriesusedfordetectioncanbefoundinAppendixC. Moreover,notethatthemajorityofchanges(inalldatasetsexcept GO)areappliedtothedatalevel(e.g.,Delete_Property_Instance), 7. EXPERIMENTALEVALUATION whereas in GO, we have also changes which are applied to the schema. Ourframeworkwasimplementedandappliedonthedatamod- Onthemulti-dimensionalmodel,themajorityofchangesinall elsdescribedinSection6. Theevaluationconsideredsomerepre- datasetsareoftypeAdd_Dimension_Value_to_ObservationandDe- sentativereal-worlddatasetsofvarioussizesfromthesetwodata lete_Dimension_Value_from_Observation.Forbothtypesofchan- models,anditsaimswerethefollowing: ges, we observe that almost the same number of changes is re- • Toidentifythenumberandtypeofsimplechangesthatusu- ported, whichimpliesthatthedatasets’evolutionmainlyconsists allyoccurinrealworldsettings. ofchangingdimensionvaluesuponobservations; thus,creatinga complexchangetocapturethispairofchangeswouldimmediately • To study the performance of our change detection process halvethenumberofreportedchanges. whendealingwithrealsettingsandquantifytheeffectofthe size of the compared versions and the number of detected 7.3 PerformanceofChangeDetection changesintheperformanceofthealgorithm. Thesecondpartofourevaluationexaminestheperformanceof thedetectionprocessfortheRDFandmulti-dimensionaltestbeds; Toourknowledge, thisisthefirsttimethatchangedetectionhas theresultsappearinTable2.Thereportedperformanceissplitinto been evaluated for datasets of this size and versatility (RDF and twoparts,namelytriplecreationandtripleingestion;asexplained multi-dimensional). in Section 5, the former includes the execution of the SPARQL 7.1 Setting 3http://www.ebi.ac.uk/gxa/home Forthemanagementoflinkeddata(e.g.,storageofdatasetsand 4http://dbpedia.org query execution), we worked with a scalable triple store, namely 5http://geneontology.org theopensourceversionofVirtuosoUniversalServer2,v7.10.3209 6https://datamarket.com/data/set/4lqc (notethat,ourworkisnotboundedtoanyspecificinfrastructureor 7https://datamarket.com/data/set/1buh 2http://virtuoso.openlinksw.com 8https://datamarket.com/data/set/1zph Figure7:DetectedChanges(RDF/Multi-dimensional) Table2:PerformanceofChangeDetection(RDF/Multi-dimensional) DatasetsandVersions #SimpleChanges #IngestedTriples TripleCreation(sec) TripleIngestion(sec) Duration(sec) RDFdatasets ATLAS:v12.07-v13.05 879.512.096 4.980.846.857 7.294,73 9.654,46 16.949,20 ATLAS:v13.05-v13.07 801.278.339 4.539.113.055 5.953,07 9.243,61 15.196,68 DBpedia:v3.7-v3.8 20.752.280 116.327.560 465,26 143,49 608,74 DBpedia:v3.8-v3.9 9.343.769 50.620.418 296,20 72,94 369,14 GO:24-03-2009-20-04-2010 11.949 59.179 0,80 0,50 1,31 GO:22-09-2009-20-04-2010 25.299 124.262 0,96 0,57 1,52 Multi-dimensionaldatasets 4lqc:v1-v2 164.557.348 978.012.302 1.651,03 994,85 2.645,87 4lqc:v2-v3 162.318.804 973.837.296 1.645,03 996,96 2.641,99 1zph:v1-v2 4.469.294 26.815.814 129,38 46,91 176,29 1zph:v2-v3 4.469.262 26.815.494 131,25 47,33 178,58 1bu4:v1-v2 475.666 2.642.549 7,42 6,60 14,02 1bu4:v2-v3 323.644 1.941.848 5,12 5,97 11,09 queriesfordetectionandtheidentificationofthetriplestobein- containingblanknodes.Alltheseworksresultinnon-concise,low- sertedintheontologyofchanges, whereasthelatteristheactual leveldeltas,whicharedifficultforahumantounderstand. enrichmentoftheontologyofchanges. High-level change detection approaches provide more human- ThemainconclusionfromTable2isthatthenumberofsimple readabledeltas. Althoughthereisnoagreed-uponlistofchanges changesisamorecrucialfactorforperformancethanthesizesof thatarenecessaryforanygivencontext,varioushigh-levelopera- thecomparedversions(seealsoTable1).Thisobservationismore tions,alongwiththeintuitionbehindthem,havebeenproposed[7, clearintheDBpediadataset,wheretheevolutionbetweenv3.7and 13,16]. However,theseapproachesdonotpresentformalseman- v3.8producesabouttwicethenumberofchangesthantheevolution ticsofsuchoperations,orofthecorrespondingdetectionprocess; betweenv3.8andv3.9;despitethefactthatinthesecondcase,we thus,nousefulformalpropertiescanbeguaranteed. havetocomparelargerdatasetversions,theexecutiontimeinthe In[7,13],afixed-pointalgorithmfordetectingchanges,imple- formercaseisalmosttwiceaslarge.Notethatthisconclusionholds mented in PromptDiff, is described. The algorithm incorporates forbothtriplecreationandingestion. heuristic-basedmatcherstodetectchangesbetweentwoversions, More importantly, the performance of our approach is about 1 thusintroducinguncertaintyintheresults,andobtainingarecallof order of magnitude faster than the performance reported by [14], 96%andaprecisionof93%. uponwhichthisworkwasbuilt,evenifthereportedsimplechanges [16]proposestheChangeDefinitionLanguage(CDL)asameans arecompatibleinbothworks, astheusedsimplechangesforthe todefinehigh-levelchanges. Achangeisdefinedanddetectedus- RDF model are similar to the so-called “basic changes” in [14]. ing temporal queries over a version log that contains recordings Forexample,whencomparingtheperformanceofchangedetection oftheappliedlow-levelchanges. Theversionlogmustbeupdated overtheGOversionsv22-09-2009andv20-04-2010,ourapproach wheneverachangeoccurs;thisoverrulestheuseofthisapproachin needs1,52sec,while[14]requires33,13sec. non-curatedordistributedenvironments.Inourwork,versionlogs arenotnecessaryforthedetection,asthedeltacanbeproduceda posteriori. [2]focusesondefiningaformalwaytorepresenthigh- 8. RELATEDWORK levelchangesassequencesoftriples,butdoesnotdescribeadetec- In general, approaches for change detection can be classified tionprocessoraspecificlanguageofchanges.Finally,[5]proposes intolow-levelandhigh-levelones, basedonthetypesofchanges aninterestinghigh-levelchangedetectionalgorithmthattakesinto theysupport.Low-levelchangedetectionapproachesreportsimple accountthesemanticsofOWL. add/delete operations, which are not concise or intuitive enough to human users, while focusing on machine readability. [4] dis- 9. CONCLUSIONS cussesalow-leveldetectionapproachforpropositionalKnowledge Bases (KBs), which can be easily extended to apply to KBs rep- Inthispaper,weproposedanapproachtocopewiththedynam- resentedunderanyclassicalknowledgerepresentationformalism. icityofWebdatasetsviathemanagementofchangesbetweenver- This work presents a number of desirable formal properties for sions. Weadvocatedinfavourofaflexible,extendibleandtriple- changedetectionlanguages,likedeltauniqueness,reversibilityof storeindependentapproach, whichissuitableforanydatamodel changesandtheabilitytomovebackwardsandforwardsinthehis- that is representable in RDF terms. Our approach prescribes the toryofversionsusingthedeltas.Similarpropertiesappearin[22], definition of data model-specific, as well as application-specific, wherealow-levelchangedetectionformalismforRDFSdatasets changes,andtheirmanagement(definition,storage,detection)ina ispresented,aswellasin[14],uponwhichthisworkbuilds. mannerthatensuresthesatisfactionofformalproperties(likecom- [8]describesalow-levelchangedetectionapproachfortheDe- pletenessandunambiguity),theflexibilityandcustomizationofthe scription Logic EL; there, the focus is on a concept-based de- consideredchanges(viacomplexchanges,whichcanbedefinedat scription of changes, and the returned delta is a set of concepts run-time),aswellastheeasyconfigurationofascalabledetection whosepositionintheclasshierarchychanged. [9]presentsafor- mechanism(viaagenericalgorithmthatbuildsonSPARQLqueries mal low-level change detection approach for DL-Lite ontologies, thatcanbeeasilygeneratedfromthechanges’definitions). which focuses on a semantical description of the changes. Re- Thisworkbuildsuponourpreviousworkonthetopicofchange cently,[6]introducedascalableapproachforreasoning-awarelow- detection[14],byprovidingamoregenericchangedefinitionframe- levelchangedetectionthatusesarelationaldatabasemanagement work,thatissignificantlymorescalableandversatilethantheone system,while[10]supportschangedetectionbetweenRDFdatasets proposedin[14],asshowninSection7.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.