A Flexible Framework for Defining, Representing and Detecting Changes on the Data Web Yannis Roussakis Ioannis Chrysakis Kostas Stefanidis FORTH-ICS FORTH-ICS FORTH-ICS [email protected] [email protected] [email protected] Giorgos Flouris Yannis Stavrakas FORTH-ICS ATHENA-IMIS [email protected] [email protected] innovation.gr 5 1 0 ABSTRACT objectviapredicate. 2 Dynamicityisanindispensablepartofthecurrentweb;datasets ThedynamicnatureofWebdatagivesrisetoamultitudeofprob- n lemsrelatedtotheidentification,computationandmanagementof are constantly evolving for several reasons, such as the inclusion a the evolving versions and the related changes. In this paper, we of new experimental evidence or observations, or the correction J of erroneous conceptualizations [21]. As an example, consider considertheproblemofchangerecognitioninRDFdatasets, i.e., 2 the problem of identifying, and when possible give semantics to, the detected number of changes between the versions 12.07 and 1 the changes that led from one version of an RDF dataset to an- 13.05,and13.05and13.07ofAtlas,whichare879.5Mand801.2M other. DespiteourRDFfocus,ourapproachissufficientlygeneral triples, respectively, while between the versions 3.7 and 3.8, and ] 3.8 and 3.9 of the English DBpedia are 20.7M and 9.3M triples toengulfdifferentdatamodelsthatcanbeencodedinRDF,such B (Figure7). asrelationalormulti-dimensional. Infact,weproposeaflexible, D Thisconstantevolutionposesseveralresearchproblems,which extendible and data-model-independent methodology of defining arerelatedtotheidentification,computation,storageandmanage- . changes that can capture the peculiarities and needs of different s ment of the evolving versions. An important question is how to c datamodelsandapplications, whilebeingformallyrobustdueto support complex changes, whose constituent changes are seem- [ thesatisfactionofthepropertiesofcompletenessandunambiguity. ingly unrelated and may occur on disparate pieces of data, but Further,weproposeanontologyofchangesforstoringthedetected 1 together as a whole they have a semantically coherent meaning changesthatallowsautomatedprocessingandanalysisofchanges, v for an application domain. Clearly, understanding and detecting cross-snapshotqueries(spanningacrossdifferentversions),aswell 2 thechangesofevolvingdatasetsisaprerequisitetoaddressthese asqueriesinvolvingbothchangesanddata. Todetectchangesand 5 problems. In particular, finding the differences (deltas) between populatesaidontology,weproposeacustomizabledetectionalgo- 6 datasets has been proved to play a crucial role in various cura- rithm,whichisapplicabletodifferentdatamodelsandapplications 2 tiontasks,suchasthesynchronizationofautonomouslydeveloped requiring the detection of custom, user-defined changes. Finally, 0 datasetsversions[3],orthevisualizationoftheevolutionhistoryof we provide a proof-of-concept application and evaluation of our 1. frameworkfordifferentdatamodels. adataset[12].Deltasarealsonecessaryincertainapplicationsthat 0 requireaccesstopreviousversionsofadatasettosupporthistori- 5 calorcross-snapshotqueries(e.g.,[19]),inorder,forexample,to 1 1. INTRODUCTION identifypaststatesofthedataset,understandtheevolutionprocess, : ordetectthesourceoferrorsinthecurrentmodelling. v With the growing complexity of the WWW, we face a com- Ingeneral,therearetwowaystorecordtheoccurringchanges. i pletely different way for creating, disseminating and consuming X The first is by constantly monitoring the dataset and logging any big volumes of information. Large-scale corporate, government, change, whichrequiresaclosedandcontrolledsystem, whereall r or even user-generated data are published and become available a changes pass through a dedicated application that records every toawidespectrumofusers. DBpedia, Freebase, YAGOandAt- changeasithappens. Amoreflexiblemethodincludesdetecting las are, among many others, examples of large data repositories, a posteriori the changes that happened. This avoids the need to whichstoreinformationaboutvariousentities,includingtheirre- haveacontrolledsystem, andallowsremoteusersofadatasetto lationships. Typically,datainsuchdatasetsarerepresentedusing identifychanges,eveniftheyhavenoaccesstotheactualchange theRDFmodel[11],inwhichinformationisstoredintriplesofthe processorknowledgethatithappened.Here,wefocusonthelatter form(subject,predicate,object),meaningthatsubjectisrelatedto case,whichismorechallengingandinteresting,eventhoughmost oftheproposedsolutions(namely,thedefinitionofalanguageof changes and the representation of the changes) are applicable in bothscenarios. Recognitionofchangesisbasedon3mainpillars,namelydefin- Permissiontomakedigitalorhardcopiesofallorpartofthisworkfor personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare ing,representinganddetectingchanges. Thefirstpillar(defining notmadeordistributedforprofitorcommercialadvantageandthatcopies changes)isrelatedtothedefinitionofalanguageofchanges,which bearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,to isbasedonthesetofchangesthatareunderstandablebythesystem. republish,topostonserversortoredistributetolists,requirespriorspecific Definingthelanguageofchangesincludesidentifyingthetypesof permissionand/orafee. changesthatwillbedetected,theirformaldefinition(e.g.,theirse- Copyright20XXACMX-XXXXX-XX-X/XX/XX...$15.00. manticsandparameters),andtakesintoaccounttheneedforcom- orcomplexchangesused. Thecustomizationfortheactual plex,domain-specificchanges.Ithasbeenarguedthatthelanguage languageofchangesemployedisbasedonthedefinitionof ofchangesasawhole,shouldbewell-behavedinthesenseofsatis- appropriateSPARQLqueries(oneperchange)thatcomeas fyingcertainproperties,namelybeingcompleteandunambiguous, parameterstothealgorithm, therebyallowingnewchanges so as to allow the generation of unique deltas in a deterministic tobeeasilydefined.DetailsaregiveninSection5. manner[14].ThedefinitionofchangesisgiveninSection3. • An additional contribution is the application of this work The second pillar (representing changes) is related to the rep- fordatamodelsotherthanthepureRDFmodel,namelythe resentationschemeusedforstoringthedetectedchanges. Thisis multi-dimensional model that has been transformed to the necessary to allow a persistent representation and storage of the RDF Data Cube vocabulary, while it could be also applied detectedchanges,aswellastosupportnavigationamongversions, similarlytotherelationalmodel. Thisapplicationismeant analysis of the deltas, cross-snapshot or historic queries, and the as a proof-of-concept that the proposed approach is indeed raising of changes as first class citizens in a multi-version repos- capableofsupportingdiverseneeds,andisnotexhaustivein itory. Our approach is based on the definition of an ontology of termsofthetypesofapplicationsormodelsthatourframe- changes that satisfies these goals. The representation scheme of worksupports. theproposedchangesisgiveninSection4. The third pillar (detecting changes) is related to the algorith- • Finally,weprovideexperimentsshowingthatthemostcru- mic definition of the detection process. This process identifies cialparameterinthechangerecognitionprocessisthetotal the changes applied between any two versions, based on the ac- number of detected changes rather than the size of the ex- tualchangedefinitionsinthelanguageofchanges, andstoresthe amined datasets. We prove also that the number of simple results in the ontology of changes. Our approach for change de- changesisproportionaltothenumberoftripleswhichwill tectionisbasedontheexecutionofappropriatelydefinedSPARQL be inserted into the ontology. Except from the evaluation queries,whoseanswersdeterminethedetectedchanges.Detailson aspect of our framework, our experiments offer a detailed thedetectionprocessaregiveninSection5. lookontheevolutionofreal-world,bigdatasets,aswerecord Amajorchallengerelatedtochangedetectionanddeltasisthat changesandprovideananalysisoftheirform. nolanguageofchangesissuitableforalldifferentapplicationsand data models. There are two reasons for that. First, the different Theapproachproposedinthispaperextendsourpreviouswork typesofdataavailableontheWebareoftenrepresentedusingdif- onchangedetection[14],byprovidingamoregenericchangedef- ferentmodels,whichcallfordifferentchanges. Second,different initionframework;here,weexperiencesignificantlyimprovedper- uses (or users) of the data may require a different set of changes formance(about1orderofmagnitude), scalability, aswellasin- being reported, or the definition of special types of changes that creasedgeneralityandapplicability. happenoftenorareimportantfortheoperationofthedata-driven application. Therefore,thereisnoone-size-fits-allsolutionforthe 2. PRELIMINARIES problemofchangerecognition. WeconsidertwodisjointsetsU, L, denotingtheURIsandlit- Inthiswork,weproposeaflexibleframeworkforchangerecog- erals(weignorehereblanknodesthatcanbeavoidedwhendata nition that can be applied to different data models and applica- are published according to the Linked Data paradigm). An RDF tions.Ourintentionisnottoprovideasinglesolution,butageneric triple [11] is a tuple of the form (subject, predicate, object) and methodologyfordefiningthethreecomponentsofamulti-version assertsthefactthatsubjectisassociatedwithobjectthroughpredi- repository,throughwhichdifferentsolutions,suitablefordifferent cate.ThesetT=U×U×(U∪L)isthesetofallRDFtriples.An needs,canbedeveloped. Eventhoughourframeworkisdesigned RDFdatasetDconsistsofasetofRDFtriples. Inthefollowing, forRDF,ourapproachisapplicabletoanydatamodelrepresentable wedenotebyD ,D theoldandnewversions,respectively, inRDF,i.e.,RDFisusedasaunifyingunderlyingmodeltoachieve old new ofadatasetD. thisgenerality. SPARQL1.1[17]istheofficialW3Crecommendationlanguage Inoverall,ourcontributionsareasfollows: forqueryingRDFgraphs. ThebuildingblockofaSPARQLstate- • Regarding the definition of changes, the objective of gen- mentisatriplepatterntpthatislikeanRDFtriple,butmayhave erality is met by providing two different types of changes, avariable(prefixedwithcharacter?)inanyofitssubject, predi- namelysimpleandcomplex(seeSection3).Theformertype cate,orobjectpositions;variablesaretakenfromaninfinitesetof ismeanttocapturefine-grainedchangesforthedatamodel variablesV,disjointfromthesetsU,L,sothesetoftriplepatterns athand,andshouldmeettherequirementsofcompleteness is:TP=(U∪V)×(U∪V)×(U∪L∪V).SPARQLtriplepat- and unambiguity. The latter type is meant to capture more ternscanbecombinedintographpatternsgp,usingoperatorslike coarse-grained,orspecialized,changesthatareusefulforthe join(“.”),optional(OPTIONAL)andunion(UNION)[1]andmay application at hand; this allows a customized behaviour of alsoincludeconditions(usingFILTER).Inthiswork,weareonly thechangedetectionprocess,dependingontheactualneeds interested in SELECT SPARQL queries, which are of the form: of the application. Complex changes are totally dynamic, “SELECT v ,...,v WHERE gp”, where n > 0, v ∈ V 1 n i and can be defined even at run-time, greatly enhancing the andgpisagraphpattern. flexibilityofourapproach. FortheevaluationofSPARQLqueries,wefollowthesemantics discussedin[15,1]. Evaluationisbasedonmappings,whichare • Forthepurposesofrepresentation,theproposedontologyof partialfunctionsµ:V(cid:55)→U∪LthatassociatevariableswithURIs changes is designed to be generic enough, so as to be cus- orliterals(abusingnotation, µ(tp)isusedtodenotetheresultof tomizable for different languages of changes. This allows replacingthevariablesintpwiththeirassignedvaluesaccording ustomeettheaforementionedgeneralitygoal,asdetailedin to µ). Then, the evaluation of a SPARQL triple pattern tp on a Section4. datasetDreturnsasetofmappings(denotedby[[tp]]D)suchthat • Thedetectionprocessisdefinedinacustomizablemanner,as µ(tp) ∈ D for µ ∈ [[tp]]D. This idea is extended to graph pat- thecoredetectionalgorithmisagnostictothesetofsimple terns by considering the semantics of the various operators (e.g., [[tp UNION tp ]]D = [[tp ]]D ∪[[tp ]]D). GivenaSPARQL that are directly associated with said change and are assumed to 1 2 1 2 query“SELECT v ,...,v WHERE gp”,itsresultwhenap- becapturedbythesimplechange. Forexample,theabovechange 1 n pliedonDis(µ(v ),...,µ(v ))forµ∈[[gp]]D. Fortheprecise wouldbeassociatedwiththetriple(m,rdfs:range,t)(whichin- 1 n semanticsandfurtherdetailsontheevaluationofSPARQLqueries, dicates that t is the type of measure m); if said triple is found in thereaderisreferredto[15,1]. D ,butnotinD ,thenthisindicatesthatanewtype(t)has new old beenattachedtomeasurem(thus,Attach_Type_To_Measure(m,t) 3. DEFININGCHANGES shouldbedetected). Triplesofthistypearerequiredtobeinoneversionbutnotin Our purpose in this work is to provide a change recognition theother(i.e.,inlow-leveldelta)andaredirectlyassociated(con- method,which,giventwo(subsequent)datasetversionsD ,D , old new sumed)bythecorrespondingsimplechange. Consumptioninthis would produce their delta (∆), i.e., a formal description of the respectmeansthatthecorrespondinglow-levelchangeiscaptured changes that were made to get from D to D . A delta is old new (described)bysaidsimplechange. basedonalanguageofchanges(L),i.e.,asetofformaldefinitions A simple change can also have a number of logical conditions ofthechangesthatthedeltacouldcontainand, subsequently, the requiredfordetection. Forexample,thetriple(m,rdfs:range,t) change detection method understands and detects; these changes mayalsoindicatethatadatatype(t)isattachedtoadimensionm. shouldcorrespondtotheevolutionprimitivesofthedatamodelun- TodeterminewhethertheinclusionofsuchatripleinD corre- derconsideration. new spondstoanAttach_Type_To_Measure(m,t)change(asopposedto Alanguageofchanges,initssimplestform,consistsofadditions anAttach_Datatype_To_Dimension(m,t)change),weneedacon- and deletions of elements (i.e., triples for the RDF case) from a ditionthatwilldeterminewhethermisameasureoradimension. dataset,i.e.,changesoftheformAdd(t)/Del(t). Suchadeltais Inthiscase,theAttach_Type_To_Measure(m,t)wouldrequirethe calledalow-leveldelta[22]andcanbeeasilycomputedasfollows: presenceofthetriple(m,rdf :type,qb : measureProperty)in ∆(D ,D )={Add(t)|t∈D \D }∪{Del(t)|t∈ old new new old D . Notethat,ingeneral,conditionsmayrefertoboththeold D \D }. new old new andthenewversion. Unlikeconsumedtriples,thetriplesappear- Low-levellanguages(anddeltas)areeasytodefineanddetect, inginconditionsarenotnecessaryaddedordeletedtriples; their andhaveseveralniceproperties[22]; however,therepresentation presenceisnecessaryineither(orboth)oftheversionsandtheir of changes at the level of (added/deleted) triples, leads to a syn- roleistodisambiguatebetweensimilarchanges. tacticdelta,whichdoesnotcapturethesemanticsofachangeand Moreformally,asimplechangeisdefinedasfollows: generatesresultsthatarenotintuitiveenoughforthehumanuser. Forexample,intheRDFcontext,theplaindeletionofanindividual DEFINITION 1. Asimplechangec(p1,...,pn)isdefinedasa (classinstance)wouldcorrespondtoamultitudeoftripledeletions tupleoftheform(cid:104)δ+,δ−,φ ,φ (cid:105)where: old new (namely,allthetriplesthatcontainthisURI,suchaspropertyin- • cisthenameandp ,...,p ∈V,n≥0,aretheparameters stances);listingallthesechangesdoesnotimmediatelyconveythe 1 n ofthechange, message that an individual was deleted, and the human observer may find it hard to decipher the intent of the actual changes that • δ+,δ− ⊆TParecalledtheconsumedaddedandconsumed tookplace[14]. Thisproblemisevenmoreseriousforotherdata deletedtriples,respectively,andaresetsoftriplepatterns, models (e.g., multi-dimensional) that are translated into RDF; in thiscase,thelow-leveldeltas,whicharedescribedinRDFjargon, • φold,φnew aregraphpatterns,calledtheconditionsrelated needtobe“translatedback”intotheoriginaldatamodel,making toDold,Dnew,respectively. decipheringevenmoredifficult. Toaddressthisproblem,high-leveldeltashavebeenproposed[14], Inourrunningexample,Attach_Type_To_Measure(m,t)hastwo which aim to describe changes at a more intuitive level, in order parameters(m,t)andδ+ ={(m,rdfs:range,t)},δ− =∅,φ = old to make them more human-understandable. In the above exam- “”,φ =“(m,rdf :type,qb:measureProperty)”. new ple,theoutputwouldbea“deleteindividual”change,thatimme- The structure of Definition 1 is used for defining the changes diately conveys the message that all triples that include a given thatthelanguageofchangesaccepts,butanyactualdetectionwill URI (individual) were deleted. The main idea behind achieving givespecificvalues(URIsorliterals)totheparametersm,tofthe thisistogrouplow-levelchangesintohigh-levelones,undersome change(e.g.,Attach_Type_To_Measure(dm-measure:meas7v8t,dm- conditions. Forthepurposesofthispaper,weorganizehigh-level type:int)).Thisiscapturedwiththefollowingnotion: changesintwomajortypes, namelysimpleandcomplex, eachof whichrepresentschangeswithacertaingranularityandroleinthe DEFINITION 2. Consider a change c(p1,...,pn). Then, an assignmentx ,...,x ofURIs/literalstovariablesp ,...,p is model.Detailsonthesetwotypesaregivenbelow. 1 n 1 n calledaninstantiationofcanddenotedbyc(x ,...,x ). 1 n 3.1 SimpleChanges Now we are in position to formally define the detectability of Ingeneral,theroleofsimplechangesistodescribechanges(evo- a change instantiation. Intuitively, a change instantiation corre- lutionprimitives)specifictothedatamodelathand; e.g., forthe sponds to a certain assignment of (some of) the variables in δ+, multi-dimensionalmodel,wewouldexpectchangeslike“Add_Di- δ−,φ ,φ ;theassignmentshouldbesuchthattheconditions old new mension”or“Attach_Type_To_Measure”. Usingsimplechanges, (φ ,φ )aretrueintheunderlyingdatasets,andthetriplesthat old new theusercanabstractfromthesyntacticalandrepresentationalpecu- correspond to the triple patterns in δ+, δ− are found in the low- liaritiesoftheunderlyingdatamodel(includingitspossibletrans- leveldelta.Moreprecisely: lationtoRDFformat),therebymakingdeltasmoreintuitive. Asimplechange(e.g.,Attach_Type_To_Measure(m,t)),iscom- DEFINITION 3. Achangeinstantiationc(x1,...,xn)ofasim- posedofthechangename(i.e.,Attach_Type_To_Measure)andthe ple change c(p ,...,p ) is detectable for the pair D ,D 1 n old new changeparameters(i.e.,(m,t)).Itsmaincharacteristicisthetriples iff there is a µ ∈ [[φold]]Dold ∩[[φnew]]Dnew such that for all that would be added/deleted from the RDF representation of the tp ∈ δ+: µ(tp) ∈ D \D andforalltp ∈ δ−: µ(tp) ∈ new old datasetwhensuchachangeoccurs.Thesecorrespondtothetriples D \D andforalli:µ(p )=x . old new i i changescanalsobeusedtodefinechangesthatwerenotforeseen atdesign-time. Thiswouldgiveanextraflexibilitytotheuserto describemorespecialized,coarse-grainedchanges,whicharerele- vanttoacertainapplication,butnotgenericenoughtobeconsid- eredpartofthe“core”languageof(simple)changes. Thisessen- tiallyallowsextendingthelanguageofchangesatrun-time. Complex changes are defined in a way similar to simple ones; there are certain differences though. First, complex changes are notdirectlyassociatedwithlowlevelchanges;insteadofδ+,δ−, theyincludeasetofsimplechanges,denotedbyδs,whichcorre- sponds to the set of simple changes that should be detectable in order for said complex change to be detectable. This is neces- sary to make their definition more user-friendly, given that com- plex changes will be defined by the end-user who does not un- derstand low level changes. Second, as complex changes can be freely defined by the user, it would be unrealistic to assume that theywillhaveanyqualityguarantees,suchascompletenessorun- ambiguity. As a consequence, the detection process may lead to non-deterministicconsumptionofsimplechangesandconflicts;to Figure1:VisualizationofCompletenessandUnambiguity avoid this, complex changes are associated with a priority level, whichisusedtoresolvesuchconflicts. Furthermore, complex changes support associations, i.e., cor- Then,adetectablechangeconsumesthetriplesthatcorrespond tothetriplepatternsinδ+,δ−.Formally: respondences between URIs/literals that are necessary to support changes like renames and merge/splits [14]. In general, an asso- DEFINITION 4. Adetectablechangeinstantiationc(x1,...,xn) ciation α is a pair (X,Y) where X,Y are sets of URIs, literals ofasimplechangec(p ,...,p )consumest∈D \D (re- orvariables(calledURI/literal/variableassociation,respectively). 1 n new old spectively, t ∈ Dold \Dnew) iff there is a µ ∈ [[φold]]Dold ∩ Associationscanhaveoneofthefollowingforms(wherevi may [[φ ]]Dnew and a tp ∈ δ+ (respectively, tp ∈ δ−) such that beaURI,literalorvariable): new µ(tp)=tandforalli:µ(p )=x . i i • {v }(cid:32){v },v (cid:54)=v (rename) 1 2 1 2 The concept of consumption represents the fact that low-level changesare“assigned”tosimpleones,essentiallyallowingagroup- • {v0}(cid:32){v1,...,vn},vi (cid:54)=vj fori(cid:54)=j(split) ing(partitioning)oflow-levelchangesintosimpleones. Tofulfil itspurpose,this“partitioning”shouldbeperfect,inthesensethat • {v1,...,vn}(cid:32){v0},vi (cid:54)=vj fori(cid:54)=j(merge) all low-level changes should be associated to one, and only one, A set of URI and literal associations should be provided as an simplechange. Thisiscapturedbythepropertiesofcompleteness inputtothedetectionprocess.Suchassociationscouldbeprovided andunambiguity.Formally: manuallybytheuser,orautomaticallyidentifiedusing,e.g.,entity DEFINITION 5. ConsiderasetofsimplechangesC. Thisset matchingsoftware[20];detectingassociationsisoutofthescopeof iscalledcompleteiffforanypairofversionsD ,D andfor thispaper.Givenamappingµandavariableassociationα,wewill old new allt∈(D \D )∪(D \D ),thereisaninstantiation denotebyµ(α)theURI/literalassociationresultingfromreplacing new old old new c(x ,...,x )ofsomec∈Csuchthatc(x ,...,x )isdetectable allvariablesinαwiththeircorrespondingURI/literalaccordingto 1 n 1 n andconsumest. µ. Complexchangesareusefulinseveralapplications. Forexam- DEFINITION 6. ConsiderasetofsimplechangesC. Thisset ple,variousontologicalapplicationsrefrainfromdeletingclasses, iscalledunambiguousiffforanypairofversionsD ,D and old new but use a special subsumption relationship to a class “Obsolete” for all t ∈ (D \D )∪(D \D ), if c,c(cid:48) ∈ C and new old old new toindicatethatacertainclass(say, cl)shouldnolongerbeused. c(x ,...,x ),c(cid:48)(x(cid:48),...,x(cid:48) )aredetectableandconsumet,then 1 n 1 m ThisactioncouldbecapturedbythechangeMark_as_Obsolete(cl), c(x ,...,x )=c(cid:48)(x(cid:48),...,x(cid:48) ). 1 n 1 m whichwouldconsumethesimplechangeAdd_SuperClass(cl,obs), whereobs=geneontology:ObsoleteClass. Inanutshell,completenessguaranteesthatalllowlevelchanges Theformalconceptsassociatedwithcomplexchangesaresimi- areassociatedwithatleastonesimplechange,therebymakingthe lartotheonesrelatedtosimplechanges: reporteddeltacomplete(i.e.,notmissinganychange);unambigu- ity guarantees that no race conditions will emerge between sim- plechangesattemptingtoconsumethesamelowlevelchange(see DEFINITION 7. Acomplexchangec(p1,...,pn)isdefinedas atupleoftheform(cid:104)δs,φ ,φ ,A,P(cid:105)where: Figure1foravisualizationofthenotionsofcompletenessandun- old new ambiguity). The combination of these two properties guarantees • cisthenameandp ,...,p ∈V,n≥0,aretheparameters thatthedeltaisproducedinadeterministicmannerandthatitwill 1 n ofthechange, properlyreflectthechangesthatwereactuallyperformed. 3.2 ComplexChanges • δsisasetofsimplechangescalledtherelatedsimplechanges, Toguaranteecompletenessandunambiguity,simplechangesmust • φ ,φ aregraphpatterns,calledtheconditionsrelated old new be defined at design time, and remain immutable between differ- toD ,D ,respectively, old new entapplications;forchangescorrespondingtocustom,application- specific evolution primitives, we use complex changes. Complex • Aisasetofvariableassociations, • P isanumbercorrespondingtotheprioritylevelofthecom- plexchange. Inourrunningexample,Mark_as_Obsolete(cl)hasoneparam- eter(cl)andδs = {Add_SuperClass(cl,obs)}, φ =“obs= old geneontology:ObsoleteClass”,φ =“”,A=∅,P =2. new The definition of complex change instantiations is identical to Definition2.Fordetectability,wehaveaseriesofdefinitions,soas totakeintoaccountpriorities: DEFINITION 8. Achangeinstantiationc(x1,...,xn)ofacom- plexchangec(p ,...,p )isinitiallydetectableforthepairD , 1 n old DnewandtheassociationsADold,Dnewiffthereisaµ∈[[φold]]Dold∩ [[φ ]]Dnewsuchthatforallc(cid:48)(µ(p ),...,µ(p ))∈δs,c(cid:48)(µ(p ), new 1 n 1 ...,µ(pn))isdetectable,forallα∈A,µ(α)∈ADold,Dnew and Figure2:RepresentationofSimpleChanges foralli,µ(p )=x . i i DEFINITION 9. Aninitiallydetectablechangeinstantiationc(x1, appearattheinstancelevel,instantiatingthecorrespondingclasses. ...,x ) of a complex change c(p ,...,p ) consumes a simple Specifically,atschemalevel,weintroduceoneclassforeachsim- n 1 n changeinstantiationc(cid:48)(x(cid:48)1,...,x(cid:48)m)ofasimplechangec(cid:48)(p(cid:48)1,..., pleandcomplexchangec(p1,...,pn)thatisunderstoodandcon- p(cid:48)m) iff c(cid:48)(p(cid:48)1,...,p(cid:48)m) ∈ δs and there is a µ ∈ [[φold]]Dold ∩ sideredbythelanguageofchanges(L),andatinstancelevel,we [[φnew]]Dnew suchthatforallα ∈ A,µ(α) ∈ ADold,Dnew and introduceoneindividualforeachdetectablechangec(x1,...,xn) foralli,µ(p )=x ,µ(p(cid:48))=x(cid:48). ineachpairofversions. i i i i DEFINITION 10. Achangeinstantiationc(x1,...,xn)ofacom- 4.2 OntologyofChanges: SchemaLevel plex change c(p ,...,p ) is detectable for the pair D , D 1 n old new Forsimplechanges,thecorrespondingchangedefinitionismod- andtheassociationsADold,Dnew iffitisinitiallydetectableforthe elledasasubclassofthemainclassSimple_Change,andisas- pairDold,DnewandtheassociationsADold,Dnew andthereisno sociatedwithadequate propertiestorepresentits parameters (see initiallydetectablechangeinstantiationofanothercomplexchange Figure 2); each such property models the type of the parameter withahigherprioritythatconsumesthesamesimplechange. (e.g.,whetheritisaURIoraliteral,viatheclassesrdfs:Resource, rdfs:Literalrespectively)anditsname(whichisadescriptivename 4. REPRESENTINGCHANGES that allows a more intuitive interaction with the user, useful also duringtheconstructionofcomplexchangesthatconsumesaidsim- 4.1 MotivationfortheOntologyofChanges plechange).Alsoadescriptivename(cname)iscapturedforeach Wetreatchangesasfirst-classcitizensinordertobeabletoper- typeofchange.Notethatconditionsdonotneedtobestored. formqueriesanalysingtheevolutionofdatasets.Further,wearein- Forcomplexchanges,similarideasareused;however,notethat terestedinperformingcombinedqueries,inwhichboththedatasets theinformationrelatedtocomplexchangesisgeneratedonthefly andthechangesshouldbeconsideredtogetananswer.Toachieve atchangecreationtime(incontrasttosimplechanges,whicharein- this,therepresentationofthechangesthataredetectedonthedata builtintheontologyatdesigntime).Inaddition,moredetailedin- cannotbeseparatedfromthedataitself. formationrelatedtothechangeshouldbeavailableintheontology Forexample,considerthefollowingquery:“returnallcountries forthechangedetectionprocess,because,unlikesimplechanges, forwhichtheunemploymentrateoftheircapitalcityincreasedat thisinformationisnotknownatdesigntimeandcannotbeembed- a rate higher than the average increase of the country as a whole dedinthecode. betweenversionsD andD ”. Thisqueryrequiresaccessto Eachcomplexchangeismodelledasasubclassofthemainclass old new thedata(toidentifycountriesandcapitals)andtothechanges(to Complex_Change,andisassociatedwithadequatepropertiesto describetheactualincreaseintheratesofunemploymentforeach representitsparameters(seeFigure3). Again,parametershavea cityandcountry). Therefore, toanswerit, thechangesshouldbe typeandaname,whicharemodelledinthesamemannerasinsim- storedinastructuredformandtheirrepresentationshouldinclude plechanges. Inaddition,eachcomplexchangeisassociatedwith connectionswiththeactualentities(citiesorcapitals)thattheyre- thesimplechange(s)thatitconsumes,andincludesalsoproperties ferto. Notethatthedefinitionofacross-snapshotquerylanguage describingits(user-defined)descriptivenameanditspriority. thattreatschangesasfirst-classcitizensisoutsidethescopeofthis Finally, foreachcomplexchange, theSPARQLqueryusedfor paper, however this work provides a formalism on which such a itsdetection,isautomaticallygeneratedatchangedefinitiontime; querylanguagecanbebased. Therefore,weproposerepresenting thisisdoneforefficiency,toavoidhavingtogeneratethisqueryin changesasspecialentitiesinanRDFdataset,withconnectionsto everyrunofthedetectionprocess. TheSPARQLqueryencapsu- theactualdata,sothatadetectablechangecanbeassociatedwith latesalltherelevantinformationaboutthedetectionofthechange, thecorrespondingdataentitiesthatitrefersto. sofurtherinformationonthecomplexchange(namelyconditions Morespecifically,weproposeanontologyofchangesforstoring andassociations)neednottobestored. the detected changes, thereby allowing a supervisory look of the 4.3 OntologyofChanges: DataLevel detectedchangesandtheirassociationwiththeentitiestheyreferto intheactualdatasets,facilitatingtheformulationandtheanswering Theinstancelevelisusedtostorethedetectablesimpleandcom- ofqueriesthatrefertoboththedataandtheirevolution. plex changes. Each detectable change between any two versions Insaidontology,theschemadescribesthedefinitionofthechanges isrepresentedusingadifferentindividualassociatedthroughade- (c(p ,...,p )–seeDefinitions1,7),whereasinformationonde- quatepropertieswithallrelevantinformation,namely,theversions 1 n tectedchanges(whicharechangeinstantiations–seeDefinition2) betweenwhichthechangewasdetectedandtheexactvaluesofits Figure6:RepresentationofAssociations Figure3:RepresentationofComplexChanges association. Forexample,Figure6showstherepresentationofthe associations{x1} (cid:32) {x2}and{y1} (cid:32) {y2,y3},whichappear betweenversionsv1,v2. 5. DETECTINGANDSTORINGCHANGES Thechangedetectionprocessisresponsibleforthedetectionof simple and complex changes between two dataset versions (and their corresponding associations, which are a necessary input for the detection of complex changes), as well as for the enrichment oftheontologyofchangeswithinformationaboutthedetectable changes. Thus,theprocesscanbeconsideredtocompriseoftwo steps:thefirstistheidentificationofthedetectablechangesandthe correspondinginformation(triples)tobeinsertedintheontologyof changes(triplecreation),andthesecondistheactualingestionof Figure4:RepresentationofSimpleChangeDetection saidinformationintheontologyofchanges(tripleingestion). Todetectsimpleandcomplexchanges,werelyonplainSPARQL parameters(ofthechangeinstantiation).Forcomplexchanges,we queries,whicharegeneratedfromtheinformationdrawnfromthe additionallyneedtoincludeconsumptioninformation. Examples definition of the corresponding changes (Definitions 3, 10). For ofsuchinstantiationsforsimpleandcomplexchangesareshownin simple changes, this information is known at design time, so the Figures4and5,forthedetectablechangesAttach_Type_To_Mea- query is loaded from a configuration file, whereas for complex sure(dm−measure:meas7v8t,dm−type:int)and changes,thecorrespondingqueryisgeneratedonceatchange-creation Mark_as_Obsolete(efo:EFO_0004151),respectively. time (run-time) and is loaded from the ontology of changes (see Figure3).Theresultsofthegeneratedqueriesdeterminethechange 4.4 OntologyofChanges: Associations instantiations that are detectable; this information is further pro- Anotherusefulfeatureoftheontologyofchangesisthestorage cessedtodeterminetheactualtriplestobeinsertedintheontology ofassociations,whicharenecessaryforthedetectionofcomplex ofchanges. Recallthatthisdependsontheactualdetectedchange changes. Associationsarestoredasinstancesclassifiedunderthe anditstype(seeSection4). classAssociation,andrecordtheversionsbetweenwhichtheyare Morespecifically,theSPARQLqueriesusedfordetectingasim- applicable,aswellastheoldandnewvaluesinthecorresponding ple change are SELECT queries, whose returned values are the parameter values of the change; thus, for each parameter of the changedefinition,weputonevariableintheSELECTclause.Then, the WHERE clause of the query includes the triple patterns that should (or should not) be found in each of the versions in order for a change instantiation to be detectable; more specifically, the triplepatternsinδ+ mustbefoundinD butnotinD , the new old triplepatternsinδ− mustbefoundinD butnotinD ,and old new thegraphpatternsinφ ,φ shouldbeappliedinD ,D , old new old new respectively. Let’susethismethodologytoconstructtheSPARQLqueryfor thesimplechangeAttach_Type_To_Measure(m,t)inwhich:δ+ = {(m,rdfs:range,t)},δ− =∅,φ =“”,φ =“(m,rdf :type, old new qb:measureProperty)”.ThecorrespondingSPARQLqueryis: SELECT ?m ?t WHERE { GRAPH D { ?m rdf:type qb:MeasureProperty. new ?m rdfs:range ?t. } FILTER NOT EXISTS { GRAPH D { ?m rdfs:range ?t. } } old } Figure5:RepresentationofComplexChangeDetection Eachresultrowofthisquerycorrespondstotheparametervalues Extensibility: Theadditionofanewchange(simpleorcomplex) ofonedetectablechangeinstantiation;forexample,ifthequeryre- requiresjusttheadditionofthecorrespondingSPARQLquery, turnsthepair(dm-measure:meas7v8t,dm-type:int),thenthechange withnofurthermodificationsonthesourcecode. instantiationAttach_Type_To_Measure(dm-measure:meas7v8t,dm- DataModelInsensitivity: Theprocesscanbeappliedinanydata type:int)isdetectable. model (e.g., multi-dimensional) which can be expressed in ThegenerationoftheSPARQLqueriesforthecomplexchanges followsasimilarpattern. OnedifferenceisthattheWHEREclause RDF(seeSection6)ignoringitsinternalrepresentation;all peculiarities of the underlying data model are incorporated shouldalsoincludetherequiredassociations,whichareassumedto inthecorrespondingSPARQLqueries,therebyallowingthe bealreadystoredintheontologyofchanges. Asecondimportant codetobeversatile. difference is that complex changes check the existence of simple changes(namely,thosefoundintheδspartofthecomplexchange Cross-PlatformCharacter: TheprocessrequiresverygenericRDF definition)intheontologyofchanges,ratherthantriplesinthetwo datamanagementoperations(SPARQLsupportandRDFdata versions(asisthecasewithsimplechangesdetection);therefore, import),soitcanbeimplementedinanytriplestore. complexchangesshouldbedetectedafterthedetectionofsimple changesandtheirstorageintheontology.Notealsothattheconsid- 6. APPLICATIONOFTHEFRAMEWORK eredsimplechangesshouldnothavebeenmarkedas“consumed” byotherdetectablechangesofahigherpriority;thus,itisimpor- Asmentionedabove,ourframeworkisflexibleandgenericenough tantforqueriesassociatedwithcomplexchangestobeexecutedin tobeapplicabletoanydatamodel, whichistransformabletothe aparticularorder,asimpliedbytheirpriority. RDF format. Thus, to apply our framework to any given data Let’sillustratetheabovemethodologytoconstructtheSPARQL model,onefirsthastodeterminehowtotransformsaiddatamodel queryforthecomplexchangeMark_as_Obsolete(cl). Thischange in RDF. The second step is the definition of the simple changes has one parameter (cl) and δs = {Add_SuperClass(cl,obs)}, that would comprise the language of changes (L); note that said φ =“obs=geneontology:ObsoleteClass”,φ =“”,A=∅, changesshouldbecompleteandunambiguousforoptimalresults. old new P =2.ThecorrespondingSPARQLqueryis: DefiningthesimplechangesamountstoidentifyingtheSPARQL querythatisnecessaryfordetection,aswellastherepresentation SELECT ?cl WHERE { ofthesimplechangesintheschemaofthechangesontology;the GRAPH <changesOntology> { lattercanbeeasilyautomated. ?asc a co:Add_Superclass; Complexchangesneednotbedefinedatthispoint,astheyare co:asc_p1 ?cl; createdatrun-time;tofacilitatethedefinitionofcomplexchanges co:asc_p2 ?obs. onecouldcreateanadequateuser-friendlyinterface,thatcould(po- FILTER NOT EXISTS { ?cc co:consumes ?asc } tentially) sacrifice some of the expressive power of the complex FILTER (?obs = “geneontology:ObsoleteClass”). change definition (Definition 7) in favour of user-friendliness; at } anyrate,thedefinitionofsuchaninterfaceisbeyondthescopeof } thiswork. As with simple changes, each query result corresponds to the Indicatively,wedemonstrate,forvisualizationandexperimenta- parameter values of a detectable change instantiation; for exam- tionpurposes,theaboveprocessfortheRDFandmulti-dimensional ple, if the query finds the simple change Add_Superclass(efo: data models; a similar methodology could be used for other data EFO_0004151,geneontology:ObsoleteClass)whichis models,e.g.,relational,butfurtherapplicationsofourframework not consumed by any complex change, then it will return efo: areomittedduetospacelimitations. EFO_0004151,whichimpliesthatthechangeinstantiation Mark_as_Obsolete(efo:EFO_0004151)isdetectable. 6.1 RDFModel Followingdetection,theinformationaboutthedetectable(sim- TheRDFdatamodelisapopularmodelforexposing,sharing, pleorcomplex)changeinstantiationsmustbestoredintheontol- andconnectingpiecesofdata,information,andknowledgeonthe ogyofchanges. Todoso, wehavetoprocesseachresultrowto SemanticWeb. ScientistsfromvariousareasofexpertiseuseRDF create the corresponding triple blocks, as specified in Section 4. to publish their scientific observations and measurements on the Thisisdoneasaseparateprocessthatfirststoresthetripleblocks Web. inafile(ondisk)andsubsequentlyuploadstheminVirtuosousing The case of RDF is straightforward in the sense that no data Virtuoso’bulkloadingprocess(tripleingestion). transformation is required. For the definition of simple changes, Notethatthedetectionandstoringofchangescouldbedonein weusedasetverysimilartotheso-called“basicchanges”in[14]. onestep,ifoneusedanadequatelydefinedSPARQLupdatestate- Thefulllistofdefinedsimplechanges,alongwithdetailsontheir ment that identified the detectable change instantiations, created definitionandthecorrespondingSPARQLqueriesusedfordetec- thecorrespondingtripleblocksandinsertedthemintheontology tioncanbefoundinAppendixB. usingasinglestatement. However,thisapproachturnedouttobe slowerby1-2ordersofmagnitude,partlybecauseitdoesnotex- 6.2 Multi-dimensionalModel ploitbulkupdatesbasedonmultiplethreads,andalsobecausebulk Themulti-dimensionaldatamodelisusefulforcapturingstatis- loadingismuchfasterthaninsertingtriplesusingSPARQLupdates ticsandinformationondataitemsalongseveraldimensions. The inVirtuoso. bridgetoRDFisachievedthroughtheDataCubevocabulary1which Thedescribedprocessenjoysthecharacteristicsputforwardin was proposed as a W3C recommendation to enable the publica- theprevioussections,namely: tionofstatisticaldataflowsandothermulti-dimensionaldatasets overtheWeb.TheDataCubevocabularyallowsmulti-dimensional data to enjoy all the advantages of the LOD and RDF technolo- Expressiveness: The algorithm supports the entire semantics of gies(knowledgesharingandinterconnection)viaitspublicationto SPARQL,therebyallowingaveryexpressivemethodtode- finechanges. 1http://www.w3.org/TR/vocab-data-cube triple-store). VirtuosoishostedonamachinewhichusesanIntel Table1:EvaluatedDatasets:VersionsandSizes XeonE5-2630at2.30GHz,with384GBofRAMrunningDebian Dataset Version #Triples Linuxwheezeversion,withLinuxkernel3.16.4. Thesystemuses RDFdatasets 7TBRAID-5HDDconfigurations.Fromthetotalamountofmem- v12.07 457.951.940 ATLAS v13.05 422.144.126 ory,wededicated64GBforVirtuosoand5GBfortheimplemented v13.07 447.149.655 application. Moreover, takingintoaccountthatCPUprovides12 v3.7 48.898.490 coreswith2threadseach,wedecidedtouseamulti-threadedim- Dbpedia v3.8 63.126.304 plementation;specifically,wenoticedthattheuseof8threadsdur- v3.9 67.980.265 ingthecreationoftheRDFtriplesalongwiththeingestionprocess 24-03-2009 189.378 gaveusoptimalresultsforoursetting.Thiswasonemorereasonto GO 22-09-2009 195.125 selectVirtuosoforourimplementation,asitallowstheconcurrent 20-04-2010 210.076 Multi-dimensionaldatasets useofmultiplethreadsduringingestion.Toeliminatetheeffectsof v1 229.338.925 hot/coldstarts,cachedOSinformationetc.,eachchangedetection 4lqc v2 243.336.594 processwasexecuted10timesandtheaveragetimeswereconsid- v3 243.449.874 ered. v1 4.022.424 For our experimental evaluation, we used 3 RDF and 3 multi- 1zph v2 4.022.352 dimensionaldatasets. TheselectedRDFdatasetswereATLAS3,a v3 4.022.208 subsetoftheEnglishDbpedia4(consistingofarticlecategories,in- v1 259.125 1bu4 v2 270.049 stancetypes,labelsandmapping-basedproperties)andGO5.The v3 270.049 selectedmulti-dimensionaldatasetswereprovidedbyDatamarket andcontainvariousstatistics;inparticular,4lqc6containsweather measurements from the Global Historical Climatology Network, RDF,therebyallowingpublishersorthirdpartiestoannotateand 1buh7 containsnumbersofemployedpersonsbetween15and64 linktospecifieddata,whichcanbeflexiblycombinedacrossdif- yearsoldtakingtimeoffoverthelast12monthsforfamilysickness ferentdatasets. or emergencies and 1zph8 contains information about unemploy- To apply our framework to the multi-dimensional model, we mentbysex,age,durationofunemploymentandregistration. All adopt the transformation proposed by the Data Cube vocabulary. thesemeasurementsaretakenfromEurostat. Table1summarizes Thedefinitionofsimplechangesisbasedontheidentificationof thecorrespondingversionsandtheirsizes. all entities of the Data Cube vocabulary (i.e., fact tables, dimen- sions,observations,codelists,hierarchies,measures,attributes)in 7.2 DetectedChanges ordertoexpressspecialorgenericcasesofchangesthatappearin Thefirstpartofourevaluationstudiesthenumberandtypeof eachofthoseentities. Toachievecompletenessandunambiguity simple changes that appear in the evaluated datasets. The results inourdefinitionofsimplechanges,wetakeintoaccounthowdata are summarized in Figure 7, where we note the large number of cubeentitiesareconnectedandwhichsetsoftriplepatternsdenote changeswhichoccurredduringATLASevolutioncomparedtothe theexistence/characteristicsofthecorrespondingentities. Again, otherdatasets. ThisisexplainedbythefactthatATLAScontains thefulllistofthedefinedsimplechangesforthemulti-dimensional experimentalbiologicalresultsandmeasurementsthatchangeover model,alongwithdetailsontheirdefinitionandthecorresponding time, thus new versions are vastly different from previous ones. SPARQLqueriesusedfordetectioncanbefoundinAppendixC. Moreover,notethatthemajorityofchanges(inalldatasetsexcept GO)areappliedtothedatalevel(e.g.,Delete_Property_Instance), 7. EXPERIMENTALEVALUATION whereas in GO, we have also changes which are applied to the schema. Ourframeworkwasimplementedandappliedonthedatamod- Onthemulti-dimensionalmodel,themajorityofchangesinall elsdescribedinSection6. Theevaluationconsideredsomerepre- datasetsareoftypeAdd_Dimension_Value_to_ObservationandDe- sentativereal-worlddatasetsofvarioussizesfromthesetwodata lete_Dimension_Value_from_Observation.Forbothtypesofchan- models,anditsaimswerethefollowing: ges, we observe that almost the same number of changes is re- • Toidentifythenumberandtypeofsimplechangesthatusu- ported, whichimpliesthatthedatasets’evolutionmainlyconsists allyoccurinrealworldsettings. ofchangingdimensionvaluesuponobservations; thus,creatinga complexchangetocapturethispairofchangeswouldimmediately • To study the performance of our change detection process halvethenumberofreportedchanges. whendealingwithrealsettingsandquantifytheeffectofthe size of the compared versions and the number of detected 7.3 PerformanceofChangeDetection changesintheperformanceofthealgorithm. Thesecondpartofourevaluationexaminestheperformanceof thedetectionprocessfortheRDFandmulti-dimensionaltestbeds; Toourknowledge, thisisthefirsttimethatchangedetectionhas theresultsappearinTable2.Thereportedperformanceissplitinto been evaluated for datasets of this size and versatility (RDF and twoparts,namelytriplecreationandtripleingestion;asexplained multi-dimensional). in Section 5, the former includes the execution of the SPARQL 7.1 Setting 3http://www.ebi.ac.uk/gxa/home Forthemanagementoflinkeddata(e.g.,storageofdatasetsand 4http://dbpedia.org query execution), we worked with a scalable triple store, namely 5http://geneontology.org theopensourceversionofVirtuosoUniversalServer2,v7.10.3209 6https://datamarket.com/data/set/4lqc (notethat,ourworkisnotboundedtoanyspecificinfrastructureor 7https://datamarket.com/data/set/1buh 2http://virtuoso.openlinksw.com 8https://datamarket.com/data/set/1zph Figure7:DetectedChanges(RDF/Multi-dimensional) Table2:PerformanceofChangeDetection(RDF/Multi-dimensional) DatasetsandVersions #SimpleChanges #IngestedTriples TripleCreation(sec) TripleIngestion(sec) Duration(sec) RDFdatasets ATLAS:v12.07-v13.05 879.512.096 4.980.846.857 7.294,73 9.654,46 16.949,20 ATLAS:v13.05-v13.07 801.278.339 4.539.113.055 5.953,07 9.243,61 15.196,68 DBpedia:v3.7-v3.8 20.752.280 116.327.560 465,26 143,49 608,74 DBpedia:v3.8-v3.9 9.343.769 50.620.418 296,20 72,94 369,14 GO:24-03-2009-20-04-2010 11.949 59.179 0,80 0,50 1,31 GO:22-09-2009-20-04-2010 25.299 124.262 0,96 0,57 1,52 Multi-dimensionaldatasets 4lqc:v1-v2 164.557.348 978.012.302 1.651,03 994,85 2.645,87 4lqc:v2-v3 162.318.804 973.837.296 1.645,03 996,96 2.641,99 1zph:v1-v2 4.469.294 26.815.814 129,38 46,91 176,29 1zph:v2-v3 4.469.262 26.815.494 131,25 47,33 178,58 1bu4:v1-v2 475.666 2.642.549 7,42 6,60 14,02 1bu4:v2-v3 323.644 1.941.848 5,12 5,97 11,09 queriesfordetectionandtheidentificationofthetriplestobein- containingblanknodes.Alltheseworksresultinnon-concise,low- sertedintheontologyofchanges, whereasthelatteristheactual leveldeltas,whicharedifficultforahumantounderstand. enrichmentoftheontologyofchanges. High-level change detection approaches provide more human- ThemainconclusionfromTable2isthatthenumberofsimple readabledeltas. Althoughthereisnoagreed-uponlistofchanges changesisamorecrucialfactorforperformancethanthesizesof thatarenecessaryforanygivencontext,varioushigh-levelopera- thecomparedversions(seealsoTable1).Thisobservationismore tions,alongwiththeintuitionbehindthem,havebeenproposed[7, clearintheDBpediadataset,wheretheevolutionbetweenv3.7and 13,16]. However,theseapproachesdonotpresentformalseman- v3.8producesabouttwicethenumberofchangesthantheevolution ticsofsuchoperations,orofthecorrespondingdetectionprocess; betweenv3.8andv3.9;despitethefactthatinthesecondcase,we thus,nousefulformalpropertiescanbeguaranteed. havetocomparelargerdatasetversions,theexecutiontimeinthe In[7,13],afixed-pointalgorithmfordetectingchanges,imple- formercaseisalmosttwiceaslarge.Notethatthisconclusionholds mented in PromptDiff, is described. The algorithm incorporates forbothtriplecreationandingestion. heuristic-basedmatcherstodetectchangesbetweentwoversions, More importantly, the performance of our approach is about 1 thusintroducinguncertaintyintheresults,andobtainingarecallof order of magnitude faster than the performance reported by [14], 96%andaprecisionof93%. uponwhichthisworkwasbuilt,evenifthereportedsimplechanges [16]proposestheChangeDefinitionLanguage(CDL)asameans arecompatibleinbothworks, astheusedsimplechangesforthe todefinehigh-levelchanges. Achangeisdefinedanddetectedus- RDF model are similar to the so-called “basic changes” in [14]. ing temporal queries over a version log that contains recordings Forexample,whencomparingtheperformanceofchangedetection oftheappliedlow-levelchanges. Theversionlogmustbeupdated overtheGOversionsv22-09-2009andv20-04-2010,ourapproach wheneverachangeoccurs;thisoverrulestheuseofthisapproachin needs1,52sec,while[14]requires33,13sec. non-curatedordistributedenvironments.Inourwork,versionlogs arenotnecessaryforthedetection,asthedeltacanbeproduceda posteriori. [2]focusesondefiningaformalwaytorepresenthigh- 8. RELATEDWORK levelchangesassequencesoftriples,butdoesnotdescribeadetec- In general, approaches for change detection can be classified tionprocessoraspecificlanguageofchanges.Finally,[5]proposes intolow-levelandhigh-levelones, basedonthetypesofchanges aninterestinghigh-levelchangedetectionalgorithmthattakesinto theysupport.Low-levelchangedetectionapproachesreportsimple accountthesemanticsofOWL. add/delete operations, which are not concise or intuitive enough to human users, while focusing on machine readability. [4] dis- 9. CONCLUSIONS cussesalow-leveldetectionapproachforpropositionalKnowledge Bases (KBs), which can be easily extended to apply to KBs rep- Inthispaper,weproposedanapproachtocopewiththedynam- resentedunderanyclassicalknowledgerepresentationformalism. icityofWebdatasetsviathemanagementofchangesbetweenver- This work presents a number of desirable formal properties for sions. Weadvocatedinfavourofaflexible,extendibleandtriple- changedetectionlanguages,likedeltauniqueness,reversibilityof storeindependentapproach, whichissuitableforanydatamodel changesandtheabilitytomovebackwardsandforwardsinthehis- that is representable in RDF terms. Our approach prescribes the toryofversionsusingthedeltas.Similarpropertiesappearin[22], definition of data model-specific, as well as application-specific, wherealow-levelchangedetectionformalismforRDFSdatasets changes,andtheirmanagement(definition,storage,detection)ina ispresented,aswellasin[14],uponwhichthisworkbuilds. mannerthatensuresthesatisfactionofformalproperties(likecom- [8]describesalow-levelchangedetectionapproachfortheDe- pletenessandunambiguity),theflexibilityandcustomizationofthe scription Logic EL; there, the focus is on a concept-based de- consideredchanges(viacomplexchanges,whichcanbedefinedat scription of changes, and the returned delta is a set of concepts run-time),aswellastheeasyconfigurationofascalabledetection whosepositionintheclasshierarchychanged. [9]presentsafor- mechanism(viaagenericalgorithmthatbuildsonSPARQLqueries mal low-level change detection approach for DL-Lite ontologies, thatcanbeeasilygeneratedfromthechanges’definitions). which focuses on a semantical description of the changes. Re- Thisworkbuildsuponourpreviousworkonthetopicofchange cently,[6]introducedascalableapproachforreasoning-awarelow- detection[14],byprovidingamoregenericchangedefinitionframe- levelchangedetectionthatusesarelationaldatabasemanagement work,thatissignificantlymorescalableandversatilethantheone system,while[10]supportschangedetectionbetweenRDFdatasets proposedin[14],asshowninSection7.