Front.Comput.Sci. DOI RESEARCHARTICLE A Survey of RDF Data Management Systems M.TamerÖZSU UniversityofWaterloo,CheritonSchoolofComputerScience,Canada 6 (cid:13)c HigherEducationPressandSpringer-VerlagBerlinHeidelberg2012 1 0 2 n Abstract RDF is increasingly being used to encode captured in the LOD cloud, is highly varied with many a J data for the semantic web and for data exchange. There different types of data from very many sources. 5 have been a large number of works that address RDF Although early (and still much of) RDF data sets are datamanagement. Inthispaperweprovideanoverview stationary, there is increasing interest in streaming RDF ] B oftheseworks. to the extent that W3C has now set up a community D interested in addressing problems of high velocity Keywords RDF,SPARQL,LinkedObjectData. . streaming data (www.w3.org/community/rsp/). s c RDF-encoded social network graphs that continually [ change is one example of streaming RDF. Even a 1 Introduction 1 benchmark has been proposed for this purpose [4]. v 7 The RDF (Resource Description Framework) is a W3C Finally, it is commonly accepted that RDF data is 0 standardthatwasproposedformodelingWebobjectsas “dirty”, meaning that it would contain inconsistencies. 7 partofdevelopingthesemanticweb. However,itsuseis This is primarily as a result of automatic extraction and 0 0 now wider than the semantic web. For example, Yago conversionofdataintoRDFformat; itisalsoafunction . 1 andDBPediaextractfactsfromWikipediaautomatically of the fact that, in the semantic web context, the same 0 and store them in RDF format to support structural data are contributed by different sources with different 6 queries over Wikipedia [1,2]; biologists encode their understandings. A line of research focuses on data 1 : experiments and results using RDF to communicate quality and cleaning of LOD data and RDF data in v i amongthemselvesleadingtoRDFdatacollections,such general[5,6]. X as Bio2RDF (bio2rdf.org) and Uniprot RDF r a (dev.isb-sib.ch/projects/ uniprot-rdf). Related to Because of these characteristics, RDF data semantic web, LOD (Linking Open Data) project builds management has been receiving considerable attention. a RDF data cloud by linking more than 3000 datasets, Inparticular,theexistenceofastandardquerylanguage, which currently have more than 84 billion triples1). A SPARQL, as defined by W3C, has given impetus to recentwork[3]showsthatthenumberofdatasourcesin works that focus on efficiently executing SPARQL LODhasdoubledwithinthreeyears(2011-2014). queries over large RDF data sets. In this paper, we RDFdatasetshaveallfouracceptedcharacteristicsof provide an overview of research efforts in management “big data”: volume, variety, velocity, and veracity. We of RDF data. We note that the amount of work in this hinted at increasing volumes above. RDF data, as areaisconsiderableanditisnotouraimtoreviewallof it. Wehopetohighlightthemainapproachesandreferto Receivedmonthdd,yyyy;acceptedmonthdd,yyyy some of the literature as examples. Furthermore, due to space considerations, we limit our focus on the E-mail:[email protected] 1)Thestatisticisreportedinhttp://stats.lod2.eu/. management of stationary RDF data, ignoring works on M.TamerÖZSU 2 otheraspects. Prefixes: mdb=http://data.linkedmdb.org/resource/geo=http://sws.geonames.org/ The organization of the paper is as follows. In the bm=http://wifo5-03.informatik.uni-mannheim.de/bookmashup/ exvo=http://lexvo.org/id/ next section (Section 2) we provide a high level wp=http://en.wikipedia.org/wiki/ Subject Property Object overview of RDF and SPARQL to establish the mdb:film/2014 rdfs:label “TheShining” framework. In Section 3, the centralized approaches to mdb:film/2014 movie:initial_release_date “1980-05-23”’ mdb:film/2014 movie:director mdb:director/8476 RDFdatamanagementarediscussed,whereasSection4 mdb:film/2014 movie:actor mdb:actor/29704 mdb:film/2014 movie:actor mdb:actor/30013 is devoted to a discussion of distributed RDF data mdb:film/2014 movie:music_contributor mdb:music_contributor/4110 management. Efforts in querying the LOD cloud is the mmddbb::fifillmm//22001144 fmooavf:ibea:rseelda_tendeBarook gbemo::02764335412647425 subjectofSection5. mdb:film/2014 movie:language lexvo:iso639-3/eng mdb:director/8476 movie:director_name “StanleyKubrick” mdb:film/2685 movie:director mdb:director/8476 mdb:film/2685 rdfs:label “AClockworkOrange” mdb:film/424 movie:director mdb:director/8476 mdb:film/424 rdfs:label “Spartacus” 2 RDFPrimer mdb:actor/29704 movie:actor_name “JackNicholson” mdb:film/1267 movie:actor mdb:actor/29704 mdb:film/1267 rdfs:label “TheLastTycoon” In this section, we provide an overview of RDF and mdb:film/3418 movie:actor mdb:actor/29704 SPARQL. The objective of this presentation is not to mdb:film/3418 rdfs:label “ThePassenger” geo:2635167 gn:name “UnitedKingdom” cover RDF fully, but to establish a basic understanding geo:2635167 gn:population 62348447 geo:2635167 gn:wikipediaArticle wp:United_Kingdom thatwillassistintheremainderofthepaper. Forafuller bm:books/0743424425 dc:creator bm:persons/Stephen+King bm:books/0743424425 rev:rating 4.7 treatment, we refer the reader to original sources of bm:books/0743424425 scom:hasOffer bm:offers/0743424425amazonOffer RDF[7,8]. Thisdiscussionisbasedon[9]and[10]. lexvo:iso639-3/eng rdfs:label “English” lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CA RDF models each “fact” as a set of triples (subject, lexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn property (or predicate), object), denoted as (cid:104)s,p,o(cid:105), Fig.1 ExampleRDFdataset. Prefixesareusedtoidentifythedata where subject is an entity, class or blank node, a sources. property2) denotes one attribute associated with one entity, andobject isanentity, aclass, ablanknode, ora rdfs:Class and rdfs:subClassOf that are used to define a literalvalue. AccordingtotheRDFstandard,anentityis classandasubclass,respectively(anotherone,rdfs:label denoted by a URI (Uniform Resource Identifier) that isusedinourqueryexamplesbelow). Tospecifythatan refers to a named resource in the environment that is individual resource is an element of the class, a special being modelled. Blank nodes, by contrast, refer to property, rdf:type is used. For example, if we wanted to anonymous resources that do not have a name3). Thus, define a class called Movies and two subclasses each triple represents a named relationship; those ActionMovies and Dramas, this would be accomplished involving blank nodes simply indicate that “something inthefollowingway: withthegivenrelationshipexists,withoutnamingit”[7]. Moviesrdf:typerdfs:Class. It is possible to annotate RDF data with semantic ActionMoviesrdfs:subClassOfMovies. metadata using RDFS (RDF Schema) or OWL, both of Dramasrdfs:subClassOfMovies. which are W3C standards. This annotation primarily enablesreasoningovertheRDFdata(calledentailment), Definition 1 [RDF data set] Let U,B,L, and V denote that we do not consider in this paper. However, as we the sets of all URIs, blank nodes, literals, and variables, willseebelow,italsoimpactsdataorganizationinsome respectively.Atuple(s,p,o)∈(U∪B)×U×(U∪B∪L) cases, and the metadata can be used for semantic query isanRDFtriple. AsetofRDFtriplesformaRDFdata optimization. Weillustratethefundamentalconceptsby set. simple examples using RDFS, which allows the AnexampleRDFdatasetisshowninFigure1where definition of classes and class hierarchies. RDFS has the data comes from a number of sources as defined by built-inclassdefinitions–themoreimportantonesbeing theURIprefixes. RDFdatacanbemodeledasanRDFgraph,whichis 2)Inliterature,theterms“property”and“predicate”areusedinter- formallydefinedasfollows. changeably;inthispaper,wewilluse“property”consistently. 3)Inmuchoftheresearch,blanknodesareignored.Unlessexplicitly statedotherwise,wewillignoretheminthispaperaswell. Definition 2 [RDF graph] A RDF graph is a six-tuple SurveyofRDFDataMan.Syst. 3 G =(cid:104)V,L , f ,E,L , f (cid:105),where variables, respectively. A SPARQL expression is V V E E expressedrecursively 1. V = V ∪ V ∪ V is a collection of vertices that c e l correspond to all subjects and objects in RDF data, • Atriplepattern(U∪B∪V)×(U∪V)×(U∪B∪ where V , V , and V are collections of class L∪V)isaSPARQLexpression, c e l vertices, entity vertices, and literal vertices, • (optionally) If P is a SPARQL expression, then respectively. P FILTER R is also a SPARQL expression where 2. L isacollectionofvertexlabels. Risabuilt-inSPARQLfiltercondition, V 3. A vertex labeling function f : V → L is an • (optionally)If P and P areSPARQLexpressions, V V 1 2 bijectivefunctionthatassignstoeachvertexalabel. then P AND|OPT|OR P are also SPARQL 1 2 The label of a vertex u ∈ V is its literal value, and expressions. l thelabelofavertexu∈V ∪V isitscorresponding c e URI. A set of triple patterns is called basic graph pattern 4. E = {−u−−,−→u } is a collection of directed edges that (BGP)andSPARQLexpressionsthatonlycontainthese 1 2 connectthecorrespondingsubjectsandobjects. arecalledBGPqueries. Thesearethesubjectofmostof 5. L isacollectionofedgelabels. theresearchinSPARQLqueryevaluation. E 6. An edge labeling function f : E → L is an An example SPARQL query that finds the names of E E bijective function that assigns to each edge a label. the movies directed by “Stanley Kubrick” and have a The label of an edge e ∈ E is its corresponding relatedbookthathasaratinggreaterthan4.0isspecified property. asfollows: −−−−→ SELECT ?name An edge u ,u is an attribute property edge if u ∈ V; 1 2 2 l WHERE { otherwise,itisalinkedge. ?m rdfs: label ?name. ?m movie: director ?d. ?d movie:director_name "Stanley Kubrick". Figure 2 shows an example of an RDF graph. The ?m movie:relatedBook ?b. ?b rev: rating ?r. vertices that are denoted by boxes are entity or class FILTER(?r > 4.0) vertices,andtheothersareliteralvertices. } Inthisquery,thefirstthreelinesintheWHEREclause “ThePassenger” “TheLastTycoon” refs:label refs:label form a BGP consisting of five triple patterns. All triple mdb:film/3418 mdb:film/1267 patterns in this example have variables, such as “?m”, bm:offers/0743424425amazonOffer “UnitedKingdom” 62348447 “?name” and “?r”, and “?r” has a filter: FILTER(?r > scam:hasOffer movie:actormovie:actor gn:name gn:population bm:books/0743424425 geo:2635167 4.0). rev:rating mdb:actor/29704 A SPARQL query can also be represented as a query 4.7 movie:relatedBook foaf:based_near movie:actor_name movie:actor “JackNicholson”graph: “StanleyKubrick” “TheShining” refs:label mdb:film/2014 movie:director_name movie:director Definition4[SPARQLquerygraph]Aquerygraphisa movie:actor mdb:director/8476 movie:initial_release_date seven-tupleQ=(cid:104)VQ,LQ,EQ,LQ, fQ, fQ,FL(cid:105),where mdb:actor/30013 V E V E movie:director movie:director 1. VQ =VQ∪VQ∪VQ∪VQisacollectionofvertices mdb:film/2685 mdb:film/424 “1980-05-23” c e l p that correspond to all subjects and objects in a refs:label refs:label “AClockworkOrange” “Spartacus” SPARQL query, where VpQ is a collection of variable vertices (corresponding to variables in the Fig.2 RDFgraphcorrespondingtothedatasetinFigure1 query expression), and VQ and VQ and VQ are c e l collections of class vertices, entity vertices, and The W3C standard language for RDF is SPARQL, literalverticesinthequerygraphQ,respectively. whichcanbedefinedasfollows[11](foramoreformal 2. EQ isacollectionofedgesthatcorrespondtoprop- definition,werefertotheW3Cspecification[12]). ertiesinaSPARQLquery. Definition 3 [SPARQL query] Let U,B,L, and V 3. LQisacollectionofvertexlabelsinQandLQisthe V E denote the sets of all URIs, blank nodes, literals, and edgelabelsinEQ. M.TamerÖZSU 4 FILTER(?r>4.0) ?z ?x ?y rev:rating ?name ?b ?r “StanleyKubrick” ?b A B rdfs:label movie:relatedBook movie:director_name ??yy ???aaa ?m ?d movie:director A C Fig.3 SPARQLquerygraphcorrespondingtoqueryQ1 ?x ?z (a) Q1 (b) Q2 4. fQ :VQ → LQisabijectivevertexlabelingfunction V V thatassignstoeachvertexinQalabelfromLQ.The ?b ?c ?d V label of a vertex v ∈ VQ is the variable; that of a p D D D vertexv∈VQisitsliteralvalue;andthatofavertex v∈VQ∪VQl isitscorrespondingURI. ?h ????xxxx ???yyy c e 5. fQ :VQ → LQisabijectivevertexlabelingfunction E D A B D E E thatassignstoeachedgein Qalabelfrom LQ. An V ???fff ???aaa ?e edgelabelcanbeapropertyoranedgevariable. 6. FLareconstraintfilters. E C ?g ?z ThequerygraphforQ isgiveninFigure3. 1 (c) Q3 The semantics of SPARQL query evaluation can, therefore, be defined as subgraph matching using graph Fig.4 SampleSPARQLqueries homomorphismwherebyallsubgraphsofanRDFgraph G are found that are homomorphic to the SPARQL It is usual to talk about SPARQL query types based query graph Q. In this context, OPT represents the on the shape of the query graph. Typically, three query optionaltriplepatternsthatmaybematched. types are observed: (i) linear (Figure 4a), where the variable in the object field of one triple pattern appears Definition 5 [SPARQL graph match] Consider an RDF inthesubjectofanothertriplepattern(e.g.,?yinQ )(ii) 1 graph G and a query graph Q that has n vertices star-shaped(Figure4b), wherethevariableintheobject {v ,...,v }. A set of n distinct vertices {u ,...,u } inG is 1 n 1 n field of one triple pattern appears in the subject of said to be a match of Q, if and only if there exists a multiple other triple patterns (e.g., ?a in Q ), and (iii) bijectivefunction F,whereu = F(v)(1 (cid:54) i (cid:54) n),such 2 i i snowflake-shaped(Figure4c),whichisacombinationof that: multiplestarqueries. 1. Ifv isaliteralvertex,v andu havethesameliteral i i i value; 2. If v is an entity or class vertex, v and u have the i i i 3 DataWarehousingApproaches sameURI; 3. If v is a variable vertex, u should satisfy the filter In this section we consider the approaches that take a i i constraintoverparametervertexv ifany;otherwise, centralizedapproachwheretheentiredataismaintained i thereisnoconstraintoveru; in one RDF database. These fall into five categories: i 4. If there is an edge from v to v in Q, there is also those that map the RDF data directly into a relational i j an edge from u to u in G. If the edge label in Q system,thosethatusearelationalschemawithextensive i j is p(i.e.,property),theedgefromu tou inG has indexing (and a native storage system), those that i j thesamelabel. IftheedgelabelinQisaparameter, denormalize the triples table into clustered properties, theedgelabelshouldsatisfythecorrespondingfilter those that use column-store organization, and those that constraint;otherwise,thereisnoconstraintoverthe exploit the native graph pattern matching semantics of edgelabelfromu tou inG. SPARQL. i j SurveyofRDFDataMan.Syst. 5 Many of the approaches compress the long character Animmediateproblemthatcanbeobservedwiththis strings to integer values using some variation of approachisthehighnumberofself-joins–thesearenot dictionary encoding in order to avoid expensive string easy to optimize. Furthermore, in large data sets, this operations. Each string is mapped to an integer in a single triples table becomes very large, further mapping table, and that integer is then used in the RDF complicatingqueryprocessing. tripletable(s). Thisfacilitatesfastindexingandaccessto values, but involves a level of indirection through the 3.2 SingleTableExtensiveIndexing mapping table to get at the original strings. Therefore, someofthesesystems(e.g.,Jena[13])employencoding One alternative to the problems created by direct only for strings that are longer than a threshold. We relational mapping is to develop native storage systems ignoreencodinginthispaper,forclarityofpresentation, that allow extensive indexing of the triple table. andrepresentthedatainitsoriginalstringform. Hexastore [18] and RDF-3X [19,20] are examples of this approach. The single table is maintained, but extensively indexed. For example, RDF-3X creates 3.1 DirectRelationalMappings indexes for all six possible permutations of the subject, RDF triples have a natural tabular structure. A direct property,andobject: (spo,sop,ops,ops,sop,pos). Eachof approach to handle RDF data using relational databases these indexes are sorted lexicographically by the first is to create single table with three columns (Subject, column,followedbythesecondcolumn,followedbythe Property,Object)thatholdsthetriples(thereusuallyare thirdcolumn. Thesearethenstoredintheleafpagesofa additional auxiliary tables, but we ignore them here). clusteredB+-tree. TheSPARQLquerycanthenbetranslatedintoSQLand The advantage of this type of organization is that executed on this table. It has been shown that SPARQL SPARQL queries can be efficiently processed regardless 1.0 can be full translated to SQL [14,15]; whether the of where the variables occur (subject, property, object) same is true for SPARQL 1.1 with its added features is sinceoneoftheindexeswillbeapplicable. Furthermore, still open. This approach aims to exploit the it allows for index-based query processing that well-developed relational storage, query processing and eliminates some of the self-joins – they are turned into optimization techniques in executing SPARQL queries. rangequeriesovertheparticularindex. Evenwhenjoins Systems such as Sesame SQL92SAIL4) [16] and are required, fast merge-join can be used since each Oracle[17]followthisapproach. index is sorted on the first column. The obvious Assuming that the table given in Figure 1 is a disadvantages are, of course, the space usage, and the relational table, the example SPARQL query given overhead of updating the multiple indexes if data is earlier can be translated to the following SQL query dynamic. (where s,p,o correspond to column names: Subject, Property,Object): 3.3 PropertyTables SELECT T1.object FROM T as T1, T as T2, T as T3, Property tables approach exploits the regularity T as T4, T as T5 WHERE T1.p="rdfs:label" exhibited in RDF datasets where there are repeated AND T2.p="movie:relatedBook" AND T3.p="movie:director" occurrence of patterns of statements. Consequently, it AND T4.p="rev:rating" AND T5.p="movie:director_name" stores “related” properties in the same table. The first AND T1.s=T2.s system that proposed this approach is Jena [13]; IBM’s AND T1.s=T3.s AND T2.o=T4.s DB2RDF[21]alsofollowsthesamestrategy. Inbothof AND T3.o=T5.s AND T4.o > 4.0 these cases, the resulting tables are mapped to a AND T5.o="Stanley Kubrick" relational system and the queries are converted to SQL forexecution. 4) Sesameisbuilttointeractwithanystoragesystemsinceitim- Jena defines two types of property tables. The first plementsaStorageandInferenceLayer(SAIL)tointerfacewiththe type,whichcanbecalledclusteredpropertytable,group particularstoragesystemonwhichitsits. SQL92SAIListhespecific instantiationtoworkonrelationalsystems. togetherthepropertiesthattendtooccurinthesame(or M.TamerÖZSU 6 Subject Property Property ... Property Subject Spill Prop val Prop val ... Prop val 1 1 2 2 k k (a) (a) DPH Subject Property l_id value (b) (b) DS Subject Property Property ... Property Type Fig.7 DB2RDFtabledesign (c) by each subject, but instead of manually identifying “similar” properties, the table accommodates k property Fig.5 Clusteredpropertytabledesign columns, each of which can be assigned a different property in different rows. Each property column is, in similar) subjects. It defines different table structures for fact,twocolumns: onethatholdsthepropertylabel,and single-valued properties versus multi-valued properties. the other that holds the value. If the number of For single-valued properties, the table contains the properties for a given subject is greater than k, then the subject column and a number of property columns extrapropertiesarespilledontoasecondrowandthisis (Figure5(a)). Thevalueforagivenpropertymaybenull marked on the “spill” column. For multivalued if there is no RDF triple that uses the subject and that properties, a direct secondary hash (DSH) table is property. Each row of the table represents a number of maintained – the original property value stores a unique RDFtriples–thesamenumberasthenon-nullproperty identifierl_id,whichappearsintheDStablealongwith values. For these tables, the subject is the primary key. thevalues. For multi-valued properties, the table structure includes the subject and the multi-valued property (Figure 5(b)). DB2RDF accomplishes the mapping from the single EachrowofthistablerepresentsasingleRDFtriple;the triples table into the DPH and DS tables automatically; key of the table is the compound key (subject,property). theobjectiveistominimizethenumberofcolumnsthat The mapping of the single triple table to property tables are used in DPH while minimizing spills (since these is a database design problem that is done by a database cause expensive self-joins) that result from multiple administrator. propertiesbeingmappedtothesamecolumn. Notethat, Jenaalsodefinesapropertyclasstablethatclusterthe across all subjects, a property is always mapped to the subjectswiththesametypeofpropertyintooneproperty same column; however, a given column can contain table (Figure 5(c)). In this case, all members of a class more than one property in different rows. The objective (recallourdiscussionofclassstructurewithinthecontext of the mapping is to ensure that the columns can be ofRDFS)togetherinonetable.The“Type”columnisthe overloaded with properties that do not occur together, valueofrdf:typeforeachpropertyinthatrow. but that properties that occur together are assigned to TheexampledatasetinFigure1maybeorganizedto differentcolumns. create one table that includes the properties of subjects Theadvantageofpropertytableapproachisthatjoins that are films, one table for properties of directors, one instarqueries(i.e., subject-subjectjoins)becomesingle table for properties of actors, one table for properties of table scans. Therefore, the translated query has fewer books and so on. Figure 6 shows one of these tables joins. The disadvantages are that in either of the two corresponding to the film subject. Note that the “actor” forms discussed above, there could be a significant property is multi-valued (since there are two of them in number of null values in the tables (see the number of film/2014),soaseparatetableiscreatedforit. NULLs in Figure 6), and dealing with multivalued IBM DB2RDF [21] also follows the same strategy, properties requires special care. Furthermore, although but with a more dynamic table organization (Figure 7). starqueriescanbehandledefficiently,thisapproachmay Thetable,calleddirectprimaryhash(DPH)isorganized not help much with other query types. Finally, when SurveyofRDFDataMan.Syst. 7 Subject label initial_release_date director music_contributor based_near relatedBook language film/2014 “TheShining” “1980-05-23” director/8476 music_contributor/4110 2635167 0743424425 iso639-3/eng film/2685 “AClockworkOrange” NULL director/8476 NULL NULL NULL NULL film/424 “Spartakus” NULL director/8476 NULL NULL NULL NULL film/1267 “TheLastTycoon” NULL NULL NULL NULL NULL NULL film/3418 “ThePassenger” NULL NULL NULL NULL NULL NULL Subject actor film/2014 actor/29704 film/2014 actor/30013 film/1267 actor/29704 film/3418 actor/29704 Fig.6 Propertytableorganizationofsubject“mdb:film”fromtheexampledataset(prefixesareremoved) manual assignment is used, clustering “similar” Subject Object film/2014 “TheShining” properties is non-trivial and bad design decisions film/2685 “AClockworkOrange” exacerbatethenullvalueproblem. film/424 “Spartacus” film/1267 “TheLastTycoon” film/3418 “ThePassenger” iso639-3/eng “English” 3.4 BinaryTables (a) rdfs:label Binarytablesapproach[22,23]followscolumn-oriented Subject Object film/2014 actor/29704 database schema organization and defines a two-column film/2014 actor/30013 tableforeachpropertycontainingthesubjectandobject. film/1267 actor/29704 film/3418 actor/29704 This results in a set of tables each of which are ordered (b) movie:actor by the subject. This is a typical column-oriented database organization and benefits from the usual Fig. 8 Binary table organization of properties “refs:label” and advantages of such systems such as reduced I/O due to “movie:actor”fromtheexampledataset(prefixesareremoved) reading only the needed properties and reduced tuple length, compression due to redundancy in the column exampledatasetgiveninFigure1wouldcreateonetable values, etc. In addition, it avoids the null values that is foreachuniqueproperty–thereare18ofthem. Twoof experienced in property tables as well as the need for thesetablesareshowninFigure8. manual or automatic clustering algorithms for “similar” properties,andnaturallysupportsmultivaluedproperties 3.5 Graph-basedProcessing –eachbecomeaseparaterowasinthecaseofJena’sDS table. Furthermore, sincetablesareorderedonsubjects, Graph-basedRDFprocessingapproachesfundamentally subject-subject joins can be implemented using efficient implement the semantics of RDF queries as defined in merge-joins. The shortcomings are that the queries Section 2. In other words, they maintain the graph require more join operations some of which may be structure of the RDF data (using some representation subject-objectjoinsthatarenothelpedbythemerge-join suchasadjacencylists),converttheSPARQLquerytoa operation. Furthermore, insertions into the tables has query graph, and do subgraph matching using higher overhead since multiple tables need to be homomorphism to evaluate the query against the RDF updated. It has been argued that the insertion problem graph. Systems such as that proposed by Bönström et can be mitigated by batch insertions, but in dynamic al[25],gStore[9,26],andchameleon-db[27]followthis RDF repositories the difficulty of insertions is likely to approach. remain a significant problem. The proliferation of the Theadvantageofthisapproachisthatitmaintainsthe number of tables may have a negative impact on the originalrepresentationoftheRDFdataandenforcesthe scalability (with respect to the number of properties) of intendedsemanticsofSPARQL.Thedisadvantageisthe binarytablesapproach[24]. costofsubgraphmatching–graphhomomorphismisNP- For example, the binary table representation of the complete.Thisraisesissueswithrespecttothescalability M.TamerÖZSU 8 of this approach to large RDF graphs; typical database 00010 10000 01000000 10000000 00000100 techniquesincludingindexingcanbeusedtoaddressthis (a) QuerySignatureGraphQ∗ issue. In the remainder, we present the approach within thecontextofonesystem,gStore. 00001001 10011000 00010100 00101000 gStore is a graph-based triple store system that can 00100 01000 01000 answer different kinds of SPARQL queries – exact 00001 00010 01000 queries, queries with wildcards (i.e., where partial 01000001 10000001 00010001 information is known about a query object such as 10000 01000 knowingtheyearofbirthbutnotthefullbirthdate),and 01001000 aggregate queries that are included in SPARQL 1.1 – 00000100 over dynamic RDF data repositories. It uses adjacency 10000 10000 list representation of graphs. An important feature of 00001000 00000010 gStoreistoencodeeacheachentityandclassvertexinto (b) DataSignatureGraphG∗ a fixed length bit string. One motivation for this encoding is to deal with fixed length bit string rather Fig.9 Signaturegraphs than variable length character strings – this is similar to the dictionary encoding mentioned above. The second, over G∗. For this, gStore uses an index structure called and more important, motivation is to capture the VS∗-treethatisasummarygraphofG∗. VS∗-treeisused “neighborhood” information for each vertex in the to efficiently process queries using a pruning strategy to encoding that can be exploited during graph matching. reduce the search space for finding matches of Q∗ over This results in the generation of a data signature graph G∗. G∗, in which each vertex corresponds to a class or an entity vertex in the RDF graph. Specifically, G∗ is induced by all entity and class vertices in the original RDF graph G together with the edges whose endpoints 4 DistributedRDFProcessing are either entity or class vertices. Figure 9(b) shows the Intheprevioussectionwefocusedoncentralized,single data signature graph G∗ that corresponds to RDF graph machine approaches to RDF data management and G in Figure 2. An incoming SPARQL query is also SPARQLqueryprocessing. Inthissection, wefocuson representedasaquerygraphQthatissimilarlyencoded distributed approaches. This section is taken from [28]. intoaquerysignaturegraph Q∗. Theencodingofquery We identify and discus four classes of approaches: graph depicted in Figure 3 into a query signature graph cloud-based solutions, partitioning-based approaches, Q∗ isshowninFigure9(a). 2 federated SPARQL evaluation systems, and partial The problem now turns into finding matches of Q∗ evaluation-basedapproach. over G∗. Although both the RDF graph and the query graph are smaller as a result of encoding, the 4.1 Cloud-basedApproaches NP-completeness of the problem remains. Therefore, gStore uses a filter-and-evaluate strategy to reduce the There have been a number of works (e.g., [29–36]) search space over which matching is applied. The focusingonmanaginglargeRDFdatasetsusingexisting objective is to first use a false-positive pruning strategy cloud platforms; a very good survey of these is is tofindasetofcandidatesubgraphs(denotedasCL),and providedbySaoudiandManolescu[37]. Manyofthese then validate these using the adjacency list to find approaches follow the MapReduce paradigm; in answers (denoted as RS). Accordingly, two issues need particular they use HDFS, and store RDF triples in flat to be addressed. First, the encoding technique should files in HDFS. When a SPARQL query is issued, the guaranteethatRS ⊆ CL–theencodingdescribedabove HDFSfilesarescannedtofindthematchesofeachtriple provably achieves this. Second, an efficient subgraph pattern, which are then joined using one of the matching algorithm is required to find matches of Q∗ MapReduce join implementations (see [38] for more SurveyofRDFDataMan.Syst. 9 detailed description of these). The most important the graph corresponding to each decomposed subquery difference among these approaches is how the RDF should not be larger than N to enable subquery triplesarestoredinHDFSfiles; thisdetermineshowthe processing at each local site. WARP [42] uses some triplesareaccessedandthenumberofMapReducejobs. frequent structures in workload to further extend the In particular, SHARD [30] directly stores the data in a results of GraphPartition. Partout [43] extends the single file and each line of the file represents all triples concepts of minterm predicates in relational database associatedwithadistinctsubject. HadoopRDF[31]and systems, and uses the results of minterm predicates as PredicateJoin[32]furtherpartitionRDFtriplesbasedon the fragmentation units. Lee et. al. [44] define the the property and store each partition within one HDFS partition unit as a vertex and its neighbors, which they file. EAGRE [33] first groups all subjects with similar call a “vertex block”. The vertex blocks are distributed properties into an entity class, and then constructs a based on a set of heuristic rules. A query is partitioned compressed RDF graph containing only entity classes into blocks that can be executed among all sites in and the connections between them. It partitions the parallel and without any communication. TriAD uses compressed RDF graph using the METIS METIS [39] to divide the RDF graph into many algorithm[39]. EntitiesareplacedintoHDFSaccording partitions and the number of result partitions is much tothepartitionsetthattheybelongto. more than the number of sites. Each result partition is Besides the HDFS-based approaches, there are also considered as a unit and distributed among different someworksthatuseotherNoSQLdistributeddatastores sites. At each site, TriAD maintains six large, to manage RDF datasets. JenaHBase [29] and in-memory vectors of triples, which correspond to all H RDF [35, 36] use some permutations of subject, SPO permutations of triples. Meanwhile, TriAD 2 property, object to build indices that are then stored in constructs asummary graph tomaintain thepartitioning HBase (http://hbase.apache.org). Trinity.RDF [34] uses information. the distributed memory-cloud graph system Trinity [40] All of the above methods implement particular toindexandstoretheRDFgraph. Ituseshashingonthe partitioning and distribution strategies that align with vertexvaluestoobtainadisjointpartitioningoftheRDF their specific requirements. When there is freedom to graphthatisplacedonnodesinacluster. partitionanddistributethedata,thisworksfine,butthere These approaches benefit from the high scalability are circumstances when partitioning and distribution and fault-tolerance offered by cloud platforms, but may may be influenced by other requirements. For example, suffer lower performance due to the difficulties of in some applications, the RDF knowledge bases are adaptingMapReducetographcomputation. partitioned according to topics (i.e., different domains), or are partitioned according to different data 4.2 Partitioning-basedApproaches contributors; inothercases, theremaybeadministrative constraints on the placement of data. In these cases, The partition-based approaches [41–45] divide an RDF these approaches may not have the freedom to partition graph G into several fragments and place each at a the data as they require. Therefore, partition-tolerant different site in a parallel/distributed system. Each site SPARQLprocessingmaybedesirable. hostsacentralizedRDFstoreofsomekind. Atruntime, a SPARQL query Q is decomposed into several 4.3 FederatedSystems subqueries such that each subquery can be answered locally at one site, and the results are then agregated. Federated queries run SPARQL queries over multiple Each of these papers proposes its own data partitioning SPARQL endpoints. A typical example is linked data, strategy, and different partitioning strategies result in where different RDF repositories are interconnected, differentqueryprocessingmethods. providing a virtually integrated distributed database. InGraphPartition[41],anRDFgraphGispartitioned Federated SPARQL query processing is a very different into n fragments, and each fragment is extended by environment than what we target in this paper, but we including N-hop neighbors of boundary vertices. discussthesesystemsforcompleteness. According to the partitioning strategy, the diameter of A common technique is to precompute metadata for M.TamerÖZSU 10 each individual SPARQL endpoint. Based on the fragments (edges that cross fragments are replicated in metadata, the original SPARQL query is decomposed source and target fragments). Each site receives the full into several subqueries, where each subquery is sent to SPARQL query Q and executes it on the local RDF its relevant SPARQL endpoints. The results of graph fragment providing data parallel computation. In subqueries are then joined together to answer the this particular setting, the partial evaluation strategy is originalSPARQLquery. InDARQ[46],themetadatais appliedasfollows: eachsiteS treatsfragment F asthe i i called service description that describes which triple known input in the partial evaluation stage; the patterns (i.e., property) can be answered. In [47], the unavailableinputistherestofthegraph(G =G\F). i metadata is called Q-Tree, which is a variant of RTree. Therearetwoimportantissuestobeaddressedinthis Each leaf node in Q-Tree stores a set of source framework. Thefirstistocomputethepartialevaluation identifers, including one for each source of a triple results at each site S given a query graph Q – in other i approximated by the node. SPLENDID [48] uses words, addressingthegraphhomomorphismof Qof F; i Vocabulary of Interlinked Datasets (VOID) as the this is called the local partial match since it finds the metadata. HiBISCuS [49] relies on “capabilities” to matches internal to fragment F. Since ensuring edge i compute the metadata. For each source, HiBISCuS disjointness is not possible in vertex-disjoint defines a set of capabilities which map the properties to partitioning, , there will be crossing edges between their subject and object authorities. TopFed [50] is a graph fragments. The second task is the assembly of biological federated SPARQL query engine whose theselocalpartialmatchestocomputecrossingmatches. metadata comprises of an N3 specification file and a Two different assembly strategies are proposed: Tissue Source Site to Tumour (TSS-to-Tumour) hash centralizedassembly,wherealllocalpartialmatchesare table,whichisdevisedbasedonthedatadistribution. senttoasinglesite,anddistributedassembly,wherethe In contrast to these, FedX [51] does not require local partial matches are assembled at a number of sites preprocessing, but sends “SPARQL ASK” to collect the inparallel. metadata on the fly. Based on the results of “SPARQL ASK” queries, it decomposes the query into subqueries andassignsubquerieswithrelevantSPARQLendpoints. 5 QueryingLinkedData Global query optimization in this context has also As noted earlier, a major reason for the development of been studied. Most federated query engines employ RDF is to facilitate the semantic web. An important existingoptimizers,suchasdynamicprogramming[52], aspect of this enterprise is the Web of Linked Data for optimizing the join order of local queries. (WLD) that connects multiple web data sets encoded in Furthermore,DARQ[46]andFedX[51]discusstheuse RDF. In one sense, this is the web data integration, and of semijoins to compute a join between intermediate queryingtheWLDisanimportantchallenge. resultsatthecontrolsiteandSPARQLendpoints. WLDusesRDFtobuildthesemanticwebbyfollow- ingfourprinciples: 4.4 PartialQueryEvaluationApproaches 1. All web resources are locally identified by their Partial function evaluation is a well-known URIs. programming language strategy whose basic idea is the 2. Information about web resources/entities are following: givenafunction f(s,d),where sistheknown encodedasRDFtriples. Inotherwords,RDFisthe input and d is the yet unavailable input, the part of f’s semanticwebdatamodel. computation that depends only on s generates a partial 3. Connectionsamongdatasetsareestablishedbydata answer. This has been used for distributed SPARQL links. processingintheDistributedgStoresystem[28]. 4. Sites that host RDF data need to be able to service An RDF graph is partitioned using some graph HTTPrequeststoserveuplinkedresources. partitioning algorithm such as METIS [39] (the particular graph partitioning algorithm does not matter OurearlierreferencewastoLinkedOpenData(LOD), as the approach is oblivious to it) into vertex-disjoint whichenforcesafifthrequirementonWLD: