PRSP: A Plugin-based Framework for RDF Stream Processing Qiong Li1,3 Xiaowang Zhang1,3 Zhiyong Feng2,3 1SchoolofComputerScienceandTechnology,TianjinUniversity,Tianjin300350,P.R.China 2SchoolofComputerSoftware,TianjinUniversity,Tianjin300350,P.R.China 3TianjinKeyLaboratoryofCognitiveComputingandApplication,Tianjin300350,P.R.China ABSTRACT venientwayandcomparetheirperformanceunderaunifiedframe- 7 worknamelyPRSP.Anduserscanchoosethefavourablesystems 1 Inthispaper,weproposeaplugin-basedframeworkforRDFstream basedontheirallkindsofrequirements.Forexample,theyhavethe 0 processing named PRSP. Within this framework, we can employ needtohandlelarge-scaleRDFgraphs,thusdistributedenginesare 2 SPARQLqueryenginestoprocessC-SPARQLquerieswithmain- thebestchoice. tainingthehighperformanceofthoseenginesinasimpleway.Tak- n ingadvantageofPRSP,wecanprocesslarge-scaleRDFstreamsin 2. PRELIMINARIES a J adistributedcontextviadistributedSPARQLengines.Besides,we RDFstreamAnRDFstreamisdefinedasorderedsequencesof canevaluatetheperformanceandcorrectnessofexistingSPARQL 4 pairs,eachpairbeingmadeofanRDFtripleandatimestampT: query engines in handling RDF streams in a united way, which 1 amendstheevaluationofthemrangingfromstaticRDF(i.e.,RDF ((cid:104)Si,Pi,Oi(cid:105),T) graph)todynamicRDF(i.e.,RDFstream). Finally,withinPRSP, ] C-SPARQL query The continuous query is divided into three B weexperimentlyevaluatethecorrectnessandtheperformanceon parts:R ,S(t),Q ,anditisformallydefinedasfollows: D YABench. TheexperimentsshowthatPRSPcanstillmaintainthe query Q=[SRPARQ,LS(t),Q ] high performance of those engines in RDF stream processing al- query SPARQL s. thoughtherearesomeslightdifferencesamongthem. • Rqueryindicatestheregisteredqueryfromuserswhichiswait- ingtobeaddressed. c [ Keywords • S(t)istheRDFstreamregisteredbytheRSPsystems,which definesthewindowsizeandstepsize. 1 RDFStream;RSP;SPARQL;C-SPARQL • Q isastandardRDFquerylanguage,i.e.,SPARQL. SPARQL v 1. INTRODUCTION PROPOSITION 1. LetQbeaC-SPARQLquery. ForanyRDF 4 RDFstream,asanewtypeofdataset,canmodelreal-timeand streamSandanypresenttimet,thefollowingholds: 5 8 continuousinformationinawiderangeofapplications,e.g. envi- Q = Q . 3 ronmental monitoring, Smart City and so on. But data stream is (cid:74) (cid:75)(S,t) (cid:74) SPARQL(cid:75)Window(S,t) Proposition1ensuresthattheevaluationproblemofC-SPARQL 0 unboundedsequencesoftime-varyingdataelementanddifficultto queriesoverRDFstreamscanbeequivalenttotheevaluationprob- . store. 1 lemofSPARQLqueriesoverRDFgraphs. Moreover,Proposition Whatismore, thereisafewRSP[1](RDFStreamProcessing) 0 1canshowthattheevaluationproblemofC-SPARQLhasthesame systems,suchasC-SPARQL[2]andEP-SPARQL[3]implemented 7 computationalcomplexityasSPARQL[2]. for supporting RDF stream due to its complicacy in processing. 1 Considerthefollowingquery,asimpleexamplefromC-SPARQL. Ontheotherhand,therearemanypopularandefficientSRARQL : Line1matchingtheR ,tellstheRSPsystemtoregisterthecon- v queryenginessupportingonlystaticRDFgraphs,suchasthecen- query tinuousqueryofTestQuery. S(t),thatisthefollowinglistofline i tralizedengines,Jena[4],RDF-3XandgStore,anddistributedsys- X 3, indicatesthatstreamswithaslidingwindowof5secondsthat tems,TriAD,gStoreD[5]andsoon.HowtoemploythoseSRARQL r queryenginestoevaluatecontinuousqueriesbecomesaninterest- slidesevery5seconds,isthestreamdatawaitingtobeprocessed. a AndQ ,displayedinbothline2andline4,isthequerylan- ingproblem. sparql guageforRDF. In this paper, we provide a plugin-based framework for RDF streamprocessingnamedPRSP,whichmakesitpossibletousethe 1.REGISTERQUERYTestQueryAS high-performanceRDFenginesthat’svalidonlyforRDFgraphs, 2.SELECT?obs toprocessRDFstreams.Moreover,withinthisframework,wecan 3.FROMSTREAMstreams[RANGE5s STEP5s] employanyRDFqueryenginetoprocessRDFstreamsinacon- 4.WHERE{?obsobservedPropertyAirTemperature.} Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalor 3. THEARCHITECTUREOFPRSP classroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributed forprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcita- PRSPisanextensionofSPARQLforqueryingbothRDFgraphs tiononthefirstpage. Copyrightsforcomponentsofthisworkownedbyothersthan and RDF streams shown in Figure1. Both continuous query and ACMmustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,orre- RDFstreamsastheinputofPRSParetransformedbytheplugin publish,topostonserversortoredistributetolists,requirespriorspecificpermission and/[email protected]. QueryRewritingandDataTransformerinPRSP,respectively. Af- XXXXXX terthat,theoutputfromtheformerpluginsastheinputofSPARQL API,theresultsareproducedbyoneofSPARQLengines.Andthe (cid:13)c 2016ACM.ISBN0-12345-67-8/90/01...$15.00 DOI:http://dx.doi.org/10.1145/0000000.0000000 rightboxconsistsofanySPARQLqueryenginewhichisusedas RDF Result 104 104 Stream RDF Centralized Data Transformer SPARQL Engines 103 103 Window ms] ms] SelectorSPARQL API DIstributed Result Time[102 Time[102 Query Rewriting Query Engines Continuous Query Plugin in RSP 101 101 100 200 300 400 500 100 200 300 400 500 Figure1:PRSPArchitecture Thenumberofsensors Thenumberofsensors Jena RDF3X gStore gStoreD TriAD Jena RDF3X gStore gStoreD TriAD a black box for evaluating RDF graphs. Its architecture contains (a) TheresponsetimeofQ1 (b) TheresponsetimeofQ1’ three types of plugin: Data Transformer, Query Rewriting, and a SPARQLAPIconnectingwiththeformertwoplugins. Figure2:QueryingtimeindifferentscenarioswithinPRSP QueryRewriting. Continuousqueriesastheinputofqueryrewritingmode,apply 104 104 transformationmethodsinordertogeneratetwotypesofqueries, namely, SPARQL query and window operator, which can be ad- ms) 103 ms) 103 dressedinoneofSPARQLengineandDataTransformermodule, me( me( Ti Ti respectively.AfterrewrittingQ,wecanobtainQ . 102 SPARQL 102 DataTransformer. The data transformer module manages RDF streams specified 101 inthequeryviaEsperoranotherDSMS.AndittransformsRDF LT RT ET LT RT ET streamsintoRDFgraphsbasedonthewindowsizeandstepsize Jena RDF3X gStore gStoreD TriAD Jena RDF3X gStore gStoreD TriAD setbywindowoperator.AftertranformingSw.r.t.t,wecanobtain (a) s=100sensors (b) s=500sensors Windows(S,t). Figure3:RDFstreamforprocessingtimeinPRSP SPARQLAPI. PRSPdefinesaunifiedinterfaceforRDFengines,whichmakes Table1:Precision/Recallresults itpossibleandeasyforSPARQLenginestoprocessRDFstreams. Jena RDF3X gStore gStoreD TriAD InthecurrentversionofPRSP,wehaveextendedPRSPbyinclud- s=100 99% 93% 100% 100% 97% ingafewcentralizedengines,suchasJena,gStore,andRDF-3X, Precision s=300 97% 94% 100% 100% 93% andtwodistributedengines,i.e.,gStoreDandTriAD. s=500 85% 88% 100% 100% 100% s=100 95% 89% 75% 72% 95% 4. EXPERIMENTSANDEVALUATIONS Recall s=300 94% 91% 88% 76% 92% s=500 92% 79% 77% 63% 91% Experiments. Allcentralizedexperimentswerecarriedoutonamachinerun- 5. CONCLUSIONS ning Linux, which has 4 CPUs with 6 cores and 64GB memory, Inthispaper,wepresentPRSP,asapluginadaptableforSPARQL and5machineswiththesameperformancefordistributedexper- engines, toprocessRDFstreams, whichmakesitfeasibletoem- iments. Forevaluation,weutilizedYABenchRSPbenchmark[6], ployvariousenginestoprocesslarge-scaleRDFstreams.Inthefu- which uses a real world dataset about water temperature. In our ture,wewilloptimizePRSPfurthertoimproveitsperformanceand experiments, we performed sliding windows with a window size correctness. ThisworkissupportedbytheNationalKeyResearch andastepsizeof5seconds,respectively. Consideringthatsome and Development Program of China (2016YFB1000603) and the enginescannotsupportcomplexqueries,theexperimentsusedtwo NationalNaturalScienceFoundationofChina(61672377). BGPqueries,Q1andQ1(cid:48).Q1isaBGPquerywithfourforms(Q1) 6. REFERENCES fromYABench,andQ1(cid:48) istherewritingofQ1withthreetriples. [1] http://www.w3.org/community/rsp/. SinceRDF-3Xdidnotworkwhenthetheamountofstreamdata to42.000triples(i.e.,s = 500),wechosefiveloadscenario(i.e., [2] Barbieri,D.F.,Braga,D.,Ceri,S.,DellaValle,E.,and s=100/200/300/400/500sensors). Grossniklaus,M.QueryingRDFstreamswithC-SPARQL. Evaluations. ACMSIGMODRecord,2010,39(1):20-26. The performance of each engine under the five different input [3] AnicicD,FodorP,RudolphS,etal.EP-SPARQL:aunified loads for windows is shown in Fig. 2. When the load ranges languageforeventprocessingandstreamreasoning.In: froms = 100tos = 500,thequeryresponsetimeiswithvary- Proc.ofWWW’11,pp.635–644. ingdegreesofincreaseexceptforgStoreD.Fig. 3showsthetime [4] http://jena.sourceforge.net/ARQ. ofthreeprocesses,includingdataloadtime(LT),queryresponse [5] P.Peng,L.Zou,MT.Zsu,L.Chen,andD.Zhou.Processing time(RT), and engine execution time (ET) under s = 300 ob- SPARQLqueriesoverdistributedRDFgraphs.VLDBJ., tained from query Q1(cid:48). LT from RDF3X, gStore, and gStoreD 2016,25(2):243–268. occupiesalargepartofET,resultingintheirlowerefficiencyfor [6] KolchinM,WetzP,KieslingE,etal.YABench:A processingRDFstreams.Table1illustratestheresultsofprecision ComprehensiveFrameworkforRDFStreamProcessor and recall from the experiments under three load scenarios (i.e., CorrectnessandPerformanceAssessment.In:Proc.of s=100/300/500)inPRSP.Alongwithmoreinputloadforwin- ICWE’16,pp.280-298. dows,mostofthemenjoylowerrecallswithhighaccuracy.