High-Performance Holistic XML Twig Filtering Using GPUs Ildar Absalyamov Roger Moussalli Walid Najjar UCRiverside IBMT.J.WatsonResearch UCRiverside Riverside,CA,92507 Center Riverside,CA,92507 [email protected] YorktownHeights,NY,10598 [email protected] [email protected] Vassilis J. Tsotras UCRiverside Riverside,CA,92507 [email protected] ABSTRACT Early pub-sub implementations restricted subscriptions to pre-defined topics or channels, such as weather, world Current state of the art in information dissemination com- news, finance, among others. Subscribers would hence be prisesofpublishersbroadcastingXML-codeddocuments,in “spammed” with more (irrelevant) information than of in- turnselectivelyforwardedtointerestedsubscribers. Thede- terest. Thesecondgenerationofpub-subsevolvedbyallow- ploymentofXMLattheheartofthissetupgreatlyincreases ing predicate expressions; here, user profiles are described the expressive power of the profiles listed by subscribers, as conjunctions of (attribute, value) pairs. In order to add using the XPath language. On the other hand, with great furtherdescriptivecapabilitytotheuserprofiles,third-gene- expressivepowercomesgreatperformanceresponsibility: it ration pub-subs adopted the eXtensible Markup Language is becoming harder for the matching infrastructure to keep (XML) as the standard format for data exchange, due to up with the high volumes of data and users. Traditionally, its self-describing and extensible nature. Exchanged doc- general purpose computing platforms have generally been uments are now encoded with XML, while profiles are ex- favored over customized computational setups, due to the pressedwithXMLquerylanguages,suchasXPath[2]. Such simplifiedusabilityandsignificantreductionofdevelopment systemstakeadvantageofthepowerfulqueryingthatXML time. The sequential nature of these general purpose com- query languages offer: profiles can now describe requests puters however limits their performance scalability. In this not only on the document values but also on the structure work,weproposetheimplementationofthefilteringinfras- ofthemessages1allowingthematchingofcomplextwig-like tructure using the massively parallel Graphical Processing profiles within tree-structured messages. Units(GPUs). Weconsidertheholistic(nopost-processing) Currently,XML-basedPub-Subsystemshavebeenadopted evaluationofthousandsofcomplextwig-styleXPathqueries for the dissemination of Micronews feeds. These feeds are in a streaming (single-pass) fashion, resulting in a speedup typicallyshortfragmentsoffrequentlyupdatedinformation, over CPUs up to 9x in the single-document case and up to such as news stories and blog updates. The most promi- 4xforlargebatchesofdocuments. Athoroughsetofexper- nent XML-based format used is RSS. In this environment, iments is provided, detailing the varying effects of several the RSS feeds are accessed via HTTP through URLs and factors on the CPU and GPU filtering platforms. supported by client applications and browser plug-ins (also calledfeedreaders). Feedreaders(likeBloglinesandNews- 1. INTRODUCTION Gator), periodically check the contents of micronews feeds and display the returned results to the user. Publish-subscribesystems(orsimplypub-subs)aretimely The complex process of matching thousands of profiles asynchronous event-based dissemination systems consisting againstmassiveamountsofpublishedmessagesisperformed of three main components: publishers, who feed a stream in the filtering infrastructure. From a user/functionality of documents (messages) into the system, subscribers, who perspective, filtering consists of determining, for each pub- register their profiles (queries, i.e., subscription interests), lished message, which subscriptions match at least once. and a filtering infrastructure for matching subscriber inter- The novelty lies in exploring new efficient filtering algo- ests with published messages and delivering the matched rithms, as well as high-performance platforms on which to messages to the interested subscriber(s). implementalgorithmsandfurtheracceleratetheirexecution. Several fine-tuned CPU-based software filtering algorithms havebeenstudied[3, 12, 14, 19]. Thesememory-boundap- proaches,however,sufferfromtheVonNeumannbottleneck andareunabletohandlelargevolumeofinputstreams. Re- cently,variousworkshaveexploitedtheobviousembarrass- ingly parallel property of filtering, by evaluating all queries inparallelusingFieldProgrammableGateArrays(FPGAs) 1 Inthispaper,weusetheterms“profile”,“subscription”and“query” interchangeably;similarlyfortheterms“document”and“message”. 1 Path ::= Step | Path Step engines are fed a series of volatile documents, incoming in Step ::= Axis TagName | Step “[” Step “]” a streaming fashion while queries are static (and known Axis ::= “/” | “//” in advance). This type of workload prevents filtering sys- TagName ::= Name | “*” temsfromusingdocumentindexing,thusrequiringadiffer- ent paradigm to solve this problem. Moreover, since filter- Figure 1: Production rules to generate an XPath ing systems return only binary result (match or no match, expression. “Name” corresponds to an arbitrary al- whether a particular query was matched or not in a given phanumeric string. document), they should be able to process large number of queries. ThisisincontrasttoXMLqueryingengines,which need to report every document node that was matched by [22,24,25]. Byintroducingnovelparallel-hardware-tailored an incoming query. filteringalgorithms,FPGAshavebeenshowntobeparticu- SoftwarebasedXMLfilteringalgorithmscanbeclassified larly well suited for the stream processing of large amounts into several categories: (1) FSM-based, (2) sequence-based of data, where the temporary computational state is not and (3) others. XFilter [4] is the earliest work studying offloaded to (low-latency) off-chip memories. These algo- FSM-based filtering, which builds a single FSM for each rithms allowed the massively parallel streaming matching XPath query. Each query node is represented by an in- of complex path profiles that supports the /child:: axis dividual state at the FSM. Transitions in the automaton and /descendant-or-self:: axis 2, wildcard (‘*’) node tests are fired when an appropriate XML event is processed. An and accounts for recursive elements in the XML document. XPathquery(profile)isconsideredmatchedifanaccepting Using this streaming approach, inter-query and intra-query stateisreachedattheFSM.YFilter[12]leveragespathcom- parallelism were exploited, resulting in up to two orders of monalitiesbetweenindividualFSMs,creatingasingleNon- magnitude speed-up over the leading software approaches. DeterministicFiniteAutomaton(NFA)representationofall In [23] we presented a preliminary study on the mapping XPathqueriesandthereforereducingthenumberofstates. ofthepathmatchingapproachonGPUs,providingtheflex- Subsequent works [16, 29, 14] use a unified finite automa- ibilityofsoftwarealongsidethemassiveparallelismofhard- ton for XPath expressions, using lazy DFAs and pushdown ware. In this paper we detail the support of the more com- automata. plex twig queries on GPUs in a massively parallel manner. FiST [19] and it’s successors are the examples of the se- In particular, the novel contributions of this paper are: quence-based approach for XML Filtering, which implies a • The first design and implementation of XML twig fil- two-stepprocess: (i)thestreamingXMLdocumentandthe tering on GPUs, more so in a holistic fashion. XPathqueriesareconvertedintosequencesand(ii)asubse- quencematchalgorithmisusedtodetermineifaqueryhad • An extensiveperformanceevaluation ofthe above ap- a match in the XML document. proach is provided, with comparison to leading soft- Amongtheotherworks,[10]buildsaTrieindextoexploit ware implementations. querycommonalities. Anotherformofsimilarityisexplored intheAFilter[9]whichusesprefixaswellassuffixsharingto • A study on the effect of several query and document create an appropriate index data structure. [13] introduces factors on GPUs and the software implementations streamqueryingandfilteringalgorithmsLQandEQ,using through experimentation is depicted. a lazy and eager strategy, respectively. RecentadvancesintheGPGPUtechnologyopenedapos- The rest of the paper is organized as follows: in Section sibilitytospeedupmanytraditionalapplications,leveraging 2 we present related work in software and hardware XML the high degree of parallelism offered by GPUs. A large processing. Section 3 provides in depth description of the numberofresearchworksexploretheusageofGPUsforim- holistic XML twig filtering algorithm. Section 4 details the proving traditional database relational operations. [5] dis- implementation of the XML filtering on GPUs. Section 5 cussesimprovingtheSELECToperationperformance,while presents an experimental evaluation of the parallel GPU- [17] focuses on implementing efficient join algorithms. Re- based approach compared to the state of the art software cently[11]addressedallcommonlyusedrelationaloperators. counterparts. Finally conclusions and open problems for Therearealsoresearchworksconcentratedonimprovingin- further research appear in Section 6. dexingoperations. [18]introducesahierarchicalCPU-GPU architecture for efficient tree querying and bulk updating 2. RELATEDWORK and [7] proposes an abstraction for tree index operations. The rapid development of XML technology as a common There has also been much interest in using GPUs for format for data interchange together with the emergence XML related problems. [8] introduced the Data partition- ofinformationdisseminationthrougheventnotificationsys- ing,QuerypartitioningandHybridapproachestoparallelize tems,hasledtoincreasedinterestincontent-basedfiltering XPathqueriesonmulti-coresystems. [31]usedthesestrate- of XML data streams. Unlike the traditional XML query- gies to create a cost model, which decides how to process ingengines,filteringsystemsexperienceessentiallydifferent XPath queries on GPUs in parallel. [15] processed XPath type of workload. In a querying system, the assumption is queriesinawaysimilartoYfilter,bycreatingendexecuting that XML documents are fixed (known in advance), which a NFA on a GPU, whereas [30] studied twig query process- allows building efficient indexing mechanisms on them to ingonalargeXMLswiththehelpofGPUs. Alltheseworks facilitate the query process; whereas queries are adhoc and focus on the problem of indexing and querying XML docu- can have arbitrary structure. On contrary, XML filtering ments using XPath queries. However none of them address filtering XML documents, which an orthogonal problem to 2 In the rest of the paper we shall use ‘/’ and ‘//’ as shorthand to query processing, thus requiring a different implementation denotethe/child:: axis and/descendant-or-self:: axis,respectively. 2 ... ... ... ... ... 1 1 a ... a ... c ... c ... c ... 1 1 1 1 1 1 1 1 b ... b ... b ... b ... b ... 1 1 1 1 1 ... $ a b ... $ a b d ... $ c d d ... $ c d d ... $ c d XML Push XML Push XML Push XML Push XML Push tree Stack tree Stack tree Stack tree Stack tree Stack open(a) open(b) open(c) open(b) open(d) (a) Matching path {/a/b} (b) Matching path {/c//d} Figure 2: Step-by-step overview of stack updates for (a) parent-child and (b) ancestor-descendant relations. Each step corresponds to open(tag) event, with respective opened tag highlighted. As the XML document tree is traveled downwards, content is pushed onto the top of the stacks. The leftmost stack column is set to be always in a matched state, corresponding to a dummy root node (denoted by $). approach. XPath relate to the structure of XML documents, hence, As mentioned, our previous work on the XML filtering rely heavily on these open/close events. As will be clearer problem concentrated on using FPGAs. In [24] we pre- below, stacks are an essential structure of XML query pro- sented a new dynamic-programming based XML filtering cessing engines, used to save the state as the structure of algorithm, mapped to FPGA platforms. the tree is visited (push on open, pop on close). To support twig filtering, a post-processing step is re- Figure 1 shows the XPath grammar used to form twig quired to identify which path matches correspond to twig queries, consisting of nodes, connected with parent-child matches. In [25] we extended this approach to support the (“/”), ancestor-descendant (“//”) or nested path relation- holistic (no post-processing) matching of the considerably ships (“[]”). We denote by L the length of the twig query, more complex twig-style queries. While providing signifi- representingthetotalnumberofnodesinthetwig,inaddi- cant performance benefits, these FPGA-based approaches tion to a dummy start node at the beginning of each query were not without their disadvantages. Matching engines (the latter being essential for query processing on GPUs). running on the FPGA relied on custom implementations Such a framework allows us to filter the whole stream- of the queries in hardware and did not allow the update, ingdocumentinasinglepass,whilecapturingquerymatch addition, and deletion of user profiles on-the-fly. Although information in the stack. the query compilation to VHDL is fast, re-synthesis of the In [25] we show that matching a twig query is a process FPGA,whichincludestranslation,mappingandplace-and- consisting of 2 interleaved mechanisms: routeareexpensive(uptoseveralhours)andmustbedone off-line. Furthermore, solutions on FPGAs cannot scale be- • Matching individual root-to-leaf paths. Here, each yondavailableresources,wherethenumberofqueriesislim- such path is evaluated (matched) using a stack whose ited to the amount of real-estate on the chip. contents are updated on push (open(tag)) events. GPUarchitecturesdonothavetheselimitationsthusoffer • Carefully joining the matched results at query split apromisingapproachtotheXMLfilteringproblem. Inthis nodes. This task is performed using stacks that are paperweextendourpreviouswork[23]thatconsideredonly mainlyupdatedonpop(close(tag))events. Themain filteringlinearXPathqueriesonGPUsandprovideaholistic reasoningliesinthatamatchedpathreportsitsmatch filtering of complex twig queries. state to its ancestors using the pop stack. 3. PARALLEL HOLISTIC TWIG MATCH- Inbothmechanisms,thedepthofthestackisequaltothe maximumdepthofthestreamingXMLdocument,whilethe INGALGORITHM width of the stack is equal to the length of the query. The We proceed with the overview of the stack-based holis- top-of-stack pointer is referred to as TOS in the remainder tic twig filtering algorithm and respective modifications as of this paper. required for the mapping onto GPUs. Detailed descriptions of the operations and properties of these stacks are presented next. 3.1 FrameworkOverview 3.2 PushStackAlgorithm The structure of an XML-encapsulated document can be represented as a tree, where nodes are XML tags. Opening As noted earlier, an essential step in matching a twig andclosingtagsintheXMLdocumenttranslatetothetrav- query lies in matching all root-to-leaf paths of this query. eling(downandup)throughthetree. SAXparsersprocess Weachievethistaskusingpush stacks,namelystackshold- XML documents in a streaming fashion, while generating ing query matching state, updated solely on push events as open(tag) and close(tag) events. XML queries expressed in the XML document is parsed. 3 R0 a a b T0 T4 c ... c c d e d f e Twig query T1 T3 XML Document R n−1 Figure 4: Sample XML document tree represen- tation and twig query. The twig query can be broken down into two root-to-leaf paths, namely /a//c//d and /a//c/e. While each of these paths T isindividuallymatched(found)inthetree, thetwig 2 /a//c[//d]/e is not. Figure 3: Abstract view of the structure of an XML document tree, with root R , and several subtrees 0 • Columns corresponding to wildcard nodes allow the such as T ,T , etc. 0 1 propagation from a prefix column without checking whetherthecolumntagisthesameasthetagrespec- tive to the push event. Wedescribethematchingprocessasadynamicprogram- ming algorithm, where the push stack represents the dy- • Iftherelationshipbetweennodeanditsprefixisances- namic programming table. Each column in the push stack tor-descendant, then the diagonal propagation prop- represents a query node from the XPath expression; when erty applies, as in parent-child relationships. In addi- mapping a twig to the GPU, each stack column needs a tion,inordertosatisfyancestor-descendantsemantics pointer to its parent column (prefix column). A split node (the descendant lies potentially more than one node acts as a prefix for all its children nodes. apart from the parent), a match is carried in the pre- The intuition is that the root-to-leaf path with length L fix column vertically upwards, even if the tag, which can be in a matched state only if its prefix of length L-1 is triggered the push event does not correspond to the matched. This optimal substructure property can be triv- descendant’scolumntag. Usingthistechnique,allde- iallyprovedbyaninductionoverthematchedquerylength scendantsoftheprefixnodecanseethematchedstate (assuming a matched dummy root as base case). of the prefix. This matched info is popped with the The Lth column is matched if ‘1’ is stored on the top of prefix. thestackforthisparticularcolumn. Whenacolumn,corre- Figure 2b depicts the matching of the path /c//d, sponding to a leaf node in the twig query, stores ‘1’ on the where d is a descendant but not a child of c. Note top-of-stack, then the entire root-to-leaf path is matched. that upward match propagation for prefix node c will The following list summarizes the properties of the push continue even after d is matched. This process will stack: stop only after c will be popped out of the stack. • Only the entries in the top-of-stack are updated on • A match in an ancestor column will only be seen by each event. descendants. This is visualized through Figure 3, de- picting an abstract XML tree, with root R , and sev- 0 • The entries in the push stack can be modified only in eralsubtreessuchasT ,T ,etc. AssumingnodeR 0 1 n−1 response to a push event (open(tag)). is a matched ancestor, this match will be reported in child subtree T using push events. Subtrees T and 2 0 • On a pop event, the pointer to the stop-of-stack is T wouldnotseethematch,astheywouldbepopped 1 updated(decreased). Thepoppedstate(dataentries) bythetimeR isvisited. Similarly,subtreesT and n−1 3 is lost. T will not see the match, as R would be popped 4 n−1 prior to entering them. • Adummyrootnodeisinamatchedstateatthebegin- ning of the algorithm (columns corresponding to root Considering twigs: in the case of a split query node hav- nodesareassumedtobematchedevenbeforeanyop- ingbothancestor-descendantandparent-childbelowit,then eration on the stack occurs). twostackcolumnsareallocatedforthesplitnode;onewould allowverticalupwardspropagationof1’s(forancestor-descen- • Ifarelationshipbetweenanodeanditsprefixisparent- dantrelationships),andanotherwouldnot(forparent-child child,thena‘1’candiagonallypropagatefromthepar- relationships). These two separate columns will also later ent’scolumntothechildcolumn,onlyonapushevent be used to report matches back to root in the pop stack iftheopenedtagmatchesthatofthechild. Figure2a (Section 3.3 describes this situation in details). depictsthediagonalpropagationforaparent-childre- Recurrence equation, applied at each (binary) stack cell lationship. C on push events is shown on Figure 5. i,j 4 relationship between jth column and its prefix is “/” Ci,j = 10 ioftherwCOCiiiRs−−e11,,jj−=1 =1 A1NDANreDlationshAipNjOjbDttehhRtwccooelleuunmmjnnthttaacoggluwismawsnioladpncedanreidtdssiypnmrepbfiuoxslhi“se*v“”e/n/t” Figure 5: Recurrence relation for push stack cell C . 1≤i≤d,1≤j ≤l, where d - maximum depth of XML i,j document, l - length of the query. 3.3 PopStackAlgorithm RecallingFigure3,onlynodesR ,...,R wouldbe 0 n−1 Matching of all individual root-to-leaf paths in a twig reported about a matched ancestor. In this case sub- query is a necessary condition for a twig match, but it is tree T2 is not matched, since we report a match only notsufficient. Figure4showsthesituationwhenbothpaths during the pop event of the last matched path node /a//c//d and /a//c/e of the query /a//c[//d]/e report a Rn−1, which in turn occurs after T2 has been pro- match, but holistically the twig fails to match. In order to cessed. Subtrees T0 and T1 again do not observe the correctly match the twig query we need to check that the match,becauseitwasnotyetreportedatthemoment tworoot-to-leafpaths,branchingatsplitnodej reporttheir they are encountered. Although subtrees T3,T4 are matchtothesamenodej,bythesametimeitwillbepopped processedafterthepopevent,theystillcannotseethe out of stack. match, since the top of the stack has grown by pro- Inordertoreportaroot-to-leafmatchtothenearestsplit cessing additional push events. node in [25] we introduced a pop stack. As for the push stack, the reporting problem could be solved using a dy- • Sincethepurposeofthepopstackistoreportamatched namic programming approach, where the pop stack serves path back to the root, reporting starts at the twig asadynamicprogrammingtable. Columnsofthepopstack leafnodes. Stackcolumnscorrespondingtoleafnodes similarly represent query nodes of the original XPath ex- in the original query are matched only if they were pression. matched in the push stack. Column j would be considered as matched in pop stack if ‘1’ is stored on TOS for a particular column after a pop • A split node reports a match only if all of its children event. The optimal substructure property could again be have been matched. If the latter is true, relationship easily proved, but this time the induction step for a split propagation rules are applied so as to carry the split node should consider optimality of all its children paths, match further. rather than just one leaf-to-split-node path, in order to re- portamatch. Whenacolumn,correspondingtothedummy Figure 6 shows a recurrence relation, which is applied at root node, stores ‘1’ on the top of the pop stack, the entire each stack cell D on a pop event. i,j twig query is matched. Thesubsequentlistsummarizesthepopstackproperties: 4. FROMALGORITHMTOEFFICIENTGPU • Unlike the push stack, the pop stack is updated both IMPLEMENTATION on push and pop events: push always forces rewriting ThetypicalworkloadforanXMLbasedpublish-subscribe the value on the top of the pop stack to its default systemconsistsofalargenumberofuserprofiles(twigque- value, which is ‘0’. A pop event on the other hand ries), filtered through a continuous document data stream. reports a match, propagating ‘1’ downwards, but will Parallel architectures like GPUs offer an opportunity for not force ‘0’ to be carried in the same direction. muchperformanceimprovement,byexploitingtheinherent • If the relationship between a node and its prefix is parallelism of the filtering problem. Using GPUs, the im- parent-child, then ‘1’ is propagated diagonally down- plementationofourstack-basedfilteringapproachleverages wards from the node’s column to its parent. As with several levels of parallelism, namely: the push stack, propagation occurs only if the pop event that fired the TOS decrement, closes the tag, • Inter-query parallelism: All queries (stacks) are correspondingtothecolumn’staginthequery(unless processed independently in a parallel manner, as they the column tag is the wildcard). are mapped on different SMs (streaming multiproces- sors). • The propagation rule for the case when the relation- ship between a node and its prefix is ancestor-descen- • Intra-queryparallelism: Allquerynodes(stackco- dant,issimilartotheonedescribedforthepushstack. lumns)areupdatingtheirtop-of-stackcontentsinpar- This time however a match is propagated in the de- allel. ThemainGPUkernelconsistsofthequerynode scendantnodeanddownwards,ratherthaninitsprefix evaluation,andeachisallocatedaSP(streamingpro- and upwards. cessor). 5 C =1, if a node corresponding to jth column is a leaf in twig query 1 if D∀ci+∈1,{jchil=dr1enAofNjDthcolumOjn+R}D1tih+c1,oclu=m1n,tiafgawnaosdeclocoserdresinpopnodpinegvetnotjth column is a split node D = i+1,j+1 i,j 0 otherwODiRis+e1,j =1 AND relationshjip+b1etthwceoelnumjtnhtcaogluimsnwialdncdaritdsspyrmefibxolis“*“”//” Figure 6: Recurrence relation for pop stack cell D . 1≤j ≤l , 1≤i≤d, where d - maximum depth of XML i,j document, l - length of the query, C - a corresponding cell of push stack. i,j FIELD SIZE EveryGPUkernel,executedinathreadonaGPUStream- Event type (pop/push) 1 bit ing Processor (SP), logically represents a single query node Tag ID 7 bits (evaluatedusingonestackcolumn). Eachofthethreadup- dates the top-of-stack value of its column according to the Table 1: Encoding of XML events at parsing (pre- received XML stream event. GPU). The size of each encoded event is 1 byte. On the GPU side, multiple threads are grouped within a threadblock. Eachthreadblockisindividuallyscheduledby FIELD SIZE the GPU to run on a Streaming Multiprocessor (SM). The IsLeaf 1 bit latter have a small low-latency memory, which we use to Prefix relationship 1 bit store the stack and other relevant state. The deployed ker- Query children with “/” relationship 1 bit nelswithinablockperformfilteringofwholetwigqueriesby updating the top-of-stack information in a parallel fashion. Query children with “//” relationship 1 bit Algorithm 1 represents a simplified version of the GPU Prefix ID 10 bits kernel: the process of updating a value on the top-of-stack. ... 11 bits Tag name 7 bits Algorithm 1 GPU Kernel Table 2: GPU Kernel personality storage format. 1: level←0 This information is encoded using 4 bytes. Some 2: for all XML events in document stream do bits are used for padding, and do not encode any 3: if push event then information (depicted as “...”). 4: level++ 5: prefixMatch←pushStack[level−1][prefixID] 6: if prefixMatch propagates diagonally then • Inter-document parallelism: This type of paral- 7: pushStack[level][colID]→childMatch←1 lelism, provided by the Fermi NVidia GPU architec- 8: end if ture [27], is used to process several XML documents 9: if prefixMatch propagates upwards then in parallel, thus increasing the filtering throughput. 10: pushStack[level][colID]→descMatch←1 11: end if Inthefollowingsubsections,weprovideadetaileddescrip- 12: else tionoftheimplementationofourstackapproachonGPUs, 13: level−− alongsidewithhardware-specificoptimizationswedeployed. 14: prevMatch←popStack[level+1][colID] 15: if prevMatch propagates upwards then 4.1 XMLStreamEncoding 16: pop stack[level][colID]→descMatch←1 XMLparsingisorthogonaltofiltering,andhasbeenthor- 17: end if oughly studied in the literature [20, 28, 21]. In this work, 18: if prevMatch propagates diagonally then parsed XML documents are passed to the GPU as encoded 19: if node is leaf && pushStack[level+1][colID] open/close events, ready for filtering. In order to minimize then the information transfer between CPU and GPU (which is 20: popStack[level][colID] → childMatch ← heavilylimitedbythePCIeinterfacebandwidth),XMLdoc- 1 uments are compressed onto an efficient binary representa- 21: end if tion, that is easily processed on the GPU side. 22: end if Each open/close tag event is encoded as a 1 byte entry, 23: popStack[level][prefix] → childMatch ← whose most significant bit is reserved to encode the type of popStack[level][prefix]&&popStack[level][colID] event (push/open internally viewed as pop/close) while the 24: end if remaining 7 bits represent the tag ID (Table 1). Using this 25: end for encoding, the whole XML document is represented as an arrayofsuchentries,whichisthentransferredtotheGPU. Note that the ColID variable refers to the column ID in- dex,uniquewithinasinglethreadblock(threadId inCUDA 4.2 GPUStreamingKernel primitives). Similarly,prefix servesasapointertothecolID 6 of the prefix node, within the thread block (twigs are eval- FIELD SIZE uated within a single block). Value for children 1 bit The match state on lines 6 and 18 propagates diagonally with “/” relationship upwards and downwards respectively, if: Push stack Value for children 1 bit with “//” relationship • The relationship between node and its prefix is “/”. ... 2 bits ... 2 bits • ThechildMatchvalueoftheentryfromtherespective Value obtained from level in the push/pop stack is matched (lines 5 or 14 1 bit Pop stack descendant respectively). Value obtained from 1 bit • The(fixed)columntagcorrespondstotheopen/closed nested paths XML tag, or the column tag is a wildcard. Table 3: Merged push/pop stack entry. The size of The match on lines 9 and 15 propagates strictly upwards each entry is 1 byte. Padding bits are depicted as and downwards respectively, if: “...”. • The relationship between node and its prefix is “//”. Fermiarchitecture[27]themaximumnumberofthreadsal- • ThedescMatchvalueoftheentryfromtherespective lowed to be accommodated in a single thread block along level in the push/pop stack is matched (lines 5 or 14 one dimension is 1024, ten bits would be enough to encode respectively). a prefix reference. To address the case of a split node with children having Finally the last 7 bits represent the tag ID of the query different relationship types, the push stack needs to keep node. Tag ID is needed to determine if a match should be two separate fields: childMatch to carry the match for the carried diagonally on lines 6 and 18. children with parent-child relationship and descMatch for 4.4 ExploitedGPUParallelismLevels the ancestor-descendant case. The same occurs with the pop stack: descMatch will Since a GPU executes instructions in a SIMT (Single In- propagatematchfromthenode’sdescendants,andchildMa- struction Multiple Threads) manner, at each point in time tchwillcarrythematchfromallitsnestedpaths. Thelatter multiple threads are concurrently executed on SM. CUDA isespeciallymeaningfulforsplitnodes,astheyarematched programmingplatformexecutesthesameinstructioningroups only when all their split-node-to-leaf paths are matched. of 32 threads, called a warp. As each GPU kernel (thread) This reporting is done as the last step on processing of pop evaluatesonequerynode,intra-queryparallelismisachieved event on line 23. through the parallel execution of threads within a warp. Parrallelevaluationisbeneficialincomparisontoserialquery 4.3 KernelPersonalityEncoding execution not only when the number of split nodes is large, As every GPU kernel executes the same program code, but also in the case of most of the split nodes appearing in it requires a parameterization in the form of a single ar- the nodes which are close to the query root. gument which we call personality. A personality is created Threadsaregroupedintoblocks,whichareexecutedeach onceoffline,bythequeryparseronCPU,whichencodesall on one SM; having said that, the amount of parallelism is information about the query node. not bounded to the number of SMs. It could be further A personality is encoded using a 4 byte entry, whose at- improved by scheduling warps from different logical thread tribute map is shown in Table 2. blocks on one physical SM. This is done in order to fully Themostsignificantbitindicateswhetherthequerynode utilizetheavailablehardwareresources. Butthenumberof is a leaf node in the twig. This information is used in pop co-allocated blocks depends on available resources (shared event processing to start matching leaf-to-split-node paths memoryandregisters),whicharesharedbetweenallblocks on line 19 of Algorithm 1. executing on SM. Thefollowingbitencodesthetypeofrelationshipthatthe SMusessimultaneousmultithreadingtoissueinstructions querynodehaswithitsprefix,whichisneededtodetermine from multiple warps in a parallel manner. However, due to match is propagated on lines 6,9,15 and 18. limitations imposed by the GPU architecture and resource Consider the case where a split node has two children constraints,themaximumnumberofwarpsavailableforex- withdifferenttypesofrelationshipconnectingthemtotheir ecution is limited. The ratio between achieved number of prefix. Insteadofduplicatingthefollowingquerynodes,we active warps per SM and maximum allowed limit, is known usethe3rd and4th mostsignificantbitsofpersonalityentry as the GPU occupancy. It is always desirable to achieve to encode whether a node has children with parent-child, 100% occupancy, because a large number of warps, avail- ancestor-descendantrelationship,orbothrespectively. Note able to be scheduled on SM helps to mask memory latency, that for other types of query nodes (including split nodes, which is still a problem even if we use fast shared memory. which have several children, but all of them are connected There are two main ways to increase occupancy: increas- with one type of relationship) only one of those bits will ing the block size or co-scheduling more blocks that share be set. This information would be later used to propagate the same physical SM. The trade-off between these two ap- push value into either childMatch or descMatch fields for proachesisdeterminedbytheamountofresourcesconsumed ordinary node, or both of them is addressed special case on by individual threads within a block. lines 7 and 10. Inter-query parallelism is achieved by parallel processing Aquerynodealsoneedstoencodeitsprefixpointer,which ofthreadblocksonSMsandsharingthemintimemultiplex is the prefix ID in a contiguous thread block. Since in the fashion. 7 MEMORY TYPE SCOPE LOCATION whichmakesitagoodcandidateforstoringinsharedmem- Global All threads Off-chip ory. However, the XML event stream is cannot fit in the Shared One thread block On-chip size-limited shared memory. It is hence placed in global Register One thread On-chip memory and is read by all threads throughout execution. WeoptimizeaccessesbyreadingtheXMLeventstreaminto Table 4: Characteristics of memory blocks available shared memory in small pieces, looping over the document as part of the GPU architecture. The scope de- in a strided manner. scribes the range of accessibility to the data placed Experimentalevaluationshowedthatthemainfactorthat within the memory. limitsachievinghighGPUoccupancyistheSMsharedmem- ory. In order to save this resource we merged the pop and push stacks into a single data structure. This merged stack BecausetypicalqueriesinXMLfilteringsystemsarerela- contains compressed elements, each of which consumes of 1 tivelyshort,wepackquerynodesrelatedtodifferentqueries byte of shared memory. The most significant half of this intoasinglethreadblock. Thisisdoneinordertoavoidre- byte stores information needed for the push stack, and the sulting in a large number of small thread blocks, since the least significant half encodes the pop stack value. The de- GPU achitecture limits maximum number of thread blocks tailed encoding scheme is shown in Table 3. Both the push that could be scheduled per SM. Therefore packing maxi- and pop parts of the merged stack value contain the fields: mizes the number of used SPs within a SM. However de- childMatch and descMatch, described in Section 4.2. pending on query size some SPs could be unoccupied. We address this issue by using a simple greedy heuristic to as- 4.6 AdditionalOptimizations sign queries to blocks. Processing of pop event makes each node responsible for Finally, the Fermi architecture [27] opened a possibility reporting its match/mismatch to the prefix node through toleverageinter-documentparallelism,byexecutingseveral the prefix pointer, because the split node does not keep in- GPU kernels concurrently, every kernel processes a single formationaboutitschildren. Incaseofamassivelyparallel XML document. It is supported by using asynchronous architecture like a GPU, this reporting could lead to race memory copying and kernel execution operations together conditions between children, reporting their matches in an withfine-grainedsynchronizationsignalingevents,delivered unsynchronized order. In order to avoid these race condi- through GPU event streams. This feature allows us to in- tionstheCUDAToolkitprovidesasetofatomicoperations creaseGPUoccupancyevenincaseswhenthereisashortage [26],includingatomicAnd(),neededtoimplementthematch of queries, which in normal situation would lead to under- propagating logic, described in Section 4.2. utilization,hencelowerGPUthroughput. Benefitsfromthe Experimentalresultsshowedthattheusageofatomicop- concurrent kernel invocation are later discussed in the ex- erations heavily deteriorates performance, up to several or- perimental section 5. dersofmagnitudeincomparisontoimplementationshaving non-atomic counterparts of that operations. To overcome 4.5 Efficient Memory GPU Memory Hierar- this issue each merged stack entry is coupled with an ad- chyUsage ditional childrenMatch array of predefined size. All chil- TheGPUarchitectureprovidesseveralhierarchicalmem- drenreporttheirmatchesintodifferentelementswithinthis ory modules (summarized in Table 4), depicting varying array, therefore avoiding race conditions. After children re- characteristics and latencies. In order to fully benefit from portingisdone,thesplitnodecaniteratethroughthisarray thehighlyparallelarchitecture,goodperformanceisachieved and collect matches, “AND”-ing elements of the children- by carefully leveraging the trade-offs of the provided mem- Match array and finally reporting its own match according ory levels. to the obtained value. The size of childrenMatch array is GlobalmemoryisprimarilyusedtocopydatafromCPU statically set during compilation, such that the algorithmic to GPU and to save values, calculated during kernel execu- implementation is customized to the query set at hand. tion. Kernel personalities and XML document streams are This number should be chosen carefully, since large ar- examplesofsuchdatainourcase. Sincetheglobalmemory rayscouldsignificantlyincreasesharedmemoryusage. Each is located off the chip its latency penalty is very high. If a query is coupled with exactly the number of stack columns thread needs to read/write value from/in global memory it needed, depending on the number of respective children. isdesirabletoorganizethoseaccessesincoalesced patterns, Customizeddatastructuresarecompiled(once,offline)from when each thread within a warp accesses adjacent memory aninputsetofqueries. Thisisincontrasttoamoregeneric location, thuscombiningreads/writesintoa singlecontigu- approach which would greatly limit the scalability of the ous global memory transaction, avoiding memory bus con- proposed solution, where each query node would be asso- tention. ciated with a generic data structure, accommodating for a In our case, global memory accesses are minimized such max(ratherthancustom)setofproperties. Formostprac- that: tical cases, twig queries are not very “bushy”, hence this optimization could significantly decrease filtering execution • Thethreadreadsitspersonalityatthebeginningker- time, as opposed to the general solution leveraging atomic nel of execution, then stores it in registers. operation semantics. • Onlythreads,correspondingtorootnodes,writeback their match state at the end of execution. 5. EXPERIMENTALRESULTS Unlike the kernel personality array, the XML document Our performance evaluation was completed on two GPU stream is shared among all queries within a thread block, devices from different product families, namely: 8 • NVidiaTeslaC2075: Fermiarchitecture,14SMswith 3 32 SPs, 448 computational cores in total. Query length 5 450 Query length 10 • NVidia Tesla K20: Kepler architecture, 13 SMs with Query length 15 400 192 SPs each, 2496 compute cores. 350 nts/s Measurements for CPU-based approaches are produced MB/s 2 300 nd Eve f3strhe0oneGmtsaBttaaiovtdfeeu-RooaffAl-st6Moh-fce,to-warruraetnr2enY.-3ibFn0agiGslteoHednrzX[CI1Mn2etn]Le.tlOfiXSlteeo6rn.i4nEgL5mi-n2eu6tx3h.0oAdsessr,vaweerreuwpsiretedh- Throughput, 1 122505000 ughput, thousa o Wall-clock execution time is measured for the proposed 100 Thr approach running on the above GPUs, and for YFilter ex- 50 ecuted on the 12-core CPU. In the case of YFilter, parsing time is not measured, since a similar SAX-like approach is 0 32 64 128 256 512 1024 2048 utilized by the GPU; furthermore, the focus of this work is Number of queries on filtering. Execution time starts from having the parsed documents in the cache. With regards to the GPU setup, Figure 7: Tesla C2075 throughput (in MB/s and execution time includes sending the parsed documents, fil- thousands of XML Events/s) of filtering 1MB XML tering, and receiving back filtering results. document for queries of length 5,10 and 15. Documents of different sizes are used, obtained by trim- ming the original DBLP XML dataset [1] into chunks of differentlengths;also,syntheticXMLdocumentsaregener- querieswillresultinserializedexecution,whichresultsinan atedusingtheDBLPDTDschemawiththehelpofToXgene almostlinearrelationwiththeaddedqueries(e.g. a50%de- XML Generator [6]. Experiments were performed on single crease in throughput as the number of queries is doubled). documentsofsizesrangingfrom32kBto2MB.Furthermore, tocapturethestreamingnatureofpub-subs,anothersetof 5.2 SpeedupOverSoftware experiments was carried out, filtering batches of 500 and 1000 25kB synthetic XML documents. InordertocalculatethespeedupoffilteringoverYFilter, Several performance experiments while varying the block we measured the execution time of YFilter running on a sizewerecarriedout,withconclusionthatablocksizeof256 CPUandourGPUapproachrunningonTeslaC2075while threads is best, maximizing utilization hence performance filtering 256 queries of length 10. This particular query (data omitted for brevity). dataset was picked to study the speedup of a fully utilized Asfortheprofilecreation,uniquetwigqueriesaregener- GPU, because it corresponds to one of the breaking points ated using the XPath generator provided with YFilter [12]. shown on Figure 7. As mentioned earlier, the execution Inparticular,wemadeuseofafairlydiversequerydataset: times for both systems do not include the time spent on parsing the XML documents. • Number of queries: 32 to 2K. Figure 8 shows that the maximum speedup is achieved for documents of small size. As the size increases, speedup • Query size: 5, 10 and 15 nodes. gradually lowers and flattens out for documents ≥ 512kB. • Numberofsplitpoints: 1,3or6noderespectively(to Thiseffectcouldbeexplainedbythelatencyofglobalmem- the above query sizes). ory reads, since number of strided memory accesses grows with increasing document size. • Maximum number of children for each split node: 4 The existence of wildcard and // has a positive effect on nodes. the speedup: the execution time of the YFilter depends on size of the NFA automaton, which, in turn, grows with the • Probability of ‘*’ and ‘//’: 10%, 30%, 50% aforementioned probability. 5.1 GPUThroughput 5.3 BatchExperiments Figure7showsthethroughputofTeslaC2075GPUusing To study the effect of inter-document parallelism we per- a1MBXMLdocument,foranincreasingnumberofqueries formed experiments with batches of documents. These ex- with varying lengths. Throughput is given in MB/s and periments used synthetically generated XML documents. in thousands of XML Events/s, which is obtained assum- Since the YFilter implementation is single-threaded and ing that on average every 6.5 bytes of the XML document cannotbenefitfromthemulticore/multiprocessorsetupthat encodes push/pop event (due to document design). weusedforourexperiments,wealsoperformmeasurements Datafordifferentwildcardand//-probabilitiesandother for a “pseudo”-multicore version of YFilter. This version documentsizesisomittedforbrevity,sincetheydonotaffect equally splits document-load across multiple copies of the the total throughput. program, the number of copies being equal to the number Thecharacteristicsofthethroughputgrapharecorrelated of CPU cores. The query load is the same for each copy with the results reported in [23]: starting off as a constant, of the program. Note that the same technique cannot be throughput eventually reaches a point, where all computa- applied to split query load among multiple CPU cores in tional GPU cores are utilized. After this point, which we previous experiments, since this could possibly deteriorate refer to as breaking point, the amount of parallelism avail- YFilter performance due to splitting a single unified NFA ablebytheGPUarchitectureisexhausted. Evaluatingmore into several automata. 9 FACTOR CPU GPU FPGA No effect on the Document size Decreases Minimal effect filtering core Number of queries Slowly decreases No effect prior breaking point, decreases after Slowly decreases Query length Slowly decreases No effect prior breaking point, decreases after Minimal effect Decreases on 15% on average per ‘*’ and //-probability No effect No effect 10 % increase in probability No effect until maximum number of query Query bushyness Minimal effect Minimal effect node children exceeds predefined parameter Query dynamicity Minimal effect Minimal effect No support No effect on the Batch size Slowly decreases Minimal effect filtering core Table 5: Summary of factors and their effects on the throughput of CPU-, FPGA- and CPU-based XML filtering. Query dynamicity refers to the capability of updating user profiles (the query set). The FPGA analysis is added using results from our study in [25]. 10 * and //-probability %50 Tesla K20, query length 5 9 9 * and //-probability %30 32 Tesla C2075, query length 5 * and //-probability %10 Tesla K20, query length 10 8 8 Tesla C2075, query length 10 7 B/s 16 TeTsleas Cla2 K02705,, qquueerryy lleennggtthh 1155 67 Events/s Speedup 456 Throughput, M 248 345 hroughput, million 3 2 T 1 2 1 1 0 32 64 128 256 512 1024 2048 32 64 128 256 512 1024 2048 Document size, kB Number of queries Figure 8: Speedup of the GPU-based solution run- Figure 9: Batched throughput (500 documents) of ning on a Tesla C2075 for a fixed number of queries theGPU-basedsolutionrunningonTeslaC2075and (256)andquerysize(10). Dataisshownforqueries, Tesla K20 GPUs, with wildcard and //-probability having wildcard and //-probabilities equal to 10%, fixed at 50%. Data is shown for queries of length 5, 30% and 50 %. 10 and 15. Throughput is shown in MB/s as well as in millions of Events/s. Forthisexperimentwehavemeasuredthebatchedexecu- the increasing batch size. This effect could be expalined by tiontimenotonlyontheTeslaC2075GPU,butalsoonthe low global memory bus throughput due to irregular asyn- TeslaK20,whichhassixtimesmoreavailablecomputational chronous copying access patterns. cores. ThespeedupusingtheTeslaK20showninFigures10and Figure 9 shows the throughput plot for a batch of 500 11 is higher than that of the Tesla C2075; nonetheless, it is docuemnts(experimentswithotherbatchsizesyieldsimilar notassignificantastheratioofthenumberoftheircompu- results). Unlike Figure 7, the plot does not depict a break- tationalcores: theamountofconcurrentkerneloverlapping ing point. This happens because all available GPU cores is limited by the GPU architecture. are always occupied by concurrently executing kernels, re- Finally, the speedup of the GPU-based versions over the sultinginafulyutilizedGPU.Doublingthenumberqueries software-multithreaded version is, as expected, lower then orquerylengthrequirestwotimesmoreGPUcomputational the speedup over a single-thread YFilter. cores, thus decreasing throughput by a factor of two. Figures 10 and 11 respectively depict the effect of query 5.4 TheEffectofSeveralFactorsonDifferent lengthandnumberofqueriesonthespeedupforbatchesof FilteringPlatforms size500and1000. Speedupdropsinbothcasesalmostbya factor of two as query load is doubled, and eventually even Table5summarizesthevariousfactorsaffectingthesingle- leads to a perfomance worse than that of the CPU. Thus documentfilteringthroughputforGPUsaswellasforsoft- we can infer that the throughput of YFilter is only slightly wareapproaches(Yfilter)andFPGAs. TheFPGAanalysis affected by query length and number of queries. is added using results from our study in [25]. Both figures show that the performane of the GPU so- Querydynamicityreferstothecapabilityofupdatinguser lution (hence speedup over CPU) slightly deteriorates with profiles (the query set). 10
Description: