Table Of ContentHigh-Performance Holistic XML Twig Filtering Using GPUs
Ildar Absalyamov Roger Moussalli Walid Najjar
UCRiverside IBMT.J.WatsonResearch UCRiverside
Riverside,CA,92507 Center Riverside,CA,92507
iabsa001@cs.ucr.edu YorktownHeights,NY,10598 najjar@cs.ucr.edu
rmoussal@us.ibm.com
Vassilis J. Tsotras
UCRiverside
Riverside,CA,92507
tsotras@cs.ucr.edu
ABSTRACT Early pub-sub implementations restricted subscriptions
to pre-defined topics or channels, such as weather, world
Current state of the art in information dissemination com-
news, finance, among others. Subscribers would hence be
prisesofpublishersbroadcastingXML-codeddocuments,in
“spammed” with more (irrelevant) information than of in-
turnselectivelyforwardedtointerestedsubscribers. Thede-
terest. Thesecondgenerationofpub-subsevolvedbyallow-
ploymentofXMLattheheartofthissetupgreatlyincreases
ing predicate expressions; here, user profiles are described
the expressive power of the profiles listed by subscribers,
as conjunctions of (attribute, value) pairs. In order to add
using the XPath language. On the other hand, with great
furtherdescriptivecapabilitytotheuserprofiles,third-gene-
expressivepowercomesgreatperformanceresponsibility: it
ration pub-subs adopted the eXtensible Markup Language
is becoming harder for the matching infrastructure to keep
(XML) as the standard format for data exchange, due to
up with the high volumes of data and users. Traditionally,
its self-describing and extensible nature. Exchanged doc-
general purpose computing platforms have generally been
uments are now encoded with XML, while profiles are ex-
favored over customized computational setups, due to the
pressedwithXMLquerylanguages,suchasXPath[2]. Such
simplifiedusabilityandsignificantreductionofdevelopment
systemstakeadvantageofthepowerfulqueryingthatXML
time. The sequential nature of these general purpose com-
query languages offer: profiles can now describe requests
puters however limits their performance scalability. In this
not only on the document values but also on the structure
work,weproposetheimplementationofthefilteringinfras-
ofthemessages1allowingthematchingofcomplextwig-like
tructure using the massively parallel Graphical Processing
profiles within tree-structured messages.
Units(GPUs). Weconsidertheholistic(nopost-processing)
Currently,XML-basedPub-Subsystemshavebeenadopted
evaluationofthousandsofcomplextwig-styleXPathqueries
for the dissemination of Micronews feeds. These feeds are
in a streaming (single-pass) fashion, resulting in a speedup
typicallyshortfragmentsoffrequentlyupdatedinformation,
over CPUs up to 9x in the single-document case and up to
such as news stories and blog updates. The most promi-
4xforlargebatchesofdocuments. Athoroughsetofexper-
nent XML-based format used is RSS. In this environment,
iments is provided, detailing the varying effects of several
the RSS feeds are accessed via HTTP through URLs and
factors on the CPU and GPU filtering platforms.
supported by client applications and browser plug-ins (also
calledfeedreaders). Feedreaders(likeBloglinesandNews-
1. INTRODUCTION Gator), periodically check the contents of micronews feeds
and display the returned results to the user.
Publish-subscribesystems(orsimplypub-subs)aretimely
The complex process of matching thousands of profiles
asynchronous event-based dissemination systems consisting
againstmassiveamountsofpublishedmessagesisperformed
of three main components: publishers, who feed a stream
in the filtering infrastructure. From a user/functionality
of documents (messages) into the system, subscribers, who
perspective, filtering consists of determining, for each pub-
register their profiles (queries, i.e., subscription interests),
lished message, which subscriptions match at least once.
and a filtering infrastructure for matching subscriber inter-
The novelty lies in exploring new efficient filtering algo-
ests with published messages and delivering the matched
rithms, as well as high-performance platforms on which to
messages to the interested subscriber(s).
implementalgorithmsandfurtheracceleratetheirexecution.
Several fine-tuned CPU-based software filtering algorithms
havebeenstudied[3, 12, 14, 19]. Thesememory-boundap-
proaches,however,sufferfromtheVonNeumannbottleneck
andareunabletohandlelargevolumeofinputstreams. Re-
cently,variousworkshaveexploitedtheobviousembarrass-
ingly parallel property of filtering, by evaluating all queries
inparallelusingFieldProgrammableGateArrays(FPGAs)
1
Inthispaper,weusetheterms“profile”,“subscription”and“query”
interchangeably;similarlyfortheterms“document”and“message”.
1
Path ::= Step | Path Step engines are fed a series of volatile documents, incoming in
Step ::= Axis TagName | Step “[” Step “]” a streaming fashion while queries are static (and known
Axis ::= “/” | “//” in advance). This type of workload prevents filtering sys-
TagName ::= Name | “*” temsfromusingdocumentindexing,thusrequiringadiffer-
ent paradigm to solve this problem. Moreover, since filter-
Figure 1: Production rules to generate an XPath
ing systems return only binary result (match or no match,
expression. “Name” corresponds to an arbitrary al-
whether a particular query was matched or not in a given
phanumeric string.
document), they should be able to process large number of
queries. ThisisincontrasttoXMLqueryingengines,which
need to report every document node that was matched by
[22,24,25]. Byintroducingnovelparallel-hardware-tailored
an incoming query.
filteringalgorithms,FPGAshavebeenshowntobeparticu-
SoftwarebasedXMLfilteringalgorithmscanbeclassified
larly well suited for the stream processing of large amounts
into several categories: (1) FSM-based, (2) sequence-based
of data, where the temporary computational state is not
and (3) others. XFilter [4] is the earliest work studying
offloaded to (low-latency) off-chip memories. These algo-
FSM-based filtering, which builds a single FSM for each
rithms allowed the massively parallel streaming matching
XPath query. Each query node is represented by an in-
of complex path profiles that supports the /child:: axis
dividual state at the FSM. Transitions in the automaton
and /descendant-or-self:: axis 2, wildcard (‘*’) node tests
are fired when an appropriate XML event is processed. An
and accounts for recursive elements in the XML document.
XPathquery(profile)isconsideredmatchedifanaccepting
Using this streaming approach, inter-query and intra-query
stateisreachedattheFSM.YFilter[12]leveragespathcom-
parallelism were exploited, resulting in up to two orders of
monalitiesbetweenindividualFSMs,creatingasingleNon-
magnitude speed-up over the leading software approaches.
DeterministicFiniteAutomaton(NFA)representationofall
In [23] we presented a preliminary study on the mapping
XPathqueriesandthereforereducingthenumberofstates.
ofthepathmatchingapproachonGPUs,providingtheflex-
Subsequent works [16, 29, 14] use a unified finite automa-
ibilityofsoftwarealongsidethemassiveparallelismofhard-
ton for XPath expressions, using lazy DFAs and pushdown
ware. In this paper we detail the support of the more com-
automata.
plex twig queries on GPUs in a massively parallel manner.
FiST [19] and it’s successors are the examples of the se-
In particular, the novel contributions of this paper are:
quence-based approach for XML Filtering, which implies a
• The first design and implementation of XML twig fil- two-stepprocess: (i)thestreamingXMLdocumentandthe
tering on GPUs, more so in a holistic fashion. XPathqueriesareconvertedintosequencesand(ii)asubse-
quencematchalgorithmisusedtodetermineifaqueryhad
• An extensiveperformanceevaluation ofthe above ap- a match in the XML document.
proach is provided, with comparison to leading soft- Amongtheotherworks,[10]buildsaTrieindextoexploit
ware implementations. querycommonalities. Anotherformofsimilarityisexplored
intheAFilter[9]whichusesprefixaswellassuffixsharingto
• A study on the effect of several query and document create an appropriate index data structure. [13] introduces
factors on GPUs and the software implementations streamqueryingandfilteringalgorithmsLQandEQ,using
through experimentation is depicted. a lazy and eager strategy, respectively.
RecentadvancesintheGPGPUtechnologyopenedapos-
The rest of the paper is organized as follows: in Section
sibilitytospeedupmanytraditionalapplications,leveraging
2 we present related work in software and hardware XML
the high degree of parallelism offered by GPUs. A large
processing. Section 3 provides in depth description of the
numberofresearchworksexploretheusageofGPUsforim-
holistic XML twig filtering algorithm. Section 4 details the
proving traditional database relational operations. [5] dis-
implementation of the XML filtering on GPUs. Section 5
cussesimprovingtheSELECToperationperformance,while
presents an experimental evaluation of the parallel GPU-
[17] focuses on implementing efficient join algorithms. Re-
based approach compared to the state of the art software
cently[11]addressedallcommonlyusedrelationaloperators.
counterparts. Finally conclusions and open problems for
Therearealsoresearchworksconcentratedonimprovingin-
further research appear in Section 6.
dexingoperations. [18]introducesahierarchicalCPU-GPU
architecture for efficient tree querying and bulk updating
2. RELATEDWORK
and [7] proposes an abstraction for tree index operations.
The rapid development of XML technology as a common There has also been much interest in using GPUs for
format for data interchange together with the emergence XML related problems. [8] introduced the Data partition-
ofinformationdisseminationthrougheventnotificationsys- ing,QuerypartitioningandHybridapproachestoparallelize
tems,hasledtoincreasedinterestincontent-basedfiltering XPathqueriesonmulti-coresystems. [31]usedthesestrate-
of XML data streams. Unlike the traditional XML query- gies to create a cost model, which decides how to process
ingengines,filteringsystemsexperienceessentiallydifferent XPath queries on GPUs in parallel. [15] processed XPath
type of workload. In a querying system, the assumption is queriesinawaysimilartoYfilter,bycreatingendexecuting
that XML documents are fixed (known in advance), which a NFA on a GPU, whereas [30] studied twig query process-
allows building efficient indexing mechanisms on them to ingonalargeXMLswiththehelpofGPUs. Alltheseworks
facilitate the query process; whereas queries are adhoc and focus on the problem of indexing and querying XML docu-
can have arbitrary structure. On contrary, XML filtering ments using XPath queries. However none of them address
filtering XML documents, which an orthogonal problem to
2
In the rest of the paper we shall use ‘/’ and ‘//’ as shorthand to query processing, thus requiring a different implementation
denotethe/child:: axis and/descendant-or-self:: axis,respectively.
2
... ... ... ... ...
1 1
a ... a ... c ... c ... c ...
1 1 1
1 1 1 1 1
b ... b ... b ... b ... b ...
1 1 1 1 1
... $ a b ... $ a b d ... $ c d d ... $ c d d ... $ c d
XML Push XML Push XML Push XML Push XML Push
tree Stack tree Stack tree Stack tree Stack tree Stack
open(a) open(b) open(c) open(b) open(d)
(a) Matching path {/a/b} (b) Matching path {/c//d}
Figure 2: Step-by-step overview of stack updates for (a) parent-child and (b) ancestor-descendant relations.
Each step corresponds to open(tag) event, with respective opened tag highlighted. As the XML document
tree is traveled downwards, content is pushed onto the top of the stacks. The leftmost stack column is set to
be always in a matched state, corresponding to a dummy root node (denoted by $).
approach. XPath relate to the structure of XML documents, hence,
As mentioned, our previous work on the XML filtering rely heavily on these open/close events. As will be clearer
problem concentrated on using FPGAs. In [24] we pre- below, stacks are an essential structure of XML query pro-
sented a new dynamic-programming based XML filtering cessing engines, used to save the state as the structure of
algorithm, mapped to FPGA platforms. the tree is visited (push on open, pop on close).
To support twig filtering, a post-processing step is re- Figure 1 shows the XPath grammar used to form twig
quired to identify which path matches correspond to twig queries, consisting of nodes, connected with parent-child
matches. In [25] we extended this approach to support the (“/”), ancestor-descendant (“//”) or nested path relation-
holistic (no post-processing) matching of the considerably ships (“[]”). We denote by L the length of the twig query,
more complex twig-style queries. While providing signifi- representingthetotalnumberofnodesinthetwig,inaddi-
cant performance benefits, these FPGA-based approaches tion to a dummy start node at the beginning of each query
were not without their disadvantages. Matching engines (the latter being essential for query processing on GPUs).
running on the FPGA relied on custom implementations Such a framework allows us to filter the whole stream-
of the queries in hardware and did not allow the update, ingdocumentinasinglepass,whilecapturingquerymatch
addition, and deletion of user profiles on-the-fly. Although information in the stack.
the query compilation to VHDL is fast, re-synthesis of the In [25] we show that matching a twig query is a process
FPGA,whichincludestranslation,mappingandplace-and- consisting of 2 interleaved mechanisms:
routeareexpensive(uptoseveralhours)andmustbedone
off-line. Furthermore, solutions on FPGAs cannot scale be- • Matching individual root-to-leaf paths. Here, each
yondavailableresources,wherethenumberofqueriesislim- such path is evaluated (matched) using a stack whose
ited to the amount of real-estate on the chip. contents are updated on push (open(tag)) events.
GPUarchitecturesdonothavetheselimitationsthusoffer
• Carefully joining the matched results at query split
apromisingapproachtotheXMLfilteringproblem. Inthis
nodes. This task is performed using stacks that are
paperweextendourpreviouswork[23]thatconsideredonly
mainlyupdatedonpop(close(tag))events. Themain
filteringlinearXPathqueriesonGPUsandprovideaholistic
reasoningliesinthatamatchedpathreportsitsmatch
filtering of complex twig queries.
state to its ancestors using the pop stack.
3. PARALLEL HOLISTIC TWIG MATCH- Inbothmechanisms,thedepthofthestackisequaltothe
maximumdepthofthestreamingXMLdocument,whilethe
INGALGORITHM
width of the stack is equal to the length of the query. The
We proceed with the overview of the stack-based holis- top-of-stack pointer is referred to as TOS in the remainder
tic twig filtering algorithm and respective modifications as of this paper.
required for the mapping onto GPUs. Detailed descriptions of the operations and properties of
these stacks are presented next.
3.1 FrameworkOverview
3.2 PushStackAlgorithm
The structure of an XML-encapsulated document can be
represented as a tree, where nodes are XML tags. Opening As noted earlier, an essential step in matching a twig
andclosingtagsintheXMLdocumenttranslatetothetrav- query lies in matching all root-to-leaf paths of this query.
eling(downandup)throughthetree. SAXparsersprocess Weachievethistaskusingpush stacks,namelystackshold-
XML documents in a streaming fashion, while generating ing query matching state, updated solely on push events as
open(tag) and close(tag) events. XML queries expressed in the XML document is parsed.
3
R0 a
a
b
T0 T4 c
... c c
d e
d f e Twig query
T1 T3 XML Document
R
n−1
Figure 4: Sample XML document tree represen-
tation and twig query. The twig query can be
broken down into two root-to-leaf paths, namely
/a//c//d and /a//c/e. While each of these paths
T isindividuallymatched(found)inthetree, thetwig
2
/a//c[//d]/e is not.
Figure 3: Abstract view of the structure of an XML
document tree, with root R , and several subtrees
0 • Columns corresponding to wildcard nodes allow the
such as T ,T , etc.
0 1 propagation from a prefix column without checking
whetherthecolumntagisthesameasthetagrespec-
tive to the push event.
Wedescribethematchingprocessasadynamicprogram-
ming algorithm, where the push stack represents the dy- • Iftherelationshipbetweennodeanditsprefixisances-
namic programming table. Each column in the push stack tor-descendant, then the diagonal propagation prop-
represents a query node from the XPath expression; when erty applies, as in parent-child relationships. In addi-
mapping a twig to the GPU, each stack column needs a tion,inordertosatisfyancestor-descendantsemantics
pointer to its parent column (prefix column). A split node (the descendant lies potentially more than one node
acts as a prefix for all its children nodes. apart from the parent), a match is carried in the pre-
The intuition is that the root-to-leaf path with length L fix column vertically upwards, even if the tag, which
can be in a matched state only if its prefix of length L-1 is triggered the push event does not correspond to the
matched. This optimal substructure property can be triv- descendant’scolumntag. Usingthistechnique,allde-
iallyprovedbyaninductionoverthematchedquerylength scendantsoftheprefixnodecanseethematchedstate
(assuming a matched dummy root as base case). of the prefix. This matched info is popped with the
The Lth column is matched if ‘1’ is stored on the top of prefix.
thestackforthisparticularcolumn. Whenacolumn,corre-
Figure 2b depicts the matching of the path /c//d,
sponding to a leaf node in the twig query, stores ‘1’ on the
where d is a descendant but not a child of c. Note
top-of-stack, then the entire root-to-leaf path is matched.
that upward match propagation for prefix node c will
The following list summarizes the properties of the push
continue even after d is matched. This process will
stack:
stop only after c will be popped out of the stack.
• Only the entries in the top-of-stack are updated on
• A match in an ancestor column will only be seen by
each event.
descendants. This is visualized through Figure 3, de-
picting an abstract XML tree, with root R , and sev-
0
• The entries in the push stack can be modified only in
eralsubtreessuchasT ,T ,etc. AssumingnodeR
0 1 n−1
response to a push event (open(tag)).
is a matched ancestor, this match will be reported in
child subtree T using push events. Subtrees T and
2 0
• On a pop event, the pointer to the stop-of-stack is
T wouldnotseethematch,astheywouldbepopped
1
updated(decreased). Thepoppedstate(dataentries)
bythetimeR isvisited. Similarly,subtreesT and
n−1 3
is lost.
T will not see the match, as R would be popped
4 n−1
prior to entering them.
• Adummyrootnodeisinamatchedstateatthebegin-
ning of the algorithm (columns corresponding to root Considering twigs: in the case of a split query node hav-
nodesareassumedtobematchedevenbeforeanyop- ingbothancestor-descendantandparent-childbelowit,then
eration on the stack occurs). twostackcolumnsareallocatedforthesplitnode;onewould
allowverticalupwardspropagationof1’s(forancestor-descen-
• Ifarelationshipbetweenanodeanditsprefixisparent- dantrelationships),andanotherwouldnot(forparent-child
child,thena‘1’candiagonallypropagatefromthepar- relationships). These two separate columns will also later
ent’scolumntothechildcolumn,onlyonapushevent be used to report matches back to root in the pop stack
iftheopenedtagmatchesthatofthechild. Figure2a (Section 3.3 describes this situation in details).
depictsthediagonalpropagationforaparent-childre- Recurrence equation, applied at each (binary) stack cell
lationship. C on push events is shown on Figure 5.
i,j
4
relationship between jth column and its prefix is “/”
Ci,j = 10 ioftherwCOCiiiRs−−e11,,jj−=1 =1 A1NDANreDlationshAipNjOjbDttehhRtwccooelleuunmmjnnthttaacoggluwismawsnioladpncedanreidtdssiypnmrepbfiuoxslhi“se*v“”e/n/t”
Figure 5: Recurrence relation for push stack cell C . 1≤i≤d,1≤j ≤l, where d - maximum depth of XML
i,j
document, l - length of the query.
3.3 PopStackAlgorithm RecallingFigure3,onlynodesR ,...,R wouldbe
0 n−1
Matching of all individual root-to-leaf paths in a twig reported about a matched ancestor. In this case sub-
query is a necessary condition for a twig match, but it is tree T2 is not matched, since we report a match only
notsufficient. Figure4showsthesituationwhenbothpaths during the pop event of the last matched path node
/a//c//d and /a//c/e of the query /a//c[//d]/e report a Rn−1, which in turn occurs after T2 has been pro-
match, but holistically the twig fails to match. In order to cessed. Subtrees T0 and T1 again do not observe the
correctly match the twig query we need to check that the match,becauseitwasnotyetreportedatthemoment
tworoot-to-leafpaths,branchingatsplitnodej reporttheir they are encountered. Although subtrees T3,T4 are
matchtothesamenodej,bythesametimeitwillbepopped processedafterthepopevent,theystillcannotseethe
out of stack. match, since the top of the stack has grown by pro-
Inordertoreportaroot-to-leafmatchtothenearestsplit cessing additional push events.
node in [25] we introduced a pop stack. As for the push
stack, the reporting problem could be solved using a dy- • Sincethepurposeofthepopstackistoreportamatched
namic programming approach, where the pop stack serves path back to the root, reporting starts at the twig
asadynamicprogrammingtable. Columnsofthepopstack leafnodes. Stackcolumnscorrespondingtoleafnodes
similarly represent query nodes of the original XPath ex- in the original query are matched only if they were
pression. matched in the push stack.
Column j would be considered as matched in pop stack
if ‘1’ is stored on TOS for a particular column after a pop • A split node reports a match only if all of its children
event. The optimal substructure property could again be have been matched. If the latter is true, relationship
easily proved, but this time the induction step for a split propagation rules are applied so as to carry the split
node should consider optimality of all its children paths, match further.
rather than just one leaf-to-split-node path, in order to re-
portamatch. Whenacolumn,correspondingtothedummy
Figure 6 shows a recurrence relation, which is applied at
root node, stores ‘1’ on the top of the pop stack, the entire
each stack cell D on a pop event.
i,j
twig query is matched.
Thesubsequentlistsummarizesthepopstackproperties:
4. FROMALGORITHMTOEFFICIENTGPU
• Unlike the push stack, the pop stack is updated both IMPLEMENTATION
on push and pop events: push always forces rewriting
ThetypicalworkloadforanXMLbasedpublish-subscribe
the value on the top of the pop stack to its default
systemconsistsofalargenumberofuserprofiles(twigque-
value, which is ‘0’. A pop event on the other hand
ries), filtered through a continuous document data stream.
reports a match, propagating ‘1’ downwards, but will
Parallel architectures like GPUs offer an opportunity for
not force ‘0’ to be carried in the same direction.
muchperformanceimprovement,byexploitingtheinherent
• If the relationship between a node and its prefix is parallelism of the filtering problem. Using GPUs, the im-
parent-child, then ‘1’ is propagated diagonally down- plementationofourstack-basedfilteringapproachleverages
wards from the node’s column to its parent. As with several levels of parallelism, namely:
the push stack, propagation occurs only if the pop
event that fired the TOS decrement, closes the tag, • Inter-query parallelism: All queries (stacks) are
correspondingtothecolumn’staginthequery(unless processed independently in a parallel manner, as they
the column tag is the wildcard). are mapped on different SMs (streaming multiproces-
sors).
• The propagation rule for the case when the relation-
ship between a node and its prefix is ancestor-descen- • Intra-queryparallelism: Allquerynodes(stackco-
dant,issimilartotheonedescribedforthepushstack. lumns)areupdatingtheirtop-of-stackcontentsinpar-
This time however a match is propagated in the de- allel. ThemainGPUkernelconsistsofthequerynode
scendantnodeanddownwards,ratherthaninitsprefix evaluation,andeachisallocatedaSP(streamingpro-
and upwards. cessor).
5
C =1, if a node corresponding to jth column is a leaf in twig query
1 if D∀ci+∈1,{jchil=dr1enAofNjDthcolumOjn+R}D1tih+c1,oclu=m1n,tiafgawnaosdeclocoserdresinpopnodpinegvetnotjth column is a split node
D = i+1,j+1
i,j 0 otherwODiRis+e1,j =1 AND relationshjip+b1etthwceoelnumjtnhtcaogluimsnwialdncdaritdsspyrmefibxolis“*“”//”
Figure 6: Recurrence relation for pop stack cell D . 1≤j ≤l , 1≤i≤d, where d - maximum depth of XML
i,j
document, l - length of the query, C - a corresponding cell of push stack.
i,j
FIELD SIZE EveryGPUkernel,executedinathreadonaGPUStream-
Event type (pop/push) 1 bit ing Processor (SP), logically represents a single query node
Tag ID 7 bits (evaluatedusingonestackcolumn). Eachofthethreadup-
dates the top-of-stack value of its column according to the
Table 1: Encoding of XML events at parsing (pre- received XML stream event.
GPU). The size of each encoded event is 1 byte. On the GPU side, multiple threads are grouped within a
threadblock. Eachthreadblockisindividuallyscheduledby
FIELD SIZE the GPU to run on a Streaming Multiprocessor (SM). The
IsLeaf 1 bit latter have a small low-latency memory, which we use to
Prefix relationship 1 bit store the stack and other relevant state. The deployed ker-
Query children with “/” relationship 1 bit nelswithinablockperformfilteringofwholetwigqueriesby
updating the top-of-stack information in a parallel fashion.
Query children with “//” relationship 1 bit
Algorithm 1 represents a simplified version of the GPU
Prefix ID 10 bits
kernel: the process of updating a value on the top-of-stack.
... 11 bits
Tag name 7 bits
Algorithm 1 GPU Kernel
Table 2: GPU Kernel personality storage format. 1: level←0
This information is encoded using 4 bytes. Some 2: for all XML events in document stream do
bits are used for padding, and do not encode any 3: if push event then
information (depicted as “...”). 4: level++
5: prefixMatch←pushStack[level−1][prefixID]
6: if prefixMatch propagates diagonally then
• Inter-document parallelism: This type of paral- 7: pushStack[level][colID]→childMatch←1
lelism, provided by the Fermi NVidia GPU architec- 8: end if
ture [27], is used to process several XML documents 9: if prefixMatch propagates upwards then
in parallel, thus increasing the filtering throughput. 10: pushStack[level][colID]→descMatch←1
11: end if
Inthefollowingsubsections,weprovideadetaileddescrip- 12: else
tionoftheimplementationofourstackapproachonGPUs, 13: level−−
alongsidewithhardware-specificoptimizationswedeployed. 14: prevMatch←popStack[level+1][colID]
15: if prevMatch propagates upwards then
4.1 XMLStreamEncoding 16: pop stack[level][colID]→descMatch←1
XMLparsingisorthogonaltofiltering,andhasbeenthor- 17: end if
oughly studied in the literature [20, 28, 21]. In this work, 18: if prevMatch propagates diagonally then
parsed XML documents are passed to the GPU as encoded 19: if node is leaf && pushStack[level+1][colID]
open/close events, ready for filtering. In order to minimize then
the information transfer between CPU and GPU (which is 20: popStack[level][colID] → childMatch ←
heavilylimitedbythePCIeinterfacebandwidth),XMLdoc- 1
uments are compressed onto an efficient binary representa- 21: end if
tion, that is easily processed on the GPU side. 22: end if
Each open/close tag event is encoded as a 1 byte entry, 23: popStack[level][prefix] → childMatch ←
whose most significant bit is reserved to encode the type of popStack[level][prefix]&&popStack[level][colID]
event (push/open internally viewed as pop/close) while the 24: end if
remaining 7 bits represent the tag ID (Table 1). Using this 25: end for
encoding, the whole XML document is represented as an
arrayofsuchentries,whichisthentransferredtotheGPU. Note that the ColID variable refers to the column ID in-
dex,uniquewithinasinglethreadblock(threadId inCUDA
4.2 GPUStreamingKernel
primitives). Similarly,prefix servesasapointertothecolID
6
of the prefix node, within the thread block (twigs are eval- FIELD SIZE
uated within a single block). Value for children
1 bit
The match state on lines 6 and 18 propagates diagonally with “/” relationship
upwards and downwards respectively, if: Push stack Value for children
1 bit
with “//” relationship
• The relationship between node and its prefix is “/”. ... 2 bits
... 2 bits
• ThechildMatchvalueoftheentryfromtherespective
Value obtained from
level in the push/pop stack is matched (lines 5 or 14 1 bit
Pop stack descendant
respectively).
Value obtained from
1 bit
• The(fixed)columntagcorrespondstotheopen/closed nested paths
XML tag, or the column tag is a wildcard.
Table 3: Merged push/pop stack entry. The size of
The match on lines 9 and 15 propagates strictly upwards
each entry is 1 byte. Padding bits are depicted as
and downwards respectively, if:
“...”.
• The relationship between node and its prefix is “//”.
Fermiarchitecture[27]themaximumnumberofthreadsal-
• ThedescMatchvalueoftheentryfromtherespective
lowed to be accommodated in a single thread block along
level in the push/pop stack is matched (lines 5 or 14
one dimension is 1024, ten bits would be enough to encode
respectively).
a prefix reference.
To address the case of a split node with children having Finally the last 7 bits represent the tag ID of the query
different relationship types, the push stack needs to keep node. Tag ID is needed to determine if a match should be
two separate fields: childMatch to carry the match for the carried diagonally on lines 6 and 18.
children with parent-child relationship and descMatch for
4.4 ExploitedGPUParallelismLevels
the ancestor-descendant case.
The same occurs with the pop stack: descMatch will Since a GPU executes instructions in a SIMT (Single In-
propagatematchfromthenode’sdescendants,andchildMa- struction Multiple Threads) manner, at each point in time
tchwillcarrythematchfromallitsnestedpaths. Thelatter multiple threads are concurrently executed on SM. CUDA
isespeciallymeaningfulforsplitnodes,astheyarematched programmingplatformexecutesthesameinstructioningroups
only when all their split-node-to-leaf paths are matched. of 32 threads, called a warp. As each GPU kernel (thread)
This reporting is done as the last step on processing of pop evaluatesonequerynode,intra-queryparallelismisachieved
event on line 23. through the parallel execution of threads within a warp.
Parrallelevaluationisbeneficialincomparisontoserialquery
4.3 KernelPersonalityEncoding
execution not only when the number of split nodes is large,
As every GPU kernel executes the same program code, but also in the case of most of the split nodes appearing in
it requires a parameterization in the form of a single ar- the nodes which are close to the query root.
gument which we call personality. A personality is created Threadsaregroupedintoblocks,whichareexecutedeach
onceoffline,bythequeryparseronCPU,whichencodesall on one SM; having said that, the amount of parallelism is
information about the query node. not bounded to the number of SMs. It could be further
A personality is encoded using a 4 byte entry, whose at- improved by scheduling warps from different logical thread
tribute map is shown in Table 2. blocks on one physical SM. This is done in order to fully
Themostsignificantbitindicateswhetherthequerynode utilizetheavailablehardwareresources. Butthenumberof
is a leaf node in the twig. This information is used in pop co-allocated blocks depends on available resources (shared
event processing to start matching leaf-to-split-node paths memoryandregisters),whicharesharedbetweenallblocks
on line 19 of Algorithm 1. executing on SM.
Thefollowingbitencodesthetypeofrelationshipthatthe SMusessimultaneousmultithreadingtoissueinstructions
querynodehaswithitsprefix,whichisneededtodetermine from multiple warps in a parallel manner. However, due to
match is propagated on lines 6,9,15 and 18. limitations imposed by the GPU architecture and resource
Consider the case where a split node has two children constraints,themaximumnumberofwarpsavailableforex-
withdifferenttypesofrelationshipconnectingthemtotheir ecution is limited. The ratio between achieved number of
prefix. Insteadofduplicatingthefollowingquerynodes,we active warps per SM and maximum allowed limit, is known
usethe3rd and4th mostsignificantbitsofpersonalityentry as the GPU occupancy. It is always desirable to achieve
to encode whether a node has children with parent-child, 100% occupancy, because a large number of warps, avail-
ancestor-descendantrelationship,orbothrespectively. Note able to be scheduled on SM helps to mask memory latency,
that for other types of query nodes (including split nodes, which is still a problem even if we use fast shared memory.
which have several children, but all of them are connected There are two main ways to increase occupancy: increas-
with one type of relationship) only one of those bits will ing the block size or co-scheduling more blocks that share
be set. This information would be later used to propagate the same physical SM. The trade-off between these two ap-
push value into either childMatch or descMatch fields for proachesisdeterminedbytheamountofresourcesconsumed
ordinary node, or both of them is addressed special case on by individual threads within a block.
lines 7 and 10. Inter-query parallelism is achieved by parallel processing
Aquerynodealsoneedstoencodeitsprefixpointer,which ofthreadblocksonSMsandsharingthemintimemultiplex
is the prefix ID in a contiguous thread block. Since in the fashion.
7
MEMORY TYPE SCOPE LOCATION whichmakesitagoodcandidateforstoringinsharedmem-
Global All threads Off-chip ory. However, the XML event stream is cannot fit in the
Shared One thread block On-chip size-limited shared memory. It is hence placed in global
Register One thread On-chip memory and is read by all threads throughout execution.
WeoptimizeaccessesbyreadingtheXMLeventstreaminto
Table 4: Characteristics of memory blocks available shared memory in small pieces, looping over the document
as part of the GPU architecture. The scope de- in a strided manner.
scribes the range of accessibility to the data placed Experimentalevaluationshowedthatthemainfactorthat
within the memory. limitsachievinghighGPUoccupancyistheSMsharedmem-
ory. In order to save this resource we merged the pop and
push stacks into a single data structure. This merged stack
BecausetypicalqueriesinXMLfilteringsystemsarerela- contains compressed elements, each of which consumes of 1
tivelyshort,wepackquerynodesrelatedtodifferentqueries byte of shared memory. The most significant half of this
intoasinglethreadblock. Thisisdoneinordertoavoidre- byte stores information needed for the push stack, and the
sulting in a large number of small thread blocks, since the least significant half encodes the pop stack value. The de-
GPU achitecture limits maximum number of thread blocks tailed encoding scheme is shown in Table 3. Both the push
that could be scheduled per SM. Therefore packing maxi- and pop parts of the merged stack value contain the fields:
mizes the number of used SPs within a SM. However de- childMatch and descMatch, described in Section 4.2.
pending on query size some SPs could be unoccupied. We
address this issue by using a simple greedy heuristic to as- 4.6 AdditionalOptimizations
sign queries to blocks. Processing of pop event makes each node responsible for
Finally, the Fermi architecture [27] opened a possibility reporting its match/mismatch to the prefix node through
toleverageinter-documentparallelism,byexecutingseveral the prefix pointer, because the split node does not keep in-
GPU kernels concurrently, every kernel processes a single formationaboutitschildren. Incaseofamassivelyparallel
XML document. It is supported by using asynchronous architecture like a GPU, this reporting could lead to race
memory copying and kernel execution operations together conditions between children, reporting their matches in an
withfine-grainedsynchronizationsignalingevents,delivered unsynchronized order. In order to avoid these race condi-
through GPU event streams. This feature allows us to in- tionstheCUDAToolkitprovidesasetofatomicoperations
creaseGPUoccupancyevenincaseswhenthereisashortage [26],includingatomicAnd(),neededtoimplementthematch
of queries, which in normal situation would lead to under- propagating logic, described in Section 4.2.
utilization,hencelowerGPUthroughput. Benefitsfromthe Experimentalresultsshowedthattheusageofatomicop-
concurrent kernel invocation are later discussed in the ex- erations heavily deteriorates performance, up to several or-
perimental section 5. dersofmagnitudeincomparisontoimplementationshaving
non-atomic counterparts of that operations. To overcome
4.5 Efficient Memory GPU Memory Hierar-
this issue each merged stack entry is coupled with an ad-
chyUsage
ditional childrenMatch array of predefined size. All chil-
TheGPUarchitectureprovidesseveralhierarchicalmem- drenreporttheirmatchesintodifferentelementswithinthis
ory modules (summarized in Table 4), depicting varying array, therefore avoiding race conditions. After children re-
characteristics and latencies. In order to fully benefit from portingisdone,thesplitnodecaniteratethroughthisarray
thehighlyparallelarchitecture,goodperformanceisachieved and collect matches, “AND”-ing elements of the children-
by carefully leveraging the trade-offs of the provided mem- Match array and finally reporting its own match according
ory levels. to the obtained value. The size of childrenMatch array is
GlobalmemoryisprimarilyusedtocopydatafromCPU statically set during compilation, such that the algorithmic
to GPU and to save values, calculated during kernel execu- implementation is customized to the query set at hand.
tion. Kernel personalities and XML document streams are This number should be chosen carefully, since large ar-
examplesofsuchdatainourcase. Sincetheglobalmemory rayscouldsignificantlyincreasesharedmemoryusage. Each
is located off the chip its latency penalty is very high. If a query is coupled with exactly the number of stack columns
thread needs to read/write value from/in global memory it needed, depending on the number of respective children.
isdesirabletoorganizethoseaccessesincoalesced patterns, Customizeddatastructuresarecompiled(once,offline)from
when each thread within a warp accesses adjacent memory aninputsetofqueries. Thisisincontrasttoamoregeneric
location, thuscombiningreads/writesintoa singlecontigu- approach which would greatly limit the scalability of the
ous global memory transaction, avoiding memory bus con- proposed solution, where each query node would be asso-
tention. ciated with a generic data structure, accommodating for a
In our case, global memory accesses are minimized such max(ratherthancustom)setofproperties. Formostprac-
that: tical cases, twig queries are not very “bushy”, hence this
optimization could significantly decrease filtering execution
• Thethreadreadsitspersonalityatthebeginningker-
time, as opposed to the general solution leveraging atomic
nel of execution, then stores it in registers.
operation semantics.
• Onlythreads,correspondingtorootnodes,writeback
their match state at the end of execution. 5. EXPERIMENTALRESULTS
Unlike the kernel personality array, the XML document Our performance evaluation was completed on two GPU
stream is shared among all queries within a thread block, devices from different product families, namely:
8
• NVidiaTeslaC2075: Fermiarchitecture,14SMswith 3
32 SPs, 448 computational cores in total. Query length 5 450
Query length 10
• NVidia Tesla K20: Kepler architecture, 13 SMs with Query length 15 400
192 SPs each, 2496 compute cores. 350 nts/s
Measurements for CPU-based approaches are produced MB/s 2 300 nd Eve
f3strhe0oneGmtsaBttaaiovtdfeeu-RooaffAl-st6Moh-fce,to-warruraetnr2enY.-3ibFn0agiGslteoHednrzX[CI1Mn2etn]Le.tlOfiXSlteeo6rn.i4nEgL5mi-n2eu6tx3h.0oAdsessr,vaweerreuwpsiretedh- Throughput, 1 122505000 ughput, thousa
o
Wall-clock execution time is measured for the proposed 100 Thr
approach running on the above GPUs, and for YFilter ex-
50
ecuted on the 12-core CPU. In the case of YFilter, parsing
time is not measured, since a similar SAX-like approach is 0
32 64 128 256 512 1024 2048
utilized by the GPU; furthermore, the focus of this work is
Number of queries
on filtering. Execution time starts from having the parsed
documents in the cache. With regards to the GPU setup,
Figure 7: Tesla C2075 throughput (in MB/s and
execution time includes sending the parsed documents, fil-
thousands of XML Events/s) of filtering 1MB XML
tering, and receiving back filtering results.
document for queries of length 5,10 and 15.
Documents of different sizes are used, obtained by trim-
ming the original DBLP XML dataset [1] into chunks of
differentlengths;also,syntheticXMLdocumentsaregener-
querieswillresultinserializedexecution,whichresultsinan
atedusingtheDBLPDTDschemawiththehelpofToXgene
almostlinearrelationwiththeaddedqueries(e.g. a50%de-
XML Generator [6]. Experiments were performed on single
crease in throughput as the number of queries is doubled).
documentsofsizesrangingfrom32kBto2MB.Furthermore,
tocapturethestreamingnatureofpub-subs,anothersetof
5.2 SpeedupOverSoftware
experiments was carried out, filtering batches of 500 and
1000 25kB synthetic XML documents. InordertocalculatethespeedupoffilteringoverYFilter,
Several performance experiments while varying the block we measured the execution time of YFilter running on a
sizewerecarriedout,withconclusionthatablocksizeof256 CPUandourGPUapproachrunningonTeslaC2075while
threads is best, maximizing utilization hence performance filtering 256 queries of length 10. This particular query
(data omitted for brevity). dataset was picked to study the speedup of a fully utilized
Asfortheprofilecreation,uniquetwigqueriesaregener- GPU, because it corresponds to one of the breaking points
ated using the XPath generator provided with YFilter [12]. shown on Figure 7. As mentioned earlier, the execution
Inparticular,wemadeuseofafairlydiversequerydataset: times for both systems do not include the time spent on
parsing the XML documents.
• Number of queries: 32 to 2K. Figure 8 shows that the maximum speedup is achieved
for documents of small size. As the size increases, speedup
• Query size: 5, 10 and 15 nodes.
gradually lowers and flattens out for documents ≥ 512kB.
• Numberofsplitpoints: 1,3or6noderespectively(to Thiseffectcouldbeexplainedbythelatencyofglobalmem-
the above query sizes). ory reads, since number of strided memory accesses grows
with increasing document size.
• Maximum number of children for each split node: 4 The existence of wildcard and // has a positive effect on
nodes. the speedup: the execution time of the YFilter depends on
size of the NFA automaton, which, in turn, grows with the
• Probability of ‘*’ and ‘//’: 10%, 30%, 50%
aforementioned probability.
5.1 GPUThroughput 5.3 BatchExperiments
Figure7showsthethroughputofTeslaC2075GPUusing To study the effect of inter-document parallelism we per-
a1MBXMLdocument,foranincreasingnumberofqueries formed experiments with batches of documents. These ex-
with varying lengths. Throughput is given in MB/s and periments used synthetically generated XML documents.
in thousands of XML Events/s, which is obtained assum- Since the YFilter implementation is single-threaded and
ing that on average every 6.5 bytes of the XML document cannotbenefitfromthemulticore/multiprocessorsetupthat
encodes push/pop event (due to document design). weusedforourexperiments,wealsoperformmeasurements
Datafordifferentwildcardand//-probabilitiesandother for a “pseudo”-multicore version of YFilter. This version
documentsizesisomittedforbrevity,sincetheydonotaffect equally splits document-load across multiple copies of the
the total throughput. program, the number of copies being equal to the number
Thecharacteristicsofthethroughputgrapharecorrelated of CPU cores. The query load is the same for each copy
with the results reported in [23]: starting off as a constant, of the program. Note that the same technique cannot be
throughput eventually reaches a point, where all computa- applied to split query load among multiple CPU cores in
tional GPU cores are utilized. After this point, which we previous experiments, since this could possibly deteriorate
refer to as breaking point, the amount of parallelism avail- YFilter performance due to splitting a single unified NFA
ablebytheGPUarchitectureisexhausted. Evaluatingmore into several automata.
9
FACTOR CPU GPU FPGA
No effect on the
Document size Decreases Minimal effect
filtering core
Number of queries Slowly decreases No effect prior breaking point, decreases after Slowly decreases
Query length Slowly decreases No effect prior breaking point, decreases after Minimal effect
Decreases on 15% on average per
‘*’ and //-probability No effect No effect
10 % increase in probability
No effect until maximum number of query
Query bushyness Minimal effect Minimal effect
node children exceeds predefined parameter
Query dynamicity Minimal effect Minimal effect No support
No effect on the
Batch size Slowly decreases Minimal effect
filtering core
Table 5: Summary of factors and their effects on the throughput of CPU-, FPGA- and CPU-based XML
filtering. Query dynamicity refers to the capability of updating user profiles (the query set). The FPGA
analysis is added using results from our study in [25].
10
* and //-probability %50 Tesla K20, query length 5
9
9 * and //-probability %30 32 Tesla C2075, query length 5
* and //-probability %10 Tesla K20, query length 10 8
8 Tesla C2075, query length 10
7 B/s 16 TeTsleas Cla2 K02705,, qquueerryy lleennggtthh 1155 67 Events/s
Speedup 456 Throughput, M 248 345 hroughput, million
3 2 T
1
2 1
1 0
32 64 128 256 512 1024 2048 32 64 128 256 512 1024 2048
Document size, kB Number of queries
Figure 8: Speedup of the GPU-based solution run- Figure 9: Batched throughput (500 documents) of
ning on a Tesla C2075 for a fixed number of queries theGPU-basedsolutionrunningonTeslaC2075and
(256)andquerysize(10). Dataisshownforqueries, Tesla K20 GPUs, with wildcard and //-probability
having wildcard and //-probabilities equal to 10%, fixed at 50%. Data is shown for queries of length 5,
30% and 50 %. 10 and 15. Throughput is shown in MB/s as well as
in millions of Events/s.
Forthisexperimentwehavemeasuredthebatchedexecu-
the increasing batch size. This effect could be expalined by
tiontimenotonlyontheTeslaC2075GPU,butalsoonthe
low global memory bus throughput due to irregular asyn-
TeslaK20,whichhassixtimesmoreavailablecomputational
chronous copying access patterns.
cores.
ThespeedupusingtheTeslaK20showninFigures10and
Figure 9 shows the throughput plot for a batch of 500
11 is higher than that of the Tesla C2075; nonetheless, it is
docuemnts(experimentswithotherbatchsizesyieldsimilar
notassignificantastheratioofthenumberoftheircompu-
results). Unlike Figure 7, the plot does not depict a break-
tationalcores: theamountofconcurrentkerneloverlapping
ing point. This happens because all available GPU cores
is limited by the GPU architecture.
are always occupied by concurrently executing kernels, re-
Finally, the speedup of the GPU-based versions over the
sultinginafulyutilizedGPU.Doublingthenumberqueries
software-multithreaded version is, as expected, lower then
orquerylengthrequirestwotimesmoreGPUcomputational
the speedup over a single-thread YFilter.
cores, thus decreasing throughput by a factor of two.
Figures 10 and 11 respectively depict the effect of query
5.4 TheEffectofSeveralFactorsonDifferent
lengthandnumberofqueriesonthespeedupforbatchesof
FilteringPlatforms
size500and1000. Speedupdropsinbothcasesalmostbya
factor of two as query load is doubled, and eventually even Table5summarizesthevariousfactorsaffectingthesingle-
leads to a perfomance worse than that of the CPU. Thus documentfilteringthroughputforGPUsaswellasforsoft-
we can infer that the throughput of YFilter is only slightly wareapproaches(Yfilter)andFPGAs. TheFPGAanalysis
affected by query length and number of queries. is added using results from our study in [25].
Both figures show that the performane of the GPU so- Querydynamicityreferstothecapabilityofupdatinguser
lution (hence speedup over CPU) slightly deteriorates with profiles (the query set).
10
Description:rithms, as well as high-performance platforms on which to implement ated
using the DBLP DTD schema with the help of ToXgene .. ACM Transactions on.