ebook img

A Survey of XML Tree Patterns PDF

0.52 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview A Survey of XML Tree Patterns

TKDE,VOL.X,NO.X,SEPTEMBER2011 1 A Survey of XML Tree Patterns MarouaneHachichaand Je´roˆme Darmont, Member, IEEE ComputerSociety Abstract—WithXMLbecominganubiquitouslanguagefordatainteroperabilitypurposesinvariousdomains,efficientlyquerying XMLdataisacriticalissue.Thishasleadtothedesignofalgebraicframeworksbasedontree-shapedpatternsakintothetree- structureddatamodelofXML.Treepatternsaregraphicrepresentationsofqueriesoverdatatrees.Theyareactuallymatched againstaninputdatatreetoansweraquery.Sincetheturnofthetwenty-firstcentury,anastoundingresearchefforthasbeen focusingontreepatternmodelsandmatchingoptimization(aprimordialissue).Thispaperisacomprehensivesurveyofthese topics,inwhichweoutlineandcomparethevariousfeaturesoftreepatterns.Wealsoreviewanddiscussthetwomainfamilies 7 ofapproachesforoptimizingtreepatternmatching,namelypatterntreeminimizationandholisticmatching.Wefinallypresent 1 actualtreepattern-baseddevelopments,toprovideaglobaloverviewofthissignificantresearchtopic. 0 2 IndexTerms—XMLQuerying,DataTree,TreePattern,TreePatternQuery,TwigPattern,Matching,Containment,TreePattern Minimization,HolisticMatching,TreePatternMining,TreePatternRewriting. n a ✦ J 7 1 1 INTRODUCTION ] SINCE its inception in 1998, the eXtended Markup query output. More formally, a TP is matched against B Language(XML)[1]hasemergedasastandardfor a tree-structured database to answer a query [16]. D data representation and exchange over the Internet, Figure1exemplifiesthisprocess.Theupperleft-hand . asmany(mostlyscientific, butnotonly) communities side part of the figure features a simple XML docu- s adopted it for various purposes, e.g., mathematics ment (a book catalog), and the lower left-hand side c [ withMathML[2],chemistrywithCML[3],geography a sample XQuery that may be run against this doc- with GML [4] and e-learning with SCORM [5], just ument (and that returns the title and author of each 1 to name a few. As XML became ubiquitous, effi- book). The treerepresentationsof the XML document v 2 ciently querying XML documents quickly appeared and the associated query are featured on the upper 5 primordial and standard XML query languages were and lower right-hand sides of Figure 1, respectively. 6 developed, namely XPath [6] and XQuery [7]. Re- At the tree level, answering the query translates in 4 searchinitiatives also complemented these standards, matchingtheTPagainstthedatatree.Thisprocesscan 0 to help fulfill user needs for XML interrogation, e.g., beoptimizedandoutputsadatatreethatiseventually . 1 XML algebras such as TAX [8] and XML information translated back as an XML document. 0 retrieval [9]. 7 1 Efficiently evaluating path expressions in a tree- : structureddatamodelsuchasXML’siscrucialforthe v overall performance of any query engine [10]. Initial i X efforts that mapped XML documents into relational r databases queried with SQL [11], [12] induced costly a table joins. Thus, algebraic approaches based on tree- shaped patterns became popular for evaluating XML processing natively instead [13], [14]. Tree algebras indeedprovideaformalframeworkforqueryexpres- sion and optimization, in a way similar to relational algebra with respect to the SQL language [15]. In this context, a tree pattern (TP), also called pat- tern tree or tree pattern query (TPQ) in the literature, models a user query over a data tree. Simply put, a tree pattern is a graphic representation that provides Fig. 1. Tree representation of XML documents and aneasyandintuitivewayofspecifyingtheinteresting queries parts from an input data tree that must appear in Since the early 2000s, a tremendous amount of • Marouane Hachicha and Je´roˆme Darmont are from the Universite´ de research has been based on, focusing on, or exploit- Lyon (ERIC Lyon 2), 5 avenue Pierre Mende`s-France, 69676 Bron ing TPs for various purposes. However, few related Cedex,France. E-mails: [email protected], jerome.darmont@univ- reviews exist. Gou and Chirkova extensively survey lyon2.fr querying techniques over persistently stored XML data [17]. Although the intersection between their TKDE,VOL.X,NO.X,SEPTEMBER2011 2 paper and ours is not empty, both papers are com- 2.1.4 Datatreecollection plementary. We do not address approaches related An XML document considered as a set of fragments to the relational storage of XML data. By focusing maybe modeled asadatatreecollection (alsonamed on native XML query processing, we complement forest in TAX [8]), which is itself a data tree. Gou and Chirkova’s work with specificities such as TP structure, minimization approaches and sample 2.1.5 Datasubtree ′ ′ ′ ′ applications. Moreover, we cover the many match- Given a data tree t = (r,N,E), t = (r ,N ,E ) is a ing optimization techniques that have appearedsince subtree of t if and only if: ′ 2007.Otherrecentsurveysaremuchshorterandfocus 1) N ⊆N; ′ ′ onaparticularissue,i.e.,twigqueries[18]andholistic 2) let e ∈ E be an edge connecting two nodes ′ ′ ′ matching [19]. (n ,n )of t ; there exists anedge e∈E connect- i j Theaimofthispaperisthustoprovideaglobaland ing twonodes(n ,n )of t suchthatn =n′ and i j i i synthetic overviewofmore thantenyearsofresearch ′ n =n . j j about TPs and closely related issues. For this sake, we first formally define TPs and related concepts in 2.1.6 Treepattern Section2. Then, we presentanddiscuss variousalter- A tree pattern p is a pair p=(t,F) where: native TP structures in Section 3. Since the efficiency 1) t=(r,N,E) is a data tree; of TP matching against tree-structured data is central 2) F is a formula that specifies constraints on t’s in TP usage, we review the two main families of nodes. TP matching optimization methods (namely, TP min- Basically, a TP captures a useful fragment of imization and holistic matching approaches), as well XPath [20]. Thus, it may be seen as the translation as tangential but nonetheless interesting methods, in of a user query [21]. Translating an XML query plan Section 4. Finally, we briefly illustrate the use of TPs into a TP is not a simple operation. Some XQueries through actual TP-based developments in Section 5. are written with complex combinations of XPath and We conclude this paper and provide insight about FLWOR expressions, and imply more than one TP. future TP-related research in Section 6. Thus, such queries must be broken up into several TPs. Onlya single XPathexpressioncanbe translated 2 BACKGROUND into a single TP. The more complex a query is, the IN this section, we first formally define all the moreitstranslationintoTP(s)isdifficult[10].Starting concepts used in this paper (Section 2.1). We also from TPs to express user queries in a first stage, and introduce a running example that illustrates through- optimize them in a second stage is a very effective outthe paperhow the TPs andrelatedalgorithms we solution used in XML query optimization [21]. survey operate (Section 2.2). 2.1.7 Treepatternmatching 2.1 Definitions Matching a TP p against a data tree t is a function 2.1.1 XMLdocument f :p→tthatmapsnodesofptonodesoftsuchthat: XML is known to be a simple and very flexible text • structural relationships are preserved, i.e., if format.Itisessentiallyemployedtostoreandtransfer nodes (x,y) are related in p through a parent- text-type data. The content of an XML document child relationship, denoted PC for short or by a is encapsulated within elements that are defined by simpleedge/inXPath(respectively,anancestor- tags [1]. These elements can be seen as a hierarchy descendantrelationship,denotedADforshortor organized in a tree-like structure. byadoubleedge//inXPath),theircounterparts (f(x),f(y)) in t must be related through a PC 2.1.2 XMLfragment (respectively, an AD) relationship too; An XML document may be considered as an ordered • formula F of p is satisfied. set of elements. Any element may contain subele- The output of matching a TP against a data tree is ments that mayin turn contain subelements, etc. One termed a witness trees in TAX [8]. particular element, which contains all the others, is 2.1.8 Treepatternembedding the document’s root. Any element (and its contents) different from the root is termed an XML fragment. Embedding a TP p into a data tree t is a function g : AnXMLfragmentmaybemodeledasafinite rooted, p→tthatmapseachnodeofptonodesoftsuchthat labeled and ordered tree. structural relationships (PC and AD) are preserved. The difference between embedding and matching 2.1.3 Datatree is that embedding maps a TP against a data tree’s An XML document (or fragment) may be modeled as structure only, whereasmatching maps a TP against a a data tree t = (r,N,E), where N is a set of nodes, datatree’sstructureandcontents[22].Intheremainder r ∈N isthe rootoft,andE isasetofedgesstitching of this paper,we use the more generaltermmatching together couples of nodes (n ,n )∈N ×N. when referring to mapping TPs against data trees. i j TKDE,VOL.X,NO.X,SEPTEMBER2011 3 Fig.2. Sampledatatree(a),treepattern(b)andwitnesstree(c) 2.1.9 Booleantreepattern of its authors is Jill. If this author was Gill, the book A Boolean TP b yields a “true” (respectively “false”) would be output. resultwhen successfully (respectivelyunsuccessfully) matched against a data tree t. In other words, b has 3 TREE PATTERN STRUCTURES no output node(s) [23]: it yields a “true” result when WE review in this section the various TP struc- t structurally matches (embeds) p. tures found in the literature. Most have been proposed to support XML algebras (Section 3.1), but 2.2 Runningexample somehavealsobeenintroducedforspecificoptimiza- tion purposes (Section 3.2). We conclude this section Let us consider the data tree from Figure 2(a), which by discussing their features in Section 3.3. represents a collection of books. The root doc gathers books, each described by their title, author(s), editor, year of publication and summary. Data tree nodes are 3.1 Treepatternsinalgebraicframeworks connected by simple edges (/), i.e., PC relationships. The first XML algebras have appeared in 1999 [24], Books do not necessarily bear the same structure. For [25] in conjunction with efforts aiming to define a instance, a summary is not present in all books, and powerful XML query language [26], [27], [28], [29], some books are written by more than one author. [30]. Note that they have appeared before the first TheTPfromFigure2(b)selectsbooktitles,authors, specificationofXQuery,thenowstandardXMLquery and years. Moreover,formula F indicates that author language, which was issued in 2001 [7]. The aim of must be different from Jill. Generally, a formula is a an XML tree algebra is to feature a set of operators booleancombinationofpredicatesonnodetypes(e.g., to manipulateand querydatatrees.Queryresultsare $1.tag =book)and/orvalues(e.g.,$3.value!=“Jill”). also data trees. MatchingthisTPagainstthedatatreefromFigure2(a) outputsthedatatree(orwitnesstree)fromFigure2(c). 3.1.1 TAXtreepattern Only one book is selected from Figure 2(a), since the The Tree Algebra for XML (TAX) [8] is one of the other one (title= “A dummy for a computer”): most popular XML algebras. TAX’s TP preserves PC 1) does not contain a year element; andADrelationshipsfromaninputordereddatatree 2) is written by author Jill, which contradicts for- in output, and satisfies a formula that is a boolean mula F. combination of predicates applicable to nodes. The Finally, the AD relationship $1//$3 in Figure 2(b)’s example from Figure 2(b) corresponds to a TAX TP, TP is correctly taken into account. It is not the struc- save that node relationships are simple edges labeled ture of the book element with title= “A dummy for ADorPCinTAXinsteadofbeingexpressedassingle a computer” that disqualifies it, but the fact that one or double edges. The TAX TP is the most basic TP TKDE,VOL.X,NO.X,SEPTEMBER2011 4 used in algebraic contexts. It has thus been greatly Figure 4 shows an example of APT where the - extended and enhanced. option is employed to extract books written by one author only. This APT matches only the first book 3.1.2 Generalizedtreepattern from Figure 2(a) (title= “Maktub”), since the second has two authors. The idea behind generalized tree patterns (GTPs) is to associate more options with TP edges in order to enrich matching. In TAX, one absent pattern node in thematchedsubtreepreventsittoappearinoutput.A GTP extends the classical TAX TP by creating groups of nodes to facilitate their manipulation, and by en- richingedgestobeextractedbythemandatory/optional matching option [21]. Figure 3 shows anexample of GTP, where the edge connecting the year element to its parent book node Fig.4. Sampleannotatedpatterntree is dotted (i.e., optional), and title and author nodes are connected to the same parent book node with solid (i.e., mandatory) edges. This GTP permits to 3.1.4 Orderedannotatedpatterntree matchallbooksdescribedbytheir title andauthor(s), mandatorily, and that may be described by their year AnAPT,asabasicTAXTP,preservestheorderoutput of publication. Matching it against the data tree from of nodes from the input data tree, whatever node Figure 2(a) outputs both books (“Maktub” and “A order is specified in the TP. In other words, node dummy for a computer”), even though the second order in witness trees is always the same as that of one does not contain a year element, while ensuring the input datatree. No reorderingoption is available, that the title and author elements exist in the matched though it is essential when optimizing queries. To book subtrees. circumvent this problem, the APT used in TLC’s SelectandJoinoperatorsissupplementedwithan order parameter (ord) based on an order specification (O-Spec) [13]. Four cases may occur: 1) empty: output node order is unspecified; O-Spec is empty; 2) maintain: input node order is retained; O-Spec duplicates input node order; 3) list-resort: nodes are wholly sorted with respect Fig.3. Samplegeneralizedtreepattern to O-Spec; input node order is forsaken; 4) list-add: nodes are partially sorted with respect Finally, note that the GTP from Figure 3, unlike the to O-Spec. TP from Figure 2(b), does not include an author != For example, associating the O-Spec specification ”Jill” clause in its formula. Retaining this predicate [author, title] to the APT from Figure 4 permits to would not allow the second book (title = “A dummy select books ordered by author first, and then title. for a computer”) to be matched by this GTP, since Graphically, the author node would simply appear on the booleancombination of formulaelements (related thelefthandsideofthewitnesstreeandthetitlenode with and) cannot be verified. on the right hand side. 3.1.3 Annotatedtreepattern 3.2 Treepatternsusedinoptimizationprocesses A feature, more than a limitation, of the TAX TP is Many TP matching optimization approaches extend that a set of subelements from the input data tree the basic TP (Figure 2(b)) to allow a broader range may all appear in the output data tree. For example, of queries. In this section, we survey the TPs that a TP with a single author node can match against a introduce new, interesting features with respect to book subtree containing several author subelements. those already presented in Section 3.1. Annotated pattern trees (APTs) from the Tree Logical Class(TLC)algebra[31] solve thisproblembyassoci- 3.2.1 Globalquerypatterntree ating matching specifications to tree edges. Matching options are: A global query pattern tree (G-QPT) is constructed • +: one to many matches; from a set of possible ordered TPs proposed for the • -: one match only; samequery[32].First,arootiscreatedfortheG-QPT. • *: zero to many matches; Then, each TP is merged with the G-QPT as follows: • ?: zero or one match. • the TP root is merged with the G-QPT root; TKDE,VOL.X,NO.X,SEPTEMBER2011 5 Fig.5. Sampleconstructionofglobalquerypatterntree • TPnodesaremergedwith G-QPT nodeswith re- spect to node ordering and PC-AD relationships. For example, TPs p and p from Figure 5 together 1 2 answer query “select title, authors and year of publi- cation of books” and merge into G-QPT g. 3.2.2 Twigpattern Twig patterns are designed for directed and rooted graphs [33]. Here, XML documents are considered Fig.7. Sampletreepatternwithlogicaloperatornodes having a graph structure, thanks to ID references connecting nodes. Twig patterns respectPC-ADnode 3.2.4 Nodedegreeandoutputnodespecification relationshipsand expressnode constraints in formula F (called twig condition here) with operators such as InaTP,thedegreeofatreenodex,denoteddegree(x), equals to, contains, ≥, etc. In Figure 6, we represent represents its number of children [35]. Miklau and the twig pattern that is equivalent to the TP from Suciu define a maximal value for degree(x) within Figure 2(b). a path expression optimization algorithm. This ap- proach is mainly designed to check containment and equivalence of XPath fragments (Section 4.1.1). Then, translating XPath expressions into TPs requires the identification of an output node marked with a wild- card (∗) in the TP. For example, suppose that we are only interested in book titles from Figure 2(a). Then, we would simply associate a ∗ symbol with the title node in the TP from Figure 2(b). Note that the other nodesmustremainintheTPiftheyappearinformula Fig.6. Sampletwigpattern F and thus help specify the output node. Moreover, distance-bounding twigs (DBTwigs) ex- 3.2.5 Extendedformula tend twig patterns to limit the large number of an- Lakshmanan et al. further specify how formula F is swersresultingfrommatchingaclassicaltwigpattern expressed in a TP [22]. They define F as a combina- against a graph [33]. This method is called Filtering tion of tag constraints (TCs), value-based constraints + Ranking. Its goal is to filter data to be matched (VBCs) and node identity constraints (NICs). TCs by indicating the length of paths corresponding to specify constraints on tag names, e.g., node.tag = descendantedges.DBTwigsalsopermittoindicatethe “book”. VBCs specify selection and join constraints number (0, 1...) of nodes to be matched. All these pa- on tag attributes and values using the operators =, rametersareindicatedin the graphicalrepresentation 6=, ≤, ≥, > and <. NICs finally determine whether of DBTwigs. two TP nodes are the same (≈) or not. In addition to TCs, VBCs and NICs, wildcards (∗) are associated 3.2.3 Logicaloperatornodes with untagged nodes. Izadi et al. include in their formal definition of TPs For example, in Figure 8, we extend the TP from a set O of logical operator nodes [34]. ∧, ∨ and Figure 2(b) with a VBC indicating that books to be ⊕ represent the binary AND, OR, and XOR logical selected must be of type “Novel”. We suppose here operators, respectively. ¬ is the unary NOT operator. that a type is associated with each book node, that These operator nodes go further than GTPs’ manda- the first book (title = “Maktub”) is of type “Novel”, tory/optional edge annotations by specifying logical and that the second book (title = “A dummy for a relationships between their subnodes. For instance, computer”) is of type “Technical book”. Then, the Figure7featuresaTPthatselectsbooktitlesandeither output tree would of course only include the book a set of authors or an editor. entitled “Maktub”. TKDE,VOL.X,NO.X,SEPTEMBER2011 6 Note that node reordering could be classified as a matching capability, but the importance of ordering witness trees leads us to consider it separately. • Expressiveness:Expressivenessstateshowalogical TP, i.e., a TP under its schematic form, translates into the corresponding physical implementation. The physical form of a TP may be either ex- Fig. 8. Sample tree pattern with a value-based con- pressed with an XML query language (XQuery, straint generally), or an implementation within an XML database offering query possibilities. Using a TP 3.2.6 Extendedtreepattern for other purposes but querying, e.g., in opti- mization algorithms, is not accounted toward Extended TPs complement classical TPs with a nega- expressiveness. Note that a TP that cannot be tion function, wildcards and an order restriction [36]. expressed in physical form is usually considered For example, in Figure 9, the negative function (de- useless. noted ¬) helps specify that we look for an edited • Supported optimizations: TPs are an essential ele- book, i.e., with no author node. The wildcard node ment of XML querying [16], [37]. Hence, many ∗ can match any single node in a data tree. Note that optimization approaches translate XML queries the wildcard has a different meaning here than in into TPs, optimize them, and then translate them Section 3.2.4, where it denotes an output node, while back into optimized queries. Optimizing a TP output node names are underlined in extended TPs. increases its matching power. This criterion ref- Finally, the order restriction denoted by a < in a box erences the different kinds of optimizations sup- means that children of node book are ordered, i.e., ported by a given TP. title must come before the ∗ node. TP comparison with respect to matching power, node reordering capability, expressiveness and sup- ported optimizations follows (Sections 3.3.1, 3.3.2, 3.3.3 and 3.3.4, respectively), and is synthesized in Section 3.3.5. 3.3.1 Matchingpower ThemostbasicTPisTAX’s[8].Thus,matchingaTAX TP against a data tree collection is very simple: it is onlybasedontreestructureandnodevaluesspecified in formula F. If F is absent, a TAX TP matches all possible subtrees from the input data tree. Fig.9. Sampleextendedtreepattern Associating a mandatory/optional status to GTP edges [21] increases the number of matched subtrees in output. Only one absent edge in a TAX TP with respect to a subtree containing all the other edges 3.3 Discussion in this TP prevents matching, while the candidate In this section, we discuss and compare the TPs subtree is “almost perfect”. If this edge is labeled surveyed in Sections 3.1 and 3.2. For this sake, we as optional in a GTP, matching becomes successful. introduce four comparison criteria. Even more flexibility is achieved by logical operator • Matching power: Matching encompasses two di- nodes [34], which allow powerful matching options, mensions. Structural matching guarantees that especially when logical operators are combined. only subtrees of the input data tree that map the With APTs, matching precision is further improved TP are output. Matching by value is verifying through edge annotations [31]. Annotations in APTs formula F. We mean by matching power all the force the matching process to extract data according matching options (edge annotations, logical op- to their nature. For example, a classical TAX TP such erator nodes, formula extensions...) beyond these as the one fromFigure 2(b), extractingbooks by titles basics. Improving matching power helps filter andauthors,doesnotallowcontrollingthenumberof data more precisely. authors. APTs help select only books with, e.g., more • Node reordering capability: Order is important in than one author, as in the example from Figure 4. XML querying, thus modern TPs should be able Unfortunately, annotations only allow two maximum to alter it [13]. We mean by node reordering cardinalities: one or several. DBTwigs [33] comple- capability the ability of a TP to modify output ment APTs in this respect, by allowing to choose the nodeorderwhenmatchingagainstanydatatree. exact number of nodes to be matched. Node degree TKDE,VOL.X,NO.X,SEPTEMBER2011 7 (number of children) [35] or the negation function of 3.3.4 Supportedoptimizations extended TPs [36] can also be exploited for this sake. Approaches purposely proposing TPs (i.e., algebraic Finally, up to now, we focused on TP structure approaches) are much fewer than approaches using because few differences exist in formulas. Graphi- TPs to optimize XML querying. Moreover, even alge- cally, twig patterns associate constraints with nodes braic approaches such as TAX do address optimiza- directly in the schema [33] while in the TAX TP and tionissues.SinceXMLqueriesareeasytomodelwith its derivatives, they appear in formula F, but this TPs, researchers casually translate any XML query in is purely cosmetic. Only TCs, VBCs and NICs help TPs, which are then optimized and implemented in further structure formula F [22]. algorithms, XML queries or any other optimization framework. 3.3.2 Nodereorderingcapability Hence, the biggest interest of the TP community Despite many studies model XML documents as un- lies in enriching and optimizing matching. Matching ordered data trees, order is essential to XML query- opportunities offered by TAX TPs, optional edges of ing [13]. In XQuery, users can (re)order output nodes GTPs, annotations, ordering specification and dupli- simply through the Order by clause inherited from cate elimination of APTs, and extended TPs aim to SQL[15].Hence,modernTPsshouldincludeordering achievemoreresultsand/orbetterprecision.GTPand features [16], [37]. APT matching characteristicsprove their efficiencyin However,among TAXTPderivatives,only ordered TIMBER [38]. APTsandextendedTPsfeaturereorderingfeatures.In Finally, satisfiability is an issue related to contain- G-QTPs, preorders associated with ordered TP nodes ment, a concept used in minimization approaches. help determine output order [32]. Unfortunately, or- Satisfiability means that there exists a database, con- der is disregarded in all other TP proposals. sistent with the schema if one is available, on which 3.3.3 Expressiveness the user query, represented by a TP, has a non-empty answer [22], [40], [41], [42]. TAX TPs and their derivatives (GTPs and APTs) do not translate into an XML query language, but they 3.3.5 Synthesis are implemented, through the TLC physical algebra [31], in the TIMBER XML database management sys- WerecapitulateinTable1thecharacteristicsofallTPs tem [38]. TIMBER permits to store XML in native surveyedinthispaperwithrespecttoourcomparison format and offers a query interface supporting both criteria. classical XQuery fragments and TAX operators. Note Insummary,matchingoptionsinTPsarenumerous thatTAXoperatorsincludeaGroup byconstructthat and it would probably not be easy to set up a metric has no equivalent in XQuery. to rank them. The best choice, then, seems to select a TranslatingTAXTPsforXMLqueryingfollowsnine TP variant that is adapted to the user’s or designer’s steps: needs. However, designing a TP that pulls together 1) identify all TP elements in the FOR clause; most of the options we survey is certainly desirable 2) push formula F’s predicates into the WHERE to maximize matching power. clause; With respect to output node ordering, we con- 3) eliminate duplicates with the help of the sider the ord order parameter introduced in APTs DISTINCT keyword; the simplest and most efficient approach. A list of 4) evaluate aggregate expressions in LET clauses; ordered elements can indeed be associated with any 5) indicate tree variables to be joined (join condi- TP,whateverthenatureoftheinputdatatree(ordered tions) via the WHERE clause; or unordered). However, it would be interesting to 6) enforce any remaining constraint in the WHERE generalizetheordspecificationtoother,possiblymore clause; complex,operatorsbesideSelectandJoin,thetwo 7) evaluate RETURN aggregates; only operators benefiting from ordering in APTs. 8) order output nodes with the help of the ORDER Expressivenessisacomplexissue.TranslatingXML BY clause; queries into TPs is indeed easier than translating TPs 9) projectonthe elements indicatedin the RETURN back into an XML query plan. XQuery, although the clause. standard XML query language, suffers from limita- Similarly, Lakshmanan et al. test the satisfiability of tions suchasthe lackof a Group by construct. Thus, TPs translated from XPath expressions and XQueries, itismoreefficienttoimplementTPsandexploitthem and then express them back in XQuery and evaluate to enrich XML querying in an ad-hoc environment them within the XQEngine XQuery engine [39]. The such as TIMBER’s. We think that the richer the pat- other TPs we survey are used in various algorithms tern, with matching options, ordering specifications, (containment and equivalence testing, TP rewriting, possibility to associate with many operators (and frequent TP mining...). Hence, their expressiveness is other options if possible), the more efficient querying not assessed. is, in terms of user need satisfaction. TKDE,VOL.X,NO.X,SEPTEMBER2011 8 TABLE1 Synthesisoftreepatterncharacteristics TP Matching features Reordering Expressiveness Supportedoptimizations TAXTP[8] Basic No TIMBER Matching GTP[21] Mandatory/optionaledges No TIMBER Matching APT[31] Edgecardinality No TIMBER Matching Ordered APT[13] Orderspecification Yes TIMBER Matching G-QPT[32] SetofTPs Yes N/A Unspecified Twig pattern[33] Graphstructure No N/A XMLgraphquerying DBTwig [33] Filtering+Ranking No N/A XMLgraphquerying +Matchingcardinality Izadi Logicaloperatornodes No N/A Matching et al.’sTP[34] +Potentialtargetnode Miklau and Nodedegree No N/A Containmentand Suciu’s TP[35] +Outputnode equivalencetesting Lakshmanan Tagconstraints et al.’sTP[22] +Value-basedconstraints No XQuery Satisfiability testing +Nodeidentityconstraints ExtendedTP Negationfunction Yes N/A Matching [36] +Wildcards Finally, minimization, relaxation, containment, equivalent TP of the smallest size. Formally, given a equivalence and satisfiability issues lead the TP data tree t and a TP p of size (i.e., number of nodes) community to optimize these tasks. However, most n, let S = {p } be the set of TPs of size n contained i i TPsusedinthiscontextarebasic,unlike TPstargeted in p (p ⊆p and n ≤n ∀i). Minimizing p is finding a i i at matching optimization, which allows optimizing TP p ∈S of size n such that: min min XML queries wherein are translated optimized TPs. • pmin ≡p when matched against t; In short, the best-optimized TP must be minimal, • nmin ≤ni ∀i. satisfiable, and offer as many matching options as Moreover,asetC of integrityconstraints (ICs)may possible. be integrated into the problem. There are two forms of ICs: 4 TREE PATTERN MATCHING OPTIMIZATION 1) each node of type A (e.g., book) must have a child (respectively, descendant) of type B (e.g., THE aim of TPs is not only to provide a graphi- author), denoted A→B (respectively, A⇒B); cal representation of queries over tree-structured 2) each node of type A (e.g., book) must have a data, but also and primarily, to allow matching descendantof typeC (e.g.,name),knowing that queries against data trees. Hence, properly optimiz- C is a descendant of B (e.g, author), i.e., A⇒C ing matching is primordial to achieve good query knowing that B ⇒C. response time. Since TP minimization relies on the concepts of In this section, we present the two main families containment and equivalence, we first detail how of approachesfor optimizing matching, namely mini- containmentandequivalencearetested(Section4.1.1). mization methods (Section 4.1) and holistic matching Then, we review the approaches that address the approaches (Section 4.2). We also survey a couple of TP minimization problem without taking ICs into otherspecificapproaches(Section4.3),beforeglobally account (Section 4.1.2), and the approaches that do discussing and comparing all TP matching optimiza- (Section 4.1.3). tion methods (Section 4.4). 4.1.1 Containmentandequivalencetesting 4.1 Treepatternminimization Let us first further formalize the definitions of con- The efficiency of TP matching depends a lot on the tainment and equivalence. Containment of two TPs size of the pattern [16], [20], [43], [44]. It is thus p1 and p2 is defined as a node mapping relationship essential to identify and eliminate redundant nodes h:p1 →p2 such that [16], [35]: inthe patternanddosoasefficientlyaspossible [16]. • h preserves node type, i.e., ∀x ∈ p1, x and h(x) This process is called TP minimization. must be of the same type. Moreover, if x is an All research related to TP minimization is based output node, h(x) must be an output node too; on a pioneer paper by Amer-Yahia et al. [16], who • h preserves node relationships, i.e., if two nodes formulate the problem as follows: given a TP, find an (x,y) are linked through a PC (respectively, AD) TKDE,VOL.X,NO.X,SEPTEMBER2011 9 Fig.10. Simpleminimizationexample relationshipinp ,(f(x),f(y))mustalsobelinked from p that bear the same type as x, but are different 1 through a PC (respectively, AD) relationship in from x. For each leaf x ∈ p, CIM searches for nodes p . of the same type as x in the set of descendants 2 Note that function h is very similar to function of images(parent(x)), where parent(x) is x’s parent f that matches a TP to a data tree (Section 2.1.7), nodeinp.Ifsuchnodesarefound,xisredundantand but here, h is a homomorphism between two TPs thus deleted from p. CIM then proceeds similarly, in [45]. Moreover, containment may be tested between a bottom-up fashion, on parent(x), until all nodes in a TP fragment and this TP as a whole. h is then an p (exceptoutput nodes that must always be retained) endomorphism. have been checked for redundancy. Finally, equivalence between two TPs is simply con- For example, let us consider TP p from Figure 10. sidered as two-way containment [16], [20]. Formally, Without regarding node order, let us check leaf node letp andp betwoTPs.p ≡p ifandonlyifp ⊆p $5 for redundancy. parent($5) = $4,images($4) = 1 2 1 2 1 2 and p ⊆p . {$2},descendants($2) = {$3}, where descendants(x) 2 1 Miklau and Suciu show that containment testing is is the setof descendantsof node x. $3bearsthe same intractable in the general case [35]. They nonetheless type as $5 (author), thus $5 is deleted. Similarly, $4 is propose an efficient algorithm for significant particu- then found redundant and deleted, outputting pmin lar cases, namely TPs with AD edges and an output in Figure 10. Further testing $1, $2 and $3 does not node. This algorithm is based on tree automata, the detect any new redundant node. Moreover, deleting objectivebeingtoreducetheTPcontainmentproblem another node from pmin would breakthe equivalence to that of regular tree languages. In summary, given between p and pmin. Hence, pmin is indeed minimal. two TPs p and p , to test whether p ⊆p : All minimization algorithms subsequent to CIM 1 2 1 2 retain its principle while optimizing complexity. One • p1’s nodes and edges are matched to determin- strategyistoreplaceimagesbymoreefficientrelation- istic finite tree automaton A ’s states and transi- 1 ships, i.e., coverage [48], [49], also called simulation tions, respectively; [43]. Let p be the TP to minimize, and x,y ∈p two of • p2’s nodes and edges are matched to alternating its nodes. If y covers x (x6 y) [50]: finite tree automaton A ’s states and transitions, 2 respectively; • the types of x and y must be identical; • if lang(A1) ⊆ lang(A2), then p1 ⊆ p2, where • if x has a child (respectively descendant) node lang(A )isthelanguageassociatedwithautoma- x′, y must have a child (respectively descendant) i ton A (i={1,2}). node y′ such that x′6 y′. i Similar approaches further test TP containment cov(x) denotes the set of nodes that cover x. cov(x)= under Document Type Definition (DTD) constraints x if x is an output node. Then, the minimization [46], [47]. For instance, Wood exploits regular tree processtests node redundancyin a top-down fashion grammars(RTGs)toachievecontainmenttesting[47]. as follows: ∀x ∈ p, ∀x′ ∈ children(x) (respectively Let D be a DTD and G and G RTGs corresponding descendants(x)), if ∃x′′ ∈ children(x) (respectively 1 2 to p and p , respectively. Then p ⊆p if and only if descendants(x)) such that x′′ ∈ cov(x′), then x′ and 1 2 1 2 (D∩G )⊆(D∩G ). the subtree rooted in x′ are deleted. children(x) (re- 1 2 spectivelydescendants(x))isthesetofdirectchildren 4.1.2 Unconstrainedminimization (respectively, all descendants) of node x. Another strategy is to simply prune subtrees re- The first TP minimization algorithm, Constraint In- dependent Minimization (CIM) [16], eliminates re- cursively [51]. The subtree rooted at node x, denoted dundant nodes from TP p by exploiting the concept subtree(x), is minimized in two steps: of images. Let there be a node x ∈ p. Its list of 1) ∀x′,x′′ ∈ children(x), if subtree(x′) ⊆ images, denoted images(x), is composed of nodes subtree(x′′) then delete subtree(x′); TKDE,VOL.X,NO.X,SEPTEMBER2011 10 2) ∀x′ ∈ children(x) (remaining children of x), • each node of type A (e.g., book) that has a child minimize subtree(x′). of type B (e.g., editor) must also have a child of A variant proceeds similarly, by first searching in a type C (e.g, address), i.e., if A→B then A→C. TP p for any subtree p redundant with sp, where sp Under FBST constraints, several minimal TPs can be i is p stripped of its root [20]. Formally, the algorithm achieved (vs. one only under FT constraints), which tests whether p − sp ⊆ p . Redundant subtrees are allows further optimizations of the chase augmenta- i i removed. Then, the algorithm is recursively executed tion and simulation-based minimization processes. on unpruned subtrees sp . i 4.2 Holistictreepatternmatching 4.1.3 Minimizationunderintegrity constraints While TP minimization approaches (Section 4.1) Taking ICs into account in the minimization process wholly focus on the TP side of the matching process, is casually achieved as follows. holistic matching (also called holistic twig join [19]) algorithms mainly operate on minimizing access to 1) Augment the TP to minimize with nodes the input data tree when performing actual match- and edges that represent ICs. This is casually ing operations. The initialbinaryjoin-based approach achieved with the classical chase technique [52]. for matching proposed by Al-Khalifa et al. [54] in- For instance, if we had a book → author IC deed produceslargeintermediate results. Holistic ap- (nodes of type “book” must have a child of proaches casually optimize TP matching in two steps type “author”), we would add one child node [55]: of type author to each of the nodes $2 and $4 from Figure 10. Note that augmentation must 1) labeling: assign to each node x in the data tree t only apply to nodes from the original TP, and an integer label label(x) that captures the struc- not to nodes previously added by the chase. ture of t (Section 4.2.1); 2) Run any TP minimization algorithm, without 2) computing: exploit labels to match a twig pat- testing utilitarian nodes introduced in step #1 tern p against t without traversing t again (Sec- for redundancy, so that ICs hold. tion 4.2.2). 3) Delete utilitarian nodes introduced in step #1. Moreover, recent proposals aim at reducing data tree size by exploiting structural summaries in combina- This process has been applied onto the CIM algo- tion to labeling schemes. We review them in Sec- rithm, to produce ACIM (Augmented CIM) [16], as tion 4.2.3. well as on its coverage-based variants [43], [48], [53]. Let us finally highlight that, given the tremendous SincethesizeofanaugmentedTPcanbemuchlarger number of holistic matching algorithms proposed in than that of the original TP, an algorithm called Con- theliterature,itisquiteimpossibletoreviewthemall. straint Dependant Minimization (CDM) also helps Hence,we aiminthe following sectionsatpresenting identify and prune all TP nodes that are redundant themostinfluential.Theinterestedreadermayfurther under ICs [16]. CDM actually acts as a filter before ACIM is applied. CDM considers a TP leaf x′ of type refer to Grimsmo and Bjørklund’s survey [19], which T′ redundant and removes it if one of the following uniquely focuses on and introduces a nice history of holistic approaches. conditions holds. • parent(x′) = x (respectively ancestor(x′) = x) of 4.2.1 Labelingphase typeT andthereexistsanICT →T′(respectively The aim of datatree labeling schemes is to determine T ⇒T′). the relationship (i.e., PC or AD) between two nodes • parent(x′) = x (respectively ancestor(x′) = x) of a tree from their labels alone [55]. Many labeling of type T, ∃x′′/parent(x′′) = x (respectively schemes have been proposed in the literature. We ancestor(x′′) = x) of type T′′, and there exists particularly focus in this section on the region encod- an IC T′ = T′′ (respectively, there exists one of ing (or containment) and the Dewey ID (or prefix) the ICs T′′ ⇒T′ or T′ =T′′). labeling schemes that areused in holistic approaches. An alternative simulation-based minimization algo- However, other approaches do exist, based on a tree- rithm also includes a similar prefilter, along with an traversal order [56], prime numbers [57] or a combi- augmentation phase that adds to the nodes of cov(x) nation of structural index and inverted lists [58], for their ancestors instead of using the chase [43]. instance. Finally, thescopeof ICshasrecentlybeenextended Theregionencodingscheme[59]labelseachnodex to include not only forward and subtype (FT) con- inadatatreetwitha3-tuple(start,end,level),where straints (as defined in Section 4.1), but also backward start (respectively, end) is a counter from the root of and sibling (BS) constraints [44]: t until the start (respectively, the end) of x (in depth • each node of type A (e.g., author) must have first),andlevelisthedepthofxint.Forexample,the a parent (respectively, ancestor) of type B (e.g., datatreefromFigure2(a)islabeledinFigure11,with book), i.e., A←B (respectively, A⇐B); the region encoding scheme label indicated between

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.