1 Quasi-SLCA based Keyword Query Processing over Probabilistic XML Data JianxinLi, ChengfeiLiu, RuiZhou andJeffrey Xu Yu,Member, IEEE, Abstract—Theprobabilisticthresholdqueryisoneofthemostcommonqueriesinuncertaindatabases,wherearesultsatisfying thequerymustbealsowithprobabilitymeetingthethresholdrequirement.Inthispaper,weinvestigateprobabilisticthreshold keywordqueries(PrTKQ)overXML data,whichisnotstudiedbefore.Wefirstintroducethenotionofquasi-SLCAanduseit 3 to represent results for a PrTKQ with the consideration of possible world semantics. Then we design a probabilisticinverted 1 0 (PI)indexthatcanbeusedto quicklyreturnthequalifiedanswersandfilterouttheunqualifiedonesbasedonourproposed 2 lower/upperbounds. After that, we propose two efficient and comparable algorithms: Baseline Algorithm and PI index-based Algorithm.Toacceleratetheperformanceofalgorithms,wealsoutilizeprobabilitydensityfunction.Anempiricalstudyusingreal n andsyntheticdatasetshasverifiedtheeffectivenessandtheefficiencyofourapproaches. a J IndexTerms—ProbabilisticXML,ThresholdKeywordQuery,ProbabilisticIndex. 1 F 1 ] B D .1 INTRODUCTION or contents. XIRQL [14] supports keyword search in s XML based on structured queries. However, users cUncertainty is widespread in many web applications, [ may not have the knowledge of the structure of XML such as information extraction, information integra- data or the query language. As such, supporting 1tion, web data mining, etc. In uncertain database, pure keyword search in XML has attracted extensive vprobabilistic threshold queries have been studied ex- 2 research.The LCA-basedapproacheswill identifythe tensively where all results satisfying the queries with 6 LCA node first, which contains everykeyword under probabilities equal to or larger than the given thresh- 3 its subtree at least once [15], [16], [17], [18], [19], [20], 2old values are returned [1], [2], [3], [4], [5]. However, [21]. Since the LCA nodes sometimes are not very .all of these works were studied based on uncertain 1 specific to users’ query, Xu and Papakonstantinou relational data model. Because the flexibility of XML 0 [20] proposed the concept of SLCA (smallest lowest 3data model allows a natural representation of un- common ancestor), where a node v is regarded as an 1certain data, uncertain XML data management has v:become an important issue and lots of works have SLCA if (a) the subtree rooted at the node v, denoted as T (v), contains all the keywords, and (b) there ibeen done recently. For example, many probabilistic sub X does not exist a descendant node v′ of v such that XML data models were designed and analyzed [6], r[7],[8],[9],[10].Basedondifferentdatamodels,query Tsub(v′) contains all the keywords. In other words, if a anodeisanSLCA,thenitsancestorswillbedefinitely evaluation [7], [10], [11], [12], [13], algebraic manipu- excluded from being SLCAs. The SLCA semantics of lation[8]andupdates[6],[10]werestudied.However, model keyword searchresult on a deterministic XML mostoftheseworksconcentratedonstructuredquery tree are also applied [22], [16], [19]. processing, e.g., twig queries. In this paper, we pro- Based on the SLCA semantics, [23] discussed top-k pose and address a new interesting and challenging keyword search over a probabilistic XML document. problem of Probabilistic Threshold Keyword Query Given a keyword query q and a probabilistic XML (PrTKQ) over uncertain XML databases based on document (PrXML), [23] returned the top k most quasi-SLCAsemantics, which is not studied beforeas relevantSLCAresults(PrSLCAs)basedontheirprob- far as we know. abilities. Different from the SLCA semantics over de- In general, an XML document could be viewed as terministicXMLdocuments,anodevbeingaPrSLCA a rooted tree, where each node represents an element can only exclude its ancestors from being PrSLCAs by a probability. This probability can be calculated • Jianxin Li, Chengfei Liu and Rui Zhou are with the Faculty of by aggregating the probabilities of the deterministic Information & Technology, Swinburne University of Technology, Australia.{jianxinli, cliu,rzhou}@swin.edu.au documents (called possible worlds) W implied in the PrXML where v is an SLCA in each deterministic • Jeffrey Xu Yu is with the Department of Systems Engineering & document W. Engineering Management, The Chinese University of Hong Kong, ∈ [email protected] However, it is not suitable to directly utilize the PrSLCAsemanticsforevaluatingPrTKQsbecausethe PrSLCA semantics are too strong. In some applica- 2 tions, users tend to be confident with the results to probabilities up to 0.40. In this condition, no results be searched, so relatively high probability threshold can be returned to the users. values may be given. Consequently, it is very likely However, from Figure 1, we can see that if the that no qualified PrSLCA results will be returned. To probabilities of library and area could contribute to 2 solvethisproblem,weproposeandutilizeaso-called their parent nodes, area and sub region would be- 1 2 quasi-SLCAsemanticstodefinetheresultsofaPrTKQ come quasi-SLCA results. Unfortunately, the PrSLCA by relaxing the semantics of PrSLCA with regards to semantics exclude them from being results. This mo- a given threshold value, i.e., besides the probability tivatesustorelaxthe PrSLCAsemanticsto the quasi- of v being a PrSLCA in PrXML, the probability of a SLCA semantics. According to the quasi-SLCA se- node v beingaquasi-SLCAinPrXMLmayalsocount mantics, the probabilities of area and sub region 1 2 the probability of v′s descendants being PrSLCAs in being the quasi-SLCA results are 0.44 and 0.56 with PrXML if their probabilities are below the specified the contributions of their child nodes library and threshold value. In other words, a node v being a area , respectively. As such, area and sub region 2 1 2 quasi-SLCA will exclude its ancestors from being are deemed as the interesting places to be returned. quasi-SLCAs by a probability only when this prob- Given a PrTKQ, our problem is to quickly com- ability is no less than the given threshold; otherwise, puteallthequasi-SLCAnodeswiththeirprobabilities thisprobabilitywillbeincludedforcontributingtoits meeting the threshold requirement. For users issuing ancestors.ThisisdifferentfromthePrSLCAsemantics PrTKQs, they generally expect to see the complete that excludes the probability contribution from child quasi-SLCAanswersetasearlyaspossibleanddonot nodes. needtoknowtheaccurateprobabilityofeachanswer, which motivates us to design a Probabilistic Inverted a(cid:13)((cid:13) (cid:13)no(cid:13)(cid:13)i(cid:13)ge(cid:13)(cid:13)r(cid:13) )(cid:13) 1(cid:13) (PI)indexandPI-basedefficientalgorithmforquickly k(cid:13) (cid:13)-(cid:13) (cid:13)dr(cid:13)a(cid:13) (cid:13)za(cid:13)h(cid:13) (cid:13) 1(cid:13) k(cid:13) (cid:13)-(cid:13) (cid:13)gn(cid:13) (cid:13)i(cid:13)d(cid:13)l(cid:13)i(cid:13)ub(cid:13)(cid:13) identifying quasi-SLCA result candidates. 2(cid:13) 1D(cid:13)N(cid:13)I(cid:13)(cid:13) We summarize the contributions of this paper as no(cid:13)(cid:13)i(cid:13)ge(cid:13)(cid:13)r_(cid:13)b(cid:13)u(cid:13)s(cid:13)(cid:13) 0(cid:13).(cid:13)1(cid:13) 5(cid:13).(cid:13)0(cid:13) a(cid:13)((cid:13) 2(cid:13))(cid:13) 1(cid:13) 8(cid:13).(cid:13)0(cid:13) c(cid:13)((cid:13) (cid:13)da(cid:13)o(cid:13)(cid:13)r(cid:13) 9(cid:13))(cid:13) follows: no(cid:13)(cid:13)i(cid:13)ge(cid:13)(cid:13)r_(cid:13)b(cid:13)u(cid:13)s(cid:13)(cid:13) 2(cid:13)dr(cid:13)a(cid:13) (cid:13)za(cid:13)h(cid:13) (cid:13) (cid:13) .(cid:13) .(cid:13) .(cid:13) • Based on our proposed quasi-SLCA result defi- 2D(cid:13)N(cid:13)I(cid:13)(cid:13) a(cid:13)((cid:13) (cid:13) )(cid:13) 3(cid:13) nition, we study probabilistic threshold keyword 3D(cid:13)N(cid:13)I(cid:13)(cid:13) 0(cid:13).(cid:13)1(cid:13) 8(cid:13).(cid:13)0(cid:13) ae(cid:13)(cid:13)ra(cid:13) (cid:13) a(cid:13)((cid:13) (cid:13) )(cid:13) 8(cid:13).(cid:13)0(cid:13) 0(cid:13).(cid:13)1(cid:13) query over uncertain XML data, which satisfies 1(cid:13) 4g(cid:13)n(cid:13) (cid:13)i(cid:13)pp(cid:13)o(cid:13)h(cid:13)s(cid:13)(cid:13) ae(cid:13)(cid:13)ra(cid:13) (cid:13) a(cid:13)(2(cid:13)(cid:13) (cid:13) 5(cid:13))(cid:13) the possible world semantics. To the best of our c(cid:13)((cid:13) (cid:13)r(cid:13)e(cid:13)t(cid:13)ne(cid:13)(cid:13)c(cid:13) c4(cid:13)(cid:13))((cid:13)(cid:13) (cid:13)da(cid:13)o(cid:13)(cid:13)r(cid:13) 5(cid:13))(cid:13) knowledge, this problem has not been studied gn(cid:13) (cid:13)i(cid:13)d(cid:13)l(cid:13)i4(cid:13)Du(cid:13)Nb(cid:13)(cid:13)I(cid:13)(cid:13)(cid:13) (cid:13) .(cid:13) .(cid:13) .(cid:13) dr(cid:13)a(cid:13) (cid:13)za(cid:13)h(cid:13) (cid:13) (cid:13) .(cid:13) .(cid:13) .(cid:13) 5(cid:13).(cid:13)0(cid:13) 4(cid:13).(cid:13)0(cid:13) 5(cid:13).(cid:13)0(cid:13) XUM(cid:13)(cid:13)(cid:13) 1(cid:13).(cid:13)0(cid:13) before. 3(cid:13).(cid:13)0(cid:13) 3(cid:13).(cid:13)0(cid:13) • We design a probabilistic inverted (PI) index dr(cid:13)a(cid:13) (cid:13)za(cid:13)h(cid:13)c(cid:13) (cid:13)(cid:13)(.(cid:13)(cid:13) (cid:13).d(cid:13)a(cid:13).o(cid:13)(cid:13)(cid:13)rcd(cid:13)(cid:13)r(cid:13)((cid:13)a(cid:13) (cid:13)z(cid:13)ya(cid:13)(cid:13)rh(cid:13)a(cid:13)(cid:13) (cid:13)r(cid:13)1b(cid:13)(cid:13).)(cid:13)(cid:13)g(cid:13)i.n(cid:13)(cid:13)(cid:13)l(cid:13).i(cid:13)(cid:13)(cid:13)d(cid:13)l(cid:13)ci(cid:13)((cid:13)u(cid:13)b(cid:13) (cid:13)(cid:13) t(cid:13)(cid:13)e.2k(cid:13)(cid:13)(cid:13)).r(cid:13)(cid:13)(cid:13)a(cid:13)mg.(cid:13)(cid:13)n(cid:13)(cid:13)(cid:13)i(cid:13)d(cid:13)l(cid:13)i(cid:13)ub(cid:13)c(cid:13) (cid:13)3(cid:13)((cid:13))(cid:13).(cid:13)(cid:13) (cid:13)k.(cid:13)r(cid:13)c.a(cid:13)(cid:13)(cid:13)(p(cid:13)(cid:13)(cid:13) (cid:13)dtr(cid:13)(cid:13)ra(cid:13)o(cid:13)(cid:13)zp(cid:13)a(cid:13)(cid:13)h(cid:13)i6(cid:13)(cid:13)(cid:13)) l(cid:13)(cid:13)(cid:13)e.(cid:13)h(cid:13).(cid:13)(cid:13) .(cid:13)dcr(cid:13)(cid:13)a(cid:13)((cid:13)(cid:13)z ma(cid:13)(cid:13)uh(cid:13)(cid:13)(cid:13)(cid:13)t 7(cid:13)(cid:13)a(cid:13)).(cid:13)(cid:13)(cid:13)t(cid:13).s(cid:13)(cid:13).(cid:13) 8(cid:13))(cid:13) that can quickly compute the lower bound and gn(cid:13) (cid:13)i(cid:13)d(cid:13)l(cid:13)i(cid:13)ub(cid:13)(cid:13) gn(cid:13) (cid:13)i(cid:13)d(cid:13)l(cid:13)i(cid:13)ub(cid:13)(cid:13) upper bound for a threshold keyword query, by which lots of unqualified nodes can be pruned Fig.1. Aprobabilistic XMLdatatree and qualified nodes can be returned as early as possible.Tokeeptheeffectivenessofpruning,the probability density function is employed based Example 1: Consider an aircraft-monitored battle- on the assumption of Gaussian distribution. field application, where the useful information will be taken as Aerial photographies. Through analysing • We propose two algorithms, a comparable base- line algorithm and a PI-based Algorithm, to ef- the photographies, we canextractthe possible objects ficiently find all the quasi-SLCA results meeting (e.g., road, factory, airport, etc.) and attach some text the threshold requirement. description to them with probabilities, which can be stored in the formatof PrXML.Figure 1 is a snapshot • Experimental evaluation has demonstrated the efficiency and effectiveness of the proposed ap- of an aircraft-monitored battlefield XML data. By is- proaches. suing a keyword query hazard,building ,a military { } departmentwould find the potential areascontaining The rest of this paper is organized as follows. In hazard buildings above a probability threshold. Section 2, we introduce the probabilistic XML model BasedonthesemanticsofPrSLCA,anyofthenodes and the problem definition of probabilistic threshold library (probability=0.3),area (=0.14),sub region ( keywordquery.Section3showstheprocedureofeffi- 1 1 = 0.168), heliport( = 0.24), sub region ( = 0.32) and ciently finding quasi-SLCA results using an example. 2 region( = 0.088) can become an PrSLCA result. The Section 4 first presents the data structure of PI index, detailed procedure of calculating the probabilities of discusses the basic building operations and pruning results will be shown later. As we know, the users techniques of PI index, and provides the building generallyspecifyathresholdvalueσastheconfidence algorithmofPIindex.InSection5,weproposeacom- scorewith their issued query, e.g.,σ =0.40represent- parable baseline algorithm and a PI-based algorithm ing that the users preferto see the answerswith their tofindthequalifiedquasi-SLCAresults.Wereportthe 3 experimental results in Section 6. Section 7 discusses Example 2: Consider a small PrXML in related works and Section 8 concludes the paper. Figure 2.a and all generated possible worlds in Figure 2. b,c,d,e,f,g,h,i where the solid line { } 2 PROBABILISTIC DATA MODEL AND represents the existence of the edge while the dashed line represents the absence of the edge. Given a PROBLEM DEFINITION possible world, we cancompute itsglobalprobability Probabilistic Data Model: A PrXML document de- based on the existence/absence of the edges in the fines a probability distribution over a space of de- possibleworld,e.g.,Pr(w )=(1 0.5) 0.3 0.4=0.06. d − ∗ ∗ terministic XML documents. Each deterministic doc- Given a keyword query q = k ,k , we can com- 1 2 { } ument belonging to this space is called a possible putetheglobalprobabilityofc beingaPrSLCAw.r.t. 2 world. A PrXML document represented as a labelled q byusingPrG (q,c )=Pr(w )+Pr(w )+Pr(w )+ slca 2 b d f tree has ordinary and distributional nodes. Ordinary Pr(w ) = 0.06+0.06+0.09+0.09 = 0.30. Similarly, h nodesareregularXMLnodesandtheymayappearin we have the global probability of a being a PrSLCA 4 deterministic documents, while distributional nodes w.r.t. q by using PrG (q,a )=Pr(w )=0.14. slca 4 e are only used for defining the probabilistic process of The Semantics of quasi-SLCA in PrXML: generating deterministic documents and they do not Definition 1: Quasi-SLCA: Given a keyword query occur in those documents. q and a threshold value σ, a node v is called a quasi- Inthispaper,weadoptapopularprobabilisticXML SLCAifandonlyif(1)v oritsdescendantsareSLCAs model, PrXML{ind,mux} [12], [23], which was first in a set W of possible worlds; (2) the aggregated discussed in [7]. In this model, a PrXML document probabilityofvanditsdescendantstobeSLCAsinW is considered as a labelled tree where distributional is no less than σ; (3) no descendant nodes of v satisfy nodes have two types, IND and MUX. An IND node both of the above conditions in any set of possible has children that are independent of each other, while worlds that overlaps with W. thechildrenofaMUXnodearemutually-exclusive,that In other words, if a descendant node v of v is d is, at most one child can exist in a random instance a quasi-SLCA, then the probability of v has to be d document (called a possible world). A real number excluded from the probability of v being a quasi- from (0,1] is attached on each edge in the XML tree, SLCA. It means that the set of possible worlds that indicating the conditional probability that the child v appears does not overlap with the set of possible d node will appear under the parent node given the worlds that v or its other descendants appear. existenceof the parentnode. Anexampleof aPrXML Given a query q, we can compute PrL (q,v) in document is given in Fig. 1. Unweighted edges have slca a bottom-up manner, where PrL (q,v) stands for 1 as the default conditional probability. slca the local probability for v being an SLCA in the TheSemanticsofPrSLCAinPrXML:Accordingto probabilistic subtree rooted at v. For example, a in 4 the semantics of possible worlds, the global probabil- Figure 2(a) is a subtree of Figure 1. PrL (q,a ) can ityofanodev beingaPrSLCAwithregardtoagiven slca 4 beused tocompute the PrSLCAprobabilityof a and query q in the possible worlds is defined as follows: 2 a . From PrL (q,v), we can easily get PrG (q,v) 1 slca slca m by PrG (q,v) = Pr(path ) PrL (q,v) where PrsGlca(q,v)= {Pr(wi)|slca(q,v,wi)=true} (1) Pr(patshlcra→v) indicates there→xviste×nce pslrcoabability of v X i=1 in the possible worlds. It can be computed by mul- wherew ,...,w denotesthepossibleworldsimplied tiplying the conditional probabilities along the path 1 m by slca(q,v,w ) = true indicates that v is an SLCA from the root r to v. i in the possible world w for the query q. Pr(w ) is Now, we define quasi-SLCA based on PrSLCA and i i theexistenceprobabilityofthepossibleworldw .The the parent-child relationship. For an IND node v, we i symbolGmeansPrG (q,v)istheglobalprobabilityof have: slca anodev beinganSLCAw.r.t.q inallpossibleworlds. PrG (q,v)=PrG (q,v)+Pr(path ) quasi−slca slca r→v × (1 (1 PrL (q,v′))) a(cid:13)4(cid:13) a(cid:13)4(cid:13) a(cid:13)4(cid:13) a(cid:13)4(cid:13) a(cid:13)4(cid:13) −Qv′∈child(v)∧v′∈/Vquasi − slca (2) 4D(cid:13)N(cid:13)I(cid:13)(cid:13) ck(cid:13)1(cid:13)1(cid:13)(cid:13) k,c(cid:13)(cid:13)1(cid:13)2(cid:13)(cid:13)k(cid:13)2(cid:13)ck(cid:13)3(cid:13)2(cid:13)(cid:13)ck(cid:13)1(cid:13)1(cid:13)(cid:13) k,c(cid:13)(cid:13)1(cid:13)2(cid:13)(cid:13)k(cid:13)2(cid:13)ck(cid:13)3(cid:13)2(cid:13)(cid:13)ck(cid:13)1(cid:13)1(cid:13)(cid:13) k,c(cid:13)(cid:13)1(cid:13)2(cid:13)(cid:13)k(cid:13)2(cid:13)ck(cid:13)3(cid:13)2(cid:13)(cid:13) ck(cid:13)1(cid:13)1(cid:13)(cid:13) k,c(cid:13)(cid:13)1(cid:13)2(cid:13)(cid:13)k(cid:13)2(cid:13)ck(cid:13)3(cid:13)2(cid:13)(cid:13) wa hqueraesit-hSeLCchAildnondoed.ev′ ofv isanSLCAnode,butnot 5(cid:13).(cid:13)0(cid:13) 4(cid:13).(cid:13)0(cid:13)60(cid:13)(cid:13).(cid:13)0(cid:13) (cid:13) (cid:13))(cid:13)b(cid:13)((cid:13) 12(cid:13)(cid:13).(cid:13)0(cid:13) (cid:13) (cid:13))(cid:13)c(cid:13)((cid:13) 60(cid:13)(cid:13).(cid:13)0(cid:13) (cid:13) (cid:13))(cid:13)d(cid:13)((cid:13) 41(cid:13)(cid:13).(cid:13)0(cid:13) (cid:13) (cid:13))(cid:13)e(cid:13)((cid:13) 3(cid:13).(cid:13)0(cid:13) For MUX node v, we have: ck(cid:13)1(cid:13)1(cid:13)(cid:13) k,c(cid:13)(cid:13)1(cid:13)2(cid:13)(cid:13)k(cid:13)2(cid:13)ck(cid:13)3(cid:13)2(cid:13)(cid:13) a(cid:13)4(cid:13) a(cid:13)4(cid:13) a(cid:13)4(cid:13) a(cid:13)4(cid:13) )(cid:13)a(cid:13)((cid:13) 90(cid:13)(cid:13).(cid:13)0(cid:13) (cid:13)c k(cid:13)(cid:13)1(cid:13))1(cid:13)(cid:13)(cid:13)f(cid:13)((cid:13)kc,(cid:13)(cid:13)1(cid:13)2(cid:13)(cid:13)4k1(cid:13)(cid:13)2(cid:13).(cid:13)(cid:13)c0k(cid:13)3(cid:13)(cid:13) 2(cid:13)(cid:13)(cid:13) ck(cid:13)(cid:13)1)(cid:13)1(cid:13)(cid:13)g(cid:13)(cid:13)((cid:13) k1c,2(cid:13)(cid:13)(cid:13)1(cid:13)2(cid:13)(cid:13).(cid:13)(cid:13)k09(cid:13)(cid:13)2 0(cid:13)(cid:13)(cid:13)ck(cid:13) .(cid:13)(cid:13)3(cid:13)(cid:13)20)(cid:13)(cid:13)(cid:13)(cid:13) ic(cid:13)k(cid:13)(cid:13) (1(cid:13)1(cid:13)(cid:13)(cid:13))(cid:13)(cid:13)h(cid:13)((cid:13)kc,(cid:13)(cid:13)1(cid:13)2(cid:13)(cid:13)k(cid:13)2(cid:13)ck(cid:13)3(cid:13)2(cid:13)(cid:13) ck(cid:13)1(cid:13)1(cid:13)(cid:13) kc,(cid:13)(cid:13)1(cid:13)2(cid:13)(cid:13)k(cid:13)2(cid:13)ck(cid:13)3(cid:13)2(cid:13)(cid:13) PPr{qGPuarssLil−casl(cqa,(vq′,)v|v)′=∈PchrisGllcda((vq),∧v)v+′ ∈/PVr(qpuaastih}r→v)×(3) Note,INDorMUXnodesarenormallynotallowed Fig.2. AsmallPrXMLanditspossibleworlds to be SLCA result nodes because they are only distri- butional nodes. As such, for the above IND or MUX 4 node v, we may use its parent node v (with v as a relationships. In other words, the different parts of p sole child) to represent the SLCA result node. terms in a node are mutual-exclusive (MUX), e.g., a 1 Example 3: Let’s consider Example 2 again. First and a in Figure 3 consists of three parts. 5 assume the specified threshold value is 0.40, then Given a keyword query and a threshold value, the global probability of a4 being a quasi-SLCA re- we first load the corresponding off-line computed sult can be calculated by using PrqGuasi−slca(q,a4) = probabilistic index w.r.t. the keyword query and then PrG (q,a ) + Pr(path ) * (1 (1 PrL (q,c ))) on-the-flycalculatetherangeofprobabilitiesofanode slca 4 r→a4 − − slca 2 = 0.14+ 0.30= 0.44because child c is an SLCAnode being a result of the keyword query using the pre- 2 but not a quasi-SLCA node w.r.t. the given threshold. computedprobabilisticindexinabottom-upstrategy. So c ’s SLCA probability contributes to its parent Here,the rangeof probabilitiescanbe representedby 2 node a . If the threshold is decreased to 0.30, then twoboundaryvalues:lowerboundandupperbound. 4 c will be taken as a qualified quasi-SLCA result and Bycomparingthelower/upperboundsofcandidates, 2 will not contribute to a . In this case, a cannot be- the qualified results can be efficiently identified. 4 4 come a quasi-SLCA result because PrqGuasi−slca(q,a4) The followed two examples briefly demonstrate =PrG (q,a )=0.14<0.30.Ifthethresholdisfurther how we calculate the lower/upper bounds based on slca 4 decreased to 0.14, both c and a are qualified quasi- a given keyword query and the off-line computed 2 4 SLCA results. probabilistic index, and how we apply the on-line Definition 2: Probabilistic Threshold Keyword computed lower/upper bounds to prune the unqual- Query: (PrTKQ) Given a keyword query q and a ified candidates and determine the qualified ones. threshold σ, the results of q over a probabilistic XML data T is a set R of quasi-SLCA nodes with a(cid:13)1k(cid:13)(cid:13)((cid:13)r6P(cid:13)4(cid:13)9(cid:13)(cid:13)(cid:13).(cid:13)0=(cid:13)1(cid:13)(cid:13))(cid:13) 09(cid:13)8(cid:13)(cid:13).(cid:13)0=(cid:13)BL(cid:13)(cid:13)(cid:13) a(cid:13)3k(cid:13)(cid:13)((cid:13)rP(cid:13)46(cid:13)(cid:13)(cid:13).(cid:13)0=(cid:13)1(cid:13)(cid:13))(cid:13) 23(cid:13)(cid:13).(cid:13)0=(cid:13)BL(cid:13)(cid:13)(cid:13) their probabilities equal to or larger than σ, i.e., 1DN(cid:13)(cid:13)I(cid:13)(cid:13) k(cid:13)((cid:13)0rP(cid:13)5(cid:13)9(cid:13)(cid:13)(cid:13).(cid:13)0=(cid:13)2(cid:13)(cid:13))(cid:13) 05(cid:13)9(cid:13)(cid:13).(cid:13)0=(cid:13)BU(cid:13)(cid:13)(cid:13) 3DN(cid:13)(cid:13)I(cid:13)(cid:13) k(cid:13)((cid:13)rP(cid:13)4(cid:13)(cid:13).(cid:13)0=(cid:13)2(cid:13)(cid:13))(cid:13) 4(cid:13).(cid:13)0=(cid:13)BU(cid:13)(cid:13)(cid:13) PrG (q,v) σ for v R. 1(cid:13) 2(cid:13) 3(cid:13) 1(cid:13) 2(cid:13) 3(cid:13) Iqnuatshii−sswlcaork, w≥e areint∀ere∈sted in how to efficiently k(cid:13)1(cid:13) k(cid:13)2(cid:13) k(cid:13)1(cid:13) k(cid:13)2(cid:13) k(cid:13)1(cid:13) k(cid:13)2(cid:13) k(cid:13)1(cid:13) k(cid:13)2(cid:13) k(cid:13)1(cid:13) k(cid:13)2(cid:13) k(cid:13)1(cid:13) 73(cid:13)9(cid:13) (cid:13).(cid:13)0(cid:13)05(cid:13)9(cid:13) (cid:13).(cid:13)0(cid:13)64(cid:13)9(cid:13) (cid:13).(cid:13)0(cid:13)63(cid:13)9(cid:13) (cid:13).(cid:13)0(cid:13)04(cid:13)9(cid:13) (cid:13).(cid:13)0(cid:13)61(cid:13)9(cid:13) (cid:13).(cid:13)0(cid:13) 8(cid:13).(cid:13)0(cid:13)5(cid:13).(cid:13)06(cid:13)8(cid:13) (cid:13).(cid:13)0(cid:13) 3(cid:13).(cid:13)02(cid:13)8(cid:13)(cid:13).(cid:13)0(cid:13) computethequasi-SLCAanswersetforaPrTKQover a probabilistic XML data. a(cid:13)5(cid:13)k(cid:13)((cid:13)rP(cid:13)42(cid:13)(cid:13)(cid:13).(cid:13)0=(cid:13)1(cid:13)(cid:13))(cid:13)27(cid:13)0(cid:13)(cid:13).(cid:13)0=(cid:13)BL(cid:13)(cid:13)(cid:13) 5a9(cid:13)(cid:13)25(cid:13)(cid:13)(cid:13).(cid:13)0=(cid:13)BL(cid:13)(cid:13)(cid:13) k(cid:13)((cid:13)rP(cid:13)56(cid:13)(cid:13)(cid:13).(cid:13)0=(cid:13)1(cid:13)(cid:13))(cid:13) XMU(cid:13)(cid:13)(cid:13)k(cid:13)((cid:13)rP(cid:13)42(cid:13)(cid:13)(cid:13).(cid:13)0=(cid:13)2(cid:13)(cid:13))(cid:13) 42(cid:13)(cid:13).(cid:13)0=(cid:13)BU(cid:13)(cid:13)(cid:13) 2DN(cid:13)(cid:13)I(cid:13)(cid:13)56(cid:13)(cid:13).(cid:13)0=(cid:13)BU(cid:13)(cid:13)(cid:13) k(cid:13)((cid:13)6rP(cid:13)1(cid:13)9(cid:13)(cid:13)(cid:13).(cid:13)0=(cid:13)2(cid:13)(cid:13))(cid:13) 3 OVERVIEW OF THIS WORK 1(cid:13) 2(cid:13) 3(cid:13) 1(cid:13) k2(cid:13)(cid:13) k1(cid:13)(cid:13) k2(cid:13)(cid:13) k1(cid:13)(cid:13) k(cid:13)1(cid:13) k(cid:13)2(cid:13) A naive method to answer a PrTKQ is to enumerate 5(cid:13).(cid:13)0(cid:13) 3(cid:13).(cid:13)03(cid:13) (cid:13).(cid:13)0(cid:13)1(cid:13).(cid:13)0(cid:13) 56(cid:13) (cid:13).(cid:13)06(cid:13)1(cid:13)9(cid:13) (cid:13).(cid:13)0(cid:13) all possible worlds and apply the query to each possible world. Then, we can compute the overall 4DN(cid:13)(cid:13)aI7(cid:13)(cid:13)7(cid:13)(cid:13)483(cid:13)(cid:13)5(cid:13)(cid:13)(cid:13)..(cid:13)0(cid:13)0=(cid:13)=B(cid:13)BL(cid:13)U(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) 33c(cid:13)(cid:13)(cid:13).2.(cid:13)(cid:13)(cid:13)00==(cid:13)(cid:13)BBUL(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) 4422(cid:13)(cid:13)(cid:13)(cid:13)c..(cid:13)(cid:13)(cid:13)700(cid:13)==(cid:13)(cid:13)BBUL(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) probability of each quasi-SLCA result and return the 1(cid:13) 1(cid:13) 1(cid:13) results meeting the probability threshold. However, k(cid:13)1(cid:13) k(cid:13)2(cid:13) k(cid:13)1(cid:13) k(cid:13)2(cid:13) k(cid:13)1(cid:13) k(cid:13)2(cid:13) the naive method is inefficient due to the huge num- 56(cid:13) (cid:13).(cid:13)0(cid:13)85(cid:13) (cid:13).(cid:13)0(cid:13) 1(cid:13) 1(cid:13) 1(cid:13) 1(cid:13) ber of possible worlds over a probabilistic XML data. Another method is to extend the work in [23] to Fig. 3. PI index and Lower/Upper Bound for a query compute the probabilities of quasi-SLCA candidates. k1,k2 overthegivenPrXML { } Although it is much more efficient than the naive method, it needs to scan the keyword node lists and Figure 3 shows the lower/upper bounds of each calculate the keyword distributions for all relevant node in Figure 1 where the probability of each nodes. Therefore, that motivates our development of individual term is calculated offline while the efficientalgorithms which not only avoids generating lower/upper bounds are computed on-the-fly based possible worlds, but also prunes more unqualified onthegivenquerykeywords.Let’sfirstintroducethe nodes. related concepts briefly: the probability of a term in To accelerate query evaluation, in this paper we a node represents the total local probability of the propose a prune-based probabilistic threshold key- termappearinginall possible worlds to be generated wordqueryalgorithm,whichdeterminesthequalified for the probabilistic subtree rooted at the node, e.g., resultsandfilterstheunqualified candidatesbyusing Pr(k ,a ) = 0.65 and Pr(k ,a ) = 0.916; the lower 1 2 2 2 off-line computed probabilityinformation. To dothis, bound value represents the minimal total local prob- we need to first calculate the probability of each ability of the given query keywords appearing in all possible query term within a node, which is stored thepossibleworldsw.r.t.theprobabilisticsubtree,e.g., as an off-line computed probabilistic index. Within LB(k k ,a )=0.65*0.916=0.595;theupperboundvalue 1 2 2 a node, any two of its contained terms may appear represents the maximal total local probability of the in the IND or MUX ways. To precisely differentiate given query keywords appearing in all the possible IND and MUX, we utilize differentparts to represent worldsw.r.t.theprobabilisticsubtreebecausethekey- the probabilities of possible query terms appearing wordsmaybeindependentorco-occur,e.g.,UB(k k , 1 2 in MUX way, while the terms in each part hold IND a ) = min 0.65, 0.916 = 0.65 no matter whether they 2 { } 5 are independent. By multiplying the path probability, a keyword). Each inner list stores the ids of XML the local probability can be transformed into the nodes in which the given keyword occurs, and for global probability. For the nodes containing MUX eachnode, the frequenciesor the weight atwhich the semantics, we group the probabilities of its terms keywordappearsortakes.Inthis work,we introduce into different parts, any two of which are mutually- a probabilistic version of this structure, in which we exclusive as shown in a , a and a in Figure 3. The store for each keyword a list of node-ids. Along with 1 3 5 detailsof computing the lower/upper bounds for the each node-id, we store the probability values that the IND and MUX semantics in the following section. subtree rooted at the node may contain the given Example 4: Consider a PrTKQ k ,k with σ=0.40 keyword. The probability values in inner lists can be 1 2 { } again. a , c and c can be pruned directly without used to compute lower bound and upper bound on- 5 2 7 calculation because their upper bounds are all lower the-fly during PrTKQ evaluation. than 0.40. We need to check the rest nodes a , a , Figure 4 shows an example of a probabilistic in- 1 2 a and a . For a , after computation, the probabil- verted index of the data in Figure 1. At the base of 3 4 4 ity of a being a quasi-SLCA result is 0.44, which the structure is a list of keywords storing pointers to 4 is larger than the specified threshold value 0.40, so lists, corresponding to each term in the XML data T. a will be taken as a result. After that, the result Thisisaninvertedarraystoring,foreachterminT,a 4 of a can be used to update the lower bound and pointertoalistoftripletuples.Inthelistk .listcorre- 4 i upperboundofa2,(LB=0.595,UB=0.65)→(LB=0.155, sponding ki ∈ T, the triple (v id,Pr(pathr→v),{p1,...}) UB=0.21). As a consequence, a should be filtered records the node v 1, the conditional probability 2 due to UB(a ) = 0.21 < σ = 0.40. Similarly, a from the root to v, and the probability set that may 2 3 can be computed and selected as a result because its contain single probability value or multiple probability probability is 0.56. Since a and a having been the value. Single probability value represents that all the 3 4 quasi-SLCA results, the bounds of a can be updated keyword instances in the subtree can be considered 1 as (LB=0.890, UB=0.950) (LB=0.136, UB=0.196). As as independent in probability, e.g., the confidence of → such, a1 can be pruned because its upper bound is a2 containing k1 is 0.65 , while multple probability { } lower than 0.40. From this example, we can find that value means that the keyword instances belonging to many answers can be pruned or returned without differentsetsoccurmutually,e.g.,theconfidenceofa 3 the need to know their accurateprobabilities,and the containing k is a set 0.8, 0.86, 0.82 , that represents 1 { } effectiveness of pruning would be acceleratedgreatly the different possibilities of k occurring in a . 1 3 with the increase of users’ search confidence. As an acute reader, you may find that we have to 4.1 BasicOperationsofBuildingPIIndex compute the probability of a being a quasi-SLCA 4 TobuildPI index,weneedtotraversethegivenXML because it cannot determine whether or not a is 4 data tree once in a bottom-up method. During the a qualified result to be output only based on its datatraversal,wewillapplythefollowingoperations lower/upper bound values. To exactly calculate the that may be used solely or in their combinations. The probability of a being a quasi-SLCA, we have to 4 binary operation X ⊲⊳/ Y promotes the probability access its child/descendant nodes, e.g., c ,c ,c , al- 1 2 3 of Y to its parent node X. The binary operation X though c has been recognized as a pruned node 2 ⊲⊳sibling Y promotes the probabilities of two sibling before we start to process a . If an internal node 4 nodesXandYtotheirparentnode.Then-arycasecan depends on a larger number of pruned nodes, the be processed by calling for the corresponding binary effectiveness of pruning will be degraded to some cases one by one. extent. To fix this challenging problem, we will in- Assume v contains the keywords k , k , ..., k , ..., troduce Probability Density Function PDF that can be 1 { 1 2 i k and the conditional probability Pr(path ) used to approximately compute the probability of a m} vp−>v1 is λ ; and v contains the keywords k ,k ,...,k node, the result of which can be used to update the 1 2 { i i+1 mi} and the conditional probability Pr(path ) is λ . lower bound and upper bound of its ancestor nodes vp−>v2 2 Operator1-v 1sibling,IND v : If v and v are inde- further. The details are provided and discussed with 1 2 1 2 pendent sibling nodes, we can directly promote their algorithms later. probabilities to their parent v , then we have, p 4 PROBABILISTIC INVERTED INDEX λ Pr(k ,v ) j <i; 1 j 1 ∗ (IPnIt)hiinsdseexctisotrnu,cwtueredefsocrriebfeficoiuenrtPlyroebvaabliuliasttiincgInPvreTrKteQd Pr(kj,vp)= 1(1−(λ12−λP1r∗(kPjr,(vk2j),)v1)) i j m mi; − ∗ ≤ ≤ ≤ queries over probabilistic XML data. In keyword λ2 Pr(kj,v2) m j mi; search on certain XML data, inverted indexes are ∗ ≤ ≤ (4) popularstructures,e.g.,[16],[20].Thebasictechnique 1.Thesymbolv isusedtorepresentanode’snameoranode’s is to maintain a list of lists, where each element in idwithoutconfusionsinthefollowingsections.Here,vistheidof the outer list corresponds to a domain element (i.e., thenodev 6 ,(cid:13)0(cid:13).(cid:13)1(a(cid:13)(cid:13) (cid:13)(cid:13) ,(cid:13) ,(cid:13)3c(cid:13).(cid:13)(cid:13)0((cid:13)(cid:13) (cid:13) ,(cid:13) ,(cid:13)8(cid:13).(cid:13)0((cid:13)a(cid:13) (cid:13)(cid:13) ,(cid:13) 1(cid:13) }(cid:13)56(cid:13)(cid:13).(cid:13)0(cid:13){(cid:13) (cid:13) ,(cid:13)0(cid:13).(cid:13)1((cid:13)a(cid:13) (cid:13)(cid:13) ,(cid:13) }(cid:13)56(cid:13)(cid:13).(cid:13)0(cid:13){(cid:13)) (cid:13)(cid:13) ,(cid:13)0(cid:13).(cid:13)1((cid:13)a(cid:13) (cid:13)(cid:13) ,(cid:13) }(cid:13)0(cid:13).(cid:13)1(cid:13){(cid:13) (cid:13) ,(cid:13)5)(cid:13)(cid:13).(cid:13)0((cid:13)c(cid:13) (cid:13)(cid:13) ,(cid:13) )(cid:13) 2(cid:13) 3(cid:13) }(cid:13)04(cid:13)9(cid:13)(cid:13).(cid:13)0(cid:13) (cid:13) ,(cid:13)64(cid:13)9(cid:13)(cid:13).(cid:13)0(cid:13),(cid:13)73(cid:13)9(cid:13)(cid:13).(cid:13)0(cid:13){(cid:13) )(cid:13) 2(cid:13) 4(cid:13) 1(cid:13) }(cid:13)28(cid:13)(cid:13).(cid:13)}0(cid:13)0(cid:13) (cid:13)(cid:13).(cid:13)1,(cid:13)6(cid:13){8(cid:13)(cid:13)(cid:13).(cid:13)0(cid:13) )(cid:13)(cid:13),(cid:13)8(cid:13).(cid:13)0(cid:13){(cid:13) )(cid:13) k(cid:13) 1(cid:13) ,(cid:13)80(cid:13)(cid:13).(cid:13)0((cid:13)c(cid:13) (cid:13)(cid:13) ,(cid:13) ,(cid:13)5(cid:13).(cid:13)0((cid:13)c(cid:13) (cid:13)(cid:13) ,(cid:13) }(cid:13)0(cid:13).(cid:13)1(cid:13){(cid:13) (cid:13) ,(cid:13)8(cid:13)}.(cid:13)(cid:13)10((cid:13)(cid:13)c(cid:13). (cid:13)(cid:13)(cid:13)50(cid:13),(cid:13)(cid:13),(cid:13)3(cid:13).(cid:13)0(cid:13) (cid:13) ,(cid:13)0(cid:13) (cid:13)){(cid:13)(cid:13) (cid:13) ,(cid:13)8(cid:13).(cid:13)0((cid:13)a(cid:13) (cid:13)(cid:13)5(cid:13),(cid:13) }(cid:13)0(cid:13).(cid:13)1(cid:13){(cid:13) (cid:13) ,(cid:13)4)2(cid:13)(cid:13)(cid:13).(cid:13)0((cid:13)c(cid:13) (cid:13)(cid:13)7(cid:13),(cid:13) )(cid:13) }(cid:13)0(cid:13).(cid:13)1(cid:13){8(cid:13)(cid:13) )(cid:13) }(cid:13)0(cid:13).(cid:13)1(cid:13){9(cid:13)(cid:13) )(cid:13) /(cid:13) k(cid:13)2(cid:13) ,(cid:13)0(cid:13).(cid:13)1((cid:13)a(cid:13) (cid:13)(cid:13)1(cid:13),}(cid:13)(cid:13)61(cid:13)9(cid:13)(cid:13).(cid:13)0(cid:13){(cid:13) (cid:13) ,(cid:13)0(cid:13).(cid:13)1((cid:13)a(cid:13) (cid:13)(cid:13) ,(cid:13) }(cid:13)85(cid:13)(cid:13).(cid:13)0(cid:13){(cid:13) (cid:13) ,(cid:13)0)(cid:13)(cid:13).(cid:13)1((cid:13)a(cid:13) (cid:13)(cid:13) ,(cid:13) }(cid:13)0(cid:13).(cid:13)1(cid:13){(cid:13) (cid:13)),(cid:13)(cid:13)3c(cid:13).(cid:13)(cid:13)0((cid:13)(cid:13) (cid:13) ,(cid:13) }(cid:13)0(cid:13).(cid:13)1(cid:13){(cid:13) (cid:13)),(cid:13)(cid:13)4c(cid:13).(cid:13)(cid:13)0((cid:13)(cid:13) (cid:13) ,(cid:13) )(cid:13) ,(cid:13)8(cid:13).(cid:13)0((cid:13)a(cid:13) (cid:13)(cid:13)3(cid:13),(cid:13) }(cid:13)61(cid:13)9(cid:13)(cid:13).(cid:13)0(cid:13),(cid:13)63(cid:13)9(cid:13)(cid:13).(cid:13)0(cid:13),(cid:13)05(cid:13)9(cid:13)(cid:13).(cid:13)0(cid:13){(cid:13) )(cid:13) 2(cid:13) 4(cid:13) 2(cid:13) }3(cid:13)(cid:13)0(cid:13) (cid:13) ,(cid:13)3(cid:13).(cid:13)0(cid:13) (cid:13) ,(cid:13)5(cid:13).(cid:13)0(cid:13){(cid:13) (cid:13) )(cid:13) }(cid:13)0(cid:13) (cid:13) ,(cid:13)3(cid:13).(cid:13)0(cid:13) (cid:13) ,(cid:13)5(cid:13).(cid:13)0(cid:13){(cid:13) (cid:13) ,(cid:13)8(cid:13).(cid:13)0((cid:13)a(cid:13) (cid:13)(cid:13)5(cid:13),(cid:13) }(cid:13)0(cid:13).(cid:13)1(cid:13){(cid:13) (cid:13) ,)(cid:13)4(cid:13) (cid:13).(cid:13)0((cid:13)c(cid:13) (cid:13)(cid:13)6(cid:13),(cid:13) }(cid:13)0(cid:13).(cid:13)1(cid:13){(cid:13) (cid:13) ,(cid:13))4(cid:13)2(cid:13)(cid:13).(cid:13)0((cid:13)c(cid:13) (cid:13)(cid:13)7(cid:13),(cid:13) )(cid:13) /(cid:13) .(cid:13) .(cid:13) .(cid:13) .(cid:13) .(cid:13) .(cid:13) Fig.4. Aprobabilistic InvertedIndex Operator2-v1 1/,IND v2: If v2 is an independent k12 k1 2k2 k31 , as shown in Figure 3 - a5. Because c5 child of v , we can directly promote the probability 0.5 0.3 0.3 0.1 1 and a are independent sibling nodes, the operation of v to v , then we have, 5 2 1 c 1sibling,IND a can be called to compute the prob- 5 5 ability with regards to their parent a . To do this, we 3 Pr(kj,v1) j <i; apply c5 1sibling,IND parti for each parti a5 using Pr(kj,v1)= 1(1−−(λ12−∗PPrr((kkjj,,vv12)))) i≤j ≤m≤mi; AOfpteerratthoar1t,. wTheecarnesuinlstseratrke1s→how(an3,0in.8,F0i.g8u∈,r0e.863,0-.8a23). λ2∗Pr(kj,v2) m≤j ≤mi; (5) anIdfbko2t→hv(1aa3n,0d.8v,20c.5o,n0t.a3i,n0)aisnettooPfImiuntdueaxl.ly-exclusive Example 5: Let’s show the procedure of comput- parts, respectively, then we can do pairwise aggre- ing c11sibling,INDc21sibling,INDc3 in Figure 2 us- gations across the two sets of parts. Building PI in- ing Operator1 and Operator2. Firstly, we compute dex needs to scan the given probabilistic XML data c11sibling,INDc2 and promote the probability of key- only once. Assume that the probabilistic XML has words to their parenta4 by Operator1, i.e., Pr(k1,a4) been encoded using probabilistic Dewey codes. The = 1 - (1 - 0.5*1.0)*(1 - 0.3*1.0) = 0.65 and Pr(k2,a4) basic idea of building PI index is to progressively = 0.3. And then, we compute a4 1/,IND c3 using process the document nodes sorted by Dewey codes operator2, i.e., Pr(k2,a4) = 1 - (1 - 0.3)*(1 - 0.4) = in ascending order, i.e., the data can be loaded and 0.58 while Pr(k1,a4) do not change because c3 only processed in a streaming strategy. When a leaf node containsk2here.Andtheconditionalprobabilityfrom vl is coming, we will compute the probability of the root to a4 is 1.0. Therefore, k1 → (a4,1.0,0.65) each term in the leaf node vl. After that, the terms and k2 → (a4,1.0,0.58) will be inserted in PI index, with their probabilities in vl will be written into PI respectively. index. Next, we need promote the terms and their Operator3-v1 1sibling,MUX v2: If v1 and v2 are two probabilities of vl to the parent vp of vl based on mutually-exclusive sibling nodes and vp is their par- the operation types in Section 4.1. After the node ent, then we generatetwo partsin vp by vp 1/,IND v1 stream is scanned completely, the building algorithm and vp 1/,IND v2, respectively. of PI index will be terminated. We don’t provide the Operator4-v1 1/,MUX v2: If v2 is a mutually- detailed building algorithm in this paper. exclusive child node of v , then we can get the ag- 1 gregated probability by v 1/,IND v . 4.2 PruningTechniquesusingPIIndex 1 2 In the above four basic operators, we assume the In this subsection, we first show how to prune the terms independently appear in v1 and v2. When the unqualified nodes using the proposed lower/upper nodes v1 and v2 contain mutually-exclusive parts, we bounds. And then, we explain how to compute need to deal with each part using the four basic lower/upper bounds, and how to update the up- operators. per/lower bounds based on intermediate results dur- Giventwoindependentsiblingnodesv1 (λ1)andv2 ing the query evaluation. (λ )whereonlyv containsasetofmutually-exclusive By default, the node lists in PI index are sorted in 2 2 parts p ,p ,... with conditional probability λ . the document order. Pr(k ,v) represents the overall Inthis{cams1e,wme2can}applytheoperationv 1sibling,INmDi probability of a keyword ki in a node v. It is obvious 1 i p for each part p . The computed results are thattheoverallprobabilityofakeywordappearingin mi mi maintained in different parts in their parent v . a node is larger than or equal to that of the keyword p Example 6: Consider an independent node c and a appearing in its descendant nodes. And the overall 5 nodea consistingofc ,c andc inFigure1.Wefirst probability value for each keyword in a node can be 5 6 7 8 promotec ,c andc toa thatconsistsofthreeparts: computed and stored in PI index offline. 6 7 8 5 7 Consider a node v and a PrTKQ q containing a set we have for LB(q,v) σ, then PrG (q,v) σ. ≥ quasi−slca ≥ of keywords k ,k ,...,k . If all the terms in v are As such, v canbe returned asa quasi-SLCA result. 1 2 t { } independent, then we have, Example 8: Let’s continue Example 7. a can be 3 directly returned as a qualified answer for the given t threshold σ( = 0.4). This is because c , c and a 2 7 5 LB(q,v)= Pr(k ,v) (6) i are filtered due to their upper bound less than the Y i=1 threshold σ( = 0.4). To update the lower/upper bound values during UB(q,v)=min Pr(k ,v)1 i t (7) i { | ≤ ≤ } query evaluation, one way is to treat the different Most of the time, v consists of a set of parts types of nodes differently, by which the updated vp1,vp2,...,vpm that are mutually-exclusive. In this lower/upperboundsmayobtainbetterprecision.But { } case, the lower bound of v would be generated from the disadvantage of this way is to easily affect the a part vpj that gives the highest lower bound value efficiency of bound update. This is because, given a while the upperbound of v would begeneratedfrom current node having multiple quasi-SLCA nodes as another part vp that gives the highest upper bound its descendant nodes, it is required to know the de- i value, in which j may be equal to or not equal to i. tailedrelationships(INDorMUX)amongthemultiple quasi-SLCA nodes. To avoid the disadvantage, we t do not separate the different types of distributional LB(q,v)=max Pr(k ,vp ) (8) 1≤j≤m{ i j } nodes, under which the multiple quasi-SLCA nodes Y i=1 appear.Inother words,we unifythemintoauniform UB(q,v)=max1≤j≤m min Pr(ki,vpj)1 i t formula based on the following two properties. { { | ≤ ≤ }} (9) Property 3: Nomatternodev isanINDorordinary Where vpj must satisfy the criteria: (1) LB(q,v) > 0; or MUX node, we can update their upper bound (2) cannot find another part vp′ having values as follows: j t Pr(k ,vp′) > 0 and min Pr(k ,vp′)1 i=1 i j { i j | ≤ Qi t > min Pr(k ,vp )1 i t . Otherwise, m UB≤(q,v}) and LB{(q,v) wi illjbe| se≤t as z≤ero.} UB′(q,v)=UB(q,v)−1+ (1−PrqGuasi−slca(q,vci)) Y Example 7: Let’s consider a in Figure 3 as i=1 3 (10) an example. The first and second parts can Where PrG (q,v ) σ should be held. generate lower and upper bounds: Part 1 quasi−slca ci ≥ → Proof:Accordingtothedefinitionofupperbound, LB( k ,k ,a )=0.32, UB( k ,k ,a )=0.4; and Part { 1 2} 3 { 1 2} 3 UB(q,v)representsthemaximalprobabilityofv being 2 LB( k ,k ,a )=0.206, UB( k ,k ,a )=0.24. Be- → { 1 2} 3 { 1 2} 3 a quasi-SLCA node, which comes from the overall cause Part 1 can produce a higher upper bound than probabilityofaspecifickeyword.Therefore,theprob- Part 2, the lower and upper bounds of a will come 3 lem of updating upper bound can be alternatively from Part 1, which guarantees that a can be a quasi- 3 considered as the percentage of the probability of SLCA candidate with a higher probability. Since Part the keyword has been used for the v′ descendant 3 does not contain full keywords, i.e., missing k , it 2 nodes becoming qualified quasi-SLCA nodes. If we cannot generate lower and upper bounds. know there are m qualified descendant nodes of v as Property 1: [Upper Bound Usage] A node v can returned answers, then we can compute their aggre- be filtered if the overall probability Pr(ki,v) of any gatedprobabilitiesby1 m (1 PrG (q,v )). keyword ki ( ki ∈ v and ki ∈ q) is lower than the Therefore, the upper−bQoui=n1d −can qubaesi−usplcdaatedcias givenPrtohorfe:sholdSivnacleue σ, iP.er.,G∃ki, Pr((kqi,,vv))<σ. UB(q,v)−1+ mi=1(1−PrqGuasi−slca(q,vci)). quasi−slca ≤ Does the abQove update equation hold for MUX min Pr(k ,v)k q Pr( k q,v), we { i | i ∈ } ≤ ∀ i ∈ node? To answer this question, we utilize the prop- have min Pr(k ,v)k q as the upper bound { i | i ∈ } erties in [23], from which we can compute the aggre- probability of v becoming a qualified quasi-SLCA gated probability by using m PrG (q,v ). node. Therefore, if a node v holds the inequation Therefore, we have UB′(Pq,iv=)1 =quasiU−sBlc(aq,v)ci Pr(ki,v) < σ, then PrqGuasi−slca(q,v) must be lower m PrG (q,v ). The equation can be con−- than σ. As such, v can be filtered. i=1 quasi−slca ci vPertedintoUB(q,v) 1+[1 m PrG (q,v )]. Property 2: [Lower Bound Usage] The nodes v can Since m (1 −PrG −Pi=(1q,vqu)a)si−caslnca be ceix- be returned as required results if we have LB(q,v)≥ pressed aQsi=11 −m PquraGsi−slca (q,civ ) + ∆ where σ and UB(q,vd)<σ where vd is any child or descen- − i=1 quasi−slca ci ∆ is a positivPe value, i.e., 0, we can de- dant node of v. rive that 1 m PrG (≥q,v ) m (1 Proof: UB(q,vd) < σ means that all the keyword − i=1 quasi−slca ci ≤ i=1 − nodes in the subtree rooted at v will contribute their PrqGuasi−slca(q,vPci)). As a consequence, we cQan obtain ppnorroodbbeaaobbfiilliivttiyceosLuBtold(qnb,oved)aewqvi.ullIannsoio-tStbhLeeCrdAwedsoourcdtthse,edn.looTwhdeeerrceebfnoodruean,nidft t=UhBaUt(qBU,(vBq),′(vq),1v−+) =1 m+UB[(11(q−,vP)Pr−Gmi=P1Pmi=r1qGuPa(srqiqG−,uvsalscia−)()sq.lc,av(cqi),]vc≤i) − i=1 − quasi−slca ci Q 8 Therefore, UB′(q,v) = UB(q,v) - 1 + m (1 - 5 PRUNE-BASED PROBABILISTIC THRESH- i=1 PrqGuasi−slca (q,vci))holdsforIND,ordinaryaQndMUX OLD KEYWORD QUERY ALGORITHM nodes. A key challenge of answering a PrTKQ is to identify Property 4: Nomatternodev isanINDorordinary the qualified result candidates and filter the unquali- or MUX node, we can update their lower bound fiedonesassoonaspossible.Inthiswork,weaddress values as follows: this challenge with the help of our proposed proba- bilistic inverted (PI) index. Two efficient algorithms are proposed, a comparable Baseline Algorithm and m LB′(q,v)=LB(q,v) PrG (q,v ) (11) a PI-based Algorithm. − quasi−slca ci X i=1 5.1 BaselineAlgorithm Where PrqGuasi−slca(q,vci)≥σ should be held. In keyword search on certain XML data, it is popular Proof: For the lower bound update, we to use keyword inverted index retrieving the rele- need to deduct the confirmed probability vant keyword nodes, by which the keyword search [1 m (1 PrG (q,v ))] for IND results are generated based on different algorithms, nod−es Qoi=r1 −m PrGquasi−slca(q,vci) for MUX e.g., [20], [16], [24], [25]. In probabilistic XML data, i=1 quasi−slca ci nodes, from Pthe original lower bound LB(q,v). [23] proposed PrStack Algorithm to compute top-k According to the procedure of the above SLCA nodes. In this section, we propose an effective proof, we have m (1 PrG (q,v )) Baseline Algorithm that is similar the idea of PrStack 1 m PrG Qi(=q1,v )−. Conqsueaqsui−esnlctaly, wcei hav≥e Algorithm. To answer PrTKQ, we need to scan all th−e Pinie=q1uatiqouna,si−1slca mci (1 PrG (q,v )) the keyword inverted lists once. Firstly, the keyword- 1 (1 − Qim=1 Pr−G quasi(−qs,lvca)) ci= matched nodes will be read one by one based on ≤m PrG− (q−,v )P. i=T1hereqfuoarsei,−slicta is csiafe to their document order. After one node is processed, i=1 quasi−slca ci uPse the right side to update the lower bound values. we check if its probability can be up to the given threshold valueσ. Ifitistrue, the nodecanbeoutput as a quasi-SLCA node and its remaining keyword Example 9: Consider a that has been computed 4 distributions (i.e., containing partialquery keywords) and its probability is 0.44. Given threshold σ (=0.4), can be continuously promoted to its parent node. a is returned as a quasi-SLCA result. Consequently, 4 Otherwise, we promote its complete keyword distri- we can update the lower/upper bound values of its butions (i.e., containing both all keywords or partial ancestora ,i.e.,UB′( k ,k ,a )=0.65-1+(1-0.44) 2 1 2 2 { } keywords) to its parent node. After that, the node at = 0.21 and LB′( k ,k ,a ) = 0.595 - 0.44 = 0.155. 1 2 2 { } the top of the stack will be popped. Similarly, the Since UB′( k ,k ,a ) < σ, a can be filtered out 1 2 2 2 { } above procedures will be repeated until all nodes effectively without computation. are processed. The basic algorithm can be terminated Property3isusedtofiltertheunqualifiednodesby whenallnodesareprocessed.Thedetailedprocedure reducing the upper bound value while Property 4 is is shown in Algorithm 1. used to quickly find the qualified required results by Because Baseline Algorithm only needs to scan comparing the reduced lower bound value (for the the keyword node lists once, it is a fast and simple probability of the remaining quasi-SLCAs) with the algorithm. However, its core computation - keyword threshold value. distributioncomputationwouldconsumelotsoftime, Sometimes,weneedtocalculatetheprobabilitydis- which motivates us to propose the PI-based Algo- tributions of keywords in a node if the given thresh- rithm that can quickly identify the qualified or un- old σ is in the range (LB(q,v),UB(q,v)]. The basic qualified candidates using offline computed PI index computational procedure is similar to the PrStack al- and only compute keyword distributions for a few gorithm in [23]. Differentfromthe PrStackalgorithm, candidates. Here, Baseline Algorithm is taken as a we will introduce probability density function (PDF) comparable base to show the pruning performance to approximately calculate the probability for a node of the PI-based Algorithm described below. if the node depends on a large number of pruned descendentnodes. To decidewhento invoke the PDF 5.2 PI-basedAlgorithm while avoiding the risk of reducing precision signif- icantly, we would like to select and compute some To efficiently answer PrTKQ, the basic idea of PI- descendant nodes that may contribute large proba- based Algorithm is to read the nodes from keyword bilities to the node v. For the remaining descendant node lists one by one in a bottom-up strategy. For nodes, we may choose to invoke the PDF, by which each node, we quickly compute its lower bound and we can reduce the time cost while still guarantee the upperboundbyaccessingPIindex,whichisfarfaster precision to some extent. The detailed procedure will thancomputingthekeyworddistributionsofthenode be introduced in the next section. directly. After comparing its lower/upper bounds 9 Algorithm 1 Baseline Algorithm Algorithm 2 PI-based Algorithm input: a query q = {k1,k2,...,km} with threshold σ, key- input: a query q = {k1,k2,...,km} with threshold σ, key- word inverted (KI) index word inverted (KI) index, PI index output:a set R of quasi-SLCA nodes output:a set R of quasi-SLCA nodes 1: load keyword node lists L = {l1,l2,...,lm} from KI 1: load keyword node lists L = {l1,l2,...,lm} from KI index; index; 2: get the smallestDewey v from L; 2: load probability node lists PIL = 3: initiate a stack S1 using v; {PIL1,PIL2,...,PILm}; 4: while L6=φ do 3: get the smallestDewey v from L; 5: get the next smallest Dewey v from L; 4: initiate a stack S1 using v and an empty stack S2; 6: while (S1.top() ≺not v) do 5: while L6=φ do 7: x = S1.pop(); 6: get the next smallest Dewey v from L again; 8: if x contains full keywords and PrqGuasi−slca(x)≥ 7: while (S1.top() ≺not v) do σ then 8: x = S1.pop(); 9: output x into R; 9: UB(q,x) and LB(q,x) ← ComputeBound(x, 10: promote the rest keyword distributions of x to its {PILi(x)}); parent xp using CombineProb(x, xp); 10: if LB(q,x)≥σ then 11: S1.push(v); 11: output x into R; 12: while S1 6=φ do 12: UpdateBound({va ∈ S1|va ≺ x}, LB(q,x), 13: a new node v← S1.pop(); UB(q,x)); 14: if v contains full keywords and PrqGuasi−slca(v) ≥ σ 13: S2.pop(vd ∈S2|vd ≻x); then 14: else if UB(q,x) ≥σ > LB(q,x)then 15: output v into R; 15: Prob(x) ← ComputeProbDist(x,S2); 16: promote the rest keyword distributions of v to its 16: if Prob(x)≥σ then parent vp using CombineProb(v, vp); 17: output x into R; 17: return R; 18: UpdateBound({va ∈S1|va ≺x}, Prob(x)); 19: S2.pop(vd ∈S2|vd ≻x); 20: else 21: S2.push(x); with the given threshold value, we can decide if the 22: S1.push(v); node should beoutput asaqualified answer,skipped 23: while S1 6=φ do asanunqualifiedresult,orcachedasapotentialresult 24: a new node v← S1.pop(); 25: UB(q,v) and LB(q,v) ← ComputeBound(v, candidate. For example, if the current node’s lower {PILi(v)}); bound is larger than or equal to the threshold value, 26: processthe node v usingthe same codesinLine 10- then the node can be output directly without further Line 21; computation. This is because all its descendants have 27: return R; been checked according to the bottom-up strategy. If its upper bound is lower than the threshold value, then the node can be filtered out. Otherwise, it will next smallest node to be processed. We compare it be temporarily cached for further checking. Based on with the node x in stack S . If v is the descendant differentcases,differentoperationswouldbeapplied. 1 node of x, then v will be pushed into S and get the Only the nodes identified as potential result candi- 1 next smallest node from L. Otherwise, we pop out dates need to be computed. Compared with Baseline x from S and check if it is a qualified quasi-SLCA Algorithm, PI-basedalgorithmcanbeacceleratedsig- 1 answer. In Baseline Algorithm, it will compute the nificantlybecauseBaselineAlgorithm hastocompute keyworddistributionsofxandcombineitsremaining the keyword distributions for all nodes. The detailed distributions and the distribution of its parent based procedure has been shown in Algorithm 2. on promotion operations. Different from Baseline Al- gorithm, PI-based Algorithm will quickly compute 5.2.1 DetailedProcedureofPI-basedAlgorithm the upper bound UB(q,x) and lower bound LB(q,x) In Algorithm 2, Line 1-Line 4 show that the proce- using PIL, which is used to differentiate the nodes dures of initiating PI-based Algorithm. We first load as qualified nodes - output, unqualified nodes - filter thekeywordnodelistsLfromKIindexandprobabil- anduncertainnodes-tobefurtherchecked.Bydoing ity node lists PIL from PI index. And then we take this, only a few nodes need to be computed. Since the smallest node v from L to initiate a stack S1 that bound computation is far faster than computation of is set using the deweycodesof v. Another stackS2 is keyword distribution, lots of run time cost can be used to maintain the temporary filtered nodes. After saved in PI-based Algorithm. Line 10-Line 21 show that, the PI-based Algorithm is ready to start. the detailed procedures. If the lower bound LB(q,x) Next,weneedtocheckeachnodeinLindocument is larger than or equal to the given threshold value order. Different from Baseline Algorithm, we only σ, then x can be output as a qualified quasi-SLCA compute the keyword distribution probabilities for a answer without computation. At this moment, the few nodes that are first identified using the lower lower bound LB(q,x) can be taken as the temporary bound and upper bound in PIL. Consider v be the probability of x being a quasi-SLCA result because 10 the exact probability of x is delayed until we need to Generally, the probability density function of a calculateitsexactprobabilityvalue.Subsequently,the Gaussian distribution N(µ,σ2) of mean µ and vari- temporaryprobabilityvalueLB(q,x)andtheprobabil- ance σ2 is: ities of x′ descendantquasi-SLCAresults canbe used 1 toupdatethelower/upperboundsoftheancestorsof f(x)= e−(x−µ)2/(2σ2) (12) x in stack S based on Equation 11 and Equation 10, √2πσ2 1 respectively. If the lower bound LB(q,x) is lower than Addressing Challenge 1: The density function has σ while the upper bound UB(q,x) is larger than or a shape of a bell centered in the mean value µ equal to σ, then we need to compute the keyword with varianceσ2. Basedon the definition of Gaussian distributions of x using the cached descendant nodes distribution, the Gaussian distribution is often used in S2. Based on the computed probability Prob(x) of todescribe,atleastapproximately,measurementsthat x,itcanbedecidedtobeoutputasaqualifiedanswer tends to cluster around the mean. Therefore,consider or filtered as an unqualifed candidate. If the upper the mean µ be the partialcomputed probabilityvalue bound UB(q,x)is lower than σ, then xwill be pushed of v be aquasi-SLCAnode,which guaranteesthe real into S2 for the possible computation of its ancestors. probability value will not be significantly different There are two main functions in PI-based Algo- from the probability base that has already been cal- rithm. The first one is ComputeProbDist(v, S ) for culated based on promising descendant nodes. The 2 computing the probability of full keyword distribu- valueofthevarianceσ2 canbechosenfromtherange tion of v using the descendant nodes in S . The [1-#computed descendant nodes/#total descendant 2 second is UpdateBound( v v v S , LB(q,v) or nodes, 1] based on the visited/unvisited descendant a a 1 { ≺ | ∈ } Prob(q,v)) for updating the bounds of the nodes to be nodes in S . This is because the more the descen- 2 processed. dant nodes are actually computed, the higher the percentage of the values would be drawn within one standarddeviationσ awayfromthemean.Extremely, 5.2.2 FunctionComputeProbDist() if all descendant nodes are computed actually, 100% The function ComputeProbDist(v, S ) can be imple- 2 ofvaluescanbedrawnwithinonestardarddeviation. mentedintwoways,ExactComputationorApproximate Therefore, we select and compute a few descendant Computation. nodes of v from S , which can contribute relatively 2 Exact Computation is to actually calculate the prob- higher probabilities to make v a quasi-SLCA node. ability of v being a quasi-SLCA node by scanning all In this work, we use heuristic method to select a the nodes in the stack S2 that maintains the descen- few descendant nodes with the higher probabilities dant nodes of v. The processing strategy is similar to of single keywords in the descendant nodes of v. Baseline Algorithm in Section 5.1. In other words, it And then, we take the partiallycomputed probability needstovisitthenodesinS2onebyoneandcompute as the base of the probability density function of a thelocalkeyworddistributionof eachnode,andthen Gaussian distribution. promotes the intermediate results to its parent. After Consider v be anode tobe evaluatedand UB(q,v)2 allnodesinS2 areprocessed,the probabilityof v will be its current upper bound value. We have, be obtained because it aggregatesall the probabilities from its descendant nodes. UB(q,v) PrG,Gaussian(q,v)= f(x)dx (13) Approximate Computation is to approximately calcu- quasi−slca Z 0 late the probability of v being a quasi-SLCA node Aftersubstituting Equation12intoEquation13,we based on a partial set of nodes in the stack S that 2 get, maintainsthedescendantnodesofv.Theapproximate computation can be made according to different dis- tproiblyuntioomniatylsp,eps,oeis.sgo.,nudniisfotrrimbudtiiosntrsi,beuttci.onIns,tphiiescweworiske, PrqGu,aGsai−ussslicaan(q,v)=Z0UB(q,v) √21πσ2e−(x2−σµ2)2dx weconsidernormalorGaussiandistributionsinmore (14) detail. Whereµisthepartiallycomputedprobability,σ2isset as 1-#computed descendant nodes/#total descendant AsweknowGaussiandistributionisconsideredthe most prominent probability distribution in statistics. nodes. Addressing Challenge 2: To embody all the However, the PDF of Gaussian distribution cannot be appliedtoPrTKQoverprobabilisticXMLdatadirectly keyword variables in the PDF, we introduce the due to two main challenges. The first challenge is to joint/conditional Gaussian distribution based on the workin[26].AssumeaPrTKQcontainstwokeywords simulate the continuous distributions using discrete distributions based on the real conditions in order to kx and ky. We have the conditional PDF as follows. reduce the approximate errors as much as possible, 2.Note that UB(q,v) has been updated if v has descendant and the second is to embody the multiple keyword nodes that are qualified answers, i.e., it minus the probability variables in the PDF. contributionsofthequalifiedanswers.