Table Of Content

Aggregation in Probabilistic Databases via Knowledge Compilation Robert Fink and Larisa Han and Dan Olteanu DeptartmentofComputerScience,UniversityofOxford WolfsonBuilding,ParksRoad,OX13QDOxford,UK robert.fink,dan.olteanu @cs.ox.ac.uk, [email protected] { } ABSTRACT source of computational complexity: for example, already 2 decidingwhetherthereisapossibleworldinwhichtheSUM This paper presents a query evaluation technique for posi- 1 ofvaluesofanattributeequalsagivenconstantisNP-hard. tiverelationalalgebraquerieswithaggregatesonarepresen- 0 Existingapproachestoaggregatesinprobabilisticdatabases tation system for probabilistic data based on the algebraic 2 haveconsideredrestrictedinstancesoftheproblem: theyfo- structuresofsemiringandsemimodule. Thecoreofoureval- cus on aggregates over one probabilistic table of restricted n uation technique is a procedure that compiles semimodule expressiveness [4, 20, 16], or rely on expected values and a andsemiringexpressionsintoso-calleddecompositiontrees, J for which the computation of the probability distribution Monte-Carlo sampling [10, 12, 22]. Expected values can lead to unintuitive query answers, for instance when data 1 canbedoneintimelinearintheproductofthesizesofthe valuesandtheirprobabilitiesfollowskewedandnon-aligned 3 probability distributions represented by its nodes. We give distributions [19]. Abiteboul et al. investigate XML queries syntactic characterisations of tractable queries with aggre- with aggregates on probabilistic data [1]. An algebra pro- ] gatesbyexploitingtheconnectionbetweenquerytractabil- B ity and polynomial-time decomposition trees. posed by Koch represents annotations and data values as rings which enables efficient incremental view maintenance D Aprototypeofthetechniqueisincorporatedintheprob- in the presence of aggregations [13]. abilistic database engine SPROUT. We report on perfor- . Our approach considers the problem of exact probabil- s manceexperimentswithcustomdatasetsandTPC-Hdata. c ity computation for positive relational algebra queries with [ aggregates on pvc-tables. The core of our technique is a 1. INTRODUCTION procedurethatcompilesarbitrarysemimoduleandsemiring 1 This paper considers the evaluation problem for queries expressions over random variables into so-called decompo- v with aggregates on probabilistic databases. sition trees, for which the computation of the probability 9 The utility of aggregation has been argued for at length. distribution can be done in polynomial time in the size of 6 Inparticular,aggregatesarecrucialforOLAPanddecision the tree and of the distributions at its nodes. Decompo- 5 supportsystems. All22TPC-Hqueriesinvolveaggregation. sition trees are a knowledge compilation technique [5] that 6 Probabilistic databases are useful to represent and query reflects structural decompositions of expressions into inde- . 1 impreciseanduncertaindata,suchasdataacquiredthrough pendent and mutually exclusive sub-expressions. Flavours 0 measurements,integratedfrommultiplesources,orproduced of decomposition trees have been proposed as compilation 2 byinformationextraction[21]. Inthispaper,weusearep- targetforpropositionalformulasthatariseintheevaluation 1 resentation system for probabilistic data called pvc-tables. ofrelationalalgebraqueries(withoutaggregates)onproba- v: Itisbasedonthealgebraicstructuresofsemiringandsemi- bilistic c-tables [18]. It has been shown that more complex i moduletosupportamixedrepresentationofaggregatedval- tasks, such as conditioning probabilistic databases on given X uesandtupleannotationsfordifferentclassesofannotations constraints [14] and sensitivity analysis and explanation of r and aggregations [2]. The pvc-tables can represent any fi- query results [11], can benefit from decomposition trees. a nite probability distribution over relational databases. In addition,theresultsofquerieswithaggregatescanberepre- Example 1. Figure1showssixpvc-tables,amongstthem sentedaspvc-tablesofpolynomialsize. Thiscontrastswith the suppliers table S, the products tables P and P , and 1 2 main-stream representation systems such as pc-tables [21], the table PS pairing suppliers and products. They all have which can require an exponential-size overhead [15]. an annotation column Φ to hold expressions in a semiring The problem of query evaluation is #P-hard already for K generatedbyasetofindependentrandomvariables,with simple conjunctive queries [21]. Aggregates are a further operations sum (+) and product (), and neutral elements · 0 and 1 . Each valuation of the random variables into a k K semiring (e.g. integers orBooleans)canonically maps semi- Permissiontomakedigitalorhardcopiesofallorpartofthisworkfor personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare ring expressions into that semiring by interpreting + and · notmadeordistributedforprofitorcommercialadvantageandthatcopies asthecorrespondingoperationsofthatsemiring. Eachsuch bearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,to valuation defines a possible world of the database. republish,topostonserversortoredistributetolists,requirespriorspecific Figure 1d shows the result of the query Q that asks for 1 permissionand/orafee. Articlesfromthisvolumewereinvitedtopresent pricesofproductsavailableinshops. Theannotationsofthe theirresultsatThe38thInternationalConferenceonVeryLargeDataBases, resulttuplesareconstructedasfollows: Theannotationofa August27th-31st2012,Istanbul,Turkey. joinoftwotuplesistheproductoftheirannotations,andthe ProceedingsoftheVLDBEndowment,Vol.5,No.5 Copyright2012VLDBEndowment2150-8097/12/01...$10.00. annotation obtained from projection or union is the sum of 490 S PS P1 Q1=πshop,price[S1PS1(P1∪P2)] Q2=πshopσP≤50$shop;P←max(price)[Q1] sid shop Φ sidpidprice Φ pidweight Φ shop price Φ Shop Φ 12345 MMMGG&&&aappSSSxxxxx12345 112233445 121234131 151614161001050500 yyyyyyyyy112233445121234131 p12341idwPei764852ghtzzzzzE12345 MMMMMMGGG&&&&&&aaapppSSSSSS 614151611050010005 xxxxxxxxx452334121yyyyyyyyy452334121312341211z(zzz(z((zzzz323421111++++zzzz5555)))) MG&apS [[xxxxxxxxx141223345yyyyyyyyy141223345112213431((zz(zzz(zzzz223431111⊗⊗⊗⊗⊗++++56616zzzz000505555++++≤))))⊗⊗⊗⊗mmmm5aaaa01111xxxx]5010·+++≤Ψmmm15aaa0xxx]·Ψ2 (a) (b) (c) (d) (e) Ψ1=[x1y11(z1+z5)+x1y12z2+x2y21(z1+z5)+x2y22z2+x3y33z3+x3y34z46=0K] Ψ2=[x4y41(z1+z5)+x4y43z3+x5y51(z1+z5)6=0K] Figure 1: A database containing relations (a) Supplier S, (c) Products P , P , and (b) the many-to-many 1 2 relation ProductSupplied PS, and the results of the positive query Q and of the aggregate query Q . 1 2 theannotationsoftheparticipatingtuples[7]. Forinstance, Wedeviseatechniqueforcomputingtheexactproba- • thetuple M&S,10 hastheannotationx y (z +z ),whose bility distribution of query results based on a generic 1 11 1 5 h i probability distribution can be computed as a function of compilation procedure of arbitrary semimodule and probabilitydistributionsoftherandomvariablesx ,y ,z , semiringexpressionsintoso-calleddecompositiontrees, 1 11 1 and z [21]. forwhichthecomputationoftheprobabilitydistribu- 5 ConsiderthequeryQ fromFigure1ethatasksforshops tion can be done in time linear in the product of the 2 in which the maximal price for the products in P or P is sizes of the distributions represented by its nodes. 1 2 lessthan50. Aggregationisexpressedusingthe$operator, Wegiveasyntacticcharacterisationofaclassofaggre- which in this query groups by the column shop and applies • gate queries that are tractable on tuple-independent the aggregation MAX on price within each group. databases. Our query tractability result follows from The annotations of result tuples are built using semi- theobservationthatthesemiringandsemimoduleex- module expressions of the form Ψ v, where Ψ is a semi- ⊗ pressions in the result of our tractable queries admit ring expression and v is a data value. Such expressions polynomial size decomposition trees and polynomial can be “summed up” with respect to aggregation opera- size probability distributions at their nodes. tions: For MIN, the sum α+ β is min(α,β); for MAX, min α+ β =max(α,β);forSUM,α+ β =α+β. Thesums A prototype of our technique is incorporated into the max sum correspond to operations in commutative monoids. The an- • probabilistic database engine SPROUT. notation Φ of M&S in Q ’s result is constructed as follows. 2 Extensiveperformanceexperimentsusingourownsyn- This tuple represents a group of six tuples in the result of • thetic datasets and TPC-H data are discussed. Q , all with the M&S shop value. The annotation Φ then 1 expresses the conditions (1) that the sum of the price val- Besides exact computation, decomposition trees also al- ues of these six tuples in the MAX monoid is less than 50, low for approximate probability computation [18]. Due to and (2) that the group is not empty (as expressed by Ψ1). lack of space, we refer the reader to the MSc thesis of the DependingonthevaluationofthevariablesinΦ,thesecon- second author [9]. The pvc-tables can be extended to cope ditions can be true ( ) or false ( ), or, more generally, the withcontinuousprobabilitydistributions,similartotheex- > ⊥ additive or multiplicative neutral element of the semiring. tensions of pc-tables in the PIP system [12]. Forinstance,avaluationν thatmapsx ,x ,y ,y ,z , 1 1 2 11 21 1 z , z to and all other variables to satisfies Φ, since 2 5 > ⊥ 2. PRELIMINARIES ν1(Φ)≡[>⊗10 +max ⊥⊗50 +max >⊗11+max 2.1 InducedDiscreteProbabilitySpace 60 + 60 + 15 50] [1⊥0⊗+max m1a1x ⊥50⊗] [mamx(a1x0,⊥11⊗) 5≤0] ·>. 2 LetS beacountablesetandXbeafinitesetofS-valued ≡ ≤ ≡ ≤ ≡> independentrandomvariables. WedenotebyP thediscrete x probability distribution of a variable x X, and by P [s] If the variables in such expressions are random variables, x ∈ the probability that x takes value s S; we often specify then the expressions themselves can be interpreted as ran- ∈ P by the set of pairs of unique values with their non-zero dom variables. Moreover, the probability distributions of x probabilities, (s,P [s]) s S and P [s]>0 . The size of the obtained expressions reflect the probabilities of query x x { | ∈ } aprobabilitydistributionisthesizeofitssetrepresentation. answerstakingparticularvaluesinarandomlydrawnworld of the database. Our technique allows to efficiently com- Definition 1. Let Ω = ν : X S be the set of map- pute probabilities defined by such expressions by structural pings from X into S. A pro{bability→mas}s function decomposition. For example, an expression α = ab 10+ xy 20 can be decomposed in independent sub-expr⊗essions Pr(ν)= Px[ν(x)] ⊗ ab⊗10andxy⊗20thatdonotsharevariablesymbolsand xY∈X hence constitute independent random variables. for every sample ν Ω, and a probability measure Thestructureofthepaperfollowsthelistofcontributions: ∈ Pr(E)= Pr(ν) for all E Ω We present an evaluation framework for queries with ⊆ • ν E aggregates (SUM, PROD, COUNT, MIN, MAX) on X∈ pvc-tables, a representation system for probabilistic define a probability space (Ω,2Ω,Pr) that we call the proba- data based on semirings and semimodules. bility space induced by X. 491 The probability distribution of the sum of two indepen- Definition 3. A commutative semiring is a set S to- dentrandomvariablesistheconvolutionoftheirindividual getherwithoperations+, :S S S andneutralelements · × → distributions [8]. For instance, given two random variables 0,1 S such that (S,+,0) and (S, ,1) are commutative ∈ · x,y over positive integers, the probability that the sum of monoids and the following holds for all s ,s ,s S: 1 2 3 ∈ therandomvariablesequalsto4isthesumoftheprobabil- s (s +s )=(s s )+(s s ) itiesofxbeing0andy being4,ofxbeing1andy being3, 2· 2 3 1· 2 1· 3 andsoon. Theapplicabilityofconvolutiontodeterminethe (s1+s2) s3 =(s1 s3)+(s2 s3) · · · probabilitydistributionofafunctionofindependentrandom 0 s =s 0=0. 1 1 variables is not limited to sums of integers: · · Commutativesemiringsarethecanonicalalgebraicstruc- Proposition 1. GivensetsA,B,C,anA-valuedrandom turefortupleannotations[7]. AnnotationsfromtheBoolean variable x with probability distribution P , a B-valued ran- x semiringyieldsetsemantics,annotationsfromNcorrespond dom variable y with probability distribution P , x indepen- y to bag semantics, and annotations from the security semi- dent of y, and an operation :A B C, the probability • × → ringcanbeusedtoconstrainaccesstoqueryresultsdepend- distribution P of the C-valued random variable z=x y is the convolutxi•oyn of P and P with respect to : • ing on access rights to database tuples that contributed to x y • the result [2]. The most general semirings are those gener- P [c]= P [a]P [b] for all c C (1) ated over a set of variables. Intuitively, the carrier of such x y x y • ∈ semirings are syntactic expressions built from the variables (a,bc)=X∈aA×bB: andthemultiplicationandsumsymbols,whereelementsare • Example 2. TheformulafortheprobabilityP ofthe identified via the semiring laws. For example, given a set Φ Ψ disjunctionΦ ΨoftwoindependentBooleanrand∨omvari- X = x1,x2,x3 of variables, the elements of the semiring ables is a spec∨ial case of Eq. (1): PosBo{ol(X) are}positive Boolean expressions, e.g. x1 +x2 or x (x +x ). By the distributivity law in semirings, the 1 2 3 PΦ Ψ[ ]= PΦ[a]PΨ[b] expressionsx1(x2+x3)andx1x2+x1x3 areequal. Amore ∨ > general freely generated semiring is the ring of polynomials (a,bX=)∈aB×bB: (also called free commutative algebra) over a set X of vari- > ∨ =PΦ[ ]PΨ[ ]+PΦ[ ]PΨ[ ]+PΦ[ ]PΨ[ ] ables. Elements of generated semirings are called semiring > > ⊥ > > ⊥ =1 (1 P [ ])(1 P [ ]) 2 expressions which we denote by K throughout this paper. Φ Ψ − − > − > Remark 1. ThesuminEq.(1)isinvariantunderthere- Definition 4. Let (S,+S,0S, S,1S) be a commutative · strictiontothoseaandbforwhichP [a]>0andP [b]>0. semiring. An S-semimodule M consists of a commutative x y Hence,evenifthecardinalityofA B isinfinite,theconvo- monoid (M,+M,0M) and a binary operation :S M × ⊗ × → lution is a finite sum whenever only finitely many elements M such that for all s1,s2 S and m1,m2 M we have ∈ ∈ in A and B have non-zero probability. 2 s (m + m )=s m + s m 1 1 M 2 1 1 M 1 2 ⊗ ⊗ ⊗ 2.2 Monoids,SemiringsandSemimodule (s + s ) m =s m + s m 1 S 2 1 1 1 M 2 1 ⊗ ⊗ ⊗ Our representation system for probabilistic data is based (s s ) m =s (s m ) 1 S 2 1 1 2 1 on the notions of monoid, semiring, and semimodule. · ⊗ ⊗ ⊗ s 0 =0 m =0 1 M S 1 M ⊗ ⊗ Definition 2. A monoid is a set M with an operation 1 m =m . S 1 1 ⊗ +:M M M and a neutral element 0 M that satisfy the foll×owing→axioms for all m ,m ,m M∈: WewriteS M todenoteaS-semimoduleM,andwrite 1 2 3 ∈ for and +⊗for + ,+ whenever the meaning is unam-· S S M (m +m )+m =m +(m +m ) · 1 2 3 1 2 3 biguous. Semimodules combine monoids with semirings to 0+m1 =m1+0=m1 represent aggregation values conditioned on the value of a semiring expression. Analogous to the case of semiring ex- A monoid is commutative if m +m =m +m . 1 2 2 1 pressions that correspond to freely generated semirings, we Monoidsnaturallydescribemanyaggregationoperations. denote by semimodule expressions the elements of the K- Aggregation over a column of a relation fixes a domain of semimodulegeneratedbyagivenmonoid. Thesemimodules values,usuallyRorN,andabinaryoperation. Forinstance, weusefrequentlyareN⊗NandB⊗NforMIN,MAX,SUM, the MIN of a column with values v1,··· ,vn ∈R is annodthPaRveOtDheminotnuoiitdivs.eAsemseamntimicso;dtuhliesBre⊗fleNctosvtehreSwUeMll-kwnoouwldn min(v1, ,vn)=min(v1,min(v2, min(vn 1,vn) )). incompatibility of SUM aggregation with set semantics. ··· ··· − ··· The binary operation is commutative and associative, i.e., 2.3 QueryLanguage the value of the aggregation is invariant under the order in The query language under consideration is a restriction which it is computed. Furthermore, each aggregation oper- ofpositiverelationalalgebraextendedbyanoperator$ for ation has a neutral element, i.e., a value that does not con- aggregation and grouping. Given a relation R over schema tributetotheaggregation. Forexample,0 Ristheneutral ∈ Σ, the operator $ in the query element for SUM, and is the neutral element for MIN. ∞ Tpahretsiceualgagr:reSgUatMion=s(cNo,r+re,s0p)o,nMdItNo=co(mNm±u∞t,amtivine,m+on)o,idMs,AiXn $A¯;α1←AGG1(B1),...,αl←AGGl(Bl)(R), =(N±∞,max, ). COUNTisaspecialcaseofSU∞M.More where (A¯ B1,...,Bl ) Σ, groups by the attributes complicated ag−g∞regations (e.g., AVG) can conceptually be A¯ and take∪s{the aggrega}tio⊆ns AGG to AGG over the at- 1 l composed from simpler ones (e.g., SUM and COUNT), but tributes B to B respectively. The result has schema A¯ 1 l ∪ their treatment is out of the scope of this paper. α ,...,α , where α to α are aggregation attributes. We 1 l 1 l { } 492 SB SN,1 SN,2 Φ::= x Φ+Φ Φ Φ [αθα] [ΦθΦ] s sid shop Φ sid shop Φ sid shop Φ · α::= Φ⊗(cid:12)(cid:12) m {+o(cid:12)(cid:12)p Φ⊗m(cid:12)(cid:12)} m (cid:12)(cid:12) (cid:12)(cid:12) 12 MM&&SS⊥ 12 MM&&SS 21 12 MM&&SS 01 op::= min max count (cid:12)sum prod 3 M&S> 3 M&S 1 3 M&S 3 (cid:12) 4 Gap ⊥ 4 Gap 7 4 Gap 7 θ::= = (cid:12)= (cid:12) (cid:12)< >(cid:12) 5 Gap ⊥ 5 Gap 2 5 Gap 2 (cid:12)6 ≤(cid:12) ≥ (cid:12) (cid:12) > x::= A va(cid:12)riable(cid:12) symb(cid:12)ol x(cid:12) X (cid:12) (a) (b) (cid:12) (cid:12) (cid:12) ∈(cid:12) (cid:12) m::= A value from an aggregation monoid M Figure 3: Three possible worlds of pvc-table S in s::= A value from the semiring S Figure 1 under two different semirings: S is anno- B tated with expressions from the Boolean semiring, Figure 2: Grammar for semiring expressions Φ K and S and S with integers. ∈ N,1 N,2 and semimodule expressions α (K S) M. ∈ ∪ ⊗ The semantics of D is given by its possible worlds: consider SUM, PROD, COUNT, MAX, and MIN aggrega- ν(t) t T ,..., ν(t) t T ν Ω . 1 n tions, and restrict the queries such that projection, union { | ∈ } { | ∈ } ∈ and grouping are never applied to aggregation attributes. wherenth(cid:8)e mapping ν is applied to all expre(cid:9)ss(cid:12)(cid:12)ions inotuples t This restriction simplifies query rewriting, but is not requi- and is identity for constants. Each mapping(cid:12)ν Ω defines siteforourqueryevaluationmethod. Outofthe22TPC-H a possible world. ∈ queries, only query Q13 violates the restriction. Columnsthatholdsemimoduleexpressionsarecalledag- Definition 5. The query language considered in this gregation columns. Semimodule expressions over different Q paper consists of queries that are built using the relational monoidscanco-existinapvc-table. Theannotationcolumn operatorsδ,σ,π, , ,$andsatisfythefollowingconstraints: Φhostssemiringexpressionsthataregeneratedbythevari- × ∪ able set X, and conditional expressions representing com- 1. IanttrπiAb¯u(tQe)sainndA¯$aAr¯e;αn1←otAaGgGg1r(eBg1a)t,i..o.,nαla←ttArGibGult(Besl.)(Q), the parisons of expressions and constants. The other columns of a pvc-table host constants and semimodule expressions. 2. InQ Q , theattributesofQ andQ arenotaggre- A grammar for such expressions is given in Figure 2. All 1 2 1 2 gation∪attributes. thesetypesofexpressionscanoccurinpvc-tablesrepresent- ingtheresultofaggregatequeries;weshowhowtocompute these expressions for any -query in Section 4. Example 3. The TPC-H query Q1 has the structure Q SELECT A,SUM(B) FROM R GROUP BY A which is equivalent Example 4. Figure 1 shows six pvc-tables. The pvc- to $A;β SUM(B)(R). TPC-H query Q2 has the structure table S has annotations from a semiring over the set X of SELECT ←A FROM R WHERE B = (SELECT MIN(C) FROM S),or, variables x ,...,x . Figure 3 shows a few possible worlds 1 5 eseqcuToihnveadleqununteliryoy,nπRtAeσ∪rmB$=γiAs(cid:0);Rβa←×rSeU$lMa∅t(;iBγo←)n(MSw)INiti(hsC)nt(hoSte)(cid:1)ia.nggQre,gsaitniocentahte- fusoiabrtliteohnwissoporlvfdct-hSteabvialneriFfaoibrgluedsriffeXe3raiennhttoasstehpmerioBrbinoagobslie.laitFnyirsPsetm,ic[roinn]gsiPdBe.rP[voa]sl-- tπrAib(uRt)e∪βπaAnσdβh≥e5nc$eAv;iβo←laStUesMc(Bon)(sStr)ainist2aivnalDidefQ.5-.quHeorwy.eve2r, P⊥x,3>[⊥, ]th·Pexn4u[⊥mB]b·ePrxo5f[>p]o.sSsiibnlceewBorhladssoinsl2y5t.hexS1etcw⊥oon·delleyxm,2ce>onnts-· sider valuations of X into N. Figure 3b shows two possible (cid:0) (cid:1) 3. PVC-TABLES:AREPRESENTATION worlds, with respective probabilities P [2] P [1] P [1] x1 · x2 · x3 · P [7] P [2] and P [0] P [1] P [3] P [7] P [2]. 2 SYSTEMFORPROBABILISTICDATA x4 · x5 x1 · x2 · x3 · x4 · x5 Inthissectionweintroduceasuccinctandcompleterepre- Thepvc-tablesPS,P ,P ,andQ inFigure1containno 1 2 1 sentationsystemforprobabilisticdata,whichwecallproba- semimoduleexpressionsandareinfactpc-tables[21],asim- bilisticvalue-conditionedtables orpvc-tablesforshort. This plerrepresentationsystemwheretuplevaluesareconstants system is based on work in databases with provenance in- and annotations are expressions from semirings generated formation [2, 7, 3]. The reason for using pvc-tables, as op- by a set of random variables. In pvc-tables, however, semi- posed to main-stream representation systems such as pc- module expressions can appear as tuple values. tables (and special cases such as tuple-independent or BID tables) [21], is that pvc-tables fit naturally with aggregate Example 5. ConsideranaggregationAGGontheweight queries: answers to aggregate queries on pvc-tables or even column of relation P1 in Figure 1. The semimodule expres- on pc-tables are representable as pvc-tables of size polyno- sion representing the aggregated value is α=z1 4 +AGG ⊗ mial in the size of the input tables, while they may require z2 8 +AGG z3 7 +AGG z4 6,wherethe+AGG operator ⊗ ⊗ ⊗ anexponentialoverheadwhenrepresentedaspc-tables[15]. depends on the particular aggregation monoid AGG. 2 Aggregationonpvc-tablesishandledusingamixedrepre- sentation of tuple annotations and aggregated values using We next look at the semantics of expressions under valu- the algebraic structures of semirings and semimodules. ations of variables; this will be extended to a probabilistic interpretation in Section 5. Definition 6. A pvc-table T over a probability space Ω Semiring,Monoid,andSemimoduleHomomorphism. induced by a set X of variables is a relation with an anno- LetKbethesemiringgeneratedbyX;thevariablesinXare tation column Φ holding semiring expressions over X, and themselveselementsofK,i.e.X K. Givenanothersemi- ⊆ where the tuple values can be constants or semimodule ex- ringS,amappingν :X Softhevariablescanbeuniquely → pressions over X. A pvc-database D= T ,...,T is a set extended to a semiring homomorphism ν : K S that 1 n { } → of pvc-tables over the same probability space Ω. “evaluates” semiring expressions from K to elements in S. 493 Database Semantics S Probability Distributions each variable x X there is exactly one n N for which ∈ ∈ Deterministic Set B Px[ ]=1 or Px[ ]=1 Px[n] = 1. For probabilistic databases with set semantics, Deterministic Bag N n>N: Px[n]=⊥1 every answer tuple has a probability for being true or false, Probabilistic Set B ∃Px[∈],Px[ ] [0,1] while in the case of bag semantics, the pvc-table encodes a Probabilistic Bag N n>N: P⊥x[n∈] [0,1] probability distribution over the multiplicity of its tuples. ∀ ∈ ∈ Existingrepresentationsystemsforprobabilisticdataare Table 1: Database semantics for different semirings overtheBooleansemiring,i.e.,theirsemanticsisset-based. S and probability distributions. While this choice is justified in the absence of aggregation (or for MIN/MAX), it is not for COUNT/SUM/PROD ag- Asimilarconstructionholdsforsemimodule: Forsemimod- gregates, which require bag semantics. In the latter case, ule expressions W and a monoid M, a mapping ν :X S → weneedasemiringforwhichthereisahomomorphisminto is extended to a monoid homomorphism ν :W M. → N [2]. The set-based probabilistic database model is subsumed by the probabilistic bag-based model: Every x X Example 6. Consider the semimodule expression ∈ is an N-valued random variable, yet the probability distri- α=xy⊗5 +min (x+z)⊗10 butions Px are non-zero only for 0,1∈N. over the semiring generated by X = x,y,z and the monoid (N,+min, ). The map ν : X {N eva}luates variable 4. QUERY EVALUATION I: COMPUTING ∞ → symbols to the semiring of positive integers and induces a THETUPLESINTHEQUERYRESULT monoid homomorphism that maps α to positive integers. Our approach to query evaluation of -queries on pvc- For instance, the mapping ν :x 2,y 3,z 0 yields Q 7→ 7→ 7→ tables has two logical steps: In the first step we compute ν(α)=ν(x)ν(y) 5 + (ν(x)+ν(z)) 10 the tuples in the query result, and in the second step we min ⊗ ⊗ computetheirprobabilitydistributions. Wediscussthefirst =6 5 + 2 10=1 5 + ... + 1 5+ min min min min ⊗ ⊗ ⊗ ⊗ step in this section and the second step in the next section. 1 10 + ... + 1 10=5 + 10=5. ⊗ min min ⊗ min ThetuplesintheresultofaqueryQcanhavesemimodule Similarly, the value of α in Example 5 depends on the par- expressions as values and semiring expressions as annota- ticulartargetsemiringS andthevaluationofthevariables. tions. The construction of such expressions can be done by For instance, α 24 for SUM aggregation and semiring a query that can be statically inferred from Q [2]. Figure 4 Naggwrietghatzi1o,nz2an7→d7→2thaenBdozo3l,eza4n7→sem0i.rinAglsow,itαh 7→z 6 for ManINd gcoivmespuatetrtahnestlautpiolensin· thoef qQueqruyerreiessulitnst(owSeQchLosqeuSeQrieLshtehraet, 1 za2n,szw3e,rz4is7→0 >,.i.Ief.a0llinvacraiasebloefsSaUreMmaanpdpe+d tof0oSr,M7→thINe⊥.quer2y bcouutntthsefotrrajnoisnlattaionndtaJalKtregrentactaivneaulssoebofedQat)a.:TJhoeinrtewusreitoinfgdaatca- M ∞ (as in join) corresponds to multiplication in the annotation Conditional Expressions have the form [αθβ], where α semiring and alternative use of data (as in union or projec- and β are semimodule expressions for (possibly different) tion)correspondstosummationintheannotationsemiring. aggregationmonoidsandsemiringsgeneratedbyX,andθis Semimoduleexpressionsareconstructedfortheaggregation a binary relation. Such conditional expressions can appear and grouping operator $. The translation of the opera- as tuple annotations in pvc-tables. Given a valuation ν : torsrename,selection,projection,product,andunionisthe X S and(semiringorsemimodule)expressionsΦ,Ψ,the same as for pc-tables [21]. In case of $, the translation → semantics of [ΦθΨ] is defined by extending ν via depends on the presence of group-by attributes. The translation uses the custom SQL operators , , and ν([ΦθΨ])=(10SS,, iofthνe(Φrw)iθseν(Ψ) (2) ⊗ to construct annotation expressions. PAGG PK ·K Example 8. The query $ (P ) from Ex- ;α AGG(weight) 1 Comparisons operators , for θ only make sense for ample 5 is rewritten into sel∅ec←t R.Φ R.weight orderedcarriers. Intheligh≤to≥fEq.(2),wecanequivalently as α, 1 as Φ from (select * frAoGGm P ) ⊗R.Theanswer K 1 see[ΦθΨ]asabinaryoperationS S S orM M S. totherewrittenqueryisonetupleoPfvalu(cid:0)eα=z 4 + (cid:1) × → × → 1⊗ AGG z 8 + z 7 + z 6. The annotation of this Example 7. Theannotationsinthepvc-tableQ2 inFig- tu2p⊗le is tAhGeGcon3st⊗ant 1AG,Gstat4in⊗g that the tuple α is the ure 1 contain conditional expressions representing compar- K h i query answer in all possible worlds. The evaluation of α isons between semimodule expressions and a constant from may however yield different outcomes in different worlds. M and semiring expressions and the constant 0 . 2 K The query π σ $ (P ) is rewritten to 5 α ;α MIN(weight) 1 ∅ ≤ ∅ ← Set vs. Bag Semantics. Thepvc-tablessystemisgeneric select S.Φ [5 S.α] as Φ from (select R.Φ · ≤ (cid:0) (cid:1) MIN ⊗ enoughtoencodebothdeterministicandprobabilisticdata- R.weight as α, 1 as Φ from (select * from P ) R) K P (cid:0) 1 bases with set and bag semantics. Table 1 summarises how S. This Boolean query asks for the probability of the min- differentchoicesforthesemiringS andprobabilitydistribu- imum we(cid:1)ight of articles being larger than 5. The result tions for variables x X give rise to those semantics. of this query is a single empty tuple with annotation Φ = ∈ Under the Boolean semiring, annotation expressions are 1 [z 4 + z 8 + z 7 + z 6 5]. 2 K 1 min 2 min 3 min 4 evaluated to or , denoting tuples in or not in the data- · ⊗ ⊗ ⊗ ⊗ ≤ ⊥ > base instance, respectively. If in addition every variable As seen in the above example, for aggregation without x X has either P [ ]=1 or P [ ]=1, then we have the grouping the resulting semimodule value is annotated with x x ∈ ⊥ > caseofdeterministicdatabaseswithsetsemantics,i.e.there 1 ,i.e.,it“exists”ineverypossibleworld. Incaseofgroup- K is exactly one possible world with non-zero probability. ing, the resulting tuples are additionally annotated with a For deterministic bag semantics, tuple annotations are conditionalexpressiontoenforcethatthegroupisnotempty evaluated to N and represent tuple multiplicities and for and thus at least one tuple in the group has a annotation 494 R = selectR.*,R.ΦfromR δB A(Q) = selectR.*,R.AasB,R.ΦasΦfrom Q R ← J K JσAθB(Q)K= selectR.*,R.Φ·K [AθB]asΦfrom (cid:0)JQK(cid:1)R πA1J,...,An(Q)K= selectR.A1,...,R.An, K R.Φ a(cid:0)sJΦK(cid:1)from Q RgroupbyR.A1,...,R.An J Q1×Q2K= selectR.*,S.*,R.Φ·K SX.Φas(cid:0)Φfr(cid:1)om Q1 R(cid:0)J, KQ(cid:1)2 S Q1 Q2 = selectR.*, R.Φ asΦfrom selec(cid:0)t*fro(cid:1)m (cid:0)Q1 (cid:1)unionallselect*from Q2 RgroupbyR.* J ∪ K K J K J K $A1,...,An;α1←AGJG1(B1),...K,αl←AGGl(Bl)(QX) =(cid:0)selec(cid:1)tR.A1,...(cid:16),R.An,Γ1 asα(cid:0)J1,..K.(cid:1),Γl asαl, (cid:0)J K(cid:1)(cid:17) J K KR.Φ 6=0K asΦfrom Q RgroupbyR.A1,...,R.An $∅;α1←AGG1(B1),...,αl←AGGl(Bl)(Q) = selectΓ1 asα1,.h.(cid:16).,XΓl asαl,(cid:17)1K asΦi from Q (cid:0)JRK(cid:1) J whereΓKi=(PSAUGMGi(cid:0)RR.Φ.Φ⊗⊗1R.Bi(cid:1) iiffAAGGGGii==CMOINU,NMTA(cid:0)JX,KS(cid:1)UM,PROD P (cid:0) (cid:1) Figure 4: Recursive algorithm for rewriting a query to account for computation of semiring (K) and · Q semimodule expressions. We assume that R.*, S.* do not select column Φ. JK different from 0 in a possible world. This has been exem- Inparticular,pvc-tablescanbeexponentiallymoresuccinct K plified in the introduction for the query Q . However, this than pc-tables. The key difference is that pvc-tables allow 2 conditional expression is not always necessary: forvaluesandannotationstobeintertwinedinsemimodule expressions, which can encode exponentially many possible Example 9. Consider the query outcomes of an aggregation in polynomial space. In pc- tables,wewouldneedtoenumeratealltheseoutcomes[15]. Q02 =πshopσP 50$shop;P MIN(price)[Q1] ≤ ← similar to Q on the database in Figure 1. Under the Boo- 5. QUERYEVALUATIONII:PROBABILITY 2 lean semiring K, the variables can take values = 0 ⊥ K COMPUTATIONBYDECOMPOSITION and = 1 . In a possible world with x ,x ,x , K 1 2 3 > 7→ ⊥ This section presents a novel approach to computing prob- M&S is not an answer since there is no supplier for the h i abilitydistributionsforqueryresultsonpvc-tableswhichis shopM&SintheinputinstanceS. Itsannotationevaluates equivalent to computing probability distributions of the re- to [ 50] = = . Here, the conditional expres- ∞≤ ·⊥ ⊥·⊥ ⊥ sults’ semiring and semimodule expressions. Our approach sion Ψ from Figure 1 would not be necessary, since it is 1 isbasedoncompilingexpressionsintoatractableformcalled impliedbythefirstconditionalexpressionintheexpression decomposition trees, for which distributions can be com- Φ. In case of other monoids, such as MAX (as discussed in the introduction), SUM, or COUNT, Ψ is necessary. 2 putedefficiently. Akeypropertyofourapproachisitsgen- 1 erality,asitappliestodifferentsemiringsandsemimodules. Theconstraintsimposedonthequerylanguage byDef- Expressions as Random Variables. We first show how inition 5 simplify the rewriting : Since projecQtions and semiring and semimodule expressions — including condi- unionsareonlydoneonnon-aggre·gatecolumns,therewrit- tional expressions — in pvc-tables give rise to random vari- ingrulesforthoseoperatorscanassumetheinputtuplesto ables. Let S be a countable semiring and M a monoid, X JK be free of semimodule expressions. a set of S-valued random variables, and Ω the probability Closureofpvc-tablesunder queries. Sincepvc-tables spaceinducedbyX;K andK M arethesetsofsemiring generalisepc-tablesandthelattQerarecompleteinthesense and semimodule expressions ov⊗er X. A semiring expression thatanyfiniteprobabilitydistributionoverrelationaldata- Φ K can be seen as a random variable Φ : Ω S by ∈ → basescanberepresentedusingpc-tables[21],itfollowsthat letting Φ:ν ν(Φ) and with probability distribution 7→ pvc-tables also form a complete representation system. In P [s]=Pr ν Ω ν(Φ)=s = Pr(ν). (3) particular,theyareclosedunder queries,inthesensethat Φ { ∈ | } for any input pvc-database, theQresult of any Q query can (cid:0) (cid:1) νν(XΦ∈)Ω=:s be represented as a pvc-table. The pvc-tables representing Asemimoduleexpressionα K M isarandomvariable query results are also succinct (for a fixed query), since ∈ ⊗ Q α:Ω M with α:ν ν(α) and probability distribution thetranslation onlyconstructsSQLqueriesofsizelinear → 7→ inthesizeofthe·input queriesandtheevaluationofsuch Pα definedsimilartoEq.(3)forallm∈M. IfM isoverR, SQL queries is in polynQomial time data complexity. then α is an R-valued random variable. However, since the JK random variables X and hence the probability space Ω are countable, the set of values that α may take is countable. Theorem 1 (Completeness and Succinctness). ProbabilityDistributionsofIndependentExpressions. 1. Anyfiniteprobabilitydistributionoverrelationaldata- Twosemiringorsemimoduleexpressionsare(syntactically) base instances can be represented by pvc-databases. independent if their sets of variables are disjoint; independent expressions are independent random variables. 2. Given a pvc-database and a fixed query Q, the query result Q ( ) caDn be representeQd as a pvc-table Example 10. Let X= x,y,a,b,c , M =N, Φ=x+y with size polynomDial in the size of . andα=a(b+c) 10+c 20{. ThenΦa}ndαareindependent D randomvariable⊗ssincet⊗heirsetsofvariablesaredisjoint.2 J K 495 c←1 c c←2 x4←⊥ x4 x4←> F F ⊕ ⊕ ⊗ ⊕ 1 20 2 20 ⊗ ⊗ ⊗ ⊗ 1 10 a a (cid:12) ⊗ ⊗ ⊗ ⊗ ⊗ 1 10 1 10 x5 (cid:12) ⊕ ⊕ (cid:12) 1⊗60 ⊕ ⊗ ⊕ ⊗ b 1 b 2 y51 z1 z5 y43 z3 ⊕ ⊗ ⊗ Fthigeusreem5i:mAoddu-ltereNe foNr.αT=hea(nbo+dce)⊗10h+asco⊗n2e0cohvieldr z1 z5 y411⊗15 (cid:12) 1⊗10 for every k N wit⊗h non-zero probabcility for c k, x5 y51 i.e. two chil∈dren for c = 1,2 in the Fsetting of Ex←.12. Figure 6: D-tree for the semimodule expression The thick (blue) part is a d-tree for the semiring x y (z +z ) 15+ x y z 60+ x y (z +z ) 10 4 41 1 5 max 4 43 3 max 5 51 1 5 component a(b+c)+c, where is replaced by . (part of the⊗annotation of tu⊗ple Gap in Figure⊗1e) Theprobabilitydistributionofth⊗esumorproductof(cid:12)inde- overthesemimoduleB N. Thethhick(iblue)partisa pendentexpressionsisgivenbytheirconvolution(cf.Propo- d-treeforthesemiring⊗componentoftheexpression, sition1)withrespecttotheadditionandmultiplicationop- where is replaced by . erationofthesemiringandsemimodule. Givenindependent ⊗ (cid:12) expressions Φ,Ψ K and α,β K M with probability Decomposition Trees (d-trees) are a normal form for ∈ ∈ ⊗ distributions PΦ,PΨ,Pα,Pβ, we have for all s S,m M: semiringandsemimoduleexpressions. Wenextdefinethem ∈ ∈ andshowhowtocompilearbitraryexpressionsintod-trees. P [s]= P [s ] P [s ] (4) Φ+Ψ Φ 1 · Ψ 2 Definition 7. Let M be a monoid and K be a semiring s1,s2∈SX:s1+s2=s generated by an S-valued set X of variables and constants P [s]= P [s ] P [s ] (5) from S. A decomposition tree, or d-tree, is a tree, where ΦΨ Φ 1 Ψ 2 · · each inner node is one of , , , , or [θ] and each leaf s1,s2∈XS:s1·s2=s node is a variable in X or⊕a c(cid:12)ons⊗tant in S, M, or S M. Pα+β[m]= Pα[m1] Pβ[m2] (6) The five types of inner nodes have thFe following meani⊗ng. · m1,m2∈MX:m1+m2=m 1. A node ⊕ with children representing Φ and Ψ repre- P [m]= P [s ] P [m ] (7) sents the expression Φ+Ψ, where Φ and Ψ are independent Φ α Φ 1 α 1 ⊗ · expressions in K, or K M respectively. s1∈S,m1∈XM:s1⊗m1=m 2. A node with ch⊗ildren representing Φ and Ψ repre- P[αθβ][s]= Pα[m1]·Pβ[m2] (8) sents the expre(cid:12)ssion Φ·Ψ, where Φ and Ψ are independent m1∈M,m2∈XM:[m1θm2]=s semiring expressions in K. 3. A node with children representing Φ and α repre- P[ΦθΨ][s]= PΦ[s1]·PΨ[s2] (9) sents the expre⊗ssion Φ α, where Φ and α are independent s1∈S,s2∈XS:[s1θs2]=s expressions in K and K⊗ M, respectively. ⊗ Following Remark 1, the sums in the above equations are 4. A node [θ] with children representing Φ and Ψ repre- restricted to summands of non-zero probabilities. sents the expression [ΦθΨ], where Φ and Ψ are independent expressions in K or K M. Example 11. LetS andM bethesemiringandthemo- 5. Givenavariablex⊗X,aninnernode withchildren noid of natural numbers with standard addition and multi- representing Φ , ..∈., Φ for all thosxe s S with panlidcatαion=. Lyet Φ5 =bexawsiethmiPmxo=dul{e(0e,x0p.3r)e,s(s1io,n0.3w)i,t(h2,P0.4)=}, Px[si]6=0 repr|exs←ens1ts the exp|rxe←sssinon Φ. F i ∈ y ⊗ (1,0.4),(2,0.4),(3,0.2) . Then α is a M-valued random Justlikesemiringorsemimoduleexpressions,d-treesrep- { } variable with P = (5,0.4),(10,0.4), (15,0.2) . The prob- resent probability distributions. Direct implementations of α { } ability distribution of Φ α = x (y 5) = (x y) 5 Eq.(4)through(10)areefficientprocedurestocomputethe S ⊗ ⊗ ⊗ · ⊗ is given by Eq.(7). For example, P [10]=P [1]P [10]+ probabilitydistributionatanyinnernodeofad-tree,given Φ α x α P [2]P [5]. The convolution is finite⊗as the probability dis- theprobabilitydistributionsatitschildren: Eq.(4)and(6) x α tributions are non-zero for finitely many elements. Further applyto nodes,Eq.(5)appliesto nodes,Eq.(7)applies ⊕ (cid:12) possible outcomes for Φ α are 0, 5, 15, 20, 30. In case of to nodes, Eqs. (8), (9) apply to [θ], and Eq. (10) applies ⊗ ⊗ the Boolean semiring, S=B, possible outcomes are 0 and 5 to nodes. For leaves, we only have the trivial cases of x withP [5]=P [ ]P [ ]andP [0]=1 P [5]. 2 a variable x with probability distribution P , of constant Φ α x y Φ α Φ α x ⊗ > > ⊗ − ⊗ s FS or m M with distribution (s,1) and (m,1) , Partitioning into Mutually Exclusive Expressions. an∈d of consta∈nt semimodule expressio{ns of}the for{m s m} Given an expression Φ and a variable x X that occurs in with probability distribution (s m,1) . The probab⊗ility Φ, the probability distribution of Φ can∈be expressed using distributionoftheentired-tre{eis⊗thedist}ributionofitsroot probabilitydistributionsofsub-expressionsundervaluations andcanbecomputedbottom-upinonepassoverthed-tree. oofbtxa.inFedorfraonmy sΦ0 ∈bySr,epwleacdinengoetveebryy oΦc|cxu←rsr0entcheeoefxpxrebsysios0n. Example 12. Considerthed-treefromFigure 5inwhich ThentheprobabilitydistributionofΦcanbepartitionedby it is assumed that each variable x a,b,c has non-zero ∈ { } the probability distributions of expressions Φ : probabilities p and p¯ = 1 p for values 1 and 2, re- |x←s0 spectively. Wex first coxmpute−thex probability distribution PΦ[s]=sX0∈SPx[s0]·PΦ|x←s0[s]. (10) fnooridthSeUleMft.bTrahnechdiosftrtihbeutdio-tnrseeauren:de{r(t1h,epba)g,(g2r,epg¯ba)t}ionfomr ob-, 496 (1,1) for1, (2,pb),(3,p¯b) forb 1, (10,1) for1 10, Compile (Expression Φ) { } { } ⊕ { } ⊗ (20,pb),(30,p¯b) for (b 1) 10, (1,pa),(2,p¯a) for a, begin { } ⊕ ⊗ { } (20,papb),(30,pap¯b),(40,p¯apb),(60,p¯ap¯b) fora(b+1) 10, if Φ has no variables then { } ⊗ (20,1) for1 20, (40,papb),(50,pap¯b),(60,p¯apb),(80,p¯ap¯b) return Φ { } ⊗ { } for a(b+1) 10+1 20, and finally if independent Φ ,Φ s.t. Φ +Φ =Φ then ⊗ ⊗ ∃ 1 2 1 2 return Compile(Φ ) Compile(Φ ) (40,papbpc),(50,pap¯bpc),(60,p¯apbpc),(80,p¯ap¯bpc) 1 ⊕ 2 { } if independent Φ ,Φ s.t. Φ Φ =Φ then 1 2 1 2 ∃ · for the entire left branch. For the right branch one obtains return Compile(Φ ) Compile(Φ ) 1 2 (cid:12) if independent Φ ,Φ s.t. Φ Φ =Φ then {(70,papbp¯c),(80,pap¯bp¯c),(100,p¯apbp¯c),(120,p¯ap¯bp¯c)}. ∃return Compil1e(Φ21) Co1m⊗pil2e(Φ2) ⊗ The probability distribution of the entire d-tree is then: if independent Φ ,Φ s.t. [Φ θΦ ]=Φ then 1 2 1 2 ∃ return Compile(Φ ) [θ] Compile(Φ ) (40,p p p ),(50,p p¯ p ),(60,p¯ p p ),(70,p p p¯ ), 1 2 a b c a b c a b c a b c { Choose variable x X occurring in Φ (80,p¯ p¯ p +p p¯ p¯ ),(100,p¯ p p¯ ),(120,p¯ p¯ p¯ ) ∈ a b c a b c a b c a b c return ( s S,P [s]=0: Compile(Φ )) } x ∀ ∈ x 6 s In case of MIN aggregation, the distribution for both bran- F ches as well as for the entire d-tree is (10,1) . In case of Algorithm 1: Compilation of semimodule or semi- theBooleansemiringandMINaggregat{ion,the}distribution ring expressions into d-trees. for the left branch (c ) is (10,p p p¯ ), ( ,p p¯ p¯ + a b c a b c p((¯22a00p,,bppp¯¯¯aac+ppccp¯))a,}p,¯(bap¯nc,d)p}af,op¯frbop¯rtcht+he←eop¯rvai⊥geprhbapt¯lclb+rd{a-npt¯racehp¯eb(p¯{cc(←)10.,>p)aips∞b{p¯(c1+0,ppaappcc2)),, sφeimPisirmoaopdpoursoleidtueicoxtpnroef3s.sviaoLrneiatobαvlee=sratPhnedni=Sa1UlφlMiv⊗amrmiaobniloebiesdahNnavwmeh-neboroenu-nezadecerdoh ∞ } probability only for 0 and 1 . The probability distribution S S We now turn to analysing the complexity of computing Pd of a d-tree d for α can be computed in time (n2m2d). O theprobabilitydistributionforagivend-tree. Sinced-trees Notethateveryφ mayonlyevaluateto0 or1 andhence are binary, the distribution of a convolution node can be i S S thesumisboundedbytheproductofthenumberofterms, computed in time linear in each of the sizes of the distribu- n and m; the product n m is thus an upper bound for the tions of the children according to Eq.(1). The distribution · size of the probability distribution at each node. In partic- ofa -nodeiscomputedintimelinearinthenumberofits ular, COUNT aggregation corresponds to m = 1 for all i. children and their probability distributions. This implies: i F Thentheresultingprobabilitydistributionhasatmostsize n and can be computed in time (n2d). In a d-tree that Theorem 2. TheprobabilitydistributionPd ofad-treed O contains semimodule expressions α,β over different aggre- whose nodes have probability distributions p ,...,p can be 1 n gation monoids, the complexity of the sub-trees of α and β computed in time ( p ). i O | | is each according to Propositions 2 and 3. Q Compiling Expressions into d-trees. For the SUM monoid, the size of P can be exponential in d Algorithm 1 sketches the construction of a d-tree for an in- the number n of leaves, since there may be exponentially put expression Φ by repeatedly applying six decomposition many distinct sums out of n numbers. We analyse two rules. Theobtainedd-treeisequivalenttotheinputexpres- classes of d-trees for which we can obtain a better upper sion, i.e., they have the same probability distribution. bound on the time complexity of computing P . d The first four rules check whether the input expression Firstly, observe that the sum of two elements in the MIN can be partitioned and decomposed into two independent or MAX monoid is one of the elements itself (e.g. a+ b min expressions Φ and Φ . In the first rule, Φ and Φ are is either a or b). This implies that the size of the proba- 1 2 1 2 eitherbothsemimoduleorbothsemiringexpressions;inthe bilitydistributionofasemimoduleexpressionαisbounded secondrule,theyaresemiringexpressions. Inthethirdrule, by the number of monoid values occurring at the leaves of Φ is a semiring, and Φ is a semimodule expression. d. Moreover, if α’s variables are N-valued, α’s probabil- 1 2 Inordertofindindependentdecompositionsinpolynomial ity distribution is unaltered under the following reduction time, expressions are analysed at a syntactic level: We at- of its variables to B-valued variables: Px[⊥] = Px[0] and temptthefirstruleonlyifΦisasumandpartitionitby the P [ ]=1 P [ ]. Hence, one may equivalently consider a x > − x ⊥ connected components in its clause-dependency graph. In d-tree for B-valued variables obtained from this reduction: additiontosuchsyntacticmanipulations,thedecomposition Proposition 2. Given a semimodule expression α over in the second and third rules uses known polynomial-time MIN or MAX, the distribution P can be computed in time algorithms to recognise read-once expressions, i.e., expres- α linear in the size of α’s d-tree in which all variables are sions where each variable occurs once, and hence factorise considered by their reduction to B-valued variables. expressions based on algebraic rewritings such as the asso- ciativityandcommutativitylaws,e.g.,[6,18]. Inparticular, Secondly, we analyse SUM aggregation. Evaluating a thisapproachallowstofactorexpressionsintocomplexsub- SUM expression x m in which all x are independent expressionsandnotonlyintoonevariableandtheresidual. i i i ⊗ variables is hard [19]. Yet, instances in which the values The last rule decomposes Φ into sub-expressions Φ , for s m areconstrainePdaretractable: wesaythatasemimodule each s S. As explained for Eq. (10), each expression Φ i s ∈ expression φi mi overtheSUMmonoidNism-bounded is obtained in linear time by replacing x with the constant ⊗ ifthereisaconstantm Nsuchthat i:0 mi m. Ag- s in Φ. Many heuristics have been proposed for choosing gregationoPverrationals∈fromabounde∀dsetc≤anbe≤regarded the variable x, since good choices can make the difference asboundedifthenumberofdecimalplacesisfixed,e.g.for between polynomial and exponential size decision diagrams currencies with 2 decimal places. (such as ordered binary decision diagrams, d-DNNFs, or d- 497 trees). In our implementation, we choose a variable with expression Φ into a d-tree in polynomial time, then hard most occurrences [18]. problems such as satisfiability, tautology, and even probability computation can be solved in polynomial time for Φ. Remark 2. Therequirementthattheunderlyingalgebraic In practice, however, this rule can be quite effective, as we structuresbecommutativeandassociativeiscrucialforstruc- show in the experiments. In case we need only apply the turaldecomposition: withouttheseproperties,expressionde- firstfourrules,thecompilationfinishesinpolynomialtime. composition would be constrained to the fixed order defined This is already known for semiring expressions occurring in by the order of symbols and the nesting of the expression. the results of tractable relational algebra queries without repeating symbols [18]. In Section 6, we define a fragment Example 13. Figure5depictsad-treeforthesemimod- ofourquerylanguage consistingoftractablequeriesthat uleexpressionΦ=a(b+c) 10+c 20. Φcannotbesplit onlycreateexpressionsQcompilableusingthefirstfourrules. ⊗ ⊗ into independent sub-expressions since variable c occurs in Wefinishthissectionwithtwoobservationsaboutourcom- both summands. We choose to eliminate variable c to cre- pilationapproach,whichcannotbepresentedatlengthdue ate mutually exclusive events. (This is a good choice since to space constraints. it leads to independent sub-expressions.) We thus create a Pruning Conditional Expressions. The evaluation of node c withasmanychildrenasvaluationsofcthathave [αθβ]-expressionscanbeconsiderablyimprovedbyemploy- non-zero probability. Consider the case c 1 and hence ingpruningrulesthatprovepartsofαorβredundant. Con- Φle|fct←b1rFa=ncha(ibn+th1e)d⊗-t1r0ee+, a1nd⊗c2a0n. bTehdiseccoomrr←peosspeodndussintog tthhee swidheicrhtchaenebxeprdeescsoiomnpΦos=ed[iαnθtoβ]su=b-[exx⊗pr1e0ss+iomnsinαya⊗nd20β.≤T1h5e] first rule into its independent summands a(b+1) 10 and probabilityP [1 ]=1 P [0 ]isindependentofy;indeed, 1 20. Theformerexpressioncanbedecomposed⊗usingthe whencomputΦingSP one−maxysSafelyignorecomputingP [20] ⊗ α α third rule into independent expressions a and (b+1) 10. sinceitcannotcontributetoP [1 ]intheconvolutionof[θ]. ⊗ Φ S Theprocedurecontinuesuntilwecompletelydecomposethe Equivalently,wemayreplaceΦwithasimpleryetequivalent expressions into variables and semimodule constants. expressioninwhichredundanttermsarepruned. Examples Figure 6 shows a d-tree for the semimodule expression in of pruning rules are (similar and symmetric cases omitted): the first conditional expression in the annotation Φ of the resulttuple Gap inFigure1eoverthesemimoduleB N: h i ⊗ MIN: Φ m m Φ m m i i i i x4y41(z1+z5)⊗15+x4y43z3⊗60+x5y51(z1+z5)⊗10 "Xi ⊗ ≤ #≡"i:mXi≤m ⊗ ≤ # It cannot be partitioned into independent sub-expressions. Choosingvariablex4 inthelastrule,wecreateanode x4. SUM: " Φi⊗mi ≤m#≡1S if mi ≤m IptosseledftuscihnigldthΦe|xt4h←ir⊥dr=ulex5inyt5o1(xz51y+51(zz51)+⊗z15)0acnadn1be1d0e.cFTomhe- PruningisparXtiicularlyeffectivewhentheproXbiabilitydistri- ⊗ former is a read-once expression, on which the second rule butions of α and β have exponential size, such as in case can be applied twice to decompose it into x5 and the rest, of the SUM monoid; there, early pruning can avoid the full thentherestintoy51 andz1+z5. Thelatterisdecomposed materialisation of such probability distributions. using the first rule into z1 and z5. The right child Compiling Joint Probability Distributions. A tuple in the result of an aggregate query in may have several Φ|x4←> =y41(z1+z5)⊗15+y43z3⊗60+x5y51(z1+z5)⊗10 semimoduleexpressionsandbeannotateQdwithaconditional can be decomposed using a sequence of the first three rules expression; in such cases we are interested in finding the that exploit the independence of sub-expressions. joint probability distribution of these expressions. This can By construction of query results using , the semiring be accomplished by compiling the expressions into a sin- · expression in the second conditional expression in Φ is pre- gle d-tree by applying mutex decomposition until some of ciselythesemiringpartoftheabovesemimoduleexpression: the expressions become independent; the joint probability JK distribution of two independent random variables is simply x y (z +z )+x y z +x y (z +z ), 4 41 1 5 4 43 3 5 51 1 5 their product. For instance, given integer variables a,b,c We can thus use the very same compilation steps as above, with non-zero probabilities for 1,2 only, the mutex decom- with the simplification that only the first two rules, which position on a decomposes the joint expression a+b,a c h · i work on semirings, need be applied. The d-tree is the same intotwobranches 1+b,1 c and 2+b,2 c . Ineachbranch, h · i h · i asforthesemimoduleexpression,butwhere nodesarere- the two expressions are independent and can be considered placed by and without semimodule expres⊗sions at leaves. separately. For example, the probability for the value 3,2 Figure 6 sh(cid:12)ows this d-tree with thick (blue) edges. 2 can be worked out to be Pa[2]Pb[1]Pc[1]+Pa[1]Pb[2]Pch[2].i Proposition 4. Foranysemimoduleorsemiringexpres- 6. TRACTABLEQUERYEVALUATION sionΦ,Algorithm1constructsad-treedsuchthatP =P . Φ d This section describes the classes ind and hie of tract- Q Q Byvirtueofthisequivalence,wecancomputetheprobabil- ablerelationalalgebraquerieswithaggregation. Thechar- ity distribution of any expression by first compiling it into acterisationusesasbuildingblocksqueriesthatreturntuple- a d-tree d and then computing its probability distribution independent relations and queries reminiscent of hierarchi- P . The algorithm is complete since the last rule is always cal queries [21]. Class hie uses ind as a building block, d Q Q applicable until all variables are evaluated. However, ex- i.e. ind hie. Our tractability result follows from the Q ⊂ Q clusive application of this rule is not desirable: compiling discussion in this section: arbitraryexpressionsbyapplyingthelastruleonlycanlead to d-trees of size exponential in the number of variables. Theorem 3 (Tractable Queries). This limitation seems necessary: if we could compile any Every query in hie has polynomial-time data complexity. Q 498 Thecharacterisationisbasedonageneralisationofthehi- createsasannotationssumsofindependentexpressions;fur- erarchicalpropertyfornon-repeatingconjunctivequeries[21]. thermore, as γ A¯, the only expressions in result tuples 1 6∈ A query Q is non-repeating if every base relation occurs at are the annotation expressions. most once in Q; in this section we assume all queries to be For 8.2(b), it is known that non-repeating hierarchical non-repeating. Given a query Q = πA¯σφ Q1 ×···×Qn queries are tractable on tuple-independent relations [21]; and an attribute A, we denote by A∗ the set of attributes since in addition all variables in the projection list are root (cid:0) (cid:1) in Q that are transitively equated with A in φ; at(A∗) de- variables,theresultistuple-independent. In8.2(c),thecon- notesthesubsetoftherelationsymbolsQ , ,Q inwhich ditional expression obtained for the result tuple compares 1 n ··· an attribute from A∗ appears. A non-repeating query Q= two tractable and independent semimodule expressions. aπbA¯lσesφ(cid:0)AQ,1B×t·h·a·t×aQren(cid:1)noisthinieAr¯aracnhdicaarleifnfootreeqaucahttewdowoifthitsavcaorni-- SDeecofnindiltyi,owne9defin(CelQahsise ashfieo)ll.ows. stant,itholdsthatat(A∗)∩at(B∗)=∅orat(A∗)⊇at(B∗) The following are hie-querQies: or at(A∗) at(B∗). Given a query of the above form, an Q acottnrtiabiuntsesAo⊆miesaattrroiobtutaettfrriobmuteAi∗f.every relation Q1,...,Qn 1. iQf Q=1π,.A¯.$.,A¯Q;γn←∈AGQG(inCd),(cid:2)σψ(Q1×···×Qn)(cid:3) Definitions 8 and 9 rely on the following notions and as- and πA¯σψ(Q1×···×Qn) is hierarchical sumptions. Every relation that does not contain semimodule expressions and whose tuples are annotated with dis- 2. Q=πA¯σφ(Q1×···×Qn) if Q1,...,Qn ∈Qind, and Q is hierarchical. tinct variable symbols with given probability distributions Queriesin9.2arethewell-knownnon-repeatinghierarchi- iscalledtuple-independent. WeallowMINandMAXaggre- cal queries [21]. For queries in 9.1, first consider the query gation as well as bounded SUM and COUNT aggregation, see Proposition 3. Additionally, we assume that for every Q0 = πA¯σψ(Q1×···×Qn) which is by assumption hierar- aggregation monoid M occurring in a query, the constant chical. It follows that, given a tuple t ∈ Q0, its annotation Φ = φ is a read-once expression [17, 21]; moreover, Φ 0 does not occur in the database. i M can be compiled into a d-tree whose size is bounded by the We assume that in a selection σ , φ is a conjunction of φ numbPerofitsvariables. ByPropositions2and3,computing (1) equality predicates of the form A=B or A=c where A the probability distribution of such a d-tree – and thus the and B are non-aggregation attributes and c is a constant, probability distribution of Φ – can be done in time polyno- and(2)θ-comparisonsαθβ,αθc,orαθAwhereαandβ are mialinthesizeofΦ. ByProposition1,thesizeoftheinput aggregationattributesandAisanon-aggregationattribute. We give separate recursive definitions of the classes ind annotation is polynomial in the size of the input database, Q hence query evaluation is in polynomial-time. ofquerieswhoseresulttuplesarepairwiseindependent,and hie whose result tuples may be correlated. Now consider the query Q (9.1). Aggregation yields ex- Q pressionsoftheformα= φ v where–asabove– AGG i⊗ i Definition 8 (Class ind). each φi is a product of variables; since the query is non- Q repeating, each product hPas exactly n variables, one from 1. Every tuple-independent relation is a ind-query. eachrelation. Theaggregateddatavaluesv areallfromthe Q i 2. LTehteQn1t,h.e..fo,lQlonw∈inQgianrde,anidndQ-˜qiu=eri$esA¯:i;γi←AGGi(Ci)(Qi). sfraommetrhelaattiroenlaatniodnc.aTnhheenecxepbreesasisosnocαiatceadnwbiethcothmepvilaerdiaibnlteos Q the same d-tree as Φ, except that instead of the leaves for (a) Q=πA¯σφ(Q˜1), such that γ1 6∈A¯ tthheevfoarrmiabxlesxv.frTohmetfhoelloawgginreggeaxtaedmrpelleatililouns,triattheasstlheiasviedseoaf. (b) cQhi=caπlAa¯nσdφ(aQll1a×tt·r·ib·u×teQsni)n,As¯ucahrethroaottQvairsiahbielersar- Example⊗14. ConsiderthedatabaseinFigure1andthe tupCloensc(iencr)nthQineg=retsπhu∅elσtqγou1fθeγQr˜2iei(cid:0)sQa˜ru1en×pdaeQ˜rir2w8(cid:1).,i2ss,eufiicnrhdstethpoaebtnsdAe¯er1vnet=atAhn¯ad2t=ththa∅et. qotahfnuetneyroeyptxaepQt1liao=innna$oQtfi∅ohi;nitαes←.arIbSenUsouMvtlhet(ipstirsuiecpQex)la(cid:0)e0mσ=issphxπloep∅,y=σt0shMhe+o&pqx=Su00eMy(rSy&)+SQ10x(0SPcyo)Sn1(cid:1)s+iwdPxehSriye;cdhthi+ines the expressions γ have the form γ = x v where 1 11 1 12 2 21 2 22 the x are indepenident random variiables;AtGhGe anin⊗otaition of x3y33+x3y34whichisequivalenttotheread-onceexpression i x (y +y )+x (y +y )+x (y +y ). According to eachtupleinQ˜ isΦ=1 incaseofA¯P= andasumΦ= 1 11 12 2 21 22 3 33 34 i K i ∅ the rewriting rules in Figure 4, the annotation of the xj overtheannotationsofthetuplesparticipatingtothis result of Q is (x y )· 10+(x y ) 50+(x y ) 11+ group,otherwise. Considerthecase8.2(a): Theselectionσ 1 11 ⊗ 1 12 ⊗ 2 21 ⊗ φ (x y ) 60+(x y ) 15+(x y ) 40. Associatingtheprice Pmay compare γ1 with a constant c to yield the annotation va2lue2s2w⊗iththe3an3n3oJ⊗taKtionva3ri3a4bl⊗esy oftheirrelation,one Φ =Φ [γ θc]. Undertheassumptionthattheconstant0 ij t · 1 M obtains a read-once expression equivalent to the one above, isnotinthedatabase,itcanbeshownthatforanyvaluation exceptthattheleaveswithy variablesarenowsemimodule ν Ω it holds that ν(Φ) = 0 if and only if ν(γ ) = 0 , ij i.e∈. the correlation of the distSributions of Φ and1γ1 is iMn- eyxpres6s0io)n+sxyij(⊗y vi: x151(+y1y1⊗104+0)y.12⊗50)+x2(y21⊗11+2 dependent of the data. Starting from this observation, it 22⊗ 3 33⊗ 34⊗ fcoalnlowbys ethxpatretshseedpbroybcaobnilsiitdyerdinisgtrtibhuetpiornoboafbΦilitti=esΦof·Φ[γ1aθncd] inRwéhiecthaπl.cσoφn(sRid1erquerieRsn$)∅i;γs←hAieGrGar(cCh)iσcφa(lR[119×];·t·h·e×seRanre) [γ θc] separately: For instance, for MIN aggregation and θ subsumed ∅by hie×. F··o·r×such queries involving aggregation 1 Q is : P [0 ] = P [0 ] and P [1 ] = P [1 ]. For without grouping, the neutral element of the aggregation MI≤NaggΦrtegSationa[nγd1≤θc]isS : P [0Φt]=SP [0 []γ+≤cP] S [0 ] monoid may safely be in the database without jeopardising and P [1 ] = P [1 ]+P≥ Φ[1t ]S 1. SΦimSilar re[γsu1≥ltcs] aSre query tractability, because the annotation the query result obtainΦetd fSor MAXΦ aSnd SU[Mγ≥ca]ggSreg−ation. Since the result- isalways1 andhencethecorrelationofthedistributionsof K ing tuples are again independent, the final projection πA¯ 1K andγ istrivial,seealsothediscussionafterDefinition8. 499