ebook img

Provenance for Aggregate Queries PDF

0.35 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Provenance for Aggregate Queries

Provenance for Aggregate Queries Yael Amsterdamer Daniel Deutch Val Tannen UniversityofPennsylvania UniversityofPennsylvania UniversityofPennsylvania andTelAvivUniversity andBenGurionUniversity 1 ABSTRACT 1 R 0 We study in this paper provenance information for queries EmpId Dept Sal 2 withaggregation. Provenanceinformationwasstudiedinthe 1 d1 20 p1 Dept n contextofvariousquerylanguagesthatdonotallowforag- 2 d1 10 p2 d1 p1+p2+p3 a gregation, and recent work has suggested to capture prove- 3 d1 15 p3 d2 r1+r2 J nance by annotating the different database tuples with ele- 4 d2 10 r1 5 mentsofacommutativesemiringandpropagatingtheanno- 5 d2 15 r2 (b) tations through query evaluation. We show that aggregate queries pose novel challenges rendering this approach inap- ] (a) B plicable. Consequently, we propose a new approach, where Figure 1: Projection on annotated relations weannotatewithprovenanceinformationnotjusttuplesbut D alsotheindividualvalueswithintuples,usingprovenanceto atethepolynomialsinthesecurity semiring,andpropagate s. describe the values computation. We realize this approach thesecurity annotationsthroughqueryevaluation (seeSec- c in a concrete construction, first for“simple”queries where tion 2.1),assigning security levels to queryresults. [ theaggregationoperatoristhelastoneapplied,andthenfor Let us briefly illustrate deletion propagation as an appli- arbitrary (positive) relational algebra querieswith aggrega- cation of provenance. Consider a simple example of an em- 1 tion; the latter queriesare shown to bemore challenging in ployee/department/salary relation R shown in Figure 1(a). v this context. Finally, we use aggregation to encode queries The variables p ,p ,p ,r ,r can be thought of as tuple 0 1 2 3 1 2 1 with difference, and study the semantics obtained for such identifiers and in the framework of provenance polynomi- 1 queries on provenanceannotated databases. als[24]theyarethe“provenancetokens”or“indeterminates” 1 out of which provenance is built. We denote by N[X] the . 1. INTRODUCTION setofprovenancepolynomials(hereX ={p1,p2,p3,r1,r2}). 1 R can be seen as an N[X]-annotated relation; as defined in 10 wiTthhepraonvneontaanticoenionffotrhmearteisounltshaosfdqautiatbeaasefterwanaspfoprlmicaattiioonnss a[2n4o]tthheer eNv[aXlu]-aatnionnotoafteqduerreyl,aftoiorne,xianmtphleisΠeDxaepmtRp,leptrhodeuocnees 1 [19, 6, 35, 10, 11, 38, 34, 25, 27, 39, 37, 2, 4]. Recent shown in Figure 1(b). Intuitively, in this simple example, : v work [24,17, 21] hasproposed a framework of semiring an- the summation in the annotation of every result tuple is i notationsthatallowsustostateformallywhatisexpectedof overthe identifiersof its alternative origins 1. X such provenanceinformation. These papers havedeveloped Now,theresultofpropagatingthedeletionsoftupleswith r the framework for the positive fragment of the relational EmpId 3 and 5 in R is obtained by simply setting p = a 3 algebra (as well as for Datalog, the positive Nested Rela- r = 0 in the answer. We get the same two tuples in the 2 tional Calculus, and some query languages on trees/XML). query answer but their provenances change to p +p and 1 2 The main goal of this paper is to extend the framework to r , respectively. If the tuple with EmpId 4 is also deleted 1 aggregate operations. from R thenwealso set r =0,andthesecond tupleinthe 1 In the perspective promoted by these papers, provenance answer is deleted because its provenance has now become isageneralformofannotationinformationthatcanbespe- 0. This algebraic treatment of deletions is related to the cialized for different purposes, such as multiplicity, trust, counting algorithm for view maintenance [26], but is more cost, security, or identification of “possible worlds” which general as it incrementally maintains not just the data but in turn applies to incomplete databases, deletion propaga- also theprovenance. tion, and probabilistic databases. In fact, the introduction An intuitive way of understanding what happens is that oftheframeworkin[24]wasmotivatedbytheneedtotrack provenance-aware evaluation of queries conveniently“com- trustanddeletionpropagationintheOrchestrasystem[23]. mutes”with deletions. In fact, in [24, 17] this intuition is Whatmakessuchadiversityofapplicationspossibleisthat captured formally by theorems that state that query eval- eachiscapturedbyadifferentsemiring,whileprovenanceis uation commutes with semiring homomorphisms. The fac- represented by elements of a semiring of polynomials. One torization throughprovenancerelies onthisandonthefact then relies on the property that any semiring-annotation that the polynomial provenance semiring is “freely gener- semantics factors through the provenance polynomials se- ated”. All applications of provenance polynomials we have mantics. This means that storing provenance polynomials listed, for trust,security,etc., are based on these theorems. allows for many other practical applications. For example, tocaptureaccesscontrol,wheretheaccesstodifferenttuples 1we explain how the annotations of query results are com- require different security credentials, we can simply evalu- putedin Section 2.1 Dept SalMass d1 45 p1p2p3 rather than multiplicities? Then, the aggregate value does d1 30 p1p2p3 Dept SalMass not correspond to any number. d1 35 p1p2pb3 d1 30 p1p2 Wewillmakep1×20intoanabstractconstruction,p1⊗20 d1 25 p1pb2p3 d1 20 p1p2 and the aggregate value will be the formal expression p1⊗ d1 20 pb1p2p3 d1 10 p1pb2 20+p2⊗10+p3⊗15. dd11 1105 ppb11ppb22ppbb33 ··· ··· ·b·· i.eI.,nttuhietivreelayl,swRe,arienetmobaedladringegrthdeodmoaminainofoffosurmma-algegxrpegreast-es, sions that capture how the sum-aggregates are computed ··· ··· ··· (b) b b from values annotated with provenance. We do the same forotherkindsof aggregation, forinstancemin-aggregation (a) givesp ⊗20minp ⊗10minp ⊗15. Wecalltheseannotated 1 2 3 Figure 2: A naive approach to aggregation aggregate expressions. In this paper we consider only aggregations defined by Thus, commutation with homomorphisms is an essential commutative-associativeoperations3. Specifically,ourframe- criterionforourproposedframeworkextensiontoaggregate workcanaccommodateaggregationbasedonanycommuta- operations. However,inSection3.1weprovethattheframe- tivemonoid. Forexamplethecommutativemonoidforsum- workofsemiring-annotatedrelationsintroducedin[24]can- mation is (R,+,0) while theone for min is (R∞,min,∞)4. notbeextendedtohandleaggregationwhilebothsatisfying TocombineanaggregationmonoidM withanannotation commutationwithhomomorphismsandworkingasusualon commutativesemiringK,inawaycapturingaggregatesover set or bag relations. K-relations,weproposetheuseofthealgebraicstructureof If the semiring operations are not enough then perhaps K-semimodules(seeSection2.2). Semimodulesareawayof wecanaddothers? Thisisanaturalideasoweillustrateit generalizing(alot!) theoperationsconsideredinlinearalge- on the same R in Figure 1(a) and we use again the neces- bra. Its“vectors”form only a commutative monoid (rather sity to supportdeletion propagation to guidetheapproach. than an abelian group) and its “scalars” are the elements Consider thequerythat groupsbyDept andsumsSal. The of K which is only a commutative semiring (rather than a result of the summation depends on which tuples partici- field). pate in it. To provide enough information to obtain all the In general, a commutative monoid M does not have an possible summation results for all possible sets of deletions, obvious structure of K-semimodule. To make it such we wecouldusetherepresentationinFigure2(a)whereweadd may need to add new elements corresponding to the scalar to the semiring operations an unary operation with the multiplicationofelementsofM withelementsfromK,thus property that p = 1 whenever p = 0. This will indeed sat- b endingupwiththeformalexpressionsthatrepresentaggre- isfythedeletioncriterion. Forexamplewhenthetuplewith b gatecomputations,motivatedabove,aselementsofatensor Id 3 is deleted we get the relation in Figure 2(b). In fact, productconstructionK⊗M. Weshowthattheuseoftensor thereexistsemiringswiththeadditionalstructureneededto product expressions as a formal representation of aggrega- define . For example in the semiring of polynomials with integer coefficients, Z[X], we can take p = 1−p while in tionresultiseffectiveinmanagingtheprovenanceof“simple b aggregate” queries, namely queries where the aggregation the semiring of boolean expressions with variables from X, b operators are thelast ones applied. BoolExp(X),wecantakep=¬p2. Thelatterisessentially the approach taken in [31]. However, whether we use Z[X] We show that certain semirings are “compatible” with b certain monoids, in the sense that the results of compu- or BoolExp(X), we still have, in the worst case, exponen- tation done in K ⊗ M may be mapped to M, faithfully tially many different results to account for, at least in the representing the aggregation results. Interestingly, compat- caseofsummation(alowerboundrecognizedalsoin[31]). It ibility is aligned with common wisdom: it is known that followsthatsummationinparticular(andthereforeanyuni- some (idempotent) aggregation functions such as MIN and form approach to aggregation) cannot be represented with MAXworkwellforsetrelations,whileSUMandPRODre- a feasible amount of annotation as long as the annotation quirethetreatmentofrelationsasbags. Weshowthatnon- stays at thetuplelevel. idempotent monoids are compatible only with“bag”semir- Instead, we will present a provenance representation for ings, i.e. semirings from which there exists an homomor- aggregation results that leads only to a poly-size increase, phism to N. one that we believeis tractably implementable using meth- In general, aggregation results may be used by the query ods similar to the ones used in Orchestra [23]. We achieve as the input to further operators, such as value-based joins this via a more radical approach: we annotate with prove- orselections. Heretheformalrepresentationofvaluesleads nanceinformationnotjustthetuplesoftheanswerbutalso to a difficulty: the truth values of comparison operators on themannerinwhichthevaluesinthosetuplesarecomputed. such formal expressions is undetermined! Consequently, we Wecangainintuitiontowardsourrepresentationfromthe particular case of bags, which are in fact N-relations, i.e., extendourframeworkandconstructsemiringsinwhichfor- mal comparison expressions between elements of the corre- relations whose tuples are annotated with elements of the semiring(N,+,·,0,1). AssumethatRinFigure1(a)issuch sponding semimodule are elements of the semiring. This arelation,i.e.,p ,...,r ∈Naretuplemultiplicities. Then, means that an expression like [p1⊗20=p2⊗10+p3⊗15] 1 2 maybeusedaspartoftheprovenanceofatuplesinthejoin after sum-aggregation the valueof theattributeSalMass in the tuple with Dept d is computed by p ×20+p ×10+ result. This expression is simply treated as a new prove- 1 1 2 nance token (with constraints), until p ,p ,p are assigned p ×15. Now, if the multiplicities are, for example p = 1 2 3 3 1 e.g. values from B or N, in which case we can interpret 2,p = 3,p = 1 then the aggregate value is 85. But what 2 3 if R is a relation annotated with provenance polynomials 3As shown in [32], for list collections it also makes sense to 2Both of these also give natural semantics to relational dif- consider non-commutativeaggregations. ference. Z[X] is used in [20] following the use of Z in [22], 4K-annotated relations with union (see Section 2.1) also while BoolExp(X) is used in theseminal paper[28]. form such a structure. bothsidesoftheequalityaselementsofthemonoidandde- are annotated with elements from commutative semirings. termine the truth value of the equality (see Section 4). We These are structures (K,+ ,· ,0 ,1 ) where (K,+ ,0 ) K K K K K K showinSection4thatthisconstructionallowsustomanage and (K,· ,1 ) are commutative monoids, · is distribu- K K K provenance information for arbitrary queries with aggrega- tive over + , and a · 0 = 0 · a = 0 . A semiring K K K K K tion,whilekeeping the representation size polynomial in the homomorphism is a mapping h : K → K′ where K,K′ sizeoftheinputdatabase. Ourconstructionisrobust: iffur- are semirings, and h(0 ) = 0 ,h(1 ) = 1 ,h(a +b) = K K′ K K′ therqueriesareapplied,thetoken[p ⊗20=p ⊗10+p ⊗15] h(a)+h(b), h(a·b) = h(a)·h(b). Examples of commuta- 1 2 3 can be used as part of a more complex expression, just as tivesemiringsareanycommutativering(ofcourse)butalso any other provenancetoken. any distributive lattice, hence any boolean algebra. Exam- The main result of this paper is providing, for the first plesofparticularinteresttousincludethebooleansemiring time, a semantics for aggregation (including group by) on (B,∨,∧,⊥,⊤) (for usual set semantics), the natural num- semiring-annotated relations that: bers semiring (N,+,·,0,1) (its elements are multiplicities, i.e., annotations that give bag semantics), and the secu- • Coincides with the usual semantics on set/bag rela- rity semiring (S,min,max,0,1) where Sis theordered set, tions for min/max/sum/prod. S S 1 < C < S < T < 0 whose elements have the following S S • Commutes with homomorphisms of semirings (hence meaning when used as annotations: 1 : public (“always S all theensuingapplications) available”), C : confidential, S : secret, T : top secret, and • Is representablewith only poly-size overhead. 0 means“neveravailable”. S Certainsemiringsplayanessentialroleincapturingprove- Asecondresultofthispaperisanewsemanticsfordifference nance information. Given a set X of provenance tokens onrelationsannotatedwithelementsfromanycommutative which correspond to“atomic”provenance information, e.g., semiring. This is done via an encoding of relational differ- tupleidentifiers,thesemiringofpolynomials(N[X],+,·,0,1) ence using nested aggregation. The fact that such an en- wasshownin[24]toadequately,andmostgenerally,capture codingcanbedoneisknown(seee.g.[29,9]),butcombined provenance for positive relational queries. The provenance withourprovenanceframework,theencodinggivesaseman- interpretationofthesemiringstructureisthefollowing. The tics for“provenance-aware”difference. Our new semantics + operation on annotations corresponds to alternative use forR−S isahybridofbag-styleandset-stylesemantics,in ofdata,the·operationtojointuseofdata,1annotatesdata thesensethattuplesofRthatappearinS donotappearin thatisalways andunrestrictedlyavailable,and0annotates R−S(i.e. abooleannegativeconditionisused),whilethose absent data. Thedefinitionof theK-relational algebra (see that donot appearin S appearinR−S with thesamean- notation(multiplicity,ifK =Nisused)thattheyhadinR. bellow for union, projection and join) fits indeed this inter- pretation. Algebraically, N[X] is the commutative semir- This makes the semantics different from the bag-monus se- ing freely generated by X, i.e., for any other commutative mantics and itsgeneralization to“monus-semirings”in [19] semiringK,anyvaluationoftheprovenancetokensX →K aswellasfromthe“negativemultiplicities”semanticsin[22] extends uniquely to a semiring homomorphism N[X] → K (more discussion in Section 6). We examine the equational (an evaluation in K of the polynomials). We say that any laws entailed by this new semantics, in contrast to those of semiring annotation semantics factors through the prove- previouslyproposedsemanticsfordifference. Inouropinion, nancepolynomialssemantics,whichmeansthatforpractical thissemanticsisprobablynotthelastwordondifferenceof purposes storing provenance information suffices for many annotated relations, but we hope that it will help inform other applications too. Other semirings can also be used and calibrate futurework on thetopic. tocapturecertain formsofprovenance,albeit lessgenerally than N[X] [24, 21]. For example, boolean expressions cap- PaperOrganization. Therestofthepaperisorganizedas tureenoughprovenancetoserveintheintensionalsemantics follows. Section2describesandexemplifiesthemainmath- ofqueriesonincomplete[28]andprobabilisticdata[18,40]. ematical ingredients used throughout the paper. Section Todefineannotatedrelationsweuseherethenamedper- 3 describes our proposed framework for “simple” aggrega- spectiveoftherelationalmodel[1]. Fixacountablyinfinite tionqueries,andthisframework isextendedinSection4to domain D of values (constants). For any finite set U of at- nested aggregation queries. We consider difference queries tributes a tuple is a function t : U → D and we denote the inSection5. RelatedWorkisdiscussedinSection6,andwe set of all such possible tuples by DU. Given a commuta- conclude in Section 7. tivesemiringK,aK-relation(withschemaU)isafunction R : DU → K whose support, supp(R) = {t | R(t) 6= 0 } is 2. PRELIMINARIES K finite. For a fixed set of attributes U we denote by K-Rel We provide in this section thealgebraic foundations that (when U is clear from the context) the set of K-relations willbeusedthroughoutthepaper. Westartbyrecallingthe with schema U. We also define a K-set to be a function notionofsemiringanditsusein [24]tocaptureprovenance S :D→K again of finitesupport. Wethen define: for theSPJUalgebraqueries. Wethenconsideraggregates, andshowthenewalgebraic constructionthatisrequiredto Union If Ri :DU →K, i=1,2 then R1∪K R2 :DU →K accurately support it. is defined by (R1 ∪K R2)(t) = R1(t)+K R2(t). The definition of union of K-sets follows similarly. 2.1 Semirings andSPJU We also define the empty K-relation (K-set) by ∅ (t) = K Webrieflyreviewthebasicframework introducedin[24]. 0 . It is easy to see that (K-Rel,∪ ,∅ ) is a commutative Acommutativemonoidisanalgebraicstructures(M,+M,0M) mKonoid 5.Similarly, we get the comKmuKtative monoid of K- where +M is an associative and commutative binary op- sets (K-Set,∪K,∅K). eration and 0M is an identity for +M. A monoid homo- Givenanamedrelationalschema,K-databasesaredefined morphism is a mapping h : M → M′ where M,M′ are from K-relations just as relational databases are defined monoids, and h(0 ) = 0 ,h(a + b) = h(a)+ h(b). We M M′ will consider database operations on relations whose tuples 5In fact, it also has a semiring structure. from usual relations, and in fact the usual (set semantics) sum, prod, min, and max aggregations on bags. SinceMIN databases correspond to the particular case K = B. The and MAX are B-semimodules this gives the usual min and (positive)K-relationalalgebradefinedin[24]correspondsto max aggregation on sets 9. a semantics on K-databases for the usual operations of the NotethatSetAggisanoperationonsets,notanoperation relational algebra. We have already defined the semantics onrelations. Inthesequelweshowhowtoextendittoone. ofunionaboveandwegiveherejusttwoothercasesleaving 2.3 A tensor product construction therest for AppendixA (for atuplet and an attributesset U′, t|U′ is therestriction of t to U′): More generally, we want to investigateM-aggregation on K-relations where M is a commutative monoid and K is Projection If R : DU → K and U′ ⊆ U then ΠU′R : somecommutativesemiring. SinceM maynothaveenough DU′ → K is defined by (ΠU′R)(t) = KR(t′) where elementstorepresentK-annotatedaggregationsweconstruct the+K sumisoverallt′ ∈supp(R)sucPhthatt′|U′ =t. aK-semimoduleinwhichM canbeembedded,bytransfer- Natural Join If Ri : DUi → K, i = 1,2 then R1 ⊲⊳ R2 : ring to semimodules the basic idea behind a standard alge- DU1∪U2 → K is defined by (R ⊲⊳ R )(t) = R (t )· braic construction, as follows. 1 2 1 1 K R2(t2) where ti =t|Ui, i=1,2. Let K be any commutativesemiring and M be any com- mutativemonoid. WestartwithK×M,denoteitselements 2.2 Semimodules andaggregates k⊗minsteadofhk,miandcallthem“simpletensors”. Next weconsider(finite)bagsofsuchsimpletensors,which,with Wewillconsideraggregatesdefinedbycommutativemonoids. SomeexamplesareSUM=(R,+,0)forsummation6,MIN= bagunionandtheemptybag,formacommutativemonoid. (R±∞,min,+∞)formin,MAX=(R±∞,max,−∞)formax, It will be convenient to denote bag union by +K⊗M, the and PROD=(R,×,1) for product. empty bag by 0K⊗M and to abuse notation denoting single- ton bags by the unique element they contain. Then, every In dealing with aggregates we have to extend the oper- non-empty bag of simple tensors can be written (repeating ation of a commutative monoid to operations on relations annotated with elements of semirings. This interaction will summands by multiplicity) k1⊗m1+K⊗M ···+K⊗M kn⊗mn. Now we define be capturedby semimodules. Definition 2.1. GivenacommutativesemiringK,astruc- k∗K⊗M Pki⊗mi=P(k·K ki)⊗mi ture(W,+ ,0 ,∗ )isaK-semimoduleif(W,+ ,0 )isa W W W W W Let ∼ be the smallest congruence w.r.t. + and ∗ commutativemonoidand∗W isabinaryoperationK×W → that satisfies (for all k,k′,m,m′): K⊗M K⊗M W such that (for all k,k ,k ∈K and w,w ,w ∈W): 1 2 1 2 (k+ k′)⊗m ∼ k⊗m+ k′⊗m k∗ (w + w ) = k∗ w + k∗ w (1) K K⊗M W 1 W 2 W 1 W W 2 0 ⊗m ∼ 0 k∗ 0 = 0 (2) K K⊗M W W W k⊗(m+ m′) ∼ k⊗m+ k⊗m′ (k + k )∗ w = k ∗ w+ k ∗ w (3) M K⊗M 1 K 2 W 1 W W 2 W k⊗0 ∼ 0 0 ∗ w = 0 (4) M K⊗M K W W (k · k )∗ w = k ∗ (k ∗ w) (5) We denote by K ⊗M the set of tensors i.e., equivalence 1 K 2 W 1 W 2 W classes of bags of simple tensors modulo ∼. We show in 1 ∗ w = w (6) K W AppendixBthat K⊗M forms a K-semimodule. In any (commutative) monoid (M,+ ,0 ) define for any M M n∈N and x∈M Liftinghomomorphisms. Givenahomomorphismofsemir- ingsh:K →K′,andsomecommutativemonoidM,wecan nx=x+ ···+ x (n times) M M “lift”htoahomomorphismofmonoidsinanaturalway. The in particular 0x = 0 . Thus M has a canonical struc- lifted homomorphism is denoted hM : K ⊗M → K′ ⊗M M ture of N-semimodule. Moreover, it is easy to check that and definedby: a commutative monoid M is a B-semimodule if and only if its operation is idempotent: x+ x = x. The K-relations hM( ki⊗mi)= h(ki)⊗mi M P P themselvesformaK-semimodule(K-Rel,∪ ,∅ ,∗ )where K K K (k∗KR)(t)=k·K R(t)7. 3. SIMPLE AGGREGATIONQUERIES We now show, for any K-semimodule W, how to define In this section we begin our study of the “provenance- W-aggregation of a K-set of elements from W. We assume aware” evaluation of aggregate queries, focusing on “sim- that W ⊆ D and that we have just one attribute, whose ple”such queries, that is, queries in which aggregations are values are all from W. Consider the K-set S such that donelast;forexample,un-nestedSELECTFROMWHERE supp(S) = {w1,...,wn} and S(wi) = ki ∈ K,i = 1,...,n GROUPBYqueries. Thisavoidstheneedtocomparevalues (i.e., each wi is annotated with ki). Then, the result of which are the result of annotated aggregations and simpli- W-aggregating S is definedas fies the treatment. The restriction is relaxed in the more SetAggW(S) = k1∗W w1+W ···+W kn∗W wn ∈W general framework presented in Section 4. The section is organized as follows. We list the desired FortheemptyK-setwedefineSetAggW(∅K)=0W. Clearly, features of a provenance-aware semantics for aggregation, SetAggW is a semimodule homomorphism 8. Since all com- andfirsttrytodesignasemanticswiththesefeatures,with- mutative monoids are N-semimodules this gives the usual out using the tensor product construction, i.e. by simply 6COUNT is particular case of summation and AVG is ob- 9The fact that the right algebraic structure to use for ag- tained from summation and COUNT. gregates isthat of semimodulescan bejustifiedin thesame 7Infact, it is theK semimodule freely generated byDU. way inwhichusingsemiringswas justifiedin[24]: byshow- 8In fact, it is the free homomorphism determined by the ing howthelaws of semimodules follow from desired equiv- identity function on W. alences between aggregation queries. workingwithK-relationsasdonein[24]. Weshowthatthis theannotationsoftheoutputΩ(D)aredefinedbythesame is impossible. Consequently, we turn to semantics that are (for allK) {+ ,· ,0 ,1 }-expressions, wheretheelements K K K K based on combining aggregation with values via the tensor in the expressions are the annotations of the input D. One product construction. We propose such semantics that do can see that the definition of the SPJU-algebra is indeed satisfy the desired features, first for relational algebra with algebraically uniform and was shown in [24, 17] to com- anadditionalAGGoperatoronrelations(thatallowsaggre- mute with homomorphisms. The connection between the gationofallvaluesinachosenattributes,butnogrouping); twopropertiesisgeneral(proofdeferredtotheAppendix): and then for GROUPBY queries. Proposition 3.1. Asemanticscommuteswithhomomor- 3.1 Semantic desiderata andfirst attempts phisms iff it is algebraically uniform. We next explain the desired features of a provenance- Afterstatingtwoofthedesiredproperties,namelyset/bag awaresemanticsforaggregation. Toillustratethedifficulties compatibility and commutation with homomorphisms we and the need for a more complex construction, we will first canalreadyshowthatitisnotpossibletogiveasatisfactory attempttodefineasemanticsonK-relations,withoutusing semanticstotheSPJU-Aalgebrawithintheframeworkused thetensor product construction of Section 2.3. in [24] for theSPJU-algebra. WeconsideracommutativesemiringK(e.g.,B,N,N[X],S, etc.) for tuple annotations and a commutative monoid M Proposition 3.2. There is no K-relation semantics for (e.g., SUM=(R,+,0),PROD=(R,×,1), MAX= MAX-(or MIN-)aggregation that is both set-compatible and (R−∞,max,−∞), MIN = (R∞,min,∞) etc.) for aggrega- commutes with homomorphisms. Similarly, there is no K- tion. WewillassumethattheelementsofM alreadybelong relation semantics for SUM-aggregation that is both bag- to thedatabase domain, M ⊆D. compatible and commutes with homomorphisms. WehaverecalledthesemanticsofSPJUqueriesinSection Proof. Assume by contradiction the existence of such 2.1. NowwewishtoaddanM-aggregation operation AGG semantics. ConsidertheN[X]-relation Rwithoneattribute onrelations. WethendenotebySPJU-Atherestrictedclass andtwotupleswithvalues10and20,withthecorresponding of queries consisting of any SPJU-expression followed pos- tuple annotations being x,y ∈ X. Let R′ be AGG (R) sibly by just one application of AGG. This corresponds to MAX according to the assumed semantics; R′ is also an N[X]- SELECT AGG(*) FROMWHEREqueries (nogrouping). relation. Because a tuple t with a value 10 is a possible For the moment, we do not give a concrete semantics to answer to the MAX-aggregation (when we set y = 0) it AGGM(R), allowing any possible semantics where the re- must occur in supp(R′). Let p ∈ N[X] be the annotation sult ofAGGM(R)isaK-relation. WenotethatAGGM(R) of the tuple t (having value 10) in R′. By algebraic unifor- should be defined iff R is a K-relation with one attribute mitytheonlyvariablesthatcanoccurinparexandy,and whose valuesare in M. we consequently denote it p(x,y). Consider two homomor- What properties do we expect of a reasonable semantics phisms h′,h′′ :N[X]→B defined byh′(x)=h′(y)=⊤ and forSPJU-A(including,ofcourse,asemanticsforAGGM(R))? by h′′(x) = ⊤,h′′(y) = ⊥. Applying AGG to h′ (R) A basic sanity checkis MAX Rel and h′′ (R) should, by set-compatibility, work as usual. Rel Set/Bag Compatibility Thesemanticscoincideswiththe Hence, by commutation with homomorphisms h′(p) = ⊥ usualonewhenK =B(sets) andM =MAXorMIN, and h′′(p) = ⊤. Functions on B defined by polynomials in and when K =N (bags) and M =SUMor PROD. N[X] are monotone in each variable. But ⊥=h′(p(x,y))= p(h′(x),h′(y))=p(⊤,⊤)and⊤=h′′(p(x,y))=p(h′′(x),h′′(y))= Notethatweassociateminandmaxwithsetsandsumand p(⊤,⊥), in contradiction to themonotonicity. product with bags. Min and max work fine with bags too, butwegetthesameresultifweconvertabagtoaset(elimi- nateduplicates)andthenapplythem. Sumandproduct(in the context of other operations such as projection) require Alternatively, one may consider going beyond semirings, us to use bags semantics in order to work properly. This toalgebraicstructureswith additionaloperations. Wehave is well-known, but our general approach sheds further light briefly explored the use of“negative”information in the in- ontheissuebydiscussingsuch“compatibility”forarbitrary troduction. Asweshowthere,onecould usetheringstruc- semirings and monoids in Section 3.4. ture on Z[X] (the additional subtraction operation) or the Asdiscussedintheintroduction,afundamentaldesidera- boolean algebra structure on BoolExp(X) (the additional tumwithmanyapplicationsiscommutationwithhomomor- complement operation) but the use of negative operation phisms. Note that a semiring homomorphism h : K → K′ does not avoid the need to enumerate in separate tuples of naturally extends to a mapping hRel :K-Rel→K′-Rel via theanswerallthepossibleaggregationresultsgivenbysub- hRel(R) = h◦R (i.e. the homomorphism is applied on the sets of the input. In the case of summation, at least, there annotation of every tuple), which then further extends to are exponentially many such tuples. We reject such an ap- K-databases. With this, thesecond desired property is proach and we state as an additional desideratum: Commutation with Homomorphisms Givenanytwocom- Poly-Size Overhead For any query Q and database D, mutative semirings K,K′ and any homomorphism h: thesizeofQ(D),includingannotations,shouldbeonly K →K′,foranyqueryQ,itssemanticsonK-databases polynomial in the size of D. andonK′-databasessatisfy h (Q(D))=Q(h (D)) Rel Rel for anyK-database D. We shall next show a semantics to the SPJU-A -algebra that satisfies all threeproperties we havelisted. Itturnsoutthatthispropertydeterminesquitepreciselythe wayinwhichtupleannotationsaredefined. Wesaythatthe 3.2 Annotations ⊗values andSPJU-A semanticsofan operation Ωon K-databasesisalgebraically LetusfixacommutativemonoidM (foraggregation)and uniformwithrespecttotheclassofcommutativesemiringsif a commutative semiring K (for annotation). The inputs of our queries are as before: K-databases whose domain D querieshomomorphismcommutationwasshownin[24],while includes the values M over which we aggregate. However, commutation for the new AGGM construct follows directly the outputs are more complicated. The basic idea for the from the definition. semantics of aggregation was already shown in Section 2.2 Example 3.4. Consider the following N[X]-relation R: where it is assumed that the domain of aggregation has a K-semimodule structure. As we have shown in Section 2.3, we can give a tensor product construction that embeds M Sal in the K-semimodule K⊗M (note that this embedding is 20 r 1 not always faithful,as discussed in Section 3.4). 10 r 2 For the output relations of our algebra queries, we thus 30 r 3 need results of aggregation (i.e., the elements of K ⊗M) to also be part of the domain out of which the tuples are LetM besomecommutativemonoid,thenAGGM(R)con- sist of a single tuple with value r ⊗20+ r ⊗10+ constructed. Thus for the output domain we will assume 1 K⊗M 2 K⊗M that K ⊗M ⊆ D, i.e. the result “combines annotations r3⊗30. The intuition is that this value captures multiple possible aggregation values, each of which may be obtained with values”. The elements of M (e.g., real numbers for sum or max aggregation) are still present, but only via the by mapping the ri annotations to N, standing for the mul- tiplicity of the corresponding tuple. The commutation with embeddingι:M →K⊗M definedby ι(m)=1 ⊗m. K homomorphismallowsustofirstevaluatethequeryandonly HavingannotationsfromKappearinthevalueswillchange thewayinwhichweapplyhomomorphismstoqueryresults, then map the ri’s, changing directly the expression in the query result. For example, if M =SUM and we map r to sotoemphasizethechangewewillcall(M,K)-relationsthe 1 K-annotated relations over such that the data domain D 1,r2 to 0,r3 to 2, we obtain 1⊗20+K⊗M 2⊗30=1⊗20+K⊗M 1⊗30 + 1⊗30 = 1⊗ 80 (which corresponds to the M that includes K⊗M. To summarize, the semantics of the K⊗M element 80). As another example, the commutation with SPJU-A -algebra will map databases of K-relations (with M ⊆D) to (M,K)-relations (with K⊗M ⊆D). homomorphisms allows us to propagate the deletion of the first tuple in R, by simply setting in the aggregation result As we define the semantics of the SPJU-A -algebra, we r =0 (keeping the other annotations intact) and obtaining first note that for selection, projection, join and union the 1 2⊗30=(1+1)⊗30=1⊗30+1⊗30=1⊗(30+30)=1⊗60. definitionisthesameasfortheSPJU-algebraonK-databases. Thelaststepofthequeryisaggregation,denotedAGGM(R), Wefurtherdemonstrate an application for security. and is well-defined iff R is a K-relation with one attribute whose values are in the M subset of D. To apply the def- Example 3.5. Consider the following relation R, anno- inition that uses the semimodule structure (shown in Sec- tated by elements from the security semiring S. tion 2.2), we convert R to an (M,K)-relation ι(R) by re- Sal placingeachm∈M withι(m)=1 ⊗m∈K⊗M. Then,if K 20 S supp(R)={m1,...,mn} and R(mi)=ki ∈K,i=1,...,n 10 1 (i.e., each mi is annotated with ki) we defineAGGM(R) as 30 SS aone-attributerelationwithonetupleannotationis1 and K whose contentis SetAgg (ι(R)),which is equalto Recall (from Section 2.1) the order relation 1 <C<S< K⊗M S T < 0; a user with credentials cred can only view tuples k1∗K⊗M ι(m1)+K⊗M ···+K⊗M kn∗K⊗M ι(mn) annotaSted with security level equal or less than cred. Now =k1⊗m1+K⊗M ···+K⊗M kn⊗mn let M = MAX and we obtain: AGGMAX(R) = S⊗20+K⊗M We definetheannotation of theonly tuplein theoutputof 1S⊗10+K⊗M S⊗30 = S⊗(20+MAX 30)+1S⊗10 and we get AGGM to be 1K, since this tuple is always available. How- S⊗30+1S⊗10. ever,thecontentofthistupledoesdependonR. Forexam- Assume now that we wish to compute the query results as ple, even when R is emptythe outputis not empty: bythe viewedbyauserwithsecuritycredentialscred. Anaivecom- semimodule laws, its contentis 0 =ι(0 ). putation would delete from R all tuples that require higher K⊗M M credentials, andre-evaluate thequery(whichingeneral may becomplex). Butobserve thatthedeletionoftuplesisequiv- CommutationwithHomomorphisms. Wehaveexplained alent to applying to R a homomorphism that maps every inSection2.3howtolift ahomomorphismh:K →K′ toa annotation t > cred to 0, and t ≤ cred to 1. Using homo- homomorphismhM :K⊗M →K′⊗M. Viathiswecanlift morphism commutation we can do better by applying this h to a homomorphism h on (M,K)-relations: let R be Rel homomorphism only on the result representation (namely sucharelationandrecallthatsomevaluesinRareelements S⊗30+1⊗10). For example, for a user with credentials C, of K⊗M,andtheannotationsofthesetuplesareelements S wemapSto0and1 to1,andobtain0⊗30+1⊗10=1⊗10; ofK. Thenh (R)denotestherelationobtainedfromRby S Rel similarlyforauserwithcredentialsSweget1⊗30+1⊗10= replacingeveryk∈K withh(k),andadditionallyreplacing 1⊗(30+ 10)=1⊗30. every k⊗M ∈ K⊗M with hM(k⊗m). All other values MAX in R stay intact. Applying hRel on a (M,K)-database D From the above definition of the semantics for aggrega- amounts toapplyinghRel on each (M,K)-relation in D. tion, it is obvious that the poly-size overhead property is WecannowstatethemainresultforourSPJU-A-algebra: fulfilled. Indeed, consider the case of provenance for sum- Theorem 3.3. Let K,K′ be semirings, h : K → K′, Q mationasinExample3.4,andcompareittothenaiverepre- sentationprovidedintheIntroduction. Insteadofhavingto anSPJU-A query andletM beacommutative monoid. For listall(exponentiallymany)optionsforthesumofsalaries, every (M,K)-database D, Q(h (D))=h (Q(D)) ifand Rel Rel weusedanexpressioninK⊗SUMthatisoflinearsizewith only if h is a semiring homomorphism. respecttotheinputtotheaggregation. Asexemplified,the The proof is by induction on the query structure, and possibleaggregateanswersnowcorrespondtodifferentvalu- is straightforward given that for the constructs of SPJU ationsfortheprovenancetokens,appliedtothisexpression. 3.3 Group By andaqueryGB R,wherethemonoidusedisSUM. {Dept},Sal So far we have considered aggregation in a limited con- The result (denoted R’) is: text,wheretheinputrelationcontainsasingleattribute. In Dept Sal commoncases,however,aggregationisusedonarbitraryre- lationsandinconjunctionwithgrouping,sowenextextend d1 r1⊗20+K⊗SUM r2⊗10 δK(r1+K r2) the algebra to handle such an operation. The idea behind d2 r3⊗10 δK(r3) theconstructionisquitesimple: weseparatelygroupthetu- Each aggregated value (for each department) is computed ples according to the values of their“group-by”attributes, very similarlyto the computation in Example 3.4. Consider andtheaggregatedvaluesforeachsuchgrouparecomputed the provenance annotation of the first tuple: intuitively, we similarly to the computation for the AGG operator. When expect it to be 1 if at least one of the first two tuples of R K considering the annotation of the aggregated tuple, we en- exists, i.e. ifat least one out of r orr isnon-zero. Indeed 1 2 counter a technical difficulty: we want this annotation to the expression is δ(r + r ) and if we map r ,r to e.g. 2 1 K 2 1 2 be equal 1K if the input relation includes at least one tuple and 1 respectively, we obtain δN(3)=1. inthecorrespondinggroup,and0 otherwise(forintuition, K We use SPJU-AGB as the name for relational algebra consider the case of bag relations, in which the aggregated with the two new operators AGG and GB. We note that resultcanhaveatmostmultiplicity1);weconsequentlyen- the poly-size overhead property is still fulfilled for queries rich our structure to include an additional construct δ that in SPJU-AGB ; commutation with homomorphism also ex- will capturethat, as follows: tendsto SPJU-AGB(see proof in theAppendix). Recallthatanadditionaldesideratumfromthesemantics Definition 3.6. A (commutative) δ-semiring is an alge- was bag / set compatibility. Recall that sets and bags are braicstructure(K,+ ,· ,0 ,1 ,δ )where(K,+ ,· ,0 ,1 ) K K K K K K K K K modeled byK =N and K =B respectively. Wenext study isacommutativesemiringandδ :K →K isaunaryoper- K compatibilityinamoregeneralway,forarbitraryK andM. ationsatisfyingthe“δ-laws”δK(0K)=0K andδK(n1K)=1K foralln≥1. IfK andK′ areδ-semiringsthenahomomor- 3.4 Annotation-aggregation compatibility phism between them is a semiring homomorphism h:K → The first desideratum we listed was an obvious sanity K′, for which we have in addition h(δ (k))=δ (h(k)). K K′ check: whateversemanticswedefine,whenspecializedtothe familiar aggregates of max, min and summation, it should The δ-laws completely determine δB and δN. But they producefamiliar results. Sincewe had to takean excursion leavealotoffreedomforthedefinitionofδK inothersemir- through the tensor product K⊗M, this familiarity is not ings; in particular for the security semiring, a reasonable immediate. However, the following proposition holds (its choice for δS is theidentity function. correctness will follow from theorems 3.12 and 3.13). As with any equational axiomatization, we can construct Proposition 3.9. In the following constructions: B ⊗ thecommutativeδ-semiringfreely generated byasetX,de- noted N[X,δ], bytaking thequotientof theset of MAX, B⊗MIN, and N⊗SUM, ι : M → K ⊗M where ι(m)=1 ⊗m is a monoid isomorphism. {+,·,0,1,δ}-algebraic expressionsbythecongruencegener- K ated by the equational laws of commutative semirings and andthismeansoursemanticssatisfiestheset/bagcompati- the δ-laws. For example, if e and e′ are elements of N[X,δ] bility property because in thesecases computingin K⊗M (i.e., congruence classes of expressions given by some rep- exactly mirrors computingin M. resentatives) then e+N[X,δ]e′ is thecongruence class of the Butofcourse,wearealsointerestedinworkingwithother expression e+e′. The elements of N[X,δ] are not standard semirings, in particular the provenance semiring, for which polynomials but certain subexpressions can be put in poly- N[X]⊗MandM areingeneralnotisomorphic(inparticular, nomial form, for exampleδ(2+3xy2) or 3+2δ(x2+2y)z2. ι is not surjective and thus not an isomorphism). In fact, We are now ready to define the group by (denoted GB) the whole point of working in N[X]⊗MAX, for example, operation; subsequently we exemplify its use, including in is to add annotated aggregate computations to the domain particular therole of δ: of values. Most of these do not correspond to actual real numbers as e.g. ι(MAX) is a strict subset of N[X]⊗MAX Definition 3.7. LetRbeaK-relationonsetofattributes (andsimilarlyι(SUM)isastrictsubsetofN[X]⊗SUMetc.). U, let U′ ⊆ U be a subset of attributes that will be grouped However,whenprovenancetokensarevaluatedtoobtainset and U′′ ∈U be the subset of attributes with values in M (to (orbag)instances,wecangobackintoι(MAX)(orι(SUM) be aggregated). We assume that U′∩U′′ =∅. For a tuple t, etc.), and then we should obtain familiar results by“strip- we define T ={t′ ∈supp(R)|∀u∈U′ t′(u)=t(u)}. ping off”the ι. It turns out that this works correctly with We then define the aggregation result R′ = GBU′,U′′(R) N[X]butnotnecessarilywitharbitrarycommutativesemir- as follows: ingsK. Thereason isthatnotonlythatιisnotanisomor- phism,butingeneralitmaybebeunfaithful(notinjective). Indeedι:SUM→B⊗SUMis not injective: R′(t)=δK(Σt′∈TR(t′)) ∀Tu6=∈φU,′a′ndt(u)=Σt′∈TR(t′)⊗t′(u) ι(4)=ι(2+2)=ι(2)+K⊗M ι(2)=⊤⊗2+K⊗M ⊤⊗2= 0 Otherwise. =(⊤∨⊤)⊗2=⊤⊗2=ι(2)  This is not surprising, as it is related to the well-known Example 3.8. Consider the relation R: difficulty of making summation work properly with set se- mantics. Ingeneral,wethusdefinecompatibilityasfollows: Dept Sal d1 20 r1 Definition 3.10. We say that a commutative semiring d1 10 r2 K and a commutative monoid M are compatible if ι is in- d2 10 r3 jective. The point of the definition is that when there is compat- abilitytousetheidentitiesthatholdinSandtothusreduce ibility, we can work in K ⊗M and whenever the results the size of annotations in query results. We can do better belongtoι(M),wecansafelyreadthemasfamiliaranswers by taking the quotient of N[S] by the smallest congruence from M. Wegivethreetheoremsthatcapturesomegeneral containing thefollowing identities: conditions for compatibility. First, we note that if we work with a semiring in which • ∀s1,s2 ∈S s1 ≥s2=⇒s1·N[S]s2 =s1. +K isidempotent,suchasBorS,acompatiblemonoidmust • ∀c∈N,s∈S 0·N[S]s=c·N[S]0S =0. also beidempotent (e.g. MIN or MAX but not SUM): Proposition 3.11. LetK besomecommutativesemiring • ∀c∈N c·N[S]1S =c. suchthat+ isidempotent,andletM besomecommutative We will denote the resulting quotient semiring by SN. It K monoid. IfM iscompatiblewithK,then+M isidempotent. is easy to check that the faithfulness of the embeddings of Proof. ι(m)=1 ⊗m=(1 + 1 )⊗m=1 ⊗m+ N and S in N[S] is preserved by taking the quotient. Most K K K K K K⊗M importantly,SN is still homomorphicto N. Thus, 1 ⊗m=1 ⊗(m+ m)=ι(m+ m) K K M M Corollary 3.15. SN is compatible with any commuta- Nicely enough, idempotent aggregations are compatible tive monoid M. with every annotation semiring K that is positive with re- spect to +K. K is said to be positive with respect to +K Example 3.16. Consider the SUM monoid. Let R,S be if k +K k′ = 0K ⇒ k = k′ = 0K. For instance, B, N, S thefollowingS-relationswhichbytheembeddingofSwetake and N[X] are such semirings (but not (Z,+,·,0,1)). The as SN-relations: following theorem holds: A A Theorem 3.12. IfM isacommutativemonoidsuchthat 30 T 30 S +M is idempotent, then M is compatible with any commu- 10 1S tative semiring K which is positive with respect to + . K R Proof sketch. Wedefineh:K⊗M →M as S h(Pi∈Iki⊗mi) = Pj∈Jmj where J = {j ∈ I|kj 6= 0}. Consider the query: AGG(R ∪ΠS.A(S ⊲⊳ R)). Ignoring We can show that h is well-defined (details deferred to the the annotations, the expected result (under bag semantics) Appendix); since ∀m∈M h◦ι(m)=m, ι is injective and is70. Workingwithinthe(compatible) semantics definedby thusK and M are compatible. SN⊗SUM, the query result contains an aggregated value of (T·SNS+SNS)⊗30+S⊗10. Wecan further simplifythisto For general (and in particular non-idempotent) monoids T⊗30+S⊗30+S⊗10=T⊗30+S⊗40. Thismeans that (ine.gp.arStUicMula)rwhoeldidsefnotrifNy[aXs])u,ffithciaetntalcloowndsiftoiorncoomnpKati(bwilhiticyh: ea.ngd. fwoeracaunseursewitthhecirnedveernsteiaolsfTι tthoemquaepryitrteosuNltaisn1dSNob⊗ta7i0n, Theorem 3.13. Let K be a commutative semiring. If 70. Similarly, for a user with credentials S, the query result there exists a semiring homomorphism from K to N then is mapped to 40. These are indeed the expected results. K is compatible with all commutative monoids. Note that if we would have used in the above example S Proof sketch. Let h′ be a homomorphism from K to instead of SN we would have(T+SS)=S so (T+SS)⊗30 N, and M be an arbitrary commutative monoid. We define would be thesame as S⊗30. For a user with credentials T a mapping h:K⊗M → M by h(Σki⊗mi) =Σh′(ki)mi. we could either use this, leading to the result of 1S⊗40, or We show in the Appendix that h is well-defined and that use the same computation done in the example, to obtain h◦ι is theidentityfunction henceι is injective. 1S⊗70. Indeed,inS⊗SUM,wehave1S⊗40=1S⊗70. This is the same phenomenon demonstrated in the beginning of Corollary 3.14. Thesemiringofprovenance polynomi- thissubsection forB,where ιisnot injective,preventingus als N[X] is compatible with all commutative monoids. from stripping it away. Now consider the security semiring S. It is idempotent, Note also that if we would have used N[S] instead of SN then we could not havedone theillustrated simplifications. andthereforenotcompatiblewithnon-idempotentmonoids such as SUM. Still, we want to be able to use S and other 4. NESTED AGGREGATIONQUERIES non-idempotent semirings, while allowing the evaluation of aggregation queries with non-idempotent aggregates. This Sofarwehavestudiedonlyquerieswheretheaggregation would work if we could construct annotations that would operatoristhelastoneperformed. Inthissectionweextend allow us to use Theorem 3.13, in other words, if we could thediscussiontoqueriesthatinvolvecomparisonsonaggre- combine annotations from S, with multiplicity annotations gate values. We first demonstrate the difficulties that arise (i.e. annotationsfromN). Weexplainnexttheconstruction in designing an algebra for such queries, then explain how of such a semiring SN (for security-bag), and its compati- to extendtheconstruction toovercome these difficulties. bilitywithanycommutativemonoidM willfollowfromthe existence of a homomorphism h:SN→N. Note. Forsimplicity,allresultsandexamplesarepresented forqueriesinwhichthecomparisonoperatorisequality(=). Constructinga compatiblesemiring. We start with the Howevertheresultscaneasilybeextendedtoarbitrarycom- semiringofpolynomialsN[S],i.e. polynomialswhereinstead parison predicates, that can bedecided for elementsof M. ofindeterminates(variables)wehavemembersofS,andthe coefficientsarenaturalnumbers. AlreadyN[S]iscompatible 4.1 Difficulties with any commutative monoid M, as there exists a homo- We start by exemplifying where the algebra proposed for morphismh:N[S]→N;butifweworkwithN[S]welosethe restricted aggregation queries, fails here: Example 4.1. Reconsidertherelation(denotedR′)which Theright-hand-sideisamonotone,infact continuousw.r.t. is the result of aggregation query, depicted in Example 3.8. the usual set inclusion operator, hence this equation has a Further consider a query Q that selects from R′ all tu- set-theoretic least solution (no need for order-theoretic do- select ples for which the aggregated salary equals 20. The crux maintheory). Thesolutionalsohasanobviouscommutative is that deciding the truth value of the selection condition semiring structureinduced bythat of polynomials. The so- involves interpreting the comparison operator on symbolic lutionsemiringisK =(X,+Kb,·Kb,0Kb,1Kb),andwecontinue representation ofvaluesinR′; sofar,wehavenowayofin- bytakingthequotibentonKdefinedbythefollowingaxioms. terpreting the obtained comparison expression, for instance r ⊗20+r ⊗10“equals”20, and thus we cannot decide the Forall k1,k2 ∈K,c1,c2,bc3,c4 ∈K⊗M: 1 2 b existence of tuples in the selection result. 0Kb ∼ 0K Note that in the above example, the truth value of the 1Kb ∼ 1K comparison(andconsequentlythesetoftuplesinthequery k1+Kbk2 ∼ k1+K k2 rteuspullets)indetpheen(dosriigninaanl)oinn-pmuotnroetlaotniiocnwRa:ynoonteththeaetxiifstweencmeaopf k1·Kbk2 ∼ k1+K k2 r to 1 and r to 0 then the tuple with dept. d appears in [c1 =c3] ∼ [c2 =c4](if c1 =Kb⊗Mc2, c3 =Kb⊗Mc4) 1 2 1 the query result, but if we map both to 1, it does not. The challenge that this non-monotonicity poses is fundamental, and if K and M are such that ι defined by ι(m) = 1 ⊗m andisencounteredbyanyalgebraon(M,K)-relations. The K isanisomorphism(andlethbeitsinverse),wefurthertake following proposition, which is the counterpart of proposi- thequotient definedby: for all a,b∈K⊗M, tion 3.2, holds(proof deferred to theAppendix): Proposition 4.2. There is no (M,K)-relation seman- (*)[a=b] ∼ 1K(if h(a)=Mh(b)) tics for nested aggregation queries with MAX-(or MIN-)- [a=b] ∼ 0K(if h(a)6=Mh(b)) aggregation that is both set-compatible and commutes with homomorphisms. Similarly for SUM-aggregation and bag- Weuse KM to denotethesemiring obtained byapplying compatibility. theaboveconstructiononasemiringK andacommutative Consequently,amoreintricateconstructionisrequiredfor monoid M. A key property is that, when we are able to nested aggregation queries. interprettheequalitiesinM,KM collapsestoK. Formally, 4.2 An Extended Structure Proposition 4.4. IfK andM aresuch that K⊗M and We start with an example of our treatment of nested ag- M are isomorphic via ι then KM =K. gregation queries, then givetheformal construction. The proof (deferred to the Appendix) is by induction on Example 4.3. Reconsider example 4.1, and recall that thestructureof elementsin KM,showing that at each step thechallengeinqueryevaluationliesincomparingelements wecan“solve”anequalitysub-expression,andreplaceitwith of K⊗M with elements of M (or K⊗M, e.g. in case of 0 or 1 . K K joins). Our solution is to introduce to the semiring K new elements,oftheform[x=y]wherex,y∈K⊗M (ifweneed Lifting homomorphisms. To conclude the description of to compare with m ∈ M, we use ι(m) instead). The result the construction we explain how to lift a semiring homo- of evaluating the query in example 4.1 (using M = SUM) morphism from h : K → K′ to hM : KM → K′M, for any will then be captured by: commutative monoid M and semirings K,K′. hM is de- fined recursively on the structure of a ∈ KM: if a ∈ K we define hM(a) = h(a), otherwise a = [b⊗m =c⊗m ] for Dept Sal 1 2 some b,c ∈ KM and m ,m ∈ M and we define hM(a) = d r ⊗20 δ(r + r )· 1 2 1 1 1 K 2 K hM(b)⊗m =hM(c)⊗m . Note that the application of +K⊗Mr2⊗10 (cid:2)r1⊗20+K⊗M r2⊗10=1K ⊗20(cid:3) (cid:2)a homomorp1hism hM maps2(cid:3)equalityexpressions toequality d2 r3⊗10 δ(r3)·K[r3⊗10=1K ⊗20] expressions (in which elements in K′ appear instead of el- Intuitively, since we do not know which tuples will satisfy ements of K appeared before). If K′ and M are such that the selection criterion, we keep both tuples and multiply the their corresponding ι : M → K ⊗M defined by ι(m) = provenanceannotationofeachofthembyasymbolicequality 1K ⊗M is injective, then we may“resolve the equalities”, expression. These equality expressions are kept as symbols otherwise the(new) equality expression remains. until we can embed the values in M = SUM and decide 4.3 The Extended Semantics the equality (e.g. if K = N), in which case we “replace” it by 1 if it holds or 0 otherwise. For example, given a Theextendedsemiring constructionallows ustodesign a K K homomorphism h : N[X] → N, h(r ) = h(r ) = 1, then semantics for general aggregation queries. Intuitively,when 1 2 hM(r1⊗20+K⊗Mr2⊗10) = h(r1)⊗20+K⊗Mh(r2)⊗10 = the existence of a tuple in the result relies on the result of 1⊗306=1⊗20, thus the equality expression isreplaced with a comparison involving aggregate values(as in theresult of (i.e. mapped by the homomorphism to) 0 . applying selection or joins), we multiply the tuple annota- K tion bythecorresponding equation annotation. We next define the construction formally; the idea un- In the sequel we assume, to simplify the definition, that derlying the construction is to define a semiring whose ele- thequeryaggregates andcomparesonlyvaluesofKM ⊗M ments are polynomials, in which equation elements are ad- (avaluem∈M isfirstreplacedbyι(m)=1 ⊗m). Inwhat K ditional indeterminates. To achieve that, we introduce for follows,letR(R ,R )be(M,KM)-relationsonanattributes 1 2 any semiring K and any commutative monoid M, the“do- set U. Recall that for a tuple t, t(u) (where u ∈ U) is the main”equation K = N[K ∪{[c = c ] | c ,c ∈ K⊗M}]. valueof theattributeuin t;also for U′ ⊆U, recall that we 1 2 1 2 b b uset|U′ todenotetherestrictionofttotheattributesinU′. Given a homomorphism h : N[X] → N we can “solve” Last, we use (KM ⊗M)U to denote the set of all tuples on the equations, e.g. if h(r ) = 1, h(r ) = 0 and h(r ) = 1 2 3 attributessetU,withvaluesfromKM⊗M. Thesemantics 2, we obtain an aggregated value of 1⊗40. Note that the follows: aggregation value is not monotone in r ,r ,r : map r to 1 1 2 3 2 (and keep r ,r as before), to obtain 1⊗20. 1. empty relation: ∀t φ(t)=0. 1 3 2. union: (R ∪R )(t)= 5. DIFFERENCE 1 2 R (t′)· [t′(u)=t(u)] if t∈supp(R ) We next show that via our semantics for aggregation, we Pt′∈supp(R1) 1 Qu∈U 1 canobtainforthefirsttimeasemanticsforarbitraryqueries +Pt′∈supp(R2)R2(t′)·Qu∈U[t′(u)=t(u)] ∪supp(R2) with difference on K-relations. We describe the obtained semantics and study some of its properties. 0 Otherwise. 5.1 Semantics forDifference 3. projection: Let U′ ⊆ U, and let T = {t|U′ | t ∈ We first note that difference queries may be encoded as supp(R)}. Then ΠU′(t)= querieswithaggregation,usingthemonoidB=({⊥,⊤},∨,⊥) R(t′)· [t(u)=t′(u)] if t∈T (thefollowing encodingwas inspired by [29b, 9]): P0 t′∈Supp(R) Qu∈U′ Otherwise. R−S =Π⊲⊳aa11..,..a..na{n(cid:0)(GRB×{a⊥1,.b.).a}n.},b(R×⊥b∪S×⊤b)(cid:1)  ⊥ and⊤ arerelationsonasingleattributeb,containing 4. selection: If P is an equality predicate involving the b b asingletuple(⊥)and(⊤)respectively,withprovenance1 . equation of some attribute u∈U and a value m∈M K Using thesemantics of Section 4, we obtain a semantics for then (σP(R))(t)=R(t)·[t(u)=ι(m)]. thedifference operation. 5. valuebasedjoin: WeassumeforsimplicitythatR and Interestingly, we next show that the obtained semantics 1 R havedisjointsetsofattributes,U andU resp.,and canbecapturedbyasimpleandintuitiveexpression. First, 2 1 2 that the join is based on comparing a single attribute we note that since B is idempotent, every semiring K posi- of each relation. Let u′1 ∈ U1 and u′2 ∈ U2 be the tive with respect tob+K is compatible with B (see Theorem attributesto join on. Forevery t∈(KM ⊗M)U1∪U2: 3.12). The following proposition then holdsbfor every K,K′ (R1 ⊲⊳R1.u1=R2.u2 R2)(t)= and every two (B,K)-relations R,S (proof deferred to the R1(t|U1)·R2(t|U2)·K[t(u1)=t(u2)]. Appendix): b Proposition 5.1. Foreverytuplet,semiringsK,K′such Simple Variants. Natural join (when U1 and U2 are that K′bB ⊗ B is isomorphic to B via ι(m) = 1K ⊗ m, if not necessarily disjoint) is captured by a similar ex- h:K →K′bis a semiring homomborphism then: pression, with the equality sub-expression on the at- hbB([(R−S)(t)])=hbB([S(t)⊗⊤=0]· R(t)). tributes common to U and U ; join on multiple val- K 1 2 uesiscapturedbymultiplicationbythecorresponding The obtained provenance expression is thus“equivalent” multiple equality expressions; in the representation of (in the precise sense of Proposition 5.1) to [S(t) ⊗ ⊤ = cartesian product (denoted by ×) no equality expres- 0]·R(t). The following lemma helps us to understand the sions appear (only R1(t|U1)·R2(t|U2)). meaning of theobtained equalityexpression: 6. Aggregation: AGG (R)(t)= Lemma 5.2. ForeverysemiringK whichispositivew.r.t. M + and h :K → B, hM([S(t)⊗⊤=0]) = ⊤ iff h(S(t)) =  1 t(u)=Pt′∈supp(R)R(t′)∗K⊗Mt′(u) ⊥K.  Proof. Itisclearthatifh(S(t))=⊥,hM([S(t)⊗⊤=0])= 0 otherwise [h(S(t))⊗⊤=0]= [⊥⊗⊤=0]= ⊤. For the other direc-  7. Group By: Let U′ ⊆U be a subset of attributes that tion, assume that h(S(t)) = ⊤. Thus [h(S(t))⊗⊤=0] = will be grouped and u ∈ U\U′ be the aggregated at- [⊤⊗⊤=0]. SinceBandBarecompatible,ι:B→B⊗Bis tribute. Then for every t∈(KM ⊗M)U′∪{u}: injective;thusι(⊥)6=ι(⊤)b;consequentlyhM([S(t)⊗⊤=b0])= [⊤⊗⊤=0]=⊥. GBU′,uR(t)= Consequently,thesemanticscanbeinterpretedasfollows: δ((ΠU′R)(t|U′)) t(u)=Pt′∈supp(R)(R(t′)·K a tuple t appears in the result of R−S if it appears in R,  Qu∈U′[t′(u)=t(u)])∗K⊗Mt′(u) rbeustuldtooefsRn−otSa,piptecaarrriinesSit.sWorihgeinnatlhaenntuoptaletioanppfreoamrsRin. tIh.ee. 0 otherwise theexistence of t in S is used as a boolean condition. Example 5.3. Let R,S be the following relations, where It is straightforward to show that the algebra satisfies R contains employees andtheir departments andS contain- set/bag compatibility and poly-sizeoverhead;commutation ing departments that are designated to be closed: with homomorphism is provedin theAppendix. Example 4.5. Reconsider the relation in Example 4.3, ID Dep and let us perform another sum aggregation on Sal. The 1 d t Dep 1 1 value in the result now contains equation expressions: 2 d t d t 1 2 1 4 δ(r + r )· r ⊗20+ r ⊗10=1 ⊗20 2 d t 1 K 2 K(cid:2) 1 K⊗M 2 K (cid:3) 2 3 ∗ r ⊗20+ r ⊗10 S K⊗M(cid:0) 1 K⊗M 2 (cid:1) + δ(r )· [r ⊗10=1 ⊗20]∗ r ⊗10 R K⊗M 3 K 3 K K⊗M 3

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.