ebook img

Enhancing Histograms by Tree-Like Bucket Indices PDF

0.31 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Enhancing Histograms by Tree-Like Bucket Indices

Noname manuscript No. (will be inserted by the editor) Enhancing Histograms by Tree-Like Bucket ⋆ Indices 5 0 Francesco Buccafurri,1 Gianluca Lax,1 Domenico Sacc`a,2 Luigi 0 Pontieri,2 Domenico Rosaci1 2 n 1 DIMET Dept., University“Mediterranea” of Reggio Calabria, Italy a e-mail: {bucca,lax,domenico.rosaci}@unirc.it J 2 DEIS Dept.,University of Calabria, & ICAR-CNR,Rende,Italy 1 e-mail: [email protected], [email protected] 1 ] The date of receipt and acceptance will beinserted by theeditor S D . Abstract Histogramsareusedtosummarizethecontentsofrelationsinto s c a number of buckets for the estimation of query result sizes. Several tech- [ niques (e.g., MaxDiff and V-Optimal) have been proposed in the past for 1 determining bucket boundaries which provide accurate estimations. How- v ever, while search strategies for optimal bucket boundaries are rather so- 0 phisticated, no much attention has been paid for estimating queries inside 2 buckets and all of the above techniques adopt naive methods for such an 0 estimation. This paper focuses on the problem of improving the estimation 1 0 inside a bucket once its boundaries have been fixed. The proposed tech- 5 niqueisbasedonthe addition,toeachbucket,of32-bitadditionalinforma- 0 tion (organized into a 4-level tree index), storing approximate cumulative / s frequencies at 7 internal intervals of the bucket. Both theoretical analysis c and experimental results show that, among a number of alternative ways : v to organize the additional information, the 4-level tree index provides the i best frequency estimation inside a bucket. The index is later added to two X well-knownhistograms,MaxDiffandV-Optimal,obtainingthenon-obvious r a result that despite the spatial cost of 4LT which reduces the number of al- lowed buckets once the storage space has been fixed, the original methods are strongly improved in terms of accuracy. Key words histograms – range query estimation – approximate OLAP ⋆ An abridged version of this paper appeared in the Proceedings of the Inter- national Conference on Data Engineering (ICDE 2002), IEEE Computer Society 2002, ISBN0-7695-1531-2 [3] 2 Francesco Buccafurri et al. 1 Introduction Ahistogramisalossycompressiontechniqueusedforrepresentingefficiently a relation. It is based on the partition of one of the relation attributes into buckets andthe storage,for eachof them, of a few summary informationin place of the detailed one. Among others, some important examples of ap- plication domains of histograms are the estimation of query selectivity [12, 14,18,13,22],temporaldatabases,where histogramsareusedforimproving the join processing [20], statistical databases, where histograms represent a method for approximating probability distributions [15]. Recently, his- tograms have received a new deal of interest, mainly because they can be effectively used for approximating query answering in order to reduce the query response time in on-line decision support systems and OLAP [17], as wellasthe problemofreconstructingoriginaldatafromaggregateinforma- tion [2] and, finally, in the context of Data Streams [9,1,7,10]. For a given storage space reduction, the problem of determining the best histogram is crucial. Indeed, different partitions lead to dramatically different errors in reconstructing the original data distribution, especially for skewed data. To better explain the problem, consider a typical case of recovering original data from a histogram: the evaluation of range queries. Think to a histogram defined on the attribute X of a relation R as a set of non-overlapping intervals of X covering all values assumed by X in R. To eachof these intervals,say B, the number of occurrences(called frequency) in R, having the value of X belonging to the interval B, is associated (and included into a data structure called bucket). A range query, defined on an interval Q of X, evaluates the number of occurrences in R with value of X in Q. Thus, buckets embed a set of pre-computed disjoint range queries capable of covering the whole active domain of X in R (with active here we mean attribute values actually appearing in R). As a consequence, the histogram does not give, in general, the possibility of evaluating exactly a range query not corresponding to one of the pre-computed embedded queries. In other words, while the contribution to the answer coming from the sub-ranges coinciding with entire buckets can be returned exactly, the contribution coming from the sub-ranges which partially overlap buckets canbe only estimated, since the actual data distribution inside the buckets is not available. It turns out that it is convenient to define the boundaries of buckets in such a way that the estimation errorof the non-precomputedrange queries is minimized (e.g., by avoiding that large frequency differences arise inside a bucket). In other words, among all possible sets of pre-computed range queries, we find the set which guarantees the best estimation of the other (non-precomputed) queries, once a technique for estimating such queries is defined. This issue is being investigated since some decades, and a large numberoftechniquesfor arranginghistogramshavebeenproposed[5,6,12, 14,18,8,13]. Enhancing Histograms by Tree-LikeBucket Indices 3 Allthesetechniquesadoptsimplemethodsforestimatingnon-precomputed queries (actually, their portions partially overlapping buckets). The most significant approaches are the continuous value assumption (often denoted in this paper by CVA) [19], where the estimation is made by linear in- terpolation on the whole domain of the bucket, and the uniform spread assumption (denoted by USA) [18], which assumes that values are located at equal distance from each other so that the overallfrequency sum can be equally distributed among them. An interesting problem is understanding whether, by exploiting infor- mationtypicallycontainedinhistogrambuckets,andpossiblyaddingafew summary information, the frequency estimation inside buckets, and then, the histogram accuracy,can be improved. This paper focuses on this prob- lem. Starting from the consideration of limits of CVA and USA studied in [2],weproposetousesomeadditionalstoragespaceinordertodescribethe distribution inside a bucket in an approximate yet very effective way. The firststepis studying how to use these 32 additionalbits inorderto maximize benefits in terms of accuracy.Our analysis shows that the trivial technique of partitioning the bucket into 8 equal-size parts and encoding each corresponding sum by 4 bits, leads to high scaling errors since it is neededto representeachsumasa fractionofthe overallsumofthe bucket. Our proposal then relies on the idea of storing partial sums internal to the bucketin a hierarchicalfashion,using a tree-likeindex (occupying 32 bits). This way, the sum contained in a given tree node, can be represented as a fraction of the sum contained in the parent node, which is a value (rea- sonably) smaller than the overall sum of the bucket. It turns out that the encodinglengthmaydecreaseasthelevelofthetreeincreases.Thebenefits weexpectbyapplyingthisapproachconcernthescalingerror.Butacrucial point is to decide how to arrange the tree, that is, how far going down in depth with the index. Of course, the higher the resolution, the larger the numberofembeddedprecomputedrangequeries(internaltothebuckets)is. Hence, we expect better accuracy as the resolution increases. However, in- creasingresolutionreducesthe numberofbits availableforencodingnodes, and, thus, amplifies scaling errors. We study the above trade-off by con- sidering the two possible (from a practical point of view) tree-indices with 32 bits, which we call 3LT and 4LT, with depth 3 and 4, respectively. The analysis leads to the conclusion that the 4LT-index represents the best so- lution. The nextstep is then understanding whether this improvementof accu- racy for the estimation inside buckets can really give benefits in terms of accuracy of a histogram arranged by one of the existing techniques. This problem is not straightforward:think, to mention the most evident aspect, that 4LT buckets use 32 bits more than CVA ones, and, then, for a fixed storage space, allows a smaller number of buckets. The last part of this paper is thus devoted to evaluate the effects of the combination of the 4LT technique with existing methods for building histograms. Through a deep experimentalcomparativeanalysisconducted,forafixedstoragespace,over 4 Francesco Buccafurri et al. several data sets, both synthetic and real-life, we show that 4LT improves significantly the accuracy of the considered histograms. Therefore this pa- per, beside giving the specific contribution of proposing a technique (i.e., the 4LT)forestimatingaccuratelyrangequeriesinternaltobuckets,proves the more general result that going beyond classical techniques (i.e., CVA andUSA)fortheestimationinsidebucketsmaygiveconcreteimprovements of histogram accuracy. ItisworthnotingthatthechoiceofMaxDiffandV-Optimalhistograms fortestingourmethoddoesnotlimitthegeneralityofthe4LTindex,which is applicable to every bucket-based histogram1. Nevertheless, it is not lim- ited the validity of our comparison, since MaxDiff and V-Optimal, despite theirnon-youngage,arestillconsideredinthisscientificcommunityaspoint of references due to their accuracy [11]. The paper is organized as follows. In Section 2, we introduce some pre- liminary definitions. The comparison, both experimental and theoretical, among a number of techniques including our tree-based methods (3LT and 4LT) for estimating range queries inside a bucket is reported in Section 3. Therein, 3LT and 4LT are also presented.From this analysis it results that 4LThasthebestperformancesintermsofaccuracy.Thus,4LTcanbecom- bined to every bucked-based histogram for increasing its accuracy. Section 4 presents a large set of experiments, conducted by applying 4LT to two, well-known methods, MaxDiff and V-Optimal [18]. Results show high im- provementsintheestimationofrangequeriesw.r.t.totheoriginalmethods — of course, the comparisons are made at parity of storage consumption so that the revised methods use less buckets to compensate the additional storage for the 4LT indices. The 4LT technique provides good results also when combined with the very simple method EquiSplit, which consists in dividing the histogram value domain into buckets of the same size so that the bucket boundaries need not to be stored, thus obtaining a very high number of buckets at the same compression rate. We draw our conclusions in Section 5. 2 Basic Definitions Given a relation R and an attribute X of R, a histogram for R on X is constructed as follows. Let U = {u ,...,u } be the set of all possible 1 m values (the domain) of X and let u < u , for each i, 1 ≤ i < m. The i i+1 frequency set for X is the set F = {f(u ),...,f(u )} such that for each i, 1 m 1 ≤ i ≤ m, f(u ) is the number of occurrences of the attribute value u i i in the relation R. The cumulative frequency set S = {s ,...,s } contains 1 m the value s = i f(u ) for each attribute value u . The value set V = i j=1 j i {ui ∈ U | f(uiP) > 0} is the active domain of X in R as it consists of all attribute values actually occurring in the relation R (non-null values). 1 There are histograms, like wavelet-based ones, that are not based on a set of buckets. Enhancing Histograms by Tree-LikeBucket Indices 5 Given any u in V, the spread d of u ∈ V for 1 ≤ i < n is defined as 1 if i i i u is the lastnon-nullvalue or otherwise as the difference u −u , where u i j i j is the first non-null value for which u >u (i.e., d is the distance from u j i i i to the next non-null value). A bucket B for R on X is a 4-tuple hinf,sup,t,ci,where u andu , inf sup 1 ≤ inf ≤ sup≤ m, are the boundaries of the domain range pertaining to the bucket, t is the number of non-null values occurring in the range, and c = sup f(u ) is the sum of frequencies of all values in the range. We i=inf i say tPhat the bucket B is 1-biased if usup is not null; if also uinf is not null, then we say that B is 2-biased. A histogram H for R on X is a h-tuple hB ,B ,...,B i of buckets such 1 2 h that:(1)foreach1≤i<h,theupperboundofB precedesthelowerbound i of B and (2) u ∈ V implies u ∈ B , for some i, 1 ≤ i ≤ h. Condition i+1 i (1) guarantees that buckets do not overlap each other, and condition (2) enforces that every non-null value be hosted by some bucket. Classically, histograms have 2-biased buckets; sometime, for storage optimizations, 2- biased buckets are made 1-biased by replacing the lower bound of each bucketwiththesuccessiveinthedomainoftheupperboundofthepreceding bucket. Aclassicalproblemonhistogramsis:givenahistogramH anda(range) query of the form u ≤ X ≤ u , 1 ≤ j ≤ i ≤ m, estimate the overall j i frequency i f(i) in the range from u to u . k=j j i P 3 Estimation Inside a Bucket In this section we deeply investigate the problem of frequency estimation insidebuckets.Firstofall,wepresenttheclassicaltwotechniques(CVAand USA), discusstheirlimitations andproposesomesimple alternatives.Then weintroduceanoveltechniquewhichisbasedona4-leveltreeindexstoring approximate representationsof the partial sums of 7 fixed bucket intervals. Laterweevaluatetheaccuracyofthevarioustechniquesbyperformingboth atheoreticalanalysisoferrorsandanumberofexperimentsonsometypical sample distributions. 3.1 Notations and Problem Formulation LetB =hinf,sup,t,cibeabucketonanattributeX ofarelationR.With- out loss of generality, we assume that inf = 1 and sup = b so that we can represent the frequency set inside the bucket as a vector F with indexes ranging from 1 to b (frequency vector of B). Similarly, the cumulative fre- quenciesarerepresentedbyavectorS withindexesfrom1tob(cumulative frequencyvectorofB).Hence,foreachi,1≤i≤b,F[i]≥0isthefrequency of the value u while S[i] = i F[j] is the cumulative frequency. Then i j=1 c = S[b] is the sum of all freqPuencies in the bucket; moreover, for notation convenience, we assume that S[0]=0. 6 Francesco Buccafurri et al. The problem of the estimation inside a bucket can be formulated as follows: given any pair i,j, 1 ≤ i ≤ j ≤ b, such that d = j −i+1 < b, estimatetherangequeryS[j]−S[i−1]= j F[k].Wefocusourattention k=i on the basic problem of estimating S[d] (Pthen by assuming i=1). We introduce now the following notation. Given 1 ≤ i ≤ j ≤ 8, we denotebyδ thesum y F[i],wherex=1+⌈b·(i−1)⌉andy =⌈b·i⌉. i/j i=x j j δi/j representsthefrequPencysumofthei−thelementsofthe partitionofB into j equal size sub-ranges. Thus, the frequency sum for a bucket is δ ; 1/1 the frequency sums fortwohalvesareδ andδ ;the frequency sumsfor 1/2 2/2 the 4 quarters are δ , 1≤i≤4; the frequency sums for the 8 eighths are i/4 δ , 1≤i≤8, and so on. i/8 3.2 Estimation Techniques Next we illustrate the existing approximation techniques and discuss some additional simple approaches. Continuous Value Assumption (CVA). The estimation ofS[d] is com- puted as S[d] = d ·c. In words, the partial contribution of a bucket to b a range query result is estimated by linear interpolation. As pointed out e in [4,2], the above estimation coincides with the expected value of the S[d] whenitisconsideredarandomvariableoverthepopulationofallfrequency distributions in the bucket for which the overall cumulative frequency is c. Uniform Spread Assumption (USA). The estimation of S[d] is given by S[d]= 1+ (t−1)·(d−1) · c, where t is the number of non-null attribute (b−1) t (cid:16) (cid:17) values in the bucket. The uniform spread assumption assumes that such e values aredistributed atequaldistance fromeachother andthe overallfre- quency sum is equally distributed among them. Obviously, in this case the informationtisnecessary.Westressthat,asdiscussedin[2],thisestimation is not supported by any unbiased probabilistic model so the assumption is rather arbitrary. 1-Biased Estimation (1b). The possibly available information on the number t of non-null elements cannot be exploited in the estimation unless somefurtherinformationonthefrequencydistributioniseitheravailableor assumed(asfortheUSAestimation).We nextshowhowtoexploitthe fact that a bucket is often 1-biased (i.e., u is not null) using the probabilistic b approachproposedin[2].Thisapproachassumesthatthequeryisarandom variable on the population of all 1-biased frequency distributions having c as overallcumulative frequency. The estimation of the range query S[d] for a 1-biased bucket is given by S[d]= d · t−1 ·c. b−1 t 2-Split Estimation (2s). We split the bucket into two parts of the same e size and store the cumulative frequency of the first part, say δ = S[b/2] 1/2 — we therefore need additional storage space (typically 32 bits). We call this method 2-split or 2s for short. Following this approach,the estimation oftherangequeryS[d]isgivenby2·d·δ ifd≤ b,δ +2·d−b·(c−δ ), b 1/2 2 1/2 b 1/2 Enhancing Histograms by Tree-LikeBucket Indices 7 otherwise. Thus we use the CVA techniques for each of the two halves of the bucket. 4-Split Estimation (4s). We split the bucket into 4 parts of the same size (quarts) and store the approximate values of the cumulative frequency of the each part δ , 1 ≤ i ≤ 4. In case the additional available space i/4 is 32 bits, we use 8 bits for each approximate value, which is therefore computed as δ˜ = hδi/4 ×(28−1)i, where hxi stands for round(x). The i/4 c frequency sum for an interval d is estimated by adding the approximate values of all first quarts that are fully contained in the interval plus the CVA estimation of the portion of the last eighth that partially overlaps the interval. Obviously, in order to reduce the approximationerror,in case d>b/2,itisconvenienttoderivetheapproximatevaluefromtheestimation of the cumulative frequency in the complementary interval from d+1 to b. 8-Split Estimation (8s). It is analogous to the 4-Split Estimation. The only difference is that the bucket is divided into 8 parts (eighths) and, for each of them, we use 4 bits for storing the cumulative frequency. Thus, the approximate value of the i-th eight (1 ≤ i ≤ 4) , is computed as δ˜ = i/8 hδi/8 ×(24−1)i, where hxi stands for round(x). c 3.3 The Tree Indices for Bucket Frequency Estimation Wenowproposetouse32bitsassophisticatedtree-indicesforprovidingan approximate description of the cumulative frequencies in the bucket— this index can be easily extended also to the case that more bits are available. To this end, we store the approximate value of the cumulative frequency in asuitablenumberofintervalsinsidethebucket.Thefirsttypeoftree-index is 3LT. 3 Level Tree index (3LT) The 3LT index uses 11bits for approximating the value of δ , and 10 bits both for approximating δ and for δ . 1/2 1/4 3/4 Let L be the 11-bits string corresponding to δ , and let L and 1/2 1/2 1/4 L be the 10-bits strings corresponding,respectively, to δ and δ . 3/4 1/4 3/4 The three L strings are constructed as follows: L1/2 =hδδ11//21 ·(211−1)i; L1/4 =hδδ11//24 ·(210−1)i; L3/4 =hδδ32//42 ·(210−1)i where, we recall, hxi stands for round(x). The approximate values for the partial sums are given by: δ1/1 =δ1/1 =s δ1/2 = 2L111/−21 ·δ1/1;e δ2/2 =δ1/1−δ1/2 δ1/4 = 2L101/−41·δe1/2; δ2/4 =δe1/2−δ1/4; δ3/4 =e2L103/−41·eδ2/2; eδ4/4 =δ2/2−δ3/4 e e e e e e e e e e Observethat the 32 bits index refersto a 3-leveltree whose nodes store directly or indirectly the approximate values of the cumulative frequencies 8 Francesco Buccafurri et al. d 1/ 1 32 bits 8678 (8678) d d 1/ 2 2/ 2 1x11 bits 5594 (5596) 3084 (3082) d d d d 1/ 4 2/ 4 3/ 4 4/ 4 2x10 bits 2834 (2833) 2760 (2762) 2818 (2819) 266 (265) 707 548 680 899 798 356 924 682 700 625 980 513 130 23 43 70 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Fig. 1 The 3-level tree. for fixed intervals: the root stores the overall cumulative frequency c, the two nodes of the second level store the cumulative frequencies for the two halves of the bucket and so on. Example 1Consider the 3-level tree in Figure 1. The 32 bits store the fol- lowing approximate cumulative frequencies: L = h5594 ·2047i = 1320, 1/2 8678 L =h2834 ·1023i=518, L =h 2818 ·1023i=935. 1/4 5594 3/4 8678−5594 WearenowreadytosolvethefrequencyestimationinsidethebucketB. Givend,1≤d<b,letibetheintegerforwhich⌈(i−1)/4·b⌉≤d<⌈i/4·b⌉. Then the approximate value of F[d] is: F[d]=P(i)+P′(i)+ d−⌈(i−1)/4·b⌉ ·δ ⌈i/4·b⌉−⌈(i−1)/4·b⌉ i/4 where e e δ if i=2 δ if i>2 1/4 P(i)= 1/2 P′(i)=δ if i=4 (cid:26)0 if i≤2 e3/4 e 0 otherwise e Thus we use the interpolation based on the CVA only inside a segment of length⌈(1/4)·b⌉.Thiscomponentbecomeszeroateachdistanced=⌈i·b⌉, 4 1≤i<4. 32 bits may be distributed in such a way that the granularity of the tree-index increases w.r.t. 3LT. 4LT index has 4 levels and uses 6 bits for the first level, 5 bits for the second one and 4 bits for the last level. 4 Level Tree index (4LT) We reserve 4 bits to store the approximate value of eachof the following 4 partialsums: δ , δ , δ and δ — let 1/8 3/8 5/8 7/8 L , i=1,3,5,7,denote such4-bits strings.We then use the remaining 16 i/8 bitsasfollows:thepartialsumsδ andδ areapproximatedbythe5-bit 1/4 3/4 strings L and L , respectively, while the partial sum δ with a 6-bits 1/4 3/4 1/2 string L . As a result, the larger the intervals, the higher is the number 1/2 of bits used. The 8 L strings are constructed as follows: L1/2 =hδδ11//12 ·(26−1)i Li/4 =hδδji//42 ·(25−1)i (i=1∧j =1),(i=3∧j =2) Li/8 =hδδji//84 ·(24−1)i (i=1∧j =1),(i=3∧j =2), (i=5∧j =3),(i=7∧j =4) Enhancing Histograms by Tree-LikeBucket Indices 9 d 1/ 1 32 bits 100 (100) d d 1/ 2 2/ 2 1x6 bits 53 (52) 48 (48) d d d d 1/ 4 2/ 4 3/ 4 4/ 4 2x5 bits 30 (30) 22 (22) 20 (20) 28 (28) d d d d d d d d 1/ 8 2/ 8 3/ 8 4/ 8 5/ 8 6/ 8 7/ 8 8/ 8 4x4 bits 12 (12) 18 (18) 16 (16) 6 (6) 6 (7) 14 (13) 13 (14) 15 (14) 7 5 18 0 6 10 0 6 0 6 9 5 13 0 8 7 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Fig. 2 The 4-level tree. where, we recall, hxi stands for round(x). The approximate values for the partial sums are eventually computed as: δ =δ =c 1/1 1/1 δ = L1/2 ×δ e1/2 26−1 1/1 δ =δ −δ e2/2 1/1 e1/2 δ = Li/4 ×δ (i=1∧j =1),(i=3∧j =2) ei/4 2e5−1 ej/2 δ =δ −δ (i=2∧j =1),(i=4∧j =2) ei/4 j/2 ei−1/4 δ = Li/8 ×δ (i=1∧j =1),(i=3∧j =2) ei/8 e24−1 ej/4 (i=5∧j =3),(i=7∧j =4) e e δ =δ −δ (i=2∧j =1),(i=4∧j =2) i/8 j/4 i−1/8 (i=6∧j =3),(i=8∧j =4) e e e Similarly to the 3LT-index, the 4LT-index refers to a 4-level tree whose nodes store directly or indirectly the approximate values of the cumulative frequenciesforfixedhierarchicalintervalsstartingfromtherootwhichstores the overallcumulative frequency c. Example 2Considerthe4-leveltreeinFigure2.The32bitsstorethefollow- ing approximate cumulative frequencies: L = 33, L = 18, L = 13, 1/2 1/4 3/4 L =6, L =11, L =5, L =7. 1/8 3/8 5/8 7/8 Again, similarly to the 3LT-index, the frequency estimation inside the bucketBcanbeobtainedbyexploitingthecontentofthenodesoftheindex. Given d, 1≤d<b, and the integer i which ⌈(i−1)/8×b⌉≤d<⌈i/8×b⌉, the approximate value of F[d] is: F[d]=P(i)+P′(i)+P′′(i)+ d−⌈(i−1)/8×b⌉ ×δ ⌈i/8×b⌉−⌈(i−1)/8×b⌉ i/8 where e e δ if i=3,4 δ if i>4 1/4 P(i)= 1/2 P′(i)=δ if i=7,8 (cid:26)0 if i≤4 e3/4 e 0 otherwise e  10 Francesco Buccafurri et al. δ if i is even P′′(i)= i−1/8 (cid:26)0 otherwise e ThusweusetheinterpolationlikeinCVAonlyinsideasegmentoflength ⌈(1/8)b⌉. This component becomes zero at each distance d = ⌈i × b/8⌉, 1≤i<8. We call the estimation 4-level tree or 4LT for short. 3.4 Worst-case Error Analysis TheapproximationerrorforCVA,1b,USAand2sarisesonlyfrominterpo- lation. On the contrary, for other methods (i.e., 4s, 8s, 3LT and 4LT), the scalingerrordue to bit savingis added to the interpolationerror.However, all methods but CVA, 1b and USA implement a equi-size division of the bucketand3LTand4LTprovidealsoanindexoversub-buckets.Weexpect thatsuchadivisionintosub-bucketsproducesanimprovementfromtheside of the interpolation error. Indeed, sub-buckets increase the granularity of summarization.In addition,we expect that index-basedmethods (i.e., 3LT and 4LT), reduce the scaling error, since hierarchical tree-like organization allows us to represent the sum inside a given sub-bucket, corresponding to a node of the tree, as a fraction of the sum contained in the parent node, instead of a fraction of the entire bucket sum (as it happens for the ”flat” methods 4s and 8s). The worst-case analysis confirms the above observa- tions. In particular we show that while CVA, 1b and USA are the same, under the worst-case point of view, 4LT outperforms the other methods. Results ofour analysisare summarizedin the following theorem.Recall that, throughout the whole section, a bucket B of size b is given. Theorem 1 Let F be the maximum frequency value occurring in B and let assume that b mod 8 = 0. Then, the interpolation and scaling worst-case errors of CVA, 1b, USA, 2s, 4s, 8s, 3LT and 4LT are the following: error/method CVA 1b USA 2s 4s 8s 3LT 4LT interpolation F·b F·b F·b F·b F·b F·b F·b F·b 4 4 4 8 16 32 16 32 scaling 0 0 0 0 F·b F·b F·b F·b 29 32 212 27 total F·b F·b F·b F·b F·b F·b F·b F·b 4 4 4 8 16 16 16 32 Proof Let b the size of the smallest sub-bucket produced by the method M M, where M is either CVA, 1b, USA, 2s, 4s, 8s, 3LT or 4LT. Observe that b = b for CVA, 1b and USA (since they do not produce sub-buckets), M while b = b, b = b for M = 4s or M = 3LT, b = b otherwise. 2s 2 M 4 M 8 Considerfirstthe interpolationerror(by assuming that no scaling error occurs). Interpolation error bounds.It canbe easilyverifiedthatthe worstcase for a method M happens whenever both the following conditions hold: (1) thereis asmallestsub-bucket,sayB (ofsize b )containing,inthe first M half, bM frequencieswithvalueF,and,inthesecondhalf, bM frequencies 2 2 with value 0, and

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.