The History of Histograms (abridged) Yannis Ioannidis Department of Informatics and Telecommunications, University of Athens Panepistimioupolis, Informatics Buildings 157-84,Athens, Hellas (Greece) [email protected] wordthat wasoriginallyused in the Greeklanguage1. Theterm‘histogram’wascoinedbythefamousstatis- tician Karl Pearson2 to refer to a \common form of Abstract graphicalrepresentation". In the Oxford English Dic- tionaryquotesfrom\PhilosophicalTransactionsofthe Thehistoryofhistogramsislongandrich,full Royal Society of London" Series A, Vol. CLXXXVI, of detailed information in every step. It in- (1895) p. 399, it is mentioned that \[The word ‘his- cludes the course of histograms in di(cid:11)erent togram’ was] introduced by the writer in his lectures scienti(cid:12)c (cid:12)elds, the successes and failures of onstatisticsasatermforacommonformofgraphical histogramsinapproximatingandcompressing representation, i.e., by columns marking as areas the information, their adoption by industry, and frequency corresponding to the range of their base.". solutions that have been given on a great va- Stigler identi(cid:12)es the lectures as the 1892 lectures on riety of histogram-related problems. In this the geometry of statistics [69]. paper and in the same spirit of the histogram Theabovequotesuggeststhathistogramswereused techniques themselves, we compress their en- long before they received their name, but their birth tire history (including their \future history" date is unclear. Bar charts (i.e., histograms with an as currently anticipated) in the given/(cid:12)xed individual‘base’elementassociatedwitheachcolumn) spacebudget, mostlyrecordingdetails forthe most likely predate histograms and this helps us put periods, events, and results with the highest a lowerbound on the timing of their (cid:12)rst appearance. (personally-biased) interest. In a limited set Theoldestknownbarchartappearedinabookbythe ofexperiments,thesemanticdistancebetween Scottish political economist William Playfair3 titled thecompressedandthefullformofthehistory \The Commercial and Political Atlas (London 1786)" was found relatively small! andshowstheimportsandexportsofScotlandtoand from seventeencountries in 1781[74]. Although Play- fair was skeptical of the usefulness of his invention, it was adopted by many in the following years, includ- 1 Prehistory ing forexample,Florence Nightingale,who used them in 1859 to compare mortality in the peacetime army Theword‘histogram’isofGreekorigin,asitisacom- to that of civilians and through those convinced the posite of the words ‘isto-s’ ((cid:19)(cid:27)(cid:28)os) (= ‘mast’, also government to improve army hygiene. means ‘web’ but this is not relevant to this discus- Fromall the above,it is clearthat histogramswere sion) and ‘gram-ma’ ((cid:13)(cid:26)(cid:11)(cid:22)(cid:22)(cid:11)) (= ‘something writ- (cid:12)rstconceivedasavisualaidtostatisticalapproxima- ten’). Hence, it should be interpreted as a form of tions. Even today this point is still emphasized in the writingconsistingof‘masts’,i.e.,longshapesvertically commonconceptionofhistograms: Webster’sde(cid:12)nesa standing, or something similar. It is not, however, a 1To the contrary, the word ‘history’ is indeed part of the Permission to copy without fee all or part of this material is Greeklanguage(‘istoria’-(cid:19)(cid:27)(cid:28)o(cid:26)(cid:19)(cid:11))andinusesincetheancient grantedprovidedthatthecopiesarenotmadeordistributedfor times. Despiteitssimilarityto‘histogram’,however,itappears direct commercial advantage, the VLDB copyright notice and tohaveadi(cid:11)erentetymology,onethatisrelatedtotheoriginal the title of the publication and its date appear, and notice is meaningoftheword,whichwas‘knowledge’. giventhatcopyingisbypermissionoftheVeryLargeDataBase 2Hisclaimtofameincludes,amongothers,thechi-squaretest Endowment. To copy otherwise, or to republish, requires a fee forstatisticalsigni(cid:12)canceandtheterm‘standarddeviation’. and/or special permission from theEndowment. 3In addition to the bar chart, Playfair is probably the fa- Proceedings of the 29th VLDB Conference, ther of the pie chart and other extremely intuitive and useful Berlin, Germany, 2003 visualizationsthatweusetoday. histogram as \a bar graph of a frequency distribution The data distribution of X is the set of pairs T = i i inwhichthewidthsofthebarsareproportionaltothe f(v (1);f (1));(v (2);f (2));:::;(v (D );f (D ))g. i i i i i i i i classes into which the variable has been divided and The joint frequency f(k1;::;kn) of the value combi- the heights of the bars are proportional to the class nation < v1(k1);::;vn(kn) > is the number of tuples frequencies". Histograms,however,areextremelyuse- in R that contain v (k ) in attribute X , for all i. The i i i fulevenwhendisassociatedfromtheircanonicalvisual joint data distribution T1;::;n of X1;::;Xn is the entire representationandtreatedaspurelymathematicalob- set of (value combination, joint frequency) pairs. jectscapturingdatadistributionapproximations. This In the sequel, for 1-dimensional cases, we use the is precisely how we approach them in this paper. above symbols without the subscript i. Inthe pastfewdecades,histogramshavebeenused inseveral(cid:12)elds ofinformatics. Besidesdatabases,his- 2.2 Motivation for Histograms tograms have played a very important role primarily Datadistributionsareveryusefulindatabasesystems in image processing and computer vision. Given an but are usually too large to be stored accurately, so image(oravideo)andavisualpixelparameter,ahis- histogramscomeintoplayasanapproximationmech- togramcaptures for each possible value of the param- anism. The two most important applications of his- eter (Webster’s \classes") the number of pixels that togram techniques in databases have been selectivity have this value (Webster’s \frequencies"). Such a his- estimation and approximate query answering within togramisasummarythatischaracteristicoftheimage query optimization (for the former) or pre-execution andcanbeveryusefulinseveraltasks: identifyingsim- user-level query feedback (for both). Our discussion ilarimages, compressingthe image, and others. Color below focuses exactly on these two, especially range- histogramsarethemostcommonintheliterature,e.g., query selectivity estimation as this is the most pop- intheQBICsystem[21],butseveralotherparameters ular issue in the literature. It should not be forgot- have been proposed as well, e.g., edge density, tex- ten,however,thathistogramshaveprovedtobeuseful turedness,intensitygradient,etc. [61]. Ingeneral,his- in the context of several other database problems as togramsusedinimageprocessingandcomputervision well, e.g., load-balancing in parallel join query execu- areaccurate. For example, a colorhistogramcontains tion [65], partition-based temporal join execution [68] a separate and precise count of pixels for each possi- and others. ble distinct color in the image. The only element of approximationmight be in the number of bits used to 2.3 Histograms representdi(cid:11)erentcolors: fewerbitsimplythatseveral actualcolorsarerepresentedby one, which will be as- AhistogramonanattributeX isconstructedbyparti- sociated with the number of pixels that have any of tioning the data distribution of X into (cid:12) ((cid:21)1) mutu- the colors that are grouped together. Even this kind ally disjoint subsets called buckets and approximating of approximationis not common, however. thefrequenciesandvaluesineachbucketinsomecom- In databases, histograms are used as a mechanism mon fashion. This de(cid:12)nition leaves several degrees of forfull-(cid:13)edgedcompressionandapproximationofdata freedomindesigningspeci(cid:12)chistogramclassesasthere distributions. They(cid:12)rstappearedintheliteratureand are several possible choices for each of the following in systems in the 1980’sand have been studied exten- (mostly orthogonal)aspects of histograms [67]: sively since then at a continuously increasing rate. In Partition Rule: This is further analyzed into the this paper, we concentrate on the database notion of following characteristics: histograms, discuss the most important developments (cid:15) Partition Class: This indicates if there are any on the topic so far, and outline several problems that restrictions on the buckets. Of great importance webelieveareinterestingandwhosesolutionmayfur- isthe serial class,whichrequiresthatbucketsare ther expand their applicability and usefulness. non-overlapping with respect to some parameter (the next characteristic), and its subclass end- 2 Histogram De(cid:12)nitions biased, which requires at most one non-singleton bucket. 2.1 Data Distributions (cid:15) Sort Parameter: This is a parameter whose Consider a relation R with n numeric attributes X value for each element in the data distribution i (i = 1::n). The value set V of attribute X is is derived from the corresponding attribute value i i the set of values of X that are present in R. Let andfrequencies. Allserialhistogramsrequirethat i V = fv (k): 1(cid:20)k (cid:20)D g, where v (k) < v (j) when the sort parameter values in each bucket form a i i i i i k < j. The spread s (k) of v (k) is de(cid:12)ned as contiguous range. Attribute value (V), frequency i i s (k) = v (k + 1) (cid:0) v (k), for 1 (cid:20) k < D . (We (F),andarea(A)areexamplesofsortparameters i i i i take s (D ) = 1.) The frequency f (k) of v (k) is that have been discussed in the literature. i i i i the number of tuples in R with X = v (k). The (cid:15) Source Parameter: This captures the property i i area a (k) of v (k) is de(cid:12)ned as a (k) = f (k)(cid:2)s (k). of the data distribution that is the most critical i i i i i in an estimation problem and is used in conjunc- proposal was an immediate loan from statistics of the tion with the next characteristic in identifying a simplest form of histogram, with the value set being unique partitioning. Spread (S), frequency (F), divided into ranges of equal length, i.e., the so called and area(A) arethe most commonlyused source equi-width histograms. Hence, in terms of the tax- parameters. onomy of Section 2.3, the entry point for histograms (cid:15) Partition Constraint: This is a mathematical intotheworldofdatabaseswastheserialclassofequi- constraint on the source parameterthat uniquely sum(V,S),wheretheequi-sum partitionconstraintre- identi(cid:12)es a single histogram within its partition quires that the sums of the source-parameter values class. Severalpartitionconstraintshavebeenpro- (spreadsinthiscase)ineachbucketareequal. Within posed so far, e.g., equi-sum, v-optimal, maxdi(cid:11), eachbucket,valuesandfrequencieswereapproximated and compressed, which are de(cid:12)ned further below basedonthecontinuousvalueassumptionandtheuni- as they are introduced. Many of the more suc- form distribution assumption, respectively. cessful ones try to avoid groupingvastly di(cid:11)erent Equi-width histograms represented a dramatic im- source parameter values into a bucket. provement over the uniform distribution assumption forthe entirevalueset(i.e., essentiallyasingle-bucket Following [67], we use p(s,u) to denote a serial his- histogram), which was the state of the practice at the togramclasswith partitionconstraintp, sort parame- time. Hence, they were quickly adopted by the Ingres ter s, and source parameter u. DBMSinitscommercialversion,andlateronbyother Construction Algorithm: Given a particular DBMSs as well. partition rule, this is the algorithm that constructs histograms that satisfy the rule. It is often the case that, for the same histogram class, there are several First Alternative construction algorithms with di(cid:11)erent e(cid:14)ciency. A few years after Kooi’s thesis, the (cid:12)rst alternative Value Approximation: This captures how at- histogram was proposed, changing only the source tribute values are approximated within a bucket, parameter [62]. Instead of having buckets of equal- which is independent of the partition rule of a his- size ranges, the new proposal called for buckets with togram. The most common alternatives are the con- (roughly) the same number of tuples in each one, i.e., tinuous value assumption and the uniform spread as- the so called equi-depth or equi-height histograms. In sumption; both assumevalues uniformly placedin the terms of the taxonomy, these are the equi-sum(V,F) rangecoveredby the bucket, with the formerignoring histograms. Therewasampleevidencethatequi-depth thenumberofthesevaluesandthelaterrecordingthat histogramswereconsiderablymoree(cid:11)ectivethanequi- number inside the bucket. width histograms, hence, many commercial vendors Frequency Approximation: This captures how switched to those in the years following their intro- frequencies are approximated within a bucket. The duction. Equi-depth histograms were later presented dominantapproachismakingtheuniformdistribution in their multi-dimensional form as well [58]. assumption, where the frequencies of all elements in the bucket are assumed to be the same and equal to the average of the actual frequencies. Optimal Sort Parameter Error Guarantees: These are upper bounds on After several years of inactivity on the topic of his- the errors of the estimates a histogram generates, tograms, interest in it was renewed in the context of whichareprovidedbased oninformationthatthe his- studying how initial errorsin statistics maintained by togram maintains. thedatabasepropagateinestimatesofthesizeofcom- A multi-dimensional histogram on a set of at- plex query results [36]. In particular, it was shown tributes is constructed by partitioning the joint data that, under some rather general conditions, in the distribution of the attributes. They have the exact worstcase,errorspropagateexponentiallyinthequery same characteristics as 1-dimensional histograms, ex- size (i.e., in the number of joins), removing any hope cept that the partition rule needs tobe moreintricate for high-quality estimates for large multi-join queries. and cannot always be clearly analyzed into the four The (cid:12)rst results that led towards new types of his- other characteristics as before, e.g., there is no real tograms were derived in an e(cid:11)ort to obtain statistics sort parameter in this case, as there can be no order- that would be optimal in minimizing/containing the ing in multiple dimensions [66]. propagation of errors in the size of join results [37]. Thebasicmathematicaltoolsusedwereborrowedfrom 3 The Past of Histograms majorization theory [55]. The focus was on a rather restrictedclassof equalityjoinqueries,i.e., single-join First Appearance queries or multi-join queries with only one attribute Tothe bestof ourknowledge,the (cid:12)rstproposaltouse participatinginjoinsperrelation(moregenerally,with histograms to approximate data distributions within a 1-1 functional dependency between each pairof join a database system was in Kooi’s PhD thesis [47]. His attributes of each relation). For this query class, and under the assumption that the value set is known ac- values (in sort-parameter order) whose di(cid:11)erence is curately, it was formally proved that the optimal his- among the largest, and compressed, which puts the togram was serial and had frequency as the sort pa- highest source values in singleton buckets and parti- 4 rameter . tions the rest in equi-sum fashion. Overall, the new partition constraints (i.e., v-optimal, maxdi(cid:11), com- Ten years ago pressed)wereshowntobethemoste(cid:11)ectiveincurbing query-result-size estimation errors. Theaboveresultmighthavenothadtheimpactitdid Thesamee(cid:11)ortpointedtowardsseveralpossibilities if it had remained true only for the restricted query for the sort and source parameters,i.e., value, spread, classitwas(cid:12)rstprovedfor. Soonafterwards,however, frequency, area, cumulative frequency, etc., with fre- in VLDB’93, it was generalized for arbitrary equality quencyandareabeingthebestsourceparameters. In- join queries, giving a strong indication that the most terestingly, the best sort parameter proved to be the e(cid:11)ective histograms may be very di(cid:11)erent from those value and not the frequency, as the original optimal- that were used until that point [34]. ityresultswouldsuggest,indicatingthat, ifvaluesare To the best of our knowledge, histograms with fre- notknownaccurately,havingbucketswithoverlapping quency as the sort parameterrepresented the (cid:12)rst de- value ranges does not pay o(cid:11) for range queries. parturefromvalue-basedgroupingofbuckets,notonly The most e(cid:11)ective of these histograms have actu- within the areaof databases,but overallwithinmath- ally been adopted by industrial products (see Section ematics and statistics as well. Furthermore, their in- 4). Furthermore, in addition to selectivity estimation troduction essentiallygeneralizedsomecommonprac- forvariousrelationaland non-relationalqueries,these tices that were already in use in commercial systems histogramshaveprovedtobeverye(cid:11)ectiveinapprox- (e.g.,inDB2),wherethehighestfrequencyvalueswere imate query answering as well [39]. maintained individually and accurately due to their Since the speci(cid:12)cation of the above space of his- signi(cid:12)cant contribution to selectivity estimates. Such tograms, there have been several e(cid:11)orts that have apracticeisaninstanceofaspecialcaseofahistogram studied one or more of its characteristics and have intheend-biasedpartitionclass,withfrequencyasthe proposed alternative, improved approaches. For each sort parameter: the highest sort-parametervalues are characteristic, we outline some of the most notable maintained in singleton buckets. Although less accu- pieces of work on it in a separate subsection below. rate than general serial histograms, in several cases, Unless explicitly mentioning the opposite, the discus- end-biased histograms proved quite e(cid:11)ective. sion is about 1-dimensional histograms. New Partition Constraints Alternative Partition Constraints The results on the optimality of frequency as the sort In addition to the partition constraints that were in- parameter left open two important questions. First, troduced as part of the original histogram framework whichpartitionconstraintsarethe moste(cid:11)ective, i.e., [67], a few more have been proposed that attempt to which ones amongall possible frequency-basedbucke- approach the e(cid:11)ectiveness of v-optimal, usually hav- tizations? Second,whichhistogramsareoptimalwhen ing a more e(cid:14)cient construction cost. Among them, the value set is not accurately maintained but is ap- we note one that uses a simpli(cid:12)ed form of the opti- proximated in some fashion? mal knot placement problem[18]toidentifythebucket The answer to the (cid:12)rst question came in the form boundaries, which are where the ‘knots’ are placed of the v-optimal histograms, which partition the data [46]. The simpli(cid:12)cation consists of using only linear distribution so that (roughly) the variance of source- splinesthatarealsoallowedtobediscontinuousacross parametervalueswithineachbucketisminimized[38]. bucket boundaries. This is combined with interesting Unfortunately, the second question had no analyt- alternativesonthevalueandfrequencyapproximation ical answer, but extensive experimentation led to the within each bucket. formation of the space of histogram characteristics that we use as the basic framework for our discussion Multi-Dimensional Partition Rules inthispaper(Section2.3)[67]. Inadditiontotheequi- sumandv-optimal partitionconstraints,itintroduced The(cid:12)rstintroductionofmulti-dimensionalhistograms severalpossiblenewonesaswell,whichsimilarlytov- wasbyMuralikrishnaandDeWitt[58],whoessentially optimalhadasagoaltoavoidgroupingtogetherinthe described2-dimensionalequi-depthhistograms. Space same bucket vastly di(cid:11)erent source-parameter values. was divided in the same way it is done in a Grid- Among them, we distinguish maxdi(cid:11), which places (cid:12)le, i.e., recursively cutting the entire space into half- bucketboundariesbetweenadjacentsource-parameter spaces by using a value of one of the dimensions as a boundary each time, the dimension and the value be- 4Thesewerecalledsimplyserialhistogramsatthetime,but ingchoseninawayprespeci(cid:12)edatthebeginningofthe thetermwaslatergeneralizedtoimplynon-overlappingranges process[58]. Bucketswerenon-overlapping(themulti- ofany sortparameter, not justfrequency, whichishowwe use theterminthispaperaswell. dimensional version of the serial partition class) on the space of the multi-dimensional values (the multi- Asmallamountofinformationwithineachbucketim- dimensional version of value as the sort parameter), pliesgrosslocalapproximationsbutalsomorebuckets. the boundaries chosen with equi-sum as the partition Finding the right balance in this tradeo(cid:11) to optimize constraint and frequency as the source parameter. the overallapproximationof the datadistribution isa Itwasnotuntilseveralyearslaterthatanynewpar- key question. titionruleswereproposed[66],thistimetakingadvan- Withrespecttoapproximatingthesetofvaluesthat tageof the generalityof the histogramtaxonomy[67]. fall in a 1-dimensional bucket, there have been essen- The most e(cid:11)ective family of such rules was MHIST- tially two approaches. Under the traditional continu- 2, which starts from the entire joint data distribution ous value assumption, onemaintains theleastamount placed in a single bucket and, at each step, splits the ofinformation(justtheminandmaxvalue),butnoth- space captured by one of the buckets it has formed ing that would give some indication of how many val- into two subspaces, until it has exhausted its budget uesthereareorwheretheymightbe. Underthe more of buckets. The split is made in the bucket and along recentuniform spread assumption [67], one alsomain- the dimensionthat ischaracterizedasmost \critical", tains the number of values within each bucket and i.e., whose marginal distribution is the most in need approximates the actual value set by the set that is of partitioning, basedonthe (1-dimensional)partition formed by (virtually) placingthe same numberof val- constraintandsourceparameterused. Incombination uesatequaldistancesbetweentheminandmaxvalue. withthemoste(cid:11)ectivepartitionconstraintsandsource Adi(cid:11)erentversionofthathasalsobeenproposedthat parameters (i.e., v-optimal or maxdi(cid:11) with frequency does not record the actual average spread within a or area), MHIST-2 represented a dramatic improve- bucketbutonethatreducestheoverallapproximation ment over the original multi-dimensional equi-depth errorin range queriesby taking into accountthe pop- histograms. ularity of particular ranges within each bucket [46]. Since MHIST, there have been several other inter- There have been several studies that show each gen- estingpartitionrulesthathavebeenproposed. Oneof eraltechniquesuperiortotheother,anindicationthat themisGENHIST[31],whichwasoriginallyproposed there may be no universal winner. in the context of multi-dimensional real-valued data, The two main approaches mentioned above have but its applicabilityis broader. The main characteris- been extended for multi-dimensional buckets as well, ticof GENHIST isthatit allowsbucketstooverlapin maintainingthe min andmaxvalueof eachdimension the space of multi-dimensional values: the algorithm in the bucket. Under the continuous value assump- startsfromauniformgridpartitioningofthespaceand tion nothing more is required, but under the uniform then iterativelyenlargesthebucketsthatcontainhigh spread assumption, the problem arises of which dis- numbers of data elements. This has two e(cid:11)ects: (cid:12)rst, tinct (multi-dimensional) values are assumed to exist thedensityofdataineachbucketdecreases,thusmak- in the bucket. If d is the number of distinct values in i ing the overall density smoother; second, the buckets attribute X that are present in a bucket and v0(k) is i i end up overlapping,thus creating many more distinct the k’th approximate value in dimension i (obtained areasthantherearebucketsperse. Thedatadistribu- byapplyingtheuniformspreadassumptionalongthat tion approximation within each area is a combination dimension), then a reasonable approach is to assume of what all the overlapping bucket that form the area that all possible combinations < v10(k1);::;vn0(kn) >, indicate. This results in a small number of buckets 1(cid:20)k (cid:20)d , exist in the bucket [66]. i i producing approximationswith low errors. There has also been an interesting e(cid:11)ort that in- Another alternative is the STHoles Histogram [11], troduces the use of kernel estimation into the 1- which takes, in some sense, a dual approach to GEN- dimensional histogram world [10] to deal speci(cid:12)cally HIST: instead of the region covered by a bucket in- with real-valued data. Roughly, it suggests choosing creasing in size and overlapping with other buckets, the points of considerable change in the probability in STHoles, this region may decrease in size due to density function as the bucket boundaries (in a spirit the removal of a piece of it (i.e., opening a hole) that similar to the maxdi(cid:11) partition constraint) and then forms a separate, child bucket. This creates buckets applying the traditional kernel estimation method for that are not solid rectangles, and is therefore capable approximatingthevalueswithineachbucket. Thishas of capturing quite irregular data distributions. also been generalized for the multi-dimensional case Identifying e(cid:11)ective multi-dimensional partition [31]. rules is by no means a closed problem, with di(cid:11)erent approachesbeing proposed continuously [23]. Frequency Approximation WithinEach Bucket With respect to approximating the set of frequencies Value Approximation Within Each Bucket that fall in a bucket, almost all e(cid:11)orts deal with the Given a speci(cid:12)c amount of space for a histogram, one traditional uniform distribution assumption. Among of the main tradeo(cid:11)s is the number of buckets ver- the few exceptions is one that is combined with the sus the amount of information kept in each bucket. linearsplinepartitionconstraintmentionedaboveand uses a linear spline-based approximation for frequen- tograms (as opposed to individual histograms), based cies as well [46]. It records one additional data item onanexpectedworkload[40]. Thise(cid:11)ortfocusesonv- per bucket to capture linearly growing or shrinking optimal histograms, but is equally applicable to other frequencies at the expense of fewer buckets for a partition constraints as well. (cid:12)xed space budget. Likewise, another exception uses Even with the existence of e(cid:14)cient calculation al- equally small additional space within each bucket to gorithms, however, static histograms su(cid:11)er from in- store cumulative frequencies in a 4-level tree index creasing errors between calculations. Moreover, in [13]. Contrary to the previous e(cid:11)ort, however, it is a data stream environment, static histograms are combined with some of the established partition con- not an option at all, as there is no opportunity to straints, i.e., v-optimal and maxdi(cid:11). store the incoming data or examine it more than once. Hence, several works have proposed various ap- E(cid:14)cient and Dynamic Constructions proachestodynamic/adaptive/self-tuninghistograms, which change asthe data gets updated, while remain- Althoughestimatione(cid:11)ectivenessisprobablythemost ing competitive to their static counterparts. Among important property of histograms (or any other com- these, wenote onefor equi-depth and compressedhis- pression/estimation method for that matter), con- tograms [26], one for v-optimal histograms [27], and struction cost is also a concern. With respect to this one for (linear) spline-based histograms [46]. There aspect,histogramsmaybedividedintotwocategories: is also an e(cid:11)ort focusing on data streams, where a static histograms and dynamic/adaptive histograms. sketch onthe (joint) datadistribution ofthe streamis Static histograms are those that are traditionally maintained, from which an e(cid:11)ective multidimensional used in database systems: after they are constructed histogram may be constructed [72]; the STHoles his- (from the stored data or a sample of it), they remain togram is used for experimentation with the method, unchangedevenif the originaldatagetsupdated. De- butinprinciple,itcouldbeappliedtootherhistogram pending on the details of the updates, a static his- classes as well. togram eventually drifts away from what it is sup- Anotherapproachtodynamicconstructionthathas posedtoapproximate,andtheestimationsitproduces been examined in the past consists of query feed- may su(cid:11)er from increasingly larger errors. When this back mechanisms that take into account actual sizes happens, theadministratorsaskforarecalculation,at of query results to dynamically modify histograms so which point the old histogram is discarded and a new that their estimates are closer to reality. In essence, one is calculated afresh. An important consideration this is histogram adaptation at query time instead of for static histograms is the cost of each calculation at update time. The main representatives in this cat- itself, which is mostly a(cid:11)ected by the partition con- egory are the ST-histograms [2] and their descendant straint. Most such constraints (e.g., equi-sum, maxd- STHoleshistograms[11],whichemployasophisticated i(cid:11),compressed)havestraightforwardcalculationsthat partition rule as well. These techniques are indepen- are e(cid:14)cient. This is not the case, however, for what dent of the particular characteristicsof the initial his- has been shown to be the most e(cid:11)ective constraint, tograms, which may be constructed in any way, e.g., i.e., v-optimal, whosestraightforwardcalculation is in they could be equi-depth histograms. In addition to generalexponentialinthenumberofsource-parameter their dynamic nature, a key advantage of these ap- values. A key contribution in this direction has been proaches is their low cost. the proposal of a dynamic-programming based algo- The LEO system [70] generalizes these e(cid:11)orts as rithm that identi(cid:12)es the v-optimal histogram(for any it uses result sizes of much more complicated queries sort and source parameter) in time that is quadratic to modify its statistics, including join and aggregate inthenumberofsource-parametervaluesandlinearin queries, queries with user-de(cid:12)ned functions, and oth- the number of buckets, thus making these histograms ers. Interestingly, LEO does not update the statistics practical as well [42]. Subsequently, several (mostly in place, but puts all feedback information into sepa- theoretical) e(cid:11)orts have introduced algorithms that rate catalogs, which are used in combination with the have reduced the required running time for calculat- original histograms at estimation time. ing these optimal histograms, eventually bringing it down to linear overall and achieving similar improve- Error Guarantees ments for the required space as well [30]. Dynamic- programming algorithms have also been proposed for Most work on histograms deals with identifying those constructing the optimal histogramsfor (hierarchical) that exhibit low errors in some estimation problem, range queries in OLAP data [44]. For the multi- but not with providing, together with the estimates, dimensional case, optimal histogram identi(cid:12)cation is some informationon what those errorsmight be. The NP-hard,soseveralapproximatetechniqueshavebeen (cid:12)rst work to address the issue [42] suggests storing proposed [59]. in each bucket the maximum di(cid:11)erence between the Another interesting development has been the pro- actual and the approximate (typically, the average) posal of algorithms to identify optimal sets of his- frequency of a value in the bucket and using that to provide upper bounds on the error of any selectivity problematic places in the schema, approximating the estimates produced by the histogram for equality and distributions of parent ids for di(cid:11)erent elements. In range selection queries. An interesting alternative fo- the XPathLearnerapproach [49], ((cid:12)rst-order) Markov cusesonoptimizingtop-Nrangequeriesandstoresad- Histograms are used [1], where the frequencies of ditionalinformationonaper-histogramratherthanon the results of traversing all paths of length 2 are a per-bucket basis [20]. stored in two 2-dimensional histograms. The dimen- sions always represent the ‘from’ and ‘to’ nodes of Other Data Types the paths in the XML graph; in the (cid:12)rst histogram, both nodes/dimensions are for XML tags, whereas Asmentionedearlier,mostworkonhistogramshasfo- in the second histogram, the ‘from’ node/dimension cusedonapproximatingnumericvalues,inoneormul- is an XML tag and the ‘to’ node/dimension is a tiple dimensions (attributes). Nevertheless, the need value. Assuming enough memory, frequencies are to approximation is much broader, and several e(cid:11)orts maintained accurately for all tag-to-tag pairs (accu- have examined the use of histograms for other data rate histogram), as there are very few. To the con- types as well. trary, <tag,value> pairs are placed in a histogram With respect to spatial data, the canonical ap- that is based on a 2-dimensional version of the com- proaches to 2-dimensional histograms do not quite pressed partition constraint, with frequency as the workoutastheseareforpointdataanddonotextend source parameter. Another approach for estimating toobjectsthatare2-dimensionalthemselves. Further- XML-query result sizes builds position histograms on more,frequencyisusuallynotanissueinspatialdata, a 2-dimensional space as well, only here the two di- asspatialobjectsarenotrepeatedinadatabase. Sev- mensionsaredirectlyorindirectlyrelatedtothenum- eral interesting techniques have been presented to ad- beringofeachnodeinapreordertraversaloftheXML dress the additional challenges, which essentially are graph [79]. Finally, histogramshave also been used in related to the partition rule, i.e., how the spatial ob- combination with or as parts of other data structures jects are grouped into buckets. Some form buckets for XML approximations. The XSketch is quite an by generalizing conventional histogram partition con- e(cid:11)ective graph-based synopsis that tries to captured straintswhileothersdoitbyfollowingapproachesused both the structuraland thevalue characteristicsof an in spatial indices (e.g., R-trees). The MinSkew His- XML (cid:12)le [63, 64]. Histograms enter the picture as togram [5] is among the more sophisticated ones and they are used at various parts of an XSketch to cap- dividesthe spacebyusingbinarypartitionings(recur- ture statistical correlations of elements and values in sively dividing the space along one of the dimensions particular neighborhoods of the XSketch graph. eachtime)sothattheoverallspatialskewofallbuckets In addition to XML graphs, histograms have also is minimized. The latter captures the variance in the been proposed to capture the degrees of the nodes in density of objects within each bucket, so it follows, in general graphs as a way to compare graphs between somesense,thespiritofthev-optimalhistograms. The them and grade their similarity [60]. SQ-Histogram [3] is an interesting alternative, divid- ing the space according to the Quad-tree rule (which Unconventional Histograms ismorerestrictivethanarbitrarybinarypartitionings) and, in addition to spatial proximity, taking into ac- Throughout the years,there have been a few interest- count proximity in the size as well as the complexity ing pieces of work that do not quite follow the gen- (numberofvertices)ofthepolygonsthatareplacedin eral histogram taxonomy or histogram problem de(cid:12)- the same bucket. Both approaches are quite e(cid:11)ective, nitions. One of them suggests the use of the Discrete with SQ being probably the overall winner. Spatial CosineTransform(DCT) tocompressanentiremulti- histograms, i.e., MinSkew histograms, have also been dimensional histogram and store its compressed form extended to capture the velocity of object movement, [48]. Itemploysaverysimplemulti-dimensionalparti- thus becoming able to approximate spatio-temporal tionrule(auniformgridovertheentirespace),divides data as well [17]. the space into a large number of small buckets, and The recent interest in XML could not, of course, then compresses the bucket information using DCT. leaveuntouched XML (cid:12)le approximation,XML query Thisappearstosaveonspacebutalsoestimationtime, resultsizeestimation,andotherrelatedproblems. The as it is possible to recover the necessary information semi-structured nature of XML (cid:12)les does not lend through the integral of the inverse DCT function. itself to histogram-based approximation, as there is Thereisalsoapromisinglineofworkthatcombines no immediate multi-dimensional space that can be histograms with other techniques to produce higher- bucketized but one needs to be formed from some quality estimations than either technique could do numeric XML-(cid:12)le characteristics. In the StatiX ap- alone. In addition to several such combinations with proach [22], information in an XML Schema is used sampling, a particularly interesting technique tries to to identify potential sources of structural skew and overcome the ‘curse of dimensionality’ by identify- then 1-dimensional histograms are built for the most ing the critical areasof dependence and independence among dimensions in multi-dimensional data, captur- rathercomplexpredicateswheneverneeded(selections ing them with a statistical interaction model (e.g., and single-table functions are already available, while log-linear model) which can then form the basis for joins will be in the next release). It also takes into lower-dimensionalMHIST histogramsto approximate account the dependencies that exist between the at- the overalljoint data distribution [19]. tributesofacube’sdimensions’hierarchiesduringroll- Finally, there is a very interesting departure from up and provides estimates at the appropriate hierar- the conventionthat histogramsare built on base rela- chylevel. Finally,thenextreleasewillemploylearning tions and estimations of the data distributions of in- techniquestorememberselectivitiesofpastpredicates termediate query results are obtained by appropriate and use them in the future. manipulations of these base-relation histograms [12]. SQL Serveremploysmaxdi(cid:11) histogramswith value It discusses the possibility of maintaining histograms asthesortparameterandessentiallyaveragefrequency oncomplex queryresults,which provesto be quite ef- (within each bucket) as the source parameter [9]. It fective in some cases. This work uses the main SQL permits up to 199 buckets, storing within eachbucket Serverhistograms(essentiallymaxdi(cid:11) - see Section 4) the frequency of the max value and (essentially) the todemonstratetheproposedapproach,buttheoverall cumulative frequencyof all valuesless than that. His- e(cid:11)ort is orthogonal to the particular histogram class. togram construction is typically based on a sample As the number of potentialcomplex queryhistograms of the data. Composite indices are used in a similar is much larger than that of base-relation histograms, fashion as in the other systems for obtaining multi- the corresponding database design problem of choos- dimensional selectivity information. ing whichhistogramstoconstructisaccordinglymore NotethatallcommercialDBMSshaveimplemented di(cid:14)cult as well. Fortunately, a workload-based algo- strictly1-dimensionalhistograms. Exceptforsomein- rithm proves adequate for the task. cidentalindirectinformation,theyessentiallystillem- ploy the attribute value independence assumption and 4 Industrial Presence of Histograms havenotventuredo(cid:11)tomulti-dimensionalhistograms. Histograms have not only been the subject of much 5 Competitors of Histograms research activity but also the favorite approximation method of all commercial DBMSs as well. Essentially The main technique that has competed against his- all systems had equi-width histograms in the begin- tograms in the past decade is wavelets, which is very ning and then eventually moved to equi-depth his- important for image compression and has been intro- tograms. In this section, we brie(cid:13)y describe the cur- duced into the database world in the late 90’s [7, 56]. rently adopted histogram class for three of the most Wavelets have been used extensively for approximate popular DBMSs. answering of di(cid:11)erent query types and/or in di(cid:11)er- DB2 employs compressed histograms with value as entenvironments: multidimensionalaggregatequeries the sort parameter and frequency as the source pa- (range-sum queries) in OLAP environments [75, 76], rameter[50]. Usersmayspecifythenumbersofsingle- aggregate and non-aggregate relational queries with ton and non-singleton buckets desired for the most- computations directly on the stored wavelet coe(cid:14)- frequent values and the equi-depth part of a com- cients [14], and selection and aggregate queries over pressed histogram, respectively, with the default be- streams[28]. Aswithhistograms,therehavealsobeen ing 10 and 20. A departure from our general descrip- e(cid:11)orts to devise wavelet-based techniques whose ap- tions above is that DB2 stores cumulative frequencies proximatequeryanswersareprovidedwitherrorguar- within non-singleton buckets. Histogram construction antees [24], as well as to construct and maintain the is based on a reservoir sample of the data. DB2 ex- most important wavelet coe(cid:14)cients dynamically [57]. ploits multi-dimensional cardinality information from Sampling is not a direct competitor to histograms, indices on composite attributes (whenever they are as it is mostly a runtime technique, and furthermore, available) to obtain some approximate quanti(cid:12)cation the literature on sampling is extremely large, so it is of any dependence that may exist between the at- impossible to analyze the corresponding highlights in tributes, and uses this during selectivity estimation. the limited space of this paper. However, we should Otherwise,itassumesattributesareindependent. The emphasize that sampling is often a complementary learning capabilities of LEO [70] play a major role in technique to histograms, as static (and even several howallavailableinformationisbestexploitedforhigh- forms of dynamic) histogramsare usually constructed quality estimation. based on a sample of the original data [15, 26, 62]. Oraclestillemploysequi-depth histograms[78]. Its There are also several specialized techniques that basic approach to multi-dimensional selectivities is have been proposed and compete with histograms on similar to that of DB2, based on exploiting any avail- speci(cid:12)cestimationproblems. Theseincludetechniques able information from composite indices. In addition for selectivity estimation of select-join queries [71] or to that, however, it o(cid:11)ers dynamic sampling capabil- spatial queries [8], using query feedback to modify ities to obtain on-the-(cid:13)y dependence information for stored curve-(cid:12)tting/parametric information for bet- ter selectivity estimation [16], selectivity estimation applicability,andingeneral,theirrelativecharacteris- foralphanumeric/stringdatain1-dimensional[43,45] ticswhenmutuallycompared. Acomprehensivestudy and multi-dimensional environments [41, 77], identi(cid:12)- needs to be conducted that will include several more cationofquantiles[6,53,54]andtheirdynamicmain- techniquesthanthosementionedhere. The\NewJer- tenance with a priori guarantees [29], approximate sey Data Reduction Report" [7] has examined many query answering for aggregate join queries [4], select- techniques and has produced a preliminary compar- join queries [25], and within the general framework of ison of their applicability to di(cid:11)erent types of data. on-line aggregation [33,32,51], computingfrequencies It can serve as a good starting point for veri(cid:12)cation, of high-frequency items in a stream [52], and others. extrapolation, and further exploration, not only with Despitetheirsuboptimalitycomparedtosomeofthese respect to applicability, but also precise e(cid:11)ectiveness techniquesonthecorrespondingproblems,histograms trade-o(cid:11)s,e(cid:14)ciencyofthealgorithms,andotherchar- remain the method of choice, due to their overall ef- acteristics. fectiveness and wide applicability. Bucket Recognition and Representation 6 The Future of Histograms The goal of any form of (partition-based) approxima- Despite the success of histograms, there are several tion, e.g., histogram-based and traditional clustering, problems whose current solutions leave enough space istoidentifygroupsofelementssothatallthosewithin forsigni(cid:12)cantimprovementandseveralothersthatre- a group are similar with respect to a small number main wide open, whose solution would make the ap- of parameters that characterize them. By storing ap- plicability of histograms much wider and/or their ef- proximations of just these parameters, one is able to fectiveness higher. We haverecentlydiscussedvarious reconstruct an approximation of the entire group of problems of both types [35], some addressing speci(cid:12)c elements with little error. Note that, in the terms histogram characteristics from the existing taxonomy of the histogram taxonomy, these parameters should while othersbeing cast in a slightlymoregeneralcon- be chosen as the source parameter(s), to satisfy the text. In this section, we focus on three of the open proximity-expressingpartition constraint. problems, those that we believe are the most promis- How do we know which parameters are similar for ingandsenseasbeingthefurthestawayfromanypast elementssothatwecangroupthemtogetherandrep- or current work that we are aware of. resent them in terms of them? This is a typical ques- tionfortraditionalpatternrecognition[73], wherebe- Histogram Techniques and Clustering fore applying any clustering techniques, there is an Abstracting away the details of the problem of earlier stage where the appropriate dimensions of the histogram-based approximation, one would see some elements are chosen among a great number of possi- striking similarities with the traditional problem of bilities. There are several techniques that make such clustering [73]: the joint data distribution is parti- a choice with varying success depending on the case. tionedintobuckets,whereeachbucketcontainssimilar Itisimportant,however,toemphasizethat,inprin- elements. Similarityisde(cid:12)nedbasedonsomedistance ciple, these parameters may not necessarily be among functionthattakesintoaccountthevaluesofthe data theoriginaldimensionsofthedataelementspresented attributes and the value of the frequency if there is in the problem but may be derivatives of them. For any variation on it (e.g., if it is not equal to 1 for all example, in several histogram-based approximations dataelements). Thebucketsareessentiallyclustersin as we have described them above, proximity is sought the traditional sense, and for each one, a very short directlyforfrequenciesbutnotforattributevalues,as approximationof the elements that fall in it is stored. attention there is on their spreads. (Recall also the Despite the similarities, the techniques that have success of area as a source parameter, which is the been developed for the two problems are in general product of frequencywith spread.) Thefrequenciesin very di(cid:11)erent, with no well-documented reasoning for a bucket are assumed constant and require a smaller many of these di(cid:11)erences. Why can’t the histogram amount of information to be stored for their approxi- techniquesthathavebeendevelopedforselectivityes- mation than the attribute values, which are assumed timation be used for clustering or vice versa? From to follow a linear rule (equal spread). Hence, con- another perspective, why can’t the frequency in selec- ventional histogram-based approximation, under the tivity estimation be considered as another dimension uniformdistributionanduniformspreadassumptions, of thejointdatadistributionandhavethe problembe implies clustering in the derived space of frequency considered as traditional clustering? What would the and spread. In principle, however, not all data distri- impact be of using stored approximations developed butions are served best with such an approach. foroneproblemtosolveanother? Ingeneral,giventhe To increase the accuracy of histogram approxima- greatvarietyoftechniquesthatexistforthetwoprob- tions, there should be no (cid:12)xed, prede(cid:12)ned approxi- lems, it is crucial to obtain an understanding of the mation approach to the value dimensions and the fre- advantagesanddisadvantagesofeachone,itsrangeof quencies. It should not necessarily even be the same for di(cid:11)erent buckets. Histograms should be (cid:13)exible Returning to precise query answering, note that enough to use the optimal approximation for each di- typicallyindicesarebuiltassumingallvaluesorranges mension in each bucket, one that would produce the of values being equally important. Hence, having a best estimations for the least amount of information. balanced tree becomes crucial. There are often cases, Identifying what that optimal approximation is, is a however, where di(cid:11)erent values have di(cid:11)erent impor- hard problem and requires further investigation. tance and di(cid:11)erent frequency in the expected work- loads [46]. If this query frequency or some other such Histograms and Tree Indices parameter is used in conjunction with advanced his- togrambucketizationrules,someveryinterestingtrees The fact that thereis acloserelationshipbetween ap- wouldbegeneratedwhoseaveragesearchperformance proximate statistics kept in databases, especially his- might be much better than that of the B+ tree. tograms, and indices has been recognized in the past From the above, it is clear that the interaction be- in severalworks [7]. If one considers the root of a B+ tween histograms and indices presents opportunities tree, the values that appear in it essentially partition but also several technical challenges that need to be the attributeonwhichitisbuiltintobucketswith the investigated. The trade-o(cid:11) between hierarchical his- corresponding borders. Each bucket is then further togramsthatarebalancedtreeswithequi-depthbuck- subdivided into smaller buckets by the nodes of the etizationandthosethatareunbalancedwithmoread- subsequent level of the tree. One can imagine storing vanced bucketizations requires special attention. The the appropriateinformationnext toeachbucketspec- possibility of some completely new structures that i(cid:12)ed in a node, hence transforming the node into a wouldstrikeevenbettertrade-o(cid:11)s,combiningthebest histogram, and the entire index into a so called hi- of both worlds, cannot be ruled out either. erarchical histogram. This may adversely a(cid:11)ect in- dex search performance, of course, as it would reduce 7 Conclusions the out-degree of the node, possibly making the tree deeper. Nevertheless,althoughthisideaworksagainst Histograms have been very successful within the themainfunctionalityofanindex,itsbene(cid:12)tsarenon- databaseworld. Thereasonisthat,amongseveralex- negligibleaswell,soithasevenbeenincorporatedinto isting competing techniques, they probably represent some systems. the optimal point balancing the tradeo(cid:11)between sim- Webelievethathierarchicalhistogramsand,ingen- plicity, e(cid:14)ciency, e(cid:11)ectiveness, and applicability for eral,theinteractionbetweenapproximationstructures dataapproximation/compression. Research-wisemost and indices should be investigated further, as there of the basicproblemsaroundhistogramsseem tohave are several interesting issues that remain unexplored been solved, but we believe there are still better so- as analyzed below. Consider again a B+ tree whose lutions to be found for some of them. Moreover, as nodesarecompletelyfull. Inthatcase,the rootof the outlined in the previous section, there are some un- tree speci(cid:12)es a bucketization of the attribute domain touchedfoundationalproblemswhosesolutionmayre- thatcorrespondstoanequi-depthhistogram,i.e.,each quire signi(cid:12)cant changes in our overall perspective on bucket contains roughly an equal number of elements histograms. As much as the past ten years have been under it. Similarly, any node in the tree speci(cid:12)es an enjoyable and productive in deepening our collective equi-depthbucketizationoftherangeofvaluesitleads understandingofhistogramsandapplyingtheminthe to. real world, we believe the next ten will be even more The main issue with B+ trees being turned into exciting and really look forward to them! hierarchical equi-depth histograms is that the latter are far from optimal overall on selectivity estimation 8 Personal History [67]. Histograms like v-optimal and maxdi(cid:11) are much more e(cid:11)ective. What kind of indices would one get if Our personal history with histograms has been each node represented bucketizations following one of strongly in(cid:13)uenced by Stavros Christodoulakis. It all these rules? Clearly, the trees would be unbalanced. started during the \Query Optimization Workshop", Thiswouldmaketraditionalsearchlesse(cid:14)cientonthe which wasorganizedin conjunction with SIGMOD’89 average. On the other hand, other forms of searches inPortland,whenStavrosarguedthatoptimizingvery would be served more e(cid:11)ectively. In particular, in a large join queries did not make any sense, as the er- system that provides approximate answers to queries, rors in the selectivity estimates would be very large the root of such a tree would provide a higher-quality after a few joins. Wanting to prove him wrong due answer than the root of the corresponding B+ tree. to a personal interest in large query optimization, we Furthermore, the system may move in a progressive started collaborating with him on the error propaga- fashion, traversing the tree as usual and providing a tion problem, work that led to results that justi(cid:12)ed series of answers that are continuously improving in Stavros’ fears completely [36]. During this e(cid:11)ort, we quality, eventually reaching the leaves and the (cid:12)nal, were initiated by Stavros into the wonderful world of accurate result. majorizationtheory,Schurfunctions,andalltheother
Description: