Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes Owen Kaser Daniel Lemire Kamel Aouiche Dept. ofCSAS LICEF,UniversitéduQuébec LICEF,UniversitéduQuébec UniversityofNewBrunswick àMontréal àMontréal 100TuckerParkRoad 100SherbrookeWest 100SherbrookeWest SaintJohn,NB,Canada Montreal,QC,Canada Montreal,QC,Canada [email protected] [email protected] [email protected] ABSTRACT 9 histogram-oblivious. They ignore the number of attribute values 0 Bitmap indexes must be compressed to reduce input/output costs andtheirfrequencies. Otherheuristicsarehistogram-aware. They 0 includecolumnreorganizationsandfrequency-awareordering. On andminimizeCPUusage. Toacceleratelogicaloperations(AND, 2 larger data sets [2], we had considered histogram-oblivious row- OR, XOR) over bitmaps, we use techniques based on run-length orderingheuristics. Sortingbeforeindexingreducedthetotalcon- n encoding(RLE),suchasWord-AlignedHybrid(WAH)compres- struction time. Our main contribution is an evaluation of practi- a sion.Thesetechniquesaresensitivetotheorderoftherows:asim- J ple lexicographical sort can divide the index size by 9 and make calhistogram-awareheuristicstotheroworderingproblem. Sec- ondarycontributionsincludeguidelinesaboutwhen“unary”bitmap 9 indexes several times faster. We investigate reordering heuristics encodingispreferred,andanimprovementoverthenaivebitmap 1 basedoncomputedattribute-valuehistograms. Simplypermuting constructionalgorithm—itisnowpracticaltoconstructbitmapin- thecolumnsofthetablebasedonthesehistogramscanincreasethe dexesovertableswithhundredsofmillionsofrowsandmillionsof ] sortingefficiencyby40%. B attributevalues. D CategoriesandSubjectDescriptors Tofurtherreducethesizeofbitmapindexes,wecanbintheat- tribute values [7,12,17,20]. For range queries, different bitmap s. H.3.2[InformationStorageandRetrieval]:InformationStorage; encodingshavedifferentspace-performancetradeoffs[5,6]. c E.1[Data]:DataStructures [ GeneralTerms 2. BITMAPINDEXES 3 v Wefindbitmapindexesinseveraldatabasesystems,apparently Algorithms,Performance,Experimentation. 3 beginning with the MODEL 204 engine, commercialized for the 8 IBM370in1972. 1. INTRODUCTION 0 Thesimplestandmostcommonmethodofbitmapindexingas- 2 Bitmapindexesareamongthemostcommonlyusedindexesin sociates a bitmap with every attribute value v of every attribute . datawarehouses[3,8]. Withoutcompression,bitmapindexescan a; the bitmap represents the predicate a=v. For a table with 8 beimpracticallylargeandslow.Word-AlignedHybrid(WAH)[25] nrows(facts)andccolumns(attributes/dimensions),eachbitmap 0 isacompetitivecompressiontechnique:comparedtoLZ77[5]and haslengthn. Initially,allbitmapvaluesaresetto0. Forrow j,we 8 0 Byte-AlignedBitmapCompression(BBC)[1],WAHindexescan setthe jth componentofcbitmapsto1. Iftheith attributehasni : betentimesfaster[24]. possiblevalues,wehaveL=∑ci=1nibitmaps. v Run-lengthencoding(RLE)andsimilarencodingschemes(BBC Bitmap indexes are fast, because we find rows having a given Xi andWAH)makeitpossibletocomputelogicaloperationsbetween valuevforattributeabyreadingonlythebitmapcorrespondingto bitmapsintimeproportionaltothecompressedsizeofthebitmaps. valuev(andnottheotherbitmapsforattributea),andthereisonly ar However,theirefficiencydependsontheorderoftherows. While onebit(orless,withcompression)toprocessforeachrow. More computingthebestorderingisNP-hard[2],simpleheuristicssuch complexqueriesareachievedwithlogicaloperations(AND,OR, aslexicographicalsortareeffective. XOR,NOT)overbitmapsandcurrentmicroprocessorcanperform Pinaretal.[14],SharmaandGoyal[18],andCanahuateetal.[4] 32or64bitwiseoperationsinasinglemachineinstruction. usedGray-coderowsortingtoimproveRLEandWAHcompres- Forrow j,exactlyonebitmappercolumnwillhaveits jthentry sion. However,theirlargestbitmapindexcouldfituncompressed setto1.AlthoughtheentireindexhasnLbits,thereareonlync1’s; inRAMonaPC. formanytables,L(cid:29)candthusonaveragethetableisverysparse. Wedistinguishtwotypesofheuristicsforthisproblem. Heuris- Long(hencecompressible)runsof0’sareexpected. ticssuchaslexicographicalsort[2]orGray-codesorting[14]are Onecanalsoreducethenumberofbitmapsforlargedimensions. GivenLbitmaps,thereareL(L−1)/2pairsofbitmaps.So,instead Permissiontomakedigitalorhardcopiesofallorpartofthisworkfor ofmappinganattributevaluetoasinglebitmap,wemapthemto personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare pairsofbitmaps(seeTable1). Werefertothistechniqueas2-of- notmadeordistributedforprofitorcommercialadvantageandthatcopies N encoding [23]; with it, we can use far fewer bitmaps for large bearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,to dimensions. For instance, with only 2,000 bitmaps, we can rep- republish,topostonserversortoredistributetolists,requirespriorspecific resentanattributewith2milliondistinctvalues. Buttheaverage permissionand/orafee. bitmapdensityismuchhigherwith2-of-Nencoding,andthuscom- DOLAP’08,October30,2008,NapaValley,California,USA. Copyright2008ACM978-1-60558-250-4/08/10...$5.00. pression may be less effective. More generally, k-of-N encoding Table1:Exampleof1-of-Nand2-of-Nencoding Montreal 100000000000000 110000 Paris 010000000000000 101000 Toronto 001000000000000 100100 NewYork 000100000000000 011000 Berlin 000010000000000 010100 allows L bitmaps to represent (cid:0)L(cid:1) distinct values; conversely, us- k ingL=(cid:100)kn1i/k(cid:101)bitmapsissufficienttorepresentnidistinctvalues. However,searchingforaspecifiedvaluevnolongerrequiresscan- ningasinglebitmap.Instead,thecorrespondingkbitmapsmustbe combinedwithabitwiseAND. Thereisatradeoffbetweenindex Figure1:EnhancedWord-AlignedHybrid(EWAH) sizeandtheindexspeed[2]. Forsmalldimensions,usingk-of-Nencodingmayfailtoreduce Algorithm1Constructingbitmaps. Forsimplicity,weassumethe thenumberofbitmaps,butstillreducetheperformance.Weapply numberofrowsismultipleofthewordsize. thefollowingheuristic.Anycolumnwithlessthan5distinctvalues Construct:B1,...,BL,Lcompressedbitmaps islimitedto1-of-N encoding(simpleorunarybitmap). Anycol- ωi←0for1≤i≤L. umnwithlessthan21distinctvalues,islimitedtok=1,2,andany c←1{rowcounter} columnwithlessthan85distinctvaluesislimitedtok=1,2,3. N ←0/{N recordsthedirtiedbitmaps} foreachtablerowdo foreachattributeintherowdo 3. COMPRESSION foreachbitmapicorrespondingtotheattributevaluedo RLE compresses efficiently when there are long runs of iden- settotruethe(cmodw)thbitofwordωi N ←N ∪{i} tical values: it works by replacing any repetition by the number ifcisamultipleofwthen ofrepetitionsfollowedbythevaluebeingrepeated. Forexample, foriinN do the sequence 11110000 becomes 4140. Current microprocessors addc/w−|Bi|−1cleanwords(0x00)toBi performoperationsoverwordsof32or64bitsandnotindividual addthewordωitobitmapBi bits. Hence, theCPUcostofRLEmightbelarge[19]. Bytrad- ωi←0 ingsomecompressionformorespeed,Antoshenkov[1]defineda N ←0/ c←c+1 RLEvariantworkingoverbytesinsteadofbits: theByte-Aligned foriin{1,2,...,L}do BitmapCompression(BBC).Tradingevenmorecompressionfor addc/w−|Bi|−1cleanwords(0x00)toBi evenmorespeed,Wuetal.[25]proposedtheWord-AlignedHybrid (WAH). Their scheme is made of two different types of words1. Naively,wecouldcomputelogicaloperationsbetween2bitmaps Thefirstbitofeveryworddistinguishesaverbatim(ordirty)31- inn/32bitwiseoperations. Instead,wecomputelogicaloperations bitwordfromarunningsequenceof31-bitcleanwords(0x00or (OR,AND,XOR)between2bitmapsintimeO(|B1|+|B2|)where 1x11). Runningsequencesarestoredusing1bittodistinguishbe- |Bi|isthesizeofthecompressedbitmap[2,25]. Finally, wecan tweenthetypeofword(0for0x00and1for1x11)and30bitsto boundthebitmapsizes:|ViBi|≤mini|Bi|and|WiBi|≤∑i|Bi|. representthenumberofconsecutivecleanwords.Hence,abitmap oflength62containingasingle1-bitatposition32wouldbecoded 4. SORTINGTOIMPROVECOMPRESSION asthewords100x01and010x00. Becausedirtywordsarestored inunitsof31bitsusing32bits,WAHcompressioncanexpandthe Sortingcanbenefitbitmapindexesatseverallevels.Wecansort data by 3%. We created our own WAH variant called Enhanced therowsofthetable. Thesortingorderdependsitselfontheorder Word-Aligned Hybrid (EWAH). Contrary to WAH compression, ofthetablecolumns. Andfinally, wecanallocatethebitmapsto EWAH may never (within 0.1%) generate a compressed bitmap theattributevaluesinsortedorder. largerthantheuncompressedbitmap. Italsousesonlytwotypes 4.1 Sortingrows ofwords(seeFig.1).Thefirsttypeisa32-bitverbatimword.The secondtypeofwordisamarkerword: thefirstbitisusedtoindi- Reorderingtherowsofacompressedbitmapindexcanimprove catewhichcleanwordwillfollow, 16bitstostorethenumberof compression. Whether using RLE, BBC, WAH or EWAH, the cleanwords,and15bitstostorethenumberofdirtywordsfollow- problemisNP-hardbyreductionfromtheHamiltonianpathprob- ingthecleanwords.EWAHbitmapsbeginwithamarkerword. lem[2,Theorems1and2]. Asimpleheuristicbeginswithanun- GivenLbitmapsandnrows,wecannaivelyconstructabitmap compressed index. Rows (binary vectors) are then rearranged to index in time O(nL) by appending a word to each compressed promote runs. In the process, we may also reorder the bitmaps. bitmapevery32or64rows. Wefoundthisapproachimpractically ThisistheapproachofCanahuateetal.[4],butitusesΩ(nL)time. slow when L was large—typically, with k=1. Instead, we con- Forthelargedimensionsandnumberofrowswehaveconsidered, struct bitmap indexes in time O(nck+L)=O(nck) [2] where ck itisinfeasible.Amorepracticalapproach[2]istoreorderthetable, is the number of true values per row (See Algorithm 1): within thenconstructthecompressedindexdirectly;wecanalsoreorder each block of 32 rows, we store the values of the bitmaps in a thetablecolumnspriortosorting. set—omitting any unsolicited bitmap, whose values are all false Threetypesoforderingcanbeusedfororderingrows. Wemay (0x00).Wepartitionthetablehorizontallyintoblocksindexedwith clusteridenticalrows,butitisnotacompetitiveheuristic[2]. compressedbitmapsusingafixedmemorybudget(256MiB).Each blockofbitmapsiswrittensequentiallyandprecededbyanarray • Inlexicographicorder,asequencea1,a2,...issmallerthan of4-byteintegerscontainingthelocationofeachbitmap. anothersequenceb1,b2,...ifandonlyifthereisa jsuchthat aj<bjandai=bifori< j.TheUnixsortcommandpro- 1Forsimplicity,welimitourexpositionto32bitwords. videsanefficientmeanofsortingflatfilesintolexicographic order;inunder10sourtestcomputer(seeSection6)sorteda 1 5-million-line,120MBfile.SQLsupportslexicographicsort viaORDERBY. 0.8 y • Gsasthepnreqadaruryesae-eniecxoc=GeidsCebtasi(1bG,fjioatCsr2u)v,ci.esh<.co.trtohtijirsa.nstgs2Avmi1alsagjladol=neerirdfitahnt1vhme2a⊕dn2ioanbv2s1eht⊕,riombwb.2e.i,st..⊕Ovh.e.o(acwmitfjo−iatrn1osn(,|d[cvb1o1oj4m|n,](cid:54)=:|lpvytaa2hr|ijee)f, dirtiness probabilit 000...246 reflected GC where|vi|isthenumberoftruevalueinbitvectorvi. Sort- binary Savage-Winkler GC ingtherowsofabitmapindexwithoutmaterializingtheun- random 0 compressed bitmap index is possible [2]: we implemented 5 10 15 20 25 30 an O(ncklogn)-time solution for k-of-N indexes using an distinct items in block of 32 external-memoryB-tree[10]. Unfortunately,itprovedtobe twoordersofmagnitudeslowerthanlexicographicsort. Figure2:Probabilitiesthatabitmapwillcontainadirtyword, whenseveral(x-axis)of1000possibleattributevaluesarefound in a 32-row chunk. Effects are shown for values with k-of-N Algorithm2Gray-codelesscomparatorbetweensparsebitvectors codesthatareadjacentinGCorder,adjacentinlexicographic INPUT:arraysaandbrepresentingthepositionoftheonesin order,orrandomlyselected. twobitvectors k>2, where bitmaps are denser. Selecting the codes randomly OUTPUT:whetherthebitvectorrepresentedbyaislessthan isdisastrous. Hence,sortingpartofacolumn—evenonewithout theonerepresentedbyb longrunsofidenticalvalues—improvescompressionfork>1. f ←true ForencodingslikeBBC,WAHorEWAH,GCsortingisnotopti- m←min(length(a),length(b)) mal,evenwhenallk-of-Ncodesarepresent.Forexampleconsider forpin1,2,...,mdo thesequenceofrows1001, 1010, 1100,0101, 0101,0110, 0110, return f ifap>bpand¬f ifap<bp 0011. Using 4-bit words, we see that a single bitmap contains a f ←¬f cleanword(0000)whereasbyexchangingthefifthandsecondrow, return¬f iflength(a)>length(b), f iflength(b)>length(a), wegettwocleanwords(0000and1111). andfalseotherwise 4.2 Sortingbitmapcodes ForRLE,thebestorderingoftherowsofabitmapindexmin- For a simple index, the map from attribute value to bitmaps is imizesthesumoftheHammingdistances: ∑ih(ri,ri+1)whereri inconsequential;fork-of-Nencodings,somebitmapallocationsare is the ith row, for h(x,y)=|{i|xi (cid:54)=yi}|. If all 2L different rows morecompressible:consideranattributewithtwooverwhelmingly arepresent,theGCsortwouldbeanoptimalsolutiontothisprob- frequent values and many other values that occur once each. If lem [14]. The following proposition shows that GC sort is also optimalifall(cid:0)N(cid:1)k-of-Ncodesarepresent.Thesameisnottrueof thetablerowsaregiveninrandomorder,thetwofrequentvalues k shouldhavecodesthatdifferaslittleaspossible. lexicographicorderwhenk>1: 0110immediatelyfollows1001 There are several ways to allocate the bitmaps. Firstly, the at- among2-of-4codes,buttheirHammingdistanceis4. tributevaluescanbevisitedinalphabeticalornumericalorder,or— for histogram-aware schemes—in order of frequency. Secondly, Proposition1 Wecanenumerate,inGCorder,allk-of-Ncodesin thebitmapcodescanbeusedindifferentorders. Weconsiderlex- timeO(k(cid:0)Nk(cid:1))(optimalcomplexity). Moreover, theHammingdis- icographicalordering(1100,1010,1001,0110,...) andGCorder tancebetweensuccessivecodesisminimal(=2). (1001,1010,1100,0101,...)ordering(seeproofofProposition1). Fordenselow-dimensionaltables, GCorderispreferable[2]and PROOF. Letabeanarrayofsizek indicatingthepositionsof itscompressioneffectsarecomparabletosortingtheindexrowsin theonesink-of-N codes. Astheexternalloop,varythevaluea1 GCorder. Meanwhile, itistechnicallyeasiertoimplementsince from1toN−k+1. Withinthisloop,varythevaluea2 fromN− wecansortthetablelexicographicallyandonlyuseGCordering k+2 down to a1+1. Inside this second loop, vary the value of duringthebitmapindexconstruction. a3 froma2+1uptoN−k+3,andsoon. Byinspection,wesee Alpha-Lex denotes sorting the table lexicographically and as- thatallpossiblecodesaregeneratedinincreasingGCorder.Tosee signingbitmapcodessothattheithattributegetsthelexicographi- thattheHammingdistancebetweensuccessivecodesis2,consider callyithsmallestbitmapcode. Gray-Lexissimilar,exceptthatthe whathappenswhenaicompletesaloop. Supposethatiisoddand ithattributegetstherank-ibitmapcodeinGCorder.Thesetwoap- greater than 1, then ai had value N−k+i and it will take value proachesarehistogramoblivious—theyignorethefrequenciesof ai−1+1.Meanwhile,byconstruction,ai+1(ifitexists)remainsat attributevalues. svoaloune.NT−heka+rgiu+m1enwthiserseimasilaair+i2frieimsaevinesna.tv(cid:50)alueN−k+i+2and Knowingthefrequencyofeachattributevaluecanimprovecode assignment when k>1. For instance, clustering dirty words in- creasesthecompressibility.Withinacolumn,Alpha-LexandGray- Foragivencolumn,supposethatinablockof32rows,wehave Lexorderrunsofidenticalvaluesirrespectiveofthefrequency:the j distinct attribute values. We computed the average number of sequence afcccadeaceabe may become aaaabccccdeeef. For bitmapsthatwouldhaveadirtyword(seeFig.2). Comparingk- better compression, we should order the attribute values—within of-NcodesthatwereadjacentinGCorderingagainstk-of-Ncodes a column—by their frequency (e.g., aaaacccceeebdf). Allocat- thatwerelexicographicallyadjacent, thedifferencewasinsignifi- ingthebitmapcodesinGCordertothefrequency-sortedattribute cant for k=2. However, GC ordering is substantially better for values, our Gray-Frequency sorts the table rows as follows. Let 2Thesymbol⊕istheXORoperator. f(ai) be the frequency of attribute ai. Instead of sorting the ta- n/32 550000 300000 k=1 k=2 500000 k=3 250000 450000 400000 200000 k=1 350000 k=2 150000 300000 k=3 k=4 250000 100000 200000 50000 150000 100000 0 0 5000 10000 15000 20000 432143124231421341324123342134123241321431423124243124132341231421432134143214231342132412431234 (a) Uniform histograms with cardinalities 200, 400, Figure 3: Storage gain in words for sorting a given column 600,800 with 100,000 rows and various number of attribute values (2δ(kn,(cid:100)kni1/k(cid:101),n)−4ni). 240000 kk==12 220000 k=3 blerowsa1,a2,...,ad,welexicographicallysorttheextendedrows 200000 k=4 f(a1),a1,f(a2),a2,...,f(ad),ad bycomparingthefrequenciesby 180000 theirnumericalvalue.Thefrequencies f(ai)arediscardedpriorto 160000 indexing. 140000 120000 4.3 Choosingthecolumnorder 100000 Lexicographic table sorting uses the ith column as the ith sort 80000 key: it uses the first column as the main key, the second column 60000 tobreaktieswhentworowshavethesamefirstcomponent,andso 40000 on.Somecolumnorderingsleadtosmallerindexesthanothers[2]. 432143124231421341324123342134123241321431423124243124132341231421432134143214231342132412431234 Wemodelthestoragecostofabitmapindexasthesumofthe (b)Zipfiandatawithskewparameters1.6,1.2,0.8and number of dirty words and the number of sequences of identical 0.4 cleanwords(1x11or0x00).IfasetofLbitmapshasxdirtywords, then there are at most L+x sequences of clean words; the stor- Figure4:Indexsizesinwordsforvariousdimensionorderson agecostisatmost2x+L. Thisboundwillbetighterforsparser synthetic data (100,000 rows). Zipfian columns have 100 dis- bitmaps.Becausethesimpleindexofacolumnhasatmostn1-bits, tinctvalues.Ordering“1234”indicatesorderingbydescending ithasatmostndirtywords,andthus,thestoragecostisatmost3n. skew(Zipfian)orascendingcardinality(uniform). Thenextpropositionshowsthatthestoragecostofasortedcolumn isboundedby5ni. sionsofdifferentsizes(seeFig.4(a))andusingsame-sizedimen- Proposition2 UsingGC-sortedk-of-Lcodes,asortedcolumnwith sions of different skew (see Fig. 4(b)). We then determined the nidistinctvalueshasnomorethan2nidirtywords,andthestorage Gray-Lexindexsizeforeachofthe4! differentdimensionorder- costisnomorethan4ni+(cid:100)kn1i/k(cid:101). isnhgosu.ldBabseeodrodnertehdesferormesuleltass,tftoormspoasrtseskienwdeexde,sa(nkd=fr1o)m,dsimmaelnlessiotntos Fork=1, Proposition2istrueirrespectiveoftheorderofthe largest;whereastheoppositeistruefork>1. values,aslongasidenticalvaluesappearsequentially.Anotherex- Asensibleheuristicmightbetosortcolumnsbyincreasingden- treme is to assume that all 1-bits are randomly distributed. Then sity (≈n−1/k). However, a very sparse column (n1/k (cid:29)w) will i i sparsebitmapindexeshave≈δ(r,L,n)=(1−(1− r )w)Ln dirty notbenefitfromsorting(seeFig.3)andshouldbeputlast.Hence, Ln w wordswhereristhenumberof1-bits,Listhenumberofbitmaps we use the following heuristic: columns are sorted in decreasing and w is the word length (w=32). Hence, we have an approxi- orderwithrespecttomin(n−1/k,(1−n−1/k)/(4w−1)): thisfunc- mate storage cost of 2δ+(cid:100)kn1i/k(cid:101). The gain of column C is the tion is maximum at densityi n−1/k =1i/(4w) and it goes down to difference between the expected storage cost of a randomly row- i zeroasthedensitygoesto1. InFig.4(a),thisheuristicmakesthe shuffledC,minusthestoragecostofasortedC. Weestimatethe bestchoiceforallvaluesofk.Weconsiderthisheuristicfurtherin gainby2δ(kn,(cid:100)kn1i/k(cid:101),n)−4ni (seeFig.3)forcolumnswithuni- Section6.3. formhistograms. Thegainismodal: itincreasesuntilamaximum isreachedandthenitdecreases. Themaximumgainisreachedat ≈(n(w−1)/2)k/(k+1):forn=100,000andw=32,themaximum 4.4 Avoidingcolumnorder isreachedat≈1,200fork=1andat≈13,400fork=2.Skewed As an alternative to lexicographic sort and column reordering, histogramshavealessergainforafixedcardinalityni. weintroduceFrequent-Componentsorting,whichuseshistograms Afterlexicographicsorting,theithcolumnisdividedintoatmost tohelpsortwithoutbiasfromafixeddimensionordering. Insort- n1n2···ni−1 sorted blocks. Hence, it has at most 2n1···ni dirty ing, we compare the frequency of the ith most frequent attribute words.Whenthedistributionsareskewed,theithcolumnwillhave valuesineachoftworowswithoutregard(exceptforpossibletie- blocksofdifferentlengthsandtheirorderingdependsonhowthe breaking)towhichcolumnstheycomefrom.Withappropriatepre- columns are ordered. To assess these effects, we generated data andpost-processing,itispossibletoimplementthisapproachusing with 4 independent columns: using uniformly distributed dimen- astandardsortingutilitysuchasUnixsort. 5. PICKINGTHERIGHTK-OF-N Table2:Characteristicsofdatasetsused. ChoosingkandN areimportantdecisions. Wechooseasingle rows cols ∑ini size k value for all dimensions3, leaving the possibility of varying k Census-Income 199523 42 103419 99.1MB bydimensionasfuturework. Largervaluesofktypicallyleadto 4-dprojection 199523 4 102609 2.96MB asmallerindexandafasterconstructiontime—althoughwehave DBGEN 13977980 16 4411936 1.5GB observedcaseswherek=2makesalargerindex. However,query 4-dprojection 13977980 4 402544 297MB timesincreasewithk:thereisaconstructiontime/speedtradeoff. Netflix 100480507 4 500146 2.61GB KJV-4grams 877020839 4 33553 21.6GB Largerkmakesqueriesslower. Wecanboundtheadditionalcostofqueries. Assume(cid:0)Lki(cid:1)=ni. icnoduelxdebseaarveomiduecdhifbwigegearllothwaend1l-aorfg-eNrNin,dbeexceasu.seThoenoerceotiucladllya,wtahyiss Agivenk-of-LibitmapistheresultofanORoperationoveratmost kni/Li≤3ni(k−1)/k unary bitmaps. Because |WiBi|≤∑i|Bi|, the awpopuelnddcarenaateddointieonmaolr1et(ocleevaenr)ybaitttmribaupteth’san1-tohfe-N1-coofd-Ne.iSnidnecxehthaiss, expectedsizeofsuchabitmapisnolargerthan3n(k−1)/ktimesthe this2-of-N indexwouldneverbemuchlargerthanthe1-of-N in- i expectedsizeofaunarybitmap. Aquerylookingforoneattribute dex. So, if N is unconstrained, we can see that there is never a value will have to AND together k of these denser bitmaps. The significantspaceadvantagetochoosingksmall. entireANDingoperationcanbedonebyk−1pairwiseANDsthat Nevertheless,themainadvantageofk>1isfewerbitmaps.We produce intermediate results whose EWAH sizes are increasingly chooseNassmallaspossible. small:2k−1bitmapsarethusprocessed.Hence,theexpectedtime complexity of an equality query on a dimension of size ni is no 6. EXPERIMENTALRESULTS k−1 morethan3(2k−1)nik timeshigherthantheexpectedcostofthe We present experiments to assess the effects of various factors samequeryonak=1index. (choicesofk,sortingapproaches,dimensionorderings)intermsof Foralesspessimisticestimateofthisdependence,observethat EWAH index sizes. These factors also affect index creation and indexesseldomincreaseinsizewhenkgrows. Wemayconserva- querytimes(wereportrealwall-clocktimes). tivelyassumethatindexsizeisunchangedwhenkchanges.There- foretheexpectedsizeofonebitmapgrowsas≈n−1/k/k,leading 6.1 Platform i toquerieswhosecostisproportionalto(2−1/k)n−1/k. Relative Ourtestprograms4 werewritteninC++andcompiledbyGNU i GCC4.0.2onanAppleMacProwithtwodouble-coreIntelXeon tothecostfork=1,whichisproportionalto1/ni,wecansaythat processors (2.66GHz) and 2GiB of RAM. Lexicographic sorts increasingkleadstoqueriesthatare(2−1/k)n(ik−1)/k timesmore of flat files were done using GNU coreutils sort version 6.9. For expensivethanonasimplebitmapindex. all tests involving k=1, we used the sparse implementation ap- Forexample,supposeni=100,goingfromk=1tok=2should proachedinSection3becausewithoutit,theGray-Lexindexcre- increasequerycostabout15foldbutnomorethan90fold.Insum- ationtimeswere20–100timesslower,dependingonthedataset. mary,themovefromk=1toanythinglargercanhaveadramatic negativeeffectonqueryspeeds. Onceweareatk=2,theincre- 6.2 Datasetsused mental cost of going to k=3, k=4 is not so high: whereas the Weprimarilyusedfourdatasets,whosedetailsaresummarized ratiok=2/k=1goesas√ni,theratiok=3/k=2goesasn1i/6. in Table 2: Census-Income [9], DBGEN [21], KJV-4grams, and Netflix[13]. DBGENisasyntheticdataset,whereasKJV-4grams Largerkmakesindexessmaller. isalargelist(includingduplicates)of4-tuplesofwordsobtained Considertheeffectofalength100runofvaluesv1,followedby fromtheversesintheKingJamesBible[16],afterstemmingwith 100repetitionsofv2,then100ofv3,etc.Regardlessofk,whenever thePorteralgorithm[15]andremovalofstemmedwordswiththree weswitchfromv1 tovi+1 atleasttwobitmapswillhavetomake orfewerletters.Occurrenceofroww1,w2,w3,w4indicatesthatthe transitions between 0 and 1. Thus, unless the transition appears firstparagraphofaversecontainswordsw1throughw4,inthisor- atawordboundary,wecreateatleast2dirtywordswheneveran der. Thisdataisascaled-upversionofwordco-occurrencecubes attribute changes from row to row. The best case, where only 2 usedtostudyanalogiesinnaturallanguage[11,22]. EachofKJV- dirtywordsarecreated,isachievedwhenk=1foranyassignment 4grams’ columns contains roughly 8 thousand distinct stemmed of bitmap codes to attribute values. For k>1 and N as small as words.TheNetflixtablehas4dimensions:UserID,MovieID,Date possible,itmaynotbepossibletoachievesofewdirtywords,orit andRating,havingcardinalities5,2182,17770,and480189.De- mayrequireaparticularassignmentofbitmapcodestovalues. tails of how it was obtained from the data downloaded are given Encodingswithk>1findtheirusewhenmany(e.g.15)attribute elsewhere[2]. values fall within a word-length boundary. In that case, a k=1 For some of our tests, we chose four dimensions with a wide index will have at least 15 bitmaps with transitions (and we can rangeofsizes. ForCensus-Income, wechoseage(d1), wageper anticipate15dirtywords).However,iftherewereonly45possible hour(d2),dividendsfromstocks(d3)andanumericalvalue5found valuesinthedimension,wewouldnotneedmorethan10bitmaps in the 25th position (d4). Their respective cardinalities were 91, with k=2. Hence, there would be at most 10 dirty words and 1240,1478and99800. ForDBGEN,weselecteddimensionsof maybelessifwehavesortedthedata(seeFig.2). cardinality7,11,2526and400000.Dimensionsarenumberedby increasingsize:column1hasfewerdistinctvalues. ChoosingN. Itseemsintuitive,havingchosenk,tochooseNtobeassmallas possible. Yet,wehaveobservedcaseswheretheresulting2-of-N 4http://code.google.com/p/lemurbitmapindex/. 3Exceptthatforcolumnswithsmallni,weautomaticallyadjustk 5Theassociatedmetadatasaysthiscolumnshouldbea10-valued downwardwhenitexceedsthelimitsnotedattheendofSection2. migrationcode. 6.3 ColumnOrdering Table4:SizesofEWAHindexes(32-bitwords)forvarioussort- Fig. 5 shows the Gray-Lex index sizes for each column order- ingmethods. ing.ThedimensionsofKJV-4gramsaretoosimilarfororderingto Lexunsorted Gray-Lex Gray-Freq beinteresting, andwehavethusomittedthem. Forsmalldimen- Census-Income k=1 8.49×105 4.87×105 4.87×105 sions, thevalueofk wasloweredusingtheheuristicpresentedin (4d) 2 9.12×105 4.52×105 4.36×105 Section2. Ourresultssuggestthattable-columnreorderinghasa 3 6.90×105 3.73×105 3.28×105 significanteffect(40%). Thisdoesnotcontradicttheobservation 4 4.58×105 2.17×105 1.98×105 byCanahuateet.al[4]thatbitmapreorderingdoesnotchangethe DBGEN 1 5.48×107 3.38×107 3.38×107 sizemuch. (4d) 2 7.13×107 2.76×107 2.74×107 Thevalueofkaffectswhichorderingleadstothesmallestindex: 3 5.25×107 1.50×107 1.50×107 good orderings for k=1 are frequently bad orderings for k>1, 4 3.24×107 1.21×107 1.19×107 Netflix 1 6.20×108 3.22×108 3.19×108 and vice versa. This is consistent with our earlier analysis (see 2 8.27×108 3.17×108 2.43×108 Figs.3and4).ForNetflixandDBGEN,wehaveomittedk=2for 3 5.73×108 1.97×108 1.49×108 legibility:itisinferiortok=1formostorderings. 4 3.42×108 1.37×108 1.14×108 Census-Income’slargestdimensionisverylarge(n4≈n/2);DB- KJV-4grams 1 6.08×109 6.68×108 6.68×108 GENhasalsoalargedimension(n4≈n/35). Sortingcolumnsin 2 8.02×109 9.93×108 7.29×108 decreasingorderwithrespecttomin(n−1/k,(1−n−1/k)/(4w−1)) 3 4.13×109 8.31×108 5.77×108 i i for k=1, we have that only for DBGEN the ordering “2134” is 4 2.52×109 6.39×108 5.01×108 suggested,otherwise,“1234”isrecommended. Thustheheuristic Table4showsindexsizesforourlargedatasets,usingGray-Lex provides nearly optimal recommendations. For k=3 and k=4, orderingsandGray-Frequency.Dimensionswereorderedfromthe the ordering “1234” is recommended for all data sets: for k=4 largest to the smallest (“4321”) except for Census-Income where and Census-Income, this recommendation is wrong. For k=2 weusedtheordering“3214”. WeobservedthatKJV-4gramsdid andCensus-Income,theordering“3214”isrecommended,another notbenefitinindexsizefork=2.Thisdatasethasmanyverylong wrongrecommendationforthisdataset. Hence, abettercolumn runsofidenticalattributevaluesinthefirsttwodimensions,andthe reordering heuristic is needed for k>1. The difficulty appears numberofattributevaluesismodest,comparedwiththenumberof to be fundamental: when we calculated the gain experimentally, rows.Thisisidealfor1-of-N. we found that the best orderings sometimes did not have the di- Gray-FrequencyyieldsthesmallestindexesinTable4.Frequent- mensionswithhighestgainfirst. Ourgreedyapproachmaybetoo Componentisnotshowninthetable.OnNetflixfork=1itoutper- simple,andititmaybenecessarytoknowthehistogramskews. formedtheotherapproachesby1%,andforDBGENitwasonly slightly worse than the others. But in all other case on DBGEN, Census-IncomeandNetflix,itleadtoindexes5–50%larger. 6.4 Sorting 6.5 Queries OnsomesyntheticZipfiantests,wefoundasmallimprovement (lessthan4%for2dimensions)byusingGray-Lexcodinginpref- Wetimedequalityqueriesagainstour4-dbitmapindexes, and erencetoAlpha-Lex[2,Fig.3].Onotherdatasets,Gray-Lexeither theresultsareshowninFig.6.Queriesweregeneratedbychoosing hadnoeffectorasmallpositiveeffect. Therefore,ourcurrentex- attributevaluesuniformlyatrandomandthefiguresreportaverage periments do not include Alpha-Lex, with the exception that we timesforsuchqueries. Wemade100randomchoicespercolumn experimentallyevaluatedhowsortingaffectstheEWAHcompres- for KJV-4grams when k>1. For DBGEN and Netflix, we had sionofindividualcolumns. Whereassortingtendstocreateruns 1,000randomchoicespercolumnand10,000randomchoiceswere ofidenticalvaluesinthefirstcolumns,thebenefitsofsortingare used for Census-Income and KJV-4grams (k=1). For each data farlessapparentinlatercolumns,exceptthose stronglycorrelated set, we give the results per column (leftmost tick is the column withthefirstfewcolumns.ForTable3,wehavesortedprojections usedastheprimarysortkey,nexttickisforthesecondarysortkey, ofCensus-IncomeandDBGENonto10dimensionsd1...d10with etc.). n1<...<n10.(Thedimensionsd1...d4inthisgrouparedifferent FromFig.6(b),weseethatsimplebitmapindexesalwaysyield fromthedimensionsd1...d4 discussedearlier.) Weseethatifwe the fastest queries. The difference caused by k is highly depen- sortfromthelargestcolumn(d10...d1),atmost3columnsbene- dent upon the data set and the particular column in the data set. fitfromthesort,whereas5ormorecolumnsbenefitwhensorting However, foragivendatasetandcolumn, withonlyafewsmall fromthesmallestcolumn(d1...d10). exceptions,querytimesincreasesignificantlywithk.ForDBGEN, the last two dimensions have size 7 and 11, whereas for Netflix, Lexicographicsorting. thelastdimensionhassize5,andtherefore,theywillneverusea Constructingasimplebitmapindex(usingGray-Lex)overKJV- k-valuelargerthan2:theirspeedismostlyoblivioustok. 4gramstookapproximately14,000secondsorlessthanfourhours. InSection5,wepredictedthatthequerytimewouldgrowwith Nearlyhalf(6,000s)ofthetimewasduetothesortutility,since kas≈(2−1/k)n−1/k:forthelargedimensionssuchasthelargest i thedatasetismuchlargerthanthemachine’smainmemory(2GiB). onesforDBGEN(400k)andNetflix(480k), querytimesaretwo Constructinganunsortedindexisfaster(approximately10,000s), ordersofmagnitudeslowerfork=2asopposedtok=1,andfour buttheindexisabout9timeslarger. ordersofmagnitudeslowerfork=4.Thus,ourmodelexaggerates Tostudyscaling,webuiltindexesfromprefixesofthefulldata thedifferencesbyaboutanorderofmagnitude.Themostplausible set.Wefoundconstructiontimesincreasedlinearlywithindexsize explanationisthatquerytimesarenotdirectlyproportionaltothe fork=1,whetherornotsortingwasused. For1≤k≤4,index bitmaploaded,butalsoincludeaconstantfactor. sizeincreasedlinearlywiththeprefixsizeforunsorteddata. Yet Fig.6(a)and6(b)showtheequalityquerytimespercolumnbe- withsorting,indexsizeincreasedsublinearly.Asnewdataarrives, foreandaftersortingthetables.Sortingimprovesquerytimesmost itisincreasinglylikelytofitintoexistingruns,oncesorted. forlargervaluesofk:forNetflix,sortingimprovedthequerytimes 650000 5.5e+07 5.5e+08 k=1 k=1 k=1 600000 k=2 5e+07 k=3 5e+08 k=3 k=3 k=4 k=4 550000 k=4 4.5e+07 4.5e+08 500000 4e+07 4e+08 450000 3.5e+07 3.5e+08 400000 3e+07 3e+08 350000 2.5e+07 2.5e+08 300000 250000 2e+07 2e+08 200000 1.5e+07 1.5e+08 150000 1e+07 1e+08 432143124231421341324123342134123241321431423124243124132341231421432134143214231342132412431234 432143124231421341324123342134123241321431423124243124132341231421432134143214231342132412431234 432143124231421341324123342134123241321431423124243124132341231421432134143214231342132412431234 (a)Census-Income (b)DBGEN (c)Netflix Figure5:Indexsizes(words,yaxis)on4-ddatasetsforalldimensionorderings(xaxis). Table3: Numberof32-bitwordsusedfordifferentunaryindexeswhenthetablewassortedlexicographically(dimensionsordered bydescendingcardinality,d ...d ,orbyascendingcardinality,d ...d ). 10 1 1 10 Census-Income DBGEN cardinality unsorted d1...d10 d10...d1 cardinality unsorted d1...d10 d10...d1 d1 7 42427 32 42309 2 0.75×106 24 0.75×106 d2 8 36980 200 36521 3 1.11×106 38 1.11×106 d3 10 34257 1215 28975 7 2.58×106 150 2.78×106 d4 47 0.13×106 12118 0.13×106 9 0.37×106 1006 3.37×106 d 51 35203 17789 28803 11 4.11×106 10824 4.11×106 5 d 91 0.27×106 75065 0.25×106 50 13.60×106 0.44×106 1.42×106 6 d7 113 12199 9217 12178 2526 23.69×106 22.41×106 23.69×106 d8 132 20028 14062 19917 20000 24.00×106 24.00×106 22.12×106 d9 1240 29223 24313 28673 400000 24.84×106 24.84×106 19.14×106 d10 99800 0.50×106 0.48×106 0.30×106 984297 27.36×106 27.31×106 0.88×106 total - 1.11×106 0.64×106 0.87×106 - 0.122×109 0.099×109 0.079×109 byatmost2fork=1, atmost40fork=2andatmost140for 1000 nckoo=lulo3mn;nginsr,dukenxs=eos3fwiuidtsehunaktlil>cyalg1vebtaseluntehesfiet(bsfeersoetmSimusbopsrretoicnvtegiomenve4enn.t1sw)f.rho(eOmnntshtohererteinfiagrrs.e)t uality query 1 1000 kkkk====1234 SyoynndthiettsiclaDrgBeGfiErNstschooluwmend.nAosltihgonuifigchaNntestpfleixe,dulipkefrDomBGsoErNtin,gh,absea- B) per eq 0. 11 many-valuedcolumnfirst,itshowsabenefitfromsortingeveninits Mi ttmfhhasiaortnsd,tttchfhroeeolumsgemacitnnoh:neodinsnoctforhatl:ecutmwfi,rthnshi.teletTwthqhoiueredclroaicelrougslmeuosmnntstntahrbbaelneelng,aeeKsfistJtfcVsroo-ml4mugomrr2aen0mfartsrio,membetewssnoiefcrafietsittneaegdsr Size of bitmaps ( 0 .00 .000.000111 1e-05 (k=1)toalmost1500timesfaster(k=3). Census-Income DBGEN Netflix KJV-4grams Wecancomparethesetimeswiththeexpectedamountofdata Dimensions 1-4 for each data set scanned per query. This is shown in Figure 7, and we observe reasonablycloseagreementbetweenmostquerytimesandtheex- Figure7:Bitmapdataexaminedperequalityquery. pectedsizesofthebitmapsbeingscanned. Exceptionsincludethe 1,wecangainadditionalindexsizereductionatthecostoflonger firstdimensiononKJV-4gramsandsomecaseswherethebitmaps indexconstructionbyusingGray-FrequencyratherthanGray-Lex. are tiny. This discrepancy might be explained by the retrieval of Ifthetotalnumberofattributevaluesissmallrelativetothenum- therowIDsfromthecompressedbitmaps:longrunsof1x11clean berofrows,thenweshouldfirsttrythek=1index. Perhapsthe wordsmustbeconvertedtomanyrowIDs. data set resembles KJV-4grams. Besides yielding faster queries, thek=1indexmaybesmaller. 7. GUIDELINESFORK Ourexperimentsindicatethatsimple(k=1)bitmapencodingis 8. CONCLUSIONANDFUTUREWORK preferablewhenstoragespaceandindex-creationtimearelessim- portantthanfastequalityqueries. Thestorageandindex-creation Weshowedthatwhilesortingimprovesbitmapindexes,wecan penaltiesarekeptmodestbytablesortingandAlgorithm1. improvethemevenmore(30–40%)ifweknowthenumberofdis- Spacerequirementscanbereducedbychoosingk>1,although tinctvaluesineachcolumn.Fork-of-Nencodingswithk>1,even Tab.4showsthatthisapproachhasrisks(seeKJV-4grams).Fork> further gains (10–30%) are possible using the frequency of each uality query 1 10 kkkk====1234 uality query 1 10 kkkk====1234 nds) per eq 0 .00.11 nds) per eq 0 .00.11 me (seco 0.001 me (seco 0.001 Average ti 0 1.0e0-0015 Average ti 0 1.0e0-0015 Census-Income DBGEN Netflix KJV-4grams Census-Income DBGEN Netflix KJV-4grams Dimensions 1-4 for each data set Dimensions 1-4 for each data set (a)Querytimesoverunsorted indexes (b)Querytimesoversorted(Gray-Lex)indexes Figure6:Querytimesareaffectedbydimension,tablesortingandk. value.Regardingfuturework,theaccuratemathematicalmodelling [12] N.Koudas.Spaceefficientbitmapindexing.InCIKM’00, ofcompressedbitmapindexesremainsanopenproblem. pages194–201,2000. [13] Netflix,Inc.Nexflixprize.http://www.netflixprize.com Acknowledgements (checked2008-04-28),2007. [14] A.Pinar,T.Tao,andH.Ferhatosmanoglu.Compressing ThisworkissupportedbyNSERCgrants155967,261437andby bitmapindicesbydatareorganization.InICDE’05,pages FQRNTgrant112381. 310–321,2005. [15] M.F.Porter.Analgorithmforsuffixstripping.InReadings 9. REFERENCES ininformationretrieval,pages313–316.MorganKaufmann, [1] G.Antoshenkov.Byte-alignedbitmapcompression.InDCC 1997. ’95,page476,1995. [16] ProjectGutenbergLiteraryArchiveFoundation.Project [2] K.Aouiche,D.Lemire,andO.Kaser.Tridelatabledefaits Gutenberg.http://www.gutenberg.org/(checked etcompressiondesindexbitmapsavecalignementsurles 2007-05-30),2007. mots.availablefromhttp://arxiv.org/abs/0805.3339. [17] D.Rotem,K.Stockinger,andK.Wu.MinimizingI/Ocosts [3] L.Bellatreche,R.Missaoui,H.Necir,andH.Drias. ofmulti-dimensionalquerieswithbitmapindices.In Selectionandpruningalgorithmsforbitmapindexselection SSDBM’06,pages33–44,2006. problemusingdatamining.LNCS,4654:221,2007. [18] Y.SharmaandN.Goyal.Anefficientmulti-component [4] G.Canahuate,H.Ferhatosmanoglu,andA.Pinar.Improving indexingembeddedbitmapcompressionfordata bitmapindexcompressionbydatareorganization.http:// reorganization.InformationTechnologyJournal, hpcrd.lbl.gov/~apinar/papers/TKDE06.pdf(checked 7(1):160–164,2008. 2008-05-30),2006. [19] K.Stockinger,K.Wu,andA.Shoshani.Strategiesfor [5] C.Y.ChanandY.E.Ioannidis.Bitmapindexdesignand processingadhocqueriesonlargedatawarehouses.In evaluation.InSIGMOD’98,pages355–366,1998. DOLAP’02,pages72–79,2002. [6] C.Y.ChanandY.E.Ioannidis.Anefficientbitmapencoding [20] K.Stockinger,K.Wu,andA.Shoshani.Evaluationstrategies schemeforselectionqueries.InSIGMOD’99,pages forbitmapindiceswithbinning.InDEXA2004,2004. 215–226,1999. [21] TPC.DBGEN2.4.0.http://www.tpc.org/tpch/ [7] R.Darira,K.C.Davis,andJ.Grommon-Litton.Heuristic (checked2007-12-4),2006. designofpropertymaps.InDOLAP’06,pages91–98,2006. [22] P.D.TurneyandM.L.Littman.Corpus-basedlearningof [8] K.DavisandA.Gupta.DataWarehousesandOLAP: analogiesandsemanticrelations.MachineLearning, Concepts,Architectures,andSolutions,chapterIndexingin 60(1–3):251–278,2005. DataWarehouses.IRMPress,2007. [23] H.K.T.Wong,H.F.Liu,F.Olken,D.Rotem,andL.Wong. [9] S.HettichandS.D.Bay.TheUCIKDDarchive. Bittransposedfiles.InVLDB85,pages448–457,1985. http://kdd.ics.uci.edu(checked2008-04-28),2000. [24] K.Wu,E.J.Otoo,andA.Shoshani.Aperformance [10] M.Hirabayashi.QDBM:Quickdatabasemanager. comparisonofbitmapindexes.InCIKM’01,pages http://qdbm.sourceforge.net/(checked2008-02-22), 559–561,2001. 2006. [25] K.Wu,E.J.Otoo,andA.Shoshani.Optimizingbitmap [11] O.Kaser,S.Keith,andD.Lemire.TheLitOLAPproject: indiceswithefficientcompression.ACMTransactionson Datawarehousingwithliterature.InCaSTA’06,2006. DatabaseSystems(TODS),31(1):1–38,2006.