ebook img

Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes PDF

0.26 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes Owen Kaser Daniel Lemire Kamel Aouiche Dept. ofCSAS LICEF,UniversitéduQuébec LICEF,UniversitéduQuébec UniversityofNewBrunswick àMontréal àMontréal 100TuckerParkRoad 100SherbrookeWest 100SherbrookeWest SaintJohn,NB,Canada Montreal,QC,Canada Montreal,QC,Canada [email protected] [email protected] [email protected] ABSTRACT 9 histogram-oblivious. They ignore the number of attribute values 0 Bitmap indexes must be compressed to reduce input/output costs andtheirfrequencies. Otherheuristicsarehistogram-aware. They 0 includecolumnreorganizationsandfrequency-awareordering. On andminimizeCPUusage. Toacceleratelogicaloperations(AND, 2 larger data sets [2], we had considered histogram-oblivious row- OR, XOR) over bitmaps, we use techniques based on run-length orderingheuristics. Sortingbeforeindexingreducedthetotalcon- n encoding(RLE),suchasWord-AlignedHybrid(WAH)compres- struction time. Our main contribution is an evaluation of practi- a sion.Thesetechniquesaresensitivetotheorderoftherows:asim- J ple lexicographical sort can divide the index size by 9 and make calhistogram-awareheuristicstotheroworderingproblem. Sec- ondarycontributionsincludeguidelinesaboutwhen“unary”bitmap 9 indexes several times faster. We investigate reordering heuristics encodingispreferred,andanimprovementoverthenaivebitmap 1 basedoncomputedattribute-valuehistograms. Simplypermuting constructionalgorithm—itisnowpracticaltoconstructbitmapin- thecolumnsofthetablebasedonthesehistogramscanincreasethe dexesovertableswithhundredsofmillionsofrowsandmillionsof ] sortingefficiencyby40%. B attributevalues. D CategoriesandSubjectDescriptors Tofurtherreducethesizeofbitmapindexes,wecanbintheat- tribute values [7,12,17,20]. For range queries, different bitmap s. H.3.2[InformationStorageandRetrieval]:InformationStorage; encodingshavedifferentspace-performancetradeoffs[5,6]. c E.1[Data]:DataStructures [ GeneralTerms 2. BITMAPINDEXES 3 v Wefindbitmapindexesinseveraldatabasesystems,apparently Algorithms,Performance,Experimentation. 3 beginning with the MODEL 204 engine, commercialized for the 8 IBM370in1972. 1. INTRODUCTION 0 Thesimplestandmostcommonmethodofbitmapindexingas- 2 Bitmapindexesareamongthemostcommonlyusedindexesin sociates a bitmap with every attribute value v of every attribute . datawarehouses[3,8]. Withoutcompression,bitmapindexescan a; the bitmap represents the predicate a=v. For a table with 8 beimpracticallylargeandslow.Word-AlignedHybrid(WAH)[25] nrows(facts)andccolumns(attributes/dimensions),eachbitmap 0 isacompetitivecompressiontechnique:comparedtoLZ77[5]and haslengthn. Initially,allbitmapvaluesaresetto0. Forrow j,we 8 0 Byte-AlignedBitmapCompression(BBC)[1],WAHindexescan setthe jth componentofcbitmapsto1. Iftheith attributehasni : betentimesfaster[24]. possiblevalues,wehaveL=∑ci=1nibitmaps. v Run-lengthencoding(RLE)andsimilarencodingschemes(BBC Bitmap indexes are fast, because we find rows having a given Xi andWAH)makeitpossibletocomputelogicaloperationsbetween valuevforattributeabyreadingonlythebitmapcorrespondingto bitmapsintimeproportionaltothecompressedsizeofthebitmaps. valuev(andnottheotherbitmapsforattributea),andthereisonly ar However,theirefficiencydependsontheorderoftherows. While onebit(orless,withcompression)toprocessforeachrow. More computingthebestorderingisNP-hard[2],simpleheuristicssuch complexqueriesareachievedwithlogicaloperations(AND,OR, aslexicographicalsortareeffective. XOR,NOT)overbitmapsandcurrentmicroprocessorcanperform Pinaretal.[14],SharmaandGoyal[18],andCanahuateetal.[4] 32or64bitwiseoperationsinasinglemachineinstruction. usedGray-coderowsortingtoimproveRLEandWAHcompres- Forrow j,exactlyonebitmappercolumnwillhaveits jthentry sion. However,theirlargestbitmapindexcouldfituncompressed setto1.AlthoughtheentireindexhasnLbits,thereareonlync1’s; inRAMonaPC. formanytables,L(cid:29)candthusonaveragethetableisverysparse. Wedistinguishtwotypesofheuristicsforthisproblem. Heuris- Long(hencecompressible)runsof0’sareexpected. ticssuchaslexicographicalsort[2]orGray-codesorting[14]are Onecanalsoreducethenumberofbitmapsforlargedimensions. GivenLbitmaps,thereareL(L−1)/2pairsofbitmaps.So,instead Permissiontomakedigitalorhardcopiesofallorpartofthisworkfor ofmappinganattributevaluetoasinglebitmap,wemapthemto personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare pairsofbitmaps(seeTable1). Werefertothistechniqueas2-of- notmadeordistributedforprofitorcommercialadvantageandthatcopies N encoding [23]; with it, we can use far fewer bitmaps for large bearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,to dimensions. For instance, with only 2,000 bitmaps, we can rep- republish,topostonserversortoredistributetolists,requirespriorspecific resentanattributewith2milliondistinctvalues. Buttheaverage permissionand/orafee. bitmapdensityismuchhigherwith2-of-Nencoding,andthuscom- DOLAP’08,October30,2008,NapaValley,California,USA. Copyright2008ACM978-1-60558-250-4/08/10...$5.00. pression may be less effective. More generally, k-of-N encoding Table1:Exampleof1-of-Nand2-of-Nencoding Montreal 100000000000000 110000 Paris 010000000000000 101000 Toronto 001000000000000 100100 NewYork 000100000000000 011000 Berlin 000010000000000 010100 allows L bitmaps to represent (cid:0)L(cid:1) distinct values; conversely, us- k ingL=(cid:100)kn1i/k(cid:101)bitmapsissufficienttorepresentnidistinctvalues. However,searchingforaspecifiedvaluevnolongerrequiresscan- ningasinglebitmap.Instead,thecorrespondingkbitmapsmustbe combinedwithabitwiseAND. Thereisatradeoffbetweenindex Figure1:EnhancedWord-AlignedHybrid(EWAH) sizeandtheindexspeed[2]. Forsmalldimensions,usingk-of-Nencodingmayfailtoreduce Algorithm1Constructingbitmaps. Forsimplicity,weassumethe thenumberofbitmaps,butstillreducetheperformance.Weapply numberofrowsismultipleofthewordsize. thefollowingheuristic.Anycolumnwithlessthan5distinctvalues Construct:B1,...,BL,Lcompressedbitmaps islimitedto1-of-N encoding(simpleorunarybitmap). Anycol- ωi←0for1≤i≤L. umnwithlessthan21distinctvalues,islimitedtok=1,2,andany c←1{rowcounter} columnwithlessthan85distinctvaluesislimitedtok=1,2,3. N ←0/{N recordsthedirtiedbitmaps} foreachtablerowdo foreachattributeintherowdo 3. COMPRESSION foreachbitmapicorrespondingtotheattributevaluedo RLE compresses efficiently when there are long runs of iden- settotruethe(cmodw)thbitofwordωi N ←N ∪{i} tical values: it works by replacing any repetition by the number ifcisamultipleofwthen ofrepetitionsfollowedbythevaluebeingrepeated. Forexample, foriinN do the sequence 11110000 becomes 4140. Current microprocessors addc/w−|Bi|−1cleanwords(0x00)toBi performoperationsoverwordsof32or64bitsandnotindividual addthewordωitobitmapBi bits. Hence, theCPUcostofRLEmightbelarge[19]. Bytrad- ωi←0 ingsomecompressionformorespeed,Antoshenkov[1]defineda N ←0/ c←c+1 RLEvariantworkingoverbytesinsteadofbits: theByte-Aligned foriin{1,2,...,L}do BitmapCompression(BBC).Tradingevenmorecompressionfor addc/w−|Bi|−1cleanwords(0x00)toBi evenmorespeed,Wuetal.[25]proposedtheWord-AlignedHybrid (WAH). Their scheme is made of two different types of words1. Naively,wecouldcomputelogicaloperationsbetween2bitmaps Thefirstbitofeveryworddistinguishesaverbatim(ordirty)31- inn/32bitwiseoperations. Instead,wecomputelogicaloperations bitwordfromarunningsequenceof31-bitcleanwords(0x00or (OR,AND,XOR)between2bitmapsintimeO(|B1|+|B2|)where 1x11). Runningsequencesarestoredusing1bittodistinguishbe- |Bi|isthesizeofthecompressedbitmap[2,25]. Finally, wecan tweenthetypeofword(0for0x00and1for1x11)and30bitsto boundthebitmapsizes:|ViBi|≤mini|Bi|and|WiBi|≤∑i|Bi|. representthenumberofconsecutivecleanwords.Hence,abitmap oflength62containingasingle1-bitatposition32wouldbecoded 4. SORTINGTOIMPROVECOMPRESSION asthewords100x01and010x00. Becausedirtywordsarestored inunitsof31bitsusing32bits,WAHcompressioncanexpandthe Sortingcanbenefitbitmapindexesatseverallevels.Wecansort data by 3%. We created our own WAH variant called Enhanced therowsofthetable. Thesortingorderdependsitselfontheorder Word-Aligned Hybrid (EWAH). Contrary to WAH compression, ofthetablecolumns. Andfinally, wecanallocatethebitmapsto EWAH may never (within 0.1%) generate a compressed bitmap theattributevaluesinsortedorder. largerthantheuncompressedbitmap. Italsousesonlytwotypes 4.1 Sortingrows ofwords(seeFig.1).Thefirsttypeisa32-bitverbatimword.The secondtypeofwordisamarkerword: thefirstbitisusedtoindi- Reorderingtherowsofacompressedbitmapindexcanimprove catewhichcleanwordwillfollow, 16bitstostorethenumberof compression. Whether using RLE, BBC, WAH or EWAH, the cleanwords,and15bitstostorethenumberofdirtywordsfollow- problemisNP-hardbyreductionfromtheHamiltonianpathprob- ingthecleanwords.EWAHbitmapsbeginwithamarkerword. lem[2,Theorems1and2]. Asimpleheuristicbeginswithanun- GivenLbitmapsandnrows,wecannaivelyconstructabitmap compressed index. Rows (binary vectors) are then rearranged to index in time O(nL) by appending a word to each compressed promote runs. In the process, we may also reorder the bitmaps. bitmapevery32or64rows. Wefoundthisapproachimpractically ThisistheapproachofCanahuateetal.[4],butitusesΩ(nL)time. slow when L was large—typically, with k=1. Instead, we con- Forthelargedimensionsandnumberofrowswehaveconsidered, struct bitmap indexes in time O(nck+L)=O(nck) [2] where ck itisinfeasible.Amorepracticalapproach[2]istoreorderthetable, is the number of true values per row (See Algorithm 1): within thenconstructthecompressedindexdirectly;wecanalsoreorder each block of 32 rows, we store the values of the bitmaps in a thetablecolumnspriortosorting. set—omitting any unsolicited bitmap, whose values are all false Threetypesoforderingcanbeusedfororderingrows. Wemay (0x00).Wepartitionthetablehorizontallyintoblocksindexedwith clusteridenticalrows,butitisnotacompetitiveheuristic[2]. compressedbitmapsusingafixedmemorybudget(256MiB).Each blockofbitmapsiswrittensequentiallyandprecededbyanarray • Inlexicographicorder,asequencea1,a2,...issmallerthan of4-byteintegerscontainingthelocationofeachbitmap. anothersequenceb1,b2,...ifandonlyifthereisa jsuchthat aj<bjandai=bifori< j.TheUnixsortcommandpro- 1Forsimplicity,welimitourexpositionto32bitwords. videsanefficientmeanofsortingflatfilesintolexicographic order;inunder10sourtestcomputer(seeSection6)sorteda 1 5-million-line,120MBfile.SQLsupportslexicographicsort viaORDERBY. 0.8 y • Gsasthepnreqadaruryesae-eniecxoc=GeidsCebtasi(1bG,fjioatCsr2u)v,ci.esh<.co.trtohtijirsa.nstgs2Avmi1alsagjladol=neerirdfitahnt1vhme2a⊕dn2ioanbv2s1eht⊕,riombwb.2e.i,st..⊕Ovh.e.o(acwmitfjo−iatrn1osn(,|d[cvb1o1oj4m|n,](cid:54)=:|lpvytaa2hr|ijee)f, dirtiness probabilit 000...246 reflected GC where|vi|isthenumberoftruevalueinbitvectorvi. Sort- binary Savage-Winkler GC ingtherowsofabitmapindexwithoutmaterializingtheun- random 0 compressed bitmap index is possible [2]: we implemented 5 10 15 20 25 30 an O(ncklogn)-time solution for k-of-N indexes using an distinct items in block of 32 external-memoryB-tree[10]. Unfortunately,itprovedtobe twoordersofmagnitudeslowerthanlexicographicsort. Figure2:Probabilitiesthatabitmapwillcontainadirtyword, whenseveral(x-axis)of1000possibleattributevaluesarefound in a 32-row chunk. Effects are shown for values with k-of-N Algorithm2Gray-codelesscomparatorbetweensparsebitvectors codesthatareadjacentinGCorder,adjacentinlexicographic INPUT:arraysaandbrepresentingthepositionoftheonesin order,orrandomlyselected. twobitvectors k>2, where bitmaps are denser. Selecting the codes randomly OUTPUT:whetherthebitvectorrepresentedbyaislessthan isdisastrous. Hence,sortingpartofacolumn—evenonewithout theonerepresentedbyb longrunsofidenticalvalues—improvescompressionfork>1. f ←true ForencodingslikeBBC,WAHorEWAH,GCsortingisnotopti- m←min(length(a),length(b)) mal,evenwhenallk-of-Ncodesarepresent.Forexampleconsider forpin1,2,...,mdo thesequenceofrows1001, 1010, 1100,0101, 0101,0110, 0110, return f ifap>bpand¬f ifap<bp 0011. Using 4-bit words, we see that a single bitmap contains a f ←¬f cleanword(0000)whereasbyexchangingthefifthandsecondrow, return¬f iflength(a)>length(b), f iflength(b)>length(a), wegettwocleanwords(0000and1111). andfalseotherwise 4.2 Sortingbitmapcodes ForRLE,thebestorderingoftherowsofabitmapindexmin- For a simple index, the map from attribute value to bitmaps is imizesthesumoftheHammingdistances: ∑ih(ri,ri+1)whereri inconsequential;fork-of-Nencodings,somebitmapallocationsare is the ith row, for h(x,y)=|{i|xi (cid:54)=yi}|. If all 2L different rows morecompressible:consideranattributewithtwooverwhelmingly arepresent,theGCsortwouldbeanoptimalsolutiontothisprob- frequent values and many other values that occur once each. If lem [14]. The following proposition shows that GC sort is also optimalifall(cid:0)N(cid:1)k-of-Ncodesarepresent.Thesameisnottrueof thetablerowsaregiveninrandomorder,thetwofrequentvalues k shouldhavecodesthatdifferaslittleaspossible. lexicographicorderwhenk>1: 0110immediatelyfollows1001 There are several ways to allocate the bitmaps. Firstly, the at- among2-of-4codes,buttheirHammingdistanceis4. tributevaluescanbevisitedinalphabeticalornumericalorder,or— for histogram-aware schemes—in order of frequency. Secondly, Proposition1 Wecanenumerate,inGCorder,allk-of-Ncodesin thebitmapcodescanbeusedindifferentorders. Weconsiderlex- timeO(k(cid:0)Nk(cid:1))(optimalcomplexity). Moreover, theHammingdis- icographicalordering(1100,1010,1001,0110,...) andGCorder tancebetweensuccessivecodesisminimal(=2). (1001,1010,1100,0101,...)ordering(seeproofofProposition1). Fordenselow-dimensionaltables, GCorderispreferable[2]and PROOF. Letabeanarrayofsizek indicatingthepositionsof itscompressioneffectsarecomparabletosortingtheindexrowsin theonesink-of-N codes. Astheexternalloop,varythevaluea1 GCorder. Meanwhile, itistechnicallyeasiertoimplementsince from1toN−k+1. Withinthisloop,varythevaluea2 fromN− wecansortthetablelexicographicallyandonlyuseGCordering k+2 down to a1+1. Inside this second loop, vary the value of duringthebitmapindexconstruction. a3 froma2+1uptoN−k+3,andsoon. Byinspection,wesee Alpha-Lex denotes sorting the table lexicographically and as- thatallpossiblecodesaregeneratedinincreasingGCorder.Tosee signingbitmapcodessothattheithattributegetsthelexicographi- thattheHammingdistancebetweensuccessivecodesis2,consider callyithsmallestbitmapcode. Gray-Lexissimilar,exceptthatthe whathappenswhenaicompletesaloop. Supposethatiisoddand ithattributegetstherank-ibitmapcodeinGCorder.Thesetwoap- greater than 1, then ai had value N−k+i and it will take value proachesarehistogramoblivious—theyignorethefrequenciesof ai−1+1.Meanwhile,byconstruction,ai+1(ifitexists)remainsat attributevalues. svoaloune.NT−heka+rgiu+m1enwthiserseimasilaair+i2frieimsaevinesna.tv(cid:50)alueN−k+i+2and Knowingthefrequencyofeachattributevaluecanimprovecode assignment when k>1. For instance, clustering dirty words in- creasesthecompressibility.Withinacolumn,Alpha-LexandGray- Foragivencolumn,supposethatinablockof32rows,wehave Lexorderrunsofidenticalvaluesirrespectiveofthefrequency:the j distinct attribute values. We computed the average number of sequence afcccadeaceabe may become aaaabccccdeeef. For bitmapsthatwouldhaveadirtyword(seeFig.2). Comparingk- better compression, we should order the attribute values—within of-NcodesthatwereadjacentinGCorderingagainstk-of-Ncodes a column—by their frequency (e.g., aaaacccceeebdf). Allocat- thatwerelexicographicallyadjacent, thedifferencewasinsignifi- ingthebitmapcodesinGCordertothefrequency-sortedattribute cant for k=2. However, GC ordering is substantially better for values, our Gray-Frequency sorts the table rows as follows. Let 2Thesymbol⊕istheXORoperator. f(ai) be the frequency of attribute ai. Instead of sorting the ta- n/32 550000 300000 k=1 k=2 500000 k=3 250000 450000 400000 200000 k=1 350000 k=2 150000 300000 k=3 k=4 250000 100000 200000 50000 150000 100000 0 0 5000 10000 15000 20000 432143124231421341324123342134123241321431423124243124132341231421432134143214231342132412431234 (a) Uniform histograms with cardinalities 200, 400, Figure 3: Storage gain in words for sorting a given column 600,800 with 100,000 rows and various number of attribute values (2δ(kn,(cid:100)kni1/k(cid:101),n)−4ni). 240000 kk==12 220000 k=3 blerowsa1,a2,...,ad,welexicographicallysorttheextendedrows 200000 k=4 f(a1),a1,f(a2),a2,...,f(ad),ad bycomparingthefrequenciesby 180000 theirnumericalvalue.Thefrequencies f(ai)arediscardedpriorto 160000 indexing. 140000 120000 4.3 Choosingthecolumnorder 100000 Lexicographic table sorting uses the ith column as the ith sort 80000 key: it uses the first column as the main key, the second column 60000 tobreaktieswhentworowshavethesamefirstcomponent,andso 40000 on.Somecolumnorderingsleadtosmallerindexesthanothers[2]. 432143124231421341324123342134123241321431423124243124132341231421432134143214231342132412431234 Wemodelthestoragecostofabitmapindexasthesumofthe (b)Zipfiandatawithskewparameters1.6,1.2,0.8and number of dirty words and the number of sequences of identical 0.4 cleanwords(1x11or0x00).IfasetofLbitmapshasxdirtywords, then there are at most L+x sequences of clean words; the stor- Figure4:Indexsizesinwordsforvariousdimensionorderson agecostisatmost2x+L. Thisboundwillbetighterforsparser synthetic data (100,000 rows). Zipfian columns have 100 dis- bitmaps.Becausethesimpleindexofacolumnhasatmostn1-bits, tinctvalues.Ordering“1234”indicatesorderingbydescending ithasatmostndirtywords,andthus,thestoragecostisatmost3n. skew(Zipfian)orascendingcardinality(uniform). Thenextpropositionshowsthatthestoragecostofasortedcolumn isboundedby5ni. sionsofdifferentsizes(seeFig.4(a))andusingsame-sizedimen- Proposition2 UsingGC-sortedk-of-Lcodes,asortedcolumnwith sions of different skew (see Fig. 4(b)). We then determined the nidistinctvalueshasnomorethan2nidirtywords,andthestorage Gray-Lexindexsizeforeachofthe4! differentdimensionorder- costisnomorethan4ni+(cid:100)kn1i/k(cid:101). isnhgosu.ldBabseeodrodnertehdesferormesuleltass,tftoormspoasrtseskienwdeexde,sa(nkd=fr1o)m,dsimmaelnlessiotntos Fork=1, Proposition2istrueirrespectiveoftheorderofthe largest;whereastheoppositeistruefork>1. values,aslongasidenticalvaluesappearsequentially.Anotherex- Asensibleheuristicmightbetosortcolumnsbyincreasingden- treme is to assume that all 1-bits are randomly distributed. Then sity (≈n−1/k). However, a very sparse column (n1/k (cid:29)w) will i i sparsebitmapindexeshave≈δ(r,L,n)=(1−(1− r )w)Ln dirty notbenefitfromsorting(seeFig.3)andshouldbeputlast.Hence, Ln w wordswhereristhenumberof1-bits,Listhenumberofbitmaps we use the following heuristic: columns are sorted in decreasing and w is the word length (w=32). Hence, we have an approxi- orderwithrespecttomin(n−1/k,(1−n−1/k)/(4w−1)): thisfunc- mate storage cost of 2δ+(cid:100)kn1i/k(cid:101). The gain of column C is the tion is maximum at densityi n−1/k =1i/(4w) and it goes down to difference between the expected storage cost of a randomly row- i zeroasthedensitygoesto1. InFig.4(a),thisheuristicmakesthe shuffledC,minusthestoragecostofasortedC. Weestimatethe bestchoiceforallvaluesofk.Weconsiderthisheuristicfurtherin gainby2δ(kn,(cid:100)kn1i/k(cid:101),n)−4ni (seeFig.3)forcolumnswithuni- Section6.3. formhistograms. Thegainismodal: itincreasesuntilamaximum isreachedandthenitdecreases. Themaximumgainisreachedat ≈(n(w−1)/2)k/(k+1):forn=100,000andw=32,themaximum 4.4 Avoidingcolumnorder isreachedat≈1,200fork=1andat≈13,400fork=2.Skewed As an alternative to lexicographic sort and column reordering, histogramshavealessergainforafixedcardinalityni. weintroduceFrequent-Componentsorting,whichuseshistograms Afterlexicographicsorting,theithcolumnisdividedintoatmost tohelpsortwithoutbiasfromafixeddimensionordering. Insort- n1n2···ni−1 sorted blocks. Hence, it has at most 2n1···ni dirty ing, we compare the frequency of the ith most frequent attribute words.Whenthedistributionsareskewed,theithcolumnwillhave valuesineachoftworowswithoutregard(exceptforpossibletie- blocksofdifferentlengthsandtheirorderingdependsonhowthe breaking)towhichcolumnstheycomefrom.Withappropriatepre- columns are ordered. To assess these effects, we generated data andpost-processing,itispossibletoimplementthisapproachusing with 4 independent columns: using uniformly distributed dimen- astandardsortingutilitysuchasUnixsort. 5. PICKINGTHERIGHTK-OF-N Table2:Characteristicsofdatasetsused. ChoosingkandN areimportantdecisions. Wechooseasingle rows cols ∑ini size k value for all dimensions3, leaving the possibility of varying k Census-Income 199523 42 103419 99.1MB bydimensionasfuturework. Largervaluesofktypicallyleadto 4-dprojection 199523 4 102609 2.96MB asmallerindexandafasterconstructiontime—althoughwehave DBGEN 13977980 16 4411936 1.5GB observedcaseswherek=2makesalargerindex. However,query 4-dprojection 13977980 4 402544 297MB timesincreasewithk:thereisaconstructiontime/speedtradeoff. Netflix 100480507 4 500146 2.61GB KJV-4grams 877020839 4 33553 21.6GB Largerkmakesqueriesslower. Wecanboundtheadditionalcostofqueries. Assume(cid:0)Lki(cid:1)=ni. icnoduelxdebseaarveomiduecdhifbwigegearllothwaend1l-aorfg-eNrNin,dbeexceasu.seThoenoerceotiucladllya,wtahyiss Agivenk-of-LibitmapistheresultofanORoperationoveratmost kni/Li≤3ni(k−1)/k unary bitmaps. Because |WiBi|≤∑i|Bi|, the awpopuelnddcarenaateddointieonmaolr1et(ocleevaenr)ybaitttmribaupteth’san1-tohfe-N1-coofd-Ne.iSnidnecxehthaiss, expectedsizeofsuchabitmapisnolargerthan3n(k−1)/ktimesthe this2-of-N indexwouldneverbemuchlargerthanthe1-of-N in- i expectedsizeofaunarybitmap. Aquerylookingforoneattribute dex. So, if N is unconstrained, we can see that there is never a value will have to AND together k of these denser bitmaps. The significantspaceadvantagetochoosingksmall. entireANDingoperationcanbedonebyk−1pairwiseANDsthat Nevertheless,themainadvantageofk>1isfewerbitmaps.We produce intermediate results whose EWAH sizes are increasingly chooseNassmallaspossible. small:2k−1bitmapsarethusprocessed.Hence,theexpectedtime complexity of an equality query on a dimension of size ni is no 6. EXPERIMENTALRESULTS k−1 morethan3(2k−1)nik timeshigherthantheexpectedcostofthe We present experiments to assess the effects of various factors samequeryonak=1index. (choicesofk,sortingapproaches,dimensionorderings)intermsof Foralesspessimisticestimateofthisdependence,observethat EWAH index sizes. These factors also affect index creation and indexesseldomincreaseinsizewhenkgrows. Wemayconserva- querytimes(wereportrealwall-clocktimes). tivelyassumethatindexsizeisunchangedwhenkchanges.There- foretheexpectedsizeofonebitmapgrowsas≈n−1/k/k,leading 6.1 Platform i toquerieswhosecostisproportionalto(2−1/k)n−1/k. Relative Ourtestprograms4 werewritteninC++andcompiledbyGNU i GCC4.0.2onanAppleMacProwithtwodouble-coreIntelXeon tothecostfork=1,whichisproportionalto1/ni,wecansaythat processors (2.66GHz) and 2GiB of RAM. Lexicographic sorts increasingkleadstoqueriesthatare(2−1/k)n(ik−1)/k timesmore of flat files were done using GNU coreutils sort version 6.9. For expensivethanonasimplebitmapindex. all tests involving k=1, we used the sparse implementation ap- Forexample,supposeni=100,goingfromk=1tok=2should proachedinSection3becausewithoutit,theGray-Lexindexcre- increasequerycostabout15foldbutnomorethan90fold.Insum- ationtimeswere20–100timesslower,dependingonthedataset. mary,themovefromk=1toanythinglargercanhaveadramatic negativeeffectonqueryspeeds. Onceweareatk=2,theincre- 6.2 Datasetsused mental cost of going to k=3, k=4 is not so high: whereas the Weprimarilyusedfourdatasets,whosedetailsaresummarized ratiok=2/k=1goesas√ni,theratiok=3/k=2goesasn1i/6. in Table 2: Census-Income [9], DBGEN [21], KJV-4grams, and Netflix[13]. DBGENisasyntheticdataset,whereasKJV-4grams Largerkmakesindexessmaller. isalargelist(includingduplicates)of4-tuplesofwordsobtained Considertheeffectofalength100runofvaluesv1,followedby fromtheversesintheKingJamesBible[16],afterstemmingwith 100repetitionsofv2,then100ofv3,etc.Regardlessofk,whenever thePorteralgorithm[15]andremovalofstemmedwordswiththree weswitchfromv1 tovi+1 atleasttwobitmapswillhavetomake orfewerletters.Occurrenceofroww1,w2,w3,w4indicatesthatthe transitions between 0 and 1. Thus, unless the transition appears firstparagraphofaversecontainswordsw1throughw4,inthisor- atawordboundary,wecreateatleast2dirtywordswheneveran der. Thisdataisascaled-upversionofwordco-occurrencecubes attribute changes from row to row. The best case, where only 2 usedtostudyanalogiesinnaturallanguage[11,22]. EachofKJV- dirtywordsarecreated,isachievedwhenk=1foranyassignment 4grams’ columns contains roughly 8 thousand distinct stemmed of bitmap codes to attribute values. For k>1 and N as small as words.TheNetflixtablehas4dimensions:UserID,MovieID,Date possible,itmaynotbepossibletoachievesofewdirtywords,orit andRating,havingcardinalities5,2182,17770,and480189.De- mayrequireaparticularassignmentofbitmapcodestovalues. tails of how it was obtained from the data downloaded are given Encodingswithk>1findtheirusewhenmany(e.g.15)attribute elsewhere[2]. values fall within a word-length boundary. In that case, a k=1 For some of our tests, we chose four dimensions with a wide index will have at least 15 bitmaps with transitions (and we can rangeofsizes. ForCensus-Income, wechoseage(d1), wageper anticipate15dirtywords).However,iftherewereonly45possible hour(d2),dividendsfromstocks(d3)andanumericalvalue5found valuesinthedimension,wewouldnotneedmorethan10bitmaps in the 25th position (d4). Their respective cardinalities were 91, with k=2. Hence, there would be at most 10 dirty words and 1240,1478and99800. ForDBGEN,weselecteddimensionsof maybelessifwehavesortedthedata(seeFig.2). cardinality7,11,2526and400000.Dimensionsarenumberedby increasingsize:column1hasfewerdistinctvalues. ChoosingN. Itseemsintuitive,havingchosenk,tochooseNtobeassmallas possible. Yet,wehaveobservedcaseswheretheresulting2-of-N 4http://code.google.com/p/lemurbitmapindex/. 3Exceptthatforcolumnswithsmallni,weautomaticallyadjustk 5Theassociatedmetadatasaysthiscolumnshouldbea10-valued downwardwhenitexceedsthelimitsnotedattheendofSection2. migrationcode. 6.3 ColumnOrdering Table4:SizesofEWAHindexes(32-bitwords)forvarioussort- Fig. 5 shows the Gray-Lex index sizes for each column order- ingmethods. ing.ThedimensionsofKJV-4gramsaretoosimilarfororderingto Lexunsorted Gray-Lex Gray-Freq beinteresting, andwehavethusomittedthem. Forsmalldimen- Census-Income k=1 8.49×105 4.87×105 4.87×105 sions, thevalueofk wasloweredusingtheheuristicpresentedin (4d) 2 9.12×105 4.52×105 4.36×105 Section2. Ourresultssuggestthattable-columnreorderinghasa 3 6.90×105 3.73×105 3.28×105 significanteffect(40%). Thisdoesnotcontradicttheobservation 4 4.58×105 2.17×105 1.98×105 byCanahuateet.al[4]thatbitmapreorderingdoesnotchangethe DBGEN 1 5.48×107 3.38×107 3.38×107 sizemuch. (4d) 2 7.13×107 2.76×107 2.74×107 Thevalueofkaffectswhichorderingleadstothesmallestindex: 3 5.25×107 1.50×107 1.50×107 good orderings for k=1 are frequently bad orderings for k>1, 4 3.24×107 1.21×107 1.19×107 Netflix 1 6.20×108 3.22×108 3.19×108 and vice versa. This is consistent with our earlier analysis (see 2 8.27×108 3.17×108 2.43×108 Figs.3and4).ForNetflixandDBGEN,wehaveomittedk=2for 3 5.73×108 1.97×108 1.49×108 legibility:itisinferiortok=1formostorderings. 4 3.42×108 1.37×108 1.14×108 Census-Income’slargestdimensionisverylarge(n4≈n/2);DB- KJV-4grams 1 6.08×109 6.68×108 6.68×108 GENhasalsoalargedimension(n4≈n/35). Sortingcolumnsin 2 8.02×109 9.93×108 7.29×108 decreasingorderwithrespecttomin(n−1/k,(1−n−1/k)/(4w−1)) 3 4.13×109 8.31×108 5.77×108 i i for k=1, we have that only for DBGEN the ordering “2134” is 4 2.52×109 6.39×108 5.01×108 suggested,otherwise,“1234”isrecommended. Thustheheuristic Table4showsindexsizesforourlargedatasets,usingGray-Lex provides nearly optimal recommendations. For k=3 and k=4, orderingsandGray-Frequency.Dimensionswereorderedfromthe the ordering “1234” is recommended for all data sets: for k=4 largest to the smallest (“4321”) except for Census-Income where and Census-Income, this recommendation is wrong. For k=2 weusedtheordering“3214”. WeobservedthatKJV-4gramsdid andCensus-Income,theordering“3214”isrecommended,another notbenefitinindexsizefork=2.Thisdatasethasmanyverylong wrongrecommendationforthisdataset. Hence, abettercolumn runsofidenticalattributevaluesinthefirsttwodimensions,andthe reordering heuristic is needed for k>1. The difficulty appears numberofattributevaluesismodest,comparedwiththenumberof to be fundamental: when we calculated the gain experimentally, rows.Thisisidealfor1-of-N. we found that the best orderings sometimes did not have the di- Gray-FrequencyyieldsthesmallestindexesinTable4.Frequent- mensionswithhighestgainfirst. Ourgreedyapproachmaybetoo Componentisnotshowninthetable.OnNetflixfork=1itoutper- simple,andititmaybenecessarytoknowthehistogramskews. formedtheotherapproachesby1%,andforDBGENitwasonly slightly worse than the others. But in all other case on DBGEN, Census-IncomeandNetflix,itleadtoindexes5–50%larger. 6.4 Sorting 6.5 Queries OnsomesyntheticZipfiantests,wefoundasmallimprovement (lessthan4%for2dimensions)byusingGray-Lexcodinginpref- Wetimedequalityqueriesagainstour4-dbitmapindexes, and erencetoAlpha-Lex[2,Fig.3].Onotherdatasets,Gray-Lexeither theresultsareshowninFig.6.Queriesweregeneratedbychoosing hadnoeffectorasmallpositiveeffect. Therefore,ourcurrentex- attributevaluesuniformlyatrandomandthefiguresreportaverage periments do not include Alpha-Lex, with the exception that we timesforsuchqueries. Wemade100randomchoicespercolumn experimentallyevaluatedhowsortingaffectstheEWAHcompres- for KJV-4grams when k>1. For DBGEN and Netflix, we had sionofindividualcolumns. Whereassortingtendstocreateruns 1,000randomchoicespercolumnand10,000randomchoiceswere ofidenticalvaluesinthefirstcolumns,thebenefitsofsortingare used for Census-Income and KJV-4grams (k=1). For each data farlessapparentinlatercolumns,exceptthose stronglycorrelated set, we give the results per column (leftmost tick is the column withthefirstfewcolumns.ForTable3,wehavesortedprojections usedastheprimarysortkey,nexttickisforthesecondarysortkey, ofCensus-IncomeandDBGENonto10dimensionsd1...d10with etc.). n1<...<n10.(Thedimensionsd1...d4inthisgrouparedifferent FromFig.6(b),weseethatsimplebitmapindexesalwaysyield fromthedimensionsd1...d4 discussedearlier.) Weseethatifwe the fastest queries. The difference caused by k is highly depen- sortfromthelargestcolumn(d10...d1),atmost3columnsbene- dent upon the data set and the particular column in the data set. fitfromthesort,whereas5ormorecolumnsbenefitwhensorting However, foragivendatasetandcolumn, withonlyafewsmall fromthesmallestcolumn(d1...d10). exceptions,querytimesincreasesignificantlywithk.ForDBGEN, the last two dimensions have size 7 and 11, whereas for Netflix, Lexicographicsorting. thelastdimensionhassize5,andtherefore,theywillneverusea Constructingasimplebitmapindex(usingGray-Lex)overKJV- k-valuelargerthan2:theirspeedismostlyoblivioustok. 4gramstookapproximately14,000secondsorlessthanfourhours. InSection5,wepredictedthatthequerytimewouldgrowwith Nearlyhalf(6,000s)ofthetimewasduetothesortutility,since kas≈(2−1/k)n−1/k:forthelargedimensionssuchasthelargest i thedatasetismuchlargerthanthemachine’smainmemory(2GiB). onesforDBGEN(400k)andNetflix(480k), querytimesaretwo Constructinganunsortedindexisfaster(approximately10,000s), ordersofmagnitudeslowerfork=2asopposedtok=1,andfour buttheindexisabout9timeslarger. ordersofmagnitudeslowerfork=4.Thus,ourmodelexaggerates Tostudyscaling,webuiltindexesfromprefixesofthefulldata thedifferencesbyaboutanorderofmagnitude.Themostplausible set.Wefoundconstructiontimesincreasedlinearlywithindexsize explanationisthatquerytimesarenotdirectlyproportionaltothe fork=1,whetherornotsortingwasused. For1≤k≤4,index bitmaploaded,butalsoincludeaconstantfactor. sizeincreasedlinearlywiththeprefixsizeforunsorteddata. Yet Fig.6(a)and6(b)showtheequalityquerytimespercolumnbe- withsorting,indexsizeincreasedsublinearly.Asnewdataarrives, foreandaftersortingthetables.Sortingimprovesquerytimesmost itisincreasinglylikelytofitintoexistingruns,oncesorted. forlargervaluesofk:forNetflix,sortingimprovedthequerytimes 650000 5.5e+07 5.5e+08 k=1 k=1 k=1 600000 k=2 5e+07 k=3 5e+08 k=3 k=3 k=4 k=4 550000 k=4 4.5e+07 4.5e+08 500000 4e+07 4e+08 450000 3.5e+07 3.5e+08 400000 3e+07 3e+08 350000 2.5e+07 2.5e+08 300000 250000 2e+07 2e+08 200000 1.5e+07 1.5e+08 150000 1e+07 1e+08 432143124231421341324123342134123241321431423124243124132341231421432134143214231342132412431234 432143124231421341324123342134123241321431423124243124132341231421432134143214231342132412431234 432143124231421341324123342134123241321431423124243124132341231421432134143214231342132412431234 (a)Census-Income (b)DBGEN (c)Netflix Figure5:Indexsizes(words,yaxis)on4-ddatasetsforalldimensionorderings(xaxis). Table3: Numberof32-bitwordsusedfordifferentunaryindexeswhenthetablewassortedlexicographically(dimensionsordered bydescendingcardinality,d ...d ,orbyascendingcardinality,d ...d ). 10 1 1 10 Census-Income DBGEN cardinality unsorted d1...d10 d10...d1 cardinality unsorted d1...d10 d10...d1 d1 7 42427 32 42309 2 0.75×106 24 0.75×106 d2 8 36980 200 36521 3 1.11×106 38 1.11×106 d3 10 34257 1215 28975 7 2.58×106 150 2.78×106 d4 47 0.13×106 12118 0.13×106 9 0.37×106 1006 3.37×106 d 51 35203 17789 28803 11 4.11×106 10824 4.11×106 5 d 91 0.27×106 75065 0.25×106 50 13.60×106 0.44×106 1.42×106 6 d7 113 12199 9217 12178 2526 23.69×106 22.41×106 23.69×106 d8 132 20028 14062 19917 20000 24.00×106 24.00×106 22.12×106 d9 1240 29223 24313 28673 400000 24.84×106 24.84×106 19.14×106 d10 99800 0.50×106 0.48×106 0.30×106 984297 27.36×106 27.31×106 0.88×106 total - 1.11×106 0.64×106 0.87×106 - 0.122×109 0.099×109 0.079×109 byatmost2fork=1, atmost40fork=2andatmost140for 1000 nckoo=lulo3mn;nginsr,dukenxs=eos3fwiuidtsehunaktlil>cyalg1vebtaseluntehesfiet(bsfeersoetmSimusbopsrretoicnvtegiomenve4enn.t1sw)f.rho(eOmnntshtohererteinfiagrrs.e)t uality query 1 1000 kkkk====1234 SyoynndthiettsiclaDrgBeGfiErNstschooluwmend.nAosltihgonuifigchaNntestpfleixe,dulipkefrDomBGsoErNtin,gh,absea- B) per eq 0. 11 many-valuedcolumnfirst,itshowsabenefitfromsortingeveninits Mi ttmfhhasiaortnsd,tttchfhroeeolumsgemacitnnoh:neodinsnoctforhatl:ecutmwfi,rthnshi.teletTwthqhoiueredclroaicelrougslmeuosmnntstntahrbbaelneelng,aeeKsfistJtfcVsroo-ml4mugomrr2aen0mfartsrio,membetewssnoiefcrafietsittneaegdsr Size of bitmaps ( 0 .00 .000.000111 1e-05 (k=1)toalmost1500timesfaster(k=3). Census-Income DBGEN Netflix KJV-4grams Wecancomparethesetimeswiththeexpectedamountofdata Dimensions 1-4 for each data set scanned per query. This is shown in Figure 7, and we observe reasonablycloseagreementbetweenmostquerytimesandtheex- Figure7:Bitmapdataexaminedperequalityquery. pectedsizesofthebitmapsbeingscanned. Exceptionsincludethe 1,wecangainadditionalindexsizereductionatthecostoflonger firstdimensiononKJV-4gramsandsomecaseswherethebitmaps indexconstructionbyusingGray-FrequencyratherthanGray-Lex. are tiny. This discrepancy might be explained by the retrieval of Ifthetotalnumberofattributevaluesissmallrelativetothenum- therowIDsfromthecompressedbitmaps:longrunsof1x11clean berofrows,thenweshouldfirsttrythek=1index. Perhapsthe wordsmustbeconvertedtomanyrowIDs. data set resembles KJV-4grams. Besides yielding faster queries, thek=1indexmaybesmaller. 7. GUIDELINESFORK Ourexperimentsindicatethatsimple(k=1)bitmapencodingis 8. CONCLUSIONANDFUTUREWORK preferablewhenstoragespaceandindex-creationtimearelessim- portantthanfastequalityqueries. Thestorageandindex-creation Weshowedthatwhilesortingimprovesbitmapindexes,wecan penaltiesarekeptmodestbytablesortingandAlgorithm1. improvethemevenmore(30–40%)ifweknowthenumberofdis- Spacerequirementscanbereducedbychoosingk>1,although tinctvaluesineachcolumn.Fork-of-Nencodingswithk>1,even Tab.4showsthatthisapproachhasrisks(seeKJV-4grams).Fork> further gains (10–30%) are possible using the frequency of each uality query 1 10 kkkk====1234 uality query 1 10 kkkk====1234 nds) per eq 0 .00.11 nds) per eq 0 .00.11 me (seco 0.001 me (seco 0.001 Average ti 0 1.0e0-0015 Average ti 0 1.0e0-0015 Census-Income DBGEN Netflix KJV-4grams Census-Income DBGEN Netflix KJV-4grams Dimensions 1-4 for each data set Dimensions 1-4 for each data set (a)Querytimesoverunsorted indexes (b)Querytimesoversorted(Gray-Lex)indexes Figure6:Querytimesareaffectedbydimension,tablesortingandk. value.Regardingfuturework,theaccuratemathematicalmodelling [12] N.Koudas.Spaceefficientbitmapindexing.InCIKM’00, ofcompressedbitmapindexesremainsanopenproblem. pages194–201,2000. [13] Netflix,Inc.Nexflixprize.http://www.netflixprize.com Acknowledgements (checked2008-04-28),2007. [14] A.Pinar,T.Tao,andH.Ferhatosmanoglu.Compressing ThisworkissupportedbyNSERCgrants155967,261437andby bitmapindicesbydatareorganization.InICDE’05,pages FQRNTgrant112381. 310–321,2005. [15] M.F.Porter.Analgorithmforsuffixstripping.InReadings 9. REFERENCES ininformationretrieval,pages313–316.MorganKaufmann, [1] G.Antoshenkov.Byte-alignedbitmapcompression.InDCC 1997. ’95,page476,1995. [16] ProjectGutenbergLiteraryArchiveFoundation.Project [2] K.Aouiche,D.Lemire,andO.Kaser.Tridelatabledefaits Gutenberg.http://www.gutenberg.org/(checked etcompressiondesindexbitmapsavecalignementsurles 2007-05-30),2007. mots.availablefromhttp://arxiv.org/abs/0805.3339. [17] D.Rotem,K.Stockinger,andK.Wu.MinimizingI/Ocosts [3] L.Bellatreche,R.Missaoui,H.Necir,andH.Drias. ofmulti-dimensionalquerieswithbitmapindices.In Selectionandpruningalgorithmsforbitmapindexselection SSDBM’06,pages33–44,2006. problemusingdatamining.LNCS,4654:221,2007. [18] Y.SharmaandN.Goyal.Anefficientmulti-component [4] G.Canahuate,H.Ferhatosmanoglu,andA.Pinar.Improving indexingembeddedbitmapcompressionfordata bitmapindexcompressionbydatareorganization.http:// reorganization.InformationTechnologyJournal, hpcrd.lbl.gov/~apinar/papers/TKDE06.pdf(checked 7(1):160–164,2008. 2008-05-30),2006. [19] K.Stockinger,K.Wu,andA.Shoshani.Strategiesfor [5] C.Y.ChanandY.E.Ioannidis.Bitmapindexdesignand processingadhocqueriesonlargedatawarehouses.In evaluation.InSIGMOD’98,pages355–366,1998. DOLAP’02,pages72–79,2002. [6] C.Y.ChanandY.E.Ioannidis.Anefficientbitmapencoding [20] K.Stockinger,K.Wu,andA.Shoshani.Evaluationstrategies schemeforselectionqueries.InSIGMOD’99,pages forbitmapindiceswithbinning.InDEXA2004,2004. 215–226,1999. [21] TPC.DBGEN2.4.0.http://www.tpc.org/tpch/ [7] R.Darira,K.C.Davis,andJ.Grommon-Litton.Heuristic (checked2007-12-4),2006. designofpropertymaps.InDOLAP’06,pages91–98,2006. [22] P.D.TurneyandM.L.Littman.Corpus-basedlearningof [8] K.DavisandA.Gupta.DataWarehousesandOLAP: analogiesandsemanticrelations.MachineLearning, Concepts,Architectures,andSolutions,chapterIndexingin 60(1–3):251–278,2005. DataWarehouses.IRMPress,2007. [23] H.K.T.Wong,H.F.Liu,F.Olken,D.Rotem,andL.Wong. [9] S.HettichandS.D.Bay.TheUCIKDDarchive. Bittransposedfiles.InVLDB85,pages448–457,1985. http://kdd.ics.uci.edu(checked2008-04-28),2000. [24] K.Wu,E.J.Otoo,andA.Shoshani.Aperformance [10] M.Hirabayashi.QDBM:Quickdatabasemanager. comparisonofbitmapindexes.InCIKM’01,pages http://qdbm.sourceforge.net/(checked2008-02-22), 559–561,2001. 2006. [25] K.Wu,E.J.Otoo,andA.Shoshani.Optimizingbitmap [11] O.Kaser,S.Keith,andD.Lemire.TheLitOLAPproject: indiceswithefficientcompression.ACMTransactionson Datawarehousingwithliterature.InCaSTA’06,2006. DatabaseSystems(TODS),31(1):1–38,2006.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.