Finite Length Analysis on Listing Failure Probability of Invertible Bloom Lookup Tables Daichi Yugawa∗ and Tadashi Wadayama∗ ∗Department of Computer Science and Engineering, Nagoya Institute of Technology Nagoya, Japan Email: [email protected], [email protected] Abstract—The Invertible Bloom Lookup Tables (IBLT) is a finite length performances. Especially, predicting the error data structure which supports insertion, deletion, retrieval and floor of the listing failure probability is required to guarantee listingoperationsofthekey-valuepair.TheIBLTcanbeusedto the accuracy of a listing process. It is known stopping sets realize efficient set reconciliation for database synchronization. [11] dominate the finite length performance of LDGM codes The most notable feature of the IBLT is the complete listing operationofthekey-valuepairsbasedonthealgorithmsimilarto forerasurechannels.InthecaseoftheIBLT,thestoppingsets 3 the peeling algorithm for low-density generator-matrix (LDGM) have crucial importance as well as the case of LDGM codes. 1 codes. In thispaper, we will present a stoppingset (SS) analysis In this paper, we will present a stopping set analysis for the 0 for the IBLT which reveals finite length behaviors of the listing IBLT which unveils the finite length behaviors of the listing 2 failure probability. The key of the analysis is enumeration of n the number of stopping matrices of given size. We derived failure probability. a a novel recursive formula useful for computationally efficient The outline of this manuscript is organized as follows. J enumeration. An upper bound on the listing failure probability Section II introducesnotationand definitionsrequiredfor this 1 based on the union bound accurately captures the error floor paper. A brief review of the IBLT is also given. Section III behaviors. It will be shown that, in the error floor region, the 3 provides an upper bound on the listing failure probability. dominant SS have size 2. We propose a simple modification on hash functions, which are called SS avoiding hash functions, for An enumeration method for the number of stopping matrices ] T preventing occurrences of the SS of size 2. based on a recursive formula is the heart of the efficient I evaluation of the upper bound. Section IV presents some s. I. INTRODUCTION results of computer experiments. It will be shown that, in c The Invertible Bloom Lookup Tables (IBLT) is a recently the error floor region, the stopping sets with size 2 become [ developed data structure which supports insertion, deletion, dominant.InSectionV,aclassofhashfunctions,SSavoiding 1 retrieval and listing operations of the key-value pairs [7]– hash functions, is proposed to resolve the stopping sets with v [10]. The IBLT can be seen as a natural extension of the size 2 for lowering the error floor. 3 Bloomfilter[1]–[6]whichcanhandlesetmembershipqueries. 0 The most notable feature of the IBLT is the complete listing 5 II. PRELIMINARIES operationofthekey-valuepairsbasedonthealgorithmsimilar 7 . to the peelingalgorithm[12] forlow-densitygenerator-matrix 1 A. Bloom Filter (LDGM) codes. 0 3 The listing operation enable us to use the IBLT for a Before going into details of the IBLT, we here explain the 1 basis of an efficient set reconciliation algorithm with small structure of the original Bloom filter (BF) which is the basis : amount of communications. Set reconciliation is a process to of the IBLT. Assume that we have a binary array T and v i synchronize contents of two sets at two distinct locations and k-hash functions h1,...,hk. The binary array T is initially X itcanbe usedforrealizingdatabasesynchronization,memory set to all zero. When an item x comes to insert, we set r synchronization,andanimplementationoftheBiffcodes[10]. T[h (x)] = 1 for i ∈ [1,k]. The notation [α,β] means a i The implementation of the IBLT is fairly simple and it is the set of consecutive integers from α to β. The process is naturally scalable to multiple servers, which is a desirable called the Insert(x) operation. The set membership query on feature for data sets of extremely large size. y is the query for checking whether y is in the BF or not. The paper by Goodrich and Mitzenmacher [7] provides The LookUp(y) operation returns YES if T[h (y)] = 1 for i the detailed analysis on the IBLT such as the optimization i ∈ [1,k]; otherwise it returns NO. The operations Insert(x) of the number of hash functions to minimize the retrieval andLookUp(y)canbecarriedoutinO(k)-time.Notethatthe failure probability. They also presented asymptotic thresholds LookUp(y) operation may yield false positive; i.e., it returns for accurate recovery by using the known results on 2-cores YES when y is not in the BF. The minimization of this false of random hypergraphs. Furthermore, some fault tolerance positiveprobabilityintermsofthenumberofhashfunctionsis features of the IBLT are extensively studied. animportanttopicofstudiesoftheBF[1][6].Anappropriately Fordesigningpracticalapplications,itisbeneficialtoknow designedBF providesa highlyspace efficientset membership not only the asymptotic behavior of listing processes but also query system with reasonably small false positive probability. B. IBLT and its Operations h when x ∈ {0,1}b obeys the uniform distribution. The m- i cellsaresplitintok-subtableseachofsizem/kandeachhash As in the case of the BF, k-hash functions h ,...,h are 1 k functionuniformlyselectsa cellinasubtable.Inotherwords, used in the IBLT. Instead of binary array, the IBLT utilizes the range of h is [(i−1)∗(m/k)+1,i∗(m/k)]. an array of cells T[1],...,T[m]. A cell T[i] consists of three i fields which are called Count, KeySum, and ValueSum, which III. UPPERBOUND ONLISTINGFAILUREPROBABILITY are denoted by T[i].Count,T[i].KeySum,T[i].ValueSum. In this section, we will derive an upper bound on the An input to the IBLT is a key-value pair (Key,Value). The listingfailureprobability.Thelistingfailureeventoccurswhen count field represents the number of inserted entries. The a stopping set [11], which is a combinatorial substructure KeySum (resp. ValueSum) field stores exclusive OR of key of a matrix, appears. In order to evaluate the listing failure (resp. value) of inserted entries. The contents of all the cells probability, we need to enumerate the number of stopping are initialized to zero at the beginning. matrices of given size. A stopping matrix is a matrix with The IBLT allows 4-operations: Insert(x,y), Delete(x,y), no row of weight one corresponding to the case where no Get(x) and ListEntries(). The operation Insert(x,y) stores cells with counter value equal to one exists. a key-value pair (x,y) into the IBLT. In an insertion pro- cess, the key x (resp. value y) is added (over F ) to the 2 A. Enumeration of Stopping Matrix KeySum (resp. ValueSum) filed of T[h (x)] for i ∈ [1,k]; i namely, T[h (x)].KeySum = T[h (x)].KeySum ⊕ x and The state matrix B of an IBLT can be represented by an i i T[h (x)].ValueSum= T[h (x)].ValueSum⊕y. The count m×n binary matrix where m = ℓk. A row of the matrix B i i field of T[h (x)] is also incremented as T[h (x)].count = corresponds to a cell and a column corresponds to an entry. i i T[h (x)].count + 1 at the same time. The operation The matrix B can be divided into disjoint k-blocks with size i Delete(x,y)removesthekey-valuepair(x,y)fromtheIBLT. ℓ×n. If the (s,t)-element of the u-th block of B is one, this The process is the same as that of Insert(x,y) except for meansthatthet-thentryishashedtothes-thcellbyusingthe decrementing the counter. The operation Get(x) retrieves the u-th hash function. Suppose that a sub-matrix M′ consisting valuecorrespondingtothekeyx.Thisoperationisrealizedas ofseveralcolumnsof M havenorowsofweightone.Insuch follows. If there exists i∈[1,k] satisfyingT[h (x)].Count= a case, ListEntries() fails to list all the entry in this table i 1, then Get(x) returns T[h (x)].ValueSum. Otherwise, because M′ cannot be resolved in the peeling process. If a i Get(x) declares the failure of the operation. binary matrix M′ does not have a row with weight one, M′ is said to be a stopping matrix. The existence of a stopping The last operation ListEntries() outputs all the key-value matrix in B is the necessary and sufficient condition for the pairs in the IBLT by sequentially removing the entries with failure of a peeling process [11][12]. the counter value equal to one from the table. The de- tails of the process is as follows. We first look for i ∈ In our case, the state matrix B is divided into k-subblocks [1,m] satisfying T[i].Count = 1. If there exists i∗ satis- correspondingtosubtables.Itmightbe reasonableto consider fying the condition T[i∗].Count = 1, the key-value pair a stopping matrix in a subblock before discussing the proba- (T[i∗].KeySum,T[i∗].ValueSum)isregisteredintotheout- bility of the event that B includes a stopping matrix. putlistandthenDelete(T[i∗].KeySum,T[i∗].ValueSum)is Let S(ℓ,n) be the set of ℓ×n binary matrices with column executed.Thisprocessisiterateduntilnocellwiththecounter weight one; i.e., value equal to one can be found. It should be remarked that, S(ℓ,n)=△{(m ,...,m )∈{0,1}ℓ×n|wt(m )=1,i∈[1,n]}, in some cases, ListEntries() fails to list all the entry in the 1 n i IBLT. This is because a non-empty IBLT can have counter where wt(·) represents the Hamming weight function. From values larger than one for i ∈ [1,m]. This failure event is this definition,it is evidentthatthe cardinalityof S(ℓ,n) is ℓn. called a listingfailure.Itis desirablethatan IBLT isdesigned The number of the stopping matrices in S(ℓ,n) is denoted by to decreasethe frequencyof the listing failure eventsas small z(ℓ,n), which can be written as as possible. z(ℓ,n)=△ #{M ∈S(ℓ,n) |M is a stopping matrix}. (1) C. Probabilistic Model For convention, z(0,0) is defined to be 1. It is clear that the probability of the listing failure event, The next recursive formula plays a key role to enumerate which is called the listing failure probability, depends on z(ℓ,n) which is required for evaluating an upper bound for the definition of the probabilistic model for keys and hash the listing failure probability. functions. In this paper (except for Section V), we assume Theorem 1 (Recursive formula on z(ℓ,n)): The following the following model for keys and hash functions. The hash recursive relation functions h ,...,h have domain {0,1}b and the key of 1 k the entries to be stored are independent random variables min(ℓ,n) ℓ n uniformly distributed over {0,1}b. The number of entries z(ℓ,n)=ℓn− c! z(l−c,n−c) (2) (cid:18)c(cid:19)(cid:18)c(cid:19) are assumed to be n. The hash functions are assumed to be Xc=1 uniform such that h (x) distributes uniformly in the range of holds for ℓ≥1 and n≥1. i (Proof)Leta(ℓ,n)bethecardinalityofnon-stoppingmatrices TABLEI a(ℓ,n)=△ ℓn−z(ℓ,n).Inthe following,we enumeratea(ℓ,n) VALUESOFz(ℓ,n):NUMBEROFSTOPPINGMATRICESINS(ℓ,n) by using a recursive relation. For given M ∈ S(ℓ,n), a pair ℓ\n 1 2 3 4 5 6 7 8 9 10 (i,j) ∈ [1,ℓ]×[1,n] is said to be a pivot of M if Mi,j = 1 1 0 1 1 1 1 1 1 1 1 1 and the Hamming weight of the i-th row of M is 1. The set 2 0 2 2 8 22 52 114 240 494 1004 3 0 3 3 21 63 243 969 3657 12987 43959 of pivots of M is denoted by 4 0 4 4 40 124 664 3196 15712 79228 396616 △ 5 0 5 5 65 205 1405 7425 44385 271205 1666925 piv(M)={(i,j)∈[1,ℓ]×[1,n]|(i,j) is a pivot of M}. 6 0 6 6 96 306 2556 14286 100176 691146 4916436 7 0 7 7 133 427 4207 24409 196105 1471519 11773699 Note that M is a stopping matrix if and only if piv(M) is 8 0 8 8 176 568 6448 38424 347712 2775032 24547664 empty. The cardinality of non-stopping matrices a(ℓ,n) can 9 0 9 9 225 729 9369 56961 573057 4794633 46341081 be represented by 10 0 10 10 280 910 13060 80650 892720 7753510 81163900 min(ℓ,n) a(ℓ,n)= #T(ℓ,n) (3) i B. Listing Failure Probability and its Bound Xi=1 The set of all the state matrix is defined as where T(ℓ,n) =△ {M ∈S(ℓ,n) |#piv(M)=i}, i∈[0,min(ℓ,n)]. B(ℓ,n,k) =△ {(M1,...,Mk)T |Mi ∈S(ℓ,n),i∈[1,k]}. (10) i (4) The cardinality of B(ℓ,n,k) is ℓnk. According to the scenario This is because the set of non-stopping matrices can be we have discussed in the previous section, we here define a partitioned into disjoint sets T(ℓ,n) for i ∈ [1,min(ℓ,n)]. In probability space by assigning the equal probability 1/ℓnk to i the following, we will try to prove the equality each element in B(ℓ,n,k). #Ti(ℓ,n) =c!(cid:18)ℓc(cid:19)(cid:18)nc(cid:19)z(l−c,n−c) (5) abiSliutpy,powsheicthhatisPtfh(eℓ,pnr,okb)abrielpitryestehnattsLthisetElinsttirniegsf(a)iluorpeerpartoiobn- fails to list all the entries in the IBLT. The next theorem for i ∈ [1,min(ℓ,n)]. Assume that M ∈ Tc(ℓ,n) is given provides an upper bound on Pf(ℓ,n,k). (c∈[1,min(ℓ,n)]). By gettingrid ofall thecolumnandrows Theorem2(Upperboundonlistingfailureprobability):For corresponding to piv(M) from M, we obtain an (ℓ−c)× given ℓ ≥ 1,n ≥ 1,k ≥ 1, the listing failure probability (n−c) matrix M′. Namely we delete the i-th row and the P (ℓ,n,k) can be upper bounded by f j-thcolumnfromM if(i,j)∈piv(M).Fromtheassumption n k M ∈Tc(ℓ,n),theresultingmatrixM′mustbeastoppingmatrix P (ℓ,n,k)≤ n z(ℓ,i) . (11) in T0(ℓ−c,n−c). Note that the size of T0(ℓ−c,n−c) is given by f Xi=2(cid:18)i(cid:19)(cid:18) ℓi (cid:19) z(ℓ−c,n−c).Therefore,thesizeofT(ℓ,n)istheproductofthe i (Proof) The peeling process of the ListEntries() fails to numberofpossiblewaystochoosepiv(M)andz(ℓ−c,n−c). recoverall the entries in the IBLT if and only if B ∈B(ℓ,n,k) Basedonasimplecombinatorialargument,wecanseethatthe containsa stoppingmatrixasitssub-matrix.Thus, P (ℓ,n,k) f numberofpossiblewaystochoosepiv(M)canbeenumerated can be characterized as as c! ℓ n . As a result, we have the equality (5). Combining c c (3) a(cid:0)nd(cid:1)(cid:0)(5(cid:1)), the claim of the theorem is obtained. Pf(ℓ,n,k)=Pr[B includes a stopping matrix]. (12) For some special combinations of ℓ and n, z(ℓ,n) has a For an index set I ∈ 2[1,n], let B be the sub-matrix simple expression as follows. I of B consisting of columns of B with indices in I. If z(ℓ,1) = 0, ℓ≥1 (6) B is a stopping matrix, then the index set I is said to be a I z(ℓ,2) = ℓ, ℓ≥1 (7) stoppingset.TheprobabilityPf(ℓ,n,k)canbeupperbounded as follows: z(ℓ,3) = ℓ, ℓ≥1 (8) z(1,n) = 1, n≥1. (9) Pf(ℓ,n,k) = Pr[B includes a stopping matrix] Theseexpressionscanbeeasilyprovedbasedonthedefinition of the stopping matrix and of S(ℓ,n). The recursive formula = Pr BI is a stopping matrix (2)enableustoevaluatethevalueofz(ℓ,n)efficiently.These I∈2[[1,n]\∅ simple expressions can be used as boundary conditions for a ≤ Pr[I is a stopping set]. (13) recursive evaluation process. I∈2X[1,n]\∅ Table I presents the values of z(ℓ,n) for (ℓ,n) ∈ [1,10]2. The last inequality is due to the union bound. From the Thesevaluesarecomputedbasedontherecursiveformula(2). definition of the probability space defined on B(ℓ,n,k), the Note that S(ℓ,n) contains 1010-matrices when ℓ = n = 10. probability that I is a stopping set is given by A naive enumeration scheme generating all the matrices in S(ℓ,n) may have computational difficulty even for such small z(ℓ,#I) k Pr[I is a stopping set]= . (14) parameters. (cid:18) ℓ#I (cid:19) By using this equality, we have the following upper bound: P (ℓ,n,k) ≤ Pr[I is a stopping set] f I∈2X[1,n]\∅ n = Pr[I is a stopping set|#I =i] Xi=1I∈2X[1,n]\∅ n k n z(ℓ,i) = . (15) (cid:18)i(cid:19)(cid:18) ℓi (cid:19) Xi=2 In the last equality, we used the fact z(ℓ,1)=0. IV. COMPUTEREXPERIMENTS In this section, we will present several results on computer experiments and on numerical evaluation of the upper bound presented in the previous section. In order to examine the tightness of the bound, Figure 1 Fig.1. Comparisonofthelistingfailureprobability:experimentalvaluesand presents curves of the listing failure probability obtained by upperbound(n=210, k=3, b=32). computer experiments (dashed line) and of the upper bound (solid line). These curves are plotted as functions of the numberof cells m. The numberof entries is n=210 and the symbolsizeofthekeyisb=32.Incomputerexperiments,the number of trials is 106. As a hash function, SHA-1[13] was used. The numberof the hash functionsassumedto be k =3. We used pseudorandom 32-bit numbers for pseudorandom key-valuepairs. Itcanbe observedthattheupperboundgives fairly tight estimation, as the number of cells m increases. As in the case of LDPC codes, the error curve in Figure 1 exhibitbothwaterfallanderrorfloorphenomenon.Thisresult indicatesthattheupperboundpreciselycapturestheerrorfloor behavior of the listing failure probability. From the upper bound, it is possible to see a tradeoff between the water fall and error floor. Figure 2 presents the upperboundsfork ∈[3,6].Thenumberofentriesisn=100. A curve of the upper bound is plotted as a function of the number of cells m. We can observe that the listing failure Fig.2. Comparisonoftheupperboundonlistingfailureprobability:3hashes, probabilities in the error floor region can be decreased as the 4hashes,5hashesand6hashes(n=100). number of hash functions k increases. On the other hand, increments of k pushes the water falls to the right. From the upper bound and some experimental results, we see that stopping sets of size 2 dominates the error floor behavior. Figure 3 presents the upper bound, the asymptote P (ℓ,n,k) defined by 2 k △ n z(ℓ,2) n 1 P (ℓ,n,k)= = (16) 2 (cid:18)2(cid:19)(cid:18) ℓ2 (cid:19) (cid:18)2(cid:19)ℓk and the experimental value of the list error probability. The result suggest that the probability of occurrence of stopping set of size 2 determines the depth of an error floor. V. SS AVOIDINGHASH FUNCTION We have seen that stopping sets of size 2 dominate the behaviorofthelistfailureprobabilityintheerrorfloorregion. The stopping sets of size 2 occur when k-hash values for 2- distinct keys collide; i.e., Fig. 3. Comparison of the listing failure probability: experimental values, upperboundandasymptote P2(ℓ,n,k)(n=210, k=3, b=32). (h (a),h (a),...,h (a))=(h (b),h (b),...,h (b)) (17) 1 2 k 1 2 k for a 6= b. If this type of collision can be prevented, it is the error floor performance without sacrificing the waterfall expected that the error floor performance can be improved. performance. The SS avoiding hash function defined here are designed REFERENCES so that the collisions (17) are avoided. In the following discussion, we will further assume the uniqueness of keys [1] B.Bloom,“Space/timetrade-offsinhashcodingwithallowableerrors,” Communications oftheACM,vol.13,no.7,pp.422-426, 1970. registered in the IBLT. Namely, an insertion of district entries [2] F.Bonomi,M.Mitzenmacher,R.Panigrahy,S.Singh,andG.Varghese, with the same keys and a multiple insertion of the same key- “Beyond Bloom filters: From approximate membership checks to ap- value pairs are not allowed. This assumption may be natural proximatestatemachines,”ACMSIGCOMMComputerCommunication Review, vol.36,no.4,pp.326,2006 for most of applications such as set reconciliation. [3] F.Bonomi,M.Mitzenmacher,R.Panigrahy,S.Singh,andG.Varghese, Let a hash function h be an bijective map from {0,1}b “Animprovedconstruction forcounting Bloomfilters,” InProceedings to {0,1}sk where b = sk. The SS avoiding hash functions ofthe European Symposium onAlgorithms (ESA),vol. 4168ofLNCS, pp.684-695, 2006. (h ,...,h )are simplydefinedby partitioningthe outputsk- 1 k [4] A. Broder and M. Mitzenmacher, “Network applications of Bloom tuple from h into k binary s-tuples; i.e., hi(x) is given by filters:Asurvey,”InternetMathematics,vol.1,no.4,pp.485-509,2004. [5] B.Chazelle,J.Kilian,R.Rubinfeld,andA.Tal,“TheBloomierfilter:an hi(x)=qi+(i−1)2s+1, i∈[1,k], (18) efficientdatastructureforstaticsupportlookuptables,” InProceedings oftheFifteenthAnnualACM-SIAMSymposiumonDiscreteAlgorithms, where(q ,...,q )=h(x)(q ∈{0,1}s).Notethatm/k =2s pp.30-39,2004. 1 k i holds; i.e., each subtable contains 2s-cells. Due to the as- [6] M.Mitzenmacher,“CompressedBloomfilters,”IEEE/ACMTransactions onNetworking, vol.10,no.5,pp.613-620,2002. sumption on the uniqueness of the keys in the IBLT, it is [7] M. Goodrich and M. Mitzenmacher, “Invertible bloom lookup tables,” evident that a collision (17) does not occur. This means that InProceedings ofthe49thAllertonConference, pp.792-799,2011. occurrences of the stopping sets of size 2 can be completely [8] F.Putze,P.Sanders,andJ.Singler, “Cache-, hash-,and space-efficient Bloom filters,” ACM Journal of Experimental Algorithms, vol. 14, pp. prevented.Note that the use of the SS avoiding hash function 4.4-4.18,2009. introduces a restriction on several system parameters; i.e., [9] D.EppsteinandM.T.Goodrich,“Straggleridentificationinround-trip b = sk. This inflexibility can be considered as a price to be datastreamsviaNewton’sidentitiesandinvertibleBloomfilters,”IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 2, pp. paid for lowering the error floor. 297-306,2011. It should be remarkedthat the probabilistic modelassumed [10] M. Mitzenmacher, G. Varghese, “Biff (Bloom filter) codes: fast error in Section II cannot be directly applied to the system pre- correction for large data sets,” Information Theory Proceedings (ISIT), IEEEInternational Symposium on,pp.483-487, 2012. sented in this section This is because the assumption on the [11] C.Di,D.Proietti,I.E.Teletar,T.RichardsonandR.Urbanke,“Finite- uniqueness of the keys introduces weak correlations between lengthanalysisoflow-density parity-check codesonthebinaryerasure the storedentries.Althoughwe haveto takecare ofthese dis- channel,”IEEETransactionsonInformationTheory,vol.48,no.6,pp. 1570-1579, 2002. tinctions, the analysis presented in the previous sections may [12] T. Richardson and R. Urbanke, Modern Coding Theory, Cambridge be still usefulforpredictingthe performanceof ListEntries() University Press,2008. with the SS avoiding hash functions if b is large enough. [13] NationalInstituteofStandardsandTechnologies,SecureHashStandard, FederalInformationProcessingStandards Publication, FIPS-180,1993. Figure 4 presents the results of a computer experiment on the SS avoiding hash functions. As an bijective map, the identity map was exploited. The two curves of listing failure probabilitiesareplotted;thefirstonecorrespondstothecaseof a conventionalhash function and the second one corresponds tothecaseoftheSSavoidinghashfunctionwherethesymbol size of the key is b=3s. In bothcases, the numberof entries is n = 210 and the number of hash functions is assumed to be k =3. We can observe that the SS avoiding hash function reducesthelistingfailureprobabilitiesintheerrorfloorregion. Furthermore, the upper bound almost captures the error floor behavior of the listing failure probability in this settings. VI. CONCLUSION In this paper, we presented a finite length performance analysisonthelisting failureprobabilitywhichmaybe useful for designing a system or an algorithm including the IBLT as a building component. The recursive formula presented in Section III will become an useful tool for finite length analysis. In Section IV, we have seen that the error floor performance can be improved by increasing the number of the hash functions but it degrades the waterfall performance. From the results shown in Section V, we can expect that ap- Fig. 4. Comparison of the listing failure probability: conventional hash propriately designed SS avoiding hash functions can improve function andSSavoidinghashfunction (n=210, k=3).