ebook img

Recursive n-gram hashing is pairwise independent, at best PDF

0.19 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Recursive n-gram hashing is pairwise independent, at best

Recursive n-gram hashing is pairwise independent, at best DanielLemirea,∗,OwenKaserb aLICEF,Universite´duQue´beca`Montre´al(UQAM),100SherbrookeWest,Montreal,QC,H2X3P2Canada bDept.ofCSAS,UniversityofNewBrunswick,100TuckerParkRoad,SaintJohn,NB,Canada 9 0 0 2 Abstract g Manyapplicationsusesequencesofnconsecutivesymbols(n-grams). Hashingthese u n-grams can be a performance bottleneck. For more speed, recursive hash families A computehashvaluesbyupdatingpreviousvalues. Weprovethatrecursivehashfam- 9 ilies cannot be more than pairwise independent. While hashing by irreducible poly- 1 nomials is pairwise independent, our implementations either run in time O(n) or use an exponential amount of memory. As a more scalable alternative, we make hashing ] by cyclic polynomials pairwise independent by ignoring n−1 bits. Experimentally, B weshowthathashingbycyclicpolynomialsistwiceasfastashashingbyirreducible D polynomials.WealsoshowthatrandomizedKarp-Rabinhashfamiliesarenotpairwise . s independent. c [ Keywords: RollingHashing,Rabin-KarpHashing,HashingStrings 6 v 6 1. Introduction 7 6 Ann-gramisaconsecutivesequenceofnsymbolsfromanalphabetΣ. Ann-gram 4 hashfunctionhmapsn-gramstonumbersin[0,2L).Thesefunctionshaveseveralappli- . 5 cationsfromfull-textmatching[1–3],patternmatching[4],orlanguagemodels[5–11] 0 toplagiarismdetection[12]. 7 Toprovethatahashingalgorithmmustworkwell, wetypicallyneedhashvalues 0 : tosatisfysomestatisticalproperty. Indeed,ahashfunctionthatmapsalln-gramstoa v singleintegerwouldnotbeuseful. Yet,asinglehashfunctionisdeterministic: itmaps i X ann-gramtoasinglehashvalue. Thus,wemaybeabletochoosetheinputdatasothat r thehashvaluesarebiased. Therefore,werandomlypickafunctionfromafamilyH a offunctions[13]. SuchafamilyH isuniform(overL-bits)ifallhashvaluesareequiprobable. That is,consideringhselecteduniformlyatrandomfromH,wehaveP(h(x)=y)=1/2L foralln-gramsxandallhashvaluesy. Thisconditionisweak;thefamilyofconstant functions(h(x)=c)isuniform1. ∗Correspondingauthor.Tel.:00+1+514987-3000ext.2835;fax:00+1+514843-2160. Emailaddresses:[email protected](DanielLemire),[email protected](OwenKaser) 1 Weomitfamiliesuniformoveranarbitraryinterval[0,b)—notoftheform[0,2L). Indeed,several applications[14,15]requireuniformityoverL-bits. PreprintsubmittedtoElsevier December5,2009 Intuitively,wewouldwantthatifanadversaryknowsthehashvalueofonen-gram, itcannotdeduceanythingaboutthehashvalueofanothern-gram. Forexample,with the family of constant functions, once we know one hash value, we know them all. The family H is pairwise independent if the hash value of n-gram x is independent 1 from the hash value of any other n-gram x . That is, we have P(h(x )=y∧h(x )= 2 1 2 z)=P(h(x )=y)P(h(x )=z)=1/4L for all n-grams x , x , and all hash values y, 1 2 1 2 z with x (cid:54)=x . Pairwise independence implies uniformity. We refer to a particular 1 2 hashfunctionh∈H as“uniform”or“apairwiseindependenthashfunction”whenthe familyinquestioncanbeinferredfromthecontext. Moreover,theideaofpairwiseindependencecanbegeneralized: afamilyofhash functionsH isk-wiseindependentifgivendistinctx1,...,xk andgivenhselecteduni- formly at random from H, then P(h(x1)=y1∧···∧h(xk)=yk)=1/2kL. Note that k-wise independence implies k−1-wise independence and uniformity. (Fully) inde- pendentfamiliesarek-wiseindependentforarbitrarilylargek. Forapplications,non- independent families may fare as well as fully independent families if the entropy of thedatasourceissufficientlyhigh[16]. Ahashfunctionhisrecursive[17]—orrolling[18]—ifthereisafunctionF com- puting the hash value of the n-gram x2...xn+1 from the hash value of the preceding n-gram(x1...xn)andthevaluesofx1andxn+1. Thatis,wehave h(x ,...,x )=F(h(x ,...,x ),x ,x ). 2 n+1 1 n 1 n+1 Ideally,wecouldcomputefunctionFintimeO(L)andnot,forexample,intimeO(Ln). Themaincontributionsofthispaperare: • aproofthatrecursivehashingisnomorethanpairwiseindependent(§3); • aproofthatrandomizedKarp-Rabincanbeuniformbutneverpairwiseindepen- dent(§5); • aproofthathashingbyirreduciblepolynomialsispairwiseindependent(§7); • aproofthathashingbycyclicpolynomialsisnotevenuniform(§9); • aproofthathashingbycyclicpolynomialsispairwiseindependent—afterignor- ingn−1consecutivebits(§10). Weconcludewithanexperimentalsectionwhereweshowthathashingbycyclicpoly- nomials is faster than hashing by irreducible polynomials. Table 1 summarizes the algorithmspresented. 2. Trailing-zeroindependence Some randomized algorithms [14, 15] merely require that the number of trailing zeroes be independent. For example, to estimate the number of distinct n-grams in a large document without enumerating them, we merely have to compute maximal numbers of leading zeroes k among hash values [19]. Na¨ıvely, we may estimate that if a hash value with k leading zeroes is found, we have ≈2k distinct n-grams. Such Table1:Asummaryofthehashingfunctionpresentedandtheirproperties.ForGENERALandCYCLIC,we requireL≥n.TomakeCYCLICpairwiseindependent,weneedtodiscardsomebits—theresultingscheme isnotformallyrecursive.RandomizedKarp-Rabinisuniformundersomeconditions. name costpern-gram independence memoryuse non-recursive3-wise(§4) O(Ln) 3-wise O(nL|Σ|) RandomizedKarp-Rabin(§5) O(LlogL2O(log∗L)) uniform O(L|Σ|) GENERAL(§7) O(Ln) pairwise O(L|Σ|) RAM-BufferedGENERAL(§8) O(L) pairwise O(L|Σ|+L2n) CYCLIC(§9) O(L+n) pairwise(§10) O((L+n)|Σ|) estimatesmightbeusefulbecausethenumberofdistinctn-gramsgrowslargewithn: Shakespeare’sFirstFolio[20]hasover3milliondistinct15-grams. Formally, let zeros(x) return the number of trailing zeros (0,1,...,L) of x, where zeros(0)=L. We say h is k-wise trailing-zero independent if P(zeros(h(x ))≥ j ∧ 1 1 zeros(h(x2))≥ j2∧...∧zeros(h(xk))≥ jk)=2−j1−j2−···−jk,for ji=0,1,...,L. Ifhisk-wiseindependent, itisk-wisetrailing-zeroindependent. Theconverseis nottrue.Ifhisak-wiseindependentfunction,considerg◦hwheregmakeszeroallbits beforetherightmost1(e.g.,g(0101100)=0000100).Hashg◦hisk-wisetrailing-zero independentbutnotevenuniform(considerthatP(g=0001)=8P(g=1000)). 3. Recursivehashfunctionsarenomorethanpairwiseindependent Notonlyarerecursivehashfunctionslimitedtopairwiseindependence: theycan- notbe3-wisetrailing-zeroindependent. Proposition1 There is no 3-wise trailing-zero independent hashing function that is recursive. PROOF Considerthe (n+2)-gramanbb. Supposehisrecursiveand3-wisetrailing- zeroindependent,then (cid:16) ^ P zeros(h(a,...,a))≥L ^ (cid:17) zeros(h(a,...,a,b))≥L zeros(h(a,...,a,b,b))≥L (cid:16) ^ ^ (cid:17) = P h(a,...,a)=0 F(0,a,b)=0 F(0,a,b)=0 (cid:16) ^ (cid:17) = P h(a,...,a)=0 F(0,a,b)=0 (cid:16) ^ (cid:17) = P zeros(h(a,...,a))≥L zeros(h(a,...,a,b))≥L = 2−2L bytrailing-zeropairwiseindependence (cid:54)= 2−3L asrequiredbytrailing-zero3-wiseindependence. Hence,wehaveacontradictionandnosuchhexists. 4. Anon-recursive3-wiseindependenthashfunction Atrivialwaytogenerateanindependenthashistoassignarandomintegerin[0,2L) toeachnewvaluex. Unfortunately,thisrequiresasmuchprocessingandstorageasa completeindexingofallvalues. However,inamultidimensionalsettingthisapproachcanbeputtogooduse. Sup- pose that we have tuples in K1×K2×···×Kn such that |Ki| is small for all i. We canconstructindependenthashfunctionshi:Ki→[0,2L)foralliandcombinethem. The hash function h(x1,x2,...,xn)=h1(x1)⊕h2(x2)⊕···⊕hn(xn) is then 3-wise in- dependent(⊕isthe“exclusiveor”function,XOR). IntimeO(∑ni=1|Ki|),wecancon- structthehashfunctionbygenerating∑ni=1|Ki|randomnumbersandstoringthemina look-uptable. Withconstant-timelook-up,hashingann-gramthustakesO(Ln)time. Algorithm1isanapplicationofthisideaton-grams. Algorithm1The(non-recursive)3-wiseindependentfamily. Require: nL-bithashfunctionsh1,h1,...,hnoverΣfromanindependenthashfamily 1: s←emptyFIFOstructure 2: foreachcharactercdo 3: appendctos 4: iflength(s)=nthen 5: yieldh1(s1)⊕h2(s2)⊕...⊕hn(sn) {Theyieldstatementreturnsthevalue,withoutterminatingthealgorithm.} 6: removeoldestcharacterfroms 7: endif 8: endfor Thisnewfamilyisnot4-wiseindependentforn>1. Considerthen-gramsac,ad, bc, bd. The XOR of their four hash values is zero. However, the family is 3-wise independent. Proposition2 The family of hash functions h(x) = h (x )⊕h (x )⊕...⊕h (x ), 1 1 2 2 n n wheretheL-bithashfunctionsh ,...,h aretakenfromanindependenthashfamily,is 1 n 3-wiseindependent. PROOF Consider any 3 distinct n-grams: x(1) = x1(1)...xn(1), x(2) = x1(2)...xn(2), and x(3) =x1(3)...xn(3). Because the n-grams are distinct, at least one of two possibilities holds: CaseA Forsomei∈{1,...,n},thethreevaluesxi(1),xi(2),xi(3)aredistinct. Writeχj= hi(xi(j))for j=1,2,3. Forexample,considerthethree1-grams: a,b,c. CaseB (Uptoareorderingofthethreen-grams.)Therearetwovaluesi,j∈{1,...,n} suchthatx(1)isdistinctfromthetwoidenticalvaluesx(2),x(3),andsuchthatx(2) i i i j is distinct from the two identical values xi(1),xi(3). Write χ1 = hi(xi(1)), χ2 = hj(x(j2)),andχ3=hi(xi(3)). Forexample,considerthethree2-grams: ad,bc,bd. RecallthattheXORoperationisinvertible: a⊕b=cifandonlyifa=b⊕c. Weprove3-wiseindependenceforcasesAandB. CaseA. Write f(i)=h(x(i))⊕χi fori=1,2,3. Wehavethatthevaluesχ1,χ2,χ3 are mutuallyindependent,andtheyareindependentfromthevalues f(1),f(2),f(3)2: (cid:32) (cid:33) (cid:32) (cid:33) 3 3 3 3 P ^χ =y ∧^ f(i)=y(cid:48) =∏P(χ =y)P ^ f(i)=y(cid:48) i i i i i i i=1 i=1 i=1 i=1 forallvaluesyi,y(cid:48)i. Hence,wehave P(cid:16)h(x(1))=z(1)^h(x(2))=z(2)^h(x(3))=z(3)(cid:17) = P(cid:16)χ =z(1)⊕f(1))^χ =z(2)⊕f(2)^χ =z(3)⊕f(3)(cid:17) 1 2 3 = ∑ P(cid:16)χ =z(1)⊕η^χ =z(2)⊕η(cid:48)^χ =z(3)⊕η(cid:48)(cid:48)(cid:17)× 1 2 3 η,η(cid:48),η(cid:48)(cid:48) P(f(1)=η∧f(2)=η(cid:48)∧f(3)=η(cid:48)(cid:48)) 1 = ∑ P(f(1)=η∧f(2)=η(cid:48)∧f(3)=η(cid:48)(cid:48)) 23L η,η(cid:48),η(cid:48)(cid:48) 1 = . 23L Thus,inthiscase,thehashvaluesare3-wiseindependent. Case B. Write f(1) = h(x(1))⊕χ , f(2) = h(x(2))⊕χ ⊕χ , f(3) = h(x(3))⊕χ . 1 2 3 3 Again,thevaluesχ ,χ ,χ aremutuallyindependent,andindependentfromthevalues 1 2 3 f(1),f(2),f(3). Wehave P(cid:16)h(x(1))=z(1)^h(x(2))=z(2)^h(x(3))=z(3)(cid:17) = P(cid:16)χ =z(1)⊕f(1))^χ ⊕χ =z(2)⊕f(2)^χ =z(3)⊕f(3)(cid:17) 1 2 3 3 = P(cid:16)χ =z(1)⊕f(1))^χ =z(2)⊕f(2)⊕z(3)⊕f(3)^χ =z(3)⊕f(3)(cid:17) 1 2 3 = ∑ P(cid:16)χ =z(1)⊕η^χ =z(2)⊕z(3)⊕η(cid:48)⊕η(cid:48)(cid:48)^χ =z(3)⊕η(cid:48)(cid:48)(cid:17)× 1 2 3 η,η(cid:48),η(cid:48)(cid:48) P(f(1)=η∧f(2)=η(cid:48)∧f(3)=η(cid:48)(cid:48)) 1 = ∑ P(f(1)=η∧f(2)=η(cid:48)∧f(3)=η(cid:48)(cid:48)) 23L η,η(cid:48),η(cid:48)(cid:48) 1 = . 23L 2Thevalues f(1),f(2),f(3)arenotnecessarilymutuallyindependent. Thisconcludestheproof. 5. RandomizedKarp-Rabinisnotindependent Oneofthemostcommonrecursivehashfunctionsiscommonlyassociatedwiththe Karp-Rabin string-matching algorithm [21]. Given an integer B, the hash value over the sequence of integers x1,x2,...,xn is ∑ni=1xiBn−i. A variation of the Karp-Rabin hashmethodis“HashingbyPower-of-2IntegerDivision”[17],whereh(x1,...,xn)= ∑ni=1xiBn−i mod2L. Inparticular,thehashcodemethodoftheJavaStringclassuses this approach, with L=32 and B=31 [22]. A widely used textbook [23, p. 157] recommendsasimilarInteger-DivisionhashfunctionforstringswithB=37. Since such Integer-Division hash functions are recursive, quickly computed, and widely used, it is interesting to seek a randomized version of them. Assume that h 1 isarandomhashfunctionoversymbolsuniformin[0,2L),thendefineh(x1,...,xn)= Bn−1h1(x1)+Bn−2h1(x2)+···+h1(xn)mod2L forsomefixedinteger B. Wechoose B=37(callingtheresultingrandomizedhash“ID37;”seeAlgorithm2).Ouralgorithm computeseachhashvalueintimeO(M(L)),whereM(L)isthecostofmultiplyingtwo L-bitintegers. (WeprecomputethevalueBn mod2L.) Inmanypracticalcases,Lbits can fit into a single machine word and the cost of multiplication can be considered constant. Ingeneral,M(L)isinO(LlogL2O(log∗L))[24]. Algorithm2TherecursiveID37family(RandomizedKarp-Rabin). Require: anL-bithashfunctionh overΣfromanindependenthashfamily 1 1: B←37 2: s←emptyFIFOstructure 3: x←0(L-bitinteger) 4: z←0(L-bitinteger) 5: foreachcharactercdo 6: appendctos 7: x←Bx−Bnz+h1(c)mod2L 8: iflength(s)=nthen 9: yieldx 10: removeoldestcharacteryfroms 11: z←h1(y) 12: endif 13: endfor TherandomizedInteger-Divisionfunctionsmappingn-gramsto[0,2L)arenotpair- wiseindependent. However,forBodd,theyareuniform. Proposition3 Randomized Integer-Division hashing with B odd is not uniform for n-grams,ifniseven. Otherwise,itisuniform,butnotpairwiseindependent. PROOF ForBodd,weseethatP(h(a2k)=0)>2−Lsinceh(a2k)=h1(a)(B0(1+B)+ B2(1+B)+···+B2k−2(1+B))mod2Landsince(1+B)iseven,wehaveP(h(a2k)= 0)≥P(h (x )=2L−1∨h (x )=0)=1/2L−1. Hence,forBoddandneven,wedonot 1 1 1 1 haveuniformity. For the rest of the result, we begin with n = 2 and B even. If x (cid:54)= x , then 1 2 P(h(x ,x ) = y) = P(Bh (x )+h (x ) = ymod2L) = ∑ P(h (x ) = y−Bzmod 1 2 1 1 1 2 z 1 2 2L)P(h (x )=z)=∑ P(h (x )=y−Bzmod2L)/2L=1/2L,whereasP(h(x ,x )= 1 1 z 1 2 1 1 y)=P((B+1)h (x )=ymod2L)=1/2Lsince(B+1)x=ymod2Lhasauniqueso- 1 1 lution x when B is even. Therefore h is uniform. This argument can be extended for anyvalueofnandfornodd,Beven. Toshowitisnotpairwiseindependent,firstsupposethatBisodd. Foranystring β of length n−2, consider n-grams w = βaa and w = βbb for distinct a,b ∈ Σ. 1 2 Then P(h(w )=h(w ))=P(B2h(β)+Bh (a)+h (a)=B2h(β)+Bh (b)+h (b)mod 1 2 1 1 1 1 2L) = P((1+B)(h (a)−h (b))mod2L = 0) ≥ P(h (a)−h (b) = 0)+P(h (a)− 1 1 1 1 1 h1(b)=2L−1).Becauseh1isindependent,P(h1(a)−h1(b)=0)=∑c∈[0,2L)P(h1(a)= c)P(h1(b) = c) = ∑c∈[0,2L)1/4L = 1/2L. Moreover, P(h1(a)−h1(b) = 2L−1) > 0. Thus, we have that P(h(w ) = h(w )) > 1/2L which contradicts pairwise inde- 1 2 pendence. Second, if B is even, a similar argument shows P(h(w ) = h(w )) > 3 4 1/2L, where w =βaa and w =βba. P(h(a,a)=h(b,a))=P(Bh (a)+h (a)= 3 4 1 1 Bh (b)+h (a)mod2L)=P(B(h (a)−h (b))mod2L=0)≥P(h (a)−h (b)=0)+ 1 1 1 1 1 1 P(h (a)−h (b)=2L−1)>1/2L. This argument can be extended for any value of B 1 1 andn. A weaker condition than pairwise independence is 2-universality: a family is 2- universalifP(h(x )=h(x ))≤1/2L[16].Asaconsequenceofthisproof,Randomized 1 2 Integer-Divisionisnoteven2-universal. These results also hold for any Integer-Division hash where the modulo is by an evennumber,notnecessarilyapowerof2. 6. GeneratinghashfamiliesfrompolynomialsoverGaloisfields ApracticalformofhashingusingthebinaryGaloisfieldGF(2)iscalled“Recursive Hashing by Polynomials” and has been attributed to Kubina by Cohen [17]. GF(2) containsonlytwovalues(1and0)withtheaddition(andhencesubtraction)definedby XOR,a+b=a⊕bandthemultiplicationbyAND,a×b=a∧b.GF(2)[x]isthevector spaceofallpolynomialswithcoefficientsfromGF(2).Anyintegerinbinaryform(e.g., c=1101) can thus be interpreted as an element of GF(2)[x] (e.g., c=x3+x2+1). If p(x)∈GF(2)[x], then GF(2)[x]/p(x) can be thought of as GF(2)[x] modulo p(x). As an example, if p(x)=x2, then GF(2)[x]/p(x) is the set of all linear polynomials. Forinstance,x3+x2+x+1=x+1modx2 since,inGF(2)[x],(x+1)+x2(x+1)= x3+x2+x+1. Asasummary,wecomputeoperationsoverGF(2)[x]/p(x)—where p(x)isofde- greeL—asfollows: • thepolynomial∑Li=−01qixiisrepresentedastheL-bitinteger∑Li=−01qi2i; • subtractionoradditionoftwopolynomialsistheXORoftheirL-bitintegers; Table2:SomeirreduciblepolynomialsoverGF(2)[x] degree polynomial 10 1+x3+x10 15 1+x+x15 20 1+x3+x20 25 1+x3+x25 30 1+x+x4+x6+x30 • multiplicationofapolynomial∑Li=0qixi bythemonomialxisrepresentedeither as ∑Li=−01qixi+1 if qL−1 =0 or as p(x)+∑Li=−01qixi+1 otherwise. In other words, if the value of the last bit is 1, we merely apply a binary left shift, otherwise, we apply a binary left shift immediately followed by an XOR with the integer representing p(x). Ineithercase,wegetanL-bitinteger. Hence,merelywiththeXORoperation,thebinaryleftshift,andawaytoevaluatethe valueofthelastbit,wecancomputeallnecessaryoperationsoverGF(2)[x]/p(x)using integers. Considerahashfunctionh overcharacterstakenfromsomeindependentfamily. 1 Interpreting h hash values as polynomials in GF(2)[x]/p(x), and with the condition 1 that degree(p(x))≥n, we define a hash function as h(a1,a2,···,an)=h1(a1)xn−1+ h1(a2)xn−2+···+h1(an). Itisrecursiveoverthesequenceh1(ai). Thecombinedhash canbecomputed byreusingprevioushashvalues: h(a ,a ,...,a )=xh(a ,a ,...,a )−h (a )xn+h (a ). 2 3 n+1 1 2 n 1 1 1 n+1 Depending on the choice of the polynomial p(x) we get different hashing schemes, includingGENERALandCYCLIC,whicharepresentedinthenexttwosections. 7. Recursivehashingbyirreduciblepolynomialsispairwiseindependent We can choose p(x) to be an irreducible polynomial of degree L in GF(2)[x]: an irreducible polynomial cannot be factored into nontrivial polynomials (see Table 2). TheresultinghashiscalledGENERAL(seeAlgorithm3). Themainbenefitofsetting p(x) to be an irreducible polynomial is that GF(2)[x]/p(x) is a field; in particular, it isimpossiblethat p (x)p (x)=0mod p(x)unlesseither p (x)=0or p (x)=0. The 1 2 1 2 fieldpropertyallowsustoprovethatthehashfunctionispairwiseindependent. Lemma1 GENERALispairwiseindependent. PROOF If p(x)isirreducible,thenanynon-zeroq(x)∈GF(2)[x]/p(x)hasaninverse, noted q−1(x) since GF(2)[x]/p(x) is a field. Interpret hash values as polynomials in GF(2)[x]/p(x). Firstly, we prove that GENERAL is uniform. In fact, we show a stronger re- sult: P(q1(x)h1(a1)+q2(x)h1(a2)+···+qn(x)h1(an)=y)=1/2L for any polyno- mials qi where at least one is different from zero. The result follows by induction Algorithm3TherecursiveGENERALfamily. Require: anL-bithashfunctionh overΣfromanindependenthashfamily; anirre- 1 duciblepolynomial pofdegreeLinGF(2)[x] 1: s←emptyFIFOstructure 2: x←0(L-bitinteger) 3: z←0(L-bitinteger) 4: foreachcharactercdo 5: appendctos 6: x←shift(x) 7: z←shiftn(z) 8: x←x⊕z⊕h1(c) 9: iflength(s)=nthen 10: yieldx 11: removeoldestcharacteryfroms 12: z←h1(y) 13: endif 14: endfor 1: functionshift 2: inputL-bitintegerx 3: shiftxleftby1bit,storingresultinanL+1-bitintegerx(cid:48) 4: ifleftmostbitofx(cid:48)is1then 5: x(cid:48)←x(cid:48)⊕p 6: endif 7: {leftmostbitofx(cid:48)isthusalways0} 8: returnrightmostLbitsofx(cid:48) on the number of non-zero polynomials: it is clearly true where there is a single non-zero polynomial qi(x), since qi(x)h1(ai)=y ⇐⇒ q−i 1(x)qi(x)h1(ai)=q−i 1(x)y. Suppose it is true up to k−1 non-zero polynomials and consider a case where we have k non-zero polynomials. Assume without loss of generality that q (x)(cid:54)=0, we 1 haveP(q1(x)h1(a1)+q2(x)h1(a2)+···+qn(x)h1(an)=y)=P(h1(a1)=q−11(x)(y− q2(x)h1(a2)−···−qn(x)h1(an))) = ∑y(cid:48)P(h1(a1) = q−11(x)(y−y(cid:48)))P(q2(x)h1(a2)+ ···+qn(x)h1(an)=y(cid:48))=∑y(cid:48) 21L21L = 21L by the induction argument. Hence the uni- formityresultisshown. Consider two distinct sequences a1,a2,...,an and a(cid:48)1,a(cid:48)2,...,a(cid:48)n. Write Ha = h(a1,a2,...,an) and Ha(cid:48) = h(a(cid:48)1,a(cid:48)2,...,a(cid:48)n). We have that P(Ha = y∧Ha(cid:48) = y(cid:48)) = P(Ha =y|Ha(cid:48) =y(cid:48))P(Ha(cid:48) =y(cid:48)). Hence, to prove pairwise independence, it suffices toshowthatP(Ha=y|Ha(cid:48) =y(cid:48))=1/2L. Suppose that ai =a(cid:48)j for some i,j; if not, the result follows since by the (full) independenceofthehashingfunctionh1,thevaluesHaandHa(cid:48) areindependent. Write q(x)=−(∑k|ak=aixn−k)(∑k|a(cid:48)k=a(cid:48)jxn−k)−1,thenHa+q(x)Ha(cid:48) isindependentfromai= a(cid:48)j (andh1(ai)=h1(a(cid:48)j)). In Ha+q(x)Ha(cid:48), only hashed values h1(ak) for ak (cid:54)= ai and h1(a(cid:48)k) for a(cid:48)k (cid:54)= a(cid:48)j remain: label them h1(b1),...,h1(bm). The result of the substitution can be written Ha+q(x)Ha(cid:48) =∑kqk(x)h1(bk) where qk(x) are polynomials in GF(2)[x]/p(x). All qk(x)arezeroifandonlyifHa+q(x)Ha(cid:48) =0forallvaluesofh1(a1),...,h1(an)and h1(a(cid:48)1),...,h1(a(cid:48)n)(butnoticethatthevalueh1(ai)=h1(a(cid:48)j)isirrelevant);inparticular, it must be true when h1(ak)=1 and h1(a(cid:48)k)=1 for all k, hence (xn+···+x+1)+ q(x)(xn...+x+1)=0⇒q(x)=−1. Thus,allqk(x)arezeroifandonlyifHa=Ha(cid:48) for all values of h1(a1),...,h1(an) and h1(a(cid:48)1),...,h1(a(cid:48)n) which only happens if the sequencesaanda(cid:48)areidentical. Hence,notallqk(x)arezero. Write Hy(cid:48),a(cid:48) =(∑k|a(cid:48)k=a(cid:48)jxn−k)−1(y(cid:48)−∑k|a(cid:48)k(cid:54)=a(cid:48)jxn−kh1(a(cid:48)k)). On the one hand, the condition Ha(cid:48) = y(cid:48) can be rewritten as h1(a(cid:48)j) = Hy(cid:48),a(cid:48). On the other hand, Ha+ q(x)Ha(cid:48) = y+q(x)y(cid:48) is independent from h1(a(cid:48)j) = h1(ai). Because P(h1(a(cid:48)j) = Hy(cid:48),a(cid:48)) = 1/2L irrespective of y(cid:48) and h1(a(cid:48)k) for k ∈ {k|a(cid:48)k (cid:54)= a(cid:48)j}, then P(h1(a(cid:48)j) = Hy(cid:48),a(cid:48)|Ha+q(x)Ha(cid:48)=y+q(x)y(cid:48))=P(h1(a(cid:48)j)=Hy(cid:48),a(cid:48))whichimpliesthath1(a(cid:48)j)=Hy(cid:48),a(cid:48) andHa+q(x)Ha(cid:48) =y+q(x)y(cid:48)areindependent. Hence,wehave P(Ha=y|Ha(cid:48) =y(cid:48)) = P(Ha+q(x)Ha(cid:48) =y+q(x)y(cid:48)|h1(a(cid:48)j)=Hy(cid:48),a(cid:48)) = P(Ha+q(x)Ha(cid:48) =y+q(x)y(cid:48)) = P(∑q (x)h (b )=y+q(x)y(cid:48)) k 1 k k and by the earlier uniformity result, this last probability is equal to 1/2L. This con- cludestheproof. 8. Tradingmemoryforspeed: RAM-BufferedGENERAL Unfortunately,GENERAL—ascomputedbyAlgorithm3—requiresO(nL)timeper n-gram. Indeed,shiftingavaluentimesinGF(2)[x]/p(x)requiresO(nL)time. How- ever, if we are willing to trade memory usage for speed, we can precompute these shifts. WecalltheresultingschemeRAM-BufferedGENERAL. Lemma2 Pickany p(x)inGF(2)[x]. Thedegreeof p(x)isL. Representelementsof GF(2)[x]/p(x)aspolynomialsofdegreeatmostL−1. GivenanyhinGF(2)[x]/p(x). wecancomputexnhinO(L)timegivenanO(L2n)-bitmemorybuffer. PROOF Write h as ∑Li=−01qixi. Divide h into two parts, h(1) =∑Li=−0n−1qixi and h(2) = ∑Li=−L1−nqixi,sothath=h(1)+h(2).Thenxnh=xnh(1)+xnh(2).Thefirstpart,xnh(1)isa polynomialofdegreeatmostL−1sincethedegreeofh(1)isatmostL−1−n. Hence, xnh(1) as an L-bit value is just qL−n−1qL−n−2...q00...0. which can be computed in time O(L). So, only the computation of xnh(2) is possibly more expensive than O(L) time,buth(2) hasonlyntermsasapolynomial(sincethefirstL−ntermsarealways zero). Hence, if we precompute xnh(2) for all 2n possible values of h(2), and store them in an array with O(L) time look-ups, we can compute xnh as an L-bit value in O(L)time. Whennislarge,thisprecomputationrequiresexcessivespaceandprecomputation time. Fortunately, we can trade back some speed for memory. Consider the proof of

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.