Recursive n-gram hashing is pairwise independent, at best DanielLemirea,∗,OwenKaserb aLICEF,Universite´duQue´beca`Montre´al(UQAM),100SherbrookeWest,Montreal,QC,H2X3P2Canada bDept.ofCSAS,UniversityofNewBrunswick,100TuckerParkRoad,SaintJohn,NB,Canada 9 0 0 2 Abstract g Manyapplicationsusesequencesofnconsecutivesymbols(n-grams). Hashingthese u n-grams can be a performance bottleneck. For more speed, recursive hash families A computehashvaluesbyupdatingpreviousvalues. Weprovethatrecursivehashfam- 9 ilies cannot be more than pairwise independent. While hashing by irreducible poly- 1 nomials is pairwise independent, our implementations either run in time O(n) or use an exponential amount of memory. As a more scalable alternative, we make hashing ] by cyclic polynomials pairwise independent by ignoring n−1 bits. Experimentally, B weshowthathashingbycyclicpolynomialsistwiceasfastashashingbyirreducible D polynomials.WealsoshowthatrandomizedKarp-Rabinhashfamiliesarenotpairwise . s independent. c [ Keywords: RollingHashing,Rabin-KarpHashing,HashingStrings 6 v 6 1. Introduction 7 6 Ann-gramisaconsecutivesequenceofnsymbolsfromanalphabetΣ. Ann-gram 4 hashfunctionhmapsn-gramstonumbersin[0,2L).Thesefunctionshaveseveralappli- . 5 cationsfromfull-textmatching[1–3],patternmatching[4],orlanguagemodels[5–11] 0 toplagiarismdetection[12]. 7 Toprovethatahashingalgorithmmustworkwell, wetypicallyneedhashvalues 0 : tosatisfysomestatisticalproperty. Indeed,ahashfunctionthatmapsalln-gramstoa v singleintegerwouldnotbeuseful. Yet,asinglehashfunctionisdeterministic: itmaps i X ann-gramtoasinglehashvalue. Thus,wemaybeabletochoosetheinputdatasothat r thehashvaluesarebiased. Therefore,werandomlypickafunctionfromafamilyH a offunctions[13]. SuchafamilyH isuniform(overL-bits)ifallhashvaluesareequiprobable. That is,consideringhselecteduniformlyatrandomfromH,wehaveP(h(x)=y)=1/2L foralln-gramsxandallhashvaluesy. Thisconditionisweak;thefamilyofconstant functions(h(x)=c)isuniform1. ∗Correspondingauthor.Tel.:00+1+514987-3000ext.2835;fax:00+1+514843-2160. Emailaddresses:[email protected](DanielLemire),[email protected](OwenKaser) 1 Weomitfamiliesuniformoveranarbitraryinterval[0,b)—notoftheform[0,2L). Indeed,several applications[14,15]requireuniformityoverL-bits. PreprintsubmittedtoElsevier December5,2009 Intuitively,wewouldwantthatifanadversaryknowsthehashvalueofonen-gram, itcannotdeduceanythingaboutthehashvalueofanothern-gram. Forexample,with the family of constant functions, once we know one hash value, we know them all. The family H is pairwise independent if the hash value of n-gram x is independent 1 from the hash value of any other n-gram x . That is, we have P(h(x )=y∧h(x )= 2 1 2 z)=P(h(x )=y)P(h(x )=z)=1/4L for all n-grams x , x , and all hash values y, 1 2 1 2 z with x (cid:54)=x . Pairwise independence implies uniformity. We refer to a particular 1 2 hashfunctionh∈H as“uniform”or“apairwiseindependenthashfunction”whenthe familyinquestioncanbeinferredfromthecontext. Moreover,theideaofpairwiseindependencecanbegeneralized: afamilyofhash functionsH isk-wiseindependentifgivendistinctx1,...,xk andgivenhselecteduni- formly at random from H, then P(h(x1)=y1∧···∧h(xk)=yk)=1/2kL. Note that k-wise independence implies k−1-wise independence and uniformity. (Fully) inde- pendentfamiliesarek-wiseindependentforarbitrarilylargek. Forapplications,non- independent families may fare as well as fully independent families if the entropy of thedatasourceissufficientlyhigh[16]. Ahashfunctionhisrecursive[17]—orrolling[18]—ifthereisafunctionF com- puting the hash value of the n-gram x2...xn+1 from the hash value of the preceding n-gram(x1...xn)andthevaluesofx1andxn+1. Thatis,wehave h(x ,...,x )=F(h(x ,...,x ),x ,x ). 2 n+1 1 n 1 n+1 Ideally,wecouldcomputefunctionFintimeO(L)andnot,forexample,intimeO(Ln). Themaincontributionsofthispaperare: • aproofthatrecursivehashingisnomorethanpairwiseindependent(§3); • aproofthatrandomizedKarp-Rabincanbeuniformbutneverpairwiseindepen- dent(§5); • aproofthathashingbyirreduciblepolynomialsispairwiseindependent(§7); • aproofthathashingbycyclicpolynomialsisnotevenuniform(§9); • aproofthathashingbycyclicpolynomialsispairwiseindependent—afterignor- ingn−1consecutivebits(§10). Weconcludewithanexperimentalsectionwhereweshowthathashingbycyclicpoly- nomials is faster than hashing by irreducible polynomials. Table 1 summarizes the algorithmspresented. 2. Trailing-zeroindependence Some randomized algorithms [14, 15] merely require that the number of trailing zeroes be independent. For example, to estimate the number of distinct n-grams in a large document without enumerating them, we merely have to compute maximal numbers of leading zeroes k among hash values [19]. Na¨ıvely, we may estimate that if a hash value with k leading zeroes is found, we have ≈2k distinct n-grams. Such Table1:Asummaryofthehashingfunctionpresentedandtheirproperties.ForGENERALandCYCLIC,we requireL≥n.TomakeCYCLICpairwiseindependent,weneedtodiscardsomebits—theresultingscheme isnotformallyrecursive.RandomizedKarp-Rabinisuniformundersomeconditions. name costpern-gram independence memoryuse non-recursive3-wise(§4) O(Ln) 3-wise O(nL|Σ|) RandomizedKarp-Rabin(§5) O(LlogL2O(log∗L)) uniform O(L|Σ|) GENERAL(§7) O(Ln) pairwise O(L|Σ|) RAM-BufferedGENERAL(§8) O(L) pairwise O(L|Σ|+L2n) CYCLIC(§9) O(L+n) pairwise(§10) O((L+n)|Σ|) estimatesmightbeusefulbecausethenumberofdistinctn-gramsgrowslargewithn: Shakespeare’sFirstFolio[20]hasover3milliondistinct15-grams. Formally, let zeros(x) return the number of trailing zeros (0,1,...,L) of x, where zeros(0)=L. We say h is k-wise trailing-zero independent if P(zeros(h(x ))≥ j ∧ 1 1 zeros(h(x2))≥ j2∧...∧zeros(h(xk))≥ jk)=2−j1−j2−···−jk,for ji=0,1,...,L. Ifhisk-wiseindependent, itisk-wisetrailing-zeroindependent. Theconverseis nottrue.Ifhisak-wiseindependentfunction,considerg◦hwheregmakeszeroallbits beforetherightmost1(e.g.,g(0101100)=0000100).Hashg◦hisk-wisetrailing-zero independentbutnotevenuniform(considerthatP(g=0001)=8P(g=1000)). 3. Recursivehashfunctionsarenomorethanpairwiseindependent Notonlyarerecursivehashfunctionslimitedtopairwiseindependence: theycan- notbe3-wisetrailing-zeroindependent. Proposition1 There is no 3-wise trailing-zero independent hashing function that is recursive. PROOF Considerthe (n+2)-gramanbb. Supposehisrecursiveand3-wisetrailing- zeroindependent,then (cid:16) ^ P zeros(h(a,...,a))≥L ^ (cid:17) zeros(h(a,...,a,b))≥L zeros(h(a,...,a,b,b))≥L (cid:16) ^ ^ (cid:17) = P h(a,...,a)=0 F(0,a,b)=0 F(0,a,b)=0 (cid:16) ^ (cid:17) = P h(a,...,a)=0 F(0,a,b)=0 (cid:16) ^ (cid:17) = P zeros(h(a,...,a))≥L zeros(h(a,...,a,b))≥L = 2−2L bytrailing-zeropairwiseindependence (cid:54)= 2−3L asrequiredbytrailing-zero3-wiseindependence. Hence,wehaveacontradictionandnosuchhexists. 4. Anon-recursive3-wiseindependenthashfunction Atrivialwaytogenerateanindependenthashistoassignarandomintegerin[0,2L) toeachnewvaluex. Unfortunately,thisrequiresasmuchprocessingandstorageasa completeindexingofallvalues. However,inamultidimensionalsettingthisapproachcanbeputtogooduse. Sup- pose that we have tuples in K1×K2×···×Kn such that |Ki| is small for all i. We canconstructindependenthashfunctionshi:Ki→[0,2L)foralliandcombinethem. The hash function h(x1,x2,...,xn)=h1(x1)⊕h2(x2)⊕···⊕hn(xn) is then 3-wise in- dependent(⊕isthe“exclusiveor”function,XOR). IntimeO(∑ni=1|Ki|),wecancon- structthehashfunctionbygenerating∑ni=1|Ki|randomnumbersandstoringthemina look-uptable. Withconstant-timelook-up,hashingann-gramthustakesO(Ln)time. Algorithm1isanapplicationofthisideaton-grams. Algorithm1The(non-recursive)3-wiseindependentfamily. Require: nL-bithashfunctionsh1,h1,...,hnoverΣfromanindependenthashfamily 1: s←emptyFIFOstructure 2: foreachcharactercdo 3: appendctos 4: iflength(s)=nthen 5: yieldh1(s1)⊕h2(s2)⊕...⊕hn(sn) {Theyieldstatementreturnsthevalue,withoutterminatingthealgorithm.} 6: removeoldestcharacterfroms 7: endif 8: endfor Thisnewfamilyisnot4-wiseindependentforn>1. Considerthen-gramsac,ad, bc, bd. The XOR of their four hash values is zero. However, the family is 3-wise independent. Proposition2 The family of hash functions h(x) = h (x )⊕h (x )⊕...⊕h (x ), 1 1 2 2 n n wheretheL-bithashfunctionsh ,...,h aretakenfromanindependenthashfamily,is 1 n 3-wiseindependent. PROOF Consider any 3 distinct n-grams: x(1) = x1(1)...xn(1), x(2) = x1(2)...xn(2), and x(3) =x1(3)...xn(3). Because the n-grams are distinct, at least one of two possibilities holds: CaseA Forsomei∈{1,...,n},thethreevaluesxi(1),xi(2),xi(3)aredistinct. Writeχj= hi(xi(j))for j=1,2,3. Forexample,considerthethree1-grams: a,b,c. CaseB (Uptoareorderingofthethreen-grams.)Therearetwovaluesi,j∈{1,...,n} suchthatx(1)isdistinctfromthetwoidenticalvaluesx(2),x(3),andsuchthatx(2) i i i j is distinct from the two identical values xi(1),xi(3). Write χ1 = hi(xi(1)), χ2 = hj(x(j2)),andχ3=hi(xi(3)). Forexample,considerthethree2-grams: ad,bc,bd. RecallthattheXORoperationisinvertible: a⊕b=cifandonlyifa=b⊕c. Weprove3-wiseindependenceforcasesAandB. CaseA. Write f(i)=h(x(i))⊕χi fori=1,2,3. Wehavethatthevaluesχ1,χ2,χ3 are mutuallyindependent,andtheyareindependentfromthevalues f(1),f(2),f(3)2: (cid:32) (cid:33) (cid:32) (cid:33) 3 3 3 3 P ^χ =y ∧^ f(i)=y(cid:48) =∏P(χ =y)P ^ f(i)=y(cid:48) i i i i i i i=1 i=1 i=1 i=1 forallvaluesyi,y(cid:48)i. Hence,wehave P(cid:16)h(x(1))=z(1)^h(x(2))=z(2)^h(x(3))=z(3)(cid:17) = P(cid:16)χ =z(1)⊕f(1))^χ =z(2)⊕f(2)^χ =z(3)⊕f(3)(cid:17) 1 2 3 = ∑ P(cid:16)χ =z(1)⊕η^χ =z(2)⊕η(cid:48)^χ =z(3)⊕η(cid:48)(cid:48)(cid:17)× 1 2 3 η,η(cid:48),η(cid:48)(cid:48) P(f(1)=η∧f(2)=η(cid:48)∧f(3)=η(cid:48)(cid:48)) 1 = ∑ P(f(1)=η∧f(2)=η(cid:48)∧f(3)=η(cid:48)(cid:48)) 23L η,η(cid:48),η(cid:48)(cid:48) 1 = . 23L Thus,inthiscase,thehashvaluesare3-wiseindependent. Case B. Write f(1) = h(x(1))⊕χ , f(2) = h(x(2))⊕χ ⊕χ , f(3) = h(x(3))⊕χ . 1 2 3 3 Again,thevaluesχ ,χ ,χ aremutuallyindependent,andindependentfromthevalues 1 2 3 f(1),f(2),f(3). Wehave P(cid:16)h(x(1))=z(1)^h(x(2))=z(2)^h(x(3))=z(3)(cid:17) = P(cid:16)χ =z(1)⊕f(1))^χ ⊕χ =z(2)⊕f(2)^χ =z(3)⊕f(3)(cid:17) 1 2 3 3 = P(cid:16)χ =z(1)⊕f(1))^χ =z(2)⊕f(2)⊕z(3)⊕f(3)^χ =z(3)⊕f(3)(cid:17) 1 2 3 = ∑ P(cid:16)χ =z(1)⊕η^χ =z(2)⊕z(3)⊕η(cid:48)⊕η(cid:48)(cid:48)^χ =z(3)⊕η(cid:48)(cid:48)(cid:17)× 1 2 3 η,η(cid:48),η(cid:48)(cid:48) P(f(1)=η∧f(2)=η(cid:48)∧f(3)=η(cid:48)(cid:48)) 1 = ∑ P(f(1)=η∧f(2)=η(cid:48)∧f(3)=η(cid:48)(cid:48)) 23L η,η(cid:48),η(cid:48)(cid:48) 1 = . 23L 2Thevalues f(1),f(2),f(3)arenotnecessarilymutuallyindependent. Thisconcludestheproof. 5. RandomizedKarp-Rabinisnotindependent Oneofthemostcommonrecursivehashfunctionsiscommonlyassociatedwiththe Karp-Rabin string-matching algorithm [21]. Given an integer B, the hash value over the sequence of integers x1,x2,...,xn is ∑ni=1xiBn−i. A variation of the Karp-Rabin hashmethodis“HashingbyPower-of-2IntegerDivision”[17],whereh(x1,...,xn)= ∑ni=1xiBn−i mod2L. Inparticular,thehashcodemethodoftheJavaStringclassuses this approach, with L=32 and B=31 [22]. A widely used textbook [23, p. 157] recommendsasimilarInteger-DivisionhashfunctionforstringswithB=37. Since such Integer-Division hash functions are recursive, quickly computed, and widely used, it is interesting to seek a randomized version of them. Assume that h 1 isarandomhashfunctionoversymbolsuniformin[0,2L),thendefineh(x1,...,xn)= Bn−1h1(x1)+Bn−2h1(x2)+···+h1(xn)mod2L forsomefixedinteger B. Wechoose B=37(callingtheresultingrandomizedhash“ID37;”seeAlgorithm2).Ouralgorithm computeseachhashvalueintimeO(M(L)),whereM(L)isthecostofmultiplyingtwo L-bitintegers. (WeprecomputethevalueBn mod2L.) Inmanypracticalcases,Lbits can fit into a single machine word and the cost of multiplication can be considered constant. Ingeneral,M(L)isinO(LlogL2O(log∗L))[24]. Algorithm2TherecursiveID37family(RandomizedKarp-Rabin). Require: anL-bithashfunctionh overΣfromanindependenthashfamily 1 1: B←37 2: s←emptyFIFOstructure 3: x←0(L-bitinteger) 4: z←0(L-bitinteger) 5: foreachcharactercdo 6: appendctos 7: x←Bx−Bnz+h1(c)mod2L 8: iflength(s)=nthen 9: yieldx 10: removeoldestcharacteryfroms 11: z←h1(y) 12: endif 13: endfor TherandomizedInteger-Divisionfunctionsmappingn-gramsto[0,2L)arenotpair- wiseindependent. However,forBodd,theyareuniform. Proposition3 Randomized Integer-Division hashing with B odd is not uniform for n-grams,ifniseven. Otherwise,itisuniform,butnotpairwiseindependent. PROOF ForBodd,weseethatP(h(a2k)=0)>2−Lsinceh(a2k)=h1(a)(B0(1+B)+ B2(1+B)+···+B2k−2(1+B))mod2Landsince(1+B)iseven,wehaveP(h(a2k)= 0)≥P(h (x )=2L−1∨h (x )=0)=1/2L−1. Hence,forBoddandneven,wedonot 1 1 1 1 haveuniformity. For the rest of the result, we begin with n = 2 and B even. If x (cid:54)= x , then 1 2 P(h(x ,x ) = y) = P(Bh (x )+h (x ) = ymod2L) = ∑ P(h (x ) = y−Bzmod 1 2 1 1 1 2 z 1 2 2L)P(h (x )=z)=∑ P(h (x )=y−Bzmod2L)/2L=1/2L,whereasP(h(x ,x )= 1 1 z 1 2 1 1 y)=P((B+1)h (x )=ymod2L)=1/2Lsince(B+1)x=ymod2Lhasauniqueso- 1 1 lution x when B is even. Therefore h is uniform. This argument can be extended for anyvalueofnandfornodd,Beven. Toshowitisnotpairwiseindependent,firstsupposethatBisodd. Foranystring β of length n−2, consider n-grams w = βaa and w = βbb for distinct a,b ∈ Σ. 1 2 Then P(h(w )=h(w ))=P(B2h(β)+Bh (a)+h (a)=B2h(β)+Bh (b)+h (b)mod 1 2 1 1 1 1 2L) = P((1+B)(h (a)−h (b))mod2L = 0) ≥ P(h (a)−h (b) = 0)+P(h (a)− 1 1 1 1 1 h1(b)=2L−1).Becauseh1isindependent,P(h1(a)−h1(b)=0)=∑c∈[0,2L)P(h1(a)= c)P(h1(b) = c) = ∑c∈[0,2L)1/4L = 1/2L. Moreover, P(h1(a)−h1(b) = 2L−1) > 0. Thus, we have that P(h(w ) = h(w )) > 1/2L which contradicts pairwise inde- 1 2 pendence. Second, if B is even, a similar argument shows P(h(w ) = h(w )) > 3 4 1/2L, where w =βaa and w =βba. P(h(a,a)=h(b,a))=P(Bh (a)+h (a)= 3 4 1 1 Bh (b)+h (a)mod2L)=P(B(h (a)−h (b))mod2L=0)≥P(h (a)−h (b)=0)+ 1 1 1 1 1 1 P(h (a)−h (b)=2L−1)>1/2L. This argument can be extended for any value of B 1 1 andn. A weaker condition than pairwise independence is 2-universality: a family is 2- universalifP(h(x )=h(x ))≤1/2L[16].Asaconsequenceofthisproof,Randomized 1 2 Integer-Divisionisnoteven2-universal. These results also hold for any Integer-Division hash where the modulo is by an evennumber,notnecessarilyapowerof2. 6. GeneratinghashfamiliesfrompolynomialsoverGaloisfields ApracticalformofhashingusingthebinaryGaloisfieldGF(2)iscalled“Recursive Hashing by Polynomials” and has been attributed to Kubina by Cohen [17]. GF(2) containsonlytwovalues(1and0)withtheaddition(andhencesubtraction)definedby XOR,a+b=a⊕bandthemultiplicationbyAND,a×b=a∧b.GF(2)[x]isthevector spaceofallpolynomialswithcoefficientsfromGF(2).Anyintegerinbinaryform(e.g., c=1101) can thus be interpreted as an element of GF(2)[x] (e.g., c=x3+x2+1). If p(x)∈GF(2)[x], then GF(2)[x]/p(x) can be thought of as GF(2)[x] modulo p(x). As an example, if p(x)=x2, then GF(2)[x]/p(x) is the set of all linear polynomials. Forinstance,x3+x2+x+1=x+1modx2 since,inGF(2)[x],(x+1)+x2(x+1)= x3+x2+x+1. Asasummary,wecomputeoperationsoverGF(2)[x]/p(x)—where p(x)isofde- greeL—asfollows: • thepolynomial∑Li=−01qixiisrepresentedastheL-bitinteger∑Li=−01qi2i; • subtractionoradditionoftwopolynomialsistheXORoftheirL-bitintegers; Table2:SomeirreduciblepolynomialsoverGF(2)[x] degree polynomial 10 1+x3+x10 15 1+x+x15 20 1+x3+x20 25 1+x3+x25 30 1+x+x4+x6+x30 • multiplicationofapolynomial∑Li=0qixi bythemonomialxisrepresentedeither as ∑Li=−01qixi+1 if qL−1 =0 or as p(x)+∑Li=−01qixi+1 otherwise. In other words, if the value of the last bit is 1, we merely apply a binary left shift, otherwise, we apply a binary left shift immediately followed by an XOR with the integer representing p(x). Ineithercase,wegetanL-bitinteger. Hence,merelywiththeXORoperation,thebinaryleftshift,andawaytoevaluatethe valueofthelastbit,wecancomputeallnecessaryoperationsoverGF(2)[x]/p(x)using integers. Considerahashfunctionh overcharacterstakenfromsomeindependentfamily. 1 Interpreting h hash values as polynomials in GF(2)[x]/p(x), and with the condition 1 that degree(p(x))≥n, we define a hash function as h(a1,a2,···,an)=h1(a1)xn−1+ h1(a2)xn−2+···+h1(an). Itisrecursiveoverthesequenceh1(ai). Thecombinedhash canbecomputed byreusingprevioushashvalues: h(a ,a ,...,a )=xh(a ,a ,...,a )−h (a )xn+h (a ). 2 3 n+1 1 2 n 1 1 1 n+1 Depending on the choice of the polynomial p(x) we get different hashing schemes, includingGENERALandCYCLIC,whicharepresentedinthenexttwosections. 7. Recursivehashingbyirreduciblepolynomialsispairwiseindependent We can choose p(x) to be an irreducible polynomial of degree L in GF(2)[x]: an irreducible polynomial cannot be factored into nontrivial polynomials (see Table 2). TheresultinghashiscalledGENERAL(seeAlgorithm3). Themainbenefitofsetting p(x) to be an irreducible polynomial is that GF(2)[x]/p(x) is a field; in particular, it isimpossiblethat p (x)p (x)=0mod p(x)unlesseither p (x)=0or p (x)=0. The 1 2 1 2 fieldpropertyallowsustoprovethatthehashfunctionispairwiseindependent. Lemma1 GENERALispairwiseindependent. PROOF If p(x)isirreducible,thenanynon-zeroq(x)∈GF(2)[x]/p(x)hasaninverse, noted q−1(x) since GF(2)[x]/p(x) is a field. Interpret hash values as polynomials in GF(2)[x]/p(x). Firstly, we prove that GENERAL is uniform. In fact, we show a stronger re- sult: P(q1(x)h1(a1)+q2(x)h1(a2)+···+qn(x)h1(an)=y)=1/2L for any polyno- mials qi where at least one is different from zero. The result follows by induction Algorithm3TherecursiveGENERALfamily. Require: anL-bithashfunctionh overΣfromanindependenthashfamily; anirre- 1 duciblepolynomial pofdegreeLinGF(2)[x] 1: s←emptyFIFOstructure 2: x←0(L-bitinteger) 3: z←0(L-bitinteger) 4: foreachcharactercdo 5: appendctos 6: x←shift(x) 7: z←shiftn(z) 8: x←x⊕z⊕h1(c) 9: iflength(s)=nthen 10: yieldx 11: removeoldestcharacteryfroms 12: z←h1(y) 13: endif 14: endfor 1: functionshift 2: inputL-bitintegerx 3: shiftxleftby1bit,storingresultinanL+1-bitintegerx(cid:48) 4: ifleftmostbitofx(cid:48)is1then 5: x(cid:48)←x(cid:48)⊕p 6: endif 7: {leftmostbitofx(cid:48)isthusalways0} 8: returnrightmostLbitsofx(cid:48) on the number of non-zero polynomials: it is clearly true where there is a single non-zero polynomial qi(x), since qi(x)h1(ai)=y ⇐⇒ q−i 1(x)qi(x)h1(ai)=q−i 1(x)y. Suppose it is true up to k−1 non-zero polynomials and consider a case where we have k non-zero polynomials. Assume without loss of generality that q (x)(cid:54)=0, we 1 haveP(q1(x)h1(a1)+q2(x)h1(a2)+···+qn(x)h1(an)=y)=P(h1(a1)=q−11(x)(y− q2(x)h1(a2)−···−qn(x)h1(an))) = ∑y(cid:48)P(h1(a1) = q−11(x)(y−y(cid:48)))P(q2(x)h1(a2)+ ···+qn(x)h1(an)=y(cid:48))=∑y(cid:48) 21L21L = 21L by the induction argument. Hence the uni- formityresultisshown. Consider two distinct sequences a1,a2,...,an and a(cid:48)1,a(cid:48)2,...,a(cid:48)n. Write Ha = h(a1,a2,...,an) and Ha(cid:48) = h(a(cid:48)1,a(cid:48)2,...,a(cid:48)n). We have that P(Ha = y∧Ha(cid:48) = y(cid:48)) = P(Ha =y|Ha(cid:48) =y(cid:48))P(Ha(cid:48) =y(cid:48)). Hence, to prove pairwise independence, it suffices toshowthatP(Ha=y|Ha(cid:48) =y(cid:48))=1/2L. Suppose that ai =a(cid:48)j for some i,j; if not, the result follows since by the (full) independenceofthehashingfunctionh1,thevaluesHaandHa(cid:48) areindependent. Write q(x)=−(∑k|ak=aixn−k)(∑k|a(cid:48)k=a(cid:48)jxn−k)−1,thenHa+q(x)Ha(cid:48) isindependentfromai= a(cid:48)j (andh1(ai)=h1(a(cid:48)j)). In Ha+q(x)Ha(cid:48), only hashed values h1(ak) for ak (cid:54)= ai and h1(a(cid:48)k) for a(cid:48)k (cid:54)= a(cid:48)j remain: label them h1(b1),...,h1(bm). The result of the substitution can be written Ha+q(x)Ha(cid:48) =∑kqk(x)h1(bk) where qk(x) are polynomials in GF(2)[x]/p(x). All qk(x)arezeroifandonlyifHa+q(x)Ha(cid:48) =0forallvaluesofh1(a1),...,h1(an)and h1(a(cid:48)1),...,h1(a(cid:48)n)(butnoticethatthevalueh1(ai)=h1(a(cid:48)j)isirrelevant);inparticular, it must be true when h1(ak)=1 and h1(a(cid:48)k)=1 for all k, hence (xn+···+x+1)+ q(x)(xn...+x+1)=0⇒q(x)=−1. Thus,allqk(x)arezeroifandonlyifHa=Ha(cid:48) for all values of h1(a1),...,h1(an) and h1(a(cid:48)1),...,h1(a(cid:48)n) which only happens if the sequencesaanda(cid:48)areidentical. Hence,notallqk(x)arezero. Write Hy(cid:48),a(cid:48) =(∑k|a(cid:48)k=a(cid:48)jxn−k)−1(y(cid:48)−∑k|a(cid:48)k(cid:54)=a(cid:48)jxn−kh1(a(cid:48)k)). On the one hand, the condition Ha(cid:48) = y(cid:48) can be rewritten as h1(a(cid:48)j) = Hy(cid:48),a(cid:48). On the other hand, Ha+ q(x)Ha(cid:48) = y+q(x)y(cid:48) is independent from h1(a(cid:48)j) = h1(ai). Because P(h1(a(cid:48)j) = Hy(cid:48),a(cid:48)) = 1/2L irrespective of y(cid:48) and h1(a(cid:48)k) for k ∈ {k|a(cid:48)k (cid:54)= a(cid:48)j}, then P(h1(a(cid:48)j) = Hy(cid:48),a(cid:48)|Ha+q(x)Ha(cid:48)=y+q(x)y(cid:48))=P(h1(a(cid:48)j)=Hy(cid:48),a(cid:48))whichimpliesthath1(a(cid:48)j)=Hy(cid:48),a(cid:48) andHa+q(x)Ha(cid:48) =y+q(x)y(cid:48)areindependent. Hence,wehave P(Ha=y|Ha(cid:48) =y(cid:48)) = P(Ha+q(x)Ha(cid:48) =y+q(x)y(cid:48)|h1(a(cid:48)j)=Hy(cid:48),a(cid:48)) = P(Ha+q(x)Ha(cid:48) =y+q(x)y(cid:48)) = P(∑q (x)h (b )=y+q(x)y(cid:48)) k 1 k k and by the earlier uniformity result, this last probability is equal to 1/2L. This con- cludestheproof. 8. Tradingmemoryforspeed: RAM-BufferedGENERAL Unfortunately,GENERAL—ascomputedbyAlgorithm3—requiresO(nL)timeper n-gram. Indeed,shiftingavaluentimesinGF(2)[x]/p(x)requiresO(nL)time. How- ever, if we are willing to trade memory usage for speed, we can precompute these shifts. WecalltheresultingschemeRAM-BufferedGENERAL. Lemma2 Pickany p(x)inGF(2)[x]. Thedegreeof p(x)isL. Representelementsof GF(2)[x]/p(x)aspolynomialsofdegreeatmostL−1. GivenanyhinGF(2)[x]/p(x). wecancomputexnhinO(L)timegivenanO(L2n)-bitmemorybuffer. PROOF Write h as ∑Li=−01qixi. Divide h into two parts, h(1) =∑Li=−0n−1qixi and h(2) = ∑Li=−L1−nqixi,sothath=h(1)+h(2).Thenxnh=xnh(1)+xnh(2).Thefirstpart,xnh(1)isa polynomialofdegreeatmostL−1sincethedegreeofh(1)isatmostL−1−n. Hence, xnh(1) as an L-bit value is just qL−n−1qL−n−2...q00...0. which can be computed in time O(L). So, only the computation of xnh(2) is possibly more expensive than O(L) time,buth(2) hasonlyntermsasapolynomial(sincethefirstL−ntermsarealways zero). Hence, if we precompute xnh(2) for all 2n possible values of h(2), and store them in an array with O(L) time look-ups, we can compute xnh as an L-bit value in O(L)time. Whennislarge,thisprecomputationrequiresexcessivespaceandprecomputation time. Fortunately, we can trade back some speed for memory. Consider the proof of