ebook img

Iterative Universal Hash Function Generator for Minhashing PDF

0.07 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Iterative Universal Hash Function Generator for Minhashing

Iterative Universal Hash Function Generator for Minhashing Fabr´ıcioOlivettideFranc¸aa aCenterofMathematics,ComputingandCognition(CMCC),UniversidadeFederaldoABC(UFABC)–SantoAndre´,SP,Brazil. 4 1 0 2 Abstract n a MinhashingisatechniqueusedtoestimatetheJaccardIndexbetweentwosetsbyexploitingtheprobabilityofcollision J in a random permutation. In order to speed up the computation, a random permutation can be approximated by 3 using an universal hash function such as the h function proposed by Carter and Wegman. A better estimate of 2 a,b the Jaccard Index can be achieved by using many of these hash functions, created at random. In this paper a new ] iterative procedureto generatea set of h functionsis devisedthat eliminates the need for a list of randomvalues G a,b andavoidthemultiplicationoperationduringthecalculation. Thepropertiesofthegeneratedhashfunctionsremains L thatofanuniversalhashfunctionfamily. Thisispossibleduetotherandomnatureoffeaturesoccurrenceonsparse s. datasets. Results showthatthe uniformityof hashingthe featuresis maintanedwhile obtaininga speedupof upto c 1.38comparedtothetraditionalapproach. [ Keywords: 1 v 4 2 1. Introduction dataintoalowdimensionrepresentationwhilepreserving 1 enough information in way that similar objects remains 6 Many tasks in Machine Learning require the calcula- similar after the transformation. Some methods used to . 1 tionofsimilarityforeverypairofobjectsinthedataset. reducethedimensionalityofadatasetareFeatureSelec- 0 Someexamplesofsuchtasksaredataclusteringandnear- tion[2], FeatureExtraction[3]andProbabilisticDimen- 4 est neighbors searching. The need for such calculation sionReduction[4]. 1 maybedemandingforlargedatasetswithmanyfeatures : TheFeatureSelectionapproachtriestofindthesubset v leadingtothesocalledcurseofdimensionality[1]. This ofthefeaturesthataremostrelevanttothetaskbeingper- Xi occurs when the objects of a data set are described on formed.Thiscanbedonebyremovingredundant(highly a higher dimension, typically with hundredsof features, r correlated) variables and irrelevant variables that brings a adding one more dimension to the computational com- no informationwhatsoever. These are usually measured plexity,i.e.,aO(n)becomesO(n.d),wherenisthenum- by the Entropy, Information Gain and Correlation mea- berofobjectsanddthenumberoffeatures.Also,ahigher sures. Sincethisisacombinatorialproblem,itismostly dimensionmayalsoleadtoalossofprecisionwhencal- solvedwithmeta-heuristicapproaches[5,6]. culatingthesimilaritiesbetweentwoobjectsincethismay Feature Extraction refers to the creation of a smaller only be perceivedon a small subsetof featuresand thus featuressetbasedonthe(non-)linearcombinationofthe maskedbyalltheothers,thatcouldbeconsideredrandom originalfeatures. Thisisusuallyperformedbyusingma- noise. trix decomposition techniques, creating a map between Onewaytodealwithsuchproblemistotransformthe theoriginalfeaturessettothetransformedone. Morere- centlytherehasbeenafocusonNeuralNetworkFeature ExtractionthroughDeepLearningtechniques[7]. Emailaddress:[email protected] (Fabr´ıcioOlivettideFranc¸a) Probabilistic Dimension Reduction is performed ex- PreprintsubmittedtoElsevier January25,2014 ploiting the probability of two objects being very simi- 2. Min-wise Independent PermutationsLocalitySen- lar. This is done by means of specially designed hash sitiveHashing functions that has a high probability of collision when When the features of the studied data set can be de- hashingobjectswithsimilarityaboveacertainthreshold. scribed as sets, as it is the case on many semi-structure OneofthesetechniquesiscalledLocality-SensitiveHash- data(i.e.,textdocuments),thesimilaritybetweentwoob- ing[8,9]andwassuccessfulyappliedtodifferentapplica- jectscan becalculatedthroughthe JaccardIndexorJac- tionsincludingtextandimageretrieval. Themainadvan- cardsimilarity: tageofthistechniqueisthatthecomputationalcomplexity approximatesO(n). |O1∩O2| J = , (1) One of such techniqueis called Minhashing[10], and |O1∪O2| itwasspecificallycraftedtoapproximatetheJaccardIn- where O1 and O2 are the two objects compared and it dexbetweentwosets. Thisisbasedonthefactthat,given returnsa valuebetween0 and1, with the latter meaning a randompermutationofthe featuresset, the probability thetwoobjectsareequal. thatthe first feature oftwo objectsare the same is equal Let us illustrate this concept by using an example of to the Jaccard Index between those two. In order to be- comparingtwotextdocuments: come computationally feasible, the random permutation isapproximatedbyusinganuniversalhashfunction. d1=”Thecatatethemouse” d2=”Themouseatethecheese” Since a good estimation through probabilistic proce- duresrequireasampleofconsiderablesize,thereisaneed Onedatastrcutureusedtorepresentthesedocumentsis to automatically create universal hash functions at ran- an ordered set of terms contained in a document, or the dom. One of such approaches, as devised by Carter and bag-of-wordsrepresentations: Wegman[11],requiresthattworandomvaluesaredrawn fromanuniformdistributionforeachfunction.Giventhat d1={ate,cat,mouse,the} N hash functionsare created, there will be a total of 2N d2={ate,cheese,mouse,the} randomvalues. Thisleadstoarequirementofadditional TheJaccardsimilaritybetweenthesetwosetswillbe: O(2N)spacecomplexityandNmultiplicationoperations. Theneedforanoptimizationisjustifiedsincethispro- |d1∩d2| 3 J(d1,d2)= = . (2) cedureisoftenusedonverylargedatasets[12]andwith |d1∪d2| 5 algorithmsthatrequiresaverylargenumberofthosehash Since the intersection and union operations can be functions[13]whereasmallspeedupmightimplyonan costlytocompute,itcanbecomeverytimeconsumingto economyofhoursofprocessing. perform pairwise similarity computation on a large data Inthispaperitwillbedescribedaniterativeprocedure setwithobjectsrepresentedasalargesetoffeatures. As forgeneratinganynumberofhashfunctionswithoutthe such, itisdesirabletofindawaytoestimate theJaccard need of generating random numbers. This is done by index and reduce the objects representation to a smaller exploiting the randomnessof the features occurrenceon dimension. sparsedatasets. Thespeedupofthismethodwillbeex- ThetechniqueknownasMin-wiseindependentpermu- perimentlymeasuredaswellastheuniformityproperties. tations locality sensitive hashing (MinHash) [8], allows In Section 2 it will be explained how minahshingworks toquicklyestimatethesimilarityoftwosetsapproximat- andthegeneralalgorithm.Section3willexplainthepro- ingtheJaccardIndex. Itisbasedontheprobabilitythat, posed modifications, describe the pseudo-algorithm and givenarandompermutationoftwosets,thefirstelement showsomeinitiallresultsregardingtheuniformityofthis of both sets will be the same with a probabilityequalto approach. Theperformanceandpropertiesofsuchmodi- theirJaccardIndex. ficationwillbeanalysedandexploredduringtheexperi- Followingthesameexample,supposetheorderedsetof mentalsetupofSection4. Finally,inSection5itwillbe termsisshuffledfollowingapermutationπandthedocu- givensomeconcludingremarks. mentsarerepresentedby: 2 d1={the,mouse,cat,ate} d2={the,mouse,cheese,ate} mhi(oj)=argminnhi(x)|∀x∈ojo (6) x The probability that the first element of these sets are Sincetheprobabilitythattheminimumindicesoftwo thesameisequaltotheratiobetweenthenumberofcom- objects are the same, given a hash function, equals the mon elementsand the numberof total distinctelements, JaccardIndexaswellThecomplexitywiththisapproach or: becomesO(N.D.M¯)withusuallyM¯ << Monsparsedata sets. |d1∩d2| 3 P(d1π =d2π)= = = J(d1,d2). (3) 0 0 |d1∪d2| 5 3. IterativeProcedure Thisprobabilitycanbeestimatedbyreshufflingtheor- ByinspectingEq.5,thepurposeoftherandomvalues deredsetwith N differentpermutationsandcountingthe a and b is to ensure uniformity of permutation given a numberN oftimesthatthefirstelementareequalto equal sequenceoffeaturesindeces. Inotherwords,ifwe have bothsets: M features, in M different permutations, each feature is N expectedtohavetheminimumhashvalueonce. P(d1π =d2π)≈ equal ≈ J(d1,d2). (4) Inmanyrealworldapplicationsthesetoffeaturesde- 0 0 N scribingeachobjectissparse(smallpercentageofthefull So,itispossibletoestimatetheJaccardIndexbetween featureset)andpresentsarandomnessfollowingaproba- two objects, described by a total of M features, by cal- bilitydistribution.Therandomnessoffeaturesoccurrence culatingN Minhashes,withN differentpermutations,for can be exploitedto avoid the randomvaluescalculation. eachdocument. Noticethatthepermutationmustfollow Letusfirstdefinethei-thhashfunction,ofasequenceof anuniformdistribution. N functions,as: Given that N < M, the cost of comparing two docu- mentsisreducedbyafactorof N. Sincethecostofgen- M h =(a+i)·x+i·b mod P. (7) i erating a permutations is O(M), the total computational cost will be O(N.(M + D.M¯)) where D is the numberof Noticethatthisisstillthesameuniversalhashfunction documentsand M¯ is the average number of features per asbefore,holdingthesameproperties.Ifwesubtracth i−1 document. One way to avoid this additional cost is by fromh inmodulusP,weobtain: i usinganuniversalhashfunctionsuchasoneofthosepro- posedin[11]: ∆h=h −h i i−1 h(x)=a·x+b mod P, (5) =[(a+i)·x+i·b mod P whereaandbarerandomlychosenwithuniformdistribu- − (a+i−1)·x+(i−1)·b mod P] tion,andPisalargeprimenumber. Thevariable xisthe =(x+b) mod P, valuetobehashed,i.e.,anumberassociatedwithagiven feature. Theprimenumbershouldbe atleastaslargeas whenP>a,b. thenumberoffeatures. Thishashfunctionwillmapeach So, givena value x to be hashed, andthe valueof the featuretoanindexintherange[0,P[.Therandomvalues first hash function h (x), we can obtain any number of 0 willensurethatthoseindecesgeneratea randompermu- hashfunctionsbysequentiallysummingup∆h. Thispro- tationofthefeatureset. cedureisdescribedinAlg.1. Notice that with this function the application of Min- Themaindifferencebetweentherandomapproachand Hash is straightforward. For each object j, simply find, theiterativeoneisthatthelatteravoidthemultiplication for each hash functioni, the feature x that has the mini- operation. This reduce the computational complexity in mumvalueforh(x): aboutO(N·b·logb·loglogb)(butthisvariesdepending i 3 Algorithm1:Iterativeuniversalhashfunctiongener- Table1: Percentageoflikelyresultsfromaχ-squaredTestfortheuni- ator. formityofdistributionoftheRandomandIterativeHashfunctionwith Data:thevaluextobehashed,initialvaluesfora 1,000hashesand100bucketsandfeatures. andb,alargeprimenumberPandthenumber Random Iterative N ofhashfunctions. UniformityTest 92% 92% Result: vectorHwiththenhashvaluesofx MinhashingTest 95% 93% H[0]=a∗x; ∆h=b+x mod P; Table2: Percentageoflikelyresultsfromaχ-squaredTestfortheuni- fori←1toN do formityofdistributionoftheRandomandIterativeHashfunctionwith H[i]= H[i−1]+∆h mod P; 50,000hashesand5,000bucketsandfeatures. Random Iterative UniformityTest 100% 100% MinhashingTest 93% 100% onthemultiplicationalgorithmused). Also,asstatedbe- fore,thespacecomplexityisreducedfromO(2N)toO(2) regardingthelistofrandomnumbers. 4.1. UniformityofDistribution Noticethatthisiterativeprocedurecanalsobesimpli- Theuniformityofanuniversalhashfunctionstatesthat, fiedtoafunctionH(i)=a·x+i·∆h mod P,sowecanstill given a hash function with m buckets, the probability of paralellize the calculation of each H[i] if so is required. a value xbeingassignedtoabucketwitha randomhash Additionally, when the algorithm is run on a distributed function h is proportional to 1/ . To verify if the fam- framework (i.e., MapReduce) the overhead of sending a m ily of Iterative Hash functions has this property, it was vectorofrandomnumbersofsize2Nthroughoutallofthe performed 100 independent experiments with a differ- machinesis eliminated. If not for a significant speedup, entvalue x randomlyassignedintherange[0,P[,where thenforaconciseandclearercode. P=7,757wastheprimenumberchosen,tobehashedin Sincetherationalefortheproofofuniversalityofthis oneof100differentbuckets.Toevaluatetheuniformityit hashfunctionisthesameastherandomapproach,inthe wasperformedaχ-squaredtestforeachexperimentwith next section it will be provided some empirical experi- α = 0.05with1,000hashfunctions. Thechoiceforthis mentsinordertotestthevalidityoftheseclaims. Itwill numberof hashfunctionsis dueto a recomendationthat be compared, between the random and the iterative ap- the expected count for each bucket should be 5 to 10 in proach, the uniformity of buckets distribution, the uni- orderfortheχ-squaredtesttoreturnanacurateresponse. formity of choosing each feature independetly from the Similarly, for each experiment it was generated 100 Minhashing procedure, the Jaccard estimation error and randomkeyvaluesandtheMinhashforeachofthe1,000 the computational time to generate a set of either hash hash functions was calculated. The uniformity of the functions. probabilitythatagivenkeywillbechosenastheMinhash wasalsotestedwiththeχ-squaredtestwithα=0.05.Ta- ble1reportsthepercentageofthe100experimentswhere 4. ExperimentalResults theχ-squaredtestpointedtoalikelyuniformity. Fromtheseresultswecanseethatbothfamilyofhash Thissectionwill bedevidedinto twosubsections: the functionsarelikelytogenerateanuniformdistributionof firstonewilltesttheuniformityofdistributionofhashed bucketsandauniformdistributionofchoicefortheMin- values to m buckets and that the probabiliy of a value x hash.Inordertosimulateamorerealisticscenario,where bechosenasaMinahshisuniform;thesecondwillshow wecanhavethousandsofpossiblefeatures(i.e.,textmin- that the estimation error of the Jaccard Index is similar ing),thesametestswereperformedbutwith5,000buck- betweenbothapproacheswithaslightlyadvantageforthe ets and features and 50,000 hash functions. The results iterativeproceudreand,finally,theobtainedspeedup. arereportedonTable2. 4 Table3: Average andstandard deviation ofthe time, inseconds, ob- Table4: MeanAbsoluteErroroftheestimationsobtainedbytheRan- tainedforeachexperiment. domandtheIterativeapproaches. Random Iterative #ofhashes Random Iterative 20-Newsgroup 26.27±1.59 18.94±0.56 5 0.0455±0.0368 0.0456±0.0368 Classic 12.61±0.24 10.10±0.20 10 0.0430±0.0273 0.0421±0.0255 15 0.03859±0.0231 0.0387±0.0230 Aswecansee,theseresultsjustconfirmstheprevious experimentthatbothsetofhashfunctionslikelyhavethe MeanAbsoluteErrorwascalculatebetweeneachestima- uniformitypropertyandbelongstotheuniversalfamilyof tionandtherealJaccardIndex. hashfunctions. TheresultsfromTable4showsasmalldifferenceofthe meanabsoluteerrorobtainedbyeachapproach.Itshould benoticedthatat-testperformedoneachexperimentre- 4.2. HashFunctionPerformance vealedthatthep-valuefortheexperimentswith10and15 hasheswerebelow0.05whichmeansthat,althoughtjust Next the processing time for both approaches were slightly different, the means are unlikely the same. The comparedinordertoquantifythespeedupobtainedwith p-valueobtainedon the 5 hashesexperimentswas much the iterative hash function. For this end it was per- higherthan0.05,though. formed100independentexperimentstogenerateandap- ply 100,000 hash functions for a subset of 500 docu- ments from the 20 Newsgroupsdataset [14] and 10,000 5. Conclusion hash functions for a subset of 3,891 documents from the Classic dataset [15]. This code was written in C++ Thispaperproposedanoptimizationtotherandomuni- with Boost Library and compiled with g++ 4.7.3 using versalhashfunctiongeneratorcommonlyusedtoestimate just the optmization flag -O2 on an Intel i5 2.5GHz us- the Jaccard Coefficient between two sets. The proposal ing a single core. The Operational System used was eliminates the need of generating two large lists of ran- the Mint 15 Linux Distribution and the source code dom values and the repetition of a multiplication opera- for all of the reported experiments can be found at tion throughout the calculation. This is done by chang- https://github.com/folivetti/HBLCoClust/. ingthefunctioncreationtoaniterativeprocedurewherea In Table 3 the average and standard-deviation of the givenhashfunctionh(x)dependsonthepreviouslygen- i time taken by every experiments are reported, it is pos- eratedfunctionplusanincrement. sibletoseethatthereisaspeedupof1.38,onaverage,for As a first concern, it was tested whether this iterative the20-Newsgroupdatasetand1.25fortheClassicdataset, procedurekeepstheuniversalitypropertiesoftherandom also a t-paired test was performedwith α = 0.05 which hashfunctions.Someexperimentsindicatesthatthishash confirmedthatthedifferenceisstatisticallysignificantfor function has uniformity of buckets allocation and Min- both experiments. As the results point out, this is a sig- hashingdistribution,assuch,itcanbeusedasanuniver- nificantimprovementonthecomputationaltimethatcan salhashfunction. chiefly benefit the computationof datasets with millions Next, the speed up obtained with this procedure was of objects. If, for example, an experimentusing the tra- measuredandtheaveragegainwasbetween1.25to1.38 ditionalrandomapproachtakes10hourson a verylarge fromthetimemeasuredwiththerandomapproach.Addi- dataset,withtheiterativeprocedurewecanexpectabout tionally, it was shown, with anotherexperiment,thatthe 7hoursofprocessing. estimation error regarding the Jaccard Index, remained Finally,inordertoseeiftheJaccardcoefficientestima- praticallythesame,withatinyadvantagetotheiterative tionofbothhashapproachesareequivalent,theMinhash- approach. ing estimation was performed on the same dataset with Theseresultsshowthattheiterativeprocedurecanhave eachobjectbeingrepresentedby5,10,15hashesandthe asignificantimpactwhenminingverylargedatasetsand 5 when using hundredsof hash functions. And, since it is [11] J.Carter,M.N.Wegman,Universalclassesofhash just a simple modification, it can be easily adapted onto functions, Journal of Computer and System Sci- many existing source codes that already uses such hash ences18(2)(1979)143–154. functions. [12] R. Szmit, Locality sensitive hashing for similarity searchusingmapreduceonlargescaledata,in:Lan- References guage Processing and Intelligent Information Sys- [1] J. H. Friedman, On bias, variance, 0/1loss, and the tems,Springer,2013,pp.171–178. curse-of-dimensionality, Data mining and knowl- [13] F. O.D. Franca,Scalableoverlappingco-clustering edgediscovery1(1)(1997)55–77. of word-document data, in: Machine Learning [2] I. Guyon, A. Elisseeff, An introduction to vari- andApplications(ICMLA),201211thInternational able and feature selection, The Journalof Machine Conferenceon,Vol.1,IEEE,2012,pp.464–467. LearningResearch3(2003)1157–1182. [14] K. Lang, Newsweeder: Learning to filter netnews, [3] I.Guyon,Featureextraction:foundationsandappli- in:ProceedingsoftheTwelfthInternationalConfer- cations,Vol.207,Springer,2006. enceonMachineLearning,1995,pp.331–339. [4] N. D. Lawrence, I. Tutorial, Dimensionalityreduc- [15] Classic3andclassic4datasets(Oct.2013). tiontheprobabilisticway,ICML2008Tutorial. URLftp://ftp.cs.cornell.edu/pub/smart [5] P.A.Castro, F.J. VonZuben,Featuresubsetselec- tion by means of a bayesian artificial immune sys- tem, in: HybridIntelligentSystems, 2008.HIS’08. EighthInternationalConferenceon,IEEE,2008,pp. 561–566. [6] S. M. Vieira, L. F. Mendonc¸a, G. J. Farinha, J. M. Sousa, Metaheuristics for feature selection: appli- cation to sepsis outcome prediction, in: Evolution- ary Computation (CEC), 2012 IEEE Congress on, IEEE,2012,pp.1–8. [7] Y. Bengio, A. C. Courville, P. Vincent, Unsuper- vised feature learningand deep learning: A review andnewperspectives,CoRRabs/1206.5538. [8] S. Har-Peled, P. Indyk, R. Motwani, Approximate nearestneighbor:Towardsremovingthecurseofdi- mensionality., Theory OF Computing 8 (1) (2012) 321–350. [9] A.Andoni,P.Indyk,H.L.Nguyen,I.Razenshteyn, Beyond locality-sensitive hashing, arXiv preprint arXiv:1306.1547. [10] A. Broder, On the resemblance and containment of documents, in: Compression and Complexity of Sequences 1997. Proceedings, 1997, pp. 21–29. doi:10.1109/SEQUEN.1997.666900. 6

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.