ebook img

Parallel mining of time-faded heavy hitters PDF

1.1 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Parallel mining of time-faded heavy hitters

Parallel mining of time–faded heavy hitters MassimoCafaroa,∗,MarcoPulimenoa,ItaloEpicocoa aUniversityofSalento,Lecce,Italy Abstract WepresentPFDCMSS,anovelmessage–passingbasedparallelalgorithmforminingtime–fadedheavyhitters. The algorithmisaparallelversionoftherecentlypublishedFDCMSSsequentialalgorithm.Weformallyproveitscorrect- nessbyshowingthattheunderlyingdatastructure,asketchaugmentedwithaSpaceSavingstreamsummaryholding 7 exactlytwocounters,ismergeable. Whilstmergeabilityoftraditionalsketchesderivesimmediatelyfromtheory,we 1 0 showthatmergingouraugmentedsketchisnontrivial.Nonetheless,theresultingparallelalgorithmisfastandsimple 2 to implement. To the best of our knowledge, PFDCMSS is the first parallel algorithm solving the problem of min- n ingtime–fadedheavyhittersonmessage–passingparallelarchitectures. Extensiveexperimentalresultsconfirmthat a PFDCMSS retains the extreme accuracy and error bound provided by FDCMSS whilst providing excellent parallel J scalability. 1 1 Keywords: message–passing,heavyhitters,timefadingmodel,sketches. ] S D 1. Introduction . s c Inthispaperwedealwiththeproblemofmininginparalleltime–fadedheavyhitters(alsocalledfrequentitems), [ and we present PFDCMSS, a novel message–passing based parallel algorithm which is a parallel version of the 1 recentlypublishedFDCMSSsequentialalgorithm[4]. v Miningofheavyhittersinadatastreamhasbeenthoroughlystudied, andtheproblemisregardedasoneofthe 4 mostimportantinthestreamingalgorithmsliterature.Dependingontheparticularapplication,theproblemisreported 0 0 intheliteratureashotlistanalysis[19],marketbasketanalysis[2]andicebergquery[17],[1]. 3 Eventhoughtherearemanypossibleapplications, werecallheresomeofthemostimportantcontextstowhich 0 theproblemhasbeensuccessfullyapplied: networktrafficanalysis[14],[16],[27],analysisofweblogs[7],Compu- . 1 tationalandtheoreticalLinguistics[18]. 0 All of the algorithms for detecting heavy hitters can be classified as being either counter or sketch based, the 7 differencebeingthatcounter–basedalgorithmsrelyonasetofcounterswhichareusedtokeeptrackofstreamitems, 1 whilstsketch–basedalgorithmsmonitorthedatastreambyusingasketchdatastructure,oftenabi-dimensionalarray : v datastructurecontainingacounterineachcell. Streamitemsaremappedbyhashfunctionstocorrespondingcellsin i thesketch.Theformeralgorithms(counter–based)aredeterministic,whilstthelatter(sketch–based)areprobabilistic. X Regarding counter–based algorithms, the first sequential algorithm has been designed by Misra and Gries [26]. r a Their algorithm was rediscovered, independently, about twenty years later by Demaine et al. [14] (this algorithm is known in the literature as the Frequent algorithm) and Karp et al. [22]. Among the developed counters–based algorithmswerecallhereStickySamplingandLossyCounting[24],andSpaceSaving[25]. Sketch–basedsolutions includeCountSketch[7],GroupTest[12],Count-Min[11]andhCount[21]. Relevant parallel algorithms include [6], [3] and [5] which are message-passing based parallel versions of the FrequentandSpaceSavingalgorithms. Shared-memoryalgorithmshavebeendesignedaswell,includingaparallel versionofFrequent[31],aparallelversionofLossyCounting[30],andparallelversionsofSpaceSaving[28][13]. ∗Correspondingauthor Emailaddresses:[email protected](MassimoCafaro),[email protected](MarcoPulimeno), [email protected](ItaloEpicoco) PreprintsubmittedtoElsevier January12,2017 Recentshared-memoryparallelalgorithmsforheavyhitterswererecentlyproposedin[29].Finally,acceleratorbased algorithmsexploitingaGPU(GraphicsProcessingUnit)include[20]and[15]. Regardingrelatedwork,i.e.,parallel algorithms specifically designed to solve the problem of mining time–faded heavy hitters, we are not aware of any other algorithm: to the best of our knowledge, ours is the first parallel algorithm solving the problem on message– passingparallelarchitectures. Inthispaper,weareconcernedwiththeproblemofdetectinginparallelheavyhittersinastreamwiththeaddi- tional constraint that recent items must be weighted more than former items. The underlying assumption is that, in someapplications,recentdataiscertainlymoreusefulandvaluablethanolder,staledata. Therefore,eachiteminthe streamhasanassociatedtimestampthatwillbeusedtodetermineitsweight. Inpractice,insteadofestimatingitems’ frequencies,wearerequiredtoestimateitems’decayedfrequencies. Thispaperisorganizedasfollows. WerecallinSection2preliminarydefinitionsandconceptsthatwillbeused in the rest of the manuscript. We present in Section 3 our PFDCMSS algorithm and formally prove in Section 4 its correctness. Next, we provide extensive experimental results in Section 5, showing that PFDCMSS retains the extremeaccuracyanderrorboundprovidedbythesequentialFDCMSSwhilstprovidingexcellentparallelscalability. Finally,wedrawourconclusionsinSection6. 2. Preliminarydefinitions In this Section we introduce preliminary definitions and the notation used throughout the paper. We deal with aninputdatastreamσconsistingofasequenceofnitemsdrawnfromauniverseU; withoutlossofgenerality, let m be the number of distinct items in σ i.e., let U = {1,2,...,m}, which we will also denote as [m]. Let f be the i frequencyoftheitemi∈U(i.e.,itsnumberofoccurrencesinσ),anddenotethefrequencyvectorbyf=(f ,..., f ). 1 m Moreover, let 0 < φ < 1 be a support threshold, 0 < (cid:15) < 1 a tolerance such that (cid:15) < φ and denote the 1-norm of f (whichrepresentsthetotalnumberofoccurrencesofallofthestreamitems)by||f|| . 1 In this paper, we are concerned with the problem of detecting in parallel frequent items in a stream with the additional constraint that recent items must be weighted more than former items. The underlying assumption is that, in some applications, recent data is certainly more useful and valuable than older, stale data. Therefore, each item in the stream has an associated timestamp that will be used to determine its weight. In practice, instead of estimating frequencies, we are required to estimate decayed frequencies. Two different models have been proposed intheliterature: theslidingwindowandthetimefadingmodel. PFDCMSSworksinthelattermodel. Furthermore, eventhoughthebasicideasunderliningthealgorithmarealsoappropriateforanonlinedistributedsetting, herewe areassumingthattheentiredatasetisavailableforofflineprocessing. The time fading model [23] [9] [8] does not use a window sliding over time; freshness of more recent items is insteademphasizedbyfadingthefrequencycountofolderitems. Thisisachievedbycomputingtheitem’sdecayed frequencythroughtheuseofadecayfunctionthatassigngreaterweighttomorerecentoccurrencesofanitemthan toolderones: theolderanoccurrencesis,theloweritsdecayedweight. Definition1. Let w(t,t) be a decayed function which computes the decayed weight at time t for the occurrence of i itemiarrivedattimet. Adecayedfunctionmustsatisfythefollowingproperties: i 1. w(t,t)=1whent =tand0≤w(t,t)≤1forallt>t; i i i i 2. wisamonotonenon-increasingfunctionastimetincreases,i.e.,t(cid:48) ≥t =⇒ w(t,t(cid:48))≤w(t,t). i i Relatedworkhasmostlyexploitedbackwarddecayfunctions,inwhichtheweightofanitemisafunctionofits age,a,wheretheageattimet >t issimplya=t−t. Inthiscase,w(t,t)isgivenbyw(t,t)= h(t−ti) = h(t−ti),where i i i i h(t−t) h(0) hisapositivemonotonenon-increasingfunction. Thetermbackwarddecaystemsfromtheaimofmeasuringfromthecurrenttimebacktotheitem’stimestamp. Prior algorithmsandapplicationshavebeenusingbackwardexponentialdecayfunctionssuchash(a) = e−λa,withλ > 0 asdecayingfactor. Inouralgorithm,weuseinsteadaforwarddecayfunction,definedasfollows(see[10]foradetaileddescriptionof theforwarddecayapproach). Underforwarddecay,theweightofanitemiscomputedontheamountoftimebetween thearrivalofanitemandafixedpoint L,calledthelandmark time,which,byconvention,issometimeearlierthan 2 thetimestampsofalloftheitems. Theideaistolookforwardintimefromthelandmarktoseeanitem, insteadof lookingbackwardfromthecurrenttime. Definition2. Given a positive monotone non-decreasing function g, and a landmark time L, the forward decayed weightofanitemiwitharrivaltimet > Lmeasuredattimet≥t isgivenbyw(t,t)= g(ti−L). i i i g(t−L) The denominator is used to normalize the decayed weight so that w(t,t) is always less than or equal to 1 as i requestedbyDefinition1. Definition3. Thedecayedfrequencyofanitemvintheinputstreamσ,computedattimet,isgivenbythesumof thedecayedweightsofalltheoccurrencesofvinσ: fv(t)=(cid:80)vi=vw(ti,t). Definition4. Thedecayedcountattimet,C(t),ofastreamσofnitemsisthesumofthedecayedweightsofallthe itemsoccurringinthestream:C(t)=(cid:80)n w(t,t). i=1 i TheApproximateTime–FadedHeavyHitters(ATFHH)problemisformallystatedasfollows. Problem 1. Approximate Time–Faded Heavy Hitters. Given a stream σ of items with an associated timestamp, a threshold0<φ<1andatolerance0<(cid:15) <1suchthat(cid:15) <φ,andlettinggbeadecayingfunctionusedtodetermine thedecayedfrequenciesandtbethequerytime,returnthesetofitemsF,sothat: • F containsalloftheitemsvwithdecayedfrequencyattimet f (t)>φC(t)(decayedfrequentitems); v • F doesnotcontainanyitemvsuchthat f (t)≤(φ−(cid:15))C(t). v In the following, when clear from the context, the query time shall be considered an implicit parameter, so we write f andC instead of f (t) andC(t). The algorithm presented makes use of a Count–Min sketch data structure v v augmentedbyaSpaceSavingsummaryassociatedtoeachsketchcell. Inthefollowing,werecallthemainproperties oftheCount–MinandtheSpaceSavingalgorithmsinthecaseofnondecayingfrequencies,butthesameproperties alsoholdinatime-fadingcontext. Count–Min is based on a sketch whose dimensions are derived by the input parameters (cid:15), the error, and δ, the probabilityoffailure. Inparticular, forCount–Mind = (cid:100)ln1/δ(cid:101)isthenumberofrowsinthesketchandw = (cid:100)e/(cid:15)(cid:101) is the number of columns. Every cell in the sketch is a counter, which is updated by hash functions. By using this datastructure, thealgorithmsolveswithprobabilitygreaterthanorequalto1-δthefrequencyestimationproblem for arbitrary items. The algorithm may also be extended to solve the approximate frequent items problem as well, by using an additional heap data structure which is updated each time a cell is updated. Since in Count-Min the frequenciesstoredinthecellsoverestimatethetruefrequencies,apointqueryforanarbitraryitemsimplyinspectsall ofthed cellsinwhichtheitemismappedtobythecorrespondinghashfunctionsandreturnstheminimumofthose dcounters. SpaceSavingisacounter-basedalgorithmssolvingtheheavyhittersproblem. Itmakesuseofastreamsummary datastructurecomposedbyagivennumberofcountersk(cid:28)n,nbeingthelengthofthestream.Eachcountermonitors aniteminthestreamandtracksitsfrequency. Asubstitutionstrategyisusedwhenthealgorithmprocessesanitem notalreadymonitoredandallofthecountersareoccupied. Let σ be the input stream and denote by S the summary data structure of k counters used by the Space Saving algorithm. Moreover,denoteby|S|thesumofthecountersinS,by f theexactfrequencyofanitemvandby fˆ its v v estimatedfrequency, let fˆmin betheminimumfrequencyinS. Ifthereexistatleastonecounternotmonitoringany item, fˆminiszero. Finally,denotebyf=(f ,..., f )thefrequencyvector. Thefollowingrelationshold(asshownin[25]): 1 m |S|=||f|| , (1) 1 fˆ − fˆmin ≤ f ≤ fˆ, v∈S, (2) v v v f ≤ fˆmin, v(cid:60)S, (3) v 3 (cid:36) (cid:37) ||f|| fˆmin ≤ 1 . (4) k Therefore,itholdsthat (cid:36) (cid:37) ||f|| fˆ − f ≤ fˆmin ≤ 1 , v∈U. (5) v v k 3. Thealgorithm Inthissection,westartbyrecallingoursequentialalgorithmFDCMSS[4].Thekeydatastructureisanaugmented Count–Minsketch D, whosedimensionsd (rows)andw(columns)arederivedbytheinputparameters(cid:15), theerror, and δ, the probability of failure. Whilst every cell in an ordinary CM sketch contains a counter used for frequency estimation,inourcaseacellholdsaSpaceSavingstreamsummarywithexactlytwocounters. Theideabehindthe augmented sketch is to monitor the time–faded items that the sketch hash functions map to the corresponding cells by an instance of Space Saving with two counters, so that for a given cell we are able to determine a majority item candidatewithregardtothesub-streamofitemsfallinginthatcell. Indeed,byusingadatastructureSwithtwocountersineachcell,andlettingC denotethetotaldecayedcount i,j oftheitemsfallinginthecell D[i][j], themajorityitemis, ifitexists, theitemwhosedecayedfrequencyisgreater than Ci,j. The corresponding majority item candidate in the cell is the item monitored by the Space Saving counter 2 whoseestimateddecayedfrequencyismaximum. Wehaveprovedthat,withhighprobability,ifatime-fadeditemis frequent,then,inatleastoneofthesketchcellswhereitismapped,itisamajorityitemwithregardtothesub-stream ofitemsfallinginthesamecell. Therefore,ouralgorithmwilldetectit. Theorem1. Ifanitemiisfrequent,thenitappearsasamajorityitemcandidateinatleastoneofthedcellsinwhich itfalls,withprobabilitygreaterthanorequalto1−( 1 )d. 2φw Regardingtheerrorboundofouralgorithm, let f betheexactdecayedfrequencyofitemiinthestreamσand i fˆ betheestimateddecayedfrequencyofitemireturnedbyFDCMSS.LetC bethetotaldecayedcountofallofthe i itemsinthestream. Wehaveprovedthefollowingerrorbound. Theorem 2. ∀u ∈ [m], fˆ estimates the exact decayed count f of u at query time with error less than (cid:15)C and u u probabilitygreaterthan1−δ. Theproofsofaforementionedtheoremscanbefoundin[4]. The algorithm’s initialization requires as input parameters (cid:15), the error; δ, the probability of failure; and φ, the support threshold. The initialization returns a sketch D. The procedure starts deriving d = (cid:100)ln1/δ(cid:101), the number of rowsinthesketchandw = (cid:100)e(cid:101),thenumberofcolumnsinthesketch. Then,foreachofthed∗wcellsavailablein 2(cid:15) thesketchDweallocateadatastructureSwithtwoSpaceSavingcountersc andc . Givenacounterc , j=1,2,we 1 2 j denotebyc .iandc .f respectivelythecounter’sitemanditsestimateddecayedfrequency. Finally,wesetthesupport j j thresholdtoφ,selectdpairwiseindependenthashfunctionsh ,...,h : [m] → [w],mappingmdistinctitemsintow 1 d cells,andinitializethecountvariable,representingthetotaldecayedcountofalloftheitemsinthestream,tozero. Updating the sketch upon arrival of a stream item i with timestamp t, shown in pseudo-code as Algorithm 1, i requirescomputingx,whichisthenonnormalizedforwarddecayedweightoftheitem,andincrementingcountbyx. Then,weupdatethedcellsinwhichtheitemismappedtobythecorrespondinghashfunctionsh (x), j=1,...,dby j usingtheSpaceSavingitemupdateprocedure. LetSdenotetheSpaceSavingstreamsummarydatastructurewithtwocounterscorrespondingtothecelltobe updated. UpdatingSuponarrivalofanitemworksasfollows. Whenprocessinganitemwhichisalreadymonitored byacounter,itsestimatedfrequencyisincrementedbythenonnormalizedweightx. Whenprocessinganitemwhich isnotalreadymonitoredbyoneoftheavailablecounters,therearetwopossibilities. Ifacounterisavailable,itwill beinchargeofmonitoringtheitem,anditsestimatedfrequencyissettothenonnormalizedweight x. Otherwise,if allofthecountersarealreadyoccupied(theirfrequenciesaredifferentfromzero),thecounterstoringtheitemwith minimum frequency is incremented by the non normalized weight x. Then, the monitored item is evicted from the 4 Algorithm1Process Require: i,anitem;t,timestampofitemi;D,sketchdatastructure i Ensure: updateofsketchrelatedtoitemi;updatethelocaltotaldecayedcount. 1: procedureprocess(i,ti,D) 2: x←g(t−ti) (cid:46)computethenonnormalizeddecayedweightofitemi 3: lCount←lCount+x (cid:46)updatelocaltotaldecayedcount 4: for j=1toddo 5: S← D[j][hj(i)] 6: SpaceSavingUpdate(S,i,x) (cid:46)updatethesketch 7: endfor 8: endprocedure counterandreplacedbythenewitem. Thishappenssinceanitemwhichisnotmonitoredcannothaveafrequency greaterthantheminimalfrequency. PFDCMSS, the parallel version of our sequential algorithm, works as follows. We assume the offline setting in whichthestreamitemshavebeenstoredasastaticdatasetalongwiththecorrespondingtimestamps.Itisworthnoting hereimmediatelythatouralgorithmworksinthestreaming(online)settingaswell.Indeed,intheformercase(offline setting)wepartitiontheinputdatasetandtimestampsusingasimple1Dblock-baseddomaindecompositionamong theavailable pprocessesandthenprocessinparallelthesub-streamsassignedtotheprocessesusingAlgorithm1. In the latter case (online setting), we have instead p distributed sites, each handling a different stream σ,i = 1,...,p i processedagainusingAlgorithm1. Intheparallelversion,oncethesub-streamshavebeenprocessed,oneoftheprocessesisinchargeofdetermining the time–faded heavy hitters. In order to do so, all of the processes engage in a parallel reduction in which their sketches are merged into a global sketch which preserves all of the information stored in the local sketches. This sketchisthenqueriedandthetime–fadedheavyhittersarereturned. In the distributed setting, one of the sites may act as a centralized coordinator or there can be another different site taking this responsibility. The coordinator broadcasts, when required, a ”query” message to the p sites, which thentemporarilystopprocessingtheirsub-streams, andengageinthesketchmergeprocedure. Wecanimaginethe distributedsitesasbeingmulti-threadedprocesses,inwhichonethreadexecutesAlgorithm1,temporarilystopswhen aquerymessageisreceivedfromthecoordinator,createsacopyofitslocalsketchandthenresumestreamprocessing whilstanotherthreadengagesinthedistributedsketchmergingprocedureusingthesketchcopy. Inordertoretrievethetime–fadedheavyhitters,aquerycanbeposedwhenneeded. Thequery,showninpseudo- codeasAlgorithm2,startsbydeterminingtheglobaldecayedcountforthewholestreamσ. Thisrequiresaparallel reductioninwhichthelocaldecayedcountsaresummed. Itisworthnotingherethattheglobaldecayedcountisstill nonnormalized;thenormalizationoccursdividingbyg(t−L),wheretisthequerytimeandLdenotesthelandmark time. Then, we build, through a user’s defined parallel reduction, a global sketchG which is obtained by merging thelocalsketches. Todoso,eachprocessinvokesaparallelreductionbyusingtheMergeSketchoperatorshownin pseudo-codeasAlgorithm3. Thesketchesarereducedasfollows: foreverycorrespondingcellintwosketchestobemerged,thehostedSpace Savingsummariesaremergedfollowingthestepsdescribedin[5],i.e.,buildingatemporarysummaryS consisting C ofalloftheitemsmonitoredbybothS andS . ToeachiteminS isassignedadecayedfrequencycomputedas 1 2 C follows: ifanitemispresentinbothS andS ,itsfrequencyisthesumoftheitscorrespondingfrequenciesineach 1 2 summary;iftheitemispresentonlyinoneofeitherS orS ,itsfrequencyisincrementedbytheminimumfrequency 1 2 oftheothersummary. Atlast,inordertoderivethemergedsummary,wetakeonlythe2itemsinS withthegreatest C frequenciesanddiscardtheothers. It is worth noting here that the sum of the counters in the stream summary data structure S related to a given cell D[i][j] is equal to the value that the Count–Min sketch–based algorithm would store in the counter variable correspondingtothatcell,i.e.,the1-normofthefrequencyvectorcorrespondingtothesub–streamfallinginthecell throughthepairwiseindependenthashfunctions. Thus,anaugmentedsketchisequivalent,fromthisperspective,to aCount–Minsketchandthispropertyispreservedbythemergeprocedure. Fromnowonwewillcallthisproperty 5 Algorithm2Query Require: t,querytime;D,process’sketch Ensure: setoffrequentitems 1: procedurequery(t) 2: gCount←ParallelReduction(lCount,Sum) 3: gCount← gCount g(t−L) 4: G ←ParallelReduction(D,MergeSketch) 5: F =∅ 6: foreachSij ∈Gdo 7: letc1andc2bethecountersinSij 8: cm ←argmax(c1,c2) (cid:46)cmthecounterwithmaximumdecayedcount 9: if cm.f >φ∗gCountthen g(t−L) 10: p←PointEstimate(cm.i,t) 11: if p>φ∗gCountthen 12: F ← F∪{(cm.i,p)} 13: endif 14: endif 15: endfor 16: returnF 17: endprocedure 1-normequivalence. However, merging Count–Min sketches simply requires adding the corresponding cells’ counters. Indeed, via linearity, thesum of sketchesis equal tothe sketch ofthe sums. Instead, inour case, weneed an adhoc procedure inordertocorrectlymergethetwoSpaceSavingstreamsummarieshostedbythecorrespondingcellssothat1-norm equivalencepropertyispreserved.Nonetheless,theaugmentedsketchwhichresultsfromourparallelmergereduction is 1-norm equivalent to the Count–Min sketch obtained by summing the Count–Min sketches corresponding to our augmentedsketcheswhicharetheinputoftheparallelmergereduction. Once the global sketch is obtained, the query procedure initializes F, an empty set, and then it inspects each of thed∗wcellsinthesketchD. Foragivencell,wedeterminec ,thecounterinthedatastructureSwithmaximum m decayedcount. Wenormalizethedecayedcountstoredinc dividingbyg(t−L),andthencomparethisquantitywith m thethresholdgivenbyφ∗gCount (gCount beingthenormalizedglobaldecayedcount). Ifthenormalizeddecayed frequencyisgreater,weposeapointqueryfortheitemc .i,showninpseudo-codeasAlgorithm4. If p,thereturned m value,isgreaterthanthethresholdφ∗gCount,thenweinsertinF thepair(c .i,p). m The point query for an item j returns its estimated decayed frequency. After initializing the answer variable to infinity,weinspecteachofthedcellsinwhichtheitemismappedtobythecorrespondinghashfunctions,todetermine theminimumdecayedfrequencyoftheitem. Ineachcell,iftheitemisstoredbyoneoftheSpaceSavingcounters, wesetanswertotheminimumbetweenanswerandthecorrespondingcounter’sdecayedfrequency. Otherwise(none ofthetwocountersmonitorstheitem j),wesetanswertotheminimumbetweenanswerandtheminimumdecayed frequencystoredinthecounters. Sincethefrequenciesstoredinallofthecountersofthesketcharenotnormalized, wereturnthenormalizedfrequencyanswerdividingbyg(t−L). AttheendofthequeryprocedurethesetF isreturned. 4. Correctness Here,weprovethatouralgorithmcorrectlymergestwoFDCMSSsketches. Themergeprocedurepreservesallof thepropertiesofthesketch,includingthefactthat,consideringthesumoftheSpaceSavingcountersineachsketch cell,anFDCMSSsketchis1-normequivalenttotheclassicalCount–Minsketch. It is worth noting here that we would obtain a correct result by using the merge procedure presented in [5] to combine the Space Saving summaries stored in the corresponding sketch cells, but we also want to impose 1-norm 6 Algorithm3MergeSketch Require: D ,D : sketchestobemerged. 1 2 Ensure: G,themergedsketch 1: procedureMergeSketch(D1,D2) 2: foreachS1ij ∈ D1,S2ij ∈ D2do 3: m1 ←min(c11.f,c12.f) (cid:46)m1,theminimumofcounters’frequencyinS1ij 4: m2 ←min(c21.f,c22.f) (cid:46)m2,theminimumofcounters’frequencyinS2ij 5: foreachcs1 ∈S1do 6: cs2 ←Find(S2,cs1.i) 7: if cs2 then 8: csc.f ←cs1.f +cs2.f 9: Delete(S2,cs2) 10: else 11: csc.f ←cs1.f +m2 12: endif 13: csc.i←cs1.i 14: Insert(SC,csc) 15: endfor 16: foreachcs2 ∈S2do 17: csc.i←cs2.i 18: csc.f ←cs2.f +m1 19: Insert(SC,csc) 20: endfor 21: Purge(SC) (cid:46)SC nowcontains2counterswiththegreatestfrequencies 22: G[i][j]←SC 23: endfor 24: returnG 25: endprocedure Algorithm4PointEstimate Require: j,anitem;t,querytime Ensure: estimationofitem jdecayedcount; 1: procedurepointestimate(j,t) 2: answer←∞ 3: fori=1toddo 4: S←G[i][hi(j)] (cid:46)letc1andc2bethecountersinS 5: if j==c1.ithen 6: answer←min(answer,c1.f) 7: elseif j==c2.ithen 8: answer←min(answer,c2.f) 9: else 10: m←min(c1.f,c2.f) 11: answer←min(answer,m) 12: endif 13: endfor 14: return answer g(t−L) 15: endprocedure 7 equivalence,i.e.,theadditionalconditionthatthesumofcounters’valuesineachmergedcellalwaysreflectsthetotal decayedcountoftheitemswhichfellinthecorrespondingcells. Indeed, in [5] we showed how to merge Space Saving stream summaries in parallel. However, we have proved thatourmergeproceduresatisfiestheSpaceSavingpropertiesdescribedbyeq. 2-5,andthefollowingrelaxedversion ofeq. 1: |S|≤||f|| , (6) 1 As shown in Theorem 3, which is the main result of this section, it turns out that k = 2 counters (i.e., majority itemmining)isaspecialcase: whentheSpaceSavingsummariestobemergedholdtwocounters,thantheproperty ineq. 1holdsforthemergedsummaryinitsoriginalform,thatis|S|=||f|| ,withoutmodifyingthemergeprocedure 1 designedin[5]. Theorem 3. The parallel merge algorithm provides an augmented sketch that preserves all of the properties of a FDCMSSsketch. Proof. The correctness of the parallel FDCMSS sketch merge algorithm derives from the correctness of the Space Savingmergeprocedure,alreadyshownin[5]. Itremainstoshowthat,whenlookingtothesumoftheSpaceSaving countersassociatedtoeachcell,themergedaugmentedsketchisstill1-normequivalenttoaCount–Minsketch,that is,thesumofthecountersvaluesisequaltothedecayedcountofalltheitemsfalleninthatcell. LetusrecallthemergealgorithmforSpaceSavingsummariesintroducedin[5].Wewillusethemultisetnotation, thus let us rewrite the properties of a Space Saving summary stated in equations 1-4, this time with reference to multisets. Indeed, we model the input stream as a multiset (also called a bag), which essentially is a set where the duplicationofelementsisallowed.Weshalluseacalligraphiccapitallettertodenoteamultiset,andthecorresponding capitalGreeklettertodenoteitsunderlyingset. Inparticular,weextendthetraditionalnotionofmultisetasfollows. Insteadofconsideringanindicatorfunctionwhichreturnsthemultiplicityofanitem,weuseafunctionprovidingthe decayedfrequencyofthatitem. Therefore,summingoveralloftheitemsweobtainthetotaldecayedcountinplace ofthecardinalityofthemultiset. Definition5. AdecayedmultisetN = (N, f )isapairwhere N issomeset, calledtheunderlyingsetofelements, N and f : N →Risafunctionwhichprovidesthedecayedfrequencyforeachx∈ N accordingtoDefinition3. N ThedecayedcountofN isexpressedby (cid:88) |N|:= f (x), (7) N x∈N whilstthecardinalityoftheunderlyingsetN is (cid:88) |N|:=Card(N)= 1. (8) x∈N From now on, when referring to either the exact or estimated frequency of an item, we shall mean the item’s exact or estimated decayed frequency. Recall that our Space Saving stream summary data structure uses exactly k = 2 counters, and let N = (N, f ) be the input decayed multiset, S = (Σ, fˆ ) the decayed multiset of all of the N S monitored items and their respective counters at the end of the sequential Space Saving algorithm’s execution, i.e., thealgorithm’ssummarydatastructure. Let|S|bethesumofthefrequenciesstoredinthecounters, f (e)theexact N frequencyofaniteme, fˆ (e)itsestimatedfrequencyand fˆ min theminimumfrequencyinS,where fˆ min = 0when S S S |Σ|<2. Indeed,eventhoughasummarydatastructurehasexactly2counters,itmaymonitorlessthan2items,since anitemisactuallymonitoredifandonlyifitscounter’sfrequencyisdifferentfromzero.Thefollowingrelationshold, foreachiteme∈ N: |S|=|N|, (9) fˆ (e)− fˆ min ≤ f (e)≤ fˆ (e), e∈Σ, (10) S S N S 8 f (e)≤ fˆ min, e(cid:60)Σ, (11) N S (cid:36) (cid:37) |N| fˆ min ≤ . (12) S 2 Now, let S = (Σ , fˆ ) and S = (Σ , fˆ ) be two summaries related respectively to the input sub-arrays N = 1 1 S1 2 2 S2 1 (N , f )andN =(N , f ),withN =N (cid:93)N =(N, f ). LetS =(Σ , fˆ )bethefinalmergedsummary. 1 N1 2 2 N2 1 2 N M M SM Theorem3in[5]statesthatifeqs. (10)-(12)holdforS andS and,ifitisverifiedarelaxedversionofeq. (9), 1 2 i.e.,itholdsthat |S|≤|N|, i=1,2 (13) i i thenthesepropertiescontinuetobetruealsoforS (itisworthnotingherethateq. (13)alsoholdsforsummaries M producedbythesequentialSpaceSavingalgorithm).Theauthorsshowthatthisisenoughtoguaranteethecorrectness ofthemergeoperation,but,ingeneral|S |≤|N|. M InordertoobtainS ,westartcombiningS andS toobtainS ,andthen,if|Σ |>2,wetakethetwocounters M 1 2 C C withthegreatestfrequencyvaluesinS inordertobuildS ,otherwisewereturnS =S . C M M C Wecanexpressthecombineoperationasshownbythefollowingequation: fˆSC(e)= fˆS1ffˆˆ(Se1)((ee+))++fˆS2ffˆˆ(Smme2ii)nn,,, eee∈∈∈ΣΣΣ11∩\\ΣΣΣ22,,, (14) S2 S1 2 1 Inthespecialcaseofstreamsummariesholdingexactlyk=2counters,itholdsthatfori=1,2,|Σ|≤2,and|Σ |≤4. i C Now,supposethat|S|=|N|,(thisistruewhenS isproducedbythesequentialSpaceSaving)andletδ= fˆmin+fˆmin i i i S S andx=|Σ |−2. Furthermore,supposethattheentriesinS aresortedinascendingorderwithregardtothec1ounte2rs C C frequencies. Asprovedin[5],itholdsthat: |S |=|S |+|S |+xδ=|N |+|N |+xδ (15) C 1 2 1 2 (cid:88)x |S |=|N |+|N |+xδ− fˆ (e), (16) M 1 2 SC i i=1 wherethesumisextendedoverthefirstxentries. Wehavetoshowthatthedifferencexδ−(cid:80)x fˆ (e)isalwaysequaltozerowhenk=2,sothat|S |=|N|. i=1 SC i M Whenx≤0,xδ=0. Inthatcase,S =S and|S |=|N|. M C C Whenx>0,thefirstxcountersofS havevaluesequaltoδ. Toseethis,considerthetwocasesx=1andx=2. C When x = 2,thatis,thetwosummariestobemergedcontaindifferentitemsand|Σ | = 4,thisiseasilyseenby C simplecomputations: infact,δistheminimumvalueacounterinS canassume,andthereareatleasttwocounters C with this value in S , obtained combining the two counters with minimum value in S and S . As a consequence C 1 2 thesecountersarethefirsttwo,andxδ−(cid:80)x fˆ (e)=0. i=1 SC i Whenx=1,oneofthefollowingcasesarises: 1. one of the summaries (without loss of generality, let us suppose it is S ) contains two counters, the other 1 summary(S )containsonlyonecounterandnoitemisincommonbetweenthesummaries. Inthiscase,δis 2 equaltotheminimumcounterinS since fˆmin = 0, butitisalsotheminimumcounterinS , henceitholds 1 S C 2 thatδ− fˆ (e )=0 SC 1 9 2. bothsummariescontaintwocountersandtheyhaveexactlyanitemincommon. Inthiscasewefurtherhaveto distinguishthreecases: (a) theitemincommonhastheminimumfrequencyinboththesummaries. Thecombinedfrequencyofthis item will be equal to δ which is the sum of the minimum frequencies of two summaries. Its combined frequencyisalsotheminimuminS ,henceitholdsthatδ− fˆ (e )=0; C SC 1 (b) theitemincommonhasthemaximumfrequencyinboththesummaries. Itscombinedfrequencyisalso themaximumvalueinS ,andS containstwodistinctitemswithcombinedfrequencyequaltoδwhich C C isalsotheminimuminS ,henceδ− fˆ (e )=0; C SC 1 (c) theitemincommonappearswithminimumfrequencyinonesummary(withoutlossofgenerality,letus supposeinS )andwithmaximumfrequencyintheothersummary(S ). Thecombinedfrequencyofthe 1 2 itemwhichappearswithminimumfrequencyinS isequaltoδwhichagainistheminimumfrequency 2 ofthecountersinS ,henceδ− fˆ (e )=0. C SC 1 Takingintoaccountthatinallofthecaseswhenx=1theS containsatleastoneitemwhosecombinedfrequency C isequaltoδ,itholdsthatxδ−(cid:80)x fˆ (e)=0. i=1 SC i WehaveshownthatallofthepropertiesofaSpaceSavingsummaryoftwocountersarepreservedbythemerge procedureintroducedin[5]. ItsufficestoguaranteethatallofthepropertiesstatedforanFDCMSSsketchcontinue toholdaftertheparallelmergeproceduredepictedinthealgorithmpresented. Inparticular,itholdstheproperty1, whichguaranteesthatamergedFDCMSSsketchcontinuestobe1-normequivalenttoaCount–Minsketch. 5. Experimentalresults Inthissection,wereportexperimentalresultsonsyntheticdatasets. Here,wethoroughlytestouralgorithmusing an exponential decay function. All of the experiments have been carried out on the Galileo cluster machine kindly providedbyCINECAinItaly. ThismachineisalinuxCentOS7.0NeXtScaleclusterwith516computenodes;each node is equipped with 2 2.40 GHz octa-core Intel Xeon CPUs E5-2630 v3, 128 GB RAM and 2 16 GB Intel Xeon Phi7120Paccelerators(availableon384nodesonly). High-Performancenetworkingamongthenodesisprovidedby IntelQDR(40Gb/s)Infiniband. AllofthecodeswerecompiledusingtheIntelC++compilerv17.0.0. Let f be the true frequency of an item and fˆthe corresponding frequency reported by an algorithm, then the Relative Error is defined as ∆f = |f−fˆ|, and the Average Relative Error is derived by averaging the Relative Errors f overallofthemeasuredfrequencies. Precision, ametric definedas the totalnumber oftrue heavy hittersreported over thetotal numberof candidate items, quantifies the number of false positives reported by an algorithm in the output stream summary. Recall is the total number of true heavy hitters reported over the number of true heavy hitters given by an exact algorithm. In all of the results we obtained 100% recall, even on a tiny sketch of size 4 x 800 (recall may be less than 100%, but this happens only when the sketch size is really minimal). For this reason, to avoid wasting space, we do not showhererecallplots. Rather,wepresentPrecision,AbsoluteError,AverageRelativeError(ARE),Updates/msand runtime/performanceplotssinceweareinterestedinunderstandingtheerrorbehaviorandthealgorithm’sscalability whenweuseanincreasingnumberofcoresofexecution. Table1reportstheexperimentscarriedout. Foreachdiffer- entmetricunderexamination,wevariedn,thestreamsizeinbillionsofitems,ρ,theskewofthezipfiandistribution, φ,thethresholdandw,thenumberofsketchcolumns. Alloftheotherparametersarefixedwhenvaryingoneofthe previousones,andweshow,ontopofeachplot,thefixedparameters’values. Finally, we also present, for the metrics of interest, the results obtained by fixing the stream size and varying the number of cores utilized from 1 to 512. We conclude this section with a comparison between strong and weak scalability. Withtheexperiment1weaimatmeasuringthealgorithmaccuracy,theexperiment2aimsatmeasuringhowthe parallelization affects the algorithm’s accuracy, finally experiment 3 is meant to measure the computational perfor- manceoftheparallelalgorithmmeasuringbothstrongandweakscalability. 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.