Streaming similarity self-join Gianmarco De Francisci Morales Aristides Gionis Qatar Computing Research Institute Aalto University [email protected] [email protected] ABSTRACT 6 1 Weintroduceandstudytheproblemofcomputingthesimi- 0 larityself-joininastreamingcontext(sssj),wheretheinput time 2 isanunboundedstreamofitemsarrivingcontinuously. The goalistofindallpairsofitemsinthestreamwhosesimilarity r a is greater than a given threshold. The simplest formulation M of the problem requires unbounded memory, and thus, it is intractable. Tomaketheproblemfeasible,weintroducethe 8 notion of time-dependent similarity: the similarity of two items decreases with the difference in their arrival time. ] By leveraging the properties of this time-dependent sim- Figure 1: Timestampeddocumentsarriveasstream. The B ilarity function, we design two algorithmic frameworks to documents on top (in red) have similar content. Among all D solve the sssj problem. The first one, MiniBatch (MB), uses pairs of similar documents, we are interested in those that . existing index-based filtering techniques for the static ver- arrive close in time. In the example, out of all 4-choose-2 s sion of the problem, and combines them in a pipeline. The pairsonlytwopairsarereported(shownwithbluearrows). c [ second framework, Streaming (STR), adds time filtering to theexistingindexes,andintegratesnewtime-basedbounds dealing with arbitrary objects. In practice, it is possible to 2 deeply in the working of the algorithms. We also introduce obtain scalable algorithms by leveraging structural proper- v anewindexingtechnique(L2),whichisbasedonanexisting tiesofspecificprobleminstances. Atypicaldesideratahere 4 state-of-the-artindexingtechnique(L2AP),butisoptimized is to design output-sensitive algorithms.1 1 Inmanyreal-worldapplicationsobjectsarerepresentedas for the streaming case. 8 sparse vectors in a high-dimensional Euclidean space, and ExtensiveexperimentsshowthattheSTRalgorithm,when 4 similarityismeasuredasthedot-productofunit-normalized instantiated with the L2 index, is the most scalable option 0 vectors — or equivalently, cosine similarity. The similarity across a wide array of datasets and parameters. . self-joinproblemasksforallpairsofobjectswhosesimilarity 1 0 is above a predefined threshold θ. Efficient algorithms for 6 1. INTRODUCTION thissettingarewell-developed,andrelyonpruningbasedon 1 inverted indexes, as well as a number of different geometric Similarity self-join is the problem of finding all pairs of : bounds[3,7,10,28];detailsonthosemethodsarepresented v similar objects in a given dataset. The problem is related in the following sections. i tothesimilarity join operator[4,10],whichhasbeenstud- X Computing similarity self-join is a problem that is ger- ied extensively in the database and data-mining communi- mane not only for static data collections, but also for data r ties. The similarity self-join is an essential component in a streams. Two examples of real-world applications that mo- severalapplications,includingplagiarismdetection[11,16], tivate the problem of similarity self-join in the streaming query refinement [21], document clustering [8, 9, 15], data setting are the following. cleaning[10],communitymining[22],near-duplicaterecord Trend detection: existing algorithms for trend detection detection [28], and collaborative filtering [12]. in microblogging platforms, such as Twitter, rely on identi- The similarity self-join problem is inherently quadratic. In fact, the brute-force O(n2) algorithm that computes the fyinghashtagsorothertermswhosefrequencysuddenlyin- creases. A more focused and more granular trend-detection similarity between all pairs is the best one can hope for, in approachwouldbetoidentifyasetofposts,whosefrequency the worst case, when exact similarity is required and when increases, and which share a certain fraction of hashtags or terms. For such a trend-detection method it is essential to be able to find similar pairs of posts in a data stream. Near-duplicate item filtering: consider again a micro- blogging platform, such as Twitter. When an event occurs, usersmayreceivemultiplenear-copiesofpostsrelatedtothe event. Such posts often appear consecutively in the feed of 1https://en.wikipedia.org/wiki/Output-sensitive_ algorithm 1 users, thus cluttering their information stream and degrad- different algorithmic frameworks, MB and STR: For MB, ingtheirexperience. Groupingthesenear-copiesorfiltering any state-of-the-art indexing scheme can be used as a them out is a simple way to improve the user experience. black box. For STR, we adapt and extend the existing schemes to be well-suited for streaming data. Surprisingly, the problem of streaming similarity self-join • We perform an extensive experimental evaluation, and has not been considered in the literature so far. Possibly, test the proposed algorithms on datasets with different becausethesimilarityself-joinoperatorinprinciplerequires characteristics and for a large range of parameters. unbounded memory: one can never forget a data item, as it Finally, we note that all the code2 we developed and all may be similar to another one that comes far in the future. datasets3 used for validation are publicly available. Tothebestofourknowledge,thispaperisthefirsttoad- dress the similarity self-join problem in the streaming set- 2. RELATEDWORK ting. We overcome the inherent unbounded-memory bot- tleneck by introducing a temporal factor to the similarity Theproblemofstreamingsimilarityself-join hasnotbeen operator. Namely, we consider two data items similar, only studiedpreviously. Workrelatedtothisproblemcanbeclas- if they have arrived within a short time span. More pre- sified into two main areas: similarity self-join (also known cisely, we define the time-dependent similarity of two data as all-pairs similarity), and streaming similarity. items to be their content-based cosine similarity multiplied Similarity self-join. The similarity self-join problem has by a factor that decays exponentially with the difference in been studied extensively. It was introduced by Chaudhuri their arrival times. This time-dependent factor allows us et al. [10], and many works have followed up [4, 5, 7, 10, to drop old items, as they cannot be similar to any item 27, 28], including approaches that use parallel computa- that arrives after a certain time horizon τ. The concept is tion [1, 2, 6, 13, 24]. Discussing all these related papers is illustrated in Figure 1. outofthescopeofthiswork. Themostrelevantapproaches Thetime-dependentsimilarityoutlinedaboveisverywell- arethosebyBayardoetal.[7],whichsignificantlyimproved suited for both our motivating applications — trend detec- thescalabilityofthepreviousindexing-basedmethods,and tion and near-duplicate filtering: in both cases we are in- by Anastasiu and Karypis [3], which represents the cur- terested in identifying items that not only are content-wise rent state-of-the-art for sequential batch algorithms. Our similar, but have also arrived within a short time interval. work relies heavily on these two papers, and we summarize Akin to previous approaches for the similarity self-join their common filtering framework in Section 5. For a good problem,ourmethodreliesonindexingtechniques. Previous overview, refer to the work of Anastasiu and Karypis [3]. approachesusedifferenttypesofindexfilteringtoreducethe numberofcandidatesreturnedbytheindexforfullsimilar- Streaming similarity. The problem of computing simi- ity evaluation. Following this terminology, we use the term laritiesinastreamofvectorshasreceivedsurprisinglylittle time filtering torefertothepropertyofthetime-dependent attention so far. Most of the work on streaming similarity similarity that allows to drop old items from the index. focuses on time series [14, 17, 19]. We present two different algorithmic frameworks for the Lian and Chen [20] study the similarity join problem for streamingsimilarityself-joinproblem,bothofwhichrelyon uncertain streams of vectors in the sliding-window model. thetime-filteringproperty. Bothframeworkscanbeinstan- Theyfocusonuncertainobjectsmovinginalow-dimensional tiatedwithindexingschemesbasedonexistingonesforthe Euclidean space (d ∈ [2,5]). Similarity is measured by the batch version of the problem. The first framework, named Euclideandistance,andthereforethepruningtechniquesare MiniBatch(MB),usesexistingindexingschemesoff-the-shelf. verydifferentthantheonesemployedinourwork;e.g.,they Inparticular,itusestwoindexesinapipeline,anditdrops usespace-partitiontechniques,andusetemporalcorrelation the older one when it becomes old enough. The second of samples to improve the efficiency of their algorithm. frameworkwepropose,namedStreaming(STR),modifiesex- WangandRundensteiner[26]studythemoregeneralprob- istingindexingschemessothattimefilteringisincorporated lem of multi-way join with expensive user-defined functions internally in the index. (UDFs) in the sliding-window model. They present a time- One of our contributions is a new index, L2, which com- slicing approach, which has some resemblance to the time- binesstate-of-theartboundsforindexpruningfromL2AP[3] basedpruningpresentedinthispaper. However,theirfocus in a way that is optimized for stream data. The superior inondistributingthecomputationonaclusterviapipelined performance of L2 stems from the fact that it uses bounds parallelism, and ensuring that the multiple windows of the that (i) are effective in reducing the number of candidate multi-way join align correctly. Differently from this paper, pairs, (ii) do not require collecting statistics over the data their method treats UDFs as a black box. stream (which need to be updated as the stream evolves), ValariandPapadopoulos[23]focusongraphs,ratherthan (iii) lead to very lightweight index maintenance, while (iv) vectors, and study the case of Jaccard similarity of nodes allowing to drop data as soon as they become too old to be on a sliding window over an edge stream. They apply a similar to any item currently read. Our experimental eval- count-based pruning on edges by keeping a fixed-size win- uation demonstrates that the L2 index, incorporated in the dow,andleveraginganupperboundonthesimilaritywithin STR framework, is the method of choice for all datasets we the next window. Instead, in this work we consider time- try across a wide range of parameters. based pruning, without any assumption on the frequency In brief, our contributions can be summarized as follows. of arrival of items in the stream. Furthermore, the arrival of edges changes the similarity of the nodes, and therefore • We introduce the similarity self-join problem in data differentindexingtechniquesneedtobeapplied;e.g.,nodes streams. can never be pruned. • We propose a novel time-dependent similarity measure, whichallowstoforgetdataitemswhentheybecomeold. 2Code: http://github.com/gdfm/sssj 3Data: http://research.ics.aalto.fi/dmg/sssj-datasets/ • We show how to solve the proposed problem within two 2 3. PROBLEMSTATEMENT Parameter setting. The sssj problem, defined in Prob- We consider data items represented as vectors in a d- lem1,requirestwoparameters: asimilaritythresholdθand dimensionalEuclideanspaceRd. Inreal-worldapplications, a time-decay factor λ. The time-filtering property suggests the dimensionality d is typically high, while the data items a simple methodology to set these two parameters: are sparse vectors. Given two vectors x = (cid:104)x1...xd(cid:105) and 1. Selectθasthelowestvalueofthesimilaritybetweentwo y=(cid:104)y1...yd(cid:105), their similarity sim(x,y) is the dot-product simultaneously-arriving vectorsthataredeemedsimilar. d 2. Select τ as the smallest difference in arrival times be- (cid:88) sim(x,y)=dot(x,y)=x·y= x y . tween two identical vectors that are deemed dissimilar. j j j=1 3. Set λ=τ−1logθ−1. Weassumethatallvectorsxarenormalizedtounitlength, Settingθandτ insteps1and2dependsontheapplication. i.e., ||x|| = 1, which implies that the dot-product of two 2 vectorsisequaltotheircosinesimilarity;cos(x,y)=dot(x,y). Additionalnotation. Whenreferringtothenon-streaming Inthestandardall-pairssimilaritysearchproblem(apss), setting, we consider a dataset D = {x1,...,xn} consisting also known as similarity self-join, we are given a set of vec- of n vectors in Rd. tors and a similarity threshold θ, and the goal is to find all Following Anastasiu and Karypis, we use the notation pairs of vectors (x,y) for which sim(x,y)≥θ. x(cid:48) = x(cid:48)p = (cid:104)x1,..,xp−1,0,..,0(cid:105) and to denote the prefix of In this paper we assume that the input items arrive as a avectorx. Foravectorx, wedenotebyvmx itsmaximum (cid:80) data stream. Each vector x in the input stream is times- coordinate, by Σx = jxj the sum of its coordinates, and tamped with the time of its arrival t(x), and the stream is by |x| the number of its non-zero coordinates. denoted by S =(cid:104)...,(xi,t(xi)),(xi+1,t(xi+1),...(cid:105). Given a static dataset D we use mj to refer to the maxi- We define the similarity of two vectors x and y by con- mum value along the j-th coordinate over all vectors in D. sideringnotonlytheircoordinates(xj,yj),butalsothedif- Allvaluesofmj togethercomposethevectorm. Ourmeth- ference in their arrival times in the input stream ∆t = ods use indexing schemes which build an index incremen- xy |t(x)−t(y)|. For fixed coordinates, the larger the arrival tally, vector-by-vector. We use m(cid:98) to refer to the vector m, time difference of two vectors, the smaller their similarity. restricted to the dataset that is already indexed, and we In particular, given two timestamped vectors x and y we write m(cid:98)j for its j-th coordinate. Furthermore, we use m(cid:98)λ define their time-dependent similarity sim (x,y) as (and mλ for its j-th coordinate) to denote a time-decayed ∆t (cid:98)j variantofm,whoseprecisedefinitionisgiveninSection5.3. sim (x,y)=dot(x,y)e−λ|t(x)−t(y)|, (cid:98) ∆t where λ is a time-decay parameter. This time-dependent similarity reverts to the standard 4. OVERVIEWOFTHEAPPROACH dot-product (or cosine) similarity when ∆t =0 or λ=0, xy and it goes to zero as ∆t approaches infinity, at an ex- Inthissectionwepresentahigh-leveloverviewofourmain xy ponential rate modulated by λ. As in the standard apss algorithms for the sssj problem. problem,givenasimilaritythresholdθ,twovectorsxandy We present two different algorithmic frameworks: MB-IDX arecalledsimilar iftheirtime-dependentsimilarityisabove (MiniBatch-IDX) and STR-IDX (Streaming-IDX), where IDX the threshold, i.e., sim (x,y)≥θ. is an indexing method for the apss problem on static data. ∆t Wecannowdefinethestreamingversionoftheapssprob- To make the presentation of our algorithms more clear, we lem,calledthestreamingsimilarityself-join problem(sssj). firstgiveanoverviewoftheindexingmethodsforthestatic apss problem, as introduced in earlier papers [3, 7, 10]. Problem 1 (sssj). Givenastreamoftimestampedvec- All indexing schemes are based on building an inverted tors S, a similarity threshold θ, and a time-decay factor λ, index. The index is a collection of d posting lists, I = output all pairs of vectors (x,y) in the stream such that {I ,...,I }, where the list I contains all pairs (ι(x),x ) 1 d j j sim (x,y)≥θ. ∆t such that the j-th coordinate of vector x is non-zero, i.e., x (cid:54)=0. Here ι(x) denotes a reference to vector x. Time filtering. The main challenge of computing a self- j Somemethodsoptimizecomputationbynotindexingthe joininadatastreamisthat,inprinciple,unboundedmem- whole dataset. In these cases, the un-indexed part is still oryisrequired,asanitemcanbesimilarwithanyotheritem needed in order to compute exact similarities. Thus, we that will arrive arbitrarily far in the future. By adopting assumethataseparatepartoftheindexI containstheun- the time-dependent similarity measure sim , we introduce ∆t indexed part of the dataset, called residual direct index R. aforgettingmechanism,whichnotonlyisintuitivefroman Allschemesbuildtheindexincrementally,whilealsocom- application point-of-view, but also allows to overcome the puting similar pairs. In particular, we start with an empty unbounded-memory requirement. In particular, since for index,andweiterativelyprocessthevectorsinD. Foreach (cid:96) -normalized vectors dot(x,y)≤1, we have 2 newly-processed vector x, we compute similar pairs (x,y) sim∆t(x,y)=dot(x,y)e−λ∆txy ≤e−λ∆txy. for all vectors y that are already in the index. Thereafter, weadd(someof)thenon-zerocoordinatesofxtotheindex. Thus, ∆t > λ−1logθ−1 implies sim (x,y) < θ, and as xy ∆t Building the index and computing similar pairs can be a result a given vector cannot be similar to any vector that seen as a three-phase process: arrived more than 1 1 index construction (IC): addsnewvectorstotheindexI. τ = log λ θ candidate generation (CG): uses the index I to generate timeunitsearlier. Consequently,wecansafelyprunevectors candidatesimilarpairs. Thecandidatepairsmaycontain that are older than τ. We call τ the time horizon. false positives but no false negatives. 3 Algorithm 1: MB-IDX (MiniBatch with index IDX) similar pairs in W, and to build an index I on W. During the (k+1)-th time interval, the buffer W is reset and used input : Data stream S, threshold θ, decay λ tostorethenewvectorsfromthestream. Atthesametime, output: All pairs x,y∈S s.t. sim (x,y)≥θ ∆t MB-IDX queries I with each vector x read from the stream, 1 I←∅ to find similar pairs between x and vectors in the previous 2 t0←0;t←0 time interval. Similar pairs are computed by invoking the 3 τ←λ−1logθ−1 functions CandGen-IDX and CandVer-IDX. At the end of the 4 while true do (k+1)-th time interval, the index I is replaced with a new 5 W←∅ index on the vectors in the (k+1)-th time interval. 6 t0←t0+τ MB-IDX guarantees that all pairs of vectors whose time 7 while t≤t0+τ do difference is smaller than τ are tested for similarity. Thus, 8 x←read(S) duetothetime-filteringproperty,itreturnsthecompleteset 9 t←t(x) of similar pairs. Algorithm 1 shows pseudocode for MB-IDX. 10 C←CandGen-IDX(I,x,θ) One drawback of MB-IDX is that it reports some similar 11 P←CandVer-IDX(I,x,C,θ) pairs with a delay. In particular, all similar pairs that span 12 report ApplyDecay(P,λ) across two time intervals are reported after the end of the 13 W←W ∪{x} first interval. In applications that require to report similar 14 (I,P)←IndConstr-IDX(W,θ) pairsassoonasbothvectorsarepresent,thisbehaviorisun- 15 report ApplyDecay(P,λ) desirable. Moreover,toguaranteecorrectness,MB-IDXneeds to tests pairs of vectors whose time difference is as large as 2τ,whichcanbeprunedonlyaftertheyarereportedbythe candidate verification (CV): computestruesimilaritiesbe- indexing scheme IDX, thus wasting computational power. tween candidate pairs, and reports true similar pairs, while dismissing false positives. 5. FILTERINGFRAMEWORK More concretely, for an indexing scheme IDX we assume We now review the main indexing schemes used for the that the following three primitives are available, which cor- apss problem. For each indexing scheme IDX, we describe respond to the three phases outlined above: its three phases (IC, CG, CV), and discuss how to adapt it (I,P)←IndConstr-IDX(D,θ): Given a dataset D consist- tothestreamingsetting,givingtheSTR-IDXalgorithm. One ing of n vectors, and a similarity threshold θ, the func- of our main contributions is the adaptation of well-known tion IndConstr-IDX returns in P = {(x,y)} all similar algorithms for the batch case to the streaming setting. For pairs(x,y),withx,y∈D. Additionally,IndConstr-IDX making the paper self-contained, in each of the following buildsanindexI,whichcanbeusedtofindsimilarpairs sub-sections we first present the indexing schemes in the between the vectors in D and another query vector z. staticcase(stateoftheart)andhowtheyareusedintheMB framework,andthenwedescribetheiradaptationtotheSTR C←CandGen-IDX(I,x,θ): GivenanindexI,builtonadataset framework (contribution of this paper). Furthermore, sec- D, a vector x, and a similarity threshold θ, the function tion 5.4 describes our improved (cid:96) -based indexing scheme. CandGen-IDXreturnsasetofcandidatevectorsC ={y}, 2 For all the indexing schemes we present, except INV, we which is a superset of all vectors that are similar to x. providepseudocode. Duetolackofspace,andalsotohigh- P←CandVer-IDX(I,x,C,θ): Given an index I, built on a lightthedifferencesamongtheindexingschemes,wepresent datasetD,avectorx,asetofcandidatevectorsC,anda ourpseudocodeusingacolorconvention: −L2APindex: all similarity thresholdθ, the functionCandVer-IDXreturns lines are included (black, red, and green). the set P ={(x,y)} of true similar pairs. − AP index: red lines are included; green lines excluded. Both proposed frameworks, MB-IDX and STR-IDX, rely on − L2 index: green lines are included; red lines excluded. an indexing scheme IDX, and adapt it to the streaming set- 5.1 Invertedindex ting by augmenting it with the time-filtering property. The difference between the two frameworks is in how this adap- The first scheme is a simple inverted index (INV) with no tation is done. MB-IDX uses IDX as a “black box”: it uses index-pruning optimizations. As with all indexing schemes, timefilteringinordertobuildindependentinstancesofIDX the main observation is that for two vectors to be similar, indexes, and drops them when they become obsolete. Con- they need to have at least one common coordinate. Thus, versely, for STR-IDX we opportunely modify the indexing two similar vectors can be found together in some list Ij. method IDX by directly applying time filtering. IntheICphase,foreachnewvectorx,allitscoordinates The presentation ofthe STR-IDX framework is deferredto xj areaddedtotheindex. IntheCGphase,giventhequery the next section, after we present the details of the specific vector x, we use the index I to retrieve candidate vectors indexing methods IDX we use. similar to x. In particular, the candidates y are all vectors TheMB-IDXframeworkispresentedbelow,asweonlyneed in the posting lists where x has non-zero coordinates. to know the specifications of an index IDX, not its internal The result of the CG phase is the exact similarity score workings. In particular, we only assume the three prim- between x and each candidate vector y. Therefore, the CV itives, IndConstr-IDX, CandGen-IDX, and CandVer-IDX, are phase simply applies the similarity threshold θ and reports available. the true similar pairs. The MB-IDX framework works in time intervals of dura- MB framework (MB-INV): The functions IndConstr-INV, tion τ. During the k-th time interval it reads all vectors CandGen-INV,andCandVer-INV,neededtospecifytheMB-INV fromthestream,andstorestheminabufferW. Attheend algorithm follow directly from the discussion above. Since of the time interval, it invokes IndConstr-IDX to report all they are rather straightforward, we omit further details. 4 Algorithm 2: IndConstr-L2AP Algorithm 3: CandGen-L2AP input : Dataset D, threshold θ input : Index I, vector x, threshold θ output: Similarity index I on D, and output: Accum. score array C for candidate vectors set of pairs P ={(x,y)|sim(x,y)≥θ} (candidate set is C ={y∈D|C[ι(y)]>0}) 1 I←∅ 1 C←∅ 2 P←∅ 2 sz1←θ/vmx 3 foreach x∈D do 3 rs1←dot(x,m(cid:98)) 4 C←CandGen-L2AP(I,x,θ) 4 rs2←1; rst←1 5 P←P ∪CandVer-L2AP(I,x,C,θ) 5 foreach j =d...1 s.t. xj >0 do //reverse order 6 b1←0 6 foreach (ι(y),yj,||yj(cid:48)||)∈Ij do 7 b2←0; bt←0 7 remscore← min{rs1,rs2} 8 foreach j =1...d s.t. xj >0 do 8 if |y|vmy ≥sz1 then 9 pscore← min{b1,b2} 9 if C[ι(y)]>0 or remscore≥θ then 10 b1←b1+xj min{mj√,vmx} 10 C[ι(y)]←C[ι(y)]+xjyj 11 bt←bt+xj2; b2← bt 11 l2bound←C[ι(y)]+||x(cid:48)j||||yj(cid:48)|| 12 if min{b1,b2}≥θ then 12 if l2bound<θ then 13 if R[ι(x)]=∅ then 13 C[ι(y)]←0 14 R[ι(x)]←x(cid:48)j 14 rs1←rs1−xjm(cid:98)j √ 15 Q[ι(x)]←pscore 15 rst←rst−xj2; rs2← rst 16 Ij←Ij∪{(ι(x),xj,||x(cid:48)j||)} 16 return C; 17 return (I,P) Algorithm 4: CandVer-L2AP STR framework (STR-INV): The description of INV given input : Index I, vector x, candidate vector array C, above considers the apss problem, i.e., a static dataset D. threshold θ WenowconsideradatastreamS,whereeachvectorxinthe output: Set of pairs P ={(x,y)|sim(x,y)≥θ} stream is associated with a timestamp t(x). We apply the 1 P←∅ INVschemebyaddingthedataitemsintheindexintheor- 2 foreach y s.t. C[ι(y)]>0 do dertheyappearinthestream. ThelistsI oftheindexkeep pairs (ι(x),x ) ordered by timestamps t(jx). Maintaining a 3 ps1←C[ι(y)]+Q[ι(y)] ttihmeed-raetsapiencttihjnegsoarmdeeroirndseird,esothweelicsatns cisaneajsuys.t aWppeenpdronceesws 45 dszs21←←CC[[ιι((yy))]]++mmiinn{{|vxm|,x|yΣ(cid:48)|y}(cid:48),vvmmxyv(cid:48)mΣyx(cid:48)} vector coordinates at the end of the lists. 76 if pss1←≥Cθ[ιa(yn)d]+dsd1o≥t(xθ,yan(cid:48))d sz2 ≥θ) then Beforeaddinganewitemxtotheindex,weusetheindex 8 if (s≥θ) then toretrievealltheearliervectorsy thatare∆t-similartox. 9 P←P ∪{(x,y,s)} Due to the time-filtering property, vectors that are older 10 return P than τ cannot be similar to x. This observation has two implications: (i) we can stop retrieving candidate vectors from a list I upon encountering a vector that is older than j x is added to the index, and the prefix x(cid:48) is saved in the τ;and(ii)whenencounteringsuchavector,thepartofthe residual direct index R. list preceding it can be pruned away. In the CG phase (function CandGen-AP, shown as Algo- rithm 3, including red lines and excluding green), AP uses 5.2 All-pairsindexingscheme alowerboundsz forthesizeofanyvectorythatissimilar 1 The AP [7] scheme improves over the simple INV method to x, so that vectors that have too few non-zero entries can by reducing the size of the index. When using AP, not all be ignored (line 8). Additionally, it uses a variable rs to 1 the coordinates of a vector x need to be indexed, as long keep an upper bound on the similarity between x and any as it can be guaranteed that x and all its similar vectors y othervectory. Thisupperboundiscomputedbyusingthe share at least one common coordinate in the index I. residual direct index R and the already accumulated dot- Similarly to the INV scheme, AP incrementally builds an product, and is updated as the algorithm processes the dif- index by processing one vector at a time. In the IC phase ferentpostinglists. Whentheupperboundbecomessmaller (function IndConstr-AP, shown as Algorithm 2, including thanθ,vectorsthathavenotalreadybeenaddedtotheset red lines and excluding green), for each new vector x, the of candidates can be ignored (line 9). The array C holds algorithm scans its coordinates in a predefined order. It thecandidatevectors,togetherwiththepartialdot-product keeps a score pscore, which represents an upper bound on that is due to the indexed part of the vectors. the similarity between a prefix of x and any other vector Finally, in the CV phase (CandVer-AP function, shown as in the dataset. To compute the upper bound, AP uses the Algorithm4),wecomputethefinalsimilaritiesbyusingthe vector m, which keeps the maximum of each coordinate in residual index R, and report the true similar pairs. thedataset. Aslongaspscoreissmallerthanthethreshold The streaming versions of AP, in both MB and STR frame- θ, given that the similarity of x to any other vector cannot works,arenotefficientinpractice,andthus,weomitfurther exceedθ,thecoordinatesofxscannedsofarcanbeomitted details. Instead,weproceedpresentingthenextscheme,the from the index without the danger of missing any similar L2AP index, which is a generalization of AP. pair. As soon as pscore exceeds θ, the remaining suffix of 5 5.3 L2-basedindexingscheme Algorithm 5: STR-IDX (Streaming with index IDX) L2AP [3] is the state-of-the-art for computing similarity input : Data stream S, threshold θ, decay λ self-join. The scheme uses tighter (cid:96)2-bounds, which reduce output: All pairs x,y∈S s.t. sim∆t(x,y)≥θ thesizeoftheindex,butalsothenumberofgeneratedcan- 1 I←∅ didates and the number of fully-computed similarities. 2 P←∅ TheL2APschemeprimarilyleveragestheCauchy-Schwarz 3 while true do inequality,whichstatesthatdot(x,y)≤||x||||y||. Thesame 4 x ← read(S) boundapplieswhenconsideringaprefixofaqueryvectorx. 5 (I,P)←IndConstr-IDX-STR(I,x,θ,λ) Since we consider unit-normalized vectors (||y||=1), 6 report P dot(x(cid:48),y)≤||x(cid:48)||||y||≤||x(cid:48)||. at time t. Its j-th coordinate mλ(t) is given by (cid:98)j The previous bound produces a tighter value for pscore, (cid:110) (cid:111) whichisusedintheICphasetoboundthesimilarityofthe mλ(t)= max x e−λ|t−t(x)| . (cid:98)j j vector currently being indexed, to the rest of the dataset. x∈S t(x)≤t In particular, L2AP sets pscore=min{pscore ,||x(cid:48)||}. Additionally, the L2AP index stores the valAuPe of pscore To simplify the notation, we omit the dependency from computedwhenxisindexed. L2APkeepsthesevaluesinan time when obvious from the context, and write simply m(cid:98)λ. array Q, index by ι(x), as shown in line 15 of Algorithm 2. An upper bound on the time-dependent cosine similarity The L2AP index also stores in the index the magnitude of of any newly arrived vector x at time t can be obtained by the prefix of each vector x, up to coordinate j. That is, d the entries of the posting lists are now triples of the type remscore(t)=dot(x,mλ)=(cid:88)x mλ. (cid:98) j (cid:98)j (ι(x),x ,||x(cid:48)||). j j j=1 Both pieces of additional information, the array Q and Conversely,thevectormisusedintheICphasetodecide the values ||x || stored in the posting lists, are used during j which coordinates to add to the index. Its purpose is to the CG phase to reduce the number of candidates. guarantee that any two similar vectors in the stream share In the CG phase, for a given query vector x, we scan its atleastacoordinate. Assuch,itneedstolookatthefuture. coordinatesbackwards,i.e.,inreverseorderwithrespectto Inastaticsetting,andinMB,maximumvaluesinthewhole theoneusedduringindexing,andweaccumulatesimilarity dataset can easily be accumulated beforehand. However, scores for suffixes of x. We keep a remscore bound on the in a streaming setting these maximum values need to be remaining similarity score, which combines the rs bound 1 keptonline. Thisdifferenceimpliesthatwhenthevectorm usedinAPandanew(cid:96) -basedrs boundthatusestheprefix 2 2 magnitude values (||y(cid:48)||) stored in the posting lists. The changes,theinvariantonwhichtheAPindexisbuiltislost, j and we need to restore it. We call this process re-indexing. remscore bound is an upper bound on the similarity of the At a first glance, it might seem straightforward to use prefix of the current query vector and any other vector in sim todecide whatto index, i.e., addingthedecayfactor the index, thus as long as remscore is smaller than θ, the ∆t toline10inAlgorithm2. However,thedecayfactorcauses algorithm can prune the candidate. mtochangerapidly,whichinturnleadstoalargernumber Pseudocode for the three phases of L2AP, IC, CG, and CV, of re-indexings. Re-indexing, as explained next, is an ex- is shown in Algorithms 2, 3, and 4, respectively, including pensive operation. Therefore, we opt to avoid applying the both red and green lines. More details on the scheme can time decay when computing the b bound with m. befoundintheoriginalpaperofAnastasiuandKarypis[3]. 1 Re-indexing (IC). For each coordinate x of a newly ar- MBframework(MB-L2AP): Asbefore,theMB-L2APalgorithm j rived vector x, one of three cases may occurr. isadirectinstantiationofthegenericAlgorithm1,usingthe − x =0: no action is needed. functionsthatimplementthethreedifferentphasesofL2AP. j − 0<x ≤m : the posting list I can be pruned via time STR framework (STR-L2AP): Todescribethemodifications j j j filtering, as explained next. required for adapting the L2AP scheme in the streaming framework, we need to introduce some additional notation. − xj >mj: theupperboundvectormneedstobeupdated First, we assume that the input is a stream S of vectors. withthenewvaluejustread,i.e.,mj←xj. Inthiscase, Themaindifferenceisthatnowbothmandmarefunction parts of the pruned vectors may need to be re-indexed. (cid:98) of time, as the maximum values in the stream evolve over When m is updated (x > m ) the invariant of prefix j j time. Weadapttheuseofthetwovectorstothestreaming filteringdoesnotholdanymore. That is, theremaybevec- case differently. tors similar to x that are not matched in I. To restore the The vector m is used in the CG phase to generate candi- invariant, we re-scan R (un-indexed part of vectors within (cid:98) date vectors. Given that we are looking at vectors that are horizon τ). If the similarity of a vector y ∈ R and m has alreadyindexed(i.e.,inthepast),itispossibletoapplythe increased, we will reach the threshold θ while scanning the definition of sim between the query vector x and m. In prefix y(cid:48) (before reaching its end). Therefore, we just need ∆t (cid:98) p particular, given that the coordinates of m(cid:98) originate from to index those coordinates yp(cid:48) < yj ≤ yp, where p(cid:48) is the differentvectorsintheindex,wecanapplythedecayfactor newly computed boundary. to each coordinate to m separately. Given that the similarity can increase only for vectors (cid:98) Moreformally,letmλ(t)betheworst case indexed vector thathavenon-zerovalueinthedimensionsofmthatgotup- (cid:98) attimet,thatis,mλ(t)istherepresentationofthevectorin dated,wecankeepaninvertedindexofRtoavoidscanning (cid:98) theindexthatismost∆t-similartoanyvectorinS arriving every vector. We use the updated dimensions of the max 6 Algorithm 6: IndConstr-L2AP-STR Algorithm 7: CandGen-L2AP-STR input : Index I, vector x, threshold θ, decay λ input : Index I, vector x, threshold θ, decay λ output: Updated similarity index I including x, and output: Accum. score array C for candidate vectors set of pairs P ={(x,y)|sim (x,y)≥θ} (candidate set is C ={y∈D|C[ι(y)]>0}) ∆t 1 b1← +∞; b2← +∞ 1 C←∅ 2 b1←0 2 rs1← +∞; rs2← +∞ 3 b2←0;bt←0 3 rs1←dot(x,m(cid:98)λ) 4 C←CandGen-L2AP-STR(I,x,θ,λ) 4 rs2←1;rst←1 5 P←CandVer-L2AP-STR((I,x,C,θ,λ) 5 foreach j =d...1 s.t. xj >0 do //reverse order 6 foreach j =1...d s.t. xj >0 do 6 foreach (ι(y),yj,||yj(cid:48)||)∈Ij s.t. ∆txy ≤τ do 7 pscore← min{b1,b2} 7 remscore← min{rs1,rs2e−λ∆txy} 8 b1←b1+xj min{m√j,vmx} 8 if (C[ι(y)]>0 or remscore≥θ) then 9 bt←bt+xj2;b2← bt 9 C[ι(y)]←C[ι(y)]+xjyj 1101 if mifinR{b[ι1(,xb)2]}=≥∅θtthheenn 1101 lif2blo2ubnodun←dC<[ιθ(yth)]e+n||x(cid:48)j||||yj(cid:48)||e−λ∆txy 12 R[ι(x)]←x(cid:48)j 12 C[ι(y)]←0 1134 Ij←QI[ιj(∪x){](←ι(xp)s,cxojr,e||x(cid:48)j||)} 1143 rrsst1←←rrsst1−−xxjj2m;(cid:98)rλjs2←√rst 15 return (I,P) 15 return C; vector m to select the possible candidates for re-indexing, j Algorithm 8: CandVer-L2AP-STR and then scan only those from the residual index R. Note that re-indexing inserts older vectors in the index. input : Index I, vector x, candidate vector array C, As the lists always append items at the end, re-indexing threshold θ, decay λ introduces out-of-order items in the list. As explained in output: Set of pairs P ={(x,y)|sim∆t(x,y)≥θ} Section6,thelossofthetime-orderedpropertyhindersone 1 P←∅ of the optimizations in pruning the index based on time 2 foreach y s.t. C[ι(y)]>0 do filtering (explained next). 3 ps1←(C[ι(y)]+Q[ι(y)])e−λ∆txy Time filtering (CG). To avoid continuously scanning the 4 ds1←(C[ι(y)]+min{vmxΣy(cid:48),vmy(cid:48)Σx})e−λ∆txy index, we adopt a lazy approach. The posting lists Ij are 5 sz2←(C[ι(y)]+min{|x|,|y(cid:48)|}vmxvmy(cid:48))e−λ∆txy (partially) sorted by time, i.e., newest items are always ap- 6 if ps1 ≥θ and ds1 ≥θ and sz2 ≥θ) then pended to the tail of the lists. Thus, a simple linear scan 7 s←C[ι(y)]+dot(x,y(cid:48)) fromtheheadofthelistsisabletoprunethelistslazilywhile 8 if s≥θ then accumulating the similarity in the CG phase. More specifi- 9 P←P ∪{(x,y,s)} cally, when a new vector x arrives, we scan all the posting 10 return P listsI suchthatx (cid:54)=0,inordertogenerateitscandidates. j j While scanning the lists, we drop any item (ι(y),y ,||y(cid:48)||) j j such that |t(x)−t(y)|>τ. bounds depend only on the vector being index. This im- ThemainloopofSTR-L2APisshowninAlgorithm5. The plies that by using only the (cid:96) -based bounds, one does not 2 algorithm simply reads each item x in the stream, adds x need to maintain the worst case vector m(t), and thus, no to the index, and computes the vectors that are similar to re-indexing is required. it, by calling IndConstr-L2AP-STR (Algorithm 6, including Thus, the L2 index uses only the (cid:96) -based bounds and 2 both red and green lines). discardstheAP-basedbounds. ThestaticversionofL2(used inMB)isshownatAlgorithms2,3,and4,includingthegreen 5.4 ImprovedL2-basedindexingscheme lines and excluding the red ones. The main loop for the Thelastindexingschemewepresent,L2,isanadaptation streamingcaseisIndConstr-L2-STR,shownasAlgorithm6, of L2AP, optimized for stream data. including the green lines and excluding the red ones. The idea for the improved index is based on the obser- vation that L2AP combines a number of different bounds: the bounds inherited from the AP scheme (e.g., b for in- 6. IMPLEMENTATION 1 dexconstruction,andrs forcandidategeneration)andthe 1 Thissectionpresentsadditionaldetailsandoptimizations new (cid:96) -based bounds introduced by Anastasiu and Karypis 2 for both of our algorithmic frameworks, MB and STR. [3](e.g., b forindexconstruction, andrs andl2boundfor 2 2 candidate generation). 6.1 Minibatch Wehaveobservedthatingeneralthe(cid:96) -basedboundsare 2 moreeffectivethantheAP-basedbounds. Inalmostallcases Recall that in the MB framework, we construct the index the(cid:96) -basedboundsaretheonesthattrigger. Thisobserva- forasetofdataitemsthatarrivewithinhorizonτ. However, 2 tion is also verified by the results of Anastasiu and Karypis the L2AP scheme requires that we know not only the data [3]. Furthermore, as can be easily seen (by inspecting the usedtoconstructtheindex,butalsothedatausedtoquery red and green lines of Algorithms 2, 3, and 4), while the AP the index, e.g., see Algorithm 2, line 10 This assumption bounds use statistics of the data in the index, the (cid:96) -based does not hold for MB, as shown in Algorithm 1, as the data 2 7 usedtoquerytheindexarriveinthestreamaftertheindex Table2: Fractionofthe24configurationsof(θ,λ)thatsuc- has been constructed. cessfullyterminatewithintheallowedtimebudget(closerto To address this problem, we modify the MB framework as 1.00 is better). follows. At any point of time we consider two windows, W and W , each of size τ, where W is the current win- MB STR k−1 k k Dataset dow and Wk−1 is the previous one. The input data are INV L2AP L2 INV L2AP L2 accumulated in the current window W , and the vector m k WebSpam 1.00 1.00 1.00 1.00 0.83 0.96 is maintained. When we arrive at the end of the current RCV1 1.00 1.00 1.00 1.00 0.96 1.00 window,wecomputetheglobalvectorm,definedoverboth Blogs 0.25 0.25 0.25 1.00 0.96 1.00 W andW , bycombiningthemvectorsofthetwowin- Tweets 0.25 0.25 0.25 1.00 0.96 0.96 k−1 k dows. WethenusethedatainW tobuildtheindex,and k−1 the data in W to query the index built over W . k k−1 Whenmovingtothenextwindow,W becomescurrent, k+1 withadifferentdecayfactorforeachcoordinate(asspec- W becomes previous, and W is dropped. k k−1 ified in the definition of mλ). (cid:98) 6.2 Streaming Candidate verification: In the candidate-verification phase, decay-factor pruning is easily applied when computing Nextwediscussimplementationissuesrelatedtothestream- all of the bounds (lines 3–5 of Algorithm 8). ing framework. As a final remark, we note that the (cid:96) bounds depend 2 Variable-size lists. The streaming framework introduces only on the candidate and query vectors, not on the data posting lists of variable size. The size of the posting lists seen earlier in the stream. This property makes L2 very not only increases, but due to time filtering it may also de- suitable for the streaming setting. crease. In order to avoid many and small memory (de)al- locations, we implement posting lists using a circular byte 7. EXPERIMENTALEVALUATION buffer. Whenthebufferbecomesfullwedoubleitscapacity, while when its size drops below 1/4 we halve it. The methods presented in the previous sections combine fourindexingschemesintotwoalgorithmicframeworks(mini- Time filtering. In the INV and L2 schemes it is easy to batchvs.streaming). Altogether,theyleadtoalargenum- maintain the posting lists in time-increasing order. This beroftrade-offsforincreasedpruningpowervs.indexmain- ordering introduces a simple optimization: by scanning the tainance. In order to better understand the behavior and posting lists backwards — from the newest to the oldest evaluatetheefficiencyofeachmethodweperformanexten- item — during candidate generation, it is possible to stop sive experimental study. The objective of our evaluation is scanningandtruncatethepostinglistassoonaswefindthe to experimentally answer the following questions: firstitemoverthehorizonτ. Truncatingthecircularbuffer Q1: Which framework performs better, STR or MB? requires constant time (if no shrinking occurs). In L2AP it is not possible to keep the posting lists in Q2: How effective is L2, compared to L2AP and INV? timeorder,asre-indexingmayintroduceout-of-orderitems. Q3: What are the effects of the parameters λ and θ? Thus, L2AP scans the lists forward and needs to go through all the expired items to prune the posting list. Datasets. Wetesttheproposedalgorithmsonseveralreal- worlddatasets. RCV1istheReutersCorpusvolume1dataset Data structures. The residual direct index R and the ofnewswires[18],WebSpamisacorpusofspamwebpages[25], Q array in L2AP and L2 need to be continuously pruned. Blogs is a collection of one month of WordPress blog posts To support the required operations for these data strunc- crawledinJune2015,andTweetsisasampleoftweetscol- tures, we implement them using a linked hash-map, which lectedinJune2009[29]. Alldatasetsareavailableonlinein combines a hash-map for fast retrieval, and a linked list for textformat,4 whilefortheexperimentsweuseamorecom- sequentialaccess. Thesequentialaccessistheorderinwhich pact and faster-to-read binary format; the text-to-binary the data items are inserted in the data structure, which is converter is also included in the source code we provide. also the time order. Maintaining these data structures re- These datasets exhibit a wide variety in their characteris- quires amortized constant time, and memory linear in the tics,assummarizedinTable1,andallowustoevaluateour number of vectors that arrive within a time interval τ. methods under very different scenarios. Specifically, we see Applying decay-factor pruning (λ). Central to adapt- thatthedensityofthefourdatasetsvariesgreatly. Withre- ingtheindexingmethodsinthestreamingsetting,istheuse specttotimestamps,itemsinWebSpamandRCV1areassigned of the decay factor λ for pruning. In order to make decay- an artificial timestamp, sampled from a Poisson and a se- factor pruning effective we apply the following principles. quentialarrivalprocess,respectively. ForBlogsandTweets thepublicationtimeofeachitemisavailable,andweuseit Index construction: As the decay factor is used to prune as a timestamp. candidates from the data seen in the past, decay-factor pruning is never applied during the index-construction Algorithms. Wetestthetwoalgorithmicframeworks,STR phase. andMB,withthethreeindexvariants,INV,L2AP,andL2,on all four datasets. For the similarity threshold θ we explore Candidate generation: Decay-factorpruningisapplieddur- a range of values in [0.5,0.99], which is typical for the apss ing candidate generation when computing the remscore problem [3, 7], while for the time-decay factor λ we use bound (line 7 of Algorithm 7), as well as when applying exponentially increasing values in the range [10−4,10−1]. the early (cid:96) pruning (line 10). 2 ForL2AP,itisalsoappliedwhencomputingrs (line13), 4All datasets will be available at time of publication. 1 8 (cid:80) Table 1: Datasets used in the experimental evaluation. n: number of vectors; m: number of coordinates; |x|: number of non-zero coordinates; ρ=(cid:80)|x|/nm: density; |x|=(cid:80)|x|/n: average number of non-zero coordinates; and type of timestamps. Dataset n m (cid:80)|x| ρ(%) |x| Timestamps WebSpam 350000 680715 1305M 0.55 3728.00 poisson RCV1 804414 43001 61M 0.18 75.72 sequential Blogs 2532437 356043 356M 0.04 140.40 publishingdate Tweets 18266589 1048576 173M 0.001 9.46 publishingdate OurcodealsoincludesanimplementationofAPforMB,but B)1.0 in preliminary experiments we found it much slower than M L2AP, therefore we omit it from the set of indexing strate- es(0.8 giesunderstudy. Someconfigurationsareveryexpensivein ntri termsoftimeormemory,andweareunabletorunsomeof E0.6 thealgorithms. Wesetatimeoutof3hoursforeachexperi- R) / dataset ment,andweaborttherunifthistimoutlimitisexceeded, ST0.4 WebSpam oriftheJVMcrashesbecauseoflackofmemory. Inallcases es( RCV1 of failure during our experiments, MB fails due to timeout, ntri0.2 while STR because of memory requirements. E 0.0 Setting. We run all the experiments on a computer with 100 101 102 103 104 [email protected], τ 32GBofRAM(ofwhich16GBallocatedfortheJVMheap), and a 500GB SATA disk. All experiments run sequentially Figure 2: Ratio of index entries traversed during CG for onasinglecoreoftheCPU.Wewarmupthediskcachefor STR compared to MB. For small τ, the ratio tends to one, each datasetbyperformingonerunof thealgorithmbefore while for larger horizons τ, STR needs to traverse only 65% taking running times. Times are averaged over three runs. of the entries compared to MB. 7.1 Results shorter horizons). The different behavior is caused by the Q1 (MB vs. STR): Of the four datasets we use, MB success- higher density of WebSpam, which has an average number of fullyrunswithallconfigurationsononlytwoofthem(RCV1 non-zerocoordinatespervector,whichisalmosttwoorders and WebSpam). As can be seen from Table 1, Blogs and of magnitude larger than RCV1 (see Table 1). For STR, this Tweets are the largest datasets, and MB is not able to scale high density renders the lazy update approach inefficient, tothesesizes. Ontheotherhand,STRisabletorunwithall as a large number of posting lists need to be updated and configurations on all four datasets. The overhead of MB be- prunedforeachvector,especiallyforshorthorizons. MBcan comes too large when the horizon τ becomes small enough, simply throw away old indexes, rather than mending them. asanewindexneedstobeinitializedtoofrequently. Table2 SummarizingourfindingsrelatedtotheMB-vs.-STRstudy, showsasummaryoftheoutcomesoftheexperimentaleval- weconcludethatSTRismoreefficientinmostcases,asitis uationsforthevariousconfiguration. Therefore,werestrict able to run on all datasets, while MB becomes too slow for this comparison to RCV1 and WebSpam. largerdatasets. Forveryspecificconditionsofhighdensity, MostofthetimeofSTRandMB,accordingtoourprofiling MB has a small advantage on STR. results, is spent in the CG phase while scanning the posting lists. Hence,wefirstcomparethealgorithmsonthenumber Q2 (Indexing schemes): Now that we have established ofpostingentriestraversed. Figure2showsthatSTRusually that STR scales better than MB, we focus on the former al- doeslesstotalwork,andtraversesaround65%oftheentries gorithm: the rest of the experiments are performed on STR. traversedbyMB. ThefigureshowstheratioforL2,butother We turn our attention to the effectiveness of pruning, and indexing strategies show the same trend (plots are omitted compareL2withINVandL2AP. Forbrevity,weshowresults for brevity). for RCV1 only. The other datasets follow the same trends. We now turn our attention to running time. Figures 3 Figure 5 shows the comparison in time. Several interest- and 4 compare the running time of STR and MB on RCV1 ing patterns emerge. First, L2 is almost always the fastest andWebSpamforallconfigurationonwhichbothareableto indexingscheme. Second,INVworkswellforshorthorizons, run. Thedecayfactorλvarieswiththecolumnsofthegrid, where the overheads incurred by pruning are not compen- while the indexing scheme with the rows. The two datasets satedbyalargereductionincandidates. Instead,forlarger present a different picture. horizons (low θ and λ) the gains of pruning are more pro- Clearly,forRCV1(Figure3)algorithmSTRisfasterthanMB nounced, thus making INV a poor choice. Finally, L2AP is in most cases. More aggressive pruning reduces the differ- slightlyslowerthanL2inmostcases,andmuchslowerwhen ence,whileinsomecaseswithlowθ(usefulforrecommender the horizon is short. Even though L2 uses a subset of the systems)algorithmSTRcanbeupto4timesfasterthanMB. bounds of L2AP, the overhead of computing the AP bounds L2AP is the exception, and for smaller τ its performance is offsets any possible gain in pruning. subpar, as detailed next. In addition, L2AP may need to re-index residuals of vec- Conversely, for WebSpam the MB algorithm has the upper tors,andashorterhorizoncausesmorefrequentre-indexings. hand in most cases, especially for larger decay factors (i.e., Therefore, the two rightmost plots show an upward trend 9 λ = 0.0001 λ = 0.001 λ = 0.01 λ = 0.1 6000 700 70 50 5000 600 60 in 500 50 40 d Time (s)423000000000 423000000 423000 2300 exing = I N 1000 100 10 10 V 0 0 0 0 1000 120 140 400 800 100 120 350 in e (s) 600 6800 18000 223050000 dexing algorithm Tim 420000 4200 426000 11505000 = 2LAP MSBTR 0 0 0 0 1000 120 30 800 100 25 20 in e (s) 600 6800 2105 15 dexin Tim 400 40 10 10 g = L 200 20 5 5 2 0 0 0 0 0.5 0.6 0.7 0.8 0.9 0.99 0.5 0.6 0.7 0.8 0.9 0.99 0.5 0.6 0.7 0.8 0.9 0.99 0.5 0.6 0.7 0.8 0.9 0.99 θ θ θ θ Figure 3: Time taken by the MB and STR algorithms as a function of the similarity threshold θ, on the RCV1 dataset. λ = 0.0001 λ = 0.001 λ = 0.01 λ = 0.1 3000 800 600 1000 2500 670000 500 800 ind Time (s)211050500000000 425310000000000 423100000000 426000000 exing = INV 0 0 0 0 6000 2000 9000 1600 8000 me (s)453000000000 11050000 111802400000000 4657000000000000 indexing = algoritMhBm Ti21000000 500 426000000 231000000000 2LAP STR 0 0 0 0 4500 1200 1000 900 4000 800 3500 1000 800 700 in e (s)23500000 680000 600 650000 dexin Tim21050000 400 400 430000 g = 1000 200 200 200 2L 500 100 0 0 0 0 0.5 0.6 0.7 0.8 0.9 0.99 0.5 0.6 0.7 0.8 0.9 0.99 0.5 0.6 0.7 0.8 0.9 0.99 0.5 0.6 0.7 0.8 0.9 0.99 θ θ θ θ Figure 4: Time taken by the MB and STR algorithms as a function of the similarity threshold θ, on the WebSpam dataset. in L2AP, which is due to the re-indexing overhead. The right as λ increases. overhead is so large that we were not able to run L2AP for Figure6showsthecomparisoninthenumberofindexen- θ = 0.99 and λ = 0.1. While there are possible practi- tries traversed. Clearly, INV has usually the largest amount calworkarounds(e.g.,useamorelaxboundtodecreasethe due to the absence of pruning. The relative effectiveness of frequencyofre-indexing),L2alwaysprovidesabetterchoice pruningforL2isalmostconstantthroughouttherangeofλ, and does not require tinkering with the algorithm. however the number of entries traversed decreases, and so In general, a shorter horizon makes other pruning strate- doestheimportanceoffilteringtheindex(comparedtothe gies less relevant compared to time filtering, as can be seen timefiltering). Thisisinlinewiththebehaviorexhibitedin from the progressive flattening of the curves from left to Figure5wherethedifferenceintimedecreasesforlargerλ. 10