ebook img

Streaming Algorithms for k-Means Clustering with Fast Queries PDF

0.58 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Streaming Algorithms for k-Means Clustering with Fast Queries

Streaming Algorithms for k-Means Clustering with Fast Queries YuZhang∗ KanatTangwongsan† SrikantaTirthapura‡ January17,2017 7 Abstract 1 0 Wepresentmethodsfork-meansclusteringonastreamwithafocusonprovidingfastresponsesto 2 clusteringqueries. Whencomparedwiththecurrentstate-of-the-art,ourmethodsprovideasubstantial n improvement in the time to answer a query for cluster centers, while retaining the desirable properties a ofprovablysmallapproximationerror,andlowspaceusage. Ouralgorithmsarebasedonanovelidea J of “coreset caching” that reuses coresets (summaries of data) computed for recent queries in answer- 3 ingthecurrentclusteringquery. Wepresentbothprovabletheoreticalresultsanddetailedexperiments 1 demonstratingtheircorrectnessandefficiency. ] S D 1 Introduction . s c Clustering is a fundamental method for understanding and interpreting data. The goal of clustering is to [ partition input objects into groups or “clusters” such that objects within a cluster are similar to each other, 1 andobjectsindifferentclustersarenot. Apopularformulationofclusteringisk-meansclustering. Givena v setofpointsS inanEuclideanspaceandaparameterk,thegoalistopartitionS intok “clusters”inaway 6 2 that minimizes a cost metric based on the (cid:96) distance between points. The k-means formulation is widely 2 8 usedinpractice. 3 We consider streaming k-means clustering, where the inputs to clustering are not all available at once, 0 . but arrive as a continuous, possibly unending sequence. The algorithm needs to maintain enough state to 1 0 be able to incrementally update the clusters as more tuples arrive. When a query is posed, the algorithm is 7 requiredtoreturnkclustercenters,oneforeachclusterwithinthedataobservedsofar. 1 While there has been substantial prior work on streaming k-means clustering (e.g. [1, 2, 3, 4]), the : v major focus of prior work has been on optimizing the memory used by the streaming algorithm. In this i X respect, these works have been successful, and achieve a provable guarantee on the approximation quality r ofclustering,whileusingspacepolylogarithmicinthesizeoftheinputstream[1,2]. However,forallthese a algorithms,whenaqueryforclustercentersisposed,anexpensivecomputationisneededattimeofquery. This can be a serious problem for applications that need answers in (near) real-time, such as in network monitoringandsensordataanalysis. Ourworkaimsatdesigningastreamingclusteringalgorithmthatsig- nificantlyimprovestheclusteringqueryruntimecomparedtothecurrentstate-of-the-art,whilemaintaining otherdesirablepropertiesenjoyedbycurrentalgorithms,suchasprovableaccuracyandlimitedmemory. To understand why current solutions have a high query runtime, let us review the framework used in currentsolutionsforstreamingk-meansclustering. Atahighlevel, incomingdatastreamSisdividedinto ∗DepartmentofElectricalandComputerEngineering,IowaStateUniversity,[email protected] †ComputerScienceProgram,MahidolUniversityInternationalCollege,[email protected] ‡DepartmentofElectricalandComputerEngineering,IowaStateUniversity,[email protected] 1 smaller “chunks” S ,S ,...,. Each chunk is summarized using a “coreset” (for example, see [5]). The 1 2 resulting coresets may still not all fit into the memory of the processor, so multiple coresets are further merged recursively into higher level coresets forming a hierarchy of coresets, or a “coreset tree”. When a query arrives, all active coresets in the coreset tree are merged together, and a clustering algorithm such as k-means++[6]isappliedontheresult,outputtingkclustercenters. Thequeryruntimeisproportionaltothe number of coresets that need to be merged together. In prior algorithms, the total size of all these coresets couldbeaslargeasthememoryoftheprocessoritself,andhencethequeryruntimecanbeveryhigh. 1.1 OurContributions We present algorithms for streaming k-means whose query runtime is substantially smaller than the cur- rent state-of-the-art, while maintaining the desirable properties of a low memory footprint and provable approximation guarantees on the result. Our algorithms are based on the idea of “coreset caching” that to our knowledge, has not been used before in streaming clustering. The idea in coreset caching is to reuse coresetsthathavebeencomputedduringprevious(recent)queriestospeedupthecomputationofacoreset for the current query. This way, when a query arrives, it is not needed to combine all coresets currently in memory; it is sufficient to only merge a coreset from a recent query (stored in the coreset cache) along with coresets of points that arrived after this query. We show that this leads to substantial savings in query runtime. Name QueryCost UpdateCost MemoryUsed Coresetlevelreturned (perpoint) (perpoint) atqueryafterN batches (cid:16) (cid:17) (cid:16) (cid:17) CoresetTree(CT) O kdm · rlogN O(kd) O mdrlogN log N q logr logr r CachedCoreset (cid:16) (cid:17) (cid:16) (cid:17) Tree(CC) O kdm ·r O(kd) O mdrlogN 2log N q logr r RecursiveCached (cid:16) (cid:17) CoresetTree(RCC) O kdmloglogN O(kdloglogN) O(mdN1/8) O(1) q OnlineCoreset usuallyO(1) (cid:16) (cid:17) Cache(OnlineCC) (cid:16) (cid:17) O(kd) O mdrlogN 2log N worstcaseO kdm ·r logr r q Table 1: The accuracy and query cost of different clustering methods. k is the number of centers desired, d is the dimension of a data point, m is the size of a coreset (in practice, this is a constant factor times k), r is a parameter used for CC and CT, showing the “merge degree” of the coreset tree, q is the number of points between two queries. The “level” of a coreset is indicative of the number of recursive merges of priorcoresetstoarriveatthiscoreset. Smallerthelevel,moreaccurateisthecoreset. Forexample,abatch algorithmthatseestheentireinputcanreturnacoresetatlevel0. Let n denote the number of points observed in the stream so far. Let N = n where m is the size of a m coreset,aparameterthatisindependentofn. Ourmaincontributionsareasfollows: • We present an algorithm “Cached Coreset Tree” (CC) whose query runtime is a factor of O(logN) smallerthanthequeryruntimeofastate-of-the-artcurrentmethod,“CoresetTree”(CT),1whileusing similarmemoryandmaintainingthesamequalityofapproximation. 1CTisessentiallythestreamkm++algorithm[1]and[2]exceptithasamoreflexibleruleformergingcoresets. 2 • We present a recursive version of CC, “Recursive Cached Coreset Tree” (RCC), which provides more flexible tradeoffs for the memory used, quality of approximation, and query cost. For instance, it is possible to improve the query runtime by a factor of O(logN/loglogN), and improve the quality of approximation,atthecostofgreatermemory. • WepresentanalgorithmOnlineCC,acombinationofCCandasimplesequentialstreamingclustering algorithm (due to [7]), which provides further improvements in clustering query runtime while still maintainingtheprovableclusteringquality,asinRCCandCC. • Forallalgorithms,wepresentproofsshowingthatthek centersreturnedinresponsetoaqueryform an O(logk) approximation to the optimal k-means clustering cost. In other words, the quality is comparable to what we will obtain if we simply stored all the points so far in memory, and ran an (expensive)batchk-means++algorithmattimeofquery. • Wepresentadetailedexperimentalevaluation. Theseshowthatwhencomparedwithstreamkm++[1], astate-of-the-artmethodforstreamingk-meansclustering,ouralgorithmsyieldsubstantialspeedups (5x-100x) in query runtime and in total time, and match the accuracy, for a broad range of query arrivalfrequencies. OurtheoreticalresultsaresummarizedinTable1. 1.2 RelatedWork Inthebatchsetting,whenallinputisavailableatthestartofcomputation,Lloyd’salgorithm[8],alsoknown asthek-meansalgorithm,isasimpleiterativealgorithmfork-meansclusteringthathasbeenwidelyused for decades. However, it does not have a provable approximation guarantee on the quality of the clusters. k-means++[6]presentsamethodtodeterminethestartingconfigurationforLloyd’salgorithmthatyieldsa provableguaranteeontheclusteringcost. [9]proposesaparallelizationof k-means++calledk-meansII. The earliest streaming clustering method, Sequential k-means (due to [7]), maintains the current cluster centers and applies one iteration of Lloyd’s algorithm for every new point received. Because it is fast and easy to implement, Sequential k-means is commonly used in practice (e.g., Apache Spark mllib [10]). However, it cannot provide any approximation guarantees [11] on the cost of clustering. BIRCH [12] is a streaming clustering method based on a data structure called the “CF Tree”, and returns cluster centers through agglomerative hierarchical clustering on the leaf nodes of the tree. CluStream[13] constructs “microclusters” that summarize subsets of the stream, and further applies a weighted k-means algorithm on the microclusters. STREAMLS [3] is a divide-and-conquer method based on repeated appli- cationofabicriteriaapproximationalgorithmforclustering. Asimilardivide-and-conqueralgorithmbased on k-means++ is presented in [2]. However, these methods have a high cost of query processing, and are not suitable for continuous maintenance of clusters, or for frequent queries. In particular, at the time of query,theserequiremergingofmultipledatastructures,followedbyanextractionofclustercenters,which isexpensive. (cid:16) (cid:17) [5]presentscoresetsofsizeO klogn forsummarizingnpointsk-means,andalsoshowhowtousethe (cid:15)d merge-and-reducetechniquebasedontheBentley-Saxedecomposition[14]toderiveasmall-spacestream- (cid:16) (cid:17) ing algorithm using coresets. Further work [15, 16] has reduced the size of a k-means coreset to O kd . (cid:15)6 streamkm++ [1] is a streaming k-means clustering algorithm that uses the merge-and-reduce technique alongwithk-means++togenerateacoreset. Ourworkimprovesonstreamkm++w.r.t. queryruntime. Roadmap. WepresentpreliminariesinSection2,backgroundforstreamingclusteringinSection3and thenthealgorithmsCC,RCC,andOnlineCCinSection4,alongwiththeirproofsofcorrectnessandquality guarantees. WethenpresentexperimentalresultsinSection5. 3 2 Preliminaries 2.1 ModelandProblemDescription We work with points from the d-dimensional Euclidean space Rd for integer d > 0. A point can have a positive integral weight associated with it. If unspecified, the weight of a point is assumed to be 1. For points x,y ∈ Rd,letD(x,y) = (cid:107)x−y(cid:107)denotetheEuclideandistancebetween xandy. Forpoint xandapoint setΨ ⊆ Rd,thedistanceof xtoΨisdefinedtobeD(x,Ψ) = minψ∈Ψ(cid:107)x−ψ(cid:107). Definition1(k-meansclusteringproblem) Givenaset P ⊆ Rd withnpoints,anassociatedweightfunc- tionw: P → Z+,findapointsetΨ ⊆ Rd,|Ψ| = k,thatminimizestheobjectivefunction (cid:88) (cid:88) (cid:16) (cid:17) φΨ(P) = w(x)·D2(x,Ψ) = min w(x)·(cid:107)x−ψ(cid:107)2 . ψ∈Ψ x∈P x∈P Streams: AstreamS = e ,e ,...isanorderedsequenceofpoints,wheree isthei-thpointobservedbythe 1 2 i algorithm. Fort > 0, letS(t)denotetheprefixe ,e ,...,e. For0 < i ≤ j, letS(i, j)denotethesubstream 1 2 t ei,ei+1,...,ej. Let n denote the total number of points observed so far. Define S = S(1,n) be the whole streamobserveduntile . n Wehavewrittenouranalysisasifaqueryforclustercentersarriveseveryqpoints. Thisparameter(q) captures the query rate in the most basic terms, we note that our results on the amortized query processing timestillholdaslongastheaveragenumberofpointsbetweentwoqueriesisq. Thereasonisthatthecost of answering each query does not relate to when the query arrives, and the total query cost is simply the number of queries times the cost of each query. Suppose that the queries arrived according to a different probability distribution, such that the expected interval between two queries is q points. Then, the same resultswillholdinexpectation. Inthetheoreticalanalysisofouralgorithms,wemeasuretheperformanceinbothtermsofcomputational runtime and memory consumed. The computational cost can be divided into two parts, the query runtime, andtheupdateruntime, whichisthetimetoupdateinternaldatastructuresuponreceivingnewpoints. We typicallyconsideramortized processingcost,whichistheaverageper-pointprocessingcost,takenoverthe entirestream. Weexpressthememorycostintermsofwords,whileassumingthateachpointinRd canbe storedinO(d)words. 2.2 Thek-means++Algorithm Our algorithm uses as a subroutine the k-means++ algorithm [6], a batch algorithm for k-means cluster- ing with provable guarantees on the quality of the objective function. The properties of the algorithm are summarizedbelow. Theorem2(Theorem3.1in[6]) On an input set of n points P ⊆ Rd, the k-means++ algorithm returns a setΨofkcenterssuchthatE(cid:2)φΨ(P)(cid:3) ≤ 8(lnk+2)·φOPT(P)whereφOPT(P)istheoptimalk-meansclustering costfor P. ThetimecomplexityofthealgorithmisO(nkd). 2.3 Coresets Ourclusteringmethodbuildsontheconceptofacoreset,asmall-spacerepresentationofaweightedsetof pointsthat(approximately)preservescertainpropertiesoftheoriginalsetofpoints. 4 Algorithm1:StreamClusteringDriver 1 def StreamCluster-Update(H,p) (cid:66) InsertpointsintoDinbatchesofsizem 2 H.n ← H.n+1 3 Add ptoH.C 4 if (|H.C| = m)then 5 H.D.Update(H.C,H.n/m) 6 H.C ← ∅ 7 def StreamCluster-Query() 8 C1 ← H.D.Coreset() 9 returnk-means++(k,C1∪H.C) Definition3(k-meansCoreset) Foraweightedpointset P ⊆ Rd,integerk > 0,andparameter0 < (cid:15) < 1, aweightedsetC ⊆ Rd issaidtobea(k,(cid:15))-coresetof Pforthek-meansmetric,ifforanysetΨofk points inRd,wehave (1−(cid:15))·φΨ(P) ≤ φΨ(C) ≤ (1+(cid:15))·φΨ(P) When k is clear from the context, we simply say an (cid:15)-coreset. In this paper we use term “coreset” to meanak-meanscoreset. Forintegerk > 0,parameter0 < (cid:15) < 1,andweightedpointsetP ⊆ Rd,weusethe notationcoreset(k,(cid:15),P)tomeana(k,(cid:15))-coresetofP. Weusethefollowingobservationsfrom[5]. Observation4([5]) IfC andC areeach(k,(cid:15))-coresetsfordisjointmulti-setsP andP respectively,then 1 2 1 2 C ∪C isa(k,(cid:15))-coresetfor P ∪P . 1 2 1 2 Observation5([5]) IfC is(k,(cid:15))-coresetforC ,andC isa(k,δ)-coresetforP,thenC isa(k,(1+(cid:15))(1+ 1 2 2 1 δ)−1)-coresetfor P. While our algorithms can work with any method for constructing coresets, one concrete construction dueto[16]providesthefollowingguarantees. Theorem6(Theorem15.7in[16]) Let0 < δ < 1 andletndenotethesizeofpointset P. Thereexistsan 2 (cid:18) (cid:19) kd+log(1) algorithmtocomputecoreset(k,(cid:15),P)withprobabilityatleast1−δ. ThesizeofthecoresetisO δ , (cid:15)6 (cid:16) (cid:16) (cid:17) (cid:17) andtheconstructiontimeisO ndk+log2 1 log2n+|coreset(k,(cid:15),P)| . δ 3 Streaming Clustering and Coreset Trees To provide context for how algorithms in this paper will be used, we describe a generic “driver” algorithm for streaming clustering. We also discuss the coreset tree (CT) algorithm. This is both an example of how thedriverworkswithaspecificimplementationandaquickreviewofanalgorithmfrompriorworkthatour algorithmsbuildupon. 5 Algorithm2:CoresetTreeUpdate (cid:66) Input: bucketb 1 def CT-Update(b) 2 AppendbtoQ0 3 j ← 0 4 while|Qj| ≥ rdo 5 U ← coreset(k,(cid:15),∪B∈QjB) 6 AppendU toQj+1 7 Qj ← ∅ 8 j ← j+1 9 def CT-Coreset() (cid:83) (cid:83) 10 return j{ B∈Qj B} 3.1 DriverAlgorithm The“driver”algorithm(presentedinAlgorithm1)isinitializedwithaspecificimplementationofacluster- ingdatastructureDandabatchsizem. ItinternallykeepsstateinsideanobjectH. Itgroupsarrivingpoints into batches of size m and inserts into the clustering data structure at the granularity of a batch. H stores additionalstate,thenumberofpointsreceivedsofarinthestreamH.n,andthecurrentbatchofpointsH.C. Subsequentalgorithmsinthispaper,includingCT,areimplementationsfortheclusteringdatastructureD. 3.2 CT:r-wayMergingCoresetTree Ther-waycoresettree(CT)turnsatraditionalbatchalgorithmforcoresetconstructionintoastreamingal- gorithmthatworksinlimitedspace. Althoughthebasicideasarethesame,ourdescriptionofCTgeneralizes thecoresettreeofAckermannetal.[1],whichisthespecialcasewhenr = 2. The Coreset Tree: A coreset tree Q maintains buckets at multiple levels. The buckets at level 0 are called basebuckets,whichcontaintheoriginalinputpoints. Thesizeofeachbasebucketisspecifiedbyaparameter m. Each bucket above that is a coreset summarizing a segment of the stream observed so far. In an r-way CT,level(cid:96)hasbetween0andr−1(inclusive)buckets,eachasummaryofr(cid:96) basebuckets. Initially, the coreset tree is empty. After observing n points in the stream, there will be N = (cid:98)n/m(cid:99) base buckets (level 0). Some of these base buckets may have been merged into higher-level buckets. The distributionofbucketsacrosslevelsobeysthefollowinginvariant: If N is written in base r as N = (s ,s ,...,s ,s ) , with s being the most significant digit q q−1 1 0 r q (i.e., N = (cid:80)q sri),thenthereareexactly s bucketsinleveli. i=0 i i How is a base bucket added? The process to add a base bucket is reminiscent of incrementing a base-r counterbyone,wheremergingistheequivalentoftransferringthecarryfromonecolumntothenext. More specifically, CT maintains a sequence of sequences {Q }, where Q is the buckets at level j. To incorporate j j a new bucket into the coreset tree, CT-Update, presented in Algorithm 2, first adds it at level 0. When the numberofbucketsatanyleveliofthetreereachesr,thesebucketsaremerged,usingthecoresetalgorithm, toformasinglebucketatlevel(i+1),andtheprocessisrepeateduntiltherearefewerthanr bucketsatall levelsofthetree. Anexampleofhowthecoresettreeevolvesaftertheadditionofbasebucketsisshownin Figure1. 6 (cid:83) (cid:83) Howtoansweraquery? Thealgorithmsimplyunionsallthe(active)bucketstogether,specifically { B}. j B∈Qj Noticethatthedriverwillcombinethiswithapartialbasebucketbeforederivingthek-meanscenters. We present lemmas stating the properties of the CT algorithm and we use the following definition in provingclusteringguarantees. Definition7(Level-(cid:96)Coreset) For(cid:96) ∈ Z ,a(k,(cid:15),(cid:96))-coresetofapointsetP ⊆ Rd,denotedbycoreset(k,(cid:15),(cid:96),P), ≥0 isasfollows: • Thelevel-0coresetof Pis P. • For(cid:96) > 0,alevel-(cid:96)coresetof PisacoresetoftheC’s(i.e.,coreset(k,(cid:15),∪t C)),whereeachC is i i=1 i i alevel-(cid:96) coreset,(cid:96) < (cid:96),of P suchthat{P }t formsapartitionof P. i i i j j=1 We first determine the number of levels in the coreset tree after observing N base buckets. Let the maximumlevelofthetreebedenotedby(cid:96)(Q) = max{j|Q (cid:44) ∅}. j Lemma8 Afterobserving N basebuckets,(cid:96)(Q) ≤ logN. logr Proof: Aswaspointedoutearlier,foreachlevel j ≥ 0,abucketin Q isasummaryofrj basebuckets. Let j (cid:96)∗ = (cid:96)(Q). AfterobservingN basebuckets,thecoverageofabucketatlevel(cid:96)∗cannotexceedN,sor(cid:96)∗ ≤ N, whichmeans(cid:96)(Q) = (cid:96)∗ = logN. (cid:4) logr Lemma9 For a point set P, parameter (cid:15) > 0, and integer (cid:96) ≥ 0, if C = coreset(k,(cid:15),(cid:96),P) is a level (cid:96)-coresetof P,thenC = coreset(k,(cid:15)(cid:48),P)where(cid:15)(cid:48) = (1+(cid:15))(cid:96) −1. Proof: We prove this by induction using the proposition P: For a point set P, if C = coreset(k,(cid:15),(cid:96),P), thenC = coreset(k,(cid:15)(cid:48),P)where(cid:15)(cid:48) = (1+(cid:15))(cid:96) −1. Toprovethebasecaseof(cid:96) = 0,considerthat,bydefinition,coreset(k,(cid:15),0,P) = P,andcoreset(k,0,P) = P. Now consider integer L > 0. Suppose that for each positive integer (cid:96) < L, P((cid:96)) was true. The task is to prove P(L). Suppose C = coreset(k,(cid:15),L,P). Then there must be an arbitrary partition of P into sets P ,P ,...,P such that ∪q P = P. For i = 1...q, letC = coreset(k,(cid:15),(cid:96),P) for (cid:96) < L. ThenC must 1 2 q i=1 i i i i i beoftheformcoreset(k,(cid:15),∪q C). i=1 i By the inductive hypothesis, we know that C = coreset(k,(cid:15),P) where (cid:15) = (1+(cid:15))(cid:96)i − 1. By i i i i the definition of a coreset and using (cid:96) ≤ (L − 1), it is also true that C = coreset(k,(cid:15)(cid:48)(cid:48),P) where i i i (cid:15)(cid:48)(cid:48) = (1+(cid:15))(L−1)−1. LetC(cid:48) = ∪q C. FromObservation4andusing P = ∪q P,itmustbetruethatC(cid:48) = i=1 i i=1 i coreset(k,(cid:15)(cid:48)(cid:48),P). Since C = coreset(k,(cid:15),C(cid:48)) and using Observation 5, we get C = coreset(k,γ,P) whereγ = (1+(cid:15))(1+(cid:15)(cid:48)(cid:48))−1. Simplifying,wegetγ = (1+(cid:15))(1+(1+(cid:15))(L−1)−1)−1 = (1+(cid:15))L−1. This provestheinductivecaseforP(L),whichcompletestheproof. (cid:4) The accuracy of a coreset is given by the following lemma, since it is clear that a level-(cid:96) bucket is a level-(cid:96)coresetofitsresponsiblerangeofbasebuckets. Lemma10 Let(cid:15) = (clogr)/logN wherecisasmallenoughconstant. AfterobservingstreamS = S(1,n), a clustering query StreamCluster-Query returns a set of k centers Ψ of S whose clustering cost is a O(logk)-approximationtotheoptimalclusteringforS. 7 Proof: AfterobservingN basebuckets,Lemma8indicatesthatallcoresetsinQareatlevelnogreaterthan (logN/logr). UsingLemma8,themaximumlevelcoresetinQisan(cid:15)(cid:48)-coresetwhere   (cid:15)(cid:48) = (cid:32)1+ clologgNr(cid:33)llooggNr −1 ≤ (cid:20)e(cid:16)clologgNr(cid:17)·llooggNr −1(cid:21) < 0.1. ConsiderthatStreamCluster-Querycomputesk-means++ontheunionoftwosets,oneoftheresult is CT-Coresetand the other the partially-filled base bucket H.C. Hence, Θ = (∪ ∪ B)∪H.C is the j B∈Qj coreset union that is given to k-means++. Using Observation 4, Θ is a (cid:15)(cid:48)-coreset of S. Let Ψ be the final k centersgeneratedbyrunningk-means++onΘ,andletΨ bethesetofk centerswhichachievesoptimal 1 k-meansclusteringcostforS. Fromthedefinitionofcoreset,when(cid:15)(cid:48) < 0.1,wehave 0.9φΨ(S) ≤ φΨ(Θ) ≤ 1.1φΨ(S) (1) 0.9φΨ (S) ≤ φΨ (Θ) ≤ 1.1φΨ (S) (2) 1 1 1 LetΨ denotethesetofk centerswhichachievesoptimalk-meansclusteringcostforΘ. UsingTheo- 2 rem2,wehave E(cid:2)φΨ(Θ)(cid:3) ≤ 8(lnk+2)·φΨ (Θ) (3) 2 SinceΨ istheoptimalkcentersforcoresetΘ,wehave 2 φΨ (Θ) ≤ φΨ (Θ) (4) 2 1 UsingEquations2,3and4weget E(cid:2)φΨ(Θ)(cid:3) ≤ 9(lnk+2)·φΨ (S) (5) 1 UsingEquations1and5, E(cid:2)φΨ(S)(cid:3) ≤ 10(lnk+2)·φΨ (S) (6) 1 WeconcludethatΨisafactor-O(logk)clusteringcentersofS comparedtotheoptimal. (cid:4) Thefollowinglemmaquantifiesthememoryandtimecostof CT. Lemma11 Let N be the number of buckets observed so far. Algorithm CT, including the driver, takes (cid:16) (cid:17) mdrlogN amortized O(kd) time per point, using O memory. The amortized cost of answering a query is logr (cid:16) (cid:17) O kdm · rlogN perpoint. q logr Proof: First,thecostofarrangingnpointsintolevel-0bucketsistriviallyO(n),resultingin N = n/mbuck- ets. For j ≥ 1,alevel-jbucketiscreatedforeverymrj points,sothenumberoflevel-jbucketsevercreated is N/rj. Hence, across all levels, the total number of buckets created is (cid:80)(cid:96) N = O(N/r). Furthermore, j=1 rj whenabucketiscreated,CTmergesrmpointsintompoints. ByTheorem6,thetotalcostofcreatingthese buckets is O(N · (kdrm + log2(rm) + dk)) = O(nkd), hence O(kd) amortized time per point. In terms of r space,eachlevelmusthavefewerthanrbuckets,eachwithmpoints. Therefore,across(cid:96) ≤ logN levels,the logr space required is O(logN ·mdr). Finally, when answering a query, the union of all the buckets has at most logr O(mr · logN) points, computable in the same time as the size. Therefore, k-means++, run on these points logr plusatmostonebasebucket,takesO(kdrm· logN). Theamortizedboundimmediatelyfollows. Thisproves logr 8 After Batch 1 Batch 2 Batch 3 Batch 4 Batch 7 [1,4] [1,4] Coreset Tree [1,1] [1,2] [1,2] [1,2] [3,4] [5,6] [1,1] [2,2] [3,3] [7,7] [3,3] [4,4] Coreset Cache [1, 2] [1, 2] [1, 2] [1, 4] [1,4] [1, 6] Action taken to coresetmerged for CT:[1,1] CT:[1,2] CT:[1,2], [3,3] CT:[1,4] CT:[1,4] [5,6] [7,7] answer the query coresetmerged for CC: same CC: same CC: same CC: same CC: [1, 6] and [7,7] Batch 8 Batch 15 Batch 16 [1,16] [1,8] [1,8] [1,8] [9,16] Coreset Tree [1,4] [5,8] [9,12] [9,12] [13,16] [5,6] [7,8] [13,14] [13,14] [15,16] [15,15] [7,7] [8,8] [15,15] [16,16] Coreset Cache [1, 4] [1, 6] [1, 8] [1, 8] [1, 12] [1, 14] [1, 8] [1, 12] [1, 14] [1, 16] Action taken to CT:[1,8] CT:[1,8] [9,12] [13,14] [15,15] CT:[1,16] answer the query CC: same CC: [1, 14] and [15,15] CC: same Figure1: IllustrationofAlgorithmCC,showingthestatesofcoresettreeandcacheafterbatch1,2,3,4,7, 8, 15 and 16. The notation [l,r] denotes a coreset of all points in buckets l to r, both endpoints inclusive. Thecoresettreeconsistsofasetofcoresets,eachofwhichisabasebucketorhasbeenformedbymerging multiple coresets. Whenever a coreset is merged into a another coreset (in the tree) or discarded (in the cache), the coreset is marked with an “X”. We suppose that a clustering query arrives after seeing each batch, and describe the actions taken to answer this query (1) if only CT was used, or (2) if CC was used alongwithCT. thetheorem. (cid:4) As evident from the above lemma, answering a query using CT is expensive compared to the cost of adding a point. More precisely, when queries are made rather frequently—every q points, q < O(rm · log N) = O(cid:101)(rkd·log N)—thecostofqueryprocessingisasymptoticallygreaterthanthecostofhandling r r pointarrivals. Weaddressthisissueinthenextsection. 4 Clustering Algorithms with Fast Queries Thissectiondescribesalgorithmsforstreamingclusteringwithanemphasisonquerytime. 4.1 AlgorithmCC: CoresetTreewithCaching The CC algorithm uses the idea of “coreset caching” to speed up query processing by reusing coresets that wereconstructedduringpriorqueries. Inthisway,itcanavoidmergingalargenumberofcoresetsatquery time. When compared with CT, the CC algorithm can answer queries faster, while maintaining nearly the sameprocessingtimeperelement. In addition to the coreset tree CT, the CC algorithm also has an additional coreset cache, cache, that stores a subset of coresets that were previously computed. When a new query has to be answered, CC avoids the cost of merging coresets from multiple levels in the coreset tree. Instead, it reuses previously cached coresets and retrieves a small number of additional coresets from the coreset tree, thus leading to 9 lesscomputationatquerytime. However, the level of the resulting coreset increases linearly with the number of merges a coreset is involved in. For instance, suppose we recursively merged the current coreset with the next arriving batch to get a new coreset, and so on, for N batches. The resulting coreset will have a level of Θ(N), which can lead to a very poor clustering accuracy. Additional care is needed to ensure that the level of a coreset is controlledwhilecachingisused. Details: Eachcachedcoresetisasummaryofbasebuckets1throughsomenumberu. Wecallthisnumber uastherightendpointofthecoresetanduseitasthekey/indexintothecache. Wecalltheinterval[1,u]as the“span”ofthebucket. Toexplainwhichcoresetsarecachedbythealgorithm,weintroducethefollowing definitions. For integers n > 0 and r > 0, consider the unique decomposition of n according to powers of r as n = (cid:80)ij=0βirαi,where0 ≤ α0 < α1... < αj and0 < βi < rforeachi. Theβiscanbeviewedasthenon-zero digits in the representation of n as a number in base r. Let minor(n,r) = β0rα0, the smallest term in the decomposition,andmajor(n,r) = n−minor(n,r). Notethatwhennisoftheformβrα where0 < β < rand α ≥ 0,major(n) = 0. Forκ = 1... j,letnκ = (cid:80)ij=κβirαi. nκ canbeviewedasthenumberobtainedbydroppingtheκ smallest non-zero digits in the representation of n as a number in base r. The set prefixsum(n,r) is defined as {n |κ = 1... j}. Whennisoftheformβrα where0 < β < r,prefixsum(n,r) = ∅. κ For instance, suppose n = 47 and r = 3. Since 47 = 1·33 +2·32 +2·30, we have minor(47,3) = 2,major(47,3) = 45,andprefixsum(47,3) = {27,45}. Wehavethefollowingfactontheprefixsum. Fact12 Letr ≥ 2. Foreach N ∈ Z+,prefixsum(N +1,r) ⊆ prefixsum(N,r)∪{N}. Proof: Therearethreecases: CaseI: N (cid:46) (r −1) (mod r). Consider N in r-ary representation, and let δ denote the least significant digit. Sinceδ < (r−1),ingoingfromN to(N+1),theonlydigitchangedistheleastsignificantdigit,which changesfromδtoδ+1andnocarrypropagationtakesplace. Forallelementsy ∈ prefixsum(N+1,r),yis alsoinprefixsum(N). Theonlyexceptioniswhen N ≡ 0 (mod r),whenoneelementofprefixsum(N + 1,r)is N itself. Inthiscase,itisstilltruethatprefixsum(N +1,r) ⊆ prefixsum(N,r)∪{N}. CaseII:N ≡ (r−1) (mod r)andforr-aryrepresentationofN,alldigitsare(r−1). Inthiscase,(N+1) shouldbepowerofrandcanberepresentedastermrα whereα ≥ 0,thenprefixsum(N+1,r)isemptyset soourclaimholds. CaseIII: N ≡ (r−1) (mod r)butCaseIIdoesnothold. Considerther-aryrepresentationof N. There mustexistatleastonedigitlessthan(r−1). N+1willchangeastreakof(r−1)digitsto0startingfromthe leastsignificantdigit,untilithitsthefirstdigitthatisnot(r−1)whichshouldbelessthan(r−1). Werefer suchdigitasβk. N canbeexpressedinr-aryformasN = (βjβj−1···βk+1βkβk−1···β1β0)r. Correspondingly, N +1 = (βjβj−1···βk+1(1+βk)00···00)r. Comparingtheprefixsumof(N +1)with N,thepartofdigits βjβj−1···βk+1 remainsunchanged,thusprefixsum(N +1,r) ⊂ prefixsum(N,r). (cid:4) CC caches every coreset whose right endpoint is in prefixsum(N,r). When a query arrives after N batches, the task is to compute a coreset whose span is [1,N]. CC partitions [1,N] as [1,N ]∪[N +1,N] 1 1 whereN = major(N,r). Outofthesetwointervals,[1,N ]isalreadyavailableinthecache,and[N +1,N] 1 1 1 isretrievedfromthecoresettree,throughtheunionofnomorethan(r−1)coresets. Thisneedsamergeof no more than r coresets. This is in contrast with CT, which may need to merge as many as (r−1) coresets at each level of the tree, resulting in a merge of up to (r −1)logN coresets at query time. The algorithm 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.