ebook img

DTIC ADA447010: Clustering Large Datasets in Arbitrary Metric Spaces PDF

0.2 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DTIC ADA447010: Clustering Large Datasets in Arbitrary Metric Spaces

Clustering Large Datasets in Arbitrary Metric Spaces VenkateshGanti RaghuRamakrishnan JohannesGehrke (cid:3) y ComputerSciences Department,UniversityofWisconsin-Madison AllisonPowell JamesFrench z x DepartmentofComputerScience, UniversityofVirginia,Charlottesville Abstract in a coordinate space can be represented as vectors. The vectorrepresentationallowsvariousvectoroperations,e.g., Clusteringpartitionsacollectionofobjectsintogroups additionandsubtractionofvectors,toformcondensedrep- calledclusters, suchthatsimilar objectsfallintothe same resentations of clusters and to reduce the time and space group. Similarity between objectsis definedby a distance requirementsoftheclusteringproblem[4,26]. Theseoper- function satisfying the triangle inequality; this distance ationsarenotpossibleinadistancespacethusmakingthe functionalongwiththecollectionofobjectsdescribesadis- problemmuchharder. 1 tancespace. Inadistancespace,theonlyoperationpossi- The distance function associated with a distance space bleondataobjectsisthecomputationofdistancebetween can becomputationallyveryexpensive[5], and maydom- them.Allscalablealgorithmsintheliteratureassumeaspe- inatetheoverallresourcerequirements. Forexample,con- cialtypeofdistancespace,namelya -dimensionalvector siderthedomainofstringswherethedistancebetweentwo space,whichallowsvectoroperationskonobjects. strings is the edit distance. Computing the edit distance 2 Wepresenttwoscalablealgorithmsdesignedforcluster- between two strings of lengths and requires ing very large datasetsin distance spaces. Our first algo- comparisonsbetweencharacters.mIncontnrast,compuOtin(gmtnhe) rithmBUBBLEis,toourknowledge,thefirstscalableclus- Euclideandistancebetweentwo -dimensionalvectorsina teringalgorithmfor dataina distancespace. Oursecond coordinatespacerequiresjust n operations. Mostalgo- algorithmBUBBLE-FMimprovesuponBUBBLEbyreduc- rithmsintheliteraturehavepaOid(nli)ttleattentiontothispar- ingthenumberofcallstothedistancefunction,whichmay ticularissuewhendevisingclusteringalgorithmsfordatain be computationallyvery expensive. Both algorithmsmake adistancespace. onlyasinglescanoverthedatabasewhileproducinghigh In this work, we first abstract out the essential features clustering quality. In a detailed experimental evaluation, of the BIRCH clustering algorithm [26] into the BIRCH (cid:3) westudybothalgorithmsintermsofscalabilityandquality frameworkfor scalableclusteringalgorithms. We thenin- of clustering. We also show results of applying the algo- stantiate BIRCH resulting in two new scalable clustering (cid:3) rithmstoareal-lifedataset. algorithms for distance spaces: BUBBLE and BUBBLE- FM. 1.Introduction The remainderof the paper is organizedas follows. In Section2, wediscussrelatedworkonclusteringandsome Data clustering is an important data mining problem of our initial approaches. In Section 3, we present the [1, 8, 9, 10, 12, 17, 21, 26]. The goal of clustering is to BIRCH framework for fast, scalable, incremental clus- partitionacollectionofobjectsintogroups,calledclusters, (cid:3) tering algorithms. In Sections 4 and 5, we instantiate the suchthat“similar”objectsfallinto thesamegroup. Simi- frameworkfor data in a distance space resulting in our al- laritybetweenobjectsiscapturedbyadistancefunction. gorithmsBUBBLEandBUBBLE-FM.Section6evaluates Inthispaper,weconsidertheproblemofclusteringlarge the performance of BUBBLE and BUBBLE-FM on syn- datasetsinadistancespaceinwhichtheonlyoperationpos- theticdatasets. WediscussanapplicationofBUBBLE-FM sibleondataobjectsisthecomputationofadistancefunc- tionthatsatisfiesthetriangleinequality.Incontrast,objects Adistancespaceisalsoreferredtoasanarbitrarymetricspace. We 1 usethetermdistancespacetoemphasizethatonlydistancecomputations ThefirstthreeauthorsweresupportedbyGrant2053fromtheIBM arepossiblebetweenobjects.Wecallan -dimensionalspaceacoordinate (cid:3) corporation. spacetoemphasizethatvectoroperationnslikecentroidcomputation,sum, SupportedbyanIBMCorporateFellowship anddifferenceofvectorsarepossible. y SupportedbyNASAGSRPNGT5-50062. Theedit distance between twostrings is the numberofsimple edit z 2 ThisworksupportedinpartbyDARPAcontractN66001-97-C-8542. operationsrequiredtotransformonestringintotheother. x Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 2006 2. REPORT TYPE 00-00-2006 to 00-00-2006 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Clustering Large Datasets in Arbitrary Metric Spaces 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION University of Virginia,Department of Computer Science,151 Engineer’s REPORT NUMBER Way,Cahrlottesville,VA,22094-4740 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES 14. ABSTRACT 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE 11 unclassified unclassified unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 toareal-lifedatasetinSection7andconcludeinSection8. Multidimensional scaling (MDS) is a technique for Weassumethatthereaderisfamiliarwiththedefinitions distance-preserving transformations [25]. The input to a ofthefollowingstandardterms:metricspace, normofa MDS method is a set of objects, a distance func- vector,radius,andcentroidofasetofpointsinLapcoordinate tion , and an integer S;inthe oNutput is a set of - space.(Seethefullpaper[16]forthedefinitions.) dimednsional image vecktors in a -dimensionSaoluctoordNinakte space (also called the image spacke), one image vector for 2.Related Workand InitialApproaches eachobject,suchthatthedistancebetweenanytwoobjects isequal(orveryclose)tothedistancebetweentheirrespec- In this section, we discuss related work on clustering, tive image vectors. MDS algorithmsdo not scale to large and three important issues that arise when clustering data datasetsfortworeasons. First,theyassumethatallobjects inadistancespacevis-a-visclusteringdatainacoordinate fit in main memory. Second, most MDS algorithms pro- space. posed in the literature compute distances between all pos- Data clustering has been extensively studied in the siblepairsofinputobjectsasafirststep thushavingcom- Statistics [20], Machine Learning [12, 13], and Pattern plexityatleast [19]. Recently,Linetal. developed Recognitionliterature[6,7]. Thesealgorithmsassumethat 2 ascalableMDSOm(Netho)dcalledFastMap[11]. FastMappre- allthedatafits into mainmemory,andtypicallyhaverun- servesdistancesapproximatelyintheimagespacewhilere- ningtimessuper-linearinthesizeofthedataset.Therefore, quiringonlyafixednumberofscansoverthedata. There- theydonotscaletolargedatabases. fore,onepossibleapproachforclusteringdatainadistance Recently, clusteringhasreceivedattentionas an impor- space is to map all objects into a coordinate space us- tantdataminingproblem[8,9,10,17,21,26]. CLARANS ingFastMap, andtheNnclustertheresultantvectorsusinga [21]isamedoid-basedmethodwhichismoreefficientthan scalableclusteringalgorithmfordatainacoordinatespace. earlier medoid-based algorithms [18], but has two draw- WecallthisapproachtheMap-Firstoptionandempirically backs: it assumes that all objects fit in main memory, and evaluate it in Section 6.2. Our experiments show that the the result is very sensitive to the input order [26]. Tech- qualityofclusteringthusobtainedisnotgood. niques to improve CLARANS’s ability to deal with disk- Applications of clustering are domain-specific and we residentdatasetsbyfocussingonlyonrelevantpartsofthe believe that a single algorithm will not serve all require- database using -trees were also proposed [9, 10]. But these techniqueRs (cid:3)depend on -trees which can only in- ments. A pre-clusteringphase, to obtain a data-dependent dex vectors in a coordinate sRp(cid:3)ace. DBSCAN [8] uses a summarization of large amounts of data into sub-clusters, was shown to be very effective in making more complex density-based notion of clusters to discover clusters of ar- data analysis feasible [4, 24, 26]. Therefore, we take the bitrary shapes. Since DBSCAN relies on the -Tree for speedandscalabilityinitsnearestneighborseaRrc(cid:3)hqueries, approachof developinga pre-clustering algorithmthat re- turnscondensedrepresentationsofsub-clusters.Adomain- itcannotclusterdatainadistancespace. BIRCH[26]was specific clustering method can further analyze the sub- designedtoclusterlargedatasetsof -dimensionalvectors usingalimitedamountofmainmemnory.Butthealgorithm clustersoutputbyouralgorithm. reliesheavilyonvectoroperations,whicharedefinedonly 3.BIRCH incoordinatespaces. CURE[17]isasampling-basedhier- (cid:3) archicalclusteringalgorithmthatisabletodiscoverclusters Inthissection,wepresenttheBIRCH frameworkwhich ofarbitraryshapes. However,itreliesonvectoroperations (cid:3) generalizes the notion of a cluster feature (CF) and a CF- andthereforecannotclusterdatainadistancespace. tree,thetwobuildingblocksoftheBIRCHalgorithm[26]. Three important issues arise when clustering data in a IntheBIRCH familyofalgorithms,objectsarereadfrom distancespaceversusdatainacoordinatespace. First, the (cid:3) the database sequentially and inserted into incrementally conceptof a centroid is not defined. Second, the distance evolvingclusterswhicharerepresentedbygeneralizedclus- function could potentially be computationallyvery expen- terfeatures(CF s). Anewobjectreadfromthedatabaseis sive as discussed in Section 1. Third, the domain-specific (cid:3) insertedintotheclosestcluster,anoperation,whichpoten- nature of clustering applications places requirements that tially requiresan examinationof allexistingCF s. There- aretoughtobemetbyjustonealgorithm. (cid:3) foreBIRCH organizesallclustersinanin-memoryindex, Many clustering algorithms [4, 17, 26] rely on vector (cid:3) a height-balanced tree, called a CF -tree. For a new ob- operations, e.g., the calculation of the centroid, to repre- (cid:3) ject,thesearchforanappropriateclusternowrequirestime sentclustersandtoimprovecomputationtime. Suchalgo- logarithmicinthenumberofclustersasopposedtoalinear rithmscannotclusterdatainadistancespace. Thusoneap- scan. proachistomapallobjectsintoa -dimensionalcoordinate In the remainderof this section, we abstractly state the space while preserving distanceskbetween pairs of objects components of the BIRCH framework. Instantiations of andthenclustertheresultingvectors. (cid:3) thesecomponentsgenerateconcreteclusteringalgorithms. 3.1.GeneralizedClusterFeature thedownwardpath,thenon-leafentry“closest”to isse- lectedtotraversedownwards.Intuitively,directing Otothe Any clustering algorithmneedsa representationfor the childnodeoftheclosestnon-leafentryissimilartoOidenti- clustersdetectedinthedata. Thenaiverepresentationuses fyingthe most promisingregionand zoominginto it for a allobjectsinacluster.However,sinceaclustercorresponds morethoroughexamination. Thedownwardtraversalcon- toadenseregionofobjects,thesetofobjectscanbetreated tinues till reaches a leaf node. When reaches a leaf collectivelythroughasummarizedrepresentation. Wewill node ,itOisinsertedintothecluster inO closestto if callsuchacondensed,summarizedrepresentationofaclus- thethLresholdrequirement isnotvioClatedLduetotheinOser- teritsgeneralizedclusterfeature(CF(cid:3)). tion. Otherwise, formsTanewclusterin . If doesnot Sincetheentiredatasetusuallydoesnotfitinmainmem- have enough spacOe for the new cluster, itLis splLit into two ory,wecannotexamineallobjectssimultaneouslytocom- leafnodesandtheentriesin redistributed: thesetofleaf puteCF(cid:3)sof clusters. Therefore,we incrementallyevolve entriesin isdividedintotwLogroupssuchthateachgroup clusters and their CF(cid:3)s, i.e., objects are scanned sequen- consistsoLf“similar”entries. Anewentryforthenewleaf tially and the set of clusters is updated to assimilate new node is created at its parent. In general, all nodes on the objects. Intuitively,atanystage,thenextobjectisinserted pathfromtherootto maysplit. Weomitthedetailsofthe intothecluster “closest” toit aslongastheinsertiondoes insertionofanobjectLintotheCF -treebecauseitissimilar notdeterioratethe“quality”ofthecluster. (Bothconcepts (cid:3) tothatofBIRCH[26]. areexplainedlater.) TheCF isthenupdatedtoreflectthe (cid:3) During the data scan, existing clusters are updated and insertion. Since objects in a cluster are not kept in main newclustersareformed. ThenumberofnodesintheCF - memory,CF sshouldmeetthefollowingrequirements. (cid:3) (cid:3) tree may increase beyond beforethe data scan is com- Incrementalupdatabilitywheneveranewobjectisin- plete due to the formationMof many new clusters. Then (cid:15) sertedintothecluster. it is necessary to reduce the space occupied by the CF - (cid:3) tree which can be done by reducing the number of clus- Sufficiency to compute distances between clusters, tersitmaintains. Thereductioninthenumberofclustersis (cid:15) andqualitymetrics(likeradius)ofacluster. achievedbymergingcloseclusterstoformbiggerclusters. CF s are efficient for two reasons. First, they occupy BIRCH mergesclustersbyincreasingthethresholdvalue (cid:3) (cid:3) muchlessspacethanthenaiverepresentation.Second,cal- associatedwiththeleafclustersandre-insertingtheminto culationofinter-clusterandintra-clustermeasurementsus- Ta new tree. The re-insertion of a leaf cluster into the new ingtheCF s is muchfaster thancalculationsinvolvingall tree merely inserts its CF ; all objects in leaf clusters are (cid:3) (cid:3) objectsinclusters. treated collectively. Thusa new, smaller CF -treeis built. (cid:3) Afteralltheoldleafentrieshavebeeninsertedintothenew 3.2.CF -Tree (cid:3) tree,thedatascanresumesfromthepointofinterruption. Note that the CF -treeinsertion algorithmrequiresdis- Inthissection,wedescribethestructureandfunctional- (cid:3) tancemeasuresbetweenthe“insertedentries”andnodeen- ityofaCF -tree. (cid:3) triestoselecttheclosestentryateachlevel.Sinceinsertions ACF -treeisaheight-balancedtreestructuresimilarto (cid:3) areoftwotypes: insertionofasingleobject, andthatofa the -tree [3]. The number of nodes in the CF -tree is (cid:3) (cid:3) leafcluster,theBIRCH frameworkrequiresdistancemea- bounRded by a pre-specified number . Nodes in a CF - (cid:3) (cid:3) sures to be instantiated between a CF and an object, and treeareclassifiedintoleafandnon-leMafnodesaccordingto (cid:3) betweentwoCF s(orclusters). their position in the tree. Each non-leaf node contains at (cid:3) In summary, CF s, their incremental maintenance, the most entriesoftheform( , ), , (cid:3) whereB is apointerto tCheFi(cid:3)chcihlidlid niod2e,fa1n;d::C:;FBgis distance measures, and the threshold requirement are the the CFchoilfdtihe set of objects suimthmarized by the sub-t(cid:3)iree componentsof the BIRCH(cid:3) framework, which have to be (cid:3) instantiatedtoderiveaconcreteclusteringalgorithm. rootedatthe child. Aleafnodecontainsatmost en- th tries, each ofithe form [CF ], ; eacBh leaf entryis the CF of acluster(cid:3)i. Eiac2h cflu1s;t:e:r:a;tBthge leaf level 4.BUBBLE (cid:3) satisfiesathresholdrequirement ,whichcontrolsitstight- Inthissection,weinstantiateBIRCH fordatainadis- nessorquality. T (cid:3) tance space resulting in our first algorithm called BUB- ThepurposeoftheCF -treeistodirectanewobject BLE. Recall thatCF s at leaf and non-leafnodesdifferin totheclusterclosesttoit.(cid:3)Thefunctionalityofnon-leafenO- their functionality. T(cid:3)he former incrementally maintain in- tries and leaf entries in the CF -tree is different: non-leaf formation about the output clusters, whereas the latter are (cid:3) entriesexistto“guide”newobjectstoappropriateleafclus- usedtodirectnewobjectstoappropriateleafclusters. Sec- ters, whereasleaf entries representthe dynamicallyevolv- tions4.1and4.2describetheinformationinaCF (andthen (cid:3) ingclusters. Foranewobject , ateachnon-leafnodeon incrementalmaintenance)attheleafandnon-leaflevels. O 4.1.CF sattheleaflevel can be mapped to vectors in the 2- (cid:3) 4.1.1 Summarystatisticsattheleaflevel 5di]mensionalEuclideanspace.(T0h;i0s)i;s(o3n;e0)o;f(0m;a4n)ypossible mappings. Foreachclusterdiscoveredbythealgorithm,wereturnthe Thefollowinglemmashowsthatunderany -distance- followinginformation(whichisusedinfurtherprocessing): k preservingtransformation ,theclustroidof Ristheobject thenumberofobjectsinthecluster,acentrallylocatedob- whoseimagevectofr isclosestOtothecentroid jectin itanditsradius. Sincea distancespace, ingeneral, Ooft2heOsetofimagevectors f(O.)Thus,theclustroidisthe does not support creation of new objects using operations generalizationofthecentroifd(Oto)distancespaces. Following onasetofobjects,weassignanactualobjectinthecluster the generalization of the centroid, we generalize the defi- astheclustercenter. We definetheclustroid of asetof objects which is the generalization of theOc^entroid to a nitionsof theradiusofacluster, andthedistancebetween clusterstodistancespaces. distanceOspace. WenowintroducetheRowSumofanob- 3 ject with respect to a set of objects , and the concept Lemma4.2 Let beasetofobjectsina ofanOimagespaceIS ofasetofobjeOcts inadistance distancespace O =wfiOth1;c:lu:s:t;rOoindg andlet space. Informally,th(eOi)mage spaceof a setOof objects is a bea -distan(cSe-;pdr)eservingtransfoOr^mation. Lfe:tO 7!beRthke k coordinatespacecontaininganimagevectorforeachobject centrRoidof . Thenthefollowingholds: O suchthatthedistancebetweenanytwoimagevectorsisthe f(O) sameasthedistancebetweenthecorrespondingobjects. 8O2O :jjf(O^)(cid:0)Ojj(cid:20)jjf(O)(cid:0)Ojj Intheremainderofthissection,weuse todenotea Definition4.3 Let be aset of objects distancespacewhere isthedomainofal(lSp;odss)ibleobjects inadistancespace O =wfOith1;c:lu::st;rOoindg .Theradius and isSadistancefunction. (S;d) O^ r(O) d:S(cid:2)S 7!R of isdefinedas def ni=1d2(Oi;O^). Definition4.1 Let be a set of ob- O r(O) = n jects in a distance sOpac=e fO1;.::T:;hOenRgowSum of an ob- Definition4.4 We define tqwoPdifferent inter-cluster dis- (S;d) tance metrics between cluster features. Let and ject isdefinedas . TheOclu2strOoid is defineRdoawsSuthme(Oob)jed=ceft nj=1d2s(uOch;Othja)t betwo clustersco.nsiLsteintgthoefirobcjleucsttsrofiOds11;b:e::O;O11ann1dgaOnd2 O^ . PO^ 2 O freOsp21e;c:ti:v:e;lyO.2n2Wge define the clustroid distaO^n1ce O^a2s 8O2O:RowSum(O^)(cid:20)RowSum(O) D0 and the average inter-cluster def Definition4.2 Let bea set of objects D0(O1;O2) = d(O^1;O^2) itniona.dWisteanccaellspaacneO(S=-;ddi)fs.tOanL1;ce:et:-fp:r;eO:senOrgvi7!ngtRraknsbfeoramfautinocn- distanceD2asD2(O1;O2)d=ef (Pni=11Pnjn=211nd22(O1i;O2j))12. if f Rk Both BUBBLE and BUBBLE-FM use as the distance wh8eir;ej 2 f1;::is:;thnegE:ucdli(dOeia;nOdjis)ta=ncejjbfe(tOwie)en(cid:0) f(aOndj)jj metric between leaf level clusters, andD0as the threshold in .jjWXe(cid:0)caYlljj theimagespaceof under X(denoteYd requirement , i.e., a new object is inserted into ISRk ). ForanRokbject ,wecall O thefimagevec- a cluster wTith clustroid only Oifnew f(O) O 2O f(O) O . (TheusefOo^r isexDpl0a(iOne;dflOatneerw.)g) = torof under .Wedefine def . d(O^;Onew)(cid:20)T D2 O f f(O) = ff(O1);:::;f(On)g 4.1.2 Incrementalmaintenanceofleaf-levelCF s (cid:3) Inthissection,wedescribetheincrementalmaintenanceof The existence of a distance-preserving transformation is CF s at the leaf levels of the CF -tree. Since the sets of guaranteedbythefollowinglemma. (cid:3) (cid:3) objects we are concerned with in this section are clusters, weuse (insteadof )todenoteasetofobjects. Lemma4.1 [19] Let be a set of objects in a distance space . Then theOre exists a positive integer TheCincremental mOaintenance of the number of objects a(nSd;ad)function such that iskan(k <- in a cluster is trivial. So we concentrate next on the djOisjt)ance-preservingtrfan:sfOorm7!atioRnk. f Rk incremental mCaintenance of the clustroid . Recall that for a cluster , is the object in withC^the minimum RowSumvalueC. AC^slongasweareabCletokeepalltheob- Forexample,threeobjects withtheinter-objectdis- tance distribution x;y;z jectsof inmainmemory,wecanmaintain incrementally [d(x;y) = 3;d(y;z) = 4;d(z;x) = underinCsertionsbyupdatingtheRowSumvC^aluesofallob- Themedoid ofasetofobjects issometimesusedasacluster jects and then selecting the object with minimum 3 center[18].ItisdOefiknedastheobject O thatminimizestheaverage RowSOum2vaClue as the clustroid. But this strategy requires m dissimilaritytoallobjectsin (i.e.,O n2O isminimumwhen allobjectsin inmainmemory,whichisnotaviableop- ).But,itisnotpossOibletomotiiv=at1edth(Oehie;uOri)sticmaintenance Oala=cOlumstroid ofthemedoid. However, weexpectsimilarheuristics|to tionforlargeCdatasets. Sinceexactmaintenanceisnotpos- P workevenfor|themedoid. sible, we developa heuristicstrategy which works well in practicewhilesignificantlyreducingmainmemoryrequire- aclustroidchanges,isexpensive.Fortunately(fromObser- ments. We classify insertions in clusters into two types, vation2), if is largethen thenewsetof objectswithin Type I and Type II, and motivate heuristics for each type of isalmonstthesameastheoldset because isvery(cid:15) ofinsertion. A TypeI insertionis theinsertionof asingle cloCs^e(cid:3)to . R C^(cid:3) objector,equivalently,aclustercontainingonlyoneobject. ObserC^vations 1 and 2 motivate the following heuris- EachobjectinthedatasetcausesaTypeIinsertionwhenit tic maintenance of the clustroid. As long as is small isreadfromthedatafile,makingitthemostcommontype (smallerthanaconstant ),wekeepalltheobjjeCcjtsof in ofinsertion. ATypeIIinsertionistheinsertionofacluster main memory and comppute the new clustroid exactlyC. If containingmore than one object. Type II insertions occur is large (larger than ), we invoke Observation 2 and onlywhentheCF -treeisbeingrebuilt.(SeeSection3.) jmCajintaina subsetof ofpsize . These objects havethe (cid:3) TypeI Insertions: In ourheuristics, wemakethefollow- lowest RowSumvaluCes in anpd henceapre closest to . If ingapproximation:underanydistance-preservingtransfor- the RowSum value of C is less than the highest oCf^the mation into a coordinate space, the image vector of the values,saythatof O,ntehwen replaces in . Our clustroidfisthecentroidofthesetofallimagevectors,i.e., pexperimentsconfirmOthpat thisOhenuerwistic worksOvperyRwell in . From Lemma4.2, we knowthat this is the practice. bfe(sC^t)p=ossfib(lCe)approximation. Inadditiontotheapproxima- Type II Insertions: Let and tion,ourheuristicismotivatedbythefollowingtwoobser- bCe1tw=o cfluOst1e1r;s::b:e;iOng1nm1gerged. vations. CL2et =bfeOa2d1i;s:ta:n:;cOe-2pnre2sgerving transformation of Observation 1: Consider the insertion of a new object intof . Let be the image vector in IS C1 [ C2 k into a cluster and assume that of obRject uXndiejr . Let ( ) be the c(eCn1tro[idC2o)f Oonnleywasubset Cis=kepftOin1;m::a:in;Omnegmory. Let Oij ( f X1 X).2 Let be their R (cid:26)beCa -distance-preserving transffo:rmCa[- fclXus1tir;o:i:d:s;,Xan1dn1g fX2i;:::;bXe2tnh2egir radii. C^L1e;tC^2 be the btfieOontnhefewrocgmen7!Ctro[iRdfkOofnewgRi.nTtkoheRnk,,andletf(C)= ni=1nf(Oi) cluTsthroeidneowfCc1en[trr(CoC2id1.);r(oCf2) liesontheC^li(cid:3)nejoin- f(C) RowSum(Onew) P ing and ; its eXxact lfo(cCa1tio[nCo2n) the line depends on thevXal1uesofX2 and . Since isusedasthethreshold n n 2 2 requirementfonr1insertnio2ns,thedDist0ancebetween and = d (Onew;Oj)= (f(Onew)(cid:0)f(Oj)) isboundedasshownbelow: X X1 Xj=1 Xj=1 n = (f(Oj)(cid:0)f(C))2+n(f(C)(cid:0)f(Onew))2 2 n22(X1(cid:0)X2)2 n22d2(C^1;C^2) n22T Xj=1 (X(cid:0)X1) = (n1+n2)2 (cid:25) (n1+n2)2 < (n1+n2)2 (cid:25) nr2(C)+nd2(C^;Onew) The following two assumptions motivate the heuristic maintenanceoftheclustroidunderTypeIIinsertions. Thus,wecancalculateRowSum approximatelyus- ing only and significantly red(uOcenetwhe) main memory re- (i) and are non-overlapping but very close to each quiremenCt^s. otheCr1. SincCe2 and are being merged, the threshold criterion is satCis1fied imC2plying that and are close to Observation 2: Let be a leaf cluster in the CF -tree and C = afnOo1b;:je:c:t;wOhnigch is inserted into each other. We expect the two clusCt1ers to Cb2e almost non- . Let (cid:3)and beOthneewclustroids of and overlappingbecause they were two distinct clusters in the CrespectivC^ely. LC^e(cid:3)t be the distance mCetric bCe[twfeeOnnelewagf oldCF(cid:3)-tree. clusters and theDt0hreshold requirement of the CF -tree (ii) . Due to lack of any prior information about under . ThTen (cid:3) thecnl1us(cid:25)tersn,2we assumethattheobjectsareuniformlydis- D0 tributedin themergedcluster. Therefore,thevaluesof and areclosetoeachotherinTypeIIinsertions. n1 d(C^;C^(cid:3))(cid:20)(cid:15),where(cid:15)= (nT+1): tobFenomr2tihdewsaeytwbeotwreeaesnons,awnde ex,pwechticthheconrerwescplounsdtrsotiodtCh^(cid:3)e An implication of Observation 2 is that as long as we periphery of either cluCs^t1er. TCh^2erefore we maintain a few keep a set of objects consisting of all objects in objects( innumber)ontheperipheryofeachclusterinits within aRdist(cid:26)ancCe of from , we know that . CF . Bepcause they are the farthest objects from the clus- CHowever,whentheclust(cid:15)roidchaC^ngesduetotheinsC^e(cid:3)rti2onRof troi(cid:3)d,theyhavethehighestRowSumvaluesintheirrespec- ,wehavetoupdate toconsistofallobjectswithin tiveclusters. Oofnew.SincewecannotasRsumethatallobjectsinthedatase(cid:15)t Thusweoverallmaintain objectsforeachleafcluster fitiC^n(cid:3)mainmemory,wehavetoretrieveobjectsin fromthe , whichwecallthe represe2n(cid:1)tpativeobjectsof ; thevalue disk. Repeatedretrievalofobjectsfromthedisk,Cwhenever C is called the representation number of . CStoring the 2p C representativeobjectsenablestheapproximateincremental timeweupdatethesampleobjectswe incuracertaincost. maintenanceoftheclustroid.Theincrementalmaintenance Thuswehavetostrikeabalancebetweenthecostofupdat- oftheradiusof issimilartothatofRowSumvalues;de- ingthesampleobjectsandtheircurrency. tailsaregiveninCthefullpaper[16]. Because asplit at of NL causesredistributionof Summarizing,wemaintainthefollowinginformationin its entriesbetween childaind the newnode , we theCF of aleaf cluster : (i) thenumberof objectsin , havetoupdatesampclehsilSd(iNL )andS(NL )cahteilndtrki+es1NL (cid:3) (ii)theclustroidof , (iiiC) representativeobjects, (iCv) andNL oftheparent(weiactuallycreka+te1samplesforthei theRowSumvaluesoCfthere2p(cid:1)repsentativeobjects,and(v)the newentkr+y1NL ). However,toreflectchangesinthedis- radius of the cluster. All these statistics are incrementally tributionsatalkl+ch1ildrennodesweupdatethesampleobjects maintainable asdescribedabove astheclusterevolves. atallentriesofNLwheneveroneofitschildrensplits. | | 4.2.CF satNon-leafLevel 4.2.3 Distancemeasuresatnon-leaflevels (cid:3) Let be a new object inserted into the CF -tree. Inthissection,weinstantiatetheclusterfeaturesatnon- The Odinsetwance between and NL is defined(cid:3)to be leaflevelsoftheBIRCH(cid:3) frameworkanddescribetheirin- S(NL ) . SOinnceew i is meaning- crementalmaintenance. Dles2s(,fOweneewngs;ure thiat)each nonD-l2e(affOennetrwyg;h;a)s at least one sampleobjectfromitschildduringtheselectionofsample 4.2.1 SampleObjects objects. LetL representthe leafentryofaleafnode . In the BIRCH framework, the functionality of a CF at th (cid:3) (cid:3) Thedistancebietween andLi isdefinedtobetheclustroLid a non-leaf entry is to guide a new object to the sub-tree distance L . C i whichcontainsitsprospectivecluster. Therefore,theclus- TheinDst0a(nCti;atiio)nofdistancemeasurescompletesthein- terfeatureofthe non-leafentryNL ofanon-leafnode th stantiationofBIRCH derivingBUBBLE.Weomitthethe NLsummarizesthiedistributionofallcliustersinthesubtree (cid:3) cost analysis of BUBBLE because it is similar to that of rootedat NL . In AlgorithmBUBBLE, this summary, the BIRCH. CF , is repreisented by a set of objects; we call these ob- (cid:3) jectsthesampleobjectsS(NL )ofNL andtheunionofall 5.BUBBLE-FM sampleobjectsatalltheentrieisthesamipleobjectsS(NL)of Whileinsertinganewobject ,BUBBLEcomputes NL. distancesbetween andallOthneewsampleobjectsateach We now describe the procedure for selecting the sam- non-leafnodeonitOsndeowwnwardpathfromtheroottoaleaf ple objects. Let child be the child nodes at NL with chiledn1tr;i:e:s:;respecktively. Let S(NL ) de- node.Thedistancefunction maybecomputationallyvery notethesent1o;f::sa:m;npkleobjectscollectedfrom anidas- expensive (e.g., the edit distdance on strings). We address sociated with NL . S(NL) is the union of sacmhipldlei objects this issue in our second algorithm BUBBLE-FM which at all entries of NiL. The number of sample objects to be improves upon BUBBLE by reducing the numbe|r of in- vocationsof using FastMap [11]. We first givea brief collectedatanynon-leafnodeis upperboundedbya con- overviewofFda|stMapandthendescribeBUBBLE-FM. stantcalledthesamplesize(SS).Thenumber S(NL ) con- j i j tributed by is . The restric- 5.1.OverviewofFastMap ni(cid:3)SS childi MAX( k ;1) tionthateachchildnodehave(cid:22)atlej=a1stnojn(cid:23)erepresentativein Givenaset of objects,adistancefunction ,andan integer , FastMOapNquickly(intimelinearin )cdomputes S(NL) is placed so that the disPtribution of the sample ob- vectokrs(calledimagevectors), oneforeacNhobject,ina jectsis representativeof allits children,and isalso neces- N-dimensionalEuclideanimagespacesuchthatthedistance sary to definedistance measuresbetween anewly inserted kbetweentwoimagevectorsisclosetothedistancebetween objectandanon-leafcluster. If isaleafnode,then thesampleobjectsS(NL )arerancdhoimldliypickedfromallthe the corresponding two objects. Thus, FastMap is an “ap- clustroidsoftheleafclusitersat . Otherwise,theyare proximate” k-distance-preserving transformation. Each randomlypickedfrom ’sscahmilpdlieobjectsS ). of the axeRs is defined by the line joining two objects.4 childi (childi The kobjectsarecalledpivotobjects. Thespacedefined by th2ek axes is the fastmapped image space IS of 4.2.2 UpdatestoSampleObjects . Theknumberof calls to made by FastMap tfomm(Oap) TheCF -treeevolvesgraduallyasnewobjectsareinserted (cid:3) Oobjectsis , where isdaparameter(typicallysettoN1 intoit. Theaccuracyofthesummarydistributioncaptured or2). 3Nkc c by sample objects at a non-leaf entry depends on how re- An important feature of FastMap that we use in centlythesampleobjectsweregathered.Theperiodicityof BUBBLE-FMisitsfastincrementalmappingability.Given updates to these samples, and when these updates are ac- tually triggered,affects thecurrencyof the samples. Each SeeLinet.al.fordetails[11]. 4 anewobject , FastMap projectsitontothe coordi- FastMap does not accurately reflect the relative dis- nate axes of IOSnew to compute a -dimensionkal vector tances between clustroids; the inaccuracy causes er- for inISfm(O)withjust cakllsto . Distancebe- roneousinsertionsofobjectsintoclustersdeteriorat- tweeOnnew anfdma(nOy)object 2k can nodwbemeasured ing the clustering quality. Similar errorsat non-leaf throughOtnheewEuclideandistancOeb2etwOeentheirimagevectors. levels merely cause new entries to be redirected to wrongleafnodeswheretheywillformnewclusters. 5.2.DescriptionofBUBBLE-FM Therefore,theimpactoftheseerrorsisonthemainte- nancecostsoftheCF -tree,butnotontheclustering BUBBLE-FM differs from BUBBLE only in its usage (cid:3) quality,andhencearenotsosevere. ofsampleobjectsatanon-leafnode. InBUBBLE-FM,we firstmap usingFastMap thesetofallsampleobjectsata 2. If IS has to be maintained accurately un- non-leaf|nodeintoan“app|roximate”imagespace. Wethen der nfemw(Lin)sertions then it should be reconstructed use the image space to measure distances between an in- whenever any clustroid in the leaf node changes. comingobjectandtheCF s. SinceCF satnon-leafentries In this case, the overhead of repeatedlLy invoking (cid:3) (cid:3) functionmerelyasguidestoappropriatechildrennodes,an FastMapoffsetsthegainsduetomeasuringdistances approximateimagespaceissufficient.Wenowdescribethe inIS . constructionoftheimagespaceanditsusageindetail. fm(L) Consider a non-leaf node NL. Whenever S(NL) is up- 5.2.2 Imagedimensionalityandotherparameters dated,weuseFastMaptomapS(NL)intoa -dimensional The image dimensionalities of non-leaf nodes can be dif- coordinatespace IS ; is calledthekimagedimen- ferentbecausethesampleobjectsateachnon-leafnodeare sionalityofNL.FasftMma(NprLet)urknsavectorforeachobjectin mapped into independent image spaces. The problem of S(NL).ThecentroidoftheimagevectorsofS(NL )isthen findingtherightdimensionalityoftheimagespacehasbeen usedasthecentroidoftheclusterrepresentedbyNiL while studiedwell[19]. We settheimagedimensionalitiesofall definingdistancemetrics. i non-leafnodestothesamevalue;anytechniqueusedtofind Let :S(NL) IS be the distance preserv- the right image dimensionality can be incorporated easily ing tranfsmformation7!assfomc(iaNteLd)with FastMap that maps intothemappingalgorithm. each sample object S(NL) to a -dimensional vec- Our experience with BUBBLE and BUBBLE-FM on tor IS s 2 . Let k be the centroid severaldatasets showedthat the results arenotverysensi- of thfems(est)o2f imafgme(NveLct)ors of SS(N(NLL),i)i.e., tivetosmalldeviationsinthevaluesoftheparameters: the i S(NLi) = representationnumberandthesamplesize. We foundthat s2S(NLi)fm(s). avalueof10fortherepresentationnumberworkswellfor TPhejnSo(nN-Lleia)jfCF in BUBBLE-FM consists of (1) S(NL ) several datasets including those used for the experimental (cid:3) and(2) .Inaddition,wemaintaintheimagevectoris studyinSection6. Anappropriatevalueforthesamplesize ofthe S(pNivLoti)objectsreturnedbyFastMap. dependson the branching factor of the CF -tree. We The2k pivotobjectsdefinetheaxesofthek-dimensional observedthatavalueof wBoFrkswellinp(cid:3)ractice. images2pkace constructedby FastMap. Let be a new 5(cid:2)BF object. Using FastMap, we incrementallyOmneawp to IS . WedefinethedistancebetweeOnnew VanndewN2L tofmb(eNthLe)Euclidean distance between Onaenwd i. Formally, Vnew S(NLi) def D(Onew;S(NLi)) = jjVnew(cid:0)S(NLi)jj Similarly,thedistancebetweentwonon-leafentriesNL andNL isdefinedtobe . Wheneveir S(NL)j , BUBBLEj-jFSM(NmLeia)s(cid:0)urSes(NdiLstja)njcjesatNL in jthedistjan(cid:20)ce2skpace,asinBUBBLE. 5.2.1 Analternativeattheleaflevel WedonotuseFastMapattheleaflevelsoftheCF -treefor (cid:3) thefollowingreasons. 1. Suppose FastMap were used at the leaf levels also. The approximate image space constructed by timeto associateeachobject with a clusterwhose representativeobjectisclosestOto2 D. 6.Performance Evaluation WeintroducesomenotationbeOforedescribingtheevalu- In this section, we evaluate BUBBLE and BUBBLE- ationmetrics. Let betheactualclustersinthe FM on synthetic datasets. Our studies show that BUB- datasetand A1;::b:e;AthKesetofclustersdiscoveredby BLEandBUBBLE-FMarescalablehighqualityclustering BUBBLEoCrB1;U:B::B;LCEK-FM.Let ( )bethecentroidof algorithms. cluster ( ). Let bethecluAstrioiCdiof . Let de- 5 notetheAniumCbierofpC^oiintsinthecluster C. iWeusne(tChe)fol- 6.1.DatasetsandEvaluationMethodology lowingmetrics,someofwhicharetraditCionallyusedinthe To compare with the Map-First option, we use two Statistics and the Pattern Recognition communities [6, 7], datasets DS1 and DS2. Both DS1 and DS2 have 100000 toevaluatetheclusteringqualityandspeed. 2-dimensionalpointsdistributedin100clusters[26]. How- Thedistortion( )ofasetof ever,theclustercentersinDS1areuniformlydistributedon K 2 (cid:15) clustersindicatestjh=e1tighXtn2eCssj(oXft(cid:0)heCclju)sters. a 2-dimensional grid; in DS2, the cluster centers are dis- P P tributed on a sine wave. These two datasets are also used Theclustroidquality( jjAi(cid:0)C^jjj)istheaverage tovisuallyobservetheclustersproducedbyBUBBLEand (cid:15) distance between theCacQtua=l centKroid of a cluster BUBBLE-FM. andtheclustroid thatisclosestto . Ai We also generated k-dimensional datasets as described C^j Ai Thenumberofcallsto (NCD)andthetimetakenby byAgrawaletal.[1]. The -dimensionalbox k isdi- (cid:15) thealgorithmindicatetdhecostofthealgorithm.NCD vided into cells by halvking the range [0;1o0v]er each k isusefultoextrapolatetheperformanceforcomputa- dimension.2A cluster center is randomly[0p;1la0c]ed in each tionallyexpensivedistancefunctions. of cells chosen randomly from the cells, where k is tKhe number of clusters in the datase2t. In each clusteKr, 6.2.ComparisonwiththeMap-Firstoption points are distributed uniformly within a radius ran- N WemappedDS1,DS2,andDS20d.50c.100Kintoanap- dKomly picked from . A dataset containing - dimensionalpointsa[n0d:5;1c:0lu]stersisdenotedDS d. Nc.k. propriate k-dimensional space (k = 2 for DS1, DS2, and Even though these dataKsets consist of -dimenskionaKl veNc- 20 for DS20d.50c.100K) using FastMap, and then used torswedonotexploittheoperationsspkecifictocoordinate BIRCHtoclustertheresultingk-dimensionalvectors. The clustroids of clusters obtained from BUBBLE and spaces, and treat the vectors in the dataset merely as ob- BUBBLE-FMonDS2areshowninFigures1and2respec- jects. Thedistancebetweenanytwoobjectsisreturnedby tively, and the centroids of clusters obtained from BIRCH theEuclideandistancefunction. areshowninFigure3.Fromthedistortionvalues(Table1), We now describe the evaluation methodology. The weseethatthequalityofclustersobtainedbyBUBBLEor clustroids of the sub-clusters returned by BUBBLE and BUBBLE-FMisclearlybetterthantheMap-Firstoption. BUBBLE-FM are further clustered using a hierarchical clustering algorithm [20] to obtain the required number Dataset Map-First BUBBLE BUBBLE-FM of clusters. To minimize the effect of hierarchical clus- DS1 195146 129798 122544 tering on the final results, the amount of memory allo- DS2 1147830 125093 125094 cated to the algorithm was adjusted so that the number of DS20d.50c.100K 2.214* 21127.5 21127.5 6 sub-clustersreturnedbyBUBBLEorBUBBLE-FMisvery 10 close(notexceedingtheactualnumberofclustersbymore Table1.ComparisonwiththeMap-Firstoption than 5%) to the actual number of clusters in the synthetic 6.3.QualityofClustering dataset. Whenever the final cluster is formed by merging sub-clusters,theclustroidofthefinalclusteristhecentroid In thissection, weusethedatasetDS20d.50c.100K.To oftheclustroidsofsub-clustersmerged. Otherparameters placetheresultsintheproperperspective,wementionthat to the algorithm, the sample size (SS), the branching fac- the average distance between the centroid of each cluster tor ( ), andthe representation number( ) arefixed at andanactualpointinthedatasetclosestto is0.212. 75,1B5,and10respectively(unlessotherw2ise(cid:1)pstated)asthey AHeincetheclustroidquality(CQ)cannotbelessAthian0.212. werefoundtoresultingoodclusteringquality. Theimage FromTable2,weobservethattheCQvaluesareclosetothe dimensionalityforBUBBLE-FMissettobeequaltothedi- mensionalityofthedata.Thedataset isscannedasecond Algorithm CQ Actual Computed D Distortion Distortion ThequalityoftheresultfromBIRCHwasshowntobeindependentof 5 theinputorder[26].Since,BUBBLEandBUBBLE-FMareinstantiations BUBBLE 0.289 21127.4 21127.5 oftheBIRCH frameworkwhichisabstracted outfromBIRCH,wedo BUBBLE-FM 0.294 21127.4 21127.5 (cid:3) notpresentmoreresultsonorder-independencehere. Table2.ClusteringQuality 20 •••••• •••••• •••••• •••••• 20 •••••• •••••• •••••• •••••• 20 •••••• •••••• •••••• •••••• • • • • • • • • • • • • • • • • • • • • • • • • 10 • • • • • • • • 10 • • • • • • • • 10 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • DS2 0 • • • • • • • • DS2 0 • • • • • • • • DS2 0 • • • • • • • • • • • • • • • • • • • • -10 • • • • • • • • -10 • • • • • • • • -10 • • • • • • • • • • • • • • • • • • • • -20 ••••••• ••••••• ••••••• ••••••• -20 ••••••• ••••••• ••••••• ••••••• -20 ••••••• ••••••• ••••••• ••••••• 0 100 200 300 400 500 600 0 100 200 300 400 500 600 0 100 200 300 400 500 600 BUBBLE BUBBLE-FM BIRCH Figure1.DS2: BUBBLE Figure2.DS2: BUBBLE-FM Figure3.DS2: BIRCH SS=75,B=15,2p=10,#Clusters=50 SS=75,B=15,2p=10,#Clusters=50 SS=75,B=15,2p=10,#Points=200K 250 5e+07 300 BUBBLE-FM: time BUBBLE-FM: NCD BUBBLE-FM: time BUBBLE: time 4.5e+07 BUBBLE: NCD BUBBLE: time 200 4e+07 250 Time (in seconds) 110500 NCD value123...55523eeeee+++++0000077777 Time (in seconds) 112050000 50 1e+07 50 5e+06 0 0 0 100 200 300 400 500 0 100 200 300 400 500 0 50 100 150 200 250 300 #points (in multiples of 1000) #points (in multiples of 1000) #clusters Figure4.Timevs#points Figure5.NCDvs#points Figure6.Timevs#clusters minimumpossiblevalue(0.212), and thedistortionvalues whilekeepingthenumberofpointsconstantat200000.The matchalmostexactly. Also,weobservedthatallthepoints resultsareshowninFigure6. Theplotoftimeversusnum- except a few (less than 5) were placed in the appropriate berofclustersisalmostlinear. 6 clusters. 7.DataCleaningApplication 6.4.Scalability When different bibliographic databases are integrated, To study scalability characteristics with respect to the differentconventionsforrecordingbibliographicitemssuch numberofpointsinthedataset,wefixedthenumberofclus- asauthornamesandaffiliationscauseproblems. Usersfa- tersat50andvariedthenumberofdatapointsfrom50000 miliar with one set of conventions will expect their usual to500000(i.e.,DS20d.50c.*). forms to retrieve relevant information from the entire col- Figures4and5plotthetimeandNCDvaluesforBUB- lectionwhensearching. Therefore,a necessarypartofthe BLE and BUBBLE-FM as the number of points is in- integrationisthecreationofajointauthorityfile[2,15]in creased. Wemakethefollowingobservations. (i) Bothal- which classes of equivalentstrings are maintained. These gorithms scale linearly with the number of points, which equivalent classes can be assigned a canonical form. The is as expected. (ii) BUBBLE consistently outperforms process of reconciling variant string forms ultimately re- BUBBLE-FM. This is due to the overhead of FastMap in quires domain knowledge and inevitably a human in the BUBBLE-FM. (The distance function in the fastmapped loop,butitcanbesignificantlyspeededupbyfirstachieving spaceaswellastheoriginalspaceistheEuclideandistance aroughclusteringusingametricsuchastheeditdistance. function.) However, the constant differencebetween their Groupingcloselyrelatedentriesintoinitialclustersthatact runningtimes suggeststhattheoverheaddueto theuseof asrepresentativestringshastwobenefits: (1)Earlyaggre- FastMapremainsconstanteventhoughthenumberofpoints gationactsasa“sorting”stepthatletsususemoreaggres- increases. Thedifferenceisconstantbecausetheoverhead sive strategies in later stages with less risk of erroneously duetoFastMapisincurredonlywhenthenodesintheCF - (cid:3) separating closely related strings. (2) If an error is made tree split. Once the distribution of clusters is captured the in the placement of a representative, only that representa- nodes do not split that often any more. (iii) As expected, tiveneedbemovedtoanewlocation. Also,eventhesmall BUBBLE-FMhassmallerNCDvalues. Sincetheoverhead reductionin thedata sizeis valuable,giventhecostofthe due to the use of FastMap remains constant, as the num- subsequentdetailedanalysisinvolvingadomainexpert. berof pointsis increased the differencebetween the NCD 7 Applyingeditdistancetechniquestoobtainsucha“first valuesincreases. To study scalabilitywith respectto thenumberof clus- NCDversusnumberofclustersisinthefullpaper[16]. 6 ters,wevariedthenumberofclustersbetween50and250 Examplesandmoredetailsaregiveninthefullpaper. 7

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.