Table Of ContentClustering Large Datasets in Arbitrary Metric Spaces
VenkateshGanti RaghuRamakrishnan JohannesGehrke
(cid:3) y
ComputerSciences Department,UniversityofWisconsin-Madison
AllisonPowell JamesFrench
z x
DepartmentofComputerScience, UniversityofVirginia,Charlottesville
Abstract in a coordinate space can be represented as vectors. The
vectorrepresentationallowsvariousvectoroperations,e.g.,
Clusteringpartitionsacollectionofobjectsintogroups additionandsubtractionofvectors,toformcondensedrep-
calledclusters, suchthatsimilar objectsfallintothe same resentations of clusters and to reduce the time and space
group. Similarity between objectsis definedby a distance requirementsoftheclusteringproblem[4,26]. Theseoper-
function satisfying the triangle inequality; this distance ationsarenotpossibleinadistancespacethusmakingthe
functionalongwiththecollectionofobjectsdescribesadis- problemmuchharder.
1
tancespace. Inadistancespace,theonlyoperationpossi- The distance function associated with a distance space
bleondataobjectsisthecomputationofdistancebetween can becomputationallyveryexpensive[5], and maydom-
them.Allscalablealgorithmsintheliteratureassumeaspe- inatetheoverallresourcerequirements. Forexample,con-
cialtypeofdistancespace,namelya -dimensionalvector siderthedomainofstringswherethedistancebetweentwo
space,whichallowsvectoroperationskonobjects. strings is the edit distance. Computing the edit distance
2
Wepresenttwoscalablealgorithmsdesignedforcluster- between two strings of lengths and requires
ing very large datasetsin distance spaces. Our first algo- comparisonsbetweencharacters.mIncontnrast,compuOtin(gmtnhe)
rithmBUBBLEis,toourknowledge,thefirstscalableclus- Euclideandistancebetweentwo -dimensionalvectorsina
teringalgorithmfor dataina distancespace. Oursecond coordinatespacerequiresjust n operations. Mostalgo-
algorithmBUBBLE-FMimprovesuponBUBBLEbyreduc- rithmsintheliteraturehavepaOid(nli)ttleattentiontothispar-
ingthenumberofcallstothedistancefunction,whichmay ticularissuewhendevisingclusteringalgorithmsfordatain
be computationallyvery expensive. Both algorithmsmake adistancespace.
onlyasinglescanoverthedatabasewhileproducinghigh In this work, we first abstract out the essential features
clustering quality. In a detailed experimental evaluation, of the BIRCH clustering algorithm [26] into the BIRCH
(cid:3)
westudybothalgorithmsintermsofscalabilityandquality frameworkfor scalableclusteringalgorithms. We thenin-
of clustering. We also show results of applying the algo- stantiate BIRCH resulting in two new scalable clustering
(cid:3)
rithmstoareal-lifedataset. algorithms for distance spaces: BUBBLE and BUBBLE-
FM.
1.Introduction The remainderof the paper is organizedas follows. In
Section2, wediscussrelatedworkonclusteringandsome
Data clustering is an important data mining problem
of our initial approaches. In Section 3, we present the
[1, 8, 9, 10, 12, 17, 21, 26]. The goal of clustering is to
BIRCH framework for fast, scalable, incremental clus-
partitionacollectionofobjectsintogroups,calledclusters, (cid:3)
tering algorithms. In Sections 4 and 5, we instantiate the
suchthat“similar”objectsfallinto thesamegroup. Simi-
frameworkfor data in a distance space resulting in our al-
laritybetweenobjectsiscapturedbyadistancefunction.
gorithmsBUBBLEandBUBBLE-FM.Section6evaluates
Inthispaper,weconsidertheproblemofclusteringlarge
the performance of BUBBLE and BUBBLE-FM on syn-
datasetsinadistancespaceinwhichtheonlyoperationpos-
theticdatasets. WediscussanapplicationofBUBBLE-FM
sibleondataobjectsisthecomputationofadistancefunc-
tionthatsatisfiesthetriangleinequality.Incontrast,objects Adistancespaceisalsoreferredtoasanarbitrarymetricspace. We
1
usethetermdistancespacetoemphasizethatonlydistancecomputations
ThefirstthreeauthorsweresupportedbyGrant2053fromtheIBM arepossiblebetweenobjects.Wecallan -dimensionalspaceacoordinate
(cid:3)
corporation. spacetoemphasizethatvectoroperationnslikecentroidcomputation,sum,
SupportedbyanIBMCorporateFellowship anddifferenceofvectorsarepossible.
y
SupportedbyNASAGSRPNGT5-50062. Theedit distance between twostrings is the numberofsimple edit
z 2
ThisworksupportedinpartbyDARPAcontractN66001-97-C-8542. operationsrequiredtotransformonestringintotheother.
x
Report Documentation Page Form Approved
OMB No. 0704-0188
Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and
maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,
including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington
VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it
does not display a currently valid OMB control number.
1. REPORT DATE 3. DATES COVERED
2006 2. REPORT TYPE 00-00-2006 to 00-00-2006
4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER
Clustering Large Datasets in Arbitrary Metric Spaces
5b. GRANT NUMBER
5c. PROGRAM ELEMENT NUMBER
6. AUTHOR(S) 5d. PROJECT NUMBER
5e. TASK NUMBER
5f. WORK UNIT NUMBER
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION
University of Virginia,Department of Computer Science,151 Engineer’s REPORT NUMBER
Way,Cahrlottesville,VA,22094-4740
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)
11. SPONSOR/MONITOR’S REPORT
NUMBER(S)
12. DISTRIBUTION/AVAILABILITY STATEMENT
Approved for public release; distribution unlimited
13. SUPPLEMENTARY NOTES
14. ABSTRACT
15. SUBJECT TERMS
16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF
ABSTRACT OF PAGES RESPONSIBLE PERSON
a. REPORT b. ABSTRACT c. THIS PAGE 11
unclassified unclassified unclassified
Standard Form 298 (Rev. 8-98)
Prescribed by ANSI Std Z39-18
toareal-lifedatasetinSection7andconcludeinSection8. Multidimensional scaling (MDS) is a technique for
Weassumethatthereaderisfamiliarwiththedefinitions distance-preserving transformations [25]. The input to a
ofthefollowingstandardterms:metricspace, normofa MDS method is a set of objects, a distance func-
vector,radius,andcentroidofasetofpointsinLapcoordinate tion , and an integer S;inthe oNutput is a set of -
space.(Seethefullpaper[16]forthedefinitions.) dimednsional image vecktors in a -dimensionSaoluctoordNinakte
space (also called the image spacke), one image vector for
2.Related Workand InitialApproaches
eachobject,suchthatthedistancebetweenanytwoobjects
isequal(orveryclose)tothedistancebetweentheirrespec-
In this section, we discuss related work on clustering,
tive image vectors. MDS algorithmsdo not scale to large
and three important issues that arise when clustering data
datasetsfortworeasons. First,theyassumethatallobjects
inadistancespacevis-a-visclusteringdatainacoordinate
fit in main memory. Second, most MDS algorithms pro-
space.
posed in the literature compute distances between all pos-
Data clustering has been extensively studied in the
siblepairsofinputobjectsasafirststep thushavingcom-
Statistics [20], Machine Learning [12, 13], and Pattern
plexityatleast [19]. Recently,Linetal. developed
Recognitionliterature[6,7]. Thesealgorithmsassumethat 2
ascalableMDSOm(Netho)dcalledFastMap[11]. FastMappre-
allthedatafits into mainmemory,andtypicallyhaverun-
servesdistancesapproximatelyintheimagespacewhilere-
ningtimessuper-linearinthesizeofthedataset.Therefore,
quiringonlyafixednumberofscansoverthedata. There-
theydonotscaletolargedatabases.
fore,onepossibleapproachforclusteringdatainadistance
Recently, clusteringhasreceivedattentionas an impor-
space is to map all objects into a coordinate space us-
tantdataminingproblem[8,9,10,17,21,26]. CLARANS
ingFastMap, andtheNnclustertheresultantvectorsusinga
[21]isamedoid-basedmethodwhichismoreefficientthan
scalableclusteringalgorithmfordatainacoordinatespace.
earlier medoid-based algorithms [18], but has two draw-
WecallthisapproachtheMap-Firstoptionandempirically
backs: it assumes that all objects fit in main memory, and
evaluate it in Section 6.2. Our experiments show that the
the result is very sensitive to the input order [26]. Tech-
qualityofclusteringthusobtainedisnotgood.
niques to improve CLARANS’s ability to deal with disk-
Applications of clustering are domain-specific and we
residentdatasetsbyfocussingonlyonrelevantpartsofthe
believe that a single algorithm will not serve all require-
database using -trees were also proposed [9, 10]. But
these techniqueRs (cid:3)depend on -trees which can only in- ments. A pre-clusteringphase, to obtain a data-dependent
dex vectors in a coordinate sRp(cid:3)ace. DBSCAN [8] uses a summarization of large amounts of data into sub-clusters,
was shown to be very effective in making more complex
density-based notion of clusters to discover clusters of ar-
data analysis feasible [4, 24, 26]. Therefore, we take the
bitrary shapes. Since DBSCAN relies on the -Tree for
speedandscalabilityinitsnearestneighborseaRrc(cid:3)hqueries, approachof developinga pre-clustering algorithmthat re-
turnscondensedrepresentationsofsub-clusters.Adomain-
itcannotclusterdatainadistancespace. BIRCH[26]was
specific clustering method can further analyze the sub-
designedtoclusterlargedatasetsof -dimensionalvectors
usingalimitedamountofmainmemnory.Butthealgorithm clustersoutputbyouralgorithm.
reliesheavilyonvectoroperations,whicharedefinedonly
3.BIRCH
incoordinatespaces. CURE[17]isasampling-basedhier- (cid:3)
archicalclusteringalgorithmthatisabletodiscoverclusters
Inthissection,wepresenttheBIRCH frameworkwhich
ofarbitraryshapes. However,itreliesonvectoroperations (cid:3)
generalizes the notion of a cluster feature (CF) and a CF-
andthereforecannotclusterdatainadistancespace.
tree,thetwobuildingblocksoftheBIRCHalgorithm[26].
Three important issues arise when clustering data in a
IntheBIRCH familyofalgorithms,objectsarereadfrom
distancespaceversusdatainacoordinatespace. First, the (cid:3)
the database sequentially and inserted into incrementally
conceptof a centroid is not defined. Second, the distance
evolvingclusterswhicharerepresentedbygeneralizedclus-
function could potentially be computationallyvery expen-
terfeatures(CF s). Anewobjectreadfromthedatabaseis
sive as discussed in Section 1. Third, the domain-specific (cid:3)
insertedintotheclosestcluster,anoperation,whichpoten-
nature of clustering applications places requirements that
tially requiresan examinationof allexistingCF s. There-
aretoughtobemetbyjustonealgorithm. (cid:3)
foreBIRCH organizesallclustersinanin-memoryindex,
Many clustering algorithms [4, 17, 26] rely on vector (cid:3)
a height-balanced tree, called a CF -tree. For a new ob-
operations, e.g., the calculation of the centroid, to repre- (cid:3)
ject,thesearchforanappropriateclusternowrequirestime
sentclustersandtoimprovecomputationtime. Suchalgo-
logarithmicinthenumberofclustersasopposedtoalinear
rithmscannotclusterdatainadistancespace. Thusoneap-
scan.
proachistomapallobjectsintoa -dimensionalcoordinate
In the remainderof this section, we abstractly state the
space while preserving distanceskbetween pairs of objects
components of the BIRCH framework. Instantiations of
andthenclustertheresultingvectors. (cid:3)
thesecomponentsgenerateconcreteclusteringalgorithms.
3.1.GeneralizedClusterFeature thedownwardpath,thenon-leafentry“closest”to isse-
lectedtotraversedownwards.Intuitively,directing Otothe
Any clustering algorithmneedsa representationfor the childnodeoftheclosestnon-leafentryissimilartoOidenti-
clustersdetectedinthedata. Thenaiverepresentationuses
fyingthe most promisingregionand zoominginto it for a
allobjectsinacluster.However,sinceaclustercorresponds
morethoroughexamination. Thedownwardtraversalcon-
toadenseregionofobjects,thesetofobjectscanbetreated
tinues till reaches a leaf node. When reaches a leaf
collectivelythroughasummarizedrepresentation. Wewill node ,itOisinsertedintothecluster inO closestto if
callsuchacondensed,summarizedrepresentationofaclus- thethLresholdrequirement isnotvioClatedLduetotheinOser-
teritsgeneralizedclusterfeature(CF(cid:3)). tion. Otherwise, formsTanewclusterin . If doesnot
Sincetheentiredatasetusuallydoesnotfitinmainmem- have enough spacOe for the new cluster, itLis splLit into two
ory,wecannotexamineallobjectssimultaneouslytocom-
leafnodesandtheentriesin redistributed: thesetofleaf
puteCF(cid:3)sof clusters. Therefore,we incrementallyevolve entriesin isdividedintotwLogroupssuchthateachgroup
clusters and their CF(cid:3)s, i.e., objects are scanned sequen- consistsoLf“similar”entries. Anewentryforthenewleaf
tially and the set of clusters is updated to assimilate new
node is created at its parent. In general, all nodes on the
objects. Intuitively,atanystage,thenextobjectisinserted
pathfromtherootto maysplit. Weomitthedetailsofthe
intothecluster “closest” toit aslongastheinsertiondoes insertionofanobjectLintotheCF -treebecauseitissimilar
notdeterioratethe“quality”ofthecluster. (Bothconcepts (cid:3)
tothatofBIRCH[26].
areexplainedlater.) TheCF isthenupdatedtoreflectthe
(cid:3) During the data scan, existing clusters are updated and
insertion. Since objects in a cluster are not kept in main
newclustersareformed. ThenumberofnodesintheCF -
memory,CF sshouldmeetthefollowingrequirements. (cid:3)
(cid:3) tree may increase beyond beforethe data scan is com-
Incrementalupdatabilitywheneveranewobjectisin- plete due to the formationMof many new clusters. Then
(cid:15) sertedintothecluster. it is necessary to reduce the space occupied by the CF -
(cid:3)
tree which can be done by reducing the number of clus-
Sufficiency to compute distances between clusters,
tersitmaintains. Thereductioninthenumberofclustersis
(cid:15) andqualitymetrics(likeradius)ofacluster.
achievedbymergingcloseclusterstoformbiggerclusters.
CF s are efficient for two reasons. First, they occupy BIRCH mergesclustersbyincreasingthethresholdvalue
(cid:3) (cid:3)
muchlessspacethanthenaiverepresentation.Second,cal- associatedwiththeleafclustersandre-insertingtheminto
culationofinter-clusterandintra-clustermeasurementsus- Ta new tree. The re-insertion of a leaf cluster into the new
ingtheCF s is muchfaster thancalculationsinvolvingall tree merely inserts its CF ; all objects in leaf clusters are
(cid:3) (cid:3)
objectsinclusters. treated collectively. Thusa new, smaller CF -treeis built.
(cid:3)
Afteralltheoldleafentrieshavebeeninsertedintothenew
3.2.CF -Tree
(cid:3) tree,thedatascanresumesfromthepointofinterruption.
Note that the CF -treeinsertion algorithmrequiresdis-
Inthissection,wedescribethestructureandfunctional- (cid:3)
tancemeasuresbetweenthe“insertedentries”andnodeen-
ityofaCF -tree.
(cid:3) triestoselecttheclosestentryateachlevel.Sinceinsertions
ACF -treeisaheight-balancedtreestructuresimilarto
(cid:3) areoftwotypes: insertionofasingleobject, andthatofa
the -tree [3]. The number of nodes in the CF -tree is
(cid:3) (cid:3) leafcluster,theBIRCH frameworkrequiresdistancemea-
bounRded by a pre-specified number . Nodes in a CF - (cid:3)
(cid:3) sures to be instantiated between a CF and an object, and
treeareclassifiedintoleafandnon-leMafnodesaccordingto (cid:3)
betweentwoCF s(orclusters).
their position in the tree. Each non-leaf node contains at (cid:3)
In summary, CF s, their incremental maintenance, the
most entriesoftheform( , ), , (cid:3)
whereB is apointerto tCheFi(cid:3)chcihlidlid niod2e,fa1n;d::C:;FBgis distance measures, and the threshold requirement are the
the CFchoilfdtihe set of objects suimthmarized by the sub-t(cid:3)iree componentsof the BIRCH(cid:3) framework, which have to be
(cid:3) instantiatedtoderiveaconcreteclusteringalgorithm.
rootedatthe child. Aleafnodecontainsatmost en-
th
tries, each ofithe form [CF ], ; eacBh leaf
entryis the CF of acluster(cid:3)i. Eiac2h cflu1s;t:e:r:a;tBthge leaf level 4.BUBBLE
(cid:3)
satisfiesathresholdrequirement ,whichcontrolsitstight-
Inthissection,weinstantiateBIRCH fordatainadis-
nessorquality. T (cid:3)
tance space resulting in our first algorithm called BUB-
ThepurposeoftheCF -treeistodirectanewobject BLE. Recall thatCF s at leaf and non-leafnodesdifferin
totheclusterclosesttoit.(cid:3)Thefunctionalityofnon-leafenO- their functionality. T(cid:3)he former incrementally maintain in-
tries and leaf entries in the CF -tree is different: non-leaf formation about the output clusters, whereas the latter are
(cid:3)
entriesexistto“guide”newobjectstoappropriateleafclus- usedtodirectnewobjectstoappropriateleafclusters. Sec-
ters, whereasleaf entries representthe dynamicallyevolv- tions4.1and4.2describetheinformationinaCF (andthen
(cid:3)
ingclusters. Foranewobject , ateachnon-leafnodeon incrementalmaintenance)attheleafandnon-leaflevels.
O
4.1.CF sattheleaflevel can be mapped to vectors in the 2-
(cid:3)
4.1.1 Summarystatisticsattheleaflevel 5di]mensionalEuclideanspace.(T0h;i0s)i;s(o3n;e0)o;f(0m;a4n)ypossible
mappings.
Foreachclusterdiscoveredbythealgorithm,wereturnthe
Thefollowinglemmashowsthatunderany -distance-
followinginformation(whichisusedinfurtherprocessing): k
preservingtransformation ,theclustroidof Ristheobject
thenumberofobjectsinthecluster,acentrallylocatedob-
whoseimagevectofr isclosestOtothecentroid
jectin itanditsradius. Sincea distancespace, ingeneral,
Ooft2heOsetofimagevectors f(O.)Thus,theclustroidisthe
does not support creation of new objects using operations
generalizationofthecentroifd(Oto)distancespaces. Following
onasetofobjects,weassignanactualobjectinthecluster
the generalization of the centroid, we generalize the defi-
astheclustercenter. We definetheclustroid of asetof
objects which is the generalization of theOc^entroid to a nitionsof theradiusofacluster, andthedistancebetween
clusterstodistancespaces.
distanceOspace. WenowintroducetheRowSumofanob-
3
ject with respect to a set of objects , and the concept Lemma4.2 Let beasetofobjectsina
ofanOimagespaceIS ofasetofobjeOcts inadistance distancespace O =wfiOth1;c:lu:s:t;rOoindg andlet
space. Informally,th(eOi)mage spaceof a setOof objects is a bea -distan(cSe-;pdr)eservingtransfoOr^mation. Lfe:tO 7!beRthke
k
coordinatespacecontaininganimagevectorforeachobject centrRoidof . Thenthefollowingholds: O
suchthatthedistancebetweenanytwoimagevectorsisthe f(O)
sameasthedistancebetweenthecorrespondingobjects.
8O2O :jjf(O^)(cid:0)Ojj(cid:20)jjf(O)(cid:0)Ojj
Intheremainderofthissection,weuse todenotea
Definition4.3 Let be aset of objects
distancespacewhere isthedomainofal(lSp;odss)ibleobjects
inadistancespace O =wfOith1;c:lu::st;rOoindg .Theradius
and isSadistancefunction. (S;d) O^ r(O)
d:S(cid:2)S 7!R of isdefinedas def ni=1d2(Oi;O^).
Definition4.1 Let be a set of ob- O r(O) = n
jects in a distance sOpac=e fO1;.::T:;hOenRgowSum of an ob- Definition4.4 We define tqwoPdifferent inter-cluster dis-
(S;d) tance metrics between cluster features. Let and
ject isdefinedas .
TheOclu2strOoid is defineRdoawsSuthme(Oob)jed=ceft nj=1d2s(uOch;Othja)t betwo clustersco.nsiLsteintgthoefirobcjleucsttsrofiOds11;b:e::O;O11ann1dgaOnd2
O^ . PO^ 2 O freOsp21e;c:ti:v:e;lyO.2n2Wge define the clustroid distaO^n1ce O^a2s
8O2O:RowSum(O^)(cid:20)RowSum(O) D0
and the average inter-cluster
def
Definition4.2 Let bea set of objects D0(O1;O2) = d(O^1;O^2)
itniona.dWisteanccaellspaacneO(S=-;ddi)fs.tOanL1;ce:et:-fp:r;eO:senOrgvi7!ngtRraknsbfeoramfautinocn- distanceD2asD2(O1;O2)d=ef (Pni=11Pnjn=211nd22(O1i;O2j))12.
if f Rk Both BUBBLE and BUBBLE-FM use as the distance
wh8eir;ej 2 f1;::is:;thnegE:ucdli(dOeia;nOdjis)ta=ncejjbfe(tOwie)en(cid:0) f(aOndj)jj metric between leaf level clusters, andD0as the threshold
in .jjWXe(cid:0)caYlljj theimagespaceof under X(denoteYd requirement , i.e., a new object is inserted into
ISRk ). ForanRokbject ,wecall O thefimagevec- a cluster wTith clustroid only Oifnew
f(O) O 2O f(O) O . (TheusefOo^r isexDpl0a(iOne;dflOatneerw.)g) =
torof under .Wedefine def . d(O^;Onew)(cid:20)T D2
O f f(O) = ff(O1);:::;f(On)g 4.1.2 Incrementalmaintenanceofleaf-levelCF s
(cid:3)
Inthissection,wedescribetheincrementalmaintenanceof
The existence of a distance-preserving transformation is
CF s at the leaf levels of the CF -tree. Since the sets of
guaranteedbythefollowinglemma. (cid:3) (cid:3)
objects we are concerned with in this section are clusters,
weuse (insteadof )todenoteasetofobjects.
Lemma4.1 [19] Let be a set of objects in a distance
space . Then theOre exists a positive integer TheCincremental mOaintenance of the number of objects
a(nSd;ad)function such that iskan(k <- in a cluster is trivial. So we concentrate next on the
djOisjt)ance-preservingtrfan:sfOorm7!atioRnk. f Rk incremental mCaintenance of the clustroid . Recall that
for a cluster , is the object in withC^the minimum
RowSumvalueC. AC^slongasweareabCletokeepalltheob-
Forexample,threeobjects withtheinter-objectdis-
tance distribution x;y;z jectsof inmainmemory,wecanmaintain incrementally
[d(x;y) = 3;d(y;z) = 4;d(z;x) = underinCsertionsbyupdatingtheRowSumvC^aluesofallob-
Themedoid ofasetofobjects issometimesusedasacluster jects and then selecting the object with minimum
3
center[18].ItisdOefiknedastheobject O thatminimizestheaverage RowSOum2vaClue as the clustroid. But this strategy requires
m
dissimilaritytoallobjectsin (i.e.,O n2O isminimumwhen allobjectsin inmainmemory,whichisnotaviableop-
).But,itisnotpossOibletomotiiv=at1edth(Oehie;uOri)sticmaintenance
Oala=cOlumstroid ofthemedoid. However, weexpectsimilarheuristics|to tionforlargeCdatasets. Sinceexactmaintenanceisnotpos-
P
workevenfor|themedoid. sible, we developa heuristicstrategy which works well in
practicewhilesignificantlyreducingmainmemoryrequire- aclustroidchanges,isexpensive.Fortunately(fromObser-
ments. We classify insertions in clusters into two types, vation2), if is largethen thenewsetof objectswithin
Type I and Type II, and motivate heuristics for each type of isalmonstthesameastheoldset because isvery(cid:15)
ofinsertion. A TypeI insertionis theinsertionof asingle cloCs^e(cid:3)to . R C^(cid:3)
objector,equivalently,aclustercontainingonlyoneobject. ObserC^vations 1 and 2 motivate the following heuris-
EachobjectinthedatasetcausesaTypeIinsertionwhenit tic maintenance of the clustroid. As long as is small
isreadfromthedatafile,makingitthemostcommontype (smallerthanaconstant ),wekeepalltheobjjeCcjtsof in
ofinsertion. ATypeIIinsertionistheinsertionofacluster main memory and comppute the new clustroid exactlyC. If
containingmore than one object. Type II insertions occur is large (larger than ), we invoke Observation 2 and
onlywhentheCF -treeisbeingrebuilt.(SeeSection3.) jmCajintaina subsetof ofpsize . These objects havethe
(cid:3)
TypeI Insertions: In ourheuristics, wemakethefollow- lowest RowSumvaluCes in anpd henceapre closest to . If
ingapproximation:underanydistance-preservingtransfor- the RowSum value of C is less than the highest oCf^the
mation into a coordinate space, the image vector of the values,saythatof O,ntehwen replaces in . Our
clustroidfisthecentroidofthesetofallimagevectors,i.e., pexperimentsconfirmOthpat thisOhenuerwistic worksOvperyRwell in
. From Lemma4.2, we knowthat this is the practice.
bfe(sC^t)p=ossfib(lCe)approximation. Inadditiontotheapproxima- Type II Insertions: Let and
tion,ourheuristicismotivatedbythefollowingtwoobser- bCe1tw=o cfluOst1e1r;s::b:e;iOng1nm1gerged.
vations. CL2et =bfeOa2d1i;s:ta:n:;cOe-2pnre2sgerving transformation of
Observation 1: Consider the insertion of a new object intof . Let be the image vector in IS C1 [ C2
k
into a cluster and assume that of obRject uXndiejr . Let ( ) be the c(eCn1tro[idC2o)f
Oonnleywasubset Cis=kepftOin1;m::a:in;Omnegmory. Let Oij ( f X1 X).2 Let be their
R (cid:26)beCa -distance-preserving transffo:rmCa[- fclXus1tir;o:i:d:s;,Xan1dn1g fX2i;:::;bXe2tnh2egir radii. C^L1e;tC^2 be the
btfieOontnhefewrocgmen7!Ctro[iRdfkOofnewgRi.nTtkoheRnk,,andletf(C)= ni=1nf(Oi) cluTsthroeidneowfCc1en[trr(CoC2id1.);r(oCf2) liesontheC^li(cid:3)nejoin-
f(C) RowSum(Onew) P ing and ; its eXxact lfo(cCa1tio[nCo2n) the line depends on
thevXal1uesofX2 and . Since isusedasthethreshold
n n
2 2 requirementfonr1insertnio2ns,thedDist0ancebetween and
= d (Onew;Oj)= (f(Onew)(cid:0)f(Oj)) isboundedasshownbelow: X X1
Xj=1 Xj=1
n
= (f(Oj)(cid:0)f(C))2+n(f(C)(cid:0)f(Onew))2 2 n22(X1(cid:0)X2)2 n22d2(C^1;C^2) n22T
Xj=1 (X(cid:0)X1) = (n1+n2)2 (cid:25) (n1+n2)2 < (n1+n2)2
(cid:25) nr2(C)+nd2(C^;Onew) The following two assumptions motivate the heuristic
maintenanceoftheclustroidunderTypeIIinsertions.
Thus,wecancalculateRowSum approximatelyus-
ing only and significantly red(uOcenetwhe) main memory re- (i) and are non-overlapping but very close to each
quiremenCt^s. otheCr1. SincCe2 and are being merged, the threshold
criterion is satCis1fied imC2plying that and are close to
Observation 2: Let be a leaf cluster
in the CF -tree and C = afnOo1b;:je:c:t;wOhnigch is inserted into each other. We expect the two clusCt1ers to Cb2e almost non-
. Let (cid:3)and beOthneewclustroids of and overlappingbecause they were two distinct clusters in the
CrespectivC^ely. LC^e(cid:3)t be the distance mCetric bCe[twfeeOnnelewagf oldCF(cid:3)-tree.
clusters and theDt0hreshold requirement of the CF -tree (ii) . Due to lack of any prior information about
under . ThTen (cid:3) thecnl1us(cid:25)tersn,2we assumethattheobjectsareuniformlydis-
D0 tributedin themergedcluster. Therefore,thevaluesof
and areclosetoeachotherinTypeIIinsertions. n1
d(C^;C^(cid:3))(cid:20)(cid:15),where(cid:15)= (nT+1): tobFenomr2tihdewsaeytwbeotwreeaesnons,awnde ex,pwechticthheconrerwescplounsdtrsotiodtCh^(cid:3)e
An implication of Observation 2 is that as long as we periphery of either cluCs^t1er. TCh^2erefore we maintain a few
keep a set of objects consisting of all objects in objects( innumber)ontheperipheryofeachclusterinits
within aRdist(cid:26)ancCe of from , we know that . CF . Bepcause they are the farthest objects from the clus-
CHowever,whentheclust(cid:15)roidchaC^ngesduetotheinsC^e(cid:3)rti2onRof troi(cid:3)d,theyhavethehighestRowSumvaluesintheirrespec-
,wehavetoupdate toconsistofallobjectswithin tiveclusters.
Oofnew.SincewecannotasRsumethatallobjectsinthedatase(cid:15)t Thusweoverallmaintain objectsforeachleafcluster
fitiC^n(cid:3)mainmemory,wehavetoretrieveobjectsin fromthe , whichwecallthe represe2n(cid:1)tpativeobjectsof ; thevalue
disk. Repeatedretrievalofobjectsfromthedisk,Cwhenever C is called the representation number of . CStoring the
2p C
representativeobjectsenablestheapproximateincremental timeweupdatethesampleobjectswe incuracertaincost.
maintenanceoftheclustroid.Theincrementalmaintenance Thuswehavetostrikeabalancebetweenthecostofupdat-
oftheradiusof issimilartothatofRowSumvalues;de- ingthesampleobjectsandtheircurrency.
tailsaregiveninCthefullpaper[16]. Because asplit at of NL causesredistributionof
Summarizing,wemaintainthefollowinginformationin its entriesbetween childaind the newnode , we
theCF of aleaf cluster : (i) thenumberof objectsin , havetoupdatesampclehsilSd(iNL )andS(NL )cahteilndtrki+es1NL
(cid:3)
(ii)theclustroidof , (iiiC) representativeobjects, (iCv) andNL oftheparent(weiactuallycreka+te1samplesforthei
theRowSumvaluesoCfthere2p(cid:1)repsentativeobjects,and(v)the newentkr+y1NL ). However,toreflectchangesinthedis-
radius of the cluster. All these statistics are incrementally tributionsatalkl+ch1ildrennodesweupdatethesampleobjects
maintainable asdescribedabove astheclusterevolves. atallentriesofNLwheneveroneofitschildrensplits.
| |
4.2.CF satNon-leafLevel 4.2.3 Distancemeasuresatnon-leaflevels
(cid:3)
Let be a new object inserted into the CF -tree.
Inthissection,weinstantiatetheclusterfeaturesatnon- The Odinsetwance between and NL is defined(cid:3)to be
leaflevelsoftheBIRCH(cid:3) frameworkanddescribetheirin- S(NL ) . SOinnceew i is meaning-
crementalmaintenance. Dles2s(,fOweneewngs;ure thiat)each nonD-l2e(affOennetrwyg;h;a)s at least one
sampleobjectfromitschildduringtheselectionofsample
4.2.1 SampleObjects
objects. LetL representthe leafentryofaleafnode .
In the BIRCH framework, the functionality of a CF at th
(cid:3) (cid:3) Thedistancebietween andLi isdefinedtobetheclustroLid
a non-leaf entry is to guide a new object to the sub-tree
distance L . C i
whichcontainsitsprospectivecluster. Therefore,theclus-
TheinDst0a(nCti;atiio)nofdistancemeasurescompletesthein-
terfeatureofthe non-leafentryNL ofanon-leafnode
th stantiationofBIRCH derivingBUBBLE.Weomitthethe
NLsummarizesthiedistributionofallcliustersinthesubtree (cid:3)
cost analysis of BUBBLE because it is similar to that of
rootedat NL . In AlgorithmBUBBLE, this summary, the
BIRCH.
CF , is repreisented by a set of objects; we call these ob-
(cid:3)
jectsthesampleobjectsS(NL )ofNL andtheunionofall 5.BUBBLE-FM
sampleobjectsatalltheentrieisthesamipleobjectsS(NL)of
Whileinsertinganewobject ,BUBBLEcomputes
NL.
distancesbetween andallOthneewsampleobjectsateach
We now describe the procedure for selecting the sam-
non-leafnodeonitOsndeowwnwardpathfromtheroottoaleaf
ple objects. Let child be the child nodes at
NL with chiledn1tr;i:e:s:;respecktively. Let S(NL ) de- node.Thedistancefunction maybecomputationallyvery
notethesent1o;f::sa:m;npkleobjectscollectedfrom anidas- expensive (e.g., the edit distdance on strings). We address
sociated with NL . S(NL) is the union of sacmhipldlei objects this issue in our second algorithm BUBBLE-FM which
at all entries of NiL. The number of sample objects to be improves upon BUBBLE by reducing the numbe|r of in-
vocationsof using FastMap [11]. We first givea brief
collectedatanynon-leafnodeis upperboundedbya con-
overviewofFda|stMapandthendescribeBUBBLE-FM.
stantcalledthesamplesize(SS).Thenumber S(NL ) con-
j i j
tributed by is . The restric- 5.1.OverviewofFastMap
ni(cid:3)SS
childi MAX( k ;1)
tionthateachchildnodehave(cid:22)atlej=a1stnojn(cid:23)erepresentativein Givenaset of objects,adistancefunction ,andan
integer , FastMOapNquickly(intimelinearin )cdomputes
S(NL) is placed so that the disPtribution of the sample ob-
vectokrs(calledimagevectors), oneforeacNhobject,ina
jectsis representativeof allits children,and isalso neces-
N-dimensionalEuclideanimagespacesuchthatthedistance
sary to definedistance measuresbetween anewly inserted
kbetweentwoimagevectorsisclosetothedistancebetween
objectandanon-leafcluster. If isaleafnode,then
thesampleobjectsS(NL )arerancdhoimldliypickedfromallthe the corresponding two objects. Thus, FastMap is an “ap-
clustroidsoftheleafclusitersat . Otherwise,theyare proximate” k-distance-preserving transformation. Each
randomlypickedfrom ’sscahmilpdlieobjectsS ). of the axeRs is defined by the line joining two objects.4
childi (childi The kobjectsarecalledpivotobjects. Thespacedefined
by th2ek axes is the fastmapped image space IS of
4.2.2 UpdatestoSampleObjects
. Theknumberof calls to made by FastMap tfomm(Oap)
TheCF -treeevolvesgraduallyasnewobjectsareinserted
(cid:3) Oobjectsis , where isdaparameter(typicallysettoN1
intoit. Theaccuracyofthesummarydistributioncaptured
or2). 3Nkc c
by sample objects at a non-leaf entry depends on how re-
An important feature of FastMap that we use in
centlythesampleobjectsweregathered.Theperiodicityof
BUBBLE-FMisitsfastincrementalmappingability.Given
updates to these samples, and when these updates are ac-
tually triggered,affects thecurrencyof the samples. Each SeeLinet.al.fordetails[11].
4
anewobject , FastMap projectsitontothe coordi- FastMap does not accurately reflect the relative dis-
nate axes of IOSnew to compute a -dimensionkal vector tances between clustroids; the inaccuracy causes er-
for inISfm(O)withjust cakllsto . Distancebe- roneousinsertionsofobjectsintoclustersdeteriorat-
tweeOnnew anfdma(nOy)object 2k can nodwbemeasured ing the clustering quality. Similar errorsat non-leaf
throughOtnheewEuclideandistancOeb2etwOeentheirimagevectors. levels merely cause new entries to be redirected to
wrongleafnodeswheretheywillformnewclusters.
5.2.DescriptionofBUBBLE-FM Therefore,theimpactoftheseerrorsisonthemainte-
nancecostsoftheCF -tree,butnotontheclustering
BUBBLE-FM differs from BUBBLE only in its usage (cid:3)
quality,andhencearenotsosevere.
ofsampleobjectsatanon-leafnode. InBUBBLE-FM,we
firstmap usingFastMap thesetofallsampleobjectsata 2. If IS has to be maintained accurately un-
non-leaf|nodeintoan“app|roximate”imagespace. Wethen der nfemw(Lin)sertions then it should be reconstructed
use the image space to measure distances between an in- whenever any clustroid in the leaf node changes.
comingobjectandtheCF s. SinceCF satnon-leafentries In this case, the overhead of repeatedlLy invoking
(cid:3) (cid:3)
functionmerelyasguidestoappropriatechildrennodes,an FastMapoffsetsthegainsduetomeasuringdistances
approximateimagespaceissufficient.Wenowdescribethe inIS .
constructionoftheimagespaceanditsusageindetail. fm(L)
Consider a non-leaf node NL. Whenever S(NL) is up- 5.2.2 Imagedimensionalityandotherparameters
dated,weuseFastMaptomapS(NL)intoa -dimensional The image dimensionalities of non-leaf nodes can be dif-
coordinatespace IS ; is calledthekimagedimen- ferentbecausethesampleobjectsateachnon-leafnodeare
sionalityofNL.FasftMma(NprLet)urknsavectorforeachobjectin mapped into independent image spaces. The problem of
S(NL).ThecentroidoftheimagevectorsofS(NL )isthen findingtherightdimensionalityoftheimagespacehasbeen
usedasthecentroidoftheclusterrepresentedbyNiL while studiedwell[19]. We settheimagedimensionalitiesofall
definingdistancemetrics. i non-leafnodestothesamevalue;anytechniqueusedtofind
Let :S(NL) IS be the distance preserv- the right image dimensionality can be incorporated easily
ing tranfsmformation7!assfomc(iaNteLd)with FastMap that maps intothemappingalgorithm.
each sample object S(NL) to a -dimensional vec- Our experience with BUBBLE and BUBBLE-FM on
tor IS s 2 . Let k be the centroid severaldatasets showedthat the results arenotverysensi-
of thfems(est)o2f imafgme(NveLct)ors of SS(N(NLL),i)i.e., tivetosmalldeviationsinthevaluesoftheparameters: the
i S(NLi) = representationnumberandthesamplesize. We foundthat
s2S(NLi)fm(s). avalueof10fortherepresentationnumberworkswellfor
TPhejnSo(nN-Lleia)jfCF in BUBBLE-FM consists of (1) S(NL ) several datasets including those used for the experimental
(cid:3)
and(2) .Inaddition,wemaintaintheimagevectoris studyinSection6. Anappropriatevalueforthesamplesize
ofthe S(pNivLoti)objectsreturnedbyFastMap. dependson the branching factor of the CF -tree. We
The2k pivotobjectsdefinetheaxesofthek-dimensional observedthatavalueof wBoFrkswellinp(cid:3)ractice.
images2pkace constructedby FastMap. Let be a new 5(cid:2)BF
object. Using FastMap, we incrementallyOmneawp to
IS . WedefinethedistancebetweeOnnew
VanndewN2L tofmb(eNthLe)Euclidean distance between Onaenwd
i. Formally, Vnew
S(NLi)
def
D(Onew;S(NLi)) = jjVnew(cid:0)S(NLi)jj
Similarly,thedistancebetweentwonon-leafentriesNL
andNL isdefinedtobe . Wheneveir
S(NL)j , BUBBLEj-jFSM(NmLeia)s(cid:0)urSes(NdiLstja)njcjesatNL in
jthedistjan(cid:20)ce2skpace,asinBUBBLE.
5.2.1 Analternativeattheleaflevel
WedonotuseFastMapattheleaflevelsoftheCF -treefor
(cid:3)
thefollowingreasons.
1. Suppose FastMap were used at the leaf levels
also. The approximate image space constructed by
timeto associateeachobject with a clusterwhose
representativeobjectisclosestOto2 D.
6.Performance Evaluation WeintroducesomenotationbeOforedescribingtheevalu-
In this section, we evaluate BUBBLE and BUBBLE- ationmetrics. Let betheactualclustersinthe
FM on synthetic datasets. Our studies show that BUB- datasetand A1;::b:e;AthKesetofclustersdiscoveredby
BLEandBUBBLE-FMarescalablehighqualityclustering BUBBLEoCrB1;U:B::B;LCEK-FM.Let ( )bethecentroidof
algorithms. cluster ( ). Let bethecluAstrioiCdiof . Let de-
5
notetheAniumCbierofpC^oiintsinthecluster C. iWeusne(tChe)fol-
6.1.DatasetsandEvaluationMethodology
lowingmetrics,someofwhicharetraditCionallyusedinthe
To compare with the Map-First option, we use two Statistics and the Pattern Recognition communities [6, 7],
datasets DS1 and DS2. Both DS1 and DS2 have 100000 toevaluatetheclusteringqualityandspeed.
2-dimensionalpointsdistributedin100clusters[26]. How-
Thedistortion( )ofasetof
ever,theclustercentersinDS1areuniformlydistributedon K 2
(cid:15) clustersindicatestjh=e1tighXtn2eCssj(oXft(cid:0)heCclju)sters.
a 2-dimensional grid; in DS2, the cluster centers are dis-
P P
tributed on a sine wave. These two datasets are also used Theclustroidquality( jjAi(cid:0)C^jjj)istheaverage
tovisuallyobservetheclustersproducedbyBUBBLEand (cid:15) distance between theCacQtua=l centKroid of a cluster
BUBBLE-FM. andtheclustroid thatisclosestto . Ai
We also generated k-dimensional datasets as described C^j Ai
Thenumberofcallsto (NCD)andthetimetakenby
byAgrawaletal.[1]. The -dimensionalbox k isdi- (cid:15) thealgorithmindicatetdhecostofthealgorithm.NCD
vided into cells by halvking the range [0;1o0v]er each
k isusefultoextrapolatetheperformanceforcomputa-
dimension.2A cluster center is randomly[0p;1la0c]ed in each
tionallyexpensivedistancefunctions.
of cells chosen randomly from the cells, where
k
is tKhe number of clusters in the datase2t. In each clusteKr,
6.2.ComparisonwiththeMap-Firstoption
points are distributed uniformly within a radius ran-
N
WemappedDS1,DS2,andDS20d.50c.100Kintoanap-
dKomly picked from . A dataset containing -
dimensionalpointsa[n0d:5;1c:0lu]stersisdenotedDS d. Nc.k. propriate k-dimensional space (k = 2 for DS1, DS2, and
Even though these dataKsets consist of -dimenskionaKl veNc- 20 for DS20d.50c.100K) using FastMap, and then used
torswedonotexploittheoperationsspkecifictocoordinate BIRCHtoclustertheresultingk-dimensionalvectors.
The clustroids of clusters obtained from BUBBLE and
spaces, and treat the vectors in the dataset merely as ob-
BUBBLE-FMonDS2areshowninFigures1and2respec-
jects. Thedistancebetweenanytwoobjectsisreturnedby
tively, and the centroids of clusters obtained from BIRCH
theEuclideandistancefunction.
areshowninFigure3.Fromthedistortionvalues(Table1),
We now describe the evaluation methodology. The
weseethatthequalityofclustersobtainedbyBUBBLEor
clustroids of the sub-clusters returned by BUBBLE and
BUBBLE-FMisclearlybetterthantheMap-Firstoption.
BUBBLE-FM are further clustered using a hierarchical
clustering algorithm [20] to obtain the required number
Dataset Map-First BUBBLE BUBBLE-FM
of clusters. To minimize the effect of hierarchical clus-
DS1 195146 129798 122544
tering on the final results, the amount of memory allo-
DS2 1147830 125093 125094
cated to the algorithm was adjusted so that the number of
DS20d.50c.100K 2.214* 21127.5 21127.5
6
sub-clustersreturnedbyBUBBLEorBUBBLE-FMisvery 10
close(notexceedingtheactualnumberofclustersbymore Table1.ComparisonwiththeMap-Firstoption
than 5%) to the actual number of clusters in the synthetic
6.3.QualityofClustering
dataset. Whenever the final cluster is formed by merging
sub-clusters,theclustroidofthefinalclusteristhecentroid In thissection, weusethedatasetDS20d.50c.100K.To
oftheclustroidsofsub-clustersmerged. Otherparameters placetheresultsintheproperperspective,wementionthat
to the algorithm, the sample size (SS), the branching fac- the average distance between the centroid of each cluster
tor ( ), andthe representation number( ) arefixed at andanactualpointinthedatasetclosestto is0.212.
75,1B5,and10respectively(unlessotherw2ise(cid:1)pstated)asthey AHeincetheclustroidquality(CQ)cannotbelessAthian0.212.
werefoundtoresultingoodclusteringquality. Theimage FromTable2,weobservethattheCQvaluesareclosetothe
dimensionalityforBUBBLE-FMissettobeequaltothedi-
mensionalityofthedata.Thedataset isscannedasecond
Algorithm CQ Actual Computed
D
Distortion Distortion
ThequalityoftheresultfromBIRCHwasshowntobeindependentof
5
theinputorder[26].Since,BUBBLEandBUBBLE-FMareinstantiations BUBBLE 0.289 21127.4 21127.5
oftheBIRCH frameworkwhichisabstracted outfromBIRCH,wedo BUBBLE-FM 0.294 21127.4 21127.5
(cid:3)
notpresentmoreresultsonorder-independencehere.
Table2.ClusteringQuality
20 •••••• •••••• •••••• •••••• 20 •••••• •••••• •••••• •••••• 20 •••••• •••••• •••••• ••••••
• • • • • • • • • • • • • • • • • • • • • • • •
10 • • • • • • • • 10 • • • • • • • • 10 • • • • • • • •
• • • • • • • • • • • •
• • • • • • • • • • • •
DS2 0 • • • • • • • • DS2 0 • • • • • • • • DS2 0 • • • • • • • •
• • • • • • • • • • • •
-10 • • • • • • • • -10 • • • • • • • • -10 • • • • • • • •
• • • • • • • • • • • •
-20 ••••••• ••••••• ••••••• ••••••• -20 ••••••• ••••••• ••••••• ••••••• -20 ••••••• ••••••• ••••••• •••••••
0 100 200 300 400 500 600 0 100 200 300 400 500 600 0 100 200 300 400 500 600
BUBBLE BUBBLE-FM BIRCH
Figure1.DS2: BUBBLE Figure2.DS2: BUBBLE-FM Figure3.DS2: BIRCH
SS=75,B=15,2p=10,#Clusters=50 SS=75,B=15,2p=10,#Clusters=50 SS=75,B=15,2p=10,#Points=200K
250 5e+07 300
BUBBLE-FM: time BUBBLE-FM: NCD BUBBLE-FM: time
BUBBLE: time 4.5e+07 BUBBLE: NCD BUBBLE: time
200 4e+07 250
Time (in seconds) 110500 NCD value123...55523eeeee+++++0000077777 Time (in seconds) 112050000
50 1e+07 50
5e+06
0 0
0 100 200 300 400 500 0 100 200 300 400 500 0 50 100 150 200 250 300
#points (in multiples of 1000) #points (in multiples of 1000) #clusters
Figure4.Timevs#points Figure5.NCDvs#points Figure6.Timevs#clusters
minimumpossiblevalue(0.212), and thedistortionvalues whilekeepingthenumberofpointsconstantat200000.The
matchalmostexactly. Also,weobservedthatallthepoints resultsareshowninFigure6. Theplotoftimeversusnum-
except a few (less than 5) were placed in the appropriate berofclustersisalmostlinear.
6
clusters.
7.DataCleaningApplication
6.4.Scalability
When different bibliographic databases are integrated,
To study scalability characteristics with respect to the
differentconventionsforrecordingbibliographicitemssuch
numberofpointsinthedataset,wefixedthenumberofclus-
asauthornamesandaffiliationscauseproblems. Usersfa-
tersat50andvariedthenumberofdatapointsfrom50000
miliar with one set of conventions will expect their usual
to500000(i.e.,DS20d.50c.*).
forms to retrieve relevant information from the entire col-
Figures4and5plotthetimeandNCDvaluesforBUB-
lectionwhensearching. Therefore,a necessarypartofthe
BLE and BUBBLE-FM as the number of points is in-
integrationisthecreationofajointauthorityfile[2,15]in
creased. Wemakethefollowingobservations. (i) Bothal-
which classes of equivalentstrings are maintained. These
gorithms scale linearly with the number of points, which
equivalent classes can be assigned a canonical form. The
is as expected. (ii) BUBBLE consistently outperforms
process of reconciling variant string forms ultimately re-
BUBBLE-FM. This is due to the overhead of FastMap in
quires domain knowledge and inevitably a human in the
BUBBLE-FM. (The distance function in the fastmapped
loop,butitcanbesignificantlyspeededupbyfirstachieving
spaceaswellastheoriginalspaceistheEuclideandistance
aroughclusteringusingametricsuchastheeditdistance.
function.) However, the constant differencebetween their
Groupingcloselyrelatedentriesintoinitialclustersthatact
runningtimes suggeststhattheoverheaddueto theuseof
asrepresentativestringshastwobenefits: (1)Earlyaggre-
FastMapremainsconstanteventhoughthenumberofpoints
gationactsasa“sorting”stepthatletsususemoreaggres-
increases. Thedifferenceisconstantbecausetheoverhead
sive strategies in later stages with less risk of erroneously
duetoFastMapisincurredonlywhenthenodesintheCF -
(cid:3) separating closely related strings. (2) If an error is made
tree split. Once the distribution of clusters is captured the
in the placement of a representative, only that representa-
nodes do not split that often any more. (iii) As expected,
tiveneedbemovedtoanewlocation. Also,eventhesmall
BUBBLE-FMhassmallerNCDvalues. Sincetheoverhead
reductionin thedata sizeis valuable,giventhecostofthe
due to the use of FastMap remains constant, as the num-
subsequentdetailedanalysisinvolvingadomainexpert.
berof pointsis increased the differencebetween the NCD 7
Applyingeditdistancetechniquestoobtainsucha“first
valuesincreases.
To study scalabilitywith respectto thenumberof clus-
NCDversusnumberofclustersisinthefullpaper[16].
6
ters,wevariedthenumberofclustersbetween50and250 Examplesandmoredetailsaregiveninthefullpaper.
7