ebook img

A Complex Networks Approach for Data Clustering PDF

0.98 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview A Complex Networks Approach for Data Clustering

A Complex Networks Approach for Data Clustering ∗ Francisco A. Rodrigues Departamento de Matemática Aplicada e Estatística, Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo - Campus de São Carlos, Caixa Postal 668, 13560-970 São Carlos, SP. Guilherme Ferraz de Arruda and Luciano da Fontoura Costa Instituto de Física de São Carlos, Universidade de São Paulo, Av. Trabalhador São Carlense 400, Caixa Postal 369, CEP 13560-970, São Carlos, São Paulo, Brazil 1 Many methodshavebeendevelopedfor dataclustering, suchask-means,expectation maximiza- 1 tion and algorithms based on graph theory. In thislatter case, graphs aregenerally constructedby 0 taking into account the Euclidian distance as a similarity measure, and partitioned using spectral 2 methods. However, these methods are not accurate when the clusters are not well separated. In n addition,itisnotpossibletoautomaticallydeterminethenumberofclusters. Theselimitationscan a beovercomebytakingintoaccount network communityidentification algorithms. Inthiswork, we J proposeamethodologyfordataclusteringbasedoncomplexnetworkstheory. Wecomparedifferent 6 metricsfor quantifyingthesimilarity betweenobjects andtakeintoaccount threecommunityfind- 2 ing techniques. This approach is applied to two real-world databases and to two sets of artificially generateddata. Bycomparingourmethodwithtraditionalclusteringapproaches,weverifythatthe ] n proximitymeasuresgivenbytheChebyshevandManhattandistancesarethemostsuitablemetrics a toquantifythesimilarity betweenobjects. Inaddition,thecommunityidentificationmethodbased - on the greedy optimization providesthe smallest misclassification rates. a t a PACSnumbers: 89.75.Hc,89.75.-k,89.75.Kd d . s c INTRODUCTION ent names in different contexts, such as clustering (in i pattern recognition), numerical taxonomy (in ecology) s y Classification is one of the most intrinsic activities of and partition (in graphtheory). In the currentwork, we h human beings, being used to facilitate the handling and adopt the term “clustering”. Clustering can be used in p organization of the huge amount of information that we manytasks,suchasdata reduction,performedby group- [ receiveeveryday. Asamatteroffact,thebrainisableto ing data into cluster and processing each cluster as a 1 recognize objects in scenes and also to provide a catego- single entity; hypothesis generation, when there is no in- v rizationofobjects,persons,orevents. This classification formation about the analyzed data; hypothesis testing, 1 is performed in order to cluster objects that are similar i.e. verification of the validity of a particular hypothe- 4 1 with respect to common attributes. Actually, humans sis; and prediction based on classes, where the obtained 5 have by now classified almost all known living species clustersarebasedonthe characteristicsofthe respective 1. and materials on earth. Due to the importance of the patterns. Asamatteroffact,clusteringisafundamental 0 classification task, it is fundamental to develop methods tool for many research fields, such as machine learning, 1 able to perform this task automatically. Indeed, many data mining, pattern recognition, image analysis, infor- 1 methodsforcategorizationhavebeendevelopedwithap- mation retrieval, and bioinformatics [2, 3]. : v plication to life sciences (biology, zoology), medical sci- Many methods have been developed for data cluster- Xi ences (psychiatry, pathology), social sciences (sociology, ing [4], many of which are based on graph theory [5]. archaeology), earth sciences (geography, geology), and Graphs-based clustering methods take into account al- r a engineering [1, 2]. gorithms related to minimum spanning trees [6], region The process of classification can be performed in two of influence (e.g. [7]), direct trees [8] and spectral anal- different ways, i.e. supervised classification, where the ysis [8]. These methods are able to detect clusters of previously known class of objects are provided as pro- various shapes, at least for the case in which they are totypes for classifying additional objects; and unsuper- well separated. However, these algorithms present some vised classification, where no previous knowledge about drawbacks, such as the spectral clustering, which only the classes is provided. In the latter case, the catego- divides the graph into two groups and not in an arbi- rizationis performed in orderto maximize the similarity trary number of clusters. Division into more than two between the objects in each class while minimizing the groups can be achieved by repeated bisection, but there similarity between objects in different classes. In the is no guarantee of reaching the best division into three current work, we introduce a method for unsupervised groups [4]. Also, these methods give no hint about how classification based on complex networks. many clusters should be identified. On the other hand, Unsupervisedclassificationmaybe foundunder differ- methods for community identification in networks are 2 able to handle these drawbacks [9]. Moreover, these vertices present similar roles, such as in the case of the methods provide more accurate partitions than the tra- brain of mammals, where cortical modules are associ- ditional method based on graph, such as the spectral atedtobrainfunctions [16]. Communitieshavethesame partition [9]. Actually, methods based on complex net- principle as clusters in pattern recognition research. In works are improvements of clustering approaches based this way, the algorithms developed for community iden- on graphs. tificationcanalsobe usedto partitiongraphandfinding Only recently, a method has been developed for data clusters. clustering based on complex networks concepts [10]. In Differentmethodshavebeendevelopedinordertofind thiscase,theauthorsproposedaclusteringmethodbased communities in networks. Basically, these methods can ongraphpartitioningandtheChameleonalgorithm[11]. begroupedasspectralmethods(e.g.[17]),divisivemeth- Although this method is able to detect clusters in dif- ods(e.g.[18]), agglomerativemethods (e.g.[19]), andlo- ferent shapes, it presents some drawbacks. The authors cal methods (e.g. [20]). The choice of the best method considered a method for community identification very depends of the specific application, including the net- particularwhichdoesnotprovidethemostaccuratenet- work size and number of connections. This is due to workdivision[12]. Inaddition,itconsideredonlyasingle the fact that the most precise methods, such as the ex- metric to establish the connections between every pair tremaloptimizationalgorithm,arequite time expensive. of objects, i.e. the Euclidian distance. On the other Here,we take three differentmethods that provideaccu- hand, the method introduced in the current work over- rate results, but have different time complexities. These comes all these limitations. We adopt the most accurate methods are described in the next section. community identification methods and use the most tra- The quality of a particular network division can be ditional metrics to define the similarity between objects, evaluatedinterms of the modularity measure. This met- including the Euclidian, Manhattan, Chebyshev, Fu and ric allows the number of communities to be automati- Tanimoto distances [4]. The accuracy of our method- callydeterminedaccordingtothebestnetworkpartition. ology is evaluated in artificial as well as two real-world For a network partitioned into m communities, a matrix databases. Moreover,we compareourmethodologywith E, c × c, is constructed whose elements e , represent ij sometraditionalclusteringalgorithms,i.e.k-means,cob- thefractionofconnectionsbetweencommunitiesiandj. web, expectation maximization and farthest first. We The modularity Q is calculated as verifythatourapproachprovidesthesmallesterrorrates. 2 2 So,weconcludedthatcomplexnetworkstheoryseemsto Q= [eii−( eij) ]=TrE−||E ||. (1) provide the tools and concepts able to improve the clus- Xi Xj tering methods based on graphs, potentially overcoming The highest value of modularity is obtained for the best the most traditional clustering methods. network division. In particular, networks that present high values of Q have modular structure implying that clusters are identified with high accuracy [9, 15]. CONCEPTS AND METHODS Complex networks Clustering based on network Complex networks are graphs with non-trivial topo- Inliterature,there aremany definitions of clusters[4], logical features, whose connections are distributed as a suchasthatprovidedbyEverittet al. [3],whereclusters power-law [13]. An undirected network can be repre- areunderstoodascontinuousregionsofthefeaturespace sented by its adjacency matrix A, whose elements a containingahighdensityofpoints,separatedfromother ij are equal to one whenever there is a connection between high density regions by low density regions. This defi- the vertices i and j, or equal to zero otherwise. A more nition is similar to that of network communities, i.e. a general representation takes into account weighted con- community is topologically defined as a subset of highly nections, where each edge (i,j) presents an associated inter-connectedverticeswhicharerelativelysparselycon- weight or strength ω(i,j). nected to nodes in other communities [12]. Differentmeasureshavebeendevelopedtocharacterize Let each object (also denominated pattern) be repre- thetopologyofnetworkstructures,suchastheclustering sented by a feature vector ~x = [x1,x2,...,xn]. These coefficient,distance-relatedmeasurementsandcentrality features,x ,arescalarnumbersandquantifythe proper- i metrics[14]. Byallowingthedifferentnetworkproperties ties of objects. For instance, in case of the Iris database, to be quantified, these methods have revealed that most the objects are flowers and the attributes are the length real-worldnetworks are far from purely random [15]. and the width of the sepal and the petal, in centime- In addition to this highly intricate topological organi- ters [21]. The clustering approach consists of grouping zation, complex networks also tend to present modular the feature vectors into m clusters, C1,C2,...,Cm, in structure. In this case, these modules are clusters whose such a way that objects belonging to the same cluster 3 exhibit higher similarity with each other than with ob- 3. Inverse of Manhattan distance: jects in other groups. −1 n The process of clustering based on networks involves −1 D (x,y)= |x −y | , (4) the definition of the following concepts: M i i i=1 ! X 1. Proximity measure: eachobject is representedas a which assumes values in [0,∞). node, where each pair of nodes are connected ac- cording to their similarity. These connections are 4. Exponential of Manhattan distance: weighted in the sense so as to quantify how simi- lareachpairof verticesis, interms of their feature S (x,y)=αexp(−αD ), (5) M M vector. In this way, the most similar objects are connected by the strongest edges. assuming values in the interval [0,α]. 2. Clustering criterion: modularity is the most tra- 5. Inverse of Chebyshev distance: ditional measure used to quantify de quality of a 1 network division [12], see Equation 1. Here, we D−1(x,y)= . (6) C maxn |x −y | adopt this metric to automatically choose the best i=1 i i clusterpartition. Inproblemsinwhichthe number This metric results in values in [0,∞). of clusters is known, it is not necessary to consider the modularity. 6. Exponential of Chebyshev distance: 3. Clustering algorithms: Complex networks theory S (x,y)=αexp(−αD ), (7) C C provides many algorithms for community identifi- cation, which act as the clustering algorithms [12]. assuming values in the interval [0,α]. Thechoiceofthemostsuitablemethodforapartic- 7. Metric proposed by Fu, ular application should take into account the error rate and the execution time. d2(x,y) F(x,y)=1− . (8) 4. Validation of the results: The validation of the kxk+kyk clustering methods based on networks can be per- This metrics results in values in the interval [0,1] formed in two different ways: (i) by considering databases in which the clusters are known (or at 8. Exponential of the metric proposed by Fu: least expected), such as the Iris database [21], and (ii)by takingintoaccountartificialdata withclus- 1−F S (x,y)=αexp −α . (9) ter organization, which allows to control the level F 2 (cid:18) (cid:19) of the data modular organization. If F(x,y) = 1, then S (x,y) = α. If F(x,y) = 0, F Proximity measures can be classified into two types, then S (x,y) = αexp −α . Therefore, S as- similarity measures, that is s(~x,~y) = s0 only if ~x = ~y F 2 F sumes values in this limited interval. and −∞ < s(~x,~y) ≤ s0 < +∞; and dissimilarity mea- (cid:0) (cid:1) sures, where d(~x,~y) = d0 only if ~x = ~y and −∞ < d0 ≤ 9. Exponential of the Tanimoto mesure: d(~x,~y)<+∞. To construct networks,it is more natural toadoptsimilaritymeasures,sinceitisexpectedthatthe 1−T S =αexp −α , (10) T edges with the strongest weights should be verified be- 2 (cid:18) (cid:19) tween the vertices with the most similar feature vectors. where In this way, we adopt the following similarity measures to develop the network-basedclustering approach [4]: xTy T(x,y)= . (11) 2 2 1. Inverse of Euclidian distance: kxk +kyk −xTy 1 D−1(x,y)= , (2) This metric assumes values in (−∞,1]. Therefore, E d2(x,y) if T(x,y) = 1, then, s (x,y) = α, if T(x,y) → T whered2(x,y)isthetraditionalEuclidiandistance. −∞, then, ST(x,y)→0 This metric results in values in the interval [0,∞). In order to divide networks into communities and 2. Exponential of Euclidian distance: therefore obtain the clusters, we adopt three methods, namely the maximization of the modularity method, SE(x,y)=αexp(−αd2(x,y)), (3) which is based on the greedy algorithm [19], here called where this metric results in values in the interval fastgreedy algorithm; the extremal optimization ap- [0,α]. proach[22] andthe waltrap method [23]. In both former 4 methods,twocommunitiesiandjarejoinedaccordingto 30 real-valued features are computed for each cell nu- the increase of the modularity Q of the network. Thus, cleus[25]. Thisdatabaseiscomposedby699cells,where starting with each vertex disconnected and considering 241 are malignant and 458 are benign. each of them as a community, we repeatedly join com- munities together into pairs, choosing at each step the merging that results in the greatestincrease (or smallest RESULTS AND DISCUSSION decrease) of the modularity Q. The best division corre- spondstothe partitionthatresultedinthe highestvalue The accuracy of the clustering method based on net- of Q. The difference between these two methods lies in worksis comparedwith fourtraditionalclusteringmeth- the choice of the optimization algorithm. On the other ods, namely k-means, cobweb, farthest first and expec- hand, the walktrap method is based on random walks, tation maximization (EM) [26]. These methods present where the community identification uses a metrics that differentproperties,suchasthek-meanstendencytofind considerstheprobabilitytransitionmatrix[23]. Thetime sphericalclusters [4]. Moreover,we considerthree meth- ofexecutionofthewalktrapmethodrunasO(N2logN). ods for community identification, namely fastgreedy, ex- Whilethefastgreedymethodisbelievedtobethefastest tremal optimization and walktrap [12]. However, in this 2 one, running in O(Nlog N), the extremal optimization work,since the fastgreedyandextremaloptimizationre- provides the most accurate division [24]. On the other sult in the same error rates for all considered databases, hand, the extremal optimization method is not particu- we discuss only the results of the fastgreedy method, larly fast, scaling as O(N2logN). whichisfasterthantheextremaloptimizationapproach. Thevalidationofthenetwork-basedclusteringmethod We start our analysis by taking into account the Iris isperformedwithrespecttoartificial(i.e. computergen- and the Breast Cancer Wisconsin databases. As a pre- erated clusters) and real-world databases. In the case of liminary data visualization, we project the patterns into artificialdata,weusetwodifferentconfigurations,i.e. (i) a two dimensional space by taking into account princi- two separatedclouds of points with a Gaussiandistribu- pal component analysis. Figure 2 shows the projections. tionin atwo dimensionalspace,and(ii) two semi-circles It is clear that there is no clear separation between the with varying density of points, as presented in Figure 1. clusters for both databases. Intheformercase,thevalidationsetconsistsoftwosetof Since the attributes in the Iris data present different points(clusters)generatedaccordingtoagaussiandistri- ranges,havingvalues suchas0.1forthe petalwidthand butionwithcovariancematrixequalstoidentity,(Σ=I). 7.2 for the sepal length, it is necessary to take into ac- The median of one set of points is moved from the ori- count a feature standardization procedure [4]. In this gin (0,0) until (0,15), in steps of 0.75, while the other case, each attribute is transformed in order to present cluster remains fixed at the origin of axis. In this way, mean equals to zero and standard deviation equals to the distance between clusters is varied from d = 0 to one. This transformation, called standardization, is per- d=15. Figure 1(a) to (c) shown three cases considering formed as, three distances, i.e. d = 0, 3 and 15. Observe that as d x −x f f increases, the cluster identification becomes easier. The yf = (12) σ second artificial database corresponds to a classic prob- xf lem in pattern recognition [4]. It consists of two sets of where x , σ are the average and standard deviation f xf points uniformly generatedin two limited semi-circle ar- of the values of attribute f, respectively. The obtained eas. In this case, the density of points, i.e. the number results considering the four clustering algorithm is pre- of points by unit of area, defines the cluster resolutions, sented in Table I. The EM and k-means exhibit the with higher density producing more defined clusters. In smallererrorsamongthetraditionalclassifiers. However, our analysis, this density is varied from 1 to 32, in steps note that this performanceis obtainedwhen the number of 1.6. Figures 1(d) to (f) show three configurations of of clusters is known. On the other hand, EM provides this artificial database generated by taking into account anerrorof40%whenthenumberofclustersisunknown. three different densities, ρ=1, 6.4 and 14.4. This is a limitation of these methods, since in most of With respect to real-world databases, we take into the cases,the informationaboutthe number ofclassesis account two datasets, i.e. the Iris database [21], and not available. the Breast Cancer Wisconsin database [25]. The Iris Table I also presents the results with respect to the databaseiscomposedbythreespeciesofIrisflowers(Iris cluster-based on complex networks approaches. Only setosa,Iris virginica andIris versicolor). Eachclasscon- combinations between metric and community algorithm sists of 50 samples, where four features were measured which result in the smallesterror rates are shownin this from each sample, i.e. the length and the width of the table. The smallest error was obtained by taking into sepal and the petal, in centimeters. On the other hand, account the inverse of the Chebyshev distance and the the cancer database is composed by features of digitized fastgreedy community identification algorithm. The ob- imageofafineneedleaspiratefromabreastmass,where tained error for this case is equal to 4.7%. The second 5 FIG. 1: Examples of the artificial databases for clustering evaluation error. The first set points is generated according to a Gaussian distribution, where the two sets are separated by distances (a) d = 0, (b) d = 3 and (c) d = 15. In the case of the second set, a higher density of points provides more defined clusters, as shown in examples (d) ρ = 1.0, (e) ρ = 6.4 and (f) ρ=14.4. best performance is obtained by considering the inverse TABLE I: Clustering errors for the Iris database considering ofthe Euclidiandistanceandthe fastgreedyorthe walk- the cases in which the number of classes k is known (k = 3) trap algorithms, which provide an error of 6%. In addi- or unknown (k =?). EM and k-means are the only methods tiontothe smallesterrorrates,network-basedclustering that need to specify the numberof clusters k. present other important feature, i.e. it is not necessary Method % error (k=?) % error (k=3) tospecifythenumberofclusterspresentinthedatabase. k-means – 11.3 Indeed, the maximum value of the modularity suggests cobweb 33.3 – themostaccuratepartition. Nevertheless,forsomeprox- farthest first – 14.0 imity measures, the modularity is not able to determine EM 40.0 9.3 the best partition. In this case, the knowledge about DE−1 - fastgreedy 6.0 6.0 the number ofclustersimplies inareductionofthe error DE−1 - walktrap 6.0 6.0 rates, as in the case of exponential of the Tanimoto dis- SE - walktrap 33.3 14.7 ST - walktrap 33.3 6.0 tance, where the erroris reducedfrom 33.3%to 6%, and SF - walktrap 33.3 7.3 theexponentialoftheChebyshevdistance,wheretheer- D−1 - fastgreedy 33.3 6.0 M ror is reduced from 33.3% to 7.3%. Therefore, for the D−1 - walktrap 33.3 6.0 M Iris data,suchmetrics arenot appropriatedfor network- SM - walktrap 33.3 6.0 based clustering. We also analyze the clustering error D−1 - fastgreedy 4.7 4.7 C without standardization. In this case, the error rates D−1 - walktrap 9.3 9.3 C are larger than those obtained considering the normal- SC - Walktrap 33.3 7.3 ization,for somecases. However,for the best results, we verify that the errors are similar in both cases. Figure 3 presents the dendrogram obtained for the best separa- tion, i.e. by taking into account the inverse of Cheby- Thecancerdatabasealsoneedsto bepre-processedby shev distance and the fastgreedy community identifica- the standardization. The obtained clustering errors are tionmethod. Observethatthe bestpartitionisobtained presented in Table II. Only combinations between met- for the highest value of the modularity measure. ricandcommunityalgorithmwhichresultinthesmallest errorratesareshowninthistable. Inthiscase,thesmall- est clustering error is obtained by the k-means method, whichproducesanerrorrateof7.2%. However,thecom- 6 1.5 1 mponent0.5 co CA 0 P 2nd -0.5 -1 -1.5-4 -3 -2 -1 1st PC0Acompone1nt 2 3 4 (a) 800 600 mponent240000 co A C P 2nd -200 -400 -600 -8-010000 -500 0 500 1000 1500 2000 2500 3000 3500 4000 1st PCAcomponent (b) FIG. 2: Projection of (a) the Iris and (b) the Breast Cancer Wisconsin databases by principal component analysis. plex networks-based method taking into account the in- verse of Manhattan distance and walktrap algorithm for community identification provides an error rate of 7.9%. Observe that when the number of clusters is known, all methods result in smaller error rates. Nevertheless, the FIG. 3: Dendrogram obtained by taking into account the in- −1 verse of Chebyshev distance and the fastgreedy community D -walktrapproducesthesameerrorrateof7.9%even M identificationalgorithm fortheIrisdata. Thecutintheden- whenk =2. Therefore,thehighestvalueofthemodular- drogram resultsinthreeclasses, wheretheerrorrateisequal ityaccountsfortheseparationforthismethod. Although to 4.7%. ourproposedmethodimpliedinanhighererrorthanthe k-means methodology, it presents the advantage that it isnotnecessarytoknownthe numberofclusters. Inthis TABLEII:ClusteringerrorsfortheBreastCancerWisconsin way,our methodologyis also more suitable to determine databaseconsideringthecasesinwhichthenumberofclasses the clusters for the Breast Cancer Wisconsin database. k isknown(k=3) orunknown(k=?). EMandk-meansare theonlymethodsthatneedtospecify thenumberofclusters k. Method % error (k=?) % error (k = 3) k-means – 7.2 In order to provide a more comprehensive evaluation cobweb 37.2 – of the proposed complex networks-based clustering ap- farthest first – 35.3 proach, we generated two set of artificial data into a EM 75.9 8.8 two dimensional space, as discussed in the last section. D−1 - walktrap 52.9 9.8 E This artificial data allows to control the cluster separa- SF - fastgreedy 17.6 17.6 bility of the generated databases. Initially, we consider D−1 - walktrap 7.9 7.9 M two clusters of points with Gaussian distribution in an D−1 - fastgreedy 50.8 15.3 C two-dimensional space separated by a distance d. Fig- SC - walktrap 15.9 15.9 ure 4 presents the best obtained results for the complex networks-based approach taking into account different proximity measures. For all cases, the number of clus- ters is determined automatically by the maximum value 7 FIG. 4: Error rates obtained according to the separation of FIG. 5: Error rates obtained according to the separation of two clusters composed by sets of points with gaussian distri- two clusters composed by sets of points with Gaussian dis- bution in a two-dimensional space. Weshow the best results tribution in a two-dimensional space for the case where the for each proximity measure, i.e. (a) Euclidian, (b) Tanimoto, number of clusters is known. We show the best results for (c) Fu, (d) Manhattan and (e) Chebyshev distances. The eachproximitymeasure,i.e. (a)Euclidian,(b)Tanimoto,(c) number of clusters is determined automatically by the max- Fu, (d) Manhattan and (e) Chebyshev distances. The num- imum value of the modularity for all cases. In (f) the most ber of clusters is set as k = 2 for all methods. In (f) it the accurate network-based method is compared with the best most accurate network-based method is compared with the traditionalclusteringapproach. Eachpointisanaverageover best traditional clustering approach. Each point is an aver- 10 simulations. age over 10 simulations. of the modularity. Note that the error rate goes to zero ditionalalgorithms,thek-meansallowsthemostaccurate ford≥5. Figure4(f) showsthe comparisonbetweenthe results. In fact, for d ≥ 3, both k-means and network- traditional clustering method which resulted in the best based clustering method provide similar error rates. For results, i.e. the cobweb, and the best complex networks this database, accurate results were expected for the k- approach. In this case, the method based on the expo- means method, since the clusters are symmetric around nential of the Chebyshev distance and fastgreedy algo- themeans,beingequallydistributedamongthetwoclus- rithm provides the smallest error rate. Observe that the ters. Observe that the other approaches of the complex variation of the error rate is also small for this method, networks-basedmethodalsoimplyinansmallerrorrate. compared with the cobweb. Comparingwiththecaseswherethenumberofclustersis The k-means algorithm cannot be used in the com- unknown, i.e. Figure 4 shows that the error rate is sim- parison where the number of clusters k is known. Thus, ilar for both approaches. Therefore, the network-based we consider the case where k is determined for all meth- methods result in the most accurate cluster partitions ods. Figure 5 presents the obtained results. In all cases, and have the advantage that it is not necessary to know the error rate goes to zero for d ≥ 5. As in the case of the number of clusters. unknown number of clusters, the method based on the The second artificial database used to evaluate the exponentialoftheChebyshevdistanceandfastgreedyal- classification error rates is given by the two semi-circle gorithmprovidesthesmallesterrorrate. Amongthetra- with varying density of points (see Figures 1(d) – (f)). 8 FIG. 6: The smallest clustering errors obtained for the com- plex networks-based methods applied in the second artificial dataset (see Figures 1(d) – (f)). The error rates are deter- mined according to thedensity of points. The adopted prox- imity measures are (a) the exponential of the Euclidian dis- tance,(b)theexponentialoftheManhattandistance,(c)the inverseof theChebyshevdistance and (d) theexponentialof the Chebyshev distance. The number of clusters is obtained automatically by themaximum value of themodularity. Figure6presentsthe obtainedresultsforunknownnum- FIG. 7: The smallest clustering errors obtained for the com- ber of clusters considering fastgreedy algorithm. Only plex networks-based methods applied in the second artificial the best results are shown in this figure. The higher dataset (see Figures 1(d) – (f)). The error rates are deter- the density of points, the smaller the error rate, since mined according to thedensity of points. The adopted prox- imity measures are (a) the exponential of the Euclidian dis- the clusters become more defined. The error rate does tance,(b)theexponentialoftheManhattandistance,(c)the not tend to zeros only for the inverse of the Chebyshed inverseof theChebyshevdistance and (d)the exponentialof distance (Figure 6(c)). The most accurate clustering is the Chebyshev distance. The number of clusters is fixed as obtained by taking into account the exponential of the k = 2. The k-means (e) and cobweb (f) are the traditional Manhattandistance (Figure6(b)). Figure7presentsthe clustering methods that produce thesmallest errors. obtainederrorswhenthenumberofclustersisknown,i.e. k =2. Again,thecomplexnetworks-basedmethodwhich takesintoaccounttheexponentialoftheManhattandis- networks theory has tools to improve graph-based clus- tanceproducesthesmallesterror. Itisinterestingtonote tering methodologies, since this new area provides more thatthe traditionalclusteringmethods,i.e.k-meansand accurate algorithms for community identification. In cobweb, result in higher error rates than the methods fact, comparing with traditional clustering methods, the based on complex networks. In addition, the error does network-based approach finds clusters with the smallest not tend to zero when the density of points is increased error rates for both real-world and artificial databases. for these traditional methods. Figure 8 presents an ex- In addition, this methodology allows the identification ampleofthebestclusteringforthek-meansandcomplex of the number of clusters automatically by taking into networks-based methods. Observe that k-means cannot account the maximum value of the modularity measure- identify the correct clusters. ment. Among the considered proximity measures, the inverse of the Chebyshev distance and the inverse of the Manhattan distance are the most suitable metric for the CONCLUSION considered real-world databases. With respect to the artificial databases, the exponential of the Chebyshev In this work, we study different proximity measures andexponentialofthe Manhattandistance producesthe to represent a data set into a graph and then adopt smallest error rates. Therefore, metrics based on the community detection algorithms to perform respective Chebyshed and Manhattan distances are the most suit- clustering. Our obtained results suggest that complex abletoquantifythesimilaritybetweenobjectsintermsof 9 [3] B. S.Everitt, S. Landau, and M. Leese, Cluster analysis (Arnold, 2001). [4] S.TheodoridisandK.Koutroumbas,Pattern recognition (Academic Press, 2003). [5] A. Jain, M. Murty, and P. Flynn, ACM computing sur- veys(CSUR) 31, 264 (1999), ISSN0360-0300. [6] C. Zahn, Computers, IEEE Transactions on 100, 68 (2006), ISSN0018-9340. [7] R. Urquhart, Pattern recognition 15, 173 (1982), ISSN 0031-3203. [8] W. Koontz, P. Narendra, and K. Fukunaga,Computers, IEEETransactions on100, 936(2006), ISSN0018-9340. [9] M.NewmanandM.Girvan,PhysicalreviewE69,26113 (2004), ISSN1550-2376. [10] T. de Oliveira, L. Zhao, K. Faceli, and A. de Carvalho, in IEEE Congress on Evolutionary Computation, 2008. CEC 2008.(IEEE World Congress on Computational In- telligence) (2008), pp.2121–2126. [11] G.Karypis,E.Han,andV.Kumar,IEEEComputer32, 68 (2002), ISSN0018-9162. FIG.8: Exampleofthebestperformanceforthe(b)k-means, [12] S.Fortunato,PhysicsReports486,75(2010),ISSN0370- complex network-based methods method using (c) the best 1573. modularity value and (d) fixing k = 2. The original data is [13] R. Albert and A.-L. Barabási, Reviews of Modern shown in (a). Physics 74, 48 (2002). [14] L.daF.Costa,F.A.Rodrigues,G.Travieso,andP.R.V. Boas, Advancesin Physics 56, 167 (2007). [15] M. E. J. Newman, Networks: An Introduction (Oxford their feature vectors. Among the community identifica- Univ Pr, 2010), ISBN 0199206651. tion algorithms, the fastgreedy revealed to be the most [16] E. Bullmore and O. Sporns, Nature Reviews Neuro- suitable, due to its accuracy and the smallest time for science 10, 186 (2009), ISSN1471-003X. processing. [17] M. Newman, Proceedings of the National Academy of The analysis proposed in this work can be extended Sciences 103, 8577 (2006). bytakingintoaccountotherreal-worlddatabasesaswell [18] M.GirvanandM.Newman,ProceedingsoftheNational as other approaches to generate artificial clusters. The AcademyofSciencesoftheUnitedStatesofAmerica99, 7821 (2002). application to different areas, such as medicine, biology, [19] A. Clauset, M. E. J. Newman, and C. Moore, Physical physicsandeconomyconstituteotherpromisingresearch Review E 70, 066111 (2004). possibilities. [20] A. Clauset, Physical Review E 72, 026132 (2005). [21] R. A.Fisher, AnnalsEugenics 7, 12 (1936). [22] J. Duch and A. Arenas, Physical Review E 72, 027104 Acknowledgement (2005). [23] P. Pons and M. Latapy, Computer and Information Sciences-ISCIS 2005 pp. 284–293 (2005). Luciano da F. Costa thanks CNPq (301303/06-1)and [24] L.Danon,J.Duch,A.Arenas,andA.Díaz-Guilera,Jour- FAPESP (05/00587-5)for sponsorship. nal of Statistical Mechanics: Theory and Experiment p. P09008 (2005). [25] W. Wolberg, W. Street, and O. Mangasarian, Cancer Letters 77, 163 (1994), ISSN0304-3835. [26] I.WittenandE.Frank,DataMining: Practical machine ∗ Electronic address: [email protected] learning tools and techniques (Morgan Kaufmann Pub, [1] M.Anderberg,Cluster Analysis for Applications (1973). 2005), ISBN 0120884070. [2] A.Jain, M. Murty,andP.Flynn,ACMComputingSur- veys31, 264 (1999), ISSN0360-0300.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.