ebook img

Handbook of Cluster Analysis PDF

711 Pages·2016·10.17 MB·english
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Handbook of Cluster Analysis

CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2016 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business Version Date: 20151012 International Standard Book Number-13: 978-1-4665-5189-3 (eBook - PDF) Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1. ClusterAnalysis:AnOverview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 ChristianHennigandMarinaMeila 2. ABriefHistoryofClusterAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 FionnMurtagh SectionI OptimizationMethods 3. QuadraticErrorandk-Means. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 BorisMirkin 4. K-MedoidsandOtherCriteriaforCrispClustering . . . . . . . . . . . . . . . . . . . . . 55 DouglasSteinley 5. FoundationsforCenter-BasedClustering:Worst-CaseApproximationsand ModernDevelopments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 PranjalAwasthiandMariaFlorinaBalcan SectionII Dissimilarity-Based Methods 6. HierarchicalClustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 PedroContrerasandFionnMurtagh 7. SpectralClustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 MarinaMeila SectionIII MethodsBasedonProbabilityModels 8. MixtureModelsforStandard p-DimensionalEuclideanData . . . . . . . . . . . . . . 145 GeoffreyJ.McLachlanandSurenI.Rathnayake 9. LatentClassModelsforCategoricalData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 G.CeleuxandGérardGovaert 10. DirichletProcessMixturesandNonparametricBayesianApproaches toClustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 VinayakRao 11. FiniteMixturesofStructuredModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 MarcoAlfóandSaraViviani 12. Time-SeriesClustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 JorgeCaiado,ElizabethAnnMaharaj,andPierpaoloD’Urso 13. ClusteringFunctionalData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 DavidB.HitchcockandMarkC.Greenwood 14. MethodsBasedonSpatialProcesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 LisaHandl,ChristianHirsch,andVolkerSchmidt 15. SignificanceTestinginClustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 HanwenHuang,YufengLiu,DavidNeilHayes,AndrewNobel,J.S.Marron,and ChristianHennig 16. Model-BasedClusteringforNetworkData . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 ThomasBrendanMurphy SectionIV MethodsBasedonDensityModesandLevelSets 17. AFormulationinModalClusteringBasedonUpperLevelSets . . . . . . . . . . . . 361 AdelchiAzzalini 18. ClusteringMethodsBasedonKernelDensityEstimators: Mean-ShiftAlgorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 MiguelÁ.Carreira-Perpiñán 19. Nature-InspiredClustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419 JuliaHandlandJoshuaKnowles SectionV SpecificCluster andDataFormats 20. Semi-SupervisedClustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443 AnilJain,RongJin,andRadhaChitta 21. ClusteringofSymbolicData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469 PaulaBrito 22. ASurveyofConsensusClustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497 JoydeepGhoshandAyanAcharya 23. Two-ModePartitioningandMultipartitioning . . . . . . . . . . . . . . . . . . . . . . . . . 519 MaurizioVichi 24. FuzzyClustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545 PierpaoloD’Urso 25. RoughSetClustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575 IvoDüntschandGüntherGediga SectionVI ClusterValidation andFurtherGeneralIssues 26. Method-IndependentIndicesforClusterValidationandEstimatingthe NumberofClusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595 MariaHalkidi,MichalisVazirgiannis,andChristianHennig 27. CriteriaforComparingClusterings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619 MarinaMeila 28. ResamplingMethodsforExploringClusterStability . . . . . . . . . . . . . . . . . . . . 637 FriedrichLeisch 29. RobustnessandOutliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653 L.A.García-Escudero,A.Gordaliza,C.Matrán,A.Mayo-Iscar,and ChristianHennig 30. VisualClusteringforDataAnalysisandGraphicalUserInterfaces. . . . . . . . . . 679 SébastienDéjeanandJosianeMothe 31. ClusteringStrategyandMethodSelection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703 ChristianHennig Preface This Handbook intends to give a comprehensive, structured, and unified account of the centraldevelopmentsincurrentresearchonclusteranalysis. Thebookisaimedatresearchersandpractitionersinstatistics,andallthescientistsand engineers who are involved in some way in data clustering and have a sufficient back- ground in statistics and mathematics. Recognizing the interdisciplinary nature of cluster analysis, most parts of the book were written in a way that is accessible to readers from various disciplines. How much background is required depends to some extent on the chapter.Familiaritywithmathematicalandstatisticalreasoningisveryhelpful,butanaca- demicdegreeinmathematicsorstatisticsisnotrequiredformostparts.Occasionallysome knowledgeofalgorithmsandcomputationwillhelptomakemostofthematerial. Sincewewantedthisbooktobeimmediatelyusefulinpractice,theclusteringmethods we present are usually described in enough detail to be directly implementable. In addi- tion,eachchaptercomprisesthegeneralideas,motivation,advantagesandpotentiallimits ofthemethodsdescribed,signpoststosoftware,theoryandapplications,andadiscussion ofrecentresearchissues. For those already experienced with cluster analysis, the book offers a broad and struc- tured overview. For those starting to work in this field, it offers an orientation and an introduction to the key issues. For the many researchers who are only temporarily or marginally involved with cluster analysis problems, the book chapters contain enough algorithmic and practical detail to give them a working knowledge of specific areas of clustering.Furthermore,thebookshouldhelpscientists,engineers,andotherusersofclus- teringmethodstomakeinformedchoicesofthemostsuitableclusteringapproachfortheir problem,andtomakebetteruseoftheexistingclusteranalysistools. Cluster analysis, also sometimes known as unsupervised classification, is about find- inggroupsinasetofobjectscharacterizedbycertainmeasurements.Thistaskhasavery wide range of applications such as delimitation of species in biology, data compression, classificationofdiseasesormentalillnesses,marketsegmentation,detectionofpatternsof unusualInternetuse,delimitationofcommunities,orclassificationofregionsorcountries foradministrativeuse.Unsupervisedclassificationcanbeseenasabasichumanlearning activity,connectedtoissuesasbasicasthedevelopmentofstableconceptsinlanguage. Formal cluster analysis methodology has been developed, among others, by mathe- maticians,statisticians,computerscientists,psychologists,socialscientists,econometrists, biologists, and geoscientists. Some of these branches of development existed indepen- dently for quite some time. As a consequence, cluster analysis as a research area is very heterogeneous. This makes sense, because there are also various different relevant con- cepts of what constitutes a cluster. Elements of a cluster can be connected by being very similar to each other and distant from nonmembers of the cluster, by having a particular characterizationintermsoffew(potentiallyoutofmany)variables,bybeingappropriately representedbythesamecentroidobject,byconstitutingsomekindofdistinctiveshapeor pattern,orbybeinggeneratedfromacommonhomogeneousprobabilisticprocess. Cluster analysis is currently a very popular research area and its popularity can be expected to grow more connected to the growing availability and relevance of data col- lected in all areas of life, which often come in unstructured ways and require some processing in order to become useful. Unsupervised classification is a central technique tostructuresuchdata. Research on cluster analysis faces many challenges. Cluster analysis is applied to ever new data formats; many approaches to cluster analysis are computer intensive and their application to large databases is difficult; there is little unification and standardization in thefieldofclusteranalysis,whichmakesitdifficulttocomparedifferentapproachesina systematicmanner.Eventheinvestigationofpropertiessuchasstatisticalconsistencyand stabilityoftraditionalelementaryclusteranalysistechniquesisoftensurprisinglyhard. Cluster analysis as a research area has grown so much in recent years that it is all but impossibletocovereverythingthatcouldbeconsideredrelevantinahandbooklikethis. Wehavechosentoorganizethisbookaccordingtothetraditionalcoreapproachestoclus- ter analysis, tracing them from the origins to recent developments. The book starts with an overview of approaches (Chapter 1), followed by a quick journey through the history of cluster analysis (Chapter 2). The next four sections of the book are devoted to four majorapproachestowardclusteranalysis,allofwhichgobacktothebeginningsofclus- ter analysis in the 1950s and 1960s or even further. (Probably Pearson’s paper on fitting Gaussian mixtures in 1894, see Chapter 2, was the first publication of a method covered in this Handbook, although Pearson’s use of it is not appropriately described as “cluster analysis.”) Section I is about methods that aim at optimizing an objective function that describes howwelldataisgroupedaroundcentroids.Themostpopularofthesemethodsandprob- ablythemostpopularclusteringmethodingeneralisK-means.Theefficientoptimization oftheK-meansandotherobjectivefunctionsofthiskindisstillahardproblemandatopic ofmuchrecentresearch. Section II is concerned with dissimilarity-based methods, formalizing the idea that objectswithinclustersshouldbesimilarandobjectsindifferentclustersshouldbedissim- ilar. Chapters treat the traditional hierarchical methods such as single linkage, and more recentapproachestoanalyzedissimilaritydatasuchasspectralclusteringandgraph-based approaches. SectionIIIcoversthebroadfieldofclusteringmethodsbasedonprobabilitymodelsfor clusters,thatis,mixturemodelsandpartitioningmodels.Suchmodelshavebeenanalyzed for many different kinds of data, including standard real-valued vector data, categorical, ordinalandmixeddata,regression-typedata,functionalandtime-seriesdata,spatialdata, and network data. A related issue, treated in Chapter 15, is to test for the existence of clusters. SectionIVdealswithclusteringmethodsinspiredbynonparametricdensityestimation. Insteadofsettingupspecificmodelsfortheclustersinthedata,theseapproachesidentify clusterswiththe“islands”ofhighdensityinthedata,nomatterwhatshapethesehave,or theyaimatfindingthemodesofthedatadensity,whichareinterpretedas“attractors”or representatives for the rest of the points. Most of these methods also have a probabilistic background,buttheirnatureisnonparametric;theyformalizeaclusterbycharacterizing intermsofthedensityordistributionofpointsinsteadofsettingitup. Section V collects a number of further approaches to cluster analysis, partly analyzing specific data types such as symbolic data and ensembles of clusterings, partly presenting specificproblemssuchasconstrainedandsemi-supervisedclusteringandtwo-modeand multipartitioning,fuzzyandroughsetclustering. Byandlarge,SectionsIthroughVareaboutmethodsforclustering.Buthavingaclus- teringmethodisnotallthatisneededinclusteranalysis.SectionVItreatsfurtherrelevant issues,manyofwhichcanbegroupedundertheheadline“clustervalidation,”evaluating thequalityofaclustering.Aspectsincludeindexestomeasureclustervalidity(whichare oftenalsousedforchoosingthenumberofclusters),comparingdifferentclusterings,mea- suringclusterstabilityandrobustnessofclusteringmethods,clustervisualization,andthe generalstrategyincarryingoutaclusteranalysisandthechoiceofanappropriatemethod. Given the limited length of the book, there are a number of topics that some readers mayexpectintheHandbookofClusterAnalysis,butthatarenotcovered.Weseemostofthe presentedmaterialasessential;somedecisionsweremotivatedbyindividualpreferences and the chosen focus, some by the difficulty of finding good authors for certain topics. Much of what is missing are methods for further types of data (such as text clustering), some more recent approaches that are currently used by rather limited groups of users, some of the recent progress in computational issues for large data sets including some of the clustering methods, of which the main motivation is to be able to deal with large amountsofdata,andsomehybridapproachesthatpiecetogethervariouselementaryideas from clustering and classification. We have weighted an introduction to the elementary approaches (on which there is still much research and that still confronts us with open problems)higherthanthecoverageofasmanybranchesaspossibleofcurrentspecialized cutting-edgeresearch,althoughsomeofthiswasincludedbychapterauthors,allofwhom areactiveandwell-distinguishedresearchersinthearea. Ithasbeenalongprocesstowritethisbookandweareverygratefulforthecontinuous supportbyChapman&Hall/CRCandparticularlybyRobertCalver,whogaveusalotof encouragementandpusheduswhennecessary. 1 Cluster Analysis: An Overview ChristianHennigandMarinaMeila CONTENTS Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Dimensions(Dichotomies)ofClustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 ByTypeofClustering:Hardvs.Soft . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.2 ByTypeofClustering:Flatvs.Hierarchical . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.3 ByDataTypeorFormat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.4 ByClusteringCriterion:(Probabilistic)ModelBasedvs.Cost-Based. . . . . . . 4 1.2.5 ByRegime:Parametric(K IsInput)vs.Nonparametric (SmoothnessParameterIsInput). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 TypesofClusterings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 DataFormats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 ClusteringApproaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5.1 Centroid-BasedClustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5.2 AgglomerativeHierarchicalMethods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.5.3 SpectralClustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5.4 MixtureProbabilityModels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5.5 Density-BasedClustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.5.6 OtherProbabilityModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.5.7 FurtherClusteringApproaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.6 ClusterValidationandFurtherIssues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.6.1 ApproachesforClusterValidation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.6.2 NumberofClusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.6.3 VariableSelection,DimensionReduction,BigDataIssues . . . . . . . . . . . . . 15 1.6.4 GeneralClusteringStrategyandChoiceofMethod . . . . . . . . . . . . . . . . . . 16 1.6.5 ClusteringIsDifferentfromClassification . . . . . . . . . . . . . . . . . . . . . . . . . 17 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Abstract This chapter gives an overview of the basic concepts of cluster analysis, including some references to aspects not covered in this Handbook. It introduces general definitions of a clustering, for example, partitions, hierarchies, and fuzzy clusterings. It distinguishes objects × variables data from dissimilarity data and the parametric and nonparamet- ric clustering regimes. A general overview of principles for clustering data is given, comprising centroid-based clustering, hierarchical methods, spectral clustering, mixture 2 HandbookofClusterAnalysis model and other probabilistic methods, density-based clustering, and further methods. The chapter then reviews methods for cluster validation, that is, assessing the quality of a clustering, which includes the decision about the number of clusters. It then briefly discusses variable selection, dimension reduction, and the general strategy of cluster analysis. 1.1 Introduction Informally speaking, clustering means finding groups in data. Aristotle’s classification of living things was one of the first-known clusterings, a hierarchical clustering. The knowledge of biology has grown, yet the schematic organization of all known species remains in the form of a (hierarchical) clustering. Doctors defining categories of tumors by their properties, astronomers grouping galaxies by their shapes, companies observing that users of their products group according to behavior, programs that label the pixels of an image by the object they belong to, other programs that segment a video stream intoscenes,recommendersystemsthatgroupproductsintocategories,allareperforming clustering. Cluster analysis has been developed in several different fields with very diverse appli- cationsinmind.Therefore,thereisawiderangeofapproachestoclusteranalysisandan even wider range of styles of presentation and notation in the literature that may seem ratherbewildering.Inthischapter,wetrytoprovidethereaderwithasystematicifsome- whatsimplifiedviewofclusteranalysis.ReferencesaregiventotheHandbookchaptersin whichthepresentedideasaretreatedindetail. More formally, ifwearegiven asetofobjects, alsocalledadataset,D ={x , x , ... ,x } 1 2 n containingndatapoints,thetaskofclusteringistogroupthemintoK disjointsubsetsofD, denoted by C ,C ,...,C . A clustering is, in first approximation, the partition obtained, 1 2 K that is, C ={C ,C ,...,C }. If a data point x belongs to cluster C , then we say that the 1 2 K i k labelofx isk.Hence,labelsrangefrom1toK,thenumberofclusters.Obviously,notevery i suchC qualifiesasa“good”or“useful”clustering,andwhatisdemandedofa“good”C candependontheparticularapproachonetakestoclusteranalysis. As we shall soon see, this first definition can be extended in various ways. In fact, the reader should be aware that clustering, or grouping of data, can mean different things in differentcontexts,aswellasindifferentareasofdataanalysis.Thereisnouniquedefinition of what a cluster is, or what the “best” clustering of an abstract D should be. Hence, the cornerstoneofanyrigorousclusteranalysisisanappropriateandcleardefinitionofwhat a “good” clustering is in the specific context. For a practitioner, cluster analysis should start with the question: which of the existing clustering paradigms best fits my intuitive knowledgeandmyneeds? Researchinclusteringhasproducedovertheyearsavastnumberofdifferentparadigms, approaches, and methods for clustering to choose from. This book is a guide to the field, which, rather than being exhaustive, aims at being systematic, explaining and describing the main topics and tools. Limits on the size of the book combined withthe vast amount of existing work on cluster analysis (Murtagh and Kurtz (2015) count more than 404,000 documents related to cluster analysis in the literature) meant that some results or issues couldnotmaketheirwayintothespecializedHandbookchapters.Insuchcases,wegive referencestotheliterature.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.