ebook img

Data Clustering: Theory, Algorithms, and Applications PDF

489 Pages·2007·5.428 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Data Clustering: Theory, Algorithms, and Applications

SA20_GanMaWu fm 1.qxp 4/9/2007 9:57 AM Page i Data Clustering http://avaxhome.ws/blogs/ChrisRedfield SA20_GanMaWu fm 1.qxp 4/9/2007 9:57 AM Page ii ASA-SIAM Series on Statistics and Applied Probability The ASA-SIAM Series on Statistics and Applied Probability is published jointly by the American Statistical Association and the Society for Industrial and Applied Mathematics. The series consists of a broad spectrum of books on topics in statistics and applied probability. The purpose of the series is to provide inexpensive, quality publicationsof interest to the intersecting membership of the two societies. Editorial Board Martin T. Wells Lisa LaVange Cornell University, Editor-in-Chief University of North Carolina H. T. Banks David Madigan North Carolina State University Rutgers University Douglas M. Hawkins Mark van der Laan University of Minnesota University of California, Berkeley Susan Holmes Stanford University Gan, G., Ma, C., and Wu, J., Data Clustering: Theory, Algorithms, and Applications Hubert, L., Arabie, P., and Meulman, J., The Structural Representation of Proximity Matrices with MATLAB Nelson, P. R., Wludyka, P. S., and Copeland, K. A. F., The Analysis of Means: A Graphical Method for Comparing Means, Rates, and Proportions Burdick, R. K., Borror, C. M., and Montgomery, D. C., Design and Analysis of Gauge R&R Studies: Making Decisions with Confidence Intervals in Random and Mixed ANOVA Models Albert, J., Bennett, J., and Cochran, J. J., eds., Anthology of Statistics in Sports Smith, W. F., Experimental Design for Formulation Baglivo, J. A., Mathematica Laboratories for Mathematical Statistics: Emphasizing Simulation and Computer Intensive Methods Lee, H. K. H., Bayesian Nonparametrics via Neural Networks O’Gorman, T. W., Applied Adaptive Statistical Methods: Tests of Significance and Confidence Intervals Ross, T. J., Booker, J. M., and Parkinson, W. J., eds., Fuzzy Logic and Probability Applications: Bridging the Gap Nelson, W. B., Recurrent Events Data Analysis for Product Repairs, Disease Recurrences, and Other Applications Mason, R. L. and Young, J. C., Multivariate Statistical Process Control with Industrial Applications Smith, P. L., A Primer for Sampling Solids, Liquids, and Gases: Based on the Seven Sampling Errors of Pierre Gy Meyer, M. A. and Booker, J. M., Eliciting and Analyzing Expert Judgment: A Practical Guide Latouche, G. and Ramaswami, V., Introduction to Matrix Analytic Methods in Stochastic Modeling Peck, R., Haugh, L., and Goodman, A., Statistical Case Studies: A Collaboration Between Academe and Industry, Student Edition Peck, R., Haugh, L., and Goodman, A., Statistical Case Studies: A Collaboration Between Academe and Industry Barlow, R., Engineering Reliability Czitrom, V. and Spagon, P. D., Statistical Case Studies for Industrial Process Improvement SA20_GanMaWu fm 1.qxp 4/9/2007 9:57 AM Page iii Data Clustering Theory, Algorithms, and Applications Guojun Gan York University Toronto, Ontario, Canada Chaoqun Ma Hunan University Changsha, Hunan, People’s Republic of China Jianhong Wu York University Toronto, Ontario, Canada Society for Industrial and Applied Mathematics American Statistical Association Philadelphia, Pennsylvania Alexandria, Virginia SA20_GanMaWu fm 1.qxp 4/9/2007 9:57 AM Page iv The correct bibliographic citation for this book is as follows: Gan, Guojun, Chaoqun Ma, and Jianhong Wu, Data Clustering: Theory, Algorithms, and Applications, ASA-SIAM Series on Statistics and Applied Probability, SIAM, Philadelphia, ASA, Alexandria, VA, 2007. Copyright © 2007 by the American Statistical Association and the Society for Industrial and Applied Mathematics. 10 9 8 7 6 5 4 3 2 1 All rights reserved. Printed in the United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688. Trademarked names may be used in this book without the inclusion of a trademark symbol. These names are intended in an editorial context only; no infringement of trademark is intended. Library of Congress Cataloging-in-Publication Data Gan, Guojun, 1979- Data clustering : theory, algorithms, and applications / Guojun Gan, Chaoqun Ma, Jianhong Wu. p. cm. – (ASA-SIAM series on statistics and applied probability ; 20) Includes bibliographical references and index. ISBN: 978-0-898716-23-8 (alk. paper) 1. Cluster analysis. 2. Cluster analysis—Data processing. I. Ma, Chaoqun, Ph.D. II. Wu, Jianhong. III. Title. QA278.G355 2007 519.5’3—dc22 2007061713 is a registered trademark. Contents ListofFigures xiii ListofTables xv ListofAlgorithms xvii Preface xix I Clustering,Data,andSimilarityMeasures 1 1 DataClustering 3 1.1 DefinitionofDataClustering . . . . . . . . . . . . . . . . . . . . . . 3 1.2 TheVocabularyofClustering . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 RecordsandAttributes . . . . . . . . . . . . . . . . . . . . 5 1.2.2 DistancesandSimilarities . . . . . . . . . . . . . . . . . . 5 1.2.3 Clusters,Centers,andModes. . . . . . . . . . . . . . . . . 6 1.2.4 HardClusteringandFuzzyClustering . . . . . . . . . . . . 7 1.2.5 ValidityIndices . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 ClusteringProcesses . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 DealingwithMissingValues . . . . . . . . . . . . . . . . . . . . . . 10 1.5 ResourcesforClustering. . . . . . . . . . . . . . . . . . . . . . . . . 12 1.5.1 SurveysandReviewsonClustering . . . . . . . . . . . . . 12 1.5.2 BooksonClustering . . . . . . . . . . . . . . . . . . . . . 12 1.5.3 Journals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.5.4 ConferenceProceedings . . . . . . . . . . . . . . . . . . . 15 1.5.5 DataSets . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2 DataTypes 19 2.1 CategoricalData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 BinaryData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3 TransactionData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4 SymbolicData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.5 TimeSeries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 v vi Contents 3 ScaleConversion 25 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1.1 IntervaltoOrdinal . . . . . . . . . . . . . . . . . . . . . . 25 3.1.2 IntervaltoNominal . . . . . . . . . . . . . . . . . . . . . . 27 3.1.3 OrdinaltoNominal . . . . . . . . . . . . . . . . . . . . . . 28 3.1.4 NominaltoOrdinal . . . . . . . . . . . . . . . . . . . . . . 28 3.1.5 OrdinaltoInterval . . . . . . . . . . . . . . . . . . . . . . 29 3.1.6 OtherConversions . . . . . . . . . . . . . . . . . . . . . . 29 3.2 CategorizationofNumericalData . . . . . . . . . . . . . . . . . . . . 30 3.2.1 DirectCategorization . . . . . . . . . . . . . . . . . . . . . 30 3.2.2 Cluster-basedCategorization . . . . . . . . . . . . . . . . . 31 3.2.3 AutomaticCategorization . . . . . . . . . . . . . . . . . . 37 3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4 DataStandardizationandTransformation 43 4.1 DataStandardization . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2 DataTransformation . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.2.1 PrincipalComponentAnalysis . . . . . . . . . . . . . . . . 46 4.2.2 SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2.3 TheKarhunen-LoèveTransformation . . . . . . . . . . . . 49 4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5 DataVisualization 53 5.1 Sammon’sMapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.2 MDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.3 SOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.4 Class-preservingProjections . . . . . . . . . . . . . . . . . . . . . . 59 5.5 ParallelCoordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.6 TreeMaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.7 CategoricalDataVisualization . . . . . . . . . . . . . . . . . . . . . 62 5.8 OtherVisualizationTechniques . . . . . . . . . . . . . . . . . . . . . 65 5.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6 SimilarityandDissimilarityMeasures 67 6.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.1.1 ProximityMatrix . . . . . . . . . . . . . . . . . . . . . . . 68 6.1.2 ProximityGraph . . . . . . . . . . . . . . . . . . . . . . . 69 6.1.3 ScatterMatrix. . . . . . . . . . . . . . . . . . . . . . . . . 69 6.1.4 CovarianceMatrix . . . . . . . . . . . . . . . . . . . . . . 70 6.2 MeasuresforNumericalData . . . . . . . . . . . . . . . . . . . . . . 71 6.2.1 EuclideanDistance . . . . . . . . . . . . . . . . . . . . . . 71 6.2.2 ManhattanDistance . . . . . . . . . . . . . . . . . . . . . 71 6.2.3 MaximumDistance . . . . . . . . . . . . . . . . . . . . . . 72 6.2.4 MinkowskiDistance . . . . . . . . . . . . . . . . . . . . . 72 6.2.5 MahalanobisDistance . . . . . . . . . . . . . . . . . . . . 72 Contents vii 6.2.6 AverageDistance . . . . . . . . . . . . . . . . . . . . . . . 73 6.2.7 OtherDistances . . . . . . . . . . . . . . . . . . . . . . . . 74 6.3 MeasuresforCategoricalData . . . . . . . . . . . . . . . . . . . . . 74 6.3.1 TheSimpleMatchingDistance. . . . . . . . . . . . . . . . 76 6.3.2 OtherMatchingCoefficients . . . . . . . . . . . . . . . . . 76 6.4 MeasuresforBinaryData . . . . . . . . . . . . . . . . . . . . . . . . 77 6.5 MeasuresforMixed-typeData . . . . . . . . . . . . . . . . . . . . . 79 6.5.1 AGeneralSimilarityCoefficient . . . . . . . . . . . . . . . 79 6.5.2 AGeneralDistanceCoefficient. . . . . . . . . . . . . . . . 80 6.5.3 AGeneralizedMinkowskiDistance . . . . . . . . . . . . . 81 6.6 MeasuresforTimeSeriesData . . . . . . . . . . . . . . . . . . . . . 83 6.6.1 TheMinkowskiDistance . . . . . . . . . . . . . . . . . . . 84 6.6.2 TimeSeriesPreprocessing . . . . . . . . . . . . . . . . . . 85 6.6.3 DynamicTimeWarping . . . . . . . . . . . . . . . . . . . 87 6.6.4 MeasuresBasedonLongestCommonSubsequences . . . . 88 6.6.5 MeasuresBasedonProbabilisticModels . . . . . . . . . . 90 6.6.6 MeasuresBasedonLandmarkModels . . . . . . . . . . . . 91 6.6.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.7 OtherMeasures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.7.1 TheCosineSimilarityMeasure . . . . . . . . . . . . . . . 93 6.7.2 ALink-basedSimilarityMeasure . . . . . . . . . . . . . . 93 6.7.3 Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.8 SimilarityandDissimilarityMeasuresbetweenClusters . . . . . . . . 94 6.8.1 TheMean-basedDistance . . . . . . . . . . . . . . . . . . 94 6.8.2 TheNearestNeighborDistance . . . . . . . . . . . . . . . 95 6.8.3 TheFarthestNeighborDistance . . . . . . . . . . . . . . . 95 6.8.4 TheAverageNeighborDistance . . . . . . . . . . . . . . . 96 6.8.5 Lance-WilliamsFormula . . . . . . . . . . . . . . . . . . . 96 6.9 SimilarityandDissimilaritybetweenVariables . . . . . . . . . . . . . 98 6.9.1 Pearson’sCorrelationCoefficients . . . . . . . . . . . . . . 98 6.9.2 MeasuresBasedontheChi-squareStatistic . . . . . . . . . 101 6.9.3 MeasuresBasedonOptimalClassPrediction . . . . . . . . 103 6.9.4 Group-basedDistance . . . . . . . . . . . . . . . . . . . . 105 6.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 II ClusteringAlgorithms 107 7 HierarchicalClusteringTechniques 109 7.1 RepresentationsofHierarchicalClusterings . . . . . . . . . . . . . . 109 7.1.1 n-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.1.2 Dendrogram . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.1.3 Banner . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.1.4 PointerRepresentation . . . . . . . . . . . . . . . . . . . . 112 7.1.5 PackedRepresentation . . . . . . . . . . . . . . . . . . . . 114 7.1.6 IciclePlot . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 7.1.7 OtherRepresentations . . . . . . . . . . . . . . . . . . . . 115 viii Contents 7.2 AgglomerativeHierarchicalMethods . . . . . . . . . . . . . . . . . . 116 7.2.1 TheSingle-linkMethod . . . . . . . . . . . . . . . . . . . 118 7.2.2 TheCompleteLinkMethod . . . . . . . . . . . . . . . . . 120 7.2.3 TheGroupAverageMethod . . . . . . . . . . . . . . . . . 122 7.2.4 TheWeightedGroupAverageMethod . . . . . . . . . . . . 125 7.2.5 TheCentroidMethod . . . . . . . . . . . . . . . . . . . . . 126 7.2.6 TheMedianMethod . . . . . . . . . . . . . . . . . . . . . 130 7.2.7 Ward’sMethod . . . . . . . . . . . . . . . . . . . . . . . . 132 7.2.8 OtherAgglomerativeMethods . . . . . . . . . . . . . . . . 137 7.3 DivisiveHierarchicalMethods . . . . . . . . . . . . . . . . . . . . . 137 7.4 SeveralHierarchicalAlgorithms. . . . . . . . . . . . . . . . . . . . . 138 7.4.1 SLINK . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 7.4.2 Single-linkAlgorithmsBasedonMinimumSpanningTrees 140 7.4.3 CLINK . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 7.4.4 BIRCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 7.4.5 CURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 7.4.6 DIANA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 7.4.7 DISMEA . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 7.4.8 EdwardsandCavalli-SforzaMethod . . . . . . . . . . . . . 147 7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 8 FuzzyClusteringAlgorithms 151 8.1 FuzzySets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 8.2 FuzzyRelations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 8.3 Fuzzyk-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 8.4 Fuzzyk-modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 8.5 Thec-meansMethod . . . . . . . . . . . . . . . . . . . . . . . . . . 158 8.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 9 Center-basedClusteringAlgorithms 161 9.1 Thek-meansAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . 161 9.2 Variationsofthek-meansAlgorithm . . . . . . . . . . . . . . . . . . 164 9.2.1 TheContinuousk-meansAlgorithm . . . . . . . . . . . . . 165 9.2.2 TheCompare-meansAlgorithm . . . . . . . . . . . . . . . 165 9.2.3 TheSort-meansAlgorithm . . . . . . . . . . . . . . . . . . 166 9.2.4 Accelerationofthek-meansAlgorithmwiththe kd-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 9.2.5 OtherAccelerationMethods . . . . . . . . . . . . . . . . . 168 9.3 TheTrimmedk-meansAlgorithm . . . . . . . . . . . . . . . . . . . . 169 9.4 Thex-meansAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . 170 9.5 Thek-harmonicMeansAlgorithm . . . . . . . . . . . . . . . . . . . . 171 9.6 TheMeanShiftAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . 173 9.7 MEC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 9.8 Thek-modesAlgorithm(Huang) . . . . . . . . . . . . . . . . . . . . 176 9.8.1 InitialModesSelection . . . . . . . . . . . . . . . . . . . . 178 9.9 Thek-modesAlgorithm(Chaturvedietal.) . . . . . . . . . . . . . . . 178 Contents ix 9.10 Thek-probabilitiesAlgorithm . . . . . . . . . . . . . . . . . . . . . . 179 9.11 Thek-prototypesAlgorithm . . . . . . . . . . . . . . . . . . . . . . . 181 9.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 10 Search-basedClusteringAlgorithms 183 10.1 GeneticAlgorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 10.2 TheTabuSearchMethod . . . . . . . . . . . . . . . . . . . . . . . . 185 10.3 VariableNeighborhoodSearchforClustering . . . . . . . . . . . . . . 186 10.4 Al-Sultan’sMethod . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 10.5 TabuSearch–basedCategoricalClusteringAlgorithm . . . . . . . . . 189 10.6 J-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 10.7 GKA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 10.8 TheGlobalk-meansAlgorithm . . . . . . . . . . . . . . . . . . . . . 195 10.9 TheGenetick-modesAlgorithm . . . . . . . . . . . . . . . . . . . . . 195 10.9.1 TheSelectionOperator . . . . . . . . . . . . . . . . . . . . 196 10.9.2 TheMutationOperator . . . . . . . . . . . . . . . . . . . . 196 10.9.3 Thek-modesOperator . . . . . . . . . . . . . . . . . . . . 197 10.10 TheGeneticFuzzyk-modesAlgorithm . . . . . . . . . . . . . . . . . 197 10.10.1 StringRepresentation . . . . . . . . . . . . . . . . . . . . . 198 10.10.2 InitializationProcess . . . . . . . . . . . . . . . . . . . . . 198 10.10.3 SelectionProcess . . . . . . . . . . . . . . . . . . . . . . . 199 10.10.4 CrossoverProcess . . . . . . . . . . . . . . . . . . . . . . 199 10.10.5 MutationProcess . . . . . . . . . . . . . . . . . . . . . . . 200 10.10.6 TerminationCriterion. . . . . . . . . . . . . . . . . . . . . 200 10.11 SARS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 10.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 11 Graph-basedClusteringAlgorithms 203 11.1 Chameleon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 11.2 CACTUS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 11.3 ADynamicSystem–basedApproach . . . . . . . . . . . . . . . . . . 205 11.4 ROCK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 11.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 12 Grid-basedClusteringAlgorithms 209 12.1 STING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 12.2 OptiGrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 12.3 GRIDCLUS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 12.4 GDILC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 12.5 WaveCluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 12.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 13 Density-basedClusteringAlgorithms 219 13.1 DBSCAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 13.2 BRIDGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 13.3 DBCLASD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.