ebook img

Unsupervised Classification: Similarity Measures, Classical and Metaheuristic Approaches, and Applications PDF

270 Pages·2013·4.981 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Unsupervised Classification: Similarity Measures, Classical and Metaheuristic Approaches, and Applications

Unsupervised Classification Sanghamitra Bandyopadhyay (cid:2) Sriparna Saha Unsupervised Classification Similarity Measures, Classical and Metaheuristic Approaches, and Applications SanghamitraBandyopadhyay SriparnaSaha MachineIntelligenceUnit Dept.ofComputerScience IndianStatisticalInstitute IndianInstituteofTechnology Kolkata,WestBengal,India Patna,India ISBN978-3-642-32450-5 ISBN978-3-642-32451-2(eBook) DOI10.1007/978-3-642-32451-2 SpringerHeidelbergNewYorkDordrechtLondon LibraryofCongressControlNumber:2012953901 ACMComputingClassification(1998): I.2,J.3,H.2,H.3 ©Springer-VerlagBerlinHeidelberg2013 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartof thematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation, broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology nowknownorhereafterdeveloped.Exemptedfromthislegalreservationarebriefexcerptsinconnection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’slocation,initscurrentversion,andpermissionforusemustalwaysbeobtainedfromSpringer. PermissionsforusemaybeobtainedthroughRightsLinkattheCopyrightClearanceCenter.Violations areliabletoprosecutionundertherespectiveCopyrightLaw. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. Whiletheadviceandinformationinthisbookarebelievedtobetrueandaccurateatthedateofpub- lication,neithertheauthorsnortheeditorsnorthepublishercanacceptanylegalresponsibilityforany errorsoromissionsthatmaybemade.Thepublishermakesnowarranty,expressorimplied,withrespect tothematerialcontainedherein. Printedonacid-freepaper SpringerispartofSpringerScience+BusinessMedia(www.springer.com) To myparentsSatyendra NathBanerjeeand Bandana Banerjee,myhusband Prof.Ujjwal Maulik,andson Utsav SanghamitraBandyopadhyay To myhusband Dr.AsifEkbal,mydaughter Spandana, and myparentsSankarPrasad Saha andSilaSaha Sriparna Saha Preface Classification is an integral part of any pattern recognition system. Depending on whether a set of labeled training samples is available or not, classification can be eithersupervisedorunsupervised.Clusteringisanimportantunsupervisedclassifi- cationtechniquewhereanumberofdatapointsaregroupedintoclusterssuchthat pointsbelongingtothesameclusteraresimilarinsomesenseandpointsbelonging todifferentclustersaredissimilarinthesamesense.Clusteranalysisisacomplex problem as a variety of similarity and/or dissimilarity measures exist in the litera- turewithoutanyuniversaldefinition.Inacrispclusteringtechnique,eachpatternis assigned to exactly one cluster, whereas in the case of fuzzy clustering, each pat- ternisgivenamembershipdegreetoeachclass.Fuzzyclusteringisinherentlymore suitableforhandlingimpreciseandnoisydatawithoverlappingclusters. Forpartitioningadataset,onehastodefineameasureofsimilarityorproximity basedonwhichclusterassignmentsaredone.Themeasureofsimilarityisusually datadependent.Itmaybenotedthat,ingeneral,oneofthefundamentalfeaturesof shapesandobjectsissymmetry,whichisconsideredtobeimportantforenhancing their recognition. Examples of symmetry abound in real life, such as the human face, human body, flowers, leaves, and jellyfish. As symmetry is so common, it maybeinterestingtoexploitthispropertywhileclusteringadataset.Basedonthis observation,inrecentyearsalargenumberofsymmetry-basedsimilaritymeasures havebeenproposed.Thisbookisfocusedondifferentissuesrelatedtoclustering, withparticularemphasisonsymmetry-basedandmetaheuristicapproaches. Theaimofaclusteringtechniqueistofindasuitablegroupingoftheinputdata set so that some criteria are optimized. Hence, the problem of clustering can be posed as an optimization problem. The objective to be optimized may represent differentcharacteristicsoftheclusters,suchascompactness,symmetricalcompact- ness, separation between clusters, connectivity within a cluster, etc. A straightfor- wardwaytoposeclusteringasanoptimizationproblemistooptimizesomecluster validityindexthatreflectsthegoodnessoftheclusteringsolutions.Allpossibleval- uesofthechosenoptimizationcriterion(validityindex)definethecompletesearch space.Traditionalpartitionalclusteringtechniques,suchasK-meansandfuzzyC- means,employagreedysearchtechniqueoverthesearchspaceinordertooptimize vii viii Preface thecompactnessoftheclusters.Althoughthesealgorithmsarecomputationallyef- ficient,theysufferfromsomedrawbacks.Theyoftengetstuckatsomelocaloptima dependingonthechoiceoftheinitialclustercenters.Theyarenotabletodetermine the appropriate number of clusters from data sets and/or are capable of detecting clustersofsomespecificshapeonly. Toovercometheproblemofgettingstuckatlocaloptima,severalmetaheuristic optimization tools, such as genetic algorithms (GAs), simulated annealing (SA), differentialevolution(DE),etc.,havebeenwidelyusedtoreachtheglobaloptimum valueofthechosenvaliditymeasure.Thesetechniquesperformmultimodalsearch incomplexlandscapesandprovidenear-optimalsolutionsfortheobjectiveorfitness functionofanoptimizationproblem.Theyhaveapplicationsinfieldsasdiverseas patternrecognition,imageprocessing,neuralnetworks,machinelearning,jobshop scheduling,andverylarge-scaleintegration(VLSIdesign),tomentionjustafew. Thetwofundamentalquestionsthatneedtobeaddressedinanytypicalclustering scenarioare:(i)howmanyclustersareactuallypresentinthedataand(ii)howreal or goodis the clusteringitself.That is,whatevertheclusteringtechnique,onehas to determine the number of clusters and also the validity of the clusters formed. Themeasureofvalidityofclustersshouldbesuchthatitwillbeabletoimposean ordering of the clusters in terms of their goodness. Several cluster validity indices have been proposed in the literature, e.g., the Davies-Bouldin (DB) index, Dunn’s index,Xie-Beni(XB)index, I-index,CS-index,etc.,tonamejustafew.Inrecent years, several symmetry-based cluster validity indices have also been developed whichareabletodetectanykindofsymmetricclusterfromdatasets.Inthisbook, wediscuss in detailseveralexistingwell-knownclustervalidityindicesas wellas somerecentlyproposedsymmetry-basedversions. ConventionalGA-basedclusteringtechniques,aswellasrelatedapproaches,use somevaliditymeasureforoptimization.Mostoftheexistingclusteringalgorithms are able to determine hyperspherical/ellipsoidal-shaped clusters depending on the distance norm used. In recent years, some symmetry-based clustering techniques have been developed which can determine the appropriate number of clusters and the appropriate partitioning from data sets having any type of clusters, irrespec- tiveoftheirgeometricalshapeandoverlappingnature,aslongastheypossessthe characteristic of symmetry. A major focus of this book is on GA-based clustering techniques,whichusesymmetryasasimilaritymeasure.Someoftheseclustering techniquesarealsoabletodetectthenumberofclustersautomatically. In many real-life situations one may need to optimize several objectives si- multaneously.Theseproblemsareknownasmultiobjectiveoptimizationproblems (MOOPs).Inthisregard,amultitudeofmetaheuristicsingle-objectiveoptimization techniques such as genetic algorithms, simulated annealing, differential evolution, and their multiobjective versions have been developed. In this book we discuss in detailsomeexistingsingle-andmultiobjectiveoptimizationtechniques.Moreover, a newly developed multiobjective simulated annealing-based technique is elabo- ratelydescribedanditseffectivenessforsolvingseveralbenchmarktestproblemsis shown. Preface ix Intherealmofclustering,simultaneousoptimizationofmultiplevalidityindices, capturing different characteristics of the data, is expected to provide improved ro- bustnesstodifferentdataproperties.Hence,itisusefultoapplysomemultiobjective optimization(MOO)techniquetosolvetheclusteringproblem.InMOO,searchis performedoveranumberofobjectivefunctions.Incontrasttosingle-objectiveopti- mization,whichyieldsasinglebestsolution,inMOOthefinalsolutionsetcontains anumberofPareto-optimalsolutions,noneofwhichcanbefurtherimprovedwith regardtoanyoneobjectivewithoutdegradinganother.Inapartofthisbooksome recentlydevelopedmultiobjectiveclusteringtechniquesarediscussedindetail. Amajorfocusofthisbookisonseveralreal-lifeapplicationsofclusteringtech- niques in domains such as remote sensing satellite images, magnetic resonance (MR) brain images, and face detection. Analysis of remote sensing satellite im- ageshassignificantutilityindifferentareassuchasclimatestudies,assessmentof forestresources,examiningmarineenvironments,etc.Animportanttaskinremote sensingapplicationsistheclassificationofpixelsintheimagesintohomogeneous regions,eachofwhichcorrespondstosomeparticularlandcovertype.Thisproblem has often been modeled as a segmentation problem, and clustering methods have beenusedtosolveit.However,sinceitisdifficulttohaveaprioriinformationabout thenumberofclustersinsatelliteimages,theclusteringalgorithmsshouldbeable to automatically determine this value. Moreover, in satellite images it is often the casethatsomeregionsoccupyonlyafewpixels,whiletheneighboringregionsare significantlylarge.Thus,automaticallydetectingregionsorclustersofsuchwidely varying sizes presents a challenge in designing segmentation algorithms. In this book,weexploretheapplicationsofsomesymmetry-basedclusteringalgorithmsto classifyremotesensingimageryinordertodemonstratetheireffectiveness. Automaticallyclassifyingbraintissuesfrommagneticresonanceimages(MRI) hasamajorroleinresearchandclinicalstudyofneurologicalpathology.Addition- ally, regional volume calculations may provide even more useful diagnostic infor- mation.AutomaticsegmentationofbrainMRimagesisacomplexproblem.Cluster- ingapproacheshavebeenwidelyusedforsegmentationofMRbrainimages.Apart of this book is dedicated to the applications of some symmetry-based clustering algorithmstoclassifyMRbrainimagesinordertodemonstratetheireffectiveness. Inrecentyearstheproblemofhumanfacerecognitionhasgainedhugepopular- ity.Therearewideapplicationsoffacerecognitionsystemsincludingsecureaccess controlandfinancialtransactions.Thefirstimportantstepoffullyautomatichuman facerecognitionishumanfacedetection.Facedetectionisthetechniquetoautomat- icallydeterminethelocationsandsizesoffacesinagiveninputimage.Inthisbook aproceduretodetecthumanfacesbasedonsymmetryisalsodescribedindetail. Thisbookaimstoprovideatreatiseonclusteringinametaheuristicframework, withextensiveapplicationstoreal-lifedatasets.Chapter1providesanintroduction topatternrecognition,machinelearning,classification,clustering,andrelatedareas, alongwithdifferentapplicationsofpatternrecognition.Chapter2describesindetail existingmetaheuristic-basedsingle-andmultiobjectiveoptimizationtechniques.An elaborate discussion on a recently developed multiobjective simulated annealing- based technique, archived multiobjective simulated annealing (AMOSA), is also x Preface provided in this chapter. The utility of this technique to solve several benchmark testproblemsisshown.Chapter3mainlydescribesthedifferenttypesofsimilarity measures developed in the literature for handling binary, categorical, ordinal, and quantitativevariables.Italsocontainsadiscussionondifferentnormalizationtech- niques.Chapter4givesabroadoverviewoftheexistingclusteringtechniquesand theirrelativeadvantagesanddisadvantages.Italsoprovidesadetaileddiscussionof severalrecentlydevelopedsingle-andmultiobjectivemetaheuristicclusteringtech- niques.Chapter5presentssymmetry-baseddistancesandageneticalgorithm-based clusteringtechniquethatusesthissymmetry-baseddistanceforassignmentofpoints todifferentclustersandforfitnesscomputation.InChap.6,somesymmetry-based cluster validity indices are described. Elaborate experimental results are also pre- sentedinthischapter.Applicationofthesesymmetry-basedclustervalidityindices tosegmentremotesensingsatelliteimagesisalsopresented.InChap.7,someauto- maticclusteringtechniquesbasedonsymmetryarepresented.Thesetechniquesare abletodeterminetheappropriatenumberofclustersandtheappropriatepartitioning fromdatasetshavingsymmetricclusters.Theeffectivenessoftheseclusteringtech- niquesisshownformanyartificialandreal-lifedatasets,includingMRbrainimage segmentation.Chapter8dealswithanextensionoftheconceptofpointsymmetryto line symmetry-based distances, and genetic algorithm-based clustering techniques usingthesedistances.Resultsarepresentedforsomeartificialandreal-lifedatasets. Atechniqueusingthelinesymmetry-baseddistanceforfacedetectionfromimages isalsodiscussedindetail.Somemultiobjectiveclusteringtechniquesbasedonsym- metry are described in detail in Chap. 9. Three different clustering techniques are discussed here. The first one assumes the number of clusters a priori. The second andthirdclusteringtechniquesareabletodetecttheappropriatenumberofclusters from data sets automatically. The third one, apart from using the concept of sym- metry, also uses the concept of connectivity. A method of measuring connectivity betweentwopointsinaclusterisdescribedforthispurpose.Aconnectivity-based clustervalidityindexisalsodiscussedinthischapter.Extensiveexperimentalresults illustratingthegreatereffectivenessofthethreemultiobjectiveclusteringtechniques overthesingle-objectiveapproachesarepresentedforseveralartificialandreal-life datasets. This book contains an in-depth discussion on clustering and its various facets. Inparticular,itconcentratesonmetaheuristicclusteringusingsymmetryasasimi- laritymeasurewithextensivereal-lifeapplicationsindatamining,satelliteremote sensing, MR brain imaging, gene expression data, and face detection. It is, in this sense,acompletebookthatwillbeequallyusefultothelaymanandbeginnerasto anadvancedresearcherinclustering,beingvaluableforseveralreasons. Itincludesdiscussionsontraditionalaswellassomerecentsymmetry-basedsim- ilaritymeasurements.Existingwell-knownclusteringtechniquesalongwithmeta- heuristic approaches are elaborately described. Moreover, some recent clustering techniquesbasedonsymmetryaredescribedindetail.Multiobjectiveclusteringis anotheremergingtopicinunsupervisedclassification.Amultiobjectivedatacluster- ingtechniqueisdescribedelaboratelyinthisbookalongwithextensiveexperimen- talresults.Achapterofthisbookiswhollydevotedtodiscussingsomeexistingand Preface xi newmultiobjectiveclusteringtechniques.Extensivereal-lifeapplicationsinremote sensing satellite imaging, MR brain imaging, and face detection are also provided in the book. Theoretical analyses of some recent symmetry-based clustering tech- niquesareincluded. The book will be useful to graduate students and researchers in computer sci- ence, electrical engineering, system science, and information technology as both text and reference book for some parts of the curriculum. Theoretical and experi- mentalresearchers willbenefitfrom the discussionsin this book.Researchers and practitioners in industry and R & D laboratories working in fields such as pattern recognition,datamining,softcomputing,remotesensing,andbiomedicalimaging willalsobenefit. Theauthorsgratefullyacknowledgetheinitiativeandsupportfortheprojectpro- vided by Mr. Ronan Nugent of Springer. Sanghamitra Bandyopadhyay acknowl- edgesthesupportprovidedbytheSwarnajayantiProjectGrant(No.DST/SJF/ET- 02/2006-07) of the Department of Science and Technology, Government of India. TheauthorsacknowledgethehelpofMr.MalayBhattacharyya,seniorresearchfel- lowofISIKolkata,fordrawingsomeofthefiguresofChap.5.SriparnaSahaalso acknowledgesthehelpofherstudentsMridula,Abhay,andUtpalforproofreading thechapters. Kolkata,India SanghamitraBandyopadhyay SriparnaSaha Contents 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 DataTypeswithExamples. . . . . . . . . . . . . . . . . . . . . . 3 1.3 PatternRecognition:Preliminaries . . . . . . . . . . . . . . . . . 4 1.3.1 DataAcquisition . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.2 FeatureSelection . . . . . . . . . . . . . . . . . . . . . . 7 1.3.3 Classification. . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3.4 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.5 DistanceMeasuresinClustering . . . . . . . . . . . . . . 9 1.3.6 ModelSelection . . . . . . . . . . . . . . . . . . . . . . . 10 1.3.7 ModelOrderSelection. . . . . . . . . . . . . . . . . . . . 10 1.4 RobustnesstoOutliersandMissingValues . . . . . . . . . . . . . 11 1.5 FuzzySet-TheoreticApproach:RelevanceandFeatures . . . . . . 12 1.6 ApplicationsofPatternRecognitionandLearning . . . . . . . . . 13 1.7 SummaryandScopeoftheBook . . . . . . . . . . . . . . . . . . 14 2 SomeSingle-andMultiobjectiveOptimizationTechniques . . . . . . 17 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Single-ObjectiveOptimizationTechniques . . . . . . . . . . . . . 18 2.2.1 OverviewofGeneticAlgorithms . . . . . . . . . . . . . . 19 2.2.2 SimulatedAnnealing . . . . . . . . . . . . . . . . . . . . 23 2.3 MultiobjectiveOptimization . . . . . . . . . . . . . . . . . . . . . 25 2.3.1 MultiobjectiveOptimizationProblems . . . . . . . . . . . 26 2.3.2 VariousMethodstoSolveMOOPs . . . . . . . . . . . . . 28 2.3.3 RecentMultiobjectiveEvolutionaryAlgorithms . . . . . . 29 2.3.4 MOOPsandSA . . . . . . . . . . . . . . . . . . . . . . . 33 2.4 An Archive-Based Multiobjective Simulated Annealing Technique:AMOSA . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.4.2 ArchivedMultiobjectiveSimulatedAnnealing(AMOSA) . 37 2.4.3 ArchiveInitialization . . . . . . . . . . . . . . . . . . . . 38 xiii

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.