ebook img

Reverse Clustering: Formulation, Interpretation and Case Studies PDF

117 Pages·2021·3.727 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Reverse Clustering: Formulation, Interpretation and Case Studies

Studies in Computational Intelligence 957 Jan W. Owsiński · Jarosław Stańczak · Karol Opara · Sławomir Zadrożny · Janusz Kacprzyk Reverse Clustering Formulation, Interpretation and Case Studies Studies in Computational Intelligence Volume 957 Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland The series “Studies in Computational Intelligence” (SCI) publishes new develop- mentsandadvancesinthevariousareasofcomputationalintelligence—quicklyand with a high quality. The intent is to cover the theory, applications, and design methods of computational intelligence, as embedded in the fields of engineering, computer science, physics and life sciences, as well as the methodologies behind them. The series contains monographs, lecture notes and edited volumes in computational intelligence spanning the areas of neural networks, connectionist systems, genetic algorithms, evolutionary computation, artificial intelligence, cellular automata, self-organizing systems, soft computing, fuzzy systems, and hybrid intelligent systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. Indexed by SCOPUS, DBLP, WTI Frankfurt eG, zbMATH, SCImago. AllbookspublishedintheseriesaresubmittedforconsiderationinWebofScience. More information about this series at http://www.springer.com/series/7092 ń ł ń Jan W. Owsi ski Jaros aw Sta czak (cid:129) (cid:129) ł ż Karol Opara S awomir Zadro ny (cid:129) (cid:129) Janusz Kacprzyk Reverse Clustering Formulation, Interpretation and Case Studies 123 Jan W.Owsiński Jarosław Stańczak Polish Academy ofSciences Polish Academy ofSciences Systems Research Institute Systems Research Institute Warsaw,Poland Warsaw,Poland KarolOpara Sławomir Zadrożny Polish Academy ofSciences Polish Academy ofSciences Systems Research Institute Systems Research Institute Warsaw,Poland Warsaw,Poland JanuszKacprzyk Polish Academy ofSciences Systems Research Institute Warsaw,Poland ISSN 1860-949X ISSN 1860-9503 (electronic) Studies in Computational Intelligence ISBN978-3-030-69358-9 ISBN978-3-030-69359-6 (eBook) https://doi.org/10.1007/978-3-030-69359-6 ©TheEditor(s)(ifapplicable)andTheAuthor(s),underexclusivelicensetoSpringerNature SwitzerlandAG2021 Thisworkissubjecttocopyright.AllrightsaresolelyandexclusivelylicensedbythePublisher,whether thewholeorpartofthematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseof illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmissionorinformationstorageandretrieval,electronicadaptation,computersoftware,orbysimilar ordissimilarmethodologynowknownorhereafterdeveloped. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publicationdoesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfrom therelevantprotectivelawsandregulationsandthereforefreeforgeneraluse. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained hereinorforanyerrorsoromissionsthatmayhavebeenmade.Thepublisherremainsneutralwithregard tojurisdictionalclaimsinpublishedmapsandinstitutionalaffiliations. ThisSpringerimprintispublishedbytheregisteredcompanySpringerNatureSwitzerlandAG Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland Preface We witness nowadays an explosive growth and development of methods and techniques,relatedtodataanalysis,thisgrowthbeingconditioned,ontheonehand, by the rapidly expanding availability of data in virtually all domains of human activity, and, on the other hand, the very substantive progress in technical and scientific capabilities of dealing with the increasing volumes of data. All this amounts to a dramatic change, especially in quantitative terms. Yet, as researchers and practitioners involved in the work on methodological side of data analysis know very well, many of the fundamental substantive prob- lemsinthisdomainstillrequiresolutions,oratleast—bettersolutions—thanthose available now. This concerns, in particular, such fundamental areas as clustering, classification, rule extraction, and so on. The primary issue is here constituted by the opposition between precision or accuracy and speed or computational cost (when the problem at hand is already truly well-defined). One cannot forget, nei- ther, of the very strong data dependence of effectiveness and efficiency of many of the methodologies being applied nowadays, making the situation even more difficult. Thepresentbookaddressesthisnexusofissues,aiming,inthiscase,apparently attheinterfaceofclusteringandclassification,but,infact,beingrelevanttoamuch broader domain, with much broader implications in terms of applicability and interpretation.Namely,itdescribestheparadigmof“reverseclustering”,introduced bythepresentauthors.Theparadigmconcernsthesituation,inwhichwearegiven acertaindata set,composedofentities,observations,objects…,whichisusualfor the data analysis situation, and, at the same time, we are given, or we consider, a certainpartitionofthisdataset.Wedonotassumeapriorianythingaboutthedata set,noraboutthepartition,and,essentiallyimportantly,abouttherelationbetween thedatasetandthepartition.Thus,thepartitionmaybetheresultofadefinitekind of analysis of the given data set, but may, as well, result from quite a different mechanism (e.g. a division of the set of objects according to some variable or criterion not contained in the data set at hand). v vi Preface Under these circumstances—the data set and the partition being given—we try toreconstructthepartitiononthebasisofthedataset,usingclusteranalysis.Wetry to find the entire clustering procedure that will yield, for this given data set, a partition that is as close to the given one as possible. Thus, the result of the pro- cedure is both the clustering procedure, defined by a number of attributes (clus- tering method, its parameters, variable selection, distance definition,…) and the concrete partition found. It is obvious that the paradigm borders upon classification (for a very specific formulation/interpretation of the situation faced), but extends to a much broader domain,inwhichtheperceptionoftheproblemitselfandthemeaningofsolutions can vary very widely. This is, in particular, shown in the present book. In the current stage of work, the results obtained and largely contained in this book pertain mainly to the substantive aspect of the paradigm, while the technical aspects of the respective algorithms are, as of now, left to future research. The reverse clustering paradigm constitutes a new perspective on quite a broad spectrumofproblemsindataanalysis,and,asthebookshows,itcanprovidevery interesting, instructive and significant results, under a wide variety of interpreta- tionalassumptions.Wesincerelyhope,therefore,thatthisbookdoesnotonlygive the Readers a new material and fresh insight into some problems of data analysis, but may also provoke them to deeper studies in the direction here indicated. Warsaw, Poland Jan W. Owsiński Jarosław Stańczak Karol Opara Sławomir Zadrożny Janusz Kacprzyk Introduction This book is devoted to an approach or a paradigm, developed by the authors and appliedtoaseriesofcases,ofdiversecharacter,mostlybasedonreal-lifedata;the approach (or paradigm) belonging to the broadly understood domain of data analysis—more precisely: classification and cluster analysis. We call the approach “reverse clustering” because of its logic, which is formulated as follows: Assumewedisposeofasetofdata,X,composedofnobjectsorobservations, indexedi,i=1,…,n,eachofthesebeingdescribedbyavectorofmfeatures or variables, indexed k, the respective vector being denoted x = {x ,…,x i i1 ik, …,xim}. At the same time, assume we dispose of a partition of the set X of objectsintosubsets,thispartitionbeingdenotedP .Forthesedata,wetryto A obtainapartitionP thatisasclosetoP aspossible,byapplyingclustering B A algorithmstothesetX.Thereby,wefindboththepartitionP thatisasclose B as possible to P and the concrete clustering procedure, with all its param- A eters, which yields the partition P . B The above does not explicitly state the purpose of the exercise (to say nothing ofthetechnicaldetails),butitcaneasilybededucedthatwhatisaimedatisclosely related tothenotionofclassification.Whiletheclose relation with classificationis not only obvious, but definitely true, the paradigm has a much wider spectrum of applications and meanings, as this is explained in Chap. 2 of the book, following the more precise presentation of this paradigm, given in Chap. 1. Theparadigmisconstituted,first,bytheabovestatementoftheproblem,which then has to be expressed in pragmatic technical terms, involving (1) the space of clustering algorithms with its granularity (what algorithms are accountedforandwhatparameters,definingtheentireclusteringprocedure,are being subject of the search for P ); B vii viii Introduction (2) themeasureofsimilaritybetweenthepartitionofthesetX,givenattheoutset, i.e.P ,andthepartitions,obtainedfromtheclusteringalgorithms,thismeasure A beingmaximised(orthemeasureofdistancebetweenthem,beingminimised); and (3) the technique of search for the P given the data of the concrete problem. B Thisparadigmis,however,also,andperhapsevenmoreimportantly,constituted by the interpretation of the entire setting, and the particular instances of this interpretation—asmentioned,treatedatlengthinChap.2.Thisisimportantinsofar asit places theparadigmagainstthebackground ofthe data analysisdomain, with special emphasis on classification and related fields. These various interpretation instances are associated primarily with the status of the partition P , namely its A source, the degree of credibility we assign to it, as well as its actual or presumed connectionwiththedatasetX.Dependingonthese,andontheresultsobtained,the status of the obtained partition P , including validity and applicability, will also B vary significantly. Owing to this variety of interpretations, the paradigm may find application in a broad spectrum of analytic, but also cognitive, situations. The subsequent chapters of the book, starting with the third one, are exactly devoted to the presentation of the cases treated, which definitely differ not only as to their substance matter (domain, from which the data come), but, largely, as to the interpretation of the actualproblemandtheresultsobtained.Theimplicationisthattheparadigmcanbe used in many data analytic circumstances for diverse purposes, whenever the structuration of the data set into groups is appropriate. Theparadigmofreverseclusteringhasbeenpresentedalreadyinseveralpapers bythesameteamofauthors,e.g.inOwsińskietal.(2017a,b),Owsiński,Stańczak andZadrożny(2018).Thepresentbookaimsatamorecompletepresentationofthe paradigm and its interpretations. The book does not go into the computational and numerical issues and details, which are, of course, of very high importance. Namely,themainpurposeofthebookistopresenttheapproachanditscapacities in terms of various kinds of situations, problems and interpretations of respective results. We do indeed hope it conveys the intended message in an effective and interesting manner. The book is structured in the following manner: first, Chap. 1 presents the schemeoftheapproach,characterised,inparticular,asithasbeenusedinthecases illustratedinthisbook,alongwithnotationused.Then,Chap.2outlinesthecontext of the reverse clustering, starting with other approaches, which concern similar kindsofproblems,relatedtodataanalysis,includingalsoanamplereferencetothe very general idea of reverse engineering, as well as explainable artificial intelli- gence or data analysis. Then, the context is shortly analysed in terms of more detailed specific problems, arising in connection with both the reverse clustering procedure and the data analytic methods in a more general perspective (like, e.g. selection of variables, or definitions of distance). This chapter contains also a very importantsectiononthepotentialinterpretationsofthereverseclusteringparadigm and its results. Chapter 3 constitutes a very short introduction to the cases studied Introduction ix and illustrated in the book, which are then presented in the consecutive chapters: Chap. 4 is devoted to the motorway traffic data, Chap. 5 to environmental con- taminationdata,Chaps.6and7totwoseparatecasesoftypologiesorclassifications of administrative units in Poland, and, finally, Chap. 8 to some more academic exercises.ThebookcloseswithChap.9summarisingtheworkdoneandproposing some new vistas. This book is intended to offer the Readers truly interesting and novel perspec- tives in data analysis, regarding the diverse ways offormulating and approaching problems,andunderstandingtheresults,andweshallbeverysatisfiedifitdiditat least in a perceptible degree. Jan W. Owsiński Jarosław Stańczak Karol Opara Sławomir Zadrożny Janusz Kacprzyk

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.