Studies in Computational Intelligence 807 Sarah Vluymans Dealing with Imbalanced and Weakly Labelled Data in Machine Learning using Fuzzy and Rough Set Methods Studies in Computational Intelligence Volume 807 Series editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland e-mail: [email protected] The series “Studies in Computational Intelligence” (SCI) publishes new develop- mentsandadvancesinthevariousareasofcomputationalintelligence—quicklyand with a high quality. The intent is to cover the theory, applications, and design methods of computational intelligence, as embedded in the fields of engineering, computer science, physics and life sciences, as well as the methodologies behind them. The series contains monographs, lecture notes and edited volumes in computational intelligence spanning the areas of neural networks, connectionist systems, genetic algorithms, evolutionary computation, artificial intelligence, cellular automata, self-organizing systems, soft computing, fuzzy systems, and hybrid intelligent systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. The books of this series are submitted to indexing to Web of Science, EI-Compendex, DBLP, SCOPUS, Google Scholar and Springerlink. More information about this series at http://www.springer.com/series/7092 Sarah Vluymans Dealing with Imbalanced and Weakly Labelled Data in Machine Learning using Fuzzy and Rough Set Methods 123 SarahVluymans Department ofApplied Mathematics, Computer Science andStatistics GhentUniversity Gent, Belgium ISSN 1860-949X ISSN 1860-9503 (electronic) Studies in Computational Intelligence ISBN978-3-030-04662-0 ISBN978-3-030-04663-7 (eBook) https://doi.org/10.1007/978-3-030-04663-7 LibraryofCongressControlNumber:2018961733 ©SpringerNatureSwitzerlandAG2019 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpart of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission orinformationstorageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilar methodologynowknownorhereafterdeveloped. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publicationdoesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfrom therelevantprotectivelawsandregulationsandthereforefreeforgeneraluse. The publisher, the authors, and the editorsare safeto assume that the adviceand informationin this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authorsortheeditorsgiveawarranty,expressorimplied,withrespecttothematerialcontainedhereinor for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictionalclaimsinpublishedmapsandinstitutionalaffiliations. ThisSpringerimprintispublishedbytheregisteredcompanySpringerNatureSwitzerlandAG Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland To my family Foreword The number of application domains in which large quantities of data need to be processed has increased manifold over the past decades. Typical examples include text mining, bioinformatics, image processing, knowledge management, etc. This evolution has triggered a growing need for intelligent techniques in machine learning in general and in classification in particular. In this book, hybrid models called fuzzy rough sets, designed to handle uncertainty in data by combining both gradual properties (fuzziness) and indiscernibility (roughness), are invoked to this purpose. Rough set theory, proposed by Zdzisław Pawlak in 1982, has proven its merits as an elegant and powerful framework for data analysis. Based on the approxi- mation of decision classes, it allows to infer data dependencies that are useful for featureselectionanddecisionmodelconstruction.Acorenotioninroughsettheory is that of discernibility: the ability to distinguish between objects, based on their attributevalues.Theobservationthatinmanycasesobjectsresembleeachothertoa certainextentmotivatedthehybridizationofroughsetswithfuzzysets,introduced byLotfiZadehin1965.Indeed,usingfuzzyrelationsthatmodelagradualnotionof discernibility between objects allows to apply the rough set methodology to real-valued data without the need for discretization. Fuzzy rough sets were shown to perform remarkably well in lazy learning approaches,servingeither asapreprocessingtool for nearest neighbourclassifiers, oreffectivelyreplacingthelatterbyassessingtestobjects’membershiptothelower and/or upper approximation of decision classes. They also demonstrated their use for imbalanced classification, i.e. involving data where one or several classes are under-represented. Atthesametime,differentnoise-tolerantfuzzyroughsetmodelswereproposed to make classification more robust. In particular, the ordered weighted average (OWA) approach was identified as a promising one from both theoretical and practicalangle.Inordertodetermineifaninstancebelongspossiblyorcertainlyto a concept, this approach carries out a weighted evaluation of the instance’s neighbours in the training data. A key asset of the model is its association of weights with ordered positions of elements rather than with elements themselves, vii viii Foreword whichallowstoexpress,e.g.thatanobjectcertainlybelongstoaconceptifmostof itsneighboursdo,where‘most’ismodelledbyanappropriateOWAweightvector. In this book, Sarah Vluymans effectively takes the research on OWA fuzzy roughsetstothenextlevel,atonceprovidingclearandeffectiveguidelinesonhow to use them in practice and expanding its application radius to a wide range of challenging classification problems, including imbalanced, semi-supervised, multi-instance and multi-label classification. InChap.3,shefirstfocusesonthegeneralOWAmodelanddemonstratesthata thoughtful selection of weight vectors taking a number of simple dataset charac- teristicsinmindcansignificantlyimprove classification.Moreover, sheshowsthat the provided guidelines also carry over to other applications of OWA fuzzy rough sets like prototype selection, making them a sound base for machine learning practitioners to work with. Chapter 4 addresses the difficult problem of multi-class imbalanced classifica- tion. By a skilful combination of the binary (two-class) fuzzy rough classifier IFROWANN and the one-versus-one (OVO) decomposition scheme, again involving the adaptive selection of OWA weight vectors, the author manages to outperform state-of-the-art approaches in terms of both balanced accuracy and mean AUC. Chapter 5 evaluates the OWA model in a semi-supervised learning (SSL) context, where labels are known for a minority of training samples and unlabelled instances are used for improving generalization. As a remarkable con- clusion, the author reveals that the popular self-labelling technique does not improvethetraditionalOWAmodelwhichemploysonlylabelleddataandthatthe latter even outperforms existing SSL approaches based on self-labelling. In multi-instance learning (MIL), a data sampleis described by a bag offeature vectors called instances, where the class labels of instances are not known, only thoseofthebagsandthegoalistopredictthelabelofnewbags.InChap.6,fuzzy and fuzzy rough classifiers are assembled to handle MIL data that are competitive with and, in the case of imbalanced data, even superior to state-of-the-art approaches. Finally, Chap. 7 deals with multi-label learning (MLL), a setting orthogonal to MILwheremorethanonelabelcanbeassociatedwithasingledatasample.Using a customized label set similarity relation within an OWA fuzzy rough nearest-neighbour based consensus approach, the author once more manages to come up with an efficient and competitive proposal. Summarizing,throughthemanycontributionsofthisbook,SarahVluymanshas managed not only to considerably widen the application scope of OWA fuzzy rough sets but also to make a convincing case for their practical use and appeal. Special praise is warranted for the meticulous and comprehensive experimental evaluationthataccompanieseachchapterandwhichservesasashininglightofbest practice in machine learning. Foreword ix We hope that her work will inspire other researchers to continue efforts along theselinesandtofurtherfostertheparadigmofOWA fuzzyrough setsasauseful tool for machine learning. Gent, Belgium Chris Cornelis September 2018 Yvan Saeys Ghent University Preface This book is based on my Ph.D. dissertation completed at Ghent University (Belgium) and the University of Granada (Spain) in June 2018. It focuses on classification.Thegoalistopredicttheclasslabelofelements(thatis,assignthem to a category) based on a previously provided dataset of known observations. Traditionally,anumberoffeaturesaremeasuredforallobservations,suchthatthey can be described by a feature vector (collecting the values for all features) and an associated outcome, if the latter is known. In the classic iris dataset, for example, eachobservationcorrespondstoanirisplantandisdescribedbyitsvaluesforfour features representing biological properties of the flower. The associated class label is the specific family of irises the sample belongs to and the prediction task is to categorize a plant tothe correct family based on its feature values. Aclassification algorithm doessobased onitstraining setoflabelledinstances, thatis, aprovided set of iris flowers for which both the features values and class labels are known. Oneofthemostintuitiveclassifiersisthenearestneighbouralgorithm.Toclassifya new element, this method locates the most similar training instance (the nearest neighbour) and assigns the target to the class to which this neighbour belongs. Other methods build an explicit classification model from the training set, for example, in the format of a decision tree. Few real-world datasets are perfect, that is, it is usually not possible to make entirelyaccurateclasspredictionsbasedonthetrainingset.Onenaturalissueisthe uncertainty present in any dataset. Mathematics provides us with frameworks to modelsuchdataimperfections andthisfromdifferent viewpoints.Fuzzysettheory extendstraditionalsettheorybyallowingapartialmembershipofelementstoaset intheformofareal-valuedmembershipdegreebetweenzeroandone.Inthisway, vagueandsubjectiveconceptsaswellasgradedrelationshipsbetweenobservations can be represented. Data incompleteness or indiscernibility is another common obstacle and refers to the situation where the measured features are insufficient to provide a precise definition for or sharply delineate a concept. Rough set theory resolves this by approximating the concept with a lower (conservative) and upper (liberal)approximation.Fuzzyandroughsettheoryhavebeencombinedintofuzzy roughsettheory.Agradedsimilaritybetweenobservationsisincorporatedandthe xi