Intelligent Systems Reference Library 144 Boris Kovalerchuk Visual Knowledge Discovery and Machine Learning Intelligent Systems Reference Library Volume 144 Series editors Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland e-mail: [email protected] Lakhmi C. Jain, University of Canberra, Canberra, Australia; Bournemouth University, UK; KES International, UK e-mail: [email protected]; [email protected] URL: http://www.kesinternational.org/organisation.php The aim of this series is to publish a Reference Library, including novel advances and developments in all aspects of Intelligent Systems in an easily accessible and well structured form. The series includes reference works, handbooks, compendia, textbooks,well-structuredmonographs,dictionaries,andencyclopedias.Itcontains well integrated knowledge and current information in the field of Intelligent Systems. The series covers the theory, applications, and design methods of IntelligentSystems.Virtuallyalldisciplinessuchasengineering,computerscience, avionics, business, e-commerce, environment, healthcare, physics and life science are included. The list of topics spans all the areas of modern intelligent systems such as: Ambient intelligence, Computational intelligence, Social intelligence, Computational neuroscience, Artificial life, Virtual society, Cognitive systems, DNA and immunity-based systems, e-Learning and teaching, Human-centred computing and Machine ethics, Intelligent control, Intelligent data analysis, Knowledge-based paradigms, Knowledge management, Intelligent agents, Intelligent decision making,Intelligent network security, Interactiveentertainment, Learningparadigms,Recommendersystems,RoboticsandMechatronicsincluding human-machine teaming, Self-organizing and adaptive systems, Soft computing including Neural systems, Fuzzy systems, Evolutionary computing and the Fusion of these paradigms, Perception and Vision, Web intelligence and Multimedia. More information about this series at http://www.springer.com/series/8578 Boris Kovalerchuk Visual Knowledge Discovery and Machine Learning 123 Boris Kovalerchuk Central Washington University Ellensburg, WA USA ISSN 1868-4394 ISSN 1868-4408 (electronic) Intelligent Systems Reference Library ISBN978-3-319-73039-4 ISBN978-3-319-73040-0 (eBook) https://doi.org/10.1007/978-3-319-73040-0 LibraryofCongressControlNumber:2017962977 ©SpringerInternationalPublishingAG2018 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpart of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission orinformationstorageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilar methodologynowknownorhereafterdeveloped. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publicationdoesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfrom therelevantprotectivelawsandregulationsandthereforefreeforgeneraluse. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authorsortheeditorsgiveawarranty,expressorimplied,withrespecttothematerialcontainedhereinor for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictionalclaimsinpublishedmapsandinstitutionalaffiliations. Printedonacid-freepaper ThisSpringerimprintispublishedbySpringerNature TheregisteredcompanyisSpringerInternationalPublishingAG Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland To my family Preface Emergence of Data Science placed knowledge discovery, machine learning, and data mining in multidimensional data, into the forefront of a wide range of current research, and application activities in computer science, and many domains far beyond it. Discovering patterns, in multidimensional data, using a combination of visual and analytical machine learning means are an attractive visual analytics opportu- nity.It allows theinjectionoftheuniquehuman perceptualandcognitive abilities, directly into the process of discovering multidimensional patterns. While this opportunity exists, the long-standing problem is that we cannot see the n-D data with a naked eye. Our cognitive and perceptual abilities are perfected only in the 3-D physical world. We need enhanced visualization tools (“n-D glasses”) to represent the n-D data in 2-D completely, without loss of information, which is important for knowledge discovery. While multiple visualization methods for the n-Ddatahavebeendevelopedandsuccessfullyusedformanytasks,manyofthem arenon-reversibleandlossy.Suchmethodsdonotrepresentthen-Ddatafullyand do not allow the restoration of the n-D data completely from their 2-D represen- tation. Respectively, our abilities to discover the n-D data patterns, from such incomplete2-D representations,arelimitedand potentially erroneous. Thenumber of available approaches, to overcome these limitations, is quite limited itself. The ParallelCoordinates andtheRadial/Star Coordinates, today,arethemostpowerful reversibleandlosslessn-Ddatavisualizationmethods,whilesufferfromocclusion. There is a need to extend the class of reversible and lossless n-D data visual representations, for the knowledge discovery in the n-D data. A new class of such representations, called the General Line Coordinate (GLC) and several of their specifications, are the focus of this book. This book describes the GLCs, and their advantages,whichincludeanalyzingthedataoftheChallengerdisaster,Worldhunger, semantic shift in humorous texts, image processing, medical computer-aided diag- nostics,stockmarket,andthecurrencyexchangeratepredictions.Reversiblemethods forvisualizingthen-Ddatahavetheadvantagesascognitiveenhancers,ofthehuman cognitiveabilities,todiscoverthen-Ddatapatterns.Thisbookreviewsthestateofthe vii viii Preface art in this area, outlines the challenges, and describes the solutions in the framework of the General Line Coordinates. This book expands the methods of the visual analytics for the knowledge dis- covery,bypresentingthevisualandhybridmethods,whichcombinetheanalytical machine learning and the visual means. New approaches are explored, from both the theoretical and the experimental viewpoints, using the modeled and real data. Theinspiration,foranewlargeclassofcoordinates,istwofold.Thefirstoneisthe marvelous success of the Parallel Coordinates, pioneered by Alfred Inselberg. The second inspirationistheabsence ofa“silverbullet”visualization, whichisperfect forthepatterndiscovery,in theallpossiblen-Ddatasets.MultipleGLCscanserve asa collective“silver bullet.”Thismultiplicity ofGLCs increases thechances that the humans will reveal the hidden n-D patterns in these visualizations. The topic of this book is related to the prospects of both the super-intelligent machines and the super-intelligent humans, which can far surpass the current humanintelligence,significantlyliftingthehumancognitivelimitations.Thisbook is about a technical way for reaching some of the aspects of super-intelligence, which are beyond the current human cognitive abilities. It is to overcome the inabilities to analyze a large amount of abstract, numeric, and high-dimensional data;andtofindthecomplexpatterns,inthesedata,withanakedeye,supportedby theanalyticalmeansofmachinelearning.Thenewalgorithmsarepresentedforthe reversible GLC visual representations of high-dimensional data and knowledge discovery. The advantagesofGLCs are shown,bothmathematically andusing the different datasets. These advantages form a basis, for the future studies, in this super-intelligence area. This book is organized as follows. Chapter 1 presents the goal, motivation, and the approach. Chapter 2 introduces the concept of the General Line Coordinates, which is illustrated with multiple examples. Chapter 3 provides the rigorous mathematical definitions of the GLC concepts along with the mathematical state- ments of their properties. A reader, interested only in the applied aspects of GLC, can skip this chapter. A reader, interested in implementing GLC algorithms, may findChap.3usefulforthis.Chapter4describesthemethodsofthesimplificationof visual patterns in GLCs for the better human perception. Chapter 5 presents several GLC case studies, on the real data, which show the GLC capabilities. Chapter 6 presents the results of the experiments on discovering the visual features in the GLCs by multiple participants, with the analysis of the human shape perception capabilities with over hundred dimensions, in these experiments.Chapter7presentsthelinearGLCscombinedwith machine learning, including hybrid, automatic, interactive, and collaborative versions of linear GLC, with the data classification applications from medicine to finance and image pro- cessing. Chapter 8 demonstrates the hybrid, visual, and analytical knowledge dis- covery and the machine learning approach for the investment strategy with GLCs. Chapter 9 presents a hybrid, visual, and analytical machine learning approach in text mining, for discovering the incongruity in humor modeling. Chapter 10 describes the capabilities of the GLC visual means to enhance evaluation of accuracyanderrorsofmachinelearningalgorithms.Chapter11showsanapproach, Preface ix to how the GLC visualization benefits the exploration of the multidimensional Paretofront,inmulti-objectiveoptimizationtasks.Chapter12outlinesthevisionof a virtual data scientist and the super-intelligence with visual means. Chapter 13 concludes this book with a comparison and the fusion of methods and the dis- cussion of the future research. The final note ison the topics, which are outside of this book. These topics are “goal-free” visualizations that are not related to the specific knowledge discovery tasks of supervised and unsupervised learning, and the Pareto optimization in the n-D data. The author’s Web site of this book is located at http://www.cwu.edu/*borisk/visualKD, where additional information and updates can be found. Ellensburg, USA Boris Kovalerchuk Acknowledgements First of all thanks to my family for supporting this endeavor for years. My great appreciationgoestomycollaborators:Vladimir Grishin,AntoniWilinski,Michael Kovalerchuk,DmytroDovhalets,AndrewSmigaj,andEvgeniiVityaev.Thisbook is based on a series of conference and journal papers, written jointly with them. Thesepapersarelistedinthereferencesection inChap.1underrespectivenames. Thisbookwouldnotbepossiblewithouttheireffort;andtheeffortbythegraduate and undergraduate students: James Smigaj, Abdul Anwar, Jacob Brown, Sadiya Syeda,AbdulrahmanGharawi,MitchellHanson,MatthewStalder,FrankSenseney, KeylaCerna,JulianRamirez,KyleDischer,ChrisCottle,AntonioCastaneda,Scott Thomas,andTommyMathan,whohavebeeninvolvedinwritingthecodeandthe computational explorations. Over 70 Computer Science students from the Central Washington University (CWU) in the USA and the West Pomeranian Technical University (WPTU) in Poland participated in visual pattern discovery and experi- mentsdescribedinChap.6.Thevisualpatterndiscoverydemonstrateditsuniversal nature, when students at CWU in the USA, WPTU in Poland, and Nanjing University of Aeronautics and Astronautics in China were able to discover the visualpatterninthen-DdataGLCvisualizationsduringmylecturesandchallenged mewithinterestingquestions.DiscussionoftheworkofstudentsinvolvedinGLC development with the colleagues: Razvan Andonie, Szilard Vajda, and Donald Davendra helped, in writing this book, too. IwouldliketothankAndrzejPiegatandtheanonymousreviewersofourjournal and conference papers, for their critical readings of those papers. I owe much to WilliamSumnerandDaleComstockforthecriticalreadingsofmultiplepartsofthe manuscript. The remaining errors are mine, of course. My special appreciation is to Alfred Inselberg, for his role in developing the Parallel Coordinates and the personal kindness in our communications, which inspired me to work on this topic and book. The importance of his work is in developing the Parallel Coordinates as a powerful tool for the reversible n-D data visualizationandestablishingtheirmathematicalproperties.Itisarealmarvelinits xi