Data Analysis and Classification: Methods and Applications PDF

346 Pages·2021·9.18 MB·English
Studies in Classification, Data Analysis, and Knowledge Organization Krzysztof Jajuga Krzysztof Najman Marek Walesiak   Editors Data Analysis and Classification Methods and Applications The use of general descriptive names, registered names, trademarks, service marks, etc. in this publicationdoesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfrom therelevantprotectivelawsandregulationsandthereforefreeforgeneraluse. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained hereinorforanyerrorsoromissionsthatmayhavebeenmade.Thepublisherremainsneutralwithregard tojurisdictionalclaimsinpublishedmapsandinstitutionalaffiliations. ThisSpringerimprintispublishedbytheregisteredcompanySpringerNatureSwitzerlandAG Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland Preface This volume presents the papers from the 29th Conference of Section of ClassificationandDataAnalysisofPolishStatisticalSocietyheldattheUniversity of Gdansk on September 7–9, 2020. The papers presented refer to a set of studies addressing a wide range of recent methodological aspects and applications of classificationanddataanalysistoolsinmicroandmacroeconomicproblems.Inthe finalselection,weaccepted19ofthepapersthatwerepresentedattheconference. Each of the submissions has been reviewed by two anonymous referees, and the authors have subsequently revised their original manuscripts and incorporated the commentsandsuggestionsofthereferees.Theselectioncriteriawerebasedonthe contribution of the papers to the theory and applications of modern classification and data analysis. The chapters have been organized along with the major fields and themes in classification and data analysis: Methodology, Application in Finance, Application inEconomics,ApplicationinSocialIssues,andApplicationwithCOVID-19Data. The part onMethodology contains five papers. The paper by Dudekfocuses on thenewalgorithm fromspectralclusteringfamilyanditsapplications inlargedata sets analysis. The author conducted a comparative analysis with other approaches. Rozmus article focuses on the analysis of the number of clusters and stability indicators.Theaimofthearticleistocomparetheresultsintermsoftheindicated correctnumberofgroupsbyclassicalindexesandstabilitymeasures.Thepaperby Majkowska, Migdał-Najman, Najman, and Raca attempts to characterize words commonly used in the messages published by Twitter users. Text mining methods and techniques were used to carry out the research, which was mainly focused on the analysis of individual words and collocations occurring in the users’ tweets. Bryśinhispaperconductsresearchof1446selectedpublicationsprovidesinsights on classification algorithms applied to information security tasks, their popularity, and the algorithm selection challenges. The paper by Najman and Zieliński investigates the issueof the usefulness of isolation forestsin outlier detection. The resultsofsimulationsandempiricalstudiesonselecteddatasetsarepresented.The assessment takes into account the impact of individual characteristics of big data sets on the effectiveness of the analyzed methods. v vi Preface The part on Application in Finance contains two papers. Batóg and Wawrzyniak’sstudywascarriedoutonthebasisofselectedfinancialratios,which in the literature are considered to be nominants with the recommended range of values,withtheassumptionthatthebettersituationoftheexaminedobjectiswhen thevaluesoftheindicator-nominantareabovetheupperlimitoftherecommended range of values (right-handed asymmetrical nominant) or below the lower limit of this range (left-handed asymmetrical nominant). Trzpiot in her article considers whether the standard risk estimation procedures are in line with investors’ expec- tations. Article is concerned on presenting the assumptions of Gini regression, the selected estimation method, and its application to the systematic risk assessment. The application part is modeling assets listed on the Warsaw Stock Exchange. The part on Application in Economics contains five papers. The paper by Raca presents an overview of the definitions of the term dark data, a proposal of its interpretation, and a classification of data in a company with regard to: usability, availability,andquality.Aspartoftheresearch,fouruniversalfeaturesofdarkdata sets have been indicated (unavailability, unawareness, uselessness, and costliness). Cieraszewska, Hamerska, Lula, and Zembura present the results of research including the analysis of abstracts of scientific articles in the field of economics, prepared in English by authors from 36 European countries and registered in the Scopus database intheyears2011–2020.Theontology-basedapproachisusedfor identificationofconceptsrelatedtomedicalscienceandeconomics.Thepaperalso presents the results of research on the relationship between the interdisciplinary nature of research in the field of economics and the number and ‘degree of inter- nationalization of authors’ teams. The aim of the Putek Szeląg’s and Gdakowicz articleistopresentselectedmethodsofdurationanalysistoassesstheprobabilityof exit from the real estate sale offer system, taking into account various types of competing risk (the year of submitting the property for sale). In the survey, the calculation of the offer duration takes into account the properties that have been soldandarestillcurrent(onthedayoftheendofthesurvey).SłupikandTrzęsiok’s work aims to identify and characterize electricity users in terms of their attitudes toward energy saving. The authors of the article based their analysis on the results oftheproprietaryresearchconductedamonghouseholdsintheSilesianProvincein Poland, in 2018, and on a review of the literature on profiling individual energy consumers. In the article, the authors also characterize the obtained segments and identify fundamental factors influencing the respondents’ behavior toward save energy. Wolak in the paper presents a study of selected linear ordering algorithms to build a ranking of districts in the Lesser Poland Province in terms of tourist attractiveness using techniques considering potential spatial relationships. ThepartonApplicationinSocialIssuescontainsfourpapers.Bieszk-Stolorzin her paper assesses the impact of gender of unemployed people on the duration of registeredunemploymentandonthedurationofstayingoutoftheoffice’sregister, taking into account different reasons for de-registration. Due to censored observa- tions,i.e., observationsnotcompletedwith aneventintheanalyzedperiod,author decided to use selected methods of survival analysis. The purpose of Grzenda’s Preface vii paper is to indicate the possibility of using Cox regression model to determine directadjustedprobabilitiesoffindingajobbytheunemployeddependingontheir individualcharacteristicsinthecontextoflong-termunemploymentrisk.Thestudy is based on LFS data from 2017 and 2018 for Poland. Przybysz, Stanimir, and Wasiak proposed to use the methods of multidimensional comparative analysis to assess the level of implementation of the Europe 2020 strategy, indicating areas importantforthequalityoflifeofseniorsandidentifyingchangesintheassessment of the implementation of this strategy by this generation. The study showed the existence of a very large diversity of seniors in terms of their life quality and their assessment of the strategy. Kos-Łabędowicz and Trzęsiok present two classifica- tions of the elderly in Poland in terms of their preferences regarding means of transport:onepreparedonthebasisofliteratureresearchandexpertknowledge,the other with theuse ofaselectedtaxonomic method. Theaim ofthearticle istotest the agreement between the obtained classifications and thus to verify the validity of the proposed expert segmentation which reflects Polish society specifically. ThepartonApplicationwithCOVID-19Datacontainsthreepapers.Nojszewska and Sielska analyze the similarities of European countries during COVID-19 pan- demic in terms of the following indicators: Economic sentiment indicator (ESI), employmentexpectationsindicator(EEI)fromthebeginningof2020.Theresearch showsthatafterthecollapseinMarch/April2020,thevaluesofvariablesreflecting the condition of economies started to increase in most of the identified groups of countries.Salamagastudiedaquestionregardingtheinfluenceofthecoronacrisison global foreign investment in the near future, especially in the investment market oftheVisegradGroupcountries.ThemainpurposeoftheLandmesser’spaperisto analyze the patterns of COVID-19 evolution in a group of 27 EU countries. First, authorappliestheconceptofdynamictimewarping(DTW)toidentifygroupsofEU countries affected to varying degrees by the COVID-19 pandemic. Further, within the selected groups, the structure of the time series for infected and deceased COVID-19 patients using ARIMA models was analyzed. We wish to thank all the authors for making their studies available for our volume. Their scholarly efforts and research inquiries made this volume possible. We are also indebted to the anonymous referees for providing insightful reviews with many useful comments and suggestions. In spite of our intention to address a wide range of problems pertaining to classification and data analysis theory, there are issues that still need to be researched.Wehopethatthestudiesincludedinourvolumewillencouragefurther research and analyses in modern data science. Wroclaw, Poland Krzysztof Jajuga Sopot, Poland Krzysztof Najman Jelenia Góra, Poland Marek Walesiak January 2021 Contents Methodology Evaluation of Two-Step Spectral Clustering Algorithm for Large Untypical Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Andrzej Dudek Determining the Number of Groups in Cluster Analysis Using Classical Indexes and Stability Measures—Comparison of Results . . . . 11 Dorota Rozmus Identification of the Words Most Frequently Used by Different Generations of Twitter Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Agata Majkowska, Kamila Migdał-Najman, Krzysztof Najman, and Katarzyna Raca Classification Algorithms Applications for Information Security on the Internet: A Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Michał Bryś Outlier Detection with the Use of Isolation Forests . . . . . . . . . . . . . . . . 65 Krzysztof Najman and Krystian Zieliński Application in Finance Propositions of Transformations of Asymmetrical Nominants into Stimulants on the Example of Chosen Financial Ratios . . . . . . . . . . . . . 83 Barbara Batóg and Katarzyna Wawrzyniak Gini Regression in the Capital Investment Risk Assessment—Sensitivity Risk Measures in Portfolio Analysis. . . . . . . . . 101 Grażyna Trzpiot ix x Contents Application in Economics Enterprise Dark Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Katarzyna Raca The Significance of Medical Science Issues in Research Papers Published in the Field of Economics. . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Urszula Cieraszewska, Monika Hamerska, Paweł Lula, and Marcela Zembura Application of Duration Analysis Methods in the Study of the Exit of a Real Estate Sale Offer from the Offer Database System . . . . . . . . . 153 Ewa Putek-Szeląg and Anna Gdakowicz IsSocietyReady forLong-Term Investments?—ProfilesofElectricity Users in Silesia. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 Sylwia Słupik and Joanna Trzęsiok The Use of the Spatial Taxonomic Measure of Development to Assess the Tourist Attractiveness of Districts of the Lesser Poland Province. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Jacek Wolak Application in Social Issues ModelsofCompetingEventsinAssessingtheEffectsoftheTransition of Unemployed People Between the States of Registration and De-Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Beata Bieszk-Stolorz DirectAdjustedSurvivalProbabilitiesintheAnalysisofFindingaJob by the Unemployed Depending on Their Individual Characteristics . . . 229 Wioletta Grzenda Europe 2020 Strategy—Objective Evaluation of Realization and Subjective Assessment by Seniors as Beneficiaries of Social Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 Klaudia Przybysz, Agnieszka Stanimir, and Marta Wasiak Do Seniors Get to the Disco by Bike or in a Taxi?—Classification of Seniors According to Their Preferred Means of Transport . . . . . . . . 271 Joanna Kos-Łabędowicz and Joanna Trzęsiok Application with COVID-19 Data The Impact of the COVID-19 Pandemic on the Economies of European Countries in the Period January–September 2020 Based on Economic Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 Ewelina Nojszewska and Agata Sielska

