Table Of ContentStudies in Classification, Data Analysis,
and Knowledge Organization
Krzysztof Jajuga
Krzysztof Najman
Marek Walesiak Editors
Data Analysis and
Classification
Methods and Applications
fi
Studies in Classi cation, Data Analysis,
and Knowledge Organization
Managing Editors Editorial Board
Wolfgang Gaul, Karlsruhe, Germany Daniel Baier, Bayreuth, Germany
Maurizio Vichi, Rome, Italy Frank Critchley, Milton Keynes, UK
Claus Weihs, Dortmund, Germany Reinhold Decker, Bielefeld, Germany
Edwin Diday, Paris, France
Michael Greenacre, Barcelona, Spain
Carlo Natale Lauro, Naples, Italy
Jacqueline Meulman, Leiden,
The Netherlands
Paola Monari, Bologna, Italy
Shizuhiko Nishisato, Toronto, Canada
Noboru Ohsumi, Tokyo, Japan
Otto Opitz, Augsburg, Germany
Gunter Ritter, Passau, Germany
Martin Schader, Mannheim, Germany
More information about this series at http://www.springer.com/series/1564
Krzysztof Jajuga Krzysztof Najman
(cid:129) (cid:129)
Marek Walesiak
Editors
Data Analysis
fi
and Classi cation
Methods and Applications
123
Editors
Krzysztof Jajuga Krzysztof Najman
Department ofFinancial Investments Department ofStatistics
andRisk Management University of Gdańsk
Wroclaw University of Economics Sopot, Poland
andBusiness
Wroclaw,Poland
MarekWalesiak
Department ofEconometrics andComputer
Science
Wroclaw University of Economics
andBusiness
Jelenia Góra,Poland
ISSN 1431-8814 ISSN 2198-3321 (electronic)
Studies in Classification,Data Analysis, andKnowledgeOrganization
ISBN978-3-030-75189-0 ISBN978-3-030-75190-6 (eBook)
https://doi.org/10.1007/978-3-030-75190-6
MathematicsSubjectClassification: 62H25,62H30,62H86,62-09,68U20,62P12,62P20,62P25
©TheEditor(s)(ifapplicable)andTheAuthor(s),underexclusivelicencetoSpringerNature
SwitzerlandAG2021
Thisworkissubjecttocopyright.AllrightsaresolelyandexclusivelylicensedbythePublisher,whether
thewholeorpartofthematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseof
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmissionorinformationstorageandretrieval,electronicadaptation,computersoftware,orbysimilar
ordissimilarmethodologynowknownorhereafterdeveloped.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publicationdoesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfrom
therelevantprotectivelawsandregulationsandthereforefreeforgeneraluse.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, expressed or implied, with respect to the material contained
hereinorforanyerrorsoromissionsthatmayhavebeenmade.Thepublisherremainsneutralwithregard
tojurisdictionalclaimsinpublishedmapsandinstitutionalaffiliations.
ThisSpringerimprintispublishedbytheregisteredcompanySpringerNatureSwitzerlandAG
Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland
Preface
This volume presents the papers from the 29th Conference of Section of
ClassificationandDataAnalysisofPolishStatisticalSocietyheldattheUniversity
of Gdansk on September 7–9, 2020. The papers presented refer to a set of studies
addressing a wide range of recent methodological aspects and applications of
classificationanddataanalysistoolsinmicroandmacroeconomicproblems.Inthe
finalselection,weaccepted19ofthepapersthatwerepresentedattheconference.
Each of the submissions has been reviewed by two anonymous referees, and the
authors have subsequently revised their original manuscripts and incorporated the
commentsandsuggestionsofthereferees.Theselectioncriteriawerebasedonthe
contribution of the papers to the theory and applications of modern classification
and data analysis.
The chapters have been organized along with the major fields and themes in
classification and data analysis: Methodology, Application in Finance, Application
inEconomics,ApplicationinSocialIssues,andApplicationwithCOVID-19Data.
The part onMethodology contains five papers. The paper by Dudekfocuses on
thenewalgorithm fromspectralclusteringfamilyanditsapplications inlargedata
sets analysis. The author conducted a comparative analysis with other approaches.
Rozmus article focuses on the analysis of the number of clusters and stability
indicators.Theaimofthearticleistocomparetheresultsintermsoftheindicated
correctnumberofgroupsbyclassicalindexesandstabilitymeasures.Thepaperby
Majkowska, Migdał-Najman, Najman, and Raca attempts to characterize words
commonly used in the messages published by Twitter users. Text mining methods
and techniques were used to carry out the research, which was mainly focused on
the analysis of individual words and collocations occurring in the users’ tweets.
Bryśinhispaperconductsresearchof1446selectedpublicationsprovidesinsights
on classification algorithms applied to information security tasks, their popularity,
and the algorithm selection challenges. The paper by Najman and Zieliński
investigates the issueof the usefulness of isolation forestsin outlier detection. The
resultsofsimulationsandempiricalstudiesonselecteddatasetsarepresented.The
assessment takes into account the impact of individual characteristics of big data
sets on the effectiveness of the analyzed methods.
v
vi Preface
The part on Application in Finance contains two papers. Batóg and
Wawrzyniak’sstudywascarriedoutonthebasisofselectedfinancialratios,which
in the literature are considered to be nominants with the recommended range of
values,withtheassumptionthatthebettersituationoftheexaminedobjectiswhen
thevaluesoftheindicator-nominantareabovetheupperlimitoftherecommended
range of values (right-handed asymmetrical nominant) or below the lower limit of
this range (left-handed asymmetrical nominant). Trzpiot in her article considers
whether the standard risk estimation procedures are in line with investors’ expec-
tations. Article is concerned on presenting the assumptions of Gini regression, the
selected estimation method, and its application to the systematic risk assessment.
The application part is modeling assets listed on the Warsaw Stock Exchange.
The part on Application in Economics contains five papers. The paper by Raca
presents an overview of the definitions of the term dark data, a proposal of its
interpretation, and a classification of data in a company with regard to: usability,
availability,andquality.Aspartoftheresearch,fouruniversalfeaturesofdarkdata
sets have been indicated (unavailability, unawareness, uselessness, and costliness).
Cieraszewska, Hamerska, Lula, and Zembura present the results of research
including the analysis of abstracts of scientific articles in the field of economics,
prepared in English by authors from 36 European countries and registered in the
Scopus database intheyears2011–2020.Theontology-basedapproachisusedfor
identificationofconceptsrelatedtomedicalscienceandeconomics.Thepaperalso
presents the results of research on the relationship between the interdisciplinary
nature of research in the field of economics and the number and ‘degree of inter-
nationalization of authors’ teams. The aim of the Putek Szeląg’s and Gdakowicz
articleistopresentselectedmethodsofdurationanalysistoassesstheprobabilityof
exit from the real estate sale offer system, taking into account various types of
competing risk (the year of submitting the property for sale). In the survey, the
calculation of the offer duration takes into account the properties that have been
soldandarestillcurrent(onthedayoftheendofthesurvey).SłupikandTrzęsiok’s
work aims to identify and characterize electricity users in terms of their attitudes
toward energy saving. The authors of the article based their analysis on the results
oftheproprietaryresearchconductedamonghouseholdsintheSilesianProvincein
Poland, in 2018, and on a review of the literature on profiling individual energy
consumers. In the article, the authors also characterize the obtained segments and
identify fundamental factors influencing the respondents’ behavior toward save
energy.
Wolak in the paper presents a study of selected linear ordering algorithms to
build a ranking of districts in the Lesser Poland Province in terms of tourist
attractiveness using techniques considering potential spatial relationships.
ThepartonApplicationinSocialIssuescontainsfourpapers.Bieszk-Stolorzin
her paper assesses the impact of gender of unemployed people on the duration of
registeredunemploymentandonthedurationofstayingoutoftheoffice’sregister,
taking into account different reasons for de-registration. Due to censored observa-
tions,i.e., observationsnotcompletedwith aneventintheanalyzedperiod,author
decided to use selected methods of survival analysis. The purpose of Grzenda’s
Preface vii
paper is to indicate the possibility of using Cox regression model to determine
directadjustedprobabilitiesoffindingajobbytheunemployeddependingontheir
individualcharacteristicsinthecontextoflong-termunemploymentrisk.Thestudy
is based on LFS data from 2017 and 2018 for Poland. Przybysz, Stanimir, and
Wasiak proposed to use the methods of multidimensional comparative analysis to
assess the level of implementation of the Europe 2020 strategy, indicating areas
importantforthequalityoflifeofseniorsandidentifyingchangesintheassessment
of the implementation of this strategy by this generation. The study showed the
existence of a very large diversity of seniors in terms of their life quality and their
assessment of the strategy. Kos-Łabędowicz and Trzęsiok present two classifica-
tions of the elderly in Poland in terms of their preferences regarding means of
transport:onepreparedonthebasisofliteratureresearchandexpertknowledge,the
other with theuse ofaselectedtaxonomic method. Theaim ofthearticle istotest
the agreement between the obtained classifications and thus to verify the validity
of the proposed expert segmentation which reflects Polish society specifically.
ThepartonApplicationwithCOVID-19Datacontainsthreepapers.Nojszewska
and Sielska analyze the similarities of European countries during COVID-19 pan-
demic in terms of the following indicators: Economic sentiment indicator (ESI),
employmentexpectationsindicator(EEI)fromthebeginningof2020.Theresearch
showsthatafterthecollapseinMarch/April2020,thevaluesofvariablesreflecting
the condition of economies started to increase in most of the identified groups of
countries.Salamagastudiedaquestionregardingtheinfluenceofthecoronacrisison
global foreign investment in the near future, especially in the investment market
oftheVisegradGroupcountries.ThemainpurposeoftheLandmesser’spaperisto
analyze the patterns of COVID-19 evolution in a group of 27 EU countries. First,
authorappliestheconceptofdynamictimewarping(DTW)toidentifygroupsofEU
countries affected to varying degrees by the COVID-19 pandemic. Further, within
the selected groups, the structure of the time series for infected and deceased
COVID-19 patients using ARIMA models was analyzed.
We wish to thank all the authors for making their studies available for our
volume. Their scholarly efforts and research inquiries made this volume possible.
We are also indebted to the anonymous referees for providing insightful reviews
with many useful comments and suggestions.
In spite of our intention to address a wide range of problems pertaining to
classification and data analysis theory, there are issues that still need to be
researched.Wehopethatthestudiesincludedinourvolumewillencouragefurther
research and analyses in modern data science.
Wroclaw, Poland Krzysztof Jajuga
Sopot, Poland Krzysztof Najman
Jelenia Góra, Poland Marek Walesiak
January 2021
Contents
Methodology
Evaluation of Two-Step Spectral Clustering Algorithm for Large
Untypical Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Andrzej Dudek
Determining the Number of Groups in Cluster Analysis Using
Classical Indexes and Stability Measures—Comparison of Results . . . . 11
Dorota Rozmus
Identification of the Words Most Frequently Used by Different
Generations of Twitter Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Agata Majkowska, Kamila Migdał-Najman, Krzysztof Najman,
and Katarzyna Raca
Classification Algorithms Applications for Information Security
on the Internet: A Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Michał Bryś
Outlier Detection with the Use of Isolation Forests . . . . . . . . . . . . . . . . 65
Krzysztof Najman and Krystian Zieliński
Application in Finance
Propositions of Transformations of Asymmetrical Nominants into
Stimulants on the Example of Chosen Financial Ratios . . . . . . . . . . . . . 83
Barbara Batóg and Katarzyna Wawrzyniak
Gini Regression in the Capital Investment Risk
Assessment—Sensitivity Risk Measures in Portfolio Analysis. . . . . . . . . 101
Grażyna Trzpiot
ix
x Contents
Application in Economics
Enterprise Dark Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Katarzyna Raca
The Significance of Medical Science Issues in Research Papers
Published in the Field of Economics. . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Urszula Cieraszewska, Monika Hamerska, Paweł Lula,
and Marcela Zembura
Application of Duration Analysis Methods in the Study of the Exit
of a Real Estate Sale Offer from the Offer Database System . . . . . . . . . 153
Ewa Putek-Szeląg and Anna Gdakowicz
IsSocietyReady forLong-Term Investments?—ProfilesofElectricity
Users in Silesia. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Sylwia Słupik and Joanna Trzęsiok
The Use of the Spatial Taxonomic Measure of Development
to Assess the Tourist Attractiveness of Districts of the Lesser
Poland Province. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Jacek Wolak
Application in Social Issues
ModelsofCompetingEventsinAssessingtheEffectsoftheTransition
of Unemployed People Between the States of Registration
and De-Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Beata Bieszk-Stolorz
DirectAdjustedSurvivalProbabilitiesintheAnalysisofFindingaJob
by the Unemployed Depending on Their Individual Characteristics . . . 229
Wioletta Grzenda
Europe 2020 Strategy—Objective Evaluation of Realization
and Subjective Assessment by Seniors as Beneficiaries
of Social Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
Klaudia Przybysz, Agnieszka Stanimir, and Marta Wasiak
Do Seniors Get to the Disco by Bike or in a Taxi?—Classification
of Seniors According to Their Preferred Means of Transport . . . . . . . . 271
Joanna Kos-Łabędowicz and Joanna Trzęsiok
Application with COVID-19 Data
The Impact of the COVID-19 Pandemic on the Economies
of European Countries in the Period January–September 2020 Based
on Economic Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
Ewelina Nojszewska and Agata Sielska