Table Of ContentThomas A. Runkler
Data Analytics
Models and Algorithms for
Intelligent Data Analysis
2nd Edition
Data Analytics
Thomas A. Runkler
Data Analytics
Models and Algorithms for
Intelligent Data Analysis
2nd Edition
ThomasA.Runkler
SiemensAG
München,Germany
ISBN:978-3-658-14074-8 ISBN:978-3-658-14075-5(eBook)
DOI10.1007/978-3-658-14075-5
LibraryofCongressControlNumber:2016942272
SpringerVieweg
©SpringerFachmedienWiesbaden2012,2016
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformationstorage
andretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodologynowknownor
hereafterdeveloped.
Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublicationdoes
notimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevantprotective
lawsandregulationsandthereforefreeforgeneraluse.
Thepublisher,theauthorsandtheeditorsaresafetoassumethattheadviceandinformationinthisbookare
believedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsortheeditors
giveawarranty,expressorimplied,withrespecttothematerialcontainedhereinorforanyerrorsoromissions
thatmayhavebeenmade.
Printedonacid-freepaper
SpringerViewegisabrandofSpringerNature
TheregisteredcompanyisSpringerFachmedienWiesbadenGmbH
Preface
The information in the world doubles every 20 months. Important data sources are
business and industrial processes, text and structured databases, images and videos,
and physical and biomedical data. Data analytics allows to find relevant information,
structures, and patterns, to gain new insights, to identify causes and effects, to predict
future developments, or to suggest optimal decisions. We need models and algorithms
to collect, preprocess, analyze, and evaluate data, from various fields such as statistics,
machine learning, pattern recognition, system theory, operations research, or artificial
intelligence. With this book, you will learn about the most important methods and
algorithmsfordataanalytics.Youwillbeabletochooseappropriatemethodsforspecific
tasks and apply these in your own data analytics projects. You will understand the basic
conceptsofthegrowingfieldofdataanalytics,whichwillallowyoutokeeppaceandto
activelycontributetotheadvancementofthefield.
This text is designed for undergraduate and graduate courses on data analytics for
engineering, computer science, and math students. It is also suitable for practitioners
working on data analytics projects. The book is structured according to typical practical
data analytics projects. Only basic mathematics is required. This material has been used
for more than ten years in numerous courses at the Technical University of Munich,
Germany, in short courses at several other universities, and in tutorials at international
scientific conferences. Much of the content is based on the results of industrial research
anddevelopmentprojectsatSiemens.
Historyofthebookversions:
(cid:129) DataAnalytics,secondedition,2016,English
(cid:129) DataMining,secondedition,2015,German
(cid:129) DataAnalytics,2012,English
(cid:129) DataMining,2010,German
(cid:129) InformationMining,2000,German
v
vi Preface
Thanks go to everybody who has contributed to this material, in particular to the
reviewers and my students for suggesting improvements and pointing out errors and to
theeditorialandpublisherteamfortheirprofessionalcollaboration.
Munich,Germany ThomasA.Runkler
April2016
Contents
1 Introduction ......................................................................... 1
1.1 It’sAllAboutData ............................................................ 1
1.2 DataAnalytics,DataMining,andKnowledgeDiscovery................... 2
References............................................................................ 3
2 DataandRelations ................................................................. 5
2.1 TheIrisDataSet............................................................... 5
2.2 DataScales..................................................................... 8
2.3 SetandMatrixRepresentations............................................... 10
2.4 Relations....................................................................... 11
2.5 DissimilarityMeasures........................................................ 12
2.6 SimilarityMeasures........................................................... 14
2.7 SequenceRelations............................................................ 16
2.8 SamplingandQuantization ................................................... 18
Problems.............................................................................. 21
References............................................................................ 22
3 DataPreprocessing ................................................................. 23
3.1 ErrorTypes .................................................................... 23
3.2 ErrorHandling................................................................. 26
3.3 Filtering........................................................................ 27
3.4 DataTransformation .......................................................... 32
3.5 DataIntegration ............................................................... 35
Problems.............................................................................. 35
References............................................................................ 36
4 DataVisualization .................................................................. 37
4.1 Diagrams....................................................................... 37
4.2 PrincipalComponentAnalysis ............................................... 39
4.3 MultidimensionalScaling..................................................... 43
4.4 SammonMapping............................................................. 47
vii
viii Contents
4.5 Auto-Associator ............................................................... 51
4.6 Histograms..................................................................... 51
4.7 SpectralAnalysis.............................................................. 54
Problems.............................................................................. 57
References............................................................................ 57
5 Correlation........................................................................... 59
5.1 LinearCorrelation............................................................. 59
5.2 CorrelationandCausality..................................................... 61
5.3 Chi-SquareTestforIndependence............................................ 62
Problems.............................................................................. 65
References............................................................................ 65
6 Regression ........................................................................... 67
6.1 LinearRegression ............................................................. 67
6.2 LinearRegressionwithNonlinearSubstitution.............................. 71
6.3 RobustRegression............................................................. 72
6.4 NeuralNetworks............................................................... 73
6.5 RadialBasisFunctionNetworks.............................................. 77
6.6 Cross-Validation............................................................... 79
6.7 FeatureSelection .............................................................. 81
Problems.............................................................................. 82
References............................................................................ 82
7 Forecasting........................................................................... 85
7.1 FiniteStateMachines ......................................................... 85
7.2 RecurrentModels.............................................................. 86
7.3 AutoregressiveModels........................................................ 88
Problems.............................................................................. 89
References............................................................................ 89
8 Classification ........................................................................ 91
8.1 ClassificationCriteria ......................................................... 91
8.2 NaiveBayesClassifier ........................................................ 95
8.3 LinearDiscriminantAnalysis................................................. 97
8.4 SupportVectorMachine ...................................................... 99
8.5 NearestNeighborClassifier................................................... 102
8.6 LearningVectorQuantization................................................. 103
8.7 DecisionTrees................................................................. 104
Problems.............................................................................. 107
References............................................................................ 108
Contents ix
9 Clustering............................................................................ 111
9.1 ClusterPartitions .............................................................. 111
9.2 SequentialClustering.......................................................... 113
9.3 Prototype-BasedClustering................................................... 115
9.4 FuzzyClustering............................................................... 117
9.5 RelationalClustering.......................................................... 123
9.6 ClusterTendencyAssessment ................................................ 127
9.7 ClusterValidity................................................................ 128
9.8 Self-OrganizingMap.......................................................... 129
Problems.............................................................................. 130
References............................................................................ 131
AppendixA BriefReviewofSomeOptimizationMethods....................... 133
A.1 OptimizationwithDerivatives................................................ 133
A.2 GradientDescent .............................................................. 134
A.3 LagrangeOptimization........................................................ 135
References............................................................................ 137
Solutions.................................................................................. 139
Index...................................................................................... 143