ebook img

Data analytics PDF

167 Pages·2020·3.958 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Data analytics

Thomas A. Runkler Data Analytics Models and Algorithms for Intelligent Data Analysis 3. Edition Data Analytics Thomas A. Runkler Data Analytics Models and Algorithms for Intelligent Data Analysis Third Edition ThomasA.Runkler München,Germany ISBN978-3-658-29778-7 ISBN978-3-658-29779-4 (eBook) https://doi.org/10.1007/978-3-658-29779-4 ©SpringerFachmedienWiesbadenGmbH,partofSpringerNature2012,2016,2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformationstorage andretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodologynowknownor hereafterdeveloped. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublicationdoes notimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevantprotective lawsandregulationsandthereforefreeforgeneraluse. Thepublisher,theauthors,andtheeditorsaresafetoassumethattheadviceandinformationinthisbookare believedtobetrueandaccurateatthedateofpublication. Neitherthepublishernortheauthorsortheeditors giveawarranty,expressedorimplied,withrespecttothematerialcontainedhereinorforanyerrorsoromissions thatmayhavebeenmade.Thepublisherremainsneutralwithregardtojurisdictionalclaimsinpublishedmaps andinstitutionalaffiliations. ThisSpringerViewegimprintispublishedbytheregisteredcompanySpringerFachmedienWiesbadenGmbH. Theregisteredcompanyaddressis:Abraham-Lincoln-Str.46,65189Wiesbaden,Germany Preface The information in the world doubles every 20 months. Important data sources are business and industrial processes, text and structured databases, images and videos, and physical and biomedical data. Data analytics allows to find relevant information, structures, and patterns, to gain new insights, to identify causes and effects, to predict future developments, or to suggest optimal decisions. We need models and algorithms to collect, preprocess, analyze, and evaluate data. This involves methods from various fieldssuchasstatistics,machinelearning,patternrecognition,systemstheory,operations research,orartificialintelligence.Withthisbook,youwilllearnaboutthemostimportant methodsandalgorithmsfordataanalytics.Youwillbeabletochooseappropriatemethods forspecifictasksandapplytheseinyourowndataanalyticsprojects.Youwillunderstand thebasicconceptsofthegrowingfieldofdataanalytics,whichwillallowyoutokeeppace andtoactivelycontributetotheadvancementofthefield. This text is designed for undergraduate and graduate courses on data analytics for engineering, computer science, and math students. It is also suitable for practitioners workingon data analyticsprojects. The bookis structuredaccordingto typical practical data analyticsprojects. Only basic mathematicsis required.Thismaterialhas beenused formanyyearsincoursesattheTechnicalUniversityofMunich,Germany,andatmany other universities. Much of the contentis based on the results of industrialresearch and developmentprojectsatSiemens. Thefollowinglistshowsthehistoryofthebookversions: (cid:129) DataAnalytics,thirdedition,2020,English (cid:129) DataAnalytics,secondedition,2016,English (cid:129) DataMining,secondedition,2015,German (cid:129) DataAnalytics,2012,English (cid:129) DataMining,2010,German (cid:129) InformationMining,2000,German v vi Preface Thankstoeverybodywhohascontributedtothismaterial,inparticulartothereviewers andstudentsforsuggestingimprovementsandpointingouterrorsandtotheeditorialand publisherteamfortheirprofessionalcollaboration. Munich,Germany ThomasA.Runkler April2020 Contents 1 Introduction ......................................................................... 1 1.1 It’sAllAboutData ............................................................ 1 1.2 DataAnalytics,DataMining,andKnowledgeDiscovery................... 2 References............................................................................ 3 2 DataandRelations.................................................................. 5 2.1 TheIrisDataSet............................................................... 5 2.2 DataScales..................................................................... 8 2.3 SetandMatrixRepresentations............................................... 10 2.4 Relations....................................................................... 11 2.5 DissimilarityMeasures........................................................ 11 2.6 SimilarityMeasures........................................................... 14 2.7 SequenceRelations............................................................ 16 2.8 SamplingandQuantization................................................... 18 Problems.............................................................................. 21 References............................................................................ 22 3 DataPreprocessing ................................................................. 23 3.1 ErrorTypes .................................................................... 23 3.2 ErrorHandling................................................................. 26 3.3 Filtering........................................................................ 27 3.4 DataTransformation .......................................................... 33 3.5 DataIntegration ............................................................... 35 Problems.............................................................................. 36 References............................................................................ 36 4 DataVisualization................................................................... 37 4.1 Diagrams....................................................................... 37 4.2 PrincipalComponentAnalysis ............................................... 39 4.3 MultidimensionalScaling..................................................... 43 4.4 SammonMapping............................................................. 47 vii viii Contents 4.5 Auto-encoder .................................................................. 51 4.6 Histograms..................................................................... 51 4.7 SpectralAnalysis.............................................................. 54 Problems.............................................................................. 58 References............................................................................ 59 5 Correlation........................................................................... 61 5.1 LinearCorrelation............................................................. 61 5.2 CorrelationandCausality..................................................... 63 5.3 Chi-SquareTestforIndependence............................................ 64 Problems.............................................................................. 67 References............................................................................ 68 6 Regression ........................................................................... 69 6.1 LinearRegression ............................................................. 69 6.2 LinearRegressionwithNonlinearSubstitution.............................. 74 6.3 RobustRegression............................................................. 75 6.4 NeuralNetworks............................................................... 75 6.5 RadialBasisFunctionNetworks.............................................. 81 6.6 Cross-Validation............................................................... 82 6.7 FeatureSelection.............................................................. 85 Problems.............................................................................. 86 References............................................................................ 87 7 Forecasting........................................................................... 89 7.1 FiniteStateMachines ......................................................... 89 7.2 RecurrentModels.............................................................. 91 7.3 AutoregressiveModels........................................................ 92 Problems.............................................................................. 93 References............................................................................ 94 8 Classification ........................................................................ 95 8.1 ClassificationCriteria ......................................................... 95 8.2 NaiveBayesClassifier ........................................................ 99 8.3 LinearDiscriminantAnalysis................................................. 102 8.4 SupportVectorMachine ...................................................... 104 8.5 NearestNeighborClassifier................................................... 106 8.6 LearningVectorQuantization................................................. 107 8.7 DecisionTrees................................................................. 108 Problems.............................................................................. 113 References............................................................................ 114 Contents ix 9 Clustering............................................................................ 117 9.1 ClusterPartitions .............................................................. 117 9.2 SequentialClustering.......................................................... 119 9.3 Prototype-BasedClustering................................................... 121 9.4 FuzzyClustering............................................................... 123 9.5 RelationalClustering.......................................................... 129 9.6 ClusterTendencyAssessment ................................................ 133 9.7 ClusterValidity................................................................ 134 9.8 Self-organizingMap........................................................... 135 Problems.............................................................................. 137 References............................................................................ 137 A BriefReviewofSomeOptimizationMethods.................................... 141 A.1 OptimizationwithDerivatives................................................ 141 A.2 GradientDescent.............................................................. 142 A.3 LagrangeOptimization........................................................ 143 References............................................................................ 145 Solutions.................................................................................. 147 Index...................................................................................... 155 List of Symbols ∀x ∈X Foreachx inX ∃x ∈X Thereexistsanx inX ⇒ If...then... ⇔ Ifandonlyif (cid:2)b f dx Integraloff fromx =atox =b a ∂f Partialderivativeoff withrespecttox ∂x ∧ Conjunction ∨ Disjunction ∩ Intersection ∪ Union ¬ Complement \ Setdifference ⊂,⊆ Inclusion · Product,innerproduct × Cartesianproduct,vectorproduct {} Emptyset [x,y] Closedintervalfromx toy (x,y],[x,y) Half-boundedintervalsfromx toy (x,y) Openintervalfromx toy |x| Absolutevalueofx |X| CardinalityofthesetX (cid:13)x(cid:13) Normofvectorx (cid:14)x(cid:15) Smallestintegera ≥x (cid:17)(cid:3)x(cid:18)(cid:4) Largestintegera ≤x n Vectorwiththecomponentsnandm,binomialcoefficient m ∞ Infinity a (cid:21)b aismuchlessthanb a (cid:22)b aismuchgreaterthanb α(t) Timevariantlearningrate argminX IndexoftheminimumofX xi

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.