ebook img

GUIDE TO INTELLIGENT DATA SCIENCE : how to intelligently make use of. PDF

427 Pages·2020·8.698 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview GUIDE TO INTELLIGENT DATA SCIENCE : how to intelligently make use of.

Texts in Computer Science Michael R. Berthold · Christian Borgelt Frank Höppner · Frank Klawonn Rosaria Silipo Guide to Intelligent Data Science How to Intelligently Make Use of Real Data Second Edition Texts in Computer Science SeriesEditors DavidGries OritHazzan TitlesinthisseriesnowincludedintheThomsonReutersBookCitationIndex! ‘Texts in Computer Science’ (TCS) delivers high-quality instructional content for undergraduates and graduates in all areas of computing and information science, withastrongemphasisoncorefoundationalandtheoreticalmaterialbutinclusive of some prominent applications-related content. TCS books should be reasonably self-containedandaimtoprovidestudentswithmodernandclearaccountsoftop- ics ranging across the computing curriculum. As a result, the books are ideal for semestercoursesorforindividualself-studyincaseswherepeopleneedtoexpand their knowledge. All texts are authored by established experts in their fields, re- viewedinternallyandbytheserieseditors,andprovidenumerousexamples,prob- lems,andotherpedagogicaltools;manycontainfullyworkedsolutions. TheTCSseriesiscomprisedofhigh-quality,self-containedbooksthathavebroad and comprehensivecoverageand are generallyin hardback format and sometimes containcolor.Forundergraduatetextbooksthatarelikelytobemorebriefandmod- ular in their approach, require only black and white, and are under 275 pages, Springer offers the flexibly designed Undergraduate Topics in Computer Science series,towhichwereferpotentialauthors. Moreinformationaboutthisseriesat www.springer.com/series/3191 Michael R. Berthold (cid:2) Christian Borgelt (cid:2) Frank Höppner (cid:2) Frank Klawonn (cid:2) Rosaria Silipo Guide to Intelligent Data Science How to Intelligently Make Use of Real Data Second Edition MichaelR.Berthold FrankHöppner DepartmentofComputerandInformation DepartmentofComputerScience Science OstfaliaUniversityofAppliedSciences UniversityofKonstanz Wolfenbüttel,Germany Konstanz,Germany FrankKlawonn DepartmentofComputerScience OstfaliaUniversityofAppliedSciences Wolfenbüttel,Germany ChristianBorgelt DepartmentofComputerSciences RosariaSilipo UniversityofSalzburg KNIMEAG Salzburg,Austria Zurich,Switzerland SeriesEditors DavidGries OritHazzan DepartmentofComputerScience FacultyofEducationinTechnologyand CornellUniversity Science Ithaca,NY,USA Technion–IsraelInstituteofTechnology Haifa,Israel ISSN1868-0941 ISSN1868-095X(electronic) TextsinComputerScience ISBN978-3-030-45573-6 ISBN978-3-030-45574-3(eBook) https://doi.org/10.1007/978-3-030-45574-3 ©SpringerNatureSwitzerlandAG2010,2020 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartof thematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation, broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology nowknownorhereafterdeveloped. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. Thepublisher,theauthorsandtheeditorsaresafetoassumethattheadviceandinformationinthisbook arebelievedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsor theeditorsgiveawarranty,expressorimplied,withrespecttothematerialcontainedhereinorforany errorsoromissionsthatmayhavebeenmade.Thepublisherremainsneutralwithregardtojurisdictional claimsinpublishedmapsandinstitutionalaffiliations. Printedonacid-freepaper ThisSpringerimprintispublishedbytheregisteredcompanySpringerNatureSwitzerlandAG Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland Preface Theinterestinmakingsenseofdatahasbecomecentraltoprettymucheverybusi- nessandprojectinthepastdecade.Whenwewrotethefirsteditionofthisvolume, itwasaimedatthepeoplewhosetaskitwastoanalyzerealworlddata—oftenina side department, prepared to solve problems given to them. In the meantime, data science use has spread from those isolated groups to the entire organization. This fitsourgoalfromthefirsteditionevenmore:providingdifferentlevelsofdetailfor eachsectionofthebook.Readersmoreinterestedintheconceptsandhowtoapply methods, focus more on the first half of each chapter. Readers, also interested in the underpinnings of the presented methods and algorithms, find the theory in the middle.Andforbothaudiences,theendofmostchaptersshowspracticalexamples ofhowtoapplythesetorealdata. Still,giventheadvanceinresearchwedecideditworthrevising,updating,and expanding the content to also cover select new methods and tools that have been inventedinthemeantime.Wealsoadjustedtothenewwayofthinkingaboutdata analysisandhowmanyorganizationsnowseethisinthebiggercontextofDataSci- ence,asitisvisiblebythemanyuniversitiesaddingDataSciencecurriculumoreven entireDataSciencedepartments.Astheoriginalvolumecoveredtheentireprocess fromdataingestiontodeploymentandmanagement(whichweexpandedsubstan- tiallyinthisedition),wedecidedtoalsofollowthistrendanduseDataScienceas theoverarchingumbrellaterm.Thisalsomadeitobviousthathavingsomeoneelse withmorerealworlddatascienceexperiencewouldbebeneficialsowewereglad whenRosariaagreedtosharethe(re)writingdutyandadddecadesofindustrialdata scienceexperiencetotheteam. And finally we decided to put the focus of the practical exercises more on KNIME Analytics Platform—not because there is only one tool out there but be- cause visual workflows are easier to explain and therefore lend themselves to be includedinprintedmaterialtoillustratehowthemethodsarebeingusedinpractice. Onthebook’swebsite:www.datascienceguide.orgyoucanstillfindtheRexamples fromthefirsteditionandwehavealsoaddedexamplesinPythonaswellasallofthe KNIMEworkflowsdescribedinthebook.Wearealsoprovidingteachingmaterial andwillcontinuouslyupdatethesiteinthecomingyears. There are many people to be thanked, and we will not attempt to list them all. However,Iris Adä and Martin Horn deserve mentioningfor all their help with the first edition. For the second round, we owe thanks to Satoru Hayasaka, Kathrin Melcher,andEmilioSilvestriwhohavespentmanyhoursproofreadingandupdat- ing/creatingworkflows. Konstanz,Germany MichaelR.Berthold Salzburg,Austria ChristianBorgelt Braunschweig,Germany FrankHöppnerand FrankKlawonn Zurich,Switzerland RosariaSilipo v Contents 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 DataandKnowledge. . . . . . . . . . . . . . . . . . . . . 2 1.1.2 TychoBraheandJohannesKepler. . . . . . . . . . . . . . 4 1.1.3 IntelligentDataScience . . . . . . . . . . . . . . . . . . . 6 1.2 TheDataScienceProcess . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Methods,Tasks,andTools . . . . . . . . . . . . . . . . . . . . . . 11 1.4 HowtoReadThisBook . . . . . . . . . . . . . . . . . . . . . . . 13 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2 PracticalDataScience:AnExample . . . . . . . . . . . . . . . . . . 15 2.1 TheSetup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 DataUnderstandingandPatternFinding . . . . . . . . . . . . . . 16 2.3 ExplanationFinding . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4 PredictingtheFuture . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5 ConcludingRemarks. . . . . . . . . . . . . . . . . . . . . . . . . 23 3 ProjectUnderstanding . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1 DeterminetheProjectObjective . . . . . . . . . . . . . . . . . . . 26 3.2 AssesstheSituation . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3 DetermineAnalysisGoals . . . . . . . . . . . . . . . . . . . . . . 30 3.4 FurtherReading . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4 DataUnderstanding . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.1 AttributeUnderstanding . . . . . . . . . . . . . . . . . . . . . . . 34 4.2 DataQuality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3 DataVisualization . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3.1 MethodsforOneandTwoAttributes . . . . . . . . . . . . 40 4.3.2 MethodsforHigher-DimensionalData . . . . . . . . . . . 48 4.4 CorrelationAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.5 OutlierDetection. . . . . . . . . . . . . . . . . . . . . . . . . . . 65 vii viii Contents 4.5.1 OutlierDetectionforSingleAttributes . . . . . . . . . . . 66 4.5.2 OutlierDetectionforMultidimensionalData . . . . . . . . 68 4.6 MissingValues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.7 AChecklistforDataUnderstanding . . . . . . . . . . . . . . . . . 72 4.8 DataUnderstandinginPractice . . . . . . . . . . . . . . . . . . . 73 4.8.1 VisualizingtheIrisData . . . . . . . . . . . . . . . . . . . 74 4.8.2 VisualizingaThree-DimensionalDataSetonaTwo- CoordinatePlot . . . . . . . . . . . . . . . . . . . . . . . 82 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5 PrinciplesofModeling . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.1 ModelClasses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.2 FittingCriteriaandScoreFunctions . . . . . . . . . . . . . . . . . 89 5.2.1 ErrorFunctionsforClassificationProblems. . . . . . . . . 91 5.2.2 MeasuresofInterestingness . . . . . . . . . . . . . . . . . 93 5.3 AlgorithmsforModelFitting . . . . . . . . . . . . . . . . . . . . 93 5.3.1 Closed-FormSolutions . . . . . . . . . . . . . . . . . . . 93 5.3.2 GradientMethod . . . . . . . . . . . . . . . . . . . . . . . 94 5.3.3 CombinatorialOptimization . . . . . . . . . . . . . . . . . 96 5.3.4 RandomSearch,GreedyStrategies,andOtherHeuristics . 96 5.4 TypesofErrors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.4.1 ExperimentalError . . . . . . . . . . . . . . . . . . . . . 102 5.4.2 SampleError . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.4.3 ModelError . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.4.4 AlgorithmicError . . . . . . . . . . . . . . . . . . . . . . 111 5.4.5 MachineLearningBiasandVariance . . . . . . . . . . . . 111 5.4.6 LearningWithoutBias? . . . . . . . . . . . . . . . . . . . 112 5.5 ModelValidation. . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.5.1 TrainingandTestData . . . . . . . . . . . . . . . . . . . . 112 5.5.2 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . 114 5.5.3 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . 114 5.5.4 MeasuresforModelComplexity . . . . . . . . . . . . . . 115 5.5.5 CopingwithUnbalancedData . . . . . . . . . . . . . . . . 121 5.6 ModelErrorsandValidationinPractice . . . . . . . . . . . . . . . 121 5.6.1 ScoringModelsforClassification . . . . . . . . . . . . . . 122 5.6.2 ScoringModelsforNumericPredictions . . . . . . . . . . 124 5.7 FurtherReading . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6 DataPreparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.1 SelectData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.1.1 FeatureSelection . . . . . . . . . . . . . . . . . . . . . . 128 6.1.2 DimensionalityReduction . . . . . . . . . . . . . . . . . . 133 6.1.3 RecordSelection . . . . . . . . . . . . . . . . . . . . . . . 134 6.2 CleanData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 6.2.1 ImproveDataQuality . . . . . . . . . . . . . . . . . . . . 136 Contents ix 6.2.2 MissingValues . . . . . . . . . . . . . . . . . . . . . . . . 137 6.2.3 RemoveOutliers . . . . . . . . . . . . . . . . . . . . . . . 139 6.3 ConstructData . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.3.1 ProvideOperability . . . . . . . . . . . . . . . . . . . . . 140 6.3.2 AssureImpartiality . . . . . . . . . . . . . . . . . . . . . 142 6.3.3 MaximizeEfficiency . . . . . . . . . . . . . . . . . . . . . 144 6.4 ComplexDataTypes . . . . . . . . . . . . . . . . . . . . . . . . . 147 6.5 DataIntegration . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 6.5.1 VerticalDataIntegration . . . . . . . . . . . . . . . . . . . 149 6.5.2 HorizontalDataIntegration . . . . . . . . . . . . . . . . . 150 6.6 DataPreparationinPractice . . . . . . . . . . . . . . . . . . . . . 152 6.6.1 Removing Empty or Almost Empty Attributes and RecordsinaDataSet . . . . . . . . . . . . . . . . . . . . 152 6.6.2 NormalizationandDenormalization. . . . . . . . . . . . . 153 6.6.3 BackwardFeatureElimination . . . . . . . . . . . . . . . 154 6.7 FurtherReading . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 7 FindingPatterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 7.1 HierarchicalClustering . . . . . . . . . . . . . . . . . . . . . . . 159 7.1.1 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . 160 7.1.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . 162 7.1.3 VariationsandIssues. . . . . . . . . . . . . . . . . . . . . 164 7.2 Notionof(Dis-)Similarity . . . . . . . . . . . . . . . . . . . . . . 167 7.3 Prototype-andModel-BasedClustering. . . . . . . . . . . . . . . 173 7.3.1 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . 174 7.3.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . 175 7.3.3 VariationsandIssues. . . . . . . . . . . . . . . . . . . . . 178 7.4 Density-BasedClustering . . . . . . . . . . . . . . . . . . . . . . 181 7.4.1 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . 181 7.4.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . 182 7.4.3 VariationsandIssues. . . . . . . . . . . . . . . . . . . . . 184 7.5 Self-organizingMaps . . . . . . . . . . . . . . . . . . . . . . . . 187 7.5.1 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . 187 7.5.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . 188 7.6 FrequentPatternMiningandAssociationRules . . . . . . . . . . 189 7.6.1 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . 191 7.6.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . 192 7.6.3 VariationsandIssues. . . . . . . . . . . . . . . . . . . . . 199 7.7 DeviationAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . 206 7.7.1 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . 206 7.7.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . 207 7.7.3 VariationsandIssues. . . . . . . . . . . . . . . . . . . . . 210 7.8 FindingPatternsinPractice . . . . . . . . . . . . . . . . . . . . . 211 7.8.1 HierarchicalClustering . . . . . . . . . . . . . . . . . . . 211 x Contents 7.8.2 k-MeansandDBSCAN . . . . . . . . . . . . . . . . . . . 211 7.8.3 AssociationRuleMining . . . . . . . . . . . . . . . . . . 214 7.9 FurtherReading . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 8 FindingExplanations. . . . . . . . . . . . . . . . . . . . . . . . . . . 219 8.1 DecisionTrees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 8.1.1 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . 221 8.1.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . 222 8.1.3 VariationsandIssues. . . . . . . . . . . . . . . . . . . . . 225 8.2 BayesClassifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 8.2.1 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . 230 8.2.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . 231 8.2.3 VariationsandIssues. . . . . . . . . . . . . . . . . . . . . 235 8.3 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 8.3.1 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . 241 8.3.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . 243 8.3.3 VariationsandIssues. . . . . . . . . . . . . . . . . . . . . 246 8.3.4 Two-ClassProblems . . . . . . . . . . . . . . . . . . . . . 254 8.3.5 RegularizationforLogisticRegression . . . . . . . . . . . 255 8.4 Rulelearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 8.4.1 PropositionalRules . . . . . . . . . . . . . . . . . . . . . 258 8.4.2 InductiveLogicProgrammingorFirst-OrderRules. . . . . 265 8.5 FindingExplanationsinPractice . . . . . . . . . . . . . . . . . . 267 8.5.1 DecisionTrees . . . . . . . . . . . . . . . . . . . . . . . . 267 8.5.2 NaïveBayes . . . . . . . . . . . . . . . . . . . . . . . . . 268 8.5.3 LogisticRegression . . . . . . . . . . . . . . . . . . . . . 269 8.6 FurtherReading . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 9 FindingPredictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 9.1 Nearest-NeighborPredictors. . . . . . . . . . . . . . . . . . . . . 275 9.1.1 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . 275 9.1.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . 277 9.1.3 VariationsandIssues. . . . . . . . . . . . . . . . . . . . . 279 9.2 ArtificialNeuralNetworks. . . . . . . . . . . . . . . . . . . . . . 282 9.2.1 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . 283 9.2.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . 286 9.2.3 VariationsandIssues. . . . . . . . . . . . . . . . . . . . . 290 9.3 DeepLearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292 9.3.1 RecurrentNeuralNetworksandLong-ShortTermMemory Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 9.3.2 ConvolutionalNeuralNetworks . . . . . . . . . . . . . . . 295 9.3.3 More Deep Learning Networks: Generative-AdversarialNetworks(GANs) . . . . . . . . . 296 9.4 SupportVectorMachines . . . . . . . . . . . . . . . . . . . . . . 297

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.