A General Introduction to Data Analytics JoãoMendesMoreira UniversityofPorto AndréC.P.L.F.deCarvalho UniversityofSãoPaulo TomášHorváth EötvösLorándUniversityinBudapest PavolJozefŠafárikUniversityinKošice Thiseditionfirstpublished2019 ©2019JohnWiley&Sons,Inc. LibraryofCongressCataloging-in-PublicationData Names:Moreira,João,1969–author.|Carvalho,AndréCarlosPoncedeLeonFerreira,author.| Horváth,Tomáš,1976–author. Title:Ageneralintroductiontodataanalytics/byJoãoMendesMoreira,AndréC.P.L.F.deCarvalho, TomášHorváth. Description:Hoboken,NJ:JohnWiley&Sons,2019.|Includesbibliographicalreferencesandindex.| Identifiers:LCCN2017060728(print)|LCCN2018005929(ebook)|ISBN 9781119296256(pdf)|ISBN9781119296263(epub)|ISBN9781119296249(cloth) Subjects:LCSH:Mathematicalstatistics–Methodology.|Electronicdataprocessing.|Datamining. Classification:LCCQA276.4(ebook)|LCCQA276.4.M6642018(print)|DDC519.50285–dc23 LCrecordavailableathttps://lccn.loc.gov/2017060728 PrintedintheUnitedStatesofAmerica. Setin10/12ptWarnockbySPiGlobal,Pondicherry,India Contents Preface xiii Acknowledgments xv PresentationalConventions xvii AbouttheCompanionWebsite xix PartI IntroductoryBackground 1 1 WhatCanWeDoWithData? 3 1.1 BigDataandDataScience 4 1.2 BigDataArchitectures 5 1.3 SmallData 6 1.4 WhatisData? 7 1.5 AShortTaxonomyofDataAnalytics 9 1.6 ExamplesofDataUse 10 1.6.1 BreastCancerinWisconsin 11 1.6.2 PolishCompanyInsolvencyData 11 1.7 AProjectonDataAnalytics 12 1.7.1 ALittleHistoryonMethodologiesforDataAnalytics 12 1.7.2 TheKDDProcess 14 1.7.3 TheCRISP-DMMethodology 15 1.8 HowthisBookisOrganized 16 1.9 WhoShouldReadthisBook 18 PartII GettingInsightsfromData 19 2 DescriptiveStatistics 21 2.1 ScaleTypes 22 2.2 DescriptiveUnivariateAnalysis 25 2.2.1 UnivariateFrequencies 25 2.2.2 UnivariateDataVisualization 27 2.2.3 UnivariateStatistics 32 2.2.4 CommonUnivariateProbabilityDistributions 38 2.3 DescriptiveBivariateAnalysis 40 2.3.1 TwoQuantitativeAttributes 41 2.3.2 TwoQualitativeAttributes,atLeastoneofthemNominal 45 2.3.3 TwoOrdinalAttributes 46 2.4 FinalRemarks 47 2.5 Exercises 47 3 DescriptiveMultivariateAnalysis 49 3.1 MultivariateFrequencies 49 3.2 MultivariateDataVisualization 50 3.3 MultivariateStatistics 59 3.3.1 LocationMultivariateStatistics 59 3.3.2 DispersionMultivariateStatistics 60 3.4 InfographicsandWordClouds 66 3.4.1 Infographics 66 3.4.2 WordClouds 67 3.5 FinalRemarks 67 3.6 Exercises 68 4 DataQualityandPreprocessing 71 4.1 DataQuality 71 4.1.1 MissingValues 72 4.1.2 RedundantData 74 4.1.3 InconsistentData 75 4.1.4 NoisyData 76 4.1.5 Outliers 77 4.2 ConvertingtoaDifferentScaleType 77 4.2.1 ConvertingNominaltoRelative 78 4.2.2 ConvertingOrdinaltoRelativeorAbsolute 81 4.2.3 ConvertingRelativeorAbsolutetoOrdinalorNominal 82 4.3 ConvertingtoaDifferentScale 83 4.4 DataTransformation 85 4.5 DimensionalityReduction 86 4.5.1 AttributeAggregation 88 4.5.1.1 PrincipalComponentAnalysis 88 4.5.1.2 IndependentComponentAnalysis 91 4.5.1.3 MultidimensionalScaling 91 4.5.2 AttributeSelection 92 4.5.2.1 Filters 92 4.5.2.2 Wrappers 93 4.5.2.3 Embedded 94 4.5.2.4 SearchStrategies 95 4.6 FinalRemarks 96 4.7 Exercises 96 5 Clustering 99 5.1 DistanceMeasures 100 5.1.1 DifferencesbetweenValuesofCommonAttributeTypes 101 5.1.2 DistanceMeasuresforObjectswithQuantitativeAttributes 103 5.1.3 DistanceMeasuresforNon-conventionalAttributes 104 5.2 ClusteringValidation 107 5.3 ClusteringTechniques 108 5.3.1 K-means 110 5.3.1.1 CentroidsandDistanceMeasures 110 5.3.1.2 HowK-meansWorks 111 5.3.2 DBSCAN 115 5.3.3 AgglomerativeHierarchicalClusteringTechnique 117 5.3.3.1 LinkageCriterion 119 5.3.3.2 Dendrograms 120 5.4 FinalRemarks 122 5.5 Exercises 123 6 FrequentPatternMining 125 6.1 FrequentItemsets 127 6.1.1 Settingthemin_supThreshold 128 6.1.2 Apriori–aJoin-basedMethod 131 6.1.3 Eclat 133 6.1.4 FP-Growth 134 6.1.5 MaximalandClosedFrequentItemsets 138 6.2 AssociationRules 139 6.3 BehindSupportandConfidence 142 6.3.1 Cross-supportPatterns 143 6.3.2 Lift 144 6.3.3 Simpson’sParadox 145 6.4 OtherTypesofPattern 147 6.4.1 Sequentialpatterns 147 6.4.2 FrequentSequenceMining 148 6.4.3 ClosedandMaximalSequences 148 6.5 FinalRemarks 149 6.6 Exercises 149 7 CheatSheetandProjectonDescriptiveAnalytics 151 7.1 CheatSheetofDescriptiveAnalytics 151 7.1.1 OnDataSummarization 151 7.1.2 OnClustering 151 7.1.3 OnFrequentPatternMining 153 7.2 ProjectonDescriptiveAnalytics 154 7.2.1 BusinessUnderstanding 154 7.2.2 DataUnderstanding 155 7.2.3 DataPreparation 155 7.2.4 Modeling 157 7.2.5 Evaluation 158 7.2.6 Deployment 158 PartIII PredictingtheUnknown 159 8 Regression 161 8.1 PredictivePerformanceEstimation 164 8.1.1 Generalization 164 8.1.2 ModelValidation 165 8.1.3 PredictivePerformanceMeasuresfor Regression 169 8.2 FindingtheParametersoftheModel 171 8.2.1 LinearRegression 171 8.2.1.1 EmpiricalError 173 8.2.2 TheBias-varianceTrade-off 175 8.2.3 ShrinkageMethods 177 8.2.3.1 RidgeRegression 179 8.2.3.2 LassoRegression 180 8.2.4 MethodsthatuseLinearCombinationsof Attributes 181 8.2.4.1 PrincipalComponentsRegression 181 8.2.4.2 PartialLeastSquaresRegression 182 8.3 TechniqueandModelSelection 182 8.4 FinalRemarks 183 8.5 Exercises 184 9 Classification 187 9.1 BinaryClassification 188 9.2 PredictivePerformanceMeasuresforClassification 192 9.3 Distance-basedLearningAlgorithms 199 9.3.1 K-nearestNeighborAlgorithms 199 9.3.2 Case-basedReasoning 202 9.4 ProbabilisticClassificationAlgorithms 203 9.4.1 LogisticRegressionAlgorithm 205 9.4.2 NaiveBayesAlgorithm 207 9.5 FinalRemarks 208 9.6 Exercises 210 10 AdditionalPredictiveMethods 211 10.1 Search-basedAlgorithms 211 10.1.1 DecisionTreeInductionAlgorithms 212 10.1.2 DecisionTreesforRegression 217 10.1.2.1 ModelTrees 218 10.1.2.2 MultivariateAdaptiveRegressionSplines 219 10.2 Optimization-basedAlgorithms 221 10.2.1 ArtificialNeuralNetworks 222 10.2.1.1 Backpropagation 224 10.2.1.2 DeepNetworksandDeepLearningAlgorithms 230 10.2.2 SupportVectorMachines 233 10.2.2.1 SVMforRegression 237 10.3 FinalRemarks 238 10.4 Exercises 239 11 AdvancedPredictiveTopics 241 11.1 EnsembleLearning 241 11.1.1 Bagging 243 11.1.2 RandomForests 244 11.1.3 AdaBoost 245 11.2 AlgorithmBias 246 11.3 Non-binaryClassificationTasks 248 11.3.1 One-classClassification 248 11.3.2 Multi-classClassification 249 11.3.3 RankingClassification 250 11.3.4 Multi-labelClassification 251 11.3.5 HierarchicalClassification 252 11.4 AdvancedDataPreparationTechniquesforPrediction 253 11.4.1 ImbalancedDataClassification 253 11.4.2 ForIncompleteTargetLabeling 254 11.4.2.1 Semi-supervisedLearning 254 11.4.2.2 ActiveLearning 255 11.5 DescriptionandPredictionwithSupervisedInterpretable Techniques 255 11.6 Exercises 256 12 CheatSheetandProjectonPredictiveAnalytics 259 12.1 CheatSheetonPredictiveAnalytics 259 12.2 ProjectonPredictiveAnalytics 259 12.2.1 BusinessUnderstanding 260 12.2.2 DataUnderstanding 260 12.2.3 DataPreparation 265 12.2.4 Modeling 265 12.2.5 Evaluation 265 12.2.6 Deployment 266 PartIV PopularDataAnalyticsApplications 267 13 ApplicationsforText,WebandSocialMedia 269 13.1 WorkingwithTexts 269 13.1.1 DataAcquisition 271 13.1.2 FeatureExtraction 271 13.1.2.1 Tokenization 272 13.1.2.2 Stemming 272 13.1.2.3 ConversiontoStructuredData 275 13.1.2.4 IstheBagofWordsEnough? 276 13.1.3 RemainingPhases 277 13.1.4 Trends 277 13.1.4.1 SentimentAnalysis 278 13.1.4.2 WebMining 278 13.2 RecommenderSystems 278 13.2.1 Feedback 279 13.2.2 RecommendationTasks 280 13.2.3 RecommendationTechniques 281 13.2.3.1 Knowledge-basedTechniques 281 13.2.3.2 Content-basedTechniques 282 13.2.3.3 CollaborativeFilteringTechniques 282 13.2.4 FinalRemarks 289 13.3 SocialNetworkAnalysis 291 13.3.1 RepresentingSocialNetworks 291 13.3.2 BasicPropertiesofNodes 294 13.3.2.1 Degree 294 13.3.2.2 Distance 294 13.3.2.3 Closeness 295 13.3.2.4 Betweenness 296 13.3.2.5 ClusteringCoefficient 297 13.3.3 BasicandStructuralPropertiesofNetworks 297 13.3.3.1 Diameter 297 13.3.3.2 Centralization 297 13.3.3.3 Cliques 299 13.3.3.4 ClusteringCoefficient 299 13.3.3.5 Modularity 299 13.3.4 TrendsandFinalRemarks 299 13.4 Exercises 300 ApendixA:ComprehensiveDescriptionoftheCRISP-DM Methodology 303 References 311 Index 315 Preface Wearelivinginaperiodofhistorythatwillcertainlyberememberedasone whereinformationbegantobeinstantaneouslyobtainable,servicesweretai- loredtoindividualcriteria,andpeopledidwhatmadethemfeelgood(ifitdid notputtheirlivesatrisk).Everyyear,machinesareabletodomoreandmore thingsthatimproveourqualityoflife.Moredataisavailablethaneverbefore, andwillbecomeevenmoreso.Thisisatimewhenwecanextractmoreinfor- mationfromdatathaneverbefore,andbenefitmorefromit. Indifferentareasofbusinessandindifferentinstitutions,newwaystocollect dataarecontinuouslybeingcreated.Olddocumentsarebeingdigitized,new sensors count the number of cars passing along motorways and extract use- fulinformationfromthem,oursmartphonesareinforminguswhereweareat eachmomentandwhatnewopportunitiesareavailable,andourfavoritesocial networksregistertowhomwearerelatedorwhatthingswelike. Whateverareaweworkin,newdataisavailable:dataonhowstudentsevalu- ateprofessors,dataontheevolutionofdiseasesandthebesttreatmentoptions perpatient,dataonsoil,humiditylevelsandtheweather,enablingustoproduce more food with better quality, data on the macro economy, our investments and stock market indicatorsover time, enablingfairer distribution of wealth, data on things we purchase, allowing us to purchase more effectively and at lowercost. Students in many different domains feel the need to take advantage of the data they have. New courses on data analytics have been proposed in many differentprograms,frombiologytoinformationscience,fromengineeringto economics,fromsocialsciencestoagronomy,allovertheworld. Thefirstbooksondataanalyticsthatappearedsomeyearsagowerewritten by data scientists for other data scientists or for data science students. The majority of the people interested in these subjects were computing and statisticsstudents.Thebooksondataanalyticswerewrittenmainlyforthem. Nowadays, more and more people are interested in learning data analytics. Students of economics, management, biology, medicine, sociology, engineer- ing,andsomeothersubjectsarewillingtolearnaboutdataanalytics.Thisbook intendsnotonlytoprovideanew,morefriendlytextbookforcomputingand statisticsstudents,butalsotoopendataanalyticstothosestudentswhomay knownothingaboutcomputingorstatistics,butwanttolearnthesesubjects in a simple way. Those who have already studied subjects such as statistics willrecognizesomeofthecontentdescribedinthisbook,suchasdescriptive statistics.Studentsfromcomputingwillbefamiliarwithapseudocode. Afterreadingthisbook,itisnotexpectedthatyouwillfeellikeadatascientist withabilitytocreatenewmethods,butitisexpectedthatyoumightfeellikea dataanalyticspractitioner,abletodriveadataanalyticsproject,usingtheright methodstosolverealproblems. JoãoMendesMoreira UniversityofPorto,Porto,Portugal AndréC.P.L.F.deCarvalho UniversityofSãoPaulo,SãoCarlos,Brazil TomášHorváth EötvösLorándUniversityinBudapest PavolJozefŠafárikUniversityinKošice October,2017