Table Of ContentAGeneralIntroductiontoDataAnalytics
A General Introduction to Data Analytics
JoãoMendesMoreira
UniversityofPorto
AndréC.P.L.F.deCarvalho
UniversityofSãoPaulo
TomášHorváth
EötvösLorándUniversityinBudapest
PavolJozefŠafárikUniversityinKošice
Thiseditionfirstpublished2019
©2019JohnWiley&Sons,Inc.
Allrightsreserved.Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,or
transmitted,inanyformorbyanymeans,electronic,mechanical,photocopying,recordingor
otherwise,exceptaspermittedbylaw.Adviceonhowtoobtainpermissiontoreusematerialfromthis
titleisavailableathttp://www.wiley.com/go/permissions.
TherightofJoãoMoreira,AndrédeCarvalho,andTomášHorváthtobeidentifiedastheauthor(s)of
thisworkhasbeenassertedinaccordancewithlaw.
RegisteredOffice
JohnWiley&Sons,Inc.,111RiverStreet,Hoboken,NJ07030,USA
EditorialOffice
111RiverStreet,Hoboken,NJ07030,USA
Fordetailsofourglobaleditorialoffices,customerservices,andmoreinformationaboutWileyproducts
visitusatwww.wiley.com.
Wileyalsopublishesitsbooksinavarietyofelectronicformatsandbyprint-on-demand.Somecontent
thatappearsinstandardprintversionsofthisbookmaynotbeavailableinotherformats.
LimitofLiability/DisclaimerofWarranty
Inviewofongoingresearch,equipmentmodifications,changesingovernmentalregulations,and
theconstantflowofinformationrelatingtotheuseofexperimentalreagents,equipment,anddevices,
thereaderisurgedtoreviewandevaluatetheinformationprovidedinthepackageinsertorinstructions
foreachchemical,pieceofequipment,reagent,ordevicefor,amongotherthings,anychangesin
theinstructionsorindicationofusageandforaddedwarningsandprecautions.Whilethepublisherand
authorshaveusedtheirbesteffortsinpreparingthiswork,theymakenorepresentationsorwarranties
withrespecttotheaccuracyorcompletenessofthecontentsofthisworkandspecificallydisclaimall
warranties,includingwithoutlimitationanyimpliedwarrantiesofmerchantabilityorfitnessforapar-
ticularpurpose.Nowarrantymaybecreatedorextendedbysalesrepresentatives,writtensalesmaterials
orpromotionalstatementsforthiswork.Thefactthatanorganization,website,orproductisreferredto
inthisworkasacitationand/orpotentialsourceoffurtherinformationdoesnotmeanthatthepublisher
andauthorsendorsetheinformationorservicestheorganization,website,orproductmayprovideor
recommendationsitmaymake.Thisworkissoldwiththeunderstandingthatthepublisherisnotengaged
inrenderingprofessionalservices.Theadviceandstrategiescontainedhereinmaynotbesuitablefor
yoursituation.Youshouldconsultwithaspecialistwhereappropriate.Further,readersshouldbeaware
thatwebsiteslistedinthisworkmayhavechangedordisappearedbetweenwhenthisworkwaswritten
andwhenitisread.Neitherthepublishernorauthorsshallbeliableforanylossofprofitoranyother
commercialdamages,includingbutnotlimitedtospecial,incidental,consequential,orotherdamages.
LibraryofCongressCataloging-in-PublicationData
Names:Moreira,João,1969–author.|Carvalho,AndréCarlosPoncedeLeonFerreira,author.|
Horváth,Tomáš,1976–author.
Title:Ageneralintroductiontodataanalytics/byJoãoMendesMoreira,AndréC.P.L.F.deCarvalho,
TomášHorváth.
Description:Hoboken,NJ:JohnWiley&Sons,2019.|Includesbibliographicalreferencesandindex.|
Identifiers:LCCN2017060728(print)|LCCN2018005929(ebook)|ISBN
9781119296256(pdf)|ISBN9781119296263(epub)|ISBN9781119296249(cloth)
Subjects:LCSH:Mathematicalstatistics–Methodology.|Electronicdataprocessing.|Datamining.
Classification:LCCQA276.4(ebook)|LCCQA276.4.M6642018(print)|DDC519.50285–dc23
LCrecordavailableathttps://lccn.loc.gov/2017060728
Coverimage:©agsandrew/Shutterstock
CoverdesignbyWiley
PrintedintheUnitedStatesofAmerica.
Setin10/12ptWarnockbySPiGlobal,Pondicherry,India
10 9 8 7 6 5 4 3 2 1
Tothewomenathomethatmakemylifebetter:Mamã,YáandYé–João
Tomyfamily,Valeria,Beatriz,GabrielaandMariana–André
TomywifeDanielle–Tomáš
vii
Contents
Preface xiii
Acknowledgments xv
PresentationalConventions xvii
AbouttheCompanionWebsite xix
PartI IntroductoryBackground 1
1 WhatCanWeDoWithData? 3
1.1 BigDataandDataScience 4
1.2 BigDataArchitectures 5
1.3 SmallData 6
1.4 WhatisData? 7
1.5 AShortTaxonomyofDataAnalytics 9
1.6 ExamplesofDataUse 10
1.6.1 BreastCancerinWisconsin 11
1.6.2 PolishCompanyInsolvencyData 11
1.7 AProjectonDataAnalytics 12
1.7.1 ALittleHistoryonMethodologiesforDataAnalytics 12
1.7.2 TheKDDProcess 14
1.7.3 TheCRISP-DMMethodology 15
1.8 HowthisBookisOrganized 16
1.9 WhoShouldReadthisBook 18
PartII GettingInsightsfromData 19
2 DescriptiveStatistics 21
2.1 ScaleTypes 22
2.2 DescriptiveUnivariateAnalysis 25
2.2.1 UnivariateFrequencies 25
viii Contents
2.2.2 UnivariateDataVisualization 27
2.2.3 UnivariateStatistics 32
2.2.4 CommonUnivariateProbabilityDistributions 38
2.3 DescriptiveBivariateAnalysis 40
2.3.1 TwoQuantitativeAttributes 41
2.3.2 TwoQualitativeAttributes,atLeastoneofthemNominal 45
2.3.3 TwoOrdinalAttributes 46
2.4 FinalRemarks 47
2.5 Exercises 47
3 DescriptiveMultivariateAnalysis 49
3.1 MultivariateFrequencies 49
3.2 MultivariateDataVisualization 50
3.3 MultivariateStatistics 59
3.3.1 LocationMultivariateStatistics 59
3.3.2 DispersionMultivariateStatistics 60
3.4 InfographicsandWordClouds 66
3.4.1 Infographics 66
3.4.2 WordClouds 67
3.5 FinalRemarks 67
3.6 Exercises 68
4 DataQualityandPreprocessing 71
4.1 DataQuality 71
4.1.1 MissingValues 72
4.1.2 RedundantData 74
4.1.3 InconsistentData 75
4.1.4 NoisyData 76
4.1.5 Outliers 77
4.2 ConvertingtoaDifferentScaleType 77
4.2.1 ConvertingNominaltoRelative 78
4.2.2 ConvertingOrdinaltoRelativeorAbsolute 81
4.2.3 ConvertingRelativeorAbsolutetoOrdinalorNominal 82
4.3 ConvertingtoaDifferentScale 83
4.4 DataTransformation 85
4.5 DimensionalityReduction 86
4.5.1 AttributeAggregation 88
4.5.1.1 PrincipalComponentAnalysis 88
4.5.1.2 IndependentComponentAnalysis 91
4.5.1.3 MultidimensionalScaling 91
4.5.2 AttributeSelection 92
4.5.2.1 Filters 92
4.5.2.2 Wrappers 93
4.5.2.3 Embedded 94
Contents ix
4.5.2.4 SearchStrategies 95
4.6 FinalRemarks 96
4.7 Exercises 96
5 Clustering 99
5.1 DistanceMeasures 100
5.1.1 DifferencesbetweenValuesofCommonAttributeTypes 101
5.1.2 DistanceMeasuresforObjectswithQuantitativeAttributes 103
5.1.3 DistanceMeasuresforNon-conventionalAttributes 104
5.2 ClusteringValidation 107
5.3 ClusteringTechniques 108
5.3.1 K-means 110
5.3.1.1 CentroidsandDistanceMeasures 110
5.3.1.2 HowK-meansWorks 111
5.3.2 DBSCAN 115
5.3.3 AgglomerativeHierarchicalClusteringTechnique 117
5.3.3.1 LinkageCriterion 119
5.3.3.2 Dendrograms 120
5.4 FinalRemarks 122
5.5 Exercises 123
6 FrequentPatternMining 125
6.1 FrequentItemsets 127
6.1.1 Settingthemin_supThreshold 128
6.1.2 Apriori–aJoin-basedMethod 131
6.1.3 Eclat 133
6.1.4 FP-Growth 134
6.1.5 MaximalandClosedFrequentItemsets 138
6.2 AssociationRules 139
6.3 BehindSupportandConfidence 142
6.3.1 Cross-supportPatterns 143
6.3.2 Lift 144
6.3.3 Simpson’sParadox 145
6.4 OtherTypesofPattern 147
6.4.1 Sequentialpatterns 147
6.4.2 FrequentSequenceMining 148
6.4.3 ClosedandMaximalSequences 148
6.5 FinalRemarks 149
6.6 Exercises 149
7 CheatSheetandProjectonDescriptiveAnalytics 151
7.1 CheatSheetofDescriptiveAnalytics 151
7.1.1 OnDataSummarization 151
x Contents
7.1.2 OnClustering 151
7.1.3 OnFrequentPatternMining 153
7.2 ProjectonDescriptiveAnalytics 154
7.2.1 BusinessUnderstanding 154
7.2.2 DataUnderstanding 155
7.2.3 DataPreparation 155
7.2.4 Modeling 157
7.2.5 Evaluation 158
7.2.6 Deployment 158
PartIII PredictingtheUnknown 159
8 Regression 161
8.1 PredictivePerformanceEstimation 164
8.1.1 Generalization 164
8.1.2 ModelValidation 165
8.1.3 PredictivePerformanceMeasuresfor
Regression 169
8.2 FindingtheParametersoftheModel 171
8.2.1 LinearRegression 171
8.2.1.1 EmpiricalError 173
8.2.2 TheBias-varianceTrade-off 175
8.2.3 ShrinkageMethods 177
8.2.3.1 RidgeRegression 179
8.2.3.2 LassoRegression 180
8.2.4 MethodsthatuseLinearCombinationsof
Attributes 181
8.2.4.1 PrincipalComponentsRegression 181
8.2.4.2 PartialLeastSquaresRegression 182
8.3 TechniqueandModelSelection 182
8.4 FinalRemarks 183
8.5 Exercises 184
9 Classification 187
9.1 BinaryClassification 188
9.2 PredictivePerformanceMeasuresforClassification 192
9.3 Distance-basedLearningAlgorithms 199
9.3.1 K-nearestNeighborAlgorithms 199
9.3.2 Case-basedReasoning 202
9.4 ProbabilisticClassificationAlgorithms 203
9.4.1 LogisticRegressionAlgorithm 205
9.4.2 NaiveBayesAlgorithm 207
9.5 FinalRemarks 208
9.6 Exercises 210
Contents xi
10 AdditionalPredictiveMethods 211
10.1 Search-basedAlgorithms 211
10.1.1 DecisionTreeInductionAlgorithms 212
10.1.2 DecisionTreesforRegression 217
10.1.2.1 ModelTrees 218
10.1.2.2 MultivariateAdaptiveRegressionSplines 219
10.2 Optimization-basedAlgorithms 221
10.2.1 ArtificialNeuralNetworks 222
10.2.1.1 Backpropagation 224
10.2.1.2 DeepNetworksandDeepLearningAlgorithms 230
10.2.2 SupportVectorMachines 233
10.2.2.1 SVMforRegression 237
10.3 FinalRemarks 238
10.4 Exercises 239
11 AdvancedPredictiveTopics 241
11.1 EnsembleLearning 241
11.1.1 Bagging 243
11.1.2 RandomForests 244
11.1.3 AdaBoost 245
11.2 AlgorithmBias 246
11.3 Non-binaryClassificationTasks 248
11.3.1 One-classClassification 248
11.3.2 Multi-classClassification 249
11.3.3 RankingClassification 250
11.3.4 Multi-labelClassification 251
11.3.5 HierarchicalClassification 252
11.4 AdvancedDataPreparationTechniquesforPrediction 253
11.4.1 ImbalancedDataClassification 253
11.4.2 ForIncompleteTargetLabeling 254
11.4.2.1 Semi-supervisedLearning 254
11.4.2.2 ActiveLearning 255
11.5 DescriptionandPredictionwithSupervisedInterpretable
Techniques 255
11.6 Exercises 256
12 CheatSheetandProjectonPredictiveAnalytics 259
12.1 CheatSheetonPredictiveAnalytics 259
12.2 ProjectonPredictiveAnalytics 259
12.2.1 BusinessUnderstanding 260
12.2.2 DataUnderstanding 260
12.2.3 DataPreparation 265
12.2.4 Modeling 265
12.2.5 Evaluation 265
12.2.6 Deployment 266
xii Contents
PartIV PopularDataAnalyticsApplications 267
13 ApplicationsforText,WebandSocialMedia 269
13.1 WorkingwithTexts 269
13.1.1 DataAcquisition 271
13.1.2 FeatureExtraction 271
13.1.2.1 Tokenization 272
13.1.2.2 Stemming 272
13.1.2.3 ConversiontoStructuredData 275
13.1.2.4 IstheBagofWordsEnough? 276
13.1.3 RemainingPhases 277
13.1.4 Trends 277
13.1.4.1 SentimentAnalysis 278
13.1.4.2 WebMining 278
13.2 RecommenderSystems 278
13.2.1 Feedback 279
13.2.2 RecommendationTasks 280
13.2.3 RecommendationTechniques 281
13.2.3.1 Knowledge-basedTechniques 281
13.2.3.2 Content-basedTechniques 282
13.2.3.3 CollaborativeFilteringTechniques 282
13.2.4 FinalRemarks 289
13.3 SocialNetworkAnalysis 291
13.3.1 RepresentingSocialNetworks 291
13.3.2 BasicPropertiesofNodes 294
13.3.2.1 Degree 294
13.3.2.2 Distance 294
13.3.2.3 Closeness 295
13.3.2.4 Betweenness 296
13.3.2.5 ClusteringCoefficient 297
13.3.3 BasicandStructuralPropertiesofNetworks 297
13.3.3.1 Diameter 297
13.3.3.2 Centralization 297
13.3.3.3 Cliques 299
13.3.3.4 ClusteringCoefficient 299
13.3.3.5 Modularity 299
13.3.4 TrendsandFinalRemarks 299
13.4 Exercises 300
ApendixA:ComprehensiveDescriptionoftheCRISP-DM
Methodology 303
References 311
Index 315