Table Of ContentDATA MINING
FOR BUSINESS ANALYTICS
DATA MINING
FOR BUSINESS ANALYTICS
Concepts, Techniques, and Applications in R
Galit Shmueli
Peter C. Bruce
Inbal Yahav
Nitin R. Patel
Kenneth C. Lichtendahl, Jr.
Thiseditionfirstpublished2018
©2018JohnWiley&Sons,Inc.
Allrightsreserved.Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmitted,in
anyformorbyanymeans,electronic,mechanical,photocopying,recordingorotherwise,exceptaspermittedby
law.Adviceonhowtoobtainpermissiontoreusematerialfromthistitleisavailableat
http://www.wiley.com/go/permissions.
TherightofGalitShmueli,PeterC.Bruce,InbalYahav,NitinR.Patel,andKennethC.LichtendahlJr.tobe
identifiedastheauthorsofthisworkhasbeenassertedinaccordancewithlaw.
RegisteredOffices
JohnWiley&Sons,Inc.,111RiverStreet,Hoboken,NJ07030,USA
EditorialOffice
111RiverStreet,Hoboken,NJ07030,USA
Fordetailsofourglobaleditorialoffices,customerservices,andmoreinformationaboutWileyproductsvisitusat
www.wiley.com.
Wileyalsopublishesitsbooksinavarietyofelectronicformatsandbyprint-on-demand.Somecontentthat
appearsinstandardprintversionsofthisbookmaynotbeavailableinotherformats.
LimitofLiability/DisclaimerofWarranty
Thepublisherandtheauthorsmakenorepresentationsorwarrantieswithrespecttotheaccuracyorcompleteness
ofthecontentsofthisworkandspecificallydisclaimallwarranties;includingwithoutlimitationanyimplied
warrantiesoffitnessforaparticularpurpose.Thisworkissoldwiththeunderstandingthatthepublisherisnot
engagedinrenderingprofessionalservices.Theadviceandstrategiescontainedhereinmaynotbesuitablefor
everysituation.Inviewofon-goingresearch,equipmentmodifications,changesingovernmentalregulations,and
theconstantflowofinformationrelatingtotheuseofexperimentalreagents,equipment,anddevices,thereader
isurgedtoreviewandevaluatetheinformationprovidedinthepackageinsertorinstructionsforeachchemical,
pieceofequipment,reagent,ordevicefor,amongotherthings,anychangesintheinstructionsorindicationof
usageandforaddedwarningsandprecautions.Thefactthatanorganizationorwebsiteisreferredtointhiswork
asacitationand/orpotentialsourceoffurtherinformationdoesnotmeanthattheauthororthepublisher
endorsestheinformationtheorganizationorwebsitemayprovideorrecommendationsitmaymake.Further,
readersshouldbeawarethatwebsiteslistedinthisworkmayhavechangedordisappearedbetweenwhenthis
workswaswrittenandwhenitisread.Nowarrantymaybecreatedorextendedbyanypromotionalstatements
forthiswork.Neitherthepublishernortheauthorshallbeliableforanydamagesarisingherefrom.
LibraryofCongressCataloging-in-PublicationDataappliedfor
Hardback:9781118879368
CoverDesign:Wiley
CoverImage:©AchimMittler,FrankfurtamMain/Gettyimages
Setin11.5/14.5ptBemboStdbyAptaraInc.,NewDelhi,India
PrintedintheUnitedStatesofAmerica.
10 9 8 7 6 5 4 3 2 1
The beginning of wisdom is this:
Get wisdom, and whatever else you get, get insight.
– Proverbs 4:7
Contents
ForewordbyGarethJames xix
ForewordbyRaviBapna xxi
PrefacetotheREdition xxiii
Acknowledgments xxvii
PART I PRELIMINARIES
CHAPTER 1 Introduction 3
1.1 WhatIsBusinessAnalytics? . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 WhatIsDataMining? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 DataMiningandRelatedTerms . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 BigData. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 DataScience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 WhyAreThereSoManyDifferentMethods? . . . . . . . . . . . . . . . . . . . 8
1.7 TerminologyandNotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.8 RoadMapstoThisBook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
OrderofTopics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
CHAPTER 2 Overview of the Data Mining Process 15
2.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 CoreIdeasinDataMining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
AssociationRulesandRecommendationSystems . . . . . . . . . . . . . . . . . 16
PredictiveAnalytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
DataReductionandDimensionReduction . . . . . . . . . . . . . . . . . . . . 17
DataExplorationandVisualization . . . . . . . . . . . . . . . . . . . . . . . . 17
SupervisedandUnsupervisedLearning . . . . . . . . . . . . . . . . . . . . . . 18
2.3 TheStepsinDataMining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 PreliminarySteps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
OrganizationofDatasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
PredictingHomeValuesintheWestRoxburyNeighborhood . . . . . . . . . . . 21
vii
viii CONTENTS
LoadingandLookingattheDatainR . . . . . . . . . . . . . . . . . . . . . . 22
SamplingfromaDatabase . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
OversamplingRareEventsinClassificationTasks . . . . . . . . . . . . . . . . . 25
PreprocessingandCleaningtheData. . . . . . . . . . . . . . . . . . . . . . . 26
2.5 PredictivePowerandOverfitting . . . . . . . . . . . . . . . . . . . . . . . . . 33
Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
CreationandUseofDataPartitions . . . . . . . . . . . . . . . . . . . . . . . 35
2.6 BuildingaPredictiveModel . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
ModelingProcess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.7 UsingRforDataMiningonaLocalMachine . . . . . . . . . . . . . . . . . . . 43
2.8 AutomatingDataMiningSolutions . . . . . . . . . . . . . . . . . . . . . . . . 43
DataMiningSoftware: TheStateoftheMarket(byHerbEdelstein). . . . . . . . 45
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
PART II DATA EXPLORATION AND DIMENSION REDUCTION
CHAPTER 3 Data Visualization 55
3.1 UsesofDataVisualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
BaseRorggplot? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2 DataExamples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Example1: BostonHousingData . . . . . . . . . . . . . . . . . . . . . . . . 57
Example2: RidershiponAmtrakTrains. . . . . . . . . . . . . . . . . . . . . . 59
3.3 BasicCharts: BarCharts,LineGraphs,andScatterPlots . . . . . . . . . . . . . 59
DistributionPlots: BoxplotsandHistograms . . . . . . . . . . . . . . . . . . . 61
Heatmaps: VisualizingCorrelationsandMissingValues . . . . . . . . . . . . . . 64
3.4 MultidimensionalVisualization . . . . . . . . . . . . . . . . . . . . . . . . . . 67
AddingVariables: Color,Size,Shape,MultiplePanels,andAnimation . . . . . . . 67
Manipulations: Rescaling,AggregationandHierarchies,Zooming,Filtering . . . . 70
Reference: TrendLinesandLabels . . . . . . . . . . . . . . . . . . . . . . . . 74
ScalinguptoLargeDatasets. . . . . . . . . . . . . . . . . . . . . . . . . . . 74
MultivariatePlot: ParallelCoordinatesPlot. . . . . . . . . . . . . . . . . . . . 75
InteractiveVisualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.5 SpecializedVisualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
VisualizingNetworkedData . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
VisualizingHierarchicalData: Treemaps . . . . . . . . . . . . . . . . . . . . . 82
VisualizingGeographicalData: MapCharts . . . . . . . . . . . . . . . . . . . . 83
3.6 Summary: MajorVisualizationsandOperations,byDataMiningGoal . . . . . . . 86
Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
TimeSeriesForecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
UnsupervisedLearning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
CHAPTER 4 Dimension Reduction 91
4.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.2 CurseofDimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Description:CHAPTER 16 Handling Time Series. 387 humans guide the auto-correction process by rejecting defiantly and substituting ing its acquisition of SPSS, IBM has incorporated Clementine and SPSS into IBM. Modeler. There are still a large number of stand-alone data mining tools based on a single