HANDBOOK OF STATISTICAL ANALYSIS AND DATA MINING APPLICATIONS “Great introduction to the real-world process of data mining. The overviews, practical advice, tutorials, and extraDVDmaterialmake thisbookaninvaluable resourceforboth newandexperienceddataminers.” KarlRexer,Ph.D. (President andFounder ofRexerAnalytics, Boston,Massachusetts, www.RexerAnalytics.com) “Statisticalthinking willonedaybeasnecessaryfor efficientcitizenshipastheabilitytoreadandwrite.” H.G.Wells(1866–1946) “Today wearen’t quitetotheplacethatH.G.Wellspredictedyearsago,butsocietyisgettingcloser out ofnecessity.Globalbusinessesandorganizationsarebeingforcedtousestatisticalanalysisanddatamining applicationsinaformat thatcombinesart andscience–intuition andexpertise incollectingand understandingdatainordertomakeaccuratemodelsthatrealisticallypredictthefuturethatleadtoinformed strategicdecisionsthusallowingcorrectactionsensuringsuccess,beforeitistoolate...today,numeracy isasessentialasliteracy. AsJohnElderlikestosay:‘Godatamining!’Itreallydoessaveenormoustime andmoney.Forthosewiththepatienceandfaithtogetthroughtheearlystagesofbusinessunderstandingand datatransformation,thecascade ofresultscanbeextremelyrewarding.” GaryMiner, March,2009 HANDBOOK OF STATISTICAL ANALYSIS AND DATA MINING APPLICATIONS R N OBERT ISBET PacificCapital BankcorpN.A. SantaBarbara, CA J E OHN LDER ElderResearch,Inc.,Charlottesville, VA G M ARY INER StatSoft,Inc.,Tulsa,Oklahoma AMSTERDAM (cid:129) BOSTON (cid:129) HEIDELBERG (cid:129) LONDON NEW YORK (cid:129) OXFORD (cid:129) PARIS (cid:129) SAN DIEGO SAN FRANCISCO (cid:129) SINGAPORE (cid:129) SYDNEY (cid:129) TOKYO Academic Press is an imprint of Elsevier Academic PressisanimprintofElsevier 30CorporateDrive,Suite400,Burlington, MA01803,USA 525BStreet,Suite1900,SanDiego, California92101-4495,USA 84Theobald’sRoad,London WC1X8RR,UK Copyright#2009,ElsevierInc.Allrightsreserved. Nopartofthispublication maybe reproduced ortransmitted inanyformor byany means,electronicormechanical,includingphotocopy,recording,or anyinformation storage andretrievalsystem,without permissioninwritingfromthe publisher. Permissionsmaybesought directlyfromElsevier’sScience&TechnologyRights Department inOxford,UK:phone: (þ44)1865843830,fax:(þ44)1865853333, E-mail:[email protected]. Youmayalsocompleteyourrequestonline via theElsevierhomepage(http://elsevier.com), byselecting“Support&Contact” then“CopyrightandPermission”andthen“ObtainingPermissions.” Library ofCongress Cataloging-in-Publication Data Nisber,Robert, 1942- Handbook ofstatistical analysisanddataminingapplications/RobertNisbet,JohnElder, Gary Miner. p.cm. Includes index. ISBN978-0-12-374765-5(hardcover:alk.pager)1.Datamining–Statisticalmethods.I.Elder,JohnF. (JohnFletcher) II.Miner,Gary.III.Title. QA76.9.D343N572009 0 006.312–dc22 2009008997 British Library Cataloguing-in-PublicationData Acatalogue recordfor thisbookisavailablefrom theBritishLibrary. ISBN:978-0-12-374765-5 Forinformation onallAcademic Presspublications visitourWebsiteat www.elsevierdirect.com Printed inCanada 09 10 9 8 7 6 5 4 3 2 1 HANDBOOK OF STATISTICAL ANALYSIS AND DATA MINING APPLICATIONS Table of Contents Foreword 1 xv ATheoretical Frameworkforthe DataMining Process 18 Foreword 2 xvii MicroeconomicApproach 19 Preface xix InductiveDatabase Approach 19 Introduction xxiii Strengths oftheDataMiningProcess 19 List of Tutorials by Guest Authors xxix Customer-Centric VersusAccount-Centric:ANew WaytoLookatYour Data 20 ThePhysicalDataMart 20 I TheVirtualDataMart 21 HouseholdedDatabases 21 HISTORY OF PHASES OF TheDataParadigmShift 22 DATA ANALYSIS, BASIC CreationoftheCar 22 Major ActivitiesofDataMining 23 THEORY, AND THE DATA Major ChallengesofDataMining 25 MINING PROCESS ExamplesofDataMiningApplications 26 Major IssuesinData Mining 26 General Requirements forSuccessinaData Mining 1. The Background for Data Mining Project 28 Practice Example ofaDataMiningProject:ClassifyaBat’s Preamble 3 SpeciesbyIts Sound 28 AShortHistoryofStatistics andDataMining 4 TheImportance ofDomain Knowledge 30 ModernStatistics:ADuality? 5 Postscript 30 AssumptionsoftheParametric Model 6 WhyDidDataMiningArise? 30 TwoViewsofReality 8 SomeCaveats withDataMiningSolutions 31 Aristotle 8 Plato 9 TheRiseofModernStatisticalAnalysis:TheSecond 3. The Data Mining Process Generation 10 Preamble 33 Data,DataEverywhere ... 11 TheScience ofDataMining 33 MachineLearningMethods:TheThirdGeneration 11 TheApproachtoUnderstandingand Problem StatisticalLearningTheory:TheFourth Solving 34 Generation 12 CRISP-DM 35 Postscript 13 BusinessUnderstanding(MostlyArt) 36 DefinetheBusinessObjectivesoftheDataMining 2. Theoretical Considerations for Model 36 Data Mining Assess theBusinessEnvironmentforData Preamble 15 Mining 37 TheScientificMethod 16 Formulate theDataMiningGoalsand WhatIsDataMining? 17 Objectives 37 v vi TABLEOFCONTENTS Data Understanding(MostlyScience) 39 5. Feature Selection DataAcquisition 39 Preamble 77 DataIntegration 39 VariablesasFeatures 78 DataDescription 40 TypesofFeature Selections 78 DataQuality Assessment 40 FeatureRankingMethods 78 Data Preparation(AMixtureofArtand GiniIndex 78 Science) 40 Bi-variateMethods 80 Modeling (AMixtureofArtand Science) 41 MultivariateMethods 80 StepsintheModelingPhaseofCRISP-DM 41 ComplexMethods 82 Deployment (MostlyArt) 45 SubsetSelectionMethods 82 ClosingtheInformationLoop(Art) 46 TheOther TwoWaysofUsingFeature TheArtofDataMining 46 SelectioninSTATISTICA: Interactive Artistic StepsinData Mining 47 Workspace 93 Postscript 47 STATISTICADMRecipeMethod 93 Postscript 96 4. Data Understanding and Preparation Preamble 49 ActivitiesofDataUnderstandingand 6. Accessory Tools for Doing Preparation 50 Data Mining Definitions 50 Preamble 99 IssuesThat ShouldbeResolved 51 DataAccess Tools 100 BasicIssuesThat MustBe ResolvedinData StructuredQueryLanguage(SQL)Tools 100 Understanding 51 Extract,Transform, andLoad(ETL) BasicIssuesThat MustBe ResolvedinData Capabilities 100 Preparation 51 DataExplorationTools 101 Data Understanding 51 BasicDescriptiveStatistics 101 DataAcquisition 51 CombiningGroups(Classes)forPredictive Data DataExtraction 53 Mining 105 DataDescription 54 Slicing/Dicingand DrillingDownintoDataSets/ DataAssessment 56 Results Spreadsheets 106 DataProfiling 56 ModelingManagementTools 107 DataCleansing 56 DataMinerWorkspace Templates 107 DataTransformation 57 ModelingAnalysisTools 107 DataImputation 59 FeatureSelection 107 DataWeightingand Balancing 62 ImportancePlotsofVariables 108 DataFilteringand Smoothing 64 In-PlaceDataProcessing(IDP) 113 DataAbstraction 66 Example:TheIDPFacilityofSTATISTICAData DataReduction 69 Miner 114 DataSampling 69 HowtoUsetheSQL 114 DataDiscretization 73 RapidDeploymentofPredictiveModels 114 DataDerivation 73 ModelMonitors 116 Postscript 75 Postscript 117 vii TABLEOFCONTENTS II 8. Advanced Algorithms for Data Mining Preample 151 THE ALGORITHMS IN DATA AdvancedDataMiningAlgorithms 154 MINING AND TEXT MINING, InteractiveTrees 154 MultivariateAdaptiveRegressionSplines THE ORGANIZATION OF THE (MARSplines) 158 THREE MOST COMMON DATA StatisticalLearningTheory:SupportVector MINING TOOLS, AND Machines 162 Sequence,Association,and LinkAnalyses 164 SELECTED SPECIALIZED IndependentComponents Analysis(ICA) 168 AREAS USING DATA MINING KohonenNetworks 169 Characteristics ofaKohonenNetwork 169 QualityControlData Miningand RootCause 7. Basic Algorithms for Data Mining: Analysis 169 A Brief Overview Imageand ObjectDataMining:Visualizationand Preamble 121 3D-Medicaland OtherScanningImaging 170 STATISTICADataMinerRecipe Postscript 171 (DMRecipe) 123 KXEN 124 9. Text Mining and Natural Language BasicDataMiningAlgorithms 126 Processing AssociationRules 126 Preamble 173 NeuralNetworks 128 TheDevelopment ofTextMining 174 RadialBasisFunction(RBF)Networks 136 APracticalExample:NTSB 175 AutomatedNeuralNets 138 GoalsofTextMiningofNTSBAccident Generalized AdditiveModels(GAMs) 138 Reports 184 OutputsofGAMs 139 Drilling intoWordsofInterest 188 InterpretingResultsofGAMs 139 MeanswithErrorPlots 189 Classificationand RegressionTrees(CART) 139 FeatureSelectionTool 190 RecursivePartitioning 144 AConclusion:LosingControl oftheAircraft in PruningTrees 144 BadWeather IsOftenFatal 191 General CommentsaboutCARTfor Summary 194 Statisticians 144 TextMiningConceptsUsed inConductingText AdvantagesofCART overOther Decision MiningStudies 194 Trees 145 Postscript 194 UsesofCART 146 General CHAIDModels 146 AdvantagesofCHAID 147 10. TheThreeMostCommonDataMining Disadvantages ofCHAID 147 Software Tools Generalized EMand k-Means ClusterAnalysis—An Preamble 197 Overview 147 SPSSClementineOverview 197 k-MeansClustering 147 OverallOrganizationofClementine EMCluster Analysis 148 Components 198 ProcessingStepsoftheEMAlgorithm 149 OrganizationoftheClementineInterface 199 V-foldCross-Validation asAppliedto ClementineInterface Overview 199 Clustering 149 Settingthe DefaultDirectory 201 Postscript 150 SuperNodes 201 viii TABLEOFCONTENTS ExecutionofStreams 202 12. Numerical Prediction SAS-Enterprise Miner (SAS-EM)Overview 203 Preamble 259 OverallOrganizationofSAS-EMVersion5.3 LinearResponseAnalysisandtheAssumptionsofthe Components 203 ParametricModel 260 LayoutoftheSAS-EnterpriseMinerWindow 204 Parametric StatisticalAnalysis 261 VariousSAS-EMMenus,Dialogs,andWindows Assumptionsofthe ParametricModel 262 UsefulDuring theDataMiningProcess 205 TheAssumption ofIndependency 262 Software RequirementstoRunSAS-EM 5.3 TheAssumption ofNormality 262 Software 206 NormalityandtheCentral LimitTheorem 263 STATISTICAData Miner,QC-Miner,and Text TheAssumption ofLinearity 264 Miner Overview 214 LinearRegression 264 OverallOrganizationand UseofSTATISTICA MethodsforHandlingVariableInteractionsin DataMiner 214 LinearRegression 265 Three FormatsforDoingDataMiningin CollinearityamongVariables inaLinear STATISTICA 230 Regression 265 Postscript 234 TheConceptoftheResponseSurface 266 GeneralizedLinearModels(GLMs) 270 MethodsforAnalyzingNonlinearRelationships 271 11. Classification NonlinearRegressionand Estimation 271 Preample 235 Logitand ProbitRegression 272 What IsClassification? 235 PoissonRegression 272 Initial OperationsinClassification 236 ExponentialDistributions 272 Major IssueswithClassification 236 PiecewiseLinearRegression 273 WhatIstheNatureofDataSettoBe DataMiningandMachineLearningAlgorithmsUsed Classified? 236 inNumericalPrediction 274 HowAccurate DoestheClassification Have NumericalPredictionwithC&RT 274 toBe? 236 ModelResultsAvailableinC&RT 276 HowUnderstandable DotheClassesHave AdvantagesofClassificationand RegressionTrees toBe? 237 (C&RT)Methods 277 AssumptionsofClassificationProcedures 237 GeneralIssuesRelated toC&RT 279 Numerical VariablesOperate Best 237 ApplicationtoMixedModels 280 NoMissingValues 237 NeuralNetsforPrediction 280 Variables AreLinearandIndependentinTheir ManualorAutomated Operation? 280 Effects ontheTarget Variable 237 StructuringtheNetwork forManual MethodsforClassification 238 Operation 280 Nearest-Neighbor Classifiers 239 ModernNeuralNetsAre “GrayBoxes” 281 Analyzing ImbalancedData SetswithMachine ExampleofAutomatedNeuralNetResults 281 LearningPrograms 240 SupportVectorMachines(SVMs)andOther Kernel CHAID 246 LearningAlgorithms 282 Random ForestsandBoostedTrees 248 Postscript 284 Logistic Regression 250 NeuralNetworks 251 13. Model Evaluation and Enhancement Na¨ıveBayesianClassifiers 253 Preamble 285 What IstheBestAlgorithmfor Introduction 286 Classification? 256 ModelEvaluation 286 Postscript 257 SplittingData 287
Description: