ebook img

Data Mining and Predictive Analytics PDF

827 Pages·2015·6.28 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Data Mining and Predictive Analytics

DATA MINING AND PREDICTIVE ANALYTICS WILEY SERIES ON METHODS AND APPLICATIONS IN DATA MINING Series Editor: Daniel T. Larose Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition Daniel T. Larose and Chantal D. Larose Data Mining for Genomics and Proteomics: Analysis of Gene and Protein Expression Data Darius M. Dziuda Knowledge Discovery with Support Vector Machines Lutz Hamel Data-Mining on the Web: Uncovering Patterns in Web Content, Structure, and Usage Zdravko Markov and Daniel T. Larose Data Mining Methods and Models Daniel T. Larose Practical Text Mining with Perl Roger Bilisoly Data Mining and Predictive Analytics Daniel T. Larose and Chantal D. Larose DATA MINING AND PREDICTIVE ANALYTICS Second Edition DANIEL T. LAROSE CHANTAL D. LAROSE Copyright©2015byJohnWiley&Sons,Inc.Allrightsreserved PublishedbyJohnWiley&Sons,Inc.,Hoboken,NewJersey PublishedsimultaneouslyinCanada Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformor byanymeans,electronic,mechanical,photocopying,recording,scanning,orotherwise,exceptas permittedunderSection107or108ofthe1976UnitedStatesCopyrightAct,withouteithertheprior writtenpermissionofthePublisher,orauthorizationthroughpaymentoftheappropriateper-copyfeeto theCopyrightClearanceCenter,Inc.,222RosewoodDrive,Danvers,MA01923,(978)750-8400,fax (978)750-4470,oronthewebatwww.copyright.com.RequeststothePublisherforpermissionshould beaddressedtothePermissionsDepartment,JohnWiley&Sons,Inc.,111RiverStreet,Hoboken,NJ 07030,(201)748-6011,fax(201)748-6008,oronlineathttp://www.wiley.com/go/permissions. LimitofLiability/DisclaimerofWarranty:Whilethepublisherandauthorhaveusedtheirbesteffortsin preparingthisbook,theymakenorepresentationsorwarrantieswithrespecttotheaccuracyor completenessofthecontentsofthisbookandspecificallydisclaimanyimpliedwarrantiesof merchantabilityorfitnessforaparticularpurpose.Nowarrantymaybecreatedorextendedbysales representativesorwrittensalesmaterials.Theadviceandstrategiescontainedhereinmaynotbesuitable foryoursituation.Youshouldconsultwithaprofessionalwhereappropriate.Neitherthepublishernor authorshallbeliableforanylossofprofitoranyothercommercialdamages,includingbutnotlimitedto special,incidental,consequential,orotherdamages. Forgeneralinformationonourotherproductsandservicesorfortechnicalsupport,pleasecontactour CustomerCareDepartmentwithintheUnitedStatesat(800)762-2974,outsidetheUnitedStatesat (317)572-3993orfax(317)572-4002. Wileyalsopublishesitsbooksinavarietyofelectronicformats.Somecontentthatappearsinprintmay notbeavailableinelectronicformats.FormoreinformationaboutWileyproducts,visitourwebsiteat www.wiley.com. LibraryofCongressCataloging-in-PublicationData: Larose,DanielT. Dataminingandpredictiveanalytics/DanielT.Larose,ChantalD.Larose. pagescm.–(Wileyseriesonmethodsandapplicationsindatamining) Includesbibliographicalreferencesandindex. ISBN978-1-118-11619-7(cloth) 1.Datamining.2.Predictiontheory.I.Larose,ChantalD.II.Title. QA76.9.D343L37762015 006.3′12–dc23 2014043340 Setin10/12ptTimesbyLaserwordsPrivateLimited,Chennai,India PrintedintheUnitedStatesofAmerica 10987654321 2 2015 Tothosewhohavegonebeforeus, Andtothosewhocomeafterus, IntheFamilyTreeofLife… CONTENTS PREFACE xxi ACKNOWLEDGMENTS xxix PARTI DATAPREPARATION 1 CHAPTER1 ANINTRODUCTIONTODATAMININGANDPREDICTIVE ANALYTICS 3 1.1 WhatisDataMining?WhatisPredictiveAnalytics? 3 1.2 Wanted:DataMiners 5 1.3 TheNeedforHumanDirectionofDataMining 6 1.4 TheCross-IndustryStandardProcessforDataMining:CRISP-DM 6 1.4.1 CRISP-DM:TheSixPhases 7 1.5 FallaciesofDataMining 9 1.6 WhatTasksCanDataMiningAccomplish 10 1.6.1 Description 10 1.6.2 Estimation 11 1.6.3 Prediction 12 1.6.4 Classification 12 1.6.5 Clustering 15 1.6.6 Association 16 TheRZone 17 RReferences 18 Exercises 18 CHAPTER2 DATAPREPROCESSING 20 2.1 WhydoWeNeedtoPreprocesstheData? 20 2.2 DataCleaning 21 2.3 HandlingMissingData 22 2.4 IdentifyingMisclassifications 25 2.5 GraphicalMethodsforIdentifyingOutliers 26 2.6 MeasuresofCenterandSpread 27 2.7 DataTransformation 30 2.8 Min–MaxNormalization 30 2.9 Z-ScoreStandardization 31 2.10 DecimalScaling 32 2.11 TransformationstoAchieveNormality 32 vii viii CONTENTS 2.12 NumericalMethodsforIdentifyingOutliers 38 2.13 FlagVariables 39 2.14 TransformingCategoricalVariablesintoNumericalVariables 40 2.15 BinningNumericalVariables 41 2.16 ReclassifyingCategoricalVariables 42 2.17 AddinganIndexField 43 2.18 RemovingVariablesthatarenotUseful 43 2.19 VariablesthatShouldProbablynotbeRemoved 43 2.20 RemovalofDuplicateRecords 44 2.21 AWordAboutIDFields 45 TheRZone 45 RReference 51 Exercises 51 CHAPTER3 EXPLORATORYDATAANALYSIS 54 3.1 HypothesisTestingVersusExploratoryDataAnalysis 54 3.2 GettingtoKnowtheDataSet 54 3.3 ExploringCategoricalVariables 56 3.4 ExploringNumericVariables 64 3.5 ExploringMultivariateRelationships 69 3.6 SelectingInterestingSubsetsoftheDataforFurtherInvestigation 70 3.7 UsingEDAtoUncoverAnomalousFields 71 3.8 BinningBasedonPredictiveValue 72 3.9 DerivingNewVariables:FlagVariables 75 3.10 DerivingNewVariables:NumericalVariables 77 3.11 UsingEDAtoInvestigateCorrelatedPredictorVariables 78 3.12 SummaryofOurEDA 81 TheRZone 82 RReferences 89 Exercises 89 CHAPTER4 DIMENSION-REDUCTIONMETHODS 92 4.1 NeedforDimension-ReductioninDataMining 92 4.2 PrincipalComponentsAnalysis 93 4.3 ApplyingPCAtotheHousesDataSet 96 4.4 HowManyComponentsShouldWeExtract? 102 4.4.1 TheEigenvalueCriterion 102 4.4.2 TheProportionofVarianceExplainedCriterion 103 4.4.3 TheMinimumCommunalityCriterion 103 4.4.4 TheScreePlotCriterion 103 4.5 ProfilingthePrincipalComponents 105 4.6 Communalities 108 4.6.1 MinimumCommunalityCriterion 109 4.7 ValidationofthePrincipalComponents 110 4.8 FactorAnalysis 110 4.9 ApplyingFactorAnalysistotheAdultDataSet 111 4.10 FactorRotation 114 4.11 User-DefinedComposites 117

Description:
Learn methods of data analysis and their application to real-world data sets This updated second edition serves as an introduction to data mining methods and models, including association rules, clustering, neural networks, logistic regression, and multivariate analysis. The authors apply a unified
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.