ebook img

Data mining methods and models PDF

335 Pages·2006·4.718 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Data mining methods and models

SPH SPH JWDD006-FM JWDD006-Larose November23,2005 14:49 CharCount=0 DATA MINING METHODS AND MODELS DANIEL T. LAROSE DepartmentofMathematicalSciences CentralConnecticutStateUniversity AJOHNWILEY&SONS,INCPUBLICATION iii SPH SPH JWDD006-FM JWDD006-Larose November23,2005 14:49 CharCount=0 Copyright(cid:1)C 2006byJohnWiley&Sons,Inc.Allrightsreserved. PublishedbyJohnWiley&Sons,Inc.,Hoboken,NewJersey. PublishedsimultaneouslyinCanada. Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmittedinanyform or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permittedunderSection107or108ofthe1976UnitedStatesCopyrightAct,withouteithertheprior writtenpermissionofthePublisher,orauthorizationthroughpaymentoftheappropriateper-copyfee totheCopyrightClearanceCenter,Inc.,222RosewoodDrive,Danvers,MA01923,978-750-8400,fax 978-646-8600,oronthewebatwww.copyright.com.RequeststothePublisherforpermissionshouldbe addressedtothePermissionsDepartment,JohnWiley&Sons,Inc.,111RiverStreet,Hoboken,NJ07030, (201)748–6011,fax(201)748–6008oronlineathttp://www.wiley.com/go/permission. LimitofLiability/DisclaimerofWarranty:Whilethepublisherandauthorhaveusedtheirbesteffortsin preparingthisbook,theymakenorepresentationsorwarrantieswithrespecttotheaccuracyorcompleteness ofthecontentsofthisbookandspecificallydisclaimanyimpliedwarrantiesofmerchantabilityorfitness foraparticularpurpose.Nowarrantymaybecreatedorextendedbysalesrepresentativesorwrittensales materials.Theadviceandstrategiescontainedhereinmaynotbesuitableforyoursituation.Youshould consultwithaprofessionalwhereappropriate.Neitherthepublishernorauthorshallbeliableforanyloss ofprofitoranyothercommercialdamages,includingbutnotlimitedtospecial,incidental,consequential, orotherdamages. ForgeneralinformationonourotherproductsandservicespleasecontactourCustomerCareDepartment withintheU.S.at877-762-2974,outsidetheU.S.at317-572-3993orfax317-572-4002. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print, however,maynotbeavailableinelectronicformat.FormoreinformationaboutWileyproducts,visitour websiteatwww.wiley.com LibraryofCongressCataloging-in-PublicationData: Larose,DanielT. Dataminingmethodsandmodels/DanielT.Larose. p. cm. Includesbibliographicalreferences. ISBN-13978-0-471-66656-1 ISBN-100-471-66656-4(cloth) 1.Datamining. I.Title. QA76.9.D343L3782005 005.74–dc22 2005010801 PrintedintheUnitedStatesofAmerica 10 9 8 7 6 5 4 3 2 1 iv SPH SPH JWDD006-FM JWDD006-Larose November23,2005 14:49 CharCount=0 DEDICATION Tothosewhohavegonebefore, includingmyparents,ErnestLarose(1920–1981) andIreneLarose(1924–2005), andmydaughter,EllyrianeSoleilLarose(1997–1997); Forthosewhocomeafter, includingmydaughters,ChantalDanielleLarose(1988) andRavelRenaissanceLarose(1999), andmyson,TristanSpringLarose(1999). v SPH SPH JWDD006-FM JWDD006-Larose November23,2005 14:49 CharCount=0 CONTENTS PREFACE xi 1 DIMENSIONREDUCTIONMETHODS 1 NeedforDimensionReductioninDataMining 1 PrincipalComponentsAnalysis 2 ApplyingPrincipalComponentsAnalysistotheHousesDataSet 5 HowManyComponentsShouldWeExtract? 9 ProfilingthePrincipalComponents 13 Communalities 15 ValidationofthePrincipalComponents 17 FactorAnalysis 18 ApplyingFactorAnalysistotheAdultDataSet 18 FactorRotation 20 User-DefinedComposites 23 ExampleofaUser-DefinedComposite 24 Summary 25 References 28 Exercises 28 2 REGRESSIONMODELING 33 ExampleofSimpleLinearRegression 34 Least-SquaresEstimates 36 CoefficientofDetermination 39 StandardErroroftheEstimate 43 CorrelationCoefficient 45 ANOVATable 46 Outliers,HighLeveragePoints,andInfluentialObservations 48 RegressionModel 55 InferenceinRegression 57 t-TestfortheRelationshipBetweenxandy 58 ConfidenceIntervalfortheSlopeoftheRegressionLine 60 ConfidenceIntervalfortheMeanValueofyGivenx 60 PredictionIntervalforaRandomlyChosenValueofyGivenx 61 VerifyingtheRegressionAssumptions 63 Example:BaseballDataSet 68 Example:CaliforniaDataSet 74 TransformationstoAchieveLinearity 79 Box–CoxTransformations 83 Summary 84 References 86 Exercises 86 vii SPH SPH JWDD006-FM JWDD006-Larose November23,2005 14:49 CharCount=0 viii CONTENTS 3 MULTIPLEREGRESSIONANDMODELBUILDING 93 ExampleofMultipleRegression 93 MultipleRegressionModel 99 InferenceinMultipleRegression 100 t-TestfortheRelationshipBetweenyandxi 101 F-TestfortheSignificanceoftheOverallRegressionModel 102 ConfidenceIntervalforaParticularCoefficient 104 ConfidenceIntervalfortheMeanValueofyGivenx1,x2,...,xm 105 PredictionIntervalforaRandomlyChosenValueofyGivenx1,x2,...,xm 105 RegressionwithCategoricalPredictors 105 AdjustingR2:PenalizingModelsforIncludingPredictorsThatAre NotUseful 113 SequentialSumsofSquares 115 Multicollinearity 116 VariableSelectionMethods 123 PartialF-Test 123 ForwardSelectionProcedure 125 BackwardEliminationProcedure 125 StepwiseProcedure 126 BestSubsetsProcedure 126 All-Possible-SubsetsProcedure 126 ApplicationoftheVariableSelectionMethods 127 ForwardSelectionProcedureAppliedtotheCerealsDataSet 127 BackwardEliminationProcedureAppliedtotheCerealsDataSet 129 StepwiseSelectionProcedureAppliedtotheCerealsDataSet 131 BestSubsetsProcedureAppliedtotheCerealsDataSet 131 Mallows’CpStatistic 131 VariableSelectionCriteria 135 UsingthePrincipalComponentsasPredictors 142 Summary 147 References 149 Exercises 149 4 LOGISTICREGRESSION 155 SimpleExampleofLogisticRegression 156 MaximumLikelihoodEstimation 158 InterpretingLogisticRegressionOutput 159 Inference:ArethePredictorsSignificant? 160 InterpretingaLogisticRegressionModel 162 InterpretingaModelforaDichotomousPredictor 163 InterpretingaModelforaPolychotomousPredictor 166 InterpretingaModelforaContinuousPredictor 170 AssumptionofLinearity 174 Zero-CellProblem 177 MultipleLogisticRegression 179 IntroducingHigher-OrderTermstoHandleNonlinearity 183 ValidatingtheLogisticRegressionModel 189 WEKA:Hands-onAnalysisUsingLogisticRegression 194 Summary 197 SPH SPH JWDD006-FM JWDD006-Larose November23,2005 14:49 CharCount=0 CONTENTS ix References 199 Exercises 199 5 NAIVEBAYESESTIMATIONANDBAYESIANNETWORKS 204 BayesianApproach 204 MaximumaPosterioriClassification 206 PosteriorOddsRatio 210 BalancingtheData 212 Na˙ıveBayesClassification 215 NumericPredictors 219 WEKA:Hands-onAnalysisUsingNaiveBayes 223 BayesianBeliefNetworks 227 ClothingPurchaseExample 227 UsingtheBayesianNetworktoFindProbabilities 229 WEKA:Hands-OnAnalysisUsingtheBayesNetClassifier 232 Summary 234 References 236 Exercises 237 6 GENETICALGORITHMS 240 IntroductiontoGeneticAlgorithms 240 BasicFrameworkofaGeneticAlgorithm 241 SimpleExampleofaGeneticAlgorithmatWork 243 ModificationsandEnhancements:Selection 245 ModificationsandEnhancements:Crossover 247 MultipointCrossover 247 UniformCrossover 247 GeneticAlgorithmsforReal-ValuedVariables 248 SingleArithmeticCrossover 248 SimpleArithmeticCrossover 248 WholeArithmeticCrossover 249 DiscreteCrossover 249 NormallyDistributedMutation 249 UsingGeneticAlgorithmstoTrainaNeuralNetwork 249 WEKA:Hands-onAnalysisUsingGeneticAlgorithms 252 Summary 261 References 262 Exercises 263 7 CASESTUDY:MODELINGRESPONSETODIRECTMAILMARKETING 265 Cross-IndustryStandardProcessforDataMining 265 BusinessUnderstandingPhase 267 DirectMailMarketingResponseProblem 267 BuildingtheCost/BenefitTable 267 DataUnderstandingandDataPreparationPhases 270 ClothingStoreDataSet 270 TransformationstoAchieveNormalityorSymmetry 272 StandardizationandFlagVariables 276 SPH SPH JWDD006-FM JWDD006-Larose November23,2005 14:49 CharCount=0 x CONTENTS DerivingNewVariables 277 ExploringtheRelationshipsBetweenthePredictorsandtheResponse 278 InvestigatingtheCorrelationStructureAmongthePredictors 286 ModelingandEvaluationPhases 289 PrincipalComponentsAnalysis 292 ClusterAnalysis:BIRCHClusteringAlgorithm 294 BalancingtheTrainingDataSet 298 EstablishingtheBaselineModelPerformance 299 ModelCollectionA:UsingthePrincipalComponents 300 OverbalancingasaSurrogateforMisclassificationCosts 302 CombiningModels:Voting 304 ModelCollectionB:Non-PCAModels 306 CombiningModelsUsingtheMeanResponseProbabilities 308 Summary 312 References 316 INDEX 317 SPH SPH JWDD006-FM JWDD006-Larose November23,2005 14:49 CharCount=0 PREFACE WHATISDATAMINING? Dataminingistheanalysisof(oftenlarge)observationaldatasetstofind unsuspectedrelationshipsandtosummarizethedatainnovelwaysthatareboth understandableandusefultothedataowner. —DavidHand,HeikkiMannila,andPadhraicSmyth,PrinciplesofDataMining, MITPress,Cambridge,MA,2001 Data mining is predicted to be “one of the most revolutionary developments of the nextdecade,”accordingtotheonlinetechnologymagazineZDNETNews(February 8,2001).Infact,theMITTechnologyReviewchosedataminingasoneof10emerging technologiesthatwillchangetheworld. Because data mining represents such an important field, Wiley-Interscience andIhaveteameduptopublishanewseriesondatamining,initiallyconsistingof threevolumes.Thefirstvolumeinthisseries,DiscoveringKnowledgeinData:An IntroductiontoDataMining,appearedin2005andintroducedthereadertothisrapidly growingfield.Thesecondvolumeintheseries,DataMiningMethodsandModels, explores the process of data mining from the point of view of model building: the developmentofcomplexandpowerfulpredictivemodelsthatcandeliveractionable resultsforawiderangeofbusinessandresearchproblems. WHYISTHISBOOKNEEDED? DataMiningMethodsandModelscontinuesthethrustofDiscoveringKnowledgein Data,providingthereaderwith: (cid:1) Modelsandtechniquestouncoverhiddennuggetsofinformation (cid:1) Insightintohowthedataminingalgorithmsreallywork (cid:1) Experienceofactuallyperformingdataminingonlargedatasets “WHITE-BOX”APPROACH:UNDERSTANDING THEUNDERLYINGALGORITHMICANDMODEL STRUCTURES The best way to avoid costly errors stemming from a blind black-box approach to data mining is to instead apply a “white-box” methodology, which emphasizes an xi SPH SPH JWDD006-FM JWDD006-Larose November23,2005 14:49 CharCount=0 xii PREFACE understandingofthealgorithmicandstatisticalmodelstructuresunderlyingthesoft- ware. DataMiningMethodsandModelsappliesthewhite-boxapproachby: (cid:1) Walkingthereaderthroughthevariousalgorithms (cid:1) Providing examples of the operation of the algorithm on actual large data sets (cid:1) Testingthereader’slevelofunderstandingoftheconceptsandalgorithms (cid:1) Providinganopportunityforthereadertodosomerealdataminingonlarge datasets AlgorithmWalk-Throughs DataMiningMethodsandModelswalksthereaderthroughtheoperationsandnu- ancesofthevariousalgorithms,usingsmallsampledatasets,sothatthereadergets atrueappreciationofwhatisreallygoingoninsidethealgorithm.Forexample,in Chapter 2 we observe how a single new data value can seriously alter the model results.Also,inChapter6weproceedstepbysteptofindtheoptimalsolutionusing theselection,crossover,andmutationoperators. ApplicationsoftheAlgorithmsandModelstoLargeDataSets DataMiningMethodsandModelsprovidesexamplesoftheapplicationofthevar- iousalgorithmsandmodelsonactuallargedatasets.Forexample,inChapter3we analyticallyunlocktherelationshipbetweennutritionratingandcerealcontentusing areal-worlddataset.InChapter1weapplyprincipalcomponentsanalysistoreal- worldcensusdataaboutCalifornia.Alldatasetsareavailablefromthebookseries Website:www.dataminingconsultant.com. ChapterExercises:CheckingtoMakeSureThat YouUnderstandIt DataMiningMethodsandModelsincludesover110chapterexercises,whichallow readerstoassesstheirdepthofunderstandingofthematerial,aswellashavingalittle funplayingwithnumbersanddata.TheseincludeClarifyingtheConceptexercises, which help to clarify some of the more challenging concepts in data mining, and WorkingwiththeDataexercises,whichchallengethereadertoapplytheparticular dataminingalgorithmtoasmalldatasetand,stepbystep,toarriveatacomputation- allysoundsolution.Forexample,inChapter5readersareaskedtofindthemaximum aposterioriclassificationforthedatasetandnetworkprovidedinthechapter. Hands-onAnalysis:LearnDataMiningbyDoingDataMining Chapters1to6providethereaderwithhands-onanalysisproblems,representingan opportunityforthereadertoapplyhisorhernewlyacquireddataminingexpertiseto SPH SPH JWDD006-FM JWDD006-Larose November23,2005 14:49 CharCount=0 SOFTWARE xiii solvingrealproblemsusinglargedatasets.Manypeoplelearnbydoing.DataMining MethodsandModelsprovidesaframeworkbywhichthereadercanlearndatamining bydoingdatamining.Forexample,inChapter4readersarechallengedtoapproach areal-worldcreditapprovalclassificationdataset,andconstructtheirbestpossible logisticregressionmodelusingthemethodslearnedinthischaptertoprovidestrong interpretive support for the model, including explanations of derived and indicator variables. CaseStudy:BringingItAllTogether Data Mining Methods and Models culminates in a detailed case study, Modeling ResponsetoDirectMailMarketing.Herethereaderhastheopportunitytoseehow everythingthatheorshehaslearnedisbroughtalltogethertocreateactionableand profitablesolutions.Thecasestudyincludesover50pagesofgraphical,exploratory data analysis, predictive modeling, and customer profiling, and offers different so- lutions,dependingontherequisitesoftheclient.Themodelsareevaluatedusinga custom-builtcost/benefittable,reflectingthetruecostsofclassificationerrorsrather than the usual methods, such as overall error rate. Thus, the analyst can compare modelsusingtheestimatedprofitpercustomercontacted,andcanpredicthowmuch moneythemodelswillearnbasedonthenumberofcustomerscontacted. DATAMININGASAPROCESS DataMiningMethodsandModelscontinuesthecoverageofdataminingasaprocess. TheparticularstandardprocessusedistheCRISP–DMframework:theCross-Industry Standard Process for Data Mining. CRISP–DM demands that data mining be seen asanentireprocess,fromcommunicationofthebusinessproblem,throughdatacol- lectionandmanagement,datapreprocessing,modelbuilding,modelevaluation,and finally,modeldeployment.Therefore,thisbookisnotonlyforanalystsandmanagers butalsofordatamanagementprofessionals,databaseanalysts,anddecisionmakers. SOFTWARE Thesoftwareusedinthisbookincludesthefollowing: (cid:1) Clementinedataminingsoftwaresuite (cid:1) SPSSstatisticalsoftware (cid:1) Minitabstatisticalsoftware (cid:1) WEKAopen-sourcedataminingsoftware Clementine (http://www.spss.com/clementine/), one of the most widelyuseddataminingsoftwaresuites,isdistributedbySPSS,whosebasesoftware isalsousedinthisbook.SPSSisavailablefordownloadonatrialbasisfromtheir

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.