SPH SPH JWDD006-FM JWDD006-Larose November23,2005 14:49 CharCount=0 DATA MINING METHODS AND MODELS DANIEL T. LAROSE DepartmentofMathematicalSciences CentralConnecticutStateUniversity AJOHNWILEY&SONS,INCPUBLICATION iii SPH SPH JWDD006-FM JWDD006-Larose November23,2005 14:49 CharCount=0 DATA MINING METHODS AND MODELS i SPH SPH JWDD006-FM JWDD006-Larose November23,2005 14:49 CharCount=0 ii SPH SPH JWDD006-FM JWDD006-Larose November23,2005 14:49 CharCount=0 DATA MINING METHODS AND MODELS DANIEL T. LAROSE DepartmentofMathematicalSciences CentralConnecticutStateUniversity AJOHNWILEY&SONS,INCPUBLICATION iii SPH SPH JWDD006-FM JWDD006-Larose November23,2005 14:49 CharCount=0 Copyright(cid:1)C 2006byJohnWiley&Sons,Inc.Allrightsreserved. PublishedbyJohnWiley&Sons,Inc.,Hoboken,NewJersey. PublishedsimultaneouslyinCanada. Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmittedinanyform or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permittedunderSection107or108ofthe1976UnitedStatesCopyrightAct,withouteithertheprior writtenpermissionofthePublisher,orauthorizationthroughpaymentoftheappropriateper-copyfee totheCopyrightClearanceCenter,Inc.,222RosewoodDrive,Danvers,MA01923,978-750-8400,fax 978-646-8600,oronthewebatwww.copyright.com.RequeststothePublisherforpermissionshouldbe addressedtothePermissionsDepartment,JohnWiley&Sons,Inc.,111RiverStreet,Hoboken,NJ07030, (201)748–6011,fax(201)748–6008oronlineathttp://www.wiley.com/go/permission. LimitofLiability/DisclaimerofWarranty:Whilethepublisherandauthorhaveusedtheirbesteffortsin preparingthisbook,theymakenorepresentationsorwarrantieswithrespecttotheaccuracyorcompleteness ofthecontentsofthisbookandspecificallydisclaimanyimpliedwarrantiesofmerchantabilityorfitness foraparticularpurpose.Nowarrantymaybecreatedorextendedbysalesrepresentativesorwrittensales materials.Theadviceandstrategiescontainedhereinmaynotbesuitableforyoursituation.Youshould consultwithaprofessionalwhereappropriate.Neitherthepublishernorauthorshallbeliableforanyloss ofprofitoranyothercommercialdamages,includingbutnotlimitedtospecial,incidental,consequential, orotherdamages. ForgeneralinformationonourotherproductsandservicespleasecontactourCustomerCareDepartment withintheU.S.at877-762-2974,outsidetheU.S.at317-572-3993orfax317-572-4002. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print, however,maynotbeavailableinelectronicformat.FormoreinformationaboutWileyproducts,visitour websiteatwww.wiley.com LibraryofCongressCataloging-in-PublicationData: Larose,DanielT. Dataminingmethodsandmodels/DanielT.Larose. p. cm. Includesbibliographicalreferences. ISBN-13978-0-471-66656-1 ISBN-100-471-66656-4(cloth) 1.Datamining. I.Title. QA76.9.D343L3782005 005.74–dc22 2005010801 PrintedintheUnitedStatesofAmerica 10 9 8 7 6 5 4 3 2 1 iv SPH SPH JWDD006-FM JWDD006-Larose November23,2005 14:49 CharCount=0 DEDICATION Tothosewhohavegonebefore, includingmyparents,ErnestLarose(1920–1981) andIreneLarose(1924–2005), andmydaughter,EllyrianeSoleilLarose(1997–1997); Forthosewhocomeafter, includingmydaughters,ChantalDanielleLarose(1988) andRavelRenaissanceLarose(1999), andmyson,TristanSpringLarose(1999). v SPH SPH JWDD006-FM JWDD006-Larose November23,2005 14:49 CharCount=0 vi SPH SPH JWDD006-FM JWDD006-Larose November23,2005 14:49 CharCount=0 CONTENTS PREFACE xi 1 DIMENSIONREDUCTIONMETHODS 1 NeedforDimensionReductioninDataMining 1 PrincipalComponentsAnalysis 2 ApplyingPrincipalComponentsAnalysistotheHousesDataSet 5 HowManyComponentsShouldWeExtract? 9 ProfilingthePrincipalComponents 13 Communalities 15 ValidationofthePrincipalComponents 17 FactorAnalysis 18 ApplyingFactorAnalysistotheAdultDataSet 18 FactorRotation 20 User-DefinedComposites 23 ExampleofaUser-DefinedComposite 24 Summary 25 References 28 Exercises 28 2 REGRESSIONMODELING 33 ExampleofSimpleLinearRegression 34 Least-SquaresEstimates 36 CoefficientofDetermination 39 StandardErroroftheEstimate 43 CorrelationCoefficient 45 ANOVATable 46 Outliers,HighLeveragePoints,andInfluentialObservations 48 RegressionModel 55 InferenceinRegression 57 t-TestfortheRelationshipBetweenxandy 58 ConfidenceIntervalfortheSlopeoftheRegressionLine 60 ConfidenceIntervalfortheMeanValueofyGivenx 60 PredictionIntervalforaRandomlyChosenValueofyGivenx 61 VerifyingtheRegressionAssumptions 63 Example:BaseballDataSet 68 Example:CaliforniaDataSet 74 TransformationstoAchieveLinearity 79 Box–CoxTransformations 83 Summary 84 References 86 Exercises 86 vii SPH SPH JWDD006-FM JWDD006-Larose November23,2005 14:49 CharCount=0 viii CONTENTS 3 MULTIPLEREGRESSIONANDMODELBUILDING 93 ExampleofMultipleRegression 93 MultipleRegressionModel 99 InferenceinMultipleRegression 100 t-TestfortheRelationshipBetweenyandxi 101 F-TestfortheSignificanceoftheOverallRegressionModel 102 ConfidenceIntervalforaParticularCoefficient 104 ConfidenceIntervalfortheMeanValueofyGivenx1,x2,...,xm 105 PredictionIntervalforaRandomlyChosenValueofyGivenx1,x2,...,xm 105 RegressionwithCategoricalPredictors 105 AdjustingR2:PenalizingModelsforIncludingPredictorsThatAre NotUseful 113 SequentialSumsofSquares 115 Multicollinearity 116 VariableSelectionMethods 123 PartialF-Test 123 ForwardSelectionProcedure 125 BackwardEliminationProcedure 125 StepwiseProcedure 126 BestSubsetsProcedure 126 All-Possible-SubsetsProcedure 126 ApplicationoftheVariableSelectionMethods 127 ForwardSelectionProcedureAppliedtotheCerealsDataSet 127 BackwardEliminationProcedureAppliedtotheCerealsDataSet 129 StepwiseSelectionProcedureAppliedtotheCerealsDataSet 131 BestSubsetsProcedureAppliedtotheCerealsDataSet 131 Mallows’CpStatistic 131 VariableSelectionCriteria 135 UsingthePrincipalComponentsasPredictors 142 Summary 147 References 149 Exercises 149 4 LOGISTICREGRESSION 155 SimpleExampleofLogisticRegression 156 MaximumLikelihoodEstimation 158 InterpretingLogisticRegressionOutput 159 Inference:ArethePredictorsSignificant? 160 InterpretingaLogisticRegressionModel 162 InterpretingaModelforaDichotomousPredictor 163 InterpretingaModelforaPolychotomousPredictor 166 InterpretingaModelforaContinuousPredictor 170 AssumptionofLinearity 174 Zero-CellProblem 177 MultipleLogisticRegression 179 IntroducingHigher-OrderTermstoHandleNonlinearity 183 ValidatingtheLogisticRegressionModel 189 WEKA:Hands-onAnalysisUsingLogisticRegression 194 Summary 197