Table Of ContentSPH SPH
JWDD006-FM JWDD006-Larose November23,2005 14:49 CharCount=0
DATA MINING
METHODS AND
MODELS
DANIEL T. LAROSE
DepartmentofMathematicalSciences
CentralConnecticutStateUniversity
AJOHNWILEY&SONS,INCPUBLICATION
iii
SPH SPH
JWDD006-FM JWDD006-Larose November23,2005 14:49 CharCount=0
DATA MINING
METHODS AND
MODELS
i
SPH SPH
JWDD006-FM JWDD006-Larose November23,2005 14:49 CharCount=0
ii
SPH SPH
JWDD006-FM JWDD006-Larose November23,2005 14:49 CharCount=0
DATA MINING
METHODS AND
MODELS
DANIEL T. LAROSE
DepartmentofMathematicalSciences
CentralConnecticutStateUniversity
AJOHNWILEY&SONS,INCPUBLICATION
iii
SPH SPH
JWDD006-FM JWDD006-Larose November23,2005 14:49 CharCount=0
Copyright(cid:1)C 2006byJohnWiley&Sons,Inc.Allrightsreserved.
PublishedbyJohnWiley&Sons,Inc.,Hoboken,NewJersey.
PublishedsimultaneouslyinCanada.
Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmittedinanyform
or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as
permittedunderSection107or108ofthe1976UnitedStatesCopyrightAct,withouteithertheprior
writtenpermissionofthePublisher,orauthorizationthroughpaymentoftheappropriateper-copyfee
totheCopyrightClearanceCenter,Inc.,222RosewoodDrive,Danvers,MA01923,978-750-8400,fax
978-646-8600,oronthewebatwww.copyright.com.RequeststothePublisherforpermissionshouldbe
addressedtothePermissionsDepartment,JohnWiley&Sons,Inc.,111RiverStreet,Hoboken,NJ07030,
(201)748–6011,fax(201)748–6008oronlineathttp://www.wiley.com/go/permission.
LimitofLiability/DisclaimerofWarranty:Whilethepublisherandauthorhaveusedtheirbesteffortsin
preparingthisbook,theymakenorepresentationsorwarrantieswithrespecttotheaccuracyorcompleteness
ofthecontentsofthisbookandspecificallydisclaimanyimpliedwarrantiesofmerchantabilityorfitness
foraparticularpurpose.Nowarrantymaybecreatedorextendedbysalesrepresentativesorwrittensales
materials.Theadviceandstrategiescontainedhereinmaynotbesuitableforyoursituation.Youshould
consultwithaprofessionalwhereappropriate.Neitherthepublishernorauthorshallbeliableforanyloss
ofprofitoranyothercommercialdamages,includingbutnotlimitedtospecial,incidental,consequential,
orotherdamages.
ForgeneralinformationonourotherproductsandservicespleasecontactourCustomerCareDepartment
withintheU.S.at877-762-2974,outsidetheU.S.at317-572-3993orfax317-572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print,
however,maynotbeavailableinelectronicformat.FormoreinformationaboutWileyproducts,visitour
websiteatwww.wiley.com
LibraryofCongressCataloging-in-PublicationData:
Larose,DanielT.
Dataminingmethodsandmodels/DanielT.Larose.
p. cm.
Includesbibliographicalreferences.
ISBN-13978-0-471-66656-1
ISBN-100-471-66656-4(cloth)
1.Datamining. I.Title.
QA76.9.D343L3782005
005.74–dc22
2005010801
PrintedintheUnitedStatesofAmerica
10 9 8 7 6 5 4 3 2 1
iv
SPH SPH
JWDD006-FM JWDD006-Larose November23,2005 14:49 CharCount=0
DEDICATION
Tothosewhohavegonebefore,
includingmyparents,ErnestLarose(1920–1981)
andIreneLarose(1924–2005),
andmydaughter,EllyrianeSoleilLarose(1997–1997);
Forthosewhocomeafter,
includingmydaughters,ChantalDanielleLarose(1988)
andRavelRenaissanceLarose(1999),
andmyson,TristanSpringLarose(1999).
v
SPH SPH
JWDD006-FM JWDD006-Larose November23,2005 14:49 CharCount=0
vi
SPH SPH
JWDD006-FM JWDD006-Larose November23,2005 14:49 CharCount=0
CONTENTS
PREFACE xi
1 DIMENSIONREDUCTIONMETHODS 1
NeedforDimensionReductioninDataMining 1
PrincipalComponentsAnalysis 2
ApplyingPrincipalComponentsAnalysistotheHousesDataSet 5
HowManyComponentsShouldWeExtract? 9
ProfilingthePrincipalComponents 13
Communalities 15
ValidationofthePrincipalComponents 17
FactorAnalysis 18
ApplyingFactorAnalysistotheAdultDataSet 18
FactorRotation 20
User-DefinedComposites 23
ExampleofaUser-DefinedComposite 24
Summary 25
References 28
Exercises 28
2 REGRESSIONMODELING 33
ExampleofSimpleLinearRegression 34
Least-SquaresEstimates 36
CoefficientofDetermination 39
StandardErroroftheEstimate 43
CorrelationCoefficient 45
ANOVATable 46
Outliers,HighLeveragePoints,andInfluentialObservations 48
RegressionModel 55
InferenceinRegression 57
t-TestfortheRelationshipBetweenxandy 58
ConfidenceIntervalfortheSlopeoftheRegressionLine 60
ConfidenceIntervalfortheMeanValueofyGivenx 60
PredictionIntervalforaRandomlyChosenValueofyGivenx 61
VerifyingtheRegressionAssumptions 63
Example:BaseballDataSet 68
Example:CaliforniaDataSet 74
TransformationstoAchieveLinearity 79
Box–CoxTransformations 83
Summary 84
References 86
Exercises 86
vii
SPH SPH
JWDD006-FM JWDD006-Larose November23,2005 14:49 CharCount=0
viii CONTENTS
3 MULTIPLEREGRESSIONANDMODELBUILDING 93
ExampleofMultipleRegression 93
MultipleRegressionModel 99
InferenceinMultipleRegression 100
t-TestfortheRelationshipBetweenyandxi 101
F-TestfortheSignificanceoftheOverallRegressionModel 102
ConfidenceIntervalforaParticularCoefficient 104
ConfidenceIntervalfortheMeanValueofyGivenx1,x2,...,xm 105
PredictionIntervalforaRandomlyChosenValueofyGivenx1,x2,...,xm 105
RegressionwithCategoricalPredictors 105
AdjustingR2:PenalizingModelsforIncludingPredictorsThatAre
NotUseful 113
SequentialSumsofSquares 115
Multicollinearity 116
VariableSelectionMethods 123
PartialF-Test 123
ForwardSelectionProcedure 125
BackwardEliminationProcedure 125
StepwiseProcedure 126
BestSubsetsProcedure 126
All-Possible-SubsetsProcedure 126
ApplicationoftheVariableSelectionMethods 127
ForwardSelectionProcedureAppliedtotheCerealsDataSet 127
BackwardEliminationProcedureAppliedtotheCerealsDataSet 129
StepwiseSelectionProcedureAppliedtotheCerealsDataSet 131
BestSubsetsProcedureAppliedtotheCerealsDataSet 131
Mallows’CpStatistic 131
VariableSelectionCriteria 135
UsingthePrincipalComponentsasPredictors 142
Summary 147
References 149
Exercises 149
4 LOGISTICREGRESSION 155
SimpleExampleofLogisticRegression 156
MaximumLikelihoodEstimation 158
InterpretingLogisticRegressionOutput 159
Inference:ArethePredictorsSignificant? 160
InterpretingaLogisticRegressionModel 162
InterpretingaModelforaDichotomousPredictor 163
InterpretingaModelforaPolychotomousPredictor 166
InterpretingaModelforaContinuousPredictor 170
AssumptionofLinearity 174
Zero-CellProblem 177
MultipleLogisticRegression 179
IntroducingHigher-OrderTermstoHandleNonlinearity 183
ValidatingtheLogisticRegressionModel 189
WEKA:Hands-onAnalysisUsingLogisticRegression 194
Summary 197