WY045-FM September24,2004 10:2 DISCOVERING KNOWLEDGE IN DATA An Introduction to Data Mining DANIEL T. LAROSE DirectorofDataMining CentralConnecticutStateUniversity AJOHNWILEY&SONS,INC.,PUBLICATION iii WY045-FM September24,2004 10:2 vi WY045-FM September24,2004 10:2 DISCOVERING KNOWLEDGE IN DATA i WY045-FM September24,2004 10:2 ii WY045-FM September24,2004 10:2 DISCOVERING KNOWLEDGE IN DATA An Introduction to Data Mining DANIEL T. LAROSE DirectorofDataMining CentralConnecticutStateUniversity AJOHNWILEY&SONS,INC.,PUBLICATION iii WY045-FM September24,2004 10:2 Copyright©2005byJohnWiley&Sons,Inc.Allrightsreserved. PublishedbyJohnWiley&Sons,Inc.,Hoboken,NewJersey. PublishedsimultaneouslyinCanada. Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmittedinanyform orbyanymeans,electronic,mechanical,photocopying,recording,scanning,orotherwise,exceptas permittedunderSection107or108ofthe1976UnitedStatesCopyrightAct,withouteithertheprior writtenpermissionofthePublisher,orauthorizationthroughpaymentoftheappropriateper-copyfeeto theCopyrightClearanceCenter,Inc.,222RosewoodDrive,Danvers,MA01923,978-750-8400, fax978-646-8600,oronthewebatwww.copyright.com.RequeststothePublisherforpermissionshould beaddressedtothePermissionsDepartment,JohnWiley&Sons,Inc.,111RiverStreet,Hoboken, NJ07030,(201)748-6011,fax(201)748-6008. LimitofLiability/DisclaimerofWarranty:Whilethepublisherandauthorhaveusedtheirbesteffortsin preparingthisbook,theymakenorepresentationsorwarrantieswithrespecttotheaccuracyor completenessofthecontentsofthisbookandspecificallydisclaimanyimpliedwarrantiesof merchantabilityorfitnessforaparticularpurpose.Nowarrantymaybecreatedorextendedbysales representativesorwrittensalesmaterials.Theadviceandstrategiescontainedhereinmaynotbesuitable foryoursituation.Youshouldconsultwithaprofessionalwhereappropriate.Neitherthepublishernor authorshallbeliableforanylossofprofitoranyothercommercialdamages,includingbutnotlimitedto special,incidental,consequential,orotherdamages. ForgeneralinformationonourotherproductsandservicespleasecontactourCustomerCareDepartment withintheU.S.at877-762-2974,outsidetheU.S.at317-572-3993orfax317-572-4002. Wileyalsopublishesitsbooksinavarietyofelectronicformats.Somecontentthatappearsinprint, however,maynotbeavailableinelectronicformat. LibraryofCongressCataloging-in-PublicationData: Larose,DanielT. Discoveringknowledgeindata:anintroductiontodatamining/DanielT.Larose p. cm. Includesbibliographicalreferencesandindex. ISBN0-471-66657-2(cloth) 1.Datamining. I.Title. QA76.9.D343L38 2005 006.3(cid:1)12—dc22 2004003680 PrintedintheUnitedStatesofAmerica 10 9 8 7 6 5 4 3 2 1 iv WY045-FM September24,2004 10:2 Dedication Tomyparents, Andtheirparents, Andsoon... Formychildren, Andtheirchildren, Andsoon... 2004 Chantal Larose v WY045-FM September24,2004 10:2 vi WY045-FM September24,2004 10:2 CONTENTS PREFACE xi 1 INTRODUCTIONTODATAMINING 1 WhatIsDataMining? 2 WhyDataMining? 4 NeedforHumanDirectionofDataMining 4 Cross-IndustryStandardProcess:CRISP–DM 5 CaseStudy1:AnalyzingAutomobileWarrantyClaims:Exampleofthe CRISP–DMIndustryStandardProcessinAction 8 FallaciesofDataMining 10 WhatTasksCanDataMiningAccomplish? 11 Description 11 Estimation 12 Prediction 13 Classification 14 Clustering 16 Association 17 CaseStudy2:PredictingAbnormalStockMarketReturnsUsing NeuralNetworks 18 CaseStudy3:MiningAssociationRulesfromLegalDatabases 19 CaseStudy4:PredictingCorporateBankruptciesUsingDecisionTrees 21 CaseStudy5:ProfilingtheTourismMarketUsingk-MeansClusteringAnalysis 23 References 24 Exercises 25 2 DATAPREPROCESSING 27 WhyDoWeNeedtoPreprocesstheData? 27 DataCleaning 28 HandlingMissingData 30 IdentifyingMisclassifications 33 GraphicalMethodsforIdentifyingOutliers 34 DataTransformation 35 Min–MaxNormalization 36 Z-ScoreStandardization 37 NumericalMethodsforIdentifyingOutliers 38 References 39 Exercises 39 vii
Description: