MAKING SENSE OF DATA I MAKING SENSE OF DATA I A Practical Guide to Exploratory Data Analysis and Data Mining Second Edition GLENN J. MYATT WAYNE P. JOHNSON Copyright©2014byJohnWiley&Sons,Inc.Allrightsreserved PublishedbyJohnWiley&Sons,Inc.,Hoboken,NewJersey PublishedsimultaneouslyinCanada Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmittedinanyform orbyanymeans,electronic,mechanical,photocopying,recording,scanning,orotherwise,exceptas permittedunderSection107or108ofthe1976UnitedStatesCopyrightAct,withouteithertheprior writtenpermissionofthePublisher,orauthorizationthroughpaymentoftheappropriateper-copyfee totheCopyrightClearanceCenter,Inc.,222RosewoodDrive,Danvers,MA01923,(978)750-8400, fax(978)750-4470,oronthewebatwww.copyright.com.RequeststothePublisherforpermission shouldbeaddressedtothePermissionsDepartment,JohnWiley&Sons,Inc.,111RiverStreet, Hoboken,NJ07030,(201)748-6011,fax(201)748-6008,oronlineat http://www.wiley.com/go/permission. LimitofLiability/DisclaimerofWarranty:Whilethepublisherandauthorhaveusedtheirbestefforts inpreparingthisbook,theymakenorepresentationsorwarrantieswithrespecttotheaccuracyor completenessofthecontentsofthisbookandspecificallydisclaimanyimpliedwarrantiesof merchantabilityorfitnessforaparticularpurpose.Nowarrantymaybecreatedorextendedbysales representativesorwrittensalesmaterials.Theadviceandstrategiescontainedhereinmaynotbe suitableforyoursituation.Youshouldconsultwithaprofessionalwhereappropriate.Neitherthe publishernorauthorshallbeliableforanylossofprofitoranyothercommercialdamages,including butnotlimitedtospecial,incidental,consequential,orotherdamages. Forgeneralinformationonourotherproductsandservicesorfortechnicalsupport,pleasecontact ourCustomerCareDepartmentwithintheUnitedStatesat(800)762-2974,outsidetheUnitedStates at(317)572-3993orfax(317)572-4002. Wileyalsopublishesitsbooksinavarietyofelectronicformats.Somecontentthatappearsinprint maynotbeavailableinelectronicformats.FormoreinformationaboutWileyproducts,visitourweb siteatwww.wiley.com. LibraryofCongressCataloging-in-PublicationData: Myatt,GlennJ.,1969– [Makingsenseofdata] MakingsenseofdataI:apracticalguidetoexploratorydataanalysisanddatamining/ GlennJ.Myatt,WayneP.Johnson.–Secondedition. pagescm Revisededitionof:Makingsenseofdata.c2007. Includesbibliographicalreferencesandindex. ISBN978-1-118-40741-7(paper) 1.Datamining. 2.Mathematicalstatistics. I.Johnson,WayneP. II.Title. QA276.M922014 006.3′12–dc23 2014007303 PrintedintheUnitedStatesofAmerica ISBN:9781118407417 10 9 8 7 6 5 4 3 2 1 CONTENTS PREFACE ix 1 INTRODUCTION 1 1.1 Overview / 1 1.2 SourcesofData / 2 1.3 ProcessforMakingSenseofData / 3 1.4 OverviewofBook / 13 1.5 Summary / 16 FurtherReading / 16 2 DESCRIBING DATA 17 2.1 Overview / 17 2.2 ObservationsandVariables / 18 2.3 TypesofVariables / 20 2.4 CentralTendency / 22 2.5 DistributionoftheData / 24 2.6 ConfidenceIntervals / 36 2.7 HypothesisTests / 40 Exercises / 42 FurtherReading / 45 v vi CONTENTS 3 PREPARING DATA TABLES 47 3.1 Overview / 47 3.2 CleaningtheData / 48 3.3 RemovingObservationsandVariables / 49 3.4 GeneratingConsistentScalesAcrossVariables / 49 3.5 NewFrequencyDistribution / 51 3.6 ConvertingTexttoNumbers / 52 3.7 ConvertingContinuousDatatoCategories / 53 3.8 CombiningVariables / 54 3.9 GeneratingGroups / 54 3.10 PreparingUnstructuredData / 55 Exercises / 57 FurtherReading / 57 4 UNDERSTANDING RELATIONSHIPS 59 4.1 Overview / 59 4.2 VisualizingRelationshipsBetweenVariables / 60 4.3 CalculatingMetricsAboutRelationships / 69 Exercises / 81 FurtherReading / 82 5 IDENTIFYING AND UNDERSTANDING GROUPS 83 5.1 Overview / 83 5.2 Clustering / 88 5.3 AssociationRules / 111 5.4 LearningDecisionTreesfromData / 122 Exercises / 137 FurtherReading / 140 6 BUILDING MODELS FROM DATA 141 6.1 Overview / 141 6.2 LinearRegression / 149 6.3 LogisticRegression / 161 6.4 k-NearestNeighbors / 167 CONTENTS vii 6.5 ClassificationandRegressionTrees / 172 6.6 OtherApproaches / 178 Exercises / 179 FurtherReading / 182 APPENDIXA ANSWERS TO EXERCISES 185 APPENDIXB HANDS-ON TUTORIALS 191 B.1 TutorialOverview / 191 B.2 AccessandInstallation / 191 B.3 SoftwareOverview / 192 B.4 ReadinginData / 193 B.5 PreparationTools / 195 B.6 TablesandGraphTools / 199 B.7 StatisticsTools / 202 B.8 GroupingTools / 204 B.9 ModelsTools / 207 B.10 ApplyModel / 211 B.11 Exercises / 211 BIBLIOGRAPHY 227 INDEX 231
Description: