ebook img

Making Sense of Data II: A Practical Guide to Data Visualization, Advanced Data Mining Methods, and Applications PDF

298 Pages·2009·8.58 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Making Sense of Data II: A Practical Guide to Data Visualization, Advanced Data Mining Methods, and Applications

MAKING SENSE OF DATA II MAKING SENSE OF DATA II A Practical Guide to Data Visualization, Advanced Data Mining Methods, and Applications GLENN J. MYATT WAYNE P. JOHNSON Copyright#2009byJohnWiley&Sons,Inc.Allrightsreserved. PublishedbyJohnWiley&Sons,Inc.,Hoboken,NewJersey PublishedsimultaneouslyinCanada Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformor byanymeans,electronic,mechanical,photocopying,recording,scanning,orotherwise,exceptas permittedunderSections107or108ofthe1976UnitedStatesCopyrightAct,withouteithertheprior writtenpermissionofthePublisher,orauthorizationthroughpaymentoftheappropriatepercopyfeetothe CopyrightClearanceCenter,Inc.,222RosewoodDrive,Danvers,MA01923,(978)7508400,fax(978) 7504470,oronthewebatwww.copyright.com.RequeststothePublisherforpermissionshouldbe addressedtothePermissionsDepartment,JohnWiley&Sons,Inc.,111RiverStreet,Hoboken,NJ07030, (201)7486011,fax(201)7486008,oronlineathttp://www.wiley.com/go/permission. LimitofLiability/DisclaimerofWarranty:Whilethepublisherandauthorhaveusedtheirbestefforts inpreparingthisbook,theymakenorepresentationsorwarrantieswithrespecttotheaccuracyor completenessofthecontentsofthisbookandspecificallydisclaimanyimpliedwarrantiesof merchantabilityorfitnessforaparticularpurpose.Nowarrantymaybecreatedorextendedbysales representativesorwrittensalesmaterials.Theadviceandstrategiescontainedhereinmaynotbesuitable foryoursituation.Youshouldconsultwithaprofessionalwhereappropriate.Neitherthepublishernor authorshallbeliableforanylossofprofitoranyothercommercialdamages,includingbutnotlimited tospecial,incidental,consequential,orotherdamages. Forgeneralinformationonourotherproductsandservicesorfortechnicalsupport,pleasecontactour CustomerCareDepartmentwithintheUnitedStatesat(800)7622974,outsidetheUnitedStatesat (317)5723993orfax(317)5724002. Wileyalsopublishesitsbooksinvarietyofelectronicformats.Somecontentthatappearsinprintmay notbeavailableinelectronicformats.FormoreinformationaboutWileyproducts,visitourwebsiteat www.wiley.com. LibraryofCongressCataloging-in-PublicationData: Myatt,GlennJ.,1969 MakingsenseofdataII:apracticalguidetodatavisualization,advanceddataminingmethods,and applications/GlennJ.Myatt,WayneP.Johnson. p.cm. Makingsenseofdata2 Includesbibliographicalreferencesandindex. ISBN9780470222805(pbk.) 1. Datamining.2. Informationvisualization. I.Johnson,WayneP.II. Title.III.Title:Makingsenseofdata2. QA76.9.D343M932008 005.74 dc22 2008024103 PrintedintheUnitedStatesofAmerica 10 9 8 7 6 5 4 3 2 1 CONTENTS PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 AccessingTabularData 3 1.3.3 AccessingUnstructuredData 3 1.3.4 UnderstandingtheVariablesandObservations 3 1.3.5 DataCleaning 6 1.3.6 Transformation 7 1.3.7 VariableReduction 9 1.3.8 Segmentation 10 1.3.9 PreparingDatatoApply 10 1.4 Analysis 11 1.4.1 DataMiningTasks 11 1.4.2 Optimization 12 1.4.3 Evaluation 12 1.4.4 ModelForensics 13 1.5 Deployment 13 1.6 OutlineofBook 14 1.6.1 Overview 14 1.6.2 DataVisualization 14 1.6.3 Clustering 15 1.6.4 PredictiveAnalytics 15 1.6.5 Applications 16 1.6.6 Software 16 1.7 Summary 16 1.8 FurtherReading 17 2 DATAVISUALIZATION 19 2.1 Overview 19 2.2 VisualizationDesignPrinciples 20 2.2.1 GeneralPrinciples 20 2.2.2 GraphicsDesign 23 2.2.3 AnatomyofaGraph 28 v vi CONTENTS 2.3 Tables 32 2.3.1 SimpleTables 32 2.3.2 SummaryTables 33 2.3.3 TwoWayContingencyTables 34 2.3.4 Supertables 34 2.4 UnivariateDataVisualization 36 2.4.1 BarChart 36 2.4.2 Histograms 37 2.4.3 FrequencyPolygram 41 2.4.4 BoxPlots 41 2.4.5 DotPlot 43 2.4.6 StemandLeafPlot 44 2.4.7 QuantilePlot 46 2.4.8 Quantile QuantilePlot 48 2.5 BivariateDataVisualization 49 2.5.1 Scatterplot 49 2.6 MultivariateDataVisualization 50 2.6.1 HistogramMatrix 52 2.6.2 ScatterplotMatrix 54 2.6.3 MultipleBoxPlot 56 2.6.4 TrellisPlot 56 2.7 VisualizingGroups 59 2.7.1 Dendrograms 59 2.7.2 DecisionTrees 60 2.7.3 ClusterImageMaps 60 2.8 DynamicTechniques 63 2.8.1 Overview 63 2.8.2 DataBrushing 64 2.8.3 NearnessSelection 65 2.8.4 SortingandRearranging 65 2.8.5 SearchingandFiltering 65 2.9 Summary 65 2.10 FurtherReading 66 3 CLUSTERING 67 3.1 Overview 67 3.2 DistanceMeasures 75 3.2.1 Overview 75 3.2.2 NumericDistanceMeasures 77 3.2.3 BinaryDistanceMeasures 79 3.2.4 MixedVariables 84 3.2.5 OtherMeasures 86 3.3 AgglomerativeHierarchicalClustering 87 3.3.1 Overview 87 3.3.2 SingleLinkage 88 3.3.3 CompleteLinkage 92 3.3.4 AverageLinkage 93 3.3.5 OtherMethods 96 3.3.6 SelectingGroups 96 CONTENTS vii 3.4 PartitionedBasedClustering 98 3.4.1 Overview 98 3.4.2 kMeans 98 3.4.3 WorkedExample 100 3.4.4 MiscellaneousPartitionedBasedClustering 101 3.5 FuzzyClustering 103 3.5.1 Overview 103 3.5.2 FuzzykMeans 103 3.5.3 WorkedExamples 104 3.6 Summary 109 3.7 FurtherReading 110 4 PREDICTIVEANALYTICS 111 4.1 Overview 111 4.1.1 PredictiveModeling 111 4.1.2 TestingModelAccuracy 116 4.1.3 EvaluatingRegressionModels’PredictiveAccuracy 117 4.1.4 EvaluatingClassificationModels’PredictiveAccuracy 119 4.1.5 EvaluatingBinaryModels’PredictiveAccuracy 120 4.1.6 ROCCharts 122 4.1.7 LiftChart 124 4.2 PrincipalComponentAnalysis 126 4.2.1 Overview 126 4.2.2 PrincipalComponents 126 4.2.3 GeneratingPrincipalComponents 127 4.2.4 InterpretationofPrincipalComponents 128 4.3 MultipleLinearRegression 130 4.3.1 Overview 130 4.3.2 GeneratingModels 133 4.3.3 Prediction 136 4.3.4 AnalysisofResiduals 136 4.3.5 StandardError 139 4.3.6 CoefficientofMultipleDetermination 140 4.3.7 TestingtheModelSignificance 142 4.3.8 SelectingandTransformingVariables 143 4.4 DiscriminantAnalysis 145 4.4.1 Overview 145 4.4.2 DiscriminantFunction 146 4.4.3 DiscriminantAnalysisExample 146 4.5 LogisticRegression 151 4.5.1 Overview 151 4.5.2 LogisticRegressionFormula 151 4.5.3 EstimatingCoefficients 153 4.5.4 AssessingandOptimizingResults 156 4.6 NaiveBayesClassifiers 157 4.6.1 Overview 157 4.6.2 BayesTheoremandtheIndependenceAssumption 158 4.6.3 IndependenceAssumption 158 4.6.4 ClassificationProcess 159 viii CONTENTS 4.7 Summary 161 4.8 FurtherReading 163 5 APPLICATIONS 165 5.1 Overview 165 5.2 SalesandMarketing 166 5.3 IndustrySpecificDataMining 169 5.3.1 Finance 169 5.3.2 Insurance 171 5.3.3 Retail 172 5.3.4 Telecommunications 173 5.3.5 Manufacturing 174 5.3.6 Entertainment 175 5.3.7 Government 176 5.3.8 Pharmaceuticals 177 5.3.9 Healthcare 179 5.4 microRNADataAnalysisCaseStudy 181 5.4.1 DefiningtheProblem 181 5.4.2 PreparingtheData 181 5.4.3 Analysis 183 5.5 CreditScoringCaseStudy 192 5.5.1 DefiningtheProblem 192 5.5.2 PreparingtheData 192 5.5.3 Analysis 199 5.5.4 Deployment 203 5.6 DataMiningNontabularData 203 5.6.1 Overview 203 5.6.2 DataMiningChemicalData 203 5.6.3 DataMiningText 210 5.7 FurtherReading 213 APPENDIXA MATRICES 215 A.1 OverviewofMatrices 215 A.2 MatrixAddition 215 A.3 MatrixMultiplication 216 A.4 TransposeofaMatrix 217 A.5 InverseofaMatrix 217 APPENDIXB SOFTWARE 219 B.1 SoftwareOverview 219 B.1.1 SoftwareObjectives 219 B.1.2 AccessandInstallation 221 B.1.3 UserInterfaceOverview 221 B.2 DataPreparation 223 B.2.1 Overview 223 B.2.2 ReadinginData 224 B.2.3 SearchingtheData 225 CONTENTS ix B.2.4 VariableCharacterization 227 B.2.5 RemovingObservationsandVariables 228 B.2.6 CleaningtheData 228 B.2.7 TransformingtheData 230 B.2.8 Segmentation 235 B.2.9 PrincipalComponentAnalysis 236 B.3 TablesandGraphs 238 B.3.1 Overview 238 B.3.2 ContingencyTables 239 B.3.3 SummaryTables 240 B.3.4 Graphs 242 B.3.5 GraphMatrices 246 B.4 Statistics 246 B.4.1 Overview 246 B.4.2 DescriptiveStatistics 248 B.4.3 ConfidenceIntervals 248 B.4.4 HypothesisTests 249 B.4.5 ChiSquareTest 250 B.4.6 ANOVA 251 B.4.7 ComparativeStatistics 251 B.5 Grouping 253 B.5.1 Overview 253 B.5.2 Clustering 254 B.5.3 AssociativeRules 257 B.5.4 DecisionTrees 258 B.6 Prediction 261 B.6.1 Overview 261 B.6.2 LinearRegression 263 B.6.3 DiscriminantAnalysis 265 B.6.4 LogisticRegression 266 B.6.5 NaiveBayes 267 B.6.6 kNN 269 B.6.7 CART 269 B.6.8 NeuralNetworks 270 B.6.9 ApplyModel 271 BIBLIOGRAPHY 273 INDEX 279 PREFACE Thepurposeofthisbookistooutlineadiverserangeofcommonlyusedapproaches tomakingandcommunicatingdecisionsfromdata,usingdatavisualization,cluster- ing,andpredictiveanalytics.Thebookrelatesthesetopicstohowtheycanbeusedin practice in a variety of ways. First, the methods outlined in the book are discussed within the context of a data mining process that starts with defining the problem and ends with deployment of the results. Second, each method is outlined in detail,includingadiscussionofwhenandhowtheyshouldbeused.Third,examples areprovidedthroughouttofurtherillustratehowthemethodsoperate.Fourth,thereis a detailed discussion of applications in which these approaches are being applied today. Finally, software called TraceisTM, which can be used with the examples in the book or with data sets of interest to the reader, is available for downloading from acompanion website. The bookisaimedtowards professionals inanydisciplinewhoare interestedin making decisions from data in addition to understanding how data mining can be used. Undergraduate and graduate students taking courses in data mining through a Bachelors,Masters,orMBAprogramcouldusethebookasaresource.Theapproaches havebeenoutlinedtoanextentthatsoftware professionals couldusethebooktogain insight into the principles of data visualization and advanced data mining algorithms inordertohelpinthedevelopmentofnewsoftwareproducts. The book is organized into five chapters and two appendices. † Chapter 1 Introduction: The first chapter reviews the material in the book within the context of the overall data mining process. Defining the problem, preparingthedata,performingtheanalysis,anddeployinganyresultsarecriti- cal steps. When and how each of the methods described in the book can be applied to this process are described. † Chapter 2 Data Visualization: The second chapter reviews principles and methods for understanding and communicating data through the use of data visualizations. The chapter outlines ways of visualizing single variables, the relationships between two or more variables, groupings in the data, along with dynamic approaches to interacting with the data through graphical user interfaces. † Chapter 3 Clustering: Chapter 3 outlines in detail common approaches to clustering data sets and includes a detailed explanation of methods for deter- miningthedistancebetweenobservationsandtechniquesforclusteringobser- vations. Three popular clustering approaches are discussed: agglomerative hierarchical clustering, partitioned-based clustering, and fuzzy clustering. xi

Description:
A hands-on guide to making valuable decisions from data using advanced data mining methods and techniques This second installment in the Making Sense of Data series continues to explore a diverse range of commonly used approaches to making and communicating decisions from data. Delving into more tec
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.