ebook img

Data Mining Algorithms, Explained using R - Wiley PDF

714 Pages·2016·6.38 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Data Mining Algorithms, Explained using R - Wiley

Trimsize:170mmx244mmCichosz ffirs.tex V3-11/04/2014 10:23A.M. Pageii Trimsize:170mmx244mmCichosz ffirs.tex V3-11/04/2014 10:23A.M. Pageiii Data Mining Algorithms: Explained Using R Paweł Cichosz DepartmentofElectronicsandInformation Technology WarsawUniversityofTechnology Poland Trimsize:170mmx244mmCichosz ffirs.tex V3-11/04/2014 10:23A.M. Pageiv Thiseditionfirstpublished2015 ©2015byJohnWiley&Sons,Ltd Registeredoffice:JohnWiley&Sons,Ltd,TheAtrium,SouthernGate,Chichester,WestSussex,PO198SQ,UnitedKingdom Fordetailsofourglobaleditorialoffices,forcustomerservicesandforinformationabouthowtoapplyforpermissionto reusethecopyrightmaterialinthisbookpleaseseeourwebsiteatwww.wiley.com. TherightoftheauthortobeidentifiedastheauthorofthisworkhasbeenassertedinaccordancewiththeCopyright,Designs andPatentsAct1988. Allrightsreserved.Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmitted,inanyform orbyanymeans,electronic,mechanical,photocopying,recordingorotherwise,exceptaspermittedbytheUKCopyright, DesignsandPatentsAct1988,withoutthepriorpermissionofthepublisher. Wileyalsopublishesitsbooksinavarietyofelectronicformats.Somecontentthatappearsinprintmaynotbeavailablein electronicbooks. LimitofLiability/DisclaimerofWarranty:Whilethepublisherandauthorhaveusedtheirbesteffortsinpreparingthisbook, theymakenorepresentationsorwarrantieswithrespecttotheaccuracyorcompletenessofthecontentsofthisbookand specificallydisclaimanyimpliedwarrantiesofmerchantabilityorfitnessforaparticularpurpose.Itissoldontheunder- standingthatthepublisherisnotengagedinrenderingprofessionalservicesandneitherthepublishernortheauthorshallbe liablefordamagesarisingherefrom.Ifprofessionaladviceorotherexpertassistanceisrequired,theservicesofacompetent professionalshouldbesought. LibraryofCongressCataloging-in-PublicationData Cichosz,Pawel,author. Dataminingalgorithms:explainedusingR/PawelCichosz. pagescm Summary:“Thisbooknarrowsdownthescopeofdataminingbyadoptingaheavilymodeling-orientedperspective”– Providedbypublisher. Includesbibliographicalreferencesandindex. ISBN978-1-118-33258-0(hardback) 1. Datamining.2. Computeralgorithms.3. R(Computerprogramlanguage) I.Title. QA76.9.D343C4722015 006.3′12–dc23 2014036992 AcataloguerecordforthisbookisavailablefromtheBritishLibrary. ISBN:9781118332580 Typesetin10/12ptTimesbyLaserwordsPrivateLimited,Chennai,India Trimsize:170mmx244mmCichosz ftoc.tex V2-11/04/2014 10:23A.M. Pagevii Contents Acknowledgements xix Preface xxi References xxxi PartI Preliminaries 1 1 Tasks 3 1.1 Introduction 3 1.1.1 Knowledge 4 1.1.2 Inference 4 1.2 Inductivelearningtasks 5 1.2.1 Domain 5 1.2.2 Instances 5 1.2.3 Attributes 5 1.2.4 Targetattribute 6 1.2.5 Inputattributes 6 1.2.6 Trainingset 6 1.2.7 Model 7 1.2.8 Performance 7 1.2.9 Generalization 8 1.2.10 Overfitting 8 1.2.11 Algorithms 8 1.2.12 Inductivelearningassearch 9 1.3 Classification 9 1.3.1 Concept 10 1.3.2 Trainingset 10 1.3.3 Model 11 1.3.4 Performance 12 1.3.5 Generalization 13 1.3.6 Overfitting 13 1.3.7 Algorithms 13 Trimsize:170mmx244mmCichosz ftoc.tex V2-11/04/2014 10:23A.M. Pageviii viii CONTENTS 1.4 Regression 14 1.4.1 Targetfunction 14 1.4.2 Trainingset 14 1.4.3 Model 15 1.4.4 Performance 15 1.4.5 Generalization 15 1.4.6 Overfitting 15 1.4.7 Algorithms 16 1.5 Clustering 16 1.5.1 Motivation 16 1.5.2 Trainingset 17 1.5.3 Model 18 1.5.4 Crispvs.softclustering 18 1.5.5 Hierarchicalclustering 18 1.5.6 Performance 18 1.5.7 Generalization 19 1.5.8 Algorithms 19 1.5.9 Descriptivevs.predictiveclustering 19 1.6 Practicalissues 19 1.6.1 Incompletedata 20 1.6.2 Noisydata 20 1.7 Conclusion 20 1.8 Furtherreadings 21 References 22 2 Basicstatistics 23 2.1 Introduction 23 2.2 Notationalconventions 24 2.3 Basicstatisticsasmodeling 24 2.4 Distributiondescription 25 2.4.1 Continuousattributes 25 2.4.2 Discreteattributes 36 2.4.3 Confidenceintervals 40 2.4.4 m-Estimation 43 2.5 Relationshipdetection 47 2.5.1 Significancetests 48 2.5.2 Continuousattributes 50 2.5.3 Discreteattributes 52 2.5.4 Mixedattributes 56 2.5.5 Relationshipdetectioncaveats 61 2.6 Visualization 62 2.6.1 Boxplot 62 2.6.2 Histogram 63 2.6.3 Barplot 64 2.7 Conclusion 65 2.8 Furtherreadings 66 References 67 Trimsize:170mmx244mmCichosz ftoc.tex V2-11/04/2014 10:23A.M. Pageix CONTENTS ix PartII Classification 69 3 Decisiontrees 71 3.1 Introduction 71 3.2 Decisiontreemodel 72 3.2.1 Nodesandbranches 72 3.2.2 Leaves 74 3.2.3 Splittypes 74 3.3 Growing 76 3.3.1 Algorithmoutline 76 3.3.2 Classdistributioncalculation 78 3.3.3 Classlabelassignment 79 3.3.4 Stopcriteria 80 3.3.5 Splitselection 82 3.3.6 Splitapplication 86 3.3.7 Completeprocess 86 3.4 Pruning 90 3.4.1 Pruningoperators 91 3.4.2 Pruningcriterion 91 3.4.3 Pruningcontrolstrategy 100 3.4.4 Conversiontorulesets 101 3.5 Prediction 103 3.5.1 Classlabelprediction 104 3.5.2 Classprobabilityprediction 104 3.6 Weightedinstances 105 3.7 Missingvaluehandling 106 3.7.1 Fractionalinstances 106 3.7.2 Surrogatesplits 113 3.8 Conclusion 114 3.9 Furtherreadings 114 References 116 4 NaïveBayesclassifier 118 4.1 Introduction 118 4.2 Bayesrule 118 4.3 ClassificationbyBayesianinference 120 4.3.1 Conditionalclassprobability 120 4.3.2 Priorclassprobability 121 4.3.3 Independenceassumption 122 4.3.4 Conditionalattributevalueprobabilities 122 4.3.5 Modelconstruction 123 4.3.6 Prediction 124 4.4 Practicalissues 125 4.4.1 Zeroandsmallprobabilities 125 4.4.2 Linearclassification 126 4.4.3 Continuousattributes 127 Trimsize:170mmx244mmCichosz ftoc.tex V2-11/04/2014 10:23A.M. Pagex x CONTENTS 4.4.4 Missingattributevalues 128 4.4.5 Reducingnaïvety 129 4.5 Conclusion 131 4.6 Furtherreadings 131 References 132 5 Linearclassification 134 5.1 Introduction 134 5.2 Linearrepresentation 136 5.2.1 Innerrepresentationfunction 137 5.2.2 Outerrepresentationfunction 138 5.2.3 Thresholdrepresentation 139 5.2.4 Logitrepresentation 142 5.3 Parameterestimation 145 5.3.1 Deltarule 145 5.3.2 Gradientdescent 149 5.3.3 Distancetodecisionboundary 152 5.3.4 Leastsquares 153 5.4 Discreteattributes 154 5.5 Conclusion 155 5.6 Furtherreadings 156 References 157 6 Misclassificationcosts 159 6.1 Introduction 159 6.2 Costrepresentation 161 6.2.1 Costmatrix 161 6.2.2 Per-classcostvector 162 6.2.3 Instance-specificcosts 163 6.3 Incorporatingmisclassificationcosts 164 6.3.1 Instanceweighting 164 6.3.2 Instanceresampling 167 6.3.3 Minimum-costrule 169 6.3.4 Instancerelabeling 174 6.4 Effectsofcostincorporation 176 6.5 Experimentalprocedure 180 6.6 Conclusion 184 6.7 Furtherreadings 185 References 187 7 Classificationmodelevaluation 189 7.1 Introduction 189 7.1.1 Datasetperformance 189 7.1.2 Trainingperformance 189 7.1.3 Trueperformance 189 7.2 Performancemeasures 190 7.2.1 Misclassificationerror 191 Trimsize:170mmx244mmCichosz ftoc.tex V2-11/04/2014 10:23A.M. Pagexi CONTENTS xi 7.2.2 Weightedmisclassificationerror 191 7.2.3 Meanmisclassificationcost 192 7.2.4 Confusionmatrix 194 7.2.5 ROCanalysis 200 7.2.6 Probabilisticperformancemeasures 210 7.3 Evaluationprocedures 213 7.3.1 Modelevaluationvs.modelingprocedureevaluation 213 7.3.2 Evaluationcaveats 214 7.3.3 Hold-out 217 7.3.4 Cross-validation 219 7.3.5 Leave-one-out 221 7.3.6 Bootstrapping 223 7.3.7 Choosingtherightprocedure 227 7.3.8 Evaluationproceduresfortemporaldata 230 7.4 Conclusion 231 7.5 Furtherreadings 232 References 233 PartIII Regression 235 8 Linearregression 237 8.1 Introduction 237 8.2 Linearrepresentation 238 8.2.1 Parametricrepresentation 239 8.2.2 Linearrepresentationfunction 240 8.2.3 Nonlinearrepresentationfunctions 241 8.3 Parameterestimation 242 8.3.1 Meansquareerrorminimization 242 8.3.2 Deltarule 243 8.3.3 Gradientdescent 245 8.3.4 Leastsquares 248 8.4 Discreteattributes 250 8.5 Advantagesoflinearmodels 251 8.6 Beyondlinearity 252 8.6.1 Generalizedlinearrepresentation 252 8.6.2 Enhancedrepresentation 255 8.6.3 Polynomialregression 256 8.6.4 Piecewise-linearregression 257 8.7 Conclusion 258 8.8 Furtherreadings 258 References 259 9 Regressiontrees 261 9.1 Introduction 261 9.2 Regressiontreemodel 262 9.2.1 Nodesandbranches 262 Trimsize:170mmx244mmCichosz ftoc.tex V2-11/04/2014 10:23A.M. Pagexii xii CONTENTS 9.2.2 Leaves 262 9.2.3 Splittypes 262 9.2.4 Piecewise-constantregression 262 9.3 Growing 263 9.3.1 Algorithmoutline 264 9.3.2 Targetfunctionsummarystatistics 265 9.3.3 Targetvalueassignment 266 9.3.4 Stopcriteria 267 9.3.5 Splitselection 268 9.3.6 Splitapplication 271 9.3.7 Completeprocess 272 9.4 Pruning 274 9.4.1 Pruningoperators 275 9.4.2 Pruningcriterion 275 9.4.3 Pruningcontrolstrategy 277 9.5 Prediction 277 9.6 Weightedinstances 278 9.7 Missingvaluehandling 279 9.7.1 Fractionalinstances 279 9.7.2 Surrogatesplits 284 9.8 Piecewiselinearregression 284 9.8.1 Growing 285 9.8.2 Pruning 289 9.8.3 Prediction 290 9.9 Conclusion 292 9.10 Furtherreadings 292 References 293 10 Regressionmodelevaluation 295 10.1 Introduction 295 10.1.1 Datasetperformance 295 10.1.2 Trainingperformance 295 10.1.3 Trueperformance 295 10.2 Performancemeasures 296 10.2.1 Residuals 296 10.2.2 Meanabsoluteerror 297 10.2.3 Meansquareerror 297 10.2.4 Rootmeansquareerror 299 10.2.5 Relativeabsoluteerror 299 10.2.6 Coefficientofdetermination 300 10.2.7 Correlation 301 10.2.8 Weightedperformancemeasures 301 10.2.9 Lossfunctions 302 10.3 Evaluationprocedures 303 10.3.1 Hold-out 304 10.3.2 Cross-validation 304

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.