ebook img

Statistical Data Analytics: Foundations for Data Mining, Informatics, and Knowledge Discovery PDF

487 Pages·2015·3.798 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Statistical Data Analytics: Foundations for Data Mining, Informatics, and Knowledge Discovery

Statistical Data Analytics Statistical Data Analytics Foundations for Data Mining, Informatics, and Knowledge Discovery Walter W. Piegorsch UniversityofArizona,USA Thiseditionfirstpublished2015 ©2015JohnWiley&Sons,Ltd Registeredoffice JohnWiley&SonsLtd,TheAtrium,SouthernGate,Chichester,WestSussex,PO198SQ,UnitedKingdom Fordetailsofourglobaleditorialoffices,forcustomerservicesandforinformationabouthowtoapplyforpermissionto reusethecopyrightmaterialinthisbookpleaseseeourwebsiteatwww.wiley.com. TherightoftheauthortobeidentifiedastheauthorofthisworkhasbeenassertedinaccordancewiththeCopyright,Designs andPatentsAct1988. Allrightsreserved.Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmitted,inanyform orbyanymeans,electronic,mechanical,photocopying,recordingorotherwise,exceptaspermittedbytheUKCopyright, DesignsandPatentsAct1988,withoutthepriorpermissionofthepublisher. Wileyalsopublishesitsbooksinavarietyofelectronicformats.Somecontentthatappearsinprintmaynotbeavailablein electronicbooks. Designationsusedbycompaniestodistinguishtheirproductsareoftenclaimedastrademarks.Allbrandnamesandproduct namesusedinthisbookaretradenames,servicemarks,trademarksorregisteredtrademarksoftheirrespectiveowners.The publisherisnotassociatedwithanyproductorvendormentionedinthisbook. LimitofLiability/DisclaimerofWarranty:Whilethepublisherandauthorhaveusedtheirbesteffortsinpreparingthisbook, theymakenorepresentationsorwarrantieswithrespecttotheaccuracyorcompletenessofthecontentsofthisbookand specificallydisclaimanyimpliedwarrantiesofmerchantabilityorfitnessforaparticularpurpose.Itissoldontheunder- standingthatthepublisherisnotengagedinrenderingprofessionalservicesandneitherthepublishernortheauthorshallbe liablefordamagesarisingherefrom.Ifprofessionaladviceorotherexpertassistanceisrequired,theservicesofacompetent professionalshouldbesought. LibraryofCongressCataloging-in-PublicationData Piegorsch,WalterW. Statisticaldataanalytics:foundationsfordatamining,informatics,andknowledgediscovery/WalterW.Piegorsch. pagescm Includesbibliographicalreferencesandindex. ISBN978-1-118-61965-0(cloth:alk.paper)1.Datamining–Mathematics.2.Mathematicalstatistics.I.Title. QA76.9.D343P5352015 006.3′12—dc23 2015015327 AcataloguerecordforthisbookisavailablefromtheBritishLibrary. Typesetin10/12ptTimesLTStdbySPiGlobal,Chennai,India 1 2015 To Karen Contents Preface xiii PartI Background:IntroductoryStatisticalAnalytics 1 1 Dataanalyticsanddatamining 3 1.1 Knowledgediscovery:findingstructureindata 3 1.2 Dataqualityversusdataquantity 5 1.3 Statisticalmodelingversusstatisticaldescription 7 2 Basicprobabilityandstatisticaldistributions 10 2.1 Conceptsinprobability 10 2.1.1 Probabilityrules 11 2.1.2 Randomvariablesandprobabilityfunctions 12 2.1.3 Means,variances,andexpectedvalues 17 2.1.4 Median,quartiles,andquantiles 18 2.1.5 Bivariateexpectedvalues,covariance,andcorrelation 20 2.2 Multiplerandomvariables∗ 21 2.3 Univariatefamiliesofdistributions 23 2.3.1 Binomialdistribution 23 2.3.2 Poissondistribution 26 2.3.3 Geometricdistribution 27 2.3.4 Negativebinomialdistribution 27 2.3.5 Discreteuniformdistribution 28 2.3.6 Continuousuniformdistribution 29 2.3.7 Exponentialdistribution 29 2.3.8 Gammaandchi-squaredistributions 30 2.3.9 Normal(Gaussian)distribution 32 2.3.10 Distributionsderivedfromnormal 37 2.3.11 Theexponentialfamily 41 viii CONTENTS 3 Datamanipulation 49 3.1 Randomsampling 49 3.2 Datatypes 51 3.3 Datasummarization 52 3.3.1 Means,medians,andcentraltendency 52 3.3.2 Summarizingvariation 56 3.3.3 Summarizing(bivariate)correlation 59 3.4 Datadiagnosticsanddatatransformation 60 3.4.1 Outlieranalysis 60 3.4.2 Entropy∗ 62 3.4.3 Datatransformation 64 3.5 Simplesmoothingtechniques 65 3.5.1 Binning 66 3.5.2 Movingaverages∗ 67 3.5.3 Exponentialsmoothing∗ 69 4 Datavisualizationandstatisticalgraphics 76 4.1 Univariatevisualization 77 4.1.1 Stripchartsanddotplots 77 4.1.2 Boxplots 79 4.1.3 Stem-and-leafplots 81 4.1.4 Histogramsanddensityestimators 83 4.1.5 Quantileplots 87 4.2 Bivariateandmultivariatevisualization 89 4.2.1 Piechartsandbarcharts 90 4.2.2 MultipleboxplotsandQQplots 95 4.2.3 Scatterplotsandbubbleplots 98 4.2.4 Heatmaps 102 4.2.5 Timeseriesplots∗ 105 5 Statisticalinference 115 5.1 Parametersandlikelihood 115 5.2 Pointestimation 117 5.2.1 Bias 118 5.2.2 Themethodofmoments 118 5.2.3 Leastsquares/weightedleastsquares 119 5.2.4 Maximumlikelihood∗ 120 5.3 Intervalestimation 123 5.3.1 Confidenceintervals 123 5.3.2 Single-sampleintervalsfornormal(Gaussian)parameters 124 5.3.3 Two-sampleintervalsfornormal(Gaussian)parameters 128 5.3.4 Waldintervalsandlikelihoodintervals∗ 131 5.3.5 Deltamethodintervals∗ 135 5.3.6 Bootstrapintervals∗ 137 5.4 Testinghypotheses 138 5.4.1 Single-sampletestsfornormal(Gaussian)parameters 140 5.4.2 Two-sampletestsfornormal(Gaussian)parameters 142 CONTENTS ix 5.4.3 Waldstests,likelihoodratiotests,and‘exact’tests∗ 145 5.5 Multipleinferences∗ 148 5.5.1 Bonferronimultiplicityadjustment 149 5.5.2 Falsediscoveryrate 151 PartII StatisticalLearningandDataAnalytics 161 6 Techniquesforsupervisedlearning:simplelinearregression 163 6.1 Whatis“supervisedlearning?” 163 6.2 Simplelinearregression 164 6.2.1 Thesimplelinearmodel 164 6.2.2 Multipleinferencesandsimultaneousconfidencebands 171 6.3 Regressiondiagnostics 175 6.4 Weightedleastsquares(WLS)regression 184 6.5 Correlationanalysis 187 6.5.1 Thecorrelationcoefficient 187 6.5.2 Rankcorrelation 190 7 Techniquesforsupervisedlearning:multiplelinearregression 198 7.1 Multiplelinearregression 198 7.1.1 Matrixformulation 199 7.1.2 WeightedleastsquaresfortheMLRmodel 200 7.1.3 InferencesundertheMLRmodel 201 7.1.4 Multicollinearity 208 7.2 Polynomialregression 210 7.3 Featureselection 211 7.3.1 R2plots 212 p 7.3.2 Informationcriteria:AICandBIC 215 7.3.3 Automatedvariableselection 216 7.4 Alternativeregressionmethods∗ 223 7.4.1 Loess 224 7.4.2 Regularization:ridgeregression 230 7.4.3 Regularizationandvariableselection:theLasso 238 7.5 Qualitativepredictors:ANOVAmodels 242 8 Supervisedlearning:generalizedlinearmodels 258 8.1 Extendingthelinearregressionmodel 258 8.1.1 Nonnormaldataandtheexponentialfamily 258 8.1.2 Linkfunctions 259 8.2 TechnicaldetailsforGLiMs∗ 259 8.2.1 Estimation 260 8.2.2 Thedeviancefunction 261 8.2.3 Residuals 262 8.2.4 Inferenceandmodelassessment 264 8.3 SelectedformsofGLiMs 265 8.3.1 Logisticregressionandbinary-dataGLiMs 265

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.