Predictive Statistics AnalysisandInferencebeyondModels All scientific disciplines prize predictive success. Conventional statistical analyses, however,treatpredictionassecondary,insteadfocusingonmodelingandhenceon estimation, testing, and detailed physical interpretation, tackling these tasks before the predictive adequacy of a model is established. This book outlines a fully pre- dictive approach to statistical problems based on studying predictors; the approach doesnotrequirethatpredictorscorrespondtoamodelalthoughthisimportantspe- cial case is included in the general approach. Throughout, the point is to examine predictive performance before considering conventional inference. These ideas are traced through five traditional subfields of statistics, helping readers to refocus and adoptadirectlypredictiveoutlook.Thebookalsoconsiderspredictionviacontem- porary ‘blackbox’ techniques and emerging data types and methodologies, where conventional modeling is so difficult that good prediction is the main criterion available for evaluating the performance of a statistical method. Well-documented open-sourceRcodeinaGithubrepositoryallowsreaderstoreplicateexamplesand applytechniquestootherinvestigations. BERTRAND S. CLARKE is Chair of the Department of Statistics at the Univer- sityofNebraska,Lincoln.Hisresearchfocusesonpredictivestatisticsandstatistical methodologyingenomicdata.HeisafellowoftheAmericanStatisticalAssociation, serves as editor or associate editor for three journals, and has published numerous papers in several statistical fields as well as a book on data mining and machine learning. JENNIFER L. CLARKE isProfessorofFoodScienceandTechnology, Professor ofStatistics,andDirectoroftheQuantitativeLifeSciencesInitiativeattheUniver- sity of Nebraska, Lincoln. Her current interests include statistical methodology for metagenomicsandalsoprediction,statisticalcomputation,andmultitypedataanal- ysis. She serves on the steering committee of the Midwest Big Data Hub and is Co-Principal Investigator on an award from the NSF focused on data challenges in digitalagriculture. CAMBRIDGE SERIES IN STATISTICAL AND PROBABILISTIC MATHEMATICS EditorialBoard Z.Ghahramani(DepartmentofEngineering,UniversityofCambridge) R.Gill(MathematicalInstitute,LeidenUniversity) F.P.Kelly(DepartmentofPureMathematicsandMathematicalStatistics, UniversityofCambridge) B.D.Ripley(DepartmentofStatistics,UniversityofOxford) S.Ross(DepartmentofIndustrialandSystemsEngineering,UniversityofSouthernCalifornia) M.Stein(DepartmentofStatistics,UniversityofChicago) Thisseriesofhigh-qualityupper-divisiontextbooksandexpositorymonographscoversallaspects ofstochasticapplicablemathematics.Thetopicsrangefrompureandappliedstatisticstoprobabil- itytheory,operationsresearch,optimization,andmathematicalprogramming.Thebookscontain clear presentations of new developments in the field and also of the state of the art in classical methods. While emphasizing rigorous treatment of theoretical methods, the books also contain applications and discussions of new techniques made possible by advances in computational practice. Acompletelistofbooksintheseriescanbefoundatwww.cambridge.org/statistics. Recenttitlesincludethefollowing: 20. RandomGraphDynamics,byRickDurrett 21. Networks,byPeterWhittle 22. SaddlepointApproximationswithApplications,byRonaldW.Butler 23. AppliedAsymptotics,byA.R.Brazzale,A.C.DavisonandN.Reid 24. RandomNetworksforCommunication,byMassimoFranceschettiandRonaldMeester 25. DesignofComparativeExperiments,byR.A.Bailey 26. SymmetryStudies,byMarlosA.G.Viana 27. ModelSelectionandModelAveraging,byGerdaClaeskensandNilsLidHjort 28. BayesianNonparametrics,editedbyNilsLidHjortetal. 29. FromFiniteSampletoAsymptoticMethodsinStatistics,byPranabK.Sen,JulioM.Singer andAntonioC.PedrosadeLima 30. BrownianMotion,byPeterMörtersandYuvalPeres 31. Probability(FourthEdition),byRickDurrett 33. StochasticProcesses,byRichardF.Bass 34. RegressionforCategoricalData,byGerhardTutz 35. ExercisesinProbability(SecondEdition),byLoïcChaumontandMarcYor 36. StatisticalPrinciplesfortheDesignofExperiments,byR.Mead,S.G.GilmourandA.Mead 37. QuantumStochastics,byMou-HsiungChang 38. NonparametricEstimationunderShapeConstraints,byPietGroeneboomandGeurt Jongbloed 39. LargeSampleCovarianceMatricesandHigh-DimensionalDataAnalysis,byJianfengYao, ShurongZhengandZhidongBai 40. Mathematical Foundations of Infinite-Dimensional Statistical Models, by Evarist Giné and RichardNickl 41. Confidence,Likelihood,Probability,byToreSchwederandNilsLidHjort 42. ProbabilityonTreesandNetworks,byRussellLyonsandYuvalPeres 43. RandomGraphsandComplexNetworks(Volume1),byRemcovanderHofstad 44. FundamentalsofNonparametricBayesianInference,bySubhashisGhosalandAadvander Vaart 45. Long-RangeDependenceandSelf-Similarity,byVladasPipirasandMuradS.Taqqu 46. PredictiveStatistics,byBertrandS.ClarkeandJenniferL.Clarke Predictive Statistics Analysis and Inference beyond Models Bertrand S. Clarke UniversityofNebraska,Lincoln Jennifer L. Clarke UniversityofNebraska,Lincoln UniversityPrintingHouse,CambridgeCB28BS,UnitedKingdom OneLibertyPlaza,20thFloor,NewYork,NY10006,USA 477WilliamstownRoad,PortMelbourne,VIC3207,Australia 314–321,3rdFloor,Plot3,SplendorForum,JasolaDistrictCentre,NewDelhi–110025,India 79AnsonRoad,#06–04/06,Singapore079906 CambridgeUniversityPressispartoftheUniversityofCambridge. ItfurtherstheUniversity’smissionbydisseminatingknowledgeinthepursuitof education,learning,andresearchatthehighestinternationallevelsofexcellence. www.cambridge.org Informationonthistitle:www.cambridge.org/9781107028289 DOI:10.1017/9781139236003 ©BertrandS.ClarkeandJenniferL.Clarke2018 Thispublicationisincopyright.Subjecttostatutoryexception andtotheprovisionsofrelevantcollectivelicensingagreements, noreproductionofanypartmaytakeplacewithoutthewritten permissionofCambridgeUniversityPress. Firstpublished2018 PrintedintheUnitedStatesofAmericabySheridanBooks,Inc. AcataloguerecordforthispublicationisavailablefromtheBritishLibrary. ISBN978-1-107-02828-9Hardback Additionalresourcesforthispublicationatwww.cambridge.org/predictivestatistics CambridgeUniversityPresshasnoresponsibilityforthepersistenceoraccuracyof URLsforexternalorthird-partyinternetwebsitesreferredtointhispublication anddoesnotguaranteethatanycontentonsuchwebsitesis,orwillremain, accurateorappropriate. Contents ExpandedContents pagevi Preface xi PartI ThePredictiveView 1 1 WhyPrediction? 3 2 DefiningaPredictiveParadigm 34 3 WhataboutModeling? 67 4 ModelsandPredictors:ABickeringCouple 86 PartII EstablishedSettingsforPrediction 123 5 TimeSeries 125 6 LongitudinalData 161 7 SurvivalAnalysis 206 8 NonparametricMethods 249 9 ModelSelection 307 PartIII ContemporaryPrediction 359 10 BlackboxTechniques 361 11 EnsembleMethods 449 12 TheFutureofPrediction 524 References 605 Index 635 v Expanded Contents Preface xi PartI ThePredictiveView 1 1 WhyPrediction? 3 1.1 MotivatingthePredictiveStance 4 1.2 SomeExamples 11 1.2.1 PredictionwithEnsemblesratherthanModels 12 1.2.2 HypothesisTestingasPrediction 21 1.2.3 PredictingClasses 26 1.3 GeneralIssues 32 2 DefiningaPredictiveParadigm 34 2.1 TheSunriseProblem 34 2.2 ParametricFamilies 41 2.2.1 FrequentistParametricCase 41 2.2.2 BayesianParametricCase 43 2.2.3 Interpretation 46 2.3 TheAbstractVersion 47 2.3.1 Frequentism 48 2.3.2 BayesApproach 51 2.3.3 SurveySampling 56 2.3.4 PredictivistApproach 58 2.4 AUnifiedFrameworkforPredictiveAnalysis 63 3 WhataboutModeling? 67 3.1 ProblemClassesforModelsandPredictors 68 3.2 InterpretingModeling 73 3.3 TheDangersofModeling 75 3.4 Modeling,Inference,Prediction,andData 78 3.5 Prequentialism 80 4 ModelsandPredictors:ABickeringCouple 86 4.1 SimpleNonparametricCases 87 4.2 FixedEffectsLinearRegression 94 4.3 QuantileRegression 101 4.4 Comparisons:Regression 104 vi ExpandedContents vii 4.5 LogisticRegression 108 4.6 BayesClassifiersandLDA 111 4.7 NearestNeighbors 115 4.8 Comparisons:Classification 116 4.9 ALookAheadtoPartII 119 PartII EstablishedSettingsforPrediction 123 5 TimeSeries 125 5.1 ClassicalDecompositionModel 125 5.2 Box–Jenkins:FrequentistSARIMA 128 5.2.1 PredictorClassIdentification 129 5.2.2 EstimatingParametersinanARMA(p,q)Process 132 5.2.3 ValidationinanARMA(p,q)Process 133 5.2.4 Forecasting 135 5.3 BayesSARIMA 139 5.4 ComputedExamples 142 5.5 StochasticModeling 150 5.6 Endnotes:VariationsandExtensions 156 5.6.1 RegressionwithanARMA(p,q)ErrorTerm 157 5.6.2 DynamicLinearModels 159 6 LongitudinalData 161 6.1 PredictorsDerivedfromRepeated-MeasuresANOVA 167 6.2 LinearModelsforLongitudinalData 172 6.3 PredictorsDerivedfromGeneralizedLinearModels 180 6.4 PredictorsUsingRandomEffects 184 6.4.1 LinearMixedModels 184 6.4.2 GeneralizedLinearMixedModels 193 6.4.3 NonlinearMixedModels 194 6.5 ComputationalComparisons 194 6.6 Endnotes:MoreonGrowthCurves 201 6.6.1 AFixedEffectGrowthCurveModel 203 6.6.2 AnotherFixedEffectTechnique 204 7 SurvivalAnalysis 206 7.1 NonparametricPredictorsofSurvival 208 7.1.1 TheKaplan–Meierpredictor 208 7.1.2 MedianasaPredictor 216 7.1.3 BayesVersionoftheKaplan–MeierPredictor 219 7.1.4 DiscriminationandCalibration 221 7.1.5 PredictingwithMedians 222 7.2 ProportionalHazardsPredictors 226 7.2.1 FrequentistEstimatesofh0andβinPHModels 228 7.2.2 FrequentistPHModelsasPredictors 231 7.2.3 BayesPHModels 233 7.2.4 ContinuingtheExample 236 7.3 ParametricModels 239 7.4 Endnotes:OtherModels 245 viii ExpandedContents 7.4.1 AcceleratedFailureTime(AFT)Models 245 7.4.2 CompetingRisks 246 8 NonparametricMethods 249 8.1 PredictorsUsingOrthonormalBasisExpansions 252 8.2 PredictorsBasedonKernels 260 8.2.1 KernelDensityEstimation 260 8.2.2 KernelRegression:DeterministicDesigns 266 8.2.3 KernelRegression:RandomDesign 270 8.3 PredictorsBasedonNearestNeighbors 275 8.3.1 NearestNeighborDensityEstimation 275 8.3.2 NearestNeighborRegression 281 8.3.3 BeyondtheIndependenceCase 285 8.4 PredictorsfromNonparametricBayes 286 8.4.1 PolyaTreeProcessPriorsforDistributionEstimation 288 8.4.2 GaussianProcessPriorsforRegression 291 8.5 ComparingNonparametricPredictors 294 8.5.1 DescriptionoftheData,Methods,andResults 295 8.5.2 M-CompleteorM-Open? 300 8.6 Endnotes 302 8.6.1 SmoothingSplines 303 8.6.2 NearestNeighborClassification 304 8.6.3 Test-BasedPrediction 304 9 ModelSelection 307 9.1 LinearModels 312 9.2 InformationCriteria 320 9.3 BayesModelSelection 327 9.4 Cross-Validation 334 9.5 SimulatedAnnealing 339 9.6 MarkovChainMonteCarloandtheMetropolis–HastingsAlgorithm 344 9.7 ComputedExamples:SAandMCMC–MH 348 9.8 Endnotes 353 9.8.1 DIC 354 9.8.2 PosteriorPredictiveLoss 354 9.8.3 Information-TheoreticModelSelectionProcedures 355 9.8.4 ScoringRulesandBFsRedux 356 PartIII ContemporaryPrediction 359 10 BlackboxTechniques 361 10.1 ClassicalNonlinearRegression 364 10.2 Trees 368 10.2.1 FindingaGoodTree 371 10.2.2 PruningandSelection 379 10.2.3 BayesTrees 383 10.3 NeuralNets 386 10.3.1 ‘Fitting’aGoodNN 388 10.3.2 ChoosinganArchitectureforanNN 393 ExpandedContents ix 10.3.3 BayesNNs 394 10.3.4 NNHeuristics 397 10.3.5 DeepLearning,ConvolutionalNNs,andAllThat 399 10.4 KernelMethods 405 10.4.1 BayesKernelPredictors 409 10.4.2 FrequentistKernelPredictors 416 10.5 PenalizedMethods 422 10.6 ComputedExamples 429 10.6.1 DopplerFunctionExample 429 10.6.2 PredictingaVegetationGreennessIndex 433 10.7 Endnotes 443 10.7.1 ProjectionPursuit 443 10.7.2 LogicTrees 445 10.7.3 HiddenMarkovModels 446 10.7.4 Errors-in-VariablesModels 447 11 EnsembleMethods 449 11.1 BayesModelAveraging 454 11.2 Bagging 462 11.3 Stacking 471 11.4 Boosting 480 11.4.1 BoostingClassifiers 481 11.4.2 BoostingandRegression 486 11.5 MedianandRelatedMethods 489 11.5.1 DifferentSortsof‘Median’ 489 11.5.2 MedianandOtherComponents 494 11.5.3 Heuristics 495 11.6 ModelAveragePredictioninPractice 497 11.6.1 SimulationStudy 497 11.6.2 ReanalyzingtheVegoutData 507 11.6.3 MixingItUp 518 11.7 Endnotes 519 11.7.1 PredictionalongaString 520 11.7.2 NoFreeLunch 522 12 TheFutureofPrediction 524 12.1 RecommenderSystems 526 12.1.1 CollaborativeFilteringRecommenderSystems 526 12.1.2 Content-Based(CB)RecommenderSystems 530 12.1.3 OtherMethods 533 12.1.4 Evaluation 536 12.2 StreamingData 537 12.2.1 KeyExamplesofProceduresforStreamingData 538 12.2.2 SensorData 547 12.2.3 StreamingDecisions 551 12.3 Spatio-TemporalData 556 12.3.1 Spatio-TemporalPointData 559 12.3.2 RemoteSensingData 562 12.3.3 Spatio-TemporalPointProcessData 565