Springer Series in Statistics Series Editors: P.Bickel P.Diggle S.Fienberg K.Krickeberg I.Olkin N.Wermuth S.Zeger 3 Berlin Heidelberg New York Hongkong London Milan Paris Tokyo Springer Series in Statistics Andersen/Borgan/Gill/Keiding:Statistical Models Based on Counting Processes. Atkinson/Riani:Robust Diagnostic Regression Analysis. Atkinson/Riani/Cerioli:Exploring Multivariate Data with the Forward Search. Berger:Statistical Decision Theory and Bayesian Analysis, 2nd edition. Borg/Groenen:Modern Multidimensional Scaling: Theory and Applications. Brockwell/Davis:Time Series: Theory and Methods, 2nd edition. Bucklew:Introduction to Rare Event Simulation. Chan/Tong:Chaos: AStatistical Perspective. Chen/Shao/Ibrahim:Monte Carlo Methods in Bayesian Computation. Coles:An Introduction to Statistical Modeling of Extreme Values. David/Edwards:Annotated Readings in the History of Statistics. Devroye/Lugosi:Combinatorial Methods in Density Estimation. Efromovich: Nonparametric Curve Estimation:Methods, Theory, and Applications. Eggermont/LaRiccia:Maximum Penalized Likelihood Estimation, Volume I: Density Estimation. Fahrmeir/Tutz:Multivariate Statistical Modelling Based on Generalized Linear Models, 2nd edition. Fan/Yao:Nonlinear Time Series: Nonparametric and Parametric Methods. Farebrother:Fitting Linear Relationships: AHistory of the Calculus of Observations 1750–1900. Federer:Statistical Design and Analysis for Intercropping Experiments, Volume I: Two Crops. Federer:Statistical Design and Analysis for Intercropping Experiments, Volume II: Three or More Crops. Ghosh/Ramamoorthi:Bayesian Nonparametrics. Glaz/Naus/Wallenstein:Scan Statistics. Good: Permutation Tests: APractical Guide to Resampling Methods for Testing Hypotheses, 2nd edition. Gouriéroux:ARCH Models and Financial Applications. Gu:Smoothing Spline ANOVAModels. Györfi/Kohler/Krzyzak/Walk: ADistribution-Free Theory of Nonparametric Regression. Haberman:Advanced Statistics, Volume I: Description of Populations. Hall:The Bootstrap and Edgeworth Expansion. Härdle:Smoothing Techniques: With Implementation in S. Harrell:Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis. Hart:Nonparametric Smoothing and Lack-of-Fit Tests. Hastie/Tibshirani/Friedman:The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Hedayat/Sloane/Stufken:Orthogonal Arrays: Theory and Applications. Heyde:Quasi-Likelihood and its Application: AGeneral Approach to Optimal Parameter Estimation. (continued after index) Wolfgang Härdle,Marlene Müller, Stefan Sperlich,Axel Werwatz Nonparametric and Semiparametric Models 123 Wolfgang Härdle Marlene Müller CASE – Center for Applied Statistics Fraunhofer ITWM and Economics Gottlieb-Daimler-Straße Wirtschaftswissenschaftliche Fakultät 67663 Kaiserslautern Humboldt-Universität zu Berlin Germany 10178 Berlin [email protected] Germany [email protected] Stefan Sperlich Departamento de Economía Axel Werwatz Universidad Carlos III de Madrid DIW Berlin C./Madrid,126 Königin-Luise-Straße 5 28903 Getafe (Madrid) 14195 Berlin Spain Germany [email protected] [email protected] Cataloging-in-Publication Data applied for A catalog record for this book is available from the Library ofCongress. Bibliographic information published by Die Deutsche Bibliothek Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data is available in the Internet at <http://dnb.ddb.de>. Mathematics Subjects Classification (2000):62G07,62G08,62G20,62G09,62G10 ISBN 978-3-642-62076-8 ISBN 978-3-642-17146-8 (eBook) DOI 10.1007/978-3-642-17146-8 This work is subject to copyright.All rights are reserved,whether the whole or part ofthe material is concerned,specifically the rights oftranslation,reprinting,reuse ofillustrations,recitation,broadcas- ting,reproduction on microfilm or in any other way,and storage in data banks.Duplication ofthis pu- blication or parts thereofis permitted only under the provisions ofthe German Copyright Law of September 9,1965,in its current version,and permission for use must always be obtained from Sprin- ger-Verlag.Violations are liable for prosecution under the German Copyright Law. springeronline.com © Springer-Verlag Berlin Heidelberg 2004 Originally published by Springer-Verlag Berlin Heidelberg New York in 2004 Softcover reprint of the hardcover 1st edition 2004 The use ofgeneral descriptive names,registered names,trademarks etc.in this publication does not imply,even in the absence ofa specific statement,that such names are exempt from the relevant pro- tective laws and regulations and therefore free for general use. Cover design:design & production,Heidelberg Typesetting by the authors Printed on acid-free paper 40/3142 – 543210 Preface The concept of smoothing is a central idea in statistics. Its role is to extract structural elements of variable complexity from patterns of random varia- tion. The nonparametric smoothing concept is designed to simultaneously estimate and model the underlying structure. This involves high dimen- sionalobjects,likedensityfunctions,regressionsurfacesorconditionalquan- tiles.Suchobjectsaredifficulttoestimatefordatasetswithmixed,highdi- mensionalandpartiallyunobservablevariables.Thesemiparametricmodel- ingtechniquecompromisesthetwoaims,flexibilityandsimplicityofstatis- ticalprocedures,byintroducingpartialparametriccomponents.These(low dimensional)componentsallowonetomatchstructuralconditionslikefor examplelinearityinsomevariablesandmaybeusedtomodeltheinfluence ofdiscretevariables.Theflexibilityofsemiparametricmodelinghasmadeit awidelyacceptedstatisticaltechnique. Theaimofthismonographistopresentthestatisticalandmathematical principlesofsmoothingwithafocusonapplicabletechniques.Thenecessary mathematicaltreatmentiseasilyunderstandableandawidevarietyofinter- activesmoothingexamplesaregiven.Thistextisane-book;itisadownload- ableentity(http://www.i-xplore.de)whichallowsthereadertorecalculate allargumentsandapplicationswithoutreferencetoaspecificsoftwareplat- form. This new technique for proliferation of methods and ideas is specifi- callydesignedforthebeginnerinnonparametricandsemiparametricstatis- tics.ItisbasedontheXploRequantlettechnology,developedatHumboldt- Universita¨tzuBerlin. Thetexthasevolvedoutofthecourses“NonparametricModeling”and “SemiparametricModeling”,thattheauthorstaughtatHumboldt-Universi- ta¨tzuBerlin,ENSAEParis,CharlesUniversityPrague,andUniversidadde Cantabria,Santander.Thebookdividesitselfnaturallyintotwoparts: VI Preface • PartI:NonparametricModels histogram,kerneldensityestimation,nonparametricregression • PartII:SemiparametricModels generalized regression, single index models, generalized partial linear models,additiveandgeneralizedadditivemodels. The first part (Chapters 2–4) covers the methodological aspects of non- parametricfunctionestimationforcross-sectionaldata,inparticularkernel smoothingmethods.Althoughourprimaryfocuswillbeonflexibleregres- sionmodels,acloselyrelatedtopictoconsiderisnonparametricdensityesti- mation.Sincemanytechniquesandconceptsfortheestimationofprobability densityfunctionsarealsorelevantforregressionfunctionestimation,wefirst considerhistograms(Chapter2)andkerneldensityestimates(Chapter3)in moredetail.Finally,inChapter4weintroduceseveralmethodsofnonpara- metrically estimating regression functions. The main part of this chapter is devotedtokernelregression,butotherapproachessuchassplines,orthogo- nalseriesandnearestneighbormethodsarealsocovered. Thefirstpartisintendedforundergraduatestudentsmajoringinmath- ematics, statistics, econometrics or biometrics. It is assumed that the audi- encehasabasicknowledgeofmathematics(linearalgebraandanalysis)and statistics (inference and regression analysis). The material is easy to utilize sincethee-bookcharacterofthetextallowsmaximumflexibilityinlearning (andteaching)intensity. Thesecondpart(Chapters5–9)isdevotedtosemiparametricregression models,inparticularextensionsoftheparametricgeneralizedlinearmodel. InChapter5wesummarizethemainideasofthegeneralizedlinearmodel (GLM).Typicalconceptsarethelogitandprobitmodels.Nonparametricex- tensionsoftheGLMconsidereitherthelinkfunction(singleindexmodels, Chapter 6) or the index argument (generalized partial linear models, addi- tive and generalized additive models, Chapters 7–9). Single index models focus on the nonparametric error distribution in an underlying latent vari- ablemodel.Partiallinearmodelstakethepragmaticpointoffixingtheerror distributionbutlettheindexbeofnon-orsemiparametricstructure.General- izedadditivemodelsconcentrateona(lowerdimensional)additivestructure oftheindexwithfixedlinkfunction.Thismodelclassbalancesthedifficulty ofhigh-dimensionalsmoothingwiththeflexibilityofnonparametrics. In addition to the methodological aspects, the second part also covers computationalalgorithmsfortheconsideredmodels.Asinthefirstpartwe focus on cross-sectional data. It is intended to be used by Master and PhD studentsorresearchers. This book would not have been possible without substantial support frommanycolleaguesandstudents.Ithasbenefitedatseveralstagesfrom Preface VII useful remarks and suggestions of our students at Humboldt-Universita¨t zu Berlin, ENSAE Paris and Charles University Prague. We are grateful to Lorens Helmchen, Stephanie Freese, Danilo Mercurio, Thomas Ku¨hn, Ying Chen and Michal Benko for their support in text processing and program- ming, Caroline Condron for language checking and Pavel Cˇ´ızˇek, Zdeneˇk Hla´vkaandRainerSchulzfortheirassistanceinteaching.Weareindebtedto JoelHorowitz(NorthwesternUniversity),EnnoMammen(Universita¨tHei- delberg) and Helmut Rieder (Universita¨t Bayreuth) for their valuable com- ments on earlier versions of the manuscript. Thanks go also to Clemens Heine,SpringerVerlag,forbeingaverysupportiveandhelpfuleditor. Berlin/Kaiserslautern/Madrid,February2004 WolfgangHa¨rdle MarleneMu¨ller StefanSperlich AxelWerwatz Contents Preface ........................................................... V Notation.......................................................... XXI 1 Introduction .................................................. 1 1.1 DensityEstimation........................................ 1 1.2 Regression ............................................... 3 1.2.1 ParametricRegression .............................. 5 1.2.2 NonparametricRegression .......................... 7 1.2.3 SemiparametricRegression ......................... 9 Summary ..................................................... 18 PartI NonparametricModels 2 Histogram .................................................... 21 2.1 MotivationandDerivation ................................ 21 2.1.1 Construction....................................... 21 2.1.2 Derivation ......................................... 23 2.1.3 VaryingtheBinwidth............................... 23 2.2 StatisticalProperties ...................................... 24 2.2.1 Bias ............................................... 25 2.2.2 Variance ........................................... 26 2.2.3 MeanSquaredError ................................ 27 2.2.4 MeanIntegratedSquaredError...................... 29 X Contents 2.2.5 OptimalBinwidth .................................. 29 2.3 DependenceoftheHistogramontheOrigin ................ 30 2.4 AveragedShiftedHistogram............................... 32 BibliographicNotes............................................ 35 Exercises...................................................... 36 Summary ..................................................... 38 3 NonparametricDensityEstimation............................. 39 3.1 MotivationandDerivation ................................ 39 3.1.1 Introduction ....................................... 39 3.1.2 Derivation ......................................... 40 3.1.3 VaryingtheBandwidth ............................. 43 3.1.4 VaryingtheKernelFunction ........................ 43 3.1.5 KernelDensityEstimationasaSumofBumps ........ 45 3.2 StatisticalProperties ...................................... 46 3.2.1 Bias ............................................... 46 3.2.2 Variance ........................................... 48 3.2.3 MeanSquaredError ................................ 49 3.2.4 MeanIntegratedSquaredError...................... 50 3.3 SmoothingParameterSelection ............................ 51 3.3.1 Silverman’sRuleofThumb ......................... 51 3.3.2 Cross-Validation ................................... 53 3.3.3 RefinedPlug-inMethods............................ 55 3.3.4 AnOptimalBandwidthSelector?! ................... 56 3.4 ChoosingtheKernel ...................................... 57 3.4.1 CanonicalKernelsandBandwidths.................. 57 3.4.2 AdjustingBandwidthsacrossKernels................ 59 3.4.3 OptimizingtheKernel .............................. 60 3.5 ConfidenceIntervalsandConfidenceBands ................ 61 3.6 MultivariateKernelDensityEstimation..................... 66 3.6.1 Bias,VarianceandAsymptotics...................... 70 3.6.2 BandwidthSelection................................ 72 3.6.3 ComputationandGraphicalRepresentation .......... 75 BibliographicNotes............................................ 79 Exercises...................................................... 80 Summary ..................................................... 82 Contents XI 4 NonparametricRegression .................................... 85 4.1 UnivariateKernelRegression .............................. 85 4.1.1 Introduction ....................................... 85 4.1.2 KernelRegression .................................. 88 4.1.3 Local Polynomial Regression and Derivative Estimation ....................................... 94 4.2 OtherSmoothers.......................................... 98 4.2.1 Nearest-NeighborEstimator ........................ 98 4.2.2 MedianSmoothing ................................. 101 4.2.3 SplineSmoothing .................................. 101 4.2.4 OrthogonalSeries .................................. 104 4.3 SmoothingParameterSelection ............................ 107 4.3.1 ACloserLookattheAveragedSquaredError ........ 110 4.3.2 Cross-Validation ................................... 113 4.3.3 PenalizingFunctions ............................... 114 4.4 ConfidenceRegionsandTests.............................. 118 4.4.1 PointwiseConfidenceIntervals...................... 119 4.4.2 ConfidenceBands .................................. 120 4.4.3 HypothesisTesting ................................. 124 4.5 MultivariateKernelRegression ............................ 128 4.5.1 StatisticalProperties ................................ 130 4.5.2 PracticalAspects ................................... 132 BibliographicNotes............................................ 135 Exercises...................................................... 137 Summary ..................................................... 139 PartII SemiparametricModels 5 SemiparametricandGeneralizedRegressionModels ........... 145 5.1 DimensionReduction ..................................... 145 5.1.1 VariableSelectioninNonparametricRegression ...... 148 5.1.2 NonparametricLinkFunction ....................... 148 5.1.3 Semi-orNonparametricIndex ...................... 149 5.2 GeneralizedLinearModels ................................ 151 5.2.1 ExponentialFamilies ............................... 151 5.2.2 LinkFunctions ..................................... 153