TheAnnalsofStatistics 1995,Vol.23,No.6,1865]1895 THE 1994 NEYMAN MEMORIAL LECTURE SMOOTHING SPLINE ANOVA FOR EXPONENTIAL FAMILIES, WITH APPLICATION TO THE WISCONSIN EPIDEMIOLOGICAL STUDY OF DIABETICRETINOPATHY1 BY GRACE WAHBA,2 YUEDONG WANG,3 CHONGGU,4 RONALDKLEIN5 ANDBARBARAKLEIN6 UniversityofWisconsin]Madison,UniversityofMichigan,Purdue University,UniversityofWisconsin]MadisonandUniversityof Wisconsin]Madison Let yi, is1,...,n, be independentobservationswith the density of yi of the form h(cid:14)yi,fi.sexpwyifiyb(cid:14)fi.qc(cid:14)yi.x, where b and c are given functions and b is twice continuously differentiable and bounded awayfrom0.Let fisf(cid:14)t(cid:14)i..,wherets(cid:14)t1,...,td.gTT(cid:14)1.m???mTT(cid:14)d.sTT, the TT(cid:14)a. are measurable spaces of rather general form and f is an unknown function on TT with some assumed ‘‘smoothness’’ properties. Given (cid:20)yi,t(cid:14)i., is1,...,n4, it is desired to estimate f(cid:14)t. for t in some region of interest contained in TT. We develop the fitting of smoothing spline ANOVA models to this data of the form f(cid:14)t.sCq(cid:7)afa(cid:14)ta.q (cid:7)a-bfab(cid:14)ta,tb.q???. The components of the decomposition satisfy side conditions which generalize the usual side conditions for parametric ANOVA.Theestimateof f isobtainedastheminimizer,inanappropriate function space, of LL(cid:14)y, f.q(cid:7)alaJa(cid:14)fa.q(cid:7)a-blabJab(cid:14)fab.q???, where LL(cid:14)y,f. is the negative log likelihood of ys(cid:14)y1,...,yn.9 given f, the J ,J ,... arequadraticpenaltyfunctionalsandtheANOVAdecom- a ab position is terminated in some manner. There are five major parts re- quiredtoturnthisprogramintoapracticaldataanalysistool:(cid:14)1.methods fordecidingwhichtermsintheANOVAdecompositiontoinclude(cid:14)model selection.,(cid:14)2.methodsforchoosinggoodvaluesofthesmoothingparame- ters l ,l ,..., (cid:14)3. methods for making confidence statements concern- a ab ingtheestimate,(cid:14)4.numericalalgorithmsforthecalculationsand,finally, (cid:14)5. public software. In this paper we carry out this program, relying on earlierworkandfillinginimportantgaps.Theoverallschemeisapplied ReceivedJanuary1995;revisedMay1995. 1This work formed the basis for the Neyman Lecture at the 57th Annual Meeting of the Institute of MathematicalStatistics, Chapel Hill, North Carolina, June 23, 1994, presented by GraceWahba. 2ResearchsupportedinpartbyNIHGrantEY09946andNSFGrantDMS-91-21003. 3ResearchsupportedinpartbyNIHGrantsEY09446,P60-DK20572andP30-HD18258. 4ResearchsupportedbyNSFGrantDMS-93-01511. 5ResearchsupportedbyNIHGrantEY03083. 6ResearchsupportedbyNIHGrantEY03083. AMS 1991 subject classifications. Primary 62G07, 92C60, 68T05, 65D07, 65D10, 62A99, 62J07;secondary41A63,41A15,62M30,65D15,92H25,49M15. Key words and phrases. Smoothing spline ANOVA, nonparametric regression, exponential families,riskfactorestimation. 1865 1866 G. WAHBA, ET AL. to Bernoulli data from the Wisconsin Epidemiologic Study of Diabetic Retinopathytomodeltheriskofprogressionofdiabeticretinopathyasa function of glycosylated hemoglobin, duration of diabetes and body mass index.Itisbelievedthattheresultshavewidepracticalapplicationtothe analysisofdatafromlargeepidemiologicstudies. 1. Introduction. We,alongwithmanyothers,areinterestedinbuilding flexible statistical models for prediction (cid:14)a.k.a. multivariate function estima- tion.. Desirable features of such models include the ability to simultaneously handle continuous variables on various domains, ordered categorical vari- ablesandunorderedcategoricalvariables.Acrucialfeatureistheavailability of a set of methods for adaptively controlling the complexity or degrees of freedom of the model (cid:14)sometimes called the bias]variance tradeoff. and for comparing different candidate models in the same or related families of models. Other desirable features include the reduction to simple parametric models if the data suggest that such models are adequate, readily inter- pretable estimates even when several predictor variables are involved, rea- sonable accuracy statements after the model has been fitted and publicly available software. Smoothing spline ANOVA (cid:14)SS-ANOVA. models, which are the subject of this paper, are endowed with all of these features to a greater or lesser extent, although the development of both theory and practice is by no means complete. Briefly, these models represent a function f(cid:14)t., ts(cid:14)t1,...,td. of d variables as f(cid:14)t.sCq(cid:7)afa(cid:14)ta.q(cid:7)a-bfab(cid:14)ta,tb.q???, where the compo- nents satisfy side conditions which generalize the usual side conditions for parametric ANOVA to function spaces, and the series is truncated in some manner. Independent observations yi, is1,...,n, are assumed to be dis- tributed as h(cid:14)y, f(cid:14)t(cid:14)i... with parameter of interest f(cid:14)t(cid:14)i.., and f(cid:14)?. is as- i sumed to be ‘‘smooth’’ in some sense; f is estimated as the minimizer, in an appropriate function space, of LL(cid:14)y, f.q(cid:7)alaJa(cid:14)fa.q(cid:7)a-blabJab(cid:14)fab. q???, where LL(cid:14)y, f. is the negative log likelihoodof (cid:14)y1,..., yn. given f, the J , J ,... are quadraticpenaltyfunctionalsand the l ,l ,... are smooth- a ab a ab ing parameters to be chosen. These models have been developed extensively for Gaussian data, and the ds1 special case has been developedfor exponentialfamilies.Our goal here is to extend this work to the d)1 case for exponential families and to demonstrate its usefulness by analyzing data from the Wisconsin Epidemio- logicStudyofDiabeticRetinopathy(cid:14)WESDR..WebuildanSS-ANOVAmodel to estimate the risk of progression of diabetic retinopathy, an important cause of blindness, at followup, given values of the predictor variables glycosylated hemoglobin, duration of diabetes and body mass index at base- line, and the response (cid:14)progression of retinopathy or not. at followup. From thedatasetanalyzedherewehavebeenabletodescribeinterestingrelations that were not found using more traditional methods. CART wBreiman, Friedman, Olshen and Stone (cid:14)1984.x, MARS wFriedman (cid:14)1991.x, projection pursuit wFriedman and Steutzle (cid:14)1981.x and thep method SMOOTHING SPLINE ANOVA 1867 wBreiman (cid:14)1991.x are some of the more popular methods that have been proposed in the statistical literature for multivariate function estimation. Certain supervised machine learning methods, in particular feedforward neural nets and radial basis functions, are also used for this purpose. See Geman,BienenstockandDoursat(cid:14)1992., Ripley(cid:14)1994., ChengandTittering- ton (cid:14)1994., Wahba (cid:14)1992, 1995. and references therein, for a discussion of relationships between neural nets and statistical nonparametric regression methods. The popular additive models of Hastie and Tibshirani (cid:14)1990., when fitted with smoothing splines, are a special case of the smoothing spline ANOVA model.The varying-coefficientmodels wHastie and Tibshirani(cid:14)1993.x are a very interesting subfamily. Roosen and Hastie (cid:14)1994. have taken a further interesting step by combining the additive spline models with projec- tion pursuit. Two basic reference works for smoothing splines are Eubank (cid:14)1988. and Green and Silverman (cid:14)1994.. Stone (cid:14)1994. has recently studied theoreticalpropertiesofsumsoftensorproductsofpolynomialsplinesusedin a regression approach (cid:14)as opposed to the smoothing approach here. to esti- matecomponentsofanANOVAdecompositionofatargetfunction.InStone’s regression context numbers of basis functions play the role of smoothing parameters. They are chosen there theoretically, although it would be possi- ble to choose them, as well as the knot locations, adaptively. It is tantalizing to conjecture the circumstances under which Stone’s convergence rates could be obtained in the smoothing context employed in the present paper. See Chen (cid:14)1991., Gu and Qiu (cid:14)1994. and references cited there. SS-ANOVA models for Gaussian data are described in some (cid:14)but not complete. generality in Wahba w(cid:14)1990., Chapter 10x, where references to the previousliteratureare given.GCV (cid:14)generalizedcross-validation., UBR (cid:14)unbi- asedrisk.andGML(cid:14)generalizedmaximumlikelihood.arealldiscussedthere for choosingthe smoothingparametersin the Gaussiancase.SeeCravenand Wahba (cid:14)1979. and Li (cid:14)1985, 1986. for properties of GCV and UBR estimates. Gu,Bates,ChenandWahba(cid:14)1989.,Chen,GuandWahba(cid:14)1989.,Gu(cid:14)1992b., Gu and Wahba (cid:14)1991a, b, 1993a, b., Chen (cid:14)1991, 1993. and others discuss further various aspects of these models. The code RKPACK wGu (cid:14)1989., available from [email protected] will fit specified SS-ANOVA models given Gaussian data. O’Sullivan(cid:14)1983. and O’Sullivan,Yandell and Raynor (cid:14)1986., in the ds1 case, proposed penalized log-likelihood estimates with spline penalties for data from general exponential families. Methods for choosing a single smoothing parameter in the ds1 non-Gaussian case have been a matter of lively activity. O’Sullivan, Yandell and Raynor (cid:14)1986., Green and Yandell (cid:14)1985., Yandell (cid:14)1986., Cox and Chang (cid:14)1990., Wahba (cid:14)1990., Moody (cid:14)1991., Liu (cid:14)1993., Gu (cid:14)1990, 1992a, c. and Xiang and Wahba (cid:14)1995. have addressed this issue, all considering methods related to ordinary leaving out one, GCV or UBR adapted to the non-Gaussian case. Wong (cid:14)1992. has examined the existence of exactly unbiased estimates for the expected Kullback]Leibler information distance as well as predictive mean square error in several non-Gaussian cases. See also Hudson (cid:14)1978.. One can conclude from Wong’s 1868 G. WAHBA, ET AL. work that there is no exact unbiased risk estimate of the Kullback]Leibler informationdistanceintheBernoullicase.Itisclear,however,thatfordense data sets and smooth unknown true functions, good approximations must exist. This may explain why no unique, completely definitive result is avail- able in the Bernoulli case. We will use the approach in Wang (cid:14)1994., which represents a multiple smoothing parameter extension of Gu’s (cid:14)1992a. exten- sion of the UBR estimate originally obtained for Gaussian data with known variance wMallows (cid:14)1973., Craven and Wahba (cid:14)1979.x. Bayesian ‘‘confidence intervals’’ were proposed for the cross-validated smoothing spline with Gaussian data by Wahba (cid:14)1983. and their properties were studied by Nychka (cid:14)1988, 1990.. Generalization to the componentwise case in SS-ANOVA appears in Gu and Wahba (cid:14)1993b.. Gu (cid:14)1992c. discussed their extensionto the singlesmoothingparameternon-Gaussiancase.In this work,wedevelopandemploythecomponentwisegeneralizationofGu(cid:14)1992c. to the non-Gaussian componentwise SS-ANOVA case. Model selectionin the contextof non-GaussianSS-ANOVAhas many open questions. The first model selection question might be: will the parametric model which is built into the SS-ANOVA as a special case do as well as a model which contains nonparametric terms? A method for answering this question in the Gaussian case from a hypothesis testing point of view was given by Cox, Koh, Wahba and Yandell (cid:14)1988. and by Xiang and Wahba (cid:14)1995. in the Bernoulli case. In the general case where one is comparing one nonparametric model with another, the problem is more complicated. Chen (cid:14)1993. proposed an approximate hypothesis testing procedure in the general Gaussian case. Gu (cid:14)1992b. proposed cosine diagnostics as an aid in model selection. The use of componentwise confidence intervals to eliminate terms was suggested in Gu and Wahba (cid:14)1993b.. Of course model selection from a hypothesis testing point of view (cid:14)i.e., is a simple model correct?. is not the same as model selection from a prediction point of view (cid:14)i.e., no model is correct,whichmodel is likelyto predictbest?.. In our analysisof the WESDR data we carry out informal model selection procedures including deletion of terms small enough to be of no practical significance, and examination of the componentwise Bayesian ‘‘confidence intervals.’’ We will discuss a number of open questions related to model selection in this context from a prediction point of view. It is clear that the existence of user-friendly software is essential for this and any other sophisticatednonparametricregression method to be useful.A computer code GRKPACK wWang (cid:14)1995.x, which calls RKPACK as a subrou- tine, has been developed to carry out the SS-ANOVA analysis for Bernoulli and other non-Gaussian data. We use GRKPACK to carry out the WESDR data analysis. In Section 2 we review penalized GLIM models with a single smoothing parameter, and then review the SS-ANOVA decomposition of a function and established methods for fitting SS-ANOVA models in the Gaussian case. Although this review is fairly detailed, the presentation of this detail eases greatly the exposition of the generalization of the fitting of these models in SMOOTHING SPLINE ANOVA 1869 the non-Gaussian case. In Section 3 we describe the extension of SS-ANOVA models to the non-Gaussian exponential family no nuisance parameter case, including a numerical algorithm and methods for choosing the smoothing parameters. In Section 4 Bayesian ‘‘confidence intervals’’ are extended to the componentwise exponential family case and a procedure for computing them isdescribed.InSection5wediscussmodelselectionandinSection6wecarry out the WESDR data analysis. Section 7 discusses some computational considerations and Section 8 gives some conclusions. 2. PenalizedGLIMand GaussianSS-ANOVAmodels. For simplicity of notation, we will be primarily concerned with data from a member of an exponential family with no nuisance parameter and semiparametric general- izations of the generalized linear models (cid:14)GLIM’s. introduced by Nelder and Wedderburn (cid:14)1972.; see also McCullagh and Nelder (cid:14)1989.. Our method can also deal with overrunderdispersion situations; see Wang (cid:14)1994. for details. We consider random variables y with density h(cid:14)y, f . of the form i i i (cid:14)2.1. h(cid:14)yi, fi. sexp yifiyb(cid:14) fi. qc(cid:14)yi. , where b and c are given functions with b twice continuously differentiable and uniformly bounded away from 0. This includes binomial, Poisson and other random variables as well as normal random variables with variance 1. Letting t be a vector of predictor variables taking values in some fairly arbitrary index set TT, we observe pairs (cid:20)yi,t(cid:14)i., is1,...,n4, where the yi are independent observations with distribution h(cid:14)y, f(cid:14)t(cid:14)i.... Our goal is to i estimate f(cid:14)t. for t in some region in the space TT of interest. GLIM models represent f as a linear combination of simple parametric functions of the components of t, typically as low degree polynomials. Usually the unknown coefficientsarethenestimatedbyminimizingthenegativeloglikelihood,that is, by minimizing n (cid:14)2.2. LL(cid:14)y, f. sy (cid:7) yif(cid:14)t(cid:14)i.. yb(cid:14) f(cid:14)t(cid:14)i... . is1 O’Sullivan, Yandell and Raynor (cid:14)1986. replaced the parametric assump- tion on f by the assumption that f is a member of some ‘‘smooth’’ class of functions of t, and they estimated f as the minimizer, in an appropriate functionspacewreproducingkernelHilbertspace(cid:14)RKHS.xof LL(cid:14)y, f.qlJ(cid:14)f., where J is a roughnesspenalty.An SS-ANOVAmodel provides a decomposi- tion of f of the form (cid:14)2.3. f(cid:14)t1,...,td. smq (cid:7)fa(cid:14)ta. q (cid:7)fab(cid:14)tab.q??? a ab and the penaltylJ(cid:14)f. is replaced by (cid:7)alaJa(cid:14)fa.q(cid:7)ablabJab(cid:14)fab.q???. The SS-ANOVA model with Gaussian data has the form (cid:14)2.4. yisf(cid:14)t1(cid:14)i.,...,td(cid:14)i..q«i, is1,...,n, where «s(cid:14)«1,...,«n.9;N(cid:14)0,s2In=n., tagTT(cid:14)a., where TT(cid:14)a. is a measur- able space, as1,...,d, (cid:14)t1,...,td.stgTTsTT(cid:14)1.m???mTT(cid:14)d. and s2 may 1870 G. WAHBA, ET AL. beunknown.For f satisfyingsomemeasurabilityconditionsauniqueANOVA decompositionoftheform(cid:14)2.3.canalwaysbedefinedasfollows.Let dm bea a probability measure on TT(cid:14)a. and define the averaging operator EE on TT by a (cid:14)2.5. (cid:14)EEaf.(cid:14)t. sH f(cid:14)t1,...,td. dma(cid:14)ta.. TT(cid:14)a. Then the identity is decomposed as Is (cid:3)(cid:14)EEaq(cid:14)IyEEa.. a (cid:14)2.6. s (cid:3)EEaq (cid:7)(cid:14)IyEEa. (cid:3) EEb a a b/a q (cid:7) (cid:14)IyEEa.(cid:14)IyEEb. (cid:3) EEgq???q(cid:3)(cid:14)IyEEa.. a-b g/a,b a The components of this decomposition generate the ANOVA decomposition of f of the form (cid:14)2.3. by Cs(cid:14)(cid:3)aEEa.f, fas(cid:14)(cid:14)IyEEa.(cid:3)b/aEEb.f, fabs (cid:14)(cid:14)IyEEa.(cid:14)IyEEb.(cid:3)g/a,bEEg.f and so forth. Efron and Stein (cid:14)1981. discuss this kind of ANOVA decomposition in a different context. Further details in the RKHS context may be found in Gu and Wahba (cid:14)1993a, b.. The idea behind SS-ANOVA is to construct an RKHS HH of functions on TT sothatthecomponentsoftheSS-ANOVAdecompositionrepresentanorthog- onal decomposition of f in HH. Then RKHS methods can be used to explicitly imposesmoothnesspenaltiesoftheform(cid:7)alaJa(cid:14)fa.q(cid:7)ablabJab(cid:14)fab.q???, where, however, the series will be truncated at some point. This is done as follows. Let HH (cid:14)a. be an RKHS of functions on TT (cid:14)a. with HTT(cid:14)a.fa(cid:14)ta.dmas0 for fa(cid:14)ta.gHH (cid:14)a. and let w1(cid:14)a.x be the one-dimensional space of constant functions on TT (cid:14)a.. Construct HH as d HHs (cid:3) (cid:14)(cid:20)w1(cid:14)a.x4[(cid:20)HH (cid:14)a.4. (cid:14)2.7. as1 sw1x [ (cid:7) HH (cid:14)a.[ (cid:7) wHH (cid:14)a.mHH (cid:14)b.x [???, a a-b where w1x denotes the constant functions on TT. With some abuse of notation, factors of the form w1(cid:14)a.x are omitted whenever they multiply a term of a different form. Thus HH (cid:14)a. is shorthand for w1(cid:14)1.xm???mw1(cid:14)ay1.xmHH (cid:14)a.m w1(cid:14)aq1.xm???mw1(cid:14)d.x (cid:14)which is a subspace of HH .. The components of the ANOVA decomposition are now in mutually orthogonal subspaces of HH. Note that the components will depend on the measure dm , and these should be a chosen in a specific application so that the fitted mean, main effects, two factor interactions and so forth have reasonable interpretations. Next, HH (cid:14)a. is decomposed into a parametric part and a smooth part, by letting HH (cid:14)a.sHHp(cid:14)a.[HHs(cid:14)a., where HHp(cid:14)a. is finite dimensional(cid:14)the ‘‘paramet- ric’’ part. and HH (cid:14)a. (cid:14)the ‘‘smooth’’ part. is the orthocomplement of HH (cid:14)a. in s p HH (cid:14)a.. Elements of HH (cid:14)a. are not penalized through the device of letting p Ja(cid:14)fa.s5Ps(cid:14)a.fa52,where Ps(cid:14)a. istheorthogonalprojectoronto HHs(cid:14)a.;wHH (cid:14)a.m SMOOTHING SPLINE ANOVA 1871 HH (cid:14)b.x is now a direct sum of four orthogonal subspaces: wHH (cid:14)a.mHH (cid:14)b.xs wHHp(cid:14)a.mHHp(cid:14)b.x[wHHp(cid:14)a.mHHs(cid:14)b.x[wHHs(cid:14)a.mHHp(cid:14)b.x[wHHs(cid:14)a.mHH s (cid:14)b.x. By con- vention the elements of the finite-dimensional space wHHp(cid:14)a.mHHp(cid:14)b.x will not be penalized. Continuing this way results in an orthogonal decomposition of HH into sums of products of unpenalized finite-dimensional subspaces, plus main effects ‘‘smooth’’ subspaces, plus two-factor interaction spaces of the form parametricmsmooth wHHp(cid:14)a.mHHs(cid:14)b.x, smoothmparametric wHHs(cid:14)a.m HHp(cid:14)b.x and smoothmsmooth wHHs(cid:14)a.mHHs(cid:14)b.x and similarly for the three and higher factor subspaces. Now suppose that we have selected the model MM; that is, we have decided which subspaces will be included. Collect all of the included unpenalized subspaces into a subspace, call it HH 0, of dimension M, and relabel the other subspaces as HHb, bs1,2,..., p; HHb may stand for a subspace HHs(cid:14)a., or for one of the three subspaces in the decomposition of wHH (cid:14)a.mHH (cid:14)b.x which contains at least one ‘‘smooth’’ component, or for a higher order subspace withatleastone‘‘smooth’’component.Collectingthesesubspacesas MMsHH 0 [(cid:7)bHHb, the estimation problem in the Gaussian case becomes: find f in MMsHH 0[(cid:7)bHHb to minimize 1 n p (cid:14)2.8. n (cid:7) (cid:14)yiyf(cid:14)t(cid:14)i...2ql(cid:7) uby15Pbf52, is1 bs1 where Pb is the orthogonal projector in MM into HHb. The overparameteriza- tion luby1slb is convenient for both expository and computational purposes wsee Gu (cid:14)1989. and Gu and Wahba(cid:14)1991b.x and is accountedfor in RKPACK. The minimizer f of (cid:14)2.8. is known to have a representationwWahba (cid:14)1990., l,u Chapter10x intermsofabasis(cid:20)f4 for HH0 andthereproducingkernels(cid:14)RK’s. n (cid:20)R (cid:14)s,t.4 for the HHb. Letting b p (cid:14)2.9. Qu(cid:14)s,t. s (cid:7) ubRb(cid:14)s,t., bs1 it is M n (cid:14)2.10. fl,u(cid:14)t. s (cid:7) dnfn(cid:14)t. q (cid:7) ciQu(cid:14)t(cid:14)i.,t. sf(cid:14)t.9dqj(cid:14)t.9c, ns1 is1 where f9(cid:14)t. s(cid:14)f1(cid:14)t.,...,fM(cid:14)t.., j9(cid:14)t. s(cid:14)Qu(cid:14)t(cid:14)1.,t.,...,Qu(cid:14)t(cid:14)n.,t... c and d are vectors of coefficientswhich satisfy n=1 M=1 (cid:14)2.11. (cid:14)QuqnlI.cqSdsy, S9cs0, where here and below we are letting Q be the n=n matrix with ijth entry u Q (cid:14)t(cid:14)i.,t(cid:14)j.. and S be the n=M matrix with inth entry f(cid:14)t(cid:14)i... This u n system will have a unique solution for anyl)0 provided S is of full column 1872 G. WAHBA, ET AL. rank, which we will always assume. This condition on S is equivalent to the uniquenessofleastsquaresregressionontospan(cid:20)f4.SincetheRKofatensor n product space is the product of the RK’s of the component spaces, the computation of the R ’s is straightforward.For example, the RK correspond- b ing to the subspace HHp(cid:14)a. m HHs(cid:14)b. is (cid:14)in an obvious notation. R (cid:14)s ,t .R (cid:14)s ,t ..Ofcourseanypositive-definitefunctionmayinprin- HHp(cid:14)a. a a HHs(cid:14)b. b b ciple play the role of a reproducing kernel here. Special properties of RK’s related to splines are noted in Wahba (cid:14)1990.. Conditionally positive-definite functionsasoccurinthinplatesplineswWahbaandWendelberger(cid:14)1980.x can be accommodated; see Gu and Wahba (cid:14)1993a. and references cited therein. Examples on the sphere can be found in Wahba (cid:14)1981, 1982. and Weber and Talkner (cid:14)1993.; examples on a discrete index set, as might occur in large contingency tables, can be found in Gu and Wahba (cid:14)1991a.. It is not hard to modify reproducing kernels so that a given particular set of functions plays the role of a spanning set for HH (cid:14)a.; see Wahba w(cid:14)1978., Section 3x. Arbitrary p functions including functions containing breaks and jumps and indicator functionsmaybeaddedto HH 0;seeShiau,WahbaandJohnson(cid:14)1986.,Wahba (cid:14)1990., Wahba, Gu, Wang and Chappell (cid:14)1995.. Assuming the model (cid:14)2.4., the smoothing parameters l,u may be chosen by generalized cross-validation(cid:14)GCV. (cid:14)s2 unknown. or unbiased risk (cid:14)UBR. (cid:14)s2 known.. The GCV and UBR estimates are the minimizers of V and U, respectively, given by 1rn5(cid:14)IyA(cid:14)l,u..y52 (cid:14)2.12. V(cid:14)l,u. s (cid:14)1rn.tr(cid:14)IyA(cid:14)l,u.. 2 and 1 2 (cid:14)2.13. U(cid:14)l,u. s 5(cid:14)IyA(cid:14)l,u..y52q s2 tr A(cid:14)l,u., n n where A(cid:14)l,u. satisfies (cid:14)2.14. (cid:14)fl,u(cid:14)t(cid:14)1..,..., fl,u(cid:14)t(cid:14)n...9sA(cid:14)l,u.y. The properties of GCV and UBR estimates in the Gaussian case are well known; see Wahba (cid:14)1990. and the references cited therein, especially Li (cid:14)1985.. Loosely speaking, under appropriate assumptions they provide good estimates of the l,u which minimize (cid:7)nis1(cid:14)fl,u(cid:14)t(cid:14)i..yf(cid:14)t(cid:14)i...2. The code RKPACK wGu (cid:14)1989.x may be used to compute the GCV and UBR estimates of the luy1, along with f and the components of f in the ANOVA b l,u l,u decomposition. Of course to estimate luy1 the component matrices with b i, jth entry R (cid:14)t(cid:14)i.,t(cid:14)j.. must be ‘‘sufficiently distinguishable.’’ One way to b quantify this would be to examine the Fisher information matrix for the u b based on the associated Bayes model wWahba (cid:14)1978.x and (cid:14)4.1. below. 3. SS-ANOVA for general exponential families. The generalization ofanANOVAestimateforGaussiandatatothegeneralexponentialfamilyis obtained by replacing (cid:14)1rn.(cid:7)(cid:14)yiyf(cid:14)t(cid:14)i...2 by (cid:14)1rn.LL(cid:14)y, f., where LL(cid:14)y, f. SMOOTHING SPLINE ANOVA 1873 is given by (cid:14)2.2., and then solving the variational problem: find f in MM to minimize n p (cid:14)3.1. LL(cid:14)y, f. q 2l(cid:7) uby15Pbf52. bs1 The minimizer of (cid:14)3.1. is also known to have a representation of the form (cid:14)2.10. wO’Sullivan, Yandell and Raynor (cid:14)1986., Wahba (cid:14)1990.x and it is well known that now c and d are the minimizers of n n (cid:14)3.2. I(cid:14)c,d. sy (cid:7) li(cid:14)f9(cid:14)t(cid:14)i..dqj9(cid:14)t(cid:14)i..c. q 2lc9Quc, is1 where li(cid:14)fi.syifiyb(cid:14)fi. with fisf9(cid:14)t(cid:14)i..dqj9(cid:14)t(cid:14)i..c and where Qu is as in (cid:14)2.11.. Since the l ’s are not quadratic, (cid:14)3.2. cannot be minimized directly. i Ifall l (cid:14)f .’sarestrictlyconcave,wecanuseaNewton]Raphsonprocedureto i i compute c and d for fixed l and u. Let uisydlirdfi, u9s(cid:14)u1,...,un., wisyd2lirdfi2, Wsdiag(cid:14)w1,...,wn. and Ss(cid:14)f(cid:14)t(cid:14)1..,...,f(cid:14)t(cid:14)n...9. Note that from the properties of the exponential family, the vector u and the diagonal entries of the matrix W contain the means and variances for the distributions with parameter fi. We have ›Ir›csQuuqnlQuc, ›Ir›ds S9u, ›2Ir›c›c9sQuWQuqnlQu, ›2Ir›c›d9sQuWS and ›2Ir›d›d9s S9WS. The Newton]Raphson iteration satisfies the linear system (cid:14)3.3. (cid:15)QuWyQuqnlQu QuWyS/ (cid:15) cycy / s (cid:15)yQuuyynlQucy /, S9WyQu S9WyS dydy yS9uy where the subscript minus indicates quantities evaluated at the previous Newton]Raphsoniteration;seeGu(cid:14)1990..Withsomeabuseofnotationwhen the meaning is clear, we will here let f stand for the vector (cid:14)f ,..., f .9. 1 n Then, as in Gu (cid:14)1990., fsSdqQuc is always unique as long as S is of full column rank. So only a solution of (cid:14)3.3. is needed. If Q is nonsingular, (cid:14)3.3. u is equivalent to the system (cid:14)3.4. (cid:14)WyQuqnlI.cqWySds(cid:14)Wyfyyuy., S9cs0. If Q is singular, any solution to (cid:14)3.4. is also a solution to (cid:14)3.3.. Let u QuW.y.,TuhsenW(cid:14)y13r.42Q. buWecyo1rm2,escWysWyy1r2c, SWysWy1r2S and ˜˜ysWyy1r2(cid:14)Wyfyy y (cid:14)QW ,uqnlI.cW qSW ds˜˜y, (cid:14)3.5. y y y SWX cs0; y compare (cid:14)2.11.. So far, the smoothing parameters lbsluby1 are fixed. We now consider their automatic choice. It is easy to see that the solution of (cid:14)3.4. gives the 1874 G. WAHBA, ET AL. minimizer of n n p (cid:7) (cid:14)˜˜yiywi1r2fi.2q 2l(cid:7) uby15Pbf52 is1 bs1 (cid:14)3.6. n n p s (cid:7) wiy(cid:14) ˜yiyfi.2q 2l(cid:7) uby15Pbf52, is1 bs1 where ˆyisfiyyuiyrwiy and the ˜˜yiswi1r2˜yi are the components of ˜˜y defined before (cid:14)3.5.. The ˜y’s are called the pseudo-data. The Newton] i Raphson procedure iteratively reformulates the problem to estimate the f ’s i from the pseudo-data by weighted penalized least squares. See Wang (cid:14)1994. for further details. Wang (cid:14)1994. proved the following lemma, which shows that the pseudo-data approximately have the usual data structure if f is the canonical parameter and f is not far from f. y LEMMA 1. Supposethatbof (cid:14)2.1. hastwocontinuousderivativesandb0 is uniformly bounded away from 0. If <fiyyfi<so(cid:14)1. uniformly in i, then ˜yisfiq«iqop(cid:14)1., where « has mean 0 and variance wy1. i i See also Gu (cid:14)1990.. Wahba w(cid:14)1990., Section 9.2x suggested (cid:14)in the single smoothing parameter case. that lbe chosen by minimizing the generalized cross-validation (cid:14)GCV. score 1rn5(cid:14)IyA(cid:14)l,u..˜˜y52 (cid:14)3.7. V(cid:14)l,u. s , (cid:14)1rn.tr(cid:14)IyA(cid:14)l,u.. 2 where A(cid:14)l,u. satisfies (cid:14)3.8. (cid:14)w11yr2fl,u(cid:14)t(cid:14)1..,...,wn1yr2fl,u(cid:14)t(cid:14)n...9sA(cid:14)l,u.˜˜y and f (cid:14)t(cid:14)i..’s are computed from the solution of (cid:14)3.5.. She suggested that l l,u be fixed, the Newton]Raphson iteration (cid:14)3.3. be run to convergence, V(cid:14)l. be evaluated, a newlbe chosen, the new V(cid:14)l. be evaluated at convergence and then lbe chosen to minimize the V(cid:14)l. so obtained. Gu (cid:14)1992a. provided an argument why a better estimate would result from carrying out one step of the Newton]Raphson iteration, minimizing the GCV score, carrying out a second iteration with the new value of land iterating to convergence. See also Yandell (cid:14)1986.. Inthecase li(cid:14)fi.syifiyb(cid:14)fi.,thedispersionparameteris1and u2irwis (cid:14)yiyEyi.2rvar(cid:14)yi.. As a result, Gu (cid:14)1992a. suggested that V be replaced by
Description: