Table Of ContentSmoothing Spline ANOVA for Multivariate Bernoulli
Observations, With Application to
Ophthalmology Data
Fangyu Gao, Grace Wahba, Ronald Klein, and Barbara Klein
We combine a smoothing spline analysis of variance (SS-ANOVA) model and a log-linear model to build a partly (cid:143)exible model
for multivariate Bernoullidata. The jointdistributionconditioningon the predictorvariables is estimated. The log odds ratio is used
to measure the association between outcome variables. A numerical scheme based on the block one-step successive over relaxation
SOR–Newton-Ralphsonalgorithmisproposedtoobtainanapproximatesolutionforthevariationalproblem.Weextendthegeneralized
approximatecrossvalidation(GACV)andtherandomizedGACVforchoosingsmoothingparameterstothecaseofmultivariateBernoulli
responses.Therandomizedversionisfastandstabletocomputeandisusedtoadaptivelyselectsmoothingparametersineachblockone-
stepSORiteration.ApproximateBayesiancon(cid:142)denceintervalsareobtainedforthe(cid:143)exibleestimatesoftheconditionallogitfunctions.
Simulationstudiesare conductedto check the performanceoftheproposedmethod,usingthe comparativeKullback–Leiblerdistance
asayardstick.Finally,themodelisappliedtotwo-eyeobservationaldatafromtheBeaverDamEyeStudy,toexaminetheassociation
ofpigmentaryabnormalitiesandvariouscovariates.
KEYWORDS: Log-linear models; Multivariate responses; Odds ratio; Penalized likelihood;Repeated measurements; Representers;
Reproducing kernel Hilbert space; Risk factor estimation; Semiparametric regression; Smoothingspline analysis of
variance.
1. INTRODUCTION and a reasonable assessment of accuracy after the model has
been (cid:142)tted. This property is especially important for medi-
CorrelatedBernoullioutcomesmaycomefrommanyappli-
cal researchers, as the investigators are usually interested in
cations. One motivationof this study is to develop a (cid:143)exible
understandingthe cause of certainoutcomes.
model for analyzingtypicaldata from ophthalmologicalstud-
It is not enough to simply estimate the marginal distri-
ies.Usually,pairedobservationsare availableforbotheyesof
bution separately for the individual outcome variables. The
thesameperson.Bothperson-speci(cid:142)candeye-speci(cid:142)c covari-
dependence structure can be useful for the ef(cid:142)cient estima-
ates may be available as predictor variables. The outcomes
tion of the mean values, or it can be of direct scienti(cid:142)c
for the same person are expected to be correlated even after
interest. This is a very active research topic, with numer-
adjustment for the available predictor variables. This associ-
ous schemes proposed to study it. For example, Cox (1972)
ation re(cid:143)ects the consequence of unmeasured or unmeasur-
expressed the likelihood function in terms of the multivari-
able genetic,behavioralor other risk factors. Other examples
ate exponential family distribution, and Qu, Williams, Beck,
involving correlated outcomes include two-period cross-over
and Goormastic (1987) considered conditional logistic mod-
designs (Jones and Kenward 1989), twin studies (Cessie and
els. Lipsitz, Laird, and Harrington (1991), Williamson, Kim,
Houwelingen 1994) and typical longitudinal studies (Diggle,
and Lipsitz (1995), and Heagerty and Zeger (1996) consid-
Liang, and Zeger 1994). Sometimes it is also of interest to
ered marginal models and used the (global) odds ratio as a
model several closely related endpoints simultaneously. For
measure of association.Liang et al. (1992) discussed the dif-
example, Liang, Zeger, and Qaqish (1992) considered two
ferencebetweenlog-linearandmarginalmodels.Molenberghs
endpoints from the Indonesian Children’s Study, respiratory
andRitter(1996)proposeda likelihood-basedmarginalmodel
and diarrheal infections,in the same model.
We are interested in (cid:142)nding the relation between the out- and established the connection with the second-order gener-
come variables and the predictor variables, including condi- alized estimating equations (GEEs). McCullagh and Nelder
tionalcorrelationsbetweenthe outcomevariables.Due to the (1989) and Golenk and McCullagh (1995) proposed a multi-
complexityof biologicalprocesses, linearparametricassump- variatemarginallogisticregressionmodel.Otherrelatedwork
tions on some scale, or even quadratic or cubic assump- hasbeendonebyZhao,Prentice,andSelf(1992),Fitzmaurice
tions might not be adequate. When such an assumption is and Laird (1993),and Carey, Zeger, and Diggle(1993).Katz,
far away from the truth, the results obtained under it may Zeger, and Liang (1994) speci(cid:142)cally discussed approaches to
even be misleading. Hence we are interested in building a accountfor the associationbetween felloweyes.
(cid:143)exible statistical model. A nonparametric model of the type Researchers have already realized the merit of a nonpara-
consideredinthisarticlecanalsoserve as an automateddiag- metric approach to model correlated data. We note that addi-
nostic toolfor parametric (cid:142)tting.The model shouldalso have tive smoothingsplines with (cid:142)xed smoothing parameters have
readily interpretableresults for multivariatefunctionestimate been used by Wild and Yee (1996) and Yee and Wild (1996).
These authors gave a nonparametric extension to both GEE
and likelihood approaches. Heagerty and Zeger (1998) used
Fangyu Gao is Senior Economist, Freddie Mac, McLean, VA 22102 the log odds ratio as a measurement of dependence and
(E-mail: fangyu_gao@freddiemac.com). Grace Wahba is Bascom Profes-
sor, Department of Statistics, University of Wisconsin, Madison, WI
53706(E-mail:wahba@stat.wisc.edu). RonaldKlein,MD, is Professor,and
Barbara Klein, MD is Professor, Department of Ophthalmology, Univer-
©2001American StatisticalAssociation
sity of Wisconsin,Madison,WI 53705(E-mail: kleinr@epi.ophth.wisc.edu;
JournaloftheAmerican StatisticalAssociation
kleinb@epi.ophth.wisc.edu).
March2001,Vol.96,No.453,TheoryandMethods
127
128 JournaloftheAmericanStatisticalAssociation,March2001
smoothing splines with (cid:142)xed degrees of freedom to estimate and log odds ratios along with the smoothingparameter esti-
it. Theirmodel was (cid:142)tted using GEEs. Lin and Zhang (1999) mates. Simulation studies are presented to demonstrate the
proposeda generalizedadditivemixedeffects modelandused ef(cid:142)cacy of the methods,and (cid:142)nally,the results are appliedto
smoothing splines to estimate the additive (cid:142)xed effects term. two-eye observational data from the Beaver Dam Eye study
Berhane and Tibshirani (1998) also provided a generalized to examine the association of pigmentary abnormalities and
additivemodel for the GEE method. variouscovariates.
In thisarticlewe explorehowto combinesmoothingspline
2. LOG-LINEAR MODEL FOR MULTIVARIATE
analysis of variance (SS-ANOVA) and log-linear models to
BERNOULLI OBSERVATIONS
build a partly (cid:143)exiblemodel for multivariateBernoulliobser-
vations. We also propose a method for choosing the smooth- We (cid:142)rst present the log-linear model for multivariate
ing parameters adaptively in this multivariate outcome case. Bernoullidata.Assuming that there are J different endpoints,
Classical log-linear models have been widely used to esti- and K repeated measurements for the jth endpoint, let Y
j jk
matejointconditionalprobabilities(see Bishop,Fienberg,and denote the kth measurement of the jth endpoint. For exam-
Holland 1975). SS-ANOVA provides a general framework ple, in ophthalmologicalstudies, we have two repeated mea-
for multivariatenonparametricfunctionestimationthatallows surements for each disease: left eye and right eye. In a typ-
both main effects and interactionterms. The popularadditive ical longitudinalstudy, we have repeated measurements over
spline models discussed by Hastie and Tibshirani (1990) and time. Y=4Y 1j=11:::1J1k=11:::1K 5 is a multivariate
jk j
implemented in S (Chambers and Hastie 1992) are special Bernoulli outcome variable. Let X =4X 1X 1:::1X 5
jk jk1 jk2 jkD
cases of SS-ANOVA models restrictedto main effects. Hastie be a vector of predictor variables ranging over the subset ¸
and Tibshirani(1990) also commented on interactionspaces. of ²D, where X denotes the dth predictor variable for the
jkd
SS-ANOVA models have been studied extensively for kth measurement of the jth endpoint. Some predictor vari-
Gaussian data. Recently, Y. Lin (2000) obtained some gen- ables may take different values for different measurements,
eral convergenceresults for the tensor productspaceANOVA whereas others may be the same for all Yjk’s. For example,
model and showed that the SS-ANOVA model achieves in ophthalmologystudies both person-speci(cid:142)c predictors and
the optimal convergence rate. Wahba, Wang, Gu, Klein, eye-speci(cid:142)c predictors may be present. The person-speci(cid:142)c
and Klein (1995) gave a general setting for applying SS- predictors are the same for each person. For the eye-speci(cid:142)c
ANOVA to data from exponential families. They success- predictors, the set of predictor variables is the same, but
fully applied their method to analyze demographic medical the variables may take different values for the left and right
eyes. We can treat observations from both eyes as corre-
data with univariate Bernoulli outcomes. X. Lin (1998) pro-
lated repeated measurements in ourmodel.Let X=4X 1j=
posedusing SS-ANOVA to analyzedata with polychotomous jk
11:::1J1k=11:::1K 5. Then 4X1Y5 is a pair of random
responses. Wang (1998a) developed a mixed effects smooth- j
vectors. For a response vector y =4y 1j =11:::1J1k =
ing splinemodel for correlatedGaussian data. Brumback and jk
11:::1K 5, its joint probability distribution conditioning on
Rice (1998) proposed smoothingsplinemodels for correlated j
the predictorvariablesX can be written as
curves (see also Wang 1998b;Wang and Brown 1996).Inter-
estingconnectionsbetweenSS-ANOVA modelsandgraphical
P4Y=yX5
models,as discussedbyWhittaker(1990),Jordan (1998),and —
others also may be observed. J Kj J
=exp f y + (cid:129) y y
It is of particular interest to us to explore the nonlinear- jk jk jk11jk2 jk1 jk2
ity of the conditional logit functions. We use the log odds Xj=1kX=1 Xj=1kX1<k2
ratio to model the association among multivariate Bernoulli + (cid:129) y y +
j1k11j2k2 j1k1 j2k2
outcomes. We restrict the log odds ratios to simple paramet- j1<j2k11k2
X X
ric forms and estimate them using maximum likelihood esti-
+ (cid:129) y y :::y b4f1(cid:129)5 1 (1)
mation. However, the methods that we propose here can be 111121:::1JKJ 11 12 JKJ ƒ
easily generalized to estimate log odds ratios nonparametri-
where
cally, and to other cases when the log-likelihoodcan be fully
speci(cid:142)ed. An extensionof the generalizedapproximatecross-
validation(GACV) proposed by Xiang and Wahba (1996) to b4f1(cid:129)5=log 1+ efjk+ e4fj1k1+fj2k2+(cid:129)j1k11j2k25
multivariate responses is derived for choosing the smoothing j1k j11k1j21k2
X X X
parameters.It is derivedstartingwithanapproximateleaving-
out-one-subject argument, in contrast to the leaving-out-one- + + e4 all ff+ all (cid:129)(cid:129)5 0 (2)
P P
observationargumentusedbyXiangandWahba(1996)inthe
original derivation of the GACV. Then the randomized trace Let M = J K be the length of the vector Y. There
j=1 j
techniquefor computingthe GACV obtainedby Wahba,Lin, are a total of 2M 1 parameters: 4f1(cid:129)5 = 4f 1f 1:::1
Gao, Xiang, Klein, and Klein (1999)and Lin, Wahba, Xiang, f 1(cid:129) P1:::1(cid:129) ƒ 5, which may depe11nd12on X.
JKJ 11112 111121:::1JKJ
Gao,Klein,andKlein(2000)isgeneralizedtotheextensionof The parameter space is unconstrained. They have straight-
theGACV just derived.An ef(cid:142)cientnumericalapproximation forward interpretations in terms of conditional probabilities;
schemeanditerativealgorithminvolvingatailoredblockone- for example,
step successive over relaxation SOR–Newton-Raphson algo-
rithm is proposed to compute the conditional logit functions fjk=logit4P4Yjk=1—Y4ƒjk5=01X55 (3)
Gaoetal.: SmoothingSplineANOVAforBernoulliData 129
is the conditionallogitfunction; modelisamemberofthequadraticexponentialmodelofZhao
and Prentice (1990).
(cid:129)j1k11j2k2 =logOR4Yj1k11Yj2k2—Y4ƒj1k11ƒj2k25=01X5 (4)
3. PENALIZED MULTIVARIATE LOGISTIC
is the conditionallog odds ratio, which is a meaningful way
REGRESSION USING SMOOTHING
to measure pairwise association;and
SPLINE ANALYSIS OF VARIANCE
(cid:129)j1k11j2k21j3k3 =logOR4Yj1k11Yj2k2—Yj3k3 =11 To simplify notation, we consider a parsimonious model
Y4ƒj1k11ƒj2k21ƒj3k35=01X5 here, by setting all higher-order associationsthan pairwise to
be 0 in (7). Thus the negativelog-likelihoodfunction simpli-
logOR4Y 1Y Y =01
ƒ j1k1 j2k2— j3k3 (cid:142)es to
Y4ƒj1k11ƒj2k21ƒj3k35=01X5 (5)
n n J Kj
measuresthree-wayassociation.HereY4ƒ 5 denotesthesubset ¬4y1f1(cid:129)5=ƒ l f4xi51(cid:129)4xi5 =ƒ fj4xijk5yijk
of vector Y, exceptY , and i=1 i=1 j=1k=1
X X XX
J
p
+ (cid:129) 4x 1x 5y y +
logit4p5=log1ƒp1 j=1k1<k2 jj ijk1 ijk2 ijk1 ijk2 j1<j2k11k2
X X X X
P4v=11w=15P4v=01w=05
OR4v1w5= 0 (6) (cid:129) 4x 1x 5y y b4f1(cid:129) 51 1 (8)
P4v=11w=05P4v=01w=15 j1j2 ij1k1 ij2k2 ij1k1 ij2k2ƒ i i
Now assume that we have n independent observa- where
tions 4x1y51i =11:::1n, where y =4y 1y 1:::1y 5
i i i i11 i12 iJKJ
and x = 4x 1x 10001x 5. Here y and x =
4x 1xi 1:::i111x i125 are thiJeKJoutcome variiajkble and prijekdic- b4fi1(cid:129)i5=log 1+ exp4fj4xijk55
ijk1 ijk2 ijkD j1k
tor vector for the kth measurement of the jth endpoint of X
the ith subject. From now on we use fi and (cid:129)i to denote + exp4fj14xij1k15+ fj24xij2k25
the parameters for the ith subject, while y =4y 1:::1y 5, j11k1j21k2
1 n X X
f=4f 1:::1f 5,and(cid:129)=4(cid:129) 1:::1(cid:129) 5.Wecanwritethelog-
1 n 1 n + (cid:129) 4x 1x 55+ + exp f 4x 5
likelihoodfunctionbased on the observed data j1j2 ij1k1 ij2k2 j ijk
j1k
X
¬4y1f1(cid:129)5
+ (cid:129) 4x 1x 5 0 (9)
j1j2 ij1k1 ij2k2
n J Kj J j11k1j21k2
= f y + (cid:129) y y X X
ijk ijk ijk11ijk2 ijk1 ijk2
i=1 j=1k=1 j=1k1<k2 We are interested in relaxing the parametric assumptions
X XX X X
to builda (cid:143)exible log-linearmodel.We are particularlyinter-
+ (cid:129) y y +
ij1k11ij2k2 ij1k1 ij2k2 ested in exploring the nonlinearity of the conditional logit
j1<j2k11k2
X X functions f. We also could model the (cid:129) nonparametrically,
+ (cid:129) y y y b4f1(cid:129) 5 0 (7) but becausea largernumberofobservationswould be needed
i111i121:::1iJKJ i11 i12 iJKJ ƒ i i
toestimateandtunemanymultivariatesmoothfunctionsnon-
parametrically, we assume in this article that the (cid:129) are of
We call (7) the log-linear model for multivariate logistic
simple parametric form possibly depending on some set of
regression. Here f is the conditional logit function for the
ijk
parameters ‚, and leave more general(cid:129) for future study.We
kthmeasurementofthejthendpointoftheithsubject.Scien-
ti(cid:142)cally,exceptforthepossibilitythatthefijk maytakediffer- use the penalizedlikelihoodmethodto model the fj nonpara-
entpredictorvaluesfrom measurementto measurement,there metrically. To relax the parametric assumptions, the penal-
is little reason to believe they will take different functional ized likelihood method (O’Sullivan 1983) assumes that the
formsforthesameendpoint.Hencewe canassumethatf = function to be estimated is smooth in some sense. This is
ijk
f 4x 5. The same reasoning applies to the associationterms; done by assuming that the function to be estimated is in
j ijk
forexample,wecanassumethat(cid:129) =(cid:129) 4x 1x 5. a given reproducing kernel Hilbert space of “smooth” func-
In practice, the number of paraimj1ke1t1eij2rks2to bje1j2estiijm1ka1tedij2ck2an tions, where “roughness” is measured by the size of some
be reduced in many ways. For example, in many situations quadraticfunctional,typicallyasquarenormorseminorm,and
scienti(cid:142)c interestis focused primarilyon the conditionallogit a roughnesspenaltyis imposedinobtainingtheestimate.The
function f and log odds ratio (cid:129) , which measures reproducingkernel Hilbert space theory (see Aronszajn 1950;
ijk ij1k11ij2k2
pairwise association. The existence of three-way associations Kimeldorf and Wahba 1971) can then used to character-
(cid:129) and higher-order associations are usually dif- ize the solutions of very general penalized likelihood prob-
(cid:142)ciju1kl1t1ijt2ok21vij3ekr3ify in practical situations and may attract less lems. We assume then that f ¨j, where ¨j is a given
j 2
scienti(cid:142)c interest. Hence it is possible to set all higher-order reproducing kernel Hilbert space of functions on ¸.
associations to be 0 and to (cid:142)t only a parsimonious model Hence, f = 4f 1f 1:::1f 5 ¨1 ¨J. The penal-
1 2 J 2
instead of the saturated one described in (7). The reduced ized multivariate logistic regression estimate of f and (cid:129) =
130 JournaloftheAmericanStatisticalAssociation,March2001
4(cid:129) 1(cid:129) 1:::1(cid:129) 5 is the minimizer of the variational prob- whereŒ=4 ¥ 5g,g =44I ¥ 5 ¥ 5g,g =44I
lem11 12 JJ ¥ 54I ¥ 5 d d ¥d 5g,aƒndsdofodrt1h6=.dThd1eaverda1gd2ingopeƒr-
n n atdo1rs aƒnddn2oQrmds36=od1n1d2¨dj3can be chQosen so that the compo-
Q
¬ 4y1f1(cid:129)5= l4f4x51(cid:129)4x55+ ª 4f 1:::1f 51 (10) nentsofthisdecompositionareprojectionsofg ontoorthogo-
å ƒ i i 2 å 1 J
i=1 nal subspacesof ¨j (see Gu and Wahba 1993;Wahba 1990;
X
where the (cid:142)rst part is the negative log-likelihood,the second Wahba et al. 1995). In practice, the ANOVA decomposition
part is the roughness penalty,and å=4å 1:::1å 5 are the in(14)willbetruncatedatsomepoint.Assumingthatwehave
1 J
smoothingparameters that controlthe smoothnessof the esti- alreadydecidedwhichsubspaces(equivalently,componentsof
mated fj’s. We assume an additiveform of the penalty func- g) will be includedin our model ›4 ¨j5, and suppressing
tional for simplicityand easy interpretation, the superscriptj on the subspacesfor the rest of this section,
we can regroup and write the model space as
J
ª 4f 1:::1f 5= ªj 4f 50 (11)
å 1 J åj j
j=1
X
› =¨ 4¨ ¨ 5=¨ ¨ 0 (15)
Weconsidertheorthogonaldecomposition¨j=¨j ¨j. 0 111 11q 0 1
0 1
Here ¨j is (cid:142)nite dimensional (the “parametric” part, usually
0
polynomials), and ¨j (the “smooth” part) is the orthocom-
plement of ¨j in ¨j1. The penaltyfunctionalwill be related Usually we will let ¨0 be a (cid:142)nite-dimensional space con-
0
onlytothe smoothpartof thefunctional,ªj 4f 5= Pjf 2 , taining functions that will not be penalized. The norms on
where Pj is the orthogonal projection opeårjatojr in˜¨1j jo˜nåtjo the composite ¨11l11 l q are the tensor product norms
1
¨j. The penalizedlikelihoodhas the expression induced by the norms on the component subspaces, and
1 g 2 = P g 2+ q ‹ Pg 2, where P is the orthogonal
˜ ˜å ˜ 0 ˜ l=1 l˜ l ˜ l
n n J projectorin› onto¨ .Nowwecanusereproducingkernel
¬ 4y1f1(cid:129)5= l4f4x51(cid:129)4x55+ Pjf 2 0 (12) P 11l
å ƒ i i 2 ˜ 1 j˜åj Hilbert space methods to explicitly impose roughness penal-
i=1 j=1
X X ties.Forthepenaltyfunctionalin(11),itcanbefurtherrewrit-
The followingtheoremshows the existenceand uniqueness ten as
of the solution to the variational problem (10). Letting ¨ =
0
¨1 ¨J denote the null space of ¨1 ¨J with
0 0
respect to the penaltyfunctionª , we have: q
å ªj 4f 5= ‹ Pjf 20 (16)
Theorem 1. If the minimizer of (12) exists in ¨0, then it åj j l=1 jl˜ l j˜
uniquelyexists in ¨1 ¨J0 X
Proof. See AppendixA.
It is well known (see, e.g., Kimeldorf and Wahba 1971;
If ” 1(cid:141)=11:::1M span ¨ , then this correspondsto the Wahba 1990) that the minimizer of the variational prob-
(cid:141) 0
identi(cid:142)ability of the ordinary parametric model, which is a lem(12)isinaparticular(cid:142)nite-dimensionalsubspacespanned
linearcombinationofthe” .Inthelaterexamples,the” are by a set of basisfunctionsobtainedfrom thereproducingker-
(cid:141) (cid:141)
low-degree polynomials. nelassociatedwiththemodelspace.It is knownthatthemin-
imizers f of (12) have the form
3.1 Smoothing SplineAnalysis of j
Variance Models for f
j
Here we brie(cid:143)y review some of the well-known prop- f 4 5=”j45Tdj+ (cid:142)j4 5Tcj1 (17)
j
erties of SS-ANOVA models. For any j (1 j J), let
g = f . Here g is assumed to be a function of D vari-
j
ables, g =g4x11:::1xD5 with xd 2¸4d5, d =11:::D; thus where cj and dj are vectors of coef(cid:142)cientsto be found. Here
x=4x11:::1xD52¸ =¸415† †¸4D5. Given probability 8”j9pj is a set of p basis functionsspanningthe nullspace
measuresdŒ on¸4d5,de(cid:142)netheaveragingoperators¥ on¸ (cid:141) (cid:141)=1 j
d d ¨j. ”j45T =4”j451:::1”j 455. (cid:142)j45T =4(cid:142) 451 1 i
as 0 1 pj ijk
n11 k K 5, where (cid:142) 45 is the representer of the evalua-
j ijk
4¥d5g4x5= g4x11x210001xD5 dŒd4xd50 (13) tionfunctionalat xijk in ¨1j. It is not necessary to befamiliar
¸4d5 withtheconceptoftherepresenterofanevaluationfunctional
Z
These averaging operators de(cid:142)ne a unique (functional in a reproducingkernel Hilbert spaceto follow the rest of the
ANOVA) decompositionof g as article,if one is willingto acceptthe result that ourestimates
will be of the form (17) and that the penalty functionals are
D quadraticformsinthecj asgivenfollowing(19)below.(Inter-
g4x11:::1xD5=Œ+ gd4xd5+ gd1d24xd11xd25 ested readers can (cid:142)nd de(cid:142)nitions in Wahba 1990, pp. 1–2.)
d=1 d1<d2 Descriptions of the particular ”j and (cid:142) used in this article
X X (cid:141) ijk
+ + g 4x 1:::1x 51 (14) are givenin Section 5.
11:::1D 1 D
Gaoetal.: SmoothingSplineANOVAforBernoulliData 131
The smoothing parameters å =4‹ 1:::1‹ 5 are incor- that the initial value for V be at least 25. Combined with
j j1 jq
porated into (cid:142)j4 5T. Details have been given by Gu and the iterative method for choosing smoothing parameters pro-
Wahb (1993), and Wahba (1990, chap. 10). and Wahba et al. posed in the next section, the foregoing algorithm usually
(1995). The penalty functionals are known quadratic forms converges very rapidly. From our experience, for medical
in the cj. We give examples later. Given the 8‹ 9 and the data, 50–100 basis functions usually yield a very good
jl
result (17), the numerical problem is to obtain the minimizer approximation.
of (10). Next, we discuss in detail how to use the block one-step
SOR algorithmtoobtainan approximatesolutionfor(cid:142)xedV.
3.2 The Fitting Algorithm
Numerically, we need to solve a large nonlinear system. To
For computationalreasons, onlyan approximateminimizer speed up the computation,we break down the nonlinearsys-
of (12) will be obtained, using only a subset of the basis tem to several “blocks.” When updating one “block” in the
functions 8(cid:142)ijk4511 i n11 k Kj9. Usually, when the nonlinearsystem,we(cid:142)xallother“blocks”atthemostrecently
design points are close, their representers are also very close.
updated values. The natural breakdown in this application is
Hence when the dataset is large, it is very likely that many
to update the f ’s and ‚ iteratively.To update the ‚, we use
of the basis functions are nearly linearly dependent. On the j
the Newton-Raphsonalgorithm and iterate until convergence.
other hand, if by some “prior” knowledge the structure of
However, theupdatingstep forf is much more computation-
the true f is thought to be not very complicated, then it j
j ally intensive. The block nonlinear SOR algorithm requires
may be well approximated by a small number of basis func-
thattheNewton-Ralphsonalgorithmberununtilconvergence.
tions. As a result, if we select a subset of the design points
However, we use only a one-step updating formula, sacri(cid:142)c-
having maximum separation, then their corresponding repre-
ing the overall convergence rate a bit to reduce the compu-
senters are expected to have less correlation. We choose a
tational complexity in each iteration. Let S be the nK p
(cid:142)xed V and usetheSAS procedureFASTCLUS to clusterthe j j j
matrix with entries ”j4x 5. Returningto our(cid:142)xed V and the
nKj values of the design points 8xijk11 i n11 k Kj9 (cid:141) ijk
into V clusters. Within the vth cluster, v =11:::1V, we selected design points xj1v, v=11:::1V and the associated
randomly select one design point and call it xj1v. The rep- representers (cid:142)j1v4 5, we de(cid:142)ne
resenter of evaluation in ¨j at x , called (cid:142) , is used to
j1v j1v
form theapproximatingsubspace.An approximatesolutionis
(cid:142) 4x 5 (cid:142) 4x 5 (cid:142) 4x 5
of the form f 45=”j45Tdj+ (cid:142)j 45Tcj, where ”j45T is as j11 1j1 j12 1j1 j1V 1j1
j V V
before and (cid:142)j45T =4(cid:142) 451v=11:::1V5, and is computed 0(cid:142) 4x 5 (cid:142) 4x 5 (cid:142) 4x 51
V j1v j11 1j2 j12 1j2 j1V 1j2
Q =
froitrhm(1.2()sebeyOartebgloacakndonRe-hsetienpboSlOdtR1–9N7e0w).toWn-hReanlpthhesonnuamlgboe-r j1V BB 000 000 000 000 CC
B C
ovferbgaessistofutnhcetieoxnascVt soinlucrteioanse.sW, tehehaavpeprfooxuinmdattehastofluotriomnocdoenr-- BBB(cid:142)j114xnjKj5 (cid:142)j124xnjKj5 (cid:142)j1V4xnjKj5CCCnKj V
@ A
atelylargeV, itisnotimportantwhichpointineachclusteris
chosen. and
Recall that the (cid:129) could depend on a set of parameters ‚.
The iterative algorithm for the approximate solution is given (cid:142) 4x 5 (cid:142) 4x 5 (cid:142) 4x 5
j11 j11 j12 j11 j1V j11
in Table 1.
0(cid:142) 4x 5 (cid:142) 4x 5 (cid:142) 4x 51
j11 j12 j12 j12 j1V j12
Q = (18)
Table1. IterativeAlgorithmforApproximateSpline j1V BB 000 000 000 000 CC
B C
Vdo„initialvalue BBB(cid:142)j114xj1V5 (cid:142)j124xj1V5 (cid:142)j1V4xj1V5CCCV V0
ClusterthedatapointsintoV groups @ A
Randomlyselectonedatapointfromeachgroup Let cj =4c 1c 1:::1c 5T. Our goal is to compute the
Generatethecorrespondingbasisfunctions V j11 j12 j1V
approximatesolutionwith the representation
f initialvalues,j =1121:::1J
j „
‚ initialvalues
„
do f 45=”j4 5Tdj+ (cid:142)j45Tcj1 (19)
j V V
doj =1toJ
f updatedvaluesintheapproximatingsubspace
j „ which involvestheuseof S , Q and Q . It is easytover-
end j j1V j1V
ify that the penaltyfor f has the quadratic form P f 2 =
‚„Newton-Ralphsonupdate for‚ j ˜ 1 j˜åj
until(convergence) cjTQ cj. Therefore, to update f , provided all other esti-
V j1V V j
V „2 V mated valuesare (cid:142)xed at the solutionsfrom the previousiter-
until ˜fne˜wfoƒldf˜old˜<prec1 and ˜‚n˜e‚woƒld‚˜old˜ <prec2 ation,andrecallingthatfijk=fj4xijk5, thevariationalproblem
is to minimize
setHteore10pre7c1anadnd10pre8ci2natrheepreexsapmecpil(cid:142)eesdtothfroesllhoowld.sW,wehsiucghgweset Ij1V =ƒ n Kj fijkyijkƒb4fi1(cid:129)i5 + n2cjVTQj1VcjV0 (20)
ƒ ƒ i=1 k=1
X X
132 JournaloftheAmericanStatisticalAssociation,March2001
Denote b =b4f1(cid:129) 5. To update f , we need the following,
i i i j
where l is de(cid:142)ned in Lemma C.1
j
¡b
Œijk = ¡fi =EYijk= efijk+ efijk+fijk3+(cid:129)ijk1ijk3+ + e j31k3fij3k3+ j31k3 j41k4(cid:129)ij3k31ij4k4
ijk kX36=k P P P
1
ƒ
1+ efij3k3+ + e j31k3fij3k3+ j31k3 j41k4(cid:129)ij3k31ij4k4 1
j31k3 P P P
¡2b X
w = i =varY =Œ 41 Œ 51
ijk1ijk ¡f2 ijk ijk ƒ ijk
ijk
¡2b ¡b
w = i =cov4Y 1Y 5=E4Y Y 5 EY EY = i Œ Œ
ijk11ijk2 ¡f ¡f ijk1 ijk2 ijk1 ijk2 ƒ ijk1 ijk2 ¡(cid:129) ƒ ijk1 ijk2
ijk1 ijk2 ijk11ijk2
=4efijk1+fijk2+(cid:129)ijk11ijk2 + + e j31k3fij3k3+ j31k3 j41k4(cid:129)ij3k31ij4k45
P P P
1
ƒ
1+ efij3k3+ + e j31k3fij3k3+ j31k3 j41k4(cid:129)ij3k31ij4k4 ƒŒijk1Œijk21
j31k3 P P P
X
¡l 4y 1f 5
u = ƒ j ij ij = y + Œ 1
ijk ¡f ƒ ijk ijk
ijk
u =4u 1u 1:::1u 1u 1:::1u 5T1
j 1j1 1j2 1jKj 2j1 njKj
w w w
ij11ij1 ij11ij2 ij11ijKj
0 w w w 1
ij21ij1 ij21ij2 ij21ijKj
W = 1
ij BB 000 000 000 000 CC
B C
B C
Bw w w C
BB@ ijKj1ij1 ijKj1ij2 ijKj1ijKjCCAKj Kj
and
W =diag4W 1W 1:::1W 50
j 1j 2j nj
Therefore, the one-step updatingformula is to solve We use this result later to construct approximate Bayesian
con(cid:142)denceintervalsfor the f .
j
QT W Q + nQ QT W S cj cj The preceding discussion assumes no special structure in
j1V j j1V j1V j1V j j Vƒ V
ƒ ƒ ƒ the design points. The algorithm is speci(cid:142)cally designed to
0 STW Q STW S 10dj dj 1
j j j1V j j j ƒ handle the unstructured case. However, when special struc-
ƒ ƒ ƒ
@ A@ A ture is available, the foregoing algorithm can be simpli(cid:142)ed.
QT u nQ cj
= ƒ j1V jƒƒ j1V Vƒ 1 (21) One common case is the presence of person-speci(cid:142)c covari-
0 ƒSTjuj 1 ates only. Hence xijk =xij for all k =11:::1Kj. Similarly,
ƒ f =f 4x 5 =f 4x 5 =f . To update f , the part of the
@ A ijk j ijk j ij ij j
where thesubscriptminusindicatesthequantitiesevaluatedat penalizedlikelihoodthat needs to be minimized has the sim-
the latestupdate. pli(cid:142)ed form
With some abuse of notation, we use u to denote
ij n Kj n
4uij11:::1uijKj5T, fij to denote 4fij11:::1fijKj5T, an so on. Ij1V =ƒ yijk fijƒbi + 2cjTQj1Vcj0 (23)
HtoesreeeyQitjh=atftihjƒeƒsoWluƒitjiƒ1ounijƒofa(r2e1c)agllievdesthtehepsmeuindiom-dizaetra.oIftiseasy Xi=1 kX=1
Now rede(cid:142)ne
1 n
4y f 5TW 4y f 5+ cjTQ cj0 (22)
n Qijƒ ij ijƒ Qijƒ ij j1V (cid:142)j114x1j5 (cid:142)j124x1j5 (cid:142)j1V4x1j5
i=1
X 0(cid:142) 4x 5 (cid:142) 4x 5 (cid:142) 4x 51
j11 2j j12 2j j1V 2j
The block one-step SOR–Newton-Ralphson procedure iter- Q = 0 (24)
j1V B 00 00 00 00 C
atively reformulates the problem to estimate f from the B 0 0 0 0 C
j B C
pseudo-databyweightedpenalizedleastsquares.Thepseudo- B C
dataareequivalenttotheadjusteddependentvectorofYeeand BB(cid:142)j114xnj5 (cid:142)j124xnj5 (cid:142)j1V4xnj5CCn V
@ A
Wild(1996).Thepseudo-datacanbeshown to havetheusual
(mean andcovariance)datastructureimplicitin(22)ifthefj Denote yij = Kk=j1yijk, Œij =E4 Kk=j1Yijk5, Wij =var4 Kk=j1
are not far away from f . A theorem is given in AppendixBƒ. Y 5, u = Kj y + Œ , and u =4u 1u 1:::1u 5T.
j ijk ij ƒPk=1 ijk ij P j 1j 2j Pnj
P
Gaoetal.: SmoothingSplineANOVAforBernoulliData 133
Except for the aforementioned changes, all of the previous for choosingå , CV4å 5, is de(cid:142)ned as
j j
formulas and discussionsremain true.
Finally, by the SOR–Newton theorem of Ortega and 1 n Kj
Rheinboldt(1970)andcorollary3.1ofX.Lin(1998),thelocal CV4å 5= y f6ƒi7+ b4f 5
j n ƒ ijk ijk ij
convergent property holds if we use block one-step SOR– i=1 k=1
X X
Newton-Raphson method to (cid:142)nd the minimizer of a twice- 1 n Kj
differentiableconvex function.Thus there exists an open ball = y f + b4f 5
n ƒ ijk ijk ij
of the minimizer, and when the starting point is in this open i=1 k=1
X X
ball, the algorithm will converge to the minimizer for (cid:142)xed Kj
smoothingparameters. + yijk4fijkƒfi6jƒki75 0 (25)
In response to a referee’s query, we note that the back(cid:142)t- k=1
X
ting algorithmpopularizedby Hastie and Tibshirani(1990)is
a form of block SOR algorithm.A block SOR algorithm was Because yijk and fi6jƒki7 are independent,and because for large
also used by Wahba et al. (1995), who gave a detailed dis- sample sizes f6ƒi7 is expected to be close to f , CV is
ijk ijk
cussion of the relationbetween the block SOR they used and expected to provide an approximately unbiased estimate for
the back(cid:142)tting algorithm of Hastie and Tibshirani. However, theCKLdistance,given(cid:129) andf4ƒj5 (cid:142)xed.First,letusassume
the way in which we have assigned blocks in this article is that theexactsolutionof the variationalproblemcan becom-
different than either of these forms of block SOR. puted.Lety =4y 11 i n11 k K 5T.In eachupdating
j ijk j
step for f , the minimizer of
j
4. ADAPTIVE CHOICE OF THE
SMOOTHING PARAMETERS n Kj n
I 4f 1y 5= y f + b4f 5 + ªj 4f 5 (26)
åj j j ƒ ƒ ijk ijk ij 2 åj j
So far, all smoothing parameters are considered (cid:142)xed. i=1 k=1
X X
Now we consideran automateddata-drivenmethodto choose
is obtained. By using several (cid:142)rst order Taylor expansions,
smoothingparameters.Theriskfunctionor“target”tobemin-
we derive an approximate cross validation(ACV) score. The
imizedisthecomparativeKullback–Leibler(CKL)distanceof
GACV is a generalizationof the ACV score.
theestimatefrom the“truth.” Becausethetruthis notobserv-
First, we de(cid:142)ne the generalized average of submatrices
able,exceptinsimulationstudies,theCKL alsoisnotobserv-
as a generalization of the trace of a matrix. For any matrix
able.Thusa computableproxyfor theCKL whoseminimizer
is a good estimate of the minimizer of the CKL is desired. A with submatrices Aii11 i n on the diagonal, Aii =
The GACV was proposed by Xiang and Wahba (1996) as ai1k1k2 K K11 k11k2 K, if K =1, the generalized aver-
a method for choosing a smoothing parameter for penalized ageofAii issimplythetraceof Adividedbyn,SA=tr4A5=n.
likelihood estimates when the data are from an exponential When K 2, let
family with no nuisance parameters. Those authors showed
via Monte Carlo methods with Bernoulli data that its mini- 1 n K
„= a =tr4A51
mizer providesa good estimate of the minimizer of the CKL nK i1kk
i=1k=1 (27)
inrelativelysmallsamples.ThederivationoftheGACV score X1X n
ƒ= a 0
beginswithaleave-out-oneargument.Directcalculationofthe n K4K 15 i1k1k2
GACV score becomes unstable in the large-sample case due ƒ Xi=1kX16=k2
to the existence of the inverse of a generally ill-conditioned
The generalizedaverage of A ’s is de(cid:142)ned as
matrix in the formula (see Xiang and Wahba 1996). Later, ii
Lin et al. (2000) and Wahba et al. (1999) demonstrated that
the GACV score can be computedand minimizedovermulti- „ ƒ ƒ
ple smoothingparameters with the use of a randomized trace 0ƒ „ ƒ1
ttehcishnairqtiucele,(wraenGexAtCenVd),thwehiGchACprVo,viadnedsathsetnabtlheecraalncuGlAatCioVn,.Itno SA=4„ƒƒ5IK K+ ƒ eeT =BB000 000 000 000CC 1 (28)
B C
thecaseof correlatedBernoulliobservations.Insteadof using B C
Bƒ ƒ „C
a leave-out-one-observation to start the derivation, we leave B CK K
@ A
out one person. Following the derivation, in the next section
we demonstrateviaa small simulationstudythe propertiesof where e=41111 115T is the unitvector.When K 3, this
the ranGACV for providing an estimate of the minimizer of formissuitablewheneveritisreasonabletoassumethatasso-
the CKL in the correlatedBernoullicase. ciations between different pairs of repeated observations are
Tochoosethesmoothingparametersforf ,weletallofthe similar. Other generalization forms are possible. The proper-
j
otherconditionallogitfunctionsf 4`=j5 and the(cid:129) be(cid:142)xed ties of the K 3 case are a matter for future research.
` 6
at the estimated values from the previous iteration. Let fi6jƒki7 LetHj denotetheinverseHessianof(26)withrespecttofj.
denotethe estimatedconditionallogitfunctionf evaluatedat Hj is an nK nK matrix with Hj1 i=11:::1n as K K
j j j ii j j
x , with the ith subjectleft out for the estimationprocedure. submatrices on the diagonal.Let Gj =I W Hj and let Hj
ijk ii ƒ ij ii S
The ordinary leave-out-one-subject cross validation function andGj denotethegeneralizedaverageofHj andGj.Wenow
S ii ii
134 JournaloftheAmericanStatisticalAssociation,March2001
de(cid:142)ne the GACV as …T… …TW 4fYj+…N11 fYj5. We canthuscalculatetherandom-
1 n Kj iNzeNdƒesNtimatjeoåfj Gj.ƒThåisj approach avoidsexplicitcalculation
GACV4å 5= y f + b4f 5 S
j n ƒ ijk ijk ij oftheinverseHessianHj,whichiscomputationallyexpensive
i=1 k=1
X1 n X and tends to be unstable if Hj is ill conditioned.A random-
+ y y ized estimate can always be obtained provided that a cheap
n ij1 ijKj
i=1 andstable“blackbox”existsforcalculatingthe(approximate)
X
y Œ one-step solution for perturbed data. The resulting ranGACV
HSj4GSj5ƒ10 ij1ƒ000 ij1 10 (29) functionis
y Œ 1 n Kj
B ijKjƒ ijKjC ranGACV4å 5= y f + b4f 5
@ A j n ƒ ijk ijk ij
DetailsofthederivationoftheGACVaregiveninAppendixC. i=1 j=1
X X
DirectcomputationoftheGACVforlargesamplesizesisslow 1 n
+ 4y 1:::1y 5
andtendstobeunstable(seeXiangandWahba1996,Sec.3.1). n ij1 ijKj
Thisexplicitcalculationcanbeavoidedbyusingatechniquein i=1
X
thespiritoftherandomizedtracemethod,providedthatasolu- y Œ
tion(eitherexactorapproximate)ofthevariationalproblemcan HS¹j GS¹j ƒ10 ij1ƒ000 ij1 11 (32)
beobtainedatalowercost.Weproposeaone-steprandomized
y Œ 0
estimateofGACVthatisfastandcheaptocalculate. B ijKjƒ ijKj C
@ A
Given a square matrix A with A 41 i n5 being the
K K submatricesonthediagonal,wiie discusshowto obtain where SH¹j and GS¹j denotethe randomizedestimates.To reduce
the variance in the term after the “+” in (32), we may draw
a randomized estimateof A. First, n iid vectors of K iid ran-
S R independentrandom vectors …4151:::1…4R5, and replace the
dom variables distributed as N(0,1) are generated. Let … =
i term after the “+” in (32) by
4… 1:::1… 5T be the ith such vector and …=4…T1:::1…T5T.
i1 K 1 n
Then „ = tr4A5=4nK5 can be estimated by 4…TA…5=4nK5. y Œ
Otr4nAt5h5e=4ontKhe4rKhan1d5,5.ifTKo e>sti1m,attheen ƒ =4 ai k11k,2laeit1k1…k2 ƒ= n1R R n 4yij11:::1yijKj5HS¹j4r5 GS¹j4r5 ƒ10 ij1ƒ000 ij1 1
41=pK5 Kk=1…ƒikand…N =4…N11:::1…NP11i…PN21k1:1:Pk2:1Pi…N1kn15kT2.HereN…iN is Xr=1Xi=1 B@yijKjƒŒijKj0CA(33)
a column vector with K replicates of … for each 1 i n.
Note thatPE…TA…= a . ThNui s we can estimateƒ to obtainan R-replicatedranGACV function.The GACV and
N N i k11k2 i1k1k2
by4…TA… …TA…5=4nK4K 155andobtainarandomizedesti- ranGACV functions are derived by assuming that the mini-
N Nƒ P P ƒ
mateof A. mizer of (20) is calculatedat each block nonlinearSOR iter-
S
In practice,therandomizedestimateofGACV iscalculated ation. To speed up the algorithm, however, only a one-step
by solvingthe nonlinearsystem on the perturbed data Y + … update will be calculated.We remark that all favorable prop-
j
and Y + …. Let fYj denote the solution of (26) using the erties of GACV and ranGACV are preserved for the block
originajl daNta andåljet fYj+… denote the solution using the one-stepSORalgorithmandapproximatesplineestimate.It is
åj veryeasytocarryoutthecomputation,asnoadditionalmatrix
perturbeddata.If we takefYj astheinitialvaluetoaNewton- decompositionis required.We iterativelyminimizeranGACV
åj
Ralphsoncalculationof fYj+… and run the iteration only once ineachstep oftheblockone-stepSOR iteration.Thisis done
by using all matrix decoåmj positions that have already been repeatedlyuntilsomeprespeci(cid:142)edconvergencecriteriaismet,
performed for calculatingfYj in the last step, then we obtain or the number of iterationsexceedsa prespeci(cid:142)ed limit.
the one-step solution fYj+…1å1.j Because 4¡I =¡f 54fYj1Y 5=0 We remark that it is a straightforward generalization to
åj åj j åj j extend the the algorithm and the ranGACV to observations
and 4¡2I =¡fT¡f 54fYj1Y 5=4¡2I =¡fT¡f 54fYj1Y + …5, we that come from other exponential families with no nuisance
åj j j åj j åj j j åj j
observethesimplerelation parameter. In fact, the original derivation of the GACV did
¡2I 1¡I not use the fact that the data were Bernoulli. However, the
fYj+…11=fYj åj fYj1Y +… ƒ åj fYj1Y +… GACV hasnotyetbeen testedondatafromotherexponential
åj åjƒ ¡fT¡f åj j ¡f åj j
j j j families.
¡2I 1 ¡I
=fYj åj fYj1Y ƒ …+ åj fYj1Y 5. MONTE CARLO SIMULATIONS
åjƒ ¡fT¡f åj j ƒ ¡f åj j
j j j In this section we demonstrate results from some Monte
=fYj+ Hj…0 (30) Carlosimulationstoevaluatetheperformanceoftheproposed
åj method. The comparative Kullback–Leibler distance (CKL)
Hence we have is used to measure the performance of the estimated values.
In our experience,the CKL distance provides a measure that
fYj+…11 fYj =Hj…0 (31)
åj ƒ åj agreesverywellwithhumanintuitionwheneyeballingthe(cid:142)t-
ted surfaces.
Thus …T4fåYjj+…11 ƒfåYjj5 = …THj… and …NT4fåYjj+…N11 ƒfåYjj5 = In all models presented here and in next section, we used
…SNTimHijla…Nr,lya,n…dTGwje…=can…To…bta…inTWa r4afnYdj+o…m11izefdYje5s,tiamndat…eToGfj…HS=j. rbelporcokdsufcoirngthkeertnenelssoraspsroocdiautcetdswpaicthecruepbriocdsupcliinnegskaesrnbeulisldtihnagt
ƒ j åj ƒ åj N N
Gaoetal.: SmoothingSplineANOVAforBernoulliData 135
belowwerealsogivenbyLinetal.(2000).In theexperiments
and data analysisthat follow, the ranGACV is used to choose
the smoothingparameters.
5.1 Repeated Measurementsfor the Same Endpoint
The (cid:142)rst example concerns the single smoothing parame-
ter situation. We try to mimic the characteristic of possible
ophthalmologydata. Each subject has one endpoint of inter-
est and paired observations.There is one observation-speci(cid:142)c
covariate, X 14k=1125. The X ’s are assumed to be uni-
ik i1
formly distributed on the interval 400510955. X =X + …,
i2 i1 i
and the …’s are uniformly distributed on 4 00510055. The
i ƒ
tedFiglinueres1re.prHeissetongtrtahme otrufe(cid:129)O fvoarluTehroefe(cid:129)D=iffe.8re.nTthSeammpelaenSsizoefse.sTthimeadteodt- tlrougeit4cPo4nYditi=on1alYl4oƒgki5t=fu0n1cxtio5n5 i=s 2a6sesxupm4ed30to4xbe f0245x5ik255=+
ik — i ik ƒ ikƒ
valuesare.7926,.7937,and.7686. sin4(cid:143)x257 2, and the conditional log odds ratio (cid:129) =
ik ƒ
logOR4Y 1Y x5=08. Three different sample sizes are used
i1 i2— i
generatethe(cid:142)’s.(ThisconstructionwasgiveninWahba1990,
in this simulation: n=12512501500. For each sample size,
sec. 10.2, for general SS-ANOVA models.) We brie(cid:143)y note
100independentdatasetsarerandomlygeneratedaccordingto
some details.Let u 60117; thenullspace of the cubicspline
2 the true jointdistribution.
penaltyfunctional( 4f"52) is spannedbythelinearfunctions
Figure 1 shows histogram plots of the estimated (cid:129) for the
” 4u5=1and” 4u5=u 1=2.De(cid:142)nethereproducingkernel O
1 2 R ƒ threedifferentsamplesizes.Thedottedlinesrepresentthetrue
r 4u1v5 =k 4u5k 4v5 k 46u v75, where u1v 60117 and
1 2 2 ƒ 4 ƒ 2 value of 08. The (cid:142)tted values appear to converge to the truth
k 4u5=B 4u5=`, with B 45 the `th Bernoullipolynomial.In
` ` W ` whilethesamplesizeincreases.Theestimatorof(cid:129) appearsto
thecasewhere f is afunctionon60117,the”’sarethelinear
j
be approximatelyunbiased and normally distributed from the
functionsabove and (cid:142) 4u5=r 4x 1u5. When f is a func-
j1(cid:141) 1 j1(cid:141) j
tion on the unit square or higher dimensional unit cube, the histogram.
”’s are builtup from tensorsums and productsof” and ” , Figure 2 plotsthe true conditionalprobabilityfunctionand
and reproducingkernelstogeneratethe (cid:142) are built1up from2 the estimated curves for each sample size, P4Y =1Y4ƒk5=
j1(cid:141) ik — i
tensor sums and productsof r and r 4u1v5=” 4u5” 4v5, as 01x 5=ef4xik5=41+ ef4xik55. For each sample size, the 100 (cid:142)t-
1 0 2 2 ik
givenbyWahba(1990).Particulardetailsforthemodelin(37) tedvaluesare rankedaccordingtotheCKLdistancesbetween
Figure2. True(DashedLine)andEstimated(Solid Line)ConditionalProbabilityFunctions.
136 JournaloftheAmericanStatisticalAssociation,March2001
P4Y =1 or Y =1x5=42efi + e2fi+(cid:129)5=41+ 2efi + e2fi+(cid:129)5
i1 i2 — i
8 from the observeddata.
0.
For this experiment, we assume that p4x5=P4Y =1 or
i i1
p(x)0.4 YFio2u=r d1i—fxfei5re=nt08vasliune42s0a7rxei25u+sed01f.oFrig(cid:129)u:r0e,3.4p,l.o8t,satnhde1tr.u2e; (cid:129)p4=x50.
corresponds to the case where Y and Y are independent.
i1 i2
However we pretend that this fact is unknown, and so esti-
0
0. mate(cid:129) bytheproposedalgorithm.Straightforwardcalculation
0.0 0.2 0.4 0.6 0.8 1.0
yields the following formula to compute f for given (cid:129) and
x i
P4Y =1 or Y =1x5:
i1 i2 — i
Figure3. Truep x =P Y =1orY =1x Usedfor theSimula-
i i1 i2 — i
tionStudy. 4p4x5 15+ 41 p4x552+ e(cid:129)p4x541 p4x55
f =log i ƒ ƒ i i ƒ i 0
i e(cid:129)41 p4x55
p ƒ i
(34)
the estimated joint distributionsand the truth. The 5th, 25th,
50th, 75th and 95th best (cid:142)ts are plotted for each sample The experiment is conducted as follows. First, for the
size. The true conditional logit function is a bimodal func- “univariate” (cid:142)t, the only information needed is Y, which is
Li
tion. The trend is clear that when the sample size increases, de(cid:142)ned to be 0 when Y =Y =0 and 1 otherwise. Here
i1 i2
the estimated curves become increasingly accurate. However, P4Y =1x5=p4x5. We generate 100 datasets according to
Li — i i
for a parametric model, there might be no prior knowledge the true distribution and (cid:142)t the data using the “univariate”
about the bimodal nature of the truth.Hence a linear or even penalized logistic regression based only on the derived data
quadratic form will miss the true curve no matter how large Y’s. For the bivariate (cid:142)t, we (cid:142)rst calculate the true joint dis-
Li
the sample size. tribution of 4Y 1Y 5 according to the previous formula. For
i1 i2
Inthenextexperiment,wecomparetheproposednew“mul- each value of (cid:129), 100 datasets are randomly generated and
tivariate” method to the “univariate” (cid:142)t. In ophthalmology the joint distribution is estimated by the proposed multivari-
studies, one question of interest is to estimate the probabil- atemethod.Afterward,theprobabilityP4Y =1 or Y =1x5
i1 i2 — i
ity of at least one eye developing a certain disease given canbederivedfrom theestimatedjointdistribution.Forevery
the values of the predictor variables for a person. Assume run, the CKL distancebetween the estimatedp4x5 and p4x5
O i i
that there is no eye-speci(cid:142)c covariate.The X’s are uniformly is calculated.
i
distributed on [0,1], and for each subject, there are paired The foregoing procedure is performed for three different
observations 4Y 1Y 5. We want to estimate the probability sample sizes: n=10012001400. Figure 4 shows histograms
i1 i2
Figure4. HistogramsofEstimated(cid:129)’s for n=100,n=200, andn=400. Thedottedlines representthe truevaluesof(cid:129). From left toright,
O
the meansofestimatedvaluesare .0214,.3743,.7596,and 1.1624for n=400; .1028,.3161,.7058,and1.1025for n=200;and .0619,
.3719,.7746,and1.1775forn=100.
Description:Smoothing Spline ANOVA for Multivariate Bernoulli nostic tool for parametric fi tting In practice, the number of parameters to be estimated can.