Electronic Journal of Statistics Vol.0(2009) ISSN:1935-7524 Likelihood Inference in Exponential Families and Directions of Recession Charles J. Geyer 9 0 School of Statistics, Universityof Minnesota 0 313 Ford Hall, 224 Church St. SE 2 Minneapolis, MN 55455 USA e-mail: [email protected] n url: www.stat.umn.edu/geyer/gdor a J Abstract: Wheninafullexponential familythemaximumlikelihoodes- 5 timate(MLE)doesnotexist,theMLEmayexistintheBarndorff-Nielsen completionofthefamily.Weproposeapracticalalgorithm forfindingthe ] MLE in the completion based on repeated linear programming using the T R contributed package rcdd and illustrate it with two generalized linear S modelexamples.WhentheMLEforthenullhypothesisliesinthecomple- . tion,likelihoodratiotestsofmodelcomparisonarealmostunchangedfrom h theusualcase.Onlythedegreesoffreedomneedtobeadjusted.Whenthe at MLE lies in the completion, confidence intervals are changed much more m fromtheusualcase.TheMLEofthenaturalparametercanbethoughtof ashavinggonetoinfinityinacertaindirection,whichwecallagenericdi- [ rectionofrecession.Weproposeanewone-sidedconfidenceintervalwhich says how close to infinity the natural parameter may be. This maps to 1 one-sided confidence intervals for mean values showing how close to the v boundaryoftheirsupporttheymaybe. 5 5 AMS 2000 subject classifications:Primary62F99; secondary52B55. 4 Keywords and phrases:exponential family,existenceofmaximumlike- 0 lihoodestimate,Barndorff-Nielsencompletion. . 1 0 Contents 9 0 1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 : v 2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 i X 2.1 A Binomial Example . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2 A Logistic Regression Example . . . . . . . . . . . . . . . . . . . 2 r a 2.3 A Contingency Table Example . . . . . . . . . . . . . . . . . . . 3 2.3.1 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . 6 2.3.2 Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . 7 3 Theory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.1 Exponential Families . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2 Tangent Cone, Normal Cone, and Convex Support . . . . . . . . 8 3.3 Directions of Recession and Constancy . . . . . . . . . . . . . . . 8 3.4 Limits in Directions of Recession . . . . . . . . . . . . . . . . . . 10 3.5 Convex Polyhedra . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.6 Generic Directions of Recession . . . . . . . . . . . . . . . . . . . 12 3.7 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 0 imsart-ejs ver. 2008/08/29 file: ejs_2008_349.tex date: January 5, 2009 C. J. Geyer/Directions of Recession 1 3.8 Comparison with Pre-Existing Theory . . . . . . . . . . . . . . . 13 3.9 Natural Affine Submodels . . . . . . . . . . . . . . . . . . . . . . 13 3.10 Tangent Cones of Models and Affine Submodels . . . . . . . . . . 14 3.11 Tangent Cones of Saturated Models . . . . . . . . . . . . . . . . 14 3.12 Calculating the Linearity . . . . . . . . . . . . . . . . . . . . . . 14 3.13 Calculating Generic Directions of Recession . . . . . . . . . . . . 15 3.14 Calculating Maximum Likelihood Estimates . . . . . . . . . . . . 16 3.14.1 In the Original Family . . . . . . . . . . . . . . . . . . . . 16 3.14.2 In the Completion . . . . . . . . . . . . . . . . . . . . . . 16 3.15 Likelihood Ratio Tests . . . . . . . . . . . . . . . . . . . . . . . . 17 3.16 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.16.1 In the Limiting Conditional Model . . . . . . . . . . . . . 18 3.16.2 In the Original Model . . . . . . . . . . . . . . . . . . . . 19 3.16.3 The Cartesian Product Case . . . . . . . . . . . . . . . . 19 3.16.4 Calculating the Constancy Space . . . . . . . . . . . . . . 21 3.17 Multinomial Sampling . . . . . . . . . . . . . . . . . . . . . . . . 21 4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 A Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 1. Introduction The problem addressedin this article is widespread. Many users of statistics have run into it, although they may not have been aware of it. The problem has been well understood for thirty years,but until now convenientsoftware to handle the problem has not been available. Inadiscreteexponentialfamily,forexample,logisticregressionorcategorical dataanalysis,itcanhappenthatthemaximumlikelihoodestimate(MLE)does not exist in the conventional sense. This is often not detected by software, so usersmaybeunawareofthesituation.Whethertheyareawareornot,available softwareprovidesnosupportforvalidstatisticalinferenceinthissituation.The aster contributed package for R (6) has a check for near-singularity of the Fisher information matrix, which indicates either nonexistence of the MLE or nearcollinearityofpredictors,andthischeckhasrevealedanumberofinstances where nonexistence of the MLE arose in actual applications. When the MLE does not exist in the conventional sense, it may exist in the Barndorff-Nielsen completion of the family (2; 3; 5, and references cited at the beginning of Chapter 2 of the latter). A practical algorithm for finding the MLE in the Barndorff-Nielsen completion of an exponential family using repeated linear programming was proposed in the author’s unpublished thesis (5) and used in (10). Now we propose a different method, also using repeated linear programming with the R contributed package rcdd (9). We also propose methods forhypothesis tests andconfidenceintervalsvalidwhen the MLEdoes not exist in the conventionalsense.Our methods only work for full exponential imsart-ejs ver. 2008/08/29 file: ejs_2008_349.tex date: January 5, 2009 C. J. Geyer/Directions of Recession 2 families but are simpler than the methods of (5) which worked for non-full convex families. Our examples are chosen to be good examples for teaching statisticians how to use our methods. All the analysis discussed in this article is carried out in full in the accompanying technical report (7), which is produced using the R function Sweave so all results in the report are actually produced by the code shown therein and hence are fully reproducible by anyone who has R. 2. Examples 2.1. A Binomial Example The binomial distribution provides the simplest example. Suppose x is bino- mial with sample size n and success probability p. The MLE for p is pˆ= x/n. So far, so good, but three things are different about the cases pˆ=0 and pˆ=1. First,thenaturalparameterisθ =logit(p),andθˆ=logit(pˆ)doesnotexistwhen pˆ=0 or pˆ=1. Second, the probability distribution corresponding to the MLE is degenerate, the binomial distribution with success probability p=0 or p=1 beingconcentratedatx=0andx=n,respectively.Third,theelementary95% confidence interval pˆ±1.96 pˆ(1−pˆ)/n does not work when pˆ=0 or pˆ=1. p 2.2. A Logistic Regression Example The same kind of problems arise in multiparameter exponential families but the relevant multidimensional geometry is much harder to visualize. To intro- duce ideas,consider what is still a relativelysimple example,which is a logistic regression.Supposeweobserveavectory whosecomponentsareBernoulliwith meansformingavectorp.Thenaturalparameterisθ =logit(p),wherelogitop- erates componentwise θ =logit(p ). Suppose we also have one covariate vector i i x and we want to fit a quadratic model θ =β +β x +β x2. i 1 2 i 3 i Finally, suppose x takes the values 1, ..., 30 and y =0 for x ≤12 or x ≥24 i i i i and y =1 otherwise. i If we try to fit these data using the R function glm, it complains “algorithm did not converge, fitted probabilities numerically 0 or 1 occurred.” In fact, the MLE for β does not exist. Define the vector δ =(−587,72,−2) (1) then we say δ is a generic direction of recession(GDOR) because there exists a βˆ such that lim l(βˆ+sδ)= sup l(β), (2) s→∞ β∈R3 imsart-ejs ver. 2008/08/29 file: ejs_2008_349.tex date: January 5, 2009 C. J. Geyer/Directions of Recession 3 where l is the log likelihood function. Although we write βˆ here, this does not denote the MLE — the MLE does not exist — we can consider the MLE for β to be βˆ sent to infinity in the direction δ. Neither βˆ nor δ satisfying (2) are unique.Inthisexample,(2)actuallyholdsforallβˆ∈R3 whenδ isgivenby (1). If we consider what happens to the mean value parameter vector p, we find that p(βˆ+sδ)→y as s→∞. Thus MLE for the mean value parameter vector doesexistandisequaltotheobserveddatavector.ThisMLEisstrangebecause the distribution it goes with is degenerate, concentrated at the observed data. The MLE distribution says we can never observe data other than what we did observe; other data values occur with probability zero. Thisdegeneracyneednotcauseproblemsforstatisticalinference.Thesample is not the population, and estimates are not parameters. What we need is a confidence interval, necessarily one-sided, that says how close β is to infinity and how close p is to y. Fix a choice of βˆ, say βˆ = 0, and consider for all real s the probability distributionhavingparameterβˆ+sδ.Assgoesfrom−∞to+∞theprobability ofseeingthedatavalueactuallyobservedgoesfromzerotoone.Findtheunique s that makes this probability 0.05, call it sˆ, then [sˆ,∞) is a 95% confidence interval for the scalar parameter s. Figure 1 shows the mean value parameter values corresponding to the ends of this confidence interval (sˆ and ∞). These intervals are not the best we can do; Figure 2 shows improved intervals that require more computation. Wesummarizethisexample,explainingwhichfeaturesaregeneralandwhich are not. First, the GDOR notion is perfectly general. When the MLE for the naturalparameterdoes notexist,then under conditionsof Brown(3) thathold in all practical examples, there is a GDOR, and the likelihood is maximized by goingtoinfinityinthatdirection.Second,theMLEforthemeanvalueparame- ter forthis example correspondsto a completely degeneratedistribution,which is not general. Usually it will be only partially degenerate, some components of the response being fixed at an extreme value but other components being randomundertheMLEdistribution.Third,theMLEforthenaturalparameter “is” any βˆ plus infinity in the direction of the GDOR, which is not general. Usually only some βˆ will work in (2). Fourth, Figure 1 shows most of what is important about this example, but in general there is no such figure. Usually, hypothesis tests and confidence intervals are all that can be reported. 2.3. A Contingency Table Example Thisexampleisa2×2×···×2contingencytablewithsevendimensionshence 27 = 128 cells. The file http://www.stat.umn.edu/geyer/gdor/catrec.txt presents the data as eight vectors, seven categorical predictors v , ..., v that 1 7 specify the cells of the contingency table and one response y that gives the cell counts. Tosimplify thediscussion,wetakethe cellcountsto be independentPoisson random variables so we have a generalized linear model. The R statement imsart-ejs ver. 2008/08/29 file: ejs_2008_349.tex date: January 5, 2009 C. J. Geyer/Directions of Recession 4 0 1. 8 0. 6 of y 0. n a e m 4 0. 2 0. 0 0. 0 5 10 15 20 25 30 x Fig1.Quickanddirtyconfidenceintervalsforthemeanvalueparameter.Outerpointshaving valueszeroorone are MLEforthemean valueparameter, whichare alsothe components of the observed data vector. Inner points are mean values corresponding to βˆ+sˆδ, where sˆis thelowerboundfora95%one-sidedconfidenceintervalforsandwhereβˆ=0andδ isgiven by (1). Compare Figure 2. imsart-ejs ver. 2008/08/29 file: ejs_2008_349.tex date: January 5, 2009 C. J. Geyer/Directions of Recession 5 Table1 Generic direction of recession forcontingency table example. Only nonzero components shown. coefficient direction intercept −1 v1 1 v2 1 v3 1 v5 1 v1:v2 −1 v1:v3 −1 v1:v5 −1 v2:v3 −1 v2:v5 −1 v3:v5 −1 v1:v2:v3 1 v1:v3:v5 1 v2:v3:v5 1 out3 <- glm(y ~ (v1 + v2 + v3 + v4 + v5 + v6 + v7)^3, family = poisson, data = dat, x = TRUE) fits the model with all three-way interactions but no higher-order interactions, assuming the data have been read in as data frame dat. Rdoesnotcomplaininfittingthismodel,eventhoughitshould.Thecorrect MLE is “at infinity.” Thismodelhas64parameters(foratablewith128cells).Onemightsaythis istoomanyparameterschasingtoo little data,butatestcomparingthis model to the model with only two-way interactions says the three-way model fits the data much better (P ≈ 10−17). Whether one likes this model or not, it should be possible for statisticians to analyze it, and we will use it for an example. With64parameterswedonotshowthewholeGDOR,onlyitsnonzerocom- ponents,which areshownin Table1.It is hardto visualize as a 64-dimensional vector. Much easier to understand is what it does to the mean cell counts. Sixteen cells have mean zero at the MLE in the Barndorff-Nielsen completion. They are shown in Table 2. One other cell has observed data value zero but MLE mean value nonzero. The next step in the analysis is to find a βˆ such that (2) is satisfied. It has long been known that such a βˆis determined by finding the MLE in the family conditioned on certain cells being zero, in this case those in Table 2. This is easily accomplished by removing those rows from the data. The following R statements fit this model dat.cond <- dat[linear, ] out3.cond <- glm(y ~ (v1 + v2 + v3 + v4 + v5 + v6 + v7)^3, family = poisson, data = dat.cond) where linear indicates the cells that have nonzero MLE mean values and dat is the data frame containing the original data. We do not show the results imsart-ejs ver. 2008/08/29 file: ejs_2008_349.tex date: January 5, 2009 C. J. Geyer/Directions of Recession 6 Table2 Cells withMLE mean value zero and 95% confidence intervals (lower and upper are lower and upper confidence bounds). v1 v2 v3 v4 v5 v6 v7 lower upper 0 0 0 0 0 0 0 0 0.2863 0 0 0 1 0 0 0 0 0.1408 1 1 0 0 1 0 0 0 0.2200 1 1 0 1 1 0 0 0 0.4210 0 0 0 0 0 1 0 0 0.0895 0 0 0 1 0 1 0 0 0.0938 1 1 0 0 1 1 0 0 0.1930 1 1 0 1 1 1 0 0 0.2887 0 0 0 0 0 0 1 0 0.1063 0 0 0 1 0 0 1 0 0.1141 1 1 0 0 1 0 1 0 0.0913 1 1 0 1 1 0 1 0 0.2646 0 0 0 0 0 1 1 0 0.0667 0 0 0 1 0 1 1 0 0.1548 1 1 0 0 1 1 1 0 0.1410 1 1 0 1 1 1 1 0 0.3239 of this fit — it is shown in the accompanying technical report (7) — partly becausethereare64regressioncoefficients,butmainlybecausethereisnothing new in this fit or its use to make inference about the cell mean values or the corresponding linear predictor values for the cells of the table involved (those with linear equal to TRUE). The R function predict.glm will produce valid confidence intervals for these cells using this fit. We only remark that βˆ is a little strange in that it has one nonestimable coefficient (which R indicates as being NA) because the model matrix is not full rank.Theoriginalmodelmatrixwasfullrank,butwhenweremovesixteenofits rowsthe resultis notfullrank.Ris smartenoughto dropone ormorecolumns (in this example just one) of the model matrix producing a new model matrix that is full rank and has the same column space as the one it was given, which is equivalent to setting the coefficients corresponding to the dropped columns to zero. Although R reports these coefficients as NA, we must take them to be zero. 2.3.1. Confidence Intervals To obtain one-sided confidence intervals we use the same procedure used in the other example. Consider for all real s the probability distribution having parameterβˆ+sδ.As s goes from−∞ to +∞ the probability ofobserving data having zeros in the 16 cells listed in Table 2 goes from zero to one. Find the unique s that makes this probability 0.05, call it sˆ, then [sˆ,∞) is a 95% confi- dence interval for the scalar parameter s. Calculate the mean value parameter vectorcorrespondingtoβˆ+sˆδ.This givesthe upper bounds for95%confidence intervals shown in Table 2. imsart-ejs ver. 2008/08/29 file: ejs_2008_349.tex date: January 5, 2009 C. J. Geyer/Directions of Recession 7 2.3.2. Hypothesis Tests WhentheMLEdoesnotexistintheconventionalsenseforthenullhypothesis of a likelihood ratio test, the usual asymptotics do not hold and the usual test is invalid. However, the usual asymptotics may apply when the test is done conditionally, the conditioning event being the same as the one used in determining βˆ. In this example, if the null hypothesis is all three-way but no higher-way interactions, one merely proceeds using the data frame data.cond producedabovetofitboththenullandalternativemodelsanddothelikelihood ratio test. This idea was suggested by S. Fienberg (personal communication). Of course, one does not have to believe the asymptotics on which this test is based.One caninstead calculate a P-value based on a parametric bootstrap, using as the null distribution the one fitted in out3.cond.Since this procedure has nothing to do with the concerns of this article, being the same thing one would do whenever one distrusts asymptotics, we will say no more about it. 3. Theory WeredevelopthetheoryofBarndorff-Nielsencompletionofexponentialfam- ilies (2, Sections 9.3 and 9.4; 3, pp. 191–202; 5, Chapters 2 and 4) so that it is useful for calculation, particularly calculation using the R contributed package rcdd. 3.1. Exponential Families An exponential family of distributions (2; 3; 5) is a statistical model having log likelihood l(θ)=hy,θi−c(θ), (3) wherey =(y ,...,y )isavectorstatistic,θ =(θ ,...,θ )isavectorparameter, 1 p 1 p and p hy,θi= y θ . i i i=1 X A statistic y and parameter θ that give a log likelihood of this form are called natural. The function c is called the cumulant function of the family. The distribution with parameter value θ has a density with respect to the distribution with parameter value ψ of the form f (ω)=ehY(ω),θ−ψi−c(θ)+c(ψ). (4) θ The requirement that this integrate to one determines the function c up to an additive constant c(θ)=c(ψ)+logE ehY,θ−ψi . (5) ψ (cid:0) (cid:1) imsart-ejs ver. 2008/08/29 file: ejs_2008_349.tex date: January 5, 2009 C. J. Geyer/Directions of Recession 8 We take (5) to be valid for all θ in Rp, defining c(θ) = ∞ for θ such that the expectation in (5) is infinite. Define Θ={θ∈Rp :c(θ)<∞}. (6) The exponential family is full if its natural parameter space is (6). We shall be interested only in full families. By convention, the cumulant function and log likelihood are defined for all θ ∈ Rp not just at valid parameter values. We have c(θ) = ∞ and l(θ) = −∞ for θ ∈/ Θ, so such θ can never maximize the likelihood. 3.2. Tangent Cone, Normal Cone, and Convex Support The tangent cone of a convex set C at a point y ∈C is T (y)=cl{s(w−y):w ∈C and s≥0}, (7) C where cl denotes the closure operation (14, Theorem 6.9). The normal cone of a convex set C in Rp at a point y ∈C is N (y)={δ ∈Rp :hw−y,δi≤0 for all w∈C}. (8) C (14, Theorem 6.9). Tangent and normal cones are polars of each other (14, Theorem 6.9 and Corollary 6.30). Each determines the other. Theconvex support ofanexponentialfamilyisthesmallestclosedconvexset thatcontainsthe naturalstatistic withprobabilityone under some distribution inthefamily,inwhichcasethisistrueforalldistributionsinthefamily,because the distributions are mutually absolutely continuous (2, pp. 90 and 111–112). 3.3. Directions of Recession and Constancy Directions of recession and constancy of convex and concave functions are defined by Rockefellar (13, p. 69). We apply these notions to log likelihoods of full exponential families. Proofs of all theorems and corollaries are given in the appendix. Theorem 1. For some vector δ and for a full exponential family with log like- lihood (3), natural parameter space Θ, convex support C, natural statistic Y, and observed value of the natural statistic y such that y ∈ C, the following are equivalent. (a) There exists a θ ∈ Θ such that s 7→ l(θ +sδ) is not a strictly concave function on the interval where it is finite. (b) For all θ ∈Θ the function s7→l(θ+sδ) is constant on R. (c) The parameter values θ and θ + sδ correspond to the same probability distribution for some θ ∈Θ and some s6=0. (d) The parameter values θ and θ + sδ correspond to the same probability distribution for all θ ∈Θ and all real s. imsart-ejs ver. 2008/08/29 file: ejs_2008_349.tex date: January 5, 2009 C. J. Geyer/Directions of Recession 9 (e) hY −y,δi=0 almost surely for some distribution in the family. (f) hY −y,δi=0 almost surely for all distributions in the family. (g) δ ∈N (y) and −δ ∈N (y). C C (h) hw,δi=0, for all w ∈T (y). C Anyvectorδthatsatisfiesanyoneoftheconditionsofthetheorem(andhence all of them) is called a direction of constancy of the log likelihood. The set of all directions of constancy is called the constancy space of the log likelihood. It is clear from (e) or (h) of the theorem that the constancy space is a vector subspace. Corollary 2. For a full exponential family, suppose θˆ and θˆ are maximum 1 2 likelihood estimates. Then θˆ −θˆ is a direction of constancy. 1 2 Fromthecorollaryand(d)ofthetheorem,weseethatdirectionsofconstancy do not cause any problem for statistical inference, because all maximum likeli- hood estimates correspond to the same probability distribution. Thus we have uniqueness where it is important. Nonuniqueness of the MLE for the natural parameter is, at worst, merely a computational nuisance. A family is said to be minimal if it has no directions of constancy. This can alwaysbe arrangedby reparametrization(2, pp. 111–116;3, pp. 13–16;see also 5, Section 1.5). The R function glm always uses a minimal parametrization, dropping predictors to obtain a full rank model matrix. However, insisting on minimality can complicate theoretical issues. Better to keep options open and allow for directions of constancy. Theorem 3. For some vector δ and for a full exponential family with log like- lihood (3), natural parameter space Θ, convex support C, natural statistic Y, and observed value of the natural statistic y such that y ∈ C, the following are equivalent. (a) There exists a θ ∈Θ such that the function s7→l(θ+sδ) is nondecreasing on R. (b) For all θ ∈Θ the function s7→l(θ+sδ) is nondecreasing on R. (c) hY −y,δi≤0 almost surely for some distribution in the family. (d) hY −y,δi≤0 almost surely for all distributions in the family. (e) δ ∈N (y). C (f) hw,δi≤0, for all w ∈T (y). C Anyvectorδthatsatisfiesanyoneoftheconditionsofthetheorem(andhence all of them) is called a direction of recession of the log likelihood. From now on we will simply say direction of recession or constancy to refer to directions of recession or constancy of the log likelihood. Note that every direction of constancy is a direction of recession. Theorem 4. For a full exponential family with convex support C and observed value of the natural statistic y such that y ∈C, the following are equivalent. (a) The MLE exists. (b) Every direction of recession is a direction of constancy. imsart-ejs ver. 2008/08/29 file: ejs_2008_349.tex date: January 5, 2009