ebook img

The Epic Story of Maximum Likelihood - Project Euclid PDF

23 Pages·2008·0.25 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview The Epic Story of Maximum Likelihood - Project Euclid

StatisticalScience 2007,Vol.22,No.4,598–620 DOI:10.1214/07-STS249 ©InstituteofMathematicalStatistics,2007 The Epic Story of Maximum Likelihood Stephen M. Stigler Abstract. At a superficial level, the idea of maximum likelihood must be prehistoric:earlyhuntersandgatherersmaynothaveusedthewords“method of maximum likelihood” to describe their choice of where and how to hunt and gather, but it is hard to believe they would have been surprised if their methodhadbeendescribedinthoseterms.Itseemsasimple,evenunassail- able idea: Who would rise to argue in favor of a method of minimum likeli- hood, or even mediocre likelihood? And yet the mathematical history of the topic shows this “simple idea” is really anything but simple. Joseph Louis Lagrange, Daniel Bernoulli, Leonard Euler, Pierre Simon Laplace and Carl FriedrichGaussareonlysomeofthosewhoexploredthetopic,notalwaysin ways we would sanction today. In this article, that history is reviewed from back well before Fisher to the time of Lucien Le Cam’s dissertation. In the processFisher’sunpublished1930characterizationofconditionsforthecon- sistency and efficiency of maximum likelihood estimates is presented, and the mathematical basis of his three proofs discussed. In particular, Fisher’s derivation of the information inequality is seen to be derived from his work on the analysis of variance, and his later approach via estimating functions wasderivedfromEuler’sRelationforhomogeneousfunctions.Thereaction toFisher’sworkisreviewed,andsomelessonsdrawn. Keywordsandphrases: R.A.Fisher,KarlPearson,JerzyNeyman,Harold Hotelling, Abraham Wald, maximum likelihood, sufficiency, efficiency, superefficiency,historyofstatistics. 1. INTRODUCTION Galton reported it, during a pause in the conversation Herbert Spencer said, “You would little think it, but In the 1860s a small group of young English in- I once wrote a tragedy.” Huxley answered promptly, tellectuals formed what they called the X Club. The “I know the catastrophe.” Spencer declared it was im- name was taken as the mathematical symbol for the possible,forhehadneverspokenaboutitbeforethen. unknown, and the plan was to meet for dinner once Huxley insisted. Spencer asked what it was. Huxley a month and let the conversation take them where replied,“Abeautiful theory, killedbyanasty, uglylit- chance would have it. The group included the Dar- tlefact”(Galton,1908,page258). winian biologist Thomas Henry Huxley and the social Huxley’sdescriptionofascientifictragedyissingu- philosopher-scientist Herbert Spencer. One evening larlyappropriateforonetellingofthehistoryofMaxi- about1870theymetfordinnerattheAthenaeumClub mumLikelihood.Thetheoryofmaximumlikelihoodis in London, and that evening included one exchange very beautiful indeed: a conceptually simple approach thatsostruckthosepresentthatitwasrepeatedonsev- toanamazinglybroadcollectionofproblems.Thisthe- eral occasions. Francis Galton was not present at the oryprovidesasimplerecipethatpurportstoleadtothe dinner, but heheard separate accounts fromthree men optimum solution for all parametric problems and be- who were, and he recorded it in his own memoirs. As yond,andnotonlypromisesanoptimumestimate,but also a simple all-purpose assessment of its accuracy. StephenM.StigleristheErnestDeWittBurton And all this comes with no need for the specification DistinguishedServiceProfessor,DepartmentofStatistics, UniversityofChicago,Chicago,Illinois60637,USA of a priori probabilities, and no complicated deriva- (e-mail:[email protected]). tionofdistributions.Furthermore,itiscapableofbeing 598 THEEPICSTORYOFMAXIMUMLIKELIHOOD 599 JoeHodges’sNasty,UglyLittleFact(1951) a better assumption, were supposed equally able to be 1 positive and negative, and large errors were expected T =X¯ if|X¯ |≥ n n n n1/4 to be less frequently encountered than small. Indeed, =αX¯ if|X¯ |< 1 . itwasgenerallyacceptedthattheirfrequencydistribu- √ n n n1/4 tionfollowedasmoothsymmetriccurve.Eventhegoal Then n(Tn−θ)isasymptoticallyN(0,1)ifθ (cid:4)=0, oftheobserverwasagreedupon:whilethewordsem- andasymptoticallyN(0,α2)ifθ =0. ployed varied, the observer sought the most probable Tn isthen“super-efficient”forθ =0ifα2<1. position for the object of observation, be it a star dec- lination or a geodetic location. But in the few serious FIG.1. TheexampleofasuperefficientestimateduetoJosephL. attemptstotreatthisproblem,thedetailsvariedinim- Hodges, Jr. The example was presented in lectures in 1951, but ¯ portantways.Itwastoprovequitedifficulttoarriveat wasfirstpublishedinLeCam(1953).HereXnisthesamplemean apreciseformulationthatincorporatedtheseelements, of a random sample of size n from a N(θ,1) population, with nVar(X¯n)=1alln,allθ (Bahadur,1983;vanderVaart,1997). covered useful applications, and also permitted analy- sis. Therewereearlyintelligentcommentsrelatedtothis automated in modern computers and extended to any problem already in the 1750s by Thomases Simpson number of dimensions. But as in Huxley’s quip about and Bayes and by Johann Heinrich Lambert in 1760, Spencer’sunpublishedtragedy,somewouldhaveitthat but the first serious assault related to our topic was by thistheoryhasbeen“killedbyanasty,uglylittlefact,” Joseph Louis Lagrange in 1769 (Stigler, 1986, Chap- mostfamously by Joseph Hodges’s elegant simple ex- ter 2; 1999, Chapter 16; Sheynin, 1971; Hald, 1998, ample in 1951, pointing to the existence of “superef- 2007). Lagrange postulated that observations varied ficient” estimates (estimates with smaller asymptotic aboutthedesiredmeanaccordingtoamultinomialdis- variances than the maximum likelihood estimate). See tribution, and in an analytical tour de force he showed Figure1.Andthen,justaswithfatallywoundedslaves thattheprobabilityofasetofobservationswaslargest intheRomanColosseum,orfatallywoundedbullsina iftherelativefrequenciesofthedifferentpossibleval- Spanish bullring, the theory was killed yet again, sev- ues were used as the values of the probabilities. In eraltimesoverbyothers,byingeniousexamplesofin- modernterminology,hefoundthatthemaximumlike- consistentmaximumlikelihoodestimates. lihood estimates of the multinomial probabilities are The full story of maximum likelihood is more com- the sample relative frequencies. He concluded that the plicatedandlesstragicthanthissimpleaccountwould mostprobablevalueforthedesired meanwasthenthe haveit.Thehistoryofmaximumlikelihoodismorein meanvaluefoundfromtheseprobabilities,whichisthe thespiritofaHomericepic,withlongperiodsofpeace arithmetic mean of the observations. It was only then, punctuated by some small attacks building to major and contrary to modern practice, that Lagrange intro- battles; a mixture of triumph and tragedy, all of this ducedthehypothesisthatthemultinomialprobabilities dominated by a few characters of heroic stature if not followed a symmetric curve, and so he was left with heroic temperament. For all its turbulent past, maxi- only the problem of finding the probability distribu- mum likelihood has survived numerous assaults and tion of the arithmetic mean when the error probabili- remains a beautiful, if increasingly complicated the- ties follow a curve. This he solved for several exam- ory. I propose to review that history, with a sketch of ples by introducing and using “Laplace Transforms.” the conceptual problems of the early years and then a By introducing restrictions in the form of the curve closer look at the bold claims of the 1920s and 1930s, only after deriving the estimates of probabilities, La- and at the early arguments, some unpublished, that grange’s analysis had the curious consequence of al- weredevisedtosupportthem. ways arriving at method of moment estimates, even though starting with maximum likelihood! (Lagrange, 2. THEEARLYHISTORYOF 1776;Stigler,1999,Chapter14;Hald,1998,page48.) MAXIMUMLIKELIHOOD Ataboutthesametime,DanielBernoulliconsidered By the mid-1700s it seems to have become a com- the problem in two successively very different ways. monplace among natural philosophers that problems First, in 1769 he tried using the hypothesized curve as of observational error were susceptible to mathemat- a weight function, in order to weight, then iteratively ical description. There was essential agreement upon reweight and average the observations. This was very some elements of that description: errors, for want of much like some modern robust M-estimates. Second, 600 S.M.STIGLER in 1778 (possibly after he had seen a 1774 memoir place in history, more for what in the end it seemed of Laplace’s with a Bayesian analytical formulation), to suggest, rather than for what it accomplished. The Bernoulli changed his view dramatically and used the two authors considered a very general setting for the same curve as a density for single observations. He estimationproblem—asetofmultivariateobservations multiplied these densities together, and he sought as with a distribution depending upon a potentially large thetruevaluefortheobservedquantity,thatvaluethat array of constants to be determined. They did not re- made the product a maximum (Bernoulli, 1769, 1778; fertotheconstantsasparameters,butitwouldbehard Stigler,1999,Chapter14;Laplace,1774). for a modern reader to view them in any other light, These and the other attempts of that time were even though a close reading of the memoir shows that primarily theoretical explorations, and did not attract it lacked the parametric view Fisher was to introduce many practical applications or further development. morethan20yearslater(Stigler,2007). And while they all used phrases that could easily be The main result of Pearson and Filon (expressed in translated into modern English as “Maximum Likeli- modernterminology)camefromtakingalikelihoodra- hood,” and in some cases even be defended as maxi- tio(aratioofthefrequencydistributionoftheobserved mum likelihood, in no case was there a reasoned de- data and the frequency distribution evaluated for the fenseforthemortheirperformance.Themostthatwas same data, but with the constants slightly perturbed), tobefoundwasthesuperficialinvocationthatthevalue expanding its logarithm in a multivariate Taylor’s ex- derivedwas“mostprobable”becauseitmadetheonly pansion, then approximating the coefficients by their probability in sight (the probability of the observed expectedvaluesandclaimingthattheresultingexpres- data)aslargeaspossible. siongavethefrequencydistributionoftheerrorsmade Thephilosophicallymostcogentoftheseearlytreat- in estimating the constants. They erred in taking the ments was that of Gauss, in his first publication on limitofthecoefficients,ineffectusingaprocedurethat leastsquaresin1809(Gauss,1809).Gauss,likeDaniel did not at all depend upon the method of estimation Bernoulliin1778,adoptedLaplace’sanalyticalformu- used and would at most be valid for maximum likeli- lation, but unlike Bernoulli, Gauss explicitly invoked Laplace’s Bayesian perspective using a uniform prior hood estimates, a fact they failed to recognize. Their distributionfortheunknowns.WhereLaplacehadthen last step employed an implicit Bayesian step in the sought (and found) the posterior median (which mini- manner of Gauss. When cubic and higher order terms mized the posterior expected error), Gauss chose the were neglected, their formula would give a multivari- posteriormode.Inaccordwithmodernmaximumlike- ate normal posterior distribution (extending results of lihood with normally distributed errors,this led Gauss Laplaceacenturyearlier),althoughPearsonandFilon to the method of least squares. The simplicity and cautioned against doing this with skewed frequency tractability of the analysis made this approach very distributions. A modern reader would recognize their popularoverthenineteenthcentury.Bytheendofthat resulting distribution as the normal distribution some- century this was sometimes known as the Gaussian timesusedtoapproximatethedistributionofmaximum method, and the approach became the staple of many likelihood estimates, but Pearson and Filon made no textbooks, often without the explicit invocation of a such restriction in the choice of estimate and applied uniform prior that Gauss had seen as needed to justify itheedlesslytoallmannerofestimates,particularlyto theprocedure. methodofmomentsestimates. The result may in hindsight be seen to be a mess, 3. KARLPEARSONANDL.N.G.FILON not even applying to the examples presented, and the approach was soon to be abandoned by Pearson him- Over the19th century, the theory ofestimation gen- erallyremainedaroundthelevelLaplaceandGaussleft self. But it led to some correct results for the bivari- it,albeitwithfrequentretreatstolowerlevels.Withre- ate normal correlation coefficient, and it was bold and gardtomaximumlikelihood,themostimportantevent surelyhighlysuggestivetoareaderlikeRonaldFisher, afterGauss’spublicationof1809occurredonlyonthe to whom I now turn. I have recently published a de- eveofanewcentury,withalongmemoirbyKarlPear- tailed study (Stigler, 2005) of how Fisher was led to son and Louis Napoleon George Filon, published in write his 1922 watershed work on “The Mathematical the Transactions of the Royal Society of London in Foundations of Theoretical Statistics,” so I will only 1898 (Pearson and Filon, 1898). The memoir has a brieflyreviewthemainpointsleadingtothatmemoir. THEEPICSTORYOFMAXIMUMLIKELIHOOD 601 4. R.A.FISHER thefollowing.Supposeyouhavetwocandidatesases- timatesforaparameterθ,denotedbySandT.Suppose At Cambridge Fisher had studied the theory of er- thatT isasufficientstatisticforθ.Sincegenerallyboth rors and even published in 1912 a short piece com- S andT areapproximatelynormalwithlargesamples, mending the virtues of the Gaussian approach to esti- let us (anticipating a species of argument Wald was to mation,particularlyofthestandarddeviationofanor- develop rigorously in 1943) follow Fisher in consid- mallydistributedsample.Hehadbeensotakenbythe ering that S and T actually have a bivariate normal invarianceoftheestimatessoderived,how(forexam- distribution, bothwithexpectation=θ,andwithstan- ple)theestimateofthesquareofafrequency constant darddeviations σ and σ andcorrelation ρ.Thenthe S T was the square of the estimate of the constant, that he standard facts of the bivariate normal distribution tell termed the criterion “absolute” (Fisher, 1912). But his us that E(S|T =t)=θ +ρ(σ /σ )(t −θ). Since T S T approach at that time was superficial in most respects, is sufficient, this cannot depend upon θ, which is only tacitly endorsing the naïve Bayesian approach Gauss possibleifρ(σ /σ )=1,orifσ =ρσ ≤σ .ThusT S T T S S hadused,withoutnoticingthelurkinginconsistencyin cannothavealargermeansquarederrorthananyother eventheexampleheconsidered,inthattheestimateof suchestimateS,andsomustbeoptimumaccordingto the squared standard deviati(cid:2)on based upon the distrib- aclearmetriccriterion,expectedsquarederror!Inone ution of the data, namely n1 (xi −x¯)2, did not agree strokeFisherhad(ifoneacceptsthesubstitutionofex- with that fo(cid:2)und applying the same principle to distrib- actforapproximatenormality)thesimpleandpowerful utionof n1 (xi −x¯)2 alone. result: Four years later, Fisher sent to Pearson for possi- Sufficiencyimpliesoptimality,atleastwhen ble publication a short, equally superficial critique of combined with consistency and asymptotic a Biometrika article by Kirstine Smith advocating the normality. minimum chi-square approach to estimation (Smith, 1916). Pearson’s thoughtful rejection letter to Fisher Thequestionwas,howgeneralisthisresult?Neither focused on the lack of a clear and convincing ratio- Fisher nor much of posterity thought of consistency nale for the method of choosing constants to maxi- and asymptotic normality as major restrictions. After mize the frequency function, and Pearson even stated all,whowoulduseaninconsistentestimate,andwhile that he now thought the Pearson–Filon paper was re- therearenotedexceptions,isnotasymptoticnormality miss on the same count. He called particular attention thegeneralrule?Indeed,Fisherclearlyknewtheresult to a perceptive footnote in Smith’s paper that argued was stronger than this, that a sufficient estimate cap- the case against the Gaussian method: the probabil- tured all the information in the data in even stronger ity being maximized was not a probability but rather senses; the argument was only to present the claim in a probability density, an infinitesimal probability, and terms of a specific criterion, minimum standard error. ofwhatforcewassuchmeagerevidenceindefenseofa Butwhataboutsufficiency? choice?Atleasttheminimumchi-squaremethodopti- At this point Fisher appears to have made an in- mizedwithrespecttoanactualmetric.Twomoreyears teresting and highly productive mistake. He quickly passed, and in 1918 Fisher discovered sufficiency in explored a number of other parametric examples and thecontextofestimatingthenormalstandarddeviation cametotheconclusionthatmaximizingthelikelihood (Fisher, 1920); he recalled Pearson’s challenge to pro- always led to an estimate that was a function of a suf- duce a rationale for the method, and he was off to the ficient statistic! When he read the paper to the Royal Society in November 1921, his abstract, as printed in races, quickly setting to work on the monumental pa- Nature(November24,1921)emphaticallystated,“Sta- peronthetheoryofstatisticsthathereadtotheRoyal tisticsobtainedbythemethodofmaximumlikelihood SocietyinNovember1921andpublishedin1922. arealwayssufficientstatistics.”Andfromthisitwould follow,withtheminorquibblethatperhapsconsistency 5. FISHER’SFIRSTPROOF and asymptotic normality may be needed, that maxi- By my reconstruction, Fisher’s discovery of suffi- mumlikelihoodestimatesarealwaysoptimum.Atruly ciency was quickly followed by the development of a beautiful theory was born, after over a century and a short argument that he gave in that great 1922 paper; halfingestation. indeeditwasthefirstmathematicalargumentinthepa- Evenasthepaperwasbeingreadiedforpress,doubts per.Theessenceoftheargumentinmodernnotationis occurredtotheonepersonbestequippedtounderstand 602 S.M.STIGLER the theory, Fisher himself. The bold claim of the ab- 1922paperFisheralsopointedlyincludedasectionil- stractdoesnotappearinthepublishedversion;neither lustratingtheuseofmaximumlikelihoodforPearson’s doesitsdenial.Heexpressedhimselfinthisway: Type-III distributions (gamma distributions), contrast- ing his results with the erroneous ones Pearson and “Forthesolution ofproblems ofestimation Filonhadgivenin1898forthesamefamily. we require a method which for each partic- ular problem will lead us automatically to 6. THREEYEARSLATER the statistic by which the criterion of suffi- ciency is satisfied. Such a method is, I be- By 1925 Fisher’s earlier optimism had faded some- lieve,providedbytheMethodofMaximum what, and he prepared a revised version of his theory Likelihood,althoughIamnotsatisfiedasto for presentation to the Cambridge Philosophical Soci- themathematicalrigourofanyproofwhich ety. At some point in the interim he had recognized I can put forward to that effect. Readers that sufficient statistics of the same dimension as the of the ensuing pages are invited to form parameter did not always exist. What led to this re- theirownopinionastothepossibilityofthe alization? Fisher did not say, although in a 1935 dis- method of maximum likelihood leading in cussion he wrote, “I ought to mention that the theo- any case to an insufficient statistic. For my rem that if a sufficient statistic exists, then it is given ownpartIshouldgladlyhavewithheldpub- by the method of maximum likelihood was proved in lication until a rigourously complete proof my paper of [1922].... It was this that led me to at- could be formulated; but the number and tach especial importance to this method. I did not at thattime,however,appreciatethecasesinwhichthere variety of new results which the method is no sufficient statistic, or realize that other proper- discloses press for publication, and at the ties of the likelihood function, in addition to the posi- sametimeIamnotinsensibleoftheadvan- tion of its maximum, could supply what was lacking” tagewhichaccruestoAppliedMathematics (Fisher,1935,page82).Ispeculatethathelearnedthis from the co-operation of the Pure Mathe- in considering a problem where no sufficient statistic matician, and this co-operation is not infre- exists,namelytheproblemthatfiguredprominentlyin quently called forth by the very imperfec- the 1925 paper, the estimation of a location parameter tions of writers on Applied Mathematics” for a Cauchy distribution. In any event, in that 1925 (Fisher,1922,page323). paper Fisher did not dwell on this discovery of insuf- The 1922 paper did present several related argu- ficiency; quite the contrary. The possibility that suffi- mentsinadditiontotheWaldianoneIreportedabove. cient statistics need not exist was only casually noted It stated less boldly a converse of the statement in the as a fact 14 pages into the paper, and a reader of both 1921 abstract that, “it appears that any statistic which the 1922 and 1925 papers might not even notice the fulfils the condition of sufficiency must be a solution subtleshiftinemphasisthathadtakenplace. obtained by the method of the optimum [e.g. maxi- Where in 1922 Fisher started with consistency and mum likelihood]” (page 331). But Fisher did not now sufficiency, in 1925 he began with efficiency. Writing claim that a sufficient statistic need always exist. In- of consistent and asymptotically normal estimates, he steadFishergaveanimprovednon-Bayesianversionof stated, “The criterion of efficiency requires that the the Pearson–Filon argument for asymptotic normality, fixed value to which the variance of a statistic (of the expanding the likelihood function about the true value classofwhichwearespeaking)multipliedbyn,tends, and pointing out how and why the argument requires shallbeassmallaspossible.Anefficientstatisticisone maximum likelihood estimates (and that it would not for which this criterion is satisfied” (page 703). With apply to moment estimates), and how it could be used thisinmind,hismainclaimnowwas(page707),“We to assess the accuracy of maximum likelihood esti- shall see that the method of maximum likelihood will mates (pages 328–329). And there, in a long footnote, alwaysprovideastatisticwhich,ifnormallydistributed hecalledKarlPearsontotaskfornotearliercallingat- in large samples with variance falling off inversely to tention himself to the error in the 1898 paper. Fisher thesamplenumber,willbeanefficientstatistic.” notedthatin1903Pearsonhadpublishedcorrectstan- Thus in 1925 the theory said that if there is an effi- dard errors for moment estimates, even while citing cientstatistic,thenthemaximumlikelihoodestimateis the 1898 paper without noting that the standard errors efficient.Whenasufficientandconsistentestimateex- givenin1898forseveralexampleswerewrong.Inthe ists,itwillalsobemaximumlikelihood,butthatisnot THEEPICSTORYOFMAXIMUMLIKELIHOOD 603 necessaryforefficiency.Hegrantedthatmorethanone Fisher did not discuss conditions under which the lin- efficient estimate could exist, but he repeated a proof ear approximation would prove adequate; he was con- he had already given in 1924 (Fisher, 1924a) that any tenttoexploititasasimpleroutetotheasymptoticdis- two efficient estimates are correlated with correlation tribution of the maximum likelihood estimate, namely thatapproaches1.0asnincreases. N(θ,1/I(θ)). Thus far he had not gone beyond the 1922argument. 7. THE1925“ANOVA”PROOF Thepartoftheargumentthatwasnovelin1925,the What did Fisher offer by way of proof of this new “ANOVA proof,” then went as follows: Let T be any efficiency-based formulation? His 1922 treatment had estimate of θ, assumed to be consistent and asymptot- leaned crucially on sufficiency, but that was no longer ically normal N(θ,V). In the proof Fisher used this generally available. In its place he depended upon a as the exact distribution of T, and further treated V newandlimitedbutmathematicallyrathercleverproof as not depending upon θ, as would approximately be that I will call the “analysis of variance proof.” The the case for “reasonable” estimates T in what we now proofwasclearlybaseduponaprobabilisticversionof call “regular” parametric problems. Fisher considered theanalysisofvariancebreakdownofasumofsquares the score function X as a function of the sample and thatFisherwasdevelopingseparatelyataboutthesame looked at its variation over different samples in two timeforagriculturalfieldtrials.Fisher’sown1925pre- ways.ThefirstwastoconsiderthetotalvariationofX sentationoftheargumentisfairlyopaqueanddoesnot over all samples, namely its variance Var(X)=I(θ). explainclearlyitsunderlyinglogic;in1935hegavean And for the second, he evaluated Var(X|T), the con- improved presentation that helps some (Fisher, 1935, ditional variation in X given the value of T for the pages 42–44). The mathematical details of the proof sample (i.e., the variance of X among all samples have been clearly re-presented by Hinkley (1980) at that give the same value for T). From this he com- some length. I will be content to offer only a sketch putedE[Var(X|T)],whichhefoundequaltoVar(X)− emphasizing the essence of the argument, what I be- 1/V. Since Var(X) = E[Var(X|T)] + Var[E(X|T)] lievetobethelogicaldevelopmentFisherhadinmind. (this is the ANOVA-like breakdown I refer to), this Itwillhelpthehistoricaldiscussiontodividehis1925 would give Var[E(X|T)] = 1/V. But Var(X|T) ≥ 0 argument into two parts, just as Fisher did in the 1935 always,whichimpliesthatnecessarilyE[Var(X|T)]≥ version. 0, and so Var(X)−1/V ≥0. This gave 1 ≤I(θ), or Let f(x;θ) be the density of a single observation, V V ≥ 1 for any such T, with equality for efficient and let φ be the likelihood function for a sample of n I(θ) independentobservations,sothatlogφ=(cid:7)logf.Fol- estimates—what we now refer to as the information lowingFisher,letX= 1∂φ = ∂ logφ—whatwenow inequality. Thus if the maximum likelihood estimate φ ∂θ ∂θ indeed has asymptotic variance 1/I(θ), he had estab- sometimes refer to as the score function. Fisher was lishedefficiency. only concerned here with situations where the maxi- mum likelihood estimate could be found from solving Thelogicoftheproof—andthelikelyroutethatled theequationX=0forθ.Thefirstpartoftheargument Fishertoit—seemsclear.Iftherewereasufficientsta- wasreallymoreofarestatementofwhathehadshown tistic S, then the factorization theorem (which Fisher in 1922: from expanding the score function in a Tay- had recognized in 1922, at least in part) would give lor series, he had that the score function was approx- φ = C · h(S;θ), where the proportionality factor C imately a linear function of the maximum likelihood may depend upon the sample but not on θ. By suf- estimate; as he put it, X=−nA(θ −θˆ) “if θ −θˆ is a ficiency, X would then depend upon the sample only smallquantityofordern−1/2,”wherehis−nAdenoted through S, and so Var(X|S) = 0 for all values of S, what we now call the Fisher Information in a sample, and consequently E[Var(X|S)] = 0 also. Also, if S I(θ). Since under fairly general regularity conditions is sufficient, the maximum likelihood estimate (found (cid:3) (cid:3) (cid:3) E(X)= 1∂φφ = ∂φ = ∂ φ = ∂ 1=0, we also through solving X =0 for θ) is a function of S. The φ ∂θ ∂θ ∂θ ∂θ have Var(X) = I(θ). As Fisher noted, I(θ) may be failure of T to capture all of the information in the foundfromanyofthealternativeexpressions sampleisthenreflectedthroughthevariationintheval- (cid:4) (cid:5) (cid:4) (cid:5) ues of X given T, namely through Var(X|T) and thus ∂2logφ ∂logφ 2 I(θ)=−E =E E[Var(X|T)]. This latter quantity plays the role of a ∂θ2 ∂θ residual sum of squares and measures the loss of effi- (cid:4) (cid:5) (cid:4) (cid:5) ∂2logf ∂logf 2 ciency of T over S (or at least over what would have =−nE =nE . ∂θ2 ∂θ beenachievablehadtherebeenasufficientstatistic). 604 S.M.STIGLER Whatismore,thisinterpretationgaveFisheratarget and scientifically (he was working on crop estimating topursueintryingtomeasuretheamountoflostinfor- at that time) that he was able to engage in just such a mation, or even to determine how one might recover dialogue.IrefertoHaroldHotelling. it, just as in an analysis of variance one can advance Hotelling received his Ph.D. from Princeton Uni- the analysis by introducing factors that lead to a de- versity in 1924, for a dissertation in point set topol- creaseintheresidualsumofsquares.Intheremainder ogy. In that same year he joined the Food Research ofthe1925paperFisherpursuedjustsuchcourses.He Institute at Stanford University, where he worked on introduced both the term and the concept of an ancil- agriculturalproblems.Soonafter,hediscoveredFisher larystatistic,ineffectasacovariatedesignedtoreduce through Fisher’s 1925 book, Statistical Methods for theresidualsumofsquarestowarditstheoreticalmin- Research Workers. Hotelling reviewed that book for imumachievablevalue.Hegaveparticularattentionto JASA; in fact he reviewed each of the first seven edi- multinomial problems and focused on a study of the tions and the first three of these were volunteered re- information loss when no sufficient estimate existed, views, not requested by the Editor (Hotelling, 1951). He started up a correspondence with Fisher, and tried and the loss in information in using an estimate that unsuccessfully to get Fisher to visit Stanford in 1928 wasefficientbutnotmaximumlikelihood(e.g.,amin- and 1929 (Stigler, 1999a). After several friendly ex- imum chi-square estimate). He found the latter differ- changes of letters, on October 15, 1928, Fisher (who ence tended to a finite limit, a measure of what C. R. hadhadseveralrequestsfromothersfordetailedmath- Rao(1961,1962)waslatertoterm“second-ordereffi- ematicalproofs)wrote,askingHotelling,“NowIwant ciency.” your considered opinion as to the utility of collect- By 1935 Fisher evidently had come to see the first ing such scraps of theory as are needed to prove just part of the argument—the part establishing that the what is wanted for my practical methods.” Hotelling maximum likelihood estimate actually achieved the repliedDecember8,stronglyencouragingsuchawork lowerbound1/I(θ)—asunsatisfactory,andheoffered as valuable for mathematics generally, and stated that initsplaceadifferentargumenttoshowtheboundwas “aknowledgeofthegroundsforbeliefinatheoryhelps achieved. That argument (Fisher, 1935, pages 45–46) todispeltheabsurdnotionswhichtendtoclustereven wasderivedfromwhatIwillcallhisthirdproof;Ishall about sound doctrines.” Fisher’s Christmas Eve 1928 commentonitlaterinthatconnection. replyproposedthattheycollaborate: Fisher’s 1925 work was conceptually deep and has been the subject of much fruitful modern discussion, 24Dec’28 particularly by Efron (1975, 1978, 1982, 1998), Efron DearProf.Hotelling andHinkley(1978)andHinkley(1980). YourletterhasarrivedonChristmasEve, and has given me plenty to think about for 8. AFTER1925:CORRESPONDENCE the holidays. You will not expect too much of my answer, as you see that I am writing WITHHOTELLING first and thinking afterwards; but I can see Fisher’s beautiful theory had become more compli- alreadythatIhaveagreatdealtothankyou cated but was still quite attractive. The proofs Fisher for. offered in 1925 were not such as would satisfy the After a few hours consideration I believe Pure Mathematician he had referred to in 1922, nor my right course is to send you a draft con- would they withstand the challenges that would come tents, to be pulled to pieces or recast as a quarter century later. Were they all that he could of- much as you like, and to say I will do my fer? To answer this, it would help us to listen in on a besttofillthebillifyouwillbejointauthor dialogue between Fisher and a nonhostile, highly in- and be responsible for the pure mathemat- telligent party. Many in the audience in England who ics. If you consent to this and to taking the wereinterested in this question had axes to wield, and first decision, like an editor, as to inclusion Fisher’s transparent digs at Karl Pearson, even though orexclusion,ontheclearunderstandingthat theycameintheformoflegitimatelypointingoutma- either of us may throw it up as soon as we jor errors in Pearson’s previous work, just set those thinkitisnotworthwhile,Iwillstartsend- axes a-grinding. But there was one reader who ap- ing stuff in. It will be mostly new as many proached Fisher’s level as a mathematician and was of the proofs can be done much better than so distant both geographically (he was in California) inmyoldpublications. THEEPICSTORYOFMAXIMUMLIKELIHOOD 605 Have you all my old stuff? I believe you whereasyoucallthemallinconsistent.Thus have, but if not I will try to find anything Ishouldnotcallthemeanofasamplefrom stilllacking. 1 dx It seems a monstrous lot of work, but I π 1+(x−m)2 will not grumble if I need not think too muchaboutarrangement. an inconsistent statistic, though you would. Yourssincerely Congratulationsonaveryfinepaper.” R.A.Fisher Hotelling’s paper is little referred to today, which [HotellingPapersBox3] seems a shame. It is beautifully written, as was most of Hotelling’s work, and among other things he ex- Fisher’s draft table of contents is given as Appen- plained Fisher’s own work on this topic more clearly dix 1 below. That work was never to be completed. thanFishereverdid.HereviewedFisher’sproofofas- Therewasnoapparentsplitbetweenthetwo,butasthe ymptotic normality (the one based upon the Pearson– project went on, Fisher’s increasing focus on genetics Filonapproach),andhegentlynotedthat“itisnotclear ashis1930book TheGeneticalTheoryofNaturalSe- what conditions, particularly of continuity, are neces- lection went through the press, and Hotelling’s move sary in order that the proofs which have been given in 1931 to the Department of Economics at Columbia shall be valid.” To repair this omission Hotelling of- University, were likely causes for the drop in inter- fered two explicit proofs for the case of one contin- est. By February 1930 Fisher was writing, “It is a uous variable, stating overconfidently that “the exten- grind getting anything serious done in the way of a sionstoanynumberofvariablesareperfectlyobvious; text book; I hope you will stick to yours, though; as and the corresponding theorems for discrete variables well as developing the purely mathematical develop- followimmediately....”Theproblemis,asHotelling’s ments.”Nonetheless,Hotellingspentnearlysixmonths clear exposition makes apparent to a modern reader, atRothamstedoverthelasthalfof1929andsawquite theproofdoesnotwork.Hesimplifiedtheproblemby a bit of Fisher over that time. Hotelling returned to transformingtheparameterspacetoafiniteinterval(if the United States in late December in time to submit necessary) by an arc tangent transformation, and dis- apapertotheAmericanMathematicalSociety(AMS) cretized the observed variable by grouping in a finite andpresentitattheirmeetinginDesMoines,Decem- number of small intervals, and did not realize that the ber 31. That paper was entitled, “The consistency and two combined do not ensure the uniformity he would ultimate distribution of optimum statistics”; that is,on needtoachievethedesiredresultforotherthandiscrete theconsistencyandasymptoticnormalityofmaximum distributions with bounded parameter sets. The error likelihood estimates. It was published in the October evidently came to Hotelling’s attention by 5 Decem- 1930issueoftheTransactionsoftheAMS. ber1931,whenhecirculatedalistof37“Outstanding Itisareasonableguessthattheapproachtakeninthe Problems inthe Theory of Statistics.” Problem #16 on paper reflected Fisher’s views to some degree, com- the list was, “Prove the validity of the double limiting ing directly after the long visit with Fisher, although process used in the proof of (Hotelling, 1930), for as Fisher apparently played no direct role in the writing. generalasituationaspossible.” Atanyrate,whenFisherwrotetoHotellingonthe7th of January in 1930 to thank him for a copy, Fisher’s 9. THEGEOMETRICSHADOWOFANASTY onlycomplaintwasthatthedefinitionof“consistency” LITTLEFACT Hotelling gave was slightly different from Fisher’s. Tothispointtherehadbeennotevenahintofthefu- Fisherwrote, tureappearanceofanynasty,uglylittlefactthatmight “Itisworthnotingtoavoidfutureconfusion sully the beautiful theory. But then, on November 15, that you are using consistency in a some- 1930, Hotelling wrote to Fisher with some pointed whatdifferentsensefrommine.Tomeasta- questions. The letter reflected a geometric view of the tistic is inconsistent if it tends to the wrong inference problem that Hotelling seems to have found limitasthesampleisincreasedindefinitely. in Fisher’s work by 1926 and developed further after I do not think I have ever attempted to ap- their conversations at Rothamsted. Hotelling gave one ply the distinction of consistency or incon- statement of the view in his 1930 paper (which must sistency to statistics which tend to no limit, have been drafted at Rothamsted), and he restated it 606 S.M.STIGLER FIG.2. AreconstructionofHotelling’sgeometricviewofthemultinomialestimationproblem,circaFall1929.Herexrepresentsamultino- mialobservedrelativefrequencyvectorinthesimplex,andthecurvef(p)thepotentialvaluesofthemultinomialprobabilityvector;thetrue valueoftheparameterp(p0)isshown,asistheMLEandacontourofthelikelihoodsurface. inhisNovemberletterindifferentbutequivalentnota- exact meaning of the theorem. One of sev- tion.TheessenceiscapturedbyFigure2,drawntodis- eral questions is whether the variance of a playwhatHotellingconveyedinwordsandsymbols. statistic or its mean square deviation from Hotelling considered a parameterized multinomial the true value should be used as a measure ofaccuracy. problemwithmcells,wheretheobservationsareavec- tor of relative frequencies of counts x =(x ,...,x ) Denotingbypˆ theoptimumestimateofa 1 m parameter p, whose true value is p , can it taking values in the m-dimensional simplex 0 (cid:2) be said that the variance of pˆ, assuming pˆ m x =1, x ≥0 all t. Let the probabilities of the t=1 t t normallydistributed,islessthanthatofany cells f(p) = (f (p),...,f (p)) depend upon a pa- 1 m other function of the same observations? rameter p; this describes a curve in the simplex as p Obviouslynotwithoutfurtherqualification, varies. Let p =p denote the true value of the para- 0 since a function of the observations can be meter,letpˆ bethemaximumlikelihoodestimateofp, definedhavinganarbitrarilysmallvariance. and let f(p0) and f(pˆ) be the points on the curve We must therefore restrict the comparison corresponding to these two values. In his 1930 paper, to a special class of functions suitable for Hotelling stated further, “The likelihood L is constant estimating p,butthedefinitionofthisclass over a system of approximately spherical hypersur- must not involve p . How should the class 0 faces about [x]. The point [f(pˆ)] is the point of the bedefined?Astheclassofconsistentstatis- curve which lies on the smallest of the approximate tics? If so, the following difficulty must be spheres meeting the curve, and is therefore approxi- faced. matelythenearestpointonthecurveto[x]”(Hotelling, Consider a distribution of frequency amongafinitenumbermofclasses,involv- 1930). ing a parameter p. In a sample of n, let x HerethenishowHotellingraisedhisquestionincor- t be the number [Hotelling evidently means respondence, in the context of what must have been a relative frequency] falling in the tth class. sharedframeofdiscoursetheyhadadoptedatRotham- Letf (p)betheprobabilityofanindividual sted. t falling into this class. Ifwe take x ,...,x 1 m DearDr.Fisher: ascoordinatesinm-space,theequations Thankyouverymuchforyourrecentlet- x =f (p) (t =1,...,m) t t ter,withgraphanddata. represent a curve with p as parameter. The I have been examining various problems points corresponding to samples will form inMaximumLikelihoodoflate;Iwonderif a “globular cluster” (as you so well put it you can enlighten me as to the conditions in 1915)1 about that point on the curve for under which your proof holds good regard- ing the minimum variance of statistics ob- 1HereHotellingevidentlyreferstoFisher’suseinFisher(1924, tained by this method, or rather, as to the atpage101)oftheevocativeastronomicalterm“globularcluster” THEEPICSTORYOFMAXIMUMLIKELIHOOD 607 which p = p . The method of maximum I have two students working on the opti- 0 likelihood corresponds approximately, for mumestimatesofmfortheabovecurveand large samples, to taking for pˆ the para- for the Type III case you treated. Failing to meter of the point of the curve nearest to getanythingofconsequenceforsmallsam- thatrepresentingthesample;i.e.,toproject- ples by purely mathematical methods, they ing orthogonally. Now consider some other willprobablysoonresorttoexperiment.2 method of projecting sample points upon Cordiallyyours, the curve; for example an orthogonal pro- HaroldHotelling jection followed by an alternate stretching Hotelling’s letters posed a challenging question in a andcontracting alongthecurve.Thenif p 0 direct but nonconfrontational way. Clearly, Hotelling happens to give a point in one of the re- said, some more constraints on the class of estimates gions of condensation [i.e. high density], would be needed; the geometric view they had ev- this method of estimation will, for suffi- idently shared at Rothamsted suggested that consis- ciently large samples, yield a statistic with tency alone was not enough. There is no obvious smallervariancethanthatbythemethodof guarantee that the curve f(p) and the contours of the maximum likelihood. To be sure, its vari- likelihood are such that improvement over maximum ance will be larger if the true value p lies 0 likelihood is not possible. What would be needed to in a region of rarefaction [i.e. low density], prevent this, or at least to convince a reader such as and averaging for different possible values Hotelling that the worry was groundless? Hotelling’s of p mightindicate agreateraverage vari- 0 hypothetical improvements were certainly vague. A ance than that of the optimum statistic. But modern reader might be tempted to see them as fore- such an averaging would seem to be of a shadowing Hodges’s estimate or even shrinkage via piecewith“Bayes’Theorem,”insupposing Steinestimation,buteventhoughtheyfallshortofthat, equalaprioriprobabilities. theypresentedaclearchallengetoFisher. Hotelling went on to state that even in the particu- lar case of symmetric beta densities, maximum likeli- 10. FISHER’SREPLY:ATHIRDPROOFOFTHE hoodfailedtobeoptimum,buthisderivationtherewas EFFICIENCYOFMAXIMUMLIKELIHOOD marred by a simple error in differentiation. Before he By 1930 Fisher was no stranger to challenges by receivedFisher’sreplyofthe28thofNovember,1930, skeptical readers. His general reaction to one from a Hotellingwroteagain,onDecember12,correctinghis friendly source was to state clearly what he was pre- own error with regard to the beta estimation problem pared to say, while avoiding speaking directly to the and enlarging on his other comment, to the point of point raised. Without addressing the criticism, much ratherclearlyspeculatingonthepossibilityofsuperef- less admitting its validity, he would move directly to ficientestimates. a new and improved position, often not giving any in- Thegeneralquestionoftheexactcircum- dicationthatitrepresentedthestrongeststatementthat stances in which optimum statistics have could be made and perhaps even hinting otherwise or minimum variance...is extremely interest- at least allowing the reader to speculate so. Such was ing. That the property is not perfectly gen- thecasehere. eral seems clear from a consideration of Fisher’s reply to the first of Hotelling’s letters was someofthedistributionshavingdiscontinu- brief, but it included one enclosure (A) that outlined a ities; and also from the fact that, if the true new proof that illuminated Fisher’s views, as well as value were known, a system of estimation a second short note (B) correcting Hotelling’s error in could be devised which would give it with differentiatingthebetadensity. arbitrarilysmallvariance;andsuchasystem 28November1930 of estimation might happen to be adopted DearHotelling, evenifthetruevaluewereunknown. IenclosetwonotesAandBonthepoints you raise. The first brings in the general todescribeapointcloud.Fisher(1924)usedtheterminsummariz- ingthemultipledimensionalspaceapproachhehadtakeninFisher (1915). Fisher did not use the term in Fisher (1915), although it 2Fromcommentselsewhereinthecorrespondenceitisclearthat wouldhavebeenappropriatetherealso. by“experiment”Hotellingmeanssimulationwithdiceorcards.

Description:
Lagrange, Daniel Bernoulli, Leonard Euler, Pierre Simon Laplace and Carl. Friedrich Gauss are Distinguished Service Professor, Department of Statistics,. University of very beautiful indeed: a conceptually simple approach to an amazingly .. for press, doubts occurred to the one person best equi
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.