ebook img

Note on distribution free testing for discrete distributions PDF

0.2 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Note on distribution free testing for discrete distributions

TheAnnalsofStatistics 2013,Vol.41,No.6,2979–2993 DOI:10.1214/13-AOS1176 (cid:13)c InstituteofMathematicalStatistics,2013 NOTE ON DISTRIBUTION FREE TESTING FOR DISCRETE DISTRIBUTIONS 4 1 By Estate Khmaladze 0 2 Victoria University of Wellington n The paper proposes one-to-one transformation of the vector of a J components {Yin}mi=1 of Pearson’s chi-squarestatistic, 3 νin npi Yin= √−npi , i=1,...,m, ] T into another vector {Zin}mi=1, which, therefore, contains the same S “statisticalinformation,”butisasymptoticallydistributionfree.Hence h. anyfunctional/teststatisticbasedon{Zin}mi=1 isalsoasymptotically t distribution free. Natural examples of such test statistics are tradi- ma tional goodness-of-fit statistics from partial sums PI≤kZin. Thesupplementshowshowtheapproachworksintheproblemof [ independent interest: the goodness-of-fit testing of power-law distri- bution with the Zipf law and the Karlin–Rouault law as particular 1 alternatives. v 9 0 1. Introduction. Themaindriverforthisworkwastheneedforaclassof 6 distribution-free tests for discrete distributions. The basic step, reported in 0 . Section 2 below, could have been made long ago, maybe even soon after the 1 publicationoftheclassicalpapersofPearson(1900)andFisher(1922,1924). 0 4 However, the tradition of using the chi-square goodness-of-fit statistic be- 1 came sowidely spread,and the pointof view that, for discrete distributions, : v other statistics “have to” have their asymptotic distributions dependent on i X the individual probabilities, became so predominant and “evident,” that it required a strong impulse to examine the situation again. It came, in this r a case,intheformofaquestionfromProfessorRiteiShibata,“Whyisthethe- ory of distribution-free tests for discrete distributions so much more narrow than for continuous distributions?” If it is true that sometimes a question is half of the answer, then this is one such case. Received August 2012; revised September2013. AMS 2000 subject classifications. Primary 62D05, 62E20; secondary 62E05, 62F10. Key words and phrases. Components of chi-square statistics, unitary transformations, parametric families of distributions, projections, power-law distributions, Zipf’s law. This is an electronic reprint of the original article published by the Institute of Mathematical Statistics in The Annals of Statistics, 2013,Vol. 41, No. 6, 2979–2993. This reprint differs from the original in pagination and typographic detail. 1 2 E. KHMALADZE We recall that for continuous distributions, the idea of the time trans- formation t=F(x) of Kolmogorov (1933), along with subsequent papers of Smirnov (1937) andWald andWolfowitz (1939), was always associated with a class of goodness-of-fit statistics. The choice of statistics invariant under this time transformation, at least since the paper of Anderson and Darling (1952), became an accepted principle in goodness-of-fit theory for continu- ousdistributions.Fordiscretedistributions,however,everythingislockedon a single statistic, the chi-square goodness-of-fit statistic. It certainly is true that in cases like the maximum likelihood statistic for multinomial distri- butions [see, e.g., Kendal and Stuart (1963)] or like the empirical likelihood [see, e.g., Einmahl and McKeague (1999) and Owen (2001)], the chi-square statistic appears as a natural asymptotic object. Yet most of the time the choice of this statistic comes as a deliberate choice of one particular asymp- totically distribution-free statistic. The idea of a class of asymptotically distribution free tests, to the best of our knowledge, was never considered in any serious and systematic way. This is a pity, because unlike the transformation t=F(x), which is ba- sically a tool for one-dimensional time x, if we do not digress onto the transformation of Rosenblatt (1952) or spatial martingales of Khmaladze (1993), the idea behind Pearson’s chi-square test is applicable to any mea- surable space. The potential of its generalization seems, therefore, worth investigation. We will undertake one such investigation in this paper. Namely, we will obtain a transformation of the vector Y of components of Pearson’s chi- n square statistic (see below) into a vector Z , which will be shown to be n asymptotically distribution free. Therefore, any functional based on Z can n beusedasastatisticofanasymptotically distribution-freetestforthecorre- sponding discrete distribution. Thus the paper demonstrates, we hope, that the geometric insight behind the papers of Pearson (1900) or Fisher (1924) goes considerably further than one goodness-of-fit statistic. In the remaining part of this Introduction we present a typical result of this paper. General results and other, may be more convenient, forms of the transformation are given in the appropriate sections later on. Letp1,...,pmbeadiscreteprobabilitydistribution;allpi>0and mi=1pi= 1. Denote ν1n,...,νmn the corresponding frequencies in a sample oPf size n, and consider the vector Y of components of the chi-square statistic n ν np in i Y = − , i=1,...,m. in √npi Let X =(X1,...,Xm)T denote a vector of m independent N(0,1) random variables. As n the vector Y has a limit distribution of the zero-mean n →∞ Gaussian vector Y =(Y1,...,Ym)T such that (1) Y =X X,√p √p, −h i DISTRIBUTIONFREETESTS FOR DISCRETE DISTRIBUTIONS 3 where √p denotes the vector √p=(√p ,...,√p )T. Here and below we 1 m use the notation a,b for inner product of vectors a and b in Rm: a,b = m a b . h i h i i=1 i i PAccording to (1) the vector Y is an orthogonal projection of X parallel to √p. Of course its distribution depends on √p—it is only the sum of squares Y,Y , h i which is chi-square distributedand hence has a distribution free from √p. It is for this reason that we do not have any other asymptotically distribution- freegoodness-of-fittestfordiscretedistributionsexceptthechi-squarestatis- tic m 2 (ν np ) in i Y ,Y = − . n n h i np Xi=1 i In particular, the asymptotic distribution of partial sums based on Y , like in k k ν np ν np in i in i − or − , k=1,2,...,m, Xi=1 √npi Xi=1 √n which would be discrete time analogues of the empirical process, will cer- tainly depend on √p, as will the asymptotic distribution of statistics based on them. Here we would like to refer to paper of Henze (1996), which ad- vances the point of view that goodness-of-fit tests for discrete distributions should be thought of as based on empirical processes in discrete time, that is, on the partial sums on the right. In the same vein, Choulakian, Lockhart and Stephens(1994)considered quadraticfunctionals based on these partial sums, as direct analogues of (weighted) omega-square statistics. We refer also to Goldstein, Morris and Yen (2004), where tables for some quantiles of Kolmogorov–Smirnov statistics from the partial sums are calculated in the parametric problem, described in the supplementary material [Khmal- adze (2013)]. These papers illustrate the dependence on the hypothetical distribution p very clearly. We do not know of many attempts to construct distribution-free tests for discrete distributions, but one such, suggested in Greenwood and Nikulin (1996), stands out for its simplicity and clarity: any discrete distribution function F0 can be replaced by a piece-wise linear distribution function F˜0 with thesame values as F0 at the (nowheredense) jumppoints of thelatter; this opens up the possibility to use time transformation t=F˜0(x) and thus obtain distribution-free tests. However, without inquiring about the conse- quences of implied additional randomization between the jump points, this approach remains a one-dimensional tool. 4 E. KHMALADZE In this paper we introduce a vector Z = Z m as follows: let r be the n { in}i=1 unit length “diagonal” vector with all coordinates 1/√m, and put 1 (2) Z =Y Y ,r (r √p). n n n −h i1 √p,r − −h i More explicitly, m ν np 1 ν np 1 1 in i jn j Z = − − √p . in √npi − √mXj=1 √npj 1− mj=1 pj/m(cid:18)√m − i(cid:19) P p We will see that the following statement for Z is true: n Proposition. Let I=(1,...,1)T denote the vector with all m coordi- nates equal to 1. The asymptotic distribution of Z is that of another, stan- n dard orthogonal projection 1 Z =d X X,r r=X X,I I −h i − mh i and therefore any statistic based on Z is asymptotically distribution free. n The transformation of Y to Z is one-to-one. n n Thus the problem of testing p is translated into the problem of testing uniform discrete distribution of the same dimension m. In particular, partial sums k Z , k=1,2,...,m, in Xi=1 will asymptotically behave as a discrete time analog of the standard Brow- nian bridge. On the other hand, since the transformation from Y to Z is n n one-to-one, Z carries the same amount of statistical information as Y . n n For the proof of the proposition, see Theorem 1 below. We will see that thisisnotanisolatedresult,butoneofseveralpossibleresults,anditfollows from one particular point of view, which is explained in the next section. We carry it on to the parametric case in Section 3. 2. Pertinent unitary transformation. The idea behind the transforma- tion (2) can be explained as follows: the problem with the vector Y is that it projects a standard vector X parallel to a specific vector, the vector √p. This vector changes and with it changes the distribution of Y. However, us- ing an appropriate unitary operator, which incorporates √p, one can “turn” Y so that the result will be an orthogonal projection parallel to a standard vector. One such standard vector can be the vector (1/√m)I above. DISTRIBUTIONFREETESTS FOR DISCRETE DISTRIBUTIONS 5 Slightly more generally, let q and r be two vectors of unit length in m- dimensionalspace Rm.Apartfrom obvious particularchoice of r=(1/√m)I and q =√p=(√p ,...,√p )T, we will consider other choices later on as 1 m well. Denote by = (q,r) the 2-dimensional subspace of Rm, generated by the vectors q anLd r,Land by ∗ its orthogonal complement in Rm. In the L lemma below we write q for the part of q orthogonal to r, and r for the ⊥r ⊥q part of r orthogonal to q: q =q q,r r, r =r q,r q ⊥r ⊥q −h i −h i and let µ = q = r . Obviously, vectors r and q /µ form an or- ⊥r ⊥q ⊥r k k k k thonormal basis of and vectors q and r /µ form another orthonormal ⊥q L basis. Consider U =rcT +q dT/µ ⊥r with some c,d , as a linear operator in . ∈L L Lemma 1. (i) The operator U is unitary if and only if the vectors c and d are orthonormal, c = d =1, c,d =0. k k k k h i (ii) The unitary operator U maps q to r, Uq=r, if and only if c=q and d= r /µ. ⊥q ± Altogether 1 U =rqT q rT ± µ2 ⊥r ⊥q is the unitary operator in , which maps vector q to vector r. It also maps L vector r to vector q . ⊥q ⊥r ± Remark. In what follows in this section we will choose the sign +. It is clear that if vector x is orthogonal to q and r, then Ux=0. In other words, U annihilates ∗. Denote IL∗ the projection operator parallel to , L L so that it is the identity operator on ∗ and annihilates the subspace . Then the operator IL∗+U is a unitaryLoperator on Rm. We use it to obtaLin our first result. Suppose vector Y is projection of X, parallel to the vector q, Y =X X,q q. −h i 6 E. KHMALADZE Theorem 1. (i) The vector 1 (3) X′=(IL∗ +U)X =X X,q (q r) X,r⊥q (r q) −h i − −h i1 q,r − −h i is also a vector with independent N(0,1) coordinates. (ii) The vector 1 (4) Z =(IL∗ +U)Y =Y Y,r (r q) −h i1 q,r − −h i is projection of X′ parallel to r, Z =X′ X′,r r. −h i Proof. (i) By its definition, vector Y is the orthogonal projection of X, parallel to q. Therefore, if we project it further as 1 1 R=Y Y,r r =X X,q q X,r r , −h ⊥qiµ2 ⊥q −h i −h ⊥qiµ2 ⊥q we will obtain the vector R orthogonal to both q and r, that is, a vector in ∗. If we apply operator IL∗ to R it will not change, while U will annihilate L it, and thus 1 (IL∗ +U)X =R+U(cid:18)hX,qiq+hX,r⊥qiµ2r⊥q(cid:19) 1 =R+ X,q r+ X,r q h i h ⊥qiµ2 ⊥r 1 =X X,q (q r) X,r (r q ). −h i − −h ⊥qiµ2 ⊥q− ⊥r Noting that 2 2 r q =(r q)(1+ q,r ) and µ =1 q,r , ⊥q ⊥r − − h i −h i we obtain the right-hand side of (3). Coordinates of X′ are independent N(0,1) random variables if the covariance matrix EX′X′T is the identity matrix on Rm. We have EX′X′T =(IL∗ +U)EXXT(IL∗ +U)T =(IL∗ +U)(IL∗ +UT) 1 =IL∗ +UUT =IL∗ +rrT + µ2q⊥rq⊥Tr=I. (ii) Note that the orthogonality property of Y, Y,q =0, implies that h i X,r = Y,r , and re-write (3) as ⊥q h i h i 1 X′=(I +U)X =Y Y,r (r q)+ X,q r. L∗ −h i1 q,r − h i −h i DISTRIBUTIONFREETESTS FOR DISCRETE DISTRIBUTIONS 7 Also note that X′,r = (IL∗ +U)X,r = X,(IL∗ +U)Tr = X,q h i h i h i h i and so that Z is indeed the projection of X′, we need 1 Z =X′ X′,r r=Y Y,r (r q). (cid:3) −h i −h i1 q,r − −h i The second statement of this theorem, together with the classical state- d ment Yn Y, and the choice of r=(1,...,1)/√m and q=√p, proves the → proposition of the Introduction. The nature of the transformation and the proof given above does not depend on a particular choice of the vector r and is correct for any r of unit length. For example, we can choose r=(1,0,...,0)T. Then the transformed vector Z will have coordinates n νin npi ν1n np1 1 (5) Zin= − − (δ1i √pi) √npi − √np1 1 √p1 − − or νin npi ν1n np1 1 Z1n=0, Zin= − − √pi, i=2,...,m. √npi − √np1 1 √p1 − As a corollary of the previous theorem we obtain a vector with very simple asymptotic behavior. d Corollary 2. If Yn Y = X X,√p √p, then for the vector Zn → −h i defined in (5) we have Zn d (0,X2,...,Xm)T. → To find the asymptotic distribution of statistics based on this choice of Z may be more convenient than in the previous case. Yet the relationship n between the two is one-to-one. It is often the case that the probabilities p1,...,pm depend on a param- eter, which has to be estimated from observed frequencies. This case needs additional consideration which we defer to the next section. However, there are also cases when the hypothetical probabilities are fixed, or the value of the parameter is estimated from previous samples, and therefore needs to be treated as a given. In these cases Theorem 1 is directly applicable. One important case of this type is the two-sample problem. Namely, let events, labeled by i=1,2,...,m,bebasically asabove, andlet ν′ ,...,ν′ 1n′ mn′ andν′′ ,...,ν′′ befrequenciesoftheseevents intwoindependentsamples 1n′′ mn′′ 8 E. KHMALADZE of size n′ and n′′, respectively. Let µ1,...,µm denote the frequencies in the pooled sample of size n=n′+n′′. Then the normalized differences ν′ n′µ /n Y′ = in′− i , i=1,...,m, in n′µ /n i p are the components of the two sample chi-square statistic: the sum of their squaresisthestatistic.Conditionswhichguaranteeconvergenceofthevector Y′ of these differences in distribution to the vector Y are well known; see, n for example, Rao (1965), or Einmahl and Khmaladze (2001) and references therein. Then it follows from Theorem 1 that under these conditions the vector Z′ with coordinates n ν′ n′µ /n Z′ = in′ − i in n′µ /n i p 1 m νj′n−n′µj/n 1 1 + µi − √m n′µ /n 1+ m µ /nm(cid:18)√m rn(cid:19) Xj=1 j j=1 j p P p converges in distribution to vector X X,I I/m and, hence, is asymptot- −h i ically distribution free. To show this result one needs only to choose as q the vector ( µ1/n,..., µm/n)T in Theorem 1 above. Corollary 2 suggests another chopice of the trpansformed vector with coordinates Z = νi′n′ −n′µi/n ν1′n−n′µ1/n 1 µi, i=2,...,m in n′µi/n − n′µ1/n 1+ µ1/nrn p p p with also simple asymptotic behavior. 3. The case of estimated parameters. We will now see that the pivotal property of Y to behave as asymptotically orthogonal projection of X re- n mains true for components of chi-square statistic with estimated parameter. Indeed, if the hypothetical probabilities depend on a κ-dimensional pa- rameter, p =p (θ),which is estimated viamaximum likelihood or minimum i i chi-square, then the statistic m (ν np (θˆ ))2 in i n − np (θˆ ) Xi=1 i n has chi-square distribution with m 1 k degrees of freedom; see extensive − − review of this matter in Stigler (1999), Chapter 19. Notwithstanding great convenience of this result, note, however, that the asymptotic distribution of the vector Yˆ itself, with n ν np (θˆ ) (6) Yˆ = in− i n , in np (θˆ ) i n q DISTRIBUTIONFREETESTS FOR DISCRETE DISTRIBUTIONS 9 depends, under hypothesis, not only on the probabilities p (θ) at the true i valueof θ,butalsoontheirderivatives inθ.Therefore,thelimitdistribution of statistics from Yˆ in general will depend on the hypothetical parametric n family and on the value of the parameter. Atthesametime,itiswellknownsincelongago[see,e.g.,Cram´er(1946), Chapter 20; a modern treatment can be found in van der Vaart (1998)] that undermild assumptionsthe maximum likelihood (and minimumchi-square) estimator possesses asymptotic expansion of the form m p˙ (θ) √n(θˆ θ)=Γ−1 Y i +o (1), n in P − p (θ) Xi=1 i p where p˙ (θ) denotes the κ-dimensional vector of derivatives of p (θ) in θ and i i m p˙ (θ)p˙ (θ)T i i Γ= p (θ) Xi=1 i denotestheκ κFisherinformationmatrix.Atthesametime,theexpansion × p˙ (θ)T Yˆ =Y i √n(θˆ θ)+o (1) in in n P − p (θ) − i p is also true. Combining these two expansions, one obtains p˙ (θ)T m p˙ (θ) (7) Yˆ =Y i Γ−1 Y i +o (1). in in in P − p (θ) p (θ) i Xi=1 i p p Use the notation p˙ (θ) qˆ =Γ−1/2 i , i=1,...,m i p (θ) i p and remember that m p˙ (θ)T i p (θ) =0, i p (θ) Xi=1p i p that is, that the vectors in i, which form p˙/√p, are orthogonal to the vector √p. Therefore all κ coordinates of qi form, in i, vectors which are orthonor- mal and orthogonal to the vector p(θ). Together with (1) this implies the convergence in distribution of Yˆ tpo Gaussian vector n (8) Yˆ =X X,√p √p X,qˆ qˆ. −h i −h i It is easily seen that expression (8) describes Yˆ as an orthogonal projec- tion of X parallel to vectors √p and p˙/√p; see Khmaladze (1979) for an analogous description of empirical processes. Using this description, we can extend the method of Section 2 to the present situation. 10 E. KHMALADZE Indeed, let us assume from now on that κ=1, which will make the pre- sentation more transparent. Having two vectors, q = p(θ) and qˆ, which determine the asymptotics of Yˆ , let us choose now apstandard vector r of n unit length and another vector, rˆ, also of unit length and orthogonal to r. Heuristically, one may think of it as a normalized “score function” for some “standard” family around r. For example, choose r=(1/√m)I and choose any unit vector, such that m rˆ =0. Two such choices, we think, will be i=1 i particularly useful: for m evPen, 1 (1,...,1, 1,..., 1)T √m − − or 1 (1,...,1, 1,..., 1,1,...,1)T √m − − with the “plateau” of 1s taken m/2-long, and for m odd put, say, the last − coordinate equal 0. Whatever thechoice of rˆ,supposewechose and fixedit. It is obvious that the vector (9) Zˆ=X X,r r X,rˆ rˆ −h i −h i has a distribution totally unconnected, and hence free from the parametric family p(θ). Consider now the subspace ˆ= (q,qˆ,r,rˆ). We do not need to L L insist that it is a 4-dimensional subspace, but typically it is, at least, as far as we have freedom in rˆ. Let ˆ∗ denote the orthogonal complement of ˆ to Rm. Two bases of the spaceLˆ will be useful: one is formed by r,rˆ,b3,bL4 L where b3 and b4 are re-arrangements of q and qˆ, which are orthonormal and orthogonal to r and rˆ; the other is formed by q,qˆ,a3,a4 where a3 and a4 are, re-arrangements of r and rˆ, which are orthonormal and orthogonal to q and qˆ. We will consider particular forms of these vectors later on. Lemma 2. The operator Uˆ =rqT +rˆqˆT +b3aT3 +b4aT4 is a unitary operator on ˆ and such that L Uˆq=r, Uˆqˆ=rˆ. Theorem 3. Under convergence in distribution of the vector Yˆ with n coordinates (6) to the Gaussian vector Yˆ given by (8), the vector (10) Zˆn=Yˆn Yˆn,a3 (a3 b3) Yˆn,a4 (a4 b4) −h i − −h i − converges in distribution to the Gaussian vector Zˆ given by (9). Therefore, any statistic based on Z is asymptotically distribution free. n

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.