Nonparametric estimation of the stationary density and the transition density of a Markov chain 8 0 0 Claire Lacour 2 n MAP5, Universit´e Paris Descartes, CNRS UMR 8145, 45 rue des Saints-P`eres a 75270 Paris Cedex 06, France J [email protected] 9 ] T S Abstract . h t a In this paper, we study first the problem of nonparametric estimation of the sta- m tionary density f of a discrete-time Markov chain (X ). We consider a collection i [ of projection estimators on finite dimensional linear spaces. We select an estimator 2 among the collection by minimizing a penalized contrast. The same technique en- v ablestoestimatethedensityg of(X ,X )andsotoprovideanadaptiveestimator 5 i i+1 4 of the transition density π = g/f. We give bounds in L2 norm for these estimators 6 and weshow that they are adaptive in the minimax senseover a large class of Besov 1 spaces. Some examples and simulations are also provided. 1 6 0 Key words: Adaptive estimation, Markov Chain, Stationary density, Transition / h density, Model selection, Penalized contrast, Projection estimators t a m : v i X 1 Introduction r a Nonparametric estimation is now a very rich branch of statistical theory. The case of i.i.d. observations is the most detailed but many authors are also in- terested in the case of Markov processes. Early results are stated by Roussas (1969), who studies nonparametric estimators of the stationary density and the transition density of a Markov chain. He considers kernel estimators and assumes that the chain satisfies the strong Doeblin’s condition (D ) (see Doob 0 (1953) p.221). He shows consistency and asymptotic normality of his estima- tor. Several authors tried to consider weaker assumptions than the Doeblin’s condition. Rosenblatt (1970) introduces an other condition, denoted by (G ), 2 and he gives results on the bias and the variance of the kernel estimator of the invariant density in this weaker framework. Yakowitz (1989) improves Preprint submitted to Elsevier 2 February 2008 also the result of asymptotic normality by considering a Harris-condition. The study of kernel estimators is completed by Masry and Gy¨orfi (1987) who find sharp rates for this kind of estimators of the stationary density and by Basu and Sahoo (1998) who prove a Berry-Esseen inequality under the con- dition (G ) of Rosenblatt. Other authors are interested in the estimation of 2 the invariant distribution and the transition density in the non-stationary case: Doukhan and Ghind`es (1983) bound the integrated risks for any initial distribution. InHern´andez-Lerma et al.(1988),recursive estimators for anon- stationary Markov chain are described. Liebscher (1992) gives results for the invariant density in this non-stationary framework using a condition denoted by (D ) derived from the Doeblin’s condition but weaker than (D ). All the 1 0 above papers deal with kernel estimators. Among those who are not interested in such estimators, let us mention Bosq (1973) who studies an estimator of the stationary density by projection on a Fourier basis, Prakasa Rao (1978) who outlines a new estimator for the stationary density by using delta-sequences and Gillert and Wartenberg (1984) who present estimators based on Hermite bases or trigonometric bases. The recent work of Cl´emenc¸on (1999) allows to measure the performance of all these estimators since he proves lower bounds for the minimax rates and gives thus the optimal convergence rates for the estimation of the stationary density and the transition density. Cl´emenc¸on also provides an other kind of estimator for the stationary density and for the transition density, that he obtains by projection on wavelet bases. He presents an adaptive procedure which is ”quasi-optimal” in the sense that the procedure reaches almost the optimal rate but with a logarithmic loss. He needs other conditions than those we cited above and in particular a minoration condition derived from Num- melin’s (1984) works. In this paper, we will use the same condition. The aim of this paper is to estimate the stationary density of a discrete-time Markov chain and its transition density. We consider an irreducible positive recurrent Markov chain (X ) with a stationary density denoted by f. We n suppose that the initial density is f (hence the process is stationary) and ˜ we construct an estimator f from the data X ,...,X . Then, we study the 1 n mean integrated squared error E f˜ f 2 and its convergence rate. The same k − k2 technique enables to estimate the density g of (X ,X ) and so to provide an i i+1 estimator of the transition density π = g/f, called the quotient estimator. An adaptative procedure is proposed for the two estimations and it is proved that both resulting estimators reach the optimal minimax rates without ad- ditive logarithmic factor. We will use here some technical methods known as the Nummelin splitting technique(seeNummelin(1984),Meyn and Tweedie(1993)orH¨opfner and L¨ocherbach (2003)). This method allows to reduce the general state space Markov chain 2 theory to the countable space theory. Actually, the splitting of the original chain creates an artificial accessible atom and we will use the hitting times to this atom to decompose the chain, as we would have done for a countable space chain. To build our estimator of f, we use model selection via penalization as de- ˆ scribed in Barron et al. (1999). First, estimators by projection denoted by f m are considered. The index m denotes the model, i.e. the subspace to which the estimator belongs. Then the model selection technique allows to select ˆ ˆ automatically an estimator f from the collection of estimators (f ). The mˆ m estimator of g is built in the same way. The collections of models that we con- sider here include wavelets but also trigonometric polynomials and piecewise polynomials. This paper is organized as follows. In Section 2, we present our assumptions on the Markov chain and on the collections of models. We give also examples of chains and models. Section 3 is devoted to estimation of the stationary density and in Section 4 the estimation of the transition density is explained. Some simulations are presented in Section 5. The proofs are gathered in the last section, which contains also a presentation of the Nummelin splitting technique. 2 The framework 2.1 Assumptions on the Markov chain We consider an irreducible Markov chain (X ) taking its values in the real n line R. We suppose that (X ) is positive recurrent, i.e. it admits a station- n ary probability measure µ (for more details, we refer to Meyn and Tweedie (1993)). We assume that the distribution µ has a density f with respect to the Lebesgue measure and it is this quantity that we want to estimate. Since the number of observations is finite, f is estimated on a compact set only. Without loss of generality, this compact set is assumed to be equal to [0,1] and, from now, f denotes the transition density multiplied by the indicator function of [0,1] f1 . More precisely, the Markov process is supposed to [0,1] satisfy the following assumptions: A1. (X ) is irreducible and positive recurrent. n A2. ThedistributionofX isequaltoµ,thusthechainis(strictly)stationary. 0 A3. The stationary density f belongs to L∞([0,1]) i.e. sup f(x) < x∈[0,1]| | ∞ A4. The chain is strongly aperiodic, i.e. it satisfies the following minorization condition: there is some function h : [0,1] [0,1] with hdµ > 0 and a 7→ R 3 positive distribution ν such that, for all event A and for all x, P(x,A) h(x)ν(A) ≥ where P is the transition kernel of (X ). n A5. The chainis geometrically ergodic,i.e.thereexists afunction V > 0 finite and a constant ρ (0,1) such that, for all n 1 ∈ ≥ Pn(x,.) µ V(x)ρn TV k − k ≤ where . is the total variation norm. TV k k We can remark that condition A3 implies that f belongs to L2([0,1]) where 1 L2([0,1]) = t : R R,Supp(t) [0,1] and t 2 = t2(x)dx < . { 7→ ⊂ k k ∞} Z0 Notice that, if the chain is aperiodic, condition A4 holds, at least for some m-skeleton (i.e. a chain with transition probability Pm) (see Theorem 5.2.2 in Meyn and Tweedie (1993)). This minorization condition is used in the Num- melin splitting technique and is also required in Cl´emenc¸on (1999). Thelastassumption,whichiscalledgeometricregularitybyCl´emenc¸on(2000), means that the convergence of the chain to the invariant distribution is ge- ometrically fast. In Meyn and Tweedie (1993), we find a slightly different condition (replacing the total variation norm by the V-norm). This condi- tion, which is sufficient for A5, is widely used in Monte Carlo Markov Chain literature because it guarantees central limit theorems and enables to sim- ulate laws via a Markov chain (see for example Jarner and Hansen (2000), Roberts and Rosenthal (1998) or Meyn and Tweedie (1994)). The following subsection gives some examples of Markov chains satisfying hypotheses A1–A5. 2.2 Examples of chains 2.2.1 Diffusion processes We consider the process (X ) where ∆ > 0 is the observation step and i∆ 1≤i≤n (X ) is defined by t t≥0 dX = b(X )dt+σ(X )dW t t t t where W is the standard Brownian motion, b a is a locally bounded Borelian function and σ is a uniformly continuous function such that: (1) there exists λ , λ such that x = 0, 0 < λ < σ2(x) < λ , − + − + ∀ 6 (2) there exists M ,α 0 and r > 0 such that x M ,xb(x) r x α+1. 0 0 ≥ ∀| | ≥ ≤ − | | 4 Then,ifX followsthestationarydistribution,Proposition1inPardoux and Veretennikov 0 (2001) shows that the discretized process (X ) satisfies Assumptions i∆ 1≤i≤n A1–A5. 2.2.2 Nonlinear AR(1) processes Let us consider the following process X = ϕ(X )+ε n n−1 Xn−1,n where ε has a positive density l with respect to the Lebesgue measure, x,n x whichdoesnotdependonn.Wesupposethatϕisboundedonanycompactset and that there exist M > 0 and ρ < 1 such that, for all x > M, ϕ(x) < ρ x . Mokkadem(1987)provesthatifthereexistss > 0sucht|h|atsup |E ε |s <| |, x | x,n| ∞ then the chain is geometrically ergodic. If we assume furthermore that l has x a lower bound then the chain satisfies all the previous assumptions. 2.2.3 ARX (1,1) models The nonlinear process ARX(1,1) is defined by X = F(X ,Z )+ξ n n−1 n n where F is bounded and (ξ ), (Z ) are independent sequences of i.i.d. random n n variableswithE ξ < .WesupposethatthedistributionofZ hasapositive n n | | ∞ density l with respect to the Lebesgue mesure. Assume that there exist ρ < 1, a locally bounded and mesurable function h : R R+ such that Eh(Z ) < n 7→ ∞ and positive constants M,c such that (u,v) > M F(u,v) < ρ u +h(v) c and sup F(x) < . ∀| | | | | | − | | ∞ |x|≤M Then the process (X ) satisfies Assumptions A1–A5 (see Doukhan (1994) n p.102). 2.2.4 ARCH process The considered model is X = F(X )+G(X )ε n+1 n n n+1 where F and G are continuous functions and for all x, G(x) = 0. We suppose 6 that the distribution of ε has a positive and continuous density with respect n to the Lebesgue measure and that there exists s 1 such that E ε s < . n ≥ | | ∞ 5 The chain (X ) satisfies Assumptions A1–A5 if (see Doukhan (1994) p.106): i F(x) + G(x) (E ε s)1/s n lim sup | | | | | | < 1. x |x|→∞ | | 2.3 Assumptions on the models In order to estimate f, we need to introduce some collections of models. The assumptions on the models are the following: M1. EachS isalinearsubspace of(L∞ L2)([0,1])withdimensionD √n m m ∩ ≤ M2. Let 1 t ∞ φ = sup k k m √D t mt∈Sm\{0} k k There exists a real r such that for all m, φ r . 0 m 0 ≤ This assumption (L2-L∞ connexion) is introduced by Barron et al. (1999) and can be written: t S t r D t . (1) m ∞ 0 m ∀ ∈ k k ≤ k k q We get then a set of models (S ) where = m, D √n . We m m∈Mn Mn { m ≤ } neednowalast assumptionregarding thewholecollection, which ensures that, for m and m′ in , S +S′ belongs to the collection of models. Mn m m M3. The models are nested, that is for all m, Dm Dm′ Sm Sm′. ≤ ⇒ ⊂ 2.4 Examples of models WeshowherethattheassumptionsM1-M3arenottoorestrictive.Indeed,they are verified for the models spanned by the following bases (see Barron et al. (1999)): • Histogram basis: Sm =< ϕ1,...,ϕ2m > with ϕj = 2m/21[j2−m1,2jm[ for j = 1,...,2m. Here D = 2m, r = 1 and = 1,..., lnn/2ln2 where m 0 n M { ⌊ ⌋} x denotes the floor of x, i.e. the largest integer less than or equal to x. ⌊ ⌋ Trigonometric basis: S =< ϕ ,...,ϕ > with ϕ (x) = 1 (x), ϕ = m 0 m−1 0 [0,1] 2j • √2 cos(2πjx)1 (x), ϕ = √2sin(2πjx)1 (x) for j 1. For this [0,1] 2j−1 [0,1] ≥ model D = m and r = √2 hold. m 0 Regular piecewise polynomial basis: S is spanned by polynomials of degree m • 0,...,r (where r is fixed) on each interval [(j 1)/2D,j/2D[,j = 1,...,2D. − In this case, m = (D,r), D = (r + 1)2D and = (D,r), D = m n M { 1,..., log (√n/(r+1)) .We can put r = √r +1. ⌊ 2 ⌋} 0 6 Regular wavelet basis: S =< ψ ,j = 1,...,m,k Λ(j) > where ψ m jk −1,k • − ∈ points out the translates of the father wavelet and ψ (x) = 2j/2ψ(2jx k) jk − whereψ isthemotherwavelet.Weassumethatthesupportofthewaveletsis included in [0,1] and that ψ = ϕ belongs to the Sobolev space Wr. In this −1 2 framework Λ(j) = 0,...,K2j 1 (for j 0) where K is a constant which { − } ≥ depends on the supports of ϕ and ψ: for example for the Haar basis K = 1. We have then D = m Λ(j) = Λ( 1) +K(2m+1 1). Moreover m j=−1| | | − | − P ψ + m 2j/2 ψ φ k| −1,k| j=0 k| j,k| m≤ √D P P m P ϕ ψ (1+ m 2j/2) ϕ ψ k k∞ ∨k k∞ j=0 k k∞ ∨k k∞ =: r 0 ≤ (K Λ( 1)P)2m+1 ≤ K Λ( 1) ∧| − | ∧| − | q 3 Estimation of the stationary density 3.1 Decomposition of the risk for the projection estimator Let 1 n γ (t) = [ t 2 2t(X )]. (2) n i n k k − i=1 X Notice that E(γ (t)) = t f 2 f 2 and therefore γ (t) is the empirical n n version of the L2 distancke b−etwkeen−tkankd f. Thus, fˆ is defined by m ˆ f = argminγ (t) (3) m n t∈Sm where S is a subspace of L2 which satisfies M2. Although this estimator m depends on n, no index n is mentioned in order to simplify the notations . It is also the case for all the estimators in this paper. ˆ A more explicit formula for f is easy to derive: m 1 n ˆ ˆ ˆ f = β ϕ , β = ϕ (X ) (4) m λ λ λ λ i n λ∈Λ i=1 X X where (ϕ ) is an orthonormal basis of S . Note that λ λ∈Λ m E(fˆ ) = < f,ϕ > ϕ , m λ λ λ∈Λ X which is the projection of f on S . m In order to evaluate the quality of this estimator, we now compute the mean integrated squared error E f fˆ 2 (often denoted by MISE). m k − k 7 Proposition 1 Let X be a Markov chain which satisfiesAssumptions A1–A5 n and S be a subspace of L2 with dimension D n. If S satisfies condition m m m ˆ ≤ M2, then the estimator f defined by (3) satisfies m D E f fˆ 2 d2(f,S )+C m m m k − k ≤ n where C is a constant which does not depend on n. To compute the bias term d(f,S ), we assume that f belongs to the Besov m space Bα ([0,1]). We refer to DeVore and Lorentz (1993) p.54 for the def- 2,∞ inition of Bα ([0,1]). Notice that when α is an integer, the Besov space 2,∞ Bα ([0,1]) contains the Sobolev space Wα (see DeVore and Lorentz (1993) 2,∞ 2 p.51–55). Hence, we have the following corollary. Corollary 2 Let X be a Markov chain which satisfies Assumptions A1–A5. n Assume that the stationary density f belongs to Bα ([0,1]) and that S is one 2,∞ m of the spaces mentioned in Section 2.4 (with the regularity of polynomials and 1 wavelets larger than α 1). If we choose Dm = n2α+1 , then the estimator − ⌊ ⌋ defined by (3) satisfies E f fˆm 2 = O(n−2α2α+1) k − k We can notice that we obtain the same rate than in the i.i.d. case (see Donoho et al. (1996)). Actually, Cl´emenc¸on (1999) proves that n−2α2α+1 is the optimal rate in the minimax sense in the Markovian framework. With very different theoretical tools, Tribouley and Viennet (1998) show that this rate is also reached in the case of the univariate density estimation of β-mixing random variables by using a wavelet estimator. 1 However, the choice Dm = n2α+1 is possible only if we know the regularity ⌊ ⌋ α of the unknown f. But generally, it is not the case. It is the reason why we construct an adaptive estimator, i.e. an estimator which achieves the optimal rate without requiring the knowledge of α. 3.2 Adaptive estimation Let (S ) be a collection of models as described in Section 2.3. For each m m∈Mn ˆ S , f is defined as above by (3). Next, we choose mˆ among the family m m n M such that ˆ mˆ = argmin[γ (f )+pen(m)] n m m∈Mn 8 ˜ ˆ where pen is a penalty function to be specified later. We denote f = f and mˆ we bound the L2-risk E f f˜ as follows. k − k Theorem 3 Let X be a Markov chain which satisfies Assumptions A1–A5 n and (S ) be a collection of models satisfying Assumptions M1–M3. Then m m∈Mn the estimator defined by ˜ ˆ ˆ f = f where mˆ = argmin[γ (f )+pen(m)], (5) mˆ n m m∈Mn with D m pen(m) = K for some K > K (6) 0 n (where K is a constant depending on the chain) satisfies 0 C E f˜ f 2 3 inf d2(f,S )+pen(m) + 1 m k − k ≤ m∈Mn{ } n where C does not depend on n. 1 Remark 4 The constant K in the penalty depends only on the distribution 0 of the chain and can be chosen equal to max(r2,1)(C + C f ) where C 0 1 2k k∞ 1 and C are theoretical constants provided by the Nummelin splitting technique. 2 The number r is known and depends on the chosen base (see subsection 2.3). 0 The mention of f in the penalty term seems to be a problem, seeing that f ∞ k k ˆ ˆ is unknown. Actually, we could replace f by f with f an estimator of ∞ ∞ k k k k f. This method of random penalty is successfully applied in Birg´e and Massart (1997) or Comte (2001) for example. But we choose not to use this method here, since the constants C and C in K are not computable either. Notice 1 2 0 that Cl´emen¸con(2000) handle with the same kind of unknown quantities in the threshold of his nonlinear wavelet estimator. Actually it is the price to pay for dealing with dependent variables (see also the mixing constant in the threshold in Tribouley and Viennet (1998)). But this annoyance can be circumvented for practical purposes. Indeed,for the simulationsthe computation of the penalty is hand-adjusted. Some techniques of calibration can be found in Lebarbier (2005) in the context of multiple change point detection. In a Gaussian framework the practical choice of the penalty for implementation is also discussed in Section 4 of Birg´e and Massart (2007). Corollary 5 Let X be a Markov chain which satisfies Assumptions A1–A5 n and (S ) be a collection of models mentioned in Section 2.4 (with the m m∈Mn regularity of polynomials and wavelets larger than α 1). If f belongs to − Bα ([0,1]), with α > 1/2, then the estimator defined by (5) and (6) satisfies 2,∞ E f˜ f 2 = O(n−2α2α+1) k − k Remark 6 When α > 1, Bα ([0,1]) C[0,1] (where C[0,1] is the set of 2 2,∞ ⊂ 9 the continuous functions with support in [0,1]) and then the assumption A3 f < is superfluous. ∞ k k ∞ We have already noticed that it is the optimal rate in the minimax sense (see the lower bound in Cl´emenc¸on (1999)). Note that here the procedure reaches thisratewhatevertheregularityoff,withoutneedingtoknowα.Thisresultis thusaimprovement oftheoneofCl´emenc¸on (1999),whoseadaptiveprocedure 2α achieves only the rate (log(n)/n)2α+1. Moreover, our procedure allows to use more bases (not only wavelets) and is easy to implement. 4 Estimation of the transition density We now suppose that the transition kernel P has a density π. In order to estimate π, we remark that π can be written g/f where g is the density of (X ,X ). Thus we begin with the estimation of g. As previously, g and π are i i+1 estimated on a compact set which is assumed to be equal to [0,1]2, without loss of generality. 4.1 Estimation of the joint density g We need now a new assumption. A3’. π belongs to L∞([0,1]2). Notice that A3’ implies A3. We consider now the following subspaces. S(2) = t L2([0,1]2), t(x,y) = α ϕ (x)ϕ (y) m { ∈ λ,µ λ µ } λ,µX∈Λm where (ϕ ) is an orthonormal basis of S . Notice that, if we set λ λ∈Λm m 1 t φ(2)= sup k k∞, m D t mt∈Sm(2)\{0} k k hypothesis M2 implies that φ(2) is bounded by r2. The condition M1 must be m 0 replaced by the following condition: M1’. EachS(2) isalinearsubspaceof(L∞ L2)([0,1]2)withdimensionD2 √n. m ∩ m ≤ 10