Consistency of the Plug-In Estimator of the Entropy Rate for Ergodic Processes Łukasz De˛bowski Institute of Computer Science Polish Academy of Sciences 01-248 Warszawa, Poland Email: [email protected] Abstract—A plug-in estimator of entropy is the entropy of Having denoted the entropy of a discrete distribution the distribution where probabilities of symbols or blocks have been replaced with their relative frequencies in the sample. H(p)=− p(w)logp(w), (3) Consistencyandasymptoticunbiasednessoftheplug-inestimator 6 w:pX(w)>0 can be easily demonstrated in the IID case. In this paper, we 1 ask whether the plug-in estimator can be used for consistent let the block entropy be 0 estimation of the entropy rate h of a stationary ergodic process. 2 The answer is positive if, to estimate block entropy of order k, H(k)=H(pk) (4) r we use a sample longer than 2k(h+ǫ), whereas it is negative if p we use a sample shorter than 2k(h−ǫ). In particular, if we do with the associated entropy rate A not know the entropy rate h, it is sufficient to use a sample of h= lim H(k)/k. (5) length (|X|+ǫ)k where |X| is the alphabet size. The result is n→∞ 6 derived usingk-block coding. As a by-product of ourtechnique, 2 As shown in [4], for the variational distance we also show that the block entropy of a stationary process is ] bounded above by a nonlinear function of the average block |p−q|:= |p(w)−q(w)|, (6) T entropy of its ergodic components. This inequality can be used w foranalternativeproofoftheknownfactthattheentropyratea X I . stationary process equals the average entropy rate of its ergodic we have s components. c lim p −p (·,Xn(k)) =0, (7) [ k k 1 k→∞ I. RESULTS (cid:12) (cid:12) 2 if we put n(k) ≥ 2k(cid:12)(h+ǫ) for IID pro(cid:12)cesses as well as for (cid:12) (cid:12) v Nonparametric entropy estimation is a task that requires a irreducibleMarkovchains,forfunctionsofirreducibleMarkov 4 very large amount of data. This problem has been studied chains, for ψ-mixing processes, and for weak Bernoulli pro- 1 mostly in the IID case, see a review of literature in [1]. cesses. This result suggests that sample size n(k) ≥ 2k(h+ǫ) 0 6 Moreover, the novel results of [1] state that it is impossible may be sufficient for estimation of block entropy H(k). 0 to estimate entropy of a distribution with a support size S Let us state our problemformally.The plug-inestimator of . using an IID sample shorter than of order S/logS, whereas 1 the block entropy is it is possible for a sample longer than of order S/logS, and 0 6 a practical estimator achieving this boundhas been exhibited. H(k,Xn)=H(p (·,Xn)), (8) 1 k 1 1 Earlier,in[2], anotherentropyestimatorwasproposedwhich, : for a finite alphabet, has a bias exponentially decreasing with as considered e.g. by [5]. Since v Xi thhoewesvaemr,plteooleslnogwth.toTbheeatetxhpeoSn/enlotigalSdseacmaypleofbotuhned.bias is, Epk(w,X1n)=P(Xii++1k =w) (9) r In this paper we would like to pursue the more difficult then, applying the Jensen inequality, we obtain a andlessrecognizedquestionofentropyestimationforgeneral EH(k,Xn)≤H(k) (10) stationary ergodic processes, cf. [3]. For a stationary process 1 (X )∞ over a finite alphabet X, consider the blocks of sotheplug-inestimatorisabiasedestimatorofH(k).Thebias i i=−∞ consecutive random symbols Xl =(X ) . Consider then of the plug-inestimator can be quite large since by inequality k i k≤i≤l the true block distribution p (w,Xn)≥⌊n/k⌋−1 for p (w,Xn)>0 we also have k 1 k 1 p (w)=P(Xi+k =w) (1) H(k,Xn)≤log⌊n/k⌋. (11) k i+1 1 and the empirical distribution Theplug-inestimatorH(k,Xn)dependsontwoarguments: 1 ⌊n/k⌋ the block length k and the sample Xn. If we fix the block 1 1 pk(w,X1n)= ⌊n/k⌋ 1 Xii(kk−1)+1 =w . (2) lengthk andletthesamplesizen tendtoinfinity,weobtaina Xi=1 n o consistent and asymptotically unbiased estimator of the block entropy H(k). Namely, for a stationary ergodic process, considered for proving Theorem1 in Section II, cf. [7]. 2) Random lower bounds, such as the plug-in estimator lim H(k,Xn)=H(k) a.s. (12) n→∞ 1 H(k,X1n): As we have seen, for a stationary process, we have (10), whereas for a stationary ergodic process, by the ergodic theorem and hence we have (15)–(17). lim EH(k,Xn)=H(k) (13) n→∞ 1 Both quantities K(X1n) and H(k,X1n) can be used for esti- by inequality (10) and the Fatou lemma. These results gener- mation of the entropy rate h. alize what is known for the IID case [5]. When applying H(k,Xn(k)) for the estimation of entropy 1 Nowthequestionariseswhatn(k)weshouldchoosesothat rate,wearesupposednottoknowtheexactvalueofh.There- H(k,Xn(k))/k be a consistent estimator of the entropy rate. fore, the choice of minimal admissible n(k) is not so trivial. Using a1technique based on source coding, which is different Accordingto Theorem1,we mayputn(k)=(|X|+ǫ)k. This than used in [4], we may establish some positive result in a bound is, however, pessimistic, especially for processes with more general case than considered in [4]: a vanishing entropy rate h = 0, cf. [10]. Having a random oveTrheaofirenmite1a:lpLheatb(eXt Xi).∞i=F−o∞r abnye aǫ>sta0tioannadry ergodic process nup(kpe)r=bo2uEnKd(Xof1k)t+hekǫ.block entropy K(X1k), we may also put AquestionariseswhetherwecanimproveTheorem1.Thus, n(k)≥2k(h+ǫ), (14) let us state three open problems: we have 1) Does the equality lim EH(k,Xn(k))/k=h, (15) lim H(k,Xn(k))/k=h a.s. (24) k→∞ 1 k→∞ 1 liminfH(k,Xn(k))/k=h a.s., (16) hold true in some cases? In other words, is the plug- 1 k→∞ in estimator H(k,Xn(k))/k an almost surely consistent ∀ lim P H(k,Xn(k))/k−h>η =0. (17) 1 η>0 1 estimator of the entropy rate? k→∞ (cid:16) (cid:17) 2) What happens for lim k−1logn(k) = h? In par- Accordingto Theorem1, forthe sample size (14) the plug- k→∞ ticular, can Theorem 1 be strengthened by setting n(k) inestimatorH(k,Xn(k))/k oftheentropyratehisconsistent 1 equal to some random stopping time, such as in probability. In contrast, applying inequality (11) for ǫ > 0 and n(k)=2K(X1k), (25) n(k)≤2k(h−ǫ) (18) where K(Xk) is a length of a universal code for Xk? 1 1 3) The plug-inestimator isnotoptimalin the IID case [1]. yields Can we propose a better estimator of the entropy rate limsupH(k,Xn(k))/k≤h−ǫ a.s. (19) also for an arbitrary stationary ergodic process? 1 k→∞ Another class of less clearly stated problems concerns Hencethesamplesize(18)isinsufficienttoobtainaconsistent comparing the entropy estimates K(Xk) and H(k,Xn(k)). estimate of the entropy rate h using the plug-in estimator. 1 1 Although the gap between these estimates is closing when Let us observe that in general there are two different kinds divided by k, i.e., of random entropy bounds for stationary processes: 1) Random upper bounds K(X1n) based on universal cod- kl→im∞E K(X1k)−H(k,X1n(k)) /k=0, (26) ing[6],[7]oruniversalprediction[8]:Forthesebounds, h i fwoer haavsetaKtioranfatriynepqruoacleitsys, wxen1 2h−avKe(xtn1h)e≤so1u.rTcehecreofdoirneg, tThoe sdeifefeirte,nlceetKus(Xn1okt)e−thHa(tki,nXeq1nu(akl)i)tiecsanEbeKa(rXbiktr)ar≥ilyHlar(gke). 1 inequality EK(Xn)≥HP(n) and the Barron inequality and H(k) ≥ EH(k,Xn) hold for any stationary process, 1 1 regardless whether it is ergodic or not. Hence, by the er- P (K(X1n)+logP(X1n)≤−m)≤2−m (20) godic decomposition [11], we have EK(Xk) ≥ H(Xk) 1 1 [9, Theorem 3.1]. Moreover, for a stationary ergodic and H(X1k|I) ≥ EH(k,X1n), where I is the shift-invariant process, we have algebraofa stationaryprocess(Xi)∞i=−∞, H(X1k)=H(k) is the entropy of Xk, and H(Xk|I) is the conditional entropy nl→im∞EK(X1n)/n=h, (21) of X1k given I. C1onsequently,1 nl→im∞K(X1n)/n=h a.s. (22) E K(X1k)−H(k,X1n) ≥H(X1k)−H(X1k|I) In particular, these conditions hold for (cid:2) (cid:3)=I(X1k;I), (27) K(X1n)=minK(k,X1n), (23) where I(X1k;I) is the mutual information between block X1k k and the shift-invariant algebra I. In fact, for an arbitrary whereK(k,Xn)isthelengthofthecodewhichwillbe stationaryprocess,the mutualinformationI(Xk;I)can grow 1 1 as fast as any sublinear function, cf. [12], [10]. length is ≤(klog|X|+2log⌊n/k⌋)D(k,Xn)), 1 Whereas there is no universal sublinear upper bound for 4) what the Shannon-Fano code words for block Xk⌊n/k⌋ 1 mutual informationI(Xk;I)=H(Xk)−H(Xk|I), we may are (description length ≤⌊n/k⌋(H(k,Xn)+1)), 1 1 1 1 ask whether there is an upper bound for entropy H(Xn) in 5) what the remaining block Xn is (description 1 k⌊n/k⌋+1 terms of a function of conditional entropy H(Xk|I) for an length ≤klog|X|). 1 arbitrary stationary process and n ≥ k. Using the code from Hence quantity the proof of Theorem 1, we can provide this bound: n Theorem 2: For a stationary process (X )∞ , natural 2logk+ (H(k,Xn)+1)+ i i=−∞ k 1 numbers p and k, n=pk, and a real number m≥1, n + klog|X|+2log (D(k,Xn)+1) (31) H(Xn) H(Xk|I) 2 2 k 1 1 − 1 ≤ + logk+3log|X|× (cid:16) (cid:17) n k k n is an upper bound for the length of the k-block code. 1 1 n k For our application, the k-block code has a deficiency that × + 1− σ mH(Xk|I)−log + , (28) m m 1 k n very rare blocks have too long codewords, which leads to (cid:18) (cid:18) (cid:19) (cid:16) (cid:17) (cid:19) an unwanted explosion of term 2logn in the upper bound where σ(y)=min(exp(y),1). k of the code length for n → ∞. Hence let us modify the k- Theorem 2 states that the block entropy of a stationary block code so that a k-block is Shannon-Fano coded if and process is bounded above by a nonlinear function of the only if its Shannon-Fano code word is shorter than klog|X|, averageblock entropyof its ergodiccomponents.We suppose whereas it is left uncoded otherwise. In the coded sequence, that this inequality can be strengthenedif there exists a better to distinguish between these two cases, we have to add some estimator of the block entropy than the plug-in estimator. A flag, say 0 before the Shannon-Fano code word and 1 before simple corollary of Theorem 2 is that the uncodedblock. In this way, to fully describe Xn in terms 1 lim H(Xk|I)/k =h, (29) of the modified k-block code we have to specify: 1 k→∞ 1) what k is (description length 2logk), a fact usually proved by the ergodic decomposition [11, 2) what the number of used distinct Shannon-Fano code Theorem 5.1]. To derive (29) from (28), we first put n→∞ words is (description length klog|X|), and next m→∞ and k →∞. 3) whatthecodebookis(we havetospecifytheShannon- Inthefollowing,inSectionIIweproveTheorem1,whereas Fano code word for each coded k-block, hence the in Section III we prove Theorem 2. description length is ≤3klog|X|D(k,Xn)), 1 II. PROOF OF THEOREM1 4) what the sequence of code words for block X1k⌊n/k⌋ is (description length ≤⌊n/k⌋(H(k,Xn)+2)), Our proof of Theorem 1 applies source coding. To be 1 5) what the remaining block Xn is (description precise, it rests on a modification of the simplistic universal k⌊n/k⌋+1 length ≤klog|X|). code by Neuhoff and Shields [7]. The Neuhoff-Shields code In view of this, quantity is basically a k-block code with parameter k depending on the string X1n. In the following, we will show that the plug- K(k,Xn)=2logk+ n(H(k,Xn)+2)+ in estimator H(k,Xn) multiplied by n/k is the dominating 1 k 1 1 term in the length of a modified k-block code for Xn by the +3klog|X|(D(k,Xn)+1) (32) 1 1 results of [7], [13]. This length cannot be shorter than nh so isanupperboundforthelengthofthemodifiedk-blockcode. the expectation of H(k,Xn)/k must tend to h. 1 Since the k-block code is an instantaneous code, the The idea of a k-block code is that we first describe a code upper bound for its length satisfies Kraft inequality bkocookn,tai.ien.e,dwienethneumcoemrapteretshseedcostlrleincgtioXnnoafnbdlothckeisrwfreoqfuleenncgieths k,xn1 2−K(k,xn1) ≤ 1. Therefore, we have EK(k,X1n) ≥ 1 H(n), whereas by the Barron inequality npk(w,X1n), and then we apply the Shannon-Fano coding to P Xn partitionedintoblocksfromthecodebook.LetD(k,Xn) P(K(k,Xn)+logP(Xn)≤−m)≤2−m (33) 1 1 1 1 be the number of distinct blocks of length k contained in the compressed string Xn. Formally, [9,Theorem3.1],theBorel-Cantellilemma,andtheShannon- 1 McMillan-Breiman theorem D(k,X1n)= w ∈Xk :∃i∈1,...,⌊n/k⌋X(iik−1)k+1 =w . lim [−logP(X1n)] =h a.s. (34) (cid:12)n o(cid:12)(30) n→∞ n (cid:12) (cid:12) To fully describ(cid:12)e Xn in terms of a k-block code we hav(cid:12)e to [14], we obtain 1 specify, cf. [7]: K(k,Xn(k)) −logP(X1n(k)) 1) what k is (description length 2logk), liminf 1 ≥ lim =h a.s. 2) what D(k,Xn) is (description length log⌊n/k⌋), k→∞ n(k) k→∞h n(k) i 1 (35) 3) whatthecodebookis(wehaveto specifytheShannon- Fano code wordfor each k-block,hencethe description Accordingto[13,Theorem2],foreachδ >0almostsurely there exists k such that for all k ≥k and n>2kh we have by the definition of the entropy rate. Second, 0 0 D(k,X1n)≤2k(h+δ)+ nkδ. (36) lim E H(k,X1n(k)) − K(k,X1n(k)) − =0 (43) Hence for n(k)≥2k(h+ǫ) and δ <ǫ, we have almost surely k→∞ " k n(k) # since K(k,Xn(k)) h≤liminf 1 k k→∞ n(k) kl→im∞n(k)ED(k,X1n(k))=0 (44) 3k H(k,Xn(k)) =liminf log|X|D(k,Xn(k))+ 1 by the result of [7, Eq. (8)]. Third, k→∞ n(k) 1 k ! 1 ≤liminf 3log|X| k2−k(ǫ−δ)+δ + H(k,X1n(k)) nl→im∞nE[K(k,X1n)+logP(X1n)]− =0 (45) k→∞ k ! since (cid:16) (cid:17) H(k,Xn(k)) ∞ =3log|X|δ+liminf 1 . (37) E[K(k,Xn)+logP(Xn)]− ≤ m2−m <∞ (46) k→∞ k 1 1 m=0 X Since δ can be chosen arbitrarily small then by the Barron inequality (33). Fourth, likm→i∞nf H(k,Xk1n(k)) ≥h a.s. (38) nl→im∞E [−lognP(X1n)] −h − =0 (47) (cid:20) (cid:21) In contrast, inequality (10) implies since the Shannon-McMillan-Breiman theorem (34) implies convergence in probability, i.e., for all ǫ>0, we have EH(k,Xn(k)) limsup 1 ≤h. (39) [−logP(Xn)] k lim P 1 −h<−ǫ =0. (48) k→∞ n→∞ n Hence, by the Fatou lemma and inequality (38), we have (cid:18) (cid:19) Hence h=Eliminf H(k,X1n(k)) = lim EH(k,X1n(k)), (40) E [−logP(X1n)] −h − k→∞ k k→∞ k n (cid:20) (cid:21) i.e.,equality(15)isestablished.By inequality(38)andequal- [−logP(Xn)] ity (40), we also obtain equality (16). ≤hP 1 −h<−ǫ n (cid:18) (cid:19) The proofof statement(17) requiresa few additionalsteps. +ǫP [−logP(X1n)] −h≥−ǫ (49) Denoting X+ = X1{X >0} and X− = −X1{X <0}, we n (cid:18) (cid:19) obtainfromMarkovinequality,inequality(10),andinequality tends to a value smaller than ǫ, where ǫ is arbitrarily small. (X +Y)− ≤X−+Y− that n(k) n(k) + H(k,X ) H(k,X ) ηP 1 −h>η ≤E 1 −h k k ! " # − H(k,Xn(k)) H(k,Xn(k)) =E 1 −h +E 1 −h " k # " k # III. PROOF OF THEOREM2 − H(k) H(k,Xn(k)) K(k,Xn(k)) ≤ −h +E 1 − 1 + k k n(k) (cid:20) (cid:21) " # − +E K(k,X1n(k)) + logP(X1n(k)) + For the code from the proof of Theorem 1, we have n(k) n(k) " # H(Xn) H(Xk|I) K(k,Xn) H(k,Xn) − 1 − 1 ≤E 1 − 1 −logP(Xn(k)) n k n k +E 1 −h . (41) 2 2 3k (cid:20) (cid:21) h n(k) i = + logk+ log|X|(ED(k,Xn)+1). (50) k n n 1 Then, following the idea of [3, Theorem 1], we may express Now we will show that all four terms on the RHS of (41) the number of distinct k-blocks as tend to 0, which is sufficient to establish (17). First, n/k H(k) D(k,Xn)= 1 1 Xik =w ≥1 . (51) lim −h =0 (42) 1 (i−1)k+1 k→∞(cid:20) k (cid:21) wX∈Xk Xi=1 n o Hence by the Markov inequality, E(D(k,Xn)|I) 1 n/k = P 1 Xik =w ≥1 I (i−1)k+1 (cid:12) wX∈Xk Xi=1 n o (cid:12)(cid:12) n/k (cid:12)(cid:12) ≤ min 1,E 1 Xi+k =(cid:12)w I (i−1)k+1 (cid:12) wX∈Xk Xi=1 n o(cid:12)(cid:12) n (cid:12) = min 1, P(Xk =w|I) (cid:12) k 1 (cid:12) wX∈Xk h i n n −1 = E min P(Xk|I) ,1 I . (52) k k 1 (cid:18) (cid:18)h i (cid:19)(cid:12)(cid:12) (cid:19) In consequence, (cid:12) (cid:12) k n ED(k,Xn)≤Eσ −logP(Xk|I)−log , (53) n 1 1 k (cid:16) (cid:17) where E −logP(Xk|I) = H(Xk|I). Therefore, using 1 1 another Markov inequality (cid:2) (cid:3) 1 P −logP(Xk|I)≥mH(Xk|I) ≤ (54) 1 1 m for m≥1, w(cid:0)e further obtain from (53) that(cid:1) k 1 1 n ED(k,Xn)≤ + 1− σ mH(Xk|I)−log . n 1 m m 1 k (cid:18) (cid:19) (cid:16) (5(cid:17)5) Inserting (55) into (50) yields the requested bound. REFERENCES [1] J. Jiao, K. Venkat, Y. Han, and T. Weissman, “Minimax estimation of functionals of discrete distributions,” IEEE Trans. Inform. Theory, vol.61,pp.2835–2885, 2015. [2] Z.Zhang,“EntropyestimationinTuring’sperspective,”NeuralComput., vol.24,pp.1368–1389, 2012. [3] Ł. De˛bowski, “Estimation of entropy from subword complexity,” in Challenges in Computational Statistics and Data Mining, S. Matwin andJ.Mielniczuk, Eds. Springer, 2016,pp.53–70. [4] K.MartonandP.C.Shields,“Entropyandtheconsistentestimationof jointdistributions,” Ann.Probab.,vol.22,pp.960–977,1994. [5] Z. Zhang and X. Zhang, “A normal law for the plug-in estimator of entropy,” IEEETrans.Inform.Theory,vol.58,pp.2745–2747, 2012. [6] J. Ziv and A. Lempel, “A universal algorithm for sequential data compression,”IEEETrans.Inform.Theory,vol.23,pp.337–343,1977. [7] D.NeuhoffandP.C.Shields,“Simplisticuniversalcoding,”IEEETrans. Inform.Theory,vol.IT-44,pp.778–781,1998. [8] B. Ryabko, “Applications of universal source coding to statistical analysis oftime series,” in Selected Topics in Information and Coding Theory, ser.Series onCoding andCryptology, I.Woungang, S.Misra, andS.C.Misra,Eds. WorldScientific Publishing, 2010. [9] A.R.Barron,“Logicallysmoothdensityestimation,”Ph.D.dissertation, StanfordUniversity, 1985. [10] Ł.De˛bowski,“RegularHilbergprocesses:Anexampleofprocesseswith avanishing entropy rate,”2015,http://arxiv.org/abs/1508.06158. [11] R. M. Gray and L. D. Davisson, “The ergodic decomposition of stationary discrete random processses,” IEEE Trans. Inform. Theory, vol.20,pp.625–636, 1974. [12] Ł.De˛bowski, “Mixing, ergodic, andnonergodic processes withrapidly growing information between blocks,” IEEE Trans. Inform. Theory, vol.58,pp.3392–3401, 2012. [13] D. S. Ornstein and B. Weiss, “How sampling reveals a process,” Ann. Probab.,vol.18,pp.905–930, 1990. [14] P. H. Algoet and T. M. Cover, “A sandwich proof of the Shannon- McMillan-Breimantheorem,”Ann.Probab.,vol.16,pp.899–909,1988.