First- and Second-Order Coding Theorems for Mixed Memoryless Channels with General Mixture PDF

1 First- and Second-Order Coding Theorems for Mixed Memoryless Channels with General Mixture Hideki Yagi, Te Sun Han, and Ryo Nomura Abstract This paper investigates the first- and second-order maximum achievable rates of codes with/without cost constraints for mixed channels whose channel law is characterized by a general mixture of (at most) uncountably many stationary and memoryless discrete channels. These channels are referred to as mixed memoryless channels with general mixture and include the class of mixed memoryless channels of finitely or countably memoryless channels as a special case. For mixed memoryless channels with general mixture, the first-order coding theorem 6 which gives a formula for the ε-capacity is established, and then a direct part of the second-order coding theorem 1 is provided. A subclass of mixed memoryless channels whose component channels can be ordered according to 0 their capacity is introduced, and the first- and second-order coding theorems are established. It is shown that the 2 established formulas reduce to several known formulas for restricted scenarios. y a M I. INTRODUCTION 6 Investigation of the maximum achievable rate of codes whose probability of decoding error does not exceed ε [0,1) for various coding systems has been one of major research topics in information theory. ] ∈ T The first-order optimum rate for channel codes with such a property is referred to as the ε-capacity. I Inspired by the recent results of second-order coding theorems given, for example, by Hayashi [6] and . s Polyanskiy, Poor, and Verdu´ [11] for stationary memoryless channels, this research topic has become of c [ greater importance from both theoretical and practical viewpoints. 3 It is well-known that stationary memoryless channels with finite input and/or output alphabets have the v so-called strongconverse property, and the ε-capacity coincides with the channel capacity (ε-capacity with 7 ε = 0) [19]. On the other hand, allowing a decoding error probability up to ε, the maximum achievable 8 8 rate may be improved for non-stationary and/or non-ergodic channels. The simplest example is a class of 5 mixed channels [5], also referred to as averaged channels [1], [8] or decomposable channels [18], whose 0 . probability distribution is characterized by a mixture of multiple stationary memoryless channels. This 1 channel is stationary but non-ergodic and is of theoretical importance when extensions of coding theorems 0 5 for ergodic channels are addressed. 1 For general channels including mixed channels, a general formula for the ε-capacity has been given : v by Verdu´ and Han [14]. This formula, however, involves limit operations with respect to code length n, i X and thus is infeasible to compute in general. On the other hand, for mixed channels of uncountably many r stationary and memoryless discrete channels, which will be called general mixed memoryless channels, a a single-letter characterization of the channel capacity has been given by Ahlswede [1] for the case without cost constraintsand byHan [5]forthecase withcostconstraints.Thesecharacterizations areofimportance because the channel capacity may be computed with complexity independent of n. Recently, Yagi and Nomura [20] has provided a single-letter characterization of the ε-capacity with/without cost constraints for mixed channels of at most countably many stationary memoryless channels. Regarding the ε-capacity for mixed memoryless channels with general mixture, however, no characterizations have been given in the literature. The regular decomposable channel which consists of memoryless channels [18], is one of a few examples for which a single-letter characterization of the ε-capacity is known. In addition, the H. Yagi is with the Dept. of Communication Engineering and Informatics, The University of Electro-Communications, Tokyo, Japan (email: h.yagi@uec.ac.jp). T.S.HaniswiththeNationalInstituteofInformation andCommunications Technology (NICT),Tokyo, Japan (email:han@is.uec.ac.jp). R. Nomura is with School of Network and Information, Senshu University, Kanagawa, Japan (email: nomu@isc.senshu-u.ac.jp). 2 second-order optimum rate has been characterized only for a few classes of mixed memoryless channels such as the mixed channel of two memoryless additive channels [12], the mixed channel of finitely many stationary and memoryless discrete channels which can be ordered according to their capacities [21], and block fading channels characterized as the mixed channel consisting of additive Gaussian noise channels [22]. This paper first gives a single-letter characterization of the ε-capacity with/without cost constraints for mixed memoryless channels with general mixture (Theorem 1). The established formula reduces to the one for the channel capacity given by [1] and [5] when ε is zero. The achievability and converse proofs of Theorem 1 proceed in a parallel manner: (i) the upper or lower bound on the error probability is characterized by the type (empirical distribution) of codewords and (ii) the convergence of a subsequence of types to a certain probability distribution is discussed. Next, a direct coding theorem (achievability) is given for the second-order optimum rate (Theorem 2). In the proof of Theorem 2, an upper bound on the error probability is derived based on the random coding argument of a fixed type, and it is a key to specify the type of codewords so that the speed of the convergence of the mutual information computed by this type to the target first-order coding rate is fast enough (cf. Equation (98)). For a fixed code, on the other hand, we cannot guarantee that the speed of the convergence of such mutual information to the target first-order coding rate is fast enough, and this fact has prevented us from establishing the converse part of the second-order coding theorem. In order to circumvent this problem, we will introduce a subclass of mixed memoryless channels with general mixture, called well-ordered mixed memoryless channels, whose component channels can be ordered as discussed in [21]. For this channel class, the first- and second-order coding theorems are established. It is shown that the established formulas reduce to several known formulas for restricted scenarios. All coding theorems are proved based on the information spectrum methods (c.f. [5], [17]). In particular, we use a proof technique for the converse part such that the proof proceeds based on an arbitrarily chosen converging subsequence of types of codewords, which may simplify even the proof of the second-order coding theorem for stationary memoryless channels such as in [6]. This paper is organized as follows: The problem addressed in this paper is stated in Sect. II. We next establish the first-order coding theorem in Sect. III-A and a direct part of the second-order coding theorem in Sect. III-B for mixed memoryless channels with general mixture. These theorems are proved in Sect. IV; several lemmas used to prove the theorems are first provided in Sect. IV-A, and then proofs of the coding theorems are given in Sect. IV-B and IV-C, respectively. Section V discusses well-ordered mixed memoryless channels, introduced in Sect. V-A, and the first- and second-order coding theorems are stated in Sect. V-B along with the proofs in Sect. V-C and V-D. Some concluding remarks are given in Sect. VI. II. PROBLEM FORMULATION A. Mixed Memoryless Channel under General Mixture Consider a channel Wn : n n, without any assumption on the memory structure, which X → Y stochastically maps an input sequence x n of length n into an output sequence y n of length ∈ X ∈ Y n. Here, and denote finite input and output alphabets, respectively. A sequence W := Wn of X Y { }∞n=1 channels Wn is referred to as a general channel [5]. We consider a mixed channel1 with a general probability measure [5, Sect 3.3]. Let Θ be an arbitrary probabilityspaceandassignageneralchannelW = Wn toeachθ Θ,whicharecalledcomponent θ { θ }∞n=1 ∈ channels or simply components. Here, we assume that each W has the same input alphabet and output θ X alphabet . With an arbitrary probability measure w on Θ, we define a mixed channel W = Wn Y { }∞n=1 1Mixed channels are also referred to as averaged channels [8] or decomposable channels [18]. 3 with the conditional probability distribution given by Wn(y x) = Wn(y x)dw(θ) | θ | ZΘ ( n = 1,2, ; x n, y n). (1) ∀ ··· ∀ ∈ X ∀ ∈ Y In this paper, we focus on the case where the component channels are stationary memoryless discrete channels. Then, a component channel can be denoted simply by W = W : . A mixed channel θ θ { X → Y} given by (1) with stationary memoryless discrete channels W = W is referred to as a general mixed θ θ { } memoryless channel for simplicity. Let be a code of length n and the number of codewords = M . We denote the codeword n n n C |C | corresponding to message i 1,2,...,M by u , i.e., = u ,u ,...,u . We assume that the ∈ { n} i Cn { 1 2 Mn} decoding region D of u satisfies i i Mn D = n and D D = (i = j). (2) i i j Y ∩ ∅ 6 i=1 [ The average probability of decoding error over W is defined as 1 Mn ε := Wn(Dc u ), (3) n M i| i n i=1 X where Dc denotes the complement set of D in n. Such a code is referred to as an (n,M ,ε ) code. i i Y Cn n n We consider a cost function c ( ) for x = (x ,x ,...,x ) n, defined as n 1 2 n · ∈ X n 1 c (x) := c(x ), (4) n i n i=1 X where c : [0, ). A sequence x is said to satisfy cost constraint Γ if X → ∞ c (x) Γ, (5) n ≤ and an (n,M ,ε ) code is said to satisfy cost constraint Γ if every codeword u satisfies cost n n n i n C ∈ C constraint Γ. Remark 1: If Γ max c(x), then (5) holds for any x n. This case corresponds to the coding x system without cos≥t constra∈iXnts, which is indicated simply by∈ΓX= + . ✷ ∞ B. Optimum Coding Rates Definition 1: A first-order coding rate R 0 is said to be (ε Γ)-achievable if there exists a sequence ≥ | of (n,M ,ε ) codes satisfying cost constraint Γ such that n n 1 limsupε ε and liminf logM R. (6) n n n ≤ n n ≥ →∞ →∞ The supremum of all (ε Γ)-achievable rates is called the first-order (ε Γ)-capacity and is denoted by C (Γ). We also write as C| = C (+ ) for simplicity. | ✷ ε ε ε ∞ Set Γ := min c(x). If Γ < Γ , then obviously C (Γ) = 0 because no sequences x n satisfy 0 x 0 ε ∈X ∈ X cost constraint Γ, and hence no R > 0 is (ε Γ)-achievable. | Let M denote the maximum size of codes of length n and error probability less than or equal to ε n∗,ε satisfying cost constraint Γ. The first-order (ε Γ)-capacity indicates that M behaves as | n∗,ε logM = nC (Γ)+o(n) n∗,ε ε 4 for sufficiently large n. For coding systems whose first-order capacity had been characterized, our next target may be to characterize the second-oder term of logM . This motivates us to introduce the second- n∗,ε order coding rates, and its maximum value denoted by D (R Γ) with respect to the first-order coding ε | rate R = C (Γ) roughly satisfies the relation ε logM nC (Γ)+√nD (R Γ)+o √n n∗,ε ≃ ε ε | for sufficiently large n. Second-order achievable rates and their optim(cid:0)um(cid:1)value are now formally defined as follows. Definition 2: A second-order coding rate S is said to be (ε,R Γ)-achievable if there exists a sequence | of (n,M ,ε ) codes satisfying cost constraint Γ such that n n 1 M n limsupε ε and liminf log S. (7) n n ≤ n √n enR ≥ →∞ →∞ The supremum of all (ε,R Γ)-achievable rates is called the second-order (ε,R Γ)-capacity and is denoted by D (R Γ). We also write| as D (R) = D (R + ) for simplicity. | ✷ ε ε ε | | ∞ Remark 2: It is easily verified that if R < C (Γ) then D (R Γ) = + for all ε [0,1) from the ε ε | ∞ ∈ definition of capacities. Also, if R > C (Γ) then D (R Γ) = for all ε [0,1). Therefore, only the ε ε case R = C (Γ) is of our main interest. | −∞ ∈ ✷ ε III. CODING THEOREMS FOR GENERAL MIXED MEMORYLESS CHANNEL A. First-Order Coding Theorems The following theorem gives a single-letter characterization for the first-order (ε Γ)-capacity of mixed | memoryless channels with general mixture. Theorem 1: Let W be a general mixed memoryless channel with measure w. For any fixed ε [0,1) ∈ and Γ Γ , the first-order (ε Γ)-capacity is given by 0 ≥ | C (Γ) = sup sup R dw(θ) ε , (8) ε P:Ec(XP)≤Γ (cid:26) (cid:12)Z{θ|I(P,Wθ)<R} ≤ (cid:27) (cid:12) where XP indicates the input random variable subje(cid:12)ct to distribution P on , and I(P,Wθ) denotes the mutual information with input P and channel W : (cf. Csisza´r andXKo¨rner [3]). ✷ θ X → Y The proof of this theorem is given in Sect. IV. Remark 3: If Θ is a singleton, Theorem 1 reduces to the well-known formula C (Γ) = sup I(P,W) (0 ε < 1), (9) ε P:Ec(XP) Γ ≤ ∀ ≤ which means that the strong converse holds in this case (cf. [3], [19]), unlike in the general case Θ > 1. | | For Θ which is a finite or countable infinite set, formula (8) of the first-order capacity C (Γ) reduces to ε the formula given by Yagi and Nomura [20]. For mixed memoryless channels with general mixture, on the other hand, in the special case of ε = 0, formula (8) reduces to C (Γ) = sup w-ess.infI(P,W ), (10) 0 θ P:Ec(XP) Γ ≤ which coincides with the formula given by Han [5, Theorem 3.6.5], where w-ess.inf denotes the essential infimum of I(P,W ) with respect to the probability measure w. ✷ θ When Θ is a singleton, it is known that the C (Γ) is concave in Γ and is strictly increasing over ε the range Γ Γ Γ , where Γ denotes the smallest Γ at which C (Γ) coincides with C (without 0 ∗ ∗ ε ε ≤ ≤ cost constraints) (cf. Blahut [2]). For the case of Θ > 1, C (Γ) is indeed non-decreasing, but there are ε | | examples of mixed memoryless channels for which C (Γ) is not strictly increasing in Γ Γ Γ . This ε 0 ∗ ≤ ≤ also indicates that C (Γ) need not be concave in Γ. ε 5 In the case without cost constraints, Theorem 1 reduces to the following corollary. Corollary 1: Let W be a general mixed memoryless channel with measure w. For any fixed ε [0,1), ∈ the first-order ε-capacity is given by C = supsup R dw(θ) ε , (11) ε ≤ P (cid:26) (cid:12)Z{θ|I(P,Wθ)<R} (cid:27) where sup denotes the supremum over the set (cid:12) ( ) of all probability distributions on . ✷ (cid:12) P X X P Remark 4: The direct part of formula (11) was first demonstrated by Han [5, Lemma 3.3.3]. In the special case of ε = 0, we have an alternative formula of C as in (10) (by replacing the supremum over 0 P Ec(X ) Γ with the supremum over ( )), which coincides with the formula given by Ahlswede P [{1].|See also≤[5, R}emark 3.3.3] for the equivPalXence between these characterizations. ✷ B. Second-Order Coding Theorems We now turn to analyzing second-order coding rates. Let Ψ denote the Gaussian cumulative θ,P distribution function with zero mean and variance 2 W (y x) θ V := P(x)W (y x) log | D(W ( x) PW ) , (12) θ,P θ θ θ | PW (y) − ·| || x y (cid:18) θ (cid:19) X∈X X∈Y that is, z 1 z 2 t Ψθ,P(z) := G , G(z) := e−2 dt, (13) Vθ,P! √2π Z −∞ where p PW (y) := P(x)W (y x) (14) θ θ | x X denotes the output distribution on due to the input distribution P on via channel W , and θ Y X D(W ( x) PW ) denotes the divergence between W ( x) and PW . It is known that there are stationary θ θ θ θ ·| || ·| memoryless channels W for which V = 0 for some P ( ) (cf. [11], [14]). In such a case, with θ θ,P ∈ P X an abuse of notation, we interpret Ψ (z) = G(z/ V ) as the step function which is defined to take θ,P θ,P zero for z < 0 and one otherwise. p For the second-order coding rate, we have the following direct theorem (achievability). Theorem 2 (DirectPart):Let W beageneral mixedmemorylesschannel withmeasurew.Forε [0,1), ∈ Γ Γ , and R 0, it holds that 0 ≥ ≥ D (R Γ) sup sup S G (R,S P) ε =: D (R Γ), (15) ε w ε | ≥ P:Ec(XP)≤Γ n (cid:12) | ≤ o | (cid:12) where (cid:12) G (R,S P) := dw(θ)+ Ψ (S)dw(θ). (16) w θ,P | Z θ I(P,Wθ)<R Z θ I(P,Wθ)=R { | } { | } ✷ The proof of this theorem is given in Sect. IV. Remark 5: The two terms on the right-hand side of (16) can be summarized into the following single term: dw(θ) lim Ψ √n(R I(P,W ))+S , (17) θ,P θ n − ZΘ →∞ (cid:0) (cid:1) 6 which is called the canonical representation (cf. Nomura and Han [9], [10]). Let us here focus on the crucial case of R = C (Γ). In view of formula (8) for the ε-capacity C (Γ) it is not difficult to check ε ε that, for any P such that Ec(X ) Γ, P ≤ dw(θ) ε, (18) ≤ Z θI(P,Wθ)<Cε(Γ) { | } dw(θ) ε (19) ≥ Z θI(P,Wθ) Cε(Γ) { | ≤ } hold. Thus, we may consider the following canonical equation for S: dw(θ) lim Ψ √n(C (Γ) I(P,W ))+S = ε. (20) θ,P ε θ n − ZΘ →∞ Notice here, in view of (18) and (19), that(cid:0)equation (20) always has a (cid:1)solution. Let S (ε) denote the P solution of this equation, where S (ε) = + if the solution is not unique (notice that this case occurs P ∞ if dw(θ) = 0, which equivalently means that the second term on the right-hand side in θ I(P,Wθ)=Cε(Γ) { | } (16) is zero). Then, the D C (Γ) Γ (i.e., R = C (Γ)) in (15) can be rewritten in a simpler form as R ε ε | ε D C (Γ) Γ = sup S (ε). (21) (cid:0) (cid:1)ε ε P | P:Ec(XP) Γ ≤ We sometimes prefer this simple expre(cid:0)ssion rath(cid:1)er than in (15). ✷ Remark 6: Denote the right-hand side of (15) again by D (R Γ). If Θ is a singleton, it can be easily ε | verified that if R > C (Γ) ε −∞ sup sup S Ψ (S) ε if R = C (Γ) D (R Γ) =  P ≤ ε (22) ε |  PE:Ic((PX,PW)≤)=ΓR n (cid:12)(cid:12) o + (cid:12) if R < C (Γ), ε ∞ where setting the singleton set Θas Θ = θ we use Ψ instead of Ψ . In particular, if  { 0} P P,θ0 R = C (Γ) = sup I(P,W), (23) ε P:I(P,W)=R Ec(XP) Γ ≤ then it follows from Theorem 4 with Θ = 1 later in Sect. V that | | √V G 1(ε) if ε 1 D C (Γ) Γ = D C (Γ) Γ = max − ≥ 2 (24) ε ε | ε ε | (cid:26) √VminG−1(ε) if ε < 21, where (cid:0) (cid:1) (cid:0) (cid:1) V := max V , (25) max P P:I(P,W)=Cε(Γ) Ec(XP) Γ ≤ V := min V (26) min P P:I(P,W)=Cε(Γ) Ec(XP) Γ ≤ by using V instead of V . Formula (24) is due to Hayashi [6] (with cost constraint), Polyanskiy, Poor, P P,θ0 and Verdu´ [11] (without cost constraints), and Strassen [14] (without cost constraints under the maximum error probability criterion). ✷ Similarly to the first-order coding theorem, Theorem 2 reduces to the following corollary in the case where there are no cost constraints. Corollary 2: Let W be a general mixed memoryless channel with measure w. For ε [0,1) and R 0, ∈ ≥ it holds that D (R) supsup S G (R,S P) ε . (27) ε w ≥ | ≤ P n (cid:12) o ✷ (cid:12) (cid:12) 7 IV. PROOFS OF THEOREMS 1 AND 2 A. Lemmas We state several lemmas which are used to prove Theorems 1 and 2. We first provide error bounds for codes of fixed length, which hold for any general channel. Lemma 1 (Feinstein’s Upper Bound [4]): For any input variable Xn with values in n, there exists an X (n,M ,ε ) code such that n n 1 Wn(Yn Xn) 1 ε Pr log | logM +η +e nη, (28) n ≤ n P (Yn) ≤ n n − (cid:26) Yn (cid:27) where2 Yn is the output variable due to Xn via channel Wn and η > 0 is an arbitrary positive number. ✷ The following lemma was first established in [7, Lemma 4] in the context of quantum channel coding. The proof for the classical version is stated in [6, Sect. IX-B]3. Lemma 2 (Hayashi-Nagaoka’s Lower Bound [7]): Let Qn be an arbitrary probability distribution on n. Every (n,M ,ε ) code satisfies n n n Y C 1 Wn(Yn Xn) 1 ε Pr log | logM η e nη, (29) n ≥ n Qn(Yn) ≤ n n − − − (cid:26) (cid:27) where Xn denotes the random variable subject to the uniform distribution on , Yn denotes the output n variable due to Xn via channel Wn, and η > 0 is an arbitrary positive numberC. ✷ We next state lemmas for mixed channels. We first arrange a so-called expurgated parameter space which possesses a useful property and is still asymptotically dominant over the whole parameter space. Given a set of arbitrary i.i.d. product probability distributions Qn = Q n on n, let Qn be given as θ ×θ Y Qn(y) := Qn(y)dw(θ) ( y n), (30) θ ∀ ∈ Y ZΘ and define Θ(y) := θ Θ Qn(y) e√4nQn(y) ( y n) (31) ∈ | θ ≤ ∀ ∈ Y n o and Θ˜(x,y) := θ Θ Wn(y x) e√4nWn(y x) ( (x,y) n n). (32) ∈ | θ | ≤ | ∀ ∈ X ×Y n o Let S ,k = 1,2, ,N , denote a type (empirical distribution) on n, where N is the number of all k n n distinct types. Let··S˜· ,k = 1,2, ,N˜ , denote a joint type on n Y n, where N˜ is the number of all k n n ··· X ×Y distinct joint types. Since Qn is an i.i.d. product probability distribution, the subset Θ(y) depends only on θ the type S of y, and therefore it can be denoted as Θ(S ) instead of Θ(y). Likewise, since Wn(y x) is stationary kand memoryless, the subset Θ˜(x,y) depends oknly on the joint type S˜ of (x,y), andθther|efore k it can be denoted as Θ˜(S˜ ) instead of Θ˜(x,y). Using k Nn N˜n Θ := Θ(S ) and Θ˜ := Θ˜(S˜ ), (33) n k n k k=1 k=1 \ \ we define another set Θ := Θ Θ˜ . (34) ∗n n ∩ n 2ForrandomvariablesU andV,weletP denotetheprobabilitydistributionofU andP denotetheconditionalprobabilitydistribution U U|V of U given V. 3Later, weshall generalize thislemma tothe mixed channel consisting of general component channels in Lemma7, whose poof isgiven in Appendix D. 8 Lemma 3: Let W be a general mixed memoryless channel with measure w. Given a set of arbitrary i.i.d. product probability distributions Qn on n, let Qn be defined by (30). Then, it holds that θ Y dw(θ) 1 2(n+1) e √4n. (35) |X|·|Y| − ≥ − ZΘ∗n (Proof) See Appendix A. ✷ The following lemmas play a key role in proving the coding theorems for mixed channels. Lemma 4 (UpperDecompositionLemma): Let W bea general mixedmemorylesschannel withmeasure w. Then, it holds that 1 Wn(Yn Xn) 1 Wn(Yn Xn) γ 1 Pr log θ | z Pr log θ θ | z + + +e √nγ (cid:26)n PYn(Yθn) ≤ n(cid:27) ≤ (cid:26)n PYθn(Yθn) ≤ n √n √4 n3(cid:27) − ( θ Θ ), (36) ∀ ∈ ∗n where γ > 0 and z > 0 are arbitrary numbers, and Yn indicates the output variable due to the input Xn n θ via channel Wn. θ (Proof) See Appendix B. ✷ Lemma 5 (Lower DecompositionLemma): Let W bea general mixedmemorylesschannel withmeasure w. Given a set of arbitrary i.i.d. product probability distributions Qn on n, let Qn be defined by (30). θ Y Then, it holds that 1 Wn(Yn Xn) 1 Wn(Yn Xn) γ 1 Pr log θ | z Pr log θ θ | z e √nγ n Qn(Yn) ≤ n ≥ n Qn(Yn) ≤ n − √n − √4 n3 − − (cid:26) θ (cid:27) (cid:26) θ θ (cid:27) ( θ Θ ), (37) ∀ ∈ ∗n where γ > 0 and z > 0 are arbitrary numbers, and Yn indicates the output variable due to the input Xn n θ via channel Wn. θ (Proof) See Appendix C. ✷ Remark 7: As we shall show in the proof of Theorem 1 in the next subsection, there exists an interesting duality between the achievability proof and the converse proof based on Lemmas 4 and 5. UsingUpper/LowerDecompositionLemmahasbeen thestandard techniqueintheanalysisoftheoptimum coding rate in various problems in information theory such as source coding [5, Sect. 1.4], [10], random number generation [9], and hypothesis testing [5, Sect. 4.2] for mixed sources. The proof of Theorem 1 demonstrates that we may also use this standard technique for mixed memoryless channels. Later, we shall also demonstrate in Sect. V-D that Lemma 7 can be used as a powerful alternative to Lemmas 2 and 5, and it saves several steps of the converse proof. ✷ B. Proof of Theorem 1 (Proof of Direct Part) Define C (Γ) := sup sup R dw(θ) ε , (38) ε P:Ec(XP)≤Γ (cid:26) (cid:12) Z{θ|I(P,Wθ)<R} ≤ (cid:27) and then for any small δ > 0 there exists an input d(cid:12)istribution P ( ) such that Ec(X ) Γ and (cid:12) 0 ∈ P X P0 ≤ sup R dw(θ) ε C (Γ) δ. (39) ε ≤ ≥ − (cid:26) (cid:12) Z{θ|I(P0,Wθ)<R} (cid:27) We fix such a P and show that (cid:12) 0 (cid:12) R = C (Γ) 4δ. (40) ε − 9 is (ε Γ)-achievable. | Without loss of generality, we assume that the elements in = 1,2,..., are indexed so that X { |X|} c(1) c(2) c( ). We define the type P on n so that n ≥ ≥ ··· ≥ |X| X nP (x) P (x) = ⌊ 0 ⌋ (x = 1,2,..., 1), (41) n n |X|− 1 |X|− P ( ) = 1 P (x). (42) n n |X| − x=1 X Then, it is readily shown that P (x)c(x) Γ, (43) n ≤ x X∈X P (x) P (x) |X| ( x ), (44) n 0 | − | ≤ n ∀ ∈ X and lim P (x) = P (x) ( x ), (45) n 0 n ∀ ∈ X →∞ where (43) follows because P satisfies P (x)c(x) Γ. 0 x 0 ≤ Let T be the set of all sequences x ∈Xn of type P , and consider the input random variable Xn n n uniformly distributed on T . Using LemmP∈a 1Xwith 1 logM = R and η = γ , where γ > 0 is an arbitrary n n n √n positive number, we obtain the following chain of expansions 1 Wn(Yn Xn) γ limsupε limsupPr log | R+ n ≤ n P (Yn) ≤ √n n→∞ n→∞ (cid:26) Yn (cid:27) 1 Wn(Yn Xn) γ = limsup dw(θ)Pr log θ | R+ n P (Yn) ≤ √n n→∞ ZΘ (cid:26) Yn θ (cid:27) 1 Wn(Yn Xn) γ = limsup dw(θ)Pr log θ | R+ n P (Yn) ≤ √n n→∞ (cid:20)ZΘ∗n (cid:26) Yn θ (cid:27) 1 Wn(Yn Xn) γ + dw(θ)Pr log θ | R+ n P (Yn) ≤ √n ZΘ−Θ∗n (cid:26) Yn θ (cid:27)(cid:21) 1 Wn(Yn Xn) γ limsup dw(θ)Pr log θ | R+ ≤ n P (Yn) ≤ √n n→∞ ZΘ∗n (cid:26) Yn θ (cid:27) 1 Wn(Yn Xn) γ +limsup dw(θ)Pr log θ | R+ n P (Yn) ≤ √n n→∞ ZΘ−Θ∗n (cid:26) Yn θ (cid:27) 1 Wn(Yn Xn) γ limsup dw(θ)Pr log θ | R+ +limsup dw(θ) ≤ n P (Yn) ≤ √n n→∞ ZΘ∗n (cid:26) Yn θ (cid:27) n→∞ ZΘ−Θ∗n 1 Wn(Yn Xn) γ = limsup dw(θ)Pr log θ | R+ . (46) n P (Yn) ≤ √n n→∞ ZΘ∗n (cid:26) Yn θ (cid:27) Here, we have used dw(θ) 2(n+1) e √4n (47) |X|·|Y| − ≤ ZΘ−Θ∗n 10 (cf. Lemma 3) to obtain (46). We apply Lemma 4 with z = R+ γ to (46) to obtain n √n 1 Wn(Yn Xn) 2γ 1 limsupε limsup dw(θ)Pr log θ θ | R+ + n→∞ n ≤ n→∞ ZΘ∗n (cid:26)n PYθn(Yθn) ≤ √n √4 n3(cid:27) 1 Wn(Yn Xn) 2γ 1 limsup dw(θ)Pr log θ θ | R+ + ≤ n→∞ ZΘ (cid:26)n PYθn(Yθn) ≤ √n √4 n3(cid:27) 1 Wn(Yn Xn) 2γ 1 dw(θ)limsupPr log θ θ | R+ + , (48) ≤ ZΘ n→∞ (cid:26)n PYθn(Yθn) ≤ √n √4 n3(cid:27) where the inequality in (48) is due to Fatou’s lemma. Now notice that 1 P (y) = Wn(y x) Yθn T θ | n | | xX∈Tn (n+1) e nH(Pn)Wn(y x) ≤ |X| − θ | xX∈Tn n = (n+1) P (x )W (y x ) |X| n i θ i i | xX∈TnYi=1 = (n+1) (P W ) n(y) ( y n), (49) |X| n θ × ∀ ∈ Y where (P W ) n denotes the n product distribution of n θ × P W (y) := P (x)W (y x) ( y ). (50) n θ n θ | ∀ ∈ Y x X∈X Plugging inequality (49) into (48), we obtain 1 Wn(Yn Xn) 2γ 1 limsupε dw(θ)limsupPr log θ θ | R+ + + |X| log(n+1) n→∞ n ≤ ZΘ n→∞ (cid:26)n (PnWθ)×n(Yθn) ≤ √n √4 n3 n (cid:27) 1 Wn(Yn Xn) dw(θ)limsupPr log θ θ | R+δ . (51) ≤ n (P W ) n(Yn) ≤ ZΘ n→∞ (cid:26) n θ × θ (cid:27) Inequality (51) implies that there exists x n of type P such that n n ∈ X 1 Wn(Yn x ) limsupε dw(θ)limsupPr log θ θ | n R+δ Xn = x (52) n ≤ n (P W ) n(Yn) ≤ n n→∞ ZΘ n→∞ (cid:26) n θ × θ (cid:12) (cid:27) Now, we can write as (cid:12) (cid:12) 1 Wn(Yn x ) 1 n W (Y x ) log θ θ | n = log θ θ,i| i , (53) n (P W ) n(Yn) n P W (Y ) n θ × θ i=1 n θ θ,i X where x = (x ,x , ,x ), n 1 2 n ··· Yn = (Y ,Y , ,Y ). θ θ,1 θ,2 ··· θ,n Notice here that Y ,Y ,...,Y are conditionally independent random variables given Xn = x (under θ,1 θ,2 θ,n n the conditional distribution Wn( x )), and therefore the right-hand side of (53) is a sum of conditionally θ ·| n independent random variables given Xn = x with conditional mean n n 1 W (Y x ) E log θ θ,i| i Xn = x = I(P ,W ) (54) n n θ n P W (Y ) ( n θ θ,i ) Xi=1 (cid:12) (cid:12) (cid:12)

