ebook img

Information Geometry Approach to Parameter Estimation in Markov Chains PDF

0.41 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Information Geometry Approach to Parameter Estimation in Markov Chains

Information Geometry Approach to Parameter Estimation in Markov Chains 5 Masahito Hayashi1 and Shun Watanabe2 1 0 1Graduate School of Mathematics, Nagoya University, Japan, 2 and Centre for Quantum Technologies, National University of Singapore, y Singapore. a 2Department of Information Science and Intelligent Systems, University of M Tokushima, Japan, and Institute for System Research, University of Maryland, College Park. 5 ] Abstract: We consider the parameter estimation of Markov chain when T theunknowntransitionmatrixbelongstoanexponential familyoftransi- S tionmatrices.Then,weshowthatthesamplemeanofthegeneratorofthe . exponentialfamilyisanasymptoticallyefficientestimator.Further,wealso h define a curved exponential family of transition matrices. Using a transi- at tionmatrixversionofthePythagoreantheorem,wegiveanasymptotically m efficientestimatorforacurvedexponential family. [ Primary62M05. Keywords and phrases: exponential family, natural parameter, expec- 4 tation parameter, relative entropy, Fisher information matrix, asymptotic v efficientestimator. 4 1 8 1. Introduction 3 . 1 InformationgeometryestablishedbyAmariandNagaoka[4]isanelegantmethod 0 for statistical inference. This method provides us a very general approach to 4 statistical parameter estimation. Under this framework, we easily find that the 1 efficientestimator canbe givenwith less calculationcomplexity for exponential : v families and a curved exponential families under the independent and identical i X distributed case. Therefore, we can expect a similar structure in the Markov chains. r a The preceding studies [9, 10, 11, 12, 13, 14, 15, 16] introduced the concept of exponential families of transition matrices. However, in their definition, al- thoughthemaximumlikelihoodestimatorhastheasymptoticefficiency,i.e.,at- tainstheCram´er-Raoboundasymptotically,themaximumlikelihoodestimator isnotnecessarilycalculatedwithlesscalculationcomplexity.Thatis,the maxi- mumlikelihoodestimatorhasacomplexformsothatitrequireslongcalculation time in their model. Further, it is quite difficult to calculate the Cram´er-Rao bound even with the asymptotic first order coefficient because these papers fo- cusedonlyonthelimitoftheinverseoftheFisherinformation.Fromapractical viewpoint, it is needed to calculate the asymptotic first order coefficient. So, it is strongly required to resolve these two problems for the estimation of Marko- 1 M. Hayashi and S. Watanabe/Information Geometry Approach in Markov Chains 2 vian process, i.e., (1) to give an asymptotically efficient estimator with small calculation and (2) to derive a formula for the asymptotic Cram´er-Rao bound with small calculation. The purpose of this paper is giving the answers for these two problems. For this purpose, we notice another type of exponential family of transition ma- trices by Nakagawa and Kanaya [2] and Nagaoka [5]. They defined the Fisher information matrix in their sense. On the other hand, for the estimation of the probabilitydistribution,theclassofcurvedexponentialfamiliesplaysanimpor- tant role as a wider class of distribution families than the class of exponential families. That is, when the unknown distribution belongs to a curved exponen- tialfamily,theasymptoticefficientestimatorcanbetreatedintheinformation- geometricalframework. Therefore, to deal with these problems in a wider class of families of transition matrices, we introduce a curved exponential family of transition matrices as a subset of an exponential family of transition matrices in the sense of [2, 5]. Since any exponential family of transition matrices is a curved exponential family, the class of curved exponential families is a larger classoffamiliesoftransitionmatricesthantheclassofexponentialfamilies.Es- pecially, any smooth subset of transition matrices on a finite-size system forms acurvedexponentialfamily oftransitionmatrices.Ourpurposeis resolvingthe above two problems for a curved exponential family as well as for an exponen- tial family. Since any smooth parametric subfamily of transition matrices on a finite-size system forms a curved exponential family, our treatment for curved exponential families has a wide applicability for the estimation of Markovian process. This is reason why we adopted the definition of an exponential family by [2, 5]. Firstly, we show that, for an exponential family of transition matrices in the senseof[2,5],anestimatorofasimpleformasymptoticallyattainstheCram´er- Rao bound, which is givenas the inverseof Fisher informationmatrix. That is, the estimator for the expectation parameter is asymptotically efficient and is writtenas the sample meanofn+1-observations.Since itrequiresonly a small amountofcalculation,theproblem(1)isresolved.Additionally,theproblem(2) is also resolved for an exponential family of transition matrices because Fisher information matrix is computable. To show the above items, we discuss the behavior of the sample mean of n+1 observations. Indeed, while the existing papers [7, 6] derived the form of the asymptotic variance, this paper shows that the asymptotic variance can be written by using the second derivative of the potential function of the gen- erated exponential family. Using this relation, we show that the sample mean asymptotically attains the Cram´er-Rao bound for the expectation parameter. Next, we define the Fisher information matrix for a curvedexponential fam- ily with a computable form. Then, using a transition matrix version of the Pythagoreantheorem,wegiveanasymptoticallyefficientestimatorforacurved exponential family, in which, the estimator is given as a function of the above estimatorinthelargerexponentialfamily.Sincetheasymptoticmeansquareer- roristhe inverseofthe Fisherinformationmatrix,the problems(1)and(2)are resolvedjointly.Intheaboveway,weresolvetheproblemsthatwereunsolvedin M. Hayashi and S. Watanabe/Information Geometry Approach in Markov Chains 3 existingpapers[9,10, 11, 12, 13, 14, 15, 16].Further,duringthis derivation,we also obtain a notable evaluation for variance of sample mean as a by product, which is summarized in Subsection 2.1. For the abovediscussion,we needthe descriptionofanexponentialfamily of transition matrices. Since the information geometrical structure for probability distributions plays important roles in several topics in information theory as well as statistics, it is better to describe the information geometry of transition matrices so that it can be easily applied to these topics. In fact, the authors applied it to finite-length evaluations of the tail probability, the error proba- bility in simple hypothesis testing, source coding, channel coding, and random numbergenerationinMarkovchainaswellastheestimationerrorofparametric family of transition matrices [17, 18]. Thus, we revisit the exponential family of transition matrices [2, 5] in a manner consistent with the above purpose by using Bregmann divergence [21, 20]. In particular, the relative R´enyi entropy for transition matrices plays an important role in the finite-length analysis; we define the relative entropy for transition matrices so that it is a special case of the relative R´enyi entropy, which is different from the definitions in the litera- tures [2, 5]. Although some of results in this paper have been already stated in [5] (without detailed proof), we restate those results and give proofs since the logicalorder of arguments are different from [5] and we want to keep the paper self-contained. In particular, although the paper [5] is written with differential geometrical terminologies, e.g., Christoffel symbols, this paper is written only with terminologies of convex functions and linear algebra. The remaining of this paper is organizedas follows. Section 2 gives the brief summary of obtained results, which is crucial for understanding the structure ofthis paper.InSection3,wedefine therelativeentropyandtherelativeR´enyi entropy between two transition matrices In Section 4, we revisit an exponen- tial family of transition matrices and its properties. In Section 5, we focus on the joint distribution when a transition matrix is given as an element of a one-parameter exponential family and the input distribution is given as the stationary distribution. Then, we characterize the quantities given in Sections 3 and 4 by using the joint distribution. In Section 6, we proceed to the n+1 observationMarkovprocesswhentheinitialdistributionisthestationarydistri- bution.Then,weshowthatthesamplemeanofthegeneratorisanunbiasedand asymptoticallyefficient estimator under a one-parameterexponentialfamily. In Section 7, we proceed to the n+1 observationMarkovprocess when the initial distributionisanon-stationarydistribution.Weshowasimilarfactinthiscase. Section 8 extends a part of these results to the multi-parameter case and the case of a curved exponential family. In appendix, we address the relations with existing results by Nakagawa and Kanaya [5], Nagaoka[5], and Natarajan [1]. 2. Summary of results Here, we prepare notations and definitions. For two given transition matrices W and W over and , we define W W (x,y x,y ):=W(xx)W (y y ), Y Y ′ ′ ′ Y ′ X Y × | | | M. Hayashi and S. Watanabe/Information Geometry Approach in Markov Chains 4 W n(x ,x ,...,x x):=W(x x )W(x x ) W(x x),andWn(xx)= × n n 1 1 ′ n n 1 n 1 n 2 1 ′ ′ W− n(x,x| ,...,x x)|. F−or a giv−en|dis−trib·u·t·ion P|on and a | xn−1,...,x1 × n−1 1| ′ X transition matrix V from to , we define V P(y,x) := V(y x)P(x) and P X Y × | VP(y):= V P(y,x). x × A non-negativematrix W is called irreducible when for eachx,x , there ′ P ∈X exists a natural number n such that Wn(xx) > 0 [27]. An irreducible matrix ′ | W is called ergodic when there are no input x and no integer n such that ′ ′ Wn(x x) = 0 unless n is divisible by n [27]. The irreducibility and the er- ′ ′ ′ | godicity depend only on the support 2 := (x,x) 2 W(xx) > 0 for a XW { ′ ∈ X | | ′ } non-negativematrixW over .Hence,wesaythat 2 isirreducibleandergodic X XW when a non-negative matrix W is irreducible and ergodic, respectively. Indeed, when a subset of 2 is irreducible and ergodic, the set 2 is also irreducible XW XW and ergodic, respectively. It is known that the output distribution WnP con- vergesto the stationary distribution of W for a givenergodic transition matrix W [7, 3, 27]. Although the main result is asymptotic estimation for an expo- nential family and a curved exponential family, we also have additional results as Subsections 2.1 and 2.2. 2.1. Asymptotic behavior of sample mean Assume that the random variable X obeys the Markov process with the irre- n ducible and ergodic transition matrix W(xx). In this paper, for an arbitrary ′ two-inputfunctiong(x,x),wefocusonthe|samplemeanS := gn(Xn+1) where ′ n n gn(Xn+1) := n g(X ,X ), and Xn+1 := (X ,...,X ). This is because i=1 i+1 i n+1 1 atwo-inputfunctiong(x,x)iscloselyrelatedtoanexponentialfamilyoftransi- ′ P tionmatrices.Indeed,thesimplesamplemeancanbetreatedinthisformulation by choosing g(x,x) as x or x. Since the function g(x,x) can be chosen arbi- ′ ′ ′ trary,thefollowingdiscussioncanhandlethesamplemeanofthehiddenMarkov process. Then, the expectation E[S ] and the variance V[S ] are characterizedas fol- n n lows.WedenotethenormalizedPerron-FrobeniuseigenvectorofW(xx)byP ′ W anddefinethelimitingexpectationE[g(X,X )]:= g(x,x)W(xx| )P (x). ′ x,x′ ′ | ′ W ′ We denote the Perron-Frobenius eigenvalue of W(xx)eθg(x,x′) by λ and de- P| ′ θ finethe cumulantgeneratingfunctionφ(θ):=logλ .Then,whenthe transition θ matrix W is irreducible and ergodic, the relation E[S ] E[g(X,X )] (2.1) n ′ → is known. In Sections 6 and 7 of this paper, we show d2φ nV[S ] (0) (2.2) n → dθ2 while existing papers [7, 6] characterized the asymptotic variance by using the fundamental matrix. (See [17, Section 6].) M. Hayashi and S. Watanabe/Information Geometry Approach in Markov Chains 5 In particular,whenthe initialdistributionis the stationarydistributionP , W we have E[S ] = E[g(X,X )]. Then, in Section 6, using a constant C, we show n ′ that d2φ C d2φ C (0)(1 )2 nV[S ] (0)(1+ )2 (2.3) dθ2 − √n ≤ n ≤ dθ2 √n for the stationary case. The concrete form of C is also given in Section 6. This analysisis obtainedviaevaluationsofFisher informationgiveninSections 5, 6, and 7. 2.2. Cram´er-Rao bound and asymptotically efficient estimator Firstly,forsimplicity,we summarizeourobtainedresultsfor the one-parameter casewhilethispaperaddressesamulti-parameterexponentialfamily.InSection 4, for a given two-input function g(x,x) and an irreducible and ergodic transi- ′ tion matrix W, we define the potential function φ(θ) and exponential family of transition matrices W with the generator g(x,x). We also define its Fisher θ ′ informationmatrix{d2φ(}θ) and the expectation parameter η(θ):= dφ(θ). Then, dθ2 dθ wefocusonthe distributionfamilyofMarkovchainsgeneratedbythe familyof transition matrices W with arbitraryinitial distributions. We show that the θ { } Fisher information of the expectation parameter under the distribution family is asymptotically equalto nd2φ(θ(η)) 1+o(n) evenfor the non-stationarycase dθ2 − in Section 7. Then, we show that the random variable S is the asymptotically n efficientestimator,i.e.,themeansquareerroris d2φ(θ(η))/n+o(1/n).InSection dθ2 6, we give more detailed analysis for the stationary case. To derive the results in Sections 6 and 7, we prepare evaluations of Fisher information in Section 5. Now, we address the multi-parameter case. In Section 4, we also define a multi-parameter exponential family W of transition matrices, and show the θ~ Pythagorean theorem. Then, we show the asymptotic efficiency of the sample mean in the multi-parameter case in Subsections 8.1 and 8.2. We also show that the set of all positive transition matrices on a finite-size system forms an exponential family in Example 1. Further, we define a curved exponential family of transition matrices, and give its asymptotically efficient estimator in Subsection 8.3. Since any smooth parametric family of transition matrices on a finite-size system forms a curved exponential family, this result has a wide applicability. These results require the technical preparations given in Sections 3, 4, and 5. 2.3. Relative entropy and relative R´enyi entropy In this paper, given two transition matrices W and V, we define the relative entropy D(W V) and the relative R´enyi entropy D (W V) in Section 3. In 1+s k k Subsection8.3,therelativeentropyD(W V)playsacrucialroleinourestimator k inacurvedexponentialfamily.WealsoshowthattheFisherinformationisgiven M. Hayashi and S. Watanabe/Information Geometry Approach in Markov Chains 6 as the limits of the relative entropyand the relativeR´enyientropy,which plays important roles in the proof of the asymptotic efficiency of our estimator in a curved exponential family in Subsection 8.3. Also, as discussed in [17], the relative R´enyi entropy D (W V) plays a central role in simple hypothesis 1+s k testing as well as the relative entropy D(W V). Further, these information k quantitiesplayancentralroleinrandomnumbergeneration,datacompression, and channel coding [18]. In Section 3, we also give their properties that are useful in the above applications. Forthese applications,weneedto addressthe relativeentropyD(W V)and k therelativeR´enyientropyD (W V)inaunifiedway.Moreprecisely,therel- 1+s k ativeentropyD(W V)isneededtobedefinedasthe limitoftherelativeR´enyi k entropyD (W V).Indeed,the existingpaper[5]definedthe relativeentropy 1+s k D(W V) in a different way.However,the definition by [5] cannotyieldthe def- k inition of the relative R´enyi entropy in a unified way. Appendix A summarizes the detailed relation between the results in this part and existing results. 3. Relative entropy and relative R´enyi entropy Inthissection,inordertoinvestigategeometricstructurefortransitionmatrices, we define the relative entropy and the relative R´enyi entropy. For this purpose we prepare the following lemma, which is shown after Lemma 5.2. Lemma 3.1. Consider an irreducible transition matrix W over and a real- X valuedfunctiong on .Defineφ(θ)asthelogarithmofthePerron-Frobenius X×X eigenvalue of the matrix: W (xx):=W(xx)eθg(x,x′). (3.1) θ ′ ′ | | Then, the function φ(θ) is convex. Further, the following conditions are equiva- lent. (1) No real-valued function f on satisfies that g(x,x) = f(x) f(x)+c ′ ′ for any (x,x) 2 with a coXnstant c R. − ′ ∈XW ∈ (2) The function φ(θ) is strictly convex, i.e., d2φ(θ)>0 for any θ. dθ2 (3) d2φ(θ) >0. dθ2 |θ=0 Using Lemma 3.1, given two distinct transition matrices W and V, we de- fine the relative entropy D(W V) and the relative R´enyi entropy D (W V) 1+s k k as follows. For this purpose, we denote the logarithm of the Perron-Frobenius eigenvalue of the matrix W(xx)1+sV(xx) s by ϕ(1+s) under the condition ′ ′ − | | given below. When 2 2 and 2 is irreducible, we define XW ⊂XV XW dϕ ϕ(1+s) D(W V):= (1), D (W V):= (3.2) 1+s k ds k s fors>0.TherelativeR´enyientropyD (W V)withs ( 1,0)isdefinedby 1+s k ∈ − (3.2)when 2 2 isirreducible,whichisaweakerassumption.When 2 2 XW∩XV XW∩XV is irreducible and the condition 2 2 does not hold, the relative entropy XW ⊂ XV M. Hayashi and S. Watanabe/Information Geometry Approach in Markov Chains 7 D(W V)andtherelativeR´enyientropyD (W V)withs>0areregardedas 1+s k k the infinity. Note that the limit lim D (W W ) equals D(W W ). When s 0 1+s ′ ′ XW2 ⊂XV2 andXW2 isirreducible,the→functionlogkWV((xx|xx′′)) satisfiestkhe condition for the function g in Lemma 3.1 because W and V| are distinct. Hence, the function s sD (W V) is strictly convex. So, the relative R´enyi entropy 1+s 7→ k D (W V) is strictly monotone increasing with respect to s. 1+s k From the property of Perron-Frobenius eigenvalue, we immediately obtain the following lemma. Lemma 3.2. Given two transition matrices W and V (W and V ) on X X Y Y X ( ), respectively, we have Y D(W V )+D(W V )=D(W W V V ) X X Y Y X Y X Y k k × k × D (W V )+D (W V )=D (W W V V ) 1+s X X 1+s Y Y 1+s X Y X Y k k × k × for s ( 1,0) (0, ). ∈ − ∪ ∞ Theorem 3.3. Transition matrices W , W , and W satisfy 1 2 pD(W W)+(1 p)D(W W) D(pW +(1 p)W W) (3.3) 1 2 1 2 k − k ≥ − k pD(W W )+(1 p)D(W W ) D(W pW +(1 p)W ) (3.4) 1 2 1 2 k − k ≥ k − for p (0,1). ∈ (3.3) can be directly shown from Lemma 4.5 given latter. The proof of (3.4) will be given after (5.5). 4. Information geometry for transition matrices 4.1. Exponential family In the following, we treat only irreducible transition matrices. Hence, an irre- ducibletransitionmatrixissimplycalledatransitionmatrix.Wedefineanexpo- nential family for transition matrices. We focus on a transition matrix W(xx) ′ | from to . Then, a set of real-valued functions g on is called lin- j X X { } X ×X earlyindependentunderthetransitionmatrixW(xx)whenanylinearnon-zero ′ | combination of g satisfies the condition in Lemma 3.1. For θ~ = (θ1,...,θd) j { } andlinearly independent functions g ,we define the matrix W (xx) from { j} θ~ | ′ X to in the following way. X Wθ~(x|x′):=W(x|x′)ePdj=1θjgj(x,x′). (4.1) Using the Perron-Frobenius eigenvalue λ of W , we define the potential func- θ~ θ~ tion φ(~θ):=logλ . θ~ Note that, since the value W (xx) generally depends on x, we cannot x θ~ | ′ ′ makeatransitionmatrixbysimplymultiplyingaconstantwiththematrixW . P θ~ Tomakea transitionmatrix fromthe matrixW ,we recallthata non-negative ~θ M. Hayashi and S. Watanabe/Information Geometry Approach in Markov Chains 8 matrixV from to isatransitionmatrixifandonlyifthevector(1,...,1)T X X isaneigenvectorofthetransposeVT.Inordertoresolvethisproblem,wefocus onthestructureofthematrixW .WedenotethePerron-Frobeniuseigenvectors θ~ T 2 3 of W~θ and its transpose Wθ~ by P~θ and Pθ~. Then, similar to [2, (16)] [5, (2)], we define the matrix W (xx) as θ~ | ′ Wθ~(x|x′):=λθ−~1P3θ~(x)Wθ~(x|x′)P3θ~(x′)−1. (4.2) The matrix W (xx) is a transition matrix because the vector (1,...,1)T is an θ~ | ′ eigenvectorofthetransposeWT.Thestationarydistributionofthegiventransi- θ~ tionmatrixW isthePerron-Frobeniusnormalizedeigenvectorofthetransition θ~ matrix W , which is given as θ~ 3 2 P1~θ(x):= Pθ~(3x)P~θ(x2) (4.3) x′′Pθ~(x′′)P~θ(x′′) because P 3 Xx′ Wθ~(x|x′)P1θ~(x′)= λθ~ x′′PP3~θθ~((xx′)′)P2θ~(x′′)Xx′ Wθ~(x|x′)P2θ~(x′) 3 2 = Pθ~(3x)P~θ(x2) =P1θ~P(x). x′′Pθ~(x′′)P~θ(x′′) InthefoPllowing,wecallthefamilyoftransitionmatrices := W anexponen- E { θ~} tialfamilyoftransitionmatricesgeneratedbyW withthegenerator g ,...,g . 1 d { } Since the generator g ,...,g is linearly independent, due to Lemma 3.1, 1 d c c ∂2φ = d2φ(~c{t) is strict}ly positive for an arbitrary non-zero vector i,j i j∂θi∂θj dt2 ~c=(c ,...,c ). That is, the Hesse matrix H [φ]=[ ∂2φ ] is non-negative. P 1 d θ~ ∂θi∂θj i,j Using the potential function φ(θ), we discuss several concepts for transition matrices based on Lemma 3.1, formally. We call the parameter (θ1,...,θd) the naturalparameter,andtheparameterη (~θ):= ∂φ(~θ)theexpectationparameter. j ∂θj For ~η =(η ,...,η ), we define θ1(~η),...,θd(~η) as η (θ1(~η),...,θd(~η))=η . 1 d j j For a giventransitionmatrix W, we define a linear subspace ( 2) of the W N X space ( 2) of all two-input functions as the set of functions f(x) f(x)+c. ′ G X − Then, we obtain the following lemma. Lemma 4.1. The following are equivalent for the generator g and the tran- j { } sition matrix W. (1) The set of functions g are linearly independent in the quotient space j { } ( 2)/ ( 2). W G X N X (2) The map θ~ ~η(~θ) is one-to-one. → (3) The Hesse matrix H [φ] is strictly positive for any θ~, which implies the θ~ strict convexity of the potential function φ(~θ). (4) The Hesse matrix H [φ] is strictly positive. θ~ |θ~=0 M. Hayashi and S. Watanabe/Information Geometry Approach in Markov Chains 9 (5) The parametrization θ~ W is faithful for any θ~. 7→ θ~ Proof. Applying Lemma 3.1 to φ(~ct) for an arbitrary non-zero vector ~c = (c ,...,c ), we obtain the equivalence among (1), (3), and (4). (3) (2) is 1 d ⇒ trivial. Now, we show (2) (1) by showing the contraposition. If (1) does not ⇒ holds. There exists a non-zero vector~c=(c ,...,c ) such that c g (x,x)= 1 d i i i ′ f(x) f(x)+C. Hence, we have d2φ(~ct) =0. Hence, (2) does not hold. − ′ dt2 P Now, we show (1) (5) by showing the contraposition. When W = W , ⇒ θ~′ θ~ considering the logarithm, there exist a function f and a constant c such that θ jg (x,x) θjg (x,x)=f(x) f(x)+C for (x,x) 2 . j ′ j ′ − j j ′ − ′ ′ ∈XW Now,weshow(5) (1)byshowingthecontraposition.Ifasetofreal-valued P P ⇒ functions g on is not linearly independent, there exist a function f j { } X ×X anda constantC suchthat θ jg (x,x) θjg (x,x)=f(x) f(x)+C. j ′ j ′ − j j ′ − ′ In this case, choosing P3θ~′(xP) = P3~θ(x)ef(x) aPnd λθ~′ = λθ~e−C, P3θ~′ and λθ~′ are the Perron-Frobenius eigenvector and eigenvalue of the transition matrix W . θ~′ Then, we have W =W . θ~′ θ~ Now, we introduce the notation := V V is a transition matrix and ,W 2 = 2 .AnyelementW WcXanbew{ritt|enasW (xx)=W(xx)eg(x,x′) bXyWusinXgVa}n element g ∈ G(′X∈2)WbXec,Wause of logWW′((xx|xx′′)) ∈′G(X| 2′). Hence|, i′f and | only if the set of two-input functions g form a basis of the quotient space j { } ( 2)/ ( 2),theset coincideswiththeexponentialfamilygenerated W ,W G X N X WX by W with the generator g . This fact shows that is an exponential j ,W { } WX family. In particular,when W is a positive transitionmatrix, the subspace ( 2) W N X does not depend on W and is abbreviated to ( 2). In this case, is ,W N X WX the set of positive transition matrices. Then, it does not depend on W, and is abbreviated to . WX We define the Fisher information matrix for the natural parameter by the Hesse matrix H [φ]:= [ ∂2φ (~θ)] . The Fisher information matrix for the ex- θ~ ∂θi∂θj i,j pectation parameter is given as H [φ] 1. Further, for fixed values θk+1,...,θd, θ~ − o o we call the subset W ~θ = (θ1,...,θk,θk+1,...,θd) an exponential sub- { θ~ ∈ E| o o } family of . The following are examples of an exponential family. E Example 1. Now, we assume that = 0,1,...,m and W is a positive X { } transition matrix, i.e., XW2 = X2. Define gi,j(x,x′) = δx,iδx′,j for i = 1,...,m andj =0,1,...,m.Then, them2+m functionsg form abasis ofthequotient i,j space ( 2)/ ( 2). Therefore, the set of positive transition matrices forms an G X N X exponential family with the above choice of g . i,j Example 2. For a given subset 2 for = 0,1,...,m , we choose a transition matrix W whose supporSt i⊂s X. DefinXe the{subset ˜ as} (i,j) S i is S S { ∈ | not minimum integer satisfying (i,j) for a fixed j . We define g (x,x)= i,j ′ ∈S } δx,iδx′,j for (i,j) ˜. Then, the set ,W is an exponential family generated ∈ S WX by g . However, the set is not an exponential subfamily of the { i,j}(i,j)∈S˜ WX,W M. Hayashi and S. Watanabe/Information Geometry Approach in Markov Chains 10 set of positive transition matrices because it is not included in the set of positive transition matrices. Remark1. The above-definedexponentialfamilies containexponentialfamilies of distributions as follows. For a given exponential family of distributions P on θ withthegeneratorf(x),wedefinethetransitionmatrixW(xx)asP (x)and ′ 0 X | the generator g(x,x) as f(x). Then, the exponential family W (xx) is P (x). ′ θ ′ θ | The given potential function and the given expectation parameter (defined in the next subsection) are the same as those in the case with the exponential family of distributions P . θ { } Remark 2. The papers [9, 10, 11, 12] called a family of transition matrices W (xx) an exponential family when W (xx) has the form θ ′ θ ′ { | } | W (xx)=eC(x,x′)+θg(x,x′) ψ(θ,x′). (4.4) θ ′ − | The papers [14, 15, 16] extended the above definition to the continuous-time case. However, our exponential family is written as [5] W (xx)=eC(x,x′)+θg(x,x′)+ψ(θ,x) ψ(θ,x′) φ(θ). (4.5) θ ′ − − | 3 by choosing C(x,x′) and ψ(θ,x) as logW(x|x′) and logPθ~(x), respectively. So, thetraditionaldefinition(4.4)isdifferentfromours.Theadvantageofourmodel over their model is explained in Remark 3. 4.2. Mixture family In the following, we assume that the functions g satisfies the condition of j { } Lemma 4.1. For fixed values η ,...,η , we call the subset W ~η(~θ) = o,1 o,k { θ~ ∈ E| (η ,...,η ,η ,...,η ) amixturesubfamilyof .Givenatransitionmatrix o,1 o,k k+1 d } E W, real-valued functions g on 2, and real numbers b , we say that the set j j X V g (x,x)V(xx)P (x) = b j is a mixture family on 2 { ∈ WX,W| x,x′ j ′ | ′ V ′ j∀ } XW generated by the constraints g =b . Note that a mixture family on 2 does P { j j} XW not necessarily contain W because its definition depends on the real numbers b . When W is a positive transition matrix, it is simply called a mixture family j generated by the constraints g = b because is the set of positive j j ,W { } WX transition matrices. For a given transition matrix W and two mixture families and on 2 ,theintersection isalsoamixturefamilyon 2 . M1 M2 XW M1∩M2 XW Lemma 4.2. The intersection of the mixture family on 2 generated by the XW constraints g = b and the exponential family is the mixture j j j=1,...,k ,W subfamily W{ } ~η(~θ) = (b ,...,b ,η ,...,η )WXof the exponential { θ~ ∈ WX,W| 1 k k+1 d } family . ,W WX Lemma 4.2 will be shownafter Lemma 5.1 in Section5. Here, we give exam- ples for mixture families.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.