Information-geometrical characterization of statistical models which are statistically equivalent to probability simplexes Hiroshi Nagaoka Graduate School of Informatics and Engineering The University of Electro-Communications Chofu, Tokyo 182-8585, Japan Email: [email protected] 7 1 0 2 Abstract—The probability simplex is the set of all probabil- II. STATEMENTOF THEMAIN RESULT b ity distributions on a finite set and is the most fundamental We begin with giving some basic definitions which are e object in the finite probability theory. In this paper we give F necessary to state our problem. a characterization of statistical models on finite sets which are statistically equivalent to probability simplexes in terms of α- For an arbitrary finite set X, let P(X) and P(X) be 8 familiesincludingexponentialfamiliesandmixturefamilies.The the sets of probability distributions and of strictly positive ] subject has a close relation to some fundamental aspects of probability distributions on X; T information geometry such asα-connectionsand autoparallelity. I P(X):={p|p:X →[0,1], p(x)=1} . s x c X [ I. AN INTRODUCTORY EXAMPLE P(X):={p|p:X →(0,1), p(x)=1}. x 2 Let X = {0,1,2} and let M = {p |0 < λ < 1} be the X v set of probability distributions on X oλf the form In particular, let for an arbitrary positive integer d 6 P :=P({0,1,...,d}) 3 d 7 pλ =(pλ(0),pλ(1),pλ(2))=(λ,(1−λ)/2,(1−λ)/2). P :=P({0,1,...,d}), d 7 0 The statistical model M has the following three properties. whichwecallthed-dimensional(closedandopen)probability . Firstly, it is a mixture family since simplexes. 1 0 A mapping Γ : P(X) → P(Y), where X and Y are 7 pλ =λ(1,0,0)+(1−λ)(0,1/2,1/2). finitesets,iscalledaMarkovmapwhenthereexistsachannel 1 W(y|x) from X to Y such that, for any p∈P(X), : Secondly, it is an exponential family since v Γ(p)= W(·|x)p(x). i X logp =θF −ψ(θ), x λ X r i.e., Γ(p) is the output distribution of the channel W cor- a where θ = log(2λ/(1−λ)), (F(0),F(1),F(2)) = (1,0,0) responding to the input distribution p. Note that a Markov and ψ(θ) = −log(1 − λ)/2 = log(2 + eθ). Lastly, M is map is affine; Γ(λp+(1−λ)q) = λΓ(p)+(1−λ)Γ(q) for statistically equivalent to the 1-dimensional open probability ∀p,q ∈P(X) and 0≤∀λ≤1. simplexP1 ={(λ,1−λ)|0<λ<1}in thesense thatthere Let M and N be smooth submanifolds (statistical models) exist a channel V from {0,1} to X and a channel W from of P(X) and P(Y), respectively. When there exist a pair X to {0,1} such that M is the set of output distributions of of Markov maps Γ : P(X) → P(Y ) and ∆ : P(Y ) → V forinputdistributionsinP1 andthatV isinvertiblebyW. P(X)suchthattheirrestrictionsΓ|M and∆|N arebijections The matrix representations of these channels are given by betweenM andN andaretheinversemappingsofeachother, we say that M and N are Markov equivalent or statistically 1 0 1 0 0 equivalent and wite as M ≃N. V = 0 1/2 , W = . 0 1 1 The aim of this paper is to give a characterization of sta- 0 1/2 (cid:20) (cid:21) tistical modelswhichare statistically equivalentto probability simplexes. The main result is as follows. Note that the invertibility WV =I holds. Our aim is to show the equivalence between the first two Theorem 1 For an arbitrary smooth submanifold M of properties and the last one. P(X), the following conditions are mutually equivalent. (i) M ≃P , where d=dimM. For an arbitrary α ∈ R, define a function L(α) : R+(= d (ii) M is an exponential family and is a mixture (0,∞))→R by1 family. 1−α (iii) ∃α6=∃β, M isan α-familyandisan β-family. L(α)(u)= u 2 (α6=1) (5) logu (α=1). (iv) ∀α, M is an α-family. (cid:26) The function L(α) is naturally extended to a mapping Explanation of exponential family, mixture family and α- (R+)X →RX (f 7→L(α)(f)) by familyforarbitraryα∈R aswellastheproofofthetheorem will be presented in subsequent sections. Here we only give L(α)(f) (x)=L(α)(f(x)). (6) a few remarkson condition (i). Firstly, (i) is equivalentto the conditionthat∃d′, M ≃Pd′, sinceifM ≃Pd′ thenM and For a submani(cid:16)foldM o(cid:17)f P(X), its denormalizationM˜ is Pd′ must be diffeomorphic,so that dimM =dimPd′ =d′. defined by Secondly, (i) is equivalent to the condition M ≃ P , where d M denotes the topological closure of M, and means that M M˜ := τp|p∈M and τ ∈R+ , (7) is the set of output distributions of an invertible (erro-free) where τp denotes th(cid:8)e function X ∋ x 7→ τp(cid:9)(x) ∈ R+. The channel. denormalizationis an extendedmanifoldobtainedby relaxing the normalization constraint p(x) = 1. Obviously, M˜ is III. SOME FACTS ABOUT CONDITION(i) x a submanifold of P(X), and P(X) = (R+)X is an open From the definition of the relation ≃, condition (i) implies subset of RX. When the imaPge that there exist Γ : P(X) → Pd and ∆ : Pd → f f P(X) satisfying Γ ◦ ∆ = id (the identity on P ). Let L(α)(M˜)= L(α)(τp) p∈M and τ ∈R+ d {q0,q1,...,qd}⊂P(X) be defined by n (cid:12) o formsanopensubsetofanaffin(cid:12)esubspace,sayZ,ofRX,M (cid:12) ∆(δi)=qi, ∀i∈{0,1,...,d}, (1) iscalledanα-family.Inthispaper,itisassumedforsimplicity that M is maximal in the sense that whereδ isthedeltadistributionson{0,1,...,d}concentrated i on i. Then it is easy to see, as is shown in Lemma9.5 and its L(α)(M˜)=Z∩L(α) (R+)X . (8) “Supplement”of[1]whereour∆iscalledacongruentembed- ding (of P into P(X)), that the supports A :=supp(q ) Since it follows from the definition (cid:0)(5) of L((cid:1)α) that d i i constitute a partition of X in the sense that L(α) (R+)X = (R+)X (α6=1) d RX (α=1), Ai∩Aj =φ if i6=j, and Ai =X, (2) (cid:0) (cid:1) (cid:26) (8) is written as i=0 and the left inverse Γ of ∆ is represen[ted as Z∩(R+)X (α6=1) d L(α)(M˜)= (9) Γ(p)= p(Ai)δi, ∀p∈P(X), (3) (cid:26) Z (α=1). i=0 Note that, as is pointed out in section 2.6 of [4], an affine X subspace Z satisfying (9) must be a linear subspace when wherep(A ):= p(x).Inaddition,condition(i)implies M =∆(Pi ):={∆x∈(Aλi)|λ∈P }, so that from (1) we have α 6= 1. Note also that P(X) is an α-family for ∀α ∈ R, d P d corresponding to the case when Z =RX. d When α = 1, the notion of α-family is equivalent to that M = λiqi (λ0,...,λd)∈Pd . (4) of exponential family, whose general form is M = {pθ|θ = (Xi=0 (cid:12) ) (θ1,...,θd)∈Rd} such that (cid:12) Conversely, if a statistical m(cid:12)odel M ⊂P(X) is represented d in the form (4) by a collection of d+1 distributions {q } on p (x)=exp C(x)+ θiF (x)−ψ(θ) , (10) i θ i X whosesupports{Ai} constitute a partitionof X, thenwe " Xi=1 # see that M satisfies condition (i) by defining ∆ and Γ by (1) where C,F ,...,F are functions on X and ψ is a function 1 d and (3). Thus a necessary and sufficient condition for (i) is on Rd defined by obtained, which will be used in later arguments to prove the theorem. d ψ(θ)=log exp C(x)+ θiF (x) . (11) i " # IV. α-FAMILY,e-FAMILY ANDm-FAMILY Xx Xi=1 Following the way developed in [5] (see also [3], [4]), 1L(α)(u)canbereplacedwithaL(α)(u)+bbyarbitraryconstantsa6=0 andb,possiblydepending onα.In[3],[4],[5],theseconstants areproperly we give the definition of α-family, which includes that of chosen so that the ±α-duality and the limit of α → 1 can be treated in a exponential family and mixture family as special cases. convenient way. When α = −1, on the other hand, the notion of α-family wherej denotestheelementof{0,1,...,d}suchthatx∈A . j is equivalentto that of mixture family, whose general form is LettingC ∈RX bedefinedbyC(x)= (logq (x))1 (x), i i Ai M ={p |θ =(θ1,...,θd)∈Θ} such that we have θ P d L(1)(M˜) p (x)=C(x)+ θiF (x), (12) θ i d Xi=1 = C+ (logλi)1Ai (λ0,...,λd)∈(R+)d+1 where F1,...,Fd are functions on X satisfying ( Xi=0 (cid:12) ) xFi(x)=0 and Θ:={θ ∈Rd|∀x, pθ(x)>0}. = C+ d ξ 1 (ξ ,(cid:12)(cid:12)...,ξ )∈Rd+1 , (Pθ1W,.h.e.n,θαd)6=∈1Θ, t}heisgeneral form of α-family M ={pθ|θ = ( Xi=0 i Ai (cid:12) 0 d ) (cid:12) which is an affine subsp(cid:12)ace of RX. This proves that M is a d 2 p (x)= ξj(θ)F (x) 1−α. (13) 1-family (an exponential family). θ j The implication (i) ⇒ (iv) has thus been proved. nXj=0 o See §2.6 of [4] for further details. VI. EQUIVALENCE OF(ii), (iii)AND(iv) Theimplications(iv)⇒(ii)⇒(iii)areobvious.Tosee(iii) V. PROOF OF(i)⇒(iv) ⇒ (iv), some results of information geometry are invoked. Assume (i), which implies that there exists a collection of Remark 1: The notion of affine connections appears only in d+1 probabilitydistributions{q }⊂P(X) whose supports i {A } constitute a partition of X and that M is represented this section. Since the implication (ii) ⇒ (i) will be proved i as (4). Then the denormalization M˜ is represented as in the next section without using affine connections (at least explicitly), we do not need them in proving the equivalence d of the conditions of Theorem 1 except for (iii). M˜ = λ q (λ ,...,λ )∈(R+)d+1 . (14) i i 0 d ( ) We first introduce some concepts from general differential Xi=0 (cid:12)(cid:12) geometry. Let S be a smooth manifold and denote by T(S) Let α be an arbitrar(cid:12)y real number such that α 6= 1. Since the set of smooth vector fields on S. Here, by a vector field L(α)(0) = 0 in this case, it follows from the disjointness of on S we mean a mapping, say X, such that X : S ∋ p 7→ the supports of {qi} that X ∈ T (S), where T (S) denotes the tangent space of S p p p at p. An affine connection on S is represented by a mapping 1−α L(α) λiqi = λi 2 L(α)(qi) ∇:T(S)×T(S)∋(X,Y)7→∇XY ∈T(S),whichiscalled ! a covariantderivative,satisfying certain conditions.Let M be i i X X a smooth submanifold of S. Then ∇ is naturally defined on for any (λ ,...,λ )∈(R+)d+1. From this we have 0 d T(M)×T(M), so that ∇ Y is defined for any vector fields X L(α)(M˜) on M. However, the value ∇XY in this case is a mapping M ∋ p 7→ (∇ Y) ∈ T (S) in general and is not a vector X p p d 1−α field on M (i.e., ∇ Y 6∈T(M)) unless = λ 2 L(α)(q ) (λ ,...,λ )∈(R+)d+1 X i i 0 d ( ) Xi=0 (cid:12) (∇XY)p ∈Tp(M), ∀p∈M. (15) d (cid:12) (cid:12) = ξ L(α)(q ) (ξ ,...,ξ )∈(R+)d+1 When (15) holds for ∀X,Y ∈ T(M), M is said to be i i 0 d (Xi=0 (cid:12) ) autoparallel w.r.t. ∇ or ∇-autoparallel in S. =Z∩(R+)X, (cid:12)(cid:12) Let∇,∇′ and∇′′ beaffineconnectiononS forwhichthere exists a real number a satisfying2 where Z is the (d+1)-dimensional linear subspace of RX spanned by L(α)(q ), i∈{0,1,...,d}. This provesthat M is ∇′′ =a∇+(1−a)∇′. (16) i an α-family for any α6=1. Let α=1. For any x∈X, we have If a submanifold M is ∇-autoparallel and ∇′-autoparallel, then it is also ∇′′-autoparallel. This implication is obvious from (∇′′Y) = a(∇ Y) + (1 − a)(∇′ Y) and the X p X p X p L(1) λiqi (x)=log λiqi(x) autoparallelity condition (15), which will be invoked later. i ! i ! AswasindependentlyintroducedbyCˇencov[1]andAmari X X =log(λjqj(x)) [2], a one-parameter family of affince connections, which are =logλ +logq (x) called the α-connections (α ∈ R), are defined on a manifold j j = (logλi+logqi(x))1Ai(x), 2Forarbitraryaffineconnections∇and∇′,theiraffinecombinationa∇+ i (1−a)∇′ always becomesanaffineconnection. X of probability distributions. After Amari’s notation, the α- Thisgivesa linearisomorphismfromV(m) onto V(e). There- connection is written in the form of affine combination fore, for any two points µ,ν ∈M˜, we can define µa ∇(α) = 1+α∇(1)+ 1−α∇(−1), (17) (dΦ)ν ◦(dΦ)−µ1 :V(e) ∋a7→ ν ∈V(e). 2 2 Thismeansthat,foranya∈V(e) andanyµ,ν ∈M˜,wehave which implies that µa ∈V(e). For arbitrary a∈V(e) and ν ∈M˜, let us define a ν ∇(γ) = γ−β∇(α)+ α−γ∇(β) (18) map Ψa,ν by α−β α−β Ψ :M˜ ∋µ7→ µa ∈V(e). a,ν for any α,β,γ ∈R such that α6=β. Then its differential at a point µ∈νM˜ is given by When a submanifold M of S is autoparallel w.r.t. the α- ga (dΨ ) :V(m) ∋g 7→ ∈V(e). connection in S, we say that M is α-autoparallel in S. a,ν µ ν Since (18) is of the form (16), it follows that if M is α- Composing this map with the inverse of autoparallel and β-autoparallel in S for some α 6= β, then it g is γ-autoparallel in S for all γ ∈ R. On the other hand, it (dΦ)ν :V(m) ∋g 7→ ∈V(e), ν was shown in [5] (see also section 2.6 of [4]) that, for any we have submanifold M of P(X) and for any real number α, M is an α-family if and only if M is α-autoparallel in P(X). (dΨa,ν)µ◦(dΦ)−ν1 :V(e) ∋b7→ab∈V(e). Combination of these two results proves (iii) ⇒ (iv). This proves that a,b∈V(e) ⇒ab∈V(e). Remark 2: Since the e-connectionand the m-connectionare Lemma 2 V(e) contains the constant functions on X. dualw.r.t.theFisherinformationmetric[3],[4],[5],condition Proof. From the definition (7) of M˜, for any µ ∈ M˜ and (ii) is a special case of doubly autoparallelity introduced by any positive constant τ =ec, we have τµ ∈M˜. This implies Ohara; see [6], [7] and the reference cited there. It is pointed that both logµ and log(τµ) belong to Z(e), and hence the outin[7] thattheα-autoparallelityforallαfollowsfromthat translation log(τµ)−logµ=logτ =c belongs to V(e). for α=±1. These two lemmas state that V(e) is a subalgebra of the VII. PROOF OF(ii)⇒(i) commutative algebra RX with the unit element 1 (: the constant function x 7→ 1) of RX contained in V(e). From Assume (ii), which means that there exist two affine sub- spaces Z(e) and Z(m) of RX such that a well known result on such subalgebras4 , it is concluded that there exists a partition {A }d of X such that L(e)(M˜)={logµ|µ∈M˜}=Z(e) (19) i i=0 d L(m)(M˜)=M˜ =Z(m)∩(R+)X, (20) V(e) = ci1Ai (c0,...,cd)∈Rd+1 . (21) ( ) where L(e) := L(1) and L(m) := L(−1). Let V(e) and V(m) Let an element p Xio=f0M(⊂(cid:12)(cid:12)(cid:12)M˜) be arbitrarily fixed. Then we be the linear spaces of translation vectors of Z(e) and Z(m), 0 have respectively,sothatwehaveZ(e) =f+V(e) foranyf ∈Z(e) Z(e) =logp +V(e). (22) and Z(m) =g+V(m) for any g ∈Z(m)3. 0 From (19), (21) and (22) and the disjointness of {A }, we i Lemma 1 V(e) isclosedw.r.t.multiplicationoffunctions; have i.e., a,b∈V(e) ⇒ab∈V(e), where the product ab M˜ ={µ| logµ∈Z(e)} is defined by (ab)(x)=a(x)b(x). ={µ| logµ−logp ∈V(e)} 0 Proof. The map = µ ∃(c ,...,c )∈Rd+1, 0 d Φ:=L(e)|M˜ :M˜ ∋µ7→logµ∈Z(e) n (cid:12)(cid:12) d is a diffeomorphism from M˜ = Z(m)∩(R+)X, which is an (cid:12)logµ=logp0+ ci1Ai, , open subset of Z(m), onto Z(e). The differential map of Φ at Xi=0 o a point µ∈M˜ is defined by = p d eci1 (c ,...,c )∈Rd+1 0 Ai 0 d (dΦ) dµ(t) = dΦ(µ(t)) ( Xi=0 (cid:12) ) µ dt t=0 dt t=0 d (cid:12)(cid:12) (cid:16) (cid:12) (cid:17) (cid:12) = λ q (λ ,...,λ )∈(R+)d+1 , for any smooth curve µ(t)(cid:12) in M˜ and is repre(cid:12)sented as i i 0 d (cid:12) (cid:12) (Xi=0 (cid:12) ) f (cid:12) (dΦ)µ :V(m) ∋f 7→ ∈V(e). 4Althoughvariousmathemat(cid:12)icalextensionsofthisresultincludinginfinite- µ dimensional and/or noncommutative versions are known, the author of the presentpapercouldfindnoappropriatereferencedescribingtheresultforthe 3Actually,Z(m)isalinearspaceasmentionedinsectionIV,andtherefore finite-dimensionalcommutativecasewithanelementaryproof.So,wegivea Z(m)=V(m). proofintheappendix forthereaders’sake. where Proof. Let 1 qi := p (A )p01Ai, i∈{0,...,d}. B := f−1(λ)|λ∈R, f ∈V ⊂2X, (24) 0 i Then {q } are probabilitydistributions on X whose supports which is the tota(cid:8)lity of the level sets f−(cid:9)1(λ) = {x|f(x) = i are supp(q )=A , and λ}⊂ X of functions in V. We first show that, for any B ⊂ i i X, M =M˜ ∩P(X) B ∈B ⇔ 1 ∈V. (25) B d = λ q (λ ,...,λ )∈P . Since ⇐ is obvious, it suffices to show ⇒. Assume B ∈ B, ( i i 0 d d) so that B = f−1(λ) for some f ∈ V and λ ∈ R. When B Xi=0 (cid:12)(cid:12) is the empty set φ, we have 1B = 0 ∈ V. So we assume Since this is the same for(cid:12)m as (4), condition (i) has been B 6= φ, which means that λ ∈ f(X). Let the elements of derived. f(X) be λ ,λ ,...λ , where λ =λ and λ 6=λ if i6=j, 0 1 k 0 i j and let B := f−1(λ ). Then we have f = k λ 1 with VIII. CONCLUSION i i i=0 j Bi B =B.Leta(t)=a tk+a tk−1+···+a beapolynomial 0 0 1 k We have shown Theorem 1 which gives an information- P satisfying a(λ )=1 and a(λ )=0 for any i6=0. Explicitly, 0 i geometrical characterization of statistical models on finite a(t) is expressed as sample spaceswhich are statistically equivalentto openprob- ability simplexes Pd. The statistical equivalence (also called k t−λi a(t)= . the Markov equivalence) to probability simplexes played a λ −λ 0 i crucial role in Cˇencov’s pioneering work [1] on information Yi=1 geometry, where the notions of Fisher information metric It follows that and the α-connections were characterized in terms of the k statisticalequivalence.Thepresentworkshedanotherlighton a(f)= a(λi)1Bi =1B0 =1B. therelationbetweenthestatisticalequivalenceandinformation i=0 X geometry. In addition, a(f)=a fk+a fk−1+···+a belongs to V 0 1 k since V is a subalgebra of RX with 1 ∈ V. Hence we have ACKNOWLEDGMENT 1 ∈V. B This work was partially supported by JSPS KAKENHI Using (25), we see that Grant Number 16K00012. X ∈B, (26) REFERENCES B ∈B ⇒ Bc ∈B, (27) [1] N.N.Cˇencov (Chentsov), Statistical Decision RulesandOptimalInfer- B ,B ∈B ⇒ B ∩B ∈B (28) 1 2 1 2 ence, AMS,1982(original Russianedition: Nauka,Moscow,1972). [2] S. Amari, “Differential geometry of curved exponential families— as curvature and information loss”, The Annals of Statistics, 10, 357–385, 1982. 1X =1∈V ⇒X ∈B, (29) [3] S.Amari, Differential-Geometrical Methods inStatistics, Springer, Lec- tureNotesinStatistics 28,1985. B ∈B ⇒1B ∈V ⇒1Bc =1−1B ∈V [4] S. Amari and H. Nagaoka, Methods of information geometry, AMS & ⇒Bc ∈B, (30) OUP,2000. [5] H. Nagaoka and S. Amari, “Differential geometry of smooth families B1,B2 ∈B ⇒1B1,1B2 ∈V ⇒1B1∩B2 =1B11B2 ∈V of probability distributions”, Technical Report METR 82-7, Dept. of Math.Eng.andInstr.Phys,Univ.ofTokyo,1982.(http://www.keisu.t.u- ⇒B1∩B2 ∈B. (31) [6] A. Ohara, “Information geometric analysis of an interior point method Properties (26)-(28) implies that B is an additive class of forsemidefiniteprogram-ming”,GeometryinPresentDayScience(eds. sets (σ-algebra) on the finite entire set X. Therefore, B is O.E.Barndorff-NielsenandE.B.V.Jensen),pp.49-74,WorldScientific, generatedbyapartition{A ,··· ,A }ofX inthesensethat 1999. 1 n [7] A. Ohara, “Geodesics for dual connections and means on symmetric every element of B is the union of some (or no) elements cones”, Integr.equ.oper.theory,50,537–548, 2004. of {A ,··· ,A }. Recalling the definition (24) of B, we 1 n conclude (23). APPENDIX Proposition Let X be a finite set and V be a subalgebra of RX containing the constant functions. Then there exists a partition {A }n of X such that i i=1 n V = c 1 (c ,...,c )∈Rn . (23) i Ai 1 n ( ) Xi=1 (cid:12) (cid:12) (cid:12)