A maximum entropy theorem with applications to the measurement of biodiversity 1 Tom Leinster∗ 1 0 2 n a Abstract J This is a preliminary article stating and proving a new maximum en- 3 1 tropy theorem. The entropies that we consider can be used as measures of biodiversity. In that context, the question is: for a given collection ] of species, which frequency distribution(s) maximize the diversity? The T theorem provides the answer. The chief surprise is that although we are I dealing with not just a single entropy, but a one-parameter family of . s entropies,thereisasingledistributionmaximizingallofthemsimultane- c ously. [ 4 v 6 0 9 Contents 0 . 0 1 Statement of the problem 2 1 9 2 Preparatory results 5 0 : v 3 The main theorem 14 i X 4 Corollaries and examples 20 r a This article is preliminary. Little motivation or context is given, and the proofs are probably not optimal. I hope to write this up in a more explanatory and polished way in due course. ∗School of Mathematics and Statistics, University of Glasgow, Glasgow G12 8QW, UK; [email protected]. SupportedbyanEPSRCAdvanced ResearchFellowship. 1 1 Statement of the problem Basic definitions Fix an integer n ≥ 1 throughout. A similarity matrix is an n×n symmetric matrix Z with entries in the interval [0,1], such that Z =1 for all i. A (probability) distribution is an n-tuple p=(p ,...,p ) ii 1 n with p ≥0 for all i and n p =1. i i=1 i Given a similarity matrix Z and a distribution p, thought of as a column P vector, we may form the matrix product Zp (also a column vector), and we denote by (Zp) its ith entry. i Lemma 1.1 Let Z be a similarity matrix and p a distribution. Then p ≤ i (Zp) ≤1 for all i∈{1,...,n}. In particular, if p >0 then (Zp) >0. i i i Proof We have n (Zp) = Z p =p + Z p ≥p i ij j i ij j i j=1 j6=i X X and (Zp) ≤ n 1p =1. 2 i j=1 j Let Z be aPsimilarity matrix and let q ∈[0,∞). The function HZ is defined q on distributions p by 1 1− p (Zp)q−1 if q 6=1 q−1 i i HqZ(p)= − piloig:X(pZi>p0)i if q =1. Lemma 1.1 guarantees thati:Xptih>e0se definitions are valid. The definition in the case q = 1 is explained by the fact that HZ(p) = lim HZ(p) (easily shown 1 q→1 q using l’Hˆopital’s rule). We call HZ the entropy of order q. q Notes on the literature TheentropiesHZ wereintroducedinthisgenerality q by Ricotta and Szeidl in 2006, as an index of the diversity of an ecological community [RS]. Think of n as the number of species, Z as indicating the ij similarity of the ith and jth species, and p as the relative abundance of the i ith species. Ricotta and Szeidl used not similarities Z but dissimilarities or ij ‘distances’d ;theformulasabovebecomeequivalenttotheirsonputtingZ = ij ij 1−d . ij The case Z = I goes back further. Something very similar to HI, using q logarithmstobase2ratherthanbasee,appearedininformationtheoryin1967, inapaperofHavrdaandCharv´at[HC]. Later,theentropiesHI werediscovered q in statistical ecology, in a 1982 paper of Patil and Taillie [PT]. Finally, they were rediscoveredin physics, in a 1988 paper of Tsallis [Tsa]. Still in the case Z = I, certain values of q give famous quantities. The entropyHI isShannonentropy(exceptthatShannonusedlogarithmstobase2). 1 2 The entropy HI is known in ecology as the Simpson or Gini–Simpson index; it 2 istheprobabilitythattwoindividualschosenatrandomareofdifferentspecies. For general Z, the entropy of order 2 is known as Rao’s quadratic en- tropy[Rao]. It is usually statedin terms of the matrix with (i,j)-entry1−Z , ij that is, the matrix of dissimilarities mentioned above. One way to obtain a similarity matrix is to start with a finite metric space {a1,...,an} and put Zij =e−d(ai,aj). Matrices of this kind are investigated in depth in [Lei2] and other papers cited therein. Here, metric spaces will only appear in two examples (4.4 and 4.5). The maximum entropy problem Let Z be a similarity matrix and let q ∈ [0,∞). The maximum entropy problem is this: For which distribution(s) p is HZ(p) maximal, and what is the q maximum value? The solution is given in Theorem 3.2. The terms used in the statement of the theorem will be defined shortly. However,the following striking fact can be stated immediately: There is a distribution maximizing HZ for all q simultaneously. q So even though the entropies of different orders rank distributions differently, there is a distribution that is maximal for all of them. For example, this fully explains the numerical coincidence noted in the Re- sults section of [AKB]. Restatement in terms of diversity LetZ beasimilaritymatrix. Foreach q ∈[0,∞), define a function DZ on distributions p by q 1 1−(q−1)HZ(p) 1−q if q 6=1 DZ(p) = q q ( eH1Z(p) if q =1 (cid:0) (cid:1) 1 1−q p (Zp)q−1 if q 6=1 i i = i:Xpi>(Z0p)−pi if q =1. i We call DZ the diversityofio:Yprid>e0r q. As for entropy, it is easily shown that q DZ(p)=lim DZ(p). 1 q→1 q Thesediversitieswereintroducedinformallyin[Lei1],andareexplainedand developedin[LC]. ThecaseZ =I iswellknowninseveralfields: ininformation 3 theory,logDI iscalledtheR´enyientropyoforderq[R´en];inecology,DI iscalled q q the Hill number of order q [Hill]; and in economics, 1/DI is the Hannah–Kay q measure of concentration [HK]. The transformation between HZ and DZ is invertible and order-preserving q q (increasing). Hence the maximum entropy problem is equivalent to the maxi- mum diversity problem: For which distribution(s) p is DZ(p) maximal, and what is the q maximum value? The solution is given in Theorem 3.1. It will be more convenient mathemati- cally to work with diversity rather than entropy. Thus, we prove results about diversity and deduce results about entropy. When stated in terms of diversity, a further striking aspect of the solution becomes apparent: There is a distribution maximizing DZ for all q simultaneously. q The maximum value of DZ is the same for all q. q So every similarity matrix has an unambiguous ‘maximum diversity’, the max- imum value of DZ for any q. q A similarity matrix may have more than one maximizing distribution—but the collection of maximizing distributions is independent of q > 0. In other words,a distributionthatmaximizes DZ forsome q actuallymaximizes DZ for q q all q (Corollary 4.1). ThediversitiesDZ arecloselyrelatedtogeneralizedmeans[HLP],alsocalled q power means. Given a finite set I, positive real numbers (x ) , positive real i i∈I numbers (p ) such that p = 1, and t ∈ R, the generalized mean of i i∈I i i (x ) , weighted by (p ) , of order t, is i i∈I i i∈I P 1/t p xt if t6=0 i i ! i∈I Xxpii if t=0. i∈I Y For example, if p =p for all i,j ∈I then the generalizedmeans of orders 1, 0 i j and −1 are the arithmetic, geometric and harmonic means, respectively. Given a similarity matrix Z and a distribution p, take I = {i ∈ {1,...,n} | p > 0}. Then 1/DZ(p) is the generalized mean of ((Zp) ) , i q i i∈I weighted by (p ) , of order q−1. We deduce the following. i i∈I Lemma 1.2 Let Z be a similarity matrix and p a distribution. Then: i. DZ(p) is continuous in q ∈[0,∞) q 4 ii. if (Zp) = (Zp) for all i,j such that p ,p > 0 then DZ(p) is constant i j i j q over q ∈[0,∞); otherwise, DZ(p) is strictly decreasing in q ∈[0,∞) q iii. lim DZ(p)=1/max (Zp) . q→∞ q i:pi>0 i Proof Alloftheseassertionsfollowfromstandardresultsongeneralizedmeans. Continuity is clear except perhaps at q = 1, where it follows from Theorem 3 of [HLP]. Part (ii) follows from Theorem 16 of [HLP], and part (iii) from Theorem 4. 2 In the light of this, we define DZ(p)=1/ max(Zp) . ∞ i i:pi>0 There is no useful definition of HZ, since lim HZ(p)=0 for all Z and p. ∞ q→∞ q 2 Preparatory results Herewemakesomedefinitionsandprovesomelemmasinpreparationforsolving the maximum diversity and entropy problems. Some of these definitions and lemmas can also be found in [Lei2] and [LW]. Convention: for the rest of this work, unlabelled summations are under- stood to be over all i∈{1,...,n} such that p >0. i P Weightings and magnitude Definition 2.1 Let Z be a similarity matrix. A weighting on Z is a column vector w∈Rn such that 1 . Zw= . . . 1 A weighting w is non-negative if w ≥ 0 for all i, and positive if w > 0 for i i all i. Lemma 2.2 Let w and x be weightings on Z. Then n w = n x . i=1 i i=1 i Proof Write u for the column vector (1 ··· 1)t, whePre ( )t meaPns transpose. Then n n w =utw=(Zx)tw=xt(Zw)=xtu= x , i i i=1 i=1 X X using symmetry of Z. 2 Definition 2.3 Let Z be a similarity matrix onwhich there exists at leastone weighting. Its magnitude is |Z|= n w , for any weighting w on Z. i=1 i P 5 For example, ifZ is invertiblethen there is a unique weighting w onZ,and w is the sum of the ith row of Z−1. So then i n |Z|= (Z−1) , ij i,j=1 X the sum of all n2 entries of Z−1. This formula also appears in [SP], [Shi] and[POP],forcloselyrelatedreasonstodowithdiversityanditsmaximization. Weight distributions Definition 2.4 Let Z be a similarity matrix. A weight distributionfor Z is a distribution p such that (Zp) =···=(Zp) . 1 n Lemma 2.5 Let Z be a similarity matrix. i. If Z admits a non-negative weighting then |Z|>0. ii. If w is a non-negative weighting on Z then w/|Z| is a weight distribution for Z, and this defines a one-to-one correspondence between non-negative weightings and weight distributions. iii. If Z admits a weight distribution then Z admits a weighting and |Z|>0. iv. If p is a weight distribution for Z then (Zp) =1/|Z| for all i. i Proof i. Let w be a non-negative weighting. Certainly |Z| = n w ≥ 0. Since i=1 i we are assumingthat n≥1, the vector 0 is nota weighting,so w >0 for i P some i. Hence |Z|>0. ii. Thefirstpartisclear. Toseethatthisdefinesaone-to-onecorrespondence, take a weight distribution p, writing (Zp) =K for all i. Since p =1, i i we have p > 0 for some i, and then K = (Zp) > 0 by Lemma 1.1. The i i P vector w=p/K is then a non-negative weighting. The two processes—passing from a non-negative weighting to a weight distribution, and vice versa—are easily shown to be mutually inverse. iii. Follows from the previous parts. iv. Follows from the previous parts. 2 The first connection between magnitude and diversity is this: Lemma 2.6 Let Z be a similarity matrix and p a weight distribution for Z. Then DZ(p)=|Z| for all q ∈[0,∞]. q Proof By continuity, it is enough to prove this for q 6= 1,∞. In that case, using Lemma 2.5(iv), 1 1 DZ(p)= p (Zp)q−1 1−q = p |Z|1−q 1−q =|Z|, q i i i as required. (cid:16)X (cid:17) (cid:16)X (cid:17) 2 6 Invariant distributions Definition 2.7 Let Z be a similarity matrix. A distribution p is invariant if DZ(p)=DZ(p) for all q,q′ ∈[0,∞]. q q′ Soon we will classify the invariant distributions. To do so, we need some more notation and a lemma. Given a similarity matrix Z and a subset B ⊆ {1,...,n}, let Z be the B matrix Z restricted to B, so that (Z ) =Z (i,j ∈B). If B has m elements B ij ij then Z is an m×m matrix, but it will be more convenient to index the rows B and columns of Z by the elements of B themselves than by 1,...,m. B We will also need to consider distributions on subsets of {1,...,n}. A dis- tributiononB ⊆{1,...,n}issaidtobeinvariant,aweightdistribution,etc.,if itis invariant,a weightdistribution, etc.,with respecttoZ . Similarly,we will B sometimesspeakof‘weightingsonB’,meaningweightingsonZ . Distributions B are understood to be on {1,...,n} unless specified otherwise. Lemma 2.8 Let Z be a similarity matrix, let B ⊆ {1,...,n}, and let r be a distribution on B. Write p for the distribution obtained by extending r by zero. Then DZB(r) = DZ(p) for all q ∈ [0,∞]. In particular, r is invariant if and q q only if p is. Proof For i ∈ B we have r = p and (Z r) = (Zp) . The result follows i i B i i immediately from the definition of diversity of order q. 2 By Lemma 2.6,anyweightdistributionis invariant,andbyLemma 2.8, any extension by zero of a weight distribution is also invariant. We will prove that these are all the invariant distributions there are. For a distribution p we write supp(p) = {i ∈ {1,...,n} | p > 0}, the i support of p. Let Z be a similarity matrix. Given ∅6=B ⊆{1,...,n} and a non-negative weightingwonZ ,letwbethedistributionobtainedbyfirsttakingtheweight B distribution w/|Z | on B, then extending by zero to {1,...,n}. B e Proposition 2.9 Let Z be a similarity matrix and p a distribution. The fol- lowing are equivalent: i. p is invariant ii. (Zp) =(Zp) for all i,j ∈supp(p) i j iii. p is the extension by zero of a weight distribution on a nonempty subset of {1,...,n} iv. p = w for some non-negative weighting w on some nonempty subset of {1,...,n}. e 7 Proof (i⇒ii): Follows from Lemma 1.2. (ii⇒iii): Suppose that (ii) holds, and write B =supp(p). The distribution pon{1,...,n}restrictstoadistributionronB. Thisrisaweightdistribution onB, since for all i∈B we have(Z r) =(Zp) , which by (ii) is constantover B i i i∈B. Clearly p is the extension by zero of r. (iii⇒i): Suppose that p is the extension by zero of a weight distribution r on a nonempty subset B ⊆{1,...,n}. Then for all q ∈[0,∞], DZ(p)=DZB(r)=|Z | q q B by Lemmas 2.8 and 2.6 respectively; hence p is invariant. (iii⇔iv): Follows from Lemma 2.5. 2 There is at least one invariant distribution on any given similarity matrix. For we may choose B to be a one-element subset, which has a unique non- negative weighting w = (1), and this gives the invariant distribution w = (0,...,0,1,0,...,0). e Maximizing distributions Definition 2.10 LetZ be a similaritymatrix. Givenq ∈[0,∞],a distribution p is q-maximizing if DZ(p) ≥ DZ(p′) for all distributions p′. A distribution q q is maximizing if it is q-maximizing for all q ∈[0,∞]. It makes no difference to the definition of ‘maximizing’ if we omit q = ∞; nor does it make a difference to either definition if we replace diversity DZ by q entropy HZ. q We will eventually show that every similarity matrix has a maximizing dis- tribution. Lemma 2.11 Let Z be a similarity matrix and p an invariant distribution. Then p is 0-maximizing if and only if it is maximizing. Proof Suppose that p is 0-maximizing. Then for all q ∈[0,∞] and all distri- butions p′, DZ(p)=DZ(p)≥DZ(p′)≥DZ(p′), q 0 0 q using invariance in the first step and Lemma 1.2 in the last. 2 Lemma 2.12 Let Z be a similarity matrix and B ⊆ {1,...,n}. Suppose that sup DZB(r)≥sup DZ(p), where the first supremum is over distributions r on r 0 p 0 B and the second is over distributions p on {1,...,n}. Suppose also that Z B admits an invariant maximizing distribution. Then so does Z. (In fact, sup DZB(r) ≤ sup DZ(p) in any case, by Lemma 2.8. So the ‘≥’ in r 0 p 0 the statement of the present lemma could equivalently be replaced by ‘=’.) 8 Proof Let r be an invariant maximizing distribution on Z . Define a distri- B bution p on {1,...,n} by extending r by zero. By Lemma 2.8, p is invariant. Using Lemma 2.8 again, DZ(p)=DZB(r)=supDZB(r′)≥supDZ(p′), 0 0 0 0 r′ p′ so p is 0-maximizing. Then by Lemma 2.11, p is maximizing. 2 Decomposition LetZ beasimilaritymatrix. SubsetsB andB′ of{1,...,n} are complementary (for Z) if B∪B′ = {1,...,n}, B∩B′ = ∅, and Zii′ = 0 for all i ∈ B and i′ ∈ B′. For example, there exist nonempty complementary subsets if Z can be expressed as a nontrivial block sum X 0 . 0 X′ (cid:18) (cid:19) Given a distribution p and a subset B ⊆ {1,...,n} such that p > 0 for i some i∈B, let p| be the distribution on B defined by B p i (p| ) = . B i p j∈B j Lemma 2.13 Let Z be a similarity mPatrix, and let B and B′ be nonempty complementary subsets of {1,...,n}. Then: i. For any weightings v on ZB and v′ on ZB′, there is a weighting w on Z defined by v if i∈B w = i i v′ if i∈B′. (cid:26) i ii. For any invariant distributions r on B and r′ on B′, there exists an in- variant distribution p on {1,...,n} such that p|B =r and p|B′ =r′. Proof i. For i∈B, we have (Zw) = Z v + Z v′ = (Z ) v =(Z v) =1. i ij j ij j B ij j B i j∈B j∈B′ j∈B X X X Similarly, (Zw) =1 for all i∈B′. So w is a weighting. i ii. By Proposition 2.9, r = v for some non-negative weighting v on some nonempty subsetC ⊆B. Similarly, r′ =v′ forsomenon-negativeweight- ing v′ onsome nonemptyCe′ ⊆B′. By (i), there isa non-negativeweight- ing w on the nonempty set C∪C′ defineed by v if i∈C w = i i v′ if i∈C′. (cid:26) i 9 Let p = w, a distribution on {1,...,n}, which is invariant by Proposi- tion 2.9. For i∈C we have e pi wi/|ZC∪C′| (p| ) = = B i j∈Bpj j∈Cwj/|ZC∪C′| w v i i = P =P w v j∈C j j∈C j = r . i P P For i∈B\C we have p i (p| ) = =0=r . B i i p j∈B j Hence p|B =r, and similarly p|BP′ =r′. 2 Lemma 2.14 Let Z be a similarity matrix, let B and B′ be complementary subsets of {1,...,n}, and let p be a distribution on {1,...,n} such that p >0 i for some i∈B and p >0 for some i∈B′. Then i D0Z(p)=D0ZB(p|B)+D0ZB′(p|B′). Proof By definition, p p p DZ(p)= i = i + i . 0 (Zp) (Zp) (Zp) X i i∈BX: pi>0 i i∈BX′: pi>0 i Now for i∈B, p = p (p| ) i j B i j∈B X by definition of p| , and B (Zp) = Z p + Z p = (Z ) p = p (Z p| ) . i ij j ij j B ij j j B B i j∈B j∈B′ j∈B j∈B X X X X Similar equations hold for B′, so DZ(p) = (p|B)i + (p|B′)i 0 i∈BX: pi>0(ZBp|B)i i∈BX′: pi>0(ZB′p|B′)i = D0ZB(p|B)+D0ZB′(p|B′), as required. 2 Proposition 2.15 Let Z be a similarity matrix and let B and B′ be nonempty complementary subsets of {1,...,n}. Suppose that ZB and ZB′ each admit an invariant maximizing distribution. Then so does Z. 10