ebook img

Two results on expected values of imbalance indices of phylogenetic trees PDF

0.12 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Two results on expected values of imbalance indices of phylogenetic trees

Two results on expected values of imbalance indices of phylogenetic trees Arnau Mir and Francesc Rossell´o Research Instituteof Health Science (IUNICS)and Department of Mathematics and Computer 2 Science, University of theBalearic Islands, E-07122 Palma deMallorca, Spain 1 {arnau.mir,cesc.rossello}@uib.es 0 2 n a Abstract. We compute an explicit formula for the expected value of the Colless J indexofaphylogenetictreegeneratedundertheYulemodel,andanexplicitformula 8 for the expected value of the Sackin index of a phylogenetic tree generated under 1 the uniform model. ] E 1 Introduction P . o i A phylogenetic tree is a representation of the shared evolutionary history of a set b of extant species. From the mathematical point a view, it is a leaf-labeled rooted - q tree, with its leaves representing the extant species under study, its internal nodes [ representing common ancestors of some of them, the root representing the most 1 recentcommonancestorofallofthem,andthearcsrepresentingdirectdescendance v throughmutations.Inthispaperweonlyconsiderbinary phylogenetictrees,where 4 8 every internal node has exactly two children. 8 There are several stochastic models of the evolutionary processes that produce 3 . phylogenetic trees. Two of the most popular are the Yule and the uniform models. 1 In the Yule, or Equal-Rate Markov, model [8,23], starting with a node, at every 0 2 step a leaf is chosen randomly and uniformly, and it is replaced by a cherry (a 1 phylogenetic tree consisting only of a root and two leaves). Finally, the labels : v are assigned randomly and uniformly to the leaves once the desired number of i X leaves is reached. Under this model of evolution, different trees with the same r numberofleaves may havedifferentprobabilities.Morespecifically, ifT isabinary a phylogenetic tree with n leaves, and for every internal node z we denote by κ (z) T the number of its descendant leaves, then the probability of T under the Yule model is [4,21] 2n−1 1 P (T)= . Y n! κ (v) 1 v inYternal T − On the other hand, the main feature of the uniform, or Proportional to Distin- guishable Arrangements, model [17] is that all phylogenetic trees with the same number of leaves have the same probability. From the point of view of tree growth, thiscorrespondstoaprocesswhere,startingwithanodelabeled1,atthek-thstep a new pendant arc, ending in the leaf labeled k+1, is added either to a new root or to some edge (with all possible locations of this new pendant arc equiprobable). Notice that this is not an explicit model of evolution, only of tree growth. Thestudyoftheprobabilisticdistributionsofindicesassociatedtophylogenetic trees under different stochastic models of phylogenetic tree growth has received a lotofinterestinthelastdecades[12,19].Theultimategoalofthislineofresearchis to be able to take as null model some stochastic model of phylogenetic tree growth and evaluate against it the indices of a sample of phylogenetic trees reconstructed from data. Two of the most popular indices used in this connection, measuring the degree of symmetry, or balance, of a tree, are Sackin’s [18] and Colless’ [6] indices, which we define later in the main body of this paper, but there are many other measures associated to phylogenetic trees that have been used in this context, like forinstance other imbalance indices[7,Chap.33] orthenumberofcherries of trees [20]. Several properties of the distributions of Sackin’s S and Colless’ C indices have been studied in the literature under different models [3,9,13,14,15,16,21]. In particular, their expected values have been studied under the Yule and the uni- form model. The results published so far on these expected values have been the following. Let S and C be the random variables defined by choosing a binary n n phylogenetictreeT withnleaves andcomputingS(T)orC(T),respectively.Then: – Under the Yule model, n E (S ) = 2n 1/j [9]. Y n • jP=2 E (C ) = nlog(n) + (γ 1 log(2))n + o(n) [3], where γ is the Euler Y n • − − constant. – Under the uniform model, E (S ) √πn3/2 [22]. U n • ∼ E (C ) √πn3/2 [3]. U n • ∼ And, for instance, these are the formulas used by the R package apTreeshape [1] to compute the expected value of these indices for a given numberof leaves. Let us alsomention thatRogers [15]foundarecursiveformulaforthemoment-generating functions of C and S, which allowed him to compute E (C ) and E (C ) for Y n U n n = 1,...,50, but he did not obtain any explicit formula for them. In this paper we obtain explicit formulas for E (C ) and E (S ). Namely, Y n U n ⌊n/2⌋ 1 E (C )= n +δ (n), Y n odd j Xj=2 where δ (n) = 1 if n is odd, and δ (n)= 0 if n is even, and odd odd n 2, 2, 2 n EU(Sn) = 2n 33F2(cid:18)1, 4 2−n ;2(cid:19), − − where F is a hypergeometric function [2] that can be directly computed with 3 2 many software systems, like Mathematica or R. These formulas thus contribute to our knowledge of the probability distributions of these indices, and yield precise values which can be used is tests. 2 Preliminaries and notations In this paper, by a phylogenetic tree on a set S of taxa we mean a binary rooted tree with its leaves bijectively labeled in the set S. To simplify the language, we shall always identify a leaf of a phylogenetic tree with its label. We shall also use the term phylogenetic tree with n leaves to refer to a phylogenetic tree on the set 1,...,n . We shall denote by L(T) the set of leaves of a phylogenetic tree T and { } by V (T) its set of internal nodes. int Let (S) be the set of isomorphism classes of phylogenetic trees on a set S of T taxa, and set = ( 1,...,n ). It is well known [7, Ch.3] that = 1 and, for n 1 T T { } |T | every n > 2, = (2n 3)!! = (2n 3)(2n 5) 3 1. n |T | − − − ··· · Whenever there exists a path from u to v in a phylogenetic tree T, we shall say that v is a descendant of u. The cluster of a node v in T is the set C (v) of its T descendant leaves, an we shall denote by κ (v) the cardinal C (v), that is, the T T | | number of descendant leaves of v. The depth δ (v) of a node v in a phylogenetic T tree T is the length (number of arcs) of the unique path from the root r to v. Given two phylogenetic trees T ,T on disjoint sets of taxa S ,S , respectively, 1 2 1 2 we shall denote by T T the tree on S S obtained by connecting the roots of 1 2 1 2 ∪ T and T to a (new) common parent r. Every tree in is obtained as T T , 1 2 n k n−k for some subset S b1,...,n with k elements (withT1 6 k 6 n 1), some tree k T on S and some⊆tr{ee T o}n Sc = 1,...,n S : actually, if−we perfborm in k k n−k k { }\ k this order the choices necessary to produce a tree T in this way, we obtain n ∈ T every tree in twice. n T An ordered m-forest on a set S is an ordered sequence of m phylogenetic trees (T ,T ,...,T ), each T on a set S of taxa, such that these sets S are pairwise 1 2 m i i i disjoint and their union is S.Let bethe set of isomorphismclasses of ordered m,n F m-forests on 1,...,n . It is known (see, for instance, [11, Lem. 1]) that for every { } n > m > 1, (2n m 1)!m = − − . |Fm,n| (n m)!2n−m − 3 Expected value of the Colless index under the Yule model Let T be a phylogenetic tree. For every v V (T), the balance value of v is int ∈ bal (v) = κ (v ) κ (v ), where v and v are its children. The Colless index T T 1 T 2 1 2 | − | [6] of a phylogenetic tree T is n ∈ T C(T)= bal (v). T X v∈Vint(T) Lemma 1. If T and T , then k k n−k n−k ∈ T ∈ T (a) C(T T ) = C(T )+C(T )+ 2n k k n−k k n−k | − | 2 (b) P (Tb T ) = P (T )P (T ) Y k n−k (n 1) n Y k Y n−k − k b (cid:0) (cid:1) where P denotes the probability of a phylogenetic tree under the Yule model. Y Proof. Assertion (a) is well known, and a direct consequence of the definition of C. Assertion (b) is a direct consequence of the explicit probabilities of T , T k n−k and T T under the Yule model, and the fact that V (T T ) = V (T ) k n−k int k n−k int k ∪ V (T ) r , these unions being disjoint. int n−k b ∪{ } b Lemma 2. Let I : R be a mapping such that, for every phylogenetic n∈NTn → trees T1,T2 on disjoiSnt sets of taxa S1,S2, respectively, I(T T ) = I(T )+I(T )+f(S , S ) 1 2 1 2 1 2 | | | | b for some mapping f : N N R. For every n > 1, let I be the random variable n × → that chooses a tree T and computes I(T ), and let E (I ) be its expected n n n Y n ∈ T value under the Yule model. Then, n−1 n−1 1 E (I )= 2 E (I )+ f(k,n k) Y n Y k n 1 − − (cid:16) Xk=1 Xk=1 (cid:17) Proof. We compute E (I ) using its very definition and Lemma 1.(b): Y n E (I ) = I(T ) p (T ) Y n n Y n · TXn∈Tn n−1 = I(T T ) p (T T ) k n−k Y k n−k · Xk=1 Sk⊂X{1,...,n} Tk∈XT(Sk)Tn−kX∈T(Sk) |Sk|=k b b n−1 1 n = (I(T )+I(T ) k n−k 2 (cid:18)k(cid:19) Xk=1 TXk∈TkTn−Xk∈Tn−k 2 +f(k,n k)) P (T )P (T ) − · (n 1) n Y k Y n−k − k 1 n−1 (cid:0) (cid:1) = (I(T )+I(T )+f(k,n k))P (T )P (T ) k n−k Y k Y n−k n 1 − − Xk=1XTk TXn−k n−1 1 = I(T )P (T )P (T ) k Y k Y n−k n 1 − Xk=1(cid:16)XTk TXn−k + I(T )P (T )P (T ) n−k Y k Y n−k XTk TXn−k + f(k,n k)P (T )P (T ) Y k Y n−k − XTk TXn−k (cid:17) n−1 1 = I(T )P (T )+ I(T )P (T )+f(k,n k) k Y k n−k Y n−k n 1 − − Xk=1(cid:16)XTk TXn−k (cid:17) n−1 1 = (E (I )+E (I )+f(k,n k)) Y k Y n−k n 1 − − Xk=1 n−1 n−1 1 = 2 E (I )+ f(k,n k) Y k n 1 − − (cid:16) Xk=1 Xk=1 (cid:17) as we claimed. Mappings I satisfying the hypothesis in the previous lemma are a special case of binary recursive tree shape statistics in the sense of [10]. Theorem 1. Let C be the random variable that chooses a tree T and com- n n ∈ T putes its Colless index C(T ). Its expected value under the Yule model is n ⌊n/2⌋ 1 E (C )= n +δ (n), Y n odd j Xj=2 where δ (n) = 1 if n is odd, and δ (n) = 0 if n is even. odd odd Proof. To simplify the notations, we shall denote E (C ) simply by E . By Lem- Y n n mas 1.(a) and 2, n−1 n−1 1 E = 2 E + n 2k . n k n 1 | − | − (cid:16) Xk=1 Xk=1 (cid:17) Now a simple computation shows that n(n 2) n−1 − if n is even n 2k =  2 Xk=1| − | (n−1)2 if n is odd 2   and therefore n(n 2) n−1 − if n is even 2 E = E + 2(n 1) n n−1 Xk=1 k n−−1 if n is odd 2   In order to obtain a recurrence of order one from this expression, we distinguish the case when n is even from the case when n is odd. – When n is even n−1 n−2 2 n(n 2) 2 n 2 E = E + − , E = E + − n k n−1 k n 1 2(n 1) n 2 2 − Xk=1 − − Xk=1 and then n−2 2 2 n(n 2) E = E + E + − n n−1 k n 1 n 1 2(n 1) − − Xk=1 − n−2 2 n 2 2 n 2 n 2 n 2 = E + − E + − − + − n−1 k n 1 n 1 · n 2 n 1 · 2 n 1 − − − Xk=1 − − 2 n 2 n 2 = E + − E + − n−1 n−1 n 1 n 1 n 1 −n n−2 − = E + − n−1 n 1 n 1 − − – When n is odd n−1 n−2 2 n 1 2 (n 1)(n 3) E = E + − , E = E + − − n k n−1 k n 1 2 n 2 2(n 2) − Xk=1 − Xk=1 − and then n−2 2 2 n 1 E = E + E + − n n−1 k n 1 n 1 2 − − Xk=1 n−2 2 n 2 2 n 2 (n 1)(n 3) = E + − E + − − − +1 n−1 k n 1 n 1 · n 2 n 1 · 2(n 2) − − − Xk=1 − − 2 n 2 = E + − E +1 n−1 n−1 n 1 n 1 −n − = E +1 n−1 n 1 − So, in summary, n 2 n − if n is even En = En−1+n 1 n 1 1 − if n is odd −  In particular, if n is even, n n 2 n n 1 n 2 E = E + − = − E +1 + − n n−1 n−2 n 1 n 1 n 1 n 2 n 1 −n − − (cid:16) − (cid:17) − = E +2 n−2 n 2 − Setting x = E /n, this equation becomes n n 2 x = x + n n−2 n whose solution (for even numbered terms) with x = E /2 = 0 is 2 2 n/2 1 x = . n i Xi=2 Therefore, when n is even, n/2 1 E = n , n i Xi=2 and when n is odd, (n−1)/2 ⌊n/2⌋ n n 1 1 E = E +1 = (n 1) +1 = 1+n n n−1 n 1 n 1 · − i i − − Xi=2 Xi=2 as we claimed. 4 Expected value of the Sackin index under the uniform model The Sackin index [18] of a phylogenetic tree T is defined as the sum of the n ∈ T depths of its leaves: n S(T) = δ (i). T Xi=1 Alternatively, S(T) = κ (v). T X v∈Vint(T) Let S be the random variable that chooses a tree T and computes its n n ∈ T Sackin indexS(T).Since,undertheuniformmodel,alltrees in haveprobability n T 1/((2n 3)!!), the expected value of S under the uniform model is n − S(T) T∈Tn . P(2n 3)!! − So, we need to compute the numerator in this fraction. Now, for every k = 1,...,n 1, let − c = T δ (1) = k . k,n n T |{ ∈ T | }| n−1 Lemma 3. For every n> 3, S(T) =n k c k,n · TX∈Tn Xk=1 Proof. Notice that, for every 16 i 6 n, T δ (i) = k = T δ (1) = k . n T n T |{ ∈ T | }| |{ ∈T | }| Then n n S(T)= δ (i) = δ (i) T T TX∈Tn TX∈TnXi=1 Xi=1TX∈Tn n n−1 = k T δ (i) = k n T ·|{ ∈ T | }| Xi=1Xk=1 n n−1 n−1 = k T δ (1) = k = n k c n T k,n ·|{ ∈ T | }| · Xi=1Xk=1 Xk=1 Lemma 4. For every n> 2 and k = 1,...,n 1, − (2n k 3)!k c = − − . k,n (n k 1)!2n−k−1 − − Proof. To compute c for k > 1, notice that every tree T such that δ(1) = k k,n n ∈ T will have the form described in Fig. 1. Therefore, it is determined by the ordered k-forest T ,T ,...,T on 2,...,n , and thus 1 2 k { } (2n k 3)!k c = = − − . k,n |Fk,n−1| (n k 1)!2n−k−1 − − .. T1 . T 2 1 T k Fig.1. The structure of a tree T with δT(1)=k. Now, recall that the (generalized) hypergeometric function F is defined [2] as p q F a1,...,ap;z = (a1)k···(ap)k zk, p q(cid:18)b1,..., bq (cid:19) Xk>0 (b1)k···(bq)k · k! where (a) := a (a + 1) (a + k 1). Many popular software systems, like k · ··· − Mathematica or R, have implementations of these functions. Theorem 2. The expected value of the random variable S under the uniform n model is n 2, 2, 2 n EU(Sn) = 2n 33F2(cid:18)1, 4 2−n ;2(cid:19) − − Proof. As we have already mentioned, S(T) n n−1 n n−1 (2n k 3)!k2 E (S ) = T∈Tn = k c = − − U n P(2n 3)!! (2n 3)!! · k,n (2n 3)!! (n k 1)!2n−k−1 − − Xk=1 − Xk=1 − − Now nk2(2n k 3)! nk2(2n k 3)!2n−2(n 2)! − − = − − − (2n 3)!!(n k 1)!2n−k−1 (2n 3)!(n k 1)!2n−k−1 − − − − − − nk2(2n k 3)!2n−2(n 2)!k! = − − − (2n 3)!(n k 1)!2n−k−1k! − − − nk22k−1 n−1 = k (n 1) (cid:0)2n−3(cid:1) − k (cid:0) (cid:1) and thus n−1 n−1 n E (S )= k22k−1 k U n n 1 · (cid:0)2n−3(cid:1) − Xk=1 k n n−1 k22k−1((cid:0)n 1(cid:1))(n 2)(n 3) (n k) = − − − ··· − n 1 (2n 3)(2n 4)(2n 5) (2n k 2) − Xk=1 − − − ··· − − n n−1 k22k−1(n 2)(n 3) (n k) = − − ··· − 2n 3 (2n 4)(2n 5) (2n k 2) − Xk=1 − − ··· − − n n−1k22k−1(2 n)(2 n+1) ( n+k) = − − ··· − 2n 3 (4 2n)(4 2n+1) (2 2n+k) − Xk=1 − − ··· − n n−2(k+1)22k(2 n)(2 n+1) (1 n+k) = − − ··· − 2n 3 (4 2n)(4 2n+1) (3 2n+k) − Xk=0 − − ··· − n ((k+1)!)2(2 n)(2 n+1) (1 n+k) 2k = − − ··· − · 2n 3 (k!)2(4 2n)(4 2n+1) (3 2n+k) − Xk>0 − − ··· − n (2)k(2)k(2 n)k 2k n 2, 2, 2 n = 2n 3 (1) ((4 −2n) · k! = 2n 33F2(cid:18)1, 4 2−n ;2(cid:19) − Xk>0 k − k − − as we claimed. 5 Conclusion InthispaperwehaveobtainedexplicitformulasfortheexpectedvalueoftheSackin index underthe uniform modeland the Colless index underthe Yule model. These results add up to the already known expected value of the Sackin index under the Yulemodel[9].For any n,theseexpected values areeasily computeddirectly using for instance the software system R, and can be used instead of their estimations in packages like apTreeshape [1] or SymmeTREE [5]. It remains open the problem of finding an explicit formula for the expected value of the Colless index under the uniform model. Acknowledgements. The research reported in this paper has been partially supported by the Spanish government and the UE FEDER program, through projects MTM2009-07165 and TIN2008-04487-E/TIN. We thank G. Cardona and M. Llabr´es for several comments on this work. References 1. N. Bortolussi, E. Durand, M. Blum, O. Franc¸ois, apTreeshape: statistical analysis of phylo- genetic tree shape. Bioinformatics 22 (2006), 363–364. 2. W.N. Bayley, Generalized Hypergeometric Series. Cambridge Tracts in Mathematics and Mathematical Physics 32, Stechert-HafnerServiceInc. (1964).

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.