JOURNALOF COMPUTATIONAL BIOLOGY Research Articles Volume 24,Number9, 2017 #MaryAnnLiebert, Inc. Pp. 831–850 DOI:10.1089/cmb.2016.0159 Enumeration of Ancestral Configurations y. nl for Matching Gene Trees and Species Trees o e s u al n o rs FILIPPO DISANTO and NOAH A. ROSENBERG e p r o F 7. 1 8/ 0 ABSTRACT 9/ 0 m at Given a gene tree and a species tree, ancestral configurations represent the combinatorially o distinct sets of gene lineages that can reach a given node of the species tree. They have been c b. introduced as a data structure for use in the recursive computation of the conditional u p rt probability under the multispecies coalescent model of a gene tree topology given a species e b tree,thecostofthiscomputationbeingaffectedbythenumberofancestralconfigurationsof e e.li the gene tree in the species tree. For matching gene trees and species trees, we obtain n nli enumerative results on ancestral configurations. We study ancestral configurations in bal- m o anced and unbalanced families of trees determined by a given seed tree, showing that for o seed trees with more than one taxon, the number of ancestral configurations increases for r f e both families exponentially in the number of taxa n. For fixed n, the maximal number of g ka ancestral configurations tabulated at the species tree root node and the largest number of c Pa labeled histories possible for a labeled topology occur for trees with precisely the same er unlabeled shape. For ancestral configurations at the root, the maximum increases with kn, nt 0 e wherek (cid:2) 1:5028isaquadraticrecurrenceconstant.Underauniformdistributionoverthe C 0 al set of labeled trees of given size, the mean number of root ancestral configurations grows c pffiffiffiffiffiffiffiffi di with 3=2(4=3)n and the variance with *1:4048(1:8215)n. The results provide a contri- e M bution to the combinatorial study of gene trees and species trees. y sit er Keywords: combinatorics, gene trees, phylogenetics, species trees. v ni U d r o nf 1. INTRODUCTION a St y Investigationsoftheevolutionofgenomicregionsalongspeciestreebrancheshavegeneratednew b d combinatorialstructuresthatcanassistinstudyinggenetreesandspeciestrees(Maddison,1997;Degnan e d a and Salter, 2005; Than and Nakhleh, 2009; Degnan et al., 2012; Wu, 2012). Among these structures are o nl ancestralconfigurations,structuresthatforagivengenetreetopologyandspeciestreetopologydescribethe w o possiblesetsofgenelineagesthatcanreachagivennodeofthespeciestree(Wu,2012). D Ancestralconfigurationsrepresentthesetofobjectsoverwhichrecursivecomputationsareperformedin a fundamental calculation for inference of species trees from information on multiple genetic loci: the evaluationofgenetreeprobabilitiesconditionalonspeciestrees(Wu,2012).Becauseoftheappearanceof ancestral configurations in sets over which sums are computed [e.g., Eq. (7) of Wu (2012)], solutions to Department ofBiology, Stanford University, Stanford,California. 831 832 DISANTO AND ROSENBERG enumerative problems involving ancestral configurations contribute to an understanding of the computa- tional complexity of phylogenetic calculations. Undertheassumptionthatagenetreeandaspeciestreehaveamatchinglabeledtopologyt,weexamine the number of ancestral configurations that can appear at the nodes of the species tree. Extending results ofWu(2012),whoseappendixreportedthenumberofancestralconfigurationsforcaterpillarspeciestrees and established a lower bound for completely balanced species trees, we study the number of ancestral configurationswhentbelongstofamiliesoftreescharacterizedbyabalancedorunbalancedpatternanda y. nl seed tree. As a special case, we derive upper and lower bounds on the number of ancestral configurations e o possessed by matching gene trees and species trees of given size. Finally, we study the mean and the us variance of the number of ancestral configurations when t is a random labeled tree of given size selected al under a uniform distribution. n o s r e p or 2. PRELIMINARIES F 7. 8/1 We study ancestral configurations for rooted binary labeled trees. We start with some definitions and 9/0 preliminaryresults.InSection2.1,werecallbasicpropertiesofrootedbinarylabeledtrees.InSection2.2, 0 we recall properties of generating functions that will be used to derive some of our enumerative results. m at Following Wu (2012), in Section 2.3, we define ancestral configurations, and we determine a recursive o proceduretocompute theirnumberfor matching gene trees andspeciestrees atagivenspeciestree node. c b. We then relate the total number of ancestral configurations in a tree to the number of ancestral configu- u rtp rations at the root of the tree. e b e e.li 2.1. Labeled topologies n nli A labeled topology, or tree for short, of size jtj=n is a bifurcating rooted tree with n labeled taxa m o (Fig.1A).Weassumewithoutlossofgeneralityalinear(alphabetical)ordera(cid:3)b(cid:3)c(cid:3)(cid:4)(cid:4)(cid:4)amongthe o setfa‚b‚c‚ ...gofpossiblelabelsforthetaxaofatree.Atreeofsizenhasleaveslabeledusingthefirst r f ge nlabelsintheorder(cid:3).Giventwotreest1andt2,wewritet1@t2 andsaythatt1isisomorphictot2when, a removing labels at their taxa, t and t share the same unlabeled topology. The set of trees of size n is k 1 2 Pac denotedbyTn,andT=[n(cid:5)1Tn denotesthesetofalltreesofanysize.Thenumberoftreesofsizen(cid:5)2 er canbecomputedasjTnj=(2n-3)!!=1·3·5· (cid:4)(cid:4)(cid:4) ·(2n-3)(Felsenstein,1978),whichcanberewritten nt for n(cid:5)1 as e C dical jTnj= 2n(-21n(-n-2)1!)! = 2n(2(2nn-)!1)n!: (1) e M y sit r ve A B m C m ni g U l d m m m nfor t = R = l i R = h Sta l 1 l 2 l y h g b d i e g h i g h i g h i d a o nl a b c d e f a b c d e f a b c d e f w o D FIG.1. Agenetreeandaspeciestreewithamatchinglabeledtopologyt.(A)Atreetofsize6isomorphictothe gene tree and species tree depicted in (B, C). Tree t is characterized by its shape and by the labeling of its taxa. It isconvenienttolabeltheinternalnodesoft.Weidentifyeachlineage(edge)oftbyitsimmediatedescendantnode, so,forexample,lineagegresultsfromthecoalescenceoflineagesaandb.(B)ApossiblerealizationR ofthegene 1 tree in (A) (dotted lines) in the species tree with a matching topology (solid lines). The ancestral configuration at speciestreenode‘isfh‚e‚fg.Theconfigurationatnodemisfa‚b‚‘g.(C)AdifferentrealizationR ofthegenetree 2 in (A) in the matching species tree. The configurations at species tree nodes ‘ and m are fc‚d‚ig and fg‚h‚ig, respectively. ENUMERATION OF ANCESTRAL CONFIGURATIONS 833 The exponential generating function associated with the sequence jT j is defined as n T(z) =Xzjtj =X1 jTnjzn =z+ z2 + 3z3 + 15z4 + ...‚ (2) jtj! n! 2 6 24 t2T n=1 and it is given by (Flajolet and Sedgewick, 2009, Example II.19) p ffiffiffiffiffiffiffiffiffiffiffi y. T(z)=1- 1-2z: (3) nl e o Throughoutthearticle,mostofourresultsarepurelycombinatorial.Whereaprobabilitydistributionon us thesetoflabeledtopologiesofagivensizeisneeded,weassumeauniformprobabilitydistributionoverthe al set of trees of given size. n o s r pe 2.2. Exponential growth and analytic combinatorics r o F Following Flajolet and Sedgewick (2009), a sequence of non-negative numbers a is said to have 7. exponential growth kn or, equivalently, to be of exponential order k when n 1 8/ 9/0 lim sup[(an)1=n]= lim [sup[(am)1=m]]=k: at 0 n!1 n!1 m(cid:5)n m This relationship can be rephrased as a =kns(n), where s is a subexponential factor, that is, o n c lim sup [s(n)1=n]=1.Bythesedefinitions,asequencea growsexponentiallyinnwhenitsexponential b. n!1 n u order strictly exceeds 1. p rt The exponential order of a sequence gives basic information about its speed of growth and enables e b ne.lie coordmeprakraisaonndsw(binth) hoathseerxspeoqnueenntcieasl.oIrndeprarktbic<ulakra,,ftrhoemntthheedseefiqnuietniocne,oitffroaltlioowssbtnh=aatnifco(annv)erhgaessetxopo0neexnptioa-l nli nentiallyfast as(kb=ka)n.If two sequences (an)and(bn)have the sameexponential growth,then we write m o an./bn. o We are interested in the exponential growth of several increasing sequences of non-negative integers. r e f Several results will be obtained through techniques of analytic combinatorics [see Sections IV and VI of g a Flajolet and Sedgewick (2009)]. The entries of a sequence of integers (a ) can be interpreted as the k n n(cid:5)0 Pac coefficients of the power series expansion A(z)= P1n=0anzn at z=0 of a function A(z), the generating r function of the sequence. Considering z as a complex variable, under suitable conditions, there exists a e nt general correspondence between the singular expansion of the generating function A(z) near its dominant e C singularity—theonenearesttotheorigin—andtheasymptoticbehavioroftheassociatedcoefficientsa .In al particular,theexponentialorderofthesequence(a )isgivenbytheinverseofthemodulusofthedominnant c n di singularityofA(z).Forinstance,theexponentialorderofthesequencejT j=n!,withjT jasinEquation(1), e n n M is2because1=2isthedominantsingularityoftheassociatedgeneratingfunction[Eq.(3)].Inotherwords, sity jTnj=n! increases with a subexponential multiple of 2n as n becomes large. r e v ni 2.3. Gene trees, species trees, and ancestral configurations U d In this section, we define the object on which our study focuses: the ancestral configurations of a gene r o f treeGinaspeciestreeS.AncestralconfigurationshavebeenintroducedbyWu(2012).Inourframework, n a St whereexactlyonegenelineagehasbeenselectedfromeachspecies,weassumeGandStohavethesame y labeled topology t. b d e ad 2.3.1. Ancestral configurations. Suppose R is a realization of a gene tree G in a species tree S, o nl wherewefocusonthecaseofG=S=t(Fig.1).Inotherwords,Risoneoftheevolutionarypossibilitiesfor w thegenetreeGonthematchingspeciestreeS.Viewedbackwardintime,foragivennodekoft,consider o D the set C(k‚R) of gene lineages (edges of G) that are present in S at the point right before node k. As in Wu (2012), the set C(k‚R) is called the ancestral configuration of the gene tree at node k of the speciestree.TakingthetreetdepictedinFigure1AandconsideringtherealizationR ofthegenetreeG=t 1 inthespeciestreeS=tasgiveninFigure1B,weseethatthegenelineagesa,b,and‘arethosepresentin the species tree at the point right before the root node m. The set C(m‚R )=fa‚b‚‘g is thus the ancestral 1 configuration of the gene tree at node m of the species tree. Similarly, the ancestral configuration of the genetreeatnode‘ofthespeciestreeisthesetofgenelineagesC(‘‚R )=fh‚e‚fg.InFigure1C,wherea 1 834 DISANTO AND ROSENBERG different realization R of the same gene tree is depicted, the ancestral configuration at the root m of the 2 species tree is the set of gene lineages C(m‚R )=fg‚h‚ig. The ancestral configuration at node ‘ is 2 C(‘‚R )=fc‚d‚ig. 2 Let<(G‚S)bethesetofpossiblerealizationsofthegenetreeG=t inthespeciestree S=t.Foragiven node k of t, by considering all possible elements R2<(G‚S), we define the set C(k)=fC(k‚R):R2<(G‚S)g (4) y. nl and the number o e s c(k)=jC(k)j: (5) u al n Thus,c(k)correspondstothenumberofdifferentwaysthegenelineagesofGcanreachthepointrightbefore o rs nodekinS,whenallpossiblerealizationsofthegenetreeGinthespeciestreeSareconsidered.Forinstance, e r p taking t as in Figure 1A, we have C(g)=ffa‚bgg, C(‘)=ffc‚d‚e‚fg‚fh‚e‚fg‚fc‚d‚ig‚fh‚igg, and o F 7. C(m)=ffg‚‘g‚fa‚b‚‘g‚fg‚c‚d‚e‚fg‚fa‚b‚c‚d‚e‚fg‚fg‚h‚e‚fg‚ 1 (6) 8/ fa‚b‚h‚e‚fg‚fg‚c‚d‚ig‚fa‚b‚c‚d‚ig‚fg‚h‚ig‚fa‚b‚h‚igg: 0 9/ 0 NotethatfortwodifferentrealizationsR ‚R 2<(G‚S)andaninternalnodek,wedonotnecessarilyhave m at C(k‚R1)6¼C(k‚R2). 1 2 o For each internal node k, our definition of ancestral configuration specifically excludes as a possibility c b. thecaseinwhichallgenetreelineagesdescendedfromnodekhavecoalescedatspeciestreenodeksothat u rtp fkg2=C(k). Each configuration at node k is considered at the point right before node k in the species tree, be and there is thus no time for the gene lineages from the left subtree of k to coalesce with those from the e e.li rightsubtreeofk.OurdefinitionisidenticaltothatofWu(2012),withtheexceptionthatwesaythataleaf n or 1-taxon tree has 0 ancestral configurations, whereas Wu assigns these cases 1 ancestral configuration. nli BecauseweassumegenetreeGandspeciestreeShavethesamelabeledtopologyt,thesetC(k)andthe o m quantityc(k)definedinEquations(4)and(5)dependonlyonnodekandtreet.Inwhatfollows,weusethe o r term configuration at node k of t to denote an element of C(k). The next result provides a recursive f ge procedure for calculating the number c(k) at a given node k of t. a k c a P Proposition1 Givenatreetwithjtj>1,thenumberc(r)ofpossibleconfigurationsattherootroft er can be recursively computed as nt e C al c(r)=1+c(r‘)+c(rr)+c(r‘)c(rr)=[c(r‘)+1][c(rr)+1]‚ (7) c di Me where r‘ (resp. rr) denotes the left (resp. right) child of r and c(r) is set to 0 when jtj=1. y sit Proof. If A and B are two sets of sets, we define A(cid:6)B=fa[b:a2A‚b2Bg. The set C(r) of r e configurations at internal node r can be decomposed as v ni U C(r)=ffr ‚r gg[[C(r )(cid:6)ffr gg][[ffr gg(cid:6)C(r )][[C(r )(cid:6)C(r )]‚ (8) d ‘ r ‘ r ‘ r ‘ r r o anf where the setunionsare disjointbecause,asalready noted, fr‘g2=C(r‘)andfrrg2=C(rr).We immediately St obtain Equation (7), as c(r)=jC(r)j. - y b ed We reiterate that for Equation (7) to apply for all t with jtj>1, we must set to 0 the number of d a configurations at a species tree leaf andat the root ofthe 1-taxontree. For the tree depicted inFigure1A, o nl each configuration in C(m) [Eq. (6)] can be obtained as described in Equation (8) from the configurations w o in C(g) and C(‘). Note indeed that c(m)=10=(1+1)(4+1)=[c(g)+1][c(‘)+1], as determined by Equa- D tion (7). 2.3.2. Total configurations and root configurations. Let K(t) be the set of nodes of a tree t. The numberofnodesjK(t)jsatisfiesjK(t)j=2jtj-1<2jtj.Definethetotalnumberofconfigurationsintasthesum X c= c(k): k2K(t) ENUMERATION OF ANCESTRAL CONFIGURATIONS 835 Letc(r)bethenumberofconfigurationsattherootroft,orrootconfigurationsforshort.Asisshownin Appendix 1, c(r) satisfies the bound c(r)(cid:7)2jtj=2: (9) Furthermore, because c(r)(cid:5)c(k) for each node k of t, we have c(r)(cid:7)c(cid:7)2jtjc(r): (10) y. onl Thisresultindicatesthatthetotalnumberofconfigurationscandthenumberofrootconfigurationsc(r) se are equal up to a factor that is at most polynomial in the tree size jtj. A consequence is that in measuring u al c(r)forafamily(ti)oftreesofincreasingsize,anexponentialgrowthoftheformc(r)./kjtjforthenumber on ofrootconfigurationstranslatesintothesameexponentialgrowthforthetotalnumberofconfigurationsin s er t: p r Fo c(r)./kjtj5c./kjtj‚ (11) 7. 1 where, by virtue of Equation (9), k(cid:7)2. 8/ 0 An equivalent result holds when we consider the expected value of the total number of configurations 09/ E [c]inarandomlabeledtreetopologyofgivensizen.Indeed,whenatreeofsizenisselectedatrandom m at frnom the set of labeled topologies, Equation (10) gives En[c(r)](cid:7)En[c](cid:7)2nEn[c(r)]. Thus, the expo- co nential growth of En[c] with respect to n can be recovered from the exponential growth of En[c(r)], b. u E [c]./E [c(r)]: (12) p n n rt ebe Similarly, for the second moment En[c2], we have En[c(r)2](cid:7)En[c2](cid:7)4n2En[c(r)2], and thus ne.li En[c2]./En[c(r)2]: (13) nli o Using these results, in Sections 3 and 5 we will determine the exponential growth of c(r) and c with m o respecttosizejtjwhentisconsideredindifferentsettings.InSection3,tbelongstofamiliesofunbalanced r e f orbalancedtrees,whereasinSection5,weperformouranalysisconsideringtasarandomlabeledtopology ag of given size. k c a P 2.4. Root configurations in small trees r e ent For small values of n, Equation (7) enables the exhaustive computation of the number of root configura- C al tions c(r) for representative labelings of each of the unlabeled topologies of size n. In Figure 2, each dot c correspondstothelogarithmofthenumberofrootconfigurationsforacertaintreeshapeofsizedeterminedby di e itsx-coordinate.Thedotsassociatedwiththelargestvaluesofc(r)areconnectedbythetopline,whosegrowth M y is linear in n. Indeed, as was shown by Wu (2012), there exist families of trees for which the growth of the sit numberofrootconfigurationsisexponentialinthetreesize.FromEquation(9),itfollowsthatthegrowthof r ve the sequence of thelargestnumber of root configurations intrees ofsizen mustbeexponentialin n aswell. Uni Thetreeshapeswhoselabeledtopologiespossessthelargestnumberofrootconfigurationsamongtrees d of fixed size appearin Figure 3 together with their number of root configurations c(r). Starting with n=4, r o f n a St s 5 y on oaded b hmonfigurati 4 FIG. 2. Naturallogarithmofthenumberofroot Downl allogaritofrootc 3 c2o(cid:7)nfingu(cid:7)ra1ti0o.nTshfeovraallulepfoosrsnib=le1,trleoeg(s0h)a,piessomofitsteizde. Naturmber 2 Dnuomtsbecrosrroefsrpooontdcionngfitgourtahtieonlsarfgoersetaacnhdnsamreaclloens-t u en 1 nectedbythetopandbottomlines,respectively. h t of 0 2 3 4 5 6 7 8 9 10 Numberofleaves 836 DISANTO AND ROSENBERG c( r )= 1 c( r )= 2 c( r )= 4 c( r )= 6 c( r )= 10 c( r )= 15 c( r )= 25 c( r )= 35 c( r )= 55 FIG. 3. Tree shapes of size 2(cid:7)n(cid:7)10 whose labeled topologies have the largest number of root configurations y. nl amongtreesofsizen.Thenumberofrootconfigurationsc(r)isindicatedforeachtree.Ineachtreedisplayed,thetwo o e rootsubtrees eachmaximize thenumberofrootconfigurations amongtrees oftheirsize. s u al each shape in the sequence can be seen to be produced by connecting two smaller shapes also in the n o sequence (possibly the same shape) to a shared root. s r e Thetreeshapethatminimizesthenumberofrootconfigurationsisthecaterpillartopology.Thenumberof p or rootconfigurationsinthecaterpillarofsizenisn-1(Wu,2012).ThebottomlineinFigure2,whichconnects F 7. dotscorrespondingtothesmallestnumberofrootconfigurationsforatreewithntaxa,growswithlog(n-1). 1 These observations show that tree topology can have a considerable impact on the number of ancestral 8/ 0 configurationsthatarepossibleforagiventreesize.Indeed,thenextsectioninvestigatestheeffectoftree 9/ 0 balanceonthenumberofrootconfigurationsinatree.Figure2suggeststhatforrandomlabeledtopologies at of a specified size, we can expect the variance of the number of root configurations to be large. We will m o confirmthisclaiminSection5.Wewillalsoshowthatalthoughthereexisttreefamilies(e.g.,caterpillars) c b. for which the growth of the number of root configurations is polynomial in the tree size, the expected u p number of root configurations in a random labeled topology of given size n grows exponentially in n. rt e b e ne.li 3. ROOT CONFIGURATIONS FOR UNBALANCED nli AND BALANCED FAMILIES OF TREES o m o r In this section, we study the number of root configurations for particular families of trees, extending f e beyond two cases considered by Wu (2012): the caterpillar case, which was studied exactly, and the g p a ffiffiffin k completelybalancedcase,forwhichalooselowerboundof 2 wasreported.Asbalanceisanimportant c a treepropertythatinfluencesancestralconfigurations,westudyunbalancedandbalancedfamiliesgenerated P er by different seed trees. Upper and lower bound results on the number of root configurations for trees of ent specified size appear in Section 4. C al For a given seed tree s, we consider the unbalanced family (uh(s)) (Fig. 4A) and the balanced family c (b (s)) (Fig. 4B) defined as follows: di h e M y sit A r e niv = uh+1 U d r o f n Sta s s s s s s uh s y b d e d B a o nl = bh+1 w o D s s s s s s s bh bh FIG. 4. Unbalanced and balanced families of trees defined from a given seed tree s. (A) The unbalanced family u =u (s)isdefinedbyu =s,settingu asthetreeofsizeju j=ju j+jsj=(h+2)jsjobtainedbyappendingu ands h h 0 h+1 h+1 h h to a shared root node. (B) The balanced family b =b (s) is defined by b =s, setting b as the tree of size h h 0 h+1 jb j=2jb j=2h+1jsjobtained byappendingtwo copies ofb toasharedrootnode. h+1 h h ENUMERATION OF ANCESTRAL CONFIGURATIONS 837 u (s)=s; u (s)=(u (s)‚s) (14) 0 h+1 h b (s)=s; b (s)=(b (s)‚b (s))‚ (15) 0 h+1 h h where (t ‚t ) is the tree shape obtained by appending trees t and t to a shared root node. Note that the 1 2 1 2 familyofcaterpillartreesis obtainedas(u (s)) whenjsj=1. Forthesame seed treeofsize1,(b (s)) is the h h family of completely balanced trees. When jsj=2, (u (s)) resembles the lodgepole family (k ), which is h h y. definedrecursivelybysettingk0 asthe1-taxontree,andkh+1=(kh‚s)(DisantoandRosenberg,2015).The onl onlydifferenceisthatinuh(s),eachleafisinacherry,whereaskh hasauniqueleafthatisnotinacherry. e Foreachfamily,itisunderstoodthatweconsideranarbitrarylabelingofeachunlabeledshapeinthefamily. s u al 3.1. Unbalanced families n o s r Fixaseedtreesandconsiderthefamilyu =u (s)asdefinedinEquation(14).Letc=c bethenumberof pe h h 0 or rootconfigurationsins=u0,anddefinechasthenumberofrootconfigurationsinuh.Ifsisthe1-taxontree,then F as noted earlier, the number of root configurations c is set to 0. From Proposition 1, we obtain the recursion 7. 1 8/ ch+1=1+c+ch(c+1)‚ (16) 0 at 09/ starting with c0=c. As shown in Appendix 2, the generating function m X1 co Uc(z)= chzh b. h=0 u p rt is described by e b e z+c ne.li Uc(z)= (1-z)(1-z-cz): (17) nli o For c(cid:5)0, the dominant singularity of U —the singularity nearest to the origin—is the solution m c z =1=(c+1)(cid:7)1 of the equation 1-z-cz=0. Applying Theorem IV.7 of Flajolet and Sedgewick (2009) o 0 r f yields the exponential growth of the sequence (c ) with respect to the index h as e h g ka (cid:3)1(cid:4)h ac c ./ =(c+1)h: (18) P h z r 0 e ent Because uh has juhj=(h+1)jsj leaves, substituting h=juhj=jsj-1 in Equation (18), we obtain the next C al proposition. c di e Proposition 2 In the unbalanced family (u ), the exponential growth of the number of root con- M h y figurations in the size juhj is sit er [(c+1)1=jsj]juhj‚ (19) v ni U where jsj is the size of the seed tree and c is its number of root configurations. The total number of ord configurations in the family (uh) has the same exponential growth. f n a Inotherwords,forvaluesofthenumberofleavesnatwhichamemberoftheunbalancedfamilyexists, y St the number of root configurations in the unbalanced family grows with [(c+1)1=jsj]n. d b Whentheseedtreeisthe1-taxontree,sothatc=0and(uh)isthesequenceofcaterpillartrees,Equation(19) de givestheexponentialgrowth1juhj=1.Indeed,thenumberofrootconfigurationsinthecaterpillarfamilygrows a o likeapolynomialfunctionofthesize,asimmediatelyfollowsfromEquation(16)[seealsoWu(2012)].Taking nl w jsj>1,thenumberofrootconfigurationsinu (s) becomesexponentialinthetreesize.Table1illustratesthat h o D forunbalancedfamiliesdefinedbysmallseedtreesofsizegreaterthanone,rootconfigurationsinn-taxontrees— provided that a tree with n taxa is in the family—have exponential growth in the range 1:3n to 1:5n. 3.2. Balanced families The results change when we consider balanced families. For a fixed seed tree s, consider the family b =b (s)asdefinedinEquation(15).Letc=c bethenumberofrootconfigurationsinseedtrees=b ,and h h 0 0 define c as the number of root configurations in b . If jsj=1, then c is 0. From Proposition 1, we obtain h h 838 DISANTO AND ROSENBERG Table 1. Approximate Values of theConstants That When RaisedtothePowern Describe theExponentialGrowth withtheNumberof taxa nof theNumberof Ancestral Configurations inUnbalanced and BalancedFamilies ForSmall SeedTrees (c+1)1=jsj (k )1=jsj Seed (c+1)1=jsj (k )1=jsj c c Seed tree s jsj c (unbalanced) (balanced) trees jsj c (unbalanced) (balanced) 1 0 1 1.503 5 6 1.476 1.479 y. nl 2 1 1.414 1.503 6 5 1.348 1.351 o e us 3 2 1.442 1.469 6 6 1.383 1.385 al n 4 3 1.414 1.425 6 7 1.414 1.416 o s r e 4 4 1.495 1.503 6 8 1.442 1.444 p r o 5 4 1.380 1.385 6 10 1.491 1.492 F 17. 5 5 1.431 1.435 6 9 1.468 1.469 8/ 0 9/ EachconstantisobtainedtothreedecimalplacesbynumericallyevaluatingEquation(20). 0 at m c =(c +1)2‚ o h+1 h c pub. withc0=c.Definingthesequencexh+1=x2h+1,withx0=c+1,itisstraightforwardtoshowthatch=xh-1. ert Sequence xh can be studied as in Aho and Sloane (1973, Section 3 and Example 2.2). For h(cid:5)1, a b constant k exists for which e c ne.li xh=ºkc(2h)ß‚ nli m o whereºkßisthefloorfunctionfork.Theconstantkc canbeapproximatedusingtherecursivedefinitionof x , summing terms in a series: o h r kage f kc=(c+1)exp"X1 2-h-1log(cid:3)1+ x12(cid:4)#: (20) ac h=0 h P nter Switching back to ch, for h(cid:5)1, we obtain e C c =x -1=ºk(2h)ß-1: (21) al h h c c di Thus, because c grows with ºk(2h)ß, to determine the exponential growth of the number of root con- e h c M figurations,itremainstoevaluatetheconstantk .RescalingEquation(21)toconsiderthenumberofleaves c y jb j=2hjsj as a parameter, we obtain the next proposition. sit h r e v ni Proposition 3 In the balanced family (bh), the exponential growth of the number of root configu- U rations in the size jb j is d h r nfo [(kc)1=jsj]jbhj‚ (22) a St wherejsjisthesizeoftheseedtree.Theconstantk canbecomputedasinEquation(20)andboundedby y c b d 1 e c+1<k <(c+1)+ : (23) ad c c+1 o nl w The total number of configurations in the family (bh) has the same exponential growth. o D In other words, for values of the number of leaves n, at which a member of the balanced family exists, the number of root configurations in the balanced family grows with [(k )1=jsj]n. c Proof. It remains only to prove the bound [Eq. (23)]. The lower bound follows quickly from Equation (20),astheexponentispositive.Theupperboundisobtainedbyobservingthatthesequencex =x2 +1 h h-1 isincreasing,andthuslog(1+1=x2)(cid:5)log(1+1=x2)foreachh(cid:5)0.Therefore,fromEquation(20)andthe 0 h fact that x =c+1, we have 0 ENUMERATION OF ANCESTRAL CONFIGURATIONS 839 " # (cid:3) 1(cid:4)X1 (cid:5) 1 (cid:6) k <(c+1)exp log 1+ 2-h-1 =(c+1) 1+ : c x2 (c+1)2 - 0 h=0 Comparing the number of root configurations in balanced families with those in unbalanced families (Table 1), we see that the exponential order for balanced families is greater than in unbalanced families, y. although typically still in the range 1:3n to 1:5n. nl o e 3.3. Comparing unbalanced and balanced families s u al Foragivenseedtrees,thequantitiesu(s)=(c+1)1=jsjandb(s)=(k )1=jsjdeterminetheexponentialorders n c o s of the sequences considered in Propositions 2 and 3, respectively. We observe three facts. r e r p (i) Applying the lower bound in Equation (23), c+1<kc, for a fixed seed tree s, we always have o F u(s)<b(s): (24) 7. 1 8/ Therefore,thegrowthofthenumberofancestralconfigurationsinthefamilybh(s)isexponentiallyfasterthanthe 0 9/ growthinthefamilyuh(s).Whensisnotsmall,however,u(s) canbecomeclosetob(s).Forlarges,c isalso 0 at large. Owing to the upper bound in Equation (23), although c+1<kc, kc only slightly exceeds c+1. Fur- m thermore, the exponent 1=jsj in the expressions for u(s) and b(s) further reduces the difference between them. co For instance, if s is the caterpillar tree with 10 leaves, we have c=9, u(s)=(10)1=10 (cid:2)1:2589, and ub. 1:2589(cid:2)(10)1=10 <b(s)<(10:1)1=10 (cid:2)1:2601. In this case, b(s)-u(s) is bounded above by a constant p rt near 10-3. The increasing similarity of u(s) and b(s) is already evident in Table 1, as their values for 6- e b taxon seed trees are substantially closer to each other than for the smaller 1-, 2-, and 3-taxon seed trees. e e.li (ii)Thechoiceoftheseedtreecanplayanimportantroleintherelativevaluesofb(s)andu(s)astaking nlin twodifferentseedtreescanfliptheinequalityinEquation(24).Infact,ifs1ands2aretwoseedtreesofthe o same size js j=js j=jsj for which c >c , then m 1 2 1 2 o u(s )>b(s ): (25) r 1 2 f e g To obtain this result, we note that jsjlogu(s )= log(c +1)(cid:5)log[(c +1)+1](cid:5)log[(c +1)+ 1=(c +1)] a 1 1 2 2 2 k Pac i>solobgsekrcv2a=bjlsejliongTba(bs2le),1w,hwehreerethaetlfiaxtteedrjisnjeoqfu4a,li5ty,ofro6ll,ouw(ss)ffroormsotmheeuopfptehrebshoaupneds[eExqc.ee(2d3s)b](.sT)hfoerroetshuelrt r nte shapes. Ce (iii) When the seed tree s is chosen as the 1-taxon tree with jsj=1, the constant b(s)=k0 determines an al upperboundforthenumberofrootconfigurationsthatatreeofgivensizecanhave.Thisresultisshownin c edi more detail in the following section. The value of k0 can be computed numerically from Equation (20): M y k0 (cid:2)1:502836801: (26) sit pffiffiffi er This constant provides the exact value for which 2(cid:2)1:4142, reported by Wu (2012), provided a lower v ni bound. U d r o f n 4. SMALLEST AND LARGEST NUMBERS OF ROOT a St CONFIGURATIONS FOR TREES OF FIXED SIZE y b d e We have seen that the number of root configurations for caterpillar trees grows polynomially and that the d oa numberofrootconfigurationsinunbalancednoncaterpillarfamiliesandbalancedfamiliesgrowsexponentially. wnl Intheexampleswehaveconsidered,theexponentialgrowthproceedswith1:3n to1:503n.Wenowshowthat Do thecaterpillartreeshavethesmallestnumberofrootconfigurationsandthattheconstantk0[Eq.(26)],infact, provides an upper bound on the exponential growth of the number of root configurations as n increases. We characterize the labeled topologies that possess the largest number of root configurations at fixed n. 4.1. Smallest number of root configurations Forthecaterpillartreeofsizen,thenumberofrootconfigurationsisn-1.Weshowthatthisvalue,n-1, is the smallest number of root configurations for a tree of size n. 840 DISANTO AND ROSENBERG Letc(r)denotethenumberofrootconfigurationsoftreet.Letm (r)=min c(r).Supposewehave t n ft:jtj=ng t shown for each i with 1(cid:7)i(cid:7)n-1 that m(r)=i-1: (27) i The claim clearly holds for i=1‚2‚3, for each of which the sole tree t has i-1 root configurations. For n(cid:5)2, we use induction to prove Equation (27) for i=n. Suppose t0 is a tree of size n such that c (r)=m (r). The number of root configurations of t0 is given by Proposition 1 as the product y. t0 n se onl crot0o(rt)c=o[ncfit‘0g(ru)ra+ti1o]n[cs,tr0(tr0‘)a+n1d],t0rwmheursett0‘seapnadrat0rtealryepthoessreososttshuebtmreiensimofatl0.nBumecbaeurseoft0rhoaostthceonmfiignuimraatilonnusmabmeornogf u treesoftheirsize.Wecanthenwritec (r)=m(r)andc (r)=m (r),where,withoutlossofgenerality,iis onal a certain value with 1(cid:7)i(cid:7)ºn=2ß. tT0‘ herefoire, ct0(r)t0rhas then-fiorm ct0(r)=[mi(r)+1][mn-i(r)+1]. It is s determined from the minimum r e p For mn(r)=ct0(r)=minfi:1(cid:7)i(cid:7)ºn=2ßg[mi(r)+1][mn-i(r)+1]: (28) 7. 1 Applying the inductive hypothesis [Eq. (27)], we obtain m (r)=min i(n-i). In the permissible 8/ n fi:1(cid:7)i(cid:7)ºn=2ßg 0 range for i, the product i(n-i) reaches its minimum value at i=1, equaling n-1 as desired. 9/ 0 By induction, we have shown that Equation (27) holds for each i(cid:5)1. Furthermore, the fact that the m at product [mi(r)+1][mn-i(r)+1] in Equation (28) is minimal only at i=1 also demonstrates that those tree o shapesofsizenwiththesmallestnumberofrootconfigurationscanberecursivelyobtainedbyappendingthe c b. 1-taxontreeandthetreeshapeofsizen-1withthesmallestnumberofrootconfigurationstoasharedroot u p node. Trees resulting from this recursive construction are exactly those having a caterpillar shape. rt e b e e.li 4.2. Largest number of root configurations n nli For the largest number of root configurations, we denote Mn(r)=maxft:jtj=ngct(r). Similarly to Equation o m (28),weseektoidentifythetreestthatproducethemaximuminthefollowingequationandtoevaluatethat o maximum: r f e g M (r)=max [M(r)+1][M (r)+1]: (29) a n fi:1(cid:7)i(cid:7)ºn=2ßg i n-i k c a P er Note that M (r)=0. Taking M~ =M (r)+1, we have the recursion nt 1 n n e al C M~n=1+maxfi:1(cid:7)i(cid:7)ºn=2ßgM~iM~n-i‚ c di startingwithM~ =1.ThesequenceM~ wasstudiedbydeMierandNoy(2012,Theorems1and2),whereit e 1 n M was shown (i) taking d=d(n) as the power of 2 nearest to n=2, we have M~ =1+M~ M~ , so that y n d n-d sit M (r)=[M (r)+1][M (r)+1]; er n d n-d v ni (ii) for all n(cid:5)10, kn-1=4 <M~ <kn, that is, U 0 n 0 d for k0n-1=4-1<Mn(r)<k0n-1‚ (30) n a St where the constant k0 has been already computed in Equation (26). y Forsmalln,thelabeledtopologieswiththelargestnumbersofrootconfigurationsappearinFigure3. b d Collecting the results for the smallest and largest number of root configurations, we can state the e d a following facts. o nl w o Proposition 4 (i) For each n(cid:5)1, the smallest number of root configurations in a tree of size n is D m (r)=n-1. The caterpillar tree shape of size n has exactly m (r) root configurations. (ii) For each n n n(cid:5)10,thelargestnumberofrootconfigurationsinatreeofsizen,M (r),canbeboundedasinEquation n (30). For n(cid:5)2, if d=d(n) denotes the power of 2 nearest to n=2, then M (r) is the number of root n configurationsinthetreeshapet recursivelydefinedasjt j=1,t =(t ‚t ).Whenn=2hforintegersh,t n 1 n d n-d n is the completely balanced tree of depth h and M (r)=ºknß-1 [Eq. (21)]. n 0 As a corollary, we obtain the following result, the proof of which appears in Appendix 3.
Description: