ebook img

Taxon Size Distribution in a Time Homogeneous Birth and Death Process PDF

0.15 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Taxon Size Distribution in a Time Homogeneous Birth and Death Process

Taxon Size Distribution in a Time-Homogeneous Birth and Death Process. 9 0 Panagis Moschopoulos Max Shpak 0 2 Dept. of Mathematical Sciences Dept. of Biological Sciences Univ. of Texas at El Paso Univ. of Texas at El Paso n a El Paso TX, 79968 El Paso TX, 79968 J [email protected] [email protected] 8 January 9, 2009 ] P A . Abstract t a t The number of extant individuals within a lineage, as exemplified by s countsofspeciesnumbersacrossgenerainahighertaxonomiccategory,is [ known to bea highly skewed distribution. Because thesublineages (such 1 as genera in a clade) themselves follow a random birth process, deriving v thedistributionoflineagesizesinvolvesaveragingthesolutionstoabirth 6 and death process over the distribution of time intervals separating the 6 origin of thelineages. In this paper, we show that the resulting distribu- 0 tions can berepresentedbyhypergeometric functionsof the second kind. 1 We also provide approximations of these distributions up to the second . 1 order, and compare theseresults to theasymptotic distributions and nu- 0 merical approximations used in previous studies. For two limiting cases, 9 one with a relatively high rate of lineage origin, one with a low rate, the 0 cumulative probability densities and percentiles are compared to show : v that the approximations are robust over a wide rane of parameters. It i is proposed that the probability density distributions of lineage size may X haveanumberofrelevantapplications tobiological problemssuchasthe r a coalescence of genetic lineages andin predicting thenumberof species in living and extinct higher taxa, as these systems are special instances of the underlyingprocess analyzed in this paper. Running Heading Family Size Distribution Keywords birth-death process, taxon size, hypergeometric function, phylogeny, genealogy 1 1. Introduction The number of individuals in a genetic lineage (family size) and the formally equivalent problemof the number of species within genera or other higher taxa has been of wide interest to demographers and biologists alike (e.g. Watson and Galton 1875,Yule, 1924,Anderson 1974,Burlando 1990). Empirically, the distributions of family sizes in humans and taxon sizes in a number of organ- ismshavebeendocumentedtobeconsistentwiththedistributionspredictedby simple stochastic models. The most straightforward approximation to the distribution of lineage or taxon sizes is the pure birth process, which entails a certain probability of an individual giving birth or one species giving rise to another per unit time, ignoring death or extinction. In its details, this is clearly an unrealistic model, butitdoesprovideareasonablefirstorderapproximationforthediversification of higher taxa during their early history, when extinction is infrequent, or in lineages(suchassmallpopulationsofbacteriagrowingonarichmedium)where the death rates are very small compared to birth rates. A basic model (Yule 1924, Feller 1950, Bailey 1968) represents a pure birth processwithbirthprobabilityλperunittime. Inthismodel,theinstantaneous rate of change in the probability of observing n individuals is given by the following differential equation dp n =λ(n−1)pn−1−λnpn . (1.1) dt Undertheinitialconditionp1(0)=1(asingleancestorforthelineageattime zero) and 0 for all n>1, this differential equation has the solution p (t)=eλt(1−eλt)n−1 . (1.2) n More realistically, if there is an intrinsic death rate (or, for species and higher taxa, extinction rate) µ, then one has a time-homogeneous birth and death process defined by the differential equation, e.g. Kendall (1948), Darwin (1956), Keiding (1975), dp n =λ(n−1)pn−1−(λ+µ)npn+µ(n+1)pn+1 for n>0 dt dp0 =µp1 . (1.3) dt The solutions for λ6=µ given p1(0)=1 are given by (λ−µ)2 exp[−(λ−µ)t] λ−λexp[−(λ−µ)t] n−1 p (t)= for n>0 n (λ−µexp[−(λ−µ)t])2 λ−µexp[−(λ−µ)t] (cid:18) (cid:19) µ−µexp[−(λ−µ)t] p0(t)= . (1.4) λ−µexp[−(λ−µ)t] 2 In the special case where λ=µ, the solutions simplify to: (λt)n−1 p (t)= for n>0 n (λt+1)n+1 λt p0(t)= . (1.5) λt+1 For the purposes ofmanyempiricalstudies,in whichone doesnotknow the precise number of once-extant lineages which are currently extinct (i.e. n =0 individualsinthelineageorspeciesinthecladeattimet,withanindeterminate number in previous time intervals), it is useful to work with the truncated distributions describing the probability densities for strictly positive values of n, i.e. p (t) n P (t)= for n≥1 n 1−p0(t) which, using the expressions in (1.4), evaluates to n−1 (λ−µ)exp[−(λ−µ)t] λ−λexp[−(λ−µ)t] P (t)= for n≥1 n λ−µexp[−(λ−µ)t] λ−µexp[−(λ−µ)t] (cid:18) (cid:19) (1.6) while for λ=µ, (λt)n−1 P (t)= for n≥1. (1.7) n (λt+1)n Strictly speaking, in order to compare the distributions of family sizes to those predicted in (1.4) or (1.6), or to infer birth and death rates using maxi- mum likelihood estimators, the lineage ages must be known. This requirement becomes particularly problematic when one compares the number of individu- als in sublineages, for example the number of species in different genera within a taxonomic family or order. Because the sublineages arose at different times (i.e. obviously, not all genera within a clade originated at the same moment), comparisonsof the number of species in a genus or any other subclade must be weighted by the differences in subclade ages. When the age of a lineage is known, as is the case for phylogenetic trees withbranchlengthsthathavebeencallibratedusing fossildataand"molecular clocks" (sensu Zuckerkandl and Pauling 1965, Kimura 1980), the number of individuals or species in each lineage can be compared to the predictions of equation (1.4) by applying the appropriate time values. For phylogenies with knownbranchlengthsandresolvedsplittimes,Harveyetal(1994)andNeeetal (1994, 1995) presented a method whereby the origination and extinction rates λ and µ could be estimated using maximum likelihood methods. A similar approach, not requiring conditioning on the age of the first split, is given by Rannala (1997) and in Felsenstein (2004, pgs. 564-9). 3 Such detailed data is not reliably available for the majority of phylogenetic trees or gene genealogies, and when it is, the confidence intervals on the age estimates can be very wide. To address the problem of rate estimation when molecular clock estimates are absent or unreliable, Reed and Hughes (2002) derived distributions of genera sizes weighted by the time intervals separating generic branching events. Under the assumption that within a lineage (clade), subclades such as genera within families arise at a rate ρ, the authors used an exponential approximation for the distribution of the generic ages. This representationisnotlimitedtoanytaxonomicgroup,asitappliestoanylineage inwhichconstituentsublineagesaredefinedby someunique characters,genetic markers,or in the case of humans, family names. The time at which a genus within a clade of age τ arises is given by ρe−ρt f(t)= (1.8) 1−e−ρτ which, for τ →∞, gives f(t)=ρe−ρt. (1.9) Using(1.9)toweighttheprobabilitydistributionsp attimet,ReedandHughes n defined τ q (τ)= ρp (t)e−ρtdt . (1.10) n n 0 Z ReedandHughesderivedageneratingfunctionandanalgorithmforcalculation of q (τ) recursively from a limiting case (when n→ ∞). It will be shown that n given certain choices of changes in variables, a closed form expression for this integral can be evaluated, and a number of useful approximations to the q (τ) n distributioncanbemade,ofwhichthe asymptoticapproximationcalculatedby Reed and Hughes is an important limiting case. 2. Results Pure Birth Process Wewillbeginwiththedistributionforapurebirthprocess,whichisareasonable approximationwhen λ≫µ. τ q (τ)= ρ exp[−(λ+ρ)t](1−exp[−λt])n−1dt . (2.1) n Z0 Under the assumption that the lineage is ancient relative to the age of any sublineage, we take the limit as τ → ∞. Then, applying a change of variables y =eλt, the resulting integral is 1 q =ρ yρ/λ(1−y)n−1dy (2.2) n Z0 4 which, for n≥1, evaluates to Γ ρ +1 Γ(n) q =Beta(ρ/λ+1,n)= λ . (2.3) n Γ n+ ρ +1 (cid:0) λ(cid:1) Using an asymptotic expansion of the gamma(cid:0) ratios inv(cid:1)olving n above (e.g. Exton, 1978, Luke 1969), we get the following: This can be further simplified to an expansion of the ratios of gamma functions, up to the order n−2, Γ(ρ/λ+1) ρ(1+ ρ) q = 1− λ λ n nρ/λ+1 2n (cid:26) +(1+λρ)(2+λρ) 3 ρ +1 2+ ρ +O(n−3) . 24n2 λ λ h (cid:0) (cid:1) i (cid:27) It follows from the above that, as n→∞, the probability can be approximated by q =Γ(ρ/λ+1)n−ρ/λ−1. (2.4) n Birth and Death Processes As in the analysesofReed andHughes (2002),we considerthree differentbirth and death models. The first, with λ>µ (birth rate higher than death rate), is perhapsthemostbiologicallysignificant,becausethisistheonlybirthanddeath process in which a lineage has a nonzero probability of persisting indefinitely. WewillevaluatethefunctionQ forthetruncatedsolutionstothebirthand n death process given by equation (1.6), because most of the relevant empirical datainvolvesn>0. Forconvenienceweusetheparametersω=λ−µandθ=µ/λ, so thatwhen the birth rate is greaterthan the deathrate,ω >0 andθ <1. The integral over the truncated distribution can be written as ∞ ρω exp[−(ω+ρ)t] 1− exp[−ωt] n−1 Q = dt. n 1−θ exp[ωt] 1−θ exp[−ωt] Z0 (cid:18) (cid:19) Applying the changes of variable 1−exp[−ωt] (1−θ)dy y(t)= , dt= 1−θexp[−ωt] ω(1−y)(1−θy) and noting that the integral limits are y(0)= 0 and y(∞)=1, we obtain 1 Q =ρ(1−θ) yn−1(1−y)ρ/ω(1−θy)−ρ/ω−1dy. (2.5) n Z0 If we do the same for Equation (1.4) by integrating over the part of the distri- bution where n≥1, an expression of similar form is derived 5 ρ(1−θ) 1 q = yn−1(1−y)ρ/ω(1−θy)−ρ/ωdy . (2.6) n ω Z0 Thetwointegralsgivenaboveevaluatetoconstantmultiplesofahypergeometric function (Exton 1978, Luke 1969), which has the integral representation 1 Γ(c−a)Γ(a) ya−1(1−y)c−a−1(1−θy)−bdy = 2F1(a,b,c,θ) (2.7) Γ(c) Z0 and 2F1 is the Gauss hypergeometric function that is defined by the series ∞ (a) (b) 2F1(a,b,c,θ)= k kθk (2.7a) (c) k=0 k X where Γ(a+k) (a) =a(a+1)...(a+k−1)= . k Γ(a) (Exton 1978, Luke 1969). In equations (2.5) and (2.6), we have a = n and b=1+ρ/ω. In the first instance, c = n+ρ/ω+1, and c = n+ρ/ω+2 in the second. Using (2.7) and (2.7a) in (2.5) we obtain ∞ ρ ρ θk Q = ρ(1−θ)Γ 1+ A 1+ B . (2.8) n ω ω k k! (cid:16) (cid:17) Xk=0(cid:16) (cid:17) The coefficients A=A(ρ,ω,n) and B =B(ρ,ω,k,n) are given by Γ(n) A(ρ,ω,n)= , Γ n+ ρ +1 ω (n) (cid:0) Γ(n+k(cid:1))Γ(n+ρ/ω+1) k B(ρ,ω,k,n)= = . (n+ρ/ω+1) Γ(n)Γ(n+k+ρ/ω+1) k ItiswellknownthattheratiosofGammafunctionscanbeexpandedinterms of 1/n, see for example Luke (1969). The expansion is in terms of Bernoulli polynomials, see also Moschopoulos and Mudholkar (1983) for similar asymp- toticexpansionsofhypergeometricfunctions. The approximationstofolloware lengthy and rather tedious, but nevertheless straightforward. Second order approximations to the terms A and B are given by: A=n−(1+ωρ) 1− ωρ 1+ ωρ + 1+ ωρ 2+ ωρ 3 ωρ 2+ ωρ × 1+O(n−3)  (cid:0)2n (cid:1) (cid:0) (cid:1)(cid:0) 24(cid:1)n(cid:16)2 (cid:0) (cid:1) (cid:17) (cid:0) (cid:1)   To solve for B, we use the following: 6 Γ(n+k) k(k−1) k(k−1) 3(k−1)2−k−1 =nk 1+ + ) × 1+O(n−3) Γ(n) 2n 24n2 (cid:0) (cid:1) ! (cid:0) (cid:1) Γ n+ ωρ +k+1 =nk 1+ k k+ 2ωρ +1 + k(k−1) 3 k+ 2ωρ +1 2−k−1 (cid:0)Γ n+ ωρ +1 (cid:1)  (cid:0) 2n (cid:1) (cid:16) (cid:0) 24n2 (cid:1) (cid:17) (cid:0) (cid:1)   × 1+O(n−3) (cid:0) (cid:1) Γ n+ ρ +k+1 ω = Γ n+ ρ +1 (cid:0) ω (cid:1) (cid:0) (cid:1) nk 1+ k k+ 2ωρ +1 + k(k−1) 3 k+ 2ωρ +1 2−k−1 × 1+O(n−3)  (cid:0) 2n (cid:1) (cid:16) (cid:0) 24n2 (cid:1) (cid:17) (cid:0) (cid:1)   which allows one to rewrite B as a series expansion: k(k−1) k(k−1) 3(k−1)2−k−1 B = 1+ + 2n 24n2 (cid:0) (cid:1)! −1 k(k+ 2ρ +1 k(k−1)(3(k+ 2ρ +1)2−k−1) × 1+ ω + ω × 1+O(n−3) 2n 24n2 ! (cid:0) (cid:1) k ρ 1 =1− 1+ + φ(ρ,ω,k)+O(n−3) n ω n2 (cid:16) (cid:17) where ρ2 3ρ ρ2 ρ φ(ρ,ω,k)= k2 + +1 −k + . 2ω2 2ω 2ω2 2ω (cid:18) (cid:18) (cid:19) (cid:18) (cid:19)(cid:19) Collecting terms, we obtain the following approximation up to order n−2 Qn =ρ(1−θ)Γ 1+ ωρ n−1−ωρ 1− ωρ (cid:0)12+n ωρ(cid:1) + ωρ (cid:0)1+ ωρ(cid:1)2(cid:16)43n(cid:0)2ωρ(cid:1)2+ ωρ(cid:17) (cid:16) (cid:17)   ∞ ρ k ρ φ(ρ,ω,k)θk × 1+ 1− 1+ + × 1+O(n−3) . (2.9) ω k n ω n2 k! Xk=0(cid:16) (cid:17) (cid:20) (cid:16) (cid:17) (cid:21) (cid:0) (cid:1) Thetermsintheaboveapproximationunderscorethefactthattheshapeofthe distributionQ isdeterminedbythemagnitudeofρ/ω,theratiooftheratesat n 7 whichnew lineages(genera)ariseto the net birthrate ofindividuals orspecies. This is because in the limit as τ → ∞, the absolute magnitudes of the two rateparametersdonotcontributetotheratiooflineagenumbertolineagesize. Basically, a small value of ρ/ω implies that over sufficient time, there will be a comparatively low number of lineages rich in individuals, while a large value results in many lineages with relatively few individuals. It is also remarked that the series terms ∞ ∞ ρ θk ρ θk k 1+ , k2 1+ ω k k! ω k k! kX=0 (cid:16) (cid:17) kX=0 (cid:16) (cid:17) have to be computed numerically, just as the hypergeometric function must be estimated using numerical methods. As there is no closed-form expression for equation (2.9), the value of these approximations is to show the rate of convergence of first and second order terms (as functions of n−1) to the exact solution for different birth and death rates. In contrast, a closed form expression does exist for the asymptotic limit, as n → ∞. Unlike the terms involving higher powers of k in (2.9), the first series term within the parenthesis converges to ∞ ρ θk 1+ =(1−θ)−1−ρ/ω. ω k k! Xk=0(cid:16) (cid:17) By ignoring all factors of higher order than n−1, we have ρ Q ≈ρ(1−θ)1−ρ/ωΓ 1+ n−ρ/ω−1 (2.10) n ω (cid:16) (cid:17) as in Reed and Hughes (2002), apart from the constant factor of 1/ω for the truncateddistribution. Itis noteworthythat inequation(2.10)the termθ only appears as a constant factor, while in (2.10) its magnitude determines the rate at which the series term converges. The efficacy of both the asymptotic and n−2 approximations will be investigated numerically in the next section, where they are compared to the exact hypergeometric function solutions. Wenextdirectourattentiontothecaseofa”declining”linagewhereλ<µ, i.e. when the birth or speciation rate is less than the death or extinction rate. In this scenario, the parameter values ω <0 and θ <1. Using the same changes of variables as in the derivation of equation (2.6), the integral defining the probability density is 1/θ Q =ρ(1−θ) yn−1(1−y)ρ/ω(1−θy)−ρ/ω−1dy . (2.11) n 0 Z With a further change of variables, z=θy, we again derive an expression with integral limits of 0 and 1, 1 z ρ/ω Q =ρ(1−θ)θ1−n zn−1 1− (1−z)−ρ/ω−1dz . (2.12) n θ Z0 (cid:16) (cid:17) 8 The solution to this integral is a hypergeometric function of a form similar to thatofequation(2.6). Here,theparametersarea=n,b=-ρ/ω,andc=n-ρ/ω+1. In place of θ in the series expansion, we have 1/θ, so that (2.11) ultimately evaluates to ∞ ρ ρ θk Q =ρ(1−θ)θ1−n Γ − A − B . (2.13) n ω ω k k! (cid:16) (cid:17) kX=0(cid:16) (cid:17) Note that because ω <0, −ρ/ω >0. For the above distribution, the coefficients A and B are Γ(n) (n) Γ(n+k)Γ(n− ρ +1) A= , B(ρ,ω,k,n)= k = ω . Γ n− ρ n− ρ +1 Γ(n)Γ(n+k− ρ +1) ω ω k ω (cid:0) (cid:1) (cid:0) (cid:1) An approximation in terms of leading orders of 1/n is possible for (2.13) using the same approach as was used in (2.9) i.e. by rewriting the Gamma functions as truncated polynomial ratios. For the A, the second order approximation is: A=nωρ 1− ωρ ωρ +1 + ωρ −1 ωρ −2 3 1+ ωρ 2− ωρ −1 × 1+O(n−3)  (cid:0)2n (cid:1) (cid:0) (cid:1)(cid:0) (cid:1)(cid:16)24n(cid:0)2 (cid:1) (cid:17) (cid:0) (cid:1)   To evaluate B, we have the same ratio Γ(n+k)/Γ(n) in the numerator as in the case of λ>µ, while the denominator term Γ( 1-ρ/ω+k)/ Γ(n-ρ/ω) is: nk 1+ k k− 2ωρ + (k−1)k 3 k− 2ωρ −1 2−k−1 × 1+O(n−3)  (cid:0) 2n (cid:1) (cid:16) (cid:0) 24n2 (cid:1) (cid:17) (cid:0) (cid:1)   UsingtheseriesexpansionofcoefficientsAandB,weobtainthefollowingsecond order approximation for equation (2.13), Qn =ρ(1−θ)θ1−n Γ −ρ nωρ ω (cid:16) (cid:17) × 1− ωρ ωρ +1 + ωρ −1 ωρ −2 3 1+ ωρ 2− ωρ −1  (cid:0)2n (cid:1) (cid:0) (cid:1)(cid:0) (cid:1)2h4n(cid:0)2 (cid:1) i   ∞ × −ρ 1+ k ρ +1 + φ(ρ,ω,k) θk +O(n−3). (2.14) ω k n ω n2 k! Xk=0(cid:16) (cid:17) (cid:20) (cid:16) (cid:17) (cid:18) (cid:19)(cid:21) Recall that ω <0, so that the coefficients -ρ/ω are positive. The asymptotic approximation to this distribution, calculated for n → ∞, is given by the following equation: 9 ρ Q ≈ρθn(1−θ)1−ρ/ωΓ 1+ n−ρ/ω−1 . (2.15) n ω Thespecialcasewhereλ=µ(exactlyzeron(cid:16)etgrow(cid:17)thrate)isnotlikelytooccur in biological systems, therefore we do not analyze this scenario in detail. In those rare instances where net growth rate is close to zero, the distributions can be approximated by using the equations for the cases where growth rate is positive and negative and taking limits as µ approaches λ. A Numerical Assessment of the Approximations Theaccuracyofthesecondorder(2.9)andasymptoticapproximations(2.10)are assessedbycomparingtheir probabilitydensitiesto thoseofthe exactsolutions (2.8) for a range of parameters values. The examples considered here are the biologically most relevant case where the birth rate is higher than the death rate, such that ω >0 and θ <1. Attention is focused on two parameters: ρ/ω and θ. The first is the ratio of therateatwhichnewlineagesarisetothenetbirthrateofindividualsorspecies withinthatlineage. As wasdiscussedaboveinconnectionwiththe distribution (2.9), the shape of the probability density function depends principally on the ratio ρ/ω. When ρ/ω ≪1, the distribution will be characterized by very few lineages with large numbers of individuals, while as ρ/ω →1, the distribution willbeskewedtotheleftduetothepresenceofmanylineageswithfew(perhaps only a single) individual. The accuracy of the approximation is examined for differentvaluesofthisratio,sothatinoneinstanceρandωareofapproximately the same magnitude, while in another ρ and ω differ by an order of magnitude. It is expected from the summation term in equation (2.9) that the second order approximation should be most accurate when θ ≪1, corresponding to a low birth rate relative to the death rate, because the series converges rapidly when θ is near zero. To compare the accuracy of the approximations as a function of the rate parameters, we will consider cases where of θ=0.01, 0.1, and 0.4. The predictions on the accuracy of the respective approximations are givenqualitativesupportbyFigures1and2,whichplotthecumulativedensity functions n Q N N=1 X for the exactsolutionQ givenby equation(2.8), the ordern−2 approximation n of equation (2.9), and the asymptotics in equation (2.10). In the first set of figures, the rates at which new lineages (e.g. "genera") are born is of the same order of magnitude as the net birth rate, i.e. ρ=0.02, ω=0.05, while in the second set, there is a tenfold difference between the birth rates of lineages and individuals, with ρ=0.01 and ω=0.1. IIINNNSSSEEERRRTTT FFFIIIGGGUUURRREEESSS 111aaa---ccc aaannnddd 222aaa---ccc HHHEEERRREEE 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.