ebook img

The external lengths in Kingman's coalescent PDF

0.21 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview The external lengths in Kingman's coalescent

The external lengths in Kingman’s coalescent 1 Svante Janson∗ Go¨tz Kersting† 1 0 2 17 January, 2011 n a J 7 Abstract 1 In this paper we prove asymptotic normality of the total length of ] external branches in Kingman’s coalescent. The proof uses an embedded R Markovchain,whichcanbedescribedasfollows: Takeanurnwithnblack P balls. Empty it in n steps according to the rule: In each step remove a . randomly chosen pair of balls and replace it by one red ball. Finally h t remove the last remaining ball. Then the numbers Uk, 0 ≤ k ≤ n, of a red balls after k steps exhibit an unexpected property: (U0,...,Un) and m (Un,...,U0) are equal in distribution. [ MSC 2000 subject classifications. 60K35, 60F05, 60J10 2 Key words and phrases. coalescent, external branch,time reversibility, urn model v 1 1 1 Introduction and results 0 5 OurmainresultinthispaperisthatthetotallengthL ofallexternalbranches . n 4 in Kingman’s coalescent with n external branches is asymptotically normal for 0 n . 0 →∞ Kingman’scoalescent(1982)consistsoftwocomponents. Firsttherearethe 1 : coalescent times T1 >T2 > >Tn =0. They are such that v ··· i X k (T T ) , k =2,...,n k 1 k r (cid:18)2(cid:19) − − a are independent, exponential random variables with expectation 1. Second there are partitions π = 1,...,n ,π ,...,π = 1 ,..., n of the set 1 2 n { } { } { } 1,...,n , where the set π containes k disjoint subsets of 1,...,n and π k k 1 { } (cid:8) (cid:9) (cid:8) { (cid:9)} − evolves from π by merging two randomly chosen elements of π . Moreover, k k (T ,...,T ) and (π ,...,π ) are independent. For convenience we put π := . n 1 n 1 0 ∅ ∗DepartmentofMathematics, UppsalaUniversity,POBox480,SE-75106Uppsala,Swe- den. [email protected] †FachbereichInformatikundMathematik,Universit¨atFrankfurt,Fach187,D-60054Frank- furtamMain,Germany. [email protected] 1 As is customary the coalescent can be represented by a tree with n leaves labelled from 1 to n. Each of these leaves corresponds to an external branch of the tree. The other node of the branch with label i is located at level ρ(i):=max k 1: i π k { ≥ { }6∈ } within the coalescent. The length of this branch is T , The total external ρ(i) length of the coalescent is given by n L := T . n ρ(i) i=1 X This quantity is of a certain statistical interest. Coalescent trees have been introduced by Kingman as a model for the genealogic relationshipof n individ- uals, down to their most recent common ancestor. Mutations can be located everywhere on the branches. Then mutations on external branches affect only single individuals. This fact was used by Fu and Li (1993) in designing their D-statisticandprovidingatestwhetherornotdatafittoKingman’scoalescent. Otherwise single external branches have mainly been studied in the litera- ture. The asymptotic distribution of T has been obtained by Caliebe et al ρ(i) (2007),usingarepresentationofitsLaplacetransformduetoBlumandFranc¸ois (2005). We address this issue in Section 6 below. Freund and Mo¨hle (2009) in- vestigated the external branch length of the Bolthausen-Snitman coalescent, and Gnedin et al (2008) the Λ-coalescent. Here is our main result. Theorem 1. As n , →∞ 1 n d L 2 N(0,1) . n 2 logn − → r (cid:0) (cid:1) The proof will show that the limiting normal distribution originates from the random partitions and not from the exponential waiting times. A second glance on this result reveals a peculiarity: The normalization of L is carriedout using its expectation, but only half of its variance. These two n terms have been determined by Fu and Li (1993) (with a correction given by Durrett (2002)). They obtained 8nh 16n+8 8logn E(L )=2 , Var(L )= n− n n (n 1)(n 2) ∼ n − − with h :=1+1+ + 1, the n-th harmonicnumber. Belowwe derive a more n 2 ··· n general result. Touncoverthispeculiarityweshallstudytheexternallengthsinmoredetail. First we look at the point processes η on (0, ), given by η = n δ , n ∞ n i=1 √nTρ(i) i.e. P η (B):=# i:√nT B (1) n ρ(i) { ∈ } for Borel sets B (0, ). ⊆ ∞ 2 Theorem 2. As n the point process η converges in distribution, as n → ∞ point processes on (0, ], to a Poisson point process η on (0, ) with intensity ∞ ∞ measure λ(dx)=8x 3dx. − We use (0, ] in the statement of Theorem 2 instead of (0, ) since it is ∞ ∞ d stronger, including for example η (a, ) η(a, ) for every a > 0. The n ∞ → ∞ significance is that, as n , there will be points clustering at 0 but not at → ∞ . (Below in the proof we recall the definition of convergence in distribution ∞ of point processes.) Theorem 2 permits a first orientation. Since √nL = xη (dx), one is n n tempted to resort to infinitely divisible distributions. However, the intensity R measure λ(dx) is slightly outside the range of the L´evy-Chintchin formula. Shortly speaking this means that small points of η have a dominant influ- n ence on the distribution of L and we are within the domain of the normal n distribution. Thus let us look in more detail on the external lengths and focus on Lα,β := T , 0 α<β 1 , n ρ(i) ≤ ≤ nα≤Xρ(i)<nβ which is the total length of those external branches having their internal nodes between level nα and nβ within the coalescent. Obviously L =L0,1. ⌈ ⌉ ⌈ ⌉ n n Proposition 3. For 0 α<β 1 ≤ ≤ 2 E(Lα,β)= nβ nα 2n+1 nβ nα n n(n 1) ⌈ ⌉−⌈ ⌉ −⌈ ⌉−⌈ ⌉ − (cid:0) (cid:1)(cid:0) (cid:1) and logn Var(Lα,β) 8(β α) , n ∼ − n as n . →∞ In particular E(L1 ε,1) E(L0,1), whereas Var(L1 ε,1) εVar(L0,1). n− ∼ n n− ∼ n Thus the proposition indicates that the systematic part of L and its fluc- n tuations arise in different regions of the coalescent tree, the former close to the leaves and the latter closer to the root. Still this proposition gives an inadequate impression. Theorem 4. For 0 α<β <1/2 ≤ P(Lα,β =0) 1 n → as n . Moreover →∞ √nL0,12 d ∞xη(dx) n → Z2 and for 1/2 α<β 1 ≤ ≤ Lα,β E(Lα,β) n − n d N(0,1) . Var(Lα,β) → n q In addition Lα,β and Lγ,δ are asymptotically independent for α<β γ <δ. n n ≤ 3 0,1 1,1 This result implies Theorem 1: In L = L 2 +L2 the summands are of n n n order 1/n and logn/n, such that in the limit the second, asymptotically normalcomponentdominates. To this end,however,nhasto become exponen- p p 0,1 tially large, otherwise the few long branches, which make up L 2, cannot be n neglected and may produce extraordinary large values of L . Thus the normal n approximation for the distribution of L seems little useful for practical pur- n poses. Oneexpectsafatrighttailcomparedtothe normaldistribution. Indeed ∞xη(dx) has finite mean but infinite variance. 2 This is illustrated by the following two histograms from 10000values of L , n R where the length of the horizontal axis to the right indicates the range of the values. 0,1 n=50 0 0 2 4 6 8 0,2 n=1000 0 2 4 6 8 The heavy tails to the right are clearly visible. Also very large outliers appear: Forn=50thesimulatedvaluesofL rangefrom0.685to8.38,andforn=1000 n from 1.57 to 7.87. Also it turns out that the approximationof the variance in Proposition3 is goodonlyforverylargen. ThiscanbeseenalreadyfromtheformulaofFuand Li. To get an exact formula for the variance we look at a somewhat different quantity, namely n Lˆα,β := (T T T T ) n ρ(i)∧ ⌊nα⌋− ρ(i)∧ ⌊nβ⌋ i=1 X with 0 α < β 1, which is the portion of the external length between level ≤ ≤ nα and nβ within the coalescent. ⌊ ⌋ ⌊ ⌋ Proposition 5. For 0 α 1 with m:= nα ≤ ≤ ⌊ ⌋ n m E(Lˆα,1)=2 − n n 1 − and 8(h h )(n+2m 2) 4(n m)(4n+m 5) Var(Lˆαn,1)= n−1(−n m1−)1(n 2) − − (−n 1)2(n 2)− . − − − − 4 For α = 0 we recover the formula of Fu and Li. A similar expression holds for Lˆα,β. n Proposition3 andTheorem4 carryoverto Lˆα,β, upto a changein expecta- n tion and with the limit √nLˆ0n,12 →d 2∞(x−2)η(dx). The following histogram from a random sample of length 10000 shows that already for n = 50 the dis- tribution of Lˆ21,1 fits well to the normRal distribution when using the values for n expectation and variance, given in Proposition 5. 0,1 0 1 2 3 OurmaintoolfortheproofsisarepresentationofL bymeansofanimbed- n ded Markov chain U ,U ,...,U , which is of interest of its own. We shall in- 0 1 n troduce it as an urn model. The relevant fact is that this model possesses an unexpectedhiddensymmetry,namelyitisreversibleintime. Thisisoursecond main result. For the proof we use another urn model, which allows reversal of time in a simple manner. The urn models are introduced and studied in Section 2. Proposition 3 is provenin Section3, Theorems2and4 arederivedinSection4 andProposition 5 in Section 5. In Section 6 we complete the paper by considering the length of an external branch chosen at random. 2 The urn models Take an urn with n black balls. Empty it in n steps according to the rule: In eachstep removea randomly chosenpair ofballs andreplace it by one redball. In the last step remove the last remaining ball. Let U := number of red balls in the urn after k steps . k Obviously U = U = 0, U = U = 1 and 1 U min(k,n k) for 0 n 1 n 1 k − ≤ ≤ − 2 k n 2. U ,...,U is a Markov chain with transition probabilities 0 n ≤ ≤ − u2 n−2k , if u′ =u−1 , P(Uk+1 =u′ |Uk =u)= (cid:0)u((cid:1)n(cid:14)−(cid:0) k−(cid:1)u) n−2k , if u′ =u ,  n−k2−u n−2(cid:14)k(cid:0) , (cid:1) if u′ =u+1 . We begin our study of the model b(cid:0)y calcul(cid:1)a(cid:14)t(cid:0)ing e(cid:1)xpectations and covariances. Proposition 6. For 0 k l n ≤ ≤ ≤ k(n k) k(k 1)(n l)(n l 1) E(U )= − , Cov(U ,U )= − − − − . k n 1 k l (n 1)2(n 2) − − − 5 Proof. Imagine that the black balls are numbered from 1 to n. Let Z be the ik indicator variable of the event that the black ball with number i is not yet removed after k steps. Then U =n k n Z and consequently k − − i=1 ik E(U )=n k PnE(Z ) k 1k − − and for k l in view of Z Z 1l 1k ≤ ≤ n n Cov(U ,U )= Cov(Z ,Z ) k l ik jl i=1j=1 XX =n(n 1)E(Z Z )+nE(Z ) n2E(Z )E(Z ) . 1k 2l 1l 1k 1l − − Also n−1 n−k (n k)(n k 1) P(Z =1)= 2 2 = − − − 1k n ··· n k+1 n(n 1) (cid:0) 2 (cid:1) (cid:0)−2 (cid:1) − and for k l (cid:0) (cid:1) (cid:0) (cid:1) ≤ n 2 n k 1 n k 1 n l − − − − − − P(Z =1,Z =1)= 2 2 2 2 1k 2l n ··· n k+1 · n k ··· n l+1 (cid:0) 2 (cid:1) (cid:0) −2 (cid:1) (cid:0) −2 (cid:1) (cid:0)−2 (cid:1) (n k 1)(n k 2)(n l)(n l 1) = (cid:0)−(cid:1) − (cid:0) − (cid:1)− (cid:0) −(cid:1) (cid:0)− − (cid:1) . n(n 1)2(n 2) − − Our claim now follows by careful calculation. Notethattheseexpressionsforexpectationsandcovariancesareinvariantunder the transformation k n k, l n l. This is not by coincidence: 7→ − 7→ − Theorem 7. (U ,U ,...,U ) and (U ,U ,...,U ) are equal in distribution. 0 1 n n n 1 0 − Proof. Leaving aside U = U = 0 we have U 1 a.s. for the other values of 0 n k ≥ k. Instead we shall look at U = U 1 for 1 k n 1. It turns out that k′ k − ≤ ≤ − for this process one can specify a different dynamics, which is more lucid and amenable to reversing time. Consider the following alternative box scheme: There are two boxes A and B. At the beginning A contains n 1 black balls whereas B is empty. The − balls are converted in 2n 2 steps into n 1 red balls lying in B. Namely, in − − the steps number 1,3,...,2n 3 a randomly drawnball from A is shifted to B − and in the steps number 2,4,...,2n 2 a randomly chosen black ball (whether − fromA orB) is recoloredto a redball. These 2n 2 operations arecarriedout − independently. For 1 k n 1 let ≤ ≤ − U :=number of red balls in box A after 2k 1 steps, k′ − that is at the moment after the kth move and before the kth recoloring. Obvi- ously the sequence is a Markov chain, also U =0. 1′ 6 As to the transitionprobabilitiesnotethatafter 2k 1steps there aren k − − black balls in all and n k 1 balls in A. Thus given U = r there are r red − − k′ and n k r 1 black balls in A, and the remaining r+1 black balls belong − − − to B. Then U = r +1 occurs only, if in the next step the ball recolored k′+1 from black to red belongs to A and subsequently the ball shifted from A to B is black. Thus P(Uk′+1 =r+1|Uk′ =r)= n−nk−kr−1 · n−nk−kr−12 = n−k−2r−1 / n−2k . − − − Similarly U = r 1 occurs, if the recolored ball belo(cid:0)ngs to B(cid:1)a(cid:0)nd n(cid:1)ext the k′+1 − ball shifted from A to B is red. The corresponding probability is P(Uk′+1 =r−1|Uk′ =r)= nr+1k · n rk 1 = r+21 / n−2k . − − − Since U = 1 = U +1 and in view of the transition p(cid:0)roba(cid:1)bi(cid:0)lities(cid:1)of (U ) and 1 1′ k (U ) we see that (U ,...,U ) and (U +1,...,U +1) indeed coincide in k′ 1 n 1 1′ n′ 1 distribution. − − Next note that U =0. Therefore U canbe consideredas a function not n′ 1 k′ onlyofthefirst2k 1b−utalsoofthelast2n 2k 1shiftingandrecoloringsteps. − − − Sincethestepsareindependent,theprocessbackwardsisequallyeasytohandle. Taking into account that backwardsthe order of moving and recoloringballs is interchanged,onemayjustrepeatthecalculationsaboveto obtainreversibility. But this repetition can be avoided as well. Let us put our model more formally: Label the balls from 1 to n 1 and write the state space as − S := (L ,c ),...,(L ,c ) L A,B ,c b,r , 1 1 n 1 n 1 i i − − | ∈{ } ∈{ } where L is th(cid:8)e(cid:0)location of ball i and c i(cid:1)ts color. Then in our mod(cid:9)el the first i i and second coordinate are changed in turn from A to B and from b to r. This is done completely at random, starting within the first coordinates. Clearly we mayinterchangetheroleofthefirstandsecondcoordinate. Thusourboxmodel is equivalent to the following version: Again initially A contains n 1 black balls whereasB is empty. Now in the − steps number 1,3,...,2n 3 a randomly chosenblack ball is recoloredto a red − ball and in the steps number 2,4,...,2n 2 a randomly drawn ball from A is − shiftedtoB. Againthese2n 2operationsarecarriedoutindependently. Here − we consider Uk′′ :=number of black balls in box B after 2k−1 steps. Then from the observed symmetry it is clear that (U ,...,U ) and 1′ n′ 1 (U ,...,U ) are equal in distribution. − 1′′ n′′ 1 Ifwe fina−llyinterchangebothcolorsandboxesaswell, thenwearriveatthe dynamics of the backward process. This finishes the proof. There is a variant of our proof, which makes the reversibility of (U ) manifest k′ in a different manner. Let again the balls be labelled from 1 to n 1. Denote − ν :=instance between 1 and n 1, when ball m is colored to red, m − σ :=instance between 1 and n 1, when ball m is shifted to box B. m − 7 Then from our construction it is clear that ν = (ν ) and σ = (σ ) are two m m independent randompermutations of the numbers 1,...,n 1 . Moreover,at { − } instancek (i.e. after2k 1steps)ballnumbermisredandbelongstoboxA,if − itwascoloredbefore andshifted afterwards,i.e. ν <k<σ . Thus weobtain m m the formula U =# 1 m n 1:ν <k <σ (2) k′ { ≤ ≤ − m m} and we may conclude the following result. Corollary 8. Let ν and σ be two independent random permutations of 1,...,n 1 . Then (U ,...,U ) is equal in distribution to the process 1 n 1 { − } − # 1 m n 1:ν <k <σ +1 . { ≤ ≤ − m m} 1 k n 1 ≤ ≤ − (cid:0) (cid:1) Certainly this representation implies Theorem 7 again. Also it contains additional information. For example, it is immediate that U 1 has a hyper- k − geometric distribution with parameters n 1,k 1,n k 1. − − − − The next example contains a first application of Theorem 7 to our original urn model. Example. Let us consider τ = max k 1 : U = k , the number of red n n k { ≥ − } balls in the urn, after the last black ball has been removed. From reversibility τ has the same distribution as the moment τ =max k 1:U =k , before n n′ { ≥ k } the first red ball is taken away from the urn. Thus n−2 n−4 n−2k+2 (n k) (n 2k+1) P(τ k)= 2 2 2 = − ··· − . n ≥ n 1 n 2 ··· n k+1 (n 1) (n k) (cid:0) −2 (cid:1)(cid:0) −2 (cid:1) (cid:0) −2 (cid:1) − ··· − It follows for t 0 (cid:0) (cid:1)(cid:0) (cid:1) (cid:0) (cid:1) ≥ τ P n t exp( t2) , √n ≥ → − (cid:16) (cid:17) as n . →∞ More generally the dynamics of our urn looks as follows: Clearly, if n is large, then in the beginning always two black balls are removed from the urn. The rare moments, when red balls are taken away,appear with increasing rate. Indeed it is not difficult to see that in the limit n and after a √n-scaling →∞ of time these instances build up a Poissonprocess with linearly increasing rate. As we have seen the picture remains the same after reversal of time. This will be made more precise in Section 4. We conclude this section by imbedding our urn model into the coalescent. Let V :=k # i:ρ(i)<k , (3) k − { } andU :=V , 0 k n. Thus V is the number ofinternalbranchesamong k n k k − ≤ ≤ the k branches after the (n k)-th coalescing event and U is the number of k − internalbranchesamongthen kbranchesafterthek-thcoalescingevent. The − 8 coalescing mechanism takes two random branches and combines them into one internalbranch. Ifwecodetheexternalbranchesbyblackballsandtheinternal branchesbyred,thiscompletelyconformstooururnmodel;thus(U ,...,U )is 0 n asabove. ByTheorem7,(V ,...,V )hasthesamedistributionas(U ,...,U ). 0 n 0 n In the next sections we make use of the Markov chain V ,...,V and its prop- 0 n erties. Remark. For a different interpretation of the process (U ), suppose that we k have n 1 pairs of (different) shoes, and that all left shoes are mixed in one − pile and all right shoes in another. We sortthe shoes by taking first a left shoe (at random), then a rightshoe (also at random), then another left shoe, and so on. As soon as we take a shoe that matches one that we already have picked, we put away the pair; otherwise we put the shoe on the table in front of us. If the pairs are numbered and ν is the time right shoe m is picked, and σ m m the time left shoe m is picked, then right shoe m is on the table when the k-th left shoe has been picked if and only if ν < k < σ , so by (2), the number m m of right shoes remaining on the table when the k-th left shoe has been picked is U , 1 k n 1. The number of left shoes remaining on the table at the k′ ≤ ≤ − same time is U +1=U , so the total number of shoes on the table is 2U 1. k′ k k− This is a variation of the sock-sorting process studied in Steinsaltz (1999) andJanson(2009),Section 8,whichis similar exceptthat there is no difference betweenleftandright;weobtainitifwemixallshoesinonepileandpickfrom it at random. (See Janson (2009) for other interpretations, including priority queues, and further references.) It is not surprising that we have the same asymptotical behaviour of U and max U as for the sock-sorting problem. In k k k particular,we mention the following Gaussianprocess limit result, cf. Theorem 8.2 in Janson (2009). (This result is not used in the sequel.) Theorem 9. As n , the stochastic process n 1/2 U nt(1 t) con- − nt verges in D[0,1] to a→co∞ntinuous Gaussian process Z(t) wi⌊th⌋m−ean E(−Z(t))=0 (cid:0) (cid:1) and covariance function E Z(s)Z(t) =s2(1 t)2, 0 s t 1. − ≤ ≤ ≤ Sketch of proof. No(cid:0)te first th(cid:1)at E(U )=nt(1 t)+O(1) by Proposition 6. nt It is easily seen that ⌊ ⌋ − 2 n k 2 E(U U )=U U +1= − − U +1 k+1 k k k k | − n k n k − − and it follows that U E(U ) U k k k k M := − = , k (n k)(n k 1) (n k)(n k 1) − (n 1)(n k 1) − − − − − − − − − k =0,1,...,n 2, is a martingale. − Consider in the sequel only k (1 δ)n for some fixed δ > 0. Then ≤ − Var(M ) (n k 1) 4Var(U ) = O(n 3), and it follows from Doob’s in- k − k − ≤ − − equality that max U E(U ) =O (n1/2). k k P k | − | 9 (Using Theorem 7 we see that this extends to 0 k n.) A straight- ≤ ≤ forward computation of the conditional quadratic variation M,M := m h i E (M M )2 U ) shows that, uniformly in 0 t 1 δ, k<m k+1− k | k ≤ ≤ − P (cid:0) n3 M,M p t2 , h i⌊nt⌋ → (1 t)2 − which implies, see Theorem VIII.3.11 in Jacod and Shiryaev (1987), that n3/2M d Zˆ(t) in D[0,1 δ], where Zˆ(t) is a Gaussian martingale given nt by Zˆ(t⌊) =⌋ →W(t2/(1 t)2) for−a standard Brownian motion W(t). The result follows, for t [0,1 −δ], with Z(t)=(1 t)2Zˆ(t). ∈ − − Since δ >0 is arbitrary,this yields convergence in D[0,1). By time-reversal and Theorem 7, we also have convergence in D(0,1], and together these imply convergence in D[0,1], see e.g. the proof in Janson (2009). 3 Proof of Proposition 3 We use the representation Lα,β = T X , n k k nα≤Xk<nβ where X :=# i:ρ(i)=k , k { } 1 k <n. In view of the coalescing procedure X takes only the values 0,1,2, k ≤ and from the definition (3) of V k X =1+V V . (4) k k k+1 − From (4), V =U and Proposition 6 we obtain after simple calculations k n k − 2k 2k(n k 1)(n 3) E(X )= , Var(X )= − − − (5) k n 1 k (n 1)2(n 2) − − − and for k <l 4k(n l 1) Cov(X ,X )= − − . (6) k l −(n 1)2(n 2) − − Also from T = n (T T ) we have E(T ) = 2 n 1 and Var(T )=4k n j=k+11 j−;1th−us j k j=k+1 (j−1)j k j=kP+1 (j 1)2j2 P − P 1 1 c E(T )=2 , Var(T ) (7) k k − n k ≤ k3 (cid:16) (cid:17) for a suitable c>0, independent of n. Thus from independence 1 1 2k E(Lα,β)= 2 . n k − n n 1 nα≤Xk<nβ (cid:16) (cid:17) − 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.