ebook img

Counting All Possible Ancestral Configurations of Sample Sequences in Population Genetics PDF

13 Pages·2006·0.22 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Counting All Possible Ancestral Configurations of Sample Sequences in Population Genetics

1 Counting All Possible Ancestral Configurations of Sample Sequences in Population Genetics Yun S. Song, Rune Lyngsø, and Jotun Hein Abstract— Given a set D of input sequences, a genealogy for D can be constructed backwards in time, using such evolutionary events as mutation, coalescent and recombination. An ancestral configuration (AC) can be regarded as the multiset of all sequences present at a particular point in time in a possible genealogy for D. The complexity of computing the likelihood of observing D depends heavily on the total number of distinct ACs of D, and therefore it is of interest to estimate that number. For D consisting of binary sequences of finite length,weconsidertheproblemofenumeratingexactlyalldistinctACs.Weassumethattherootsequencetypeisknownandthatmutation process is governed by the infinite-sites model. When there is no recombination, we construct a general method of obtaining closed-form formulas for the total number of ACs. The enumeration problem becomes much more complicated when recombination is involved. In that case, we devise a method of enumeration based on counting contingency tables and construct a dynamic programming algorithm for the approach. Lastly, we describe a method of counting the number of ACs that can appear in genealogies with less than or equal toagivennumberRofrecombinations.OfparticularinterestisthecaseinwhichRisclosetotheminimumnumberofrecombinationsforD. Index Terms—Ancestral configurations, coalescent, recombination, contingency table, enumeration 1 INTRODUCTION leads to a new AC in the genealogy, but a coalescent or a recombination event may lead to an AC that has already been ONEofthestandardproblemsinmathematicalpopulation encountered in the genealogy. geneticsistocomputethelikelihoodP(D)ofobserving In the absence of recombination, genealogies can be rep- a given data set D under the assumed model of evolution. resented by time-ordered binary trees, whereas a case with The likelihood P(D) can be defined as a formal sum over recombination requires a more complicated graphical repre- all possible genealogies consistent with D. A genealogy, in sentation called the ancestral recombination graph (ARG) [5]. turn, can be viewed as a sequence of Markov states, and the It is well known that the coalescent with recombination is likelihoodP(D)canbedetermined,inprinciple,bysumming considerablymoredifficulttostudythantheclassiccoalescent. productsoftransitionprobabilitiesoverallpossiblesequences One obvious reason for that contrast is that there are many of states in a Markov chain with given initial and final states. moreinequivalentARGsthantrees.Asmanycomputationsof There are recursion relations for computing P(D) exactly[1], interest—computing P(D), for example—involve studying a [2], [15], but that approach quickly becomes infeasible as setofgenealogiesconsistentwiththeobserveddata,perhapsit data size grows, and one must then resort to Monte Carlo is not so surprising that including recombination in the model methods [3], [5], [6], [7], [11], [12], [16]. Given a set D of evolution poses many challenges. One of our goals here is of input sequences, a genealogy for D can be constructed to illustrate more precisely why recombination is difficult to backwardsintime,usingsuchevolutionaryeventsasmutation, study, by comparing the total number of ACs when there is coalescentandrecombination.Anancestralconfiguration(AC) recombinationwiththatwhenitisabsent.Ithasbeenknownto can be regarded as the multiset of all sequences present at a many, if not most people in population genetics that there can particular point in time in a possible genealogy for D. For be many more ACs when there is recombination than when bothdeterministicandstochasticapproaches,theefficiencyof it is absent, but we are not aware of any work that tried to a method of computing P(D) depends heavily on the total examine exactly by how much. number of distinct ACs of D. Asmentionedabove,whentherearemanyACs,solvingre- In this paper, we address the problem of enumerating all cursion relations to compute P(D) exactly is often infeasible. distinct ACs for a given data set D, consisting of binary In such a case, we are interested in asking whether it would sequencesoffinitelength.Thatis,weareinterestedincomput- bepossibletoapproximatethetruerecursionbya“collapsed” ingthetotalnumberofdistinctACsinallpossiblegenealogies recursion, in which ACs are lumped together, such that we thatcouldhavegeneratedD.IfanACappearsintwoormore just need to consider transitions, with appropriately defined different genealogies, it is counted only once. We consider probability,betweenlumpedACstoestimateP(D).Weshow both the classic coalescent [9], [10], in which recombination in this paper that this idea of lumping ACs works at least is absent, and its extension where recombination is allowed. in the context of our enumeration problem. What we achieve When a mutation event occurs in a genealogy, then it always herecanbeviewedasasmallsteptowardsmeetingourdesired goal. Y.S.iswiththeDepartmentofComputerScience,UniversityofCalifornia atDavis,Davis,CA95616,USA.E-mail:[email protected] Inourwork,weassumetheinfinitesitesmodelofmutation, R.L. and J.H. are with the Department of Statistics, University of which means that there can be at most one mutation per Oxford, 1 South Parks Road, Oxford, OX1 3TG, UK. E-mail: lyngsoe, [email protected] site. Hence, the input data can be regarded as defining a 2 set of binary sequences. For ease of discussion, we assume trees consistent with τ. (Recall that, unlike coalescent trees, that the root sequence is known, and use 0 to denote the a perfect phylogeny may be non-binary and that relative time ancestral type at the root and 1 to denote a mutant type. ordering of two of its interior vertices is not defined if one We remark that the techniques described in this paper can vertex is not a descendant of the other.) be generalized to the root-unknown case. When there is no recombination, we construct a general method of obtaining closed-form formulas for the total number of ACs; these for- 2.1 Definition of an ancestral configuration mulasarepolynomialsinthemultiplicityofdistinctsequences In our algorithm for counting ACs, we partition the in- in D. We have implemented this method in Mathematica formation contained in a configuration into two parts, one and the program is called aceTrees (short for “ancestral encoding sequence types and the other their multiplicity. We configuration enumeration for trees”). As expected, the enu- use x ,...,x to denote d distinct finite binary sequences 1 d meration problem becomes much more complicated when (allele types) x of some fixed length—for example, x = i 1 recombination is involved. In that case, we discuss a method 0100,x =0010, and x =1100, with d being 3. We define 2 3 based on counting non-negative integer valued tables with T = (x ,...,x ) and n = (n ,...n ), where n > 0, 1 d 1 d i fixedrowandcolumnsums,commonlyknownascontingency for all i ∈ {1,...,d}, denote their multiplicity. We say tables.Fortwosites,weshowhowaclosed-formformulacan that (T,n) and (T0,n0) are equivalent, denoted (T,n) ∼ be obtained. For an arbitrary number of sites, we construct a (T0,n0), if there exists a permutation σ ∈ S such that d dynamic programming algorithm for counting ACs. We have (T ,n ) = (T0,n0), where T = (x ,...,x ) and σ σ σ σ(1) σ(d) implemented this algorithm in C++; the software is called n = (n ,...,n ). By a configuration, we mean an σ σ(1) σ(d) aceARGs (short for “ancestral configuration enumeration for equivalence class [T,n]∈{(T,n)}/∼. ARGs”). Lastly, we discuss a method of counting the number As we describe presently, every coalescent tree defines a of ACs that can appear in ARGs with less than or equal to sequence of configurations. Consider a data set D = [T,n], a given number R of recombinations. Of particular interest where T = (x ,...,x ) and n = (n ,...,n ). A con- 1 d 1 d is the case in which R is (or is close to) the minimum figuration is said to be ancestral to D if it “appears” in number of recombinations for D. We have implemented this at least one possible coalescent tree that derives D. (Note algorithm in C; the software is called greven. This method that D is ancestral to itself according to this definition.) interpolatesbetweenthecasewherethereisnorecombination Let us illustrate this definition through a specific example. andthecasewhereanarbitrarynumberofrecombinationsare Consider the simple example D = [(1,0),(2,2)], consisting allowed. As it should, greven agrees with aceTrees and of four binary sequences of length one. There are several aceARGs in those two respective limits. As a further check, coalescent trees that can give rise to D under the infinite-sites we have made alternative, independent implementations of all model of mutation. Three of them are shown in Figure 1. A threeprogramsinPython.Allourprogramsimplementingthe cross-section, corresponding to a particular time slice, of a methods discussed in this paper are available upon request. coalescent tree defines an AC. Going backwards in time, con- Theorganizationofthispaperisasfollows.InSection2,we figuration changes whenever either a mutation or a coalescent considertheproblemofenumeratingACswhenthereisnore- event occurs. combination,inwhichcasegenealogiescanberepresentedby For D = [(1,0),(2,2)], the seven configurations shown in coalescent trees. Mitochondria data from [17] are considered Figure 1 form a complete set of configurations ancestral to there as an example. By counting the total number of ACs, D; i.e. any configuration appearing in a coalescent tree for D we illustrate why an exact computation of the likelihood is must be one of the seven configurations. The corresponding difficultforthatdata.Theaforementionedmethodofcounting Markov chain transition graph C , where vertices correspond D ACs in unconstrained recombination graphs is discussed in totheACsforD andedgescorrespondtoallowedtransitions, Section 3. We compare the number of ACs in coalescent trees is as shown in Figure 2. This graph shows that the sequence withthatinARGsandhighlighttheirdifferences.InSection4, of ACs arising in any coalescent tree for D must be one of weconsiderenumeratingACsthatcanappearinARGswithat the following: most R recombinations. We conclude in Section 5 with some general remarks on our work. ψ →ψ →ψ →ψ →ψ , 1 2 4 6 7 ψ →ψ →ψ →ψ →ψ , 1 2 5 6 7 2 ANCESTRAL CONFIGURATIONS IN COALES- ψ1 →ψ3 →ψ5 →ψ6 →ψ7. CENT TREES In general, every coalescent tree compatible with any given In this section, we focus on the case in which the evolution data set D starts in the configuration corresponding to D under the infinite-sites model can be represented by a tree. and ends in the configuration consisting of a single all-zero A given data set D is compatible with a tree if it passes the sequence(correspondingtothemostrecentcommonancestor). three gamete test: for every pair of sites, not all of the allele Let P denote some appropriately-defined transition prob- i,j types 01, 10 and 11 appear in the data. If the three gamete ability corresponding to the transition ψ → ψ . Then the i j test is passed, then there exists a unique perfect phylogeny τ probability of the sequence ψ → ψ → ... → ψ can be i1 i2 ik for D [4], [8], with at most one mutation per site, but it is defined as P P ···P . In [1], Ethier and Griffiths i1,i2 i2,i3 ik−1,ik important to note that in general there are many coalescent showed that the probability of observing D can be obtained 3 ψ =D=[(1,0),(2,2)] 1 ψ2 =[(1,0),(1,2)] ψ7 ψ7 ψ7 ψ3 =[(1,0),(2,1)] ψ6 ψ6 ψ6 ψ4 =[(0),(3)] ψ4 ψ5 ψ5 Time ψ5 =[(1,0),(1,1)] ψ2 ψ2 ψ3 ψ6 =[(0),(2)] ψ1 ψ1 ψ1 ψ7 =[(0),(1)] 1 1 0 0 1 1 0 0 1 1 0 0 Fig.1. Coalescenttreesandancestralconfigurationsψ1,...,ψ7.Across-section,correspondingtoaparticulartimeslice,ofatreedefinesanAC. ψ2 ψ4 type obtained from y after the 1 at site a mutates to a 0, i define M (T,µ) as a ψ ψ ψ 1 6 7 ψ3 ψ5 (cid:26) ((DRi(;yTai)(,TD),(Rµi+;1(eµ)))),, iiff yyaia ∈=/ {yy1∈,.{.y.,y,.k.}.,,y }. i i j i j 1 k Fig.2. MarkovchaintransitiongraphCD forD=[(1,0),(2,2)].Dashed lines denote mutation events, while solid lines denote coalescent events. This directed graph is acyclic and contains 3 distinct paths from the initial Since a singlet site a uniquely determines i, we omit i in configurationψ1=D totheabsorbingconfigurationψ7. writing Ma(T,µ). In the example T = (100,101,010) con- sidered above, the unique allele type associated with a=2 is y ,andy2 =(000)isnotin{y ,y ,y }.Fora=3,theasso- 3 3 1 2 3 by summing such probabilities over all possible sequences of ciateduniquealleletypeisy ,andy3 =(100),whichisequal 2 2 ACs. In the example D =[(1,0),(2,2)] considered above, to y . Hence, M (T,µ) = ((100,101,000),(µ ,µ ,1)) and 1 2 1 2 M (T,µ)=((100,010),(µ +1,µ )). P(D) = P P P P +P P P P 3 1 3 1,2 2,4 4,6 6,7 1,2 2,5 5,6 6,7 +P P P P . 1,3 3,5 5,6 6,7 2.2.2. Coalescent and mutation events Ethier and Griffiths[1] formulated a recursion for evaluating P(D).Ingeneral,computingP(D)exactlyusingtherecursion Consider a data set D = [T,n]. The multiplicity n of an i becomes infeasible when sample size is large, and one must alleletypex decreasesbyexactlyonewhentwosequencesof i then resort to Monte Carlo methods (see [6], [7]). In what thatalleletypecoalesce.ThenumberofACswithalleletypes follows, we develop a method of counting exactly the total T is therefore equal to π(n), which is equal to 1 (for [T,n] number of inequivalent ACs of an arbitrary data set D. This itself) plus the number of configurations that can be reached kind of enumeration should prove useful for studying when from[T,n]viacoalescenteventsonly.Wheneveryalleletype exact computation of P(D) becomes infeasible. has coalesced completely, we end up with the configuration [T,1 ]. d In T = (x ,...,x ), suppose that x is the only allele 2.2 Enumeration of ACs: A general solution 1 d i type with a 1 at site a. Consider a configuration [T,m] We now describe a general, efficient method of obtaining with m = 1, reached by some coalescent events starting i closed-formformulasforthetotalnumberofinequivalentACs. from [T,n]. Now, a mutation event can occur to change the The actual ACs themselves can easily be extracted from our characterofx atsiteafrom1to0,making[T,m]jumptoa i method. differentconfiguration[T0,m0],whereT0 6=T.Letxa denote i the allele type obtained after such a mutation event. There are 2.2.1. Notation two possible cases. First, if xa is equal to one of the alleles i For ease of discussion, we first introduce some useful in T, say x , then the configuration after the mutation event j notation. We use 1 to denote a k-tuple of 1s, and e to is [D (T),D (m+e )]. Note that the multiplicity of x has k i i i j j denoteavectorwitha1attheithentryand0selsewhere;the increased by one. How many configurations are there with length of e will be clear from the context of its usage. For allele types D (T)? For k 6= i and k 6= j, the multiplicity i i n = (n ,...,n ), define π(n) := Qk n . Let D denote of x can be any integer between n and 1; independently 1 k i=1 i i k k a deletion operator which, when acting on a vector of length of the mutation event, coalescent events within allele type x k k ≥ i, deletes the ith entry of the vector, thus changing its can reduce the multiplicity from n to 1. Further, since the k lengthtok−1.LetR denoteareplacementoperatorwhich multiplicity of x increases by 1 after the mutation event, it i;z j replaces the ith entry of a vector with z. canrangefromn +1to1.Hence,thetotalnumberofpossible j Let T = (y ,...,y ) and µ = (µ ,...,µ ). As usual, ACs with allele types D (T) is π(D (n+e )). Second, if xa 1 k 1 k i i j i y ,...,y are k distinct binary sequences. In T, a site is is not equal to any allele in T, then the configuration after 1 k called a singlet if there is exactly one allele type yi ∈ the mutation event is [Ri;xai(T),m]. Note that we still have {y ,...,y }withvalue1atthatsite.Here,notethatµ need m =1. How many configurations are there with allele types 1 k i i not be 1. In T = (100,101,010), for example, sites two and Ri;xa(T)? Similar to the first case, since m is bounded by three are singlets. Let a be a singlet site and y the unique R (in)and1 ,thenumberofconfigurationswithalleletypes i i;1 d allele type with a 1 at that site. With yai denoting the allele Ri;xai(T) is π(Ri;1(n)). 4 2.2.3. A graph construction algorithm v = ((100,110,001),(n ,n ,n )) 1 1 2 3 Our method of counting ACs is to deal with coalescent v2 = ((100,001),(n1+1,n3)) eventsforeachalleletypeconfigurationT combinatoriallyand v3 = ((100,110,000),(n1,n2,1)) findallpossiblealleletypeconfigurationsusingasimplegraph v4 = ((100,000),(n1+1,1)) construction algorithm. More precisely, the general idea goes v5 = ((000,001),(1,n3)) as follows. For a given input data set D, suppose that there v6 = ((000),(2)) are ACs with T as allele type configuration. If the maximum v v 2 4 multiplicity of T over all such ACs is µ, then, as discussed above,asimplecoalescentargumentshowsthatthetotalnum- v v berofACswithalleletypeconfigurationT isπ(µ).Hence,if 1 6 we know all possible allele type configurations T ,T ,...,T 1 2 p and their maximum multiplicity µ ,µ ,...,µ , respectively, 1 2 p v v 3 5 thenwecaneasilydeterminethetotalnumberofACsofDby summingoverπ(µj).Ourgraphconstructionalgorithmbelow Fig. 3. Application of our enumeration algorithm to D = finds such T1,T2,...,Tp and µ1,µ2,...,µp. [(100,110,001),(n1,n2,n3)]. Shown on the right hand side is the final We wish to construct a directed graph G where a vertex graphGD constructedbyouralgorithm. D vj is labeled by (Tj,nj), to be determined by the following TABLE1 iterative procedure: Mitochondriadatafrom[17],consistingof63sequencesand18segregating 1. Given a data set D = [T,n], let G initially be a sites.Thereare413,243,616ACsforthisdataset. D graph with no edges and with v = (T,n) as its only 1 vertex. Here, (T,n) is an arbitrary representative of the Haplotype Multiplicity equivalence class [T,n]. 100101000000010000 2 100101000100010000 2 2. Let V denote the set of all vertices in G with out- 0 D 010000000000000001 1 degree zero. For all v = (T ,n ) ∈ V , determine the 001000001000000000 3 j j j 0 set S of singlet sites (defined above) in T . 000101000000010000 19 vj j 000111000000010000 1 3. If S = ∅ for all v = (T ,n ) ∈ V , terminate vj j j j 0 000000000011000001 1 the procedure. Otherwise, arbitrarily order V and se- 000000000011100001 1 0 quentiallycarryoutthefollowingstepsforv satisfying 000000000000000111 4 j 000000000000000101 8 S 6= ∅: Determine M (T ,n ) for all a ∈ S . If vj a j j vj 000000000000000000 5 M (T ,n )=v ∈G , draw a directed edge from v 000000000000000001 4 a j j k D j to v . If not, then add v = M (T ,n ) to G and 000000010000000000 3 k k a j j D 000000100000001000 1 draw a directed edge from v to v . j k 4. Go back to step 2. The graph G compactly encodes all possible ACs. In a D 2.3 A toy example vertex v = (T ,n ) ∈ G , T captures the set of distinct j j j D j binary strings, whereas n determines the range of possible Consider the data set D = [(100,110,001),(n1,n2,n3)], j multiplicity of the strings; i.e. if there are r distinct binary where n1,n2, and n3 are some arbitrary positive integers. sequencesinT ,thenthemultiplicitycanbeanythingbetween Applying the algorithm from Section 2.2 leads to the graph j n and1 .Further,itisstraightforwardtoshowthatifv and GD illustrated in Figure 3. From (2.1), the total number of j r j v are two distinct vertices in G , then T 6= T . The total configurations ancestral to D is thus given by k D j k number α(D) of configurations ancestral to D is therefore 6 X given by α(D) = π(n ) i X i=1 α(D)= π(ni), (2.1) = n1n2n3+n3(n1+1)+n1n2+(n1+1)+n3+2 i∈V(GD) = n1(n2+1)(n3+1)+2n3+3. whereV(G )istheindexsetoftheverticesinG .Wehave D D 2.4 Mitochondria DNA data implementedtheabovealgorithminMathematica.Foragiven data set D = [(x ,...,x ),(n ,...,n )], our program gen- We now consider the data set from Ward et al.[17], which 1 d 1 d erates closed-form formulas for α(D) in terms of n ,...,n . consists of sequences from the control region of mitochondria 1 d DNA (mtDNA), sampled from 63 individuals in a single (REMARK: We have an independent program, to be discussed Amerindian tribe. There are 18 segregating sites in the data, later, that can exhaustively search through evolutionary histo- shown in Table 1. riestocomputethenumberofACs.Thatprogramcanbeused Computing the likelihood of observing a given data set toanalyzesmallsamplesizes(usuallylessthan30sequences). is important—for example, for parameter estimation—but, as We have checked that it produces the same answers as does mentioned in Section 2.1, doing it exactly is infeasible when our combinatorial method described above.) samplesizeislarge.ThesamemtDNAdatasetwasconsidered 5 001∗∗∗∗ ∗∗∗1011 0∗10∗∗∗ ∗∗∗∗∗∗1 sequences which carry additional non-ancestral material (de- noted by “∗”s). This concept is illustrated in Figure 4, where possible coalescent events are also described. To simplify things, we assume that recombination breakpoints occur at 0011011 0∗10∗∗1 the midpoints of consecutive sites in D. Hence, if D consists (a) (b) of some segregating sites in a region X, the distribution of ancestral material between two consecutive segregating sites i 0111011 01100∗0 and i+1 in X is completely determined by the configuration at sites i and i+1. Of particular interest is a class of ARGs in which, for any recombinationevent,boththeprefixandthesuffixinvolvedin 0111011 ∗∗∗1011 ∗1100∗0 0∗10∗∗∗ a recombination event contain some ancestral material. Such (c) (d) a class of ARGs was the main focus of [5], and we also limit our attention to such ARGs in this paper. Hence, for Fig.4. Illustrationofpossibleevents.Timeflowsfromtoptobottom.(a)and (b)areexamplesofrecombinationevents.Lets[i:j]denotethesubstringofs our purpose, a sequence appearing in an ARG is generally a in-betweeniandj,inclusive.Goingbackwardsintime,whenarecombination string over {0,1,∗}, but the all-∗ string is not allowed. As event is encountered, the lineage of a sequence s[1:L] breaks up into two before, 0 denotes the ancestral type at the root and 1 denotes parts,onecorrespondingtothelineageofasequences1whoseprefixs1[1:k] isidenticaltos[1:k]andtheothercorrespondingtothelineageofasecond a mutant type. A configuration is defined as in Section 2.1, sequences2whosesuffixs2[k+1:L]isidenticaltos[k+1:L].Thesuffix except that now strings are defined over {0,1,∗}. To every s1[k+1:L]andtheprefixs2[1:k]carrynon-ancestralmaterial,denotedby ARG, there corresponds a sequence of configurations, which “∗”s.(c)and(d)areexamplesofcoalescentevents.Twosequencess1ands2 can coalesce if their characters are identical at common ancestral positions. can be viewed as generating the ARG backwards in time. Thefinalsequencecontainstheunionoftheancestralmaterialins1 ands2. GivenadatasetD,aconfigurationissaidtobeancestraltoD ifitappearsinatleastonepossiblesequenceofconfigurations that corresponds to some ARG consistent with D. Shown in by Griffiths and Tavare´ [6], who developed a Markov chain Figure 5 is a sequence of ACs and its corresponding ARG. Monte Carlo method to analyze the data. Our goal here is It is important to note that there are many—in fact, infinitely to compute the total number of inequivalent ACs of the data, many—distinct ARGs that can derive the same initial data, thus illustrating why computing the likelihood exactly using and that the total number of ACs can be immensely large. recursions is difficult. This is the main point that we wish to illustrate in this paper. Let T = (x ,...,x ) denote the 14 distinct haplotypes 1 14 ForthesimpleexampleD =[(10,11,01),(1,1,2)]considered shown in Table 1, with x being the ith row. Applying i in Figure 5, there are only 220 ACs in total. We shall soon our algorithm from Section 2.2 to D = [T,(n ,...,n )] 1 14 see that, as the number of sites and the number of sequences generates a directed graph G with 12,896 vertices, and D increase, the total number of ACs grows extremely fast. one can easily obtain a closed-form formula, which is too long to write down here, for the total number α(D) of ACs. For n = (1,1,...,1), α(D) = 128,640. For n = (2,2,1,3,19,1,1,1,4,8,5,4,3,1),whichbeingthemultiplic- 3.2 A warm-up example ity of the actual data, α(D)=413,243,616. Before we plunge into the core of our enumeration work, let us consider a simple example for which it 3 ANCESTRAL CONFIGURATIONS IN UNCON- is not too difficult to obtain a complete set of ACs STRAINED ARGS by hand. The reader not too familiar with the ARG is In this section, we turn to enumerating ancestral con- recommended to go through this example. Consider the figurations in unconstrained ancestral recombination graphs initial data D = [(00,11),(1,1)]. If recombinations are not (ARGs)[5]. This case is much more complicated than the allowed, then it is clear that there are only 5 ACs, namely classic coalescent case, and, in general, it is difficult to obtain [(00,11),(1,1)],[(00,01),(1,1)],[(00,10),(1,1)],[(00),(2)], closed-formformulasforthenumberofACs.Belowwetrans- and [(00),(1)]. In the presence of recombination, there are late the problem of counting ACs into counting contingency 30 ACs, shown in Figure 6. These ACs can be generated tables and provide a dynamic programming algorithm. iteratively by asking what happens to a given AC under a mutation, a coalescent or a recombination event; note that some of these events may not be possible, depending on 3.1 Definition of an AC in ARGs the AC. On the right hand side of Figure 6 is a Markov Intheabsenceofrecombination,iftheinputdataDcontains chain transition graph C , depicting the possible transitions D nsequences,thenanyconfigurationancestraltoD containsat between ACs. Note that the graph C contains directed D mostnsequences.Thisisnolongertruewhenrecombinations cycles. AC 1 is the initial configuration, and AC 21 is called are allowed. Going backwards in time, when a recombination the grand common ancestor. To any ARG with the grand event is encountered, the lineage of a sequence breaks up into common ancestor as the root, there corresponds a unique path two parts, distributing its ancestral material to two different from AC 1 to AC 21. 6 00 8 7 Ancestral Configurations m 6 2 1 : [(10,11,01),(1,1,2)] 4 : [(10,01),(1,2)] 7 : [(00),(2)] 5 m 2 : [(10,1∗,∗1,01),(1,1,1,2)] 5 : [(00,01),(1,2)] 8 : [(00),(1)] 4 1 Time 3 3 : [(10,1∗,01),(1,1,2)] 6 : [(00,01),(1,1)] 2 1 10 11 01 01 Fig. 5. A sequence of ACs for D = [(10,11,01),(1,1,2)] and its corresponding ARG. Filled circles denote mutation events. Open circle denotes a recombination event with breakpoint between the first and the second sites. Mutation events at the first and the second sites are denoted by m1 and m2, respectively.NotethatthereareotherARGsthatcouldhavegeneratedD.Intotal,thereare220ACsofD. Ancestral Configurations 12 1:[(00,11),(1,1)] 11:[(00),(2)] 21:[(00),(1)] 13 22 2:[(0∗,∗0,11),(1,1,1)] 12:[(0∗,10,∗1),(1,1,1)] 22:[(01,10),(1,1)] 6 14 3:[(00,1∗,∗1),(1,1,1)] 13:[(∗0,01,1∗),(1,1,1)] 23:[(0∗,∗0),(2,2)] 23 4:[(00,10),(1,1)] 14:[(0∗,∗0,∗1),(2,1,1)] 24:[(0∗,∗0,∗1),(1,1,1)] 2 7 15 24 29 5:[(00,01),(1,1)] 15:[(0∗,∗0,1∗),(1,2,1)] 25:[(0∗,∗0,1∗),(1,1,1)] 3 8 16 6:[(0∗,∗0,1∗,∗1),(1,1,1,1)] 16:[(01,∗0),(1,1)] 26:[(00,∗0),(1,1)] 1 25 30 4 9 17 7:[(0∗,∗0,01),(1,1,1)] 17:[(0∗,∗0,00),(1,1,1)] 27:[(00,0∗),(1,1)] 26 8:[(0∗,∗0,10),(1,1,1)] 18:[(0∗,10),(1,1)] 28:[(0∗,∗0),(1,1)] 5 10 18 27 9:[(00,0∗,∗1),(1,1,1)] 19:[(00,∗1),(1,1)] 29:[(0∗,∗0),(2,1)] 11 19 10:[(00,1∗,∗0),(1,1,1)] 20:[(00,1∗),(1,1)] 30:[(0∗,∗0),(1,2)] 28 20 21 Fig. 6. Configurations ancestral to D = [(00,11),(1,1)] and the Markov chain transition graph CD. Mutation events are denoted by dashed arrows. Coalescentandrecombinationeventsaredenotedbysolidarrows.Notethatsomesolidarrowsarebidirectional,onedirectionforrecombinationandtheother forcoalescent.Unidirectionalarrowsdenotecoalescentevents. 3.3 Possible number of 1s and 0s in an AC There are n (n +1)+1 possible values of (p,q) in the first 0 1 case, n+1 in the second, and n in the last. There exists no Let n denote the total number of binary sequences in D, witheachsequencebeingoflengthL.Letni (resp.ni)denote AC with r01 0s and r11 1s if (r01,r11)∈/ I(n10,n11). Now comes 0 1 thetotalnumberof0s(resp.1s)atsitei.Notethatni+ni =n a crucial point. Even for L > 1, if an arbitrary number of 0 1 for all i∈{1,...,L}. (For D =[(10,01,11),(2,3,1)], n1 = recombination events are allowed, there exists at least one 3,n1 =3 for the first site and n2 =2,n2 =4 for the sec0ond AC for any combination of (r01,r11),...,(r0L,r1L) satisfying site.1) Let ri and ri denote the to0tal num1ber of 0s and of 1s, (r0i,r1i) ∈ I(ni0,ni1), for all i ∈ {1,...,L}. There exists no 0 1 AC with ri 0s and ri 1s at site i if (ri,ri) ∈/ I(ni,ni). respectively, at site i in an arbitrary AC ψ. The infinite-sites 0 1 0 1 0 1 One way to see these points is as follows. Starting from model of mutation imposes constraints on possible values of (ri,ri). For ni 6= 0, a mutation event at site i can occur in the input data D, use recombination events to distribute the 0 1 1 ACψ onlyifri =1.Ifri =1andthemutationeventoccurs, ancestral material of the first sequence to L new sequences, 1 1 then AC ψ jumps to a new AC ψ0 with (ri+1) 0s and no 1s. each carrying exactly one ancestral site. Do the same for all 0 To determine the possible range of ri and ri, let us first othersequencesinD,sothatweendupwithn·Lsequences, 0 1 eachcarryingexactlyoneancestralsite.Then,determiningthe considertheL=1case.Sinceeverysequenceisoflengthone, range of ri and ri for each site i reduces to the L=1 case. it follows from the above discussion that there exists exactly 0 1 one AC for any pair (r1,r1)∈I(n1,n1), where I(n ,n ) is We stress that the above statement would not hold for L>1 0 1 0 1 0 1 in the absence of recombination (or if too few recombination defined as the set of all non-negative integer pairs (p,q) such events are allowed). For example, consider D = [(01),(2)]. that if n 6=0 and n 6=0, then 0 1 In the absence of recombination, there is no AC with r1 =1 0 either (i) 1≤p≤n0 and 1≤q ≤n1, (3.2) and r12 =2. or (ii) 1≤p≤n +1 and q =0; 0 if n0 =0 and n1 =n, then 3.4 Important observations either (i) p=0 and 1≤q ≤n, We now highlight a couple of important facts. First, the set (3.3) or (ii) p=1 and q =0; of ACs does not depend on the order of 0s and 1s within a site when arbitrary number of recombinations are allowed. and, if n =n and n =0, then 0 1 For instance, [(01,10),(1,1)] has the same set of ACs as 1≤p≤n and q =0. (3.4) the example [(00,11),(1,1)] considered in Section 3.2 (see 7 Figure 6). This fact can easily be explained as follows. Let D number of strings in ψ and the number of non-∗ characters andD0 betwodatasetsofthesamesize,suchthattheithsite at site k. A moment’s thought leads to the conclusion that of D is equal to the ith site of D0 up to some rearrangement max(c−r −r ,0) ≤ r ≤ c. Similarly, the number c of 0 1 ∗ ∗ of elements within the site. Then, D can be transformed into all-∗ prefixes of length k−1 satisfies max(r +r −c,0)≤ 0 1 D0, or vice versa, using a series of appropriate recombination c ≤ r + r . Note that c and r must satisfy c + c = ∗ 0 1 ∗ ∗ ∗ andcoalescentevents.ItthereforefollowsthatD andD0 have r +r +r for consistency. In Figures 7b-d, the ACs shown 0 1 ∗ the same set of ACs. have (r ,r ) = (4,0) ∈ I(3,1) and r = 1 at site three, and 0 1 ∗ Second, if two input data sets D and D0 differ by some eachAChasexactlyonesequencecontaining“∗∗”asaprefix permutation σ of their sites, then they have the same number of length 2 (i.e., c =1). As required, c+c =r +r +r . ∗ ∗ 0 1 ∗ of ACs. Any AC of D can be transformed into an AC of D0 3.5.2. Restricted contingency tables via σ, and vice versa. For D a data set containing n sequences of length L, Asabove,letψ beanACforthefirstk−1sitescontaining let X(D) = {(n1,n1),...,(nL,nL)} denote the multiset of d distinct strings. The pairing of a length-(k−1) string in ψ 0 1 0 1 ordered pairs (ni,ni), where ni and ni are the total number or a length-(k−1) all-∗ string with a character at site k can 0 1 0 1 of 0s and 1s, respectively, at site i of D. Then, the above two be concisely summarized by a table of the following form: facts together imply that two data sets D and D0 have the row sums same number of ACs if X(D)=X(D0) as multisets. A ··· A A r 0,1 0,d 0,∗ 0 A ··· A A r 1,1 1,d 1,∗ 1 3.5 Main idea: Counting restricted contingency tables A ··· A 0 r ∗,1 ∗,d ∗ We here describe our method of counting ACs. Intuitively, column sums: c ··· c c 1 d ∗ our approach is to construct ACs moving along the sites, say For j = 1,...,d, a particular entry A corresponds to the from left to right. Given that we have constructed ACs for i,j multiplicity of a length-k string obtained from appending the first k − 1 sites, we construct ACs for the first k sites character i to the right of a length-(k−1) string of type j in by appending additional material (i.e., 0,1 or ∗) to the right. ψ, whereas A corresponds to the multiplicity of a length-k We iterate this procedure until the last site is processed. As i,∗ string obtained from appending character i to the right of an we describe below, this procedure of extending ACs can be all-∗stringoflengthk−1.Toavoidgeneratinganall-∗string translated into constructing contingency tables, where column of length-k, we impose the condition A = 0. A column sumsarepartiallydeterminedbythemultiplicityofthestrings ∗,∗ sum c , for j =1,...,d, is given by the multiplicity of string intheACthatisbeingextended(i.e.,anACforthefirstk−1 j type j in ψ, and row sums r and r , respectively denoting sites) and row sums are partially determined by the allowed 0 1 the number of 0s and 1s at site k in the corresponding new numbers of 0s and 1s at site k (c.f., Section 3.3). AC, satisfy (r ,r ) ∈ I(nk,nk) (see Section 3.3). Further, 0 1 0 1 3.5.1. Extending an AC as described above, the number c of all-∗ prefixes of length ∗ Suppose that ψ is an AC for the first k−1 sites. Further k−1 and the number r of ∗s at site k satisfy max(c−r − ∗ 0 suppose that ψ contains d distinct strings, with multiplicity r ,0) ≤ r ≤ c, max(r +r −c,0) ≤ c ≤ r +r , and 1 ∗ 0 1 ∗ 0 1 c ,c ,...,c satisfying c + c + ··· + c = c. Since ψ c +···+c +c =r +r +r .Thenumberofnon-zeroentries 1 2 d 1 2 d 1 d ∗ 0 1 ∗ is an AC for the first k − 1 sites, it does not contain any in the above contingency table gives the number of distinct all-∗ string of length k −1. However, a string s in an AC length-k strings in the corresponding new AC. In the next for the first k sites may contain the all-∗ string of length iteration, non-zero A will appear as possible column sums i,j k − 1 as its prefix, if the character at the kth site of s whenconstructingcontingencytablesforsitek+1.Returning is either 0 or 1, but not ∗. For illustration, consider D = to our example, contingency tables corresponding to the ACs [(0001,0000,1101,1110),(1,1,1,1)].AnACforthefirsttwo discussed before are shown in Figures 7b-d. In Figure 7b, sites is shown in Figure 7a, where we simply list strings to all strings in the AC are distinct, and the corresponding represent the AC. It has c=4 and the multiplicity of “00” is contingency table contains only 1s as non-zero entries. In 2, while every other sequence type has multiplicity 1. In any Figures 7c and 7d, the string 000 has multiplicity 2, and the AC for the first two sites, the string “∗∗” is not allowed, but entryA ,whichcorrespondstothemultiplicityofthelength- 0,1 “∗∗0” or “∗∗1” may appear in an AC for the first three sites. 3 string obtained from appending 0 to the right of 00, is 2. Shown in Figures 7b-d are three examples of ACs for the first In summary, there is a one-to-one correspondence be- three sites; they each contain “∗∗0”. The all-∗ string “∗∗∗” tween non-negative integer valued contingency tables de- is not allowed in any AC for the first three sites. scribed above and ACs for the first k sites that can be created In an AC for the first k sites, how many 0s and 1s can be by appending 0s, 1s and ∗s to the right of the strings in an present at site k? We have already answered that question in AC for the first k−1 sites. Hence, we can enumerate ACs by Section 3.3; i.e. the number r of 0s and the number r of counting contingency tables. 0 1 1s satisfy (r ,r )∈I(nk,nk), where nk and nk respectively Since the table described above has A = 0, we call it a 0 1 0 1 0 1 ∗,∗ denote the number of 0s and 1s at site k in the input data restricted contingency table with row sums r = (r ,r ,r ) 0 1 ∗ D. How about the number r of ∗s at site k? Since an all-∗ and column sums c = (c ,...,c ,c ). If the entry A ∗ 1 d ∗ ∗,∗ string of length-k is not allowed, r is bounded from above, were not restricted to be zero, we would have a standard ∗ while the minimum value of r is determined by the total (or an unrestricted) contingency table with row sums r and ∗ 8 (a) 00 (b) 00∗ 1 1 1 1 4 (c) 000 2 0 1 1 4 (d) 000 2 1 0 1 4 000 000 000 00 c1=2 1∗0 0 0 0 0 0 1∗∗ 0 0 0 0 0 1∗0 0 0 0 0 0 1∗ c2=1 ∗10 1 0 0 0 1 ∗10 0 1 0 0 1 ∗1∗ 0 0 1 0 1 ∗1 c3=1 ∗∗0 2 1 1 1 ∗∗0 2 1 1 1 ∗∗0 2 1 1 1 Fig.7. ConstructingACsforD=[(0001,0000,1101,1110),(1,1,1,1)]byextendingstrings.Forclarityofillustration,ACsarerepresentedaslistsof strings inside [ ]; a string is read horizontally. (a) An AC for the first two sites. (b)-(d) ACs and their corresponding restricted contingency tables obtained fromextendingtheACin(a)byanadditionalcolumnwithr0=4,r1=0,r∗=1.Notethatc∗=1in(b)-(d),thussatisfyingc+c∗=r0+r1+r∗. columnsumsc.Restrictedandunrestrictedcontingencytables has only 4 degrees of freedom; once the entries a,b,c,d are related as follows. Let N (r,c) denote the number of are chosen, the other entries get fixed by given row and u unrestricted contingency tables with column sums c and row column sums. Moreover, a,b,c,d satisfy the following set of sumsr.Similarly,letN (r,c)denotethenumberofrestricted constraints: r contingency tables with row sums r and column sums c. 0≤a+b≤r , For r = (r ,r ,r ) and c = (c ,...,c ,c ), define r0 = 0 0 1 ∗ 1 d ∗ 0≤c+d≤r , (r ,r ,r − 1) and c0 = (c ,...,c ,c − 1). Then, it is 1 0 1 ∗ 1 d ∗ 0≤a+c≤c , straightforward to show that 0 0≤b+d≤c . (cid:26) N (r,c)−N (r0,c0) if r ≥1 and c ≥1, 1 N (r,c)= u u ∗ ∗ r N (r,c) otherwise. These constraints imply u (3.5) a∈ A ={0,...,max(c ,r )}, 0 0 We have not yet discussed how to obtain the total number b∈ B ={0,...,min(c ,r −a)}, 1 0 of ACs. The case with two sites is amenable to analytic c∈ C ={0,...,min(r ,c −a)}, 1 0 techniques. For more than two sites, we propose a solution d∈ D ={0,...,min(r −c,c −b)}, 1 1 via dynamic programming. These topics are subsequently discussed in the next two subsections. and therefore (3.6) can be written as X X X X X β(D)= 1. 3.6 Counting ACs for two sites If there are only two sites in D, we only need to consider (c0,c1)∈I(n10,n11), a∈A b∈B c∈C d∈D contingency tables of size 3×3. This simplifies the problem (r0,r1)∈I(n20,n21) considerably,anditispossibletosumoverallnecessary3×3 This sum can be performed explicitly. For example, for n1 = 0 contingency tables explicitly to obtain closed-form formulas. n1 = n2 = n2 = n/2, for n even, the total number β(D) of 1 0 1 We describe this analytical result below. ACs is given by the following degree-8 polynomial: It follows from the discussion in the previous section that 179 2339 151 3851 the number of ACs with c 0s and c 1s at the first site and β(D) = 2+ n+ n2+ n3+ n4 0 1 84 1260 160 11520 r 0s and r 1s at the second site is given by the following 0 1 31 259 11 11 sum of restricted contingency table numbers: + n5+ n6+ n7+ n8. 384 23040 13440 430080 r0+rX1+c0+c1 (cid:0) (cid:1) (3.7) N (r ,r ,j−r −r ),(c ,c ,j−c −c ) , r 0 1 0 1 0 1 0 1 j=max(r0+r1,c0+c1) Notethatthedegree8isequaltothenumber3L−1ofdistinct where the sum over j accounts for the allowed values of c strings over {0,1,∗}, except for the all-∗ string, of length ∗ and r . By using (3.5), this sum can be shown to be equal to L=2 ∗ N ((r ,r ,c +c ),(c ,c ,r +r )).Hence,thetotalnumber u 0 1 0 1 0 1 0 1 β(D) of ACs for the case with two sites is 3.7 Counting ACs for arbitrary number of sites X N ((r ,r ,c +c ),(c ,c ,r +r )), For more than two sites, we adopt a dynamic programming u 0 1 0 1 0 1 0 1 (c0,c1)∈I(n10,n11), approach for counting restricted contingency tables. As men- (r0,r1)∈I(n20,n21) tioned in Section 3.5, our algorithm progresses sequentially (3.6) along the sites, from left to right. There also exists a more where,asbefore,I(ni,ni)determinestheallowednumberof efficient, albeit somewhat more complicated, dynamic pro- 0 1 0s and 1s at site i. gramming formulation in terms of counting unrestricted con- The key observation is that the expression shown in (3.6) tingency tables, but we will not discuss that here. Further, we can be summed explicitly. A 3×3 unrestricted contingency remarkthatweconsideredanalternativeapproachtocounting table of the form ACs that is based on counting lattice points bounded by poly- row sums topes.Computingthenumberofsuchlatticepointsisawidely- a b v r studied problem in mathematics (for example, see [13] and 0 c d w r referencestherein).Moreover,thereisapublicsoftwarecalled 1 x y z c +c LattE (available at http://www.math.ucdavis.edu/∼latte/) that 0 1 column sums: c c r +r enumerates lattice points, but we found that it is significantly 0 1 0 1 9 slower than our own program for enumerating ACs via count- For each c1 ∈C , the multiplicity of c1 is defined as 1 ing restricted contingency tables; it seems that L = 2 is the (cid:12) (cid:12) µ(c1):=(cid:12)(cid:8)(r ,r )∈I(n1,n1) Λ(r ,r )=c1(cid:9)(cid:12). only case that LattE can handle. (cid:12) 0 1 0 1 0 1 (cid:12) Recall that both contingency tables shown in Figures 7c For the kth site, where k >1, C is recursively defined as k and 7d have 2,1,1,1 as non-zero entries, although the two tables have different corresponding ACs. In each AC, as the (cid:26) M ∈R(ck−1,rk), where (cid:27) C := Λ(M) . corresponding contingency table captures, there are 4 distinct k ck−1 ∈Ck−1 and rk ∈I(nk0,nk1) types of length-3 strings, with exactly one type being of Note that only distinct Λ(rk) need to be considered when multiplicity2,whileeveryothertypeisofmultiplicity1.Now, constructing C . For every ck−1 ∈ C and every ck ∈ C , k k−1 k consider extending the ACs in Figures 7c and 7d to create wecountthefollowingnumberofrestrictedcontingencytables ACs for the first four sites. For both cases, we need to find M with Λ(M)=ck: contingency tables of the following form: w(ck−1,ck):= X (cid:12)(cid:12)(cid:8)M ∈R(ck−1,rk) Λ(M)=ck(cid:9)(cid:12)(cid:12). row sums A0,1 A0,2 A0,3 A0,4 A0,∗ r0 rk∈I(nk0,nk1) A1,1 A1,2 A1,3 A1,4 A1,∗ r1 This can then be used to define the multiplicity of ck ∈Ck as A A A A 0 r ∗,1 ∗,2 ∗,3 ∗,4 ∗ X column sums: 2 1 1 1 c∗ µk(ck)= µk−1(ck−1)w(ck−1,ck). where(r ,r )∈I(n4,n4).Asaconsequence,thesamesetof ck−1∈Ck−1 0 1 0 1 contingencytableswillbegenerated,althoughthecorrespond- In practice, Ck and µk(·) can be determined concurrently. ing set of ACs will be different for the two cases. (In the ACs The total number β(D) of ACs for D is obtained by obtainedfromextendingtheACinFigure7c,possibleprefixes summing the multiplicity of column sums for the last site: oflength-3are“000”,“1∗∗”,“∗10”,“∗∗0”and“∗∗∗”,whereas X β(D)= µ (cL). (3.8) L in the ACs obtained from extending the AC in Figure 7d, possible prefixes of length-3 are “000”, “1∗0”, “∗1∗”, “∗∗0” cL∈CL and “∗∗∗”.) Hence, since we are only interested in counting More generally, the total number of ACs for D[1:k], restric- ACs, and not in constructing the actual ACs themselves, we tion of D to the first k sites, is do not need to consider extensions of the ACs in Figures 7c β(D[1:k])= X µ (ck). k and 7d separately. At each iteration of our algorithm, we only keeptrackofdistinctmultiplicityconfigurationsandhowoften ck∈Ck We have written a C++ program that implements the dynamic each configuration appears. Then, we construct contingency programming algorithm described above. For two sites, we tablesforeachdistinctmultiplicityconfiguration.Thismethod have checked numerically that (3.8) gives the same answers considerably reduces the number of contingency tables to as does (3.6). construct. Wenowdescribeourenumerationalgorithm.Givenamatrix M, let Λ(M) denote the descending array of positive integers 3.8 Some explicit enumeration of ACs in M. For example, if For n even, ni = ni = n/2 for every site i leads to the 0 1  3 2 0 4  largest number of ACs. For n odd, the largest number of ACs M = 1 0 2 0 , is achieved if ni0 = bn/2c +1 and ni1 = bn/2c for every 0 0 1 0 site i, where b·c denotes the floor function. The left hand side of Table 2 shows such largest possible number of ACs for then Λ(M) = (4,3,2,2,1,1). Suppose that M and M0 are small number n of sequences and small number L of sites. It contingency tables that arise while considering site k. When is quite striking that the number of ACs grows so fast as the constructing contingency tables for site k+1, we do not need data size increases. These numbers were obtained using our to distinguish M from M0 if Λ(M) = Λ(M0); we just need softwarethatimplementsthedynamicprogrammingalgorithm to keep track of how many tables for site k have Λ(M) as described in Section 3.7. For L = 2 and n even, our closed- descending array of positive integers. (Note that Λ(M) are form formula (3.7) agrees with the numerical values shown in used as column sums when constructing contingency tables Table 2. for site k+1.) ForfixedL,thetotalnumberofACsisboundedfromabove Given a k-tuple v = (v ,v ,...,v ) of non-negative 1 2 k by a polynomial in n of degree 3L−1. There are 3L − 1 integers, we define |v| = Pk v . Let R(c,r) be the set i=1 i non-negativeintegervaluedvariablesm ,...,m ,andthe of all restricted contingency tables with column sums (c,c ) 1 3L−1 ∗ maximum value that any variable can take is n. Hence, the and row sums (r,r∗), where max(|r|−|c|,0) ≤ c∗ ≤ |r| total number of ACs is O(n3L−1). As this simple analysis and max(|c|−|r|,0)≤r ≤|c|. As usual, (ni,ni) denote ∗ 0 1 shows, the total number of ACs depends more crucially on the number of 0s and 1s at site i in the input data D. the number of sites than on the number of sequences. This For each site i, we associate a set C of column sums. To i qualitative behavior of growth is also apparent in Table 2. initialize the algorithm, we define Intheabsenceofrecombination,thereare2L distinctbinary C :=(cid:8)Λ(r ,r ) (r ,r )∈I(n1,n1)(cid:9). sequences of length L, but that the genealogy is given by 1 0 1 0 1 0 1 10 TABLE2 TotalnumberofACsfordataDwith,foreverysitei,ni0=a0 andni1=a1,wherea0=a1=bn/2cforneven,anda0=bn/2c+1,a1=bn/2cfor nodd.Undertheseconditions,themaximumvalueofα(D)is1+a0(a1+2L−1).Thesevaluesareshownontherighthandside.Shownontheleft handsideisthetotalnumberofACswhenanarbitrarynumberofrecombinationsareallowed. β(D)(WithRecombination) max(α(D))(WithoutRecombination) L L n 2 3 4 5 n 2 3 4 5 2 30 573 16875 689175 2 5 9 17 33 3 108 6286 743387 149861079 3 9 17 33 65 4 330 62589 32482009 35523729489 4 11 19 35 67 5 866 445137 893479326 4938627635669 5 16 28 52 100 6 2143 3302506 29521615942 962962451049968 6 19 31 55 103 7 4611 17409443 568860072916 91812561254804105 7 25 41 73 137 a tree puts tight restriction on which sequences can appear 4.1 Examples revisited together in an AC. As in the above case, for every site i, As mentioned in Section 3.4, when arbitrary number of let ni = a and ni = a , where a = a = n/2 if n is 0 0 1 1 0 1 recombinations are allowed, the set of ACs does not depend even, and a0 = bn/2c+1,a1 = bn/2c if n is odd. Under on the order of 0s and 1s within a column. For example, D = these conditions, the number α(D) of ACs is maximum if the [(00,11),(1,1)] and D0 =[(01,10),(1,1)] have the same set input data contains a all-0 sequences and a all-1 sequences. 0 1 of ACs, shown in Figure 6, when arbitrary number of recom- One can use the method described in Section 2.2 to show that binations are allowed. That no longer holds true if there is a max(α(D)) = 1+a (a +2L−1), where L is the number 0 1 restrictiononthenumberofallowedrecombinations.Thisfact of sites. Some numerical values of this function are shown on iseasiesttoseeintheabsenceofrecombination,inwhichcase the right hand side of Table 2. These numbers are negligibly the set of ACs for D is {[(00,11),(1,1)], [(00,01),(1,1)], small, compared to the value of β(D) shown on the left hand [(00,10),(1,1)], [(00),(2)], [(00),(1)]}, whereas that for side of Table 2. D0 is {[(01,10),(1,1)], [(01,00),(1,1)], [(00,10),(1,1)], For L sites, let BL(k) denote the number of ACs with [(00),(2)],[(00),(1)]}. Similarly, when only a single recom- exactly k strings, and let SL(k) denote the cumulative count bination is allowed, it is straightforward to show that D and Pki=1BL(i), which gives the number of ACs with at most k D0 have different sets of ACs, although the two sets are both strings. Table 3 shows BL(k) and SL(k) for data sets with of size 19. n=5 and (ni0,ni1)=(3,2) for every site i. The largest value When there is a restriction on the number of recombina- of SL(k) corresponds to β(D) for n=5 in Table 2. tions, even the number of ACs may depend on the order of 0s and 1s within a column. Consider the two data sets 4 ANCESTRAL CONFIGURATIONS IN D =[(00,11),(1,2)] and D0 =[(01,10,11),(1,1,1)]. In the CONSTRAINED ARGS absenceof recombination, D has 6 ACs,while thereexistsno coalescent tree for D0 (c.f. the three gamete test mentioned in In this section, we enumerate ACs in ARGs with at most Section 2). The minimum number of recombinations for D0 R recombinations. This study is motivated by the problem is one, and there are 21 inequivalent ACs in the set of ARGs of approximating the likelihood P(D) of the input data D for D0 with exactly one recombination. In contrast, there are when the recombination rate is low, by summing over the 26 ACs for D when one recombination is allowed. probability of ARGs with small number of recombinations. In Section 3.4, we observed that, when arbitrary number Further, the work described here can be used to enumerate all of recombinations are allowed, two data sets differing by ARGs that can generate D using at most R recombinations, some permutation of their columns have the same number of under the infinite-sites model of mutation. The constraint on ACs. This also is no longer true, in general, if the number thenumberofrecombinationsfurthercomplicatestheproblem of recombinations is constrained. For example, consider the of enumerating ACs, and combinatorial analysis is more two data sets D = [(001,110,111),(1,1,1)] and D0 = difficult to carry out. We provide a dynamic programming [(010,101,111),(1,1,1)]. While D has an ARG with only algorithm that enumerates ACs by explicitly tracking all one recombination, no valid ARG with less than two recom- possible evolutionary histories backwards in time. binations exists for D0. (CAUTIONARYREMARK: GivenadatasetD,anACψ thatis R recombinations away from D may not appear in any ARG 4.2 The search algorithm for D with at most R recombinations; further recombinations ForthecasewithrestrictednumberRofrecombinations,the may be required for ψ to reach the grand common ancestor. approachwetakeistotraceexplicitlyallpossibleevolutionary Hence, the number of ACs in ARGs for D with at most R histories backwards in time. For any particular AC, we try all recombinations is in general smaller than the number of ACs possibleevents—thatis,recombination,coalescentormutation that can be reached from D using at most R recombinations. events—that can happen to that AC. Possible recombination In what follows, we are interested in counting the former.) andcoalescenteventsareasillustratedinFigure4.Wedonot

Description:
[2], [15], but that approach quickly becomes infeasible as data size . (Recall that, unlike coalescent trees, sequence (corresponding to the most recent common ancestor). Let Pi,j strings inside [ ]; a string is read horizontally.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.