ebook img

The Imperfect Ancestral Recombination Graph Reconstruction Problem: Upper Bounds for ... PDF

16 Pages·2010·0.41 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview The Imperfect Ancestral Recombination Graph Reconstruction Problem: Upper Bounds for ...

JOURNALOF COMPUTATIONAL BIOLOGY Research Articles Volume 17,Number6, 2010 #MaryAnnLiebert, Inc. Pp. 767–781 DOI:10.1089/cmb.2009.0249 The Imperfect Ancestral Recombination Graph Reconstruction Problem: Upper Bounds for Recombination and Homoplasy FUMEI LAM, RYAN TARPINE, and SORIN ISTRAIL ABSTRACT One of the central problems in computational biology is the reconstruction of evolutionary histories. While models incorporating recombination and homoplasy have been studied separately, a missing component in the theory is a robust and flexible unifying model which incorporates both of these major biological events shaping genetic diversity. In this article, we introduce the first such unifying model and develop algorithms to find the optimal ancestral recombination graph incorporating recombinations and homoplasy events. The powerofourframeworkistheconnectionbetweenourformulationandtheDirectedSteiner Arborescence Problem in combinatorial optimization. We implement linear programming techniquesas wellas heuristicsfortheDirectedSteinerArborescenceProblem,anduseour methods to construct evolutionary histories for both simulated and real data sets. Key words: recombination, phylogeny, homoplasy, genetic variation, haplotypes, sequence analysis. 1. INTRODUCTION One of the central problems in computational biology is the problem of reconstructing evolu- tionaryhistories.Manyvariantsoftheproblemhavebeenstudied,butwiththegrowingrepositoriesof variation data, there is increased demand for new tools for analysis. Prior studies have established that in order to accurately represent complete evolutionary histories, the underlying model must incorporate hy- bridizationevents,whichcorrespondtothemixingofgeneticmaterialofancestralsequencespassedtotheir descendents.Nordborg(2001)states,‘‘Intheeraofgenomicpolymorphismdata,theimportanceofmodeling recombination can hardly be overemphasized.’’ Another important set of events to consider in the con- struction of evolutionary histories are homoplasy events, which extend beyond the infinite sites model to includebackandrecurrentmutations. The goal of our work is to create a model for constructing evolutionary histories that extends current models by incorporating both evolutionary processes of recombination and homoplasy. Such analyses are applicabletootherresearch,includingthestudyoflinkagedisequilibriumandrecombinationhotspots,and the search for genetic predictors of disease. The mathematical and computational challenge is to develop methods that are robust and rigorous. Department ofComputer Science, BrownUniverisity, Providence,RhodeIsland. 767 768 LAM ET AL. 2. SINGLE EVENT MODELS In the phylogenetic tree reconstruction models under consideration, the input is a matrix with rows representing individuals (e.g., haplotype data) and columns representing sites. Throughout this article, we willassume thatthe input isbinary,with values 0and 1.Importantexamplesofsuch instancesarise from single nucleotide polymorphism (SNP) data; a SNP is a position in the genome with at least two different bases present in the population, each with a frequency above a certain threshold. It is the most abundant type of polymorphism and in practice is found to be largely binary. A vast amount of such data has been collected in the International HapMap project of the International HapMap Consortium (2005). The two important evolutionary forces we consider in this work are mutation and recombination. A mutation is a change in a single site of a sequence, either from value 0 to 1 or from 1 to 0. In population genetics, the infinite sites model assumes that at most one mutation occured at each site throughout evolutionaryhistory.Aphylogenetictreeisadirectedtreewitheachedgelabeledbyanintegerbetween1 andmandeachnodelabeledbyabinarysequence.Ifedgee¼(u,v)islabeledbysitei,thenthesequence vcanbeobtainedfromthesequenceubychangingthevalueatsiteiofufromvaluextovalue1(cid:1)x.We denotethisbyv¼u.i.AphylogenetictreeTdisplayinginputIsatisfiesthepropertythateachrowinIlabels anodeintreeT.Inwhatfollows, weinterchangeablyrefertoanodeintheancestral recombinationgraph (ARG) and the binary sequence labeling it. If binary input I can be displayed in a phylogenetic tree such that each label from 1,2,…m labels at most one edge, the resulting tree is called a perfect phylogeny. Theothereventweconsiderismeiotic,orcrossoverrecombination,whichisoneofthedominantforces impacting genetic diversity. A crossover recombination occurs when two chromosomes of equal length exchangematerialtoformadescendent,whichcontainsaprefixofthefirstchromosomeandasuffixofthe second. In this work, the term recombination refers to such events. ItiswellknownthatasetofequallengthbinarysequencesIcanbedisplayedinaperfectphylogenyif and only if it passes the four gamete test (Gusfield, 1991, 1999). Much of the current haplotype data fails theperfectphylogenytestandthuscannotbeexplainedbyaperfectphylogeny.Inthefollowingsections, wediscusstwopreviouslystudiedmodelsforreconstructingevolutionaryhistoriesforsuchdata:ancestral recombination graphs and imperfect phylogenetic trees. 2.1. Ancestral recombination graphs An ancestral recombination graph or phylogenetic network is a directed acyclic graph in which nodes correspond to binary sequences of length m, and edges correspond to either mutation or recombination events.Thereisauniquerootvertexwithindegree0,andeverynodeotherthantheroothasindegreeeither one or two. Nodes with indegree two are recombination nodes; if e is an edge that is directed into a recombinationnode,theneisarecombinationedge.Ifsisarecombinationnode,thenitcanbeobtainedby its two parents by combining a prefix of one parent with a suffix of the second parent. Otherwise, e is a mutation edge and is labeled by a site i2f1,2, ...mg; for a mutation edge (u,v) labeled by site i, the sequences satisfy v¼u.i.Ahomoplasyeventoccursifthereexistsasiteiwithtwoormoreedgeslabeled by i. Homoplasy events are either recurrent mutations (x2f0,1g mutates to 1(cid:1)x two or more times at a sitei)orbackmutations(inwhichsiteimutatesfromx2f0,1gto1(cid:1)xandthenfrom1(cid:1)xbacktox).An ARGGdisplaysinputIifthereisanodeinGcorrespondingtoeachrowinI.Thefollowingproblemhas beenthesubjectofintensiveresearch(BordewichandSemple,2007;GriffithsandMarjoram,1997;Hein, 1990, 1993; Hudson and Kaplan, 1985; Lyngs et al., 2005; Meyers and Griffiths, 2003; Song et al., 2005; Song, 2006; Song and Hein, 2004; Wang et al., 2001). AncestralRecombinationGraph(ARG)ReconstructionProblem:GivenasetofnbinarysequencesIeachof lengthm,findtheancestralrecombinationgraphdisplayingIwiththeminimumnumberofrecombinationnodes R (I)under theinfinitesitesmodel. min This problem was first considered by Hudson and Kaplan (1985). As this problem is APX-hard (and therefore NP-hard) in the general case (Bordewich and Semple, 2007), there has been increased focus to develop efficient methods to compute lower bounds for R (I). Meyers and Griffiths (2003) develop min methodstoobtaingloballowerboundsbycombininglocallowerbounds.SongandHein(2004)introduce lower bounding methods based on set theoretic conditions and use tree operations to generate optimal ARGs. Song et al. (2005) compute both lower and upper bounds on the minimum number of recombi- nations needed to construct the evolutionary history. Their lower and upper bounds are shown to often IMPERFECT ANCESTRAL RECOMBINATION GRAPH RECONSTRUCTION 769 coincide in practice, and in such cases, their algorithm solves the parsimonious ARG reconstruction problem to optimality. 2.2. Imperfect phylogenetic trees Thesecondmethodtoaddresssequencesthatcannotbedisplayedonaperfectphylogenyistheimperfect phylogenyreconstructionmethod.Animperfectphylogenyisaphylogenetictreethatviolatesinfinitesites by allowing back and recurrent mutations. The following problem has also been the subject of a vast literature(AgarwalaandFernandez-Baca,1994;Bandelt,1991;Damaschke,2004;Ganapathyetal.,2003; Semple and Steel, 2003; Sridhar et al., 2006, 2007b). Imperfect Phylogeny Reconstruction Problem: Given a set of n binary sequences I each of length m, find a phylogenetic treeexplainingI withtheminimumnumberof homoplasyevents. BecausetheproblemisNP-hardingeneral(FouldsandGraham,1982),animportantproblemistoisolate parametersthatcapturethecomplexityoftheprobleminthehopeoffindinganalgorithmthatispolynomial intheremainingparameters.InBlellochetal.(2006)andSridharetal.(2007a),itwasshownthatforbinary sequences,theimperfectphylogenyproblemcanbesolvedinfixedparametertractabletimeinparameterq, where q denotes the number of homoplasy events needed to explain the input sequences. 3. HIERARCHY OF IMPERFECT ANCESTRAL RECOMBINATION GRAPHS Whilemodelsforrecombinationandimperfectionhavebeenstudiedseparately,amissingcomponentin thetheoryofconstructingevolutionaryhistoriesisarobustunifyingmodelincorporatingbothphenomena. The following characteristics should be satisfied by any model incorporating both recombination and homoplasy events. 1. Robustness: incorporates both single-event models (ARG reconstruction and imperfect phylogeny reconstruction) as special cases 2. Flexibility: model input incorporates a weighted set of parameters based on information about the relative rates of recurrent mutation and recombination events for the input sequences. 3. Computational effectiveness: allows algorithms on real data AnimperfectancestralrecombinationgraphoninputIisanARGdisplayingIwhichallowshomoplasy events.TosatisfyProperty(2),wewouldliketobeabletoinputcostparametersassociatedtothemutation andrecombinationeventsobtainedfromseparateanalysis.TheweightedcostofanimperfectARGAisthe sum of the costs associated to the recombination and mutation events occuring in A. The imperfect ARG reconstruction problem is the following. Imperfect Ancestral Recombination Graph (Imperfect ARG) Reconstruction Problem: Given a set of n binary sequences I each of length m, a common ancestral sequence r, and weights w, find an Imperfect ARG displayingIwith minimumweighted cost. This is an important problem, as it combines the two major biological events shaping genetic diversity intoasingleframework.Bychoosing asufficientlylargevalueforthecostofrecombination,theproblem includes the imperfect phylogeny reconstruction problem as a special case. Similarly, by choosing a sufficientlylargevalueforthecostofmutation,theproblemincludestheARGreconstructionproblemasa special case. It follows that the imperfect ARG reconstruction problem is also APX-hard (and therefore NP-hard) in the general case. OurcontributionistodevelopalgorithmsfortheimperfectARGreconstructionproblem.Moreover,our model has the advantage that it uses the same representation for both recombination and mutation events, rather than using cycles to represent recombination and edges to represent mutation. We believe this uniform treatment of the two events is advantageous in the development of algorithms. The central idea for the construction of the graph theoretic representations we will describe is the trans- formationofrecombinationcyclesintosimplergraphstructures.Suchatransformationispowerfulbecause, with cycles no longer present, the problem of constructing a minimum ARG can be formulated as a well- knowncombinatorialoptimizationproblemknownastheMinimumDirectedSteinerArborescence(MDSA) problem (for a survey and applications of the MDSA problem, see Hwang et al., 1992, and Winter, 1987). 770 LAM ET AL. 3.1. Directed Steiner arborescence problem GivenaconnecteddirectedgraphGwithedgeweightsw ,arootvertexr,andasetofterminalvertices e V ,adirectedSteinerarborescenceisasubgraphofGthatcontainsadirectedpathfromrtoeachterminal T inV .ThecostofadirectedSteinerarborescenceDisthesumofitsedgeweights.Thefollowingisawell- T studied problem in combinatorial optimization. Minimum Directed Steiner Arborescence (MDSA) Problem Input:ConnecteddirectedgraphGwithedgeweightsw ,rootvertexrandasetofterminalverticesV e T Objective: Find a directed Steiner arborescence of minimum weight in G. The MDSA problem is NP-hard (Karp, 1972) and as hard to approximate as the Set Cover Problem (Guhaand Khuller, 1996). Therefore, there is no constant factor approximationalgorithm for the problem (unless P¼NP). 3.2. Informal description of main result Webeginbygivinganinformaloutlineofourmethodandmainresults.Throughoutthediscussion,the inputisassumedtobeasetofnbinarysequences,eachoflengthm.ForinputI,considerfirstthesetAof all ancestral recombination graphs displaying I. We will construct a set of auxiliary graphs, called hier- archy graphs (and denoted HG), together with a sequence of transformations that maps each ancestral recombination graph to a hierarchy graph. To informally describe one level of hierarchy graphs, we will first utilize two pebbles (one labeled a and one labeled b) and describe a sequence of moves for these pebbles. We begin by detailing how the transformation turns a single recombination cycle into a directed path.WethenextendthisideatotransformmoregeneralARGs(containingmultiplerecombinationcycles) into directed arborescences. IfvisarecombinationnodeinanARG,thenbytracingtwopathsfromtheparentsofvbackintime,the pathseventuallymeetatacommonancestral,orcoalescentnode.Anypairofsuchdirectedpathstogether form a recombination cycle. Note that a recombination node v may give rise to several different recom- binationcycles,possiblywithdifferentcoalescentnodes(Fig.1).ForafixedrecombinationcycleC,denote thecoalescentnodeofCbycoal(C)andtherecombinationnodeofcycleCbyrec(C)(Fig.1).Aninternal node of recombination cycle C is any node in C\{coal(C), rec(C)}. For recombination cycle C, imagine placing two pebbles (labeled by a and b) on the coalescent node coal(C).Considerthetwonodedisjointpathsfromthecoalescentnodetotherecombinationnode(Fig.2). Wethinkofpebbleaastravelingalongonepathandpebblebastravelingalongthesecondpathuntilthey meet at the recombination node. Note that, for a fixed recombination cycle C, the only nodes simulta- neouslyoccupiedbybothpebblesarethecoalescentnodeandrecombinationnode.Weenforcethat,ateach coal(C) coal(C¢) C C¢ rec(C) rec(C¢) FIG. 1. Recombination cycles. IMPERFECT ANCESTRAL RECOMBINATION GRAPH RECONSTRUCTION 771 coal(C) = 00000 2 4 [00000, 00000] [2, *] 01000 00010 [01000 , 00000] [*, 4] 5 [01000, 00010] [5, *] 01001 1 2 1 [01001, 00010] [*, 1] 3 [01001, 10010] [3, *] 01101 10010 [01101, 10010] R(2.5) R > (2.5) R< (2.5) [10101, 10101] rec(C) = 10101 FIG. 2. Transforming a recombination cycle intoa directed path. time step, only one pebble is in motion, unless both pebbles move together to meet at the recombination node. We record the journey of the pebbles in a sequence of ordered pairs, where the first component of each ordered pair represents the position of the first pebble throughout its journey, and the second com- ponent represents the position of the second pebble throughout its journey. From the sequence of moves, it is possible to construct a graph, with ordered pairs of binary sequences as nodes and ordered pairs ofmutations/recombinationslabelingtheedges.Notethatforanyrecombinationcycle,theremaybemany associated directed paths, depending on the choice of ordering for the steps taken by the pebbles. For example, pebble a could move first and then pebble b, or vice versa, resulting in two different paths. However, while there may be multiple paths associated with C, the lengths of all such paths are equal. TheideaoutlinedabovecanbeextendedtotransformmoregeneralARGs(possiblycontainingmultiple recombination cycles) into directed arborescences. We will construct a hierarchy of representations (de- noted HG ,HG ,HG , ...) which will satisfy the following: 1 2 3 1. EverydirectedSteinerarborescenceinahierarchygraphHG canbemappedtoanARGofthesame i cost. 2. Thesetofhierarchygraphsareorganizedintolevels,witheachhigherlevelrepresentingalargerset of ARGs. 3. EveryARGinAcanbetransformedtoadirectedSteinerarborescenceinHG ofthesamecostfora l suitable level l. Furthermore, it is possible to compute an upper bound on this level l. 4. Bysolvingasequenceofdirected Steinerarborescenceproblems,itispossibletofindasequence of upper bounds for the minimum imperfect ARG problem that converges to the exact solution. 3.3. Hierarchy graph: the first two levels We first describe in complete detail the first two levels of the hierarchy, HG and HG . We assume 1 2 throughout that the input is given as an n(cid:2)m binary input matrix I, with rows representing sequences and columns representing varying sites. The vertices in HG correspond to binary sequences of length m, and there is a directed edge between 1 two vertices u andu if the two sequences corresponding tothese vertices differ at exactly onesite. HG 1 2 1 corresponds to the hypercube in dimension m, where each undirected edge is replaced by two directed 772 LAM ET AL. edges,oneineachdirection.HierarchyHG modelshomoplasyevents(butnotrecombinationevents),and 1 we have the following lemma. Lemma 3.1. The Directed Steiner Arborescence Problem in HG is equivalent to the Imperfect 1 Phylogeny Reconstruction Problem. Inleveltwoofthehierarchy,theverticesofthegraphHG (I)areorderedpairs(v ,v ),wherev andv 2 1 2 1 2 bothrangeoverbinarysequencesoflengthm.Thereisadirectededgefromorderedpair(v ,v )to(w ,w ) 1 2 1 2 if one of the following properties is satisfied. (I) v and w differ in exactly one site (site i) and v ¼w . The edge between (v , v ) and (w , w ) is 1 1 2 2 1 2 1 2 labeled [i,*]. (II) v and w differ in exactly one site (site i) and v ¼w . The edge between (v , v ) and (w , w ) is 2 2 1 1 1 2 1 2 labeled [*, i]. (III) w ¼w and w can be obtained from v and v by combining the firstbkcsites of v with the last 1 2 1 1 2 1 n(cid:1)bkcsites of v . The edge between (v , v ) and (w , w ) is labeled R(k). 2 1 2 1 2 (IV) w ¼w and w can be obtained from v and v by combining the firstbkcsites of v with the last 1 2 1 1 2 2 n(cid:1)bkcsites of v . The edge between (v , v ) and (w , w ) is labeled R(k). 1 1 2 1 2 Edges of type (I) and (II) are called level 2 mutation edges, and edges of type (III) and (IV) are called level 2 recombination edges. Associated with the edges of the level 2 graph is a weight function w :E(HG(I))!R , which is specified as part of the input and indicates the correspondingcosts for the e (cid:3)0 recombination and mutation events. By convention, the recombination point k will always be chosen to be half-integral (e.g., k¼2.5 cor- responds to a recombination event between sites 2 and 3), except in the following case. In order to incorporate certain types of branching events, we allow the events R(0) and R(0), which indicate that the recombination node inherits the complete sequence of one of its parents. Event R(0) indicates that the recombinationnodeagreeswithparentvandeventR(0)indicatesthattherecombinationnodeagreeswith parent w. The weight of the edges corresponding to these recombination events will always be equal to zero. LetR bethenodeinHG correspondingtotheorderedpair(r,r).ForeachrowsofI,addtheordered 2 2 pair(s,s)tothelistofterminalverticesV .GivenasolutionTtothedirectedSteinerarborescenceproblem T on graph HG with root R , weights w, and terminal vertices V , we now construct an imperfect ancestral 2 2 T recombination graph A displaying I of the same weight as T. For any node (u, v) in T, let Fþ(u, v) denote the set of outgoing edges from (u, v) in arborescence T. Initialize Y¼{(r,r)} While Y=;,let(u,v)2Y I.WhileFþ(u,v)=;,let e¼((u,v),(u0,v0))2Fþ(u,v) Ifthelabelof edgee ismutation [i,*],adda directed edgefrom sequence u tosequence u0¼u.ito A Elseifthelabelof edgee ismutation [*,i],add adirected edge from sequence vtosequence v0¼v.ito A ElseifthelabelofedgeeisrecombinationR(k),thenu0¼v0isthesequenceagreeingwithuinpositionslessthan kand agreeingwith vin positionsgreater thank.In A,add directed edgesfrom u tou0 andfrom vto u0 ElseifthelabeloftheedgeisrecombinationR(k),thenu0¼v0 isthesequence agreeingwithvinpositionsless thankand agreeingwith uin positionsgreater thank.In A,add directededges from uto u0 from vtou0. Add(u0,v0)toYand remove efrom Fþ(u,v) II.Remove (u,v)from Y. The result is an imperfect ARG A on input sequences I with common ancestor r (Fig. 3). The con- structionshowsthatforeachSteinerarborescence,itispossibletoobtainacorrespondingimperfectARG of the same cost. However, we will see in the next section that the reverse transformation is not always possible. 3.4. Crowned trees In this section, we establish sufficient conditions on an imperfect ARG to correspond to a directed Steinerarborescenceinhierarchylevel2ofthesamecost.AcrownedtreedisplayingIisanimperfectARG displaying I which satisfies the following conditions. IMPERFECT ANCESTRAL RECOMBINATION GRAPH RECONSTRUCTION 773 (00000000, 00000000) [*,2] r = 00000000 [*,1] (00000000,01000000) (01000000,10000000) [*,5] 1 2 R(1.5) (00000000,01001000) c=10000000 C 6 7 1 (11000000,11000000) [*,3] [7,*] [6,*] (00000100,01001000) R< (1.5) R > (1.5) 5a=0 1C0020000 00000100 00000010 (00000000,01R1(001)000) (00000010,01001000) R(5.5(0)1001100,01001100) 11000000 3 b=01001000 R< (C6.53) R> (6.5) (01101000,01101000) R(6.5) 01101000 R< (5.5) R> (5.5) 01001100 01001010 [8,*] 8 4 (01001010,01001010) (01101001,01101000) 01101001 C4 01111000 [*,4] R (5.5) R (5.5) > < (01101001,01111000) 01111001 R(5.5) (01111001,01111001) FIG. 3. Transformation fromdirected Steiner arborescence toimperfect ancestral recombination graph. Condition1. LetCandC0betworecombinationcyclesintheimperfectARG.IfCandC0shareanode that is an internal node in at least one of C or C0, then the two recombination cycles have the same coalescent node (coal(C)¼coal(C0)). Condition2. ForanyfixedrecombinationcycleC,letS(C)denotethesetofcyclesC0 sharingatleast one node with C which appears as an internal node in C. Then there is a directed path P contained in C C such that i. coal(C)2P C ii. thesetofsharededgesbetweenCandanycycleC0 2S(C)formsadirectedpathandiscontainedin path P . C Note that the recombination cycles C and C0 in Figure 4 violate both conditions. A consequence of Condition 1 is that an internal node in a recombination cycle C cannot be a coalescent node for any other cycleC0.ForrecombinationcycleC,letlast(C)denotethefinalnodeinpathP ,i.e.,theuniquenodeinP C C 00000 1 4 10000 00010 5 C 3 10001 00110 R(3.5) C¢ 2 10010 01110 R(2.5) 10110 FIG. 4. Anexample ofan imperfect ancestral recombination graphthat isnota crowned tree. 774 LAM ET AL. withindegree 1and outdegree 0. By condition(2), the descendantsof last(C)in C cannot be containedin anyotherrecombinationcycleinS(C).LetP denotetheunionofdirectedpathsP overallrecombination C cyclesC.NotethatP isaunionofvertexdisjointpaths,i.e.,novertexiscontainedinmorethanonepath P .Also,observethatforeachrecombinationcycleC,thelabelingofthetwodirectedpathsr andr inC C 1 2 from coal(C) to rec(C) can be chosen arbitrarily. If C contains any edges in P, then these edges must belong to exactly one of the paths r or r in C; by convention, we will always label this path by r . 1 2 2 In Figure 3, P is the edge (r,a) and P ¼P is the directed path on edges (r,a) and (a,b). This C C C 1 2 3 establishesthepathsr forcyclesC ,C ,andC asthepathsineachofthesecyclescontainingedge(r,a); 2 1 2 3 however,thepathsr andr ofcycleC arenotdetermined.Also,last(C )¼a,last(C )¼last(C )¼b,and 1 2 4 1 2 3 last(C )¼;. 4 Animportantsubsetofthesetofcrownedtreesisthesetofgalledtrees,whichareconstrainedARGsin whichanypairofrecombinationcyclesareedgedisjoint.ThisclassofgraphswasintroducedinWangetal. (2001) and further studied in Gusfield (2005), Gusfield et al. (2003, 2004a,b), and Song (2006). The importanceofsuchtreeswasestablishedbyGusfieldetal.(2003),whodevelopapolynomialalgorithmfor the parsimonious ARG reconstruction problem over galled trees. Therefore, the set of crowned trees captures an important subfamily of ancestral recombination graphs. WenowdemonstratethatanycrownedtreeAdisplayingIcorrespondstoadirectedSteinerarborescence Tofthesamecostinthelevel2hierarchygraphHG .TheideaforbuildingTfromcrownedtreeAwillbe 2 tovisitsequencesinthecrownedtreeinawelldefinedorder,whichwillallowthepebblestoproperlykeep track of the positions as the sequences of the crowned tree are visited. Inouralgorithm,thesecondpebbleperformsadepthfirstsearchonedgesinP.Forexample,inFigure3, the second pebble first travels edge (r,a)2P. When the second pebble reaches a node last(C) for some recombination cycle C, then the first pebble for cycle C becomes activated and travels down path r of 1 recombinationcycleC.Sincea¼last(C )inFigure3,thefirstpebbleforcycleC becomesactivatedafter 1 1 thesecondpebble’smoveandtravelsthepathcontainingedges(r,c)and(c,rec(C )).Theintuitionforthe 1 algorithm isthat conditions (1) and(2) enable thefirst pebbletoremainonthe coalescentnode ofacycle whilethesecondpebbleistravelinganyedgescontainedinthecommonpathsetP.Furthermore,thefirst pebbleisabletovisitalledgesindirectedpathr inCandrecombinewiththesecondpebblesincenoother 1 recombination cycle intersects the vertices of r . The following algorithm describes the transformation 1 from crowned tree to directed Steiner arborescence. 1. Consider the sequence r corresponding to the root of crowned tree A and initialize Y¼{(r,r)}. 2. While Y=; For y¼(s,t)2Y (a) for any cycle C such that t¼last(C), activate s, add the set of edges in path r to T, and add 1 rec(C) and all nodes adjacent to edges in r to Y 1 (b) ifthereisanoutgoingmutationedgefromtinAthatiscontainedinpathsetP,supposethisedge haslabeli.Thencreateamutationedgefromytoy0¼(s,t.i)withedgelabel[*,i]andaddy0toY. (c) for each outgoing mutation edge from s (t) in A that is not contained in path set P with label i, create amutation edge from ytoy0¼(s.i, t)(y0¼(s,t.i))withedge label [i,*](label [*, i]), and add y0 to Y (d) if s (t) has an outgoing mutation edge in A labeled i whose other endpoint is a coalescent node correspondingtosequences.i(t.i),createamutationedgefromytoy0¼(s.i,t)(y0¼(s,t.i)),anda recombination edge from y0 to y00¼(s.i, s.i) labeled R(0) (from y0 to y@¼(t.i, t.i) labeled R(0)). Add y@ to Y 3. Remove y¼(s,t) from Y NotethatifthephylogeneticnetworkAdoesnotsatisfyCondition(1),itwouldbeimpossibleforpebble 1tosimultaneouslyremainonthecoalescentnodeforallrecombinationcyclesC.Similarly,ifAdoesnot satisfy Condition (2), it would be impossible for pebble 2 to simultaneously travel all the paths necessary before recombining with pebble 1. 3.5. Higher level hierarchy graphs We now extend beyond the first two levels to construct a hierarchy of representations, with each suc- cessivelevelinthehierarchyrepresentinglargerandlargersubsetsofARG.Thekthlevelofthehierarchy IMPERFECT ANCESTRAL RECOMBINATION GRAPH RECONSTRUCTION 775 will consist of an underlying graph HG , whose vertices are k-dimensional vectors (v ,v , ...v )2(Zm)k k 1 2 k 2 where each coordinate v ranges over all binary sequences of length m. In the context of our informal i presentation based on pebble motions, each coordinate corresponds to a different pebble and pairs of coordinates correspond to pairs of pebbles a and b that are allowed to traverse paths which form recom- binationcycles.ThedirectededgesofHG willdetailthepossibletransitionstepsthepebblesareallowed k tomakeastheytravelthroughtheimperfectARG.Foravectorv2Rk andasubsetC(cid:4)[k]ofcoordinates, let vj denote the restriction of vector v onto coordinates in C. We will consider vj as a set (and not a C C multi-set),sothatthesetvj hassizeoneifallofthecoordinatesofvinpositionsCarethesame.Letlbe C eitherzeroorahalfintegralvaluebetween1andm(cid:1) 1.Fortwobinarysequencesaandboflengthm,let 2 2 (a,b) denotethesequenceobtainedbycombiningthefirstblcsitesofawiththefinalm(cid:1)blcsitesofb(in l the case l¼0, (a,b) ¼b). To describe the directed edges in hierarchy graph HG , we will need to define 0 k the following operations. (i)Foravectorv,supposethereisasubsetofcoordinatesC(cid:4)[k]containingthesamebinarysequencea. Then the coordinated mutation transition at site i on C results in vector M (v)2(Zm)k, whose jth C,i 2 coordinate is (cid:1) a:i if j2C M (v) ¼ C,i j v if j62C j This move corresponds to modifying vector v by taking the sequences with coordinates in C and mutating site i in each of these sequences. Note that M (v)j ¼{a.i}, and furthermore, there may be C,i C additional coordinates in [k]\C for which v contains sequence a and remains unchanged by this transition. (ii)Consider any twodisjointsetsof coordinatesC ,C (cid:5)[k]such that vj ¼{a}andvj ¼{b}(where 1 2 C C a=b)ConsidertwosubsetsC0 (cid:4)C andC0 (cid:4)C ,one(butnotboth)ofwh1ichmaybethe2emptysetand 1 1 2 2 let l be either 0 or a half integral value between 1 and m(cid:1) 1. Then the jth coordinate of recombination 2 2 transitions R (v) and R (v) are defined as C1,C2,C10,C20,l C1,C2,C10,C20,l (cid:1)(a,b) if j2C0 [C0 R (v) ¼ l 1 2 C1,C2,C10,C20,l j vj otherwise (cid:1)(b,a) if j2C0 [C0 R (v) ¼ l 1 2 C1,C2,C10,C20,l j vj otherwise: The transitions (i) and (ii) allow us to define the edges of HG . k (E1) For a vector of binary sequences v2(Zm)k, for each integer i2[m], and for each subset C(cid:4)[k] 2 such that jvj j¼1, there is a directed edge from v to M (v). C C,i (E2) For a vector of binary sequences v2(Zm)k,l2f0, 1, 3, ...m(cid:1) 1g, and for each C ,C (cid:4)[k] and 2 2 2 2 1 2 subsets C0 (cid:4)C ,C0 (cid:4)C such that C \C ¼; and jvj j¼jvj j¼1, there is a directed edge from v to 1 1 2 2 1 2 C1 C2 R (v) and from v to R (v). C1,C2,C10,C20,l C1,C2,C10,C20,l Edgesoftype(E1)arecalledhierarchygraphmutationedges,andedgesoftype(E2)arecalledhierarchy graph recombination edges. Note that different choices of subsets C and C and values l in the recom- 1 2 bination transitions may in fact give rise to the same hierarchy graph recombination edge. Together with the edges of the hierarchy graph is an associated weight function w :E(HG )!R , which is specified e k (cid:3)0 as part of the input and indicates the corresponding costs for the recombination and mutation events. In general, the weight function can be site-dependent or chosen according to existing information about recombinationandmutationfrequenciesinthepopulation.Intheuniformmodel,allrecombinationevents (exceptrecombinationeventscorrespondingtol¼0)havethesamecosta andallmutationeventshavethe r same cost a . This corresponds toassigningweight a to all hierarchy graph recombination edges, weight m r a to all hierarchy graph mutation edges, and weight zero to the remaining edges. m Now, suppose D is a solution to the minimum Steiner arborescence problem on graph HG with root k Root , weights w, and terminal vertices V . We will describe a map F which constructs from D an k T k imperfect ancestral recombination graph A¼F (D). For any node u in D, let Fþ(u) denote the set of k outgoing edges from u in arborescence D. The transformation describes a breadth first search through the set of vertices in D, with each explored edge giving rise to a set of edges in the imperfect ARG A. In the followingdescription,YwilldenotetheverticesinDwhichhaveoutgoingedgesremainingtobeexplored. For each k, the map F transforms each Steiner arborescence in HG to an imperfect ARG. k k 776 LAM ET AL. Definition 1. An imperfect ARG A is representable at level k of the ARG hierarchy if there exists a Steiner arborescence D in HG such that F (D)¼A. The ARG-width of A is the smallest k such that A is k k representable at hierarchy level k. Input: DirectedSteiner arborescenceD2HG k Output: ImperfectAncestral GraphA¼F (D) withvertexset V(A)and edgeset E(A) k InitializeV(A)¼frg,Y¼f(r,r, ...r)g WhileY=;,letu2Y I.While Fþ(u)=;,lete¼(u,v)2Fþ(u) IfthelabelofedgeeishierarchymutationedgecorrespondingtocoordinatesCandsitei,addM (u)toV(A), C,i andadd directed edgefrom sequence u tosequence u0¼u.i toE(A) Else if edge e is hierarchy recombination edge labeled by RC1,C2,C10,C20,l(v), where ujC1¼{a} and ujC2¼{b}, thenaddsequence (a,b) toV(A),and adddirected edges froma to(a,b) and fromb to(a,b) toE(A) l l l Else if edge e is hierarchy recombination edge labeled by RC1,C2,C10,C20,l(v), where ujC1¼{a} and ujC2¼{b}, thenadd(b,a) toV(A),and adddirected edgesfrom a to(b,a) and fromb to(b,a) toE(A) l l l Addv toYand remove e fromFþ(u) II.Remove u from Y. By Section 3.4, we have the following lemma. Lemma 3.2. The set of crowned trees has ARG-width equal to two. We now study the structure of representable imperfect ancestral recombination graphs. Lemma3.3. IfanimperfectARGAisrepresentableatlevelk,thenitisrepresentableatlevelk0forall k0(cid:3)k. Proof. SinceAisrepresentableatlevelk,thereexistsanarborescenceD 2HG suchthatF (D )¼A. k k k k Let Root ¼(r,r, ...r)2HG and let Root ¼(r,r, ...r)2HG . Let D be the directed Steiner arbo- k k k0 k0 k0 |fflfflfflfflffl{zfflfflfflfflffl} |fflfflfflfflffl{zfflfflfflfflffl} k k0 rescenceinHG obtainedbyappendingk0(cid:1)kcopiesofbinarysequencertoeachvertexinD .Now,ifsis k0 k either the root of D or an input row in I, the vector v(s)¼(s,s, ...s,r, ...,r) |fflfflfflfflffl{zfflfflfflfflffl}|fflfflfflfflffl{zfflfflfflfflffl} k k0(cid:1)k appears in D . Furthermore, for each input sequence s, the path in D between Root and s gives rise to a k0 k k pathinD betweenRoot andv(s)(witheachvertexinthepathhavingvaluerinthefinalk0(cid:1)kentires). k0 k0 Now, for each input sequence s, create a trivial recombination edge between v(s) and the vector R (v(s))¼(s,s, ...s)2HG : ([k],[k0](cid:1)[k],[k],[k0](cid:1)[k],0) k0 |fflfflfflfflffl{zfflfflfflfflffl} k0 TheresultinggraphisaSteinerarborescenceD0 inHG .Sincetrivialrecombinationedgeshaveweight k0 zero, the directed Steiner arborescence D0 has the same cost as the directed Steiner arborescence D. Furthermore (cid:1) (D )¼A, implying that A is representable at level k0. & k0 k0 For fixed values of the mutation and recombination parameters, let fD(cid:6)g denote a sequence of k k(cid:3)1 solutions to the MDSA problem on the sequence of hierarchy graphs fHG g . We apply the transfor- k k(cid:3)1 mationsF tothesearborescencestoobtainasequenceofimperfectARGsf(cid:1) (D(cid:6))g .Thefollowingisa k k k k(cid:3)1 corollary of Lemma 3.3. Corollary 3.4. For any k(cid:3)1, cost(D(cid:6))(cid:3)cost(D(cid:6) ). k kþ1 For an imperfect ARG A, let R(A) denote the number of recombination events in A. Note that R(A) simplycountsthenumberofrecombinationevents(nottheweightedcostofrecombinations)anddoesnot takeintoaccountanyhomoplasyeventsinA.ThefollowingtheoremboundstheARG-widthofanyARG. Theorem3.5. ForanyimperfectancestralrecombinationgraphAwithR(A)(cid:3)1,theARG-widthofA is at most 2R(A).

Description:
ancestral recombination graph incorporating recombinations and The software implementing our algorithm, iARG, is available for download at
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.