Downloaded from genome.cshlp.org on January 2, 2023 - Published by Cold Spring Harbor Laboratory Press Breakpoint Graphs and Ancestral Genome Reconstructions MaxA.AlekseyevandPavelA.Pevzner DepartmentofComputerScienceandEngineering UniversityofCaliforniaatSanDiego,U.S.A. {maxal,ppevzner}@cs.ucsd.edu Classification: GenomeRearrangements,AncestralGenomeReconstruction,MolecularEvolution Correspondingauthor: PavelPevzner email: [email protected] Mailaddress: 9500GilmanDr.,LaJolla,CA92093-0404,U.S.A. Phone: 1-310-4976941 Fax: 1-858-5347029 Downloaded from genome.cshlp.org on January 2, 2023 - Published by Cold Spring Harbor Laboratory Press Abstract Recentlycompletedwholegenomesequencingprojectsmarkedthetransitionfromgene-based phylogenetic studies to phylogenomics analysis of entire genomes. We developed an algorithm MGRA for reconstructing ancestral genomes and used it to study the rearrangement history of sevenmammaliangenomes:human,chimpanzee,macaque,mouse,rat,dog,andopossum. MGRA reliesonthenotionofthemultiplebreakpointgraphstoovercomesomelimitationsoftheexisting approachestoancestralgenomereconstructions. MGRAalsogeneratestherearrangement-based charactersguidingthephylogenetictreereconstructionwhenthephylogenyisunknown. 2 Downloaded from genome.cshlp.org on January 2, 2023 - Published by Cold Spring Harbor Laboratory Press INTRODUCTION Thefirstattemptstoreconstructthegenomicarchitectureofancestralmammalspredatedtheera ofgenomicsequencingandwerebasedonthecytogeneticapproaches(WienbergandStanyon,1997). ff ff Therearrangement-basedphylogenomicstudieswerepioneeredbySanko andco-authors(Sanko ff et al., 1992; Sanko and Blanchette, 1998; Blanchette et al., 1997) and were based on analyzing the breakpoint distances. Moret et al. (2001) further optimized this approach and developed a popular GRAPPAsoftwareforrearrangementanalysis. MGR,anothergenomerearrangementtool(Bourque andPevzner,2002),usesthegenomicdistancesinsteadofbreakpointdistancesforancestralreconstruc- tions. Since genomic distances lead to more accurate ancestral reconstructions (Moret et al., 2002; Tang and Moret, 2003), GRAPPA has been modified for genomic distances as well. While MGR has been used in a number of phylogenomic studies (Bourque et al., 2005; Murphy et al., 2005; Pontius etal.,2007;Bulazeletal.,2007;Xiaetal.,2007;Deuveetal.,2008;Cardoneetal.,2008),bothMGRand GRAPPA have limited ability to distinguish reliable from unreliable rearrangements and to address the “weak associations” problem in ancestral reconstructions (Bourque et al., 2004, 2005; Froenicke etal.,2006;Bourqueetal.,2006). Recently,Maetal.(2006)madeanimportantsteptowardsreliablereconstructionoftheancestral genomes. In contrast to MGR and GRAPPA (which analyze both reliable and unreliable rearrange- ments), they have chosen to focus on the reliable breakpoint reconstruction in the ancestral genomes andtoavoidassignmentsinthecaseofweakassociations(complexbreakpoints). Thisprovedtobe a valuable approach since, as it turned out, most breakpoints in the ancestral mammalian genomes canbereliablyreconstructed. However,therearesomelimitations(discussedinRocchietal.(2006)) thatthisapproachhastoovercometoscaleforlargesetsofgenomes. First,whiletheMaetal.(2006) inferCARsalgorithmassumesthatthephylogenyisknown,itremainsasubjectofenduringdebates even in thecase of the primate–rodent–carnivoresplit (which is assumedto be resolved in Ma et al. (2006)). Withtheincreaseinthenumberofspecies,thereliabilityofthephylogenywillbecomeeven a bigger concern, thus raising the question of devising an approach that does not assume a fixed phylogeny but instead uses rearrangements as new characters for constructing phylogenetic trees (seeChaissonetal.(2006)). WhileMGRdoesnotassumeafixedphylogeny,itsheuristicallyderived weakassociationsarelessreliable. ThechallengethenistointegratethereliabilityofinferCARswith the flexibility of MGR. Another avenue to improve inferCARs algorithm is to find out how to deal withcomplexbreakpointsthatcreategapsinreconstructions. Note that the Ma et al. (2006) approach focuses on the reliable ancestor reconstruction rather than on the specific rearrangements that happened in the course of the evolution. These are related ff but di erent problems that both can benefit from incorporating them into a single computational framework. Indeed,Maetal.(2006)considerindividualbreakpointsanddonotdistinguishbetween particular types of rearrangements that generated a breakpoint of interest. In reality, the reversals andtranslocationsoperateonpairsofdependentbreakpointsratherthanindividualbreakpoints. Some rearrangements(andsyntenyassociations)cannotbeinferredfromtheanalysisofsinglebreakpoints butbecometractableviaanalyzingthebreakpointgraph.1 Asaresult,whileMGRconstructsprovably optimalscenariosintheabsenceofbreakpointre-use,itisnotclearwhetherthesameresultholdsfor inferCARs. Recently, Zhao and Bourque (2007) developed the EMRAE algorithm, which reconstructs both reliable rearrangements and ancestors, thus addressing the shortcomings of both MGR (difficulty in distinguishing between reliable and putative rearrangement events) and inferCARs (ancestor re- construction only). However, EMRAE (in contrast to MGR) does not attempt to reconstruct the 1Thebreakpointgraphsrepresentapopulartechniquefortherearrangementanalysissincetheyrevealpairsofbreakpoints representing footprints of the rearrangement events. See Chapter 10 of Pevzner (2000) for background information on genomerearrangementsandbreakpointgraphs. 3 Downloaded from genome.cshlp.org on January 2, 2023 - Published by Cold Spring Harbor Laboratory Press phylogenetictreeandislimitedtounichromosomalgenomes. Belowweaddresssomelimitationsof MGR, EMRAE and inferCARs by developing the Multiple Genome Rearrangements and Ancestors (MGRA)algorithm(availablefromhttp://www.cs.ucsd.edu/users/ppevzner/software.html). In particular, • MGRA constructs provably optimal scenarios even when there is some breakpoint re-use and whenothertoolsdonotguaranteeoptimality. • MGRA is suitable for ancestral reconstructions of multichromosomal genomes (in contrast to EMRAE). • MGRAisconceptuallysimplerandordersofmagnitudefasterthanMGR. • MGRA is not limited to reconstructing ancestral genomes in the case of known phylogeny (like inferCARs and EMRAE). Instead, it can guide the rearrangement-based reconstruction of phylogenetictrees. • MGRAdoesnotrequirepriorinformationabouttheapproximatelengthsofthebranchesofthe phylogenetictrees(incontrasttoinferCARs). To evaluate the performance of MGRA, we compared ancestral reconstructions generated by MGRAandinferCARs. DespitethefactthatMGRAandinferCARsareverydifferentalgorithms,their reconstructionsturnedouttoberemarkablysimilar(98.5%ofsyntenyassociationsareidentical). We furtheranalyzedsomedifferencesbetweenMGRA,inferCARs,andthecytogeneticsapproach. METHODS 1 FromPairwisetoMultipleBreakpointGraphs We start with analysis of rearrangements in circular genomes (i.e., genomes consisting of circular chromosomes)andlaterextendittogenomeswithlinearchromosomes. Weassumethateachgenome isformedbythesamesetof syntenyblocks, which arearrangeddifferentlyindifferentgenomes. We willfinditconvenienttorepresentachromosomeformedbysyntenyblocksb ,...,b asacyclewith 1 n n directed labeled edges (corresponding to blocks) alternating with n undirected unlabeled edges (connecting adjacent blocks). The directions of the edges correspond to signs (strand) of the blocks. Welabelthetailandheadofadirectededgeb asbtandbhrespectively(Fig.1)andrepresentagenome i i i asasetofdisjointcycles(oneforeachchromosomes). Theedgesineachcyclealternatebetweentwo colors: one color (e.g., “black”) used for undirected edges while the other color (traditionally called “obverse”)usedfordirectededges. Let P be a genome represented as a collection of alternating black-obverse cycles (a cycle is alter- natingifthecolorsofitsedgesalternate). Foranytwoblackedges(x ,x )and(y ,y )inthegenome 1 2 1 2 (graph) P we define a 2-break rearrangement (first introduced as DCJ rearrangement in Yancopoulos et al. (2005) and recently studied in Bergeron et al. (2006); Lin and Moret (2008)) as replacement of these edges with either a pair of edges (x ,y ), (x ,y ), or a pair of edges (x ,y ), (x ,y ) (Fig. 2a,b). 1 1 2 2 1 2 2 1 In the case of circular genomes, 2-breaks correspond to the standard rearrangement operations of reversals,fissions,orfusions/translocations(Fig.2).2 Let P and Q be genomes on the same set of blocks B. The (pairwise) breakpoint graph G(P,Q) is simplythesuperpositionofgenomes(graphs)PandQ(Fig.1c). Formally,thebreakpointgraphG(P,Q) is defined on the set of vertices V = {bt,bh | b ∈ B} with edges of three colors: obverse (connecting vertices bt and bh), black (connecting adjacent blocks in P), and green (connecting adjacent blocks in Q). The black and green edges form the black-green alternating cycles that play an important role in analyzing rearrangements (Bafna and Pevzner, 1996). ¿From now on we will ignore the obverse edgesinthebreakpointgraphsothatitbecomessimplyacollectionof(black-green)cycles(Fig.1). 2In this paper we use the term reversal (common in bioinformatics literature) instead of the term inversion (common inbiologyliterature). Forcircularchromosomes, fusionsandtranslocationsarenotdistinguishable, i.e., everyfusionof circularchromosomescanbeviewedasatranslocation,andviceversa. 4 Downloaded from genome.cshlp.org on January 2, 2023 - Published by Cold Spring Harbor Laboratory Press genome P = (+a +b −c) a) b bh bt breakpoint graph G(P,Q) of the genomes P and Q ch c) P P bh bh bt bt c ah a ct G(P,Q) at ch ch b) b bh bt ah ah ct ct at at ch Q Q c ah a ct at genome Q = (+a −b +c) Figure1: a)UnichromosomalgenomeP=(+a+b−c)representedasablack-obversecycle. b)Unichromosomalgenome Q=(+a−b+c)representedasagreen-obversecycle. c)ThebreakpointgraphG(P,Q)withandwithoutobverseedges. The 2-break distance d (P,Q) between genomes P and Q is defined as the minimum number of 2 2-breaks required to transform one genome into the other. In contrast to the Genomic Distance Problem (Hannenhalli and Pevzner, 1995; Tesler, 2002a; Ozery-Flato and Shamir, 2003) (for linear multichromosomalgenomes),the2-BreakDistanceProblemforcircularmultichromosomalgenomes has a trivial solution (Yancopoulos et al., 2005; Alekseyev and Pevzner, 2007): d (P,Q) = b(P,Q) − 2 c(P,Q), where b(P,Q) = |B| is the number of synteny blocks in P and Q, and c(P,Q) is the number of black-greencyclesinG(P,Q). Alineargenomeisacollectionoflinearchromosomesrepresentedassequencesofsignedsynteny blocks. Each linear chromosome on n blocks is represented as a path of n directed obverse edges (encoding blocks and their direction) alternating with n − 1 undirected black edges (connecting ∞ adjacent blocks). In addition, we introduce an extra vertex and connect it by an undirected (irregular)blackedgewitheveryvertexrepresentingachromosomalend(hence,thedegreeofvertex ∞ is twice the number of linear chromosomes). A linear chromosome is an alternating path of black ∞ and obverse edges, starting and ending at the vertex , and a linear genome is a collection of such ff paths. The 2-breaks involving irregular edges model the rearrangements a ecting the chromosome ends(Fig.2c,d). Analyzingreversals,translocations,fusions,andfissionsinlineargenomesposesadditionalalgo- rithmicchallengesascomparedtoanalyzing2-breaksincirculargenomes. However,rearrangement scenarios in linear genomes are well approximated by 2-break scenarios in circular genomes (Alek- seyev,2008). Hence,weuse2-breaksasasinglesubstituteforreversals,translocations,fusions,and fissions, admitting that 2-breaks may violate linearity of the genomes by creating circular chromo- somes. While previous rearrangement studies (e.g., MGR) were limited to analyzing the pairwise break- pointgraphs,MGRAusesmultiplebreakpointgraphs(Caprara,1999b),whichsimplifytherearrange- mentanalysis. LetP ,...,P begenomesonthesamesetofsyntenyblocksB. Similarlytothepairwise 1 k breakpointgraph,the(multiple)breakpointgraphG(P ,...,P )issimplythesuperpositionofgenomes 1 k (graphs)P ,...,P onthesamevertexsetV = {bt,bh | b ∈ B}∪{∞}(Fig.S20andFig.3a,b). Fig.4shows 1 k thebreakpointgraphon1357syntenyblocks3ofsixmammaliangenomes: M(mouse),R(rat),D(dog), 3ThedetailedinformationaboutsyntenyblocksandassemblybuildsisprovidedintheSupplementaryFile. Outof1360 5 Downloaded from genome.cshlp.org on January 2, 2023 - Published by Cold Spring Harbor Laboratory Press y1 x1 a) reversal y x b) 2 2 y x y1 x1 y1 x1 1 1 y2 x2 y2 x2 tranfsulsoicoant i/on y x 2 2 fission y1 x1 y2 x2 c) d) y1 x1 y1 x1 y1 x1 fusion y1 x1 reversal y2 x2 y2 x2 y x fission y x 2 2 2 2 Figure2: a)A2-breakonedges(x ,x )and(y ,y )fromthesamechromosomecorrespondstoeitherareversal,orafission. 1 2 1 2 b)A2-breakonedges(x ,x )and(y ,y )fromdifferentchromosomescorrespondstoatranslocation/fusion. c)A2-break 1 2 1 2 onedges(y ,y )and(x ,∞)ofalinearchromosomecorrespondstoareversalaffectingachromosomeendx andcreatinga 1 2 1 1 newchromosomeendy . d)A2-breakonedges(x ,∞)and(y ,∞)fromdifferentchromosomesmodelsafusion. Fissions 1 1 1 canbemodeledas2-breaksoperatingonanirregularloopedge(∞,∞)andanarbitraryregularedgeinthegenome. Q(macaque),H(human),andC(chimpanzee). Avertexinthebreakpointgraphisregularifitisdifferentfrom∞. Similarly,anedgeisregularif bothitsendpointsareregular,andirregularotherwise. TheedgesofG(P ,...,P )arerepresentedby 1 k undirected edges from the genomes P ,...,P of k different colors (hence, the degree of each regular 1 k vertex is k). To simplify the notation, we will use P ,...,P also to refer to the colors of edges in 1 k the multiple breakpoint graph, and denote the set of all colors C = {P ,...,P }. Furthermore, any 1 k non-empty subset of C is called a multicolor. All edges connecting vertices x and y in the (multiple) breakpointgraphformthemulti-edge(x,y)ofthemulticolorrepresentedbythecolorsoftheseedges (e.g., the multi-edge (eh, fh) in Fig. 3b has multicolor {P ,P } shown as red and yellow edges). The 3 4 number of multi-edges incident to a vertex (also equal to the number of adjacent vertices) is called the multidegree (note that the multidegree of a vertex may be smaller than its degree, e.g., the vertex ehinFig.3bhasdegree4andmultidegree3). Multi-edgescorrespondtoadjacentsyntenyblocksthat ff are conserved across multiple species and thus, represent valuable phylogenetic characters (Sanko andBlanchette,1998). AbreakpointinthemultiplebreakpointgraphG(P ,P ,...,P )isavertexofthemultidegreegreater 1 2 k than1. AmultiplebreakpointgraphwithoutbreakpointsisanidentitybreakpointgraphG(X,...,X)of some genome X. Alternatively, the identity breakpoint graph can be characterized as a breakpoint graph consisting of complete multi-edges (i.e., multi-edges of the multicolor C) that correspond to the syntenyblocksadjacenciesinX. syntenyblocks(kindlyprovidedbyJianMa)threesyntenyblocksrepresentintermixedsegmentsofthechromosomeX andotherchromosomes(themousechromosome7andtheratchromosomes15and20). Sincetheseblocksareshort(16, 47,and17KBrespectively),wehavediscardedthemtosimplifythechromosomeXanalysisbelow. Forbetterillustrationofthebreakpointgraphs,thevertex∞isshowninmultiplecopiesasblackdots,eachconnected byasinglemulti-edgetoregularvertices. 6 Downloaded from genome.cshlp.org on January 2, 2023 - Published by Cold Spring Harbor Laboratory Press a) at ah ch ct bh bt dt dh et eh ft fh at ah dh dt ch ct bh bt et eh fh ft b) G(P ,P ,P ,P ) 1 2 3 4 bh ah ct P1 =(+a−c−b)(+d+e+f) T P3 =(+a−d)(−c−b+e−f) ch dh at et P2 =(+d+e+b+c)(+a+f) P4 =(+d−a−c−b+e−f) dt dh et eh bt bh ct ch at ah fh ft dt dh ah at ch ct bh bt et eh fh ft fh bt eh dt ft c) P1 =(+a−c−b)(+d+e+f) P3 =(+a−d)(−c−b+e−f) r6 T r4 r2 Q 1=(+a−d−c−b+e+f) Q =(+a+b+c)(+d+e+f) Q =(+a−d−c−b+e−f) 3 2 r r X=(+a+b+c+d+e+f) 1 3 r7 r5 P2 =(+d+e+b+c)(+a+f) P4 =(+d−a−c−b+e−f) Figure 3: a) A phylogenetic tree T with four linear genomes P ,P ,P ,P (represented as green, blue, red, and yellow 1 2 3 4 graphsrespectively)attheleaves. Theobverseedgesarenotshown. b)ThemultiplebreakpointgraphG(P ,P ,P ,P )isa 1 2 3 4 superpositionofgraphsrepresentinggenomesP ,P ,P ,P . Themultidegreesofregularverticesvaryfrom1(e.g.,vertex 1 2 3 4 bh)to3(e.g.,vertexeh). c)ThesamephylogenetictreeTwithallintermediategenomespecifiedandagenomeXselected asaroot. AT-consistenttransformationofXintoP ,P ,P ,P canviewedasatransformationofthequadruple(X,X,X,X) 1 2 3 4 intothequadruple(P ,P ,P ,P )wherearearrangementateachstepisappliedtosomecopiesofthesamegenomeinthe 1 2 3 4 quadruple. Aparticularsuchtransformationtakesthefollowingsteps: (X,X,X,X)−r→1 (X,X,Q ,Q )−r→2 (Q ,Q ,Q ,Q ) 1 1 3 3 1 1 −r→3 (Q ,Q ,Q ,Q )−r→4 (Q ,Q ,P ,Q )−r→5 (Q ,Q ,P ,P )−r→6 (P ,Q ,P ,P )−r→7 (P ,P ,P ,P ),wherer isareversalintwo 3 3 2 2 3 3 3 2 3 3 3 4 1 3 3 4 1 2 3 4 1 copiesofX;r isafissionintwocopiesofX;r isareversalinbothcopiesofQ ;r isafissioninonecopyofQ ,r isa 2 3 1 4 2 5 reversalintheothercopyofQ ;r isareversalinonecopyofQ ,r isatranslocationintheothercopyofQ . 2 6 3 7 3 2 MultipleGenomeRearrangementProblem Thekeyobservationinstudiesofpairwisegenomerearrangementsisthatevery2-breaktransforma- tionofa“black”genomePintoa“green”genomeQcorrespondstoatransformationofthebreakpoint graph G(P,Q) into the identity breakpoint graph G(Q,Q) (Fig. S21) with 2-breaks on pairs of black edges(black2-breaks). MGR(BourqueandPevzner,2002)implicitlyappliesasimilarobservationand attempts to come up with rearrangements that bring the multiple breakpoint graph G(P ,P ,...,P ) 1 2 k closertotheidentitymultiplebreakpointgraphG(P,P,...,P)forivaryingfrom1tok. However,this i i i approachdoesnotallowonetoutilizetheinternaledgesofthephylogenetictreeforfindingreliable rearrangements. BelowweformalizetheMultipleGenomeRearrangementProblemintermsofmultiple breakpoint graphs. The key element of MGRA is finding a shortest transformation of the multiple breakpointgraph G(P ,P ,...,P )intoan arbitrary identitymultiplebreakpoint graph G(X,X,...,X) 1 2 k forsomeaprioriunknowngenomeX. Wefirstillustratethisconceptwithpairwisebreakpointgraphs. Let G(P ,P ) → G(X,X) be an m-step transformation of G(P ,P ) into G(X,X) by either black or 1 2 1 2 green2-breaks(incontrasttothestandardbreakpointgraphanalysisbasedonblack2-breaksonly).4 ItiseasytoseethateverysuchtransformationcorrespondstoatransformationP → X → P thatuses 1 2 4Switchingfromblackrearrangementstoamixtureofblackandgreenrearrangementsisasimplebutpowerfulparadigm thatprovedtobeusefulinpreviousstudies(BafnaandPevzner,1998;TannierandSagot,2004). 7 Downloaded from genome.cshlp.org on January 2, 2023 - Published by Cold Spring Harbor Laboratory Press 111h 112t 55t 54h 60h 61t 1150h 1151t 1152t 1130t 1151h 99999924444382468hhhhhh8868809999999999hh993444524444540468093579tttttttttttt7777789h9268888hhh78680291999999tttt24444493579hhhhhh666666678557788890094625730777777hhhhhhhhh88998898148t037t8878tttttt68909141hhtt5555555675566776786358380419788hh66666666666666hhhhhhhhh566778885778882566847915736877767thtttttttttttt8889636666h037992441hhhtt4438tttt44545785491h555555hhh5556666666888867902441230189666666tttt563327tttttt57788886hhhh15736820hhhhhhtt3335788483844444hhhh445588676701555tt555httt555568846966772085hh4915hhhttttt333333800001960249623343433hhhhhhh8181890718904344hhtttt8956035hhtt222222226111222338136358139333338333338hhhhhhhhh000116000116246185135074333tttttttttttt878994htt111111111333445566444381601617222222226222222226hhhhhhhhh11122333911122233943585703512474692403333388ttttttttt1ttttttttt00011663135074h4hhhhhh9111115t222225999995111113579111111111111h13333hhhhh4555776445563445450380134927222222222643902tttthttttttt1112223398thtt247469240hhhhhhhhhh112201111111111151552222322223hh5599990999901716111111357911246803t3t445566ttttt3ttttt444927214284hhhhht7ht1813t17178t49113113h22h22h010111111573622222315tt3tt99990268246803h2hhhhh8811t56896t211899089692234561901hh34th1tt1341hh324tt411835422117tt01h1h6225hh71811995t022235111059h6516223481100t5622h4516t13t1163h39ttt00911227t0186tt52t5601t1tt39t6ht31485ht894078941h5tt15907t377866388457h36689478065hh6519901h25117thth3230ht0ht9thh99116721135197t1t0h25727h114h5t53223t603t4551997117hh511522h2380hh118h3407t7533hh79966h0013hh3235206t7tt8t95h9741358589th739t71890899h7t133833490178839hhh455hth21ttthh6618773496308h123t20116h567h0223t1107745578t451522369t40030h0095519ht5503762hh511t11590hhtth4t333tt739420393h71htt4t3647t6thh37797785t787h334834h91280t9327495224ht478h480ttht27490401t1329413t335thttt93h33h35528h13093842339471hh56400tt235504034hh3080hh9h33919t138thh483h6064019hh1332hh4tt4041283t1869t24114h961603310tt5000htth6534tt1943595t149h02337253516393855059th2t3380621200539htth1t1hht0813213hhttt38330838551t61724h394tt1t051340242ht07t2h812637h27189h6h233059533480562804935889h1h795tttt174h2hhh3tht6h1793131042133ht902343tt1014391285t06hh09117t7t22h27h87626195tt857993t40hhhhh11181331831406456496t4htt91t1h310243171515387148h19248h678h2100t155665378310118804426th2th3918h2h71682361t41t2289h2hht51343h1th8t1811h138t3t18755170tt1h12251hh1905t0138571hht18469hh7h551115410h01918t3t301t6h22767611h60111295382t011313th781h51134t6251864072h22tt12355tttt01950t77ht5441t24h84t2131t819t17060h133ht251t17t3695130314218941h875h5458t1601t11ttt3518th01t1t38150111t4215905602h011hh81h331h214616821616t2tht4t3666t12h3624t1t161t511h3h31h80h2127761h397h337h27h29h51h8159h042425956h6862783511hhh327t223hh0h5h1t30h561t433471225294640107t6466032155992817161ttt52t7035533hh272601th301289933157t1ththt3tt1h4t12727th73219739939108t794819398hh1t4326hh8t1t5h4138427t4261t47133ht3699435229711hh1594th15561224257hhth0973ht014h3tth14978h20717h17hh712171197h2112169t12179ht499014712141t3t982htttt3358411351893765122645225341t211h4571httt14574t76t82hhh4t732h3h780496613t844610t2t713t13912838t97h696hh54h600tt1th31373808177971tt9t4451678thh874171h24405110174t021h4t2261145hh2414129200354h49ht1htt883hh166h72217053tt133378hh2h4534h63h42841611927611th11h75102t99ht2h0169140tt4842t6784946h72247324128t1677th6ttt11tth113128950044972703th8t7ht86832157hh5713424t9147h3123h229t51t34h01107752t346h8h741t5374t85t31hhh29h271101t047h9110785t9th7h3h12t4411349319h185972515198h21t1t594h724621t811h546h946h010tt133963085426924tt25htth0h7tt03125h30t18t3h9181111t111t012t27113284840148801857676999hht0247ht5tt178h815744546hht014888485h908ttthh50ht3h031h473511114911694120163101t23t99932ht33259821t1231181236t13t231tt128tt6h195h522h3t282t88t6thh111t4701343t174341182450811h246hhh7h21179461134052977h01477th652th32h4t95h9129t314hht1h1349101t930812t7111111h423447h001133h72534h7744331575939055667123134h3htt3tttttt1256h1337hh1t5th106179h024076118t6h700921h5h88h9111174656410131t390ttt97431968h156112h2t2hhh111h738tt8065082t048h1233th72t5t1t177111116251t0t10225801831122806tt15402617hh8h91282tht1th1320127ht16069327ht8856t8500h3101665tth112t5t110512187t2t1298t818h9183th3t8t16110h221221887h8211822064105016h1551911hh21h1thh1h1248127876th1084529hh116h1h6166t91ht9h66h934513711542236t041th963t14862667181t7t8t3tt3811t128h5t973h74t31h1hh652135t190351545186367h5h0656t4h21h61h1t32t021t02872h5955ht11t13h985341552h0t91h711313t736111t12811h034085h1204h80531368h5hhh2155h121635t13t0h25h311128t9134046951t2h9tth669184689t9t41h536554136168012111t11151thhh02t00122ht261202613348680h95632ht8465t6tt15thh696t780t15h5h8t51451921t8t327019592h43015195thh4312h9411t50t40t060151h12h9136ht3h3141t2240439h033343h90683t0ht551thhh699364359711tt851t22t12387h8354511557t8t18233ht7451209t88322332246847tth87hh2443193951312561h41t58t57httt2t2981h1h823628441hh86ht3431t59h3914432h292h88491t8986th482827t35370714h89th61h4h8t12822117582571h13hh531805248h78ht67798tthh14816ht11t24315395527869816tt2h9h724900h49h991t2743h191ht55250t78319h40t699931thh021000h21t5h442835145t8hh4773349hh65119012h711251148916httt59735432t599t8h5924t145t93297h91t35tt410710312th5t2629h6439ht19136h871h073tt952593185957926t0727hh5ht670h9t7tt77171682h711844199h5212802111t94hthhh002t55646541t581144985t9173th8h21t5976t45498t44hh2ht912t9341t51683680h151h7416150210911806tt24tt3t6t5h711415t11h19t1415557391542826787h9hhthh4750888h51h82t368ht841t91197hh51712108262015747hh13tht2873hht11412101547081602t38119tt39t51999h9318856h11t0966h3911166tth1178164711908tt47t516ht6324911h5413197t37h1hh216448121721053t5050t25h511615919th188t03451hh3h892h896h44145tt1h039h1659htt981882h697157h77561h316t11t21t183t4118958277t562h11ht01493t314014h7t71t181149hth17t112616852th412h248h8t7tt357ht1t09111149367991002014016453807693h1061754h777188053hhhhh1t11hhhh3h0t9h71h492267809h458t9t5t411641t2t072461868316483119h38371t3h4t10t96h96261h109989th481040996t2771t470tt14h85h214070h4644h0ht7tt96714101911h60t0t0902t013056t6266t9h12t1th181t73142066012h94t8171t96h4358h36t51490h215h385h64t2121216hhh110h0h1020031131487601819tt097hht4h740tt17115482513h024h22506114991149th5h1385ht3859143411323t821ttt011ttt1611031156t32771731845t47ht32125080hh0324thhh10h71608hh01198817034h6292445214901h1t50t6h238562t44851h323t4th632hhh1837t54ht337hht21172711745672h158984t11t3t260t1t2951t11t45643803398349h98523321869h8021993hth3hht3tht03h18732h12233t6813t59568381725322422t226731111h1hh87546834683t31h000h86390h392h435985th38t2739022521111tttth4h71tt13t0206hh115t7t67h115tt205147017130hh8368332t63452871t460tt9h410139ttt35027319hh19h310411h17h1597805377254041104h215721t2h0hh9070t3h0h913t991t8853421h8t332th48t552441151h93t13ttthh7714311hh331810110h0h1031271172t24t560027389h894t219691h66t04hth02t6923617h386715176tt511t01hh854h111204t37447793723h52101h223464h70hthh14h446h20t8thh67810117t210h041337t1630406330hh1247ht16658377t06711tt145529tt1214tt773247t4h91867htt6515t443h2492178t276h40777h1t38t144127h4537th1tt265401366h912633h16246331h0t23t97h07461hh31t121t2006h1h144h259h77t54111h4h20h42692031188956h36793t5t2t472711hh4hht00h33814643740235tt29144058t332th4163th31h0764511h23163h24027216th4253171t119hh99h6th0899976h1434381t1t06h81tth0233h111742380411226ht51h20t621t0626tt4334th11025h1151226th4524262334h331735719t9t910tttt6t30h1hh20492258h05931hh7826ht271304502tt842841331t819ht910t11h19500205h3304361695667ht5hth8119161t8345t14177ht02782h41t59ht0tt11391101938950028604h1411685999hh449thhh1t50167h05t19045h8ht0983h86h180t607478351230tt03660th28374ht7119190hh9t194907968661t03h428707196t08t7t30tt0tt2938th378hh7t9t1t101409056055t934h35275141t191789870tht0867058271t38966420h1t02hh911ttt466h1t7t697h109t0776258756th8983ht16t9h0206t24h57375116309h00t341hh9t24hh0786851h224h5t4t799667hh9900197776t71917tt2861th67670t7t3916t4430h783877171th6467t5h039h4h925hh68771hh961ht009629821h901703810th012290934885t378t70721h27t776h38tt381t007h6565h44t35140t7099tt9tt8t777t703180ht3t713060h894hh71119t030773038132647651859hh1t4000th10hh4h804806642t11906t22h853h1861h7t980098hh207hh222115hh2103121002121t48141120111t696048ht643347973t6308h07t79h65h21h9h0h79h0tt06931tt0t1131h900063h0681465130t063h70h349h3h6h413151221117hh118195h1828260412t4233817h684918685t0t8t9h9t099tt7h119tth4h048h21t0t65038118606tt1h043402008t10312th01h219t01341t08ht4t78h17h13048661156061409180028504375t054ht74hthht65tt9hh1211046001152h94600605t239047t2115hth579h235t3h38337t1892485h485553h16609t09678t9t30897867hth221tt51tt1h3th50106h85353t93999h4tt116h771001h6124525973936t982t345811t305536059t612t1h3hh896t0500thth48h069641260hth5928165511h99908ttt1914h49198h17t26tt102989542t02416512h1t1h001t03680004h3235614h684301818309h645hhhh58th269tth77428269851204681th3hhth38350h7161ht120012t73h11236356h181115h62116651514311630000299170436711tt6629tht06ht5htt10010t0th185ht40238hh14h712t273t27182221118h94h0t020h263717617t1h364h32t02061tt529h512973072hh67tth301ht811h8115771t22273348725413t80911ht01t41tt4h716012131h0110tht111118175312t22213ht912502322t28278840h5h7th1ttt18h1311h470t39215331h74111t3699102222htth2h023053853773285091hhht8689hh880hh1111h81182222443t02357tt99584t8hhht11541097434t1111126h529156791h82h94656t73h33ttttt3t6531209t89tt1t1523h1110h98567t11111173545152229012hhh327023ht587h6799500h0t11115tttt7t15677h13111857613932452221hhhhh22125091339h023991h0067t51tth2tttt4335915t2221557h277h111h21t2222292h80232t5t88408hhh2hh8480111344845995679150988t576712181hhhtttttt3h32425775614111452t6t0t567578t1634468772h96915tttht4644h6015h26hhh59977t22t98hh5501511134409133h0567915019t9tt465823t3tthhhhhh264h455h5472946464005646944thh8t28t265t69t85718h5tth3090t695h7h64t153447055h915385529823h572922ttt53h11tt68tt47h3442h915489346httt614778897t238913369785507231t1h46t7ttttt006h344828889150h98h7126tt7hhhh77889t48813313749612569171hhhhh62680h67401811551htht20t27788929h557t788913311152987061834hh05h35tt77hhhhh86hht938852ht1189t3117h53305116256hh1192hh572005566163532291t261ttt46h7hhtt7h1111117788911111189133566667618341111924680338ttttt1111hhhhhh55083333167h633467177889tt72htttt089133152729457hhtttttt111111922222262236671111111111111151022t111111111111hhhhhh778896666666667771189133830134567890121150723495tttttttttttt33hhhhht66362hhht11833112411111111111183h222222222222hh922366722366711111163732442621331111110tttttttttttt666667h035791hhhhhh111161667747381861111h88hhhhh3333112212111111650594222222tttttt223667262133hhhhhh22222222222693445556786195757985611116hh11116hhhhhhhhh116778466774981150308492978tt33tttttttttt1221258994h58hhtt333334444147222222222222222222hhhh224445567784445566788917979107806868096711116147tttttttttttttttttt667748hh49297hhhhhh44441112359133333333hhhh3444344463695258222222222tttt22tttt4445566786906868096772hhhhhhhhhtt55555555001122220813246844444444hhhhhhhh11221122571346023333tttttttt34445258hhhh6666622333683495555555555555555hhhhh111114422301122201035566809245794444htttthtttttttttt11224602hhhh77777776000011290579130h6666666666hhhhhhh2333334224805667179055555555tthhtttttt0112220292457913hhhhhhtt88888888822223444145784017h77777777777777hhhhhhhh77000111200011121027913521680241666t66ttttttttttttttt224338790455hhhtt7t99980009169h88888888888888hhh88855256334422344325457900639695287777777tt76htthtttttttttt00011120h1680241hhhhhhhh999996778973806999999hhhhh990010011038127088888t88t8ttt8ttt2234422246952858h1hhhhhttt9999999999678896778995028849179999tttttttttt0010270hhhh999996778984917hhhhh 190h 191t 738t 191h 742t 199t 192t 741h 208t 198h 148t 207h 147h 758t 757h 759t 758h 764h 756h757t 759h 765t 766t 760t 765h Chromosome colors: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Figure4: ThebreakpointgraphG(M,R,D,Q,H,C)(obverseedgesarenotshown)ofsixmammaliangenomes: Mouse(red edges),Rat(blueedges),Dog(greenedges),macaQue(violetedges),Human(orangeedges),andChimpanzee(yellowedges). Thegraphhas1357·2=2714verticeswhosecolorsrepresent23humanchromosomes. mblack2-breaks. Therefore,insteadofsearchingforashortesttransformationG(P ,P ) → G(P ,P ), 1 2 2 2 one can search for a shortest transformation of G(P ,P ) into any identity breakpoint graph G(X,X) 1 2 withoutknowingXinadvance. Inthecaseofk ≥ 2genomesP ,P ,...,P ,2-breakscanbeappliedtomulti-edgesinthemultiple 1 2 k breakpointgraphG(P ,P ,...,P )ofasmanyas2k−2differentmulticolorsformedbypropersubsets 1 2 k 8 Downloaded from genome.cshlp.org on January 2, 2023 - Published by Cold Spring Harbor Laboratory Press C of . However, not every series of such 2-breaks makes sense in terms of ancestral genome recon- structions. A basic property of ancestral genome reconstructions is that 2-breaks on multi-edges of multicolor Q ∈ C can be applied only when all genomes corresponding to colors in Q are merged intoasinglegenome. Wegiveanalternativedefinitionofthispropertyasfollows: atransformation (series of 2-breaks) S of the multiple breakpoint graph G(P ,P ,...,P ) is strict if for any 2-breaks 1 2 k ρ , ρ ∈ S operating on multi-edges of multicolors Q (cid:40) Q , ρ precedes ρ in S. The Multiple 1 2 1 2 1 2 GenomeRearrangementProblemisreformulatedasfollows: Multiple Genome Rearrangement Problem (MGRP). Given genomes P ,...,P , find a shortest strict 1 k seriesof2-breaksthattransformsthebreakpointgraphG(P ,...,P )intoanidentitybreakpointgraph. 1 k LetTbean(unrooted)phylogenetictreeofthegenomesP ,...,P (Fig. 3a). ThetreeTconsistsofk 1 k leafnodes(orsimplyleaves),k−2internalnodes,and2k−3branchesconnectingpairsofnodes,sothat thedegreeofeachleafis1whilethedegreeofeachinternalnodeis3. Removing a branch from T breaks it into two subtrees, each of which is induced by the set of its ownleaves. Amulticolorconsistingofallcolors(leaves)ofeitheroftheseinducedsubtreesiscalled T-consistent. LetGbethesetofallT-consistentmulticolors. NotethatifamulticolorQisT-consistent then its complement Q = C\Q is also T-consistent. Therefore, there is a one-to-one correspondence betweenthepairsofcomplementaryT-consistentmulticolorsandthebranchesofT(Fig.5). X MRD+QHC QHC+MRD MRD QHC MR+DQHC HC+MRDQ MR HC M+RDQHC R+MDQHC D+MRQHC Q+MRDHC H+MRDQC C+MRDQH M R D Q H C Figure5:ThephylogenetictreeTofsixmammaliangenomes:Mouse(red),Rat(blue),Dog(green),macaQue(violet),Human (orange), andChimpanzee(yellow)witharootX ontheMRD+QHCbranch. ThebranchesaredirectedtowardsX and (cid:126) labeledwiththecorrespondingpairsofcomplementaryT-consistentmulticolors. TheT-consistentmulticolorfromeach pairalsolabelsthestartingnodeofthecorrespondingdirectedbranch. Notethatthetreeorientationmaynotnecessary correlatewiththetimescaleandtherootgenomeXmaynotnecessarybeacommonancestoroftheleafgenomes. Whenaphylogenetictreeisgiven,MGRAaddressesarestrictedversionofMGRPwhere2-breaks areappliedonlytomulticolorsconsistentwiththephylogenetictree. Tree-Consistent Multiple Genome Rearrangement Problem (TCMGRP). Given genomes P ,...,P 1 k at the leaves of a phylogenetic tree T, find a shortest strict series of T-consistent 2-breaks, transforming the breakpointgraphG(P ,...,P )intoanidentitybreakpointgraph. 1 k Note that MGRP and TCMGRP problems in the case of three unichromosomal genomes cor- respond to the median problem that is NP-complete (Caprara, 1999a; Tannier et al., 2008). While existenceofexactpolynomialalgorithmsforsolvingMGRPandTCMGRPisunlikely,wedescribea heuristicapproachto“eliminating”breakpointsinG(P ,...,P )thatusesreliablerearrangements. In 1 k 9 Downloaded from genome.cshlp.org on January 2, 2023 - Published by Cold Spring Harbor Laboratory Press particular,MGRAoptimallysolvestheseproblemsincaseofsemi-independentrearrangementscenarios withsomebreakpointre-uses(seebelow). We will find it convenient to fix a branch X of the tree T and assume that this branch contains a rootX(viewedasyetanothernode)thepreciselocationofwhichistobedeterminedlater. Thechoice of X defines directions “towards” X on all branches of the tree T (Fig. 5). We label every leaf node P of the directed tree T with the corresponding singleton multicolor {P}, and then recursively label i i each internal node with the union of the multicolors of the starting nodes of all incoming branches (e.g.,inFig.5acommonendpointofbranchescomingfromtheleafnodesMandRislabeledasMR). (cid:126) (cid:126) The multicolors forming node labels of the tree T are called T-consistent. Alternatively, T-consistent multicolors can be defined as T-consistent multicolors whose induced subtrees do not contain X. Note that exactly one of the multicolors in each pair of complementary T-consistent multicolors is (cid:126) T-consistentanditlabelsthestartingnodeofthecorrespondingdirectedbranchinT (exceptforthe multicolorscorrespondingtothebranchXthatbothareT(cid:126)-consistent). MGRAtransformsthegenomesP ,...,P intoXalongthedirectedbranchesofT,using2-breaks 1 k (cid:126) (cid:126) on T-consistent multicolors (T-consistent 2-breaks). In terms of breakpoint graphs, MGRA eliminates breakpoints in G(P ,P ,...,P ) with T(cid:126)-consistent 2-breaks and transforms it into the identity break- 1 2 k point graph G(X,...,X).5 This transformation defines a reverse transformation of the genome X into the genomes P ,...,P by T(cid:126)-consistent 2-breaks (such as in Fig. 3c). MGRA keeps the track of rear- 1 k rangements applied to the breakpoint graph G(P ,...,P ) during its transformation into an identity 1 k breakpoint graph G(X,...,X). The recorded rearrangements (in the reverse order) define a reverse transformationthatpassesthrougheveryinternalnodeofthetreeT and,thus,canbeusedtorecon- structtheancestralgenomesattheinternalnodesofT. WhileinitialstepsintransformationofthebreakpointgraphG(P ,...,P )intoanidentitybreak- 1 k point graph usually correspond to reliable rearrangements, sooner or later one needs to employ less reliable heuristic arguments in order to complete the transformation. However, sometimes it is preferabletostopafterreachingcertainlevelofreliabilityevenifthetransformationisnotcomplete (and the TCMGRP problem is not solved). In this case we stop short of reconstructing the ancestral genomessincethetransformationhasnotresultedinanidentitybreakpointgraph. InSupplementC we describe an alternative method (not requiring solution of the TCMGRP problem) for reliable re- constructionof(partsof)ancestralgenomes(similartoCARsfromMaetal.(2006))atinternalnodes ofthephylogenetictree. RESULTS 3 MGRAAlgorithm Supplement A introduces the notion of independent (no breakpoint re-uses), semi-independent (breakpoint re-uses may occur only within single branches of the phylogenetic tree), and weakly- independentrearrangements(breakpointre-usesarelimitedtoadjacentbranchesofthephylogenetic tree). MGRA optimally solves the MGRP problem in case of semi-independent 2-breaks and uses heuristicstomovebeyondthesemi-independentassumption. Belowweshowthatmost2-breaksin mammalian evolution are either independent, semi-independent, or weakly independent resulting inreliableancestralreconstructions. (cid:126) (cid:126) 5TheuseofT-consistent2-breakshereismotivatedbyanimportantpropertythateveryT-consistenttransformationcan (cid:126) beturnedintoastrictT-consistenttransformationbychangingtheorderof2-breaks. Therefore,wedonotdirectlyaddress thestrictnessrequirementinMGRAthatfirstproducesaT(cid:126)-consistenttransformationofthegenomesP ,P ,...,P intothe 1 2 k genomeXandthenreordersitintoastricttransformation. 10
Description: