Ducheminetal.BMCGenomics2015,16(Suppl10):S9 http://www.biomedcentral.com/1471-2164/16/S10/S9 RESEARCH Open Access Yersinia pestis Reconstruction of an ancestral genome and comparison with an ancient sequence Wandrille Duchemin1, Vincent Daubin1, Eric Tannier1,2* From 13th Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics Frankfurt, Germany. 4-7 October 2015 Abstract Background: We propose the computational reconstruction of a whole bacterial ancestral genome at the nucleotide scale, and its validation by a sequence of ancient DNA. This rare possibility is offered by an ancient sequence of the late middle ages plague agent. It has been hypothesized to be ancestral to extant Yersinia pestis strains based on the pattern of nucleotide substitutions. But the dynamics of indels, duplications, insertion sequences and rearrangements has impacted all genomes much more than the substitution process, which makes the ancestral reconstruction task challenging. Results: We use a set of gene families from 13 Yersinia species, construct reconciled phylogenies for all of them, and determine gene orders in ancestral species. Gene trees integrate information from the sequence, the species tree and gene order. We reconstruct ancestral sequences for ancestral genic and intergenic regions, providing nearly a complete genome sequence for the ancestor, containing a chromosome and three plasmids. Conclusion: The comparison of the ancestral and ancient sequences provides a unique opportunity to assess the quality of ancestral genome reconstruction methods. But the quality of the sequencing and assembly of the ancient sequence can also be questioned by this comparison. Background isolatedgenes,ancestralgenomereconstructionsarenow Extantspeciesarederivedfromaprocessofevolutionand robustatascalelargerthanthegene,forfragmentswhere diversificationfromspeciesnowdisappeared.Thesespe- no rearrangement have occurred [4]. Methods for infer- ciesarecalledancientingeneralandancestraliftheylefta ring ancestral gene ordershave alsobeen explored [5-8]. descendant. Ancestral genomic sequences can be esti- Together,thesemethodsopenthewaytothereconstruc- matedthroughcomputationfromasetofextantsequences tion of complete ancestral genomes, including their relatedbyaphylogenyandamodelofevolution[1],while sequences. ancientgenomicsequencesingeneralcanbesequenced Obtaining ancestral sequences can allow, through the fromtheremainsofdeadorganisms[2]. study of physical properties of the reconstructed mole- cules,theinferenceofthepaleoenvironnementsinwhich Ancestral genome reconstruction these molecules evolved [9]. These methods also allow Ancestralgenomereconstructioncanconsistinpredicting access to an oriented and ordered view of molecular agenecontentinancestralspecies[3],and foreachgene eventsalongthehistoryoflife.Moreover,theyofferabet- itssequence[1].Whileoriginallyusedtostudyproteinsor ter understanding of this history and can further our knowledge of the mechanismslinking organic sequences totheirfunctions[10]. *Correspondence:[email protected] Despite this, ancestral sequence reconstruction suffers 1LaboratoiredeBiométrieetBiologieÉvolutive,LBBE,UMRCNRS5558, from several limits. Along with the study of molecular UniversityofLyon1,43boulevarddu11novembre1918,69622 Villeurbanne,France evolution, it relies on the validity of models and their Fulllistofauthorinformationisavailableattheendofthearticle ©2015Ducheminetal.;ThisisanOpenAccessarticledistributedunderthetermsoftheCreativeCommonsAttributionLicense (http://creativecommons.org/licenses/by/4.0),whichpermitsunrestricteduse,distribution,andreproductioninanymedium,provided theoriginalworkisproperlycited.TheCreativeCommonsPublicDomainDedicationwaiver(http://creativecommons.org/ publicdomain/zero/1.0/)appliestothedatamadeavailableinthisarticle,unlessotherwisestated. Ducheminetal.BMCGenomics2015,16(Suppl10):S9 Page2of13 http://www.biomedcentral.com/1471-2164/16/S10/S9 fundamental hypothesis. Furthermore, given that we are separatingthemmaketheirancestoragoodcandidatefor interested in a phenomenon often distant in time, it is an ancestral reconstructionincludingboth sequence and at best difficult to obtain proofs validating proposed pre- gene organization along the chromosome and the plas- dictions. Thus, the validation of ancestral reconstruction mids.Howeverdespitetheshortevolutionarytime,while methods is often limited to robustness tests, or simula- substitutions are quite rare [2], there is a very active tions that themselves rely on the validity of the models dynamicsofrearrangements,insertionsequencespropaga- of evolution [1]. tion, duplications, copy number variation (see Figure 2), whichmakestheproblemchallenging. Ancient genome sequencing The late-medieval ancient genome, likely close to that AncientDNAsequencesisanotherwaytohaveanaccess ancestor, offers a validation opportunity for the ancestral tothepasthistoryoflivingorganisms.Undercertaincon- reconstruction method. We achieve here this recon- ditionsitispossibletoobtaingeneticmaterialthroughthe struction and perform the comparison. sequencingoftheremainsofan organism.AncientDNA Notethatasequenceofthesamegenomewasproposed sequencingbeganinthemiddleofthe80swiththeclon- recentlybyRajaramanetal.[30],butwasnotissuedfrom ingandsequencingoffragmentsofmitochondrialDNAin ancestralreconstruction.The contigs of theancientgen- a museum specimen of Equus quagga, an extinct equine ome werescaffolded witha methodincluding the phylo- species that disappeared in the XIXth century [11]. The genyofrelatives,andsomepartsoftheassemblycouldbe adventofPCRmethods[12]andhigh-throughputsequen- corrected,butwhatwepresenthereisnotusingatallthe cing [13] followed by what is called third generation ancient sequence in the reconstruction phase, it is done sequencing[14]allowedthesequencingofseveralextinct onlyfromindependentextantdata. animals [15-17], ancient unicellular eukaryotes [18,19], bacteria[2,20,21],metagenome[22],orvirome[23]. Methods The ancient sequences disclose a new source of infor- An overview of the method, including species tree con- mation concerning the evolution of lineages of interest. struction, gene tree construction and reconciliation, They have already been used, among other things, to gene order inference and gene tree corrections accord- understand the dynamic of extant populations of the ing to this gene order, and eventually genic and inter- genus Homo [24-26], or other animals [27], to correct genic sequence prediction, is illustrated on Figure 3. and recalibrate phylogenies [17], or to better understand past pandemics [18,19,2,20,21]. Data set However, along with the problems specific to sequen- The data consists in 13 Yersinia annotated genomes cing technologies, ancient DNA sequencing islimited by (Figure 1) from which we extract 3772 homologouspro- thepost-mortemchemicaldegradationofDNAmolecules teingenefamiliescontainingatleasttwogenes,usingthe throughouttime.Thus,likefossils,ancientsequencesare HOGENOM database [31]. Of these, 1971 have exactly scarcewhile,unlikethem,limitedtorecenttimes. onecopyperextantstrain.Thisstepcorrespondstopart AinFigure3. Yersinia pestis ClassifiedamongEnterobacteriaceae,Yersiniapestisisthe Species tree bacteriumthoughttoberesponsibleforthebubonicpla- Using Muscle [32] (default parameters), we aligned the gue and the pneumonic plague. It diverged from the 1971families,concatenatedthevariablesitesofallalign- Yersinia pseudotuberculosis lineage, in part through the mentsandobtainedaphylogenetictreeusingPhyML[33] acquisitionoftwoplasmids[28].Ithasbeendemonstrated (100 bootstraps, otherwise default parameters) that we that strains of Yersinia pestis caused the black death of rootedbyseparatingthepestisfromthepseudotuberculosis 1347-1353 AD that is thought to have killed between a clades, according to a consensus in the literature. In our thirdandhalfoftheEuropeanpopulationatthattimeand tree the branch separating the two clades is well sup- persistedinEuropeuntilthemiddleoftheXVIIIthcentury ported,aswell as the branchessurroundingthe ancestor [29]. An ancient genome has been extracted and thatwewishtoreconstruct(seeFigure1).Thisstepcorre- sequenced[2].Itwasthefirstwholeancientbacterialgen- spondstopartBinFigure3. ome.Basedonasubstitutionpatterncomparedtoextant Yersiniaspecies,ithasbeenhypothesizedtotakeplaceon Gene trees the extant species phylogeny in the vicinity of a known AllgenefamiliessequenceswerethenalignedusingPrank speciation node leading to two set of extant, sequenced [34] and one gene tree per family was computed using andannotatedstrainsofthebacterium(seeFigure1). PhyML (100 bootstraps, otherwise default parameters). Theexistenceofseveralsequencedandannotatedextant Because we are aligning recently diverged strains of the genomes as well as the relatively short evolutionary time same organisms [35], the sequences often have not Ducheminetal.BMCGenomics2015,16(Suppl10):S9 Page3of13 http://www.biomedcentral.com/1471-2164/16/S10/S9 Figure1Yersiniapestisandpseudotuberculosisphylogeny.Treeobtainedusinga1971universalgenefamiliesconcatenate.Bootstrap valuesarefiguredonthebranches.Forreadability,thefiguredbranchlengthistheinverseoftheten-logarithmoftherealbranch-length.The ancestralspeciesofinteresttousisfiguredasareddiamond.Thelatemedievalancientgenomehypotheticalpositionisfiguredingrayand dashed. Figure2DotplotbetweenthesequenceoftwoextantstrainsofYersiniapestis:CO92andKM10.Bothstrainsaredescendantsofthe ancestorwefocuson.DatawasobtainedbyaligningthesequenceofstrainKIM10onthesequenceofstrainCO92usingmegablast(default parameters,onlyhitswithalength>102.5werekept). Ducheminetal.BMCGenomics2015,16(Suppl10):S9 Page4of13 http://www.biomedcentral.com/1471-2164/16/S10/S9 Figure3ProtocolusedtoobtaintheancestralgeneorderandsequenceofaYersiniapestisancestor.A)Extractionandfilteringofgene familiesfromextantgenomesandalignment.B)Reconstructionofthespeciestreeusingaconcatenateofthevariantpositionsof1971universal genefamilies.C)MLreconstructionofgenetreesfollowedbythecollapseofanynon-supportedbranch(bootstrap<99)andtheresolutionof thecreatedpolytomiesusingthespeciestreeasaguide.D)InferenceofancestralgeneadjacenciesusingDeCo.E)Detectionandcorrectionof wronglyinferredgenetreesbasedontheancestraladjacencygraphlinearity.F)Reconstructionoftheancestralsequenceofeachgene adjacencyfromtheirextantdescendants.G)Alignmentoftheconsecutiveancestraladjacencysequencestoassembletheancestralgenome. Similarcolorsindicateshomology.Dotsrepresentageneasanodeinanadjacencygraphwhileorientedsegmentsrepresentageneasa sequence. Ducheminetal.BMCGenomics2015,16(Suppl10):S9 Page5of13 http://www.biomedcentral.com/1471-2164/16/S10/S9 diverged enough to allow an unambiguous tree recon- betweenA′andB.Soextantgenomesaresetsofdisjoints struction. So we collapsed all branches with a support cyclesinagraph,modelingchromosomesandplasmids. lower than 99 and then used ProfileNJ [36] to solve the Gene extremities can be clustered into families, inher- created polytomies. ProfileNJ reconstructs species tree ited from gene families, and also inherit the reconciled branches instead of collapsed branches and chooses genetree. amongseveralsolutionswithaNeighbor-Joiningformula. Distances for the Neighbor-Joining part were computed Ancestral gene order with bppdist, a Bio++ suite software [37] (GTR + Γ(4) Ancestral adjacencies between gene extremities were model). inferred using DeCo [7]. It models the evolution of an ProfileNJ also roots the gene trees according to “Last adjacencybetweentwogeneextremitiesfollowingaparsi- Common Ancestor” reconciliation method, annotating mony principle, i.e. minimizingthe numberofgainsand internal nodes with duplications or speciations, and breakages ofadjacencies,duetorearrangements. It takes choosing a root minimizing the number of duplications. asinputthespeciestree,allgenetrees,andextantadjacen- Reconciled gene trees depict the history of the gene cies,and proposesa set ofancestral adjacencies between family, including all ancestral genes, uniquely defined by ancestralgeneextremitiesdefinedbythereconciledgene the reconciliation. trees.ThisstepcorrespondstopartDinFigure3. This step corresponds to part C in Figure 3. DeCo assumes that adjacencies evolve independently. This means in particular that ancestral gene extremities Gene families filtering can be involved in an arbitrary number of adjacencies. From the 3772 gene families, some were discarded Ancestralgeneextremitiesandadjacenciesarenotneces- because they showed signal of a process that we do not sarily made of cycles as extant genomes, so we call this handle well in our pipeline, gene transfer. Transfer was object an adjacency graph. Figure 4 shows the obtained suspected when a branch in the reconciled gene tree adjacencygraphatthisstep.Whilemostofitshowsalin- would correspond to at least 4 independent losses in the ear orcircularstructure,there aresomegeneextremities species tree. We also removed the families with more withtoomanyadjacencies,otherswithnotenough. than 5 genes in the black death ancestor, suspecting Therecanbeseveralreasonsfortheadjacencygraphnot insertion sequences, which are poorly handled by the tobeacollectionofpathsandcycles,aswewouldexpect method. We also removed families containing genes ifthedataandmethodswereperfect.Incorrectgenetrees fully included in other genes: as we model the evolution of gene orders, these would be difficult to handle. We eventually removed families when the reconciled gene tree did not contain a gene in the ancestor we want to reconstruct. The final data set contained 3656 families. Note that when removing gene families from the study, we do not necessarily give up the reconstruction of parts of the ancestral sequence. We just define the removed parts as intergenic. As we also reconstruct intergenic sequences, this simply modifies the resolution at which we are able to detect rearrangements. Extant gene order and adjacencies Eachgeneisasegmentofachromosomeoraplasmidand has a start and an end position on it. We identify these positions as the extremities of the gene. A start position may be greater than an end position: the order of the extremitiesdefinestheorientationofthegene.Wemodel each genome by a graph, whose nodesare gene extremi- ties of genes in that genome. We put an edge, called an Figure4 Ancestral adjacencygraph obtainedusing DeCo on adjacency between pairs of extremities of a same gene. thesetof3656genefamilies.Eachnodeiscoloredaccordingto AdditionallyifgenesAA′andBB′areconsecutive(Aand itsnumberofneighbors:greenfortwo(ideal,linearcase),turquoise A′ are the extremities of the first gene, appearing in that forone(whereoneadjacencyhasbeenlost),orangeforthreeand order on the chromosome or plasmid, and B, B′ are the grayforfour(whenanerrorinthenumberofancestralcopies createsconflictintheancestralgeneorder). extremities of the second gene), we put an adjacency Ducheminetal.BMCGenomics2015,16(Suppl10):S9 Page6of13 http://www.biomedcentral.com/1471-2164/16/S10/S9 areprobablythemajorsourceofsuchdiscrepancies,while turnitintoaspeciationnodebygivingittwodescendant othersmaycomefrom uncertaintiesinadjacency history nodes h and h , and assigning its descendant leaves to 1 2 inference. eitheroneofthem,dependingonwhethertheyaregenes Wetransformtheadjacencygraphintoagenome(i.e.an from descendants of s1or s . Then subtrees rootedath 2 1 adjacency graph that is a collection of paths and cycles), andh arereconstructedusingProfileNJ. 2 firstbycorrectinggenetrees,byoperationswecallzipping Figure 5B gives an example of a zipping operation on andunzipping,then byremovingaminimumnumber of the ancestral adjacency graph and on the gene tree. adjacenciessothattheremaininggraphisagenome. Zippingproducesanewancestralgeneh insteadoftwo d paralogueshandh′.Wepropagatethesameoperationto Correcting gene trees the neighbors of the ancestral gene h in the adjacency d This step corresponds to part E in Figure 3 and a more graphiftheyarethemselvessupernumeraryparalogues. detailed picture is on Figure 5. Note that for zipping and unzipping, the propagation Unzipping mechanism allows the treatment of several consecutive Each ancestral gene extremity of a gene g should have at nodes,suchthatalargesegmentalduplicationcontaining most two adjacencies. If one has more than two, a first multiple genes can bedealtwith as long asthere existsa hypothesis can be that in the real ancestral genome, the nodetostarttheunzippingmove(e.g.atoneextremityof gene g was duplicated in two copies, and each copy thesegmentalduplication). would carry some of the adjacencies of g. Cutting If in one extant species, there are two homologous Zipping and unzipping are tested independently for each copies of the gene g, and their extremities share the ancestral node with more than two neighbors. Each of homologs of the adjacencies attributed to an extremity them should decrease the number of gene extremities of g, then we perform the unzipping operation. with more than two adjacencies. The operation that It consists in making two genes out of g by modifying decreases it the most is kept. the gene tree T of the gene family containing g. Only If none of zipping and unzipping succeeds in removing the subtree rooted at g is changed, into a subtree rooted all such supernumerary adjacencies (it is possible that at a new duplication node with two descendants: g and none of the hypotheses applies), then we remove as few a new gene g′. Then the two subtrees rooted at g and g′ adjacencies as possible so that only gene extremities with are reconstructed, first by assigning all leaves to g or g′ at most two adjacencies remain. This is achieved using a according to their neighborhood; Then by constructing maximum matching technique described in [39]. subtrees on these leaves using ProfileNJ. In the case where some leaves can’t be assigned to either g or g′ Ancestral sequence reconstruction using their neighborhood (i.e. their extant neighbors are Ancestral sequences have to be reconstructed by pieces, not descendant of any of the ancestral neighbors), then because they need a multiple alignment free of rearran- leaves are assigned to one of the two set of leaves gements. The pieces have to be glued together, and in according to their mean phylogenetic distances with order to avoid between pieces border problems, pieces them. Where there is a tie (for instance if all sequences have to overlap. This is why we reconstruct an ancestral are identical, all distances are null), the leaf is randomly sequence for all pairs of genes which are connected by assigned to one of the two leaf-set. an adjacency. Then pairs are aligned together on their Figure 5A gives an example of an unzipping operation common gene, and merged. on the ancestral adjacency graph and on the gene tree. Weorienteachadjacentgenepairwithafirstandasec- If the unzipping procedure increases the number of ond gene, each gene should be once the first gene of a adjacenciesincidenttoageneextremityofagenehinthe pair, and once the second in another pair. We use the immediateneighborhoodofgintheadjacencygraph,then genetreeofthefirstgeneasaguide,toconstructamulti- theunzippingprocedureisappliedtohaswell,andthen ple sequence alignment with the extant sequences that toitsneighbors,untiltheregionislinearized. contain this adjacent pair (thus, the sequences contains Zipping bothgenesandthesequencebetweenthemwhentheyare Another possible reason for a gene g to be involved in neighborsin an extant species, andonly the first gene of morethantwoadjacenciesisthattwooftheseadjacencies the adjacency when they aren’t), and the ancestral ghandgh′concerntwoparalogshandh′whichinreality sequenceusingPrank[34]. should form only one gene. In that case we perform a Genesequencesattheendsofcontigsarereconstructed zippingoperation,similartotheonedescribedin[38]. alone using their own tree. In consequence each inter- Leth bethelastcommonancestor ofhandh′intheir gene sequence is reconstructed once and each gene d gene tree. Suppose itisassigned to speciess, whose des- sequenceisreconstructedtwiceandatleastoncewithits cendants are s and s . It is a duplication node, and we own tree.We assembletheobtainedancestral sequences 1 2 Ducheminetal.BMCGenomics2015,16(Suppl10):S9 Page7of13 http://www.biomedcentral.com/1471-2164/16/S10/S9 Figure5Illustrationoftheunzippingandzippingongenetreesandadjacencygraphs.A)Priortolinearization(leftoftheblackarrow),the genegexistsinonecopyintheancestor(verticalgraylineonthetree)andtwoindependentduplicationsoccursinitsdescendants(greenhollow squares).Intheancestraladjacencygraphaboveeachofgextremitiesdisplaystwoneighbors.Unzipping(rightoftheblackarrow)modifiesthetree sothattherearetwoancestralcopiesgandg′eachcorrespondingtoadifferentpathintheancestraladjacencygraph(lossesinthetreeare displayedasredcrosses).B)Priortolinearization(leftoftheblackarrow),twoancestralcopiesofthesamegenehandh′existintheancestor(vertical graylineonthetree;lossesinthetreearedisplayedasredcrosses).Intheancestraladjacencygraphabovetheextremitiesofhandh′eachsharea neighbor,forminganon-linearpattern.Zipping(rightoftheblackarrow)modifiesthetreesothatthereonlyoneancestralcopyh followedby d independentduplicationsinitsdescendants(greenhollowedsquares),formingonelinearpathintheancestraladjacencygraph. Ducheminetal.BMCGenomics2015,16(Suppl10):S9 Page8of13 http://www.biomedcentral.com/1471-2164/16/S10/S9 by aligning (using Smith & Waterman’s algorithm) the genome and two extant ones, while a comparison ones sharing a gene and then making the consensus between the two extant ones is shown on Figure 6A. The sequenceofthatalignment,favoringthesequencerecon- isolated dots on the dotplots of Figure 6B and C are structedwiththetreeofthealignedgene. probably reconstruction errors. While they could be For instance, consider the ancestral path ABC (where explained as small rearrangements, they probably are arti- A,B and C are genes), we reconstruct the ancestral facts of the adjacency graph linearization method, like a sequence of A using its own tree, AB using A’s tree, BC leaf falsely associated to a subtree in an unzipping event using B’s tree and C using its own tree. Afterward the for instance. ancestral sequence of A is aligned with the ancestral TheancestralsequencesoftheplasmidspCD,pMTand sequence AB, favoring the sequence of A when comput- pPCP were entirely reconstructed, for a total of respec- ing the consensus. Then the sequence AB is aligned tively100.1kb,67.7kband9.6kb.Concerningtheances- with the sequence BC, favoring the sequence BC in the tralchromosome,atotalof4.7Mbofancestral sequence consensus (as both sequences align on gene B and BC wasreconstructed,whichisclosetothesizeoftheextant used B’s tree for the reconstruction). Finally, the chromosomesofYersiniapestisstrains(e.g.4.7Mbforthe sequence ABC is aligned with the sequence C, favoring strainAntiqua).Alackofsignalinextantgenomesdueto C in the consensus. convergentrearrangements,preventedthereconstruction A graphical view of these steps are given in Figure 3, offourancestral adjacencies.Becauseofthese,theances- parts F and G. tral chromosome sequence is actually composed of four Note that, as stated before, the ancestral sequence disjoint fragments (their sizes are respectively 3.44 Mb, reconstructionneedsamultiplealignmentfreeofrearran- 0.67Mb,0.40Mband0.19Mb). gements. This means that the size of the recombination The reconstructed ancestral sequences are avalaible in events that can be taken into account for ancestral Additional file 1. sequences reconstruction depends on the density of the markers(here,thegeneextremitiesof3656genefamilies) Comparison to the ancient genome usedintheancestralorderreconstructionstep. UsingMegablast[41]wealignedthe2134ancientYersinia pestiscontigsobtainedbyBosetal.[2](avalaibleathttp:// Results paleogenomics.irmacs.sfu.ca/FPSAC/,lastaccessed19june The shape of the ancestral genome 2015) against the obtained ancestral genome, including We perform the whole process of ancestral gene order chromosomeandplasmids. reconstructionforthreedatasets:thewholesetoffiltered We examine 2179 hits of length >102.5bp from 2087 families,thesetofDfreefamilies,withoutduplicationand contigs(seeAdditionalfile2forthebimodaldistribution theDLfreefamilies,withoutduplicationnorloss. of hit lengths which justifies this threshold). The others Ancestral gene order is computed with the whole set, arefullofrepeatedelements,makingthecomparisondiffi- butitgivesfragmentedpathsintheadjacencygraph.The cult.Asaconsequencetheexaminedhitsallmatchtothe fragments are progressively assembled using the D free chromosomeandnonetotheplasmids. andDLfreegeneorders. Gene order The ancestral gene order was reconstructed for the These hits show a quasi-total congruence between the chromosome (3342 genes) and the three plasmids (pCD: organization of the ancient and ancestral sequence. 74 genes, pMT: 87 genes, pPCP: 5 genes). The plasmids Figure 7representsthecorrespondence between thetwo pCD and pPCP were obtained as circular elements in intheformofadotplot,wherecontigsoftheancientgen- the adjacency graph, while the plasmid pMT was repre- omeareconcatenatedaccordingtotheancestralsequence. sented by one linear fragment. The chromosome was Three isolated dotsdeviate fromthe central line. Two of obtainedas three linearcomponents. Tojointhesecom- them concern large repeated regions, that is, the whole ponents,weranDeCoontheirsixextremitiesusingagra- contigs match at several places. Only one seems to be a dientofadjacencygain/losscostsratio(from1/10to10/1) realdiscordancebetweenthetwogenomes.Twocontigu- and scored each potential adjacency by the number of ous regions of the contig hit on two different ancestral timesitwasobserved.Wethenappliedaweightedmaxi- sequence fragments. This chimeric contig (number 8335 mummatchingtechnique[40]toextractthebestpossible in[2])hadalreadybeenobservedbyRajaramanetal.[30] order between the fragments (only one optimal solution intheirscaffoldingoftheancient genome.Thisstretches remained). the proximity and the differences between the two The ancestral gene order is different from all extant approaches. Indeed, the latter, called FPSAC, takes as genomes. For example it is an intermediary between the input the ancient contigs and the extant genomes, frag- two extant strains CO92 and KIM10. Figure 6 B and C mentsthecontigsaccordingtotheiralignmentstoextant show the gene order comparison between the ancestral genomes, and orders fragments. Here we don’t use at all Ducheminetal.BMCGenomics2015,16(Suppl10):S9 Page9of13 http://www.biomedcentral.com/1471-2164/16/S10/S9 Figure 6 Dotplot between the ancestral genome and two extant strains of Yersinia pestis: CO92 and KIM10. Both strains are descendantsoftheancestorwefocuson.DatawasobtainedusingtheextantadjacencygraphsofstrainsKIM10andCO92andconcernsgenes order.Verticalandhorizontallinesseparatethedifferentmolecules(herethechromosomeandtheplasmids).A)dotplotbetweenthegene ordersofthetwoextantstrainsKIM10andCO92.B)dotplotbetweenthegeneordersoftheancestralgenomeandtheextantstrainCO92.C) dotplotbetweenthegeneordersoftheancestralgenomeandtheextantstrainKIM10. theancientcontigsandstartfromextantgenes.Soweare case there is no comparison point to infer some independentoftheextractionandassemblymethodology bases, and some are inferred differently than in the fortheancientsequence,andwecancomparetoit.More- ancient sequence. over,alloursequencesarecomputationallyreconstructed, (cid:129) Lack of a good model of evolution at an intermedi- whichwasnotthecaseofthoseobtainedwithFPSAC. ary scale, like duplication of small elements. They So at a large scale, there is only one difference which are here included in alignments and indel models, can be an assembly error in the ancient sequence or a which do not account for repetitions. derived mutation of the ancient bacteria, because the (cid:129) Assembly errors in the ancient sequence. ancient configuration is not supported by extant genomes. Consider for example the ancient contig number 497 Sequences where a mismatch occurs when aligned with the ances- At a finer scale, differences are more numerous. tral sequence. The mismatch is situated in an intergenic Approximately 81% of the 2084 contigs with a hit are region of the ancestral genome that is present in one exact matches to the ancestral genome. We examined descendant of the reconstructed ancestor and two out- some of the remaining and found that the differences group Yersinia pestis species. Consequently, the ances- could be explained by three kinds of error sources in tral sequence was reconstructed using a tree where the the ancestral or ancient sequences: node of interest was along a branch, missing a compari- son point (i.e. another descendant) to choose between (cid:129) Lack of sufficient data for ancestral reconstruction: its descendant allele and the outgroup allele. it is the case if only one of the two children which Consider also the ancient contig number 8849 which branches off the ancestor, in addition to an out- aligns with one mismatch to the reconstructed ancestor. group, support the presence of a sequence. In that At the position of the mismatch, all extant (group and Ducheminetal.BMCGenomics2015,16(Suppl10):S9 Page10of13 http://www.biomedcentral.com/1471-2164/16/S10/S9 chimeric contig but this discrepancy is independent). Around position 1860, that ancient contig displays one occurrence of a 20-mer. However, the reconstructed ancestral sequence has two consecutive occurrences of that 20-mer. This region is situated in an intergenic region, so it has been reconstructed by an alignment of an adjacency with its two flanking genes. The extant species (descendant of the reconstructed ancestor or not) which have this gene adjacency all display two occurrences (in favor of the ancestral reconstruction) at the exception of Yersinia pestis strain CO92, the Yersi- nia pestis reference genome which was used to map the ancient reads in [2]. While the fact that we did not use the raw reads obtained in [2] prevents us to draw any definitive conclusion, this appears to be an error in the ancient sequence assembly, caused by a derived muta- tion in the genome used as a reference. Conversely, it happens that similar patterns are better explained by errors in the reconstructed ancestral sequence. Such a case occurs on the locus where the ancient contig number 5613 maps. The situation is also Figure 7 Dotplot between the late medieval Yersinia pestis similartoFigure8A.Twocontiguousregionshitatadis- genomesandthereconstructedancestralsequence.The reconstructedchromosomewasalignedtothe2134ancient tanceof1315bponthereconstructedancestralsequence. contigs,usingmegablast(defaultparameters,onlyhitswitha The sequence separating the two hits in the ancestor is length>102.5werekept).Contigsareconcatenatedaccordingto only supported by one extant descendant (Nepal strain) thereconstructedsequence,sotheagreementispartlyduetothe andtheotherextantdescendantsmatchtheancientcontig fragmentednatureoftheancientsequence.Thecontigswithhits inonlyonelonghit.Thisseemstobeanerrorduetothe departingfromthediagonalarecircledinred. absenceofanevolutionarymodelallowingbiginsertions. Prankmodelsindelsbut1315bpisnotreallyanindelbut outgroup species) sequences bear the same allele and isratheraninsertionofwhatshouldperhapshavebeenan thus the reconstructed ancestral sequence bears it too. evolutionary unit. It seems that the indel model prefers However, the ancient contig bears another allele at that losingseveraltimessuchalongDNAsegmentratherthan position. If we consider the ancient contig as correct, insertingitonceinaterminalbranchofthephylogeny.So then this difference would be an original mutation on we can expect a small number of such false additions in the ancient strain. Such an hypothesis could be checked theancestralsequence. by mapping the ancient reads to their contigs in order to assess the validity of that specific allele. However, we Discussion note that the original study [2] that used read data to A complete reconstruction of an ancestral genome at call SNPs did not detect any that were specific to the the nucleotide level requires to take into account evolu- ancient strain. tionary events at several scales: nucleotide substitutions, There are also differences that are more structural in indels, duplications, losses, recombinations, transfers, kind. For example 43 contigs show some structural dif- transposable elements propagation, rearrangements. ferences with the ancestral genome. On 39 of them, the Each level is handled by dedicated bioinformatics tools ancient contig displays two contiguous or slightly over- which are rarely used together. lapping hits that are more distant on the ancestral gen- We associated here gene content/sequence/order tools ome (on 21 occasions, they are more than 300 bp apart inordertoattemptthereconstructionofawholeancestral in the ancestral sequence), as in Figure 8A. On 4 ancient bacterialgenome,includingachromosomeandthreeplas- contigs, contiguous regions are shown as overlapping in mids. We chose an organism from the Yersinia pestis the ancestral genomes, as in Figure 8B. clade because of a recently published ancient sequence. Such discrepancies can sometimes be explained by Despite being relatively recent at the evolutionary scale errors in the ancient sequence, especially in regions (650years),theevolutionatalllevels,andinparticularin where repetitions occur. For instance, the case illu- genome structure and organization, makes the problem strated on Figure 8A, is seen on the contig number difficult. The difficulty can come from numerous events 8335 obtained by Bos et al.[2] (which is also the (rearrangements, insertion sequence dynamics), but also
Description: