ebook img

Updated clusters of orthologous genes for Archaea: a complex ancestor of the Archaea and the ... PDF

15 Pages·2012·0.61 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Updated clusters of orthologous genes for Archaea: a complex ancestor of the Archaea and the ...

Wolfetal.BiologyDirect2012,7:46 http://www.biology-direct.com/content/7/1/46 RESEARCH Open Access Updated clusters of orthologous genes for Archaea: a complex ancestor of the Archaea and the byways of horizontal gene transfer Yuri I Wolf*, Kira S Makarova, Natalya Yutin and Eugene V Koonin Abstract Background: Collections of Clusters of Orthologous Genes (COGs) provide indispensable tools for comparative genomic analysis, evolutionary reconstruction and functional annotation ofnew genomes. Initially, COGs were made for allcomplete genomes ofcellularlife forms that were available atthe time. However,with the accumulation of thousands ofcomplete genomes, construction of a comprehensive COG set has become extremely computationally demanding and prone to error propagation, necessitating the switch to taxon-specific COG collections.Previously, wereported thecollectionof COGs for 41 genomes of Archaea (arCOGs).Here we present a majorupdate of thearCOGsand describeevolutionary reconstructionsto reveal general trends in the evolution of Archaea. Results: The updated version ofthe arCOG database incorporates 91% of thepangenomeof 120 archaea (251,032 protein-coding genes altogether) into10,335 arCOGs. Using this new set of arCOGs, we performed maximum likelihood reconstruction of thegenome content ofarchaeal ancestral forms and gene gain and loss events in archaeal evolution. This reconstruction shows that the last Common Ancestor of the extant Archaea was an organism of greater complexity than most oftheextant archaea, probably withover 2,500 protein-coding genes. The subsequent evolution of almost all archaeal lineages was apparently dominatedby gene loss resulting in genome streamlining. Overall, inthe evolution ofArchaea as well as a representative set of bacteria that was similarly analyzedfor comparison, gene losses are estimatedto outnumbergene gains at least 4to 1. Analysis of specific patterns ofgene gain in Archaea shows that, although some groups, in particular Halobacteria, acquire substantially more genes than others,onthewhole, gene exchange between major groups ofArchaea appearsto be largely random,with nomajor ‘highways’of horizontal gene transfer. Conclusions: The updated collection of arCOGs is expectedto become a key resource for comparative genomics, evolutionary reconstruction and functional annotation ofnew archaeal genomes. Given that, inspite of themajor increase inthe number ofgenomes, theconserved core of archaeal genes appears to be stabilizing, the major evolutionary trendsrevealed here have a chance to stand the testof time. Reviewers: This article was reviewed by(for complete reviews seethe Reviewers’Reports section): Dr.PLG, Prof. PF, Dr. PL (nominated by Prof. JPG). Keywords: Archaea, Orthologs,Horizontal gene transfer *Correspondence:[email protected] NationalCenterforBiotechnologyInformation,NLM,NationalInstitutesof Health,Bethesda,MD20894,USA © Wolfetal.;licenseeBioMedCentralLtd.ThisisanOpenAccessarticledistributedunderthetermsoftheCreative CommonsAttributionLicense(http://creativecommons.org/licenses/by/2.0),whichpermitsunrestricteduse,distribution,and reproductioninanymedium,providedtheoriginalworkisproperlycited. Wolfetal.BiologyDirect2012,7:46 Page2of15 http://www.biology-direct.com/content/7/1/46 Background been extensively used for functional annotation of new Agenome-wide evolutionaryclassification ofgenesises- genomes (e.g., [19,20], comparative analysis of gene sential for the entire enterprise of genomics including neighborhoods [21-23] and other connections between both functional annotation and evolutionary reconstruc- genes, as implemented in the popular STRING tool [24]; tion. The construction of such a classification for a large target selection in structural genomics (e.g., [25]); and set of diverse genomes is never an easy task due to the various genome-wide evolutionary analyses [6,8]. Subse- complexity of evolutionary relationships between genes quently, the COGs have been employed as the seed for to which gene duplication, gene loss and horizontal gene the EggNOG database that was constructed using transfer (HGT) all make major contributions. The inter- improved algorithms for graph-based automatic con- play of all these evolutionary processes makes accurate struction oforthologousgene clusters[26,27]. delineation of orthologous and paralogous relationships The methods for the construction of COGs and other, between genes extremely complicated [1-3]. Accurate similar clusters of putative orthologous genes cannot identification of orthologs and paralogs is central to guarantee correct identification of the orthologous and functional characterization of genomes because ortho- paralogous relationships between genes due to the afore- logs typically occupy the same functional niche in differ- mentioned complexity of the evolutionary processes. ent organisms whereas paralogs undergo functional The original COG analysis of small numbers of genomes diversification duplication via the processes of neofunc- involved the final step of manual curation that was im- tionalization and subfunctionalization [3-5]. Clear differ- portant for detecting and resolving problems that were entiation between orthologs and paralogs is equally not adequately addressed by the automatic procedure. important for the reconstruction of evolutionary scenar- This step ceased to be feasible with the rapid increase in ios[6-9]. the number of sequenced genomes whereas the compu- In principle, orthologous and paralogous relationships tational cost of the analysis has steeply increased. There- between genes have to be disentangled by means of fore, along with the development of improved, lower comprehensive phylogenetic analysis of entire families of complexity algorithms for identification of orthologous homologous genes in the compared genomes [2,10-13]. gene clusters [1,15,16], several smaller scale projects However, for the case of numerous, diverse genomes, have been conducted in which COGs were constructed, such comprehensive phylogenomic analysis remains annotated and analyzed in detail for compact groups of both an extremely labor-intensive and an error-prone bacteria such as the Thermus-Deinococcus group [28], process. Accordingly, several methods have been devel- Cyanobacteria [29], and Lactobacillales [19]. Along oped that aimat the identification of setsof likely ortho- these lines, we have delineated the set of COGs for 41 logs without performing comprehensive phylogenetic genomes of archaea [30]; this data set that we denoted analysis; benchmark comparisons indicate that some of arCOGs has become an important tool for archaeal gen- these methodsperform aswell ifnot,insome cases,bet- omeanalysis [31-34]. ter than phylogenomic approaches [1,14-16]. Generally, Here we present a major update of the arCOGs that these non-phylogenomic approaches in orthology infer- includes 120 archaeal genomes and use it for evolution- ence are based on partitioning graphs of genome- ary reconstructions that seem to provide insights into specific best hits for all genes (typically, compared in the majortrends ofarchaealevolution. form ofproteinsequences)from the analyzed set ofgen- omes. The key underlying assumption of this approach Results and discussion is that the sequences of orthologous genes are more UpdateofarchaealCOGdatabase similar to each other than to the sequences of any other The updated arCOG database includes protein sequences genesfrom the comparedgenomes. from 120 completely sequenced genomes. Altogether, The best hit graph approach, supplemented by add- 251,032protein-codinggenes(91%ofthetotalgenecom- itional proceduresfordetectingco-orthologousgenesets plement) were assigned to 10,335 clusters. The coverage and for treating genes encoding multidomain proteins, of individual genomes by arCOGs ranged from 99% was first implemented in the Clusters of Orthologous (strains of Sulfolobus islandicus and Methanococcus Groups (COGs) of proteins [17]; the acronym COG has maripaludis with abundant close relatives in the set) to been subsequently reinterpreted to simply denote Clus- 73% (Nanoarchaeum equitans, the sole sequenced repre- ters of Orthologous Genes [3]. The original COG set of sentative of the phylum Nanoarchaeota). In the current 1997 included only 7 complete genomes, all that were set of archaeal genomes, 129 arCOGs are strictly ubiqui- available at the time [17]. The latest comprehensive tousand32morearCOGsareubiquitoustotheexclusion COG collection released in 2003 incorporated ~70% of of N. equitans; in the original version of the arCOGs, the the protein-coding genes from 69 genomes of prokar- corresponding numbers were 166 and 50 [30]. With the yotes and unicellular eukaryotes [18]. The COGs have addition of new genomes, the size of the strictly universal Wolfetal.BiologyDirect2012,7:46 Page3of15 http://www.biology-direct.com/content/7/1/46 gene set inevitably decreases due to lineage-specific gene genes [41-46]. The current set of 10,335 arCOGs losses and possibly also annotation errors [35,36]. In real- includes 6,736 phyletic patterns of which 5,998 (89%) ity, however, the sets of conserved archaeal genes could are unique. Similar to our 2007 observations, we found be stabilizing. Indeed, analysis of the commonality distri- that the most common patterns represent genes that bution [37,38] for the new arCOG collection gives esti- are conserved in well-defined archaeal clades, such as mates for the size of the “core” (highly conserved) and all 120 Archaea, 4 Thaumarchaeota or 3 species of the “shell” (moderately conserved) components of the Methanosarcinagenus (Table 1 and Figure 3). archaeal pangenome that are almost unchanged since Analysis of the current arCOG set reveals the funda- 2007 (current estimates of ~220 and ~2,200 vs. ~230 and mental limitation of phyletic patterns as the basis for ~2,200,respectively,forthe2007arCOGset).Bycontrast, phylogenomic analysis. The early, small genomic data addition of the new genomes substantially increased sets were readily amenable to direct pattern counting (e. (from ~5,200 to ~7,400) the repertoire of rare archaeal g. in the original 1997 COGs that included 7 species, genesthatbelongtothevariable“cloud”(Figure1). about1/3ofthepossible patternswith3ormorespecies One of the immediate applications of clusters of were found in the actual data), the current set of 120 orthologous genes is phylogenomic reconstruction (i.e. archaeal species allows for ~1036 possible patterns of identification of patterns of lineage-specific gain and loss which only a tiny fraction (~1/1032) are actually of genes) for the respective group of organisms, in the observed. An overwhelming majority of the observed case of arCOGs, the Archaea. For this reconstruction, a patterns are unique and even the most frequent patterns ‘species tree’ is required as a template. We employed the represent at most 1.6% of the arCOGs. Even under un- arCOGs of ribosomal proteins to construct a maximum realistically restrictive models of gene content evolution likelihood tree from a concatenated alignment of riboso- that prohibit horizontal gene transfer, random loss of mal proteins (Figure 2, Additional file 1). Generally, the non-essential genes alone results in an exponential de- tree agrees well with the archaeal taxonomy and with crease in the number of non-unique phyletic patterns the recently published results of phylogenetic analysis with thenumber ofgenomesinthedata set. including the monophyly of the ‘TACK superphylum’, a The rapidly increasing proportion of unique phyletic large assemblage of archaeal phyla that includes Thau- patterns calls for a more coarse-grained comparison marchaeota, Aigarchaeota (with the single current repre- whereby non-identical but similar patterns are treated as sentative, Candidatus Caldiarchaeum subterraneum), members of the same group. However, standard cluster- CrenarchaeotaandKorarchaeota[39,40]. ing and ordination techniques perform poorly on phy- letic pattern data because of the difficulty of objectively PhyleticpatternsofarCOGs assessing the similarity between the observed phyletic The original 1997 study of the COGs [17] included the patterns of (ar)COGs. A proper similarity measure must first analysis of phyletic patterns, i.e. patterns of take into account relationships that go beyond simple presence-absenceofgenesfromagivenCOGinthegen- distances between 120-dimensional binary vectors pri- omes of the analyzed organisms. Subsequently, phyletic marily because different dimensions are not independent patterns proved useful in describing the evolutionary or correlated ina simple manner, but are connected bya history of lineages and functional relationships between complex network of phylogenetic and environmental relationships between the corresponding organisms. As a step toward a more biologically meaningful analysis of 10000 phyletic patterns, we compared evolutionary scenarios thatareimpliedbythesepatterns. 1000 s G O of arC 100 GWaeinsr,elcoosnsesstrauncdteadnctehsetraplossttaetreisorinparroCbOaGbsilities of gene o. n presence in ancestral nodes and the probabilities of the 10 associated gene gain and loss events using the Count method of Csűrös and Miklós[9]. The reconstruction 1 wasbasedonbinary patterns (i.e.ignoringparalogs),and 0 20 40 60 80 100 120 no. of genomes the topology ofthe ribosomalproteintree (Figure 2)was Figure1AcommonalityplotforArchaealprotein-coding used as the guide. The results of the present reconstruc- genes.DiamondsshowthenumberofarCOGsthatincludethe tion (Figure 4) that was based on 10,335 arCOGs repre- givennumberofdistinctgenomes.Dashedredlines,decomposition sented in 120 species of archaea generally agree with the ofthedataintothreeexponents(“cloud”,“shell”and“core”[37,38]); earlier observations made with the original arCOG col- solidredline:thesumofthethreecomponents. lection using maximum parsimony [30] as well as the Wolfetal.BiologyDirect2012,7:46 Page4of15 http://www.biology-direct.com/content/7/1/46 Methanosarcinales Methanocella Methanomicrobiales Halobacteriales Archaeoglobales Thermoplasmatales Euryarchaeota AciduliprofundumbooneiT469 Methanococcales Methanobacteriales MethanopyruskandleriAV19 Thermococcales NanoarchaeumequitansKin4 M Nanoarchaeota Sulfolobales Desulfurococcales Crenarchaeota Thermoproteales ThermofilumpendensHrk5 Thaumarchaeota Thaumarchaeota Caldiarchaeumsubterraneum Aigarchaeota CandidatusKorarchaeumcryptofilumOPF8 Korarchaeota 0.2 Figure2PhylogenyofuniversalArchaealribosomalproteins.TheapproximateMaximumLikelihoodtreewasreconstructedusingFastTree [51,52]. outcome of the ML reconstruction reported by Csűrös (genome streamlining) during the diversification of re- and Miklós [9]. The genome of the last archaeal com- centfamily-levelgroups (Figure 4). mon ancestor (LACA) is inferred to have contained Gene gains that are associated with the ancestor of a genes from at least ~1725 arCOGs of which 970 could clade conceivably reflect the innovations that led to the be identified with high confidence (posterior probability diversification and define the biological characteristics >0.9). Taking into account the characteristic level of par- that differentiate the given clade from other groups. alogy and the number of genes in the transient “cloud” Among the Archaea, the maximum number of gene (Figure 1),the genomesize ofLACAcan beestimated at gains was detected at LACA (Table 2 and Figure 3); around 2600 genes, which puts this ancestor form on thesegainsnecessarilyincludebothgenesinheritedfrom the high end of genomic complexity by the standards of the last universal common ancestor of cellular life and the extant archaeal genomes (Additional file 2). The his- genes acquired during the stem phases of archaeal evo- tory of most archaeal lineages appears to have involved lution. Among the internal branches of the archaeal gradual, moderate genome growth or genomic stasis in tree, the branch leading to the common ancestor of the deep branches followed by extensive gene loss Halobacteriales is associated with >1600 gene gains, bringing the inferred size of the ancestoral halobacter- Table1PhyleticpatternsinarCOGs ial genome to >3000 genes. Other branches character- Patternfrequency Patterndescription No.ofSpecies ized by high rates of gene gain include Sulfolobales, 163 allThaumarchaeota 4 Thermococcales, Methanomicrobia and others. Of spe- cial interest is the gain of 431 genes assigned to the 159 allMethanosarcina 3 common ancestor of the proposed TACK-superphy- 129 allArchaea 120 lum. Although the inference of gene gain depends on 116 allPyrobaculum+Thermoproteus 7 tree topology and therefore cannot be construed as 114 allHalobacterium 2 direct evidence of the monophyly of any group, such a 100 allHalobacteriales 16 high number of gains indicates a strong signal of Wolfetal.BiologyDirect2012,7:46 Page5of15 http://www.biology-direct.com/content/7/1/46 420 Methanosarcinales 439 Methanocella Methanomicrobiales 1,642 Halobacteriales Archaeoglobales 416 Thermoplasmatales Euryarchaeota Aciduliprofundum boonei T469 704 Methanococcales Methanobacteriales Methanopyrus kandleri AV19 705 Thermococcales 1,725 Nanoarchaeum equitans Kin4 M Nanoarchaeota 737 Sulfolobales Desulfurococcales Crenarchaeota Thermoproteales Thermofilum pendens Hrk5 494 Thaumarchaeota Thaumarchaeota 431 Caldiarchaeum subterraneum Aigarchaeota Candidatus Korarchaeum cryptofilum OPF8 Korarchaeota 0.2 1 7 1 3 9 1 3 9 9 6 4 0 0 4 4 3 8 2 6 5 2 1 1 0 9 1 3 4 1 3 1 1 1 1 1 1 1, 1, Figure3PhyleticpatternsandinferredgenegainpatternsinArchaea.Themostfrequentphyleticpatternsareshowntotherightofthe treeasblocksofgenomeswherethegeneispresent.Theinferrednumberofgenegainsisindicatedforthetreebranches.Themostfrequent gainpatternsareshownasgreendotsassociatedwithtreebranches.ThepatternwithintheMethanosarcinalescladereferstothe Methanosarcinagenus.Thefuzzypatterntotheleftofthetreerootreferstothe“zero-gain”inferencewithoutaconfidentassignmenttoany particularclade. shared gene content among the archaeal phyla that COG (0.87 in bacteria vs. 0.71 in archaea) and a higher constitute theTACK superphylum. fraction of COGs with multiple (estimated total number >1.5) gene gains (47% in bacteria vs. 39% in archaea). ComparisonoftheratesofgenegainandlossinArchaea Thus, although the overall modes of genome-scale evo- andBacteria lution are similar in both prokaryotic domains, intra- For comparison with Archaea, we analyzed the phyletic domain gene exchange seems to have played a greater patterns of 50 bacterial species in the 2003 COG data role in the evolution of bacteria. This finding does not set [18]. The evolutionary scenarios were reconstructed appear surprising given the typical greater complexity of using the Count software with the same parameters as bacterialcomparedtoarchaealcommunities. used for Archaea and the ribosomal protein phylogeny [39] as the guide tree topology (Table 3). Despite the dif- ference in sampling density and breadth (the COG set of PatternsofgenegaininArchaea 2003 covers a relatively small fraction of the bacterial di- Phyletic patternsforminthe courseofevolutionbygene versity), the estimated numbers of gene losses per COG gain (largely via HGT), vertical transmission through and the gain-loss ratio are very similar for the two pro- speciation, and gene loss. Of these processes, gene gain karyotic domains of life. Strikingly, Archaea and Bacteria seems to be of greatest interest. As shown above, gene display essentially the same, four-fold excess of losses loss is much more frequent, appears to be largely ran- over gains. For Bacteria, the reconstruction yielded dom in terms of which clades are affected and generally 20-25% more secondary gene gains (in addition to one appears to occur at approximately constant rate over default gain in the respective last common ancestor) per long spans of evolution (a form of molecular clock) [6]. Wolfetal.BiologyDirect2012,7:46 Page6of15 http://www.biology-direct.com/content/7/1/46 Methanosarcinales Methanocella Methanomicrobiales Halobacteriales Archaeoglobales Thermoplasmatales Euryarchaeota <1000 AciduliprofundumbooneiT469 1000-1400 Methanococcales 1400-1800 Methanobacteriales MethanopyruskandleriAV19 1800-2200 Thermococcales 2200-2600 NanoarchaeumequitansKin4 M Nanoarchaeota 260-30000 Sulfolobales >3000 Desulfurococcales Crenarchaeota Thermoproteales ThermofilumpendensHrk5 Thaumarchaeota Thaumarchaeota Caldiarchaeumsubterraneum Aigarchaeota CandidatusKorarchaeumcryptofilumOPF8 Korarchaeota 0.2 Figure4InferredancestralgenomesinArchaea.Thesquareboxesatthebasesofcladesindicatethenumberoffamiliesintheinferred ancestralgenomes;therectanglesatthetipsofcladesindicatethenumberoffamiliesintheextantgenomeswithintheclade.Squareboxesat thetipsindicatesinglegenomes;rectangularboxesindicatemultiplegenomes. Thus, we focus here on the patterns of gene gain, ignor- dimensional (given 120 analyzed genomes) binary ingpossible subsequentgeneloss. presence-absence patterns, we obtain 238-dimensional Unlike parsimony-based methods that reconstructbin- (the tree has a total of 238 branches, including the ary evolutionary scenarios, Count produces profiles of branchleadingtotheroot)binarygainpatterns. posterior probabilities of events. To convert these into Remarkably, despite the near doubling of the dimen- patterns, we use the simple definition of a “likely event”: sionality, the likely gain patterns provide considerably if a gene gain probability is >0.5 for a particular branch, more coarse-grained material for comparisons. The we mark it as a likely gain. Thus, instead of 120- 10,335 arCOGs form 1,878 gain patterns, of which 77% Table2InferredgenegainsinArchaea Table3Comparativeanalysisofgenegainsandlossesin No.ofgains Clade ArchaeaandBacteria 1,725 LACA arCOGs bac.COGs(2003) 1,642 Halobacteriales families 10,335 4,149 737 Sulfolobales species 120 50 705 Thermococcales gains 17,680 7,769 704 deepEuryarchaeota(Eury-withoutThermococcales) gains/family 1.71 1.87 494 Thaumarchaeota acquisitions/family 0.71 0.87 439 Methanomicrobia losses 74,690 32,355 431 TACKsuperphylum losses/family 7.23 7.80 420 Methanosarcina loss/gainratio 4.22 4.16 416 Thermoplasmatales single-gain(<1.5) 61% 53% Wolfetal.BiologyDirect2012,7:46 Page7of15 http://www.biology-direct.com/content/7/1/46 are unique (compared to 6,736 presence-absence pat- p<0.0001 for the internal branches; 36% of the internal terns with 89% of them being unique). The most com- branch gains involve multiple branches) (Additional file 3). mon (apart from the obvious overall winner, the single This observation, again, is compatible with a largely ran- gaininLACA)gainpattern(Figure3andTable4),asin- dom gene exchange. Nevertheless, some branches appear gle gain in the Halobacteriales ancestor, is inferred for to significantly deviate from the expected behavior: for ex- 1,189 arCOGs. With one exception, all frequent (>300) ample, 94 of the 109 arCOGs likely acquired by the com- gain patterns involve a single gain. The only exception mon ancestor of Thaumarchaeota and Aigarchaeota were is the “zero-gain” pattern that was assigned to 901 also acquired elsewhere, which is over than twice more arCOGs. These genes are scattered among archaeal thanexpected;conversely,onthePyrobaculum-Thermopro- clades in such a way that COUNT could not confi- teus clade, the number of multiple-event gains was almost dently assign a gain to any particular branch (e.g. twicelessthanexpected(54outof265). arCOG11374 is found in Methanobacterium sp. AL-21 and in Methanosphaera stadtmanae DSM 3091 that RoutesofgeneexchangeinArchaea are neither sister nor very distant species; informally, it To explore the preferred routes of intra-archaeal gene appears likely that this gene has been transferred from exchange, we focus on the 1,267 arCOGs with 2 pre- bacteria to some Methanobacteriaceaeancestor). dicted gains not involving LACA. This pattern likely The frequency distribution of the number of gains per indicatesasinglegeneexchangewithinarchaea(oralter- arCOG, plotted either in the discrete form for likely natively, two independent acquisitions by different gains (Figure 5a) or in the continuous form for the sums archaeal clades from outside of the domain; however, of posterior probabilities (Figure 5b), indicates that the this is a distinctly less parsimonious solution). The most single-gain patterns represent 61-67% of the data and frequently occurring 2-gain pattern (Table 6, Figure 6, the number of arCOGs with multiple acquisitions Additional file 4) is the Thermoplasmatales-Sulfolobales declines exponentially. The excellent fit of the number pair, which was found in 16 arCOGs; an unexpectedly of gene gains to the exponential decay function implies large number of shared genes between Sulfolobus and that horizontal transfer of genes of each arCOG occurs Thermoplasma, suggestive of preferential HGT, has been inarandomfashion. reported previously [47]. Not surprisingly, many pairs of clades frequently exchanging genes involve Halobacter- Multiplegenegainsinarchaea iales, the archaeal group that seems to be generally most The inferred multiple gene gains in the same arCOG pronetoHGT (see above). could provide insights into the history of intra-domain If the paths of gene exchange are random, the number gene exchange. Altogether there are 2,495 arCOGs with of exchanges between a pair of clades is expected to be ≥2 likely (posterior probability >0.5) gains (excluding proportional to the product of the numbers of gains on those with a gene gain in LACA). These families are these clades. We found that these variables were indeed implicatedinatotal of6,190gene acquisitions. correlated, relatively weakly but significantly (rS=0.39, Over 20% of the multiple gains involve the ancestral p<0.0001 for 196 pairs occurring more than once). branch of the Halobacteriales (Table 5) which is the ab- However, the most frequently observed exchanges be- solute leader among Archaea with respect to involve- tween clade pairs are an order of magnitude more fre- ment in gene exchange as either the donor or the quent than expected by chance, an overwhelmingly acceptor. Other prominent gene exchange participants unlikely fluctuation for the majority of these patterns are Thermococcales,Thermoplasmatales and Sulfolobales. (Additional file 4). Taken together, these findings indi- Generally, the number of multi-gain events on archaeal cate that, although HGT between the lineages was not tree branches strongly correlates with the overall number completely random, no major “highways” [48] of intra- of gains (rS=0.83, p<0.0001 for the entire tree; rS=0.93, domain gene exchange existed in the history of archaea. Atbest, there seem to exist weakly preferred “byways” of HGT. Table4InferredgenegainpatternsinArchaea Patternfrequency Patterndescription No.ofSpecies Conclusions 1,189 Halobacteriales 16 The updated version of the arCOG collection incorpo- rates 91% of the pangenome of 120 archaea into 10,335 1,147 LACA 120 arCOGs. This new set of arCOGs is expected to become 901 scattered(noconfidentgains) N/A a key resource for comparative genomics, evolutionary 433 Sulfolobales 13 reconstruction and functional annotation of archaeal 341 Thermococcales 12 genomes that undoubtedly will be appearing at an in- 321 Methanosarcina 3 creasing pace. Notably, between this new arCOG Wolfetal.BiologyDirect2012,7:46 Page8of15 http://www.biology-direct.com/content/7/1/46 a 1 0 2 4 6 8 10 0.1 cy 0.01 n e u q e fr 0.001 0.0001 0.00001 number of gains b 0 2 4 6 8 10 d.f. p. g o l number of gains Figure5Distributionofthegainpatternsbythenumberofgains.a. Numberofconfidentlypredictedgainsinapattern(p>0.5).b.The sumofposteriorgainprobabilitiesinapattern(kernel-smoothedprobabilitydensity).Dottedline,thebest-fittingexponent. collection and the original, 2007 version, the conserved subsequent evolution of almost all archaeal lineages was genecoreofArchaeahasnotsubstantiallyshrunk,suggest- gene loss leading to genome streamlining. Overall, in the ingthatthepresentcompositionofthiscorecouldbeclose evolution ofArchaeaaswell asa representativeset of bac- todefinitive.Wedescribeheresomeresultsoftheongoing teria that we analyzed for comparison, gene losses are esti- work on reconstruction of the genome content of archaeal mated to outnumber gene gains at least four to one. We ancestral forms and gene gain and loss events. This recon- furtherinvestigatedthespecificpatternsofgenegaininAr- structionclearlyindicatesthatthelastcommonancestorof chaea and found that, although some archaeal groups, in the extant Archaea was a complex organism, most likely with over 2,500 genes, and that the principal trend in Table6Inferredgenegainpatternsinvolving2Archaeal clades Table5InferredgenegainsinArchaea Patternfrequency Patterndescription No.ofgains Clade 16 Thermoplasmatales–Sulfolobales 565 Halobacteriales 14 deepEuryarchaeota–Thaumarchaeota 296 Thermococcales 13 Halobacteriales–Thaumarchaeota 247 Thermoplasmatales 13 deepEuryarchaeota–Methanomicrobiales 245 Sulfolobales 12 Thermoplasmata–Thermococcales 204 Thaumarchaeota 12 Halobacteriales–Methanocella 199 deepEuryarchaeota 12 Halobacteriales–Thermoplasmatales 166 Archaeoglobales 12 Halobacteriales–Methanohalobiumevestigatum 123 TACKsuperphilum 11 Halobacteriales–Archaeoglobusfulgidus 104 Methanobrevibacterruminantium 11 Halobacteriales–Archaeoglobales 98 Methanosarcina 10 Halobacteriales–Thaum-+Aigarchaeota Wolfetal.BiologyDirect2012,7:46 Page9of15 http://www.biology-direct.com/content/7/1/46 MethanohalobiumevestigatumZ 7303 Methanosarcinales Methanosarcina Methanocella Methanomicrobiales HalorhabdusutahensisDSM 12940 Halobacteriales Halobacterium Archaeoglobales ArchaeoglobusfulgidusDSM 4304 Thermoplasmatales Euryarchaeota AciduliprofundumbooneiT469 Methanococcales MethanobacteriumAL 21 Methanobacteriales MethanobrevibacterruminantiumM1 MethanopyruskandleriAV19 Thermococcales NanoarchaeumequitansKin4 M Nanoarchaeota Sulfolobales Desulfurococcales Crenarchaeota Thermoproteales ThermofilumpendensHrk5 Thaumarchaeota Thaumarchaeota Caldiarchaeumsubterraneum Aigarchaeota CandidatusKorarchaeumcryptofilumOPF8 Korarchaeota 0.2 Figure6ThebywaysofhorizontalgenetransferamongArchaea.Linesconnectthecladesthatformthemostfrequentphyleticpatterns withtwoinferredgains(differentcolorsareusedforvisualdifferentiationonly).OneofthetwocladesisthelikelyoriginoftherespectivearCOG andtheotheristhelikelyacceptoroftheHGT. particularHalobacteria,aremorepronetogeneacquisition algorithm [1]ontheresultsofall-against-allBLAST than others, on the whole, gene exchange within Archaea [49]search. appears to be largely random, so that there are no major (cid:1) Multiple alignments of the initial cluster ‘highways’ofgenetransfer. members constructed using the MUSCLE The arCOG data is available at <ftp://ftp.ncbi.nih.gov/ program [50] were used as PSSMs for a PSI- pub/wolf/COGs/arCOG/>. BLAST search [49] against the database of Archaea proteins; significantly similar proteins Methods (domains) were added to the corresponding ConstructionofarchaealCOGs original clusters. Protein sets for 120 completely sequenced genomes of (cid:1) The clusters with approximatelycomplementary Archaea were downloaded from the NCBI FTP site. phyleticpatterns and highinter-cluster sequence New members of the previously established arCOGs similarity weremerged. were identified using the PSI-BLAST search with core arCOG alignments as the PSSMs. New arCOGs were constructed largely as previously described [30]. Briefly, Phylogeneticanalysis theprocedureinvolved thefollowing steps: Themaximumlikelihoodphylogenetictreeofthe conca- tenated alignment of ribosomal proteins universal for ar- (cid:1) Initial clusters basedontrianglesofsymmetrical chaea was constructed using the FastTree program besthitswere constructed usingamodified COG [51,52]aspreviouslydescribed [39]. Wolfetal.BiologyDirect2012,7:46 Page10of15 http://www.biology-direct.com/content/7/1/46 Reconstructionofgenegainandlosseventsduringthe sensulato?Inotherwords,couldwedefinealessarbitrarybarrierforthe evolutionofArchaea phylumlevelwithinthearchaeabasedonphylogenomicinformation?A commentalongthoselineswouldbewelcome. Reconstruction of gene gain and loss during the evolu- tion of Archaea was performed using the program Authors’response:Wecannotandshouldnotaddressissuesofformal COUNT [53]. Maximum likelihood inference for gene taxonomyinthisarticle.Whatreallymattersisthecladisticpattern;which cladesaregiventhestatusof“phylum”andwhicharesubsumedwithinother presence, gain and loss at all branches of the phylogen- phylamattersonlyoperationally. etictreewasobtainedfor2-categoryevolutionarymodel. -Theauthorsdetectahigherintra-domaintransferinbacteriathanin archaea,anddonotfindit“surprisinggiventhetypicalgreatercomplexityof Additional files bacterialcomparedtoarchaealcommunities”.However,itisnotself-evident thatgreatercommunitycomplexitynecessarilycorrelateswithmore Additionalfile1:Thearchaealribosomephylogeny.The horizontalgenetransfer.Mostlikely,thereisanimportantphylogenetic phylogenetictreereconstructedforconcatenatedalignmentsof effect,withsomelineagesbeingmorepronethanothersevenwithin ribosomalproteinsusingFastTreeprogram,Newickformat. bacteria.Theobservationofahigherintra-domaintransferinbacteriacould bebiasedbythetaxonomicsamplingandalsothemorelimitedsampling Additionalfile2:TheinferredLACAgenomecontent.Posterior sizeforbacteria.Canthesepossibilitiesbediscarded? probabilitiesforthepresenceofanarCOGinLACAgenome. Additionalfile3:Gainsandmultiplegainsonthearchaealtree Authors’response:Thepossibilitythattheresultsofanycomparativegenomic branches.ThesumsofposteriorgainprobabilitiesforallarCOGsand analysisareaffectedbytaxonomicsamplingbiaseffectivelycanneverbe arCOGsinvolvedinmultiplegaineventsforthearchaealtreebranches discardedgiventhattheexistingsamplesrepresentatinysliceoftheentire (nodenumberscorrespondtolabelsintheAdditionalfile1). existingbiodiversity.Nevertheless,someofthesamplingbiaseffectscanbe Additionalfile4:High-frequencytwo-gainpatterns.Theobserved mitigatedbyusingthetree-basedprobabilisticinferenceofancestralstatesand andexpectednumbersofthetwo-gainpatterns(nodenumbers evolutionaryevents(e.g.thepotentialeffectofthesamplingbiasintroducedby correspondtolabelsintheAdditionalfile1;p-valuecomputedusing inclusionof7strainsofSulfolobusislandicusinourdatasetisreducedbythe expectationunderthePoissondistribution). factthattheirgenecomplementsarenotcountedindependentlybutarelargely sharedbythecommonbranchofS.islandicusandS.solfataricus).Regardless, wetrustthatthiscommentandresponsewillremindthereaderofinevitable Competinginterests incompletenessofcomparativegenomicanalyses. Theauthorsdeclarethattheyhavenocompetinginterests. -ThenumberofinferredancestralgenesforLACA,butalsoforthelast Authors’contributions commonbacterialancestorisrelativelyhighandthenumberofgenelosses KSMandYIWcollectedthedata;KSM,NY,EVKandYIWanalyzedthedata;YIW outnumbersgenegainsbyatleastafactorof4.Giventheimportanceof andEVKwrotethemanuscriptthatwasreadandapprovedbyallauthors. horizontalgenetransferinprokaryoticevolutionandthefactthatmany genesevolvebyduplicationandneofunctionalization,theideaofagene- Reviewers’reports richancestralgenomeevolvingbyimportantgenelossseemsabitodd.Is thisrealordoesthisresultfromourincapacitytoidentifygenetransfer Reviewer1:Dr.PurificacionLopez-Garcia,CentreNationaldela eventsand/orfailuretoidentifysometypeofgenecoalescentprocessesin RechercheScientifique,France thepast?Whatarethepotentialbiasesofthisinferredvalue? Thismanuscriptpresentsanupdateofthearchaealclusteroforthologous Authors’response:Theideaofadeepancestorthatwasmorecomplexthan genes(arCOGs)basedon120archaealgenomesequenceswhich,although anaveragemodernmemberofthedescendantcladeprobablyis“abitodd”(or stilllimitedfacetothewidearchaealdiversity,isalargeimprovement evenmorethanabit)tomanybiologists.Wesubmit,however,thatthisoddity comparedtotheinitialsetofgenomesusedtodefinethefirstarCOGs.The isprimarilyavestigeofthetraditionalbeliefintheincreaseofcomplexitywith arCOGsbecomethereforemuchmorecomprehensiveand,onthissole evolutionarytime,anotionthatmightbeintuitivebutisnotactuallyborneby basis,theworkisalreadyimportantanduseful.Inaddition,theauthorstake anysubstantialevidence[54].Giventheimportanceofgenomestreamlining, advantageofthisanalysistomakesomeinferencesaboutarchaealgenome especiallyintheevolutionofprokaryotes,thereisnoreasontoaprioridismiss evolution.Theauthorsestimateinthiswaythenumberofgenesatancestral thepossibilitythatgenelossonaverageoutweighsgenegaincausedbyHGTas nodes(includingthelastcommonarchaealancestor,LACA)andthe wellasduplicationfollowedbyneo/subfunctionalization(thelatterprocess estimatednumberofgenegainsatdifferentnodesonaguidephylogenetic beingrelativelylessimportantinprokaryotes).Theinferenceofthecomplex treebasedonhighlyconservedribosomalproteins.Thenumberofgenes ancestoris“real”inthesensethatthisisthemaximumlikelihoodestimate sharedbydistantarchaealgroupsisusedtoevaluatethelevelofintra- basedonthebestpossibleestimationoftheparametersoftheevolutionary domaingenetransferandcomparedwiththatobservedina(smaller)setof process(genegainandlossrates)fromtheavailabledataonphyleticpatterns. bacterialgenomes.Overall,Ifindthearticleinterestinganduseful. Thus,HGTistakenintoaccountimplicitlyhere,underthegenegainrate.What Ihaveafewcommentsthattheauthorsmightwanttoaddress: ismeantby“sometypeofgenecoalescentprocessesinthepast”isnotquite cleartous.Admittedly,thecurrentinferencemethodologyissomewhatcrude, -WolfetalreconstructanarchaealphylogenybasedonthearCOGsof beingbasedonphyleticpatternsandnottakingintoaccountthephylogenies ribosomalproteinsasguideforsubsequentinferencesofgenomeevolution. ofindividualgenes.Suchacomprehensiveanalysisisbeyondthecapabilitiesof Accordingtotheauthorsthetreeagreeswithknowntaxonomicschemes thecurrentlyaccessiblemaximumlikelihoodmodels.However,giventhe andpointstothemonophylyofthe“TACK”supergroup(Thaumarchaeota– apparent(approximate)randomnessoftheHGTprocesses,itappearshighly Aigarchaeota–Crenarchaeota-Korarchaeota).Thisisdifficulttosayfroman unlikelythatifandwhensuchanalysisbecomesfeasible,itchangesthe unrootedtree,becausetherootmightliesomewhereelseinthetree,e.g. conclusionsqualitatively.Withregardtothepossiblebiases,thisagaingoes betweentheKorarchaeotaandtheremaining“TAC”orelsewhere,inthe backtothetaxonomicsamplinganditspotentialincompleteness.Iftomorrow presenceofanappropriateoutgroup.Atanyrate,theTACKmonophyly adeepbranchofArchaeawithverysmallgenomesisdiscovered,thiswillresult appearstobesupportedbyalargenumberofsharedgenes(431).Atthis inadownwardreassessmentofthecomplexityofLACA. point,itmaybetimelytoconsideramoresystematics-orientedcomparison intermsofthescaleofgenelossandgainandgenomeevolutionbetween Theinferenceofcomplexdeepancestorsisneitherentirelynewnorlimitedto theEuryarchaeotaandthe“TACK”super-phylum.Areweinfrontof theevolutionofArchaea.Manyevolutionaryreconstructionsandtheoretical comparableevolutionaryscalesintermsofphylogenomicbreadth?Ifthatis modelsofgenomeevolutionpointtoextensivegenelossasoneofthekey thecase,couldtheTACKsuper-phylumbeassimilatedwithCrenarchaeota evolutionaryprocesses[55-57].Perhapsparticularlyimpressiveare

Description:
RESEARCH Open Access Updated clusters of orthologous genes for Archaea: a complex ancestor of the Archaea and the byways of horizontal gene transfer
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.