ebook img

Functional phylogenomics analysis of bacteria and archaea using consistent genome annotation ... PDF

13 Pages·2014·4.64 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Functional phylogenomics analysis of bacteria and archaea using consistent genome annotation ...

Chaietal.BMCEvolutionaryBiology2014,14:207 http://www.biomedcentral.com/1471-2148/14/207 RESEARCH ARTICLE Open Access Functional phylogenomics analysis of bacteria and archaea using consistent genome annotation with UniFam Juanjuan Chai1, Guruprasad Kora1, Tae-Hyuk Ahn1, Doug Hyatt2,3 and Chongle Pan1,2* Abstract Background: Phylogeneticstudieshaveprovideddetailedknowledgeontheevolutionarymechanismsofgenesand speciesinBacteriaandArchaea.However,theevolutionofcellularfunctions,representedbymetabolicpathwaysand biologicalprocesses,hasnotbeensystematicallycharacterized.Manycladesintheprokaryotictreeoflifehavenow beencoveredbysequencedgenomesinGenBank.Thisenablesalarge-scalefunctionalphylogenomicsstudyofmany computationallyinferredcellularfunctionsacrossallsequencedprokaryotes. Results:Atotalof14,727GenBankprokaryoticgenomeswerere-annotatedusinganewproteinfamilydatabase,UniFam, toobtainconsistentfunctionalannotationsforaccuratecomparison.Thefunctionalprofileofagenomewasrepresented bythebiologicalprocessGeneOntology(GO)termsinitsannotation.TheGOtermenrichmentanalysisdifferentiatedthe functionalprofilesbetweenselectedarchaealtaxa.706prokaryoticmetabolicpathwayswereinferredfromthese genomesusingPathwayToolsandMetaCyc.Theconsistencybetweenthedistributionofmetabolicpathwaysinthe genomesandthephylogenetictreeofthegenomeswasmeasuredusingparsimonyscoresandretentionindices.The ancestralfunctionalprofilesattheinternalnodesofthephylogenetictreewerereconstructedtotrackthegainsand lossesofmetabolicpathwaysinevolutionaryhistory. Conclusions:Ourfunctionalphylogenomicsanalysisshowsdivergentfunctionalprofilesoftaxaandclades.Such function-phylogenycorrelationstemsfromasetofclade-specificcellularfunctionswithlowparsimonyscores.Onthe otherhand,manycellularfunctionsaresparselydispersedacrossmanycladeswithhighparsimonyscores.Thesedifferent typesofcellularfunctionshavedistinctevolutionarypatternsreconstructedfromtheprokaryotictree. Keywords:Prokaryotes,Cellularfunction,Pathway,Genomes,Evolution,Phylogenomics Background arethetwoprimaryfactorsthatshapethefunctionalprofile Bacterialandarchaealmicroorganismsarecapableof very ofamicrobialspecies.Thephylogenydefinestheevolution- diverse cellular functions. Some examples are nitrogen aryhistoryofthespeciesanddictatesthecellularfunctions fixation, organic matter degradation, and antibiotic resist- available from inheritance. The environment,including the ance. The full set of cellular functions, or the functional physicalandchemicalconditionsofthehabitatandthebio- profile, of a microorganism is encoded in its genome. For logicalcohortsofthemicrobialcommunities,appliesselect- a microbial species, some essential cellular functions are ive pressure on the species to acquire and retain beneficial maintainedviaverticalgenetransfer,somebeneficialones cellular functions and remove deleterious or dispensable are gained via horizontal gene transfer or evolutionary ones. innovation,andsomedispensableonesarelostinorderto Decades of biochemistry research have uncovered the maintain a compact genome. Phylogeny and environment molecular implementation of many cellular functions in terms of biological processes and metabolic pathways. *Correspondence:[email protected] Enzymes that carry out the reactions in pathways and 1ComputerScienceandMathematicsDivision,OakRidgeNationalLaboratory, processes have been identified and linked with genes in OakRidge,TN,USA model organisms in databases, such as MetaCyc, KEGG, 2BioSciencesDivision,OakRidgeNationalLaboratory,OakRidge,TN,USA Fulllistofauthorinformationisavailableattheendofthearticle and others [1-7]. As a result, many cellular functions in a ©2014Chaietal.;licenseeBioMedCentralLtd.ThisisanOpenAccessarticledistributedunderthetermsoftheCreative CommonsAttributionLicense(http://creativecommons.org/licenses/by/4.0),whichpermitsunrestricteduse,distribution,and reproductioninanymedium,providedtheoriginalworkisproperlycredited.TheCreativeCommonsPublicDomain Dedicationwaiver(http://creativecommons.org/publicdomain/zero/1.0/)appliestothedatamadeavailableinthisarticle, unlessotherwisestated. Chaietal.BMCEvolutionaryBiology2014,14:207 Page2of13 http://www.biomedcentral.com/1471-2148/14/207 microorganism can now be inferred computationally tree. On one hand, some functions are concentrated in from its genome in three steps. First, genes are pre- selectedclades.Asaresult,manycladeshavedistinctfunc- dicted from the genome sequence. Then, the molecular tionalprofileswithenrichmentofcertainfunctions.Onthe functions of the genes are inferred by sequence analysis other hand, some other functions are sparsely dispersed and database searching. Finally, a cellular function is across distant clades, which would make it difficult to consideredtobepresentinthemicroorganismifthees- predict the presence or absence of these functions in a sential enzymes for the cellular function are encoded in microorganism based on the functions of its phylogenetic its genome. Owing to the progress in DNA sequencing neighbors.Thefunctionalphylogenomicsanalysisprovided technology, a large number of genomes (>15,000) have the parsimony scores of cellular functions as an empirical beensequencedanddepositedinGenBank[8],covering measure of their consistency with the phylogeny, and gen- considerable phylogenetic diversity of the bacterial and erated the aggregate functional profiles of taxa at different archaeal domains. Characterization of cellular functions taxonomicranks. consistently predicted from the sequenced microbial genomes in a phylogenetic context can shed light on Results and discussion their evolution in Bacteria and Archaea. This is referred Phylogenyreconstructionoftheprokaryoticgenomes toasfunctionalphylogenomicsanalysis. We downloaded from GenBank (Oct 2013) more than Evolution in the bacterial and archaeal domains has 15,000 bacterial and archaeal genomes, henceforth been studied extensively based on species trees and gene referred to as the prokaryotic genomes. After discarding trees [9,10]. Functional phylogenomics studies can un- genomescontainingmorethan1000contigsandgenomes covertheevolutionofcellularfunctionsatanintermediate with less than 400 protein-coding genes, the remaining levelbetweengenesandspecies.Genesmayevolveinpar- 14,727 prokaryotic genomes were analyzed in this study allel with species or in a divergent manner through hori- and their metadata was provided in Additional file 1. A zontal gene transfers and gene losses. Likewise, a cellular phylogenetic tree of these genomes (see Figure 1) was function may evolve in parallel with both its host species built using PhyloPhlAn [21] and FastTree [22] based on anditsconstituentgenes.Duringthisprocess,somegenes proteins found by Prodigal [23]. Prodigal was chosen in maybereplacedwiththeirnon-homologuesofequivalent order to standardize gene calling across all genomes, as function, or conferred to a species by horizontal gene GenBankgenepredictionswereshowntohavemorevari- transfer, gene sharing, or evolutionary innovation [11-13]. ationasaresultofthedifferentmethodsusedbythesub- Reconstruction of a cellular function tree is difficult due mitters [24]. Prodigal predictions were also found to have to the lack of an evolutionary model to account for all moreconsistentidentificationoftranslationinitiationsites these types of events. Instead, the evolution of cellular than GenBank records, as in Burkholderia genomes [25]. functionscanbestudiedbyinferringtheirancestralstates ThephylogenetictreeintheNewickformatisavailablein acrosstheprokaryoticphylogenetictreeandtrackingtheir the Additional file 2. The full lineages of the genomes gains and losses along the evolutionary timeline. More were extracted from the NCBI Taxonomy database and broadly used cellular functions should have earlier initial their phyla are color-coded in Ring 1 of Figure 1. Most appearances in the evolutionary history. We hypothesize phylaweremonophyleticandoccupiedcontiguoussectors that the propagation of a cellular function through the of the Ring 1, except that Proteobacteria was separated prokaryotic tree may follow the diversification of selected into two clades. This indicated a general agreement clades, passlaterallyacrossdistant species, or discontinue between the taxonomic classification and phylogenetic atcertainlineages.Ourfunctionalphylogenomicsanalysis distribution of these genomes. Due to the errors in either providedaquantitativetestofthesehypothesesforalarge taxonomic classification or phylogeny reconstruction, numberofcellularfunctionsinthephylogenetictreeofall there was also occasional inconsistency, indicated by sequencedprokaryoticgenomesinGenBank. distinct colored lines within some contiguous phylum Microbiologists have long observed the association of sectionsinRing1. certain functions with certain clades of the phylogenetic Tocomparethegene callingresultsfromProdigalpre- tree [14]. Recently, the PICRUSt algorithm [15] was devel- dictions and GenBank annotations, we built two phylo- oped to predict metagenome functional profiles of micro- genetic trees for the 10,075 genomes that had GenBank bial community members from 16S rRNA marker genes. annotations. One tree was based on proteins called by However, many genomics studies have also shown the Prodigal and the other was based on proteins provided prevalence of horizontal gene transfer among bacteria in GenBank (see Additional file 3). The Robinson-Foulds [16-20], which would disrupt the association of a cellular [26] distance between the two trees, defined as the total function with some closely related taxa. Our functional number of partitions of tips implied by one tree but not phylogenomics analysis measured the phylogeny-function the other, was 8494. This was 45% of 18,916 total parti- correlationformanycellularfunctionsovertheprokaryotic tions in the two trees. The branch score [27] defined by Chaietal.BMCEvolutionaryBiology2014,14:207 Page3of13 http://www.biomedcentral.com/1471-2148/14/207 Phylum (Ring 1) 1 2 3 4 5 6 Figure1Overviewoftheprokaryoticgenomes.Thephylogenetictreecontains14,727genomeswithtipscoloredaccordingtotheirphylum classification.Rings:(1)Phylumclassificationofthegenomes.Phylawithlessthan5genomesareinthe“Others”category.(2)Completionstatus ofthegenomeswithblackforfinishedgenomesandwhitefordraftgenomes.(3)Numberofcontigsineachgenome.(4)Numberofproteinsin eachgenome.(5)PercentageofproteinsannotatedbyUniFamineachgenome.(6)Numberofpathwaysinferredforeachgenome. the square root of sum of squared differences between height of its corresponding bar in Ring 3. A few phyla branches in the two trees was 2.24, comparing to the had very high percentages of finished genomes, includ- total internal branch length of 1074.04 in the two trees. ing Chlamydiae (71% of 146 genomes),Tenericutes (58% These two distance measures indicated that there were of133genomes)inthebacterialdomain,asdidthemost many discordant internal branches between the two phyla inthe archaeal domain. trees,but thelengthsofthesebrancheswere verysmall. The taxonomic classification and the phylogenetic tree Re-annotationoftheprokaryoticgenomes (see Figure1)exhibited highly uneven distribution of the Theprokaryoticgenomeswereannotatedbytheirsubmit- sequenced genomes among different phyla. 85% of the ters to GenBank. We observed a significant degree of in- genomes belonged to the three most sequenced phyla consistencybetweentheGenBankannotationsofdifferent (7120 genomes in Proteobacteria, 4137 in Firmicutes, genomes in the following aspects: (i) terminology: ge- and 1294 in Actinobacteria), whereas 17 phyla have 3 or nomes have different names for orthologous proteins, (ii) fewer sequenced genomes. The completion status of the ontology: some genomes have extensive annotations in genomes is indicated in Ring 2 of Figure 1 (finished GO terms and EC numbers and some others do not, and genomes in black lines and draft genomes in white lines) (iii)annotationcoverage:closelyrelatedgenomeshavedif- and the number of contigs in a genome is shown by the ferent percentages of genes with assigned functions. The Chaietal.BMCEvolutionaryBiology2014,14:207 Page4of13 http://www.biomedcentral.com/1471-2148/14/207 inconsistency probably stemmed from the submitters’ annotation of UniFam. Finally, the annotationofa micro- different annotation procedures that may involve a var- bialgenometakesapproximately4CPUhoursonaverage iety of algorithms and databases for inferring functions. using UniFam, which made it computationally feasible to In addition, the inconsistency also existed between re-annotate 14,727 genomes. UniFam will be updated in genomes deposited by the same submitter at different synchronization with the UniProt database. The annota- times, because the algorithms and databases used for tionoftheUniFamfamilieswillimprovewiththeongoing annotation are continuously improved, but the genome SwissProt curation effort. The sequence diversity of the annotation in GenBank was rarely updated after initial UniFam families will increase with the expansion of the submission. Because such inconsistency in the existing TrEMBLdatabase. GenBankannotationsmaypreventaccuratecomparison The UniFam annotation was compared with annota- of cellular functions across genomes, we re-annotated tions by RAST [28] and Prokka [29] on five representa- the 14,727 prokaryotic genomes in three steps: gene tive genomes: two Escherichia coli K-12 genomes (a calling, protein functional annotation, and metabolic finished W3110 genome and a draft MG1655 genome), pathwayinference. two Pseudomonas aeruginosa genomes (a finished M18 Prodigal [23] was used to predict protein-coding genes genome and a draft SJTD1 genome), and one finished in the genomes. The numbers of predicted genes (shown archaeal genome Methanocaldococcus jannaschii DSM by the height of green bars in Ring 4 of Figure 1) were 2661.Thesame proteinspredicted byProdigalwerepro- comparable between neighboring genomes in the phylo- vided to the three annotation pipelines. The annotation genetic tree. The number of genes in a genome was not results in gene names, gene product names and EC correlatedwiththecompletionstatusofthegenomeorits numbers from the three pipelines are available in numberofcontigs.Thissuggestedthat,althoughdraftge- Additional file 4. Only UniFam annotations provided GO nomes were likely fragmented by many repeat sequences, terms. UniFam annotated more proteins in the two E. mostofthemcouldstillprovidesufficientcoverageofthe coli genomes and the M. jannaschii genome, but less in genecontentofthegenomes. the two P. aeruginosa genomes, than Prokka and RAST The predicted protein-coding genes were re-annotated (see Additional file 5). The consistency between the gen- using UniFam_Prok. The UniFam protein family database ome annotations were measured based on EC numbers (freely available at http://unifam.omicsbio.org) was con- (see Additional file 6), because two EC numbers can be structed from the UniProt database as described in the compared exactly, and the EC number annotation is a Materialand Methods. Themanually curated annotations primary information source for pathway inference by of proteins in SwissProt were used as the source of func- Pathway Tools. The EC number consistency between the tion information in standardized terminology. Protein three pipelines was highest in E. coli, followed by P. aer- families were built around the SwissProt proteins with uginosa, and finally by M. jannaschii. The agreement their close homologs in TrEMBL. Substantial sequence between UniFam and each of the other two methods was diversitywasobtainedformanyproteinfamilies,owingto better than that between Prokka and RAST. UniFam theenormoussequencespaceofTrEMBL.Incomparison annotation was more computationally expensive than withmanyexistingproteinfamilydatabases,UniFampro- Prokka and RAST. The P. aeruginosa M18 genome was vided the following advantages for genome annotation. annotated by UniFam using 4.2 CPU hours, by Prokka First, UniFam provided a comprehensive coverage to using 1.3 CPU hours, and by RAST using 0.2 CPU alleviate the need for searching multiple databases and hours. combining the results. Ring 5 of Figure 1 shows the Metabolic pathways wereinferred for allgenomes with percentagesofannotatedproteinsforallthegenomes.On Pathway Tools [30] based on UniFam annotations. In average,70%ofthepredictedproteinswereannotatedina comparison with pathway inference tools such as KEGG genome. 10,992 out of 14,727 genomes had more than [31] and others [32], Pathway Tools provides a compre- 65% of their proteins annotated by UniFam. Second, hensive coverage of metabolic reactions and pathways becauseonlythemanuallycuratedannotationsofproteins [33] and allows automatic processing of a large number in SwissProt were used as the source of function informa- of genomes. The pathway inference was facilitated by tion,theUniFamfamilieswereassociatedwithstandardized the extensive EC number and GO term annotation by protein product names, extensive GO term annotations, UniFam. 706 prokaryotic pathways in the MetaCyc data- andECnumbers.Third,becausetheUniFamfamilieswere base [34] were found in the studied genomes. The num- built from proteins with stringent whole sequence align- bers of predicted pathways across genomes are shown in ment requirement, we believe the genome annotation by Ring 6 of Figure 1. Genomes in the Proteobacteria UniFam is reliable. Although it is difficult to rigorously phylum had more pathways inferred than other phyla, benchmarktheaccuracyofgenomeannotation,our man- whereas the archaeal genomes had much fewer inferred ual analysis of selected genomes indicated high quality pathways.Thenumber ofinferredpathwaysinagenome Chaietal.BMCEvolutionaryBiology2014,14:207 Page5of13 http://www.biomedcentral.com/1471-2148/14/207 correlated with numbers of predicted and annotated and 1 for presence of the 706 pathways in this genome. proteins of the genome. In addition, the distribution of The metabolic pathway profiles of individual genomes in curated pathways in MetaCyc among different clades a genus were merged into an aggregate genus pathway also affected the number of inferred pathways in a profile (a pathway profile of a genus pan-genome) to genome. reduce the effect of uneven genera representation on the The genome annotation results are provided at http:// analysis. The 1206 genera were hierarchically clustered unifam.omicsbio.org and will be regularly updated to based ontheir metabolic pathway profilesand 706meta- keep pace with the growth of GenBank and newer re- bolicpathwayswere clusteredbasedontheirdistribution leasesofProdigal,UniFamandMetaCyc.Althoughevery across the genera (see Figure 2). There were some annotation step was performed with stringent settings to universal pathways clustered in the right section, which minimize errors, we cannot obtain the ground truth or were detected in almost all genera. Examples included perform extensive manual curation for such a large set tRNA charging, UTP and CTP dephosphorylation I, of prokaryotic genomes. Thus, it was not the goal of this adenosineribonucleotidesde novobiosynthesis,andglu- study to examine the cellular functions of individual or- tamine degradation I. Lack of these pathways in the very ganisms. Instead, the functional phylogenomics analysis few genera is likely due to the incompleteness of the ge- was focused on the aggregate results of a large number nomes. There were also rare pathways that existed only of diverse genomes to minimize the adverse effect of oc- in very few genera, such as quinate degradation II (13 casional errors in the phylogenetic tree reconstruction, genomesin2genera:CorynebacteriumandKetogulonici- genomeannotation,andpathwayinference. genium) andmyrcenedegradation(4genomesin2genera: GordoniaandNocardia). Associationofcellularfunctionswiththetaxonomic The hierarchical clustering of the genera was based on classification the similarity between their pathway profiles without The overall functional profile of a taxon can be repre- using the taxonomic or phylogenetic information. The sented by the collection of the biological process GO domain and phylum classifications of the genera are terms of the proteins encoded in its genomes. The marked in the left and right vertical strips, respectively, differentiation between the functional profiles of taxa at in Figure 2. The archaeal genera are clearly clustered a given taxonomic rank was measured using GO term together. They were distinguished by some Archaea- enrichment analysis [35]. Because there were too many specific pathways, such as methanogenesis from CO , 2 bacterial genomes for enrichment analysis and result tRNA splicing, starch degradation V, selenocysteine bio- presentation,archaealgenomeswere chosentoshowcase synthesis II, and chitin degradation I. They also lacked the differentiation at the phylum and family levels (see some pathways universally found in Bacteria, such as Additional file 7). The enrichment analysis was con- tRNA processing and thiazole biosynthesis I. The clus- ducted only on taxa containing sufficient genomes: ≥ 10 tering of genera was also relatively consistent with their genomes for a phylum and≥5 for a family. Additional phylum classification, in that the clusters formed con- file 7 lists the top ten most enriched GO terms in each tinuous blocks of genera from the same phyla. This was phylum compared to the average in the whole archaeal a result of many taxa-specific pathways that were com- domain, and the top five most enriched GO terms of monly found in only a few selected taxa. On the other each family compared to the average in the correspond- hand, genera from the same phylum were also often ingphylum. separated into several disjoint clusters, indicating diver- The Crenarchaeota phylum contains many thermo- gence of the functional profiles of genera in the same philic organisms. The enrichment of the biological phylum. processes in defense response to viruses indicated that Crenarchaeota has heavier genomic investment in virus Associationofcellularfunctionswiththe defense than the other phyla, which may suggest preva- phylogenetictree lent viral infection in its environments. Euryarchaeota Themetabolicpathwayprofilesofgenomesweresuperim- was specialized in methanogenesis as expected, but it posed on their phylogenetic tree to assess the phylogeny- also appearedto havemore “intelligence-related” cellular function correlation. Nine representative pathways are functions, such as signal transduction and transcription shown in Figure 3. The tRNA charging pathway (Ring 1) regulation. Out of eight families in Euryarchaeota, the was found in 99.9% of the genomes and, therefore, was five known methanogen families were found to be likely to have been passed from an ancient ancestor to all specialized in methanogenesis along with different sets the descendants. The absence of this pathway in 3 ofotherbiologicalprocesses. genomes can be attributed to the low genome quality, The metabolic pathway profile of a genome was repre- instead of biological effect. The mercury detoxification sented by a 0–1 vector of length 706 with 0 for absence pathway (Ring 6) existed in only 15.7% of the genomes, Chaietal.BMCEvolutionaryBiology2014,14:207 Page6of13 http://www.biomedcentral.com/1471-2148/14/207 Clustering of Pathways Actinobacteria Bacteroidetes Crenarchaeota Cyanobacteria Euryarchaeota Firmicutes Others Bacteria Planctomycetes Archaea Proteobacteria Synergistetes a r e n e G f o g n ri e st u Cl Figure2Hierarchicalclusteringofgeneraandmetabolicpathways.Theheatmaprepresentsthepresence(red)andabsence(green)ofall 706pathways(columns)inall1206genera(rows).Thedendrogramstotheleftandonthetopoftheheatmaprepresenttheclusteringresultsof thegeneraandthepathways,respectively.Highertaxonomicclassificationsofthegeneraaremarkedonthetwocoloredstrips:theBacteria/ Archaeadomainclassificationontheleftstripandthephylumclassificationontherighttrip. but dispersed among many distant clades of the tree. The the minimum number of state changes (from presence phylogeneticdistributionofthispathwaysuggestedexten- to absence or vice versa) needed along the branches to sive horizontal gene transfer events across the bacterial produce the observed presence/absence pattern of the domain, probably as a result of the selection force of pathway at the tips. Retention index (RI) [37], which potentially mercury-contaminated environments and the normalizes the parsimony score of a pathway with the high transferability of this cellular function. In contrast, number of its occurrences in the genomes, was also the arsenate detoxification pathway (Ring 7) found in calculated to measure how well a character conforms to 14.8% of the genomes was much more concentrated on the phylogenetic tree. A higher RI indicates a better fit selected clades. It was not clear why the two heavy-metal of a character to the tree. For example, aerobic respir- detoxification pathways have such different levels of ation (cytochrome C) (Ring 2) was found in 9383 ge- correlationwiththephylogeny. nomes with parsimony score 290 and RI 0.95, whereas A pathway can be viewed as a binary character methylotrophy (Ring 5)wasfoundin1353genomeswith observed at the tips of the phylogenetic tree, with its a parsimony score 592 and RI 0.56. This is in line with presence and absence in a genome as the two states. the previous observation that methylotrophy has under- The parsimony score [36] for a pathway on the tree is gone extensive horizontal gene transfer, while aerobic Chaietal.BMCEvolutionaryBiology2014,14:207 Page7of13 http://www.biomedcentral.com/1471-2148/14/207 Pathways 1 2 3 4 5 6 7 8 9 Figure3Distributionofselectedpathwaysacrosstheprokaryoticgenomes.Eachringrepresentsthepresencepattern(colored)ofa pathwayonthephylogenetictreetips.Thepathwaysandtheirringcolorsarelistedinthelegend. respiration is consistent with the phylogeny [20]. The of 7120 genomes), Firmicutes (1343 out of 4137), Actino- parsimony scores and the genome occurrence frequen- bacteria (434 out of 1294), Bacteroidetes (170 out of 488), cies of the 706 pathways are shown in Figure 4. Path- and Cyanobacteria (69 out of 173). Nitrogen fixation was ways were categorized into consistent pathways with anexampleofinconsistentpathways(Ring5).Itwasfound RI>0.9andinconsistentpathwayswithRI<0.7.Outof in 1121 genomes with parsimony score 409 and RI 0.64. the 706 pathways in Bacteria and Archaea, 31% were This pathway is known to have been horizontally trans- consistent pathways and 26% were inconsistent pathways. ferred between many species [20]. Although both phos- The complete list of the 706 pathways with their parsi- phate acquisition and nitrogen fixation provide essential monyscores,occurrencefrequenciesin the genomes, and nutrients for microorganisms, they have very different oc- RIsisavailableinAdditionalfile8. currencefrequenciesandhorizontalgenetransferbehaviors Phosphateacquisitionwasanexampleofconsistentpath- inthebacterialgenomes. ways (Ring 4). This pathway was found in 6005 genomes TwobiosynthesispathwaysarealsoshowninFigure3. withparsimonyscore528andRI0.91.Itwasconcentrated Isopenicillin N biosynthesis (Ring 8) occurred in 2236 in 16 bacterial phyla and 2 archaeal phyla. Of the bacterial genomes with parsimony score 387 and RI 0.83. Lysine genomes, it was mostly found in Proteobacteria (3882 out biosynthesis I (Ring 9), which involves succinylated Chaietal.BMCEvolutionaryBiology2014,14:207 Page8of13 http://www.biomedcentral.com/1471-2148/14/207 Inconsistent Inconsistent 0 00 Pathways Pathways 1 0 0 8 e or c S 0 y 60 n o m si Par 00 4 0 0 2 Consistent Pathways 0 0 5000 10000 15000 Occurrence Frequencies Figure4Distributionsofparsimonyscoresandoccurrencefrequenciesofallpathways.Thepathwaysareclassifiedintoconsistent pathwayswithRI>0.9(coloredinblue)andinconsistentpathwayswithRI<0.7(coloredinred). intermediates, was found in 8372 genomes with parsi- transfer. The first pair (5A and B) displayed very distinct mony score 149 and RI 0.97, showing much higher evolutionary patterns. Aerobic respiration had very early prevalence and consistency. The two pathways were first appearance and was well maintained in most clades, both mostly found in the phyla of Proteobacteria, Acti- whereas methylotrophy was a more recent innovation and nobacteria and Bacteroidetes. However, isopenicillin N underwent manychanges. The second pair (Figure 5C and biosynthesis had a very sparse distribution, while lysine D) had similar parsimony scores (528 and 409), but very biosynthesisIhadanearlyuniform distributionin these different genome occurrence frequencies (6005 and 1121). phyla. The gains of phosphate acquisition were concentrated in From the presence/absence pattern of a pathway at the onlya few clades, whilethe gains ofnitrogen fixation were tips of the phylogenetic tree, the ancestral states of the spreadacrossmanyclades.Thethirdpair(Figure5EandF) pathway at the internal nodes of the tree were inferred showstwoheavy-metaldetoxificationpathways.Thearsen- with the parsimony method [36,38,39]. The divergence ateoneshowedasmallnumberoflossesandgainsinafew time of the internal nodes of the tree was also estimated clades, while the mercury one had a few ancient gains and [40]witha scaledreferenceageof100fortheroot.Com- losses followed by considerable recent gains and losses in biningtheancestralstatesofapathwayandthedivergence manyclades.Forthetwobiosynthesispathways(Figure5G time at internal nodes, the state changes (gain or loss) of and H), isopenicillin N biosynthesis went through frequent thepathwaywereclockedonthephylogenetictree. gainsandlosses,probablyinresponsetoselectionpressure Figure 5 shows four pairs of subtrees for the representa- in specific environments; whereas lysine biosynthesis I was tive pathways examined in Figure 3. Without considering maintained stably with few losses inalmost all clades since horizontalgenetransferfromeukaryotes,thefirstrednode theirancestorsfirstgainedthispathway. from the root on the subtree for a pathway marks the first occurrence of the pathway in nature through evolutionary Conclusions innovation. The subsequent red nodes may result from The phylogenetic tree and consistent annotations of independent evolutionary innovation or horizontal gene 14,727prokaryoticgenomesprovidedthefoundationfora Chaietal.BMCEvolutionaryBiology2014,14:207 Page9of13 http://www.biomedcentral.com/1471-2148/14/207 Figure5Subtreesofselectedpathways.Thesubtreeofapathwayisreducedfromtheclockedphylogenetictreeofallgenomesbycollapsingthe entirecladeswithoutthispathwayintotips.Therootiscoloredredforthepathway’spresenceandblueforitsabsence.Thecolorsofnon-rootnodes markthepathway’sstatuschangesfromtheirimmediateancestralnodes:redforgains,blueforlosses,andnonefornochange.Thebranches descendingfromnodescontainingthepathwayarecoloredgreen.Thetotalnumberofblueandrednodesinapathway’ssubtreeequalsthe parsimonyscoreofthepathway.(A)Aerobicrespiration(cytochromeC)withparsimonyscore290from9383genomes.(B)Methylotrophywith parsimonyscore592from1353genomes.(C)Phosphateacquisitionwithparsimonyscore528from6005genomes.(D)Nitrogenfixationwith parsimonyscore409from1121genomes.(E)Arsenatedetoxificationwithparsimonyscore196from2182genomes.(F)Mercurydetoxification withparsimonyscore944from2319genomes.(G)IsopenicillinNbiosynthesiswithparsimonyscore387from2236genomes.(H)LysinebiosynthesisI withparsimonyscore149from8372genomes. Chaietal.BMCEvolutionaryBiology2014,14:207 Page10of13 http://www.biomedcentral.com/1471-2148/14/207 functional phylogenomics analysis across Bacteria and clusters, or from MSAs for non-singleton clusters, with Archaea.Toourbestknowledge,thisisthefirststudythat defaultoptions.ThedistributionoftheHMMlengthswas systematically examined a wide variety of consistently centered at 250 residues with a long tail for long HMMs inferred cellular functions across prokaryotic genomes in (seeAdditionalfile9B).Thesetwostepswereexecutedon a phylogenetic context. The analysis systematically mea- the supercomputer, Titan, at the Oak Ridge Leadership sured the correlation of the cellular function evolution ComputingFacility. with the organism evolution in terms of function enrich- The following fields from the annotations of the Swis- ment in taxa, clustering of functional profiles and genera, sProt proteins in a cluster were extracted and used for and parsimony score distribution of metabolic pathways. the annotations of the corresponding UniFam family: Different evolution patterns of cellular function were recommended full name, EC number, gene name, and showcased using thephylogeneticdistributionsand subtrees GO terms. Database-specific annotations, such as Pfam ofselectedmetabolicpathways. IDs [44] and COG categories [45], were not used, be- cause they should be best assigned by searching those Methods specific databases. If there was more than one SwissProt ConstructionoftheUniFamdatabase protein in a cluster and their annotations were not the All proteins between 30 and 5000 residues long were same for an annotation field, the union was taken for collected from UniProt(535,849 proteins from SwissProt that field as the annotation for the cluster. We observed and ~42 million proteins from TrEMBL, accessed in very consistent annotations between SwissProt proteins October 2013). The 43 million proteins were sorted by in a cluster, because of the high sequence identity length in decreasing order and then clustered into ~13 betweenproteinsinaclusterandthereliabilityofSwissProt million clusters using 64-bit USEARCH v7.0.1001 [41] annotation. onaLinuxworkstationwith130GBofmemory.Theclus- Two sub-databases were created: UniFam_Prok (159,895 tering required every sequence in a cluster to have >80% families) for annotating prokaryotic proteins and UniFa- global alignment identity with the centroid sequence of m_Euk (107,777 families) for annotating eukaryotic pro- thecluster,whichguaranteed>60%globalalignmentiden- teins. The smaller sub-databases allowed faster database tity between any two sequences in a cluster. And the searching. Families with bacterial and archaeal proteins shortest sequencewas required tobe>80% as longas the were combined to form UniFam_Prok and those with longest sequence in a cluster. These two requirements eukaryotic proteins formed UniFam_Euk. The few families wereusedtopreventclusteringofproteinsthatwereonly that had proteins from both prokaryotes and eukaryotes locally similar. Default values for other parameters of were included in both sub-databases. However, only the USEARCHwereusedforclustering. annotations for proteins from prokaryotic organisms were 267,579 clusters containing at least one SwissProt pro- transferred to UniFam_Prok and likewise for UniFam_Euk. tein were used to construct UniFam families. The Uni- To illustrate the functional coverage of UniFam, the distri- Fam clusters included all SwissProt proteins and ~23% butions of the EC number assignment to UniFam families of all proteins from UniProt. Additional file 9A shows wereplottedforUniFam_Euk(seeAdditionalfile10A)and the distribution of the numbers of UniProt proteins in UniFam_Prok(seeAdditionalfile10B). the UniFam clusters. The average size of the UniFam clusters was 37 proteins and the biggest cluster con- tained 27,467 proteins. Singletons constituted 27% of the Re-annotationandpathwayreconstructionofprokaryotic clusters, but covered only 0.7% of all proteins included genomes in UniFam. 33,940 singletons were from Bacteria, 28,953 All bacterial and archaeal genomes, in finished or draft from Eukaryota, 6138 from Archaea, and 2026 from completion status, were downloaded from GenBank [8] Viruses. The organelle distribution of the singletons (Oct 2013). Highly fragmented genomes with more than was 1703 in plastids, 652 in plasmids, 466 in mitochon- 1000contigswerediscarded.Everygenomewasannotated dria, 7 in nucleomorphs, 4 in hydrogenosomes, and the inthefollowingsteps.First,Prodigal[23]wasusedtofind remaininginchromosomes. protein-coding genes and predict their protein sequences. For non-singleton clusters, MAFFT v7.113b [42] was Genomes with less than 400 predicted genes were also usedtocreatetheirmultiplesequencealignments(MSAs). discarded. The proteins were then searched against Uni- MAFFT was set to automatically select an appropriate Fam_Prok using HMMER 3.1b1 [43]. The top matches strategy according to the cluster size. Because of the high (ranked by whole-sequence e-value) of proteins were pairwise global alignment identities between sequences in selected by requiring the whole-sequence e-value to be a cluster, even large MSAs were observed to have low lower than1.00E-3and the aligned regions (notnecessar- alignment uncertainty. HMMs were built using HMMER ily contiguous) to cover at least 50% of both the matched 3.1b1 [43] either from single sequences for singleton HMM and the query protein. Proteins were annotated

Description:
physical and chemical conditions of the habitat and the bio- logical cohorts of the microbial This research used resources of the Oak. Ridge Leadership Holt JG: Bergey's Manual of Determinative Bacteriology. Philadelphia, PA:.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.