Table Of Content

Downloaded from genome.cshlp.org on February 22, 2023 - Published by Cold Spring Harbor Laboratory Press Methods Reconstructing ancestral gene content by coevolution Tamir Tuller,1,2,3,4,5 Hadas Birin,1,4 Uri Gophna,2 Martin Kupiec,2 and Eytan Ruppin1,3 1SchoolofComputerSciences,TelAvivUniversity,RamatAviv69978,Israel;2DepartmentofMolecularMicrobiologyand Biotechnology,TelAvivUniversity,RamatAviv69978,Israel;3SchoolofMedicine,TelAvivUniversity,RamatAviv69978,Israel Inferringthegenecontentofancestralgenomesisafundamentalchallengeinmolecularevolution.Duetothestatistical natureofthisproblem,ancestralgenomesinferredbythemaximumlikelihood(ML)orthemaximum-parsimony(MP) methodsarepronetoconsiderableerrorrates.Ingeneral,theseerrorsaredifficulttoabolishbyusinglongergenomic sequencesorbyanalyzingmoretaxa.Thisstudydescribesanewapproachforimprovingancestralgenomereconstruction, theancestralcoevolver(ACE),whichutilizescoevolutionaryinformationtoimprovetheaccuracyofsuchreconstructions overpreviousapproaches.Theprincipalideaistoreducethepotentiallylargesolutionspacebychoosingasingleoptimal (or near optimal) solution that is in accord with the coevolutionary relationships between protein families. Simulation experiments,bothonartificialandrealbiologicaldata,showthatACEyieldsamarkeddecreaseinerrorratecompared withMLorMP.Appliedtoalargedataset(95organisms,4873proteinfamilies,and10,000coevolutionaryrelationships), some of the ancestral genomes reconstructed by ACE were remarkably different in their gene content from those reconstructed by ML or MP alone (more than10%in some nodes). These reconstructions,while havingalmostsimilar likelihood/parsimonyscoresasthoseobtainedwithML/MP,hadmarkedlyhigherconcordancewiththecoevolutionary information.Specifically,whenACEwasimplementedtoimprovetheresultsofML,itaddedalargenumberofproteinsto thoseencodedbyLUCA(lastuniversalcommonancestor),mostofthemribosomalproteinsandcomponentsoftheF F- 0 1 typeATPsynthase/ATPases,complexesthatarevitalinmostlivingorganisms.OuranalysissuggeststhatLUCAappears tohavebeenbacterial-likeandhadagenomesizesimilartothegenomesizesofmanyextantorganisms. [Supplementalmaterialisavailableonlineathttp://www.genome.org.] Theproblemofreconstructingancestralstatesisasoldasthefield duetothestatisticalnatureoftheproblem,andasbothML/MP ofmolecularevolution,pioneeredbyFitcharound40yrago(Fitch assume that different sites and different genes/proteins evolve 1971). This first algorithm assumed a binary alphabet and was independently, increasing the amount of information used (the based on the maximum parsimony (MP) criterion, i.e., find the lengthsofthesequencesandthenumberoforganisms)(Lietal. labelstotheinternalnodesofatreethatminimizethenumberof 2008) does not guarantee a decrease in the error rate. Thus, in changesormutationsalongthetreeedges.Overtheyearsthisbasic manycases,theconfidencethatwemayassigntothemostlikely algorithmwasgeneralizedinmanyways.Sankoff(1975)showed or most parsimonious reconstructed ancestral state is not very howtoefficientlysolveversionsoftheMPproblemwithanon- high. binaryalphabetandwithmultipleedgeweights.Algorithmsfor Inthisworkwedescribeanovelapproachforimprovingthe inferringancestralsequencesbasedonthemaximumlikelihood accuracy of reconstructed ancestral genomes. Our approach is (ML)principle(insteadofMP)weresuggestedmorethan15yrlater based on utilizing information embedded in the coevolution of (Barry and Hartigan 1987; Felsenstein 1993; Pagel 1999; Pupko interacting proteins. The potential utility of this strategy is in- etal.2000;Krishnanetal.2004;EliasandTuller2007),aimingto tuitively simple: If the ancestral reconstruction of protein A is identify the ancestral sequences that are most likely, given the ambiguous(e.g.,differentsolutionsachievesimilarscores),butthe currentdatainprobabilisticterms.Inamanneranalogoustoin- ancestral reconstruction of protein B is unambiguous, then, ferringanancestralsequencesuchasaproteinorgene(Pupkoetal. knowingthatAandBhavecoevolved(usuallyduetosomefunc- 2000;Blanchetteetal.2004;Maetal.2006),manystudieshave tional interaction) can help us disambiguate the ancestral re- aimedtoreconstructancestralgenomes(genomiccontentorgene constructionofA.Asimpleillustrationofthisideaisdescribedin content;e.g.,seeBoussauetal.2004;Ouzounisetal.2006;Putnam Figure1.Asdemonstratedfurtheron,ourapproachcansuccess- etal.2007).Mostexistingphylogeneticreconstructionalgorithms fully distinguishbetweensolutionswithsimilarparsimony/like- canbeadaptedtoreconstructingancestralgenecontent(e.g.,see lihoodscores. Felsenstein1993;Yang1997;Swofford2002). Importantly, coevolutionary relations between proteins are The main problem related to reconstructing ancestral se- quiteubiquitousandhavebeentracedandreportedinnumerous quencesandgeneinventoriesisthat,inpractice,thereconstructed studies (Pazos et al. 1997; Chen and Dokholyan 2006; Marino- sequencesoftencontainalargenumberoferrors.Amajorsourceof Ramirezetal.2006;Barkeretal.2007;Wapinskietal.2007;Felder thisphenomenonistheexistenceofmultiplelocaland/orglobal andTuller2008;Tulleretal.2009).Onesuchsourceisprotein– maxima(i.e.,solutionsthataredifferentbuthavethesamescores protein interactions (PPIs): Several studies have already demon- of‘‘goodness’’)inthesolutionspacesearchedbyboththeMLand strated a link between coevolution and physical interactions. theMPapproaches(e.g.,seeFig.1;Choretal.2000).Furthermore, Theseinvestigationsusedthefactthatinteractingproteinstendto coevolveinordertosuccessfullypredictphysicalinteractions(e.g., 4Theseauthorscontributedequallytothiswork. seeWuetal.2003;Satoetal.2005;Juanetal.2008).Here,wetake 5Correspondingauthor. theoppositeapproachandusedataonphysicalinteractionsbe- [email protected];fax972-3-640-9357. tween proteins as a proxy for their coevolution. Large-scale PPI Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.096115.109. datasetscurrentlyexist onlyfor relativelyfeworganisms, but as 122 GenomeResearch 20:122–132 (cid:2) 2010 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/10; www.genome.org www.genome.org Downloaded from genome.cshlp.org on February 22, 2023 - Published by Cold Spring Harbor Laboratory Press The ancestral coevolver toimprovetheaccuracyofancestralpro- tein content reconstruction. In this section we briefly describe our approach, model,anddefinitions.Moreformaldef- initionsappearinSupplementaryNote1. Forsimplicity,wewillrelatesimilarlyto proteinsandthegenesthatencodethem. A phylogenetic tree is a rooted binary tree (i.e., the input degree of each nodeisoneandtheoutputdegreeistwo) togetherwithleaflabels.Inthiswork,we assumethateachnodeinaphylogenetic treecorrespondstoadifferentorganism (i.e.,thisisaspeciestree)andonecanuse this phylogenetic tree to describe the evolutionofeach proteinfamily.In the simplestbinarycase,eachlabeliseither ‘‘0’’(thegenedoesnotappearinthege- nomeoftheorganism)or‘‘1’’(theprotein is encoded in the genome of an organism).Inthenonbinarycaseeachlabelis anaturalnumberthatdenotesthenum- berofparalogouscopiesofageneinthe corresponding genome. The leaves in aphylogenetictreecorrespondtoextant species, while the internal nodes corre- spondtoancestralorganisms. A coevolving forest is a set of phylogenetic trees with identical topology Figure1. Asimpleexampledemonstratinghowcoevolutioncanbeusedtoimproveancestralse- that correspond to the same organisms, quencereconstruction.Thisexampleincludesacoevolvingforestwithtwotrees,thetreeedgesare eachtrackingtheevolutionofadifferent solidlines,whilethecoevolutionaryedgesaredashedwavyarrows;theancestralstates(e.g.,thelabels gene/protein.Theforestincludesanad- attheinternalnodesx1andy1)aresmaller,whiletheknownnonancestralstatesarelargerandinitalics ditionalsetofcoevolutionaryedgesthat (forformaldefinitions,seeSupplementaryNote1).(A)Thereconstructedancestralstatesoftwopro- connectpairsofnodes,eachnodeinan teins(proteinxandproteiny);forproteinxtherearetwoparsimonioussolutions:Inonesolution,allof thelabelsofthethreeinternalnodes,t1,t2,t3,are‘‘0’’;inthesecondsolution,allofthelabelsofthe edge belonging to a different phyloge- threeinternalnodesare‘‘1.’’(B)Bytakingintoconsiderationthecoevolutionofproteinxwithproteiny, netictree.Eachsuchcoevolutionaryedge the‘‘all‘1’’’solutionischosenforx,thusresolvingtheambiguityinthelabelsofxbyusinginformation connectsapairofnodesthatcorrespond aboutproteinywithwhichitinteracts,andwhichhaslessambiguouslabels. tothesameorganism(Fig.1).Theedges thatarepartoftheevolutionarytree(the physicalinteractionstendtobeconservedacrossspecies,atleastto standard evolutionary transition edges) are termed ‘‘tree edges’’ someextent,(e.g.,seeWuetal.2003;Liangetal.2006;Hirshand here.Forexample,Figure1illustratesacoevolvingforestspanning Sharan2007),itisplausibletoassumethatexistingPPIstestifyto twogene/proteintrees(thecoevolutionaryedgesaredashedwavy coevolutionaryrelationsacrosstheancestraltree.Anothersource arrows,whilethetreeedgesarecontinuous).Inthisworkweas- ofcoevolutioninformationthatweutilizeismetabolicadjacency: sumethatpairsofcoevolutionaryedgesthatconnecttherootsof FollowingthestudiesofSpirinetal.(2006)andZhaoetal.(2007) twoevolutionarytreesare‘‘inherited’’bytheotherpairsofnodes weassumethatmetabolicenzymesthatcatalyzeconsecutivere- oftheevolutionarytrees(asdescribedinFig.1),i.e.,ourmathe- actionsinthesamepathwaytendtocoevolve.Wealsoincorporate matical model is capable of dealing with cases where co- additionalsourcesofcoevolutioninformation,includinggenomic evolutionaryedgesappear/disappearduringevolution.However, proximity, coexpression, and gene fusion, which have all been overall, we assume that the events where new coevolutionary showntobegoodpredictorsofcoevolution(Leeetal.2004;Chen edgesappearordisappeararerelativelyrare. and Dokholyan 2006; Jensen et al. 2009; Tuller et al. 2009). Acoevolutionaryforestadditionallyincludesaweighttable The coevolution data from all of these sources is combined to for each coevolutionary edge and each tree edge. These weight reconstruct the evolutionary history of 95 unicellular bacteria, tablesincludethecostofeachpairoflabelsatthetwoendsofthe archaea,andeukaryotes. edge.Inthecaseoftreeedges,theseweightsreflecttheprobability of gaining/losing a copy numberalong the edge (Methods). For coevolutionary edges, these weights reflect the distribution of Results mutualoccurrencesofthelabelsofthenodesconnectedbythe edge.Aswediscusslater,thesetablescanbefurtherweightedto Theancestralcoevolveralgorithm(ACE) reflect our confidence in these two types of information (the The ancestral coevolvermethodincorporates informationabout evolutionarytransitionsvs.coevolutioninformation). thecoevolutionaryrelationshipsbetweenpairsofproteinsandthe Anotherbasictermneededforthedescriptionofourmethod evolutionarydistancebetweenorganisms(anevolutionarytree[s]) isthenotionofacoevolutionary graph,whichisanundirected GenomeResearch 123 www.genome.org Downloaded from genome.cshlp.org on February 22, 2023 - Published by Cold Spring Harbor Laboratory Press Tuller et al. graph that describes the coevolutionary relationships in the co- evolutionaryweightswillresultinasolutionthatismostlyinflu- evolutionary forest. In such a graph, each node corresponds to encedbythecoevolutionaryrelations(seeSupplementaryNotes1 atreeinthecoevolutionaryforest,andtwonodesareconnectedby and2;Methods).WenamesuchweightingprocedureaseitherML/ anedgeifthereisatleastonecoevolutionaryedgebetweentheir CEweightingorMP/CEweighting,(dependingonthealgorithm correspondingtrees.Forexample,thecoevolutionarygraphcorre- used). spondingtothecoevolutionaryforestinFigure1Aincludestwo Inthenextsectionwedemonstratethreedifferentwaysfor nodes(astherearetwoevolutionarytreesintheforest),andone choosingtheML/CEweighting(seemoredetailsinSupplementary edgethatconnectsthesenodes(astherearecoevolutionaryedges Notes1,2; Methods):The firstiscalledthe‘‘conservativeway’’: betweenthetwoevolutionarytrees).Aconnectedcomponentin choosesmallenoughweightsforthecoevolutionary edgessuch thecoevolutionaryforestisasubsetoftreeswhosecorresponding that the solution will be one of the ML/MP solutions. This ap- nodesinthecoevolutionarygraphinduceaconnectedcomponent proachshouldbeusedifonestronglybelievesintheconventional (i.e.,asetofnodessuchthatthereisapathbetweeneachpairof ML/MPapproachandthecorrespondingtreeedgeweighttables nodes in the set. For example,in the coevolutionary forest that (thisapproachisdemonstratedinsectionTheSimpleParsimony appearsinFigure1Athetwotreesinduceaconnectedcomponent). Case: A Classical Example of Multiple Maxima). The second is Thecoevolutionaryedgesusedinthisworkaredrawnfromvarious called the ‘‘weighted average’’: Choose the weights of the co- sourcesofinformation:proteininteractions,coexpression,prox- evolutionary edges such that the weighted average of the co- imityinthemetabolicnetworks,andothermeasuresofcofunc- evolutionaryscoreandtheML/MPscoreareoptimized.Ifonehas tionality(seeMethods).Asitisnotclearhowtomodeltheevo- similarbeliefsinthetwosourcesofinformation,asimpleaverage lutionofanetworkwithmanytypesofinteractions,wedecidedto shouldbeused;otherwise,aweightedaverage,reflectingtherel- usethegeneralizedMPasanobjectivefunction. ativeconfidenceineachoftheinformationsourcesshouldbeused Thisleadsustothedefinitionofthecomputationalproblem (thisapproachisdemonstratedinsectionTheMLCase:ADetailed weareconcernedwith,theancestralcoevolutionproblem:Given Biological Case Stud). The third approach is called the learning acoevolutionaryforest(asetoftrees,asetofcoevolutionaryedges approach:Inthiscasewe‘‘flip’’asmallfractionofthesitesatthe thatconnectpairsofnodesinthesetrees,andweighttablesfor leavesandtrytocorrectthembyusingtheentiremodel.Wechose eachtreeedgeandeachcoevolutionaryedge)wewanttofindla- theweightswiththebestperformances(moredetailsinthesection belsfortheinternalnodesofallofthetreesinthecoevolutionary TheACEReconstructsMissingValuesattheLeavesoftheEvolu- forest,suchthatthesumofthecorrespondingweightsalongallof tionaryTreeBetterThanML/MP). the tree edges and the coevolutionary edges (the total cost) is The general steps of the algorithm appear in Figure 2. The minimal. preprocessingstepsrequiredforgeneratingtheinputtotheACE Wedesignedanalgorithmforsolvingthisproblem—thean- algorithm include (Fig. 2A): reconstructing a phylogenetic tree, cestralcoevolver(ACE).ThegoalofACEistofindlabelsforallof inferring groups of orthologs, gathering functional/physical in- the nodes of all of the trees in a coevolutionary forest by opti- teractionsbetweenorthologs,andcomputingedgeweights.Next, mizingtheancestralcoevolutionproblem. ACEisimplemented(Fig.2B).Thealgorithmhasfourmainsteps: The sum of coevolutionary weights that are induced by (1)thecoevolutionaryforestispartitionedintosmallersubforests; choosinglabels(asolution)forallofthe internalnodesinacoevolutionaryforest composes the coevolutionary score. In thisworkwenormalizethisscorebydi- viding it by the minimal possible co- evolutionaryscore.Similarly,thesumof treeweightsthatareinducedbychoosing labels (a solution) for all of the internal nodesinacoevolutionaryforestisnamed the maximum likelihood or the maximum parsimony score (for a ML/MP problem, correspondingly). Also, in this casewenormalizedthisscorebydividing itbytheminimalpossibleML/MPscore. Bychangingtherelativeweightsassigned to the coevolutionary edges relative to thoseofthetreeedgeswecancontrolthe comparativeinfluenceofthesetwosour- cesofinformationontheinferredlabels. For sufficiently small weights of the co- evolutionaryedges,ACEwillchooseone ofthesolutionsobtainedbythecommon ML/MPapproaches,solvingtheproblem of multiple optima of ML/MP (the ex- Figure2. (A)Thepreprocessingstep:generatinganinputforACE.Theinputincludesanevolutionary tremecaseofancestralcoevolutionprob- treewithtreeedgeweightsaswellascoevolutionaryedgeswithcorrespondingweights(seemore detailsinMethods).(B)TheACEalgorithmhasfourmainsteps:(1)partitioningofthecoevolutionary lem without coevolutionary edges de- graphtosmallersubgraphs,(2)findingtheoptimalancestralstatesinthesesubgraph,(3)merging scribesaconventionalMP/MLproblem). thesesubgraphs,(4)improvingthesolutiongreedily(seeadetaileddescriptionintheMethodssection Ontheotherhand,usingvery largeco- andinSupplementaryNote1). 124 GenomeResearch www.genome.org Downloaded from genome.cshlp.org on February 22, 2023 - Published by Cold Spring Harbor Laboratory Press The ancestral coevolver (2) an optimal labeling is inferred for eachofthesesubforests;(3)thesolutions aremerged;and(4)afinalstepofgreedy optimization is performed. A detailed descriptionofthealgorithmispresented in the Methods section, in Supplemen- talFigure1,andinSupplementaryNote 1. Note that an ancestral genome may include more genes than each of the genomes at the leaves (Supplementary Note3). Here,weshowanapplicationofACE withintheMPframework.However,the optimization problem of inferring the ancestral states of a phylogenetic tree when the optimization criterion used is ML(e.g.,seePupkoetal.2000)underin- dependently and identically distributed (i.i.d.)probabilisticmodelssuchasJukes andCantor(1969)(JC),Neyman(1971), orthemodelofYangetal.(1995)canbe formalized as a MP problem for a nonbinary alphabet with multiple edge weights (Sankoff 1975) (Supplementary Note4).Thus,theACEalgorithmcanbe usedforimprovingbothMLandMPal- gorithms. Acomparativesimulationstudy ofACEvs.MP/ML TodemonstratetheperformanceofACE we designed a simulation incorporating Figure3. Summaryofthesimulationresults,reportingthemeanover10runsusingcoevolutionary coevolution, where the genes/proteins forestswith100trees,eachwith20leavesandbranchlengthsintherangeoffrom0.1to0.4.(A)The evolve along the corresponding evolu- errorrate(frequencyofsitesinferredwitherror,y-axis)vs.thelevelofcoevolution(normalizedbythe tionarytrees,butalsocoevolvewitheach numberoftrees,x-axis).Theinsetshowsthetypicalbehavioratverylowlevelsofcoevolution. (B) other (Supplementary Note 5; Supple- Theerrorratevs.errorsintheweightsofthetreeedges.(C)DifferencebetweentheerrorrateofACEand theerrorrateofMPorML(in%)vs.levelofcoevolution.(D)Theerrorratevs.errorinthecoevolutionary mentalFig.2).Weexaminedtheperfor- graph,thelatterdenotingthemeannumberofcoevolutionaryedgeschangedbetweenanode(rep- manceoftheACEalgorithmincompari- resentinganorganism)anditsdescendantintheevolutionarytrees;theACEwasbasedontheco- son to the standard MP/ML approaches evolutionarygraphinoneofthenodes.Theinsetshowsthedifference(in%)betweentheerrorratesvs. while varying the following parameters: theerrorincoevolutionaryedges.NotethatthegapbetweenMLandACEdecreaseswhentheerrorin (1)theamountofcoevolutionaryedges: coevolutionaryedgesincreases.(E)Theerrorratevs.meanweightofthetreeedges.Theinsetshowsthe difference(in%).NotethatthegapbetweenML/MPandACEincreaseswhenthemeanweightofthe thenumberofedgespernodeintheco- treeedges(theprobabilitytogain/lossaprotein)increases. evolutionarygraph;(2)theintroduction oferrorsintheweightsofthetreeedges andthecoevolutionarygraph;and(3)varyingtheprobabilityof increases with the level of coevolution. The error rate of ACE mutationsalongthetreeedges(correspondingtotheweightsof undergoesadecreaseofmorethan50%(toanerrorrateofjusta thetreeedges;SupplementaryNote5;SupplementalFig.2).The fewpercent)whenthelevelofcoevolutionincreasesfrom0to5 aimsofthesimulationare:(1)todemonstratetherobustnessofthe (Fig.3A).Notethatthislevelofcoevolutionisreasonableinreal ACEalgorithmtovariousparameters;(2)tocomparetheACEap- biological terms: For example, there are more than five PPI in- proachtotheconventionalapproaches(MP/ML)thatdonotuse teractionsforeachprotein,onaverage,intheyeastPPInetwork coevolutionaryinformation(currently,therearenopreviousap- (whichincludes39,396interactionsandaround6400openread- proaches that use coevolutionary information to infer genome ing frames [ORFs]; see Methods).If we incorporate a numberof contentthatcanbecomparedtotheACE). sources of coevolutionary information (see above), the average AsummaryofthesimulationresultsisprovidedinFigure3. degreeintheresultingcoevolutionarygraphturnsouttobemuch First we investigated the effect of modifying coevolution levels, higher(e.g.,themeannodedegree[connectionsperprotein]inthe quantified by the mean number of edges per node in the co- coevolutionarygraphunderlyingthebiologicalcaseanalyzedin evolutionarygraph.Figure3clearlydemonstratesthatACEout- thenextsectionwaslargerthan14). performs the MP/ML approaches, even at low levels of co- As can be seen in Figure 3B, ACE is more robust than the evolution. Both the performance of ACE (Fig. 3A), as measured standardMP/MLapproachestoerrorsintheweightsoftreeedges. by the error rate (the frequency of ancestral states inferred in- ACEperformsbetteratallerrorrates,butpredominantlywhenthe correctly),anditssuperiorityovertheotherapproaches(Fig.3C), errorlevelisparticularlyhigh.Moreover,ACEalsoperformsbetter GenomeResearch 125 www.genome.org Downloaded from genome.cshlp.org on February 22, 2023 - Published by Cold Spring Harbor Laboratory Press Tuller et al. thanMP/MLwhencoevolutionaryedges Table1. ErrorratesfordifferentML/CEorMP/CEweightings,whenusedforcorrecting disappear/appear in evolution (this pro- flippedlabelsattheleavesofthecoevolutionaryforest cesscanalsoreflecterrorsinthetopology ML/MP Only of the assumed coevolutionary graph, ML/CEorMP/CEweightings (i.i.d) 1 2 3 4 5 coevolution anotherpossiblesourceoferror)(Fig.3D; D’haeseleer and Church 2004). Finally, Meanerrorrate;ML;3%33% 0.072 0.072 0.037a 0.032 0.026 0.024b 0.0283 Figure3Eshowsthatahigherprobability Meanerrorrate;ML;5%35% 0.084 0.084 0.04a 0.034 0.03 0.026b 0.0305 Meanerrorrate;ML;7%37% 0.107 0.107 0.054a 0.045 0.039 0.035 0.032b of gene gain/loss events along the tree Meanerrorrate;MP;3%33% 0.067 0.058 0.049 0.043 0.037 0.026a,b 0.0283 edges (Supplementary Note 5) improves Meanerrorrate;MP;5%35% 0.096 0.08 0.06a 0.051 0.044 0.03b 0.0305 theperformanceofACEcomparedwith Meanerrorrate;MP;7%37% 0.105 0.086 0.064a 0.057 0.048 0.031b 0.032 the other methods. Thus, we conclude thatACEoutperformstheothermethods, aThemaximaldecreaseintheerrorrate. bTheminimalerrorrate. especially in the presence of some un- reliable/noisydata,astypicallyfoundin largebiologicaldatasets. ML/MP approaches. Additionally, as evident, the approach de- scribedherecanbeusedforselectinganoptimalML/CEweighting. ApplyingACEtobiologicaldata Thesimpleparsimonycase:Aclassicalexampleofmultipleglobalmaxima UsingACEwereconstructedancestralgenomesfromalargesetof To study the effect of integrating coevolution data on MP gene organismsfromthethreedomainsoflife,Bacteria, Archaea, and contentreconstruction,webeganwiththesimpleparsimonycase, Eukarya.Figure2summarizesthedifferentstepsingeneratingthe inwhichthepenaltyforlosing/gainingaproteinisidenticalalong biologicalinputs(Methods).Weusedthe95genomesofunicel- alltreeedges.Inthiscase,therearemanypossiblesolutions,each lularorganismsrepresentedintheCOGdatabase(Tatusovet al. havingthesameparsimonyscore.ThiscaseillustrateshowACE 2003; Jensen et al. 2009) (see Supplemental Fig. 3). We studied can be utilized for choosing one optimum by adding the co- abinarycasewhere‘‘1’’representsthepresenceofagene/protein evolutionarybiologicalconstraints(seeFig.4A).Outofallpossible and‘‘0’’representsitsabsence.Usingthe4873groupsoforthologs solutions,ACEchoseonewhosescore(33,525)isidenticaltothe thatappearinCOG,werepresentedeachofthe95genomesby best score obtained using the MP algorithm. However, the in- abinarystringhavingalengthof4873characters.Weinferredthe clusion of coevolutionary constraints resulted in a biologically ancestral genomes by using two main versions of edge weights different solution: The number of inferred proteins was either (Methods):(1)BinaryMP;(2)binaryML(detailsabouttheresults similarorhigherinmostoftheinternalnodes,andACEadded34 onadditionalversionsofMPappearinSupplementaryNotes6and proteinstotheLUCA(seeFig.4B).AsMPdoesnotconsiderthefact 7 and Supplemental Fig. 4). In each case, we examined a range thatinsomeofthetreeedgestheprobabilitytolose/gainaprotein of ML/CE or MP/CE weightings and chose one of the MP/CE islow,ittendstoremoveancestralproteins;ACEpartiallyfixesthis weightings according to the approaches mentioned before. As problem by adding proteins to most of the internal nodes. The mentionedearlier,thetwoextremecasesarewhentheweightof function of the protein families inferred by ACE and not by thecoevolutionaryedgesisverysmallrelativetothetreeedgesand MPinLUCAappearinFigure4C.Genefamilieswhosefunction viceversa,whentheweightofthetreeedgesisverysmallrelative is related to translation are overrepresented in this set (hyper- totheweightofthecoevolutionaryedges. geometricP-value=1.97310(cid:2)5;Methods);genefamilieswhose functionarerelatedtoaminoacidtransportandmetabolismare ACEreconstructsmissingvaluesattheleavesoftheevolutionary underrepresented in this set (hypergeometric P-value = 0.0065; treebetterthanML/MP Methods). Atthefirststage,tofurtherdemonstratetheadvantagesofACE Similar results were obtained for the other version of the overconventionalML/MPweperformedthefollowingprocedure: inputs (e.g., the nonbinary and nonsymmetric versions; see Methods;SupplementaryNotes6,7;SupplementalFig.4),dem- (1) We randomly flipped 3%–5% of the values with co- onstratingthatACEcanbesuccessfullyappliedtodifferentver- evolutionary relations in 3%–5% of the genomes, from ab- sionsoftheinput.Asisevident,inallthecasesACEaddeddozens sencetopresence,orviceversa. ofnewproteinfamiliestoLUCA,suggestingthatcommonML/MP (2) We reconstructed the ancestral states of the coevolutionary approaches tend to underestimate the number of proteins in forestbasedonthealteredgenomiccontents. LUCA. (3) Wethen‘‘fixed’’thevaluesattheleavesthatwereflippedby choosinglabelsthatoptimizethescoreofthecoevolutionary TheMLcase:Adetailedcasestudy networkgiventhestatesinferredinstep2. Figure5includesasummaryoftheresultsforthebinaryMLcase. (4) Steps1–3wererepeatedseventimes. AscanbeseeninFigure5A,theoptimallabelsinferredforeach The mean error-ratesof the inferred valuesat the leaves for the ML/CEweightingcorrespondtosolutionsthatareveryclose(not aboveprocedureandfor differentMP/CEor ML/CEweightings, morethan20%higher)bothtotheoptimalcoevolutionarysolu- MPandML,aswellasfordifferentpercentsofflippingarepro- tion and totheMLsolution(forcomparison,with 100 random videdinTable1. labelings,themeanMLscoreis1547%higherthantheoptimal Ascanbeseen,usingthecoevolutionaryinformationreduces score,andthemeancoevolutionaryscoreis518%higherthanthe theerrorratebymorethan50%,whereusuallymostoftheim- optimalcoevolutionaryscore;P-value<0.01inbothcases). provementisachievedinthecaseofthesecondweighting(whose IntheMLcase, sincethere weremanyoptimapointswith ML/MPscoreisalsorelativelyhigh).Thisanalysisfurthersupports similar but not identical scores (as in the case of MP), the ACE theuseofcoevolutionaryinformationinadditiontoconventional solution for the first ML/CE weighting was identical to the ML 126 GenomeResearch www.genome.org Downloaded from genome.cshlp.org on February 22, 2023 - Published by Cold Spring Harbor Laboratory Press The ancestral coevolver weightingadds98tothisreconstruction); thus, it represents a midway point between the two sources of information. Second,remarkably,thisweightinggave thelargestreductionoftheerrorrate at theleavesofthebiologicaldatatree(see previoussection). Figure5Bdepictsthedifferencebe- tweenthenumberofproteinsinferredby ACEandthenumberofproteinsinferred byMLineachinternalnodeoftheevo- lutionary tree (see Supplemental Fig. 3). Asevident,ACEaddedproteinstosome ofthenodes(e.g.,LUCAandtheancestor of the Fungi), removed proteins from others (e.g., the subtree of the Archeae andtheFirmicutes),andkeptsomenodes unchanged (e.g., the ancestor of the Cyanobacteria and the ancestor of the alpha-protobacteria). We turned to analyze the reconstructed gene content of LUCA, which hasbeenthesubjectofmuchcontroversy inrecentyears(seeGlansdorffetal.2008 for a recent survey of about 200 papers studying LUCA, e.g., Penny and Poole 1999; Koonin 2003). One major open questionconcerningLUCAisthesizeof its genome. ACE finds a total of 743 inferredproteinsinLUCA,whichisstill less than the mean number of protein familiesintheextantorganismsanalyzed (1265), but is considerably higher than the number of proteins inferred by the conventional MP method (only 587) or theMLmethod(686).Morespecifically, the inferred genome size of LUCA was largerthanthegenomesof16%(15/95) of the organisms in the analyzed data set.Koonin(2003)suggestedthatLUCA had 500–600 genes, less than the great majority (96%) of the organisms in the analyzed dataset, whileOuzounis et al. (2006) suggested that LUCA had a rela- Figure4. (A)Thecoevolutionaryscore(dashed),thesimplemaximumparsimonyscore(continuous) fordifferentMP/CEweightings.Thescoresarenormalizedtotheminimalsolution(i.e.,‘‘1’’denotes tivelyrichgenome.Ourresultsaremore aminimal/optimalsolution):Bydefinition,atthefirstMP/CEweighting,thesimpleparsimonyscoreis inlinewiththesuggestionsofOuzounis optimal/minimal,whileatthefifthMP/CEweightingthecoevolutionaryscoreisoptimal/minimal.(B) et al. (2006), namely, that LUCA had ThedifferencebetweenthenumberofproteinsthatACEinferred(usingthefirstMP/CEweighting)and a relatively rich genome that resembles thenumberofproteinsinferredbysimpleMPforeachinternalnode.(C)Thefunctionoftheprotein familiesinferredbyACEandnotbyMPinLUCA.Genefamilieswhosefunctionisrelatedtotranslation thegenecontentsofextantorganisms. areoverrepresentedinthisset(hypergeometricP-value=1.97310(cid:2)5;Methods);genefamilieswhose Fifty-sevengenefamilieswereadded functionisrelatedtoaminoacidtransportandmethabolismareunderrepresentedinthisset(hyper- to LUCA by ACE compared with ML geometricP-value=0.0065;Methods). (SupplementalTable1).Figure5Cdepicts the functionality of the new proteins solution.Thus,in thiscase, we examined theML/CEweighting foundbyACE.Functionsthatareunderrepresentedinthesetof thatoptimizedthesumofthetwoscores(short-dashedlineinFig. newproteinsaddedbyACEarerelatedto‘‘aminoacidtransport 5A). At a weighting level of 2, the two scores (CE and ML) are andmetabolism’’and‘‘transcription.’’Ontheotherhand,mostof within 4% from their respective optima. This ML/CE weighting thenewproteinfamiliesarerelatedto‘‘energyproduction’’andto alsoseemsthemostappropriateforadditionalreasons:ThisML/ ‘‘translationandribosomestructure’’;genefamiliesfromthislast CEweightingreconstructedthegenecontentofLUCAtobean categoryareoverrepresentedinthisset(hypergeometricP-value= intermediatebetweenthesolutionthatonlyreliesoncoevolution 8310(cid:2)5;Methods).Theproteinsfromthefirstgrouparemainly and the ML solution (it adds 57 protein families to the ML re- composedofcomponentsoftheF F -typeATPsynthase,andthe 0 1 construction of LUCA, while the solution with extreme ML/CE proteins from the second group are mainly ribosomal proteins. GenomeResearch 127 www.genome.org Downloaded from genome.cshlp.org on February 22, 2023 - Published by Cold Spring Harbor Laboratory Press Tuller et al. domains(Woese1998).Furthermore,the presence of the methionyl-tRNA for- myltransferase in LUCA implies that the bacterial-like ancestor already used formyl-methionine to initiate translation,andthatthisformylationwassub- sequentlylostinArchaeaandEukaryaas their translation system diverged from the bacterial one. Another interesting observationisthatthepreproteintrans- locasesubunitsSecFandSecD,whichare absentinsomebacteriaandnonessential in others, (Wooldridge 2009) are pre- dicted to be ancestral, as is the YidC translocasesubunit,whichisinvolvedin integrationofmembraneproteins(Serek etal.2004). AnalysisunderextremeML/CEweightings Inthissubsectionwerepeatedtheanal- ysisreportedin theprevioussectionfor the case where the ML/CE weighting is maximal (i.e., the weights of the co- evolutionarytablesaredominant).Inthis case, ACE finds a total of 784 inferred proteinsinLUCA,morethanhavebeen reportedinthepreviouscases;98(14%) oftheproteinsareaddedtotheresultof theconventionalML;seedetailsinSup- plementalFigure5. In this case, the prominent func- tions overlap those that have been reportedintheprevioussubsection:Pro- teins related to translation are over- representedinbothcases,whileprotein families related to metabolism are underrepresented in both cases, showing thatourpreviousconclusionsarerobust to the ML/CO weighting. A few additional proteins unique to the extreme weightingareneverthelessnotable.These include,forexample,severalproteinsin- volved in the synthesis of the bacte- Figure5. (A)Thecoevolutionaryscore(solidline),themaximumlikelihoodscore(long-dashedline), rial peptidoglycan cell wall and many andtheirmean(short-dashedline)fordifferentML/CEweightings.InalloftheML/CEweightingsthe twoscoresare<120%higherthantheiroptimum/minimum.ThesecondML/CEweightingoptimizes central proteins of the DNA replication themeanofthesetwoscores(seeshort-dashedline).(B)Thedifferencebetweenthenumberofproteins and DNA repair machineries, including inferredbyACEandthenumberofproteinsinferredbymaximumlikelihoodforeachinternalnode, DNApolymerases,ligases,helicases,and usingtheoptimalweighting.(C)ThefunctionoftheproteinfamiliesinferredbyACEandnotbyMLin a single-strand bindingprotein. Thus,if LUCA. Gene families whose function is related to translation or intracellular trafficking are over- representedinthisset(hypergeometricP-values=8310(cid:2)5and0.02,respectively;Methods);gene one were inclined to trust the extreme familieswhosefunctionisrelatedtoaminoacidtransportandmetabolismandtranscriptionareun- ML/CE weighting, one would conclude derrepresentedinthisset(hypergeometricP-value=0.0001and0.038,respectively;Methods). that LUCA already possessed advanced bacterial-likesurfacecomponentsandro- bust DNA repair and replication ma- Interestingly,whileF F -typeATPsynthasesarepresentlyfound chineries.TheseconclusionsfurtherreinforcetheviewofLUCAas 0 1 primarilyinaerobicbacteriaandmitochondria,theyarebelieved an organism not unlike extant bacterial species, both in the tobedescendedfromancientanaerobicenzymes,andthustheir numberofgenesandinitsphysiologicalcapabilities. antiquityishighlyplausible(CrossandMuller2004).Strikingly, the additional ribosomal proteins that are conserved in extant Discussion bacteria,butabsentinarchaeaandeukaryotes,wereinferredtobe encoded by LUCA. This may imply a bacterial-like LUCA from In this study, we describe a new approach for reconstructing which archaea later diverged (Gogarten et al. 1989; Iwabe et al. ancestral genomes. Our approach captures coevolutionary de- 1989), rather than a parallel emergence of the two prokaryotic pendenciesbetweendifferentproteinsandusesthisinformation 128 GenomeResearch www.genome.org Downloaded from genome.cshlp.org on February 22, 2023 - Published by Cold Spring Harbor Laboratory Press The ancestral coevolver todisambiguatethegenecontentsofthereconstructedancestral orthologsphysicallyorfunctionallyinteractaccordingtothethree genomic contents. ACE is geared to tackle an inherent problem sourcesofinformationdescribedabove(seemoredetailsbelow). that plagues the MP and ML approaches, whose solution space (2) The two orthologs exhibit a pattern of co-occurrence that is tendstobepopulatedwithmultiplemaxima(e.g.,seeFig.1;Chor significantlydifferentfromrandominthegenomeoftheorgan- etal.2000).Wedemonstratethesuperiorityofthenewapproach ismsstudied(inourcase,theratiobetweenthehighestandlowest over existing methods in a simulated scenario where coevo- probabilityintheco-occurrencedistributiontableisatleast4.25). Theinitialsetofcoevolutionaryedgesincludedmorethan70,000 lutionary information is available: Using ACE with biologically edges,andafterthefilteringitincluded10,000edges. plausible levels of coevolutionary information reduces the error Theweightsinthetablesofthecoevolutionary edgeswere rateoftheancestralgenomereconstructionbymorethan50%. computed according to the co-occurrence probabilities of the WhenACEwasappliedtothestudyofancestralgenomes,our correspondingpairsofproteins. analysisshowedthatitcanfindsolutionswhoselikelihood/par- The following subsections include more details about the simony score is very similar or identical to the ML/MP scores. threesourcesofcoevolutionaryinformationusedinthiswork. Thesesolutions,however,areoftensignificantlydifferentfromthe ML/MP solutions and, since they incorporate coevolution data, aremoreplausiblefromabiologicalstandpoint.Accordingly,ACE TheproteininteractionnetworksofS.cerevisiaeandE.coli findsdozensofadditionalproteinsinLUCA,manyofwhichare TheproteininteractionnetworkofS.cerevisiaewasobtainedfrom ribosomalproteinsorpartoftheATPsynthase,complexesthatare recentlypublishedmanuscripts(Gavinetal.2006;Kroganetal. essentialforlife. 2006;Regulyetal.2006)andfrompublicdatabases(Xenariosetal. Our coevolution-based approach is presented here in the 2002; Christie et al. 2004). High-throughput mass spectrometry frameworkofancestralgenomereconstructionduetoitsimpor- data (Gavin et al. 2006; Krogan et al. 2006) was translated into tance for evolutionary biology and because coevolutionary in- binaryprotein–proteininteractionsusingthespokemodel(Bader formation can be readily obtained at the gene/protein level. and Hogue 2002). The final yeast protein–protein network in- However, its potential scope goes far beyond inferring ancestral cluded 39,396 interactions. The E. coli network was generated genecontent:ACEcanbeusedtotackleanygeneralproblemof basedonMorietal.(2000),Xenariosetal.(2002),andArifuzzaman ancestralsequencereconstruction,includingthereconstructionof et al. (2006). The final E. coli protein–protein network included different sites or domains in proteins, or the reconstruction of 16,756interactions.Ascoevolutionaryedges,weonlyconsidered individualsitesinDNAorRNAsequences(seeNollerandWoese edgesthatappearedinthetwoproteininteractionnetworks.E.coli andS.cerevisiaearetheonlyorganismsinourdatasetwhosePPI 1981; Gutell et al. 1986; Knudsen and Hein 1999; Lockless and networks have been reconstructed on a large scale. These in- Ranganathan1999;Pedersenetal.2006;YeangandHaussler2007; teractionsweremappedtothecorrespondingCOGsbymappingto Yeangetal.2007).Thesuccessofsuchfutureapplicationsdepends E. coli genes. In the final set of coevolutionary edges, 256 were on the existence of reliable coevolutionary information at the basedonthissourceofinformation. position/site level. Coevolution information may be obtained fromproximityinthethree-dimensionalstructureoftheprotein itself, from information on coevolution of amino acids sharing Themetabolicnetworksoftheorganismsanalyzed specificbindingsite(s)orinformationaboutbindingsitesshared WeusedtherepresentationofMaandZeng(2003)forconstructing betweeninteractingproteins.Finally,asalreadypartiallydemon- themetabolicnetworks,i.e.,twoenzymesareconnectedwithan stratedhere,theACEapproachcanbegeneralizedinthefuture edgeiftheycatalyzesuccessive(orthesame)stepsinametabolic to more complex reconstruction models, e.g., using nonbinary pathway(seetheIntroductionforthemotivationforusingadja- alphabets,dependencybetweenadjacentsites,andvariousprob- cencyasevidenceforcoevolution).Eachmetabolicnetworkwas abilisticmodels(Akerborgetal.2009). reconstructed by the following stages: First we parsed all data Additionally,weintendtodesignalgorithms(e.g.,algorithms pertainingreactions,compounds,andenzymesfromKEGGrelease thatarebasedonthebeliefpropagationapproach)thatmayim- 46(Kanehisa2002)andcreatedalistoftheexistingenzymesin provetheaccuracyandrunningtimeofthebasicACEalgorithm eachspeciesinourcollection,thereactionstheycatalyze,there- describedinthiswork. actionproductsandsubstrates,andtheirdirectionality. Themetabolicnetworkofeachorganismwasgeneratedfrom itslistofenzymesasfollows:Eachenzymeisrepresentedasanode Methods inthenetwork.LetE1=[e11,e11,..,e1n]denotethesetofenzymes thatcatalyzereactionR1,andE2=[e21,e21,..,e2n]denotethesetof Coevolutionaryinformation enzymesthatcatalyzereactionR2.IfaproductofR1isasubstrate The coevolutionary edges used here were gathered from three ofR2,thenundirectededgesareassignedbetweenallnodesofE1 sources of information: (1) Proximity in the protein interaction andallnodesofE2.EdgesarealsoassignedwithinE1nodesand networkofE.coliandS.cerevisiae(forexample,Wuetal.[2003], withinE2nodes.Ascoevolutionaryedges,weconsideredenzymes Satoetal.[2005],andJuanetal.[2008]reportedarelationbetween thatareconnectedinatleast10%ofthemetabolicnetworksin coevolutionandproteininteractions).(2)Proximityinthemeta- KEGGthatarelargerthan50%oftheE.coli’smetabolicnetwork. bolicnetworksoftheanalyzedorganisms(forexample,Spirinetal. ThesetofenzymesweremappedtoCOGgroupsbymappingtoE. [2006] and Zhao et al. [2007] reported the relation between co- coli and S. cerevisiae genes (these data were downloaded from evolution and proximity in metabolic networks). (3) Various KEGG)(Kanehisa2002).Inthefinalsetofcoevolutionaryedges, physicalandfunctionalinteractionsthatweredownloadedfrom 250werebasedonthissourceofinformation. String (Jensen et al. 2009) (http://string.embl.de/; for example, ChenandDokholyan[2006]andTulleretal.[2009]reportedthe Othertypesofcofunctionality(theStringdatabase) relationbetweencoevolutionandsimilarfunctionality). Acoevolutionaryedgewasaddedtothemodelifthefollow- As an additional source of coevolutionary edges we used in- ing twoconditionswere satisfied:(1) The corresponding pairof formation downloaded from the String database (Jensen et al. GenomeResearch 129 www.genome.org Downloaded from genome.cshlp.org on February 22, 2023 - Published by Cold Spring Harbor Laboratory Press Tuller et al. 2009).Thisdatabaseincludesvarioussourcesofcoevolutionary/ Theancestralcoevolver(ACE)algorithm functional relations that are different from the two types of in- The input to ACE is a set of phylogenetic trees (a tree for each formation mentioned above. Specifically, it includes informa- protein)withthesametopologyandwithcoevolutionary edges tionongenomicneighborhood,coexpression,genefusion,coex- betweenpairsofinternalnodesofthetrees(correspondingtopairs pression,andmore.Eachcoevolutionaryrelationinthisdataset ofproteinthatcoevolve).ACEhasthreemainsteps(Fig.2;Sup- wasbasedonacompositescorethatisaweightedaverageofthese plemental Fig. 1): (1) By removing some of the coevolutionary sourcesofinformation(moredetailsaboutthedifferentcompo- edges, the input (set of trees and edges between them) is parti- nentofascorewerenotavailable).Mostoftheedges(98%)inthe tionedintosmallergroupsoftreessuchthatthereareedgesonly finalfilesetofcoevolutionaryedgeswerebasedonthissourceof betweenpairsoftreesthatareinthesamegroup.(2)Optimallabels information. (notconsideringtheremovededges)areassignedtotheinternal nodesofeachofthesesmallercoevolutionaryforestsbyanalgo- Thephylogenetictree rithmthatisageneralizationofalgorithmsforfindingancestral states by maximum likelihood or maximum parsimony (Fitch Thephylogenetictreeisbasedonthefollowingsourcesofinfor- 1971;Sankoff1975;Pupkoetal.2000);intotal,theselabelsare mation. The subtree related to alpha-proteobacteria was down- anapproximatesolutionfortheinputcoevolutionaryforest.(3) loadedfromBoussauetal.(2004).ThesubtreerelatedtoCyano- Finally,thesolutionisfurtherimprovedinagreedymanner. bacteria was downloaded from Shi and Falkowski (2008). The Webeginbybrieflydescribingstage2ofthealgorithmand subtree related to fungi was downloaded from Wapinski et al. thenexplainwhystage1isnecessary.Similarlytomanyalgorithms (2007). The reconstruction of other parts of the tree and the forcomputingtheoptimallabelsofinternaltreenodes(byMPor merging of all these trees were based on maximum likelihood MLcriterion)(Fitch1971;Sankoff1975;Pupkoetal.2000),oural- (DaganandMartin2007)andiTOL(LetunicandBork2007).The gorithmhastwophases:Inthefirstphase,ittraversesthecoevo- finalphylogenetictreeincluded95organismsfromthethreedo- lutionaryforestfromtheleavestotheroot;inthesecondphase, mainsoflife:eukaryotes(18organisms),prokaryotes(65organisms), ittraversesthecoevolutionaryforestfromtheroottotheleaves. andarchaea(12organisms).Thelistoforganismnamesandtheir However,ouralgorithmisperformedjointlyforallthetreesin taxai.d.appearinSupplementalTable2. eachconnectedcomponent. WeusedNeyman’stwo-statemodel(Neyman1971),aversion Lete=(i,j)denoteacoevolutionaryedgeoratreeedge.Let ofJukesCantor(JC)model(JukesandCantor1969)forinferring Wc =Wc denote the cost corresponding to assigning theedgelengthsofthetreebymaximumlikelihood.Thiswasdone e;ða;bÞ ði;jÞ;ða;bÞ aatnodei,andbatnodejwhere(i,j)isacoevolutionaryedge. by PAML (Yang 1997). These edge lengths correspond to the Similarly,ife=(i,j)isatreeedge,weuseWb =Wb tode- probabilities that a protein family will appear/vanish along the e;ða;bÞ ði;jÞ;ða;bÞ notethecostofassigningaandbatthetwonodesofe. correspondinglineage. Let S ðv(cid:2)Þ denote the optimal cost (the minimal sum of x(cid:2) weights)ofthesub-coevolutionaryforestthatcorrespondstothe Clustersoforthologs coevolutionarysubforestwhoserootsarev(cid:2),suchthatallthenodes insuchavectorofrootscorrespondtothesameancestralorganism Clusters of orthologs were gathered from various sources. COG mappingtomostoftheorganisms(83outof95)weredownloaded in the phylogenetic trees, and when assigning (cid:2)x in these roots. fromtheStringdatabase(Jensenetal.2009).Someofthefungiand Forexample,considerFigure1,theoptimalcostwhenconsider- thecyanobacteria(seeSupplementalTable2)donotappearinthe ing the coevolutionary forest whose roots are (cid:2)v=½x1;y1(cid:3), and Stringdatabase;theclustersinthemissingfungiweregeneratedin when assigning (cid:2)x=½1;1(cid:3) in these roots is 0.51 (due to the co- the following way: (1) We downloaded the data set of fungal evolutionaryedge). clustersoforthologsfromWapinskietal.(2007)andtheclusterof Inthefirstphase,thealgorithmtraversesthecoevolutionary cyanobacterial orthologs from Shi and Falkowski (2008). (2) We forestfromtheleavestotherootandcomputesthecostSx(cid:2)ðv(cid:2)Þfor consideredthefungi/cyanobacteriathatappearintheCOGdata eachsetofinternalnodes,v(cid:2),suchthatallofthesenodescorre- set and mapped each fungal/cyanobacterial cluster to the COG spondtothesameancestralorganisminallofthephylogenetic thathasatleast90%overlapwithit.(3)Afterthemapping,we trees,afterthecostsofthetwosetsofinternalnodesthatarede- reconstructedorthologclustersintheorganismthataremissingin scendantofv(cid:2)werecomputed(seeSupplementalFig.1B;Supple- COGfromtheircorrespondingclusterinthefungal/cyanobacte- mentaryNote1).Inthesecondphase,thealgorithmtraversesthe rialdataset.Thesmallestgenome(M.genitalium)intheanalyzed subforestfromtherootstotheleaves,andchoosesoptimallabels datasetincluded393genes(sumofcopynumbersfromeachCOG foreachsetofinternalnodes,giventheoptimallabelsofitscor- family), while the largest genome (B. japonicum) included 6162 respondingsetofparents(seeSupplementalFig.1B;Supplemen- genes(SupplementalTable3includestheCOGsdistributionsinall taryNote1). theanalyzedgenomes). Astherunningtimeofthealgorithmisexponentialwiththe sizeofthecoevolutionaryforest,thealgorithmhasaninitialstage (stage1),wheretheinputgraphispartitionedintosmallenough AnnotationofancestralproteinsandenrichmentP-values connectedcomponents.Thealgorithmusedatthisstage,Partite, TheannotationofancestralproteinswasbasedonCOGannota- recursively clusters the trees to groups according to the edge tions(Tatusovetal.2003)andtheGeneOntology(GO)annota- weights between them (thus minimizing the number of edges tions of S. cerevisiae and E. coli (http://www.geneontology.org/ betweendifferentclusters),suchthatthenumberoftreesineach index.shtml). clusteris smallenough(less thana predetermined parameterK; EnrichmentP-valuesfortheproteinsthatwereaddedbythe Supplemental Fig. 1C; Supplementary Note 8). We used hierar- ACEwerecomputedaccordingtotheCOGannotations.Foreach chical k-means (MacQueen 1967) for clustering (but obviously function,wecomputedanhypergeometricP-valuebasedonthe otherclusteringalgorithmscanbeusedtoo). totalnumberofCOGfamiliesthathavethisfunctioninLUCA,the TheinputtothePartitealgorithmisaweightedgraphwhose totalnumberofCOGfamiliesaddedbytheACE,andthenumber edgescorrespondtotheedgesinthecoevolutionary graph.The ofCOGfamilieswiththisfunctionthatwereaddedbytheACE. weightsofthegraphedgescanbeanymeasurethatrepresentsthat 130 GenomeResearch www.genome.org Downloaded from genome.cshlp.org on February 22, 2023 - Published by Cold Spring Harbor Laboratory Press The ancestral coevolver strengthofthecoevolutionbetweenthecorrespondingproteins (1)ThecoevolutionbetweenapairofCOGsusuallycorrespondsto (trees/subtrees)inacertainpartoftheevolution.Weusedtheratio therelation(s)betweenanyproteinfromoneCOGtoanyprotein betweenthemaximalandtheminimalvaluesoftheedgesinthe fromtheotherCOG.(2)Consideringthelimitednumberofana- weighttables.TheclusteringparameterKinducesatrade-offbe- lyzed organisms,a too-largecoevolutionarytable wouldinclude tween accuracy and speed: larger maximal cluster/subgraph (K) toomanyparameters(weights)thatneedtobeestimated. increasestherunningtime,butalsoincreasestheaccuracyofthe solution. Weightingoftheedgeweightsintheanalysis Thefinalstageoftheancestralcoevolveralgorithmisagreedy ofthebiologicalinputs stage(SupplementalFig.1F;SupplementaryNote1).Ineachstep of the greedy algorithm, it considers all of the edges; then, it In thebiologicalanalysis, we checked fivespecificvaluesofthe choosesanedge(treeedgeorcoevolutionaryedge)andnewlabels ML/CE weighting and MP/CE weighting. These values are de- toitsendsinawaythatgivesimprovementofthecostoftheco- scribedinSupplementaryNote2. evolutionaryforest.Ineachstep,thenewlabelsareupdated;the algorithmstopswhenitdoesnotfindnewlabelsthatimprovethe Acknowledgments costofthecoevolutionaryforest. T.T.wassupportedbytheEdmondJ.SafraBioinformaticsprogram Thetotalrunningtimeofthealgorithmisexponentialwith at Tel Aviv University and by the YeshayaHorowitzAssociation the size of the largest connected component, but behavesquite through the Center for Complexity Science, and is currently linearlywiththeotherparameters(includingthenumberoftrees). aKoshlandScholarattheWeizmannInstitute.U.G.,M.K.,andE.R. FordetailedexplanationsseeSupplementaryNote1.Thetypical aresupportedbytheJamesS.McDonnellFoundation.U.G.isalso runningtimeforaninputwithseveralhundredtreesandlargest supportedbytheBi-nationalScienceFoundation.M.K.andE.R. connectedcomponentofsizeeighttreesisuptoafewminutes.For werealsosupportedbygrantsfromtheIsraelScienceFoundation inputswithseveralthousandtreesandlargestconnectedcompo- andtheIsraeliMinistryofScienceandTechnology(M.K.). nentofsize9,therunningtimeisafewhours.So,aslongasone takes care to partition the coevolutionary graph to connected References componentsofsufficientlysmallsizeinthepreprocessingphase (whichcanalwaysbedonebyacorrespondingchoiceofaclus- teringalgorithm),thealgorithmcanberunwithinareasonable AkerborgO,SennbladB,ArvestadL,LagergrenJ.2009.Simultaneous Bayesiangenetreereconstructionandreconciliationanalysis.ProcNatl timeframe. AcadSci106:5714–5719. ArifuzzamanM,MaedaM,ItohA,NishikataK,TakitaC,SaitoR,AraT, NakahigashiK,HuangHC,HiraiA,etal.2006.Large-scaleidentification Themodelsofevolutionusedintheanalysis ofprotein–proteininteractionofEscherichiacoliK-12.GenomeRes16: ofthebiologicalinputs 686–691. BaderGD,HogueCW.2002.Analyzingyeastprotein-proteininteraction Wecheckedseveralmodelsfortheweightsofthetreeedges.Inthe dataobtainedfromdifferentsources.NatBiotechnol20:991–997. binary parsimony case, there are two possible states for each BarkerD,MeadeA,PagelM.2007.Constrainedmodelsofevolutionleadto improvedpredictionoffunctionallinkagefromcorrelatedgainandloss character (i.e., for each COG): ‘‘0’’ denotes that the COG is not ofgenes.Bioinformatics23:14–20. encodedinagenome,and‘‘1’’denotesthattheCOGisencodedin BarryD,HartiganJ.1987.Statisticalanalysisofhumanoidmolecular thegenome.Thepenaltytoswitchfrom‘‘1’’to‘‘0’’andviceversais evolution.StatSci2:191–210. identical(e.g.,1)inallthebranches. BlanchetteM,GreenED,MillerW,HausslerD.2004.Reconstructinglarge regionsofanancestralmammaliangenomeinsilico.GenomeRes14: Inthebinarylikelihoodcase,asinthebinaryparsimonycase, 2412–2423. there are two similar possible characters for each state (COG). BoussauB,KarlbergEO,FrankAC,LegaultBA,AnderssonSG.2004. However,inthiscase,eachtreeedge,(a,b),hasadifferentlength Computationalinferenceofscenariosforalpha-proteobacterialgenome (gain/loss probability, p ) that induces a different weight table evolution.ProcNatlAcadSci101:9722–9727. a,b ChenY,DokholyanNV.2006.Thecoordinatedevolutionofyeastproteinsis (seetheprevioussectionabouthowp canbecomputed).The a,b constrainedbyfunctionalmodularity.TrendsGenet22:416–419. penalty (the corresponding entry in the weight matrix) for ChorB,HendyMD,HollandBR,PennyD.2000.Multiplemaximaof achangefrom‘‘0’’to‘‘1’’orviceversa,inthiscase,is(cid:2)log(pa,b);if likelihoodinphylogenetictrees:Ananalyticapproach.MolBiolEvol17: thereisnochange,thecostis(cid:2)log(1(cid:2)p ).Thesemodelsappear 1529–1541. a,b ChristieKR,WengS,BalakrishnanR,CostanzoMC,DolinskiK,DwightSS, inthemaintext.Wedecidedtousemodelswherethepenaltyfor EngelSR,FeierbachB,FiskDG,HirschmanJE,etal.2004.Saccharomyces gainorlossofageneisidenticalfollowingtheworkofMirkinetal. GenomeDatabase(SGD)providestoolstoidentifyandanalyze (2003), who showed that this is the most reasonable/suitable sequencesfromSaccharomycescerevisiaeandrelatedsequencesfrom penaltyinourcontext. otherorganisms.NucleicAcidsRes32:D311–D314. CrossRL,MullerV.2004.TheevolutionofA-,F-,andV-typeATPsynthases Wealsocheckedadditionalmodelswhoseresultsarereported andATPases:ReversalsinfunctionandchangesintheH+/ATPcoupling inSupplementalFigure4andSupplementaryNotes6and7:Inthe ratio.FEBSLett576:1–4. nonbinaryparsimonycase,thereareafewpossiblestatesforeach D’haeseleerP,ChurchGM.2004.Estimatingandimprovingprotein character(COG).Eachstateisapositiveintegerthatdenotesthe interactionerrorrates.ProcIEEEComputSystBioinformConf2004:216– 223. numberofgenesfromaCOGinacertaingenome(notethatonly DaganT,MartinW.2007.Ancestralgenomesizesspecifytheminimumrate 6%ofthestatesintheanalyzedbiologicalinputarelargerthan oflateralgenetransferduringprokaryoteevolution.ProcNatlAcadSci ‘‘1’’).LetCxdenotethenumberofgenesfromaCOGCinnodex.In 104:870–875. the symmetric nonbinary case, the penalty for a change in the EliasI,TullerT.2007.Reconstructionofancestralgenomicsequencesusing likelihood.JComputBiol14:216–237. numberofgenescorrespondingtoaCOGCalongthetreeedge FelderY,TullerT.2008.Discoveringlocalpatternsofco-evolution.In (a,b)is(cid:2)log(pa,b)*|Ca(cid:2)Cb|.Inthenonbinarynonsymmetriccase, RECOMB-CG,(ed.CENelson,SVialette) pp.55–71.Springer-Verlag, thepenaltyforlosingagenefromaCOGCalongthetreeedge Heidelberg,Germany. (a,b)whereaistheancestorofbis(cid:2)2log(p )*|C (cid:2)C |,whilethe FelsensteinJ.1993.PHYLIP(phylogenyinferencepackage)version3.5c. a,b a b Distributedbytheauthor.DepartmentofGenetics,Universityof penaltyforgainingaCOGis(cid:2)log(p )*|C (cid:2)C |. a,b a b Washington,Seattle,WA. In the nonbinary cases, we use binary coevolutionary edge FitchWM.1971.Towarddefiningthecourseofevolution:Minimum weighttables(Cx>0orCx=0).Thiswasdonefortwomainreasons: changeforaspecifiedtreetopology.SystZool20:406–416. GenomeResearch 131 www.genome.org

Description:

the ancestral coevolver (ACE), which utilizes coevolutionary information to ferring an ancestral sequence such as a protein or gene (Pupko et al.

Reconstructing ancestral gene content by coevolution PDF

12 Pages·2009·0.97 MB·English

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Reconstructing ancestral gene content by coevolution

Description:

the ancestral coevolver (ACE), which utilizes coevolutionary information to ferring an ancestral sequence such as a protein or gene (Pupko et al.

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.