ebook img

Reversing Gene Erosion—Reconstructing Ancestral Bacterial Genomes from Gene-Content and ... PDF

13 Pages·2012·0.22 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Reversing Gene Erosion—Reconstructing Ancestral Bacterial Genomes from Gene-Content and ...

Reversing Gene Erosion(cid:151)Reconstructing Ancestral Bacterial Genomes from Gene-Content and Order Data 1 2 1 JoelV.Earnest-DeYoung ,EmmanuelleLerat ,andBernardM.E.Moret 1 Dept.ofComputerScience,Univ.ofNewMexico,Albuquerque,NM87131,USA, joeled,[email protected] 2 Dept.ofEcologyandEvolutionaryBiology,Univ.ofArizona,Tucson,AZ85721,USA, [email protected] Abstract. Inthelastfewyears,ithasbecomeroutinetousegene-orderdatato reconstructphylogenies,bothintermsofedgedistances(parsimonioussequences of operations that transform one end point of the edge into the other) and in termsofgenomesatinternalnodes,onsmall,duplication-freegenomes.Current gene-order methodsbreakdown,though,whenthegenomescontainmorethan afewhundredgenes,possesshighcopynumbersofduplicatedgenes,orcreate edge lengths in the tree of over one hundred operations. We have constructed aseriesofheuristicsthatallowustoovercome theseobstacles and reconstruct edgesdistancesandgenomesatinternalnodesforgroupsoflarger,morecomplex genomes.Wepresentresultsfromtheanalysisofagroupofthirteenmodern(cid:13)- proteobacteria,aswellasfromsimulateddatasets. 1 Introduction Althoughphylogeny,theevolutionaryrelationshipsbetweenrelatedspeciesortaxa,is a fundamentalbuildingblockin muchof biology,it has beensurprisinglydif(cid:2)cultto automate the process of inferring these evolutionary relationships from modern data (usually molecular sequence data). These relationships include both the evolutionary distanceswithinagroupofspeciesandthegeneticformoftheircommonancestors. Inthelastdecade,anewformofmoleculardatahasbecomeavailable:gene-content andgene-orderdata;thesenewdatahaveprovedusefulinsheddinglightonthesere- lationships [1(cid:150)4].The orderandthe orientationin which genes lie ona chromosome changesveryslowly,inevolutionaryterms,andthustogetherprovidearichsourceof informationforreconstructingphylogenies.Until recently,however,algorithmsusing suchdatarequiredthatallgenomeshaveidenticalgenecontentwithnoduplications,re- strictingapplicationstoverysimplegenomes(suchasorganelles)orforcingresearchers toreducetheirdatabyequalizingthegenecontent(deletingallgenesnotpresentinev- erygenomeandall(cid:147)copies(cid:148)ofeachgene,e.g.,usingtheexemplar strategy[5]).The formerwas frustratingto biologistswantingtostudymorecomplexorganisms,while thelatterresultedindatalossandconsequentlossofaccuracyinreconstruction[6]. Ourgrouprecentlydevelopedamethodtocomputethedistancebetweentwonearly arbitrarygenomes[7,?]andanothertoreconstructphylogeniesbasedongene-content and gene-orderin the presence of mildly unequal gene content [6]. In this paper, we bringtogetherthesemethodsinaframeworkthatenablesustoreconstructthegenomes 2 Pasteurella multocida Haemophilus influenzae Yersinia pestis−CO92 Yersinia pestis−KIM Salmonella typhimurium Escherichia coli Wigglesworthia brevipalpis Buchnera aphidicola Vibrio cholerae Pseudomonas aeruginosa Xylella fastidiosa Xanthomonas axonopodis Xanthomonas campestris Fig.1.The13gamma-proteobacteriaandtheirreferencephylogeny[8]. ofthecommonancestorsofthe13modernbacteriashowninFig.1(from[8]).Gamma- proteobacteriaareanancient groupofbacteria,at least 500millionyearsold[9];the groupincludesendosymbiotic,commensal,andpathogenicspecies,withmanyspecies playinganimportantmedicaloreconomicrole.Theevolutionaryhistoryofthegroupis quitecomplex,includinghighlevelsofhorizontalgenetransfer[10(cid:150)12]and,inthecase ofB.aphidicolaandW.brevipalpis,massivelevelsofgeneloss.Thesefactorsmakea phylogeneticanalysisofthisgroupbothinterestingandchallenging. Therestofthispaperisorganizedasfollows.Section2presentstheproblem.Sec- tion 3 summarizes prior work on phylogeneticreconstructionfrom gene-content and gene-orderdata.Section4presentsourframeworkfortacklingtheproblemofancestral genomereconstructiongivenareferencephylogeny;itisitselfdividedintothreesub- sections,oneon eachofourthreemaintools: median-(cid:2)nding,contentdetermination, andgeneclustering.Section5discussesourapproachtothetestingofourframework: giventhatwehaveonlyonedatasetandthatancestralgenomesforthatdatasetareen- tirelyunknown,ourtestingwas ofnecessitybasedonsimulations.Section6 presents theresultsofthistesting. 2 The Problem Wephrasethereconstructionproblemintermsofaparsimonycriterion: Giventhegeneordersofagroupofgenomesandgivenarootedtreewiththese genomesat theleaves,(cid:2)ndgeneordersforthe internalnodesofthetreethat minimizethesumofalledgelengthsinthetree. Thelengthofanedgeisde(cid:2)nedintermsofthenumberofevolutionaryevents(permis- sibleoperations)neededtotransformthegenomeatoneendoftheedgeintothegenome attheotherend.Thepermissibleoperationsinourcaseareinversions,insertions(and duplications),anddeletions;alloperationsaregiventhesamecostincomputingedge lengths.Restrictingrearrangementstoinversionsfollowsfrom(cid:2)ndingsthattheinver- sion phylogenyis robust evenwhenotherrearrangements,such as transpositions,are 3 usedincreatingthedata[13].Ourassignmentofunitcoststoalloperationssimplyre- (cid:3)ectsinsuf(cid:2)cientbiologicalknowledgeabouttherelativefrequencyoftheseoperations. Inoursetting, one insertionmay addanarbitrarynumberof genesto a single lo- cationandonedeletionmayremoveacontiguousrunofgenesfromasinglelocation, a convention consistent with biological reality. Gene duplications are treated as spe- cialized insertionsthat onlyinsert repeats.Finally,oneach edgea genecan either be insertedordeleted,butnotboth;thesameholdsformultiplecopiesofthesamegene. Allowingdeletionandinsertionofthesamegenesonthesameedgewouldleadtobio- logicallyridiculousresultssuchasdeletingtheentiresourcegenomeandtheninserting theentiretargetgenomeinjusttwooperations. Finding internal labels that minimize edge distances over the tree has been ad- dressed byourgroup(cid:151)thisis the mainoptimizationperformedbyour softwaresuite GRAPPA [14]. However, even the most recent version of GRAPPA [6] is limited to relativelysmallgenomes(typicallyoforganellarsize,withfewerthan200genes),with modestly unequal content and just a few duplications. In stark contrast, the bacterial genomes in our dataset contain 3,430 different genes and range in size from 540 to 2,987genes,withsevencontainingover2,300genes;moreover,thesegenomescontain alargenumberofduplications,rangingfrom3%to30%ofthegenome.Thus,inour model,mostpairwisegenomicdistancesareverylarge:asimplepairwisecomparison alongthetreeofFig.1indicatesthatsomeedgesofthetreemustrepresentatleast300 events.SuchlengthsareatleastanorderofmagnitudelargerthanGRAPPAcanhandle. Thelargegenomesize,vastlyunequalgenecontent,largenumberofduplications,and largeedgelengthsallcombinetomakethisdatasetordersofmagnitudemoredif(cid:2)cult toanalyzethanpreviouslyanalyzedgenomesets. 3 PriorWork Athoroughrecentreviewofthecurrentworkinphylogeneticreconstructionbasedon genecontentandgeneorderappearsin[15];wereviewonlytherelevantpointshere. TheGRAPPAsoftwarepackage[16]computesinternallabelsintwophases.First, itinitializesinternallabelsofthetreebysomemethod.Thenititerativelyre(cid:2)neslabels until convergence:each newly labeled (or relabeled) node is pushed on a queue and, while the queue is not empty, the node at the head of the queue is removed, a new labelcomputedforit(bycomputingthemedianofitsthreeneighbors),and,ifthenew labelreducesthetotaldistancetothethreeneighbors,theexistinglabelisreplacedwith the improved label and the three neighbors are placed on the queue. Thus GRAPPA relies on the computation of the median of three genomes, that is, a fourth genome whichminimizesthesumofthenumberofoperationsneededtoconvertitintoeachof thethreegivengenomes.GRAPPA(cid:2)ndsoptimalinversionmedianswithanalgorithm that runs in worst-case exponential time, but (cid:2)nishes quickly when the edge lengths aresmall (10to40operationsperedge)[6,17].GRAPPA treats groupsofgenesthat occur in the same order and orientation in all genomes as a single genetic unit; this condensationstepreducescomputationalcostsanddoesnotaffectthe(cid:2)nalresult[18]. Ourgroupdevelopedamethodto(cid:2)ndthedistancebetweentwogenomeswithar- bitrarygenecontent[7,?];thismethodreliesonaduplication-renamingheuristicthat 4 matches multiple copies of genes between genomes and renames each pair and each unmatchedcopytoanew,unusedgenenumber.Thusarbitrarygenomesareconverted intoduplication-freegenomes.Weprovedthat,giventwogenomeswithunequalgene contentandnoduplications,anyoptimalsortingsequencecanberearrangedtocontain (cid:2)rstallinsertions,thenallinversions,and(cid:2)nallyalldeletions(cid:151)atypeofnormalform for edit sequences [7].(Deletions here are genes uniqueto the source genome,while insertionsaregenesfoundonlyinthetargetgenome.) Usingthegenomesproducedby the duplication-renamingmethod,an optimalinversionsequencecanbe calculatedin timequadraticinthesizeoftheconsensusgenomes[19,20].Thenumberofdeletions iscalculatedbycountingthenumberofHannenhalli-Pevznercyclesthatcontaindele- tions,asdescribedin[21].Finally,thenumberofinsertionsisestimatedbycalculating allpossiblepositionsinthesourcegenometowhichtheinversionsequencecouldmove insertions,thenchoosingthe(cid:2)nalpositionforeachinsertionthatminimizesthenumber ofgroupsofinsertedgenes. Insome genomes,especiallybacterial ones,geneswith similar functionareoften locatedtogetherononestrandofachromosome;thesefunctionalunitsarecalledoper- ons.Inbacteria,atleast, whiletheorderofgenesin anoperonmaychange,thegene contentoftheoperonis muchless likelytodoso[22].Ingene-orderdata,anoperon appearsas a cluster of genenumberswith thesame sign, with content,but notorder, preservedacrossgenomes.HeberandStoyedevelopedalinear-timecluster-(cid:2)ndingal- gorithmtoidentifytheseoperon-likeclusterswithinequal-contentgenomes[23]. McLysaght et al. [4] reconstructed ancestral genomes for a group of poxviruses; shedeterminedgenecontentbyassumingthatthephylogenetictreecontainedasingle pointoforiginforeachgenefamilyinthemoderngenomes.Eachpointoforiginwas assignedtothatinternalnodewhichminimizedthenumberoflosseventsnecessaryto achievethegenecontentoftheleafgenomes. 4 DesigninganAlgorithmicFramework Toaddresstheproblemofreconstructingancestralgenomesatthelevelofcomplexity ofgamma-proteobacteria,weusecondensationofgeneclustersinordertoreducethe sizeofthegenomes,describeaproceduresimilartothatofMcLysaghtetal.todeter- minethegenecontentofeveryinternalnode,andpresenta newheuristic to compute themedianofthreeverydifferentgenomes. 4.1 Medians Weusethequeue-basedtree-labelingheuristicdescribedinSection3.Sinceleavescon- taintheonlylabelsknowntobecorrect,weupdatethenodesinorderoftheirdistance fromtheleaves,as showninFig.2.Theheartofthetop-levelheuristicis themedian computation.Exactmedian-(cid:2)ndingalgorithmsarelimitedtosmallgenomes,smalledge lengthsinthetree,andfewchangesincontent(cid:151)andnoneofthesepropertiesholdsin ourproblem.Wethereforepursueasimpleheuristicinspiredbygeometry.Themedian ofatriangleintheplanecanbefoundbydrawingalinefromonevertextothemiddle oftheoppositesegment,thenmovingtwothirdsofthewayalongthisline.Byanalogy, 5 R 4 3 2 1 1 1 1 L L L L L L L L L Fig.2.Internalnodesorderedbytheirdistancefromtheleaves.Nodeswithlowerindiceswillbe labeled(cid:2)rst;nolabelisgeneratedfortheroot. wegenerateasortingsequencefromonegenometoanother(anedgeofthetriangle), thenchooseagenomehalfwayalongthissortingsequenceandgenerateanewsorting sequencefromittothethirdgenome,stoppingone-thirdalongtheway. We extend the method of Marron et al. [7] to enumerate all possible positions, orientations,andorderingsofgenesaftereachoperation.Deletedgenesattheendpoint ofaninversionaremovedtotheotherendpointifdoingsoavoids(cid:147)trapping(cid:148)thedeleted genes between two consensus genes that are adjacent in the target genome. Inserted genesaremovedsoastoremainadjacenttooneofthetwoconsensusgenesbetween which they lie in the target genome.We can thus generate the genomesproducedby (cid:147)running(cid:148)a portionofthe sortingsequence,thenuse these intermediategenomesfor themedianheuristicjustdescribed,allinpolynomialtime. Thishandlingofinsertedgenesleadstoanoverestimateoftheeditdistance,which Marron et al. showed at most doubles the number of operations [7]. Their original method calculates all possible positions in the source genome to which the inversion sequence could moveinsertions and chooses the (cid:2)nal position (for each insertion)to minimizethenumberofgroupsofinsertedgenes;itmayunderestimatetheeditdistance becausethegroupingofinsertedgenesmayrequireaninversiontojoininsertedgenes andsimultaneouslysplit deleted genes,which is not possible. We comparedpairwise distancesproducedbyourmethodandbytheirstogetanupperboundontheoveresti- mation:averageandmaximumdifferencesbetweentheoverestimateandunderestimate were11.3%and24.1%,respectively. 4.2 GeneContent Wepredeterminethegenecontentofeveryinternalnodebeforecomputinganymedian: oncethegenecontentofaninternalnodeisassigned,itremainsunchanged.Sincethe tree is rooted,we knowthe directionoftime (cid:3)owoneach treeedge;we also assume thatdeletionsarefarmorelikelythaninsertions,Thenumberofcopiesofeachgeneg isdecidedindependentlyofallothergenes;atinternalnodei,itissettothemaximum numberofcopiesofgfoundinanyoftheleavesini’ssubtreeif:(i)thereareleavesboth insideandoutsidei’ssubtreethatcontaing;or(ii)thereareleavescontaininggineach halfofi’ssubtree.Otherwisethenumberofcopiesofgeneginnodeiissettozero. ThisvaluecanbecalculatedinO(NG) time,whereN is thenumberofnodesin thetreeandGisthenumberofdistinctgenesinalltheleaves,asfollows.Foreachnode inthetree,wedeterminethemaximumnumberofcopiesofeachgenefromamongthe 6 0 0 2 0 1 0 0 0 2 1 0 Fig.3.Thenumberofcopiesofageneininternalnodes. leavesofthatnode’ssubtree,usingasingledepth-(cid:2)rsttraversal.Weuseaseconddepth- (cid:2)rsttraversaltosettheactualnumberofcopiesofeachgeneateachinternalnode.If eitheroftheroot’schildrenhasasubtreemaximumofzero,thenwesettheroot’sactual numbertozeroaswell.Foreachinternalnodeotherthantheroot,ifitsparent’sactual numberofcopiesiszeroandatleastoneofitstwochildren’ssubtreemaximumiszero, thenwesetthenumberofcopiesforthegenetozero;otherwisewesetthenumberof copiestothenode’ssubtreemaximumforthegene. Internalnodes will thus possess at least as manycopies ofa geneas the majority consensus of their neighbors’ gene contents. An internal node will always possess a copy of a gene if two or more of its neighbors do. (We consider the two children of the rootto be neighbors.) Moreover,if a nodeis the nearest commonancestorof all genomespossessingthegene,itmayhavemorecopiesofthegenethanitsparentand oneofitschildren,asinthecaseoftheblacknodeinFig.3.Thegenecontentofinter- mediategenomesalongsortingsequencesisaunionofthegenecontentsofthestarting genomes,becausethesortingsequenceofoperationsthatweusealwaysinvolves(cid:2)rst insertions,theninversions,and(cid:2)nallydeletions.Therefore,whencalculatingmedians fromsortingsequences,weface threecases in whichthenumberofcopiesofa gene differbetweentheintermediategenome,themediangenome,andthemedian’sparent(cid:151) see Fig.4. InFig.4a,the intermediategenomehas thesame numberofcopies as the median,butfewerthantheparent,aswiththeblacknode’srightchildinFig.3.Each copy in the parentthat is not matchedby the duplication-renamingalgorithmwill be excludedfromthe mediangenome.The case of Fig. 4b onlyarises whenthe median genome is the nearest common ancestor of all genomes containingthe gene in ques- tion,aswiththeblacknodeinFig.3.Genomesalongtheintermediatesequencehave thesamenumberofcopiesasthemedian,whiletheparentofthemediancontainsno >m 0 0. PSfragreplacements PSfragreplacements PSfragreplacements m m 0 m m (cid:20)m m m (cid:20)m 0 >0 >0 0 (a) (b) (c) Fig.4.Caseswherethemediananditsneighborshavedifferentnumbersofcopiesofagene.Solid 1 1 linesaretreeedges;dashedanddottedlinesarefractions( and ,resp.)ofsortingsequences. 2 3 7 copyatall.Finally,thecaseofFig.4ccanonlyarisewhentherightchildofthemedian isthenearestcommonancestorofallgenomescontainingthegene,aswiththeparent oftheblacknodeinFig.3. Biologically,thisprocessof(cid:2)ndingwhichduplicatestoincludeinthemediancor- responds to matching orthologous duplicates of each gene between genomes and to discarding unmatchedparalogousduplicates. Since the original nucleotidesequences are abstracted away beforethe analysis begins,this orthologmatchingis decided en- tirely on the basis of which other genes are located next to the different homologs. Fortunately, orthologs and paralogs that can be distinguished by a nucleotide-based analysisareassigneddifferentgenenumbersbeforeouranalysisbegins.Therefore,our methodrepresentsareasonablewaytointegratebothnucleotideandgene-orderdatain differentiatingorthologousandparalogoushomologsofgenes. 4.3 ClusterCondensation Toextractinformationfromlargerandmorecomplexbiologicaldatasets,weneedfast algorithmswithfastimplementations;fasterprocessingalsoenablesa morethorough analysisandthusproducesresultsofhigherquality.Thekeyfactorhereisthesizeof thegenomes(cid:151)theirnumberisamuchsmallerissue.Wethusdevelopedatechniqueto identifyandcondensegeneclustersinordertoreducethesizeofthegenomes.Ourap- proachgeneralizesthatusedingenomeswithequalcontent[23];incontrast,GRAPPA onlycondensesidenticalsubsequencesofgenes,becauseitaimstopreservetheidentity ofeditsequences.Ourmethodallowsthecondensationofclustersbasedonlyoncontent (notorder,atleastaslongasgenesstayonthesamestrand)andalsohandlesthedif(cid:2)cult casesthatariseoutofunequalgenecontent(suchasaninsertionwithinacluster). Toidentifyclusters,we(cid:2)rstusetheduplication-renamingtechniqueofMarronetal. tocreateduplication-freegenomes.Afterrenaming,weremoveanygenesnotpresent in all of the genomes under examination.This step creates a group of genomes with equalgenecontent.Wethenusethecluster-(cid:2)ndingalgorithmofHeberandStoye[23] to (cid:2)nd equivalent clusters of genes within the equal-content genomes. Once clusters are identi(cid:2)ed, each one is condensed out of the original genomes and replaced with a single marker (as if it were a single gene). In a set of genomes with unequal gene content, there can be genes inside a cluster that are not present in the corresponding equal-contentgenomes.Wedealwiththesegenesinoneoftwoways.Ifeveryoccur- rence of that gene is located inside the cluster in each of the genomes that possesses thegene,thenthegeneiscondensedalongwiththerestofthecluster.Otherwise,the extrageneismovedtoonesideoftheclusterandtheclustercondensed.Whena me- diangenomeiscomputed,amedianforeachclusterisalsocomputedandeachcluster’s marker in the median genome is replaced with the cluster’s median. At this point, if anyextragenesmovedtothesideoftheclusterarestillbesideit,theyaremovedback insidetheclustertoapositionsimilartotheiroriginalone. 4.4 PuttingItAllTogether Ancestral genomereconstructionsare performedusing these three main components. Initialization of the internal nodes of the tree is done from the leaves up by taking 8 eitherthemidpointoroneofthetwoendpoints(alongtheinversionportionofanedit sequence)ofaninternalnode’stwochildrenanddiscardinganygenesnotallowedby themediangenecontent.ThismethodaccountsforallthreeofthecasesinFig.4and produceslabelswiththedesiredgenecontent.Newmediansarecomputedlocallynode by node in a postorder traversal of the tree, so as to propagate information from the leaves towards the root. Whenever a median is found that reduces the local score at a node, it immediately replaces the previous label at that node; that node and all its neighborsarethenmarkedforfurtherupdate. 5 Testing Weusedourlabelreconstructionmethodonthebacterialdatasetaswellasonsimulated datasets. With simulated datasets, we know the true labels for the internal nodes as wellastheexactevolutionaryeventsalongeachedge,sothatwecantesttheaccuracy of the reconstruction. The goal of our experiments was to generate datasets roughly comparabletoourbiologicaldatasetsothatourexperimentalresultswouldenableus topredictarangeofaccuracyfortheresultsonthebiologicaldataset. ThesimulateddatawerecreatedusingthetreeofFig.1;edgelengthswereassigned tothetreebasedonourbestestimateoftheedgelengthsforthebacterialgenomes.To keepthedataconsistent,edgelengthswereinterpretedasthenumberofoperationsper generatherthanasanabsolutenumber,allowingustousethesamerelativevaluefor genomesof differentsizes. The tree was labeled by (cid:2)rst constructinga root genome. Thenumberofgenesgandthetotalsizenoftherootgenomeweresetasvariableuser parameters.Oneofeachgenefrom1tog wasaddedtotherootgenome,afterwhich n(cid:0)gadditionalgeneswerechosenuniformlyatrandomintherange[1;g]andadded totherootgenome.Therootgenomewasthenrandomlypermutedandeachgeneas- signedarandomsign.Theothernodeswerethenlabeledfromtherootbyevolvingthe genomesdownthetreeaccordingtotheprescribednumberofoperations.Theallowed operationswereinsertions,deletions,andinversions.Althoughthetotalnumberofop- erationswas(cid:2)xed,theproportionofeachofthethreetypesofoperationswasleftasa variableparameterbysettingtheratioofinversionstoinsertionstodeletions.Thismix ofoperationswasusedoveralledgesofthetree. Thecharacteristicsofeachtypeofoperationweredeterminedseparately.Thelength ofeachinversionwas chosenuniformlyat randombetween1andhalfthesize ofthe genome, with a start point chosen uniformly at random from the beginning of the genometo the size of the genomeminus the lengthof the inversion.The average in- sertionlengthwassetviaauserparameterasaportionofthesizeoftherootgenome andwasusedunchangedovertheentiretree,whiletheactuallengthofeachinsertion wasdrawnfromaPoissondistributionwiththisexpectationanditslocationwaschosen uniformlyatrandomfromthebeginningtotheendofthegenome.Inmovingfromthe roottotheleaves,itwasassumedthataparticulargenecouldonlybeinsertedalongone edgeofthetree(cid:151)multipleinsertionsofthesamegene,evenalongseparatepaths,were notallowed.Theaveragedeletionlengthwaschosenasauser-speci(cid:2)edportionofthe genomefromwhichgeneswerebeingdeleted,thusvaryingfromedgetoedgeaswell asalongeachedgewitheachsuccessivedeletion,whiletheactualsizeofeachdeletion 9 was drawnfroma Poisson distributionwith this expectationandwith a start location chosenuniformlyatrandomfromthebeginningofthegenometothesizeofthegenome minusthelengthofthedeletion.Withtheconstantexpectedinsertionlength,genomes growlinearlyintheabsenceofdeletions,while,withaproportionalexpecteddeletion length,genomesshrinkexponentiallyintheabsenceofinsertions.Whenbothinsertions anddeletionsareused,genomesfartherfromtheroottendtowardsastablesize.Along eachedge,theprescribednumberofinsertionsareperformed(cid:2)rst,theninversions,and (cid:2)nallydeletions.Onceall nodes havebeenbeenassignedgenomes,theresultingleaf genomesare fed into our reconstructionprocedure.The results of the reconstruction, intermsofgenecontentandgeneorderateachinternalnode,arecomparedwiththe (cid:147)true(cid:148)tree,i.e.,thatgeneratedinthesimulation. Weconstructedtreesusing(cid:2)vedifferentmodels:an(cid:147)inversion-only(cid:148)model,a(cid:147)no- deletions(cid:148) modelwith a 6:1inversion-to-insertionratio,a (cid:147)no-insertions(cid:148)modelwith a 6:1 inversion-to-deletionratio, a (cid:147)low-insertion/deletion(cid:148)model with a 40:4:1 ratio of inversions to deletions to insertions, and a (cid:147)high-insertion/deletion(cid:148) model with a 30:10:3ratioofinversionstodeletionstoinsertions.Theaverageinsertionlengthwas setto2%oftherootgenomeandtheaveragedeletionlengthto3%ofthelocalgenome. In order to test the ef(cid:2)cacy of cluster condensation, we tested the technique on triples amongthe bacterial genomesthat lie close to each otheron the tree in Fig. 1. Tripleswerechosenbyselectinginternalnodes,then,foreachofthethreeedgesleading outfromtheinternalnode,bychoosinganearbyleafreachablebyfollowingtheedge. Foreachsetofthreegenomes,wemeasuredthesumofthelengthsofallclustersthat werefound. 6 Results Our discussion and summaries of results refer to Fig. 5. Reconstruction of ancestral genomesforthebacterialgenomestakesaround24hoursonatypicaldesktopcomputer. Themidpoint-initializationprovedquitestrong:theonlygenomestobeupdatedinthe 14 6 Pasteurella multocida 7 15 Haemophilus influenzae 16 10 Yersinia pestis−CO92 4 8 17 3 8 Yersinia pestis−KIM 5 18 11 Salmonella typhimurium 7 9 19 2 4 Escherichia coli 2 20 9 Wigglesworthia brevipalpis 10 21 1 Buchnera aphidicola 5 Vibrio cholerae 3 1 Pseudomonas aeruginosa 12 Xylella fastidiosa 6 13 22 Xanthomonas axonopodis 11 23 Xanthomonas campestris Fig.5.Thebacterialtreewithnumberededgesandinternalnodes. 10 Table1.ErrorPercentageinTreeScores AvgerrorMinerrorMaxerror Inversiononly 63.2% 57.3% 67.4% Nodeletions 62.6% 54.8% 70.7% Noinsertions 45.2% 37.6% 54.3% Lowinsertion/deletion 56.4% 46.7% 64.8% Highinsertion/deletion 34.9% 25.1% 46.4% subsequentlocalimprovementprocedurewerethetwochildrenoftheroot(nodes1and 6inFig.5),theonlyneighboringgenomesinwhichoneneighborwasnotusedtocre- atetheother.Whenweusedendpoint-initialization,threeinternalnodeswereupdated (nodes3,4,and6inFig.5)andthescoreoftheentiretreewasloweredby2.8%.This (cid:2)ndingmayindicatethattheinitializationisverygood,butitmayalsore(cid:3)ectthelarge numbersoflocaloptimainthesearchspace(cid:151)asimilar(cid:2)ndingwasreportedforthesim- plerGRAPPA[18].Itshouldbenotedthat,whencalculatingmedians,onlyfourdiffer- entmidpointsinthechild-to-childsortingsequenceareused;fromeachofthesemid- points,onlythreemidpointsinthesortingsequencefromtheintermediategenometo theparentaretested.Thusweonlyperformaveryshallowsearchandcouldeasilymiss a better solution. Interestingly,though,when we did a slightly more thoroughsearch withtenmidpointsfromchildtochildandfourmidpointsfromintermediatetoparent, using endpoint-initialization, the tree score was slightly worse than in the shallower analysis,althoughthesearch,whichtookabout3.5timeslonger,updatedthesamethree internalnodes.Ofcourse,thislargersearchremainsveryshallow;goingbeyonditwill requireamuchmoreef(cid:2)cientimplementationoftheduplication-renamingheuristicof Marronetal.[7](cid:151)inourcurrentversion,itusesupover90%ofthecomputingtime. We simulated 100 labelings of the tree with a root genomesize of 200 genes for eachofthe(cid:2)vepreviouslydescribedscenarios.Endpoint-initializationwasusedinall scenarios.Theleafgenomesproducedinoursimulationsrangedinsizefrom70genes to 400 genes.We comparedthepredictedgenecontentof the internalnodes with the actual gene content.As expected(due to our restriction on generation),the predicted genecontentalwaysmatched,exceptwhenagenecopythatwaspresentataninternal nodewaslostinallleaves.Failuretodetectthiskindofmissinggeneisunavoidablein ananalysissincethedeletionfromallleavesmeansthatnohistoricalrecordis leftto attestthepresenceofthatgeneinancestralgenomes.Whenwecomparedthenumber of operations over all edges in reconstructed trees versus the original simulated tree, thescoreforthetreewasfairlyinaccurate,consistentlyoverestimatingthetruescore, asillustratedinTable1.Therathertightdistributionforthetreeindicatesthattheerror isnotarandomprocess,butaresultofsomeaspectofourreconstructionmethod,one thatmaylenditselftoreversemapping. We compared edge lengths in the reconstructed trees with those in the true trees bycalculatingtheratioofthelengthsforeachedge(Fig. 6).Aperfectreconstruction wouldgivearatioof1.0;asthe(cid:2)gureshows,mostratiosarehigher,withedgesfurther fromleaveshavinglargerratios(andalsolargervariances).Abouthalfofthe23edges arewithinafactoroftwoandanotherquarterarewithinafactoroffour.

Description:
Reversing Gene Erosion—Reconstructing Ancestral. Bacterial Genomes 4th Int'l IEEE Conf. on Bioengineering and Bioinformatics. BIBE'04, IEEE
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.