ebook img

Reconstruction of ancestral RNA sequences under multiple structural constraints PDF

12 Pages·2016·1.16 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Reconstruction of ancestral RNA sequences under multiple structural constraints

TheAuthor(s)BMCGenomics2016,17(Suppl10):862 DOI10.1186/s12864-016-3105-4 RESEARCH OpenAccess Reconstruction of ancestral RNA sequences under multiple structural constraints OlivierTremblay-Savard1,2,VladimirReinharz1andJérômeWaldispühl1* From14thAnnualResearchinComputationalMolecularBiology(RECOMB)ComparativeGenomicsSatelliteWorkshop Montreal,Canada.11-14October2016 Abstract Background: Secondary structures formthescaffoldofmultiplesequencealignmentofnon-codingRNA(ncRNA) families.AnaccuratereconstructionofancestralncRNAsmustusethisstructuralsignal.However,theinferenceof ancestorsofasinglencRNAfamilywithasingleconsensusstructuremaybiastheresultstowardssequenceswith highaffinitytothisstructure,whicharefarfromthetrueancestors. Methods: Inthispaper,weintroduceachARNement,amaximumparsimonyapproachthat,giventwoalignments ofhomologousncRNAfamilieswithconsensussecondarystructuresandaphylogenetictree,simultaneously calculatesancestralRNAsequencesforthesetwofamilies. Results: Wetestourmethodologyonsimulateddatasets,andshowthatachARNementoutperformsclassical maximumparsimonyapproachesintermsofaccuracy,butalsoreducesbyseveralordersofmagnitudethenumber ofcandidatesequences.Toconcludethisstudy,weapplyouralgorithmsontheGlmclanandtheFinP-traJclanfrom theRfamdatabase. Conclusions: Ourresultsshowthatourmethodsreconstructsmallsetsofhigh-qualitycandidateancestorswith betteragreementtothetwotargetstructuresthanwithclassicalapproaches.Ourprogramisfreelyavailableat: http://csb.cs.mcgill.ca/acharnement. Keywords: RNA,Secondarystructure,Ancestorreconstruction,Evolution,Phylogeny,Algorithm Background For a long time, most of the attention has been With the development of sequencing technologies given to the reconstruction of ancient protein and DNA emerged the need to elucidate the relationship between sequences, while RNA molecules remained relatively sequences from various organisms. The reconstruction overlooked. Nonetheless, in the last 20 years, the dis- of ancestral sequences, which aims to reveal the chain covery of the breadth of catalytic and regulatory func- of events that led to the diversity of sequences observed tions carried by RNA molecules revived our interest for today,becamenaturallyoneofthecorechallengesinthis the RNA world hypothesis [5], and resulted in increas- field of research. Since the first attempts to rigorously ingeffortstowardabetterunderstandingoftheintricate solve this problem [1], the methods and quality of the natureofmutationalpatternsinRNAs[6–11]. data have considerably improved, to the point where the The reconstruction of non-coding RNA (ncRNA) reconstructionofancientgenomesisnowfeasible[2–4]. sequences is particularly challenging. Indeed, ncRNA functions are typically carried out by specific molecu- lar structures, and consequently sequences are generally *Correspondence:[email protected] lessconservedthanstructures[12].Thisimpliesthatded- 1SchoolofComputerScience,McGillUniversity,H3A0E9Montreal,Canada icated frameworks must be developed to capture this Fulllistofauthorinformationisavailableattheendofthearticle structuralinformation. ©TheAuthor(s).2016OpenAccessThisarticleisdistributedunderthetermsoftheCreativeCommonsAttribution4.0 InternationalLicense(http://creativecommons.org/licenses/by/4.0/),whichpermitsunrestricteduse,distribution,and reproductioninanymedium,providedyougiveappropriatecredittotheoriginalauthor(s)andthesource,providealinktothe CreativeCommonslicense,andindicateifchangesweremade.TheCreativeCommonsPublicDomainDedicationwaiver (http://creativecommons.org/publicdomain/zero/1.0/)appliestothedatamadeavailableinthisarticle,unlessotherwisestated. TheAuthor(s)BMCGenomics2016,17(Suppl10):862 Page176of186 RNAfoldingishierarchical.Secondarystructuresform single family will have a tendency to produce ancestors rapidly and act as a scaffold for the slower formation near the core of this network. This bias may result in of tertiary structures [13]. It follows that the stability ancestral sequences potentially far from the first ances- of secondary structures provides us a relatively accurate tor who acquired the function (i.e. the structure). In signature of the molecular function [14], and thus can other words, this first ancestor is likely to be a worse be used to guide the reconstruction of ancestral ncRNA fittothefunctionalstructurethansequencesatthecore sequences. of the neutral network. By contrast, in this paper we To date, the most promising approach to infer ncRNA adoptaradicallydifferentapproach.Weproposehereto ancestorshasbeenproposedin2009byD.BradleyandI. solvethisproblemsimultaneouslyfortwoncRNAfamilies Holmes,whointroducedanalgorithmtocalculateances- that share a common ancestor (See Fig. 1). This strategy tral RNA secondary structures from an alignment [15], enables us to make a better estimation of the location of andusethesestructurestoinferancestralsequencesusing the duplication event at the origin of the two families in amaximum-likelihoodapproachonstochasticgrammars the sequence landscape, hence to make a more accurate [16].Still,thetimecomplexityofinferringancestralstruc- inferenceoftheancestorsofeachfamily. tures can be prohibitive, and the specificity of the func- Our approach is as follows. Given two alignments of tional structure may not accommodate sufficiently large homologous ncRNA families with consensus secondary variationsofthis(secondary)structuretotakeadvantage structures and a phylogenetic tree, we design a max- ofthismodel. imum parsimony algorithm to simultaneously compute Covariationmodelsarepowerfulframeworkstomodel ancestral RNA sequences for both families. We test this families of structured RNA sequences [17–19], allowing methodology on simulated data sets, and compare the us to capture dependencies between distant sites. Nev- results to classical (structure-free) maximum parsimony ertheless, we argue that the reconstruction of ancestral approaches [21, 22], as well as to a customized maxi- RNA sequences of a single ncRNA family with a sin- mum parsimony algorithm integrating the constraints of glesecondarystructureusingacovariationmodelcanbe asinglestructure.Finally,weapplyourtechniquestothe hazardous.Indeed,currentsequencesaremostlikelyuni- reconstructionofancestralsequencesoftwoClans(Glm formly distributed on the entire neutral network of the [23] and FinP-traJ [24]) from the Rfam database [25]. functionalstructure[20](i.e.regionsofthesequenceland- ClansareRNAfamiliesthat“shareacommonancestorbut scapewithagoodaffinitytothefunctionalstructure),and aretoodivergenttobereasonablyaligned”[26],andthus a strategy aiming to accommodate constraints within a illustratewellthesignalweaimtocapture. Fig.1Ourapproach.Left:Theredandblueareasrepresentregionsofthesequencelandscapeofsequenceswith“good”affinity(i.e.sufficientto carrytheassociatedfunction)tothetargetstructuresS(red)andS(cid:2)(blue).Here,αandα(cid:2)areparalogoussequences,aswellasβandβ(cid:2),γ andγ(cid:2) andδandδ(cid:2).Usingclassicalreconstructionapproaches,Awouldbetheinferredancestoroftheorthologoussequencesα,β,γ andδ,andA(cid:2) wouldbetheinferredancestoroftheorthologoussequencesα(cid:2),β(cid:2),γ(cid:2)andδ(cid:2).Shadedtreesrepresenttheclassicalancestralreconstructions performedseparately,whilethemaintreerootedatAA(cid:2)representsthesimultaneousancestralreconstructionapproachintroducedinthis contribution.Therationaleofthisworkisthatancestorsinferredfromasinglefamilyandstructuremayhaveatendencytobelocatedinthecoreof theaffinityregions,andmightendupwithancestralsequencesthatwouldbehardtoreconcile.Bycontrast,asimultaneousreconstructionof orthologousfamiliesensuresthecoherencyoftheprocessandabetterinferenceoftheancestors(whicharenotnecessarilylocatedinthecoreof theaffinityregions).Right:AnexampleofaspeciestreeT(dashedlines)offourspeciesA,B,(cid:6)and(cid:7)correspondingtotheneutralnetworksshown ontheleft.Aduplicationeventisshownattheroot,creatingthetwoncRNAfamilies(representedbycoloredlines).Eachnodeofthespeciestree containsacopyofeachncRNAfamily(onered,oneblue).AttheleavesofthespeciestreeT,wefindthetwoextantncRNAsforwhichwehavethe sequenceandthestructureinformation.ThelineargradientGisalsoshown:itrepresentstheweightthatisgiventoeachstructurewhen calculatingthecosts(Gforonestructureand100%-Gfortheother) TheAuthor(s)BMCGenomics2016,17(Suppl10):862 Page177of186 Ourresultsonsimulateddatasetsshowthatourstrat- evolution. Since we have only access to the 2D struc- egyimprovestheaccuracyofthereconstruction.Onreal ture of the two extant families (and not the ancestral data sets, our approach compares favorably to PAML, a 2D structure), our model considers both of these struc- state-of-the-artmaximumlikelihoodmethodthatconsid- tures during the inference process. Near the root of the ersonestructureatatime,andcustomizedversionsofthe speciestree,ourmodelsuggeststhatthesequenceswere FitchandSankoffalgorithms.Inparticular,ourdatashows stilllikelytobeabletofoldintobothstructures.However, thatoursolutionshaveabetteragreementtothetwotar- astimepasses,eachncRNAstartstospecializeintoonly get structures than the sequences inferred with previous onestructureandlosesaffinitytotheother.Werepresent methods. Importantly, we achieve all these results with thatgradualtransitionintoourmodelusingagradientG a significantly smaller set of candidate ancestors, which which varies from 50 % (near the root) to 100 % (near improvestheinterpretabilityofourdata. theleaves).Thisgradientisgoingtobeusedinouralgo- Our algorithms have been implemented in a software rithmtocalculatethe“weight”thateachofthestructures named achARNement and are freely available at http:// must have in the global score of the inferred ancestral csb.cs.mcgill.ca/acharnement. sequences. We developed two novel algorithms, implemented in Methods a package called achARNement and freely available at Inputdata http://csb.cs.mcgill.ca/acharnement. For the algorithms presented in this paper, we assume thatwehavetwonon-codingRNAfamiliesthathavebeen Algorithms identifiedasaclan[26].ForeachofthetwoncRNAfam- Weproposeanewtool,achARNement,composedoftwo ilies, we have the consensus 2D structure it folds into. exact algorithms (CalculateScores-1struct and Wealsohavea setofspeciesthateachpossessonecopy CalculateScores-2structs) based on the Fitch of both ncRNAs (one of each family), and a species tree [21] and Sankoff [22] parsimony methods for the infer- T thatrepresentsthespeciationhistoryoftheorganisms enceofancestralsequencesinaphylogeny(notethatwe considered.WehavethesequencesofthetwoncRNAsfor arefocusingontheinferenceofancestralsequences,and eachofthestudiedspecies.Figure1illustratesanexample not only on the calculation of parsimony scores). Our ofaspeciestree. algorithms use a three-step approach (see Fig. 2): (i) a bottom-upstepinwhichminimalcostsforeverypossible Problemstatement nucleotide at every site are calculated, (ii) a middle step Given the input data described in the previous subsec- wherewelinktheminimalcostmatricesforbothfamilies tion, the problem is to infer a most parsimonious set of attherootofthephylogeny,and(iii)atop-downstepthat ancestral sequences for each of the two ncRNAs at each enumeratesalltheoptimalsequencesbasedonthecalcu- ancestral node of the input species tree. Although this lated costs. Our algorithms have the same running-time is a very classical problem in comparative genomics, our complexity than the Sankoff algorithm (O(Nk), where N goal is to achieve that using a new evolutionary model is the number of nodes andk is the sequence length); thatsimultaneouslyconsiderssequenceand2Dstructure theonlydifferencebeingaconstantnumberofadditional information,asdescribedpreviously. calculations that depends on the basepairs in the two structures. Evolutionarymodel For the substitutions, we use a cost matrix that has a Our evolutionary model assumes that the two ncRNA different weight for transitions and transversions, since families are the result of an ancient duplication of an transversions normally occur less frequently than transi- ancestral ncRNA that was able to fold into two different tions (see Table 1). In addition to the substitution cost, structures.Followingtheduplication,asubfunctionaliza- we also consider a basepair cost, as shown in Table 2. tion process took place: a series of neutral mutations The basepair cost is 0 for the G-C basepairs and 0.001 occurred and gave rise to both extant families that can for the A-U basepairs, that are not as strong as G-C only fold into one specific structure (see Fig. 1). Here, basepairs. Compared to an A-U basepair, a G-U pair weassumethattheancestorofallstudiedspeciesalready costs twice as much, while all the others are penalized possessed both ncRNAs, but that the duplication event three times as much. We have also experimented with occurred not too long before that (near the root of the a more complex scoring system for the basepairs, one speciestreeT representingthestudiedorganisms).Only that reflects the geometry and isostericity of the base- point mutations are allowed in our evolutionary model pairs.WeperformedtestsusingtheIsoDiscrepancyIndex (noindels). (IDI) table, as described in [27]. However, since this As mentioned earlier, ncRNA sequences are more tablerepresentsatransitionfromtheinitialbasepairtoa constrainedbytheirstructurethantheirsequenceduring mutatedbasepair,morecalculationswererequiredbyour TheAuthor(s)BMCGenomics2016,17(Suppl10):862 Page178of186 Fig.2GraphicalrepresentationofthealgorithmCalculateScores-2structs.Inthisexample,wehavefourspecies(A,B,CandD)andfor eachspecies,wehavetwoextantRNAs(forfamily1,inred,andfamily2,inblue).Thethreemajorstepsofthealgorithmarepresented.1)The bottom-upstep,whereminimumscoresarecalculatedateverynodeofthetreeforeachfamily.Thescorestakeintoaccountthesubstitutions,but alsothebasepaircostforthecurrentfamily,andfortheotherfamily.2)Themiddlestep.Herewelinktheminimumscorematricesforfamilies1and 2bydoingasimpleFitchonthetwomatrices.Thisallowsustoreconstructtheoriginalancestralsequences(beforetheduplication),takinginto accountbothfamilies.3)Thetop-downstep,wherewestartfromtherootandselectthenucleotidesofminimumcostateverypositionand constructtheoptimalsequences CalculateScores-2structsalgorithm.Theresults consider50%ofthestructureoffamilyf and50%ofthe ¯ obtainedwiththeIDItableandoursimplertable(Table2) structure of family f. We use a linear gradient to set the were very similar (results not shown), but at the cost of valuesofGonthedifferentdepthsofthetree(from50to a 4-fold increase in computation time. Consequently, we 100%). decidedtousethesimplertable. Forspacereasons,thefulldescriptionofthealgorithms Thedifferencebetweenthetwoalgorithmswepropose wasplacedintheAdditionalfile1.Inthefollowingpara- (CalculateScores-1struct and Calculate- graphs,anoverviewofthealgorithmswillbepresented. Scores-2structs) resides in the first step (bottom- Bottom-upstep up), where we calculate the minimal costs for every The first step of the algorithms consists of doing a possible nucleotide at every site. Let f be one of the ¯ post-order traversal of the species tree (as shown in two families and f be the other one. When calculating Algorithm 1, Additional file 1), to calculate the most the costs for family f at the internal node a, algorithm parsimoniouscostsforeachpossiblenucleotideateverysite. CalculateScores-1struct considers only the In the following paragraphs, we explain the dif- structure associated with family f. On the other hand, ferences between CalculateScores-1struct and algorithm CalculateScores-2structs considers CalculateScores-2structs in the calculation of both structures, but with a weight G that varies along thosecosts. the depth of the tree. For example, at the level of the leaves,forthefamilyf,weconsider100%ofthestructure CalculateScores-1struct: Let a be the ¯ i f and 0 % of the structure f. At the level of the root, we nucleotide at position i in the parent (ancestral) node, Table1Nucleotidesubstitutionmatrix Table2Basepaircostmatrix A C G U A C G U A 0 2 1 2 A 0.003 0.003 0.003 0.001 C 2 0 2 1 C 0.003 0.003 0 0.003 G 1 2 0 2 G 0.003 0 0.003 0.002 U 2 1 2 0 U 0.001 0.003 0.002 0.003 EachcellCi,jrepresentsthecostofmutatingthenucleotideiintoj EachcellCi,jrepresentsthecostofhavingthebasepairi-j TheAuthor(s)BMCGenomics2016,17(Suppl10):862 Page179of186 a¯ be the nucleotide that is paired with a in the current Sincethatposition(e.g.position#6inFig.3a)isnotnec- i i structure,l (resp.r)bethenucleotideatpositioniinthe essarily fixed, we consider an average basepair cost over i i left (resp. right) child, and l¯ (resp. r¯) be the nucleotide allpossiblenucleotidesatthatposition. i i thatispairedwithl (resp.r¯)intheleft(resp.right)child Themorecomplexcaseiswhenpositioniispaired. i i inthecurrentstructure. Inthiscase,wealsohavetocheckforthepositionpaired ¯ In the case that the position i is part of a basepair withiintheotherstructure(seeFig.3bandc). in the current structure, the cost of having a specific Moreprecisely,usingthesamedefinitionsasabovefor dinucleotidea,a¯ isequalto: CalculateScores-1struct, and considering that i i ¯ positioniispairedinbothstructures(andpositionitoo), min {c(li)+s(li,ai)+c(l¯i)+s(l¯i,a¯i)+bpc(ai,a¯i)} thecostofhavingaspecificdinucleotideai,a¯iisequalto: l+iandl¯i∈{Am,C,iGn,U} {c(ri)+s(ri,ai)+c(r¯i)+s(r¯i,a¯i)+bpc(ai,a¯i)} Eq.(1)(cid:2)we⎛ightingbpc(ai,a¯i)byG(cid:3)+(1−G) ⎞ riandr¯i∈{A,C,G,U} (cid:6) (cid:6) (1) ∗⎝ bpc(ai,nc)/4+ bpc(a¯i,nc)/4⎠ nc∈{A,C,G,U} nc∈{A,C,G,U} where c(x) is the previously calculated optimal cost of (2) having the nucleotide x, s(x,y) is the cost of substituting nucleotide x for y and bpc(x,y) is the cost of having the Notethatitispossibletogetcyclesof“interdependent” basepair (x,y). In the other case where i is not part of a positions when considering both structures. As you can basepair,wesimplycalculatethesubstitutioncosts. observeinFig.3c,positions4and6arepairedtogetherin fam1.Infam2,position4ispairedwith3,whichispaired CalculateScores-2structs: As mentioned ear- withposition7infam1.Finally,position6ispairedwith lier,CalculateScores-2structstakesintoaccount position8infam2.Thusallofthosepositionsare“interde- bothstructures,usingaweightG.Calculatingthecostson pendent”.Tosimplifythealgorithm,insteadofconsidering theleftandrightbranchesisalittlebitdifferentdepending thecompletecycles,wechosetostopatone“level”,thatis onifwearedealingwithapairedpositionoranunpaired lookingonlyatonepairedpositionintheotherstructure one. The general idea is that for each position i, we are foreachpositioninthefirststructure. going to measure the cost of the basepair formed with Oncethecostsarecalculatedforeverysiteateverynode ¯ positioni(ifitexists)inthestructureofthecurrentfam- ofthespeciestree,wecansimplydothemiddleandtop- ily,andwearealsogoingtoconsiderthepositionspaired downsteps. ¯ with both i and i in the other structure. Note that each position can be paired to two different positions in the Middleandtop-downsteps twostructures;wewillfocusonthatcasehere,becauseif Thetop-downstepisthepartwherewestartfromtheroot thebasepairsarethesameinbothstructures,thenwedo ofthetree,weselectthenucleotidesofminimumcostat notneedthegradientGandsimplyconsider100%ofthe everypositionandconstructtheoptimalsequences.Once basepaircost.Figure3showsthreeexamples. all the optimal sequences are enumerated at an internal The simpler case is when the position i is unpaired. node of the tree, we go down in the tree and enumer- Then, only the position paired with i in the other struc- ate the optimal sequences that gave rise to them in the ture needs to be considered, if it is paired (see Fig. 3a). child nodes and so on. Algorithm 6 (Additional file 1) a b c Fig.3Threeexamplesofthepositionsthatneedtobeconsideredwhenusinginformationfrombothstructures.Notethatinthoseexamples,we considerthatweareworkingonthesequenceoffamily1,andfam1andfam2representthe2Dstructuresoffamily1and2respectively.aThe easiercasewhentheposition(8here)isnotpairedinfam1,andweonlyhavetoconsiderthepositionpairedwithitinfam2.bThecasewhereonly oneofthetwopairedpositionsoffam1ispairedinfam2.cThecasewherebothpairedpositionsoffam1arepairedinfam2 TheAuthor(s)BMCGenomics2016,17(Suppl10):862 Page180of186 describesthisprocess.Notethatbeforestartingtoselect Thesequencesforeachinternalnodearegeneratedina the nucleotides at the root, we do a simple Fitch (algo- top-downfashion.Givenanodew,itssequences(w1,w2), rithm not shown here) on both cost matrices of family itstwochildnodesc1,c2,amutationprobabilityα,anda 1 and 2. This middle step is necessary to make the link substitution matrix β. From(w 1,s ) (resp (w2,s )), a set 1 2 between the two families, i.e reconnect both matrices W ofathousandmutantsofw1isgeneratedasfollows. of optimal scores, and reconstruct the original ancestral Each sequence w in W is created by applying a prob- i sequences(beforetheduplication). ability of mutation α to each position in the original sequencew1. Generalizingtomorethantwofamilies Each nucleotide x can be substituted to {A,C,G,U} \ This problem can easily be generalized to F > 2 struc- {x} following the distribution β(x). We used forβ : turefamilies,aslongaswemaintainthesameassumption P(A↔G) = P(C↔U) = 50%, all others are set thatallancestorsrepresentedinthetreepossessonecopy to 25%. We used those probabilities for the mutational ofeachF numberofncRNAs.Theonlypartofthealgo- events based on the observation [30] that transitions are rithmthatwouldchangeisthebottom-upstep:itwould morefrequentthantransversions. be similar to CalculateScores-2structs, except We define a free energy E(w,s) as the base pair dis- that we would be considering the basepairs in all the F i tancebetweentheminimalfreeenergystructureofw and structuresinsteadofjusttwo.ThegradientGwouldalso i s,i.e.(cid:7)(MFE(w),s).TheMFEandbasepairdistanceare be different: it would range from 100% (near the root, i F computedwithRNAfoldandRNAdistance[28]. wherethesameweightisgiventoallstructures)to100% A Boltzmann distribution is induced such that the (neartheleaves). weightofanysequencew is i ResultsandDiscussion B(wi,s)=e−ER(wTi,s) Simulateddatageneration To evaluate our method, we generated in silico twenty whereRistheBoltzmannconstantandTthetemperature different phylogenetic trees for three different pairs of inKelvin.ThepartitionfunctionZs isobtainedbysum- W secondary structures as follows. First, three secondary mingtheweightsofallsequencesw ∈W andwedefined i structures of size 100 were randomly designed such that the Boltzmann probability of each sequence Ps (w) W i thetwofirsthaveasimilarshape,andthelast,adifferent suchthat one.Thosestructuresarethefollowing (cid:6) B(w,s) Zs = B(w,s) and Ps (w)= i . W i W i Zs 0: ...(..((((((((.(.(((.(((((((.((.((. wi∈W (((..(((....((....))..)))..))))).)) Wesampletwosequencesfromthisdistributiontopop- ..))))))).))).).))))))))..)... ulatec1 andc2 (resp.c1 andc2).Were-applyrecursively. 1 1 2 2 1: ...........((((((.(((((((((((((((.(( Thegeneratorwasimplementedinpythonandisbundled ..((((..((........)).))))..)).))))))) withourachARNementpackage. )))..))))).)))))).......... Evaluationonsimulateddata 2: .(((.((((.......))))))).((((((...((( We first evaluated achARNement using simulated data, ......)))))))))...((((((((((((((((.(( asdescribedinSec.Simulateddatageneration.Themuta- ((....)))))))))))))).)))))) tional rates of bacterias (bacterial genomes are studied in Sec. Evaluation on biological data) are known to vary ThebasepairdistancesevaluatedwithRNAdistance greatly between species and it is difficult to find indis- [28]betweenthestructures0and1,0and2,and1 and2 putable reference points to evaluate them [31]. We thus arerespectively40,96and86. approximate many generations in each step (i.e level of For every pair of secondary structures, a set of twenty the tree) by using as the mutation rate α three values: bi-stablesequenceswasgeneratedwithFrnakenstein {1%,5%,10%}.Thisenablesustoobtaindiverseenough [29], such that the best scoring sequence of each run sequences at the leaves of a complete binary tree of was kept. For each pair of structures (s1,s2), and each depth6. sequence z designed on these structures, a complete For every pair of structures and mutation rate, twenty binarytreeT ofdepth6waspopulated.TherootrofT is treesweregenerated.InFig.4,weshowtheaverageerror initializedwith(r1,r2) ≡ (z,z).EachinternalnodenofT percentage over all optimal sequences inferred for both iscomposedofapairofsequences,(n1,n2),suchthatthe families in all nodes of the trees. We divided the results sequencen1isassociatedwiththestructures1andn2with by structural features; the first row is the average error thestructures2. percentageforpositionsinvolvedinaninteraction,while TheAuthor(s)BMCGenomics2016,17(Suppl10):862 Page181of186 Fig.4Theaverageerrorpercentageofalloptimalsequencesforbothfamiliesinatree.Eachcolumnrepresentsapairofsecondarystructures.The firstrowisforpositionsinstructuredregions,andthesecondrowforunstructuredregions.Forthreemutationrates:1%,5%and10% the second row is for unstructured positions. Each col- step when solutions from both families are merged. umn represents a different pair of secondary structures, The higher quality in unstructured regions when using annotated 01, 02and 12followingthe notationdefined CalculateScores-2structs was expected because in Sec. Simulated data generation. For each sequence of wealwaysconsiderstructuresfrombothfamilies,andone a familyfam , we consider a position to be in a struc- unstructured positioninonefamilycanbestructured in turedregion,ifthestructureoffamhasabasepairatthat theother.Finally,althoughthetwostructures0and1are position. muchclosertoeachotherthanto2,thebasepairdistance A first observation is that CalculateScores- doesnotseemtoaffectthequalityoftheresults. 2structs always performs the best, followed by We then examine the number of optimal solutions, CalculateScores-1struct, and the Fitch and for each pair of secondary structures and mutation Sankoffalgorithmswhoseperformancesareequivalent.In rate α. As can be observed in Figs. 5 and 6, the aver- allcases,achARNementmethodsalwaysperformbetter, age number of optimal sequences inferred both in eveninunstructuredregions. the whole tree and for the root only is always smaller For CalculateScores-1struct, although the for algorithms CalculateScores-1struct and other structure is ignored during the parallel ancestral CalculateScores-2structs, compared to Fitch reconstructions, some constraints from the other struc- and Sankoff. In the case of the pair of structures 01, ture are implicitly taken into account during the middle the average number of optimal sequences is even Fig.5Averagenumberofoptimalsequencesinthetree,y-axislogscale.Eachcolumnrepresentsadifferentpairofsecondarystructures.Forthree mutationrates:1%,5%and10% TheAuthor(s)BMCGenomics2016,17(Suppl10):862 Page182of186 Fig.6Averagenumberofoptimalsequencesintheroot,y-axislogscale.Eachcolumnrepresentsadifferentpairofsecondarystructures.Forthree mutationrates:1%,5%and10% several orders of magnitude lower for our two algo- and CalculateScores-2structs to infer the rithms. An important observation is that, in every ancestral sequences at the root of the species tree. case, all sequences at the root reconstructed by Both Fitch and Sankoff inferred the same set of CalculateScores-2structs are a subset of the 786432 sequences at the root of the species tree, optimal sequences obtained with the classical Sankoff whereas CalculateScores-1struct inferred algorithm (i.e. CalculateScores-Sankoff). This 393216 and CalculateScores-2structs 196608. shows that the additional structural constraints defined The ancestral sequences reconstructed by our meth- in our method help to reduce the initial solution space ods are subsets of the ones produced by Fitch and producedbytraditionalapproaches. Sankoff: CalculateScores-1struct cut the Running times for the four methods are shown in solution space in half and CalculateScores- Additionalfile1Sec.Runningtimes. 2structs by another half. Running times were of 19 seconds for both Fitch and Sankoff, and 14 sec- Evaluationonbiologicaldata onds for both CalculateScores-1struct and WeanalyzedtheGlmandFinP-traJclansfromtheRfam CalculateScores-2structs. The lower running database. A clan contains two RNA families, that are timesforachARNementmethodscouldbeexplainedby homologous but functionally and structurally distinct thesmallernumbersofancestralsequencesinferred. [26].Theseclans,withtheirtwofunctionalfamilieswith Welookattwodifferentmeasurestoevaluatethequal- distinctconsensusstructures,constitutegoodcandidates ity of the ancestral sequences. First, we simply look at totestouralgorithms. thepercentageofallstructuredpositions,foreachfamily, that can actually form canonical basepairs in the ances- Glmclan tral sequences. The goal is to see if the reconstructed The Glm clan contains two bacterial small non-coding sequencescanformtherequiredbasepairsinbothstruc- RNAs, GlmY and GlmZ, that are homologous but func- tures.Second,wecomputetheharmonicmean(H-mean) tionally distinct. They act in a hierarchical manner to betweenthefrequenciesintheensembleofstructuresfor activate the translation of the glmS mRNA [23]. We eachstructurefamily(representingGlmYandGlmZ).Ina selected 74 bacterial genomes for which Rfam align- statisticalphysicsframework,anRNAsequencecanadopt ments were available for both families (see the com- allstructuresanditsfrequencyrepresentsthefractionof plete list in Additional file 1 Sec. Biological Data). timethatthesequenceadoptsaparticularstructure.The The phylogeny of the 74 studied bacterial strains was harmonicmeanisdefinedas takenfromthePathosystemsRessourceIntegrationCen- ter (PATRIC) [32], and Rfam seed alignements of both FreqS1×FreqS2 families were aligned together with CARNA [33]. The 2· FreqS1+FreqS2 sequences in the full Rfam alignments were then re- alignedtothealignmentobtainedwithCARNAsimplyby andismaximizedwhenbothfrequenciesareat0.5,given mapping their corresponding positions. The sequences that the structures are different. Thus the H-mean will and structures were subsequently trimmed to remove be equal to 0.5 if the two structures are different and the gapped columns, and if only one side of an inter- share the complete structure space. Another important action was removed the other position was marked as featureistheenergyofasequenceinaparticularconfig- unstructured. uration. Although that sequence could have other more We used the basic Fitch and Sankoff methods, favorable structures, it gives another idea of the stability and our algorithms CalculateScores-1struct ofaparticularconfiguration. TheAuthor(s)BMCGenomics2016,17(Suppl10):862 Page183of186 Inordertocalculatethismeanforasequence,wecom- reconstructionprogramPAML[34].Forclarity,weremind putethefreeenergyofthesequencewhenfoldedinthe2 thereaderthatPAMLconsidersonlyonefamilyatatime, differentstructuresandtheirfrequenciesusingRNAfold, and returns one ancestor per node. We generated the and the non-canonical base pairs are ignored for these ancestorsusingbothfamiliesseparately,GlmZandGlmY, computations. andcomparedthetwopredictedancestorspercentageof To compare the different ensembles of solu- canonicalbasepairsandH-meanwiththoseobtainedwith tions, we sampled 200000 distinct sequences from the other methods, as shown in Table 3. The two ances- each of them. We present in the first 6 lines of tral sequences produced by PAML have percentages of Table 3 the maximum and average values of the canonicalbasepairs(andH-meanofPAMLGlmY)thatare percentage of canonical basepairs and H-mean for: significantlylowerthanthebestandaveragevaluesofall sequences inferred by Sankoff (or Fitch) only, those theothermethods. inferred by CalculateScores-1struct but not CalculateScores-2structs,andthoseinferredby FinP-traJclan CalculateScores-2structs only. We also present We also ran the experiment on the FinP-traJ clan. FinP thevaluesofenergyandfrequenciesinregardstothesec- is an antisense ncRNA that can bind to the 5’ UTR ondarystructureofeachfamily.Thestandarddeviations region of the traJ mRNA. The binding of those two areshownintheAdditionalfile1:Tab.7. RNAsrepressesthethetranslationoftraJ,whichinturn We observe that, on average, the percentages of represses F-plasmid transfer [35]. Similarly to the Glm canonical basepairs are all the same on the GlmZ struc- clan, we selected bacterial genomes for which the Rfam ture (99.1 %), but it is 1.5 % higher for the solutions alignmentswereavailableforbothfamilies(54genomes; of CalculateScores-2structs on the GlmY see the complete list in Additional file 1 Sec. Biological structure. Although this is not a huge difference, the Data) and we did the same preprocessing to prepare the fact that we get more canonical basepairs on average alignments.Thephylogenyforthe54bacterialstrainswas by inferring a lot less ancestral sequences is interest- alsotakenfromPATRIC[32]. ing. As for the maximums, in all subsets of solutions Noticeably, both families in this clan are sequentially we get sequences that have 100 % of the canoni- and structurally more different than with the Glm clan. cal basepairs for both structures. The average (resp. TheFitch,Sankoff,andCalculateScores-1struct max) H-means for the distinct sets of ancestors pro- methods inferred the same ensemble of 12582912 duced by Sankoff, CalculateScores-1struct sequencesattheroot.Incontrast,CalculateScores- and CalculateScores-2structs are roughly 2structs inferred a strict subset (4x smaller) similar, indicating that by cutting the solution of 3145728 sequences. Running times were of 17 space with CalculateScores-1struct and seconds for both Fitch and Sankoff, and 19 sec- CalculateScores-2structs, we do not lose onds for both CalculateScores-1struct and sequencesthathavesignificantlybetterfoldingproperties CalculateScores-2structs. inregardstobothstructures. As with Glm, we sampled 200000 distinct sequences Wethenproceededtodoacomparisonofourmethod to compare the two sets. We present in the first two with the state-of-the-art maximum likelihood ancestral rows of Table 4 the maximum and average results for Table3MaximumandaverageresultsfortheGlmClan %Z %Y H-mean EnSZ FreqSZ EnSY FreqSY Sankoff 100 100 1.40e-03 -18.7 1.40e-03 -18.7 1.40e-03 average 99.1 93.0 6.60e-06 -15.5 4.09e-06 -17.8 1.55e-04 1struct 100 100 8.76e-03 -19.3 8.76e-03 -19.3 8.76e-03 average 99.1 93.0 9.02e-05 -16.1 5.54e-05 -17.4 5.68e-04 2struct 100 100 2.45e-03 -19.3 1.27e-03 -21.3 3.26e-02 average 99.1 94.5 7.45e-06 -16.1 3.78e-06 -19.4 9.27e-04 PAMLGlmZ 92.9 90.6 1.53e-05 -18.1 7.66e-06 -22.6 1.14e-02 PAMLGlmY 92.9 90.6 1.98e-07 -17.5 9.90e-08 -22.5 3.30e-04 The%Z(resp.%Y)columnshowsthepercentageofallstructuredpositionsintheGlmZ(resp.GlmY)familyforwhichtheancestralsequencescanformcanonicalbasepairs. TheH-meancolumnrepresentstheharmonicmean.TheEnSZcolumn(resp.EnSY)showstheenergyofthesequencewhenfoldedinthesecondarystructureofthefamily GlmZ(resp.GlmY).TheFreqSZcolumn(resp.FreqSY)showsthefrequencyintheensembleofthesecondarystructureofGlmZ(resp.GlmY).Thefirstsixrowsshowmaximum andaverageresultsforSankoff,CalculateScores-1structandCalculateScores-2structsalgorithms.Thelasttworowsrepresentvaluesobtainedfor thePAMLrootancestralsequencereconstructedontheGlmZfamilyandontheGlmYfamily TheAuthor(s)BMCGenomics2016,17(Suppl10):862 Page184of186 Table4MaximumandaverageresultsfortheFinP-traJClan %F %t H-mean EnSF FreqSF EnSt FreqSt Others 94.7a 100a 6.25e-01 -23.7 6.25e-01 -23.7 6.25e-01 average 86.8 88.2 3.17e-03 -27.5 1.97e-02 -24.4 3.65e-03 2struct 94.7a 100a 5.58e-01 -23.7 5.58e-01 -23.7 5.58e-01 average 85.5 91.2 3.86e-03 -27.0 9.62e-03 -25.3 1.39e-02 PAMLFinP 100 82.4 7.13e-08 -28.3 2.59e-03 -21.4 3.56e-08 PAMLtraJ 78.9 100 2.61e-07 -26.3 2.84e-07 -26.2 2.42e-07 The%F(resp.%t)columnshowsthepercentageofallstructuredpositionsintheFinP(resp.traJ)familyforwhichtheancestralsequencescanformcanonicalbasepairs.The H-meancolumnrepresentstheharmonicmean.TheEnSFcolumn(resp.EnSt)showstheenergyofthesequencewhenfoldedinthesecondarystructureofthefamilyFinP (resp.traJ).TheFreqSFcolumn(resp.FreqSt)showsthefrequencyintheensembleofthesecondarystructureofFinP(resp.traJ).Thefirstfourrowsshowmaximumand averageresultsforthefirstthreealgorithms(Others)andCalculateScores-2structs.ThelasttworowsrepresentvaluesobtainedforthePAMLrootancestral sequencereconstructedontheFinPfamilyandonthetraJfamily. aThemaximumvaluesshowedaretheonesmaximizingthebasepairsintraJ;theonesmaximizingFinPare100and94.1for%Fand%trespectively the sampled sequences in Others, the set inferred by conservationofstructuresoversequences,wearguethat Fitch, Sankoff, and CalculateScores-1struct but achARNementsolutionsarebetterancestralcandidates. not by CalculateScores-2structs. The following tworowspresentthosesampledinthesubsetinferredby Conclusions CalculateScores-2structs. In this paper, we presented two novel maximum par- We show the results with their standard deviation in simony algorithms, implemented in achARNement, to Additionalfile1:Tab.8. solve the simultaneous ancestral reconstruction of two We observe that on average, the solutions from the ncRNAfamiliessharingacommonancestor.Wefirsteval- “others” group can form 86.8 % of the basepairs of uated our methods on simulated data, as described in the FinP structure and 88.2 % of the ones of traJ. On Sec.Simulateddatageneration,thenontwoRfamclans, the other hand, the subset of ancestors produced by theGlmandFinP-traJclan(Sec.Evaluationonbiological CalculateScores-2structs can form on average data). 85.5 % (1.3 % less) of the basepairs of the FinP structure Wefirstshowedthatonsimulateddata,achARNement and 91.2 % (3 % more) of the ones of traJ, which, over- produces smaller sets of ancestral sequences with fewer all, seems to be a better compromise. Note that this was errorsonaveragethantheclassicalFitchandSankoffalgo- achievedbyinferring4timesfewerancestors. rithms. Since all the ancestral sequences reconstructed Regarding the H-mean, the samples taken from the at the root by achARNement are included in those smaller subset of ancestral sequences reconstructed by produced by the Sankoff algorithm, it indicates that CalculateScores-2structs show similar results considering the secondary structures does not gener- for the maximum and slightly better for the average H- atesuperfluousmutations.Mostimportantly,considering mean than the bigger sets inferred by the other algo- bothstructuresinCalculateScores-2structspro- rithms, which tends to show that our method is not ducesordersofmagnitudesfewersequenceswhilealways discardingsequenceswithbetterfoldingproperties. improvingontheotheralgorithmsintermsoftheaverage WealsocomparedourresultswithPAML,foreachfam- percentageoferrors. ilyseparately.Weobserveastarkcontrastwithourresults The biological data cannot be validated directly, yet when comparing the percentage of canonical basepairs some interesting observations can be made. To the best for both families. While PAML can get 100 % on the of our knowledge, achARNement has the first imple- consideredstructure,itgetsonlyabout80%ofthebase- mentations of complete parsimonious models able to pairsoftheotherstructure.Whenlookingatthestability reconstructancestralsequencesoflargealignmentswith of the functional structures of the two families on the multiple structures. On both the Glm and FinP-traj reconstructed ancestral sequences, we observe that our clans,CalculateScores-2structshasbeenshown solutionsofferabettertrade-off(i.e.theaverageharmonic toinfersmallersetsofancestralsequencesthanFitchand meanisseveraldegreesofmagnitudebetterthattheones Sankoff, while offering a better compromise in terms of obtainedbyPAML). thepercentageofcanonicalbasepairsforbothstructures Theseresultssuggestthatourmethodsareindeedcapa- (withoutpenalizingthefoldingproperties,asshownwith ble to retrieve ancestral sequences with better fitness thesimilarvaluesofH-mean).Also,thecomparisonwith to both functional structures of the homologous RNA PAML highlights the benefits of our approach, especially families. Since RNA families are known to favour the ontheFinP-traJclan,whereitisclearthatweareableto

Description:
the affinity regions, and might end up with ancestral sequences that would be hard to reconcile. By contrast, a simultaneous reconstruction of.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.