Statistical Evaluation of the Rodin–Ohno Hypothesis: Sense/Antisense Coding of Ancestral Class I and II Aminoacyl-tRNA Synthetases Srinivas Niranj Chandrasekaran,1 Galip Gu¨rkan Yardimci,z,1 Ozgu¨n Erdogan,1 Jeffrey Roach,1 and Charles W. Carter Jr*,1 1DepartmentofBiochemistryandBiophysics,UniversityofNorthCarolina zPresentaddress:DepartmentofComputerScience,DukeUniversity,Durham,NC *Correspondingauthor:E-mail:[email protected]. Associateeditor:AndrewRoger Abstract D o w WetestedtheideathatancestralclassIandIIaminoacyl-tRNAsynthetasesaroseonoppositestrandsofthesamegene. nlo a Weassembledexcerpted94-residueUrgenesforclassItryptophanyl-tRNAsynthetase(TrpRS)andclassIIHistidyl-tRNA d e sstyrnutchteutraesset(hHaitsRpSo)siftrioonmthaedmiveorsstehgigrhoulypcoonfsseprevceide,s,abctyivied-esnitteifryeinsigduaensd. Tchateencoadtionngmthirdedeleb-bloacskespcaoidriinnggffroerquseecnocnydwaarys d from 0.35±0.0002inall-by-allsense/antisensealignmentsfor211TrpRSand207HisRSsequences,comparedwithfrequencies http between0.22±0.0009and0.27±0.0005foreightdifferentrepresentationsofthenullhypothesis.Clusteringalgorithms s demonstratefurtherthatprofilesofmiddle-basepairinginthesynthetaseantisensealignmentsarecorrelatedalongthe ://a c a sequencesfromonespecies-pairtoanother,whereasthisisnotthecaseforsimilaroperationsonsetsrepresentingthe d e m null hypothesis. Most probable reconstructed sequences for ancestral nodes of maximum likelihood trees show that ic middle-basepairingfrequencyincreasestoapproximately0.42±0.002asbacterialtreesapproachtheirroots;ancestral .o u p nodesfromtreesincludingarchaealsequencesshowalesspronouncedincrease.Thus,contemporaryandreconstructed .c o sequencesallvalidateimportantbioinformaticpredictionsbasedondescentfromoppositestrandsofthesameancestral m /m gene. They further provide novel evidence for the hypothesis that bacteria lie closer than archaea to the origin of b e translation. Moreover, the inverse polarity of genetic coding, together with a priori a-helix propensities suggest that /a in-framecodingonoppositestrandsleadstosimilarsecondarystructureswithoppositepolarity,asobservedinTrpRS rtic le andHisRScrystalstructures. -a b s Key words: sense/antisense double open reading frames, origin of translation, aminoacyl-tRNA synthetases, protein tra c modularity,multiplesequencealignment,multiplestructurealignment,ancestralgenereconstruction. t/3 0 /7 /1 Introduction 5 A andpreferwater.Thus,ancestralclassIandclassIIenzymes 88 r Aminoacyl tRNA synthetases (aaRS) occur in either of two appear to have been required to create, respectively, /97 t 3 ic structurallyunrelatedclasses,IorII,accordingtotheamino the cores and solvent interfaces of primordial globular 415 l acidtheyactivate(Cusacketal.1990;Erianietal.1990;Carter proteins. Genetic linkage implied by sense/antisense coding b e y 1993). Rodin and Ohno (1995) proposed that these two may then have ensured that activated amino acids of both g u unrelated enzyme superfamilies descended from the same typeswouldbeproducedatthesametimeandplace,increas- es gene,oneancestorcodedbyeachcomplementaryancestral ing the likelihood of producing and selecting viable gene t on strand. Although the evidence on which Rodin and Ohno products. 30 based their proposal was quite strong, the concept has tRNA aminoacylation is the defining reaction in codon- M a nevertheless proven difficult to embrace, because genetic dependent translation. Primordial enzymes enabling the rch complementarity between coding sequences severely con- process were likely among the earliest catalysts. By virtue 20 1 strainsthesequencespacesthatcanbeexploredwhilesimul- ofthatprivilegedposition,itisalsolikelythattheircontem- 9 taneously optimizing gene products translated from both porary descendants include large portions of the proteome. strands. It therefore seems of paramount importance to assess the Thedecisiveselectiveadvantageofsense/antisensecoding validity of the hypothesis. For, if the hypothesis is correct, (Pham et al. 2007) appears from the fact that amino acid establishedparadigmsmustberevisitedinlightofthepossi- specificities of the two aaRS classes are significantly skewed. bility of tracing so much of life as we know it to a single Class I aaRS substrates have a favorable median free energy ancestralgene. oftransferfromwatertocyclohexane,whereassubstratesof The Rodin–Ohno hypothesis makes testable biochemical class II aaRS are less favorable by approximately 4kcal/mol and bioinformatic predictions. Segments corresponding to (cid:2)TheAuthor2013.PublishedbyOxfordUniversityPressonbehalfoftheSocietyforMolecularBiologyandEvolution. ThisisanOpenAccessarticledistributedunderthetermsoftheCreativeCommonsAttributionNon-CommercialLicense(http:// creativecommons.org/licenses/by-nc/3.0/),whichpermitsnon-commercialre-use,distribution,andreproductioninanymedium, Open Access providedtheoriginalworkisproperlycited.Forcommercialre-use,[email protected] 1588 Mol.Biol.Evol.30(7):1588–1604 doi:10.1093/molbev/mst070 AdvanceAccesspublicationApril10,2013 MBE StatisticalTestsofSense/AntisenseAARSCoding . doi:10.1093/molbev/mst070 regions implicated in sense/antisense coding should exhibit descendantsderivedfrom ancestralaaRSgenes on opposite catalyticactivitiescharacteristicofthefull-lengthenzymes.To strands would be residual middle- or second-base pairing testthisbiochemicalprediction,wedevelopedproceduresfor betweencodonsspecifyingthesecondarystructuresthatpo- stabilizing and expressing such constructs. We call them sitionactive-siteresidues.Werefertothismetricas<MBP>, Urzymes, from Ur=primitive, original, earliest plus enzyme. forthe“Middle(second)codon-BasePairing”frequency.We A 130-residue class Itryptophanyl-tRNAsynthetase (TrpRS) examineheretheoverall<MBP>inmultiplesense/antisense Urzymefromwhichtheanticodon-bindingdomainandthe alignments derived from diverse contemporary TrpRS long connecting peptide (CP1) separating the HIGH and and HisRS sequences, its profile along the sequences, and KMSKS active-site signatures had been removed (Pham itsbehaviorinreconstructedancestralsequences. et al. 2007, 2010) accelerated tryptophan activation 109- Globalanalysisofpossiblesense/antisensecodingrelation- fold. We subsequently demonstrated that 120–140-residue ships between class I and II aminoacyl-tRNA synthetase active-site fragments derived from the implicated region sequencesisadauntingtask,becauseinsertionsanddeletions of class II histidyl-tRNA synthetase (HisRS; Li et al. 2011) have been possible along all independent trajectories D o have comparable catalytic activity, and that both Urzymes throughout the biological era. We address here the more w n catalyzeacylationofcognatetRNAs(LiL,FrancklynCS,Carter modest goal of assessing statistical evidence for ancestral lo a d CWJr,inpreparation). sense/antisensecodingderivedfromextantcodingsequences e d These catalytic activities make Urzymes important re- onlyoftheTrpRSandHisRSUrzymes. fro sources for both experimental and bioinformatic study of The TrpRS(cid:2)HisRS pair is not an obvious choice for com- m h putative steps in the foundational evolution of catalytic ac- parison, for several reasons. TrpRS uses TIGN, an unusual ttp s tivityandspecificityfrombeforetheevolutionaryerathatis variantoftheclassIHIGHsignature.TrpRSbelongstoclass ://a accessible through ancestral gene resurrection (Thornton Ic,whereasHisRSbelongstoclassIIa;thus,theyareprobably ca d 2004; Bridgham et al. 2006; Benner et al. 2007; Gaucher moredistantlyrelatedandmayretainaweakertraceoftheir e m et al. 2008). The class I and II Urzymes both correspond to common complementary ancestry. Finally, both contempo- ic .o themosthighlyconservedandhencemostancientactive-site rarysynthetasesbindtRNAinthemajorgroove(Yangetal. u p components. Further, they are compatible with antiparallel 2006),whereasmostclassIaaRSuseaminorgroovebinding .co m alignmentoftheircodingsequences,asenvisionedbyRodin mode,suggestingthattheancestralstatesofclassIandIIaaRS /m and Ohno (Pham et al. 2007). Their high, and comparable, likely bound to opposite grooves of the tRNA 30 acceptor be catalytic activities imply that they are both stable and stem(RibasdePouplanaandSchimmel2001). /artic globular,atleastinthepresenceofsubstrates,andtherefore Forvariousreasons,weneverthelessfocushereoncoding le afford unexpectedly strong support for the Rodin–Ohno properties of the TrpRS(cid:2)HisRS pair. Most importantly, these -ab s hypothesis. are the two aaRS for whichwe have demonstrated catalyti- tra c This surprising biochemical validation invites further bio- cally active Urzymes. Class Ia enzymes generally have both t/3 0 informatictestsforpossiblesense/antisensecodingrelation- longer CP1 and additional insertions within the Rossmann /7 ships. Urzyme construction relies primarily on tertiary fold.Thus,asthesmallestoftheclassIaaRS,TrpRSposesthe /15 8 structuralalignmentandproteindesign,andhenceisdistinct fewest difficulties in Urzyme design. Further, outside the 8 /9 from the study of nucleic acid sequences that might have TrpRS and HisRS families, gaps and insertions introduce 7 3 4 encoded extinct enzyme precursors. Nonetheless, the com- additional difficulties in deciding which coding sequences 1 5 parablelengthsandapproximateantiparallelalignmentofthe from the class I and II superfamilies should be structurally b y TrpRS and HisRS Urzymes do suggest how to extend the aligned to evaluate the <MBP> metric in aligned codons. gu e bioinformatic approach described by Rodin and Ohno Finally, although it uses the TIGN catalytic signature, TrpRS s t o (1995) to include, in addition to their catalytic signatures, contains the correct number of amino acids between the n 3 thesecondarystructuralelementsnecessarytoorientthem. conservedprolineandTIGNtoalignwithbothhydrophobic 0 M Althoughthoseauthorswereabletodetectsignificantsense/ andchargedresiduesintheclassIImotif2(RodinandOhno a rc antisense relationships in the full codons of the catalytic 1995). h 2 signatures, such information has long been stripped by Pham et al. (2007) described a step toward a putative 0 1 9 gene duplication and speciation from the remaining parts Urgene by aligning three large contemporary coding blocks of the presumably extinct ancestral aaRS sequences. Thus, fromcatalyticdomainsof BacillusstearothermophilusTrpRS thesense/antisenselinkagebetweenthetwoenzymesuper- and Escherichia coli HisRS sense and antisense opposite one families, if it did exist, was abolished very early, possibly another.Thatalignmentrevealedthatmiddlebasesin44%of well before the genetic code even reached a canonical codons throughout a 94-residue sense/antisense alignment twenty amino acids with variable presence of synthetases spanningtheactivesites(fig.1)were base-pairedwiththeir for the amides, asparagine, and glutamine (Woese et al. complements. A naive random probability based simply on 2000), let alone the idiosyncratic variation seen today with oneinfourbasesisapproximately0.25.Wepreviouslyfound the extensions to selenocysteine and pyrrolysine (Ibba and thatthemiddle-basepairingfrequencyinalignmentsofsim- Soll2004). ulatedtwo-codonhexanucleotidesencodingrandomdipep- Owing to the degeneracy of the genetic code and the tideswas0.27±0.044.Theuncertaintyofthisestimateofthe wobblehypothesis(Crick1966),wereasonedthatthemost randomfrequencyimpliedthatthestatisticalsignificanceof persistent base-pairing remnant in sequences coding for the frequency (0.44) observed in the B. stearothermophilus 1589 MBE Chandrasekaranetal. . doi:10.1093/molbev/mst070 D o w n lo a d e d fro m h ttp s ://a c a d e m ic .o u p .c o m /m b e /a rtic le -a b s tra c t/3 0 /7 /1 5 8 8 /9 7 3 4 1 5 b y g u e s t o n 3 0 M a rc h 2 0 1 9 FIG.1. ModularityinclassIandIIaminoacyl-tRNAsynthetaseUrzymes.Complementarymodularitywithinaminoacyl-tRNAsynthetaseactivesites.(A) Summary of information derived from the respective TrpRS and HisRS multiple sequence alignments. TheBacillus stearothermophilus TrpRS and Escherichia coli HisRS sequences are aligned antisense to each other, consistent with the Rodin–Ohno hypothesis. The three modular fragments comprisingtheUrgeneareindicatedbyathincoloredstripbetweensequenceandsecondarystructurecodes.Yellowblocksindicatethemotifsusedto anchorthethreecodingblocks,whichwereassembledend-to-endtoformasingle“gene.”Codonmiddlebasepairsaremagenta,unpairedmiddlebases (continued) 1590 MBE StatisticalTestsofSense/AntisenseAARSCoding . doi:10.1093/molbev/mst070 TrpRS:E.coliHisRSalignmentwassupportedbyaPvalueof with Z scores of approximately 5.7–8.8 (Rodin and Ohno 0.007(Phametal.2007). 1995). Consensus HIGH/motif 2 and KMSKS/motif 1 sense/ Weanalyzeheresense/antisensealignmentsderivedfrom antisensehomologiesofthesesequencescompriseonly9and amuchbroadersampleofcontemporarymultiplesequence 11 amino acids, respectively, and constitute only approxi- alignments for TrpRS and HisRS, to investigate further the mately20%ofthelengthoftheTrpRSUrzymedescribedin statisticalsignificanceofthe94-residueantiparallelconstruc- figure4ofPhametal.(2007)andonly5–6%ofthecontem- tioninfigure1.Thestatisticalevidenceformiddle-basepair- poraryfull-lengthTrpRSandHisRSenzymes. ingfrequencyisrobust,andextendstoprofilesofmiddle-base Tooptimizecorrespondencebetweenanenhancedsetof pairingversusresiduenumberandtoincreasedfrequenciesin contiguous, antiparallel codons, and following Pham et al ancestralsequencesreconstructedfrommostprobablephy- (2007), we built “Urgenes” from segments reliably linked to logenetic trees. These results provide strong bioinformatic theclass-definingsequencesalreadyidentifiedinthefigure1 supportfortheRodin–Ohnohypothesis. ofRodinandOhno.Threecodingblocksdefiningtherespec- tiveactivesiteswereassembledbyasemi-automatedproce- D New Approaches o w durefrom multiplesequencealignmentsfordiversespecies. n This work entails several novelties, arising from the unusual Twooftheseblocksincludedaminoacidscorrespondingto loa d purpose of presenting bioinformatic evidence for ancestral theclassIHIGH-(andclassIIMotif2;46residues,bluefrag- ed sense/antisense genetic coding. First, we examine the ment) and KMSKS- (class II Motif 1; 18 residues, magenta fro m statistical behavior of relationships between contemporary fragment) containing segments. A third was provided by a h sequencesoftwodistinctproteinfamilies,toinfercharacter- linkingsegmentinvolvedinaminoacidspecificityintheclass ttp s istics of ancestral genes from an era close to the advent of IUrzyme(30residues,amberfragment).Thisamberfragment ://a c genetic coding. Second, owing to the pervasive problem of is also associated with a consensus motif; we previously ad indels, we use three-dimensional structure superposition to identified the comparably conserved sequence GxDQ in em identifyandassemblemosaic“Urgenes”encodingconserved this segment in class I active sites (Carter 1993; Pham et al. ic.o u secondarystructures,alongwithhighlyconservedactive-site 2007).Thismotifandthecorrespondingantisensesequence p .c residues,forbothfamilies.Third,wedevelopandexaminea in HisRS, FKRA, provide an anchor for the central segment. om varietyofdatasetsrepresentingthenullhypothesisthatthe Anchoring motifs and structural superposition helped in /m b ancestralsequenceswerenotcomplementary.Fourth,weuse adjusting for insertions and deletions in all three coding e/a k-means clustering and a two-dimensional clustering vector blocksandinassuringappropriatealignments. rtic basedonaveragecomplementarityanditspositionsensitivity le CodingandtranslatedaminoacidsequencesforB.stear- -a along the sequence to distinguish between the sense/anti- b othermophilusTrpRSandE.coliHisRSarealignedinopposite s sense and null hypotheses. Finally, we estimate the comple- directionsinfigure1A.Therelativelyhighersequenceidentity trac mentarity metric in the time domain by reconstructing within a single aaRS family, together with structural consid- t/30 ancestral genes from most-probable phylogenetic trees of /7 erations, increase our confidence that all alignments corre- /1 middle- (second-) codon base sequences. We find that nei- 5 spondtothatconstructedpreviously(Phametal.2007)with 8 8 ther estimated divergence times (Hedges et al. 2006) nor respecttothree-dimensionalstructures.ClassIandIImultiple /9 7 node-heights are free of complications arising, probably, 3 sense/antisense sequence alignments are uncorrelated by 4 fromhorizontalgenetransfer.Thisproblemisdiscussedfur- 15 severalcriteria,alsoillustratedinfigure1.Sequenceentropies, b therinthesupplementarysectionC,SupplementaryMaterial P y online. Sj,position= i,j,aminoacids(pi,j)(cid:3)log(pi,j),derivedfrommultiple gu sequence alignments show that the relative conservation e s Results ofaminoacidsincontemporarycodingsequencesisuncor- t on relatedtothatontheotherstrand(fig.1A;R=0.11).Noris 30 The Putative Urgene Described Here (fig. 1) Is thereevidentcorrelationbetweensitesonoppositestrandsat M a Excerpted from the Intact Urzymes which insertions are observed, indicated by the arrows, in a rch TheoriginalobservationofRodinandOhno(1995)wasthat smallnumber(<20%inthebluefragmentand<5%inthe 20 1 codingsequencesforthehighlyconserved,class-defining“sig- amberfragment)ofthesequences. 9 nature” sequencesofthetwoaaRSclasseshadastatistically The resulting Urgene increased by 5-fold the number of significantsense/antisenserelationship,basedonJumbletests consecutive codons subject to statistical testing over the FIG.1.Continued areblue.SequenceentropiesofclassI(darkershading)andclassII(lightershading)areshownashistogramsaboveandbelowthesequences.Conserved positions have low sequence entropies. Significant numbers of inserted residues in the respective MSAs occur at sites indicated by black arrows. Secondarystructuresfromthecrystalstructuresofthefull-lengthenzymesareencodedas:H=a-helix,E=extended(b),T=turn,andS=loop.(B) Decompositionoftertiarystructuresinferredfromcrystalstructuresoftheintactenzymesintothethreemodularcodingblocks,coloredasin(A). Aminoacyl-50AMPintermediatesaredenotedbyspheres.NotethattheclassI(TrpRS)activesiteenclosestheindolemoietycompletely,whilethe imidazolemoietyofhistidinehasextensivesolvent-exposedsurfacearea.(C)RelationshipbetweentheUrgenein(A)andtheactiveUrzymes.Gaps between the excerpted fragments (dark gray additions) represent the remaining puzzle of reconstructing an alignment coding two intact active enzymes(seeDiscussion). 1591 MBE Chandrasekaranetal. . doi:10.1093/molbev/mst070 number examined by Rodin and Ohno, at the expense of Initial tests of the distribution under the null hypothesis examining onlythe codon middle bases.We initiallyassem- weregeneratedfromtheTrpRSandHisRSsequencesthem- bled 94-residue Urgenes for 98 TrpRS and 99 HisRS coding selves by accumulating <MBP> distributions for the same sequences.MultiplesequencealignmentsforthetwoUrgenes alignments shown in figure 1, but with +1 and (cid:4)1 frame- fromcontemporaryTrpRSandHisRSsequencesintheinitial shifts and by randomizing one of the two sequences. These data set gave rise to approximately 9,700 different homolo- distributionsallhavemeanscloseto0.25withstandarderrors gouscomparisons,whichwereanalyzedindetail.Thisinitial similartothoseofotherdistributions. database strongly emphasized bacterial sequences for both Figure3showsdistributionsforsense/antisensealignments enzymes.Wethereforeenlargedthedatabasetoincludead- underthehypothesis(H;9702alignments)andfortwomock ditional archaeal/eukaryotic species to ensure more diverse sense/antisensecodingsequencealignmentsunderN (CII 1,b multiple sequence alignments of 211 TrpRS and 207 HisRS vs.CII0;4950alignments,fig.3A;andN sequenceseitherfor 0,b sequences (fig. 2) and approximately 44,000 homologous random peptides (N , not shown) or for Lectin C (pfam 0,z sense/antisense comparisons. Species names and phyloge- PF00059)versusPDZdomains(pfamPF00595~20000align- D o netictreesinNewickformatforfullalignmentsofbothfam- ments,fig.3C).ThedistributionfortheLectinC:PDZdomain w n ilies, which include representative bacteria, archaea, and alignments effectively rules out the possibility that high loa d eukarya, are included in the supplementary material, <MBP> in the TrpRS:HisRS sense/antisense alignment e d SupplementaryMaterialonline. arisesfromsequenceconservationwithinthetworespective fro families,whicharecomparableinthecaseofHandN .All m 0,b h Contemporary Coding Sequences Exhibit threedistributionshavesimilarwidths,consistentwithcom- ttps <MBP>=0.35±0.0002 parablesamplingdiversity.Distributionsofmiddle-basecom- ://a plementarityaresummarizedintable1. ca Totestthehypothesisthatancestralcodingsequencesofthe The H and N frequencies differ by approximately 200 de i m two proteins were complementary and arose on opposite times the root mean squared standard error, suggesting ic strandsofthesamegene,weusedtheinitial(smaller;98vs. that they are very different. We performed one-sided, two- .ou p 99)MSAstoconstructall-by-allsense/antisensealignmentsof samplettestsforequalmeansbutdifferentvariancesforH .co ((HDN; CAI)vcso.dCiInI0g).sTehqeuemncidedslefr-obmaseeapcahiricnlgassfreaqguaiennsctyt,h<eMoBthPe>r, evaecrshuscaNse0,,aa,bnadnidnHfacvte,rwsuesreNs1m,a,abl.lePrvthalaunesthweecreom(cid:5)pu1t0a(cid:4)ti4onfoarl m/mbe waliagsnmtheenntsc,otmabpleute1d, afnodr euascehdotfotchoen9st,7ru02ctstehnese/haisnttoisgernamse rreogurnedssiniognermroor.dSeilms iflaorrPmvaidludeles-wbaesreeocbotmaipnleedmwehnetanrimtyulwtieprlee /article shown in figure 3B and in the clustering analysis described built using CI/CII0 sense/antisense alignment (H) and self -ab later,whichwasnotrepeatedforthelargerdataset. sense/antisensealignment(N1,a,b)aspredictors.Evenconsid- stra The<MBP>valueofthesmallercompilationis0.376;the ering the substantial homologies in the two multiple ct/3 standard deviation (SD) is 0.0224. However, as a relatively sequencealignments,thedistributionunderHisclearlydis- 0/7 large number of different comparisons contribute to this tinct from all distributions representing the null hypothesis. /15 value, we quote the standard error (SE) in estimating the Despite the fact that the larger data set produced a lower 88 /9 meanvalue,SE=0.0005.Alignmentsfortheoveralldatabase <MBP>,thedifferencebetweenHandNhypothesesactu- 7 3 (table2;211vs.207sequences;43,677sense/antisensealign- allydifferbyapproximately350timestherootmeansquare 41 5 ments)gaveacorresponding<MBP> of0.35±0.0002.This standarderror. b y lowervaluelikelyresultsfromtheincreasedrepresentationof Tovalidateconfidencefurther,wetestedforequalmedians gu e archaealandeukaryoticsequences,asdiscussedlater. withthenonparametricWilcoxonranksumtestforHversus s t o N andHversusN andthePvalues(unsurprisingly)were n Distributions under the Null Hypothesis All Have (cid:5)01,b0(cid:4)6.Thelasttest1,paerformedwasatwo-sampleone-sided 30 M <MBP> Close to 0.25±0.0005 Kolmogorov–Smirnovtestwiththenullhypothesisthatthe arc twosamplescomefromthesamedistributionagain,withthe h Thestatisticalsignificanceofmiddle-basepairingobservedin 2 H versusN andHversusN asthetwopairs.Again,we 0 pairs of coding sequences aligned in opposite directions re- obtainedPv0,aalues(cid:5)10(cid:4)3.Thu1s,,aweconcludewithaveryhigh 19 quires an estimate for the corresponding probability under degree of confidence that nonrandom effects lift the fre- thenull-hypothesis,N,thatcontemporarycodingsequences quency of middle base pairs in the TrpRS HisRS sense/ areunrelatedandmiddlebasespairrandomly.Wereportfive antisenseUrgenealignmentwellabovethoseofdistributions levelsoftestsforthedistributionof<MBP>undertheNull foreitherN andN ,distributionsrepresentingthenull hypothesisbytesting: 0,a,b 1,a,b hypothesis.Table2updatestestsofthenullhypothesisbased i) After frame shifting and/or randomization of one of onthelargerdatasetof211TrpRSand207HisRSsequences: thetwosequences. N ,N ,togetherwithN andN ,fromframeshiftingoneof 1a 1b + (cid:4) ii) Self-complementarityofeachofthetwoaaRSMSAs. thetwosequences,andN forthealignmentofTrpRSand s/s iii) PeptidesdrawnrandomlyfromthePDB. HisRSsequencesinthesameorientation. iv) HomologousMSAsofrandompairsofproteinfamilies For a more discriminating assessment of <MBP> values fromthePDB. arising from using paralogous MSAs, we considered the v) MSAsparalogousfamilies. behavior of distant (putative) and consensus paralogous 1592 MBE StatisticalTestsofSense/AntisenseAARSCoding . doi:10.1093/molbev/mst070 D o w n lo a d e d fro m h ttp s ://a c a d e m ic .o u p .c o m /m b e /a rtic le -a b s tra c t/3 0 /7 /1 5 8 8 /9 7 3 4 1 5 b y g u e s t o n 3 0 M a rc h 2 0 1 9 FIG.2. Phylogeniesoftheabstracted94-basecodingfragmentsderivedfromTrpRS(A)andHisRS(B).AlignmentswerecompiledbyMUSCLE(Edgar 2004a,2004b)andthetreesdeterminedbyJModelTest(GuindonandGascuel2003;Darribaetal.2012)andrenamedusingTaxnameconvert(Schmidt 2004; http://www.cibiv.at/software/taxnameconvert/, last accessed April 22, 2013) and drawn as cladograms by FigTree (Rambaut 2010). Trees in Newick format are provided in the supplementary material, Supplementary Material online. Phyla are indicated around the circumference and highlightedbydifferentcolors.Bluecolorsdenotebacteria;amberandreddenotearchaeaandeukaryotes,respectively. 1593 MBE Chandrasekaranetal. . doi:10.1093/molbev/mst070 Table 1. Statistics for Distributions of Middle Base Pairs for the Initial Alignment. Hypothesis Distribution N <MBP> SD SE H TrpRS vs. HisRS0 9,702 0.376 0.049 0.00049 N Random peptides vs. random peptides0 30,000 0.238 0.042 0.00024 0,a N Lectin C vs. PDZ0 20,001 0.215 0.055 0.00039 0,b N TrpRS vs. TrpRS0 4,753 0.257 0.047 0.00068 1,a N HisRS vs. HisRS0 4,851 0.255 0.037 0.00053 1,b sense/antisense alignments between a small set of Toprim domains and the HisRS MSA. Crystal structures of six unique Toprim domains were aligned with the TrpRS Urzyme (supplementary fig. S2, Supplementary Material D o online) using the POSA server (Ye and Godzik 2005), and w n optimal choices were made for equivalent residues in the lo a d “blue”and“amber”fragments(Toprimdomainslackthema- e d genta fragment). Corresponding coding sequences were re- fro m trieved and used to form sense/antisense alignments with h corresponding blue and amber fragments from all 207 ttp s HisRS sequences. The <MBP> for this alignment was ://a 0.265±0.0015, which is in the range of the other tests of ca d the null hypothesis. Thus, either Toprim domains are unre- em latedtotheclassIUrzymesorthesequencedivergenceistoo ic .o greattopreserveevidenceofsense/antisenseancestry. up .c There are sufficient numbers of sequences for the class I o m (TyrRS) and class II (ProRS) outgroups, used to root phylo- /m b genetic trees for TrpRS and HisRS, to afford a test of the e /a <MBP> behaviorofclose,consensusparalogs(supplemen- rtic tarysectionA,SupplementaryMaterialonline).Theoutgroup le -a Urgenemiddlebasesshareonly44%sequenceidentitywith b s those of their homologs, which themselves have more than tra c 60%identity.Havingidentifiedthethreeexcerptsofthetwo t/3 0 Urgenes from these sequences, we carried out all-by-all /7 /1 <MBP> calculations for the four possible class I/class II 5 8 alignments: TrpRS–HisRS, TrpRS–ProRS, TyrRS–HisRS, and 8/9 7 TyrRS–ProRS. The resulting <MBP> values are shown in 3 4 supplementary figure S1, Supplementary Material online. 15 b The four alignments have essentially equivalent high y g <MBP> values(0.336±0.005),andthestudentttestprob- u e s ability for the difference between sense/sense and sense/ t o antisense alignments within this group, which is derived n 3 0 FIG.3. Middle-basepairingfrequencydistributions.Thefrequenciesof fromfourindependentaaRS,waslessthan0.0001. M middlebasepairsinall-by-allsense/antisensecodingalignmentsforclass Theonlyconditionthatyieldsauniquedistributionwith arc I(TrpRS)andclassII(HisRS)Urzymegenesagainsteachother(CIvs. unexpectedly elevated <MBP> values is that proposed by h 2 CII0;center)arecomparedwithdistributionsfortworepresentationsof Rodin andOhno(1995). The middle-base pairing frequency 01 9 thenullhypothesis,inwhichthereisnoevidenceforsense/antisense for aligned TrpRS and HisRS active-site coding sequences, coding.Thefirst(CIIvs.CII0;top)isobtainedbyaligningeachclassII and indeed that for all four class I/class II comparisons, far sequenceantisense,inturn,againstallotherclassIIsequencesinthe exceeds what is expected under the null hypothesis. The sensedirection.Thesecond(Lectinvs.PDZ;bottom)isasimilarcom- three-block sense/antisense alignment described by (Pham parisonofthecodingsequencesforidenticallengthsofthegenesfora et al. 2007) is therefore almost certainly drawn from a lectin(pfamPF00059)againstthoseforPDZdomains(pfamPF00595). Eachisfittedtoanormaldistribution(solidline).Themeanmiddle-base unique protein-coding subset consistent with a sense/ pairing frequency for all three distributions is shown by the vertical antisense Urgene. arrow. High <MBP> Values Occur at Consistent Positions sequences derived from other aaRS families. Close extant across the MSAs structural relatives of the TrpRS Urzyme are found in the Toprim domains that occur in primases, topoisomer- If high middle-base pairing is nonrandom, there should be ases, and recombinases. We examined <MBP> in somecorrelationbetweenMBPvaluesatthesamepositionin 1594 MBE StatisticalTestsofSense/AntisenseAARSCoding . doi:10.1093/molbev/mst070 Table 2. Statistics for Distributions of Middle Base Pairs for the Larger Data Set. Hypothesis Distribution N <MBP> SD SE H TrpRS vs. HisRS0 43,677 0.349 0.0545 0.00026 N TrpRS vs. TrpRS0 22,366 0.264 0.0428 0.00029 1,a N HisRS vs. HisRS0 21,528 0.254 0.0443 0.00030 1,b N +1 Frame shifting 43,677 0.258 0.0664 0.00032 + N (cid:4)1 Frame shifting 43,677 0.261 0.0474 0.00023 (cid:4) N TrpRS vs. HisRS 43,677 0.252 0.0361 0.00017 s/s different sequence pairs. To investigate this possibility with Table 3. One- and Two-Dimensional Clustering. the smaller data set, we examined the ability of clustering Clustering 1D 2D Do algorithms to separate the 9,702 instances generated for (<MBP>) (<MBP>, <cc>) w n TrpRS/HisRSalignments,H,fromeither4,851instancesgen- lo H N H N a 1 1 d erated under N or 4,950 instances from N . If <MBP> e 1,a 1,b H against N d valuesarewell-enoughseparated,k-meansclusteringshould H 1,a 8,314 1,388 9,661 41 fro m groupvaluesaccuratelyintotwodifferentclusters,according tothehypothesisunderwhichtheyweregenerated.Weas- Na/1S,apecificity 01.18443 40,.686577 0.0304 40,.894986 https sumed two initial centroids to cluster the <MBP> values. b/Sensitivity 0.038 0.962 0.001 0.999 ://a ThealgorithmwasrunseparatelyforsetsHversusN1,aandH H against N1,b cad versusN1,b.Theoverallaccuracywassimilarforbothrepre- H 6,667 3,025 9,696 6 em sentationsofthenullhypothesis,91%forHversusN and ic 79%forHversusN1,b.Clusteringonthebasisonlyof<1M,aBP> Na/1S,bpecificity 05.30152 40,.264888 0.0001 40,.795939 .oup.c (table 3) yields higher sensitivity (fewer false negatives) but b/Sensitivity 0.106 0.894 0.000 1.000 om lowerspecificity(morefalsepositives). /m b Fornonrandommiddle-basepairingacrossapairofMSAs, e /a informationaboutmiddle-basepairingalongeachsequence <MBP> observed in the nine entries of figure 5A plus rtic shouldcomplementthatinthe<MBP> andenhanceclus- le the full data set can be explained by the kingdom of -a tering. For each group of alignments, we therefore con- b origin. Student t test probabilities are computed on the s structed a two-dimensional clustering vector by taking the tra basis of standard errors based on 6 degrees of freedom. c average of all base pairs in four blocks, each with 25% of Weconstructedoptimalphylogenetictreesforthe94-el- t/30 the alignments. Each profile consisted of the mean residue- /7 ement middle base sequences from the TrpRS and HisRS /1 by-residue correlation coefficient, <cc>, for the remaining 5 sequences. Phylogenies derived for the abstracted 94-mer 8 75%ofthealignedsequences.The<cc>valuesforallthree alignments are shown in figure 2. Despite the fact that the 8/9 7 profiles were then averaged, implementing a re-sampling 3 trees are based on middle bases of the excerpted Urgene 4 cross-validation (Picard and Cook 1984). Two-dimensional 15 (fig. 1), they are similar, to each other and to phylogenies b clustering using (<MBP>, <cc>) vectors increased both y derivedfromotherkindsofMSA.Finally,althoughtheirover- g sensitivity and specificity of clustering essentially to unity u allappearancesareapproximatelysimilar,comparisonofthe e s (table3).Positionalinformationaboutthehighmiddle-base twotreesrevealsthatittakesmorestepsfortheHisRStreeto t o pairing therefore contributes significantly to the clustering reachtherootfromthecontemporarysequences,compared n 30 powerofthetwo-dimensionalvector. with the TrpRS tree. This is consistent with the consensus M a (Brown et al. 2001; O’Donoghue and Luthey-Schulten 2003; rch <MBP> Increases for Multiple Nodes of Ancestral AndamandGogarten2011;Fournieretal.2011)thatTrpRS 20 1 Sequence Reconstruction separatedfromTyrRS(usedastheoutgrouptorootthetree) 9 The behavior of <MBP> values in the time domain sub- more recently than HisRS separated from other ProRS and stantially strengthens our conclusion. Even in the contem- otherclassIIaaaRS. porary sequences, bacterial TrpRS and HisRS sequences Maximum-likelihood ancestral sequence reconstruction exhibit significantly higher <MBP> than archaeal se- using the Lazarus (Hanson-Smith et al. 2010) interface to quences, which in turn have higher <MBP> than eukary- PAML (Yang 2007a) for the larger database produced 210 otic sequences (fig. 5A and table 4). Student t test and 206 reconstructed TrpRS and HisRS sequences, respec- probabilities are <0.0001 for the contribution to <MBP> tively. More than 75% of the reconstructed sequences had ofeukarya((cid:4)0.044),andaresignificant(0.01–0.007)forthe posterior probabilities more than 0.98, all were more than contributionsoftheTrpRSbacterial(+0.017),andthejoint 0.89, and low values appeared randomly distributed. When presence of both TrpRS and HisRS bacterial sequences these internal nodes were aligned sense/antisense, the (+0.038), relative to the intercept (0.35). The regression <MBP>, 0.357±0.002, was higher than that from contem- shows that approximately 0.97 of the variation in poraryalignmentsby23timesthestandarderror. 1595 MBE Chandrasekaranetal. . doi:10.1093/molbev/mst070 Divergence times are difficult to assign consistently in randomization, and from actual protein coding sequences our case because the time domain is both more extensive forbothrandomandmultiplesequencealignmentsforpro- and hence somewhat more ill-defined than usual, and tein families that exhibit no evidence for coding sequence because we are comparing results derived for two different complementarity. Distant putative paralogs (Toprim do- proteinfamilies,whosetreesarenonisomorphous.Datingthe mains) show no evidence of sense/antisense ancestry with ancestral nodes with node-height=1 using the fraction class II aaRS, whereas analysis of closely paralogous class I (~45%) of species represented in the TimeTree database and II aaRS outgroups exhibit essentially the same, elevated (Hedges et al. 2006) gave divergence times between 2 and <MBP>. 3,200Ma(supplementarysectionC,supplementarytableS1, Differences between the class I/II sense antisense align- and fig. S3, Supplementary Material online). These recon- ments and all similar tests representing the null hypothesis structednodesareinferredfromthemostsimilarsequences, areallmorethan100timesthestandarderrorsofthemeans. and should therefore be most recent, and yet the species Thus, they imply with considerable certainty that sense/ from which they are derived range over the entire time antisense alignments of TrpRS andHisRS Urgene sequences D o periodof cellular biology. Horizontal gene transfer therefore aredrawnfromapopulationdistinctfromthoseformedfrom w n appearstobesignificantacrosstheentiredatasetinwayswe most pairs of naturally occurring proteins, and hence that loa d cannottrack. they are somehow linked. The positional sensitivity and e d Under these circumstances, and as approximately half of robustincreasein<MBP> withthenode-heightforrecon- fro m thereconstructednodescannotbeassigneddivergencetimes structed ancestral nodes (fig. 5B) strongly reinforce this h withthepresentTimeTree database,itseemsreasonable to conclusion in qualitatively different ways. The simplest ttp s usecladogramnode-heights(i.e.,thenumberof nodescon- explanation for the unique coding patterns illustrated ://a nectinganodetoacontemporarysequence)asasurrogate in figure 3B is that the two classes of aaRS retain high ca d for divergence times, which is most nearly valid under the middle-base complementarity because they actually did em assumptionofaconstantmolecularclock.Usingnodeheights arise on opposite strands of the same gene, as proposed ic.o to form 10 bins related to the divergence times (fig. 4), we by Rodin and Ohno (1995). Our results therefore provide up found that up to the most remote and least well-defined compelling bioinformatic evidence that the null statement .co m sequences, reconstructions more distant from the contem- corresponding to the Rodin–Ohno hypothesis should be /m porarysequencesexhibitedhigher<MBP>(fig.5B).Further, rejected. be /a <MBP> values for internal nodes of the bacterial trees in- rtic creaseto0.42±0.002.Thus,bothbydivergencetimeandby le Bacteria Appear Uniformly Closer than Eukarya or -a node-height,theindependentancestralsequencereconstruc- b Archaea to the Origin of Translation s tions of TrpRS and HisRS internal nodes suggest that both tra treesmovetowardprogressivelyhigher<MBP>inancestral Carefulre-examinationofthedivergenceofTrpRSfromTyrRS ct/3 (Brownetal.1997)revealedthatreciprocallyrootedphylog- 0 nodes,andoverevolutionarytimehavedivergedfromsense/ /7 enies of TrpRS and TyrRS sequences suggested a closer /1 antisensecomplementarity. 5 8 evolutionary relationship of Archaea to eukaryotes, placing 8 /9 Discussion the root of the universal tree in the Bacteria. Curiously, the 7 3 highest <MBP> values between TrpRS and HisRS occur 41 Questions posed here involve subtle distinctions between 5 in bacterial sense/antisense alignments, leading to the b the Urzymes themselves and the coding blocks aligned in approximate (diagonal) symmetry of figure 5A and the y g u the putative Urgene. It should be noted that in this work, e significantly higher <MBP> for reconstructed ancestral s we have not aligned the entire coding sequences of the sequences (fig. 5B). Were the structural data for archaeal t on twoactiveUrzymes,onlythreeconsecutiveblocksspanning 3 TrpRSs and HisRSs insufficient to correctly condition our 0 theiractivesites.Theseblocksencodeallessentialactive-site M choice of active-site fragments, so that unidentified errors a definingresiduesandsecondarystructures,buttheyeliminate in the resulting 94-residue Urgenes are responsible for rch some portions of each Urzyme (fig. 1B and C). Conversely, 2 their significantly lower <MBP> values? If not, the higher 0 neitherhaveweshownthatproductsoftheputativeUrgene <MBP> of bacterial sequences affords novel evidence 19 arecatalyticallyactive. that bacteria are closer than other kingdoms to the origin of enzymatic aminoacylation, and hence to the origin of Status of the Rodin–Ohno Sense/Antisense Coding translationitself. Hypothesis for Class I and II aaRS Ourresultsalsoseemstronglytocontradicttheproposal Commonancestrysubstantiallyreducesthenumberofinde- raised by a recent detailed structural comparison of TrpRSs pendent observations (Felsenstein 1984). For this reason, andTyrRSsfrombacteria,archaea,andeukarya(Dongetal. Student t testing overestimates statistical significance. 2010)thatTrpRSdivergedfromTyrRSafterarchaeadiverged Moreover, the detailed procedures suggested by Felsenstein frombacteriaandthatallbacterialTrpRSsarosebyhorizontal to estimate the number of independent observations are genetransfer.Itseems unlikelythat genestransmittedfrom unwieldy.Toestablishappropriate estimatesof middle-base archaeatobacteriawouldhaveexperiencedselectivepressure pair frequency expected under the null hypothesis, we to increase their coding complementarity with class II aaRS. compiled such distributions after frame-shifting and Curiously, the TrpRS tree (fig. 2A) places eukaryotes within 1596 MBE StatisticalTestsofSense/AntisenseAARSCoding . doi:10.1093/molbev/mst070 D o w n lo a d e d fro m h ttp s ://a c a d e m ic .o u p .c o m /m b e /a rtic le -a b s tra c t/3 0 /7 /1 5 8 8 /9 7 3 4 1 5 b y g u e s t o n 3 0 M a rc h 2 0 1 9 FIG.4. BinassignmentforTrpRS(A)andHisRS(B)ancestralnodes.Thesamephylogeneticinformationrepresentedinfigure2isreproducedherein horizontalcladograms,toindicateassignmentsofnodesgroupedintoeachbinforpurposesofplottinginfigure5.Nodeheightsassociatedwitheach binareincludedatthebottomofeachcoloredrectangle.Boldlinesdenoteoutgroups. 1597
Description: