GC-Biased Gene Conversion Impacts Ribosomal DNA Evolution in Vertebrates, Angiosperms, and Other Eukaryotes Juan S. Escobar,*(cid:1),1 Sylvain Gle´min,1 and Nicolas Galtier1 1Institut des Sciences de l’Evolution, Unite´ Mixte de Recherche 5554 Centre National de la Recherche Scientifique, Universite´ Montpellier II, Montpellier, France (cid:1)Present address: Department of Ecology and Evolutionary Biology, University of Toronto, Toronto, Ontario, Canada. *Corresponding author: E-mail: [email protected]. Associate editor: Manolo Gouy Abstract Ribosomal DNA (rDNA) is one of the most conserved genes in eukaryotes. The multiples copies of rDNA in the genome D evolve in a concerted manner, through unequal crossing over and/or gene conversion, two mechanisms related to Row homologous recombination. Recombination increases local GC content in several organisms through a process known as enlo sa GC-biasedgeneconversion(gBGC).gBGChasbeenwellcharacterizedinmammals,birds,andgrasses,butitsphylogenetic ede distribution across the tree of life is poorly understood. Here, we test the hypothesis that recombination affects the ard fro evolutionofbasecompositionin18SrDNAandexaminethereliabilityofthisthoroughlystudiedmoleculeasamarkerof cm gBGCineukaryotes.Phylogeneticanalysesof18SrDNAinvertebratesandangiospermsrevealsignificantheterogeneityin h http tohfetehveol1u8tSionrDoNfAba,sceocnosmistpeonstitwiointhacprroesvsiobuosthggernooumpse.-Mwiadmemanaalsl,ybseirsd.sI,nanadddgirtaiossne,swexepeorbiesenrcveeinincrceraesaessedinGthCecGoCntceonnttsenint ars://ac ta Ostariophysiray-finnedfishesandcommelinidmonocots(i.e.,thecladeincludinggrasses),suggestingthatthegenomesof ide cm these two groups havebeen affected by gBGC. Polymorphism analyses inrDNA confirm that gBGC, not mutation bias,is lic e.o the most plausible explanation for these patterns. We also find that helix and loop sites of the secondary structure of u p ribosomalRNAdonotevolveatthesamepace:loopsevolvefasterthanhelices,whereashelicesareGCricherthanloops. .c o m We extend analyses to major lineages of eukaryotes and suggest that gBGC might have also affected base composition in /m Giardia (Diplomonadina), nudibranch gastropods (Mollusca), and Asterozoa (Echinodermata). b e /a Key words: base composition, concerted evolution, isochors, recombination, ribosome, secondary structure. rtic le -a b s Introduction evolutionofrDNAleadstohighredundancyofribosomes, tra Genes encoding ribosomal RNA (rRNA) are among the whichispresumablybeneficialtotheorganismasallribo- ct/2 most utilized and conserved genes in eukaryotes. Conser- somalsubunitsareequallycompatiblewithothercompo- 8/9 vation has been demonstrated in both linear sequences nentsofthecellulartranslationalmachinery(Averbeckand /256 Eickbush 2005). Concerted evolution of rDNA occurs in 1 andsecondarystructures,whichsuggeststhatrRNAisun- /1 derstrongpurifyingselectivepressure.Eukaryotegenomes many species, including angiosperms (Flavell and O’Dell 01 0 organize the single rRNA of the small ribosomal subunit 1976; Wendel et al. 1995; Franzke and Mummenhoff 15 3 (18S)andtwooftherRNAsofthelargeribosomalsubunit 1999; Fuertes Aguilar et al. 1999; Lim et al. 2000; Koch b y (5.8S and 28S RNA) into a shared transcription unit. et al. 2003; Kovarik et al. 2004, 2005; Rauscher et al. gu e Transcriptionunitsformmultigenefamiliesinlongtandem 2004) and vertebrates (Brown et al. 1972; Arnheim et al. s t o arrays, the ribosomal DNA (rDNA) loci, carried by one or 1980, 1982; Hillis et al. 1991). n 1 a small number of chromosomes (Dover 1994; Gonzalez Twoclassesofmechanismshavebeenputforwardto 5 A and Sylvester 2001; Eickbush TH and Eickbush DG 2007). explain the concerted evolution of rDNA (reviewed in p Variable numbers of rDNA loci occur in different species Kupriyanova 2000; Eickbush TH and Eickbush DG ril 2 0 (fromafewtenstomorethan50,000copies),sothat,de- 2007): unequal crossing over between homologous 19 pending on the considered species, rDNA can contribute rDNA units (Brown et al. 1972; Petes 1980; Szostak a substantial percentage of the nuclear genome (Long and Wu 1980; Endow and Komma 1986; Schlotterer and Dawid 1980; Rogers and Bendich 1987). and Tautz 1994) and gene conversion, that is, the copy OneofthemostremarkablefeaturesofrDNAlociisthe and paste of one genomic copy onto another, whether homogeneityinsequenceamongtranscriptionunitswithin they are orthologous or not (Hillis et al. 1991; Gangloff a genome. This ability to change sequences in a highly et al. 1996; Benevolenskaya et al. 1997; Fuertes Aguilar orchestrated manner, that is, to spread or eliminate new etal.1999;Liao2000;Limetal.2000).Today,itiswidely mutationsarrivinginoneunittoadjacentunits,isknown acceptedthatconcertedevolutioninrDNAresultsfrom asconcertedevolution(Dover1994;ElderandTurner1995; a combination of both these mechanisms (Eickbush TH Liao1999;EickbushTHandEickbushDG2007).Concerted and Eickbush DG 2007), the two being mechanistically ©TheAuthor2011.PublishedbyOxfordUniversityPressonbehalfoftheSocietyforMolecularBiologyandEvolution.Allrightsreserved.Forpermissions,please e-mail:[email protected] Mol. Biol. Evol. 28(9):2561–2575. 2011 doi:10.1093/molbev/msr079 Advance Access publication March 28, 2011 2561 MBE Escobar et al. · doi:10.1093/molbev/msr079 related to the molecular process of homologous in 18S rDNA to various groups of eukaryotes to pinpoint recombination. potential gBGC events across the eukaryotic tree. Interestingly, recombination is associated with GC bias inanumberoforganisms(reviewedinMarais2003;Duret Materials and Methods and Galtier 2009). Sucha biasmaybe producedby muta- Interspecific Data Sets in Vertebrates and tion, selection favoring G and C bases or GC-biased gene Angiosperms conversion (gBGC). gBGC is a bias in the cellular DNA re- Sequences of the 18S rDNA of vertebrates and angio- pairmachinerythatresultsinameioticsegregationdistor- spermswereobtainedfromtheSILVAdatabase(Pruesse tionfavoringGandCoverAandTalleles,henceincreasing et al. 2007). We downloaded 1,655 sequences (526 spe- GC content in the long term. gBGC affects single-copy cies) of vertebrates and 6,085 sequences (2,186 species) genes(Romiguieretal.2010)andmulticopy genesunder- ofangiosperms.Sequenceswereeditedandalignedwith goingconcertedevolution(Galtier2003;Kudlaetal.2004) ARB (Ludwig et al. 2004). Edition consisted in filtering in mammals (Galtier et al. 2001; Montoya-Burgos et al. D out short and low-quality sequences. In the vertebrate o 2003; Meunier and Duret 2004; Spencer 2006; Duret and w Arndt 2008), birds (Webster et al. 2006), yeasts (Birdsell data set, we first selected sequences(cid:1)1,200 nt (nuc_ge- nlo 2002; Mancera et al. 2008), and grasses (Gle´min et al. ne_slv parameter) and (cid:1)95% alignment quality (align_- ade e2x0t0e6n;tHDaurodsroypehtilaal.(2G0a0l8ti;eErsceotbaal.r2e0t0a6l.;2H0a1d0d)railnldettoal.a2lo0w07e)r. qseuqaulietny_cselsv bpyarpaemrceetenrt)a.geThoefn,quwaelityor(dseeqre_dquaselilteyc_tseldv d from and pintail_slv parameters), next by alignment quality, h gBGC is a recently discovered evolutionary force, which ttp andfinallybysequencelength.Inthisway,weassembled s not only impacts GC-content dynamics but also affects 287 high-quality sequences (287 species). Angiosperm ://a functional components of the genome by impeding c a sequences were selected in a similar way. However, be- d the action of natural selection (the Achilles’ heel e cause there were much more sequences of angiosperms m hypothesis; Galtier et al. 2009). Birdsell (2002) and Lynch ic than vertebrates, our criteria could be more restrictive: .o (2007) suggest that gBGC could be a ubiquitous u mechanism resulting from the generalized AT bias of (cid:1)1,700ntand(cid:1)98%alignmentqualityforthefirstfilter. p.c o Then, we ordered and selected sequences as above and m themutationprocess.Therecentdiscoveryofwidespread randomlychoseonesequencepergenus.Ourdatasetof /m AT-mutation bias and GC-fixation bias in bacteria b e (wHheartshcboenrfigramnsdtPheitsrovvie2w0.1I0n;Heuildkaerbyroatneds,egtBaGl.C20h1a0s)ssoomfaer- a(1n,g0i4o9spsepremcisesc)o.nsisted of 1,049 high-quality sequences /article been studied in a handful of taxa for which genome-wide Selected sequences were aligned using the ARB aligner, -ab which uses an 18S rRNA secondary structure backbone. s comparative data are available. Its phylogenetic tra Three alignments were obtained in each data set: 1) the c duinsdtreibrsuttoioodnadcreospssittehetthreeepooftliefen,ttiahleriemfoproe,rtisanocnelyvoafgutheliys complete alignment (i.e., all sites); 2) the alignment of t/28/9 process in molecular evolution. paired double-strand regions (i.e., helix sites); and 3) the /2 5 alignmentofunpairedsingle-strandregions(i.e.,loopsites). 6 rDNAisanidealtargettostudytheimpactofgBGCin 1 theevolutionofeukaryoticgenomesbecauseithasavery Ambiguouslyalignedsites,aswellassitesincludinggapsin /10 1 specificevolutionarydynamics(i.e.,long-termhighrecom- more than 50% of sequences, were excluded of the final 01 5 binationratethankstoconcertedevolution)andbecauseit alignment with Gblocks 0.91b (Castresana 2000) set to 3 b hasbeensequencedinaverylargenumberofspecies.We thefollowingparameters:minimumnumberofsequences y g reasoned that the analysis of GC-content variations in foraconservedposition5n/2þ1(wheren5numberof ue s rDNA could give insight into the prevalence of gBGC in taxa); minimum number of sequences for a flanking posi- t o n various lineages of the eukaryotic tree. In this paper, we tion50.85(cid:2)n;maximumnumberofcontiguousnoncon- 1 5 analyzetheevolutionofbasecompositioninthe18SrDNA servedpositions58;minimumlengthofablock52;and A p andexaminethereliabilityofthismoleculeasamarkerof allowedgappositions5withhalf.Thefinalalignmentsof ril 2 gBGC. We first analyze vertebrates and angiosperms be- all,helix,andloopsitesinthevertebratedatasetcontained 01 9 cause these two groups have been thoroughly studied in 1,342,894,and480nt,respectively.Intheangiospermdata terms of base composition (e.g., Eyre-Walker and Hurst set,alignmentsconsistedof1,400,969,and454nt,respec- 2001;Wangetal.2004)andcontaincladesinwhichgBGC tively. Note that the sum of helix and loop sites does not has been documented (e.g., mammals, birds, and grasses). correspond to the number of aligned positions at all sites Thesegroupsserveaspositivecontrolsinourtests,thatis, because each of the three alignments was separately trea- if rDNA is an appropriate marker of gBGC, we expect to ted with Gblocks. find signatures of GC enrichment in, at least, these three TheCpGmethylation-deaminationprocessleadstohy- groups. Because the secondary structure of the 18S rRNA permutability in vertebrates and angiosperms (Bird 1980; iswellknown,wequantifytheeffectofgBGCinhelixand Kovarik et al. 1997; Arndt et al. 2003). To account for loop sites of the molecule. Single nucleotide polymor- the potential influence of CpG sites in our results, we phisms(SNPs)dataareusedtoconfirmthatgBGC(orse- cleaned up the raw alignments of all sites by eliminating lectionforGCcontent),notmutationbiases,isatworkin sites involved in CpG, TpG, or CpA pairs in at least 50% rDNA.Finally,weextendtheanalysisofbasecomposition of the sequences of the alignment. We only analyzed 2562 MBE GC-Biased Gene Conversion in Ribosomal DNA · doi:10.1093/molbev/msr079 alignmentsincludingallsitesbecausesitesinhelixorloop We also tested heterogeneity in the GC content using alignments were not always physically adjacent. CpG-fil- nonhomogeneous models of sequence evolution and the tered alignments were treated with Gblocks as described phylogeniesofvertebratesandangiosperms.Thesemodels above.ThealignmentwithoutCpGconsistedof1,102sites were fitted with BPPML (Dutheil and Boussau 2008) and in vertebrates and 1,096 sites in angiosperms. NHML (Galtier and Gouy 1998), which use a maximum All alignments are available for download at http:// likelihoodapproachandaccountfornonstationary(ances- mbb.univ-montp2.fr/MBB/subsection/data.php?section5 tral and current GC content can differ) and nonhomoge- 2&from_dts55&nb_dts55. neity(branchescanhavedistinctGC)inbasecomposition acrossthephylogeny.WeestimatedtheGCcontentatany node(orgroupsofnodes)andtheequilibriumGCcontent Vertebrate and Angiosperm Phylogenies (GC*) at any branch (or group of branches) of the phylo- We inferred phylogenetic trees of vertebrates and angio- genetic tree. GC* is defined as: spermswithPhyML3.0(GuindonandGascuel2003)using D the alignments including all sites. We used a general time AT/GC o GC(cid:3) 5 ; ð1Þ w reversible(GTR)þIþGmodelofsequenceevolution,four AT/GC þ GC/AT nlo categories for the gamma distribution, parsimony starting a whereAT/GCreferstothesubstitutionratefromAorTto de trees,andsubtreepruningandregraftingbranchswapping. d Generally, species belonging to most terminal clades G or C bases, and GC / AT holds for the inverse (Sueoka fro 1962).GC*isamoreappropriatemeasureoftheevolutionary m gclraoduepsedwetroegentohterw,ealllthreosuoglvhedr,elcaotniosnisstheinpts waimthonpgrevdieoeups rdeytnaanmdicAsrtnhdant 2th0e08c)u.rrentGC(MeunierandDuret2004;Du- https reports (Hasegawa and Hashimoto 1993; Soltis et al. We fitted hierarchical models of sequence evolution to ://a c 2000; Winchell et al. 2002; Mallatt and Winchell 2007; testwhetherbranchesunderwentsimilarevolutionofbase ad e SwallaandSmith2008;TheAngiospermPhylogenyGroup composition.Weusedlikelihood-ratiotests(LRT)toassess m ic 2009). For this reason, we adjusted the trees to be whethermorecomplexmodelsprovidedasignificantlyim- .o u congruentwiththemostrecentandacceptedphylogenies proved fit compared with simpler models. In vertebrates, p.c of vertebrates and angiosperms, whereas relationships GCandGC*wereestimatedusingeighthierarchicalnested om amongmorerecentlyderivedgroupswerekeptasinferred models(seetable1):1)oneestimateforallbranchesofthe /mb byPhyML.Invertebrates,weusedthebackbonephylogeny phylogeny(homogeneous);2)cyclostomesandtheremain- e/a proposed by Alfaro et al. (2009) and manually adjusted it ing clades (jawed vertebrates); 3) we distinguished sharks rtic following other publications (van Tuinen et al. 2000; andEuteleostomiamongthejawedvertebrates;4)wedis- le-a V20e0n3k;aMtesuhrateateatl.a2l.020010;3H; Burdienlkomt aentnael.t2a0l.0230;0M4;iyDaouetzearyl. tleinogstuoismhei;d5A)ctwineopdtiesrtyinggiiuaisnhdedSarCcooepltaecrayngtiihsa,mlounnggfiEsuhtees-, bstrac and Huchon 2004; Inoue et al. 2004; Townsend et al. t/2 and tetrapods among Sarcopterygii; 6) we distinguished 8 2004; Lavoue´ et al. 2005; Sullivan et al. 2006; Hugall amphibians and amniotes among tetrapods; 7) terminal /9/2 et al. 2007; Mallatt and Winchell 2007). In angiosperms, clades (as shown in fig. 2); and 8) each branch of the tree 561 we used the phylogeny suggested by The Angiosperm hasitsownGCandGC*.Notethatinmodels3–7,different /10 Phylogeny Group (2009) and adjusted it (Olmstead et al. GCandGC*wereestimatedininternalbranches.Inangio- 101 2000; Karehed 2001; Lundberg 2001; Tamura et al. 2004; 5 sperms,sixhierarchicalnestedmodelswerefitted(seetable 3 Nickrent et al. 2005; Gonza´lez et al. 2007; Wanga et al. b 2009; Worberg et al. 2009). The main analyses presented 1tr)o:b1a)ilheoyamleosg,emneaognuos;li2id)s–ACmhbloorraenllathleasl,esN,ymmopnhoaceoatless,,CAeuras-- y gue here were obtained with the modified trees. However, to tophyllales, and eudicots have different GC and GC*; 3) st o test the robustness of our results, we also performed n we distinguished commelinids and non-commelinids 1 analyses using unmodified trees. 5 among monocots; 4) we distinguished basal eudicots (Ra- A p Analyses of Heterogeneity in Base Composition nunculales, Sabiaceae, Proteales, Buxales, and Trochoden- ril 2 We estimated the GC content at all, helix, and loop sites. drales) and core eudicots among eudicots; 5) terminal 01 9 Wefirstperformedanalysisofvariance(ANOVA)usingthe clades (as shown in fig. 3); and 6) each branch of the tree proportionofGC(dataarcsinsquareroottransformed)in hasitsownGCandGC*.Asabove,inmodels2–5different each of these three alignments as variable and clades as GC and GC* were estimated in internal branches. factors with R 2.9.1 (R Development Core Team 2009). Analyses in vertebrates and angiosperms were per- Cladesweredelimitedinordertoworkattaxonomiclevels formedusingall,helix,andloopsitesseparately.Addition- similar to those in which gBGC has been documented ally, we performed analyses using the phylogenetic trees (mammals,birds,andgrasses).Invertebrates,cladeswithin inferred with the 18S rDNA and the alignments in which Actinopterygii (ray-finned fishes) generally corresponded CpG sites were removed. In all cases, we analyzed align- to taxonomic orders and in Sarcopterygii (coelacanths, ments in which invariable sites were removed. This is be- lungfishes, and tetrapods) to taxonomic classes (fig. 2). causeinvariablesitesarepresumablyunderstrongselective Inangiosperms,cladeswereconsistentwithtaxonomicor- constraintsandwereuselessforouranalysesoftheevolu- dersorsuperordersofAPGIII(TheAngiospermPhylogeny tionofbasecomposition.Asexpected,removinginvariable Group 2009) (fig. 3). sites affected GC but not GC* estimates. 2563 MBE Escobar et al. · doi:10.1093/molbev/msr079 Table 1.Hierarchical Models ofSequenceEvolution ofthe 18SRibosomal DNA. Model 2lnL Dev. df Pvalue Vertebrates 1. Homogeneous 13568.60 2. Cyclostoma1Jawedvertebrates 13567.59 2.01 1 0.1558 3. Cyclostoma1Chondrichthyes 1Euteleostomi 13554.27 26.64 2 1.6431026 4. Cyclostoma1Chondrichthyes 1Actinopterygii1Sarcopterygii 13535.99 36.56 1 1.48310-9 5. Cyclostoma1Chondrichthyes1Actinopterygii 1Coelacanth1Dipnoi1Tetrapoda 13532.88 6.21 1 0.0127 6. Cyclostoma1Chondrichthyes1Actinopterygii 1Coelacanth1Dipnoi1Amphibia1Amniota 13527.21 11.35 1 0.0008 7. Terminalclades(asshowninfig.2) 13481.32 91.77 31 6.3231028 8. OneGC*perbranch 13202.85 556.95 533 0.2287 D o Angiosperms w n 1. Homogeneous 11598.04 lo a 2. Amborellales1Nymphaeales1Austrobaileyales d e 11MCeargantooplihdys–llCalhelso1ranEtuhdailceost1s Monocots 11540.53 115.03 10 5.14310-20 d fro m 3. Amborellales1Nymphaeales1Austrobaileyales h 1Magnolids–Chloranthales1commelinids ttp s 4. Am1bonroenll-acleosm1meNliynmidpsh1aeaCleersa1topAhuysltlarloebsa1ileyEauldesicots 11537.41 6.24 2 0.0441 ://ac a 1Magnolids–Chloranthales1commelinids d e 1non-commelinids1Ceratophyllales m 1basaleudicots1coreeudicots 11537.00 0.82 1 0.3660 ic.o 5. Terminalclades(asshowninfig.3) 11472.39 129.22 64 2.6231026 u p 6. OneGC*perbranch 11335.82 273.13 162 1.0831027 .c o m NOTE.—lnL:loglikelihood;Dev.:residualdeviance;degreesoffreedom(df):residualdegreesoffreedom.Analysesperformedinallsitesofthemolecule. /m b e /a Becausethedatasetinangiospermswastoobigforphy- GC* per terminal clade (models 7 in vertebrates and 5 in rticle logenetic analyses of base composition (1,049 sequences), angiosperms)andonefreeGC*foreachinternalbranch.In -a b s we reduced the number of sequences to two per taxo- thisanalysis,branchlengthswerefixed(tovaluesestimated tra nomic order (according to The Angiosperm Phylogeny withPhyMLusingaGTRþIþGmodel)tobesurethatthe ct/2 Group2009)whenpossible.Forthis,weestimatedtheper- detected heterogeneity between loop and helix sites re- 8/9 centage of identity among sequences of each order and flects variations in GC-content dynamics, not in lineage- /2 5 6 randomly selected pairs of sequences among those with specific evolutionary rate. 1 /1 level of identity equal to the median value of the order. 0 1 Wepreferredtoselectpairsofsequencesshowingmedian Polymorphisms in rDNA 01 5 ratherthanmaximaldivergencetoavoidpickingsequences SNPsfromrDNAlociwereretrievedfromthePolymorphix2 3 b with peculiar evolutionary dynamics (e.g., pseudogenes) database(Bazinetal.2005).Thesedatacontainedpartialor y g u that would have remained undetected along the filtering complete sequences of any of the following: 5S, 5.8S, 18S, e s process described above. The data set of angiosperms 28S,internaltranscribedspacers(ITS-1andITS-2),andinter- t o n on which we fitted the nonhomogeneous models of se- genic spacers (IGS).Inmany cases,alignmentsconsistedof 1 5 quenceevolutionconsistedof121sequences(121species). contigs of these regions (e.g., partial 18S, completes ITS-1, A p Totestwhetherbasecompositionevolveddifferentlyin 5.8S and ITS-2, and partial 28S). In angiosperms, 311 align- ril 2 helixandloopsites,weperformedanLRTtocomparethe ments (267 species) containing five sequences ofthe same 01 9 loglikelihoodofamodelincludingallsites,andthesumof speciesor morewereretrieved.Invertebrates,asfewas18 loglikelihoodsofmodelsincludinghelixandloopsitesan- alignments (15 species) containing four sequences of the alyzedseparately.Thistestusesthetotalnumberofparam- same species or more were available in the Polymorphix etersestimatedineachmodelasthenumberofdegreesof 2 database. Because the last update of this database was freedom.IftheGCcontentevolveddifferentlybetweenhe- in November 2004, we complemented the SNP data set licesandloops,weexpectedthatthesumofloglikelihoods in vertebrates with data from GenBank (requested on 21 ofmodelsseparatingloopsandheliceswouldsignificantly May2010).WeperformedsearchesofnuclearrDNAsusing improve the fit relative to the model considering all sites. thefollowingcommandline:(vertebrate*NOTmitochond*) Analyses were performed in vertebrates and angiosperms AND(5SOR5.8SOR18SOR28SORspacer).Thefinaldata separately using variable and invariable positions of the set in vertebrates consisted of 51 alignments (42 species) alignments.Thealignmentofallsiteswasobtainedbycon- containing four sequences of the same species or more. catenatinghelixandloopalignmentstreatedwithGblocks. IntraspecificrDNAsequenceswerealignedwithPrankv. Weestimatedloglikelihoodsusingamodelassumingone 100311 (Loytynoja and Goldman 2005). The resulting 2564 MBE GC-Biased Gene Conversion in Ribosomal DNA · doi:10.1093/molbev/msr079 alignments were filtered with Gblocks using the same pa- Nonhomogeneous models of sequence evolution, ex- rametersdescribedabove.NotethatallourSNPalignments plicitlytakingintoaccountphylogeneticrelationships,con- consisted of sequences from just one species, hence SNPs firm heterogeneity in base composition among major were not oriented. We deliberately analyzed nonoriented clades in vertebrates and angiosperms (table 1). In verte- SNPs given the rarity of appropriate outgroups across brates,almostallcladeswithinActinopterygiiandSarcop- the various alignments. Polymorphism analyses were per- terygii have increased GC content relative to their formed using a homemade program that calculates the ancestors. Ostariophysi (i.e., Siluriformes, Clupeiformes, numberofpolymorphicAT4GCsitesand,amongthese and Cypriniformes), mammals, and birds are remarkable sites, the number of sites with GC frequency .0.5, ,0.5, inthisrespect.GCcontentintheancestorofOstariophysi and 50.5. SNP counts were pooled by clade (as shown in hasincreased;5%relativetotheancestorofActinopter- figs.2and3)andtwo-sidedbinomialtestswereperformed ygii.Consistently,GC*atthebaseofthiscladeishigh(0.79 totestformutationalAT4GCequilibrium.Atmutational overall sites), although GC* values are lower in more re- equilibrium, one expects AT/ GC 5 GC / AT, that is, centlyderivedbranches(fig.2).Interestingly,GC*isgreater D o equal amounts of SNPs in which GC is the most frequent in helix (0.81) than loop (0.73) sites in this branch. Mam- wn allele and SNPs in which AT is the most frequent allele. malsandbirdshaveexperiencedanincreaseof;4%inthe loa We analyzed a subset of sufficiently annotated align- GCcontentrelativetotheancestralSarcopterygii.Analysis ded ments of vertebrates and angiosperms to determine ofGC*suggeststhattheincreasetookplaceatthebaseof fro whetherforcesshapingbasecompositionactdifferently m amniotes(GC*overallsites50.89)andaffectedloop(1.00) h in the differentfragmentsof rDNAloci(asdocumented more than helix (0.79) sites (fig. 2). Unlike Clupeiformes, ttp s for noncoding polymorphisms in the X chromosome of Cypriniformes, and Siluriformes, GC* is high in extant ://a D. simulans; Haddrill and Charlesworth 2008). We esti- c mammals(0.69overallsites),birds(0.85),andLepidosauria a d mated the mean per site polymorphism rate (number e (0.86), although low in the turtle (0.51) and crocodiles m of SNPs/alignment length) of each alignment and com- ic (0.49)—but the numbers of sequences analyzed in these .o paredfragments(5S,5.8S,18S,28S,ITS-1,ITS-2,andIGS) u two clades were low (one and three, respectively). On p usingnonparametricKruskal–WallistestswithR2.9.1(R .c theotherhand,afewcladesofvertebratesshowadecrease om Development Core Team 2009). in GC content (low GC*), including Chondrostei, Coela- /m b Interspecific Data Set in Other Eukaryotes canth, and Dipnoi (lungfishes) (fig. 2). Importantly, the e/a Wjoreldinoewangelosaodfeeduskeaqruyeontecsesfroofmthteh1e8SSIrLDVNAAdoaftamboasset.mSea-- GinCvdeyrtneabmraitcesssi(gLnRiTfic:avn242tl5yv1a8r5y:8b2e,twPe5en5h.3e5lix(cid:2)an1d0l(cid:4)o2o0p).sites rticle-ab Amongangiosperms,monocots,especiallycommelinids s quenceswereeditedasdescribedaboveusingthelength (grasses, palm trees, gingers, bananas, arrowroots, and al- trac andqualitythresholdsgivenintable4.Thefinaldataset lied),showthemostimportantincreaseinGCcontentrel- t/2 oerfae,uakllarliynoetaegsecsocnosnisftoeudnodfed11),2an59dsienqculuednecdes2(,560,668V9igriedni-- ative to the ancestor of all angiosperms but Amborellales 8/9/2 5 (fig.3).Incommelinids,thisincreaseisofabout3%forhelix 6 plantae,1,993Fungi,and5,211Metazoa.Wedetermined 1 and 7% for loop sites. GC* are high in commelinids for all /1 thedistributionofGCcontentpergenus(acrossspecies) 0 1 sites,helix,andloopsitesandhigherinhelicesthanloops. 0 within each eukaryotic lineage. 1 Indeed,GC*inloopsitesinthiscladearethehighestofall 53 b angiosperms. GC* values in the ancestor of magnoliids, y Results g Chloranthales, monocots, and Eudicots (MCME) are also u e s Heterogeneity in Base Composition of the 18S high, although most lineages have apparently engaged in t o rDNA in Vertebrates and Angiosperms GC erosion (note the lower GC* values of most terminal n 1 5 Base composition of the 18S rDNA significantly varies clades relative to more internal branches; fig. 3). Indeed, A p amongtaxonomicgroupsinbothvertebratesandangio- mostEudicotshavelowGC*,especiallyGunnerales,Sabia- ril 2 sperms(P,2.20(cid:2)10(cid:4)16inallANOVAs).Thisistruefor caeae, Santalales, and Trochodendrales. This suggests that 01 alignments including all sites, helix, and loop sites. anevolutionaryforceincreasingtheGCcontentwasactive 9 Noteworthy, helices are GC richer than loops (fig. 1), at some point in the evolution of angiosperms, and is no whereas loops evolve faster (vertebrate tree lengths: longeractiveinmostgroups(especiallyEudicots),withthe helix 5 1.67; loop 5 2.92; angiosperm tree lengths: notable exception of commelinids. As in vertebrates, the helix 5 20.60; loop 5 31.02; differences in tree length GC dynamics significantly varies between helix and loop between vertebrates and angiosperms essentially reflect sitesinangiosperms(LRT:v2 5299:05,P52.08(cid:2)10(cid:4)36). 52 differences in the number of analyzed species). Among We were concerned about the effects of CpG hyper- vertebrates, four GC-rich clades stand out from the mutable sites and the phylogenetic trees on our results. remaining groups: birds, Cypriniformes, mammals, and Asexpected,thereisareductionintheGCcontentofse- Siluriformes (fig. 1A). In angiosperms, heterogeneity in quenceswithoutCpGsites.Nevertheless,thepatternsob- base composition islessremarkablethanin vertebrates. served in vertebrates and angiosperms are robust and do However, monocots, especially commelinids, are GC notdependonCpGsites(withtheexceptionofbirds;sup- richer than all other flowering plants (fig. 1B). plementaryfigs.S1andS2,SupplementaryMaterialonline) 2565 MBE Escobar et al. · doi:10.1093/molbev/msr079 D o w n lo a d e d fro m h ttp s ://a c a d e m ic .o u p .c o m /m b e /a rtic le -a b s tra c t/2 8 /9 /2 5 6 1 /1 0 1 0 1 5 3 b y g u e s t o n 1 5 A p ril 2 0 FIG.1.DistributionoftheGCcontentof18SribosomalDNAinvertebrates(A)andangiosperms(B).Numbersaboveboxplotsarethenumber 19 ofanalyzedgenera.Openboxes:helixsites;grayboxes:allsites;andfilledboxes:loopsites.Horizontaldottedlinesrepresentmeansacrossall species foreachtypeofsites(up: helix; middle: all;and low:loop). oronthephylogenetictrees(supplementaryfigs.S3andS4, at testing the null hypothesis of mutational equilibrium, Supplementary Material online). which predicts equal average population frequencies for ATandGCalleles,whereasdirectionalprocesses(gBGCor GCContent andPolymorphisms inVertebrates and selection)predicthigherfrequenciesforGCalleles(Duret Angiosperms etal.2002;Galtieretal.2006).Notethatnonequilibrium WeperformedSNPanalysestodiscriminatebetweenthe mutationdynamicswithanincreasingbiastowardGC/ potential evolutionary forces underlying rDNA GC- AT mutations also predicts higher GC allele frequencies. content evolution, namely gBGC (or selection for GC Because the multigenic nature of rDNA, we are not sure content) versus mutation biases. Specifically, we aimed that we analyzed true intralocus polymorphism or 2566 MBE GC-Biased Gene Conversion in Ribosomal DNA · doi:10.1093/molbev/msr079 D o w n lo a d e d fro m h ttp s ://a c a d e m ic .o u p .c o m /m b e /a rtic le -a b s tra c t/2 8 /9 /2 5 6 1 /1 0 FIG.2.EvolutionoftheGCcontentof18SribosomalDNAacrossthephylogenyofvertebrates.Onlyresultsofalignmentswithvariablesitesare 1 0 shown.ValuescorrespondtocurrentGC(inparenthesesinterminalbranches),ancestralGC(inparenthesesininternalnodes),orGC*(red 1 5 characters along branches) at all/helix/loop sites; GC* are shown only for the most representative internal branches. Colors in terminal 3 b branches represent average GC content (blue: lowest GC; red: highest GC). Note that the color scale is relative to the data set and is not y g directly comparablewithfigure 3. Branch lengths are givenin unitsofper site substitution rate. ue s t o n between-locidifferences.However,inamultigenicfamily, Chondrostei,andCypriniformes(table2).Theaveragefre- 1 5 both intralocus and interlocus biased-gene conversion quencyofGCallelesinthesecladesspans0.58–0.67.Inan- A p increases fixation rates as compared with the neutral giosperms, the data set of polymorphisms in rDNA loci is ril 2 unbiased case (Walsh 1985). Without gBGC, we also ex- much bigger than in vertebrates. It contained 3,425 poly- 01 9 pectthat,onaverage,thenumberofgenecopiesfixedfor morphic sites of a total of 164,127 aligned sites (2.09% of GC equals the number of gene copies fixed for AT. Pre- polymorphism).Analysisofthisdatasetrevealsthat,over- dictionsarethereforesimilarforintralocusandbetween- allcladesandspecies,GCallelesinrDNAsegregateathigh- locivariation,sothatourapproachappearsrobusttothe erfrequencythanATalleles(0.63and 0.37,respectively;P uncertainty about the nature of polymorphism data—in ,2.20(cid:2)10(cid:4)16).SignificantGC-biasedspectraarefoundin neithercasedoesoneexpectasymmetricalSNPfrequency asterids, Caryophyllales, monocots, Ranunculales, and ro- spectra in absence of gBGC/selection. sids (table 2). The frequency of GC SNPs in these clades Invertebrates,673polymorphicsitesinrDNAlociwere spans 0.62–0.71. foundoutofatotalof60,586alignedsites(1.11%ofpoly- Notsurprisingly, wefoundsignificantdifferences inthe morphism).Overallcladesandspecies,GCallelesinrDNA persitepolymorphismrateamongfragmentsoftherDNA segregateathigherfrequencythanATalleles(0.61and0.39, loci in vertebrates (Kruskal–Wallis test: v2527:51, P , 6 respectively;P51.02(cid:2)10(cid:4)8).GCallelessegregateatsig- 0.0001) and angiosperms (v2564:80, P , 0.0001) (table 6 nificantly higher frequencies than AT alleles in Amphibia, 3). As expected, regions that are not directly involved in 2567 MBE Escobar et al. · doi:10.1093/molbev/msr079 D o w n lo a d e d fro m h ttp s ://a c a d e m ic .o u p .c o m /m b e /a rtic le -a b s tra c t/2 8 /9 /2 5 6 1 /1 0 1 0 1 5 3 b y g u e s FIG.3.EvolutionoftheGCcontentof18SribosomalDNAacrossthephylogenyofangiosperms.Onlyresultsofalignmentswithvariablesites t on areshown.ValuescorrespondtocurrentGC(inparenthesesinterminalbranches),ancestralGC(inparenthesesininternalnodes),orGC*(red 15 characters along branches) at all/helix/loop sites; GC* are shown only for the most representative internal branches. Colors in terminal A p branches represent average GC content (blue: lowest GC; red: highest GC). Note that the color scale is relative to the data set and is not ril 2 directlycomparablewithfigure2.Branchlengthsaregiveninunitsofpersitesubstitutionrate.MCME:Magnoliids–Chloranthales–Monocots– 0 1 Eudicots. 9 theribosomestructure(IGS,ITS-1,andITS-2)areprobably Differencesarehighlysignificantevenafterexcludingapo- lessconstrainedbyselection,hencemorelikelytoaccumu- tential outlier (Diplomonadina; table 4, fig. 4A). We late polymorphisms. Polymorphism rate is high in 5S se- obtained the distribution of GC content in the 18S rDNA quences too, although this gene codes for a functional acrosseukaryoticgeneraandselectedthe5%exhibitingthe RNA. This is possibly because 5S sequences are highest values (336 genera). These extreme values usually poorly annotated and frequently correspond to the correspond to 1 Diplomonadina (Giardia), 330 Metazoa actual5Sfragmentaswellaspartoftheuntranscribedspacer. (2 annelids, 30 arthropods, 1 cephalochordate, 190 Heterogeneity in Base Composition of rDNA vertebrates,56echinoderms,3hemichordates,47mollusks, Among Eukaryotes and 1 nematode), and 5 Viridiplantae (2 Ulvophyceae, 1 TheGCcontentinrDNAvariessignificantlyamongmajor bryophyte, and 2 commelinid monocots) (supplementary lineagesofeukaryotes(F 5302.47,P,2.20(cid:2)10(cid:4)16). table S1, Supplementary Material online). 11,6677 2568 MBE GC-Biased Gene Conversion in Ribosomal DNA · doi:10.1093/molbev/msr079 Table 2.Polymorphism in Ribosomal DNAin Vertebrates andAngiosperms. Clade N N AT>0.5 GC>0.5 Pvalue GC/(GC1AT) a sp Vertebrates Chondrostei 14 12 75 154 1.9531027 0.67 Cypriniformes 16 11 76 103 0.05 0.58 Percomorpha 14 13 58 51 0.56 0.47 Amphibia 7 6 53 103 7.6731025 0.66 Angiosperms Monocots 62 52 344 640 2.20310216 0.65 Ranunculales 9 9 20 50 4.4031024 0.71 Trochodendrales 1 1 29 25 0.68 0.46 Asterids 96 80 230 454 2.62310216 0.66 Caryophyllales 10 9 32 69 2.9631024 0.68 Rosids 128 111 581 898 2.20310216 0.61 Saxifragales 5 5 29 24 0.58 0.45 D o w NOTE.—Na:numberofalignments;Nsp:numberofspecies;AT.0.5:SNPsinwhichAorTisthemajorityallele;GC.0.5:SNPsinwhichGorCisthemajorityallele.Clades nlo aresortedbyphylogeneticproximity. a d e d fro WealsoperformedseparateanalysesinMetazoa,Viridi- etal.2008).Inangiosperms,basecompositioninrDNAhas m h plantae, and Fungi. In Metazoa, genera showing the 5% been less considered, and there is no study to compare ttp highest GC content in 18S rDNA are represented by 1 with. s://a Polychaeta(Annelida),3insects,3Hemichordata,13Aster- Withinvertebrates,thehighestGCcontentinrDNAwas c a oideaechinoderms,35mollusks(1Aplacophora,2Bivalvia, found in mammals, birds, and Ostariophysi (ray-finned de m and 32 Doridina nudibranch), and 123 vertebrates (1 fishes). In the latter, Siluriformes, Cypriniformes, and Clu- ic Elopomorpha,7Percomorpha,18Ostariophysi,2Polypter- peiformes exhibit high current GC content,although they .ou p iformes, 2 Amphibia, 34 birds, 20 Lepidosauria, and 39 do not display high GC*. However, GC* is high in the an- .c o mammals) (fig. 4B). In Viridiplantae, the highest GC cestral branch leading to Ostariophysi, suggesting that an m /m content corresponds to 2 genera of Ulvophyceae, 1 of episodeincreasingtheGCcontenttookplaceintheances- b e Chlorophyceae, 6 of Trebouxiophyceae, 2 of Briophyta, tryofthisgroup.ThiswouldexplainthehighGCobserved /a and82ofTracheophyta,including1fern,1conifer,9rosids, incurrentsequences.Amongangiosperms,monocots,and rtic le 3magnolids,6non-commelinidmonocots,and62comme- especially commelinids, Dioscoreales, Liliales, and Panda- -a b linids (supplementary fig. S5, Supplementary Material on- nales, display the highest GC content (and high GC*) of stra line). In fungi, the highest GC content corresponds to 38 all flowering plants. Interestingly, high GC* was inferred ct/2 Saccharomyceta (2 Dothideomycetes, 20 Eurotiomycetes, in the branch grouping magnoliids–Chloranthales and 8 /9 1Lecanoromycetes,1Lichinomycetes,and14Sordariomy- monocots and in the branch at the base of Eudicots. /2 5 cetes)and4Agaromyceta(2Boletales,1Hymenochaetales, 61 and 1 Polyporales) (supplementary fig. S6, Supplementary Evolution of Base Composition in the rRNA /10 1 Material online). Secondary Structure 01 5 According to our analyses, rDNA loops evolve 1.75 times 3 b Discussion faster thanhelices invertebrates and 1.50inangiosperms. y g u Our estimates are quantitatively consistent with previous e Heterogeneity in Base Composition of the 18S s results in eukaryotes (e.g., 1.37-fold faster in Smit et al. t o rDNA n 2007) and qualitatively with past observations in angio- 1 Weshowsignificantheterogeneityinbasecompositionin 5 sperms (Soltis et al. 1997; Soltis PS and Soltis DE 1998). A rDNA across the vertebrate and angiosperm phylogenies. p The average GC level observed among vertebrates in this IthasbeensuggestedthatratesofevolutioninrDNAvary ril 2 withthedistancefromfunctionallyimportantpartsofthe 01 studyis54.0%,veryclosetopreviousreports(55.4%inXia 9 ribosome,suchasthetransferRNApathandthepeptidyl- etal.2003;54.3%inWangetal.2006;and54.8%inVarriale transferase center (Smit et al. 2007). Although loops may engageintertiaryinteractions(Dutheiletal.2010)orjunc- Table 3.Average Persite Polymorphism Rate (andthe tions that link several helices together (Smit et al. 2006), correspondingstandarderrormean)inDifferentFragmentsofthe they appear generally less constrained than helices. This Ribosomal DNALoci. is likely because loops are free to evolve by substitutions Fragment Vertebrates Angiosperms that do not change the secondary structure, hence may 5S 0.0601(0.0097) 0.0396(0.0028) beneutralandfixbydrift,whereashelicesneedmuchrarer IGS 0.0686(0.0496) 0.0247(0.0074) compensatorymutationstomaintainstabilityandhighfit- ITS-1 0.0362(0.0207) 0.0250(0.0026) ITS-2 0.0156(0.0081) 0.0227(0.0023) ness (Gavrilets 2004; Meer et al. 2010). 18S 0.0085(0.0069) 0.0183(0.0085) On the other hand, helices were GC richer than loops 5.8S 0.0071(0.0055) 0.0124(0.0023) (respectively 59.8% and 43.9% in vertebrates; 63.1% and 28S 0.0021(0.0015) 0.0075(0.0028) 46.1%inangiosperms),whereasloopsdisplayedaveryhigh 2569 MBE Escobar et al. · doi:10.1093/molbev/msr079 Table 4.GCContent in 18SRibosomal DNAin MajorLineages ofEukaryotes. Lineage N Length(nt) Quality(%) MeanGC6SD(range) Alveolata 371 1,700 98 0.4660.013(0.41–0.49) Amoebozoa 45 1,200 95 0.5260.004(0.50–0.52) Choanoflagellida 5 1,600 85 0.4660.004(0.45–0.47) Cryptophyta 75 1,200 95 0.4660.008(0.45–0.48) Diplomonadina 6 1,300 80 0.7160.023(0.68–0.75) Euglenozoa 46 1,700 98 0.5060.003(0.49–0.51) Fungi 1,993 1,700 98 0.4760.013(0.41–0.52) Haptophyceae 117 1,700 95 0.4960.005(0.48–0.50) Heterokont 463 1,700 98 0.4660.015(0.41–0.50) Metazoa 5,211 1,700 95 0.5060.024(0.40–0.60) Rhodophyta 421 1,200 98 0.5060.011(0.47–0.52) Viridiplantae 2,506 1,700 98 0.4960.013(0.44–0.54) Alleukaryotes 11,259 — — 0.4960.025(0.40–0.75) D o w NOTE.—N:numberofsequences;lengthandqualityrefertothethresholdsusedtofiltershortandlow-qualitysequences;nt:nucleotides;SD:standarddeviationcalculated nlo acrosssequences. a d e d fro content of adenine relative to uracil (respectively, 39.6% Concerted evolution should enhance gBGC too because m h and7.0%invertebrates;43.3%and10.7%inangiosperms). it is formally equivalent to selection (Nagylaki 1983). All ttp s Thispatternofbasecompositioninhelixandloopregions inall,theexistenceofmultiplerDNAlocievolvinginacon- ://a seemstobeuniversal:ithasbeenobservedinprokaryotes certedwaypresumablyexacerbatestheeffectsofrecombi- c a d (GaltierandLobry1997;WangandHickey2002;Wangetal. nation, including gBGC, as compared with a single-locus e m 2006) and eukaryotes (Wang et al. 2006; Smit et al. 2006, sequence. ic .o 2007). There is evidence that adenine contributes to the We hypothesized that variations in the GC content of u p stability of the single-stranded regions of the RNA (Gutell rDNAcouldserveasanindicatorofthegBGC(orselection) .c o m etal.2000).Also,becausetherearethreehydrogenbonds processbecausethismoleculeissubmittedtohighrecom- /m inGCpairsandtwoinATpairs,ithasbeensuggestedthat bination.Twoaspectsofourresultsareinagreementwith b e GC pairs stabilize double-helix structures (Marmur and thishypothesis.First,wefindthattheheterogeneityinbase /a Doty 1959; Wada and Suyama 1986; Galtier and Lobry composition of rDNA coincide with the documented oc- rticle 1997). currence of gBGC: mammals, birds, and grasses, in which -a b s Faster loop evolution seemed associated with enrich- significant gBGC has been detected from genome-wide tra mentinGandCbasesinseveralclades,especiallyPolypter- analyses (Gle´min et al. 2006; Webster et al. 2006; Haudry ct/2 iformes and tetrapods among vertebrates, and etal.2008;DuretandGaltier2009),showdistinctivelyhigh 8/9 Austrobaileyales among angiosperms (see the higher rDNA GC content and GC* (grasses make the majority of /2 5 6 GC* relative to current GC in loop sites in figs. 2 and 3). commelinidspeciesinthisdataset).However,comprehen- 1 /1 This difference in the GC enrichment of the secondary siveanalysesofbasecompositionhaveonlybeenmadein 0 1 structure, which was found highly significant by LRT, is a few species, and we lack information on, for example, 01 5 likely due to their baseline difference in GC content: be- sharks, ray-finnedfishes,non-grassmonocots,oreudicots. 3 b cause loops are AT richer than helices, AT / GC substi- For this reason, results in Ostariophysi and commelinids y g u tutions are more likely to occur in the former than the cannot be compared with previous reports because these e s latter. Nevertheless, GC enrichment in helix sites was ob- groups were overlooked in the past. t o n served in Polypteriformes, Holostei, Osteoglossiformes, Thesecondlineofevidencecomesfrompolymorphism 1 5 mammals, birds, and Lepidosauria among vertebrates data analyses. Allele frequency spectra were found signifi- A p andBerberidopsidalesamongangiosperms(seethehigher cantlyGCbiasedin8of11cladesofvertebratesorangio- ril 2 GC* relative to current GC in helix sites in figs. 2 and 3). sperms,andunbiasedinthree,forwhichlimitedamountof 0 1 9 datawasavailable.Hence,thesedatarevealastrongexcess, Mutation Bias or gBGC/Selection for GC? amongATversusGCpolymorphicsites,ofSNPsforwhich rDNAsequenceshaveonecharacteristicthatmakesthem GorCisthemajorityallele(table2). Remarkably,thisex- particular relative to most genes in the genome and that cess is detected in groups showing increasing (e.g., mono- makes them especially interesting for this broad phyloge- cots, Siluriformes), stable (e.g., asterids, rosids), or netic study: besides they are under strong purifying selec- decreasing(Chondrostei)rDNAGCcontent.Wearguethat tive pressure (which guarantee that sequences are not suchapattern isconsistentwiththehypothesisofgBGC/ saturated), the different paralogous copies of the rDNA selection-drivenevolutionofrDNAGCcontentandrejects undergo more frequent recombination events than most the hypothesis of a GC-biased mutation process. other genesinthegenomethanks toconcertedevolution Because we analyzed nonoriented polymorphisms, the (eitherthroughgeneconversion,unequalcrossingoveror SNP frequency spectrum has a U form: SNPs in which both).Inaddition,concertedevolutionhasbeenshownto AT is the majority allele are found at one extreme and enhance selection efficacy (Mano and Innan 2008). at the other extreme are GC alleles. In the absence of 2570
Description: