CG-content log-ratio distributions of Caenorhabditis elegans and Drosophila melanogaster mirtrons Denise Fagundes-Lima Departamento de Ciˆencias Biolo´gicas, Universidade Federal de Ouro Preto, 35400-000 Ouro Preto, MG, Brazil and Departamento de F´ısica, Universidade Federal de Minas Gerais,31270-901 Belo Horizonte, MG, Brazil Gerald Weber Departamento de F´ısica, Universidade Federal de Minas Gerais,31270-901 Belo Horizonte, MG, Brazil Mirtrons are a special type of pre-miRNA which originate from intronic regions and are spliced directlyfromthetranscriptinsteadofbeingprocessedbyDrosha. Thesplicingmechanismisbetter 3 understoodfortheprocessing ofmRNAforwhich wasestablished thatthereisacharacteristic CG 1 contentaroundsplicesites. HereweanalysetheCG-contentratioofpre-miRNAsandmirtronsand 0 compare them with their genomic neighbourhood in an attempt to establish key properties which 2 are easy to evaluate and to understand their biogenesis. We propose a simple log-ratio of the CG- n contentcomparingtheprecursorsequenceandisflankingregion. WediscoveredthatCaenorhabditis a elegans and Drosophila melanogaster mirtrons, so far without exception, have smaller CG-content J thantheirgenomicneighbourhood. Thisismarkedlydifferentfromusualpre-miRNAswhichmostly 5 have larger CG-content when compared to their genomic neighbourhood. We also analysed some 2 mammalian and primate mirtrons which, in contrast the invertebrate mirtrons, have higher CG- content ratio. ] N G INTRODUCTION called ‘mirtrons’ and the main difference between them and canonical miRNAs is that intronic sequences form . o lariatsandthemirtronsareoriginatedbysplicing[8–10]. i During the last decade, a wealth of small RNAs were Flynt et al. [11] reclassified a subset of mirtrons in D. b discoveredand with them new classes of biological regu- - melanogaster as “tailed mirtrons”, which have substan- q lators emerged. Among those, microRNAs (or miRNAs) ′ tial 3 overhangs and are targets of exosome-mediated [ due to their crucial role in genomic regulation are per- 3′−5′ trimming, which allows functional pre-miRNA to haps the most intensively studied. miRNAs are involved 1 be generated. The existence of mirtrons in mammalians intheregulationofnumerouscellularprocessesincluding v (human, macaque, chimpanzee, rat and/or mouse) was 9 differentiation,development,apoptosis,proliferation,the reported by Berezikov et al. [10] where they identified, 9 stress response and they change the expression of genes using computational and experimental strategies, 3 well 0 in several human diseases such as diabetes, cancer and conservedmirtronsexpressedindiversemammals,16pri- 6 neuromuscular dystrophy [1–3]. matespecificmirtrons,and46candidatemirtronsinpri- . 1 miRNAs are non-coding RNAs first identified in 1993 mates. 0 in the nematode Caenorhabditis elegans [4]. Canon- For mRNA, which is processed by splicing, Zhang et 3 1 ical miRNAs are derived from primary miRNA tran- al. [12]determinedthatthereis acharacteristicCGcon- : scripts (pri-miRNA), usually long nucleotide sequences tent around splice sites. Also, it was shown that alter- v that form specific hairpin-shaped stem–loop secondary nativesplicingispromotedbythesecondaryRNAstruc- i X structures. Pri-miRNA may originate one or more ture[13]whichisstronglydeterminedbyCGcontent[14]. r hairpins typically with 55–70 nucleotide (nt) in length. MicroRNAs are co-expressed with mRNAs [15, 16] and, a In animals, pri-miRNAs are cleaved by the nuclear inparticular,mirtronsareseeminglynotprocessedbythe Drosha RNase III enzyme to release precursor miRNA Drosha microprocessor but by splicing only. With splic- (pre-miRNA) hairpins. These are then transported to ing being dependent on thermodynamic stability could the cytoplasm by Exportin-5 (Exp5) and cleaved by therebesomecharacteristicCGcontentwhichwouldset the Dicer RNase III enzyme to generate a very short aside mirtrons from ordinary pre-miRNAs? Of special miRNA/miRNA* duplex [5]. One of the strands, called interest would be properties which would help to under- mature miRNA (22–25nt), is incorporated into a RISC stand the splicing mechanism proposed for mirtrons [9]. complex(RNAinducedsilencingcomplex)andguidesthe Here we set out to characterise precursor sequences complextothetargetmRNAtoregulategeneexpression of miRNAs and mirtrons in terms of CG-content and while the other strand seems to take on other biological also Gibbs free energies for D. melanogaster and C. ele- functions[5–7]. Inanimals,mostofthemiRNAfunctions gans. We found that the CG-content shows marked dif- are related to down-regulationof genes. ferencesforbothtypesofsmallRNA.Also,weperformed Ruby et al. [8] showed the existence of intronic pre- the same analysis for mammalian mirtrons reported by miRNAsinDrosophila melanogaster andC. elegans that Berezikov et al. [10], again our results show important bypass Drosha processing providing an alternative path- differences albeit opposite of those for the two inverte- way for miRNA biogenesis [9]. These pre-miRNAs were brates. 2 METHODS genomefileofD.melanogaster versionr5.34[20]andver- sionWS223forC. elegans [21]. We retrievedthe precur- To characterise the small RNAs we compare the CG- sor miRNA/mirtron sequences by searching for an exact content of the precursor sequences which originate the match within the complete genome files. For each se- pre-miRNAs and mirtrons to the CG-content of their quence four types of matches were performed: the origi- neighbouring regions. The rationale for this approach is nal sequence, the reversed sequence, the complementary thatiftheneighbouringDNAsequencehasanimportant sequence and the reversed-complementarysequence. differenceinthermodynamicstability,whencomparedto The mirtrons reported by Berezikov et al. [10] were theprecursorsequencethereshouldbetelltalesignsofit collected from the supplemental data, the neighbouring in the CG-content fractions. We define the CG-content sequencesforeachthesemirtronswereobtainedfromEn- fraction as semble API and databases [22]. Tocompleteouranalysiswealsocalculatedtheaverage number of C and G nucleotides Gibbsfreeenergiesofmirtronsandordinarypre-miRNA. f = (1) total number of nucleotides In this work we use the RNAfold program from the Vi- enna package [23] with default parameters to obtain the Two types of CG-content fractions are used, one fP is Gibbs free energies, ∆G. related to either the precursor miRNA or the precur- sor mirtron. The other fN accounts for the total CG- contentsofthe150basepairsdownstreamandupstream RESULTS AND DISCUSSION oftheprecursorsequencewhichformstheneighbourhood of the precursor. The flanking sequence length was cho- sentobe ofthesameorderofmagnitudeofthelengthof D. melanogaster C. elegans canonical pre-miRNAs. We perfomed the same analysis (a) (c) with longer flanking sequences (up to 250 nt, not pre- C an sented),butfoundnodifferencefromtheresultsreported y20 on ilnogt-hraistiwoobrekt.wBeeonththCeGp-rceocnutresnotrsaanrdeictosmnbeiinghedbotuorhfooromda Frequenc ical pre-miR N R=log fP (2) As 2(cid:18)fN(cid:19) 0 6 (b) (d) A positive ratio means that the CG-content fP of the precursor sequence is larger than that of its neighbours. y4 SwiniengcmDeaNCyAGin-rfceeogrnitothennattisRisle>srsel0asttgaeebdnleetortahtllahytermtmheeoadnpysrnetachumartiscothrsetraeflbgaiilnoitkny-. Frequenc2 mirtrons To ease the notation we use 0 R+ →R>0 precursor region has larger CG-content,fP >fN -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 R− →R<0 flanking region has larger CG-content,fN >fP R R To evaluate the statistical significance of our findings we use the Kolmogorov-Smirnov test [17]. Even though FIG. 1. CG-content ratio R distribution for a) 151 canonical this statisticaltestis wellestablished,giventhe question miRNAs and b) 18 mirtrons of D. melanogaster and c) 170 canonical miRNAsand d) 5 mirtrons of C. elegans. which is posed in this work it is perhaps more intuitive and simpler to quantify the significance by using simple combinatorial probabilities. Therefore, we also calculate In Fig. 1 we show the distribution of CG-content log- the probability p− of drawing k pre-miRNAs, all with ratio R for canonical pre-miRNAs and mirtrons, defined R−, purely by chance in Methods, of D. melanogaster and C. elegans. The content log-ratio R for canonical pre-miRNAs, Fig. 1a, − n− k is roughly gaussian with a peak around R = 0. This p = (3) (cid:18)n−+n+(cid:19) means that for this type pre-miRNA there appears to be nostrongpreferentialratioforCG-contentwithinthe wheren−andn+arenumberofknownpre-miRNAswith precursor sequence and its neighbours, although a bias R− and R+, respectively. towards R+ is clearly noticeable. In stark contrast, all The database used to obtain the precursor miRNA 18 mirtrons of D. melanogaster have R < 0 (R−) as and mirtrons of D. melanogaster and C. elegans was shown in Fig. 1b. Even though the number of reported from mirBASE version 16 [18, 19], which is one of the mirtronsisstillsmall,the probabilityofpicking18small main on-line repositories for microRNA sequences. For RNAs with R− by chance alone, considering the distri- extracting the flanking sequences we used the complete bution for canonicalpre-miRNA, is p− =9.1×10−9, see 3 TABLE I.CG-content ration R and free energy ∆G characteristics and of canonical pre-miRNAs and mirtrons of invertebrates. Also shown are number of sequences n± with R±, average CG-content ratio hRi, average free energy h∆Gi and average precursor length hNi. Organism RNAtype total n+ n− n+:n− hRi h∆Gi(kcal/mol) hNi(nt) C. elegans mirtrons 5 0 5 0.81±0.08 −20.28±4.21 62.22±6.72 D. melanogaster mirtrons 18 0 18 0.76±0.10 −22.03±6.55 69.05±16.68 D. melanogaster tailedmirtrons 7 2 5 1:1.8 0.84±0.24 −20.36±10.19 93.00±42.24 C. elegans canonical miRNAsa 170 131 39 3.3:1 1.18±0.23 −35.10±9.09 91.78±14.26 D. melanogaster canonical miRNAsa 151 97 54 1.8:1 1.08±0.21 −33.89±8.25 94.00±18.50 amirtronsexcluded. TABLE II. CG-content ratio R and free energy ∆G characteristics and of canonical miRNAs considering only those with R−. Organism RNAtype n− hR−i h∆Gi(kcal/mol) hNi(nt) C. elegans canonical miRNAs 39 0.90±0.08 −30.17±8.72 90.95±17.50 D. melanogaster canonical miRNAs 54 0.87±0.10 −30.0±5.40 90.40±11.78 Eq.(3). TheKolmogorov-Smirnovdistibutiontestyields energies. Clearly,giventheR−natureofthemirtons,one p− = 7.8×10−9 which essentially confirms the simple wouldexpectthesetobegenerallylessstablethantheav- combinatorial probability. Therefore, the occurrence of eragepre-miRNAs. InFig.2 we showthe distributionof 18 R− pre-miRNA entirely by chance is very unlikely. freeenergy∆Gforbothinvertebrates,anddetailedquan- Some authors describe mirtrons as tightly packed be- tities are also givenin Tab.I. Except for one notable ex- tween exons [3], but in our analysis we have found that ception, all mitrons show ∆G larger than −30 kcal/mol, thisisnotthecase. Mostmirtronsaresurroundedbyin- confirming their instability. In contrast, canonical pre- tronic sequences not exons. This seems consistent if one miRNAs are distributed over a much larger range of en- considers that intronic regions of D. melanogaster are ergies. Certainly,thefactthatmirtronsaremuchshorter about 750 to 1000 nt in length on average [24] and that than canonicalpre-miRNAs, 60 nt comparedto 90 nton mirtrons are typically 60 nt in length. Therefore, R− average, largely accounts for this. But is a free energy means that the immediate flanking region which is also larger than −30 kcal/mol sufficient to result in R−? To intronic is more stable than the precursor region. One answer this, we isolated all canonical pre-miRNAs with possible explanation for the predominance of R− would R−andrecalculatedthetheir∆Gdistribution,whichare be if intronic regions were of highest CG-content. How- shown as red bars in Figs. 2a and 2c and summarised in ever,the intronic regionsofD. melanogaster haveone of Tab.II.Essentially,wefindaconsiderablenumberofR+ thesmallestCG-contentinthisgenome: 0.4ascompared pre-miRNAs with ∆G > −30 kcal/mol. In other words, to 0.52 for coding regions. The fact that the surround- a pre-miRNAs with ∆G>−30 kcal/moldoes not imply ing region of mirtrons has a higher CG content, which inR−. Therefore,thefreeenergydistributionalonedoes is unusual for intronic regions, suggest a role in the pro- not explain why all mirtrons of D. melanogaster and C. cessing of these special types of miRNAs. Therefore, we elegans are R−. may speculate that R− may play a role in the mirtron splicing mechanism in a similar fashionto whathappens for messenger RNA [12]. For canonical C. elegans pre-miRNAs we observe a similar gaussian shaped distribution of the ratio R The next question is whether other types of reported (Fig. 1c) but with strong bias towardsR+. Tab. I shows mirtrons, such as primate and mammalian mirtrons that the n+ : n− ratio is of three R+ pre-miRNAs for show the same R distribution as D. melanogaster and eachR− pre-miRNA.Todatethereareonlyfivemirtrons C. elegans? As shown in Tab. III, in terms of CG- reported and they all show R− (Fig. 1d), similar to the content ration R and average free energy ∆G these mirtrons of D. melanogaster. Even though this number mirtronsappearnottobebiasedtoanyparticularvalue. is very small, it is still intriguing given the strong bias Berezikovet al. [10] found that the GC content of mam- toward R+ in canonical pre-miRNAs. Indeed, the prob- malian mirtrons was much higher than that of inver- ability of picking 5 pre-miRNAs all with R− is small, tebrate miRNAs but, in comparison with their neigh- the combinatorial probability being p− = 6.3 × 10−4. bours regions, we found that they tend generally to R+ Again,theKolmogorov-Smirnovtestprovides5.5×10−4 (precursor region has larger CG-content),seeFig.3. We in agreement with the combinatorial probability. havenotattemptedtogeneratethe distributionofmam- To complete our comparative analysisof mirtrons and malianpre-miRNAs due to the number oflargegenomes canonicalpre-miRNAs,wealsocalculatedtheGibbs free which would have to be processed. 4 TABLE III.CG-content ratio R and free energy ∆G characteristics of specific vertebrate mirtrons. Organism RNAtype total n+ n− n+ :n− hRi ∆G (kcal/mol) hNi (nt) mammalians putativemirtrons 13 8 5 1.5:1 0.98±0.16 −30.02±13.20 87.92±9.52 primates specific mirtrons 16 13 3 4.3:1 1.12±0.11 −30.32±11.69 83.62±24.36 primates candidate mirtrons 45b 40 5 8:1 1.11±0.09 −41.18±12.39 90.44±21.94 bRef. 10 reports 46 candidates mirtrons for primates, yet supplementary tables only show 45 sequences. D. melanogaster C. elegans 40 CONCLUSIONS (a) (c) C an o y n Frequenc20 ical pre-miR ratWioeofhpavreecuinrtsroordsuecqeudentcheesacnodncfleapntkoinfgCrGeg-icoonnsteanntdldoigs-- NA coveredthatallD.melanogaster andC.elegans mirtrons s are R−. This cannot be explained by the CG-content of 0 the intronic regionand neither by the fact that mirtrons (b) (d) 6 are generally shorter and less stable than pre-miRNAs. Usual pre-miRNAs of these organismsonly show a mod- y m quenc4 irtrons tehraetenobtiiaosntotwhaatrdmsiRrt+ro.nTshairsefisnpdliincegdapinpeaarssimtoilasurpfpaoshrt- e Fr2 ion to mRNA instead of being processedby Drosha. For mammalianmirtronswehavefoundnosuchbias,butwe 0 noticed that these also display several important differ- -80 -60 -40 -20 -80 -60 -40 -20 ences when compared to the vertebrate mirtrons which D G (kcal/mol) D G (kcal/mol) wereconsideredinthiswork,suchasdifferencesinlength and free energy. FIG. 2. Average free energy ∆G distribution for a) 151 canonicalpre-miRNAsandb)18mirtronsofD.melanogaster. andc)170canonicalmiRNAsandd)5mirtronsofC.elegans. Red bars are for R− miRNAs. (a) mammalians (b) primate specific (c) primate candidate 20 y nc ue q Fre 0 ACKNOWLEDGEMENTS -1 -0.5 0 0.5 1-1 -0.5 0 0.5 1-1 -0.5 0 0.5 1 R FIG. 3. CG-content ratio R distribution for a) 13 putative We are grateful to J. M. Ortega for helpful sugges- mammalian mirtrons, b) 16 specific primate mirtrons and c) tions. Funding: CNPq, Fapemig and National Institute 46 candidate primate mirtrons. of Science and Technology for Complex Systems. [1] ZhangB,PanX,CobbGP,AndersonTA:microRNAs Tissue Research 2012, :1–9. as oncogenes and tumor suppressors. Developmen- [4] LeeRC,FeinbaumRL,AmbrosV:TheC.elegans het- tal Biology 2007, 302:1–12. erochronic gene lin-4 encodes small RNAs with [2] IorioM, Croce C: MicroRNA dysregulation in can- antisense complementarity to lin-14. Cell 1993, cer: diagnostics, monitoring and therapeutics. 75(5):843–854. A comprehensivereview.EMBOMolecularMedicine [5] KimV,HanJ,SiomiM:BiogenesisofsmallRNAsin 2012. animals. Nature Reviews Molecular Cell Biology 2009, [3] Hussain M: Micro-RNAs (miRNAs): genomic or- 10(2):126–139. ganisation,biogenesisandmodeofaction.Celland 5 [6] Okamura K, Chung WJ, Lai E: The long and short Nature Physics 2006, 2:55–59. of inverted repeat genes in animals: microR- [15] MorlandoM,BallarinoM,GromakN,PaganoF,Bozzoni NAs, mirtrons and hairpin RNAs. Cell cycle 2008, I,ProudfootN:Primary microRNA transcripts are 7(18):2840. processed co-transcriptionally. Nature structural & [7] Bhayani M, Calin G, Lai S: Functional relevance of molecular biology 2008, 15(9):902–909. miRNA*sequencesinhumandisease.MutationRe- [16] ShomronN,LevyC:MicroRNA-biogenesisandpre- search/Fundamental and Molecular Mechanisms of Mu- mRNA splicing crosstalk. Journal of Biomedicine tagenesis 2011. and Biotechnology 2009, 2009. [8] RubyJ,JanC,BartelD:IntronicmicroRNAprecur- [17] Press W, Teukolsky S, Vetterling W, Flannery B: Nu- sors that bypass Drosha processing. Nature 2007, merical recipes in C. Cambridge Univ. Press Cambridge 448(7149):83–86. 1992. [9] Okamura K, Hagen JW, Duan H, Tyler DM, Lai EC: [18] Griffiths-Jones S, Grocock R, van Dongen S, Bateman The mirtron pathway generates microRNA-class A,EnrightA:miRBase: microRNAsequences,tar- regulatory RNAs in Drosophila. Cell 2007, 130:89– gets and gene nomenclature.NucleicAcids Research 100. 2006, 34:D140–D144. [10] Berezikov E, Chung W, Willis J, Cuppen E, Lai EC: [19] KozomaraA,Griffiths-JonesS:miRBase: integrating Mammalian mirtron genes. Molecular Cell 2007, microRNA annotationanddeep-sequencingdata. 28(2):328–336. Nucl. Acids. Res. 2011, 39(suppl 1):D152. [11] Flynt AS,Greimann JC, ChungWJ, Lima CD, Lai EC: [20] Drysdale RA, Crosby MA: FlyBase: genes and gene MicroRNA Biogenesisvia Splicingand Exosome- models.NucleicAcidsResearch 2005,33(suppl1):D390. Mediated Trimming in Drosophila. Molecular cell [21] Stein L, Sternberg P, Durbin R, Thierry-Mieg J, Spieth 2010, 38(6):900–907. J: WormBase: network access to the genome and [12] Zhang J, Kuo C, Chen L: GC content around splice biologyofCaenorhabditiselegans.NucleicAcidsRe- sites affects splicing through pre-mRNA sec- search 2001, 29:82. ondary structures. BMC genomics 2011, 12:90. [22] Hubbard T, Barker D, Birney E, Cameron G, Chen Y, [13] Shepard P, Hertel K: Conserved RNA secondary Clark L,Cox T,CuffJ,CurwenV,DownT,etal.: The structures promote alternative splicing.Rna 2008, Ensembl genome database project. Nucleic Acids 14(8):1463. Research 2002, 30:38. [14] WeberG,HaslamN,WhitefordN,Pru¨gel-BennettA,Es- [23] Hofacker IL: Vienna RNA secondary structure sexJW,NeylonC:ThermalequivalenceofDNAdu- server. Nucl. Acids. Res. 2003, 31:3429–3431. plexes without melting temperature calculation. [24] Presgraves D: Intron length evolution in Drosophila. Mol. Biol. Evol. 2006, 23(11):2203–2213.