Published online 1 February 2013 Nucleic Acids Research, 2013, Vol. 41, No. 5 3201–3216 doi:10.1093/nar/gkt017 Hyper conserved elements in vertebrate mRNA 30-UTRs reveal a translational network of RNA-binding proteins controlled by HuR Erik Dassi1, Paola Zuccotti2, Sara Leo1, Alessandro Provenzani3, Michael Assfalg4, Mariapina D’Onofrio4, Paola Riva2 and Alessandro Quattrone1,* 1Laboratory of Translational Genomics, Centre for Integrative Biology, University of Trento, Trento, via delle Regole, 101 38123 Mattarello (TN) Italy, 2Department of Medical Biotechnology and Translational Medicine, University of Milan, Milan, via Viotti, 3/5 20133 Milano, Italy, 3Laboratory of Genomic Screening, Centre for Integrative Biology, University of Trento, Trento, via delle Regole, 101 38123 Mattarello (TN) Italy and 4Department of Biotechnology, University of Verona, Province of Verona, Ca’ Vignal 1, Strada Le Grazie 15 37134 Verona, Italy Received August 31, 2012; Revised December 20, 2012; Accepted December 26, 2012 ABSTRACT INTRODUCTION Little is known regarding the post-transcriptional The 30-untranslated region (30-UTR) of mRNAs is a networksthatcontrolgeneexpressionineukaryotes. fundamentalmediatoroftheprocessesaffectingpost-tran- scriptional regulation of gene expression (1,2). Typically, Additionally, we still need to understand how these the influences of the 30-UTR are mediated by interactions networksevolve,andtherelativeroleplayedinthem with RNA-binding proteins (RBPs) and non-coding by their sequence-dependent regulatory factors, RNAs (ncRNAs). Although a subclass of ncRNAs, the non-coding RNAs (ncRNAs) and RNA-binding microRNAs (miRNAs), bind the mRNA 30-UTR in a proteins (RBPs). Here, we used an approach that ribonucleoprotein complex with AGO proteins to mostly relied on both phylogenetic sequence sharing and negatively control target mRNAs (3–5), 30-UTR-interact- conservation in the whole mapped 30-untranslated ing RBPs can exert complex effects, influencing mRNA regions(30-UTRs)ofvertebratespeciestogainknow- transport, localization, polyadenylationstate, rate ofdeg- ledge on core post-transcriptional networks. The radation and finally rate of translation through regulated identified human hyper conserved elements (HCEs) assembly/disassembly of actively recycling polysomes (6). In this way, RBPs behave as topological controllers of were predicted to be preferred binding sites for gene expression and can influence expression both nega- RBPs and not for ncRNAs, namely microRNAs and tively and positively. long ncRNAs. We found that the HCE map identified Mechanistic studies have helped to identify dozens of a well-known network that post-transcriptionally single cis-elements in 30-UTRs bound by specific RBPs regulates histone mRNAs. We were then able to and miRNAs (7,8), sometimes with defined consequences discover and experimentally confirm a translational on gene expression and cell phenotypes. In vitro (9–11) or networkcomposedofRNARecognitionMotif(RRM)- in vivo (12–15) high-throughput approaches are also type RBP mRNAs that are positively controlled by starting to provide transcriptome-wide maps of RBP and HuR, another RRM-type RBP. HuR shows a prefer- miRNA regions of interaction with mRNAs, allowing us ence for these RBP mRNAs bound in stem–loop to trace the first mRNA–protein complex (mRNP) networks in yeast (16–18) and vertebrates (19–21). motifs, confirming its role as a ‘regulator of Trans-factorsbindtomRNAUTRsinshortcontinuous regulators’. Analysis of the transcriptome-wide HCE regions,oftencorrespondingtoadefinedsecondarystruc- distribution revealed a profile of prevalently small tureandarecurrentconsensussequence.Ifsharedamong clusters separated by unconserved intercluster species, these trans-factor footprints should determine RNA stretches, which predicts the formation of a local increase in sequence similarity. On the assump- discrete small ribonucleoprotein complexes in the tion that in a purifying (negative) selection context, 30-UTRs. highly conserved non-coding sequences in orthologous *To whom correspondence should be addressed. Tel:+39 046 128 3665; Fax:+39 0461 283937; Email: [email protected] (cid:2)TheAuthor(s)2013.PublishedbyOxfordUniversityPress. ThisisanOpenAccessarticledistributedunderthetermsoftheCreativeCommonsAttributionNon-CommercialLicense(http://creativecommons.org/licenses/ by-nc/3.0/),whichpermitsunrestrictednon-commercialuse,distribution,andreproductioninanymedium,providedtheoriginalworkisproperlycited. 3202 NucleicAcidsResearch,2013,Vol.41,No.5 protein-coding genes would point to elements potentially MATERIALS AND METHODS endowed with regulatory activity, it would be possible to HCE identification pipeline obtain information regarding the core networks involved in mRNA regulation by isolating the regions bearing an Human 30-UTR sequences were fetched from the hg18 high degree of sequence evolutionary conservation in assembly at the UCSC database (34), and all UTRs UTRs. This holds also because no selective pressure for shorter than five bases were filtered, as they are likely to proteinfunctionalityappliestoUTRs, whicharethusun- derive from annotation errors. The Sequence Conser- constrained to change sequence or structure in order to vation Score (SCS) for each base of the UTRs, as fulfill their regulatory purpose. pre-computed by means of phastCons [see (35) for On a genomic scale, the identification of putative func- details on how this value was calculated], was retrieved, tional elements on the basis of evolutionary conservation for the relevant regions of the genome, from the UCSC hasbeenmostlybasedonthecomparisonbetweenhuman, database (34) along with the 44-way alignment in MAF ratandmousegenomes,withthedefinitionoftheso-called format; the alignment is performed on 44 vertebrate UltraConservedRegions(UCRs)as200bpidenticalDNA species. We computed the Branch Length Score (BLS) stretches. This procedure selects for mostly non-exonic (36) as the fraction of the length of the total phylogenetic portions of the genome (22–25), now collected in a tree branches covered by the alignment of each exon database (26). Only a limited number of these UCRs lie composing an UTR, taking the lowest BLS of all exons in mRNA UTRs. The same approach has been recently as the BLS for the whole UTR. The final conservation applied to the transcriptome (27), as defined by a library score, which we term hyper conservation score (HCS), of expressed sequence tags. The identified 3096 sequences was computed for each base of the UTRs as the clustered in 96 segments, of which 23 were fully contained weighted average of SCS and BLS. Weight for both com- in the coding sequence (CDS) and 80 overlapped or were ponents was set at 0.5 (see Supplementary Methods): this entirely in UTRs. Out of UCRs, specific mining of UTRs correspondstocomputingtheHCSasthearithmeticmean forregionsofhighconservationhasbeenpioneeredalmost of SCS and BLS. Nevertheless, our pipeline allows 10years ago (28) by identifying conserved motif cores and changing these weights to obtain a different combination extendingthemuptoadefinedthreshold,orbycomputing of the two features. A schematic view of the pipeline can a motif conservation degree based on pairwise alignment be found in Figure 1A. homologyfrequency(29).Ineachofthesetwostudies,four A threshold was then set on HCS under which se- mammalianspecieswerecomparedforthesmallnumberof quences should not be considered as hyper conserved. UTRs known at that time. Genome-wide multiple align- The threshold was chosen to be 0.85, as, by weighting ments of several species has been rendered possible in SCS and BLS equally, that would require one part of recent years by the increased sequencing capabilities the score to be at least 0.7 when the other part is 1.0 (30,31), but they have seldom been applied to specifically and vice versa. This stringent constraint guarantees that address the identification of potentially functional sites in onlythemostconservedUTRregionsareactuallyselected UTRs,andalwaysbyanalysingalimitednumberofmam- as HCEs. By its composition, the HCS does not impose a malianspecies(32).Invertebrates,30-UTRsarelongerand threshold on the number of species that must be included less conserved than 50-UTRs, and surprisingly they are for a region to be considered an HCE; nevertheless, a modestly variable in length between species with respect region aligning on a low number of species will be to the observed intraspecies length distribution (33). This assignedalowHCSandthusnotconsideredasconserved. could suggest the existence of unknown phylogenetic con- HCEs were then identified in 30-UTRs by means of a straints acting on their length, like long-range interactions two-step algorithm, implemented in Python: among functional elements. We introduce here an approach for identifying hyper (i) First, a search was run in every UTR for five-base conserved elements (HCEs) in 30-UTRs of mRNAs, seeds that had an almost complete conservation weighting sequence conservation information and phylo- sequence-wise (SCS(cid:2)0.95) and in which average genetic distance on 44 vertebrate species, from human to HCS was not <0.85. lamprey. This approach does not require the assumption (ii) Next, these seeds were extended upstream and of an a priori sequence length, takes limited computation downstream into the UTR, one base at a time, for timeandcanbeusedforanydesiredreferencespeciesand as long as the average HCS of the HCE did not fall speciessubgroup.Itsapplicationtohuman30-UTRsledus below the preset threshold or both ends of the UTR to the mapping of >3000 HCEs, which occupy <0.5% of were reached. the total 30-UTR sequence space. These regions have Resulting HCEs were eventually merged to remove peculiarproperties, including aclustered patternofrecur- overlaps and duplicates, which could occur in the case rence and show a potential to localize functional cis of high conservation spanning a substantial part, if elements belonging to highly conserved mRNA control not the whole, UTR. A view of the algorithm is in networks. To demonstrate the usefulness of HCEs in Figure 1A. prioritizing sequences for further analysis, we used the HCEs to identify a network of mRNAs coding for RBPs Construction of the non-HCEs data sets that possess 30-UTRs bound by the HuR RBP, and we showed this network to be functional in translational TocompareHCEspropertiestonon-HCEUTRportions, regulation of gene expression. we generated 1000 data sets composed by an equal NucleicAcidsResearch,2013,Vol.41,No.5 3203 Figure 1. HCEsareshort,scatteredandhighlystructured.TheoverallHCEidentificationpipelineisshownin(A),withthelowerpartdetailingthe algorithmusedsearchingforseedsandextendingthemtoleadtothefinalHCEs.(B–G)highlightsthemostrelevantfeaturesoftheHCEs.(B)shows thelengthdistributionofHCEsand(C)theirpercentcoverageof30-UTRs;(D)displaysthepredominanceofAUbasepairscontentoverCGbase pairs in HCE bases composition and (E) the prevalence of highly structured HCEs, as indicated by the shown distribution of secondary structure density in HCEs. (F) displays the distribution of distances between HCEs on the same 30-UTRs and (G) the HCE distance distribution from the 30-UTR start, indicated in percent over the 30-UTR length. number of non-HCE sequence elements. Via a Python also intersected with miRNA-binding sites predicted by script, we randomly chose UTR and start position; the three popular tools, miRanda (37), PicTar (38) and region length was drawn from the HCE length distribu- PITA (39). The content of lncRNAdb (40) was down- tion, in order to mimic the HCEs size ranges. loaded from the website and filtered to keep only human lncRNAs. A BLAST (41) database was built with these HCE intersection with ncRNAs-binding sites sequences and a search was performed with HCEs as query, with the BLAST ‘task’ parameter set as Experimentally validated miRNAs-binding sites were ex- ‘blastn-short’; only matches with a maximum e-value of tractedfromtheSQLversionofAURA(19),availableon 0.05 were considered as true positives. the download page of the website. The data set contained 15560 binding sites regarding a total of 88 distinct HCE intersection with RBP position–frequency matrices miRNAs. Coordinates of these sites were intersected with HCEs, and only sites falling completely inside an Position–frequency matrices (PFMs) for 69 RBPs were HCE were considered. HCEs and non-HCEs sites were extracted from the RBPDB database (42). HCE and 3204 NucleicAcidsResearch,2013,Vol.41,No.5 non-HCEsequenceswerematchedagainstthesePFMsvia HuR overexpression and silencing the BioPython functions dedicated to this task. Only Human breast cancer MCF-7 cells were transiently trans- matrices longer than four bases (for a total of 29 fected using Lipofectamine2000 (Invitrogen, Carlsbad, matrices) were retained, and all matches with score CA, USA) with a pT-REX mammalian expression <80% were filtered. vector coding for human HuR (48) and with the mock empty vector as control. The same cells were infected HCE intersection with the mRNA–protein occupancy with lentiviral transduction particles bearing shRNAs profiles (Sigma Aldrich, Mission shRNA) against the HuR T>CconversionprofilesweredownloadedfromtheGEO sequence, following the manufacturer protocol and database (series GSE38355) and filtered to include only testingfourdifferentshRNAsequences.Non-targettrans- bases falling into 30-UTRs. HCE and non-HCE bases duction particles were used to infect MCF-7 cells as were intersected with the conversion profiles, quantiles negative controls. Stably silenced clones were selected were computed and distributions of scores were tested with puromycin. The most effective pool, KD1, was for significant differences by means of a t-test. For the derived from the TRCN0000017273 shRNA. Sequences non-HCE case, the iteration giving the best results was are: used to compare with the distribution of HCE scores. TRCN0000017274 [CCGGGAGAACGAATTTGATC GTCAACTCGAGTTGACGATCAAATTCGTTCTCT TTTT], TRCN0000017273 Over-representation analysis [CCGGCGTGGATCAGACTACAGGTTTCTCGAG All genes which UTRs contained at least one HCEs were AAACCTGTAGTCTGATCCACGTTTTT], extracted and input to the DAVID Functional Anno- TRCN0000017277 tation tool (43) to identify by a modified Fisher test the [CCGGGCAGCATTGGTGAAGTTGAATCTCGAG overrepresentation of functional terms contained in ATTCAACTTCACCAATGCTGCTTTTT] and TRCN0 various ontologies [selected resources were Gene 000017275 Ontology (GO) Molecular Function, Biological Process [CCGGACCATGACAAACTATGAAGAACTCGAG and Cellular Component; IPR; SMART; PFAM, TTCTTCATAGTTTGTCATGGTTTTTT]. SP_PIR_keywords, Biocarta, KEGG and OMIM disease].Estimation fortheterms P-value wasBonferroni Cell culture and treatments corrected,andonlytermsforwhichtheP-valuewas<0.05 MCF-7andMCF-7shHuRcellswereculturedinDMEM were included in the final results; terms were grouped ac- with 10% FBS, 100 U/ml penicillin–streptomycin and cording to their similarity via the DAVID Functional Clustering tool, using high-stringency clustering criteria. 0.01mM L-glutamine (all media ingredients were obtained from Sigma-Aldrich, St. Louis, MI, USA). Culturesweremaintainedat37(cid:3)Cina5%CO2incubator. Identification of the SLBP-binding sites Puromycin(final concentration2.5mg/ml)was usedforse- HCEsequencesbelongingtogenesannotatedtobepartof lection and maintenance of stable short hairpin RNA the‘chromosomeassembly’functionalgroupwerealigned (shRNA) transfectants. All reagents were purchased by means of ClustalW2 (44), along with the canonical from Sigma. 1.5(cid:4)106 MCF-7 and MCF-7 shHuR cells Stem–loop-binding protein (SLBP) binding motif, to were seeded into two 10-cm Petri dishes for polysomal detect whether these HCEs actually contained the latter. RNA extractions and into one 10-cm Petri dish for total Themultiplealignmentalgorithmwasrunwithitsdefault RNA extractions. Total RNA and polysomal RNA ex- parameters. tractionswereperformed72hafterseeding;alltheexperi- ments were run in biological and technical triplicate. Sequence motif search RNA–Protein pull-down assay Sequence motif search inside HCEs was performed by RNA probes for HuR (AUGUAUUGUUUAUACAU) means of the Weeder algorithm (45). Motif length was and Degenerated (AUGUAUNNNNNAUACAU), set to be 6, 8, 10 or 12 nucleotides, and the minimum Dblmut1 (AUGUAUGGUUGAUACAU), Dblmut2 (A occurrence frequency of the motif was set to 25% of the UGUAUUCUUAAUACAU), YB1 (AUGUAUGGUC sequences composing the data set. We considered to be UGCAUACAU) and PTB (AUGUAUCUUUCUUAU relevant all the motifs reported by Weeder as the highest ACAU) probes have been synthesized by Sigma using ranking. 0.05mmol Synthesis Scale and HPLC purification with a 50biotinylatedDNApolyClinker.Theirpredictedsecond- Secondary structure motif search ary structure folding is shown in Figure 3. Biotin ThesecondarystructurefoldingoftheHCEscontainedin pull-down assays were performed by incubating 40mg the RNA Recognition Motif (RRM)-type RBP mRNA of MCF-7 cell lysates with 1mg of biotinylated probes group were predicted via the RNAfold program of the for 1h at room temperature. The complexes were Vienna RNA package (46). Motifs were searched over isolated using 100ml of paramagnetic streptavidin- these structures by means of the RNAforester tool (47), conjugated Dynabeads (Dynal(cid:3), Invitrogen, Carlsbad, run in the local multiple alignment mode. CA, USA), and bound proteins in the pull-down NucleicAcidsResearch,2013,Vol.41,No.5 3205 material were analysed by Western blotting, using performed in triplicate using an iQ5 RealTime PCR de- antibodies recognizing HuR (Santa Cruz, CA, USA), tection system (Bio-Rad, Hercules, CA, USA). YB1 (Abcam, Cambridge, UK) and PTB (Santa Cruz, CA, USA). After secondary-antibody incubations, the Ribonucleoprotein immunoprecipitation signals were visualized by chemiluminescence Ribonucleoprotein immunoprecipitation (RIP) was per- (Amersham Biosciences, Little Chalfont, UK). formed using human HuR overexpressing MCF-7 cell line lysates. Cell extracts were resuspended in NT2 Total RNA extraction buffer [50mM Tris HCl pH=7.5, 150mM NaCl, 1mM Total RNAs from treated and non-treated cells were MgCl , 0.05% NP40, 1U/ml Ribolock (Fermentas, Glen 2 isolated using the TRIzol reagent (Invitrogen, Carlsbad, Burnie, MD, USA), 2mM DTT, 30mM EDTA supple- CA, USA), according to the manufacturer’s instructions. mented with a protease inhibitor cocktail (P8340, Sigma, Purity of RNAs (A260/A280 value of 1.8–2.1) and con- St. Louis, MI, USA], chilled at 4(cid:3)C. Cell lysates were centration were measured using the Nanodrop spectro- added to the Protein G Dynabeads (Dynal(cid:3), Invitrogen, photometer. To eliminate DNA contamination, total Carlsbad, CA, USA) at 50ml beads/250ml lysate. Beads RNA was treated with DNase I (RNase-Free DNase were previously incubated with cell extracts and then Set, Qiagen, Hilden, Germany) and then purified with bound with 5mg of mouse monoclonal anti-HuR the RNeasy kit (Qiagen, Hilden, Germany). antibody (Santa Cruz, sc-71290, CA, USA) or mouse IgG (Millipore, NI03-100UG, Billerica, MA, USA). The Polysomal RNA extraction bound RNA was extracted using phenol:chloro- form:isoamyl alcohol (25:24:1) and precipitated with MCF-7cellstreatedasdescribedabovewereincubatedfor 3 min with 0.01mg/ml cycloheximide at 37(cid:3)C, then the ethanol. RNA pellets were resuspended in 10ml RNA- grade water and, after DNAse treatment (Fermentas, plates were put on ice. The media were removed, and Glen Burnie, MD, USA), cDNA was obtained from the cells were washed twice with cold phosphate buffer each samples as previously detailed. Real Time quantita- saline (PBS)+cycloheximide 0.01mg/ml. Cells were tive PCR was performed in duplicate using the C1000 directly lysed on the plate with 300ml cold lysis buffer (Bio-Rad, Hercules, CA, USA) thermal cycler for 40 [10mM NaCl, 10mM MgCl , 10mM Tris–HCl pH 7.5, 2 cycles, and results were evaluated by cycle threshold (Ct) 1% Triton X-100, 1% sodium deoxycholate, 0.2U/ml values. Cyclin A mRNA was quantified as positive RNase inhibitor (Fermentas, Burlington, CA), 1mM control, being a known HuR target, while RPL0 was dithiothreitol and 0.01mg/ml cycloheximide], scraped quantified as negative control, as it was not identified as and transferred to an Eppendorf tube. The extracts were centrifuged for 5min at 12000g at 4(cid:3)C. The supernatant a HuR target in either of two recent studies (49,50). was frozen in liquid nitrogen and stored at (cid:5)80(cid:3)C or Obtained data represent the average of at least three in- dependent experiments. loaded directly onto a 15–50% linear sucrose gradient containing 30mM Tris–HCl, pH 7.5, 100mM NaCl, NMR RNA analysis 10mM MgCl , and centrifuged in an SW41 rotor for 2 100min at 180000g. Fractions (polysomal and The SL and NF RNA oligonucleotides probes were sub-polysomal) were collected monitoring the absorbance purchased as purified powders (Integrated DNA Techno- at 254nm and treated directly with 0.1mg/ml proteinase logies, Coralville, IA, USA). Samples were dissolved to a K for 2h at 37(cid:3)C. After phenol–chloroform extraction finalconcentrationof0.1mMin10mMphosphatebuffer, andisopropanolprecipitation,polysomalRNAwasresus- pH 7.0, 10mM NaCl, in 93% 1H O/7% 2H O. NMR ex- 2 2 pended in 30ml of RNAse-free water and then repurified perimentswereperformedwithaBrukerAvanceIIIspec- with the RNeasy kit (Qiagen, Hilden, Germany). trometer operating at 600.13MHz proton Larmor frequency,equippedwithacryogenicprobeincorporating Quantitative RT-PCRs a z-axis gradient. 1H-1D experiments were run acquiring 32768complexdatapointsonaspectralwindowof12019 FormRNAquantification,atwo-stepTaq-Manreal-time or 13227Hz, using a relaxation delay of 3s, and 64–1024 PCRanalysiswasperformed,usingprobesobtainedfrom scans. Water signal suppression was carried out by exci- Applied Biosystems (Foster, CA, USA). cDNA was tation sculpting. Temperature calibration was performed synthesized from total and polysomal RNA (1mg) in usingstandardsamples.SpectrawereFouriertransformed 20ml reactions, using the iScript cDNA Synthesis Kit after application of a 1Hz exponential window multipli- from BioRad (Hercules, CA, USA). The reverse tran- cation and baseline corrected using Topspin 2.1 (Bruker, scriptase reaction was performed by incubating the samples at 25(cid:3)C for 5min, 42(cid:3)C for 30 min and 85(cid:3)C Karlsruhe, Germany). for 5min. The PCR reactions (10ml) were performed on Construction of the HuR / RRM-type RBP mRNA 20ng of cDNA, the mix were prepared with 5(cid:4) KAPA network FAST probe (Kapa Biosystems, Boston, MA, USA) and the 20(cid:4) appropriate Taq-Man probe. The PCR mixtures HuR-binding sites, as identified in HEK293 cells by two wereincubatedat95(cid:3)Cfor3min,followedby39cyclesof recent photoactivatable-ribonucleoside-enhanced cross- 95(cid:3)Cfor30s and 60(cid:3)Cfor20s and 72(cid:3)Cfor60s.mRNA linking and immunoprecipitation (PAR-CLIP) studies levels were calculated based on the (cid:2)CT method, using (49,50), were downloaded from GEO, accession number RPL0 and HPRT1 as reference genes. All PCRs were GSE29943 and GSE29780, respectively. Sites were 3206 NucleicAcidsResearch,2013,Vol.41,No.5 intersected with UCSC 30-UTR coordinates (hg18 the 30-UTRs has interesting properties: when multiple assembly) and extracted along with the genes mapping HCEs are present on an UTR, these have a clear to these 30-UTRs. Enrichment was computed by tendency to localize in clusters, as indicated by the small counting the number of genes for each domain found in inter-HCEdistance,25basesorless(Figure1F),andtobe the resulting gene list and by performing a Fisher exact distributed along the 30-UTR with a preference for its be- test by means of the R statistical environment. The HuR ginning, with 25% of the HCEs starting on the 30-UTR RRM-typeRBP/mRNAnetworkwasbuiltbyaddingall 10% initial bases (Figure 1G). To provide a snapshot on RRM-type RBP mRNAs found to be bound by HuR in HCE architecture diversity, we distributed HCE-bearing the PAR-CLIP study to the HCE-containing 23 30-UTRsintofourclasses,dependingontheirnumberand RRM-type RBP mRNAs. An edge was reported coverage.TheclassesreportedinFigure2Aefficientlyrep- between HuR and its target mRNAs to indicate the regu- resent this diversity. We then focused on the HCE clus- latory relationship. Intersections between the 89 tered pattern because it could be an effect of a higher PAR-CLIP-derived, our 23 and the 6 RRM-type RBP order structure of trans-factors. We thus computed the mRNAs validated by us were computed and highlighted amount of HCEs lying in clusters with intracluster in different colours and node shapes, as shown in Figure distances (maximum distance between two contiguous 5B and C. HCEs in a cluster) ranging from 5–40 bases. As shown inFigure2B,aplateaustartsat20bases,settingtherefore a threshold. At this distance, 81% of the HCEs belong to RESULTS clusters of two or more elements (the figure already HCEs in the mRNA 30-UTRs are rare, short, highly excluded the 577 HCEs that are unique on their 30- structured and organized in clusters UTR). We thus propose a model, reported in Figure 2C, for which 30-UTRs contain clusters of binding sites We identified regions of exceptional evolutionary conser- separated by each other, possibly delineating a scenario vation in the 30-UTRs of the human exome by a seed ex- in which groups of trans-factors interact with each other tension strategy (Figure 1A). We took advantage of the in complexes spaced by unconserved regions of unbound 44-way vertebrate UCSC alignment (34) from which we 30-UTR. derivedthephastConsSCS(35)andtheBLS(36).Wefirst extracted fully conserved five-basesseeds (SCS(cid:2)0.95 and 30-UTR HCEs contain putative binding sites for RBPs BLS(cid:2)0.85), which were then extended until they reached and not for ncRNAs apresetthreshold(0.85)ontheconservationmeasure,the HCS (computed as the equally weighted average of SCS The main question now was what types of potentially and BLS; changing these weights would change the functional cis-acting elements are found in 30-UTR relative importance of one of the two features, see HCEs. To test for miRNAs, we compared the 30-UTR Supplementary Methods). After preliminary filtering, the HCEs with a set of 15560 experimentally determined 30- data set obtained from the UCSC database contained UTRmiRNA-bindingsites(producedby88miRNAsand 55444 30-UTRs, each one corresponding to a different involving 2232 30-UTRs) extracted from the AURA transcript (including all annotated mRNA splicing database (19). Out of the total 3149 HCEs, only 51 variants). The 30-UTR HCE identification algorithm (1.6%) of them was found to contain miRNA-binding gave 3149 HCEs (listed in Supplementary Dataset 1), be- sites, which were 60 and were bound by 33 different longing to 1010 30-UTRs, which correspond to 877 genes. miRNAs. These data resulted in whole 30-UTRs being At least one 30-UTR HCE is present in only 1.8% of the more enriched in miRNA-binding sites than HCEs total human 30-UTRs, and collectively HCEs cover only (Fisher’s exact test P-value=2.37E-10). To verify 0.47% of the 30-UTR space, making them extremely rare. whether this small number was close to random occur- 30-UTR HCEs have an average length of 100 bases, but rence, we performed the same procedure on 1000 sets of theirlengthdistribution(Figure1B)issuchthat>77%of randomly derived 30-UTR segments, which we call their total number is shorter, being only 4.5% of them non-HCEs, with the same length distribution and of the >500 bases. The subset of HCEs shorter than 100 bases same size as the HCEs. The maximum of the distribution have an average length of 23 bases, with 25% of them at of these iterations gave 40 unique miRNA-binding sites most 8 bases long. Their UTR coverage (Figure 1C) is involving 47 different miRNAs, which confirms our hy- instead prevalently low ((cid:6)25% of each 30-UTR) or high pothesis. We eventually proceeded to predict miRNA- ((cid:2)75% of the 30-UTR). No significant correlation was binding sites in HCEs and non-HCEs by means of three found between 30-UTR length and HCE coverage popular prediction tools [miRanda (37), PicTar (38), (Pearson coefficient (cid:5)0.12). Together, these distributions PITA (39)]. Compared with the best non-HCE iteration, showthat30-UTRHCEsarerelativelyshortandthatthey the number of miRNA-binding sites in HCEs is always either occupy a small portion of a 30-UTR or the most of heavily depleted (a Fisher’s exact test reports enrichment it.TheseelementsaremuchricherinAUthaninGCbases of non-HCEs sites with P-value lower than 2.2E-16 in all (Figure 1D, P-value 2.2E-16), and are by far more highly three cases). To check also for other ncRNA types, we structured than random 30-UTR sequences of the same intersected the 30-UTR HCEs with the sequences found length, structural density being defined by the fraction in lncRNAdb (40), a catalogue of eukaryotic long of unpaired bases in the HCE secondary structure non-coding RNAs (lncRNAs). A BLAST search yielded (Figure 1E, P-value 1.2E-13). Also their localization in 151 statistically significant putative binding sites at least NucleicAcidsResearch,2013,Vol.41,No.5 3207 Figure 2. HCEs can be classified according to their pattern of occurrence in 30-UTRs and are organized in clusters. (A) shows the classification of 30-UTRsin fourclasses,according totheir HCEcontent(on theleft). Numbers below eachclass boxare the number ofHCE-containing30-UTRs belongingtotheclass.Ontheright,asampleofsixHCE-containing30-UTRs:HCEsaremappedontotheir30-UTRandrepresentedasyellowareas, a grey rectangle being the full-length 30-UTR. Arrows from class boxes to UTRs indicate which UTR belongs to which class. (B) displays the increasingpercentageofclusteredHCEswhenincreasingthemaximumintraclusterdistanceallowedforanHCEtobeconsideredpartofacluster. Wespanfrom5to40bases,andat20baseswecanobservethebeginningofaplateau.Wethereforechose20basesasthemaximumintracluster distance to consider. The graph is drawn excluding the 577 HCEs that are unique on their respective 30-UTR. (C) Graphical representation of the proposedmodeloftrans-factorbindingto30-UTRs,assumingthatHCEsarebindingsitesofoneormoretrans-factors.Clustersofcloselyoccurring HCEs (composed by 3–4 HCEs on average) are separated by intercluster RNA stretches of variable length (from 20 to 1419 bases), suggesting a coordinated action on the 30-UTR. 3208 NucleicAcidsResearch,2013,Vol.41,No.5 12 bases long, involving 132 unique HCEs (4.2%) and 32 translation (52). Alternative to polyadenylation, this different lncRNAs. Again among the 1000 non-HCEs it- mechanism is conserved over a wide evolutionary erations, the BLAST search yielded, for the iteration distance (53). We therefore hypothesized that the HCEs giving the best results, 209 statistically significant inthehistone30-UTRsharbouredSLBP-bindingsites.To putative lncRNA-binding sites at least 12 bases long, verifythisconjecture,wealignedtheknownSLBPbinding involving 167 unique non-HCEs (5.30%) and 39 different motif (54) to these HCEs and found that a considerable lncRNAs. Therefore, HCEs are unlikely to be preferred fraction of the HCEs (75 out of 127) contain a close, if sites for miRNAs and lncRNAs. not perfect, match to the SLBP motif (Supplementary We then scanned the HCE and the non-HCE lists for Figure S2). Therefore, the metrics we devised to select matches with the PFMs extracted from the RBPDB for HCEs precisely identifies cis-elements involved in a resource (42), which collects the known experimental conserved and well demonstrated post-transcriptional consensi for RBP binding to mRNAs. Considering only regulatoryprocess.Weassumedthisfindingasaneffective matches with a minimum score of 80% and a matrix benchmark for the ability of 30-UTR HCEs to point to length greater than four (leaving us with 29 matrices), circuitries of phylogenetically old post-transcriptional we consistently obtained at least 1.8 times more matches control. intheHCEthaninthenon-HCEsets(17173matchesfor Thesecondhighlysignificantgenesetisaboutthebroad HCEs versus 9443 matches for the best iteration of activity of transcription and mainly composed by genes non-HCE sequences). HCE matches are listed in Supple- involved in its repression. The 137 identified genes mentaryDataset2.EnrichmentofRBPsitesinHCEswith suggest that transcription factors as EPC1, TFAP2D respect to non-HCEs is also suggested by the Fisher’s and YY1 and co-transcriptional repressors as FOXP2, exact test (P-value=5.85E-11). If really 30-UTR HCEs MEIS2 and EZH2 can be heavily controlled post-tran- identify mainly RBP-binding sites, they should at least scriptionally, their 30-UTRs being almost entirely highly partiallyspanexperimentallydeterminedRBPmRNAoc- conserved.Finally,thethirdemerginggenesetcamefrom cupancyprofiles. A recent study defines, as T>C conver- the protein domain annotation, giving the RRM. sion scores (15), contact sites for RBPs in the mRNA We also divided the HCEs on the basis of the four transcriptome of proliferating HEK293 cells (51). This classes identified, to see again whether they had a prefer- catalogue was obtained by crosslinking 4SU-labelled ential representation of themes. We found that the ‘chro- cells, selecting mRNA–protein complexes by oligo-dT matinstructure’themeisenrichedonlyinthe‘loneisland’ beads precipitation and protein extraction, and category (Figure 1H), further confirming that it emerges determining contact sites by RNA sequencing. The distri- from the histone mRNA SLBP-binding site (53). butionsofT>Cconversionscoresforeachbasefallingin Transcriptional regulation terms appear instead enriched 30-UTR HCEs and non-HCEs were tested against each in the ‘sparse frequent’ and ‘fully covered’ groups, while other for statistically significant differences. Indeed, both the ‘dense frequent’ and ‘fully covered’ 30-UTR HCEs were found to have a significantly higher level of groups, i.e. those mostly HCE-rich, point to a significant T>C scores than non-HCEs, with the performed t-test over-representation (P-value=1.09E-05) for mRNA- producing a P-value lower than 2.2E-16, and with a relatedactivities(GOterms:‘RNAbinding’,‘mRNApro- median T>C score of HCEs of 5.5 versus 4.5 of cessing’; domains: KH, RRM). non-HCEs. These findings suggest that 30-UTR HCEs are enriched for experimentally identified RBP-binding A hyperconserved motif in the 30-UTR of 19 RRM-type sites. RBP mRNAs bound by HuR Giventherecurrenttendencyoftheenrichmentanalysisto 30-UTR HCEs identify the ancient control mediating select the mRNAs of RRM-type RBPs as preferred sites histone mRNA fate for30-UTRHCEs,wefurtherexploredthesemRNAs.The To appreciate the spectrum of biological functions ex- RRM is the evolutionarily most successful among the so- pressed by 30-UTR HCE-containing genes, we performed lutionsappearedtomediateinteractionbetweenRNAand an ontological enrichment on the 877 genes bearing at proteins (55). Of the 23 enriched genes whose mRNA least one HCE in the 30-UTR. We identified three gene bears at least one HCE and whose protein product groups endowed with high significance (Supplementary contains RRM domains, 17 were experimentally verified Figure S1). The first group is composed by 78 genes RBPs and 16 had an RRM-only architecture (Supple- involved in chromatin structure (terms ‘nucleosome’, mentary Table S1). Their mRNAs are characterized by ‘chromatin assembly’, ‘DNA packaging’), including 51 30-UTRs of all four types, with a prevalence of ‘fully (53.6%) of the 95 histone genes present in the human covered’ (66.7%) and ‘sparse frequent’ (19%) types, with genome. This wide histone component of the signature is ‘lone island’ and ‘dense frequent’ types representing re- that producing the strongest over-representation signal, spectively just 9.5% and 4.7% of the 30-UTRs. RBPs becausethetermsremainhighlysignificantevenwhenper- have been shown in the yeast to be nodes of highly inter- forming the ontological enrichment after having removed connected networks of post-transcriptional regulation the non-histone genes. It is well known that all histone (15,17), but little is known about vertebrate RBP gene mRNAs have a short 30-UTR lacking a poly(A) networks. We therefore focused on the mRNA 30-UTR tail, which is bound by the SLBP to stabilize these HCEsofthisproteingroup,topredictRBPsco-regulating mRNAs and mediate their nuclear processing and their them.WescannedtheHCEsforhiddencommonelements NucleicAcidsResearch,2013,Vol.41,No.5 3209 by the Weeder algorithm (45), searching for 6–12-bases- theNMRexperiment(25(cid:3)C)andinthepull-downexperi- long motifs, with the tolerance of 1–4 mismatches, which ment(roomtemperature)arestillnotphysiological,taken are observed in at least 25% of the HCEs. The scan all together these results indicate that the SL RNA produced as best score two 12 bases sequences that can sequence likely folds into a defined secondary structure. beconsideredvariantsofthesamesequencemotif,asthey The SL instances in the mapped 30-UTRs of the 18 genes differ only in two positions. We speculated that this resulted to be up to four per 30-UTR, with 13 of them sequence motif could represent an RBP-binding site, as harbouring only one instance (Figure 3C). We then anumberoftheseproteinsareknowntohaveapreference noticed that the SL motif had a sequence quite similar for short unstructured sequences or loops in stem–loop to an already known binding site for the HuR secondarystructures(55).Wethensearchedforsecondary (ELAVL1) protein (12). To verify that SL was effectively structure motifs in the same 30-UTR HCEs with the interacting with HuR, we performed a protein pull-down RNAfold (46) and the RNAforester (47) algorithms. assay, followed by a western blot with an anti-HuR This analysis resulted in a 17-base structural motif in the antibody. Along with the putative HuR motif, we form of a stem–loop, whose core loop had a good corres- adopted two positive controls (the YB1 and PTB known pondence (7 out of 12 bases for both sequence motifs; 9 binding sites) (12), two probes mutated in the loop (one out of 12 bases for sequence motif 2) with the previously withan eptameric loop with two mismatches and another identified sequence motifs. Combining the results of one with a pentameric loop with different sequence) and both sequence and structure motif searches produced a one degenerated again in the loop, for assay specificity. remarkable concordance, as shown by the alignment in The probe design is exemplified in Figure 4A. As Figure 3A, eventually leading us to define, by taking the reported in Figure 4B, HuR indeed binds to the probe consensus sequence motif produced by the secondary corresponding to the SL motif. Mutated and degenerated structure search algorithm, a stem–loop motif shared by probes show little recovery of HuR, suggesting that the 18 out of the 23 RRM genes and reported in Figure 3B. interactionisspecificanddependingontheloopsequence The secondary structure of this motif, that we called SL and size. The positive controls, testifying the correctness for stem–loop, was predicted by three different folding of the procedure, are shown in Figure 4C and D. algorithms, namely RNAfold (45), MFold (56) and Sfold (48), all of them leading to the same result. To test HuR positively controls a translational network of thepossibilitythatthemotifsecondarystructureisindeed RRM-type RBPs astem–loop,weperformeda1H-NMRinvestigationusing twosyntheticRNAscorrespondingtothissequencemotif With the motif confirmed to be recognized by HuR, we (SL)comparedwithamotifbearingfourmutationsinthe nextsoughttounderstandwhetherHuRreallyhadapref- sequenceincorrespondencewiththestemregion,whichis erenceforRRM-containingRBPmRNAs,withrespectto predicted to lose folding (no folding, NF). The 1D mRNAs of RBPs bearing other types of RNA-binding spectrum of SL, recorded at 5(cid:3)C in aqueous solution, domains and to mRNAs of proteins bearing the most shows four intense peaks distinct from those of NF and frequent domains in the genome. To calculate enrich- at chemical shift positions typical of imino proton reson- ments,wetookadvantageofaHuRPAR-CLIP,therefore ances [(57) Supplementary Figure S3]. The observation of theoreticallyunbiased,dataset(49).WeextractedallHuR signals in this spectral region is a clear indication of the 30-UTR binding sites from this data set and derived the presenceofstrongH-bonds,whichpreventrapidexchange correspondinggenes.Wethencomputed,bymeansofthe with water protons. The spectrum recorded on a sample Fisher’s exact test, the enrichment in this gene set for: (i) designednottoformanybasepairshowsnosignalsinthe proteins containing the most common experimentally region 10–15ppm (Supplementary Figure S3). The imino verified RNA-binding domains (zinc finger C2H2, KH, protonpeaksofSLcanalsobeobservedat15(cid:3)Cand,with SAM, RRM); (ii) proteins containing the three absolute muchlowerintensity,at25(cid:3)C(SupplementaryFigureS4), most frequent domains in the human genome (IG-like, an expected behaviour for proton signals establishing GPCR superfamily and serine threonine kinase); (iii) the H-bonds. At low temperature, three more peaks can be complete set of annotated RBPs irrespective of the RNA- observedintheregion10–11ppm(SupplementaryFigures bindingdomain.Figure5AshowsthattheRRM-contain- S3 and S4), attributable to imino protons involved in ing gene set resulted to be the only one significantly weaker H-bonds. Altogether, the data indicate the enriched.ThisconfirmsthatHuRhasamarkedpreference presence of 4 strong and 3 weak H-bond producing base for binding to the 30-UTR of RRM-bearing mRNAs. We pairs.Bytypicalfrequencypositions,peak4couldbeten- then plotted all 30-UTR HuR targets identified by the tatively assigned to the guanine base forming a G:C pair, PAR-CLIPstudyalongwithourgroupofRRMs,tohigh- whilepeaks1–3couldbeattributedtouracilbasesinA:U light overlapping and unique genes of the two sets. The pairs, four A:U pairs being present in the SL stem–loop. resulting intersection is shown in the Venn diagram of To investigate whether the base pairs referred to were Figure5BandinthenetworkinFigure5C,whichdiscrim- intra- or interchain, experiments were repeated in condi- inates between gene categories by means of shapes and tionsofhigherSLconcentration.Theobservationthatthe colours. Fourteen out of the 23 HCE-containing 1H-NMR spectrum is invariant upon a 10-fold change in mRNAs for RRM-type proteins are identified as HuR sampleconcentration(SupplementaryFigureS5)suggests binding, and in particular 4 of them are among the ones that the signals refer to imino protons involved in intra- we checked by quantitative RNA immunoprecipitation molecularH-bonds.Despitethehighesttemperaturesetin followed by PCR (RIP-PCR) (58), see Figure 6A. We 3210 NucleicAcidsResearch,2013,Vol.41,No.5 Figure 3. HCEsinmRNAsencodingRRM-typeRBPsshareasequenceandsecondarystructuremotif.HCEscontainedinthegroupofRRM-type RBPgenes30-UTRswerescannedforbothsequenceandsecondarystructuremotifs.Thefirstsearchreturnedtwo,almostidentical,12-basesmotifs; the second one produced a 17-bases hairpin which, after multiple alignment, emerged to contain a 12-bases core markedly similar to previously identified sequence motifs. This core represents the loop part of the hairpin which, as the two searches are quite concordant on it, may indeed represent a binding motif for the trans-factor of the regulatory network we are trying to identify. (A) shows the alignment between sequence and secondary structure motifs. Light-yellow background highlights column sequence identity. (B) shows the secondary structure motif and its bidimensional structure. The green circle represents the biotinylated DNA polyC linker added to the RNA. (C) Motif instances (yellow areas) mappedonthefulllength30-UTR(greyrectangle)ofthe19RRM-typeRBPmRNAs.HGNCgenenamesareontheleft,UCSCUTRnamesareon the right in parenthesis. Figure 4. HuRisatrans-factorbindinginvitrototheHCEmotifsharedbymRNAsencodingRRM-typeRBPs.ThedifferentRNAprobesusedfor theproteinpull-downexperimentareshownin(A).HuRpull-downprobe:thisprobewasdesignedbyusingthesecondarystructuremotifreported inFigure3,slightlymodifyingthelowestpartofthehairpinsoastomakeitfoldcorrectlywhennotincontext.Theloopwasdesignedbyselecting themostprobablebasesofthesequenceinthealigmentandthemostprobablestructuremotifs.Positivecontrolsprobesaretheknownbindingsites fortheYB1andPTBRBPs,experimentallyobtained(11).Again,thelowestpartofthestemwasslightlymodifiedsoastomakeitfoldasdesired. NegativecontrolsHuRprobesareDbl-Mut1,Dbl-Mut2andDegenerate.TheDegenerateprobewassynthesizedbyallowingallfournucleotidesto bepresentateachloopposition,soastoobtainamixtureofprobesbearingallthepossibleeptamericloops.TheDbl-Mut1andDbl-Mut2probes wereobtainedrespectivelybymutatingtwobasesoftheoriginalprobeloopinawaytopreserveitandbymutatingonebaseanddeletingtwoothers inawaytoobtainapentamericloopinsteadofaeptamericloop.Inallprobes,the50circlerepresentsthebiotinylatedDNApolyClinker.(B)shows the HuR pull-down western blots. From the leftmost band to the rightmost: Input, HuR probe, Dbl-Mut1, Dbl-Mut2, Degenerate probe and denaturatedbeadsbands.Ascanbereadilyseen,thestem–loopprobesbindtoHuRwithamarkedspecificityforthecorrectone.(CandD)YB1 andPTBRBPspull-down.Fromtheleftmostbandtotherightmost:input,YB1/PTBprobeanddenaturatedbeads.Asshownbywesternblotting, the stem–loop probes bind to PTB and YB1 respectively, thus confirming that the pull-down protocol works as expected. eventually intersected the HCE sites on the 23 RRMs identified network, both in structural (Figure 6A) and in 30-UTR with sites identified by this plus another recent functional (Figure 6B and C) terms. We used HuR HuR PAR-CLIP study (49,50). Overlap with HCEs was overexpressing MCF-7 cells, already used for high ashighas55%and47%ofthetotalfortheLebedeva(49) throughput studies on HuR (59,60), firstly to perform and the Mukherjee (50) studies, respectively. Figure 6 five quantitative RIP-PCR assays on the MSI2, RBM15, reports the results of a validation sampling of the SRFS11, HNRNPA3 RBP mRNAs (among those