MANUSCRIPTCATEGORY: ORIGINALPAPER BIOINFORMATICS ORIGINAL PAPER Vol.28no.62012,pages763–770 doi:10.1093/bioinformatics/bts024 Genome analysis AdvanceAccesspublicationJanuary16,2012 ReLA, a local alignment search tool for the identification of distal and proximal gene regulatory regions and their conserved transcription factor binding sites Santi González1†, Bàrbara Montserrat-Sentís1†, Friman Sánchez2, Montserrat Puiggròs1,3, Enrique Blanco4, Alex Ramirez2 and David Torrents1,5∗ 1JointIRB-BSCprogramonComputationalBiology,BSCc/JordiGirona29,2DepartmentofComputerArchitecture, Campus Nord UPC D6-117, Jordi Girona 1-3, 08034 Barcelona, 3Computational Bioinformatics, National Institute of Bioinformatics, 4Departament de Genètica i Institut de Biomedicina (IBUB), Universitat de Barcelona, Diagonal 643, 08028 Barcelona, Catalonia and 5Institució Catalana de Recerca i Estudis Avançats (ICREA) Pg. Lluís Companys 23, 08010 Barcelona, Catalonia AssociateEditor:JohnQuackenbush ABSTRACT (Abeeletal.,2008;Goñietal.,2007).However,theincorporation Motivation: Thepredictionandannotationofthegenomicregions ofnovelbiologicalknowledgeintotheseprogramsisnotnecessarily involvedingeneexpressionhasbeenlargelyexplored.Mostofthe improving the quality of their predictions, which still contain a energy has been devoted to the development of approaches that substantialfractionoffalsepositives. detecttranscriptionstartsites,leavingtheidentificationofregulatory Currently,methodsthatdenovodetectandcharacterizeproximal regionsandtheirfunctionaltranscriptionfactorbindingsites(TFBSs) regulatory regions show specificity levels <50% (Van Loo and largely unexplored and with important quantitative and qualitative Marynen,2009).Eventhoughphylogeneticfootprintingusingpre- methodologicalgaps. alignedhomologousregulatoryregionsofferspromisingresultsin Results: We have developed ReLA (for REgulatory region Local the identification of Conserved Regulatory Modules (CRMs) of Alignment tool), a unique tool optimized with the Smith–Waterman transcriptionfactorbindingsites(TFBSs)(BlanchetteandTompa, algorithmthatallowslocalsearchesofconservedTFBSclustersand 2003; Blanco et al., 2007; Pavesi et al., 2007; Sebestyen et al., thedetectionofregulatoryregionsproximaltogenesandenhancer 2009; Tokovenko et al., 2009; Tonon et al., 2010), they cannot regions. ReLA’s performance shows specificities of 81 and 50% define the regulatory region itself in most real scenarios because whentestedonexperimentallyvalidatedproximalregulatoryregions they are based on global alignment strategies and require that all andenhancers,respectively. matchingbindingsitesacrossallcomparedsequencesarelocatedin Availability: ThesourcecodeofReLA’sisfreelyavailableandcan thesame(orsimilar)positionwithineachsequence,i.e.theyrequire beremotelyusedthroughourwebserverunderhttp://www.bsc.es/ predefinedandpre-alignedregulatoryregions.Asaresult,inspiteof cg/rela. theexistingmethodologies,stillthemostcommonandreliableway Contact: [email protected] toidentifyproximalregulatoryregionsingenomesistheanalysis Supplementary information: Supplementary data are available at ofafewnucleotides(typicallyupto1000)immediatelyupstreamof Bioinformaticsonline. annotated transcription start sites (TSSs), which likely constitutes the proximal promoter. But the annotation of gene starts is still Received on September 28, 2011; revised on December 19, 2011; unsolved,particularlyfornon-humanspecies.Forexample,asimple acceptedonJanuary9,2012 searchintheENSEMBLdatabase(Hubbardetal.,2009)identified (cid:3) substantial fractions of vertebrate genes without annotated 5UTR 1 INTRODUCTION (from 17% in mouse to 91% in opossum, 42% for human). This result is even more dramatic within invertebrates. Other problems The identification of the genomic regions that control the that constitute a barrier for the automatic inference of promoters transcriptionofgenesstillremainsachallengedespitetherecentand (eveninhumanormouse)aretheabundanceandoverpredictionof continuous development of new experimental and computational alternativetranscripts.Takentogether,mostcomputationalmethods methodologies(Tompaetal.,2005).Multipleautomaticapproaches that detect or align promoters strongly depend on or are coupled havebeenproposed,rangingfromthosethatsearchforphylogenetic withtheannotationofuntranslatedgeneregions,whichisgenerally conservation of sequence or transcription factor binding motifs in insufficientforthispurpose(Guigoetal.,2006). non-transcribed DNA regions (Blanchette and Tompa, 2003; Van Ontheotherhand,thecomputationalidentificationofenhancers LooandMarynen,2009)totheanalysisofDNAphysicalproperties is even more complex. These regulatory regions that work characteristic of regions expected to bind transcription factors in cooperation with promoters throughout multiple structural ∗Towhomcorrespondenceshouldbeaddressed. constraints are, apparently, delocalized relative to the genes that †Theauthorswishittobeknownthat,intheiropinion,thefirsttwoauthors are controlling (Arnosti and Kulkarni, 2005). Therefore, their shouldberegardedasjointFirstAuthors. identification through computational methods requires strategies ©TheAuthor(s)2012. PublishedbyOxfordUniversityPress. ThisisanOpenAccessarticledistributedunderthetermsoftheCreativeCommonsAttributionNon-CommercialLicense(http://creativecommons.org/licenses/ by-nc/3.0),whichpermitsunrestrictednon-commercialuse,distribution,andreproductioninanymedium,providedtheoriginalworkisproperlycited. [18:5625/2/2012Bioinformatics-bts024.tex] Page:763 763–770 MANUSCRIPTCATEGORY: ORIGINALPAPER S.Gonzálezetal. based on local alignments. Some existing methods, like rVISTA, ofconservedTFBSs)foreachofthecomparisonsperformedbetweenthe look for conserved TFBS clusters between regions that have referencesequencewitheachoftheothers(seepseudocodeinSupplementary been pre-aligned with local alignment tools, such as BLASTz in Material). The overall stringency of the searches, the reliability of the rVISTA(Loots and Ovcharenko, 2004), while others use directly resultingpredictionsandtheconservationoftheTFBSsbetweentheinput sequencescanproducedifferentpredictions.Insteadofselectingauniversal localalignment-basedsearchstrategies,liketheEnhancerElement andfixedsetofparametersforeachofthesearches,whichwouldyieldone Locator(EEL)thatusestheSmith–Watermanalgorithm(Palinetal., unique prediction, we chose to run recursively each pairwise comparison 2006).Thesetoolshaveshowngoodpredictionratesonenhancers, (reference sequence against each of the other input sequences) with a andalsoshowimportantlimitationsregardingthenumberofspecies differentsetofparametersgeneratingacollectionofpreliminarypredictions. thattheycananalyse,therequiredparametrizationandtheaccuracy A set of posterior selection filters (see below) is then applied on these oftheprediction. preliminarypredictionstocomeupwithafinalpredictionofthepromoter Toovercometheselimitations,andtoprovidenovelandimproved region. solutionstothepredictionofregulatoryregions,wehavedeveloped Eachofthesepair-wisecomparisonsiscarriedoutintwodifferentscoring ReLA, a public local-based alignment tool that is capable of scenarios (10/−1 and 20/−1 match/mismatch scores, both with an open detecting promoters and enhancers by identifying clusters of and extension gap penalties of −2 and with two overlapping thresholds to remove redundant sites (using a window of three or five nucleotides, regulatorymotifsconservedinanypositionwithinlargehomologous seeabove);i.e.atotaloffourcomparisonsareperformedoneachpairof DNAregions(i.e.independentlyofgeneannotation).Considering sequencesandeachsetofmatricesdefined.Thesespecificcombinationsof thewiderangeofpotentialusersofthistool,wehavealsodeveloped parametersweredeterminedbymonitoringandmaximizingtherelationship auser-friendlywebserverforremotepredictions.Thesourcecode betweensensitivityandspecificityusingacollectionof10knownpromoters ofReLAisdistributedalsoasastandaloneprogramthatcanbeused oftheABSdatabase(Blancoetal.,2006a)(seeSupplementaryMaterialand locallyinUnix-basedcomputationalplatforms. www.bsc.es/cg/rela/downloads).These10regionswereexcludedfromthe performanceevaluation.Wealsoobservedthatthebestpredictionsobtained during the benchmarking were those with sizes between 200 and 600nt. 2 MATERIALS AND METHODS Preliminarypredictionscoveringshorterregionsusuallyinvolvedtoofew conserved TFBSs, while larger predictions tend to connect distant and, 2.1 FromDNAtoTFBSsequences apparently,unrelatedbindingsites.Forthisreason,preliminarypredictions The first step of our method consists on the mapping of putative TFBSs outsidethisrangeofsizesarenotconsideredduringthegenerationofthe sequences along the input sequences according to a certain catalogue of finalprediction. positionfrequencymatrices(PFMs)(inthedocumentationincludedwiththe programandinthewebserver,weprovideguidanceforselectingasetof 2.3 Outputgeneration homologoussequences).UserslocallyrunningReLAshouldprovidetheir As part of the results, ReLAgenerates a graph showing the distribution own PFM files (information about accepted file formats is also provided ofallacceptedpreliminarypredictionsonthereferencesequence.Fromthe with the program and in the web server). It has been shown that the analysisoftheoverlapamongthesepreliminarypredictions,wegeneratethe selectionofparticularcollectionsofmatricesyieldsslightlydifferentresults finalpredictionbyselectingtheregion,between200and1000ntlongthat (Blancoetal.,2006b).Afterevaluatingdifferentoptions(datanotshown), containsthehighestnumberofpreliminarycandidatepredictions.Together weobtainedtheoptimalresultsbyusingPFMsfromTRANSFAC(Matys with the final prediction on the reference sequence, additional consensus etal.,2006).Weclassifiedthiscollectionofmodelsintothreesubsets:whole regionsarealsodefinedineachoftheothersequences,whichcorrespondto collectionofTRANSFACPFM,thefirst600andthefirst400PFMsranked those(ifany)thatmatchthefinalpredictedpromoterregion.Wealsoprovide by their information content.These three sets are included as the default thelistofallconservedTFBSs.Fromthislist,asubsetofthemostconserved optioninthewebserver.TheidentificationofpotentialTFBSsisperformed TFBSs(specifically,thosewithinthetop10%ofconservation)isselected using the MATSCAN software (Blanco et al., 2006b) at high levels of andusedtogenerateamultiplealignmentingraphicalformat. stringency: 80% of similarity threshold. For this study, we calculate the similarityscoreusingSS=[(current–min)/(max–min)],where,‘current’ istheactualmatchingscore,‘min’istheminimumpossiblematchingscore 2.4 Webserver and‘max’,themaximumpossiblematchingscoreofaparticularPFM,as We have implemented ReLA as a web server that can be accessed at described elsewhere (Kel et al., 2003). Since the next Smith–Waterman http://www.bsc.es/cg/rela.Theunderlyingsearchengineisdistributedalso steprequiresasinglesequenceofTFBSs,andbecauseMATSCANresults asastandaloneprogram.WehavedesignedtheReLAwebsitetomeetthe usually contain PFM that overlap in all possible ways, we next simplify requirements of non-expert users. The web version provides a graphical thisoutput.Weremovethisoverlapbyslidingawindowofnbp(n=3or representation of the putative promoter region predicted in all the input 5,bothconditionsareincludedintheglobalsearch,seebelow).Foreach sequences(Fig.5)andaplaintextdescriptionoftheresults.Asmanyas overlappingPFMstartingatthefirstpositionofeachofthe3or5ntsliding twosuboptimalsolutionscanbeprovidedthroughtheweboneachsetof window,wesystematicallykeptthemostinformativeone,maximizingthe inputsequencestopotentiallypredictalternativepromoters. overallinformationcontentofthesequence.Topreservetherelativedistance between motifs during the comparison, we insert a ‘spacer box’every n 2.5 Selectionofdatabasesforevaluation consecutive nucleotides in regions free of predictions. In summary, we convert each input DNAsequence into a sequence of highly informative To validate the results obtained and to compare our method with other non-overlappingTFBS,whichisusednextinthecomparativesearches. existingprogramsinsimilarsearchingconditions,threedifferentworking subsetsorreportedregulatoryregionsweregeneratedfromthreedifferent databases:AnnotatedregulatoryBindingSitesdatabase[ABS;(Blancoetal., 2.2 Comparativesearches 2006a)],EukaryoticPromoterDatabase[EPD;(Schmidetal.,2006)]and For our searches, we have modified the classical local alignment Smith– VistaEnhancerDatabaseBrowser[VISTA;(Viseletal.,2007)]Tofollow Watermanalgorithm(SmithandWaterman,1981)todealwithsequences common criteria and to be consistent with the annotation of ABS (as ofTFBSs(associatingauniquethree-lettercombinationtoeachTFBS)and 500ntpromoters),wetransformedtheseTSSintoregionsbyconsidering to provide the best scoring local alignment (i.e with the highest density aspromoterregion500bpupstreamfromtheEPDTSS.VISTAEnhancer 764 [18:5625/2/2012Bioinformatics-bts024.tex] Page:764 763–770 MANUSCRIPTCATEGORY: ORIGINALPAPER ReLA Browserisadatabaseofregionscontainingexperimentallyvalidatedhuman the same strategies that have been used for fast local sequence andmouseenhancerstestedintransgenicmice. comparisons of protein sequences. In particular, we adapted the Theworkingsubsetsweregeneratedaccordingtothreedifferentfiltersto Smith–Watermanalgorithm(SmithandWaterman,1981)tomakeit facilitatetheautomationofthevalidationprocessandtoensurereliability capableofcomparinganddetectingthebestoptimallocalalignment of the evaluation protocol: (i) genes associated to selected regulatory of regions with similar sequences of TFBSs. In our procedure, regions must have at least three orthologous one-to-one genes according each TFBS is internally transformed into symbols of an arbitrary toENSEMBLorthologydata;(ii)thepromoterfragmentsselectedshould alphabet, as if they were amino acid or nucleotides in traditional notoverlapwithcodingregions;and(iii)theyhavetobeinourscanned protein–proteinandDNA–DNAcomparativesearches.Thissearch region:asdescribedintheSection3,forABSandEPDitis500bpupstream ofthefirstcodonofthegene,andforVista,50000bpupstreamoftheclosest algorithm is the core of a pipeline, referred from now on as gene(seebelow). ReLA(forREgulatoryregionLocalAlignmenttool).Thecomplete Applyingthesethreefiltersweobtained75humanandmousepromoters procedurecanbedividedintothreemajorsteps.First,inputDNA fromABS,and740fromEPD.Inbothcases,5000bpupstreamfromthefirst sequences are transformed into sequences of TFBSs by mapping methioninewereusedtocheckandcomparetheaccuracyofthemethod. withtheMATSCANsoftware(Blancoetal.,2006b)allthePFMs FromVISTAEnhancerBrowser,weendedupwith44enhancerslayingin provided; secondly, the resulting TFBS sequences are compared the5000bpupstreamofaknowngene. with each other to identify conserved groups of TFBSs using the modified Smith–Waterman algorithm under different scoring 2.6 Evaluationprotocol scenarios. Finally, all the resulting preliminary alignments are For the evaluation of the promoter prediction programs, we followed a evaluatedtoproducethefinalpredictionofthepromoterregion. modifiedversionoftheDistance-basedvalidationevaluationprotocolfrom Abeeletal.(2009).Takingintoaccountthatwewereevaluatingpromoter 3.2 EvaluationofReLA’sperformance genesandwewereconsideringdistances,wecalculatedrecallandprecision We first applied ReLAto a collection of experimentally validated valuesas promoter and enhancer regions, both to define its internal search Precision=correctpredictions/totalpredictions parametersandtoevaluateitsperformance.Despiteongoingefforts Recall=discoveredgenes/totalgenes of acquiring experimental data, on functional TFBS, still the vast majority of detailed and reliable data can only be retrieved Forthoseprogramsthatprovidesinglepositionsasoutputs(TSSpredictors, from the literature. In this direction, the ABS database (Blanco ARTS,EponineandPromoterExplorer),weconsideredthesequence±500 et al., 2006a) is the result of one of the few initiatives to gather, fromTSSfortheevaluation.InthecaseofTFM,weobtainedconserved from the literature, promoters with two or more of their TFBSs binding sites as result and considered the fragment between the first and experimentallyvalidated.Forthisreasonweusedthisdatabasefor thelastoneforevaluation.Acorrectpredictionwasconsiderediftherewas ReLA’sevaluation.Weselectedthesubsetof73(35humanand38 anoverlapbetweenthepredictionandthe500bpupstreamofthedefined mouse) promoters from this database that showed, in ENSEMBL TSS. For all programs, we considered the unique or the best prediction, exceptforPromoterExplorerthatdoesnotranktheresults.Sincewewere (Hubbard et al., 2009), one-to-one orthologous relationship with usingalreadyfilteredpromotergenesinsteadofbigDNAfragments,wedid at least three out of seven chosen vertebrate species (human or not discarded any prediction further of 500nt from theTSS as it is done mouse,rat,horse,dog,cow,opossumandchicken).Byreproducinga elsewhere(Abeeletal.,2009).Forallprogramsofourevaluation,weused commonandrealisticsearchscenario,wheretheTSSand5(cid:3)UTRsof defaultsettingsdefinedbythecorrespondingdevelopers.ForEELruns,we queryhomologousregionsarenotknown,wecollectedtheputative usedthemouseandhumansequencesforeachoftheorthologousgroups upstreamregionofthesegenesandtheircorrespondingorthologous andappliedtheparametersdescribedforthisspeciespairelsewhere(Palin regions.Theseupstreamregionscomprise5000bpupstreamDNA, etal.,2006).Allthedatausedforthevalidationprocedureareavailableat fromthefirstannotatedcodonaccordingtotheencodedENSEMBL www.bsc.es/cg/rela/downloads. (cid:3) protein.Fromthemeasurementofthelengthof5UTRsregionsof ‘known’ENSEMBLgenes, we previously had estimated that this 3 RESULTS selectionof5000bpissufficienttocapturetheproximalpromoters for>85%ofknownvertebrategenes(datanotshown).Inaddition, 3.1 RationaleandunderlyingsearchstrategyofReLA wehavealsocomparedtheresultingperformanceofReLAwiththe Thegoalofthisstudyistodevelopanovelmethodologythatwould prediction ratio of other reported TFBS-based search tools: TFM overcomethecurrentlimitationsmentionedabovebyfocusingon: (Tonon et al., 2010) and rVISTA(Loots and Ovcharenko, 2004), (i) the detection of conserved TFBS; (ii) the use of local search as well as with that of TSS predictors:ARTS (Sonnenburg et al., strategies;(iii)simplicityofuse;and(iv)alowcomputationalcost 2006);Eponine(DownandHubbard,2002)andPromoterExplorer to perform genome-wide searches. For this, we decided to use (Xieetal.,2006),allofthemrunonthesameregions(Table1). Table1. PerformanceresultsonABSpromoters ReLA TFM rVISTA(1) PromoterExplorer Eponine ARTS Recall 0.81 0.6 0.37 0.51 0.2 0.14 Precision 0.81 0.61 0.46 0.69 0.21 0.14 Predictiontype DefinedregionswithconservedTFBS ConservedTFBS TSS 765 [18:5625/2/2012Bioinformatics-bts024.tex] Page:765 763–770 MANUSCRIPTCATEGORY: ORIGINALPAPER S.Gonzálezetal. Table2. PerformanceresultsonEPDTSSs ReLA TFM PromoterExplorer Eponine ARTS Recall 0.56 0.49 0.78 0.23 0.17 Precision 0.56 0.51 0.67 0.27 0.17 Following the same strategy, we alternatively evaluated ReLA using740regionsderivedfromtheEukaryoticPromoterDatabase [EPD; (Schmid et al., 2006)], which, despite not being ideal for thispurpose,setsourtoolintothecontextofpreviousevaluations of these other existing search strategies: ARTS, Eponine and PromoterExplorer against which we have also compared ReLA’s predictions(Table2). From all resulting predictions, we calculated different performance parameters, such as recall and precision by adapting an evaluation protocol used for the comparison of a large number ofTSSspredictingmethods(Abeeletal.,2009).Thisadaptationis necessarybecausethedifferentmethodsweusedprovidedifferent Fig. 1. Prediction of the proximal regulatory region of the Cyclin E1 type of outputs, e.g. ARTS, Eponine, PromoterExplorer provide (CCNE1)geneinsevenvertebratespecies.Typicalsearchscenariowhere single TSS positions, TFM and rVISTAlists of conserved TFBS, theTSSformostofthespeciescomparedisnotknownormissannotated: andReLAdelimitedregionsofconservedTFBSs. no TSS information was available for cow, dog and opossum, whereas From the results of this evaluation, we observe that the in chicken is wrongly placed. The predicted promoter for these species appearsindifferentlocationwithintheinputsequence.Dashedlinesshow overallperformanceisdifferentbetweenthedifferentmethodsand thedistributionofallprimarypredictionsalongtheseregions.Consensus databases:whileReLA’sperformancewasthebestonABSentries, predictions are delimited by the first and the last coloured box (each PromoterExplorer on EPD regions outperformed it. Interestingly, box corresponds to a conservedTFBS). Red horizontal lines indicate the ReLA’s precision values for EPD regions are lower than those experimentally characterized regulatory region. Initial fragments of each shown with ABS. To discard a possible bias in the performance transcriptareshownontheright:non-codingregionsinblue,andcoding of ReLAtowards specific promoter types that are more abundant exonsingreen.Thenumbersindicatethepositionofthecodingexonsinthe in the ABS database, we divided all EPD and ABS regions in humanmRNA.ENSEMBLTSSsareindicatedwitharrows. different promoter classes (Section 2) and calculated the same precisionvaluesforeachpromotertypeandeachpredictionmethod (cid:3) separately (Supplementary Tables S1 and S2). This analysis has observed when the annotation of orthologous gene 5UTRs and shown that, despite ABS appears to be enriched in TATA-box firstexonsispracticallyabsent,asoccursformostoftheavailable containing promoters (a 42% versus a 20% in EPD), ReLA’s genomes. These results highlight the potential of using ReLAfor performanceisnotaffectedbythis,asprecisionvaluesaresimilar thesystematicidentificationandannotationofregulatoryregionsin among most of the promoter types present inABS and EPD. It is non-modelorganisms,suchaschicken,cow,dog,opossumandany also worth noticing that predictors based onTFBSs show a better otherthathasincompletegeneandcDNAdata. performanceontheABS,whichisalsobasedonTFBS,thanwiththe TSS-basedEPDentries,whereTSSpredictorstendtodobetter.We 3.3 Predictionofmultiplealternativepromoters didnotfindanysignificantdifferencewhencomparingperformance valuesforHumanorMouseentries(datanotshown). During the evaluation of ReLA, we also observed that, in InordertohaveasenseoftheTFBSconservationlevels,upon some cases, the distribution of preliminary predictions along which ReLA is able to build predictions, we have also analysed the reference sequence highlighted two regions with similar thedistributionofthenumberofconservedboxesdetectedwithin scores, which could correspond to alternative promoters. From allABS and EPD results. This analysis shows that, indeed, there these two options, ReLA selects the one that generated more is a wide range of TFBS conservation, both in number and in preliminary predictions (Section 2). Suboptimal solutions, i.e. composition(SupplementaryFig.S1).Similarly,fromtheanalysis potentialalternativepromoters,canbeobtainedbysimplymasking of the contribution of each of the species in ReLA’s performance, the previously predicted regions in the reference sequence and weobservethatalltheothervertebratesusedinthisstudycontribute running ReLA again. For a number of such cases, we confirmed substantiallytothefinalpredictioninhuman:forexample,cowand the presence of twoTSSs through the analysis of ESTs or known dog contribute to ∼60% of the predictions whereas opossum and alternatively transcribed full-length mRNAs. For this reason, we chickento36and32%,respectively(SupplementaryTableS3). have implemented this option in the web server, where the user AdetailedinspectionofReLA’sresultsonABSentriesuncovered can launch a second search run to find suboptimal solutions. some interesting features. Often, the promoters that we identified Figure2showsthebesttwopredictionsofregulatoryregionslocated on each of the species present different locations within the input upstream from SLC7A7, an amino acid transporter gene, which 5000bp region (Fig. 1). This typical scenario, which must be hasbeenexperimentallyproventohavetwoalternativepromoters necessarily solved with local-based comparative approaches, is (Puomilaetal.,2007). 766 [18:5625/2/2012Bioinformatics-bts024.tex] Page:766 763–770 MANUSCRIPTCATEGORY: ORIGINALPAPER ReLA Fig.2. Predictionofalternativeknownpromoterregionsofthesolutecarrier Fig.3. PredictionoftheproximalregulatoryregionoftheE2Ftranscription family 7 member 7 (SLC7A7). Searches were done on 15000bp region factor1(E2F1)usingthesequenceofthewholegene.Predictedregulatory upstreamofthefirstaminoacidannotatedintheENSEMBLdatabasefor region along a highly conserved region of 15722 bp that includes the thehumanSLC7A7gene.Dashedlinesshowthedistributionofpreliminary E2F1 complete gene transcript and 5000bp upstream of the first amino predictionsinthefirstandsecondrun.Finalfirstandsecondpredictionsare acid annotated in ENSEMBL. Dashed line shows the distribution of all enclosedinaboxanddelimitedbythefirstandthelastconservedTFBS(each preliminarypredictionsalongthisregion.Aschematicrepresentationofthe designedbyacolouredbox).Initialfragmentsofeachtranscriptareshown structureofthisgeneisshown:non-codingregionsareinblue,andcoding ontheright:non-codingregionsinblue,andcodingexonsingreen.The exonsingreen.Thenumbersdesignatethepositionofthecodingexonsin numbersindicatethepositionofthecodingexonsinthemRNA.ENSEMBL the mRNA, according to human. Data related to DNAconservation from TSSsareindicatedwitharrows.Distancesarenotdrawnatrealscale. UCSCarealsoincluded[http://genome.ucsc.edu;(Kentetal.,2002)]. ThefindingoftwohighscoringregionsinanyReLAprediction theidentificationofenhancers,whichareoftenlocateddistantfrom couldsuggest,insteadofthepresenceofanalternativepromoter,the other functional elements. In order to test ReLA’s capabilities in existenceofhighlyconservedcodingexons,whichwouldconstitute enhancerdetectionwehavegatheredacollectionofexperimentally a false positive prediction. Thus, the identification of regulatory validatedhumanenhancersfromtheVISTAdatabasewithactivity regions with ReLAwould be based only on the level of sequence assessed on transgenic mice (Visel et al., 2007). To test ReLA, conservationandthepresenceofhighlyconservednon-regulatory weselected44enhancersthatarelocatedwithinthefirst50000bp DNA, like coding exons, could negatively influence the results. upstream of a known gene. In order to search for each of these To discard this, we studied how this scenario can affect ReLA distal regulatory regions we extracted up to 50000bp from the predictions. The example in Figure 3 shows a positive promoter most upstream TSS annotated for the closest gene in human and predictionwhenallsevenorthologousinputsequencesincludethe from each of the corresponding one-to-one orthologous genes in complete region of the E2F1 gene and the additional upstream mouse,rat,horse,dog,cow,opossumandchicken.Thefirstrunof untranslated regions (a total of 15000nt each). In this case, the ReLA on these 44 regions generated 40 predictions, from which distribution of hits along the human sequence shows two high 11(28%) overlapped with the annotated enhancer. Considering scoringregionsthatappeartosharesimilarconservationlevelsof that the regions selected for the search theoretically contain other nucleotides. One of these fragments constitutes the third exon of unknown regulatory regions (promoters, for instance) that could (cid:3) this gene, while the other matches the 5UTR, the TSS and the match with the first prediction, we performed a second run on corepromoter.ReLAisabletosuccessfullydiscriminatethecorrect the remaining 29 cases, which yielded nine other positive hits. In promoter region, including sites that have been experimentally total, with two iterative runs, ReLAshowed a positive predictive proven (Blanco et al., 2006a). In particular, the two TFBS that valueof50%ofthescreenedsubsetofannotatedhumanenhancers. ReLA scores highest in conservation among input sequences are Asimilarpredictionrate(49%)isobtainedoverthesameenhancer preciselydescribedintheABSdatabaseastwoE2F1factorbinding benchmark set when using a specific enhancer locator tool, EEL sitesnecessaryforself-regulationduringthetransitionfromG1toS (Palinetal.,2006)thatalsoreliesonlocalsearchstrategies(EEL phaseinthecellcycle(Johnsonetal.,1994).Interestingly,thethird searchesimpliedonlyhumanandmousesequences,asitdoesnot TFBS following the conservation ranking corresponds to ADF1, acceptmorethantwosequencespersearch).Itisworthmentioning (cid:3) whichwaslocatedwithinthe5UTRandisknowntobindthesame that an important difference between both methods is that ReLA motifsrecognizedbytheE2F1factorinmice(Hsiaoetal.,1994) provides more precise results, as the regions predicted are shorter Despite these results, we cannot discard the possibility that exons (up to 750nt long, with an average of 485nt) than those coming arewronglypredictedaspromotersincertainsituations.Therefore, fromEEL(upto11563nt,withanaverageof2644nt). we recommend performing preliminary evaluations of the coding These results indicate that ReLA is capable of searching large potential within the input sequences, for example, by comparing genomic DNA fragments and identifying multiple proximal and them against protein sequence databases with BLASTX (Altschul distal regulatory regions, which makes this tool suitable for et al., 1997). Putative coding regions should be preferentially genome-wide screenings and across several genomes (see an maskedfromtheinputsequencestoensurethecorrectprediction. exampleinFig.4). Tofurtherexemplifythisfeature,wealsoperformedagenome- wide analysis on a 109kb long ENCODE region [ENm011; 3.4 Identificationofenhancers chr11:1858751-1968592; (Birney et al., 2007)] that includes six The local nature of the underlying search engine and the capacity genes coding for, at least, 11 transcripts, with their corresponding to compare large DNAsequences makes ReLAa suitable tool for intergenic regions. By using SYT8 and MRPL23 flaking genes 767 [18:5625/2/2012Bioinformatics-bts024.tex] Page:767 763–770 MANUSCRIPTCATEGORY: ORIGINALPAPER S.Gonzálezetal. 4 DISCUSSION Taking into account the available methods to in silico recognize generegulatoryregions,asubstantialimprovementisnecessaryto accuratelyannotategenesandpromotersequencesinmostgenomes. HerewereportthedevelopmentofReLA,acomputationaltoolto identify such regulatory regions using genome-wide comparisons. ReLAis distributed as a standalone program and as a web server. OurapproachismostlybasedinanadaptationofthepopularSmith– Watermanalgorithmthatisabletorapidlyidentifycoincidencesof TFBSsbetweentwosequences(conceptuallysimilartotraditional protein–protein comparisons). ReLAis able to efficiently process longsequencesinstandardcomputationalplatforms(e.g.lessthan aminutetoobtaintheresultsshowninFig.5).Wehaveevaluated theaccuracyofReLA,firstinadatasetofexperimentallyvalidated humanandmousepromoters,onanextensivecollectionofvalidated TSS from the Eukaryotic Promoter Database, as well as on an experimentallyvalidatedcollectionofrVISTAenhancers.Wehave reached maximums of 0.81 of recall and precision levels onABS sequencesOntheotherhand,andsurprisingly,ReLA’sperformance resultslowerwhenusingEPDTSSentries.Apossibleexplanation forthisobservationcouldbethatReLAperformsbetteroncertain typesofpromoterregions.But,afterweclassifiedallABSandEPD entriesintodifferentpromotertypesaccordingtotheircomposition and evaluated their associated performance obtained with all the methods used here for the validation, we observed that ReLA’s Fig.4. PredictionoftheSALL1enhancerinsixvertebrategenomes.The accuracy is similar among most of the identified promoter types. upperpanelshowstheSALL1geneanditscorresponding50kbupstream We cannot discard though that other uncontrolled biases present regionforsixvertebratespecies.Coordinatesandstrandoftheseregionsare: either inside the underlying search methodology of each of the Chicken(chr11:6148098–6213605,−),Mouse(chr8:91551143–91618061, protocolsusedhereorintheuseddatabasescouldactuallyexplain +),Rat(chr19:19227337–19293298,−),Dog(chr2:67080056–67144522, the different behaviour observed. It is worth noticing that overall, −), Cow (chr18: 18688671–1875278, +) and Human (chr16: 51169886– TFBS-basedpredictionmethodsperformbetterontheTFBS-based 51235181,+).Ineachline,wedisplaythestructureofthegene(greenboxes ABSdatabasethanontheTSS-basedEPD,whereTSSpredictorsare arecodingexons,whileblueareuntranslated).KnownandpredictedTSSs doingbetter.Inanycase,thelevelsofprecisionandrecallobtained arealsoshown.ReLA’spredictionsareshownforeachspeciesasgroupsof withReLAaresufficienttoprovidereliablepredictionsthatguide coloredboxes.Please,notethattheseregionsarenotdrowntoscale.ReLA’s posteriorexperimentalvalidation.Thisstudyalsodemonstratesthe predicted enhancer regions expanded from 223 and 233bp for dog and mouse,301bpforcow,to359,360and366bpforrat,humanandchicken, benefitsofusingtheSmith–Watermanalgorithmtodirectlysearch respectively.Thelocationsoftheexperimentallyprovenregions(asshown for conserved binding sites, as it outperforms other methods like inrVISTAdb)aredisplayedasgreenboxes.Thebottomlineofthispanel rVISTAandTFM, that are based on pre-aligned DNAand global showsthesequenceconservationprofile(accordingtohumancoordinates; search strategies. See the example in Figure 1, which cannot be http://genome.ucsc.edu).Inthebottompanel,wedisplaythealignmentof solved using global alignment approaches. Please note that other theconservedTFBSdetectedwithineachofthepredictedenhancerregions. methods based on similar strategies could not be included in the TFBSarelabelled(inTRANSFACformat)anddifferentiatedusingarbitrary comparison, as they did not provide results on our benchmark set shapes and colours. Coordinates shown here indicate the position of the becauseoflimitationsinthesize(MMETA)andonthenumberof predictedenhancerwithinthe50kbinputsequence. sequences[Conreal(Berezikovetal.,2005)]. Furthermore, we also show that ReLA is able of predicting as anchors, we identified and characterized the corresponding alternative promoters and even enhancer regions, dealing with orthologous regions for mouse, rat, dog, cow and chicken. The multiple suboptimal solutions in most cases. Our approach is complete analysis consisted in 10 iterative ReLA searches and suitableforintegratingacomputationalannotationpipelineinwhich implied the screening ofTFBS sequences in >600kb of genomic otherpredictivemethodssuchashomologysearches(e.g.BLAST DNA.Inordertoobtainanestimationoftheperformanceonthese againstproteindatabases)canassistintheimprovementofthefinal genome-wide conditions, we have taken as positive predictions predictions. those that match ChIP-Seq transcription evidences (Birney et al., Insummary,webelievethatthedevelopmentofReLAconstitutes 2007),aswellasthosefallingimmediatelyupstreamofannotated asignificantstepforwardinthefieldofthepredictionofregulatory gene starts. This count shows that 8 out of 10 predictions have regions, as it shows the highest predictive power reported so far. evidence of expression or regulation (Supplementary Fig. S2). ReLA is able to locally compare multiple large genomic regions Pleasenotethatwecannotdiscardthatadditionalrunswouldprovide and identify non-alignable conservation events across different other overlooked regulatory regions and, at some point, also false genomes.Thisisrelevantifweconsiderthatthelimitedinformation positives. regarding regulatory regions in eukaryotes is restricted to human 768 [18:5625/2/2012Bioinformatics-bts024.tex] Page:768 763–770 MANUSCRIPTCATEGORY: ORIGINALPAPER ReLA acrosssequencedvertebrategenomes,weareactivelysearchingfor ways of improving further its predictive power by, for example, applying more sophisticated scoring systems and accepting even largerDNAregionswithlowadditionalcomputationalcosts. ACKNOWLEDGEMENTS We thank Mar Albà, Steven Laurie and our entire group for constructive feedback during the development of this work and during the writing of the manuscript.We also thank Jan andAina SagristàfordesigningReLA’slogo. Funding: Ministerio Español de Ciencia e Innovación (BIO2006- 15036). ConflictofInterest:nonedeclared. REFERENCES Abeel,T.etal.(2008)Genericeukaryoticcorepromoterpredictionusingstructural featuresofDNA.GenomeRes.,18,310–323. Abeel,T. et al. (2009) Toward a gold standard for promoter prediction evaluation. Bioinformatics,25,i313–i320. Altschul,S.F.etal.(1997)GappedBLASTandPSI-BLAST:anewgenerationofprotein databasesearchprograms.NucleicAcidsRes.,25,3389–3402. Arnosti,D.N. and Kulkarni,M.M. (2005) Transcriptional enhancers: Intelligent enhanceosomesorflexiblebillboards?J.Cell.Biochem.,94,890–898. Berezikov,E.etal.(2005)CONREALwebserver:identificationandvisualizationof conservedtranscriptionfactorbindingsites.NucleicAcidsRes.,33,W447–W450. Birney,E.etal.(2007)Identificationandanalysisoffunctionalelementsin1%ofthe humangenomebytheENCODEpilotproject.Nature,447,799–816. Blanchette,M.andTompa,M.(2003)FootPrinter:Aprogramdesignedforphylogenetic footprinting.NucleicAcidsRes.,31,3840–3842. Blanco,E.etal.(2006a)ABS:adatabaseofAnnotatedregulatoryBindingSitesfrom orthologouspromoters.NucleicAcidsRes.,34,D63–D67. Blanco,E.etal.(2006b)Transcriptionfactormapalignmentofpromoterregions.PLoS Comput.Biol.,2,e49. Blanco,E.etal.(2007)Multiplenon-collinearTF-mapalignmentsofpromoterregions. BMCBioinformatics,8,138. Down,T.A. and Hubbard,T.J. (2002) Computational detection and location of transcriptionstartsitesinmammaliangenomicDNA.GenomeRes.,12,458–461. Goñi,J.R.etal.(2007)DeterminingpromoterlocationbasedonDNAstructurefirst- principlescalculations.GenomeBiol.,8,R263. Guigo,R.etal.(2006)EGASP:thehumanENCODEGenomeAnnotationAssessment Project.GenomeBiol.,7(Suppl.1),S21–31. Hsiao,K.M.etal.(1994)MultipleDNAelementsarerequiredforthegrowthregulation ofthemouseE2F1promoter.GenesDev.,8,1526–1537. Hubbard,T.J.etal.(2009)Ensembl2009.NucleicAcidsRes.,37,D690–D697. Johnson,D.G.etal.(1994)AutoregulatorycontrolofE2F1expressioninresponseto positiveandnegativeregulatorsofcellcycleprogression.GenesDev.,8,1514–1525. Kel,A.E.etal.(2003)MATCH:Atoolforsearchingtranscriptionfactorbindingsites Fig.5. SnapshotoftheReLAwebserveroutput.Graphicalrepresentation inDNAsequences.NucleicAcidsRes.,31,3576–3579. oftheputativepromotersandalignmentsofTFBSsofthehumanE2F1gene, Kent,W.J. et al. (2002) The human genome browser at UCSC. Genome Res., 12, aswellasthelistsofpredictedregionsandconservedbindingmotifs.See 996–1006. Sections2and3foracompleteinterpretationofeachoftheresultsprovided. Loots,G.G.andOvcharenko,I.(2004)rVISTA2.0:evolutionaryanalysisoftranscription factorbindingsites.NucleicAcidsRes.,32,W217–W221. Matys,V.etal.(2006)TRANSFACanditsmoduleTRANSCompel:transcriptionalgene and mouse, e.g. from 2540 vertebrate entries in the Eukaryotic regulationineukaryotes.NucleicAcidsRes.,34,D108–D110. Promoter Database (Schmid et al., 2006), 2067 (81%) belong Palin,K.etal.(2006)Locatingpotentialenhancerelementsbycomparativegenomics to these two species. Thus, with this tool in hand we can now, usingtheEELsoftware.Nat.Protocols,1,368–374. Pavesi,G.etal.(2007)WeederH:analgorithmforfindingconservedregulatorymotifs not only fill missing gaps in the annotation of the genomes of andregionsinhomologoussequences.BMCBioinformatics,8,46. model organisms, mostly with the identification of enhancers and Puomila,K.etal.(2007)Twoalternativepromotersregulatetheexpressionoflysinuric alternative promoters, but also to start a reliable and consistent proteinintolerancegeneSLC7A7.Mol.Genet.Metab.,90,298–306. annotationofconservedpromotersthroughouttherestofgenomes Schmid,C.D. et al. (2006) EPD in its twentieth year: towards complete promoter thathavelittleornoinformationregarding5(cid:3)UTRsandoftenfirst coverageofselectedmodelorganisms.NucleicAcidsRes.,34,D82–D85. Sebestyen,E.etal.(2009)DoOPSearch:aweb-basedtoolforfindingandanalysing coding exons (Fig. 1). Beyond the current performance of ReLA commonconservedmotifsinthepromoterregionsofdifferentchordateandplant and,asweareplanningagenome-widesearchofregulatoryregions genes.BMCBioinformatics,10(Suppl.6),S6. 769 [18:5625/2/2012Bioinformatics-bts024.tex] Page:769 763–770 MANUSCRIPTCATEGORY: ORIGINALPAPER S.Gonzálezetal. Smith,T.F. and Waterman,M.S. (1981) Identification of common molecular Tonon,L. et al. (2010) TFM-Explorer: mining cis-regulatory regions in genomes. subsequences.J.Mol.Biol.,147,195–197. NucleicAcidsRes.,38(WebServerIssue),W286–W292. Sonnenburg,S.etal.(2006)ARTS:accuraterecognitionoftranscriptionstartsinhuman. VanLoo,P.andMarynen,P.(2009)Computationalmethodsforthedetectionofcis- Bioinformatics,22,e472–e480. regulatorymodules.Brief.Bioinform.,10,509–524. Tokovenko,B. et al. (2009) COTRASIF: conservation-aided transcription-factor- Visel,A.etal.(2007)VISTAEnhancerBrowser–adatabaseoftissue-specifichuman bindingsitefinder.NucleicAcidsRes.,37,e49. enhancers.NucleicAcidsRes.,35,D88–D92. Tompa,M.etal.(2005)Assessingcomputationaltoolsforthediscoveryoftranscription Xie,X.etal.(2006)PromoterExplorer:aneffectivepromoteridentificationmethod factorbindingsites.Nat.Biotechnol.,23,137–144. basedontheAdaBoostalgorithm.Bioinformatics,22,2722–2728. 770 [18:5625/2/2012Bioinformatics-bts024.tex] Page:770 763–770