Table Of Content

BIOINFORMATICS ORIGINAL PAPER Vol.29no.52013,pages614–621 doi:10.1093/bioinformatics/btt016 Gene expression AdvanceAccesspublicationJanuary17,2013 Grape RNA-Seq analysis pipeline environment David G. Knowles1,2,*, Maik Ro¨der1,2, Angelika Merkel1,2 and Roderic Guigo´1,2,* 1Bioinformatics and Genomics Group, Centre for Genomic Regulation (CRG) and 2Universitat Pompeu Fabra (UPF), Dr Aiguader 88, 08003 Barcelona, Spain AssociateEditor:IvoHofacker ABSTRACT 2008). RNA-Seq has proven particularly powerful on tasks Motivation:Theavalancheofdataarrivingsincethedevelopmentof suchasidentifyingnovelgenesandnovelspliceforms,detecting NGS technologies have prompted the need for developing fast, lowabundancetranscriptsandfindingsequencevariations,such accurateandeasilyautomatedbioinformatictoolscapableofdealing as SNPs (Marguerat et al., 2008). It is gradually substituting with massive datasets. Among the most productive applications of microarrays as the technology of choice for transcriptome ana- NGS technologies is the sequencing of cellular RNA, known as lyses, providing access to a greater dynamic range of RNA ex- RNA-Seq. Although RNA-Seq provides similar or superior dynamic pressionlevels(Marionietal.,2008). rangethanmicroarraysatsimilarorlowercost,thelackofstandard ThethroughputofNGStechnologiesiscontinuouslyacceler- anduser-friendlypipelinesisabottleneckpreventingRNA-Seqfrom ating,andthecostpersequencednucleotideisrapidlyfalling.As becomingthestandardfortranscriptomeanalysis. aconsequence,anunprecedentedamountofdataarebeingpro- Results:Inthisworkwepresentapipelineforprocessingandanalyz- duced,pressingforthedevelopmentoffastandefficientmethods ing RNA-Seq data, that we have named Grape (Grape RNA-Seq ofanalysis,asthebottleneckforscientificdiscoveryisgradually Analysis Pipeline Environment). Grape supports raw sequencing shiftingfromdataproductiontodataanalysis. reads produced by a variety of technologies, either in FASTA or IntheanalysisofNGSdata,processingefficiencyisgoverned FASTQformat,orasprealignedreadsinSAM/BAMformat.Aminimal predominantlybytwofactors:first,thesheersizeoftheindivid- Grapeconfigurationconsistsofthefilelocationoftherawsequencing ualdatasets,andsecond,thevastnumberofdatasetstobeana- reads, the genome of the species and the corresponding gene and lyzed. The current generation of sequencing machines, for transcriptannotation. example, Illumina HighSeq, can produce the equivalent of 20(cid:2) Grapefirstrunsasetofqualitycontrolsteps,andthenalignsthereads the coverage of the human genome in a single run, delivering to the genome, a step that is omitted for prealigned read formats. (cid:3)600millionreadswithalengthof4100nt.Thishasspawneda Grapenextestimatesgeneandtranscriptexpressionlevels,calculates newgenerationofalignersoptimizedforaligningshortsequences exoninclusionlevelsandidentifiesnoveltranscripts. to the genome, for example, Bowtie (Langmead et al., 2009), Grapecanberunonasinglecomputerorinparallelonacomputer BWA(Li andDurbin, 2009),BFAST (Homeretal.,2009) and cluster.Itisdistributedwithspecificmappingandquantificationtools, GEM (Marco-Sola et al., 2012), which are much faster than butgivenitsmodulardesign,anytoolsupportingpopulardatainter- previous tools such as Basic Local Alignment Search Tool, changeformatscanbeintegrated. FASTA or SSEARCH (see Trapnell and Salzberg, 2009, for a Availability: Grape can be obtained from the Bioinformatics and short review). Additionally, an increasing number of projects Genomicswebsiteat:http://big.crg.cat/services/grape. involve large sample sizes to be analyzed, often under complex Contact:[email protected]@crg.eu experimentaldesigns. ReceivedonJuly23,2012;revisedonNovember7,2012;accepted InthespecificcaseofRNA-Seq,mappingofreadsisonlythe onJanuary8,2013 firststepofacomplexdataprocessingschema,thefinalgoalof whichistoproduceaccurategeneandtranscriptquantifications, andtodelineatenoveltranscriptstructures.Thelackofeasy-to- 1 INTRODUCTION use pipelines to perform such a processing out of the box in a The development of ultrasequencing technologies during the transparentandstreamlinedfashionisactuallyabottleneckthat recentyearshasstartedamajorrevolutioninBiology.Theabil- prevents the expansion of RNA-Seq, and prompts users with ity to directly survey the cell’s RNA content by applying NGS little access to sophisticated bioinformatic resources to prefer technologies to cDNA sequencing (‘RNA-Seq’) has provided microarrays, for which standard ready-to-use processing pipe- insights of unprecedented depth on the transcription landscape linesanduser-friendlybioinformaticanalysistoolsexist. ofmanyspecies(Wangetal.,2009)suchasHomosapiens(Wang To address this need, specific pipelines have recently been et al., 2008; Sultan et al., 2008; Montgomery et al., 2010; developed for analyzing RNA-Seq data, such as the pipeline Trapnell et al., 2010), Mus musculus (Mortazavi et al., 2008; developed by Goncalveset al. (Goncalves et al., 2011), an ana- Guttman et al., 2010), Arabidopsis thaliana (Lister et al., 2008), lysis pipeline developed in R within the context of the Saccharomyces cerevisiae (Nagalakshmi et al., 2008; Yassour ArrayExpressDatabase.Theseandothertoolsfacilitatetheana- et al., 2009) and Schizosaccharomyces pombe (Castle et al., lysis of RNA-Seq data, but still require the user to have a sig- nificant bioinformatics background, specifically when complex *Towhomcorrespondenceshouldbeaddressed. experimentswithalargenumberofdatasetsneedtobeanalyzed. (cid:2) The Author 2013. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. GrapeRNA-Seqanalysis Here we describe Grape, a workflow for the analysis of TwoMySQLdatabasesstoretheinformationgeneratedbythe RNA-Seq data that automates all the steps from RNA-Seq individualstepsinthepipeline,allowingforefficientstorageand readstotranscriptquantificationanddiscovery.Itsuser-friendly retrievalofdata.Oneoftheseisusedtostorethemetadata(cell interface provides thenecessary overview ofall thedata,and is type,RNAextract,etc)aswellasinformationthatisindepend- therefore particularly suited for processing numerous files, pro- entofthereads(junctionlibraries,indiceslocation,etc),andthe duced in complex experimental setups. The results, both inter- otherfortheresultsproducedforeachofthedatasets. mediate and final, are stored in an MySQL database, allowing GrapeusestheBuildoutpackage(http://www.buildout.org)to foraccessthroughthedatabaseinterface,butcanalsobevisua- set up all necessary components for running the analysis, and lizedthroughRaisin,auser-friendlywebapplication.Thesame Raisintovisualizethedatathroughawebbrowser. htmlinterfacethatisusedforstandalonelocalanalysisiseasily BuildoutisawidelyusedPythonpackageforassemblingand deployedonawebserverforremoteaccessincollaborativepro- deployingapplicationsfrommultipleparts.Inourcase,Buildout jects,ordatadissemination. installsalltheprogramsusedbyGrape,likeGEMandtheFlux Grape takes three types of input files. First, the read files, Capacitor,andmakestheGrapepipelinescriptsavailabletothe which may be aligned or not, the reference genome sequence individually configuredRNA-Seqanalysis pipelines.Eachpipe- file and the corresponding gene annotation file. Alternatively, lineislinkedtooneofthedatasetsintheproject,andhasitsown Grape can also take read alignment files as input, rather than independentdirectorystructure.Thepipelinesareruninparallel, rawreadfiles.Grapeproducesanumberofoutputfilesintabu- andcanbeinspectedthroughlogfilestracingallexecutionsteps. lar format, so they can be easily loaded into most statistical TheonlyrequirementfortheBuildoutisasetofconfiguration packages. While this is the case for quantifications, which are filescontaininginformationonthereadfilesandparametersto madeavailableindividuallyatthegene,transcriptandexonlevel, beusedduringtheanalysis,suchasannotation,numberofmis- theBAMformatisusedasanoutputformatforalignments. matches allowed during mapping, location of data files, etc. Morespecifically,processingofRNA-SeqreadsbyGrapein- Theseconfigurationfilesalsoserveasawaytotrackallanalysis volvesthefollowingsteps:(i)sequencereadevaluation,andtrim- performedforaprojectandthemetadataassociatedtoaproject. ming if required; (ii) mapping to the genome, and the Most importantly, they allow future reproducibility of the re- transcriptome; (iii) assembly of reads in absence of a reference sults. Internally, we manage all project configurations within genome, or beforegenomemapping;(iv)identificationofnovel the version control system Subversion (SVN), so they can be exons,andsplicejunctionsandmodelingoftranscriptstructures easilyupdatedandtraced. and (v) gene and transcript quantification. Grape pays special Buildout installs two central scripts that are needed for run- attention to quality controls (QCs), which trace quality scores ning the individual pipelines. The first is the start.sh, which in- and uncalled bases along the reads, look for biased nucleotide serts the pipeline information into the metadata MySQL composition and inspect the distribution of reads along tran- database. The second is the execute.sh script that runs the scripts. These can be used for QC, although Grape leaves any actual pipeline and stores all its results in the results MySQL decision to modify the data, such as trimming and filtering, to database.Theanalysesareexecuted as so-called‘rules’through theuser. a template file (similar to a Makefile). Grape can be run on a Grape is based on the PIP pipeline management system single dataset or simultaneously on several datasets. The first (Conery et al., 2005). The current Grape distribution uses common database holds all metadata concerning the pipeline SAMtools (Li et al., 2009), the GEM mapper (Marco-Sola runs, such as information regarding the input data. It is also etal.,2012),theFluxCapacitor(Montgomeryetal.,2010)and usedasapersistentcacheforsomeshareddata,likethelibrary Cufflinks (Trapnell et al., 2010), but any tool compliant with ofsplicejunctioncoordinatesretrievedfromtheannotation.The popular data interchange formats like GFF, BAM/SAM and second database contains all information from the individual BED can be used. Grape can be run locally or on a computer runs thatarespecifictoeachdataset,andincludes thestatistics cluster.Speed-upoftheanalysesisachievedbyparallelizingcer- neededforsubsequentanalysis.Finally,itstoresthetimestamp tain steps and taking advantage of multithreading where pos- ofeachexecutedruleofadataset,whichmakesGrapecapableof sible. The Grape implementation conserves a copy of the exact resumingworkseamlesslyafterinterruption. softwareandconfigurationusedforagivensetofanalyses,guar- Usingarule-basedapproachofferstheflexibilityofbeingable anteeingforwardreproducibility. to select only a subset of steps to execute, while automatically devotingresourcesonlytotheinferredanalysissteps.Eachstep contains a set of prerequisite steps that need to be completed 2 APPROACH before the step itself can be executed, and a set of instructions thatwillbeperformedbythestep.Thisinformation,encodedin 2.1 Grape description atemplatefile,isreadbytheexecutionscript.Thisscriptbuildsa Grape is an automated workflow integrating the management, graph of dependencies between each of the different steps and analysis and visualization of RNA-Seq data. GRAPE can map executes thosenecessarytoreach therequiredstep. This allows the reads to the genome and/or transcriptome, and it can also Grape to perform all analysis steps in an ordered and easily workwithsingleorpairedendreads,bothstrandedornot. reproducible manner, and at the same time, to store results ItorganizestheRNA-Seqdatasetsinprojects,thatis,setsof and keep track of successfully completed steps This is particu- datasets, that are generated within the same study and are all larlyimportantwhentheaimistoanalyzemanylargedatasetsin analyzed using the same annotation files and reference genome a consistent manner. An additional advantage is that the tem- sequence. plate file contains all the steps executed by the pipeline in a 615 D.G.Knowlesetal. human readable way. By editing the template file, steps can be removed,addedormodified. Grape results include expression levels provided as reads per kilobasepermillionand/oruniquereadcountsfordifferentfea- tures, suchas genes, transcripts,exonsand junctions. Theycan beaccessed throughthe command line bydirectlyquerying the Grape databases or through the Raisinweb-application.Raisin isalsobasedonBuildout,andusesthesameconfigurationfiles used for producing the pipelines. Raisin accesses the results MySQLdatabasecreatedbyGrape,and displaysand summar- izesinformationproducedatthemajorstepsintheanalysispipe- line(seelaterinthetext). TheRaisinwebservercomeswithtwoconfigurations,onefor accessingGrape locallyona workstation, andanother,for dis- seminationof the results ona local Intranet orthe Internet for Fig.1. OverviewofthethreemaincomponentsoftheGrapeRNA-Seq collaborativeprojects. analysis pipeline environment. Buildout: performs the initial configur- Currently, RNA-Seq data processed by Grape from ation.MySQL:twodatabasesareused,onecontaininginformationspe- several public projects, such as the Illumina Body Map Project cifictothedatasetsanalyzedsuchasquantifications-detectedelements, (HBM) or ENCODE (ENCODE Consortium 2004, 2011) etc.Anothercontainsinformationdependentongenomeandannotation, can be browsed at http://rnaseq.crg.es (http://rnaseq.crg.es/pro as well as meta-informationthat allows for the linkingofthe different ject/HBM/tab/experiments and http://rnaseq.crg.es/project/ datasets.Raisinwebapplication:allowsthevisualizationoftheanalysis ENCODE/tab/experiments/). summariesusingawebbrowser organizedaccordingtothemetadatagivenintheBuildoutcon- 2.2 Analysis implemented in Grape figurationfiles. Next, Grape produces some basic statistics, and checks the Here we describe the analysis steps implemented in Grape and qualityof the RNA-Seqdata byverifying the data format, cal- thetoolsusedtoperform them. Thesetoolsareincludedinthe culating the distribution of ambiguous bases (bases where the standard distribution of Grape. However, alternative tools can sequencer was unable to call the base correctly and assigned it be used for the analysis, provided they comply with popular an N) and tracing quality scores along the reads. Figures 2a, b input/output formats. To illustrate the Grape analyses, we use and d illustrate the Raisin interface to some of these QC steps. an RNA-Seq dataset from the Illumina Human Body Map These initial QC steps contribute to assess whether additional (HBM) project (Additional information on the sample and the preprocessing, suchastrimming and/or filtering ofthereads, is raw data can be obtained from http://www.ncbi.nlm.nih.gov/), necessary. However, Grape only provides the QC information, with http://rnaseq.crg.es/project/ENCODE/tab/experiments/ as anditisuptotheusertodecidewhetheradditionalpreprocess- anexample.Here,PolyAþ/randomprimedRNAfromamixture ingisrequired.Grape’sarchitectureallowsforconvenientlyrun- of tissues was interrogated. Three different isolation methods ningitinstages.Forexample,Grapemayberunfirstuptothe wereused:PolyAþ, PolyAþ normalizing the RNA tomoreef- QC step, and restarted from there after completion of the QC ficiently sample lowly expressed transcripts and Ribominusin- analysis. stead of PolyAþ to remove the ribosomal RNA. The three Mapping: Grape’s next step is the alignment of the short se- samples were sequenced using the Illumina HiSeq 2000 at a quencereadstothereferencegenome.Thisstepiscrucialforthe readlengthof100nt.Eachofthesesequencingexperimentspro- RNA-Seqanalysis,andparticularcarehastobetaken,asitwill duced,onaverage,400millionreads.TheRaisininterfacetothe conditionanydownstreamanalyses.Grapealignmentmoduleis outputproducedbytheGrapeanalysisofthisdataisavailableat complex, performing several alignments, and then combines http://rnaseq.crg.es/project/HBM/tab/experiments/. themintoafinalmapping,fromwhichaBAMfileisproduced. ThemainstepsintheGrapeanalysis(illustratedinFig.1)are By default, Grape uses GEM (Marco-Sola et al., 2012), an ex- asfollows: haustiveshort-readmapper,allowingmismatchesandindels(but (cid:4) Preprocessingandqualitychecks otheralignerscouldbeusedaslong astheoutputformatis,or canbe,convertedtostandardBAM/SAM). (cid:4) Mapping AsmostNGSaligners,GEMrequiresan‘indexcreation’step (cid:4) Post-mapping before the actual read mapping. This step preprocesses the ref- (cid:4) Transcriptquantification erence sequence, creating a data structure that can be searched (cid:4) Discoveryanddelineationofnoveltranscribedelements fast and efficiently. The choice of indices determines which reference sequences the raw reads will be searched against. The (cid:4) Summarystatistics most obvious index to be used is the one corresponding to the Preprocessingandqualitychecks:Beforeprocessingthereads, genome sequence of the species investigated. However, when Grapecreatesanumberoffilesanddatabasetablesrequiredfor examiningtranscriptomedata,readsmappingacrosssplicejunc- performingtheanalysesandstoringtheresults.Experimentsare tionswillnotmatchthegenomesequence,anditisconvenientto 616 GrapeRNA-Seqanalysis Fig.3. OverviewofGrape’s mappingstrategy. The initial genome and junctionsmappingisfollowedbyaroundofsplitmapping.Remaining unmapped reads are aligned with additional mismatches, and the still remaining ones are iteratively trimmed. The mappings resulting from thedifferentstepsarecombinedintoafinalmergedmapping mismatches,anewgenomemapping(includingsplit-mapping)is performed. Second, remaining unmappedreads are successively trimmed by a set number of nucleotides (10 by default), and a Fig.2. RaisinvisualizationofGrape’sQCstep.Panelsaandbshowthe newgenome(andsplit)mappingperformedaftereachtrimming. distributionofqualityscoresandambiguousnucleotidesalongthelength Theaggressivenessofthetrimmingcanbecontrolledbytheuser. ofthereads.Panelcanddshowsummariesofthenumberofreadsinthe datasetaswellasthefractionofreadswithnoambiguousbasesandthe A parameter sets the shortest length to which the read can be numberofuniquesequencesasapercentageofthetotal trimmed.Ifthislengthisequaltothereadlength,thennotrim- mingisperformed. Thisiterativemappingendswhenallreadsaremapped,orthe create a specific index corresponding to the splice junctions. lengthofthereadsfallsbelowacertainthreshold(25nt,byde- Thus,inadditiontothegenomeindex,Grapegeneratesajunc- fault). A diagram describing it can be seen in Figure 3. This tionsindexthatcontainsannotatedsplicejunctions,plusallpos- approach is similar to the one described in Cloonan et al. sible junctions that can be obtained by biologically legal (2009).The Raisin output, providing summary statistics of the combinationsofexonswithineachlocus.Thistypeofapproach Grapemappings,isshowninFigure4. hasalreadybeenshowntodistinguishdifferentalternativesplice Post-mapping:Thereadsthatalignintheinitialroundofmap- formswhenusedtodesignsplicingmicroarrays(Johnsonetal., ping (genome, junction and split-mapping) are examined and 2003). divided into those reads that map to one location better than Grapealignsreadsagainstboththegenomeandthejunction toany other(uniquemaps)andthosethataligntoatleasttwo references.Junctionmappingsarefilteredtoremovethosereads positionsequallywell(multi-maps). that donot spanthesplicesite. Next,theremaining unmapped This is an important filtering step because a unique map reads are mapped using the GEM split-mapper (Marco-Sola implies higher reliability in the alignment than a multi-map. et al., 2012). This tool will attempt to divide the read into two However, excluding all multi-maps from an analysis results in parts and align each fragment independently to the genome. the potential loss of information. The eventual use of one read Onlyalignmentsmatchingconsensussplicesitesarefurthercon- alignment type versus the other will depend on the type of the sidered. This allows Grape to recover the reads mapping to analysis. For example, for tasks such as the identification of unannotated‘bonafide’splicesites. novel genes, or the detection of low abundance transcripts, the Afterthisstep,theremaystillbesomeunmappedreads—cor- confidenceoftheresultsincreasesifonlyuniquelymappedreads respondingtounsequencedregionsofthegenome,genomecon- areconsidered.Othertasks,likethecalculationofgenome/tran- taminants,largenumberofmismatcheswiththereferences,etc. scriptomecoverage,mayuseallthemappedreads. Grape follows an aggressive strategy in an attempt to reliably Those reads that can be aligned to a unique position in the mapasmanyreadsaspossible,andtowardthatend,itperforms genome and also to a unique position in the junctions are also additionalroundsofmapping.First,itsuccessivelyincreasesby consideredmulti-maps,andare,inmanycases,indicatorsofthe one (by default) the number of allowed mismatches up to a presence of a pseudogene or processed copy of a gene in the numberthatisproportionaltothelengthoftheread(bydefault, genome.Inthecaseofpairedendreads,ifoneofthemembers upto1mismatchper25ntofreadlength).Foreachnumberof of the pair can be aligned uniquely to a certain position, the 617 D.G.Knowlesetal. Fig.5. Distributionofuniquelymappingreadsalongtheannotatedtran- scripts.Thisallowsustoidentifybiasesthatmaybecausedbyissuessuch asRNAdegradation Fig.4. RaisinvisualizationofGrape’smappingstep.Panelashowsthe overallmappingresultsaswellastheinformationonthegenomeanno- tationandnumberofmismatchesusedforthealignments.Panelbshows thefractionofreadsalignedinthefinalmergedmapping.Panelsc,dand eshowthesametypeofinformationforthedifferentcomponentsofthe mappingprocess corresponding mate can sometimes be rescued if it is a multi-map.However,inthecaseofRNA-Seqreads,incontrast togenomicDNAreads,theinsertsizecannotalwaysbeusedto identifythecorrectmate,asthepresenceofintronscanalterthe distancebetweenmates.Grapechoosestheclosestmappingpos- itiontotheuniquelymappingmateinthesecases.Inthecaseof the example analyzed here, Grape assigns, on average, 52% of the reads to a unique position, whereas 25% of the reads are Fig.6. Raisinvisualizationofthetranscriptquantificationstep.Panela multi-mapping(Fig.4a). showsthedistributionofgeneexpression.Panelsbandcshow,respect- Theresultsofthedifferentmappingstepsarecombinedintoa ively,thetopgenesandtranscriptsdetectedinthedifferentlanesofthe final mapping results fileinGFF, BEDorSAM/BAM format. analyzedsamples This file can be uploaded to genome browsers for visualization andcomparisonorbeanalyzed byother programs.Grapeuses thisfilespecificallyforthequantificationofgenesandtranscripts reads completely included in two or more overlapping exons andforthedelineationoftranscriptstructures(seenext). will be counted separately for each exon. Exon quantifications Transcript quantification: Grape uses the mapping results to inthesecaseswillbeoverestimated. producequantifications of the abundance of a number of tran- To produce quantifications of individual transcripts, we use scribedelements:exons,splicejunctions,genesandtranscripts,as theFluxCapacitor(Montgomeryetal.,2010)TheFluxCapacitor wellasinclusionindicesforexons. converts the transcript structure of each annotated locus into a Before the quantification, Grape investigates the distribution splicing graph, where junctions are represented as nodes and of mapped reads along transcripts to identify potential 30 to 50 exonsasedges.Themappingofthereadsintothegraphimposes biases. Toward that end, the annotated transcripts are binned a number of constraints that the FluxCapacitor represents as a accordingtotheirlength(seeFig.5). system of linear equations, which can be solved using linear We use two strategies for quantifying exons and genes: (i) programming. simpleoverlap,and(ii)readdeconvolution.Underthefirstscen- Raisin plots the distribution of expression of all genes ario, exonsandgenesarequantifiedbysimply summing allthe (Fig. 6a), and lists the top 20 highly expressed transcripts uniquelymappingreadsthatarefullyincludedwithinthebound- (Fig. 6c) and genes (Fig. 6b), and from the Raisin interface, it ariesoftheexonorgene.Resultsaregiveninreadsperkilobase ispossibletonavigatetotheexpressionvaluesofalltranscripts per million mapped reads (Mortazavi et al., 2008). Note that andgenes(inhtml,csvandexcelformats). 618 GrapeRNA-Seqanalysis Fig.8. RaisinoverviewofGrape’sanalysisresults canbeusediftheinputispairedendreads.Here,bothmatesof allreadpairsmappinguniquely(withuptothespecifiednumber ofmismatches) tothe transcriptomeareevaluated. Ifthey map todifferenttranscripts,theyareclassifiedasunannotatedsplice variants if both of these transcripts belong to the same gene. Otherwise, they map to transcripts from different annotated genes,andareclassifiedasputativechimericRNAsorfusions. Summarystatistics:Apageincludingsummarystatisticsfrom Fig. 7. Raisin visualizationofGrape’s splicing analysis. Panela shows allthedifferentanalysisstepsinGrapeisproduced,anditcanbe thesummarytableforthedetectedsplicesites.Panelbcontainsatable accessedthroughRaisin(Fig.8).Fromthispageitispossibleto withthetopincludedexonsinthesamplesexamined,andpanelcshows thedistributionoftheinclusionvaluesoverallinternalexons navigatetoeachspecificanalysis. Additional analysis: Additional analyses can be implemented within Grape by simply adding an entry to the workflow tem- Splicejunctionsarequantifiedusingthenumberofreadsspan- plate file specifying the computations to be performed in the ning the junction. In this case, no need for feature length nor- additionalstep.Thenewentryisasequenceofshellcommands malizationisrequired,andthechoiceofnormalizationbasedon that specify the computations that need to be performed. The numberofreadsindifferentsamplesislefttothecriteriaofthe sequence of commands is preceded by a line that contains the user (Fig. 7a). The detected splice sites are further classified: nameof thenewstepandits dependencies.Thesedependencies ‘Known’ are junctions that appear in the annotation. ‘Novel’ arejustthenamesofthoseotherstepsthatneedtobeexecuted are junctions formed between annotated exons from the same before to generate the input required by the new step. For in- gene, but not present in the annotation. ‘Novel from unanno- stance,intheentry tatedexons’aresplit-mapjunctions,inwhichatleastoneofthe two connected ‘exons’ is not annotated. In this last group, we preprocess:start alsoincludeanyjunctiondetectedbetweenexonsfromdifferent $BIN/preprocess.RNAseq.pl genes. mvpreprocess.RNAseq.log$LOG Grape also computes an exon inclusion ratio for each anno- tatedinternalexon.Exoninclusioniscomputedastheratioofall Thefirstlinecontainsthenameofthestep(‘preprocess’)and reads supporting the inclusion of the exon (reads mapping to itsdependencies,inthiscase,astepnamedstart,andthefollow- exon junctions that include it and the exon itself) to all reads ing lines,thecommandstoberun.The variables set by Grape, supporting its exclusion. These are reads mapping to junctions suchasBINandLOG,arelistedatthestartofthisfileandcan fromthatgenethatskiptheexon.SeeFigure7c. beusedbyanyofthecommandsinthetemplatefile. Discoveryofnoveltranscribedelements:Graperunsanumber of analyses to identify novel transcribed elements. First, RNA-Seq clusters are built from uniquely mapping reads, 3 DISCUSSION using either genomic mappings, junctions maps or split reads. Thereisnothresholdforthenumberofreadsthatmakeaclus- Here we presented and discussed Grape, an architecture for a ter, but Grape provides the number of reads, and of staggered computationalpipelinefortheanalysisofmillionsorbillionsof readsmakingeachcluster.Grapealsodetectsnovelsplicejunc- short reads obtained from (potentially many) high-throughput tionsthroughthesplitmappingofreads.Thesearethejunctions RNA-Seqexperiments.Thisisageneralandflexiblepipelinethat classified as novel from unannotated exons (see previously). combinescontributionsofpreviousstudieswithourownexperi- Finally,Cufflinks(Trapnelletal.,2010)isusedtoinfertranscript encedealingwithRNA-Seqdata.Grapeattemptstoaddressthe structures. challenges both from the processing and management stand- Grapealsoimplementsasimpleproceduretoidentifychimeric points associated to the analysis of sheer amounts of data. It RNAs independently of those cases found by Cufflinks, which automates the processing and analysis steps, while at the same 619 D.G.Knowlesetal. time providing an organizational framework that simplifies the technological reach. We believe that Grape may contribute to managementandsummarizingoftheanalysis. fillsuchavoid. The fullworkflow ofthe pipelinefitsina singletext filespe- cifying the dependency graph of the pipeline’s rules. Adding a new analysis step in Grape is as simple as defining the parent ACKNOWLEDGEMENT stepsthathavetobeexecutedbeforehand,andtheprogramsand TheauthorsthankGiovanniBussotti,Jia-MingChang,Andrea scriptsthatneedtobeexecutedbythenewstep. Tanzer and Darek Dedra for their patience in testing the first Grape’s objective is to produce quantifications of transcript implementations as well as their helpful comments and sugges- abundances (and of the abundances of other transcriptional tions. Also, Jens Steuck (Bioinformatik, University of Leipzig) elements: genes, exons, splice junctions, etc). Grape does not fortestingit. perform statistical analysis and comparisons of expression or Funding:Theresearchleadingtotheseresultshasreceivedfund- splicing usage between and/or across samples. From Grape’s ing from the European UnionSeventhFrameworkProgramme output quantifications, however, other methods and tools can (FP7/2007-2013) under grant agreement 282510. This work has beusedtoperformsuchanalysis[forinstance,Rstatisticalpack- been carried under grants BIO2011-26205 from Ministerio de ages within Bioconductor (Gentleman et al. 2004; http://pypi- Economa y Competitividad (Spain) and INB GNV-1 and ranking.info/module/zc.buildou) for expression analyses, etc]. RETICS RD07/0067/0012 from PN de I+D+i, ISCIII— IntegratingGrape’s quantificationwithanalysis of geneexpres- Subdireccin General de Evaluacin y Fomento de la sionandalternativesplicing isamong thefurther developments Investigacin—(Spain) and cofunded by FEDER. This work planned within Grape’s roadmap. In the very short-term, we reflects only the author s views and the European Community plan to incorporate a method to assess data reproducibility. is not liable for any use that may be made of the information The method, called Irreproducible Discovery Rate (Li et al., containedtherein. 2011), has been pioneered in the framework of the ENCODE project,anditcanbeappliedtoexperimentswithtworeplicates. ConflictofInterest:Nonedeclared. Other developments include transcript assembly and quantifications in absence of a genome of reference. This could be particularly useful for species with transcriptome data, but not REFERENCES sequencedgenome.NotethatGrapecanalreadybedirectlyused Castle,J.etal.(2008)Expressionof24,426humanalternativesplicingeventsand toproducequantificationsifanindexfromareferencetranscrip- predictedcisregulationin48tissuesandcelllines.Nat.Genet.,40,1416–1425. tome—independently assembled—is generated. Current efforts Cloonan,N. et al. (2009) RNA-MATE: a recursive mapping strategy for are,however,mostlyfocusedtostreamlinethepipelinetomaxi- high-throughputRNA-sequencingdata.Bioinformatics,25,2615–2616. Conery,J.S. et al. (2005) Rule-based workflow management for bioinformatics. mizespeedandminimizememoryusage,tosimplifyinstallation VLDB,14,318–329. ofthepackage,toprovidegraphicalsupportforinteractiveusage ENCODEConsortium(2004)TheENCODE(ENCyclopediaOfDNAElements) andto improve graphical reporting ofthe results.Currently, as Project.Science,306,636–640. of July 2012, Grape is on version 1.6, and we provide regular ENCODEProjectConsortium(2011)Auser’sguidetotheencyclopediaofDNA upgrades. elements(ENCODE).PLoSBiol.,9,e1001046. Gentleman,R.C.etal.(2004)Bioconductor:opensoftwaredevelopmentforcom- We have so far used Grape successfully for the in-house putationalbiologyandbioinformatics.GenomeBiol.,5,R80. analysis of more than a thousand large RNA-Seq datasets Goncalves,A.etal. (2011)A pipeline forRNA-seq data processingand quality [from projects such as ENCODE (ENCODE Consortium, assessment.Bioinformatics,27,867–869. 2004, 2011), Illumina Body Map (rnaseq.crg.es/project Guttman,M.etal.(2010)Abinitioreconstructionofcelltype-specifictranscrip- tomesinmouserevealstheconservedmulti-exonicstructureoflincRNAs.Nat. /ENCODE/tab/experiments/), GTEx (www.genome.gov/gtex), Biotechnol.,28,503–510. ICGC (Puente et al., 2011), Geuvadis (www.geuvadis.org), Homer,N.etal.(2009)BFAST:analignmenttoolforlargescalegenomeresequen- Quantomics(www.quantomics.eu)andothers]. cing.PLoSOne,4,e7767. Given the unique role of RNA as both a proxy and a deter- Johnson,J.M.etal.(2003)Genome-widesurveyofhumanalternativepre-mRNA splicingwithexonjunctionmicroarrays.Science,302,2141–2144. minant of the cellular and organism phenotype, profiling of Langmead,B.etal.(2009)Ultrafastandmemory-efficientalignmentofshortDNA RNA by RNA-Seq will spread beyond basic research, in medi- sequencestothehumangenome.GenomeBiol.,10,R25. cine,agriculture,biotechnologyandothertechnicalapplications Li,H. et al. (2009) The Sequence Alignment/Map format and SAMtools. ofbiology.RNA-Seqcouldbeused,forinstance,forcontinuous Bioinformatics,25,2078–2079. Li,H. and Durbin,R. (2009) Fast and accurate short read alignment with ambulatorymonitoringoftumorresponsetotreatment.Itcould Burrows-Wheelertransform.Bioinformatics,25,1754–1760. become a standard component of blood tests, a single assay Li,Q.etal.(2011)Measuringreproducibilityofhigh-throughputexperiments.Ann. monitoring many more variables than current biomarker Appl.Stat.,5,1752–1779. assays.SuchapplicationsofRNA-Seqrequire,however,analysis Lister,R.etal.(2008)Highlyintegratedsingle-baseresolutionmapsoftheepigen- omeinArabidopsis.Cell,133,523–536. tools(mapping,quantification,etc)thatareorderofmagnitude Marco-Sola,S.etal.(2012)TheGEMmapper:fast,accurateandversatilealignment moreefficientthancurrentlyexistingones.Theyrequire,inadd- byfiltration.Nat.Methods.,9,1185–1188. ition, robust, efficient and scalable software systems for Marguerat,S.etal.(2008)Next-generationsequencing:applicationsbeyondgen- RNA-Seq data storage, organization and analysis. The lack of omes.Biochem.Soc.Trans.,35,1091–1096. Marioni,J.etal.(2008)RNA-seq:anassessmentoftechnicalreproducibilityand such systems is often seriously limiting the utility of RNA-Seq comparisonwithgeneexpressionarrays.GenomeRes.,18,1509–1517. data, and may prevent researchers to embark in medium- to Montgomery,S.B. et al. (2010) Transcriptome genetics using second generation large-scale RNA-Seq projects—which are otherwise within sequencinginaCaucasianpopulation.Nature,464,773–777. 620 GrapeRNA-Seqanalysis Mortazavi,A.etal.(2008)Mappingandquantifyingmammaliantranscriptomesby Trapnell,C.etal.(2010)TranscriptassemblyandquantificationbyRNA-Seqre- RNA-Seq.Nat.Methods,5,621–628. vealsunannotatedtranscriptsandisoformswitchingduringcelldifferentiation. Nagalakshmi,U.etal.(2008)Thetranscriptionallandscapeoftheyeastgenome Nat.Biotechnol.,28,511–515. definedbyRNAsequencing.Science,1344–1349. Wang,E.T.etal.(2008)Alternativeisoformregulationinhumantissuetranscrip- Puente,X.S.etal.(2011)Whole-genomesequencingidentifiesrecurrentmutationsin tomes.Nature,456,470–476. chroniclymphocyticleukaemia.Nature,475,101–105. Wang,Z.etal.(2009)RNA-Seq:arevolutionarytoolfortranscriptomics.Nat.Rev. Sultan,M.etal.(2008)Aglobalviewofgeneactivityandalternativesplicingby Genet.,10,57–63. deepsequencingofthehumantranscriptome.Science,321,956–960. Yassour,M.etal.(2009)Abinitioconstructionofaeukaryotictranscriptomeby Trapnell,C.andSalzberg,S.(2009)Howtomapbillionsofshortreadsontogen- massively parallel mRNA sequencing. Proc. Natl Acad. Sci. USA, 106, omes.Nat.Biotechnol.,27,455–457. 3264–3269. 621