Flexible expressed region analysis for RNA-seq with derfinder Citation Collado-Torres, Leonardo, Abhinav Nellore, Alyssa C. Frazee, Christopher Wilks, Michael I. Love, Ben Langmead, Rafael A. Irizarry, Jeffrey T. Leek, and Andrew E. Jaffe. 2017. “Flexible expressed region analysis for RNA-seq with derfinder.” Nucleic Acids Research 45 (2): e9. doi:10.1093/nar/ gkw852. http://dx.doi.org/10.1093/nar/gkw852. Published Version doi:10.1093/nar/gkw852 Permanent link http://nrs.harvard.edu/urn-3:HUL.InstRepos:31731874 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#LAA Share Your Story The Harvard community has made this article openly available. Please share how this access benefits you. Submit a story . Accessibility Publishedonline29September2016 NucleicAcidsResearch,2017,Vol.45,No.2 e9 doi:10.1093/nar/gkw852 Flexible expressed region analysis for RNA-seq with derfinder Leonardo Collado-Torres1,2,3, Abhinav Nellore1,2,4, Alyssa C. Frazee1,2, Christopher Wilks2,4, Michael I. Love5,6, Ben Langmead1,2,4, Rafael A. Irizarry5,6, Jeffrey T. Leek1,2,* and Andrew E. Jaffe1,2,3,7,* 1DepartmentofBiostatistics,JohnsHopkinsBloombergSchoolofPublicHealth,Baltimore,MD21205,USA,2Center forComputationalBiology,JohnsHopkinsUniversity,Baltimore,MD21205,USA,3LieberInstituteforBrain Development,JohnsHopkinsMedicalCampus,Baltimore,MD21205,USA,4DepartmentofComputerScience, JohnsHopkinsUniversity,Baltimore,MD21218,USA,5DepartmentofBiostatistics,HarvardT.H.ChanSchoolof PublicHealth,Boston,MA02115,USA,6Dana-FarberCancerInstitute,HarvardUniversity,Boston,MA02215,USA and7DepartmentofMentalHealth,JohnsHopkinsUniversity,Baltimore,MD21205,USA ReceivedMay16,2016;RevisedAugust25,2016;AcceptedSeptember15,2016 ABSTRACT aspowerfulasfeature-levelmethodswhentheanno- tationiscomplete. Differential expression analysis of RNA sequenc- derfinder analysis using expressed region-level ing (RNA-seq) data typically relies on reconstruct- and single base-level approaches provides a com- ingtranscriptsorcountingreadsthatoverlapknown promise between full transcript reconstruction and gene structures. We previously introduced an inter- feature-levelanalysis.Thepackageisavailablefrom mediate statistical approach called differentially ex- Bioconductor at www.bioconductor.org/packages/ pressed region (DER) finder that seeks to identify derfinder. contiguousregionsofthegenomeshowingdifferen- tialexpressionsignalatsinglebaseresolutionwith- out relying on existing annotation or potentially in- INTRODUCTION accuratetranscriptassembly. TheincreasedflexibilityofRNAsequencing(RNA-seq)has Wepresentthederfindersoftwarethatimproves madeitpossibletocharacterizethetranscriptomesofadi- our annotation-agnostic approach to RNA-seq anal- verse range of experimental systems, including human tis- ysis by: (i) implementing a computationally efficient sues (1–3), cell lines (4,5) and model organisms (6,7). The bump-hunting approach to identify DERs that per- goal of many experiments involves identifying differential mits genome-scale analyses in a large number of expression with respect to disease, development or treat- samples, (ii) introducing a flexible statistical mod- ment. In experiments using RNA-seq, RNA is sequenced eling framework, including multi-group and time- togenerateshort‘reads’(36–200+basepairs).Thesereads course analyses and (iii) introducing a new set of are aligned to a reference genome, and this alignment in- data visualizations for expressed region analysis. formation is used to quantify the transcriptional activity WeapplythisapproachtopublicRNA-seqdatafrom ofbothannotated(presentindatabaseslikeEnsembl)and noveltranscriptsandgenes. theGenotype-TissueExpression(GTEx)projectand Theabilitytoquantitativelymeasureexpressionlevelsin BrainSpan project to show that derfinder permits regionsnotpreviouslyannotatedingenedatabases,partic- the analysis of hundreds of samples at base reso- ularlyintissuesorcelltypesthataredifficulttoascertain, lution in R, identifies expression outside of known isonekeyadvantageofRNA-seqoverhybridization-based gene boundaries and can be used to visualize ex- assays like microarray technologies. As complicated tran- pressed regions at base-resolution. In simulations, scriptstructuresaredifficulttocompletelycharacterizeus- ourbaseresolutionapproachesenablediscoveryin ingshortreadsequencingtechnologies(8),themostmature thepresenceofincompleteannotationandisnearly statistical methods used for RNA-seq analysis rely on ex- isting annotation for defining regions of interest––such as genesorexons––andcountingreadsthatoverlapthosere- *Towhomcorrespondenceshouldbeaddressed.Tel:+14432876864;Fax:+14109551044;Email:[email protected] CorrespondencemayalsobeaddressedtoJeffreyT.Leek.Tel:+14109551166;Fax:+14109550958;Email:[email protected] (cid:2)C TheAuthor(s)2016.PublishedbyOxfordUniversityPressonbehalfofNucleicAcidsResearch. ThisisanOpenAccessarticledistributedunderthetermsoftheCreativeCommonsAttributionLicense(http://creativecommons.org/licenses/by-nc/4.0/),which permitsnon-commercialre-use,distribution,andreproductioninanymedium,providedtheoriginalworkisproperlycited.Forcommercialre-use,pleasecontact [email protected] e9 NucleicAcidsResearch,2017,Vol.45,No.2 PAGE2OF13 gions(9).Thesecountsarethenusedasmeasuresofgeneex- pressionabundancefordownstreamdifferentialexpression analysis (10–18). Unfortunately, the gene annotation may be incorrect or incomplete, which can affect downstream modeling of the number of reads that cross these defined features. We previously proposed an alternative statistical model forfindingdifferentiallyexpressedregions(DERs)thatfirst identifies regions that show differential expression signal and then annotates these regions using previously anno- tated genomic features (19). This analysis framework first proposed using coverage tracks (i.e. the number of reads alignedtoeachbaseinthegenome)toidentifydifferential expressionsignalateachindividualbaseandmergesadja- centbaseswithsimilarsignalintocandidateregions.How- ever,thesoftwareforourfirstversionwaslimitedtosmall samplesizes,theabilitytointerrogatetargetedgenomicloci andcomparisonsbetweenonlytwogroups. Here,weexpandtheDERfinderframeworktopermitthe analysisoflargersamplesizeswithmoreflexiblestatistical models across the genome. This paper introduces a com- prehensivesoftwarepackagecalledderfinderbuiltupon base-resolution analysis, which performs coverage calcu- lation, preprocessing, statistical modeling, region annota- tionanddatavisualization.Thissoftwarepermitsdifferen- tialexpressionanalysisatboththesinglebaselevel,result- ingindirectcalculationofDERs(20),andafeaturesum- marizationweintroduceherecall‘expressedregion’(ER)- level analysis. We show that ER analysis allows us to per- formbaseresolutionanalysisonlargerscaleRNA-seqdata setsusingtheBrainSpanproject(21)andGenotype-Tissue Expression (GTEx) project data (3) to demonstrate that derfinder can identify differential expression signal in regionsoutsideofknownannotationwithoutassembly.We usetheseDERstoillustratethepost-discoveryannotation capabilitiesofderfinderandlabeleachDERasexonic, Figure1. Anoverviewofthederfindersuite.Thederfindersoft- intronic,intergenicorsomecombinationofthoselabels.We warepackageincludesfunctionsforprocessingandnormalizingcoverage persample,performingstatisticalteststoidentifydifferentiallyexpressed show that some of these DERs we identify are outside of regions(DERs),labelingthoseregionswithknownannotationandvisu- annotatedproteincodingregionsandwouldnothavebeen alizingtheresultsacrossgroups. identifiedusinggeneorexoncountingapproaches. In the GTEx data, we identify DERs that differentiate heart (left ventricle), testis and liver tissues for eight sub- derfindercanbesubstantiallymorepowerfulthanmeth- jects. There are many potential reasons for this observed odsthatrelyonacompleteannotation. intronicexpressionincludingintronretention,background levelsofmis-transcriptionorincompleteprotein-codingan- notation.AsubsetofthesestrictlyintronicERsareassoci- MATERIALSANDMETHODS atedwithtissuedifferences,evenconditionalontheexpres- OverviewofRimplementation sionofthenearestannotatedprotein-codingregion.How- ever, we point out that intronic expression may be artifac- We chose to implement derfinder entirely in the R sta- tualandourpackagepermitsvisualizationanddiscoveryof tistical environment www.R-project.org/. Our software in- potentialexpressionartifactsnotpossiblewithotherpack- cludes upstream pre-processing of BAM and/or BigWig ages. files into base-resolution coverage. At this stage the user Finally, using simulated differentially expressed tran- canchoosetosummarizethebaseresolutioncoverageinto scripts, we demonstrate that when transcript annotation feature-levelcountsandapplypopularfeature-levelRNA- is correct, derfinder is nearly as powerful as exon- seqdifferentialexpressionanalysistoolslikeDESeq2(14), count based approaches with statistical tests performed edgeR-robust(13),limma(15,16)andvoom(17). by limma (16) (or DESeq2 (14), edgeR-robust (13)) and derfindercanbeusedtoidentifyregionsofdifferen- ballgown (22) after summarizing the information using tial expression agnostic to existing annotation (Figure 1). Rsubread(13)andStringTie(23),respectively.Finally, This can be done with either the expressed regions (ER)- we also demonstrate that when annotation is incomplete, levelor single base-level approaches, described in detail in the following subsection and Supplementary Section 2.1. PAGE3OF13 NucleicAcidsResearch,2017,Vol.45,No.2 e9 The resulting regions can then be visualized to identify Dataprocessingforresultsinmainmanuscript novelregionsandfilteroutpotentialartifacts. BrainSpandata. BigWigfilesforall487samplesacross16 After differential expression analysis, derfinder can brain regions were downloaded from the BrainSpan web- plot DERs using base-resolution coverage data by access- site(21).ThesamplesforHSB169.A1C,HSB168.V1Cand ing the raw reads within DERs for posthoc analysis like HSB168.DFCweredroppedduetoqualityissues.Basedon clustering and sensitivity analyses. We have also created exploratoryanalysesthecoveragewasassumedtobereads- a lightweight annotation function for quickly annotating per-million mapped reads in this data set. We set the cov- DERsbasedonexistingtranscriptomeannotation,includ- erage filter to 0.25 for both the single base-level and ER- ing the UCSC knownGene hg19, Ensembl p12 and Gen- levelderfinderapproaches.Sincethecoverageisalready codev19databasesaswellasnewerversions. adjustedtoreadspermillionmappedreads,wedidnotin- Vignettes with detailed instructions and examples are cludealibrarysizeadjustmentterminthesinglebase-level availablethroughtheBioconductorpagesforderfinder derfinder analysis (see Supplementary Section 2.1 for and derfinderPlot. The main functions for the ex- details on this adjustment term). The details for the sin- pressedregionandsinglebase-levelapproachesarefurther glebase-levelderfinderanalysisaredescribedfurtherin describedinSupplementarySection1.1. SupplementarySection2.2.FortheER-levelapproachwe onlyconsideredregionslongerthan5base-pairs. We sought to identify differences in expression across Expressedregionlevelanalysis brainregion(neocorticalregions:DFC,VFC,MFC,OFC, In the expressed region approach, we compute the mean M1C,S1C,IPC,A1C,STC,ITC,V1Candnon-neocortical coverage for all base pairs from all the samples and filter regions:HIP,AMY,STR,MDandCBC)anddevelopmen- out those below a user specified cutoff. Contiguous bases talstage(fetalversuspostnatal).Wethereforefitthefollow- passing this filtering step are then considered a candidate ingregion-by-stageinteractionalternativemodel,whichin- region(Figure2A).Thenforeachsample,wesumthebase- cluded main effects for fetal versus postnatal (binary) and levelcoverageforeachsuchregioninordertocreateanex- categoricalbrainregionvariable(15regionindicators,rel- pression matrix with one row per region and one column ative to A1C), and interaction terms for each brain re- persample.Thismatrixcanthenbeusedwithfeature-level gionanddevelopmentalstage.Thisresultedinatotalof32 RNA-seq differential expression analysis tools. The statis- termsinthemodel(intercept;16maineffects,15interaction ticalmodelusedforthedifferentialexpressionisageneral terms).InEquation(2),y isthescaledlog meanbase-level ij 2 F-statistic model as shown in Figure 1, Step 4 and Equa- coveragefortheexpressedregioniandsamplejasinEqua- tion (1). In Equation (1), yij is the scaled log2 mean base- tion(1);thatisyij=log2(meancoverageij+1).Themodelis levelcoveragefortheexpressedregioniandsamplej.The completedbyaninterceptterm(cid:2),aindicatorvariablefor i modeliscompletedbyaninterceptterm(cid:2)i,ngroupeffects fetalstatus(cid:3)i,mindicatorsvariables(cid:4) forthebrainregion (cid:3)i,madjustmentvariableeffects(cid:4)iandmeasurementerror andminteractionvariables(cid:5) betweenfetalstatusandbrain ε.TheF-statisticisderivedfromcomparingthisalternative region.Thetermε representsresidualerror. ij modelagainstanullmodelwithoutthe(cid:3) termsasinSup- i (cid:2)m plementaryMethods2.1.Themodelcanallowtime-course y =α +β Fetal + γ Region + analyses, two group comparisons or multi-group compar- ij i i j iq jq isons, thus making derfinder flexible. Examples of the q=1 lattercaseareshowninMethodsSections2.4.1and2.4.2. (cid:2)m (cid:2)n (cid:2)m ζiqFetalj∗ Regionjq+(cid:5)ij (2) yij =αi+ βipXjp+ γiqZjq+(cid:5)ij (1) q=1 p=1 q=1 Wecomparedtheabovemodeltoanintercept-onlymodel whereusingthelmFitfunctionfromlimma(15,16).The P-valuesfortheER-levelDERswereadjustedviatheBon- Annotationand‘genomicstate’objects ferronimethodandthosewithadjustedP-valueslessthan Wehaveimplementeda‘genomicstate’frameworktoeffi- 0.05weredeterminedtobesignificant.Wethencalculated cientlyannotateandsummarizeresultingregions,whichas- the mean coverage for each significant expressed region signseachbaseinthegenometoexactlyonestate:exonic, DERs in each sample, resulting in a mean coverage ma- intronicorintergenic,basedonanyexistingoruser-defined trix(DERsbysamples)andweperformedprincipalcompo- annotation(e.g.UCSC,Ensembl,Gencode).Ateachbase, nentanalysis(PCA)onthislog -transformedmatrix(after 2 weprioritizeexon>intron>unannotatedacrossallanno- addinganoffsetof1). tatedtranscripts. Once the DERs were identified, we identified which Overlappingexonsofdifferentlengthsbelongingtodif- of them overlap ENCODE blacklisted regions of the ferenttranscriptsarereducedintoasingle‘exonic’region, genome (4) using the file at http://hgdownload.cse.ucsc. while retaining merged transcript annotations. We have a edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/ secondimplementationthatfurtherdefinespromotersand wgEncodeDacMapabilityConsensusExcludable.bed.gz. dividesexonicregionsintocodinganduntranslatedregions For identifying which DERs overlap lincRNAs, we used (UTRs)thatmaybeusefulfortheusertomorespecifically EnsDb.Hsapiens.v75 (24), which can also be used for annotate regions – this implementation prioritizes coding a variety of transcript types. We then performed the gene exon>UTR>promoter>intron>unannotated. ontology(GO)analysisfortheDERsusingGOstats(25) e9 NucleicAcidsResearch,2017,Vol.45,No.2 PAGE4OF13 Figure2. Findingregionsviaexpressedregion-levelapproachonchromosome5withBrainSpandataset.(A)Meancoveragewithsegmentspassingthe meancutoff(0.25)markedasregions.(B)Rawcoveragecurvessuperimposedwiththecandidateregions.Coveragecurvesarecoloredbybrainregionand developmentalstage(NCX:Neocortex:Non-NCX:Non-neocortex,CBC:cerebellum,F:fetal,P:postnatal).(C)Knownexons(darkblue)andintrons (lightblue)bystrandforgenesandsubsequenttranscriptsinthelocus.TheDERsbestsupporttheGABRA6transcriptwitharedstar,indicatingthe presenceofadifferentiallyexpressedtranscript. using as background all genes that are within 5 kb of an For the conditional expression analysis, we found the ER. nearest exonic ER for each intronic ER using the dis- tanceToNearestfunctionfromGenomicRanges(27). ForeachintronicERwefittedtwolinearregressionmodels GTEx data. We selected samples from individuals that for the log -transformed coverage matrix (after adding an had data from heart (left ventricle), liver and testis tissues 2 offsetof1).Forthealternativemodelweusedascovariates withRNAIntegrityNumber(RIN)valuesgreaterthan7. two tissue indicator variables (Heart as the reference) and Eightsubjectsmatchedthiscriteriaandweselectedonly1 the coverage from the nearest strictly exonic ER as shown sampleiftheirtissuewasanalyzedmorethanonce,leaving in Equation (3) for ER i and sample j. For the null model uswith24samples.ThedatawerealignedusingRail-RNA weonlyusedthecoveragefromthenearestexonicER.We (26)version0.2.1withthecodeasdescribedatwww.github. calculatedanF-statisticusingtheanovafunctionthattests com/nellore/runs. We created a normalized mean BigWig whether(cid:3) or(cid:3) areequalto0andusedaBonferroniad- fileforthese4samplesadjustedforlibrarysizesof40mil- 1i 2i justedP-valuecutoffof0.05toidentifywhichintronicERs lion reads. We then identified the ERs using a cutoff of 5 haddifferentialexpressionadjustingforthecoverageatthe usingthefunctionrailMatrixfromderfinderversion nearestexonicER. 1.5.19. Foreachexpressedregiongreaterthan9bp,weassigned y =α +β Testis +β Liver + ij i 1i j 2i j itsannotationstatusbyusingagenomicstateobjectcreated γ ExonicCoverage +(cid:5) (3) withtheEnsemblGRCh38.p5database.Wethenperformed i j ij PCAonthelog -transformedmatrix(afteraddinganoffset 2 of1)separatelyforstrictlyexonicandstrictlyintronicERs. Simulated data. We simulated 100 bp paired-end reads Using limma (15,16) functionslmFit, ebayes we fit an (250bpfragments,sd=25)withpolyester(28)fortwo intercept-onlynullmodelandanalternativemodelwithco- groups with five samples per group from human chromo- efficientsfortissuedifferences.ForeachERwecalculateda some17withuniformerrorrateof0.005andreplicatedthis F-statisticanddeterminedwhetheritwasdifferentiallyex- processthreetimes.One-sixthofthetranscriptsweresetto pressedbytissueusingaBonferroniadjustedP-valuecutoff havehigherexpression(2x)ingroup2,asixthtohavelower of0.05. expressioningroup2(1/2x)andtheremainingtwo-thirds PAGE5OF13 NucleicAcidsResearch,2017,Vol.45,No.2 e9 to be equally expressed in both groups. Given a RNA-seq ineachsample.Alternatively,thesoftwareprovidesoptions experimentwith40millionpaired-endreads,assumingthat forcountingexonsorgenesforuseinmorestandardanal- alltranscriptsareequallyexpressedwewouldexpect1989 ysispipelines. 247ofthemtobefromchromosome17basedonthelength Next,derfindercanbeusedtoperformstatisticaltests ofallexonsusingtheknowntranscriptsUCSCknownGene on the region level expression matrix. These tests can be hg19 annotation. We used this information and the tran- carriedoutusinganystandardpackagefordifferentialex- scriptlengthtoassignthenumberofreadspertranscriptin pressionofRNA-seqdataincludingedgeR(10,12),DESeq chromosome17andgeneratedthenumberofreadswiththe (11),DESeq2(14)orlimma-voom(17). NB function from polyester with mean (cid:6) and size (see derfindercanthenbeusedtoannotatetheDERs.We thernbinomfunctionfromthestatspackage)equalto 1μ. have developed functions that label each region according 3 This resulted in an average of 2 073 682 paired-end reads towhetheritfallsentirelyinapreviouslyannotatedprotein persample.Foreachsimulationreplicate,paired-endreads codingexon(exonic),entirelyinsideapreviouslyannotated were aligned to the hg19 reference genome using HISAT intronic region (intronic) or outside of any previously an- version0.1.6-beta(29)andRail-RNAversion0.2.2b(26). notatedgene(intragenic).Thesoftwarealsowillreportany We created a GTF file using all known transcripts from regionthatoverlapsanycombinationofthosetypesofre- chromosome17aswellasonewith20%ofthetranscripts gions. missing(8.28%ofexonsmissing).UsingthesetwoGTFfiles Finally, data from an expressed region analysis can be we performed transcript quantification with StringTie visualized using different visualization approaches. While version 1.2.1 (23) as well as exon counting allowing mul- region-levelsummariescanbeplottedversusknownpheno- tiple overlaps with the featureCounts function from types,derfinderalsoprovidesfunctionstoplotbaseres- Rsubreadversion1.21.4(13).ERsweredeterminedwith olution coverage tracks for multiple samples, labeled with derfinderversion1.5.19functionsregionMatrixand coloraccordingtophenotype. railMatrix, respectively, from the HISAT BAM and Wenowprovidemoredetailoneachofthesesteps. Rail-RNABigWigoutputusingameancutoffof5forli- brariesadjustedto80millionsingle-endreads.Countmatri- FindingERs cesresultingfromfeatureCountsandderfinderwere analyzed with limma (16), DESeq2 (14) and edgeR-robust ThefirststepinaderfinderanalysisistoidentifyERs. (18) controlling the false discovery rate (FDR) at 5% and Readsshouldbealignedusinganysplicingawarealignment testing for differences between the two groups of samples. toolsuchasTopHat2(30),HISAT(29)orRail-RNA(26). We used ballgown version 2.2.0 (22) to perform differ- Baseresolutioncoverageinformationcanbereaddirectly entialexpressiontestsusingcoverageatthetranscriptand from the BAM files that are produced by most alignment exonlevels,controllingtheFDRat5%. software(26,29,30).Thisprocesscanbeparallelizedacross The3900transcriptsfromchromosome17arecomposed multiple cores to reduce computational time. An alterna- intotalby39338exons(15033unique).Toavoidambigu- tiveistoreadbigWig(31)coveragefiles.Recentalignment ous truth assignments, we used only the 3 868 that over- software such as Rail-RNA (26) produces these files di- lap only 1 transcript and assigned the truth status based rectly, or they can be created using samtools (32) or pro- on whether that transcript was set to have a high or low ducedusingthederfinderpackage.ReadingBigWigfiles expression on group 2 for the replication replicate under canproducesignificantcomputationalandmemoryadvan- evaluation.Weassessedthedifferentpipelinesbychecking tagesoverreadingfromBAMfiles. if these 3 868 exons overlapped at least one differentially Thecoverageinformationrepresentsthenumberofreads expressedunit:exons(featureCountsandballgown), thatcoverseachgenomicbaseineachsample.derfinder transcripts (ballgown) and ERs (derfinder), respec- first filters out bases that show low levels of expression tively.Wethencalculatedtheempiricalpower,falsediscov- across all samples. Since most genomic bases are not ex- eryrateandfalsepositiverate. pressed, this filtering step can reduce the number of bases that must be analyzed by up to 90%, reducing both CPU and memory usage. We originally proposed performing a RESULTS statisticaltestforeverybaseinthegenome(19)andthisap- proach is still supported by the derfinder package for Overviewofthederfinderpackage backwardscompatibility(SupplementarySection1.3). The derfinder package includes functions for several Here, we focus on a new approach based on the bump- stagesintheanalysisofdatafromanRNA-sequencingex- huntingmethodologyforregionlevelgenomicanalysis(33) periment(Figure1). (Figure 2). This approach first calculates ERs across the First,derfinderincludesfunctionsforpre-processing setofobservedsamples.Foreachbase,theaverage,poten- coveragedatafromBAMfilesorbigWigcoveragefiles.The tiallylibrarysize-adjusted,coverageiscalculatedacrossall base-levelcoveragedataformultiplesamplescanbeloaded samples in the data set. This generates a vector of (nor- andfilteredsincemostbaseswillshowzeroorverylowcov- malized) mean level expression measurements across the erage across most samples. Then, the software allows for genome.Thenanaverage-coveragecutoffisappliedtothis definition of contiguous regions that show average cover- meancoveragevectortoidentifybasesthatshowminimum age levels above a certain threshold. These ERs are non- levelsofexpression.Anexpressedregionisanycontiguous overlappingsubsetsofthegenomethatcanthenbecounted setofbasesthathasexpressionabovethemeanexpression toarriveatamatrixwithanexpressionvalueforeachregion cutoff. e9 NucleicAcidsResearch,2017,Vol.45,No.2 PAGE6OF13 Thenextstepistocountthenumberofreads(including AnnotatingDERs fractionsofreads)thatoverlapeachexpressedregion.Aswe The DERs can be annotated to their nearest gene or havepointedoutpreviously(19)thatcountingexpressionin known feature using bumphunter (33). The basic ap- genesandexonsiscomplicatedbyoverlappingannotation. proach is to overlap DERs genomic coordinates with the ERsarenon-overlapping,sothismeansthateachreadcan genomic coordinates of known genomic features. By de- beunambiguouslyassignedtotheappropriateregion. fault, derfinder labels each identified region as exonic, intronic,intragenicorsomecombinationofthosethreela- bels. ERlevelstatisticaltests Aregionmayoverlapmultiplegenomicfeatures(sayan The result of the ER step is a coverage matrix with each exonandtheadjacentintron).Usingthisinformation,can- rowcorrespondingtooneERandeachcolumncorrespond- didateDERscanfurtherbecomparedtoknowngeneanno- ingtoonesample.Thiscountmatrixcanthenbeanalyzed tation tables (Methods Section 2.3) to identify potentially using statistical models that have been developed for gene novel transcription events. Using this information, visual- orexoncountssuchaslimma(15,16),voom(17),edgeR- izationsofspecificlociforoverlapwithannotationcanbe robust (18) and DESeq2 (14). We emphasize that unlike madewithderfinderPlot.Theregionscanbeexported other feature-level counting approaches, our approach is toCSVfilesorotherfileformatsforfollow-upanddown- annotation-agnostic:ERsaredefinedempiricallyusingthe streamanalyses.Wehavealsodevelopedacomplementary observed sample data and coverage threshold. So if there R package for creating reproducible reports incorporating issufficientexpressioninaregionoutsideofpreviouslyan- the annotation and visualization steps of the derfinder notated genes, it will be quantified and analyzed with our pipelinecalledregionReport(38). approach. Application: large-scale expression analysis at base resolu- VisualizingDERs tion After statistical modeling, derfinder produces a set of We used derfinder to detect regions that were differen- DERswithsummarystatisticsperregion.Theyarestored tiallyexpressedacrossthelifespaninthehumanbrain.We as a GRanges object (27) and can be visualized with a applied derfinder to the BrainSpan RNA-seq coverage range of packages from the Bioconductor suite. We have data (Methods Section 2.4.1), a publicly available data set also developed several visualization tools specific to the consisting of 484 postmortem samples across 16 brain re- derfinderapproach. gionsfrom40uniqueindividualsthatcollectivelyspanthe Theseplotscanbemadeatdifferentlevelsofsummariza- fullcourseofhumanbraindevelopment(21).Weusedthe tion. First, the derfinder and derfinderPlot pack- expressedregionapproachdescribedaboveforthisanalysis. agesprovidearangeofvisualizationsofcoveragetracksat Forcomparisonweappliedthesingle-basedresolutionap- single base resolution. These plots can be used to identify proachpreviouslyutilizedonindependentdorsolateralpre- coveragepatternsthatmaydivergefromannotatedprotein- frontalcortexRNA-seqdata(20)(SupplementarySection coding regions. For example, using the GTEx example we 1.4). canvisualizegenesthathaveconsistentlyhighintronicex- We identified 174 610 ERs across the 484 samples with pressionasshowninFigure3.Weshowseveralexamplesof mean across-sample normalized coverage > 0.25, which genesknowntobefunctionallyimportantinheart––LBD3 constituted 34.57 megabases of expressed sequence. The and MYOZ2 (Figure 3A and B) (34,35) and liver––HGD majority(81.7%)oftheseERswerelabeledasstrictlyexonic and UPB1 (Figure 3C and D) (36,37). The coverage pro- whileonlyasmallsubset(5.4%)werestrictlynon-exonicby files can provide additional insight into transcription, and Ensembl annotation. These ERs largely distinguished the wellaspotentialtechnicalartifacts,beyondthelevelofan- fetalandpostnatalsamplesusingPCA–thefirstprincipal notated genes, exons and transcripts, which we include in component explained 40.6% of the variance of the mean ourbase-resolutionplots. coverage levels and separated these developmental stages DERs can be grouped into larger regions by distance, acrossallbrainregions.Thisseparationwasconsistentre- whichcanbeusefultoidentifypotentiallysystematicarti- gardlessoftheannotationstatusoftheDERsincludingin facts such as coverage dips (Figure 4), perhaps due to se- the strictly intronic regions (Figure 5 and Supplementary quencecomposition.Visualizingthebase-levelcoveragefor FigureS1).Theseparationbetweenbrainregionsinintronic asetofnearbycandidateDERscanrevealpatternsthatex- regionsmaybeduetonoisyorincorrectsplicing(39)ormay plain why one DER is sometimes fragmented into two or beduetomissingannotation(19)ormistakensequencingof more shorter DERs. Coverage dips (Figure 4), spikes and pre-mRNA.Thebaseresolutionvisualizationsavailableas data quality in general can affect the borders of the can- partofderfinderandderfinderPlotmakeitpossi- didate DERs. Some artifacts can be discarded, like candi- bletoexploretodetermineifitisbiologyorartifactsdriving dateDERsinsiderepetitiveregions.Base-pairsinsiderepet- theseexpressiondifferences. itiveregionsavailableinrepeatmaskertrackscanbeflagged The PCA plots also appear to show patterns consistent andfilteredoutfromtheanalysis.Otherknownpotentially withpotentialartifactssuchasbatcheffects(40)(Figure5). problematicregionsofthegenome,likethosewithextreme Regardless,thenewERapproachwepresenthereprovides GC content or mappability issues can also be filtered out, optionsforanalystswhowishtodiscoverpatternsofexpres- eitherbeforeidentifyingcandidateDERsorpost-hoc. sionoutsideofknownannotationonhundredsofsamples PAGE7OF13 NucleicAcidsResearch,2017,Vol.45,No.2 e9 Figure3. CoverageplotsfortheaveragecoveragelevelsfortheGTExexample.Averagecoverageprofileforheart(blue),liver(red)andtestis(green)from theGTExexampleneargenes:(A)LDB3,(B)MYOZ2,(C)HGDand(D)UPB1. 129 278 ERs (74%) were differentially expressed by brain regionand/ordevelopmentalstageattheER-levelcontrol- ling the family-wise error rate (FWER) at < 5% via Bon- ferronicorrection.WecontrolledtheFWERinsteadofthe FDR due to the expected large effects between the devel- opmental stages and/or brain regions. The 129 278 ER- levelDERsoverlappedatotalof17525Ensemblgenes(13 016withgenesymbols),representingalargeportionofthe knowntranscriptome.OfthesignificantER-levelDERs,93 355 (72.2%) overlapped at least 1 significant single base- level DER (Supplementary Section 1.4). Lack of overlap resultsfromalmosthalf(45.2%)ofsinglebase-levelDERs having an average coverage lower than the expression cut- offdeterminingERs(0.25).Forexample,therewashighex- pression only in the samples from a few brain regions, or onlyonedevelopmentperiod. Decreasing the cutoff that defines the ERs from 0.25 to Figure4. Exampleofacoveragedip.Meancoveragepergroupforthe 0.1resultsinalargernumberofregions(217085)thathavea BrainSpandatasetforaregionthatresultsintwoDERsforasingleexon higherproportionofnon-exonicsequence(12.1%),suggest- due to a coverage dip. The genome segment shown corresponds to the ing that the choice of this expression cutoff requires some DERsclusterranked15thintermsofoverallsignalbythesinglebase-level approachappliedtotheBrainSpandataset. initial exploratory data analysis as shown in Supplemen- tary Section1.5. Increasing the cutoffreducesthe number ofERs(SupplementaryFigureS4A)andtheirlengths(Sup- plementaryFigureS4B).Withincreasingcutoffs,thefrac- –ananalysisofthisscopeandscalewasunfeasiblewithear- tionofknownexonspresentintheERsisreduced(Supple- lierversionsofoursinglebaseresolutionsoftware(19). mentary Figure S4C) while increasing the percent of ERs Usingstatisticalmodelswhereexpressionlevelswereas- thatoverlapknownexons(SupplementaryFigureS4D).We sociated with developmental stage (fetal versus postnatal) and/orbrainregion(MethodsSection2.4.1),wefoundthat recommendusingacutoffthatbalancesthesefactors(Sup- e9 NucleicAcidsResearch,2017,Vol.45,No.2 PAGE8OF13 Figure5. PrincipalcomponentsanalysisrevealsclustersofsamplesintheBrainSpandataset.(Left)Firsttwoprincipalcomponents(PCs)withsamples coloredbysampletype(F:FetalorP:Postnatal)andshapegivenbybrainregionusingonlythestrictlyintronicERs.AnalysisofothersubsetsofERs producesimilarresults(SupplementaryFigureS1).(Right)BoxplotsforPCs1and2bybrainregion(NCX:neocortex,HIP:hippocampus,AMY:amyg- dala,STR:striatum,MD:thalamus,CBC:cerebellum)andsampletypewithnon-neocortexbraindecomposedintoitsspecificregions.Usingthesingle base-levelapproach(SupplementaryFigureS2)producessimilarresultsasshowninSupplementaryFigureS3. plementaryFigureS4),suchas0.25inthisparticulardata value 1.4e-12) among other terms associated to neuronal set. development. We highlight the utility of the ER-level analysis (using the original 0.25 cutoff) to identify regions differentially expressed within subsets of the data by analyzing brain IdentificationofERsthatdifferentiatetissuesusingasubset regions within a single developmental period. We identi- oftheGTExdata fied1170ERsthatweredifferentiallyexpressedcomparing striatumversushippocampussamplesinthefetaldevelop- WeselectedasubsetofsubjectsfromtheGTExproject(3) mental stage. These DERs mapped to 293 unique genes. thathadRNA-seqdatafromheart(leftventricle),liverand GenesmorehighlyexpressedinthestriatumincludeARPP- testis, specifically the eight subjects with samples that had 21, previously shown to localize in the basal ganglia (41), RINsgreater7,givenRIN’simpactontranscriptquantifi- anddopaminereceptorgenesDRD1andDRD2(42).Genes cation (46). Using only one sequencing library from each morehighlyexpressedinthehippocampusinfetallifewere subject aligned with Rail-RNA (26), we applied the ER- strongly enriched for neurodevelopmental genes including level derfinder approach with a cutoff of 5 normalized FZD7 (43), ZBTB18 (44) and NEUROD1 (45). The ER- reads(afternormalizingcoveragetolibrariesof40million level analysis therefore permits subgroup analysis without reads).Wefoundatotalof163674ERswithlengthsgreater the need to rerun the full derfinder single base-level than 9 base-pairs. Figure 6A shows that 118 795 (72.6%) pipeline – another improvement over previous versions of of the ERs only overlapped known exonic regions of the singlebaseresolutionanalysissoftware(19). genomeusingtheEnsemblGRCh38.p5database(47). DERsarenon-standardinthesensethattheydon’tnec- WeperformedPCAonthelog2adjustedcoveragematrix essarily match with known exons. Depending on the ap- usingjustthe118795strictlyexonicERs(Figure6B).Here, plication, you might be interested in filtering out DERs thefirsttwoPCsexplain56.8%and21.6%ofthevariance, that overlap problematic regions of the genome. This can respectively,andshowthreedistinctclustersofsamplesthat be done prior to defining the ERs or once the candidate correspondtothetissueofthesample.Wefoundthatthe16 DERs have been identified. In the BrainSpan application, 985 (10.4%) ERs (Figure 6A) that only overlap annotated only0.086%ofthe129278DERsoverlapENCODEblack- introns can also differentiate tissues using PCA, as shown listed regions (4) and 1.58% overlap lincRNAs. Similarly in Figure 6C. The total percent of variance explained by onecancheckiftheDERsoverlapotherknownfeaturesof the first two principal components is slightly lower (44.4 interest.ThegenesoverlappedbytheDERsareenrichedfor + 26.6% = 71% versus 56.8 + 21.6% = 78.4%) when us- GO terms such as neuron differentiation (GO:0030182, P- ingonlythestrictlyintronicERsversusthestrictlyexonic value4.13e-15),neurogenesis(GO:0022008,P-value4.62e- ERs.Thismayrepresentadifferentbiologicalsignaland/or 14) and neuron projection development (GO:0031175, P- potentiallynoisysplicing(asinFigure3B),butweusethis exampletoillustratethepotentialtousederfindertoex- ploreregionsoutsideofknownannotation. PAGE9OF13 NucleicAcidsResearch,2017,Vol.45,No.2 e9 Figure6. GTExERsanalysisusing24samplesfromtheheart(leftventricle),liverandtestisfor8subjects.(A)ERs(longerthan9bp)overlappingknown annotationbasedonGRGh38.p5(hg38).72.6%oftheERsonlyoverlapknownexons(strictlyexonic)while10.4%onlyoverlapknownintrons(strictly intronic).(B)FirsttwoPCswithsamplescoloredbysampletype(red:liver,blue:heart,green:testis)usingonlythestrictlyexonicERs.(C)FirsttwoPCs withsamplescoloredbysampletypeusingonlythestrictlyintronicERs.Thesignchangeofthesecondprincipalcomponentissimplyarotationandthe resultsareconsistentbetweenthestrictlyexonicandstrictlyintronicERs. Using limma (15,16) to test for differential expression 1 at fold changes of 2x and 1x, respectively. Reads were 2 betweentissues(SupplementaryMethodsSection2.4.2)we simulated from chromosome 17 using polyester (28) foundthat42880(36.1%)ofthestrictlyexonicERsand4 withthetotalnumberofreadsmatchingtheexpectednum- 401 (25.9%) of the strictly intronic ERs were differentially bergivenpaired-endlibrarywith40millionreads(Supple- expressed(FWERof5%viaBonferronicorrection).Over- mentary Methods Section 2.4.3). We used HISAT (29) to all59776(36.5%)oftheERsweredifferentiallyexpressed align the simulated reads and summarized them using ei- betweentissues.Giventhesimilarglobalpatternsofexpres- ther featureCounts from the Rsubread package (13) sionbetweenannotatedandunannotatedERs,weconsid- or StringTie (23) and performed the statistical tests on eredthescenariothatthestrictlyintronicERswerediffer- the resulting coverage matrices using limma and ball- entially expressed between tissues in the same pattern as gown,(22)respectively.Weperformedtheballgownsta- thenearestexonicERsduetopossiblerun-offtranscription tistical test at the exon-level as well as the transcript-level. events.Toassessthisscenariowefittedaconditionalregres- Weperformedthefeature-levelanalysesusingthecomplete sionforeachstrictlyintronicERadjustingforthecoverage annotation and with an annotation set missing 20% ran- oftheneareststrictlyexonicER.Atotalof749(4.4%)ofthe domly selected transcripts (8.28% unique exons missing). strictlyintronicERsdifferentiatetissueswhileadjustingfor We then used derfinder to find the ERs from the same the coverage at the nearest exonic ER at a FWER of 5%. HISATalignmentsaswellasfromRail-RNA(26)output Figure7AandBshowsanexamplewheretheexpressionis andperformedthestatisticaltestwithlimma.Forallsta- similarbetweentissuesinthenearestexonicERbutthereis tisticaltests,wecontrolledtheFDRat5%andrepeatedthe acleartissuedifferenceintheintronicERwithtestishaving simulationthreetimes. higherexpressionthantheothertwotissues.Figure7Cand Table1showstherangeoftheempiricalpower,falseposi- Dshowsdifferentpatternsbetweentheintronicandexonic tiverate(FPR)andFDRforallthesemethodsbasedonthe ERswhereintheexonicERtheexpressionislowestinthe threesimulationreplicates.derfinder’sexpressedregion heart,higherinliverandslightlyhigheratthetestis.How- approach resulted in overlapping empirical power ranges ever,intheintronicER,liveristhetissuethathasthelowest to the exon-level methods that are supplied the complete expression.Theseresultssuggestthatexpressionatunanno- annotation. The exon-level methods had a 18 to 27% loss tatedsequencecouldhavebiologicalrelevancebeyondlocal in power when using the incomplete annotation set com- annotatedexonicsequence. pared to the complete set even though only 8.28% of the uniqueexonsweremissing.derfinder,beingannotation- agnostic,doesnotrelyonhavingthecompleteannotation Simulationresults but did show increased FPR and FDR compared to the Welastlyperformedasimulationstudytoevaluatethesta- exon-levelmethods.Werecommendperformingsensitivity tistical properties of derfinder with and without com- analyses of the cutoff parameter used for defining ERs or pleteannotation.Tocomparederfinderagainstfeature- the FDR control in the statistical method used to deter- levelalternatives,wesimulatedreadsfor2groups,10sam- mine which ERs are differentially expressed (i.e. DERs). ples in total (5 per group) with 1 of the transcripts hav- Transcript-levelanalyseshadthelowestFPRandFDRbut 6 also the lowest power. Note that we only performed tran- inghigherand 1 lowerexpressioningroup2versusgroup 6
Description: