Table Of Content

Hindawi Publishing Corporation BioMed Research International Volume 2015, Article ID 403497, 11 pages http://dx.doi.org/10.1155/2015/403497 Research Article DNAseq Workflow in a Diagnostic Context and an Example of a User Friendly Implementation BeatWolf,1,2PierreKuonen,1ThomasDandekar,2andDavidAtlan3 1UniversityofAppliedSciencesandArtsofWesternSwitzerland,Perolles80,1700Fribourg,Switzerland 2UniversityofWu¨rzburg,AmHubland,97074Wu¨rzburg,Germany 3PhenosystemsSA,137RuedeTubize,1440BraineleChateau,Belgium CorrespondenceshouldbeaddressedtoBeatWolf;[email protected] Received13March2015;Revised10May2015;Accepted18May2015 AcademicEditor:HongLu Copyright©2015BeatWolfetal.ThisisanopenaccessarticledistributedundertheCreativeCommonsAttributionLicense,which permitsunrestricteduse,distribution,andreproductioninanymedium,providedtheoriginalworkisproperlycited. Overrecentyearsnextgenerationsequencing(NGS)technologiesevolvedfromcostlytoolsusedbyveryfew,toamuchmore accessible and economically viable technology. Through this recently gained popularity, its use-cases expanded from research environmentsintoclinicalsettings.Butthetechnicalknow-howandinfrastructurerequiredtoanalyzethedataremainanobstacle for a wider adoption of this technology, especially in smaller laboratories. We present GensearchNGS, a commercial DNAseq softwaresuitedistributedbyPhenosystemsSA.ThefocusofGensearchNGSistheoptimalusageofalreadyexistinginfrastructure, whilekeepingitsusesimple.Thisisachievedthroughtheintegrationofexistingtoolsinacomprehensivesoftwareenvironment,as wellascustomalgorithmsdevelopedwiththerestrictionsoflimitedinfrastructuresinmind.Thisincludesthepossibilitytoconnect multiplecomputerstospeedupcomputingintensivepartsoftheanalysissuchassequencealignments.WepresentatypicalDNAseq workflowforNGSdataanalysisandtheapproachGensearchNGStakestoimplementit.Thepresentedworkflowgoesfromraw dataqualitycontroltothefinalvariantreport.Thisincludesfeaturessuchasgenepanelsandtheintegrationofonlinedatabases, likeEnsemblforannotationsorCafeVariomeforvariantsharing. 1.Introduction the increasing number of documented variations in well- known disease related genes, this type of analysis became Nextgenerationsequencing(NGS)technologiessawarapid verycommon.Increasingly,othertypesofanalysesarealso progressionoverrecentyearsandimprovedinmanyaspects used in diagnostics, such as the detection of structural since their introduction. The data quality, the length of the variations,whichbecamepossiblewiththeincreasedquality sequences, and the speed at which the data is generated of sequencing data. The sequencing data, which can come improved massively all while decreasing the costs associ- from various sequencing technologies, can be either tar- ated with the technology [1]. Cost reductions and quality geted, such as targeted gene panel sequencing and whole improvements now allow the technology to be used in exomesequencing(WES),ornontargetedlikewholegenome a diagnostic setting, replacing older technologies such as sequencing (WGS). Those different approaches to sequence Sangersequencing[2].Anincreasingnumberoflaboratories the patient’s genome produce a varying amount of data [3] are either considering or already implementing NGS for andaresuitedfordifferenttypesofanalyses.Thecomplexity routinediagnosticsprocedures.Nextgenerationsequencing oftheanalysisincreaseswiththeamountofdatatoprocess, has many use-cases in diagnostics and, depending on the bothfromacomputationalpointofviewandfromahuman analysis, different types of information are searched for in resources point of view. The general tendency today is to the sequencing data. A very common type of analysis is moveaway fromtargetedsequencingtowardsWGS,asthis thesearchforSNPs(single-nucleotidepolymorphisms)and removes the need for targeted amplicon libraries, as well as smallindels(insertionsanddeletions)inpatientDNA.With makingiteasiertoreanalyseasampleatalaterdatewithout 2 BioMedResearchInternational the need to resequence it. This tendency to increase the positiononthereferencegenome.Whatthemostfittingposi- amount of sequenced data creates increasing challenges for tionwilldependonthealgorithmandparametersused.This smallerdiagnosticslaboratories.Theyoftendonothavethe process is specially complicated when aligning sequences requiredcomputinginfrastructureortechnicalknow-howto fromrepeatedregionsorcontaininginsertionsanddeletions. handle this type of data. This is especially true in emerging The alignment process can be done by using one of many marketswhichdonotalwayshavetherequiredfinancialand availabletools[6].Manyofthosetoolsareactivelydeveloped humanresourcesaswellastheextensiveknow-howrequired andpossesstheirindividualqualities,dependingontheuse- tosetuptheirownanalysispipeline. case and sequence data to be aligned. Some of the more Today many types of NGS data analysis exist, all of commonlyused tools for sequence alignment are BWA [7], themansweringdifferentsortsofquestionsaboutasample. Bowtie 2 [8], or BLAST [9]. This list is not exhaustive and One example being copy number variation (CNV) analy- furtherinformationonthesubjectcanbefoundin[6]. sis suited to replace traditional MLPA (multiplex ligation- dependent probe amplification) analysis. Another one is de (C)Realignment.Inadditiontothealignment,toolsexistto novo sequence alignment to detect large changes in the improve the quality of an existing alignment, for example, genome, for example, in cancer patients. But ultimately the with local indel realignment. Tools like GATK [10] and the detection of SNPs and small indels in a sample based on CompleteGenomicsrealigner[11]providethisfunctionality. a standard reference sequence remains the most prevalent This and the previous Alignment analysis step are some of typeofanalysis.Thisiswhy,inthecontextofthispaper,we the heaviest in terms of computing power and require a limitourstudytothistypeofanalysis:thedetectionofSNPs good computing infrastructure to be performed in a short and small indels in targeted or nontargeted resequencing. timeframe.Thisproblemisaccentuatedwiththeincreasing Althoughthistypeofanalysiscanbeapproachedinvarious amount of produced raw data and expected qualities of the ways,itcanbedescribedasasetofgenericstepsthatremain alignment,notablyforindels. common to most approaches. The next chapter describes thecommonlyusedstepsintheanalysisprocess,providing (D) QC. Another round of quality control is usually per- examplesofexistingtoolsthatcanbeusedtoperformthem. formedafterthefinalalignmenthasbeencreated.Itisused to verify that the sample has been sequenced correctly, in NGS Data Analysis. This chapter describes a common NGS particularintheregionsofinterest,likethegenesassociated data analysis workflow used in diagnostics. Figure 1 shows with the patients phenotype. Several tools allow the extrac- an overview of the described workflow, using a Unified tionoftheinformationneededforthisstep,suchassamtools ModelingLanguage(UML)diagram. [12] or bedtools 2 [13]. They allow the user to get metrics suchasthenumberofalignedreadsandthecoverageofthe regionsofinterestoutofthealignmentfile.Ifthequalityof (A)QC/Filtering.Theanalysisstartswiththequalitycontrol thealignmentisnotsufficientatthisstage,itcanbedecidedto of the raw sequencing data to determine if the sequencing resequencethesample.Thisstepofqualitycontroliscritical processadherestothedefinedqualityrequirements.Different in a diagnostic setting where it is necessary to show how metrics, such as the amount of reads sequenced, the length well the regions of interest for a specific sample have been distribution of the sequenced reads, and their quality, are covered.Itisespeciallyimportantforcaseswherenovariants verifiedandcomparedtotheexpectedvaluesofthespecific of interest can be found in the specified zones and thus a sequencer technology that has been used. If the quality of negativefindingsreportneedstobejustified. thesequenceddatadoesnotmatchtherequiredquality,the sampleisusuallyresequenced.Toverifythequalityoftheraw (E)VariantCalling.Aftertherawsequencingdatahasbeen data,onecommonlyusedtoolisFastQC[4],whichgenerates alignedanditsqualityverified,SNPsandindelscanbecalled. a comprehensive list of statistics of the raw data. During Theprocessofvariantcallingcomparesthealignedsequences this initial quality control stage, the raw sequencing data is tothereferencesequenceinordertocreatealistofvariants oftenfiltered,discardingreadsthatdonotmatchthedefined present in the sample. This list can be filtered on different quality criteria. This raw data filtering is commonly done criteria,suchasminimumcoverageatthevariantpositionor withtheFASTXtoolkit[5],whichallowsremovingreadsthat thefrequencyofthevariantalleleinthesample.Commonly donotmatchtherequiredquality,byremovinglowquality usedtoolsareGATK[10],samtools[12],andVarscan2[14]. readsorreadsthataretooshort.Simultaneouslytofilterthe Differentvariantcallersfollowdifferentunderlyingmodels, raw data based on its quality, it is increasingly common to resulting in differences in the called variants [15]. Variant preprocessthedata,forexample,toremovebarcodesattached callingusuallyyieldsalargeamountofvariants,evenwhen tothesequencedreads.Barcodesareaddedtothesequenced restrictingthecallingtoonlyoutputvariantswithaminimum reads when multiple samples are sequenced together and qualityandprobabilitynottobeasequencingartifact. needtobeseparatedagainbeforetheanalysis.Thissplitting basedonbarcodescanalsobedoneusingatoollikeFASTX. (F) Annotation. This is why in the next step the list of the variantsisannotatedwithmoreinformationthatispertinent (B) Alignment. The verified raw sequencing data is then to the analysis being done on the sample. This includes aligned against the human reference genome. During this the detection of variants present in public databases, such process every sequenced read is placed at the most fitting as dbSNP [16], or the information about which genes are BioMedResearchInternational 3 Data preparation Data analysis Validation and reporting Samtools, (D) QC bedtools 2 FastQC, (A) QC/filtering FASTX GATK, (E) Variant calling samtools, (I) Sanger validation Varscan2 BWA, Bowtie2, dbSNP, BLAST, (B) Alignment Phenomizer and so forth (F) Annotation VEP, (J) Report generation and so forth Samtools GATK, (G) Variant filtering complete genomics (C) Realignment realigner (H) Visualization IGV, Tablet, genome view UCSC browser, and so forth Figure1:UMLactivitydiagramofaNGSdiagnosticsworkflowseparatedintodifferentsteps.Acollectionoftheavailabletoolsforevery stepismentioned. affected and their predicted effects on those genes. Various (I) Sanger Validation. Before submitting the final findings tools exist to annotate variants, either locally or through a to the clinician, it is still common practice to validate webservice.SomeofthemorepopularonesaretheVariant the identified variants through Sanger sequencing. While Effect Predictor (VEP) from Ensembl [17] and Phenomizer future advances in NGS technologies might make this step [18] that add clinically relevant information to the detected redundant,itisstillconsideredthegoldstandardforvariant variants.Oncetheannotatedlistofvariantshasbeencreated, validation. thephenotyperelevantvarianthastobeidentified. (J)ReportGeneration.Aftervisuallyinspectingandvalidating (G) Variant Filtering. Analysis specific filters allow the user the variants, the identified variants or the lack thereof are to filter the number of variants down to a smaller number. documentedinafinalreportthatissentofftotheclinician One example is to only keep variants affecting genes with toallowhimtotakedecisionsonthecase. phenotypesmatchingtheonesofthepatientandthathavea The described process can be done by taking several predictedconsequenceonthegenefunction.Theexacttype approaches. A common approach is to use a selection of of filters to reduce the amount of candidate variants varies the previously described tools to perform the individual from case to case, depending on the available information. analysis steps. The different tools are usually connected The filtering is often done using bcftools, which is part together through a series of automated scripts, performing of the samtools project [12] or vcftools [19]. If during this the required analysis. The creation of the resulting pipeline step no clinically relevant variant related to the current is a complex technical task, as it requires insight in the analysis can be found, depending on the analysis protocol, working of the different tools and the ability to chain them the filters may be relaxed. This allows the user to include, together.Thisisanontrivialtask,asnotalltoolsfollowthe for example, variants on genes where the relation with the same standards in terms of data file formats and handling. currentphenotypeislesscertain,orvariantswithuncertain Nonetheless,thisisaverycommonapproachwhichresulted consequencesonaknowndiseasegene. in a variety of custom made NGS analysis pipelines, either privateorpublic[25]. (H)Visualization.Oncealistofcandidatevariantshasbeen Complete software packages like CLCBio [26], identified,itiscommontovisuallyverifytheminagenome NextGENe[27],ortheweb-basedGalaxy[28]alsoexist.They browser. The genome browser displays the sequenced reads allow the user to combine many of the mentioned analysis around the variant and allows the user to verify the quality tools,withouthavingtoworrytoomuchonhowtoconnect ofthedataaroundthatposition,allowingtheusertoidentify the different tools to perform the analysis. Those tools anddiscardvariantsoriginatingfromsequencingerrors.This are focused on giving the user the possibility to automate is done usingone of the variousgenome browsers installed the analysis process as much as possible, abstracting the locally such as IGV [20], Tablet [21], or Genome view [22]. complexity of the analysis. GensearchNGS belongs to this There also exist web-based genome browsers, such as the family of tools, with its own approach and focus on the UCSC[23]orEnsembl[24]genomebrowsers. issuesofsmallerlaboratories.Itfocuseslessonthepossibility 4 BioMedResearchInternational to generate a custom automated pipeline but more on an and favoring an interactive analysis approach instead of intuitive interface to handle the data-files to perform the a fully automated analysis pipeline. The user interface is analysisinapredefinedpipelineoptimisedfordiagnostics. focused on providingthe functionalityrequiredto perform The features of GensearchNGS are limited to those thepreviouslydescribedanalysis,withoutoverwhelmingthe neededtoperformthepresentedNGSdiagnosticsworkflow. userwithrarelyusedfeatures. Thismeanstheoverallfeaturesetissmallerthanthatofother complete software packages, especially CLCBio or Galaxy, 2.MaterialsandMethods which both offer a wide variety of functionalities. Having a limited feature set allows for a cleaner and more focused GensearchNGShasbeendevelopedasamultiplatformappli- user interface, one of the major goals of GensearchNGS, cation,runningonWindows,Linux,andOSX,usingJava6+. something that is highly appreciated by current users of Theapplicationisalocallyinstalledsoftwareandisnotweb GensearchNGS. based. In order to perform NGS analysis steps, Gensearch- Aswillbediscussedinthefollowingchapters,Gensearch- NGSincludesseveralexistingexternaltoolsthroughplugins. NGS uses either external applications through plugins or Thosetoolsaremainlyusedduringthesequencealignment customimplementationsofstandardalgorithmsforthedif- phase,givingtheuserthechoicebetweendifferentalignment ferentanalysissteps.Theresultinganalysisqualityistherefore algorithms which may have specific advantages related to comparable to the one observed when using those tools the data to be analysed. The external alignment algorithms separatelybyhandorthroughasoftwarepackagelikeGalaxy. supported through plugins are BWA [7] and Bowtie 1 + 2 [8] as well as Stampy [29]. GensearchNGS can either use locallyinstalledversionsofthoseexternaltoolsordownload Approach.Theissuesfacedbysmalllaboratorieswhenusing and update them automatically. If the tools are installed the previously mentioned tools, which are often limited by the user he can update them when desired; otherwise in terms of computing and human resources, are various. thesameupdatemechanismusedtoupdateGensearchNGS Based on our experience, the biggest issue faced by smaller itself is used to update the external tools. For sequencing laboratoriesisthelackoftechnicalknow-howanddedicated data,GensearchNGS supportsvariousinputandoutputfile bioinformaticians. The usage and configuration of many of formats. For raw sequencing data, the standard FASTQ thedescribeddataanalysistoolsrequireadedicatedemployee formataswellasFASTA,SFF,andFNAissupported.Allof whichnotavailableineverylaboratory.Thisisespeciallytrue themareconvertedtothestandardFASTQfileformatduring inemergingmarketswheremanymonetaryrestrictionsarein thedataimportphase.AsFASTQfilesrequireaqualityscore place.TheapproachofsolutionslikeGalaxyandCLCBioto foreverynucleotideinthesequencedreads,adefaultquality thisproblemistoabstracttheusageoftheunderlyinganalysis scoreisassumedforallthenucleotidesifthesourceformat, toolsandprovideawayfortheusertoautomatetheanalysis like FASTA, is missing this information. For aligned data, process.Whiletheabstractionoftheunderlyingtoolsreduces bothSAMandBAMfilesaresupported,whereasSAMfiles thetechnicalcomplexityoftheanalysisprocess,thisisoften are converted to BAM during import. BAM and SAM file notenoughtomaketheanalysisprocessmoreaccessible.The are accessed through the HTSJDK library. HTSJDK is an citedtoolscanbeusedformanyothertypesofanalysesthan open source Java library, developed as part of the samtools theoneslookedatinthispaper,whichleadstoamuchmore project[12].ItprovidesastandardizedAPItoaccessvarious complexuserinterfaceandoftenoverwhelmstheuserwith NGS data file formats. The library can be downloaded at manyoptions.Thecitedtoolsalsofocusmoreontheability http://samtools.github.io/htsjdk/. Annotations can be pro- to create automated analysis pipelines, leading to very little videdthroughtheBED,WIG,andGFFfilesandtheVCFfile interactivitybetweentheuserandthedata. formatisusedtoimportorexportvariants. Thesecondissueoftenencounteredisthelackofasuitable To benefit from the knowledge stored in various online computing infrastructure to perform the computationally databases, GensearchNGS has been connected to several of intensive parts of the analysis. Sequence alignment in par- them to annotate the NGS data on various levels, from the ticularisverydemandinginthatregard.Eventhoughmany genesandtheirtranscriptsdowntotheindividualvariants. laboratoriesoutsourcethesequencingandalignmentprocess, For most of the annotation data, GensearchNGS accesses the possibility to realign the raw sequencing data is often the Ensembl [24] web service through its biomart API, required.Toperformthedescribedanalysisstepsandmore, givingtheapplicationaccesstogeneandvariantinformation. GensearchNGS providesa complete software suite focusing Ensemblprovidesaunifiedwaytoaccessseveralotheronline onoptimizedalgorithmsandusageofdistributedcomputing databases,likedbSNP[16]orOMIM[30],throughasingle for optimal usage of the existing infrastructure. All the onlineservice.InadditiontoEnsembl,GensearchNGSinte- analysisstepsmakeusageofmultiprocessorsystemstospeed gratesotheronlineservicesforadditionaldataandfeatures: uptheanalysis,whereasonlysequencealignmentusingthe Cafe Variome [31] provides a standardized way to share customGensearchNGSalignmentalgorithmusesdistributed variants between various laboratories, UCSC [32] provides computing. The entire functionality is provided through accesstothehumanreferencesequence,andHPO[18]pro- an intuitive user interface to lower the technical know- videsinformationaboutphenotypesassociatedwithgenes.In how needed to perform NGS data analysis. GensearchNGS situations with a direct conflict in the information between provides its own way on how the analysis is approached thedifferentdatabases,wherepossible,theinformationwith by providing the user with a patient centric user interface the most severe clinical impact is displayed. An example of BioMedResearchInternational 5 thisiscontradictorySIFT[33]andPolyPhen2[34]scoresfor variants.Bydefaultthemostseverewillbeshowntotheuser, withthepossibilitytomanuallygetmoreinformationwhen requested. GensearchNGS also integrates the possibility to talk to other software installed by the user, such as Alamut [35], allowing the user to easily open a variant found by GensearchNGSinsidealocalAlamutinstallation. 3.Results This chapter discusses how GensearchNGS implements the differentanalysisstepsandhowittriestolowerasmuchas possiblethetechnicalknow-howandcomputinginfrastruc- Figure2:MaininterfaceofGensearchNGS,showingapatientwith turerequiredforNGSdataanalysis. itsvalidatedvariants. 3.1.SoftwareSuite. GensearchNGSprovidestheuserwiththe possibilitytocreatedifferentprojectstogrouptheanalysisof variouspatientsbasedontheirtypeortheirrelation.While ACMG variant classification guidelines [36], with the user many other NGS analysis tools provide the possibility to requestedcategoriesArtefact andFalsereferenceadded.The organizethedatainprojects,GensearchNGSputsthenotion savedvariantsareusedforanyfuturevariantdetectioninnew ofpatientsandtheirassociatedmetainformationinthecenter patients,immediatelytellingtheuserifavarianthasalready of the project management. Every project contains one or beenfoundinadifferentproject.TheadditionoftheArtefact morepatientandeachcanhaveseveralrawsequencingdata andFalsereferencecategoriesallowstheusertoquicklyfilter associated.Therawsequencingdatacaninturnhaveseveral outcommonsequencingerrorsinhisparticularsequencing alignments associated, for example, using different aligners setup.Thehierarchicalorganizationofthedataisintendedto oralignmentconfigurations.Projectspecificsettingssuchas guidetheuserthroughtheanalysisprocesswhilekeepingit specificregionsofintereststobeusedduringtheanalysiscan clear where the data comes from. An example of this is the alsobeconfigured.ThoseregionscanbedescribedbyBED sequencealignment.Itisdisplayedasachildelementofthe filesto,forexample,limittheanalysistoregionsrelevantto raw sequencing data, making the relation between the raw thetypeofanalysisperformedinthecurrentproject.Those dataandthealignmentimmediatelyobvious. BEDfilescaneitherbeprovidedbytheuserorautomatically GensearchNGS does not provide the user with a single generated based on a list of gene names. Projects also buttonwhichgoesfromtherawsequencingdatatothefinal associate the specific reference sequence to be used during report.Insteadoffullyautomatingthepipeline,Gensearch- the analysis. The user can automatically download them NGS leads the user from one processing step to the next. fromUCSC,whichprovidesreferencesequencesofthemost Before each step the user is given the choice to change commonhumanreferenceassemblies,hg18,hg19,andhg38. parametersorusethepreviouslyused(orstandard)parame- Theusageofcustomreferencessequencesisalsosupported. ters.Aftertheinitialimportoftherawsequencingdata,the The patient centric database integrated into GensearchNGS user no longer has to handle the data files. GensearchNGS allows the user to associate variants with patients, a feature performs the required data file handling and conversions very important in diagnostics. The variants associated with betweentheprocessingsteps. patients are shared between projects, allowing the user to While there is no fully automated way to go from the identify variants found in different projects on previously rawsequencingdatatothefinalvariantreport,thereisstill analysedpatients. awaytoautomateabigpartoftheanalysis.Whenimporting To perform the NGS data analysis, GensearchNGS uses rawsequencingdata,theuserhasthepossibilitytoperform thegenesdescribedinEnsemblasthegenemodel.Inapro- sequencealignmentandvariantcallingautomatically.Those cesstransparenttotheuser,thegenemodelcorrespondingto two additional analysis steps are performed using either thereferencesequencedefinedfortheanalysisisdownloaded previouslyusedconfigurationparametersorthedefaultones. automaticallyfromEnsembl.Additionalmetainformationis Only the final variant validation has then to be performed attached to the genes, such as the associated phenotypes by the user to create the final report. Furtherdetails of this as described by HPO [18] or OMIM [30]. Figure 2 shows processfollowinthenextchapter. the main interface of GensearchNGS, with the patient list ontheleftandtheinformationaboutthecurrentlyselected patient in the center. The central area of the image shows 3.2.DataAnalysis. Toperformtheinitialqualitycontrolof the information about the currently selected patient. This the raw sequencing data, GensearchNGS integrates its own includes his name, internal ID, associated phenotypes, and tools allowing the user to import the raw data available allvariantswhichhavemanuallybeenconfirmedbytheuser. in different file formats. The compatible data formats are Whenvalidatingvariantstheusercanclassifythemmanually describedinSection2.Thevariousformatsareconvertedin into different categories. The categories are based on the thestandardFASTQformatduringthedataimportstep.For 6 BioMedResearchInternational Figure3:Overviewofvariousstatisticsthattheusercaninspecttoverifythedataquality. the quality control of the raw sequencing data, the user is overrepresented K-mers at the start and the end of the presentedwithvariousstatisticsasseeninFigure3. sequencedreads. The statistics allow the user to judge the quality of the Thisrawdataimportprocessreplacesexistingtoolslike sequencing data. This includes statistics such as the length, FastQC to check the quality, FASTX to filter the data, and averagequality,andGCcontentdistributionofthesequenced the various utilities that transform different sequencing file reads, as well as the distribution of the sequenced bases formatsintothestandardFASTQfileformat.Transparently and their quality for every read position and detection of for the user, the custom implementations of those tools, BioMedResearchInternational 7 whichproducethesameresults,havebeenintegratedintoa be inspected, giving detailed coverage information of the singleprocessingstep.Thisincludesthepossibilitytoremove previously defined regions of interest. The collected quality adapter sequences and split the raw sequencing data into controlinformationcanbeexportedasareportforarchiving multiple files if they were produced through multiplexed purposesusingoneofvariousformatssuchasPDF,HTML, sequencing. Based on the reported quality information, the orPNG. usercanchoosetofiltertherawdatafollowingseveralfilters, WhileinitiallyGensearchNGSintegratedVarscan2[14] such as maximum and minimum length of the sequences asitsvariantcaller,areplacementwascreatedtoreducethe 󸀠 and their minimal average quality. He can also trim the 5 external dependencies and to improve the performance of 󸀠 and 3 parts of all reads or trim their excess length if they variant calling. A variant scanner based on the model of arelongerasauserdefinedlength.Thetrimmingisdonein Varscan2hasbeenimplementedandisintheprocessofbeing additiontotheadapterandprimerremoval.Thepossibilityto publishedasaseparatetoolcalledGNATY,availableforfree splittherawsequencingdatabasedonbarcodesintheread, fornoncommercialusage(http://gnaty.phenosystems.com/). allowingtheimportofrawdatafrommultiplesamplesthat Our custom variant caller shows important performance were sequenced at the same time, is also integrated in the advantagesovertheexistingVarscan2implementationofthe importprocess. samevariantcallingmodel,whileproducingthesameanaly- Toalignthesequenceddataagainstareferenceassembly, sisresults.Indeed,GensearchNGSshowsspeedimprovement GensearchNGSprovidestheuserwithmanyoptions.Several of up to 18 times for variant calling compared to Varscan existingalignmentalgorithmsareintegratedintotheapplica- 2, providing a similar range of features, such as filters that tionthroughplugins,allowingtheirusagewithouttheneed excludesequencingartifacts. of advanced technical know-how. The currently integrated To be able to further filter down the detected variants alignment algorithms are BWA [7], Bowtie 1 + 2 [8], and based on their relevancy to the sample’s phenotype, several Stampy[29]. sourcesareusedtoannotatethemwithvariousinformation. Most of those alignment algorithms are developed in a This includes the prediction of the variant effect on the waythatdonotallowthemtobeusedonmultipleplatforms. different transcripts it affects, a feature closely modeled Infact,mostofthemonlyrunonLinux,whichisproblematic aftertheVariantEffectPredictor(VEP)fromEnsembl[17]. foramultiplatformapplicationlikeGensearchNGS.Forthis GensearchNGS implements its own version of the the VEP reason, a custom alignment algorithm written in Java has algorithm to be able to perform the effect prediction in a beenintegrateddirectlyintoGensearchNGS.Thisallowsthe fractionofthetimecomparedtotheonlineVEPservice.This usertoperformthefullNGSdataanalysis,evenonoperating isismainlyachievedbydoingallcalculationslocallywithout systems,likeWindows,notsupportedbymostbioinformatics having to transfer the data to a remote server. Predictions tools. suchastheadditionorremovalofastopcodoninanaffected Thealignersupportspairedendreadsandgappedalign- transcript, if a splice site is affected and many others, can ment and has been optimized to use less than 5GB of be made with this algorithm. Additional information, such RAM to perform a full genome alignment. This makes it asifthevariantisknowninanypublicdatabase,associated suitable for most of today’s workspace desktops. It is fully clinical significance, and SIFT [33] and PolyPhen 2 [34] multithreaded to support all cores of the machine and is predictions are retrieved through the Ensembl web service. able to offload calculations, as described in Section 3.4, to Other services, like the Human Phenotype Ontology [18], multiple computers. The aligner uses a hash based index, providetheinformationaboutthephenotypeassociatedwith a concept made popular through the BLAST aligner [9], aspecificgene,acrucialinformationindiagnostics. withacustomheuristicbasedcandidatealignmentpositions To filter the annotated variants, GensearchNGS takes selection [37]. The final alignment is being done using the an interactive approach which allows the user to modify Gotoh [38] algorithm, a dynamic programming algorithm differentbasicfilters,suchasminimumormaximumsample usedforgappedalignment. frequency,coverage,readbalance,orquality.Therearealso The custom alignment algorithm being implemented in advanced filters, such as the predicted consequence of the Javadoesnotachievequitethesameperformanceasnatively variant, as determined by the annotation process, or if a runningalignerslikeBWA.ForWGSdatasetsitisgenerally certainphenotypeisassociatedwiththeaffectedgenes.This 50% slower. The advantage of the GensearchNGS aligner includes the possibility to filter by population frequency of over external aligners is that it is available on all supported avariant.Thefilteringbypopulationfrequencydoesnotyet platforms, allowing the user to perform the complete NGS takeintoaccounttheethnicityofthepatient.Figure4shows data analysis on all platforms. But when running on a anexampleofthislist,withseveraluserselectedinteractive platform which supports the external aligners and speed is filtersapplied. adecidingfactor,theusersareencouragedtousethealigners Additional features are available to perform a more in- availableasplugins. depthanalysisofvariants,suchasfamilyanalysiswherethe Thequalityofresultingalignmentcanbeverifiedthrough variants of multiple patients can be compared to find those various metrics by the user. He can easily verify how many only present in a certain group of patients. The usage of sequences have been discarded during the alignment pro- pedigreeinformationisofgrowingimportanceinNGSdata cess and how many sequences have been aligned to the analysis, as it allows the user to filter sequencing artifacts differentchromosomesinthereferencesequence.Advanced and noncausative variants. This is a very important feature, statistics,similartotheonescreatedbybedtools2,canalso especiallywhenusingWGSwheretheamountofvariantscan 8 BioMedResearchInternational Figure 4: Integrated variants list with several interactive filters applied. Figure 5: Example view of the genome browser, showing several availabletracks. be very large. Thanks to the usage of standard file formats likeVCF,externaltoolslikeexomewalker[39]canbeused to perform more advanced family analysis for features not withallrelevantinformationwhenneeded.Theusercanopen covered in GensearchNGS. In contrast to many other NGS external resources linked from many features in the visual- data analysis software packages, GensearchNGS does not ized data, such as genes, transcripts, or variants, to retrieve focusontheautomationofeveryaspectoftheanalysis.While additionalinformation.Thegenomebrowsertakesadvantage certainpartscanbeautomated,suchasperformingsequence oftheintegrateddatasources,liketheEnsemblgenemodel, alignment and variant calling on raw sequencing data, the to display and interactively update encoded proteins, based philosophy of our software is to let the user to control the on the variants present in the visualized data. The genome different analysis steps. This gives him enough flexibility to browser is tightly integrated into GensearchNGS, allowing adapt to the data he is currently analysing without slowing the user to easily visualize the alignment data belonging to himdown,byhavingprojectspecificanalysisoptionswhich featuresofthesample. arepreselectedforeveryanalysisstep.Thismakesitpossible to perform most types of analysis with the project specific 3.4. Distributed Computing. The lack of computing power defaultoptionsthroughasimplepressofabutton. to perform NGS data analysis is a very common issue in manylaboratories.Toaddressthisproblem,GensearchNGS 3.3. Genome Browser. A central tool of GensearchNGS is doesnotonlyoptimizethealgorithmsforspeedandmemory the integrated genome browser to visualize the aligned usagebutalsotrytoexploitthealreadyexistingcomputing sequencingdata.Thegenomebrowser,asshowninFigure5, infrastructure. One way to do this is to use distributed allows to interactively browse the sequence alignments. It computing, a technique which allows the user to combine is possible to display the aligned sequences and additional multiple computers to perform computing intensive tasks. genomic data, such as the gene locations and the encoded Those computers can be in the same local network or at a proteinsbydifferenttranscripts. remote location. One of the most demanding steps in NGS The genome browser allows the user to display sev- dataanalysisissequencealignment.Forthisreasonitisthe eral tracks, containing information which helps to analyse firstfeaturetotakeadvantageofthistechniqueinGensearch- the sequencing data. The predefined tracks, which can be NGS,allowingtheusertojoinmultiplecomputersinthesame activated or deactivated, allow the user to display coverage networktoperformsequencealignment.Theapproachused information, GC content of the reference sequence, the byGensearchNGSistotransparently,onrequestbytheuser, qualityofthesequencedreads,orcustomannotationfilesin useallcomputerswithinthesamenetworkwhereGensearch- standardfileformatssuchasBED,WIG,andGFF.Thefocus NGSisinstalled.Additionally,theuserhasthepossibilityto was put on optimizing the display of the data, allowing the startupGensearchNGSinstancesinacloud.Currently,only usertovisualizelargedatafileswithoutrequiringapowerful theamazoncloudissupported.Thisprocessworksregardless computer.Thisisachievedbystreamingthedataondemand, of the operating system used by the different installations only loading the parts of the data currently needed for of GensearchNGS. By using an automatic service discovery the visualization.The informationabout which variants are protocol,everyinstallationofGensearchNGSlocatedinthe presentinthealignmentisavailabletotheuserinthesame same network is discovered without the need for manual formasthevariantlistinthemainapplication.Itcanbeseen configuration.Thisallowstheusertousemultiplecomputers inFigure4,allowingtheusertoquicklyjumptothevariants at the same time or to completely offload calculations to a of interest. The genome browser, like all of GensearchNGS, differentcomputer,makingitpossibletousecomputersnot is heavily connected with external data sources (Ensembl, suitedforfullgenomealignmentsasananalysisworkstation. dbSNP,Entrez,OMIM,Blast,andHPO)providingtheuser The distributed sequence aligner used in GensearchNGS BioMedResearchInternational 9 currentlyonlysupportstheGensearchNGSaligneralgorithm NGS data analysis pipeline. We then presented Gensearch- andnotexternalalignerssuchasBWAorBowtie2.Itusesa NGS, a software suite integrating various algorithms to streambasedapproach,wheretheclientmachinestreamsthe perform the discussed workflow. GensearchNGS is a com- datatobealignedtoallcomputersinvolvedinthealignment mercially available software, distributed by Phenosystems process.Thisincludestheclientmachinewhichreceivesthe SA. A demo version of GensearchNGS is freely available aligned sequences back and combines them into the final athttp://www.phenosystems.com/.Thetwomainissues,the alignment file. The process of distributing the alignments technical complexity of the analysis and the computing overmultiplecomputersandthecloudistransparenttothe infrastructurerequirements,havebeenaddressedasfollows. user, but it is optional, because the usage of an external To reduce the technical complexity of the analysis, the cloudisnotcompatiblewiththesecurityguidelinesofsome userisprovidedwithanintuitiveuserinterfacewhichguides laboratories.GensearchNGSisnottheonlyNGSdataanalysis himthroughtheprocessofthedataanalysis.Heisprovided software using distributed sequence alignment. It is this withtherequiredtoolstogofromrawsequencingdatathethe ability to combine the computing resources of the local final visualization and report generation for the sequenced computer,multiplecomputersinthesamenetworkaswellas samples, without having to learn complex bioinformatics thecloud,whichmakesitsapproachuniqueandveryflexible. tools. The interface abstracts the technical complexity from theuserwhilekeepingenoughflexibilitytoadapttodifferent 4.Discussion analysis protocols. By lowering the technical complexity of NGS data analysis compared to existing individual analysis GensearchNGSaddressesproblemsthatsmallerlaboratories tools,GensearchNGSexpandsthepotentialuserbaseofthe are facing in various ways. The issue of the technical com- software. plexityoftheanalysisisapproachedthroughanintuitiveuser To address the issue of the computing infrastructure interface. The user interface abstracts the complexity of the requirements, GensearchNGS focuses on the optimal usage underlying NGS data analysis toolchain, which is a major of existing computing resources. This is primarily achieved usability improvement over the usage of individual tools. through the development of analysis algorithms that run Nevertheless,itgivestheuserfullcontrolovertheindividual on a minimal amount of resources while producing the analysis steps, instead of focusing on the creation of auto- sameanalysisresultsasstandardtools.Anexamplehasbeen matedanalysispipelines.Italsoprovidesanoriginalapproach describedwiththevariantcallingalgorithmwhichproduces by providing a patient centric user interface. This user the same results as Varscan 2 but is 18 times faster. The friendlyinterfaceallowsusersunfamiliarwiththeunderlying secondapproachistheintegrationofdistributedcomputing, software tools to fine tune the analysis parameters, without allowing the distribution of computing intensive tasks like overwhelmingthemwithfeaturesorrarelyusedoptions.This alignmenttobespeedupbytheusageofmultiplecomputers. increases the number of people able to perform NGS data With a solid framework built for DNAseq analysis, analysis. This is particularly interesting for geneticists that GensearchNGSwillexpandintofurthertypesofanalysisin mightnothavespecialbioinformaticstraining.Theissueof thefuture.OneexampleistheintegrationofRNAseqanalysis, increasinginfrastructurerequirementsforNGSdataanalysis as the combination of OMICS data for a single sample is is being addressed twofold. First, more efficient algorithms of increasing importance to understand rare diseases [44]. arebeingdeveloped,liketheheavilyoptimizedvariantcalling In fact, many parts of GensearchNGS, such as the genome algorithm included in GensearchNGS. Secondly, the ability browser,havealreadybeenadaptedtosupportRNAseqdata. to distribute heavy calculations over several computers or Theintegrationofmultipledatasourceswillnotonlyallow even a cloud service has been developed. GensearchNGS theusertointegratedifferenttypesofOMICSdatabutalso hasauniqueapproachtothisproblembyenablingtheuser allow multiple views on the same sample. One example of to combine the various distribution methods to perform this is the integration of Sanger sequencing to NGS data, sequence alignment. It provides the flexibility needed to a very useful feature for diagnostics where it is common choose the best compromise between data privacy and practice to validate variants identified through NGS with alignment speed depending on the laboratory policies and Sangersequencing. needs. The resulting software enabled various laboratories Further optimizations of the algorithms used in the to perform NGS data analysis and publish their results in analysiswillalsobeevaluated,includingoptimizationforthe various fields. The topics range from muscular dystrophies [40,41]tohearingloss[42]andmalignanthyperthermia[43] underlyinglibraries.OneexampleisHTSJDK,whereseveral but also yet unpublished work about CNV analysis and the performance patches have been submitted and included, detectionofsomaticmutationsincancerscreening. improvingNGSdataanalysisperformancenotonlyforusers of GensearchNGS, but also forthe complete bioinformatics community. 5.Conclusion In this paper we reviewed the workflow used to analyze Disclosure NGS data in a diagnostics environment. The reviewed workflow includes the detection of SNPs and short indels PierreKuonenhasanonpaidadvisorypositionatPhenosys- from panel, WES or WGS DNAseq data. We also pre- temsSA.DavidAtlanisworkingfulltimeforPhenosystems sentedissuesfacedbysmallerlaboratoriestryingtosetupa SA. 10 BioMedResearchInternational ConflictofInterests [15] J.O’Rawe,T.Jiang,G.Sunetal.,“Lowconcordanceofmultiple variant-callingpipelines:practicalimplicationsforexomeand ThomasDandekarhasnoconflictofinterestsregardingthe genomesequencing,”GenomeMedicine,vol.5,no.3,article28, publicationofthispaper. 2013. [16] S.T.Sherry,M.-H.Ward,M.Kholodovetal.,“dbSNP:theNCBI Acknowledgments databaseofgeneticvariation,”NucleicAcidsResearch,vol.29,pp. 308–311,2001. GensearchNGS is a commercially available software dis- [17] W. McLaren, B. Pritchard, D. Rios, Y. Chen, P. Flicek, and F. tributed by Phenosystems SA. Beat Wolf’s Ph.D. thesis is Cunningham,“Derivingtheconsequencesofgenomicvariants withtheEnsemblAPIandSNPEffectPredictor,”Bioinformatics, partiallyfinanciallysupportedbyPhenosystemsSA. vol.26,no.16,ArticleIDbtq330,pp.2069–2070,2010. [18] P. N. Robinson and S. Mundlos, “The human phenotype References ontology,”ClinicalGenetics,vol.77,no.6,pp.525–534,2010. [19] P.Danecek,A.Auton,G.Abecasisetal.,“Thevariantcallformat [1] E. L. van Dijk, H. Auger, Y. Jaszczyszyn, and C. Thermes, and VCFtools,” Bioinformatics, vol. 27, no. 15, pp. 2156–2158, “Tenyearsofnext-generationsequencingtechnology,”Trends 2011. inGenetics,vol.30,no.9,pp.418–426,2014. [20] J.T.Robinson,H.Thorvaldsdo´ttir,W.Winckleretal.,“Integra- [2] B.Sikkema-Raddatz,L.F.Johansson,E.N.deBoeretal.,“Tar- tivegenomicsviewer,”NatureBiotechnology,vol.29,no.1,pp. getednext-generationsequencingcanreplacesangersequenc- 24–26,2011. inginclinicaldiagnostics,”HumanMutation,vol.34,no.7,pp. [21] I. Milne, G. Stephen, M. Bayer et al., “Using tablet for visual 1035–1042,2013. explorationofsecond-generationsequencingdata,”Briefingsin [3] H. P. J. Buermans and J. T. den Dunnen, “Next generation Bioinformatics, vol. 14, no. 2, Article ID bbs012, pp. 193–202, sequencingtechnology:advancesandapplications,”Biochimica 2013. etBiophysicaActa—MolecularBasisofDisease,vol.1842,no.10, [22] T.Abeel,T.vanParys,Y.Saeys,J.Galagan,andY.vandePeer, pp.1932–1941,2014. “GenomeView: a next-generation genome browser,” Nucleic [4] FastQC,“Aqualitycontroltoolforhighthroughputsequence AcidsResearch,vol.40,no.2,articlee12,2012. data,”BabrahamBioinformaticsWebsite,http://www.bioinfor- [23] W.J.Kent,C.W.Sugnet,T.S.Fureyetal.,“Thehumangenome matics.babraham.ac.uk/projects/fastqc/. browser at UCSC,” Genome Research, vol. 12, no. 6, pp. 996– [5] W.R.Pearson,T.Wood,Z.Zhang,andW.Miller,“Comparison 1006,2002. ofDNAsequenceswithproteinsequences,”Genomics,vol.46, [24] F.Cunningham,M.RidwanAmode,D.Barrelletal.,“Ensembl no.1,pp.24–36,1997. 2015,” Nucleic Acids Research, vol. 43, no. 1, pp. D662–D669, [6] H. Li and N. Homer, “A survey of sequence alignment algo- 2015. rithms for next-generation sequencing,” Briefings in Bioinfor- [25] NGS Data Analysis Software List, SEQanswers, 2012, http:// matics,vol.11,no.5,pp.473–483,2010. seqanswers.com/wiki/Software/list. [7] H.LiandR.Durbin,“Fastandaccuratelong-readalignment [26] CLCGenomicsWorkbench,CLCBio,Aarhus,Denmark,2014, withBurrows-Wheelertransform,”Bioinformatics,vol.26,no. http://www.clcbio.com. 5,pp.589–595,2010. [27] NextGENe, Softgenetics, State College, Pa, USA http://www [8] B.LangmeadandS.L.Salzberg,“Fastgapped-readalignment .softgenetics.com. withBowtie2,”NatureMethods,vol.9,no.4,pp.357–359,2012. [28] J. Goecks, A. Nekrutenko, J. Taylor et al., “Galaxy: a com- [9] S. F. Altschul, T. L. Madden, A. A. Scha¨ffer et al., “Gapped prehensive approach for supporting accessible, reproducible, BLASTandPSI-BLAST:anewgenerationofproteindatabase and transparent computational research in the life sciences,” search programs,” Nucleic Acids Research, vol. 25, no. 17, pp. GenomeBiology,vol.11,no.8,articleR86,2010. 3389–3402,1997. [29] G.LunterandM.Goodson,“Stampy:astatisticalalgorithmfor [10] M. A. DePristo, E. Banks, R. Poplin et al., “A framework for sensitiveandfastmappingofIlluminasequencereads,”Genome variationdiscoveryandgenotypingusingnext-generationDNA Research,vol.21,no.6,pp.936–939,2011. sequencingdata,”NatureGenetics,vol.43,no.5,pp.491–498, [30] Online Mendelian Inheritance in Man, OMIM, McKusick- 2011. NathansInstituteofGeneticMedicine,JohnsHopkinsUniver- [11] P.Carnevali,J.Baccash,A.L.Halpernetal.,“Computational sity,Baltimore,Md,USA,2015,http://omim.org/. techniques for human genome resequencing using mated [31] CafeVariome,http://www.cafevariome.org. gappedreads,”JournalofComputationalBiology,vol.19,no.3, [32] TheGenomeSequencingConsortium,“Initialsequencingand pp.279–292,2012. analysisofthehumangenome,”Nature,vol.409,pp.860–921, [12] H. Li, B. Handsaker, A. Wysoker et al., “The sequence align- 2001. ment/mapformatandSAMtools,”Bioinformatics,vol.25,no.16, [33] P.Kumar,S.Henikoff,andP.C.Ng,“Predictingtheeffectsof pp.2078–2079,2009. codingnon-synonymousvariantsonproteinfunctionusingthe [13] A. R. Quinlan and I. M. Hall, “BEDTools: a flexible suite of SIFTalgorithm,”NatureProtocols,vol.4,no.7,pp.1073–1082, utilities for comparing genomic features,” Bioinformatics, vol. 2009. 26,no.6,pp.841–842,2010. [34] I. A. Adzhubei, S. Schmidt, L. Peshkin et al., “A method and [14] D.C.Koboldt,Q.Zhang,D.E.Larsonetal.,“VarScan2:somatic server for predicting damaging missense mutations,” Nature mutation and copy number alteration discovery in cancer by Methods,vol.7,no.4,pp.248–249,2010. exomesequencing,”GenomeResearch,vol.22,no.3,pp.568– [35] AlamutVisual,InteractiveBiosoftware,Rouen,France,http:// 576,2012. www.interactivebiosoftware.com/alamut-visual.

Description:

Modeling Language (UML) diagram. GensearchNGS inside a local Alamut installation. 3. the genes described in Ensembl as the gene model.

Research Article DNAseq Workflow in a Diagnostic Context and an Example of a User Friendly ... PDF

12 Pages·2015·3.18 MB·English

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Research Article DNAseq Workflow in a Diagnostic Context and an Example of a User Friendly ...

Description:

Modeling Language (UML) diagram. GensearchNGS inside a local Alamut installation. 3. the genes described in Ensembl as the gene model.

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.