ZOOLOGICAL RESEARCH AutoSeqMan: batch assembly of contigs for Sanger sequences Jie-QiongJin1,Yan-BoSun1,* 1StateKeyLaboratoryofGeneticResourcesandEvolution,KunmingInstituteofZoology,ChineseAcademyofSciences,KunmingYunnan 650223,China ABSTRACT it is not suitable for many population studies due to high costs and other limiting factors. For example, errors are With the wide application of DNA sequencing always introduced in final assembly and/or annotation results technology, DNA sequences are still increasingly using NGS data (Bickhart et al., 2017), and thus variations generated through the Sanger sequencing platform. detected in high-throughput analyses require validation by SeqMan(intheLaserGenepackage)isanexcellent Sanger sequencing (Wall et al., 2014). Furthermore, for programwithaneasy-to-usegraphicaluserinterface some present population genomic studies, error rates have (GUI)employedtoassembleSangersequencesinto been found to increase with increasing depth of coverage for Illuminadata,andthuscautionisneededwheninterpretingthe contigs. However, with increasing data size, larger resultsofnext-generationsequencing-basedassociationstudies sample sets and more sequenced loci make contig (Walletal.,2014). As such, Sanger sequencing technology is assemblecomplicatedduetotheconsiderablenumber stillwidelyusedinmanyresearchfields,includingevolutionary ofmanualoperationsrequiredtorunSeqMan. Here, taxonomybasedonshortDNAsequences(Chenetal.,2017), wepresentthe‘autoSeqMan’softwareprogram,which evolutionary history study of wild animals (Yuan et al., 2016), can automatedly assemble contigs using SeqMan biodiversityestimatesandinfluencingfactors(Zhouetal.,2017), and validation of mutations identified from high-throughput scripting language. There are two main modules analyses(Sunetal.,2013). available, namely, ‘Classification’ and ‘Assembly’. Further, with the wide application of DNA sequencing Classification first undertakes preprocessing work, technology, e.g. DNA barcoding, which uses short and whereas Assembly generates a SeqMan script to standardized DNA sequences for individual identification of consecutively assemble contigs for the classified organisms (Hajibabaei et al., 2007; Savolainen et al., 2005), files. Throughcomparisonwithmanualoperation,we Sanger sequencing data are continuing to be accumulated among evolutionary taxonomists and others. Thus, batch showedthatautoSeqMansavedsubstantialtimeinthe manipulation of these Sanger sequences has become an preprocessing and assembly of Sanger sequences. important task before downstream analyses, especially for We hope this tool will be useful for those with large those who doesn’t have programming or bioinformatics sample sets to analyze, but with little programming background or experiments. Although several sequence experience. Itisfreelyavailableathttps://github.com/ manipulationpackagesforgeneralpurposeissueshavebeen published previously, including MEGA (Kumar et al., 2016), Sun-Yanbo/autoSeqMan. EMBOSS(Riceetal.,2000),andFasParser(Sun,2017),these packages are all based on assembled contigs (a consensus Keywords: Batch processing; Sanger sequences; Contigassembly;SeqMan INTRODUCTION Received:21December2017;Accepted:08February2018 Foundationitems:Thedevelopmentofthispackagewassupportedby DNA sequencing technology has experienced a revolutionary theNationalNaturalScienceFoundationofChina(31671326)andthe shift from automated Sanger sequencing (Sanger et al., 1977) to next-generation sequencing (NGS; reviewed by Shendure YouthInnovationPromotionAssociation,ChineseAcademyofSciences & Ji (2008) and Shendure et al. (2004)). Although NGS *Correspondingauthor,E-mail:[email protected] has dominated due to its high throughput (Schuster, 2008), DOI:10.24272/j.issn.2095-8137.2018.027 SciencePress ZoologicalResearch39(2):123-126,2018 123 Biology, 65(5): 824–842. Zhou WW, Jin JQ, Wu J, Chen HM, Yang JX, Murphy RW, Che J. 2017. Mountains too high and valleys too deep drive population structuring and demographics in a Qinghai-Tibetan Plateau frog Nanorana pleskei (Dicroglossidae). Ecology and Evolution, 7(1): 240–252. FIGURE LEGENDS Figure1OverviewofassemblingtasksforSangersequences CommontasksforassemblingSFaingguerres e1q Ouevnecrevsiienwclu odfe a(1ss)ecmlasbsliifnyign gtassekqus efnocre Sfialensginerto sceoqrrueesnpocensd ingfolders(alwayscompletedmanually)and(2)opening SeqMantoaddtheab1filesthatCboemlomngotno atassakms efosra maspsleemanbdlitnhge nSaasnsgeemr bsleinqgutehnecmes( nienecdluindgec 1on) scidlaesrsaibflyeimngo usseequopeenrcaeti ofnilse)s. TheautoSeqManprogramwas designedbasedonthesesteps,inwtioth cthoerr‘eCslapsosnifidciantgio nfo’aldnedr‘sA s(aselwmabylys’ fcuonmctipolnestecdo rrmesapnounadlilnyg) taonsdte 2p)2 oapnedn3in,rge sSpeeqcMtivealny. to add the ab1 files that belong to a same sample and then assembling them (needing regionofoverlappingDNAcsoengsimdeernatbsl)e amndounsoe koepyercaotinosnisd)e. rTahtieo nautoSeqCMLaAnS pSrIoFgIrCamAT wIOaNs designed based on hasbeentakenonthebattchhesaes ssteepmsb, lwyiothf tShae n‘gCelarsssiefqicuaetinocne’ san.d ‘Assembly’ functions corresponding to step 2 This function is designed to automatedly create sub-folders SeqMan is a popular program in the LaserGene software and 3, respectively. according to the sample ID and/or sequenced locus. All package (DNAStar, Inc., Madison, WI, USA), which is used downstream analyses are performed in the corresponding for assembling Sanger sequences into contigs and has been sub-folders, where all analyzed results are also saved. widelyappliedinagreatnumberofstudies. Itcanhandletwo According to our laboratory experience, this is an efficient to thousands of Sanger sequences at one time but requires a andconvenientwaytomanageandquerylaboratorysamples considerablenumberofmanualoperations(e.g.,mouseactions, (Chenetal.,2017;Zhouetal.,2017). Figure 1) to run. Hence, it is complicated and time-expensive The only input is the directory name, which contains the for those with large sets of samples to assemble. Fortunately, raw ab1 files. There are several input prerequisites required sincethereleaseofVersion7,SeqMannowprovidesascripting for Classification performance. First, all files must be stored language,includingcommandsforopening,naming,saving,and in a same directory. Second, all files must be named closing projects, and a single script may be used to execute according to a certain pattern, i.e., “sample-locus-others”. multipleassembliesconsecutivelywithoutmanualintervention. For example, the file name “YPX24212_16S-2215_TSS Here, we developed a program called autoSeqMan, which 20171122-0871-1171_H02.ab1” denotes that it is a DNA provides a simple way to automatedly classify Sanger sequenceof16Sandthesamplenumberis“YPX24212”. The sequences and then consecutively assemble them on a program will automatedly recognize the filename according personalcomputer. Itismainlydesignedforresearcherswith to the user-specified delimiter and then create a sub-folder largesetsofsampleswithoneormorelocisequenced. “YPX24212_16S”inthemainoutputfolder. Thedelimitercan IMPLEMENTATIONANDREQUIREMENTS be “-“, “_”, or other. After classification, the program will list allsub-folders,andtheusercanlookatthefilesclassifiedinto autoSeqMan was developed into a standalone Windows eachsub-folderbysimplyclickingthefoldername(Figure2). desktop application (compiled and tested in Windows 7/10). It involves two modules, ‘Classification’ and ‘Assembly’, ASSEMBLY correspondingtosteps2and3inFigure1,respectively. Each This function will automatedly assemble the classified modulecanhandlemultiplefilesandneedstheusertoselect sequencefiles. Itwillfirstreadthelistofclassifiedsub-folders the directory either containing the raw Sanger sequence files created by the ‘Classification’ function, and then generate a (*.ab1files)orcontainingtheclassifiedsub-folderscreatedby SeqMan script for consecutively assembling the sequences ‘Classification’. Theoretically,thereisnolimittothenumberof in each sub-folder. To perform this function, the user must filesthatcanbeanalyzed. first install the LaserGene package (version 7 or higher), and This tool requires the sequence files be named in a then tell autoSeqMan the full path of the SeqMan program specialized format, in which the sample IDshould be present (whichcanbealwaysrecognizedautomatedlybyautoSeqMan), atthebeginningofthefilename.TheClassificationmodulewill after which the program will complete all assembly tasks and recognizethesampleIDbytheappropriatedelimiterandthen savetheassemblyresultsautomatedly. Thedefaultscriptwill createsub-folders(seebelow). Forconvenience,autoSeqMan generateallSQD,FAS,andSEQresults(Figure3). also provides a “Rename” tool to help users rename the ab1 filesforthebelowanalyses. 124 www.zoores.ac.cn Figure 2 Overview of ‘Classification’ function in autoSeqMan To perform this function, user needs to tell autoSeqMan the full path of a directory Figure2OverFviiegwuorfe‘ C2l aOssvifiecravtiioenw’ founf c‘tCiolnaisnsaifuitcoaSteiqoMna’n function in autoSeqMan which contains all raw ab1 files. The sample name should be present at the beginning Toperformthisfunction,userneedstotellautoSeqManthefullpathofadirectorywhichcontainsallrawab1files.Thesamplenameshouldbepresentatthe beginningoftheTfiloe npameref,owhrimch wtihollfib set hrfeeu cofniglcen tizineoadmna,en ,d uwesxhetrirac hctn ewdeiebllyd tsbhe e taorue tcotoSegelnlqi zMeaadun taaoncdSco eredqxintMrgactaotentdh etbshyp eet chiffieue dlalud teoplSimaetqitheMr .aonf a directory according to the specified delimiter. which contains all raw ab1 files. The sample name should be present at the beginning of the file name, which will be recognized and extracted by the autoSeqMan according to the specified delimiter. Figure3DefaultSeqManscripFtigfuorre a3s Dseefmaubllti nSgeqSMaanng secrrispetq foure anscseemsbling Sanger sequences ThisscriptwillbegeneratedautomaTtehdisb sycariuptto wSeilql Mbea ng.eTnheerareteids anuotonmeeadtefdo rbuys eaurstotoSeeqdMitiat.n.I nTdheefraeu ilts, naoll SneQeDd, fSoEr Qus,earnsd toF ASoutputsaregenerated. edit it. In default, all SQD, SEQ, and FAS outputs are generated. PERFORMANCE the running efficiency of SeqMan. In this test, the Assembly module required 64 s to consecutively assembly contigs for The main aim of autoSeqMan is to save manual operation the classified sequences, also substantially less than the ~2 in preparing files and running the SeqMan program. To hrequiredformanualoperation,suggestingthesignificanceof evaluateitsperformance,weappliedthistooltoourlaboratory autoSeqManindealingwithlargedatesets. data (Chen et al., 2017; Zhou et al., 2017). In this test, one hundred samples were used, each of which had two LIMITATIONS ab1 files available. Results showed that the Classification operation creFatiegdursueb 3-f oDldeefrsau(nlta mSeedqMsamanpl escIrDipats fwoerl laasssembIltinisg imSpaonrgtaenrt stoeqnuoteentchaets autoSeqMan does not undertake locus name if provided) and moved the appropriate files into anyfiltrationmanipulationonthesequencedata,eventhough thesub-folderTshwiisth sincr8ips,t swubisllt abnetia gllyenleesrsattheadn athuetotimmeatuesded by auptoooSre-qquMaliatyn.s eTqhueernec eise nndos naereeda lfwoary susperersse tnot. Thus, after formanualoperation(about1h,astestedbyourcolleagues). running autoSeqMan, users should undertake quality control edit it. In default, all SQD, SEQ, and FAS outputs are generated. Performance of the Assembly operation greatly depended on measuresofthefinalassemblywithSeqMan. Inaddition,the ZoologicalResearch39(2):123-126,2018 125 outputFastafileswillhaveverylongIDs,whichmightintroduce Kumar S, Stecher G, Tamura K. 2016. MEGA7: Molecular evolutionary some errors in subsequent sequence analyses. If necessary, genetics analysis version 7.0 for bigger datasets. Molecular Biology and userscanusethe“Sort&Rename”functionofFasParser(Sun, Evolution,33(7):1870–1874. 2017)toshortentheseIDs. Rice P, Longden I, Bleasby A. 2000. EMBOSS: the european molecular biologyopensoftwaresuite.TrendsinGenetics,16(6):276–277. COMPETINGINTERESTS Sanger F, Nicklen S, Coulson AR. 1977. DNA sequencing with Theauthorsdeclarethattheyhavenocompetinginterests. chain-terminating inhibitors. Proceedings of the National Academy of SciencesoftheUnitedStatesofAmerica,74(12):5463–5467. AUTHORS’CONTRIBUTIONS SavolainenV,CowanRS,VoglerAP,RoderickGK,LaneR.2005.Towards Y.B.S.andJ.Q.J.designedthestudyandwrotethemanuscript.Y.B.S.wrote writing the encyclopedia of life: an introduction to DNA barcoding. thesoftware. J.Q.J.evaluatedtheperformanceofautoSeqMan. Allauthors Philosophical Transactions of the Royal Society B: Biological Sciences, readandapprovedthefinalmanuscript. 360(1462):1805–1811. SchusterSC.2008.Next-generationsequencingtransformstoday’sbiology. ACKNOWLEDGEMENTS NatureMethods,5(1):16–18. Specialthankstoourcolleagues,Hong-ManChen,Yu-QiHe,andFangYan, Shendure J, Ji H. 2008. Next-generation DNA sequencing. Nature fortheirhelpfulsuggestionsontheimprovementofautoSeqMan. Biotechnology,26(10):1135–1145. ShendureJ,MitraRD,VarmaC,ChurchGM.2004.Advancedsequencing REFERENCES technologies:methodsandgoals.NatureReviewsGenetics,5(5):335–344. Bickhart DM, Rosen BD, Koren S, Sayre BL, Hastie AR, Chan S, Lee J, Sun YB. 2017. FasParser: a package for manipulating sequence data. LamET,LiachkoI,SullivanST,BurtonJN,HusonHJ,NystromJC,Kelley ZoologicalResearch,38(2):110–112. CM, Hutchison JL, Zhou Y, Sun JJ, Crisà A, De León FAP, Schwartz Sun YB, Zhou WP, Liu HQ, Irwin DM, Shen YY, Zhang YP. 2013. JC, Hammond JA, Waldbieser GC, Schroeder SG, Liu GE, Dunham MJ, Genome-widescansforcandidategenesinvolvedintheaquaticadaptation ShendureJ,SonstegardTS,PhillippyAM,VanTassellCP,SmithTP.2017. ofdolphins.GenomeBiologyandEvolution,5(1):130–139. Single-moleculesequencingandchromatinconformationcaptureenablede Wall JD, Tang LF, Zerbe B, Kvale MN, Kwok PY, Schaefer C, Risch N. novo referenceassemblyofthedomesticgoatgenome.NatureGenetics, 2014.Estimatinggenotypeerrorratesfromhigh-coveragenext-generation 49(4):643–650. sequencedata.GenomeResearch,24(11):1734–1739. ChenJM,ZhouWW,PoyarkovNA,Jr., StuartBL,BrownRM,LathropA, YuanZY,ZhouWW,ChenX,PoyarkovNA,Jr.,ChenHM,Jang-LiawNH, WangYY,YuanZY,JiangK,HouM,ChenHM,SuwannapoomC,Nguyen ChouWH,MatzkeNJ,IizukaK,MinMS,KuzminSL,ZhangYP,Cannatella SN,vanDuongT,PapenfussTJ,MurphyRW,ZhangYP,CheJ.2017.A DC,HillisDM,CheJ.2016.Spatiotemporaldiversificationofthetruefrogs novelmultilocusphylogeneticestimationrevealsunrecognizeddiversityin (GenusRana): Ahistoricalframeworkforawidelystudiedgroupofmodel Asianhornedtoads,genusMegophryssensulato(Anura: Megophryidae). organisms.SystematicBiology,65(5):824–842. MolecularPhylogenetics&Evolution,106:28–43. Zhou WW, Jin JQ, Wu J, Chen HM, Yang JX, Murphy RW, Che J. 2017. HajibabaeiM,SingerGA,HebertPD,HickeyDA.2007.DNAbarcoding:how Mountains too high and valleys too deep drive population structuring itcomplementstaxonomy,molecularphylogeneticsandpopulationgenetics. and demographics in a Qinghai-Tibetan Plateau frog Nanorana pleskei TrendsinGenetics,23(4):167–172. (Dicroglossidae).EcologyandEvolution,7(1):240–252. 126 www.zoores.ac.cn