RESOURCE Worm1:1,15–21;January/February/March2012;G2012LandesBioscience WormBase Annotating many nematode genomes Kevin Howe,1,* Paul Davis,1 Michael Paulini,1 Mary Ann Tuli,1 Gary Williams,1 Karen Yook,2 Richard Durbin,3 Paul Kersey1 and Paul W. Sternberg2,* 1EuropeanBioinformaticsInstitute;WellcomeTrustGenomeCampus;Hinxton,CambridgeUK;2CaliforniaInstituteofTechnology;DivisionofBiology;Pasadena,CAUSA; 3WellcomeTrustSangerInstitute;WellcomeTrustGenomeCampus;Hinxton,CambridgeUK Keywords:Caenorhabditiselegans,nematode,genome,annotation,modelorganismdatabase,communityresource,sequencecuration, parasitic nematode Abbreviations: ModENCODE, Model Organism Database ENCyclopedia Of DNA Elements; EST, Expressed Sequence Tag; cDNA, complementary DNA; RNASeq, RNA sequencing by 2nd generation technologies; C., Caenorhabditis; INSDC, International Nucleotide Sequence Database Collaboration WormBase(www.wormbase.org)hasbeenservingthescientificcommunityforover11yearsasthecentralrepositoryfor © gen2omic a0nd ge1netic i2nforma tioLn forathe snoil nedmatodeeCaensorhab diBtis elegianso. The sresoucrce hias eevolvend fromcitse. beginningsasadatabasehousingthegenomicsequenceandgeneticandphysicalmapsofasinglespecies,andnow represents the breadth and diversity of nematode research, currently serving genome sequence and annotation for around20nematodes.Inthisarticle,wefocusonWormBase’sroleofgenomesequenceannotation,describinghowwe annotateandintegratedatafromagrowingcollectionofnematodespeciesandstrains.Wealsoreviewourapproaches to sequence curation, and discuss the impact on annotation quality of large functional genomics projects such as modENCODE. Do not distribute. Introduction onourremittoprovideintegrated,coherentgenomeannotationfor alarge(andgrowing)collectionofnematodegenomesequencesand WormBase seeks to present an integrative view of nematode strains.Wealsosummarizeourreleaseproductioncycleandanalysis biology by in-depth curation of the research on C. elegans and pipelines, and describe how they affect the timeline between data other members of this animal family. To this end we integrate submissionanditssubsequentpublicrelease. genomic sequences and annotations with curated data from genetic,developmental,physiological,behavioralandevolutionary Integrating and Annotating Multiple studies. We provide multiple streams of access to the data, Nematode Genomes including the main website portal (www.wormbase.org), genome browsers, sequence search services, and application pro- WormBase now hosts genomic data for nearly 20 nematodes gramminginterfaces.WormBaseaimstobethecentralrepository (seeTable1,andrefs.3–14),representingspeciesofevolutionary, and portal for nematode genomic data. biomedical and agricultural interest. Recent additions include The activities of the WormBase consortium can be broadly the parasitic nematodes Trichinella spiralis,3 Ascaris suum4 and classified into three groups: (1) curation of C. elegans literature Bursaphelenchus xylophilus.5 The maturity of genome sequence and associated research and development; (2) user interface and annotation in WormBase varies widely between species. At design, development and maintenance and (3) genome sequence one end of the spectrum is the C. elegans genome, which was annotation, analysis and comparative genomics. The volume of completed over a number of years using traditional physical nematode data has exploded in recent years, and WormBase has mapping and clone-by-clone sequencing and finishing,6 and had to respond accordingly in all three of these areas.1,2 For which has highly curated annotation. More recently we have example, as the volume and variety of information has increased, seen a number of genome sequences generated by new high- its presentation to the community in a clear and accessible way throughput low-cost technologies and many of these genomes requires new forms of display. We have responded to this are inevitably fragmented and incomplete; additionally, there is challengebycompletelyredesigningtheWormBaseweb-interface relatively little published functional information about many of (Harrisetal.,manuscriptinpreparation).Inthisarticle,wefocus these species. *Correspondenceto:PaulW.SternbergandKevinHowe;Email:[email protected]@wormbase.org Submitted:11/02/11;Revised:02/02/12;Accepted:02/02/12 http://dx.doi.org/10.4161/worm.19574 www.landesbioscience.com Worm 15 0; 5 N Genesetstatus Curated Curated Curated gExternal External Curated Curated External hExternal WormBasePredicted External External WormBasepredicted External WormBasePredicted External External hExternal External wiki/index.php/31. CDSmodels(distinctloci) 25634(20517) 21961(21936) 31476(31471) 21332(18348) 24217(24216) 36105(29962) 30670(30667) 13072(13072) - 6201(6201) 26265(22622) 16380(16380) 45167(45167) 27721(22326) 8188(8077) 18449(18449) 18074(18074) - 46280(34696) nstitute.org/crd/availableinWS2 oadibe WS230status op-levelScaffoldfagmentsN50 617493793 36717485439 3670461060 2721037841 180831244534 1881794149 3305368319 345284000 953883000 5970713338 335599453 68636373445 7636196652 66520921866 2184359029 29831407899 55271158000 1240312328 1526125228 ailable;(f)http://www.brC.angariaassemblywill Tfr aved ev © 201b) 2 Landes Bioscienherproce. M wm Sequencedenomesize( 100.3 108.4 145.5 95.8 172.5 166.3 190.4 53.0 82.1 298.0 79.8 63.5 204.3 79.3 52.6 272.8 74.6 77.0 131.8 blyaccessionneset;(i)ani g me eg AssemblyDeversion WS230oCAAC03000000 AAGD02000000 nAAQA01000000 v2(Sep.2010)oABLE03000000tABEG02000000 ABLG01000000dCABB01000000 iv1(Aug.2008)siAEHI01000000tABIR02000000rv1(June2011)ibAEKS01000000 CACX01000000uv1(Aug.2011)tCADV01000000eACKM01000000. v1(Jan.2012) ent;(e)INSDCassngsubmissionof IntegratedintoWormBase WS1 WS132 WS185 WS185 WS194 WS195 WS196 WS204 WS205 WS208 WS218 WS225 WS226 WS226 WS226 WS229 WS229 WS229 WS230 edbytheenvironmWormBase;(h)awaiti n 0 n Referencestraisequenced BristolN2 AF16 PB4641 TRS PS312 DF5080 PB2801 VW9 Morelos MHco3(ISE) PS1010 ISS195 JU1422 JU1373 ED321Heterogonic Naturalisolate Ka4C1 M31e DRD-2008JU80 sexalsodetermipredictionsfrom WormBase Modeofbreproduction androdioecious androdioecious gonochoristic gonochoristic hermaphoditic gonochoristic gonochoristic gonochoristic gonochoristic gonochoristic gonochoristic gonochoristic gonochoristic androdioecious gonochoristicandcparthenogenetic gonochoristic gonochoristic gonochoristicandc,dhermaphroditic gonochoristic (c)heterogonic;(d)additionalisoform Table1.Nematodegenomesin aSpeciesBclade C.elegansV C.briggsaeV C.remaneiV BrugiamalayiIII PristionchusVpacificus C.japonicaV C.brenneriV MeloidogynehaplaIV MeloidogyneIVincognita HemonchusVcontortus C.angariaV TrichinellaspiralisI C.sp9V C.sp11V StrongyloidesrattiIV AscarissuumIII BursaphelenchusIVxylophilus HeterorhabditisVbacteriophora C.sp5V –Notes:(a)ref.15;(b)refs.1625;(g)authorgene-setextendedby 16 Worm Volume1Issue1 WormBase undertakes different responsibilities for each of Onewayinwhichweusetheorthologyrelationshipsinternally thesespecies,whichcaninclude(1)administrationofthegenome istoprojectWormBase-approvedgenenames37ontoorthologous sequence; (2) curation of gene models and other sequence gene(s) of other nematode species. For this a conservative features; (3) curation of non-sequence-based data from the approach is adopted: each proposed gene name is required to be literatureand(4)trackingofidentifiers forwardthroughdifferent supported by an unambiguous one to one orthology connection versionsofthegenomesequenceandannotation.Thespecificway according to the majority of available source analyses. in which we manage the data for a species depends (primarily) We also use Ensembl Compara DNA pipeline38 to produce on whether we curate gene models and other features for it. It is whole-genome multiple alignments of all genomes in WormBase therefore useful for the sake of discussion to classify the species and derived genome conservation tracks (using GERP39). into two groups: core (WormBase curated gene models) and However, as the genetic diversity of the species collection in non-core. As of release WS230, the core species are C. elegans, WormBase continues to increase, a single multiple alignment for C. briggsae, C. remanei, C.brenneri and C. japonica. all nematodes becomes less appropriate. We therefore propose Analyzing and presenting data for an ever-increasing number to replace it with a series of pairwise alignments, providing ofnematodegenomesrequiresmethodsthatscalewell.Wedeploy multiple alignments only for selected subsets of species. a standard automatic analysis pipeline to annotate all the species wehouse(coreandnon-core),includingrepeatprediction,cDNA Sequence Curation alignments, the determination of homology relationships, and protein domain identification. If a genome sequence for a non- WormBase adopts an anomaly-driven approach to curation, core species is submitted without a gene-set, we also run an whereby discrepancies between current gene models and align- in-housegenepredictionpipelinethatusesCEGMA26toaccurately ment data are identified and flagged as curation targets. We have © 2012 Landes Bioscience. identify a small, universally conserved set of gene models. These implemented a software application (CurationTool) that identi- are then used to train parameters for AUGUSTUS,27 which we fies these discrepancies and scores them according to their degree then apply using protein homologies and any available RNASeq of discordance, presenting the results to the curator using a and other transcript data as supporting evidence. In some cases, graphical user interface. An in-depth discussion of CurationTool these internally-produced gene predictions are later replaced by a and our anomaly-driven curation is presented elsewhere.40 canonical set of models provided by the submitters. Forprotein-codinggenes,WormBasecuratesonlytheprotein- Do not distribute. Updating an existing species in WormBase with a new coding portion (CDS) of the full transcript. For our core species, assembly and/or gene-set presents additional challenges, because we use the high-confidence subset of cDNA alignments over- users rely on stable identifiers to track their entities of interest, laying the curated CDS models to infer a set of full-length which must be propagated forward to corresponding features transcripts (including 5' and 3' untranslated regions), using a in subsequent releases. For core species, identifiers are actively custom algorithm (unpublished).In thepast, theaccuracy ofthis managedandtrackedusingourowncurationsoftwareinfrastruc- process has been sensitive to artifacts such as alignment errors or ture. For non-core species, we use the Ensembl28 stable-identifier chimericcDNAs,butwehaverecentlyimprovedthealgorithmto mapping software for this task. take these factors into account. Theprincipalwayinwhichwedrawinformationfrommultiple The primary line of evidence for gene model curation is speciestogetherisbyconnectinggenesviaorthologyandparalogy transcriptdata.InadditiontocDNAsdepositedinthenucleotide relationships to genes in other species (both nematode and other archives, we draw data from numerous resources, publications model organisms such as human, mouse and fly). As of WS230, and directsubmissions. We also align all RNASeq data deposited we include relationships published by the following projects and in the Short Read Archive (SRA) to our core species using resources: InParanoid29 (version 7); TreeFam30 (version 7); the TopHat,41and infergeneexpression estimatesforavarietyoflife Othologous Matrix Project31 (OMA, August 2009/08 version); stages and environmental conditions using Cufflinks.42 OrthoMCL;32 PantherDB33,34 (version 7); and Ensembl28,35 WormBase is committed to act as the ultimate repository for (version 65). In addition, we curate orthology calls from the data coming from the nematode half of the modENCODE43,44 literature (e.g., Hillier et al., ref. 8) and direct submissions. We project. Most data sets have been accessible via the genome also use data in eggNOG36 (version 3.0) to cluster genes into browser since the summer of 2010. To extract the maximum functionally characterized homologous groups. utility from the data, it is integrated fully into our database, by These resources are inevitably based on snapshots of the gene extending the data models where necessary and adding full cross- models, taken at various times. For our core species however, referencing and connectivity with existing WormBase objects. particularlyC.elegans,thegenemodelsareinastateofflux,being To date, the focus for full integration has been on data sets with revised andimprovedonthebasis ofthelatestevidence.Inorder high impact on gene model and other sequence feature curation, to infer up-to-date nematode homology relationships for the namely:trans-splice sites;45poly-Acleavagesitesand untranslated latest gene models, we run the Ensembl Compara GeneTree regions;44,46 large-scale EST sets (P. Green; data retrieved from pipeline35 as part of the preparation for every WormBase release. nucleotide archives); mass-spectrometry peptide sequences;44 and The resulting gene trees are used to infer additional current RNASeq transcripts, and derived gene-predictions.44 orthology relationships to those obtained by import from the The data of highest impact for curation has been the RNASeq third-party resources and direct submission. transcriptome, and this has been used in a number of different www.landesbioscience.com Worm 17 ways. First, the modENCODE “genelets” (fragmentary gene mass-spectrometry evidences is 83%, 88% and 14% respectively. models constructed using RNASeq data from 14 life stages) have Overall, 93% of curated introns are confirmed and 82% of been used to produce a new anomaly type for CurationTool that CDS models have all of their introns confirmed by at least one highlights potential cases where adjacent genes could be merged. of these three lines of evidence; the corresponding measurements To date, over three hundred cases displaying this anomaly have for the final release prior to modENCODE (WS200, February been scrutinized, of which approximately 35% resulted in a 2009) were 74% and 56%, demonstrating the value of the pro- merge, and a further 10% some other change (for example the ject in increasing the accuracy and confidence of C. elegans gene movement of an exon from one gene to another). Second, we models. have re-visited the source RNASeq data and analyzed it using theTophat/Cufflnkspipeline41,42toidentifycandidate“RNASeq- Intraspecies Variation splice” features. These can be used both to confirm introns already part of curated gene models, and also to suggest changes Similar to many other resources, WormBase captures within- to existing gene models or new isoforms. Third, the strand bias species variation as differences (insertions, deletions and substitu- characteristic of the modENCODE RNASeq alignments47 has tions) with respect to the genome sequence of the reference been extremely useful for curators to resolve ambiguities in strain. We expect variation data for many nematode species in the definition of the 5' and 3' ends of genes. Finally, the the future, but at present almost all the data we house is for modENCODERNASeqdatahasallowedustomakecorrections C. elegans. to the C. elegans reference genome itself. By taking proposed Historically, the majority of variation data we have processed errors and verifying them using data from a private submission has beenfrom laboratory-manipulated strains. We maintainclose of high-throughput-sequencing (J. Ahringer and M. Berriman, working relationships and established data exchange protocols © 2012 Landes Bioscience. pers. comm.), we have been able to make 156 genome sequence with the Caenorhabditis Genetics Center (CGC; www.cbs.umn. corrections (110 insertions, 44 deletions and 2 substitutions), edu/CGC), the C. elegans Gene Knockout Consortium resulting in the correction of 100 gene models. (GKC; www.celeganskoconsortium.omrf.org), and the National Additionally, since the data from modENCODE began to BioResource Project of Japan (NBRP; www.shigen.nig.ac.jp/c. become available from the project Data Co-ordination Centre, elegans/index.jsp). We also curate variation data from individual the following data sets have been subjected to rigorous internal user submissions; which although time-consuming, are often Do not distribute. quality control and fully integrated into the database: ~300 biologically important. Highly Occupied Target (HOT) regions;44 ~7,000 non-coding There has recently been a rapid growth of C. elegans variation RNAgenes;44theprobable parent for~1,000pseudogenes;44and data generated by whole genome sequencing projects (refs. 50– ~21,000three-primeUTRsfromtheUTRomeproject.46Wewill 54; Andersen et al., manuscript in preparation; Moerman and prioritise the incorporation of the transcription-factor binding Waterston, manuscript in preparation). These data sets include site and chromatin accessibility data as soon as the final versions an increasing number of variations from naturally-occurring of these data sets are made available. wild-isolate strains. Motivated by community feedback, we have We have also worked with groups performing their own increased the clarity of our representation and display of this analysis of the modENCODE data. For example, a study of the information. Every variation object processed by WormBase is modENCODERNASeqreads(T.Blumenthal,pers.comm.)has assigned a unique, stable identifier with prefix “WBVar.” For resulted in significant improvements to the operon data set. This laboratory-induced variations, we also assign a more directly has involved identifying cases where fewer than 5% of the trans- informative public name comprised of a project/laboratory prefix splice leader reads for “internal” genes (i.e., genes other than the (supplied by J. Hodgkin, pers. comm.) and a numerical suffix. first) were SL2 type, and modifying the gene content of the For naturally occurring variations, the public name defaults to operons accordingly. the WBVar identifier, making the distinction between these In addition to modENCODE, we continue to draw in data objects and the laboratory induced variations obvious and from the scientific literature and direct submissions, often com- immediate. bining different data sources to assist in making correct We now also collect non-sequence-based information for wild predictions. The modENCODE poly-A site data has been isolate strains (http://tazendra.caltech.edu/~azurebrd/cgi-bin/ supplementedwithacorrespondingdatasetfromanindependent forms/wild_isolate.cgi). Compared with laboratory-manipulated study.48Thesetwodatasetshaveonly25%redundancy,andover strains, there is additional information to capture about the wild 80% of coding genes now have an annotated polyA site in isolates, such as isolation location, the condition in which it was WormBase. Gene predictions by genBlastG49 based on BLAST found, and details of how it was isolated. Many wild isolates are homologies to C. elegans proteins have also proved valuable for not stocked at the CGC, and WormBase acts as the central data the curation of C. briggsae, C. brenneri, and C. remenei. repository for these strains. We can assess gene-model accuracy in the presence of WormBase does not have a mandate to act as a permanent fragmentary transcript evidence by measuring the proportion of repository for variation data, and as the volume of these data sets curated introns that are confirmed by spliced cDNA evidence. continuestorapidlyincrease,webecomelessadequatelyresourced For WS230, the proportion of C. elegans curated CDS introns to perform this function. Projects are therefore encouraged to confirmed by traditional cDNA, modENCODE RNASeq and submit their data to the NCBI’s Database of Short Genetic 18 Worm Volume1Issue1 Variations (dbSNP),55 an established archive for variation data. homology detection and whole-genome alignment; and (7) We act as a submission broker in cases where a laboratory lacks quality control and assurance. the technical resources to conform to the dbSNP submission Forthemorecomplicatedpartsofthebuildprocess,wedeploy protocols. To date, data from six projects have been integrated twocomponentsoftheEnsemblsystemforthemanagementand intoWormBase andsubmittedtodbSNP. WormBase adds value tracking of computational pipelines: ensembl-pipeline57 for to these data sets by performing additional analysis and placing homology analysis and eHive58 for comparative analysis. The them into context with other data types (e.g., Gene). keyfeaturesofthesesystemsare(1)automaticre-runoftasksthat Variations are most often submitted to WormBase as a have failed; and (2) user-definition of a sub-task dependency molecular change at given location in a specific version of the graph for a process, allowing complex pipelines to be run with reference genome sequence. As part of the curation, we capture minimal user intervention. These systems are critical in enabling and record a short flanking sequence either side of the variation us to produce the database in a regular and timely manner. feature, disassociating it from a specific version of the reference Each stage of the database production is subject to a suite of genome. Each release, we re-map all variations and re-calculate integrity checks to ensure that it has completed cleanly and potential consequences ofthemolecular changes(e.g., non-sense, withouterror.Forexample,wecomparethenumberofobjectsin mis-sense or silent protein-coding mutation) on the latest gene each data class with the count at the corresponding stage in the models. previousrelease.Majordiscrepanciesareflaggedforinvestigation. This mechanism has proved to be extremely effective in catching Release Cycle and Database Build errors and process failures as soon as they occur. WormBase is released every two months, with the preparation Summary © 2012 Landes Bioscience. for a release beginning three months in advance. This release cycle can give rise to variability in the time between a curator WormBase is facing a deluge of data from many nematode transaction (e.g., the update of a gene name, correction of an genome sequencing projects, and we have prepared for this by error, or the import of a new data set) and its availability on the putting into place annotation and integration pipelines and WormBase website. The delay can be as short as three months workflowsthatwillallowthedatatobeanalyzedandpresentedin (if the change is made immediately before we start building the a timely and consistent manner. As ever, we welcome feedback Do not distribute. release) and as long as five months (if made immediately after, in and ideas from our user-base as part of the continued develop- which case it will not be public until the following release). ment of the resource. We are currently particularly interested in BuildingaWormBasedatabasereleaseisacomplicatedprocess, suggestions on how we can maximise the utility of housing a the broad stages of which can be described as: (1) data freeze, broad representation of the nematode phylum, and what where each contributing consortium partner takes a snap-shot comparative genomics services and views users would find most of the database(s) in which their curation data are stored; (2) useful. Users can contact the developers at [email protected] data collation, where the curation database snap-shots are with their suggestions. brought together into a single database; (3) submission of updated annotation on core species to the International Nucleo- Acknowledgments tide Sequence Database Collaboration,56 to ensure that the This work is supported by the US National Institutes of Health representation of core nematode data in the nucleotide and (Grant no. P41 HG02223); US National Human Genome proteinarchivesisup-to-date;(4)mappingofsequencedata(e.g., Research Institute (Grant no. P41-HG02223); and British cDNAs, microarray probes, sequence features, variations) to the Medical Research Council (Grant no. G070119); P.W.S. is an genome;(5)establishingconnectionsbetweenobjectsofdifferent investigatorwiththeHowardHughesMedicalInstitute.Funding types (e.g., RNAi to Gene), usually via genomic location; (6) the for open access charge: US National Human Genome Research large-scale computational analyses discussed earlier, such as Institute (Grant no. P41-HG02223). References 4. JexAR,LiuS,LiB,YoungND,HallRS,LiY,etal.Ascaris 7. SteinLD,BaoZ,BlasiarD,BlumenthalT,BrentMR, suumdraftgenome.Nature2011;479:529-33;PMID: ChenN,etal.ThegenomesequenceofCaenorhabditis 1. Yook K, Harris TW, Bieri T, Cabunoc A, Chan J, 22031327;http://dx.doi.org/10.1038/nature10553 briggsae: a platform for comparative genomics. PLoS Chen WJ, et al. WormBase 2012: more genomes, more data, new website. Nucleic Acids Res 2012; 5. Kikuchi T, Cotton JA, Dalzell JJ, Hasegawa K, Biol2003;1:E45;PMID:14624247;http://dx.doi.org/ 40(Databaseissue):D735-41;PMID:22067452;http:// KanzakiN,McVeighP,etal.Genomicinsightsinto 10.1371/journal.pbio.0000045 dx.doi.org/10.1093/nar/gkr954 theoriginofparasitismintheemergingplantpathogen 8. HillierLW,MillerRD,BairdSE,ChinwallaA,Fulton Bursaphelenchus xylophilus. PLoS Pathog 2011; 7: LA,KoboldtDC,etal.ComparisonofC.elegansand 2. HarrisTW,AntoshechkinI,BieriT,BlasiarD,ChanJ, e1002219;PMID:21909270;http://dx.doi.org/10.1371/ C. briggsae genome sequences reveals extensive con- ChenWJ,etal.WormBase:acomprehensiveresource journal.ppat.1002219 servation of chromosome organization and synteny. for nematode research. Nucleic Acids Res 2010; 38(Database issue):D463-7; PMID:19910365; http:// 6. C.elegansSequencingConsortium.Genomesequence PLoSBiol2007;5:e167;PMID:17608563;http://dx. dx.doi.org/10.1093/nar/gkp952 ofthenematodeC.elegans:aplatformforinvestigating doi.org/10.1371/journal.pbio.0050167 biology. Science 1998; 282:2012-8; PMID:9851916; 3. Mitreva M, Jasmer DP, Zarlenga DS, Wang Z, http://dx.doi.org/10.1126/science.282.5396.2012 Abubucker S, Martin J, et al. The draft genome of the parasitic nematode Trichinellaspiralis.Nat Genet 2011;43:228-35;PMID:21336279;http://dx.doi.org/ 10.1038/ng.769 www.landesbioscience.com Worm 19 9. Ross JA, Koboldt DC, Staisch JE, Chamberlin HM, 24. BoagPR,NewtonSE,GasserRB.Molecularaspectsof 39. Cooper GM, Stone EA, Asimenos G, Green ED, Gupta BP, Miller RD, et al. Caenorhabditis briggsae sexualdevelopmentandreproductioninnematodesand Batzoglou S, Sidow A, et al.; NISC Comparative recombinant inbred line genotypes reveal inter-strain schistosomes.AdvParasitol2001;50:153-98;PMID: Sequencing Program. Distribution and intensity of incompatibility and the evolution of recombination. 11757331;http://dx.doi.org/10.1016/S0065-308X(01) constraintinmammaliangenomicsequence.Genome PLoS Genet 2011; 7:e1002174; PMID:21779179; 50031-7 Res2005;15:901-13;PMID:15965027;http://dx.doi. http://dx.doi.org/10.1371/journal.pgen.1002174 25. HasegawaK,MotaMM,FutaiK,MiwaJ.Chromosome org/10.1101/gr.3577405 10. Ghedin E, Wang S, Spiro D, Caler E, Zhao Q, structure and behaviour in Bursaphelenchus xylophilus 40. WilliamsGW,DavisPA,RogersAS,BieriT,Ozersky CrabtreeJ,etal.Draftgenomeofthefilarialnematode (Nematoda: Parasitaphelenchidae) germ cells andearly P,SpiethJ.Methodsandstrategiesforgenestructure parasite Brugia malayi. Science 2007; 317:1756-60; embryo.Nematology2006;8:425-34;http://dx.doi.org/ curationinWormBase.Database(Oxford)2011;2011: PMID:17885136; http://dx.doi.org/10.1126/science. 10.1163/156854106778493475 baq039;PMID:21543339; http://dx.doi.org/10.1093/ 1145406 26. ParraG,BradnamK,KorfI.CEGMA:apipelineto database/baq039 11. DieterichC,CliftonSW,SchusterLN,ChinwallaA, accuratelyannotatecoregenesineukaryoticgenomes. 41. TrapnellC,PachterL,SalzbergSL.TopHat:discover- Delehaunty K, Dinkelacker I, et al. The Pristionchus Bioinformatics 2007; 23:1061-7; PMID:17332020; ing splice junctions with RNA-Seq. Bioinformatics pacificus genome provides a unique perspective on http://dx.doi.org/10.1093/bioinformatics/btm071 2009; 25:1105-11; PMID:19289445; http://dx.doi. nematodelifestyleandparasitism.NatGenet2008;40: 27. StankeM,SchöffmannO,MorgensternB,WaackS. org/10.1093/bioinformatics/btp120 1193-8; PMID:18806794; http://dx.doi.org/10.1038/ Gene prediction in eukaryotes with a generalized 42. Trapnell C, Williams BA, Pertea G, Mortazavi A, ng.227 hidden Markov model that uses hints from external Kwan G, van Baren MJ, et al. Transcript assembly 12. Abad P, Gouzy J, Aury JM, Castagnone-Sereno P, sources. BMC Bioinformatics 2006; 7:62; PMID: and quantification by RNA-Seq reveals unannotated Danchin EG, Deleury E,et al. Genome sequence of 16469098;http://dx.doi.org/10.1186/1471-2105-7-62 transcripts and isoform switching during cell differ- the metazoan plant-parasitic nematode Meloidogyne 28. FlicekP,AmodeMR,BarrellD,BealK,BrentS,Chen entiation. Nat Biotechnol 2010; 28:511-5; PMID: incognita. Nat Biotechnol 2008; 26:909-15; PMID: Y,etal.Ensembl2011.NucleicAcidsRes2011;39 20436464;http://dx.doi.org/10.1038/nbt.1621 18660804;http://dx.doi.org/10.1038/nbt.1482 (Databaseissue):D800-6;PMID:21045057;http://dx. 43. CelnikerSE,DillonLA,GersteinMB,GunsalusKC, 13. OppermanCH,BirdDM,WilliamsonVM,Rokhsar doi.org/10.1093/nar/gkq1064 Henikoff S, Karpen GH, et al.; modENCODE DS,BurkeM,CohnJ,etal.Sequenceandgeneticmap 29. OstlundG,SchmittT,ForslundK,KöstlerT,Messina Consortium. Unlocking the secrets of the genome. ofMeloidogynehapla:Acompactnematodegenomefor DN,RoopraS,etal.InParanoid7:newalgorithmsand Nature2009;459:927-30;PMID:19536255 plantparasitism.ProcNatlAcadSciUSA2008;105: tools foreukaryotic orthology analysis. Nucleic Acids 44. Gerstein MB, Lu ZJ, Van Nostrand EL, Cheng C, 14802-7; PMID:18809916; http://dx.doi.org/10.1073/ Res 2010; 38(Database issue):D196-203; PMID: ArshinoffBI,LiuT,etal.;modENCODEConsortium. ©pnas.08 059246105 012 La198n92828;hdttp://dx.edoi.org/1s0.1093 /narB/gkp931 iosIntegratciveanaliysisoeftheCanenorhabdcitiselegaensgenom.e 14. Mortazavi A, Schwarz EM, Williams B, Schaeffer L, 30. Li H, Coghlan A, Ruan J, Coin LJ, Hériché JK, by the modENCODE project. Science 2010; 330: Antoshechkin I, Wold BJ, et al. Scaffolding a Osmotherly L, et al. TreeFam: a curateddatabase of 1775-87; PMID:21177976; http://dx.doi.org/10.1126/ Caenorhabditis nematode genome with RNA-seq. phylogenetic trees of animal gene families. Nucleic science.1196914 Genome Res 2010; 20:1740-7; PMID:20980554; AcidsRes2006;34(Databaseissue):D572-80;PMID: 45. AllenMA,HillierLW,WaterstonRH,BlumenthalT. http://dx.doi.org/10.1101/gr.111021.110 16381935;http://dx.doi.org/10.1093/nar/gkj118 AglobalanalysisofC.eleganstrans-splicing.Genome 15. BlaxterML,DeLeyP,GareyJR,LiuLX,Scheldeman 31. AltenhoffAM,SchneiderA,GonnetGH,DessimozC. Res2011;21:255-64;PMID:21177958;http://dx.doi. P, Vierstraete A, et al. A molecular evolutionary OMA2011:orthologyinferenceamong1000complete org/10.1101/gr.113811.110 framework for the phylum Nematoda. Nature 1998; genomes.NucleicAcidsRes2011;39(Databaseissue): 46. MangoneM,ManoharanAP,Thierry-MiegD,Thierry- Do not distribute. 392:71-5;PMID:9510248;http://dx.doi.org/10.1038/ D289-94; PMID:21113020; http://dx.doi.org/10. MiegJ,HanT,MackowiakSD,etal.Thelandscapeof 32160 1093/nar/gkq1238 C.elegans3’UTRs.Science2010;329:432-5;PMID: 16. HaagS.Theevolutionofnematodesexdetermination: 32. Chen F, Mackey AJ, Stoeckert CJ, Jr., Roos DS. 20522740;http://dx.doi.org/10.1126/science.1191244 C.elegansasareferencepointforcomparativebiology OrthoMCL-DB:queryingacomprehensivemulti-species 47. HillierLW,ReinkeV,GreenP,HirstM,MarraMA, (December 29 2005). In: The C. elegans Research collectionoforthologgroups.NucleicAcidsRes2006;34 Waterston RH. Massively parallel sequencing of the Communityed.WormBook.http://www.wormbook.org (Database issue):D363-8; PMID:16381887; http://dx. polyadenylated transcriptome of C. elegans. Genome 17. Kiontke KC, Félix MA, Ailion M, Rockman MV, doi.org/10.1093/nar/gkj123 Res2009;19:657-66;PMID:19181841;http://dx.doi. Braendle C, Pénigault JB, et al. A phylogeny and 33. Thomas PD, Kejariwal A, Campbell MJ, Mi H, org/10.1101/gr.088112.108 molecularbarcodesforCaenorhabditis,withnumerous Diemer K, Guo N, et al. PANTHER: a browsable 48. JanCH,FriedmanRC,RubyJG,BartelDP.Formation, newspeciesfromrottingfruits.BMCEvolBiol2011; database of gene products organized by biological regulation and evolution of Caenorhabditis elegans 11:339; PMID:22103856; http://dx.doi.org/10.1186/ function, using curated protein family and subfamily 3’UTRs.Nature2011;469:97-101;PMID:21085120; 1471-2148-11-339 classification. Nucleic Acids Res 2003; 31:334-41; http://dx.doi.org/10.1038/nature09616 18. MayerWE,HerrmannM,SommerRJ.Phylogenyof PMID:12520017;http://dx.doi.org/10.1093/nar/gkg115 49. SheR,ChuJS,UyarB,WangJ,WangK,ChenN. thenematodegenusPristionchusandimplicationsfor 34. MiH,DongQ,MuruganujanA,GaudetP,LewisS, genBlastG:usingBLASTsearchestobuildhomologous biodiversity,biogeographyandtheevolutionofherma- ThomasPD.PANTHERversion7:improvedphylo- genemodels.Bioinformatics2011;27:2141-3;PMID: phroditism. BMC Evol Biol 2007; 7:104; PMID: genetictrees,orthologsandcollaborationwiththeGene 21653517; http://dx.doi.org/10.1093/bioinformatics/ 17605767;http://dx.doi.org/10.1186/1471-2148-7-104 Ontology Consortium. Nucleic Acids Res 2010; 38 btr342 19. Redman E, Grillo V, Saunders G, Packard E, (Database issue):D204-10; PMID:20015972; http:// 50. ZurynS,LeGrasS,JametK,JarriaultS.Astrategyfor Jackson F, Berriman M, et al. Genetics of mating dx.doi.org/10.1093/nar/gkp1019 direct mapping and identification of mutations by and sex determination in the parasitic nematode 35. VilellaAJ,SeverinJ,Ureta-VidalA,HengL,DurbinR, whole-genomesequencing.Genetics2010;186:427-30; Haemonchus contortus. Genetics 2008; 180:1877-87; Birney E. EnsemblCompara GeneTrees: Complete, PMID:20610404; http://dx.doi.org/10.1534/genetics. PMID:18854587; http://dx.doi.org/10.1534/genetics. duplication-aware phylogenetic trees in vertebrates. 110.119230 108.094623 Genome Res 2009; 19:327-35; PMID:19029536; 51. Sarin S, Bertrand V, Bigelow H, Boyanov A, 20. Bird DM, Williamson VM, Abad P, McCarter J, http://dx.doi.org/10.1101/gr.073585.107 Doitsidou M, Poole RJ, et al. Analysis of multiple DanchinEG,Castagnone-SerenoP,etal.Thegenomes 36. MullerJ,SzklarczykD,JulienP,LetunicI,RothA,Kuhn ethyl methanesulfonate-mutagenized Caenorhabditis ofroot-knotnematodes.AnnuRevPhytopathol2009; M, et al. eggNOG v2.0: extending the evolutionary elegansstrainsbywhole-genomesequencing.Genetics 47:333-51; PMID:19400640; http://dx.doi.org/10. genealogy of genes with enhanced non-supervised 2010; 185:417-30; PMID:20439776; http://dx.doi. 1146/annurev-phyto-080508-081839 orthologousgroups,speciesandfunctionalannotations. org/10.1534/genetics.110.116319 21. CicheT.ThebiologyandgenomeofHeterorhabditis Nucleic Acids Res 2010; 38(Database issue):D190-5; 52. FlibotteS,EdgleyML,ChaudhryI,TaylorJ,NeilSE, bacteriophora (February 20 2007). In: The C. elegans PMID:19900971;http://dx.doi.org/10.1093/nar/gkp951 Rogula A, et al. Whole-genome profiling of muta- Research Community ed. WormBook. http://www. 37. Horvitz HR, Brenner S, Hodgkin J, Herman RK. A genesisinCaenorhabditiselegans.Genetics2010;185: wormbook.org uniform genetic nomenclature for the nematode 431-41; PMID:20439774; http://dx.doi.org/10.1534/ 22. Viney ME. A genetic analysis of reproduction Caenorhabditis elegans. Mol Gen Genet 1979; 175: genetics.110.116616 in Strongyloides ratti. Parasitology 1994; 109: 129-33; PMID:292825; http://dx.doi.org/10.1007/ 53. SarinS,PrabhuS,O’MearaMM,Pe’erI,HobertO. 511-5; PMID:7800419; http://dx.doi.org/10.1017/ BF00425528 Caenorhabditis elegans mutant allele identification by S0031182000080768 38. Paten B, Herrero J, Beal K, Fitzgerald S, Birney E. whole-genome sequencing. Nat Methods 2008; 5: 23. Pires-daSilva A. Evolution of the control of sexual EnredoandPecan:genome-widemammalianconsistency- 865-7; PMID:18677319; http://dx.doi.org/10.1038/ identity in nematodes. Semin Cell Dev Biol 2007; based multiple alignment with paralogs. Genome Res nmeth.1249 18:362-70; PMID:17306573; http://dx.doi.org/10. 2008; 18:1814-28; PMID:18849524; http://dx.doi.org/ 1016/j.semcdb.2006.11.014 10.1101/gr.076554.108 20 Worm Volume1Issue1 54. Hillier LW, Marth GT, Quinlan AR, Dooling D, 56. Karsch-Mizrachi I, Nakamura Y, Cochrane G; Inter- 58. SeverinJ,BealK,VilellaAJ,FitzgeraldS,SchusterM, FewellG,BarnettD,etal.Whole-genomesequencing nationalNucleotideSequenceDatabaseCollaboration. GordonL,etal.eHive:anartificialintelligenceworkflow andvariantdiscoveryinC.elegans.NatMethods2008; The International Nucleotide Sequence Database systemforgenomicanalysis.BMCBioinformatics2010; 5:183-8;PMID:18204455;http://dx.doi.org/10.1038/ Collaboration. Nucleic Acids Res 2012; 40(Database 11:240; PMID:20459813; http://dx.doi.org/10.1186/ nmeth.1179 issue):D33-7; PMID:22080546; http://dx.doi.org/10. 1471-2105-11-240 55. SherryST,WardMH,KholodovM,BakerJ,PhanL, 1093/nar/gkr1006 Smigielski EM, et al. dbSNP: the NCBI database of 57. PotterSC,ClarkeL,CurwenV,KeenanS,MonginE, geneticvariation.NucleicAcidsRes2001;29:308-11; Searle SM, et al. The Ensembl analysis pipeline. PMID:11125122; http://dx.doi.org/10.1093/nar/29.1. Genome Res 2004; 14:934-41; PMID:15123589; 308 http://dx.doi.org/10.1101/gr.1859804 © 2012 Landes Bioscience. Do not distribute. www.landesbioscience.com Worm 21