M M B TM ETHODS IN OLECULAR IOLOGY SeriesEditor JohnM.Walker SchoolofLifeSciences UniversityofHertfordshire Hatfield,Hertfordshire,AL109AB,UK Forothertitlespublishedinthisseries,goto www.springer.com/series/7651 M M B TM ETHODS IN OLECULAR IOLOGY Bioinformatics for DNA Sequence Analysis Edited by David Posada Departamento de Gene´tica, Bioqu´ımica e Inmunolog´ıa, Facultad de Biolog´ıa, Universidad de Vigo, Vigo, Spain Editor DavidPosada DepartamentodeGene´tica Bioqu´ımicaeInmunolog´ıa FacultaddeBiolog´ıa UniversidaddeVigo Vigo Spain [email protected] ISSN1064-3745 e-ISSN1940-6029 ISBN978-1-58829-910-9 e-ISBN978-1-59745-251-9 DOI10.1007/978-1-59745-251-9 LibraryofCongressControlNumber:2008941278 #HumanaPress,apartofSpringerScienceþBusinessMedia,LLC2009 Allrightsreserved.Thisworkmaynotbetranslatedorcopiedinwholeorinpartwithoutthewrittenpermissionofthe publisher (Humana Press,c/o Springer ScienceþBusinessMedia, LLC, 233 Spring Street, New York, NY 10013, USA),exceptforbriefexcerptsinconnectionwithreviewsorscholarlyanalysis.Useinconnectionwithanyformof informationstorageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology nowknownorhereafterdevelopedisforbidden. Theuseinthispublicationoftradenames,trademarks,servicemarks,andsimilarterms,eveniftheyarenotidentifiedas such,isnottobetakenasanexpressionofopinionastowhetherornottheyaresubjecttoproprietaryrights. Printedonacid-freepaper springer.com To Mo´nica and Lucas Preface Therecentaccumulationofinformationfromgenomes,includingtheirsequences,has resultednotonlyinnewattemptstoansweroldquestionsandsolvelongstandingissues inbiology,butalsointheformulationofnovelhypothesesthatarisepreciselyfromthis wealth of data. The storage, processing, description, transmission, connection, and analysis of these data has prompted bioinformatics to become one the most relevant applied sciences for this new century, walking hand-in-hand with modern molecular biologyandclearlyimpactingareaslikebiotechnologyandbiomedicine. Bioinformatics skills have now become essential for many scientists working with DNA sequences. With this idea in mind, this book aims to provide practical guidance andtroubleshootingadviceforthecomputationalanalysisofDNAsequences,covering a range of issues and methods that unveil the multitude of applications and relevance that Bioinformatics has today. The analysis of protein sequences has been purposely excludedtogainfocus.Individualbookchaptersareorientedtowardthedescriptionof theuseofspecificbioinformaticstools,accompaniedbypracticalexamples,adiscussion ontheinterpretationofresults,andspecificcommentsonstrengthsandlimitationsof the methods and tools. In a sense, chapters could be seen as enriched task-oriented manualsthatwilldirectthereaderincompletingspecificbioinformaticsanalyses. The target audience for this book is biochemists, and molecular and evolutionary biologiststhatwanttolearnhowtoanalyzeDNAsequencesinasimplebutmeaningful fashion. Readers do not need a special background in statistics, mathematics, or computer science, just a basic knowledge of molecular biology and genetics. All the tools described in the book are free and all of them can be downloaded or accessed throughtheweb.Mostchapterscouldbeusedforpracticaladvancedundergraduateor graduate-levelcoursesinbioinformaticsandmolecularevolution. The book could not start in another place than describing one of the most wide- spreadbioinformaticstool:BLAST(BasicLocalAlignmentSearchTool).Indeed,one of the first steps in the analysis of DNA sequences is their collection. Therefore, Chapter 1 guides the reader through the recognition of similar sequences using BLAST. Next, the use of OrthologID for understanding the nature of this similarity is describedin Chapter2, followedbya Chapter3 aboutone of the mostimportant stages in most bioinformatics pipelines, the alignment, which shows the basis and the application of the program MAFFT. The next set of chapters is intimately related tothestudyofmolecularevolution.Indeed,theDNAsequencesthatweseetodayare theresultofthisprocess.InChapter4,SeqVisisusedtodetectcompositionalchanges inDNAsequencesthroughtime,whileChapter5isfocusedontheselectionofmodels of nucleotide substitution using jModelTest. Precisely the use of these models for phylogenetic reconstruction is described in Chapter 6, which capitalizes upon the estimationofmaximumlikelihoodphylogenetictreeswithPhyml.Indeed,theestima- tion of phylogenies is often the first step in many evolutionary analyses. How to combine multiple trees in a single supertree is the basis of Chapter 7, which explains vii viii Preface theuseoftheprogramClann.Next,Chapters8and9arecenteredonthecharacter- ization of two key evolutionary processes acting on DNA sequences. The use of the server Datamonkey for the detection of selection is described in Chapter 8, while Chapter9showsthenutsandboltsofthedetectionofrecombinationusingRDP3. The study of codon usage, which has provided many important insights at the genomicscale,isdecipheredinChapter10usingCodonExplorer,aninteractivedata base, while Chapter 11 explains how differences in the genetic code can be detected using GenDecoder. The next chapters are related to the annotation of genomes, an essential requisite for many other analyses. In Chapter 12, we learn how to predict genesusingGeneID,whileinChapter13theidentificationofregulatorymotifswith A-Glamisdescribed.Chapter14thenexplainstheuseoftheUCSCgenomebrowser and its applications, for example, to characterize a gene or to explore conserved elements. The discovery of single nucleotide polymorphisms (SNPs) and simple sequence repeats (SSRs) with bioinformatics tools SNPServer, dbSNP, and SSR Tax- onomyTreeisthesubjectofChapter15,andChapter16highlightstheuseofCensor andRepeatMaskerforthedetectionandcharacterizationoftransposableofsequences ineukaryoticgenomes.Toendthebook,Chapter17explainshowtomakethemostof DnaSPfortheanalysisofDNAsequencesinpopulations. I am very grateful to all the authors, the fundamental piece, who have put a lot of effort replying patiently to all my queries. Hopefully, the result has been a set of clear andusefulchaptersthatwillbeofhelptootherscientists.Iwanttothankallofthemfor sharing their time, wisdom and expertise. Finally, I want to thank John Walker, the editoroftheseries,forhiscontinuousadvice. Vigo,July2008 DavidPosada Contents Preface............................................................ vii Contributors........................................................ xi 1. SimilaritySearchingUsingBLAST..................................... 1 KitJ.Menlove,MarkClement,andKeithA.Crandall 2. GeneOrthologyAssessmentwithOrthologID ............................ 23 MaryEgan,ErnestK.Lee,JoannaC.Chiu,GloriaCoruzzi, andRobDeSalle 3. MultipleAlignmentofDNASequenceswithMAFFT ...................... 39 KazutakaKatoh,GeorgeAsimenos,andHiroyukiToh 4. SeqVis:AToolforDetectingCompositionalHeterogeneityAmongAligned NucleotideSequences............................................... 65 LarsSommerJermiin,JoshuaWingKeiHo,KwokWaiLau, andVivekJayaswal 5. SelectionofModelsofDNAEvolutionwithjMODELTEST ................... 93 DavidPosada 6. EstimatingMaximumLikelihoodPhylogenieswithPhyML.................. 113 Ste´phaneGuindon,Fre´de´ricDelsuc,Jean-Franc¸oisDufayard, andOlivierGascuel 7. TreesfromTrees:ConstructionofPhylogeneticSupertreesUsingClann ....... 139 ChristopherJ.CreeveyandJamesO.McInerney 8. DetectingSignaturesofSelectionfromDNASequencesUsingDatamonkey..... 163 ArtF.Y.Poon,SimonD.W.Frost,andSergeiL.KosakovskyPond 9. RecombinationDetectionandAnalysisUsingRDP3....................... 185 DarrenP.Martin 10. CodonExplorer:AnInteractiveOnlineDatabasefortheAnalysis ofCodonUsageandSequenceComposition............................. 207 JesseZaneveld,MicahHamady,NoboruSueoka,andRobKnight 11. GeneticCodePredictionforMetazoanMitochondriawithGenDecoder........ 233 FedericoAbascal,RafaelZardoya,andDavidPosada 12. ComputationalGeneAnnotationinNewGenomeAssembliesUsingGeneID ... 243 EnriqueBlancoandJosepF.Abril 13. PromoterAnalysis:GeneRegulatoryMotifIdentificationwithA-GLAM ....... 263 LeonardoMarin˜o-Ramı´rez,KannanTharakaraman,JohnL.Spouge, andDavidLandsman 14. AnalysisofGenomicDNAwiththeUCSCGenomeBrowser ................ 277 JonathanPevsner 15. MiningforSNPsandSSRsUsingSNPServer,dbSNP andSSRTaxonomyTree ............................................ 303 JacquelineBatleyandDavidEdwards ix x Contents 16. AnalysisofTransposableElementSequencesUsingCENSOR andRepeatMasker.................................................. 323 AhsanHudaandI.KingJordan 17. DNASequencePolymorphismAnalysisUsingDnaSP...................... 337 JulioRozas Index ............................................................. 351 Contributors FEDERICOABASCAL (cid:129) DepartamentodeGene´tica,Bioquı´micaeInmunologı´a,Facultad deBiologı´a,UniversidaddeVigo,Vigo,Spain JOSEP F. ABRIL (cid:129) Departament de Gene`tica, Facultat de Biologia, Universitat de Barcelona,Spain GEORGEASIMENOS (cid:129) DepartmentofComputerScience,StanfordUniversity,Stanford, CA,USA JACQUELINE BATLEY (cid:129) Australian Centre for Plant Functional Genomics, School ofLand,CropandFoodSciences,UniversityofQueensland,Brisbane,Australia ENRIQUE BLANCO (cid:129) Departament de Gene`tica, Facultat de Biologia, Universitat de Barcelona,Spain JOANNA C. CHIU (cid:129) Department of Molecular Biology and Biochemistry, Rutgers University,Piscataway,NJ,USA MARK CLEMENT (cid:129) Department of Computer Science, Brigham Young University, Provo,UT,USA GLORIACORUZZI (cid:129) DepartmentofBiology,NewYorkUniversity,NewYork,NY,USA KEITHA.CRANDALL (cid:129) DepartmentofBiology,BrighamYoungUniversity,Provo,UT, USA CHRISTOPHERJ.CREEVEY (cid:129) EMBLHeidelberg,Heidelberg,Germany FREDERICDELSUC (cid:129) InstitutdesSciencesdel’Evolution deMontpellier(ISEM),UMR 5554-CNRS,Universite´ MontpellierII,Montpellier,France ROB DESALLE (cid:129) Sackler Institute of Comparative Genomics, American Museum ofNaturalHistoryNewYork,NY,USA JEAN-FRANC¸OIS DUFAYARD (cid:129) Laboratoire d’Informatique, de Robotique et de Micro- e´lectroniquedeMontpellier(LIRMM).UMR5506-CNRS,Universite´MontpellierII, Montpellier,France DAVIDEDWARDS (cid:129) AustralianCentreforPlantFunctionalGenomics,SchoolofLand, CropandFoodSciences,UniversityofQueensland,Brisbane,Australia MARYEGAN (cid:129) DepartmentofBiology,MontclairStateUniversity,Montclair,NJ,USA SIMOND.W.FROST (cid:129) AntiviralResearchCenter,DepartmentofPathology,University ofCaliforniaSanDiego,LaJolla,CA,USA OLIVIERGASCUEL (cid:129) Laboratoired’Informatique,deRobotiqueetdeMicroe´lectronique de Montpellier (LIRMM). UMR 5506-CNRS, Universite´ Montpellier I I, Montpel- lier,France STE´PHANEGUINDON (cid:129) Laboratoired’Informatique,deRobotiqueetdeMicroe´lectroni- que de Montpellier (LIRMM). UMR 5506-CNRS, Universite´ Montpellier II, Montpellier, France; Department of Statistics, University of Auckland. Auckland, NewZealand MICAH HAMADY (cid:129) Department of Computer Science, University of Colorado, Boulder, CO,USA xi