Methods in Molecular Biology 1079 David J. Russell Editor Multiple Sequence Alignment Methods M M B TM ETHODS IN OLECULAR IOLOGY SeriesEditor JohnM.Walker School ofLife Sciences University ofHertfordshire Hatfield, Hertfordshire,AL109AB,UK For further volumes: http://www.springer.com/series/7651 . Multiple Sequence Alignment Methods Edited by David J. Russell Department of Electrical Engineering, University of Nebraska–Lincoln, Lincoln, NE, USA Editor DavidJ.Russell DepartmentofElectricalEngineering UniversityofNebraska–Lincoln Lincoln,NE,USA ISSN1064-3745 ISSN1940-6029(electronic) ISBN978-1-62703-645-0 ISBN978-1-62703-646-7(eBook) DOI10.1007/978-1-62703-646-7 SpringerNewYorkHeidelbergDordrechtLondon LibraryofCongressControlNumber:2013947475 #SpringerScience+BusinessMedia,LLC2014 Chapter4wascreatedwithinthecapacityofanUSgovernmentalemployment.UScopyrightprotectiondoesnotapply. Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartofthematerialis concerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation,broadcasting,reproduction onmicrofilmsorinanyotherphysicalway,andtransmissionorinformationstorageandretrieval,electronicadaptation, computersoftware,orbysimilarordissimilarmethodologynowknownorhereafterdeveloped.Exemptedfromthis legalreservationarebriefexcerptsinconnectionwithreviewsorscholarlyanalysisormaterialsuppliedspecificallyfor thepurposeofbeingenteredandexecutedonacomputersystem,forexclusiveusebythepurchaserofthework. DuplicationofthispublicationorpartsthereofispermittedonlyundertheprovisionsoftheCopyrightLawofthe Publisher’slocation,initscurrentversion,andpermissionforusemustalwaysbeobtainedfromSpringer.Permissions forusemaybeobtainedthroughRightsLinkattheCopyrightClearanceCenter.Violationsareliabletoprosecution undertherespectiveCopyrightLaw. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublicationdoesnot imply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevantprotectivelawsand regulationsandthereforefreeforgeneraluse. Whiletheadviceandinformationinthisbookarebelievedtobetrueandaccurateatthedateofpublication,neitherthe authorsnortheeditorsnorthepublishercanacceptanylegalresponsibilityforanyerrorsoromissionsthatmaybe made.Thepublishermakesnowarranty,expressorimplied,withrespecttothematerialcontainedherein. Printedonacid-freepaper HumanaPressisabrandofSpringer SpringerispartofSpringerScience+BusinessMedia(www.springer.com) Preface Introduction Multiplesequencealignmenthasbecomeoneoftheindispensabletoolsofbioinformatics, infactofbiology,asscientiststrytomakesenseoftherapidlyincreasingfloodofsequence information. Multiple sequence alignments are fundamental to tasks such as homology searches,genomicannotation,proteinstructureprediction,andtheareasofcomputational evolutionary biology, gene regulation networks, and functional genomics. Over the last 25 years, and increasingly over the last 10 years, there has been the development of a number of different multiple sequence alignment algorithms and implementations. Asmanyoftheseimplementationsarewellontheir waytobecomingstandardlaboratory tools,therewasaneedforasinglesourcethatwouldprovideanin-depthintroductionand analysis of the various algorithms being used. And who best to describe these algorithms andtheirnuancesthanthepeoplewhodevelopedthesealgorithms—hencethishandbook. Who Might Find This Handbook of Use Thisvolumeisintendedasbothamultiplesequencealignmenttextbookandasareference book; it begins at a level suitable for those with no previous exposure to the problem of performing sequence alignment and carries the reader through to a reasonable degree of proficiency at understanding how most industry-standard alignment algorithms achieve theirresults.Thepeoplewhomightfindthishandbookofusearecomputationalbiologists in general, people involved in tasks and areas that use multiple sequence alignments in particular,andstudentsembarkingonastudyofcomputationalbiology. For novices to the field, the chapters presented in the first part of the handbook introduce the fundamental concepts necessary for understanding how and why sequence alignment algorithms function the way they do. Because the results of multiple sequence alignments have such a direct impact on our understanding and interpretation of the information contained in biological sequences, it is important to understand the workingsandlimitationsofdifferentalgorithms.Thisisespeciallytrueforpractitionersin thefield.Treatingmultiplesequencealignmentasablackboxwhichdeliversresultstobe usedunquestioninglycouldbearecipefordisaster.Thechaptersinthesecondpartofthe handbook describe detailed practical procedures for the most commonly used multiple sequence alignment algorithms available today. Additionally, extensive practical detail for each algorithm’s implementation is provided such that a competent scientist who is unfamiliar with the method can carry out the technique successfully by simply following thedetailed,practicalprocedurespresented. v vi Preface Organization Thefirstset offive chaptersdealswithissuescommonto allmultiplesequencealignment algorithms. At the heart of many multiple sequence alignment schemes is the idea of dynamic programming. This is a solution strategy in which the problem to be solved is brokenupintooverlappingsubproblemswhicharesolvedandthencombinedtoprovidea solutiontotheoverallproblem.Chapter1providesathoroughdescriptionofthedynamic programming approach and its application to the pairwise sequence alignment problem. BecausethegenerationofmultiplesequencealignmentsisanNP-completeproblem,there is a need for heuristic strategies. Chapter 2 details the various heuristic approaches cur- rently used to generate multiple alignments. When using a heuristic approach, it is necessary to find objective scoring techniques to select between possible multiple align- ments. Chapter 3 provides a survey of different scoring schemes that can be used during multiple sequence alignment. Given the number of different multiple alignment algo- rithms available, an important issue is performance comparison. This is usually done usingbenchmarks.DifferentbenchmarkingstrategiesarestudiedinChapter4considering desirable properties of benchmarks. Multiple sequence alignments assume that the sequencesbeingalignedarehomologous.Theprocessofselectinghomologoussequences usingBLASTandFASTApackagesisdetailedinChapter5. Each of the 13 chapters in the second set deal with a particular multiple sequence alignment algorithm or package. The most widely used algorithm for multiple sequencealignmentshasbeentheClustalalgorithms.ThelatestversionofClustal,Clustal Omega,isdescribedindetailinChapter6.AlmostaswellknownastheClustalalgorithms are the T-Coffee (Tree-Based Consistency Objective Function for Alignment Evaluation) algorithms. As the authors point out, the T-Coffee algorithms incorporate structural, evolutionary,andexperimentalevidencetoreachamoremeaningfulandaccuratemultiple sequence alignment. Chapter 7 provides a practical overview of various T-Coffee imple- mentations. Both the Clustal and T-Coffee algorithms are progressive algorithms. The MAFFTalgorithm, which has been gaining in popularity in recent years, uses an iterative refinement approach to provide a fast alignment algorithm. The MAFFT algorithm is describedinChapter8alongwiththeMUSCLEalgorithm.Thechaptercontainsdetailed instructionsintheuseofseveraldifferentpopularoptionsintheMAFFTpackage.Probcons is a well-known example of an algorithm that uses Hidden Markov Models (HMMs) to provide sequence alignment. Probcons and Probalign, which uses a partition function approach,aredescribedinChapter9.Oneoftheprimaryapplicationsofmultiplesequence alignment is in phylogenetic analysis. PRANKis a phylogeny-aware alignment algorithm which uses phylogenetic information to distinguish between alignment gaps caused by insertionsanddeletions.Thisdeterminationcanbeusedtoprovidetheinferredancestral sequences and mark the alignment gaps differently depending on their origin in insertion or deletion events. Chapter 10 provides a detailed description of PRANK and providespracticaladviceforusingPRANKforevolutionaryanalysis.Chapter11describes GramAlign, an alignment algorithm that uses a grammar-based relative complexity dis- tance metric to determine the alignment order, the benefit being a computationally efficient and scalable program useful for managing the increasing amount and size of biologicaldatamadeavailableduetothecontinuingadvancementsinsequencingtechnol- ogy. Detection of local homologies is another major application of multiple sequence Preface vii alignment. The DIALIGN algorithms construct multiple alignments from local pairwise sequencesimilaritiesthusmakingthemparticularlyusefulfordiscoveringconservedfunc- tional regions in sequences that share only local homologies but are otherwise unrelated. ThedifferentDIALIGN algorithms aredescribedin Chapter12.Anotheralgorithm that focuses on local similarities is PicXAA, a nonprogressive, greedy algorithm that uses regions of high local similarity to drive the initial alignment which can then be iteratively refined. The PicXAA algorithm, as well as its implementation and usage, is described in Chapter 13. The computational cost of multiple sequence alignments can be defrayed inpartbytheintelligentuseofthemultiplecoreswithwhichmostcurrentcomputersare equipped. MSAprobs, described in Chapter 14, is a progressive alignment method which, alongwithvariousotherimprovements,hasbeenparallelizedusingmultithreadingforuse onmulticoreCPUs.Phylogenyinferenceoftenincludesaparadoxinwhichtheaccuracyof an inferred phylogeny depends on the accuracy of a multiple sequence alignment which dependsontheaccuracyoftheinter-sequencedistancemetric.Manyalignmenttechniques use a phylogeny to specify these distances, and so each inference relies on the accuracy of the other. Chapter 15 presents SATe´, an iterative method for simultaneously estimating accurate multiple sequence alignments and phylogenetic trees. The PRALINE toolkit is described in Chapter 16. The algorithms in PRALINE use progressive alignment; how- ever, instead of using a pre-determined guide tree, they continuously reevaluate at each stage which alignment will be optimal, thus generating an adaptive guide tree on the fly. As reflected in its name (Profile ALIgNmEnt), PRALINE uses various global, local, and homology-extended profile preprocessing protocols to address the problems caused by the greediness of a progressive alignment method. The algorithms described in the last two chapters, PROMALS3D and MSACompro, both focus on protein sequences. The PROMALS3D algorithm, described in Chapter 17, uses a multipronged approach including fast sequence alignment and the utilization of side information. Fast sequence alignmentmethodsalignsimilarsequenceswhileadditionalinformationsuchasstructure- basedconstraintsfromalignmentsofthree-dimensionalstructures,forrelativelydissimilar sequences,isusedtoconstructmultiplesequencealignments.TheMSAComproalgorithm described in Chapter 18 makes use of predicted secondary structure, relative solvent accessibility,andresidue–residuecontactinformationtoimprovetheaccuracyofmultiple sequence alignments, deriving the structural information from the sequence itself, rather thanfromanexternaldatabase. Thevariousmultiplesequencealignmentalgorithmspresentedinthishandbookgivea flavorofthebroadrangeofchoicesavailableformultiplesequencealignmentgeneration. Theirdiversityisareflectionofthecomplexityofthemultiplesequencealignmentproblem and the amount of information that can be obtained from multiple sequence alignments. Eachofthesechaptersnotonlydescribesthealgorithmitcoversbutalsopresentsinstruc- tions and tips on using their implementations. This handbook will hopefully provide a readilyavailableresourcewhichwillallowpractitionerstoexperimentwithdifferentalgo- rithmsandfindtheparticularalgorithmthatisofmostuseintheirapplication. Lincoln,NE,USA DavidJ.Russell Contents Preface.................................................................... v Contributors............................................................... xi PART I THEORY 1 DynamicProgramming ................................................ 3 O¨.UfukNalbantog˘lu 2 HeuristicAlignmentMethods .......................................... 29 OsamuGotoh 3 ObjectiveFunctions ................................................... 45 HalukDog˘anandHasanH.Otu 4 WhoWatchestheWatchmen?AnAppraisalofBenchmarksforMultiple SequenceAlignment................................................... 59 StefanoIantorno,KevinGori,NickGoldman,ManuelGil, andChristopheDessimoz 5 BLASTandFASTASimilaritySearchingforMultiple SequenceAlignment................................................... 75 WilliamR.Pearson PART II ALIGNMENT TECHNIQUES 6 ClustalOmega,AccurateAlignmentofVeryLargeNumbersofSequences... 105 FabianSieversandDesmondG.Higgins 7 T-Coffee:Tree-BasedConsistencyObjectiveFunction forAlignmentEvaluation............................................... 117 CedrikMagis,Jean-Franc¸oisTaly,GiovanniBussotti, Jia-MingChang,PaoloDiTommaso,IonasErb, Jose´ Espinosa-Carrasco,andCedricNotredame 8 MAFFT:IterativeRefinementandAdditionalMethods .................... 131 KazutakaKatohandDaronM.Standley 9 MultipleSequenceAlignmentUsingProbconsandProbalign .............. 147 UsmanRoshan 10 Phylogeny-awarealignmentwithPRANK................................ 155 AriL¨oytynoja 11 GramAlign:Fastalignmentdrivenbygrammar-basedphylogeny............ 171 DavidJ.Russell 12 MultipleSequenceAlignmentwithDIALIGN............................ 191 BurkhardMorgenstern 13 PicXAA:AProbabilisticSchemeforFindingtheMaximum ExpectedAccuracyAlignmentofMultipleBiologicalSequences ............ 203 SayedMohammadEbrahimSahraeianandByung-JunYoon ix
Description: