ebook img

Multiple Sequence Alignments (2022) [Sperlea] [9783662644720] PDF

106 Pages·3.478 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Multiple Sequence Alignments (2022) [Sperlea] [9783662644720]

Theodor Sperlea Multiple Sequence Alignments Which Program Fits My Data? Multiple Sequence Alignments Theodor Sperlea Multiple Sequence Alignments Which Program Fits My Data? TheodorSperlea Rostock,Germany ISBN978-3-662-64472-0 ISBN978-3-662-64473-7 (eBook) https://doi.org/10.1007/978-3-662-64473-7 #Springer-VerlagGmbHGermany,partofSpringerNature2022 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartofthe materialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation, broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology nowknownorhereafterdeveloped. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. The publisher, the authors, and the editorsare safeto assume that the adviceand informationin this bookarebelievedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsor theeditorsgiveawarranty,expressedorimplied,withrespecttothematerialcontainedhereinorforany errorsoromissionsthatmayhavebeenmade.Thepublisherremainsneutralwithregardtojurisdictional claimsinpublishedmapsandinstitutionalaffiliations. ThisSpringerimprintispublishedbytheregisteredcompanySpringer-VerlagGmbH,DEpartofSpringer Nature. Theregisteredcompanyaddressis:HeidelbergerPlatz3,14197Berlin,Germany To my wife Johanna Preface Sometimesmisunderstandings,mistakes,andfrustrationarisewhendoingresearch interdisciplinarily.Differentdisciplinesspeakdifferentlanguages;asaresult,when they talk about the same issue, they sometimes talk past each other. For different disciplines, the same word can mean different things, increasing the possibility of misunderstanding. Bioinformatics is an interdisciplinary field par excellence. Here, experts in biology,computerscience,medicine,pharmacy,chemistry,algorithmics,mathemat- ics,andengineeringcometogethertotacklescientificquestionsincollaboration.To make matters worse, bioinformatics is a fairly young field but has quickly gained enormous importance. Sequence analyses, modeling, and database searches have becomeanintegralpartofeverydaybiologicalandmedicalresearch.Itistherefore all the more tragic when incorrect results are generated or interesting findings are overlookedduetoincorrectuseofthesetools. Some aspects of bioinformatics are almost predestined for such misunderstandings. Take multiple sequence alignments (MSAs): although this class of methods is commonly used for a wide range of biological questions, they arequiterarelyexplicitlydiscussedinthecurriculaofwet-labbiologystudents.The programsthatcomposetheseMSAsfromsinglesequencesoftenhaveaninterdisci- plinaryhistory:theevolutionarybiologyquestionstheyaresupposedtoanswerwere initially tackled by rather theory-heavy algorithmists. The resulting programs are blackboxesformostoftheirusers:magicalobjectsthatareusedbutnotunderstood. Furthermore, every year new programs for generating MSAs are written that are moreaccurateoratleastexcitingonanalgorithmiclevel.Butforthemostpart,these developmentsmisstheirtargetaudience,becauseasabiologist,youalmostnevergo with the state of the art, becauseyou usethe MSA program you have always used andthathasalwaysdonethejob. This book was written to help biologists to prevent misunderstandings when talkingaboutorworkingwithMSAprograms.Itisstructuredandwritteninsucha way that it can explain to people with varying levels of prior knowledge how the programsthatwerewrittentoproduceMSAswork.Thefirstpartofthisbookdeals with the background of MSAs: Chap. 1 is intended as an introduction both to the issuesthatcanbeaddressedbyMSAsandtotheformatsthatarecommonlyusedto storeandsharethosesameissues.Chapter2thenembarksonajourneythroughthe algorithmic background of the programs and aims to pass on the basics of the vii viii Preface computer science behind them. Finally, Chap. 3 describes ingreater detail a list of currently and historically important MSA programs in the second section of this chapter.RecommendationsaremadeastowhichMSAprogramissuitableforwhich problem. For this purpose, a benchmark analysis was performed, in which many differentprogramswereappliedtostandardizedtestdatasets.Themethodologyand technicalbackgroundofthiscanbefoundinChap.4andtheresultsinChap.5. The chapter structure of this book is thus designed to provide a layer-by-layer, ever-deepeningintroductiontothebackgroundanduseofMSAsorMSAprograms. Thisbookisintendedasareferenceworkinwhichthesectionsrelevanttoagiven situationcanbequicklyidentified.Somereaderswillbeabletoskipthedescriptions ofMSAformatsbecausetheyaremainlyinterestedinhowaspecificprogramworks in detail. Nevertheless, I hope that the chapters are arranged in such a way that a generallyinterestedpersonwillfindthemanalmostexcitingread. Finally,itremainsformetothankallthepeoplewithoutwhomthisbookwould not have been possible in this form. For the analysis in the back of the book, computations were performed on the MaRC2 high-performance computer at Philipps-UniversitätMarburg.Fortheinstallationandmaintenanceoftheprograms used, I would like to thank Mr. Sitt of the Hessian Competence Center for High Performance Computing, funded by the Hessian Ministry of Science and Art. I wouldalsoliketothankProf.Dr.TorstenWaldminghaus,Prof.Dr.DominikHeider, and Carlo Klein, who encouraged me to tackle this project. I would like to thank Johanna Sperlea for her constant encouragement and proofreading. I would like to thankmyprojectmanagementcontactsStefanieSchmollandMeikeBarthfortheir constant support and great help with style and content. Special thanks go to Sarah KochinherroleasPublishingEditoratSpringer,whogavemeagreatleapoffaith whensheofferedmethisbookprojectandwhoseideaswereveryhelpfulinfinding theconcept. Rostock,Germany TheodorSperlea August2021 Contents PartI Background 1 MultipleSequenceAssignments:AnIntroduction. . . . . . . . . . . . . . . 3 1.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 AreasofApplicationofMSAs. .. . . . .. . . . .. . . .. . . . .. . . . .. 4 1.2.1 PreservedSequenceSections:MotifsandDomains. . . . . 4 1.2.2 PredictionofFunctionandStructure. . . . . . . . . . . . . . . . 5 1.2.3 Phylogeny. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.4 MSAsasEverydayTools. . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 RepresentationFormatsofMSAs. .. . . . .. . . . . .. . . . .. . . . . .. 6 1.3.1 FASTA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.2 Clustal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3.3 MSF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3.4 PHYLIP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3.5 NEXUS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.6 TreeFormats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.7 GraphicalVisualizations. . . . . . . . . . . . . . . . . . . . . . . . . 13 2 HowDoMSAProgramsWork?. . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 PairwiseSequenceAlignments. . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.1 ANaiveMethod. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.2 DynamicProgramming. . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.3 GapsandtheSimilarityMatrix. . . . . . . . . . . . . . . . . . . . 20 2.2.4 GlobalandLocalAlignments. . . . . . . . . . . . . . . . . . . . . 23 2.3 MultipleSequenceAlignments. . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.1 TheCentralProblem. . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.2 Solution1:TheProgressiveMethod. . . . . . . . . . . . . . . . 25 2.3.3 Solution2:TheIterativeMethod. . . . . . . . . . . .. . . . . . . 26 2.3.4 Solution3:TheConsistency-BasedMethod. . . . . . . . . . . 26 2.3.5 Solution4:TheProbabilisticMethod. . . . . . . . . . . . . . . 27 2.3.6 Solution5:TheMetaorEnsembleMethod. . . . . . . . . . . 28 2.3.7 MethodsoftheFuture. . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.3.8 SpecialCasesRequireSpecialMethods. . . . . . . . . . . . . . 30 iixx x Contents 2.4 FurtherTopics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.4.1 Structure-BasedMSAs. . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.4.2 BLASTandCo.. . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . 32 2.4.3 Alignment-FreeMethods. . . . . . . . . . . . . . . . . . . . . . . . 33 2.4.4 GenomeandChromosomeAlignments. . . . . . . . . . . . . . 34 3 OverviewofCurrentMSAPrograms. . . . . . . . . . . . . . . . . . . . . . . . 35 3.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2 ListofCommonMSAPrograms. . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.1 DFalign. . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . 36 3.2.2 Clustal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.3 ClustalW. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.4 SAGA.. . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . 37 3.2.5 PRRP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2.6 DIALIGN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2.7 DIALIGN2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2.8 T-Coffee. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2.9 MAFFT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2.10 POA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2.11 PRALINE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2.12 Align-M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.13 MUSCLE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.14 3D-Coffee. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.15 Kalign. . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 43 3.2.16 DIALIGN-T. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2.17 ProbCons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2.18 M-Coffee. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2.19 R-CoffeeandRM-Coffee. . . . . . . . . . . . . . . . . . . . . . . . 45 3.2.20 DIALIGN-TX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2.21 PRALINE™. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2.22 PRANK. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2.23 PSI-CoffeeandTM-Coffee. . . . . . . . . . . . . . . . . . . . . . . 47 3.2.24 MSAProbs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.25 PicXAA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.26 AlignMe. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2.27 ClustalOmega. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2.28 ReformAlign. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2.29 DECIPHER. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 PartII WhichProgramFitsMyData? 4 DetailsoftheAnalysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.1 Benchmarking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.1.1 AMarathonforMSAPrograms. . . . . . . . . . . . . . . . . . . 55 4.1.2 TheCruxofMSABenchmarking. . . . . . . . . . . . . . . . . . 56 Contents xi 4.2 BenchmarkDatasetsUsed(Table4.1). . . . . . . . . . . . . . . . . . . . . 57 4.2.1 BAliBASE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.2.2 BRAliBASE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2.3 Bench. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2.4 ArtificialBenchmarks:Rose,IRMBASE andDIRMBASE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.2.5 HOMSTRAD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.3 Scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.3.1 Sum-of-Pairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.3.2 TheColumnScore. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.3.3 ModelerScore. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.3.4 Cline’sShiftScore. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.3.5 TransitiveConsistencyScore. . . . . . . . . . . . . . . . . . . . . 67 5 DecisionAid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2 Implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.3 Results. . . . . . . .. . . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . . .. . 72 5.3.1 RNA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.3.2 DNA. . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . . .. . . 76 5.3.3 Proteins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Glossary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Literature. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.