Table Of ContentTheodor Sperlea
Multiple
Sequence
Alignments
Which Program Fits My Data?
Multiple Sequence Alignments
Theodor Sperlea
Multiple Sequence
Alignments
Which Program Fits My Data?
TheodorSperlea
Rostock,Germany
ISBN978-3-662-64472-0 ISBN978-3-662-64473-7 (eBook)
https://doi.org/10.1007/978-3-662-64473-7
#Springer-VerlagGmbHGermany,partofSpringerNature2022
Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartofthe
materialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation,
broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation
storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology
nowknownorhereafterdeveloped.
Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication
doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant
protectivelawsandregulationsandthereforefreeforgeneraluse.
The publisher, the authors, and the editorsare safeto assume that the adviceand informationin this
bookarebelievedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsor
theeditorsgiveawarranty,expressedorimplied,withrespecttothematerialcontainedhereinorforany
errorsoromissionsthatmayhavebeenmade.Thepublisherremainsneutralwithregardtojurisdictional
claimsinpublishedmapsandinstitutionalaffiliations.
ThisSpringerimprintispublishedbytheregisteredcompanySpringer-VerlagGmbH,DEpartofSpringer
Nature.
Theregisteredcompanyaddressis:HeidelbergerPlatz3,14197Berlin,Germany
To my wife Johanna
Preface
Sometimesmisunderstandings,mistakes,andfrustrationarisewhendoingresearch
interdisciplinarily.Differentdisciplinesspeakdifferentlanguages;asaresult,when
they talk about the same issue, they sometimes talk past each other. For different
disciplines, the same word can mean different things, increasing the possibility of
misunderstanding.
Bioinformatics is an interdisciplinary field par excellence. Here, experts in
biology,computerscience,medicine,pharmacy,chemistry,algorithmics,mathemat-
ics,andengineeringcometogethertotacklescientificquestionsincollaboration.To
make matters worse, bioinformatics is a fairly young field but has quickly gained
enormous importance. Sequence analyses, modeling, and database searches have
becomeanintegralpartofeverydaybiologicalandmedicalresearch.Itistherefore
all the more tragic when incorrect results are generated or interesting findings are
overlookedduetoincorrectuseofthesetools.
Some aspects of bioinformatics are almost predestined for such
misunderstandings. Take multiple sequence alignments (MSAs): although this
class of methods is commonly used for a wide range of biological questions, they
arequiterarelyexplicitlydiscussedinthecurriculaofwet-labbiologystudents.The
programsthatcomposetheseMSAsfromsinglesequencesoftenhaveaninterdisci-
plinaryhistory:theevolutionarybiologyquestionstheyaresupposedtoanswerwere
initially tackled by rather theory-heavy algorithmists. The resulting programs are
blackboxesformostoftheirusers:magicalobjectsthatareusedbutnotunderstood.
Furthermore, every year new programs for generating MSAs are written that are
moreaccurateoratleastexcitingonanalgorithmiclevel.Butforthemostpart,these
developmentsmisstheirtargetaudience,becauseasabiologist,youalmostnevergo
with the state of the art, becauseyou usethe MSA program you have always used
andthathasalwaysdonethejob.
This book was written to help biologists to prevent misunderstandings when
talkingaboutorworkingwithMSAprograms.Itisstructuredandwritteninsucha
way that it can explain to people with varying levels of prior knowledge how the
programsthatwerewrittentoproduceMSAswork.Thefirstpartofthisbookdeals
with the background of MSAs: Chap. 1 is intended as an introduction both to the
issuesthatcanbeaddressedbyMSAsandtotheformatsthatarecommonlyusedto
storeandsharethosesameissues.Chapter2thenembarksonajourneythroughthe
algorithmic background of the programs and aims to pass on the basics of the
vii
viii Preface
computer science behind them. Finally, Chap. 3 describes ingreater detail a list of
currently and historically important MSA programs in the second section of this
chapter.RecommendationsaremadeastowhichMSAprogramissuitableforwhich
problem. For this purpose, a benchmark analysis was performed, in which many
differentprogramswereappliedtostandardizedtestdatasets.Themethodologyand
technicalbackgroundofthiscanbefoundinChap.4andtheresultsinChap.5.
The chapter structure of this book is thus designed to provide a layer-by-layer,
ever-deepeningintroductiontothebackgroundanduseofMSAsorMSAprograms.
Thisbookisintendedasareferenceworkinwhichthesectionsrelevanttoagiven
situationcanbequicklyidentified.Somereaderswillbeabletoskipthedescriptions
ofMSAformatsbecausetheyaremainlyinterestedinhowaspecificprogramworks
in detail. Nevertheless, I hope that the chapters are arranged in such a way that a
generallyinterestedpersonwillfindthemanalmostexcitingread.
Finally,itremainsformetothankallthepeoplewithoutwhomthisbookwould
not have been possible in this form. For the analysis in the back of the book,
computations were performed on the MaRC2 high-performance computer at
Philipps-UniversitätMarburg.Fortheinstallationandmaintenanceoftheprograms
used, I would like to thank Mr. Sitt of the Hessian Competence Center for High
Performance Computing, funded by the Hessian Ministry of Science and Art. I
wouldalsoliketothankProf.Dr.TorstenWaldminghaus,Prof.Dr.DominikHeider,
and Carlo Klein, who encouraged me to tackle this project. I would like to thank
Johanna Sperlea for her constant encouragement and proofreading. I would like to
thankmyprojectmanagementcontactsStefanieSchmollandMeikeBarthfortheir
constant support and great help with style and content. Special thanks go to Sarah
KochinherroleasPublishingEditoratSpringer,whogavemeagreatleapoffaith
whensheofferedmethisbookprojectandwhoseideaswereveryhelpfulinfinding
theconcept.
Rostock,Germany TheodorSperlea
August2021
Contents
PartI Background
1 MultipleSequenceAssignments:AnIntroduction. . . . . . . . . . . . . . . 3
1.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 AreasofApplicationofMSAs. .. . . . .. . . . .. . . .. . . . .. . . . .. 4
1.2.1 PreservedSequenceSections:MotifsandDomains. . . . . 4
1.2.2 PredictionofFunctionandStructure. . . . . . . . . . . . . . . . 5
1.2.3 Phylogeny. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.4 MSAsasEverydayTools. . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 RepresentationFormatsofMSAs. .. . . . .. . . . . .. . . . .. . . . . .. 6
1.3.1 FASTA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.2 Clustal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.3 MSF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.4 PHYLIP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.5 NEXUS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.6 TreeFormats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.7 GraphicalVisualizations. . . . . . . . . . . . . . . . . . . . . . . . . 13
2 HowDoMSAProgramsWork?. . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 PairwiseSequenceAlignments. . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 ANaiveMethod. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 DynamicProgramming. . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.3 GapsandtheSimilarityMatrix. . . . . . . . . . . . . . . . . . . . 20
2.2.4 GlobalandLocalAlignments. . . . . . . . . . . . . . . . . . . . . 23
2.3 MultipleSequenceAlignments. . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.1 TheCentralProblem. . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.2 Solution1:TheProgressiveMethod. . . . . . . . . . . . . . . . 25
2.3.3 Solution2:TheIterativeMethod. . . . . . . . . . . .. . . . . . . 26
2.3.4 Solution3:TheConsistency-BasedMethod. . . . . . . . . . . 26
2.3.5 Solution4:TheProbabilisticMethod. . . . . . . . . . . . . . . 27
2.3.6 Solution5:TheMetaorEnsembleMethod. . . . . . . . . . . 28
2.3.7 MethodsoftheFuture. . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.8 SpecialCasesRequireSpecialMethods. . . . . . . . . . . . . . 30
iixx
x Contents
2.4 FurtherTopics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4.1 Structure-BasedMSAs. . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4.2 BLASTandCo.. . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . 32
2.4.3 Alignment-FreeMethods. . . . . . . . . . . . . . . . . . . . . . . . 33
2.4.4 GenomeandChromosomeAlignments. . . . . . . . . . . . . . 34
3 OverviewofCurrentMSAPrograms. . . . . . . . . . . . . . . . . . . . . . . . 35
3.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 ListofCommonMSAPrograms. . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.1 DFalign. . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . 36
3.2.2 Clustal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.3 ClustalW. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.4 SAGA.. . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . 37
3.2.5 PRRP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.6 DIALIGN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.7 DIALIGN2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.8 T-Coffee. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.9 MAFFT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.10 POA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.11 PRALINE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.12 Align-M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.13 MUSCLE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.14 3D-Coffee. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.15 Kalign. . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 43
3.2.16 DIALIGN-T. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.17 ProbCons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.18 M-Coffee. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.19 R-CoffeeandRM-Coffee. . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.20 DIALIGN-TX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2.21 PRALINE™. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2.22 PRANK. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.23 PSI-CoffeeandTM-Coffee. . . . . . . . . . . . . . . . . . . . . . . 47
3.2.24 MSAProbs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.25 PicXAA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.26 AlignMe. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2.27 ClustalOmega. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2.28 ReformAlign. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.29 DECIPHER. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
PartII WhichProgramFitsMyData?
4 DetailsoftheAnalysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1 Benchmarking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1.1 AMarathonforMSAPrograms. . . . . . . . . . . . . . . . . . . 55
4.1.2 TheCruxofMSABenchmarking. . . . . . . . . . . . . . . . . . 56
Contents xi
4.2 BenchmarkDatasetsUsed(Table4.1). . . . . . . . . . . . . . . . . . . . . 57
4.2.1 BAliBASE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.2 BRAliBASE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.3 Bench. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2.4 ArtificialBenchmarks:Rose,IRMBASE
andDIRMBASE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.5 HOMSTRAD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3 Scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.1 Sum-of-Pairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.2 TheColumnScore. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.3 ModelerScore. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.4 Cline’sShiftScore. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3.5 TransitiveConsistencyScore. . . . . . . . . . . . . . . . . . . . . 67
5 DecisionAid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 Implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.3 Results. . . . . . . .. . . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . . .. . 72
5.3.1 RNA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.3.2 DNA. . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . . .. . . 76
5.3.3 Proteins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Glossary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Literature. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101