Methods in Molecular Biology 1851 Tobias Sikosek Editor Computational Methods in Protein Evolution M M B ETHODS IN OLECULAR IO LO GY SeriesEditor JohnM.Walker School of Lifeand MedicalSciences, University ofHertfordshire, Hatfield, Hertfordshire AL109AB,UK Forfurther volumes: http://www.springer.com/series/7651 Computational Methods in Protein Evolution Edited by Tobias Sikosek GlaxoSmithKline, Cellzome - a GSK company, Meyerhofstrasse 1, Heidelberg, Baden-Württemberg, Germany Editor TobiasSikosek GlaxoSmithKline Cellzome-aGSKcompany Meyerhofstrasse1 Heidelberg,Baden-Wu¨rttemberg,Germany ISSN1064-3745 ISSN1940-6029 (electronic) MethodsinMolecularBiology ISBN978-1-4939-8735-1 ISBN978-1-4939-8736-8 (eBook) https://doi.org/10.1007/978-1-4939-8736-8 LibraryofCongressControlNumber:2018954227 ©SpringerScience+BusinessMedia,LLC,partofSpringerNature2019 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartofthematerialis concerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation,broadcasting,reproduction onmicrofilmsorinanyotherphysicalway,andtransmissionorinformationstorageandretrieval,electronicadaptation, computersoftware,orbysimilarordissimilarmethodologynowknownorhereafterdeveloped. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublicationdoesnotimply, evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevantprotectivelawsandregulations andthereforefreeforgeneraluse. Thepublisher,theauthors,andtheeditorsaresafetoassumethattheadviceandinformationinthisbookarebelievedto betrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsortheeditorsgiveawarranty, expressorimplied,withrespecttothematerialcontainedhereinorforanyerrorsoromissionsthatmayhavebeenmade. Thepublisherremainsneutralwithregardtojurisdictionalclaimsinpublishedmapsandinstitutionalaffiliations. This Humana Press imprint is published by the registered company Springer Science+Business Media, LLC, part of SpringerNature. Theregisteredcompanyaddressis:233SpringStreet,NewYork,NY10013,U.S.A. Preface Proteins are the most versatile kind of molecule that we know and the result of a long evolutionaryprocess.Duringthisprocess,countlessrearranging,mutating,andreplicating strandsofDNAhavemanagedtobothencodeandconserveproteinsthatwouldallowthem toreplicateandstayintactandontheotherhandhaveallowedtheirproteinstochangeand ultimately help them replicate more than other strands of DNA. All cells make proteins in theirproteinfactoriescalledribosomes,wheretheDNAofageneistranslatedaccordingto theancientgeneticcodeintostringsofaminoacidswhichfollowthelawsofthermodynam- ics and molecular forces to fold up into specific wobbly three-dimensional shapes. Protein evolution happens whenever an accidental “typo”—or mutation—in the gene is translated intoamodifiedprotein,andthatproteinisreleasedintothebusycommotionwithinthecell, packed within a dense soup of other molecules in water. Whatever this new protein does differentlythanitspredecessorcandeterminethefateofthatmutation,makingiteitheran essentialinnovation,aterriblemistakethatgetserased,orsomethingthatjuststaysaround forawhilewithoutbeingnoticed,maybetoplayaroleinthedistantfuture. Thisbookisacompilationofmethodsthatcanbeappliedtovariousproblemsrelatedto proteinsequenceandstructure.Itisadiversecollectionofapproachesrangingfrombroad conceptual(“proteinspace”)toveryspecificapplications(“antibodymodeling”).Theterm “evolution” is used slightly differently in various fields of science. While evolutionary biologists think about the natural process of Darwinian evolution (and other post- Darwinian forms of evolution of organisms living in populations and environments), bio- chemiststakeamoredesign-orientedapproachtoevolution,usingtheevolutionaryprocess invitroorinsilicotomakeproteinswithcertaindesiredproperties.Physicistsontheother handusethetermevolutiontodescribeacontinuousprocessintimethatchangesasystem fromonetoanotherstate.Whilephysicsplaysasignificantroleinthisbook,itisthefirsttwo notionsofevolutionthatwillbedescribedinthefollowingchapters. Evolutionary research has made extensive use of computers. While the result of evolu- tion can be readily studied at the macroscopic, phenotypic level, evolutionary biology has alwayshadastrongtheoreticalcomponent,sincetheactualprocesshadbeenraretodirectly observe for a long time. The underlying patterns of inheritance and the interplay between geography and population dynamics have been described in mathematical terms and have always accompanied the progress made in the Molecular Biology of cells that eventually elucidatedthecoremechanismsofinheritance:theinformationstoredinDNAandhowitis replicated and passed on—imperfectly—to future generations. The field of Bioinformatics wasbornassoonasthefirstsequencesofgenesandproteinshadbeenpublishedatalarge enoughquantitytobeamenabletodirectsequence-to-sequencecomparisons.Thefieldsof Molecular Evolutionand Phylogeneticswere close companionsofthis development where mathematicalmodelsandcomputationalalgorithmswerecombinedtoreconstructthemost likelyevolutionaryhistorygiventheobservedDNAsequences.Proteinsequenceshavebeen afreegiveawayduetothereadytranslatabilityoftheaminoacidsequencefromDNAbased on the almost universal genetic code. DNA sequences became the main source material of molecular evolution research for quite a while, further spurred by the Human Genome Projectandlatertheadventofthenext-generationsequencingdataexplosion.Evolutionary relationshipswithinpopulationsandamongspecieswererevealedinevergreaterdetail. v vi Preface Still, no matter how much genetic sequence data has become available, there still have been many aspects of how genetics translates to observable (phenotypic) changes that cannotbeunderstoodatthatlevelofdescription.Networkscienceisanothertoolkitrooted in math and computation that is used to study evolution at the genotypic to phenotypic interface. There are networks representing physical and chemical molecular interactions within a cell, the flow of information and cell-level “computation” and communication, as well as more abstract networks describing the relationships and similarities between gene andproteinsequences,includingtheentire“universe”ofknownproteins.Whilebiological networkscience—oftencalledsystemsbiology—comesclosetoprovidingaworkingmodel of the cellular phenotype, the real “gap” in understanding where a mutation in the DNA sequencemakesadifferencetothesurvivalandfitnessofanentireorganismishowphysical interactions, the “edges” or connections in systems biology networks, are a result of biophysical properties of proteins, which can be altered by mutations. It is this point— where changesof DNA translate into altered protein structure and function—that most of themethodsinthisbookarefocusedon. While Molecular Evolution has been a backward-facing, almost historical, discipline in itsearlydays,ithasincreasinglymaturedintoan“applicable”scienceduetoitsintersections with Biochemistry and Biophysics. Protein evolution is therefore much more than just the description of evolutionary relationships based on sequence differences. It has become a powerfultoolforinterferingwiththeevolutionofpathogens,fordevisingtherapiesagainst mutation-based diseases such as cancers, and for designing novel enzymes with properties that can go beyond naturally evolved functions. Methods from evolution can be easily applied whenever genetic variation is at play, and this variation is what makes all humans unique and sometimes even determines why diseases and infections affect each of us differently. While each chapter in this book is the unique work of its authors and there is no predefined“narrative”tothisbook,somecommonthemesbecomeapparent. Thefirstthemeisthatofmutationsofsingleaminoacids,i.e.pointmutations.Predict- ingtheireffectonthephysicalstructureofaproteinisanimportantcapabilitythatlinksthe abundanceofsequenceinformationwiththecomparativelyfewknownstructures(Chapters 1 and 2). Other mutational mechanisms lead to gene duplication (Chapter 3) and even de novoemergenceofnewgenes(Chapter4). Likewise, the understanding of pairwise correlated mutations can be used to reveal structure information where none is available because the fates of spatially close (and physically interacting) amino acids are evolutionarily linked and coevolve (Chapters 5, 6 and7). Going back into evolutionary history, the structure and function of proteins can be reconstructed and used productively, since these may bear similar functions to their extant descendants yet also may have some new functional properties (Chapters 8 and 9). Many formerly sequence-based methods such as sequence alignments and phylogenies can be improvedbyapplyingamorestructuralandbiophysicalviewpoint(Chapters10and11). Instead of exploring similar proteins along evolutionary time, one can of course also compareexistingproteinsbasedontheirsimilarityinsequenceandstructure.Anumberof classificationschemesfororganizingallknownproteinsexist,anditispossibletoexplorean entire “protein universe,” often by breaking full proteins into even smaller building units calleddomains(Chapters12,13,14,15and16).Homologymodelingmakesuseofthese similarities by fitting the sequences of proteins without known structure to those known structuresofproteinswithsimilarsequence(Chapter17).Thisstructurepredictioncanalso Preface vii beextendedtoprotein-proteininteractions(Chapter18)andevensomestructuralproper- ties of proteins lacking a fixed structure, i.e., disordered/unstructured proteins can be predicted (Chapter 19). Another important aspect related to disorder is the intrinsic dynamic nature of folded proteins that always exist as an ensemble of conformations, someofwhichbecomefavoredordisfavoredwithevolutionarychanges(Chapter20). Finally, evolutionary principles are at work in shaping such versatile proteins as anti- bodies or enzymes, which can also be designed to have certain properties in silico by applying directed evolution, i.e., where the evolutionary endpoint, but not its path, is determinedbytheresearcher(Chapters21and22). The book covers a wide range of computational approaches, including the dynamic programming techniques of sequence alignments, the clustering methods of phylogenies, physics-basedapproachessuchasmoleculardynamicssimulations,andarangeofstatistical, graph-based,andmachinelearningmethods.Whiletheauthorstakethetimetogivesome background and references in the introductory sections, this book is not a textbook, and more detailed descriptions of underlying theory and algorithms may have to be found elsewhere. Nevertheless, I think that there is a lot to be learned from this book for an interdisciplinaryreadership. I sincerely hope that this book offers many useful workflows and techniques that help many researchers and students working with proteins computationally. I also strongly encouragethereadertogobeyondtheindividualprotocolandmixandmatchthedifferent methodstocomeupwithnewinnovativesolutions.That’swhatevolutionwoulddo. Heidelberg,Germany TobiasSikosek Contents Preface ..................................................................... v Contributors................................................................. xi 1 PredictingtheEffectofMutationsonProteinFolding andProtein-ProteinInteractions.......................................... 1 AlexeyStrokach,CarlesCorbi-Verge,JoanTeyra,andPhilipM.Kim 2 AccurateCalculationofFreeEnergyChangesuponAmino AcidMutation.......................................................... 19 MatteoAldeghi,BertL.deGroot,andVytautasGapsys 3 Protocolsfor theMolecularEvolutionaryAnalysisofMembrane ProteinGeneDuplicates................................................. 49 LaurelR.Yohe,LiangLiu,LilianaM.Da´valos,andDavidA.Liberles 4 ComputationalPredictionofDeNovoEmergedProtein-CodingGenes....... 63 NikolaosVakirlisandAoifeMcLysaght 5 CoevolutionarySignalsandStructure-BasedModelsfor the PredictionofProteinNativeConformations ............................... 83 RicardoNascimentodosSantos,XianliJiang,LeandroMartı´nez, andFaruckMorcos 6 DetectingAminoAcidCoevolutionwithBayesianGraphicalModels.......... 105 MarianoAvinoandArtF.Y.Poon 7 Context-DependentMutationEffectsinProteins........................... 123 FrankJ.Poelwijk 8 High-ThroughputReconstructionofAncestralProtein Sequence,Structure,andMolecularFunction .............................. 135 KelseyAadland,CharlesPugh,andBryanKolaczkowski 9 AncestralSequenceReconstructionasaToolfor theElucidation ofaStepwiseEvolutionaryAdaptation..................................... 171 KristinaStraubandRainerMerkl 10 EnhancingStatisticalMultipleSequenceAlignmentandTreeInference UsingStructuralInformation ............................................ 183 JosephL.Herman 11 TheInfluenceofProteinStabilityonSequenceEvolution:Applications toPhylogeneticInference................................................ 215 UgoBastollaandMiguelArenas 12 NavigatingAmongKnownStructuresinProteinSpace...................... 233 AyaNarunsky,NirBen-Tal,andRachelKolodny 13 AGraph-BasedApproachforDetectingSequenceHomology inHighlyDivergedRepeatProteinFamilies................................ 251 JonathanN.WellsandJosephA.Marsh ix x Contents 14 ExploringEnzymeEvolutionfromChangesinSequence,Structure, andFunction........................................................... 263 JonathanD.Tyzack,NicholasFurnham,IanSillitoe, ChristineM.Orengo,andJanetM.Thornton 15 IdentificationofProteinHomologsandDomainBoundaries byIterativeSequenceAlignment.......................................... 277 DustinSchaefferandNickV.Grishin 16 ARoadmaptoDomainBasedProteomics ................................. 287 CarstenKemenaandErichBornberg-Bauer 17 ModelingofProteinTertiaryandQuaternaryStructuresBased onEvolutionaryInformation............................................. 301 GabrielStuder,GerardoTauriello,StefanBienert, AndrewMarkWaterhouse,MartinoBertoni,LorenzaBordoli, TorstenSchwede,andRosalbaLepore 18 Interface-BasedStructuralPredictionofNovelHost-Pathogen Interactions ............................................................ 317 EmineGuven-Maiorov,Chung-JungTsai,BuyongMa,andRuthNussinov 19 PredictingFunctionsofDisorderedProteinswithMoRFpred................ 337 ChristopherJ.Oldfield,VladimirN.Uversky,andLukaszKurgan 20 ExploringProteinConformationalDiversity ............................... 353 AlexanderMiguelMonzon,MariaSilvinaFornasari,DiegoJavierZea, andGustavoParisi 21 High-ThroughputAntibodyStructureModeling andDesignUsingABodyBuilder ......................................... 367 JinwooLeemandCharlotteM.Deane 22 InSilico-DirectedEvolutionUsingCADEE ............................... 381 BeatAntonAmrein,AshishRunthala, andShinaCarolineLynnKamerlin Index ...................................................................... 417 Contributors KELSEYAADLAND (cid:1) DepartmentofMicrobiology&CellScience,InstituteforFoodand AgriculturalSciences,UniversityofFlorida,Gainesville,FL,USA MATTEOALDEGHI (cid:1) MaxPlanckInstituteforBiophysicalChemistry,Computational BiomolecularDynamicsGroup,Go¨ttingen,Germany BEATANTONAMREIN (cid:1) AssociateScientist,TecanSchweizAG,Ma€nnedorf,Switzerland MIGUELARENAS (cid:1) DepartmentofBiochemistry,GeneticsandImmunology,Universityof Vigo,Vigo,Spain MARIANO AVINO (cid:1) DepartmentofPathologyandLaboratoryMedicine,WesternUniversity, London,Canada UGOBASTOLLA (cid:1) CentreforMolecularBiology,SeveroOchoa(CSIC-UAM),Madrid,Spain NIRBEN-TAL (cid:1) DepartmentofBiochemistryandMolecularBiology,GeorgeS.WiseFacultyof LifeSciences,TelAvivUniversity,TelAviv,Israel MARTINOBERTONI (cid:1) Biozentrum,UniversityofBaselandSIBSwissInstituteof Bioinformatics,Basel,Switzerland STEFANBIENERT (cid:1) Biozentrum,UniversityofBaselandSIBSwissInstituteofBioinformatics, Basel,Switzerland LORENZABORDOLI (cid:1) Biozentrum,UniversityofBaselandSIBSwissInstituteof Bioinformatics,Basel,Switzerland ERICHBORNBERG-BAUER (cid:1) InstituteforEvolutionandBiodiversity,UniversityofMu¨nster, Mu¨nster,Germany CARLESCORBI-VERGE (cid:1) TerrenceDonnellyCentreforCellularandBiomolecularResearch, UniversityofToronto,Toronto,ON,Canada LILIANAM.DA´VALOS (cid:1) DepartmentofEcologyandEvolution,StonyBrookUniversity,Stony Brook,NY,USA CHARLOTTEM.DEANE (cid:1) DepartmentofStatistics,UniversityofOxford,Oxford,UK MARIASILVINAFORNASARI (cid:1) DepartamentodeCienciayTecnologı´a,UniversidadNacional deQuilmes,CONICET,Bernal,Argentina NICHOLASFURNHAM (cid:1) LondonSchoolofHygieneandTropicalMedicine,London,UK VYTAUTASGAPSYS (cid:1) MaxPlanckInstituteforBiophysicalChemistry,Computational BiomolecularDynamicsGroup,Go¨ttingen,Germany NICKV.GRISHIN (cid:1) DepartmentofBiophysics,UniversityofTexasSouthwesternMedical Center,Dallas,TX,USA;HowardHughesMedicalInstitute,UniversityofTexas SouthwesternMedicalCenter,Dallas,TX,USA BERTL.DEGROOT (cid:1) MaxPlanckInstituteforBiophysicalChemistry,Computational BiomolecularDynamicsGroup,Go¨ttingen,Germany EMINEGUVEN-MAIOROV (cid:1) CancerandInflammationProgram,LeidosBiomedicalResearch, Inc.,FrederickNationalLaboratoryforCancerResearch,NationalCancerInstitute, Frederick,MD,USA JOSEPH L.HERMAN (cid:1) DepartmentofBiomedicalInformatics,HarvardMedicalSchool, Boston,MA,USA KRISTINA STRAUB (cid:1) InstituteofBiophysicsandPhysicalBiochemistry,Universityof Regensburg,Regensburg,Germany xi