ebook img

Dependence of Mutations in Beta-Lactamase TEM-1 Open Access PDF

13 Pages·2015·0.97 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Dependence of Mutations in Beta-Lactamase TEM-1 Open Access

Coevolutionary Landscape Inference and the Context- Dependence of Mutations in Beta-Lactamase TEM-1 Matteo Figliuzzi,1,2,3 Herve(cid:2) Jacquier,4,5 Alexander Schug,6 Oliver Tenaillon,4 and Martin Weigt*,2,3 1UPMC,InstitutdeCalculetdelaSimulation,SorbonneUniversite(cid:2)s,Paris,France 2ComputationalandQuantitativeBiology,UPMC,UMR7238,SorbonneUniversite(cid:2)s,Paris,France 3ComputationalandQuantitativeBiology,CNRS,UMR7238,Paris,France 4Infection,Antimicrobials,Modelling,Evolution,INSERM,Universite(cid:2) DenisDiderotParis7,UMR1137,SorbonneParisCite(cid:2),Paris, France 5ServicedeBacte(cid:2)riologie-Virologie,GroupeHospitalierLariboisie(cid:2)re-FernandWidal,AssistancePublique-Ho^pitauxdeParis(AP-HP), Paris,France 6SteinbuchCentreforComputing,KarlsruheInstituteforTechnology,Eggenstein-Leopoldshafen,Germany *Correspondingauthor:E-mail:[email protected]. D o Associateeditor:ClausWilke w n lo a d Abstract ed fro The quantitative characterization of mutational landscapes is a task of outstanding importance in evolutionary and m h medical biology: It is, for example, of central importance for our understanding of the phenotypic effect of mutations ttp relatedtodiseaseandantibioticdrugresistance.Herewedevelopanovelinferenceschemeformutationallandscapes, s://a whichisbasedonthestatisticalanalysisoflargealignmentsofhomologsoftheproteinofinterest.Ourmethodisableto c a d capture epistatic couplings between residues, and therefore to assess the dependence of mutational effects on the e m sequence context where they appear. Compared with recent large-scale mutagenesis data of the beta-lactamase TEM- ic .o 1, a protein providing resistance against beta-lactam antibiotics, our method leads to an increase of about 40% in u p explicative power as compared with approaches neglecting epistasis. We find that the informative sequence context .c o extendstoresiduesatnativedistancesofabout20A˚ fromthemutatedsite,reachingthusfarbeyondresiduesindirect m /m physicalcontact. b e /a Keywords:mutationallandscape,genotype–phenotypemapping,epistasis,coevolution,statisticalinference. rtic le -a b Introduction s depends on the genetic background in which it appears tra c Protein mutational landscapes are genotype-to-phenotype (Weinreich et al. 2006; Chou et al. 2011; Khan et al. 2011). t/3 3 mappings quantifying how mutations affect the biological For instance, in the field of human genetic diseases, is the /1 A functionality of a protein. They are closely related to fitness presenceofamutationenoughtopredictapathologyordo /26 8 r landscapesdescribingthereplicativecapacityofanorganism we have to know the whole genotype to make that asser- /25 t as a function of its genotype (Wright 1932). Their compre- tion? In a more formal way, this question is equivalent to 79 i 5 c hensiveandaccuratecharacterizationisataskofoutstanding quantifying how epistasis, that is, the interaction between 3 2 le importanceinevolutionaryandmedicalbiology:Ithasakey mutations through fitness, is shaping the mutational land- by roleinourunderstandingofmutationalpathwaysaccessible scape. At the protein level, a destabilizing mutation might gu e in the course of evolution (Kauffman and Levin 1987; haveanegligiblephenotypiceffectinaverystableprotein, st o Weinreich et al. 2006; Poelwijk et al. 2007), it can lead to but a large one in an unstable protein (Bloom et al. 2005; n 1 theidentificationofgeneticdeterminantsofcomplexdiseases Jacquieretal.2013).Ifthisdestabilizingmutationincreases, 1 A bguasideedtoonwraarrdevthareiaunntsde(CrsitraunlldiainngdoGfotlhdestfeuinnc2t0io1n0a),lacnodntitricbaun- fsotarbelxeapmroptleei,nt,haenedndzeylmetaetriicouacstiinviatyn,uitnwstiallbbleeobneen(ecfifc.iHalarinmas pril 2 0 tionofmolecularalterationstooncogenesis(Revaetal.2011). andThornton2013).Hence,themutationisexpectedtobe 19 Inthecontextofantibioticresistance,oneofthemostchal- context dependent. Moreover, once a mutation has fixed, lengingproblemsinmodernmedicine,theunderstandingof furthermutationswillbuilduponthespecificityofthatfocal the association between genetic variation and phenotypic mutation,therebycreatinganewgeneticbackgroundwith effectscan helptounveilpatternsof adaptivemutationsof itsspecificinteractionsandinterdependencies(Pollocketal. thepathogenstogaindrugresistance,andtherebyhopefully 2012). There are ample proofs of the existence of epistasis guide toward the discovery of new therapeutic strategies and condition-dependent effects (Breen etal. 2012; Harms (Fergusonetal.2013). and Thornton 2013; Schenk et al. 2013; de Visser and One key issue in the description of a mutational land- Krug 2014; Podgornaia and Laub 2015). Yet, it is not scapeistounderstandhowmuchtheeffectofamutation totallyclearwhethersuchinteractionshaveadominantor (cid:2)TheAuthor2015.PublishedbyOxfordUniversityPressonbehalfoftheSocietyforMolecularBiologyandEvolution. ThisisanOpenAccessarticledistributedunderthetermsoftheCreativeCommonsAttributionNon-CommercialLicense (http://creativecommons.org/licenses/by-nc/4.0/),whichpermitsnon-commercialre-use,distribution,andreproductioninany Open Access medium,providedtheoriginalworkisproperlycited.Forcommercialre-use,[email protected] 268 Mol.Biol.Evol.33(1):268–280 doi:10.1093/molbev/msv211 AdvanceAccesspublicationOctober6,2015 MBE CoevolutionaryLandscapeInference . doi:10.1093/molbev/msv211 a minor effect in determining a mutation’s phenotypic correlatedpatternsofaminoacidoccurrences.Inversely,cor- impact. relatedresiduesarenotnecessarilyincontact,ascorrelations Recent technological advances have made it possible to are inflated by indirect effects. Two residues, both being in simultaneouslyquantifytheeffectsofthousandstohundreds contacttoathirdresidue,willcoevolveeveniftheyarenotin ofthousandsofmutantsthrougheithergrowthcompetition direct contact. The Direct-Coupling Analysis (DCA) (Weigt (Deng et al. 2012; McLaughlin Jr et al. 2012; Melamed et al. etal.2009;Morcosetal.2011)hasbeenproposedtodisen- 2013; Roscoe etal. 2013;Podgornaia and Laub2015) or iso- tangle such indirect effects from direct (i.e., epistatic) lated allele experiments (Jacquier et al. 2013; Firnberg et al. couplings, which in turn have been observed to accurately 2014; Romero et al. 2015). Experimental resolution can be predict residue–residue contacts. DCA and closely related goodenoughtodetecteventheeffectsofsynonymousmu- methods thereby guide tertiary (Hopf et al. 2012; tations (Firnberg et al. 2014). Despite the development of Marks et al. 2012; Nugent and Jones 2012; Sulkowska et al. such high-throughput methods, measured genotypes cover 2012) and quaternary (Schug et al. 2009; Dago et al. 2012; onlyatinyfractionofsequencespace:Thenumberofpossible Hopf et al. 2014; Ovchinnikov et al. 2014) protein structure D mutantsgrowsexponentiallywiththenumberofsinglemu- prediction;andshedlightonspecificityandcrosstalkinbac- ow n tations, suchthatcheckingtheviabilityof allpossiblegeno- terialsignaltransduction(Procaccinietal.2011;Chengetal. lo a types further than one or two mutations away from a d 2014). e d reference sequence becomes infeasible, even for short poly- Inthisstudy,weproposeavariantofDCAwhichassigns fro peptides.Moreprecisely,thenumberofdistinctsingle-residue m toeachmutantsequenceastatisticalscore,whichinanext mutantsfortypicalproteinsisintherangeof103(cid:2)104.The step is used for predicting the phenotype of the mutant http Anultmhobuegrhoftahlilsdnouumblbeemruistnanottsyreetacehxepsertihmeernantaglelyoafc1c0e6ss(cid:2)ibl1e0,8it. stheqeuaepnpceroareclha,tiwveettoaktehethweilEds-cthyepreichsieaquceonlicbe.etTao-laecvtaamluaatsee s://aca isneededtoaccuratelyassesstheimportanceofepistasis.It TEM-1,amodelenzymeinbiochemistrywhichprovidesre- dem hficaisenbteefnoraragcuceudrattheatlaenxdisstcianpgemruegtaregsesnioensis(dOattwainaroewnskoitasunfd- slaisntdansccaepethoas bbeeetna-lqauctaanmtitatiavnetlyibicohtaicrsa.cteIrtiszedmmuetaastuiorinnagl ic.oup Plotkin 2014). Novel computational approaches exploring .c the minimum inhibitory concentration (MIC) of the o alternative data—in our case distant homologs—are m antibiotic (Davison et al. 2000; Jacquier et al. 2013; /m thus urgently needed to gain a comprehensive picture of b Firnberg et al. 2014). This abundance of mutagenesis data, e mutationallandscapes.Inthiscontext,thegrowingamount /a ofmutagenesisdataoffersthepossibilitytorigorouslyevalu- the rich homology information, and its well-defined rtic 3D structure make it a well-suited system for le ate the performance of in silico models of mutational -a testing any computational model of protein mutational b landscapes. landscapes. stra Several computational methods for predicting We will show that coevolutionary models for mutational ct/3 mutational effects on protein function have been pro- 3 landscapes do not only provide quantitative predictions of /1 posedovertheyears.Afirstclassrelieson“structural”infor- /2 mutational effects but, more importantly, they are able to 6 mstaabtiiolitny, m(Coarperiportetciiseetly aoln. 2ch00an5;geCshinentgheetthearl.mo20d0y6n;amNigc capture the context dependence of these effects. In this 8/257 way, the new approach manages to clearly outperform 9 and Henikoff 2006; Capra and Singh 2007; Lonquety et al. 5 3 state-of-the-art approaches such as SIFT (Ng and Henikoff 2 2009;Dehoucketal.2011),whichhavebeenarguedtoplay b 2003) and PolyPhen-2 (Adzhubei et al. 2010), which are y a key role in determining mutational effects (Bloom and g Glassman 2009; Wylie and Shakhnovich 2011; Serohijos and based on independent-site models (even if, like in the ues Shakhnovich 2014; Echave et al. 2015). A second class case of PolyPhen-2, additional structural information is in- t o n (NgandHenikoff2003;Adzhubeietal.2010)relieson“evo- tegrated into the prediction of mutational effects), which 11 lutionary” information extracted from independently evolv- themselves outperform predictors based on structural sta- Ap ing homologous proteins, showing variable amino acid bility. The approach is broadly applicable, as illustrated in a ril 2 sequences but conserved structure and function. Evolution small set of completely different systems: An RNA recog- 01 9 providesamultitudeofinformative“experiments”onmuta- nition motif (Melamed et al. 2013), the glucosidase enzyme tional landscapes. Critically important residues tend to be (Romero et al. 2015), and a PDZ domain (McLaughlin Jr conserved, whereas unfavorable residues are observed less et al. 2012). In the last system, positions most sensitive to frequently. mutation had been shown previously to fall into clusters of None of these methods is able to model the effects of coevolving residues termed sectors (Halabi et al. 2009): epistasis and sequence-context dependence of mutational Appling statistical inference we are able to get a more effects. To overcome this limitation, we take inspiration “quantitative” prediction of the impact of single point from a recent development in structural biology. It has mutations in the domain. These findings illustrate been recognized that coevolutionary information contained the potential of coevolutionary landscape models in in large families of homologous proteins allows to extract biomedical applications, through the in silico prediction accurate structural information from sequences alone of mutational effects related to not only antibiotic drug (de Juan et al. 2013): Residues in contact in a protein’sfold, resistance but also the role of mutations in rare diseases even if distant along the primary sequence, tend to show and cancer. 269 MBE Figliuzzietal. . doi:10.1093/molbev/msv211 D o w n lo a d e d fro m h ttp s ://a c a d e m ic .o u p FIG.1. Pipelineofthemutational-landscapeprediction:ThehomologousPfamfamilycontainingtheproteinofinterest(theBeta-lactamase2family .co m PF13354inthecaseofTEM-1)isusedtoconstructaglobalstatisticalmodelusingtheDCA.Thismodelallowstoscoremutationsbydifferencesinthe /m inferredgenotype-to-phenotypemappingbetweenthemutantandthewild-typeaminoacidsequence.Thisscore,whichisexpectedtoincorporate b e (co-)evolutionaryconstraintsactingacrosstheentirefamily,isusedasapredictorofthephenotypiceffectsofsingle(orfew)aminoacidsubstitutionsin /a theproteinofinterest. rtic le -a b s Results The simplest nontrivial parameterization assumes posi- tra c Evolutionary Modeling of Diverged Beta-Lactamase tion-specificbutindependentcontributionsofeachresidue, t/33 Sequences to Predict Mutational Effects of Single- XL /1/2 (cid:2) ða ;:::;a Þ¼ ’ðaÞ: ð2Þ 6 Residue Mutations in TEM-1 IND 1 L i i 8/2 i¼1 5 Thepipelineofourapproachisillustratedinfigure1. 7 9 Thecontribution’ðaÞmeasuringthecontributionofamino 5 In technical terms, a mutational landscape is given as a i i 32 genotype-to-phenotype mapping. To each possible amino acidaiinpositionicanbeeasilyestimatedfromamultiple- by gacaipds(sLeqdueennocteesðtah1e;:a:l:i;gnamLÞecnotnwsiisdttinhg),aofquLanatmitiantoiveapcihdesnoor- sfsrpeaqemucieefinwccoewrkaeliiggnhomtf emnpatrot(rMfiicleeSsA))m(ocfofh.doeMlmsaotle(oragilasolosusapncradollteeMdinestuhpsooidnsisgtiotfhnoer- guest o type(cid:2)ða ;:::;a Þisassigned.Thephenotypiceffectofamu- n wtaittihonamsui1nbostiatcuitdiLnbgitshmeewaisldu-rteydpbeyamthienodifafceirdenacieastcpooresition i dWameittiahniliosn).atcPhiodissssibumlbyosdtieetxuliintsitgoinngscshiemepmpislteiafi,teicsthfeeroffmesccotsreeqaurfaeotrionneag(le1sci)ntegtdloe. 11 April 2 0 (cid:3)(cid:2)ða !bÞ¼(cid:2)ða ;:::;a ;b;a ;:::;a Þ (cid:3)(cid:2) ða !bÞ¼’ðbÞ(cid:2)’ðaÞ. It becomes immediately 1 i 1 i(cid:2)1 iþ1 L IND i i i i 9 evidentthattheindependent-residuemodelisunabletocap- (cid:2)(cid:2)ða ;:::;a ;a;a ;:::;a Þ ð1Þ turethecontextdependenceofmutations,thesubstitution 1 i(cid:2)1 i iþ1 L a !bispredictedtohaveidenticaleffectsifintroducedinto i betweenthemutantandthewild-typesequence.Thisfunc- different sequence backgrounds. The score of a double mu- tion(cid:2)has,however,20L parameters,anastronomicnumber tationissimplygivenbythesumofthe(cid:3)(cid:2)-valuesofthetwo being far beyond any possibility of inference from data. single-residuemutations. Simplified parameterizations of (cid:2) reducing the number of The relation between statistically derived scores (cid:3)(cid:2) and parameters are needed. In general, a simple model can be theexperimentalMICvaluesmaybenonlinear.Thediscrete inferred more robustly from limited data, but it risks to natureofthelatterintroducessaturationeffects,inparticular miss important effects. Even if these might be captured in forstronglydeleteriousmutationswithMICvaluesbelowthe morecomplexmodels,theselatterrisktosufferfromunder- lowest measured antibiotic concentration. To address these sampling and thus overfitting effects. One of our aims is to issues,wehavedesignedarobustmappingof(cid:3)(cid:2) ða !bÞ IND i findagoodcompromisebetweenthesetwolimitations. to predicted MIC values (cid:3)^ ða !bÞ (cf. Materials and IND i 270 MBE CoevolutionaryLandscapeInference . doi:10.1093/molbev/msv211 Methods) and compared them with the experimental MIC positionionallotherresiduesinthewild-typesequences, values(cid:3) ða !bÞbylinearcorrelation.Adirectmeasure- exp i (cid:3)(cid:2) ða !b j a ;:::;a ;a ;::a Þ mentofSpearmanrankcorrelationsbetween(cid:2) and(cid:3) DCA i 1 i(cid:2)1 iþ1 L IND exp leads to numerically very similar, but slightly less robust L results. X(cid:2) (cid:3) ¼’ðbÞ(cid:2)’ðaÞþ ’ ðb;aÞ(cid:2)’ ða;aÞ : ð4Þ TheMICpredictionsusingtheindependent-residuemodel i i i ij j ij i j j¼1 showaPearsoncorrelationofR=0.63withtheexperimental MICmeasurementsofsingle-residuesubstitutionsinTEM-1. Inasecondstep,thisdifferencescoreismappedtopredicted About R2 ’39% of the variability of the experimental re- MIC values (cid:3)^ ða !bÞ and compared with the experi- DCA i sults is thusexplainable by an independent-site model built mentalvalues(cid:3) ða !bÞbylinearcorrelation. exp i onthesequencevariabilitybetweenhomologoussequences. Resultingpredictionsoutperformtheindependent-residue Very similar correlations (R2 ¼0:37) are found when com- modeling. DCA-predicted MIC values show a correlation of paring experimental results and the probabilities of being R=0.74withtheexperimentalMICmeasurementsofsingle- D o tolerated as predicted by SIFT, which, like most state-of- residue substitutions in TEM-1, that is, about R2 ’55% of w n the-art methods, is based on conservation profiles in thevariabilityoftheexperimentalresultsisexplainedbythe lo a d sequence alignments. Higher accuracy is found for DCA-inferredmutationallandscape,seefigure3,ascompared e d PolyPhen-2 (R2 ¼0:48): Its improved performance results with the 39% reported before for the independent-residue fro fromtheintegrationofaprofile-basedscorewithstructural model (IND) model. We find that DCA even outperforms m h featuresandaminoacidproperties. theintegrativemodelingofPolyPhen-2combiningsequence ttp s However, all these predictions are based on the profileswithstructuralandotherpriorbiologicalknowledge, ://a assumption that epistasis between mutations and context demonstrating the power of DCA in capturing epistatic ef- ca d dependence can be neglected. The simplest model to fectsintheTEM-1mutationallandscape. em challenge this assumption takes into account “pairwise epi- ApplyingthesameproceduretothedataofFirnbergetal. ic.o static interactions” between different residue positions in (2014), which are highly correlated with the data from up theMSA, Jacquier et al. (2013) (R=0.94), but slightly more precise .co m than that, the correlation is slightly higher /m (cid:2)DCAða1;:::;aLÞ¼XLi¼1’iðaiÞ;þX1(cid:3)i(cid:3)j(cid:3)L’ijðai;aðj3ÞÞ d(Rat¼a w0h:7ic6h;Rd2is¼pla0y:5la8r)g.eEdxicslcurdeipnagncfireosmbetthweeeanntahlyesitswtohoesxe- be/artic periments(suchdiscrepanciescouldbeeitherduetoexper- le (cf. Materials and Methods). The terms ’ ða;aÞ imental errors or due to antibiotic-specific effects) -ab parameterize the epistatic couplings between aminioj aicidjs correlations between our computational score and both stra c a anda inalignedpositionsiandj;iftheywouldbesetto data sets rise above R2 ¼0:65 (cf. supplementary fig. S2, t/3 i j 3 zero the model would reduce to the independent-site SupplementaryMaterialonline). /1 mwiothdienl t(cid:2)hIeNDD. CTAhisofmreosdideluehacsoebvoeeluntiornecewnitthly tihnetroadimucetdo queWnceecaolingcnlmudeenttshoaftdsiesqtaunetnhceomvaorloiagbsiliistyhiignhltyhienfPoframmatisvee- /268/25 infer contacts between residues from sequence information aboutthe localmutationallandscapeof TEM-1, despite the 795 lowtypicalsequenceidentityofonlyabout20%betweenthe 3 alone,andtoenablethepredictionoftertiaryandquaternary 2 pthroisteairnticslter)u.ctures(cf.thereferencesinthe“Introduction”of hpoenmdoelnocgeshanasdaTEcrMuc-1ia.lMimopreaocvteor,nacthcoeuanctciunrgacfoyrocfoannteexvtodlue-- by gue Estimating parameters from aligned sequences is a com- tion-basedapproach,andthatglobalinferencemethodssuch st o asDCAcanefficientlycapturesuchdependencies. n putationally hard task, but over the last years a number of 1 1 accurate and computationally efficient approximate algo- A p rithms have been developed (Weigt et al. 2009; Morcos Assessing the Context Dependence of Mutational ril 2 et al. 2011; Ekeberg et al. 2013; Baldassi et al. 2014). Here, Effects 0 1 9 we extend the mean-field scheme of Morcos et al. (2011) Toquantifymorepreciselytherangeofcontextdependence, (cf.MaterialsandMethods).ForTEM-1,standardDCAaccu- we apply DCA to reduced MSA. These MSAs contain the rately predicts tertiary contacts (cf. supplementary fig. S1, residue position carrying the mutation of interest, and all SupplementaryMaterialonline):Morethan60nontrivialres- residues, which are, in a representative TEM-1 crystal struc- idue–residuecontacts(minimumseparationoffiveresidues ture(PDB:1M40;Minasovetal.2002),withinadistanced max along the sequence) are predicted without error, and more (we use the minimal distance between heavy atoms as the than200ataprecisionof80%. interresiduedistance).Whenusingaverysmalld (cid:3)1:2A˚, max Havingestimated(cid:2) fromtheMSA,wecanfollowthe the mutatedresidue isconsideredon itsown, when d is DCA max samestrategyasintheindependent-residuecase.First,amu- chosentobelargerthanthemaximumdistance46.9A˚ exist- tationalscoreisintroducedasthedifferenceofthe(cid:2)-valuesof ing within the PDB structure, we are back to the full DCA the mutated and the wild-type sequences (cf. eq. 1). The modelingoftheprevioussection.Intermediated interpo- max inclusion of epistatic couplings leads to an “explicit context latebetweenthetwoextremecases.Doingso,werunDCA dependence”ofthestatisticalscoreofamutationa !bin on subalignments of residues, which are not necessarily i 271 MBE Figliuzzietal. . doi:10.1093/molbev/msv211 A B direct all residue i contacts pairs i 0.55 3D structure MSA i 2R 0.5 1 n o i acti e fr0.5 0.45 du esi r D o w i 00 25 50 nlo a 0.4 d i 0 10 20 30 40 50 ed cutoff distance fro m tFhIGe.m2.utCaotinotneaxltedffeepcetsnodfenthceeroefsimduuetaotfioinntaelreefsftec(ltasb:e(Ale)dPirionctehdeufireguorfei)n.cTlhuidsinlegadaslltroesriedsuideusew-sitpheicnifiacmsuabxaimliganlmnaetnivtse,wdihstiachncceodnmsisatxoinftcootluhmepnrse,dwihcticiohnaoref http s notnecessarilyconsecutive,butconnectedin3D.Theresultsaregivenin(B).ThemainfigureshowsthecorrelationR2betweenMICdataandour ://a predictions,as afunctionofthecutoffdistancedmax. Theinsetshowsthe averagefractionofresiduesincludedinto thereduced MSAs, againin cad dependenceofd . e max m ic .o u p as the number of model parameters grows quadratically in .co m sequence length. The inset of figure 2B shows the average /m fractionofresiduesincludedintothesub-MSA.At20A˚ itis be /a slightlyhigherthan50%,thatis,theinformativecontextofa rtic mutationisgivenbymorethanhalfofthetotalnumberof le -a residuesintheprotein. b s ItisinterestingtoobservethattheINDmodelmakesmore tra c predictionswithverylargedeviationsfromtheexperimental t/3 3 datathantheDCAmodel:Thereisanincreasednumberof /1 /2 mutations,whichareeitherpredictedtobestronglydelete- 6 8 riouseveniftheyareclosetoneutral,orviceversa.Manyof /2 5 7 thesestrongerrorsareatleastpartiallycorrectedbytheDCA 9 5 3 landscape model (cf. supplementary tables S1–S3, 2 b Supplementary Material online). By the definition of the in- y g dependentmodelintermsoffrequencycountsinindividual ue s FIG. 3. R2 between experimental fitness and predicted fitness for the MSAsequences(cf.MaterialsandMethods),amutationwith t on followingfeatures:Independent-residuemodel(IND),DCA,SIFT(SIFT), a low predicted IND score leads from a more frequent to a 1 1 Polyphen-2 (Poly), PoPMuSiC (PoP), I-Mutant2.0(sequence+structure) rareaminoacidintheconcernedMSAcolumn.However,in A p (Imut+), MUpro (MUpro), I-Mutant2.0 (Imut), molecular simulations the mutagenesis experiments some of these mutations are ril 2 (SIM), relative solvent accessibility (RSA), and Blosum62 substitution found to be admissible in the specific sequence context of 0 1 matrix(BLO). TEM-1,thatis,theyareactuallyfoundtobeclosetoneutral, 9 examplesbeing G52A, E61V, T112M,N152Y, A183V,T186P, consecutive in the primary sequence but connected in the D207V,D250Y(alltargetaminoacidsarepresentinfewtens native fold (cf. the illustration of the procedure in fig. 2A). of sequences in the MSA out of the about 2,500 functional Figure2BshowstheresultingcorrelationsbetweenMICdata homologoussequences).Forallofthesecases,DCAisableto andstatisticalpredictions,infunctionofthecutoffdistance correctatleastpartiallythestatisticalprediction.Onthecon- d .Weobservearapidincreaseinpredictivepowerwhena trary,theindependent-sitemodelpredictsthatanymutation max structural neighborhood is taken into account, but the in- between two amino acids of similar frequency in the corre- crease in correlation extends well beyond the directly con- sponding MSA column is close to neutral. Looking to the tacting residues (d ’6A˚). The maximum correlation experimental MIC, substitutions D177N, A235D, I243N, and max (R2 ’0:57) is reached around d ’20 A˚, followed by a G248Eallpredictedtobeclosetoneutralhavestronglydel- max shallowdecreasewhenincludingalsomoredistantresidues. eteriouseffects(MIC(cid:3)25).DCAcorrectsthemispredictions Thissmalldecreaseresultsprobably from overfittingeffects, byatleasttwo,onaveragebythreeMICclasses. 272 MBE CoevolutionaryLandscapeInference . doi:10.1093/molbev/msv211 Thereisasmallsetofninemutationsbadlypredictedby where(cid:4) istherelativesolventaccessiblesurfacearea(RSA)of i DCA.Innoneofthesecases,theindependentmodelingsig- residue a in position i. We use Michel Sanner’s Molecular i nificantly ameliorates predictions. Interestingly, six of these Surface algorithm (Sanner et al. 1996) applied to the PDB ninemutationsfallintothehighlygappedpartoftheMSA: structure 1M40 to estimate surface accessible surface areas DCA displays a significant loss of predictive power in the (SAS), normalized by the maximum accessibilities given in highlygappedpositionsoftheMSA,andcorrelationbetween Tien et al. (2013). We find that R2 ¼0:20 of the variability predictedandexperimental MICincreasesabove R2 ¼0:75 of the experimental fitness is explainable through RSA. whendisregardingmutationsinthisregion(seesupplemen- In general, we find that different accessibility estimates pro- taryfig.S3,SupplementaryMaterialonline). videverysimilarresults,includingtheabsoluteSAS(cf.sup- plementarymaterial,SupplementaryMaterialonline).Indeed, a simple binary classifier roughly distinguishing buried from Structural-Stability Predictions Show Lower exposed residues is almost as informative as RSA and SAS Correlations to MIC Changes than Sequence-Based Modeling values (supplementary fig. S5, Supplementary Material D o online). Note that the score (cid:3)(cid:2) does not depend on w Ithasbeenproposedbeforethattheroleofmostresiduesis RSA n thetargetaminoacidb,butonlyonthewild-typestructure. lo to make the protein properly fold, and that mutations on Note also that this R2 value, while been greater than those ade these sites mainly alter protein stability and not its activity d (WylieandShakhnovich2011):Hence,anaccurateestimation asmchaiellevredthtahnroaulglhstamtiostleiccaullasreqsuimenucleatisocnosr,esisdseuribvsetdanftrioalmly from of the change in protein stability (cid:3)(cid:3)G(cid:4)(cid:3)Gmut(cid:2)(cid:3)Gwt h homologs. ttp shouldbeabletoaccountforalargefractionofmutational s effects. Thefailureofstability-basedpredictionsofmutationalef- ://a fects may result from strong-effect mutations in or close to c a Many bioinformatic programs have been developed for d theactivesite,whosephenotypiceffectisunrelatedtoprotein e estimatingproteinstabilitychangeuponmutation:Among m stability. To assess this effect, we have repeated our analysis ic them MUpro (Cheng et al. 2006) and I-Mutant2.0 .o includingonly111mutationsfallingintotheextendedactive u p (Capriotti et al. 2005), which take the sole sequence as site (cf. the supplementary fig. S6, Supplementary Material .c o input, PoPMuSiC (Dehouck et al. 2011) and online, for details). The R2 values for both statistical models m/m I-Mutant2.0(sequence+structure) (Capriotti et al. 2005), (IND and DCA) go up strongly (R2 ¼0:52;R2 ¼0:67), b wmheitchhodcsonshsiodwer ibnocothhesreeqnutepnrceediacntidonsstriunctbueretw. eAesntehaecshe wathaelrl.eTashitshdeesmtrounctsutrraet-ebsastehdatperveodliucINttoiDornsasrhyoiwnfolitrtmDleCaAotironnoacgcauin- e/article other (cf. supplementary fig. S4, Supplementary Material ratelypredictstheeffectsof mutation fallingintothe active -ab online),wecomplementthembyextensiveforce-fieldmo- s site,andstructuralinformationdoesnot. tra lecular simulations at all-atom resolution to estimate pro- c Being grounded on complementary sources of informa- t/3 tein stability changes (cid:3)(cid:3)G induced by single point tion, predictions by evolution- and structure-based 3/1 mutations (cf. Materials and Methods for details). A score methodsarenotstronglycorrelated,asshowninsupplemen- /26 canbeassignedtoanysubstitutionofaminoacida inpo- 8 i taryfigureS4,SupplementaryMaterialonline.Alinearcom- /2 sitionibyaminoacidb, bination of DCA with structural predictors, however, yields 57 9 5 (cid:3)(cid:2)stabðai !bÞ(cid:4)(cid:2)(cid:3)(cid:3)Gðai !bÞ; ð5Þ oexnplyerliimttleenitnaclrdeaastae ignetcsotrorel0a:t6io0n(cid:2): T0h:e61exwphlaeinnedpevrfaorriamnicnegoaf 32 b y and then mapped to predicted MIC values (cid:3)^stabðai !bÞ bivariate linear regression between DCA scores and ei- gu e using the before-mentioned scheme. Pearson correlations ther solvent accessibility or Polyphen-2 predictions, as dis- s between predicted and experimental MIC are calculated: played in supplementary figure S7, Supplementary Material t on 1 We find that, although those methods which consider online. 1 A not only sequence but also structural information p (R2 ¼0:13 for PoPMuSiC and R2 ¼0:14 I-Mutant2.0 (se- ril 2 DCA Landscape Modeling Spots Stabilizing Mutations 0 quence+structure)) largely outperform those who do not 1 9 and Captures Protein-Specific Substitution Scores (R2~0:02 for MUpro and I-Mutant), one gets only a modestfurtherimprovementlettingthemutatedpolypep- The TEM-1 beta-lactamase has been the subject of intense tiderelaxthroughmolecularsimulations(R2 ¼0:17formo- studieswithregardtoproteinstructure,function,andevolu- lecularsimulations,seefig.3). tion, and a number of structurally stabilizing substitutions Itiswellknownthatresiduesburiedintheproteincoreare have been identified (Raquet et al. 1995; Wang et al. 2002; importantdeterminantsofproteinstability.Mutationaffect- Katheretal.2008;Dengetal.2012):P62S,V80I,G92D,R120G, ing these sites tends to be highly destabilizing (Ponder and E147G, H153R, M182T (strongly stabilizing), L201P, I208M, Richards1987;Bustamanteetal.2000;FranzosaandXia2009; A184V, A224V, I247V, T265M, R275L/Q, and N276D (posi- Abriata et al. 2015). Therefore, we test also to what extent tions are indicated using standard Ambler numbering; solvent accessibility explains the experimental mutation Ambleretal. 1991). Some of them were foundtoinfluence effects.Upondefining theresistancephenotype(Salverdaetal.2010).Notably,the fivehighestDCAscores(cid:3)(cid:2) outofallconsideredmutants DCA (cid:3)(cid:2)RSAðai !bÞ¼(cid:4)i; ð6Þ belong to this set: M182T, H153R, E147G, L201P, and G92D 273 MBE Figliuzzietal. . doi:10.1093/molbev/msv211 D o w n lo a d e d FIG.4. Statisticalscoresandthermodynamicstabilities.(A)Scatterplotofthelogoddratio(cid:3)(cid:2)DCA versustheexperimentalfitness(cid:3)exp,stabilizing fro mutationsmentionedinthetextarehighlightedinred.ThehighestscoringmutationsareM182T,G92D,H153R,L201PandE147G,allreportedas m h stabilizing. (B) (cid:3)(cid:2)DCA for a smaller set of single mutations is now plotted versus the change in Gibbs Free Energy relative to wild type ttp (cid:3)et(cid:3)alG.2(cid:4)002(cid:3),aGnmdutD(cid:2)en(cid:3)gGetwta,l.a2s0m12e,arseusrpeedctbivyelfyo)u.r independent studies (Ref1, Ref2, Ref3 and Ref4 are Kather et al. 2008, Raquet et al. 1995, Wang s://a c a d e m ic .o (with a large gap separating the likelihood of the strongly Discussion up stabilizing M182T from the scores of the other four, cf. .co Thecentralaimofthisarticleistheaccuratecomputational m fig. 4). More quantitatively, we found that the Gibbs Free /m Energy change relative to wild-type (cid:3)(cid:3)G of a different, inference of protein mutational landscapes to predict the be phenotypic effect of mutations. This is exemplified in the /a small set of mutations (most of which not affecting caseoftheTEM-1proteinofEscherichiacoli,abeta-lactamase rtic Amoxicillin resistance) characterized by four independent le providing antibiotic drug resistance against beta-lactams, -a studies (Raquet et al. 1995; Wang et al. 2002; Kather et al. b 2008;Dengetal.2012)ishighlycorrelatedwithDCAscores suchaspenicillin,amoxicillinorampicillin. stra Toreachthisaim,wehaveextractedinformationabouta c (RDCA=0.81) but less correlated when using independent protein and its potential mutants, which is hidden in the t/33 model(R =0.62). /1 IND sequence variability of “diverged but functional” homologs /2 We further investigate whether the statistical analysis of 6 of this protein. The central ingredient of our analysis is a 8 homologous sequences is able to capture “protein-specific /2 carefulmodelingofresiduecoevolutionbyDCA,thatis,the 57 amino acid substitution effects,” that is, if the effect of a 9 modeling includes pairwise epistasis between residues. This 5 3 specificaminoacidsubstitution(averagedoverallsequence 2 approach,initially developedin the contextofstructuralbi- b positionswherethismutationappears)isbetterdescribedby y ology to predict residue–residue contacts from sequences, g our statistical model than it would be by Blosum matrices, u has been used to define a score for each mutation, which e s which are estimated from many distinct aligned protein se- wasfoundtoexplain55%,respectively,58%ofthephenotypic t o n quences.Tothisaim,amatrixofaveragesubstitutionscoresis variabilityinthetwocorrespondingexperimentalTEM-1data 1 1 builtfromthesetofexperimentalMICvalues(cf.fig.5).We sets (Jacquier et al. 2013; Firnberg et al. 2014). This value is A p also construct an analogous matrix for the DCA-predicted substantially higher than what can be obtained by a more ril 2 MIC values of the same set of mutations, and quantify cor- 0 standard modeling approach based on sequence profiles 1 9 relationsbetweenpredictedandexperimentalaverageeffects (39%ofvariabilityexplained),whichdoesnotincludeepista- computing a Pearson correlation weighting each term with sis, or on changes in structural stability. Furthermore, our thesquarerootofthenumberofmeasuredmutationsfalling coevolutionary approach clearly outperforms state-of-the- in the related class. We find a very large correlation art approaches such as SIFT and PolyPhen-2, which are (R2 ¼0:72) between average experimental and predicted basedonnonepistaticmodels. substitution matrices. This value has to be compared with However, epistatic effects are not equally important for the substantially lower correlation found when comparing allresidues,whichmayexplainthatsomeauthorsdisagree the mutational effects in TEM-1 with the Blosum62 matrix onthecontributionofthesequencecontexttomutational (R2 ¼0:34), which provides amino acid substitution scores effect (Pollock et al. 2012; Ashenberg et al. 2013; Zou and averaged over many proteins. All other inference methods Zhang2015).Therelevantcontextdeterminingtheeffectof show substitution scores with correlations to MIC, which amutationofaresidueisnotonlygivenbyitsdirectphysical are comparable to or lower than the correlations between neighbors, but extends to a distance of about 20A˚. The MICandBlosum62. informativecontextthusincludes,on average,roughly half 274 MBE CoevolutionaryLandscapeInference . doi:10.1093/molbev/msv211 D o w n lo a d e d fro m h ttp s ://a c a d e m ic .o u p .c o m /m b e /a rtic le -a b s tra c t/3 3 /1 /2 6 8 /2 5 7 9 5 FIG.5. Protein-specificaminoacidsubstitutioneffectsinTEM-1:Aminoacidsubstitutioneffects,averagedoverexperimentalmeasurements(A),DCA 32 predictions(B),andextractedfromBLOSUM62(C).Bluesquarescorrespondtonearlyneutralmutations(logMIC45.3),whereasyellowsquares by correspondtohighlydeleteriousmutations(logMIC<2.6).Whitesquaresareusedforunobservedsubstitutions.Thehistogramin(D)showsR2 gu e betweenaveragedcomputationalandexperimentalaminoacidsubstitutioneffects. s t o n 1 1 A p of all residues in the aligned TEM-1 sequence. This result model performs much worse than the DCA model ril 2 agreeswiththefindingthatinteractionsfromsecondshell (R2 (cid:2)R2 ’26%). Concentrating on mutations from 0 DCA IND 19 and beyond might be important for protein function amino acids of given physicochemical characteristics (hy- (Drawz et al. 2009). Having a look to the physicochemical drophobicity, charge, volume) toward a target amino acid propertiesofthewild-typeandthemutantaminoacids,we ofeitherdifferent(e.g.,hydrophobictohydrophilic)orcon- observe,for example, thatmutations substituting a hydro- servedcharacteristics(e.g.,hydrophobictohydrophobic)we phobic residue with a hydrophilic one are almost equally find that the DCA predictions are stable, with R2 values welldescribedbytheDCAandbytheindependentmodel between 49% and 64%, whereas the ones of the IND (R2 (cid:2)R2 ’5%),duetothestructurallyhighlydisrup- model vary much more strongly (25–55%). In none of the DCA IND tiveeffectofahydrophilicresidueinaburiedsite,andthus considered cases, the independent model was able toout- the absence of hydrophilic residues in the corresponding performthecoevolutionaryone. column of the sequence alignment. On the contrary, the Ourfindingsdemonstratethatthe“localmutationalland- moremoderateeffectofreplacingasmallbyalargeamino scape” dictating the mutational effects in TEM-1 is closely aciddependsstronglywhetherthecontextisabletoaccom- related to the (co-)evolutionary pressures acting globally modate this change or not, and thus the independent across the entire homologous protein family. This result is 275 MBE Figliuzzietal. . doi:10.1093/molbev/msv211 quiteremarkable:Despitealowtypicalsequenceidentityof S9–S11, Supplementary Material online). DCA predictions about20%betweenhomologousbeta-lactamasesandTEM- systematically outperform independent-site models neglect- 1,theirsequencestatisticsprovidesquantitativeinformation ing epistasis and all other tested methods. Only PolyPhen-2 abouttheeffectofsingle-residuesubstitutionsinTEM-1.We reaches, in two cases out of four, comparable performance. are thus able to infer landscapes and predict quantitatively Despitethisencouragingfinding,correlationsbetweenexper- mutationaleffectsevenincases,wheremutagenesisdataare iment and computation are numerically smaller than those notsufficientlynumerous(cf.OtwinowskiandPlotkin2014). observedforTEM-1.Weexpectthisreductiontoresultfrom Thiscomplementsrecentfindings,thatpatternsofpolymor- discrepancies between the measured phenotypes (e.g., pro- phism and covariation in patient derived (and thus highly tein stability, binding affinity) and those under evolutionary similar)HIVsequencesareinformativeabouttheirreplicative selection (fitness); MIC is without doubt a better proxy for capacities (Shekhar et al. 2013; Mann et al. 2014), thanks to fitnessthanmostmolecularphenotypes.However,tosystem- highmutationratesintheHIVvirus.Furthermore,coevolu- aticallysupportthisidea,large-scaleexperimentsassessingthe tionarypatternsinproteinfamilieswererecentlyfoundtobe impact of mutations on multiple phenotypic traits in the D closely related to protein energetics and folding landscapes same protein would be necessary. In summary, despite not ow n (LuiandTiana2013;Morcosetal.2014). representingacomprehensivesurvey,currentlyavailabledata lo a WeexpectthatthemodelingapproachthroughDCAcan d suggest a large potential for coevolutionary models in bio- e d be improved along several lines. First, prediction accuracy medical applications, through the in silico prediction of the fro depends critically on the quality and size of the training m MSA. As we have shown, the prediction for gapped (and roleofmutationsinrarediseasesandcancer. http tthypanicatlhlyeloenssewfoerllu-anligganpepde)dp(othsiutisonbesttisersaulbigsntaanblteia)lloynewso(rRse2 Materials and Methods s://aca valuesrangingfrom30%to78%fromthemosttotheleast de m gappedpositions).Wethereforeexcludedgappedsequences Data ic fromthetrainingalignment,butthisprocedurereducesthe MutationalData .ou p sequence number and thus the statistics for the ungapped Theoriginaldataset(Jacquieretal.2013)wasuseddirectlyat .c o positions. the translated amino acid level. It contains 8,621 (4,094 dis- m /m Second,thecurrentDCAapproachispurelystatisticaland tinct)measurementsofamoxicillinMIC.Amongthese8,112 b e basedonevolutionaryinformation.Itdoesnottakeintoac- donotincludestopcodons,2,440arerepeatedmeasuresof /a count any complementary knowledge about the protein thewild-typesequence,3,129(Nmultiple=2,051distinct)have rticle understudy.Wehave,however,observedthattheintegration allmutationsinsidethepartofthesequencecoveredbythe -a b of structuralknowledge helpstoincreasethe prediction ac- Pfamdomain(i.e.,subjecttothepresentedstatisticalanalysis). stra cfruormactyh.eFimttiuntgattehdermesoiddueel,otnhleyRfo2rvraelusieduraeissewsistlhigihntalybboyutab2o0uA˚t Fsiinngallely,maumtaotniognt.hEeaclahttmereasesut,rethmeernetazrei faNllssingilne=n7in4e2ddiissctrinetcet ct/33/1 2%.TheeffectofintegratingtheDCA-scoreandthesolvent- classes:12:5;25;50;125;250;500;1;000,2,000,4,000(mg/l) /26 accessiblesurfaceareaisevenlarger,leadingtoagaininR2of (no single point mutation has z 4 1,000). For a given phe- 8/2 mcoomrebitnhianng6D%C.AAwvietrhyPsoimlyiPlahreinn-c2r,etahseel(a7t%te)risbeoibntgaibnueidltwuphoenn notype where amino acid ai in position i is replaced with 5795 aminoacidb,wehavedefinedauniqueexperimentalfitness 3 2 aareprboafisleedmoondealsaimndplsetrluincetuarrarleignrfeosrsmioantisocnh.eTmheesweitinhcr3e-afoselds (cid:3)expðai !bÞtakingthelogarithmicaverageonallmeasure- by g ments(whenevermultiplemeasurementswereavailable): u crossvalidation:Itwillbeinterestingtoexploremoresophis- es ticated approaches, for example, integrating prior structural 1 NðXai!bÞ t on kthneowstlaetdisgteictahl-rionufegrhenacBeapyreosicaenduinrfee.renceschemedirectlyinto (cid:3)expðai !bÞ¼Nðai !bÞ i¼1 logðziÞ; ð7Þ 11 Ap Eveniftheintegrationofcomplementaryinformationmay whereNða !bÞisthenumberofmeasurementsofmuta- ril 2 i 0 substantiallyimproveourpredictionaccuracy,themostim- 1 tiona !b. 9 i portant contribution is, however, coming from the careful inclusion of epistatic effects into our modeling approach to HomologousSequencesandPreprocessingoftheTrainingSet mutationallandscapes,asshownbyapartial-correlationanal- ThegenomicmodelwaslearnedfromanMSAofsequences ysis in supplementary figure S8, Supplementary Material online. belonging to the Pfam Beta-lactamase2 family (PF13354) From a computational point of view, the approach is (Finn et al. 2013). We have used HMMer (Mistry et al. widelyapplicablebeyondthespecificcaseofTEM-1andan- 2013) to search against the Uniprot protein sequence data- tibiotic drug resistance. To check this practically, we have base(versionupdatedtoMarch2015).TheresultingMSAis analyzed further systems in the supplementary material, L=197siteslong,andcontains5,119distinctsequences.After Supplementary Material online: A PDZ domain removing all sequences with more than five gaps, 2,462 se- (McLaughlin Jr et al. 2012), an RNA recognition motif quencesareretainedandusedforthestatisticalanalysis.They (Melamed et al. 2013), and the glucosidase enzyme haveanaveragesequenceidentityapproximately20%with (Romero et al. 2015) (cf. supplementary text S1 and figs. theTEM-1wild-typesequence. 276 MBE CoevolutionaryLandscapeInference . doi:10.1093/molbev/msv211 Statistical Sequence Modeling withd beingtheManhattandistance(numberofmis- mm0 IndependentModel—SequenceProfile matches) between sequences m and m0 and (cid:5) being the The basic assumption of the independent model (2) is the Heaviside step function whose value is 0 for negative ar- additivity of the mutational effects of different positions in gument and 1 for positive argument. The reweighting the amino acid sequence. In terms of statistical sequence threshold is set to #¼0:8 as usually done in DCA models, this corresponds to a “sequence profile model,” (Morcos et al. 2011). whichassignstoeachsequencethefactorizedprobability Duetofinitesampling,thestatisticsoftheMSAhastobe regularizedintroducingpseudocounts: L Y PINDða1;:::;aLÞ¼ fiðaiÞ ð8Þ (cid:6) 1(cid:2)(cid:6)XM i¼1 fi(cid:6)jða;bÞ¼q2þ Meff m¼1wm(cid:7)ami ;a(cid:7)amj ;b; ð13Þ withfðaÞbeingthefrequencyofaminoacidaincolumniof i theMSA,seebelowforaprecisedefinitionofthisfrequency. The factorized form of this expression suggests to use log- f(cid:6)ðaÞ¼(cid:6)þ1(cid:2)(cid:6)XM w (cid:7) ð14Þ Dow probabilities as a computational predictor of the genotype- i q Meff m¼1 m ami ;a nloa to-phenotypemapping, d e M d (cid:2)INDða1;:::;aLÞ¼logPINDða1;:::;aLÞ: ð9Þ withMeff ¼mP¼1wmand(cid:7)theKronecker’sdeltawhosevalue from Thisleadstoanexplicitexpressionofthephenotypiccontri- is 1 if the variables are equal, and 0 otherwise. We have in- http butionofaminoacidainsitei:’iðaÞ¼logfiðaÞ. celpuidsteadticpsecuoduopcloinugnstswatetwhoavleeveulss:eFdirstl,afrogrethpeseinufderoecnocuenotsf s://ac EpistaticModel—DCA ((cid:4)2 ¼0:5), needed to correct for systematic biases intro- ade Followinglastparagraph’sideatoidentifythecomputational m ducedbytheMeanField(MF)approximation(Bartonetal. ic predictor of the genotype-to-phenotype mapping with the 2014), .ou log-probabilityofastatisticalmodelinferredfromanMSAof p.c TEM-1homologs,thelattertakestheform ’ijða;bÞ¼½C(cid:2)1(cid:5)ijða;bÞ; ð15Þ om /m PDCAða1;:::;aLÞ¼Z1exp(cid:4)(cid:2)DCAða1;::;aLÞ(cid:5); ð10Þ Cijða;bÞ¼fi(cid:4)j 2ða;bÞ(cid:2)fi(cid:4)2ðaÞfj(cid:4)2ðbÞ ð16Þ be/artic le where forallaminoacidsaandb.FollowingTanaka(1998),alsodi- -a b (cid:2) ða ;:::;a Þ¼XL ’ðaÞþX ’ ða;aÞ agonalterms(cid:2)iiða;bÞ¼½C(cid:2)1(cid:5)ijða;bÞareincluded.Couplings stra DCA 1 L i¼1 i i 1(cid:3)i(cid:3)j(cid:3)L ij i j with gaps are set to zero, ’ijða;(cid:2)Þ¼’ijð(cid:2);aÞ¼0 (cf. ct/3 ð11Þ Morcosetal.2011). 3/1 Smaller pseudocounts of Bayesian size ((cid:4) ¼ 1 ) have /2 isgiveninequation(3),andtheso-calledpartitionfunction 1 Meff 68 P been used in the regularization of single site frequencies to /2 Z¼ expf(cid:2) ða ;::;a Þgisanormalizationfactor. 5 a1;:::;aL DCA 1 L inferthefields: 79 ThestatisticalmodelP thustakestheformofageneralized 5 PTohtetssammoedemloord,eelqwuiavsaDlienCnAtrtolyd,uacpeadiriwnistehMe DarCkAovorfanredsoidmuefieclod-. ’iðaÞ¼log ff(cid:4)i(cid:4)11ðð(cid:2)aÞÞ!(cid:2)X’ijða;bÞfj(cid:4)1ðbÞ: ð17Þ 32 by gu evolution (Weigt et al. 2009; Morcos et al. 2011). Inferring i j;b e s model parameters ’ from the MSA is a computationally Thesamesmallregularization(cid:4) ¼ 1 hasbeenadoptedin t on hardtask,wethereforefollowthemean-fieldapproximation 1 Meff 1 theindependent-sitemodel. 1 introduced in Morcos et al. (2011). In this context, the epi- A p static couplings can be determined by inversion of the em- MappingScorestoMICValues ril 2 pirical covariance matrix Cijða;bÞ for the co-occurrence of To compare computational predictions with experimental 01 9 aminoacidsaandbinpositionsiandjofthesameprotein MIC values, we map computational scores (cid:3)(cid:2)ða !bÞ i sequence. Once the model parameters are determined, the into predicted MIC (cid:3)^ða !bÞ, by first sorting them and i context-dependentmutationaleffectscanbeestimatedusing then associating with the n highest score (cid:3)(cid:2) the n equation(4). th nth th highestexperimentalMICvalue(cid:3) ðn Þ, exp th DetailsofStatisticalInference (cid:3)^ð(cid:3)(cid:2) Þ¼(cid:3) ðn Þ: ð18Þ To take into account phylogenetic correlations and sam- nth exp th plingbiasesinthetrainingset,eachsequenceðam;:::;amÞ; 1 L Wesubsequentlycomputelinearcorrelationsbetweenthe m¼1;:::;M;oftheMSAappearsinthestatisticswiththe predictedMIC(cid:3)^ andtheexperimentalone(cid:3) ,resultingin followingweight, exp nonlinear rank correlations between experimental fitnesses !(cid:2)1 andrawcomputationalscores(cid:3)(cid:2). X wm ¼ 1þ (cid:5)ðdm;m0 (cid:2)#LÞ ; ð12Þ This procedure has proved to be more robust than the m6¼m0 standardSpearmanrankcorrelations,becauseofthepeculiar 277

Description:
show a Pearson correlation of R= 0.63 with the experimental chemical determinants shape sequence constraints in a functional enzyme. PLoS One Bloom JD, Silberg JJ, Wilke CO, Drummond DA, Adami C, Arnold FH. 2005.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.