On the optimal contact potential of proteins Akira R. Kinjo Institute for Protein Research, Osaka University, Suita, Osaka, 565-0871, Japan∗ Sanzo Miyazawa Faculty of Technology, Gunma University, Kiryu, Gunma 376-8515, Japan. (Dated: February 1, 2008) We analytically derive the lower bound of the total conformational energy of a protein structure by assuming that the total conformational energy is well approximated by the sum of sequence- dependent pairwise contact energies. The condition for the native structure achieving the lower boundleadstothecontactenergymatrixthatisascalarmultipleofthenativecontactmatrix,i.e., 8 the so-called G¯o potential. We also derive spectral relations between contact matrix and energy 0 matrix,and approximations related toone-dimensional protein structures. Implications for protein 0 structureprediction are discussed. 2 Keywords protein structureprediction; spectral relations; one-dimensional structures n a PACSnumbers: 87.15.Cc,87.15.-v,87.14.Ee J 3 INTRODUCTION THEORY ] M Lower bound of contact energy B . o Our fundamental assumption is that the conforma- Proteins’ biological functions are made possible by i tional energy of a protein can be somehow expressed b theirprecisethree-dimensional(3D)structures,andeach in terms of a contact matrix. Now let us assume that - 3D structure is determined by its amino acid sequence q the total energy of a protein can be well approximated [ through the laws of thermodynamics [1]. Therefore, predicting protein structures from their amino acid se- by the sum of pairwise contact energies between amino 2 acid residues, and that each pairwise contact energy can quences is important not only for inferring proteins’ bi- v be decomposed into a sequence-dependent term and a ological functions, but also for understanding how 3D 6 conformation-dependent term. The sequence-dependent 4 structures are encoded in such one-dimensionalinforma- term is expressed as a matrix E(S) = (E ) which we 3 tion as amino acid sequence. The problem of protein ij 0 structure prediction is naturally cast as an optimization call the contact energy matrix, or E-matrix for short. 9. problem where a potential function is minimized. Given Each element Eij of the E-matrix represents the energy between the residues i and j when they are in contact. 0 an appropriate potential function, conformational opti- 7 This form of the E-matrix is a very general one: Each mization should yield the native structure as the unique 0 element,E ,maydependontheentiresequence,S,orit global minimum conformation of the potential function. ij : may depend only on the types of the interacting amino v Thus,theproblemhasbeentraditionallydividedintotwo i acid residues, i and j, as in the conventionalcontact po- X sub-problems: One is to establish an appropriate poten- tentials. The conformation-dependent term is expressed tial function [2], and the other is to develop the meth- r as another matrix ∆(C) = (∆ ) which we call the con- a ods to efficiently searchthe vast conformationalspace of ij tact matrix, or C-matrix. Each element ∆ of the C- a protein [3]. Among various forms of effective energy ij matrix assumes a value of either 1 or 0, depending on functions, statistical contact potentials [4, 5] have been the residues i and j are in contact or not, respectively. widely used. In this Letter, we exclusively treat a class Hence the total energy E(C,S) of a protein of sequence ofsuchcontactpotentials,neglectingothercontributions S of N residues and having conformation C is given by suchaselectrostaticsandlocalinteractions. Accordingly, a protein conformation is represented as a contact ma- N N 1 trix in which the (i,j) element is 1 if the residues i and E(C,S) = XXEij(S)∆ij(C) (1) j are in contact in space, otherwise it is 0. Although 2 i=1j=1 the contact matrix is a coarse-grained representation of 1 protein conformation, it has been known that the con- = [E(S),∆(C)] (2) 2 tactmatrixcontainssufficientinformationtorecoverthe three-dimensional (native) structure of proteins [6]. It is where [·,·] denotes the Frobenius inner product between notedthat,forthelatticemodelofproteins[7],theserep- two matrices [8, 9]. Based on this assumption, we de- resentationsofproteinconformationandenergyfunction rive the lower bound for the conformational energy and are exact. the conditions for the native structure and E-matrix to 2 achieve the bound. temperature. It should be noted that a native structure The Frobenius inner product leads to the matrix l2 can be the unique GMEC without achieving the lower norm defined as, for a matrix M, kMk ≡ [M,M]1/2 = bound of Eq. (4). Such a case is made possible either ( M2)1/2. In the case of C-matrix, since ∆ =0 or bythelimitationoftheconformationalspaceimposedby Pi,j ij ij 1, we have otherstericfactorssuchaschainconnectivityorexcluded volumes, or by inherent inconsistencies of the E-matrix 2 k∆(C)k =2Nc(C) (3) so that no plausible conformations are allowed to satisfy the lower bounds. where N ≡ (1/2) ∆ is the total number of c Pi,j ij contacts. As for any inner products, the Frobenius inner product satisfies the Cauchy-Schwarz inequality Spectral relations (|[A,B]|≤kAkkBk) from which we have To examine more closely how the lower bound can be [E,∆]≥−kEkk∆k (4) achieved,wenextderiveamoregenerouslowerboundin amorerestrictedcase. First,theC-matrixisdecomposed where the equality holds if and only if as E =ε∆ (5) N ∆= XσαuαvαT (6) for some scalar ε < 0. Although the inequality (Eq. 4) α=1 holds for any pair of matrices, we now regard it as the where σ is the α-th singular value and u and v are lower bound for conformational energy for a given E- α α α the correspondingleft andrightsingular vectors,respec- matrix. For simplicity, we first consider the energy min- imization problem for conformations with k∆(C)k fixed tively. U = (u1,··· ,uN) and V = (v1,··· ,vN) are or- thogonal matrices. The singular components are sorted tothevalueofthenativeconformation. Itisdesirablefor in decreasing order of the singular values: σ1 ≥ ··· ≥ the native conformation to satisfy the lower bound and σ (≥ 0). Since ∆ is real symmetric, the singular values hence its condition Eq. (5). If the native conformation N are the absolute values of the eigenvalues of ∆, and the indeed satisfies the condition Eq. (5), then the elements singular vectors are such that u =±v where the sign of the E-matrix is either 0 or ε so that only the contacts α α corresponds to that of the respective eigenvalue. Next, presentinthe native conformationarestabilizing. Thus, the E-matrix is decomposed in the same manner as the native conformation satisfying Eq. (5) is actually a GMECamonganyconformationswitharbitraryvaluesof N k∆(C)k. AnE-matrixthatsatisfiesEq. (5)forthenative E = XταxαyαT (7) C-matrix is a kind of the so-called G¯o potential [10, 11] α=1 which has been essentialfor studying the protein folding where τ aresingularvalues, andx and y areleft and problem. At this point, it is still possible that the na- α α α right singular vectors, respectively. Since E is also real tive structure is not the unique GMEC. For example, if symmetric,thesingularcomponentshavethesameprop- a conformation contains all the native contacts together erties as the C-matrix ∆. Noting that [E,∆]=tr(E∆T), withsomeothercontacts,thisconformationhasthesame von Neumann’s trace theorem [9] leads to the following energy as the native conformation. In order for a native inequality: conformationtobetheuniqueGMEC,itisrequiredthat the total number of contacts of the native conformation N is larger than that of any other conformations that con- [E,∆]≥−Xσατα (8) tain all the native contacts. From the relation Eq. (3), α=1 maximizing the totalnumber ofcontactsis equivalent to where the equality holds if and only if maximizingthe normoftheC-matrix,whichinturnim- plies the minimization of the right-hand side of Eq. (4). (uTx )(vTy )=−δ (9) α β α β α,β To summarize, for a given E-matrix, E(S), of a protein, its native conformation,C , achievesthe lower bound in for all α and β with non-zero singular values σ and τ n α β Eq. (4)ifandonlyifE(S)=ε∆(C )forsomeε<0,and (δ is Kronecker’s delta). We now regard this inequal- n α,β such native structure is the unique GMEC if and only if ity as a lower bound for the conformational energy for k∆(C )k is the maximum of all possible conformations a given E-matrix. For a fixed set of the singular values n thatcontainallthenativecontacts. Notethattheformer σ (α=1,··· ,N), if and only ifthere exists sucha con- α condition is a relation between E-matrix and C-matrix formation that satisfies the condition in Eq. (9), then whereasthe latterisaconditionforanativestructureto that conformation is the lowest possible energy confor- satisfy. The magnitude of ε is not specified here, but it mation. Let λ and ε (α = 1,··· ,N) be the eigenval- α α shouldbedeterminedbyotherfactorssuchasthefolding ues of the C-matrix and E-matrix, respectively, sorted 3 in the decreasing order of their absolute values. Then MJcontactpotentialisindeedhighlycorrelatedwiththe σ = |λ | and τ = |ε | for α = 1,··· ,N, and u and principal eigenvector of the native contact matrices [14]. α α α α α x are the eigenvectors of the corresponding matrices. Bastollaetal.[15]obtainedasimilarresult,buttheyalso α Thus,intermsofeigenvaluesandeigenvectors,the lower showedthattakingtheaverageofsuchx1 overevolution- bound in Eq. (8) is equal to λ ε with λ ε ≤0 for arily related proteins greatly improved the correlation. Pα α α α α α = 1,··· ,N. In addition to the condition Eq. (9) for Since the rank of the contact matrix is in general not 1, the lower bound of Eq. (8), if ∆ and E are of the same Eq. (10) does not hold and the equality in Eq. (4) can- rank, then the numbers of positive, negative, and zero notbesatisfied. Consequently,thereareattractiveinter- eigenvalues of ∆ and −E are the same and uα = ±xα. actions between non-native contacts even when x1 =u1 Thus, from Sylvester’s law of inertia [8], there exists a holdsexactly. Nevertheless,Portoetal.[16]havedemon- real non-singular matrix S such that stratedthattheknowledgeofu1aloneispracticallysuffi- cientforreconstructingthenativecontactmatrixofsmall E =−S∆ST, (10) single-domain proteins. Therefore, construction of effec- tive rank-1E-matrices is ofgreatinterest[17]. Basedon ∗ i.e., the E-matrix is congruent to the C-matrix. If the the Porto et al.’s result, it is tempting to postulate that conformation that satisfy the condition Eq. (10) is the thesatisfactionofthe lowerboundby arank-1E-matrix native structure, the E-matrix is consistent in the sense is sufficient for the native conformationto be the unique that the contributions from all the eigencomponents are GMEC.Atpresent,however,thereisnoclearconnection stabilizing the native structure (λ ε ≤ 0). Since the α α between the present formulation (energy minimization) matrix S is non-singular, we can “predict” the native and the Porto et al.’s combinatorial algorithm. structure from the E matrix as ∆ = −S−1ES−T (if we Anotherapproximationisakindofmean-fieldapproxi- can construct the appropriate matrix S). At this point, mationsinwhichthematrixelementE isreplacedbyits ij hsionwceevoetrh,etrhceonnfaotrimveatsiotrnuscwtuitrhe amdaiyffenroetntbseetthofesGinMguElaCr average over column hEi•i ≡ PNj=1Eij/N. Let us define values may have lower energies. e = (hE1•i,··· ,hEN•i)T and n = (n1,··· ,nN)T where n ≡ N ∆ is the contactnumber ofthe i-thresidue. Inordertocomparetheenergiesofconformationswith i Pj=1 ij Then,wehavethefollowingapproximationandthelower different sets of singular values, we use another inequal- bound: ity [9]: 1 N E(C,S) ≈ eTn (12) 2 −Xσατα ≥−kEkk∆k (11) 1 α=1 ≥ − kekknk (13) 2 where the lower bound is the same as that in Eq. (4). We note that, in terms of singular values, the matrix wheretheequalityin(13)holdsifandonlyifthecolumn- norms are expressed as k∆k = ( σ2)1/2 and kEk = averagedE-matrixisanti-paralleltothecontactnumber Pα α ( τ2)1/2. Hence, it is clear that the equality in Eq. vector,that is, e=εn for some ε<0. This lower bound Pα α (11) holds if and only if, in addition to the condition condition is analogous to Eq. (5), and can be regarded as another kind of the G¯o potential for one-dimensional in Eq. (9), there exists a scalar constant c such that τ = cσ for all α = 1,··· ,N. These conditions are protein structure. It has been suggested that contact α α equivalent to Eq. (5). number vector can significantly constrain the conforma- tional space [18]. Together with other one-dimensional structures, contact number vector is also used for recov- One-dimensional approximations ering the native structures [19], and can be accurately predicted [20, 21, 22, 23, 24]. It has been pointed out that the contact number vector is highly correlatedwith To connect the present results with previous studies, the principal eigenvector of the C-matrix [16, 19], which we next introduce two approximations. First, we con- suggests that this mean-field approximation is qualita- sider the case where the E-matrix is well approximated by its principal eigencomponent, that is, E ≈ ε1x1xT1. tively similar to the principal eigenvectorapproximation introduced above. This approximationis motivated by the eigenvalue anal- ysisofthe Miyazawa-Jernigan(MJ)contactpotential[4] performed by Li et al. [12], and has been employed by others [13, 14, 15]. In this case, the lower bound Eq. DISCUSSION (8) is achieved if and only if x1 = ±u1 and ε1λ1 < 0. This resultwaspreviouslyderivedby Caoetal.[13]who Using a more restricted, but conventional,form of the subsequently showed that the vector x1 constructed by E-matrix where each element Eij depends only on the using the components of the principal eigenvector of the types of i-th and j-th residues (e.g., the MJ potential), 4 Vendruscolo et al. [25, 26] have shown that it is im- contact prediction [28] as well as recent studies on local possible for such E-matrices to stabilize all the native interactions [30, 31] suggest that the latter may be the structures in a database. The conventional E-matrices case. Nevertheless, the present results may be useful for such as those they studied do not take into account the evaluating the optimality of potential functions in either sequence-dependence beyond a summation of the con- case. tributions from residue pairs. In the present study, we assumed a more general form for the E-matrix, allow- ing each element E to depend on the whole amino acid ij sequence. Inpracticalsituationsofproteinstructurepre- ∗ Electronic address: [email protected] diction, we want to optimize an energy function so that [1] C.B.Anfinsen,Principlesthatgovernthefoldingofpro- the native conformations of arbitrary proteins achieve tein chains, Science 181 (1973) 223–230. the lower bound. Now let us impose this as a requisite [2] M.-H. Hao, H. A. Scheraga, Designing potential energy for the E-matrix. Then, there should exist a function, functions for protein folding, Curr. Opin. Struct. Biol. 9 namely E, that maps each amino acid sequence to the (1999) 184–188. corresponding optimal E-matrix, that is, the G¯o poten- [3] A. Mitsutake, Y. Sugita, Y. Okamoto, Generalized- tial. Thus,theproblemofstructurepredictionbecomesa ensemblealgorithmsformolecularsimulationsofbiopoly- mers, Biopolymers 60 (2001) 96–123. trivial matter. Currently, most efforts for developing en- [4] S. Miyazawa, R. L. Jernigan, Estimation of effective in- ergyfunctionsseemtobefocusedonaccurateestimation terresidue contact energies from protein crystal struc- ofafixedsetofparametersforagivenfunctionalform[2]. tures: quasi-chemicalapproximation,Macromolecules18 The present analysis suggests that inferring the function (1985) 534–552. E that can generate the G¯o-like E-matrices from amino [5] S. Miyazawa, R. L. Jernigan, Residue-residue potentials acid sequences is essential if a contact potential is used. with a favorable contact pair term and an unfavorable The lowerboundinequality(Eq. 4)andits conditionfor high packing density term for simulation and threading, J. Mol. Biol. 256 (1996) 623–644. the equality (Eq. 5) will serve as the guiding principle [6] M.Vendruscolo,E.Kussell,E.Domany,Recoveryofpro- for inferring such a function. This approachto structure tein structure from contact maps, Fold. Des. 2 (1997) prediction is apparently similar to machine-learning ap- 295–306. proachesto contactmatrix prediction[17, 27]. Although [7] H. Taketomi, Y. Ueda, N. G¯o, Studies on protein fold- conventional machine-learning methods are not directly ing, unfolding and fluctuations by computer simulation. targetedattheoptimizationofthe formofEq. (4),their I. the effect of specific amino acid sequence represented predictionaccuracyshouldbeindicativeofthepossibility by specific inter-unit interactions, Int. J. Pept. Protein Res. 7 (1975) 445–459. for identifying the function E. [8] R. A. Horn, C. R. Johnson, Matrix analysis, Cambridge Intheprecedingparagraph,wehaveassumedtheexis- University Press, Cambridge, U.K., 1985. tence of the function E to construct the optimal contact [9] R. A. Horn, C. R. Johnson, Topics in matrix analysis, potential from a given amino acid sequence. What if, Cambridge University Press, Cambridge, U.K., 1991. however, there is no such function? In fact, the limited [10] N.G¯o,Theoreticalstudiesofproteinfolding,Annu.Rev. successofcurrentcontactmatrixprediction[28]strongly Biophys. Bioeng. 12 (1983) 183–210. [11] S. Takada, G¯o-ing for the prediction of protein folding suggests that this is more likely the case. Such a case mechanisms, Proc. Natl. Acad. Sci. U.S.A. 96 (1999) implies either that there areproteins forwhichthe lower 11698–11700. bound energy cannot be achieved, or that the total en- [12] H. Li, C. Tang, N. S. Wingreen, Nature of driving force ergy cannot be sufficiently accurately approximated by forproteinfolding: Aresultfromanalyzingthestatistical Eq. (1). The former case indicates that some proteins potential, Phys. Rev.Lett. 79 (1997) 765–768. are inherently frustrated, but to a good approximation [13] H. B. Cao, Y. Ihm, C. Z. Wang, M. Su, D. Dobbs, such proteins should be rather exceptional for natural K.M.Ho,Three-dimensionalthreadingapproachtopro- tein structurerecognition, Polymers 45 (2004) 687–697. proteins[10,11]. Thelattercasemayindicatethatmulti- [14] H. B. Cao, C. Z. Wang, D. Dobbs, Y. Ihm, K. M. body contact interactions [29] and/or other energy com- Ho,Codabilitycriterionforpickingproteinlikestructures ponents than contact energies are more important. from random three-dimensional configurations, Phys. In summary, we have shown that the requirement for Rev. E74 (2006) 031921. thenativestructuretoachievethelowerboundnaturally [15] U. Bastolla, M. Porto, H. E. Roman, M. Vendruscolo, leads to the G¯o potential and the requirementfor such a Principal eigenvector of contact matrices and hydropho- conformationtobetheuniqueGMECleadstothenative bicity profiles in proteins, Proteins 58 (2005) 22–30. [16] M. Porto, U. Bastolla, H. E. Roman, M. Vendruscolo, conformation being the most compact one among those Reconstructionofproteinstructuresfromavectorialrep- containing all the native contacts. These results suggest resentation, Phys. Rev.Lett. 92 (2004) 218101. that proteinstructure prediction shouldbe possible sim- [17] A. Vullo, I. Walsh, G. Pollastri, A two-stage approach ply by constructing the optimal energy matrices or that for improved prediction of residue contact map, BMC the contact potential alone is not suitable for the prob- Bioinformatics 7 (2006) 180. lem. Although not yet definitive, the current state of [18] A. Kabak¸cioˇglu, I. Kanter, M. Vendruscolo, E. Domany, 5 Statisticalpropertiesofcontactvectors,Phys.Rev.E65 [25] M.Vendruscolo,E.Domany,Pairwise contactpotentials (2002) 041904. are unsuitable for protein folding, J. Chem. Phys. 109 [19] A. R. Kinjo, K. Nishikawa, Recoverable one- (1998) 11101–11108. dimensional encoding of three-dimensional protein [26] M. Vendruscolo, R. Najmanovich, E. Domany, Can a structures, Bioinformatics 21 (2005) 2167–2170, pairwise contact potential stabilize native protein folds doi:10.1093/bioinformatics/bti330. against decoys obtained by threading?, Proteins 38 [20] A. R. Kinjo, K. Horimoto, K. Nishikawa, Predicting (2000) 134–148. absolute contact numbers of native protein structure [27] J. Cheng, P. Baldi, Improved residue contact prediction from amino acid sequence, Proteins 58 (2005) 158–165, using support vector machines and a large feature set, doi:10.1002/prot.20300. BMC Bioinformatics 8 (2007) 113. [21] A. R. Kinjo, K. Nishikawa, Predicting secondary struc- [28] O. Gran˜a, D. Baker, R. M. MacCallum, J. Meiler, tures, contact numbers, and residue-wise contact orders M.Punta,B.Rost,M.L.Tress, A.Valencia,CASP6as- ofnativeproteinstructurefrom aminoacidsequenceus- sessment of contact prediction, Proteins Suppl. 7 (2005) ingcriticalrandomnetworks,BIOPHYSICS1(2005)67– 214–224. 74, doi:10.2142/biophysics.1.67. [29] P.J.Munson,R.K.Singh,Statisticalsignificanceofhier- [22] Z.Yuan,Betterpredictionofproteincontactnumberus- archical multi-body potentials based on delaunay tessel- ing a support vector regression analysis of amino acid lation and their application in sequence-structure align- sequence,BMC Bioinformatics 6 (2005) 248. ment, Protein Sci. 6 (1997) 1467–1481. [23] T.Ishida,S.Nakamura,K.Shimizu,Potentialforassess- [30] G. Chikenji, Y. Fujitsuka, S. Takada, Shaping up the ingqualityofproteinstructurebasedoncontactnumber protein folding funnelbylocal interaction: lesson from a prediction, Proteins 64 (2006) 940–947. structurepredictionstudy,Proc.Natl.Acad.Sci.U.S.A. [24] A.R.Kinjo, K.Nishikawa,CRNPRED:Highly accurate 103 (2006) 3141–3146. predictionofone-dimensionalproteinstructuresbylarge- [31] P.J.Fleming,H.Gong,G.D.Rose,Secondarystructure scale critical random networks, BMC Bioinformatics 7 determinesproteintopology,ProteinSci.15(2006)1829– (2006) 401. 1834.