Quadratic Programming Approach to Fit Protein Complexes into Electron Density Maps Roman Pogodin Skolkovo Institute of Science and Technology, Nobel St., 3, Moscow, 143026, Russia Moscow Institute of Physics and Technology, Institutskiy Lane 9, Dolgoprudny, Moscow, 141700, Russia [email protected] 7 Alexander Katrutsa 1 0 Moscow Institute of Physics and Technology, Institutskiy Lane 9, Dolgoprudny, Moscow, 141700, Russia 2 Skolkovo Institute of Science and Technology, Nobel St., 3, Moscow, 143026, Russia n [email protected] a J Sergei Grudinin 9 University of Grenoble Alpes, LJK, F-38000 Grenoble, France ] CNRS, LJK, F-38000 Grenoble, France C O Inria, F-38000 Grenoble, France [email protected] . h t a m [ Abstract Twoapproachestotheproblemarenoticeable. The firstone [2] uses a genetic algorithmthat discoversand 1 v The paper investigates the problem of fitting protein then recombines good solutions which fit the density 2 complexes into electron density maps. They are rep- map. This parallel approach increases efficiency, but 9 resented by high-resolution cryoEM density maps con- the accuracydecreaseswith the number ofcomponents 1 verted into overlapping matrices and partly show a inside the complex and the map’s resolution. The sec- 2 structureof acomplex. Thegeneral purposeis todefine ondsolution[3]uses acryoEMmapdirectlyandusesa 0 . positionsofallproteinsinsideit. Thisproblemisknown set of predefined possible positions. It divides a set of 1 to beNP-hard, since it lays in the field of combinatorial fittingvariablesintouncoupledsubsets,solvescombina- 0 7 optimization overasetofdiscretestatesofthecomplex. torial optimization problems independently and finally 1 We introduce quadratic programming approaches to the gather all the solutions into the global minimum. This : problem. To findan approximate solution, weconvert a approachreducesthesizeoftheproblemfromexponen- v i densitymapintoanoverlapping matrix,whichisgener- tial in the number of all components to exponential in X ally indefinite. Since the matrix is indefinite, the opti- the number of components of the largest subset. How- r mization problem for the corresponding quadratic form ever, it is sensitive to the accuracy of the component a is non-convex. To treat non-convexity of the optimiza- models and clustering into the subsets. Methods in- tionproblem, weusedifferentconvex relaxationstofind vestigated in the current paper use a set of predefined which setofproteins minimizes thequadraticform best. positionslikeinthelastpaper. However,theminimiza- Keywords: cryoEM, electron microscopy fitting, tionproblemoverabinarysetisrelaxedintoaproblem quadratic programming,protein structure prediction overa continuous set. The problem is then solvedwith continuous optimization methods, which are more effi- cient that discrete ones. The solution is then rounded 1. Introduction to a binary one. For continuous optimization methods, quadratic The problem of proteins fitting into cryoEM den- programming approaches and a stochastic algorithm sity maps of protein complexes remains important for are used in this paper. General overview of quadratic biophysical studies of cell processes. Some examples of programming, main ideas and results of convex opti- its importance canbe found in [1], whichpresents EM- mizationarepresentedin[4].Forourpurposes,the first DataBank, an online database for electron microscopy. ideaisconvexrelaxation. Thepaper[5]isafulfillingre- 1 viewofabasicmethodssemidefinite(SDP).Thesecond minimizationoneforallfourcases. Thesescoresallows ideais sequentialquadraticprogrammingwhichfinds a us to introduce a matrix where each element considers local minimum of a problem [6]. The stochastic algo- overlapping between two positions of proteins. rithm is called Simulated Annealing which also finds a local minimum of the problem [7]. Definition 2 Let Q∈Rn×n be an overlapping matrix Efficiency of the method is tested in two princi- that corresponds to a density map, where n=m·N. ple ways. Firstly, it is testing against artificial cryoEM maps, which are based on a known structure of a pro- Eachmatrixelementqabshowsoverlappingbetweeni-th tein. Thismethodallowsonetotestefficiencyatdiffer- and j-th components of the complex, which are in k-th entmapresolutionsandwasusedin[2,3,8–11]. Thesec- and l-th positions respectively, so a=(i−1)·N+k and ondwayisto use experimentaldensitymaps ofprotein b=(j−1)·N+l. Overlappingbetweentwopositionsof complexes which structure we know, e.g. in [2,3,9,10]. a single protein is set to zero. Besidesrelativepositionsofproteins,theirfittingto 2. Problem statement the density mapshouldbe considered. It meansthat it ismoreimportanttofitaprotein’smaptothecomplex’ map contour rather than its volume, because we need AproteincomplexconsistsofmproteinsandhasN to achieve the original position within the complex. computed spatial positions for each protein, which are differentfordifferentproteins. Topredictthe structure of the protein, we look for positions of a given set of Definition 3 Relevance of a protein’s position is a proteins whichfit into the density mapbest. Introduce measure of quality of it’s fit into a density map’s con- correlation parameters for proteins’ density maps and tour. state the prediction problem formally. Relevance can be measured with the LAP, since it can be used for contour matching. Definition 1 Overlapping of two proteins’ positions is Withthat,arelevancevector,whichdescribeseach overlapping of corresponding electron density maps. possible position, can be introduced. To measure overlapping between two density maps, we use the cross-correlationfunction [12], CCF: Definition 4 Letb∈Rn beacomponentrelevancevec- tor. Each element b shows relevance of i-thcomponent a CCF=(cid:229) r 1r 2, (1) to k-th position in the complex, where a=(i−1)·N+k. i i i Further, the problem’s variable shows taken posi- where r i1 and r i2 are densities of i-th element of two tions for each protein and is defined as following. maps. Laplacian-filtered CCF (LAP) allows one to comparematchingofmaps’edgesratherthanthewhole Definition 5 Let x ∈ {0,1}n be a binary vector that volumes [12]. It can be achieved with modifying both represents the proteins positions in the complex: maps with Laplacian filter before computing the CCF. Motivation of its usage and the filter kernel are de- 1, if i-th protein is in the k-th position, scribed in [13]. xk= The ideas of CCF and LAP allows us to introduce i (0, otherwise, fouroverlappingscores. Firstly,CCFitselfcanbecom- where a=(i−1)·N+k. puted. Then overlapping shows how incompatible are the positions of two proteins. For example, if they are The vector x is divided into m subvectors for the pro- too closeor evenhas twoatoms ina same position,the teins, each with length of N for possible positions. CCF will be big. This approach is denoted as CCF, Since the best set of positions should have mini- and the goal is to minimize it. mumoverlappingbetween proteinsand maximumrele- Secondly,bothmapscanbefilteredwiththeLapla- vancebetweeneachproteinandthemap,formulatethe cianfiltertofindthebestmatchoftheircontours. This problemasaconstrainedbinaryquadraticoptimization scoreiscalledtheContactscore,andthegoalistomax- problem. It considers the overlapping in the quadratic imize it. term and the relevance in the linear term: The lasttwoapproachesimply applyingthe Lapla- cian filter to only one map, hence these scores shows x∗=argmin(xTQx−bTx), howthecontourofonemapfitsthevolumeoftheother. x∈{0,1}n (2) These scores arecalled Skin-Coreand Core-Skinscores s.t. Ax=1 , m and must be maximized to find the best match. AllscoresexceptCCFarecomputedandthenmul- where1 isa vectorofonesandA∈Rm×Nm is a matrix m tiplied by −1 to write the optimization problem as the ensures that each protein within the complex takes a 2 single position. Hence, it has the following structure: Semidefinite relaxation (SDP). To introduce the semidefinite relaxation, rewrite the problem (4) as 1 ··· 1 0 ··· 0 ··· 0 ··· 0 0 ··· 0 1 ··· 1 ··· 0 ··· 0 y∗=argmin(Tr(QY)−bTy), A= ... ... ... ... ... ... ··· ... ... ... . (3) y∈[0,1]n 0 ··· 0 0 ··· 0 ··· 1 ··· 1 s.t. Ay=1m, (7) Y=yyT. N N N Since matrix Q is indefinite in general case, the problem|(2){iszno}n-co|nve{xz. M}oreover|, th{ezpro}blem (2) To get a lower bond of the solution, relax the last is NP-hard. Because of that, relax integer constraints constraint from equalities to inequalities, so now it is of the problem into continuous variables. The prob- positive semidefinite: lem then can be solved with continuous optimization methods. The last approach will be discussed later, Y−yyT(cid:23)0. but reformulation in a continuous form can be written But the initial binary program implies now as y∗=argmin(yTQy−bTy), diag(xxT)=x. y∈[0,1]n (4) s.t. Ay=1 . Hence, we use an additional constraint m To define which proteins takes which place, return diag(Y)=y to the binary vector. To do that, replace the biggest elementineachofN positionsubvectorsby1andothers to bound the problem. The relaxed problem is now by 0: convex and can be written as 1, if k=argmaxya, y∗=argmin(Tr(QY)−bTy), i=1,...,m: xa= k=1,...,N y∈[0,1]n 0, otherwise s.t. Ay=1m, (8) for a=(i−1)·N+k. Y−yyT(cid:23)0, diag(Y)=y. 2.1. Convex relaxations Sequential quadratic programming (SQP). The Since(4)isanon-convexproblemoveraconvexset, basicideasofsequentialquadraticprogrammingarede- it can not be solved directly with guarantees of global scribed in Chapter 18 of the book [6]. This approach minimum. However, an approximate solution can be finds a local minimum for a non-convex problem and found with relaxing the problem into a convex one. In implies solving a quadratic subproblem at each iter- thenextsectionsspectrumshiftandsemidefiniterelax- ation. The subproblem is a convex second-order ap- ationsoftheproblemareintroduced. Theseapproaches proximation of the Lagrangian function of the (4), i.e. helpinfindinganapproximatesolution,usingsolutions it involves a positive-semidefinite approximationof the of the relaxed problems. Hessian. We use the implementation of an SQP algo- rithm from MATLAB Optimization Toolbox [14]. The Spectrum shift relaxation (Shift). We shift initial point for the algorithm is obtained with solving the spectrum of matrix Q to achieve positive- the linear part of the problem (4): semidefiniteness and hence make the problem convex. The corresponding transformation is y∗=argmin(−bTy), Qˆ =Q−l I, (5) y∈[0,1]n (9) min s.t. Ay=1 . m where l is the smallest eigenvalue of Q and I is an min identity matrix of the size of Q. Then the problem (4) In this work, the solution of the problem (9) can be can be rewritten with the new matrix Qˆ as treated as an approximation that only fits the density map, but does not consider overlapping between pro- y∗=argmin yTQˆy−bTy , teins. y∈[0,1]n (6) (cid:0) (cid:1) s.t. Ay=1 . m Simulated annealing (SA). The simulatedanneal- The problem (6) can now be easily solved as a convex ing method [7] is a probabilistic global optimization on. However,thismethoddoesnotguaranteetheglobal method, implemented in MATLAB Global Optimiza- minimum of the initial problem (2). tion Toolbox [14]. This approach simulates a physical 3 processofheatingandthen slowloweringthe tempera- 1. Gettheatomicstructurefromfittedproteinsusing tureofamaterialtodecreasedefects. Ateachiteration, the discrete solution x∗. it generates a new point near the current one, with a 2. Impose a 3D grid with voxel size of 1 ˚A. uniformly random direction and step length equals the currenttemperature. Ifthenewpointisbetterthanthe 3. For every non-hydrogen atom increase the density current one, the algorithm accepts it. If not, it accepts valueofthe nearestvoxelbythe atomicnumberof the point with probability the atom. D −1 4. Apply the GaussianFourier filter to blur the map. P(accept x )= 1+exp , (10) k+1 T The recommended in [12] sigma is 0.187× resolu- (cid:18) (cid:18) k(cid:19)(cid:19) tion. TheGaussiankernelsizeis2·⌈2s ⌉+1,where where D = f(x )− f(x ) for an objective function f, ⌈x⌉ is the smallest integer greater than or equal to k+1 k currentandnewpointsx andx respectivelyandthe x. k k+1 current temperature T , which changes as k 5. Resample the grid using Fourier method to match T =0.95·T. the sampling of the target map. k+1 k When the probe map is obtained, compare the origi- For this work, the method can not be directly imple- nal (target) map and the probe one with the scoring mented for the problem (4) because it has linear con- functions describing above. strains, but the method is designed for unconstrained For 10˚A resolution or less the authors propose the andbound-constrainedproblems. However,constraints Laplacian-filteredcross-correlationfunction. Thecross- canbe implemented asapenalty functiontothe objec- correlation function itself is described above (1). For tive one, so the problem is converted as the LAP, modify both target and probe maps with y∗=argmin(yTQy−bTy+wkAy−1 k ), Laplacian filter before computing the CCF. For other m 1 (11) y∈[0,1]n cases, the mutual information score (MI) is proposed. The scoring function is wherew∈Risapenaltyweightandkgk denotesthel - 1 1 normofavectorg. Theinitialpointforthealgorithmis (cid:229) (cid:229) p(x,y) I(X,Y)= p(x,y)log . (13) obtained by solving (9). Moreover,since the algorithm p(x)p(y) x∈Xy∈Y has two parameters, the initial temperature T and the 0 penalty weight w, the method is denoted as SA(T ,w). Here, X and Y correspond to the density values in 0 the probe and target maps. Functions p(x) and p(y) 2.2. Scoring functions to measure quality of are the percentage of values in maps equal to x and y. fit The aligned maps are maps where elements with equal coordinates represent one point in space. For aligned As the simplest scoring function which implies targetandprobemaps, p(x,y)ispercentageofelements knowledge of the real structure of the protein, one can with value x in the probe map and y in the target one. use root-mean-square deviation (RMSD) [15]. It mea- Becauseofwiderangeofvaluesandnoise,X andY have sures the distance d between pairs of i-th atoms of a limited number of values, e.g. 20 in [12]. i protein in two positions, one in predicted position and one in the native position. Both atoms in a pair take 3. Dataset the same place in a corresponding protein. With M pairs of such atoms, it can be written as Allmethodsweretestedonsimulatedmaps,sinceit allowsonetocomparethemethodsfordifferentprotein RMSD= 1 (cid:229)M d 2. (12) complexes that are tested in the same conditions, i.e. sM i the same resolution of the map. The maps were gen- i=1 erated as described below from 7 protein complexes. Thisapproachhelpstomeasurequalityofothercriteria Their PDB entries are 1e6v [17], 1gte [16], 1tyq [18], on test data, but can not be used with direct determi- 1z5s [19], 2p4n [20], 4a6j [21], 4bij [22]. The map res- nation of an unknown structure. olution is 10 ˚A, a voxelsize is 1 ˚A. Map’s generationis Another approach is quality criteria that use only similar to creating a map form a solution and includes information about the density map. For future work, following steps. we propose two scoring functions recommended in the 1. Impose a 3D grid with voxel size of 1 ˚A. paper [12]. Both of them use electron density maps of aninitialstructureandonefromthesolution. Awayto 2. For every non-hydrogen atom increase the density produceaprobedensitymapfromthediscretesolution valueofthe nearestvoxelbythe atomicnumberof x∗ is described in [12]. It includes following steps. the atom. 4 1.0 3. Apply the Gaussian Fourier filter to blur the map. The recommended in [12] sigma is 0.187×resolution. The Gaussian kernel size is 0.8 2·⌈2s ⌉+1. o ati 4. Computational experiment wers r0.6 s n a QualityoffitischaracterizedbyachievedRMSDof ect 0.4 eachproteininacomplex(2). Thesolutionistreatedas Corr correctifeachproteinhas RMSD≤10˚A. Everyresults 0.2 is characterized by correct answers ratio N b = c, 0.0 N SDP Shift SQP SA Linear where N is a number of correctly determined protein’s c positionsandN isanumberofproteinswithinthecom- Figure 1: Averaged for all complexes and all map scores b plex. Results for simulated annealing method are pre- sented for following parameters: initial temperature is 100, penalty weight is 1. Results for other parame- 5. Conclusion terswerethesameforallsimulations(datanotshown), which is connected with the quality of the linear solu- tion described below. Moreover, only RMSD is used The paper investigatesquadratic programmingap- as a scoring function, since the purpose of the current proachfortheproblemofdeterminingproteins’position workistotesttheproposedoptimizationapproach,and inside a complex by its EM density map. The mathe- RMSD completely represents how precise an obtained maticaloptimizationproblemis formulatedusing over- solution is. lappingofproteinsincomputedpositionsbetweeneach other,whichformsanoverlappingmatrix,andwiththe Comparison of quality of fit and the objective given math, which gives a relevance vector. function’s value. Results presented in Table (1) for We tested semidefinite relaxation, spectrum shift the complex 1gte show that the highest b corresponds relaxation, sequential quadratic programming as to the lowest approximate objective value. Moreover, quadratic programming approaches and simulated an- the SQP and SA methods find a continuous solution, nealing and linear approximation for comparison with which gives almost the same objective value as the bi- quadratic methods. The datasets were formed from nary solution. simulated density maps. The best performance was shown by SQP, SA and linear approximation methods, Comparison of the methods with the linear which shows the importance of fitting a protein to the approximation. Table (2) with results for 1e6v, given map rather than looking for the best positions 1gte, 1z5s shows that the highest b was obtained by with relate to other proteins in the complex. simulated annealing and sequential programming ap- proaches. However, the initial point for both methods 6. Acknowledgements was obtained from the solution of the linear problem (9),andb forSQP,SAandthelinearproblemisalmost This work is supported by RFBR, grant 16-37- the same. Hence, correct solutions for these datasets 00485. can be achieved using only the linear approach. References Overall performance. Despite for some complexes mentioned above the linear approach leads to the cor- [1] C.L. Lawson, M.L. Baker, C. Best, C. Bi, rectsolution(andtothecorrectsolutionwithSQPand M. Dougherty, P. Feng, G. Van Ginkel, B. Devkota, SA), in general it is not true. Moreover, SQP and SA I.Lagerstedt,S.J.Ludtke,R.H.Newman,T.J.Oldfield, sometimes make the linear solution even worse, which I.Rees,G.Sahni,R.Sala,S.Velankar,J.Warren,J.D. can be observed on the following Figure 1, that shows Westbrook, K.Henrick, G.J. Kleywegt, H.M. Berman, averageb forallcomplexesandmapscores. Therefore, and W. Chiu. Emdatabank.org: Unified data re- inthesecasesfittingtothegivenmapismoreimportant sourceforcryoem. NucleicAcidsResearch,39(SUPPL. than arranging proteins’ positions among each other. 1):D456–D464, 2011. 5 Table 1: Objective function’s optimal values for 1gte, Contact, CCF, Skin-Core and Core-Skin scores. Contact CCF Method Continuous Binary b Continuous Binary b SDP -62711135 23648296 0.50 -2261626 563438 0.25 Shift 1990252 1483316 0.00 78968 43276 0.00 SQP -3707363 -3707363 1.00 1086 1086 0.00 SA(100,1) -3707354 -3707363 1.00 205462 205462 1.00 Linear -80 -3707363 1.00 -80 205462 1.00 Skin-Core Core-Skin Method Continuous Binary b Continuous Binary b SDP -26429514 9343938 0.50 -26196482 2055248 0.25 Shift 801317 499195 0.00 789572 210241 0.00 SQP -2382769 -2382769 1.00 -2352385 -2352385 1.00 SA(100,1) -2382764 -2382769 1.00 -2352380 -2352385 1.00 Linear -80 -2382769 1.00 -80 -2352385 1.00 Table 2: Results for three complexes. Each column represents a method, each value is a correct answers ratio b . 1e6v Scoring SDP Shift SQP SA Linear Contact 0.33 0.17 0.67 0.67 0.67 CCF 0.33 0.00 0.33 0.67 0.67 Skin-Core 0.50 0.00 0.67 0.67 0.67 Core-Skin 0.67 0.17 0.67 0.67 0.67 1gte Scoring SDP Shift SQP SA Linear Contact 0.50 0.00 1.00 1.00 1.00 CCF 0.25 0.00 0.00 1.00 1.00 Skin-Core 0.50 0.00 1.00 1.00 1.00 Core-Skin 0.25 0.00 1.00 1.00 1.00 1z5s Scoring SDP Shift SQP SA Linear Contact 0.25 0.00 0.50 1.00 1.00 CCF 0.25 0.00 0.25 1.00 1.00 Skin-Core 0.25 0.00 0.75 1.00 1.00 Core-Skin 0.25 0.00 0.75 1.00 1.00 6 [2] A.P. Pandurangan, D. Vasishtan, F. Alber, and withboundATPorADP.Proc.Natl.Acad.Sci.Usa,101, M.Topf. g-tempy: Simultaneousfittingofcomponents 15627–15632. in 3d-em maps of their assembly using a genetic algo- [19] Reverter, D. and Lima, C. D. (2005). Insights into E3 rithm. Structure, 2015. ligase activity revealed by a SUMO-RanGAP1-Ubc9- [3] K.Lasker, M. Topf, A.bSali, and H.J. Wolfson. Infer- Nup358 complex. Nature, 435, 687–692. ential optimization for simultaneous fittingof multiple [20] Sindelar,C.V.,andDowning,K.H.(2007).Thebegin- componentsintoacryoemmapoftheirassembly.Jour- ning of kinesin’sforce-generating cycle visualized at 9- nal of Molecular Biology, 2009. A resolution. Journal of Cell Biology, 177(3), 377–385. [4] Stephen Boyd and Lieven Vandenberghe. Convex Op- [21] Gayathri, P.,Fujii, T., Moller-Jensen, J., vandenEnt, timization. Cambridge University Press, 2004. F.,Namba,K.,andLowe,J.(2012).ABipolarSpindle [5] Alexandre d’Aspremont and Stephen Boyd. Relax- of Antiparallel ParM Filaments Drives Bacterial Plas- ations and randomized methods for nonconvex qcqps. mid Segregation. Science, 338(6112), 1334–1337. EE392o Class Notes, Stanford University, 2003. [22] Daud´en, M. I., Martin´-Benito, J., S´anchez-Ferrero, J. [6] Jorge Nocedal and Stephen J Wright. Numerical op- C., Pulido-Cid, M., Valpuesta, J. M., and Carrascosa, timization, second edition. Numerical optimization, J.L.(2013).Largeterminaseconformationalchangein- pages 497–528, 2006. ducedbyconnectorbindinginbacteriophageT7.Jour- [7] L.Ingber.Adaptivesimulatedannealing(asa): Lessons nal of Biological Chemistry, 288(23), 16998–17007. learned. Control and Cybernetics, 25(1):32–54, 1996. [8] F. DiMaio, M.D. Tyka, M.L. Baker, W. Chiu, and D. Baker. Refinement of protein structures into low- resolution density maps using rosetta. Journal of Molecular Biology, 392(1):181–190, 2009. [9] A.P. Pandurangan and M. Topf. Finding rigid bod- ies in protein structures: Application to flexible fit- ting into cryoem maps. Journal of Structural Biology, 177(2):520–531, 2012. [10] M. Topf, K. Lasker, B. Webb, H. Wolfson, W. Chiu, and A. Sali. Protein structure fitting and refinement guided by cryo-em density. Structure, 16(2):295–307, 2008. [11] S. Zhang, D. Vasishtan, M. Xu, M. Topf, and F. Al- ber. A fast mathematical programming procedure for simultaneous fitting of assembly components into cry- oem density maps. Bioinformatics, 26(12):i261–i268, 2010. [12] D. Vasishtan and M. Topf. Scoring functions for cry- oem density fitting. Journal of Structural Biology, 174(2):333–343, 2011. [13] P.Chac´onandW.Wriggers. Multi-resolution contour- based fitting of macromolecular structures. Journal of Molecular Biology, 317(3):375–384, 2002. [14] The MathWorks, Inc., Natick, Massachusetts, United States. MATLAB and Optimization Toolbox, Global Optimization Toolbox Release 2015b. [15] V.N.Maiorov and G.M. Crippen. Significance of root- mean-squaredeviationincomparingthree-dimensional structures of globular proteins. Journal of Molecular Biology, 235(2):625–634, 1994. [16] Dobritzsch, D., Ricagno, S., Schneider, G., Schnack- erz,K.D.andLindqvist,Y.(2002).CrystalStructureof theProductiveTernaryComplexofDihydropyrimidine Dehydrogenase with Nadph and 5-Iodouracil. Implica- tions for Mechanism of Inhibition and Electron Trans- fer. J.Biol.Chem., 277, 13155. [17] Grabarse, W., Mahlert, F., Shima, S., Thauer, R.K. and Ermler, U. (2000). Comparison of Three Methyl- Coenzyme M Reductases from Phylogenetically Dis- tant Organisms: Unusual Amino Acid Modification, Conservation and Adaptation. J.Mol.Biol., 303, 329. [18] Nolen,B.J.,Littlefield,R.S.andPollard,T.D.(2004). Crystalstructuresofactin-relatedprotein2/3complex 7