Comparative analysis of protein structure using multiscale additive functionals Marconi Soares Barbosa1, Rinaldo Wander Montalva˜o2, Tom Blundell2 and Luciano da Fontoura Costa1 1Institute of Physics at S˜ao Carlos, University of S˜ao Paulo, S˜ao Carlos, SP, PO Box 369, 13560-970, Phone +55 16 3373 9858, FAX +55 162 71 3616, Brazil, [email protected] and 2Department of Biochemistry, University of Cambridge, 80 Tennis Court Road, Cambridge, CB2 1GA, Phone +44 1223 333 628, FAX +44 1223 766 082 United Kingdom (Dated: February 9, 2008) This work reports a new methodology aimed at describing characteristics of protein structural 7 shapes, and suggests a framework in which to resolve or classify automatically such structures 0 into known families. This new approach to protein structure characterization is based on elements 0 of integral geometry using biologically relevant measurements of shape and considering them on 2 a multi-scale representation which align the proposed methodology to the recently reported tube n picture of a protein structure as a minimal representation model. The method has been applied a withgood resultstoasubsetofproteinstructuresknowntobeespecially challengingtorevertinto J families, confirming thepotential of theproposed method for accurate structure classification. 4 PACSnumbers: 1 ] M I. INTRODUCTION II. ADDITIVE SHAPE FUNCTIONALS Q Evolutionhasproducedahugenumberofproteinfam- We start by describing the mathematical aspects of . o ilies and super-families whose members possess similar the adopted procedure. The Minkowski functionals of bi sequences and three-dimensional structures. Restraints a body K in the plane are proportional to the famil- - on evolutionary divergence are mainly related to the iar geometric quantities of area A(K), perimeter U(K) q protein function, and therefore selective pressure tends and the connectivity or Euler number χ(K). The usual [ to operate on the three-dimensional structure [1]. The definition of the connectivity from algebraic topology in 1 HOMSTRAD [2] is an example of a database of pro- two dimensions is the difference between the number of v tein structures organized into homologous families. As connected nc components and the number of holes nh, 9 a consequence of the global proteomic effort, the num- χ(K) = nc−nh. In the Euclidean space, there are two 1 ber of known structures is growing at an impressive rate kinds of holes to consider. First, we have the pure hole, 0 and has passed the total of 39000 structures. It is re- acompletelyclosedregionofwhite voxelssurroundedby 1 0 markable progress but, on the other hand, it also in- black voxels. Second, the tunnels. The Euler charac- 7 troduces an overwhelming amount of data to be man- teristic is consequently given as χ(K) = nc −nt +nh, 0 ually classified on those databases. With more than 400 where nt is the number of tunnels and nh is the number o/ structures solved every month, the challenge for auto- of pure holes. There is an additional geometric quan- i matic protein structural comparison and classification is tity to consider in the three-dimensional space, namely b greaterthanever. Mostoftheproteincomparisonmeth- the mean curvature or breadth B(K). By exploring the - q odsdependmainlyuponstructuralalignmentandRMSD additivityoftheMinkowskifunctionals,theirdetermina- : measures, and therefore are not completely reliable [3]. tionreducestocountingthemultiplicityofbasicbuilding v While RMSD is a good measure of structure similarity blocks that disjointly compose the object. For example i X for almost identical proteins, it cannot be used to judge a voxel can be decomposed as a disjointed set of 8 ver- r dissimilarity since it violates the triangle inequality. It tices, 12 edges, 6 faces and one open cube. The same a means that any system based on RMSD alone is unable process can be applied to any object in a lattice. For a to clusterstructures and,consequentlyincapable ofclas- three-dimensional space, which is our interest regarding sifying them into families. In addition, the reliance on protein structures, see [5, 6], we have sequence alignments introduces a drawback because it is virtuallyimpossible to avoiderrorsduring the alignment V(P)=n3, S(P)=−6n3+2n2, (1) construction. 2B(P)=3n3−2n2+n1, χ(P)=−n3+n2−n1+n0, In this paper we investigate the potential of an al- gorithm adapted to automatically classify proteins into Wheren3 isthe numberofinteriorcubes, n2 is thenum- HOMSTRAD families. This algorithm is based on con- ber of open faces, n1 is the number of sides and n0 is cepts of Integral Geometry [4], know as Morphological the number of vertices. So, the procedure to calculate Image Analysis (MIA), which has been recently applied Minkowski functionals of a pattern P can be reduced to toaseriesofproblemsduetoitssimplicity indesignand counting the number of elementary bodies of each type implementation. FieldsasdiverseasNeuroscience[5]and that compose a voxel (cubes, faces, edges and vertexes) MaterialsSciences[6]havebenefitedfromthisapproach. belonging to P. 2 III. PROTEIN STRUCTURE, TUBE PICTURE AND MULTISCALE SIGNATURES 2.5e+05 Theproteinstructureinourapproachisdefinedessen- tially by the geometrical/topological nature of its back- 2e+05 bone. All α-carbon atom coordinates are identified from a.pdb fileandaninterpolationschemeisusedtoconnect 1.5e+05 neighboring atoms by a straight path. This design pro- Area cedure attaches a variable resolution to the method, as 1e+05 the highly refined atomic scale data has to be truncated during the process. 50000 In our analysis the calculation of the Minkowski func- tionals are incorporated into a multiscale framework. In 2 4 6 8 10 12 Dilation radius suchascheme,allfourquantitiesarecalculatedasafunc- tion of a control parameter as some transformation is made on the structure of interest. In this paper we con- sider this transformation to be the process of exact dila- tions and the controlparameterthe dilation radius. Our choice is particularly suited as the exact dilation proce- dure naturally fits itself in what has been described as 30000 thetubepicture forproteinstructureanalysis[7],amini- mthaeliisnttbriicoaptheysaiscpaelcrtesasoofntinhgeogfetohmeeptrroyt/etionpmoloodgyel.arWehaicle- Perimeter20000 counted for at each spatial scale by the Minkowski func- tionals,thespacesurroundingthebackboneisprobedby 10000 performingthedilationofthestructureandthisinforma- tion is condensed in what we call henceforth multiscale 2 4 6 8 10 12 signatures. The behaviorofsuchsignatures,particularly Dilation radius the topologicallyrelated ones,can be discontinuous. For examplethe processofdilationmaychangeabruptlythe number of pure holes or tunnels at particular scales and these facts are registered for all scales in the multiscale signature for the connectivity or Euler number (charac- teristic). 2000 IV. RESULTS AND DISCUSSION Mean breadth1000 Figure 1 shows all considered functionals signatures 0 for a set of 71 proteins which were chosen specifically because of their similarities. The range of scales shown inthesegraphsencompassestheinitialstructureandthe 2 4 6 8 10 12 Dilation radius final filled volume without holes and tunnels (χ = 1). Therearebothsimilaritiesandstrikingdifferenceswhose subtleties, until now, have been handled only by more complex algorithms. ForeachofthosesignaturesinFigure1weselectthree 50 featuresinanattempt togloballycharacterizethe struc- 0 ture and, by doing so, minimize the amount of data nteiveoaenlduaeladst.efodFrotfhruettuhsreteacsnlidagasnsraidfitucdraeetvsioiaontfbioAansr,eedaitosannindMteiPngerkraoilmw,seaktneidrfu,tnwhcee- Euler Characteristic-1-0500 scale at which the integral of the curve reaches half of -150 the total value. For the signatures of the Connectivity and of the Mean Breadth we measured the standard de- -200 viation, the integral of the curve and the monotonicity 2 4 6 8 10 12 index given by i = (i +i +i )/i where i are the Dilation radius s d p s s,d,p counts for each time the curve increase, decrease or stay FIG. 1: Multiscale signatures associated with the four Minkowski functionals in the Euclidean space. 3 A B C D E Error Posterior.Error above global measures and quantifies the classification potential of the proposed framework. Such a discrimi- A (asp) 13 0 0 0 0 0.0000000 0.0000477 nantanalyisprojectsthe measurementsin sucha wayas B (ghf13) 0 11 0 0 0 0.0000000 0.0000025 to optimize their separation, expressed in terms of high C (ghf22) 0 0 12 0 1 0.0769231 0.1051906 interclass and low intraclass dispresions. It is remark- D (kinase) 0 0 0 16 0 0.0000000 0.0247150 able that, although the structures were specially chosen E (phoslip) 0 0 0 0 18 0.0000000 -0.0104105 to make a reduction into families difficult, this approach Overall 0.0140845 0.0221997 managedtoperfectlyclassifyfouroutofthefivefamilies. AmistakewasmadeinclassC,wereitmisclassified1out of 13 structures. It is worthwhile to note that although TABLE I: The result of a classical discriminant analysis for exhibiting different foldings, alpha plus beta in the class the12 features extracted from the multiscale signatures. C and all alpha in the class E, their average length and topological properties in general are quite similar. Fig- ure 2 shows a two-dimensional section of the complete constant. Table I shows the numeric results obtained featurespacedefinedbymeasuresfromthemeanbreadth andconnectivityonly. Itprovidesamoreeconomicaldis- criminating clustering, albeit with overlaps. V. CONCLUSION 3500 In this paper we have accessed the potential of the multi-scaleMinkowskifunctionalsforproteinmorpholog- ical characterization and structural analysis. We found mean breadth curve3000 toathofdpaaitosnlttaoihnlgyeiccssteiassl,feufatenasoctfstuuisrobtenrssaut.lacsFntouatrirraeeatselplkdobntbuoeytnwtontinahtleeloyfrahesmasuuviitlleytesdhoiofgtbohsttlatryhuinsicseitmdukriiflneaosdrr, Integral of 2500 ABC ntharmouelgyhtaheclgalsysciocsaylldhisycdrrimolainsaenfatmanilayly2s2i,styhieelcdlaedssfifiucllaytiaocn- D curate results. These results are comparable with the E 2000 bestapproachsofar[9],whichusesconsiderablymorepa- rametersandis basedonacomplexconcept. Inaddition 2000 2500 3000 to the classification result, it is important to emphasize Perimeter standard deviation the simplicity ofthe algorithmandthe clearrelationship between the quantities used for the characterizationand familiargeometrical,topologicalandbiologicalconcepts. FIG. 2: A scatter plot derived from the mean breadth and This direct relation to familiar measurements, combined theperimeter alone leads to a discriminative feature space. with the simplicity for implementing the MIA approach, suggeststhatthis kindofanalysisisaparticularlyuseful by classicaldiscriminant analysis[8] basedon the twelve tool for classifying the shape of protein structures. [1] M. Bajaj and T. Blundell. Evolution and the tertiary logicalterationswithadditiveshapefunctionals.European structureofproteins.Ann.Rev.Biophys.Bioeng.,13:453– Physical Journal B, 2004. 92, 1984. [6] K. Michielsen and H. de Raedt. Integral-geometry mor- [2] K.Mizuguchi,C.M.Deane,T.L.Blundell,andJ.P.Over- phological image analysis. Physics Report, 347:461–538, ington. Homstrad: a database of protein structure align- 2001. ments for homologous families. Protein Sci., 7(11):2469– [7] Jayanth R. Banavar and Amos Maritan. Geometrical ap- 2471, 1998. proachtoproteinfolding: atubepicture. Review of Mod- [3] P. Koehl. Protein structure similarities. Curr. Opin. ern Physics, 75(1), 2003. Struct. Biol.,11:348–353, 2001. [8] G. J. McLachlan. Discriminant analysis and statistical [4] D. Stoyan, W. S. Kendall, and J. Mecke. Stochastic Ge- pattern recognition. Wiley, 1992. ometryanditsapplications. Wiley,WestSussex,England, [9] Peter Rogen and Boris Fain. Automatic classification 1995. of protein structure by using gauss integrals. PNAS, [5] M. S. Barbosa, L. da F. Costa, E. S. Bernardes, G. Ra- 100:119–124, 2003. makers, and J. van Pelt. Characterizing neuromorpho-