This document has been downloaded from Tampub – The Institutional Repository of University of Tampere Publisher's version Authors: Ahola Virpi, Aittokallio Tero, Vihinen Mauno, Uusipaikka Esa A statistical score for assessing the quality of multiple sequence Name of article: alignments Year of 2006 publication: Name of journal: BMC Bioinformatics Volume: 7 Number of issue: 484 Pages: 1-19 ISSN: 1471-2105 Discipline: Medical and Health sciences / Medical biotechnology Language: en School/Other Unit: Institute of Biomedical Technology URL: http://www.biomedcentral.com/1471-2105/7/484 URN: http://urn.fi/urn:nbn:uta-3-623 DOI: http://dx.doi.org/10.1186/1471-2105-7-484 All material supplied via TamPub is protected by copyright and other intellectual property rights, and duplication or sale of all part of any of the repository collections is not permitted, except that material may be duplicated by you for your research use or educational purposes in electronic or print form. You must obtain permission for any other use. Electronic or print copies may not be offered, whether for sale or otherwise to anyone who is not an authorized user. BMC Bioinformatics BioMed Central Methodology article Open Access A statistical score for assessing the quality of multiple sequence alignments Virpi Ahola*1,2, Tero Aittokallio3,6, Mauno Vihinen4,5 and Esa Uusipaikka2 Address: 1Biotechnology and Food Research, MTT Agrifood Research Finland, Jokioinen, Finland, 2Department of Statistics, University of Turku, Turku, Finland, 3Department of Mathematics, University of Turku, Turku, Finland, 4Institute of Medical Technology, University of Tampere, Tampere, Finland, 5Research Unit, Tampere University Hospital, Tampere, Finland and 6Systems Biology Unit, Institut Pasteur, Paris, France Email: VirpiAhola*[email protected]; [email protected]; [email protected]; [email protected] * Corresponding author Published: 03 November 2006 Received: 11 April 2006 Accepted: 03 November 2006 BMC Bioinformatics 2006, 7:484 doi:10.1186/1471-2105-7-484 This article is available from: http://www.biomedcentral.com/1471-2105/7/484 © 2006 Ahola et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background: Multiple sequence alignment is the foundation of many important applications in bioinformatics that aim at detecting functionally important regions, predicting protein structures, building phylogenetic trees etc. Although the automatic construction of a multiple sequence alignment for a set of remotely related sequences cause a very challenging and error-prone task, many downstream analyses still rely heavily on the accuracy of the alignments. Results: To address the need for an objective evaluation framework, we introduce a statistical score that assesses the quality of a given multiple sequence alignment. The quality assessment is based on counting the number of significantly conserved positions in the alignment using importance sampling method in conjunction with statistical profile analysis framework. We first evaluate a novel objective function used in the alignment quality score for measuring the positional conservation. The results for the Src homology 2 (SH2) domain, Ras-like proteins, peptidase M13, subtilase and β-lactamase families demonstrate that the score can distinguish sequence patterns with different degrees of conservation. Secondly, we evaluate the quality of the alignments produced by several widely used multiple sequence alignment programs using a novel alignment quality score and a commonly used sum of pairs method. According to these results, the Mafft strategy L-INS-i outperforms the other methods, although the difference between the Probcons, TCoffee and Muscle is mostly insignificant. The novel alignment quality score provides similar results than the sum of pairs method. Conclusion: The results indicate that the proposed statistical score is useful in assessing the quality of multiple sequence alignments. Background for understanding the structure and function of these mol- A wealth of molecular data concerning the linear structure ecules. The results of annotation of gene/protein of proteins and nucleic acids is available in the form of sequences, prediction of protein structures or building of DNA, RNA and protein sequences. Multiple sequence phylogenetic trees, for instance, are critically dependent alignment has become an essential and widely used tool on the quality of the given alignment. It has been recog- Page 1 of 19 (page number not for citation purposes) BMC Bioinformatics 2006, 7:484 http://www.biomedcentral.com/1471-2105/7/484 nized that the automatic construction of a multiple account the relative symbol frequencies in the column, sequence alignment for a set of remotely related and (iii) their stereo-chemical properties. Additional sequences can be a very demanding task. Therefore, there requirements for a good conservation score include the is a need for an objective approach to evaluate the align- possibility to incorporate (iv) the effect of gaps and (v) ments produced by alignment programs. sequence weighting into (vi) a simple scoring strategy. Two popular measures for scoring entire multiple align- Existing positional scoring approaches can be roughly ments are the sum of pairs (SP) score and the column divided into two categories with respect to the second and score (CS) [1]. These scores can, however, only be used if third criteria. In the first category, the positional conserva- a reference alignment of the same sequences is available. tion is characterized based on the symbol frequencies The SP score calculates the proportion of identically only. Such frequency-based methods include, for aligned residue pairs in the test and the reference align- instance, the information-content score that quantify the ments, whereas the CS score measures the fraction of variability among the observed symbols at a particular identically aligned positions. Several modifications have position by means of Shannon's entropy [18,19]. A popu- been made to the SP score [2,3]. The APDB (Analyze lar variation of the information-content (IC) score meas- alignments with PDB) quality measure evaluates the qual- ures the Kullback-Leibler distance (relative entropy) ity of an alignment by using available tertiary structures of between the observed symbol distribution and a back- the sequences in the alignment [4]. The recently intro- ground distribution of a priori symbol probabilities [20]. duced multiple overlap score (MOS) is a promising The background probability of an individual symbol may approach, which does not need a reference alignment [5]. be calculated from the complete alignment, possibly sup- The MOS searches for identically aligned regions in many plemented with symbol-dependent pseudo-counts [21]. alignments and presumes that the alignment with the Alternatively, a priori distribution can be determined highest number of such residues also has the highest qual- using overall relative frequencies of symbols within the ity. sequences of the organism or protein family under inves- tigation. We introduce a statistical alignment quality score which first quantifies the degree of conservation at each align- In the second category of scoring approaches, the posi- ment position and then counts the number of signifi- tional conservation is characterized based on both symbol cantly conserved positions over the alignment. For frequencies and their similarity properties. Such similar- measuring the degree of conservation, we use a type of Z- ity-based scores address the fact that some symbol combi- score that is based on profile analysis [6]. After deriving nations occur more frequently than others mainly because the maximum Z-score for positional conservation, the sta- of the chemical and physical properties. The most tistical significance of an observed score value is estimated straightforward strategy is to group all the symbols using the importance sampling method [7]. The full align- according to their physicochemical properties before ment quality score is defined in terms of positional signif- applying a particular scoring scheme. For instance, Taylor icance levels, where the multiple comparison problem is presented a classification of amino acids based on their addressed with false discovery rates (FDR) [8]. The practi- synthesis in the Dayhoff mutation data matrix [22,23]. cal performance of the maxZ score is demonstrated using Subsequently, the degree of positional conservation with the SH2 domain, Ras-like proteins, peptidase M13, subti- respect to each overlapping group of symbols can be lase and β-lactamase families. The alignment quality score quantified using any frequency-based scoring approach, is finally applied to evaluate the alignment programs such as the information content [24]. Different conserva- Clustal [9], TCoffee [10], Dialign2 [11], Probcons [12], tion scores accounting for the stereochemical sensitivity Muscle [13], and Mafft [14,15]. can be obtained using different symbol properties [25]. Related work In general, the symbol properties can be considered by Several approaches have been proposed for the conserva- predefining an appropriate matrix where entries represent tion analysis of multiple sequence alignments to quantify the similarity or dissimilarity between a symbol pair. Fre- the degree of conservation at each aligned position using quently used symbol scoring matrices for amino acids column-specific score values [16]. Valdar reviewed a wide include the BLOSUM and Gonnet series of substitution range of such score types developed during the last two matrices and PAM distance matrices [26-28]. Perhaps the decades for protein sequence analysis [17]. He also intro- most widely used scoring approach, 'sum-of-pairs', char- duced the following three criteria that a positional conser- acterizes the positional conservation by calculating the vation score should fulfill: (i) the score should be a sum of all pairwise similarities between the symbols in the mathematical mapping from an alignment position into a particular column [29]. It should be noted, that this 'sum bounded interval of real values which (ii) takes into of pairs' score is different from the SP score mentioned Page 2 of 19 (page number not for citation purposes) BMC Bioinformatics 2006, 7:484 http://www.biomedcentral.com/1471-2105/7/484 earlier in the Background section. The SP score in [1] is sequence identity. Sequences within each subgroup were used to measure alignment quality with respect to the ref- aligned against the profile, then the groups were aligned, erence alignment, whereas the score by Carillo and Lip- excluding positions with low sequence identity. The posi- man [29] is more generally applicable. In this work, we tions with gaps were also excluded from the final align- only use the reference alignment-based SP score. A similar ment. We used only the first sequence of each subgroup in but more complex mean distance (MD) score is used as an order to avoid over-representation of profiles with many objective function in the multiple alignment software very similar sequences. This was necessary because the Clustal [9]. This normalized MD score also considers the current maxZ score does not take the pairwise identity of fraction of gaps [30]. A number of variations can be made the sequences into account or otherwise weight the by using different similarity matrices on symbols or sequences. The alignment of Ras-like proteins consists of weighting schemes on sequences [31]. 334 sequences. The present work is a continuation of our previous work Upper panels of the Figures 1, 2, 3 illustrate parts of the on a statistical (Dunn-Sidak) framework for detecting alignments of the Ras-like proteins, SH2 domain, pepti- conserved residues in the positions of a multiple sequence dase M13, subtilase and β-lactamase families. The com- alignment [32]. Here, we allow for the incorporation of plete alignments of the Ras-like proteins and SH2 domain any symbol similarity matrix into the framework that was can be found as additional files (Additional files 1, 2, 3, 4, based on simple frequency-based scoring function. We 5, 6, 7, 8, 9). The figures were generated using MultiDisp have previously demonstrated the usefulness of this score graphics program developed to visualize multiple in the automatic detection of the conserved residues in a sequence alignments [37] (Riikonen et al., in prepara- multiple sequence alignment, and compared its results on tion). The lower parts of the alignments include the maxZ, the SH2 domain with functionally and structurally impor- MD and IC score values. The Blosum62 and grouping of tant positions of the alignment [32]. Another application amino acids were used as a scoring matrix in the maxZ of the conservation scores includes the improvement of score. the reliability of HMMs in the sequence similarity search by decreasing the number of false positive search results Effect of the scoring matrices [33]. In the present study, the emphasis is on positional One advantage of the maxZ score is that it can consider conservation rather than on individual residues with the the physicochemical relationships of amino acids. The aim of assessing the quality of full alignment. user is able to choose an arbitrary scoring matrix or classi- fication of the amino acids, which can be incorporated Results into the calculation of the maxZ score. In addition to the Evaluating the maxZ score for positional conservation identity matrix, we demonstrate the use of three different In this section, we study the practical performance of the scoring matrices: Blosum62, Gonnet250 and PAM250 maxZ score in SH2 domain, Ras-like proteins, peptidase [26-28]. Additionally, we classify amino acids into six M13, subtilase and β-lactamase familes. We first demon- physicochemically related groups as follows: hydropho- strate the effect of five different scoring matrices and then bic {V, I, L, F, M, W, Y, C}, negatively charged {D, E}, pos- we compare the performance of maxZ score with those of itively charged {R, K}, conformational {G, P }, polar {N, information content (IC) and Mean Distance (MD) score Q, S} and {A, T}. This classification has been used, for [20,9]. Finally, we demonstrate how the maxZ score can example, by Shen and Vihinen [38]. Figure 1 shows the be used to generate a consensus sequence. scaled -log(p)-values for the Ras-like proteins using the five different scoring schemata. Multiple sequence alignments We used the multiple sequence alignments of the SH2 The residue positions in the alignment of Ras-like proteins domains, Ras-like proteins, peptidase M13, subtilase and were divided into five groups according to the entropy β-lactamase families to evaluate the maxZ score. The and variability [35]. The parameter values of the classifica- alignments for the SH2 domain, peptidase M13, subtilase tion algorithm were chosen such that the groups represent and β-lactamase families were obtained from the Pfam the known structural and/or functional roles of the resi- database [34]. The seed alignments of the SH2 domain, due positions. A rough overview of the categories is the peptidase M13, subtilases and β-lactamases consist of 58, following: 24, 45 and 128 sequences, respectively. These alignments also include gaps. The sequence alignment of the Ras-like • Box 11 contains positions with low entropy and varia- proteins was downloaded from the web page of an article bility. The positions in this group form a main functional by Oliveira et al. [35]. The alignment was build with a site. two-step alignment procedure [36]. First they classified sequences into groups with approximately 90% pairwise Page 3 of 19 (page number not for citation purposes) BMC Bioinformatics 2006, 7:484 http://www.biomedcentral.com/1471-2105/7/484 FMiuglutirDei s1p visualization of part of the Ras-like proteins (upper) and the corresponding scaled -log(p)-values (lower) MultiDisp visualization of part of the Ras-like proteins (upper) and the corresponding scaled -log(p)-values (lower). The curves show the p-values calculated using (red) Blosum62, (green) Gonnet250, (black) PAM250, (magenta) iden- tity scoring matrices and (blue) classification of the amino acids for the Ras-like proteins. Page 4 of 19 (page number not for citation purposes) BMC Bioinformatics 2006, 7:484 http://www.biomedcentral.com/1471-2105/7/484 a b c βB2 βB5 βD3 βD4 βD5 βD6 βF3 αB5 αB6 αB8 αB9 MsFeiurgvlutairDteioi sn2p svciosuraelsiz (altoiowne ro)f the a) βB-stand, b) βD-stand and c) αB-helix of the SH2 domain (upper) and the corresponding con- MultiDisp visualization of the a) βB-stand, b) βD-stand and c) αB-helix of the SH2 domain (upper) and the cor- responding conservation scores (lower). The curves show (red) the scaled -log(p)-values, (blue) Mean Distance and (green) Information content scores for the alignment. Consensus sequence for the alignment positions in c) is F P S L P E L V E H Y. (cid:129) Box 12 consists of positions with low variability and For a more detailed description of the categories, see the moderate entropy. These positions are located in the core original paper [35]. Table 1 shows the median (lower and of the structure next to the residues in Box 11. upper quartile) values of the -log(p)-values of the maxZ scores with different scoring matrices, along with MD and (cid:129) Box 22 contains positions with moderate entropy and IC scores in each of the five groups. As expected, all con- variability. These residue positions are located in the core servation scores decreased gradually when moving from structure but are not adjacent to the residues in the Box the positions with low entropy and variability to those 11. The positions are involved in the structure of the pro- with high entropy and variability. The performance of the tein, but also in signal transmission between the modula- MD and maxZ scores was very similar. The maxZ score tors and the main functional site. with groups of amino acids distinguished slightly better than the other scores the moderately conserved positions (cid:129) Box 23 consists of the positions with high entropy and (Boxes 12–23) from the highly conserved positions (Box moderate variability. These positions are located at the 11) and unconserved ones (Box 33) (Table 1, Figure 1). surface or in the core of the protein and are involved in modulator interaction. In both Ras-like protein and SH2 domain examples, all the scoring schemes tend to provide very similar results (cid:129) Box 33 contains highly variable positions with high (see Additional files 1, 2, 3, 4, 5, 6, 7, 8, 9 for Blosum62 entropy. These positions are mainly located at the surface and grouping of amino acids). The results with Blosum, of the protein. Gonnet and PAM matrices all rely heavily on the diagonal Page 5 of 19 (page number not for citation purposes) BMC Bioinformatics 2006, 7:484 http://www.biomedcentral.com/1471-2105/7/484 a) b) c) 1 2 3 4 63 64 65 66 67 68 69 70 71 72 73 147 148 149 150 151 MD 39 43 7 9 MD 23 21 100 100 57 14 100 0 0 7 20 MC 92 38 27 37 56 IC 39 38 30 44 IC 27 35 84 67 41 32 84 70 62 23 45 IC 63 43 31 39 50 maxZ 38 36 40 48 maxZ 25 30 100 71 42 22 100 5 0 15 51 maxZ 68 40 32 48 54 d) e) f) 215 216 217 218 219 220 221 222 223 28 29 30 31 254 255 256 257 MD 2 9 74 27 38 8 28 14 41 MD 90 0 34 89 MD 92 100 74 31 IC 44 25 84 41 52 18 26 27 59 IC 59 66 34 47 IC 84 49 52 63 maxZ 22 17 100 25 84 2 28 19 51 maxZ 86 0 42 70 maxZ 100 71 90 100 g) h) i) 687 688 689 690 691 692 693 120 121 122 123 306 307 308 MD 100 92 100 26 75 59 100 MD 91 20 23 93 MD 21 27 34 IC 49 60 55 58 40 29 73 IC 51 17 26 59 IC 51 21 30 maxZ 71 100 68 68 56 29 100 maxZ 79 25 33 100 maxZ 100 27 53 FhM)iu gIl utairDnedi s 3ip) vIIi smuaoltizifast oiofn t hoef tβh-ela ac)t aIm, ba)s IeI , fca)m IiIlIi easn da ndd) tIVhe m toabtilfes ooff tthhee pcoepntsiedravsaet iMon1 3s,c oe)r eI,s f) II and g) III motifs of the subtilase, and MultiDisp visualization of the a) I, b) II, c) III and d) IV motifs of the peptidase M13, e) I, f) II and g) III motifs of the subtilase, and h) I and i) II motifs of the β-lactamase families and the table of the conservation scores. MD = mean distance, IC = information content scores and maxZ = scaled -log(p)-values for the alignment. Page 6 of 19 (page number not for citation purposes) BMC Bioinformatics 2006, 7:484 http://www.biomedcentral.com/1471-2105/7/484 Table 1: Median (lower and upper quartiles) of the -log(p)-values with different residue scoring schema together with the MD and IC scores. Score Box11 Box12 Box22 Box23 Box33 LogP Blosum62 708 (708, 708) 611 (198, 708) 208 (161, 547) 120 (99, 177) 75 (47, 123) LogP Gonnet 708 (708, 708) 190 (164, 708) 158 (131, 189) 98 (78, 136) 64 (56, 106) LogP Indep 708 (708, 708) 202 (158, 708) 171 (108, 202) 75 (63, 113) 57 (35, 96) LogP PAM 708 (212, 708) 201 (166, 708) 153 (125, 201) 94 (81, 133) 66 (56, 105) LogP 6 groups 644 (631, 683) 312 (300, 333) 279 (241, 341) 216 (77, 240) 43 (26, 91) MD 92 (86, 97) 43 (29, 55) 34 (24, 42) 24 (19, 31) 20 (15, 25) IC 57 (55, 59) 39 (34, 48) 31 (27, 35) 21 (19, 23) 13 (10, 19) Box11 and Box33 represent positions with low and high entropy and variability, respectively. The three middle columns represent the moderately conserved positions. More detailed description of the categories can be found in Oliveira et al. [35]. values of the scoring matrices. For instance, a position amino acids [41]. All known SH2 domains share the same with highly or moderately conserved leucine obtains a rel- architecture, consisting of a central antiparallel β-sheet atively low maxZ score (Figure 1), whereas a position with flanked by two α-helices. The central β-sheet (strands B, C an unconserved cysteine may be also assigned as highly and D) forms the core of the structure and includes most conserved. This is especially critical when the Gonnet of the conserved residues. scoring matrix is used. The results with six amino acid groups differed most from the other scoring schemes since All scores consider the positions forming the binding this calculates the maxZ score for the amino acid classes pocket as highly conserved (> 0.4). These include invari- instead of single residues. The grouping of amino acids ant βB5, which interacts with phosphotyrosine, and βD4 tends to give high scores for the positions where the and αA2 (data not shown), which form the binding majority of the residues belong to the same class. The use pocket for the phosphotyrosine [42] (Figure 2ab). Posi- of the identity matrix corresponds to the special case tion βD6, which is also involved in forming the binding where similarities among the symbols are ignored, and pocket, obtains lower conservation score values (≈ 0.2) the amino acids are handled as if they where unrelated. indicating moderate conservation. The binding pockets The corresponding score is thus based solely on the rela- for phosphotyrosine-following residues are formed by the tive frequencies of the residues and background probabil- αB-helix, especially positions αB5 – 6 are involved in ities. The scoring based on the identity matrix shows quite forming the hydrophobic core for residue +3 [43]. Posi- similar results with the Blosum62 and Gonnet matrices. tions βB2, αB9 and βF3 are occupied with aromatic resi- For some positions, however, the identity matrix fails to dues. The MaxZ and IC scores determine these five detect the conserved positions. Similar behavior was seen positions as highly conserved, whereas the MD score (0.2 with the PAM matrix (Figure 1, position 10). – 0.4) determines positions αB9, and βF3 as moderately conserved (Figure 2c). The binding site for ligand residue Comparisons with other scores +1 includes positions βD3 and βD5 [42]. While the maxZ The results of the maxZ score were compared with those and IC scores determine position βD5 as moderately con- of the MD and IC. Figures 2 and 3 show the MD and IC served, the MD score (< 0.2) rather considers that position scores together with the -log(p)-values of the maxZ scores as unconserved (Figure 2b). for the SH2 domain, peptidase M13, subtilase and β-lacta- mase family sequences. Scaling of the -log(p)-values was Peptidase family M13 Peptidase family M13, also known performed using zero as a minimum. The maximum value as neprilysin family, consists of type II integral transmem- was obtained by calculating the -log(p)-values for each brane proteins with short N-terminal cytoplasmic possible invariant position and defining the 5% percentile domain, a hydrophobic transmembrane region, and a value to be the maximum. Blosum62 was used as a scor- large ectodomain containing a active site [44]. Three con- ing matrix in the maxZ score. The default multiple served motifs characterize all known M13 endopeptidases sequence alignment parameters of ClustalX were used to (the numbers are Pfam alignment positions): I:0vNAfY4, calculate the MD score. II:63XXHEXXH- -XX73, III:147EXXXD151 (Figures 3abc). Additionally IV:217HXXXXXR223 is conserved in nepri- SH2 domain SH2 domains are binding modules recog- lysins (Figure 3d). nizing phosphotyrosines and surrounding residues in polypeptides and proteins [39,40]. Many SH2 domains All measures scored as highly conserved the residues H65, recognize especially residues +1 and +3 following the H69, E147 which are ligands for Zn2+, and E66 and H217, phosphotyrosine and form binding pockets for these which are involved in catalysis (Figure 3bcd). The maxZ Page 7 of 19 (page number not for citation purposes) BMC Bioinformatics 2006, 7:484 http://www.biomedcentral.com/1471-2105/7/484 score values varied from 0.68 to 1 in the invariant posi- tions, the maxZ and IC scores were dependent on the side- tions occupied with different amino acids, whereas the chain. Nevertheless, the maxZ score determined all the corresponding MD score values were more stable. This invariant positions as highly conserved (0.68–1), whereas was due to different diagonal values of the scoring matrix. the IC score obtained somewhat lower scorings (0.49– The similar behavior was found in the position 219 of the 0.73). motif IV, where proline was the most frequent residue. The maxZ score determined that position as highly con- β-Lactamases β-Lactamase family of Pfam contains served (0.84), whereas the other scores only considered it sequences from many different groups including D-ala- as moderately conserved (0.38 and 0.52). For the other nyl-D-alanine carboxypeptidase B, aminopeptidase, alka- important side-chains of N1, A2, D215, H217 and R223, line D-peptidase, animal D-Ala-D-Ala carboxypeptidase which have a role in substrate binding, the behavior of the homologues, the class A and C β-lactamases and eukaryo- three scores was mostly very similar (Figure 3ad). The only tic β-lactamase homologs. The family is very diverse out- exception was the position D215, which was considered side the SXXK motif, S being the active side residue. For as moderately conserved by the maxZ and IC scores (0.22 the sequences belonging to the S12 peptidase (D-Ala-D- and 0.44), while the MD score considered it as uncon- Ala carboxypeptidase B) family in the MEROPS database, served. Another difference between the scores was in the the active site motif is I:120SXTK123. It also has another positions 70 and 71 of the motif II, where the IC score motif: II:306YXN308 (Figure 3hi). could not determine these positions as inserts, but obtained considerably high conservation score values. All the scores determined the active site serine residue as highly conserved and provided a very similar conservation Subtilisins Pfam subtilase is a family of serine proteases profile for the first motif (Figure 3h). In the second motif, consisting of S8 and S53 peptidase families of the the maxZ and IC score correctly determined the highly MEROPS database. The S8 peptidases are divided into two conserved position 306 with tyrosine/serine residues and subfamilies: S8A (e.g. subtilisin) and S8B (e.g. kexin). The considered the other residues as moderately conserved, sequences in the S8 family have a catalytic triad Asp/His/ while the MD score, on the contrary, failed to detect the Ser. In the subfamily S8A, the active site residues occur (in highly conserved position 306, where it gave only 0.21 as the Pfam alignments) in the motifs I:28D-T/SG31, a score value for that position (Figure 3i). II:254HGTH257 and III:687GTSMAXP693, and in the sub- family S8B in the motifs I:28D-DG31, II:254HGTR257, These results on the example families suggest that there III:687GTSA/VA/SXP693 (Figure 3efg). are three main differences between the maxZ, MD and IC conservation scores. Firstly, since the maxZ score is All positions of the catalytic triad Asp/His/Ser were con- strongly affected by the diagonal value of the scoring sidered as highly conserved by each of the conservation matrix used, it obtains slightly lower values for the posi- scores. In the first motif, the maxZ and MD scores tions occupied with very frequently occurring amino acids obtained high conservation score values (0.70–0.90) for and slightly higher value for more rarely occurring amino the aspartic acid and glycine residues (Figure 3e). The acids than the other scores. For very frequently occurring middle position had three possible side-chains in the first amino acids, see for example position αB5 of the SH2 motif, and hence, all scores determined that position as domain (Figure 2c) with highly conserved leucine or posi- moderately conserved (0.34–0.42). In the second motif, tions III:G687 and III:S689 of the subtilase family (Figure there were more differences between the conservation 3g), which obtain a much lower maxZ than MD score. In scores: the maxZ score determined all the positions as the opposite case, at positions αB8 – 9 of the SH2 highly conserved (0.71–1), the MD score determined the domain, the high diagonal values of the scoring matrix for first three positions as highly conserved (0.74–1), whereas histidine and tryptophan offer a much more reliable scor- the fourth position obtained much lower score (0.31) ing than the MD score (Figure 2c). Similarly, the result of (Figure 3f). The IC score determined only the first position the maxZ score for the position 11:306 of the β-lactamase as highly conserved (0.84), whereas the other positions family indicates that the position may be functionally obtained a slightly lower (0.49–0.63) conservation score important, whereas the result of the MC score indicates values. Hence, only the maxZ score considered the whole the contrary (Figure 3i). The maxZ score also determines motif as highly conserved. The MD score, on the contrary, different values for invariant positions with different obtained rather low conservation score values for the amino acids, whereas MD score always gives the score of position 257, where subgroups S8A and S8B are con- 1 for invariant positions. Secondly, as the maxZ score is served in different amino acids. The third motif was a entirely determined by the residue obtaining the greatest good example of the behavior of the different scores in the Z-score value, it is not affected by the other residues whose invariant positions (Figure 3g). While the MD score proportions may be very low, but a single conserved resi- obtained the highest score value 1 in all the invariant posi- due can already define a position as conserved (see posi- Page 8 of 19 (page number not for citation purposes) BMC Bioinformatics 2006, 7:484 http://www.biomedcentral.com/1471-2105/7/484 tion βF3 in Figure 2c or position S/Y306 in Figure 3i). AQ and SP scores for the Mafft alignments (L-INS-i strat- Hence, the maxZ score may find important positions of egy) in different reference sets. The Spearman rank corre- the alignment, which were not found by the other scores. lation coefficient between the AQ and SP scores was 0.53 Thirdly, the maxZ and MD scores also consider gaps for the L-INS-i alignments. The range of the correlation resulting in zero or very low scores for the insert positions coefficient in the 7 alignments was from 0.53 to 0.67. Fig- of the alignment. The IC score, on the other hand, fails to ure 4 shows a clear relationship between the quality detect the insert positions. scores. The three of the four outlying alignments on the lower right corner of Figure 4 are from the reference set 40. Taken together, all three scores behaved in a rather similar In these alignments, the SP scores also dramatically dif- manner. The IC score does not take into account gaps, and fered from the Column score values (CS = 0 in these align- thus its use is relevant only when the alignment does not ments). include gaps. The maxZ and MD scores differ in some positions, which generally depend on the similarity Alignment quality assessment matrix or grouping used with the maxZ score. For the Ras- We compared the performance of the 7 alignment pro- like proteins, the maxZ score with groups of amino acids grams using five reference sets of the BAliBASE database. distinguishes slightly better than the other scores both the The first two reference sets of the BAliBASE include equi- moderately conserved positions (Boxes 12–23) from the distant sequences whose identity is less than 20% (ref 11) highly conserved positions (Box 11) and unconserved or between 20 and 40% (ref 12). According to the AQ ones (Box 33). However, the results of Table 1 cannot be score, the results on the reference set 11 indicate that used to evaluate the IC score since entropy was used in the Probcons was the best method aligning on the average classification. For the SH2 domains, on the one hand, the 80% of the conserved residues correctly (Figure 5). The L- maxZ score determines the positions forming the binding INS-i strategy of Mafft and Muscle also performed well pocket for the phosphotyrosine and surrounding mole- obtaining quality scores only 5–7% lower than that of the cules mostly as highly conserved, but on the other hand, Probcons. In the reference set 12, all the tested programs it also correctly determines the more variable loops performed rather well (Figure 5). The Probcons, Muscle, between the α-helices and β-stands as unconserved. The L-INS-i and TCoffee obtained the highest alignment qual- MD and IC scores also perform well, but sometimes the ity score values (94–96%). These methods did not differ MD score fails to detect the important positions, and the from each other, but they differed from all the other meth- IC score is not capable in detecting the loops between the ods (Table 2). The quality score was the worst in the align- conserved structures. ments produced by Dialign, Clustal, or FFT-NS-2 strategy of Mafft showing 41–54% (ref 11) and 12–16% (ref 12) Consensus sequence divergence from the reference alignment. The result of the As a by-product, the maxZ score also produces the consen- SP score was very similar. The only relevant difference was sus sequence for the multiple sequence alignment. the Probcons showing significant difference to the other According to formula (8), the consensus residue at each programs in the both reference sets 11 and 12, even if the alignment position is defined as the residue with the absolute difference between the methods was very low: greatest Z-score value. The legend of Figure 2 shows the the SP score of the Probcons and L-INS-i, for instance, dif- consensus sequence for the part of the SH2 domain. fered from each other only 2%. The absolute CS scores were in all programs approximately 20% (ref 11) and Evaluating the AQ score for alignment quality 10% (ref 12) lower than that of the AQ and SP scores. In In this section, we evaluate the output of the alignment the reference set 12, the Probcons differed significantly programs using alignment quality (AQ) based on the from the other methods. In the reference set 11, the Prob- maxZ score and compare it to the sum of pairs (SP) and cons showed significant difference from all the other pro- the column score (CS) quality scores [1]. First, we study grams except the L-INS-i. the relationship between the individual AQ and SP scores. Then we compare the quality scores of 7 alignment meth- The aim of the reference set 20 is to test the ability of pro- ods using BAliBASE database [45]. Since the divergence grams to align the sequence families having disrupted by from the reference values was substantially constant over an "orphan" sequence. The reference set 30 consists of different false discovery rate (FDR) values, the results are subgroups of sequences whose residue identities between presented at FDR = 0.05. the subgroups are less than 25%. According to the AQ and SP measures, the quality of all alignments was very high in Comparison of individual AQ and SP scores the reference sets 20 and 30 (Figure 5). In the reference set We build 7 test alignments for each set of sequences in the 20, the median scores varied from 87% to 96% and from BAliBASE database and compared the results of the AQ 92% to 97% in the AQ and SP scorings, respectively, and SP scores. Figure 4 shows the scatterplot between the whereas the CS score obtained clearly lower scores varying Page 9 of 19 (page number not for citation purposes)
Description: