ebook img

Predicting PDZ domain mediated protein interactions from structure. PDF

1.5 MB·English
by  HuiShirley
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Predicting PDZ domain mediated protein interactions from structure.

Huietal.BMCBioinformatics2013,14:27 http://www.biomedcentral.com/1471-2105/14/27 RESEARCH ARTICLE Open Access Predicting PDZ domain mediated protein interactions from structure Shirley Hui1,2, Xiang Xing1,2 and Gary D Bader1,2,3* Abstract Background: PDZ domains are structural protein domains that recognize simple linear amino acid motifs,often at protein C-termini, and mediate protein-protein interactions (PPIs) in important biological processes, such as ion channel regulation, cell polarity and neural development. PDZ domain-peptide interaction predictors have been developed based on domain and peptide sequence information. Since domain structure is known to influence binding specificity, wehypothesized that structural informationcould be used to predictnew interactions compared to sequence-basedpredictors. Results: We developed a novel computational predictor of PDZ domain and C-terminal peptide interactions using a support vector machinetrained withPDZ domain structure and peptide sequence information. Performance was estimated using extensive cross validation testing. We used thestructure-based predictor to scan the human proteome for ligandsof 218 PDZ domains and show that the predictions correspond to known PDZ domain- peptide interactions and PPIsin curated databases. The structure-based predictoris complementary to the sequence-basedpredictor, finding unique knownand novel PPIs, and is less dependent ontraining–testing domain sequence similarity. We used a functional enrichment analysis of our hits to create a predictedmap of PDZ domain biology. Thismap highlights PDZ domain involvement indiverse biological processes, some onlyfound by the structure-based predictor. Basedon this analysis, wepredictnovel PDZ domain involvement in xenobiotic metabolism and suggest new interactions for other processes including wound healing and Wnt signalling. Conclusions: We built a structure-based predictor of PDZ domain-peptide interactions, which can be used to scan C-terminal proteomes for PDZ interactions. We also showthatthestructure-based predictor finds many known PDZ mediated PPIsin human that were not found byour previous sequence-based predictor and is less dependent on training–testing domain sequence similarity. Using both predictors, wedefineda functional map of human PDZ domain biology and predictnovel PDZ domain function. Users may access our structure-based and previous sequence-basedpredictorsat http://webservice.baderlab.org/domains/POW. Background human genome) and association with diseases such as PSD95/DlgA/Zo-1 (PDZ) domains are modular peptide cystic fibrosis and schizophrenia, and pathogens, such recognitiondomainsthataregenerallyfoundineukaryotic ashumanpapillomavirus[2-4].PDZdomainsfoldintoa signalling pathways, often in scaffolding proteins that are globular structure consisting of six β strands and two α responsible for regulating protein complex assembly and helices (Figure 1) and often bind their targets through localization to specialized sites in the cell, especially at the recognition of hydrophobic C-termini. Canonical membranes [1]. Their importance in higher organisms is interactions occur between the target peptide side chains highlighted by their increasing abundance from yeast to and a hydrophobic binding pocket formed between do- human (with only 2 in yeast and over 250 encoded in the mainβ2strandandα2helix,thoughotherbindingmodes are known. The binding specificity of PDZ domains has been categorized into two main classes, where class I *Correspondence:[email protected] domains prefer to bind C-terminal motifs X[S/T]XΦ and 1TheDonnellyCentre,UniversityofToronto,Toronto,ON,Canada 2DepartmentofMolecularGenetics,UniversityofToronto,Toronto,ON, class II domains prefer to bind XΦXΦ (where X is any Canada amino acid and Φ is a hydrophobe) [5]. More recent Fulllistofauthorinformationisavailableattheendofthearticle ©2013Huietal.;licenseeBioMedCentralLtd.ThisisanOpenAccessarticledistributedunderthetermsoftheCreative CommonsAttributionLicense(http://creativecommons.org/licenses/by/2.0),whichpermitsunrestricteduse,distribution,and reproductioninanymedium,providedtheoriginalworkisproperlycited. Huietal.BMCBioinformatics2013,14:27 Page2of17 http://www.biomedcentral.com/1471-2105/14/27 developed. Sanchez et al., used an empirical force field tocalculatestructure-basedenergyfunctionsforhuman SH2 domain interactions [17]. Fernandez-Ballester et al., constructed position weight matrices of all possible SH3- ligandcomplexesinyeastusing homologymodelling[18]. Smith et al., used protein backbone sampling to predict binding specificity for 85 human PDZ domains [19]. Kaufmann et al., developed an optimized energy func- tion to predict the binding specificity of PDZ domain- peptideinteractionsfor12PDZdomains[20]. In this paper, we present a structure-based predictor for PDZ domain-peptide interactions that can be used for proteome scanning. Our predictor uses a variety of different structure features that are known to play roles inproteinstructurestabilityandfacilitatingPPIs.Through Figure13DstructureofaboundPDZdomain.ThePDZdomain leave12% of domainout crossvalidation, we showed that foldsintoastructureconsistingofsixβstrandsandtwoαhelices. the structure-based predictor depends less on training– CanonicalinteractionsoccurthroughC-terminaltargetsidechain testing domain sequence similarity compared to our interactionsandthehydrophobicdomainbindingpocketformed betweendomainstrandβ2andhelixα2.Thetencoredomain previous sequence-based predictor. Based on human bindingsitesarehighlightedinblueandtheboundpeptide proteomescanningresults,wealsoshowthatthestructure- (RRETQV)isinorange.PDB:2OQS(NMRfirstmodel)[74]. based predictions correspond to known experimentally determined PDZ domain-peptide interactions and known studies have found that the PDZ domain can be specific PPIs involving PDZ domain containing proteins. A sub- uptosevenresidues[6,7]. stantial number of the structure-based predictions cor- Recent high throughput experiments have resulted in respond to known PPIs not previously predicted by the the availability of large data sets of PDZ domain-peptide sequence-based predictor (48% increase), confirming that interactions [7,8]. As a result, several computational the structure-based predictor finds different interactions methods have been developed to predict PDZ domain- thanthesequence-basedpredictor.Usingpredictionsfrom peptide interactions using sequence-based information both methods, we created a functional map using all only [8-12]. Previously, we developed a sequence-based predicted human PDZ mediated PPIs and identify xeno- predictor to scan proteomes of multiple organisms for biotic metabolism as a novel biological process enriched binders of PDZ domains [10]. Although this predictor inPDZinteractors. is more accurate and precise at proteome scanning Finally, we developed a website called POW! PDZ compared to previous sequence-based predictors, like domain-peptide interaction prediction website (http:// others, it performs better on sequences similar to those webservice.baderlab.org/domains/POW), which enables in the training set. It is known that structure features users to run our sequence-based and structure-based within the domain binding pocket play important roles predictorsonlineinhuman,mouse,flyandworm. in determining binding specificity [13-15]. Since domain structure features capture different information about Methods binding compared to sequence features, we hypothesized Domainbindingsitedefinition thattrainingwithsuchfeatureswouldresultinapredictor A number of positions in the PDZ domain that are in that is complementary to the sequence-based predictor. close contact with the peptide are important for binding In particular, such a predictor would be less dependent [7,8]. Following previous work, we defined the binding on sequence similarity and would predict additional site using ten domain positions (core positions) that are interactions not predicted by the sequence-based pre- in close contact withthe peptide ligand (<4.5 angstroms) dictor. This would expand the coverage of PDZ domain across nine PDZ domain structures. In total, 218 out of C-terminal peptide interactions that can currently be 267 human PDZ domains could be used because they predictedbysequence-basedpredictorsalone. didn’t have gaps in their binding sites based on a PDZ Structure-basedpredictorshavebeendevelopedtomore family multiple sequence alignment (8 structures), and generallypredictprotein-proteininteractions.Forinstance, we could obtain structures and compute features for Hue et al., used a support vector machine (SVM) to pre- them (41 structures). For mouse, fly and worm, respect- dict PPIs using a structure kernel [16]. Methods utilizing ively, 178 of 237, 85 of 117 and 64 of 81 known PDZ structure information to more specifically predict PPIs domainsweresupportedwith11,14and7oftheremaining mediated by peptide recognition domains have also been domains containing gaps. All PDZ domains were defined Huietal.BMCBioinformatics2013,14:27 Page3of17 http://www.biomedcentral.com/1471-2105/14/27 by HMMER 3.0 [21] against UniProt defined PDZ The minimum QMEAN score for our training models is proteins as of Apr 2012. Overall, the structure-based 0.520(average0.836).PleaseseeAdditionalfile2:TableS1 predictor supports the majority of PDZ domains (i.e. fordetailsonalltrainingdomains. 82%, 74%, 73% and 79% of known PDZ domains) for human,mouse,flyandworm,respectively. Domain-peptideinteractiondata Although previous studies used a binding site definition PDZ domain-peptide interactions were collected from of16domainpositions(asupersetofthetenweuse),these published high throughput phage display and protein positions were identified from only a single PDZ domain- microarray experiments for human and mouse, respect- peptide complex structure [9,10] and many domains con- ively [7,8]. Since the phage display data consisted of only tain gaps using this larger 16-position binding site defin- positive interactions (of which many could be non- ition (based on a multiple sequence alignment with other genomic, meaning not similar to any genomic peptide), PDZ domains). A comparison of cross validation perform- we used an established protocol to filter the interactions ance (see section on Predictor Performance Evaluation) to enrich for genomic interactions and to generate arti- using ten versus 16 binding site positions showed that the ficial negative interactions [10]. Briefly, this protocol ten positions were adequate for achieving good predictor involves creating a position weight matrix for a given performance(seeAdditionalfile1:TableS1). training domain using its experimentally determined binders (positives) and then using the matrix to scan a Domainstructuredata pool of C-terminal peptides (last 5 positions) for low The initial set of PDZ domain structures consists of one scoring binders (negatives). We adopted a minor modi- NMR and 17 X-ray structures for human collected from fication of this procedure to allow for the inclusion of the Protein Data Bank (PDB) [22] with corresponding additional class II type PDZ domains to increase cover- interaction data from phage display or protein micro- age of the PDZ family – the minimum number of gen- array experiments [7,8]. Five NMR structures were omic peptides required for inclusion was relaxed from collected from the PDB for mouse. For NMR structures, ten to four. Only domains with both positive and nega- only the first model was used. Homology models were tiveinteractiondatawereusedforpredictortraining. used to increase the number of structures available for domain structure feature encoding. In total, 11 human Domainstructurefeatureencoding and 54 mouse PDZ domain models were modelled by Structure features across the entire PDZ domain struc- SWISS-MODEL[23](downloadedFeb-Sep2011)through ture were computed and valuescorresponding to the ten the Protein Model Portal, which is a website providing core binding site positions were extracted from the lar- access to structure models generated by different pro- ger list of features computed for all domain positions. teinstructureresources[24]. Fourtypesofstructurefeatures(detailedbelow)involved The quality of the homology models was estimated by in protein folding and stability were computed to de- computing the number of identical residues between the scribe the PDZ domain structure (Figure 1). Three- targetandtemplatesequence(i.e.templatesequenceiden- dimensional geometric descriptors were investigated but tity). It has been shown that target-template sequence were not included because they resulted in inferior cross identity is positively correlated with model quality. In validation performance (see Additional file 1: Figure S1). particular, state-of-the-art algorithms can always build In total, the PDZ domain structure as defined by the highqualitymodels(RMSD<2Å)ifthetarget-template core positions was represented by a vector of length 240 sequence identity is higher than 35-40%. Furthermore, features. Each value in the feature vector was scaled to there is no significant variation in model quality for lie between zero and one. Details regarding software targets with sequence similarity between 40-70%. If the parameters used to compute the following structure similarity is 35%, there is no correlation [25,26]. All featuresareavailableinAdditionalfile1,sectionA. training models have greater than 50% sequence simi- larity to their template structure (average 90%). At this Solventaccessibility,hydrogenbondingandpositivephi threshold, models are expected to have the correct fold angleproperties withmostinaccuraciesarisingfromstructuralvariationin The first feature type consists of five values describing templates and incorrect reconstruction of loops [25,26]. proteinstructure andwere computed using the JOY web We also computed the QMEAN score which is a scoring server [28]. Solvent accessibility indicates whether the function measuring multiple geometrical aspects of pro- protein surface in the area at the given core residue pos- tein structureincluding torsion angle potential, secondary ition is available to interact with ligands. Therefore, the structure-specific interaction potentials and solvation ex- first value indicates whether a given residue is solvent posure potential [27]. This score ranges from zero to one accessible or inaccessible. Patterns of hydrogen bonding with scores closerto one indicating more reliable models. are important in forming protein secondary and tertiary Huietal.BMCBioinformatics2013,14:27 Page4of17 http://www.biomedcentral.com/1471-2105/14/27 structure and are known to be important for canonical thatseparatespositiveandnegativetrainingexamples.For C-terminal peptide binding to the PDZ domain. The next suchahyperplane: three values indicate if there is a residue side chain hydro- Xm gen bonded to a main chain amide, carbonyl or another w¼ αiyixi side chain. Finally, since positive main chain phi angles i¼1 may restrict what types of residues may be accommodated atagivenposition,thelastvalueindicatesiftheresiduehas wheretheαi’sarepositiverealnumbersthatmaximizethe followingobjectivefunction: apositivephiangle.Thesebinaryfeatures(i.e.absenceis0, presence is 1) were computed for each core residue pos- Xm Xm (cid:2) (cid:3) 1 ition resulting in a binary vector of length 50 (5 features x αi(cid:2)2 αiαjyiyjK xi;xj 10corepositions). i¼1 i;j¼1 subject totheconstraints 0≤αi≤C for all i¼1;...;m; Xm Solventaccessiblearea and αiyi¼0 The second feature type is a single value indicating how i¼1 much surface (i.e. area) for a core residue is available for where K(x,x)can be thoughtofas describing the similar- i j binding to a ligand residue. This feature was computed itybetweentwofeaturevectors,andCisacostparameter using the SURFVsoftware [29] for each residue resulting that penalizes training errors. We used the radial basis in a numeric vector of length 10 (1 feature x 10 core function(RBF)kernel,definedas: positions). (cid:2) (cid:3) K x;x ¼e(cid:2)γkxi(cid:2)xjk2 i j Electrostaticpotentialandhydrophobicity Protein-protein interactions are facilitated by the elec- A grid search was used to find locally optimal values trostatic and hydrophobic complementarity of molecular for γ and C [34]. Instead of explicitly balancing the posi- surfaces. Therefore, the third and fourth feature types tive and negative training examples, weighted costs were describe the electrostatic potential and hydrophobicity used according to C+=(n+/n-) C-, where n+ is the num- along the surface of the domain. At each core residue ber of positive traininginteractions and n- isthe number position, nine values were sampled from the surface of negative training interactions. The LibSVM software resulting in a total of 90 electrostatic and 90 hydropho- library wasusedtobuild theSVM[35]. bicity values (9 features x 10 core positions). These featuresweregenerated bytheVASCo software[30]. Semisupervisednegativetrainingsetexpansion An initial predictor was built using the data for 88 PDZ Peptidesequencefeatureencoding domains described above. A preliminary assessment of the Peptides were encoded using a sparse binary vector en- predictor’sproteomescanningperformancewasperformed coding, as described in previous work [10]. Briefly, each by scanning the human proteome (defined by genome residue in a peptide of length five was represented using assembly Ensembl:37.64) for each domain in the training a binary vector of length 20 with each bit corresponding set. This initial predictor returned a large number of hits to an amino acid type. The vectors were concatenated to (1000ormore)for overhalf of the domainswithanaver- form thefinal feature vectoroflength 100. age number of predictions returned per domain of over 2000 (see Additional file 1: Figure S2, left boxplot). Since Supportvectormachine previous phage display experiments detected fewer than We used the support vector machine (SVM) binary ma- a hundred binders per domain among billions of ran- chine learning method for our predictor [31,32]. Given dom peptides, the majority of these initial predictions interaction training data (x1,y1),...,(xm,ym) where m is are likely false positives. We surmised that the initial the number of samples, x is a feature vector for domain negative training data did not adequately cover the nega- i di and peptide pi and y is a class label such that yi={−1, tive proteomic interaction space. Therefore, we used a +1}[33], the SVM assigns a class label of +1 if a given semi supervised learning approach similar to a method interaction feature vector encodes a positive interaction previouslyusedtoexpandnegativetrainingdatasetswhen or −1 otherwise. The decision function is evaluated to therearenonegativesinitiallyavailable[36].Thispredictor assignthebinarylabel: was used to scan the human proteome for interactors of training domains as we did for the initial predictor. We fðxÞ¼ sgnðw:xþbÞ found that adding negatives reduced the number of hits returnedperdomain.Thefinalpredictorwastrainedusing wheresgn(0)=+1,otherwise−1.Theweightvectorwand a total of 942 positive and 1843 negative interactions in- biastermbdescribeamaximummarginhyperplane(w,b) volving83PDZdomainsand872peptides(Table1).When Huietal.BMCBioinformatics2013,14:27 Page5of17 http://www.biomedcentral.com/1471-2105/14/27 Table1Summaryofthetrainingdata where TP is the number of true positives, FP is the Domain Interactions number of false positives, TN is the number of true negatives, FP is the number of false positives. The over- Organism Source #Pos #Neg #Pos #Neg allperformancewassummarized by computing thearea Mouse Proteinmicroarray 58 53 527 1026 under the receiver operating characteristic (ROC) curves Mouse SVMNegatives - 24 - 210 andPrecision/Recall(PR)curves[37,38]. Human PhageDisplay 25 - 415 - Human PWMNegatives - 25 - 407 Functionalenrichmentanalysis Human SVMNegatives - 20 - 200 A gene function enrichment analysis was performed on Totals 83 - 942 1843 the predicted sequence-based and structure-based gene targets using Gene Ontology (GO) biological process scanning the human proteome again, the final predictor terms [39].The BiNGO (Biological NetworkGene Ontol- predicted1000ormorehitsforonlyfiveoutof83training ogy tool) software library [40] was used to determine domains (approximately 6% of training domains). The the enriched terms. The hypergeometric test was used average number of predictions per domain returned by tocomputeap-valueassessingtheGOtermenrichment the final predictor was approximately 400 (see Additional foragivensetofpredictedgenes.Multipletestingcorrection file1:FigureS2,rightboxplot).PleaseseeAdditionalfile1, was performed using the Benjamini and Hochberg False sectionEformoredetails. Discovery Rate (FDR) correction. GO v1.2 (downloaded Dec 7, 2011) and human GO annotations (downloaded Predictorperformanceevaluation Dec 7, 2011) were used. Only gene-sets with between We carried out multiple cross validation strategies to five and 300 genes were used from the GO ontology provide an estimate of predictor performance. First we (defined by the GMTfile dated Dec 6, 2011 and available performed ten fold cross validation which involves athttp://www.baderlab.org/Data/StructurePDZProteome partitioningthetrainingdataintotenrandomlyselected Scanning). A list of enriched terms (p-value<0.05 and interaction sets, independently holding out each set for FDR<0.1) with more than one gene interactor and testing against a predictor trained using the remainder associated with more than two domains were retained. of the data, and computing average performance across To better interpret the structure-based and sequence- alltenruns.Followingpreviouspredictionmethods and based enrichment results, we created an enrichment to better compare our results with previous work, we map, a network-based visual representation of enriched held out 12% of the domains (to estimate performance terms that groups similar terms and eases identification dependence on specific sets of domains), 8% of the of functional themes. We used the Enrichment Map peptides (to estimate predictor performance depend- Cytoscape plugin software to create the enrichment map ence on specific sets of peptides) and both 12% of the [41,42], using the parameters p-value<0.05, FDR Q domains and 8% of the peptides (to estimate predictor value<0.1and“Jaccard+overlapsimilarity”cutoff=0.517. performance dependence on specific sets of domains andpeptides)andtestedontherest,againrepeatingthis Results ten times [9]. In general, the training domain features Thestructure-basedpredictorachieveshighcross are more similar to each other (average 0.85 using validationresults normalized dot product similarity), compared to the To estimate the generality of the predictor, we ran peptidefeatures(average0.13).Thus,wealsoperformed multiple cross validation tests and plotted the ROC and leave 12% of domains out cross validation with training PR curves to summarize the performance. The predictor set filtering based on domain sequence similarity and achieves high ROC and PR area under the curve (AUC) compared the performance of the structure-based pre- scores compared to random performance AUCs over all dictor to our previously published sequence-based pre- cross validation strategies. In particular the ten fold cross dictor. This involved holding out all data for 12% of validation ROC and PR AUCs were 0.96 and 0.936, re- domains for testing and training with only remaining spectively (random ROC AUC 0.5, PR AUC 0.253). The domains and their interactions that had sequence simi- leave 8% of peptides out cross validation ROC and PR laritylessthanagiventhresholdtoalltestingdomains. AUCs were 0.935 and 0.909 respectively (random ROC We computed the following statistics to measure pre- AUC 0.5, PR AUC 0.358). The leave 12% of domains and dictor performance: 8% of peptides out cross validation out ROC and PR AUCs were 0.927 and 0.886 respectively (random ROC (cid:3) SensitivityorRecall:TP/(TP+FN) AUC 0.5, PR AUC 0.347). Finally, slightly lower AUCs (cid:3) Specificity:TN/(TN+FP) were obtained for the leave 12% of domains out cross (cid:3) Precision: TP/(TP+FP) validations, which achieved 0.872 and 0.785 respectively Huietal.BMCBioinformatics2013,14:27 Page6of17 http://www.biomedcentral.com/1471-2105/14/27 (random ROC AUC 0.5, PR AUC 0.33) (Figure 2). Like threshold for all testing domains. All training sets had our previously published sequence-based predictor, the no more than500interactions.Tenfolds wereexecuted cross validation results were lower for strategies that and repeated ten times for a total of 100 runs. For each involved leaving sets of domains out. A one-tailed t-test run,theROCandPRAUCswerecomputedandplottedas showed that the mean AUC scores were significantly box plots according to the similarity threshold (Figure 3). higher for the structure-based predictor compared to A one-tailed t-test showed that the mean ROC and PR those of the sequence-based predictor (p-value<0.025) AUC scores were significantly higher for the structure- (Table 2). Blind testing results on a small number of based predictor when training–testing domain sequence genomic mouse, worm and fly interactions suggest that similarity is<0.7 (p-value<0.029). These results show the predictor is able to correctly predict interactions in that on average, the structure-based predictor is less different organisms. However since these data sets are dependentontraining–testingdomainsequencesimilarity small, additional data is required to verify this. Please compared to the sequence-based predictor at lower simi- seeAdditionalfile1,sectionHforblindtestingresults. laritythresholds. Thestructure-basedpredictorislessdependenton Structure-basedpredictionsarevalidatedbyknownPDZ training–testingdomainsequencesimilarity domain-peptideinteractions In previous work, we showed that the performance of WeusedthepredictortoscanthehumanC-terminalprote- the sequence-based predictor depends on how similar in ome (defined by genome assembly Ensembl:GRCh37.64) binding site sequence a given testing domain is to its [43]forbindersof45PDZdomainswithknowninteractions nearest training domain. In particular, as the domain in PDZBase that we could obtain structures and compute binding site sequence similarity decreases so does the featuresfor.Foreachdomain,thisinvolvedscanning43827 predictor’s average performance until it is comparable to uniqueC-terminioflengthfive(includingsplice variants). that of a naïve nearest neighbour sequence predictor [10]. StructuresforthesedomainswereobtainedfromthePDB Tomorerigorouslycomparestructure-basedandsequence- orwerehomologymodelledandareatleast35%sequence based predictor performance as training–testing domain similar (average over 80%) to their template structures. sequence similarity varies, we performed a leave 12% of The minimum QMEAN score for these models is 0.36 domains out cross validation with domain sequence (average 0.78). Please see Additional file 2: Table S3 for similarity-based training set filtering for each predictor. moredetails. Foreachfold,12%ofdomainsandtheirinteractionswere The structure-based predictor has a true positive rate held out, and of the remaining domains, only those and (TPR) of 0.36 and precision of 0.0033 and correctly theircorrespondinginteractionswereretainedfortraining predicted interactions for 22 of the 45 domains. For these if the domain sequence similarity was less than a given domains approximately 73% of known PDZ domain- ROC Precision Recall 0 0 1. 1. 8 8 0. 0. N 6 O 6 R 0. SI 0. P CI T 4 E 4 0. R 0. P 2 2 0. 0. 0 0 0. 0. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 FPR RECALL 0.96 10 Fold 0.936 10 Fold 0.872 Domain 0.785 Domain 0.935 Peptide 0.909 Peptide 0.927 Domain+Peptide 0.886 Domain+Peptide Figure2Predictorperformanceestimationusingcrossvalidation.Predictorperformancemeasuredusingtenfold(red),leave12%of domainsout(blue),leave8%ofpeptidesout(green),leave12%ofdomainsand8%ofpeptidesout(black)crossvalidation. Huietal.BMCBioinformatics2013,14:27 Page7of17 http://www.biomedcentral.com/1471-2105/14/27 Table2Structure-basedpredictorachievesbettercrossvalidationresultsthanthesequence-basedpredictor(p-value<0.025) ROC PR Structure Sequence Structure Sequence 10Fold 0.96 0.939 0.936 0.896 (95%CI) (0.957~0.962) (0.936~0.941) (0.932~0.940) (0.890~0.900) Domain 0.872 0.851 0.785 0.764 (95%CI) (0.860~0.882) (0.839~0.862) (0.765~0.805) (0.747~0.779) Peptide 0.935 0.893 0.909 0.838 (95%CI) (0.929~0.941) (0.883~0.902) (0.898~0.918) (0.825~0.850) Domain+Peptide 0.927 0.87 0.886 0.794 (95%CI) (0.919~0.934) (0.862~0.877) (0.875~0.896) (0.783~0.804) peptide interactions in PDZBase, an independent data fewknowninteractionsperdomainthatareavailablefrom sourcenotusedfortraining,werepredicted(seeAdditional PDZBase(average2.2interactionsperdomain). file2:TableS4).Thesequence-basedpredictorhadahigher We also tested the false positive rate (FPR) of the pre- TPR of 0.46 and correctly predicted interactions for 28 dictor using two real negative data sets for human, which outof45domains.Forthesedomains,65%ofknownPDZ wereusedinarecentstudy[44]tobenchmarkasequence- interactions were predicted and the precision was 0.0024. based predictor developed by Chen et al. [9]. The first Although the sequence-based predictor has a higher TPR data set consists of 466 experimentally validated nega- thanthestructure-basedpredictor,itsprecisionandcover- tive interactions involving peptides that contain a PDZ ageofknownPDZdomainsislower.Thisislikelybecause binding motif found from the literature. The second data the sequence-based predictor predicts on average more setconsistsof133negativeliterature-describedinteractions interactions per domains than the structure-based pre- involving peptides with a non-binding PDZ motif caused dictor (average 426.89 and 239.71 per domain respect- by a mutation. The structure-based predictor made ively). The low precision for both predictors is due to the predictions for 410 negative interactions from the first ROC AUC PR AUC 0 0 1. 1. 9 8 0. 0. 8 6 0. 0. 7 4 0. 0. 6 2 0. 0. 5 0 0. 0. <=1.0 <0.9 <0.8 <0.7 <0.6 <0.5 <0.4 <=1.0 <0.9 <0.8 <0.7 <0.6 <0.5 <0.4 Mean ROC AUC Mean PR AUC 9 0. 7 0. 7 0. 0.5 5 3 0. 0. <=1.0 <0.9 <0.8 <0.7 <0.6 <0.5 <0.4 <=1.0 <0.9 <0.8 <0.7 <0.6 <0.5 <0.4 Training-Testing Domain Similarity Training-Testing Domain Similarity Structure Sequence Figure3Predictorperformancedependenceontraining–testingdomainsequencesimilarity.Leave12%ofdomainsoutcrossvalidation wasperformedwithdomainsretainedfortrainingineachfoldiftheirsequencesimilaritytoalltestingdomainswaslessthanagiventhreshold. Thiswasperformedforstructure-based(blue)andsequence-basedpredictors(magenta).ROCandPRAUCscoreswerecomputedforeachrun anddisplayedinboxplotsaccordingtotraining–testingdomainsequencesimilaritythreshold(topleftandright).Basedonsignificancetesting usingaone-tailedt-test,themeanstructure-basedpredictorROCandPRAUCscoresaresignificantlyhigherthanthesequence-basedpredictors scoreswhentraining–testingdomainsequencesimilarityis<0.7(p-value<0.029).ThemeanAUCscoresforstructure-based(blue)andsequence- based(magenta)predictorsareplottedagainstsequencesimilaritythreshold(bottomleftandright). Huietal.BMCBioinformatics2013,14:27 Page8of17 http://www.biomedcentral.com/1471-2105/14/27 data set and 126 negative interactions from the second Thus the sequence and structure-based predictors both data set, which resulted in an FPR of 0.145 and 0.0, predictuniqueknownPPIsandarecomplementary. respectively. The sequence-based predictor had a FPR To better understand how unique predictions are of 0.09 and 0.0, and made predictions for 421 and 128 made, we compared the results in more detail. The negative interactions for the first and second data sets, unique structure based predictions arise for different respectively. Compared to our structure-based and reasons. Some domains (43 domains) are more challen- sequence-based predictors, the Chen sequence-based ging for the sequence-based predictor, which returns a predictor has a much higher FPR of 0.482 and 0.256 for low number of hits per domain (ten or less) with none the first and second data sets, respectively [44] (see corresponding to known PPIs (see Additional file 2: Additionalfile2:TableS5). Table S8) (e.g. APBA1-1, CNKSR2-1, IL16-1, IL16-3). The structurepredictorfaresbetterfornineofthesedomains (ARHGEF11-1, IL16-1, IL16-3, MPDZ-12, MPP6-1, Manystructure-basedpredictionscorrespondtoknown PDZD2-3, PDZD2-5, RAPGEF6-1, SCRIB-3) and is able PDZdomaincontainingprotein-proteininteractions to predict many more hits per domain (on average ap- To determine how many structure-based predicted proximately 510 hits) with on average approximately interactions correspond to known PPIs, we scanned the three known hits per domain. On the other hand, the human proteome to predict interactions for 218 human structure-based predictor has difficulty predicting hits PDZ domains with known PPIs (that we could obtain for 19 domains (e.g. DLG5-3, MPDZ-6, MPDZ-8), of structures and compute structure features for). Known which four are better predicted by the sequence-based PPIs were retrieved from iRefIndex [33], which is a predictor (MLLT4-1, MPDZ-8, MPP3-1, PDZD2-2; aver- database integrating interactions from different databases age 383 hits) with on average one known PPI hit per including BIND [45], BioGRID [46], CORUM [47], DIP domain. In another scenario, two domains may have [48], HPRD [49], IntAct [50] and MINT [51]. In total, 61 identical binding sites at the sequence level (e.g. DLG1-1 XRAY and nine NMR structures (only the first models and DLG2-1), but be different at the structure level. The used) were obtained from the PDB and 148 homology sequence-based predictor cannot distinguish between the modelswerecreated.All modelshadatemplatesequence twodomainsin thiscase, eventhoughthedomainsmay similarityofatleast22%(average72%)andQMEANscore actually bind different proteins. While the structure- of at least 0.36 (average 0.78) Please see Additional file 2: based predictor uses features corresponding to ten core TableS3formoredetails. positions, these features are computed by considering In total, 88 domains had predicted interactions that theentiredomainstructure.Therefore,eveniftwodomains corresponded to known PPIs, with an average of greater have the same binding site residues, the resulting features than 21% of known PPIs being correctly predicted per willbedifferentiftheirwholedomainstructuresarediffer- domain. The number of PPIs successfully predicted per ent. The structure-based predictor’s ability to distinguish domain was significant (p-value<0.05, Fisher’s exact betweendomainswithhighlysimilarbindingsitesequences test) for all but ten domains. A caveat of this result is helps explain why it is able to predict more unique that PDZ domain containing proteins may contain interactions than the sequence-based predictor. Overall, multiple PDZ domains and other domains, so it is not these results demonstrate situations where the structure- possible to uniquely assign a PPI to a PDZ domain. based predictor can be used to make predictions for This could result in erroneous false negative or true domains that otherwise could not be easily predicted by positive statistics for the above tests. However, the the sequence-based predictor and thus shows that both results still serve as an estimate of predictor perform- methodsarecomplementary. ance and show that the predictor is able to predict many known human PPIs. Structure-basedpredictedbindingspecificities recapitulateexperimentalbindingspecificities Thestructure-basedpredictoriscomplementarytothe Sincevalidationdataislimited,wemoregenerallyassessed sequence-basedpredictor the results of proteome scanning by comparing predicted We next compared the structure-based predictor’s prote- binding specificities to those known from phage display. ome scanning predictions to the ones obtained using our Weconstructedpositionweightmatricestosummarizethe previously published sequence-based predictor [10]. In domain’saminoacidbindingpreferenceateachposition total, the results for 221 domains where both predictors in the ligand, using all predicted interacting peptides were able to make predictions were compared. A total of from C-terminal proteome scanning. Sequence logos 172 out of 925 known PPIs were predicted by both werethenusedtovisuallyrepresentthebindingspecificities. methods, 116 were unique to the sequence predictor and In total, 26 domains could be compared (i.e. they had 56wereuniquetothestructure-basedpredictor(Figure4). less than four genomic peptides from phage display Huietal.BMCBioinformatics2013,14:27 Page9of17 http://www.biomedcentral.com/1471-2105/14/27 A Number of Predictions for Domains with Validated Hits APBA3-1 MAST2-1 APBA3-2 MPDZ-1 CASK-1 MPDZ-10 DLG1-1 MPDZ-12 DLG1-2 MPDZ-13 DLG1-3 MPDZ-2 DLG2-1 MPDZ-3 DLG2-2 MPDZ-4 DLG2-3 MPDZ-5 DLG3-1 MPDZ-7 DLG3-2 MPDZ-9 DLG3-3 MPP3-1 DLG4-1 MPP6-1 DLG4-2 PDZD3-1 DLG4-3 PDZD3-3 ERBB2IP-1 PDZK1-1 GIPC1-1 PDZK1-2 GRIP1-5 PDZK1-3 GRIP1-6 PDZK1-4 GRIP2-5 PTPN3-1 IL16-1 RGS12-1 LIN7A-1 SCRIB-1 LIN7B-1 SCRIB-2 LIN7C-1 SCRIB-4 LRRC7-1 SHANK2-1 MAGI1-2 SLC9A3R1-1 MAGI1-3 SLC9A3R1-2 MAGI1-4 SLC9A3R2-2 MAGI1-5 SNTA1-1 MAGI1-6 SNTB1-1 MAGI2-1 SNTB2-1 MAGI2-2 SNTG2-1 MAGI2-4 SYNJ2BP-1 MAGI2-5 TJP1-1 MAGI2-6 TJP1-2 MAGI3-2 TJP1-3 MAGI3-4 TJP2-1 Structure only MAGI3-5 TJP2-3 Sequence only MAGI3-6 TJP3-1 Both 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 Combined number of predictions Combined number of predictions B Number of Known PPIs Predicted ( 172 out of 925 ) Structure ( 56 ) Both ( 48 ) Sequence ( 68 ) Figure4SummaryofpredictionsfordomainswithhitsvalidatedbyknownPPIs.(A)Breakdownofthenumberofproteomescanning predictionsperdomainmadebythestructure-basedpredictoronly(blue),sequence-basedpredictoronly(pink),andbothpredictors(yellow). OnlydomainswithhitsmatchingknownPPIs(physicalandexperimentalinteractions)iniRefIndexareshown.(B)Piechartofthenumberof validatedhitspredictedbythestructure-basedpredictoronly(blue),sequence-basedpredictoronly(pink),bothpredictors(yellow). Huietal.BMCBioinformatics2013,14:27 Page10of17 http://www.biomedcentral.com/1471-2105/14/27 experiments),coveringknownPDZdomainbindingclasses sequence-based binding specificities are very different I and II (see Additional file 1: Figure S3). For 14 domains, (Figure 5). Appleton et al.,showed that this domain has a the structure-based predicted binding specificity is more bi-specific preference for Trp or Tyr at position −1 [13]. similar to the phage display determined binding specificity The Trp preference is accommodated through main than the sequence-based predicted binding specificity, and chain interactions with β2and β3strands, while theTyr better recapitulates the preference of residues at specific preference is accomplished through hydrogen bonding positions. For example, the structure-based method better withAspatpositionβ3-5ofthedomain.Thebi-specific predictsthepreferenceforhydrophobicresidueValatpos- preferenceforaTrpor Tyr at position−1is reflected in ition0forERBB2IP-1,forhydrophilicresiduessuchasGly the structure-based binding specificity, while only a or Thr at position−2forDVL2-1and for polarresiduesat preference for Tyr is indicated in the sequence-based position −4 and a Thr or Ser at position −1 for TIAM2-1 bindingspecificity.Finally,thepredictedbindingspecificities (position numbering counted backwards from the zero C- for domain DVL2-1 are very different (Figure 5). Zhang terminal position) (Figure 5). Three domains, APBA3-1, etal.foundthatthe−2bindingsiteofthedomainactually TJP1-3 and TJP2-3, had both structure-based and accommodates a Gly-Tyr pair [52]. The preference for a sequence-based predicted binding specificity similarities Gly at position −2 is reflected in the predicted structure- muchlowerthantheaverage(lessthan0.5).Thisseemsto based binding specificity whereas there is no obvious be caused by poor representation of these domains in the preferenceinthepredictedsequence-basedbindingspeci- training set (Figure 5). More validation data should be ficity.Sincethebindingspecificitiesfortheseexamplesare used to more reliably compare the binding specificities determined by specific domain structure features, this for these domains in the future. Furthermore, since helpsexplainwhythestructure-basedpredictorcanbetter phage display experiments select optimal binders and predicttheir bindingpreferences thanthesequence-based cellular interactions may not be optimal (e.g. to aid inter- predictor. action regulation), we expect some differences between phage display and proteome scanning-based profiles. In AfunctionalmapofPDZdomainbiologyhighlightsPDZ general,thesimilaritybetweenthestructure-basedpredicted involvementinavarietyofbiologicalprocesses and experimentally determined binding specificities is high To identify gene functions better predicted by sequence (0.636). or structure-based methods, we performed GO-based gene function enrichment analysis on all predicted hits. Predictedbindingspecificitiesaresupportedbyknown The results were visualized using an enrichment map, structuraldeterminantsofPDZdomainbindingspecificity whichgroupsrelatedgenefunctionterms toeaseidentifi- As noted above, there are many cases where the cationoffunctionalthemes(Figure6).Enrichmentresults structure-based predicted binding specificity is closer to from both sequence and structure-based predictions were the experimental binding specificity than the sequence- plottedonthesamemaptoeaseidentificationofoverlap- based predicted binding specificity. For some examples, ping or unique themes, with sequence-based enrichment the structure-based predicted binding specificity better scorescorrespondingtonodecentrecolourandstructure- predicts the experimental binding specificity at certain based scores corresponding to node border colour. For positions(e.g.MLLT4-1,TJP1-1 and DVL2-1). To exam- example, a number of themes are enriched in hits from ine if this is caused by specific structural features used both methods, such as ‘photoreceptor cell maintenance’, by the structure-based predictor, we searched the litera- ‘hippo signalling’ and ‘cell junction assembly’ (i.e. node ture to find known structure determinants influencing centreandborderarered).Otherthemesareonlyenriched these specific amino acid preferences and compared insequence-based(i.e.borderisgrey,nodecentreisred)or them to our results. For MLLT4-1, the structure-based structure-based predictions (i.e. border is red, node centre predictions indicate a preference for a hydrophilic Thr is grey). For example, ‘neuron projection morphogenesis,’ residue at position −2. The preference for a hydrophilic ‘regulation of cytokinesis’, and ‘innate immune response Thr residue at position −2 is explained by the findings signalling’themescontaintermsonlyenrichedinstructure- of Chen et al. [15]. Their work showed that the Thr based predictions, while ‘actin movement’, ‘membrane preferenceatposition−2isduetoitsinteractionwithGln fusion’ and ‘nuclear transport’ are enriched only in atpositionα2-1ofthedomain,whichformsahydrophilic sequence-basedpredictions. binding site pocket at position −2. This preference is We also compared the themes from our predictions to reflected in the structure-based predicted binding specifi- those from 1249 known PDZ mediated PPIs in the city, whereas a completely different preference for a iRefIndexdatabase[53].Somethemeswereenrichedonly hydrophobicIleresidueatthispositionispredictedbythe in known interactions (e.g. ‘DNA damage checkpoint’, sequence-based predictor (Figure 5). The domain TJP1-1 ‘negative regulation of angiogenesis’), however many is another example where the predicted structure and known themes were covered by our predictors (e.g. ‘cell

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.