ebook img

Comment on Yu et al., "High Quality Binary Protein Interaction Map of the Yeast Interactome Network." Science 322, 104 (2008) PDF

0.13 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Comment on Yu et al., "High Quality Binary Protein Interaction Map of the Yeast Interactome Network." Science 322, 104 (2008)

Comment on Yu et al., “High Quality Binary Protein Interaction Map of the Yeast Interactome Network.” Science 322, 104 (2008). A. Clauset1,∗ 1Santa Fe Institute, 1399 Hyde Park Road, Santa Fe NM, 87501, USA WetesttheclaimbyYuetal.—presentedinScience322,104(2008)—thatthedegreedistribution oftheyeast(Saccharomycescerevisiae)protein-interactionnetworkisbestapproximatedbyapower law. Yuet al. consider three versions of this network. In all three cases, however,we find themost likelypower-lawmodelofthedataisdistinctfromandincompatiblewiththeonegivenbyYuetal. Only one network admits good statistical support for any power law, and in that case, the power lawexplainsonlythedistributionoftheupper10%ofnodedegrees. Theseresultsimplythatthere 9 is considerably more structure present in the yeast interactome than suggested by Yu et al., and 0 that these networksshould probably not becalled “scale free.” 0 2 Protein-interaction networks, where nodes are natu- actually capture. n ralproteins and links representnon-trivialbinding affin- The Yu et al. paper attempts to address some of a J ity between two proteins, hold great promise for push- these shortcomings using a new high-throughput Y2H ing forwardour understanding ofcellular processes. The screen. Combining its results with those of previous 5 wideinterestintheseinteractomenetworksstemsmainly studies, they produce a new set of interactions (denoted ] from the observation that while modern genetic meth- Y2H-union), which they suggest covers about 20% of N ods allow us to identify which genes code for proteins, the entire network. In the interest of completeness, M the set of these genes only amounts to a “parts list” of Yu et al. include in their subsequent network analyses a cell. A more complete understanding of cellular pro- two alternative versions of the yeast interactome: one is . o cesses requires knowing something about the functional based on co-complex information drawn from raw high- i roles and dynamic interactions of these parts [2]. Thus, throughputcoaffinitypurificationandmassspectrometry b - by considering the patterns of protein interactions, i.e., data (denoted Combined-AP/MS), and one is based on q their network, we can pose and answer meaningful bio- a smaller set of literature-curated interactions (denoted [ logical questions about complex cellular processes, and LC-multiple) that are assumed to be error free. 1 their evolution. This Comment is not concerned with the quality of v thelaboratoryproceduresortheaccuracyofthe inferred Determiningtheprotein-interactionnetworkforapar- 0 interaction data. Rather, the focus is relatively narrow, ticular species is a highly non-trivial task, and relies 3 concerning only the analysis of these networks’ degree 5 upon sophisticated molecular techniques to both build distributions andthe conclusionsdrawnthereby. In par- 0 the proteins,andto test their pairwiseinteractions. The ticular, Yu et al. claim that the degree distributions of . most direct approach would be to individually test each 1 of the n2 possible interactions for n proteins. But, be- all three networks follow power-law distributions (with 0 parameters given in Table I): 9 cause n is typically on the order of thousands or tens 0 of thousands, and high-throughput methods are not yet Asfoundpreviouslyforothermacromolecular : available for these tests, this approach is not used. In- v networks,theconnectivityor“degree”distri- i stead,researchersusetechniquesthattestmultipleinter- bution of all three data sets is best approxi- X actions at once [3, 4], e.g., the yeast two-hybrid (Y2H) mated by a power-law. [1] r screen [5]. To date, there have been a number of high a profile efforts to construct the interactome for yeast The implication thus being that the yeast proteome can (S. cerevisiae) [6, 7, 8, 9, 10], and, many would argue, be considered “scale free”, with all that goes along with a lot of real progress. that label [13]. However, this claim depends on a sta- However, these methods also have serious limitations, tistical method—specifically, linear regression—that is and have been shown capable of producing high false- known to produce biased and incorrect results in this positive and high false-negative rates [11, 12]. Some of context. By reanalyzing Yu et al.’s data [14] with ap- the techniquesalsoexhibitseverebiases,beingunable to propriatestatisticaltools[15], we showthat (i) the most test for interactions involving entire classes of proteins. likely power-law models of these networks’ degree dis- The Y2H assay, for instance, is not suitable for tran- tribution are distinct from and incompatible with those scriptional activators or membrane-bound proteins. As quoted by Yu et al., (ii) at best, only the 10.3% most a result of these limitations, some scientists have won- connected nodes in the Y2H-union network are plausi- dered privately how much real biology our current maps bly power-lawdistributed, and (iii) there is considerably morestructure presentinallof these networksthansug- gested by Yu et al. As a result, these networks should probably not be considered “scale free.” ∗Electronicaddress: [email protected] We begin by reanalyzing the Y2H-union data, argued 2 100 100 100 Y2H−union data (a) power−law fit (b) (c) 10−1 10−1 10−1 x ) ≥X 10−2 10−2 10−2 Pr( α = 2.4 α = 1.96 α = 2.89 10−3 x = 1 10−3 x = 1 10−3 x = 6 min min min p = 0.00 p = 0.00 p = 0.95 10−140 0 101 102 10−1400 101 102 10−1400 101 102 degree, x degree, x degree, x FIG.1: Threeapproachestofittingapower-lawdistributiontotheY2H-uniondatafromYuetal. [1]. (a)Theparameterization given by Yu et al., derived using standard linear regression on the log-transformed histogram of degree frequencies. (b) A fit derived via maximum likelihood over the entire range of thedata, i.e., xmin=1. (c) A fit derived via maximum likelihood for estimating α, and by selecting the xmin that gives the best power-law fit to the upperrange of the data. In both (a) and (b), the fitted power-law is not statistically significant (p = 0.00±0.01), indicating that large or systematic deviations from the power-law hypothesisexist. In (c), theupper10.3% are plausibly power-law distributed (p=0.95±0.03). by Yu et al. to be the most accurate map among the lawisaterriblemodelofthedata,andthatthedatadevi- three. Figures 1a–c show the data along with three dif- ate in largeor systematic ways fromthe proposedpower ferentpower-lawmodels. Thefirstpanelshowsthemodel law. suggested by Yu et al. (with scaling exponent α = 2.4) Thepoorfithereispartlyduetothewaythepowerlaw whichwasderivedusingalinearregressionapproach;this wasestimatedfromthe data: linearregressionfor fitting model yields a poor fit to the lower-to-middle range of thepowerlawisknowntoproducebiasedresults[15,19] the data. Thesecondpanelshowsthemostlikelypower- mainly by dramatically overweighting the large but rare law model over the entire range of data, which fits the events. Further,thecanonicalr2 value,usedtojudgethe lower range and thus the majority of the data relatively quality of the regression, can easily be high even when well, but yields a poor fit to the upper range. The third the data arenot power-lawdistributed[15]. Inshort,re- panel shows the most likely power-lawmodel for the up- gressionappliedinthiswaymakesassumptionsaboutthe per range alone. model that are incompatible with the hypothesis being Notably, Figures 1a–c plot the data as a complemen- tested, and thus fails in uncontrolled ways. tary cumulativedistribution function (CDF). If the data A more reliable method for fitting a power-law dis- were indeed power-law distributed, this function would tribution to data is the method of maximum likeli- bestraightonthelog-logaxes,butthereverseisnottrue. hood [16, 20]. Figure 1b shows a power-law model fit- Beingstraightonlog-logaxesisnotasufficientcondition ted to the entire range of data, while Figure 1c shows for some data to be power-law distributed; many kinds a power law fitted only to the upper range. The sec- of non-power-law distributed data can look straight on ond of these requires choosing a point x where the min log-log axes. To decide whether some data do or do not power-law behavior starts, but this can be done using follow a power-law distribution, we must use statistical appropriate tools [15]. Applying the significance test to tools that can tell the difference [15]. Linear regression both models, we find that even the most likely power is not one of these. law is still a terrible explanation of the entire range of A straightforwardtest for whether some data can rea- data (p = 0.00± 0.01), but that it’s an entirely plau- sonably be claimed to follow a power-law distribution is sible explanation (p = 0.95±0.03) of the 209 (10.3%) a significance test [16, 17]. When we fit the power-law most connected nodes, i.e., those with degree x ≥ 6. A model to the same data that we use to score the power corollary,however,is that the distributionofthe degrees law’s plausibility, we induce a correlation between the 1≤x<6isnotapowerlaw. (Onepossibilityisthatthe dataandthemodel;however,thiscorrelationcanbecon- low-degree data is power-law distributed in a way differ- trolledusingaMonteCarloprocedure[15]. Theresultof ent from that of the high-degree nodes. Fortunately, the thetestisasinglevaluepthatrepresentstheplausibility same tools we have already used can be readily adapted of the fitted model as an explanation of the data: small to test this hypothesis.) p-values, conventionally p < 0.1, indicate that the data Figure2repeatstheanalysisofFigure1cwiththedata cannot be considered to follow the fitted model, while for the alternative interactome maps. For these, we find larger values indicate only that the fitted model is plau- thattheCombined-AP/MSnetwork’sdegreedistribution sible (not that it is correct) [18]. Conducting such a test cannot be considered to follow a power law (p = 0.01± with the Y2H-union data and the power-law claimed by 0.03), while the support for the LC-multiple network is Yuetal.yieldsp=0.00±0.01,indicatingthatthispower marginally significant (p = 0.15±0.03). In both cases, 3 0 0 10 10 LC−multiple (a) (b) best power law −1 −1 10 10 x ) ≥X 10−2 10−2 Pr( α = 2.91 α = 3.30 −3 −3 10 x = 20 10 x = 8 min min p = 0.01 Combined−AP/MS p = 0.15 best power law −4 −4 10 10 0 1 2 0 1 2 10 10 10 10 10 10 degree, x degree, x FIG. 2: The Combined-AP/MS and LC-multiple data sets, shown as complementary cumulative distribution functions Pr(X ≥x) on log-log axes, along with fitsof thepower-law hypothesisusing the same methods as in Fig. 1c. data set regression maximum likelihood support for α xmin ntail r2 α xmin ntail p (±0.03) power law Y2H-union 2.4 1 2032 0.96 2.9±0.2 6±1 209 0.95 okay Combined-AP/MS 1.4 1 1621 0.96 2.9±0.1 20±1 309 0.01 none LC-multiple 2.1 1 1552 0.92 3.3±0.2 8±2 186 0.15 marginal TABLE I: Power-law models for the three versions of the yeast interactome. In the first few columns, we quote the models given by Yu et al., which were derived using regression methods. In thesecond set, we give thebest fits derived by maximum likelihood when the range of fit is allowed to vary (as in [15]), along with the corresponding p-value. In the final column, we statethesupportfortheconjecturethatthecorrespondingdatafollows apower-lawdistribution. Ineverycase,themaximum likelihood powerlaw isdistinctand incompatiblewith thatgiven byYuet al.,and onlyfortheupper10.3% oftheY2H-union data does thepower-law hypothesis havestrong statistical support. the bestpower-lawmodels coveronly the upper rangeof data set and here there is only marginal statistical sup- data, and power laws that cover the entire range have port for a power-law degree distribution in the upper zero support (p = 0.00±0.03). Table I summarizes the range. Notably, the two power laws for these networks results of our reanalysis of the three networks. are largely incompatible, with α = 2.9±0.2 versus Y2H As a brief aside, Yu et al. also conducted a model- αLC =3.3±0.2. If the Y2H-union network were merely a more complete version of the LC-multiple network, a comparison exercise to test whether a power-law distri- greaterdegree ofoverlapinthese estimates wouldbe ex- bution with or without an exponential cutoff was a bet- pected. The implication is that there are significant dif- ter explanation of the data. For the same reasons given above,the regressionsand r2 values Yu et al. use cannot ferencesbetweenthesenetworksthatcannotbeexplained away by simple sampling arguments. reliably determine which of these models is better. For- tunately, reliable tools for answering such a question do Further,the factthat the best power-lawmodel ofthe exist,e.g.,alikelihoodratiotest[15,21],althoughwedo Y2H-uniondataonlyexplainsthedistributionoftheup- not apply them here. per 10.3% of the node degrees implies that there is con- siderable structure in this network that remains to be With these results in hand, we can now make several explained. Thisstructuremayhaveevolutionaryorfunc- novel conclusions about the structure of protein interac- tionsinyeast,andgenerallyclarifytheresultsofYuetal. tional significance, especially considering that its behav- ioris qualitativelydifferentfromthatofthe large-degree First,thequestionofwhethertheyeastinteractome’sde- nodes and that it accounts for almost 90% of the net- greedistributioniswell-characterizedbyapower-lawdis- tributionisnotyetsettled. Yuetal.arguethattheY2H- work. Additionally, the existence of the cross-over point at x = 6 from non-power-law to power-law behavior union network is the most accurate map to date (more min deservesascientificexplanation. Inthe additionalstruc- accuratethantheCombined-AP/MSorLC-multiplever- tural analyses performed by Yu et al., the x value sions), but here the statistics only support the notion min could serve as a principled threshold by which to quan- that a power law is a plausible model of the degrees of titatively define “high degree” (see for instance [22]). a small fraction of the entire network (10.3%, or the 209 most connected nodes). The LC-multiple network was In closing, we note that there are many aspects of the presented as being a smaller, but generally high-quality, Yu et al. study that seem entirely reasonable, and much 4 ofthepaperconcernstheexperimentalworkdonetocon- must also get the statistics right. Otherwise, we cannot struct the Y2H-union version of the yeast interactome. knowforsurewhatthe datadoanddonotsay. Fortest- Fromthisperspective,thepaperpushesthefieldforward ing whether some data do or do not follow a power-law inameaningfulway,andtheproblemsdiscussedhereare distribution, reliably accurate tools now exist, and their a small part of a very large project. application can shed considerable light on the relevant Ontheotherhand,thegoalofconstructingandanalyz- scientific questions. ingtheyeastinteractomeistoultimatelyunderstandthe mechanismsthatcreatetheobservedpatternsofinterac- Acknowledgments tions,andtheirimplicationsforhighercellularfunctions. Scientific progress on these questions certainly depends onhighqualityexperimentalwork,butitalsodependson Thanks go to C. R. Shalizi for encouraging me to post highqualitystatisticalwork: togetthetheoriesright,we these thoughts on the arXiv. [1] H. Yu et al., “High Quality Binary Protein Interaction “Gaining confidence in high-throughput protein interac- Map of the Yeast Interactome Network.” Science 322, tion networks.” Nature Biotechnology 22, 78–85 (2004) 104 (2008). [13] L. Li, D. Alderson, J. C. Doyle and W. Willinger, “To- [2] Arguably, this level of understanding is still interme- wardsaTheory ofScale-FreeGraphs: Definitions,Prop- diate. A more complete understanding would incorpo- erties, and Implications.” Internet Mathematics 2, 431– rate knowledge about how genes interact and regulate 523 (2006). eachother,potentiallythroughintermediatecellularpro- [14] The data we analyze was taken directly from the pub- cesses. lished histograms in the main text and the appendices [3] R. Aebersold and M. Mann, “Mass spectrometry-based of [1], using DataThief (http://www.datathief.org/). proteomics.” Nature 422, 198–207 (2003). [15] A.Clauset,C.R.ShaliziandM.E.J.Newman,“Power- [4] E.Phizicky,P.I.H.Bastiaens,H.Zhu,M.SnyderandS. law distributions in empirical data.” arXiv:0706.1062 Fields, “Protein analysis on a proteomic scale.” Nature (2007). 422, 208–215 (2003). [16] L. Wasserman, All of Statistics, Springer(2004). [5] L. McAlister-Henn, N. Gibson and E. Panisko, “Appli- [17] W.H.Press,S.A.Teukolsky,W.T.VetterlingandB.P. cations of the Yeast Two-Hybrid System.” Methods 19, Flannery, Numerical Recipes in C: The Art of Scientific 330–337 (1999). Computing. Cambridge UniversityPress (1992). [6] M. Fromont-Racine, J. C. Rain, P. Legrain, “Toward a [18] Tests will differ depending on exactly how large a de- functional analysis of the yeast genome through exhaus- viation from the power-law model we allow our data to tive two-hybrid screens.” Nature Genetics 16, 277–282 take before we conclude that the data are not power- (1997). law distributed. The test used here assumes all data are [7] P. Uetz et al., “A comprehensive analysis of protein- iid;thus,deviationsareduetostatisticalfluctuations.In protein interactions in Saccharomycs cerevisiae.” Nature principle, this assumption could be relaxed to allow for 403, 623–627 (2000). correlations in thedata. [8] T. Ito et al., “A comprehensive two-hybrid analysis to [19] M. L. Goldstein, S. A. Morris and G. G. Yen, “Prob- exploretheyeastproteininteractome.” PNAS98,4569– lems with fitting to the power-law distribution.” Euro- 4574 (2001). pean Physical Journal B 41, 255–258 (2004). [9] A.-C.Gavin et al.,“Functionalorganization oftheyeast [20] J. Aldrich, “R. A. Fisher and the Making of Maximum proteome by systematic analysis of protein complexes.” Likelihood 1912–1922.” Statistical Science 12, 162–176 Nature 415, 141–147 (2002). (1997). [10] Y. Ho et al., “Systematic identification of protein com- [21] Q.H.Vuong,“LikelihoodRatioTestsforModelSelection plexes in Saccharomycs cerevisiae by mass spectrome- andNon-NestedHypotheses.”Econometrica57,307–333 try.”Nature 415, 180–183 (2002). (1989). [11] G. D. Bader and C. W. V. Hogue, “Analyzing yeast [22] A.Clauset, M. YoungandK.S.Gledtisch, “OntheFre- protein-protein interaction data obtained from different quency of Severe Terrorist Attacks.” Journal of Conflict sources.” Nature Biotechnology 20, 991–997 (2002) Resolution 51, 58–88 (2007). [12] J.S.Bader,A.Chaudhuri,J.M.RothbergandJ.Chant,

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.