ebook img

On protein abundance distributions in complex mixtures. PDF

0.35 MB·English
by  KoziolJA
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview On protein abundance distributions in complex mixtures.

Kozioletal.ProteomeScience2013,11:5 http://www.proteomesci.com/content/11/1/5 METHODOLOGY Open Access On protein abundance distributions in complex mixtures JA Koziol1*, NM Griffin2, F Long2, Y Li2, M Latterich2 and JE Schnitzer2 Abstract Mass spectrometry, an analyticaltechnique that measures the mass-to-charge ratio of ionized atomsor molecules, dates back more than 100 years, and has both qualitative and quantitative uses for determining chemical and structural information. Quantitative proteomic mass spectrometry onbiologicalsamplesfocuses onidentifying the proteins presentin thesamples, and establishing the relative abundances ofthose proteins.Such protein inventories create the opportunity to discover novel biomarkers and disease targets. We have previously introduced a normalized, label-freemethod for quantificationof protein abundancesunder a shotgun proteomics platform (Griffin et al.,2010). The introduction of this method for quantifying and comparing protein levels leads naturally to theissue of modeling protein abundances inindividual samples. We here report that protein abundance levels from two recent proteomics experiments conducted by theauthors can be adequately representedby Sichel distributions. Mathematically, Sichel distributionsare mixturesof Poissondistributionswith a rather complex mixing distribution, and have been previously and successfully applied to linguistics and speciesabundance data. The Sichel model can provide a direct measure of the heterogeneity of protein abundances, and can reveal protein abundance differencesthatsimpler models fail to show. Introduction genomic sequencing tools [9] and highly accurate gene ex- Large-scale proteome analysis using mass spectrometry pressionanalyticalmethods[10],reliablemethodsofquan- and subcellular fractionation techniques can provide in- tifyingproteinexpressionandmodificationlevelshavebeen ventories of proteins identified in organelles, cells and challenging[11]. tissues (e.g., [1–3]). Such protein inventories create the This difficulty is in part due to the immense chemical opportunity to discover novel biomarkers and disease complexity of proteins, which are made up from over targets (e.g., [4–7]). But a more detailed description of twentyamino acidmonomerswithdistinct chemicalprop- cells, tissues and organisms in health and disease would erties, as contrasted to biopolymers such as RNA that are benefit greatly from quantitative tools that can carefully constituted from four monomers with similar properties. and comprehensively quantify the individual building Currently there are no feasible direct methods to establish blocks, which comprise the living entity. The ability to proteinsequenceslikethatofnucleotidepolymers;theonly quantify properly identified proteins in biological sam- method to directly determine the identity and the quantity ples in a comprehensive fashion engenders an enhanced ofproteinsinamixtureinlargescaleisthemassspectrom- understanding of cellular behavior during development eter,whichcandeterminepeptidesequencesbasedonfrag- or in response to disease, and can lead to novel bio- mentation pattern analysis and expression levels via direct marker andtargetdiscoveries [4,8]. orindirectmeansofanalysis. Muchefforthasgoneintodevelopingmoreaccurateand Quantitativeproteomic mass spectrometry is indispens- costeffectivetechnologiesthatcancapturethedynamicsof able to providing valuable insights into protein content biomolecular diversity in more quantitative ways. While and activity in various cellular states. There are at present significant advances have been made to develop accurate three principal methods of quantifying proteins via mass spectrometry: labeling approaches such as iTRAQ and *Correspondence:[email protected] SILAC, which aim to reduce experimental variance and 1TheScrippsResearchInstitute,10550NTorreyPinesRoad,LaJolla,CA allow relative comparison of peptides between samples 92037,USA [12,13]; absolute quantitative approaches such as MRM Fulllistofauthorinformationisavailableattheendofthearticle ©2013Kozioletal.;licenseeBioMedCentralLtd.ThisisanOpenAccessarticledistributedunderthetermsoftheCreative CommonsAttributionLicense(http://creativecommons.org/licenses/by/2.0),whichpermitsunrestricteduse,distribution,and reproductioninanymedium,providedtheoriginalworkisproperlycited. Kozioletal.ProteomeScience2013,11:5 Page2of9 http://www.proteomesci.com/content/11/1/5 andSISCAPA[7,14],whicharehighlyaccuratebutthusfar acid solution. The peptides extracted from each gel slice at the expense of completeness; and, label free approaches were first pooled into 7 groups then lyophilized. Each thatrelyoncountingspectraorpeptidenumbersasaproxy sample, either plasma membrane (experiment 1) or forexpressionlevel(reviewedin[15]),oronionintensities caveolae (experiment 2) was separated into five differ- [16], or that jointly consider peptide count, spectral count, ent gel lanes, and each lane was subjected to a and fragment-ion intensity [17]. The latter method is complete 2D-LC-MS/MS analyses resulting in five repli- particularlywellsuitedforcomparingclinicalspecimensfor cate MS analyses of each sample. Proteins were inferred biomarker identification where samples are collected over from each replicate [with the implication, that some pro- long time periods and may have to be compared across teinswere not observed ineveryreplicate].By convention, sites[6,18]. we dropped from consideration any proteins detected in We have previously introduced a normalized, label- onerunonly. free method for quantification of protein abundances under a shotgun proteomics platform [17]. The intro- Massspectrometry duction of this method for quantifying and comparing 2D-LC-MS/MS: Lyophilized peptides were resuspended protein expression leads naturally to the issue of mode- with 15 μl of buffer A (0.1% formic acid, 5% Acetonitrile ling protein abundances. In this note, we examine vari- (ACN)), thenloaded onto atwo-dimensional microcapil- ous models for patterns of relative protein abundance lary column (manually packed C reversed phase and 18 from typical 2 dimensional liquid chromatography mass strong cation exchange column). The loaded samples spectrometry(2D-LC-MS/MS)experiments. were directly introduced into the LTQ mass spectrom- Characterization of the joint distribution of all protein eter equipped with ESI nanospray ion source by eluting abundances in a proteome is complicated by the fact that the bound peptides with a 2D-LC-MS/MS scheme con- protein abundances typically differ over several orders of trolled by Agilent 1100 HPLC quaternary pump [3]. magnitude. As might be expected, this joint distribution Briefly,17saltsteps(ammoniumacetate)wereapplied.Each can be rather complex, and we would not expect a salt step was followed by a 5 to 80% ACN gradient contai- Gaussian distribution would adequately characterize it ning 0.1% formic acid to elute the peptides on the C co- 18 [17,19]. Here, we make no Gaussian assumptions about lumn.Theflowratewasmaintainedat200to250nl/min. any abundances. Rather, from a somewhat historical per- Data acquisition for the LTQ was carried out in data- spective,wehavechosendistributionsthathavebeenpro- dependent mode. Full MS scans were recorded on the posed for modeling word counts and species abundances, eluting peptides over the 400–1400 m/z range with one as we are positing an analogous problem to these prece- MS scan followed by three MS/MS scans of the most dents. We formally compare different families of distribu- abundant ions. The temperature of the ion transfer tube tions for protein abundance, with goodness of fit criteria of both mass spectrometers was set at 180°C and the utilizedtodetermineadequacyofthemodelsforsummar- spray voltage was 2.0 kv. The normalized collision en- izing the underlying data. Our fitting criteria allow us to ergy was set at 35%. A dynamic exclusion was applied determine which models best capture the underlying data for Repeat Count of 2, a Repeat Duration of 0.5 minute, structure, and would be appropriate for characterizing andanExclusionDuration of10min. proteinabundancedistributions. The protein abundance distributions can be utilized to Databasesearchforproteinidentification establish the success rate of the experiments as defined The acquired MS/MS spectra were converted into mass by Eriksson and Fenyo [19], or what we have referred to lists using the Extract_msn program from Xcalibur and as coverage [20]. Our ultimate goal was to identify a dis- searched against a protein database containing rat ™ tribution that would improve the quantitative accuracy sequences using the Sequest program in the Bioworks oflabel-free stochasticmassspectrometry. 3.1 for Linux (Thermo Fisher Scientific, Inc., Waltham, MA, USA). The searches were performed allowing for Methods tryptic peptides only with peptide mass tolerance of 1.5 Samplepreparation Da and a minimum of 21 fragmented ions in one MS/ Luminal vascular endothelial cell plasma membranes and MS scan. Accepted peptide identification was based on a theircaveloaeweredirectlyisolatedfromratlungasprevi- minimum Cn score of 0.1; minimum cross correlation ously described [21,22]. Proteins were pre-fractionated on scoreof1.8(z=1),2.5(z=2),3.5(z=3).Falsepositiveidenti- SDS-PAGE gels prior to 2 dimensional liquid chromatog- fication rate was determined by the ratio of number of raphymassspectrometry(2D-LC-MS/MS).Gellaneswere peptides found only in the reversed database to the total cut into slices, approximately 50 per lane, for in-gel pro- number of peptides found in both forward and reverse teolytic digestions. Digested peptides were extracted from databases. Thefalsepositiveidentificationrateswere≤1%. each gel slice three times with 20% ACN and 10% formic The positive protein identification results were extracted Kozioletal.ProteomeScience2013,11:5 Page3of9 http://www.proteomesci.com/content/11/1/5 fromSequest.outfiles,filteredandgroupedwithDTASelect The Poisson distribution is a standard baseline model softwareusingabovecriteria.Proteinswereidentifiedbased for discretedata, and is often used as a starting point for on2uniquesignificantlyidentifiedpeptides. deriving more realistic models that meet the characteris- tics of an observed set of data. Mathematically, the Pois- Statisticalmethods son is aone-parameter distribution,with the mean equal Weconsiderthefollowingdiscreteprobabilitydistributions: to the variance. If discrete data show overdispersion relative to the Poisson, generalizations might be intro- (1)Thenegativebinomial(NB)distribution,with duced to accommodate this. Greenwood and Yule [23] probabilitymassfunction suggestedamodel inwhichthemeaninthePoissondis- ΓðγþkÞ tribution is itself random, following a gamma distribu- Pnbðk;γ;pÞ¼ k!ΓðγÞ pkð1(cid:2)pÞγ;k tion. This leads to a two-parameter distribution, the negative binomial, for discrete data. In turn, the negative ¼0;1;...;γ >0;0<p<1: binomial has become a standard baseline model for (2)ThediscreteWeibulldistribution,with probability discretedata overdispersed relative tothe Poisson. massfunction Inaseminalarticle,Fisherandcolleagues[24]introduced the notion of mathematically modeling species abundance Pwðk;v;pÞ¼pkv (cid:2)pðkþ1Þv;k ¼0;1;...;v>0;0 data. Their motivation was to model butterfly abundance <p<1: data from Malaya [25], and Fisher explored the truncated negative binomial distribution and extensions to this end. (3)TheZipfdistribution,withprobabilitymass With species abundance data, as with our peptide setting, function one must consider the zero-truncated forms of the under- Pzðk;pÞ¼Zekta(cid:2)ðð11þþρÞρÞ;k ¼1;2;...;ρ>0; lsypiencgiedsismtraibyuntiootnsb,etoobascecrovmedmiondaatefinthitee fsaacmtpthlinatgcferartmaien. Thiscanleadtosomeaddedcomplexitiesrelativetomodel where Zeta(.)isRiemann’szetafunction. fitting, as for example, described by Sampford [26] relative to the truncated negative binomial distribution. As with (4)TheZipf-Mandelbrotdistribution,withprobability GreenwoodandYule,Fisheretal.[24]assumedthatabun- massfunction dances could be modeled by a gamma distribution, which ðkþaÞ(cid:2)ð1þρÞ led to the negative binomial. A special case is Fisher’s log- Pzmðk;ρ;aÞ¼Zetað1þρ;aÞ;k ¼1;2;...;ρ seriesmodel,wheretheshapeparameterofthegammadis- >0;a>0: tribution tends to zero. Engen [27] provides a comprehen- sivereviewofspeciesabundancemodelsinecology. where hereZeta(r,a)denotestheHurwitz zeta Theeponymous Zipf’s law wasintroduced by Zipf [28] function. as a word frequency distribution: if one tabulates from an arbitrary text the number of words arranged in the (5)TheSicheldistribution,withprobabilitymass order of their frequency of usage, the resulting word fre- function quency distribution is generally reverse J-shaped, with a ð1(cid:2)θÞγ=2 ðαθ=2Þk very long upper tail. Zipf’s law is a mathematical power- Psðk;α;θ;γÞ¼Kγ(cid:3)αpffi1ffiffiffi(cid:2)ffiffiffiffiffiθffiffi(cid:4) k! KkþγðαÞ; law representation of this type of distribution. Zipf’s fre- k ¼0;1;...;α>0;0<θ<1;(cid:2)1<γ <1 quency distribution was later generalized by Mandelbrot [29],againinalinguistics context. where Kγ(z)denotesthemodifiedBesselfunctionof The discrete Weibull [30] is another model for skewed, thesecond kindoforderγandargumentz. power-likediscretedata.Theincorporationofanadditional parameter, as with Zipf-Mandelbrot, allows added flexibil- (6)ThePoissoninverseGaussian(PIG)distribution. ity,toaccommodatesituationinwhichthepower-lawrela- Thisisa specialcaseofthe Sicheldistribution, tionshiptendstodecayinthetail.Thisiscloselyrelatedto obtainedbysettingγ=−1/2intheprobability mass the stretched exponential distribution [31]. Newman [32] functionPs.[Numerical evaluation ofKγ(z) is and Clauset et al. [33] give particularly lucid accounts of enormouslysimplifiedifγ=−1/2ordiffersfrom power-lawdistributions. −1/2byaninteger,advantageous inanearliereraof TheSicheldistributionwasintroducedbyHolla[34],and lesspowerfulcomputationalcapabilities]. popularized in a series of papers by Sichel (e.g., [35–38]). Sichel and others have applied it both to linguistics and to Our choice of these distributions is based partly on species abundance data (e.g., [39]). The special case of an historicalconsiderations,aswenow describe. inverseGaussianmixingdistribution,leadingtothePoisson Kozioletal.ProteomeScience2013,11:5 Page4of9 http://www.proteomesci.com/content/11/1/5 inverse Gaussian distribution, enjoys some computational protein abundances, and the P(i) are the probabilities advantages(e.g.,[40]).The Sichel distribution is a mixed determinedfromthemodelsgivenabove.Note,however, Poisson distribution, and can be generalized by using that the minimal observed protein abundance is 1, mixing distributions other than the inverse Gaussian whereas the supports of the negative binomial, discrete (e.g., [41–44]). Weibull, and Sichel distributions begin at 0. Hence for From a theoretical perspective, the negative binomial these distributions, we fit zero-truncated forms of the and Sichel distributions are attractive models for protein distributions: when maximizing the log likelihood for abundance data. The frequencies of the different pro- these distributions, the P(i) are replaced by P(i)/(1-P(0)) teins in the sample can be taken as independent Poisson in the above formula for LL. The supports for the Zipf variables, where the Poisson parameters are heteroge- and Zipf-Mandelbrot distributions begin at 1, obviating neous; a mixing distribution should then be chosen to the need to deal with truncated forms of these accommodate the overdispersion. In this regard, the distributions. Poisson inverse Gaussian distribution seems preferable Becausethemodelsarenotalwaysnested,weadoptthe to the negative binomial, but the Sichel distribution, Akaike information criterion (AIC; [46]) as our with one additional free parameter relative to the Pois- general criterion for comparing models. [In the case of son inverse Gaussian distribution, is correspondingly nested models, as with the Zipf nested within the Zipf- even more flexible. Mandelbrot,onemightusealikelihoodratiotest,toassess We used maximum likelihood techniques for fitting the relative improvement in fit with the more complex observed protein abundance data to all models: this typ- model relative to the simpler one.] The AIC value is ically provides more efficient and robust estimates than definedas−2[loglikelihood-#fittedparameters].Givena other methods, developed prior to the advent of inex- set of potential models for the data, the minimum AIC pensive computing resources. Goldstein et al. [45] have value would be indicative of the preferred model. We cautioned against informal methods of parameter esti- remarkthat,thereisonefittedparameterfortheZipfdis- mation with power-law based discrete distributions, and tribution, two fitted parameters for the negative binomial, Clauset et al. [33] provide theoretical justification for discrete Weibull, Zipf-Mandelbrot, and Poisson inverse maximum likelihood. We utilized Mathematica 8.0 Gaussiandistributions,andthreefittedparametersforthe (Wolfram Research, Inc., 2010) for numerical fitting Sicheldistribution. using its default global optimization algorithm; in We display observed and fitted distributions with addition, the program also provides built-in numerical rank-frequency plots [47]. The rank-frequency plot of a evaluation of the special functions incorporated in the frequency distribution is in log-log coordinates, with x probability mass functions above, which facilitates the denotingthe ranksofthe items inthedistribution, andy optimization. the corresponding relative frequencies. [A Zipf distribu- The method of maximum likelihood in our setting is tion would be a straight line in a rank-frequency plot, straightforward. We describe the method generically, as and the plot can be utilized to estimate the parameter r follows. Let X denote a positive-integer valued random characterizing the Zipf distribution]. Newman [32] variable, with Prob(X=i)=P(i;θ) for some vector of para- describes these plots in greater detail, and astutely notes meters θ. We draw a finite random sample, and observe their equivalence to complementary cumulative distribu- X=i with frequency f for i=1,2,...,m. The method of tion function plots, but with log-log and not linear i, maximum likelihood entails finding the vector θ^ that coordinates. We utilize Newman’s construction in the maximizes thelogofthelikelihood function following. Specifically, we start with a listing of all the proteins, along with their frequency of occurrence Xm (abundance), ranked in order of increasing abundance. LL¼ filog½Pði;θÞ(cid:3): The complementary cumulative distribution P(x) of the i¼1 frequency x is defined as the fraction of proteins with [In practice it is generally more convenient to abundance greater than or equal to x. Our plots depict maximize the log of the likelihood function than the both the observed and the fitted complementary likelihood itself]. With our data, the X are the various cumulativedistributions. i Table1Summarystatisticsforpeptidecounts Min Max Median Mean SD Skewness Kurtosis Var/Mean Expt1 1 525 7 13.13 26.36 10.56 164.7 52.9 Expt2 1 302 6 12.37 20.43 6.06 60.8 33.7 Caption.Experiments1and2refertothemembranereplicatesandcaveolaereplicatesrespectively,asdescribedinthemethods.2075uniqueproteinswere identifiedinexperiment1and1069inexperiment2. Kozioletal.ProteomeScience2013,11:5 Page5of9 http://www.proteomesci.com/content/11/1/5 A. Negative Binomial B. Weibull 1 1 0.1 0.1 y y bilit bilit a 0.01 a 0.01 b b o o Pr Pr 0.001 0.001 0.0001 0.0001 1 10 100 1000 1 10 100 1000 Abundance Abundance C. Zipf D. Zipf-Mandelbrot 1 1 0.1 0.1 y y bilit bilit a 0.01 a 0.01 b b o o Pr Pr 0.001 0.001 0.0001 0.0001 1 10 100 1000 1 10 100 1000 Abundance Abundance E. Poisson inverse Gaussian F. Sichel 1 1 0.1 0.1 y y bilit bilit a 0.01 a 0.01 b b o o Pr Pr 0.001 0.001 0.0001 0.0001 1 10 100 1000 1 10 100 1000 Abundance Abundance Figure1Rank-frequencyplotsofproteinabundancesfromthefirstexperiment,togetherwithfitteddistributions.A.Negativebinomial. B.DiscreteWeibull.C.Zipf.D.Zipf-Mandelbrot.E.PoissoninverseGaussian.F.Sichel.Observeddataaredepictedinblue,andthefitted distributionsaredepictedinred.AsdescribedintheMethods,westartwithalistingofalltheproteins,alongwiththeirfrequencyofoccurrence (abundance).ThecomplementarycumulativedistributionP(x)oftheabundancexisdefinedasthefractionofproteinswithabundancegreater thanorequaltox.Ourplotsdepictboththeobservedandthefittedcomplementarycumulativedistributions(ordinates)vsprotein abundances(abscissas). Kozioletal.ProteomeScience2013,11:5 Page6of9 http://www.proteomesci.com/content/11/1/5 A. Negative Binomial B.Weibull 1 1 0.1 0.1 y y bilit bilit a a b b o o Pr 0.01 Pr 0.01 0.001 0.001 1 10 100 1 10 100 Abundance Abundance C. Zipf D. Zipf-Mandelbrot 1 1 0.1 0.1 y y bilit bilit a a b b o o Pr 0.01 Pr 0.01 0.001 0.001 1 10 100 1 10 100 Abundance Abundance E. Poisson inverse Gaussian F. Sichel 1 1 0.1 0.1 y y bilit bilit a a b b Pro 0.01 Pro 0.01 0.001 0.001 0.1 1 10 100 1000 1 10 100 Abundance Abundance Figure2Rank-frequencyplotsofproteinabundancesfromthesecondexperiment,togetherwithfitteddistributions.A.Negative binomial.B.DiscreteWeibull.C.Zipf.D.Zipf-Mandelbrot.E.PoissoninverseGaussian.F.Sichel.Observeddataaredepictedinblue,andthefitted distributionsaredepictedinred.AsdescribedintheMethods,westartwithalistingofalltheproteins,alongwiththeirfrequencyofoccurrence (abundance).ThecomplementarycumulativedistributionP(x)oftheabundancexisdefinedasthefractionofproteinswithfrequencygreater thanorequaltox.Ourplotsdepictboththeobservedandthefittedcomplementarycumulativedistributions(ordinates)vsprotein abundances(abscissas). Kozioletal.ProteomeScience2013,11:5 Page7of9 http://www.proteomesci.com/content/11/1/5 Table2Comparativestatisticsforsixmodels model more closely resembles the Sichel model; lack of fit Model AIC,Expt1 AIC,Expt2 in the right tail is again noticeable for the other models. The Zipf distribution is particularly noteworthy for its lack negativebinomial 14533.9 7326.9 of fit throughout the range of the empirical distribution of discreteWeibull 14413.6 7280.8 frequencies. Zipf 16146.9 8067.0 Zipf-Mandelbrot 14703.5 7482.7 Discussion PoissoninverseGaussian 14238.4 7203.0 Ithasbecomeapparentthatpeptideandthusproteinabun- Sichel 14167.3 7189.8 dances, as measured by large scale high-throughput shot- Caption.Experiments1and2refertothemembranereplicatesandcaveolae gun proteomics experiments, are not normally distributed replicatesrespectively,asdescribedinthemethods.AICdenotesAkaike’s informationcriterion;smallervaluesconnotebettermodelfits. [17,19].Thismaybereflectiveofthecomplexnatureofthe proteome, especially when post-translational modifications Results aretakenintoaccount,ortheinherentsamplinglimitations We are interested in quantitatively mapping the proteins of the currently available MS technology as mentioned in expressed on the surface of vascular endothelial cells as the introduction. Nonetheless, we sought to characterize they exist natively in tissue, and have developed subcel- the protein abundance distributions in terms of their con- lular tissue fractionation techniques to isolate luminal tributing peptides from two separate large-scale 2D-LC- endothelial cell surface membranes directly from lung MS/MSproteinidentificationexperiments.Ourgoalwasto and other tissues. These endothelial plasma membranes identify a distribution model that best fits or describes the (experiment1)andtheircaveolae(experiment2)wereiso- protein abundance data, which can take into account the lated from rat organs, and were subsequently analyzed by realworldvariationinproteinabundances. SDS-PAGE and mass spectrometry (see Methods). With From the earliest reports of 2D-LC-MS/MS data the first experiment, a total of 27252 peptides were [14,48,49], it has become clear that protein abundance detected in 5 2D-LC-MS/MS cycles; these identified 2075 differs over several orders of magnitude, with many pro- unique proteins, based on our model selection criteria teins having a relatively small abundance, a few with outlined in the methods section. In the second experi- relatively large abundances. This reflects the inherent ment,atotalof13226peptidesweredetectedin52D-LC- dynamic range of any proteome, prior to identification MS/MS cycles; these identified 1069 unique proteins. by mass spectrometry. One must not forget that protein Summary statistics for the relative peptide counts are detection by traditional mass spectrometry methods is given in Table 1. If abundances were Poisson distributed, dependent on the inherent physical properties of the then the ratio of variance to mean would be about 1; the proteins and their resulting peptides. Peptide detection large variance/mean ratios are indicative of extra-Poisson is highly dependent on the ease with which the peptide variability.Withineachexperiment,thedataarequitedis- can beionized. Ionizationefficiencycan bethoughtofas persed, and extremely right-skewed; heavy tails exist be- the tendency of the peptide to ionize and contribute to a causeofseveralextremevaluesofabundancecounts. mass spectrum thus facilitating the identification of the In Figures 1 and 2 we display rank-frequency plots of peptide and thus the protein. This is influenced mainly the observed protein abundance distributions from the by the inherent structural properties of the peptide, such two experiments, along with the individual models fitted as length, mass, amino acid composition, and various by maximum likelihood. The AIC values corresponding biophysical properties, such as hydrophobicity, number to the fits are given in Table 2. With both experiments, of charges and potential modifications. Thus, one must theorderingofthemodels wouldbe be acutely aware that not every peptide in a given complex sample can and will be identified even though Sichel<PIG<discreteWeibull<NB multiple methods have been developed in recent years <Zipf (cid:2)Mandelbrot<Zipf; to enhance peptide and protein coverage of a complex proteinsample[3,50]. with the left to right ordering indicative of best to worst Let us next consider the issue of the external validity fitting. The added flexibility of the general Sichel distribu- (generalizability) of our findings. To address this, we tion with arbitrary parameter γ provides an improvement analyzed a smaller dataset reported by Ishihama et al. in fit over the Poisson-inverse Gaussian distribution with [51],Table 1. The relevant data consist of concentrations γ=−1/2;inturn,bothofthesemodelsarenoticeablybetter of 46 proteins that the authors had identified and than the other models, relative to AIC values. The rank- quantifiedinmouseneuro2acells[withadifferentquan- frequency plots for Set 1 show that only the Sichel model titationmethodthanthatofGriffinetal.].Weproceeded adequately represents the empirical frequency distribution to fit the 6 distributions described previously, and in the right tail. With Set 2, the Poisson-inverse Gaussian obtainedthefollowingorderingofthemodels: Kozioletal.ProteomeScience2013,11:5 Page8of9 http://www.proteomesci.com/content/11/1/5 Sichel < PIG < Zipf-Mandelbrot < discrete Weibull Competinginterests < NB < Zipf. Theauthorsdeclarenocompetingfinancialinterests. Authors’contributions The respective AIC values were: 586.97, 592.45, ThestudywasconceivedbyJAK,NMG,ML,andJES.NMG,FL,YLperformed 599.53, 603.54, 604.59, and 705.35. The pre-eminence thesamplepreparationandmassspectrometryexperiments.JAKandNMG of the Sichel distribution remains, as does the poor undertooksubsequentanalyses.JAK,NMG,ML,andJESpreparedthe manuscript.Allauthorscontributedtomanuscriptediting.Allauthorsread performance of the Zipf distribution. With this smal- andapprovedthefinalmanuscript. ler dataset, Zipf-Mandelbrot outperforms the discrete Weibull and the negative binomial, although differences Authordetails 1TheScrippsResearchInstitute,10550NTorreyPinesRoad,LaJolla,CA are at best modest. Nevertheless, we have insufficient evi- 92037,USA.2ProteogenomicsResearchInstituteforSystemsMedicine,11107 dence that a Sichel distribution would obtain with other RoselleStreet,SanDiego,CA92121,USA. quantification methods (e.g., spectral counting methods Received:4December2011Accepted:15May2012 emPAI or RIBAR / xRIBAR); a cautious interpretation is, Published:29January2013 that we observed a Sichel distribution with the quantifica- tionmethodofGriffinetal.[17],butthattheobserveddis- References tribution may also depend on the mass spectrometer 1. DurrE,YuJ,KrasinskaKM,etal:Directproteomicmappingofthelung microvascularendothelialcellsurfaceinvivoandincellculture. technologyused. NatBiotechnol2004,22:985–992. From the analyses described in this study, one might 2. KislingerT,GramoliniAO,MacLennanDH,EmiliA:Multidimensional infer that simple models of protein distribution do not proteinidentificationtechnology(MudPIT):technicaloverviewofa profilingmethodoptimizedforthecomprehensiveproteomic adequately fit the experimental data, with empirical evi- investigationofnormalanddiseasedhearttissue.JAmSocMass dence pointing toward a more complicated mixing Spectrom2005,16:1207–20. distribution. Indeed, the more complex Poisson inverse 3. LiY,YuJ,WangY,etal:Enhancingidentificationsoflipid-embedded proteinsinmassspectrometryforimprovedmappingofendothelial Gaussian or Sichel distributions work well to accommo- plasmamembranesinvivo.MolCellProteomics2009,8:1219–1235. date the heavy tail that is typically observed in proteo- 4. AddonaTA,ShiX,KeshishianH,ManiDR,BurgessM,GilletteMA,Clauser mics experiments. These models accommodate the fact KR,ShenD,LewisGD,FarrellLA,FiferMA,SabatineMS,GersztenRE,Carr SA:Apipelinethatintegratesthediscoveryandverificationofplasma that protein abundances as reflected in the number of proteinbiomarkersrevealscandidatemarkersforcardiovasculardisease. peptides detected per protein within a given sample and NatBiotechnol2011,29:635–643. between identical samples can be different. This is not 5. AndersenJN,SathyanarayananS,DiBaccoA,ChiA,ZhangT,ChenAH, DolinskiB,KrausM,RobertsB,ArthurW,KlinghofferRA,GarganoD,LiL, surprising giving the complex nature of the sample and FeldmanI,LynchB,RushJ,HendricksonRC,Blume-JensenP,PaweletzCP: the contribution of ion suppression effects which can Pathway-basedidentificationofbiomarkersfortargetedtherapeutics: mean that a peptide detected in one sample may not be personalizedoncologywithPI3Kpathwayinhibitors.SciTranslMed2010, 2:45ra55. detected in a subsequent MS analysis of the same 6. PaweletzCP,WienerMC,BondarenkoAY,YatesNA,SongQ,LiawA,LeeAY, sample. In fact, we previously found that each MS HuntBT,HenleES,MengF,SlephHF,HolahanM,SankaranarayananS, measurement of a shotgun proteomics analysis identifies SimonAJ,SettlageRE,SachsJR,ShearmanM,SachsAB,CookJJ, HendricksonRC:Applicationofanend-to-endbiomarkerdiscoveryplatform only a subset of proteins and that second and third MS toidentifytargetengagementmarkersincerebrospinalfluidbyhigh measurementsofthesamesamplewouldrevealabout33% resolutiondifferentialmassspectrometry.JProteomeRes2010,9:1392–1401. and 16% respectively of new proteins not detected in the 7. WhiteakerJR,ZhaoL,AndersonL,PaulovicAG:Anautomatedand multiplexedmethodforhighthroughputpeptideimmunoaffinity previous analyses [1,20]. This means that multiple MS enrichmentandmultiplereactionmonitoringmassspectrometry-based measurements should be performed to comprehensively quantificationofproteinbiomarkers.MolCellProteomics2010,9:184–196. define the full proteome to the degree possible with the 8. WhiteakerJR,LinC,KennedyJ,HouL,TruteM,SokalI,YanP,Schoenherr RM,ZhaoL,VoytovichUJ,Kelly-SprattKS,KrasnoselskyA,GafkenPR,Hogan technique used, hence why 5 replicate analysis of each JM,JonesLA,WangP,AmonL,ChodoshLA,NelsonPS,McIntoshMW, samplewereperformedintheproteinidentificationexperi- KempCJ,PaulovichAG:Atargetedproteomics-basedpipelinefor ments analyzed in this paper. Furthermore, due to the in- verificationofbiomarkersinplasma.NatBiotechnol2011,29:625–634. 9. LanderES:Initialimpactofthesequencingofthehumangenome. trinsic properties of some proteins, especially their large Nature2011,470:187–197. hydrophobicitypeptides,orlackofaccessibletrypticcleav- 10. RosenbergS,ElashoffMR,BeinekeP,DanielsSE,WingroveJA,TingleyWG, age sites, some peptides may never be detected by the SagerPT,SehnertAJ,YauM,KrausWE,NewbyLK,SchwartzRS,VorosS,Ellis SG,TahirkheliN,WaksmanR,McPhersonJ,LanskyA,WinnME,SchorkNJ, mass spectrometer. This suggests that, rather than total TopolEJ:Multicentervalidationofthediagnosticaccuracyofa proteomic identification, the goal of these experiments blood-basedgeneexpressiontestforassessingobstructivecoronary should be adequate coverage of the entire proteome [20]. arterydiseaseinnondiabeticpatients.AnnInternMed2010,153:425–434. 11. WangK,LeeI,CarlsonG,HoodL,GalasD:Systemsbiologyandthe Thus,theabilitytomodelproteinabundancedistributions discoveryofdiagnosticbiomarkers.DisMarkers2010,28:199–207. from2D-LC-MS/MSexperimentsorevenfitthedistribu- 12. deGodoyLM,OlsenJV,CoxJ,NielsenML,HubnerNC,FrohlichF,Walther tionstoa specificmodelimpliesthatone couldtheoretic- TC,MannM:Comprehensivemass-spectrometry-basedproteome quantificationofhaploidversusdiploidyeast.Nature2008,455:1251–1254. allyexploitthepropertiesofthemodeltoimproveprotein 13. KuzykMA,OhlundLB,ElliottMH,SmithD,QianH,DelaneyA,HunterCL, coveragethroughoptimizingexperimentaldesign[20]. BorchersCH:AcomparisonofMS/MS-based,stable-isotope-labeled, Kozioletal.ProteomeScience2013,11:5 Page9of9 http://www.proteomesci.com/content/11/1/5 quantitationperformanceonESI-quadrupoleTOFandMALDI-TOF/TOF 41. KarlisD,XekalakiE:MixedPoissondistributions.InternationalStatistical massspectrometers.Proteomics2009,9:3328–3340. Review2005,73:35–58. 14. AndersonL,HunterCL:Quantitativemassspectrometricmultiplereaction 42. PuigP,ValeroJ:Countdatadistributions:somecharacterizationswith monitoringassaysformajorplasmaproteins.MolCellProteomics2006, applications.JAmerStatistAssoc2006,101:332–340. 5:573–588. 43. ZhuR,JoeH:Modellingheavy-tailedcountdatausingageneralized 15. YatesJR3rd,GilchristA,HowellKE,BergeronJJ:Proteomicsoforganelles Poisson-inverseGaussianfamily.StatistProbabLetters2009,79:1695–1703. andlargecellularstructures.NatRevMolCellBiol2005,6:702–714. 44. El-ShaarawiAH,ZhuR,JoeH:Modellingspeciesabundanceusingthe 16. WuZ,FellenbergK,LernerS,KusterB:Comparisonoflabel-freeprotein Poisson-Tweediefamily.Environmetrics2011,22:152–164. quantificationapproachesforchemicalproteomics.Utah,USA:58thASMS 45. GoldsteinML,MorrisSA,YenGG:Problemswithfittingtothepower-law ConferenceonMassSpectrometryandAlliedTopics;2010. distribution.EurPhysJB2004,41:255–258. 17. GriffinNM,YuJ,LongF,etal:Label-free,normalizedquantificationof 46. AkaikeH:Anewlookatthestatisticalmodelidentification.IEEETransactions complexmassspectrometrydataforproteomicanalysis.NatBiotechnol2010, onAutomaticControl1974,19:716–723. 28:83–89. 47. ZipfGK:HumanBehaviourandthePrincipleofLeastEffort.Reading,MA: 18. LatterichM,SchnitzerJE:Streamliningbiomarkerdiscovery.NatBiotechnol Addison-Wesley;1949. 2011,29:600–602. 48. WoltersDA,WashburnMP,YatesJRIII:Anautomatedmultidimensional 19. ErikssonJ,FenyoD:Improvingthesuccessrateofproteomeanalysisby proteinidentificationtechnologyforshotgunproteomics.AnalChem modelingprotein-abundancedistributionsandexperimentaldesigns. 2001,73:5683–5690. NatBiotechnol2007,25:651–655. 49. LiuH,SadygovRG,YatesJRIII:Amodelforrandomsamplingand 20. KoziolJA,FengAC,SchnitzerJE:Applicationofcapture-recapturemodels estimationofrelativeproteinabundanceinshotgunproteomics. toestimationofproteincountinMudPITexperiments.AnalChem2006, AnalChem2004,76:4193–4201. 78:3203–3207. 50. FischerF,PoetschA:Proteincleavagestrategiesforanimprovedanalysis 21. SchnitzerJE,McIntoshDP,DvorakAM,etal:Separationofcaveolaefrom ofthemembraneproteome.ProteomeScience2006,4:2. associatedmicrodomainsofGPI-anchoredproteins.Science1995, 51. IshihamaY,OdaY,TabataT,SatoT,NagasuT,RappsilberJ,MannM: 269:1435–1439. Exponentiallymodifiedproteinabundanceindex(emPAI)forestimation 22. OhP,SchnitzerJE:Isolationandsubfractionationofplasmamembranes ofabsoluteproteinamountinproteomicsbythenumberofsequenced topurifycalvaeolaeseparatelyfromglycosylphosphatidylinositol- peptidesperprotein.MolCellProteomics2005,4:1265–72. anchoredproteinmicrodomain.InCellBiology:ALaboratoryHandbook. 2ndedition.EditedbyCelisJ.Orlando:AcademicPress;1998:34–36. doi:10.1186/1477-5956-11-5 23. GreenwoodM,YuleGU:Aninquiryintothenatureoffrequency Citethisarticleas:Kozioletal.:Onproteinabundancedistributionsin distributionsrepresentativeofmultiplehappeningswithparticular complexmixtures.ProteomeScience201311:5. referencetotheoccurrenceofmultipleattacksofdiseaseorofrepeated accidents.JRoyStatistSoc1920,83:255–279. 24. FisherRA,CorbetAS,WilliamsCB:Therelationbetweenthenumberof speciesandthenumberofindividualsinarandomsamplefroman animalpopulation.JAnimalEcology1943,12:42–58. 25. CorbetAS:ThedistributionofbutterfliesintheMalaypeninsula.ProcRoy EntSocLondA1942,16:101–116. 26. SampfordMR:Thetruncatednegativebinomialdistribution. Biometrika1955,42:58–69. 27. EngenS:StochasticAbundanceModels.NewYork:JohnWiley;1978. 28. ZipfGK:SelectedStudiesofthePrincipleofRelativeFrequencyinLanguage. Cambridge,MA:HarvardUniversityPress;1932. 29. MandelbrotB:Informationtheoryandpsycholinguistics.InLanguage. EditedbyOldfieldRC,MarchallJC.London:PenguinBooks;1968. 30. EnglehardtJD,LiR:ThediscreteWeibulldistribution:analternativefor correlatedcountswithconfirmationformicrobialcountsinwater.Risk Anal2011,31:370–381. 31. GuoL,TanE,ChenS,XiaoZ,ZhangX:Thestretchedexponentialdistribution ofinternetmediaaccesspatterns.PODC’08,August18–21.Toronto,Ontario, Canada:;2008. 32. NewmanMEJ:Powerlaws,ParetodistributionsandZipf’slaw. ContempPhys2005,46:323–351. 33. ClausetA,ShaliziCR,NewmanMJA:Power-lawdistributionsinempirical data.SIAMReview2009,51:661–703. 34. HollaM:OnaPoisson-inverseGaussiandistribution.Metrika1966, 11:115–121. 35. SichelHS:Onafamilyofdiscretedistributionsparticularlysuitedto representlong-tailedfrequencydata.InProceedingsoftheThird Submit your next manuscript to BioMed Central SymposiumonMathematicalStatistics.EditedbyLaubscherNF.SouthAfrica: and take full advantage of: CouncilforScientificandIndustrialResearch,Pretoria;1971:51–97. 36. SichelHS:Onadistributionrepresentingsentence-lengthinwritten prose.JRoyStatistSocSerA1974,137:25–34. • Convenient online submission 37. SichelHS:Onadistributionlawforwordfrequencies.JAmerStatistAssoc • Thorough peer review 1975,70:542–547. • No space constraints or color figure charges 38. SichelHS:Modellingspecies-abundancefrequenciesandspecies- individualfunctionswiththegeneralizedinverseGaussian-Poisson • Immediate publication on acceptance distribution.SouthAfricanStatistJ1997,31:13–37. • Inclusion in PubMed, CAS, Scopus and Google Scholar 39. OrdJK,WhitmoreGA:ThePoisson-inverseGaussiandistributionasa • Research which is freely available for redistribution modelforspeciesabundance.CommunStatistTheorMeth1986,15:853–871. 40. AtkinsonAC,YehL:InferenceforSichel’scompoundPoissondistribution. JAmerStatistAssoc1982,77:153–158. Submit your manuscript at www.biomedcentral.com/submit

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.