ebook img

An Efficient Algorithm for Clustering of Large-Scale Mass Spectrometry Data PDF

0.15 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview An Efficient Algorithm for Clustering of Large-Scale Mass Spectrometry Data

An Efficient Algorithm for Clustering of Large-Scale Mass Spectrometry Data Fahad Saeed1,∗, Trairak Pisitkun1, Mark A. Knepper1 and Jason D. Hoffert1 1Epithelial Systems Biology Laboratory National Heart Lung and Blood Institute (NHLBI) National Institutes of Health (NIH) Bethesda, Maryland USA 3 Abstract—High-throughput spectrometers are capable of pro- be used for clustering, and a graph theoretic framework that 1 ducing data sets containing thousands of spectra for a single allows us to use this metric for efficient cluster extraction. 0 biological sample. These data sets contain a substantial amount The novel algorithm introduced using the graph-theoretic 2 ofredundancyfrompeptidesthatmaygetselectedmultipletimes framework has low computational complexity, thus allowing inaLC-MS/MSexperiment.Inthispaper,wepresentanefficient n algorithm, CAMS (Clustering Algorithm for Mass Spectra) for analysis of large datasets. a clustering mass spectrometry data which increases both the Therestofthepaperisorganizedasfollows.Westartwitha J sensitivity and confidenceof spectral assignment. CAMS utilizes brief problem statement and background information relevant 4 a novel metric, called F-set, that allows accurate identification to our discussions in section 2. In section 3, we introduce of the spectra that are similar. A graph theoretic framework is ] definedthatallowstheuseofF-setmetricefficientlyforaccurate the graph theoretic framework and the algorithm for efficient S clusteridentifications.Theaccuracyofthealgorithmistestedon extraction of clusters. Section 4 presents the experimental D real HCD and CID data sets with varying amounts of peptides. resultsandtheperformanceofthealgorithmintermsofcluster Our experiments show that the proposed algorithm is able to accuracy, cluster size. Section 5 concludes the paper with s. cluster spectra with very high accuracy in a reasonable amount discussion and future work. c of time for large spectral data sets. Thus, the algorithm is able [ todecreasethecomputationaltimebycompressingthedatasets while increasing the throughput of the data by interpreting low II. PROBLEM STATEMENTAND BACKGROUND 1 S/N spectra. INFORMATION v IndexTerms—Clustering;Massspectrometry; Graph Theory; 4 Efficient Algorithms; Mass spectrometry data is complex and requires sophisti- 3 cated algorithms to do the data processing once the raw data 8 I. INTRODUCTION from the mass spectrometer is obtained. The raw data from 0 Mass spectrometry based proteomics is an emerging area themassspectrometeristhenfedtovarioussearchalgorithms . 1 and has useful applications in biology such as studying the e.g. Sequest, Inspect. These search algorithms do a thorough 0 regulation of cellular processes [8], cancer molecular ther- job of searching the spectra against a known proteome data 3 apeutics [7] [11] and others [5]. Mass spectrometry often base. After the search is complete, each of the spectra is 1 generates thousand to millions of spectra that needs to be assigned a peptide (or a set of peptides with different sites v: analyzed. The usual computational procedure invoked, after of modifications) to which it corresponds. i the raw data is generated from the mass spectrometers is to There are a number of algorithms that have been intro- X search the spectra against a protein database. The algorithms duced for clustering mass spectrometry data. Tabb et. al [12], r used for searching e.g. Sequest, Inspect, Xtandem etc, are MS2Grouper algorithm [13], Beer et. al. developed the Pep- a essentially brute force methods that try to deduce the peptide Mineralgorithm[1],Ramakrishnanet.al. [9],Duttaet. al.[2] from a given spectra. Even algorithms that use advanced and Frank et. al. [4] are to name a few of these algorithms. techniques to reduce the computational time e.g. tag-based The objective of this work is to formulate an algorithm that for Inspect, two-pass database for X!Tandem etc. are still canaccuratelyandefficientlyclusterlargenumbersofspectra, not computationally efficient enough for analyzing millions suchthatthespectrainagivenclustermustbelongtothesame of spectra in a reasonable amount of time. peptide. More formally we define a cluster as follows: It is common for the same peptides to get selected for Definition 1: Let there be N number of spectra S = fragmentation multiple times in a given MS run, making {s ,s ,···,s } and the peptide corresponding to a spectra 1 2 N fraction of MS/MS data sets redundant. Searching the same represented as P = {p ,p ,···,p }. Now let the peptide 1 2 N spectra repeatedly, even with computationally efficient tools, corresponding to a spectra s represented by p where q = q q wastesalotoftimeandcomputationalresources.Theproblem {1,···,N}. is even more pronounced when data from multiple runs are Definition 2: A distance function δ(p ,p ) where p ∈ r t r merged. The redundancy can reach up to 50% for large data P,p ∈P is defined as the levenstein distance of the peptides t sets [1], [3], [4]. correspondingtothespectras ands .Nowletthenumberof r t The main goal of the work presented in this paper, is to clusters be k and represented as K = {k ,k ,···,k } such 1 2 k formulatean efficientand accurate algorithmfor clusteringof thatsetS isdividedintoksubsets.Then,thespectras ands r t large-scalemassspectrometrydata.Inordertoaccomplishthe where s ,s ∈S should belong to the same cluster k where r t i abovetask,weintroduceanovelmetric(calledF-set)thatcan k ∈K , if and only if, δ(p ,p ) =0 where p ,p ∈P. i r t r t Note, that during clustering of the spectra, the peptides are y not known; since the clustering of the spectra is performed nsit e before the searching. nt I a b c d e f g h i j k l m n m/z III. PROPOSED GRAPH THEORETICFRAMEWORK AND y ALGORITHM nsit e nt In this section we propose the similarity criteria that we I a b c d e f g h i j k l m n m/z useforouralgorithmandtherationalebehindit.We willthen y introduce graph theoretic framework that allows us to use the nsit e similarity metric in an efficient way. This is followed by the nt proposed clustering algorithm. AI a b c d e f g h i j k l m n m/z {(b,d,f)(d,f,h)(f,h,l)(h,l,n)} y A. F-set metric nsit e Although there has been considerable effort in developing Int algorithms for spectral data, all of the approaches have been a b c d e f g h i j k l m n m/z {(b,d,f)(d,f,h)(f,h,j)(h,j,n)} gearedtowardscountingthenumberofspectralpeaksthatare nsity common between two given spectra. This information is then nte usedtocreateasimilarityindexusedbythealgorithms[1],[4]. I a b c d e f g h i j k l m n m/z Itmakessense tocountthenumberofpeaksthatarecommon {(b,d,g)(d,g,l)(g,l,m)(l,m,n)} y between two spectra and use that for similarity indexing. nsit e However,noiseandotherfactorssuchascompoundedspectra BInt a b c d e f g h i j k l m n m/z cancreatefalsepositivesforsimilarity.Asimilarityindexthat can mitigate these false positives is necessary for an efficient Fig. 1. Section A of this figure shows three spectra. The first two and accurate clustering algorithm. spectra map to the same peptides whereas the third spectra maps to We introduce F-set metric in this paper for similarity. The a different peptide. Although, the last spectra is not mapped to the basic idea of the metric is as follows:Itis possible for a peak same peptide as the first two spectra, weobserve significant overlap between the peaks. However, if we make F-sets (of size 3) of the to appear at a certain m/z by a randomchance. However,it is samepeaksandspectra,itisclearthatthesetsformulateddonothave far less likely for peaks to appear in consecutive succession much in common for the non-related spectra, and much in common just by chance. Thus, it makes sense to formulate a similarity for the spectra that are related, as shown in section B of the figure. metricthatcountsthesetsofsimilarpeaksbetweentwogiven spectra. We formally define the F-set metric below: Definition 3: As before let the spectral data set be rep- resented as S = {s ,s ,···,s }. Each spectra has two i.e. there is a high probability that a peak would appear at a 1 2 N attributesi.e. m/z and the intensity of the peak.Let there be a randomplaceinaspectraduetonoise(andhencewouldresult fragmentationspectrums =(m ,i ),(m ,i ),···,(m ,i ) in incorrect clusters if used as a similarity metric), but for j 1 1 2 2 Q Q that is extracted from the mass spectrometry data where m peakstoappearinsuccessiveorder(assets)fortwoun-related d represents the m/z ratio and i represents the intensity of the spectraislessplausible.Figure1showsthreespectra,ofwhich d peptide at position d and 1≤j ≤N. onlytwo arerelated.Itcanbeseen fromthe figurethattheF- Now making sets of peak’s at m/z posi- set metricnotonly allowsdistinctionbetweenthe spectrathat tions of size f. Then creating sets out of the are not similar but also allows us to identify spectra that are spectra can be presented as a vector F(s ) = related i.e. map to the same peptide. Now we formulate the i {(m m ···m ),(m m ···m ),···,(m ,···, graph theoretic framework to take advantage of F-set metric 1 2 f 2 3 f+1 Q−f+1 m ,m )}. Then the F-set metric calculated for spectra s just defined. Q−1 Q x and s can be formulated as y B. Graph Theoretic Framework In this section we present the graph-theoretic framework |F(sx)||F(sy)| thatwouldallowustouseF-setmetricin anefficientmanner. W(s ,s )= φ(F(s )[i],F(s )[j]) (1) x y x y Definition 4: AweightedundirectedgraphG=(V,E)isa Xi=1 Xj=1 graph where V is a set of vertices and E ∈V ×V is a set of edges. Now let a weight w ≥0 associated with edge φ(a,b)= 1 if a[i]=b[j] (2) e=(v ,v ) where e∈E aen=d(vvi,,vjv) ∈V. (cid:26) 0 o.w. i j i j A weighted undirected graph is created with vertices that TheF-set,denotedbyW(s ,s ),canbeusedasasimilarity correspond to each of the spectra. The vertices are connected x y metric for spectra. The F-set makes set of m/z from the by weighted edges and each vertex corresponds to a single spectra of size f and then compares it with the F-set of spectra. The weight on each edge between two given spectra the other spectra. If there is a match of a F-set in the other is assignedusing the weightcalculatedusing the F-seti.e. the spectraascoreof1isadded.Otherwiseazeroscoreisadded. weight assigned to the edge is equal to the F-set calculated Therefore, the final score W represents the number of F-sets between two given spectra. More formally: thatarecommonbetweenthetwogivenspectra.Therationale Definition 5: Given a graph G=(V,E) such that the num- for comparing sets of m/z between two given spectra has to ber of vertices in the graph are equal to the number of do with the probability of peaks appearing at random places spectra being considered i.e. |V| = |S| = N and an edge connecting each vertex. Now vertices can be represented by A V =v1,v2,···,vN. Then, the nodescan be labeled using the 10 22 following mapping function ∀v → s where v ∈ V,s ∈ i i i i B 9 C S,1 ≤ i ≤ N. The weight on each edge is the F-set metric 36 22 9 that is calculated for the spectra i.e. w = W(s ,s ) where 40 e=(v ,v );s ,s ∈S,e∈E,v ,v ∈Ve. i j 23 14 89 10 12 i j i j i j After the above procedure a graph is created that is weighted, and the weight corresponds to the F-set metric D 7 E calculated for a given spectra. The next step is to extract the 21 F 72 clusters using the graph that has been created. In order to extractclusterstwomethodswereinvestigated;oneistrivialin A which a threshold is chosen by the user; the second threshold is chosen using SVM which our experiments suggested was B 36 C more effective in chosing the right threshold. After threshold 40 is chosen, the edges that have weight less than threshold are 89 eliminatedand the connectedcomponentsare reported,which canbecalculatedinO(V +E)time.Thealgorithmisstatedin D E Algorithm1 and graphical representation of clusters is shown F 72 in Fig. 2 (b). Fig. 2. The graph from with weighted edges calculated using F- set metric is shown. The value of ζ is determined using the SVM. Require: MS2 spectra data set: Thereafter, the edges having weight less than ζ are labeled with red boxes(figa).Theseedgesaretheneliminatedandtheverticesthatare Ensure: Clusters of spectra such that the cluster has still connected are determined using DFS. These connected vertices spectra that can be mapped to the same peptide: are reported as potential clusters (fig b). 1) Read the Sequest search results (.dta) files 2) Enumerate the F-set of a given size for each of the spectra independently 3) For each of the pair determine the F-sets that are the cluster be defined as ni where 1 ≤ i ≤ k. Now assume common between them thatthenumberofspectraina clusterthatbelongtothesame 4) Generate the graph using the definition 5 in the peptidebedenotedbyxi.Then,theaccuracyofasinglecluster paper can be defined as : 5) Run SVM on the F-set metrics that gives a ζ x i threshold ai = (3) n 6) Eliminate the edges that are below the ζ threshold i 7) Determine the vertices that are still connected in and the average weighted accuracy (AWA) of the whole the graph after elimination dataset under consideration is defined as: 8) Output the vertices that are still connected after k elimination as clusters a n AWA= i=1 i i (4) Algorithm 1: CAMS P k n i=1 i P AWA takes into account the accuracy of each cluster and IV. PERFORMANCE EVALUATION gives a global view of the accuracy for a given dataset. The performance evaluation can be divided into two parts. A. Quality assessment ThefirstpartdealswithassessinghowgoodtheF-setmetricis at distinguishing between related and unrelated spectra. The 1) Quality with increasing F-set size: The objective of second part of the evaluation relates to the accuracy of the the first part of quality assessment, is to see how does the clusters using the algorithm with different mass spectrometry quality of the clustering behaves using increasing F-set size. data sets. Consideringtheframeworkthatweintroducedinthepaper,the Before we go any further, let us define the quality metric increasing size of F-set must correspond to higher accuracy. that we use in this paper. The quality of the clustering can In orderto confirm this, we choose a CID and HCD data sets be divided in two parts. The first part is the quality of the used in our other studies [10]. individual cluster and the second is the quality of clustering Fig.3showstheaverageweightedaccuracywithincreasing overall. If we just take an average of the individualquality of size of the F-set. In general, the average weighted accuracy theclusteritmaybemisleading,sincethenumberofelements increases with increasing F-set size for both CID as well as in each cluster may be different. Therefore, we defined the HCD data sets. Theaccuracyseemsto be levelingoff atF-set accuracy as a weighted accuracy that allows us to determine size of 7 or more. The increase in accuracycan be seen more the quality of the clustering for each cluster as well as the pronouncedin CID data sets as comparedto HCD. The HCD overallqualityofallclusters.Theweightedaccuracyisdefined data sets have better accuracy with lower F-set size due to as follows: better Signal-to-noiseratio as comparedto CID. The fact that Assumetherearekclusters.Nowlettheaccuracyofasingle accuracyincreasessignificantlywithincreasingF-setsizeeven cluster i be denoted by a and the total number of spectra in for CID data sets shows the effectiveness of F-set metric. We i V. CONCLUSIONS AND FUTUREWORK 100 In this paper, we have presented an efficient clustering algorithm suitable for large scale mass spectrometry data. A %) A(CCURACY 80 sthimeilaalrgiotyrithmmet,ribcas(ecdalloedn tFh-esets)paitsialfolromcuatliaotends,aannddinutseendsiitny WEIGHTED 60 oinftrtohdeucpeedakthsatinalalowspsetchtrea.usAe ogfrathpeh-itnhteroordeuticcedfrFam-seewt morektriics AVERAGE foonrncoluvsetlersiinmgilsaprietcytrma.eAtricde(tFa-isleedt)awlgaosridthesmcricibteedchannidqureigboaroseuds 40 HCD time complexity and quality assessment were presented. The CID graphtheoreticframeworkallowsclusteringofverylargemass spectrometrydatasetsin areasonabletime.We usedCID and 1 2 3 4 5 6 7 8 9 10 11 12 HCD data sets with different conditions to assess the quality SIZEOFF-SET of the produced clusters. Our experiments suggest that the Fig.3. TheaverageweightedaccuracyisshownwithincreasingF-set proposedalgorithmallowsnear-perfectclustersforlarge-scale size for CID as well as HCD data sets mass spectrometry data. The execution time of the algorithm is upper-bounded by O(N2), but observed execution time is close to linear with increasing number of spectra. Thepaperpresentedis partofthe ongoingworkoncluster- |F−set|=7 |F−set|=8 ingofmassspectrometrydataandweplantoexpandthework 400 |F−set|=9 in the future. We would like to investigate both theoretical O(N2) andapplication-orientedaspectsofclusteringlarge-scalemass 300 spectrometry data. MEINMINUTES 200 REFERENCES TI [1] I. Beer, E. Barnea, T. Ziv, and A. Admon. Improving large-scale proteomics by clustering of mass spectrometry data. PROTEOMICS, 100 4(4):950–960, 2004. [2] D.DuttaandT.Chen. Speedinguptandemmassspectrometrydatabase search:metricembeddingsandfastnearneighborsearch. Bioinformat- 0 ics,23(5):612–618, 2007. 0.4 1 2 3 6 [3] A.M.Frank,N.Bandeira,Z.Shen,S.Tanner,S.P.Briggs,R.D.Smith, andP.A.Pevzner.ClusteringMillionsofTandemMassSpectra.Journal NUMBEROFSPECTRA ·104 ofProteomeResearch,7(1):113–122, January2008. [4] A.M.Frank,N.Bandeira,Z.Shen,S.Tanner,S.P.Briggs,R.D.Smith, Fig. 4. The execution time with increasing number of spectra andP.A.Pevzner. Clusteringmillionsoftandemmassspectra. Journal and increasing F-set size are shown. Note that although the CAMS ofProteomeResearch,7(1):113–122, 2008. algorithm has acomplexity of O(N2),practically the running times [5] M. H. Global proteomic profiling of phosphopeptides using electron withincreasing number of spectraaremuchlessthanthetheoretical transfer dissociation tandem massspectrometry. Proc. Natl. Acad. Sci. asymptotic times. U.S.A.,104:2199,2007. [6] J. Hoffert, T. Pisitkun, F. Saeed, J. Song, C. Chou, and M. Knepper. Dynamics of the g protein-coupled vasopressin v2 receptor signaling network revealed by quantitative phosphoproteomics. Mol Cell Pro- teomics, 2011. [7] M.F.Moran,J.Tong,P.Taylor,andR.M.Ewing.Emergingapplications seea similartrendwithCIDandHCD datasetswithdifferent forphospho-proteomicsincancermoleculartherapeutics. Biochimicaet BiophysicaActa(BBA)-ReviewsonCancer,1766(2):230–241,2006. conditions as shown in the section below. Genomics andPredicting DrugSensitivity. 2) Quality with HCD and CID data sets and complexity [8] N. Musbacher, T. B. Schreiber, and H. Daub. Glycoprotein Capture and Quantitative Phosphoproteomics Indicate Coordinated Regulation analysis: The data sets that we chose to test the spectral ofCellMigration uponLysophosphatidic AcidStimulation. Molecular clustering algorithm has been used in other studies [6], [10]. &Cellular Proteomics,9(11):2337–2353, 2010. The data sets consists of CID as well as HCD spectra. The [9] S. R. Ramakrishnan, R. Mao, A. A. Nakorchevskiy, J. T. Prince, W. S. Willard, W. Xu, E. M. Marcotte, and D. P. Miranker. A fast datasetshavebeenproducedwithvaryingamountofsynthetic coarsefilteringmethodforpeptideidentification bymassspectrometry. AQP-2peptides.WealsouseiTRAQlabeleddatasetfromour Bioinformatics, 22(12):1524–1531, 2006. recent paper [6]. The experiments were conducted with size [10] F. Saeed, T. Pisitkun, J. Hoffert, and M. Knepper. Phossa: Phospho- rylation site assignment algorithm for mass spectrometry data. under of F-set equalto 7. The evaluationof the clusteringalgorithm submission,2012. with different data sets with varying conditions allows us to [11] D. B. Solit and I. K. Mellinghoff. Tracing cancer networks with assess the performance of the algorithm with ”real world” phosphoproteomics. NatBiotech,28(10):1028–1029, 102010. [12] D.L.Tabb,M.J.MacCoss,C.C.Wu,S.D.Anderson,andJ.R.Yates. mass spectrometry data sets. Our experiments suggested that Similarity among tandem mass spectra from proteomic experiments: the AWA of the clusters obtained were near 100% accuracy detection, significance, andutility. Analytical Chemistry, 75(10):2470– with the minimum accuracy reported as 97.3% (not shown). 2477,2003. PMID:12918992. [13] D.L.Tabb,M.R.Thompson,G.Khalsa-Moyers, N.C.VerBerkmoes, The time complexity of the algorithm can be shown to be and W. H. McDonald. Ms2grouper: Group assessment and synthetic O(NL2)+O(c)+O(N)+O(V +E) ≈O(N2). As shown replacement of duplicate proteomic tandem mass spectra. Journal of in figure 4, the execution time with increasing number of theAmericanSocietyforMassSpectrometry,16(8):1250–1261,2005. spectra is far less than the theoretical O(N2) execution time and should be expected in practice.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.