ebook img

Proteogenomic approach to discover cancer aberrant peptides and antibody peptides using large ... PDF

184 Pages·2017·6.06 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Proteogenomic approach to discover cancer aberrant peptides and antibody peptides using large ...

UC San Diego UC San Diego Electronic Theses and Dissertations Title Proteogenomic approach to discover cancer aberrant peptides and antibody peptides using large-scale next-generation sequencing data Permalink https://escholarship.org/uc/item/5kz1z434 Author Cha, Seong Won Publication Date 2017 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITYOFCALIFORNIA,SANDIEGO Proteogenomicapproachtodiscovercanceraberrantpeptidesandantibodypeptidesusing large-scalenext-generationsequencingdata Adissertationsubmittedinpartialsatisfactionofthe requirementsforthedegree DoctorofPhilosophy in ElectricalEngineering(CommunicationTheoryandSystems) by SeongWonCha Committeeincharge: ProfessorVineetBafna,Chair ProfessorDrewHall,Co-Chair ProfessorCKCheng ProfessorWilliamHodgkiss ProfessorPavelAPevzner 2017 Copyright SeongWonCha,2017 Allrightsreserved. The dissertation of Seong Won Cha is approved, and it is acceptable in qualityand form for publication onmicrofilm andelectronically: Co-Chair Chair UniversityofCalifornia,SanDiego 2017 iii DEDICATION I dedicate this dissertation to my beloved parents, Myungsik Cha and Jinjune Kim, for all their love, patience, kindness, and support. iv TABLEOFCONTENTS SignaturePage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv TableofContents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v ListofFigures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii ListofTables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii AbstractoftheDissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix Chapter1 ProteogenomicdatabaseconstructiondrivenfromlargescaleRNA-seqdata 1 1.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 METHOD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 ConvertingsplicegraphstructuretoaFASTAfileformat . . 6 1.2.2 Datasetsandexperimentalprocedure . . . . . . . . . . . . 10 1.3 RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Chapter2 Proteogenomicstrategiesforidentificationofaberrantcancerpeptidesusing large-scaleNextGenerationSequencingdata . . . . . . . . . . . . . . . 21 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.1 DatabasecreationfromRNA-seqdata . . . . . . . . . . . . 25 2.2.2 DatabaseSearchDetails . . . . . . . . . . . . . . . . . . . 27 2.2.3 FDRbasederrorcontrolstrategies . . . . . . . . . . . . . . 28 2.2.4 SamplepreparationandLC-MS/MSanalysis . . . . . . . . 30 2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Chapter3 Integrativeproteogenomicpipelineforidentificationofmutatedpeptidesand immunoglobulingenerearrangements,anditsapplicationtocoloncancer 45 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 v Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Chapter4 Theantibodyrepertoireofcolorectalcancer . . . . . . . . . . . . . . . . 62 4.1 Abbreviationspage . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3 ExperimentalProcedures . . . . . . . . . . . . . . . . . . . . . . . 65 4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.5 Discussionandfuturestudy . . . . . . . . . . . . . . . . . . . . . . 81 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 AppendixA Appendix: Proteogenomic strategies for identification of aberrant cancer peptidesusinglarge-scaleNextGenerationSequencingdata . . . . . . . 89 A.1 Comparisonwithothergenepredictionmethods . . . . . . . . . . . 90 A.2 CalculationofsplitmappedcoordinatesfromCIGAR stringinSAMfileformat . . . . . . . . . . . . . . . . . . . . . . . 91 A.3 DetailedRNA-seqmethods . . . . . . . . . . . . . . . . . . . . . . 91 A.4 Comparison of spectra dataset used in this study with Merrihew et al. (2008)[73] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 A.5 Proof of correctness and completeness in applying Rule1, Rule2, and Rule3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 A.6 ProofofcorrectnessandcompletenessinDFSalgorithmimplementation ofRule1,Rule2,andRule3 . . . . . . . . . . . . . . . . . . . . . . 101 A.6.1 Rule1: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 A.6.2 Rule2: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 A.6.3 Rule3: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 AppendixB Appendix: Proteogenomic strategies for identification of aberrant cancer peptidesusinglarge-scaleNextGenerationSequencingdata . . . . . . . 104 AppendixC Appendix: Integrativeproteogenomicpipelineforidentificationofmutated peptides and immunoglobulin gene rearrangements, and its application to coloncancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 AppendixD Appendix: Theantibodyrepertoireofcolorectalcancer . . . . . . . . . . 126 D.1 Supplementalmethod . . . . . . . . . . . . . . . . . . . . . . . . . 126 D.1.1 AntibodystructureandIMGTreference . . . . . . . . . . . 126 D.1.2 NodegroupingmethodforSdBgraph . . . . . . . . . . . . 127 D.1.3 MathematicalcomparisonbetweentheSdBanddBgraph . 128 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 vi LISTOFFIGURES Figure1.1: (a)GivenRNA-seqread,findoverlappingregionswiththeexistingsplicegraph. (b) Split and add nodes. (r , node s is split into nodes u and u , and node 1 1 1 2 u is added.) (c) Assign edges for each spliced-read. (d) Revisit each pair of 3 contiguousnodes. The nodesare mergedif thereis noedge attheboundaries. (Nodesu andu aremerged,whilee isaddedbetweenu andu .) . . . . 7 1 2 5 2 3 Figure1.2: (a) By traversing the graph using a depth first search(DFS), we generate a sequence from the first visited start to end node path. (b) While traversing in DFS, when we encounter an outgoing edge that is already visited, only maintainalengthL−1suffix. (c)WhiletraversinginDFS,whenweencounter an incoming edge that is already visited, only maintain a length L−1 prefix. (d) For a pair of sequences(paths) with a prefix-suffix match, combine two sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Figure1.3: (a)Growthof thedatabase file size(Bytes)whileincorporating moreRNA-seq data. (b) Increase in the percentage of covered splice junctions compared to RefSeq. (c)Increaseinthenumberofsplicejunctionsexpressedinsplicegraph databasewhichdoesnotexistinRefSeq. . . . . . . . . . . . . . . . . . . 14 Figure1.4: (a) Shows a novel gene area where two peptides are identified in a non- genomic region. (b) Two peptides with alternative splice junctions. Peptide T.LNVNGQE:IVYSMENEK.Lissupportedby13splitmappedRNA-seqreads, and R.EIKK:QHTSFQVSGPKEEIVYSMENEK.L is supported by 40 reads. (c) Peptide ‘TIVFTVPLSQCQMVSPMISK.E’ matches in a different frame compared to the gene eef-2. Two neighboring peptides, ‘R.FIEPIEDIPSG NIAGLVGVDQYL:S.R’,and‘G.HVFEESQVTGTPMFVV:R.L’areidentified with1bpdeletion,thatallowfortheframe-shifttooccur. . . . . . . . . . . 17 Figure2.1: Number of peptide identifications in 439,858 spectra collected from a single sample(sampleid: TCGA-24-1467)usingdifferentFDRbasederrorcontrol strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Figure2.2: Overlapbetween novel identifications fromunified andsingle sample database. 41 Figure2.3: Alignmentofidentifiedspectraofmutatedpeptides. . . . . . . . . . . . . 42 Figure2.4: Alignmentofidentifiedspectraofnoveljunctionpeptides. . . . . . . . . . 43 Figure2.5: Alignmentofidentifiedspectraofmutatedpeptides. . . . . . . . . . . . . 44 Figure2.6: Insertionsandsubstitutionsarerepresentedasadditionalnodeandedgeshaving negativecoordinatevalues. Deletionsarerepresentedsameassplicejunctions withactualDNAcoordinates. . . . . . . . . . . . . . . . . . . . . . . . . 44 Figure3.1: Illustrationofproteogenomicdatabaseconstructionforimmunoglobulinpeptide identifications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 vii Figure3.2: (a) Comparison of aberrant peptide identifications against previous findings using multi-stage FDR (b)Comparison of overlapping aberrant peptide iden- tifications using combined FDR. Our proteogenomic database was created from raw RNA-seq alignments from TCGA repository and database used in Zhangetal.[138]iscreatedfromSNVinformationsreportedbydbSNP[105], COSMIC[35],andTCGAsomaticmutationcalls[75]. . . . . . . . . . . . 59 Figure3.3: (a)GenescontainingmostfrequentsomaticmutationsreportedbytheTCGA study. (b) RefSeqidentified spectraper genein 10based logscale. Mostfre- quently mutated genes in DNA level are under expressed in protein level. COL6A3 had 35463 spectra counts, TTN (188), KRAS (71), DMD (76), SYNE1 (43), LRP1B (37), ANK2 (59), and rest of the DNA level highly mutatedgenes hadless than25 spectracounts. (c)Percentageof samplescon- taining identified protein mutations in TCGA reported most frequently genes. While most of the DNA level top frequently mutated genes were under ex- pressed in protein level, we observed that some genes showed even higher mutationfrequenciesacrosssamplesinproteinlevel. . . . . . . . . . . . . 60 Figure3.4: Percentage of IG gene peptide identifications in each sample normalized by the number of known peptide identifications across sample subtypes. This percentileratioiscalculatedbydividingthenumberofknownpeptideidentifi- cationsfromthetotalnumberofIGpeptideidentificationswithineachsample. (ratio= (#ofIG peptides)/(# ofknownpeptides)* 100)Differentkinds ofIG gene segmentsare colored. Subtype C(sample groups showingboth hypermu- tationandCIMPcharacteristics)showedcomparablyhighnumberofIGgene peptideidentificationcomparedtoothersamplesubtypes. Chi-squaredtestof thisplotshowed p-value<0.0001,χ2 =2927.71. . . . . . . . . . . . . . 61 Figure4.1: Relativelocationsofidentifiedantibodypeptides. Eachhorizontalblackline representsadistinctpeptidesequence. Trypsinwasappliedforthecolorectal tumor MS/MS spectra assessment, and four different enzymes were applied for polyclonal antibody MS/MS spectra assessment. Both spectra sets were searchedagainstthesameantibodydatabaseconstructedusingtumorRNA-seq readsdrivenbyTCGA.(a)AntibodyPSMsfromcolorectaltumorMS/MSdata. (b)AntibodyPSMsfrompolyclonalantibodyMS/MSdata. . . . . . . . . 85 viii Figure4.2: Comparison of identified antibody PSMs per experiment and sample (a) The source of antibody peptides in different samples. PSMs that match non- referencepeptidesareeithermutationsorantibodypeptides. Antibodypeptides should not be observed in cell-lines. However, floating antibodies could be observed in normal colorectal samples. Antibodies from Tumor infiltrating lymphocytes should only be observed in tumor samples. (b) Occurrence of antibody peptides in tumor, normal, and tumor derived cell-lines are signifi- cantlydifferentforMS/MSspectraoftumor,normal,andcell-linecolorectal samples. EachspectrasetweresearchedagainsttheEnsemblGRCh38protein database[20]and acustomantibody database. The numberofPSMs identified as antibody peptides were 54K (colorectal tumor), 711 (colorectal normal), and 0 (Cell-lines). The PSM counts were normalized against the number of PSMstoknownpeptides. 5.5M incolorectaltumor,1.7M incolorectalnormal, and0.1M inCell-lines. Thenormalizedratiossuggestthatasignificantlylarger fractionof thecolorectal tumorPSMsare antibodypeptides, comparedto the other two data-sets (Pearson’s χ2 p-val <10−4). (c) The distribution of the numberofsamplescarryinganormalizedfractionofantibodypeptides. COAD samplescarryahigherfractionofantibodypeptides. . . . . . . . . . . . . 86 Figure4.3: Peptide correlation test. We tested the correlation between the antibody peptides and mutated peptides. For every pair of peptides, we counted the number of samples co-occurring with these peptides and then we applied Fisher exact test to calculate the p-value. For example, the peptide pairs of NTLYLQMDSLR(antibody)andAAQAQGQSCEYSLMVGYQCGQVF(Q→ R)(SAAVpeptide) co-occurredin 26samples, andtherewasaco-absence in 42samples. Itwasrevealedthat68ofthe90samplessharedtheco-occurrence ofthispairwithap-valueof2.60×10−6. Wedrewthehistogramofp-valuesof allpairsinSupplementalTable4. Wealsodrewthehistogramofthep-values from the decoy table generated by the random permutation of values. A 5% FDRthresholdwasappliedtocollectthehighcorrelatedpairs. . . . . . . . 87 Figure4.4: Kaplan-Meiersurvivalestimator. Foranysubsetofpeptides,webi-partioned peptides based on co-expression in samples. Next, we scored each sample based on the homogeneity of peptides from a single partition in that sample (Methods). The highest and lowest scoring samples (one-third each) were grouped,andweretestedtodeterminetheclinicaloutcome. TheKaplan-Meier survival estimator and log-rank test were applied to test the difference of the clinical outcome of two groups. When testing with co-occurring mutated peptide/antibody peptide pairs, we observed a significant correlation with survival (Plot (a): p-value =0.032). In contrast, the correlation was reduced when testing with only antibody peptides (Plot (b): p-value = 0.040), and therewas no-correlationwhen testingwith mutated peptides. (Plot(c): p-value =0.522). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 FigureA.1: Infilteringstage,RNA-seqreadsthathaveidenticalsplicejunctionsaremerged, andextendedinbothends . . . . . . . . . . . . . . . . . . . . . . . . . . 90 ix

Description:
N. Rajewsky, G. Ratsch, H. Rosenbaum, J. Rozowsky, K. Rutherford, P. Ruzanov, M. Sarov, [44] A. N. Houghton and J. A. Guevara-Patino. Immune
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.