ebook img

Automatic Discovery of Hidden Associations Using Vector Similarity PDF

213 Pages·2017·17.49 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Automatic Discovery of Hidden Associations Using Vector Similarity

Automatic Discovery of Hidden Associations Using Vector Similarity: Application to Biological Annotation Prediction Seyed Ziaeddin Alborzi To cite this version: Seyed Ziaeddin Alborzi. Automatic Discovery of Hidden Associations Using Vector Similarity: Appli- cation to Biological Annotation Prediction. Bioinformatics [q-bio.QM]. Université de Lorraine, 2018. English. ￿NNT: 2018LORR0035￿. ￿tel-01792299￿ HAL Id: tel-01792299 https://theses.hal.science/tel-01792299 Submitted on 15 May 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. AVERTISSEMENT Ce document est le fruit d'un long travail approuvé par le jury de soutenance et mis à disposition de l'ensemble de la communauté universitaire élargie. Il est soumis à la propriété intellectuelle de l'auteur. Ceci implique une obligation de citation et de référencement lors de l’utilisation de ce document. D'autre part, toute contrefaçon, plagiat, reproduction illicite encourt une poursuite pénale. Contact : [email protected] LIENS Code de la Propriété Intellectuelle. articles L 122. 4 Code de la Propriété Intellectuelle. articles L 335.2- L 335.10 http://www.cfcopies.com/V2/leg/leg_droi.php http://www.culture.gouv.fr/culture/infos-pratiques/droits/protection.htm E´cole doctorale IAEM Lorraine Automatic Discovery of Hidden Associations Using Vector Similarity: Application to Biological Annotation Prediction ` THESE pr´esent´ee et soutenue publiquement le 23 F´evrier 2018 pour l’obtention du Doctorat de l’Universit´e de Lorraine (mention informatique) par Seyed Ziaeddin ALBORZI Composition du jury Rapporteurs : Wim Vranken PR Vrije Universiteit Brussel, Brussels Graham Kemp PR Chalmers University of Technology, Gothenburg Examinateurs : Olivier Poch DR CNRS, Strasbourg Alessandra Carbone PR Sorbonne Universit´e, Paris Anne Boyer PR Universit´e de Lorraine, Nancy Malika Sma¨ıl-Tabbone MC Universit´e de Lorraine, Nancy Encadrants : Marie-Dominique Devignes CR CNRS, Nancy David W. Ritchie DR INRIA, Nancy E´quipe CAPSID – INRIA Nancy Grand Est Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA) — UMR 7503 Mis en page avec la classe thesul. Remerciements I would like to thank all the people who contributed in some way to the work described in this thesis. Firstandforemost,Ithankmyacademicadvisors,Dr. DaveRitchieandDr. Marie-DominiqueDevignes, for accepting me into CAPSID team at Institut national de recherche en informatique et en automatique (INRIA). During my tenure, they contributed to a rewarding doctoral school experience by giving me intellectual freedom in my work, supporting my attendance at various conferences, engaging me in new ideas, and demanding a high quality of work in all my endeavors. Additionally, I would like to thank my committee members Dr. Carbone, Dr. Poch, Dr. Vranken, Dr. Kemp, Dr. Boyer, and Dr. Sma(cid:239)l- Tabbone for their interest in my work. I was fortunate to have the chance to work in UniProt team at European Bioinformatics Institute (EBI). I worked (nonstop) as a predoctoral visitor in the team of Dr. Maria Martin of EBI for three months starting from October 2016. I am very grateful to Dr. Rabie Saidi and Mr. Alex Reneaux who helped me in creating rules for protein functional annotation I present in this thesis. I am grateful for the funding sources that allowed me to pursue my doctoral studies: ANRFellowship,RegionofLorraineandINRIA.IwouldliketoacknowledgetheDoctoralSchool at University of Lorraine. My graduate experience bene(cid:28)ted greatly from the courses I took. I would like to acknowledge my friends, Farhad, Saeid, Soheib, Mojdeh, Valia, Daishi, Iordan, Ehsan, Mohanna, Meysam,Younesandfamilymembers,Baba,Maman,Emad,Hesam,Carol,BabaMapar,MamanMapar, Mahsa, Mahna, Payam, Samira, Parinaz who supported me during my time here. Finally, I would like to thank my lovely wife, Mahta for her constant love and support. i ii This thesis is dedicated to My lovely wife who always believed in me, My wonderful parents who have raised me to be the person I am today. iii iv Contents List of Tables xi Introduction en fran(cid:231)ais 1 Introduction Chapter 1 Background 1.1 Data Science Context - Data Preparation, Mining, and Interpretation . . . . . . . . . . . 13 1.1.1 Knowledge Discovery from Data and Data Mining . . . . . . . . . . . . . . . . . . 13 1.1.2 Machine Learning and Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.1.3 Information Filtering and Recommendation Systems . . . . . . . . . . . . . . . . . 18 1.1.4 Data Structure and Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.1.5 Statistical Validation of Extracted Pattern . . . . . . . . . . . . . . . . . . . . . . 21 1.2 Biological Context - Protein Function, Domain, and Interaction . . . . . . . . . . . . . . . 22 1.2.1 Protein Sequence and Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.2.2 Protein Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 1.2.3 Protein Domains and Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 1.2.4 Protein Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Chapter 2 Discovering Hidden Associations between Enzyme Commission Numbers and Pfam Domains 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.2 Methods and Materials. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.2.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.2.2 Inferring EC-Pfam Domain Associations . . . . . . . . . . . . . . . . . . . . . . . . 52 2.2.3 De(cid:28)ning a Con(cid:28)dence Score Threshold . . . . . . . . . . . . . . . . . . . . . . . . . 52 2.2.4 Exploiting the EC Number Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . 53 2.2.5 Hypergeometric Distribution p-Value Analysis . . . . . . . . . . . . . . . . . . . . 53 2.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.3.1 Data Source Weights and Score Threshold . . . . . . . . . . . . . . . . . . . . . . . 54 v Contents 2.3.2 Global Analysis of Inferred EC-Pfam Associations . . . . . . . . . . . . . . . . . . 54 2.3.3 Comparison with dcGO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.3.4 Selecting plausible associations in multi-domain proteins . . . . . . . . . . . . . . . 57 2.3.5 Single and Multiple EC-Pfam Associations . . . . . . . . . . . . . . . . . . . . . . 57 2.3.6 Annotating PDB Chains with EC Numbers . . . . . . . . . . . . . . . . . . . . . . 60 2.3.7 The ECDomainMiner web server . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Chapter 3 Computational Discovery of Direct Associations between Annotations using Common Content - CODAC 3.1 CODAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.1.1 Tripartite Graph Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.1.2 Biadjacency Representation of bigraphs . . . . . . . . . . . . . . . . . . . . . . . . 65 3.1.3 Gold Standard of Positive and Negative Examples . . . . . . . . . . . . . . . . . . 65 3.1.4 Determining the Score Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.1.5 Combining Multiple Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.1.6 Bipartite Graph Extension with Hierarchy of Classes . . . . . . . . . . . . . . . . . 68 3.1.7 Clustering Graph Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.1.8 Calculating Statistically Signi(cid:28)cant Edges in E . . . . . . . . . . . . . . . . . . . 70 3∗ 3.1.9 Classi(cid:28)cation into Gold, Silver, and Bronze Associations . . . . . . . . . . . . . . . 71 3.2 GODomainMiner: ComputationalDiscoveryofDirectAssociationsbetweenGOtermsand Protein Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.2.1 GODomainMiner Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.2.2 Dataset Weights and Threshold Scores . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.2.3 Analysis of Calculated GO-Pfam Associations . . . . . . . . . . . . . . . . . . . . . 74 3.2.4 Distribution of GO-Domain Associations per GO term or per domain . . . . . . . 75 3.2.5 Comparison with GO-Domain Associations from dcGO . . . . . . . . . . . . . . . 80 3.2.6 Biological Assessment of New Discovered GO-Pfam Associations . . . . . . . . . . 81 3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Chapter 4 Functional Annotation of Protein Sequences and Structures 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.2.1 Method Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.2.2 Using CODAC to Infer Function-Domain Associations . . . . . . . . . . . . . . . . 91 4.2.3 Combinatorial Generation of Association Rules . . . . . . . . . . . . . . . . . . . . 92 4.2.4 Knowledge-based Filtering of Association Rules . . . . . . . . . . . . . . . . . . . . 92 4.2.5 Aggregating and Applying Function Annotation Models . . . . . . . . . . . . . . . 95 4.2.6 Extension to Other Protein Annotations . . . . . . . . . . . . . . . . . . . . . . . . 96 vi

Description:
Ce document est le fruit d'un long travail approuvé par le jury de TH`ESE présentée et soutenue publiquement le 23 Février 2018 domain interactions is far fewer than the actual number of protein interactions. Domain the presence of an association and a 0 represents no association.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.