in silico Integrating protein annotations for the prioritization of putative drug target proteins in malaria by Phelelani Mpangase Submitted in partial ful(cid:28)llment of the degree Magister Scientiae Bioinformatics In the Faculty of Natural and Agricultural Science Bioinformatics and Computational Biology Unit Department of Biochemistry University of Pretoria Pretoria November 2012 ©© UUnniivveerrssiittyy ooff PPrreettoorriiaa Declaration I, Phelelani Thokozani Mpangase, declare that the thesis/dissertation, which I hereby submit for the degree Magister Scientiae at the University of Pretoria, is my own work and not previ- ously been submitted by me for a degree at this or any other tertiary institution. Signature: .................................... Date: .................................... i Aknowledgements I would like to thank the following people for their contribution towards the completion of this thesis: • Professor Fourie Joubert for his professional support and guidance with the project and writing of this thesis. • My parents and brother for their love and support in all my studies and the decisions I make. • JeanrØ Smith and Michal Szolkiewicz who I worked closely with in my project. • Oliver Bezuidt and my colleagues at the Bioinformatics and Computational Biology Unit of the University of Pretoria for all their help and advice. • John Overington and Louisa Bellis for making the internship at the European Bioinfor- matics Institute (EBI) possible. • Kazuyoshi Ikeda and the ChEMBL team the for their help with the druggability data. • The Department of Science & Technology (DST) of South Africa, National Research Foundation (NRF) and the University of Pretoria for the funding which made it possible to complete my studies. ii Summary Current anti-malarial methods have been e(cid:27)ective in reducing the number of malarial cases. However, these methods do not completely block the transmission of the parasite. Research has shown that repeated use of the current anti-malarial drugs, which include artemisinin-based drug combinations, might be toxic to humans. There have also been reports of an emergence of artemisinin-resistant parasites. Finding anti-malarial drugs through the drug discovery process takes a long time and failure results in a great (cid:28)nancial loss. The failure of drug discovery projects can be partly attributed to the improper selection of drug targets. There is thus a need for an e(cid:27)ective way of identifying and validating new potential malaria drug targets for entry into the drug discovery process. The availability of the genome sequences for the Plasmodium parasite, human host and the Anopheles mosquito vector has facilitated post-genomic studies on malaria. Proper utilization of this data, in combination with computational biology and bioinformatics techniques, could aidinthein silico prioritizationofdrugtargets. Thisstudywasaimedatextensivelyannotating the protein sequences from the Plasmodium parasites, H. sapiens and A. gambiae with data from di(cid:27)erent online databases in order to create a resource for the prioritization of drug targets in malaria. Essentiality, assay feasibility, resistance, toxicity, structural information and druggability were the main target selection criteria which were used to collect data for protein annotations. The data was used to populate the Discovery resource (http://malport. bi.up.ac.za/) for the in silico prioritization of potential drug targets. A new version of the Discovery system, Discovery 2.0 (http://discovery.bi.up.ac.za/), has been developed using Java. The system contains new and automatically updated data as well as improved functionalities. The new data in Discovery 2.0 includes UniProt acces- sions, gene ontology annotations from the UniProt-GOA project, pathways from Reactome and Malaria Parasite Metabolic Pathways databases, protein-protein interactions data from iii IntAct as well as druggability data from the DrugEBIlity resource hosted by ChEMBL. Users can access the data by searching with a protein identi(cid:28)er, UniProt accession, protein name or through the advanced search which lets users (cid:28)lter protein sequences based on di(cid:27)erent pro- tein properties. The results are organized in a tabbed environment, with each tab displaying di(cid:27)erent protein annotation data. A sample investigation using a previously proposed malarial target, S-adenosyl-L- homocysteinehydrolase,wascarriedouttodemonstratethedi(cid:27)erentcategoriesofdataavailable in Discovery 2.0 as well as to test if the available data is su(cid:30)cient for assessment and prior- itization of drug targets. The study showed that using the annotation data in Discovery 2.0, a protein can be assessed, in a species comparative manner, on the potential of being a drug target based on the selection criteria mentioned here. However, supporting data from literature is also needed to further validate the (cid:28)ndings. iv Contents Declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Aknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi List of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Chapter 1: Introduction 1 1.1 Target discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.1 System-based target discovery . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.2 Molecular-based targets discovery . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Genomic sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Target assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.1 Essentiality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.2 Assay feasibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3.3 Resistance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.4 Toxicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3.5 Structural information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.3.6 Druggability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.4 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.5 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Chapter 2: Methods 21 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 v 2.2 Protein sequences and function . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 PlasmoDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 VectorBase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Ensembl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 UniProt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 UniProt-GOA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 InterPro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2.1 Obtaining protein sequences . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.2 Functional annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3 Orthology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 OrthoMCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.1 Assignment of sequences to orthologous groups using OrthMCL . . . . . 29 2.3.2 Multiple sequence alignment using T-co(cid:27)ee . . . . . . . . . . . . . . . . . 29 2.4 Structural information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 PDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Modbase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4.1 BLAST search against PDB database . . . . . . . . . . . . . . . . . . . . 31 2.4.2 Predicted MODBASE structures . . . . . . . . . . . . . . . . . . . . . . 31 2.5 Metabolic pathways and enzyme information . . . . . . . . . . . . . . . . . . . . 32 KEGG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 MPMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Reactome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 ExPASy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 BRENDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.5.1 Metabolic pathway assignment . . . . . . . . . . . . . . . . . . . . . . . . 36 2.5.2 EC number assignment linking to databases . . . . . . . . . . . . . . . . 37 2.6 Protein-protein interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.6.1 Assignment of protein-protein interactions . . . . . . . . . . . . . . . . . 38 2.7 Druggability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 DrugEBIlity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 vi 2.7.1 BLAST search against DrugEBIlity database . . . . . . . . . . . . . . . . 39 2.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Chapter 3: Results and discussion 41 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2 The Discovery 2.0 web-interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2.2 Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2.3 Gene Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2.4 Orthology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2.5 Metabolic pathways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.6 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2.7 Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2.8 Druggability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3 The annotation data in Discovery 2.0 . . . . . . . . . . . . . . . . . . . . . . . . 52 3.4 Case studies on Discovery 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.4.1 Protein kinase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.4.2 G protein-coupled receptor . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.4.3 Peptidase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.4.4 Aminopeptidase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.4.5 Dehydrogenase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.5 Assessment of a protein target using Discovery 2.0 . . . . . . . . . . . . . . . . . 75 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Gene ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Orthology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Metabolic pathways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Druggability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.5.1 Essentiality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 vii 3.5.2 Assay feasibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.5.3 Resistance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.5.4 Toxicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.5.5 Structural information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.5.6 Druggability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.6 Prioritization of potential drug targets in malaria using Discovery 2.0 . . . . . . 89 3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Chapter 4: Concluding discussion 93 Bibliography 98 viii List of Figures 1.1 Summary of the methods used in the two di(cid:27)erent approaches to target discovery 3 1.2 Choke-point analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3 Reaction catalyzed by DHOD . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4 The 424 amino acid PfCRT transmembrane protein encoded by the 13-exon pfcrt gene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.5 Regulation and expression of human and Plasmodium DHFR . . . . . . . . . . . 14 1.6 Docking of WR99210 analogues to mutant DHFR . . . . . . . . . . . . . . . . . 16 2.1 Clustering of orthologs using the OrthoMCL algorithm . . . . . . . . . . . . . . 28 2.2 Nitrogen metabolism pathway for the Plasmodium parasite . . . . . . . . . . . . 34 3.1 Discovery 2.0 home page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2 Advanced search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3 Reaction catalyzed by dUTPase. . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.4 Summary tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.5 Predicted functions tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.6 Gene Ontology tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.7 Orthology tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.8 Metabolic pathways tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.9 Crystal structures tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.10 Interactions tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.11 Druggability tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.12 Genome annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.13 MODBASE statistics for the modelled genomes. . . . . . . . . . . . . . . . . . . 56 3.14 Search for proteins by EC numbers in PlasmoDB . . . . . . . . . . . . . . . . . 57 ix
Description: