SEQUENCE - EVOLUTION - FUNCTION Computational Approaches in Comparative Genomics SEQUENCE - EVOLUTION - FUNCTION Computational Approaches in Comparative Genomics by Eugene V. Koonin and Michael Y. Galperin National Center for Biotechnology Information 0/ National Library Medicine National Institutes ofH ealth Springer-Science+Business Media, B.v. ..... " Electronic Services< http://www.wkap.nl> Library of Congress Cataloging-in-Publication Data Koonin. Eugene V. Sequence - evolution - function : Computational approaches in comparative genomicsl By Eugene V. Koonin and Michael Y. Halperin p. cm. Includes bibliographical references and index. ISBN 978-1-4419-5321-6 ISBN 978-1-4757-3783-7 (eBook) DOI 10.1007/978-1-4757-3783-7 1. Genomes. 2. Nucleotide sequence-Data processing. 3. Evolutionary genetics. I. Galperin, Michael Y. II. Titte. QH447 .K665 2002 572.8'6---dc21 2002034045 Copyright © 2003 by Springer Science+Business Media Dordrecht. Second Printing 2004. Originally published by Kluwer Academic Publishers in 2003. Softcover reprint ofthe hardcover 1st edition 2003 All rights reserved. No part ofthis work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording, or otherwise, without the written permission from the Publisher, with the exception of any material supplied speciftcally for the purpose ofbeing entered and executed on a computer system, for exclusive use by the purchaser of the work Permission for books published in Europe: [email protected] Permissions for books published in the United States of America: [email protected] Printed on acid-free paper. TI,e Publisher offers discounts Oll tl,;s book for course use and bulk purchases. For further informlltion, seml emllil to <[email protected]>. CONTENTS Preface .................................................................................................... ix Introduction. Personal Interludes ....................................................... . 1 Chapter 1. Genomes from Phage to Human .................................. .. 3 1.1. The humble beginnings ........................................................... . 3 1.2 .. , .and the astonishing progress of genome sequencing ............ . 13 1.3. Basic questions of comparative genomics ................................ .. 18 1.4. Further reading .......................................................................... .. 24 Chapter 2. The Evolutionary Concept in Genetics and Genomics. . . 25 2.1. Similarity, homology, divergence and convergence ................ .. 25 2.1.1. The critical definitions ..................................................... . 25 2.1.2. Conservation of protein sequence and structure in evolution................................................................. 30 2.1.3. Homologs: orthologs and paralogs.................................... 34 2.2. Patterns and mechanisms in genome evolution .......................... 37 2.2.1. Evolution of gene order .................................... ................ 37 2.2.2. Lineage-specific gene loss ......................... ,...................... 38 2.2.3. Lineage-specific expansion of gene families ................... 40 2.2.4. Horizontal (lateral) gene transfer.. ................................... 42 2.2.5. Non-orthologous gene displacement and the minimal gene set concept ..................................................... 43 2.2.6. Phyletic patterns (profiles) ............................................... 47 2.3. Conclusions and outlook .. ................ ......................................... 49 2.4. Further reading ............................ .......... .................... ............... 49 Chapter 3. Information Sources for Genomics ................................. 51 3.1. General purpose sequence databases ......................................... 51 3.1.1. Nucleotide sequence databases ........................................ 51 3.1.2. Protein sequence databases .............................................. 52 3.1.3. Reliability of the database entries .................................... 57 3.2. Protein sequence motifs and domain databases ......................... 64 3.2.1. Motif databases ................................................................ 64 3.2.2. Domain databases ............................................................ 69 3.2.3. Integrated motif and domain databases............................. 73 3.3. Protein structure databases ........................................................ 75 3.4. Specialized genomics databases ................................................ 81 3.5. Organism-specific databases ................................................... 89 3.5.1. Prokaryotes ....................................................................... 89 3.5.2. Unicellular eukaryotes ..................................................... 92 3.5.3. Multicellular eukaryotes ................................................... 93 3.6. Taxonomy, protein interactions, and other databases ................ 98 3.6.1. Taxonomy databases......................................................... 98 3.6.2. Signal transduction and protein interaction databases ...... 99 3.6.3. Biochemical databases....................................................... 101 vi 3.7. PubMed ...................................................................................... 104 3.7.1. Specifying the terms in PubMed search ........................... 104 3.7.2. Interpretation ofthe search pattern ................................... 107 3.7.3. NCBI Bookshelf ............................................................... 109 3.8. Conclusions and outlook ............................................................ 109 3.9. Furtherreading .......................................................................... 110 Chapter 4. Principles and Methods of Sequence Analysis............... 111 4.1. Identification of genes in a genornic DNA sequence ................ 112 4.1.1. Prediction of protein-coding genes .................................. 112 4.1.2. Algorithms and software tools for gene identification .... 118 4.2. Principles of sequence similarity searches ................................ 126 4.2.1. Substitution scores and substitution matrices .................. 127 4.2.2. Statistics of protein sequence comparison ....................... 133 4.2.3. Protein sequence complexity. Compositional bias .......... 136 4.3. Algorithms for sequence alignment and similarity search ........ 140 4.3.1. The basic alignment concepts and principal algorithms .. 140 4.3.2. Sequence database search algorithms .............................. 145 4.3.3. Motifs, domains and profiles ........................................... 148 4.4. Practical issues: how to get the most out of BLAST ...... .......... 159 4.4.1. Setting up the BLAST search .......................................... 159 4.4.2. Choosing the BLAST parameters .................................... 160 4.4.3. Running BLAST and formatting the output .................... 164 4.4.4. Analysis and interpretation ofBLAST results ................ 166 4.5. The road to discovery ................................................................ 172 4.6. Protein annotation in the absence of detectable homologs ....... 181 4.6.1. Prediction of subcellular localization of the protein ....... 181 4.6.2. Prediction of structural features ofthe protein ................ 184 4.6.3. Threading ........................................................................ 188 4.7. Conclusions and outlook .............................................................. 192 4.8. Further reading ........................................................................ 192 Chapter 5. Genome Annotation and Analysis ................................. 193 5.1. Methods, approaches and results in genome annotation .......... 193 5.1.1. Genome annotation: data flow and performance ............ 193 5.1.2. Automation of genome annotation .................................. 197 5.1.3. Accuracy of genome annotation ...................................... 199 5.1.4. A case study on genome annotation ................................ 206 5.2. Genome context analysis and functional prediction ................. 210 5.2.1. Phyletic patterns (profiles) ....................................... :...... 210 5.2.2. Gene (domain) fusions: "Rosetta Stone" ......................... 214 5.2.3. Gene clusters and genomic neighborhoods..................... 218 5.3. Conclusions and outlook .......................................................... 225 5.4. Further reading .......................................................................... 226 vii Chapter 6. Comparative Genomies and New Evolutionary Biology ....................................................................................... 227 6.1. The three domains oflife ........................................................... 228 6.2. Prevalence of lineage-specific gene loss and horizontal gene transfer .............. .............................. ..... 233 6.3. The Tree of Life: before and after the genomes ........................ 243 6.3.1. Phylogenetic trees in the pre-genomic era ........ ............... 243 6.3.2. Comparative genomics threatens the species tree concept 244 6.3.3. Genome trees - can comparative genomics help build a consensus? ....... .............. ..... ........ .................................... 245 6.3.4. The genomic dock .... ........................................................ 251 6.4. The major transitions in evolution: a comparative-genomic perspective ...................................................................... 252 6.4.1. Ancestrallife form and evolutionary reconstructions ....... 252 6.4.2. Beyond LUCA, back to the RNA world .. ......................... 264 6.4.3. Abrief history of early life ................................................ 268 6.4.4. The prokaryote-eukaryote transition and origin of novelty in eukaryotes ..................................................... 271 6.5. Condusions and outlook: evolution tinkers with fluid genomes ............... .............. ..... ... .... ... ..... ... ............ 292 6.6. Further Reading . ...... ...... ....... ... ......... ... .......... ... ....... ... ....... ........ 294 Chapter 7. Evolution of Central Metabolie Pathways: The Playground of Non-orthologous Gene Displaeement ........... 295 7.1. Carbohydrate metabolism........................................................... 296 7.1.1. Glycolysis ......................................................................... 296 7.1.2. Gluconeogenesis ............................................................... 303 7.1.3. Entner-Doudoroff pathway and pentose phosphate shunt 306 7.1.4. TCA cycle ......................................................................... 311 7.2. Pyrimidine biosynthesis.............................................................. 316 7.3. Purine biosynthesis..................................................................... 320 7.4. Amino acid biosynthesis............................................................. 326 7.4.1 Biosynthesis of aromatic amino acids ... ..... ...... ........ ......... 326 7.4.2. Arginine biosynthesis ........................................................ 334 7.4.3. Histidine biosynthesis ........................................................ 337 7.4.4. Biosynthesis ofbranched-chain amino acids ..................... 339 7.4.5. Praline biosynthesis ........................................................... 340 7.5. Coenzyme biosynthesis ............................................................... 342 7.5.1. Thiamin .............................................................................. 342 7.5.2. Riboflavin ......................................................................... 343 7.5.3. NAD ................................................................................... 344 7.5.4. Biotin .................................................................................. 345 7.5.5. Heme .................................................................................. 346 7.5.6. Pyridoxine .......................................................................... 348 viii 7.6. Microbial enzymes as drug targets .............................................. 349 7.6.1. Potential targets for broad-spectrum drugs ........................ 351 7.6.2. Potential targets for pathogen-specific drugs .................... 352 7.7. Conclusions and outlook ............................................................. 354 7.8. Further reading ................................................................... ......... 355 Chapter 8. Genomes and the Protein Universe ................................... 357 8.1. The protein universe is highly structured and there are few common folds .......................................... ..................... 357 8.2. Counting the beans: structural genomics, distributions of protein folds and superfamilies in genomes and some models of genome evolution............. ............ ...................... 361 8.3. Evolutionary dynamics of multidomain proteins ...................... 366 8.4. Conclusions and outlook ............................................................. 369 8.5. Further reading .......................................................................... 369 Chapter 9. Epilogue: Peering through the crystal ball ...................... 371 9.1. Functional genomics: a programme of prediction-driven research?............................................................................. 371 9.2. Digging up genomic junkyards ................................................ 376 9.3. "Dreams of a final theory" ....................................................... 379 Appendices ........................................................................................... 381 1. Glossary .............................................................. ,....................... 381 2. Useful WWW sites ..................................................................... 389 Databases ..................................................... ... ...................... ... 389 Major genome sequencing centers .......... ................................. 392 3. Problems ................................. ............... ............. ..................... .... 395 References ............................................................................................. 403 Index ...................................................................................................... 457 PREFACE The use of genome sequences to solve biological problems has been afforded its own label; for better or worse, it's called "functional genomics. " David J. Galas. Making Sense ofthe Sequence. Science, 2001, vol. 291, p. 1257 When the completion of the draft of the human genome sequence was announced on June 26, 2000, all the parties involved agreed that the major task of identifying the functions of all human genes was still many years ahead. In fact, even the much simpler task of mapping all the genes in the final version of the human genome sequence that should become available within the next few years remains a major problem. Identification of all protein-coding genes in the genome sequence and determination of the cellular functions of the proteins encoded in these genes can be accomplished only by combining powerful computational tools with a variety of experimental approaches from the arsenals of biochemistry, molecular biology, genetics and cell biology. Linking sequence to function and both to the evolutionary history of life is the fundamental task of new biology. This book is devoted to the principles, methods and some achievements of computational comparative genomics, which has shaped as aseparate discipline only in the last 5-7 years. Its beginnings have been modest, with only the genome sequences of viruses and organelles determined in the 1980's. These sequences were important for their respective disciplines and as a test ground for computational methods of genome analysis, but they were not particularly helpful for understanding how does an autonomous cell work. By 1992, the first chromosomes of baker's yeast and large chunks of bacterial genomes started to emerge, and researchers began pondering the question: What's in the genome? The breakthrough came in 1995 with the comp1ete sequencing of the first genome of a cellular life form, the bacterium Haemophilus injluenzae. The second bacterial genome, Mycoplasma genitalium, followed within months. The next year, the first complete genomes of an archaeon (Methanococcus jannaschii) and a eukaryote (yeast Saccharomyces cerevisiae) became available. Many more microbial genomes followed, and in 1999, the first genome of a multicellular eukaryote, the nematode Caenorhabiditis elegans, has been sequenced. The year 2000 brought us the complete genomes of the fruit fly Drosophila melanogaster and the thale-cress Arabidopsis thaliana, and two independent drafts of the human genome followed suit in 2001. Thus, we entered the 21st century already having at hand this 3.2 billion-letter text that has been referred to as the Book of Life, as wen as a number of accompanying books on other life forms. The challenge is now to read and interpret them. x To extract biological information from enormous strings of As, Cs, Ts, and Gs, functional genomics depends on computational analysis of the sequence data. It is unrealistic to expect that every single gene or even a majority of the genes found in the sequenced genomes would ever be studied experimentally. However, using the relatively cheap and fast computational approaches, it is usually possible to reliably predict the protein-coding regions in the DNA sequence with reasonable (albeit varying) confidence and to get at least some insight into the possible functions of the encoded proteins. Such an analysis proves valuable for many branches of biology, in large part, because it assists in classification and prioritization of the targets for future experimental research. Computations on genomes are inexpensive and fast compared to large scale experimentation, but it would be amistake to equate this with 'easy'. The history of annotation and comparative analysis of the first sequenced genomes convincingly (and sometimes painfully) shows that the quality and utility of the final product critically depend on the employed methods and the depth of interpretation of the results obtained by computer methods. Unfortunately, errors produced in the course of computer analysis are propagated just as easily as real discoveries, which makes development of reliable protocols and crystallization of the accumulating experience of genome analysis in easily accessible forms particularly important. While functional annotation of genomes may be the most obvious, and in asense, the most important purpose of computational genomics, it is not just a supporting service for experimental functional genomics, but a discipline in itself, with its own fundamental goals. The main such goal is understanding genome evolution. Ultimately, understanding here means being able to reconstruct the most likely sequence of evolutionary events that produced these genomes. Attaining this goal will require many more genomes, development of new algorithrns and years of careful analysis. Nevertheless, even in its infancy, comparative genomics has brought genuine revelations about evolution. We believe that the principal news that could not be easily fore seen in the pre-genomic era is the extreme diversity of the gene composition in different evolutionary lineages. This strongly suggests that, at least among prokaryotes, horizontal gene transfer and lineage-specific gene loss were major, formative evolutionary forces, rather than rare and relatively inconsequential events as assumed previously. Accordingly, the straight forward image of evolution as the growth of the tree of life is replaced by one of a 'grove', in which vertical, tree-type growth does occur, but multiple horizontal connections are equally prominent - an incomparably more complex, but also more interesting picture of life than ever suspected before.