Table Of ContentSEQUENCE - EVOLUTION - FUNCTION
Computational Approaches in Comparative Genomics
SEQUENCE - EVOLUTION - FUNCTION
Computational Approaches in Comparative Genomics
by
Eugene V. Koonin
and
Michael Y. Galperin
National Center for Biotechnology Information
0/
National Library Medicine
National Institutes ofH ealth
Springer-Science+Business Media, B.v.
.....
" Electronic Services< http://www.wkap.nl>
Library of Congress Cataloging-in-Publication Data
Koonin. Eugene V.
Sequence - evolution - function : Computational approaches in comparative genomicsl
By Eugene V. Koonin and Michael Y. Halperin
p. cm.
Includes bibliographical references and index.
ISBN 978-1-4419-5321-6 ISBN 978-1-4757-3783-7 (eBook)
DOI 10.1007/978-1-4757-3783-7
1. Genomes. 2. Nucleotide sequence-Data processing. 3. Evolutionary genetics. I.
Galperin, Michael Y. II. Titte.
QH447 .K665 2002
572.8'6---dc21 2002034045
Copyright © 2003 by Springer Science+Business Media Dordrecht. Second Printing 2004.
Originally published by Kluwer Academic Publishers in 2003.
Softcover reprint ofthe hardcover 1st edition 2003
All rights reserved. No part ofthis work may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, electronic, mechanical,
photocopying, microfilming, recording, or otherwise, without the written permission
from the Publisher, with the exception of any material supplied speciftcally for the
purpose ofbeing entered and executed on a computer system, for exclusive use by
the purchaser of the work
Permission for books published in Europe: permissions@wkap.nl
Permissions for books published in the United States of America: permissions@wkap.com
Printed on acid-free paper.
TI,e Publisher offers discounts Oll tl,;s book for course use and bulk purchases.
For further informlltion, seml emllil to <melissa.ramondetta@wkap.com>.
CONTENTS
Preface .................................................................................................... ix
Introduction. Personal Interludes ....................................................... . 1
Chapter 1. Genomes from Phage to Human .................................. .. 3
1.1. The humble beginnings ........................................................... . 3
1.2 .. , .and the astonishing progress of genome sequencing ............ . 13
1.3. Basic questions of comparative genomics ................................ .. 18
1.4. Further reading .......................................................................... .. 24
Chapter 2. The Evolutionary Concept in Genetics and Genomics. . . 25
2.1. Similarity, homology, divergence and convergence ................ .. 25
2.1.1. The critical definitions ..................................................... . 25
2.1.2. Conservation of protein sequence and structure in
evolution................................................................. 30
2.1.3. Homologs: orthologs and paralogs.................................... 34
2.2. Patterns and mechanisms in genome evolution .......................... 37
2.2.1. Evolution of gene order .................................... ................ 37
2.2.2. Lineage-specific gene loss ......................... ,...................... 38
2.2.3. Lineage-specific expansion of gene families ................... 40
2.2.4. Horizontal (lateral) gene transfer.. ................................... 42
2.2.5. Non-orthologous gene displacement and the minimal
gene set concept ..................................................... 43
2.2.6. Phyletic patterns (profiles) ............................................... 47
2.3. Conclusions and outlook .. ................ ......................................... 49
2.4. Further reading ............................ .......... .................... ............... 49
Chapter 3. Information Sources for Genomics ................................. 51
3.1. General purpose sequence databases ......................................... 51
3.1.1. Nucleotide sequence databases ........................................ 51
3.1.2. Protein sequence databases .............................................. 52
3.1.3. Reliability of the database entries .................................... 57
3.2. Protein sequence motifs and domain databases ......................... 64
3.2.1. Motif databases ................................................................ 64
3.2.2. Domain databases ............................................................ 69
3.2.3. Integrated motif and domain databases............................. 73
3.3. Protein structure databases ........................................................ 75
3.4. Specialized genomics databases ................................................ 81
3.5. Organism-specific databases ................................................... 89
3.5.1. Prokaryotes ....................................................................... 89
3.5.2. Unicellular eukaryotes ..................................................... 92
3.5.3. Multicellular eukaryotes ................................................... 93
3.6. Taxonomy, protein interactions, and other databases ................ 98
3.6.1. Taxonomy databases......................................................... 98
3.6.2. Signal transduction and protein interaction databases ...... 99
3.6.3. Biochemical databases....................................................... 101
vi
3.7. PubMed ...................................................................................... 104
3.7.1. Specifying the terms in PubMed search ........................... 104
3.7.2. Interpretation ofthe search pattern ................................... 107
3.7.3. NCBI Bookshelf ............................................................... 109
3.8. Conclusions and outlook ............................................................ 109
3.9. Furtherreading .......................................................................... 110
Chapter 4. Principles and Methods of Sequence Analysis............... 111
4.1. Identification of genes in a genornic DNA sequence ................ 112
4.1.1. Prediction of protein-coding genes .................................. 112
4.1.2. Algorithms and software tools for gene identification .... 118
4.2. Principles of sequence similarity searches ................................ 126
4.2.1. Substitution scores and substitution matrices .................. 127
4.2.2. Statistics of protein sequence comparison ....................... 133
4.2.3. Protein sequence complexity. Compositional bias .......... 136
4.3. Algorithms for sequence alignment and similarity search ........ 140
4.3.1. The basic alignment concepts and principal algorithms .. 140
4.3.2. Sequence database search algorithms .............................. 145
4.3.3. Motifs, domains and profiles ........................................... 148
4.4. Practical issues: how to get the most out of BLAST ...... .......... 159
4.4.1. Setting up the BLAST search .......................................... 159
4.4.2. Choosing the BLAST parameters .................................... 160
4.4.3. Running BLAST and formatting the output .................... 164
4.4.4. Analysis and interpretation ofBLAST results ................ 166
4.5. The road to discovery ................................................................ 172
4.6. Protein annotation in the absence of detectable homologs ....... 181
4.6.1. Prediction of subcellular localization of the protein ....... 181
4.6.2. Prediction of structural features ofthe protein ................ 184
4.6.3. Threading ........................................................................ 188
4.7. Conclusions and outlook .............................................................. 192
4.8. Further reading ........................................................................ 192
Chapter 5. Genome Annotation and Analysis ................................. 193
5.1. Methods, approaches and results in genome annotation .......... 193
5.1.1. Genome annotation: data flow and performance ............ 193
5.1.2. Automation of genome annotation .................................. 197
5.1.3. Accuracy of genome annotation ...................................... 199
5.1.4. A case study on genome annotation ................................ 206
5.2. Genome context analysis and functional prediction ................. 210
5.2.1. Phyletic patterns (profiles) ....................................... :...... 210
5.2.2. Gene (domain) fusions: "Rosetta Stone" ......................... 214
5.2.3. Gene clusters and genomic neighborhoods..................... 218
5.3. Conclusions and outlook .......................................................... 225
5.4. Further reading .......................................................................... 226
vii
Chapter 6. Comparative Genomies and New Evolutionary
Biology ....................................................................................... 227
6.1. The three domains oflife ........................................................... 228
6.2. Prevalence of lineage-specific gene loss and
horizontal gene transfer .............. .............................. ..... 233
6.3. The Tree of Life: before and after the genomes ........................ 243
6.3.1. Phylogenetic trees in the pre-genomic era ........ ............... 243
6.3.2. Comparative genomics threatens the species tree concept 244
6.3.3. Genome trees - can comparative genomics help build a
consensus? ....... .............. ..... ........ .................................... 245
6.3.4. The genomic dock .... ........................................................ 251
6.4. The major transitions in evolution: a comparative-genomic
perspective ...................................................................... 252
6.4.1. Ancestrallife form and evolutionary reconstructions ....... 252
6.4.2. Beyond LUCA, back to the RNA world .. ......................... 264
6.4.3. Abrief history of early life ................................................ 268
6.4.4. The prokaryote-eukaryote transition and origin of
novelty in eukaryotes ..................................................... 271
6.5. Condusions and outlook: evolution tinkers with
fluid genomes ............... .............. ..... ... .... ... ..... ... ............ 292
6.6. Further Reading . ...... ...... ....... ... ......... ... .......... ... ....... ... ....... ........ 294
Chapter 7. Evolution of Central Metabolie Pathways:
The Playground of Non-orthologous Gene Displaeement ........... 295
7.1. Carbohydrate metabolism........................................................... 296
7.1.1. Glycolysis ......................................................................... 296
7.1.2. Gluconeogenesis ............................................................... 303
7.1.3. Entner-Doudoroff pathway and pentose phosphate shunt 306
7.1.4. TCA cycle ......................................................................... 311
7.2. Pyrimidine biosynthesis.............................................................. 316
7.3. Purine biosynthesis..................................................................... 320
7.4. Amino acid biosynthesis............................................................. 326
7.4.1 Biosynthesis of aromatic amino acids ... ..... ...... ........ ......... 326
7.4.2. Arginine biosynthesis ........................................................ 334
7.4.3. Histidine biosynthesis ........................................................ 337
7.4.4. Biosynthesis ofbranched-chain amino acids ..................... 339
7.4.5. Praline biosynthesis ........................................................... 340
7.5. Coenzyme biosynthesis ............................................................... 342
7.5.1. Thiamin .............................................................................. 342
7.5.2. Riboflavin ......................................................................... 343
7.5.3. NAD ................................................................................... 344
7.5.4. Biotin .................................................................................. 345
7.5.5. Heme .................................................................................. 346
7.5.6. Pyridoxine .......................................................................... 348
viii
7.6. Microbial enzymes as drug targets .............................................. 349
7.6.1. Potential targets for broad-spectrum drugs ........................ 351
7.6.2. Potential targets for pathogen-specific drugs .................... 352
7.7. Conclusions and outlook ............................................................. 354
7.8. Further reading ................................................................... ......... 355
Chapter 8. Genomes and the Protein Universe ................................... 357
8.1. The protein universe is highly structured and there are
few common folds .......................................... ..................... 357
8.2. Counting the beans: structural genomics, distributions of
protein folds and superfamilies in genomes and some
models of genome evolution............. ............ ...................... 361
8.3. Evolutionary dynamics of multidomain proteins ...................... 366
8.4. Conclusions and outlook ............................................................. 369
8.5. Further reading .......................................................................... 369
Chapter 9. Epilogue: Peering through the crystal ball ...................... 371
9.1. Functional genomics: a programme of prediction-driven
research?............................................................................. 371
9.2. Digging up genomic junkyards ................................................ 376
9.3. "Dreams of a final theory" ....................................................... 379
Appendices ........................................................................................... 381
1. Glossary .............................................................. ,....................... 381
2. Useful WWW sites ..................................................................... 389
Databases ..................................................... ... ...................... ... 389
Major genome sequencing centers .......... ................................. 392
3. Problems ................................. ............... ............. ..................... .... 395
References ............................................................................................. 403
Index ...................................................................................................... 457
PREFACE
The use of genome sequences to solve biological
problems has been afforded its own label; for
better or worse, it's called "functional genomics. "
David J. Galas. Making Sense ofthe Sequence.
Science, 2001, vol. 291, p. 1257
When the completion of the draft of the human genome sequence was
announced on June 26, 2000, all the parties involved agreed that the major
task of identifying the functions of all human genes was still many years
ahead. In fact, even the much simpler task of mapping all the genes in the
final version of the human genome sequence that should become available
within the next few years remains a major problem. Identification of all
protein-coding genes in the genome sequence and determination of the
cellular functions of the proteins encoded in these genes can be accomplished
only by combining powerful computational tools with a variety of
experimental approaches from the arsenals of biochemistry, molecular
biology, genetics and cell biology. Linking sequence to function and both to
the evolutionary history of life is the fundamental task of new biology.
This book is devoted to the principles, methods and some achievements
of computational comparative genomics, which has shaped as aseparate
discipline only in the last 5-7 years. Its beginnings have been modest, with
only the genome sequences of viruses and organelles determined in the
1980's. These sequences were important for their respective disciplines and
as a test ground for computational methods of genome analysis, but they
were not particularly helpful for understanding how does an autonomous cell
work. By 1992, the first chromosomes of baker's yeast and large chunks of
bacterial genomes started to emerge, and researchers began pondering the
question: What's in the genome? The breakthrough came in 1995 with the
comp1ete sequencing of the first genome of a cellular life form, the bacterium
Haemophilus injluenzae. The second bacterial genome, Mycoplasma
genitalium, followed within months. The next year, the first complete
genomes of an archaeon (Methanococcus jannaschii) and a eukaryote (yeast
Saccharomyces cerevisiae) became available. Many more microbial genomes
followed, and in 1999, the first genome of a multicellular eukaryote, the
nematode Caenorhabiditis elegans, has been sequenced. The year 2000
brought us the complete genomes of the fruit fly Drosophila melanogaster
and the thale-cress Arabidopsis thaliana, and two independent drafts of the
human genome followed suit in 2001. Thus, we entered the 21st century
already having at hand this 3.2 billion-letter text that has been referred to as
the Book of Life, as wen as a number of accompanying books on other life
forms. The challenge is now to read and interpret them.
x
To extract biological information from enormous strings of As, Cs, Ts,
and Gs, functional genomics depends on computational analysis of the
sequence data. It is unrealistic to expect that every single gene or even a
majority of the genes found in the sequenced genomes would ever be studied
experimentally. However, using the relatively cheap and fast computational
approaches, it is usually possible to reliably predict the protein-coding
regions in the DNA sequence with reasonable (albeit varying) confidence
and to get at least some insight into the possible functions of the encoded
proteins. Such an analysis proves valuable for many branches of biology, in
large part, because it assists in classification and prioritization of the targets
for future experimental research.
Computations on genomes are inexpensive and fast compared to large
scale experimentation, but it would be amistake to equate this with 'easy'.
The history of annotation and comparative analysis of the first sequenced
genomes convincingly (and sometimes painfully) shows that the quality and
utility of the final product critically depend on the employed methods and the
depth of interpretation of the results obtained by computer methods.
Unfortunately, errors produced in the course of computer analysis are
propagated just as easily as real discoveries, which makes development of
reliable protocols and crystallization of the accumulating experience of
genome analysis in easily accessible forms particularly important.
While functional annotation of genomes may be the most obvious, and in
asense, the most important purpose of computational genomics, it is not just
a supporting service for experimental functional genomics, but a discipline in
itself, with its own fundamental goals. The main such goal is understanding
genome evolution. Ultimately, understanding here means being able to
reconstruct the most likely sequence of evolutionary events that produced
these genomes. Attaining this goal will require many more genomes,
development of new algorithrns and years of careful analysis. Nevertheless,
even in its infancy, comparative genomics has brought genuine revelations
about evolution. We believe that the principal news that could not be easily
fore seen in the pre-genomic era is the extreme diversity of the gene
composition in different evolutionary lineages. This strongly suggests that, at
least among prokaryotes, horizontal gene transfer and lineage-specific gene
loss were major, formative evolutionary forces, rather than rare and relatively
inconsequential events as assumed previously. Accordingly, the straight
forward image of evolution as the growth of the tree of life is replaced by one
of a 'grove', in which vertical, tree-type growth does occur, but multiple
horizontal connections are equally prominent - an incomparably more
complex, but also more interesting picture of life than ever suspected before.