Preface Since its initiation, Methods in Enzymology has tried to keep abreast of developments in all areas important to biochemistry. Increasingly, this has meant moving into the realm of molecular biology. The main objective has remained the same, nonetheless: putting ready-to-use methods into the hands of the at-the-bench scientist, This volume is no exception. Still, this may be the first Methods in Enzymology volume in which the expression “pH” is never mentioned and the word buffer, if used at all, has nothing to do with aqueous solutions! The past decade has witnessed nothing less than a flood of nucleic acid and protein sequence data, the management of which would simply be impossible without computers. The chapters in this volume address a variety of areas in which computers are used to manage and manipulate sequences. The manipulations include searching, aligning, and determin- ing the significance of similarities, as well as the construction of phyloge- netic trees that show the evolutionary history of related sequences. This relatively new field of sequence-computing has stimulated interest from a variety of more established disciplines as is apparent from the diverse backgrounds of the contributing authors. Thus, investigators in mathemat- ics and statistics, taxonomy and cladistics, and protein and nucleic acid chemistry are all represented, each bringing different perspectives and insights. All the authors were asked, however, to make an effort to write for the biochemist or molecular biologist who wants to analyze, search, or manipulate nucleic acid or protein sequence data. Some have met this charge more aptly than others, and in some instances readers must be prepared to go the extra mile to find genuine satisfaction. Often this will mean mastering new terms and jargon, the troublesome persiflage of all specialized fields. Bootstraps and jackknifes, informative sites and k-tuples, NP-complete and NP-completeness are all words with simple enough meanings, and it would be a mistake for readers to turn away from the methods described merely because of an alien nomenclature or notation. Thus, bootstraps and jackknifes are computer-intensive stastistical proce- dures for estimating the quality of a data set. Informative site is a term used in strict parsimony analysis for those arrays of characters in a set of sequences that allow a single, simple interpretation. k-Tuples are merely sets of units: duplets, triplets, quartets, and so on. NP-complete is an . . . Xl11 xiv PREFACE expression used by computer scientists to indicate a set of problems it may not be possible to solve exactly by dynamic programming. The phylogeny problem may be NP-complete. I must confess that not too long ago I had not the faintest notion of what most of these expressions implied. It was agreed in advance that the computer programs described in this volume would be made available by the contributing authors, usually only for the cost of materials (tapes or diskettes) and shipping. Readers must understand, nonetheless, that there are often problems in transporting programs from one kind of computer to another and that differences in operating systems often present stumbling blocks. The authors are very interested in the community using their programs, however, and they are prepared to help with any problems that may ensue. Naturally they would like to hear, also, when the programs work well. The sequence -computer field is a very fast-moving one, and every effort was made to produce this volume as rapidly as possible so the methods would be timely. In this regard, I must thank the authors for adhering to a set of rigid deadlines. Special thanks are due, also, to Karen Anderson, who managed many aspects involving the manuscripts and without whom this volume would likely never have materialized. RUSSELL F. DOOLITTLE Contributors to Volume 183 Article numbers are in parentheses following the names of contributors. Affiliations listed ar-z current. YVON ABEL (26), Centre de Recherches D-6900 Heidelberg, Federal Republic of Mathematiques, Universite de Montreal, Germany Montreal, Quebec H3C 3J7. Canada ROBERT CEDERGREN (19, 26), Departement WAYNE F. ANDERSON (27), Department of de Biochimie, Universite de Montreal, Biochemistry, Vanderbilt University, Montreal, Quebec H3C 357, Canada Nashville, Tennessee 37232 JEAN-MICHEL CLAVERIE (15) Computer PATRICK ARGOS (21), European Molecular Science Unit, Institut Pasteur, 75724 Paris Biology Laboratory, D-6900 Heidelberg, Cedex 1.5. France Federal Republic of Germany JOHN F. COLLINS (30), Biocomputing Re- DAVID J. BACON (27), Department of Bio search Unit, Department of Molecular Bi- chemistry, University of Alberta, Edmon- ology, University of Edinburgh, Edinburgh ton, Alberta T6G 2H7, Canada EH9 3JR, Scotland WINONA C. BARKER (3, 20), Protein Identi- ANDREW F. W. COULSON (30), Biocomput- fication Resource, National Biomedical ing Research Unit, Department of Moiecu- Research Foundation, Georgetown Univer- lar Biology, University of Edinburgh, Ed- sity Medical Center, Washington, DC. inburgh EH9 3JR, Scotland 20007 JOHN CZELUSNIAK (37), Department of GEOFFREY J. BARTON (25) Laboratory of Anatomy and Cell Biology, Wayne State Molecular Biophysics, University of Ox- University School of Medicine, Detroit, ford, Oxford OXI 3QU. Engiand Michigan 48201 DAVID BENTON (l), IntelliGenetics, Incorpo- RUSSELL F. D~OLITTLE (6, 23, 41), Center rated, Mountain View, Cahfornia 94040 for Molecular Genetics, University of Cali- B. EDWIN BLAISDELL (24), Department of fornia, San Diego, La Jolla, California Mathematics, Stanford University, Stan- 92093 ford, Cahfornia 94305 MANFRED EIGEN (32), MaxPlanck-Institut TOM L. BLUNDELL (42), Laboratory of MO- fir Biophysikalische Chemie, D-3400 Got- lecular Biology, Department of Crystallo tingen. Federal Republic of Germany graphy, Birkbeck College, University of DAVID EISENBERG (9), Molecular Biology London, London WCIE 7HX, England Institute, University of California, Los An- LYDIE BOUGUELERET (15), Computer geles, Los Angeles, California 90024 Science Unit, Institut Pasteur, 75724 Paris DA-FEI FENG (23, 4 l), Center for Molecular Cedex IS, France Genetics. University of Caltfornia. San VOLKER BRENDEL (24), Department of Diego, La Jolla, California 92093 Mathematics, Stanford University, Stan- WALTER M. FITCH (38), Department of ford, California 94305 Ecology and Evolutionary Biology, Uni- JOSEPH P. BROWN (7), Genetic Systems Cor- versity of California, Irvine, Irvine, Cali- poration, Seattle, Washington 98121 fornia 92717 CHRISTIAN BURKS (I), Theoretical Biology DANIEL GAUTHERET (19), Departement de and Biophysics Group, Los Alamos Na- Biochimie, Universite de Montreal, Mon- tional Laboratory, Los Alamos, New Mex- treal, Quebec H3C 3J7, Canada ico 8 7545 DAVID G. GEORGE (3, 20), Protein Identi- GRAHAM CAMERON (2), The European MO- fication Resource, National Biomedical lecular Biology Laboratory Data Library, Research Foundation, Georgetown Univer- ix X CONTRIBUTORS TO VOLUME 183 sity Medical Center, Washington, D.C. D-6900 Heidelberg, Federal Republic of 20007 Germany TAKASHI GOJOBORI (33), Department of SAMUEL KARLIN (24), Department of Math- Evolutionary Genetics, National Institute ematics, Stanford University, Stanford, of Genetics. Mishima 411, Japan California 94305 MORRIS GOODMAN (37), Department of SUZANNE M. KEHOE (37), Department of Anatomy and Cell Biology, Wayne State Anatomy and Cell Biology, Wayne State University School of Medicine, Detroit, University School of Medicine, Detroit, Michigan 48201 Michigan 48201 MANOLO GOUY (40), Laboratoire de Bibme- BORIVOJ KEIL (4), Institut Pasteur, 75015 trie, Universite Lyon I, 69622 Villewr- Paris, France banne Cedex, France MOTOO KIMURA (33), Department of Popu- MICHAEL GRIBSKOV (9), BRI-Basic Re- lation Genetics, National Institute of Ge- search Program, National Cancer netics, Mishima 411. Japan Institute- Frederick Cancer Research Fa- HIROHISA KISHINO (34), The Institute of cility, Frederick, Maryland 21701 Statistical Mathematics, Minato-ku. Tokyo 106. Japan NOMI L. HARRIS (16), Laboratory of Com- puter Science, Massachusetts Institute of CECILIA LANAVE (35), Centro Studi sui Mi- Technology, Cambridge, Massachusetts tocondri e Metabolism0 Energetico, CNR, 02139 Bari. 70126 Bari, Italy MASAMI HASEGAWA (34), The Institute of GAD M. LANDAU (31), Division of Com- Statistical Mathematics, Minato-ku, puter Sciences, Polytechnic University, Tokyo 106, Japan Brooklyn, New York 11201 JOTUN HEIN (39) Centre de Recherche de CHARLES B. LAWRENCE (a), Department of Mathematiques Appliques, UniversitP de Cell Biology, Baylor College of Medicine, Houston, Texas 77030 Montreal, Montreal, Quebec H3C 357, Canada WEN-HSIUNG LI (40) Center for Demo- graphic and Population Genetics, Health STEVEN HENIKOFF (7), Fred Hutchinson Science Center at Houston, University of Cancer Research Center, Seattle, Wash- Texas, Houston, Texas 77225 ington 98104 ROLAND LOTHY (9), Molecular Biology In- LOIS T. HUNT (3, 20), Protein IdentiJcation stitute, University of Caltfornia, Los An- Resource, National Biomedical Research geles, Los Angeles, California 90024 Foundation, Georgetown University Medi- cal Center, Washington, D.C. 20007 FRANCOIS MAJOR (19), Departement d’ In- formatique et Recherche Operationnelle, JOHN A. JAEGER (17), Department of Chem- Universite de Montreal, Montreal, Quebec istry, University of Rochester, Rochester, H3C 3J7, Canada New York 14627 HUGO M. MARTINEZ (IS), Department of MARK S. JOHNSON (42), Laboratory of Mo- Biochemistry and Biophysics, University lecular Biology, Department of Crystallo- of California, San Francisco, San Fran- graphy, Birkbeck College, University of cisco, California 94143 London, London WClE 7HX. England NANCY D. MONCRIEF (37), Department of ROBERT JONES (14), Thinking Machines Biology, University of Virginia, Charlottes- Corporation, Cambridge, Massachusetts ville, Virginia 22901 02142 ETSUKO N. MORIYAMA (33), Department of PATRICIA KAHN (2), The European Molecu- Evolutionary Genetics, National Institute lar Biology Laboratory Data Library, of Genetics, Mishima 411, Japan CONTRIBUTORS TO VOLUME 183 xi M~~suo MURATA (22), Department of Bio- RODGER STADEN (10, 12), Laboratory of chemistry, University of Georgia, Athens, Molecular Biology, Medical Research Georgia 30602 Council, Cambridge CB2 2QH, England RUTH NUSSINOV @I), Sackler Institute of GARY D. STORMO (13), Department of Mo- Molecular Medicine, Tel pviv University, lecular, Cellular, and Developmental Biol- Tel Aviv 69978, Israel, and Laboratory of ogy, University of Colorado, Boulder, Col- Mathematical Biology, National Cancer orado 80309 Institute, National Institutes ofHealth. Be- WILLIAM R. TAYLOR (29), Laboratory of thesda, Maryland 20892 Mathematical Biology, National Institute WILLIAM R. PEARSON (5), Department of for Medical Research, Medical Research Biochemistry, University of Virginia, Council, London NW7 IAA, England Charlottesville, Virginia 22908 DOUGLAS H. TURNER (17), Department of GRAZIANO PESOLE (35), Dipartimento di Chemistry, University of Rochester, Roch- Biochimica e Biologia Molecolare, Univer- ester, New York 14627 sita di Bari, 70126 Bari, Italy MAUNO VIHINEN (28) Department of Bio- GIULIANO PREPARATA (35), Dipartimento chemistry, University of Turku, SF-20500 di Fisica, Universitd di Milano, 20100 Turku. Finland Milano. Italy MARTIN VINGRON (21), European Molecu- CECILIA SACCONE (35), Dipartimento di lar Biology Laboratory, D-6900 Heidel- Biochimica e Biologia Molecolare, Univer- berg, Federal Republic of Germany sitd di Bari, 70126 Bari, Italy NARUYA SAITOU (36), Department of An- Uzr VISHKIN (31), Institute for Advanced thropology, Faculty of Science, The Uni- Computer Studies, University of Mary- land, College Park, Maryland 20742, and versity of Tokyo, Hongo, Bunkyo-ku, Tokyo 113, Japan Department of Computer Science, School of Mathematical Sciences, Tel Aviv Uni- ANDREJ SALI (42), Laboratory of Molecular versity, Tel Aviv 69978, Israel Biology, Department of Crystallography, Birkbeck College, University of London, JAMES C. WALLACE (7), Fred Hutchinson Cancer Research Center, Seattle, Wash- London WCIE 7HX? England ington 98104 DAVID SANKOFF (26), Centre de Recherches Mathematiques, Universite de Montreal, MICHAEL S. WATERMAN (14), Departments Montreal, Quebec H3C 3J7, Canada of Mathematics and Molecular Biology, University of Southern California, Los An- ISABELLE SAUVAGET (15), Computer geles, California 90089 Science Unit, Institut Pasteur, 75 724 Paris Cedex 15, France PATRICK L. WILLIAMS (38), College of PERIANNAN SENAPATHY ( 16), Biotechno Veterinary Medicine, North Carolina State University, Raleigh, North Carolina logy Center, University of Wisconsin, Madison, Wisconsin 53706 27606 MARVIN B. SHAPIRO (16), Laboratory of RUTHILD WINKLER-OSWATITSCH (32), Statistical and Mathematical Methodol- Max-Planck-Institut fir Biophysikalische ogy, Division of Computer Research and Chemie, D-3400 Gdttingen, Federal Re- Technology, National Institutes of Health, public of Germany Bethesda, Maryland 20892 MICHAEL ZUKER ( 17), Division of Biological JOHN C. W. SHEPHERD (1 l), Cell and Molec- Sciences, National Research Council of ular Biology Laboratory, University of Canada, Ottawa, Ontario KlA OR6, Can- Sussex, Brighton BNl 9QG, England ada r11 GENBANK 3 [l] GenBank: Current Status and Future Directions &I CHRISTIAN BURKS, MICHAEL J. CINKOSKY, PAUL GILNA, JAMIE E.-D. HAYDEN, YUKI ABE, EDWIN J. ATENCIO, STEVE BARNHOUSE, DAVID BENTON, CONNIE A. BUENAFE, KAREN E. CUMELLA, DAN B. DAVISON, DAVID B. EMMERT, MARY Jo FAULKNER, JAMES W. FICKETT, WILLIAM M. FISCHER, MARK GOOD, DEBORAH A. HORNE, F. KAY HOUGHTON, PRAFUL M. KELKAR, TOM A. KELLEY, MICHAEL KELLY, MELINDA A. KING, BERNARD J. LANGAN, JEFFREY T. LAUER, NATALIE LOPEZ, CONRAD LYNCH, JANET LYNCH, JANET B. MARCHI, THOMAS G. MARR, FRANCES A. MARTINEZ, MIA J. MCLEOD, PAT A. MEDVICK, SANTOSH K. MISHRA, JOHN MOORE, CHRISTINE A. MUNK, SOCORRO M. MONDRAGON, KEVIN K. NASSERI, DEBRA NELSON, WILL NELSON, TAN NGUYEN, GLORIA REISS, JOHN RICE, JULIE RYALS, MARGARITA D. SALAZAR, STEPHEN R. STELTS, BRIAN L. TRUJILLO, LAURIE J. TOMLINSON, MARK G. WEINER, FRANK J. WELCH, SUSAN E. WIIG, KATHERINE YUDIN, AND LARRY B. ZINS The GenBank’ database provides a collection of nucleotide sequences as well as relevant bibliographic and biological annotation. We present an updated view of the size and scope of the database, and we also describe recent developments in the strategies, protocols, and software for collect- ing, maintaining, and distributing the data. Introduction GenBank, the genetic sequence data bank, is chartered to provide a computer database of all published (and, increasingly, unpublished) DNA and RNA sequences and related bibliographic and biological information. The project is funded through an NIGMS contract with IntelliGenetics, Inc. (IG) which, in turn, contracts with the DOE acting on behalf of Los Alamos National Laboratory (LANL). The project is funded with cospon- sorship from other institutes of the National Institutes of Health, National Library of Medicine (NLM), DRR, USDA, NSF, DOE, and DOD. Data collection and distribution are carried out in collaboration with the EMBL Data Library (EMBL) and the DNA Data Bank of Japan (DDBJ). Inquiries ’ “GenBank” is a registeredt rademark of the U.S. Department of Health and Human services. Copyright 0 1990 by Academic Pra, Inc. METHODS IN ENZYMOLOGY, VOL. 183 All rights of sproduction in any form reserved. 4 DATABASES 111 regarding data distribution and release should be addressed to IG,2 inqui- ries regarding data submissions and collection should be addressed to LANL.3 Background. Computer databases offer several advantages for collect- ing, maintaining, analyzing, and distributing data sets: compact storage, low-error rates for data archiving or transmission, more precise specifica- tion and organizatioh of data, less effort in varying complex queries, and less effort in reorganizing data. We have previously discussed both the scientific community’s move 10 years ago to apply these advantages to nucleotide sequence data and the resulting establishment of the GenBank database. The database was chartered as described above and intended to provide4p5 (1) an archive, for long-term storage and preservation of se- quence data; (2) retrieval of nucleotide sequences, with delineation by sequence-specific, bibliographic, physical, or functional criteria; and (3) a research platform, for examination and analysis of sequences grouped by sequence-specific or annotation-specific criteria. Over the past several years there have been a number of developments in the database project. GenBank has passed into a second contract period with considerably expanded resources, allowing us to address many of the issues that arose as challenges in the early phase of the project. The nucleotide sequence database collaboration has recently been extended from GenBank and EMBL6,’ to include DDBJ,* which began operations in 1987. The unanticipated exponential growth in the amount of nucleotide sequence data being published as well as the possibility of another order of magnitude increase in the number of sequences being determined- as emphasized in recent discussions of the possibility of a human genome * Postal service: GenBank; IntelliGenetics, Inc.; 700 El Camino Real East; Mountain View, CA 94040. Electronic mail: “[email protected]”. Telephone: (4 15) 962-7364. 3 Postal service: GenBank, T- 10, MS K7 10, Los Alamos National Laboratory; Los Alamos, NM 87545. Electronic mail: “genbank%[email protected]” (general); or “gbsub%[email protected]” (data submissions and questions regarding submissions). Telephone: (505) 665-2 177. ,’ 4 C. Burks, J. W. Fickett, W. B. Goad, M. Kaneshisa, F. I. Lewitter, W. P. Rindone, C. D. Swindell, C.-S. Tung, and H. S. Bilofsky, Comput. Appl. Biosci. 1,225 (1985). 5 J. W. Fickett and C. Burks, in “Mathematical Models for DNA Sequences” (M. S. Water- man, ed.), p. 1. CRC Press, Boca Raton, Florida, 1988. 6 G. Cameron, Nucleic Acids Res. 16, 1865 (1988). 7 P. Kahn and G. Cameron, this volume [2]. * S. Miyazawa, in “Computers and DNA” (G. I. Bell and T. Marr, eds.), p. 47. Addison- Wesley, New York, 1989. 111 GENBANK 5 initiative9J0 and other large-scale sequencing projects-has led us (and other$ to focus on the need for alternatives for processing the much larger streams of data coming into the database. r1,r2 Similarly, the recognition of a need for improved cross-refereacing and, eventually, automatic cross- linking of GenBank and related databases (which figures heavily in recent discussions of a matrix of biological knowledge)r3 has led us to consider alternative data representations that would provide the basis for such cross-linking. Finally, computer science tools (e.g., relational databases)r4 previously in use in other contexts have evolved to the point that it is appropriate to begin exploring their application to GenBank. Focus of Chapter. We begin by providing an update on the kinds and amount of data present in the database and the objectives (completeness, timeliness, and depth of annotation) of GenBank with respect to the user community. We describe our two primary thrusts at this time: shifting our data collection effort over to a system that relies on direct, computer-read- able submissions from authors, and restructuring the database in the con- text of a relational database management system (RDBMS). An overview of the current media and formats for data distribution is given. We close with a description of the directions in which the database effort is moving as well as a discussion of the challenges ‘facing us now. Current Span and Size of Database What Data Are in the Database? The primary datum in the database is nucleotide (DNA and RNA) sequences (i.e., actggcagacagggtcatt . . . ). Each contiguous sequence determined by a single research group and submitted (or published) as a single entity is maintained as such. Occa- sionally, when more than one sequence for the same gene in the same organism is available, and the sequences overlap, a merged view of the sequences is presented in the distributed versions of the database. 9 B. M. Alberts, D. Botstein, S. Brenner, C. Cantor, R. F. Doolittle, L. Hood, V. A. McKusick, D. Nathans, M. V. Olson, S. Orkin, L. E. Rosenberg, F. H. Ruddle, S. Tilgh- man, J. Tooze, and J. D. Watson, “Mapping and Sequencing the Human Genome.” National Academy Press, Washington, D.C., 1988. lo U.S. Congress, Office of Technology Assessment, “Mapping Our Genes-The Genome Projects: How Big, How Fast?” U.S. Govt. Printing Office, Washington, D.C., 1988. ii L. Phillipson, Nature (London) 332,676 (1988). I2 G. I. Bell and T. Mat-r, eds., “Computers and DNA.” Addison-Wesley, New York, 1989. I3 H. J. Morowitz and T. F. Smith, “Report of the Matrix of Biological Knowledge Work- shop.” Santa Fe Institute, Santa Fe, New Mexico, 1987. I4 C. J. Date, “An Introduction to Database Systems,” 4th Ed. Addison-Wesley, Reading, Massachusetts, 1986. 6 DATABASES 111 The database provides the bibliographic context of the sequence. This is most often represented by a specific journal citation, although GenBank accepts and cites submissions of original data published in books, theses, and other sources. Increasingly, unpublished nucleotide sequences are ap pearing in the database,15 although most often directly tied to a publication that describes the sequence without presenting it explicitly. Unpublished submissions are cited as such. The physical context of a sequence, i.e., the organism, chromosome, map position, etc., that describes its origin in viva is included when known and made available to GenBank. Official nomenclature lists and map assignments are beginning to be made available to us in a form that allows GenBank to dynamically maintain proper, uniform values for these data items. The functional context of a sequence (or parts of a sequence) is also annotated in the database. A good example is protein-coding regions, which are annotated to support automatic extractioni6*i7 of the coding regions from the sequences in which they are embedded (we use such a tool for checking the integrity of coding regions in the database). Most of this information is presented in the GenBank FEATURES table. Finally, a limited number of data items providing the administrative context of a sequence are provided. These indicate when data were last revised, the degree of review the data have received from GenBank stalf, etc. A sample GenBank entry in the current magnetic tape flat file distribu- tion format is presented in Fig. 1. A more detailed presentation of the various data items has been given previously5; up-to-date descriptions and planned changes can be found in the release notes accompanying GenBank releases.2 How Many Data Are in the Database? The current release (Release 61.0) of the database contains about 3.5 X 10’ nucleotides (based on over 4.0 X 10’ nucleotides, with the reduction being due to the merging de- scribed above) in about 2.9 X lo4 entries. The exponential growth rate of nucleotide sequence data over the past 10 years has been characterized elsewhere’* (see also Fig. 3). The database includes 150 entries containing 104 or more contiguous nucleotides; 5 of these entries contain over lo5 contiguous nucleotides. There are about 1200 organisms (counting indi- vidual animal, plant, and microorganism species as well as viruses) repre- I5 C. Burks, in “Biomolecular Data: A Resource in Transition” (R. R. Colwell, ed.), p. 327. Oxford University Press, Oxford, 1989. I6 J. W. Fickett, Trends Biochem. Sci. 11, 190 (1986). I7 J. W. Fickett, Trends B&hem. Sci. 11, 382 (1986). I8 C. Burks, in “Biomolecular Data: A Resource in Transition” (R. R. Colwell, ed.), p. 17. Oxford University Press, Oxford, 1989. 111 GENBANK 7 LOCUS HUMCRYGAl 465 bp ds-DNA PRI 15-MAR-1989 DEFINITION Human gamma-A-crystallin gene (gamma-G5), exons 1 and 2. ACCESSION Ml7315 ,_ KEYWORDS crystallin; gamma-crystallin. SEGMENT 1 df 2 SOURCE Human fetal liver DNA, clone lambda-16G3. ORGANISM Homo sapiens Eukaryota: Metazoa; Chordata: irertebrata; Tetrapoda: Marmnalia; Eutheria: Primates; Anthropoidea; Hominoidea: Hominidae. REFERENCE 1 (bases 1 to 465) AUTHORS Meakin,S.O., Du,R.P.,, Tsui,L.-C. and Breitman,M.L. TITLE Gamma-crvstallins of the human eve lens: Expression analvsis of five members of the gene family - JOURNAL Mol. Cell. Biol. 7, 2671-2679 (1987) STANDARD full staff-entry COMMENT FEATURES from to/span description PePt 109 117 gamma-A-crystallin, axon 1 /nomgen="CRYG5" /map- "2q33-q35" 219 + 461 ganuna-A-crystallin, exon 2 IVS 118 218 CRY-g-A intron A IVS 462 > 465 CRY-g-A intron B refnumbr 2 2 numbered 1 in (11 BASE COUNT 91 a 137 c 111 g 126 t ORIGIN Chromosome 2q33-35. 1 aggtcccttt tgtgttgttt ttgccaacac agcagcttcc ctgctatata taccagttgc 61 ccctttgtcc ctatcatact agatgctaat caccctctgt caacaaccat ggggaaggtg 121 agcctgtgga ggtgctgtgc catgtctatt gggggtctgt ggtgtgtggg gatgttcctt 181 ccaoctaact atctactatc accttatttc tacctcagat caccttctac gaggaccgag 241 actftcaggg <cgctgc<ac aattgcatca gtgactgccc caacctgcgg gtctacttca 301 gccgctgcaa ct'ccatccga gtagacagcg gctgctggat gctctatgag cgtcccaatt 361 accagggcca ccagtacttc ctgcgccgag gcaagtaccc cgactatcag cactggatgg 421 gcctcagcga ctcggtccaa tcctgccgta taattcctca tgtga FIG. I. Sample GenBank entry. The entry is in the standard line-type format used in the magnetic tape distribution of GenBank. sented in the database. Of the sequences for which feature annotation is available, roughly one-half represent protein-coding regions. However, many other functional categories are also well represented: for example, there are over 400 Alu repeats annotated in the database and over 1600 tRNA gene sequences. There has been a greatly enhanced interest in “complete” genomes over the past few years. There are enough Escherichia coli DNA sequence data in GenBank to account for roughly 28% of the E. coli genome (though the apparent percentage of the genome covered would be smaller if potentially overlapping segments were merged). On the other hand, there are only enough Homo sapiens DNA sequence data in GenBank to account for, at most, 0.2% of the human genome. If one includes viruses, plasmids, and organellar chromosomes, there are about 150 “complete” genomes repre- sented in the database, but the highly parasitic nature of these latter entities would require that one include a significant amount of host genetic mate- rial in the complete specification of the life cycle.
Description: