IInnttrroodduuccttiioonn ttoo BBiiooiinnffoorrmmaattiiccss AA TThheeoorreettiiccaall aanndd PPrraaccttiiccaall AApppprrooaacchh EEddiitteedd bbyy SStteepphheenn AA.. KKrraawweettzz DDaavviidd DD.. WWoommbbllee Includes CD-ROM Introduction to Bioinformatics 00/FM/Krawetz/i-xii/F 1 11/26/02, 9:59 AM 00/FM/Krawetz/i-xii/F 2 11/26/02, 9:59 AM Introduction to Bioinformatics A Theoretical and Practical Approach Edited by Stephen A. Krawetz, PhD Wayne State University School of Medicine, Detroit, MI and David D. Womble, PhD Wayne State University School of Medicine, Detroit, MI Humana Press Totowa, New Jersey 00/FM/Krawetz/i-xii/F 3 11/26/02, 9:59 AM © 2003 Humana Press Inc. 999 Riverview Drive, Suite 208 Totowa, New Jersey 07512 humanapress.com All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording, or otherwise without written permission from the Publisher. All papers, comments, opinions, conclusions, or recommendations are those of the author(s), and do not necessarily reflect the views of the publisher. This publication is printed on acid-free paper. ∞ ANSI Z39.48-1984 (American Standards Institute) Permanence of Paper for Printed Library Materials. Production Editor: Mark J. Breaugh. Cover design by Patricia F. Cleary and Paul A. Thiessen. Cover illustration by Paul A. Thiessen, chemicalgraphics.com. For additional copies, pricing for bulk purchases, and/or information about other Humana titles, contact Humana at the above address or at any of the following numbers: Tel.: 973-256-1699; Fax: 973-256-8341; E-mail: [email protected], Website: humanapress.com Photocopy Authorization Policy: Authorization to photocopy items for internal or personal use, or the internal or personal use of specific clients, is granted by Humana Press Inc., provided that the base fee of US $20.00 per copy is paid directly to the Copyright Clearance Center at 222 Rosewood Drive, Danvers, MA 01923. For those organizations that have been granted a photocopy license from the CCC, a separate system of payment has been arranged and is acceptable to Humana Press Inc. The fee code for users of the Transactional Reporting Service is: [1-58829-064-6/03 $20.00]. Printed in the United States of America. 10 9 8 7 6 5 4 3 2 1 Library of Congress Cataloging in Publication Data Introduction to bioinformatics : a theoretical and practical approach / edited by Stephen A. Krawetz and David D. Womble. p. ; cm. Includes bibliographical references and index. ISBN 1-58829-064-6 (alk. paper) (HC); 1-58829-241-X (PB); 1-59259-335-6 (e-book) 1. Bioinformatics. I. Krawetz, Stephen A. II. Womble, David D. [DNLM: 1. Computational Biology--methods. 2. Computer Systems. 3. Databases, Genetic. 4. Genomics. 5. Sequence Analysis, DNA. 6. Software. QH 506 I646 2002] QH 507 .I575 2002 570'.285--dc21 2002190207 00/FM/Krawetz/i-xii/F 4 11/26/02, 9:59 AM Preface As the sequencing phase of the human and other genome projects nears comple- tion, we are faced with the task of understanding how the vast strings of Cs, As, Ts, and Gs encode a being. With the recent advent of microarrays and other high throughput biologic technologies, we have moved from trying to understand single molecules and pathways to that of integrative systems. We are only beginning to grasp the questions we can ask as we are now challenged to understand these large in silico, in vitro, and in vivo data sets. The new field of Bioinformatics was born of a series of meetings among “wet-bench” scientists, in the early 1980s, to meet this challenge. With the recruitment of mathematicians, computer scientists, statisticians, and astrophysicists to this field, we have now begun to design and implement some of the basic tools that will enable data integration and multidimensional analyses of these varied but unified data sets. For those new to Bioinformatics, this cross pollenization of the Life, Physical, and Theoretical sciences wants for a common language. With this in mind, Introduction to Bioinformatics: A Theoretical and Prac- tical Approach was written as an introductory text for the undergraduate, graduate, or professional. At once, this text provides the physical scientist, whether mathematician, computer scientist, statistician or astrophysicist, with a biological framework to understand the questions a life scientist would pose in the context of the computational issues and currently available tools. At the same time, it provides the life scientist with a source for the various computational tools now available, along with an introduction to their underlying mathematical foundations. As such, this book can be used as a bridge toward homologation of these fields. By bringing these disciplines together we may begin our journey toward understanding the nuances of the genetic code. Introduction to Bioinformatics: A Theoretical and Practical Approach is divided into four main sections. The first two sections are well suited to the physical scientist who is new to studying biological systems. They provide the biological vocabulary, i.e., an overview of the various biological processes that govern an organism and impact health. The first section, Biochemistry, Cell, and Molecular Biology, describes basic cellular structure and the biological decoding of the genome. In silico detection of the promoter elements that modulate genome decoding is also explained. The sec- ond section, Molecular Genetics, will lead the reader through a discussion of the long range regulation of genomes, the in silico detection of the elements that impact long range control, and the molecular genetic basis of disease as a consequence of replica- tion. Clinical human genetics and the various clinical databases are reviewed, fol- lowed by a discussion of the various issues within population genetics that can be used to address the question: “How do we evolve as we respond to our environment?” The third section, The UNIX Operating System, was written for the life scientist, to demystify the UNIX operating system that is commonly used to support advanced computational tools. Along with understanding the installation and management of UNIX-based software tools, examples of command line sequence analyses are pre- sented. These chapters should enable the life scientist to become as comfortable in a command line environment as in the Graphical-User Interface environment. v 00/FM/Krawetz/i-xii/F 5 11/26/02, 9:59 AM vi—Preface The Computer Applications section provides a common area for the physical and life scientist to meet. The management and analysis of DNA sequencing projects is presented, along with a review of how DNA can be modeled as a statistical series of patterns. The latter forms the basis of most protein and nucleic acid sequence analysis routines. These considerations are followed by a discussion of the various genome databases, the representation of genomes, and methods for their large scale analyses. This culminates in addressing the question: “Can I learn about my sequence from what is known about a similar sequence?” To directly answer this question a discussion of the various methods of pattern discovery follows, including basic multiple sequence alignment to identify both functionally and structurally related components. The accompanying protein visualization chapter outlines how these tools can aid in predicting structures that often represent homologous segments from evolutionarily conserved gene families. This final section concludes with a re- view of how multiple sequence alignment can be used to infer both functional and structural biological relationships. In closing, the final chapters of the book review the new field of Transcription Profiling, examining the current state of analysis soft- ware for systems biology. We conclude our journey with a discussion of the in silico analysis and prediction of patterns of gene expression that will ultimately guide our understanding of living systems. Though the text provides a detailed description and examples, the CD supplement also contains a complete set of illustrations from each chapter, many of which are present in color. This provides a visual resource for both the student and the teacher that should prove invaluable for those of us preparing our next Bioinformatics lecture or seminar. In addition, several full version and limited trial versions of the programs that are discussed in the text are included. These encompass a broad spectrum, from DNA sequencing project management to microarray analysis, offering the reader the opportunity to easily access some of the software tools that are discussed. It is our hope that the current and next generation of physical and life scientists will use these resources as a springboard to help us move forward in the important quest for an inte- grated understanding of our physical being. Stephen A. Krawetz David D. Womble 00/FM/Krawetz/i-xii/F 6 11/26/02, 9:59 AM Contents Preface .............................................................................................................v Contributors ....................................................................................................xi Part I. Biochemistry, Cell, and Molecular Biology: A. The Cell 1 •Nucleic Acids and Proteins: Modern Linguistics for the Genomics and Bioinformatics Era.............5 Bradley C. Hyman 2 •Structure and Function of Cell Organelles ............................................25 Jon Holy 3 •Cell Signaling ........................................................................................55 Daniel A. Rappolee B. Transcription and Translation 4 • DNA Replication, Repair, and Recombination.....................................75 Linda B. Bloom 5 •Transcription, RNA Processing, and Translation...................................93 Thomas P. Yang and Thomas W. O’Brien Part II. Molecular Genetics: A. Genomics 6 •Epigenetic Mechanisms Regulating Gene Expression..........................123 John R. McCarrey 7 • Gene Families and Evolution................................................................141 Ben F. Koop 8 • Repetitive DNA: Detection, Annotation, and Analysis.........................151 Jerzy Jurka 9 •Molecular Genetics of Disease and the Human Genome Project.......169 Paromita Deb-Rinker and Stephen W. Scherer B. Clinical Human Genetics 10 •Heredity...............................................................................................187 C. A. Rupar 11 •The Clinical Genetics Databases.........................................................199 Peter J. Bridge 12 •Population Genetics............................................................................207 Jill S. Barnholtz-Sloan Part III. The UNIX Operating System: A. Basics and Installation 13 • Introduction to UNIX for Biologists.....................................................233 David D. Womble vii 00/FM/Krawetz/i-xii/F 7 11/26/02, 9:59 AM viii—Contents 14 • Installation of the Sun Solaris™ Operating Environment.....................247 Bryon Campbell 15 •Sun System Administration..................................................................263 Bryon Campbell B. Managing Bioinformatics Tools 16 •Installing Bioinformatics Software in a Server-Based Computing Environment.....................................285 Brian Fristensky 17 •Management of a Server-Based Bioinformatics Resource ...................297 Brian Fristensky C. Command Line Sequence Analysis 18 •GCG File Management........................................................................309 Sittichoke Saisanit 19 •GCG Sequence Analysis......................................................................315 Sittichoke Saisanit Part IV. Computer Applications: A. Management and Analysis of DNA Sequencing Projects and Sequences 20 •Managing Sequencing Projects in the GAP4 Environment..................327 Rodger Staden, David P. Judge, and James K. Bonfield 21 • OLIGO Primer Analysis Software........................................................345 John D. Offerman and Wojciech Rychlik 22 • Statistical Modeling of DNA Sequences and Patterns.........................357 Gautam B. Singh 23 •Statistical Mining of the Matrix Attachment Regions in Genomic Sequences ...................................................................375 Gautam B. Singh 24 • Analyzing Sequences Using the Staden Package and EMBOSS.........................................393 Rodger Staden, David P. Judge, and James K. Bonfield B. The Genome Database: Analysis and Similarity Searching 25 • Ensembl: An Open-Source Software Tool for Large-Scale Genome Analysis....................................................413 James W. Stalker and Anthony V. Cox 26 • The PIR for Functional Genomics and Proteomics..............................431 Cathy H. Wu 27 • Sequence Similarity and Database Searching......................................443 David S. Wishart 28 • GCG Database Searching....................................................................463 David J. Heard C. Identifying Functional and Structural Sequence Elements 29 • Pattern Discovery: Methods and Software...........................................491 ^ ^ Brona Brejová, Tomás Vinar, and Ming Li 00/FM/Krawetz/i-xii/F 8 11/26/02, 9:59 AM Contents—ix 30 •The Role of Transcription Factor Binding Sites in Promoters and Their In Silico Detection...........................................................523 Thomas Werner 31 • An Introduction to Multiple Sequence Alignment and Analysis....................................................................................539 Steven M. Thompson 32 • 3D Molecular Visualization with Protein Explorer..............................565 Eric Martz 33 • Multiple Sequence Alignment and Analysis: The SeqLab Interface: A Practical Guide.........................................587 Steven M. Thompson D. Analysis of Gene Expression: Microarrays and Other Tools 34 • Overview of the Tools for Microarray Analysis: Transcription Profiling, DNA Chips, and Differential Display..................................637 Jeffrey A. Kramer 35 • Microarrays: Tools for Gene Expression Analysis.................................665 Sorin Draghici 36 • Knowledge Discovery from the Human Transcriptome.......................693 Kousaku Okubo and Teruyoshi Hishiki Part V. Appendices: Appendix 1: CD Contents...........................................................................715 Appendix 2: A Collection of Useful Bioinformatic Tools and Molecular Tables...............................................................................719 Appendix 3: Simple UNIX Commands........................................................723 Index ............................................................................................................725 00/FM/Krawetz/i-xii/F 9 12/17/02, 1:33 PM