2 Introduction to Bioinformatics 3 INTRODUCTION TO BIOINFORMATICS FOURTH EDITION Arthur M. Lesk The Pennsylvania State University In nature’s infinite book of secrecy A little I can read. Antony and Cleopatra 4 Great Clarendon Street, Oxford, OX2 6DP, United Kingdom Oxford University Press is a department of the University of Oxford. It furthers the University’s objective of excellence in research, scholarship, and education by publishing worldwide. Oxford is a registered trade mark of Oxford University Press in the UK and in certain other countries © Arthur M. Lesk 2014 The moral rights of the author have been asserted First Edition copyright 2002 Second Edition copyright 2005 Third Edition copyright 2008 Impression: 1 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, by licence or under terms agreed with the appropriate reprographics rights organization. Enquiries concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford University Press, at the address above You must not circulate this work in any other form and you must impose this same condition on any acquirer British Library Cataloguing in Publication Data Data available ISBN 978–0–19–965156–6 Printed in Italy by L.E.G.O. S.p.A—Lavis TN Links to third party websites are provided by Oxford in good faith and for information only. Oxford disclaims any responsibility for the materials contained in any third party website referenced in this work. 5 Dedicated to Eda, with whom I have merged my genes. 6 CONTENTS Preface to the first edition Preface to the second edition Preface to the third edition Preface to the fourth edition Plan of the book Introduction to bioinformatics on the web Acknowledgements 1 Introduction Life in space and time Phenotype = genotype + environment + life history + epigenetics Evolution is the change over time in the world of living things Dogmas: central and peripheral Statics and dynamics Networks Observables and data archives A database without effective modes of access is merely a data graveyard Information flow in bioinformatics Curation, annotation, and quality control The world-wide web Electronic publication Computers and computer science Programming Biological classification and nomenclature Use of sequences to determine phylogenetic relationships Use of SINES and LINES to derive phylogenetic relationships Searching for similar sequences in databases: PSI-BLAST Introduction to protein structure The hierarchical nature of protein architecture 7 Classification of protein structures Protein structure prediction and engineering Critical Assessment of Structure Prediction Protein engineering Proteomics and transcriptomics DNA microarrays Transcriptomics and RNA sequencing Mass spectrometry Systems biology Clinical implications The future Recommended reading Exercises and problems 2 Genome organization and evolution Genomes, transcriptomes, and proteomes Genes Proteomics and transcriptomics Eavesdropping on the transmission of genetic information Identification of genes associated with inherited diseases Mappings between the maps High-resolution maps Genome-wide association studies Picking out genes in genomes Genome-sequencing projects Genomes of prokaryotes The genome of the bacterium Escherichia coli The genome of the archaeon Methanococcus jannaschii The genome of one of the simplest organisms: Mycoplasma genitalium Metagenomics: the collection of genomes in a coherent environmental sample The human microbiome Genomes of eukarya Gene families The genome of Saccharomyces cerevisiae (baker's yeast) The genome of Caenorhabditis elegans The genome of Drosophila melanogaster 8 The genome of Arabidopsis thaliana The genome of Homo sapiens (the human genome) Protein-coding genes Repeat sequences RNA Single-nucleotide polymorphisms and haplotypes Systematic measurements and collections of single-nucleotide polymorphisms Ethical, legal, and social issues Genetic diversity in anthropology DNA sequences and languages Genetic diversity and personal identification Evolution of genomes Please pass the genes: horizontal gene transfer Comparative genomics of eukarya Recommended Reading Exercises and problems 3 Scientific publications and archives: media, content, and access The scientific literature Economic factors governing access to scholarly publications Open access The Public Library of Science Traditional and digital libraries How to populate a digital library The information explosion The web: higher dimensions New media: video, sound Searching the literature Bibliography management Databases Database contents The literature as a database Database organization Annotation Database quality control Database access Links 9 Database interoperability Data mining Programming languages and tools Traditional programming languages Scripting languages Program libraries specialized for molecular biology Java: computing over the web Markup languages Natural language processing Natural language processing and mining the biomedical literature Applications of text mining Recommended reading Exercises and problems 4 Archives and information retrieval Database indexing and specification of search terms Follow-up questions Analysis and processing of retrieved data The archives Nucleic acid sequence databases Genome databases and genome browsers Protein sequence databases Databases of protein families Databases of structures Classifications of protein structures Accuracy and precision of protein structure determinations Specialized, or ‘boutique’, databases Expression and proteomics databases Bibliographic databases Surveys of molecular biology databases and servers Gateways to archives Access to databases in molecular biology ENTREZ The Protein Identification Resource ExPASy: Expert Protein Analysis System Where do we go from here? Recommended reading 10