COMPUTATIONAL BIOLOGY A HYPERTEXTBOOK SCOTT T. KELLEY Department of Biology San Diego State University San Diego, California AND DENNIS DIDULO Becton, Dickinson and Company San Diego, California COMPUTATIONAL BIOLOGY A HYPERTEXTBOOK Washington, DC Copyright © 2018 American Society for Microbiology. All rights reserved. No part of this publication may be reproduced or transmitted in whole or in part or reused in any form or by any means, electronic or mechanical, including photocopying and recording, or by any information storage and retrieval system, without permission in writing from the publisher. Disclaimer: To the best of the publisher’s knowledge, this publication provides information concerning the subject matter covered that is accurate as of the date of publication. The publisher is not providing l egal, medical, or other professional ser vices. Any reference herein to any specific commercial products, procedures, or ser vices by trade name, trademark, manufacturer, or other wise does not constitute or imply endorsement, recommendation, or favored status by the American Society for Microbiology (ASM). The views and opinions of the author(s) expressed in this publication do not necessarily state or reflect those of ASM, and they s hall not be used to advertise or endorse any product. Library of Congress Cataloging-i n-P ublication Data Names: Kelley, Scott T. (Scott Theodore), author. | Didulo, Dennis, author. Title: Computational biology : a hypertextbook / Scott T. Kelley, Department of Biology, San Diego State University, San Diego, California, and Dennis Didulo, Becton, Dickinson and Company, San Diego, California. Description: Washington, DC : ASM Press, [2018] | Includes index. Identifiers: LCCN 2017051454 (print) | LCCN 2017052307 (ebook) | ISBN 9781683670032 (ebook) | ISBN 9781683670025 (pbk.) Subjects: LCSH: Computational biology. Classification: LCC QH324.2 (ebook) | LCC QH324.2 .K45 2018 (print) | DDC 570.285--dc23 LC record available at https://lccn.loc.gov/2017051454 All Rights Reserved Printed in the United States of Ameri c a 10 9 8 7 6 5 4 3 2 1 Address editorial correspondence to ASM Press, 1752 N St., N.W., Washington, DC 20036-2904, USA Send o rders to ASM Press, P.O. Box 605, Herndon, VA 20172, USA Phone: 800-546-2416; 703-661-1593 Fax: 703-661-1501 E- mail: [email protected] Online: http:// www . asmscience . org To Kina and Aidan, my wond erf ul and supp orti ve fami l y. And to my brother Brian, who selfl essly don ated his kidn ey, witho ut which I would not have had the ene rgy to write this book. CONTENTS Preface ix For the Instructor xi For the Student xiii Acknowl edgments xiv About the Authors xv CHAPTER –1 Getting Started 1 CHAPTER 00 Introduction 5 Activity 0.1: Biological Databases and Data Storage 20 CHAPTER 01 BLAST 31 Activity 1.1: BLAST Algorithm 36 CHAPTER 02 Protein Analy sis 47 Activity 2.1: Hydrophobicity Plotting 52 Activity 2.2: Protein Secondary Structure Prediction 58 CHAPTER 03 Sequence Alignment 67 Activity 3.1: Dynamic Programming 74 CHAPTER 04 Patterns in the Data 91 Activity 4.1: Protein Sequence Motifs 94 Activity 4.2: Position-S pecific Weight Matrices 102 CHAPTER 05 RNA Structure Prediction 111 Activity 5.1: RNA Structure Prediction 118 CHAPTER 06 Phyloge ne tics 133 Activity 6.1: Phyloge ne tic Analy sis 140 CHAPTER 07 Probability: All Mutations are not Equal (-ly Probable) 157 Activity 7.1: G enerating PAM and BLOSUM Substitution Matrices 163 CHAPTER 08 Bioinformatics Programming: A Primer 179 Index 191 PREFACE T his textb ook is a hypertextbook. Half of the textb ook mat er ial lies bet ween the pages of this book and the other half on the Int ern et. It seems nat ur al that a hypertextbook, which comb ines print and onl ine apps for mob ile tech- nolo gy, would be a great way to learn the bas ics of bioinformatics, which uses in for mat ics (com pu ta tional) the ory to study bi o log i cal da ta. This book was born out of a mix of nec ess ity and ins pir at ion.1 The ne ces sity came from the dearth of bioinformatics ins truct ional mat er ia ls app rop ria te for my com bi na tion of bi ol ogy stu dents, with lit tle or no com puter back ground,2 and comp uter scie nce stud ents, who were int ere sted in the field but had lit tle und er- standi ng of bio lo gy. The need bec ame acute when I learned that my fa vori te bio- informatics lab manu al, Bioinformatics for Dummies (BFD3), would no long er be upd ated. BFD was a great lab manu al for learni ng how to perf orm bas ic bioinfor- matics data analy sis. This book did not exp lain the princ ip les beh ind the alg o- rithms, but I could cover those duri ng lect ures. BFD was clear and fun to read and pro vided prac ti cal skills for bi ol o gists and oth ers look ing to an a lyze data. Unfortu- nately, the most rec ent vers ion was printed in 2007! I kept usi ng the old edit ion of BFD for some time, but event ua lly the tut or ia ls bec ame obs ol ete and the stud ents took long er and long er to comp lete the ex er- cises. In fact, sev eral pass ages of BFD were obs ol ete a few months af ter the book was printed. Bioinformatics webs ites are cons tantly changi ng, inc ludi ng their des igns and the URL links, and somet imes the pages thems elves disa pp ear alt og ether. Since I beg an writi ng this book, two of the webs ites I teach in the book and on line ma te ri als changed sig nifi cantly, and one dis ap peared al to geth er. This led to my origi n al ins pir at ion for the hyp er- part of this hypertextbook. What if I made my own bioinformatics tut or ia ls and samp le test data for com- monly used analy sis tools onl ine in easi ly upd ated fil es? That way, when a link changed or the prog ramm ers moved a rad io but ton around, I could easi ly alt er the tut or ial to refl ect these changes in real time. Students would not have to wait for a new vers ion of a book to have an acc ur ate tut or ia l. The next ins pir at ion arose from my use of pap er-based puzz les and probl ems to teach the bioinformatics alg or ithms. The probl ems I taught in class, comb ined with ix