ebook img

Bioinformatics for Evolutionary Biologists: A Problems Approach PDF

323 Pages·2018·7.133 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Bioinformatics for Evolutionary Biologists: A Problems Approach

Bernhard Haubold Angelika Börsch-Haubold Bioinformatics for Evolutionary Biologists A Problems Approach Bioinformatics for Evolutionary Biologists ö Bernhard Haubold Angelika B rsch-Haubold (cid:129) Bioinformatics for Evolutionary Biologists A Problems Approach 123 Bernhard Haubold Angelika Börsch-Haubold Department ofEvolutionary Genetics Plön,Schleswig-Holstein Max-Planck-Institute for Evolutionary Germany Biology Plön,Schleswig-Holstein Germany ISBN978-3-319-67394-3 ISBN978-3-319-67395-0 (eBook) https://doi.org/10.1007/978-3-319-67395-0 LibraryofCongressControlNumber:2017955660 ©SpringerInternationalPublishingAG2017,correctedpublication2018 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpart of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission orinformationstorageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilar methodologynowknownorhereafterdeveloped. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publicationdoesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfrom therelevantprotectivelawsandregulationsandthereforefreeforgeneraluse. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authorsortheeditorsgiveawarranty,expressorimplied,withrespecttothematerialcontainedhereinor for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictionalclaimsinpublishedmapsandinstitutionalaffiliations. Printedonacid-freepaper ThisSpringerimprintispublishedbySpringerNature TheregisteredcompanyisSpringerInternationalPublishingAG Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland Preface Evolutionary biologists have two types of ancestors: naturalists such as Charles Darwin(1809–1892)andtheoreticianssuchasRonaldA.Fisher(1890–1962).The intellectual descendants of these two scientists have traditionally formed quite separate tribes. However, the distinction between naturalists and theoreticians is rapidly fading these days: Many naturalists spend most of their time in front of computers analyzing their data, and quite a few theoreticians are starting to collect their own data. The reason for this coalescence between theory and experiment is that two hitherto expensive technologies have become so cheap, they are now essentially free: computing and sequencing. Computing became affordable in the early 1980s with the advent of the PC. More recently, next generation sequencing has allowed everyone to sequence the genomes of their favorite organisms. However, analyzing this data remains difficult. The difficulties are twofold: conceptual, which method should I use, and prac- tical, how doIcarryoutacertaincomputation. The aimofthisbookistohelpthe readerovercomebothdifficulties.Wedothisbyposingaseriesofproblems.These comeintwoforms,paperandpencilproblems,andcomputerproblems.Ourchoice ofconceptsiscenteredontheanalysisofsequencesinanevolutionarycontext.The aimhereistogivethereaderalookunderthehoodoftheprogramsappliedinthe computer problems. The computer problems are solved in the same environment used for decades by scientists, the UNIX command line, also known as the shell. This is available on all three major desktop operating systems, Windows, Linux, and OS-X. Like any skill worth learning, using the shell takes practice. The computer problems are designed to give the reader plenty of opportunity for that. InChap.1,weintroducethecommandline.Afterexplaininghowtogetstarted, we deal with plain text files, which serve as input and output of most UNIX operations.Manyoftheseoperationsarethemselvestextfilescontainingcommands to be executed on some input. Such command files are called scripts, and their treatment concludes Chap. 1. InChap.2,thenewlyacquiredUNIXskillsareusedtoexploreacentralconcept in Bioinformatics: sequence alignment. A sequence alignment represents an evo- lutionary hypothesis about which residues have a recent common ancestor. This is v vi Preface determinedusingoptimalalignmentmethodsthatextractthebestoutofaverylarge number of possible alignments. However, this optimal approach consumes a lot of time and memory. The computation of exact matches, the topic of Chap. 3, is less resource intensivethanthecomputation ofalignments.Takenbythemselves,exactmatches arealsolessusefulthanalignments,becauseexactmatchescannottakeintoaccount mutations. Nevertheless, exact matching is central to many of the most popular methods for inexact matching. We begin with methods for exact matching in time proportional to the length of the sequence investigated. Then we concentrate on methodsforexactmatchingintimeindependentofthetextlength.Thisisachieved by indexing the input sequence through the construction of suffix trees and suffix arrays. In Chap. 4, we show how to combine alignment with exact matching to obtain very fast programs. The most famous example of these is BLAST, which is rou- tinelyusedtofindsimilaritiesbetweensequences.Uptonowwehaveonlylooked at pairwise alignment. At the end of Chap. 4, we generalize this to multiple sequence alignment. In Chap. 5, multiple sequence alignments are used to construct phylogenies. These are hypotheses about the evolution of a set of species. If we zoom in from evolutionbetweenspeciestoevolutionwithinaparticularspecies,weenterthefield of population genetics, the topic of Chap. 6. Here, we concentrate on modeling evolutionbyfollowingthedescent ofasampleofgenesbackintimetotheirmost recent common ancestor. These lines of descent form a tree known as the coales- cent, the topic of much of modern population genetics. We conclude in Chap. 7 by introducing two miscellaneous topics: statistics and relational databases. Both would deserve books in their own right, and we restrict ourselves to showing how they fit in with the UNIX command line. Our course is sequence-centric, because sequence data permeates modern biol- ogy. In addition, these data have attracted a rich set of computer methods for data analysis and modeling. The sequences we analyze can be downloaded from the companion website for this book: http://guanine.evolbio.mpg.de/problemsBook/ Tothesesequences,weapplygenerictoolsprovidedbytheUNIXenvironment, published bioinformatics software, and programs written for this course. The latter are designed to allow readers to analyze a particular computational method. The programs are also available from the companion site. At the back of the book, we give complete solutions to all the problems. The solutions are an integral part of the course. We recommend you attempt each problem in the order in which they are posed. If you find a solution, compare it to ours.Ifyoucannotfindasolution,readoursandtryagain.Ifoursolutionisunclear or you have a better one, please drop us a line at Preface vii [email protected] The tongue-in-cheek Algorithm 1 summarizes these recommendations. Algorithm1UsingtheSolutions 1: whileproblemunsolveddo 2: solveproblem 3: readsolution 4: ifsolutionunclearoryoursolutionisbetterthanoursthen 5: dropusaline 6: endif 7: endwhile This book has been in the works since 2003 when BH started teaching BioinformaticsattheUniversityofAppliedSciences,Weihenstephan.Wethankall thestudentswhogaveusfeedbackonthismaterialasitevolvedovertheyears.We wouldalsoliketothankafewindividualswhocontributedinmorespecificwaysto the gestation of this book: Mike Travisano (University of Minnesota) gave us encouragementatacriticaltime.NicolaGaedekeandPeterPfaffelhuber(University ofFreiburg)commentedonanearlydraft,andourstudentsLindaKrause,Xiangyi Li,KatharinaDannenberg,andLinaUrbanreadlargepartsofthemanuscriptinone of the many guises it has taken over the years. We are grateful to all of them. Plön, Germany Bernhard Haubold July 2017 Angelika Börsch-Haubold The original version of the book backmatter was revised: For detailed information please see Erratum.Theerratumtothischapterisavailable at https://doi.org/10.1007/978-3-319-67395-0_9 ix Contents 1 The UNIX Command Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Getting Started. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3.1 Bash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.3.2 Sed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.3.3 AWK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2 Constructing and Applying Optimal Alignments . . . . . . . . . . . . . . . 23 2.1 Sequence Evolution and Alignment . . . . . . . . . . . . . . . . . . . . . . . 23 2.2 Amino Acid Substitution Matrices. . . . . . . . . . . . . . . . . . . . . . . . 25 2.2.1 Genetic Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.2 PAM Matrices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.3 The Number of Possible Alignments . . . . . . . . . . . . . . . . . . . . . . 32 2.4 Dot Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.5 Optimal Alignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.5.1 From Dot Plot to Alignment. . . . . . . . . . . . . . . . . . . . . . . 38 2.5.2 Global Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.5.3 Local Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.6 Applications of Optimal Alignment . . . . . . . . . . . . . . . . . . . . . . . 42 2.6.1 Homology Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.6.2 Dating the Duplication of Adh . . . . . . . . . . . . . . . . . . . . . 44 3 Exact Matching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.1 Keyword Trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2 Suffix Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3 Suffix Arrrays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.4 Text Compression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.4.1 Move to Front (MTF) . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.4.2 Measuring Compressibility: The Lempel–Ziv Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 xi xii Contents 4 Fast Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.1 Alignment with k Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.2 Fast Local Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.2.1 Simple BLAST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2.2 Modern BLAST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.3 Shotgun Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.4 Fast Global Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.5 Read Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.6 Clustering Protein Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.7 Position-Specific Iterated BLAST . . . . . . . . . . . . . . . . . . . . . . . . 92 4.8 Multiple Sequence Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.8.1 Query-Anchored Alignment . . . . . . . . . . . . . . . . . . . . . . . 96 4.8.2 Progressive Alignment. . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5 Evolution Between Species: Phylogeny . . . . . . . . . . . . . . . . . . . . . . . 101 5.1 Trees of Life . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.2 Rooted Phylogeny . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.3 Unrooted Phylogeny. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6 Evolution Within Populations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.1 Descent from One or Two Parents. . . . . . . . . . . . . . . . . . . . . . . . 113 6.1.1 Bi-Parental Genealogy . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.1.2 Uni-Parental Genealogy . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.2 The Coalescent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7 Additional Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.1 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.1.1 The Significance of Single Experiments . . . . . . . . . . . . . . 128 7.1.2 The Significance of Multiple Experiments. . . . . . . . . . . . . 128 7.1.3 Mouse Transcriptome Data. . . . . . . . . . . . . . . . . . . . . . . . 130 7.2 Relational Databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 7.2.1 Mouse Expression Data . . . . . . . . . . . . . . . . . . . . . . . . . . 132 7.2.2 SQL Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 7.2.3 Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 7.2.4 ENSEMBL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 8 Answers and Appendix: Unix Guide. . . . . . . . . . . . . . . . . . . . . . . . . 139 8.1 Answers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 8.2 Appendix: UNIX Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292 8.2.1 File Editing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292 8.2.2 Working with Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 8.2.3 Entering Commands Interactively . . . . . . . . . . . . . . . . . . . 293 8.2.4 Combining Commands: Pipes. . . . . . . . . . . . . . . . . . . . . . 295 8.2.5 Redirecting Output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 8.2.6 Shell Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.