This page intentionally left blank Biological sequence analysis Probabilistic models of proteins and nucleic acids The face of biology has been changed by the emergence of modern molecular genetics. Among the most exciting advances are large-scale DNA sequencing efforts such as the HumanGenomeProjectwhichareproducinganimmenseamountofdata.Theneedto understandthedataisbecomingevermorepressing.Demandsforsophisticatedanalyses ofbiologicalsequencesaredrivingforwardthenewly-createdandexplosivelyexpanding research area of computational molecular biology, or bioinformatics. Many of the most powerful sequence analysis methods are now based on principles ofprobabilisticmodelling.Examplesofsuchmethodsincludetheuseofprobabilistically derived score matrices to determine the significance of sequence alignments, the use of hidden Markov models as the basis for profile searches to identify distant members of sequence families, and the inference of phylogenetic trees using maximum likelihood approaches. Thisbookprovidesthefirstunified,up-to-date,andtutorial-leveloverviewofsequence analysismethods,withparticularemphasisonprobabilisticmodelling.Pairwisealignment, hidden Markov models, multiple alignment, profile searches, RNA secondary structure analysis, and phylogenetic inference are treated at length. Written by an interdisciplinary team of authors, the book is accessible to molecular biologists, computer scientists and mathematicians with no formal knowledge of each others’fields.Itpresentsthestate-of-the-artinthisimportant,newandrapidlydeveloping discipline. Richard Durbinis Headof the InformaticsDivisionattheSangerCentre inCambridge, England. SeanEddyisAssistantProfessoratWashingtonUniversity’sSchoolofMedicineandalso oneofthePrincipleInvestigatorsattheWashingtonUniversityGenomeSequencingCenter. Anders Krogh is a Research Associate Professor in the Center for Biological Sequence Analysis at the Technical University of Denmark. GraemeMitchisonisattheMedicalResearchCouncil’sLaboratoryforMolecularBiologyin Cambridge, England. Biological sequence analysis Probabilistic models of proteins and nucleic acids Richard Durbin Sean R. Eddy Anders Krogh Graeme Mitchison CAMBRIDGEUNIVERSITYPRESS Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo Cambridge University Press The Edinburgh Building, Cambridge CB28RU, UK Published in the United States of America by Cambridge University Press, New York www.cambridge.org Information on this title: www.cambridge.org/9780521629713 © Cambridge University Press 1998 This publication is in copyright. Subject to statutory exception and to the provision of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published in print format 1998 ISBN-13 978-0-511-33708-6 eBook (EBL) ISBN-10 0-511-33708-6 eBook (EBL) ISBN-13 978-0-521-62971-3 paperback ISBN-10 0-521-62971-3 paperback Cambridge University Press has no responsibility for the persistence or accuracy of urls for external or third-party internet websites referred to in this publication, and does not guarantee that any content on such websites is, or will remain, accurate or appropriate. Contents Preface pageix 1 Introduction 1 1.1 Sequencesimilarity,homology,andalignment 2 1.2 Overviewofthebook 2 1.3 Probabilitiesandprobabilisticmodels 4 1.4 Furtherreading 10 2 Pairwisealignment 12 2.1 Introduction 12 2.2 Thescoringmodel 13 2.3 Alignmentalgorithms 18 2.4 Dynamicprogrammingwithmorecomplexmodels 29 2.5 Heuristicalignmentalgorithms 33 2.6 Linearspacealignments 35 2.7 Significanceofscores 36 2.8 Derivingscoreparametersfromalignmentdata 42 2.9 Furtherreading 45 3 MarkovchainsandhiddenMarkovmodels 47 3.1 Markovchains 48 3.2 HiddenMarkovmodels 52 3.3 ParameterestimationforHMMs 62 3.4 HMMmodelstructure 69 3.5 MorecomplexMarkovchains 73 3.6 NumericalstabilityofHMMalgorithms 78 3.7 Furtherreading 80 4 PairwisealignmentusingHMMs 81 4.1 PairHMMs 82 4.2 Thefullprobabilityofx and y,summingoverallpaths 88 4.3 Suboptimalalignment 90 4.4 Theposteriorprobabilitythatx isalignedto y 92 i j 4.5 PairHMMsversusFSAsforsearching 96 v vi Contents 4.6 Furtherreading 99 5 ProfileHMMsforsequencefamilies 101 5.1 Ungappedscorematrices 103 5.2 AddinginsertanddeletestatestoobtainprofileHMMs 103 5.3 DerivingprofileHMMsfrommultiplealignments 106 5.4 SearchingwithprofileHMMs 109 5.5 ProfileHMMvariantsfornon-globalalignments 114 5.6 Moreonestimationofprobabilities 116 5.7 Optimalmodelconstruction 123 5.8 Weightingtrainingsequences 125 5.9 Furtherreading 133 6 Multiplesequencealignmentmethods 135 6.1 Whatamultiplealignmentmeans 136 6.2 Scoringamultiplealignment 138 6.3 Multidimensionaldynamicprogramming 141 6.4 Progressivealignmentmethods 145 6.5 MultiplealignmentbyprofileHMMtraining 150 6.6 Furtherreading 159 7 Buildingphylogenetictrees 161 7.1 Thetreeoflife 161 7.2 Backgroundontrees 163 7.3 Makingatreefrompairwisedistances 166 7.4 Parsimony 174 7.5 Assessingthetrees:thebootstrap 180 7.6 Simultaneousalignmentandphylogeny 181 7.7 Furtherreading 189 7.8 Appendix:proofofneighbour-joiningtheorem 190 8 Probabilisticapproachestophylogeny 193 8.1 Introduction 193 8.2 Probabilisticmodelsofevolution 194 8.3 Calculatingthelikelihoodforungappedalignments 198 8.4 Usingthelikelihoodforinference 206 8.5 Towardsmorerealisticevolutionarymodels 215 8.6 Comparisonofprobabilisticandnon-probabilisticmethods 224 8.7 Furtherreading 232 9 Transformationalgrammars 234 9.1 Transformationalgrammars 235 9.2 Regulargrammars 238 9.3 Context-freegrammars 243 Contents vii 9.4 Context-sensitivegrammars 248 9.5 Stochasticgrammars 250 9.6 Stochasticcontext-freegrammarsforsequencemodelling 253 9.7 Furtherreading 259 10 RNAstructureanalysis 261 10.1 RNA 262 10.2 RNAsecondarystructureprediction 268 10.3 Covariancemodels:SCFG-basedRNAprofiles 278 10.4 Furtherreading 299 11 Backgroundonprobability 300 11.1 Probabilitydistributions 300 11.2 Entropy 306 11.3 Inference 312 11.4 Sampling 315 11.5 Estimationofprobabilitiesfromcounts 320 11.6 TheEMalgorithm 324 Bibliography 327 Authorindex 346 Subjectindex 351
Description: