ebook img

Theoretical and Computational Methods in Genome Research PDF

331 Pages·1997·12.735 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Theoretical and Computational Methods in Genome Research

Theoretical and Computational Methods in Genome Research Theoretical and COlTIputational Methods in GenOlTIe Research Edited by Sandor Suhai German Cancer Research Center Heidelberg, Germany Springer Science+Business Media, LLC Llbrary of Congress Cataloglng-In-Publlcatlon Data Theoretical and cONputational methods in genome research / edited by Sandor Suhai. p. cm. Includes bibliographical references and index. ISBN 978-1-4613-7708-5 ISBN 978-1-4615-5903-0 (eBook) DOI 10.1007/978-1-4615-5903-0 1. GenoNes--Data processing--Congresses. 2. Gene libraries--Data processing--Congresses. 3. Nucleotide sequence--Data processing -Congresses. 1. Suhai, Sandor. CH442.T45 1997 572.8'S'0285--dc21 97-1548 CIP Proceedings of the International Symposium on Theoretical and Computational Genome Research, held March 24-27, 1996, in Heidelberg, Germany ISBN 978-1-4613-7708-5 © 1997 Springer Science+Business Media New York Originally published by Plenum Press, New York in 1997 Softcover reprint of the hardcover Ist edition 1997 http://www.plenum.com 10987654321 AH rights reserved No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording, or otherwise, without written permission from the Publisher PREFACE The application ofcomputational methods to solve scientific and practical problems in genome research created a new interdisciplinary area that transcends boundaries tradi tionally separating genetics, biology, mathematics, physics, and computer science. Com puters have, ofcourse, been intensively used in the field oflife sciences for many years, even before genome research started, to store and analyze DNA or protein sequences; to explore and model the three-dimensional structure, the dynamics, and the function of biopolymers; to compute genetic linkage or evolutionary processes; and more. The rapid development ofnew molecular and genetic technologies, combined with ambitious goals to explore the structure and function ofgenomesofhigherorganisms, has generated, how ever, not only a huge and exponentially increasing body ofdata but also a new class of scientific questions. The nature and complexity ofthese questions will also require, be yondestablishing anew kindofalliance between experimental and theoretical disciplines, the development ofnew generations both in computer software and hardware technolo gies. New theoretical procedures, combined with powerful computational facilities, will substantially extend the horizon ofproblems that genome research can attack with suc cess. Many ofus still feel that computational models rationalizing experimental findings ingenome research fulfill theirpromisesmoreslowlythan desired. There is alsoanuncer tainty concerning the real position ofa "theoretical genome research" in the network of established disciplines integrating their efforts in this field. There seems to be an obvious parallel betweenthe present situationofgenome research andthatofatomic physics at the endofthe first quarterofourcentury. Advancedexperimental techniques made itpossible at that time to fill huge data catalogues with the frequencies ofspectral lines, yet all at temptsatanempirical systematizationorclassically founded explanationremainedunsuc cessful until a completely new conceptual framework, quantum mechanics, emerged that made senseofall the data. The present situation ofthe life sciences is certainly more intricate due to the more amorphous nature of the field. One has to ask whether it is a fair demand at all that genome research should reach the internal coherence ofa unifying theoretical framework like theoretical physics or (to a smallerextent) chemistry. The fear seems to be, however, not unjustified that genomic data accumulation will get so far ahead of its assimilation into an appropriate theoretical framework thatthe datathemselves might eventuallyprove an encumbrance in developing such new concepts. The aim ofmost ofthe computational methods presented in this volume is to improve upon this situation by trying to provide a bridge betweenexperimental databases (information) onthe one handandtheoretical con cepts(biological andgeneticknowledge) onthe other. v vi Preface The content of this volume was presented as plenary lectures at the International Symposium on Theoretical and Computational Genome Research held March 24-27, 1996,atthe DeutschesKrebsforschungszentrum (DKFZ) inHeidelberg. Itis agreatpleas ure to thank here Professor Harald zur Hausen and the coworkers ofDKFZ for their help and hospitalityextendedto the lecturers andparticipantsduring the meetingand the Com mission ofthe European Communities for the funding ofthe symposium. The organizers profitedmuchfrom the help ofthe scientific committeeofthe symposium: MartinBishop, Philippe Dessen, Reinhold Haux, RalfHofestiidt, Willi Jiiger, Jens G. Reich, Otto Ritter, Petre Tautu, and Martin Vingron. Furthermore, the editor is deeply indebted to Michaela Knapp-Mohammadyand Anke Retzmannfor theirhelp inorganizingthe meetingandpre paringthis volume. SandorSuhai Heidelberg, July 1996 CONTENTS 1. EvaluatingtheStatisticalSignificanceofMultipleDistinctLocalAlignments StephenF.Altschul 2. HiddenMarkovModelsforHumanGenes: PeriodicPatternsinExonSequences 15 PierreBaldi,SmenBrunak,YvesChauvin,andAndersKrogh 3. IdentificationofMuscle-SpecificTranscriptionalRegulatoryRegions 33 JamesW. Fickett 4. ASystematicAnalysisofGeneFunctionsbytheMetabolicPathwayDatabase. .. 41 MinoruKanehisaandSusumuGoto 5. PolymerDynamicsofDNA,Chromatin,andChromosomes 57 JorgLangowski,LutzEhrlich,MarkusHammermann,ChristianMiinkel,and Gero Wedemann 6. Is WholeHumanGenomeSequencingFeasible? 73 EugeneW. MyersandJamesL.Weber 7. SequencePatternsDiagnosticofStructureandFunction 91 TempleF. Smith,R. MarkAdams,SudeshnaDas,LihuaYu, LoredanaLoConte,andJamesWhite 8. RecognizingFunctionalDomainsinBiologicalSequences 105 GaryD. Stormo 9. TheIntegratedGenomicDatabase(lGD)S: Enhancingthe ProductivityofGene MappingProjects 117 StephenP.Bryant,AnastassiaSpiridou,andNigelK. Spurr 10. ErrorAnalysisofGeneticLinkageData 135 RobertCottingham,Jr.,MargaretGelderElm,andMarekKimmel 11. ManagingAcceleratingDataGrowthintheGenomeDatabase 145 KennethH. Fasman 12. AdvancesinStatisticalMethodsforLinkageAnalysis 153 JeffreyR. O'ConnellandDanielE. Weeks vii viii Contents 13. ExploringHeterogeneousMolecularBiologyDatabasesintheContextofthe Object-ProtocolModel 161 VictorM. Markowitz,I.-MinA. Chen,andAnthonyS.Kosky 14. ComprehensiveGenomeInformationSystems 177 OttoRitter 15. VisualizingtheGenome 185 DavidB. Searls 16. DataManagementforLigand-BasedDrugDesign 205 KarlAberer,KlemensHemm,andManfredHendlich 17. PicturingtheWorkingProtein 231 HansFrauenfelderandPeterG. Wolynes 18. HIV-l ProteaseandItsInhibitors 237 Maciej Geller,JoannaTrylska,andJanAntosiewicz 19. DensityFunctionalandNeuralNetworkAnalysis: HydrationEffectsand SpectrosopicandStructuralCorrelationsinSmallPeptidesandAmino Acids... ... ... . .. . . 255 K. J. Jalkanen,S. Suhai,andH. Bohr 20. ComputerSimulationsofProtein-DNAInteractions. ...................... 279 MatsErikssonandLennartNilsson 21. TheRoleofNeutralMutationsintheEvolutionofRNAMolecules 287 PeterSchuster 22. HowThree-FingeredSnakeToxinsRecogniseTheirTargets: Acetylcholinesterase-FasciculinComplex,aCaseStudy 303 KurtGiles,MiaL.Raves, Israel Silman,andJoelL. Sussman 23. ProteinSequenceandStructureComparisonUsingIterativeDoubleDynamic Programming ................................................. 317 WilliamR.Taylor Index 329 Theoretical and Computational Methods in Genome Research 1 EVALUATING THE STATISTICAL SIGNIFICANCE OF MULTIPLE DISTINCT LOCAL ALIGNMENTS StephenF.Altschul NationalCenterforBiotechnologyInformation NationalLibraryofMedicine NationalInstitutesofHealth Bethesda,Maryland20894 ABSTRACT A comparison oftwo sequences may uncover multiple regions of local similarity. While the significance of each local alignment may be evaluated independently, some times a combined assessment is appropriate. This paper discusses a variety ofstatistical andalgorithmic issuesthatsuchanassessmentpresents. 1. INTRODUCTION The most widely used techniques for comparing DNA and protein sequences are lo cal alignmentalgorithms, which seeksimilarregions ofthe sequencesunder consideration [1-3]. Given a particular measure ofsimilarity, an important question is how strong a lo cal alignment must beto beconsideredstatisticallysignificant. Forabroadrange ofmeas ures, including most of those in common use, this question has been addressed both analyticallyandexperimentally[4-17]. Occasionally a sequence comparison program will uncover not one but multiple lo cal alignments, representing several distinct regions ofsequence similarity [17-22]. This is particularly common with the BLAST algorithms [3], which disallow gaps within the alignments found, but is true also oflocal alignmentalgorithms that permitgaps [17-21]. Can one arrive at ajoint statistical assessment ofthese several alignments? We will dis cuss below the various issues that arise in reaching such an assessment. We will confine attention first to local alignments lacking gaps, for which analytic results are available, and then discussthe generalizationofthese results to alignmentswithgaps. TheoreticalandComputationalMethodsinGenomeResearch,editedbySuhai PlenumPress,NewYork, 1997 1 2 s.F.Altschul 2. LOCALALIGNMENT STATISTICS The simple measure of local similarity with which we start requires a substitution matrix, or set ofscores, for aligning the various pairs ofamino acids. Assume we are sij given two protein sequencesto compare. Choosinganypairofequal-length segments, one from each sequence, we may construct an ungapped subalignment. The score for such a segmentpair may be taken as the aggregate score ofits aligned pairs ofamino acids. The segment pair with greatest score is called the maximal segment pair (or MSP), and its score the maximal segment pair score (or MSP score). The MSP score may be calculated using asimplification, thatdisallows gaps, ofthe Smith-Waterman dynamic programming algorithm [1]. To analyze the statistical behavior ofMSP scores, one may model proteins as ran dom sequences ofindependently selected amino acids; the amino acids, labeled 1to 20, occur with respective probabilitiesPi" For our theory to progress, we need to assume that theexpectedscore:E7J=1PiP for aligning two random amino acids is negative. This is, in j sij fact, a desirable condition: were the expected score positive, long segment pairs would have high score independent of any biological relationship. Given this constrain on the scores, as well as theexistence ofat leastone positive score, itisalwayspossible to find a uniquepositivesolutionAto theequation 20 LPiP e',f = 1. j (1) i,;=\ The parameter Amay be thought ofas a natural scale for the scores. Asecondposi tive parameter K important for the statistical theory, depends upon the Pi and SiP and is givenby ageometricallyconvergentinfinite series [9,10,14]. Aprogram inC for calculat ing AandK isavailablefrom theauthor. When the lengths m and n of the two sequences being compared are sufficiently large, asymptotic results concerning the distribution ofthe MSP score S come into play [9,10,14]. TheprobabilitythatSisat leastx isthen well approximatedbythe formula Prob(S;;::x) = 1- exp(-Kmne-1..x). (2) Sis saidto follow an extreme value distribution [23]. Ifonly the highest scoring segment pairwere everofinterest,we wouldneedprogressno further. 3. LOCALSIMILARITYAND MULTIPLE LOCALALIGNMENTS Becausetwo sequencesmaysharemorethanasingleregion ofsimilarity, it isdesir able to consider not only the maximal segment pair, but other segment pairs that also achieve high scores. The immediate problem is that the second, third and fourth highest scoring segment pairs are likely to be slight variations ofthe maximal segment pair. We thereforeneedadefinitionofwhentwo subalignmentsaredistinct. The first such definition, inthecontextofsubalignmentswithgaps, is due to Sellers [18],who definedthe "neighborhood" ofasubalignmentto bethe setofall subalignments it touches within a path graph. A subalignment is then considered "locally optimal" ifit

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.