BRICS BasicResearchinComputer Science B R I C S D S - 0 0 - 4 C . N . S . P Algorithms in Computational Biology e d e r s e n : A l g o r i t h m s i n C o m p ChristianN.S. Pedersen u t a t i o n a l B i o l o g y BRICSDissertationSeries DS-00-4 ISSN 1396-7002 March 2000 Copyright(cid:13)c 2000, ChristianN. S. Pedersen. BRICS, Department ofComputer Science University ofAarhus. All rightsreserved. Reproduction ofallorpart ofthis work is permitted foreducational orresearch use on conditionthat thiscopyright noticeis included inany copy. See back inner page for a list of recent BRICS Dissertation Series publi- cations. Copies may beobtained by contacting: BRICS Department ofComputer Science UniversityofAarhus NyMunkegade, building 540 DK–8000Aarhus C Denmark Telephone:+45 89423360 Telefax: +45 89423255 Internet: [email protected] BRICS publications are in general accessible through the World Wide Weband anonymous FTP through these URLs: http://www.brics.dk ftp://ftp.brics.dk This document insubdirectory DS/00/4/ Algorithms in Computational Biology Christian N(cid:28)rgaard Storm Pedersen Ph.D. Dissertation Department of Computer Science University of Aarhus Denmark Algorithms in Computational Biology A Dissertation Presented to the Faculty of Science of the University of Aarhus in Partial Ful(cid:12)llment of the Requirements for the Ph.D. Degree by Christian N(cid:28)rgaard Storm Pedersen August 31, 1999 Abstract In this thesis we are concerned with constructing algorithms that address prob- lems of biological relevance. This activity is part of a broader interdisciplinary area called computational biology, or bioinformatics, that focuses on utiliz- ing the capacities of computers to gain knowledge from biological data. The majority of problems in computational biology relate to molecular or evolu- tionary biology, and focus on analyzing and comparing the genetic material of organisms. One deciding factor in shaping the area of computational biology is that DNA, RNA and proteins that are responsible for storing and utilizing the genetic material in an organism, can be described as strings over (cid:12)nite al- phabets. The string representation of biomolecules allows for a wide range of algorithmic techniques concerned with strings to be applied for analyzing and comparing biological data. We contribute to the (cid:12)eld of computational biology byconstructing and analyzing algorithms that address problems of relevance to biological sequence analysis and structure prediction. Thegeneticmaterialoforganismsevolvesbydiscretemutations,mostpromi- nently substitutions, insertions and deletions of nucleotides. Since the genetic material is stored in DNA sequences and reflected in RNA and protein se- quences, it makes sense to compare two or more biological sequences to look for similarities and di(cid:11)erences that can be used to infer the relatedness of the sequences. In the thesis we consider the problem of comparing two sequences of coding DNA when the relationship between DNA and proteins is taken into account. We do this by using a model that penalizes an event on the DNA by the change it induces on the encoded protein. We analyze the model in de- tail, and construct an alignment algorithm that improves on the existing best alignment algorithm in the model by reducing its running time by a quadratic factor. This makes the running time of our alignment algorithm equal to the running time of alignment algorithms based on much simpler models. If a family of related biological sequences is available, it is natural to derive a compact characterization of the sequence family. Among other things, such a characterization can be used to search for unknown members of the sequence family. Onewidelyusedwaytodescribethecharacteristics ofasequencefamily is to construct a hidden Markov model that generates members of the sequence family with high probability and non-members with low probability. In the thesis we consider the general problem of comparing hidden Markov models. We de(cid:12)ne novel measures between hidden Markov models, and show how to compute them e(cid:14)ciently using dynamic programming. Since hidden Markov models are widely used to characterize biological sequence families, our mea- sures and methods for comparing hidden Markov models immediately apply to comparison of entire biological sequence families. v Besides comparing sequences and sequence families, we also consider prob- lems of (cid:12)nding regularities in a single sequence. Looking for regularities in a single biological sequence can be used to reconstruct part of the evolutionary history of the sequence or to identify the sequence among other sequences. In the thesis we focus on general string problems motivated by biological applica- tions because biological sequences are strings. We construct an algorithm that (cid:12)nds all maximal pairs of equal substrings in a string, where each pair of equal substrings adheres to restrictions in the number of characters between the oc- currences of the two substrings in the string. This is a generalization of (cid:12)nding tandem repeats, and the running time of the algorithm is comparable to the running time of existing algorithms for (cid:12)nding tandem repeats. The algorithm is based on a general technique that combines a traversal of a su(cid:14)x tree with e(cid:14)cient merging of search trees. We use the same general technique to con- struct an algorithm that (cid:12)nds all maximal quasiperiodic substrings in a string. Aquasiperiodicsubstringisasubstringthatcanbedescribedasconcatenations and superpositions of a shorter substring. Our algorithm for (cid:12)nding maximal quasiperiodic substrings has a running time that is a logarithmic factor better than the running time of the existing best algorithm for the problem. Analyzing and comparing the string representations of biomolecules can reveal a lot of useful information about the biomolecules, although the three- dimensional structures of biomolecules often reveal additional information that is not immediately visiblefrom their stringrepresentations. Unfortunately, it is di(cid:14)cult and time-consuming to determine the three-dimensional structure of a biomolecule experimentally, so computational methods for structure prediction are in demand. Constructing such methods is also di(cid:14)cult, and often results in the formulation of intractable computational problems. In the thesis we construct an algorithm that improves on the widely used mfold algorithm for RNA secondary structure prediction by allowing a less restrictive model of structure formation without an increase in the running time. We also analyze the protein folding problem in the two-dimensional hydrophobic-hydrophilic lattice model. Our analysis shows that several complicated folding algorithms do not produce better foldings in the worst case, in terms of free energy, than an existing much simpler folding algorithm. vi Acknowledgments Thelasteightyears havepassedaway quickly. Ihavelearnedalotofinteresting things from a lot of interesting people, and I have had many opportunities for traveling to interesting places to attend conferences and workshops. There are many who deserve to be thanked for their part in making my period of study a pleasant time. Here I can only mention a few of them. Thanks to Rune Bang Lyngs(cid:28) for being my o(cid:14)ce mate for many years and for being my co-worker on several projects that range from mandatory assignmentsinundergraduatecourses,toresearchpapersincludedinthisthesis. I have known Rune since we ended up in the same artillery battery almost ten yearsagoandeventhoughheplanstospendhisfuturefarawayfromDenmarkI hopethat wecan continue tocollaborate on projects. Thanksto Gerth St(cid:28)lting Brodal for always being willing to discuss computer science. Working with Gerth on two of the papers included in this thesis was very inspiring. Besides a lot of good discussions on topics related to our papers,we also spentmuch time discussing anything from the practical aspects of installing and running Linux on notebooks to computer science in general. Thanks to Ole Caprani for many interesting discussionson conductingand teaching computer science, forletting me be a teaching assistant in his computer architecture course for many years, and for pointing mein the direction of computational biology. Thanksto Jotun Hein for keeping me interested in computational biology by being an always enthusiastic source of inspiration. Also thanks to Lars Michael Kristensen and the other Ph.D. students, and to everybody else who are responsible for the good atmosphere at BRICS and DAIMI. Thanks to Dan Gus(cid:12)eld for hosting me at UC Davis for seven months and for a lot of inspiring discussions while I was there. Also thanks to Fred Roberts for hosting me at DIMACS for two months. I enjoyed visiting both places and I met a lot of interesting people. Most notably Jens Stoye who was my o(cid:14)ce mate for seven month at UC Davis. We spent a lot of time discussing many aspects of string matching and computational biology. An outcome of these discussions is included as a paper in this thesis. Jens also had a car which we used for sight seeing trips. Among those a nice skiing trip to Lake Tahoe. Last, but not least, I would like to thank my advisor Sven Skyum for skill- fullyguidingmethroughthefouryearsofthePh.D.program. Hiswillingnessto letting me roam through the various areas of computational biology while mak- ing sure that I never completely lost track of the objective has certainly been an important part of the freedom I have enjoyed during my Ph.D. program. Christian N(cid:28)rgaard Storm Pedersen, (cid:23)Arhus, August 31, 1999. vii My thesis defense was held on March 9, 2000. The thesis evaluation commit- tee was Erik Meineche Schmidt (Department of Computer Science, University of Aarhus), Anders Krogh (Center of Biological Sequence Analysis, Techni- cal University of Denmark), and Alberto Apostolico (Department of Computer Science, University of Padova and Purdue University). I would like to thank all three members of the committee for attending the defense and for their nice comments about the work presented in the thesis. This (cid:12)nalversion of the thesis has been subjectto minor typographical changes and corrections. Furthermore, it has been updated with current information about the publication status of the included papers. Christian N(cid:28)rgaard Storm Pedersen, (cid:23)Arhus, Summer 2000. viii