ebook img

Mathematical Models, Algorithms, and Statistics of Sequence Alignment By Tatiana Aleksandrovna ... PDF

107 Pages·2010·5.19 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Mathematical Models, Algorithms, and Statistics of Sequence Alignment By Tatiana Aleksandrovna ...

Mathematical Models, Algorithms, and Statistics of Sequence Alignment By Tatiana Aleksandrovna Orlova Specialist Saratov State University 2002 Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Mathematics College of Arts and Sciences University of South Carolina 2010 Accepted by: Eva Czabarka, Director of Thesis Joshua Cooper, Second Reader James Buggy, Interim Dean of the Graduate School Acknowledgments I would like to thank Dr. Eva Czabarka and Dr. Joshua Cooper for their helpful advice and patience. ii Abstract The problem of biological sequence comparison arises naturally in an attempt to explain many biological phenomena. Due to the combinatorial structure and pat- tern preserving properties of the sequences it has attracted not only biologists, but also mathematicians, statisticians and computer scientists. In this work we study one of the most effective tools widely used for comparison of biological sequences - sequence alignment. We present the basic theory of sequence alignment from com- putational, biological, and statistical perspectives. We will also present and analyze results of computer simulations that effectively illustrate one possible application of this theory. iii Contents Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi Chapter 1. Introduction: Where Mathematics Meets Biology . . 1 1.1. The Organized Complexity of Life . . . . . . . . . . . . . . . . . . . 1 1.2. Biological Sequences and Their Patterns . . . . . . . . . . . . . . . . 4 1.3. Biological Sequence Comparison . . . . . . . . . . . . . . . . . . . . 9 Chapter 2. Sequence Alignment . . . . . . . . . . . . . . . . . . . . . . 11 2.1. Pairwise Sequence Alignment . . . . . . . . . . . . . . . . . . . . . . 11 2.2. Scoring Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3. Alignment Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Chapter 3. Statistics of Local Sequence Alignment . . . . . . . . 50 3.1. Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2. Random Variables and Random Processes . . . . . . . . . . . . . . . 53 3.3. Ungapped Local Alignment Scores Statistics . . . . . . . . . . . . . . 62 3.4. Gapped Local Alignment Scores Statistics . . . . . . . . . . . . . . . 78 Chapter 4. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Appendix A. C++ Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 iv List of Tables Table 1.1 Amino acids and their abbreviations. . . . . . . . . . . . . . . 7 Table 3.1 Robinson & Robinson amino acid counts. . . . . . . . . . . . . 80 Table 3.2 Simulations results. . . . . . . . . . . . . . . . . . . . . . . . . 80 Table 3.3 Known estimates for λ and µ. . . . . . . . . . . . . . . . . . . 81 v List of Figures Figure 1.1 Central dogma of molecular biology . . . . . . . . . . . . . . . 4 Figure 1.2 DNA chemical structure . . . . . . . . . . . . . . . . . . . . . 5 Figure 1.3 DNA complementarity . . . . . . . . . . . . . . . . . . . . . . 5 Figure 1.4 Lambda repressor protein bound to a lambda operator DNA sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Figure 2.1 BLOSUM62 substitution matrix . . . . . . . . . . . . . . . . . 22 Figure 2.2 Alignment graph for sequences ATACTGG and GTCCGTG. . 28 Figure 2.3 Another alignment graph for ATACTGG and GTCCGTG. . . 29 Figure 2.4 Extended alignment graph. . . . . . . . . . . . . . . . . . . . 33 Figure 2.5 Extended alignment network . . . . . . . . . . . . . . . . . . . 45 Figure 2.6 The optimal global path. . . . . . . . . . . . . . . . . . . . . . 46 Figure 2.7 Global alignment score matrix (S(i,j)) . . . . . . . . . . . . . 47 Figure 2.8 The optimal local path. . . . . . . . . . . . . . . . . . . . . . 48 ˆ Figure 2.9 Local alignment score matrix (S(i,j)) . . . . . . . . . . . . . 49 Figure 3.1 Ungapped alignment and its accumulative score. . . . . . . . . 63 Figure 3.2 Random walk corresponding to the ungapped local alignment. 66 Figure 3.3 Counts of large alignment scores, L = 400 . . . . . . . . . . . 82 Figure 3.4 Probability density plot of simulated and estimated data for sequences of length L=40. . . . . . . . . . . . . . . . . . . . . 83 vi Figure 3.5 Cumulativeprobabilitydistributionplotofsimulatedandestimated data for sequences of length L = 40. . . . . . . . . . . . . . . 83 Figure 3.6 Probability density plot of simulated and estimated data for sequences of length L = 100. . . . . . . . . . . . . . . . . . . . 84 Figure 3.7 Cumulativeprobabilitydistributionplotofsimulatedandestimated data for sequences of length L = 100. . . . . . . . . . . . . . . 84 Figure 3.8 Probability density plot of simulated and estimated data for sequences of length L = 200. . . . . . . . . . . . . . . . . . . . 85 Figure 3.9 Cumulativeprobabilitydistributionplotofsimulatedandestimated data for sequences of length L = 200. . . . . . . . . . . . . . . 85 Figure 3.10 Probability density plot of simulated and estimated data for sequences of length L = 400. . . . . . . . . . . . . . . . . . . . 86 Figure 3.11 Cumulativeprobabilitydistributionplotofsimulatedandestimated data for sequences of length L = 400. . . . . . . . . . . . . . . 86 vii Chapter 1 Introduction: Where Mathematics Meets Biology “Biology is mathematics’ next physics, only better”. Joel Cohen [12] 1.1. The Organized Complexity of Life In 1953 Francis Crick, Rosalind Franklin, James Watson, and Maurice Wilkins determined the structure of DNA. This important event became a start of theoret- ical study of biological sequences. In his book “Life Itself. Its Origin and Nature” [14] Crick gave a very intuitive explanation of the general nature of life from an information-theoretical perspective, which can serve as a perfect start in establishing a connection between biology and mathematics. He starts with the idea that, the high degree of organized complexity of life implies the need to store and process large amount of information. Thus, Nature has to provide an effective mechanism for this important task. Such a mechanism must 1 perform two functions: information storage and replication. Information storage must be stable, which means that the information must be stored over a sufficiently long period of time, and replication must be accurate, which means that the copying can make only a small number of errors called mutations. Lack of accuracy can lead to accumulated errors and decay, but perfect accuracy is not required. Moreover, we musthaveasmallnumberofmutationsandrequirethattheymustbecopiedfaithfully through generations. Most mutations lead to maladaptive changes but some might lead to improvement. There is no mechanism to produce only favorable mutations, which leaves chance to be the only source of true novelty. It is always possible for any rare chance event to become very common once this change is favored by the environment. This gives the living system its primary mechanism to improve itself. Note that neither replication nor mutation alone can explain both similarity and diversity of life, as we can see it everywhere around. Thus, the presence of both mechanisms is required [14]. Effectively dealing with large amounts of information in this manner puts some requirements on the mechanism. Recall that it needs to provide the system with easy storage, easy replication, easy error correction, and stability of information. The key idea is that an efficient way to do this requires the use of the combinatorial principle. That is, we express the information by using only a small number of types of standard units, but we combine them in many different ways. Similar to how we use words to form sentences in human language, life uses linear strings of standard units. It is required that in any language suitable for communication the number of words must be relatively small compared to the number of sentences they can produce. Nature follows the same principle requiring a small number of basic units, but the number of structures these units can form must be very large. Moreover, in order to carry information these structures must not be completely regular. The natural question is what Nature chose for these basic units and why. 2 Carbon is one the most abundant elements in the universe. It has a perfect balance between valence electrons available and valence electrons needed to fill the outer shell. This allows carbon to excel in bonding with itself and other atoms, and to form long chains, so it can build arbitrarily long and complex structures. Moreover, energy required to make and break carbon bonds is at the appropriate level for building molecules which are stable and reactive [46]. Note, that there are other elements that possess some of these properties but only carbon has all of them [47]. Taking all the above into account Nature’s choice of carbon compounds as units is obvious. Crick concludes that information transmission with the required properties is naturally achieved through chemical reactions between combinatorial structures made of carbon compounds [14]. We will only consider one-dimensional linear structures, known as biopolymers, or biological sequences. 3

Description:
Ungapped alignment and its accumulative score . 63. Figure 3.2. Random walk corresponding to the ungapped local alignment. 66. Figure 3.3.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.