Event-based Phylogeny Inference and Multiple Sequence Alignment Phong Nguyen Duc Computer Science Department Brown University Submitted in partial fulfillment of the requirements for the Degree of Master of Science in the Department of Computer Science at Brown University Providence, Rhode Island May 2012 This thesis by Phong Nguyen Duc is accepted in its present form by the Computer Science Department as satisfying the thesis requirements for the degree of Master of Science Date Franco P. Preparata, Advisor Approved by the Graduate Council Date PeterM.Weber,DeanoftheGraduateSchool Page ii of 93 VITA Phong Nguyen Duc was born in Haiphong city, Vietnam, on 17 August 1989. After completing his high school study at the High school for the Gifted (Hochim- inh city) in 2007, he entered the National University of Singapore where he studied Computational Biology. In 2011, he entered the Graduate School at Brown Univer- sity, Computer Science Department, under the concurrent degree agreement between Brown University and the National University of Singapore. Page iii of 93 Preface Since the identification of DNA/RNA as genetic material, deciphering the code of life has been a major goal put forward by biologists. One approach particularly successful in studying DNA sequences is to compare related sequences from different organisms. Se- quence alignment, specifically pairwise alignment, is among the earliest tool developed in bioinformatics. However, the generalization of pairwise alignment to multiple sequence alignment is not straightforward. The comparison of multiple sequences is expressed in two different but related problems: multiple sequence alignment finding shared homologous regions among input sequences, and phylogeny inference finding the order by which each sequence diverges from a common parent. These two problems have been under intensive research in the last three decades. However,multiplesequencealignmentandphylogenyinferencearenotcompletelysolved problems, in the sense that there is no single best algorithm that stands out practically and theoretically for each of these problems. My first encounter of the phylogeny inference problem was in 2010, when Prof. Ken Sung at the National University of Singapore gave us an assignment to infer the phylogeny of dengue viruses across the world. By then I noticed that not all regions in the sequences canbealignedreliably,duetoheavymutationsandhighdegreeofdivergence. Thisproblem is more serious with long input sequences. Prof. Franco P. Preparata introduced the problem to me again in 2011, this time at Brown University. He was looking into how ancestor sequences can be constructed to help build the phylogeny. By the end of 2011, we had some idea of how to generate putative ancestor sequences for the internal nodes of the phylogeny, assuming there is no insertion/deletion. In Spring 2012, I found a way to reliably identify insertion/deletion events. This is then used to extend our previous algorithm to handle insertion/deletion. The final algorithm is a novel tool that suggests a complete evolution hypothesis of input sequences, consisting of a phylogeny and of the placement of mutations on the edges of the resulting tree. As described above, this thesis started with the initial insights from Prof. Franco. The discussions with him provided me with new insights, as well as support to my ideas. I can not thank him enough for these discussions, for the courses he recommended, and for his time proofreading and editing this thesis. He has been a great mentor to me. Special thanks to previous teachers who nurtured my interest in genomics and bioinfor- matics: Prof. Ken Sung (NUS), Dr. Jos´e Dinneny (NUS), and Prof. Sorin Israil (Brown University). This thesis would not have been possible without the financial support from the Singa- pore Government and SAS Institute, Singapore. Last but not least, I would like to thank my beloved family and friends who have been a constant source of love and support. I am forever indebted to them. Page iv of 93 Contents 1 Introduction 3 1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2 Scoring model 15 2.1 Scoring of Pairwise Alignment . . . . . . . . . . . . . . . . . . . . . . 15 2.1.1 Hamming distance . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.2 Levenshtein distance . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.3 General gap penalty . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Multiple Sequence Alignment . . . . . . . . . . . . . . . . . . . . . . 18 2.2.1 Star graph approximation and sum-of-pairs . . . . . . . . . . . 18 2.2.2 Affine gap in multiple sequence alignment . . . . . . . . . . . 18 3 Datasets 21 4 Multiple Sequence Alignment approaches 25 4.1 Dynamic Programming Approach . . . . . . . . . . . . . . . . . . . . 26 4.2 Progressive Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2.1 Profile representation . . . . . . . . . . . . . . . . . . . . . . . 31 4.3 Consistency Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.4 Iterative refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.5 Anchor based alignment . . . . . . . . . . . . . . . . . . . . . . . . . 39 v 4.5.1 Finding Insertion/Deletion events . . . . . . . . . . . . . . . . 45 4.5.2 Gap detection algorithm . . . . . . . . . . . . . . . . . . . . . 46 5 Phylogeny inference methods 53 5.1 Maximum Parsimony . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.2 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.3 Clustering methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.4 Neighbor Joining and its variants . . . . . . . . . . . . . . . . . . . . 58 5.4.1 Centroid method . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.4.2 Parsimony method . . . . . . . . . . . . . . . . . . . . . . . . 63 5.4.3 Parsimony method on naive NJ tree . . . . . . . . . . . . . . . 64 5.4.4 Perfect NJ method . . . . . . . . . . . . . . . . . . . . . . . . 65 5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6 Combining multiple sequence alignment with phylogeny inference 71 6.1 Generalized Fitch algorithm . . . . . . . . . . . . . . . . . . . . . . . 73 6.1.1 Singleton Profile . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.1.2 Profile alignment . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.2 Maximum parsimony with insertion/deletion events . . . . . . . . . . 75 6.2.1 Singleton profile . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.2.2 Profile alignment . . . . . . . . . . . . . . . . . . . . . . . . . 78 7 Conclusions 87 Page vi of 93 List of Tables 4.1 Example of a similarity matrix . . . . . . . . . . . . . . . . . . . . . . 26 4.2 Example of a Dynamic Programming table for pairwise alignment . . 27 vii Page viii of 93 List of Figures 4.1 Alignment path for 3 sequences [Lee et al., 2002] . . . . . . . . . . . . 27 4.2 Fractional count’s problem with handling gap . . . . . . . . . . . . . 34 4.3 Example of DAG representation of a profile . . . . . . . . . . . . . . 35 4.4 Weighting in T-COFFEE . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.5 Weighting in T-COFFEE (cont) . . . . . . . . . . . . . . . . . . . . . 37 4.6 Probabilistic consistency transformation in PROBCONS . . . . . . . 38 4.7 LTP subtree consisting of roughly 20 sequences . . . . . . . . . . . . 50 4.8 LTP restricted to a sample of 20 leaves . . . . . . . . . . . . . . . . . 51 5.1 Hamming distance as edge weights . . . . . . . . . . . . . . . . . . . 53 5.2 NJNJ workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.3 Perfect NJ workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.4 Modified RF-measure for NJ variants . . . . . . . . . . . . . . . . . . 67 5.5 Modified RF-measure for NJ variants (cont) . . . . . . . . . . . . . . 67 5.6 Proportional RF-measure over NJ variants . . . . . . . . . . . . . . . 68 6.1 MUSCLE workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.2 Generalized Fitch’s result . . . . . . . . . . . . . . . . . . . . . . . . 75 6.3 The profile of sequence S with anchor sequence S . . . . . . . . . . . 78 0 6.4 Condition for removing regions . . . . . . . . . . . . . . . . . . . . . 79 6.5 Gap lengths in case of mismatches . . . . . . . . . . . . . . . . . . . . 79 6.6 Example of gap length tree . . . . . . . . . . . . . . . . . . . . . . . . 85 1 Page 2 of 93
Description: