Iowa State University Capstones, Theses and Graduate Theses and Dissertations Dissertations 2011 Modular Algorithms for Biomolecular Network Alignment Fadi George Towfic Iowa State University Follow this and additional works at:https://lib.dr.iastate.edu/etd Part of theComputer Sciences Commons Recommended Citation Towfic, Fadi George, "Modular Algorithms for Biomolecular Network Alignment" (2011).Graduate Theses and Dissertations. 12031. https://lib.dr.iastate.edu/etd/12031 This Dissertation is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please [email protected]. Modular algorithms for biomolecular network alignment by Fadi George Towfic A dissertation submitted to the graduate faculty in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Major: Bioinformatics and Computational Biology Program of Study Committee: Vasant Honavar, Co-major professor M. Heather West Greenlee, Co-major professor Drena Dobbs Robert Jernigan Christopher Tuggle Iowa State University Ames, Iowa 2011 Copyright (cid:13)c Fadi George Towfic, 2011. All rights reserved. ii DEDICATION This dissertation is dedicated to my parents, for their endless support, constant encourage- ment and sage advice. iii TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xviii CHAPTER 1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Overview of network models and systems biology . . . . . . . . . . . . . . . . . 1 1.2 The network alignment problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Formal mathematical definition of network alignment . . . . . . . . . . . . . . . 5 1.4 Brief overview of state-of-the-art methods . . . . . . . . . . . . . . . . . . . . . 5 1.4.1 MaWISh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4.2 NetworkBLAST-M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4.3 Graemlin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4.4 GRAAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5 Limitations of current methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.6 Significant contributions of dissertation . . . . . . . . . . . . . . . . . . . . . . 9 1.6.1 First highly modular algorithm in the field . . . . . . . . . . . . . . . . 9 1.6.2 Highly scalable algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.6.3 First highly flexible algorithm . . . . . . . . . . . . . . . . . . . . . . . . 9 1.6.4 Highly portable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.6.5 High accuracy in terms of biological performance . . . . . . . . . . . . . 10 1.6.6 Applied to important biological problems . . . . . . . . . . . . . . . . . 10 iv CHAPTER 2. BIOMOLECULAR NETWORK ALIGNMENT (BiNA) TOOLKIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.1 Divide: Partitioning methods . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.2 Conquer: Scoring Functions . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 CHAPTER3. COMPARATIVEANALYSISOFTOPOLOGICALVS.NODE LABEL-BASED NETWORK ALIGNMENT . . . . . . . . . . . . . . . . . . 26 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 Materials and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2.1 Network Alignment Algorithm . . . . . . . . . . . . . . . . . . . . . . . 29 3.2.2 Scoring Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.4 Evaluation of Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3.1 Performance as Measured by GO Enrichment . . . . . . . . . . . . . . . 38 3.3.2 Reconstruction of Phylogenetic Relationships . . . . . . . . . . . . . . . 40 3.4 Discussion and conclusions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 CHAPTER 4. DETECTION OF GENE ORTHOLOGY FROM GENE CO- EXPRESSION AND PROTEIN INTERACTION NETWORKS . . . . . . 47 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2 Materials and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2.2 Graph representation of BLAST orthologs . . . . . . . . . . . . . . . . . 51 4.2.3 Random walk graph kernel score . . . . . . . . . . . . . . . . . . . . . . 52 4.2.4 BaryCenter score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 v 4.2.5 Betweenness score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2.6 Degree distribution score . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2.7 HITS score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2.8 Scoring candidate orthologs based on sequence and network similarity . 55 4.2.9 Ortholog detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.2.10 Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3 Analysis and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3.1 Reconstructing KEGG orthologs using BLAST . . . . . . . . . . . . . . 57 4.3.2 Reconstructing KEGG orthologs using sequence, protein-protein interac- tion network, and gene-coexpression data . . . . . . . . . . . . . . . . . 57 4.4 Discussion and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 CHAPTER 5. B CELL LIGAND GENE COEXPRESSION NETWORKS REVEAL REGULATORY PATHWAYS FOR LIGAND PROCESSING . 67 CHAPTER 6. TOOLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.1 BiNA webserver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.2 BiNA program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 CHAPTER 7. CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.1 Significant contributions of dissertation . . . . . . . . . . . . . . . . . . . . . . 102 7.1.1 First highly modular algorithm in the field . . . . . . . . . . . . . . . . 102 7.1.2 Highly scalable algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.1.3 First highly flexible algorithm . . . . . . . . . . . . . . . . . . . . . . . . 102 7.1.4 Highly portable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.1.5 High accuracy in terms of biological performance . . . . . . . . . . . . . 103 7.1.6 Applied to important biological problems . . . . . . . . . . . . . . . . . 103 7.2 Open problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 7.2.1 Evaluation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 7.2.2 Applicability to more network models . . . . . . . . . . . . . . . . . . . 104 7.2.3 More detailed network models. . . . . . . . . . . . . . . . . . . . . . . . 105 vi 7.2.4 Rapid comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 7.2.5 Integrated pipeline for analysis and visualization . . . . . . . . . . . . . 106 APPENDIX. ADDITIONAL MATERIAL FOR CHAPTER 6 . . . . . . . . . 107 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 vii LIST OF TABLES Table 3.1 ComparisonofGraphKernelPerformanceusingBLASTtomatchinitial node centers in K-Hop alignment between human (Hs), mouse (Mm), yeast (Sc) and fly (Dm). Bold entries are adapted from our previ- ous results on K-hop alignments (152). The methods are denoted as SP (Shortest Path), RW (Random Walk), PR (Page Rank), KL (Kull- back–Leiblerdivergence),Pearson(Pearsoncorrelation),Spearman(Spear- man rank correlation) and Chi (Chi-squared test statistic). . . . . . . . 41 Table 3.2 Comparison of Graph Kernel Performance using pure topological align- ment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Table 3.3 Comparison of the bootstrap performance in reconstructing phyloge- netic relationships between mouse-human and fly-yeast branches. . . . 44 Table 4.1 Performance of the Reciprocal BLAST hit method on the fly, yeast, human and mouse protein-protein interaction datasets from DIP as well as the gene coexpression networks for mouse and human from GEO . . 60 Table 4.2 Performance of the Reciprocal BLAST hit score as a feature to the decision tree (j48), Naive Bayes (NB), Support Vector Machine (SVM) and Ensemble classifiers on the fly, yeast, human and mouse protein- protein interaction datasets from DIP as well as the gene coexpression networks for mouse and human from GEO. Values in parenthesis are the ranks for the classifiers on the specified dataset . . . . . . . . . . . 65 viii Table 4.3 Performance of all the combined features (Reciprocal BLAST hit score, 1 and 2 hop shortest path graph kernel score, 1 and 2 hop random walk graph kernel score, BaryCenter, betweenness, degree distribution and HITS) as input to the decision tree (j48), Naive Bayes (NB), Support VectorMachine(SVM)andEnsembleclassifiersonthefly,yeast,human and mouse protein-protein interaction datasets from DIP as well as the gene coexpression networks for mouse and human from GEO. Values in parenthesis are the ranks for the classifiers on the specified dataset . . 65 Table 4.4 KEGG orthologs detected using the Ensemble classifier utilizing all net- work features. The orthologs shown in the above table were missed by the BLAST logistic regression classifier . . . . . . . . . . . . . . . . . . 66 Table 5.1 Full list of the ligands and their abbreviations used in the experiments analyzed in this paper. This list was adapted from Lee et al. (104) . . 91 Table 5.2 List of pathways detected based on high-intensity probes from the mi- croarray data. Please see Table 1 in supplementary material for a more detailed version of this table with pathway names and relative number of genes enriched in the pathway based on the data . . . . . . . . . . . 92 Table 5.3 Top matched ligands based on expression patterns in the consensus tree shown in Figure 5. The KEGG pathway categories correspond to the pathway categories highlighted in Table 2. Please see Table 3 in the supplementary material for an expanded version of this table . . . . . 95 Table A.1 Full list of networks and the number of neighborhoods utilized for com- paring the networks for figure 2 in supplementary material. . . . . . . 110 Table A.2 List of pathways detected based on high-intensity probes from the mi- croarray data. As can be seen from the table, many of the pathways enriched in high-intensity genes are known to be implicated in the de- velopment of the immune system and processing of antigens . . . . . . 118 ix Table A.3 Top matched ligands based on expression patterns in the consensus tree shown in figure 5 in the paper. The KEGG pathway categories corre- spond to the pathway categories highlighted in table 2 in the supple- mentary material and table 2 in the paper . . . . . . . . . . . . . . . . 126
Description: