STATE-OF-THE-ART PROTEIN SECONDARY-STRUCTURE PREDICTION USING A NOVEL TWO-STAGE ALIGNMENT AND MACHINE-LEARNING METHOD By AMI M. GATES A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2008 1 © 2008 Ami M. Gates 2 To My Family and Friends 3 ACKNOWLEDGMENTS I would like to dedicate this overwhelming moment to my loving and supportive family, and to my wonderful friends. I would like to thank my parents, Eileen and Myke, who always supported my goals and listened endlessly; my brother Josh, who offered continuous encouragement; and my late brother Chad, whose last words to me were “PhD”. I would like to thank my dear friends Amos, Karina, Jesse, Neko, and Nathan for standing by me, and I would like to thank my committee chair, Arunava Banerjee, who always believed in me. 4 TABLE OF CONTENTS page ACKNOWLEDGMENTS ...............................................................................................................4 LIST OF TABLES ...........................................................................................................................8 LIST OF FIGURES .........................................................................................................................9 ABSTRACT ...................................................................................................................................11 CHAPTER 1 INTRODUCTION ..................................................................................................................13 Introduction .............................................................................................................................13 Proteins ...................................................................................................................................13 Protein Secondary Structure ...................................................................................................14 Machine Learning and Protein Secondary Structure Prediction .............................................15 Protein Secondary Structure Prediction Methods ...................................................................16 Dynamic Alignment-Based Protein Window-SVM Integrated Prediction for Three State Protein Secondary Structure ................................................................................................17 Overview .................................................................................................................................17 2 REVIEW OF THE BIOLOGY OF PROTEINS .....................................................................19 Brief Biology of Proteins ........................................................................................................19 From DNA to Protein .............................................................................................................19 Protein and Amino Acids ........................................................................................................20 Protein Folding .......................................................................................................................21 Secondary Structure ................................................................................................................22 Protein Evolution and Sequence Conservation .......................................................................23 3 LITERATURE REVIEW .......................................................................................................31 Problem of Secondary Structure Prediction ...........................................................................31 Literature Review of Secondary Structure Prediction ............................................................32 Methods Preceding 1993 .................................................................................................32 Methods Proceeding 1993 ...............................................................................................36 Neural network methods from 1993 – 2007 .............................................................37 Summary of neural network based methods ............................................................40 Support Vector Machine Methods from 2001 – 2007 .....................................................40 Summary of SVM Based Methods ..................................................................................42 Combined or Meta Methods ............................................................................................42 Direct Homology Based Methods ...................................................................................43 5 4 MATERIALS AND METHODS ...........................................................................................45 Introduction .............................................................................................................................45 Protein Data and Databanks ....................................................................................................45 Datasets ...................................................................................................................................46 Protein Identity, Similarity, and Homology ...........................................................................48 Multiple Sequence Alignment and PSI-BLAST .....................................................................50 Basic Local Alignment Search Tool (BLAST) Algorithm ..............................................52 BLAST: step 1 ..........................................................................................................53 BLAST: step 2 ..........................................................................................................54 BLAST: step 3 ..........................................................................................................54 Position-Specific Iterative BLAST (PSI-BLAST) Algorithm ........................................54 Creating the PSSM ...................................................................................................55 Summary of PSI-BLAST .........................................................................................57 Input Vectors and Sliding Windows .......................................................................................57 Accuracy Measures .................................................................................................................58 Machine Learning Techniques ...............................................................................................59 Support Vector Machines ................................................................................................59 Using SVMs in Secondary Structure Prediction .............................................................63 Neural Networks .....................................................................................................................63 Information Theory and Prediction ........................................................................................64 5 NEW SECONDARY STRUCTURE PREDICTION METHOD DARWIN .........................72 Dynamic Alignment-Based Protein Window-SVM Integrated Prediction for Three State Protein Secondary Structure: A New Prediction Server. ....................................................72 Introduction and Motivation of DARWIN .............................................................................73 Methods and Algorithms used in DARWIN ..........................................................................75 Phases of DARWIN: Stage 1 ..........................................................................................76 Phase 1 ......................................................................................................................76 Phase 2a: If at least one viable template is found: ...................................................77 Phase 2b: If no viable template is found: .................................................................78 Phase 3 ......................................................................................................................79 Phases of DARWIN Stage 2: Fixed-Size Fragment Analysis .........................................79 Fragment size selection ............................................................................................80 Step 1 ........................................................................................................................80 Step 2 ........................................................................................................................80 Step 3 ........................................................................................................................81 Ensemble of Support Vector Machines in DARWIN.............................................................82 The SVM Kernel and Equation .......................................................................................82 Training the SVM and Using PSI-BLAST Profiles ........................................................83 Datasets and Measures of Accuracy for DARWIN ................................................................84 Experiments, Measures, and Results ......................................................................................86 Conclusions on DARWIN ......................................................................................................90 6 6 DARWIN WEB SERVER ......................................................................................................94 Introduction .............................................................................................................................94 Using the Server .....................................................................................................................94 Design of the DARWIN Web Service ....................................................................................96 7 DISCUSSION AND CONCLUSION ..................................................................................102 Introduction ...........................................................................................................................102 Protein Secondary Structure Prediction Progress .................................................................102 Strength of DARWIN ...........................................................................................................104 Future Work and Improvements ...........................................................................................105 LIST OF REFERENCES .............................................................................................................106 BIOGRAPHICAL SKETCH .......................................................................................................113 7 LIST OF TABLES Table page 5-1 Detailed average prediction results for DARWIN. ............................................................92 5-2 Average prediction results for dataset EVA5 for DARWIN compared to top published indirect homology method results. ....................................................................92 5-3. Average prediction results for dataset EVA6 for DARWIN compared to top published indirect homology method results. ....................................................................92 8 LIST OF FIGURES Figure page 2-1 Simplification of the processes of transcription and translation.. ......................................24 2-2 Once a polypeptide is created through the process of translation, it is released into the cytosol and is known as the primary or linear sequence.. ............................................25 2-3 The 20 known amino acids. Adapted from Voet and Voet, 2005. .....................................26 2-4 Torsion angles phi and psi that offer rotational flexibility between amino acid peptide bonds. Adapted from Voet and Voet, 2005. ..........................................................27 2-5. Ramachandran Plot for a set of three alanine amino acids joined as a tripeptide. .............28 2-6 Example of a helical protein secondary structure. The hydrogen bonds are denoted with dashed lines ................................................................................................................29 2-7 Sheet protein secondary structure, with hydrogen bonds noted with dashed lines. Adapted from Voet and Voet, 2005. ..................................................................................30 3-1 Example of a linear sequence of amino acids, each accompanied by a secondary structure label of H, C, or E. ..............................................................................................44 4-1 Protein Data Bank (PDB) website. This area is a repository for known protein structures and related protein information. ........................................................................66 4-2 Matrix known as BLOSUM 62, a similarity matrix derived from small local blocks of aligned sequences that share at least 62% identity ........................................................67 4-3 Example of a PSI-BLAST generated alignment between a query protein and a subject protein.. ..................................................................................................................67 4-4 Example of a PSI-BLAST generated position specific scoring matrix (PSSM). ..............68 4-5 Example of the BLAST algorithm. A given query protein is analyzed by looking at all three amino acid word sets ............................................................................................69 4-6 Visual example of the production of input vectors that can be used to train and test machine learning constructs.. .............................................................................................70 4-7 Visual example of decision boundary between two classes and the margin that is maximized. .........................................................................................................................71 5-1 The PSI-BLAST example alignment portion. Several areas in a given alignment can result in missing information.. ...........................................................................................93 9 5-2 Histogram for each dataset, EVA5 and EVA6 displays the percentage of proteins predicted by DARWIN with given accuracy.. ...................................................................93 6-1 Image of the DARWIN Web page that allows Internet based graphical user interface with the DARWIN service. ..............................................................................................101 10
Description: