THE UNIVERSITY OF CHICAGO A DISSERTATION SUBMITTED TO THE FACULTY OF THE DIVISION OF THE PHYSICAL SCIENCES IN CANDIDACY FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF DEPARTMENT OF COMPUTER SCIENCE BY CHRISTOPHER BUN CHICAGO, ILLINOIS NOVEMBER 8, 2016 Copyright © 2016 by Christopher Bun All Rights Reserved To My Family And Friends Epigraph Table of Contents LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Genome Sequencing: Techniques and Data Profiles . . . . . . . . . . . . . . 4 2.1.1 Sequencing By Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Oligo-ligation Detection . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.3 Single Molecule and Nanopore Sequencing . . . . . . . . . . . . . . . 8 2.1.4 Sequencing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Genome Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.2 Algorithms for De Novo Genome Assembly . . . . . . . . . . . . . . . 15 2.3 Computing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.1 Scientific Compute Services . . . . . . . . . . . . . . . . . . . . . . . 20 3 ASSEMBLYRAST FRAMEWORK . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1 The WASP Language for Computational Workflows . . . . . . . . . . . . . . 22 3.1.1 Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.1 Interface Generalization . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.2 Rapid Pipeline Design . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.3 Universal Hyperparameter Search Driver . . . . . . . . . . . . . . . . 34 3.2.4 Logic-Driven Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2.5 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.6 Analysis Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3 System Design and Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3.1 AssemblyRAST Control Plane . . . . . . . . . . . . . . . . . . . . . . 38 3.3.2 Workers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3.3 Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4 ASSEMBLER PROFILING AND OPTIMIZATION . . . . . . . . . . . . . . . . . 47 4.0.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.0.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.0.3 Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.0.4 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 v 4.0.5 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.0.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.1 Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.1.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.1.2 Postprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5 INTEGRATIVE ASSEMBLY ALGORITHMS . . . . . . . . . . . . . . . . . . . . 60 5.0.1 Integrative Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.1 Hyperparameter Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2 Block Construction and Merging . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3 A Self-Tuning Ensemble De Novo Assembly Pipeline . . . . . . . . . . . . . 65 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6 REFERENCE-INDEPENDENT ASSEMBLY ERROR CLASSIFICATION LEARN- ING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.1.1 Hard and Soft Genomic Variation Types . . . . . . . . . . . . . . . . 69 6.1.2 Statistical Approach to De Novo Assembly Evaluation . . . . . . . . 70 6.1.3 Supervised Classification . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.2 A Novel Implementation of Error Classification Using Gradient Boosting Trees 74 6.2.1 Dataset Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.2.2 Assembly Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.2.3 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.2.4 Data Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.2.5 Discussion and Feasibility . . . . . . . . . . . . . . . . . . . . . . . . 80 6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.3.1 Training Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.3.2 Feature Engineering and Extraction . . . . . . . . . . . . . . . . . . . 89 6.3.3 Model Performance and Hyperparameter Tuning . . . . . . . . . . . . 92 6.3.4 Detection of Major Errors Produced by Assemblers . . . . . . . . . . 96 6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7 APPLICATIONSOFANACCURATEDENOVOASSEMBLYEVALUATIONPRO- FILE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 7.1 Error Removal and Contiguity Metric Correction . . . . . . . . . . . . . . . 104 7.1.1 Contig Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 7.1.2 Corrected Statistical Measures . . . . . . . . . . . . . . . . . . . . . . 104 7.2 Likelihood and Scoring Framework For Systematic Evaluation . . . . . . . . 106 7.3 Error Prediction Strategy for Assembly Reconcilliation Algorithms . . . . . . 111 8 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 8.1 Future . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 vi List of Figures 2.1 NGAx plot of V. cholera assembly by Velvet parameter sweep . . . . . . . . . . 18 2.2 3-mer De Bruijn Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3 4-mer De Bruijn Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.1 Flowchart of a read-specific assembly workflow . . . . . . . . . . . . . . . . . . . 29 3.2 Pipeline Branching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3 Parameter Sweeps With Pipeline Combinations . . . . . . . . . . . . . . . . . . 35 3.4 The Assembly RAST Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.5 Relative ALE Scores of V. cholera assembly . . . . . . . . . . . . . . . . . . . . 45 3.6 AssemblyRAST Web Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.7 AssemblyRAST Web Interface facillitates user-friendly pipeline design . . . . . . 46 4.1 NGAx plot of V. cholera assembly . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.1 NGAx plot of V. cholera assembly by Velvet parameter sweep . . . . . . . . . . 62 5.2 Velvet Hash Length vs. NGA50 Score on Rsp HiSeq and MiSeq . . . . . . . . . 63 5.3 NGA50 scores of pairwise mergings of V. Cholerae assemblies . . . . . . . . . . 64 5.4 Smart Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.1 An example decision tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.2 Generation of training data by AssemblyML workflow . . . . . . . . . . . . . . . 79 6.3 An example of a non-smooth coverage pattern over a misassembly . . . . . . . . 90 6.4 Discrepancies in coverages between flanking regions . . . . . . . . . . . . . . . . 91 6.5 Misassemblies vs. Average Contig Coverage . . . . . . . . . . . . . . . . . . . . 92 6.6 ContigendregionspredictedasmisassembliesinSpadesAssemblyofSingulisphaera acidiphila, but not classified as a major misassembly by QUAST . . . . . . . . . 94 6.7 Feature Importances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.8 Prediction Outcomes for 300 Genome Readsets Simulated . . . . . . . . . . . . 95 6.9 AssemblyMLPredictionOutcomesForVelvetAssemblyofSingulisphaeraacidiphila 96 6.10 REAPRFCDPredictionOutcomesForVelvetAssemblyofSingulisphaeraacidiphila 97 6.11 Correct Misassembly Classification Velvet Assembly of Singulisphaera acidiphila 98 6.12 Quast defines misassemblies in which inconsistencies are shorter than 1000 base- pairs as local misassemblies. The trained model predicts this missassembly (as shown by the black bar) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.13 Velvet Scaffolding Technique Flagged as Misassembly . . . . . . . . . . . . . . . 100 6.14 AssemblyMLPredictionOutcomesForSpadesAssemblyofSingulisphaeraacidiphila101 6.15 REAPRFCDPredictionOutcomesForSpadesAssemblyofSingulisphaeraacidiphila101 6.16 Mapping Quality Anomalies in Spades Assembly of Singulisphaera acidiphila . . 102 7.1 Regions with misassembly from Spades and Velvet assemblies (A, C) compared with their contigs broken at predicted loci (B, D). Red represents contigs that contain true misassemblies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 vii 7.2 The Error Response Curve captures the trade-off between contiguity and correct- ness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7.3 An improved assembly pipeline using AssemblyML-guided contig breaking and ERC metrics as preceeding steps to block merging. . . . . . . . . . . . . . . . . 113 viii List of Tables 2.1 Profiles of Major Next Generation Sequencing Platforms . . . . . . . . . . . . . 11 2.2 FastA Base Pair Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1 Common Wasp/Lisp Supported Expressions . . . . . . . . . . . . . . . . . . . . 25 3.2 Wasp contains three types of specialized extensions: Type conversion, data anal- ysis, and framework-level functions. . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.1 Comparison of NGA50 assembly scores for various genomes. Best scores are bolded. 51 4.2 Comparison of misassemblies for various genomes. . . . . . . . . . . . . . . . . . 51 4.3 Various statistics for the assembly of R. spaeroides HiSeq data. IDBA-ID gen- erated the most contiguous set while also performing the least amount of misas- semblies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4 Effects of preprocessing on V. cholera NGA50. The best scores per assembler are shown in bold. Fields with ’-’ indicate an error generated by the assembler . . . 59 4.5 Effects of preprocessing on V. cholera misassembly. The fewest misassemblies per assembler are shown in bold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.1 Features generated from assembly and sequence data . . . . . . . . . . . . . . . 77 6.2 Genomes Used to Generate Training Set . . . . . . . . . . . . . . . . . . . . . . 81 6.3 Top10three-wayfeatureinteractionscoresfortheAssemblyMLModel. Weighted F-score represents the frequency that the three features appear within the same tree, weighted by the probabilities that the nodes will be visited. . . . . . . . . . 93 6.4 Prediction Statistics Across a Subsample of Microbe Assemblies . . . . . . . . . 93 6.5 Top 20 Features of XGBoost Trained Model . . . . . . . . . . . . . . . . . . . . 95 7.1 Statistics of Spades and Velvet assemblies and contigs broken at predicted loci . 107 ix ABSTRACT High-throughput genetic sequencing technologies have driven the proliferation of new ge- nomic data. From the advent of long-read Sanger sequencing to the now low-cost, short-read generation and upcoming era of single-molecule techniques, methods to address the complex genome assembly problem have evolved alongside and are introduced at an expiditious pace. These algorithms attempt to produce an accurate representation of a target genome from datasetsfilledwitherrorsandambiguities. Manyofthechallengesintroduced, unfortunately, must be addressed through an algorithm’s ad-hoc criteria and heuristics, and as a result, can output assembly hypotheses that contain significant errors. Without an inexpensive or computational approach to assess the quality of a given assembly hypothesis, researchers must make due with draft-level genome projects for downstream analysis. Solving three fun- damental challenges will alleviate this issue: (i) automation and incorporation of algorithms from the dynamic landscape of genome assembly tools, (ii) developing optimal assembly al- gorithms best suited for various types, or mixtures, of sequencing data, and (iii) developing an approach to assess de novo genome assembly quality independence of a reference genome. We provide several contributions towards this effort: We first introduce AssemblyRAST, ageneralcomputeorchestrationframeworkandaccompanyingdomain-specificlanguagethat facillitates rapid workflow design for rapid genome assembly, analysis, and method discov- ery. Next, we demonstrate the improvement of genome assemblies through novel integrative algorithm techniques. Finally, we devise a method for reference-independent assembly eval- uation and error identification through supervised learning, along with several applications to further improve existing techniques. x
Description: