Bioinformatics with Python Cookbook Third Edition Use modern Python libraries and applications to solve real-world computational biology problems Tiago Antao BIRMINGHAM—MUMBAI Bioinformatics with Python Cookbook Third Edition Copyright © 2022 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. Publishing Product Manager: Devika Battike Senior Editor: David Sugarman Content Development Editor: Joseph Sunil Technical Editor: Rahul Limbachiya Copy Editor: Safis Editing Project Coordinator: Farheen Fathima Proofreader: Safis Editing Indexer: Pratik Shirodkar Production Designer: Shankar Kalbhor Marketing Coordinator: Priyanka Mhatre First published: June 2015 Second edition: November 2018 Third edition: September 2022 Production reference: 1090922 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-80323-642-1 www.packt.com Co n t r i b u t o r s About the author Tiago Antao is a bioinformatician who is currently working in the field of genomics. A former computer scientist, Tiago moved into computational biology with an MSc in bioinformatics from the Faculty of Sciences at the University of Porto, Portugal, and a PhD on the spread of drug-resistant malaria from the Liverpool School of Tropical Medicine, UK. Post his doctoral, Tiago worked with human datasets at the University of Cambridge, UK and with mosquito whole-genome sequencing data at the University of Oxford, UK, before helping to set up the bioinformatics infrastructure at the University of Montana, USA. He currently works as a data engineer in the biotechnology field in Boston, MA. He is one of the co-authors of Biopython, a major bioinformatics package written in Python. About the reviewers Urminder Singh is a bioinformatician, computer scientist, and developer of multiple open source bioinformatics tools. His educational background encompasses physics, computer science, and computational biology degrees, including a Ph.D. in bioinformatics from Iowa State University, USA. His diverse research interests include novel gene evolution, precision medicine, sociogenomics, machine learning in medicine, and developing tools and algorithms for big heterogeneous data. You can visit him online at urmi-21.github.io. Tiffany Ho works as a bioinformatics associate at Embark Veterinary. She holds a BSc from the University of California, Davis in genetics and genomics, and an MPS from Cornell University in plant breeding and genetics. Table of Contents Preface xiii 1 Python and the Surrounding Software Ecology 1 Installing the required basic software Interfacing with R via rpy2 9 with Anaconda 2 Getting ready 9 Getting ready 2 How to do it... 10 How to do it... 4 There’s more... 15 There’s more... 5 See also 16 Installing the required software Performing R magic with Jupyter 16 with Docker 7 Getting ready 16 Getting ready 7 How to do it... 17 How to do it... 8 There’s more... 18 See also 8 See also 18 2 Getting to Know NumPy, pandas, Arrow, and Matplotlib 19 Using pandas to process vaccine- Getting ready 26 adverse events 20 How to do it... 27 Getting ready 20 There’s more... 29 How to do it... 20 Reducing the memory usage of There’s more... 25 pandas DataFrames 29 See also 26 Getting ready 29 Dealing with the pitfalls of joining How to do it… 29 pandas DataFrames 26 See also 32 vi Table of Contents Accelerating pandas processing Getting ready 36 with Apache Arrow 32 How to do it… 36 Getting ready 33 See also 39 How to do it... 33 Introducing Matplotlib There’s more... 35 for chart generation 39 Understanding NumPy as the Getting ready 40 engine behind Python data How to do it... 40 science and bioinformatics 36 There’s more... 47 See also 47 3 Next-Generation Sequencing 49 Accessing GenBank and moving How to do it... 66 around NCBI databases 50 There’s more... 72 Getting ready 50 See also 72 How to do it... 51 Extracting data from VCF files 73 There’s more... 53 Getting ready 73 See also 54 How to do it... 74 Performing basic sequence analysis 55 There’s more... 75 Getting ready 55 See also 76 How to do it... 55 Studying genome accessibility and There’s more... 56 filtering SNP data 76 See also 57 Getting ready 76 Working with modern sequence How to do it... 78 formats 57 There’s more... 88 Getting ready 57 See also 88 How to do it... 58 Processing NGS data with HTSeq 88 There’s more... 64 Getting ready 89 See also 65 How to do it... 90 Working with alignment data 66 There’s more... 92 Getting ready 66 Table of Contents vii 4 Advanced NGS Data Processing 93 Preparing a dataset for analysis 93 There’s more… 111 Getting ready 94 Finding genomic features from How to do it… 94 sequencing annotations 111 Using Mendelian error information How to do it… 111 for quality control 101 There’s more… 114 How to do it… 101 Doing metagenomics with There’s more… 105 QIIME 2 Python API 114 Exploring the data with Getting ready 114 standard statistics 106 How to do it... 116 How to do it… 106 There’s more... 119 5 Working with Genomes 121 Technical requirements 121 Extracting genes from a reference Working with high-quality using annotations 137 reference genomes 122 Getting ready 137 Getting ready 122 How to do it... 138 How to do it... 123 There’s more... 140 There’s more... 127 See also 140 See also 128 Finding orthologues with the Dealing with low-quality Ensembl REST API 141 genome references 128 Getting ready 141 Getting ready 128 How to do it... 141 How to do it... 129 There’s more... 144 There’s more... 133 Retrieving gene ontology See also 134 information from Ensembl 144 Traversing genome annotations 134 Getting ready 144 Getting ready 134 How to do it... 145 How to do it... 134 There’s more... 149 There’s more... 136 See also 149 See also 137 viii Table of Contents 6 Population Genetics 151 Managing datasets with PLINK 152 Analyzing population structure 167 Getting ready 152 Getting ready 168 How to do it... 154 How to do it... 168 There’s more... 158 See also 174 See also 158 Performing a PCA 174 Using sgkit for population genetics Getting ready 174 analysis with xarray 158 How to do it... 175 Getting ready 159 There’s more... 177 How to do it... 159 See also 177 There’s more... 163 Investigating population structure Exploring a dataset with sgkit 163 with admixture 177 Getting ready 163 Getting ready 177 How to do it... 163 How to do it... 178 There’s more... 167 There’s more... 183 See also 167 7 Phylogenetics 185 Preparing a dataset for phylogenetic Reconstructing phylogenetic trees 200 analysis 185 Getting ready 200 Getting ready 186 How to do it... 201 How to do it... 186 There’s more... 204 There’s more... 192 Playing recursively with trees 205 See also 192 Getting ready 205 Aligning genetic and genomic data 192 How to do it... 205 Getting ready 192 There’s more... 209 How to do it... 193 Visualizing phylogenetic data 210 Comparing sequences 195 Getting ready 210 Getting ready 195 How to do it... 210 How to do it... 195 There’s more... 215 There’s more... 200