Table Of ContentBioinformatics with
Python Cookbook
Third Edition
Use modern Python libraries and applications to solve
real-world computational biology problems
Tiago Antao
BIRMINGHAM—MUMBAI
Bioinformatics with Python Cookbook
Third Edition
Copyright © 2022 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted
in any form or by any means, without the prior written permission of the publisher, except in the case
of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information
presented. However, the information contained in this book is sold without warranty, either express
or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable
for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and
products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot
guarantee the accuracy of this information.
Publishing Product Manager: Devika Battike
Senior Editor: David Sugarman
Content Development Editor: Joseph Sunil
Technical Editor: Rahul Limbachiya
Copy Editor: Safis Editing
Project Coordinator: Farheen Fathima
Proofreader: Safis Editing
Indexer: Pratik Shirodkar
Production Designer: Shankar Kalbhor
Marketing Coordinator: Priyanka Mhatre
First published: June 2015
Second edition: November 2018
Third edition: September 2022
Production reference: 1090922
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-80323-642-1
www.packt.com
Co n t r i b u t o r s
About the author
Tiago Antao is a bioinformatician who is currently working in the field of genomics. A former computer
scientist, Tiago moved into computational biology with an MSc in bioinformatics from the Faculty
of Sciences at the University of Porto, Portugal, and a PhD on the spread of drug-resistant malaria
from the Liverpool School of Tropical Medicine, UK. Post his doctoral, Tiago worked with human
datasets at the University of Cambridge, UK and with mosquito whole-genome sequencing data at the
University of Oxford, UK, before helping to set up the bioinformatics infrastructure at the University
of Montana, USA. He currently works as a data engineer in the biotechnology field in Boston, MA.
He is one of the co-authors of Biopython, a major bioinformatics package written in Python.
About the reviewers
Urminder Singh is a bioinformatician, computer scientist, and developer of multiple open source
bioinformatics tools. His educational background encompasses physics, computer science, and
computational biology degrees, including a Ph.D. in bioinformatics from Iowa State University, USA.
His diverse research interests include novel gene evolution, precision medicine, sociogenomics, machine
learning in medicine, and developing tools and algorithms for big heterogeneous data. You can visit
him online at urmi-21.github.io.
Tiffany Ho works as a bioinformatics associate at Embark Veterinary. She holds a BSc from the
University of California, Davis in genetics and genomics, and an MPS from Cornell University in
plant breeding and genetics.
Table of Contents
Preface xiii
1
Python and the Surrounding Software Ecology 1
Installing the required basic software Interfacing with R via rpy2 9
with Anaconda 2
Getting ready 9
Getting ready 2 How to do it... 10
How to do it... 4 There’s more... 15
There’s more... 5 See also 16
Installing the required software Performing R magic with Jupyter 16
with Docker 7
Getting ready 16
Getting ready 7 How to do it... 17
How to do it... 8 There’s more... 18
See also 8 See also 18
2
Getting to Know NumPy, pandas, Arrow, and Matplotlib 19
Using pandas to process vaccine- Getting ready 26
adverse events 20 How to do it... 27
Getting ready 20 There’s more... 29
How to do it... 20 Reducing the memory usage of
There’s more... 25 pandas DataFrames 29
See also 26
Getting ready 29
Dealing with the pitfalls of joining How to do it… 29
pandas DataFrames 26 See also 32
vi Table of Contents
Accelerating pandas processing Getting ready 36
with Apache Arrow 32 How to do it… 36
Getting ready 33 See also 39
How to do it... 33
Introducing Matplotlib
There’s more... 35
for chart generation 39
Understanding NumPy as the Getting ready 40
engine behind Python data How to do it... 40
science and bioinformatics 36 There’s more... 47
See also 47
3
Next-Generation Sequencing 49
Accessing GenBank and moving How to do it... 66
around NCBI databases 50 There’s more... 72
Getting ready 50 See also 72
How to do it... 51 Extracting data from VCF files 73
There’s more... 53
Getting ready 73
See also 54
How to do it... 74
Performing basic sequence analysis 55 There’s more... 75
Getting ready 55 See also 76
How to do it... 55 Studying genome accessibility and
There’s more... 56 filtering SNP data 76
See also 57
Getting ready 76
Working with modern sequence How to do it... 78
formats 57 There’s more... 88
Getting ready 57 See also 88
How to do it... 58 Processing NGS data with HTSeq 88
There’s more... 64
Getting ready 89
See also 65
How to do it... 90
Working with alignment data 66 There’s more... 92
Getting ready 66
Table of Contents vii
4
Advanced NGS Data Processing 93
Preparing a dataset for analysis 93 There’s more… 111
Getting ready 94 Finding genomic features from
How to do it… 94 sequencing annotations 111
Using Mendelian error information How to do it… 111
for quality control 101 There’s more… 114
How to do it… 101 Doing metagenomics with
There’s more… 105 QIIME 2 Python API 114
Exploring the data with Getting ready 114
standard statistics 106 How to do it... 116
How to do it… 106 There’s more... 119
5
Working with Genomes 121
Technical requirements 121 Extracting genes from a reference
Working with high-quality using annotations 137
reference genomes 122 Getting ready 137
Getting ready 122 How to do it... 138
How to do it... 123 There’s more... 140
There’s more... 127 See also 140
See also 128 Finding orthologues with the
Dealing with low-quality Ensembl REST API 141
genome references 128 Getting ready 141
Getting ready 128 How to do it... 141
How to do it... 129 There’s more... 144
There’s more... 133 Retrieving gene ontology
See also 134 information from Ensembl 144
Traversing genome annotations 134 Getting ready 144
Getting ready 134 How to do it... 145
How to do it... 134 There’s more... 149
There’s more... 136 See also 149
See also 137
viii Table of Contents
6
Population Genetics 151
Managing datasets with PLINK 152 Analyzing population structure 167
Getting ready 152 Getting ready 168
How to do it... 154 How to do it... 168
There’s more... 158 See also 174
See also 158
Performing a PCA 174
Using sgkit for population genetics Getting ready 174
analysis with xarray 158
How to do it... 175
Getting ready 159 There’s more... 177
How to do it... 159 See also 177
There’s more... 163
Investigating population structure
Exploring a dataset with sgkit 163 with admixture 177
Getting ready 163 Getting ready 177
How to do it... 163 How to do it... 178
There’s more... 167 There’s more... 183
See also 167
7
Phylogenetics 185
Preparing a dataset for phylogenetic Reconstructing phylogenetic trees 200
analysis 185
Getting ready 200
Getting ready 186 How to do it... 201
How to do it... 186 There’s more... 204
There’s more... 192
Playing recursively with trees 205
See also 192
Getting ready 205
Aligning genetic and genomic data 192
How to do it... 205
Getting ready 192 There’s more... 209
How to do it... 193
Visualizing phylogenetic data 210
Comparing sequences 195
Getting ready 210
Getting ready 195 How to do it... 210
How to do it... 195 There’s more... 215
There’s more... 200