ebook img

Practical Computing and Bioinformatics for Conservation and Evolutionary Genomics PDF

368 Pages·6.658 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Practical Computing and Bioinformatics for Conservation and Evolutionary Genomics

Eric C. Anderson Practical Computing and Bioinformatics for Conservation and Evolutionary Genomics Contents List of Tables xi List of Figures xiii Preface xv Introduction xvii Eric’s Notes of what he might do xix 0.1 Table of topics . . . . . . . . . . . . . . . . . . . . . . . . . . xix I Part I: Essential Computing Skills 1 1 Overview of Essential Computing Skills 3 2 Essential Unix/Linux Terminal Knowledge 5 2.1 Getting a bash shell on your system . . . . . . . . . . . . . . 5 2.2 Navigating the Unix filesystem . . . . . . . . . . . . . . . . . 6 2.2.1 Changing the working directory with cd . . . . . . . . 9 2.2.2 Updating your command prompt . . . . . . . . . . . . 10 2.2.3 TAB-completion for paths . . . . . . . . . . . . . . . . 11 2.2.4 Listing the contents of a directory with ls . . . . . . . 13 2.2.5 Globbing . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.6 What makes a good file-name? . . . . . . . . . . . . . 17 2.3 The anatomy of a Unix command . . . . . . . . . . . . . . . 18 2.3.1 The command . . . . . . . . . . . . . . . . . . . . . . . 19 2.3.2 The options . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.3 Arguments . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.4 Getting information about Unix commands . . . . . . 20 2.4 Handling, Manipulating, and Viewing files and streams . . . 21 2.4.1 Creating new directories . . . . . . . . . . . . . . . . . 21 2.4.2 Fundamental file-handling commands . . . . . . . . . 22 2.4.3 “Viewing” Files . . . . . . . . . . . . . . . . . . . . . . 24 2.4.4 Redirecting standard output: > and >> . . . . . . . . . 25 2.4.5 stdin, < and | . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.6 stderr . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4.7 Symbolic links . . . . . . . . . . . . . . . . . . . . . . 27 iii iv Contents 2.4.8 File Permissions . . . . . . . . . . . . . . . . . . . . . 28 2.4.9 Editing text files at the terminal . . . . . . . . . . . . 29 2.5 Customizing your Environment . . . . . . . . . . . . . . . . . 30 2.5.1 Appearances matter . . . . . . . . . . . . . . . . . . . 30 2.5.2 Where are my programs/commands at?! . . . . . . . . 31 2.6 A Few More Important Keystrokes . . . . . . . . . . . . . . . 31 2.7 A short list of additional useful commands. . . . . . . . . . . 32 2.8 Two important computing concepts . . . . . . . . . . . . . . 33 2.8.1 Compression . . . . . . . . . . . . . . . . . . . . . . . 33 2.8.2 Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.9 Unix: Quick Study Guide . . . . . . . . . . . . . . . . . . . . 34 3 Shell programming 37 3.1 An example script . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2 The Structure of a Bash Script . . . . . . . . . . . . . . . . . 44 3.2.1 A bit more on ; and & . . . . . . . . . . . . . . . . . . 46 3.3 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.3.1 Assigning values to variables . . . . . . . . . . . . . . 48 3.3.2 Accessing values from variables . . . . . . . . . . . . . 49 3.3.3 What does the shell do with the value substituted for a variable?. . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3.4 Double and Single Quotation Marks and Variable Sub- stitution . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3.5 One useful, fancy, variable-substitution method . . . . 55 3.3.6 Integer Arithmetic with Shell Variables . . . . . . . . 55 3.3.7 Variable arrays . . . . . . . . . . . . . . . . . . . . . . 57 3.4 Evaluateacommandandsubstitutetheresultonthecommand line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.5 Grouping/Collecting output from multiple commands: (commands) and { commands; } . . . . . . . . . . . . . . . . 60 3.6 Exit Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.6.1 Combinations of exit statuses . . . . . . . . . . . . . . 62 3.7 Loops and repetition . . . . . . . . . . . . . . . . . . . . . . 64 3.8 More Conditional Evaluation: if, then, else, and friends . . 68 3.9 Finally…positional parameters . . . . . . . . . . . . . . . . . 69 3.10 basename and dirname two useful little utilities . . . . . . . 70 3.11 bash functions . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.12 reading files line by line . . . . . . . . . . . . . . . . . . . . . 72 3.13 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . 73 4 Sed, awk, and regular expressions 75 4.1 awk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.1.1 Line-cycling, tests and actions. . . . . . . . . . . . . . 76 4.1.2 Column splitting, fields, -F, $, NF, print, OFS and BEGIN 77 4.1.3 A brief introduction to regular expressions . . . . . . . 80 Contents v 4.1.4 A variety of tests . . . . . . . . . . . . . . . . . . . . . 83 4.1.5 Code in the action blocks . . . . . . . . . . . . . . . . 84 4.1.6 Using awk to assign to shell variables . . . . . . . . . . 88 4.1.7 Passing Variables into awk with -v . . . . . . . . . . . 88 4.1.8 Writing awk scripts in files . . . . . . . . . . . . . . . 88 4.2 sed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5 Working on remote servers 91 5.1 Accessing remote computers . . . . . . . . . . . . . . . . . . 91 5.1.1 Windows . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.1.2 Hummingbird . . . . . . . . . . . . . . . . . . . . . . . 92 5.1.3 Summit . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.1.4 Sedna . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.2 Transferring files to remote computers . . . . . . . . . . . . . 93 5.2.1 sftp (via lftp) . . . . . . . . . . . . . . . . . . . . . . 93 5.2.2 git . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.2.3 Globus. . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.2.4 Interfacing with “The Cloud” . . . . . . . . . . . . . . 104 5.2.5 Getting files from a sequencing center . . . . . . . . . 112 5.3 tmux: the terminal multiplexer . . . . . . . . . . . . . . . . . 115 5.3.1 An analogy for how tmux works . . . . . . . . . . . . . 117 5.3.2 First steps with tmux. . . . . . . . . . . . . . . . . . . 118 5.3.3 Further steps with tmux . . . . . . . . . . . . . . . . . 122 5.4 tmux for Mac users . . . . . . . . . . . . . . . . . . . . . . . 124 5.5 Installing Software on an HPCC . . . . . . . . . . . . . . . . 126 5.5.1 Modules . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.5.2 Miniconda . . . . . . . . . . . . . . . . . . . . . . . . . 134 5.5.3 Installing Java Programs . . . . . . . . . . . . . . . . 143 5.6 vim: it’s time to get serious with text editing . . . . . . . . . 146 5.6.1 Using neovim and Nvim-R and tmux to use R well on the cluster . . . . . . . . . . . . . . . . . . . . . . . . . 146 6 High Performance Computing Clusters (HPCC’s) 149 6.1 An oversimplified, but useful, view of a computing cluster . . 150 6.2 Cluster computing and the job scheduler . . . . . . . . . . . 152 6.3 Learning about the resources on your HPCC . . . . . . . . . 155 6.4 Getting compute resources allocated to your jobs on an HPCC 159 6.4.1 Interactive sessions . . . . . . . . . . . . . . . . . . . . 159 6.4.2 Batch jobs. . . . . . . . . . . . . . . . . . . . . . . . . 161 6.4.3 SLURM Job Arrays . . . . . . . . . . . . . . . . . . . 171 6.5 PREPATION INTERLUDE: An in-class exercise to make sure everything is configured correctly . . . . . . . . . . . . . . . 180 6.6 More Boneyard… . . . . . . . . . . . . . . . . . . . . . . . . . 186 6.7 The Queue (SLURM/SGE/UGE) . . . . . . . . . . . . . . . 187 6.8 Modules package . . . . . . . . . . . . . . . . . . . . . . . . . 187 vi Contents 6.9 Compiling programs without admin privileges . . . . . . . . 187 6.10 Job arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 6.11 Writing stdout and stderr to files . . . . . . . . . . . . . . . . 189 6.12 Breaking stuff down . . . . . . . . . . . . . . . . . . . . . . . 189 II Part II: Reproducible Research Strategies 191 7 Introduction to Reproducible Research 193 8 Rstudio and Project-centered Organization 195 8.1 Organizing big projects . . . . . . . . . . . . . . . . . . . . . 195 8.2 UsingRStudioinworkflowswithremotecomputersandHPCCs 196 8.2.1 Keeping an RStudio project “in sync” with GitHub. . 198 8.2.2 Evaluatingscriptslinebylineonaremotemachinefrom within RStudio . . . . . . . . . . . . . . . . . . . . . . 200 9 Version control 207 9.1 Why use version control? . . . . . . . . . . . . . . . . . . . . 207 9.2 How git works . . . . . . . . . . . . . . . . . . . . . . . . . . 207 9.3 git workflow patterns . . . . . . . . . . . . . . . . . . . . . . 207 9.4 using git with Rstudio . . . . . . . . . . . . . . . . . . . . . . 207 9.5 git on the command line . . . . . . . . . . . . . . . . . . . . 207 10 A fast, furious overview of the tidyverse 209 11 Authoring reproducibly with Rmarkdown 211 11.1 Notebooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 11.2 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 11.2.1 Zotero and Rmarkdown . . . . . . . . . . . . . . . . . 212 11.3 Bookdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 11.4 Google Docs . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 12 Using python 215 III Part III: Bioinformatic Analyses 217 13 Overview of Bioinformatic Analyses 219 14 DNA Sequences and Sequencing 221 14.1 DNA Stuff . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 14.1.1 DNA Replication with DNA Polymerase . . . . . . . . 223 14.1.2 The importance of the 3’ hydroxyl… . . . . . . . . . . 226 14.2 Sanger sequencing . . . . . . . . . . . . . . . . . . . . . . . . 227 14.3 Illumina Sequencing by Synthesis . . . . . . . . . . . . . . . 230 14.4 Library Prep Protocols . . . . . . . . . . . . . . . . . . . . . 230 14.4.1 WGS . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 14.4.2 RAD-Seq methods . . . . . . . . . . . . . . . . . . . . 231 Contents vii 14.4.3 Amplicon Sequencing . . . . . . . . . . . . . . . . . . 231 14.4.4 Capture arrays, RAPTURE, etc. . . . . . . . . . . . . 231 15 Bioinformatic file formats 233 15.1 Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 15.2 FASTQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 15.2.1 Line 1: Illumina identifier lines . . . . . . . . . . . . . 235 15.2.2 Line 4: Base quality scores. . . . . . . . . . . . . . . . 235 15.2.3 A FASTQ ‘tidyverse’ Interlude . . . . . . . . . . . . . 236 15.2.4 Comparing read 1 to read 2 . . . . . . . . . . . . . . . 242 15.3 FASTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 15.3.1 Genomic ranges. . . . . . . . . . . . . . . . . . . . . . 244 15.3.2 Extracting genomic ranges from a FASTA file . . . . . 245 15.3.3 Downloading reference genomes from NCBI . . . . . . 246 15.4 Alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 15.4.1 How might I align to thee? Let me count the ways… . 246 15.4.2 Play with simple alignments . . . . . . . . . . . . . . . 251 15.4.3 SAM Flags . . . . . . . . . . . . . . . . . . . . . . . . 252 15.4.4 The CIGAR string . . . . . . . . . . . . . . . . . . . . 254 15.4.5 The SEQ and QUAL columns . . . . . . . . . . . . . . 256 15.4.6 SAM File Headers . . . . . . . . . . . . . . . . . . . . 257 15.4.7 The BAM format . . . . . . . . . . . . . . . . . . . . . 259 15.4.8 Quick self study . . . . . . . . . . . . . . . . . . . . . 260 15.5 Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 15.5.1 VCF Format – The Body . . . . . . . . . . . . . . . . 263 15.5.2 VCF Format – The Header . . . . . . . . . . . . . . . 269 15.5.3 Boneyard . . . . . . . . . . . . . . . . . . . . . . . . . 269 15.6 Segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 15.7 Conversion/Extractions between different formats . . . . . . 270 15.8 Visualization of Genomic Data . . . . . . . . . . . . . . . . . 270 15.8.1 Sample Data . . . . . . . . . . . . . . . . . . . . . . . 271 16 Genome Assembly 275 17 Alignment of sequence data to a reference genome (and as- sociated steps) 277 17.1 TheJourneyofeachDNAFragmentfromOrganismtoSequenc- ing Read . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 17.2 Read Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 17.3 Aligning reads with bwa . . . . . . . . . . . . . . . . . . . . . 283 17.3.1 Indexing the genome for alignment . . . . . . . . . . . 284 17.3.2 Mapping reads with bwa mem . . . . . . . . . . . . . . 285 17.3.3 Hold it Right There, Buddy! What about the Read Groups? . . . . . . . . . . . . . . . . . . . . . . . . . . 285 17.4 Processing alignment output with samtools . . . . . . . . . 286 viii Contents 17.4.1 samtools subcommands . . . . . . . . . . . . . . . . . 288 17.5 BONEYARD BELOW HERE . . . . . . . . . . . . . . . . . 295 17.6 Preprocess ? . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 17.7 Quick notes to self on chaining things: . . . . . . . . . . . . . 295 17.8 Merging BAM files . . . . . . . . . . . . . . . . . . . . . . . . 296 17.9 Divide and Conquer Strategies . . . . . . . . . . . . . . . . . 296 18 Variant calling 297 18.1 Genotype Likelihoods . . . . . . . . . . . . . . . . . . . . . . 298 18.1.1 Basic Sketch of Genotype Likelihood Calculations . . 298 18.1.2 Specifics of different genotype likelihoods . . . . . . . 302 18.1.3 Computing genotype likelihoods with three different softwares . . . . . . . . . . . . . . . . . . . . . . . . . 302 18.1.4 A Directed Acyclic Graph For Genotype Likelihoods . 305 19 Boneyard 309 20 Basic Handling of VCF files 311 20.1 bcftools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312 20.1.1 Tell me about my VCF file! . . . . . . . . . . . . . . . 312 20.1.2 Get fragments/parts of my VCF file . . . . . . . . . . 314 20.1.3 Combine VCF files in various ways . . . . . . . . . . . 314 20.1.4 Filter out variants for a variety of reasons . . . . . . . 315 21 Bioinformatics for RAD seq data with and without a refer- ence genome 317 22 Processing amplicon sequencing data 319 23 Genome Annotation 321 24 Whole genome alignment strategies 323 24.1 Mapping of scaffolds to a closely related genome . . . . . . . 323 24.2 Obtaining Ancestral States from an Outgroup Genome . . . 323 24.2.1 Using LASTZ to align coho to the chinook genome . . 324 24.2.2 Try on the chinook chromosomes . . . . . . . . . . . . 327 24.2.3 Explore the other parameters more . . . . . . . . . . . 327 IV Part IV: Analysis of Big Variant Data 335 25 Bioinformatic analysis on variant data 337 V Part V: Population Genomics 339 26 Topics in pop gen 341 26.1 Coalescent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 26.2 Measures of genetic diversity and such . . . . . . . . . . . . . 341 Contents ix 26.3 Demographic inference with 𝜕𝑎𝜕𝑖 and moments . . . . . . . 342 26.4 Balls in Boxes . . . . . . . . . . . . . . . . . . . . . . . . . . 342 26.5 Some landscape genetics . . . . . . . . . . . . . . . . . . . . 342 26.6 Relationship Inference . . . . . . . . . . . . . . . . . . . . . . 342 26.7 Tests for Selection . . . . . . . . . . . . . . . . . . . . . . . . 343 26.8 Multivariate Associations, GEA, etc. . . . . . . . . . . . . . 343 26.9 Estimating heritability in the wild . . . . . . . . . . . . . . . 343

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.