ebook img

UNIX and Perl to the Rescue! : A Field Guide for the Life Sciences (and Other Data-rich Pursuits) PDF

375 Pages·2012·1.504 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview UNIX and Perl to the Rescue! : A Field Guide for the Life Sciences (and Other Data-rich Pursuits)

Unix and Perl to the Rescue! Your research has generated gigabytes of data and now you need to analyze it. You hate using spreadsheets but it’s all you know, so what else can you do? This book will transform how you work with large and complex data sets, teaching you powerful programming tools for slicing and dicing data to suit your needs. Written in a fun and accessible style, this step-by-step guide will inspire and inform non-programmers about the essential aspects of Unix and Perl. It shows how, with just a little programming knowledge, you can write programs that could save you hours, or even days. No prior experience is required and new concepts are introduced using numerous code examples that you can try out for yourself. Going beyond the basics, the authors touch upon many broader topics that will help those new to programming, including debugging and how to write in a good programming style. KEITH BRADNAM is a project scientist in the Genome Center at the University of California, Davis. He has extensive experience working with model organism databases and spent four years as a project leader at WormBase, helping to develop this important bioinformatics resource. IAN KORF is an Associate Professor in Molecular and Cellular Biology at the University of California, Davis. His research seeks to understand structure and function in genomic DNA. He has developed new tools for gene prediction, co-authored the only book devoted to BLAST and helped in the development of BioPerl. 99778811110077000000668811pprree__ppii--vviiii..iinndddd ii 1111//55//22001111 66::5511::2211 PPMM 99778811110077000000668811pprree__ppii--vviiii..iinndddd iiii 1111//55//22001111 66::5511::2222 PPMM Unix and Perl to the Rescue! A fi eld guide for the life sciences (and other data-rich pursuits) KEITH BRADNAM University of California, Davis IAN KORF University of California, Davis 99778811110077000000668811pprree__ppii--vviiii..iinndddd iiiiii 1111//55//22001111 66::5511::2222 PPMM cambridge university press Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, S ã o Paulo, Delhi, Tokyo, Mexico City Cambridge University Press The Edinburgh Building, Cambridge CB2 8RU, UK Published in the United States of America by Cambridge University Press, New York www.cambridge.org Information on this title: www.cambridge.org/97801107000681 © Keith Bradnam and Ian Korf 2012 This publication is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published 2012 Printed in the United Kingdom at the University Press, Cambridge A catalogue record for this publication is available from the British Library Library of Congress Cataloguing in Publication data ISBN 978-1-107-00068-1 Hardback ISBN 978-0-521-16982-0 Paperback Cambridge University Press has no responsibility for the persistence or accuracy of URLs for external or third-party internet websites referred to in this publication, and does not guarantee that any content on such websites is, or will remain, accurate or appropriate. 99778811110077000000668811pprree__ppii--vviiii..iinndddd iivv 1111//55//22001111 66::5511::2222 PPMM Contents Part 1 Introduction and background page 1 1.1 Introduction 1 1.2 How to use this book 3 Part 2 Installing Unix and Perl 6 2.1 What do I need in order to learn Unix and Perl? 6 2.2 Installing Linux on a PC 8 2.3 Installing a code editor 9 Part 3 Essential Unix 11 3.1 Introduction to Unix 11 3.2 The Unix terminal 11 3.3 The Unix command prompt 13 3.4 Your fi rst Unix command 14 3.5 The hierarchy of a Unix fi lesystem 17 3.6 Finding out where you are in the fi lesystem 19 3.7 How to navigate a Unix fi lesystem 20 3.8 Absolute and relative paths 22 3.9 Working with your home directory 23 3.10 The Unix shell 25 3.11 Environment variables 26 3.12 Introduction to command-line options 28 3.13 Man pages 32 3.14 Working with directories 35 3.15 The importance of saving keystrokes 37 3.16 Moving and renaming fi les 42 3.17 Moving and renaming directories 46 3.18 How to remove fi les 47 3.19 How to copy fi les and directories 49 3.20 Working with text fi les 53 3.21 Introduction to aliases 56 3.22 Editing text fi les 60 3.23 Automating Unix commands 63 3.24 How to hide fi les and fi nd hidden fi les 66 3.25 Creating a confi guration fi le 68 3.26 Programming with Unix 73 3.27 Unix fi le permissions 74 3.28 How to specify which directories contain programs 77 3.29 Creating useful shell scripts 81 3.30 Unix summary 83 99778811110077000000668811pprree__ppii--vviiii..iinndddd vv 1111//55//22001111 66::5511::2222 PPMM vi Contents Part 4 Essential Perl 85 4.1 Hello world 85 4.2 Scalar variables 87 4.3 Use warnings 91 4.4 Maths and functions 94 4.5 Perl vs. perl 100 4.6 Conditional statements 101 4.7 Use strict 108 4.8 Stopping programs 114 4.9 Working with strings 116 4.10 Dealing with special characters 122 4.11 Matching operators 124 4.12 The transliteration operator 130 4.13 List context 134 4.14 Introduction to Arrays 136 4.15 Array manipulation 142 4.16 The @ARGV array 148 4.17 Defi ned and undefi ned variables 152 4.18 Sorting 154 4.19 Introduction to loops 158 4.20 More loops 162 4.21 Loop control 166 4.22 Data input and output 171 4.23 Reading and writing fi les 176 4.24 Introduction to hashes 182 4.25 Working with hashes 187 4.26 Introduction to regular expressions 191 4.27 Regular expression metacharacters 193 4.28 Working with regular expressions 201 4.29 Interacting with other programs 206 4.30 Using functions and subroutines 211 4.31 Returning data from a subroutine 215 4.32 Part 4 summary 219 Part 5 Advanced Unix 220 5.1 Introduction to advanced Unix 220 5.2 Introduction to process control 223 5.3 The grep command 229 5.4 Viewing and controlling program output 234 5.5 Redirecting input and output 235 5.6 Standard error 241 5.7 Connecting commands with pipelines 243 5.8 Advanced text manipulation 247 99778811110077000000668811pprree__ppii--vviiii..iinndddd vvii 1111//55//22001111 66::5511::2222 PPMM Contents vii Part 6 Advanced Perl 252 6.1 Regular expressions revisited 252 6.2 Function libraries 256 6.3 References and two-dimensional arrays 262 6.4 Records and other hash references 269 6.5 Using references with subroutines 274 6.6 Complex data structures 275 6.7 Adding command-line options 281 6.8 OOP basics 286 6.9 CPAN 292 Part 7 Programming topics 296 7.1 Debugging strategies 296 7.2 Common error messages 301 7.3 Code beautifi cation 305 7.4 Abstraction 311 7.5 Data management 317 7.6 Documentation 325 7.7 Revision control 330 7.8 Working with other people’s data 333 7.9 Getting help 337 Appendix 342 Index 358 99778811110077000000668811pprree__ppii--vviiii..iinndddd vviiii 1111//55//22001111 66::5511::2222 PPMM 99778811110077000000668811pprree__ppii--vviiii..iinndddd vviiiiii 1111//55//22001111 66::5511::2222 PPMM Introduction and background 1 1.1 Introduction Why this book? I f this book had to have a mission statement, we would say that it is designed to help you make the transition from computer user to computer programmer .1 We wrote this book with life scientists in mind. But it is equally appropriate for anyone who needs to slice and dice large, diverse data sets. A few years ago, biologists did not need to know how to program. With the arrival of the Human Genome Project and other -omic technologies, biology has been transformed into an incredibly data-rich science. While the science is moving ahead at a staggering rate, most people have not changed themselves to match. Not everyone needs to know how to program, but for those that desire it, this book will help them catch up quickly. W e have both watched students struggle with trying to analyze mountains of data, and sometimes the struggle has not been because the students lack the ability to tackle the prob- lem. Rather, it is because they frequently lack the t ools to tackle the problem. For many people, data analysis means “using a spreadsheet.” Sometimes this is all you need, but for many problems a programming solution will be faster, easier, and much more powerful. T his is not a book for dummies or idiots. Conversely, it’s also not for super-geniuses. It’s for ordinary educated people who haven’t needed to program until now. Whether the topic is language, mathematics, or programming, some people learn faster than others. But we all learn to read, write, multiply, and divide. And we can all learn to program. Rest assured, you can program. We are happy to be your guides. Learning to program is a journey. Like other journeys, it takes time and effort. But the rewards are worth every step. Not only will you be learning a new skill that you can apply to your work, you will be seeing the world of data from a completely different perspective. We guarantee you will fi nd this personally enlightening, and we are not exaggerating when we say that your newfound knowledge will empower you more than you can imagine. Why Unix? The Unix OS has been around since 1969 and it’s not likely to disappear any time soon. Back then there was no such thing as a graphical user interface (GUI). You typed every- thing. It may seem archaic to use a keyboard to issue commands today, but it’s much easier to automate keyboard-driven tasks than mouse-driven tasks. There are several var- iants of Unix (including Linux), though the differences do not matter much. Although you may not have noticed it, Apple has been using Unix as the underlying OS on all of their computers since 2001.2 1 Note that this doesn’t mean you need to grow a beard, start reading science-fi ction books, or wear T-shirts bearing unfath- omable geeky slogans. Indeed, all of these clich é s about programmers should be tossed aside. Programmers are real people … well, most of us are anyway. 2 If you haven’t noticed it, that’s probably because it is “hidden” behind a very slick-looking GUI. But it’s there nonetheless. 99778811110077000000668811cc0011__pp11--55..iinndddd 11 1111//55//22001111 55::4499::0022 PPMM 2 Introduction and background I ncreasingly, the raw output of biological research exists as in silico data, usually in the form of very large text fi les that can grow to several gigabytes in size. Unix is par- ticularly suited to working with such fi les and has several powerful (and fl exible) com- mands that can process your data for you. The real strength of learning Unix is that most of these commands can be combined in an almost unlimited fashion. If you can learn just fi ve Unix commands, you will be able to do a lot more than just fi ve things. Why Perl? P erl is one of the most popular programming languages, and has a particularly strong following in the bioinformatics community. People sometimes get argumentative about which language is best. There is no single best language for everything. Perl does most things very well, and is a fi ne programming language to learn. Other equally capable and easy to use languages include Python and Ruby. Once you learn how to program well in one language, adapting to other languages is trivial. O riginally developed in 1987, Perl remains under active development and there is therefore a l ot of supporting material available to help you learn it.3 You are very likely to fi nd Perl pre-installed on just about every type of Unix/Linux-based OS, and it is also available for Windows. Among programming languages, there is often a distinction between those that are interpreted (e.g., Perl, Python, Ruby) and those that are compiled (e.g., C, C++, Java). People often call interpreted programs s cripts. It is generally easier to learn program- ming with a scripting language because you don’t have to worry as much about variable types and memory allocation. The downside is that the interpreted programs often run much slower than compiled ones. But let’s not get lost in petty details. Scripts are pro- grams, scripting is programming, and computers can solve problems quickly regardless of the language. About the authors K eith Bradnam started out his academic career studying ecology. This involved lots of fi eld trips and throwing quadrats around on windy hillsides. He was then lucky enough to be in the right place at the right time to do a Masters degree in bioinformatics (at a time when nobody was very sure what bioinformatics was). From that point onwards he has spent most of his waking life sat at a keyboard (often staring into a Unix terminal). A PhD studying eukaryotic genome evolution followed; this was made easier by the fact that only one genome had been completed at the time he started (this soon changed). After a brief stint working on an Arabidopsis genome database he moved to working on the excellent model organism database, WormBase, at the Wellcome Trust Sanger Institute. It was here that he fi rst met Ian Korf and where they bonded over a shared love of Macs, neatly written code, and English puddings. Ian then tried to run away and hide in California at the UC Davis Genome Center, but Keith tracked him down and joined his lab. Apart from doing research, he also gets to look after all the computers in the lab and teach the occasional class or two. However, he would give it all up for the chance 3 A good “fi rst port of call” would be www.perl.org, the offi cial web site of the Perl programming language. 99778811110077000000668811cc0011__pp11--55..iinndddd 22 1111//55//22001111 55::4499::0033 PPMM

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.