ebook img

Computational Biology —: Unix/Linux, Data Processing and Programming PDF

290 Pages·2004·4.347 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Computational Biology —: Unix/Linux, Data Processing and Programming

Röbbe Wünschiers Computational Biology - Unix/Linux, Data Processing and Programming Springer-Verlag Berlin Heidelberg GmbH Röbbe Wünschiers Computational Biology – Unix/Linux, Data Processing and Programming With 19 Figures and 12 Tables 123 Dr.Röbbe Wünschiers University ofCologne Institute for Genetics Weyertal 121 50931 Köln Germany ISBN 978-3-540-21142-6 ISBN 978-3-642-18552-6 (eBook) DOI 10.1007/978-3-642-18552-6 Library ofCongress Control Number:2004102411 This work is subject to copyright.All rights reserved,whether the whole or part ofthe material is concerned,specifically the rights oftranslation,reprinting,reuse ofillustrations,recitation,broad- casting,reproduction on microfilm or in any other way,and storage in data banks.Duplication ofthis publication or parts thereofis permitted only under the provisions ofthe German Copyright Law of September 9,1965,in its current version,and permission for use must always be obtained from Springer-Verlag.Violations are liable for prosecution under the German Copyright Law. springeronline.com Springer-Verlag Berlin Heidelberg 2004 Originally published by Springer-Verlag Berlin Heidelberg New York in 2004 The use of general descriptive names,registered names,trademarks,etc.in this publication does not imply,even in the absence ofa specific statement,that such names are exempt from the relevant pro- tective laws and regulations and therefore free for general use. Cover design:Design & Production,Heidelberg Typesetting:Camera ready by the author 31/3150WI - 5 4 3 2 1 0 - Printed on acid-free paper Dedicated to Károly Nagy and to the Open-Source Community Foreword A shift in culture Only a decade ago, the first thing a molecular biologist would have had to learnwhenheorshestartedthelabworkwashowtohandlepipettes,extract DNA,useenzymesandcloneagene.Now,thefirstthingthatheorsheshould learn is how to handle databases and to extract all the information that is alreadyknownaboutthegenethatheorshewantstostudy.Inalllikelihood, he or she will find that the gene has already been sequenced from several organisms,that itwasrecoveredin avariety ofESTprojects,thatexpression data are available from microarray and SAGE studies, that it was included in linkage studies, that proteomics data are rapidly accumulating, that lists of interacting proteins are being compiled, that domain structure data are available and that it is part of a network of genetic interactions which is intensively modelled. He or she will discover that all this information resides in many different databases with different data formats and with different levels of analyses and linking. Starting to work on this gene will make sense only, if all this information is put together in a project-specific manner and set into the context of what is known about related genes and processes. At this point he or she may decide to walk up to the bioinformatics group in house and ask for help with arranging the data in a useful manner. This will then turn into the first major frustration in his or her career, since the last thing a scientific bioinformatics group wants to do is to provide a service for data retrieval and management. Molecularbiologyiscurrentlygoingthroughadramaticculturalshift.The daily business of pipetting and gel running has increasingly to be comple- mented with data compiling and processing.Large lists of data are produced by sequencers,microarrayexperiments orreal-timePCRmachines everyday. Working with data lists has become as important as extracting DNA. Every bench scientist needs proficiency in computing; discoveries are made both at the bench and on the screen. It would be completely wrong to think that computers in molecular biology are the business of bioinformaticians only. VIII Foreword Bioinformatics has become a scientific discipline of its own and should not be consideredto be aserviceprovider.Theday-to-daycomputing will always have to be done by the experimentalist himself or herself. Of course, there are now also a lot of helpful and fancy programpackages for the bench scientist; but these will only perform routine tasks and all too oftentheyareonlypoorlycompatible.Ascientistneedsthefreedomtodevelop his or her own ideas and to link things that have previously not been linked. Beingabletogobacktothebasicsofcomputingandprogrammingistherefore avitalskillfortheexperimentalist,asimportantasmakingbuffersandsetting up enzyme reactions. It allows him or her to handle and analyze the data in exactly the way it is required for the project and to pursue new avenues of research, rather than trotting old paths. Unix is the key to basic computing. If one is used to Windows or Mac operating systems, this might at first sound like going back into the stone age; but the dramatic recent shift of at least the Mac operating system to a Unix base should teach us otherwise. Unix is here to stay and it allows the largestflexibility for bioinformatics applications.Those who havelearned Unix will soon discover the myriad of little “progies” that are available from colleagues all over the world and that can make life much easier in the lab. This book gives exactly the sort of introduction into Unix, Unix-based operatingsystemsandprogramminglanguagesthatwillbeakeycompetence for experimentally working molecular biologists and that will make all the difference for the successful projects of the future. It has been written by a bench scientist, specifically with the needs of molecular biologists in mind. It canbe usedeitherforself-teachingorinpracticalcourses.Everygroupleader should hand over this book to new students in the lab, together with their first set of pipettes. Cologne/Germany, Prof. Dr. Diethard Tautz January 2004 (Institut fu¨r Genetik der Universita¨t zu K¨oln) Preface Welcome on board! With this book I would like to invite you, the scientist, to a journey through terminals and program codes. You are welcome to put aside your pipette, cultureflaskorrubberbootsforawhile,makeyourselfcomfortableinfrontof acomputer(donotforgetyourfavouritehotalcohol-freedrink)andlearnsome unixingandprogramming.Why? Becausewearelivingintheinformationage and there is a huge amount of biological knowledge and databases out there. They contain information on almost everything: genes and genomes, rRNAs, enzymes, protein structures, DNA-microarray experiments, single organisms, ecologicaldata,thetreeoflifeandendlessmore.Furthermore,nowadaysmany research apparatuses are connected to computers. Thus, you have electronic access to your data. However, in order to cope with all this information you need some tools. This book will provide you with the skills to use these tools and to develop your own tools, i.e. it will introduce Unix and its derivatives (Linux,MacOSX,CygWin,etc.)andprogramming(shellprogramming,awk, perl).Thesetoolswillmakeyouindependentofthewayinwhichotherpeople make you process your data – in the form of application software. What you want is open functionality. You want to decide how to process (e.g. analyze, format, save, correlate) data and you want it now – not waiting for the lab programmer to treat your request; and you know it best – you understand your data and your demands.This is what open functionality stands for, and both Linux and programming languages can provide it to you. I started programming on a Casio PB-100 hand-held built in 1983.It can store10smallBasicprograms.Theaccompanyingbookwasentitled“Learnas yougo”and,indeed,inmyopinionthisisthebestwaytolearnprogramming. My first contact to Unix was triggered by the need to copy data files from a Unix-drivenBrukerEPR-Spectrometerontoa floppy disk. The realchallenge startedwhenItriedtoimportthefilestoadata-plottingprogramonthePC. While the first problem could be solved by finding the right page in a Unix manual, the latter required programming skills – Q-Basic at that time. This X Preface problemwasminorcomparedtothetroubleoneencounterstoday.Acommon problemistofeedoneprogramwiththeoutputofanotherprogram:youmight have to change lines to columns, commas to dots, tabulators to semicolons, uppercase to lowercase, DNA to RNA, FASTA to GenBank format and so forth. Then there is that huge amount of information out there in the web, which you might need to bring into shape for your own analysis. YouandThisBook– Thisbookiswrittenforthetotalbeginner.Youneednot even to know what a computer is, though you should have access to one and find the power switch. The book is the result of a) the way how I learned to workwithUnix, its derivativesandits numeroustoolsandb) a lecture which I startedat the Institute for Genetics atthe University ofCologne/Germany. Most programming examples are taken from biology; however, you need not be a biologist. Except for two or three examples, no biological knowledge is necessary.Ihavetriedtoillustratealmosteverythingpracticallywithso-called terminals and examples. You should run these examples. Each chapter closes with some exercises. Brief solutions can be found at the end of the book. Why Linux? – This book is not limited to Linux! All examples are valid for Unix or any Unix derivative like Mac OS X, Knoppix or the free Windows- basedCygWin package,too.I choseLinuxbecause itis opensourcesoftware: you need not invest money except for the book itself. Furthermore, Linux provides all the great tools Unix provides. With Linux (as with all other Unix derivatives) you are close to your data. Via the command line you have immediate access to your files and can use either publicly available or your own designed tools to process these. With the aid of pipes you can construct your own data-processing pipeline. It is great. Why awkand perl?– awkisagreatlanguageforbothlearningprogramming andtreating largetext-baseddata files(contraryto binary files).To 99%you will work with text-based files, be it data tables, genomes or species lists. Apartfrombeing simpletolearnandhavingaclearsyntax,awkprovidesyou withthepossibilitytoconstructyourowncommands.Thus,thelanguagecan grow with you as you grow with the language. I know bioinformatic profes- sionals entirely focusing on awk. perl is much more powerful but also more unclear in its syntax (or flexible, to put it positively), but, since awk was one basis for developing perl, it is only a small step to go once you have learned awk–butagiantleapforyourpossibilities.Youshouldtakethis step.Bythe way, both awk and perl run on all common operating systems. Acknowledgements – SpecialthankstoKristinaAuerswald,TillBayer,Bene- dikt Bosbach and Chris Voolstra for proofreading,and all the other students for encouraging me to bring these lines together. Hu¨rth/Germany, January 2004 Ro¨bbe Wu¨nschiers Contents Part I Whetting Your Appetite 1 Introduction ............................................... 3 1.1 Information ............................................ 3 1.2 What Is Linux?......................................... 3 1.3 What Is Shell Programming?............................. 4 1.4 What Is sed?........................................... 4 1.5 What Is awk?........................................... 4 1.6 What Is perl? ......................................... 5 1.7 Prerequisites ........................................... 5 1.8 Conventions............................................ 6 Part II Computer and Operating Systems 2 Unix/Linux ................................................ 9 2.1 What Is a Computer? ................................... 9 2.2 Some History........................................... 10 2.2.1 Versions of Unix................................. 11 2.2.2 The Rise of Linux ............................... 12 2.2.3 Why a Penguin?................................. 14 2.2.4 Linux Distributions .............................. 14 2.3 X-Windows ............................................ 14 2.3.1 How Does It Work? .............................. 15 2.4 Unix/Linux Architecture................................. 16 2.5 What Is the Difference Between Unix and Linux? ........... 17 2.6 What Is the Difference Between Unix/Linux and Windows? .. 17 2.7 What Is the Difference Between Unix/Linux and Mac OS X?. 18 2.8 One Computer, Two Operating Systems ................... 18 2.8.1 VMware........................................ 19 2.8.2 CygWin ........................................ 19

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.