ebook img

Bioinformatics, Biocomputing and Perl: An Introduction PDF

485 Pages·2004·3.271 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Bioinformatics, Biocomputing and Perl: An Introduction

Bioinformatics Biocomputing and Perl An Introduction to Bioinformatics Computing Skills and Practice Michael Moorhouse Post-Doctoral Worker from Erasmus MC, The Netherlands Paul Barry Department of Computing and Networking, Institute of Technology, Carlow, Ireland Copyright2004 JohnWiley&SonsLtd,TheAtrium,SouthernGate,Chichester, WestSussexPO198SQ,England Telephone(+44)1243779777 Email(forordersandcustomerserviceenquiries):[email protected] VisitourHomePageonwww.wileyeurope.comorwww.wiley.com AllRightsReserved.Nopartofthispublicationmaybereproduced,storedinaretrievalsystem ortransmittedinanyformorbyanymeans,electronic,mechanical,photocopying,recording, scanningorotherwise,exceptunderthetermsoftheCopyright,DesignsandPatentsAct1988 orunderthetermsofalicenceissuedbytheCopyrightLicensingAgencyLtd,90Tottenham CourtRoad,LondonW1T4LP,UK,withoutthepermissioninwritingofthePublisher.Requests tothePublishershouldbeaddressedtothePermissionsDepartment,JohnWiley&SonsLtd, TheAtrium,SouthernGate,Chichester,WestSussexPO198SQ,England,oremailedto [email protected],orfaxedto(+44)1243770620. Thispublicationisdesignedtoprovideaccurateandauthoritativeinformationinregardtothe subjectmattercovered.ItissoldontheunderstandingthatthePublisherisnotengagedin renderingprofessionalservices.Ifprofessionaladviceorotherexpertassistanceisrequired, theservicesofacompetentprofessionalshouldbesought. OtherWileyEditorialOffices JohnWiley&SonsInc.,111RiverStreet,Hoboken,NJ07030,USA Jossey-Bass,989MarketStreet,SanFrancisco,CA94103-1741,USA Wiley-VCHVerlagGmbH,Boschstr.12,D-69469Weinheim,Germany JohnWiley&SonsAustraliaLtd,33ParkRoad,Milton,Queensland4064,Australia JohnWiley&Sons(Asia)PteLtd,2ClementiLoop#02-01,JinXingDistripark,Singapore129809 JohnWiley&SonsCanadaLtd,22WorcesterRoad,Etobicoke,Ontario,CanadaM9W1L1 Wileyalsopublishesitsbooksinavarietyofelectronicformats.Somecontentthatappears inprintmaynotbeavailableinelectronicbooks. BritishLibraryCataloguinginPublicationData AcataloguerecordforthisbookisavailablefromtheBritishLibrary ISBN0-470-85331-X Typesetin9.5/12.5ptLucidaBrightbyLaserwordsPrivateLimited,Chennai,India PrintedandboundinGreatBritainbyAntonyRoweLtd,Chippenham,Wiltshire Thisbookisprintedonacid-freepaperresponsiblymanufacturedfromsustainableforestry inwhichatleasttwotreesareplantedforeachoneusedforpaperproduction. For my parents, who taught me the value of knowledge – MJM For three great kids: Joseph, Aaron and Aideen – PJB Contents Preface xv 1 Settingthe Biological Scene 1 1.1 IntroducingBiologicalSequenceAnalysis 1 1.2 ProteinandPolypeptides 4 1.3 GeneralisedModelsandtheirUse 5 1.4 TheCentralDogmaofMolecularBiology 6 1.4.1 Transcription 6 1.4.2 Translation 7 1.5 GenomeSequencing 10 1.5.1 Sequenceassembly 11 1.6 TheExampleDNA-gene-proteinsystemwewilluse 12 WheretofromHere 13 2 Settingthe Technological Scene 15 2.1 TheLayersofTechnology 15 2.1.1 Frompassiveusertoactivedeveloper 16 2.2 Findingperl 17 2.2.1 Checkingforperl 17 WheretofromHere 18 I Working with Perl 19 3 TheBasics 21 3.1 Let’sGetStarted! 21 3.1.1 RunningPerlprograms 22 3.1.2 Syntaxandsemantics 23 3.1.3 Program:runthyself! 25 3.2 Iteration 26 3.2.1 UsingthePerlwhileconstruct 26 3.3 MoreIterations 30 3.3.1 Introducingvariablecontainers 31 3.3.2 Variablecontainersandloops 32 viii Contents 3.4 Selection 34 3.4.1 UsingthePerlifconstruct 35 3.5 ThereReallyisMTOWTDI 36 3.6 ProcessingDataFiles 41 3.6.1 Askinggetlinestodomore 43 3.7 IntroducingPatterns 44 WheretofromHere 46 TheMaximsRepeated 46 4 PlacestoPutThings 49 4.1 BeyondScalars 49 4.2 Arrays:AssociatingDatawithNumbers 49 4.2.1 Workingwitharrayelements 51 4.2.2 Howbigisthearray? 51 4.2.3 Addingelementstoanarray 52 4.2.4 Removingelementsfromanarray 54 4.2.5 Slicingarrays 54 4.2.6 Pushing,popping,shiftingandunshifting 56 4.2.7 Processingeveryelementinanarray 57 4.2.8 Makinglistseasiertoworkwith 59 4.3 Hashes:AssociatingDatawithWords 60 4.3.1 Workingwithhashentries 61 4.3.2 Howbigisthehash? 61 4.3.3 Addingentriestoahash 62 4.3.4 Removingentriesfromahash 62 4.3.5 Slicinghashes 63 4.3.6 Workingwithhashentries:acompleteexample 64 4.3.7 Processingeveryentryinahash 66 WheretofromHere 68 TheMaximsRepeated 68 5 Getting Organised 71 5.1 NamedBlocks 71 5.2 IntroducingSubroutines 73 5.2.1 Callingsubroutines 73 5.3 CreatingSubroutines 74 5.3.1 Processingparameters 76 5.3.2 Betterprocessingofparameters 78 5.3.3 Evenbetterprocessingofparameters 80 5.3.4 Amoreflexibledrawlinesubroutine 83 5.3.5 Returningresults 84 5.4 VisibilityandScope 85 5.4.1 Usingprivatevariables 86 5.4.2 Usingglobalvariablesproperly 88 5.4.3 Thefinalversionofdrawline 89 5.5 In-builtSubroutines 90 5.6 GroupingandReusingSubroutines 92 5.6.1 Modules 93 5.7 TheStandardModules 96 5.8 CPAN:TheModuleRepository 96 5.8.1 SearchingCPAN 97 5.8.2 InstallingaCPANmodulemanually 98 Contents ix 5.8.3 InstallingaCPANmoduleautomatically 99 5.8.4 AfinalwordonCPANmodules 99 WheretofromHere 100 TheMaximsRepeated 100 6 AboutFiles 103 6.1 I/O:InputandOutput 103 6.1.1 Thestandardstreams:STDIN,STDOUTandSTDERR 103 6.2 ReadingFiles 105 6.2.1 Determiningthedisk-filenames 106 6.2.2 Openingthenameddisk-files 108 6.2.3 Readingalinefromeachofthedisk-files 110 6.2.4 Puttingitalltogether 110 6.2.5 Slurping 114 6.3 WritingFiles 116 6.3.1 Redirectingoutput 117 6.3.2 Variableinterpolation 117 6.4 ChoppingandChomping 118 WheretofromHere 119 TheMaximsRepeated 119 7 Patterns,PatternsandMorePatterns 121 7.1 PatternBasics 121 7.1.1 Whatisaregularexpression? 122 7.1.2 Whatmakesregularexpressionssospecial? 122 7.2 IntroducingthePatternMetacharacters 124 7.2.1 The+repetitionmetacharacter 124 7.2.2 The|alternationmetacharacter 126 7.2.3 Metacharactershorthandandcharacterclasses 127 7.2.4 Moremetacharactershorthand 128 7.2.5 Morerepetition 130 7.2.6 The?and*optionalmetacharacters 130 7.2.7 Theanycharactermetacharacter 131 7.3 Anchors 132 7.3.1 The\bwordboundarymetacharacter 132 7.3.2 The^start-of-linemetacharacter 133 7.3.3 The$end-of-linemetacharacter 133 7.4 TheBindingOperators 134 7.5 RememberingWhatWasMatched 135 7.6 GreedybyDefault 137 7.7 AlternativePatternDelimiters 138 7.8 AnotherUsefulUtility 139 7.9 Substitutions:SearchandReplace 140 7.9.1 Substitutingforwhitespace 141 7.10 FindingaSequence 142 WheretofromHere 146 TheMaximsRepeated 146 8 PerlGrabbag 147 8.1 Introduction 147 8.2 Strictness 147 x Contents 8.3 PerlOne-liners 149 8.4 RunningOtherProgramsfromperl 152 8.5 RecoveringfromErrors 153 8.6 Sorting 155 8.7 HEREDocuments 159 WheretofromHere 160 TheMaximsRepeated 161 II Working with Data 163 9 Downloading Datasets 165 9.1 Let’sGetData 165 9.2 DownloadingfromtheWeb 165 9.2.1 UsingwgettodownloadPDBdata-files 167 9.2.2 Mirroringadataset 168 9.2.3 Smartermirroring 168 9.2.4 Downloadingasubsetofadataset 169 WheretofromHere 171 TheMaximsRepeated 171 10 TheProteinDatabank 173 10.1 Introduction 173 10.2 DeterminingBiomoleculeStructures 174 10.2.1 X-RayCrystallography 174 10.2.2 Nuclearmagneticresonance 176 10.2.3 Summaryofproteinstructuremethods 177 10.3 TheProteinDatabank 177 10.4 ThePDBData-fileFormats 179 10.4.1 Examplestructures 180 10.4.2 DownloadingPDBdata-files 181 10.5 AccessingDatainPDBEntries 182 10.6 AccessingPDBAnnotationData 183 10.6.1 FreeRandresolution 184 10.6.2 Databasecrossreferences 186 10.6.3 Coordinatessection 188 10.6.4 Extracting3Dcoordinatedata 191 10.7 ContactMaps 192 10.8 STRIDE:SecondaryStructureAssignment 196 10.8.1 InstallationofSTRIDE 197 10.9 AssigningSecondaryStructures 197 10.9.1 UsingSTRIDEandparsingtheoutput 200 10.9.2 ExtractingaminoacidsequencesusingSTRIDE 204 10.10 IntroducingthemmCIFProteinFormat 205 10.10.1ConvertingmmCIFtoPDB 206 10.10.2ConvertingmmCIFstoPDBwithCIFTr 206 10.10.3ProblemswiththeCIFTrconversion 208 10.10.4SomeadviceonusingmmCIF 208 10.10.5AutomatedconversionofmmCIFtoPDB 208 WheretofromHere 210 TheMaximsRepeated 210 Contents xi 11 Non-redundantDatasets 211 11.1 IntroducingNon-redundantDatasets 211 11.1.1 Reasonsforredundancy 211 11.1.2 Reductionofredundancy 212 11.1.3 Non-redundancyandnon-representative 212 11.2 Non-redundantProteinStructures 213 WheretofromHere 217 TheMaximsRepeated 217 12 Databases 219 12.1 IntroducingDatabases 219 12.1.1 Relatingtables 220 12.1.2 Theproblemwithsingle-tabledatabases 222 12.1.3 Solvingtheone-tableproblem 222 12.1.4 Databasesystem:adefinition 224 12.2 AvailableDatabaseSystems 224 12.2.1 Personaldatabasesystems 225 12.2.2 Enterprisedatabasesystems 225 12.2.3 Opensourcedatabasesystems 225 12.3 SQL:theLanguageofDatabases 226 12.3.1 DefiningdatawithSQL 226 12.3.2 ManipulatingdatawithSQL 227 12.4 ADatabaseCaseStudy:MER 227 12.4.1 TherequirementfortheMERdatabase 231 12.4.2 Installingadatabasesystem 232 12.4.3 CreatingtheMERdatabase 233 12.4.4 AddingtablestotheMERdatabase 235 12.4.5 PreparingSWISS-PROTdataforimportation 238 12.4.6 Importingtab-delimiteddataintoproteins 245 12.4.7 Workingwiththedatainproteins 246 12.4.8 AddinganothertabletotheMERdatabase 248 12.4.9 PreparingEMBLdataforimportation 249 12.4.10Importingtab-delimiteddataintodnas 253 12.4.11Workingwiththedataindnas 253 12.4.12Relatingdatainonetabletothatinanother 254 12.4.13AddingthecrossrefstabletotheMERdatabase 255 12.4.14Preparingcrossreferencesforimportation 256 12.4.15Importingtab-delimiteddataintocrossrefs 259 12.4.16Workingwiththedataincrossrefs 259 12.4.17AddingthecitationstabletotheMERdatabase 263 12.4.18Preparingcitationinformationforimportation 265 12.4.19Importingtab-delimiteddataintocitations 268 12.4.20Workingwiththedataincitations 268 WheretofromHere 269 TheMaximsRepeated 269 13 DatabasesandPerl 273 13.1 WhyProgramDatabases? 273 13.2 PerlDatabaseTechnologies 274 13.3 PreparingPerl 275 13.3.1 CheckingtheDBIinstallation 275 xii Contents 13.4 ProgrammingDatabaseswithDBI 276 13.4.1 Developingadatabaseutilitymodule 279 13.4.2 Improvingupondumpresults 280 13.5 CustomisingOutput 282 13.6 CustomisingInput 285 13.7 ExtendingSQL 289 WheretofromHere 292 TheMaximsRepeated 292 III Working with the Web 295 14 TheSequenceRetrievalSystem 297 14.1 AnExampleofWhat’sPossible 297 14.2 WhySRS? 298 14.3 UsingSRS 298 WheretofromHere 300 TheMaximsRepeated 300 15 WebTechnologies 303 15.1 TheWebDevelopmentInfrastructure 303 15.2 CreatingContentfortheWWW 305 15.2.1 ThestaticcreationofWWWcontent 308 15.2.2 ThedynamiccreationofWWWcontent 308 15.3 PreparingApacheforPerl 310 15.3.1 Testingtheexecutionofserver-sideprograms 312 15.4 SendingDatatoaWebServer 315 15.5 WebDatabases 320 WheretofromHere 327 TheMaximsRepeated 327 16 WebAutomation 329 16.1 WhyAutomateSurfing? 329 16.2 AutomatedSurfingwithPerl 330 WheretofromHere 335 TheMaximsRepeated 336 IV Working with Applications 337 17 Tools andDatasets 339 17.1 Introduction 339 17.2 SequenceDatabases 340 17.2.1 UnderstandingEMBLentries 343 17.2.2 UnderstandingSWISS-PROTentries 346 17.2.3 Summarisingsequencesdatabases 347 17.3 GeneralConceptsandMethods 347 17.3.1 Predictionsandvalidation 348 17.3.2 True/False/Negative/Positive 348 Contents xiii 17.3.3 Balancingtheerrors 351 17.3.4 Usingmultiplealgorithmstoimproveperformance 352 17.3.5 tRNA-ScanSE,acasestudy 353 17.4 IntroducingBioinformaticsTools 357 17.4.1 ClustalW 358 17.4.2 Algorithmsandmethods 359 17.4.3 Installationanduse 360 17.4.4 Substitution/scoringmatrices 361 17.5 BLAST 362 17.5.1 InstallingNCBI-BLAST 364 17.5.2 Preparationofdatabasefilesforfastersearching 365 17.5.3 ThedifferenttypesofBLASTsearch 369 17.5.4 FinalwordsonBLAST 371 WheretofromHere 371 TheMaximsRepeated 371 18 Applications 373 18.1 Introduction 373 18.2 ScientificBackgroundtoMerOperon 374 18.2.1 Function 374 18.2.2 Geneticstructureandregulation 374 18.2.3 MobilityoftheMerOperon 375 18.3 DownloadingtheRawDNASequence 377 18.4 InitialBLASTSequenceSimilaritySearch 378 18.5 GeneMark 380 18.5.1 UsingBLASTtoidentifyspecificsequences 382 18.5.2 Dealingwithfalsenegativesandmissingproteins 386 18.5.3 Over-predictedgenesandfalsepositives 387 18.5.4 SummaryofvalidationofGeneMarkprediction 388 18.6 StructuralPredictionwithSWISS-MODEL 388 18.6.1 Alternativestohomologymodelling 390 18.6.2 ModellingwithSWISS-MODEL 390 18.7 DeepViewasaStructuralAlignmentTool 396 18.8 PROSITEandSequenceMotifs 401 18.8.1 UsingPROSITEpatternsandmatrices 402 18.8.2 DownloadingPROSITEanditssearchtools 403 18.8.3 FinalwordonPROSITE 407 18.9 Phylogenetics 407 18.9.1 AlookattheHMAdomainofMerAandMerP 407 WheretofromHere? 410 TheMaximsRepeated 411 19 DataVisualisation 413 19.1 IntroducingVisualisation 413 19.2 DisplayingTabularDataUsingHTML 415 19.2.1 DisplayingSWISS-PROTidentifiers 417 19.3 CreatingHigh-qualityGraphicswithGD 422 19.3.1 UsingtheGDmodule 424 19.3.2 DisplayinggenesinEMBLentries 426 19.3.3 Introducingmogrify 429

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.