ebook img

An Introduction to Bioinformatics Algorithms PDF

431 Pages·2004·5.872 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview An Introduction to Bioinformatics Algorithms

AN INTRODUCTION TO BIOINFORMATICS ALGORITHMS NEIL C. JONES AND PAVEL A. PEVZNER Preface Intheearly1990swhenoneofuswasteachinghisfirstbioinformaticsclass, he was not sure that there would be enough students to teach. Although the Smith-Waterman and BLAST algorithms had already been developed they had not become the household names among biologists that they are today. Eventheterm“bioinformatics”hadnotyetbeencoined. DNAarrays wereviewedbymostasintellectualtoyswithdubiouspracticalapplication, exceptforahandfulof enthusiastswhosawavastpotentialinthe technol- ogy. A few bioinformaticians were developing new algorithmic ideas for nonexistent data sets: David Sankoff laid the foundations of genome rear- rangementstudiesatatimewhentherewaspracticallynogeneorderdata, Michael Waterman and Gary Stormo were developing motif finding algo- rithms when there were veryfewpromoter samples available, Gene Myers was developing sophisticated fragment assembly tools when no bacterial genomehasbeenassembledyet,andWebbMillerwasdreamingaboutcom- paringbillion-nucleotide-longDNAsequenceswhenthe172,282-nucleotide Epstein-Barr virus was the longest GenBank entry. GenBank itself just re- centlymadeatransitionfromaseriesofbound(paper!) volumestoanelec- tronicdatabaseonmagnetictapethatcouldbesenttoscientistsworldwide. Onehastogobacktothemid-1980sandearly1990stofullyappreciatethe revolutioninbiologythathastakenplaceinthelastdecade.However,bioin- formatics has affected more than just biology—it has also had a profound impact on the computational sciences. Biology has rapidly become a large source of new algorithmic and statistical problems, and has arguably been the target for more algorithms than any of the other fundamental sciences. Thislinkbetweencomputer scienceand biologyhasimportanteducational implicationsthatchangethewayweteachcomputationalideastobiologists, aswellashowappliedalgorithmicsistaughttocomputerscientists. 1 Introduction ImagineAlice,Bob,andtwopilesoftenrocks. AliceandBobareboredone Saturdayafternoon so they play the following game. In each turn a player mayeithertakeonerockfromasinglepile,oronerockfrombothpiles. Once therocksaretaken,theyareremovedfromplay;theplayerthattakesthelast rockwinsthegame. Alicemovesfirst. It is not immediately clear what the winning strategy is, or even if there isone. Doesthefirstplayer(orthesecond)alwayshaveanadvantage? Bob tries to analyze the game and realizes that there are too many variants in the game with two piles of ten rocks (which we will refer to as the 10+10 game). Usingareductionistapproach,hefirsttriestofindastrategyforthe simpler 2+2 game. He quickly seesthatthe second player—himself, in this case—winsany2+2game,sohedecidestowritethe“winningrecipe”: If Alice takesone rockfromeachpile, Iwill takethe remainingrocks and win. If Alice takes one rock, I will take one rock from the same pile. Asaresult,therewillbeonlyonepileanditwillhavetworocks init, soAlice’sonlychoice willbetotakeoneofthem. Iwilltakethe remainingrocktowinthegame. Inspiredbythisanalysis,Bobmakesaleapoffaith:thesecondplayer(i.e., himself)winsinanyn+ngame,forn≥2. Ofcourse,everyhypothesismust beconfirmedbyexperiment,soBobplaysafewroundswithAlice. Itturns out that sometimes he wins and sometimes he loses. Bob tries to come up withasimplerecipeforthe3+3game,buttherearealargenumberofdiffer- entgamesequencestoconsider,andtherecipequicklygetstoocomplicated. Thereissimplynohopeof writingarecipeforthe10+10gamebecausethe numberofdifferentstrategiesthatAlicecantakeisenormous. Meanwhile,Alicequicklyrealizesthatshewillalwayslosethe2+2game, but she doesnot lose hope of finding a winning strategyfor the 3+3 game. 2 1 Introduction Moreover,shetookAlgorithms101andsheunderstandsthatrecipeswritten inthe style thatBob useswill nothelp verymuch: recipe-styleinstructions arenotasufficientlyexpressivelanguagefordescribingalgorithms. Instead, she beginsbydrawingthe following table filledwith the symbols ↑, ←, (cid:3), and ∗. The entry in position (i,j)(i.e., the ithrow and the jth column) de- scribesthemovesthatAlicewillmakeinthei+j game,withiandj rocks inpilesAandB respectively. A←indicatesthatsheshouldtakeonestone from pile B. A ↑ indicates that she should take one stone from pile A. A (cid:3) indicates that she should take one stone from each pile, and ∗ indicates thatsheshouldnotbotherplayingthegamebecauseshewilldefinitelylose againstanopponentwhohasaclue. 0 1 2 3 4 5 6 7 8 9 10 0 ∗ ← ∗ ← ∗ ← ∗ ← ∗ ← ∗ 1 ↑ (cid:3) ↑ (cid:3) ↑ (cid:3) ↑ (cid:3) ↑ (cid:3) ↑ 2 ∗ ← ∗ ← ∗ ← ∗ ← ∗ ← ∗ 3 ↑ (cid:3) ↑ (cid:3) ↑ (cid:3) ↑ (cid:3) ↑ (cid:3) ↑ 4 ∗ ← ∗ ← ∗ ← ∗ ← ∗ ← ∗ 5 ↑ (cid:3) ↑ (cid:3) ↑ (cid:3) ↑ (cid:3) ↑ (cid:3) ↑ 6 ∗ ← ∗ ← ∗ ← ∗ ← ∗ ← ∗ 7 ↑ (cid:3) ↑ (cid:3) ↑ (cid:3) ↑ (cid:3) ↑ (cid:3) ↑ 8 ∗ ← ∗ ← ∗ ← ∗ ← ∗ ← ∗ 9 ↑ (cid:3) ↑ (cid:3) ↑ (cid:3) ↑ (cid:3) ↑ (cid:3) ↑ 10 ∗ ← ∗ ← ∗ ← ∗ ← ∗ ← ∗ Forexample,ifsheisfacedwiththe3+3game,shefindsa(cid:3)inthethird rowandthirdcolumn,indicatingthatsheshouldtakearockfromeachpile. This makes Bob take the first move in a 2+2 game, which is marked with a ∗. No matter what he does, Alice wins. Suppose Bob takes a rock from pileB—thisleadstothe2+1game. Aliceagainconsultsthetablebyreading theentryat(2,1),seeingthatsheshouldalsotakearockfrompileB leaving tworocksinA. However,ifBobhadinsteadtakenarockfrompileA,Alice would consult entry (1,2)to find ↑. She again should also take a rockfrom pileA,leavingtworocksinpileB. Impressed by the table, Bob learns how to use it to win the 10+10 game. However,Bobdoesnotknowhowtoconstructasimilartableforthe20+20 game. The problem is not that Bob is stupid, but that he has not studied algorithms. Evenif,throughsheerluck,Bobfiguredhowtoalwayswinthe 1 Introduction 3 20+20 game, he could neither say with confidence that it would work no matterwhatAlicedid, nor would he evenbeabletowrite downtherecipe forthegeneraln+ngame. MoreembarrassingtoBobisthatthe ageneral 10+10+10gamewith threepileswould turninto animpossible conundrum forhim. TherearetwothingsBobcoulddotoremedyhissituation. First,hecould takeaclassinalgorithmstolearnhowtosolveproblemsliketherockpuzzle. Second, he could memorize a suitably large table that Alice gives him and usethattoplaythegame. Leadingquestionsnotwithstanding, whatwould youdoasabiologist? Ofcourse,theanswerweexpecttohearfrommostrationalpeopleis“Why inthe worlddoIcareaboutagame withtwo nerdypeople andabunchof rocks? I’minterestedinbiology, andthisgamehasnothingtodowithme.” This is not actually true: the rock game is in fact the ubiquitous sequence alignment problem in disguise. Although it is not immediately clear what DNA sequence alignment andthe rockgame havein common, the compu- tationalideausedtosolvebothproblemsisthesame. ThefactthatBobwas not able to find the strategyfor the game indicates that he does not under- stand how alignment algorithms work either. He might disagreeif he uses alignmentalgorithmsorBLAST1 onadailybasis,butwearguethatsincehe failed to come up with a strategy for the 10+10 rock game, he will also fail when confronted with a new flavor of alignment problem or a particularly complex similarity analysis. Moretroubling to Bob, he mayfind it difficult tocompetewiththescadsofnewbiologistswhothinkalgorithmicallyabout biologicalproblems.2 ManybiologistsarecomfortableusingalgorithmslikeBLASTwithoutre- ally understanding how the underlying algorithm works. This is not sub- stantially differentfroma diligent robot following Alice’swinning strategy table,butitdoeshaveanimportantconsequence. BLASTsolvesaparticular problemonlyapproximatelyandithascertainsystematicweaknesses. We’re notpickingonBLASThere:thereasonthatBLASThastheselimitationsis,in part,becauseoftheparticularproblemthatitsolves.Userswhodonotknow howBLASTworksmightmisapplythealgorithmormisinterprettheresults it returns. Biologists sometimes use bioinformatics tools simply as compu- tationalprotocolsinquitethesamewaythatanuninformedmathematician 1. BLASTisadatabasesearchtool—aGoogleforbiologicalsequences—thatwillbeintroduced laterinthisbook. 2. These“newbiologists”haveprobablyalreadyfoundanotherevenmoreelegantsolutionof therocksproblemthatdoesnotrequiretheconstructionofatable. 4 1 Introduction might use experimentalprotocols without anybackgroundin biochemistry ormolecularbiology. Ineithercase,importantobservationsmightbemissed orincorrectconclusions drawn. Besides, intellectuallyinterestingworkcan quicklybecomemeredrudgeryifonedoesnotreallyunderstandit. Manyrecentbioinformaticsbookscatertothissortofprotocol-centricprac- ticalapproachto bioinformatics. Theyfocusonparametersettings, specific featuresofapplication,andotherdetailswithoutrevealingtheideasbehind the algorithms. This trend often follows the tradition of biology books of presentingmaterialasacollectionoffactsanddiscoveries.Incontrast,intro- ductorybooksinalgorithmsusuallyfocusonideasratherthanonthedetails ofcomputationalrecipes. Sincebioinformaticsisacomputationalscience,abioinformaticstextbook shouldstrivetopresenttheprinciplesthatdriveanalgorithm’sdesign,rather thanlist astamp collection of the algorithms themselves. We hope thatde- scribingtheintellectualcontentofbioinformaticswillhelpretainyourinter- estinthesubject. Inthisbookweattempttoshowthatahandfulofalgorith- mic ideas can be used to solve a large number of bioinformatics problems. We feel that focusing on ideas has more intellectual value and represents a better long-term investment: protocols change quickly, but the computa- tionalideasdon’tseemto. We pursued a goal of presenting both the foundations of algorithms and theimportantresultsinbioinformaticsunderthe samecover. Amorethor- oughapproachforastudentwouldbetotakeanIntroductiontoAlgorithms coursefollowedbyaBioinformaticscourse,butthisisoftenanunrealisticex- pectationinviewoftheheavycourseloadbiologistshavetotake. Tomake bioinformaticsideasaccessibletobiologistsweappealtotheinnatealgorith- mic intuition of the student and try to avoid tedious proofs. The technical detailsarehiddenunlesstheyareabsolutelynecessary.3 Thisbookcoversbothnewandoldareasincomputationalbiology. Some topics, to our knowledge, have never been discussed in a textbook before, while others are relatively old-fashioned and describe some experimental approachesthat are rarelyused these days. The reason for including older topics is twofold. First, some of them still remainthe best examplesfor in- troducingalgorithmic ideas. Second,ourgoalistoshowtheprogressionof ideasinthefield, withtheimplicitwarningthathotareasinbioinformatics seemtocomeandgowithalarmingspeed. 3. Insomeplaceswehideimportantcomputationalandbiologicaldetailsandtrytosimplify thepresentation.Wewillunavoidablybeblamedlaterfor“trivializing”bioinformatics. 1 Introduction 5 One observation gained from teaching bioinformatics classes is that the interest of computer science students, who usually know little of biology, fadesquickly when the studentsareintroduced to biology without links to computationalissues. The samehappenstobiologists iftheyarepresented withseeminglyunnecessaryformalismwithnolinkstorealbiologicalprob- lems. To hold a student’s interest, it is necessary to introduce biology and algorithmssimultaneously. Ourrathereclectictableofcontentsisademon- strationthatattemptstoreachthisgoalresultinasomewhatinterleavedor- ganizationofthematerial. However,wehavetriedtomaintainaconsistent algorithmictheme(e.g.,graphalgorithms)throughouteachchapter. Molecularbiologyandcomputersciencearecomplexfieldswhose termi- nologyandnomenclaturecanbeformidabletotheoutsider. Bioinformatics merges the two fields, and addsa healthy dose of statistics, combinatorics, and other branches of mathematics. Like modern biologists who have to master the dense language of mathematics and computer science, mathe- maticians and computer scientists working in bioinformatics have to learn thelanguageofbiology. Althoughthequestionofwhofacesthebiggerchal- lengeisatopichotlydebatedoverpintsofbeer,thisisnotthefirst“invasion” offoreignersintobiology;seventyyearsagoahordeofphysicistsinfestedbi- ology labs, ultimatelytorevolutionize the fieldbydecipheringthe mystery ofDNA. Two influential scientists are credited with crossing the barrier between physics and biology: Max Delbrück and Erwin Schrödinger. Trained as physicists,theirentrancesintothefieldofbiologywereremarkablydifferent. Delbrück,trainedasanatomicphysicistbyNielsBohr,quicklybecameanex- pertingenetics;in1945hewasalreadyteachinggeneticstootherbiologists.4 Schrödinger,ontheotherhand,neverturnedintoa“certified”geneticistand remained somewhat of a biological dilettante. However, his book What Is Life?,publishedin1944,wasinfluentialtoanentiregenerationofphysicists and biologists. Both James Watson (a biology student who wanted to be a naturalist) and Francis Crick (a physicist who worked on magnetic mines) switched careerstoDNA scienceafterreadingShrödinger’sbook. Another Nobel laureate, Sydney Brenner, even admitted to stealing a copy from the publiclibraryinJohannesburg,SouthAfrica. Like Delbrück and Schrödinger, there is great variety in the biological background of today’s computer scientists-turned-bioinformaticians. Some ofthemhavebecomeexpertsinbiology—though veryfewputonlabcoats 4. DelbrückfoundedthefamousphagegeneticscoursesatColdSpringHarborLaboratory. 6 1 Introduction and perform experiments—while others remain biological dilettantes. Al- thoughthereexistsanopinionthateverybioinformaticianshouldbeanex- pert in both biology and computer science, we are not sure that this is fea- sible. First, it takes a lot of work just to master one of the two, so perhaps understanding two in equal amounts is a bit much. Second, it is good to recall that the first pioneers of DNA science were, in fact, self-proclaimed dilettantes. JamesWatsonknewalmostnoorganicorphysicalchemistrybe- forehestartedworkingonthedoublehelix;FrancisCrick,beingaphysicist, knew very little biology. Neither saw any need to know about (let alone memorize) the chemical structure of the four nucleotide bases when they discoveredthestructureofDNA.5WhenaskedbyErwinChargaffhowthey couldpossiblyexpecttoresolvethestructureofDNAwithoutknowingthe structures of its constituents, they responded that they could always look upthestructuresinabookiftheneedarose. Ofcourse, theyunderstoodthe physicalprinciplesbehindacompound’sstructure. Therealityisthateventhemostbiologicallyorientedbioinformaticiansare expertsonly insome specific areaof biology. LikeDelbrück, who probably wouldneverhavepassedanexaminbiologyinthe1930s(whenzoologyand botany remained the core of the mainstreambiological curriculum), a typi- calmodern-daybioinformaticianisunlikelytopassthesequenceoforganic chemistry,biochemistry,andstructuralbiochemistryclassesthata“real”bi- ologist has to take. The question of how much biology a good computer scientist–turned–bioinformatician has to know seems to be best answered with “enough to deeply understand the biological problem and to turn it into anadequate computational problem.” This book providesa verybrief introductiontobiology. Wedonotclaimthatthisisthebestapproach. For- tunately, an interested reader can use Watson’s approach and look up the biologicaldetailsinthebookswhentheneedarises,orreadpages1through 1294ofAlbertsandcolleagues’(includingWatson)bookMolecularBiologyof theCell(3). Thisbookiswhatwe,ascomputerscientists,believethatamodernbiolo- gistoughttoknowaboutcomputerscienceifheorshewouldbeasuccessful researcher. 5. Accordingly,wedonotpresentanywhereinthisbookthechemicalstructuresofeithernu- cleotidesoraminoacids.Noalgorithminthisbookrequiresknowledgeoftheirstructure. 2 Algorithms and Complexity Thisbookisabouthowtodesignalgorithmsthatsolvebiologicalproblems. We will see how popular bioinformatics algorithms work and we will see what principles drove their design. It is important to understand how an algorithmworksinordertobeconfidentinitsresults;itisevenmoreimpor- tant to understand an algorithm’s design methodology in order to identify itspotentialweaknessesandfixthem. Before considering any algorithms in detail, we need to define loosely what we mean by the word “algorithm” and what might qualify as one. In many places throughout this text we try to avoid tedious mathematical formalisms,yetleaveintacttherigorandintuitionbehindtheimportantcon- cept. 2.1 WhatIsan Algorithm? Roughly speaking, an algorithm is a sequence of instructions that one must perform in order to solve a well-formulated problem. We will specify prob- lemsintermsoftheirinputsandtheiroutputs,andthealgorithmwillbethe method oftranslatingthe inputsinto theoutputs. Awell-formulatedprob- lemisunambiguousandprecise,leavingnoroomformisinterpretation. Inordertosolveaproblem,someentityneedstocarryoutthestepsspec- ifiedbythealgorithm. Ahumanwithapenandpaperwouldbeabletodo this, but humans are generally slow, make mistakes, and prefer not to per- formrepetitivework. Acomputerislessintelligentbutcanperformsimple steps quickly and reliably. A computer cannot understand English, so al- gorithms mustbe rephrasedinaprogramminglanguage suchasCor Java in order to give specific instructions to the processor. Every detail must be specifiedtothecomputerinexactlytherightformat,makingitdifficulttode- 8 2 AlgorithmsandComplexity scribealgorithms; trifling detailsthatapersonwould naturallyunderstand must be specified. If a computer were to put on shoes, one would need to tellittofindapairthatbothmatchesandfits,toputtheleftshoeontheleft foot,therightshoeontheright,andtotiethelaces.1 Inthisbook,however, weprefertosimplyleaveitat“Putonapairofshoes.” However, tounderstand howanalgorithm works, we need some wayof listingthestepsthatthealgorithmtakes,while beingneithertoovaguenor too formal. We will use pseudocode, whose elementary operations are sum- marized below. Pseudocode is a language computer scientists often use to describealgorithms: itignoresmanyofthedetailsthatarerequiredinapro- gramming language, yet it is more precise and less ambiguous than, say, a recipe in a cookbook. Individually, the operations do not solve any partic- ularly difficult problems, but they can be grouped together into minialgo- rithmscalledsubroutines thatdo. In our particular flavor of pseudocode, we use the concepts of variables, arrays, and arguments. A variable, written as x or total, contains some nu- mericalvalueandcanbeassignedanewnumericalvalueatdifferentpoints throughoutthecourseofanalgorithm. Anarrayofnelementsisanordered collection of n variables a1,a2,...,an. We usually denote arrays by bold- faceletters like a = (a1,a2,...,an) and write the individual elements as ai where i is between 1 and n. An algorithm in pseudocode is denoted by a name, followed by the list of arguments that it requires, like MAX(a,b) be- low;thisisfollowedbythestatementsthatdescribethealgorithm’sactions. One can invokean algorithm by passing it the appropriatevaluesfor its ar- guments. Forexample,MAX(1,99)wouldreturnthelargerof1and99. The operationreturnreportstheresultoftheprogramorsimplysignalsitsend. Belowarebriefdescriptionsoftheelementarycommandsthatweuseinthe pseudocodethroughoutthisbook.2 Assignment Format: a←b Effect: Setsthevariableatothevalueb. 1. Itissurprisinglydifficulttowriteanunambiguoussetofinstructionsonhowtotieashoelace. 2. Anexperiencedcomputerprogrammermightbeconfusedbyournotusing“endif”or“end for”,whichistheconventionalpractice. Werelyonindentationtodemarcateblocksofpseu- docode.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.