Table Of ContentAN INTRODUCTION TO
BIOINFORMATICS ALGORITHMS
NEIL C. JONES AND PAVEL A. PEVZNER
Preface
Intheearly1990swhenoneofuswasteachinghisfirstbioinformaticsclass,
he was not sure that there would be enough students to teach. Although
the Smith-Waterman and BLAST algorithms had already been developed
they had not become the household names among biologists that they are
today. Eventheterm“bioinformatics”hadnotyetbeencoined. DNAarrays
wereviewedbymostasintellectualtoyswithdubiouspracticalapplication,
exceptforahandfulof enthusiastswhosawavastpotentialinthe technol-
ogy. A few bioinformaticians were developing new algorithmic ideas for
nonexistent data sets: David Sankoff laid the foundations of genome rear-
rangementstudiesatatimewhentherewaspracticallynogeneorderdata,
Michael Waterman and Gary Stormo were developing motif finding algo-
rithms when there were veryfewpromoter samples available, Gene Myers
was developing sophisticated fragment assembly tools when no bacterial
genomehasbeenassembledyet,andWebbMillerwasdreamingaboutcom-
paringbillion-nucleotide-longDNAsequenceswhenthe172,282-nucleotide
Epstein-Barr virus was the longest GenBank entry. GenBank itself just re-
centlymadeatransitionfromaseriesofbound(paper!) volumestoanelec-
tronicdatabaseonmagnetictapethatcouldbesenttoscientistsworldwide.
Onehastogobacktothemid-1980sandearly1990stofullyappreciatethe
revolutioninbiologythathastakenplaceinthelastdecade.However,bioin-
formatics has affected more than just biology—it has also had a profound
impact on the computational sciences. Biology has rapidly become a large
source of new algorithmic and statistical problems, and has arguably been
the target for more algorithms than any of the other fundamental sciences.
Thislinkbetweencomputer scienceand biologyhasimportanteducational
implicationsthatchangethewayweteachcomputationalideastobiologists,
aswellashowappliedalgorithmicsistaughttocomputerscientists.
1
Introduction
ImagineAlice,Bob,andtwopilesoftenrocks. AliceandBobareboredone
Saturdayafternoon so they play the following game. In each turn a player
mayeithertakeonerockfromasinglepile,oronerockfrombothpiles. Once
therocksaretaken,theyareremovedfromplay;theplayerthattakesthelast
rockwinsthegame. Alicemovesfirst.
It is not immediately clear what the winning strategy is, or even if there
isone. Doesthefirstplayer(orthesecond)alwayshaveanadvantage? Bob
tries to analyze the game and realizes that there are too many variants in
the game with two piles of ten rocks (which we will refer to as the 10+10
game). Usingareductionistapproach,hefirsttriestofindastrategyforthe
simpler 2+2 game. He quickly seesthatthe second player—himself, in this
case—winsany2+2game,sohedecidestowritethe“winningrecipe”:
If Alice takesone rockfromeachpile, Iwill takethe remainingrocks
and win. If Alice takes one rock, I will take one rock from the same
pile. Asaresult,therewillbeonlyonepileanditwillhavetworocks
init, soAlice’sonlychoice willbetotakeoneofthem. Iwilltakethe
remainingrocktowinthegame.
Inspiredbythisanalysis,Bobmakesaleapoffaith:thesecondplayer(i.e.,
himself)winsinanyn+ngame,forn≥2. Ofcourse,everyhypothesismust
beconfirmedbyexperiment,soBobplaysafewroundswithAlice. Itturns
out that sometimes he wins and sometimes he loses. Bob tries to come up
withasimplerecipeforthe3+3game,buttherearealargenumberofdiffer-
entgamesequencestoconsider,andtherecipequicklygetstoocomplicated.
Thereissimplynohopeof writingarecipeforthe10+10gamebecausethe
numberofdifferentstrategiesthatAlicecantakeisenormous.
Meanwhile,Alicequicklyrealizesthatshewillalwayslosethe2+2game,
but she doesnot lose hope of finding a winning strategyfor the 3+3 game.
2 1 Introduction
Moreover,shetookAlgorithms101andsheunderstandsthatrecipeswritten
inthe style thatBob useswill nothelp verymuch: recipe-styleinstructions
arenotasufficientlyexpressivelanguagefordescribingalgorithms. Instead,
she beginsbydrawingthe following table filledwith the symbols ↑, ←, (cid:3),
and ∗. The entry in position (i,j)(i.e., the ithrow and the jth column) de-
scribesthemovesthatAlicewillmakeinthei+j game,withiandj rocks
inpilesAandB respectively. A←indicatesthatsheshouldtakeonestone
from pile B. A ↑ indicates that she should take one stone from pile A. A
(cid:3) indicates that she should take one stone from each pile, and ∗ indicates
thatsheshouldnotbotherplayingthegamebecauseshewilldefinitelylose
againstanopponentwhohasaclue.
0 1 2 3 4 5 6 7 8 9 10
0 ∗ ← ∗ ← ∗ ← ∗ ← ∗ ← ∗
1 ↑ (cid:3) ↑ (cid:3) ↑ (cid:3) ↑ (cid:3) ↑ (cid:3) ↑
2 ∗ ← ∗ ← ∗ ← ∗ ← ∗ ← ∗
3 ↑ (cid:3) ↑ (cid:3) ↑ (cid:3) ↑ (cid:3) ↑ (cid:3) ↑
4 ∗ ← ∗ ← ∗ ← ∗ ← ∗ ← ∗
5 ↑ (cid:3) ↑ (cid:3) ↑ (cid:3) ↑ (cid:3) ↑ (cid:3) ↑
6 ∗ ← ∗ ← ∗ ← ∗ ← ∗ ← ∗
7 ↑ (cid:3) ↑ (cid:3) ↑ (cid:3) ↑ (cid:3) ↑ (cid:3) ↑
8 ∗ ← ∗ ← ∗ ← ∗ ← ∗ ← ∗
9 ↑ (cid:3) ↑ (cid:3) ↑ (cid:3) ↑ (cid:3) ↑ (cid:3) ↑
10 ∗ ← ∗ ← ∗ ← ∗ ← ∗ ← ∗
Forexample,ifsheisfacedwiththe3+3game,shefindsa(cid:3)inthethird
rowandthirdcolumn,indicatingthatsheshouldtakearockfromeachpile.
This makes Bob take the first move in a 2+2 game, which is marked with
a ∗. No matter what he does, Alice wins. Suppose Bob takes a rock from
pileB—thisleadstothe2+1game. Aliceagainconsultsthetablebyreading
theentryat(2,1),seeingthatsheshouldalsotakearockfrompileB leaving
tworocksinA. However,ifBobhadinsteadtakenarockfrompileA,Alice
would consult entry (1,2)to find ↑. She again should also take a rockfrom
pileA,leavingtworocksinpileB.
Impressed by the table, Bob learns how to use it to win the 10+10 game.
However,Bobdoesnotknowhowtoconstructasimilartableforthe20+20
game. The problem is not that Bob is stupid, but that he has not studied
algorithms. Evenif,throughsheerluck,Bobfiguredhowtoalwayswinthe
1 Introduction 3
20+20 game, he could neither say with confidence that it would work no
matterwhatAlicedid, nor would he evenbeabletowrite downtherecipe
forthegeneraln+ngame. MoreembarrassingtoBobisthatthe ageneral
10+10+10gamewith threepileswould turninto animpossible conundrum
forhim.
TherearetwothingsBobcoulddotoremedyhissituation. First,hecould
takeaclassinalgorithmstolearnhowtosolveproblemsliketherockpuzzle.
Second, he could memorize a suitably large table that Alice gives him and
usethattoplaythegame. Leadingquestionsnotwithstanding, whatwould
youdoasabiologist?
Ofcourse,theanswerweexpecttohearfrommostrationalpeopleis“Why
inthe worlddoIcareaboutagame withtwo nerdypeople andabunchof
rocks? I’minterestedinbiology, andthisgamehasnothingtodowithme.”
This is not actually true: the rock game is in fact the ubiquitous sequence
alignment problem in disguise. Although it is not immediately clear what
DNA sequence alignment andthe rockgame havein common, the compu-
tationalideausedtosolvebothproblemsisthesame. ThefactthatBobwas
not able to find the strategyfor the game indicates that he does not under-
stand how alignment algorithms work either. He might disagreeif he uses
alignmentalgorithmsorBLAST1 onadailybasis,butwearguethatsincehe
failed to come up with a strategy for the 10+10 rock game, he will also fail
when confronted with a new flavor of alignment problem or a particularly
complex similarity analysis. Moretroubling to Bob, he mayfind it difficult
tocompetewiththescadsofnewbiologistswhothinkalgorithmicallyabout
biologicalproblems.2
ManybiologistsarecomfortableusingalgorithmslikeBLASTwithoutre-
ally understanding how the underlying algorithm works. This is not sub-
stantially differentfroma diligent robot following Alice’swinning strategy
table,butitdoeshaveanimportantconsequence. BLASTsolvesaparticular
problemonlyapproximatelyandithascertainsystematicweaknesses. We’re
notpickingonBLASThere:thereasonthatBLASThastheselimitationsis,in
part,becauseoftheparticularproblemthatitsolves.Userswhodonotknow
howBLASTworksmightmisapplythealgorithmormisinterprettheresults
it returns. Biologists sometimes use bioinformatics tools simply as compu-
tationalprotocolsinquitethesamewaythatanuninformedmathematician
1. BLASTisadatabasesearchtool—aGoogleforbiologicalsequences—thatwillbeintroduced
laterinthisbook.
2. These“newbiologists”haveprobablyalreadyfoundanotherevenmoreelegantsolutionof
therocksproblemthatdoesnotrequiretheconstructionofatable.
4 1 Introduction
might use experimentalprotocols without anybackgroundin biochemistry
ormolecularbiology. Ineithercase,importantobservationsmightbemissed
orincorrectconclusions drawn. Besides, intellectuallyinterestingworkcan
quicklybecomemeredrudgeryifonedoesnotreallyunderstandit.
Manyrecentbioinformaticsbookscatertothissortofprotocol-centricprac-
ticalapproachto bioinformatics. Theyfocusonparametersettings, specific
featuresofapplication,andotherdetailswithoutrevealingtheideasbehind
the algorithms. This trend often follows the tradition of biology books of
presentingmaterialasacollectionoffactsanddiscoveries.Incontrast,intro-
ductorybooksinalgorithmsusuallyfocusonideasratherthanonthedetails
ofcomputationalrecipes.
Sincebioinformaticsisacomputationalscience,abioinformaticstextbook
shouldstrivetopresenttheprinciplesthatdriveanalgorithm’sdesign,rather
thanlist astamp collection of the algorithms themselves. We hope thatde-
scribingtheintellectualcontentofbioinformaticswillhelpretainyourinter-
estinthesubject. Inthisbookweattempttoshowthatahandfulofalgorith-
mic ideas can be used to solve a large number of bioinformatics problems.
We feel that focusing on ideas has more intellectual value and represents
a better long-term investment: protocols change quickly, but the computa-
tionalideasdon’tseemto.
We pursued a goal of presenting both the foundations of algorithms and
theimportantresultsinbioinformaticsunderthe samecover. Amorethor-
oughapproachforastudentwouldbetotakeanIntroductiontoAlgorithms
coursefollowedbyaBioinformaticscourse,butthisisoftenanunrealisticex-
pectationinviewoftheheavycourseloadbiologistshavetotake. Tomake
bioinformaticsideasaccessibletobiologistsweappealtotheinnatealgorith-
mic intuition of the student and try to avoid tedious proofs. The technical
detailsarehiddenunlesstheyareabsolutelynecessary.3
Thisbookcoversbothnewandoldareasincomputationalbiology. Some
topics, to our knowledge, have never been discussed in a textbook before,
while others are relatively old-fashioned and describe some experimental
approachesthat are rarelyused these days. The reason for including older
topics is twofold. First, some of them still remainthe best examplesfor in-
troducingalgorithmic ideas. Second,ourgoalistoshowtheprogressionof
ideasinthefield, withtheimplicitwarningthathotareasinbioinformatics
seemtocomeandgowithalarmingspeed.
3. Insomeplaceswehideimportantcomputationalandbiologicaldetailsandtrytosimplify
thepresentation.Wewillunavoidablybeblamedlaterfor“trivializing”bioinformatics.
1 Introduction 5
One observation gained from teaching bioinformatics classes is that the
interest of computer science students, who usually know little of biology,
fadesquickly when the studentsareintroduced to biology without links to
computationalissues. The samehappenstobiologists iftheyarepresented
withseeminglyunnecessaryformalismwithnolinkstorealbiologicalprob-
lems. To hold a student’s interest, it is necessary to introduce biology and
algorithmssimultaneously. Ourrathereclectictableofcontentsisademon-
strationthatattemptstoreachthisgoalresultinasomewhatinterleavedor-
ganizationofthematerial. However,wehavetriedtomaintainaconsistent
algorithmictheme(e.g.,graphalgorithms)throughouteachchapter.
Molecularbiologyandcomputersciencearecomplexfieldswhose termi-
nologyandnomenclaturecanbeformidabletotheoutsider. Bioinformatics
merges the two fields, and addsa healthy dose of statistics, combinatorics,
and other branches of mathematics. Like modern biologists who have to
master the dense language of mathematics and computer science, mathe-
maticians and computer scientists working in bioinformatics have to learn
thelanguageofbiology. Althoughthequestionofwhofacesthebiggerchal-
lengeisatopichotlydebatedoverpintsofbeer,thisisnotthefirst“invasion”
offoreignersintobiology;seventyyearsagoahordeofphysicistsinfestedbi-
ology labs, ultimatelytorevolutionize the fieldbydecipheringthe mystery
ofDNA.
Two influential scientists are credited with crossing the barrier between
physics and biology: Max Delbrück and Erwin Schrödinger. Trained as
physicists,theirentrancesintothefieldofbiologywereremarkablydifferent.
Delbrück,trainedasanatomicphysicistbyNielsBohr,quicklybecameanex-
pertingenetics;in1945hewasalreadyteachinggeneticstootherbiologists.4
Schrödinger,ontheotherhand,neverturnedintoa“certified”geneticistand
remained somewhat of a biological dilettante. However, his book What Is
Life?,publishedin1944,wasinfluentialtoanentiregenerationofphysicists
and biologists. Both James Watson (a biology student who wanted to be a
naturalist) and Francis Crick (a physicist who worked on magnetic mines)
switched careerstoDNA scienceafterreadingShrödinger’sbook. Another
Nobel laureate, Sydney Brenner, even admitted to stealing a copy from the
publiclibraryinJohannesburg,SouthAfrica.
Like Delbrück and Schrödinger, there is great variety in the biological
background of today’s computer scientists-turned-bioinformaticians. Some
ofthemhavebecomeexpertsinbiology—though veryfewputonlabcoats
4. DelbrückfoundedthefamousphagegeneticscoursesatColdSpringHarborLaboratory.
6 1 Introduction
and perform experiments—while others remain biological dilettantes. Al-
thoughthereexistsanopinionthateverybioinformaticianshouldbeanex-
pert in both biology and computer science, we are not sure that this is fea-
sible. First, it takes a lot of work just to master one of the two, so perhaps
understanding two in equal amounts is a bit much. Second, it is good to
recall that the first pioneers of DNA science were, in fact, self-proclaimed
dilettantes. JamesWatsonknewalmostnoorganicorphysicalchemistrybe-
forehestartedworkingonthedoublehelix;FrancisCrick,beingaphysicist,
knew very little biology. Neither saw any need to know about (let alone
memorize) the chemical structure of the four nucleotide bases when they
discoveredthestructureofDNA.5WhenaskedbyErwinChargaffhowthey
couldpossiblyexpecttoresolvethestructureofDNAwithoutknowingthe
structures of its constituents, they responded that they could always look
upthestructuresinabookiftheneedarose. Ofcourse, theyunderstoodthe
physicalprinciplesbehindacompound’sstructure.
Therealityisthateventhemostbiologicallyorientedbioinformaticiansare
expertsonly insome specific areaof biology. LikeDelbrück, who probably
wouldneverhavepassedanexaminbiologyinthe1930s(whenzoologyand
botany remained the core of the mainstreambiological curriculum), a typi-
calmodern-daybioinformaticianisunlikelytopassthesequenceoforganic
chemistry,biochemistry,andstructuralbiochemistryclassesthata“real”bi-
ologist has to take. The question of how much biology a good computer
scientist–turned–bioinformatician has to know seems to be best answered
with “enough to deeply understand the biological problem and to turn it
into anadequate computational problem.” This book providesa verybrief
introductiontobiology. Wedonotclaimthatthisisthebestapproach. For-
tunately, an interested reader can use Watson’s approach and look up the
biologicaldetailsinthebookswhentheneedarises,orreadpages1through
1294ofAlbertsandcolleagues’(includingWatson)bookMolecularBiologyof
theCell(3).
Thisbookiswhatwe,ascomputerscientists,believethatamodernbiolo-
gistoughttoknowaboutcomputerscienceifheorshewouldbeasuccessful
researcher.
5. Accordingly,wedonotpresentanywhereinthisbookthechemicalstructuresofeithernu-
cleotidesoraminoacids.Noalgorithminthisbookrequiresknowledgeoftheirstructure.
2
Algorithms and Complexity
Thisbookisabouthowtodesignalgorithmsthatsolvebiologicalproblems.
We will see how popular bioinformatics algorithms work and we will see
what principles drove their design. It is important to understand how an
algorithmworksinordertobeconfidentinitsresults;itisevenmoreimpor-
tant to understand an algorithm’s design methodology in order to identify
itspotentialweaknessesandfixthem.
Before considering any algorithms in detail, we need to define loosely
what we mean by the word “algorithm” and what might qualify as one.
In many places throughout this text we try to avoid tedious mathematical
formalisms,yetleaveintacttherigorandintuitionbehindtheimportantcon-
cept.
2.1 WhatIsan Algorithm?
Roughly speaking, an algorithm is a sequence of instructions that one must
perform in order to solve a well-formulated problem. We will specify prob-
lemsintermsoftheirinputsandtheiroutputs,andthealgorithmwillbethe
method oftranslatingthe inputsinto theoutputs. Awell-formulatedprob-
lemisunambiguousandprecise,leavingnoroomformisinterpretation.
Inordertosolveaproblem,someentityneedstocarryoutthestepsspec-
ifiedbythealgorithm. Ahumanwithapenandpaperwouldbeabletodo
this, but humans are generally slow, make mistakes, and prefer not to per-
formrepetitivework. Acomputerislessintelligentbutcanperformsimple
steps quickly and reliably. A computer cannot understand English, so al-
gorithms mustbe rephrasedinaprogramminglanguage suchasCor Java
in order to give specific instructions to the processor. Every detail must be
specifiedtothecomputerinexactlytherightformat,makingitdifficulttode-
8 2 AlgorithmsandComplexity
scribealgorithms; trifling detailsthatapersonwould naturallyunderstand
must be specified. If a computer were to put on shoes, one would need to
tellittofindapairthatbothmatchesandfits,toputtheleftshoeontheleft
foot,therightshoeontheright,andtotiethelaces.1 Inthisbook,however,
weprefertosimplyleaveitat“Putonapairofshoes.”
However, tounderstand howanalgorithm works, we need some wayof
listingthestepsthatthealgorithmtakes,while beingneithertoovaguenor
too formal. We will use pseudocode, whose elementary operations are sum-
marized below. Pseudocode is a language computer scientists often use to
describealgorithms: itignoresmanyofthedetailsthatarerequiredinapro-
gramming language, yet it is more precise and less ambiguous than, say, a
recipe in a cookbook. Individually, the operations do not solve any partic-
ularly difficult problems, but they can be grouped together into minialgo-
rithmscalledsubroutines thatdo.
In our particular flavor of pseudocode, we use the concepts of variables,
arrays, and arguments. A variable, written as x or total, contains some nu-
mericalvalueandcanbeassignedanewnumericalvalueatdifferentpoints
throughoutthecourseofanalgorithm. Anarrayofnelementsisanordered
collection of n variables a1,a2,...,an. We usually denote arrays by bold-
faceletters like a = (a1,a2,...,an) and write the individual elements as ai
where i is between 1 and n. An algorithm in pseudocode is denoted by a
name, followed by the list of arguments that it requires, like MAX(a,b) be-
low;thisisfollowedbythestatementsthatdescribethealgorithm’sactions.
One can invokean algorithm by passing it the appropriatevaluesfor its ar-
guments. Forexample,MAX(1,99)wouldreturnthelargerof1and99. The
operationreturnreportstheresultoftheprogramorsimplysignalsitsend.
Belowarebriefdescriptionsoftheelementarycommandsthatweuseinthe
pseudocodethroughoutthisbook.2
Assignment
Format: a←b
Effect: Setsthevariableatothevalueb.
1. Itissurprisinglydifficulttowriteanunambiguoussetofinstructionsonhowtotieashoelace.
2. Anexperiencedcomputerprogrammermightbeconfusedbyournotusing“endif”or“end
for”,whichistheconventionalpractice. Werelyonindentationtodemarcateblocksofpseu-
docode.