Table Of ContentBioinformatics Algorithms
Bioinformatics Algorithms
Design and Implementation in Python
Miguel Rocha
University of Minho, Braga, Portugal
Pedro G. Ferreira
Ipatimup/i3S, Porto, Portugal
AcademicPressisanimprintofElsevier
125LondonWall,LondonEC2Y5AS,UnitedKingdom
525BStreet,Suite1650,SanDiego,CA92101-4495,UnitedStates
50HampshireStreet,5thFloor,Cambridge,MA02139,UnitedStates
TheBoulevard,LangfordLane,Kidlington,OxfordOX51GB,UnitedKingdom
Copyright©2018ElsevierInc.Allrightsreserved.
Nopartofthispublicationmaybereproducedortransmittedinanyformorbyanymeans,electronicormechanical,including
photocopying,recording,oranyinformationstorageandretrievalsystem,withoutpermissioninwritingfromthepublisher.Detailson
howtoseekpermission,furtherinformationaboutthePublisher’spermissionspoliciesandourarrangementswithorganizationssuchas
theCopyrightClearanceCenterandtheCopyrightLicensingAgency,canbefoundatourwebsite:www.elsevier.com/permissions.
ThisbookandtheindividualcontributionscontainedinitareprotectedundercopyrightbythePublisher(otherthanasmaybenoted
herein).
Notices
Knowledgeandbestpracticeinthisfieldareconstantlychanging.Asnewresearchandexperiencebroadenourunderstanding,changesin
researchmethods,professionalpractices,ormedicaltreatmentmaybecomenecessary.
Practitionersandresearchersmustalwaysrelyontheirownexperienceandknowledgeinevaluatingandusinganyinformation,methods,
compounds,orexperimentsdescribedherein.Inusingsuchinformationormethodstheyshouldbemindfuloftheirownsafetyandthe
safetyofothers,includingpartiesforwhomtheyhaveaprofessionalresponsibility.
Tothefullestextentofthelaw,neitherthePublishernortheauthors,contributors,oreditors,assumeanyliabilityforanyinjuryand/or
damagetopersonsorpropertyasamatterofproductsliability,negligenceorotherwise,orfromanyuseoroperationofanymethods,
products,instructions,orideascontainedinthematerialherein.
LibraryofCongressCataloging-in-PublicationData
AcatalogrecordforthisbookisavailablefromtheLibraryofCongress
BritishLibraryCataloguing-in-PublicationData
AcataloguerecordforthisbookisavailablefromtheBritishLibrary
ISBN:978-0-12-812520-5
ForinformationonallAcademicPresspublications
visitourwebsiteathttps://www.elsevier.com/books-and-journals
Publisher:MaraConner
AcquisitionEditor:ChrisKatsaropoulos
EditorialProjectManager:SerenaCastelnovo
ProductionProjectManager:VijayarajPurushothaman
Designer:MilesHitchen
TypesetbyVTeX
CHAPTER 1
Introduction
1.1 Prelude
Inthelastdecades,importantadvanceshavebeenachievedinthebiologicalandbiomedical
fields,whichhavebeenboostedbyimportantadvancesinexperimentaltechnologies.The
mostknown,andarguablymostrelevant,examplecomesfromtheimpressiveevolutionof
sequencingtechnologiesinthelast40years,boostedbythelargeinvestmentintheHuman
GenomeProjectmainlyinthe1990’s[92,150].
Additionally,otherhigh-throughputtechnologiesformeasuringgeneexpression,proteinor
compoundconcentrationsincells,haveledtoarealrevolutioninbiologicalandmedicalre-
search.Allthesetechniquesarecurrentlyabletogeneratemassiveamountsofthesocalled
omicsdata,thatcanbeusedtofosterscientificresearchinthelifesciencesandpromotethe
developmentofnoveltechnologiesinhealthcare,biotechnologyandrelatedareas.
Merelyastwoexamplesoftheimpactofthesenoveltechnologiesandproduceddata,wecan
pinpointtheimpressivedevelopmentinareassuchaspersonalized(orprecision)medicine
andmetabolicengineeringeffortswithinindustrialbiotechnology.
Precisionmedicineaddressesthegrowingtrendoftailoringtreatmentstothecharacteris-
ticsofindividual(orgroupsof)patients.Thishasbeenmadeincreasinglypossiblebythe
availabilityofgenomic,epigenomic,geneexpression,andothertypesofdataaboutspe-
cificpatients,allowingtodeterminedistinctriskprofilesforcertaindiseases,ortostudy
differentiatedeffectsoftreatmentscorrelatedtopatternsingenomic,epigenomicorgene
expressiondata.Thesedataallowtodesignspecificcoursesofactionbasedonthepatient’s
profiles,allowingmoreaccuratediagnosisandspecifictreatmentplans.Thisfieldisex-
pectedtogrowsignificantlyinthecomingyears,asitisconfirmedbyprojectssuchasthe
100,000GenomesProjectlaunchedbytheUKPrimeMinisterDavidCameronin2012
(https://www.genomicsengland.co.uk/the-100000-genomes-project/)orthe
launchofthePrecisionMedicineInitiative,announcedinJanuary2015byPresidentBarack
Obama,andwhichhasstartedinFebruary2016.
Cancerresearchisanareathatlargelybenefitedfromtherecentadvancesinmolecularassays.
ProjectssuchastheGenomicDataCommons(https://gdc.cancer.gov)ortheInterna-
tionalCancerGenomeConsortium(ICGC,http://icgc.org/)aregeneratingcomprehen-
siveandmulti-dimensionalmapsofthegenomicalterationsincancercellsfromhundredsof
individualsindozensoftumortypeswithavisiblescientific,clinical,andsocietalimpact.
BioinformaticsAlgorithms.DOI:10.1016/B978-0-12-812520-5.00001-8
Copyright©2018ElsevierInc.Allrightsreserved. 1
2 Chapter1
Othercurrentlarge-scaleeffortsboostedbytheuseofhigh-throughputtechnologiesand
ledbyinternationalconsortiaaregeneratingdataatanunprecedentedscaleandchanging
ourviewofhumanmolecularbiology.Ofnoticeareprojectssuchasthe1000Genomes
Project(www.internationalgenome.org/)thatprovidesacatalogofhumangenetic
variationacrossworldwidepopulations;theEncyclopediaofDNAElements(ENCODE,
https://www.encodeproject.org/)hasbuiltamapoffunctionalelementsinthehuman
genome;theEpigenomicsRoadmap(http://www.roadmapepigenomics.org/)ischar-
acterizingtheepigenomiclandscapesofprimaryhumantissuesandcellsortheGenotype-
TissueExpressionproject(GTEx,https://www.gtexportal.org/)whichisproviding
geneexpressionandquantitativetraitlocifrommorethan50humantissues.
Ontheotherhand,metabolicengineeringisrelatedtotheimprovementofspecificmicrobes
usedinindustrialbiotechnologicalprocessestoproduceimportantcompoundsasbio-fuels,
plastics,pharmaceuticals,foods,foodingredientsandotheradded-valuecompounds.Strate-
giesusedtoimprovehostmicrobesincludeblockingcompetingpathwaysthroughgenedele-
tionorinactivation,overexpressingrelevantgenes,introducingheterologousgenesorenzyme
engineering.
Inbothcases,theimpactofdataavailabilityhasbeentremendous,openingnewavenuesfor
scientificadvanceandtechnologicaldevelopment.However,thishasalsoraisedsignificant
challengesinthemanagementandanalysisofsuchcomplexandlargevolumesofdata.Bio-
logicalresearchhasbecomeinmanyaspectsverydata-orientedandthishasbeenintricately
connectedtotheabilitytohandlethesehugeamountsofdatageneratingnovelknowledge,or
asFlorianMarkowetzrecentlyputsit“Allbiologyiscomputationalbiology”[108].There-
fore,thevalueofthesophisticatedcomputationaltoolsthathavebeendevelopedtoaddress
thesedataprocessingandanalysishasbeenundeniable.
ThisbookisaboutBioinformatics,thefieldthataimstohandlethesebiologicaldata,using
computers,andseekingtounravelnovelknowledgefromrawdata.Inthenextsection,we
willdiscussfurtherwhatBioinformaticsis,andthedifferenttasksandscientificdisciplines
thatareinvolvedinthefield.Toclosethechapter,wewilloverviewthecontentoftheremain-
ingofthebooktohelpthereaderinthetaskofbetternavigatingit.
1.2 What is Bioinformatics
Bioinformaticsisamulti-disciplinaryfieldattheintersectionofBiology,ComputerScience,
andStatistics.Naturally,itsdevelopmenthasfollowedthetechnologicaladvancesandre-
searchtrendsinBiologyandInformationTechnologies.Thus,althoughitisstillayoungfield,
itisevolvingfastanditsscopehasbeensuccessivelyredefined.Forinstance,theNationalIn-
stituteofHealth(NIH)definesBioinformaticsinabroadway,asthe“research,development,
Introduction 3
orapplicationofcomputationaltoolsandapproachesforexpandingtheuseofbiological,
medical,biological,behavioral,orhealthdata”[79].Accordingtothisdefinition,thetasks
involvedincludedataacquisition,storage,archival,analysis,andvisualization.
Someauthorshaveamorefocuseddefinition,whichrelatesBioinformaticsmainlytothe
studyofmacromoleculesatthecellularlevel,andemphasizeitscapabilityofhandlinglarge-
scaledata[105].Indeed,sinceitsappearance,themaintasksofBioinformaticshavebeen
relatedtohandlingdataatacellularlevel,andthiswillalsobethefocusofthisbook.
StillinthepreviousseminaldocumentfromtheNIH,therelatedfieldofComputationalBiol-
ogyisdefinedasthe“developmentandapplicationofdata-analyticalandtheoreticalmethods,
mathematicalmodeling,andcomputationalsimulationtechniquestothestudyofbiolog-
ical,behavioral,andsocialsystems”.Thus,althoughdeeplyrelated,andsometimesused
interchangeablybysomeauthors,thefirst(Bioinformatics)relatestoamoretechnologically
orientedview,whilethesecondismorerelatedtothestudyofnaturalsystemsandtheirmod-
eling.Thisdoesnotpreventalargeoverlapofthetwofields.
Bioinformaticstacklesalargenumberofresearchproblems.Forinstance,theBioinformatics
(https://academic.oup.com/bioinformatics)journalpublishesresearchonapplica-
tionareasthatincludegenomeanalysis,phylogenetics,genetic,andpopulationanalysis,gene
expression,structuralbiology,textmining,imageanalysis,andontologiesanddatabases.
TheNationalCenterforBiotechnologyInformation(NCBI,https://www.ncbi.nlm.nih.
gov/Class/MLACourse/Modules/MolBioReview/bioinformatics.html)unfolds
Bioinformaticsintothreemainareas:
• developingnewalgorithmsandstatisticstoassessrelationshipswithinlargedatasets;
• analyzingandinterpretingdifferenttypesofdata(e.g.nucleotideandaminoacidse-
quences,proteindomains,andproteinstructures);
• developingandimplementingtoolsthatenableefficientaccessandmanagementofdiffer-
enttypesofinformation.
Thisbookwillfocusmainlyonthefirstoftheseareas,coveringthemainalgorithmsthathave
beenproposedtoaddressBioinformaticstasks.Theemphasiswillbeputonalgorithmsfor
sequenceprocessingandanalysis,consideringbothnucleotideandaminoacidsequences.
1.3 Book’s Organization
Thisbookisorganizedintofourlogicalpartsencompassingthemajorthemesaddressedin
thistext,eachcontainingchaptersdealingwithspecifictopics.
4 Chapter1
Inthefirstpart,wherethischapterisincluded,weintroducethefieldofBioinformatics,pro-
vidingrelevantconceptsanddefinitions.Sincethisisaninterdisciplinaryfield,wewillneed
toaddresssomefundamentalaspectsregardingalgorithmsandthePythonprogramminglan-
guage(Chapter2),coversomebiologicalbackgroundneededtounderstandthealgorithmsput
forwardinthefollowingpartsofthebook(Chapter3).
Thesecondpartofthisbookaddressesanumberofproblemsrelatedtosequenceanalysis,in-
troducingalgorithmsandproposingillustrativePythonfunctionsandprogramstosolvethem.
TheBioinformaticstasksaddressedwillcovertopicsrelatedwithbasicsequenceprocess-
ingandanalysistasks,suchastheonesinvolvedintranscriptionandtranslation(Chapter4),
algorithmsforfindingpatternsinsequences(Chapter5),pairwiseandmultiplesequence
alignmentalgorithms(Chapters6and8),searchinghomologoussequencesindatabases
(Chapter7),algorithmsforphylogeneticanalysisfromsequences(Chapter9),biological
motifdiscoverywithdeterministicandstochasticalgorithms(Chapters10,11),andfinally
HiddenMarkovModelsandtheirapplicationsinBioinformatics(Chapter12).
Thethirdpartofthebookwillfocusonmoreadvancedalgorithms,basedingraphsasdata
structures,whichwillallowtohandlelarge-scalesequenceanalysistasks,suchastheones
typicallyinvolvedinprocessingandanalyzingnext-generationsequencing(NGS)data.This
partstartswithanintroductiontographdatastructuresandalgorithms(Chapter13),addresses
theconstructionandexplorationofbiologicalnetworksusinggraphs(Chapter14),focuseson
algorithmstohandleNGSdata,addressingthetasksofassemblingreadsintofullgenomes(in
Chapter15)andmatchingreadstoreferencegenomes(inChapter16).
ThebookcloseswithPartIV,whereanumberofcomplementaryresourcestothisbookare
identified(Chapter17),includinginterestingbooksandarticles,onlinecourses,andPython
relatedresources,andsomefinalwordsareputforward.
Asacomplementarysourceofinformation,awebsitehasbeendevelopedtocomplementthe
book’smaterials,includingcodeexamplesandproposedsolutionsformanyoftheexercises
putforwardintheendofeachchapter.
Description:Bioinformatics Algorithms: Design and Implementation in Python provides a comprehensive book on many of the most important bioinformatics problems, putting forward the best algorithms and showing how to implement them. The book focuses on the use of the Python programming language and its algorithms