Table Of ContentMourad Elloumi Editor
Algorithms for
Next-Generation
Sequencing Data
Techniques, Approaches, and
Applications
Algorithms for Next-Generation Sequencing Data
Mourad Elloumi
Editor
Algorithms for
Next-Generation Sequencing
Data
Techniques, Approaches, and Applications
123
Editor
MouradElloumi
LaTICE
Tunis,Tunisia
UniversityofTunis-ElManar
Tunis,Tunisia
ISBN978-3-319-59824-6 ISBN978-3-319-59826-0 (eBook)
DOI10.1007/978-3-319-59826-0
LibraryofCongressControlNumber:2017950216
©SpringerInternationalPublishingAG2017
Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartof
thematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation,
broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation
storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology
nowknownorhereafterdeveloped.
Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication
doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant
protectivelawsandregulationsandthereforefreeforgeneraluse.
Thepublisher,theauthorsandtheeditorsaresafetoassumethattheadviceandinformationinthisbook
arebelievedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsor
theeditorsgiveawarranty,expressorimplied,withrespecttothematerialcontainedhereinorforany
errorsoromissionsthatmayhavebeenmade.Thepublisherremainsneutralwithregardtojurisdictional
claimsinpublishedmapsandinstitutionalaffiliations.
Printedonacid-freepaper
ThisSpringerimprintispublishedbySpringerNature
TheregisteredcompanyisSpringerInternationalPublishingAG
Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland
To myparentsandmychildren.
Preface
A deoxyribonucleicacid (DNA)macromoleculecan be codedbya sequenceover
a four-letteralphabet.Theselettersare A,C, G, andT, andtheycoderespectively
thebasesAdenine,Cytosine,GuanineandThymine.DNAsequencingconsiststhen
indeterminingtheexactorderofthesebasesinaDNAmacromolecule.Asamatter
of fact, DNA sequencing technology is playing a key role in the advancementof
molecular biology. Compared to previous sequencing machines, Next-Generation
Sequencing (NGS) machines function much faster, with significantly lower pro-
duction costs and much higher throughput in the form of short reads, i.e., short
sequencescodingportionsofDNAmacromolecules.
AsaresultoftheextendedspreadofNGSmachines,wearewitnessinganexpo-
nentialgrowthinthenumberofnewlyavailableshortreads.Hence,wearefacing
thechallengeofstoringthemtoanalyzehugenumbersofreadsrepresentingsetsof
portionsofgenomes,orevenwholegenomes.Theanalysisofthishugenumberof
reads will help, among others, to decode life’s mysteries, detect pathogens, make
bettercrops,andimprovequalityoflife.Thisisadifficulttask,anditismadeeven
moredifficultnotonlybytheshortlengthsofthereadsandthehugenumberofthese
readsbutalsobythepresenceofhighsimilaritybetweentheconcernedportionsof
genomes,or whole genomes,and by the presence of manyrepetitive structuresin
these genomes, or whole genomes. Such a task requires the development of fast
algorithmswithlowmemoryrequirementsandhighperformance.
This book surveys the most recent developments on algorithms for NGS data,
offeringenoughfundamentalandtechnicalinformationonthesealgorithmsandthe
related problems, without overcrowding the reader’s head. It presents the results
of the latest investigations in the field of NGS data analysis. The algorithms
presented in this book deal with the most important and/or the newest topics
encounteredinthisfield.Thesealgorithmsarebasedonnew/improvedapproaches
and/or techniques. The few published books on algorithms for NGS data either
lack technical depth or focus on specific topics. This book is the first overview
on algorithms for NGS data with both a wide coverage of this field and enough
depth to be of practical use to working professionals. So, this book tries to find
a balance between theoretical and practical coverage of a wide range of issues in
vii
viii Preface
the field of NGS data analysis. The techniques and approaches presented in this
book combine sound theory with practicalapplications in life sciences. Certainly,
the list of topics covered in this book is not exhaustive, but it is hoped that these
topics will get the reader to think of the implications of the presented algorithms
on other topics. The chapters presented in this book were carefully selected for
qualityandrelevance.Thisbookalsopresentsexperimentsthatprovidequalitative
andquantitativeinsightsintothefieldofNGSdataanalysis.Itishopedthatthisbook
willincreasetheinterestofresearchersinstudyingawiderrangeofcombinatorial
problemsrelatedtoNGSdataanalysis.
Preferably,thereaderofthisbookshouldbesomeonewhoisfamiliarwithbioin-
formaticsandwouldliketolearnaboutalgorithmsthatdealwiththemostimportant
and/orthenewesttopicsencounteredinthefieldofNGSdataprocessing.However,
this book could be used by a wider audience such as graduate students, senior
undergraduatestudents,researchers,instructors,andpractitionersinbioinformatics,
computer science, mathematics, statistics, and life sciences. It will be extremely
valuableandfruitfulforthesepeople.Theywillcertainlyfindwhattheyarelooking
foror,atleast,acluethatwillhelpthemtomakeanadvanceintheirresearch.This
bookisquitetimelysinceNGStechnologyisevolvingatabreathtakingspeedand
will certainly point the reader to algorithms for NGS data that may be the key to
newandimportantdiscoveriesinlifesciences.
This book is organizedinto four parts: Indexing, Compression, and Storage of
NGSData;ErrorCorrectioninNGSData;AlignmentofNGSData;andAssembly
ofNGSData.The14chapterswerecarefullyselectedtoprovideawidescopewith
minimaloverlapbetweenthechapterstoreduceduplication.Eachcontributorwas
asked to presentreview material as well as currentdevelopments.In addition,the
authorswerechosenfromamongtheleadersintheirrespectivefields.
Tunis,Tunisia MouradElloumi
April2017
Contents
PartI Indexing,Compression,andStorageofNGSData
1 AlgorithmsforIndexingHighlySimilarDNASequences.............. 3
NadiaBenNsira,ThierryLecroq,andMouradElloumi
2 Full-TextIndexesforHigh-ThroughputSequencing................... 41
DavidWeeseandEnricoSiragusa
3 SearchingandIndexingCircularPatterns .............................. 77
CostasS.Iliopoulos,SolonP.Pissis,andM.SohelRahman
4 DeNovoNGSDataCompression......................................... 91
GaetanBenoit,ClaireLemaitre,GuillaumeRizk,ErwanDrezen,
andDominiqueLavenier
5 CloudStorage-ManagementTechniquesforNGSData................ 117
EvangelosTheodoridis
PartII ErrorCorrectioninNGSData
6 ProbabilisticModels forErrorCorrectionofNonuniform
SequencingData............................................................ 131
MarcelH.SchulzandZivBar-Joseph
7 DNA-SeqErrorCorrectionBasedonSubstringIndices............... 147
DavidWeese,MarcelH.Schulz,andHuguesRichard
8 ErrorCorrectioninMethylationProfilingFromNGSBisulfite
Protocols .................................................................... 167
GuillermoBarturen,JoséL.Oliver,andMichaelHackenberg
ix
x Contents
PartIII AlignmentofNGSData
9 ComparativeAssessmentofAlignmentAlgorithmsforNGS
Data:Features,Considerations,Implementations,andFuture ....... 187
CarolShen,TonyShen,andJimmyLin
10 CUSHAWSuite:ParallelandEfficientAlgorithmsforNGS
ReadAlignment............................................................. 203
YongchaoLiuandBertilSchmidt
11 String-Matchingand Alignment Algorithms for Finding
MotifsinNGSData......................................................... 235
GiuliaFisconandEmanuelWeitschek
PartIV AssemblyofNGSData
12 TheContigAssemblyProblemandItsAlgorithmicSolutions ........ 267
GéraldineJean,AndreeaRadulescu,andIrenaRusu
13 An Efficient Approach to Merging Paired-End Reads
andIncorporationofUncertainties....................................... 299
Tomáš Flouri, Jiajie Zhang, Lucas Czech, Kassian Kobert,
andAlexandrosStamatakis
14 Assembly-FreeTechniquesforNGSData ............................... 327
MatteoCominandMicheleSchimd
Contributors
Ziv Bar-Joseph Computational Biology Department and Machine Learning
Department,SchoolofComputerScience,CarnegieMellonUniversity,Pittsburgh,
PA,USA
GuillermoBarturen CentreforGenomicsandOncologicalResearch (GENYO),
Granada,Spain
NadiaBenNsira LaboratoryofTechnologiesofInformationandCommunication
andElectricalEngineering(LaTICE),Tunis,Tunisia
UniversityofTunis-ElManar,Tunis,Tunisia
The Computer Science, InformationProcessing and Systems Laboratory (LITIS),
EA4108,UniversityofRouen-Normandy,Normandy,France
GaetanBenoit GenScale,Rennes,France
INRIA,Rennes,France
Matteo Comin Department of Information Engineering, University of Padova,
Padova,Italy
LucasCzech HeidelbergInstituteforTheoreticalStudies,Heidelberg,Germany
ErwanDrezen GenScale,Rennes,France
INRIA,Rennes,France
MouradElloumi LaTICE,Tunis,Tunisia
UniversityofTunis-ElManar,Tunis,Tunisia
Giulia Fiscon Institute for Systems Analysis and Computer Science “Antonio
Ruberti”(IASI),NationalResearchCouncil(CNR),Rome,Italy
TomášFlouri HeidelbergInstituteforTheoreticalStudies,Heidelberg,Germany
Michael Hackenberg Department of Genetics, University of Granada, Granada,
Spain
xi