Mourad Elloumi Editor Algorithms for Next-Generation Sequencing Data Techniques, Approaches, and Applications Algorithms for Next-Generation Sequencing Data Mourad Elloumi Editor Algorithms for Next-Generation Sequencing Data Techniques, Approaches, and Applications 123 Editor MouradElloumi LaTICE Tunis,Tunisia UniversityofTunis-ElManar Tunis,Tunisia ISBN978-3-319-59824-6 ISBN978-3-319-59826-0 (eBook) DOI10.1007/978-3-319-59826-0 LibraryofCongressControlNumber:2017950216 ©SpringerInternationalPublishingAG2017 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartof thematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation, broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology nowknownorhereafterdeveloped. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. Thepublisher,theauthorsandtheeditorsaresafetoassumethattheadviceandinformationinthisbook arebelievedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsor theeditorsgiveawarranty,expressorimplied,withrespecttothematerialcontainedhereinorforany errorsoromissionsthatmayhavebeenmade.Thepublisherremainsneutralwithregardtojurisdictional claimsinpublishedmapsandinstitutionalaffiliations. Printedonacid-freepaper ThisSpringerimprintispublishedbySpringerNature TheregisteredcompanyisSpringerInternationalPublishingAG Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland To myparentsandmychildren. Preface A deoxyribonucleicacid (DNA)macromoleculecan be codedbya sequenceover a four-letteralphabet.Theselettersare A,C, G, andT, andtheycoderespectively thebasesAdenine,Cytosine,GuanineandThymine.DNAsequencingconsiststhen indeterminingtheexactorderofthesebasesinaDNAmacromolecule.Asamatter of fact, DNA sequencing technology is playing a key role in the advancementof molecular biology. Compared to previous sequencing machines, Next-Generation Sequencing (NGS) machines function much faster, with significantly lower pro- duction costs and much higher throughput in the form of short reads, i.e., short sequencescodingportionsofDNAmacromolecules. AsaresultoftheextendedspreadofNGSmachines,wearewitnessinganexpo- nentialgrowthinthenumberofnewlyavailableshortreads.Hence,wearefacing thechallengeofstoringthemtoanalyzehugenumbersofreadsrepresentingsetsof portionsofgenomes,orevenwholegenomes.Theanalysisofthishugenumberof reads will help, among others, to decode life’s mysteries, detect pathogens, make bettercrops,andimprovequalityoflife.Thisisadifficulttask,anditismadeeven moredifficultnotonlybytheshortlengthsofthereadsandthehugenumberofthese readsbutalsobythepresenceofhighsimilaritybetweentheconcernedportionsof genomes,or whole genomes,and by the presence of manyrepetitive structuresin these genomes, or whole genomes. Such a task requires the development of fast algorithmswithlowmemoryrequirementsandhighperformance. This book surveys the most recent developments on algorithms for NGS data, offeringenoughfundamentalandtechnicalinformationonthesealgorithmsandthe related problems, without overcrowding the reader’s head. It presents the results of the latest investigations in the field of NGS data analysis. The algorithms presented in this book deal with the most important and/or the newest topics encounteredinthisfield.Thesealgorithmsarebasedonnew/improvedapproaches and/or techniques. The few published books on algorithms for NGS data either lack technical depth or focus on specific topics. This book is the first overview on algorithms for NGS data with both a wide coverage of this field and enough depth to be of practical use to working professionals. So, this book tries to find a balance between theoretical and practical coverage of a wide range of issues in vii viii Preface the field of NGS data analysis. The techniques and approaches presented in this book combine sound theory with practicalapplications in life sciences. Certainly, the list of topics covered in this book is not exhaustive, but it is hoped that these topics will get the reader to think of the implications of the presented algorithms on other topics. The chapters presented in this book were carefully selected for qualityandrelevance.Thisbookalsopresentsexperimentsthatprovidequalitative andquantitativeinsightsintothefieldofNGSdataanalysis.Itishopedthatthisbook willincreasetheinterestofresearchersinstudyingawiderrangeofcombinatorial problemsrelatedtoNGSdataanalysis. Preferably,thereaderofthisbookshouldbesomeonewhoisfamiliarwithbioin- formaticsandwouldliketolearnaboutalgorithmsthatdealwiththemostimportant and/orthenewesttopicsencounteredinthefieldofNGSdataprocessing.However, this book could be used by a wider audience such as graduate students, senior undergraduatestudents,researchers,instructors,andpractitionersinbioinformatics, computer science, mathematics, statistics, and life sciences. It will be extremely valuableandfruitfulforthesepeople.Theywillcertainlyfindwhattheyarelooking foror,atleast,acluethatwillhelpthemtomakeanadvanceintheirresearch.This bookisquitetimelysinceNGStechnologyisevolvingatabreathtakingspeedand will certainly point the reader to algorithms for NGS data that may be the key to newandimportantdiscoveriesinlifesciences. This book is organizedinto four parts: Indexing, Compression, and Storage of NGSData;ErrorCorrectioninNGSData;AlignmentofNGSData;andAssembly ofNGSData.The14chapterswerecarefullyselectedtoprovideawidescopewith minimaloverlapbetweenthechapterstoreduceduplication.Eachcontributorwas asked to presentreview material as well as currentdevelopments.In addition,the authorswerechosenfromamongtheleadersintheirrespectivefields. Tunis,Tunisia MouradElloumi April2017 Contents PartI Indexing,Compression,andStorageofNGSData 1 AlgorithmsforIndexingHighlySimilarDNASequences.............. 3 NadiaBenNsira,ThierryLecroq,andMouradElloumi 2 Full-TextIndexesforHigh-ThroughputSequencing................... 41 DavidWeeseandEnricoSiragusa 3 SearchingandIndexingCircularPatterns .............................. 77 CostasS.Iliopoulos,SolonP.Pissis,andM.SohelRahman 4 DeNovoNGSDataCompression......................................... 91 GaetanBenoit,ClaireLemaitre,GuillaumeRizk,ErwanDrezen, andDominiqueLavenier 5 CloudStorage-ManagementTechniquesforNGSData................ 117 EvangelosTheodoridis PartII ErrorCorrectioninNGSData 6 ProbabilisticModels forErrorCorrectionofNonuniform SequencingData............................................................ 131 MarcelH.SchulzandZivBar-Joseph 7 DNA-SeqErrorCorrectionBasedonSubstringIndices............... 147 DavidWeese,MarcelH.Schulz,andHuguesRichard 8 ErrorCorrectioninMethylationProfilingFromNGSBisulfite Protocols .................................................................... 167 GuillermoBarturen,JoséL.Oliver,andMichaelHackenberg ix x Contents PartIII AlignmentofNGSData 9 ComparativeAssessmentofAlignmentAlgorithmsforNGS Data:Features,Considerations,Implementations,andFuture ....... 187 CarolShen,TonyShen,andJimmyLin 10 CUSHAWSuite:ParallelandEfficientAlgorithmsforNGS ReadAlignment............................................................. 203 YongchaoLiuandBertilSchmidt 11 String-Matchingand Alignment Algorithms for Finding MotifsinNGSData......................................................... 235 GiuliaFisconandEmanuelWeitschek PartIV AssemblyofNGSData 12 TheContigAssemblyProblemandItsAlgorithmicSolutions ........ 267 GéraldineJean,AndreeaRadulescu,andIrenaRusu 13 An Efficient Approach to Merging Paired-End Reads andIncorporationofUncertainties....................................... 299 Tomáš Flouri, Jiajie Zhang, Lucas Czech, Kassian Kobert, andAlexandrosStamatakis 14 Assembly-FreeTechniquesforNGSData ............................... 327 MatteoCominandMicheleSchimd Contributors Ziv Bar-Joseph Computational Biology Department and Machine Learning Department,SchoolofComputerScience,CarnegieMellonUniversity,Pittsburgh, PA,USA GuillermoBarturen CentreforGenomicsandOncologicalResearch (GENYO), Granada,Spain NadiaBenNsira LaboratoryofTechnologiesofInformationandCommunication andElectricalEngineering(LaTICE),Tunis,Tunisia UniversityofTunis-ElManar,Tunis,Tunisia The Computer Science, InformationProcessing and Systems Laboratory (LITIS), EA4108,UniversityofRouen-Normandy,Normandy,France GaetanBenoit GenScale,Rennes,France INRIA,Rennes,France Matteo Comin Department of Information Engineering, University of Padova, Padova,Italy LucasCzech HeidelbergInstituteforTheoreticalStudies,Heidelberg,Germany ErwanDrezen GenScale,Rennes,France INRIA,Rennes,France MouradElloumi LaTICE,Tunis,Tunisia UniversityofTunis-ElManar,Tunis,Tunisia Giulia Fiscon Institute for Systems Analysis and Computer Science “Antonio Ruberti”(IASI),NationalResearchCouncil(CNR),Rome,Italy TomášFlouri HeidelbergInstituteforTheoreticalStudies,Heidelberg,Germany Michael Hackenberg Department of Genetics, University of Granada, Granada, Spain xi