Table Of Content

Ghent University Faculty of Sciences Department of Applied Mathematics, Computer Science and Statistics ALFALFA Fast and Accurate Mapping of Long Next Generation Sequencing Reads Dissertationsubmittedinpartialfulfillmentoftherequirementsfor thedegreeofDoctorofComputerScience Michaël Vyverman September 2014 Supervisors: prof. dr. Peter Dawyndt prof. dr. Bernard De Baets prof. dr. Veerle Fack Voor mijn ouders Dankwoord Dezethesisvormthetverslagvanmijnreisdoorheenhetwetenschappelijkonder- zoek die iets meer dan vier jaar geleden is gestart. Aan een doctoraat vertrek je opbrede,vertrouwdeenveelgebruiktewegen,maaralsnelontdekjenieuweoor- den. Ik ben het gebied van de bioinformatica binnengetrokken en heb exotische plaatsenverkendzoalssequeneringstechnologie,indexstructurenenread mapping algoritmen. Doorheen de tocht heb ik steeds smallere wegen moeten navigeren, waarbij ik de hulp kreeg van voorgaande verslagen en lokale gidsen. Zelf heb ik ook nieuwe gebieden in kaart gebracht die nu in dit werk staan beschreven, en in de toekomst misschien andere reizigers naar nieuwe oorden leiden. Devoorbijejarenvormdeneenfascinerendereiswaarinikveleinteressanteen boeiende ervaringen heb opgedaan. Toch zou dit verslag niet tot stand gekomen zijn zonder de steun en hulp van vele medereizigers, ervaren gidsen, en mijn familie en vrienden thuis. Eerst en vooral wil ik mijn promotoren, Peter, Bernard en Veerle bedanken. Dankzij hen heb ik deze reis kunnen aanvangen en ben ik niet onderweg ver- dwaald. Ik wil Peter vooral bedanken voor alle hulp, zowel bij het onderzoek als bij het schrijven, en om mij te laten inzien dat je soms uitdagingen moet durven aangaan. Veerle wil ik vooral bedanken voor de wekelijkse afspraken die we hadden, waarbij ik alles eens op een rijtje kon zetten. Alhoewel we minder persoonlijkcontacthebbengehad,zorgdeBernardaltijdvooreennieuwe,heldere en kritische blik. Dit reisverslag werd grondig nagelezen door de leden van mijn jury. Ik wil ze bedankenvoorhunsuggestiesenopmerkingendiehebbengeholpenomdeinhoud van dit werk nog te verbeteren. Ik draag deze thesis op aan mijn ouders. Zij hebben me altijd gesteund en gemotiveerd. Ook tijdens deze reis hebben ze me op alle mogelijke manieren geholpen. Zo heeft mijn vader bijvoorbeeld grafische ondersteuning verleend aan i ii Dankwoord enkele afbeeldingen die dit werk sieren. Ikwileveneensmijnfamilieenvriendenbedanken. Zehebbensteedsmetveel interesse geluisterd naar mijn verhalen en nieuwe ontdekkingen, maar evenzeer naar mijn frustraties als er obstakels op de weg lagen. Tijdens mijn expedities naar conferenties en buitenlandse verblijven heb ik heel wat mensen leren kennen waarmee ik boeiende gesprekken met heb gehad. In particular, I would like to thank Veli Mäkinen and the members of his group for the very nice and interesting stay I had at the University of Helsinki. Aangezien het doorkruisen van bioinformatica-gebied vele uiteenlopende ex- pertises vergt, ben ik blij dat ik deel uitmaakte van het multidisciplinaire on- derzoeksplatform Nucleotides to Networks (N2N). Binnen dit kader wil ik vooral Yao-Chen Lin en Lieven Sterck bedanken voor hun expertise en hun feedback. VerderkonikaltijdrekenenopdeervarenreizigersbinnendevakgroepToege- pasteWiskunde,InformaticaenStatistiek(TWIST).Inhetbijzonderwilikmijn bureaugenoten Glad en Nico bedanken, die me niet alleen met hun ervaringen geholpen hebben, maar ook graag hun technische kennis met mij deelden, zodat ik nooit lang met pech onderweg bleef stilstaan. Gelukkig was er tijd om onderweg regelmatig even te stoppen voor ont- spanning en andere activiteiten. Zo waren er de lunches met de leden van deonderzoeksgroepCombinatorischeAlgoritmenenAlgoritmischeGrafentheorie (CAAGT), waar ik elke week naar uitkeek. Ook de regelmatige gesprekken met Jan zal ik zeker missen. Met mijn medereizigers binnen de vakgroep TWIST heb ik heel wat leuke momentenbeleefd,waaronderspelletjes-enfilmavonden,enzelfseenavondrond het kampvuur op één van de TWI(ST)kends. Zelfs letterlijk onderweg, namelijk op de trein, heb ik nieuwe mensen leren kennen. Zij hebben ervoor gezorgd dat de lange treinrit Gent-Ressegem net iets korter leek. Ben en Jeroen, ik hoop dat we elkaar nu niet uit het oog verliezen, en we nog vele leuke spelletjesavonden beleven. De nodige brandstof voor mijn reis werd voorzien door het agentschap voor Innovatie door Wetenschap en Technologie (IWT). Ik wil hen dan ook bedanken voor deze kans die zij mij geboden hebben. Een andere soort brandstof voor mijn reis werd geleverd in de vorm van computationele kracht van de STEVIN supercomputer van de Universiteit Gent en de goede technische ondersteuning geleverd door het HPC-team. Aan allen veel dank! Michaël Vyverman, december 2014 Contents Summary vii 1 Introduction 1 1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Basic notations . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.2 Common substrings . . . . . . . . . . . . . . . . . . . . . . 2 1.1.3 Biological sequences . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Sequencing technology . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.1 Historical overview . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.2 Sequencing reads . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2.3 Sequencing technology comparison . . . . . . . . . . . . . . 14 1.3 Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.3.1 Sequence alignment . . . . . . . . . . . . . . . . . . . . . . 16 1.3.2 Read mapping . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.3.3 Read mappers . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.4 Dynamic programming . . . . . . . . . . . . . . . . . . . . . . . . . 29 1.4.1 Variants for different alignment methods . . . . . . . . . . . 30 1.4.2 Optimizations. . . . . . . . . . . . . . . . . . . . . . . . . . 34 1.5 Evaluation and Testing. . . . . . . . . . . . . . . . . . . . . . . . . 41 1.5.1 Theoretical complexity . . . . . . . . . . . . . . . . . . . . . 41 1.5.2 Memory model . . . . . . . . . . . . . . . . . . . . . . . . . 42 2 Full-text Index Structures 43 2.1 Popular Index Structures . . . . . . . . . . . . . . . . . . . . . . . 46 2.1.1 Suffix trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 iii iv Contents 2.1.2 Suffix arrays . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.1.3 Enhanced suffix arrays . . . . . . . . . . . . . . . . . . . . . 51 2.1.4 Compressed suffix arrays . . . . . . . . . . . . . . . . . . . 55 2.1.5 The Burrows-Wheeler transform . . . . . . . . . . . . . . . 55 2.1.6 FM-indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 2.2 Time-memory trade-offs . . . . . . . . . . . . . . . . . . . . . . . . 61 2.2.1 Uncompressed index structures . . . . . . . . . . . . . . . . 62 2.2.2 Sparse indexes . . . . . . . . . . . . . . . . . . . . . . . . . 64 2.2.3 Compressed index structures . . . . . . . . . . . . . . . . . 65 2.3 Index structures in external memory . . . . . . . . . . . . . . . . . 74 2.3.1 Suffix arrays . . . . . . . . . . . . . . . . . . . . . . . . . . 75 2.3.2 Suffix trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 2.3.3 Compressed index structures . . . . . . . . . . . . . . . . . 78 2.4 Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 2.4.1 Suffix trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 2.4.2 Suffix arrays . . . . . . . . . . . . . . . . . . . . . . . . . . 80 2.4.3 Compressed index structures . . . . . . . . . . . . . . . . . 81 2.4.4 External memory suffix tree construction . . . . . . . . . . 82 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 2.5.1 Prospects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 2.5.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3 essaMEM 89 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.2 Enhanced sparse suffix arrays . . . . . . . . . . . . . . . . . . . . . 91 3.2.1 Data structure . . . . . . . . . . . . . . . . . . . . . . . . . 91 3.2.2 String matching . . . . . . . . . . . . . . . . . . . . . . . . 93 3.2.3 Construction . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.3 MEM-finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.3.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.3.2 Sparse suffix arrays . . . . . . . . . . . . . . . . . . . . . . . 97 3.3.3 Sparse child arrays . . . . . . . . . . . . . . . . . . . . . . . 99 3.3.4 Sparse suffix links . . . . . . . . . . . . . . . . . . . . . . . 99 3.3.5 Pattern suffix sampling . . . . . . . . . . . . . . . . . . . . 100 3.3.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 101 3.4 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 3.4.1 Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Contents v 3.4.2 Memory requirements . . . . . . . . . . . . . . . . . . . . . 105 3.4.3 Time-memory trade-offs . . . . . . . . . . . . . . . . . . . . 106 3.4.4 Impact of optimizations . . . . . . . . . . . . . . . . . . . . 111 3.5 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4 ALFALFA 117 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.2 Algorithms & heuristics . . . . . . . . . . . . . . . . . . . . . . . . 119 4.2.1 Seed-finding . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.2.2 Candidate regions . . . . . . . . . . . . . . . . . . . . . . . 133 4.2.3 Candidate region extension . . . . . . . . . . . . . . . . . . 140 4.2.4 Alignment post-processing . . . . . . . . . . . . . . . . . . . 147 4.2.5 Paired-end read mapping . . . . . . . . . . . . . . . . . . . 148 4.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 4.3.1 Memory footprint . . . . . . . . . . . . . . . . . . . . . . . 153 4.3.2 Performance and accuracy on simulated data . . . . . . . . 154 4.3.3 Mapping quality . . . . . . . . . . . . . . . . . . . . . . . . 164 4.3.4 Performance and accuracy on real data . . . . . . . . . . . 164 5 Mesalina 169 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 5.2 Spliced alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 5.2.1 Candidate regions . . . . . . . . . . . . . . . . . . . . . . . 175 5.2.2 Candidate region extension . . . . . . . . . . . . . . . . . . 176 5.2.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 179 5.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 5.3.1 Memory footprint . . . . . . . . . . . . . . . . . . . . . . . 180 5.3.2 Performance and accuracy trade-offs . . . . . . . . . . . . . 180 5.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 Concluding remarks 185 A List of abbreviations 191 B Sequence alignment and mapping format 195 B.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 vi Contents C Details of the essaMEM experimental results 205 C.1 Testing environment and experimental measurements . . . . . . . . 205 C.2 Additional tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 D Details of the ALFALFA experimental results 215 D.1 Testing environment and experimental measurements . . . . . . . . 215 D.1.1 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 D.1.2 Performance and accuracy measurements . . . . . . . . . . 219 D.1.3 Read mappers . . . . . . . . . . . . . . . . . . . . . . . . . 220 D.2 Additional tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 E ALFALFA command line structure 237 E.1 Indexing a reference genome . . . . . . . . . . . . . . . . . . . . . . 239 E.1.1 Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 E.2 Mapping and aligning a read set . . . . . . . . . . . . . . . . . . . 240 E.2.1 I/O options . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 E.2.2 Alignment options . . . . . . . . . . . . . . . . . . . . . . . 241 E.2.3 Seed options . . . . . . . . . . . . . . . . . . . . . . . . . . 241 E.2.4 Extend options . . . . . . . . . . . . . . . . . . . . . . . . . 242 E.2.5 Paired-end mapping options . . . . . . . . . . . . . . . . . . 244 E.2.6 Miscellaneous options . . . . . . . . . . . . . . . . . . . . . 244 E.3 Evaluating mapping accuracy . . . . . . . . . . . . . . . . . . . . . 245 E.3.1 Options shared by all subcommands . . . . . . . . . . . . . 245 E.3.2 Summary subcommand options . . . . . . . . . . . . . . . . 246 E.3.3 Sam subcommand options . . . . . . . . . . . . . . . . . . . 246 E.3.4 Wgsim subcommand options . . . . . . . . . . . . . . . . . 247 Bibliography 249 Nederlandstalige samenvatting 273 List of Figures 277 List of Tables 281 Index 284

Description:

sequences and RNA-seq reads to a eukaryotic reference genome poses additional .. By default, sequencing libraries produce single-end sequencing reads. These sequences ware, such as GPUs and FPGAs [8,262]. As the . trace indicate exact matches that guide the alignment and red dots be-.

Fast and Accurate Mapping of Long Next Generation Sequencing Reads PDF

300 Pages·2014·2.67 MB·English

by Michaël Vyverman

Checking for file health...

Save to my drive

Quick download

Download

Download Fast and Accurate Mapping of Long Next Generation Sequencing Reads PDF Free - Full Version

by Michaël Vyverman| 2014| 300 pages| 2.67| English

Download Fast and Accurate Mapping of Long Next Generation Sequencing Reads by Michaël Vyverman in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About Fast and Accurate Mapping of Long Next Generation Sequencing Reads

Detailed Information

Author:	Michaël Vyverman
Publication Year:	2014
Pages:	300
Language:	English
File Size:	2.67
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free Fast and Accurate Mapping of Long Next Generation Sequencing Reads Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download Fast and Accurate Mapping of Long Next Generation Sequencing Reads PDF?

Yes, on https://PDFdrive.to you can download Fast and Accurate Mapping of Long Next Generation Sequencing Reads by Michaël Vyverman completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read Fast and Accurate Mapping of Long Next Generation Sequencing Reads on my mobile device?

After downloading Fast and Accurate Mapping of Long Next Generation Sequencing Reads PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of Fast and Accurate Mapping of Long Next Generation Sequencing Reads?

Yes, this is the complete PDF version of Fast and Accurate Mapping of Long Next Generation Sequencing Reads by Michaël Vyverman. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download Fast and Accurate Mapping of Long Next Generation Sequencing Reads PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.