ebook img

GPU Acceleration of DNA Alignment of Long Reads for DNA Assembly PDF

110 Pages·2017·5.55 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview GPU Acceleration of DNA Alignment of Long Reads for DNA Assembly

GPU Acceleration of DNA Alignment of Long Reads for DNA Assembly Tong Dong Qiu CE-MS-2018-32 y g o ol n Abstract h c e T f o Third generation sequencing machines produce reads with tens of thousands of base y t pairs. To perform de novo assembly, all reads must be compared with every other read si er to find overlaps. Finding overlaps with the optimal Smith-Waterman is not feasible, v ni since the complexity of Smith-Waterman is quadratic with the length of the reads. U Heuristics are designed be faster, but are not guaranteed to give the optimal solution. t f el Two heuristic DNA aligners are Daligner and Darwin. Daligner uses an edit graph based D algorithm that has an O(ND) complexity, where N is the read length, and D the number of differences between the two aligned reads. Darwin creates overlapping tiles to search promising areas of the Smith-Waterman matrix, and is empirically shown to be optimal. This work implements these algorithms on a GPU, and compares the two with respect to sensitivity and specificity. Daligner is not suitable for GPU acceleration, but Darwin hasshownspeedupof109xvs8CPUthreads, usingaTeslaK40. Thespeedupincreases to 148x when the Smith-Waterman scores are not calculated. Despite large speedups for Darwin, Daligner is 2-6x faster than Darwin, and slightly more sensitive and specific. An advantage of Darwin is that is produces generally longer overlaps, calculates the Smith-Waterman score, and is able to report the aligned sequences, where Daligner only reports the start and end of the overlap. GPU Acceleration of DNA Alignment of Long Reads for DNA Assembly THESIS submitted in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE in COMPUTER ENGINEERING by Tong Dong Qiu born in Ede (Gld), The Netherlands Quantum & Computer Engineering Department of Electrical Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology GPU Acceleration of DNA Alignment of Long Reads for DNA Assembly by Tong Dong Qiu Abstract Third generation sequencing machines produce reads with tens of thousands of base pairs. Toperformdenovoassembly,allreadsmustbecomparedwitheveryotherreadtofindoverlaps. FindingoverlapswiththeoptimalSmith-Watermanisnotfeasible,sincethecomplexityofSmith- Watermanisquadraticwiththelengthofthereads. Heuristicsaredesignedbefaster,butarenot guaranteed to give the optimal solution. Two heuristic DNA aligners are Daligner and Darwin. Daligner uses an edit graph based algorithm that has an O(ND) complexity, where N is the read length, and D the number of differences between the two aligned reads. Darwin creates overlapping tiles to search promising areas of the Smith-Waterman matrix, and is empirically shown to be optimal. This work implements these algorithms on a GPU, and compares the two with respect to sensitivity and specificity. Daligner is not suitable for GPU acceleration, but Darwinhasshownspeedupof109xvs8CPUthreads, usingaTeslaK40. Thespeedupincreases to148xwhentheSmith-Watermanscoresarenotcalculated. DespitelargespeedupsforDarwin, Daligner is 2-6x faster than Darwin, and slightly more sensitive and specific. An advantage of Darwin is that is produces generally longer overlaps, calculates the Smith-Waterman score, and is able to report the aligned sequences, where Daligner only reports the start and end of the overlap. Laboratory : Quantum & Computer Engineering Codenumber : CE-MS-2018-32 Committee Members : Advisor: Nauman Ahmed, QCE, TU Delft Chairperson: Zaid Al-Ars, QCE, TU Delft Member: Arjan van Genderen, QCE, TU Delft Member: Matthias M¨oller, AM, TU Delft i ii Dedicated to my family and friends iii iv Contents Acknowledgements vii 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thesis goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Background 5 2.1 Biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 DNA sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Sanger sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.2 Next Generation Sequencing . . . . . . . . . . . . . . . . . . . . . 10 2.2.3 Third Generation Sequencing . . . . . . . . . . . . . . . . . . . . . 15 2.3 DNA alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3.1 Dynamic programming . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.2 Global alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.3 Local alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.4 Semi-global methods . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.5 Heuristic algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4 DNA assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4.1 De novo assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.4.2 Reference based assembly . . . . . . . . . . . . . . . . . . . . . . . 33 2.5 GPU processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.5.1 CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.5.2 Memory hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.5.3 Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.5.4 Bioinformatics on GPU . . . . . . . . . . . . . . . . . . . . . . . . 37 3 Concept 39 3.1 Pacific Biosciences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2 Daligner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2.1 Seeding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.2 Local alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3 Darwin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.1 D-SOFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.2 GACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 v 4 Specification 49 4.1 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.1.1 Daligner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.1.2 Darwin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2.1 Daligner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2.2 Darwin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3 Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3.1 Daligner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3.2 Darwin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5 Results 61 5.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2 Daligner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2.1 Runtimes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2.2 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.3 Darwin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.3.1 Runtimes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.3.2 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.4 Sensitivity and specificity . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6 Conclusion 77 6.1 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Bibliography 98 vi

Description:
This work implements these algorithms on a GPU, and compares the two with respect to sensitivity and [169][170][171][172][173]. Available: http://idwebhost-202-73.ethz.ch/praktika/analytisch/stm/Surf%20Sci%20126,.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.