ebook img

FPGA Acceleration of DNA Sequencing Analysis and Storage PDF

139 Pages·2017·1.16 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview FPGA Acceleration of DNA Sequencing Analysis and Storage

Imperial College London Department of Computing FPGA Acceleration of DNA Sequencing Analysis and Storage James Arram July 2017 Submitted in part ful(cid:12)lment of the requirements for the degree of Doctor of Philosophy in Computing of Imperial College London and the Diploma of Imperial College London 1 Abstract In this work we explore how Field-Programmable Gate Arrays (FPGAs) can be used to alleviate the data processing bottlenecks in DNA sequencing. We focus our e(cid:11)orts on accelerating the FM-index, a data structure used to solve the computationally intensivestringmatchingproblemsfoundinDNAsequencinganalysissuchasshortread alignment. The main contributions of this work are: 1. We accelerate the FM-index using FPGAs and develop several novel methods for reducing the memory bottleneck of the search algorithm. These methods include customising the FM-index structure according to the memory architecture of the FPGA platform and minimising the number of memory accesses through both architectural and algorithmic optimisations. 2. We present a new approach for accelerating approximate string matching using the backtracking FM-index. This approach makes use of specialised approximate string matching modules and a run-time recon(cid:12)gurable architecture in order to achieve both high sensitivity and high performance. 3. We extend the FM-index search algorithm for reference-based compression and accelerate it using FPGAs. This accelerated design is integrated into fastqZip and fastaZip, two new tools that we have developed for the fast and e(cid:11)ective compres- sion of sequence data stored in the FASTQ and FASTA formats respectively. We implement our designs on the Maxeler Max4 Platform and show that they are able to outperform state-of-the-art DNA sequencing analysis software. For instance, our hardware-accelerated compression tool for FASTQ data is able to achieve a higher compression ratio than the best performing tool, fastqz, whilst the average compression and decompression speeds are 25 and 43 times faster respectively. 2 Declaration of Originality I hereby declare that I am the sole author of this thesis and that the material within hasnotbeensubmittedforadegreeinanyotheruniversityorinstitution. Thematerials and information used or derived from other published sources has been clearly cited and appropriately acknowledged. 3 Copyright Declaration The copyright of this thesis rests with the author and is made available under a Creative Commons Attribution Non-Commercial No Derivatives licence. Researchers are free to copy, distribute or transmit the thesis on the condition that they attribute it, that they do not use it for commercial purposes and that they do not alter, transform or build upon it. For any reuse or redistribution, researchers must make clear to others the licence terms of this work 4 Acknowledgements First and foremost I would like to thank my supervisor Wayne Luk for all the advice and encouragement he has given me over the past four years. His forming of the col- laboration between myself and the Department of Chemical Pathology in the Chinese UniversityofHongKong(CUHK)gavethisprojectrealpurposeandallowedmetowork on truly exciting applications. From CUHK I would like to thank Peiyong Jiang, Rossa Chiu and Dennis Lo for their advice on DNA methylation and for their hospitality when I visited Hong Kong. I would like to thank my colleagues, in particular Paul Grigoras, Xinyu Niu, Pavel Burovskiy and Kuen Hung Tsoi, for the technical support they have given me throughout this project and for the many hours of co(cid:11)ee breaks shared. Last but not least, I would like to thank my friends and family for (cid:12)lling my life with love and support. Thank you all. 5 Contents 1 Introduction 13 2 Background and related work 20 2.1 Indexed String Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.1.1 Notation and basic de(cid:12)nitions . . . . . . . . . . . . . . . . . . . . . 21 2.1.2 Su(cid:14)x Tries, Trees and Arrays . . . . . . . . . . . . . . . . . . . . . 21 2.1.3 Burrows-Wheeler Transform . . . . . . . . . . . . . . . . . . . . . . 24 2.1.4 FM-index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2 Application Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2.1 FPGA acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2.2 FPGA accelerator platforms . . . . . . . . . . . . . . . . . . . . . . 33 2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3 Exact string matching 40 3.1 Base design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.1.1 FM-index structure. . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.1.2 Hardware architecture . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.1.3 Performance model . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2 Algorithmic optimisations . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2.1 k-step FM-index . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2.2 Precomputed intervals . . . . . . . . . . . . . . . . . . . . . . . . . 55 6 3.2.3 Oversampled Index . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.2.4 Optimisation procedure . . . . . . . . . . . . . . . . . . . . . . . . 58 3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.3.1 Experimental setup. . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.3.2 Base design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.3.3 k-step FM-index . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.3.4 Precomputed intervals . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.3.5 Oversampled index . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.3.6 Short read alignment . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4 Approximate string matching 72 4.1 Backtracking design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.1.1 Backtracking strategy . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.1.2 Hardware architecture . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.1.3 Seed-and-compare . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.2 Design-space exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.2.1 Static architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.2.2 Run-time recon(cid:12)gurable architecture . . . . . . . . . . . . . . . . . 88 4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.3.1 Experimental setup. . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.3.2 Backtracking design . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.3.3 Seed-and-compare . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.3.4 Short read alignment . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5 Sequence data compression 103 5.1 Reference-based compression . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.1.1 Algorithm overview . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.1.2 Hardware architecture . . . . . . . . . . . . . . . . . . . . . . . . . 107 7 5.2 FASTQ and FASTA compression . . . . . . . . . . . . . . . . . . . . . . . 109 5.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.3.1 Experimental setup. . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.3.2 Reference-based compression . . . . . . . . . . . . . . . . . . . . . 117 5.3.3 FASTQ compression . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.3.4 FASTA compression . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6 Conclusion 125 Bibliography 131 8 List of Figures 1.1 Sequencing throughput of Illumina NGS platforms . . . . . . . . . . . . . 14 1.2 NGS data analysis pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1 Trie of the strings: B, AN, and ANA . . . . . . . . . . . . . . . . . . . . . 21 2.2 Su(cid:14)x trie of the text BANANA . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3 Su(cid:14)x tree of the text BANANA . . . . . . . . . . . . . . . . . . . . . . . 23 2.4 Su(cid:14)x array of the text BANANA . . . . . . . . . . . . . . . . . . . . . . . 24 2.5 Construction of the BWT . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.6 FM-index of the text BANANA . . . . . . . . . . . . . . . . . . . . . . . . 27 2.7 FM-index search algorithm example . . . . . . . . . . . . . . . . . . . . . 28 2.8 Maxeler MPC-X2000 architecture . . . . . . . . . . . . . . . . . . . . . . . 33 2.9 Benchmarking system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1 Modi(cid:12)ed FM-index structure . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2 The FM-index size and buckets size for the Human genome . . . . . . . . 44 3.3 Hardware achitecture for exact string matching . . . . . . . . . . . . . . . 46 3.4 Count operation steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.5 Base design performance estimate for the Maxeler Max4 DFE . . . . . . . 50 3.6 Construction of k-step FM-index with step size of 2 . . . . . . . . . . . . 52 3.7 The k-step FM-index size and buckets size for the Human genome . . . . 53 3.8 Precomputed intervals optimisation . . . . . . . . . . . . . . . . . . . . . . 55 3.9 Oversampled FM-index construction . . . . . . . . . . . . . . . . . . . . . 57 9 3.10 Optimisation procedure for the base design . . . . . . . . . . . . . . . . . 58 3.11 Performance estimates for algorithmic optimisations . . . . . . . . . . . . 59 3.12 Memory channel bandwidth for the Max4 DFE . . . . . . . . . . . . . . . 61 3.13 Performance of D1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.14 Strong-scaling and weak-scaling performance of D1 . . . . . . . . . . . . . 64 3.15 Performance of D2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.16 Performance of D3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.17 Performance of D4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.1 Approximate matching using greedy and exhaustive searching . . . . . . . 74 4.2 Search phases for approximate matching with one edit . . . . . . . . . . . 75 4.3 Search phases for approximate matching with two edits . . . . . . . . . . 76 4.4 Hardware architecture for approximate string matching . . . . . . . . . . 79 4.5 Seed-and-compare optimisation . . . . . . . . . . . . . . . . . . . . . . . . 82 4.6 Frequency distribution of CMLs . . . . . . . . . . . . . . . . . . . . . . . . 83 4.7 Static architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.8 Alignment breakdown for NGS dataset . . . . . . . . . . . . . . . . . . . . 87 4.9 Run-time recon(cid:12)gurable architecture . . . . . . . . . . . . . . . . . . . . . 89 4.10 Approximate matching performance . . . . . . . . . . . . . . . . . . . . . 94 4.11 Pipeline of modules for the seed-and-compare optimisation . . . . . . . . 95 4.12 Performance of seed-and-compare optimisation (one mismatch) . . . . . . 96 4.13 Performance of seed-and-compare optimisation (two mismatches) . . . . . 96 4.14 Pipeline of modules for up to two mismatches . . . . . . . . . . . . . . . . 98 5.1 Reference-based compression example . . . . . . . . . . . . . . . . . . . . 105 5.2 Hardware architecture for reference-based compression . . . . . . . . . . . 108 5.3 FASTQ format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.4 FASTA format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.5 Reverse complement optimisation for FASTQ (cid:12)les . . . . . . . . . . . . . 111 5.6 Merge tuples optimisation for FASTA (cid:12)les . . . . . . . . . . . . . . . . . . 112 10

Description:
intensive string matching problems found in DNA sequencing analysis such as We extend the FM-index search algorithm for reference-based compression and able to outperform state-of-the-art DNA sequencing analysis software. decision tree is applying the oversampled index optimisation.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.