ebook img

FPGA Acceleration of Sequence Analysis Tools in Bioinformatics PDF

180 Pages·2014·1.95 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview FPGA Acceleration of Sequence Analysis Tools in Bioinformatics

Boston University College of Engineering   Dissertation  FPGA Acceleration of Sequence Analysis Tools in Bioinformatics   by  Atabak Mahram     Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2013 © Copyright by Atabak Mahram 2013 Approved by First Reader _______________________________________________________ Martin C. Herbordt, PhD Professor of Electrical and Computer Engineering Second Reader _______________________________________________________ Ayse Coskun , PhD Professor of Electrical and Computer Engineering Third Reader _______________________________________________________ Douglas Densmore , PhD Professor of Electrical and Computer Engineering Forth Reader _______________________________________________________ Allyn Hubbard , PhD Professor of Electrical and Computer Engineering Contents 1  Introduction .......................................................................................................................................... 1  1.1  The Problem .................................................................................................................................. 1  1.2  Sequence Analysis Algorithms ...................................................................................................... 4  1.3  High‐Performance Computing with Accelerators ......................................................................... 5  1.4  FPGA‐Based Accelerators .............................................................................................................. 7  1.4.1  Programmability .................................................................................................................... 8  1.4.2  FPGAs for High‐Performance Computing ........................................................................... 10  1.5  High‐Performance Reconfigurable Computing for Sequence Analysis ....................................... 11  1.6  Summary of Contributions .......................................................................................................... 12  1.6.1  Acceleration of NCBI BLAST ................................................................................................ 13  1.6.2  Acceleration of CLUSTALW ................................................................................................. 14  1.7  Organization of the Rest of the Thesis ........................................................................................ 16  2  High‐Performance Computing ............................................................................................................ 18  2.1  Overview ..................................................................................................................................... 18  2.2  Background ................................................................................................................................. 19  2.3  Multicore Processors .................................................................................................................. 22  2.4  GPU Computing ........................................................................................................................... 24  2.5  FPGAs .......................................................................................................................................... 27  2.6  FPGA‐Based Systems ................................................................................................................... 34  2.6.1  Convey System .................................................................................................................... 34  2.6.2  Gidel Board .......................................................................................................................... 36  2.7  Summary ..................................................................................................................................... 37  3  Sequence Analysis: Methods and Algorithms ..................................................................................... 38  3.1  Overview ..................................................................................................................................... 38  3.2  The Basic Biology of Cell ............................................................................................................. 39  3.3  Fundamentals of Biosequence Analysis ...................................................................................... 40  3.4  Scoring Models ............................................................................................................................ 42  3.5  Pairwise Sequence Alignment with Dynamic Programming ....................................................... 44  3.6  BLAST ........................................................................................................................................... 49  3.6.1  Overview ............................................................................................................................. 49  3.6.2  Word Matching ................................................................................................................... 50 3.6.3  Ungapped Extension ........................................................................................................... 51  3.6.4  Gapped Extension ............................................................................................................... 51  3.6.5  Statistical Evaluation in BLAST ............................................................................................ 52  3.7  CLUSTAL‐W: Multiple‐Sequence Alignment ............................................................................... 53  3.7.1  Dynamic Programming ....................................................................................................... 53  3.7.2  Progressive Multiple‐Sequence Alignment: ClustalW ......................................................... 53  4  Previous Attempts to Accelerate Sequence Analysis ......................................................................... 55  4.1  Overview ..................................................................................................................................... 55  4.2  Software Acceleration of Smith‐Waterman ................................................................................ 58  4.3  Hardware Acceleration of Smith‐Waterman .............................................................................. 60  4.4  Cluster Computing and NCBI BLAST ............................................................................................ 66  4.5  GPU accelerated NCBI BLASTp .................................................................................................... 67  4.6  FPGA Accelerators and NCBI BLASTp .......................................................................................... 68  4.6.1  Tree BLAST ........................................................................................................................... 68  4.6.2  Mercury BLASTp .................................................................................................................. 74  4.7  Acceleration of Multiple Sequence Alignment ........................................................................... 79  5  CAAD BLAST ........................................................................................................................................ 82  5.1  Overview ..................................................................................................................................... 82  5.2  Filter Basics ................................................................................................................................. 84  5.3  Two‐Hit Filter .............................................................................................................................. 85  5.4  EUA Filter .................................................................................................................................... 98  5.4.1  Theoretical General Skipping .............................................................................................. 99  5.4.2  Skip‐Fold Mechanism ........................................................................................................ 100  5.4.3  Seed Lookup mechanism .................................................................................................. 102  5.5  CAAD BLAST Architectures ........................................................................................................ 103  5.6  Multiple Phase System on a Gidel Board .................................................................................. 104  5.6.1  Results ............................................................................................................................... 107  5.6.2  Scalability Analysis ............................................................................................................ 112  5.6.3  Terminology ........................................................................... Error! Bookmark not defined.  5.7  The Pipelined System on a Convey Machine ............................................................................ 124  5.7.1  System Configuration and Operation ............................................................................... 125  5.7.2  Pipelined Filters ................................................................................................................. 126 5.7.3  Jump FIFO Interface .......................................................................................................... 128  5.7.4  Glue Logic Modules ........................................................................................................... 129  5.7.5  RTL Optimizations ............................................................................................................. 130  5.7.6  Replicating and Balancing the Components ..................................................................... 132  5.7.7  Floor Planning ................................................................................................................... 134  5.7.8  Integration and Results ..................................................................................................... 136  6  CLUSTALW ......................................................................................................................................... 143  6.1  Overview ................................................................................................................................... 143  6.2  BACKGROUND ........................................................................................................................... 146  6.2.1  Basics of MSA for Biological Sequences ............................................................................ 146  6.2.2  CLUSTALW Overview ......................................................................................................... 148  6.3  DESIGN AND IMPLEMENTATION ............................................................................................... 149  6.3.1  Design Overview ............................................................................................................... 149  6.3.2  FMSA Scoring .................................................................................................................... 150  6.3.3  Filter Details ...................................................................................................................... 153  6.3.4  RESULTS ............................................................................................................................. 154  7  Conclusion and Future Work ............................................................................................................ 158  7.1  Summary ................................................................................................................................... 159  7.2  Future Directions ...................................................................................................................... 162  8  References ........................................................................................................................................ 165 FPGA Acceleration of Sequence Analysis Tools in Bioinformatics   ATABAK MAHRAM Boston University, College of Engineering, 2013 Major Professor: Martin C. Herbordt, PhD, Professor of Electrical and Computer Engineering   Abstract   With advances in biotechnology and computing power, biological data are being produced at an exceptional rate. The purpose of this study is to analyze the application of FPGAs to accelerate high impact production biosequence analysis tools. Compared with other alternatives, FPGAs offer huge compute power, lower power consumption, and reasonable flexibility. BLAST has become the de facto standard in bioinformatic approximate string matching and so its acceleration is of fundamental importance. It is a complex highly-optimized system, consisting of tens of thousands of lines of code and a large number of heuristics. Our idea is to emulate the main phases of its algorithm on FPGA. Utilizing our FPGA engine, we quickly reduce the size of the database to a small fraction, and then use the original code to process the query. Using a standard FPGA-based system, we achieved 12x speedup over a highly optimized multithread reference code. Multiple Sequence Alignment (MSA)--the extension of pairwise Sequence Alignment to multiple Sequences--is critical to solve many biological problems. Previous attempts to accelerate Clustal-W, the most commonly used MSA code, have directly mapped a portion of the code to the FPGA. We use a new approach: we apply prefiltering of the kind commonly used in BLAST to perform the initial all-pairs alignments. This results in a speedup of from 80x to 190x over the CPU code (8 cores). The quality is comparable to the original according to a commonly used benchmark suite evaluated with respect to multiple distance metrics. The challenge in FPGA-based acceleration is finding a suitable application mapping. Unfortunately many software heuristics do not fall into this category and so other methods must be applied. One is restructuring: an entirely new algorithm is applied. Another is to analyze application utilization and develop accuracy/performance tradeoffs. Using our prefiltering approach and novel FPGA programming models we have achieved significant speedup over reference programs. We have applied approximation, seeding, and filtering to this end. The bulk of this study is to introduce the pros and cons of these acceleration models for biosequence analysis tools. 1 Introduction 1.1 The Problem Bioinformatics refers to the analysis and management of scientific data and to the development of tools and applications that help us organize, retrieve, and process biological knowledge bases [Dur98][Jon04][Ewe05]. The application of mathematics and computer science for the modeling of biological processes has been essential to the use of biotic information for fundamental applications such as understanding life processes and in high impact applied domains such as drug discovery [Ach07][Jon04]. The key insight in bioinformatics is that biologically significant polymers, such as proteins and DNA, can be abstracted into character strings of a finite alphabet [Dur98]. Another fundamental observation is that all living cells pass a massive amount of hereditary features onto their offspring through a process of replication and cell division [Alb02]. In other words, nature adapts new sequences from pre-existing sequences. This opens the door for understanding the functionality of newly discovered sequences: by comparing a new sequence with known sequences, we can usually detect similarities that will help us learn about the structure and infer the functionality of that sequence. This mechanism allows biologists to use approximate string matching (AM) to determine, for example, how a newly identified protein is related to previously analyzed proteins, and how it has diverged through mutation [Mah10]. While AM is critical in diverse fields, e.g., text analysis, certain properties of biological sequences have required creation of biology-specific algorithms. Here the canonical AM task is Sequence Alignment (SA). For example, Hamming distance, the number of differing characters, is one way to measure differences between two strings, but does 1 not tolerate insertions or deletions (indels). As discussed later, more generalized scoring is necessary and is most often based on the probability of particular character mutations and includes indels; it can be handled using dynamic programming (DP) techniques. These have complexity O(mn) for two strings of size m and n, respectively. With the exploding size of biological databases, however, DP algorithms have often proven to be impractical. This has spawned heuristic O(n) algorithms, the most famous and widely used of these is BLAST [Alt90]. Since the completion of the human genome project, the scientific community has seen a sharp and rapid growth in the size of publicly available genomic and biotic information.  Due to advances in technology and computing power, biological data are being produced at an exponential rate; genomic databases now double in size every 15 months [Ben12a]. The complexity of bioinformatic tasks to which sequence alignment is being applied is increasing just as dramatically. A typical query, say, of a protein with respect to a database of all other known proteins, requires millions of pairwise SAs. In Multiple Sequence Alignment (MSA), algorithms often begin with all-to-all pairwise sequence alignment. And Phylogenetic Analysis can require millions of MSAs. As a result, the development of faster SA tools and methods continues to be one of the fundamental challenges in Computational Biology. Since its invention, BLAST has been based on heuristics [Alt90][Tho94] and algorithmic development remains an active area of research [Hen10][Hom09][Ken02]. On the other hand, the acceleration and parallelization of these applications are as important as algorithmic improvements. For example, the National Center for Biotechnology Information(NCBI) maintains a BLAST server, that consists of thousands of nodes that 2

Description:
1 Introduction 3 Sequence Analysis: Methods and Algorithms . analysis tools, the designer does not have the luxury of losing selectivity to gain.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.