ebook img

Practical Methods for Approximate String Matching PDF

113 Pages·00.61 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Practical Methods for Approximate String Matching

(cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) Heikki Hyyrö Practical Methods for Approximate String Matching ACADEMIC DISSERTATION To be presented, with the permission of the Faculty of Information Sciences of the University of Tampere, for public discussion in the B1096 Auditorium of the University on December 5th, 2003, at 12 noon. DEPARTMENT OF COMPUTER SCIENCES UNIVERSITY OF TAMPERE A-2003-4 TAMPERE 2003 (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) Opponent: Prof. Esko Ukkonen Department of Computer Science University of Helsinki Reviewers: Prof. Jorma Tarhio Laboratory of Information Processing Science Helsinki University of Technology Prof. Jukka Teuhola Department of Information Technology University of Turku (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) Department of Computer Sciences FIN-33014 UNIVERSITY OF TAMPERE Finland(cid:1) (cid:1) (cid:1) Electronic dissertation (cid:1) Acta Electronica Universitatis Tamperensis 308 (cid:1) ISBN 951-44-5840-0 (cid:1) ISSN 1456-954X (cid:1) (cid:1)http://acta.uta.fi (cid:1) (cid:1) ISBN 951-44-5818-4 ISSN 1459-6903 Tampereen yliopistopaino Oy Tampere 2003(cid:1) (cid:1) (cid:1) Abstract Given a pattern string and a text, the task of approximate string matching is to (cid:12)nd all locations in the text that are similar to the pattern. This type of search may be done for example in applications of spelling error correction or bioinformatics. Typically edit distance is used as the measure of similarity (or distance) between two strings. In this thesis we concentrate on unit-cost edit distance that de(cid:12)nes the distance beween two strings as theminimumnumberofeditoperationsthatareneededintransformingone of the strings into the other. More speci(cid:12)cally, we discuss the Levenshtein and the Damerau edit distances. Aproximate string matching algorithms can be divided into o(cid:11)-line and on-line algorithms dependingon whethertheymay or may not, respectively, preprocess the text. In this thesis we propose practical algorithms for both typesof approximate stringmatching as wellas forcomputingedit distance. Ourmaincontributionsareanewvariantof thebit-parallelapproximate string matching algorithm of Myers, a method that makes it easy to modify many existing Levenshtein edit distance algorithms into using the Damerau editdistance,abit-parallelalgorithmforcomputingeditdistance,amoreer- ror tolerant version of the ABNDM algorithm, a two-phase (cid:12)ltering scheme, atunedindexedapproximate stringmatching methodforgenomesearching, and an improved and extended version of the hybrid index of Navarro and Baeza-Yates. Toevaluate theirpracticality, wecomparemostoftheproposedmethods with previously existing algorithms. The test results support the claim of the title of this thesis that our proposed algorithms work well in practice. Keywords: Edit distance, approximate string matching, approximate pattern matching i ii Acknowledgements Thisthesisis basedon thework thatwas carriedout between theyears 2000 and 2003 at the Department of Computer Sciences, University of Tampere. Some early groundwork was done already in 1999 during the research done for my master’s thesis on string matching. The following people had the most important role in making this work possible. FirstofallIwouldliketothankmyprimarysupervisor,ProfessorMartti Juhola. Ever since he recruited me in 1998 to do research underhis supervi- sion, he has o(cid:11)ered me excellent work environment as well as various kinds of generalsupportandadvice. Iam thankfulforhim fortrustingme enough to let me work in a very autonomous and free manner. I am grateful that Martti remained supportive even when some of my decisions on the direc- tion of the work and the contents of the thesis resulted in prolonging the preparation of the thesis. The second person to thank is Professor Mauno Vihinen. He inroduced me in 1998 to an application in bioinformatics that involved exact and ap- proximatestringmatching, andinthatheismainlyresponsibleforthe topic of the thesis. In fact, large parts of the thesis have been done in conjunction with that original application. Without doubt the most in(cid:13)uential person in terms of the eventual con- tents of the thesis has been Gonzalo Navarro, who at the moment of writing this is an Associate Professorat the University of Chile. We started to work together after meeting in the SPIRE conference in Chile in November 2001, and Inow regard him to benot only a work partnerbut also a long-distance friend. He has been, perhaps unknowingly, an uno(cid:14)cial supervisor of this work. As one of the leading researchers in the (cid:12)eld of string matching, Gon- zalo has provided me with inspiration as well as excellent advice and insight into the (cid:12)eld. Throughout the work I have also enjoyed the general environment pro- vided by the Department of Computer Sciences at the University of Tam- pere. I will not mention separately everyone involved in making the depart- ment a pleasant place. However, as one exception I would like to thank Professor Erkki Ma(cid:127)kinen. He has always been prepared to help for example by proof-reading article manuscripts. iii iv Contents Introduction 1 Basic Notation 2 I Edit Distance and Approximate String Matching 3 1 Levenshtein and Damerau Edit Distance 5 2 Dynamic Programming 7 3 Filling Only a Necessary Portion of the Dynamic Program- ming Matrix 11 4 Bit-parallel Methods 15 II Our Contributions 17 5 Our Variant of the Algorithm of Myers 19 6 Adding Transposition into the Bit-parallel Methods 27 7 Using Diagonal Tiling in Computing Edit Distance 37 8 Using BPM in ABNDM 43 9 On Using Two-Phase Filtering 59 10 A Practical Index for Genome Searching 73 11 An Improvement and an Extension on the Hybrid Index of Navarro & Baeza-Yates 87 Conclusion 99 Bibliography 101 v Introduction Finding the occurrences of a given query string (pattern) from a possibly very large text is an old and fundamental problem in computer science. It emerges in applications ranging from text processing and music retrieval to bioinformatics. This task, collectively known as string matching, has several di(cid:11)erent variations. The most natural and simple of these is exact string matching, in which, like the name suggests, one wishes to (cid:12)nd only occurrences that are exactly identical to the pattern string. This type of search, however, may not be adequate in all applications if for example the pattern string or the text may contain typographical errors. Perhaps the most important applications of this kind arise in the (cid:12)eld of bioinformatics, as small variations are fairly common in DNA or protein sequences. The (cid:12)eldofapproximatestringmatching,whichhasbeenaresearchsubjectsince the 1960’s, answers the problem of small variation by permitting some error between the pattern and its occurrences. Given an error threshold and a metricto measurethe distancebetween two strings, thetask ofapproximate stringmatchingisto(cid:12)ndallsubstringsofthetextthatarewithin(adistance of) the error threshold from the pattern. In this work we concentrate on approximate string matching that uses so called unit-cost edit distance as the metric to measure the distance be- tween two strings. We consider two di(cid:11)erent kinds of edit distances: the Levenshtein edit distance and the Damerau edit distance. These two, and especially the Levenshtein edit distance, are the most commonly used forms of unit-cost edit distance. Most of the research underlying this thesis has been inclined towards practical results. The primary aim has been to develop methods that work well in practice. Therefore theoretical considerations have been given a slightly secondary role. A major reason for this choice is that much of the work has been done in conjunction with a real-life application: applying string matching in searching for unique oligonucleotides in a large DNA genome [19, 26, 18, 23]. The term oligonucleotide refers to a fairly short sequence of DNA. In the (cid:12)rst part we present a concise overview of edit distance and ap- proximate string matching. This gives the basic background for part two, in which we present our primary contributions. 1 Basic Notation We will use the following notation throughout the thesis. String characters will be indexed with a subscript: P refers to the ith i character of the string P, and P to its substring that begins from the i::j ith character and ends at the jth character. P is the length of P. The j j (cid:12)rst character has the index 1, and so P = P . We interpret the non- 1::jPj existing substringP as the empty character (cid:15). The superscriptR denotes 1::0 the reverse of the string. For example if P = \abcd", then PR = \dcba", PR = \dc" and (P )R = \ba". Note the last two examples that show 1::2 1::2 how we may use parentheses to di(cid:11)erentiate between a substring of the reversed string and a reversed substring. The notation P T denotes the (cid:14) concatenation of the strings P and T. For example if P = \abc" and T = \def", then P T = \abcdef". (cid:14) The string B is a subsequence of the string A if B = A for i = 1::B , i x(i) j j where x(i) is a mapping that ful(cid:12)lls the conditions 1 x(i) A for (cid:20) (cid:20) j j i = 1::B and x(i 1) < x(i) for i = 2::B . Thus B is a subsequence j j (cid:0) j j of A if the characters B , B , ..., B appear in the same order, but not 1 2 jBj necessarily consecutively, in A. For sake of uniformity, the two compared strings in the context of com- putingedit distance are denoted byP and T. In the context of approximate string matching P is a pattern and T is the text. It is a standard practice in the literature to denote the length P of P by m and the length T of T j j j j by n. Throughout the text we assume that m n. (cid:20) (cid:6) denotes the used alphabet and (cid:27) the size (number of di(cid:11)erent charac- ters) of (cid:6). In addition k denotes the maximum allowed error in the context of thresholded edit distance or approximate string matching, and w is the size (number of bits) of the computer word. The Levenshtein edit distance between the strings P and T will bedenoted by ed (P;T) and the Damerau L edit distance by ed (P;T). D Bit-operations are described as follows: ’&’ denotes bitwise \AND", ’’ j denotes bitwise \OR", ’^’ denotes bitwise \XOR", ’ ’ denotes bit comple- (cid:24) mentation, and ’<<’ and ’>>’ denote shifting the bit-vector left and right, respectively, using zero (cid:12)lling in both directions. The ith bit of the bit vec- tor V is referred to as V[i] and bit-positions are assumed to grow from right to left. In addition we use a superscript to denote bit-repetition. As an example let V = 1001110 be a bit vector. Then V[1] = V[5] = V[6] = 0, V[2] = V[3] = V[4] = V[7] = 1, and we could also write V = 102130. 2

Description:
Given a pattern string and a text, the task of approximate string matching is to find all locations in the text that are similar to the pattern. This type of search may be done for example in applications of spelling error correction or bioinformatics. Typically edit distance is used as the measure
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.