Combinatorial Pattern Matching: 7th Annual Symposium, CPM 96 Laguna Beach, California, June 10–12, 1996 Proceedings PDF

401 Pages·1996·6.271 MB·English

by Ricardo Baeza-Yates, Gonzalo Navarro (auth.), Dan Hirschberg, Gene Myers (eds.)

Checking for file health...

Save to my drive

Quick download

Lecture Notes in Computer Science 1075 Edited by G. Goos, J. Hartmanis and J. van Leeuwen Advisory Board: .W Brauer D. Gries J. Stoer Dan Hirschberg Gene Myers (Eds.) lairotanibmoC nrettaP Matching 7th Annual Symposium, MPC 69 Laguna Beach, California, June 10-12, 1996 Proceedings regnirpS Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands Volume Editors Dan Hirschberg Information and Computer Science Department University of California at Irvine Irvine, CA 92717-3425, USA Gene Myers Department of Computer Science, University of Arizona Tucson, AZ 85721, USA Cataloging-in-Publication data applied for Die Deutsche Bibliothek - CIP-Einheitsaufnahme Combinatorial pattern matching (cid:12)9 7th annual symposium ; proceedings / CPM 96, Laguna Beach, California, June 01 - 12, 1996. Dan Hirschberg ; Gene Myers (ed.). - Berlin ; Heidelberg ; New York ; Barcelona ; Budapest ; Hong Kong ; London ; Milan ; Paris ; Santa Clara ; Singapore ; Tokyo : Springer, 1996 (Lecture notes in computer science ; Vol. )5701 ISBN 3-540-61258-0 NE: Hirschberg, Dan Hrsg.; CPM <7, 1996, <Laguna Beach, Calif.>; GT CR Subject Classification (1991): F.2.2, 1.5.4, 1.5.0, 1.7.3, H.3.3, E.4, G.2.1, J.3 ISBN 3-540-61258-0 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer -Verlag. Violations are liable for prosecution under the German Copyright Law. (cid:14)9 Springer-Verlag Berlin Heidelberg 1996 Printed in Germany Typesetting: Camera-ready by author SPIN 10512928 06/3142 - 5 4 3 2 1 0 Printed on acid-free paper Foreword The papers contained in this volume were presented at the Seventh Annual Symposium on Combinatorial Pattern Matching (CPM 96), held June 10-12, 1996, at the Aliso Creek Inn, a resort hotel located in Laguna Beach, California, about 40 miles from Los Angeles. They were selected from 48 papers submit- ted in response to a call for papers. In addition, invited lectures were given by David Lipman (Computing Discoveries in Molecular Biology) and Richard Arratia (Poisson Process Approximation for Repeats in One Sequence and Its Application to Sequencing by Hybridization). Combinatorial Pattern Matching addresses issues of searching and matching strings and more complicated patterns such as trees, regular expressions, graphs, point sets, and arrays. The goal is to derive non-trivial combinatorial properties for such structures and then to exploit these properties in order to achieve improved performance for tile corresponding computationM problems. In recent years, a steady flow of high-quality research on this subject has changed a sparse set of isolated results into a full-fledged area of algorithmics with important applications. This area is expected to grow even further due to the increasing demand for speed and efficiency that comes especially from molecular biology, but also from areas such as information retrieval, pattern recognition, compiling, data compression, and program analysis. The objective of annual CPM gatherings is to provide an international forum for research in combinatorial pattern matching. The general organization and orientation of CPM Conferences is coordinated by a Steering Committee composed of A. Apostolico, M. Crochemore, Z. Galil, and U. Manber. The first six meetings were held in Paris (1990), London (1991), Tucson (1992), Padova (1993), Pacific Grove (1994), and Helsinki (1995). After the first meeting, a selection of the papers appeared as a special issue of Theoretical Computer Science. Since the third meeting, the proceedings have appeared as volumes 644, 684,807, and 937 of the present series. CPM 96 s~aw organized by Dan Hirschberg of the Department of Information and Computer Science at the University of California at Irvine. The conference was supported in part by the University of California at Irvine Office of Research and Graduate Studies, the Department of Information and Computer Science, and the Irvine Research Unit on Computer Systems Design. Irvine and Tucson, March 1996 Dan Hirschberg Gene Myers Iv Program Committee Dan Hirschberg, co-chair Gene Myers, co-chair Rob Irving Pavel Pevzner Sampath Kannan Dennis Shasha Rao Kosaraju Jim Storer Gad Landau Andy Yao Arthur Lesk Frances Yao Additional Referees David Eppstein Gesine Reinert Martin Farach Martin Vingron Table of Contents A Faster Algorithm for Approximate String Matching ...................... 1 Ricardo Baeza-Yates, Gonzalo Navarro Boyer-Moore Strategy to Efficient Approximate String Matching .......... 24 Nadia EI-Mabrouk, Maxime Crochemore Randomized Efficient Algorithms for Compressed Strings: The Finger-Print Approach ............................................... 39 Leszek Gqsieniec, Marek I(arpinski, Wojciech Plandowski, Wojciech Rytter Filtration with q-Samples in Approximate String Matching ................ 50 Erkki Sutinen, Jorma Tarhio Computing Discoveries in Molecular Biology (Abstract) ................... 64 David J. Lipman Approximate Dictionary Queries .......................................... 65 Gerth Str Brodal, Leszek Gqsieniec Approximate Multiple String Search ...................................... 75 Robert Muth, Udi Manber A 2 ~-Approximation Algorithm for tile Shortest Superstring Problem .... 87 Chris Armen, Clifford Stein Suffix Trees oll Words ................................................... 102 Arne Andersson, N. Jesper Larsson, I(urt Swanson The Suffix Tree of a Tree and Minimizing Sequential Transducers ........ 116 Dany Breslauer Perfect Hashing for Strings: Formalization and Algorithms ............... 130 Martin Faraeh, S. Muthukrishnan Spliced Alignment: A New Approach to Gene Recognition ............... 141 Mikhail S. Gelfand, Andrey A. Mironov, Pavel A. Pevzner Original Synteny ........................................................ 159 Vincent Fer~vtti, Joseph H. Nadeau, David Sankoff Fast Sorting by Reversal ................................................. 168 Piotr Berman, Sridhar Hannenhalli A Double Combinatorial Approach to Discovering Patterns in Biological Sequences ......................................... 186 Marie-France Sagot, Alain Viari Poisson Process Approximation for Repeats in One Sequence and Its Application to Sequencing by Hybridization ...................... 209 Richard Arratia, Gesine Reinert iiiv Improved Approximation Algorithms for Tree Alignment ................. 220 Lusheng Wang, Dan dleisuG The Asymmetric Median Tree - A New Model for Building Consensus Trees .............................. 234 Cynthia A. Phillips, Tandy .J Warnow Constructing Computer Virus Phylogenies ............................... 253 Leslie Ann ,grebdloG Paul .W ,grebdloG Cynthia A. Phillips, Gregory B. Sorkin Docking of Conformationally Flexible Proteins ........................... 271 Bilha Sandak, Ruth Nussinov, Haim .J Wolfson Invariant Patterns in Crystal Lattices: Implications for Protein Folding Algorithms .............................. 288 William E. Hart, Sorin Istrail Graph Traversals, Genes, and Matroids: An Efficient Case of the Travelling Salesman Problem .................... 304 Dan Gusfield, Richard I(arp, Lusheng Wang, Paul Stelling Alphabet Independent and Dictionary Scaled Matching .................. 320 Amihood Amir, Gruia Calinescu Analysis of Two-Dimensional Approximate Pattern Matching Algorithms ............................................ 335 Kunsoo Park Approximation Algorithms for Maximum Two-DimensionM Pattern Matching ..................................... 348 Srinivasa R. Arikati, Anders Dessmark, Andrzej Lingas, Madhav Marathe Efficient Parallel Algorithms for Tree Editing Problems .................. 361 I(aizhong Zhang Approximate Pattern Matching in Directed Graphs ...................... 373 James J. uF Finite-State Computability of Annotations of Strings and Trees .......... 384 Hans L. Bodlaender, Patricia A. Evans, Michael R. Fellows Author Index ........................................................... 392 A Faster Algorithm for Approximate String Matching * Ricardo Baeza-Yates Gonzalo Navarro Department of Computer Science University of Chile Blanco Encalada 2120 - Santiago - Chile {rbaeza,gnavarro}~dcc.uchile.cl Abstract. We present a new algorithm for on-line approximate string matching. The algorithm is based on the simulation of a non-deterministic finite automaton built from the pattern and using the text as input. This simulation uses bit operations on a RAM machine with word length O(log n), being n the maximum size of the text. The running time achieved is O(n) for small patterns (i.e. of length m = O(v/1-d-~), independently of the maximum number of errors allowed, k. This algorithm is then used to design two general algorithms. One of them partitions the problem into subproblems, while the other partitions the automaton into sub-automata. These algorithms are combined to obtain a hybrid algorithm which on average is O(n) for moderate k/m ratios, O(v/mk / log n n) for medium ratios, and O((m - k)kn/log n) for large ratios. We show experimentally that this hybrid algorithm is faster than previous ones for moderate size of patterns and error ratios, which is the case in text searching. 1 Introduction Approximate string matching is one of the main problems in classical string algorithms, with applications to text searching, computational biology, pattern recognition, etc. Given a text of length n, a pattern of length m, and a maximal number of errors allowed, k, we want to find all text positions where the pattern matches the text up to k errors. Errors can be substituting, deleting or inserting a character. The solutions to this problem differ if the algorithm has to be on-line (that is, the text is not known in advance) or off-line (the text can be preprocessed). In this paper we are interested in the first case, where the classical dynamic programming solution is O(mn) running time 13, 14. In the last years several algorithms have been presented that achieve O(kn) comparisons in the worst-case 20, 9, 10, 11 or in the average case 21, 9, by taking advantage of the properties of the dynamic programming matrix. In the * This work has been supported in part by FONDECYT grant 1950622. same trend is 6, with average time complexity O(kn/v~r- ) "o( is the alphabet size). The algorithms which are O(kn) in the worst case tend to involve too much overhead, and are not competitive in practice. Other approaches attempt to filter the text, reducing the area in which dynamic programming needs to be used 18, 19, ,71 ,61 7, .8 These algorithms achieve sublinear expected time in many cases (O(kn log a re~m) is a typical figure) for moderate k/m ratios, but the filtration is not effective for larger ratios. A simple and fast filtering technique is shown in 5, which yields an O(n) algorithm for moderate k/rn ratios. Yet other approaches use bit-parallelism ,2 52 in a RAM machine of word length O(log n) to reduce the number of operations. 24 achieves O(kmn/log n) time, which is competitive for patterns of length O(logn). 22 packs the cells differ- ently to achieve O(mn log ~a log n) time complexity. 26 uses a Four Russians approach and packs the table in machine words, achieving O(kn/log n) time on average. We present a new algorithm which combines the ideas of taking advantage of the properties of the matrix, filtering the text and using bit-parallelism, being faster than previous work for moderate size patterns and error ratios, as we are interested in text searching. We model the search with a non-deterministic finite automaton (NFA) built from the pattern and using the text as input. This automaton is simulated by an algorithm based on bit operations on a RAM machine of word length O(logn). The algorithm achieves running time O(n), independently of k, for small patterns (i.e. mk= O(log n)). This restricted algorithm is used to design two general algorithms. A first one partitions the problem into subproblems, and has average time cost O(mn/logn) for small a = k/m (i.e. a < 1/logn), otherwise it is O(v/rnk/logn n) (i.e. O(V~ n) for m = O(logn), else O(kn)). It involves also a cost to verify potential matches, which is shown to be not significant for a < al ~ 1 - mlllv/~'~/x/~. This algorithm is a generalization of an earlier heuristic 23, 5, that reduces the problem to subproblems of exact matching and is shown to be O(n) for a < a0 = 1/(3 log am). The second one partitions the automaton in sub-automata, being O(k2n/(x/'~log n)) on average. For a > 1 - 1/x/~ its worst case, O((m - k)kn/logn), domi- nates. This algorithm is shown to be better than dynamic programming for k < log(n)/(1 - a). We analyze the optimal way to combine the algorithms. We show experimentally that this hybrid algorithm is faster than previous ones, for moderate m and a. Table 1 shows the combined complexity. As a corollary of our analysis, we give tight bounds for the probability of finding an occurrence of a pattern of length m with k errors starting at a fixed position in random text. We also show that the heuristic of 21 works O(kn) time on average, with a constant tighter than that of .6 li Condition Com~exity Method used mk= O(log n) )n(O the simple algorithm o(.) reducing to exact match t~<~0 Oto < ot < trl 0 (~mk/log n n problem partitioning a>alAk<logn/(1-ce) o((~ - k)k,,/log ,,) automaton partitioning a > al A k > logn/(1 - a) )nm(O )lain dynamic programming Table 1. Complexity of our hybrid algorithm. 2 Preliminaries The problem of approximate string matching can be stated as follows: given a (long) Text of length n, and a (short) pattern pat of length m, both being sequences of characters from an alphabet Z, find all segments (called "occur- rences" or "matches") of Text whose edit distance to pat is at most k, the number of allowed errors. We use ~c = IzI. The edit distance between two strings a and b is the minimum number of edit operations needed to transform a into b. The allowed edit operations are deleting, inserting and replacing a character. Therefore, the problem is non-trivial for k<m. Stated that way, we should report a number of segments that contain others. Because of that, it is common to report only minimal or maximal segments. It is also common to report not the matching segments but only their start or end point. In this work we focus on returning end points of minimal segments (i.e. those not containing others). We use a C-like notation for the operations (e.g. ~:, l,--, ! =, A, >>). We use text to denote the current character of Text and, unlike C, strj to denote the j-th character of str. Except when otherwise indicated, the log function denotes logarithm in base 2. Consider the NFA for searching text with at most k --- 2 errors shown in Fig- ure 1. Every row denotes the number of errors seen. The first one 0, the second one 1, and so on. Every column represents matching the pattern up to a given position. At each iteration, a new text character is considered and the automaton changes its states. Horizontal arrows represent matching a character (they can only be followed if the corresponding match occurs), vertical arrows represent inserting a character in the pattern, solid diagonal arrows represent replacing a character, and dashed diagonal arrows represent deleting a character of the pattern (they are empty transitions, since we delete the character from the pattern without advancing in the text). Finally, the empty transition at the initial state allows to consider any character as a potential starting point of a match, and the automaton accepts a character (as the end of a match) whenever a rightmost state is active. If we do not care about the number of errors, we can consider

See more

Combinatorial Pattern Matching: 7th Annual Symposium, CPM 96 Laguna Beach, California, June 10–12, 1996 Proceedings PDF

Preview Combinatorial Pattern Matching: 7th Annual Symposium, CPM 96 Laguna Beach, California, June 10–12, 1996 Proceedings

The list of books you might like