ebook img

Approximate Text Searching Gonzalo Navarro PDF

225 Pages·2008·1.45 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Approximate Text Searching Gonzalo Navarro

Approximate Text Searching by Gonzalo Navarro A Thesis presented to the University of Chile in ful(cid:12)llment of the thesis requirement to obtain the degree of PhD. in Computer Science Advisor : Ricardo Baeza-Yates Committee : Jorge Olivos : Patricio Poblete : Esko Ukkonen (External Professor, Univ. of Helsinki, Finland) This work has been supported in part by Fondecyt (Chile) grants 1-950622 and 1-960881, Fondef (Chile) grant 96-1064 and CYTED VII.13 AMYRI Project. Dept. of Computer Science - University of Chile Santiago - Chile December 1998 Abstract This thesisfocuses on the problemoftext retrievalallowingerrors,alsocalled \approximate"string matching. The problem is to (cid:12)nd a pattern in a text, where the pattern and the text may have \errors". This problem has received a lot of attention in recent years because of its applications in many areas, such as information retrieval, computational biology and signal processing, to name a few. The aim of this work is the development and analysis of novel algorithmsto deal with the problem under various conditions, as well as a better understanding of the problem itself and its statistical behavior. Although our results are valid in many di(cid:11)erent areas, we focus our attention on typical text searching for information retrieval applications. This makes some ranges of values for the parameters of the problem more interesting than others. We have divided this presentation in two parts. The (cid:12)rst one deals with on-line approximatestring matching,i.e. when there is no time or space to preprocess the text. These algorithmsare the core of o(cid:11)-line algorithms as well. On-line searching is the area of the problem where better algorithms existed. We have obtained new bounds for the probability of an approximatematch of a pattern in a randomtext, and used these results to analyze manyold and new algorithms. We have developed new algorithmsfor this problem which are currently among the fastest known ones, being even the fastest algorithmsfor almostallthe interesting cases of typicaltext searching. Finally,we extended our results to the simultaneous search of multiple patterns, obtaining the best existing algorithms when a moderate number of them is sought (less than 100, approximately). The second part of this thesis addresses indexed approximate string matching, i.e. when we are able to build an index for the text beforehand, to speed up the search later. The ultimateindex for approximatestring matchingis yet to appear and the current development is rather immature,but we have made progress regarding new algorithms as well as better understanding of the problem. For the restricted case of indices able to retrieve only whole words on natural language text, we have obtained new analytical results on their asymptotic complexity, which allowed us to develop an index that is sublinear in space and query time simultaneously, something that did not exist before. For this kind of index we also presented improved search algorithms. For general indices able to (cid:12)nd any occurrence (not only words), we have developed new indexing schemes which are a tradeo(cid:11) between e(cid:14)ciency and space requirements. Also, inspired in on-line techniques, we have proposed a hybrid between existing indexing schemes and obtained very promising results. It is worth to mention that in almost all cases we have complemented the development of the new algorithms with their worst-case and average-case complexity analysis, as well as a thorough experimental validation and comparison against the best previous work we were aware of. As a whole, we believe that this work constitutes a valuable contribution to the development and understanding of the problem of approximate text searching. Acknowledgments Lazynessisawidespreadsin,andIhavebeenguiltyofitwhenItookthe(cid:12)leoftheacknowledgments of my MSc. thesis (1995),thinking on modifyingit for my PhD. thesis. My surprise was that most of what I was thinking to say was already there, valid in 1995 and valid now, (cid:12)nishing 1998. I am still indebted to the same people I was indebted to by that time,and the debt keeps growing. So if you, reader, have seen the acknowledgments of my MSc. thesis, this is not a careless copy of it, it is a renowed expression of gratitude. The completion of this thesis does not only mean (cid:12)nishing the most important and ample work I have ever attempted. It also signs the end of a 5 years long stage of mylife, an incredibly enjoyable time of endless curiosity, exciting work, and great satisfactions and rewards. It has been also a stage of working in what I like most, with little pressure or interference from other problems or requirements. Fortunately, nothing makes me think that which follows should be very di(cid:11)erent. What I have been (and hope to keep) enjoying is a researcher life, no more,no less. This is the only life I believe I am able to live. Therefore, I cannot be less than indebted to all the people which madethis kind of life a possibility forme,andIwouldliketoexpressmygratitudetotheirsincereanddisinterestedhelpandfriendship. I am afraid I will forget some names anyway, and I hope they will forgive such a mistake. First of all, my wife Betina, that left everything to follow me in this adventure. Only she is responsible of keeping me still living a life outside my o(cid:14)ce. Without she, I would have (cid:12)nished this thesis in half the time, but my life would be by far less interesting to live. My wife apart, one of the (cid:12)rst persons I should mention is Jorge Olivos, who knew me at ESLAI, and has been pushing me since then to leave my job and come to Chile, to enjoy serious research. He also made the (cid:12)rst steps in giving me an opportunity to come, and has been always ready to bring me his friendship and help. I want also to acknowledge Patricio Poblete for his friendship and unconditional support ((cid:12)nancial and academic, among others), for the stimulating lectures and joint work I enjoyed with him, and lastly,for being the Department'shead witha style totallyopposite to whatsuch an administrative position could makeone to expect. His accessibility,(cid:13)exibilityand willingnessprevented morethan one headache at the critical moments. Ricardo Baeza-Yates, my thesis advisor, deserves a special acknowledgment. This is not only for trusting me from the (cid:12)rst time and giving me access to all what he thought I deserved despite my student status, but also for having done an excellent job as my thesis advisor (being there when I needed advise, not intervening when he thought I could manage it) and in general as my guide frombeing a disoriented student that came to Chile with littleidea of what to do, to the end of my studentship and my birth as a new researcher with some idea of what he wants and how far can he go. Not happy with that, he also took care of (cid:12)nancially supporting my research, helped me in so many ways that I have forgotten, and has been a permanent, patient and disinterested friend. Along these years I have met new people from outside which became my friends too, and I want also to thank them: Nivio Ziviani, Edleno de Moura, Mathieu Ra(cid:14)not, Edgar Cha(cid:19)vez, Jesu(cid:19)s Vegas,MarcioDrumondArau(cid:19)jo,Joa~oPauloKitajima,BerthierRibeiro,ErkkiSutinen,GeneMyers, Amihood Amir,Pablode la Fuente, Esteban Feuerstein, and others that I surely forgotto mention. I would also like to thank all the rest of the people of the Department who made my life easier and more enjoyable, as well as my old friends that are still with me by e-mail and with who I have enjoyed a lot: Sergio Servetto, GuillermoAlvarez, Pablo Mart(cid:19)(cid:16)nez-Lo(cid:19)pez, ... and of course my family, who is always proud of me, even when I do nothing remarkable. A special note to Pablo Palma, a very special person of the kind one encounters a few times in life, which has been an excellent friend and to whom I also owe a lot. My gratitude to my thesis committee, who took the heavy job of reading the whole thesis, and made a number of useful comments that improved the work in many ways: Jorge Olivos (again), Patricio Poblete (again) and Esko Ukkonen. TherearemanyotherpeoplewhichIamindebtedtoforthisthesis. Manyofthem,forinstance,sent me working versions of their algorithms,what made the tests a lot easier and, in some sense, more fair: William Chang, Alden Wright, Gene Myers, Erkki Sutinen, Tadao Takaoka, Jorma Tarhio, Robert Muth, Udi Manber and Archie Cobbs. Others have read and made suggestions to improve papers which later became part of this thesis, such as Gene Myers, Udi Manber, Erkki Sutinen, and of course a lot of anonymousconference and journal referees. Finally,somepeople have worked with us in papers related to this thesis, and although I have only included here my original work, they haveworkedvery closetous and areresponsible of manyimprovements: ErkkiSutinen, Jorma Tarhio,NivioZiviani,MarcioDrumondArau(cid:19)jo,EdlenodeMouraandMathieuRa(cid:14)not. InChapter 8 I have borrowed some experimental (cid:12)gures from joint papers (thanks to Nivio and Marcio). Last but not least, FONDECYT (Chile) grants 1-950622 and 1-960881, FONDEF (Chile) grant 96-1014, and CYTED VII.13 AMYRI Project, which partially supported this work, are gratefully acknowledged. Contents 1 Introduction 1 1.1 History and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Overview of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.1 General Part . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.2 On-line Searching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.3 Indexed Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Notation and Basic Concepts 9 2.1 De(cid:12)nition of the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Dynamic ProgrammingAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 A Graph Reformulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 A Reformulation Based on Automata. . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5 Filtering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.6 Bit-Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.7 Su(cid:14)x Trees and DAWGs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.8 Su(cid:14)x Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.9 Natural Language and Its Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.10 Inverted Files or Inverted Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.11 Su(cid:14)x Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3 Related Work and Our Contributions 29 3.1 On-line Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1.1 Taking Advantage of the Dynamic ProgrammingMatrix . . . . . . . . . . . . 31 3.1.1.1 Improving the Worst Case . . . . . . . . . . . . . . . . . . . . . . . 31 3.1.1.2 Improving the Average Case . . . . . . . . . . . . . . . . . . . . . . 33 3.1.2 Searching with a DeterministicAutomaton . . . . . . . . . . . . . . . . . . . 34 3.1.3 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.1.3.1 Moderate Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.1.3.2 Very Long Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.1.4 Bit-Parallel Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.1.4.1 ParallelizingNon-deterministic Automata . . . . . . . . . . . . . . . 42 3.1.4.2 Parallelizingthe Dynamic ProgrammingMatrix . . . . . . . . . . . 43 3.2 Variants on the On-line Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2.1 Extended Patterns and Di(cid:11)erent Costs . . . . . . . . . . . . . . . . . . . . . . 44 3.2.2 Multiple Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3 Indexed Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.3.1 Word-Retrieving Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.3.2 Simulating Text Traversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3.2.1 MinimumRedundancy . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3.2.2 Depth-First Search . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3.3 Filtration Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3.3.1 All q-grams on the Text . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3.3.2 Sampling the Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4 Basic Tools 55 4.1 Statistics of the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.1.1 Probability of Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.1.1.1 An Upper Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.1.1.2 A Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.1.1.3 Experimental Veri(cid:12)cation . . . . . . . . . . . . . . . . . . . . . . . . 58 4.1.2 Active Columns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.2 Partitioning Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.3 Hierarchical Veri(cid:12)cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3.1 Pattern Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3.2 Superimposed Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 I On-line Searching 67 5 A Bit-Parallel Algorithm 70 5.1 A New ParallelizationTechnique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.2 A Linear Algorithm for Small Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.2.1 A Simple Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2.2 The Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.3 Handling Extended Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.4 Partitioning Large Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.5 Partitioning the Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.6 Superimposing the Subpatterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.7 Analysis and Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.7.1 The Simple Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.7.2 Automaton Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.7.2.1 Search Cost. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.7.2.2 Practical Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.7.2.3 Improving Register Usage . . . . . . . . . . . . . . . . . . . . . . . . 85 5.7.3 Pattern Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.7.3.1 Search Cost. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.7.3.2 Optimal Selection for j . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.7.4 Superimposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.7.4.1 Optimizing the Amount of Superimposition . . . . . . . . . . . . . . 89 5.7.4.2 Optimal Grouping and Aligning . . . . . . . . . . . . . . . . . . . . 91 5.8 Combining All the Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.8.1 A Theoretical Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.8.2 A Practical Heuristic and a Searching Software . . . . . . . . . . . . . . . . . 92 5.9 Experimental Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6 Filtering and Automata Algorithms 98 6.1 Reduction to Exact Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.1.1 The Original Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.1.2 Applying Hierarchical Veri(cid:12)cation . . . . . . . . . . . . . . . . . . . . . . . . 99 6.1.3 Optimizing the Partition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.1.4 Experimental Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.1.5 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.2 A Counting Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.2.1 A Simple Counting Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.2.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.2.2.1 Exact Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.2.2.2 A Simpler Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.2.3 A Sampling Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.2.4.1 MaximumError Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.2.4.2 Comparison among Algorithms . . . . . . . . . . . . . . . . . . . . . 112 6.3 A Su(cid:14)x Automaton Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.3.1 Adapting the NFA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.3.2 The Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.3.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.4 A Partial Deterministic Automaton. . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.4.1 Lazy Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.4.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.4.4.1 Automaton Growth . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.4.4.2 Comparison Against Other Algorithms . . . . . . . . . . . . . . . . 123 6.4.5 Working with Limited Memory . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.4.5.1 Victim Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.4.5.2 Victim Replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 7 Multiple Patterns 125 7.1 Superimposed Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.1.1 Handling Longer Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 7.2 Partitioning into Exact Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.3 A Counting Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 7.4.1 Superimposed Automata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 7.4.2 Partitioning into Exact Searching . . . . . . . . . . . . . . . . . . . . . . . . . 133 7.4.3 Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 7.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 II Indexed Searching 144 8 Word-Retrieving Indices 146 8.1 Vocabulary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 8.1.1 Combining Heaps' and Zipf's Laws . . . . . . . . . . . . . . . . . . . . . . . . 146 8.1.2 Vocabulary Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 8.1.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 8.2 Full Inverted Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 8.2.1 Retrieval Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 8.2.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 8.2.3 Di(cid:11)erential Pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 8.3 Block Addressing Inverted Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 8.3.1 Average Space-Time Trade-o(cid:11)s . . . . . . . . . . . . . . . . . . . . . . . . . . 157 8.3.1.1 Query Time Complexity. . . . . . . . . . . . . . . . . . . . . . . . . 158 8.3.1.2 Space Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 8.3.1.3 Combined Sublinearity . . . . . . . . . . . . . . . . . . . . . . . . . 160 8.3.2 Analyzing the Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 8.3.3 Experimental Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 8.3.3.1 Fixed Block Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 8.3.3.2 Fixed Number of Blocks. . . . . . . . . . . . . . . . . . . . . . . . . 165 8.3.3.3 Sublinear Space and Time . . . . . . . . . . . . . . . . . . . . . . . 166 8.4 Improving the Search Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 8.4.1 Vocabulary Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 8.4.1.1 Searching in General Metric Spaces . . . . . . . . . . . . . . . . . . 167 8.4.1.2 The Vocabulary as a Metric Space . . . . . . . . . . . . . . . . . . . 169 8.4.1.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 170 8.4.2 Block Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 9 Sequence-Retrieving Indices 176 9.1 An Index Based on Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 9.1.1 Indexing Text Substrings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 9.1.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 9.1.2.1 Building the Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 9.1.2.2 Index Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 9.1.2.3 Retrieval Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 9.1.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 9.2 An Index Based on Su(cid:14)x Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 9.2.1 Using the Bit-parallel Automaton. . . . . . . . . . . . . . . . . . . . . . . . . 185 9.2.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 9.2.3 A New Algorithm Based on Pattern Partitioning . . . . . . . . . . . . . . . . 189 9.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 10 Conclusions 197 10.1 Results Obtained . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 10.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 10.3 Open Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 Bibliography 201 List of Figures 2.1 Dynamic programmingto compute an edit distance. . . . . . . . . . . . . . . . . . . 14 2.2 Dynamic programmingto search a pattern allowing errors . . . . . . . . . . . . . . . 14 2.3 Filling styles for the dynamic programmingmatrix . . . . . . . . . . . . . . . . . . . 15 2.4 Graph-based algorithm to compute an edit distance. . . . . . . . . . . . . . . . . . . 16 2.5 An NFA to search allowing errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.6 An NFA for exact searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.7 A su(cid:14)x tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.8 A DAWG and a su(cid:14)x automaton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.9 A non-deterministic su(cid:14)x automaton. . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.10 An inverted (cid:12)le . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.11 A block-addressing index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.12 The su(cid:14)xes of a text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.13 The su(cid:14)x array for the sample text. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1 Taxonomy of on-line algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 On-line algorithmsbased on dynamic programming . . . . . . . . . . . . . . . . . . . 32 3.3 On-line algorithmsbased on automata . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4 On-line (cid:12)ltering algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.5 On-line bit-parallel algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.6 Algorithmsfor multipattern search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.7 Taxonomy of indexed algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.8 Word retrieving indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.9 Indices that simulate text traversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.10 Filtration indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.1 Upper bound for the probability of matching . . . . . . . . . . . . . . . . . . . . . . 57 4.2 Theoretical and practical bounds for (cid:11) . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.3 Experiments on matching probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4 Experiments on the last active column . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Description:
Esko Ukkonen. (External . Patricio Poblete (again) and Esko Ukkonen. 2.6 Bit -Parallelism . 3.1.4 Bit-Parallel Algorithms . 5 A Bit-Parallel Algorithm. 70.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.