ebook img

Analytic Pattern Matching from DNA to Twitter PDF

372 Pages·2015·1.878 MB·english
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Analytic Pattern Matching from DNA to Twitter

ANALYTIC PATTERN MATCHING From DNA to Twitter PHILIPPE JACQUET InstitutNationaldeRechercheenInformatique etenAutomatique(INRIA),Rocquencourt WOJCIECH SZPANKOWSKI PurdueUniversity, Indiana www.cambridge.org Informationonthistitle:www.cambridge.org/9780521876087 ©PhilippeJacquetandWojciechSzpankowski2015 Firstpublished2015 PrintedintheUnitedKingdombyClays,Stlvesplc AcatalogrecordforthispublicationisavailablefromtheBritishLibrary LibraryofCongressCataloginginPublicationdata Jacquet,Philippe,1958– AnalyticpatternmatchingfromDNAtoTwitter/PhilippeJacquet,InstitutNationaldeRechercheen InformatiqueetenAutomatique(INRIA),RocquencourtWojciechSzpankowski,PurdueUniversity,Indiana. pages cm Includesbibliographicalreferencesandindex. ISBN978-0-521-87608-7(Hardback) 1. Patternrecognitionsystems. I. Szpankowski,Wojciech,1952– II. Title. TK7882.P3J332015 519.2–dc23 2014017256 ISBN978-0-521-87608-7Hardback Contents Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii I ANALYSIS 1 Chapter 1 Probabilistic Models . . . . . . . . . . . . . . . . . . . . 3 1.1 Probabilistic models on words . . . . . . . . . . . . . . . . . . . . 4 1.2 Probabilistic tools . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Generating functions and analytic tools . . . . . . . . . . . . . . 13 1.4 Special functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.4.1 Euler’s gamma function . . . . . . . . . . . . . . . . . . . 16 1.4.2 Riemann’s zeta function . . . . . . . . . . . . . . . . . . . 18 1.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Bibliographicalnotes . . . . . . . . . . . . . . . . . . . . . . . . . 21 Chapter 2 Exact String Matching . . . . . . . . . . . . . . . . . . 23 2.1 Formulation of the problem . . . . . . . . . . . . . . . . . . . . . 23 2.2 Language representation . . . . . . . . . . . . . . . . . . . . . . . 25 2.3 Generating functions . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.5 Limit laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.5.1 Pattern count probability for small values of r . . . . . . 34 2.5.2 Central limit laws . . . . . . . . . . . . . . . . . . . . . . 35 2.5.3 Large deviations . . . . . . . . . . . . . . . . . . . . . . . 37 2.5.4 Poisson laws . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.6 Waiting times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . 45 Chapter 3 Constrained Exact String Matching . . . . . . . . . . 47 3.1 Enumeration of (d,k) sequences . . . . . . . . . . . . . . . . . . . 48 3.1.1 Languages and generating functions . . . . . . . . . . . . 51 3.2 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3 The probability count . . . . . . . . . . . . . . . . . . . . . . . . 57 3.4 Central limit law . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.5 Large deviations . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.6 Some extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.7 Application: significant signals in neural data . . . . . . . . . . . 70 3.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . 73 Chapter 4 Generalized String Matching . . . . . . . . . . . . . . 75 4.1 String matching over a reduced set . . . . . . . . . . . . . . . . . 76 4.2 Generalized string matching via automata . . . . . . . . . . . . . 82 4.3 Generalized string matching via a language approach . . . . . . . 93 4.3.1 Symbolic inclusion–exclusion principle . . . . . . . . . . . 94 4.3.2 Multivariate generating function . . . . . . . . . . . . . . 96 4.3.3 Generating function of a cluster. . . . . . . . . . . . . . . 98 4.3.4 Moments and covariance . . . . . . . . . . . . . . . . . . . 103 4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . 107 Chapter 5 Subsequence String Matching . . . . . . . . . . . . . . 109 5.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.2 Mean and variance analysis . . . . . . . . . . . . . . . . . . . . . 112 5.3 Autocorrelation polynomial revisited . . . . . . . . . . . . . . . . 118 5.4 Central limit laws. . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.5 Limit laws for fully constrained pattern . . . . . . . . . . . . . . 122 5.6 Generalized subsequence problem . . . . . . . . . . . . . . . . . . 123 5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . 132 II APPLICATIONS 133 Chapter 6 Algorithms and Data Structures . . . . . . . . . . . . 135 6.1 Tries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 6.2 Suffix trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.3 Lempel–Ziv’77 scheme . . . . . . . . . . . . . . . . . . . . . . . . 143 6.4 Digital search tree . . . . . . . . . . . . . . . . . . . . . . . . . . 144 6.5 Parsing trees and Lempel–Ziv’78 algorithm . . . . . . . . . . . . 147 Bibliographicalnotes . . . . . . . . . . . . . . . . . . . . . . . . . 153 Chapter 7 Digital Trees . . . . . . . . . . . . . . . . . . . . . . . . . 155 7.1 Digital tree shape parameters . . . . . . . . . . . . . . . . . . . . 156 7.2 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 7.2.1 Average path length in a trie by Rice’s method . . . . . . 159 7.2.2 Average size of a trie . . . . . . . . . . . . . . . . . . . . . 168 7.2.3 Average depth in a DST by Rice’s method . . . . . . . . 169 7.2.4 Multiplicity parameter by Mellin transform . . . . . . . . 176 7.2.5 Increasing domains . . . . . . . . . . . . . . . . . . . . . . 183 7.3 Limiting distributions . . . . . . . . . . . . . . . . . . . . . . . . 185 7.3.1 Depth in a trie . . . . . . . . . . . . . . . . . . . . . . . . 185 7.3.2 Depth in a digital search tree . . . . . . . . . . . . . . . . 189 7.3.3 Asymptotic distribution of the multiplicity parameter . . 193 7.4 Average profile of tries . . . . . . . . . . . . . . . . . . . . . . . . 197 7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 Bibliographicalnotes . . . . . . . . . . . . . . . . . . . . . . . . . 216 Chapter 8 Suffix Trees and Lempel–Ziv’77 . . . . . . . . . . . . . 219 8.1 Random tries resemble suffix trees . . . . . . . . . . . . . . . . . 222 8.1.1 Proof of Theorem 8.1.2 . . . . . . . . . . . . . . . . . . . 226 8.1.2 Suffix trees and finite suffix trees are equivalent . . . . . . 239 8.2 Size of suffix tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 8.3 Lempel–Ziv’77 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 8.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 Bibliographicalnotes . . . . . . . . . . . . . . . . . . . . . . . . . 266 Chapter 9 Lempel–Ziv’78 Compression Algorithm . . . . . . . . 267 9.1 Description of the algorithm . . . . . . . . . . . . . . . . . . . . . 269 9.2 Number of phrases and redundancy of LZ’78 . . . . . . . . . . . 271 9.3 From Lempel–Ziv to digital search tree . . . . . . . . . . . . . . . 279 9.3.1 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 9.3.2 Distributional analysis . . . . . . . . . . . . . . . . . . . . 285 9.3.3 Large deviation results . . . . . . . . . . . . . . . . . . . . 287 9.4 Proofs of Theorems 9.2.1 and 9.2.2 . . . . . . . . . . . . . . . . . 290 9.4.1 Large deviations: proof of Theorem 9.2.2 . . . . . . . . . 290 9.4.2 Central limit theorem: Proof of Theorem 9.2.1 . . . . . . 292 9.4.3 Some technical results . . . . . . . . . . . . . . . . . . . . 294 9.4.4 Proof of Theorem 9.4.2 . . . . . . . . . . . . . . . . . . . 300 9.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . 309 Chapter 10 String Complexity . . . . . . . . . . . . . . . . . . . . . 311 10.1 Introduction to string complexity . . . . . . . . . . . . . . . . . . 312 10.1.1 String self-complexity . . . . . . . . . . . . . . . . . . . . 313 10.1.2 Joint string complexity . . . . . . . . . . . . . . . . . . . 313 10.2 Analysis of string self-complexity . . . . . . . . . . . . . . . . . . 314 10.3 Analysis of the joint complexity . . . . . . . . . . . . . . . . . . . 315 10.3.1 Independent joint complexity . . . . . . . . . . . . . . . . 315 10.3.2 Key property . . . . . . . . . . . . . . . . . . . . . . . . . 316 10.3.3 Recurrence and generating functions . . . . . . . . . . . . 317 10.3.4 Double depoissonization . . . . . . . . . . . . . . . . . . . 318 10.4 Average joint complexity for identical sources . . . . . . . . . . . 322 10.5 Average joint complexity for nonidentical sources . . . . . . . . . 323 10.5.1 The kernel and its properties . . . . . . . . . . . . . . . . 323 10.5.2 Main results. . . . . . . . . . . . . . . . . . . . . . . . . . 326 10.5.3 Proof of (10.18) for one symmetric source . . . . . . . . . 329 10.5.4 Finishing the proof of Theorem 10.5.6 . . . . . . . . . . . 338 10.6 Joint complexity via suffix trees . . . . . . . . . . . . . . . . . . . 340 10.6.1 Joint complexity of two suffix trees for nonidentical sources341 10.6.2 Joint complexity of two suffix trees for identical sources . 343 10.7 Conclusion and applications . . . . . . . . . . . . . . . . . . . . . 343 10.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . 346 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 Foreword Earlycomputersreplacedcalculatorsandtypewriters,andprogrammersfocused on scientific computing (calculations involving numbers) and string processing (manipulating sequences of alphanumeric characters, or strings). Ironically, in modernapplications,stringprocessingisanintegralpartofscientificcomputing, as strings are an appropriate model of the natural world in a wide range of applications, notably computational biology and chemistry. Beyond scientific applications, strings are the lingua franca of modern computing, with billions of computers having immediate access to an almost unimaginable number of strings. Decades of research have met the challenge of developing fundamental al- gorithms for string processing and mathematical models for strings and string processingthataresuitableforscientificstudies. Untilnow,muchofthisknowl- edgehas beenthe provinceofspecialists,requiringintimate familiaritywiththe research literature. The appearance of this new book is therefore a welcome development. It is a unique resource that provides a thorough coverage of the fieldandservesasaguidetotheresearchliterature. Itisworthyofseriousstudy by any scientist facing the daunting prospect of making sense of huge numbers of strings. The development of an understanding of strings and string processing algo- rithmshasparalleledtheemergenceofthefieldofanalyticcombinatorics,under the leadership of the late Philippe Flajolet, to whom this book is dedicated. Analytic combinatoricsprovidespowerfultoolsthatcansynthesizeandsimplify classical derivations and new results in the analysis of strings and string pro- cessing algorithms. As disciples of Flajolet and leaders in the field nearly since its inception, Philippe Jacquet and Wojciech Szpankowski are well positioned to provide a cohesive modern treatment, and they have done a masterful job in this volume. Robert Sedgewick Princeton University Preface Repeated patterns and related phenomena in words are known to play a cen- tral role in many facets of computer science, telecommunications, coding, data compression, data mining, and molecular biology. One of the most fundamen- tal questions arising in such studies is the frequency of pattern occurrencesin a givenstringknownasthetext. Applicationsoftheseresultsincludegenefinding in biology, executing and analyzing tree-like protocols for multiaccess systems, discoveringrepeatedstringsinLempel–Zivschemesandotherdatacompression algorithms, evaluating string complexity and its randomness, synchronization codes, user searching in wireless communications, and detecting the signatures of an attacker in intrusion detection. The basic pattern matching problem is to find for a given (or random) pat- tern w or set of patterns and a text X how many times occurs in the W W text X and how long it takes for to occur in X for the first time. There are W many variations of this basic pattern matching setting which is known as exact string matching. In approximate string matching, better known as generalized string matching, certain words from are expected to occur in the text while W other words are forbidden and cannot appear in the text. In some applications, especially inconstrainedcoding andneuraldata spikes,one puts restrictionson the text (e.g.,onlytextwithout the patterns000and0000is permissible),lead- ing to constrained string matching. Finally, in the most general case, patterns fromtheset donotneedtooccurasstrings(i.e.,consecutively)butratheras W subsequences;that leadsto subsequence pattern matching, alsoknownashidden pattern matching. These various pattern matching problems find a myriad of applications. Molecularbiologyprovidesanimportantsourceofapplicationsofpatternmatch- ing, be it exact or approximate or subsequence pattern matching. There are examplesinabundance: finding signalsinDNA; finding splitgeneswhereexons are interrupted by introns; searching for starting and stopping signals in genes; finding tandem repeats in DNA. In general, for gene searching, hidden pattern matching (perhaps with an exotic constraintset) is the right approachfor find- ing meaningful information. The hidden pattern problem can also be viewed as a close relative of the longest common subsequence (LCS) problem, itself of immediate relevance to computational biology but whose probabilistic aspects are still surrounded by mystery. Exact and approximate pattern matching have been used over the last 30 years in source coding (better known as data compression), notably in the Lempel–Ziv schemes. The idea behind these schemes is quite simple: when an encoder finds two (longest) copies of a substring in a text to be compressed, the secondcopyis notstoredbut, rather,oneretainsa pointerto the copy(and possibly the length of the substring). The holy grail of universal source coding is to show that, without knowing the statistics of the text, such schemes are asymptotically optimal. There are many other applications of pattern matching. Prediction is one of them and is closely related to the Lempel–Ziv schemes (see Jacquet, Sz- pankowski, and Apostol (2002) and Vitter and Krishnan (1996)). Knowledge discovery can be achieved by detecting repeated patterns (e.g., in weather pre- diction, stock market, social sciences). In data mining, pattern matching algo- rithmsareprobablythealgorithmsmostoftenused. Atexteditorequippedwith a pattern matching predictor can guess in advance the words that one wants to type. Messages in phones also use this feature. In this book we study pattern matching problems in a probabilistic frame- work in which the text is generated by a probabilistic source while the pattern is given. In Chapter 1 various probabilistic sources are discussed and our as- sumptions are summarized. In Chapter 6 we briefly discuss the algorithmic as- pects of pattern matching and various efficient algorithms for finding patterns, while in the rest of this book we focus on analysis. We apply analytic tools of combinatorics and the analysis of algorithms to discover general laws of pat- ternoccurrences. Toolsofanalyticcombinatoricsandanalysisofalgorithmsare wellcoveredinrecentbooks byFlajoletandSedgewick(2009)andSzpankowski (2001). The approach advocated in this book is the analysis of pattern matching problems through a formal description by means of regular languages. Basi- cally, such a description of the contexts of one, two, or more occurrences of a pattern gives access to the expectation, the variance, and higher moments, respectively. A systematic translation into the generating functions of a com- plex variable is available by methods of analytic combinatorics deriving from the original Chomsky–Schu¨tzenberger theorem. The structure of the implied generating functions at a pole or algebraic singularity provides the necessary asymptotic information. In fact, there is an important phenomenon, that of asymptotic simplification, in which the essentials of combinatorial-probabilistic featuresarereflectedbythesingularformsofgeneratingfunctions. Forinstance, variancecoefficientscomeoutnaturallyfromthisapproach,togetherwithasuit- ablenotionofcorrelation. Perhapstheoriginalityofthepresentapproachliesin thisjointuseofcombinatorial-enumerativetechniquesandanalytic-probabilistic methods. Weshouldpointoutthatpatternmatching,hiddenwords,andhiddenmean- ingwerestudiedbymanypeopleindifferentcontextsforalongtimebeforecom- puter algorithms were designed. Rabbi Akiva in the first century A.D. wrote a collection of documents called Maaseh Merkava on secret mysticism and medi- tations. In the eleventh century the Spaniard Solomon Ibn Gabirol called these secret teachings Kabbalah. Kabbalists organized themselves as a secret society dedicated to the study of the ancient wisdom of Torah, looking for mysterious connectionsandhiddentruths,meaning,andwordsinKabbalahandelsewhere. Recent versions of this activity are knowledge discovery and data mining, bib- liographic search, lexicographic research, textual data processing, and even web site indexing. Public domain utilities such as agrep, grappe, and webglimpse (developed for example by Wu and Manber (1995), Kucherov and Rusinowitch (1997), and others) depend crucially on approximate pattern matching algo- rithmsforsubsequencedetection. Manyinterestingalgorithmsbasedonregular expressionsandautomata,dynamicprogramming,directedacyclicwordgraphs, and digital tries or suffix trees have been developed. In all the contexts men- tionedaboveitisofobviousinteresttodistinguishpatternoccurrencesfromthe statistically unavoidable phenomenon of noise. The results and techniques of this book may provide some answers to precisely these questions. Contents of the book This book has two parts. In Part I we compile all the results known to us about the various pattern matching problems that have been tackled by ana- lytic methods. Part II is dedicated to the application of pattern matching to various data structures on words, such as digital trees (e.g., tries and digital search trees), suffix trees, string complexity, and string-based data compression andincludesthepopularschemesofLempelandZiv,namelytheLempel–Ziv’77 and the Lempel–Ziv’78 algorithms. When analyzing these data structures and algorithms we use results and techniques from Part I, but we also bring to the tablenewmethodologiessuchastheMellintransformandanalyticdepoissoniza- tion. As already discussed, there are various pattern matching problems. In its simplestform,apattern =wisasinglestringw andonesearchesforsomeor W alloccurrencesofw asablockofconsecutivesymbolsinthe text. This problem is known as exact string matching and its analysis is presented in Chapter 2 whereweadoptasymbolicapproach. Wefirstdescribealanguagethatcontains alloccurrencesofw. Thenwetranslatethis languageinto ageneratingfunction

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.