Table Of Content

Applied Combinatorics on Words Lothaire June 23, 2004 2 Version June 23, 2004 Contents i Contents Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Chapter 1 Algorithms on Words . . . . . . . . . . . . . . . . . . . 1 1.0 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1 Words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Elementary algorithms . . . . . . . . . . . . . . . . . . . . 7 1.3 Tries and automata . . . . . . . . . . . . . . . . . . . . . . 15 1.4 Pattern matching . . . . . . . . . . . . . . . . . . . . . . . 35 1.5 Transducers . . . . . . . . . . . . . . . . . . . . . . . . . . 39 1.6 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 1.7 Word enumeration . . . . . . . . . . . . . . . . . . . . . . . 66 1.8 Probability distributions on words . . . . . . . . . . . . . . 71 1.9 Statistics on words. . . . . . . . . . . . . . . . . . . . . . . 86 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Chapter 2 Structures for Indexes . . . . . . . . . . . . . . . . . . . 101 2.0 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 101 2.1 Suffix trie . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 2.2 Suffix tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 2.3 Contexts of factors . . . . . . . . . . . . . . . . . . . . . . 116 2.4 Suffix automaton . . . . . . . . . . . . . . . . . . . . . . . 121 2.5 Compact suffix automaton . . . . . . . . . . . . . . . . . . 132 2.6 Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 2.7 Finding regularities . . . . . . . . . . . . . . . . . . . . . . 143 2.8 Pattern matching machine . . . . . . . . . . . . . . . . . . 147 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Chapter 3 Symbolic Natural Language Processing . . . . . . . . 155 3.0 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 155 3.1 From letters to words . . . . . . . . . . . . . . . . . . . . . 156 3.2 From words to sentences . . . . . . . . . . . . . . . . . . . 187 Version June 23, 2004 ii Contents Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 Chapter 4 Statistical Natural Language Processing . . . . . . . 199 4.0 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 199 4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . 200 4.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 4.3 Application to speech recognition . . . . . . . . . . . . . . 213 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 Chapter 5 Inference of Network Expressions . . . . . . . . . . . 227 5.0 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 227 5.1 Inferring simple network expressions: models and related problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 5.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 5.3 Inferring network expressions with spacers . . . . . . . . . 240 5.4 Related issues . . . . . . . . . . . . . . . . . . . . . . . . . 244 5.5 Open problems . . . . . . . . . . . . . . . . . . . . . . . . . 247 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 Chapter 6 Statistics on Words with Applications to Biological Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 6.0 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 252 6.1 Probabilistic models for biological sequences . . . . . . . . 254 6.2 Overlapping and non-overlappingoccurrences . . . . . . . 260 6.3 Word locations along a sequence . . . . . . . . . . . . . . . 264 6.4 Word count distribution. . . . . . . . . . . . . . . . . . . . 270 6.5 Renewal count distribution . . . . . . . . . . . . . . . . . . 291 6.6 Occurrences and counts of multiple patterns . . . . . . . . 294 6.7 Some applications to DNA sequences . . . . . . . . . . . . 306 6.8 Some probabilistic and statistical tools . . . . . . . . . . . 315 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 Chapter 7 Analytic Approach to Pattern Matching . . . . . . . 329 7.0 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 329 7.1 Probabilistic models . . . . . . . . . . . . . . . . . . . . . . 332 7.2 Exact string matching . . . . . . . . . . . . . . . . . . . . . 335 7.3 Generalized string matching . . . . . . . . . . . . . . . . . 349 7.4 Subsequence pattern matching . . . . . . . . . . . . . . . . 366 7.5 Generalized subsequence problem . . . . . . . . . . . . . . 377 7.6 Self-repetitive pattern matching . . . . . . . . . . . . . . . 383 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 Chapter 8 Periodic Structures in Words . . . . . . . . . . . . . . 399 Version June 23, 2004 8.0 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 399 8.1 Definitions and preliminary results . . . . . . . . . . . . . . 400 8.2 Counting maximal repetitions . . . . . . . . . . . . . . . . 402 8.3 Basic algorithmic tools . . . . . . . . . . . . . . . . . . . . 408 8.4 Finding all maximal repetitions in a word . . . . . . . . . . 411 8.5 Finding quasi-squares in two words . . . . . . . . . . . . . 416 8.6 Finding repeats with a fixed gap . . . . . . . . . . . . . . . 418 8.7 Computing local periods of a word. . . . . . . . . . . . . . 421 8.8 Finding approximate repetitions . . . . . . . . . . . . . . . 427 8.9 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439 Chapter 9 Counting, Coding and Sampling with Words . . . . 443 9.0 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 443 9.1 Counting: walks in sectors of the plane . . . . . . . . . . . 445 9.2 Sampling: polygons, animals and polyominoes . . . . . . . 456 9.3 Coding: trees and maps . . . . . . . . . . . . . . . . . . . . 466 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478 Chapter 10 Words in Number Theory . . . . . . . . . . . . . . . . 481 10.0 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 482 10.1 Morphic and automatic sequences: definitions and gener- alities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483 10.2 d-Kernels and properties of automatic sequences . . . . . . 487 10.3 Christol’s algebraic characterizationof automatic sequences 496 10.4 An application to transcendence in positive characteristic . 502 10.5 An application to transcendental power series over the ra- tionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503 10.6 An application to transcendence of real numbers . . . . . . 504 10.7 The Tribonacci word . . . . . . . . . . . . . . . . . . . . . 506 10.8 The Rauzy fractal . . . . . . . . . . . . . . . . . . . . . . . 511 10.9 An application to simultaneous approximation . . . . . . . 522 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531 References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535 General Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556 Presentation v Presentation A series of important applications of combinatorics on words has emergedwith the development of computerized text and string processing, especially in biology and in linguistics. The aim of this volume is to present, in a unified treatment, some of the major fields of applications. The main topics that are coveredin this book are 1. Algorithmsformanipulatingtext,suchasstringsearching,patternmatching, and testing a word for special properties. 2. Efficient data structures for retrieving information on large indexes, in- cluding suffix trees and suffix automata 3. Combinatorial,probabilisticandstatisticalpropertiesofpatternsinfinite words,andmoregeneralpattern,undervariousassumptionsonthesources of the text. 4. Inference of regular expressions. 5. Algorithms for repetitions in strings, such as maximal run or tandem repeats. 6. Linguistictextprocessing,especiallyanalysisofthesyntacticandsemantic structure of natural language. Applications to language processing with large dictionaries. 7. Enumeration, generation and sampling of complex combinatorial structures by their encodings in words. This book is actually the third of a series of books on combinatorics on words. Lothaire’s “Combinatorics on Words” appeared in its first printing in 1984 as Volume 17 of the Encyclopedia of Mathematics. It was based on the impulse ofM. P.Schu¨tzenberger’s scientific work. Since then, the theory devel- oped to a large scientific domain. It was reprinted in 1997 in the Cambridge Mathematical Library. Lothaire is a nom de plume for a group of authors ini- tially constituted of former students of Schu¨tzenberger. Along the years,it has enlargedto a broader community coordinated by the editors. A second volume of Lothaire’s series, entitled “Algebraic Combinatorics on Words” appeared in 2002. Itcontainsbothcomplementsandnewxdevelopmentsthatemergedsince the publication of the first volume. The content of this volume is quite applied, in comparison with the two previous ones. However, we have tried to follow the same spirit, namely to presentintroductoryexpositions,withfulldescriptionsandnumerousexamples. Refinements are frequently deferredto problems,or mentioned inNotes. There Version June 23, 2004 vi Presentation is presently no similar book that covers these topics in this way. Although each chapter has a different author, the book is really a cooper- ative work. A set of common notation has been agreed upon. Algorithms are presented in a consistent way using transparent conventions. There is also a common general index, and a common list of bibliographic references. This book is independent of Lothaire’s other books, in the sense that no knowledge of the other volumes is assumed. The book has been written with the objective of being readable by a large audience. The prerequisits are those of a general scientific education. Some chapters may require a more advanced preparation. A graduate student in science or engineering should have no difficulty in reading all the chapters. A student in linguistics should be able to read part of it with profit and interest. Outline of contents. The general organisations is described below. Natural languages Symbolic language processing Statistical language processing Bioinformatics Inference of network expressions Core algorithms Statistics on words with applications Algorithms on words Algorithmics Structuresfor indexes Analytic approach to pattern matching Periodic structuresin words Mathematics Counting, coding and sampling Words in numbertheory Figure 0.1. Overall structureof “Applied Combinatorics on Words”. Thetwofirstchaptersaredevotedtocorealgorithms. Thefirst,“Algorithms on words”, is is quite general, and is used in all other chapters. The second chapter, “Structures for indexes”, is fundamental for all advanced algorithmic treatment, and more technical. Among the applications, a first domain is linguistics, represented by two chaptersentitled“Symboliclanguageprocessing”and“Staticticallanguagepro- cessing”. Version June 23, 2004 Presentation vii A second application is biology. This is covered by two chapters, entitled “Inferenceofnetworkexpressions”,and“StatisticsonWords withApplications to Biological Sequences”. The next block is composed of two chapters dealing with algorithmics, a subject whichis of interestfor its ownin theoreticalcomputer science, but also related to biology and linguistics One chapter is entitled “Analytic approach to pattern matching” and deals with generalized pattern matching algorithms. A chapter entitled “Periodic structures in words” describes algorithms used for discovering and enumerating repetitions in words. A final block is devoted to applications to mathematics (and theoretical physics). It is represented by two chapters. The first chapter, entitled “Count- ing, coding and sampling with words” deals with the use of words for coding combinatorialstructures. Another chapter, entitled “Words in number theory” deals with transcendence, fractals and dynamical systems. Description of contents. Basicalgorithms,asneededlater,andnotationaregiveninChapter1 “Algo- rithms onwords”,writtenby JeanBerstelandDominique Perrin. This chapter alsocontainsbasic concepts onautomata,grammars,and parsing. It ends with an exposition of probability distribution on words. The concepts and methods introduced are used in all the other chapters. Chapter2,entitled“Structuresforindexes”andwrittenbyMaximeCroche- more, presents data structures for the compact representation of the suffixes of a text. These areused in severalsubsequentchapters. Compactsuffix trees are presented, and construction of these trees in linear time is carefully described. The theory and algorithmics for suffix automata are presented next. The main application, namely the construction of indexes, is described next. Many other applications are given, such as detection of repetitions or forbidden words in a text, use as a pattern matching machine, and search for conjugates. Thefirstdomainofapplications,linguistics,isrepresentedbyChapter3and Chapter 4. Chapter 3, entitled “Symbolic language processing” is written by Eric Laporte. In language processing, a text or a discourse is a sequence of sentences; a sentence is a sequence of words; a word is a sequence of letters. The most universal levels are those of sentence, word and letter (or phoneme), but intermediate levels exist, and can be crucial in some languages, between word and letter: a level of morphological elements (e.g. suffixes), and the level of syllables. The discovery of this piling up of levels, and in particular of word level and phoneme level, delighted structuralist linguists in the 20th century. They termed this inherent, universal feature of human language as “double articulation”. This chapterisorganizedaroundthe mainlevelsofanylanguagemodelling: first, how words are made from letters; second, how sentences are made from words. It surveys the basic operations of interest for language processing, and for each type of operation, it examines the formal notions and tools involved. The main originality of this presentation is the systematic and consistent use Version June 23, 2004 viii Presentation of finite state automata at every level of the description. This point of view is reflected in some practical implementations of natural language processing systems. Chapter 4, entitled “Statistical language processing” is written by Mehryar Mohri. Itpresentstheuseofstatisticalmethodstonaturallanguageprocessing. Themaintooldevelopedisthenotionofweightedtransducers. Theweightsare numbers in some semiring that can represent probabilities. Applications to speech processing are discussed. The block of applications to biology is concerned with analysis of word oc- curences,patternmatching,andconnectionswithgenomeanalysis. Itiscovered by the next two chapters, and to some extent also by the alogrithmics bloc. Chapter 5, “Inference of network expressions”, is written by Nadia Pisanti and Marie-FranceSagot. This chapter introduces various mathematicalmodels andalgorithmsforinferring regularexpressionwithout Kleenestarthatappear repeated in a word or are common to a set of words. Inferring a network expressionmeanstodiscoversuchexpressionswhichareinitiallyunknown,from the word(s) where the repeated (or common) expressions will be sought. This is in contrast with the string searching problem considered in other chapters. This has many applications,notably in molecular biology,system security, text mining etc. Because of the richness of the mathematical and algorithmical problems posed by molecular biology, we concentrate on applications in this area. Applications to biology motivate us also to consider network expressions that appear repeated not exactly but approximately. Chapter 6 is written by Gesine Reinert, Sophie Schbath and Michael Wa- terman, and entitled “Statistics on Words with Applications to Biological Se- quences”. Properties of words in sequences have been of considerable interest in many fields, such as coding theory and reliability theory, and most recently in the analysis of biologicalsequences. The latter will serve as the key example in this chapter. Two main aspects of word occurrences in biological sequences are: where do they occur and how many times do they occur? An important problem, for instance, was to determine the statistical significance of a word frequency in a DNAsequence. Thenaiveideaisthefollowing: awordmaybesignificantlyrare in a DNA sequence because it disrupts replication or gene expression, (perhaps a negative selection factor), whereas a significantly frequent word may have a fundamental activity with regard to genome stability. Well-known examples of words with exceptional frequencies in DNA sequences are certain biological palindromes corresponding to restriction sites avoided for instance in E. coli, and the Cross-overHotspot Instigator sites in several bacteria. Statistical methods to study the distribution of the word locations along a sequence and word frequencies have also been an active field of research; the goal of this chapter is to provide an overview of the state of this research. Because DNA sequences are long, asymptotic distributions were proposed first. Exact distributions exist now, motivated by the analysis of genes and protein sequences. Unfortunately, exact results are not adapted in practice for Version June 23, 2004