Learning Finite-State Machines Statistical and Algorithmic Aspects Borja de Balle Pigem Thesis Supervisors: Jorge Castro and Ricard Gavald`a Department of Software — PhD in Computing ThesissubmittedtoobtainthequalificationofDoctor fromtheUniversitatPolit`ecnicadeCatalunya Copyright©2013byBorjadeBallePigem ThisworkislicensedundertheCreativeCommonsAttribution-NonCommercial-ShareAlike3.0UnportedLicense. Toviewacopyofthislicense,visithttp://creativecommons.org/licenses/by-nc-sa/3.0/. Abstract The present thesis addresses several machine learning problems on generative and predictive models on sequential data. All the models considered have in common that they can be defined in terms of finite-state machines. On one line of work we study algorithms for learning the probabilistic analog of Deterministic Finite Automata (DFA). This provides a fairly expressive generative model for sequences with very interesting algorithmic properties. State-merging algorithms for learning these models can be interpreted as a divisive clustering scheme where the “dependency graph” between clusters is not necessarily a tree. We characterize these algorithms in terms of statistical queries and a use this charac- terization for proving a lower bound with an explicit dependency on the distinguishability of the target machine. In a more realistic setting, we give an adaptive state-merging algorithm satisfying the strin- gent algorithmic constraints of the data streams computing paradigm. Our algorithms come with strict PAC learning guarantees. At the heart of state-merging algorithms lies a statistical test for distribu- tion similarity. In the streaming version this is replaced with a bootstrap-based test which yields faster convergence in many situations. We also studied a wider class of models for which the state-merging paradigm also yield PAC learning algorithms. Applications of this method are given to continuous-time Markovian models and stochastic transducers on pairs of aligned sequences. The main tools used for obtaining these results include a variety of concentration inequalities and sketching algorithms. In another line of work we contribute to the rapidly growing body of spectral learning algorithms. The main virtues of this type of algorithms include the possibility of proving finite-sample error bounds in the realizable case and enormous savings on computing time over iterative methods like Expectation- Maximization. In this thesis we give the first application of this method for learning conditional distri- butions over pairs of aligned sequences defined by probabilistic finite-state transducers. We also prove that the method can learn the whole class of probabilistic automata, thus extending the class of models previously known to be learnable with this approach. In the last two chapters we present works com- bining spectral learning with methods from convex optimization and matrix completion. Respectively, theseyieldanalternativeinterpretationofspectrallearningandanextensiontocaseswithmissingdata. In the latter case we used a novel joint stability analysis of matrix completion and spectral learning to prove the first generalization bound for this type of algorithms that holds in the non-realizable case. Workinthisareahasbeenmotivatedbyconnectionsbetweenspectrallearning,classicautomatatheory, and statistical learning; tools from these three areas have been used. iii Preface From Heisenberg’s uncertainty principle to the law of diminishing returns, essential trade-offs can be found all around in science and engineering. The Merriam–Webster on-line dictionary defines trade-off as a noun describing “a balancing of factors all of which are not attainable at the same time.” In many cases where a human actor is involved in the balancing, such balancing is the result of some conscious reasoningandactingplan,pursuedwiththeintentofachievingaparticulareffectinasystemofinterest. In general it is difficult to imagine that this reasoning and acting can take place without the existence of some specific items of knowledge. In particular, no reasoning whatsoever is possible without the knowledge of the factors involved in the trade-off and some measurement of the individual effect each of those factors has on the ultimate goal of the balancing. The knowledge required to reach a trade-off can be conceptually separated into two parts. First there is a qualitative part, which involves the identification of the factors that need to be balanced. Then, there is a quantitative one: the influence exerted by these factors on the system of interest needs to be measured. These two pieces are clearly complementary, and each of them brings with it some actionable information. The qualitative knowledge brings with it intuitions about the underlying works of the system under study which can be useful in deriving analogies and generalizations. On the other hand,quantitativemeasurementsprovidethematerialneededforprincipledmathematicalreasoningand fine-tuning of the trade-off. In computer science there are plenty of trade-offs that perfectly exemplify this dichotomy. Take, for example, the field of optimization algorithms. Let k denote the number of iterations executed by an iterative optimization algorithm. Obviously, there exists a relation between k and the accuracy of the solutionfoundbythealgorithm. Thisisageneralintuitionthatholdsalongalloptimizationalgorithms, andonethatcanbefurtherextendedtootherfamiliesofapproximationalgorithms. Ontheotherhand, it is well known that for some classes of optimization pro√blems the distance to the optimal solution decreases like O(1/k), while for some others the rate O(1/ k) is not improvable. These particular bits of knowledge exemplify two different quantitative realizations of the iterations/accuracy trade-off. But even more importantly, this kind of quantitative information can be used to approach meta-trade-offs like the following. Suppose you are given√two models for the same problem, one that is closer to reality but can only be solved at a rate of O(1/ k), and another that is less realistic but can be solved at a rateofO(1/k). Then,givenafixedcomputationalbudget,shouldonecomputeamoreaccuratesolution to an imprecise model, or a less accurate solution in a more precise model? Being able to answer such questions is what leads to informed decisions in science and engineering, which is, in some sense, the ultimate goal for studying such trade-offs in the first place. Learning theory is a field of computer science that abounds in trade-offs. Those arise not only from the computational nature of the field, but also from a desire to encompass, in a principled way, many aspects of the statistical process that takes data from some real world problem and produces a meaningfulmodelforit. Themanyaspectsinvolvedinsuchtrade-offscanbebroadlyclassifiedaseither computational (time and space complexity), statistical (sample complexity and identifiability), or of a modeling nature (realizable vs non-realizable, proper vs improper modeling). This diversity induces a rich variety of situations, most of which, unfortunately, turn out to be computationally intractable as soon as the models of interest become moderately complex. The understanding and identification of all these trade-offs has lead in recent years to important advances in the theory and practice of machine learning, data mining, information retrieval, and other data-driven computational disciplines. v In this dissertation we address a particular problem in learning theory: the design and analysis of algorithms for learning finite-state machines. In particular, we look into the problem of designing efficient learning algorithms for several classes of finite-state machines under different sets of modelling assumptions. Specialattentionisdevotedtothecomputationalandstatisticalcomplexityoftheproposed methods. The leitmotif of the thesis is to try answering the question: what is the larger class of finite- state machines that is efficiently learnable from a computational and statistical perspective? Of course, a short answer would be: there is a trade-off. The results presented in this thesis provide several longer answers to this question, all of which illuminate some aspect of the trade-offs implied by the different modelling assumptions. vi Acknowledgements Needless to say is that this is the last page one writes on a thesis. It is thus the time to thank everyone whose help, in one way or another, has made it possible to reach this final page. Help has come to me in many forms: mentoring, friendship, love, and mixtures of those. To me, these are the things which make life worth living. And so, it is with great pride and pleasure that I thank the following people for their help. Firstandforemost,IameternallythankfultomyadvisorsJorgeandRicardfortheirencouragement, patience, and guidance throughout these years. My long-time co-authors Ariadna and Xavier, for their kindness, and for sharing their ideas and listening to mine, deserve little less that my eternal gratitude. During my stay in New York I had the great pleasure to work with Mehryar Mohri; his ideas and points of view still inspire me to this day. Besides my closer collaborators, I have interacted with a long list of people during my research. All of them have devoted their time to share and discuss with me their views on many interesting problems. Hoping I do not forget any unforgettable name, I extend my most sincere gratitude to the following people: Franco M. Luque, Rapha¨el Bailly, Albert Bifet, Daniel Hsu, Dana Angluin, Alex Clark, Colin de la Higuera, Sham Kakade, Anima Anandkumar, David Sontag, Jeff Sorensen, Marta A`rias, Ramon Ferrer, Doina Precup, Prakash Panangaden, Denis Th´erien, Gabor Lugosi, Geoff Holmes, Bernhard Pfahringer, and Osamu Watanabe. I would never have gotten into machine learning research if it was not for Pep Fuertes and Enric Ventura who introduced me to academic research, Jos´e L. Balc´azar whose lectures on computability sprangmyinterestoncomputerscience,andIgnasiBeldawhomentoredmyfirststepsintothefascinating world of artificial intelligence. I thank all of them for giving some purpose to my sleepless nights. Sometimes it happens that one mixes duty with pleasure and work with friendship. It is my duty to acknowledge that working alongside my good friends Hugo, Joan, Masha, and Pablo has been a tremendouspleasure. Also,thisyearsatUPCwouldnohavebeenthesamewithoutthefunandtasteful company of the omegacix crew and satellites: Leo, David, Oriol, Dani, Joan, Marc, Cristina, L´ıdia, Edgar, Pere, Adri`a, Miquel, Montse, Sol, and many others. A special thank goes to all people who work at the LSI secretariat, for their well-above-the-mean efficiency and friendliness. NowImustconfessthat,ontopofallthework,thesehavebeensomefunyears. Butitisnotentirely myfault. Someofthesepeoplemusttakepartoftheblame: mylong-timefriendsCarles,Laura,Ricard, Esteve, Roger, Gemma, Txema, Agn`es, Marcel, and the rest of the Weke crew; my roomies Marc, Neus, Llu´ıs, Milena, Laura, Laura, and Hoyt; my newly found friends Eul`alia, Rosa, Mar¸cal, and Esther; my tatami partners Sergi, Miquel, and Marina, and my aikido sensei Joan. NevereverIwouldhavefinishedthisthesiswithouttheconstantandlovingsupportfrommyfamily, specially my mother Laura and my sister Ju´lia. To them goes my most heartfelt gratitude. vii Contents Abstract iii Preface v Acknowledgements vii Introduction 1 Overview of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Related Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 I State-merging Algorithms 1 State-Merging with Statistical Queries 13 1.1 Strings, Free Monoids, and Finite Automata . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2 Statistical Query Model for Learning Distributions . . . . . . . . . . . . . . . . . . . . . . 16 1.3 The l Query for Split/Merge Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 ∞ 1.4 Learning PDFA with Statistical Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.5 A Lower Bound in Terms of Distinguishability . . . . . . . . . . . . . . . . . . . . . . . . 24 1.6 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2 Implementations of Similarity Oracles 33 2.1 Testing Similarity with Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.2 Confidence Intervals from Uniform Convergence . . . . . . . . . . . . . . . . . . . . . . . . 35 2.3 Bootstrapped Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.4 Confidence Intervals from Sketched Samples . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.5 Proof of Theorem 2.4.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.6 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3 Learning PDFA from Data Streams Adaptively 47 3.1 The Data Stream Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2 A System for Continuous Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.3 Sketching Distributions over Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.4 State-merging in Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.5 A Strategy for Searching Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.6 Detecting Changes in the Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4 State-Merging on Generalized Alphabets 61 4.1 PDFA over Generalized Alphabets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2 Generic Learning Algorithm for GPDFA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.3 PAC Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.4 Distinguishing Distributions over Generalized Alphabets . . . . . . . . . . . . . . . . . . . 71 4.5 Learning Local Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.6 Applications of Generalized PDFA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 1 4.7 Proof of Theorem 4.3.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 II Spectral Methods 5 A Spectral Learning Algorithm for Weighted Automata 87 5.1 Weighted Automata and Hankel Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.2 Duality and Minimal Weighted Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.3 The Spectral Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.4 Recipes for Error Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6 Sample Bounds for Learning Stochastic Automata 97 6.1 Stochastic and Probabilistic Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.2 Sample Bounds for Hankel Matrix Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.3 PAC Learning PNFA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.4 Finding a Complete Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 7 Learning Transductions under Benign Input Distributions 109 7.1 Probabilistic Finite-State Transducers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7.2 A Learning Model for Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . 111 7.3 A Spectral Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.4 Sample Complexity Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 8 The Spectral Method from an Optimization Viewpoint 121 8.1 Maximum Likelihood and Moment Matching . . . . . . . . . . . . . . . . . . . . . . . . . 121 8.2 Consistent Learning via Local Loss Optimization . . . . . . . . . . . . . . . . . . . . . . . 123 8.3 A Convex Relaxation of the Local Loss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 8.4 Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 9 Learning Weighted Automata via Matrix Completion 133 9.1 Agnostic Learning of Weighted Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 9.2 Completing Hankel Matrices via Convex Optimization . . . . . . . . . . . . . . . . . . . . 135 9.3 Generalization Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Conclusion and Open Problems 145 Generalized and Alternative Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Choice or Reduction of Input Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 Learning Distributions in Non-Realizable Settings . . . . . . . . . . . . . . . . . . . . . . . . 147 Bibliography 149 A Mathematical Preliminaries 157 A.1 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 A.2 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 A.3 Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 A.4 Convex Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 A.5 Properties of l and lp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 ∞ ∞ 2
Description: