Approximation Algorithms for Grammar-Based Data Compression by Eric Lehman Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 2002 © Eric Lehman, MMII. All rights reserved. The author hereby grants to MIT permission to reproduce and E'AKER distribute publicly paper and electronic copies of this thesis dm STSITTUCE in whole or in part. ArPR 1 6 2002 LIBRARIES Author ............... Department of Electrical Engineering and Computer Science February 1, 2002 Certified by ....................... Madhu Sudan Associate Professor of Electrical Enginjering and Computer Science Thesjs Supevisor Accepted by ............. ...... Arthur C. Smith Chairman, Department Committee on Graduate Students 2 Approximation Algorithms for Grammar-Based Data Compression by Eric Lehman Submitted to the Department of Electrical Engineering and Computer Science on February 1, 2002, in partial fulfillment of the requirements for the degree of Doctor of Philosophy Abstract This thesis considers the smallest grammar problem: find the smallest context-free grammar that generates exactly one given string. We show that this problem is intractable, and so our objective is to find approximation algorithms. This simple question is connected to many areas of research. Most importantly, there is a link to data compression; instead of storing a long string, one can store a small grammar that generates it. A small grammar for a string also naturally brings out under- lying patterns, a fact that is useful, for example, in DNA analysis. Moreover, the size of the smallest context-free grammar generating a string can be regarded as a computable relaxation of Kolmogorov complexity. Finally, work on the smallest grammar problem qualitatively extends the study of approximation algorithms to hierarchically-structured objects. In this thesis, we establish hardness results, eval- uate several previously proposed algorithms, and then present new procedures with much stronger approximation guarantees. Thesis Supervisor: Madhu Sudan Title: Associate Professor of Electrical Engineering and Computer Science 3 4 Acknowledgments This thesis is the product of collaboration with Abhi Shelat and April Rasala at MIT as well as Amit Sahai, Moses Charikar, Manoj Prabhakaran, and Ding Liu at Princeton. The problem addressed here was proposed by Yevgeniy Dodis and Amit Sahai. My readers, Piotr Indyk and Dan Spielman, provided helpful, low-key advice. During the final writeup, April Rasala suggested dozens of fixes and improvements and even agreed to marry me to keep my spirits up. The entire project was deftly overseen by my advisor, Madhu Sudan. During my long haul through graduate school, Be Blackburn made the lab a welcoming place for me, as she has for countless others. Many professors offered support and advice at key times, including Tom Leighton, Albert Meyer, Charles Leiserson, and Michel Goemans. My mother taught me to count. My warmest thanks to them all. 5 6 In memory of Danny Lewin. May 14, 1970 - September 11, 2001 7 8 Contents 1 Introduction 11 1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.1.1 Data Compression . . . . . . . . . . . . . . . . . . . . . . . . 12 1.1.2 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.1.3 Pattern Recognition . . . . . . . . . . . . . . . . . . . . . . . 13 1.1.4 Hierarchical Approximation . . . . . . . . . . . . . . . . . . . 14 1.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.3 Summary of Our Contributions . . . . . . . . . . . . . . . . . . . . . 17 1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2 Preliminaries 19 2.1 Definitions and Notation . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 Basic Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3 Hardness 27 3.1 NP-Hardness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2 Hardness via Addition Chains . . . . . . . . . . . . . . . . . . . . . . 30 3.3 Background on Addition Chains . . . . . . . . . . . . . . . . . . . . . 34 3.4 An Observation on Hardness . . . . . . . . . . . . . . . . . . . . . . . 36 4 Analysis of Previous Algorithms 37 4.1 Compression Versus Approximation . . . . . . . . . . . . . . . . . . . 37 4.2 L Z 78 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 9 4.3 BISECTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4 SEQUENTIAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.5 Global Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.5.1 LONGEST MATCH . . . . . . . . . . . . . . . . . . . . . . . . 63 4.5.2 GREEDY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.5.3 RE-PAIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5 New Algorithms 77 5.1 An O(log3 n) Approximation Algorithm . . . . . . . . . . . . . . . . . 77 5.1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.1.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.1.3 Aside: The Substring Problem . . . . . . . . . . . . . . . . . . 83 5.2 An O(logn/m*)-Approximation Algorithm . . . . . . . . . . . . . . . 84 5.2.1 An LZ77 Variant . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.2.2 Balanced Binary Grammars . . . . . . . . . . . . . . . . . . . 86 5.2.3 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.3 Grammar-Based Compression versus LZ77 . . . . . . . . . . . . . . . 102 6 Future Directions 105 10
Description: