Springer Monographs in Mathematics Élisabeth Gassiat Universal Coding and Order Identification by Model Selection Methods Springer Monographs in Mathematics Editors-in-Chief Isabelle Gallagher, Paris, France Minhyong Kim, Oxford, UK Series Editors Sheldon Axler, San Francisco, USA Mark Braverman, Princeton, USA Maria Chudnovsky, Princeton, USA Sinan C. Güntürk, New York, USA Tadahisa Funaki, Tokyo, Japan Claude Le Bris, Marne la Vallée, France Pascal Massart, Orsay, France Alberto Pinto, Porto, Portugal Gabriella Pinzari, Napoli, Italy Ken Ribet, Berkeley, USA René Schilling, Dresden, Germany Panagiotis Souganidis, Chicago, USA Endre Süli, Oxford, UK Shmuel Weinberger, Chicago, USA Boris Zilber, Oxford, UK Thisseriespublishesadvancedmonographsgivingwell-writtenpresentationsofthe “state-of-the-art”infieldsofmathematicalresearchthathaveacquiredthematurity neededforsuchatreatment.Theyaresufficientlyself-containedtobeaccessibleto morethanjusttheintimatespecialistsofthesubject,andsufficientlycomprehensive to remain valuable references for many years. Besides the current state of knowledgeinitsfield,anSMMvolumeshouldideallydescribeitsrelevancetoand interaction with neighbouring fields of mathematics, and give pointers to future directions of research. More information about this series at http://www.springer.com/series/3733 É lisabeth Gassiat Universal Coding and Order fi Identi cation by Model Selection Methods 123 ÉlisabethGassiat Laboratoire deMathématiques UniversitéParis-Sud Orsay Cedex,France Translated byAnna Ben-Hamou, LPSM,Sorbonne Université, Paris,France ISSN 1439-7382 ISSN 2196-9922 (electronic) SpringerMonographs inMathematics ISBN978-3-319-96261-0 ISBN978-3-319-96262-7 (eBook) https://doi.org/10.1007/978-3-319-96262-7 LibraryofCongressControlNumber:2018948590 MathematicsSubjectClassification(2010): 68P30,62C10 TranslationfromtheFrenchlanguageedition:Codageuniverseletidentificationd’ordreparsélectionde modèlesbyElisabethGassiat,©SociétéMathématiquedeFrance2014.AllRightsReserved. ©SpringerInternationalPublishingAG,partofSpringerNature2018 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpart of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission orinformationstorageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilar methodologynowknownorhereafterdeveloped. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publicationdoesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfrom therelevantprotectivelawsandregulationsandthereforefreeforgeneraluse. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authorsortheeditorsgiveawarranty,expressorimplied,withrespecttothematerialcontainedhereinor for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictionalclaimsinpublishedmapsandinstitutionalaffiliations. ThisSpringerimprintispublishedbytheregisteredcompanySpringerNatureSwitzerlandAG Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland TO HOW, WHO LIKE BOOKS IN VARIOUS (CODING) LANGUAGES! Preface Quantifying information contained in a set of messages is the starting point of informationtheory.Extractinginformationfromadatasetisattheheartofstatistics. Information theory and statistics are thus naturally linked together, and this course lies at their interface. Thetheoreticalconceptofinformationwasintroducedinthecontextofresearch on telecommunication systems. The basic objective of information theory is to transmit messages in the most secure and least costly way. Messages are encoded, then transmitted, and finally decoded at reception. Those three steps will not be investigatedhere.Wewillessentiallybeinterestedinthefirstone,thecodingstep, in its multiple links with statistical theory, and in the rich ideas which are exchanged between information theory and statistics. The reader who is interested in a more complete view of the basic results in information theory can refer, for instance, to the two (very different) books: [1, 2]. Thisbookismostlyconcernedwithlosslesscoding,wherethegoalistoencode in a deterministic and decodable way a sequence of symbols, in the most efficient waypossible,inthesenseofthecodewords’length.Thegainofacodingschemeis measuredthroughthecompressionrate,whichistheratiobetweenthecodeword’s length and the coded word’s length. If the sequence of symbols to be encoded is generated by a stochastic process, a coding scheme will perform better if more frequent symbols are encoded with shorter codewords. This is where statistics comes into play: if one only has incomplete knowledge of the underlying process generating the sequence of symbols to be encoded, then, in order to improve the performance of the coding compression, one had better use what can be inferred aboutthelawoftheprocessfromthefirstobservedsymbols.Shannon’sentropyis the basic quantity of information allowing one to analyze the compression per- formanceofagivencodingmethod.Whenpossible,onedefinestheentropyrateof aprocessasthelimit,asntendsto þ1,oftheShannon’sentropyofthelawofthe first n symbols, normalized by n. In the first chapter, we will see that the asymptotic compression rate is lower bounded by the entropy rate of the process’ distribution producing the text to encode, provided that this process is ergodic and stationary. We will also see that vii viii Preface everycodingmethodcanbeassociatedwithaprobabilitydistributioninsuchaway that the compression performance of a code associated with a distribution Q for a processwithdistributionPisgivenbyaninformationdivergencebetweenPandQ. Thesettingislaiddown:theproblemofuniversalcodingistofindacodingmethod (hence a sequence of distributions Q ) which asymptotically realizes (when the n numbernofsymbolstobeencodedtendstoinfinity)theoptimalcompressionrate, for the largest possible class of distributions P. While investigating this question, we will be particularly interested in understanding the existing links between uni- versal coding and statistical estimation, at all levels, from methods and ideas to proofs. Wewillsee,inChap.2,thatinthecaseofasequenceofsymbolswithvaluesin a finite alphabet, it is possible to find universal coding methods for the class of all distributionsofstationaryergodicprocesses.Beforestudyingstatisticalmethodsin a strict sense, we will present Lempel–Ziv coding, which relies on the simple idea thatacodeword’slengthcanbeshortenedbytakingadvantageofrepetitionsinthe word to be encoded. We will then present different quantification criteria for compression capacities and will see that those criteria are directly related to well-known statistical methods: maximum likelihood estimation and Bayesian estimation. We will take advantage of the approximation of stationary ergodic processesbyMarkovchainswitharbitrarymemory.Suchchainsarecalledcontext tree sources ininformation theory, and variable length Markov chains in statistical modelization. Few things are known for nonparametric classes, even in finite alphabets.Wewillpresenttheexampleofrenewalprocessesforwhichwewillsee that the approximationbyvariable length Markov chains isa good approximation. Chapter 3 then tackles the problem of coding over infinite alphabets. When trying toencodesequencesofsymbolswithvaluesinaverylargealphabet(which may then be seen as infinite), one encounters various unsolved problems. In par- ticular, there isno universal code.One isthen confronted with problems relatedto model selection and order identification. After having laid down some milestones (codingofintegers,necessaryandsufficientconditionsfortheexistenceofaweakly universalcodeoveraclassofprocessdistributions),westudymoreparticularly,as a first attempt toward a better understanding of these questions, classes of process distributionscorrespondingtosequencesofindependentandidenticallydistributed variables, characterized by the speed of decrease at infinity of the probability measure.Analternativeideaistoencodethesequenceofsymbolsintwosteps:first the pattern (how repetitions are arranged), then the dictionary (the letters used, in their order of appearance). We will see that the information contained in the message shape, measured by the entropy rate, is the same as that contained in the wholemessage.However,althoughitisnotpossibletodesignauniversalcodefor theclassofmemorylesssources(sequencesofindependentandidenticallyrandom variables)withvaluesinaninfinitealphabet,itispossibletoobtainauniversalcode for their pattern. Chapter 4 deals with the question of order identification in statistics. This is a modelselectionproblemwhicharisesinvariouspracticalsituationsandwhichaims at identifying an integer characterizing the model: length of dependency for a Preface ix Markov chain, number of hidden states for a hidden Markov chain, number of populations ina population mixture. The coding ideas and techniquespresentedin thepreviouschapterhaverecentlyledtosomenewresults,inparticularconcerning latent variable models such as hidden Markov models. We finally show that the questionoforderidentificationreliesonadelicateunderstandingoflikelihoodratio trajectories.Wepointouthowthiscanbedoneinthecaseofpopulationmixtures. At the end of each chapter, one may find bibliographical comments, important references, and some open problems. Theoriginal French versionofthisbook[3] resultedfromtheeditingoflecture notes intended for students in Master 2 of Probability and Statistics and doctoral studentsatOrsayUniversity(Paris-Sud).Itisaccessibletoanyonewithagraduate level in mathematics, with basic knowledge in mathematical statistics. The only difference between this translated version and the first edition is in the remark followingTheorem3.6,wherewementionsomeprogressthathasbeenmadesince then. Except in Chap. 4, all the proofs are detailed. I chose to recall all the necessary tools needed to understand the text, usually by giving a detailed explanation, or at least by providing a reference. However, the last part of Chap. 4 contains more difficult results, for which the essential ideas are given but the proofs are only sketched. Iwouldliketothankeveryonewhohavereadthisbookwhenitwasinprogress. It has greatly benefited from their attention. I am particularly grateful to Gilles Stoltz and to his demanding reading, and to Grégory Miermont for his canonical patience. Orsay, France Élisabeth Gassiat References 1. T.M. Cover, J.A. Thomas, Elements of Information Theory. Wiley Series in Telecommunications(WileyandSons,NewYork,1991) 2. I.Csisźar,J.Korner,InformationTheory:CodingTheoremsforDiscreteMemorylessSystems, 3rdedn.(AkademiaKiado,Budapest,1981) 3. É. Gassiat, Codage universel et identification d’ordre par sélection de modéles (Société mathématiquedeFrance,2014) Contents 1 Lossless Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Kraft-McMillan Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Quantifying Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Shannon Entropy and Compression . . . . . . . . . . . . . . . . . . . . . . . 10 1.3.1 Shannon’s Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3.2 Huffman’s Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.4 Shannon-Fano-Elias Coding. Arithmetic Coding. . . . . . . . . . . . . . 16 1.5 Entropy Rate and Almost Sure Compression . . . . . . . . . . . . . . . . 20 1.5.1 Almost Sure Convergence of Minus the Log-Likelihood Rate to the Entropy Rate . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.5.2 Almost Sure Compression Rate . . . . . . . . . . . . . . . . . . . . 24 1.6 Notes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2 Universal Coding on Finite Alphabets . . . . . . . . . . . . . . . . . . . . . . . 29 2.1 Lempel-Ziv Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.2 Strongly Universal Coding: Regrets and Redundancies. . . . . . . . . 34 2.2.1 Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.2.2 NML, Regret and Redundancy. . . . . . . . . . . . . . . . . . . . . 36 2.2.3 Minimax and Maximin . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.3 Bayesian Redundancy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.3.1 Rissanen’s Theorem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.3.2 Bayesian Statistics, Jeffrey’s Prior. . . . . . . . . . . . . . . . . . . 51 2.4 Dirichlet Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.4.1 Mixture Coding of Memoryless Sources . . . . . . . . . . . . . . 54 2.4.2 Mixture Coding of Context Tree Sources . . . . . . . . . . . . . 57 2.4.3 Double Mixture and Universal Coding . . . . . . . . . . . . . . . 62 2.5 Renewal Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 2.5.1 Redundancy of Renewal Processes . . . . . . . . . . . . . . . . . . 65 2.5.2 Adaptivity of CTW for Renewal Sources . . . . . . . . . . . . . 67 xi