Mathematical Theory of Bayesian Statistics Mathematical Theory of Bayesian Statistics Sumio Watanabe CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2018 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Printed on acid-free paper Version Date: 20180402 International Standard Book Number-13: 978-1-482-23806-8 (Hardback) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com Contents Preface ix 1 Definition of Bayesian Statistics 1 1.1 Bayesian Statistics . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Probability Distribution . . . . . . . . . . . . . . . . . . . . . 4 1.3 True Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Model, Prior, and Posterior . . . . . . . . . . . . . . . . . . . 9 1.5 Examples of Posterior Distributions . . . . . . . . . . . . . . 11 1.6 Estimation and Generalization . . . . . . . . . . . . . . . . . 17 1.7 Marginal Likelihood or Partition Function . . . . . . . . . . . 21 1.8 Conditional Independent Cases . . . . . . . . . . . . . . . . . 25 1.9 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2 Statistical Models 35 2.1 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . 35 2.2 Multinomial Distribution . . . . . . . . . . . . . . . . . . . . 41 2.3 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.4 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . 53 2.5 Finite Normal Mixture . . . . . . . . . . . . . . . . . . . . . . 56 2.6 Nonparametric Mixture . . . . . . . . . . . . . . . . . . . . . 59 2.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3 Basic Formula of Bayesian Observables 67 3.1 Formal Relation between True and Model . . . . . . . . . . . 67 3.2 Normalized Observables . . . . . . . . . . . . . . . . . . . . . 77 3.3 Cumulant Generating Functions. . . . . . . . . . . . . . . . . 80 3.4 Basic Bayesian Theory . . . . . . . . . . . . . . . . . . . . . . 85 3.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 v vi CONTENTS 4 Regular Posterior Distribution 99 4.1 Division of Partition Function . . . . . . . . . . . . . . . . . . 99 4.2 Asymptotic Free Energy . . . . . . . . . . . . . . . . . . . . . 107 4.3 Asymptotic Losses . . . . . . . . . . . . . . . . . . . . . . . . 111 4.4 Proof of Asymptotic Expansions . . . . . . . . . . . . . . . . 118 4.5 Point Estimators . . . . . . . . . . . . . . . . . . . . . . . . . 123 4.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5 Standard Posterior Distribution 135 5.1 Standard Form . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.2 State Density Function . . . . . . . . . . . . . . . . . . . . . . 146 5.3 Asymptotic Free Energy . . . . . . . . . . . . . . . . . . . . . 152 5.4 Renormalized Posterior Distribution . . . . . . . . . . . . . . 154 5.5 Conditionally Independent Case . . . . . . . . . . . . . . . . . 162 5.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 6 General Posterior Distribution 177 6.1 Bayesian Decomposition . . . . . . . . . . . . . . . . . . . . . 177 6.2 Resolution of Singularities . . . . . . . . . . . . . . . . . . . . 181 6.3 General Asymptotic Theory . . . . . . . . . . . . . . . . . . . 190 6.4 Maximum A Posteriori Method . . . . . . . . . . . . . . . . . 196 6.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 7 Markov Chain Monte Carlo 207 7.1 Metropolis Method . . . . . . . . . . . . . . . . . . . . . . . . 207 7.1.1 Basic Metropolis Method . . . . . . . . . . . . . . . . 209 7.1.2 Hamiltonian Monte Carlo . . . . . . . . . . . . . . . . 211 7.1.3 Parallel Tempering . . . . . . . . . . . . . . . . . . . . 215 7.2 Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . 217 7.2.1 Gibbs Sampler for Normal Mixture . . . . . . . . . . . 218 7.2.2 Nonparametric Bayesian Sampler . . . . . . . . . . . . 221 7.3 Numerical Approximation of Observables . . . . . . . . . . . 225 7.3.1 Generalization and Cross Validation Losses . . . . . . 225 7.3.2 Numerical Free Energy. . . . . . . . . . . . . . . . . . 226 7.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 8 Information Criteria 231 8.1 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 231 8.1.1 Criteria for Generalization Loss . . . . . . . . . . . . . 232 8.1.2 Comparison of ISCV with WAIC . . . . . . . . . . . . 240 CONTENTS vii 8.1.3 Criteria for Free Energy . . . . . . . . . . . . . . . . . 245 8.1.4 Discussion for Model Selection . . . . . . . . . . . . . 250 8.2 Hyperparameter Optimization . . . . . . . . . . . . . . . . . . 251 8.2.1 Criteria for Generalization Loss . . . . . . . . . . . . . 253 8.2.2 Criterion for Free Energy . . . . . . . . . . . . . . . . 257 8.2.3 Discussion for Hyperparameter Optimization . . . . . 259 8.3 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 9 Topics in Bayesian Statistics 267 9.1 Formal Optimality . . . . . . . . . . . . . . . . . . . . . . . . 267 9.2 Bayesian Hypothesis Test . . . . . . . . . . . . . . . . . . . . 270 9.3 Bayesian Model Comparison . . . . . . . . . . . . . . . . . . . 275 9.4 Phase Transition . . . . . . . . . . . . . . . . . . . . . . . . . 277 9.5 Discovery Process . . . . . . . . . . . . . . . . . . . . . . . . . 282 9.6 Hierarchical Bayes . . . . . . . . . . . . . . . . . . . . . . . . 286 9.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 10 Basic Probability Theory 293 10.1 Delta Function . . . . . . . . . . . . . . . . . . . . . . . . . . 293 10.2 Kullback-Leibler Distance . . . . . . . . . . . . . . . . . . . . 294 10.3 Probability Space . . . . . . . . . . . . . . . . . . . . . . . . . 296 10.4 Empirical Process. . . . . . . . . . . . . . . . . . . . . . . . . 302 10.5 Convergence of Expected Values . . . . . . . . . . . . . . . . 303 10.6 Mixture by Dirichlet Process . . . . . . . . . . . . . . . . . . 306 References 309 Index 317 Preface The purpose of this book is to establish a mathematical theory of Bayesian statistics. Inpracticalapplications ofBayesian statistical inference, weneedtopre- pare a statistical model and a prior for a given sample, then estimate the unknown true distribution. One of the most important problems is devising amethodhowtoconstructapairofastatistical modelandaprior,although we do not know the true distribution. The answer based on mathematical theory to this problem is given by the following procedures. (1)Firstly,weconstructtheuniversalandmathematicallawsbetweenBayesian observables which hold for an arbitrary triple of a true distribution, a sta- tistical model, and a prior. (2) Secondly, by using such laws, we can evaluate how appropriate a set of a statistical model and a prior is for the unknown true distribution. (3) And lastly, the most suitable pair of the statistial model and the prior is employed. The conventional approach to such a purpose has been based on the assumption that the posterior distribution can be approximated by some normal distribution. However, the new statistical theory introduced by this book holds for arbitrary posterior distribution, demonstrating that the ap- plication field will beextended. Theauthor expects that also new statistical methodology which enables us to manupulate complex and hierarchical sta- tistical models such as normal mixtures or hierarhical neural networks will be based on the new mathematical theory. Sumio Watanabe ix