List of Figures Page: xix
List of Algorithms Page: xxi
List of Generative Stories Page: xxiii
Preface Page: xxv
Acknowledgments Page: xxix
Preliminaries Page: 1
Probability Measures Page: 1
Random Variables Page: 2
Continuous and Discrete Random Variables Page: 3
Joint Distribution over Multiple Random Variables Page: 4
Conditional Distributions Page: 5
Bayes' Rule Page: 7
Independent and Conditionally Independent Random Variables Page: 7
Exchangeable Random Variables Page: 8
Expectations of Random Variables Page: 9
Models Page: 11
Parametric vs. Nonparametric Models Page: 12
Inference with Models Page: 12
Generative Models Page: 14
Independence Assumptions in Models Page: 16
Directed Graphical Models Page: 17
Learning from Data Scenarios Page: 19
Bayesian and Frequentist Philosophy (Tip of the Iceberg) Page: 22
Summary Page: 23
Exercises Page: 24
Introduction Page: 25
Overview: Where Bayesian Statistics and NLP Meet Page: 26
First Example: The Latent Dirichlet Allocation Model Page: 29
The Dirichlet Distribution Page: 34
Inference Page: 38
Summary Page: 39
Second Example: Bayesian Text Regression Page: 39
Conclusion and Summary Page: 41
Exercises Page: 42
Priors Page: 43
Conjugate Priors Page: 44
Conjugate Priors and Normalization Constants Page: 47
The Use of Conjugate Priors with Latent Variable Models Page: 48
Mixture of Conjugate Priors Page: 49
Renormalized Conjugate Distributions Page: 51
Discussion: To Be or Not To Be Conjugate? Page: 52
Summary Page: 53
Priors Over Multinomial and Categorical Distributions Page: 53
The Dirichlet Distribution Re-visited Page: 54
The Logistic Normal Distribution Page: 58
Discussion Page: 64
Summary Page: 65
Non-informative Priors Page: 65
Uniform and Improper Priors Page: 66
Jeffreys Prior Page: 67
Discussion Page: 68
Conjugacy and Exponential Models Page: 69
Multiple Parameter Draws in Models Page: 70
Structural Priors Page: 72
Conclusion and Summary Page: 73
Exercises Page: 75
Bayesian Estimation Page: 77
Learning with Latent Variables: Two Views Page: 78
Bayesian Point Estimation Page: 79
Maximum a Posteriori Estimation Page: 79
Posterior Approximations Based on the MAP Solution Page: 87
Decision-theoretic Point Estimation Page: 88
Discussion and Summary Page: 90
Empirical Bayes Page: 90
Asymptotic Behavior of the Posterior Page: 92
Summary Page: 93
Exercises Page: 94
Sampling Methods Page: 95
MCMC Algorithms: Overview Page: 96
NLP Model Structure for MCMC Inference Page: 97
Partitioning the Latent Variables Page: 98
Gibbs Sampling Page: 99
Collapsed Gibbs Sampling Page: 102
Operator View Page: 106
Parallelizing the Gibbs Sampler Page: 109
Summary Page: 110
The Metropolis-Hastings Algorithm Page: 111
Variants of Metropolis-Hastings Page: 112
Slice Sampling Page: 113
Auxiliary Variable Sampling Page: 113
The Use of Slice Sampling and Auxiliary Variable Sampling in NLP Page: 115
Simulated Annealing Page: 116
Convergence of MCMC Algorithms Page: 116
Markov Chain: Basic Theory Page: 118
Sampling Algorithms Not in the MCMC Realm Page: 120
Monte Carlo Integration Page: 123
Discussion Page: 124
Computability of Distribution vs. Sampling Page: 124
Nested MCMC Sampling Page: 125
Runtime of MCMC Samplers Page: 125
Particle Filtering Page: 126
Conclusion and Summary Page: 127
Exercises Page: 129
Variational Inference Page: 131
Variational Bound on Marginal Log-likelihood Page: 131
Mean-field Approximation Page: 134
Mean-field Variational Inference Algorithm Page: 135
Dirichlet-multinomial Variational Inference Page: 137
Connection to the Expectation-maximization Algorithm Page: 141
Empirical Bayes with Variational Inference Page: 143
Discussion Page: 144
Initialization of the Inference Algorithms Page: 144
Convergence Diagnosis Page: 145
The Use of Variational Inference for Decoding Page: 146
Variational Inference as KL Divergence Minimization Page: 147
Online Variational Inference Page: 147
Summary Page: 148
Exercises Page: 149
Nonparametric Priors Page: 151
The Dirichlet Process: Three Views Page: 152
The Stick-breaking Process Page: 153
The Chinese Restaurant Process Page: 155
Dirichlet Process Mixtures Page: 157
Inference with Dirichlet Process Mixtures Page: 158
Dirichlet Process Mixture as a Limit of Mixture Models Page: 161
The Hierarchical Dirichlet Process Page: 161
The Pitman-Yor Process Page: 163
Pitman-Yor Process for Language Modeling Page: 165
Power-law Behavior of the Pitman-Yor Process Page: 166
Discussion Page: 167
Gaussian Processes Page: 168
The Indian Buffet Process Page: 168
Nested Chinese Restaurant Process Page: 169
Distance-dependent Chinese Restaurant Process Page: 169
Sequence Memoizers Page: 170
Summary Page: 171
Exercises Page: 172
Bayesian Grammar Models Page: 173
Bayesian Hidden Markov Models Page: 174
Hidden Markov Models with an Infinite State Space Page: 175
Probabilistic Context-free Grammars Page: 177
PCFGs as a Collection of Multinomials Page: 180
Basic Inference Algorithms for PCFGs Page: 180
Hidden Markov Models as PCFGs Page: 184
Bayesian Probabilistic Context-free Grammars Page: 185
Priors on PCFGs Page: 185
Monte Carlo Inference with Bayesian PCFGs Page: 186
Variational Inference with Bayesian PCFGs Page: 187
Adaptor Grammars Page: 189
Pitman-Yor Adaptor Grammars Page: 190
Stick-breaking View of PYAG Page: 192
Inference with PYAG Page: 192
Hierarchical Dirichlet Process PCFGs (HDP-PCFGs) Page: 196
Extensions to the HDP-PCFG Model Page: 197
Dependency Grammars Page: 198
State-split Nonparametric Dependency Models Page: 198
Synchronous Grammars Page: 200
Multilingual Learning Page: 201
Part-of-speech Tagging Page: 203
Grammar Induction Page: 204
Further Reading Page: 205
Summary Page: 207
Exercises Page: 208
Closing Remarks Page: 209
Basic Concepts Page: 211
Basic Concepts in Information Theory Page: 211
Entropy and Cross Entropy Page: 211
Kullback-Leibler Divergence Page: 212
Other Basic Concepts Page: 212
Jensen's Inequality Page: 212
Transformation of Continuous Random Variables Page: 213
The Expectation-maximization Algorithm Page: 213
Distribution Catalog Page: 215
The Multinomial Distribution Page: 215
The Dirichlet Distribution Page: 216
The Poisson Distribution Page: 217
The Gamma Distribution Page: 217
The Multivariate Normal Distribution Page: 218
The Laplace Distribution Page: 218
The Logistic Normal Distribution Page: 219
The Inverse Wishart Distribution Page: 220
Bibliography Page: 221
Author's Biography Page: 241
Index Page: 243
Blank Page Page: ii
Natural language processing (NLP) went through a profound transformation in the mid-1980s when it shifted to make heavy use of corpora and data-driven techniques to analyze language. Since then, the use of statistical techniques in NLP has evolved in several ways. One such example of evolution took place in the late 1990s or early 2000s, when full-fledged Bayesian machinery was introduced to NLP. This Bayesian approach to NLP has come to accommodate for various shortcomings in the frequentist approach and to enrich it, especially in the unsupervised setting, where statistical learning is done without target prediction examples.
We cover the methods and algorithms that are needed to fluently read Bayesian learning papers in NLP and to do research in the area. These methods and algorithms are partially borrowed from both machine learning and statistics and are partially developed "in-house" in NLP. We cover inference techniques such as Markov chain Monte Carlo sampling and variational inference, Bayesian estimation, and nonparametric modeling. We also cover fundamental concepts in Bayesian statistics such as prior distributions, conjugacy, and generative modeling. Finally, we cover some of the fundamental modeling techniques in NLP, such as grammar modeling and their use with Bayesian analysis.