Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Information Theory, Inference, and Learning Algorithms David J.C. MacKay Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Information Theory, Inference, and Learning Algorithms David J.C. MacKay [email protected] c1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003 (cid:13) Version 6.0 (as published) June 26, 2003 Please send feedback on this book via http://www.inference.phy.cam.ac.uk/mackay/itila/ This book will be published by C.U.P. in September 2003. It will remain viewable on-screen on the above website, in postscript, djvu, and pdf formats. (C.U.P. replace this page with their own page ii.) Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v 1 Introduction to Information Theory . . . . . . . . . . . . . . . . 3 2 Probability, Entropy, and Inference . . . . . . . . . . . . . . . . . 22 3 More about Inference . . . . . . . . . . . . . . . . . . . . . . . . 48 I Data Compression . . . . . . . . . . . . . . . . . . . . . . . . 65 4 The Source Coding Theorem . . . . . . . . . . . . . . . . . . . . 67 5 Symbol Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6 Stream Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7 Codes for Integers . . . . . . . . . . . . . . . . . . . . . . . . . . 132 II Noisy-Channel Coding . . . . . . . . . . . . . . . . . . . . . . 137 8 Correlated Random Variables . . . . . . . . . . . . . . . . . . . . 138 9 Communication over a Noisy Channel . . . . . . . . . . . . . . . 146 10 The Noisy-Channel Coding Theorem . . . . . . . . . . . . . . . . 162 11 Error-Correcting Codes and Real Channels . . . . . . . . . . . . 177 III Further Topics in Information Theory . . . . . . . . . . . . . 191 12 Hash Codes: Codes for E(cid:14)cient Information Retrieval . . . . . 193 13 Binary Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 14 Very Good Linear Codes Exist . . . . . . . . . . . . . . . . . . . 229 15 Further Exercises on Information Theory . . . . . . . . . . . . . 233 16 Message Passing . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 17 Communication over Constrained Noiseless Channels . . . . . . 248 18 Crosswords and Codebreaking . . . . . . . . . . . . . . . . . . . 260 19 Why have Sex? Information Acquisition and Evolution . . . . . 269 IV Probabilities and Inference . . . . . . . . . . . . . . . . . . . 281 20 An Example Inference Task: Clustering . . . . . . . . . . . . . . 284 21 Exact Inference by Complete Enumeration . . . . . . . . . . . . 293 22 Maximum Likelihood and Clustering . . . . . . . . . . . . . . . . 300 23 Useful Probability Distributions . . . . . . . . . . . . . . . . . . 311 24 Exact Marginalization . . . . . . . . . . . . . . . . . . . . . . . . 319 25 Exact Marginalization in Trellises . . . . . . . . . . . . . . . . . 324 26 Exact Marginalization in Graphs . . . . . . . . . . . . . . . . . . 334 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 27 Laplace’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 28 Model Comparison and Occam’s Razor . . . . . . . . . . . . . . 343 29 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . 357 30 E(cid:14)cient Monte Carlo Methods . . . . . . . . . . . . . . . . . . . 387 31 Ising Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400 32 Exact Monte Carlo Sampling . . . . . . . . . . . . . . . . . . . . 413 33 Variational Methods . . . . . . . . . . . . . . . . . . . . . . . . . 422 34 Independent Component Analysis and Latent Variable Mod- elling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437 35 Random Inference Topics . . . . . . . . . . . . . . . . . . . . . . 445 36 Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 451 37 Bayesian Inference and Sampling Theory . . . . . . . . . . . . . 457 V Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . 467 38 Introduction to Neural Networks . . . . . . . . . . . . . . . . . . 468 39 The Single Neuron as a Classi(cid:12)er . . . . . . . . . . . . . . . . . . 471 40 Capacity of a Single Neuron . . . . . . . . . . . . . . . . . . . . . 483 41 Learning as Inference . . . . . . . . . . . . . . . . . . . . . . . . 492 42 Hop(cid:12)eld Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 505 43 Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . . . 522 44 Supervised Learning in Multilayer Networks . . . . . . . . . . . . 527 45 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . 535 46 Deconvolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549 VI Sparse Graph Codes . . . . . . . . . . . . . . . . . . . . . . 555 47 Low-Density Parity-Check Codes . . . . . . . . . . . . . . . . . 557 48 Convolutional Codes and Turbo Codes . . . . . . . . . . . . . . . 574 49 Repeat{Accumulate Codes . . . . . . . . . . . . . . . . . . . . . 582 50 Digital Fountain Codes . . . . . . . . . . . . . . . . . . . . . . . 589 VII Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597 A Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 598 B Some Physics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601 C Some Mathematics . . . . . . . . . . . . . . . . . . . . . . . . . . 605 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 620 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Preface This book is aimed at senior undergraduates and graduate students in Engi- neering, Science, Mathematics, and Computing. It expects familiarity with calculus, probability theory, and linear algebra as taught in a (cid:12)rst- or second- year undergraduate course on mathematics for scientists and engineers. Conventional courses on information theory cover not only the beauti- ful theoretical ideas of Shannon, but also practical solutions to communica- tion problems. This book goes further, bringing in Bayesian data modelling, Monte Carlo methods, variational methods, clustering algorithms, and neural networks. Why unify information theory and machine learning? Because they are two sides of the same coin. In the 1960s, a single (cid:12)eld, cybernetics, was populated by information theorists, computer scientists, and neuroscientists, allstudyingcommonproblems. Informationtheoryandmachinelearning still belong together. Brains are the ultimate compression and communication systems. And the state-of-the-art algorithms for both data compression and error-correcting codes use the same tools as machine-learning. How to use this book Theessentialdependenciesbetweenchaptersareindicatedinthe(cid:12)gureonthe next page. An arrow from one chapter to another indicates that the second chapter requires some of the (cid:12)rst. WithinPartsI,II,IV,andVofthisbook,chaptersonadvancedoroptional topics are towards the end. All chapters of Part III are optional on a (cid:12)rst reading, except perhaps for Chapter 16 (Message Passing). Thesamesystemsometimesapplieswithinachapter: the(cid:12)nalsectionsof- tendealwithadvancedtopicsthatcanbeskippedona(cid:12)rstreading. Forexam- pleintwokeychapters{Chapter4(TheSourceCodingTheorem)andChap- ter 10 (The Noisy-Channel Coding Theorem) { the (cid:12)rst-time reader should detour at section 4.5 and section 10.4 respectively. Pagesvii{xshowafewwaystousethisbook. First,Igivetheroadmapfor a course that I teach in Cambridge: ‘Information theory, pattern recognition, and neural networks’. The book is also intended as a textbook for traditional coursesininformation theory. Thesecondroadmapshows thechapters for an introductoryinformationtheorycourseandthethirdforacourseaimedat an understanding of state-of-the-art error-correcting codes. The fourth roadmap shows how to use the text in a conventional course on machine learning. v Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. vi Preface 1 Introduction toInformation Theory IV Probabilities and Inference 2 Probability,Entropy,andInference 20 AnExample Inference Task: Clustering 3 More about Inference 21 Exact Inference byComplete Enumeration 22 MaximumLikelihoodandClustering I Data Compression 23 Useful ProbabilityDistributions 4 TheSource Coding Theorem 24 Exact Marginalization 5 SymbolCodes 25 Exact Marginalization in Trellises 6 Stream Codes 26 Exact Marginalization in Graphs 7 Codes forIntegers 27 Laplace’s Method 28 Model Comparison andOccam’s Razor II Noisy-Channel Coding 29 Monte Carlo Methods 8 Correlated Random Variables 30 E(cid:14)cientMonte Carlo Methods 9 Communication overaNoisy Channel 31 Ising Models 10 TheNoisy-Channel Coding Theorem 32 Exact Monte Carlo Sampling 11 Error-Correcting Codes andReal Channels 33 Variational Methods 34 IndependentComponentAnalysis III Further Topics in Information Theory 35 Random Inference Topics 36 Decision Theory 12 HashCodes 37 Bayesian Inference andSampling Theory 13 Binary Codes 14 Very GoodLinear Codes Exist V Neural networks 15 Further Exercises onInformation Theory 16 Message Passing 38 Introduction toNeural Networks 17 Constrained Noiseless Channels 39 TheSingle Neuron asaClassi(cid:12)er 18 Crosswords andCodebreaking 40 CapacityofaSingle Neuron 19 WhyhaveSex? 41 Learning asInference 42 Hop(cid:12)eld Networks 43 Boltzmann Machines 44 Supervised Learning in MultilayerNetworks 45 Gaussian Processes 46 Deconvolution VI Sparse Graph Codes Dependencies 47 Low-DensityParity-CheckCodes 48 Convolutional Codes andTurboCodes 49 Repeat-Accumulate Codes 50 Digital Fountain Codes Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Preface vii 11 IInnttrroodduuccttiioonn ttooIInnffoorrmmaattiioonn TThheeoorryy IV Probabilities and Inference 22 PPrroobbaabbiilliittyy,,EEnnttrrooppyy,,aannddIInnffeerreennccee 2200 AAnnEExxaammppllee IInnffeerreennccee TTaasskk:: CClluusstteerriinngg 33 MMoorree aabboouutt IInnffeerreennccee 2211 EExxaacctt IInnffeerreennccee bbyyCCoommpplleettee EEnnuummeerraattiioonn 2222 MMaaxxiimmuummLLiikkeelliihhooooddaannddCClluusstteerriinngg I Data Compression 23 Useful ProbabilityDistributions 44 TThheeSSoouurrccee CCooddiinngg TThheeoorreemm 2244 EExxaacctt MMaarrggiinnaalliizzaattiioonn 55 SSyymmbboollCCooddeess 25 Exact Marginalization in Trellises 66 SSttrreeaamm CCooddeess 26 Exact Marginalization in Graphs 7 Codes forIntegers 2277 LLaappllaaccee’’ss MMeetthhoodd 28 Model Comparison andOccam’s Razor II Noisy-Channel Coding 2299 MMoonnttee CCaarrlloo MMeetthhooddss 88 CCoorrrreellaatteedd RRaannddoomm VVaarriiaabblleess 3300 EE(cid:14)(cid:14)cciieennttMMoonnttee CCaarrlloo MMeetthhooddss 99 CCoommmmuunniiccaattiioonn oovveerraaNNooiissyy CChhaannnneell 3311 IIssiinngg MMooddeellss 1100 TThheeNNooiissyy--CChhaannnneell CCooddiinngg TThheeoorreemm 3322 EExxaacctt MMoonnttee CCaarrlloo SSaammpplliinngg 1111 EErrrroorr--CCoorrrreeccttiinngg CCooddeess aannddRReeaall CChhaannnneellss 3333 VVaarriiaattiioonnaall MMeetthhooddss 34 IndependentComponentAnalysis III Further Topics in Information Theory 35 Random Inference Topics 36 Decision Theory 12 HashCodes 37 Bayesian Inference andSampling Theory 13 Binary Codes 14 Very GoodLinear Codes Exist V Neural networks 15 Further Exercises onInformation Theory 16 Message Passing 3388 IInnttrroodduuccttiioonn ttooNNeeuurraall NNeettwwoorrkkss 17 Constrained Noiseless Channels 3399 TThheeSSiinnggllee NNeeuurroonn aassaaCCllaassssii(cid:12)(cid:12)eerr 18 Crosswords andCodebreaking 4400 CCaappaacciittyyooffaaSSiinnggllee NNeeuurroonn 19 WhyhaveSex? 4411 LLeeaarrnniinngg aassIInnffeerreennccee 4422 HHoopp(cid:12)(cid:12)eelldd NNeettwwoorrkkss 43 Boltzmann Machines 44 Supervised Learning in MultilayerNetworks 45 Gaussian Processes 46 Deconvolution VI Sparse Graph Codes My Cambridge Course on, Information Theory, 4477 LLooww--DDeennssiittyyPPaarriittyy--CChheecckkCCooddeess Pattern Recognition, 48 Convolutional Codes andTurboCodes and Neural Networks 49 Repeat-Accumulate Codes 50 Digital Fountain Codes Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. viii Preface 11 IInnttrroodduuccttiioonn ttooIInnffoorrmmaattiioonn TThheeoorryy IV Probabilities and Inference 22 PPrroobbaabbiilliittyy,,EEnnttrrooppyy,,aannddIInnffeerreennccee 20 AnExample Inference Task: Clustering 3 More about Inference 21 Exact Inference byComplete Enumeration 22 MaximumLikelihoodandClustering I Data Compression 23 Useful ProbabilityDistributions 44 TThheeSSoouurrccee CCooddiinngg TThheeoorreemm 24 Exact Marginalization 55 SSyymmbboollCCooddeess 25 Exact Marginalization in Trellises 66 SSttrreeaamm CCooddeess 26 Exact Marginalization in Graphs 7 Codes forIntegers 27 Laplace’s Method 28 Model Comparison andOccam’s Razor II Noisy-Channel Coding 29 Monte Carlo Methods 88 CCoorrrreellaatteedd RRaannddoomm VVaarriiaabblleess 30 E(cid:14)cientMonte Carlo Methods 99 CCoommmmuunniiccaattiioonn oovveerraaNNooiissyy CChhaannnneell 31 Ising Models 1100 TThheeNNooiissyy--CChhaannnneell CCooddiinngg TThheeoorreemm 32 Exact Monte Carlo Sampling 11 Error-Correcting Codes andReal Channels 33 Variational Methods 34 IndependentComponentAnalysis III Further Topics in Information Theory 35 Random Inference Topics 36 Decision Theory 12 HashCodes 37 Bayesian Inference andSampling Theory 13 Binary Codes 14 Very GoodLinear Codes Exist V Neural networks 15 Further Exercises onInformation Theory 16 Message Passing 38 Introduction toNeural Networks 17 Constrained Noiseless Channels 39 TheSingle Neuron asaClassi(cid:12)er 18 Crosswords andCodebreaking 40 CapacityofaSingle Neuron 19 WhyhaveSex? 41 Learning asInference 42 Hop(cid:12)eld Networks 43 Boltzmann Machines 44 Supervised Learning in MultilayerNetworks 45 Gaussian Processes 46 Deconvolution VI Sparse Graph Codes Short Course on 47 Low-DensityParity-CheckCodes Information Theory 48 Convolutional Codes andTurboCodes 49 Repeat-Accumulate Codes 50 Digital Fountain Codes Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Preface ix 1 Introduction toInformation Theory IV Probabilities and Inference 2 Probability,Entropy,andInference 20 AnExample Inference Task: Clustering 3 More about Inference 21 Exact Inference byComplete Enumeration 22 MaximumLikelihoodandClustering I Data Compression 23 Useful ProbabilityDistributions 4 TheSource Coding Theorem 2244 EExxaacctt MMaarrggiinnaalliizzaattiioonn 5 SymbolCodes 2255 EExxaacctt MMaarrggiinnaalliizzaattiioonn iinn TTrreelllliisseess 6 Stream Codes 2266 EExxaacctt MMaarrggiinnaalliizzaattiioonn iinn GGrraapphhss 7 Codes forIntegers 27 Laplace’s Method 28 Model Comparison andOccam’s Razor II Noisy-Channel Coding 29 Monte Carlo Methods 8 Correlated Random Variables 30 E(cid:14)cientMonte Carlo Methods 9 Communication overaNoisy Channel 31 Ising Models 10 TheNoisy-Channel Coding Theorem 32 Exact Monte Carlo Sampling 1111 EErrrroorr--CCoorrrreeccttiinngg CCooddeess aannddRReeaall CChhaannnneellss 33 Variational Methods 34 IndependentComponentAnalysis III Further Topics in Information Theory 35 Random Inference Topics 36 Decision Theory 1122 HHaasshhCCooddeess 37 Bayesian Inference andSampling Theory 1133 BBiinnaarryy CCooddeess 1144 VVeerryy GGooooddLLiinneeaarr CCooddeess EExxiisstt V Neural networks 1155 FFuurrtthheerr EExxeerrcciisseess oonnIInnffoorrmmaattiioonn TThheeoorryy 1166 MMeessssaaggee PPaassssiinngg 38 Introduction toNeural Networks 1177 CCoonnssttrraaiinneedd NNooiisseelleessss CChhaannnneellss 39 TheSingle Neuron asaClassi(cid:12)er 18 Crosswords andCodebreaking 40 CapacityofaSingle Neuron 19 WhyhaveSex? 41 Learning asInference 42 Hop(cid:12)eld Networks 43 Boltzmann Machines 44 Supervised Learning in MultilayerNetworks 45 Gaussian Processes 46 Deconvolution VI Sparse Graph Codes Advanced Course on 4477 LLooww--DDeennssiittyyPPaarriittyy--CChheecckkCCooddeess Information Theory and Coding 4488 CCoonnvvoolluuttiioonnaall CCooddeess aannddTTuurrbbooCCooddeess 4499 RReeppeeaatt--AAccccuummuullaattee CCooddeess 5500 DDiiggiittaall FFoouunnttaaiinn CCooddeess Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. x Preface 1 Introduction toInformation Theory IV Probabilities and Inference 22 PPrroobbaabbiilliittyy,,EEnnttrrooppyy,,aannddIInnffeerreennccee 2200 AAnnEExxaammppllee IInnffeerreennccee TTaasskk:: CClluusstteerriinngg 33 MMoorree aabboouutt IInnffeerreennccee 2211 EExxaacctt IInnffeerreennccee bbyyCCoommpplleettee EEnnuummeerraattiioonn 2222 MMaaxxiimmuummLLiikkeelliihhooooddaannddCClluusstteerriinngg I Data Compression 23 Useful ProbabilityDistributions 4 TheSource Coding Theorem 2244 EExxaacctt MMaarrggiinnaalliizzaattiioonn 5 SymbolCodes 25 Exact Marginalization in Trellises 6 Stream Codes 26 Exact Marginalization in Graphs 7 Codes forIntegers 2277 LLaappllaaccee’’ss MMeetthhoodd 2288 MMooddeell CCoommppaarriissoonn aannddOOccccaamm’’ss RRaazzoorr II Noisy-Channel Coding 2299 MMoonnttee CCaarrlloo MMeetthhooddss 8 Correlated Random Variables 3300 EE(cid:14)(cid:14)cciieennttMMoonnttee CCaarrlloo MMeetthhooddss 9 Communication overaNoisy Channel 3311 IIssiinngg MMooddeellss 10 TheNoisy-Channel Coding Theorem 3322 EExxaacctt MMoonnttee CCaarrlloo SSaammpplliinngg 11 Error-Correcting Codes andReal Channels 3333 VVaarriiaattiioonnaall MMeetthhooddss 3344 IInnddeeppeennddeennttCCoommppoonneennttAAnnaallyyssiiss III Further Topics in Information Theory 35 Random Inference Topics 36 Decision Theory 12 HashCodes 37 Bayesian Inference andSampling Theory 13 Binary Codes 14 Very GoodLinear Codes Exist V Neural networks 15 Further Exercises onInformation Theory 16 Message Passing 3388 IInnttrroodduuccttiioonn ttooNNeeuurraall NNeettwwoorrkkss 17 Constrained Noiseless Channels 3399 TThheeSSiinnggllee NNeeuurroonn aassaaCCllaassssii(cid:12)(cid:12)eerr 18 Crosswords andCodebreaking 4400 CCaappaacciittyyooffaaSSiinnggllee NNeeuurroonn 19 WhyhaveSex? 4411 LLeeaarrnniinngg aassIInnffeerreennccee 4422 HHoopp(cid:12)(cid:12)eelldd NNeettwwoorrkkss 4433 BBoollttzzmmaannnn MMaacchhiinneess 4444 SSuuppeerrvviisseedd LLeeaarrnniinngg iinn MMuullttiillaayyeerrNNeettwwoorrkkss 4455 GGaauussssiiaann PPrroocceesssseess 46 Deconvolution VI Sparse Graph Codes A Course on Bayesian Inference 47 Low-DensityParity-CheckCodes and Machine Learning 48 Convolutional Codes andTurboCodes 49 Repeat-Accumulate Codes 50 Digital Fountain Codes