ebook img

Deep Learning - Bengio PDF

576 Pages·2016·18.731 MB·English
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Deep Learning - Bengio

Deep Learning Yoshua Bengio Ian Goodfellow Aaron Courville Contents Acknowledgments vii Notation ix 1 Introduction 1 1.1 Who Should Read This Book? . . . . . . . . . . . . . . . . . . . . 8 1.2 Historical Trends in Deep Learning . . . . . . . . . . . . . . . . . 11 I Applied Math and Machine Learning Basics 25 2 Linear Algebra 27 2.1 Scalars, Vectors, Matrices and Tensors . . . . . . . . . . . . . . . 27 2.2 Multiplying Matrices and Vectors . . . . . . . . . . . . . . . . . . 30 2.3 Identity and Inverse Matrices . . . . . . . . . . . . . . . . . . . . 31 2.4 Linear Dependence, Span, and Rank . . . . . . . . . . . . . . . . 32 2.5 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.6 Special Kinds of Matrices and Vectors . . . . . . . . . . . . . . . 35 2.7 Eigendecomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.8 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . 39 2.9 The Moore-Penrose Pseudoinverse . . . . . . . . . . . . . . . . . 40 2.10 The Trace Operator . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.11 Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.12 Example: Principal Components Analysis . . . . . . . . . . . . . 42 3 Probability and Information Theory 46 3.1 Why Probability? . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.3 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . 49 3.4 Marginal Probability . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.5 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . 51 i CONTENTS 3.6 The Chain Rule of Conditional Probabilities . . . . . . . . . . . . 52 3.7 Independence and Conditional Independence . . . . . . . . . . . 52 3.8 Expectation, Variance, and Covariance . . . . . . . . . . . . . . . 53 3.9 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.10 Common Probability Distributions . . . . . . . . . . . . . . . . . 57 3.11 Useful Properties of Common Functions . . . . . . . . . . . . . . 62 3.12 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.13 Technical Details of Continuous Variables . . . . . . . . . . . . . 64 3.14 Structured Probabilistic Models . . . . . . . . . . . . . . . . . . . 65 3.15 Example: Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . 68 4 Numerical Computation 74 4.1 Overflow and Underflow . . . . . . . . . . . . . . . . . . . . . . . 74 4.2 Poor Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.3 Gradient-Based Optimization . . . . . . . . . . . . . . . . . . . . 76 4.4 Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . 85 4.5 Example: Linear Least Squares . . . . . . . . . . . . . . . . . . . 87 5 Machine Learning Basics 89 5.1 Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.2 Example: Linear Regression . . . . . . . . . . . . . . . . . . . . . 97 5.3 Generalization, Capacity, Overfitting and Underfitting . . . . . . 99 5.4 The No Free Lunch Theorem . . . . . . . . . . . . . . . . . . . . 104 5.5 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.6 Hyperparameters, Validation Sets and Cross-Validation . . . . . 108 5.7 Estimators, Bias, and Variance . . . . . . . . . . . . . . . . . . . 110 5.8 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . 118 5.9 Bayesian Statistics and Prior Probability Distributions . . . . . . 121 5.10 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.11 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . 131 5.12 Weakly Supervised Learning . . . . . . . . . . . . . . . . . . . . . 134 5.13 The Curse of Dimensionality and Statistical Limitations of Local Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 II Modern Practical Deep Networks 147 6 Feedforward Deep Networks 149 6.1 From Fixed Features to Learned Features . . . . . . . . . . . . . 149 6.2 Formalizing and Generalizing Neural Networks . . . . . . . . . . 152 6.3 Parametrizing a Learned Predictor . . . . . . . . . . . . . . . . . 154 ii CONTENTS 6.4 Flow Graphs and Back-Propagation . . . . . . . . . . . . . . . . 167 6.5 Universal Approximation Properties and Depth . . . . . . . . . . 180 6.6 Feature / Representation Learning . . . . . . . . . . . . . . . . . 184 6.7 Piecewise Linear Hidden Units . . . . . . . . . . . . . . . . . . . 186 6.8 Historical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 7 Regularization 190 7.1 Regularization from a Bayesian Perspective . . . . . . . . . . . . 191 7.2 Classical Regularization: Parameter Norm Penalty . . . . . . . . 193 7.3 Classical Regularization as Constrained Optimization . . . . . . . 200 7.4 Regularization and Under-Constrained Problems . . . . . . . . . 201 7.5 Dataset Augmentation . . . . . . . . . . . . . . . . . . . . . . . . 203 7.6 Classical Regularization as Noise Robustness . . . . . . . . . . . 204 7.7 Early Stopping as a Form of Regularization . . . . . . . . . . . . 208 7.8 Parameter Tying and Parameter Sharing . . . . . . . . . . . . . . 215 7.9 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . 215 7.10 Bagging and Other Ensemble Methods . . . . . . . . . . . . . . . 215 7.11 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 7.12 Multi-Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . 222 7.13 Adversarial Training . . . . . . . . . . . . . . . . . . . . . . . . . 223 8 Optimization for Training Deep Models 226 8.1 Optimization for Model Training . . . . . . . . . . . . . . . . . . 226 8.2 Challenges in Optimization . . . . . . . . . . . . . . . . . . . . . 229 8.3 Optimization Algorithms . . . . . . . . . . . . . . . . . . . . . . . 236 8.4 Approximate Natural Gradient and Second-Order Methods . . . 241 8.5 Conjugate Gradients . . . . . . . . . . . . . . . . . . . . . . . . . 241 8.6 BFGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 8.7 Hints, Global Optimization and Curriculum Learning . . . . . . . 243 9 Convolutional Networks 248 9.1 The Convolution Operation . . . . . . . . . . . . . . . . . . . . . 248 9.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 9.3 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 9.4 Convolution and Pooling as an Infinitely Strong Prior . . . . . . 261 9.5 Variants of the Basic Convolution Function . . . . . . . . . . . . 262 9.6 Structured Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . 269 9.7 Convolutional Modules . . . . . . . . . . . . . . . . . . . . . . . . 269 9.8 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 9.9 Efficient Convolution Algorithms . . . . . . . . . . . . . . . . . . 271 9.10 Random or Unsupervised Features . . . . . . . . . . . . . . . . . 271 iii CONTENTS 9.11 The Neuroscientific Basis for Convolutional Networks . . . . . . . 273 9.12 Convolutional Networks and the History of Deep Learning . . . . 280 10 Sequence Modeling: Recurrent and Recursive Nets 281 10.1 Unfolding Flow Graphs and Sharing Parameters . . . . . . . . . . 282 10.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . 284 10.3 Bidirectional RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . 295 10.4 Deep Recurrent Networks . . . . . . . . . . . . . . . . . . . . . . 296 10.5 Recursive Neural Networks . . . . . . . . . . . . . . . . . . . . . 299 10.6 Auto-Regressive Networks . . . . . . . . . . . . . . . . . . . . . . 299 10.7 Facing the Challenge of Long-Term Dependencies . . . . . . . . . 305 10.8 Handling Temporal Dependencies with N-Grams, HMMs, CRFs and Other Graphical Models . . . . . . . . . . . . . . . . . . . . . 317 10.9 Combining Neural Networks and Search . . . . . . . . . . . . . . 328 11 Practical methodology 333 11.1 Basic Machine Learning Methodology . . . . . . . . . . . . . . . 333 11.2 Manual Hyperparameter Tuning . . . . . . . . . . . . . . . . . . 334 11.3 Hyper-parameter Optimization Algorithms . . . . . . . . . . . . . 334 11.4 Debugging Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 336 12 Applications 339 12.1 Large Scale Deep Learning . . . . . . . . . . . . . . . . . . . . . . 339 12.2 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 12.3 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 352 12.4 Natural Language Processing and Neural Language Models . . . 353 12.5 Structured Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . 369 12.6 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 369 III Deep Learning Research 370 13 Structured Probabilistic Models for Deep Learning 372 13.1 The Challenge of Unstructured Modeling . . . . . . . . . . . . . . 373 13.2 Using Graphs to Describe Model Structure . . . . . . . . . . . . . 377 13.3 Advantages of Structured Modeling . . . . . . . . . . . . . . . . . 391 13.4 Learning About Dependencies . . . . . . . . . . . . . . . . . . . . 392 13.5 Inference and Approximate Inference Over Latent Variables . . . 394 13.6 The Deep Learning Approach to Structured Probabilistic Models 395 14 Monte Carlo Methods 400 14.1 Markov Chain Monte Carlo Methods . . . . . . . . . . . . . . . . 400 iv CONTENTS 14.2 The Difficulty of Mixing Between Well-Separated Modes . . . . . 402 15 Linear Factor Models and Auto-Encoders 404 15.1 Regularized Auto-Encoders . . . . . . . . . . . . . . . . . . . . . 405 15.2 Denoising Auto-encoders . . . . . . . . . . . . . . . . . . . . . . . 408 15.3 Representational Power, Layer Size and Depth . . . . . . . . . . 410 15.4 Reconstruction Distribution . . . . . . . . . . . . . . . . . . . . . 411 15.5 Linear Factor Models . . . . . . . . . . . . . . . . . . . . . . . . . 412 15.6 Probabilistic PCA and Factor Analysis . . . . . . . . . . . . . . . 413 15.7 Reconstruction Error as Log-Likelihood . . . . . . . . . . . . . . 417 15.8 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . 418 15.9 Denoising Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . 423 15.10 Contractive Auto-Encoders . . . . . . . . . . . . . . . . . . . . . 428 16 Representation Learning 431 16.1 Greedy Layerwise Unsupervised Pre-Training . . . . . . . . . . . 432 16.2 Transfer Learning and Domain Adaptation . . . . . . . . . . . . . 439 16.3 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . 446 16.4 Semi-Supervised Learning and Disentangling Underlying Causal Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447 16.5 Assumption of Underlying Factors and Distributed Representation449 16.6 Exponential Gainin RepresentationalEfficiency from Distributed Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453 16.7 Exponential Gain in Representational Efficiency from Depth . . . 455 16.8 Priors Regarding The Underlying Factors . . . . . . . . . . . . . 457 17 The Manifold Perspective on Representation Learning 461 17.1 Manifold Interpretation of PCA and Linear Auto-Encoders . . . 469 17.2 Manifold Interpretation of Sparse Coding . . . . . . . . . . . . . 472 17.3 The Entropy Bias from Maximum Likelihood . . . . . . . . . . . 472 17.4 Manifold Learning via Regularized Auto-Encoders . . . . . . . . 473 17.5 Tangent Distance, Tangent-Prop, and Manifold Tangent Classifier 474 18 Confronting the Partition Function 478 18.1 The Log-Likelihood Gradient of Energy-Based Models . . . . . . 479 18.2 Stochastic Maximum Likelihood and Contrastive Divergence . . . 481 18.3 Pseudolikelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 488 18.4 Score Matching and Ratio Matching . . . . . . . . . . . . . . . . 490 18.5 Denoising Score Matching . . . . . . . . . . . . . . . . . . . . . . 492 18.6 Noise-Contrastive Estimation . . . . . . . . . . . . . . . . . . . . 492 18.7 Estimating the Partition Function . . . . . . . . . . . . . . . . . 494 v CONTENTS 19 Approximate inference 502 19.1 Inference as Optimization . . . . . . . . . . . . . . . . . . . . . . 504 19.2 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . 505 19.3 MAP Inference: Sparse Coding as a Probabilistic Model . . . . . 506 19.4 Variational Inference and Learning . . . . . . . . . . . . . . . . . 507 19.5 Stochastic Inference . . . . . . . . . . . . . . . . . . . . . . . . . 511 19.6 Learned Approximate Inference . . . . . . . . . . . . . . . . . . . 511 20 Deep Generative Models 513 20.1 Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . . . 513 20.2 Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . 516 20.3 Training Restricted Boltzmann Machines . . . . . . . . . . . . . . 519 20.4 Deep Belief Networks . . . . . . . . . . . . . . . . . . . . . . . . . 523 20.5 Deep Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . 526 20.6 Boltzmann Machines for Real-Valued Data . . . . . . . . . . . . . 537 20.7 Convolutional Boltzmann Machines . . . . . . . . . . . . . . . . . 540 20.8 Other Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . 541 20.9 Directed Generative Nets . . . . . . . . . . . . . . . . . . . . . . 541 20.10 A Generative View of Autoencoders . . . . . . . . . . . . . . . . 543 20.11 Generative Stochastic Networks . . . . . . . . . . . . . . . . . . . 549 20.12 Methodological Notes . . . . . . . . . . . . . . . . . . . . . . . . . 551 Bibliography 555 Index 593 vi Acknowledgments This book would not have been possible without the contributions of many people. We would like to thank those who commented on our proposal for the book and helped plan its contents and organization: Hugo Larochelle, Guillaume Alain, Kyunghyun Cho, C¸a˘glar G¨ulc¸ehre, Razvan Pascanu, David Krueger and Thomas Roh´ee. We would like to thank the people who offered feedback on the content of the book itself. Some offered feedback on many chapters: Julian Serban, Laurent Dinh, Guillaume Alain, Kelvin Xu, Ilya Sutskever, Vincent Vanhoucke, David Warde-Farley, Jurgen Van Gael, Dustin Webb, Johannes Roith, Ion Androut- sopoulos, Pawel Chilinski, Halis Sak, Fr´ed´eric Francis, Jonathan Hunt, and Grig- ory Sapunov. We would also like to thank those who provided us with useful feedback on individual chapters: Chapter 1, Introduction: Johannes Roith, Eric Morris, Samira Ebrahimi, • Ozan C¸a˘glayan, Mart´ın Abadi, and Sebastien Bratieres. Chapter 2, Linear Algebra: Pierre Luc Carrier, Li Yao, Thomas Roh´ee, • Colby Toland, Amjad Almahairi, Sergey Oreshkov, Istv´an Petr´as, and Den- nis Prangle. Chapter 3, Probability and Information Theory: Rasmus Antti, Stephan • Gouws, Vincent Dumoulin, Artem Oboturov, Li Yao, John Philip Anderson, and Kai Arulkumaran. Chapter 4, Numerical Computation: Meire Fortunato, and Tran Lam An. • Chapter 5, Machine Learning Basics: Dzmitry Bahdanau, and Meire For- • tunato. Chapter 8, Optimization for Training Deep Models: Marcel Ackermann. • Chapter 9, Convolutional Networks: Mehdi Mirza, C¸a˘glar G¨ulc¸ehre, and • Mart´ın Arjovsky. vii CONTENTS Chapter 10, Sequence Modeling: Recurrent and Recursive Nets: Dmitriy • Serdyuk, and Dongyu Shi. Chapter 18, Confronting the Partition Function: Sam Bowman, and Ozan • C¸a˘glayan. Bibliography, Leslie N. Smith. • We also want to thank those who allowed us to reproduce images, figures or data from their publications: David Warde-Farley, Matthew D. Zeiler, Rob Fergus, Chris Olah, Jason Yosinski, Nicolas Chapados, and James Bergstra. We indicate their contributions in the figure captions throughout the text. Finally, we would like to thank Google for allowing Ian Goodfellow to work on the book as his 20% project while working at Google. In particular, we would like to thank Ian’s former manager, Greg Corrado, and his subsequent manager, Samy Bengio, for their support of this effort. viii Notation This section provides a concise reference describing the notation used throughout this book. If you are unfamiliar with any of these mathematical concepts, this notation reference may seem intimidating. However, do not despair, we describe most of these ideas in chapters 1-3. Numbers and Arrays a A scalar (integer or real) value with the name “a” a A vector with the name “a” A A matrix with the name “A” A A tensor with the name “A” I Identity matrix with n rows and n columns n I Identity matrix with dimensionality implied by context e Standard basis vector [0,...,0,1,0,...,0] with a 1 at position i. i diag(a) A square, diagonal matrix with entries given by a a A scalar random variable with the name “a” a A vector-valued random variable with the name “a” A A matrix-valued random variable with the name “A” ix

See more

The list of books you might like

book image

The Mountain Is You

Brianna Wiest
·2020
·0.34 MB

book image

Corrupt (Devil's Night #1)

Penelope Douglas
·518 Pages
·2015
·0.74 MB

book image

Believe Me

Tahereh Mafi
·177 Pages
·2021
·2.19 MB

book image

Der langste Tag. Normandie 6. Juni 1944

Cornelius Ryan
·418 Pages
·1959
·4.676 MB

book image

Poetry and Painting in Song China: The Subtle Art of Dissent

Alfreda Murck
·433 Pages
·2000
·234.159 MB

book image

When Asia Was the World

Stewart Gordon
·237 Pages
·2007
·5.7 MB

book image

The Dancing Floor

John Buchan
·209 Pages
·2005
·1.448 MB