Table Of Content

MEAP Edition Manning Early Access Program Regularization in Deep Learning Version 3 Copyright 2022 Manning Publications For more information on this and other Manning titles go to manning.com ©Manning Publications Co. To comment go to liveBook welcome Thank you for purchasing the MEAP edition of Regularization in Deep Learning. One of the most important goals in building machine learning and especially deep learning models is to achieve good generalization performance in the test dataset. The training task is considered to be completed when we have obtained a generalizable model, often with the help of proper regularization in the training process. While the theory of generalization still remains a mystery, it is an active research area with new insights being proposed. Currently, there are quite a number of regularization techniques that have proved to be empirically effective in a specific training context. However, these resources are often scrambled and disconnected. This book intends to bridge the gap by offering a systematic and well-illustrated perspective on different regularization techniques, covering data, model, cost function, and optimization procedure. It even goes one step further by mixing the most recent research breakthroughs with practical coding examples on regularization in deep learning models. This book entertains this complex and ever-growing topic in a unique way. It introduces minimal mathematics and technical concepts in a well-illustrated manner and provides practical examples and code walkthroughs offered via a step-by-step fashion. The teaching is designed to be intuitive, natural, and progressive, instead of forcing in a particular concept. We hope that you will enjoy reading the book and acquire a few useful tools under your belt to build better-performing models. We also encourage you to post any questions or comments you have about the content in the liveBook Discussion forum. We appreciate knowing where we can make improvements and increase your understanding of the subject. Dr Peng Liu ©Manning Publications Co. To comment go to liveBook brief contents 1 Introducing regularization 2 Generalization: a classical view 3 Generalization: a modern view 4 Fundamentals of training deep neural networks 5 Regularization via data 6 Regularization via model 7 Regularization via objective function 8 Regularization via optimization 9 Recommendations and discussions ©Manning Publications Co. To comment go to liveBook 1 1 Introducing regularization This chapter covers • The what and why of regularization in deep learning • Underfitting versus overfitting • Bias and variance trade-off • A typical model training process • Different types of regularization We all want to do well in exams. Since young, we have most likely been taught to study hard to achieve high marks, not only in the usual practice exam but also in the final exam. In the world of modeling, the practice exam is analogous to the training data given to us, where both questions and answers are available. We want to train a model based on the training data and use the trained model to make predictions for the test data, which is the final exam. Our goal is to train an excellent mental model that works well for the practice exam and final exam. We know that the common problem-solving patterns will largely remain the same between the two exams, so it is essential to understand and learn these patterns during the practice exam in order to be able to do well in the final exam. Depending on our preparation strategy, our trained mental model could perform good or bad in either exam, ending up with one of the four possible outcomes: good in both exams, bad in both exams, good in the practice exam but unfortunately bad in the final exam, and bad in the practice exam but surprisingly good in the final exam. These four different exam outcomes are illustrated in figure 1.1. ©Manning Publications Co. To comment go to liveBook 2 Figure 1.1 There are four different outcomes for the practice and final exams, where the practice exam contains question-answer pairs, and the final exam only contains the questions. This also corresponds to the four possible scenarios in model development, where the available training data contain the input-output pairs, and the test data only contain the input data. The trained model is assessed on both training and test data, although a more significant focus is given on the latter. In figure 1.1, the lower left quadrant represents the worst case of all - bad results for both exams. Although unfortunate, this scenario is somewhat expected - insufficient preparation during the practice period will likely lead to poor performance in the final exam. We need to fully utilize the practice exam, whose solutions are available, and train our problem-solving skills to form a good mental model that will be applied to the test exam. If the model is not good enough in the first place, we could barely expect to do well in the final exam. Note that it is still possible to get high marks in the final exam when we are not well prepared based on the practice exam, purely by chance. However, this scenario, as represented by the upper left quadrant, is improbable. We may happen to get lucky in our random guessing, which does not guarantee a stable final exam performance on average. The upper right quadrant is the best of all - the mental model is well trained based on the practice exam and is suitably applied to the final exam. We manage to grasp the gist by continuously fine-tuning our mental model, which generalizes to future, unforeseen problems in the final exam. Life is good in this region. Some students may care too much about their performance in the training exams. The extensive pressure from their parents, teachers, or peers may push them to over-learn from the training materials. As represented by the lower right quadrant, this excessive stress on ©Manning Publications Co. To comment go to liveBook 3 the training performance has led them to eventually memorize the answers to all of the training questions, all the way to obtain full marks in the practice exam, thanks to perfect question-answer memorization. However, we all know that the final test exam will most likely be different from the training exam one way or the other. By merely remembering the answers instead of learning the patterns, they suffer from overfitting, a phenomenon where the trained mental model fits the practice exam too well, sometimes perfectly, but fails to generalize to new questions in the final exam. On the contrary, underfitting means we have not tried hard enough, resulting in a too simple model, and do not perform well in practice or final exam. However, the overfitting problem occurs frequently in many aspects of life; the desire to do exceptionally well on things at hand blinds us from seeing the big picture. The trick is, do not memorize the answer; learn from it. In the example above, the overfitting region in the lower right quadrant means that the model has been trained to perform well on the training set but does not generalize well to the test set. Such an overfitting problem can be lessened by proper regularization. Regularization offers a suite of techniques designed as slight modifications to the model development process. With proper regularization, the trained mental model can better learn the underlying patterns between the input data (question) and the output label (answer). Such an adequately regularized mental model will better generalize to the final exam and output more accurate answers to new questions. Regularization contains a suite of strategies used to reduce the generalization error of machine learning models, in particular deep learning models. An adequate regularization could help control the complexity of a neural network, reduce the generalization gap and address the issue of overfitting. Given finite training data or limited optimization procedure, proper regularization could help generalize a trained model to a future unseen test dataset, usually sampled from the same hidden distribution as the training set. It could reduce overfitting while at the same time keeping the training error as low as possible. It typically works by tweaking either data, model architecture, cost function, or optimization procedure to enhance the overall learning algorithm and control the degree of overfitting. In this book, our primary focus is to bring us from the lower right quadrant (the overfitting region) to the upper right quadrant (the goal region). We assume a good amount of effort has been made to do well in the training exam, which is relatively easy to achieve given our currently available learning resources. The challenge that occurs more frequently in practice is overfitting. How can we detect it? When do we need to worry about it? What techniques are available to remediate this problem? Furthermore, why do some techniques work better than others? These are some of the questions the book aims to address. 1.1 Why do we need regularization? The model complexity can be represented by the number of model parameters used and the model architecture that regulates the interaction between the input features and the model parameters, jointly producing the output label. Deep neural networks are usually complex models due to their large number of parameters and sophisticated architecture, thus prone to being over-parameterized and overfitting. An overly complex model may learn and understand the underlying patterns of the training data very well, but at the same time ©Manning Publications Co. To comment go to liveBook 4 unconsciously become sensitive to the noises in the training data, resulting in overfitting and not generalizing to the test data. While some data points can have no noise due to perfect data measurement and collection, the majority of the data points will often be corrupted by noises one way or the other. The random noises in the data point would cause the actual value to deviate from the real value by a (hopefully) small margin. Since these noises, if learned by the model, would obscure the underlying pattern and do not necessarily appear in the test data, the model's performance will likely suffer as a result of overfitting the noise-afflicted data. Similarly, in our exam example, not all the answers provided are correct in the practice exam. Noises may corrupt the target labels in practice. Thus the model will surely go wrong if it tries to tune its parameters to match its predictions to the target. If we try to memorize the sample answers to the practice questions without learning the problem-solving skills, the final exam performance will most likely drop if the answers are wrong. It helps to treat these answers with a grain of salt and cultivate critical thinking and reasoning skills. Note that embedding noises in the data could sometimes be good if done adequately as a data augmentation approach. A small amount of noise, whether in the input features, the model weights, or the target label, could pressurize the model to tune its estimation process so that the resulting weights are robust to different variations. These intentionally injected noises are random perturbations that cause the model to exhibit an improved performance by learning to appreciate different variations of the same data point, thus becoming more robust and less sensitive to the noises in test data. After all, if the training data covers all variations of the test data, then it becomes effortless to achieve a perfect generalization performance, even with a simple model like k-nearest neighbors. The reason is simple – if the questions in the final exam are all taken from the practice exam, then it is straightforward to get full marks in the final exam without studying very hard. As we will learn in the later part of the book, noise injection and data augmentation are helpful techniques to create noisy data points that help improve the model’s robustness. The trained model after noise perturbation will be less sensitive to the effects of random variability in the test data. Let us graphically visualize the true pattern (the actual function to be approximated), the noise, and the model. In figure 1.2, the dots represent the actual observations as input- output pairs, and the curve denotes the true function we want the model to learn. Unfortunately, the actual observations often deviate from the true function, possibly due to random noise in the observation model (the underlying model that governs how actual data points are revealed and is often assumed to contain an additive noise) or purely collection error, leading to vertical perturbations. Our task is to train a model that approximates the curve as much as possible. Such approximation could easily go wrong in two ways. On the one hand, using a more complex and flexible model such as a neural network is more likely to result in a better approximation than, say, a simple linear regression model. However, overfitting the observed data without proper control of model complexity will likely lead to an overly sensitive model. Model training needs to minimize the impact of unwanted noises, which tend to cause ups and downs to the fitted curve. On the other hand, a simple model may not be flexible enough to learn the true pattern and thus result in the underfitting scenario. It is easy to fall into ©Manning Publications Co. To comment go to liveBook 5 either underfitting or overfitting during the model training process. In other words, our model may not be flexible enough and fail to capture the curvature using a simple straight line, or it may be too complex and end up with a wiggly-looking curve that is overstretched and unnecessarily sensitive. Figure 1.2 The true relationship we want to learn, as represented by the curve, and the training data available for model development, as represented by the dots. The (vertical) deviations between the curve and the dots are due to random observational noises (assumed to be additive in this case) or collection error, which almost always appear in practice. Our goal is to use these dots to build a model to approximate the true pattern as much as possible, namely the real relationship between input x and output y. At the same time, the learned model needs to be robust enough to avoid distractions from the noises. There are two scenarios for underfitting. We could underfit the model by limiting ourselves to a simple linear regression model, which struggles to capture a nonlinear relationship due to insufficient model complexity. As shown in figure 1.3, the model, as represented by the straight line, is not flexible enough to approximate the true nonlinear function in a curved line. ©Manning Publications Co. To comment go to liveBook 6 Figure 1.3 An example of fitting a nonlinear function using a linear function, which underfits the training data. The model needs to be flexible enough to be able to learn the pattern of the true function, which is often highly complex in reality. Alternatively, the training data could be minimal and thus insufficient to support a decent training process. In such a case with great uncertainty in the input data, a conservative function such as a simple linear model is more likely to generalize to the test set. On the other hand, a complex model tends to overfit the limited training data and fails to learn a generalizable mapping function. It is thus better to play safely in the face of extreme uncertainties. This case is illustrated in figure 1.4. ©Manning Publications Co. To comment go to liveBook

Manning Early Access Program Regularization in Deep Learning Version 3 PDF

177 Pages·2022·10.723 MB·English

by Peng Liu

Checking for file health...

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Manning Early Access Program Regularization in Deep Learning Version 3

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.