ebook img

Ensemble Methods for Machine Learning Version 6 PDF

320 Pages·2022·17.721 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Ensemble Methods for Machine Learning Version 6

MEAP Edition Manning Early Access Program Ensemble Methods for Machine Learning Version 6 Copyright 2022 Manning Publications For more information on this and other Manning titles go to manning.com ©Manning Publications Co. To comment go to liveBook welcome Thank you for purchasing the MEAP for Ensemble Methods for Machine Learning. Modern machine learning has become synonymous with Deep Learning. However, Deep Learning is often too big a hammer for many applications, requires large data sets and computational resources that are out of reach for most of us: students, engineers, data scientists and analysts and casual enthusiasts. Ensemble methods are another powerful way to build effective and robust models for real- world applications in many areas including finance, medicine, recommendation systems, cybersecurity and many more. This book is intended to be a tutorial on the practical aspects of implementing and training deployable ensemble models. This book is intended for a broad audience: anyone creating ML algorithms - data scientists who are interested in using these algorithms for building models; engineers who are involved in building applications and architectures; or students, Kagglers, casual enthusiasts who want to learn more about this fascinating and exciting area of machine learning. On this journey, we’ll adopt an immersive approach to ensemble methods aimed at fostering intuition and demystifying the technical and algorithmic details. You will learn how to (1) implement a basic version from scratch to gain an under-the-hood understanding, and (2) use sophisticated, off-the-shelf implementations (such as ) to ultimately get the scikit-learn best out of your models. Every chapter also comes with its own case study: a practical demonstration of how to use different ensemble methods on real-world tasks. It is impossible to provide a detailed introduction to the diverse area of machine learning in one book. Instead, this book assumes basic knowledge of machine learning and that you have used or played around with at least one fundamental ML technique such as decision trees. A basic working knowledge of Python is also assumed. Examples, visualizations and chapter case studies all use Python and Jupyter notebooks. Knowledge of other commonly used Python packages such as Numpy (for mathematical computations), Pandas (for data manipulation) and matplotlib (for visualization) is useful, but not necessary. In fact, you can learn how to use these packages through the examples and case studies. Finally, this book is dedicated to you, and your feedback will be invaluable in improving it. Please post any questions, comments, corrections and suggestions in the liveBook's Discussion Forum for this book. —Gautam Kunapuli ©Manning Publications Co. To comment go to liveBook brief contents PART 1: THE BASICS OF ENSEMBLES 1 Ensemble Learning: Hype or Halleleujah? PART 2: ESSENTIAL ENSEMBLE METHODS 2 Homogeneous Parallel Ensembles: Bagging and Random Forests 3 Heterogeneous Parallel Ensembles: Combining Strong Learners 4 Sequential Ensembles: Boosting 5 Sequential Ensembles: Gradient Boosting 6 Sequential Ensembles: Newton Boosting PART 3: ENSEMBLES IN THE WILD: ADAPTING ENSEMBLE METHODS TO YOUR DATA 7 Learning with Continuous and Count Labels 8 Learning with Categorical Features 9 Explaining Your Ensembles 10 Further Reading 1 1 Ensemble Learning: Hype or Hallelujah? This chapter covers • Defining and framing the ensemble learning problem • Motivating the need for ensembles in different applications • Understanding how ensembles handle fit vs. complexity • Implementing our first ensemble with ensemble diversity and model aggregation In October 2006, Netflix announced a $1 million prize for the team that was able to improve movie recommendations over their own proprietary recommendation system, Cinematch, by 10%. The Netflix Grand Prize was one of the first ever open data science competitions and attracted tens of thousands of teams. The training set consisted of 100 million ratings that 480 thousand users had given to 17 thousand movies. Within three weeks, 40 teams had already beaten Cinematch’s results. By September 2007, over 40 thousand teams had entered the contest and a team from AT&T Labs took the 2007 Progress Prize by improving upon Cinematch by 8.42%. As the competition progressed and the 10% mark remained elusive, a curious phenomenon emerged among the competitors. Teams began to collaborate and share knowledge about effective feature engineering, algorithms and techniques. Inevitably, they began combining their models, blending individual approaches into powerful and sophisticated ensembles of many models. These ensembles combined the best of various diverse models and features and proved to be far more effective than any individual model. In June 2009, nearly two years after the contest began, BellKor’s Pragmatic Chaos, a merger of three different teams edged out another merged team, The Ensemble (which was a merger ©Manning Publications Co. To comment go to liveBook 2 of over 30 teams!), to improve upon the baseline by 10% and take the $1 million prize. Just edged out is a bit of an understatement as BellKor’s Pragmatic Chaos managed to submit their final models barely 20 minutes before The Ensemble got their models in1. In the end, both teams achieved a final performance improvement of 10.06%. While the Netflix competition captured the imagination of data scientists, machine learners and casual data science enthusiasts worldwide, its lasting legacy has been to establish Ensemble Methods as a powerful way to build practical and robust models for large-scale, real- world applications. Among the individual algorithms used are several that have become staples of collaborative filtering and recommendation systems today: k-nearest neighbors, matrix factorization and restricted Boltzmann machines. However, Andreas Töscher and Michael Jahrer of BigChaos, co-winners of the Netflix Prize, summed up2 their keys to success: “During the nearly 3 years of the Netflix competition, there were two main factors which improved the overall accuracy: the quality of the individual algorithms and the ensemble idea. …the ensemble idea was part of the competition from the beginning and evolved over time. In the beginning, we used different models with different parametrization and a linear blending. …[Eventually] the linear blend was replaced by a nonlinear one...” In the years since, the use of ensemble methods has exploded, and they have emerged as a state-of-the-art technology for machine learning. The next two sections provide a gentle introduction to what ensemble methods are, why they work and where they are applied. Then, we will look at a subtle but important challenge prevalent in all machine-learning algorithms: the fit vs. complexity tradeoff. Finally, we jump into training our very first ensemble method and see in a hands-on manner how ensemble methods overcome this fit vs. complexity tradeoff and improve overall performance. Along the way, we will familiarize ourselves with several key terms that form the lexicon of ensemble methods and will be used throughout the book. 1.1 Ensemble Methods: The Wisdom of the Crowds What exactly is an ensemble method? Let’s get an intuitive notion of what they are and how they work by considering the allegorical case of Dr. Randy Forrest. We can then go on to frame the ensemble learning problem. Dr. Randy Forrest is a famed and successful diagnostician, much like his idol Dr. Gregory House of TV fame. His success, however, is due not only to his exceeding politeness (unlike his cynical and curmudgeonly idol), but also his rather unusual approach to diagnosis. You see, Dr. Forrest works at a teaching hospital and commands the respect of a large number of doctors-in-training. Dr. Forrest has taken care to assemble a team with a diversity 1 https://www.netflixprize.com/leaderboard.html 2 The BigChaos Solution to the Netflix Grand Prize, Andreas Töscher, Michael Jahrer and Robert M. Bell. ©Manning Publications Co. To comment go to liveBook 3 of skills. His residents excel at different specializations: one is good at cardiology (heart), another at pulmonology (lungs), and yet another at neurology (nervous system) and so on. All in all, a rather diversely skillful bunch, each with their own strengths. Every time Dr. Randy Forrest gets a new case he solicits the opinions of all his residents and collects possible diagnoses from all of them. He then democratically decides the final diagnosis as the most common one from among all those proposed. Figure 1.1 The diagnostic procedure followed by Dr. Randy Forrest every time he gets a new case is to get opinions from his residents. His residents offer their diagnoses: either that the patient has cancer or has no cancer. Dr. Forrest then selects the majority answer as the final diagnosis put forth by his team. Dr. Forrest embodies a diagnostic ensemble: he aggregates his residents’ diagnoses into a single diagnosis representative of the collective wisdom of his team. As it turns out, Dr. Forrest is right more often than any individual resident is. Why? Because he knows that his residents are pretty smart, and a large number of pretty smart residents are all unlikely to make the same mistake. Here, Dr. Forrest relies on the power of model aggregating or model averaging: he knows that the average answer is most likely going to be a good one. Still, how does Dr. Forrest know that all his residents are not wrong? He can’t know that for sure, of course. However, he has guarded against this undesirable outcome all the same. Remember that his residents all have diverse specializations. Because of their diverse backgrounds, training, specialization and skills, it is possible, but highly unlikely that all his residents are wrong. Here, Dr. Forrest relies on the power of ensemble diversity, or the diversity of the individual components of his ensemble. ©Manning Publications Co. To comment go to liveBook 4 Dr. Randy Forrest, of course, is an ensemble method, and his residents (who are in training) are the machine-learning algorithms that make up the ensemble. The secrets to his success, and indeed the success of ensemble methods as well, are: • ensemble diversity, so that he as a variety of opinions to choose from, and • model aggregation, so that he can combine them into a single final opinion. Any collection of machine-learning algorithms can be used to build an ensemble: literally, a group of machine learners. But why do they work? James Surowiecki, in The Wisdom of Crowds, describes human ensembles or wise crowds thus: “If you ask a large enough group of diverse and independent people to make a prediction or estimate a probability, the average of those answers will cancel out errors in individual estimation. Each person's guess, you might say, has two components: information and errors. Subtract the errors, and you're left with the information.” This is also precisely the intuition behind ensembles of learners: it is possible to build a wise machine-learning ensemble by aggregating individual learners. An ensemble method is a machine-learning algorithm that aims to improve predictive performance on a task by aggregating the predictions of multiple estimators or models. In this manner, an ensemble method learns a meta- estimator. The key to success with ensemble methods is ensemble diversity. Informally, ensemble diversity refers to the fact that individual ensemble components, or machine-learning models, are different from each other. Training such ensembles of diverse individual models is a key challenge in ensemble learning, and different approaches achieve this in different ways. 1.2 Why You Should Care About Ensemble Learning What can you do with ensemble methods? Are they really just hype or are they hallelujah? As we see in this section, they can be used to train and deploy predictive models to build robust and effective models for many different applications. One palpable success of ensemble methods is their domination of data science competitions (alongside deep learning), where they have been generally successful on different types of machine-learning tasks and application areas. Anthony Goldbloom, CEO of Kaggle.com, revealed in 2015 that the three most successful algorithms for structured problems were XGBoost, Random Forest and Gradient Boosting, all ensemble methods. Indeed, the most popular way to tackle data science competitions these days is to combine feature engineering with ensemble methods. Structured data is generally ©Manning Publications Co. To comment go to liveBook 5 highly organized in tables, relational databases and other formats most of us are familiar with, and the type of data that ensemble methods have proven to be very successful on. Unstructured data, in contrast, does not always have table structure. Images, Audio, video, waveform and text data are typically unstructured, which deep learning approaches -- including automated feature generation -- have demonstrated with great success. While we focus on structured data for most of this book, ensemble methods can be combined with deep learning for unstructured problems as well. Beyond competitions, ensemble methods drive data science in several areas including financial and business analytics, medicine and healthcare, cybersecurity, education, manufacturing, recommendation systems, entertainment and many more. In 2018, Olson et al3 conducted a comprehensive analysis of 13 popular machine-learning algorithms and their variants. They ranked each algorithm’s performance on 165 benchmark data sets (Figure 1.2). Their goal was to emulate the standard machine-learning pipeline to provide advice on how to select a machine-learning algorithm. Figure 1.2 Which machine learning algorithm should I use for my data set? The mean ranking of the performance of several different machine-learning algorithms on 165 different data sets is shown here. Figure reproduced from Olson et al (2018). SVC = support vector classification, SGD = stochastic gradient descent, KNN = k-nearest neighbor, PAC = passive-aggressive classifier, NB = naïve Bayes classifier. On average, ensemble methods (1: Gradient Tree Boosting, 2: Random Forest, 4: Extra Trees) outperformed individual classifiers and classical ensemble approaches (9: AdaBoost). 3 Data-driven advice for applying machine learning to bioinformatics problems, Randal S. Olson, William La Cava, Zairah Mustahsan, Akshay Varik, and Jason H. Moore, Pacific Symposium on Machine Learning (2018). ©Manning Publications Co. To comment go to liveBook 6 These results demonstrate exactly why ensemble methods (specifically, tree-based ensembles) are considered state-of-the-art. If your goal is to develop state-of-the-art analytics from your data, or to eke out better performance and improve models you already have, this book is for you. If your goal is to start competing more effectively in data science competitions, for fame and fortune, or to just improve your data science skills, this book is also for you. If you’re excited about adding powerful ensemble methods to your machine-learning arsenal, this book is definitely for you. To drive home this point, we will build our first ensemble method: a simple model combination ensemble. Before we do, let’s dive into the tradeoff between fit and complexity that most machine-learning methods have to grapple with, as it will help us understand why ensemble methods are so effective. 1.3 Fit vs. Complexity in Individual Models In this section, we look at two popular machine-learning methods: decision trees and support vector machines. As we do so, we’ll try to understand how their fitting and predictive behavior changes as they learn increasingly complex models. This section also serves as a refresher of the training and evaluation practices we usually follow during modeling. Machine learning tasks are typically: • supervised learning tasks, with a data set of labeled examples, where data has been annotated. For example, in cancer diagnosis, each example will be an individual patient, with label/annotation “has cancer” or “does not have cancer”. Labels can be 0−1 (binary classification), categorical (multiclass classification) or continuous (regression). • unsupervised learning tasks, with a data set of unlabeled examples, where the data lacks annotations. This includes tasks such as grouping examples together by some notion of “similarity” (clustering) or identifying anomalous data that does not fit the expected pattern (anomaly detection). Let’s say that we’re looking at the Boston Housing data set, which describes the median value of owner-occupied homes in 506 U.S. census tracts in the Boston area. The machine-learning task is to learn a regression model to predict the median home value in a census tract using different variables. The Boston Housing data set is available from : scikit-learn from sklearn.datasets import load_boston from sklearn.preprocessing import StandardScaler X, y = load_boston(return_X_y=True) X = StandardScaler().fit_transform(X) y = StandardScaler().fit_transform(y.reshape(−1, 1)) A data set is generally represented as a table, where each row is a data point or an example. Each example is characterized by features (also known as independent variables, or attributes) and a label (also known as a dependent variable, annotation or response). ©Manning Publications Co. To comment go to liveBook

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.