ebook img

Data Cleaning and Exploration with Machine Learning: Get to grips with machine learning techniques to achieve sparkling-clean data quickly PDF

542 Pages·2022·8.705 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Data Cleaning and Exploration with Machine Learning: Get to grips with machine learning techniques to achieve sparkling-clean data quickly

Data Cleaning and Exploration with Machine Learning Get to grips with machine learning techniques to achieve sparkling-clean data quickly Michael Walker BIRMINGHAM—MUMBAI Data Cleaning and Exploration with Machine Learning Copyright © 2022 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. Publishing Product Manager: Ali Abidi Senior Editor: David Sugarman Content Development Editor: Manikandan Kurup Technical Editor: Rahul Limbachiya Copy Editor: Safis Editing Project Coordinator: Farheen Fathima Proofreader: Safis Editing Indexer: Hemangini Bari Production Designer: Alishon Mendonca Marketing Coordinators: Shifa Ansari and Abeer Riyaz Dawe First published: August 2022 Production reference: 1290722 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-80324-167-8 www.packt.com Contributors About the author Michael Walker has worked as a data analyst for over 30 years at a variety of educational institutions. He has also taught data science, research methods, statistics, and computer programming to undergraduates since 2006. He is currently the Chief Information Officer at College Unbound in Providence, Rhode Island. About the reviewers Kalyana Bedhu is an engineering leader for data science at Microsoft. Kalyana has over 20 years of industry experience in data analytics across various companies such as Ericsson, Sony, Bosch Fidelity, and Oracle, among others. Kalyana was an early practitioner of data science at Ericsson, setting up a data science lab and building up competence in solving some practical data science problems. He played a pivotal role in transforming a central IT organization that dealt with most of the enterprise business intelligence, data, and analytical systems, into an AI and data science engine. Kalyana is a recipient of patents, a speaker, and has authored award-winning papers and data science courses. Thanks to Packt and the author for the opportunity to review this book. Divya Sardana serves as the lead AI/ ML engineer at Nike. Previously, she was a senior data scientist at Teradata Corp. She holds a Ph.D. in computer science from the University of Cincinnati, OH. She has experience working on end-to-end machine learning and deep learning problems involving techniques such as regression and classification. She has further experience in moving developed models to production and ensuring scalability. Her interests include solving complex big data and machine learning/deep learning problems in real-world domains. She is actively involved in the peer review of journals and books in the area of machine learning. She has served as a session chair at machine learning conferences such as ICMLA 2021 and BDA 2021. Table of Contents Preface Section 1 – Data Cleaning and Machine Learning Algorithms 1 Examining the Distribution of Features and Targets Technical requirements 4 outliers in univariate analysis 26 Subsetting data 4 Using histograms, boxplots, Generating frequencies for and violin plots to examine the categorical features 12 distribution of features 35 Generating summary statistics Using histograms 35 for continuous and discrete Using boxplots 39 features 20 Using violin plots 41 Identifying extreme values and Summary 43 2 Examining Bivariate and Multivariate Relationships between Features and Targets Technical requirements 46 Using grouped boxplots to view Identifying outliers and bivariate relationships between extreme values in bivariate continuous and categorical relationships 46 features 63 Using scatter plots to view Using linear regression to bivariate relationships between identify data points with continuous features 55 significant influence 67 vi Table of Contents Using K-nearest neighbors to outliers 76 find outliers 72 Summary 80 Using Isolation Forest to find 3 Identifying and Fixing Missing Values Technical requirements 82 Using KNN imputation 102 Identifying missing values 82 Using random forest for Cleaning missing values 88 imputation 105 Imputing values with regression 95 Summary 107 Section 2 – Preprocessing, Feature Selection, and Sampling 4 Encoding, Transforming, and Scaling Features Technical requirements 112 Feature hashing 129 Creating training datasets and Using mathematical avoiding data leakage 112 transformations 130 Removing redundant or Feature binning 134 unhelpful features 116 Equal-width and equal-frequency Encoding categorical features 121 binning 135 One-hot encoding 121 K-means binning 138 Ordinal encoding 124 Feature scaling 139 Encoding categorical features Summary 142 with medium or high cardinality 127 Table of Contents vii 5 Feature Selection Technical requirements 146 Using forward feature selection 156 Selecting features for Using backward feature selection 158 classification models 147 Using exhaustive feature Mutual information classification for selection 159 feature selection with a categorical Eliminating features recursively target 147 in a regression model 164 ANOVA F-value for feature selection with a Eliminating features recursively categorical target 150 in a classification model 168 Using Boruta for feature Selecting features for selection 170 regression models 151 Using regularization and other F-tests for feature selection with a embedded methods 173 continuous target 152 Mutual information for feature Using L1 regularization 174 selection with a continuous target 154 Using a random forest classifier 176 Using forward and backward Using principal component feature selection 156 analysis 177 Summary 181 6 Preparing for Model Evaluation Technical requirements 184 Plotting precision-sensitivity curves 205 Measuring accuracy, sensitivity, Evaluating multiclass models 208 specificity, and precision for Evaluating regression models 211 binary classification 184 Using K-fold cross-validation 219 Examining CAP, ROC, and precision-sensitivity curves for Preprocessing data with binary classification 192 pipelines 221 Summary 226 Constructing CAP curves 192 Plotting a receiver operating characteristic (ROC) curve 199 viii Table of Contents Section 3 – Modeling Continuous Targets with Supervised Learning 7 Linear Regression Models Technical requirements 230 Running and evaluating our linear model 240 Key concepts 230 Improving our model evaluation 244 Key assumptions of linear regression models 230 Using lasso regression 245 Linear regression and ordinary least Tuning hyperparameters with grid squares 232 searches 251 Linear regression and gradient Using non-linear regression 252 descent 233 Regression with gradient Using classical linear regression 234 descent 261 Pre-processing the data for our Summary 264 regression model 234 8 Support Vector Regression Technical requirements 266 SVR with a linear model 272 Key concepts of SVR 266 Using kernels for nonlinear SVR 282 Nonlinear SVR and the kernel trick 270 Summary 288 9 K-Nearest Neighbors, Decision Tree, Random Forest, and Gradient Boosted Regression Technical requirements 290 Using random forest regression 305 Key concepts for K-nearest Decision tree and random neighbors regression 290 forest regression 306 K-nearest neighbors regression 293 A decision tree example with Key concepts for decision tree interpretation 307 and random forest regression 303 Table of Contents ix Building and interpreting our Using gradient boosted actual model 309 regression 314 Random forest regression 311 Summary 323 Section 4 – Modeling Dichotomous and Multiclass Targets with Supervised Learning 10 Logistic Regression Technical requirements 328 Evaluating a logistic regression model 344 Key concepts of logistic Regularization with logistic regression 328 regression 355 Logistic regression extensions 330 Multinomial logistic regression 361 Binary classification with Summary 368 logistic regression 331 11 Decision Trees and Random Forest Classification Technical requirements 370 Decision tree models 375 Key concepts 370 Implementing random forest 387 Using random forest for classification 373 Implementing gradient boosting3 91 Using gradient-boosted decision trees 374 Summary 394 12 K-Nearest Neighbors for Classification Technical requirements 396 KNN for multiclass classification 404 Key concepts of KNN 396 KNN for letter recognition 413 KNN for binary classification 399 Summary 416

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.