Table Of ContentData Cleaning and
Exploration with
Machine Learning
Get to grips with machine learning techniques to
achieve sparkling-clean data quickly
Michael Walker
BIRMINGHAM—MUMBAI
Data Cleaning and Exploration with
Machine Learning
Copyright © 2022 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or
transmitted in any form or by any means, without the prior written permission of the publisher,
except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the
information presented. However, the information contained in this book is sold without warranty,
either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors,
will be held liable for any damages caused or alleged to have been caused directly or indirectly by
this book.
Packt Publishing has endeavored to provide trademark information about all of the companies
and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing
cannot guarantee the accuracy of this information.
Publishing Product Manager: Ali Abidi
Senior Editor: David Sugarman
Content Development Editor: Manikandan Kurup
Technical Editor: Rahul Limbachiya
Copy Editor: Safis Editing
Project Coordinator: Farheen Fathima
Proofreader: Safis Editing
Indexer: Hemangini Bari
Production Designer: Alishon Mendonca
Marketing Coordinators: Shifa Ansari and Abeer Riyaz Dawe
First published: August 2022
Production reference: 1290722
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-80324-167-8
www.packt.com
Contributors
About the author
Michael Walker has worked as a data analyst for over 30 years at a variety of educational
institutions. He has also taught data science, research methods, statistics, and computer
programming to undergraduates since 2006. He is currently the Chief Information Officer
at College Unbound in Providence, Rhode Island.
About the reviewers
Kalyana Bedhu is an engineering leader for data science at Microsoft. Kalyana has over 20
years of industry experience in data analytics across various companies such as Ericsson,
Sony, Bosch Fidelity, and Oracle, among others. Kalyana was an early practitioner of data
science at Ericsson, setting up a data science lab and building up competence in solving
some practical data science problems. He played a pivotal role in transforming a central
IT organization that dealt with most of the enterprise business intelligence, data, and
analytical systems, into an AI and data science engine. Kalyana is a recipient of patents, a
speaker, and has authored award-winning papers and data science courses.
Thanks to Packt and the author for the opportunity to review this book.
Divya Sardana serves as the lead AI/ ML engineer at Nike. Previously, she was a senior
data scientist at Teradata Corp. She holds a Ph.D. in computer science from the University
of Cincinnati, OH. She has experience working on end-to-end machine learning and
deep learning problems involving techniques such as regression and classification. She has
further experience in moving developed models to production and ensuring scalability.
Her interests include solving complex big data and machine learning/deep learning
problems in real-world domains. She is actively involved in the peer review of journals
and books in the area of machine learning. She has served as a session chair at machine
learning conferences such as ICMLA 2021 and BDA 2021.
Table of Contents
Preface
Section 1 – Data Cleaning and Machine
Learning Algorithms
1
Examining the Distribution of Features and Targets
Technical requirements 4 outliers in univariate analysis 26
Subsetting data 4 Using histograms, boxplots,
Generating frequencies for and violin plots to examine the
categorical features 12 distribution of features 35
Generating summary statistics Using histograms 35
for continuous and discrete Using boxplots 39
features 20 Using violin plots 41
Identifying extreme values and Summary 43
2
Examining Bivariate and Multivariate Relationships between
Features and Targets
Technical requirements 46 Using grouped boxplots to view
Identifying outliers and bivariate relationships between
extreme values in bivariate continuous and categorical
relationships 46 features 63
Using scatter plots to view Using linear regression to
bivariate relationships between identify data points with
continuous features 55 significant influence 67
vi Table of Contents
Using K-nearest neighbors to outliers 76
find outliers 72 Summary 80
Using Isolation Forest to find
3
Identifying and Fixing Missing Values
Technical requirements 82 Using KNN imputation 102
Identifying missing values 82 Using random forest for
Cleaning missing values 88 imputation 105
Imputing values with regression 95 Summary 107
Section 2 – Preprocessing, Feature
Selection, and Sampling
4
Encoding, Transforming, and Scaling Features
Technical requirements 112 Feature hashing 129
Creating training datasets and
Using mathematical
avoiding data leakage 112
transformations 130
Removing redundant or
Feature binning 134
unhelpful features 116
Equal-width and equal-frequency
Encoding categorical features 121
binning 135
One-hot encoding 121 K-means binning 138
Ordinal encoding 124
Feature scaling 139
Encoding categorical features
Summary 142
with medium or high cardinality 127
Table of Contents vii
5
Feature Selection
Technical requirements 146 Using forward feature selection 156
Selecting features for Using backward feature selection 158
classification models 147
Using exhaustive feature
Mutual information classification for selection 159
feature selection with a categorical
Eliminating features recursively
target 147
in a regression model 164
ANOVA F-value for feature selection
with a Eliminating features recursively
categorical target 150 in a classification model 168
Using Boruta for feature
Selecting features for
selection 170
regression models 151
Using regularization and other
F-tests for feature selection with a
embedded methods 173
continuous target 152
Mutual information for feature Using L1 regularization 174
selection with a continuous target 154 Using a random forest classifier 176
Using forward and backward Using principal component
feature selection 156 analysis 177
Summary 181
6
Preparing for Model Evaluation
Technical requirements 184 Plotting precision-sensitivity curves 205
Measuring accuracy, sensitivity,
Evaluating multiclass models 208
specificity, and precision for
Evaluating regression models 211
binary classification 184
Using K-fold cross-validation 219
Examining CAP, ROC, and
precision-sensitivity curves for Preprocessing data with
binary classification 192 pipelines 221
Summary 226
Constructing CAP curves 192
Plotting a receiver operating
characteristic (ROC) curve 199
viii Table of Contents
Section 3 – Modeling Continuous Targets
with Supervised Learning
7
Linear Regression Models
Technical requirements 230 Running and evaluating our linear
model 240
Key concepts 230
Improving our model evaluation 244
Key assumptions of linear regression
models 230 Using lasso regression 245
Linear regression and ordinary least
Tuning hyperparameters with grid
squares 232
searches 251
Linear regression and gradient
Using non-linear regression 252
descent 233
Regression with gradient
Using classical linear regression 234
descent 261
Pre-processing the data for our
Summary 264
regression model 234
8
Support Vector Regression
Technical requirements 266 SVR with a linear model 272
Key concepts of SVR 266 Using kernels for nonlinear SVR 282
Nonlinear SVR and the kernel trick 270 Summary 288
9
K-Nearest Neighbors, Decision Tree, Random Forest, and
Gradient Boosted Regression
Technical requirements 290 Using random forest regression 305
Key concepts for K-nearest
Decision tree and random
neighbors regression 290
forest regression 306
K-nearest neighbors regression 293
A decision tree example with
Key concepts for decision tree interpretation 307
and random forest regression 303
Table of Contents ix
Building and interpreting our Using gradient boosted
actual model 309 regression 314
Random forest regression 311 Summary 323
Section 4 – Modeling Dichotomous and
Multiclass Targets with Supervised Learning
10
Logistic Regression
Technical requirements 328 Evaluating a logistic regression model 344
Key concepts of logistic
Regularization with logistic
regression 328
regression 355
Logistic regression extensions 330
Multinomial logistic regression 361
Binary classification with Summary 368
logistic regression 331
11
Decision Trees and Random Forest Classification
Technical requirements 370 Decision tree models 375
Key concepts 370 Implementing random forest 387
Using random forest for classification 373 Implementing gradient boosting3 91
Using gradient-boosted decision trees 374 Summary 394
12
K-Nearest Neighbors for Classification
Technical requirements 396 KNN for multiclass classification 404
Key concepts of KNN 396 KNN for letter recognition 413
KNN for binary classification 399
Summary 416