Table Of Content

Data Cleaning and Exploration with Machine Learning Get to grips with machine learning techniques to achieve sparkling-clean data quickly Michael Walker BIRMINGHAM—MUMBAI Data Cleaning and Exploration with Machine Learning Copyright © 2022 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. Publishing Product Manager: Ali Abidi Senior Editor: David Sugarman Content Development Editor: Manikandan Kurup Technical Editor: Rahul Limbachiya Copy Editor: Safis Editing Project Coordinator: Farheen Fathima Proofreader: Safis Editing Indexer: Hemangini Bari Production Designer: Alishon Mendonca Marketing Coordinators: Shifa Ansari and Abeer Riyaz Dawe First published: August 2022 Production reference: 1290722 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-80324-167-8 www.packt.com Contributors About the author Michael Walker has worked as a data analyst for over 30 years at a variety of educational institutions. He has also taught data science, research methods, statistics, and computer programming to undergraduates since 2006. He is currently the Chief Information Officer at College Unbound in Providence, Rhode Island. About the reviewers Kalyana Bedhu is an engineering leader for data science at Microsoft. Kalyana has over 20 years of industry experience in data analytics across various companies such as Ericsson, Sony, Bosch Fidelity, and Oracle, among others. Kalyana was an early practitioner of data science at Ericsson, setting up a data science lab and building up competence in solving some practical data science problems. He played a pivotal role in transforming a central IT organization that dealt with most of the enterprise business intelligence, data, and analytical systems, into an AI and data science engine. Kalyana is a recipient of patents, a speaker, and has authored award-winning papers and data science courses. Thanks to Packt and the author for the opportunity to review this book. Divya Sardana serves as the lead AI/ ML engineer at Nike. Previously, she was a senior data scientist at Teradata Corp. She holds a Ph.D. in computer science from the University of Cincinnati, OH. She has experience working on end-to-end machine learning and deep learning problems involving techniques such as regression and classification. She has further experience in moving developed models to production and ensuring scalability. Her interests include solving complex big data and machine learning/deep learning problems in real-world domains. She is actively involved in the peer review of journals and books in the area of machine learning. She has served as a session chair at machine learning conferences such as ICMLA 2021 and BDA 2021. Table of Contents Preface Section 1 – Data Cleaning and Machine Learning Algorithms 1 Examining the Distribution of Features and Targets Technical requirements 4 outliers in univariate analysis 26 Subsetting data 4 Using histograms, boxplots, Generating frequencies for and violin plots to examine the categorical features 12 distribution of features 35 Generating summary statistics Using histograms 35 for continuous and discrete Using boxplots 39 features 20 Using violin plots 41 Identifying extreme values and Summary 43 2 Examining Bivariate and Multivariate Relationships between Features and Targets Technical requirements 46 Using grouped boxplots to view Identifying outliers and bivariate relationships between extreme values in bivariate continuous and categorical relationships 46 features 63 Using scatter plots to view Using linear regression to bivariate relationships between identify data points with continuous features 55 significant influence 67 vi Table of Contents Using K-nearest neighbors to outliers 76 find outliers 72 Summary 80 Using Isolation Forest to find 3 Identifying and Fixing Missing Values Technical requirements 82 Using KNN imputation 102 Identifying missing values 82 Using random forest for Cleaning missing values 88 imputation 105 Imputing values with regression 95 Summary 107 Section 2 – Preprocessing, Feature Selection, and Sampling 4 Encoding, Transforming, and Scaling Features Technical requirements 112 Feature hashing 129 Creating training datasets and Using mathematical avoiding data leakage 112 transformations 130 Removing redundant or Feature binning 134 unhelpful features 116 Equal-width and equal-frequency Encoding categorical features 121 binning 135 One-hot encoding 121 K-means binning 138 Ordinal encoding 124 Feature scaling 139 Encoding categorical features Summary 142 with medium or high cardinality 127 Table of Contents vii 5 Feature Selection Technical requirements 146 Using forward feature selection 156 Selecting features for Using backward feature selection 158 classification models 147 Using exhaustive feature Mutual information classification for selection 159 feature selection with a categorical Eliminating features recursively target 147 in a regression model 164 ANOVA F-value for feature selection with a Eliminating features recursively categorical target 150 in a classification model 168 Using Boruta for feature Selecting features for selection 170 regression models 151 Using regularization and other F-tests for feature selection with a embedded methods 173 continuous target 152 Mutual information for feature Using L1 regularization 174 selection with a continuous target 154 Using a random forest classifier 176 Using forward and backward Using principal component feature selection 156 analysis 177 Summary 181 6 Preparing for Model Evaluation Technical requirements 184 Plotting precision-sensitivity curves 205 Measuring accuracy, sensitivity, Evaluating multiclass models 208 specificity, and precision for Evaluating regression models 211 binary classification 184 Using K-fold cross-validation 219 Examining CAP, ROC, and precision-sensitivity curves for Preprocessing data with binary classification 192 pipelines 221 Summary 226 Constructing CAP curves 192 Plotting a receiver operating characteristic (ROC) curve 199 viii Table of Contents Section 3 – Modeling Continuous Targets with Supervised Learning 7 Linear Regression Models Technical requirements 230 Running and evaluating our linear model 240 Key concepts 230 Improving our model evaluation 244 Key assumptions of linear regression models 230 Using lasso regression 245 Linear regression and ordinary least Tuning hyperparameters with grid squares 232 searches 251 Linear regression and gradient Using non-linear regression 252 descent 233 Regression with gradient Using classical linear regression 234 descent 261 Pre-processing the data for our Summary 264 regression model 234 8 Support Vector Regression Technical requirements 266 SVR with a linear model 272 Key concepts of SVR 266 Using kernels for nonlinear SVR 282 Nonlinear SVR and the kernel trick 270 Summary 288 9 K-Nearest Neighbors, Decision Tree, Random Forest, and Gradient Boosted Regression Technical requirements 290 Using random forest regression 305 Key concepts for K-nearest Decision tree and random neighbors regression 290 forest regression 306 K-nearest neighbors regression 293 A decision tree example with Key concepts for decision tree interpretation 307 and random forest regression 303 Table of Contents ix Building and interpreting our Using gradient boosted actual model 309 regression 314 Random forest regression 311 Summary 323 Section 4 – Modeling Dichotomous and Multiclass Targets with Supervised Learning 10 Logistic Regression Technical requirements 328 Evaluating a logistic regression model 344 Key concepts of logistic Regularization with logistic regression 328 regression 355 Logistic regression extensions 330 Multinomial logistic regression 361 Binary classification with Summary 368 logistic regression 331 11 Decision Trees and Random Forest Classification Technical requirements 370 Decision tree models 375 Key concepts 370 Implementing random forest 387 Using random forest for classification 373 Implementing gradient boosting3 91 Using gradient-boosted decision trees 374 Summary 394 12 K-Nearest Neighbors for Classification Technical requirements 396 KNN for multiclass classification 404 Key concepts of KNN 396 KNN for letter recognition 413 KNN for binary classification 399 Summary 416

Data Cleaning and Exploration with Machine Learning: Get to grips with machine learning techniques to achieve sparkling-clean data quickly PDF

542 Pages·2022·8.705 MB·English

by Michael Walker

Checking for file health...

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Download Data Cleaning and Exploration with Machine Learning: Get to grips with machine learning techniques to achieve sparkling-clean data quickly PDF Free - Full Version

by Michael Walker| 2022| 542 pages| 8.705| English

Download Data Cleaning and Exploration with Machine Learning: Get to grips with machine learning techniques to achieve sparkling-clean data quickly by Michael Walker in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About Data Cleaning and Exploration with Machine Learning: Get to grips with machine learning techniques to achieve sparkling-clean data quickly

No description available for this book.

Detailed Information

Author:	Michael Walker
Publication Year:	2022
ISBN:	9781803241678
Pages:	542
Language:	English
File Size:	8.705
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free Data Cleaning and Exploration with Machine Learning: Get to grips with machine learning techniques to achieve sparkling-clean data quickly Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download Data Cleaning and Exploration with Machine Learning: Get to grips with machine learning techniques to achieve sparkling-clean data quickly PDF?

Yes, on https://PDFdrive.to you can download Data Cleaning and Exploration with Machine Learning: Get to grips with machine learning techniques to achieve sparkling-clean data quickly by Michael Walker completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read Data Cleaning and Exploration with Machine Learning: Get to grips with machine learning techniques to achieve sparkling-clean data quickly on my mobile device?

After downloading Data Cleaning and Exploration with Machine Learning: Get to grips with machine learning techniques to achieve sparkling-clean data quickly PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of Data Cleaning and Exploration with Machine Learning: Get to grips with machine learning techniques to achieve sparkling-clean data quickly?

Yes, this is the complete PDF version of Data Cleaning and Exploration with Machine Learning: Get to grips with machine learning techniques to achieve sparkling-clean data quickly by Michael Walker. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download Data Cleaning and Exploration with Machine Learning: Get to grips with machine learning techniques to achieve sparkling-clean data quickly PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.