ebook img

Managing Datasets and Models PDF

387 Pages·2023·9.376 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Managing Datasets and Models

M D anaging atasets and M odels MMDDSS..FFMM__22..88..2233..iinndddd 11 0088//0022//2233 11::4400 PPMM LICENSE, DISCLAIMER OF LIABILITY, AND LIMITED WARRANTY By purchasing or using this book and its companion files (the “Work”), you agree that this license grants permission to use the contents contained herein, but does not give you the right of ownership to any of the textual content in the book or own- ership to any of the information, files, or products contained in it. This license does not permit uploading of the Work onto the Internet or on a network (of any kind) without the written consent of the Publisher. Duplication or dissemination of any text, code, simulations, images, etc. contained herein is limited to and subject to licensing terms for the respective products, and permission must be obtained from the Publisher or the owner of the content, etc., in order to reproduce or network any portion of the textual material (in any media) that is contained in the Work. Mercury Learning and Information (“MLI” or “the Publisher”) and anyone involved in the creation, writing, production, accompanying algorithms, code, or computer programs (“the software”), and any accompanying Web site or software of the Work, cannot and do not warrant the performance or results that might be obtained by using the contents of the Work. The author, developers, and the Pub- lisher have used their best efforts to insure the accuracy and functionality of the textual material and/or programs contained in this package; we, however, make no warranty of any kind, express or implied, regarding the performance of these con- tents or programs. The Work is sold “as is” without warranty (except for defective materials used in manufacturing the book or due to faulty workmanship). The author, developers, and the publisher of any accompanying content, and anyone involved in the composition, production, and manufacturing of this work will not be liable for damages of any kind arising out of the use of (or the inability to use) the algorithms, source code, computer programs, or textual material contained in this publication. This includes, but is not limited to, loss of revenue or profit, or other incidental, physical, or consequential damages arising out of the use of this Work. The sole remedy in the event of a claim of any kind is expressly limited to replace- ment of the book and only at the discretion of the Publisher. The use of “implied warranty” and certain “exclusions” vary from state to state, and might not apply to the purchaser of this product. Companion files also available for downloading from the publisher by writing to [email protected]. MMDDSS..FFMM__22..88..2233..iinndddd 22 0088//0022//2233 11::4400 PPMM M D anaging atasets and M odels Oswald Campesato MERCURY LEARNING AND INFORMATION Dulles, Virginia Boston, Massachusetts New Delhi MMDDSS..FFMM__22..88..2233..iinndddd 33 0088//0022//2233 11::4400 PPMM Copyright ©2023 by Mercury Learning and Information LLC. All rights reserved. This publication, portions of it, or any accompanying software may not be reproduced in any way, stored in a retrieval system of any type, or transmitted by any means, media, electronic display or mechanical display, including, but not limited to, photocopy, recording, Internet postings, or scanning, without prior permission in writing from the publisher. Publisher: David Pallai Mercury Learning and Information 22841 Quicksilver Drive Dulles, VA 20166 [email protected] www.merclearning.com 1-800-232-0223 O. Campesato. Managing Datasets and Models. ISBN: 9781683929529 The publisher recognizes and respects all marks used by companies, manufacturers, and developers as a means to distinguish their products. All brand names and product names mentioned in this book are trademarks or service marks of their respective companies. Any omission or misuse (of any kind) of service marks or trademarks, etc. is not an attempt to infringe on the property of others. Library of Congress Control Number: 2022952302 232425321 Printed on acid-free paper in the United States of America. Our titles are available for adoption, license, or bulk purchase by institutions, corporations, etc. For additional information, please contact the Customer Service Dept. at 800-232-0223(toll free). All of our titles are also available in digital format at numerous digital vendors. Companion files are available for download by writing to the publisher at [email protected]. The sole obligation of Mercury Learning and Information to the purchaser is to replace the book, based on defective materials or faulty workmanship, but not based on the operation or functionality of the product. MMDDSS..FFMM__22..88..2233..iinndddd 44 0088//0022//2233 11::4400 PPMM I’d like to dedicate this book to my parents – may this bring joy and happiness into their lives. MMDDSS..FFMM__22..88..2233..iinndddd 55 0088//0022//2233 11::4400 PPMM MMDDSS..FFMM__22..88..2233..iinndddd 66 0088//0022//2233 11::4400 PPMM C ontents Preface xiii Chapter 1: Working with Data 1 Import Statements for this Chapter 2 Exploratory Data Analysis (EDA) 3 Dealing with Data: What Can Go Wrong? 6 Analyzing Missing Data 8 Explanation of Data Types 10 Data Preprocessing 15 Working with Data Types 16 What is Drift? 17 What is Data Leakage? 18 Model Selection and Preparing Datasets 19 Types of Dependencies Among Features 23 Data Cleaning and Imputation 27 Summary 43 Chapter 2: Outlier and Anomaly Detection 45 Import Statements for this Chapter 45 Working with Outliers 46 Finding Outliers with NumPy 49 Finding Outliers with Pandas 54 MMDDSS..FFMM__22..88..2233..iinndddd 77 0088//0022//2233 11::4400 PPMM viii • Contents Finding Outliers with Scikit-Learn (Optional) 61 Fraud Detection 63 Techniques for Anomaly Detection 65 Working with Imbalanced Datasets 70 Summary 76 Reference 76 Chapter 3: Cleaning Datasets 77 Prerequisites for this Chapter 77 Analyzing Missing Data 78 Pandas, CSV Files, and Missing Data 80 Missing Data and Imputation 91 Skewed Datasets 108 CSV Files with Multi-Row Records 111 Column Subset and Row Subrange of Titanic CSV File 116 Data Normalization 117 Handling Categorical Data 120 Working with Currency 125 Working with Dates 135 Working with Quoted Fields 145 What is SMOTE? 149 Data Wrangling 150 Summary 152 Chapter 4: Working with Models 153 Import Statements for this Chapter 153 Techniques for Scaling Data 154 Examples of Splitting and Scaling Data 155 The Confusion Matrix 163 The ROC Curve and AUC Curve 176 Exploring the Titanic Dataset 181 Steps for Training Classifiers 189 Diagram for Partitioned Datasets 190 MMDDSS..FFMM__22..88..2233..iinndddd 88 0088//0022//2233 11::4400 PPMM Contents • ix A KNN-Based Model with the wine.csv Dataset 192 Other Models with the wine.csv Dataset 195 A KNN-Based Model with the bmi.csv Dataset 197 A KNN-Based Model with the Diabetes.csv Dataset 198 SMOTE and the Titanic Dataset 200 EDA and Data Visualization 205 What about Regression and Clustering? 209 Feature Importance 209 What is Feature Engineering? 212 What is Feature Selection? 213 What is Feature Extraction? 218 Data Cleaning and Machine Learning 219 Summary 222 Chapter 5: Matplotlib and Seaborn 223 Import Statements for this Chapter 224 What is Data Visualization? 225 What is Matplotlib? 226 Matplotlib Styles 227 Display Attribute Values 228 Color Values in Matplotlib 230 Cubed Numbers in Matplotlib 231 Horizontal Lines in Matplotlib 233 Slanted Lines in Matplotlib 234 Parallel Slanted Lines in Matplotlib 235 Lines and Labeled Vertices in Matplotlib 237 A Dotted Grid in Matplotlib 238 Lines in a Grid in Matplotlib 240 Two Lines and a Legend in Matplotlib 242 Loading Images in Matplotlib 243 A Checkerboard in Matplotlib 244 Randomized Data Points in Matplotlib 246 MMDDSS..FFMM__22..88..2233..iinndddd 99 0088//0022//2233 11::4400 PPMM

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.