ebook img

Feature Engineering and Selection: A Practical Approach for Predictive Models (Chapman & Hall/CRC Data Science Series) PDF

308 Pages·2019·46.111 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Feature Engineering and Selection: A Practical Approach for Predictive Models (Chapman & Hall/CRC Data Science Series)

Feature Engineering and Selection A Practical Approach for Predictive Models CHAPMAN & HALL/CRC DATA SCIENCE SERIES Reflecting the interdisciplinary nature of the field, this book series brings together researchers, practitioners, and instructors from statistics, computer science, machine learning, and analytics. The series will publish cutting-edge research, industry applica- tions, and textbooks in data science. The inclusion of concrete examples, applications, and methods is highly encouraged. The scope of the series includes titles in the areas of machine learning, pattern recog- nition, predictive analytics, business analytics, Big Data, visualization, programming, software, learning analytics, data wrangling, interactive graphics, and reproducible research. Published Titles Feature Engineering and Selection: A Practical Approach for Predictive Models Max Kuhn and Kjell Johnson Probability and Statistics for Data Science: Math + R + Data Norman Matloff Feature Engineering and Selection A Practical Approach for Predictive Models Max Kuhn Kjell Johnson Map tiles on front cover: © OpenStreetMap © CartoDB. CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2020 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Printed on acid-free paper International Standard Book Number-13: 978-1-138-07922-9 (Hardback) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com To: Maryann Kuhn Valerie, Truman, and Louie Contents Preface • xi Author Bios • xv 1 Introduction • 1 1.1 A Simple Example • 4 1.2 Important Concepts • 7 1.3 A More Complex Example • 15 1.4 Feature Selection • 17 1.5 An Outline of the Book • 18 1.6 Computing • 20 2 Illustrative Example: Predicting Risk of Ischemic Stroke • 21 2.1 Splitting • 23 2.2 Preprocessing • 23 2.3 Exploration • 26 2.4 Predictive Modeling across Sets • 30 2.5 Other Considerations • 34 2.6 Computing • 34 3 A Review of the Predictive Modeling Process • 35 3.1 Illustrative Example: OkCupid Profile Data • 35 3.2 Measuring Performance • 36 3.3 Data Splitting • 46 3.4 Resampling • 47 3.5 Tuning Parameters and Overfitting • 56 3.6 Model Optimization and Tuning • 57 3.7 Comparing Models Using the Training Set • 61 3.8 Feature Engineering without Overfitting • 62 3.9 Summary • 64 3.10 Computing • 64 4 Exploratory Visualizations • 65 4.1 Introduction to the Chicago Train Ridership Data • 66 vii viii Contents 4.2 Visualizations for Numeric Data: Exploring Train Ridership Data • 69 4.3 Visualizations for Categorical Data: Exploring the OkCupid Data • 83 4.4 Postmodeling Exploratory Visualizations • 88 4.5 Summary • 92 4.6 Computing • 92 5 Encoding Categorical Predictors • 93 5.1 Creating Dummy Variables for Unordered Categories • 94 5.2 Encoding Predictors with Many Categories • 96 5.3 Approaches for Novel Categories • 102 5.4 Supervised Encoding Methods • 102 5.5 Encodings for Ordered Data • 107 5.6 Creating Features from Text Data • 109 5.7 Factors versus Dummy Variables in Tree-Based Models • 114 5.8 Summary • 119 5.9 Computing • 120 6 Engineering Numeric Predictors • 121 6.1 1:1 Transformations • 122 6.2 1:Many Transformations • 126 6.3 Many:Many Transformations • 133 6.4 Summary • 154 6.5 Computing • 155 7 Detecting Interaction Effects • 157 7.1 Guiding Principles in the Search for Interactions • 161 7.2 Practical Considerations • 164 7.3 The Brute-Force Approach to Identifying Predictive Interactions • 165 7.4 Approaches when Complete Enumeration Is Practically Impossible • 172 7.5 Other Potentially Useful Tools • 184 7.6 Summary • 185 7.7 Computing • 186 8 Handling Missing Data • 187 8.1 Understanding the Nature and Severity of Missing Information • 189 8.2 Models that Are Resistant to Missing Values • 195 8.3 Deletion of Data • 196 8.4 Encoding Missingness • 197 8.5 Imputation Methods • 198 8.6 Special Cases • 203 8.7 Summary • 203 8.8 Computing • 204 9 Working with Profile Data • 205 9.1 Illustrative Data: Pharmaceutical Manufacturing Monitoring • 209 9.2 What Are the Experimental Unit and the Unit of Prediction? • 210 9.3 Reducing Background • 214 9.4 Reducing Other Noise • 215 9.5 Exploiting Correlation • 217 Contents ix 9.6 Impacts of Data Processing on Modeling • 219 9.7 Summary • 224 9.8 Computing • 225 10 Feature Selection Overview • 227 10.1 Goals of Feature Selection • 227 10.2 Classes of Feature Selection Methodologies • 228 10.3 Effect of Irrelevant Features • 232 10.4 Overfitting to Predictors and External Validation • 235 10.5 A Case Study • 238 10.6 Next Steps • 240 10.7 Computing • 240 11 Greedy Search Methods • 241 11.1 Illustrative Data: Predicting Parkinson’s Disease • 241 11.2 Simple Filters • 241 11.3 Recursive Feature Elimination • 248 11.4 Stepwise Selection • 252 11.5 Summary • 254 11.6 Computing • 255 12 Global Search Methods • 257 12.1 Naive Bayes Models • 257 12.2 Simulated Annealing • 260 12.3 Genetic Algorithms • 270 12.4 Test Set Results • 280 12.5 Summary • 281 12.6 Computing • 282 Bibliography • 283 Index • 295

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.