D M ata ining anD P a reDictive nalytics B D for usiness ecisions LICENSE, DISCLAIMER OF LIABILITY, AND LIMITED WARRANTY By purchasing or using this book and its companion files (the “Work”), you agree that this license grants permission to use the contents contained herein, but does not give you the right of ownership to any of the textual content in the book or ownership to any of the information, files, or products contained in it. This license does not permit uploading of the Work onto the Internet or on a network (of any kind) without the written consent of the Publisher. Duplication or dissemination of any text, code, simulations, images, etc. contained herein is limited to and subject to licensing terms for the respective products, and permission must be obtained from the Publisher or the owner of the content, etc., in order to reproduce or network any portion of the textual material (in any media) that is contained in the Work. Mercury Learning and inforMation (“MLI” or “the Publisher”) and anyone involved in the c reation, writing, production, accompanying algorithms, code, or computer programs (“the software”), and any accompanying Web site or software of the Work, cannot and do not warrant the perfor- mance or results that might be obtained by using the contents of the Work. The author, devel- opers, and the Publisher have used their best efforts to insure the accuracy and functionality of the textual material and/or programs contained in this package; we, however, make no w arranty of any kind, express or implied, regarding the performance of these contents or programs. The Work is sold “as is” without warranty (except for defective materials used in manufacturing the book or due to faulty workmanship). The author, developers, and the publisher of any accompanying content, and anyone involved in the composition, production, and manufacturing of this work will not be liable for damages of any kind arising out of the use of (or the inability to use) the algorithms, source code, computer programs, or textual material contained in this publication. This includes, but is not limited to, loss of revenue or profit, or other incidental, physical, or consequential damages arising out of the use of this Work. The sole remedy in the event of a claim of any kind is expressly limited to replacement of the book and only at the discretion of the Publisher. The use of “implied warranty” and certain “exclusions” vary from state to state, and might not apply to the purchaser of this product. Companion files also available for downloading from the publisher by writing to [email protected]. D M ata ining anD P a reDictive nalytics B D for usiness ecisions A Case Study Approach Andres Fortino, PhD NYU School of Professional Studies Mercury Learning anD InforMation Dulles, Virginia Boston, Massachusetts New Delhi Copyright ©2023 by Mercury Learning and inforMation LLC. All rights reserved. This publication, portions of it, or any accompanying software may not be reproduced in any way, stored in a retrieval system of any type, or transmitted by any means, media, electronic display or mechanical display, including, but not limited to, photocopy, recording, Internet postings, or scanning, without prior permission in writing from the publisher. Publisher: David Pallai Mercury Learning and inforMation 22841 Quicksilver Drive Dulles, VA 20166 [email protected] www.merclearning.com 1-800-232-0223 A. Fortino. Data Mining and Predictive Analytics for Business Decisions. ISBN: 978-1-68392675-7 The publisher recognizes and respects all marks used by companies, manufacturers, and developers as a means to distinguish their products. All brand names and product names mentioned in this book are trademarks or service marks of their respective companies. Any omission or misuse (of any kind) of service marks or trademarks, etc. is not an attempt to infringe on the property of others. Library of Congress Control Number: 2022950710 232425321 Printed on acid-free paper in the United States of America. Our titles are available for adoption, license, or bulk purchase by institutions, corporations, etc. For additional information, please contact the Customer Service Dept. at 800-232-0223(toll free). All of our titles are available for sale in digital format at numerous digital vendors. Companion files for this title can also be downloaded by writing to [email protected]. The sole obligation of Mercury Learning and inforMation to the purchaser is to replace the book, based on defective materials or faulty workmanship, but not based on the operation or functionality of the product. To my wife, Kathleen C ontents Preface xv Acknowledgments xix Chapter 1: Data Mining and Business 1 Data Mining Algorithms and Activities 2 Data is the New Oil 2 Data-Driven Decision-Making 3 Business Analytics and Business Intelligence 4 Algorithmic Technologies Associated with Data Mining 5 Data Mining and Data Warehousing 6 Case Study 1.1: Business Applications of Data Mining 7 Case A – Classification 7 Case B – Regression 7 Case C – Anomaly Detection 7 Case D – Time Series 8 Case E – Clustering 8 Reference 8 Chapter 2: The Data Mining Process 9 Data Mining as a Process 10 Exploration 10 Analysis 10 Interpretation 10 Exploitation 11 Selecting a Data Mining Process 11 The CRISP-DM Process Model 12 Business Understanding 12 Data Understanding 12 Data Preparation 13 Modeling 13 Evaluation 13 viii • Contents Deployment 13 Selecting Data Analytics Languages 14 The Choices for Languages 15 References 16 Chapter 3: Framing Analytical Questions 17 How Does CRISP-DM Define the Business and Data Understanding Step? 18 The World of the Business Data Analyst 19 How Does Data Analysis Relate to Business Decision-Making? 20 How Do We Frame Analytical Questions? 21 What Are the Characteristics of Well-framed Analytical Questions? 22 Exercise 3.1 – Framed Questions About the Titanic Disaster 23 Case Study 3.1 – The San Francisco Airport Survey 25 Case Study 3.2 – Small Business Administration Loans 26 References 29 Chapter 4: Data Preparation 31 How Does CRISP-DM Define Data Preparation? 32 Steps in Preparing the Data Set for Analysis 33 Data Sources and Formats 34 What is Data Shaping? 35 The Flat-File Format 35 Application of Tools for Data Acquisition and Preparation 37 Exercise 4.1 – Shaping the Data File 37 Exercise 4.2 – Cleaning the Data File 40 Ensuring the Right Variables are Included 42 Using SQL to Extract the Right Data Set from Data Warehouses 44 Case Study 4.1: Cleaning and Shaping the SFO Survey Data Set 45 Case Study 4.2: Shaping the SBA Loans Data Set 46 Case Study 4.3: Additional SQL Queries 48 Reference 49 Chapter 5: Descriptive Analysis 51 Getting a Sense of the Data Set 52 Describe the Data Set 53 Explore the Data Set 53 Verify the Quality of the Data Set 54 Analysis Techniques to Describe the Variables 54 Exercise 5.1 – Descriptive Statistics 54 Distributions of Numeric Variables 54 Correlation 55 Exercise 5.2 – Descriptive Analysis of the Titanic Disaster Data 57 Case Study 5.1: Describing the SFO Survey Data Set 59 Solution Using R 59 Contents • ix Solution Using Python 62 Case Study 5.2: Describing the SBA Loans Data Set 66 Solution Using R 67 Solution Using Python 72 Reference 76 Chapter 6: Modeling 77 What is a Model? 77 How Does CRISP-DM Define Modeling? 78 Selecting the Modeling Technique 79 Modeling Assumptions 79 Generate Test Design 79 Design of Model Testing 80 Build the Model 80 Parameter Setting 80 Models 80 Model Assessment 80 Where Do Models Reside in a Computer? 81 The Data Mining Engine 81 The Model 82 Data Sources and Outputs 82 Traditional Data Sources 83 Static Data Sources 83 Real-Time Data Sources 84 Analytic Outputs 84 Model Building 84 Step 1: Framing Questions 85 Step 2: Selecting the Machine 86 Step 3: Selecting Known Data 86 Step 4: Training the Machine 87 Step 5: Testing the Model 87 Step 6: Deploying the Model 88 Step 7: Collecting New Data 88 Step 8: Updating the Model 88 Step 9: Learning – Repeat Steps 7 and 8 88 Step 10: Recommending Answers to the User 89 Reference 89 Chapter 7: Predictive Analytics with Regression Models 91 What is Supervised Learning? 92 Regression to the Mean 92 Linear Regression 93 Simple Linear Regression 93 The R-squared Coefficient 95 The Use of the p-value of the Coefficients 96