Machine Learning for Decision Sciences with Case Studies in Python Machine Learning for Decision Sciences with Case Studies in Python S. Sumathi Suresh V. Rajappa L. Ashok Kumar Surekha Paneerselvam First edition published 2022 by CRC Press 6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742 and by CRC Press 2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN © 2022 S. Sumathi, Suresh V. Rajappa, L Ashok Kumar and Surekha Paneerselvam CRC Press is an imprint of Taylor & Francis Group, LLC Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowl- edged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including pho- tocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are not available on CCC please contact [email protected] Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for identification and explanation without intent to infringe. ISBN: 978-1-032-19356-4 (hbk) ISBN: 978-1-032-19357-1 (pbk) ISBN: 978-1-003-25880-3 (ebk) DOI: 10.1201/9781003258803 Typeset in Times by codeMantra Contents Preface..............................................................................................................................................xv Acknowledgment ...........................................................................................................................xvii About the Authors ...........................................................................................................................xix Introduction .....................................................................................................................................xxi Chapter 1 Introduction ..................................................................................................................1 1.1 Introduction to Data Science .............................................................................1 1.1.1 Mathematics .........................................................................................1 1.1.2 Statistics ...............................................................................................1 1.2 Describing Structural Patterns ..........................................................................2 1.2.1 Uses of Structural Patterns ...................................................................2 1.3 Machine Learning and Statistics .......................................................................3 1.4 Relation between Artificial Intelligence, Machine Learning, Neural Networks, and Deep Learning ..........................................................................4 1.5 Data Science Life Cycle ....................................................................................6 1.6 Key Role of Data Scientist .................................................................................8 1.6.1 Difference between Data Scientist and Machine Learning Engineer................................................................................8 1.7 Real-World Examples ........................................................................................8 1.8 Use Cases ...........................................................................................................9 1.8.1 Financial and Insurance Industries ......................................................9 1.8.1.1 Fraud Mitigation ...................................................................9 1.8.1.2 Personalized Pricing ...........................................................10 1.8.1.3 AML – Anti-Money Laundering ........................................10 1.8.2 Utility Industries ................................................................................11 1.8.2.1 Smart Meter and Smart Grid ..............................................11 1.8.2.2 Manage disaster and Outages .............................................11 1.8.2.3 Compliance .........................................................................11 1.8.3 Oil and Gas Industries ........................................................................11 1.8.3.1 Manage Exponential Growth ..............................................11 1.8.3.2 3D Seismic Imaging and Kirchhoff....................................12 1.8.3.3 Rapidly Process and Display Seismic Data ........................12 1.8.4 E-Commerce and Hi-Tech Industries .................................................12 1.8.4.1 Association and Complementary Products .........................12 1.8.4.2 Cross-Channel Analytics ....................................................12 1.8.4.3 Event analytics ....................................................................13 Summary ....................................................................................................................13 Review Questions .......................................................................................................14 Chapter 2 Overview of Python for Machine Learning ...............................................................15 2.1 Introduction .....................................................................................................15 2.1.1 The Flow of Program Execution in Python .......................................15 2.2 Python for Machine Learning .........................................................................15 2.2.1 Why Is Python Good for ML? ............................................................16 2.3 Setting up Python ............................................................................................16 v vi Contents 2.3.1 Python on Windows ...........................................................................16 2.3.2 Python on Linux .................................................................................17 2.3.2.1 Ubuntu ................................................................................17 2.4 Python Basics ..................................................................................................17 2.4.1 Python Operators ...............................................................................18 2.4.1.1 Arithmetic Operators ..........................................................18 2.4.1.2 Comparison Operators ........................................................18 2.4.1.3 Assignment Operators ........................................................18 2.4.1.4 Logical Operators ...............................................................18 2.4.1.5 Membership Operators .......................................................19 2.4.2 Python Code Samples on Basic Operators .........................................19 2.4.2.1 Arithmetic Operators ..........................................................19 2.4.2.2 Comparison Operators ........................................................21 2.4.2.3 Logical Operators ...............................................................22 2.4.2.4 Membership Operators .......................................................23 2.4.3 Flow Control .......................................................................................24 2.4.3.1 If & elif Statement ..............................................................24 2.4.3.2 Loop Statement ...................................................................25 2.4.3.3 Loop Control Statements ....................................................26 2.4.4 Python Code Samples on Flow Control Statements...........................26 2.4.4.1 Conditional Statements .......................................................26 2.4.4.2 Python if...else Statement....................................................27 2.4.4.3 Python if…elif…else Statement .........................................28 2.4.4.4 The For Loop ......................................................................29 2.4.4.5 The range() Function ..........................................................29 2.4.4.6 For Loop with else ..............................................................31 2.4.4.7 While Loop .........................................................................31 2.4.4.8 While Loop with else..........................................................32 2.4.4.9 Python Break and Continue ................................................32 2.4.4.10 Python Break Statement .....................................................32 2.4.4.11 Python Continue Statement ................................................33 2.4.5 Review of Basic Data Structures and Implementation in Python ......34 2.4.5.1 Array Data Structure ..........................................................34 2.4.5.2 Implementation of Arrays in Python ..................................35 2.4.5.3 Linked List ..........................................................................36 2.4.5.4 Implementation of Linked List in Python ..........................36 2.4.5.5 Stacks and Queues ..............................................................38 2.4.5.6 Queues ................................................................................40 2.4.5.7 Implementation of Queue in Python ...................................41 2.4.5.8 Searching ............................................................................42 2.4.5.9 Implementation of Searching in Python .............................44 2.4.5.10 Sorting ................................................................................46 2.4.5.11 Implementation of Bubble Sort in Python ..........................47 2.4.5.12 Insertion Sort ......................................................................47 2.4.5.13 Implementation of Insertion Sort in Python .......................49 2.4.5.14 Selection Sort ......................................................................51 2.4.5.15 Implementation of Selection Sort in Python.......................52 2.4.5.16 Merge Sort ..........................................................................52 2.4.5.17 Implementation of Merge Sort in Python ...........................53 2.4.5.18 Shell Sort ............................................................................54 2.4.5.19 Quicksort ............................................................................55 Contents vii 2.4.5.20 Data Structures in Python with Sample Codes ..................55 2.4.5.21 Python Code Samples for Data Structures in Python ........58 2.4.6 Functions in Python ...........................................................................68 2.4.6.1 Python Code Samples for Functions ..................................68 2.4.6.2 Returning Values from Functions .......................................68 2.4.6.3 Scope of Variables ..............................................................69 2.4.6.4 Function Arguments ...........................................................70 2.4.7 File Handling ......................................................................................74 2.4.8 Exception Handling ............................................................................74 2.4.9 Debugging in Python .........................................................................75 2.4.9.1 Packages..............................................................................75 2.5 Numpy Basics ..................................................................................................75 2.5.1 Introduction to Numpy .......................................................................76 2.5.1.1 Array Creation ....................................................................76 2.5.1.2 Array Slicing.......................................................................77 2.5.2 Numerical Operations ........................................................................77 2.5.3 Python Code Samples for Numpy Package ........................................78 2.5.3.1 Array Creation ....................................................................78 2.5.3.2 Class and Attributes of ndarray—.ndim .............................82 2.5.3.3 Class and Attributes of ndarray—.shape ............................82 2.5.3.4 Class and Attributes of ndarray—ndarray.size, ndarray.Itemsize, ndarray.resize .........................................83 2.5.3.5 Class and Attributes of ndarray—.dtype ............................83 2.5.3.6 Basic Operations .................................................................84 2.5.3.7 Accessing Array Elements: Indexing .................................85 2.5.3.8 Shape Manipulation ............................................................88 2.5.3.9 Universal Functions (ufunc) in Numpy ...............................90 2.5.3.10 Broadcasting .......................................................................90 2.5.3.11 Args and Kwargs ................................................................91 2.6 Matplotlib Basics .............................................................................................92 2.6.1 Creating Graphs with Matplotlib .......................................................93 2.7 Pandas Basics ..................................................................................................94 2.7.1 Getting Started with Pandas...............................................................94 2.7.2 Data Frames .......................................................................................95 2.7.3 Key Operations on Data Frames ........................................................95 2.7.3.1 Data Frame from List .........................................................95 2.7.3.2 Rows and Columns in Data Frame .....................................96 2.8 Computational Complexity ..............................................................................97 2.9 Real-world Examples .......................................................................................97 2.9.1 Implementation using Pandas .............................................................98 2.9.2 Implementation using Numpy ............................................................98 2.9.3 Implementation using Matplotlib .......................................................98 Summary ....................................................................................................................99 Review Questions .....................................................................................................100 Exercises for Practice ...............................................................................................101 Chapter 3 Data Analytics Life Cycle for Machine Learning....................................................103 3.1 Introduction ...................................................................................................103 3.2 Data Analytics Life Cycle .............................................................................104 3.2.1 Phase 1 – Data Discovery .................................................................104 viii Contents 3.2.2 Phase 2 – Data Preparation and Exploratory Data Analysis ............107 3.2.2.1 Exploratory Data Analysis ................................................110 3.2.3 Phase 3 – Model Planning ................................................................136 3.2.4 Phase 4 – Model Building ................................................................139 3.2.5 Phase 5 – Communicating Results ...................................................140 3.2.6 Phase 6 – Optimize and Operationalize the Models ........................140 Summary ..................................................................................................................142 Review Questions .....................................................................................................143 Chapter 4 Unsupervised Learning ............................................................................................145 4.1 Introduction ...................................................................................................145 4.2 Unsupervised Learning .................................................................................145 4.2.1 Clustering .........................................................................................147 4.3 Evaluation Metrics for Clustering .................................................................147 4.3.1 Distance Measures ...........................................................................148 4.3.1.1 Minkowski Metric ............................................................149 4.3.2 Similarity Measures .........................................................................149 4.4 Clustering Algorithms ...................................................................................150 4.4.1 Hierarchical and Partitional Clustering Approaches .......................150 4.4.2 Agglomerative and Divisive Clustering Approaches .......................150 4.4.3 Hard and Fuzzy Clustering Approaches ..........................................150 4.4.4 Monothetic and Polythetic Clustering Approaches ..........................151 4.4.5 Deterministic and Probabilistic Clustering Approaches ..................151 4.5 k-Means Clustering ........................................................................................151 4.5.1 Geometric Intuition, Centroids ........................................................151 4.5.2 The Algorithm ..................................................................................152 4.5.3 Choosing k ........................................................................................152 4.5.4 Space and Time Complexity ............................................................153 4.5.5 Advantages and Disadvantages of k-Means Clustering ...................153 4.5.5.1 Advantages ........................................................................153 4.5.5.2 Disadvantages ...................................................................153 4.5.6 k-Means Clustering in Practice Using Python .................................154 4.5.6.1 Illustration of the k-Means Algorithm Using Python .......154 4.5.7 Fuzzy k-Means Clustering Algorithm ..............................................157 4.5.7.1 The Algorithm ..................................................................158 4.5.8 Advantages and Disadvantages of Fuzzy k-Means Clustering ........158 4.6 Hierarchical Clustering .................................................................................159 4.6.1 Agglomerative Hierarchical Clustering ...........................................159 4.6.2 Divisive Hierarchical Clustering ......................................................161 4.6.3 Techniques to Merge Cluster ............................................................161 4.6.4 Space and Time Complexity ............................................................163 4.6.5 Limitations of Hierarchical Clustering ............................................163 4.6.6 Hierarchical Clustering in Practice Using Python ...........................163 4.6.6.1 DATA_SET.......................................................................164 4.7 Mixture of Gaussian Clustering ....................................................................165 4.7.1 Expectation Maximization ...............................................................166 4.7.2 Mixture of Gaussian Clustering in Practice Using Python ..............168 4.8 Density-Based Clustering Algorithm ............................................................169 4.8.1 DBSCAN (Density-Based Spatial Clustering of Applications with Noise) ..................................................................169 Contents ix 4.8.2 Space and Time Complexity ............................................................171 4.8.3 Advantages and Disadvantages of DBSCAN ...................................171 4.8.3.1 Advantages ........................................................................171 4.8.3.2 Disadvantages ...................................................................171 4.8.4 DBSCAN in Practice Using Python .................................................172 Summary ..................................................................................................................174 Review Questions .....................................................................................................174 Chapter 5 Supervised Learning: Regression ............................................................................177 5.1 Introduction ...................................................................................................177 5.2 Supervised Learning – Real-Life Scenario ...................................................177 5.3 Types of Supervised Learning .......................................................................178 5.3.1 Supervised Learning – Classification ..............................................178 5.3.1.1 Classification – Predictive Modeling ................................179 5.3.2 Supervised Learning – Regression...................................................179 5.3.2.1 Regression Predictive Modeling .......................................180 5.3.3 Classification vs. Regression ............................................................180 5.3.4 Conversion between Classification and Regression Problems .........181 5.4 Linear Regression ..........................................................................................181 5.4.1 Types of Linear Regression ..............................................................182 5.4.1.1 Simple Linear Regression .................................................183 5.4.1.2 Multiple Linear Regression ..............................................184 5.4.2 Geometric Intuition ..........................................................................186 5.4.3 Mathematical Formulation ...............................................................187 5.4.4 Solving Optimization Problem .........................................................201 5.4.4.1 Maxima and Minima ........................................................201 5.4.4.2 Gradient Descent ..............................................................202 5.4.4.3 LMS (Least Mean Square) Update Rule ..........................205 5.4.4.4 SGD Algorithm .................................................................205 5.4.5 Real-World Applications ..................................................................206 5.4.5.1 Predictive Analysis ...........................................................206 5.4.5.2 Medical Outcome Prediction ............................................208 5.4.5.3 Wind Speed Prediction .....................................................208 5.4.5.4 Environmental Effects Monitoring ...................................209 5.4.6 Linear Regression in Practice Using Python ...................................209 5.4.6.1 Simple Linear Regression Using Python ..........................209 5.4.6.2 Multiple Linear Regression Using Python .......................212 Summary ..................................................................................................................215 Review Questions .....................................................................................................215 Chapter 6 Supervised Learning: Classification ........................................................................219 6.1 Introduction ...................................................................................................219 6.2 Use Cases of Classification ............................................................................219 6.3 Logistic Regression .......................................................................................219 6.3.1 Geometric Intuition ..........................................................................220 6.3.2 Variants of Logistic Regression .......................................................222 6.3.2.1 Simple Logistic Regression ..............................................222 6.3.2.2 Multiple Logistic Regression ............................................223 6.3.2.3 Binary Logistic Regression ..............................................223