ebook img

Learning Predictive Analytics with Python PDF

337 Pages·2016·4.375 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Learning Predictive Analytics with Python

Learning Predictive Analytics with Python Gain practical insights into predictive modelling by implementing Predictive Analytics algorithms on public datasets with Python Ashish Kumar BIRMINGHAM - MUMBAI Learning Predictive Analytics with Python Copyright © 2016 Packt Publishing First published: February 2016 Production reference: 1050216 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78398-326-1 www.packtpub.com Foreword Data science is changing the way we go about our daily lives at an unprecedented pace. The recommendations you see on e-commerce websites, the technologies that prevent credit card fraud, the logic behind airline itinerary and route selections, the products and discounts you see in retail stores, and many more decisions are largely powered by data science. Futuristic sounding applications like self-driving cars, robots to do household chores, smart wearable technologies, and so on are becoming a reality, thanks to innovations in data science. Predictive analytics is a branch of data science, used to predict unknown future events based on historical data. It uses a number of techniques from data mining, statistical modelling and machine learning to help make forecasts with an acceptable level of reliability. Python is a high-level, object-oriented programming language. It has gained popularity because of its clear syntax and readability, and beginners can pick up the language easily. It comes with a large library of modules that can be used to do a multitude of tasks ranging from data cleaning to building complex predictive modelling algorithms. I'm a co-founder at Tiger Analytics, a firm specializing in providing data science and predictive analytics solutions to businesses. Over the last decade, I have worked with clients at numerous Fortune 100 companies and start-ups alike, and architected a variety of data science solution frameworks. Ashish Kumar, the author of this book, is currently a budding data scientist at our company. He has worked on several predictive analytics engagements, and understands how businesses are using data to bring in scientific decision making to their organizations. Being a young practitioner, Ashish relates to someone who wants to learn predictive analytics from scratch. This is clearly reflected in the way he presents several concepts in the book. Whether you are a beginner in data science looking to build a career in this area, or a weekend enthusiast curious to explore predictive analytics in a hands-on manner, you will need to start from the basics and get a good handle on the building blocks. This book helps you take the first steps in this brave new world; it teaches you how to use and implement predictive modelling algorithms using Python. The book does not assume prior knowledge in analytics or programming. It differentiates itself from other such programming cookbooks as it uses publicly available datasets that closely represent data encountered in business scenarios, and walks you through the analysis steps in a clear manner. There are nine chapters in the book. The first few chapters focus on data exploration and cleaning. It is written keeping beginners to programming in mind—by explaining different data structures and then going deeper into various methods of data processing and cleaning. Subsequent chapters cover the popular predictive modelling algorithms like linear regression, logistic regression, clustering, decision trees, and so on. Each chapter broadly covers four aspects of the particular model—math behind the model, different types of the model, implementing the model in Python, and interpreting the results. Statistics/math involved in the model is clearly explained. Understanding this helps one implement the model in any other programming language. The book also teaches you how to interpret the results from the predictive model and suggests different techniques to fine tune the model for better results. Wherever required, the author compares two different models and explains the benefits of each of the models. It will help a data scientist narrow down to the right algorithm that can be used to solve a specific problem. In addition, this book exposes the readers to various Python libraries and guides them with the best practices while handling different datasets in Python. I am confident that this book will guide you to implement predictive modelling algorithms using Python and prepare you to work on challenging business problems involving data. I wish this book and its author Ashish Kumar every success. Pradeep Gulipalli Co-founder and Head of India Operations - Tiger Analytics About the Author Ashish Kumar has a B. Tech from IIT Madras and is a Young India Fellow from the batch of 2012-13. He is a data science enthusiast with extensive work experience in the field. As a part of his work experience, he has worked with tools, such as Python, R, and SAS. He has also implemented predictive algorithms to glean actionable insights for clients from transport and logistics, online payment, and healthcare industries. Apart from the data sciences, he is enthused by and adept at financial modelling and operational research. He is a prolific writer and has authored several online articles and short stories apart from running his own analytics blog. He also works pro-bono for a couple of social enterprises and freelances his data science skills. He can be contacted on LinkedIn at https://goo.gl/yqrfo4, and on Twitter at https://twitter.com/asis64. Table of Contents Preface ix Chapter 1: Getting Started with Predictive Modelling 1 Introducing predictive modelling 1 Scope of predictive modelling 3 Ensemble of statistical algorithms 3 Statistical tools 4 Historical data 4 Mathematical function 5 Business context 5 Knowledge matrix for predictive modelling 6 Task matrix for predictive modelling 7 Applications and examples of predictive modelling 8 LinkedIn's "People also viewed" feature 8 What it does? 8 How is it done? 8 Correct targeting of online ads 9 How is it done? 9 Santa Cruz predictive policing 10 How is it done? 10 Determining the activity of a smartphone user using accelerometer data 10 How is it done? 10 Sport and fantasy leagues 11 How was it done? 11 Python and its packages – download and installation 11 Anaconda 11 Standalone Python 12 Installing a Python package 13 Installing pip 13 Installing Python packages with pip 15 Table of Contents Python and its packages for predictive modelling 16 IDEs for Python 18 Summary 21 Chapter 2: Data Cleaning 23 Reading the data – variations and examples 24 Data frames 24 Delimiters 25 Various methods of importing data in Python 25 Case 1 – reading a dataset using the read_csv method 26 The read_csv method 27 Use cases of the read_csv method 28 Case 2 – reading a dataset using the open method of Python 31 Reading a dataset line by line 31 Changing the delimiter of a dataset 33 Case 3 – reading data from a URL 34 Case 4 – miscellaneous cases 35 Reading from an .xls or .xlsx file 36 Writing to a CSV or Excel file 36 Basics – summary, dimensions, and structure 36 Handling missing values 38 Checking for missing values 39 What constitutes missing data? 40 How missing values are generated and propagated 40 Treating missing values 41 Deletion 41 Imputation 41 Creating dummy variables 45 Visualizing a dataset by basic plotting 46 Scatter plots 46 Histograms 48 Boxplots 49 Summary 51 Chapter 3: Data Wrangling 53 Subsetting a dataset 54 Selecting columns 55 Selecting rows 57 Selecting a combination of rows and columns 59 Creating new columns 61 Generating random numbers and their usage 62 Various methods for generating random numbers 62 Table of Contents Seeding a random number 65 Generating random numbers following probability distributions 66 Probability density function 66 Cumulative density function 66 Uniform distribution 67 Normal distribution 70 Using the Monte-Carlo simulation to find the value of pi 73 Geometry and mathematics behind the calculation of pi 74 Generating a dummy data frame 77 Grouping the data – aggregation, filtering, and transformation 80 Aggregation 84 Filtering 87 Transformation 88 Miscellaneous operations 89 Random sampling – splitting a dataset in training and testing datasets 91 Method 1 – using the Customer Churn Model 93 Method 2 – using sklearn 93 Method 3 – using the shuffle function 94 Concatenating and appending data 94 Merging/joining datasets 102 Inner Join 108 Left Join 108 Right Join 109 An example of the Inner Join 111 An example of the Left Join 112 An example of the Right Join 112 Summary of Joins in terms of their length 113 Summary 114 Chapter 4: Statistical Concepts for Predictive Modelling 117 Random sampling and the central limit theorem 118 Hypothesis testing 119 Null versus alternate hypothesis 119 Z-statistic and t-statistic 119 Confidence intervals, significance levels, and p-values 121 Different kinds of hypothesis test 123 A step-by-step guide to do a hypothesis test 125 An example of a hypothesis test 126 Chi-square tests 127 Correlation 132 Summary 139 Table of Contents Chapter 5: Linear Regression with Python 141 Understanding the maths behind linear regression 143 Linear regression using simulated data 145 Fitting a linear regression model and checking its efficacy 146 Finding the optimum value of variable coefficients 151 Making sense of result parameters 153 p-values 153 F-statistics 154 Residual Standard Error 155 Implementing linear regression with Python 156 Linear regression using the statsmodel library 157 Multiple linear regression 160 Multi-collinearity 166 Variance Inflation Factor 167 Model validation 168 Training and testing data split 168 Summary of models 170 Linear regression with scikit-learn 171 Feature selection with scikit-learn 172 Handling other issues in linear regression 173 Handling categorical variables 175 Transforming a variable to fit non-linear relations 181 Handling outliers 187 Other considerations and assumptions for linear regression 192 Summary 194 Chapter 6: Logistic Regression with Python 197 Linear regression versus logistic regression 198 Understanding the math behind logistic regression 199 Contingency tables 200 Conditional probability 201 Odds ratio 202 Moving on to logistic regression from linear regression 204 Estimation using the Maximum Likelihood Method 207 Building the logistic regression model from scratch 211 Making sense of logistic regression parameters 213 Wald test 214 Likelihood Ratio Test statistic 214 Chi-square test 215 Implementing logistic regression with Python 216 Processing the data 217 Data exploration 218 Table of Contents Data visualization 219 Creating dummy variables for categorical variables 223 Feature selection 225 Implementing the model 226 Model validation and evaluation 228 Cross validation 230 Model validation 232 The ROC curve 232 Confusion matrix 234 Summary 239 Chapter 7: Clustering with Python 241 Introduction to clustering – what, why, and how? 242 What is clustering? 242 How is clustering used? 242 Why do we do clustering? 244 Mathematics behind clustering 245 Distances between two observations 245 Euclidean distance 246 Manhattan distance 246 Minkowski distance 246 The distance matrix 246 Normalizing the distances 247 Linkage methods 249 Single linkage 249 Compete linkage 250 Average linkage 250 Centroid linkage 250 Ward's method 250 Hierarchical clustering 251 K-means clustering 254 Implementing clustering using Python 258 Importing and exploring the dataset 258 Normalizing the values in the dataset 260 Hierarchical clustering using scikit-learn 260 K-Means clustering using scikit-learn 262 Interpreting the cluster 264 Fine-tuning the clustering 265 The elbow method 265 Silhouette Coefficient 267 Summary 269

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.