Essential Statistics for Non-STEM Data Analysts Get to grips with the statistics and math knowledge needed to enter the world of data science with Python Rongpeng Li BIRMINGHAM—MUMBAI Essential Statistics for Non-STEM Data Analysts Copyright © 2020 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. Commissioning Editor: Sunith Shetty Acquisition Editor: Devika Battike Senior Editor: Roshan Kumar Content Development Editor: Sean Lobo Technical Editor: Sonam Pandey Copy Editor: Safis Editing Project Coordinator: Aishwarya Mohan Proofreader: Safis Editing Indexer: Pratik Shirodkar Production Designer: Roshan Kawale First published: November 2020 Production reference: 1111120 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-83898-484-7 www.packt.com Packt.com Subscribe to our online digital library for full access to over 7,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website. Why subscribe? • Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals • Improve your learning with Skill Plans built especially for you • Get a free eBook or video every month • Fully searchable for easy access to vital information • Copy and paste, print, and bookmark content Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks. Contributors About the author Rongpeng Li is a data science instructor and a senior data scientist at Galvanize, Inc. He has previously been a research programmer at Information Sciences Institute, working on knowledge graphs and artificial intelligence. He has also been the host and organizer of the Data Analysis Workshop Designed for Non-STEM Busy Professionals at LA. Michael Hansen (https://www.linkedin.com/in/michael-n-hansen/), a friend of mine, provided invaluable English language editing suggestions for this book. Michael has great attention to detail, which made him a great language reviewer. Thank you, Michael! About the reviewers James Mott, PhD, is a senior education consultant with extensive experience in teaching statistical analysis, modeling, data mining, and predictive analytics. He has over 30 years of experience using SPSS products in his own research, including IBM SPSS Statistics, IBM SPSS Modeler, and IBM SPSS Amos. He has also been actively teaching about these products to IBM/SPSS customers for over 30 years. In addition, he is an experienced historian with expertise in the research and teaching of 20th century United States political history and quantitative methods. His specialties are data mining, quantitative methods, statistical analysis, teaching, and consulting. Yidan Pan obtained her PhD in system, synthetic, and physical biology from Rice University. Her research interest is profiling mutagenesis at genomic and transcriptional levels with molecular biology wet labs, bioinformatics, statistical analysis, and machine learning models. She believes that this book will give its readers a lot of practical skills for data analysis. Packt is searching for authors like you If you're interested in becoming an author for Packt, please visit authors. packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea. Table of Contents Preface Section 1: Getting Started with Statistics for Data Science 1 Fundamentals of Data Collection, Cleaning, and Preprocessing Technical requirements 4 frequent value 18 Collecting data from various Outlier removal 20 data sources 5 Data standardization – when Reading data directly from files 5 and how 21 Obtaining data from an API 6 Examples involving the scikit- Obtaining data from scratch 9 learn preprocessing module 23 Data imputation 11 Imputation 23 Preparing the dataset for imputation 11 Standardization 23 Imputation with mean or median values 16 Summary 24 Imputation with the mode/most 2 Essential Statistics for Data Assessment Classifying numerical and Understanding mean, median, categorical variables 26 and mode 30 Distinguishing between numerical and Mean 30 categorical variables 26 Median 31 Mode 32 ii Table of Contents Learning about variance, data types 43 standard deviation, quartiles, Frequencies and proportions 43 percentiles, and skewness 33 Transforming a continuous variable to Variance 33 a categorical one 46 Standard deviation 36 Using bivariate and Quartiles 37 multivariate descriptive statistics 47 Skewness 39 Covariance 48 Knowing how to handle Cross-tabulation 50 categorical variables and mixed Summary 51 3 Visualization with Statistical Graphs Basic examples with the plotting 72 Python Matplotlib package 54 Example 1 – preparing data to fit the Elements of a statistical graph 54 plotting Exploring important types of plotting function API 73 in Matplotlib 56 Example 2 – combining analysis with plain plotting 76 Advanced visualization customization 65 Presentation-ready plotting tips 78 Customizing the geometry 65 Use styling 78 Customizing the aesthetics 70 Font matters a lot 80 Query-oriented statistical Summary 80 Section 2: Essentials of Statistical Analysis 4 Sampling and Inferential Statistics Understanding fundamental The dangers associated with non- concepts in sampling techniques 84 probability sampling 86 Performing proper sampling Probability sampling – the safer approach 88 under different scenarios 86 Understanding statistics associated with sampling 98 Table of Contents iii Sampling distribution of the The central limit theorem 107 sample mean 98 Summary 108 Standard error of the sample mean 103 5 Common Probability Distributions Understanding important distribution 121 concepts in probability 110 Uniform distribution 122 Events and sample space 110 Exponential distribution 122 The probability mass function and Normal distribution 124 the probability density function 111 Learning about joint and Subjective probability and conditional distribution 126 empirical probability 116 Independency and conditional Understanding common distribution 127 discrete probability distributions 116 Understanding the power law and black swan 127 Bernoulli distribution 117 Binomial distribution 118 The ubiquitous power law 128 Poisson distribution 120 Be aware of the black swan 129 Understanding the common Summary 130 continuous probability 6 Parametric Estimation Understanding the concepts of Applying the maximum parameter estimation and the likelihood approach with features of estimators 132 Python 141 Evaluation of estimators 133 Likelihood function 141 MLE for uniform distribution Using the method of moments boundaries 144 to estimate parameters 136 MLE for modeling noise 145 Example 1 – the number of 911 phone MLE and the Bayesian theorem 155 calls in a day 137 Example 2 – the bounds of Summary 160 uniform distribution 139