ebook img

Data Science and Analytics with Python PDF

397 Pages·2017·20.8 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Data Science and Analytics with Python

DATA SCIENCE ANALYTICS AND PYTHON WITH Jesús Rogel-Salazar CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2017 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business Version Date: 20170517 International Standard Book Number-13: 978-1-498-74209-2 (Hardback) Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com Contents 1 1 Trials and Tribulations of a Data Scientist 2 1.1 Data? Science? Data Science! 1.1.1 So, What Is Data Science? 3 7 1.2 The Data Scientist: A Modern Jackalope 1.2.1 Characteristics of a Data Scientist and a Data Science Team 12 17 1.3 Data Science Tools 1.3.1 Open Source Tools 20 22 1.4 From Data to Insight: the Data Science Workflow 1.4.1 Identify the Question 24 1.4.2 Acquire Data 25 1.4.3 Data Munging 25 1.4.4 Modelling and Evaluation 26 1.4.5 Representation and Interaction 26 1.4.6 Data Science: an Iterative Process 27 28 1.5 Summary 31 2 Python: For Something Completely Different 33 2.1 Why Python? Why not?! 2.1.1 To Shell or not To Shell 36 2.1.2 iPython/Jupyter Notebook 39 40 2.2 Firsts Slithers with Python 2.2.1 Basic Types 40 2.2.2 Numbers 41 2.2.3 Strings 41 2.2.4 Complex Numbers 43 2.2.5 Lists 44 2.2.6 Tuples 49 2.2.7 Dictionaries 52 54 2.3 Control Flow 2.3.1 if... elif... else 55 2.3.2 while 56 2.3.3 for 57 2.3.4 try... except 58 2.3.5 Functions 61 2.3.6 Scripts and Modules 65 68 2.4 Computation and Data Manipulation 2.4.1 Matrix Manipulations and Linear Algebra 69 2.4.2 NumPy Arrays and Matrices 71 2.4.3 Indexing and Slicing 74 76 2.5 Pandas to the Rescue 81 2.6 Plotting and Visualising: Matplotlib 83 2.7 Summary 3 The Machine that Goes “Ping”: Machine Learning and Pattern 87 Recognition 87 3.1 Recognising Patterns 90 3.2 Artificial Intelligence and Machine Learning 92 3.3 Data is Good, but other Things are also Needed 94 3.4 Learning, Predicting and Classifying 98 3.5 Machine Learning and Data Science 100 3.6 Feature Selection 102 3.7 Bias, Variance and Regularisation: A Balancing Act 105 3.8 Some Useful Measures: Distance and Similarity 110 3.9 Beware the Curse of Dimensionality 116 3.10 Scikit-Learn is our Friend 119 3.11 Training and Testing 124 3.12 Cross-Validation 3.12.1 k-fold Cross-Validation 125 128 3.13 Summary 131 4 The Relationship Conundrum: Regression 131 4.1 Relationships between Variables: Regression 136 4.2 Multivariate Linear Regression 138 4.3 Ordinary Least Squares 4.3.1 The Maths Way 139 144 4.4 Brain and Body: Regression with One Variable 4.4.1 Regression with Scikit-learn 153 155 4.5 Logarithmic Transformation 160 4.6 Making the Task Easier: Standardisation and Scaling 4.6.1 Normalisation or Unit Scaling 161 4.6.2 z-Score Scaling 162 164 4.7 Polynomial Regression 4.7.1 Multivariate Regression 169 170 4.8 Variance-Bias Trade-Off 172 4.9 Shrinkage: LASSO and Ridge 179 4.10 Summary 181 5 Jackalopes and Hares: Clustering 182 5.1 Clustering 183 5.2 Clustering with k-means 5.2.1 Cluster Validation 186 5.2.2 k-means in Action 189 193 5.3 Summary 195 6 Unicorns and Horses: Classification 196 6.1 Classification 6.1.1 Confusion Matrices 198 6.1.2 ROC and AUC 202 205 6.2 Classification with KNN 6.2.1 KNN in Action 206 211 6.3 Classification with Logistic Regression 6.3.1 Logistic Regression Interpretation 216 6.3.2 Logistic Regression in Action 218 226 6.4 Classification with Naïve Bayes 6.4.1 Naïve Bayes Classifier 232 6.4.2 Naïve Bayes in Action 233 238 6.5 Summary 7 Decisions, Decisions: Hierarchical Clustering, Decision Trees and 241 Ensemble Techniques 242 7.1 Hierarchical Clustering 7.1.1 Hierarchical Clustering in Action 245 249 7.2 Decision Trees 7.2.1 Decision Trees in Action 256 265 7.3 Ensemble Techniques 7.3.1 Bagging 271 7.3.2 Boosting 272 7.3.3 Random Forests 274 7.3.4 Stacking and Blending 276 277 7.4 Ensemble Techniques in Action 282 7.5 Summary 285 8 Less is More: Dimensionality Reduction 286 8.1 Dimensionality Reduction 291 8.2 Principal Component Analysis 8.2.1 PCA in Action 295 8.2.2 PCA in the Iris Dataset 300 304 8.3 Singular Value Decomposition 8.3.1 SVD in Action 306 310 8.4 Recommendation Systems 8.4.1 Content-Based Filtering in Action 312 8.4.2 Collaborative Filtering in Action 316 323 8.5 Summary 327 9 Kernel Tricks up the Sleeve: Support Vector Machines 328 9.1 Support Vector Machines and Kernel Methods 9.1.1 Support Vector Machines 331 9.1.2 The Kernel Trick 340 9.1.3 SVM in Action: Regression 343 9.1.4 SVM in Action: Classification 347 353 9.2 Summary 355 Pipelines in Scikit-Learn 361 Bibliography 369 Index Figures 1.1 Asimplifieddiagramoftheskillsneededindata scienceandtheirrelationship. 8 1.2 Jackalopesaremythicalanimalsresemblingajackrabbit withantlers. 10 1.3 Thevariousstepsinvolvedinthedatascience workflow. 23 2.1 Aplotgeneratedbymatplotlib. 84 3.1 MeasuringthedistancebetweenpointsAand B. 107 3.2 Thecurseofdimensionality. Tendatainstancesplaced inspacesofincreaseddimensionality,from1dimension to3. Sparsityincreaseswiththenumberof dimensions. 112 3.3 Volumeofahypersphereasafunctionofthe dimensionality N. Asthenumberofdimensions increases,thevolumeofthehyperspheretendsto zero. 115 3.4 Adatasetissplitintotrainingandtestingsets. The trainingsetisusedinthemodellingphaseandthe testingsetisheldforvalidatingthemodel. 122

Description:
Data Science and Analytics with Python is designed for practitioners in data science and data analytics in both academic and business environments. The aim is to present the reader with the main concepts used in data science using tools developed in Python, such as SciKit-learn, Pandas, Numpy, and o
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.