Advanced Data Science and Analytics with Python Chapman & Hall/CRC Data Mining and Knowledge Series Series Editor: Vipin Kumar Text Mining and Visualization Case Studies Using Open-Source Tools Markus Hofmann and Andrew Chisholm Graph-Based Social Media Analysis Ioannis Pitas Data Mining A Tutorial-Based Primer, Second Edition Richard J. Roiger Data Mining with R Learning with Case Studies, Second Edition Luís Torgo Social Networks with Rich Edge Semantics Quan Zheng and David Skillicorn Large-Scale Machine Learning in the Earth Sciences Ashok N. Srivastava, Ramakrishna Nemani, and Karsten Steinhaeuser Data Science and Analytics with Python Jesús Rogel-Salazar Feature Engineering for Machine Learning and Data Analytics Guozhu Dong and Huan Liu Exploratory Data Analysis Using R Ronald K. Pearson Human Capital Systems, Analytics, and Data Mining Robert C. Hughes Industrial Applications of Machine Learning Pedro Larrañaga et al Automated Data Analysis Using Excel Second Edition Brian D. Bissett Advanced Data Science and Analytics with Python Jesús Rogel-Salazar For more information about this series please visit: https://www.crcpress.com/Chapman--HallCRC-Data-Mining-and-Knowledge-Discovery-Series/book- series/CHDAMINODIS Advanced Data Science and Analytics with Python Jesús Rogel-Salazar First edition published 2020 by CRC Press 6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742 and by CRC Press 2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN © 2020 Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, LLC Reasonable efforts have been made to publish reliable data and information, but the author and pub- lisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or here- after invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978- 750-8400. For works that are not available on CCC please contact [email protected] Trademark notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. ISBN: 978-0-429-44661-0 (hbk) ISBN: 978-1-138-31506-8 (pbk) ISBN: 978-0-429-44664-1 (ebk) To A. J. Johnson Then. Now. Always. Contents 1 1 No Time to Lose: Time Series Analysis 2 1.1 Time Series 4 1.2 One at a Time: Some Examples 7 1.3 Bearing with Time: Pandas Series 1.3.1 Pandas Time Series in Action 18 1.3.2 Time Series Data Manipulation 21 31 1.4 Modelling Time Series Data 1.4.1 Regression... (Not) a Good Idea? 34 1.4.2 Moving Averages and Exponential Smoothing 36 1.4.3 Stationarity and Seasonality 39 1.4.4 Determining Stationarity 42 1.4.5 Autoregression to the Rescue 48 51 1.5 Autoregressive Models 56 1.6 Summary viii j. rogel-salazar 57 2 Speaking Naturally: Text and Natural Language Processing 59 2.1 Pages and Pages: Accessing Data from the Web 2.1.1 Beautiful Soup in Action 64 77 2.2 Make Mine a Regular: Regular Expressions 2.2.1 Regular Expression Patterns 79 88 2.3 Processing Text with Unicode 96 2.4 Tokenising Text 102 2.5 Word Tagging 109 2.6 What Are You Talking About?: Topic Modelling 2.6.1 Latent Dirichlet Allocation 110 2.6.2 LDA in Action 115 129 2.7 Summary 131 3 Getting Social: Graph Theory and Social Network Analysis 132 3.1 Socialising Among Friends and Foes 140 3.2 Let’s Make a Connection: Graphs and Networks 3.2.1 Taking the Measure: Degree, Centrality and More 145 3.2.2 Connecting the Dots: Network Properties 149 156 3.3 Social Networks with Python: NetworkX 3.3.1 NetworkX: A Quick Intro 156 162 3.4 Social Network Analysis in Action 3.4.1 Karate Kids: Conflict and Fission in a Network 162 3.4.2 In a Galaxy Far, Far Away: Central Characters in a Network 189 advanced data science and analytics with python ix 205 3.5 Summary 207 4 Thinking Deeply: Neural Networks and Deep Learning 208 4.1 A Trip Down Memory Lane 214 4.2 No-Brainer: What Are Neural Networks? 4.2.1 Neural Network Architecture: Layers and Nodes 215 4.2.2 Firing Away: Neurons, Activate! 218 4.2.3 Going Forwards and Backwards 223 227 4.3 Neural Networks: From the Ground up 4.3.1 Going Forwards 229 4.3.2 Learning the Parameters 232 4.3.3 Backpropagation and Gradient Descent 234 4.3.4 Neural Network: A First Implementation 243 254 4.4 Neural Networks and Deep Learning 4.4.1 Convolutional Neural Networks 263 4.4.2 Convolutional Neural Networks in Action 268 4.4.3 Recurrent Neural Networks 279 4.4.4 Long Short-Term Memory 286 4.4.5 Long Short-Term Memory Networks in Action 290 300 4.5 Summary