ebook img

Practical Web Scraping for Data Science: Best Practices and Examples with Python PDF

313 Pages·2018·4.79 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Practical Web Scraping for Data Science: Best Practices and Examples with Python

Practical Web Scraping for Data Science Best Practices and Examples with Python — Seppe vanden Broucke Bart Baesens Practical Web Scraping for Data Science Best Practices and Examples with Python Seppe vanden Broucke Bart Baesens Practical Web Scraping for Data Science Seppe vanden Broucke Bart Baesens Leuven, Belgium Leuven, Belgium ISBN-13 (pbk): 978-1-4842-3581-2 ISBN-13 (electronic): 978-1-4842-3582-9 https://doi.org/10.1007/978-1-4842-3582-9 Library of Congress Control Number: 2018940455 Copyright © 2018 by Seppe vanden Broucke and Bart Baesens This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Managing Director, Apress Media LLC: Welmoed Spahr Acquisitions Editor: Todd Green Development Editor: James Markham Coordinating Editor: Jill Balzano Cover designed by eStudioCalamar Cover image designed by Freepik (www.freepik.com) Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail orders-ny@springer- sbm.com, or visit www.springeronline.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation. For information on translations, please e-mail [email protected], or visit http://www.apress.com/ rights-permissions. Apress titles may be purchased in bulk for academic, corporate, or promotional use. eBook versions and licenses are also available for most titles. For more information, reference our Print and eBook Bulk Sales web page at http://www.apress.com/bulk-sales. Any source code or other supplementary material referenced by the author in this book is available to readers on GitHub via the book’s product page, located at www.apress.com/9781484235812. For more detailed information, please visit http://www.apress.com/source-code. Printed on acid-free paper Dedicated to our partners, kids and parents. Table of Contents About the Authors ����������������������������������������������������������������������������������������������������ix About the Technical Reviewer ���������������������������������������������������������������������������������xi Introduction �����������������������������������������������������������������������������������������������������������xiii Part I: Web Scraping Basics ���������������������������������������������������������������������������1 Chapter 1: Introduction ���������������������������������������������������������������������������������������������3 1.1 W hat Is Web Scraping? ..........................................................................................................3 1.1.1 Why Web Scraping for Data Science?..........................................................................4 1.1.2 Who Is Using Web Scraping? .......................................................................................5 1.2 Getting Ready .........................................................................................................................8 1.2.1 Setting Up ....................................................................................................................8 1.2.2 A Quick Python Primer .................................................................................................9 Chapter 2: The Web Speaks HTTP ���������������������������������������������������������������������������25 2.1 T he Magic of Networking .....................................................................................................25 2.2 The HyperText Transfer Protocol: HTTP ................................................................................28 2.3 HTTP in Python: The Requests Library .................................................................................34 2.4 Query Strings: URLs with Parameters ..................................................................................39 Chapter 3: Stirring the HTML and CSS Soup �����������������������������������������������������������49 3.1 Hypertext Markup Language: HTML .....................................................................................49 3.2 Using Your Browser as a Development Tool .........................................................................51 3.3 Cascading Style Sheets: CSS ...............................................................................................56 3.4 The Beautiful Soup Library ...................................................................................................61 3.5 More on Beautiful Soup .......................................................................................................72 v Table of ConTenTs Part II: Advanced Web Scraping �������������������������������������������������������������������79 Chapter 4: Delving Deeper in HTTP �������������������������������������������������������������������������81 4.1 W orking with Forms and POST Requests .............................................................................81 4.2 O ther HTTP Request Methods ..............................................................................................97 4.3 M ore on Headers ................................................................................................................100 4.4 D ealing with Cookies .........................................................................................................108 4.5 U sing Sessions with Requests ...........................................................................................119 4.6 B inary, JSON, and Other Forms of Content ........................................................................121 Chapter 5: Dealing with JavaScript ����������������������������������������������������������������������127 5.1 W hat Is JavaScript? ...........................................................................................................127 5.2 S craping JavaScript ...........................................................................................................128 5.3 S craping with Selenium .....................................................................................................134 5.4 M ore on Selenium ..............................................................................................................148 Chapter 6: From Web Scraping to Web Crawling ��������������������������������������������������155 6.1 W hat Is Web Crawling? ......................................................................................................155 6.2 W eb Crawling in Python .....................................................................................................158 6.3 S toring Results in a Database ............................................................................................161 Part III: Managerial Concerns and Best Practices ��������������������������������������173 Chapter 7: Managerial and Legal Concerns ����������������������������������������������������������175 7.1 T he Data Science Process..................................................................................................175 7.2 W here Does Web Scraping Fit In? ......................................................................................179 7.3 L egal Concerns ..................................................................................................................181 Chapter 8: Closing Topics �������������������������������������������������������������������������������������187 8.1 O ther Tools .........................................................................................................................187 8.1.1 Alternative Python Libraries ....................................................................................187 8.1.2 Scrapy .....................................................................................................................188 8.1.3 Caching ....................................................................................................................188 8.1.4 Proxy Servers ..........................................................................................................189 vi Table of ConTenTs 8.1.5 Scraping in Other Programming Languages ............................................................190 8.1.6 Command-Line Tools ...............................................................................................191 8.1.7 Graphical Scraping Tools .........................................................................................191 8.2 Best Practices and Tips .....................................................................................................193 Chapter 9: Examples ���������������������������������������������������������������������������������������������197 9.1 S craping Hacker News.......................................................................................................199 9.2 Using the Hacker News API ................................................................................................201 9.3 Quotes to Scrape ...............................................................................................................202 9.4 Books to Scrape .................................................................................................................206 9.5 Scraping GitHub Stars ........................................................................................................209 9.6 Scraping Mortgage Rates ..................................................................................................214 9.7 Scraping and Visualizing IMDB Ratings .............................................................................220 9.8 Scraping IATA Airline Information .......................................................................................222 9.9 Scraping and Analyzing Web Forum Interactions ...............................................................228 9.10 Collecting and Clustering a Fashion Data Set ..................................................................237 9.11 Sentiment Analysis of Scraped Amazon Reviews ............................................................241 9.12 Scraping and Analyzing News Articles .............................................................................252 9.13 Scraping and Analyzing a Wikipedia Graph ......................................................................271 9.14 Scraping and Visualizing a Board Members Graph ..........................................................278 9.15 Breaking CAPTCHA’s Using Deep Learning ......................................................................281 Index ���������������������������������������������������������������������������������������������������������������������299 vii About the Authors Seppe vanden Broucke is an Assistant Professor of Data and Process Science at the Faculty of Economics and Business, KU Leuven, Belgium. His research interests include business data mining and analytics, machine learning, process management, and process mining. His work has been published in well-known international journals and presented at top conferences. Seppe’s teaching includes Advanced Analytics, Big Data, and Information Management courses. He also frequently teaches for industry and business audiences. Besides work, Seppe enjoys traveling, reading (Murakami to Bukowski to Asimov), listening to music (Booka Shade to Miles Davis to Claude Debussy), watching movies and series (less so these days due to a lack of time), gaming, and keeping up with the news. Bart Baesens is a Professor of Big Data and Analytics at KU Leuven, Belgium, and a lecturer at the University of Southampton, United Kingdom. He has done extensive research on big data and analytics, credit risk modeling, fraud detection, and marketing analytics. Bart has written more than 200 scientific papers and several books. Besides enjoying time with his family, he is also a diehard Club Brugge soccer fan. Bart is a foodie and amateur cook. He loves drinking a good glass of wine (his favorites are white Viognier or red Cabernet Sauvignon) either in his wine cellar or when overlooking the authentic red English phone booth in his garden. Bart loves traveling and is fascinated by World War I and reads many books on the topic. ix About the Technical Reviewer Mark Furman, MBA, is a Systems Engineer, Author, Teacher, and Entrepreneur. For the last 16 years he has worked in the Information Technology field with a focus on Linux- based systems and programming in Python, working for a range of companies including Host Gator, Interland, Suntrust Bank, AT&T, and Winn-Dixie. Currently he has been focusing his career on the maker movement and has launched Tech Forge (techforge. org), which will focus on helping people start a makerspace and help sustain current spaces. He holds a Master of Business Administration from Ohio University. You can follow him on Twitter @mfurman. xi Introduction Congratulations! By picking up this book, you’ve set the first steps into the exciting world of web scraping. First of all, we want to thank you, the reader, for choosing this guide to accompany you on this journey. Goals For those who are not familiar with programming or the deeper workings of the web, web scraping often looks like a black art: the ability to write a program that sets off on its own to explore the Internet and collect data is seen as a magical, exciting, perhaps even scary power to possess. Indeed, there are not many programming tasks that are able to fascinate both experienced and novice programmers in quite such a way as web scraping. Seeing a program working for the first time as it reaches out on the web and starts gathering data never fails to provide a certain rush, feeling like you’ve circumvented the “normal way” of working and just cracked some sort of enigma. It is perhaps because of this reason that web scraping is also making a lot of headlines these days. In this book, we set out to provide a concise and modern guide to web scraping, using Python as our programming language. We know that there are a lot of other books and online tutorials out there, but we felt that there was room for another entry. In particular, we wanted to provide a guide that is “short and sweet,” without falling into the typical “learn this in X hours” trap where important details or best practices are glossed over just for the sake of speed. In addition, you’ll note that we have titled this book as “Practical Web Scraping for Data Science.” We’re data scientists ourselves, and have very often found web scraping to be a powerful tool to have in your arsenal for the purpose of data gathering. Many data science projects start with the first step of obtaining an appropriate data set. In some cases (the “ideal situation,” if you will), a data set is readily provided by a business partner, your company’s data warehouse, or your academic supervisor, or can xiii

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.