Python Web Scraping Cookbook Over 90 proven recipes to get you scraping with Python, microservices, Docker, and AWS Michael Heydt BIRMINGHAM - MUMBAI Python Web Scraping Cookbook Copyright © 2018 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. Commissioning Editor: Veena Pagare Acquisition Editor: Tushar Gupta Content Development Editor: Tejas Limkar Technical Editor: Danish Shaikh Copy Editor: Safis Editing Project Coordinator: Manthan Patel Proofreader: Safis Editing Indexer: Rekha Nair Graphics: Tania Dutta Production Coordinator: Shraddha Falebhai First published: February 2018 Production reference: 1070218 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78728-521-7 www.packtpub.com Contributors About the author Michael Heydt is an independent consultant specializing in social, mobile, analytics, and cloud technologies, with an emphasis on cloud native 12-factor applications. Michael has been a software developer and trainer for over 30 years and is the author of books such as D3.js By Example, Learning Pandas, Mastering Pandas for Finance, and Instant Lucene.NET. You can find more information about him on LinkedIn at michaelheydt. I would like to greatly thank my family for putting up with me disappearing for months on end and sacrificing my sparse free time to indulge in creation of content and books like this one. They are my true inspiration and enablers. About the reviewers Mei Lu is the founder and CEO of Jobfully, providing career coaching for software developers and engineering leaders. She is also a Career/Executive Coach for Carnegie Mellon University Alumni Association, specializing in the software / high-tech industry. Previously, Mei was a software engineer and an engineering manager at Qpass, M.I.T., and MicroStrategy. She received her MS in Computer Science from the University of Pennsylvania and her MS in Engineering from Carnegie Mellon University. Lazar Telebak is a freelance web developer specializing in web scraping, crawling, and indexing web pages using Python libraries/frameworks. He has worked mostly on projects of automation, website scraping, crawling, and exporting data in various formats (CSV, JSON, XML, and TXT) and databases such as (MongoDB, SQLAlchemy, and Postgres). Lazar also has experience of fronted technologies and languages such as HTML, CSS, JavaScript, and jQuery. Packt is searching for authors like you If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea. mapt.io Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website. Why subscribe? Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals Improve your learning with Skill Plans built especially for you Get a free eBook or video every month Mapt is fully searchable Copy and paste, print, and bookmark content PacktPub.com Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks. Table of Contents Preface 1 Chapter 1: Getting Started with Scraping 7 Introduction 7 Setting up a Python development environment 8 Getting ready 8 How to do it... 8 Scraping Python.org with Requests and Beautiful Soup 13 Getting ready... 13 How to do it... 14 How it works... 17 Scraping Python.org in urllib3 and Beautiful Soup 19 Getting ready... 19 How to do it... 19 How it works 20 There's more... 20 Scraping Python.org with Scrapy 21 Getting ready... 21 How to do it... 22 How it works 23 Scraping Python.org with Selenium and PhantomJS 25 Getting ready 25 How to do it... 26 How it works 28 There's more... 28 Chapter 2: Data Acquisition and Extraction 29 Introduction 29 How to parse websites and navigate the DOM using BeautifulSoup 30 Getting ready 30 How to do it... 32 How it works 35 There's more... 35 Searching the DOM with Beautiful Soup's find methods 35 Getting ready 35 Table of Contents How to do it... 36 Querying the DOM with XPath and lxml 38 Getting ready 39 How to do it... 39 How it works 45 There's more... 45 Querying data with XPath and CSS selectors 46 Getting ready 46 How to do it... 47 How it works 47 There's more... 48 Using Scrapy selectors 48 Getting ready 48 How to do it... 48 How it works 50 There's more... 50 Loading data in unicode / UTF-8 50 Getting ready 51 How to do it... 52 How it works 53 There's more... 53 Chapter 3: Processing Data 54 Introduction 54 Working with CSV and JSON data 55 Getting ready 55 How to do it 57 How it works 63 There's more... 63 Storing data using AWS S3 64 Getting ready 64 How to do it 65 How it works 68 There's more... 69 Storing data using MySQL 69 Getting ready 69 How to do it 70 How it works 74 There's more... 74 [ ii ] Table of Contents Storing data using PostgreSQL 75 Getting ready 75 How to do it 76 How it works 79 There's more... 79 Storing data in Elasticsearch 80 Getting ready 80 How to do it 80 How it works 83 There's more... 83 How to build robust ETL pipelines with AWS SQS 84 Getting ready 84 How to do it - posting messages to an AWS queue 85 How it works 86 How to do it - reading and processing messages 87 How it works 89 There's more... 89 Chapter 4: Working with Images, Audio, and other Assets 90 Introduction 91 Downloading media content from the web 91 Getting ready 91 How to do it 92 How it works 92 There's more... 93 Parsing a URL with urllib to get the filename 93 Getting ready 93 How to do it 93 How it works 94 There's more... 94 Determining the type of content for a URL 95 Getting ready 95 How to do it 95 How it works 95 There's more... 96 Determining the file extension from a content type 97 Getting ready 97 How to do it 97 How it works 97 [ iii ] Table of Contents There's more... 98 Downloading and saving images to the local file system 98 How to do it 99 How it works 99 There's more... 100 Downloading and saving images to S3 100 Getting ready 100 How to do it 100 How it works 101 There's more... 102 Generating thumbnails for images 102 Getting ready 103 How to do it 103 How it works 104 Taking a screenshot of a website 105 Getting ready 105 How to do it 105 How it works 107 Taking a screenshot of a website with an external service 108 Getting ready 109 How to do it 110 How it works 112 There's more... 114 Performing OCR on an image with pytesseract 114 Getting ready 114 How to do it 115 How it works 116 There's more... 116 Creating a Video Thumbnail 116 Getting ready 116 How to do it 116 How it works 118 There's more.. 119 Ripping an MP4 video to an MP3 119 Getting ready 119 How to do it 120 There's more... 120 Chapter 5: Scraping - Code of Conduct 121 [ iv ]