ebook img

Python Web Scraping: Hands-on data scraping and crawling using PyQT, Selenium, HTML and Python PDF

220 Pages·2017·7.15 MB·english
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Python Web Scraping: Hands-on data scraping and crawling using PyQT, Selenium, HTML and Python

Python Web Scraping Second Edition Fetching data from the web Katharine Jarmul Richard Lawson BIRMINGHAM - MUMBAI Python Web Scraping Second Edition Copyright © 2017 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: October 2015 Second edition: May 2017 Production reference: 1240517 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78646-258-9 www.packtpub.com Credits Authors Copy Editor Katharine Jarmul Manisha Sinha Richard Lawson Reviewers Project Coordinator Dimitrios Kouzis-Loukas Nidhi Joshi Lazar Telebak Commissioning Editor Proofreader Veena Pagare Safis Editing Acquisition Editor Indexer Varsha Shetty Francy Puthiry Content Development Editor Production Coordinator Cheryl Dsa Shantanu Zagade Technical Editor Danish Shaikh About the Authors Katharine Jarmul is a data scientist and Pythonista based in Berlin, Germany. She runs a data science consulting company, Kjamistan, that provides services such as data extraction, acquisition, and modelling for small and large companies. She has been writing Python since 2008 and scraping the web with Python since 2010, and has worked at both small and large start-ups who use web scraping for data analysis and machine learning. When she's not scraping the web, you can follow her thoughts and activities via Twitter (@kjam) or on her blog: https://blog.kjamistan.com. Richard Lawson is from Australia and studied Computer Science at the University of Melbourne. Since graduating, he built a business specializing in web scraping while travelling the world, working remotely from over 50 countries. He is a fluent Esperanto speaker, conversational in Mandarin and Korean, and active in contributing to and translating open source software. He is currently undertaking postgraduate studies at Oxford University and in his spare time enjoys developing autonomous drones. You can find him on LinkedIn at https://www.linkedin.com/in/richardpenman. About the Reviewers Dimitrios Kouzis-Loukas has over fifteen years of experience providing software systems to small and big organisations. His most recent projects are typically distributed systems with ultra-low latency and high-availability requirements. He is language agnostic, yet he has a slight preference for C++ and Python. A firm believer in open source, he hopes that his contributions will benefit individual communities as well as all of humanity. Lazar Telebak is a freelance web developer specializing in web scraping, crawling, and indexing web pages using Python libraries/frameworks. He has worked mostly on a projects that deal with automation and website scraping, crawling and exporting data to various formats including: CSV, JSON, XML, TXT and databases such as: MongoDB, SQLAlchemy, Postgres. Lazar also has experience of fronted technologies and languages: HTML, CSS, JavaScript, jQuery. www.PacktPub.com For support files and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.comand as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks. https://www.packtpub.com/mapt Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career. Why subscribe? Fully searchable across every book published by Packt Copy and paste, print, and bookmark content On demand and accessible via a web browser Customer Feedback Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/Python-Web-Scraping-Katharine-Jarmul/dp/1786462583. If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products! Table of Contents Preface 1 Chapter 1: Introduction to Web Scraping 7 When is web scraping useful? 7 Is web scraping legal? 8 Python 3 9 Background research 10 Checking robots.txt 10 Examining the Sitemap 11 Estimating the size of a website 11 Identifying the technology used by a website 13 Finding the owner of a website 16 Crawling your first website 17 Scraping versus crawling 17 Downloading a web page 18 Retrying downloads 19 Setting a user agent 20 Sitemap crawler 21 ID iteration crawler 22 Link crawlers 25 Advanced features 28 Parsing robots.txt 28 Supporting proxies 29 Throttling downloads 30 Avoiding spider traps 31 Final version 32 Using the requests library 33 Summary 34 Chapter 2: Scraping the Data 35 Analyzing a web page 36 Three approaches to scrape a web page 39 Regular expressions 39 Beautiful Soup 41 Lxml 44 CSS selectors and your Browser Console 45 XPath Selectors 48 LXML and Family Trees 51 Comparing performance 52 Scraping results 53 Overview of Scraping 55 Adding a scrape callback to the link crawler 56 Summary 59 Chapter 3: Caching Downloads 60 When to use caching? 60 Adding cache support to the link crawler 61 Disk Cache 63 Implementing DiskCache 65 Testing the cache 67 Saving disk space 68 Expiring stale data 69 Drawbacks of DiskCache 70 Key-value storage cache 71 What is key-value storage? 72 Installing Redis 72 Overview of Redis 73 Redis cache implementation 75 Compression 76 Testing the cache 77 Exploring requests-cache 78 Summary 80 Chapter 4: Concurrent Downloading 81 One million web pages 81 Parsing the Alexa list 82 Sequential crawler 83 Threaded crawler 85 How threads and processes work 85 Implementing a multithreaded crawler 86 Multiprocessing crawler 88 Performance 92 Python multiprocessing and the GIL 93 Summary 94 Chapter 5: Dynamic Content 95 An example dynamic web page 95 Reverse engineering a dynamic web page 98 [ ii ]

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.