ebook img

Webbots, spiders, and screen scrapers a guide to developing internet agents with PHPCURL PDF

396 Pages·2012·15.074 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Webbots, spiders, and screen scrapers a guide to developing internet agents with PHPCURL

E 2 D N SSCCRRAAPPEE,, I T D WWEEBBBBOOTTSS,, SSPPIIDDEERRSS,,I AAUUTTOOMMAATTEE,, To download the scripts and code O N libraries used in the book, visit http:// AANNDD CCOONNTTRROOLL WebbotsSpidersScreenScrapers.com E TTHHEE IINNTTEERRNNEETT DIT2N AANNDD SSCCRREEEENN SSCCRR AAPPEERRSS ID O N A G U I D E T O D E V E L O P I N G I N T E R N E T A G E N T S WW W I T H P H P / C UR L There’s a wealth of data online, but sorting and gathering Sample projects for automating tasks like price monitoring EE it by hand can be tedious and time consuming. Rather and news aggregation will show you how to put the than click through page after endless page, why not let concepts you learn into practice. SSBB M I C H A E L S C H R E N K bots do the work for you? This second edition of Webbots, Spiders, and Screen CCBB Webbots, Spiders, and Screen Scrapers will show Scrapers includes tricks for dealing with sites that are RROO you how to create simple programs with PHP/CURL to resistant to crawling and scraping, writing stealthy EE mine, parse, and archive online data to help you make webbots that mimic human search behavior, and using TT informed decisions. Michael Schrenk, a highly regarded regular expressions to harvest specific data. As you EESS webbot developer, teaches you how to develop fault- discover the possibilities of web scraping, you’ll see how NN,, tolerant designs, how best to launch and schedule the webbots can save you precious time and give you much work of your bots, and how to create Internet agents that: greater control over the data available on the Web. SSSS • Send email or SMS notifications to alert you to new CCPP ABOUT THE AUTHOR information quickly RRII Michael Schrenk has developed webbots for over DD • Search different data sources and combine the results 15 years, working just about everywhere from Silicon AA on one page, making the data easier to interpret and EE analyze Valley to Moscow, for clients like the BBC, foreign PP governments, and many Fortune 500 companies. He’s a RR • Automate purchases, auction bids, and other online frequent Defcon speaker and lives in Las Vegas, Nevada. EESS activities to save time RR ,, SS TTEECCHHNNIICCAALL RREEVVIIEEWW BBYY DDAANNIIEELL SSTTEENNBBEERRGG,, CCRREEAATTOORR OOFF CCUURRLL AANNDD LLIIBBCCUURRLL AA NN DD THE FINEST IN GEEK ENTERTAINMENT™ www.nostarch.com S C H R “I LIE FLAT.” $39.95($41.95CDN) E N This book uses RepKover—a durable binding that won’t snap shut. COMPUTERS/PROSHELVE IN: K GRAM M IN G webbots2e.book Page i Thursday, February 16, 2012 11:59 AM WEBBOTS, SPIDERS, AND SCREEN SCRAPERS, 2ND EDITION webbots2e.book Page ii Thursday, February 16, 2012 11:59 AM webbots2e.book Page iii Thursday, February 16, 2012 11:59 AM WEBBOTS, SPIDERS, AND SCREEN SCRAPERS 2 N D E D I T I O N A Guide to Developing Internet Agents with PHP/CURL by Michael Schrenk San Francisco webbots2e.book Page iv Thursday, February 16, 2012 11:59 AM WEBBOTS, SPIDERS, AND SCREEN SCRAPERS, 2ND EDITION. Copyright © 2012 by Michael Schrenk. All rights reserved. No part of this work may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval system, without the prior written permission of the copyright owner and the publisher. 16 15 14 13 12 1 2 3 4 5 6 7 8 9 ISBN-10: 1-59327-397-5 ISBN-13: 978-1-59327-397-2 Publisher: William Pollock Production Editor: Serena Yang Cover and Interior Design: Octopod Studios Developmental Editor: Tyler Ortman Technical Reviewer: Daniel Stenberg Copyeditor: Paula L. Fleming Compositor: Serena Yang Proofreader: Alison Law For information on book distributors or translations, please contact No Starch Press, Inc. directly: No Starch Press, Inc. 38 Ringold Street, San Francisco, CA 94103 phone: 415.863.9900; fax: 415.863.9950; [email protected]; www.nostarch.com The Library of Congress has catalogued the first edition as follows: Schrenk, Michael. Webbots, spiders, and screen scrapers : a guide to developing internet agents with PHP/CURL / Michael Schrenk. p. cm. Includes index. ISBN-13: 978-1-59327-120-6 ISBN-10: 1-59327-120-4 1. Web search engines. 2. Internet programming. 3. Internet searching. 4. Intelligent agents (Computer software) I. Title. TK5105.884.S37 2007 025.04--dc22 2006026680 No Starch Press and the No Starch Press logo are registered trademarks of No Starch Press, Inc. Other product and company names mentioned herein may be the trademarks of their respective owners. Rather than use a trademark symbol with every occurrence of a trademarked name, we are using the names only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The information in this book is distributed on an “As Is” basis, without warranty. While every precaution has been taken in the preparation of this work, neither the author nor No Starch Press, Inc. shall have any liability to any person or entity with respect to any loss or damage caused or alleged to be caused directly or indirectly by the information contained in it. webbots2e.book Page v Thursday, February 16, 2012 11:59 AM In loving memory Charlotte Schrenk 1897–1982 webbots2e.book Page vi Thursday, February 16, 2012 11:59 AM webbots2e.book Page vii Thursday, February 16, 2012 11:59 AM B R I E F C O N T E N T S About the Author.........................................................................................................xxiii About the Technical Reviewer.......................................................................................xxiii Acknowledgments........................................................................................................xxv Introduction....................................................................................................................1 PART I: FUNDAMENTAL CONCEPTS ANDTECHNIQUES.............................................7 Chapter 1: What’s in It for You?.......................................................................................9 Chapter 2: Ideas for Webbot Projects..............................................................................15 Chapter 3: Downloading Web Pages..............................................................................23 Chapter 4: Basic Parsing Techniques...............................................................................37 Chapter 5: Advanced Parsing with Regular Expressions.....................................................49 Chapter 6: Automating Form Submission..........................................................................63 Chapter 7: Managing Large AmountsofData..................................................................77 PART II: PROJECTS.................................................................................................91 Chapter 8: Price-Monitoring Webbots..............................................................................93 Chapter 9: Image-Capturing Webbots...........................................................................101 webbots2e.book Page viii Thursday, February 16, 2012 11:59 AM Chapter 10: Link-Verification Webbots..........................................................................109 Chapter 11: Search-Ranking Webbots...........................................................................117 Chapter 12: Aggregation Webbots...............................................................................129 Chapter 13: FTP Webbots............................................................................................139 Chapter 14: Webbots That Read Email..........................................................................145 Chapter 15: Webbots That Send Email..........................................................................153 Chapter 16: Converting a Website intoaFunction..........................................................163 PART III: ADVANCED TECHNICAL CONSIDERATIONS............................................171 Chapter 17: Spiders....................................................................................................173 Chapter 18: Procurement Webbots andSnipers.............................................................185 Chapter 19: Webbots and Cryptography......................................................................193 Chapter 20: Authentication..........................................................................................197 Chapter 21: Advanced Cookie Management.................................................................209 Chapter 22: Scheduling Webbots andSpiders...............................................................215 Chapter 23: Scraping Difficult Websites with Browser Macros..........................................227 Chapter 24: Hacking iMacros......................................................................................239 Chapter 25: Deployment and Scaling............................................................................249 PART IV: LARGER CONSIDERATIONS....................................................................263 Chapter 26: Designing Stealthy Webbots and Spiders....................................................265 Chapter 27: Proxies....................................................................................................273 Chapter 28: Writing Fault-Tolerant Webbots..................................................................285 viii Brief Contents

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.