AAUUTTOOMMAATTEE Visit http://schrenk.com to test your WWEEBBBBOOTTSS,, SSPPIIDDEERRSS,, webbots on sample target pages, OONNLLIINNEE TTAASSKKSS and to download the scripts and WWIITTHH PPHHPP//CCUURRLL code libraries used in this book. SSCCRREEEENN SSCCRR AAPPEERRSS AANNDD A G U I D E T O D E V E L O P I N G I N T E R N E T A G E N T S WW W I T H P H P / C U R L The Internet is bigger and better than what a mere Sample projects using standard code libraries reinforce EE browser allows. Webbots, Spiders, and Screen Scrapers these new skills. You’ll learn how to create your own is for programmers and businesspeople who want to webbots and spiders that track online prices, aggregate SSBB M I C H A E L S C H R E N K take full advantage of the vast resources available on the different data sources into a single web page, and CCBB Web. There’s no reason to let browsers limit your online archive the online data you just can’t live without. experience—especially when you can easily automate RROO You’ll learn inside information from an experienced online tasks to suit your individual needs. EE webbot developer on how and when to write stealthy TT Learn how to write webbots and spiders that do all this webbots that mimic human behavior, tips for developing EESS and more: fault-tolerant designs, and various methods for launch- NN,, ing and scheduling webbots. You’ll also get advice on • Programmatically download entire websites how to write webbots and spiders that respect website SSSS • Effectively parse data from web pages owner property rights, plus techniques for shielding CCPP websites from unwanted robots. • Manage cookies RRII Some tasks are just too tedious—or too important!— DD • Decode encrypted files to leave to humans. Once you’ve automated your online AA EE • Automate form submissions life, you’ll never let a browser limit the way you use the PP Internet again. RR • Send and receive email EE ABOUT THE AUTHOR SS • Send SMS alerts to your cell phone RR Michael Schrenk develops webbots and spiders for ,, • Unlock password-protected websites clients across North America. He has written for SS AA • Automatically bid in online auctions Computerworld and Web Techniques magazines NN and has taught college courses on web usability and • Exchange data with FTP and NNTP servers Internet marketing. He’s also an occasional speaker DD at DEFCON. S C H R E THE FINEST IN GEEK ENTERTAINMENT™ $39.95 ($49.95 CDN) N K ® www.nostarch.com INTERNETSHELVE IN : “I LAY FLAT.” ® This book uses RepKover—a durable binding that won’t snap shut. WEBBOTS, SPIDERS, AND SCREEN SCRAPERS WEBBOTS, SPIDERS, AND SCREEN SCRAPERS A Guide to Developing Internet Agents with PHP/CURL by Michael Schrenk ® San Francisco WEBBOTS, SPIDERS, AND SCREEN SCRAPERS. Copyright © 2007 by Michael Schrenk. All rights reserved. No part of this work may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval system, without the prior written permission of the copyright owner and the publisher. Printed on recycled paper in the United States of America 11 10 09 08 07 1 2 3 4 5 6 7 8 9 ISBN-10: 1-59327-120-4 ISBN-13: 978-1-59327-120-6 Publisher: William Pollock Production Editor: Christina Samuell Cover and Interior Design: Octopod Studios Developmental Editors: Tyler Ortman and William Pollock Technical Reviewer: Peter MacIntyre Copyeditor: Megan Dunchak Compositors: Megan Dunchak, Riley Hoffman, and Christina Samuell Proofreader: Stephanie Provines Indexer: Nancy Guenther For information on book distributors or translations, please contact No Starch Press, Inc. directly: No Starch Press, Inc. 555 De Haro Street, Suite 250, San Francisco, CA 94107 phone: 415.863.9900; fax: 415.863.9950; [email protected]; www.nostarch.com Library of Congress Cataloging-in-Publication Data Schrenk, Michael. Webbots, spiders, and screen scrapers : a guide to developing internet agents with PHP/CURL / Michael Schrenk. p. cm. Includes index. ISBN-13: 978-1-59327-120-6 ISBN-10: 1-59327-120-4 1. Web search engines. 2. Internet programming. 3. Internet searching. 4. Intelligent agents (Computer software) I. Title. TK5105.884.S37 2007 025.04--dc22 2006026680 No Starch Press and the No Starch Press logo are registered trademarks of No Starch Press, Inc. Other product and company names mentioned herein may be the trademarks of their respective owners. Rather than use a trademark symbol with every occurrence of a trademarked name, we are using the names only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The information in this book is distributed on an “As Is” basis, without warranty. While every precaution has been taken in the preparation of this work, neither the author nor No Starch Press, Inc. shall have any liability to any person or entity with respect to any loss or damage caused or alleged to be caused directly or indirectly by the information contained in it. In loving memory Charlotte Schrenk 1897–1982 A C K N O W L E D G M E N T S I needed support and inspiration from family, friends, and colleagues to write this book. Unfortunately, I did not always acknowledge their contributions when they offered them. Here is a delayed thanks to all of those who helped me. Thanks to Donna, my wife, who convinced me that I could actually do this, and to my kids, Ava and Gordon, who have always supported my crazy schemes, even though they know it means fewer coffees and chess matches together. Andy King encouraged me to find a publisher for this project, and Daniel Stenberg, founder of the cURL project, helped me organize my thoughts when this book was barely an outline. No Starch Press exhibited saint-like patience while I split my time between writing webbots and writing about webbots. Special thanks to Bill, who trusted the concept, Tyler, who edited most of the manuscript, and Christina, who kept me on task. Peter MacIntyre was instrumental in checking for technical errors, and Megan’s copyediting improved the book throughout. Anamika Mishra assisted with the book’s website and consistently covered for me when I was busy writing or too tired to code. Laurie Curtis helped me explore what it might be like to finish a book. Finally, a tip of the hat goes to Mark, Randy, Megan, Karen, Terri, Susan, Dennis, Dan, and Matt, who were thoughtful enough to ask about my book’s progress before inquiring about the status of their projects. B R I E F C O N T E N T S Introduction....................................................................................................................1 PART I: FUNDAMENTAL CONCEPTS AND TECHNIQUES Chapter 1: What’s in It for You?.......................................................................................9 Chapter 2: Ideas for Webbot Projects..............................................................................13 Chapter 3: Downloading Web Pages..............................................................................21 Chapter 4: Parsing Techniques.......................................................................................35 Chapter 5: Automating Form Submission..........................................................................47 Chapter 6: Managing Large Amounts of Data..................................................................61 PART II: PROJECTS Chapter 7: Price-Monitoring Webbots..............................................................................77 Chapter 8: Image-Capturing Webbots.............................................................................85 Chapter 9: Link-Verification Webbots..............................................................................93 Chapter 10: Anonymous Browsing Webbots..................................................................101 Chapter 11: Search-Ranking Webbots...........................................................................111 Chapter 12: Aggregation Webbots...............................................................................123 Chapter 13: FTP Webbots............................................................................................133 Chapter 14: NNTP News Webbots...............................................................................139 Chapter 15: Webbots That Read Email..........................................................................149 Chapter 16: Webbots That Send Email..........................................................................157 Chapter 17: Converting a Website into a Function..........................................................167 PART III: ADVANCED TECHNICAL CONSIDERATIONS Chapter 18: Spiders....................................................................................................177 Chapter 19: Procurement Webbots and Snipers.............................................................187 Chapter 20: Webbots and Cryptography......................................................................195 Chapter 21: Authentication..........................................................................................199 Chapter 22: Advanced Cookie Management.................................................................211 Chapter 23: Scheduling Webbots and Spiders...............................................................217 PART IV: LARGER CONSIDERATIONS Chapter 24: Designing Stealthy Webbots and Spiders....................................................227 Chapter 25: Writing Fault-Tolerant Webbots..................................................................235 Chapter 26: Designing Webbot-Friendly Websites..........................................................247 Chapter 27: Killing Spiders..........................................................................................257 Chapter 28: Keeping Webbots out of Trouble................................................................265 Appendix A: PHP/CURL Reference................................................................................275 Appendix B: Status Codes............................................................................................285 Appendix C: SMS Email Addresses...............................................................................289 Index.........................................................................................................................293 viii Brief Contents
Description: