Table Of Contentmain.html
Webbots, Spiders, and Screen Scrapers
by Michael Schrenk
Publisher: No Starch
Pub Date: March 15, 2007
Print ISBN-10: 1-593-27120-4
Print ISBN-13: 978-1-59-327120-6
Pages: 328
Table of Contents | Index
Overview
The Internet is bigger and better than what a mere browser allows. Webbots, Spiders, and
Screen Scrapers is for programmers and businesspeople who want to take full advantage of
the vast resources available on the Web. There's no reason to let browsers limit your online
experience-especially when you can easily automate online tasks to suit your individual
needs.
Learn how to write webbots and spiders that do all this and more:
Programmatically download entire websites
●
Effectively parse data from web pages
●
Manage cookies
●
Decode encrypted files
●
Automate form submissions
●
Send and receive email
●
Send SMS alerts to your cell phone
●
Unlock password-protected websites
●
Automatically bid in online auctions
●
Exchange data with FTP and NNTP servers
●
Sample projects using standard code libraries reinforce these new skills. You'll learn how to
create your own webbots and spiders that track online prices, aggregate different data
sources into a single web page, and archive the online data you just can't live without. You'll
learn inside information from an experienced webbot developer on how and when to write
stealthy webbots that mimic human behavior, tips for developing fault-tolerant designs, and
various methods for launching and scheduling webbots. You'll also get advice on how to write
webbots and spiders that respect website owner property rights, plus techniques for shielding
websites from unwanted robots.
As a bonus, visit the author's website to test your webbots on sample target pages, and to
download the scripts and code libraries used in the book.
file:///D|/!!/final/main.html (1 von 2) [29.03.2008 23:21:53]
main.html
Some tasks are just too tedious-or too important!- to leave to humans. Once you've
automated your online life, you'll never let a browser limit the way you use the Internet again.
file:///D|/!!/final/main.html (2 von 2) [29.03.2008 23:21:53]
toc.html
Webbots, Spiders, and Screen Scrapers
by Michael Schrenk
Publisher: No Starch
Pub Date: March 15, 2007
Print ISBN-10: 1-593-27120-4
Print ISBN-13: 978-1-59-327120-6
Pages: 328
Table of Contents | Index
Dedication
ACKNOWLEDGMENTS
Introduction
FUNDAMENTAL CONCEPTS AND TECHNIQUES
WHAT'S IN IT FOR YOU?
Uncovering the Internet's True Potential
What's in It for Developers?
What's in It for Business Leaders?
Final Thoughts
IDEAS FOR WEBBOT PROJECTS
Inspiration from Browser Limitations
A Few Crazy Ideas to Get You Started
Final Thoughts
DOWNLOADING WEB PAGES
Think About Files, Not Web Pages
Downloading Files with PHP's Built-in Functions
Introducing PHP/CURL
Installing PHP/CURL
LIB_http
Final Thoughts
PARSING TECHNIQUES
Parsing Poorly Written HTML
Standard Parse Routines
Using LIB_parse
Useful PHP Functions
Final Thoughts
AUTOMATING FORM SUBMISSION
Reverse Engineering Form Interfaces
Form Handlers, Data Fields, Methods, and Event Triggers
Unpredictable Forms
Analyzing a Form
Final Thoughts
MANAGING LARGE AMOUNTS OF DATA
Organizing Data
Making Data Smaller
Thumbnailing Images
Final Thoughts
PROJECTS
PRICE-MONITORING WEBBOTS
The Target
file:///D|/!!/final/toc.html (1 von 4) [29.03.2008 23:21:54]
toc.html
Designing the Parsing Script
Initialization and Downloading the Target
Further Exploration
IMAGE-CAPTURING WEBBOTS
Example Image-Capturing Webbot
Creating the Image-Capturing Webbot
Further Exploration
Final Thoughts
LINK-VERIFICATION WEBBOTS
Creating the Link-Verification Webbot
Running the Webbot
Further Exploration
ANONYMOUS BROWSING WEBBOTS
Anonymity with Proxies
The Anonymizer Project
Final Thoughts
SEARCH-RANKING WEBBOTS
Description of a Search Result Page
What the Search-Ranking Webbot Does
Running the Search-Ranking Webbot
How the Search-Ranking Webbot Works
The Search-Ranking Webbot Script
Final Thoughts
Further Exploration
AGGREGATION WEBBOTS
Choosing Data Sources for Webbots
Example Aggregation Webbot
Adding Filtering to Your Aggregation Webbot
Further Exploration
FTP WEBBOTS
Example FTP Webbot
PHP and FTP
Further Exploration
NNTP NEWS WEBBOTS
NNTP Use and History
Webbots and Newsgroups
Further Exploration
WEBBOTS THAT READ EMAIL
The POP3 Protocol
Executing POP3 Commands with a Webbot
Further Exploration
WEBBOTS THAT SEND EMAIL
Email, Webbots, and Spam
Sending Mail with SMTP and PHP
Writing a Webbot That Sends Email Notifications
Further Exploration
CONVERTING A WEBSITE INTO A FUNCTION
Writing a Function Interface
Final Thoughts
ADVANCED TECHNICAL CONSIDERATIONS
SPIDERS
file:///D|/!!/final/toc.html (2 von 4) [29.03.2008 23:21:54]
toc.html
How Spiders Work
Example Spider
LIB_simple_spider
Experimenting with the Spider
Adding the Payload
Further Exploration
PROCUREMENT WEBBOTS AND SNIPERS
Procurement Webbot Theory
Sniper Theory
Testing Your Own Webbots and Snipers
Further Exploration
Final Thoughts
WEBBOTS AND CRYPTOGRAPHY
Designing Webbots That Use Encryption
A Quick Overview of Web Encryption
Local Certificates
Final Thoughts
AUTHENTICATION
What Is Authentication?
Example Scripts and Practice Pages
Basic Authentication
Session Authentication
Final Thoughts
ADVANCED COOKIE MANAGEMENT
How Cookies Work
PHP/CURL and Cookies
How Cookies Challenge Webbot Design
Further Exploration
SCHEDULING WEBBOTS AND SPIDERS
The Windows Task Scheduler
Complex Schedules
Non-Calendar-Based Triggers
Final Thoughts
LARGER CONSIDERATIONS
DESIGNING STEALTHY WEBBOTS AND SPIDERS
Why Design a Stealthy Webbot?
Stealth Means Simulating Human Patterns
Final Thoughts
WRITING FAULT-TOLERANT WEBBOTS
Types of Webbot Fault Tolerance
Error Handlers
DESIGNING WEBBOT-FRIENDLY WEBSITES
Optimizing Web Pages for Search Engine Spiders
Web Design Techniques That Hinder Search Engine Spiders
Designing Data-Only Interfaces
KILLING SPIDERS
Asking Nicely
Building Speed Bumps
Setting Traps
Final Thoughts
KEEPING WEBBOTS OUT OF TROUBLE
file:///D|/!!/final/toc.html (3 von 4) [29.03.2008 23:21:54]
toc.html
It's All About Respect
Copyright
Trespass to Chattels
Internet Law
Final Thoughts
PHP/CURL REFERENCE
Creating a Minimal PHP/CURL Session
Initiating PHP/CURL Sessions
Setting PHP/CURL Options
Executing the PHP/CURL Command
Closing PHP/CURL Sessions
STATUS CODES
HTTP Codes
NNTP Codes
SMS EMAIL ADDRESSES
Colophon
Index
file:///D|/!!/final/toc.html (4 von 4) [29.03.2008 23:21:54]
Ipreface.html
WEBBOTS, SPIDERS, AND SCREEN SCRAPERS. Copyright © 2007 by Michael Schrenk.
All rights reserved. No part of this work may be reproduced or transmitted in any form or by
any means, electronic or mechanical, including photocopying, recording, or by any
information storage or retrieval system, without the prior written permission of the copyright
owner and the publisher.
Printed on recycled paper in the United States of America
11 10 09 08 07 1 2 3 4 5 6 7 8 9
ISBN-10: 1-59327-120-4
ISBN-13: 978-1-59327-120-6
Publisher: William Pollock
Production Editor: Christina Samuell
Cover and Interior Design: Octopod Studios
Developmental Editors: Tyler Ortman and William Pollock
Technical Reviewer: Peter MacIntyre
Copyeditor: Megan Dunchak
Compositors: Megan Dunchak, Riley Hoffman, and Christina Samuell
Proofreader: Stephanie Provines
Indexer: Nancy Guenther
For information on book distributors or translations, please contact No Starch Press, Inc.
directly:
No Starch Press, Inc.
555 De Haro Street, Suite 250, San Francisco, CA 94107
phone: 415.863.9900; fax: 415.863.9950; info@nostarch.com; www.nostarch.com
Library of Congress Cataloging-in-Publication Data
Code View:
Schrenk, Michael.
Webbots, spiders, and screen scrapers : a guide to developing internet agents
with PHP/CURL / Michael Schrenk.
p. cm.
file:///D|/!!/final/Ipreface.html (1 von 2) [29.03.2008 23:21:55]
Ipreface.html
Includes index.
ISBN-13: 978-1-59327-120-6
ISBN-10: 1-59327-120-4
1. Web search engines. 2. Internet programming. 3. Internet searching. 4.
Intelligent agents (Computer software) I. Title.
TK5105.884.S37 2007
025.04--dc22
2006026680
No Starch Press and the No Starch Press logo are registered trademarks of No Starch Press,
Inc. Other product and company names mentioned herein may be the trademarks of their
respective owners. Rather than use a trademark symbol with every occurrence of a
trademarked name, we are using the names only in an editorial fashion and to the benefit of
the trademark owner, with no intention of infringement of the trademark.
The information in this book is distributed on an "As Is" basis, without warranty. While every
precaution has been taken in the preparation of this work, neither the author nor No Starch
Press, Inc. shall have any liability to any person or entity with respect to any loss or damage
caused or alleged to be caused directly or indirectly by the information contained in it.
file:///D|/!!/final/Ipreface.html (2 von 2) [29.03.2008 23:21:55]
Idedication.html
Webbots, Spiders, and Screen Scrapers
Table of ContentsNot Available in This Reduce Text zoom Increase Previous Next
Html view
• Index • Errata Format
Dedication
In loving memory
Charlotte Schrenk
1897–1982
Webbots, Spiders, and Screen Scrapers
Table of ContentsNot Available in This Reduce Text zoom Increase Previous Next
Html view
• Index • Errata Format
Top of Page
URL http://safari.informit.com/9781593271206/Idedication
file:///D|/!!/final/Idedication.html [29.03.2008 23:22:12]