Table Of ContentProgramming Spiders, Bots, and Aggregators in Java
Jeff Heaton
Publisher: Sybex
February 2002
ISBN: 0782140408, 512 pages
Spiders, bots, and aggregators are all so-called intelligent agents, which execute tasks on
the Web without the intervention of a human being. Spiders go out on the Web and identify
multiple sites with information on a chosen topic and retrieve the information. Bots find
information within one site by cataloging and retrieving it. Aggregrators gather data from
multiple sites and consolidate it on one page, such as credit card, bank account, and
investment account data. This book offer offers a complete toolkit for the Java programmer
who wants to build bots, spiders, and aggregrators. It teaches the basic low-level
HTTP/network programming Java programmers need to get going and then dives into how to
create useful intelligent agent applications. It is aimed not just at Java programmers but JSP
programmers as well. The CD-ROM includes all the source code for the author's intelligent
agent platform, which readers can use to build their own spiders, bots, and aggregators.
Programming Spiders, Bots, and Aggregators in Java
Jeff Heaton
Associate Publisher: Richard Mills
Acquisitions and Developmental Editor: Diane Lowery
Editor: Rebecca C. Rider
Production Editor: Dennis Fitzgerald
Technical Editor: Marc Goldford
Graphic Illustrator: Tony Jonick
Electronic Publishing Specialists: Jill Niles, Judy Fung
Proofreaders: Emily Hsuan, Laurie O’Connell, Nancy Riddiough
Indexer: Ted Laux
CD Coordinator: Dan Mummert
CD Technician: Kevin Ly
Cover Designer: Carol Gorska, Gorska Design
Cover Illustrator/Photographer: Akira Kaede, PhotoDisc
Copyright © 2002 SYBEX Inc., 1151 Marina Village Parkway, Alameda, CA 94501. World
rights reserved. The author(s) created reusable code in this publication expressly for reuse by
readers. Sybex grants readers limited permission to reuse the code found in this publication or
its accompanying CD-ROM so long as (author(s)) are attributed in any application containing
the reusabe code and the code itself is never distributed, posted online by electronic
transmission, sold, or commercially exploited as a stand-alone product. Aside from this
specific exception concerning reusable code, no part of this publication may be stored in a
retrieval system, transmitted, or reproduced in any way, including but not limited to
photocopy, photograph, magnetic, or other record, without the prior agreement and written
permission of the publisher.
Library of Congress Card Number: 2001096980
ISBN: 0-7821-4040-8
SYBEX and the SYBEX logo are either registered trademarks or trademarks of SYBEX Inc.
in the United States and/or other countries.
Screen reproductions produced with FullShot 99. FullShot 99 © 1991-1999 Inbit
Incorporated. All rights reserved. FullShot is a trademark of Inbit Incorporated.
The CD interface was created using Macromedia Director, COPYRIGHT 1994, 1997-1999
Macromedia Inc. For more information on Macromedia and Macromedia Director, visit
http://www.macromedia.com/.
i
Internet screen shot(s) using Microsoft Internet Explorer reprinted by permission from
Microsoft Corporation.
TRADEMARKS: SYBEX has attempted throughout this book to distinguish proprietary
trademarks from descriptive terms by following the capitalization style used by the
manufacturer.
The author and publisher have made their best efforts to prepare this book, and the content is
based upon final release software whenever possible. Portions of the manuscript may be based
upon pre-release versions supplied by software manufacturer(s). The author and the publisher
make no representation or warranties of any kind with regard to the completeness or accuracy
of the contents herein and accept no liability of any kind including but not limited to
performance, merchantability, fitness for any particular purpose, or any losses or damages of
any kind caused or alleged to be caused directly or indirectly from this book.
10 9 8 7 6 5 4 3 2 1
Software License Agreement: Terms and Conditions
The media and/or any online materials accompanying this book that are available now or in
the future contain programs and/or text files (the “Software”) to be used in connection with
the book. SYBEX hereby grants to you a license to use the Software, subject to the terms that
follow. Your purchase, acceptance, or use of the Software will constitute your acceptance of
such terms.
The Software compilation is the property of SYBEX unless otherwise indicated and is
protected by copyright to SYBEX or other copyright owner(s) as indicated in the media files
(the “Owner(s)”). You are hereby granted a single-user license to use the Software for your
personal, noncommercial use only. You may not reproduce, sell, distribute, publish, circulate,
or commercially exploit the Software, or any portion thereof, without the written consent of
SYBEX and the specific copyright owner(s) of any component software included on this
media.
In the event that the Software or components include specific license requirements or end-user
agreements, statements of condition, disclaimers, limitations or warranties (“End-User
License”), those End-User Licenses supersede the terms and conditions herein as to that
particular Software component. Your purchase, acceptance, or use of the Software will
constitute your acceptance of such End-User Licenses.
By purchase, use or acceptance of the Software you further agree to comply with all export
laws and regulations of the United States as such laws and regulations may exist from time to
time.
Reusable Code in This Book
The authors created reusable code in this publication expressly for reuse for readers. Sybex
grants readers permission to reuse for any purpose the code found in this publication or its
accompanying CD-ROM so long as all of the authors are attributed in any application
containing the reusable code, and the code itself is never sold or commercially exploited as a
stand-alone product.
ii
Software Support
Components of the supplemental Software and any offers associated with them may be
supported by the specific Owner(s) of that material, but they are not supported by SYBEX.
Information regarding any available support may be obtained from the Owner(s) using the
information provided in the appropriate read.me files or listed elsewhere on the media.
Should the manufacturer(s) or other Owner(s) cease to offer support or decline to honor any
offer, SYBEX bears no responsibility. This notice concerning support for the Software is
provided for your information only. SYBEX is not the agent or principal of the Owner(s), and
SYBEX is in no way responsible for providing any support for the Software, nor is it liable or
responsible for any support provided, or not provided, by the Owner(s).
Warranty
SYBEX warrants the enclosed media to be free of physical defects for a period of ninety (90)
days after purchase. The Software is not available from SYBEX in any other form or media
than that enclosed herein or posted to http://www.sybex.com/ If you discover a defect in the
media during this warranty period, you may obtain a replacement of identical format at no
charge by sending the defective media, postage prepaid, with proof of purchase to:
SYBEX Inc.
Product Support Department
1151 Marina Village Parkway
Alameda, CA 94501
Web: http://www.sybex.com/
After the 90-day period, you can obtain replacement media of identical format by sending us
the defective disk, proof of purchase, and a check or money order for $10, payable to
SYBEX.
Disclaimer
SYBEX makes no warranty or representation, either expressed or implied, with respect to the
Software or its contents, quality, performance, merchantability, or fitness for a particular
purpose. In no event will SYBEX, its distributors, or dealers be liable to you or any other
party for direct, indirect, special, incidental, consequential, or other damages arising out of the
use of or inability to use the Software or its contents even if advised of the possibility of such
damage. In the event that the Software includes an online update feature, SYBEX further
disclaims any obligation to provide this feature for any specific duration other than the initial
posting.
The exclusion of implied warranties is not permitted by some states. Therefore, the above
exclusion may not apply to you. This warranty provides you with specific legal rights; there
may be other rights that you may have that vary from state to state. The pricing of the book
with the Software by SYBEX reflects the allocation of risk and limitations on liability
contained in this agreement of Terms and Conditions.
Shareware Distribution
This Software may contain various programs that are distributed as shareware. Copyright laws
apply to both shareware and ordinary commercial software, and the copyright Owner(s)
retains all rights. If you try a shareware program and continue using it, you are expected to
iii
register it. Individual programs differ on details of trial periods, registration, and payment.
Please observe the requirements stated in appropriate files.
Copy Protection
The Software in whole or in part may or may not be copy-protected or encrypted. However, in
all cases, reselling or redistributing these files without authorization is expressly forbidden
except as specifically provided for by the Owner(s) therein.
This book is dedicated to my grandparents: Agnes Heaton and the memory of Roscoe Heaton,
as well as Emil A. Stricker and the memory of Esther Stricker.
Acknowledgments
There are many people that helped to make this book a reality, both directly and indirectly. It
would not be possible to thank them all, but I would like to acknowledge the primary
contributors.
Working with Sybex on this project was a pleasure. Everyone involved in the production of
this book was both professional and pleasant. First, I would like to acknowledge Marc
Goldford, my technical editor, for his many helpful suggestions, and for testing the final
versions of all examples. Rebecca Rider was my editor, and she did an excellent job of
making sure that everything was clear and understandable. Diane Lowery, my acquisitions
editor, was very helpful during the early stages of this project. I would also like to thank the
production team: Dennis Fitzgerald, production editor; Jill Niles and Judy Fung, electronic
publishing specialists; and Laurie O’Connell, Nancy Riddiough, and Emily Hsuan,
proofreaders.
It has also been a pleasure to work with everyone in the Global Software division of the
Reinsurance Group of America, Inc. (RGA). I work with a group of very talented IT
professionals, and I continue to learn a great deal from them. In particular, I would like to
thank my supervisor Kam Chan, executive director, for the very valuable help he provides me
with as I learn to design large complex systems in addition to just programming them.
Additionally, I would like to thank Rick Nolle, vice president of systems, for taking the time
to find the right place for me at RGA. Finally, I would like to thank Jym Barnes, managing
director, for our many discussions about the latest technologies.
In addition, I would like to thank my agent, Neil J. Salkind, Ph.D., for helping me develop
and present the proposal for this book. I would also like to thank my friend Lisa Oliver for
reviewing many chapters and discussing many of the ideas that went into this book. Likewise,
I would like to thank my friend Jeffrey Noedel for the many discussions of real-world
applications of bot technology. I would also like to thank Bill Darte, of Washington
University in St. Louis, for acting as my advisor for some of the research that went into this
book.
iv
Table of Contents
Table of Contents......................................................................................................................i
Introduction..............................................................................................................................1
Overview................................................................................................................................1
What Is a Bot?........................................................................................................................1
What Is a Spider?...................................................................................................................2
What Are Agents and Intelligent Agents?..............................................................................3
What Are Aggregators?..........................................................................................................4
The Java Programming Language..........................................................................................4
Wrap Up.................................................................................................................................5
Chapter 1: Java Socket Programming...................................................................................6
Overview................................................................................................................................6
The World of Sockets.............................................................................................................6
Java I/O Programming.........................................................................................................14
Proxy Issues..........................................................................................................................22
Socket Programming in Java................................................................................................24
Client Sockets.......................................................................................................................25
Server Sockets......................................................................................................................37
Summary..............................................................................................................................44
Chapter 2: Examining the Hypertext Transfer Protocol...................................................46
Overview..............................................................................................................................46
Address Formats...................................................................................................................46
Using Sockets to Program HTTP.........................................................................................50
Bot Package Classes for HTTP............................................................................................60
Under the Hood....................................................................................................................73
Summary..............................................................................................................................82
Chapter 3: Accessing Secure Sites with HTTPS.................................................................84
Overview..............................................................................................................................84
HTTP versus HTTPS...........................................................................................................84
Using HTTPS with Java.......................................................................................................85
HTTP User Authentication...................................................................................................90
Securing Access...................................................................................................................96
Under the Hood..................................................................................................................105
Summary............................................................................................................................115
Chapter 4: HTML Parsing..................................................................................................116
Overview............................................................................................................................116
Working with HTML.........................................................................................................116
Tags a Bot Cares About.....................................................................................................118
HTML That Requires Special Handling............................................................................123
Using Bot Classes for HTML Parsing................................................................................126
Using Swing Classes for HTML Parsing...........................................................................128
Bot Package HTML Parsing Examples..............................................................................133
Under the Hood..................................................................................................................153
Summary............................................................................................................................163
Chapter 5: Posting Forms....................................................................................................165
Overview............................................................................................................................165
Using Forms.......................................................................................................................165
Bot Classes for a Generic Post...........................................................................................171
Under the Hood..................................................................................................................186
i
Summary............................................................................................................................190
Chapter 6: Interpreting Data..............................................................................................191
Overview............................................................................................................................191
The Structure of the CSV File............................................................................................191
The Structure of a QIF File................................................................................................197
The XML File Format........................................................................................................203
Summary............................................................................................................................213
Chapter 7: Exploring Cookies.............................................................................................215
Overview............................................................................................................................215
Examining Cookies............................................................................................................216
Bot Classes for Cookie Processing.....................................................................................230
Under the Hood..................................................................................................................232
Summary............................................................................................................................238
Chapter 8: Building a Spider..............................................................................................239
Overview............................................................................................................................239
Structure of Websites.........................................................................................................239
Structure of a Spider...........................................................................................................242
Constructing a Spider.........................................................................................................246
Summary............................................................................................................................266
Chapter 9: Building a High-Volume Spider......................................................................267
Overview............................................................................................................................267
What Is Multithreading?.....................................................................................................267
Multithreading with Java....................................................................................................268
Synchronizing Threads.......................................................................................................272
Using a Database................................................................................................................275
The High-Performance Spider...........................................................................................283
Under the Hood..................................................................................................................284
Summary............................................................................................................................315
Chapter 10: Building a Bot..................................................................................................317
Overview............................................................................................................................317
Constructing a Typical Bot................................................................................................317
Using the CatBot................................................................................................................331
An Example CatBot...........................................................................................................336
Under the Hood..................................................................................................................342
Summary............................................................................................................................359
Chapter 11: Building an Aggregator..................................................................................360
Overview............................................................................................................................360
Online versus Offline Aggregation....................................................................................360
Building the Underlying Bot..............................................................................................361
Building the Weather Aggregator......................................................................................369
Summary............................................................................................................................374
Chapter 12: Using Bots Conscientiously............................................................................375
Overview............................................................................................................................375
Dealing with Websites.......................................................................................................375
Webmaster Actions............................................................................................................381
A Conscientious Spider......................................................................................................383
Under the Hood..................................................................................................................396
Summary............................................................................................................................401
Chapter 13: The Future of Bots..........................................................................................403
Overview............................................................................................................................403
ii
Internet Information Transfer.............................................................................................403
Understanding XML..........................................................................................................404
Transferring XML Data.....................................................................................................408
Bots and SOAP...................................................................................................................412
Summary............................................................................................................................412
Appendix A: The Bot Package............................................................................................414
Utility Classes....................................................................................................................414
HTTP Classes.....................................................................................................................416
The Parsing Classes............................................................................................................419
Spider Classes....................................................................................................................424
Appendix B: Various HTTP Related Charts.....................................................................430
The ASCII Chart................................................................................................................430
HTTP Headers....................................................................................................................434
HTTP Status Codes............................................................................................................436
HTML Character Constants...............................................................................................439
Appendix C: Troubleshooting.............................................................................................441
WIN32 Errors.....................................................................................................................441
UNIX Errors.......................................................................................................................441
Cross-Platform Errors........................................................................................................444
How to Use the NOBOT Scripts........................................................................................446
Appendix D: Installing Tomcat...........................................................................................447
Installing and Starting Tomcat...........................................................................................447
A JSP Example...................................................................................................................449
Appendix E: How to Compile Examples Under Windows...............................................451
Using the JDK....................................................................................................................451
Using VisualCafé...............................................................................................................456
Appendix F: How to Compile Examples Under UNIX.....................................................458
Using the JDK....................................................................................................................458
Appendix G: Recompiling the Bot Package.......................................................................461
Glossary..............................................................................................................................463
iii
Introduction
Introduction
Overview
A tremendous amount of information is available through the Internet: today’s news, the
location of an expected package, the score of last night’s game, or the current stock price of
your company. Open your favorite browser, and all of this information is only a mouse click
away. Nearly any piece of current information can be found online; you have only to discover
it.
Most of the information content of the Internet is both produced and consumed by human
users. As a result, web pages are generally structured to be inviting to human visitors. But is
this the only use for the Web? Are human users the only visitors a website is likely to
accommodate?
Actually, a whole new class of web user is developing. These users are computer programs
that have the ability to access the Web in much the same way as a human user with a browser
does. There are many names for these kinds of programs, and these names reflect many of the
specialized tasks assigned to them. Spiders, bots, aggregators, agents, and intelligent agents
are all common terms for web-savvy computer programs. As you read through this book, we
will examine how to create each of these Internet programs. We will examine the differences
between them as well as see what the benefits for each are. Figure I.1 shows the hierarchy of
these programs.
Figure I.1: Bots, spiders, aggregators, and agents
What Is a Bot?
1