ebook img

Mining the Web. Dicovering Knowledge from Hypertext Data PDF

364 Pages·2002·3.698 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Mining the Web. Dicovering Knowledge from Hypertext Data

TEAM LinG - Live, Informative, Non-cost and Genuine! MINING THE WEB DISCOVERING KNOWLEDGE FROM HYPERTEXT DATA TEAM LinG - Live, Informative, Non-cost and Genuine! The Morgan Kaufmann Series in Data Management Systems SeriesEditor:JimGray,MicrosoftResearch Mining the Web: Discovering Knowledge from Database: Principles, Programming, and HypertextData Performance,SecondEdition SoumenChakrabarti PatrickO’NeilandElizabethO’Neil AdvancedSQL:1999—UnderstandingObject- TheObjectDataStandard:ODMG3.0 RelationalandOtherAdvancedFeatures EditedbyR.G.G.CattellandDouglasBarry JimMelton DatabaseTuning:Principles, Experiments, and DataontheWeb:FromRelationstoSemistructured TroubleshootingTechniques DataandXML DennisShashaandPhilippeBonnet SergeAbiteboul, PeterBuneman, andDan Suciu SQL: 1999—Understanding Relational LanguageComponents DataMining:PracticalMachineLearningTools JimMeltonandAlanR.Simon andTechniqueswithJavaImplementations InformationVisualizationinDataMiningand IanWittenandEibeFrank KnowledgeDiscovery Edited by Usama Fayyad, Georges G. JoeCelko’sSQLforSmarties:AdvancedSQL Grinstein,andAndreasWierse Programming,SecondEdition JoeCelko Transactional Information Systems: Theory, Algorithms, and the Practice of Concurrency Joe Celko’s Data and Databases: Concepts in ControlandRecovery Practice GerhardWeikumandGottfriedVossen JoeCelko SpatialDatabases:WithApplicationtoGIS DevelopingTime-OrientedDatabaseApplications PhilippeRigaux,MichelScholl,andAgne`s inSQL Voisard RichardT.Snodgrass InformationModelingandRelationalDatabases: FromConceptualAnalysistoLogicalDesign WebFarmingfortheDataWarehouse TerryHalpin RichardD.Hackathorn ComponentDatabaseSystems DatabaseModeling&Design,ThirdEdition Edited by Klaus R. Dittrich and Andreas TobyJ.Teorey Geppert ManagingReferenceDatainEnterpriseDatabases: ManagementofHeterogeneousandAutonomous BindingCorporateDatatotheWiderWorld DatabaseSystems MalcolmChisholm Edited by Ahmed Elmagarmid, Marek Rusinkiewicz,andAmitSheth DataMining:ConceptsandTechniques JiaweiHanandMichelineKamber Object-Relational DBMSs: Tracking the Next UnderstandingSQLandJavaTogether:AGuide GreatWave,SecondEdition toSQLJ,JDBC,andRelatedTechnologies MichaelStonebrakerandPaulBrown,with JimMeltonandAndrewEisenberg DorothyMoore TEAM LinG - Live, Informative, Non-cost and Genuine! ACompleteGuidetoDB2UniversalDatabase MigratingLegacySystems:Gateways,Interfaces, DonChamberlin &theIncrementalApproach MichaelL.BrodieandMichaelStonebraker Universal Database Management: A Guide to Object/RelationalTechnology AtomicTransactions CynthiaMaroSaracco Nancy Lynch, Michael Merritt, William Weihl,andAlanFekete ReadingsinDatabaseSystems,ThirdEdition EditedbyMichaelStonebrakerandJosephM. QueryProcessingforAdvancedDatabaseSystems Hellerstein EditedbyJohannChristophFreytag,David Maier,andGottfriedVossen Understanding SQL’s Stored Procedures: A CompleteGuidetoSQL/PSM TransactionProcessing:ConceptsandTechniques JimMelton JimGrayandAndreasReuter PrinciplesofMultimediaDatabaseSystems BuildinganObject-OrientedDatabaseSystem: V.S.Subrahmanian TheStoryofO2 Edited by Franc¸ois Bancilhon, Claude Principles of Database Query Processing for Delobel,andParisKanellakis AdvancedApplications ClementT.YuandWeiyiMeng Database Transaction Models for Advanced Applications AdvancedDatabaseSystems EditedbyAhmedK.Elmagarmid Carlo Zaniolo, Stefano Ceri, Christos Faloutsos, Richard T. Snodgrass, V. S. A Guide to Developing Client/Server SQL Subrahmanian,andRobertoZicari Applications SetragKhoshafian,ArvolaChan,AnnaWong, PrinciplesofTransactionProcessingfortheSystems andHarryK.T.Wong Professional PhilipA.BernsteinandEricNewcomer The Benchmark Handbook for Database and TransactionProcessingSystems,SecondEdition UsingtheNewDB2:IBM’sObject-Relational EditedbyJimGray DatabaseSystem DonChamberlin CamelotandAvalon:ADistributedTransaction Facility DistributedAlgorithms Edited by Jeffrey L. Eppinger, Lily B. NancyA.Lynch Mummert,andAlfredZ.Spector ActiveDatabaseSystems:TriggersandRulesFor ReadingsinObject-OrientedDatabaseSystems AdvancedDatabaseProcessing EditedbyStanleyB.ZdonikandDavidMaier EditedbyJenniferWidomandStefanoCeri TEAM LinG - Live, Informative, Non-cost and Genuine! MINING THE WEB DISCOVERING KNOWLEDGE FROM HYPERTEXT DATA Soumen Chakrabarti IndianInstituteofTechnology,Bombay TEAM LinG - Live, Informative, Non-cost and Genuine! SeniorEditor Lothlo´rienHomet PublishingServicesManager EdwardWade EditorialAssistant CorinaDerman CoverDesign RossCarronDesign TextDesign FrancesBacaDesign CoverImage KimihiroKuno/Photonica CompositionandTechnicalIllustration WindfallSoftware,usingZzTEX Copyeditor SharilynHovind Proofreader JenniferMcClain Indexer SteveRath Printer TheMaple-VailBookManufacturingGroup Designationsusedbycompaniestodistinguishtheirproductsareoftenclaimedastrademarksorregistered trademarks.InallinstancesinwhichMorganKaufmannPublishersisawareofaclaim,theproductnames appearininitialcapitalorallcapitalletters.Readers,however,shouldcontacttheappropriatecompaniesfor morecompleteinformationregardingtrademarksandregistration. MorganKaufmannPublishers AnimprintofElsevierScience 340PineStreet,SixthFloor SanFrancisco,CA94104-3205 www.mkp.com ©2003byElsevierScience(USA) Allrightsreserved PrintedintheUnitedStatesofAmerica 07 06 05 04 03 5 4 3 2 1 Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformor byanymeans—electronic,mechanical,photocopying,recording,orotherwise—withoutthepriorwritten permissionofthepublisher. LibraryofCongressControlNumber:2002107241 ISBN:1-55860-754-4 Thisbookisprintedonacid-freepaper. TEAM LinG - Live, Informative, Non-cost and Genuine! FOREWORD Jiawei Han University of Illinois, Urbana-Champaign The World Wide Web overwhelms us with immense amounts of widely dis- tributed, interconnected, rich, and dynamic hypertext information. It has pro- foundlyinfluencedmanyaspectsofourlives,changingthewayswecommunicate, conductbusiness,shop,entertain,andsoon.However,theabundantinformation on the Web is not stored in any systematically structured way, a situation which poses great challenges to those seeking to effectively search for high quality in- formation and to uncover the knowledge buried in billions of Web pages. Web mining-or the automatic discovery of interesting and valuable information from the Web-has therefore become an important theme in data mining. AsaprominentresearcheronWebmining,SoumenChakrabartihaspresented tutorials and surveys on this exciting topic at many international conferences. Now,afteryearsofdedication,hepresentsuswiththisexcellentbook.Miningthe Web: Discovering Knowledge from Hypertext Data is the first book solely dedicated tothethemeofWebminingandifofferscomprehensivecoverageandarigorous treatment. Chakrabarti starts with a thorough introduction to the infrastructure of the Web, including the mechanisms for Web crawling, Web page indexing, and keyword or similarity-based searching of Web contents. He then gives a systematicdescriptionofthefoundationsofWebmining,focusingonhypertext- basedmachinelearninganddataminingmethods,suchasclustering,collaborative filtering,supervisedlearning,andsemi-supervisedlearning.Afterthat,hepresents the application of these fundamental principles to Web mining itself-especially Web linkage analysis-introducing the popular PageRank and HITS algorithms that substantially enhance the quality of keyword-based Web searches. If you are a researcher, a Web technology developer, or just an interested reader curious about how to explore the endless potential of the Web, you will find this book provides both a solid technical background and state-of-the-art knowledgeonthisfascinatingtopic.Itisajewelinthecollectionofdatamining and Web technology books. I hope you enjoy it. vii TEAM LinG - Live, Informative, Non-cost and Genuine! Preface............................................... xv Prerequisites and Contents ............ xvi Omissions....................................... xvi Acknowledgments........................... xvii INTRODUCTION ................................ 1 Crawling and Indexing .................... 6 Topic Directories............................. 7 Clustering and Classification .......... 8 Hyperlink Analysis .......................... 9 Resource Discovery and Vertical Portals............................................. 11 Structured vs. Unstructured Data Mining............................................. 11 Bibliographic Notes......................... 13 Part I INFRASTRUCTURE1..5............... CRAWLING THE WEB................... 17 HTML and HTTP Basics .................... 18 Crawling Basics.................................. 19 Engineering Large- Scale Crawlers ... 21 DNS Caching, Prefetching, and Resolution ..................................... 22 Multiple Concurrent Fetches.......... 23 Link Extraction and Normalization................................. 25 Robot Exclusion ............................ 26 Eliminating Already- Visited URLs ............................................. 26 Spider Traps.................................. 28 Avoiding Repeated Expansion of Links on Duplicate Pages.............. 29 Load Monitor and Manager ........... 29 TEAM LinG - Live, Informative, Non-cost and Genuine! Per- Server Work- Queues............ 30 Text Repository ............................. 31 Refreshing Crawled Pages............ 33 Putting Together a Crawler................ 35 Design of the Core Components ... 35 Case Study: Using......................... 40 Bibliographic Notes ............................ 40 WEB SEARCH AND INFORMATION RETRIEVAL.......... 45 Boolean Queries and the Inverted Index .................................................. 45 Stopwords and Stemming ............. 48 Batch Indexing and Updates ......... 49 Index Compression Techniques.... 51 Relevance Ranking............................ 53 Recall and Precision...................... 53 The Vector- Space Model.............. 56 Relevance Feedback and Rocchio(cid:144)s Method .......................... 57 Probabilistic Relevance Feedback Models.......................... 58 Advanced Issues........................... 61 Similarity Search................................ 67 Handling (cid:141) Find- Similar(cid:142) Queries ... 68 Eliminating Near Duplicates via Shingling........................................ 71 Detecting Locally Similar Subgraphs of the Web................... 73 Bibliographic Notes ............................ 75 Part II LEARNING7..7............................ SIMILARITY AND CLUSTERING... 79 Formulations and Approaches ........... 81 Partitioning Approaches ................ 81 TEAM LinG - Live, Informative, Non-cost and Genuine! Geometric Embedding Approaches ................................... 82 Generative Models and Probabilistic Approaches............... 83 Bottom- Up and Top- Down Partitioning Paradigms....................... 84 Agglomerative Clustering .............. 84 The ................................................ 87 Means Algorithm............................ 87 Clustering and Visualization via Embeddings ....................................... 89 Self- Organizing Maps ( SOMs)..... 90 Multidimensional Scaling ( MDS) and FastMap ................................. 91 Projections and Subspaces........... 94 Latent Semantic Indexing ( LSI).... 96 Probabilistic Approaches to Clustering........................................... 99 Generative Distributions for Documents .................................... 101 Mixture Models and Expectation Maximization ( EM)........................ 103 Multiple Cause Mixture Model ( MCMM).......................................... 108 Aspect Models and Probabilistic LSI................................................. 109 Model and Feature Selection......... 112 Collaborative Filtering ........................ 115 Probabilistic Models....................... 115 Combining Content- Based and Collaborative Features .................. 117 Bibliographic Notes ............................ 121 SUPERVISED LEARNING ............. 125 The Supervised Learning Scenario.... 126 Overview of Classification Strategies........................................... 128 TEAM LinG - Live, Informative, Non-cost and Genuine!

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.