Table Of Content

ModelingtheInternetandtheWeb:ProbabilisticMethodsandAlgorithms. P.Baldi,P.FrasconiandP.Smyth Copyright2003P.Baldi,P.FrasconiandP.Smyth. PublishedbyJohnWiley&Sons,Ltd. ISBN:0-470-84906-1 ModelingtheInternetandtheWeb Modeling the Internet and the Web ProbabilisticMethodsandAlgorithms PierreBaldi SchoolofInformationandComputerScience, UniversityofCalifornia,Irvine,USA PaoloFrasconi DepartmentofSystemsandComputerScience, UniversityofFlorence,Italy PadhraicSmyth SchoolofInformationandComputerScience, UniversityofCalifornia,Irvine,USA Copyright ©2003 PierreBaldi,PaoloFrasconiandPadhraicSmyth Publishedby JohnWiley&SonsLtd,TheAtrium,SouthernGate,Chichester, WestSussexPO198SQ,England Phone (+44)1243779777 Email(forordersandcustomerserviceenquiries):[email protected] VisitourHomePageonwww.wileyeurope.comorwww.wiley.com All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system or transmittedinanyformorbyanymeans,electronic,mechanical,photocopying,recording,scanningor otherwise,exceptunderthetermsoftheCopyright,DesignsandPatentsAct1988orunderthetermsof alicenceissuedbytheCopyrightLicensingAgencyLtd,90TottenhamCourtRoad,LondonW1T4LP, UK,withoutthepermissioninwritingofthePublisher.RequeststothePublishershouldbeaddressed tothePermissionsDepartment,JohnWiley&SonsLtd,TheAtrium,SouthernGate,Chichester,West SussexPO198SQ,England,[email protected],orfaxedto(+44)1243770620. Thispublicationisdesignedtoprovideaccurateandauthoritativeinformationinregardtothesubjectmatter covered.ItissoldontheunderstandingthatthePublisherisnotengagedinrenderingprofessionalservices. Ifprofessionaladviceorotherexpertassistanceisrequired,theservicesofacompetentprofessionalshould besought. OtherWileyEditorialOffices JohnWiley&SonsInc.,111RiverStreet,Hoboken,NJ07030,USA Jossey-Bass,989MarketStreet,SanFrancisco,CA94103-1741,USA Wiley-VCHVerlagGmbH,Boschstr.12,D-69469Weinheim,Germany JohnWiley&SonsAustraliaLtd,33ParkRoad,Milton,Queensland4064,Australia JohnWiley&Sons(Asia)PteLtd,2ClementiLoop#02-01,JinXingDistripark,Singapore129809 JohnWiley&SonsCanadaLtd,22WorcesterRoad,Etobicoke,Ontario,CanadaM9W1L1 Wileyalsopublishesitsbooksinavarietyofelectronicformats.Somecontentthatappearsinprintmay notbeavailableinelectronicbooks. BritishLibraryCataloguinginPublicationData AcataloguerecordforthisbookisavailablefromtheBritishLibrary ISBN0-470-84906-1 Typesetin10/12ptTimesbyT&TProductionsLtd,London. PrintedandboundinGreatBritainbyBiddlesLtd,Guildford,Surrey. Thisbookisprintedonacid-freepaperresponsiblymanufacturedfromsustainableforestry inwhichatleasttwotreesareplantedforeachoneusedforpaperproduction. ToEzioandJosé(P.B.),toNeda(P.F.)and toSeosamhandBrídÁine(P.S.) Contents Preface xiii 1 MathematicalBackground 1 1.1 ProbabilityandLearningfromaBayesianPerspective 1 1.2 ParameterEstimationfromData 4 1.2.1 Basicprinciples 4 1.2.2 Asimpledieexample 6 1.3 MixtureModelsandtheExpectationMaximizationAlgorithm 10 1.4 GraphicalModels 13 1.4.1 Bayesiannetworks 13 1.4.2 Beliefpropagation 15 1.4.3 Learningdirectedgraphicalmodelsfromdata 16 1.5 Classification 17 1.6 Clustering 20 1.7 Power-LawDistributions 22 1.7.1 Definition 22 1.7.2 Scale-freeproperties(80/20rule) 24 1.7.3 ApplicationstoLanguages:Zipf’sandHeaps’Laws 24 1.7.4 Originofpower-lawdistributionsandFermi’smodel 26 1.8 Exercises 27 2 BasicWWWTechnologies 29 2.1 WebDocuments 30 2.1.1 SGMLandHTML 30 2.1.2 GeneralstructureofanHTMLdocument 31 2.1.3 Links 32 2.2 ResourceIdentifiers:URI,URL,andURN 33 2.3 Protocols 36 2.3.1 ReferencemodelsandTCP/IP 36 2.3.2 Thedomainnamesystem 37 2.3.3 TheHypertextTransferProtocol 38 2.3.4 Programmingexamples 40 viii CONTENTS 2.4 LogFiles 41 2.5 SearchEngines 44 2.5.1 Overview 44 2.5.2 Coverage 45 2.5.3 Basiccrawling 46 2.6 Exercises 49 3 WebGraphs 51 3.1 InternetandWebGraphs 51 3.1.1 Power-lawsize 53 3.1.2 Power-lawconnectivity 53 3.1.3 Small-worldnetworks 56 3.1.4 PowerlawofPageRank 57 3.1.5 Thebow-tiestructure 58 3.2 GenerativeModelsfortheWebGraphandOtherNetworks 60 3.2.1 Webpagegrowth 60 3.2.2 Latticeperturbationmodels:betweenorderanddisorder 61 3.2.3 Preferentialattachmentmodels,ortherichgetricher 63 3.2.4 Copymodels 66 3.2.5 PageRankmodels 67 3.3 Applications 68 3.3.1 Distributedsearchalgorithms 68 3.3.2 Subgraphpatternsandcommunities 70 3.3.3 Robustnessandvulnerability 72 3.4 NotesandAdditionalTechnicalReferences 73 3.5 Exercises 74 4 TextAnalysis 77 4.1 Indexing 77 4.1.1 Basicconcepts 77 4.1.2 Compressiontechniques 79 4.2 LexicalProcessing 80 4.2.1 Tokenization 80 4.2.2 Textconflationandvocabularyreduction 82 4.3 Content-BasedRanking 82 4.3.1 Thevector-spacemodel 82 4.3.2 Documentsimilarity 83 4.3.3 Retrievalandevaluationmeasures 85 4.4 ProbabilisticRetrieval 86 4.5 LatentSemanticAnalysis 88 4.5.1 LSIandtextdocuments 89 4.5.2 ProbabilisticLSA 89 4.6 TextCategorization 93 CONTENTS ix 4.6.1 knearestneighbors 93 4.6.2 TheNaiveBayesclassifier 94 4.6.3 Supportvectorclassifiers 97 4.6.4 Featureselection 102 4.6.5 Measuresofperformance 104 4.6.6 Applications 106 4.6.7 Supervisedlearningwithunlabeleddata 111 4.7 ExploitingHyperlinks 114 4.7.1 Co-training 114 4.7.2 Relationallearning 115 4.8 DocumentClustering 116 4.8.1 Backgroundandexamples 116 4.8.2 Clusteringalgorithmsfordocuments 117 4.8.3 Relatedapproaches 119 4.9 InformationExtraction 120 4.10 Exercises 122 5 LinkAnalysis 125 5.1 EarlyApproachestoLinkAnalysis 126 5.2 NonnegativeMatricesandDominantEigenvectors 128 5.3 HubsandAuthorities:HITS 131 5.4 PageRank 134 5.5 Stability 138 5.5.1 StabilityofHITS 139 5.5.2 StabilityofPageRank 139 5.6 ProbabilisticLinkAnalysis 140 5.6.1 SALSA 140 5.6.2 PHITS 142 5.7 LimitationsofLinkAnalysis 143 6 AdvancedCrawlingTechniques 149 6.1 SelectiveCrawling 149 6.2 FocusedCrawling 152 6.2.1 Focusedcrawlingbyrelevanceprediction 152 6.2.2 Contextgraphs 154 6.2.3 Reinforcementlearning 155 6.2.4 RelatedintelligentWebagents 157 6.3 DistributedCrawling 158 6.4 WebDynamics 160 6.4.1 Lifetimeandagingofdocuments 161 6.4.2 Othermeasuresofrecency 167 6.4.3 Recencyandsynchronizationpolicies 167 x CONTENTS 7 ModelingandUnderstandingHumanBehavioronthe Web 171 7.1 Introduction 171 7.2 WebDataandMeasurementIssues 172 7.2.1 Background 172 7.2.2 Server-sidedata 174 7.2.3 Client-sidedata 177 7.3 EmpiricalClient-SideStudiesofBrowsingBehavior 179 7.3.1 Earlystudiesfrom1995to1997 180 7.3.2 TheCockburnandMcKenziestudyfrom2002 181 7.4 ProbabilisticModelsofBrowsingBehavior 184 7.4.1 Markovmodelsforpageprediction 184 7.4.2 FittingMarkovmodelstoobservedpage-requestdata 186 7.4.3 BayesianparameterestimationforMarkovmodels 187 7.4.4 PredictingpagerequestswithMarkovmodels 189 7.4.5 Modelingrunlengthswithinstates 193 7.4.6 Modelingsessionlengths 194 7.4.7 Adecision-theoreticsurfingmodel 198 7.4.8 Predictingpagerequestsusingadditionalvariables 199 7.5 ModelingandUnderstandingSearchEngineQuerying 201 7.5.1 Empiricalstudiesofsearchbehavior 202 7.5.2 Modelsforsearchstrategies 207 7.6 Exercises 208 8 CommerceontheWeb:ModelsandApplications 211 8.1 Introduction 211 8.2 CustomerDataontheWeb 212 8.3 AutomatedRecommenderSystems 212 8.3.1 Evaluatingrecommendersystems 214 8.3.2 Nearest-neighborcollaborativefiltering 215 8.3.3 Model-basedcollaborativefiltering 218 8.3.4 Model-basedcombiningofvotesandcontent 223 8.4 NetworksandRecommendations 224 8.4.1 Email-basedproductrecommendations 224 8.4.2 Adiffusionmodel 226 8.5 WebPathAnalysisforPurchasePrediction 228 8.6 Exercises 232 AppendixA MathematicalComplements 235 A.1 GraphTheory 235 A.1.1 Basicdefinitions 235 A.1.2 Connectivity 236 A.1.3 Randomgraphs 236 CONTENTS xi A.2 Distributions 237 A.2.1 Expectation,variance,andcovariance 237 A.2.2 Discretedistributions 237 A.2.3 Continuousdistributions 238 A.2.4 Weibulldistribution 240 A.2.5 Exponentialfamily 240 A.2.6 Extremevaluedistribution 241 A.3 SingularValueDecomposition 241 A.4 MarkovChains 243 A.5 InformationTheory 243 A.5.1 Mathematicalbackground 244 A.5.2 Information,surprise,andrelevance 247 AppendixB ListofMainSymbolsandAbbreviations 253 References 257 Index 277 Preface Since its early ARPANET inception during the Cold War, the Internet has grown by a staggering nine orders of magnitude. Today, the Internet and theWorldWide Web pervade our lives, having fundamentally altered the way we seek, exchange, distribute,andprocessinformation.TheInternethasbecomeapowerfulsocialforce, transformingcommunication,entertainment,commerce,politics,medicine,science, andmore.Itmediatesanevergrowingfractionofhumanknowledge,formingboth thelargestlibraryandthelargestmarketplaceonplanetEarth. Unliketheinventionofearliermediasuchasthepress,photography,oreventhe radio,whichcreatedspecializedpassivemedia,theInternetandtheWebimpactall information,convertingittoauniformdigitalformatofbitsandpackets.Inaddition, the Internet and the Web form a dynamic medium, allowing software applications to control, search, modify, and filter information without human intervention. For example,emailmessagescancarryprogramsthataffectthebehaviorofthereceiving computer.Thisactivemediumalsopromoteshumaninterventioninsharing,updating, linking, embellishing, critiquing, corrupting, etc., information to a degree that far exceedswhatcouldbeachievedwithprinteddocuments. Incommonusage,thewords‘Internet’and‘Web’(orWorldWideWeborWWW) are often used interchangeably. Although they are intimately related, there are of coursesomenuanceswhichwehavetriedtorespect.‘Internet’,inparticular,isthe moregeneraltermandimplicitlyincludesphysicalaspectsoftheunderlyingnetworks aswellasmechanismssuchasemailandpeer-to-peeractivitiesthatarenotdirectly associatedwiththeWeb.Theterm‘Web’,ontheotherhand,isassociatedwiththe informationstoredandavailableontheInternet.Itisalsoatermthatpointstoother complexnetworksofinformation,suchaswebsofscientificcitations,socialrelations, orevenproteininteractions.Inthissense,itisfairtosaythatapredominantfraction ofourbookisabouttheWebandtheinformationaspectsoftheInternet.Weuse‘Web’ every time we refer to theWorldWideWeb and ‘web’when we refer to a broader classofnetworksorotherkindsofnetworks,i.e.webofcitations. As the Internet and the Web continue to expand at an exponential rate, it also evolvesintermsofthedevicesandprocessorsconnectedtoit,e.g.wirelessdevices andappliances.EvermorehumandomainsandactivitiesareensnaredbytheWeb, thuscreatingchallengingproblemsofownership,security,andprivacy.Forinstance,