Table Of ContentModeling and Data Mining
in Blogosphere
Synthesis Lectures on Data Mining
and Knowledge Discovery
Editor
RobertGrossman,UniversityofIllinois,Chicago
ModelingandDataMininginBlogosphere
NitinAgarwalandHuanLiu
2009
Copyright© 2009byMorgan&Claypool
Allrightsreserved.Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmittedin
anyformorbyanymeans—electronic,mechanical,photocopy,recording,oranyotherexceptforbriefquotationsin
printedreviews,withoutthepriorpermissionofthepublisher.
ModelingandDataMininginBlogosphere
NitinAgarwalandHuanLiu
www.morganclaypool.com
ISBN:9781598299083 paperback
ISBN:9781598299090 ebook
DOI10.2200/S00213ED1V01Y200907DMK001
APublicationintheMorgan&ClaypoolPublishersseries
SYNTHESISLECTURESONDATAMININGANDKNOWLEDGEDISCOVERY
Lecture#1
SeriesEditor:RobertGrossman,UniversityofIllinois,Chicago
SeriesISSN
SynthesisLecturesonDataMiningandKnowledgeDiscovery
ISSNpending.
Modeling and Data Mining
in Blogosphere
Nitin Agarwal
UniversityofArkansasatLittleRock
Huan Liu
ArizonaStateUniversity
SYNTHESISLECTURESONDATAMININGANDKNOWLEDGEDISCOVERY
#1
M
&C Morgan &cLaypool publishers
ABSTRACT
Thisbookoffersacomprehensiveoverviewofthevariousconceptsandresearchissuesaboutblogsor
weblogs.It introduces techniques and approaches,tools and applications,and evaluation method-
ologies with examples and case studies. Blogs allow people to express their thoughts, voice their
opinions,andsharetheirexperiencesandideas.Blogsalsofacilitateinteractionsamongindividuals
creating a network with unique characteristics.Through the interactions individuals experience a
senseofcommunity.Weelaborateonapproachesthatextractcommunitiesandclusterblogsbased
on information of the bloggers. Open standards and low barrier to publication in Blogosphere
have transformed information consumers to producers, generating an overwhelming amount of
ever-increasingknowledgeaboutthemembers,theirenvironmentandsymbiosis.Weelaborateon
approaches that sift through humongous blog data sources to identify influential and trustworthy
bloggersleveragingcontentandnetworkinformation.Spamblogsorsplogsisanincreasingconcern
in Blogosphere, which is discussed in detail with the approaches leveraging supervised machine
learning algorithms and interaction patterns.We elaborate on data collection procedures,provide
resourcesforblogdatarepositories,mentionvariousvisualizationandanalysistoolsinBlogosphere,
andexplainconventionalandnovelevaluationmethodologies,tohelpperformresearchintheBlo-
gosphere.
Thebookissupportedbyadditionalmaterial,includinglectureslidesaswellasthecomplete
setoffiguresusedinthebook,andthereaderisencouragedtovisitthebookwebsiteforthelatest
information:
http://tinyurl.com/mcp-agarwal
KEYWORDS
blogosphere,weblogs,blogs,blog model,power law distribution,scale free networks,
degree distribution,clustering coefficient,centrality measures,clustering,community
discovery,influence,diffusion,trust,propagation,spam blogs,splogs,data collection,
blogcrawling,performanceevaluation
To my parents, Sushma and Umesh Chand Agarwal…–NA
To my parents, wife, and sons…–HL
…with much love and gratitude for everything.
ix
Contents
Acknowledgments................................................................xi
1 ModelingBlogosphere............................................................1
1.1 ModelingEssentials ........................................................2
1.2 PreferentialAttachmentBlogModels ........................................ 8
1.2.1 Log-normalDistributionModels 12
2 BlogClusteringandCommunityDiscovery........................................15
2.1 GraphBasedApproach....................................................17
2.2 ContentBasedApproach...................................................21
2.3 HybridApproach..........................................................24
3 InfluenceandTrust..............................................................27
3.1 Influence .................................................................27
3.1.1 GraphBasedApproach 30
3.1.2 ContentBasedApproach 33
3.1.3 HybridApproach 34
3.1.4 BlogLeaders 40
3.2 Trust.....................................................................40
3.2.1 TrustComputation 41
3.2.2 TrustPropagation 43
4 SpamFilteringinBlogosphere....................................................45
4.1 GraphBasedApproach....................................................47
4.2 ContentBasedApproach...................................................49
4.3 HybridApproach..........................................................51
5 DataCollectionandEvaluation...................................................53
x CONTENTS
5.1 DataCollection...........................................................53
5.1.1 API 53
5.1.2 WebCrawler 56
5.1.3 AvailableDatasets 58
5.1.4 DataPreprocessing 59
5.2 Evaluation................................................................60
5.2.1 BlogModeling 61
5.2.2 BlogClusteringandCommunityDiscovery 61
5.2.3 InfluenceandTrust 64
5.2.4 Spam 68
A ToolsinBlogosphere.............................................................71
B APIExamples...................................................................79
Bibliography ....................................................................87
Biography.......................................................................95
Index...........................................................................97