Survey on Web Spam Detection: Principles and Algorithms Nikita Spirin Jiawei Han DepartmentofComputerScience DepartmentofComputerScience UniversityofIllinoisatUrbana-Champaign UniversityofIllinoisatUrbana-Champaign Urbana,IL61801,USA Urbana,IL61801,USA [email protected] [email protected] ABSTRACT as means of malware and adult content dissemination and fishing attacks. For instance, [39] ranked 100 million pages Searchenginesbecameadefactoplacetostartinformation usingPageRankalgorithm[86]andfoundthat11outoftop acquisition on the Web. Though due to web spam phe- 20 results were pornographic websites, that achieved high nomenon, search results are not always as good as desired. ranking due to content and web link manipulation. Last, it Moreover, spam evolves that makes the problem of provid- forcesasearchenginecompanytowasteasignificantamount ing high quality search even more challenging. Over the of computational and storage resources. In 2005 the total lastdecaderesearchonadversarialinformationretrievalhas worldwide financial losses caused by spam were estimated gainedalotofinterestbothfromacademiaandindustry. In at $50 billion [63], in 2009 the same value was estimated this paper we present a systematic review of web spam de- already at $130 billion [64]. Among new challenges which tection techniques with the focus on algorithms and under- are emerging constantly one can highlight a rapid growth lying principles. We categorize all existing algorithms into of the Web and its heterogeneity, simplification of content three categories based on the type of information they use: creationtools(e.g.,freewebwikis,bloggingplatforms,etc.) content-based methods, link-based methods, and methods anddecreaseinwebsitemaintenancecost(e.g.,domainreg- basedonnon-traditionaldatasuchasuserbehaviour,clicks, istration, hosting, etc.), evolution of spam itself and hence HTTP sessions. In turn, we perform a subcategorization of appearance of new web spam strains that cannot be cap- link-basedcategoryintofivegroupsbasedonideasandprin- tured by previously successful methods. ciples used: labels propagation, link pruning and reweight- Web spam phenomenon mainly takes place due to the fol- ing, labels refinement, graph regularization, and feature- lowing fact. The fraction of web page referrals that come based. Wealsodefinetheconceptofwebspamnumerically from search engines is significant and, moreover, users tend and provide a brief survey on various spam forms. Finally, to examine only top ranked results. Thus, [98] showed that we summarize the observations and underlying principles for85%ofthequeriesonlythefirstresultpageisrequested applied for web spam detection. and onlythefirst threeto fivelinksareclicked [65]. There- fore, inclusion in the first SERP1 has a clear economic in- Keywords centiveduetoanincreaseinwebsitetraffic. Toachievethis goal website owners attempt to manipulate search engine web spam detection, content spam, link spam, cloaking, rankings. This manipulation can take various forms such collusion, link farm, pagerank, random walk, classification, as the addition of a surrogate content on a page, excessive clustering,websearch,userbehaviour,graphregularization, and undeserved link creation, cloaking, click fraud, and tag labels propagation spam. We define these concepts in Section 2, following the work [54; 82; 13; 57]. Generally speaking, web spam man- 1. INTRODUCTION ifests itself as a web content generated deliberately for the Spampervadesanyinformationsystem,beite-mailorweb, purpose of triggering unjustifiably favourable relevance or social, blog or reviews platform. The concept of web spam importance of some web page or pages [54]. It is worth or spamdexing was first introduced in 1996 [31] and soon mentioning that the necessity of dealing with the malicious was recognized as one of the key challenges for search en- contentinacorpusisakeydistinctivefeatureofadversarial gine industry [57]. Recently [50; 97], all major search en- informationretrieval[41]incomparisonwiththetraditional gine companies have identified adversarial information re- information retrieval, where algorithms operate on a clean trieval [41] as a top priority because of multiple negative benchmark data set or in an intranet of a corporation. effectscausedbyspamandappearanceofnewchallengesin Accordingtovariousstudies[85;25;12]theamountofweb this area of research. First, spam deteriorates the quality spam varies from 6 to 22 percent, which demonstrates the ofsearchresultsanddepriveslegitimatewebsitesofrevenue scope of the problem and suggests that solutions that re- that they might earn in the absence of spam. Second, it quire manual intervention will not scale. Specifically, [7] weakenstrustofauserinasearchengineproviderwhichis shows that 6% of English language web pages were classi- especially tangible issue due to zero cost of switching from fied as spam, [25] reports 22% of spam on a host level and one search provider to another. Third, spam websites serve [12] estimates it as 16.5%. Another group of researchers study not only the cumulative amount of spam but its dis- 1Search Engine Result Page. SIGKDD Explorations Volume13,Issue 2 Page50 tributionamongcountriesandtopleveldomains[85]. They • Body Spamming. In this case the body of a page is report 13.8% of spam in the English speaking internet, 9% modified. This is the most common form of content inJapanese,22%inGerman,and25%inFrench. Theyalso spam because it is cheap and simultaneously allows showthat70%ofpagesinthe*.bizdomainand20%inthe applying various strategies. For instance, if a spam- *.com domain are spam. merwantstoachieveahighrankingofapageforonly This survey has two goals: first, it aims to draw a clear limitedpredefinedsetofqueries,theycanusetherep- roadmap of algorithms, principles and ideas used for web etition strategy by overstuffing body of a page with spamdetection;second,itaimstobuildawarenessandstim- termsthatappearinthesetofqueries(thereevenwas ulatefurtherresearchintheareaofadversarialinformation aspammingcompetitiontorankhighestforthequery retrieval. To the best of our knowledge there is no com- “nigritude ultramarine” [38]). On the other hand, if prehensive but concise web spam mining survey with the the goal is to cover as many queries as possible, the focus on algorithms yet. This work complements existing strategycouldbetousealotofrandomwordsatonce. surveys [54; 13; 82; 57] and a book [23] on the topic of web To hide part of a content used to boost ranking, site spam. [54] presents a web spam taxonomy and provides owners make it coloured in the same way as a back- abroadcoverageofvariouswebspamformsanddefinitions. ground so that only machines can recognize it. In [13] issues specific to web archives are considered. [82; • Meta-Tags Spamming. Because meta-tags play a spe- 57]enumeratespamformsanddiscusschallengescausedby cificroleinadocumentdescription,searchenginesan- spam phenomenon for web search engines. [23] is a recent alyze them carefully. Hence, the placement of spam book, which provides the broadest coverage of web spam content in this field might be very prospective from detection research available so far. We think that paral- spammer point of view. Because of the heavy spam- lel research on email spam fighting [19; 95; 113] and spam ming, nowadays search engines give a low priority to on social websites [58] might also be relevant. [33] presents this field or even ignore it completely. a general framework for adversarial classification and ap- proaches the problem from game-theoretical perspective. • Anchor Text Spamming. The usefulness of anchor The paper is organized as follows. In Section 2 we pro- text3forwebrankingwasfirstintroducedin1994[80], vide a brief overview of web spam forms following [54; 82; andinstantlyspammersaddedthecorrespondingstrat- 13; 57]. Then we turn to content-based mining methods egytotheir“arsenal”. Spammerscreatelinkswiththe in Section 3.1. In Section 3.2 we provide a careful cover- desired anchor text (often unrelated to a linking page ageoflink-basedspamdetectionalgorithms. InSection3.3 content) in order to get the “right” terms for a target we consider approaches to fight against spam using click- page. through and user behaviour data, and by performing real- time HTTP sessions analysis. Finally, we summarize key • URL Spamming. Some search engines also consider a principles underlying web spam mining algorithms in Sec- tokenized URL of a page as a zone. And hence spam- tion 4 and conclude in Section 5. merscreateaURLforapagefromwordswhichshould be mentioned in a targeted set of queries. For exam- ple,ifaspammerwantstoberankedhighforthequery 2. WEBSPAMTAXONOMY “cheap acer laptops”, they can create a URL “cheap-laptops.com/cheap-laptops/acer-laptops.html”4. 2.1 ContentSpam Thecontentspamisprobablythefirstandmostwidespread With the advent of link-based ranking algorithms [86; 68] form of web spam. It is so widespread because of the fact content spam phenomenon was partially overcome. How- that search engines use information retrieval models based ever,spamisconstantlyevolvingandsoonafterwardsspam- onapagecontenttorankwebpages,suchasavectorspace mersstartedconstructinglinkfarms[53;3],groupsofhighly model [96], BM25 [94], or statistical language models [114]. interconnected spam pages, with the aim to boost ranking Thus,spammersanalyzetheweaknessesofthesemodelsand of one or a small number of target pages in a farm. This exploit them. For instance, if we consider TFIDF scoring form of spam is referred to as link spam. (cid:2) TFIDF(q,p)= TF(t)·IDF(t), (1) 2.2 LinkSpam t∈q∧t∈p There are two major categories of link spam: outgoing link spam and incoming link spam. where q is a query, p is a page, and t is a term, then spam- mers can try to boost TF of terms appearing on a page2. 2.2.1 Outgoinglinkspam There are multiple facets based on which we can categorize This is the easiest and cheapest method of link spam be- content spam. cause, first, a spammer have a direct access to his pages Taking a document structure into account there are 5 sub- and therefore can add any items to them, and second, they types of content spamming. caneasilycopytheentirewebcataloguesuchasDMOZ5 or Yahoo! Directory6 and therefore quickly create a large set • Title Spamming. Due to high importance of the title field for information retrieval [79] spammers have a 3a visible caption for a hyperlink clear incentive to overstuff it so as to achieve higher 4itisworthnotingthattheexamplealsodemonstratesidea overall ranking. of keyword repetition, because the word “laptops” appears three times in a tokenized URL representation 2we assume that IDF of a term cannot be manipulated by 5dmoz.org a spammer due to an enormous size of the Web 6http://dir.yahoo.com/ SIGKDD Explorations Volume13,Issue 2 Page51 of authoritative links, accumulating relevance score. Out- page refresh time to zero and initialize a refresh URL going link spamming targets mostly HITS (Section 3.2.1) attributewithaURLofatargetpage. Moresophisti- algorithm [68] with the aim to get high hub score. cated approach is to use page level scripts that aren’t usuallyexecutedbycrawlersandhencemoreeffective 2.2.2 Incominglinkspam from spammers point of view. InthiscasespammerstrytoraiseaPageRank(Section3.2.1) • Accessible Pages. Thesearepageswhichspammers score of a page (often referred to as a target page) or sim- can modify but don’t own. For instance, it can be plyboostanumberofincominglinks. Onecanidentifythe Wikipedia pages, blog with public comments, a pub- following strategies depending on an access to pages. lic discussion group, or even an open user-maintained • OwnPages. Inthiscaseaspammerhasadirectcon- web directory. Spammers exploit the opportunity to be able to slightly modify external pages by creating trol over all the pages and can be very flexible in his strategies. Hecancreatehisownlinkfarm7 andcare- linkstotheirownpages. Itisworthnotingthatthese strategies are usually combined. Thus, while spam- fully tune its topology to guarantee the desired prop- ming comments, adversaries can apply both link and erties and optimality. One common link farm has a anchor text spamming techniques. topologydepictedonafig.1andisnamedasahoney- pot farm. Inthiscaseaspammercreatesapagewhich 2.3 CloakingandRedirection looks absolutely innocent and may be even authorita- tive (though it is much more expensive), but links to Cloakingisthewaytoprovidedifferentversionsofapageto thespammer’stargetpages. Inthiscaseanorganically crawlers and users based on information contained in a re- aggregated PageRank (authority) score is propagated quest. Ifusedwithgoodmotivation,itcanevenhelpsearch further to target pages and allows them to be ranked engine companies because in this case they don’t need to higher. Moreaggressiveformofahoneypotschemais parse a page in order to separate the core content from a hijacking, when spammers first hack a reputable web- noisy one (advertisements, navigational elements, rich GUI siteandthenuseitasapartoftheirlinkfarm. Thus, elements). However, if exploited by spammers, cloaking in 2006 a website for prospective CS students was hi- takes an abusive form. In this case adversary site owners jackedandspammedwithlinkofpornographicnature. servedifferentcopiesofapagetoacrawlerandauserwith the goal to deceive the former [28; 108; 110; 75]. For exam- ple,asurrogatepagecanbeservedtothecrawlertomanipu- lateranking,whileusersareservedwithauser-orientedver- sionofapage. Todistinguishusersfromcrawlersspammers analyze a user-agent field of HTTP request and keep track of IP addresses used by search engine crawlers. The other strategyistoredirectuserstomaliciouspagesbyexecuting JavaScript activated by page onLoad() event or timer. It isworthmentioningthatJavaScriptredirectionspamisthe most widespread and difficult to detect by crawlers, since Figure 1: Scheme of a typical link farm. mostly crawlers are script-agnostic [29]. Spammerscanalsocolludebyparticipatinginlinkex- 2.4 ClickSpam changeschemesinordertoachievehigherscale,higher Since search engines use click stream data as an implicit in-link counts, or other goals. Motivations of spam- feedback to tune ranking functions, spammers are eager to merstocolludearecarefullyanalyzedin[53]. Optimal generate fraudulent clicks with the intention to bias those properties of link farms are analyzed in [53; 3; 117]. functionstowardstheirwebsites. Toachievethisgoalspam- Toreducetimespentonalinkfarmpromotionspam- mers submit queries to a search engine and then click on mersarealsoeagertobuyexpiredandabandoneddo- linkspointingtotheirtargetpages[92;37]. Tohideanoma- mainnames. Theyareguidedbytheprinciplethatdue lous behaviour they deploy click scripts on multiple ma- to the non-instant update of an index and recrawling, chinesoreveninlargebotnets[34;88]. Theotherincentive searchenginesbelievethatadomainisstillunderthe ofspammerstogeneratefraudulentclickscomesfromonline control of a good website owner and therefore spam- advertising [60]. In this case, in reverse, spammers click on mers can benefit with “resources” and reputation left adsofcompetitorsinordertodecreasetheirbudgets,make by the previous website for some time. them zero, and place the ads on the same spot. Wealsoconsiderredirectionasaninstanttypeofhon- eypot scheme8. Here the spamming scheme works as 3. ALGORITHMS follows. First, a honeypot page achieves high ranking All the algorithms proposed to date can be roughly cate- inaSERPbyboostingtechniques. Butwhenthepage gorized into three groups. The first one consists of tech- isrequestedbyauser,theydon’tactuallyseeit,they niques which analyze content features, such as word counts getredirectedtoatargetpage. Therearevariousways or language models [44; 85; 100; 81; 101; 89; 40], and con- toachieveredirection. Theeasiestapproachistoseta tent duplication [42; 43; 103]. Another group of algorithms 7now link farm maintenance is in dozens of times cheaper utilizeslink-basedinformationsuchasneighbourgraphcon- than before and the price is constantly decreasing nectivity[25;47;49;48;119;118;116],performslink-based 8some researchers consider it as a separate form of spam- trust and distrust propagation [86; 55; 71; 99; 112; 12; 52; ming technique 26; 66; 11; 5; 8], link pruning [16; 84; 73; 93; 74; 35; 109; SIGKDD Explorations Volume13,Issue 2 Page52 32; 111; 115; 6], graph-based label smoothing [121; 1; 30], only cheap-to-compute content features. They also showed andstudystatisticalanomalies[4;7;38;9]. Finally,thelast thatcomputationallydemandingandglobalfeatures,forin- groupincludesalgorithmsthatexploitclickstreamdata[92; stance PageRank, yield only negligible additional increase 37; 60] and user behaviour data [77; 78], query popularity in quality. Therefore, the authors claim that more careful information [28; 10], and HTTP sessions information [107]. and appropriate choice of a machine learning model is very important. 3.1 Content-basedSpamDetection Another group introduces features based on HTML page Seminallineofworkoncontent-basedanti-spamalgorithms structure to detect script-generated spam pages [103]. The has been done by Fetterly et al. [43; 44; 42; 85]. In [44] underlying idea, that spam pages are machine-generated, theyproposethatwebspampagescanbeidentifiedthrough is similar to the work discussed above [42; 43]. However, statistical analysis. Since spam pages are usually automat- here authors make a non-traditional preprocessing step by ically generated, using phrase stitching and weaving tech- removing all the content and keeping only layout of a page. niques [54] and aren’t intended for human visitors, they Thus, they study page duplication by analyzing its layout exhibit anomalous properties. Researchers found that the and not content. They apply fingerprinting techniques [91; URLsofspampageshaveexceptionalnumberofdots,dashes, 21] with the subsequent clustering to find groups of struc- digits and length. They report that 80 of the 100 longest turally near-duplicate spam pages. discovered host names refer to adult websites, while 11 re- There is a line of work dedicated to language modelling for fertofinancial-credit-relatedwebsites. Theyalsoshowthat spamdetection. [81]presentsanapproachofspamdetection pages themselves have a duplicating nature – most spam inblogsbycomparingthelanguagemodels[59]forblogcom- pagesthatresideonthesamehosthaveverylowwordcount ments and pages, linked from these comments. The under- variance. Another interesting observation is that the spam lyingideaisthatthesemodelsarelikelytobesubstantially pages’contentchangesveryrapidly9. Specifically,theystud- different for a blog and a spam page due to random nature ied average amount of week-to-week changes of all the web of spam comments. They use KL-divergence as a measure pagesonagivenhostandfoundthatthemostvolatilespam of discrepancy between two language models (probability hostscanbedetectedwith97.2%basedonlyonthisfeature. distributions) Θ , Θ : 1 2 All the proposed features can be found in the paper [44]. Intheirotherwork[42;43]theystudiedcontentduplication (cid:2) p(w|Θ ) KL(Θ (cid:17)Θ )= p(w|Θ )log 1 . (2) and found that the largest clusters with a duplicate con- 1 2 1 p(w|Θ ) tent are spam. To find such clusters and duplicate content w 2 they apply shingling [22] method based on Rabin finger- Thebeneficialtraitofthismethodisthatitdoesn’trequire prints [91; 21]. Specifically, they first fingerprint each of any training data. n words on a page using a primitive polynomial PA, sec- [101; 89] extend the analysis of linguistic features for web ond they fingerprint each token from the first step with a spam detection by considering lexical validity, lexical and different primitive polynomial PB using prefix deletion and contentdiversity,syntacticaldiversityandentropy,emotive- extensiontransformations,thirdtheyapplymdifferentfin- ness, usage of passive and active voices, and various other gerprinting functions to each string from the second stage NLP features. and retain the smallest of the n resulting values for each of the m fingerprinting functions. Finally, the document is Finally, [10] proposes a number of features based on occur- representedasabagofmfingerprintsandclusteringisper- renceofkeywordsonapagethatareeitherofhighadvertis- formedbytakingthetransitiveclosureofthenear-duplicate ing value or highly spammed. Authors investigate discrim- relationship10. They also mined the list of popular phrases inative power of the following features: Online Commercial bysorting(i,s,d)triplets11lexicographicallyandtakingsuf- Intention (OCI) value assigned to a URL by Microsoft Ad- ficiently long runs of triples with matching i and s values. Center12, Yahoo! Mindset classification of a page as either Based on this study they conclude that starting from the commercialornon-commercial13,GoogleAdWordspopular 36th position one can observe phrases that are evidence of keywords14,andnumberofGoogleAdSenseadsonapage15. machine-generated content. These phrases can be used as They report an increase in accuracy by 3% over the [24] an additional input, parallel to common spam words, for a whichdoesn’tusethesefeatures. Similarideaswereapplied ”bag of word”-based spam classifier. for cloaking detection. In [28] search engine query logs and online advertising click-through logs are analyzed with re- In [85] they continue their analysis and provide a handful spect to query popularity and monetizability. Definition of of other content-based features. Finally, all these features querypopularityisstraightforward,querymonetizabilityis are blended in a classification model within C4.5, boosting, defined as a revenue generated by all ads that are served and bagging frameworks. They report 86.2% true positive in response to the query16. Authors use top 5000 queries and 97.8% true negative rates for a boosting of ten C4.5 out of each query category, request top 200 links for each trees. Recent work [40] describes a thorough study on how queryfourtimesbyprovidingvariousagent-fieldstoimitate various features and machine learning models contribute to requests by a user (u) and a crawler (c), and then apply a the quality of a web spam detection algorithm. The au- cloaking test (3), which is a modified version of a test pro- thors achieved superior classification results using state-of- the-artlearningmodels,LogitBoostandRandomForest,and 9it can even change completely on every request. 12adlab.msn.com/OCI/oci.aspx 10two documents are near-duplicates if their shingles agree 13mindset.research.yahoo.com in two out of six of the non-overlapping runs of fourteen 14adwords.google.com/select/keywordtoolexternal shingles [42] 15google.com/adsense 11s is the ith shingle of document d 16overlapbetweenthesetwocategoriesisreportedtobe17% SIGKDD Explorations Volume13,Issue 2 Page53 posed in the earlier work on cloaking detection [108; 110]: • PageRank[86]useslinkinformationtocomputeglobal importance scores for all pages on the web. The key min[D(c ,u ),D(c ,u )] CloakingScore(p)= 1 1 2 2 , (3) underlyingideaisthatalinkfromapagepi toapage max[D(c ,c ),D(u ,u )] 1 2 1 2 pj shows an endorsement or trust of page pi in page where pj,andthealgorithmfollowstherepeatedimprovement |a ∩a | principle, i.e. the true score is computed as a conver- D(a1,a2)=1−2|a1∪a2| (4) gencepointofaniterativeupdatingprocess. Themost 1 2 popularandsimplewaytointroducePageRankisalin- isanormalizedtermfrequencydifferencefortwocopiesofa ear system formulation. In this case PageRank vector pagerepresentedasasetofterms,maybewithrepetitions. for all pages on the web is defined as the solution of They report 0.75 and 0.985 precision at high recall values the matrix equation forpopularandmonetizablequeriescorrespondingly,which (cid:10)π=(1−c)·MT(cid:10)π+c·(cid:10)r, (6) suggests that the proposed technique is useful for cloaking detection. However,themethodsdescribedin[28;108;110] wherecisadampingfactor,and(cid:10)risastaticPageRank might have relatively high false positive rate, since legiti- vector.(cid:8) For non-(cid:9)personalized PageRank it is a unit mate dynamically generated pages generally contain differ- vector 1,..., 1 , where N =|V|. It is worth noting N N ent terms and links on each access, too. To overcome this thatthisisneededtomeettheconditionsofaPerron- shortcoming an improved method was proposed [75], which Frobeniustheorem[15]andguaranteetheexistenceof is based on the analysis of structural (tags) properties of astationarydistributionforthecorrespondingMarkov the page. The idea is to compare multisets of tags and not chain. Ithasalsoasemanticmeaningthatatanygiven words and links of pages to compute the cloaking score. moment “random surfer” can visit any page with the non-zero probability. 3.2 Link-basedSpamDetection Foranon-uniformstaticvector(cid:10)rthesolutioniscalled Alllink-basedspamdetectionalgorithmscanbesubdivided apersonalizedPageRank(PPR)[46;56]andthevector into five groups. The first group exploits the topological (cid:10)r is called a personalization, random jump or telepor- relationship (distance, co-citation, similarity) between the tation vector. webpagesandasetofpagesforwhichlabelsareknown[86; 55; 71; 99; 112; 12; 52; 26; 66; 11; 5; 8]. The second group There are a few useful properties of PageRank. First, of algorithms focuses on identification of suspicious nodes itislinearin(cid:10)r,i.e. if(cid:10)π1 isthesolutionof (6)withthe and links and their subsequent downweighting [16; 84; 73; personalization vector(cid:10)r1 and(cid:10)π2 is the solution of (6) 93; 74; 35; 109; 32; 111; 115; 6]. The third one works by with the personalization vector (cid:10)r2, then the vector extractinglink-basedfeaturesforeachnodeandusevarious (cid:3)π1+(cid:3)π2 isthesolutionfortheequation(6)withtheper- 2 machine learning algorithms to perform spam detection [4; sonalizationvector (cid:3)r1+(cid:3)r2. Asaconsequenceofthatwe 2 7; 38; 9]. The fourth group of link-based algorithms uses can expand a personalized PageRank in the following the idea of labels refinement based on web graph topology, way: whenpreliminarylabelspredictedbythebasealgorithmare (cid:2) 1 modified using propagation through the hyperlink graph or PPR((cid:10)r)= PPR(χv), (7) N a stackedclassifier[25; 47; 49; 48]. Finally,thereisa group v∈V of algorithms which is based on graph regularization tech- where χv is the teleportation vector consisting of all niques for web spam detection [121; 1; 30]. zeros except for a node v such that χv(v)=1 (we use 3.2.1 Preliminaries thispropertyinSection3.2.2). Second,PageRankhas aninterpretationasaprobabilityofarandomwalkter- minatingatagivenvertexwherethelengthfollowsthe • Web Graph Model. We model the Web as a graph geometric distribution [62; 45], i.e. the probability to G =(V,E) with vertices V, representing web pages17, makej stepsbeforeterminationisequaltoc·(1−c)j anddirectedweightededgesE,representinghyperlinks and the following representation is valid between pages. If a web page pi has multiple hyper- (cid:2)∞ links to a page pj, we will collapse all these links into PPR((cid:10)r)=c(cid:10)r· (1−c)j(MT)j. (8) one edge (i,j)∈E. Self loops aren’t allowed. We de- j=0 noteasetofpageslinkedbyapagepi asOut(pi)and a set of pages pointing to pi as In(pi). Finally, each • HITS[68]algorithmassignshub andauthority scores edge (i,j) ∈ E can have an associated non-negative for each page on the web and is based on the follow- weight wij. A common strategy to assign weights is ing observation. Page pi is a good hub, has a high wij = |Out1(pi)|, though other strategies are possible. hpuagbessc;oarnedhpi,aigfeitispoaingtosotdoamutahnoyrigtoyoidf i(tauisthroefreitraenticveed) Forinstance,in[2]theyassignweightsproportionalto by many good hubs, and therefore has a high author- a number of links between pages. In a matrix nota- tion a web graph model is represented by a transition ity score ai. As we see the algorithm also has a re- peatedimprovement principlebehindit. Initsoriginal matrix M defined as (cid:17) formthealgorithmconsiderspagesrelevanttoaquery wij, if(i,j)∈E, based on a keyword-based ranking (the root set), all Mij = (5) 0, otherwise. the pages that point to them, and all the pages ref- erenced by these pages. For this subset of the Web 17one can also analyze host level web graph an adjacency matrix is defined, denoted as A. The SIGKDD Explorations Volume13,Issue 2 Page54 corresponding authority and hub scores for all pages [112] further researches the the idea of propagation by an- fromthesubsetareformalizedinthefollowingpairof alyzing how trust and distrust propagation strategies can equations: work together. First, they challenge the way trust is prop- (cid:17) agatedinTrustRankalgorithm–eachchild18 getsanequal (cid:10)a=AT(cid:10)h, (9) part of parent’s trust c · TR(p) , and propose two more (cid:10)h=A(cid:10)a. |Out(p)| strategies: Itcanbeshownthatthesolution((cid:10)h,(cid:10)a)forthesystem • constant splitting, when each child gets the same dis- of equations (9) after repeatitive updating converges counted part of parent’s trust c·TR(p) score without to the principal eigenvectors of AAT and ATA corre- respect to number of children; spondingly. • logarithmic splitting, when each child gets an equal [83] studies the robustness of PageRank and HITS algo- partofparent’sscorenormalizedbylogarithmofnum- rithms with respect to small perturbations. Specifically, ber of children c· TR(p) . they analyzed how severely the ranks will change if a small log(1+|Out(p)|) portionoftheWebismodified(removed). Theyreportthat They also analyze various partial trust aggregation strate- PageRankisstabletosmallperturbationsofagraph,while gies, whereas TrustRank simply sums up trust values from HITSisquitesensitive. [18]conductsacomprehensiveanal- eachparent. Specifically,theyconsidermaximumsharestrat- ysis of PageRank properties and how link farms can affect egy, when the maximum value sent by parents is used; and theranking. Theyprovethatforanylinkfarmandanyset maximum parent strategy, when propagation is performed of target pages the sum of PageRank scores over the target to guarantee that a child score wouldn’t exceed maximum pagesisatleastalinearfunctionofthenumberofpagesin ofparentsscores. Finally,theyproposetousealinearcom- a link farm. bination of trust and distrust values: [53] studies optimality properties of link farms. They de- rivetheboostingfactorforone-targetspamfarmandprove TotalScore(p)=η·TR(p)−β·AntiTR(p), (10) that thisfarmisoptimalifalltheboosting pagesin a farm whereη, β ∈(0,1). According totheirexperiments,combi- have no links between them, they point to the target page nation of both propagation strategies result in better spam and target page links back to a subset of them. They also demotion (80% of spam sites disappear from the top ten discussmotivationsofspammerstocollude(formalliances), bucketsincomparisonwiththeTrustRankandPageRank), and study optimality of web spam rings and spam quasi- maximum share with logarithmic splitting is the best way cliques. There is also a relevant work which analyzes prop- tocomputetrustanddistrustvalues. Theideaoftrustand ertiesofpersonalizedPageRank[72]andasurveyonefficient distrust propagation in the context of reputation systems PageRank computation [14]. was studied in [51]. 3.2.2 Algorithmsbasedonlabelspropagation Two algorithms [12; 52] utilize PageRank decomposition property (Section 3.2.1) to estimate the amount of unde- The key idea behind algorithms from this group is to con- served PageRank coming from suspicious nodes. In [12] sider a subset of pages on the web with known labels and the SpamRank algorithm is proposed; it finds supporters then compute labels of other nodes by various propagation for a page using Monte Carlo simulations [46], assigns a rules. One of the first algorithms from this category is penalty score for each page by analyzing whether personal- TrustRank[55],whichpropagatestrustfromasmallseedset of good pages via a personalized PageRank. The algorithm izedPageRankscore PPR(χ(cid:10)j)i isdistributed withthebias towards suspicious nodes, and finally computes SpamRank relyontheprincipleofapproximateisolation ofagoodset– for each page as a PPR with the personalization vector ini- goodpagespointmostlytogoodpages. Toselectaseedset tialized with penalty scores. The essence of the algorithm ofreputablepagestheysuggestusinganinversePageRank, is in the penalty scores assignment. Authors partition all which operates on a graph with all edges reversed. Having supporters for a page by their PageRank scores using bin- computedinversePageRankscoreforallpagesontheWeb, ning with exponentially increasing width, compute the cor- theytaketop-Kpagesandlethumanannotatortojudgeon relation between the index of a bin and the logarithm of a reputation of these pages. Then they construct a person- count in the bin, and then assign penalty to supporters by alization vector where components corresponding only to summingupcorrelationscoresofpageswhichtheysupport. reputable judged pages are non-zero. Finally, personalized The insight behind the proposed approach is that PageR- PageRankiscomputed. TrustRankshowsbetterproperties ank follows power law distribution [87]. The concept of than PageRank for web spam demotion. spammass wasintroducedin[52]. Spammassmeasuresthe ThefollowupworkontrustpropagationisAnti-TrustRank[71]. amount of PageRank that comes from spam pages. Similar Opposite to TrustRank they consider distrust propagation toTrustRankitneedsthecore ofknowngoodpagestoesti- from a set of known spam pages on an inverted graph. A matetheamountofPageRankcomingfromgoodpages. The seedsetisselectedamongpageswithhighPageRankvalues. algorithm works in two stages. First, it computes PageR- They found that their approach outperforms TrustRank on ank(cid:10)π and TrustRank(cid:10)π(cid:4) vectors and estimates the amount the task of finding spam pages with high precision and is of spam mass for each page using the formula m(cid:10) = (cid:3)π−(cid:3)π(cid:2). able to capture spam pages with higher PageRank values (cid:3)π Second, the threshold decision, which depends on the value than TrustRank. There is also an algorithm, called Bad- ofspammass,ismade. Itisworthnotingthatthealgorithm Rank [99], which proposes the idea to compute badness of can effectively utilize knowledge about bad pages. a page using inverse PageRank computation. One can say that the relation between PageRank and TrustRank is the 18inthisparagraphwewillrefertoout-neighboursofapage same as between BadRank and Anti-TrustRank. as children and in-neighbours as parents SIGKDD Explorations Volume13,Issue 2 Page55 Credibility-based link analysis is described in [26]. In this andprovethetheoremthatanydampingfunctionsuchthat work the authors define the concept of k-Scoped Credibility the sum of dampings is 1 yields a well-defined normalized foreachpage,proposeseveralmethodsofitsestimation,and functional ranking. They study exponential (PageRank), show how it can be used for web spam detection. Specifi- linear,quadratichyperbolic(TotalRank),generalhyperbolic cally,theyfirstdefinetheconceptofBadPath,ak-hopran- (HyperRank)dampingfunctions,andproposeefficientmeth- domwalkstartingfromacurrentpageandendingataspam ods of rank computation. In [8] they research the appli- page,andthencomputethetunedk-ScopedCredibility score cation of general damping functions for web spam detec- as tionandproposetruncatedPageRank algorithm,whichuses (cid:18) (cid:2)k (cid:19) (cid:2) (cid:8) (cid:9)(cid:20)(cid:21) truncated exponential model. The key underlying observa- Ck(p)= 1− P pathl(p) ·γ(p), tion behind the algorithm is that spam pages have a large number of distinct supporters at short distances, while this l=1 pathl(p)∈BadPathl(p) (11) number is lower than expected at higher distances. There- where k is a parameter specifying the length of a random fore, they suggest using damping function that ignore the walk, γ(p) is a credibility penalty factor that is needed to direct contribution of the first few levels of in-links (cid:17) deal with only(cid:8) partial(cid:9)know(cid:22)ledge of all spam pages on the 0 if j ≤J, Web19,andP pathl(p) = li−=10wii+1. Thecredibilityscore damping(j)= D(1−c)j otherwise. (14) canbeusedtodownweightorprunelowcrediblelinksbefore link-based ranking or to change the personalization vector In this work they also propose a probabilistic counting al- in PPR, TrustRank, or Anti-TrustRank. gorithm to efficiently estimate number of supporters for a In [66] the concept of anchor is defined, as a subset of page. Link-based feature analysis and classification mod- pageswithknownlabels,andvariousanchor-basedproxim- els using link-based and content-based features are studied ity measures on graphs are studied. They discuss personal- in [7; 9] correspondingly. izedPageRank;harmonicrank,whichisdefinedviarandom walk on a modified graph with an added source and a sink 3.2.3 Linkpruningandreweightingalgorithms such that all anchor vertices are connected to a source and Algorithms belonging to this category tend to find unreli- all vertices are connected to a sink with probability c; non- able links and demote them. The seminal work [16] raises conserving rank, which is a generalization of personalized problemsinHITS[68]algorithm,suchasdominationofmu- PageRank satisfying the equation tually reinforcing relationships and neighbour graph topic (cid:10)π=(I−(1−c)·MT)−1(cid:10)r. (12) drift, and proposed methods of their solution by augment- ing a link analysis with a content analysis. They propose They report that non-conserving rank is the best for trust to assign each edge an authority weight of 1 if there are k k propagation, while harmonic rank better suits for distrust pagesfromonesitelinktoasinglepageonanothersite,and propagation. assign a hub weight of 1 if a single page from the first site l Spam detection algorithm utilizing pages similarity is pro- is pointing to l pages on the other site. To combat against posedin[11],wheresimilarity-basedtop-Klistsareusedto topic drift they suggest using query expansion, by taking compute a spam score for a new page. Authors consider top-Kfrequentwordsfromeachinitiallyretrievedpage;and co-citation,CompanionRank,SimRank[61],andkNN-SVD candidate page set pruning, by taking page relevance as a projectionsasmethodstocomputesimilaritybetweenpages. factor in HITS computation. The same problems are stud- First, for a page to be classified a top-K result list is re- iedin[84],whereaprojection-basedmethodisproposedto trieved using some similarity measure. Second, using the compute authority scores. They modify eigenvector part of retrievedpagesthefollowingfourvaluesarecomputed: frac- HITSalgorithminthefollowingway. Insteadofcomputing tion of the number of labeled spam pages in the list (SR), a principal eigenvector of ATA, they compute all eigenvec- anumberoflabeledspampagesdividedbyanumberofla- tors of the matrix and then take the eigenvector with the beled good pages in the list (SON), sum of the similarity biggestprojectionontherootset(setofpagesoriginallyre- valuesoflabeledspampagesdividedbythetotalsimilarity trieved by keyword search engine, as in HITS), finally they value of pages retrieved (SVR), and the sum of the simi- reportauthorityscoresasthecorrespondingcomponentsof larity values of labeled spam pages divided by the sum of this eigenvector. thesimilarityvaluesoflabeledgoodpages(SVONV).Third, Another group introduces the concept of tightly-knit com- threshold-based rule is used to make a decision. According munity (TKC) and proposes SALSA algorithm [73], which totheirexperiments,similarity-basedspamdetection(SVR, performs two random walks to estimate authority and hub SVONV)performsbetterathighlevelsofrecall,whileAnti- scoresforpagesinasubgraphinitiallyretrievedbykeyword- TrustRank[71]andcombinedTrust-Distrust[112]algorithms basedsearch. Itisworthnotingthattheoriginalandanin- show higher precision at low recall levels. verted subgraphs are considered to get two different scores. TheseminallineofworkwasdonebyBaeza-Yatesetal.[5; Anextensionofthiswork[93]considersclusteringstructure 8; 9; 7]. In [5], inspired by the PageRank representation onpagesandtheirlinkagepatternstodownweightbadlinks. (Equation 8), they propose the concept of functional rank, The key trick is to count number of clusters pointing to a which is a generalization of PageRank via various damping page instead of number of individual nodes. In this case functions. Theyconsiderrankingbasedonageneralformula authority of a page is defined as follows: (cid:2) p(cid:10)= N1(cid:10)1(cid:2)∞ damping(j)(MT)j, (13) aj =k:j∈l(k)(cid:10)i:j∈1l(i)Sik, (15) j=0 19several strategies to define γ are proposed. where Sik = ||ll((ii))∩∪ll((kk))|| and l(i) is a set of pages linked from SIGKDD Explorations Volume13,Issue 2 Page56 pagepi. Theapproachactslikepopularityrankingmethods ysis. [4]studieslink-basedfeaturestoperformwebsitecate- discussed in [20; 27]. gorizationbasedontheirfunctionality. Theirassumptionis [74] studies “small-in-large-out” problem of HITS and pro- thatsitessharingsimilarstructuralpatterns,suchasaverage posestoreweightincomingandoutgoinglinksforpagesfrom page level or number of outlinks per leaf page, share simi- the root set in the following way. If there is a page whose lar roles on the Web. For example, web directories mostly in-degree is among the three smallest ones and whose out- consist of pages with high ratio of outlinks to inlinks, form degreeisamongthethreelargestones,thensettheweight4 a tree-like structure, and the number of outlinks increases forallin-linksofallrootpages,otherwisesetto1. Runone with the depth of a page; while spam sites have specific iteration of HITS algorithm without normalization. Then topologies aimed to optimize PageRank boost and demon- if there exists a root page whose authority value is among strate high content duplication. Overall, each website is the three smallest ones and whose hub value is among the represented as a vector of 16 connectivity features and a threelargestones,settheweight4forallin-linksofallroot clusteringisperformedusingcosineasasimilaritymeasure. pages, and then run the HITS algorithm again. Authorsreportthattheymanagedtoidentify183webspam [35]introducestheconceptof“neponistic”links–linksthat rings forming 31 cluster in a dataset of 1100 sites. present for reasons rather than merit, for instance, naviga- Numerouslink-basedfeatures,derivedusingPageRank,Trust- tional links on a website or links between pages in a link Rank, and truncated PageRank computation are studied farm. They apply C4.5 algorithm to recognize neponis- in [7]. Mixture of content-based and link-based features tic links using 75 different binary features such as IsSim- is used to combat against web spam in [38; 9], spam in ilarHeaders, IsSimilarHost, is number of shared in-links is blogs [69; 76]. greater than a threshold. Then they suggest pruning or 3.2.5 Algorithmsbasedonlabelsrefinement downweighting neponistic links. In their other work [109] theycontinuestudyinglinksindenselyconnectedlinkfarms. The idea of labels refinement has been studied in machine Thealgorithmoperatesinthreestages. First,itselectsaset learning literature for general classification problems for a of bad seed pages guiding by the definition that a page is long time. In this section we present algorithms that ap- bad if intersection of its incoming and outgoing neighbours ply this idea for web spam detection. In [25] a few web isgreaterthanathreshold. Second,itexpandsthesetofbad graph-based refinement strategies are proposed. The algo- pages following the idea that a page is bad if it points to a rithm in [25] works in two stages. First, labels are assigned lotofbadpagesfromtheseedset. Finally,linksbetweenex- using a spam detection algorithm discussed in [7]. Then, panded set of bad pages are removed or downweighted and at the second stage labels are refined in one of three ways. any link-based ranking algorithm [86; 68] can be applied. One strategy is to perform Web graph clustering [67] and Similar ideas are studied on a host level in [32]. then refine labels guided by the following rules. If the ma- In [111] the concept of a complete hyperlink is proposed, a jority of pages in a cluster is predicted to be spam, they hyperlink coupled with the associated anchor text, which denote all pages in the cluster as spam. Formally, they as- is used to identify pages with suspiciously similar linkage sume that predictions of the base algorithm are in [0,1], patterns. Rationalebehindtheirapproachisthatpagesthat then they compute the average value over the cluster and have high complete hyperlink overlap are more likely to be compareitagainstathreshold. Thesameprocedureisdone machine-generated pages from a link farm or pages with fornon-spamprediction. Theotherstrategyoflabelrefine- duplicating content. The algorithm works as follows. First, ment,whichisbasedonpropagationwithrandomwalks,is itbuildsabasesetofdocuments,asinHITS,andgenerates proposed in [120]. The key part is to initialize the person- a page-hyperlink matrix using complete hyperlinks, where alization vector(cid:10)r in PPR by normalizing the predictions of Aij =1,ifapagepicontainscomplete-hyperlinkchlj. Then the base algorithm: rp = (cid:2)p∈s(Vp)s(p), where s(p) is a predic- it finds bipartite components with the size greater than a tion of the base algorithm and rp is the component of the thresholdinthecorrespondinggraph,wherepartsarepages vector(cid:10)rcorrespondingtopagep. Finally,thethirdstrategy and links, and downweight complete hyperlinks from large is to use stacked graphical learning [70]. The idea is to ex- components. Finally, a HITS-like algorithm is applied on a tend the original feature representation of an object with a reweighted adjacency matrix. newfeaturewhichisanaveragepredictionforrelatedpages [115] notices that PageRank score of pages that achieved in the graph and run a machine learning algorithm again. highranksbylink-spammingtechniquescorrelateswiththe They report 3% increase over the baseline after two rounds damping factor c. Using this observation authors identify of stacked learning. suspiciousnodes,whosecorrelationishigherthanathresh- Afewotherrelabellingstrategiesareproposedin[47;49;48]. old, and downweight outgoing links for them with some [47] suggests constructing an absolutely new feature space function proportional to correlation. They also prove that by utilizing predictions from the first stage: the label by spammers can amplify PageRank score by at most 1c and the base classifier, the percentage of incoming links coming experimentallyshowthateventwo-nodecollusioncanyield fromspamandpercentageoutgoinglinkspointingtospam, a big PageRank amplification. The follow-up work [6] per- etc. Overall,theyconsidersevennewfeaturesandreportin- formsmoregeneralanalysisofdifferentcollusiontopologies, creaseinperformanceoverthebaseclassifier. [48]proposes where they show that due to the power law distribution of tousefeaturere-extractionstrategyusingclustering,propa- PageRank [87], the increase in PageRank is negligible for gation, and neighbour-graph analysis. Self-training was ap- top-ranked pages. The work is similar to [53; 3]. plied in [49] so as to reduce the size of the training dataset requiring for web spam detection. 3.2.4 Algorithmswithlink-basedfeatures 3.2.6 Graphregularizationalgorithms Algorithms from this category represent pages as feature vectorsandperformstandardclassificationorclusteringanal- Algorithmsinthisgroupperformtrunsductiveinferenceand SIGKDD Explorations Volume13,Issue 2 Page57 utilize Web graph to smooth predicted labels. According URLinreply(weight2),viewURLinpreviousposts(weight toexperimentalresults,graphregularizationalgorithmsfor 1). Then they build a user-URL bipartite graph where an web spam detection can be considered as the state-of-the- edge weight is the sum of all weights associated with the art at the time of writing. The work [121] builds a discrete corresponding actions. After that they compute SimRank analogue of classification regularization theory [102; 104] for all pairs of URLs and introduce the regularization term by defining discrete operators of gradient, divergence and via Laplacian of the similarity matrix. The second regular- Laplacianondirectedgraphsandproposesthefollowingal- ization term is defined analogously, but now simply via a gorithm. First,theycomputeaninverseweightedPageRank Laplacianofastandardtransitionmatrix(seeeq.5). Third withtransitionprobabilitiesdefinedasaij = Inw(jpii). Second, and fourth asymmetric terms, defined via the Web graph they build the graph Laplacian transitionmatrixanditsdiagonalroworcolumnaggregated matrices, are introduced to take into account the principle ΠA+ATΠ L=Π−α , (16) of approximate isolation of good pages. Finally, the sound 2 quadratic problem is solved. Authors report that even le- whereαisauser-specifiedparameterin[0,1],Aisatransi- gitimately looked spam websites can be effectively detected tionmatrix,andΠisadiagonalmatrixwiththePageRank usingthismethodandhencetheapproachcomplementsthe score20 over the diagonal. Then, they solve the following existing methods. matrix equation 3.3 Miscellaneous Lϕ(cid:10) =Π(cid:10)y, (17) Inthissectionwediscussalgorithmsthatusenon-traditional where(cid:10)yisavectorconsistingof{−1,0,1}suchthatyi =1if features and ideas to combat against web spam. pageisnormal,yi =−1ifitisspam,andyi =0ifthelabel 3.3.1 Unsupervisedspamdetection for a page is unknown. Finally, the classification decision is made based on the sign of the corresponding component Theproblemofunsupervisedwebspamdetectionisstudied of vector ϕ(cid:10). It is worth noting that the algorithm requires in[119;118;116]. Authorsproposetheconceptofspamicity strongly connected graphs. anddevelopanonlineclient-sidealgorithmforwebspamde- Another algorithm that follows regularization theory is de- tection. The key distinctive feature of their solution is that scribed in [1]. There are two principles behind it. First, itdoesn’tneedtrainingdata. Atthecoreoftheirapproach it addresses the fact that hyperlinks are not placed at ran- isa(θ,k)-pagefarmmodel introducedin[117],thatallowsto dom implies some degree of similarity between the linking computethetheoreticalboundonthePageRankscorethat pages [36; 27] This, in turn, motivates to add a regularizer can be achieved using the optimal page farm with a given to the objective function to smooth predictions. Second, numberofpagesandlinksbetweenthem. Theproposedal- it uses the principle of approximate isolation of good pages gorithmworksasfollows. Foragivenpageitgreedilyselect that argues for asymmetric regularizer. The final objective pages from the k-neighbourhood, that contribute most to function has the following form: the PageRank score, following the definition (cid:17) (cid:2)l PR[p,G]−PR[p,G(V−{v})], ifv(cid:5)=p, Ω(w(cid:10),(cid:10)z)= 1l L(w(cid:10)T(cid:10)xi+zi,yi)+λ1(cid:17)w(cid:10)(cid:17)2+λ1(cid:17)(cid:10)z(cid:17)2 PRContrib(v,p)= 1N−c, otherwise, i=1 (cid:2) (19) +γ aijΦ(w(cid:10)T(cid:10)xi+zi,w(cid:10)T(cid:10)xj+zj), (18) and at each iteration it computes the link-spamicity score (i,j)∈E as the ratio of the observed PageRank contribution from selectedpagesovertheoptimalPageRankcontribution. The where L(a,b) is a standard loss function, w(cid:10) is a vector of algorithm uses monotonicity property to limit number of coefficients, (cid:10)xi and yi are feature representation and a true supportersthatneedstobeconsidered. Finally,itmarksthe label correspondingly, zi is a bias term, aij is a weight of pageassuspiciousiftheentirek-neighbourhoodofthispage the link (i,j) ∈ E, and Φ(a,b) = max[0,b−a]2 is the reg- is processed and link-spamicity score is still greater than a ularization function. The authors provide two methods to threshold. Analogous optimality conditions were proposed find the solution of the optimization problem using conju- for a page content. In this case term-spamicity score of a gategradientandalternatingoptimization. Theyalsostudy pageisdefinedastheratiooftheTFIDFscoreachievedby theissueofweightssettingforahostgraphandreportthat the observed content of a page over the TFIDF score that thelogarithmofthenumberoflinksyieldsthebestresults. could be achieved by an “optimal” page that has the same Finally, according to the experimental study, the algorithm number of words. has good scalability properties. SomeinterestingideatoextractwebspamURLsfromSEO 3.3.2 Algorithmsbasedonuserbrowsingbehaviour forums is proposed in [30]. The key observation is that on [77] proposes to incorporate user browsing data for web SEO forums spammers share links to their websites to find spam detection. The idea is to build a browsing graph G = partners for building global link farms. Researchers pro- (V,E,T,σ)wherenodesV arepagesandedgesE aretransi- pose to solve a URL classification problem using features tionsbetweenpages,T correspondstoastayingtimeandσ extracted from SEO forum, from Web graph, and from a denotestherandomjumpprobability,andthencomputean website, and regularizing it with four terms derived from importance for each page using PageRank-like algorithm. link graph and user-URL graph. The first term is defined The distinctive feature of the proposed solution is that it as follows. Authors categorize actions on a SEO forum in considers continuous-time Markov process as an underly- threegroups: postURLinarootofathread (weight3),post ingmodelbecauseuserbrowsingdataincludetimeinforma- 20computed at the first stage. tion. Formally, the algorithm works as follows. First, using SIGKDD Explorations Volume13,Issue 2 Page58 staying times Z1,...,Zmi for a page i it finds the diagonal The idea to use rank-time features in addition to query- element qii of the matrix Q = P(cid:4)(t) as a solution of the independentfeaturesisintroducedin[100]. Specifically,au- optimization problem: thors solve the problem of spam pages demotion after the (cid:23) 1 1 1 (cid:24)2 querywasissuedguidedbytheprinciplethatspammersfool (Z¯+ qii)− 2(S2− qi2i) −→mqiiin, s.t. qii <0. (20) rmaentkhiondgsarlgaothrietrhtmhsanangdenauchinieevlyerheilgehvapnotsiptaiogness,uasnindgtdhieffreerfeonret Non-diagonal elements of the matrix Q are estimated as spam pages should be outliers. Overall, 344 rank-time fea- (cid:17) tures are used such as number of query terms in title and −qqiiji = cσ(cid:2)j,Nk=w+˜1i1jw˜ij +(1−c)σji,=iN∈+V1,,jj∈∈V˜V, (21) fcaronendqtuaaeinnpcaaygqeou,fefaroyrqtudeeirrffmye,rtenenr-tmgrvaoamnluoeasvpeoraflgaenp,snabnuedmtwbfoeerernodfqiffupeearrgeynesttetsrhkmaipst whereadditionalN+1pseudo-vertexisaddedtothegraph n-grams. Accordingtotheexperiment,theadditionofrank- to model the teleportation effect such that all last pages in timefeaturesallowstoincreaseprecisionby25%atthesame the corresponding browsing sessions are connected to it via levels of recall. In the same work they study the problem edgeswiththeweightsequaltothenumberofclicksonthe ofoverfittinginwebspamdetectionandsuggestthattrain- lastpage,andpseudo-vertexisconnectedtothefirstpagein ingandtestingdatashouldbedomain-separated,otherwise each session with the weight equal to normalized frequency testing error could be up to 40% smaller than the real. of visits for this page; σj is a random jump probability. Atthesecondstep,thestationarydistributioniscomputed 3.3.4 Clickspamdetection using the matrix Q, which is proven to correspond to the Sinceclickspamaimstopush“maliciousnoise”intoaquery stationarydistributionforthecontinuous-timeMarkovpro- logwiththeintentiontocorruptdata,usedfortheranking cess. Similar idea is studied in [90]. The authors define a function construction, most of the counter methods study hyperlink-click graph,whichisaunionofthestandardWeb the ways to make learning algorithms robust to this noise. graph and a query-page click-log-based, and apply random Otheranti-click-fraudmethodsaredrivenbytheanalysisof walks to perform pages ranking. the economic factors underlying the spammers ecosystem. User behaviour data is also used in [78]. There are two Interesting idea to prevent click spam is proposed in [92]. observations behind their approach. First, spammers aim The author suggests using personalized ranking functions, to get high ranking in SERP and therefore the major part as being more robust, to prevent click fraud manipulation. of the traffic to spam website is due to visits from a search The paper presents a utility-based framework allowing to engine site. Second, users can easily recognize a spam site judgewhenitiseconomicallyreasonabletohirespammersto and therefore should leave it quite fast. Based on these promoteawebsite,andperformsexperimentalstudydemon- observations authors propose 3 new features: ratio of visits strating that personalized ranking is resistant to spammers fromasearchsite,numberofclicksandpageviewsonasite manipulationsanddiminishesfinancialincentivesofsiteown- per visit. ers to hire spammers. The work [37] studies the robustness of the standard click-through-based ranking function con- 3.3.3 HTTPanalysisandreal-timespamdetection struction process and also reports its resistance to fraudu- Methodsinthissubsectioncanbenaturallypartitionedinto lent clicks. 2groups: client-sideandserver-side. Themethodsfromthe The work [60] studies the problem of click fraud for online first group useverylimited information, usuallydoesn’t re- advertisingplatformandparticularlyaddressestheproblem quirelearning,andlessaccurate. Thelattergrouprepresen- of “competitor bankruptcy”. The authors present a click- tative methods, in reverse, are more precise since they can based family of ads pricing models and theoretically prove incorporate additional real-time information. thatsuchmodelsleavenoeconomicincentivesforspammers Lightweight client-side web spam detection method is pro- to perform malicious activity, i.e. short term competitor’s posedin[107]. Insteadofanalyzingcontent-basedandlink- budgetwastingwillbeannihilatedbylongtermdecreasein basedfeaturesforapage,researchersfocusedonHTTPses- the ads placement price. [105] carefully analyzes the entire sion information and achieved competitive results. They spammers’ ecosystem by proposing the spam double fun- represent each page and a session, associated with it, as a nel model which describes the interaction between spam- vectoroffeaturessuchasIP-addressorwordsfromarequest publishers and advertisers via page redirections. headerina“bag-of-words”model,andperformaclassifica- In[17]anincentivebasedrankingmodelisintroduced,which tion using various machine learning algorithms. The same mainlyincorporatesusersintorankingconstructionandpro- group introduced the way of large dataset creation for web videsaprofitablearbitrageopportunityfortheuserstocor- spam detection by extracting URLs from email spam mes- rect inaccuracies in the system. The key idea is that users sages [106]. Though not absolutely clean, the dataset con- are subject to an explicit information about revenue they tains about 350000 web spam pages. might “earn” within the system if they correct an erro- Similarly,HTTPsessionsareanalyzed,amongothers,inthe neous ranking. It is theoretically proven that the model contextofmaliciousredirectiondetectionproblem. Theau- with the specific incentives (revenue) structure guarantees thorsof[29]provideacomprehensivestudyoftheproblem, merit-based ranking and is resistant to spam. classifyallspamredirectiontechniquesinto3types(HTTP redirection,METARefresh,JavaScriptredirection),analyze 4. KEYPRINCIPLES thedistributionofvariousredirectiontypesontheWeb,and present a lightweight method to detect JavaScript redirec- Havinganalyzedalltherelatedworkdevotedtothetopicof tion, which is the most prevalent and difficult to identify webspammining,weidentifyasetofunderlyingprinciples type. that are frequently used for algorithms construction. SIGKDD Explorations Volume13,Issue 2 Page59
Description: