FINDING AND FIGHTING SEARCH ENGINE SPAM by Baoning Wu A Dissertation Presented to the Graduate and Research Committee of Lehigh University in Candidacy for the Degree of Doctor of Philosophy in Computer Science Lehigh University 2007 This dissertation is accepted in partial fulfillment of the requirements for the degree of Doctor of Philosophy. (Date) Brian D. Davison Donald J. Hillman Daniel P. Lopresti Lin Lin (Management, College of Business & Economics) Marc Najork (Microsoft Research) ii Acknowledgments Greatthankstomyadvisor,Prof.BrianD.Davison,forhisinsightfulguidance,discussion, patience and suggestions throughout my Ph.D progress. Without his guidance, I could not have finished this dissertation. I am also thankful to my committee members, professors Donald J. Hillman, Daniel P. Lopresti and Lin Lin from Lehigh University; and Dr. Marc Najork from Microsoft Research for their excellent guidance and suggestions. I greatly appreciate the encouragement from my wife, Jie Song, and my parents for supporting me in pursuing this degree. Many thanks to my lab mates, Wei Zhang, Lan Nie, Xiaoguang Qi and Vinay Goel. I appreciate their help in providing discussions and doing experiments with me on large real data sets to test approaches proposed in this dissertation. Many thanks to Dr. Russell Quong from Google Inc. and Dr. Kumar Chellapilla from Microsoft Live Labs for providing me with the opportunity to intern with these two IT giants respectively. Theseinternshipsgave metheopportunitytoworkwith expertsinmy research area, to know the state-of-the-art spam-fighting techniques and to use large-scale iii data sets from the entire world wide web. My work has been financially supported by different organizations for the past five yearswhileIpreparedthisdissertation. ThesourcesincludeanITscholarshipfromLehigh University, the National Science Foundation under awards ANI-9903052, IIS-0328825 and IIS-0545875, and Microsoft Live Labs (“Accelerating Search”). All the experiments were done with real web data sets. Great thanks to Urban Mu¨ller from the search.ch search engine for helpful discussions and for providing access to the search.ch dataset. Great thanks to Stanford WebBase Project for providing its web crawl. Great thanks to the Laboratory of Web Algorithmics, Universit`a degli Studi di Milano and Yahoo Research Barcelona for making the UK-2006 dataset and labels available. Without any of the above support, I would not have had a chance to finish this dis- sertation. iv Contents Acknowledgments iii Abstract 1 1 Introduction 2 2 Background 9 2.1 Traditional vector space model and TF-IDF ranking . . . . . . . . . . . . . 10 2.2 Eigenvector and eigenvalue of a matrix . . . . . . . . . . . . . . . . . . . . . 11 2.3 Link based ranking algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 12 3 Spamming techniques 14 3.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2 Content spam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3 Link spam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4 Page-hiding spam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.5 Other spam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 v 3.6 Search engine optimization or marketing . . . . . . . . . . . . . . . . . . . . 22 4 Survey of existing anti-spam techniques 24 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2 Techniques used against different kinds of spam . . . . . . . . . . . . . . . . 26 4.2.1 Techniques to combat content based spam . . . . . . . . . . . . . . . 27 4.2.2 Techniques to combat link based spam . . . . . . . . . . . . . . . . . 29 4.2.3 Techniques to combat page-hiding based spam . . . . . . . . . . . . 41 4.3 Looking at anti-spam techniques from different angles . . . . . . . . . . . . 42 4.3.1 Common methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.3.2 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3.3 Automatic vs. semi-automatic . . . . . . . . . . . . . . . . . . . . . . 45 4.3.4 Online vs. offline calculation . . . . . . . . . . . . . . . . . . . . . . . 47 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5 Detecting link farms 52 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.3.1 TKC effect and link farm spam . . . . . . . . . . . . . . . . . . . . . 57 5.3.2 Web communities or clusters . . . . . . . . . . . . . . . . . . . . . . 61 5.3.3 Duplicate and near-duplicate web pages . . . . . . . . . . . . . . . . 62 vi 5.4 Impact of link farms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.4.1 Link farm effect on HITS . . . . . . . . . . . . . . . . . . . . . . . . 64 5.4.2 Existence of link farm spam . . . . . . . . . . . . . . . . . . . . . . . 64 5.5 Complete hyperlinks algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.5.1 Complete hyperlinks . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.5.2 Finding bipartite components . . . . . . . . . . . . . . . . . . . . . . 69 5.5.3 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . . 70 5.5.4 Down-weighting the adjacency matrix . . . . . . . . . . . . . . . . . 71 5.5.5 Ranking with the revised graph . . . . . . . . . . . . . . . . . . . . . 72 5.6 Expansion algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.6.1 Initial step: IN-OUT common set . . . . . . . . . . . . . . . . . . . . 74 5.6.2 Expansion step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.6.3 Ranking with the revised graph . . . . . . . . . . . . . . . . . . . . . 78 5.6.4 Comparison of ParentPenalty and BadRank . . . . . . . . . . . . . . 78 5.6.5 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . . 79 5.7 Experiments and evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.7.1 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.7.2 Anecdotal results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.7.3 Evaluation of rankings . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.7.4 Precision vs. recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.7.5 Global ranking results . . . . . . . . . . . . . . . . . . . . . . . . . . 93 vii 5.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.8.1 Discussion for the complete hyperlinks algorithm . . . . . . . . . . . 96 5.8.2 Discussion for the expansion algorithm . . . . . . . . . . . . . . . . . 99 5.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6 Detecting cloaking 104 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.2 Detecting syntactic cloaking . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.2.1 Algorithm details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.2.2 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.2.3 Experimental results for our 3-copy method . . . . . . . . . . . . . . 111 6.2.4 Experimental results for out 4-copy method . . . . . . . . . . . . . . 117 6.2.5 Distribution of syntactic cloaking within top rankings . . . . . . . . 118 6.3 Detecting semantic cloaking . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.3.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 6.3.3 Training the classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 7 Topical TrustRank 144 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 viii 7.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 7.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 7.4 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 7.5 Topical TrustRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 7.5.1 Seed set partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 7.5.2 Combination of different topic scores . . . . . . . . . . . . . . . . . . 154 7.5.3 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 7.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 7.6.1 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 7.6.2 Bias of TrustRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 7.6.3 Results for search.ch data . . . . . . . . . . . . . . . . . . . . . . . . 161 7.6.4 Results for WebBase data . . . . . . . . . . . . . . . . . . . . . . . . 171 7.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 7.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 8 Incorporating trust into authority 179 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 8.2 Background and related work . . . . . . . . . . . . . . . . . . . . . . . . . . 182 8.3 Trust and distrust propagation . . . . . . . . . . . . . . . . . . . . . . . . . 185 8.4 Evaluating trust and distrust propagation . . . . . . . . . . . . . . . . . . . 188 8.4.1 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 8.4.2 Experimental procedure . . . . . . . . . . . . . . . . . . . . . . . . . 189 ix 8.4.3 Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 8.4.4 Different choices of propagation . . . . . . . . . . . . . . . . . . . . . 192 8.4.5 Combination of trust and distrust . . . . . . . . . . . . . . . . . . . 194 8.4.6 Size of seed sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 8.5 Incorporating trust into web authority calculations . . . . . . . . . . . . . . 196 8.5.1 The flaw in a trust-based ranking . . . . . . . . . . . . . . . . . . . . 196 8.5.2 The Cautious Surfer . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 8.6 Experimental Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 8.6.1 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 8.6.2 Selection of queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 8.6.3 Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 203 8.6.4 Combining relevance and authority . . . . . . . . . . . . . . . . . . . 205 8.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 8.7.1 Baseline results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 8.7.2 Cautious surfer configuration . . . . . . . . . . . . . . . . . . . . . . 207 8.7.3 Comparing trust sources . . . . . . . . . . . . . . . . . . . . . . . . . 208 8.7.4 Experimental results on UK-2006 dataset . . . . . . . . . . . . . . . 209 8.7.5 Experimental results on WebBase. . . . . . . . . . . . . . . . . . . . 212 8.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 9 Discussion and conclusion 216 9.1 Future spam. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 x
Description: