CATEGORIZATION, ANALYSIS, AND VISUALIZATION OF COMPUTER- MEDIATED COMMUNICATION AND ELECTRONIC MARKETS by Ahmed Abbasi __________________________ A Dissertation Submitted to the Committee on BUSINESS ADMINISTRATION In Partial Fulfillment of the Requirements For the Degree of DOCTOR OF PHILOSOPHY WITH A MAJOR IN MANAGEMENT In the Graduate College THE UNIVERSITY OF ARIZONA 2008 2 THE UNIVERSITY OF ARIZONA GRADUATE COLLEGE As members of the Dissertation Committee, we certify that we have read the dissertation prepared by Ahmed Abbasi entitled Categorization, Analysis, and Visualization of Computer-Mediated Communication and Electronic Markets and recommend that it be accepted as fulfilling the dissertation requirement for the Degree of Doctor of Philosophy. _________________________________________________ Date: 04/17/2008 Hsinchun Chen _________________________________________________ Date: 04/17/2008 Jay F. Nunamaker, Jr. _________________________________________________ Date: 04/17/2008 Zhu Zhang Final approval and acceptance of this dissertation is contingent upon the candidate’s submission of the final copies of the dissertation to the Graduate College. I hereby certify that I have read this dissertation prepared under my direction and recommend that it be accepted as fulfilling the dissertation requirement. _________________________________________________ Date: 04/17/2008 Dissertation Director: Hsinchun Chen 3 STATEMENT BY AUTHOR This dissertation has been submitted in partial fulfillment of requirements for an advanced degree at the University of Arizona and is deposited in the University Library to be made available to borrowers under rules of the Library. Brief quotations from this dissertation are allowable without special permission, provided that accurate acknowledgment of source is made. Requests for permission for extended quotation from or reproduction of this manuscript in whole or in part may be granted by the head of the major department or the Dean of the Graduate College when in his or her judgment the proposed use of the material is in the interests of scholarship. In all other instances, however, permission must be obtained from the author. SIGNED: Ahmed Abbasi 4 ACKNOWLEDGMENTS I would like to thank my advisor, Dr. Hsinchun Chen, for his encouragement and invaluable feedback every step of the way. There is no doubt that Dr. Chen’s guidance played an integral role in my scholarly development. My four years as a doctoral student have provided me with the enduring skills and fortitude which I believe will enable me to succeed in future endeavors. I am grateful to my committee members, Dr. Jay F. Nunamaker Jr. and Dr. Zhu Zhang for possessing countless wisdom, and the kindness to bestow some upon me. I thank the department chair, Dr. J. Leon Zhao and the rest of the MIS department faculty for their support. My dissertation has been partially supported by the National Science Foundation grant: “Multilingual Online Stylometric Authorship Identification: An Exploratory Study,” (NSF #0646942, September 2006 – February 2008). I would like to thank my good friends and colleagues in the Artificial Intelligence Lab: Siddharth Kaza, Tianjun Fu, Xin Li, Daning Hu, Sven Thoms, Arab Salem, Hsin-min Lu, Chun-Ju Tseng, and David Zimbra for their support. I am especially indebted to Cathy Larson for her kindness, humor, and optimism. Most of all, I am grateful for the constant love and support of my family, without whom this would not have been possible. To my wife Saba, who stood by me through thick and thin. My parents, who always led by example and taught me the meaning of hard work. And my brother and sister, who have always been wonderful role models. 5 DEDICATION This dissertation is dedicated to my parents, for their unconditional love and support. Their commendable work ethic and unwavering principles have been an inspiration. 6 TABLE OF CONTENTS LIST OF ILLUSTRATIONS.............................................................................................11 LIST OF TABLES …………………………………………………………………….13 ABSTRACT …………………………………………………………………….15 CHAPTER 1: INTRODUCTION.....................................................................................17 1.1 Motivation.............................................................................................................17 1.2 Overview...............................................................................................................19 1.3 Analysis of Textual Information Types.................................................................21 1.4 Analysis of Ideational Information Types.............................................................23 1.5 Analysis of Textual, Ideational, and Inter-personal Information..........................24 CHAPTER 2: A STYLOMETRIC APPROACH TO IDENTITY- LEVEL IDENTIFICATION AND SIMILARITY DETECTION IN CYBERSPACE..................25 2.1 Introduction...........................................................................................................25 2.2 Related Work.........................................................................................................27 2.2.1 Stylometry....................................................................................................27 2.2.2 Online Stylometric Analysis........................................................................31 2.2.3 Feature Set Types for Online Stylometry.....................................................36 2.3 Research Gaps and Questions...............................................................................38 2.3.1 Similarity Detection.....................................................................................38 2.3.2 Richer Feature Sets......................................................................................38 2.3.3 Individual Author Level Features................................................................39 2.3.4 Scalability across Domains..........................................................................39 2.3.5 Research Questions......................................................................................39 2.4 Research Design: An Overview............................................................................40 2.4.1 Techniques...................................................................................................40 2.4.2 Feature Sets and Types.................................................................................42 2.5 System Design......................................................................................................43 2.5.1 Feature Extraction........................................................................................44 2.5.2 Classifier Construction.................................................................................47 2.6 Evaluation.............................................................................................................54 2.6.1 Test Bed54 2.6.2 Experiment 1: Identification Task................................................................55 2.6.3 Experiment 2: Similarity Detection Task.....................................................62 2.7 Conclusions...........................................................................................................67 CHAPTER 3: STYLOMETRIC IDENTIFICATION IN ELECTRONIC MARKETS: SCALABILITY AND ROBUSTNESS............................................................................69 3.1 Introduction...........................................................................................................69 3.2 Related Work.........................................................................................................71 3.2.1 Reputation Systems/Online Feedback Mechanisms....................................71 3.2.2 Stylometric Analysis....................................................................................75 3.3 Research Gaps, Questions, and Design.................................................................82 3.3.1 Research Gaps..............................................................................................82 7 3.3.2 Research Questions......................................................................................83 3.3.3 Research Design...........................................................................................84 3.4 System Design......................................................................................................85 3.4.1 Feature Extraction........................................................................................85 3.4.2 Classifier Construction: Writeprints............................................................87 3.5 Evaluation.............................................................................................................92 3.5.1 Test Bed92 3.5.2 Experimental Setup......................................................................................93 3.5.3 Experiment 1: Scalability.............................................................................95 3.5.4 Experiment 2: Robustness............................................................................99 3.6 Conclusions.........................................................................................................106 CHAPTER 4: WEBSITE SIGNATURES: AN EXPERIMENT ON FAKE ESCROW AND SPOOF WEBSITES..............................................................................................108 4.1 Introduction.........................................................................................................108 4.2 Related Work........................................................................................................110 4.2.1 Fake Website Types.....................................................................................112 4.2.2 Fake Website Features ................................................................................115 4.2.3 Fake Website Categorization Techniques...................................................118 4.3 Research Design..................................................................................................121 4.3.1 Research Gaps and Questions....................................................................121 4.3.2 Research Framework.................................................................................122 4.4 Evaluation...........................................................................................................126 4.4.1 Experimental Setup....................................................................................127 4.4.2 Experiment 1: Fake Escrow Websites........................................................129 4.4.3 Experiment 2: Spoof Sites.........................................................................131 4.4.4 Hypotheses Testing....................................................................................133 4.5 Conclusions.........................................................................................................142 CHAPTER 5: A COMPARISON OF TOOLS FOR DETECTING FAKE WEBSITES 144 5.1 Introduction.........................................................................................................144 5.2 Fake Website Detection Tools.............................................................................146 5.2.1 Lookup Systems.........................................................................................146 5.2.2 Classifier Systems......................................................................................147 5.2.3 Hybrid Systems and Dynamic Classifiers..................................................149 5.2.4 Summary of Existing Tools........................................................................149 5.3 Proposed Approach.............................................................................................151 5.4 Experiments and Results.....................................................................................153 5.4.1 Overall Results...........................................................................................155 5.4.2 Impact of Time of Day and Interval...........................................................155 5.4.3 Hybrid Systems: Combining Classifier and Lookup Methods..................157 5.5 Conclusions.........................................................................................................160 CHAPTER 6: FEATURE SELECTION FOR OPINION CLASSIFICATION IN ONLINE FORUMS AND REVIEWS............................................................................161 6.1 Introduction.........................................................................................................161 6.2 Related Work.......................................................................................................162 8 6.2.1 Tasks….163 6.2.2 Features 165 6.2.3 Classification Techniques..........................................................................168 6.2.4 Sentiment Analysis Domains.....................................................................170 6.3 Research Gaps and Questions.............................................................................171 6.3.1 Web Forums in Multiple Languages..........................................................171 6.3.2 Stylistic Features........................................................................................171 6.3.3 Feature Reduction for Sentiment Classification........................................172 6.3.4 Research Questions....................................................................................173 6.4 Research Design..................................................................................................173 6.5 System Design....................................................................................................175 6.5.1 Feature Extraction......................................................................................175 6.5.2 Determining Size of Initial Feature Set.....................................................177 6.5.3 Feature Selection: Entropy Weighted Genetic Algorithm (EWGA)..........179 6.5.4 Classification..............................................................................................184 6.6 Evaluation...........................................................................................................185 6.6.1 Experiment 1: Movie Review Test Bed.....................................................185 6.6.2 Experiment 2: Online Discussion Forum...................................................191 6.6.3 Results Discussion.....................................................................................195 6.7 Conclusions and Future Directions.....................................................................197 CHAPTER 7: MINING ONLINE REVIEW SENTIMENTS USING FEATURE RELATION NETWORKS..............................................................................................199 7.1 Introduction.........................................................................................................199 7.2 Related Work.......................................................................................................201 7.2.1 Classification Methods for Sentiment Analysis.........................................202 7.2.2 N-Gram Features for Sentiment Analysis..................................................202 7.2.3 Feature Selection for Sentiment Analysis..................................................207 7.3 Research Gaps and Questions.............................................................................210 7.3.1 Research Gaps.............................................................................................211 7.3.2 Research Questions.....................................................................................211 7.4 Research Design..................................................................................................212 7.4.1 Extended N-Gram Feature Set...................................................................213 7.4.2 Feature Relation Network..........................................................................214 7.5 Experiments........................................................................................................223 7.5.1 Experiment 1a: Comparison of Feature Sets using Cross Validation........224 7.5.2 Experiment 1b: Comparison of Features on 10,000 Review Test Beds.....227 7.5.3 Experiment 2a: Comparison of Feature Selection Methods......................228 7.5.4 Experiment 2b: Comparison of Selection Methods...................................230 7.5.5 Results Discussion.....................................................................................231 7.6 Conclusions.........................................................................................................232 CHAPTER 8: AFFECT ANALYSIS OF WEB FORUMS AND BLOGS USING CORRELATION ENSEMBLES....................................................................................234 8.1 Introduction.........................................................................................................234 8.2 Related Work.......................................................................................................235 9 8.2.1 Features for Affect Analysis.......................................................................238 8.2.2 Techniques for Assigning Affect Intensities..............................................242 8.3 Research Design..................................................................................................243 8.3.1 Gaps and Questions....................................................................................243 8.3.2 Research Framework.................................................................................244 8.3.3 Research Hypotheses.................................................................................252 8.4 Evaluation...........................................................................................................253 8.4.1 Test Bed253 8.4.2 Experimental Design..................................................................................254 8.4.3 Experiment 1: Comparison of Feature Sets...............................................255 8.4.4 Experiment 2: Comparison of Techniques.................................................258 8.4.5 Experiment 3: Ablation Testing.................................................................261 8.4.6 Hypotheses Results....................................................................................262 8.5 Case Study..........................................................................................................264 8.6 Conclusions.........................................................................................................267 CHAPTER 9: CYBERGATE: A SYSTEM AND DESIGN FRAMEWORK FOR TEXT ANALYSIS OF COMPUTER-MEDIATED COMMUNICATION...............................269 9.1 Introduction.........................................................................................................269 9.2 Background.........................................................................................................272 9.2.1 CMC Text...................................................................................................273 9.2.2 CMC Text Analysis Features.....................................................................273 9.2.3 CMC Text Analysis Systems......................................................................275 9.3 A Design Framework for CMC Text Analysis....................................................278 9.3.1 Proposed CMC Text Analysis Framework.................................................279 9.4 Kernel Theory.....................................................................................................281 9.5 Meta-Requirements.............................................................................................283 9.6 Meta-Design........................................................................................................285 9.6.1 Features for CMC Text Analysis................................................................286 9.6.2 Feature Selection Techniques for CMC Text Analysis..............................288 9.6.3 Visualization Techniques for CMC Text Analysis.....................................290 9.7 Testable Hypotheses............................................................................................292 9.8 System Design: The CyberGate System.............................................................293 9.8.1 Information Types and Features.................................................................294 9.8.2 Feature Selection........................................................................................295 9.8.3 Visualization...............................................................................................295 9.8.4 Writeprints and Ink Blots...........................................................................299 9.9 A CMC Text Analysis Example Using CyberGate: The Enron Case.................303 9.10 Experimental Evaluation: Text Categorization using CyberGate.....................307 9.10.1 Research Hypotheses...............................................................................309 9.10.2 Information Types Representing the Ideational Meta-function...............310 9.10.3 Information Types Representing the Textual Meta-function...................313 9.10.4 Information Types Representing the Interpersonal Meta-function..........316 9.10.5 Results Discussion...................................................................................317 9.11 Conclusions.......................................................................................................318 10 CHAPTER 10: CONCLUSION.....................................................................................320 10.1 Contributions.....................................................................................................320 10.2 Relevance to MIS..............................................................................................323 10.3 Future Directions..............................................................................................325 REFERENCES ….………………..……………………………………………. 326
Description: