UNIVERSITYCOTED’AZUR DOCTORAL SCHOOL STIC SCIENCESETTECHNOLOGIESDEL’INFORMATION ETDELACOMMUNICATION P H D T H E S I S toobtainthetitleof PhD of Science of the Université côte d’azur Specialty : COMPUTER SCIENCE Defendedby Zide MENG Temporal and semantic analysis of richly typed social networks from user-generated content sites on the web Thesis Advisor: Fabien GANDON and Catherine FARON-ZUCKER prepared at INRIA Sophia Antipolis, WIMMICS Team defendedonNov07,2016 Jury: Reviewers: Pr. Frédérique LAFOREST - TélécomSaint-Etienne Dr. John BRESLIN - NationalUniversityofIreland,Galway Examiner: Pr. Martin ARNAUD - UniversitédeRennes1 Advisor: Dr. Fabien GANDON - INRIASophiaAntipolis Co-Advisor: Dr. Catherine FARON-ZUCKER - UniversityNiceSophiaAntipolis President: Pr. Frederic PRECIOSO - UniversityNiceSophiaAntipolis Acknowledgments IwouldliketoexpressmysincerelythankstomysupervisorsFabienGandonandCatherine Faron-Zucker for their great help, support, inspire and advice. I was very lucky to have themasmysupervisorssincetheysupportedmenotonlyontheresearch,butalsoonmany aspectsofmylife. I would also like to thank the rest of my thesis committee for their precious time, insightful comments and helpful suggestions. I could not finish my thesis without all the help. IwouldliketothanktheOctopusproject(ANR-12-CORD-0026)forthefinancialsup- portofmyresearch. Iwouldalsoliketothankourprojectpartnersforallthecollaborations andmeetings. Itwasagreatpleasuretoworkwiththem. I would also to thank the SMILK project for the financial support of my research and thechanceofapplyingmyworkonadditionalreal-worlddataset. I would like to thank Stack Overflow, Flickr and Viseo for sharing their data which providedmewiththechancetoconductmyresearchproject. IwouldliketothanktheWimmicsteam. Itisafriendlyandinternationalenvironment whereIspentagreattimewithallmycolleagues. IwouldliketothankChristineFoggiaforallherhelpandsupport. Iwouldliketothankmydearfriendswhoalwayssupportedmeandencouragedme. I alsowanttothankSophieforherhelpandsupport. Especially,IwouldliketothankFuqi Songforhelpingmetogetovermanydifficulttimes. Iamsoluckytohaveallofthem. Iwouldliketothankshanshanforherloveandsupport. Iamhappyandluckytohave herwithme. Iwouldliketoexpressmydeeploveandthankstomyfamilyfortheirsupportingand understandingwheneverandwherever. iii Abstract We propose an approach to detect topics, overlapping communities of interest, expertise, trendsandactivitiesinuser-generatedcontentsitesandinparticularinquestion-answering forumssuchasStackOverflow. WefirstdescribeQASM(Question&AnswerSocialMe- dia), a system based on social network analysis to manage the two main resources in question-answering sites: users and content. We also introduce the QASM vocabulary used to formalize both the level of interest and the expertise of users on topics. We then proposeanefficientapproachtodetectcommunitiesofinterest. Itreliesonanothermethod to enrich questions with a more general tag when needed. We compared three detection methods on a dataset extracted from the popular Q&A site StackOverflow. Our method based on topic modeling and user membership assignment is shown to be much simpler andfasterwhilepreservingthequalityofdetection. Wethenproposeanadditionalmethod to automatically generate a label for a detected topic by analyzing the meaning and links of its bag of words. We conduct a user study to compare different algorithms to choose a label. Finally we extend our probabilistic graphical model to jointly model topics, exper- tise, activities and trends. We performed experiments with real-world data to confirm the effectivenessofourjointmodel,studyinguserbehaviorsandtopicdynamics. Keywords: socialsemanticweb, socialmediamining, probabilisticgraphicalmodel, questionan- swer sites, user-generated content, topic modeling, expertise detection, overlapping com- munitydetection v Résumé Nousproposonsuneapprochepourdétecterlessujets,lescommunautésd’intérêtnondis- jointes,l’expertise,lestendancesetlesactivitésdansdessitesoùlecontenuestgénérépar lesutilisateursetenparticulierdansdesforumsdequestions-réponsestelsqueStackOver- Flow. Nous décrivons d’abord QASM (Questions & Réponses dans des médias sociaux), unsystèmebasésurl’analysederéseauxsociauxpourgérerlesdeuxprincipalesressources d’unsitedequestions-réponses: lesutilisateursetlecontenu. Nousprésentonségalement le vocabulaire QASM utilisé pour formaliser à la fois le niveau d’intérêt et l’expertise des utilisateurs. Nous proposons ensuite une approche efficace pour détecter les commu- nautésd’intérêts. Ellereposesuruneautreméthodepourenrichirlesquestionsavecuntag plus général en cas de besoin. Nous comparons trois méthodes de détection sur un jeu de donnéesextraitdusitepopulaireStackOverflow. Notreméthodebaséesurleserévèleêtre beaucoupplussimpleetplusrapide,toutenpréservantlaqualitédeladétection. Nouspro- posonsencomplémentuneméthodepourgénérerautomatiquementunlabelpourunsujet détecté en analysant le sens et les liens de ses mots-clefs. Nous menons alors une étude pour comparer différents algorithmes pour générer ce label. Enfin, nous étendons notre modèle de graphes probabilistes pour modéliser conjointement les sujets, l’expertise, les activitésetlestendances. Nouslevalidonssurdesdonnéesdumonderéelpourconfirmer l’efficacité de notre modèle intégrant les comportements des utilisateurs et la dynamique dessujets. Mot-clés: web social sémantique, l’analyse des médias sociaux, modèle graphique probabiliste, sitesdequestions-réponses,contenugénéréparl’utilisateur,modélisationdesthématiques, détectiond’expertise,ladétectiondecommunautésrecouvrantes vii You can’t connect the dots looking forward, you can only connect them looking back- wards. So you have to trust that the dots will somehow connect in your future. –Jobs
Description: