ebook img

[PDF] from github.io PDF

177 Pages·2015·1.43 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview [PDF] from github.io

École Polytechnique Laboratoire d’Informatique de l’X (LIX) GRAPH-OF-WORDS: MINING AND RETRIEVING TEXT WITH NETWORKS OF FEATURES DISSERTATION submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science by FRANÇOIS ROUSSEAU 18 2015 September , Copyright c2015FrançoisRousseau (cid:13) Thesis prepared under the supervision of Michalis Vazirgiannis at the Department of Informatics of École Polytechnique 7161 Laboratoire d’Informatique de l’X (LIX) – UMR 1 rue Honoré d’Estienne d’Orves Bâtiment Alan Turing Campus de l’École Polytechnique 91120 Palaiseau, France To my mom and dad who have always wanted a doctor, and to my sister who made me become one. ABSTRACT We propose graph-of-words as an alternative document representation to the historical bag-of-words that is extensively used in text mining and retrieval. We represent textual documents as statistical graphs whose vertices correspond to unique terms of the document and whose edges represent co-occurrences between the terms within a fixed-size sliding window over the full processed document. The underlying assumption is that all the words present in a docu- ment have some relationships with one another, modulo a window size outside of which the relationship is not taken into consideration. This representation takes better into account word dependency compared to traditional unigrams and even n-grams and can be interpreted as a network of features that captures the relations between terms at the document level or even at the collection level when considering a giant graph-of-words made of the aggregation of multiple smaller graph-of-words, one for each document of the collection. In our work, we capitalized on the graph-of-words representation to retrieve more relevant information, extract more cohesive keywords and learn more dis- criminative patterns with successful applications in ad hoc information retrieval, single-document keyword extraction and single-label multi-class text catego- rization. Experiments conducted on various text datasets with ground truth 25 data, including a Web-scale collection of M documents, and using standard 10 evaluation metrics for each task (e.g., MAP, P@ , accuracy and macro-average 1 F -score) resulted in statistically significant improvements in effectiveness for little to no additional cost in efficiency. 1 The main findings of our research are: ( ) for ad hoc information retrieval, whenassessingaqueryterm’sweightinadocument,ratherthanconsideringthe overall term frequency of a word and then applying a concave transformation to ensure a decreasing marginal gain in relevance, one should instead consider for each word the number of distinct contexts of co-occurrence with other words so as to favor terms that appear with a lot of other terms, i.e. consider 2 the node degree in the corresponding unweighted graph-of-words; ( ) for keyword extraction, humans tend to select as keywords not only central but also densely connected nodes in the corresponding weighted graph-of-words, property captured by reducing the graph to its main core using the concept 3 of graph degeneracy; and ( ) for text categorization, long-distance n-grams – defined as subgraphs of unweighted graph-of-words – are more discriminative features than standard n-grams, partly because they can capture more variants of the same set of terms compared to fixed sequences of terms and therefore appear in more documents. v RÉSUMÉ Nous proposons la représentation de documents par graphe-de-mots comme alternative à la représentation par sac-de-mots qui est largement utilisée en fouillededonnéesetrecherched’informationdanslestextes.Nousreprésentons lesdocumentsàl’aidedegraphesstatistiquesdontlesnœudscorrespondentaux uniques termes du document et dont les arêtes représentent les co-occurrences entrelestermesdansunefenêtreglissantedetaillefixe.L’hypothèsesous-jacente étantquetouslesmotsd’undocumentsontenlien,modulolatailledelafenêtre endehorsdelaquellelelienn’estpasprisencompte.Cettereprésentationprend mieux en compte la dépendance entre les mots qu’avec une représentation plus traditionnelle se basant sur les unigrammes et même les n-grammes et peut être interprétée comme un réseau de variables qui stocke les relations entre les termes à l’échelle du document ou même à l’échelle d’une collection de documents lorsque l’on considère un graphe-de-mots construit à partir de plusieurs graphe-de-mots plus petits, un par document de la collection. Au cours de nos travaux, nous avons tiré profit de la représentation par graphe-de-mots pour mieux rechercher les informations les plus pertinentes à unerequête,pourextrairedesmots-cléspluscohésifsetpourapprendredesmo- tifs plus discriminants, avec des applications en recherche ad hoc d’information, en extraction de mots-clés et en classification de textes. Les expériences me- nées sur de nombreux jeux de données dont on connaît la vérité terrain, parmi 25 lesquels une collection de M de pages Web, et en utilisant les mesures stan- 10 dards d’évaluation pour chaque tâche (MAP, P@ , taux de bonne classification 1 et macro-average F -score) ont conduites à des améliorations statistiquement significatives en qualité pour peu voire pas de coût supplémentaire en efficacité. 1 Les principaux résultats de notre recherche sont : ( ) en recherche ad hoc d’information, lorsque l’on évalue le poids d’un terme de la requête dans un document, au lieu de considérer la fréquence globale du terme et ensuite lui appliquer une transformation concave pour s’assurer d’un gain marginal en pertinence décroissant, on devrait plutôt considérer pour chaque mot le nombre distinct de contexte de co-occurrences avec les autres mots de façon à favoriserlesmotsquiapparaissentavecleplusgrandnombredemotsdifférents, c’est-à-dire considérer le degré du nœud dans le graphe-de-mots non pondéré 2 correspondant; ( ) en extraction de mots-clés, les humains ont tendance à sélectionner les nœuds non seulement centraux mais aussi connectés densément aux autres nœuds dans le graphe-de-mots pondéré correspondant, propriété qui se retrouve lorsque l’on réduit un réseau à son core principal en utilisant 3 le principe de dégénérescence de graphe; et ( ) en classification de textes, les n-grammesditsdelonguedistance–définitcommedessous-graphesdegraphe- de-mots – sont plus discriminants que les n-grammes standards, en partie parce qu’ils couvrent plus de variantes du même ensemble de termes comparés à des séquences fixes de termes et ainsi ils apparaissent dans plus de documents. vii PUBLICATIONS The following publications are included in parts or in an extended version in this dissertation: 1 ) F. Rousseau and M. Vazirgiannis. Composition of TF normalizations: new insights on scoring functions for ad hoc IR. in Proceedings of the 36th annual international ACM SIGIR conference on research and development in information retrieval. In SIGIR ’13. ACM, 2013, pages 917–920. 2 ) F.RousseauandM.Vazirgiannis. Graph-of-wordandTW-IDF:newapproach to ad hoc IR. in Proceedings of the 22nd ACM international conference on informa- tion and knowledge management. In CIKM ’13. ACM, 2013, pages 59–68. Best Paper, Honorable Mention. 3 ) F. Rousseau and M. Vazirgiannis. Main core retention on graph-of-words for single-document keyword extraction. In Proceedingsofthe37theuropeanconfer- ence on information retrieval. In ECIR ’15. Springer-Verlag, 2015, pages 382– 393 . 4 ) F. Rousseau, E. Kiagias, and M. Vazirgiannis. Text categorization as a graph classificationproblem. InProceedingsofthe53rdannualmeetingoftheassociation for computational linguistics and the 7th international joint conference on natural language processing. Volume 1. In ACL-IJCNLP ’15. ACL, 2015, pages 1702– 1712 . 5 ) J. Kim, F. Rousseau, and M. Vazirgiannis. Convolutional sentence kernel fromwordembeddingsforshorttextcategorization. InProceedingsofthe2015 conference on empirical methods in natural language processing. In EMNLP ’15. 2015 ACL, . Furthermore, the following publications were part of my Ph.D. research but are not covered in this dissertation – the topics of these publications are somewhat outside of the scope of the material covered here: 6 ) M. Karkali, F. Rousseau, A. Ntoulas, and M. Vazirgiannis. Efficient online novelty detection in news streams. In Proceedings of the 14th international conference on web information systems engineering. In WISE ’13. Springer-Verlag 2013 57 71 Berlin, , pages – . 7 ) P. Meladianos, G. Nikolentzos, F. Rousseau, Y. Stavrakas, and M. Vazirgian- nis. Degeneracy-based real-time sub-event detection in twitter stream. In Proceedings of the 9th AAAI international conference on web and social media. In 15 2015 248 257 ICWSM ’ . AAAI Press, , pages – . 8 ) F. Rousseau, J. Casas-Roma, and M. Vazirgiannis. Community-preserving anonymization of social networks. ACM transactions on knowledge discovery from data, 2015. Submitted on 24/11/2014. ix publications 9 ) J. Casas-Roma and F. Rousseau. Community-preserving generalization of social networks. In Proceedings of the social media and risk ASONAM 2015 workshop. In SoMeRis ’15. IEEE Computer Society, 2015. 10 ) P.Meladianos,G.Nikolentzos,F.Rousseau,Y.Stavrakas,andM.Vazirgiannis. Shortest-path graph kernels for document similarity. In Proceedingsofthe15th IEEE international conference on data mining. In ICDM ’15. IEEE Computer 2015 03 06 2015 Society, . Submitted on / / . x

Description:
and computation. For ML, we worked with the Python scikit-learn library where each word wi belongs to a common vocabulary V (a. k. a. dictionary or lexicon), potentially of .. through the AdWords platform). Researchers use them
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.