Table Of Content

École Polytechnique Laboratoire d’Informatique de l’X (LIX) GRAPH-OF-WORDS: MINING AND RETRIEVING TEXT WITH NETWORKS OF FEATURES DISSERTATION submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science by FRANÇOIS ROUSSEAU 18 2015 September , Copyright c2015FrançoisRousseau (cid:13) Thesis prepared under the supervision of Michalis Vazirgiannis at the Department of Informatics of École Polytechnique 7161 Laboratoire d’Informatique de l’X (LIX) – UMR 1 rue Honoré d’Estienne d’Orves Bâtiment Alan Turing Campus de l’École Polytechnique 91120 Palaiseau, France To my mom and dad who have always wanted a doctor, and to my sister who made me become one. ABSTRACT We propose graph-of-words as an alternative document representation to the historical bag-of-words that is extensively used in text mining and retrieval. We represent textual documents as statistical graphs whose vertices correspond to unique terms of the document and whose edges represent co-occurrences between the terms within a fixed-size sliding window over the full processed document. The underlying assumption is that all the words present in a document have some relationships with one another, modulo a window size outside of which the relationship is not taken into consideration. This representation takes better into account word dependency compared to traditional unigrams and even n-grams and can be interpreted as a network of features that captures the relations between terms at the document level or even at the collection level when considering a giant graph-of-words made of the aggregation of multiple smaller graph-of-words, one for each document of the collection. In our work, we capitalized on the graph-of-words representation to retrieve more relevant information, extract more cohesive keywords and learn more discriminative patterns with successful applications in ad hoc information retrieval, single-document keyword extraction and single-label multi-class text categorization. Experiments conducted on various text datasets with ground truth 25 data, including a Web-scale collection of M documents, and using standard 10 evaluation metrics for each task (e.g., MAP, P@ , accuracy and macro-average 1 F -score) resulted in statistically significant improvements in effectiveness for little to no additional cost in efficiency. 1 The main findings of our research are: ( ) for ad hoc information retrieval, whenassessingaqueryterm’sweightinadocument,ratherthanconsideringthe overall term frequency of a word and then applying a concave transformation to ensure a decreasing marginal gain in relevance, one should instead consider for each word the number of distinct contexts of co-occurrence with other words so as to favor terms that appear with a lot of other terms, i.e. consider 2 the node degree in the corresponding unweighted graph-of-words; ( ) for keyword extraction, humans tend to select as keywords not only central but also densely connected nodes in the corresponding weighted graph-of-words, property captured by reducing the graph to its main core using the concept 3 of graph degeneracy; and ( ) for text categorization, long-distance n-grams – defined as subgraphs of unweighted graph-of-words – are more discriminative features than standard n-grams, partly because they can capture more variants of the same set of terms compared to fixed sequences of terms and therefore appear in more documents. v RÉSUMÉ Nous proposons la représentation de documents par graphe-de-mots comme alternative à la représentation par sac-de-mots qui est largement utilisée en fouillededonnéesetrecherched’informationdanslestextes.Nousreprésentons lesdocumentsàl’aidedegraphesstatistiquesdontlesnœudscorrespondentaux uniques termes du document et dont les arêtes représentent les co-occurrences entrelestermesdansunefenêtreglissantedetaillefixe.L’hypothèsesous-jacente étantquetouslesmotsd’undocumentsontenlien,modulolatailledelafenêtre endehorsdelaquellelelienn’estpasprisencompte.Cettereprésentationprend mieux en compte la dépendance entre les mots qu’avec une représentation plus traditionnelle se basant sur les unigrammes et même les n-grammes et peut être interprétée comme un réseau de variables qui stocke les relations entre les termes à l’échelle du document ou même à l’échelle d’une collection de documents lorsque l’on considère un graphe-de-mots construit à partir de plusieurs graphe-de-mots plus petits, un par document de la collection. Au cours de nos travaux, nous avons tiré profit de la représentation par graphe-de-mots pour mieux rechercher les informations les plus pertinentes à unerequête,pourextrairedesmots-cléspluscohésifsetpourapprendredesmo- tifs plus discriminants, avec des applications en recherche ad hoc d’information, en extraction de mots-clés et en classification de textes. Les expériences me- nées sur de nombreux jeux de données dont on connaît la vérité terrain, parmi 25 lesquels une collection de M de pages Web, et en utilisant les mesures stan- 10 dards d’évaluation pour chaque tâche (MAP, P@ , taux de bonne classification 1 et macro-average F -score) ont conduites à des améliorations statistiquement significatives en qualité pour peu voire pas de coût supplémentaire en efficacité. 1 Les principaux résultats de notre recherche sont : ( ) en recherche ad hoc d’information, lorsque l’on évalue le poids d’un terme de la requête dans un document, au lieu de considérer la fréquence globale du terme et ensuite lui appliquer une transformation concave pour s’assurer d’un gain marginal en pertinence décroissant, on devrait plutôt considérer pour chaque mot le nombre distinct de contexte de co-occurrences avec les autres mots de façon à favoriserlesmotsquiapparaissentavecleplusgrandnombredemotsdifférents, c’est-à-dire considérer le degré du nœud dans le graphe-de-mots non pondéré 2 correspondant; ( ) en extraction de mots-clés, les humains ont tendance à sélectionner les nœuds non seulement centraux mais aussi connectés densément aux autres nœuds dans le graphe-de-mots pondéré correspondant, propriété qui se retrouve lorsque l’on réduit un réseau à son core principal en utilisant 3 le principe de dégénérescence de graphe; et ( ) en classification de textes, les n-grammesditsdelonguedistance–définitcommedessous-graphesdegraphe- de-mots – sont plus discriminants que les n-grammes standards, en partie parce qu’ils couvrent plus de variantes du même ensemble de termes comparés à des séquences fixes de termes et ainsi ils apparaissent dans plus de documents. vii PUBLICATIONS The following publications are included in parts or in an extended version in this dissertation: 1 ) F. Rousseau and M. Vazirgiannis. Composition of TF normalizations: new insights on scoring functions for ad hoc IR. in Proceedings of the 36th annual international ACM SIGIR conference on research and development in information retrieval. In SIGIR ’13. ACM, 2013, pages 917–920. 2 ) F.RousseauandM.Vazirgiannis. Graph-of-wordandTW-IDF:newapproach to ad hoc IR. in Proceedings of the 22nd ACM international conference on information and knowledge management. In CIKM ’13. ACM, 2013, pages 59–68. Best Paper, Honorable Mention. 3 ) F. Rousseau and M. Vazirgiannis. Main core retention on graph-of-words for single-document keyword extraction. In Proceedingsofthe37theuropeanconfer- ence on information retrieval. In ECIR ’15. Springer-Verlag, 2015, pages 382– 393 . 4 ) F. Rousseau, E. Kiagias, and M. Vazirgiannis. Text categorization as a graph classificationproblem. InProceedingsofthe53rdannualmeetingoftheassociation for computational linguistics and the 7th international joint conference on natural language processing. Volume 1. In ACL-IJCNLP ’15. ACL, 2015, pages 1702– 1712 . 5 ) J. Kim, F. Rousseau, and M. Vazirgiannis. Convolutional sentence kernel fromwordembeddingsforshorttextcategorization. InProceedingsofthe2015 conference on empirical methods in natural language processing. In EMNLP ’15. 2015 ACL, . Furthermore, the following publications were part of my Ph.D. research but are not covered in this dissertation – the topics of these publications are somewhat outside of the scope of the material covered here: 6 ) M. Karkali, F. Rousseau, A. Ntoulas, and M. Vazirgiannis. Efficient online novelty detection in news streams. In Proceedings of the 14th international conference on web information systems engineering. In WISE ’13. Springer-Verlag 2013 57 71 Berlin, , pages – . 7 ) P. Meladianos, G. Nikolentzos, F. Rousseau, Y. Stavrakas, and M. Vazirgian- nis. Degeneracy-based real-time sub-event detection in twitter stream. In Proceedings of the 9th AAAI international conference on web and social media. In 15 2015 248 257 ICWSM ’ . AAAI Press, , pages – . 8 ) F. Rousseau, J. Casas-Roma, and M. Vazirgiannis. Community-preserving anonymization of social networks. ACM transactions on knowledge discovery from data, 2015. Submitted on 24/11/2014. ix publications 9 ) J. Casas-Roma and F. Rousseau. Community-preserving generalization of social networks. In Proceedings of the social media and risk ASONAM 2015 workshop. In SoMeRis ’15. IEEE Computer Society, 2015. 10 ) P.Meladianos,G.Nikolentzos,F.Rousseau,Y.Stavrakas,andM.Vazirgiannis. Shortest-path graph kernels for document similarity. In Proceedingsofthe15th IEEE international conference on data mining. In ICDM ’15. IEEE Computer 2015 03 06 2015 Society, . Submitted on / / . x

Description:

and computation. For ML, we worked with the Python scikit-learn library where each word wi belongs to a common vocabulary V (a. k. a. dictionary or lexicon), potentially of .. through the AdWords platform). Researchers use them

[PDF] from github.io PDF

177 Pages·2015·1.43 MB·English

Checking for file health...

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Download [PDF] from github.io PDF Free - Full Version

by Unknow| 2015| 177 pages| 1.43| English

Download [PDF] from github.io by in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About [PDF] from github.io

Detailed Information

Author:	Unknown
Publication Year:	2015
Pages:	177
Language:	English
File Size:	1.43
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free [PDF] from github.io Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download [PDF] from github.io PDF?

Yes, on https://PDFdrive.to you can download [PDF] from github.io by completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read [PDF] from github.io on my mobile device?

After downloading [PDF] from github.io PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of [PDF] from github.io?

Yes, this is the complete PDF version of [PDF] from github.io by Unknow. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download [PDF] from github.io PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.