Table Of ContentÉcole Polytechnique
Laboratoire d’Informatique de l’X (LIX)
GRAPH-OF-WORDS:
MINING AND RETRIEVING TEXT
WITH NETWORKS OF FEATURES
DISSERTATION
submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy
in Computer Science
by
FRANÇOIS ROUSSEAU
18 2015
September ,
Copyright c2015FrançoisRousseau
(cid:13)
Thesis prepared under the supervision of Michalis Vazirgiannis
at the Department of Informatics of École Polytechnique
7161
Laboratoire d’Informatique de l’X (LIX) – UMR
1
rue Honoré d’Estienne d’Orves
Bâtiment Alan Turing
Campus de l’École Polytechnique
91120
Palaiseau, France
To my mom and dad who have always wanted a doctor,
and to my sister who made me become one.
ABSTRACT
We propose graph-of-words as an alternative document representation to the
historical bag-of-words that is extensively used in text mining and retrieval. We
represent textual documents as statistical graphs whose vertices correspond
to unique terms of the document and whose edges represent co-occurrences
between the terms within a fixed-size sliding window over the full processed
document. The underlying assumption is that all the words present in a docu-
ment have some relationships with one another, modulo a window size outside
of which the relationship is not taken into consideration. This representation
takes better into account word dependency compared to traditional unigrams
and even n-grams and can be interpreted as a network of features that captures
the relations between terms at the document level or even at the collection level
when considering a giant graph-of-words made of the aggregation of multiple
smaller graph-of-words, one for each document of the collection.
In our work, we capitalized on the graph-of-words representation to retrieve
more relevant information, extract more cohesive keywords and learn more dis-
criminative patterns with successful applications in ad hoc information retrieval,
single-document keyword extraction and single-label multi-class text catego-
rization. Experiments conducted on various text datasets with ground truth
25
data, including a Web-scale collection of M documents, and using standard
10
evaluation metrics for each task (e.g., MAP, P@ , accuracy and macro-average
1
F -score) resulted in statistically significant improvements in effectiveness for
little to no additional cost in efficiency.
1
The main findings of our research are: ( ) for ad hoc information retrieval,
whenassessingaqueryterm’sweightinadocument,ratherthanconsideringthe
overall term frequency of a word and then applying a concave transformation
to ensure a decreasing marginal gain in relevance, one should instead consider
for each word the number of distinct contexts of co-occurrence with other
words so as to favor terms that appear with a lot of other terms, i.e. consider
2
the node degree in the corresponding unweighted graph-of-words; ( ) for
keyword extraction, humans tend to select as keywords not only central but
also densely connected nodes in the corresponding weighted graph-of-words,
property captured by reducing the graph to its main core using the concept
3
of graph degeneracy; and ( ) for text categorization, long-distance n-grams –
defined as subgraphs of unweighted graph-of-words – are more discriminative
features than standard n-grams, partly because they can capture more variants
of the same set of terms compared to fixed sequences of terms and therefore
appear in more documents.
v
RÉSUMÉ
Nous proposons la représentation de documents par graphe-de-mots comme
alternative à la représentation par sac-de-mots qui est largement utilisée en
fouillededonnéesetrecherched’informationdanslestextes.Nousreprésentons
lesdocumentsàl’aidedegraphesstatistiquesdontlesnœudscorrespondentaux
uniques termes du document et dont les arêtes représentent les co-occurrences
entrelestermesdansunefenêtreglissantedetaillefixe.L’hypothèsesous-jacente
étantquetouslesmotsd’undocumentsontenlien,modulolatailledelafenêtre
endehorsdelaquellelelienn’estpasprisencompte.Cettereprésentationprend
mieux en compte la dépendance entre les mots qu’avec une représentation
plus traditionnelle se basant sur les unigrammes et même les n-grammes et
peut être interprétée comme un réseau de variables qui stocke les relations
entre les termes à l’échelle du document ou même à l’échelle d’une collection
de documents lorsque l’on considère un graphe-de-mots construit à partir de
plusieurs graphe-de-mots plus petits, un par document de la collection.
Au cours de nos travaux, nous avons tiré profit de la représentation par
graphe-de-mots pour mieux rechercher les informations les plus pertinentes à
unerequête,pourextrairedesmots-cléspluscohésifsetpourapprendredesmo-
tifs plus discriminants, avec des applications en recherche ad hoc d’information,
en extraction de mots-clés et en classification de textes. Les expériences me-
nées sur de nombreux jeux de données dont on connaît la vérité terrain, parmi
25
lesquels une collection de M de pages Web, et en utilisant les mesures stan-
10
dards d’évaluation pour chaque tâche (MAP, P@ , taux de bonne classification
1
et macro-average F -score) ont conduites à des améliorations statistiquement
significatives en qualité pour peu voire pas de coût supplémentaire en efficacité.
1
Les principaux résultats de notre recherche sont : ( ) en recherche ad hoc
d’information, lorsque l’on évalue le poids d’un terme de la requête dans
un document, au lieu de considérer la fréquence globale du terme et ensuite
lui appliquer une transformation concave pour s’assurer d’un gain marginal
en pertinence décroissant, on devrait plutôt considérer pour chaque mot le
nombre distinct de contexte de co-occurrences avec les autres mots de façon à
favoriserlesmotsquiapparaissentavecleplusgrandnombredemotsdifférents,
c’est-à-dire considérer le degré du nœud dans le graphe-de-mots non pondéré
2
correspondant; ( ) en extraction de mots-clés, les humains ont tendance à
sélectionner les nœuds non seulement centraux mais aussi connectés densément
aux autres nœuds dans le graphe-de-mots pondéré correspondant, propriété
qui se retrouve lorsque l’on réduit un réseau à son core principal en utilisant
3
le principe de dégénérescence de graphe; et ( ) en classification de textes, les
n-grammesditsdelonguedistance–définitcommedessous-graphesdegraphe-
de-mots – sont plus discriminants que les n-grammes standards, en partie parce
qu’ils couvrent plus de variantes du même ensemble de termes comparés à des
séquences fixes de termes et ainsi ils apparaissent dans plus de documents.
vii
PUBLICATIONS
The following publications are included in parts or in an extended version in
this dissertation:
1
) F. Rousseau and M. Vazirgiannis. Composition of TF normalizations: new
insights on scoring functions for ad hoc IR. in Proceedings of the 36th annual
international ACM SIGIR conference on research and development in information
retrieval. In SIGIR ’13. ACM, 2013, pages 917–920.
2
) F.RousseauandM.Vazirgiannis. Graph-of-wordandTW-IDF:newapproach
to ad hoc IR. in Proceedings of the 22nd ACM international conference on informa-
tion and knowledge management. In CIKM ’13. ACM, 2013, pages 59–68. Best
Paper, Honorable Mention.
3
) F. Rousseau and M. Vazirgiannis. Main core retention on graph-of-words for
single-document keyword extraction. In Proceedingsofthe37theuropeanconfer-
ence on information retrieval. In ECIR ’15. Springer-Verlag, 2015, pages 382–
393
.
4
) F. Rousseau, E. Kiagias, and M. Vazirgiannis. Text categorization as a graph
classificationproblem. InProceedingsofthe53rdannualmeetingoftheassociation
for computational linguistics and the 7th international joint conference on natural
language processing. Volume 1. In ACL-IJCNLP ’15. ACL, 2015, pages 1702–
1712
.
5
) J. Kim, F. Rousseau, and M. Vazirgiannis. Convolutional sentence kernel
fromwordembeddingsforshorttextcategorization. InProceedingsofthe2015
conference on empirical methods in natural language processing. In EMNLP ’15.
2015
ACL, .
Furthermore, the following publications were part of my Ph.D. research but are
not covered in this dissertation – the topics of these publications are somewhat
outside of the scope of the material covered here:
6
) M. Karkali, F. Rousseau, A. Ntoulas, and M. Vazirgiannis. Efficient online
novelty detection in news streams. In Proceedings of the 14th international
conference on web information systems engineering. In WISE ’13. Springer-Verlag
2013 57 71
Berlin, , pages – .
7
) P. Meladianos, G. Nikolentzos, F. Rousseau, Y. Stavrakas, and M. Vazirgian-
nis. Degeneracy-based real-time sub-event detection in twitter stream. In
Proceedings of the 9th AAAI international conference on web and social media. In
15 2015 248 257
ICWSM ’ . AAAI Press, , pages – .
8
) F. Rousseau, J. Casas-Roma, and M. Vazirgiannis. Community-preserving
anonymization of social networks. ACM transactions on knowledge discovery
from data, 2015. Submitted on 24/11/2014.
ix
publications
9
) J. Casas-Roma and F. Rousseau. Community-preserving generalization of
social networks. In Proceedings of the social media and risk ASONAM 2015
workshop. In SoMeRis ’15. IEEE Computer Society, 2015.
10
) P.Meladianos,G.Nikolentzos,F.Rousseau,Y.Stavrakas,andM.Vazirgiannis.
Shortest-path graph kernels for document similarity. In Proceedingsofthe15th
IEEE international conference on data mining. In ICDM ’15. IEEE Computer
2015 03 06 2015
Society, . Submitted on / / .
x
Description:and computation. For ML, we worked with the Python scikit-learn library where each word wi belongs to a common vocabulary V (a. k. a. dictionary or lexicon), potentially of .. through the AdWords platform). Researchers use them