Table Of ContentTextual Data Science
with R
Chapman & Hall/CRC
Computer Science and Data Analysis Series
The interface between the computer and statistical sciences is increasing, as
each discipline seeks to harness the power and resources of the other. This series
aims to foster the integration between the computer sciences and statistical,
numerical, and probabilistic methods by publishing a broad range of reference
works, textbooks, and handbooks.
SERIES EDITORS
David Blei, Princeton University
David Madigan, Rutgers University
Marina Meila, University of Washington
Fionn Murtagh, Royal Holloway, University of London
Proposals for the series should be sent directly to one of the series editors above, or
submitted to:
Chapman & Hall/CRC
Taylor and Francis Group
3 Park Square, Milton Park
Abingdon, OX14 4RN, UK
Semisupervised Learning for Computational Linguistics
Steven Abney
Visualization and Verbalization of Data
Jörg Blasius and Michael Greenacre
Chain Event Graphs
Rodrigo A. Collazo, Christiane Görgen, and Jim Q. Smith
Design and Modeling for Computer Experiments
Kai-Tai Fang, Runze Li, and Agus Sudjianto
Microarray Image Analysis: An Algorithmic Approach
Karl Fraser, Zidong Wang, and Xiaohui Liu
R Programming for Bioinformatics
Robert Gentleman
Exploratory Multivariate Analysis by Example Using R
François Husson, Sébastien Lê, and Jérôme Pagès
Bayesian Artificial Intelligence, Second Edition
Kevin B. Korb and Ann E. Nicholson
®
Computational Statistics Handbook with MATLAB , Third Edition
Wendy L. Martinez and Angel R. Martinez
Exploratory Data Analysis with MATLAB®, Third Edition
Wendy L. Martinez, Angel R. Martinez, and Jeffrey L. Solka
®
Statistics in MATLAB : A Primer
Wendy L. Martinez and MoonJung Cho
Clustering for Data Mining: A Data Recovery Approach, Second Edition
Boris Mirkin
Introduction to Machine Learning and Bioinformatics
Sushmita Mitra, Sujay Datta, Theodore Perkins, and George Michailidis
Introduction to Data Technologies
Paul Murrell
R Graphics
Paul Murrell
Correspondence Analysis and Data Coding with Java and R
Fionn Murtagh
Data Science Foundations: Geometry and Topology of Complex Hierarchic
Systems and Big Data Analytics
Fionn Murtagh
Pattern Recognition Algorithms for Data Mining
Sankar K. Pal and Pabitra Mitra
Statistical Computing with R
Maria L. Rizzo
Statistical Learning and Data Science
Mireille Gettler Summa, Léon Bottou, Bernard Goldfarb, Fionn Murtagh,
Catherine Pardoux, and Myriam Touati
Bayesian Regression Modeling With INLA
Xiaofeng Wang, Yu Ryan Yue, and Julian J. Faraway
Music Data Analysis: Foundations and Applications
Claus Weihs, Dietmar Jannach, Igor Vatolkin, and Günter Rudolph
Foundations of Statistical Algorithms: With References to R Packages
Claus Weihs, Olaf Mersmann, and Uwe Ligges
Textual Data Science with R
Mónica Bécue-Bertaut
Textual Data Science
with R
Mónica Bécue-Bertaut
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2018 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works
Printed on acid-free paper
Version Date: 20190214
International Standard Book Number-13: 978-1-138-62691-1 (Hardback)
This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access
www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc.
(CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization
that provides licenses and registration for a variety of users. For organizations that have been granted
a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents
Foreword xiii
Preface xv
1 Encoding: from a corpus to statistical tables 1
1.1 Textual and contextual data . . . . . . . . . . . . . . . . . . 1
1.1.1 Textual data . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Contextual data . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 Documents and aggregate documents . . . . . . . . . 2
1.2 Examples and notation . . . . . . . . . . . . . . . . . . . . . 3
1.3 Choosing textual units . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Graphical forms . . . . . . . . . . . . . . . . . . . . . 6
1.3.2 Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.3 Stems . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.4 Repeated segments . . . . . . . . . . . . . . . . . . . . 7
1.3.5 In practice . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.1 Unique spelling . . . . . . . . . . . . . . . . . . . . . . 9
1.4.2 Partially automated preprocessing . . . . . . . . . . . 9
1.4.3 Word selection . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Word and segment indexes . . . . . . . . . . . . . . . . . . . 10
1.6 The Life UK corpus: preliminary results . . . . . . . . . . . 10
1.6.1 Verbal content through word and repeated segment
indexes . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6.2 Univariate description of contextual variables . . . . . 13
1.6.3 A note on the frequency range . . . . . . . . . . . . . 13
1.7 Implementation with Xplortext . . . . . . . . . . . . . . . . 14
1.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Correspondence analysis of textual data 17
2.1 Data and goals . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.1 Correspondence analysis: a tool for linguistic data
analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.2 Data: a small example . . . . . . . . . . . . . . . . . . 17
2.1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Associations between documents and words . . . . . . . . . . 19
vii
viii Contents
2.2.1 Profile comparisons. . . . . . . . . . . . . . . . . . . . 19
2.2.2 Independence of documents and words . . . . . . . . . 20
2.2.3 The χ2 test . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.4 Association rates between documents and words . . . 23
2.3 Active row and column clouds . . . . . . . . . . . . . . . . . 24
2.3.1 Row and column profile spaces . . . . . . . . . . . . . 24
2.3.2 Distributional equivalence and the χ2 distance . . . . 24
2.3.3 Inertia of a cloud . . . . . . . . . . . . . . . . . . . . . 25
2.4 Fitting document and word clouds . . . . . . . . . . . . . . . 26
2.4.1 Factorial axes . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.2 Visualizing rows and columns . . . . . . . . . . . . . . 28
2.4.2.1 Category representation . . . . . . . . . . . . 30
2.4.2.2 Word representation . . . . . . . . . . . . . . 30
2.4.2.3 Transition formulas . . . . . . . . . . . . . . 32
2.4.2.4 Simultaneous representation of rows and
columns . . . . . . . . . . . . . . . . . . . . . 32
2.5 Interpretation aids . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.1 Eigenvalues and representation quality of the clouds . 33
2.5.2 Contribution of documents and words to axis inertia . 34
2.5.3 Representation quality of a point . . . . . . . . . . . . 35
2.6 Supplementary rows and columns . . . . . . . . . . . . . . . 36
2.6.1 Supplementary tables . . . . . . . . . . . . . . . . . . 36
2.6.2 Supplementary frequency rows and columns . . . . . . 36
2.6.3 Supplementary quantitative and qualitative variables . 37
2.7 Validating the visualization . . . . . . . . . . . . . . . . . . . 37
2.8 Interpretation scheme for textual CA results . . . . . . . . . 38
2.9 Implementation with Xplortext . . . . . . . . . . . . . . . . 41
2.10 Summary of the CA approach . . . . . . . . . . . . . . . . . 41
3 Applications of correspondence analysis 43
3.1 Choosing the level of detail for analyses . . . . . . . . . . . . 43
3.2 Correspondence analysis on aggregate free text answers . . . 44
3.2.1 Data and objectives . . . . . . . . . . . . . . . . . . . 44
3.2.2 Word selection . . . . . . . . . . . . . . . . . . . . . . 44
3.2.3 CA on the aggregate table . . . . . . . . . . . . . . . . 44
3.2.3.1 Document representation . . . . . . . . . . . 45
3.2.3.2 Word representation . . . . . . . . . . . . . . 46
3.2.3.3 Simultaneous interpretation of the plots . . . 46
3.2.4 Supplementary elements . . . . . . . . . . . . . . . . . 49
3.2.4.1 Supplementary words . . . . . . . . . . . . . 49
3.2.4.2 Supplementary repeated segments . . . . . . 49
3.2.4.3 Supplementary categories . . . . . . . . . . . 50
3.2.5 Implementation with Xplortext . . . . . . . . . . . . . 51
3.3 Direct analysis . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3.1 Data and objectives . . . . . . . . . . . . . . . . . . . 52
Contents ix
3.3.2 The main features of direct analysis . . . . . . . . . . 53
3.3.3 Direct analysis of the culture question . . . . . . . . . 53
3.3.4 Implementation with Xplortext . . . . . . . . . . . . . 58
4 Clustering in textual data science 61
4.1 Clustering documents . . . . . . . . . . . . . . . . . . . . . . 61
4.2 Dissimilarity measures between documents . . . . . . . . . . 62
4.3 Measuring partition quality . . . . . . . . . . . . . . . . . . . 63
4.3.1 Document clusters in the factorial space . . . . . . . . 63
4.3.2 Partition quality . . . . . . . . . . . . . . . . . . . . . 63
4.4 Dissimilarity measures between document clusters . . . . . . 64
4.4.1 The single-linkage method . . . . . . . . . . . . . . . . 64
4.4.2 The complete-linkage method . . . . . . . . . . . . . . 64
4.4.3 Ward’s method . . . . . . . . . . . . . . . . . . . . . . 64
4.5 Agglomerative hierarchical clustering . . . . . . . . . . . . . 65
4.5.1 Hierarchical tree construction algorithm . . . . . . . . 65
4.5.2 Selecting the final partition . . . . . . . . . . . . . . . 66
4.5.3 Interpreting clusters . . . . . . . . . . . . . . . . . . . 66
4.6 Direct partitioning . . . . . . . . . . . . . . . . . . . . . . . . 67
4.7 Combining clustering methods . . . . . . . . . . . . . . . . . 68
4.7.1 Consolidating partitions . . . . . . . . . . . . . . . . . 68
4.7.2 Direct partitioning followed by AHC . . . . . . . . . . 68
4.8 A procedure for combining CA and clustering . . . . . . . . 69
4.9 Example: joint use of CA and AHC . . . . . . . . . . . . . . 69
4.9.1 Data and objectives . . . . . . . . . . . . . . . . . . . 69
4.9.1.1 Data preprocessing using CA . . . . . . . . . 70
4.9.1.2 Constructing the hierarchical tree . . . . . . 70
4.9.1.3 Choosing the final partition . . . . . . . . . . 72
4.10 Contiguity-constrained hierarchical clustering . . . . . . . . . 74
4.10.1 Principles and algorithm . . . . . . . . . . . . . . . . . 74
4.10.2 AHC of age groups with a chronological constraint . . 75
4.10.3 Implementation with Xplortext . . . . . . . . . . . . . 76
4.11 Example: clustering free text answers . . . . . . . . . . . . . 76
4.11.1 Data and objectives . . . . . . . . . . . . . . . . . . . 76
4.11.2 Data preprocessing . . . . . . . . . . . . . . . . . . . . 78
4.11.2.1 CA: eigenvalues and total inertia . . . . . . . 78
4.11.2.2 Interpreting the first axes . . . . . . . . . . . 80
4.11.3 AHC: building the tree and choosing the final partition 84
4.12 Describing cluster features . . . . . . . . . . . . . . . . . . . 88
4.12.1 Lexical features of clusters. . . . . . . . . . . . . . . . 89
4.12.1.1 Describing clusters in terms of characteristic
words . . . . . . . . . . . . . . . . . . . . . . 89
4.12.1.2 Describing clusters in terms of characteristic
documents . . . . . . . . . . . . . . . . . . . 91
4.12.2 Describing clusters using contextual variables . . . . . 91