ebook img

Textual data science using R PDF

213 Pages·2018·31.515 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Textual data science using R

Textual Data Science with R Chapman & Hall/CRC Computer Science and Data Analysis Series The interface between the computer and statistical sciences is increasing, as each discipline seeks to harness the power and resources of the other. This series aims to foster the integration between the computer sciences and statistical, numerical, and probabilistic methods by publishing a broad range of reference works, textbooks, and handbooks. SERIES EDITORS David Blei, Princeton University David Madigan, Rutgers University Marina Meila, University of Washington Fionn Murtagh, Royal Holloway, University of London Proposals for the series should be sent directly to one of the series editors above, or submitted to: Chapman & Hall/CRC Taylor and Francis Group 3 Park Square, Milton Park Abingdon, OX14 4RN, UK Semisupervised Learning for Computational Linguistics Steven Abney Visualization and Verbalization of Data Jörg Blasius and Michael Greenacre Chain Event Graphs Rodrigo A. Collazo, Christiane Görgen, and Jim Q. Smith Design and Modeling for Computer Experiments Kai-Tai Fang, Runze Li, and Agus Sudjianto Microarray Image Analysis: An Algorithmic Approach Karl Fraser, Zidong Wang, and Xiaohui Liu R Programming for Bioinformatics Robert Gentleman Exploratory Multivariate Analysis by Example Using R François Husson, Sébastien Lê, and Jérôme Pagès Bayesian Artificial Intelligence, Second Edition Kevin B. Korb and Ann E. Nicholson ® Computational Statistics Handbook with MATLAB , Third Edition Wendy L. Martinez and Angel R. Martinez Exploratory Data Analysis with MATLAB®, Third Edition Wendy L. Martinez, Angel R. Martinez, and Jeffrey L. Solka ® Statistics in MATLAB : A Primer Wendy L. Martinez and MoonJung Cho Clustering for Data Mining: A Data Recovery Approach, Second Edition Boris Mirkin Introduction to Machine Learning and Bioinformatics Sushmita Mitra, Sujay Datta, Theodore Perkins, and George Michailidis Introduction to Data Technologies Paul Murrell R Graphics Paul Murrell Correspondence Analysis and Data Coding with Java and R Fionn Murtagh Data Science Foundations: Geometry and Topology of Complex Hierarchic Systems and Big Data Analytics Fionn Murtagh Pattern Recognition Algorithms for Data Mining Sankar K. Pal and Pabitra Mitra Statistical Computing with R Maria L. Rizzo Statistical Learning and Data Science Mireille Gettler Summa, Léon Bottou, Bernard Goldfarb, Fionn Murtagh, Catherine Pardoux, and Myriam Touati Bayesian Regression Modeling With INLA Xiaofeng Wang, Yu Ryan Yue, and Julian J. Faraway Music Data Analysis: Foundations and Applications Claus Weihs, Dietmar Jannach, Igor Vatolkin, and Günter Rudolph Foundations of Statistical Algorithms: With References to R Packages Claus Weihs, Olaf Mersmann, and Uwe Ligges Textual Data Science with R Mónica Bécue-Bertaut Textual Data Science with R Mónica Bécue-Bertaut CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2018 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Printed on acid-free paper Version Date: 20190214 International Standard Book Number-13: 978-1-138-62691-1 (Hardback) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com Contents Foreword xiii Preface xv 1 Encoding: from a corpus to statistical tables 1 1.1 Textual and contextual data . . . . . . . . . . . . . . . . . . 1 1.1.1 Textual data . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.2 Contextual data . . . . . . . . . . . . . . . . . . . . . 2 1.1.3 Documents and aggregate documents . . . . . . . . . 2 1.2 Examples and notation . . . . . . . . . . . . . . . . . . . . . 3 1.3 Choosing textual units . . . . . . . . . . . . . . . . . . . . . 5 1.3.1 Graphical forms . . . . . . . . . . . . . . . . . . . . . 6 1.3.2 Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.3 Stems . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3.4 Repeated segments . . . . . . . . . . . . . . . . . . . . 7 1.3.5 In practice . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4.1 Unique spelling . . . . . . . . . . . . . . . . . . . . . . 9 1.4.2 Partially automated preprocessing . . . . . . . . . . . 9 1.4.3 Word selection . . . . . . . . . . . . . . . . . . . . . . 10 1.5 Word and segment indexes . . . . . . . . . . . . . . . . . . . 10 1.6 The Life UK corpus: preliminary results . . . . . . . . . . . 10 1.6.1 Verbal content through word and repeated segment indexes . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.6.2 Univariate description of contextual variables . . . . . 13 1.6.3 A note on the frequency range . . . . . . . . . . . . . 13 1.7 Implementation with Xplortext . . . . . . . . . . . . . . . . 14 1.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2 Correspondence analysis of textual data 17 2.1 Data and goals . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.1 Correspondence analysis: a tool for linguistic data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.2 Data: a small example . . . . . . . . . . . . . . . . . . 17 2.1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 Associations between documents and words . . . . . . . . . . 19 vii viii Contents 2.2.1 Profile comparisons. . . . . . . . . . . . . . . . . . . . 19 2.2.2 Independence of documents and words . . . . . . . . . 20 2.2.3 The χ2 test . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.4 Association rates between documents and words . . . 23 2.3 Active row and column clouds . . . . . . . . . . . . . . . . . 24 2.3.1 Row and column profile spaces . . . . . . . . . . . . . 24 2.3.2 Distributional equivalence and the χ2 distance . . . . 24 2.3.3 Inertia of a cloud . . . . . . . . . . . . . . . . . . . . . 25 2.4 Fitting document and word clouds . . . . . . . . . . . . . . . 26 2.4.1 Factorial axes . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.2 Visualizing rows and columns . . . . . . . . . . . . . . 28 2.4.2.1 Category representation . . . . . . . . . . . . 30 2.4.2.2 Word representation . . . . . . . . . . . . . . 30 2.4.2.3 Transition formulas . . . . . . . . . . . . . . 32 2.4.2.4 Simultaneous representation of rows and columns . . . . . . . . . . . . . . . . . . . . . 32 2.5 Interpretation aids . . . . . . . . . . . . . . . . . . . . . . . . 32 2.5.1 Eigenvalues and representation quality of the clouds . 33 2.5.2 Contribution of documents and words to axis inertia . 34 2.5.3 Representation quality of a point . . . . . . . . . . . . 35 2.6 Supplementary rows and columns . . . . . . . . . . . . . . . 36 2.6.1 Supplementary tables . . . . . . . . . . . . . . . . . . 36 2.6.2 Supplementary frequency rows and columns . . . . . . 36 2.6.3 Supplementary quantitative and qualitative variables . 37 2.7 Validating the visualization . . . . . . . . . . . . . . . . . . . 37 2.8 Interpretation scheme for textual CA results . . . . . . . . . 38 2.9 Implementation with Xplortext . . . . . . . . . . . . . . . . 41 2.10 Summary of the CA approach . . . . . . . . . . . . . . . . . 41 3 Applications of correspondence analysis 43 3.1 Choosing the level of detail for analyses . . . . . . . . . . . . 43 3.2 Correspondence analysis on aggregate free text answers . . . 44 3.2.1 Data and objectives . . . . . . . . . . . . . . . . . . . 44 3.2.2 Word selection . . . . . . . . . . . . . . . . . . . . . . 44 3.2.3 CA on the aggregate table . . . . . . . . . . . . . . . . 44 3.2.3.1 Document representation . . . . . . . . . . . 45 3.2.3.2 Word representation . . . . . . . . . . . . . . 46 3.2.3.3 Simultaneous interpretation of the plots . . . 46 3.2.4 Supplementary elements . . . . . . . . . . . . . . . . . 49 3.2.4.1 Supplementary words . . . . . . . . . . . . . 49 3.2.4.2 Supplementary repeated segments . . . . . . 49 3.2.4.3 Supplementary categories . . . . . . . . . . . 50 3.2.5 Implementation with Xplortext . . . . . . . . . . . . . 51 3.3 Direct analysis . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3.1 Data and objectives . . . . . . . . . . . . . . . . . . . 52 Contents ix 3.3.2 The main features of direct analysis . . . . . . . . . . 53 3.3.3 Direct analysis of the culture question . . . . . . . . . 53 3.3.4 Implementation with Xplortext . . . . . . . . . . . . . 58 4 Clustering in textual data science 61 4.1 Clustering documents . . . . . . . . . . . . . . . . . . . . . . 61 4.2 Dissimilarity measures between documents . . . . . . . . . . 62 4.3 Measuring partition quality . . . . . . . . . . . . . . . . . . . 63 4.3.1 Document clusters in the factorial space . . . . . . . . 63 4.3.2 Partition quality . . . . . . . . . . . . . . . . . . . . . 63 4.4 Dissimilarity measures between document clusters . . . . . . 64 4.4.1 The single-linkage method . . . . . . . . . . . . . . . . 64 4.4.2 The complete-linkage method . . . . . . . . . . . . . . 64 4.4.3 Ward’s method . . . . . . . . . . . . . . . . . . . . . . 64 4.5 Agglomerative hierarchical clustering . . . . . . . . . . . . . 65 4.5.1 Hierarchical tree construction algorithm . . . . . . . . 65 4.5.2 Selecting the final partition . . . . . . . . . . . . . . . 66 4.5.3 Interpreting clusters . . . . . . . . . . . . . . . . . . . 66 4.6 Direct partitioning . . . . . . . . . . . . . . . . . . . . . . . . 67 4.7 Combining clustering methods . . . . . . . . . . . . . . . . . 68 4.7.1 Consolidating partitions . . . . . . . . . . . . . . . . . 68 4.7.2 Direct partitioning followed by AHC . . . . . . . . . . 68 4.8 A procedure for combining CA and clustering . . . . . . . . 69 4.9 Example: joint use of CA and AHC . . . . . . . . . . . . . . 69 4.9.1 Data and objectives . . . . . . . . . . . . . . . . . . . 69 4.9.1.1 Data preprocessing using CA . . . . . . . . . 70 4.9.1.2 Constructing the hierarchical tree . . . . . . 70 4.9.1.3 Choosing the final partition . . . . . . . . . . 72 4.10 Contiguity-constrained hierarchical clustering . . . . . . . . . 74 4.10.1 Principles and algorithm . . . . . . . . . . . . . . . . . 74 4.10.2 AHC of age groups with a chronological constraint . . 75 4.10.3 Implementation with Xplortext . . . . . . . . . . . . . 76 4.11 Example: clustering free text answers . . . . . . . . . . . . . 76 4.11.1 Data and objectives . . . . . . . . . . . . . . . . . . . 76 4.11.2 Data preprocessing . . . . . . . . . . . . . . . . . . . . 78 4.11.2.1 CA: eigenvalues and total inertia . . . . . . . 78 4.11.2.2 Interpreting the first axes . . . . . . . . . . . 80 4.11.3 AHC: building the tree and choosing the final partition 84 4.12 Describing cluster features . . . . . . . . . . . . . . . . . . . 88 4.12.1 Lexical features of clusters. . . . . . . . . . . . . . . . 89 4.12.1.1 Describing clusters in terms of characteristic words . . . . . . . . . . . . . . . . . . . . . . 89 4.12.1.2 Describing clusters in terms of characteristic documents . . . . . . . . . . . . . . . . . . . 91 4.12.2 Describing clusters using contextual variables . . . . . 91

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.