Table Of ContentChapman & Hall/CRC
Computer Science and Data Analysis Series
Exploratory Multivariate
Analysis by Example Using R
François Husson
Sébastien Lê
Jérôme Pagès
Boca Raton London New York
CRC Press is an imprint of the
Taylor & Francis Group, an informa business
A CHAPMAN & HALL BOOK
Chapman & Hall/CRC
Computer Science and Data Analysis Series
The interface between the computer and statistical sciences is increasing, as each
discipline seeks to harness the power and resources of the other. This series aims to
foster the integration between the computer sciences and statistical, numerical, and
probabilistic methods by publishing a broad range of reference works, textbooks, and
handbooks.
SERIES EDITORS
David Blei, Princeton University
David Madigan, Rutgers University
Marina Meila, University of Washington
Fionn Murtagh, Royal Holloway, University of London
Proposals for the series should be sent directly to one of the series editors above, or submitted to:
Chapman & Hall/CRC
Taylor and Francis Group
3 Park Square, Milton Park
Abingdon, OX14 4RN, UK
Published Titles
Semisupervised Learning for Computational Linguistics
Steven Abney
Visualization and Verbalization of Data
Jörg Blasius and Michael Greenacre
Design and Modeling for Computer Experiments
Kai-Tai Fang, Runze Li, and Agus Sudjianto
Microarray Image Analysis: An Algorithmic Approach
Karl Fraser, Zidong Wang, and Xiaohui Liu
R Programming for Bioinformatics
Robert Gentleman
Exploratory Multivariate Analysis by Example Using R
François Husson, Sébastien Lê, and Jérôme Pagès
Bayesian Artificial Intelligence, Second Edition
Kevin B. Korb and Ann E. Nicholson
Published Titles cont.
Computational Statistics Handbook with MATLAB®, Third Edition
Wendy L. Martinez and Angel R. Martinez
Exploratory Data Analysis with MATLAB®, Third Edition
Wendy L. Martinez, Angel R. Martinez, and Jeffrey L. Solka
Statistics in MATLAB®: A Primer
Wendy L. Martinez and MoonJung Cho
Clustering for Data Mining: A Data Recovery Approach, Second Edition
Boris Mirkin
Introduction to Machine Learning and Bioinformatics
Sushmita Mitra, Sujay Datta, Theodore Perkins, and George Michailidis
Introduction to Data Technologies
Paul Murrell
R Graphics
Paul Murrell
Data Science Foundations: Geometry and Topology of Complex Hierarchic
Systems and Big Data Analytics
Fionn Murtagh
Correspondence Analysis and Data Coding with Java and R
Fionn Murtagh
Pattern Recognition Algorithms for Data Mining
Sankar K. Pal and Pabitra Mitra
Statistical Computing with R
Maria L. Rizzo
Statistical Learning and Data Science
Mireille Gettler Summa, Léon Bottou, Bernard Goldfarb, Fionn Murtagh,
Catherine Pardoux, and Myriam Touati
Music Data Analysis: Foundations and Applications
Claus Weihs, Dietmar Jannach, Igor Vatolkin, and Günter Rudolph
Foundations of Statistical Algorithms: With References to R Packages
Claus Weihs, Olaf Mersmann, and Uwe Ligges
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2017 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works
Printed on acid-free paper
Version Date: 20170331
International Standard Book Number-13: 978-1-1381-9634-6 (Hardback)
This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access
www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc.
(CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization
that provides licenses and registration for a variety of users. For organizations that have been granted
a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents
Preface xi
1 Principal Component Analysis (PCA) 1
1.1 Data — Notation — Examples . . . . . . . . . . . . . . . . . 1
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.1 Studying Individuals . . . . . . . . . . . . . . . . . . . 2
1.2.2 Studying Variables . . . . . . . . . . . . . . . . . . . . 3
1.2.3 Relationships between the Two Studies . . . . . . . . 5
1.3 Studying Individuals . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 The Cloud of Individuals . . . . . . . . . . . . . . . . 5
1.3.2 Fitting the Cloud of Individuals . . . . . . . . . . . . 7
1.3.2.1 Best Plane Representation of N . . . . . . . 7
I
1.3.2.2 Sequence of Axes for Representing N . . . . 10
I
1.3.2.3 How Are the Components Obtained? . . . . 10
1.3.2.4 Example . . . . . . . . . . . . . . . . . . . . 10
1.3.3 Representation of the Variables as an Aid for
Interpreting the Cloud of Individuals . . . . . . . . . . 11
1.4 Studying Variables . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.1 The Cloud of Variables . . . . . . . . . . . . . . . . . 13
1.4.2 Fitting the Cloud of Variables. . . . . . . . . . . . . . 14
1.5 Relationships between the Two Representations N and N 16
I K
1.6 Interpreting the Data . . . . . . . . . . . . . . . . . . . . . . 17
1.6.1 Numerical Indicators . . . . . . . . . . . . . . . . . . . 17
1.6.1.1 Percentage of Inertia Associated with a
Component . . . . . . . . . . . . . . . . . . . 17
1.6.1.2 Quality of Representation of an Individual or
Variable . . . . . . . . . . . . . . . . . . . . . 18
1.6.1.3 Detecting Outliers . . . . . . . . . . . . . . . 19
1.6.1.4 Contribution of an Individual or Variable to
the Construction of a Component . . . . . . 19
1.6.2 Supplementary Elements. . . . . . . . . . . . . . . . . 20
1.6.2.1 Representing Supplementary Quantitative
Variables . . . . . . . . . . . . . . . . . . . . 21
1.6.2.2 Representing Supplementary Categorical
Variables . . . . . . . . . . . . . . . . . . . . 22
1.6.2.3 Representing Supplementary Individuals . . 24
v
vi Contents
1.6.3 Automatic Description of the Components . . . . . . . 24
1.7 Implementation with FactoMineR . . . . . . . . . . . . . . . 25
1.8 Additional Results . . . . . . . . . . . . . . . . . . . . . . . . 26
1.8.1 Testing the Significance of the Components . . . . . . 26
1.8.2 Variables: Loadings versus Correlations . . . . . . . . 27
1.8.3 Simultaneous Representation: Biplots . . . . . . . . . 27
1.8.4 Missing Values . . . . . . . . . . . . . . . . . . . . . . 28
1.8.5 Large Datasets . . . . . . . . . . . . . . . . . . . . . . 29
1.8.6 Varimax Rotation . . . . . . . . . . . . . . . . . . . . 29
1.9 Example: The Decathlon Dataset . . . . . . . . . . . . . . . 30
1.9.1 Data Description — Issues . . . . . . . . . . . . . . . 30
1.9.2 Analysis Parameters . . . . . . . . . . . . . . . . . . . 30
1.9.2.1 Choice of Active Elements . . . . . . . . . . 30
1.9.2.2 Should the Variables Be Standardised? . . . 32
1.9.3 Implementation of the Analysis . . . . . . . . . . . . . 32
1.9.3.1 Choosing the Number of Dimensions to
Examine . . . . . . . . . . . . . . . . . . . . 34
1.9.3.2 Studying the Cloud of Individuals . . . . . . 35
1.9.3.3 Studying the Cloud of Variables . . . . . . . 38
1.9.3.4 JointAnalysisoftheCloudofIndividualsand
the Cloud of Variables . . . . . . . . . . . . . 40
1.9.3.5 Comments on the Data . . . . . . . . . . . . 43
1.10 Example: The Temperature Dataset . . . . . . . . . . . . . . 45
1.10.1 Data Description — Issues . . . . . . . . . . . . . . . 45
1.10.2 Analysis Parameters . . . . . . . . . . . . . . . . . . . 45
1.10.2.1 Choice of Active Elements . . . . . . . . . . 45
1.10.2.2 Should the Variables Be Standardised? . . . 46
1.10.3 Implementation of the Analysis . . . . . . . . . . . . . 47
1.11 Example of Genomic Data: The Chicken Dataset . . . . . . 53
1.11.1 Data Description — Issues . . . . . . . . . . . . . . . 53
1.11.2 Analysis Parameters . . . . . . . . . . . . . . . . . . . 54
1.11.3 Implementation of the Analysis . . . . . . . . . . . . . 54
2 Correspondence Analysis (CA) 61
2.1 Data — Notation — Examples . . . . . . . . . . . . . . . . . 61
2.2 Objectives and the Independence Model . . . . . . . . . . . . 63
2.2.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.2.2 Independence Model and χ2 Test . . . . . . . . . . . . 64
2.2.3 The Independence Model and CA . . . . . . . . . . . 66
2.3 Fitting the Clouds . . . . . . . . . . . . . . . . . . . . . . . . 67
2.3.1 Clouds of Row Profiles . . . . . . . . . . . . . . . . . . 67
2.3.2 Clouds of Column Profiles . . . . . . . . . . . . . . . . 68
2.3.3 Fitting Clouds N and N . . . . . . . . . . . . . . . . 70
I J
2.3.4 Example: Women’sAttitudestoWomen’sWorkinFrance
in 1970 . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Contents vii
2.3.4.1 Column Representation (Mother’s Activity). 72
2.3.4.2 Row Representation (Partner’s Work) . . . . 74
2.3.5 Superimposed Representation of Both Rows and
Columns. . . . . . . . . . . . . . . . . . . . . . . . . . 74
2.4 Interpreting the Data . . . . . . . . . . . . . . . . . . . . . . 79
2.4.1 Inertias Associated with the Dimensions (Eigenvalues) 79
2.4.2 Contribution of Points to a Dimension’s Inertia . . . . 82
2.4.3 Representation Quality of Points on a Dimension or
Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
2.4.4 Distance and Inertia in the Initial Space . . . . . . . . 84
2.5 Supplementary Elements (= Illustrative) . . . . . . . . . . . 85
2.6 Implementation with FactoMineR . . . . . . . . . . . . . . . 88
2.7 CA and Textual Data Processing . . . . . . . . . . . . . . . . 90
2.8 Example: The Olympic Games Dataset . . . . . . . . . . . . 94
2.8.1 Data Description — Issues . . . . . . . . . . . . . . . 94
2.8.2 Implementation of the Analysis . . . . . . . . . . . . . 96
2.8.2.1 Choosing the Number of Dimensions to
Examine . . . . . . . . . . . . . . . . . . . . 98
2.8.2.2 Studying the Superimposed Representation . 98
2.8.2.3 Interpreting the Results . . . . . . . . . . . . 101
2.8.2.4 Comments on the Data . . . . . . . . . . . . 102
2.9 Example: The White Wines Dataset . . . . . . . . . . . . . . 104
2.9.1 Data Description — Issues . . . . . . . . . . . . . . . 104
2.9.2 Margins . . . . . . . . . . . . . . . . . . . . . . . . . . 106
2.9.3 Inertia . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
2.9.4 Representation on the First Plane . . . . . . . . . . . 109
2.10 Example: The Causes of Mortality Dataset . . . . . . . . . . 112
2.10.1 Data Description — Issues . . . . . . . . . . . . . . . 112
2.10.2 Margins . . . . . . . . . . . . . . . . . . . . . . . . . . 114
2.10.3 Inertia . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
2.10.4 First Dimension . . . . . . . . . . . . . . . . . . . . . 118
2.10.5 Plane 2-3 . . . . . . . . . . . . . . . . . . . . . . . . . 120
2.10.6 Projecting the Supplementary Elements . . . . . . . . 124
2.10.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 127
3 Multiple Correspondence Analysis (MCA) 131
3.1 Data — Notation — Examples . . . . . . . . . . . . . . . . . 131
3.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
3.2.1 Studying Individuals . . . . . . . . . . . . . . . . . . . 132
3.2.2 Studying the Variables and Categories . . . . . . . . . 133
3.3 Defining Distances between Individuals and Distances between
Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
3.3.1 Distances between the Individuals . . . . . . . . . . . 134
3.3.2 Distances between the Categories . . . . . . . . . . . . 134
3.4 CA on the Indicator Matrix . . . . . . . . . . . . . . . . . . 136
viii Contents
3.4.1 Relationship between MCA and CA . . . . . . . . . . 136
3.4.2 The Cloud of Individuals . . . . . . . . . . . . . . . . 137
3.4.3 The Cloud of Variables . . . . . . . . . . . . . . . . . 138
3.4.4 The Cloud of Categories . . . . . . . . . . . . . . . . . 139
3.4.5 Transition Relations . . . . . . . . . . . . . . . . . . . 142
3.5 Interpreting the Data . . . . . . . . . . . . . . . . . . . . . . 144
3.5.1 Numerical Indicators . . . . . . . . . . . . . . . . . . . 144
3.5.1.1 Percentage of Inertia Associated with a
Component . . . . . . . . . . . . . . . . . . . 144
3.5.1.2 Contribution and Representation Quality of
an Individual or Category . . . . . . . . . . . 145
3.5.2 Supplementary Elements. . . . . . . . . . . . . . . . . 146
3.5.3 Automatic Description of the Components . . . . . . . 147
3.6 Implementation with FactoMineR . . . . . . . . . . . . . . . 149
3.7 Addendum . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
3.7.1 Analysing a Survey . . . . . . . . . . . . . . . . . . . . 152
3.7.1.1 Designing a Questionnaire: Choice of Format 152
3.7.1.2 Accounting for Rare Categories. . . . . . . . 153
3.7.2 Description of a Categorical Variable or a
Subpopulation . . . . . . . . . . . . . . . . . . . . . . 154
3.7.2.1 Description of a Categorical Variable by a
Categorical Variable . . . . . . . . . . . . . . 154
3.7.2.2 Description of a Subpopulation (or a
Category) by a Quantitative Variable . . . . 155
3.7.2.3 Description of a Subpopulation (or a
Category) by the Categories of a Categorical
Variable . . . . . . . . . . . . . . . . . . . . . 156
3.7.3 The Burt Table . . . . . . . . . . . . . . . . . . . . . . 157
3.7.4 Missing Values . . . . . . . . . . . . . . . . . . . . . . 158
3.8 Example: The Survey on the Perception of Genetically
Modified Organisms . . . . . . . . . . . . . . . . . . . . . . . 160
3.8.1 Data Description — Issues . . . . . . . . . . . . . . . 160
3.8.2 Analysis Parameters and Implementation with
FactoMineR . . . . . . . . . . . . . . . . . . . . . . . . 163
3.8.3 Analysing the First Plane . . . . . . . . . . . . . . . . 164
3.8.4 Projection of Supplementary Variables . . . . . . . . . 165
3.8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 167
3.9 Example: The Sorting Task Dataset . . . . . . . . . . . . . . 167
3.9.1 Data Description — Issues . . . . . . . . . . . . . . . 167
3.9.2 Analysis Parameters . . . . . . . . . . . . . . . . . . . 169
3.9.3 Representation of Individuals on the First Plane . . . 169
3.9.4 Representation of Categories . . . . . . . . . . . . . . 170
3.9.5 Representation of the Variables . . . . . . . . . . . . . 171
Contents ix
4 Clustering 173
4.1 Data — Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 173
4.2 Formalising the Notion of Similarity . . . . . . . . . . . . . . 177
4.2.1 Similarity between Individuals . . . . . . . . . . . . . 177
4.2.1.1 Distances and Euclidean Distances . . . . . . 177
4.2.1.2 Example of Non-Euclidean Distance . . . . . 178
4.2.1.3 Other Euclidean Distances . . . . . . . . . . 179
4.2.1.4 Similarities and Dissimilarities . . . . . . . . 179
4.2.2 Similarity between Groups of Individuals . . . . . . . 180
4.3 Constructing an Indexed Hierarchy . . . . . . . . . . . . . . 181
4.3.1 Classic Agglomerative Algorithm . . . . . . . . . . . . 181
4.3.2 Hierarchy and Partitions . . . . . . . . . . . . . . . . . 183
4.4 Ward’s Method . . . . . . . . . . . . . . . . . . . . . . . . . 183
4.4.1 Partition Quality . . . . . . . . . . . . . . . . . . . . . 184
4.4.2 Agglomeration According to Inertia . . . . . . . . . . 185
4.4.3 Two Properties of the Agglomeration Criterion . . . . 187
4.4.4 Analysing Hierarchies, Choosing Partitions . . . . . . 188
4.5 Direct Search for Partitions: K-Means Algorithm . . . . . . 189
4.5.1 Data — Issues . . . . . . . . . . . . . . . . . . . . . . 189
4.5.2 Principle . . . . . . . . . . . . . . . . . . . . . . . . . 190
4.5.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . 191
4.6 Partitioning and Hierarchical Clustering . . . . . . . . . . . . 191
4.6.1 Consolidating Partitions . . . . . . . . . . . . . . . . . 192
4.6.2 Mixed Algorithm . . . . . . . . . . . . . . . . . . . . . 192
4.7 Clustering and Principal Component Methods . . . . . . . . 192
4.7.1 Principal Component Methods Prior to AHC . . . . . 193
4.7.2 Simultaneous Analysis of a Principal Component Map
and Hierarchy. . . . . . . . . . . . . . . . . . . . . . . 193
4.8 Clustering and Missing Data . . . . . . . . . . . . . . . . . . 194
4.9 Example: The Temperature Dataset . . . . . . . . . . . . . . 194
4.9.1 Data Description — Issues . . . . . . . . . . . . . . . 194
4.9.2 Analysis Parameters . . . . . . . . . . . . . . . . . . . 195
4.9.3 Implementation of the Analysis . . . . . . . . . . . . . 195
4.10 Example: The Tea Dataset . . . . . . . . . . . . . . . . . . . 199
4.10.1 Data Description — Issues . . . . . . . . . . . . . . . 199
4.10.2 Constructing the AHC . . . . . . . . . . . . . . . . . . 201
4.10.3 Defining the Clusters . . . . . . . . . . . . . . . . . . . 202
4.11 Dividing Quantitative Variables into Classes . . . . . . . . . 204
5 Visualisation 209
5.1 Data — Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 209
5.2 Viewing PCA Data . . . . . . . . . . . . . . . . . . . . . . . 209
5.2.1 Selecting a Subset of Objects — Cloud of Individuals 210
5.2.2 Selecting a Subset of Objects — Cloud of Variables. . 211
5.2.3 Adding Supplementary Information . . . . . . . . . . 212
Description:Full of real-world case studies and practical advice, Exploratory Multivariate Analysis by Example Using R, Second Edition focuses on four fundamental methods of multivariate exploratory data analysis that are most suitable for applications. It covers principal component analysis (PCA) when variable