DATA HANDLING IN SCIENCE AND TECHNOLOGY — VOLUME 20B Handbook of Chemometrics and Qualimetrics: PartB DATA HANDLING IN SCIENCE AND TECHNOLOGY Advisory Editors: B.G.M. Vandeginste and S.C. Rutan Other volumes in this series: Volume 1 Microprocessor Programming and Applications for Scientists and Engineers, by R.R. Smardzewski Volume 2 Chemometrics: A Textbook, by D.L. Massart, B.G.M. Vandeginste, S.M. Deming, Y. Micotte and L. Kaufman Volume 3 Experimental Design: A Chemometric Approach, by S.N. Deming and S.L. Morgan Volume 4 Advanced Scientific Coputing in BASIC with Applications in Chemistry, Biology and Pharmacology, by P. Vaiko and S. Vajda Volume 5 PCs for Chemists, edited by J. Zupan Volume 6 Scientific Computing and Automation (Europe) 1990, Proceedings of the Scientific Computing and Automation (Europe) Conference, 12-15 June 1990, Maastrichit, Tine Nettierlands, edited by E.J. Karjalainen Volume 7 Receptor Modeling for Air Quality Management, edited by P.K. Hopke Volume 8 Design and Optimization in Organic Synthesis, by R. Carlson Volume 9 Multivariate Pattern Recognition in Chemometrics, illustrated by case studies, edited by R.G. Brereton Volume 10 Sampling of Heterogeneous and Dynamic Material Systems: Theories of Heterogeneity, Sampling and Homogenizing, by P.M. Gy Volume 11 Experimental Design: A Chemometric Approach (Second, Revised and Expanded Edition), by S.N. Deming and S.L. Morgan Volume 12 Methods for Experimental Design: Principles and Applications for Physicists and Chemists, by J.L. Goupy Volume 13 Intelligent Software for Chemical Analysis, edited by L.M.C. Buydens and P.J. Schoenmakers Volume 14 The Data Analysis Handbook, by I.E. Frank and R. Todeschini Volume 15 Adaption of Simulated Annealing to Chemical Optimization Problems, edited by J. Kalivas Volume 16 Multivariate Analysis of Data in Sensory Science, edited by T. Naes and E. Risvik Volume 17 Data Analysis for Hyphenated Techniques, by E.J. Karjalainen and U.P. Karjalainen Volume 18 Signal Treatment and Signal Analysis in NMR, edited by D.N. Rutledge Volume 19 Robustness of Analytical Chemical Methods and Pharmaceutical Technological Products, edited by M.W.B. Hendriks, J.H. de Boer and A.K. Smilde Volume 20A Handbook of Chemometrics and Qualimetrics: Part A, by D.L. Massart, B.G.M. Vandeginste, LM.C. Buydens, S. De Jong, P.J. Lewi and J. Smeyers-Verbeke Volume 20B Handbook of Chemometrics and Qualimetrics: Part B, by B.G.M. Vandeginste, D.L. Massart, L.M.C. Buydens, S. De Jong, P.J. Lewi and J. Smeyers-Verbeke DATA HANDLING IN SCIENCE AND TECHNOLOGY — VOLUME 20B Advisory Editors: B.G.M. Vandeginste and S.C. Rutan Handbook of Chemometrics and Qualimetrics: Part B B.G.M. VANDEGINSTE Unilever Research Laboratorium, Vlaardingen, The Netherlands D.L. MASSART Farmaceutisch Instituut, Dienst Farmaceutische en Biomedische Analyse, Vrije Universiteit Brussel, Brussels, Belgium L.M.C. BUYDENS Vakgroep Analytische Chemie, Katholieke Universiteit Nijmegen, Faculteit Natuun/vetenschappen, Nijmegen, The Netherlands S. DE JONG Unilever Research Laboratorium, Vlaardingen, The Netherlands P.J. LEWI Janssen Research Foundation, Center for Molecular Design, Vosselaar, Belgium J. SMEYERS-VERBEKE Farmaceutisch Instituut, Dienst Farmaceutische en Biomedische Analyse, Vrije Universiteit Brussel, Brussels, Belgium ELSEVIER Amsterdam - Boston - London - New York - Oxford - Paris San Diego - San Francisco - Singapore - Sydney - Tokyo ELSEVIER SCIENCE B.V. Sara Burgerhartstraat 25 P.O. Box 211, 1000 AE Amsterdam, The Netherlands © 1998 Elsevier Science B.V. All rights reserved. This work is protected under copyright by Elsevier Science, and the following terms and conditions apply to its use: Photocopying Single photocopies of single chapters may be made for personal use as allowed by national copyright laws. Permission of the Publisher and payment of a fee is required for all other photocopying, including multiple or systematic copying, copying for advertising or promotional purposes, resale, and all forms of document delivery. Special rates are available for educational institutions that wish to make photocopies for non-profit educational classroom use. Permissions may be sought directly from Elsevier Science via their homepage (http://www.elsevier.com) by selecting 'Customer support' and then 'Permissions'. Alternatively you can send an e-mail to: [email protected], or fax to: (+44) 1865 853333. In the USA, users may clear permissions and make payments through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA; phone: (+1) (978) 7508400, fax: (+1) (978) 7504744, and in the UK through the Copyright Licensing Agency Rapid Clearance Service (CLARCS), 90 Tottenham Court Road, London WIP OLP, UK; phone: (+44) 207 631 5555; fax: (+44) 207 631 5500. Other countries may have a local reprographic rights agency for payments. Derivative Works Tables of contents may be reproduced for internal circulation, but permission of Elsevier Science is required for external resale or distribution of such material. Permission of the Publisher is required for all other derivative works, including compilations and translations. Electronic Storage or Usage Permission of the Publisher is required to store or use electronically any material contained in this work, including any chapter or part of a chapter. Except as outlined above, no part of this work may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior written permission of the Publisher. Address permissions requests to: Elsevier Science Global Rights Department, at the fax and e-mail addresses noted above. Notice No responsibility is assumed by the Publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. Because of rapid advances in the medical sciences, in particular, independent verification of diagnoses and drug dosages should be made. First edition 1998 Second impression 2003 Library of Congress Cataloging-in-Publication Data Handbook of chemometrics and qualimetrics / B.G.M. Vandeginste ... [et al.]. p. cm. — (Data handling in science and technology ; v. 20B) Includes index. ISBN 0-444-82853-2 (pt. 20B : acid-free paper) 1. Chemistry, Analytic-Statistical methods. 2. Chemistry, Analytic—Mathematics. 3. Chemistry, Analytic—Data processing. I. Vandeginste, B.G.M. II. Series. QD75.4.S8H36 1998 543\00r5195-dc21 98-42544 CIP British Library Cataloguing in Publication Data A catalogue record from the British Library has been applied for. ISBN: 0-444-82853-2 (Vol. 20B) ISBN: 0-444-82854-0 (set) © The paper used in this publication meets the requirements of ANSI/NISO Z39.48-1992 (Permanence of Paper). Printed in The Netherlands. Preface In 1991 two of us, Luc Massart and Bernard Vandeginste, discussed, during one of our many meetings, the possibility and necessity of updating the book Chemometrics: a textbook. Some of the newer techniques, such as partial least squares and expert systems, were not included in that book which was written some 15 years ago. Initially, we thought that we could bring it up to date with relatively minor revision. We could not have been more wrong. Even during the planning of the book we witnessed a rapid development in the application of natural computing methods, multivariate calibration, method validation, etc. When approaching colleagues to join the team of authors, it was clear from the outset that the book would not be an overhaul of the previous one, but an almost completely new book. When forming the team, we were particularly happy to be joined by two industrial chemometricians. Dr. Paul Lewi from Janssen Pharmaceutica and Dr. Sijmen de Jong from Unilever Research Laboratorium Vlaardingen, each having a wealth of practical experience. We are grateful to Janssen Pharmaceutica and Unilever Research Vlaardingen that they allowed Paul, Sijmen and Bernard to spend some of their time on this project. The three other authors belong to the Vrije Universiteit Brussel (Prof. An Smeyers-Verbeke and Prof. D. Luc Massart) and the Katholieke Universiteit Nijmegen (Professor Lutgarde Buydens), thus creating a team in which university and industry are equally well represented. We hope that this has led to an equally good mix of theory and application in the new book. Much of the material presented in this book is based on the direct experience of the authors. This would not have been possible without the hard work and input of our colleagues, students and post-doctoral fellows. We sincerely want to acknowl- edge each of them for their good research and contributions without which we would not have been able to treat such a broad range of subjects. Some of them read chapters or helped in other ways. We also owe thanks to the chemometrics community and at the same time we have to offer apologies. We have had the opportunity of collaborating with many colleagues and we have profited from the research and publications of many others. Their ideas and work have made this book possible and necessary. The size of the book shows that they have been very productive. Even so, we have cited only a fraction of the literature and we have not included the more sophisticated work. Our wish was to consolidate and therefore to explain those methods that have become more or less accepted, also to newcomers to chemometrics. Our apologies, therefore, to those we did not cite or not extensively: it is not a reflection on the quality of their work. Each chapter saw many versions which needed to be entered and re-entered in the computer. Without the help of our secretaries, we would not have been able to complete this work successfully. All versions were read and commented on by all authors in a long series of team meetings. We will certainly retain special memories of many of our two-day meetings, for instance the one organized by Paul in the famous abbey of the regular canons of Premontre at Tongerlo, where we could work in peace and quiet as so many before us have done. Much of this work also had to be done at home, which took away precious time from our families. Their love, understanding, patience and support was indispens- able for us to carry on with the seemingly endless series of chapters to be drafted, read or revised. Contents Preface v Chapter 28 Introduction to Part B 1 References 5 Chapter 29 Vectors, Matrices and Operations on Matrices 7 29.1 Vector space 8 29.2 Geometrical properties of vectors 10 29.3 Matrices 15 29.4 Matrix product 19 29.5 Dimension and rank 27 29.6 Eigenvectors and eigenvalues 30 29.7 Statistical interpretation of matrices 42 29.8 Geometrical interpretation of matrix products 51 References 56 Chapter 30 Cluster Analysis 57 30.1 Clusters 57 30.2 Measures of (dis)similarity 60 30.2.1 Similarity and distance 60 30.2.2 Measures of (dis)similarity for continuous variables 60 30.2.2.1 Distances 60 30.2.2.2 Correlation coefficient 62 30.2.2.3 Scaling 64 30.2.3 Measures of (dis)similarity for other variables 65 30.2.3.1 Binary variables 65 30.2.3.2 Ordinal variables 66 30.2.3.3 Mixed variables 67 30.2.4 Similarity matrix 68 30.3 Clustering algorithms 69 30.3.1 Hierarchical methods 69 30.3.2 Non-hierarchical methods 76 30.3.3 Other methods 79 30.3.4 Selecting clusters 82 30.3.4.1 Measures for clustering tendency 82 30.3.4.2 How many clusters? 83 30.3.5 Conclusion 84 References 85 Chapter 31 Analysis of Measurement Tables 87 Introduction 87 31.1 Principal components analysis 88 31.1.1 Singular vectors and singular values 89 31.1.2 Eigenvectors and eigenvalues 91 Vlll 31.1.3 Latent vectors and latent values 95 31.1.4 Scores and loadings 95 31.1.5 Principal components 96 31.1.6 Transition formulae 100 31.1.7 Reconstructions 100 31.2 Geometrical interpretation 104 31.2.1 Line of closest fit . 104 31.2.2 Distances 108 31.2.3 Unipolar axes 112 31.2.4 Bipolar axes 113 31.3 Preprocessing 115 31.3.1 No transformation 118 31.3.2 Column-centering 119 31.3.3 Column-standardization 122 31.3.4 Log column-centering 123 31.3.5 Log double-centering 125 31.3.6 Double-closure 130 31.4 Algorithms 134 31.4.1 Singular value decomposition 134 31.4.2 Eigenvalue decomposition 138 31.5 Validation 140 31.5.1 Scree-plot 142 31.5.2Malinowski'sF-test 143 31.5.3 Cross-validation 144 31.6 Principal coordinates analysis 146 31.6.1 Distances defined from data 146 31.6.2 Distances derived from comparisons of pairs 148 31.6.3 Eigenvalue decomposition 148 31.7 Non-linear principal components analysis 149 31.7.1 Extensions of the data by higher order terms 149 31.7.2 Non-linear transformations of the data 149 31.7.3 Non-linear PCAbiplot 150 31.8 Three-way principal components analysis 153 31.8.1 Unfolding 153 31.8.2 The Tucker3 model 154 31.8.3 The PARAFAC model 156 31.9 PCA and cluster analysis 156 References 158 Chapter 32 Analysis of Contingency Tables 161 32.1 Contingency table 161 32.2 Chi-square statistic 166 32.3 Closure 167 32.3.1 Row-closure 168 32.3.2 Column-closure 168 32.3.3 Double-closure .169 32.4 Weighted metric 170 32.5 Distance of chi-square 175 32.5.1 Row-closure 175 32.5.2 Column-closure 176 32.5.3 Double-closure 177 32.6 Correspondence factor analysis 182 32.6.1 Historical background 182 32.6.2 Generalized singular value decomposition 183 32.6.3 Biplots. 187 32.6.4 Application 193 32.7 Log-linear model 201 32.7.1 Historical introduction 201 32.7.2 Algorithm 201 32.7.3 Application 204 References 205 Chapter 33 Supervised Pattern Recognition 207 33.1 Supervised and unsupervised pattern recognition 207 33.2 Derivation of classification rules 208 33.2.1 Types of classification rules 208 33.2.2 Canonical variates and linear discriminant analysis 213 33.2.3 Quadratic discriminant analysis and related methods 220 33.2.4 The k-nearest neighbour method 223 33.2.5 Density methods. 225 33.2.6 Classification trees 227 33.2.7 UNEQ,SIMCA and related methods 228 33.2.8 Partial least squares 232 33.2.9 Neural networks 233 33.3 Feature selection and reduction 236 33.4 Validation of classification rules 238 References 239 Chapter 34 Curve and Mixture Resolution by Factor Analysis and Related Techniques . . 243 34.1 Abstract and true factors 243 34.2 Full-rank methods 251 34.2.1 A qualitative approach 251 34.2.2 Factor rotations 252 34.2.3 The Varimax rotation . 254 34.2.4. Factor rotation by target transformation factor analysis (TTFA) . 256 34.2.5 Curve resolution based methods 260 34.2.5.1 Curve Resolution of two-factor systems 260 34.2.5.2 Curve resolution of three-factor systems 267 34.2.6 Factor rotation by iterative target transformation factor analysis (ITTFA) 268 34.3 Evolutionary and local rank methods 274 34.3.1 Evolving factor analysis (EFA) 274 34.3.2 Fixed-size window evolving factor analysis (FSWEFA) 278 34.3.3 Heuristic evolving latent projections (HELP) 280 34.4 Pure column (or row) techniques 286 34.4.1 The variance diagram (VARDIA) technique 286 34.4.2 SimpHsma 292 34.4.3 Orthogonal projection approach (OPA) 295 34.5 Quantitative methods for factor analysis 298 34.5.1 Generalized rank annihilation factor analysis (GRAFA) 298 34.5.2 Residual bilinearization(RBL) 300