ebook img

The Energy of Data and Distance Correlation PDF

467 Pages·2023·6.453 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview The Energy of Data and Distance Correlation

The Energy of Data and Distance Correlation Energy distance is a statistical distance between the distributions of random vec- tors, which characterizes the equality of distributions. The name energy derives from Newton’s gravitational potential energy, and there is an elegant relation to the notion of potential energy between statistical observations. Energy statistics are functions of distances between statistical observations in metric spaces. The authors hope this book will spark the interest of most statisticians who so far have not explored E-statistics and would like to apply these new methods using R. The Energy of Data and Distance Correlation is intended for teachers and students looking for dedicated material on energy statistics but can serve as a supplement to a wide range of courses and areas, such as Monte Carlo methods, U-statistics or V-statistics, measures of multivariate dependence, goodness-of-fit tests, non- parametric methods, and distance-based methods. • E-statistics provides powerful methods to deal with problems in multivariate inference and analysis. • Methods are implemented in R, and readers can immediately apply them using the freely available energy package for R. • The proposed book will provide an overview of the existing state-of-the-art in development of energy statistics and an overview of applications. • Background and literature review are valuable for anyone considering further research or application in energy statistics. MONOGRAPHS ON STATISTICS AND APPLIED PROBABILITY Editors: F. Bunea, R. Henderson, N. Keiding, L. Levina, N. Meinshausen, R. Smith Recently Published Titles Multistate Models for the Analysis of Life History Data Richard J. Cook and Jerald F. Lawless 158 Nonparametric Models for Longitudinal Data with Implementation in R Colin O. Wu and Xin Tian 159 Multivariate Kernel Smoothing and Its Applications José E. Chacón and Tarn Duong 160 Sufficient Dimension Reduction Methods and Applications with R Bing Li 161 Large Covariance and Autocovariance Matrices Arup Bose and Monika Bhattacharjee 162 The Statistical Analysis of Multivariate Failure Time Data: A Marginal Modeling Approach Ross L. Prentice and Shanshan Zhao 163 Dynamic Treatment Regimes Statistical Methods for Precision Medicine Anastasios A. Tsiatis, Marie Davidian, Shannon T. Holloway, and Eric B. Laber 164 Sequential Change Detection and Hypothesis Testing General Non-i.i.d. Stochastic Models and Asymptotically Optimal Rules Alexander Tartakovsky 165 Introduction to Time Series Modeling Genshiro Kitigawa 166 Replication and Evidence Factors in Observational Studies Paul R. Rosenbaum 167 Introduction to High-Dimensional Statistics, Second Edition Christophe Giraud 168 Object Oriented Data Analysis J.S. Marron and Ian L. Dryden 169 Martingale Methods in Statistics Yoichi Nishiyama 170 The Energy of Data and Distance Correlation Gabor J. Szekely and Maria L. Rizzo For more information about this series please visit: https://www.crcpress.com/ Chapman--HallCRC-Monographs-on-Statistics--Applied-Probability/book-series/ CHMONSTAAPP The Energy of Data and Distance Correlation Gábor J. Székely Maria L. Rizzo First edition published 2023 by CRC Press 6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742 and by CRC Press 4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN CRC Press is an imprint of Taylor & Francis Group, LLC © 2023 Taylor & Francis Group, LLC Reasonable efforts have been made to publish reliable data and information, but the author and pub- lisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information stor- age or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are not available on CCC please contact mpkbookspermissions@tandf. co.uk Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for identification and explanation without intent to infringe. ISBN: 978-1-482-24274-4 (hbk) ISBN: 978-1-032-43379-0 (pbk) ISBN: 978-0-429-15715-8 (ebk) Typeset in Latin Modern by KnowledgeWorks Global Ltd. DOI: 10.1201/9780429157158 Publisher’s note: This book has been prepared from camera-ready copy provided by the authors. Access the companion website: http://cran.us.r-project.org/web/packages/energy/index.html Access the Supplementary Material: github.com/mariarizzo/energy Contents Preface xiii Authors xv Notation xvii I The Energy of Data 1 1 Introduction 3 1.1 Distances of Data . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Energy of Data: Distance Science of Data . . . . . . . . . . . 9 2 Preliminaries 13 2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 V-statistics and U-statistics . . . . . . . . . . . . . . . . . . . 15 2.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.2 Representation as a V-statistic . . . . . . . . . . . . . 15 2.2.3 Asymptotic Distribution . . . . . . . . . . . . . . . . . 17 2.2.4 E-statistics as V-statistics vs U-statistics. . . . . . . . 18 2.3 A Key Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4 Invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3 Energy Distance 23 3.1 Introduction: The Energy of Data . . . . . . . . . . . . . . . 23 3.2 The Population Value of Statistical Energy . . . . . . . . . . 26 3.3 A Simple Proof of the Inequality . . . . . . . . . . . . . . . . 27 3.4 Energy Distance and Cram´er’s Distance . . . . . . . . . . . . 28 3.5 Multivariate Case . . . . . . . . . . . . . . . . . . . . . . . . 32 3.6 Why is Energy Distance Special? . . . . . . . . . . . . . . . . 35 3.7 Infinite Divisibility and Energy Distance . . . . . . . . . . . 36 3.8 Freeing Energy via Uniting Sets in Partitions . . . . . . . . . 39 3.9 Applications of Energy Statistics . . . . . . . . . . . . . . . . 41 3.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 v vi Contents 4 Introduction to Energy Inference 45 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 Testing for Equal Distributions . . . . . . . . . . . . . . . . . 46 4.3 Permutation Distribution and Test . . . . . . . . . . . . . . . 48 4.4 Goodness-of-Fit . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.5 Energy Test of Univariate Normality . . . . . . . . . . . . . . 53 4.6 Multivariate Normality and other Energy Tests . . . . . . . . 57 4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5 Goodness-of-Fit 59 5.1 Energy Goodness-of-Fit Tests . . . . . . . . . . . . . . . . . 59 5.2 Continuous Uniform Distribution . . . . . . . . . . . . . . . 61 5.3 Exponential and Two-Parameter Exponential . . . . . . . . . 61 5.4 Energy Test of Normality . . . . . . . . . . . . . . . . . . . . 61 5.5 Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . 62 5.6 Geometric Distribution . . . . . . . . . . . . . . . . . . . . . 62 5.7 Beta Distribution . . . . . . . . . . . . . . . . . . . . . . . . 63 5.8 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . 64 5.8.1 The Poisson E-test . . . . . . . . . . . . . . . . . . . . 65 5.8.2 Probabilities in Terms of Mean Distances . . . . . . . 66 5.8.3 The Poisson M-test. . . . . . . . . . . . . . . . . . . . 67 5.8.4 Implementation of Poisson Tests . . . . . . . . . . . . 68 5.9 Energy Test for Location-Scale Families . . . . . . . . . . . 69 5.10 Asymmetric Laplace Distribution . . . . . . . . . . . . . . . 70 5.10.1 Expected Distances . . . . . . . . . . . . . . . . . . . 70 5.10.2 Test Statistic and Empirical Results . . . . . . . . . . 73 5.11 The Standard Half-Normal Distribution . . . . . . . . . . . . 74 5.12 The Inverse Gaussian Distribution . . . . . . . . . . . . . . . 75 5.13 Testing Spherical Symmetry; Stolarsky Invariance . . . . . . 77 5.14 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.15 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6 Testing Multivariate Normality 83 6.1 Energy Test of Multivariate Normality . . . . . . . . . . . . 83 6.1.1 Simple Hypothesis: Known Parameters. . . . . . . . . 84 6.1.2 Composite Hypothesis: Estimated Parameters . . . . . 87 6.1.3 On the Asymptotic Behavior of the Test . . . . . . . . 88 6.1.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . 89 6.2 Energy Projection-Pursuit Test of Fit . . . . . . . . . . . . . 91 6.2.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . 91 6.2.2 Projection Pursuit Results. . . . . . . . . . . . . . . . 93 6.3 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.3.1 Hypergeometric Series Formula . . . . . . . . . . . . . 94 6.3.2 Original Formula . . . . . . . . . . . . . . . . . . . . . 96 6.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Contents vii 7 Eigenvalues for One-Sample E-Statistics 99 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 7.2 Kinetic Energy: The Schr¨odinger Equation . . . . . . . . . . 101 7.3 CF Version of the Hilbert-Schmidt Equation . . . . . . . . . 103 7.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.5 Computation of Eigenvalues . . . . . . . . . . . . . . . . . . 109 7.6 Computational and Empirical Results . . . . . . . . . . . . . 110 7.6.1 Results for Univariate Normality . . . . . . . . . . . . 110 7.6.2 Testing Multivariate Normality . . . . . . . . . . . . . 114 7.6.3 Computational Efficiency . . . . . . . . . . . . . . . . 115 7.7 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 8 Generalized Goodness-of-Fit 121 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 8.2 Pareto Distributions . . . . . . . . . . . . . . . . . . . . . . . 122 8.2.1 Energy Tests for Pareto Distribution . . . . . . . . . . 122 8.2.2 Test of Transformed Pareto Sample . . . . . . . . . . 123 8.2.3 Statistics for the Exponential Model . . . . . . . . . . 124 8.2.4 Pareto Statistics . . . . . . . . . . . . . . . . . . . . . 124 8.2.5 Minimum Distance Estimation . . . . . . . . . . . . . 126 8.3 Cauchy Distribution . . . . . . . . . . . . . . . . . . . . . . . 127 8.4 Stable Family of Distributions . . . . . . . . . . . . . . . . . 128 8.5 Symmetric Stable Family . . . . . . . . . . . . . . . . . . . . 129 8.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 9 Multi-sample Energy Statistics 131 9.1 Energy Distance of a Set of Random Variables . . . . . . . . 131 9.2 Multi-sample Energy Statistics . . . . . . . . . . . . . . . . . 132 9.3 Distance Components: A Nonparametric Extension of ANOVA 133 9.3.1 The DISCO Decomposition . . . . . . . . . . . . . . . 134 9.3.2 Application: Decomposition of Residuals . . . . . . . . 138 9.4 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . 141 9.5 Case Study: Hierarchical Clustering . . . . . . . . . . . . . . 143 9.6 K-groups Clustering . . . . . . . . . . . . . . . . . . . . . . . 145 9.6.1 K-groups Objective Function . . . . . . . . . . . . . . 146 9.6.2 K-groups Clustering Algorithm . . . . . . . . . . . . . 147 9.6.3 K-means as a Special Case of K-groups . . . . . . . . 148 9.7 Case Study: Hierarchical and K-groups Cluster Analysis . . . 148 9.8 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . 149 9.8.1 Bayesian Applications . . . . . . . . . . . . . . . . . . 150 9.9 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 9.9.1 Proof of Theorem 9.1 . . . . . . . . . . . . . . . . . . 150 9.9.2 Proof of Proposition 9.1 . . . . . . . . . . . . . . . . . 152 9.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 viii Contents 10 Energy in Metric Spaces and Other Distances 155 10.1 Metric Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 155 10.1.1 Review of Metric Spaces . . . . . . . . . . . . . . . . 155 10.1.2 Examples of Metrics . . . . . . . . . . . . . . . . . . . 156 10.2 Energy Distance in a Metric Space . . . . . . . . . . . . . . 158 10.3 Banach Spaces . . . . . . . . . . . . . . . . . . . . . . . . . 161 10.4 Earth Mover’s Distance . . . . . . . . . . . . . . . . . . . . 162 10.4.1 Wasserstein Distance . . . . . . . . . . . . . . . . . . 163 10.4.2 Energy vs. Earth Mover’s Distance . . . . . . . . . . 165 10.5 Minimum Energy Distance (MED) Estimators . . . . . . . . 166 10.6 Energy in Hyperbolic Spaces and in Spheres . . . . . . . . . 167 10.7 The Space of Positive Definite Symmetric Matrices . . . . . 168 10.8 Energy and Machine Learning . . . . . . . . . . . . . . . . . 169 10.9 Minkowski Kernel and Gaussian Kernel . . . . . . . . . . . . 172 10.10 On Some Non-Energy Distances . . . . . . . . . . . . . . . . 173 10.11 Topological Data Analysis . . . . . . . . . . . . . . . . . . . 176 10.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 II Distance Correlation and Dependence 181 11 On Correlation and Other Measures of Association 183 11.1 The First Measure of Dependence: Correlation . . . . . . . . 183 11.2 Distance Correlation . . . . . . . . . . . . . . . . . . . . . . 184 11.3 Other Dependence Measures . . . . . . . . . . . . . . . . . . 185 11.4 Representations by Uncorrelated Random Variables . . . . . 185 12 Distance Correlation 189 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 12.2 Characteristic Function Based Covariance . . . . . . . . . . 192 12.3 Dependence Coefficients . . . . . . . . . . . . . . . . . . . . 194 12.3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . 194 12.4 Sample Distance Covariance and Correlation . . . . . . . . . 195 12.4.1 Derivation of V2 . . . . . . . . . . . . . . . . . . . . . 197 n 12.4.2 Equivalent Definitions for V2 . . . . . . . . . . . . . . 198 n 12.4.3 Theorem on dCov Statistic Formula . . . . . . . . . . 198 12.5 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 12.6 Distance Correlation for Gaussian Variables . . . . . . . . . 202 12.7 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 12.7.1 Finiteness of (cid:107)fˆ (t,s)−fˆ (t)fˆ (s)(cid:107)2 . . . . . . . . 203 X,Y X Y 12.7.2 Proof of Theorem 12.1 . . . . . . . . . . . . . . . . . 204 12.7.3 Proof of Theorem 12.2 . . . . . . . . . . . . . . . . . 206 12.7.4 Proof of Theorem 12.4 . . . . . . . . . . . . . . . . . 207 12.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 Contents ix 13 Testing Independence 211 13.1 The Sampling Distribution of nV2 . . . . . . . . . . . . . . . 211 n 13.1.1 Expected Value and Bias of Distance Covariance . . . 213 13.1.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . 213 13.1.3 Asymptotic Properties of nV2 . . . . . . . . . . . . . . 214 n 13.2 Testing Independence . . . . . . . . . . . . . . . . . . . . . . 215 13.2.1 Implementation as a Permutation Test . . . . . . . . . 215 13.2.2 Rank Test . . . . . . . . . . . . . . . . . . . . . . . . . 216 13.2.3 Categorical Data . . . . . . . . . . . . . . . . . . . . . 216 13.2.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 217 13.2.5 Power Comparisons . . . . . . . . . . . . . . . . . . . 217 13.3 Mutual Independence . . . . . . . . . . . . . . . . . . . . . . 221 13.4 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 13.4.1 Proof of Proposition 13.1 . . . . . . . . . . . . . . . . 222 13.4.2 Proof of Theorem 13.1 . . . . . . . . . . . . . . . . . . 223 13.4.3 Proof of Corollary 13.3. . . . . . . . . . . . . . . . . . 225 13.4.4 Proof of Theorem 13.2 . . . . . . . . . . . . . . . . . . 226 13.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 14 Applications and Extensions 229 14.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 14.1.1 Nonlinear and Non-monotone Dependence . . . . . . . 229 14.1.2 Identify and Test for Nonlinearity . . . . . . . . . . . 232 14.1.3 Exploratory Data Analysis . . . . . . . . . . . . . . . 233 14.1.4 Identify Influential Observations . . . . . . . . . . . . 234 14.2 Some Extensions . . . . . . . . . . . . . . . . . . . . . . . . . 235 14.2.1 Affine and Monotone Invariant Versions . . . . . . . . 235 14.2.2 Generalization: Powers of Distances . . . . . . . . . . 236 14.2.3 Distance Correlation for Dissimilarities. . . . . . . . . 237 14.2.4 An Unbiased Distance Covariance Statistic . . . . . . 238 14.3 Distance Correlation in Metric Spaces . . . . . . . . . . . . . 239 14.3.1 Hilbert Spaces and General Metric Spaces . . . . . . . 239 14.3.2 Testing Independence in Separable Metric Spaces . . . 240 14.3.3 Measuring Associations in Banach Spaces . . . . . . . 241 14.4 Distance Correlation with General Kernels . . . . . . . . . . 241 14.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . 243 14.5.1 Variable Selection, DCA and ICA . . . . . . . . . . . 243 14.5.2 Nonparametric MANOVA Based on dCor . . . . . . . 244 14.5.3 Tests of Independence with Ranks . . . . . . . . . . . 245 14.5.4 Projection Correlation . . . . . . . . . . . . . . . . . . 245 14.5.5 Detection of Periodicity via Distance Correlation . . . 245 14.5.6 dCov Goodness-of-fit Test of Dirichlet Distribution . . 246 14.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.