Post-Shrinkage Strategies in Statistical and Machine Learning for High-Dimensional Data This book presents some post-estimation and predictions strategies for the host of useful statis- tical models with applications in data science. It combines statistical learning and machine learn- ing techniques in a unique and optimal way. It is well-known that machine learning methods are subject to many issues relating to bias, and consequently the mean squared error and prediction error may explode. For this reason, we suggest shrinkage strategies to control the bias by com- bining a submodel selected by a penalized method with a model with many features. Further, the suggested shrinkage methodology can be successfully implemented for high-dimensional data analysis. Many researchers in statistics and medical sciences work with big data. They need to analyze this data through statistical modeling. Estimating the model parameters accurately is an important part of the data analysis. This book may be a repository for developing improve estimation strategies for statisticians. This book will help researchers and practitioners for their teaching and advanced research and is an excellent textbook for advanced undergraduate and graduate courses involving shrinkage, statistical, and machine learning. • The book succinctly reveals the bias inherited in machine learning method and successfully provides tools, tricks, and tips to deal with the bias issue. • Expertly sheds light on the fundamental reasoning for model selection and post-estimation using shrinkage and related strategies. • This presentation is fundamental because shrinkage and other methods appropriate for model selection and estimation problems, and there is a growing interest in this area to fill the gap between competitive strategies. • Application of these strategies to real-life data set from many walks of life. • Analytical results are fully corroborated by numerical work, and numerous worked examples are included in each chapter with numerous graphs for data visualization. • The presentation and style of the book clearly makes it accessible to a broad audience. It offers rich, concise expositions of each strategy and clearly describes how to use each estimation strategy for the problem at hand. • This book emphasizes that statistics/statisticians can play a dominant role in solving Big Data problems and will put them on the precipice of scientific discovery. • The book contributes novel methodologies for HDDA and will open a door for continued research in this hot area. • The practical impact of the proposed work stems from wide applications. The developed com- putational packages will aid in analyzing a broad range of applications in many walks of life. Taylor & Francis Taylor & Francis Group http://taylorandfrancis.com Post-Shrinkage Strategies in Statistical and Machine Learning for High-Dimensional Data Syed Ejaz Ahmed Feryaal Ahmed Bahadır Yüzbaşı Designed cover image: © Askhat Gilyakhov First edition published 2023 by CRC Press 6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742 and by CRC Press 4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN CRC Press is an imprint of Taylor & Francis Group, LLC © 2023 Syed Ejaz Ahmed, Feryaal Ahmed and Bahadır Yüzbaşı Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot as- sume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright hold- ers if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged, please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including pho- tocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are not available on CCC please contact [email protected] Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for identification and explanation without intent to infringe. ISBN: 978-0-367-76344-2 (hbk) ISBN: 978-0-367-77205-5 (pbk) ISBN: 978-1-003-17025-9 (ebk) DOI: 10.1201/9781003170259 Typeset in CMR10 by KnowledgeWorks Global Ltd. Publisher’s note: This book has been prepared from camera-ready copy provided by the authors. Dedicated in loving memory to Don Fraser and Kjell Doksum. Taylor & Francis Taylor & Francis Group http://taylorandfrancis.com Contents Preface xiii Acknowledgments xv Author/editor biographies xvii List of Figures xix List of Tables xxiii Contributors xxvii Abbreviations xxix 1 Introduction 1 1.1 Least Absolute Shrinkage and Selection Operator . . . . . . . . . . . . . . . 4 1.2 Elastic Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Adaptive LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Smoothly Clipped Absolute Deviation . . . . . . . . . . . . . . . . . . . . . 6 1.5 Minimax Concave Penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.6 High-Dimensional Weak-Sparse Regression Model . . . . . . . . . . . . . . . 7 1.7 Estimation Strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.7.1 Pretest Estimation Strategy . . . . . . . . . . . . . . . . . . . . . . . 8 1.7.2 Shrinkage Estimation Strategy . . . . . . . . . . . . . . . . . . . . . 8 1.8 Asymptotic Properties of Non-Penalty Estimators . . . . . . . . . . . . . . 9 1.8.1 Bias of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.8.2 Risk of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.9 Organization of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2 Introduction to Machine Learning 13 2.1 What is Learning? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Unsupervised Learning: Principle Component Analysis and k-Means Clus- tering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.1 Principle Component Analysis (PCA) . . . . . . . . . . . . . . . . . 14 2.2.2 k-Means Clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.3 Extension: Unsupervised Text Analysis . . . . . . . . . . . . . . . . 17 2.3 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.2 Multivariate Adaptive Regression Splines (MARS) . . . . . . . . . . 19 2.3.3 k Nearest Neighbours (kNN) . . . . . . . . . . . . . . . . . . . . . . 20 2.3.4 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.5 Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . . . 23 vii viii Contents 2.3.6 Linear Discriminant Analysis (LDA) . . . . . . . . . . . . . . . . . . 24 2.3.7 Artificial Neural Network (ANN) . . . . . . . . . . . . . . . . . . . . 25 2.3.8 Gradient Boosting Machine (GBM) . . . . . . . . . . . . . . . . . . 27 2.4 Implementation in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5 Case Study: Genomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5.1 Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.5.2 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3 Post-Shrinkage Strategies in Sparse Regression Models 33 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 Estimation Strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.1 Least Squares Estimation Strategies . . . . . . . . . . . . . . . . . . 36 3.2.2 Maximum Likelihood Estimator . . . . . . . . . . . . . . . . . . . . 36 3.2.3 Full Model and Submodel Estimators . . . . . . . . . . . . . . . . . 37 3.2.4 Shrinkage Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3 Asymptotic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3.1 Asymptotic Distributional Bias . . . . . . . . . . . . . . . . . . . . . 42 3.3.2 Asymptotic Distributional Risk . . . . . . . . . . . . . . . . . . . . . 44 3.4 Relative Risk Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.4.1 Risk Comparison of βˆFM and βˆSM . . . . . . . . . . . . . . . . . . . 47 1 1 3.4.2 Risk Comparison of βˆFM and βˆS . . . . . . . . . . . . . . . . . . . . 47 1 1 3.4.3 Risk Comparison of βˆS and βˆSM . . . . . . . . . . . . . . . . . . . . 48 1 1 3.4.4 Risk Comparison of βˆPS and βˆFM . . . . . . . . . . . . . . . . . . . 49 1 1 3.4.5 Risk Comparison of βˆPS and βˆS . . . . . . . . . . . . . . . . . . . . 49 1 1 3.4.6 Mean Squared Prediction Error . . . . . . . . . . . . . . . . . . . . . 50 3.5 Simulation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.5.1 Strong Signals and Noises . . . . . . . . . . . . . . . . . . . . . . . . 51 3.5.2 Signals and Noises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.5.3 Comparing Shrinkage Estimators with Penalty Estimators . . . . . . 55 3.6 Prostrate Cancer Data Example . . . . . . . . . . . . . . . . . . . . . . . . 65 3.6.1 Classical Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.6.2 Shrinkage and Penalty Strategies . . . . . . . . . . . . . . . . . . . . 71 3.6.3 Prediction Error via Bootstrapping . . . . . . . . . . . . . . . . . . . 74 3.6.4 Machine Learning Strategies . . . . . . . . . . . . . . . . . . . . . . 77 3.7 R-Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.8 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4 Shrinkage Strategies in High-Dimensional Regression Models 91 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.2 Estimation Strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.3 Integrating Submodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.3.1 Sparse Regression Model. . . . . . . . . . . . . . . . . . . . . . . . . 95 4.3.2 Overfitted Regression Model . . . . . . . . . . . . . . . . . . . . . . 95 4.3.3 Underfitted Regression Model . . . . . . . . . . . . . . . . . . . . . . 96 4.3.4 Non-Linear Shrinkage Estimation Strategies . . . . . . . . . . . . . 96 4.4 Simulation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.5 Real Data Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.5.1 Eye Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.5.2 Expression Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.5.3 Riboflavin Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Contents ix 4.6 R-Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5 Shrinkage Estimation Strategies in Partially Linear Models 109 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.1.1 Statement of the Problem . . . . . . . . . . . . . . . . . . . . . . . . 110 5.2 Estimation Strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.3 Asymptotic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.4 Simulation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.4.1 Comparing with Penalty Estimators . . . . . . . . . . . . . . . . . . 117 5.5 Real Data Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.5.1 Housing Prices (HP) Data . . . . . . . . . . . . . . . . . . . . . . . . 126 5.5.2 Investment Data of Turkey . . . . . . . . . . . . . . . . . . . . . . . 127 5.6 High-Dimensional Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.6.1 Real Data Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.7 R-Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.8 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6 Shrinkage Strategies : Generalized Linear Models 147 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 6.2 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . 149 6.3 A Genle Introduction of Logistic Regression Model . . . . . . . . . . . . . . 150 6.3.1 Statement of the Problem . . . . . . . . . . . . . . . . . . . . . . . 150 6.4 Estimation Strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 6.4.1 The Shrinkage Estimation Strategies . . . . . . . . . . . . . . . . . . 153 6.5 Asymptotic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 6.6 Simulation Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 6.6.1 Penalized Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 6.7 Real Data Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 6.7.1 Pima Indians Diabetes (PID) Data . . . . . . . . . . . . . . . . . . . 173 6.7.2 South Africa Heart-Attack Data . . . . . . . . . . . . . . . . . . . . 175 6.7.3 Orinda Longitudinal Study of Myopia (OLSM) Data . . . . . . . . . 175 6.8 High-Dimensional Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 6.8.1 Simulation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 179 6.8.2 Gene Expression Data . . . . . . . . . . . . . . . . . . . . . . . . . . 181 6.9 A Gentle Introduction of Negative Binomial Models . . . . . . . . . . . . . 181 6.9.1 Sparse NB Regression Model . . . . . . . . . . . . . . . . . . . . . . 186 6.10 Shrinkage and Penalized Strategies . . . . . . . . . . . . . . . . . . . . . . . 186 6.11 Asymptotic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 6.12 Simulation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 6.13 Real Data Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 6.13.1 Resume Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 6.13.2 Labor Supply Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 6.14 High-Dimensional Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 6.15 R-Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 6.16 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213