Table Of Content

Data Science for Mathematicians CRC Press/Chapman and Hall Handbooks in Mathematics Series Series Editor: Steven G. Krantz AIMS AND SCOPE STATEMENT FOR HANDBOOKS IN MATHEMATICS SERIES The purpose of this series is to provide an entree to active areas of mathematics for graduate students, beginning professionals, and even for seasoned researchers. Each volume should contain lively pieces that introduce the reader to current areas of interest. The writing will be semi-expository, with some proofs included for texture and substance. But it will be lively and inviting. Our aim is to develop future workers in each field of study. These handbooks are a unique device for keeping mathematics up-to-date and vital. And for involving new people in developing fields. We anticipate that they will have a distinct impact on the development of mathematical research. Handbook of Analytic Operator Theory Kehe Zhu Handbook of Homotopy Theory Haynes Miller Data Science for Mathematicians Nathan Carter (cid:3) (cid:3) https://www.crcpress.com/CRC-PressChapman-and-Hall-Handbooks-in-Mathematics-Series/book- (cid:3)series/CRCCHPHBKMTH (cid:3) Data Science for Mathematicians Edited by Nathan Carter CRCPress Taylor&FrancisGroup 6000BrokenSoundParkwayNW,Suite300 BocaRaton,FL33487-2742 (cid:13)c 2021byTaylor&FrancisGroup,LLC CRCPressisanimprintofTaylor&FrancisGroup,anInformabusiness NoclaimtooriginalU.S.Governmentworks InternationalStandardBookNumber-13:978-0-367-02705-6(Hardback) 978-0-429-39829-2(ebook) Thisbookcontainsinformationobtainedfromauthenticandhighlyregardedsources.Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannotassumeresponsibilityforthevalidityofallmaterialsortheconsequencesoftheiruse.The authorsandpublishershaveattemptedtotracethecopyrightholdersofallmaterialreproduced in this publication and apologize to copyright holders if permission to publish in this form has notbeenobtained.Ifanycopyrightmaterialhasnotbeenacknowledgedpleasewriteandletus knowsowemayrectifyinanyfuturereprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, repro- duced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now knownorhereafterinvented,includingphotocopying,microfilming,andrecording,orinanyinfor- mationstorageorretrievalsystem,withoutwrittenpermissionfromthepublishers. For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organizationthatprovideslicensesandregistrationforavarietyofusers.Fororganizationsthat have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice:Productorcorporatenamesmaybetrademarksorregisteredtrademarks, andareusedonlyforidentificationandexplanationwithoutintenttoinfringe. Library of Congress Cataloging-in-Publication Data Names:Carter,NathanC.,editor. Title:Datascienceformathematicians/NathanCarter,ed. Description: First edition. | Boca Raton, FL : CRC Press, 2020. | Includes bibliographicalreferencesandindex.| Contents: Programming with data / Sean Raleigh -- Linear algebra / Jeffery Leader -- Basic statistics / David White -- Clustering / Amy S. Wagaman-- Operationsresearch/AlicePaulandSusanMartonosi--Dimensionalityreduc- tion/SofyaChepushtanova,ElinFarnell,EricKehoe,MichaelKirby,andHenry Kvinge -- Machine learning / Mahesh Agarwal, Nathan Carter, and David Oury--Deeplearning/SamuelS.Watson--Topologicaldataanalysis/Henry Adams,JohnathanBush,JoshuaMirth.| Summary:“Mathematicianshaveskillsthat,ifdeepenedintherightways,would enable them to use data to answer questions important to them and others, and report those answers in compelling ways. Data science combines parts of mathematics, statistics, computer science. Gaining such power and the ability to teach has reinvigorated the careers of mathematicians. This handbook will assistmathematicianstobetterunderstandtheopportunitiespresentedbydata science”--Providedbypublisher. Identifiers: LCCN 2020011719 | ISBN 9780367027056 (hardback) | ISBN 9780367528492(paperback)|ISBN9780429398292(ebook) Subjects: LCSH: Mathematical analysis. | Mathematical statistics. | Data mining.|Bigdata--Mathematics. Classification:LCCQA300.D34162020|DDC515--dc23 LCrecordavailableathttps://lccn.loc.gov/2020011719 Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com Contents Foreword xv 1 Introduction 1 Nathan Carter 1.1 Who should read this book? . . . . . . . . . . . . . . . . . . 1 1.2 What is data science? . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Is data science new? . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 What can I expect from this book? . . . . . . . . . . . . . . 8 1.5 What will this book expect from me? . . . . . . . . . . . . . 10 2 Programming with Data 13 Sean Raleigh 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 The computing environment . . . . . . . . . . . . . . . . . . 14 2.2.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.2 The command line . . . . . . . . . . . . . . . . . . . . 15 2.2.3 Programming languages . . . . . . . . . . . . . . . . . 16 2.2.4 Integrated development environments (IDEs) . . . . . 17 2.2.5 Notebooks. . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.6 Version control . . . . . . . . . . . . . . . . . . . . . . 22 2.3 Best practices . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3.1 Write readable code . . . . . . . . . . . . . . . . . . . 23 2.3.2 Don’t repeat yourself . . . . . . . . . . . . . . . . . . . 26 2.3.3 Set seeds for random processes . . . . . . . . . . . . . 27 2.3.4 Profile, benchmark, and optimize judiciously . . . . . 27 2.3.5 Test your code . . . . . . . . . . . . . . . . . . . . . . 28 2.3.6 Don’t rely on black boxes . . . . . . . . . . . . . . . . 29 2.4 Data-centric coding . . . . . . . . . . . . . . . . . . . . . . . 30 2.4.1 Obtaining data . . . . . . . . . . . . . . . . . . . . . . 30 2.4.1.1 Files . . . . . . . . . . . . . . . . . . . . . . . 30 2.4.1.2 The web . . . . . . . . . . . . . . . . . . . . 31 2.4.1.3 Databases. . . . . . . . . . . . . . . . . . . . 31 2.4.1.4 Other sources and concerns . . . . . . . . . . 33 2.4.2 Data structures . . . . . . . . . . . . . . . . . . . . . . 34 2.4.3 Cleaning data . . . . . . . . . . . . . . . . . . . . . . . 35 2.4.3.1 Missing data . . . . . . . . . . . . . . . . . . 36 v vi Contents 2.4.3.2 Data values . . . . . . . . . . . . . . . . . . . 37 2.4.3.3 Outliers . . . . . . . . . . . . . . . . . . . . . 38 2.4.3.4 Other issues . . . . . . . . . . . . . . . . . . 38 2.4.4 Exploratory data analysis (EDA) . . . . . . . . . . . . 40 2.5 Getting help . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3 Linear Algebra 43 Jeffery Leader 3.1 Data and matrices . . . . . . . . . . . . . . . . . . . . . . . . 44 3.1.1 Data, vectors, and matrices . . . . . . . . . . . . . . . 44 3.1.2 Term-by-document matrices . . . . . . . . . . . . . . . 46 3.1.3 Matrix storage and manipulation issues . . . . . . . . 47 3.2 Matrix decompositions . . . . . . . . . . . . . . . . . . . . . 51 3.2.1 Matrix decompositions and data science . . . . . . . . 51 3.2.2 The LU decomposition . . . . . . . . . . . . . . . . . . 51 3.2.2.1 Gaussian elimination . . . . . . . . . . . . . 51 3.2.2.2 The matrices L and U . . . . . . . . . . . . . 53 3.2.2.3 Permuting rows . . . . . . . . . . . . . . . . 55 3.2.2.4 Computational notes . . . . . . . . . . . . . 56 3.2.3 The Cholesky decomposition . . . . . . . . . . . . . . 58 3.2.4 Least-squares curve-fitting . . . . . . . . . . . . . . . . 60 3.2.5 Recommender systems and the QR decomposition . . 63 3.2.5.1 A motivating example . . . . . . . . . . . . . 63 3.2.5.2 The QR decomposition . . . . . . . . . . . . 65 3.2.5.3 Applications of the QR decomposition . . . . 70 3.2.6 The singular value decomposition . . . . . . . . . . . . 71 3.2.6.1 SVD in our recommender system . . . . . . . 74 3.2.6.2 Further reading on the SVD . . . . . . . . . 77 3.3 Eigenvalues and eigenvectors . . . . . . . . . . . . . . . . . . 78 3.3.1 Eigenproblems . . . . . . . . . . . . . . . . . . . . . . 78 3.3.2 Finding eigenvalues. . . . . . . . . . . . . . . . . . . . 82 3.3.3 The power method . . . . . . . . . . . . . . . . . . . . 84 3.3.4 PageRank . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.4 Numerical computing . . . . . . . . . . . . . . . . . . . . . . 92 3.4.1 Floating point computing . . . . . . . . . . . . . . . . 92 3.4.2 Floating point arithmetic . . . . . . . . . . . . . . . . 92 3.4.3 Further reading . . . . . . . . . . . . . . . . . . . . . . 94 3.5 Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.5.1 Creating a database . . . . . . . . . . . . . . . . . . . 95 3.5.2 The QR decomposition and query-matching . . . . . . 96 3.5.3 The SVD and latent semantic indexing. . . . . . . . . 96 3.5.4 Searching a web . . . . . . . . . . . . . . . . . . . . . 96 Contents vii 4 Basic Statistics 99 David White 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.2 Exploratory data analysis and visualizations . . . . . . . . . 103 4.2.1 Descriptive statistics . . . . . . . . . . . . . . . . . . . 106 4.2.2 Sampling and bias . . . . . . . . . . . . . . . . . . . . 109 4.3 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.3.1 Linear regression . . . . . . . . . . . . . . . . . . . . . 112 4.3.2 Polynomial regression . . . . . . . . . . . . . . . . . . 116 4.3.3 Group-wise models and clustering . . . . . . . . . . . 117 4.3.4 Probability models . . . . . . . . . . . . . . . . . . . . 118 4.3.5 Maximum likelihood estimation . . . . . . . . . . . . . 122 4.4 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . 124 4.4.1 The sampling distribution . . . . . . . . . . . . . . . . 125 4.4.2 Confidence intervals from the sampling distribution. . 127 4.4.3 Bootstrap resampling . . . . . . . . . . . . . . . . . . 130 4.5 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 4.5.1 Hypothesis testing . . . . . . . . . . . . . . . . . . . . 133 4.5.1.1 First example . . . . . . . . . . . . . . . . . 133 4.5.1.2 General strategy for hypothesis testing . . . 136 4.5.1.3 Inference to compare two populations . . . . 137 4.5.1.4 Other types of hypothesis tests . . . . . . . . 138 4.5.2 Randomization-based inference . . . . . . . . . . . . . 139 4.5.3 Type I and Type II error . . . . . . . . . . . . . . . . 142 4.5.4 Power and effect size . . . . . . . . . . . . . . . . . . . 142 4.5.5 The trouble with p-hacking . . . . . . . . . . . . . . . 143 4.5.6 Bias and scope of inference . . . . . . . . . . . . . . . 144 4.6 Advanced regression . . . . . . . . . . . . . . . . . . . . . . . 145 4.6.1 Transformations . . . . . . . . . . . . . . . . . . . . . 145 4.6.2 Outliers and high leverage points . . . . . . . . . . . . 146 4.6.3 Multiple regression, interaction . . . . . . . . . . . . . 148 4.6.4 What to do when the regression assumptions fail . . . 152 4.6.5 Indicator variables and ANOVA . . . . . . . . . . . . 155 4.7 The linear algebra approach to statistics . . . . . . . . . . . 159 4.7.1 The general linear model . . . . . . . . . . . . . . . . 160 4.7.2 Ridge regression and penalized regression . . . . . . . 165 4.7.3 Logistic regression . . . . . . . . . . . . . . . . . . . . 166 4.7.4 The generalized linear model . . . . . . . . . . . . . . 171 4.7.5 Categorical data analysis . . . . . . . . . . . . . . . . 172 4.8 Causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 4.8.1 Experimental design . . . . . . . . . . . . . . . . . . . 173 4.8.2 Quasi-experiments . . . . . . . . . . . . . . . . . . . . 176 4.9 Bayesian statistics . . . . . . . . . . . . . . . . . . . . . . . . 177 4.9.1 Bayes’ formula . . . . . . . . . . . . . . . . . . . . . . 177 4.9.2 Prior and posterior distributions . . . . . . . . . . . . 178 viii Contents 4.10 A word on curricula . . . . . . . . . . . . . . . . . . . . . . . 180 4.10.1 Data wrangling . . . . . . . . . . . . . . . . . . . . . . 180 4.10.2 Cleaning data . . . . . . . . . . . . . . . . . . . . . . . 181 4.11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 4.12 Sample projects . . . . . . . . . . . . . . . . . . . . . . . . . 182 5 Clustering 185 Amy S. Wagaman 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 5.1.1 What is clustering?. . . . . . . . . . . . . . . . . . . . 186 5.1.2 Example applications . . . . . . . . . . . . . . . . . . 186 5.1.3 Clustering observations . . . . . . . . . . . . . . . . . 187 5.2 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 5.3 Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 5.4 Partitioning and the k-means algorithm . . . . . . . . . . . . 193 5.4.1 The k-means algorithm . . . . . . . . . . . . . . . . . 193 5.4.2 Issues with k-means . . . . . . . . . . . . . . . . . . . 195 5.4.3 Example with wine data . . . . . . . . . . . . . . . . . 197 5.4.4 Validation . . . . . . . . . . . . . . . . . . . . . . . . . 200 5.4.5 Other partitioning algorithms . . . . . . . . . . . . . . 204 5.5 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . 204 5.5.1 Linkages . . . . . . . . . . . . . . . . . . . . . . . . . . 205 5.5.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 206 5.5.3 Hierarchical simple example . . . . . . . . . . . . . . . 207 5.5.4 Dendrograms and wine example . . . . . . . . . . . . 208 5.5.5 Other hierarchical algorithms . . . . . . . . . . . . . . 211 5.6 Case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 5.6.1 k-means results . . . . . . . . . . . . . . . . . . . . . . 212 5.6.2 Hierarchical results . . . . . . . . . . . . . . . . . . . . 214 5.6.3 Case study conclusions . . . . . . . . . . . . . . . . . . 215 5.7 Model-based methods . . . . . . . . . . . . . . . . . . . . . . 217 5.7.1 Model development. . . . . . . . . . . . . . . . . . . . 217 5.7.2 Model estimation . . . . . . . . . . . . . . . . . . . . . 218 5.7.3 mclust and model selection . . . . . . . . . . . . . . . 220 5.7.4 Example with wine data . . . . . . . . . . . . . . . . . 220 5.7.5 Model-based versus k-means . . . . . . . . . . . . . . 221 5.8 Density-based methods . . . . . . . . . . . . . . . . . . . . . 224 5.8.1 Example with iris data . . . . . . . . . . . . . . . . . . 226 5.9 Dealing with network data . . . . . . . . . . . . . . . . . . . 228 5.9.1 Network clustering example . . . . . . . . . . . . . . . 229 5.10 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 5.10.1 Feature selection . . . . . . . . . . . . . . . . . . . . . 232 5.10.2 Hierarchical clusters . . . . . . . . . . . . . . . . . . . 233 5.10.3 Overlapping clusters, or fuzzy clustering . . . . . . . . 234 5.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 Contents ix 6 Operations Research 239 Alice Paul and Susan Martonosi 6.1 History and background . . . . . . . . . . . . . . . . . . . . . 241 6.1.1 How does OR connect to data science? . . . . . . . . . 241 6.1.2 The OR process . . . . . . . . . . . . . . . . . . . . . 242 6.1.3 Balance between efficiency and complexity. . . . . . . 243 6.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 6.2.1 Complexity-tractability trade-off . . . . . . . . . . . . 246 6.2.2 Linear optimization . . . . . . . . . . . . . . . . . . . 247 6.2.2.1 Duality and optimality conditions . . . . . . 249 6.2.2.2 Extension to integer programming . . . . . . 252 6.2.3 Convex optimization . . . . . . . . . . . . . . . . . . . 252 6.2.3.1 Duality and optimality conditions . . . . . . 256 6.2.4 Non-convex optimization . . . . . . . . . . . . . . . . 258 6.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 6.3.1 Probability principles of simulation . . . . . . . . . . . 261 6.3.2 Generating random variables . . . . . . . . . . . . . . 262 6.3.2.1 Simulation from a known distribution . . . . 262 6.3.2.2 Simulation from an empirical distribution: bootstrapping . . . . . . . . . . . . . . . . . 267 6.3.2.3 Markov Chain Monte Carlo (MCMC) methods 267 6.3.3 Simulationtechniquesforstatisticalandmachinelearn- ing model assessment . . . . . . . . . . . . . . . . . . 269 6.3.3.1 Bootstrapping confidence intervals . . . . . . 269 6.3.3.2 Cross-validation . . . . . . . . . . . . . . . . 270 6.3.4 Simulation techniques for prescriptive analytics . . . . 271 6.3.4.1 Discrete-event simulation . . . . . . . . . . . 272 6.3.4.2 Agent-based modeling . . . . . . . . . . . . . 272 6.3.4.3 Using these tools for prescriptive analytics . 273 6.4 Stochastic optimization . . . . . . . . . . . . . . . . . . . . . 273 6.4.1 Dynamic programming formulation . . . . . . . . . . . 274 6.4.2 Solution techniques. . . . . . . . . . . . . . . . . . . . 275 6.5 Putting the methods to use: prescriptive analytics . . . . . . 277 6.5.1 Bike-sharing systems . . . . . . . . . . . . . . . . . . . 277 6.5.2 A customer choice model for online retail . . . . . . . 278 6.5.3 HIV treatment and prevention . . . . . . . . . . . . . 279 6.6 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 6.6.1 Optimization solvers . . . . . . . . . . . . . . . . . . . 281 6.6.2 Simulation software and packages . . . . . . . . . . . . 282 6.6.3 Stochastic optimization software and packages . . . . 283 6.7 Looking to the future . . . . . . . . . . . . . . . . . . . . . . 283 6.8 Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 6.8.1 The vehicle routing problem. . . . . . . . . . . . . . . 285 6.8.2 The unit commitment problem for power systems . . . 286 6.8.3 Modeling project . . . . . . . . . . . . . . . . . . . . . 289 6.8.4 Data project . . . . . . . . . . . . . . . . . . . . . . . 289