Deep Learning and Scientific Computing with R torch torch is an R port of PyTorch, one of the two most-employed deep learning frameworks in industry and research. It is also an excellent tool to use in scientific computations. It is written entirely in R and C/C++. Though still “young” as a project, R torch already has a vibrant community of users and developers. Experience shows that torch users come from a broad range of different backgrounds. This book aims to be useful to (almost) everyone. Globally speaking, its purposes are threefold: - Provide a thorough introduction to torch basics – both by carefully explaining underlying concepts and ideas, and showing enough examples for the reader to become “fluent” in torch. - Again with a focus on conceptual explanation, show how to use torch in deep-learning applications, ranging from image recognition over time series prediction to audio classification. - Provide a concepts-first, reader-friendly introduction to selected scientific- computation topics (namely, matrix computations, the Discrete Fourier Transform, and wavelets), all accompanied by torch code you can play with. Deep Learning and Scientific Computing with R torch is written with first-hand technical expertise and in an engaging, fun-to-read way. Chapman & Hall/CRC The R Series Series Editors John M. Chambers, Department of Statistics, Stanford University, California, USA Torsten Hothorn, Division of Biostatistics, University of Zurich, Switzerland Duncan Temple Lang, Department of Statistics, University of California, Davis, USA Hadley Wickham, RStudio, Boston, Massachusetts, USA Recently Published Titles R for Conservation and Development Projects: A Primer for Practitioners Nathan Whitmore Using R for Bayesian Spatial and Spatio-Temporal Health Modeling Andrew B. Lawson Engineering Production-Grade Shiny Apps Colin Fay, Sébastien Rochette, Vincent Guyader, and Cervan Girard Javascript for R John Coene Advanced R Solutions Malte Grosser, Henning Bumann, and Hadley Wickham Event History Analysis with R, Second Edition Göran Broström Behavior Analysis with Machine Learning Using R Enrique Garcia Ceja Rasch Measurement Theory Analysis in R: Illustrations and Practical Guidance for Research- ers and Practitioners Stefanie Wind and Cheng Hua Spatial Sampling with R Dick R. Brus Crime by the Numbers: A Criminologist’s Guide to R Jacob Kaplan Analyzing US Census Data: Methods, Maps, and Models in R Kyle Walker ANOVA and Mixed Models: A Short Introduction Using R Lukas Meier Tidy Finance with R Stefan Voigt, Patrick Weiss and Christoph Scheuch Deep Learning and Scientific Computing with R torch Sigrid Keydana Model-Based Clustering, Classification, and Density Estimation Using mclust in R Lucca Scrucca, Chris Fraley, T. Brendan Murphy, and Adrian E. Raftery Spatial Data Science: With Applications in R Edzer Pebesma and Roger Bivand For more information about this series, please visit: https://www.crcpress.com/Chapman--Hall- CRC-The-R-Series/book-series/CRCTHERSER Deep Learning and Scientific Computing with R torch Sigrid Keydana Designed cover image: https://www.shutterstock.com/image-photo/eurasian-red-squir- rel-sciurus-vulgaris-looking-2070311126 First edition published 2023 by CRC Press 6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742 and by CRC Press 4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN CRC Press is an imprint of Taylor & Francis Group, LLC © 2023 Sigrid Keydana Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the conse- quences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged, please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and re- cording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rose- wood Drive, Danvers, MA 01923, 978-750-8400. For works that are not available on CCC please contact [email protected] Trademark notice: Product or corporate names may be trademarks or registered trade- marks and are used only for identification and explanation without intent to infringe. ISBN: 978-1-032-23138-9 (hbk) ISBN: 978-1-032-23139-6 (pbk) ISBN: 978-1-003-27592-3 (ebk) DOI: 10.1201/9781003275923 Typeset in Latin Modern font by KnowledgeWorks Global Ltd. Publisher’s note: This book has been prepared from camera-ready copy provided by the authors. Contents List of Figures xi Preface xv Author Biography xix I Getting Familiar with Torch 1 1 Overview 3 2 On torch, and How to Get It 5 2.1 In torch World . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Installing and Running torch . . . . . . . . . . . . . . . . . 6 3 Tensors 7 3.1 What’s in a Tensor? . . . . . . . . . . . . . . . . . . . . . . . 7 3.2 Creating Tensors . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2.1 Tensors from values . . . . . . . . . . . . . . . . . . . 9 3.2.2 Tensors from specifications . . . . . . . . . . . . . . . 12 3.2.3 Tensors from datasets . . . . . . . . . . . . . . . . . . 14 3.3 Operations on Tensors . . . . . . . . . . . . . . . . . . . . . . 19 3.3.1 Summary operations . . . . . . . . . . . . . . . . . . . 21 3.4 Accessing Parts of a Tensor . . . . . . . . . . . . . . . . . . 23 3.4.1 “Think R” . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.5 Reshaping Tensors . . . . . . . . . . . . . . . . . . . . . . . . 26 3.5.1 Zero-copy reshaping vs. reshaping with copy. . . . . . 27 3.6 Broadcasting . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.6.1 Broadcasting rules . . . . . . . . . . . . . . . . . . . . 31 4 Autograd 33 4.1 Why Compute Derivatives? . . . . . . . . . . . . . . . . . . . 33 4.2 Automatic Differentiation Example . . . . . . . . . . . . . . 35 4.3 Automatic Differentiation with torch autograd . . . . . . . . 36 5 Function Minimization with autograd 41 5.1 An Optimization Classic . . . . . . . . . . . . . . . . . . . . 41 5.2 Minimization from Scratch . . . . . . . . . . . . . . . . . . . 42 v vi Contents 6 A Neural Network from Scratch 47 6.1 Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 6.2 Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6.3 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . 49 6.4 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 52 6.5.1 Generate random data . . . . . . . . . . . . . . . . . . 52 6.5.2 Build the network . . . . . . . . . . . . . . . . . . . . 52 6.5.3 Train the network . . . . . . . . . . . . . . . . . . . . 53 7 Modules 57 7.1 Built-in nn_module()s . . . . . . . . . . . . . . . . . . . . . 57 7.2 Building up a Model . . . . . . . . . . . . . . . . . . . . . . . 60 7.2.1 Models as sequences of layers: nn_sequential() index{nn_sequential()} . . . . . . . . . . . . . . . . 60 7.2.2 Models with custom logic . . . . . . . . . . . . . . . . 61 8 Optimizers 63 8.1 Why Optimizers? . . . . . . . . . . . . . . . . . . . . . . . . 63 8.2 Using built-in torch Optimizers . . . . . . . . . . . . . . . . 64 8.3 Parameter Update Strategies . . . . . . . . . . . . . . . . . . 65 8.3.1 Gradient descent (a.k.a. steepest descent, a.k.a. stochastic gradient descent (SGD)) . . . . . . . . . . . 66 8.3.2 Things that matter . . . . . . . . . . . . . . . . . . . . 67 8.3.3 Staying on track: Gradient descent with momentum . 68 8.3.4 Adagrad . . . . . . . . . . . . . . . . . . . . . . . . . . 69 8.3.5 RMSProp . . . . . . . . . . . . . . . . . . . . . . . . . 71 8.3.6 Adam . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 9 Loss Functions 75 9.1 torch Loss Functions . . . . . . . . . . . . . . . . . . . . . . 75 9.2 What Loss Function Should I Choose? . . . . . . . . . . . . . 76 9.2.1 Maximum likelihood . . . . . . . . . . . . . . . . . . . 76 9.2.2 Regression. . . . . . . . . . . . . . . . . . . . . . . . . 76 9.2.3 Classification . . . . . . . . . . . . . . . . . . . . . . . 77 10 Function Minimization with L-BFGS 83 10.1 Meet L-BFGS . . . . . . . . . . . . . . . . . . . . . . . . . . 83 10.1.1 Changing slopes . . . . . . . . . . . . . . . . . . . . . 83 10.1.2 Exact Newton method . . . . . . . . . . . . . . . . . . 84 10.1.3 Approximate Newton: BFGS and L-BFGS . . . . . . . 85 10.1.4 Line search . . . . . . . . . . . . . . . . . . . . . . . . 85 10.2 Minimizing the Rosenbrock Function with optim_lbfgs() . . 86 10.2.1 optim_lbfgs() default behavior . . . . . . . . . . . . 87 10.2.2 optim_lbfgs() with line search . . . . . . . . . . . . 89 Contents vii 11 Modularizing the Neural Network 91 11.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 11.2 Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 11.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 11.4 What’s to Come . . . . . . . . . . . . . . . . . . . . . . . . . 93 II Deep Learning with torch 95 12 Overview 97 13 Loading Data 99 13.1 Datavs.dataset()vs.dataloader()–What’stheDifference? 99 13.2 Using dataset()s . . . . . . . . . . . . . . . . . . . . . . . . 100 13.2.1 A self-built dataset() . . . . . . . . . . . . . . . . . . 101 13.2.2 tensor_dataset() . . . . . . . . . . . . . . . . . . . . 102 13.2.3 torchvision::mnist_dataset() . . . . . . . . . . . . 103 13.3 Using dataloader()s . . . . . . . . . . . . . . . . . . . . . . 104 14 Training with luz 107 14.1 Que haya luz – Que haja luz – Let there be Light . . . . . . 107 14.2 Porting the Toy Example . . . . . . . . . . . . . . . . . . . . 108 14.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 14.2.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 14.2.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . 109 14.3 A More Realistic Scenario . . . . . . . . . . . . . . . . . . . . 111 14.3.1 Integrating training, validation, and test . . . . . . . . 111 14.3.2 Using callbacks to “hook” into the training process . . 114 14.3.3 How luz helps with devices . . . . . . . . . . . . . . . 116 14.4 Appendix: A Train-Validate-Test Workflow Implemented by Hand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 15 A First Go at Image Classification 121 15.1 What does It Take to Classify an Image? . . . . . . . . . . . 121 15.2 Neural Networks for Feature Detection and Feature Emergence 122 15.2.1 Detecting low-level features with cross-correlation . . 122 15.2.2 Build up feature hierarchies . . . . . . . . . . . . . . . 128 15.3 Classification on Tiny Imagenet . . . . . . . . . . . . . . . . 134 15.3.1 Data pre-processing . . . . . . . . . . . . . . . . . . . 135 15.3.2 Image classification from scratch . . . . . . . . . . . . 137 16 Making Models Generalize 141 16.1 The Royal Road: more – and More Representative! – Data . 142 16.2 Pre-processing Stage: Data Augmentation . . . . . . . . . . . 142 16.2.1 Classic data augmentation . . . . . . . . . . . . . . . . 142 16.2.2 Mixup . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 16.3 Modeling Stage: Dropout and Regularization . . . . . . . . . 151 viii Contents 16.3.1 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . 151 16.3.2 Regularization . . . . . . . . . . . . . . . . . . . . . . 153 16.4 Training Stage: Early Stopping . . . . . . . . . . . . . . . . . 154 17 Speeding up Training 157 17.1 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . 157 17.2 Dynamic Learning Rates . . . . . . . . . . . . . . . . . . . . 159 17.2.1 Learning rate finder . . . . . . . . . . . . . . . . . . . 160 17.2.2 Learning rate schedulers . . . . . . . . . . . . . . . . . 163 17.3 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . 164 18 Image Classification, Take Two: Improving Performance 169 18.1 Data Input (Common for all) . . . . . . . . . . . . . . . . . . 170 18.2 Run 1: Dropout . . . . . . . . . . . . . . . . . . . . . . . . . 171 18.3 Run 2: Batch Normalization . . . . . . . . . . . . . . . . . . 174 18.4 Run 3: Transfer Learning . . . . . . . . . . . . . . . . . . . . 177 19 Image Segmentation 181 19.1 Segmentation vs. Classification . . . . . . . . . . . . . . . . . 181 19.2 U-Net, a “classic” in image segmentation . . . . . . . . . . . 182 19.3 U-Net – a torch implementation . . . . . . . . . . . . . . . . 183 19.3.1 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . 183 19.3.2 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . 186 19.3.3 The “U” . . . . . . . . . . . . . . . . . . . . . . . . . . 190 19.3.4 Top-level module . . . . . . . . . . . . . . . . . . . . . 191 19.4 Dogs and Cats . . . . . . . . . . . . . . . . . . . . . . . . . . 192 20 Tabular Data 201 20.1 Types of Numerical Data, by Example . . . . . . . . . . . . . 201 20.2 A torch dataset for Tabular Data . . . . . . . . . . . . . . 204 20.3 Embeddings in Deep Learning: The Idea . . . . . . . . . . . 208 20.4 Embeddings in deep learning: Implementation . . . . . . . . 209 20.5 Model and Model Training . . . . . . . . . . . . . . . . . . . 211 20.6 Embedding-generated Representations by Example . . . . . 214 21 Time Series 219 21.1 Deep Learning for Sequences: The Idea . . . . . . . . . . . . 219 21.2 A Basic Recurrent Neural Network . . . . . . . . . . . . . . . 221 21.2.1 Basic rnn_cell() . . . . . . . . . . . . . . . . . . . . 221 21.2.2 Basic rnn_module() . . . . . . . . . . . . . . . . . . . 222 21.3 Recurrent Neural Networks in torch . . . . . . . . . . . . . 225 21.4 RNNs in Practice: GRU and LSTM . . . . . . . . . . . . . . 226 21.5 Forecasting Electricity Demand . . . . . . . . . . . . . . . . 228 21.5.1 Data inspection . . . . . . . . . . . . . . . . . . . . . . 229 21.5.2 Forecasting the very next value . . . . . . . . . . . . . 231 21.5.3 Forecasting multiple time steps ahead . . . . . . . . . 239 Contents ix 22 Audio Classification 247 22.1 Classifying Speech Data . . . . . . . . . . . . . . . . . . . . . 247 22.2 Two Equivalent Representations . . . . . . . . . . . . . . . . 250 22.3 Combining Representations: The Spectrogram . . . . . . . . 251 22.4 Training a Model for Audio Classification . . . . . . . . . . . 254 22.4.1 Baseline setup: Training a convnet on spectrograms . 254 22.4.2 Variation one: Use a Mel-scale spectrogram instead . . 261 22.4.3 Variation two: Complex-valued spectograms . . . . . . 267 III Other Things to do with torch: Matrices, Fourier Transform, and Wavelets 273 23 Overview 275 24 Matrix Computations: Least-squares Problems 277 24.1 Five Ways to do Least Squares . . . . . . . . . . . . . . . . . 277 24.2 Regression for Weather Prediction . . . . . . . . . . . . . . . 278 24.2.1 Least squares (I): Setting expectations with lm() . . . 280 24.2.2 Least squares (II): Using linalg_lstsq() . . . . . . . 282 24.2.3 Interlude: What if we hadn’t standardized the data? . 283 24.2.4 Least squares (III): The normal equations . . . . . . . 287 24.2.5 Least squares (IV): Cholesky decomposition . . . . . . 289 24.2.6 Least squares (V): LU factorization. . . . . . . . . . . 292 24.2.7 Least squares (VI): QR factorization . . . . . . . . . . 293 24.2.8 Least squares (VII): Singular Value Decomposition (SVD) . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 24.2.9 Checking execution times . . . . . . . . . . . . . . . . 296 24.3 A Quick Look at Stability . . . . . . . . . . . . . . . . . . . . 299 25 Matrix Computations: Convolution 305 25.1 Why Convolution? . . . . . . . . . . . . . . . . . . . . . . . . 305 25.2 Convolution in One Dimension . . . . . . . . . . . . . . . . . 306 25.2.1 Two ways to think about convolution. . . . . . . . . . 307 25.2.2 Implementation . . . . . . . . . . . . . . . . . . . . . . 310 25.3 Convolution in Two Dimensions . . . . . . . . . . . . . . . . 311 25.3.1 How it works (output view) . . . . . . . . . . . . . . . 312 25.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . 313 26 Exploring the Discrete Fourier Transform (DFT) 319 26.1 Understanding the Output of torch_fft_fft() . . . . . . . 319 26.1.1 Starting point: A cosine of frequency 1 . . . . . . . . . 320 26.1.2 Reconstructing the magic . . . . . . . . . . . . . . . . 324 26.1.3 Varying frequency . . . . . . . . . . . . . . . . . . . . 327 26.1.4 Varying amplitude . . . . . . . . . . . . . . . . . . . . 330 26.1.5 Adding phase . . . . . . . . . . . . . . . . . . . . . . . 330 26.1.6 Superposition of sinusoids . . . . . . . . . . . . . . . . 333