2 n d E d i t i o n Advanced Analytics with Spark PATTERNS FOR LEARNING FROM DATA AT SCALE Sandy Ryza, Uri Laserson, Sean Owen, & Josh Wills SECOND EDITION Advanced Analytics with Spark Patterns for Learning from Data at Scale Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills BBeeiijjiinngg BBoossttoonn FFaarrnnhhaamm SSeebbaassttooppooll TTookkyyoo Advanced Analytics with Spark by Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills Copyright © 2017 Sanford Ryza, Uri Laserson, Sean Owen, Joshua Wills. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐ tutional sales department: 800-998-9938 or [email protected]. Editor: Marie Beaugureau Indexer: WordCo Indexing Services Production Editor: Melanie Yarbrough Interior Designer: David Futato Copyeditor: Gillian McGarvey Cover Designer: Karen Montgomery Proofreader: Christina Edwards Illustrator: Rebecca Demarest June 2017: Second Edition Revision History for the Second Edition 2017-06-09: First Release The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Advanced Analytics with Spark, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-491-97295-3 [LSI] Table of Contents Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1. Analyzing Big Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 The Challenges of Data Science 3 Introducing Apache Spark 4 About This Book 6 The Second Edition 7 2. Introduction to Data Analysis with Scala and Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Scala for Data Scientists 10 The Spark Programming Model 11 Record Linkage 12 Getting Started: The Spark Shell and SparkContext 13 Bringing Data from the Cluster to the Client 19 Shipping Code from the Client to the Cluster 22 From RDDs to Data Frames 23 Analyzing Data with the DataFrame API 26 Fast Summary Statistics for DataFrames 32 Pivoting and Reshaping DataFrames 33 Joining DataFrames and Selecting Features 37 Preparing Models for Production Environments 38 Model Evaluation 40 Where to Go from Here 41 3. Recommending Music and the Audioscrobbler Data Set. . . . . . . . . . . . . . . . . . . . . . . . . . 43 Data Set 44 iii The Alternating Least Squares Recommender Algorithm 45 Preparing the Data 48 Building a First Model 51 Spot Checking Recommendations 54 Evaluating Recommendation Quality 56 Computing AUC 58 Hyperparameter Selection 60 Making Recommendations 62 Where to Go from Here 64 4. Predicting Forest Cover with Decision Trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Fast Forward to Regression 65 Vectors and Features 66 Training Examples 67 Decision Trees and Forests 68 Covtype Data Set 71 Preparing the Data 71 A First Decision Tree 74 Decision Tree Hyperparameters 80 Tuning Decision Trees 82 Categorical Features Revisited 86 Random Decision Forests 88 Making Predictions 91 Where to Go from Here 91 5. Anomaly Detection in Network Traffic with K-means Clustering. . . . . . . . . . . . . . . . . . . 93 Anomaly Detection 94 K-means Clustering 94 Network Intrusion 95 KDD Cup 1999 Data Set 96 A First Take on Clustering 97 Choosing k 99 Visualization with SparkR 102 Feature Normalization 106 Categorical Variables 108 Using Labels with Entropy 109 Clustering in Action 111 Where to Go from Here 112 6. Understanding Wikipedia with Latent Semantic Analysis. . . . . . . . . . . . . . . . . . . . . . . . 115 The Document-Term Matrix 116 Getting the Data 118 iv | Table of Contents Parsing and Preparing the Data 118 Lemmatization 120 Computing the TF-IDFs 121 Singular Value Decomposition 123 Finding Important Concepts 125 Querying and Scoring with a Low-Dimensional Representation 129 Term-Term Relevance 130 Document-Document Relevance 132 Document-Term Relevance 133 Multiple-Term Queries 134 Where to Go from Here 136 7. Analyzing Co-Occurrence Networks with GraphX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 The MEDLINE Citation Index: A Network Analysis 139 Getting the Data 140 Parsing XML Documents with Scala’s XML Library 142 Analyzing the MeSH Major Topics and Their Co-Occurrences 143 Constructing a Co-Occurrence Network with GraphX 146 Understanding the Structure of Networks 150 Connected Components 150 Degree Distribution 153 Filtering Out Noisy Edges 155 Processing EdgeTriplets 156 Analyzing the Filtered Graph 158 Small-World Networks 159 Cliques and Clustering Coefficients 160 Computing Average Path Length with Pregel 161 Where to Go from Here 166 8. Geospatial and Temporal Data Analysis on New York City Taxi Trip Data. . . . . . . . . . . 169 Getting the Data 170 Working with Third-Party Libraries in Spark 171 Geospatial Data with the Esri Geometry API and Spray 172 Exploring the Esri Geometry API 172 Intro to GeoJSON 174 Preparing the New York City Taxi Trip Data 176 Handling Invalid Records at Scale 178 Geospatial Analysis 182 Sessionization in Spark 185 Building Sessions: Secondary Sorts in Spark 186 Where to Go from Here 189 Table of Contents | v 9. Estimating Financial Risk Through Monte Carlo Simulation. . . . . . . . . . . . . . . . . . . . . . 191 Terminology 192 Methods for Calculating VaR 193 Variance-Covariance 193 Historical Simulation 193 Monte Carlo Simulation 193 Our Model 194 Getting the Data 195 Preprocessing 195 Determining the Factor Weights 198 Sampling 201 The Multivariate Normal Distribution 204 Running the Trials 205 Visualizing the Distribution of Returns 208 Evaluating Our Results 209 Where to Go from Here 211 10. Analyzing Genomics Data and the BDG Project. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Decoupling Storage from Modeling 214 Ingesting Genomics Data with the ADAM CLI 217 Parquet Format and Columnar Storage 223 Predicting Transcription Factor Binding Sites from ENCODE Data 225 Querying Genotypes from the 1000 Genomes Project 232 Where to Go from Here 235 11. Analyzing Neuroimaging Data with PySpark and Thunder. . . . . . . . . . . . . . . . . . . . . . . 237 Overview of PySpark 238 PySpark Internals 239 Overview and Installation of the Thunder Library 241 Loading Data with Thunder 241 Thunder Core Data Types 248 Categorizing Neuron Types with Thunder 249 Where to Go from Here 254 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 vi | Table of Contents Foreword Ever since we started the Spark project at Berkeley, I’ve been excited about not just building fast parallel systems, but helping more and more people make use of large- scale computing. This is why I’m very happy to see this book, written by four experts in data science, on advanced analytics with Spark. Sandy, Uri, Sean, and Josh have been working with Spark for a while, and have put together a great collection of con‐ tent with equal parts explanations and examples. The thing I like most about this book is its focus on examples, which are all drawn from real applications on real-world data sets. It’s hard to find one, let alone 10, examples that cover big data and that you can run on your laptop, but the authors have managed to create such a collection and set everything up so you can run them in Spark. Moreover, the authors cover not just the core algorithms, but the intricacies of data preparation and model tuning that are needed to really get good results. You should be able to take the concepts in these examples and directly apply them to your own problems. Big data processing is undoubtedly one of the most exciting areas in computing today, and remains an area of fast evolution and introduction of new ideas. I hope that this book helps you get started in this exciting new field. — Matei Zaharia, CTO at Databricks and Vice President, Apache Spark vii