SECOND EDITION Covers Apache Spark 3 With examples in Java, Python, and Scala Jean-Georges Perrin Foreword by Rob Thomas M A N N I N G Lexicon Summary of the Spark terms involved in the deployment process Term Definition Application Your program that is built on and for Spark. Consists of a driver program and executors on the cluster. Application JAR A Java archive (JAR) file containing your Spark application. It can be an uber JAR including all the dependencies. Cluster manager An external service for acquiring resources on the cluster. It can be the Spark built-in cluster manager. More details in chapter 6. Deploy mode Distinguishes where the driver process runs. In cluster mode, the framework launches the driver inside the cluster. In client mode, the submitter launches the driver outside the cluster. You can find out which mode you are in by calling the deployMode() method. This method returns a read-only property. Driver program The process running the main() function of the application and creating the SparkContext. Everything starts here. Executor A process launched for an application on a worker node. The executor runs tasks and keeps data in memory or in disk storage across them. Each application has its own executors. Job A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (for example, save() or collect()); check out appendix I). Stage Each job gets divided into smaller sets of tasks, called stages, that depend on each other (similar to the map and reduce stages in MapReduce). Task A unit of work that will be sent to one executor. Worker node Any node that can run application code in the cluster. Application pro cesses and resources elements Worker node Job: parallel tasks triggered after an action is called Application JAR or Cache Driver program cut e x Task Task E SparkSession Cluster manager (SparkContext) Worker node Jobs are split into stages. or Cache ut c e The driver can access Ex Task Task its deployment mode. Your code in a Nodes JAR package Apache Spark components Spark in Action S E ECOND DITION JEAN-GEORGES PERRIN FOREWORD BY ROB THOMAS MANNING SHELTER ISLAND For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email: [email protected] ©2020 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. Manning Publications Co. Development editor: Marina Michaels 20 Baldwin Road Technical development editor: Al Scherer PO Box 761 Review editor: Aleks Dragosavljevic´ Shelter Island, NY 11964 Production editor: Lori Weidert Copy editor: Sharon Wilkey Proofreader: Melody Dolab Technical proofreader: Rambabu Dosa and Thomas Lockney Typesetter: Gordan Salinovic Cover designer: Marija Tudor ISBN 9781617295522 Printed in the United States of America Liz, Thank you for your patience, support, and love during this endeavor. Ruby, Nathaniel, Jack, and Pierre-Nicolas, Thank you for being so understanding about my lack of availability during this venture. I love you all. contents foreword xiii preface xv acknowledgments xvii about this book xix about the author xxv about the cover illustration xxvi P 1 T .............1 ART HE THEORY CRIPPLED BY AWESOME EXAMPLES 1 So, what is Spark, anyway? 3 1.1 The big picture: What Spark is and what it does 4 What is Spark? 4 ■ The four pillars of mana 6 1.2 How can you use Spark? 8 Spark in a data processing/engineering scenario 8 ■ Spark in a data science scenario 9 1.3 What can you do with Spark? 10 Spark predicts restaurant quality at NC eateries 11 ■ Spark allows fast data transfer for Lumeris 11 ■ Spark analyzes equipment logs for CERN 12 ■ Other use cases 12 1.4 Why you will love the dataframe 12 The dataframe from a Java perspective 13 ■ The dataframe from an RDBMS perspective 13 ■ A graphical representation of the dataframe 14 v vi CONTENTS 1.5 Your first example 14 Recommended software 15 ■ Downloading the code 15 Running your first application 15 ■ Your first code 17 2 Architecture and flow 19 2.1 Building your mental model 20 2.2 Using Java code to build your mental model 21 2.3 Walking through your application 23 Connecting to a master 24 ■ Loading, or ingesting, the CSV file 25 ■ Transforming your data 28 ■ Saving the work done in your dataframe to a database 29 3 The majestic role of the dataframe 33 3.1 The essential role of the dataframe in Spark 34 Organization of a dataframe 35 ■ Immutability is not a swear word 36 3.2 Using dataframes through examples 37 A dataframe after a simple CSV ingestion 39 ■ Data is stored in partitions 44 ■ Digging in the schema 45 ■ A dataframe after a JSON ingestion 46 ■ Combining two dataframes 52 3.3 The dataframe is a Dataset<Row> 57 Reusing your POJOs 58 ■ Creating a dataset of strings 59 Converting back and forth 60 3.4 Dataframe’s ancestor: the RDD 66 4 Fundamentally lazy 68 4.1 A real-life example of efficient laziness 69 4.2 A Spark example of efficient laziness 70 Looking at the results of transformations and actions 70 ■ The transformation process, step by step 72 ■ The code behind the transformation/action process 74 ■ The mystery behind the creation of 7 million datapoints in 182 ms 77 ■ The mystery behind the timing of actions 79 4.3 Comparing to RDBMS and traditional applications 83 Working with the teen birth rates dataset 83 ■ Analyzing differences between a traditional app and a Spark app 84 4.4 Spark is amazing for data-focused applications 86 4.5 Catalyst is your app catalyzer 86 CONTENTS vii 5 Building a simple app for deployment 90 5.1 An ingestionless example 91 Calculating p 91 ■ The code to approximate p 93 ■ What are lambda functions in Java? 99 ■ Approximating p by using lambda functions 101 5.2 Interacting with Spark 102 Local mode 103 ■ Cluster mode 104 ■ Interactive mode in Scala and Python 107 6 Deploying your simple app 114 6.1 Beyond the example: The role of the components 116 Quick overview of the components and their interactions 116 Troubleshooting tips for the Spark architecture 120 ■ Going further 121 6.2 Building a cluster 121 Building a cluster that works for you 122 ■ Setting up the environment 123 6.3 Building your application to run on the cluster 126 Building your application’s uber JAR 127 ■ Building your application by using Git and Maven 129 6.4 Running your application on the cluster 132 Submitting the uber JAR 132 ■ Running the application 133 Analyzing the Spark user interface 133 P 2 I .........................................................137 ART NGESTION 7 Ingestion from files 139 7.1 Common behaviors of parsers 141 7.2 Complex ingestion from CSV 141 Desired output 142 ■ Code 143 7.3 Ingesting a CSV with a known schema 144 Desired output 145 ■ Code 145 7.4 Ingesting a JSON file 146 Desired output 148 ■ Code 149 7.5 Ingesting a multiline JSON file 150 Desired output 151 ■ Code 152 7.6 Ingesting an XML file 153 Desired output 155 ■ Code 155 viii CONTENTS 7.7 Ingesting a text file 157 Desired output 158 ■ Code 158 7.8 File formats for big data 159 The problem with traditional file formats 159 ■ Avro is a schema- based serialization format 160 ■ ORC is a columnar storage format 161 ■ Parquet is also a columnar storage format 161 Comparing Avro, ORC, and Parquet 161 7.9 Ingesting Avro, ORC, and Parquet files 162 Ingesting Avro 162 ■ Ingesting ORC 164 ■ Ingesting Parquet 165 ■ Reference table for ingesting Avro, ORC, or Parquet 167 8 Ingestion from databases 168 8.1 Ingestion from relational databases 169 Database connection checklist 170 ■ Understanding the data used in the examples 170 ■ Desired output 172 ■ Code 173 Alternative code 175 8.2 The role of the dialect 176 What is a dialect, anyway? 177 ■ JDBC dialects provided with Spark 177 ■ Building your own dialect 177 8.3 Advanced queries and ingestion 180 Filtering by using a WHERE clause 180 ■ Joining data in the database 183 ■ Performing Ingestion and partitioning 185 Summary of advanced features 188 8.4 Ingestion from Elasticsearch 188 Data flow 189 ■ The New York restaurants dataset digested by Spark 189 ■ Code to ingest the restaurant dataset from Elasticsearch 191 9 Advanced ingestion: finding data sources and building your own 194 9.1 What is a data source? 196 9.2 Benefits of a direct connection to a data source 197 Temporary files 198 ■ Data quality scripts 198 ■ Data on demand 199 9.3 Finding data sources at Spark Packages 199 9.4 Building your own data source 199 Scope of the example project 200 ■ Your data source API and options 202