About This E-Book EPUB is an open, industry-standard format for e-books. However, support for EPUB and its many features varies across reading devices and applications. Use your device or app settings to customize the presentation to your liking. Settings that you can customize often include font, font size, single or double column, landscape or portrait mode, and figures that you can click or tap to enlarge. For additional information about the settings and features on your reading device or app, visit the device manufacturer’s Web site. Many titles include programming code or configuration examples. To optimize the presentation of these elements, view the e-book in single-column, landscape mode and adjust the font size to the smallest setting. In addition to presenting code and configurations in the reflowable text format, we have included images of the code that mimic the presentation found in the print book; therefore, where the reflowable format may compromise the presentation of the code listing, you will see a “Click here to view code image” link. Click the link to view the print- fidelity code image. To return to the previous page viewed, click the Back button on your device or app. Data Analytics with Spark Using Python Jeffrey Aven Boston • Columbus • Indianapolis • New York • San Francisco • Amsterdam Cape Town • Dubai • London • Madrid • Milan • Munich • Paris Montreal • Toronto • Delhi • Mexico City • São Paulo • Sydney Hong Kong • Seoul • Singapore • Taipei • Tokyo Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and the publisher was aware of a trademark claim, the designations have been printed with initial capital letters or in all capitals. The author and publisher have taken care in the preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. For information about buying this title in bulk quantities, or for special sales opportunities (which may include electronic versions; custom cover designs; and content particular to your business, training goals, marketing focus, or branding interests), please contact our corporate sales department at [email protected] or (800) 382-3419. For government sales inquiries, please contact [email protected]. For questions about sales outside the U.S., please contact [email protected]. Visit us on the Web: informit.com/aw Library of Congress Control Number: 2018938456 © 2018 Pearson Education, Inc. All rights reserved. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or likewise. For information regarding permissions, request forms, and the appropriate contacts within the Pearson Education Global Rights & Permissions Department, please visit www.pearsoned.com/permissions/. Microsoft and/or its respective suppliers make no representations about the suitability of the information contained in the documents and related graphics published as part of the services for any purpose. All such documents and related graphics are provided “as is” without warranty of any kind. Microsoft and/ or its respective suppliers hereby disclaim all warranties and conditions with regard to this information, including all warranties and conditions of merchantability, whether express, implied or statutory, fitness for a particular purpose, title and non-infringement. In no event shall Microsoft and/or its respective suppliers be liable for any special, indirect or consequential damages or any damages whatsoever resulting from loss of use, data or profits, whether in an action of contract, negligence or other tortious action, arising out of or in connection with the use or performance of information available from the services. The documents and related graphics contained herein could include technical inaccuracies or typographical errors. Changes are periodically added to the information herein. Microsoft and/or its respective suppliers may make improvements and/or changes in the product(s) and/or the program(s) described herein at any time. Partial screenshots may be viewed in full within the software version specified. Microsoft® Windows®, and Microsoft Office® are registered trademarks of the Microsoft Corporation in the U.S.A. and other countries. This book is not sponsored or endorsed by or affiliated with the Microsoft Corporation. ISBN-13: 978-0-13-484601-9 ISBN-10: 0-13-484601-X 1 18 Editor-in-Chief Greg Wiegand Executive Editor Trina MacDonald Development Editor Amanda Kaufmann Managing Editor Sandra Schroeder Senior Project Editor Lori Lyons Technical Editor Yaniv Rodenski Copy Editor Catherine D. Wilson Project Manager Dhayanidhi Karunanidhi Indexer Erika Millen Proofreader Jeanine Furino Cover Designer Chuti Prasertsith Compositor codemantra Contents at a Glance Preface Introduction I: Spark Foundations 1 Introducing Big Data, Hadoop, and Spark 2 Deploying Spark 3 Understanding the Spark Cluster Architecture 4 Learning Spark Programming Basics II: Beyond the Basics 5 Advanced Programming Using the Spark Core API 6 SQL and NoSQL Programming with Spark 7 Stream Processing and Messaging Using Spark 8 Introduction to Data Science and Machine Learning Using Spark Index Table of Contents Preface Introduction I: Spark Foundations 1 Introducing Big Data, Hadoop, and Spark Introduction to Big Data, Distributed Computing, and Hadoop A Brief History of Big Data and Hadoop Hadoop Explained Introduction to Apache Spark Apache Spark Background Uses for Spark Programming Interfaces to Spark Submission Types for Spark Programs Input/Output Types for Spark Applications The Spark RDD Spark and Hadoop Functional Programming Using Python Data Structures Used in Functional Python Programming Python Object Serialization Python Functional Programming Basics Summary 2 Deploying Spark Spark Deployment Modes Local Mode Spark Standalone Spark on YARN Spark on Mesos Preparing to Install Spark Getting Spark Installing Spark on Linux or Mac OS X Installing Spark on Windows Exploring the Spark Installation Deploying a Multi-Node Spark Standalone Cluster Deploying Spark in the Cloud Amazon Web Services (AWS) Google Cloud Platform (GCP) Databricks Summary 3 Understanding the Spark Cluster Architecture Anatomy of a Spark Application Spark Driver Spark Workers and Executors The Spark Master and Cluster Manager Spark Applications Using the Standalone Scheduler Spark Applications Running on YARN Deployment Modes for Spark Applications Running on YARN Client Mode Cluster Mode Local Mode Revisited Summary 4 Learning Spark Programming Basics Introduction to RDDs Loading Data into RDDs Creating an RDD from a File or Files Methods for Creating RDDs from a Text File or Files Creating an RDD from an Object File Creating an RDD from a Data Source Creating RDDs from JSON Files Creating an RDD Programmatically