Simplify Big Data Analytics with Amazon EMR A beginner's guide to learning and implementing Amazon EMR for building data analytics solutions Sakti Mishra BIRMINGHAM—MUMBAI Simplify Big Data Analytics with Amazon EMR Copyright © 2022 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. Group Product Manager: Sunith Shetty Publishing Product Manager: Reshma Raman Senior Editor: Tazeen Shaikh Content Development Editor: Shreya Moharir Technical Editor: Devanshi Ayare Copy Editor: Safis Editing Project Coordinator: Aparna Nair Proofreader: Safis Editing Indexer: Sejal Dsilva Production Designer: Nilesh Mohite Marketing Coordinator: Priyanka Mhatre First published: March 2022 Production reference: 1170222 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-80107-107-9 www.packt.com I dedicate this to everyone who doesn't settle down after achieving their goals but instead is encouraged to define the next one by pushing their limits. Contributors About the author Sakti Mishra is an engineer, architect, author, and technology leader with over 16 years of experience in the IT industry. He is currently working as a senior data lab architect at Amazon Web Services (AWS). He is passionate about technologies and has expertise in big data, analytics, machine learning, artificial intelligence, graph networks, web/mobile applications, and cloud technologies such as AWS and Google Cloud Platform. Sakti has a bachelor's degree in engineering and a master's degree in business administration. He holds several certifications in Hadoop, Spark, AWS, and Google Cloud. He is also an author of multiple technology blogs, workshops, white papers and is a public speaker who represents AWS in various domains and events. About the reviewers Suvojit Dasgupta is a senior data architect with AWS, focusing on data engineering and analytics. In his 17 years of experience, he has led multiple strategic initiatives to design, build, migrate, modernize, and operate petabyte-scale data platforms for Fortune 500 companies. He is passionate about data architecture and takes pride in building well-architected solutions. In his free time, he likes to explore new technologies and listen to audio books. You can follow Suvojit on Twitter at @suvojitdasgupta. Praveen Gupta is currently a data engineering manager with AWS, and has over 17 years of experience in the IT industry. Praveen started his career as an ETL/reporting developer working on traditional RDBMSs and reporting tools. Since 2014, he has been working on the AWS cloud on projects related to data science/machine learning and building complex data engineering pipelines on AWS. He specializes in data ingestion, big data processing, reporting, and building massive data warehouses at the petabyte scale for his customers, helping them make data-driven decisions. Praveen has an undergraduate degree and a master's degree, both in computer science from UIUC, USA. Praveen lives in Portland, USA with his wife and 8-year-old daughter. Table of Contents Preface Section 1: Overview, Architecture, Big Data Applications, and Common Use Cases of Amazon EMR 1 An Overview of Amazon EMR What is Amazon EMR? 4 Amazon DynamoDB 16 What is big data? 4 Amazon Redshift 16 Hadoop – processing framework to AWS Lake Formation 17 handle big data 5 AWS Identity and Access Overview of Amazon EMR – managed Management (IAM) 17 and scalable Hadoop cluster in AWS 8 AWS Key Management Service (KMS) 17 A brief history of the major big Lake House architecture overview 17 data releases 9 EMR release history 19 Benefits of Amazon EMR 9 Comparing Amazon EMR Decoupling compute and storage 11 with AWS Glue and AWS Glue DataBrew 21 Persistent versus transient clusters 12 AWS Glue 21 Integration with other AWS AWS Glue DataBrew 23 services 14 Choosing the right service for your Amazon S3 with EMR File System (EMRFS) 15 use case 24 Amazon Kinesis Data Streams (KDS) 15 Summary 26 Amazon Managed Streaming for Kafka (MSK) 15 Test your knowledge 27 AWS Glue Data Catalog 15 Further reading 27 Amazon Relational Database Service (RDS) 16 viii Table of Contents 2 Exploring the Architecture and Deployment Options EMR architecture deep dive 30 Submitting jobs to the cluster as EMR steps 41 Distributed storage layer 31 YARN – cluster resource manager 32 Building Hadoop jobs with Distributed processing frameworks 33 dependencies in a specific Hadoop applications 33 EMR release version 43 Understanding clusters EMR deployment options 44 and nodes 34 Amazon EMR on Amazon EC2 44 Uniform instance groups 35 Amazon EMR on Amazon EKS 45 Instance fleet 36 Amazon EMR on AWS Outposts 49 EMR pricing for different Using S3 versus HDFS for deployment options 51 cluster storage 37 Monitoring and controlling your costs HDFS as cluster-persistent storage 37 with AWS Budgets and Cost Explorer 54 Amazon S3 as a persistent data store 38 Summary 55 Understanding the cluster Test your knowledge 55 life cycle 39 Further reading 56 Options to submit work to the cluster 41 3 Common Use Cases and Architecture Patterns Reference architecture for Reference architecture for batch ETL workloads 58 interactive analytics and ML 66 Use case overview 59 Use case overview 67 Reference architecture walkthrough 59 Reference architecture walkthrough 68 Best practices to follow during Best practices to follow during implementation 61 implementation 70 Reference architecture for Reference architecture for clickstream analytics 62 real-time streaming analytics 71 Use case overview 63 Use case overview 72 Reference architecture walkthrough 63 Reference architecture walkthrough 72 Best practices to follow during Best practices to follow during implementation 65 implementation 75 Table of Contents ix Reference architecture for Use case overview 79 genomics data analytics 76 Reference architecture walkthrough 80 Use case overview 76 Best practices to follow during Reference architecture walkthrough 76 implementation 83 Best practices to follow during Summary 83 implementation 78 Test your knowledge 84 Reference architecture for Further reading 84 log analytics 79 4 Big Data Applications and Notebooks Available in Amazon EMR Technical requirements 86 TensorFlow 111 Understanding popular big MXNet 112 data applications in EMR 86 Notebook options available Hive 87 in EMR 112 Presto 92 EMR Notebooks 113 Spark 94 JupyterHub 115 HBase 98 EMR Studio 118 Hue 106 Zeppelin 120 Ganglia 110 Summary 120 Machine learning frameworks Test your knowledge 121 available in EMR 111 Further reading 121 Section 2: Configuration, Scaling, Data Security, and Governance 5 Setting Up and Configuring EMR Clusters Technical requirements 126 Advanced configuration for Setting up and configuring cluster hardware and software 129 clusters with the EMR console's Understanding the Software quick create option 126 Configuration section 130