Data Analytics with Hadoop AN INTRODUCTION FOR DATA SCIENTISTS Benjamin Bengfort & Jenny Kim WOW! eBook www.wowebook.org WOW! eBook www.wowebook.org Data Analytics with Hadoop An Introduction for Data Scientists Benjamin Bengfort and Jenny Kim BBeeiijjiinngg BBoossttoonn FFaarrnnhhaamm SSeebbaassttooppooll TTookkyyoo WOW! eBook www.wowebook.org Data Analytics with Hadoop by Benjamin Bengfort and Jenny Kim Copyright © 2016 Jenny Kim and Benjamin Bengfort. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/ institutional sales department: 800-998-9938 or [email protected]. Editor: Nicole Tache Indexer: WordCo Indexing Services Production Editor: Melanie Yarbrough Interior Designer: David Futato Copyeditor: Colleen Toporek Cover Designer: Randy Comer Proofreader: Jasmine Kwityn Illustrator: Rebecca Demarest June 2016: First Edition Revision History for the First Edition 2016-05-25: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781491913703 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Analytics with Hadoop, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-491-91370-3 [LSI] WOW! eBook www.wowebook.org Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Part I. Introduction to Distributed Computing 1. The Age of the Data Product. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 What Is a Data Product? 4 Building Data Products at Scale with Hadoop 5 Leveraging Large Datasets 6 Hadoop for Data Products 7 The Data Science Pipeline and the Hadoop Ecosystem 8 Big Data Workflows 10 Conclusion 11 2. An Operating System for Big Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Basic Concepts 14 Hadoop Architecture 15 A Hadoop Cluster 17 HDFS 20 YARN 21 Working with a Distributed File System 22 Basic File System Operations 23 File Permissions in HDFS 25 Other HDFS Interfaces 26 Working with Distributed Computation 27 MapReduce: A Functional Programming Model 28 MapReduce: Implemented on a Cluster 30 Beyond a Map and Reduce: Job Chaining 37 iii WOW! eBook www.wowebook.org Submitting a MapReduce Job to YARN 38 Conclusion 40 3. A Framework for Python and Hadoop Streaming. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Hadoop Streaming 42 Computing on CSV Data with Streaming 45 Executing Streaming Jobs 50 A Framework for MapReduce with Python 52 Counting Bigrams 55 Other Frameworks 59 Advanced MapReduce 60 Combiners 60 Partitioners 61 Job Chaining 62 Conclusion 65 4. In-Memory Computing with Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Spark Basics 68 The Spark Stack 70 Resilient Distributed Datasets 72 Programming with RDDs 73 Interactive Spark Using PySpark 77 Writing Spark Applications 79 Visualizing Airline Delays with Spark 81 Conclusion 87 5. Distributed Analysis and Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Computing with Keys 91 Compound Keys 92 Keyspace Patterns 96 Pairs versus Stripes 100 Design Patterns 104 Summarization 105 Indexing 110 Filtering 117 Toward Last-Mile Analytics 123 Fitting a Model 124 Validating Models 125 Conclusion 127 iv | Table of Contents WOW! eBook www.wowebook.org Part II. Workflows and Tools for Big Data Science 6. Data Mining and Warehousing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Structured Data Queries with Hive 132 The Hive Command-Line Interface (CLI) 133 Hive Query Language (HQL) 134 Data Analysis with Hive 139 HBase 144 NoSQL and Column-Oriented Databases 145 Real-Time Analytics with HBase 148 Conclusion 155 7. Data Ingestion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Importing Relational Data with Sqoop 158 Importing from MySQL to HDFS 158 Importing from MySQL to Hive 161 Importing from MySQL to HBase 163 Ingesting Streaming Data with Flume 165 Flume Data Flows 165 Ingesting Product Impression Data with Flume 169 Conclusion 173 8. Analytics with Higher-Level APIs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Pig 175 Pig Latin 177 Data Types 181 Relational Operators 182 User-Defined Functions 182 Wrapping Up 184 Spark’s Higher-Level APIs 184 Spark SQL 186 DataFrames 189 Conclusion 195 9. Machine Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Scalable Machine Learning with Spark 197 Collaborative Filtering 199 Classification 206 Clustering 208 Conclusion 212 Table of Contents | v WOW! eBook www.wowebook.org 10. Summary: Doing Distributed Data Science. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Data Product Lifecycle 214 Data Lakes 216 Data Ingestion 218 Computational Data Stores 220 Machine Learning Lifecycle 222 Conclusion 224 A. Creating a Hadoop Pseudo-Distributed Development Environment. . . . . . . . . . . . . . . 227 B. Installing Hadoop Ecosystem Products. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 Glossary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 vi | Table of Contents WOW! eBook www.wowebook.org Preface The term big data has come into vogue for an exciting new set of tools and techniques for modern, data-powered applications that are changing the way the world is com‐ puting in novel ways. Much to the statistician’s chagrin, this ubiquitous term seems to be liberally applied to include the application of well-known statistical techniques on large datasets for predictive purposes. Although big data is now officially a buzzword, the fact is that modern, distributed computation techniques are enabling analyses of datasets far larger than those typically examined in the past, with stunning results. Distributed computing alone, however, does not directly lead to data science. Through the combination of rapidly increasing datasets generated from the Internet and the observation that these data sets are able to power predictive models (“more data is better than better algorithms”1), data products have become a new economic paradigm. Stunning successes of data modeling across large heterogeneous datasets— for example, Nate Silver’s seemingly magical ability to predict the 2008 election using big data techniques—has led to a general acknowledgment of the value of data sci‐ ence, and has brought a wide variety of practitioners to the field. Hadoop has evolved from a cluster-computing abstraction to an operating system for big data by providing a framework for distributed data storage and parallel computa‐ tion. Spark has built upon those ideas and made cluster computing more accessible to data scientists. However, data scientists and analysts new to distributed computing may feel that these tools are programmer oriented rather than analytically oriented. This is because a fundamental shift needs to occur in thinking about how we manage and compute upon data in a parallel fashion instead of a sequential one. This book is intended to prepare data scientists for that shift in thinking by providing an overview of cluster computing and analytics in a readable, straightforward fashion. We will introduce most of the concepts, tools, and techniques involved with dis‐ 1 Anand Rajaraman, “More data usually beats better algorithms”, Datawocky, March 24, 2008. vii WOW! eBook www.wowebook.org tributed computing for data analysis and provide a path for deeper dives into specific topics areas. What to Expect from This Book This book is not an exhaustive compendium on Hadoop (see Tom White’s excellent Hadoop: The Definitive Guide for that) or an introduction to Spark (we instead point you to Holden Karau et al.’s Learning Spark), and is certainly not meant to teach the operational aspects of distributed computing. Instead, we offer a survey of the Hadoop ecosystem and distributed computation intended to arm data scientists, sta‐ tisticians, programmers, and folks who are interested in Hadoop (but whose current knowledge of it is just enough to make them dangerous). We hope that you will use this book as a guide as you dip your toes into the world of Hadoop and find the tools and techniques that interest you the most, be it Spark, Hive, machine learning, ETL (extract, transform, and load) operations, relational databases, or one of the other many topics related to cluster computing. Who This Book Is For Data science is often erroneously conflated with big data, and while many machine learning model families do require large datasets in order to be widely generalizable, even small datasets can provide a pattern recognition punch. For that reason, most of the focus of data science software literature is on corpora or datasets that are easily analyzable on a single machine (especially machines with many gigabytes of mem‐ ory). Although big data and data science are well suited to work in concert with each other, computing literature has separated them up until now. This book intends to fill in the gap by writing to an audience of data scientists. It will introduce you to the world of clustered computing and analytics with Hadoop, from a data science perspective. The focus will not be on deployment, operations, or soft‐ ware development, but rather on common analyses, data warehousing techniques, and higher-order data workflows. So who are data scientists? We expect that a data scientist is a software developer with strong statistical skills or a statistician with strong software development skills. Typi‐ cally, our data teams are composed of three types of data scientists: data engineers, data analysts, and domain experts. Data engineers are programmers or computer scientists who can build or utilize advanced computing systems. They typically program in Python, Java, or Scala and are familiar with Linux, servers, networking, databases, and application deployment. For those data engineers reading this book, we expect that you’re accustomed to the difficulties of programming multi-process code as well as the challenges of data wran‐ gling and numeric computation. We hope that after reading this book you’ll have a viii | Preface WOW! eBook www.wowebook.org
Description: