www.it-ebooks.info www.it-ebooks.info FIRST EDITION Mastering Azure Analytics Architecting in the cloud with Azure Data Lake, HDInsight, and Spark Zoiner Tejada Boston www.it-ebooks.info Mastering Azure Analytics by Zoiner Tejada Copyright © 2016 Zoiner Tejada. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles ( http://safaribooksonline.com ). For more information, contact our corporate/ institutional sales department: 800-998-9938 or [email protected] . Editor: Shannon Cutt Proofreader: FILL IN PROOFREADER Production Editor: FILL IN PRODUCTION EDI‐ Indexer: FILL IN INDEXER TOR Interior Designer: David Futato Copyeditor: FILL IN COPYEDITOR Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest May 2016: First Edition Revision History for the First Edition 2016-04-20: First Early Release See http://oreilly.com/catalog/errata.csp?isbn=9781491956588 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Mastering Azure Analytics, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the author(s) have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author(s) disclaim all responsibil‐ ity for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-491-95658-8 [FILL IN] www.it-ebooks.info Table of Contents 1. Enterprise Analytics Fundamentals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 The Analytics Data Pipeline 5 Data Lakes 6 Lambda Architecture 7 Kappa Architecture 9 Choosing between Lambda and Kappa 10 The Azure Analytics Pipeline 10 Introducing the Analytics Scenarios 12 Sample code and sample datasets 14 What you will need 14 Broadband Internet Connectivity 14 Azure Subscription 15 Visual Studio 2015 with Update 1 15 Azure SDK 2.8 or later 18 Chapter Summary 20 2. Getting Data into Azure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Ingest Loading Layer 21 Bulk Data Loading 21 Disk Shipping 22 End User Tools 38 Network Oriented Approaches 55 Stream Loading 80 Stream Loading with Event Hubs 80 Chapter Summary 82 iii www.it-ebooks.info www.it-ebooks.info CHAPTER 1 Enterprise Analytics Fundamentals In this chapter we’ll review the fundamentals of enterprise analytic architectures. We will introduce the analytics data pipeline, a fundamental process that takes data from its source thru all the steps until it is available to analytics clients. Then we will intro‐ duce the concept of a data lake, as well as two different pipeline architec‐ tures: Lambda architectures and Kappa architectures. The particular steps in the typical data processing pipeline (as well as considerations around the handling of “hot” and “cold” data) are detailed and serve as a framework for the rest of the book. We conclude the chapter by introducing our case study scenarios, along with their respective data sets that provide a more real-world experience to performing big data analytics on Azure. The Analytics Data Pipeline Data does not end up nicely formatted for analytics on its own, it takes a series of steps that involve collecting the data from the source, massaging the data to get it into the forms appropriate to the analytics desired (sometimes referred to as data wran‐ gling or data munging) and ultimately pushing the prepared results to the location from which they can be consumed. This series of steps can be thought of as a pipe‐ line. The analytics data pipeline forms a basis for understanding any analytics solution, and as such is very useful to our purposes in this book as we seek to understand how to accomplish analytics using Microsoft Azure. We define the analytics data pipeline as consisting of five major components that are useful in comprehending and design‐ ing any analytics solution. The major components include: Source: The location from which new raw data is either pulled or which pushes new raw data into the pipeline 5 www.it-ebooks.info Ingest: The computation that handles receiving the raw data from the source so that it can be processed Processing: The computation controlling how the data gets prepared and processed for delivery Storage: The various locations where the ingested, intermediate and final calculations are stored, whose storage can be transient (the data lives in memory or only for a finite period of time) or persistent (the data is stored for the long term) Delivery: How the data is presented to the ultimate consumer, which can run the gamut from dedicated analytics client solutions used by analysts to API’s that enable the results to integrate into a larger solution or be consumed by other processes Figure 1-1. The data analytics pipeline is a conceptual framework that is helpful in understanding where various data technologies apply. Data Lakes The term data lake is becoming the latest buzzword, similar to how Big Data grew in popularity and at the same time its definition got more unclear as vendors attached the definition that suited their products best. Let us begin by defining the concept. A data lake consists of two parts: storage and processing. Data lake storage requires an infinitely scalable, fault tolerant, storage repository designed to handle massive volumes of data with varying shapes, sizes and ingest velocities. Data lake processing requires a processing engine that can successfully operate on the data at this scale. The term data lake was originally coined by James Dixon, the CTO of Pentaho, wherein he used the term in contrast with the traditional, highly schematized data mart: “If you think of a datamart as a store of bottled water - cleansed and packaged and structured for easy consumption - the data lake is a large body of water in a more natu‐ ral state. The contents of the lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.” James Dixon, CTO Pentaho With this definition, the goal is to create a repository that intentionally leaves the data in its raw or least-processed form with the goal of enabling questions to be asked of it 6 | Chapter 1: Enterprise Analytics Fundamentals www.it-ebooks.info in the future, that would otherwise not be answerable if the data were packaged into a particular structure or otherwise aggregated. That definition of data lake should serve as the core, but as you will see in reading this book the simple definition belies the true extent of a data lake that in reality extends to include not just a single processing engine, but multiple processing engines and because it represents the enterprise wide, centralized repository of source and processed data (after all, a data lake champions a “store-all” approach to data management), it has other requirements such as metadata management, discovery and governance. One final important note, the data lake concept as it is used today is intended for batch processing, where high latency (time until results ready) is appropriate. That said, support for lower latency processing is a natural area of evolution for data lakes so this definition may evolve with the technology landscape. With this broad definition of data lake, let us look at two different architectures that can be used to act on the data managed by a data lake: Lambda Architecture and Kappa Architecture. Lambda Architecture Lambda Architecture was originally proposed by the creator of Apache Storm, Nathan Marz. In his book, “Big Data: Principles and best practices of scalable real‐ time data systems”, he proposed a pipeline architecture that aims to reduce the com‐ plexity seen in real-time analytics pipelines by constraining any incremental computation to only a small portion of this architecture. In Lambda Architecture, there are two paths for data to flow in the pipeline: 1. A “hot” path where latency sensitive data (e.g., the results need to be ready in sec‐ onds or less) flows for rapid consumption by analytics clients. 2. A “cold” path where all data goes and is processed in batches that can tolerate greater latencies (e.g., the results can take minutes or even hours) until results are ready. Lambda Architecture | 7 www.it-ebooks.info Figure 1-2. The Lambda Architecture captures all data entering the pipeline into immut‐ able storage, labeled Master Data in the diagram. This data is processed by the Batch layer and output to a Serving Layer in the form of Batch Views. Latency sensitive calcu‐ lations are applied on the input data by the Speed Layer and exposed as Real-time Views. Analytics clients can consume the data from either the Speed Layer Views or the Serving Layer Views depending on the timeframe of the data required. In some imple‐ mentations, the Serving Layer can host both the Real-time Views and the Batch Views. When data flows into the “cold” path, this data is immutable. Any changes to the value of particular datum are reflected by a new, time-stamped datum being stored in the system alongside any previous values. This approach enables the system to re- compute the then-current value of a particular datum for any point in time across the history of the data collected. Because the “cold” path can tolerate greater latency to results ready, this means the computation can afford to run across large data sets, and the types of calculation performed can be time-intensive. The objective of the “cold” path can be summarized as: take the time that you need, but make the results extremely accurate. When data flows into the “hot” path, this data is mutable and can be updated in place. In addition, the “hot” path places a latency constraint on the data (as the results are typically desired in near-realtime). The impact of this latency constraint is that the types of calculations that can be performed are limited to those that can happen quickly enough. This might mean switching from an algorithm that provides perfect accuracy, to one that provides an approximation. An example of this involves count‐ ing the number of distinct items in a data set (e.g., the number of visitors to your website)- you can count either by counting each individual datum (which can be very high latency if the volume is high) or you can approximate the count using algo‐ rithms like HyperLogLog. The objective of the “hot” path can be summarized as: trade-off some amount of accuracy in the results, in order to ensure that the data is ready as quickly as possible. 8 | Chapter 1: Enterprise Analytics Fundamentals www.it-ebooks.info
Description: