ebook img

Hadoop: Data Processing and Modelling PDF

976 Pages·2017·11.576 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Hadoop: Data Processing and Modelling

Hadoop: Data Processing and Modelling Unlock the power of your data with Hadoop 2.X ecosystem and its data warehousing techniques across large data sets </> LEARNING PATH Hadoop: Data Processing and Modelling Unlock the power of your data with Hadoop 2.X ecosystem and its data warehousing techniques across large data sets A course in three modules BIRMINGHAM - MUMBAI Hadoop: Data Processing and Modelling Copyright © 2016 Packt Publishing All rights reserved. No part of this course may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this course to ensure the accuracy of the information presented. However, the information contained in this course is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this course. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this course by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. Published on: August 2016 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN: 978-1-78712-516-2 www.packtpub.com Credits Authors Content Development Editor Garry Turkington Rashmi Suvarna Tanmay Deshpande Graphics Sandeep Karanth Kirk D'Phena Reviewers Production Coordinator David Gruzman Shantanu N. Zagade Muthusamy Manigandan Vidyasagar N V Shashwat Shriparv Shiva Achari Pavan Kumar Polineni Uchit Vyas Yohan Wadia Preface A number of organizations are focusing on big data processing, particularly with Hadoop. This course will help you understand how Hadoop, as an ecosystem, helps us store, process, and analyze data. Hadoop is synonymous with Big Data processing. Its simple programming model, "code once and deploy at any scale" paradigm, and an ever-growing ecosystem make Hadoop an inclusive platform for programmers with different levels of expertise and breadth of knowledge. A team of machines are interconnected via a very fast network and provide better scaling and elasticity, but that is not enough. These clusters have to be programmed. A greater number of machines, just like a team of human beings, require more coordination and synchronization. The higher the number of machines, the greater the possibility of failures in the cluster. How do we handle synchronization and fault tolerance in a simple way easing the burden on the programmer? The answer is systems such as Hadoop. Today, it is the number-one sought after job skill in the data sciences space. To handle and analyze Big Data, Hadoop has become the go-to tool. Hadoop 2.x is spreading its wings to cover a variety of application paradigms and solve a wider range of data problems. It is rapidly becoming a general-purpose cluster platform for all data processing needs, and will soon become a mandatory skill for every engineer across verticals. Explore the power of Hadoop ecosystem to be able to tackle real-world scenarios and build. This course covers optimizations and advanced features of MapReduce, Pig, and Hive. Along with Hadoop 2.x and illustrates how it can be used to extend the capabilities of Hadoop. When you finish this course, you will be able to tackle the real-world scenarios and become a big data expert using the tools and the knowledge based on the various step-by-step tutorials and recipes. i Preface What this learning path covers Hadoop beginners Guide This module is here to help you make sense of Hadoop and use it to solve your big data problems. It's a really exciting time to work with data processing technologies such as Hadoop. The ability to apply complex analytics to large data sets—once the monopoly of large corporations and government agencies—is now possible through free open source software (OSS). This module removes the mystery from Hadoop, presenting Hadoop and related technologies with a focus on building working systems and getting the job done, using cloud services to do so when it makes sense. From basic concepts and initial setup through developing applications and keeping the system running as the data grows, the module gives the understanding needed to effectively use Hadoop to solve real world problems. Starting with the basics of installing and configuring Hadoop, the module explains how to develop applications, maintain the system, and how to use additional products to integrate with other systems. In addition to examples on Hadoop clusters on Ubuntu uses of cloud services such as Amazon, EC2 and Elastic MapReduce are covered. Hadoop Real World Solutions Cookbook, 2nd edition Big Data is the need the day. Many organizations are producing huge amounts of data every day. With the advancement of Hadoop-like tools, it has become easier for everyone to solve Big Data problems with great efficiency and at a very low cost. When you are handling such a massive amount of data, even a small mistake can cost you dearly in terms of performance and storage. It's very important to learn the best practices of handling such tools before you start building an enterprise Big Data Warehouse, which will be greatly advantageous in making your project successful. This module gives readers insights into learning and mastering big data via recipes. The module not only clarifies most big data tools in the market but also provides best practices for using them. The module provides recipes that are based on the latest versions of Apache Hadoop 2.X, YARN, Hive, Pig, Sqoop, Flume, Apache Spark, Mahout and many more such ecosystem tools. This real-world-solution cookbook is packed with handy recipes you can apply to your own everyday issues. Each chapter provides in-depth recipes that can be referenced easily. This module provides detailed practices on the latest technologies such as YARN and Apache Spark. Readers will be able to consider themselves as big data experts on completion of this module. ii Preface Mastering Hadoop This era of Big Data has similar changes in businesses as well. Almost everything in a business is logged. Every action taken by a user on the page of an e-commerce page is recorded to improve quality of service and every item bought by the user are recorded to cross-sell or up-sell other items. Businesses want to understand the DNA of their customers and try to infer it by pinching out every possible data they can get about these customers. Businesses are not worried about the format of the data. They are ready to accept speech, images, natural language text, or structured data. These data points are used to drive business decisions and personalize experiences for the user. The more data, the higher the degree of personalization and better the experience for the user. Hadoop is synonymous with Big Data processing. Its simple programming model, "code once and deploy at any scale" paradigm, and an ever-growing ecosystem makes Hadoop an all-encompassing platform for programmers with different levels of expertise. This module explores the industry guidelines to optimize MapReduce jobs and higher-level abstractions such as Pig and Hive in Hadoop 2.0. Then, it dives deep into Hadoop 2.0 specific features such as YARN and HDFS Federation. This module is a step-by-step guide that focuses on advanced Hadoop concepts and aims to take your Hadoop knowledge and skill set to the next level. The data processing flow dictates the order of the concepts in each chapter, and each chapter is illustrated with code fragments or schematic diagrams. This module is a guide focusing on advanced concepts and features in Hadoop. Foundations of every concept are explained with code fragments or schematic illustrations. The data processing flow dictates the order of the concepts in each chapter What you need for this learning path In the simplest case, a single Linux-based machine will give you a platform to explore almost all the exercises in this course. We assume you have a recent distribution of Ubuntu, but as long as you have command-line Linux familiarity any modern distribution will suffice. Some of the examples in later chapters really need multiple machines to see things working, so you will require access to at least four such hosts. Virtual machines are completely acceptable; they're not ideal for production but are fine for learning and exploration. Since we also explore Amazon Web Services in this course, you can run all the examples on EC2 instances, and we will look at some other more Hadoop-specific uses of AWS throughout the modules. AWS services are usable by anyone, but you will need a credit card to sign up! To get started with this hands-on recipe-driven module, you should have a laptop/desktop with any OS, such as Windows, Linux, or Mac. It's good to have an IDE, such as Eclipse or IntelliJ, and of course, you need a lot of enthusiasm to learn. iii Preface The following software suites are required to try out the examples in the module: f Java Development Kit (JDK 1.7 or later): This is free software from Oracle that provides a JRE ( Java Runtime Environment ) and additional tools for developers. It can be downloaded from http://www.oracle.com/technetwork/java/ javase/downloads/index.html. f The IDE for editing Java code: IntelliJ IDEA is the IDE that has been used to develop the examples. Any other IDE of your choice can also be used. The community edition of the IntelliJ IDE can be downloaded from https://www.jetbrains.com/idea/ download/. f Maven: Maven is a build tool that has been used to build the samples in the course. Maven can be used to automatically pull-build dependencies and specify configurations via XML files. The code samples in the chapters can be built into a JAR using two simple Maven commands: mvn compile mvn assembly:single These commands compile the code into a JAR file. These commands create a consolidated JAR with the program along with all its dependencies. It is important to change the mainClass references in the pom.xml to the driver class name when building the consolidated JAR file. Hadoop-related consolidated JAR files can be run using the command: hadoop jar <jar file> args This command directly picks the driver program from the mainClass that was specified in the pom.xml. Maven can be downloaded and installed from http://maven.apache.org/ download.cgi. The Maven XML template file used to build the samples in this course is as follows: <?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http:// maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>MasteringHadoop</groupId> <artifactId>MasteringHadoop</artifactId> <version>1.0-SNAPSHOT</version> <build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> iv Preface <version>3.0</version> <configuration> <source>1.7</source> <target>1.7</target> </configuration> </plugin> <plugin> <version>3.1</version> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-jar-plugin</artifactId> <configuration> <archive> <manifest> <mainClass>MasteringHadoop.MasteringHadoopTest</ mainClass> </manifest> </archive> </configuration> </plugin> <plugin> <artifactId>maven-assembly-plugin</artifactId> <configuration> <archive> <manifest> <mainClass>MasteringHadoop.MasteringHadoopTest</ mainClass> </manifest> </archive> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> </configuration> </plugin> </plugins> <pluginManagement> <plugins> <!--This plugin's configuration is used to store Eclipse m2e settings only. It has no influence on the Maven build itself. --> <plugin> <groupId>org.eclipse.m2e</groupId> <artifactId>lifecycle-mapping</artifactId> <version>1.0.0</version> v

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.