ebook img

Use Amazon EMR PDF

474 Pages·2013·9.64 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Use Amazon EMR

Amazon Elastic MapReduce Developer Guide API Version 2009-03-31 Amazon Elastic MapReduce Developer Guide Amazon Web Services Amazon Elastic MapReduce Developer Guide Amazon Elastic MapReduce: Developer Guide Amazon Web Services Copyright © 2013 Amazon Web Services, Inc. and/or its affiliates. All rights reserved. The following are trademarks or registered trademarks of Amazon: Amazon, Amazon.com, Amazon.com Design, Amazon CloudWatch, Amazon DevPay, Amazon EC2, Amazon Redshift, Amazon Web Services Design, AWS, CloudFront, EC2, Elastic Compute Cloud, Kindle, and Mechanical Turk. In addition, Amazon.com graphics, logos, page headers, button icons, scripts, and service names are trademarks, or trade dress of Amazon in the U.S. and/or other countries. Amazon's trademarks and trade dress may not be used in connection with any product or service that is not Amazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages or discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may or may not be affiliated with, connected to, or sponsored by Amazon. Amazon Elastic MapReduce Developer Guide What is Amazon EMR?...........................................................................................................................1 What Can You Do with Amazon EMR?...................................................................................................2 What Tools are Available for Amazon EMR?..........................................................................................3 Get Started: Count Words with Amazon EMR .......................................................................................5 Step 1: Sign up for the Service...............................................................................................................5 Step 2: Create the Amazon S3 Bucket...................................................................................................6 Step 3: Upload Your Data........................................................................................................................7 Step 4: Upload Your Script......................................................................................................................7 Step 5: Launch the Job Flow..................................................................................................................8 Step 6: Monitor the Job Flow................................................................................................................15 Step 7:View the Results.......................................................................................................................18 Step 8: Clean Up .................................................................................................................................. 22 Where Do I Go From Here?..................................................................................................................23 Understand Amazon EMR....................................................................................................................25 Overview of Amazon EMR....................................................................................................................25 Architectural Overview of Amazon EMR.....................................................................................26 Elastic MapReduce Features ...................................................................................................... 27 Amazon EMR Concepts ....................................................................................................................... 29 Job Flows and Steps...................................................................................................................29 Hadoop and MapReduce ............................................................................................................ 30 Associated AWS Product Concepts......................................................................................................34 Amazon Virtual Private Cloud Concepts.....................................................................................38 Use Amazon EMR................................................................................................................................ 39 Setting Up Your Environment to Run a Job Flow..................................................................................41 Create a Job Flow.................................................................................................................................48 How to Create a Streaming Job Flow..........................................................................................49 How to Create a Job Flow Using Hive.........................................................................................57 How to Create a Job Flow Using Pig...........................................................................................65 How to Create a Job Flow Using a Custom JAR.........................................................................73 How to Create a Cascading Job Flow.........................................................................................81 Launch an HBase Cluster on Amazon EMR...............................................................................89 View Job Flow Details...........................................................................................................................98 Terminate a Job Flow..........................................................................................................................102 Customize a Job Flow ........................................................................................................................ 104 Add Steps to a Job Flow...........................................................................................................104 Wait for Steps to Complete..............................................................................................107 Add More than 256 Steps to a Job Flow..........................................................................108 Bootstrap Actions ...................................................................................................................... 110 Resizing Running Job Flows.....................................................................................................124 Calling Additional Files and Libraries........................................................................................132 Using Distributed Cache.................................................................................................. 132 Running a Script in a Job Flow........................................................................................137 Connect to the Master Node in an Amazon EMR Job Flow...............................................................139 Connect to the Master Node Using SSH...................................................................................140 Web Interfaces Hosted on the Master Node.............................................................................143 Open an SSH Tunnel to the Master Node.................................................................................145 Configure Foxy Proxy to View Websites Hosted on the Master Node.......................................146 Use Cases .......................................................................................................................................... 150 Cascading ................................................................................................................................. 151 Pig ............................................................................................................................................. 154 Streaming .................................................................................................................................. 158 Build Binaries Using Amazon EMR....................................................................................................160 How EC2 Instances are Tagged in Amazon EMR..............................................................................165 Protect a Job Flow from Termination..................................................................................................166 Lower Costs with Spot Instances........................................................................................................172 Choosing What to Launch as Spot Instances...........................................................................173 Spot Instance Pricing in Amazon EMR.....................................................................................174 Availability Zones and Regions ................................................................................................. 175 API Version 2009-03-31 4 Amazon Elastic MapReduce Developer Guide Launching Spot Instances in Job Flows....................................................................................175 Changing the Number of Spot Instances in a Job Flow............................................................182 Troubleshooting Spot Instances ................................................................................................ 185 Store Data with HBase....................................................................................................................... 186 HBase Job Flow Prerequisites..................................................................................................187 Launch an HBase Cluster on Amazon EMR.............................................................................188 Connect to HBase Using the Command Line............................................................................196 Back Up and Restore HBase....................................................................................................197 Terminate an HBase Cluster.....................................................................................................208 Configure HBase.......................................................................................................................208 Access HBase Data with Hive...................................................................................................212 View the HBase User Interface.................................................................................................214 View HBase Log Files...............................................................................................................215 Monitor HBase with CloudWatch...............................................................................................216 Monitor HBase with Ganglia......................................................................................................216 Troubleshooting .................................................................................................................................. 218 Things to Check When Your Amazon EMR Job Flow Fails.......................................................218 Amazon EMR Logging .............................................................................................................. 222 Enable Logging and Debugging................................................................................................223 Use Log Files ............................................................................................................................ 226 Monitor Hadoop on the Master Node........................................................................................235 View the Hadoop Web Interfaces..............................................................................................236 Troubleshooting Tips ................................................................................................................. 240 Monitor Metrics with Amazon CloudWatch.........................................................................................245 Monitor Performance with Ganglia......................................................................................................256 Distributed Copy Using S3DistCp.......................................................................................................263 Export, Query, and Join Tables in Amazon DynamoDB......................................................................271 Prerequisites for Integrating Amazon EMR...............................................................................272 Step 1: Create a Key Pair..........................................................................................................272 Step 2: Create a Job Flow.........................................................................................................273 Step 3: SSH into the Master Node............................................................................................278 Step 4: Set Up a Hive Table to Run Hive Commands................................................................281 Hive Command Examples for Exporting, Importing, and Querying Data..................................286 Optimizing Performance............................................................................................................293 Use Third Party Applications With Amazon EMR...............................................................................295 Parse Data with HParser...........................................................................................................296 Use Karmasphere Analytics......................................................................................................297 Launch a Job Flow on the MapR Distribution for Hadoop.........................................................297 Configure Amazon EMR.....................................................................................................................301 Configure User Permissions with IAM................................................................................................301 Set Policy for an IAM User........................................................................................................304 Configure IAM Roles for Amazon EMR..............................................................................................308 Set Access Permissions on Files Written to Amazon S3....................................................................313 Using Elastic IP Addresses.................................................................................................................315 Specify the Amazon EMR AMI Version...............................................................................................318 Hadoop Configuration.........................................................................................................................328 Supported Hadoop Versions ..................................................................................................... 329 Configuration of hadoop-user-env.sh ........................................................................................ 331 Upgrading to Hadoop 1.0..........................................................................................................332 Hadoop Version Behavior ................................................................................................ 332 Hadoop 0.20 Streaming Configuration......................................................................................333 Hadoop Default Configuration (AMI 1.0)...................................................................................333 Hadoop Configuration (AMI 1.0)......................................................................................333 HDFS Configuration (AMI 1.0).........................................................................................337 Task Configuration (AMI 1.0) ........................................................................................... 337 Intermediate Compression (AMI 1.0)...............................................................................341 Hadoop Memory-Intensive Configuration Settings (AMI 1.0) ...................................................341 Hadoop Default Configuration (AMI 2.0 and 2.1)......................................................................344 API Version 2009-03-31 5 Amazon Elastic MapReduce Developer Guide Hadoop Configuration (AMI 2.0 and 2.1).........................................................................345 HDFS Configuration (AMI 2.0 and 2.1)............................................................................348 Task Configuration (AMI 2.0 and 2.1)..............................................................................348 Intermediate Compression (AMI 2.0 and 2.1)..................................................................352 Hadoop Default Configuration (AMI 2.2)...................................................................................353 Hadoop Configuration (AMI 2.2)......................................................................................353 HDFS Configuration (AMI 2.2).........................................................................................357 Task Configuration (AMI 2.2) ........................................................................................... 357 Intermediate Compression (AMI 2.2)...............................................................................361 Hadoop Default Configuration (AMI 2.3)...................................................................................362 Hadoop Configuration (AMI 2.3)......................................................................................362 HDFS Configuration (AMI 2.3).........................................................................................366 Task Configuration (AMI 2.3) ........................................................................................... 367 Intermediate Compression (AMI 2.3)...............................................................................370 File System Configuration......................................................................................................... 371 JSON Configuration Files..........................................................................................................373 Multipart Upload........................................................................................................................ 377 Hadoop Data Compression.......................................................................................................378 Setting Permissions on the System Directory...........................................................................379 Hadoop Patches........................................................................................................................380 Hive Configuration .............................................................................................................................. 382 Supported Hive Versions...........................................................................................................383 Share Data Between Hive Versions...........................................................................................388 Differences from Apache Hive Defaults ....................................................................................388 Interactive and Batch Modes.....................................................................................................389 Creating a Metastore Outside the Hadoop Cluster...................................................................393 Using the Hive JDBC Driver......................................................................................................395 Additional Features of Hive in Amazon EMR............................................................................398 Upgrade to Hive 0.8 .................................................................................................................. 404 Upgrade the Configuration Files......................................................................................405 Upgrade the Metastore .................................................................................................... 405 Upgrade to Hive 0.8 (MySQL on the Master Node)................................................406 Upgrade to Hive 0.8 (MySQL on Amazon RDS).....................................................409 Pig Configuration ................................................................................................................................ 413 Supported Pig Versions.............................................................................................................413 Pig Version Details .................................................................................................................... 416 Performance Tuning............................................................................................................................417 Running Job Flows on an Amazon VPC.............................................................................................418 Write Amazon EMR Applications........................................................................................................426 Common Concepts for API Calls........................................................................................................426 Use SDKs to Call Amazon EMR APIs................................................................................................428 Using the AWS SDK for Java to Create an Amazon EMR Job Flow.........................................429 Using the AWS SDK for .Net to Create an Amazon EMR Job Flow..........................................430 Using the Java SDK to Sign a Query Request..........................................................................430 Use Query Requests to Call Amazon EMR APIs...............................................................................431 Why Query Requests Are Signed.............................................................................................432 Components of a Query Request in Amazon EMR...................................................................432 How to Generate a Signature for a Query Request in Amazon EMR........................................433 Command Line Interface Reference for Amazon EMR.......................................................................437 Install the Amazon EMR Command Line Interface.............................................................................437 How to Call the Command Line Interface...........................................................................................443 Command Line Interface Options.......................................................................................................443 Command Line Interface Releases .................................................................................................... 452 Compare Job Flow Types ................................................................................................................... 454 Appendix: Amazon EMR Resources...................................................................................................456 Document History............................................................................................................................... 461 Glossary ............................................................................................................................................. 458 Index ................................................................................................................................................... 466 API Version 2009-03-31 6 Amazon Elastic MapReduce Developer Guide What is Amazon EMR? With Amazon Elastic MapReduce (Amazon EMR) you can analyze and process vast amounts of data. It does this by distributing the computational work across a cluster of virtual servers running in the Amazon cloud.The cluster is managed using an open-source framework called Hadoop. Hadoop uses a distributed processing architecture called MapReduce in which a task is mapped to a set of servers for processing.The results of the computation performed by those servers is then reduced down to a single output set. One node, designated as the master node, controls the distribution of tasks. The following diagram shows a Hadoop cluster with the master node directing a group of slave nodes which process the data. Amazon EMR has made enhancements to Hadoop and other open-source applications to work seamlessly with AWS. For example, Hadoop clusters running on Amazon EMR use Amazon EC2 instances as virtual Linux servers for the master and slave nodes, Amazon S3 for bulk storage of input and output data, and Amazon CloudWatch to monitor cluster performance and raise alarms.You can also move data into and out of Amazon DynamoDB using Amazon EMR and Hive. All of this is orchestrated by Amazon EMR control software that launches and manages the Hadoop cluster.This process is called an Amazon EMR job flow. The following diagram illustrates how Amazon EMR interacts with other AWS services. API Version 2009-03-31 1 Amazon Elastic MapReduce Developer Guide What Can You Do with Amazon EMR? Open-source projects that run on top of the Hadoop architecture can also be run on Amazon EMR.The most popular applications, such as Hive, Pig, HBase, DistCp, and Ganglia, are already integrated with Amazon EMR. By running Hadoop on Amazon EMR you get the benefits of the cloud: • The ability to provision clusters of virtual servers within minutes. • You can scale the number of virtual servers in your cluster to manage your computation needs, and only pay for what you use. • Integration with other AWS services. What Can You Do with Amazon EMR? Amazon EMR simplifies running Hadoop and related big-data applications on AWS.You can use it to manage and analyze vast amounts of data. For example, a cluster (also called a job flow in Amazon EMR),can be configured to process petabytes of data. Topics • Hadoop Programming on Amazon EMR (p.2) • Data Analysis and Processing on Amazon EMR (p.3) • Data Storage on Amazon EMR (p.3) • Move Data with Amazon EMR (p.3) Hadoop Programming on Amazon EMR In order to develop custom Hadoop applications, you used to need access to a lot of hardware to test your Hadoop programs. Amazon EMR makes it easy to spin up a set of Amazon EC2 instances as virtual servers to run your Hadoop cluster.You can also test various server configurations without having to purchase or reconfigure hardware.When you're done developing and testing your application, you can terminate your cluster, only paying for the computational time you used. API Version 2009-03-31 2 Amazon Elastic MapReduce Developer Guide Data Analysis and Processing on Amazon EMR Amazon EMR provides several types of clusters that you can launch to run custom Hadoop map-reduce code, depending on the type of program you're developing and the libraries you intend to use. Custom JAR Run your custom map-reduce program written in Java.This cluster provides low-level access to the MapReduce API.You have the responsibility of defining and implementing the map reduce tasks in your Java application. Cascading This type of cluster installs the Cascading Java library, which provides features such as splitting and joining data streams. Using the Cascading Java library can simplify application development.With a Cascading cluster you can still access the low-level MapReduce APIs as you can with the Custom JAR cluster type. Streaming Run a single Hadoop job based on map and reduce functions you upload to Amazon S3.The functions can be implemented in any of the following supported languages: Ruby, Perl, Python, PHP, R, Bash, C++. Data Analysis and Processing on Amazon EMR You can also use Amazon EMR to analyze and process data without writing a line of code. Several open-source applications run on top of Hadoop and make it possible to run map-reduce jobs and manipluate data using either a SQL-like syntax or a specialized language called Pig Latin. Amazon EMR is integrated with Apache Hive and Apache Pig. Data Storage on Amazon EMR Distributed storage is a way to store large amounts of data over a distributed network of computers with redundancy to protect against data loss. Amazon EMR is integrated with the Hadoop Distributed File System (HDFS) and Apache HBase. Move Data with Amazon EMR You can use Amazon EMR to move large amounts of data in and out of databases and data stores. By distributing the work, the data can be moved quickly. Amazon EMR provides custom libraries to move data in and out of Amazon Simple Storage Service (Amazon S3), Amazon DynamoDB, and Apache HBase. What Tools are Available for Amazon EMR? There are several ways you can interact with Amazon EMR: • Console — a graphical interface that you can use to launch and manage job flows.With it, you fill out web forms to specify the details of job flows to launch, view the details of existing job flows, debug and terminate job flows. Using the console is the easiest way to get started with Amazon EMR. No programming knowledge is required.The console is available online at https://console.aws.amazon.com/elasticmapreduce/. • Command Line Interface (CLI) — an application you run on your local machine to connect to Amazon EMR and create and manage job flows.With it, you can write scripts that automate the process of API Version 2009-03-31 3 Amazon Elastic MapReduce Developer Guide What Tools are Available for Amazon EMR? launching and managing job flows. Using the CLI is the best option if you prefer working from a command line. For more information, see Command Line Interface Reference for Amazon EMR (p.437). • Software Development Kit (SDK) — AWS provides an SDK with functions that call Amazon EMR to create and manage job flows.With it, you can write applications that automate the process of creating and managing job flows. Using the SDK is the best option if you want to extend or customize the functionality of Amazon EMR.You can download the AWS SDK for Java from http://aws.amazon.com/sdkforjava/. • Web Service API — AWS provides a low-level interface that you can use to call the web service directly using JSON. Using the API is the best option if you want to create an custom SDK that calls Amazon EMR. For more information, see the Amazon EMR API Reference API Version 2009-03-31 4

Description:
Amazon Elastic MapReduce Developer Guide Amazon CloudWatch, Amazon DevPay, Amazon EC2, Amazon Redshift, Amazon Web Services.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.