Pro Hadoop Jason Venner Pro Hadoop Copyright © 2009 by Jason Venner All rights reserved. No part of this work may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval system, without the prior written permission of the copyright owner and the publisher. ISBN-13 (pbk): 978-1-4302-1942-2 ISBN-13 (electronic): 978-1-4302-1943-9 Printed and bound in the United States of America 9 8 7 6 5 4 3 2 1 Trademarked names may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, we use the names only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. Java™ and all Java-based marks are trademarks or registered trademarks of Sun Microsystems, Inc., in the US and other countries. Apress, Inc., is not affiliated with Sun Microsystems, Inc., and this book was written without endorsement from Sun Microsystems, Inc. Lead Editor: Matthew Moodie Technical Reviewer: Steve Cyrus Editorial Board: Clay Andres, Steve Anglin, Mark Beckner, Ewan Buckingham, Tony Campbell, Gary Cornell, Jonathan Gennick, Michelle Lowman, Matthew Moodie, Duncan Parkes, Jeffrey Pepper, Frank Pohlmann, Douglas Pundick, Ben Renow-Clarke, Dominic Shakeshaft, Matt Wade, Tom Welsh Project Manager: Richard Dal Porto Copy Editors: Marilyn Smith, Nancy Sixsmith Associate Production Director: Kari Brooks-Copony Production Editor: Laura Cheu Compositor: Linda Weidemann, Wolf Creek Publishing Services Proofreader: Linda Seifert Indexer: Becky Hornyak Artist: Kinetic Publishing Services Cover Designer: Kurt Krames Manufacturing Director: Tom Debolski Distributed to the book trade worldwide by Springer-Verlag New York, Inc., 233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax 201-348-4505, e-mail (cid:107)(cid:110)(cid:96)(cid:97)(cid:110)(cid:111)(cid:41)(cid:106)(cid:117)(cid:60)(cid:111)(cid:108)(cid:110)(cid:101)(cid:106)(cid:99)(cid:97)(cid:110)(cid:41)(cid:111)(cid:94)(cid:105)(cid:42)(cid:95)(cid:107)(cid:105), or visit (cid:100)(cid:112)(cid:112)(cid:108)(cid:54)(cid:43)(cid:43)(cid:115)(cid:115)(cid:115)(cid:42)(cid:111)(cid:108)(cid:110)(cid:101)(cid:106)(cid:99)(cid:97)(cid:110)(cid:107)(cid:106)(cid:104)(cid:101)(cid:106)(cid:97)(cid:42)(cid:95)(cid:107)(cid:105). For information on translations, please contact Apress directly at 2855 Telegraph Avenue, Suite 600, Berkeley, CA 94705. Phone 510-549-5930, fax 510-549-5939, e-mail (cid:101)(cid:106)(cid:98)(cid:107)(cid:60)(cid:93)(cid:108)(cid:110)(cid:97)(cid:111)(cid:111)(cid:42)(cid:95)(cid:107)(cid:105), or visit (cid:100)(cid:112)(cid:112)(cid:108)(cid:54)(cid:43)(cid:43)(cid:115)(cid:115)(cid:115)(cid:42)(cid:93)(cid:108)(cid:110)(cid:97)(cid:111)(cid:111)(cid:42)(cid:95)(cid:107)(cid:105). Apress and friends of ED books may be purchased in bulk for academic, corporate, or promotional use. eBook versions and licenses are also available for most titles. For more information, reference our Special Bulk Sales–eBook Licensing web page at (cid:100)(cid:112)(cid:112)(cid:108)(cid:54)(cid:43)(cid:43)(cid:115)(cid:115)(cid:115)(cid:42)(cid:93)(cid:108)(cid:110)(cid:97)(cid:111)(cid:111)(cid:42)(cid:95)(cid:107)(cid:105)(cid:43)(cid:101)(cid:106)(cid:98)(cid:107)(cid:43)(cid:94)(cid:113)(cid:104)(cid:103)(cid:111)(cid:93)(cid:104)(cid:97)(cid:111). The information in this book is distributed on an “as is” basis, without warranty. Although every pre- caution has been taken in the preparation of this work, neither the author(s) nor Apress shall have any liability to any person or entity with respect to any loss or damage caused or alleged to be caused directly or indirectly by the information contained in this work. The source code for this book is available to readers at (cid:100)(cid:112)(cid:112)(cid:108)(cid:54)(cid:43)(cid:43)(cid:115)(cid:115)(cid:115)(cid:42)(cid:93)(cid:108)(cid:110)(cid:97)(cid:111)(cid:111)(cid:42)(cid:95)(cid:107)(cid:105). You may need to answer questions pertaining to this book in order to successfully download the code. This book is dedicated to Joohn Choe. He had the idea, walked me through much of the process, trusted me to write the book, and helped me through the rough spots. Contents at a Glance About the Author ..................................................................xix About the Technical Reviewer ......................................................xxi Acknowledgments ...............................................................xxiii Introduction ..................................................................... xxv CHAPTER 1 Getting Started with Hadoop Core ...............................1 CHAPTER 2 The Basics of a MapReduce Job ................................27 CHAPTER 3 The Basics of Multimachine Clusters ............................71 CHAPTER 4 HDFS Details for Multimachine Clusters .........................97 CHAPTER 5 MapReduce Details for Multimachine Clusters ..................127 CHAPTER 6 Tuning Your MapReduce Jobs .................................177 CHAPTER 7 Unit Testing and Debugging ...................................207 CHAPTER 8 Advanced and Alternate MapReduce Techniques ...............239 CHAPTER 9 Solving Problems with Hadoop .................................285 CHAPTER 10 Projects Based On Hadoop and Future Directions ...............329 APPENDIX A The JobConf Object in Detail ...................................339 Index ...........................................................................387 v Contents About the Author ..................................................................xix About the Technical Reviewer ......................................................xxi Acknowledgments ...............................................................xxiii Introduction ..................................................................... xxv CHAPTER 1 Getting Started with Hadoop Core ..........................1 Introducing the MapReduce Model ..................................1 Introducing Hadoop ...............................................4 Hadoop Core MapReduce .....................................5 The Hadoop Distributed File System ............................6 Installing Hadoop .................................................7 The Prerequisites ............................................7 Getting Hadoop Running .....................................13 Checking Your Environment ..................................13 Running Hadoop Examples and Tests ..............................17 Hadoop Examples ..........................................18 Hadoop Tests ..............................................23 Troubleshooting .................................................24 Summary .......................................................24 CHAPTER 2 The Basics of a MapReduce Job ...........................27 The Parts of a Hadoop MapReduce Job .............................27 Input Splitting ..............................................31 A Simple Map Function: IdentityMapper .......................31 A Simple Reduce Function: IdentityReducer ....................34 Configuring a Job. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .36 Specifying Input Formats ....................................45 Setting the Output Parameters ...............................47 Configuring the Reduce Phase ................................51 Running a Job ..................................................53 vii viii (cid:78)CONTENTS Creating a Custom Mapper and Reducer. . . . . . . . . . . . . . . . . . . . . . . . . . . .56 Setting Up a Custom Mapper .................................56 After the Job Finishes .......................................61 Creating a Custom Reducer ..................................63 Why Do the Mapper and Reducer Extend MapReduceBase? ......66 Using a Custom Partitioner ...................................67 Summary .......................................................69 CHAPTER 3 The Basics of Multimachine Clusters ......................71 The Makeup of a Cluster .........................................71 Cluster Administration Tools ......................................73 Cluster Configuration ............................................74 Hadoop Configuration Files ..................................75 Hadoop Core Server Configuration ............................76 A Sample Cluster Configuration ...................................80 Configuration Requirements ..................................80 Configuration Files for the Sample Cluster .....................82 Distributing the Configuration ................................86 Verifying the Cluster Configuration ............................87 Formatting HDFS ...........................................88 Starting HDFS ..............................................89 Correcting Errors ...........................................91 The Web Interface to HDFS ..................................92 Starting MapReduce ........................................92 Running a Test Job on the Cluster ............................94 Summary .......................................................95 CHAPTER 4 HDFS Details for Multimachine Clusters ...................97 Configuration Trade-Offs .........................................97 HDFS Installation for Multimachine Clusters .........................98 Building the HDFS Configuration ..............................98 Distributing Your Installation Data ............................101 Formatting Your HDFS ......................................102 Starting Your HDFS Installation ..............................104 Verifying HDFS Is Running ..................................105 (cid:78)CONTENTS ix Tuning Factors .................................................111 File Descriptors ............................................111 Block Service Threads ......................................112 NameNode Threads ........................................113 Server Pending Connections ................................114 Reserved Disk Space .......................................114 Storage Allocations ........................................115 Disk I/O ..................................................115 Network I/O Tuning ........................................119 Recovery from Failure ...........................................119 NameNode Recovery .......................................120 DataNode Recovery and Addition ............................120 DataNode Decommissioning ................................121 Deleted File Recovery ......................................122 Troubleshooting HDFS Failures ...................................122 NameNode Failures ........................................123 DataNode or NameNode Pauses .............................125 Summary ......................................................125 CHAPTER 5 MapReduce Details for Multimachine Clusters ..........127 Requirements for Successful MapReduce Jobs .....................127 Launching MapReduce Jobs .....................................128 Using Shared Libraries ..........................................130 MapReduce-Specific Configuration for Each Machine in a Cluster .....130 Using the Distributed Cache .....................................131 Adding Resources to the Task Classpath ......................132 Distributing Archives and Files to Tasks ......................133 Accessing the DistributedCache Data ........................133 Configuring the Hadoop Core Cluster Information ...................135 Setting the Default File System URI ..........................135 Setting the JobTracker Location ............................136 The Mapper Dissected ..........................................136 Mapper Methods ..........................................138 Mapper Class Declaration and Member Fields .................142 Initializing the Mapper with Spring ...........................143 Partitioners Dissected ...........................................147 The HashPartitioner Class. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .149 The TotalOrderPartitioner Class ..............................149 The KeyFieldBasedPartitioner Class ..........................151 x (cid:78)CONTENTS The Reducer Dissected ..........................................153 A Simple Transforming Reducer .............................154 A Reducer That Uses Three Partitions ........................159 Combiners .....................................................163 File Types for MapReduce Jobs ..................................166 Text Files .................................................166 Sequence Files ............................................168 Map Files .................................................169 Compression ...................................................171 Codec Specification ........................................171 Sequence File Compression .................................172 Map Task Output ..........................................172 JAR, Zip, and Tar Files .....................................174 Summary ......................................................174 CHAPTER 6 Tuning Your MapReduce Jobs ............................177 Tunable Items for Cluster and Jobs ...............................177 Behind the Scenes: What the Framework Does ................178 Cluster-Level Tunable Parameters ...........................182 Per-Job Tunable Parameters ................................188 Monitoring Hadoop Core Services .................................192 JMX: Hadoop Core Server and Task State Monitor .............192 Nagios: A Monitoring and Alert Generation Framework .........192 Ganglia: A Visual Monitoring Tool with History .................193 Chukwa: A Monitoring Service ..............................196 FailMon: A Hardware Diagnostic Tool .........................196 Tuning to Improve Job Performance ..............................196 Speeding Up the Job and Task Start .........................196 Optimizing a Job’s Map Phase ..............................198 Tuning the Reduce Task Setup ..............................201 Addressing Job-Level Issues ................................205 Summary ......................................................205 CHAPTER 7 Unit Testing and Debugging ...............................207 Unit Testing MapReduce Jobs ....................................207 Requirements for Using ClusterMapReduceTestCase ...........208 Simpler Testing and Debugging with ClusterMapReduceDelegate ..............................214 Writing a Test Case: SimpleUnitTest ..........................216 (cid:78)CONTENTS xi Running the Debugger on MapReduce Jobs ........................223 Running an Entire MapReduce Job in a Single JVM ............223 Debugging a Task Running on a Cluster ......................230 Rerunning a Failed Task ....................................234 Summary ......................................................237 CHAPTER 8 Advanced and Alternate MapReduce Techniques .......239 Streaming: Running Custom MapReduce Jobs from the Command Line ..............................................239 Streaming Command-Line Arguments ........................243 Using Pipes ...............................................248 Using Counters in Streaming and Pipes Jobs ..................248 Alternative Methods for Accessing HDFS ..........................249 libhdfs ...................................................249 fuse-dfs ..................................................251 Mounting an HDFS File System Using fuse_dfs ................252 Alternate MapReduce Techniques ................................256 Chaining: Efficiently Connecting Multiple Map and/or Reduce Steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .257 Map-side Join: Sequentially Reading Data from Multiple Sorted Inputs ...................................265 Aggregation: A Framework for MapReduce Jobs that Count or Aggregate Data ..............................................274 Aggregation Using Streaming ...............................275 Aggregation Using Java Classes .............................277 Specifying the ValueAggregatorDescriptor Class via Configuration Parameters ................................278 Side Effect Files: Map and Reduce Tasks Can Write Additional Output Files ...................................279 Handling Acceptable Failure Rates ................................279 Dealing with Task Failure ...................................280 Skipping Bad Records ......................................280 Capacity Scheduler: Execution Queues and Priorities ................281 Enabling the Capacity Scheduler .............................281 Summary ......................................................284
Description: