HBase Administration Cookbook Master HBase configuration and administration for optimum database performance Yifeng Jiang BIRMINGHAM - MUMBAI HBase Administration Cookbook Copyright © 2012 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: August 2012 Production Reference: 1080812 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.. ISBN 978-1-84951-714-0 www.packtpub.com Cover Image by Asher Wishkerman ([email protected]) Credits Author Project Coordinator Yifeng Jiang Yashodhan Dere Reviewers Proofreader Masatake Iwasaki Aaron Nash Tatsuya Kawano Indexer Michael Morello Hemangini Bari Shinichi Yamashita Graphics Acquisition Editor Manu Joseph Sarah Cullington Valentina D'silva Lead Technical Editor Production Coordinator Pramila Balan Arvindkumar Gupta Technical Editors Cover Work Merin Jose Arvindkumar Gupta Kavita Raghavan Manmeet Singh Vasir Copy Editors Brandt D'Mello Insiya Morbiwala About the Author Yifeng Jiang is a Hadoop and HBase Administrator and Developer at Rakuten—the largest e-commerce company in Japan. After graduating from the University of Science and Technology of China with a B.S. in Information Management Systems, he started his career as a professional software engineer, focusing on Java development. In 2008, he started looking over the Hadoop project. In 2009, he led the development of his previous company's display advertisement data infrastructure using Hadoop and Hive. In 2010, he joined his current employer, where he designed and implemented the Hadoop- and HBase-based, large-scale item ranking system. He is also one of the members of the Hadoop team in the company, which operates several Hadoop/HBase clusters. Acknowledgement Little did I know, when I was first asked by Packt Publishing whether I would be interested in writing a book about HBase administration on September 2011, how much work and stress (but also a lot of fun) it was going to be. Now that the book is finally complete, I would like to thank those people without whom it would have been impossible to get done. First, I would like to thank the HBase developers for giving us such a great piece of software. Thanks to all of the people on the mailing list providing good answers to my many questions, and all the people working on tickets and documents. I would also like to thank the team at Packt Publishing for contacting me to get started with the writing of this book, and providing support, guidance, and feedback. Many thanks to Rakuten, my employer, who provided me with the environment to work on HBase and the chance to write this book. Thank you to Michael Stack for helping me with a quick review of the book. Thank you to the book's reviewers—Michael Morello, Tatsuya Kawano, Kenichiro Hamano, Shinichi Yamashita, and Masatake Iwasaki. To Yotaro Kagawa: Thank you for supporting me and my family from the very start and ever since. To Xinping and Lingyin: Thank you for your support and all your patience—I love you! About the Reviewers Masatake Iwasaki is a Software Engineer at NTT DATA CORPORATION, providing technical consultation for open source softwares such as Hadoop, HBase, and PostgreSQL. Tatsuya Kawano is an HBase contributor and evangelist in Japan. He has been helping the Japanese Hadoop and HBase community to grow since 2010. He is currently working for Gemini Mobile Technologies as a Research & Development software engineer. He is also developing Cloudian, a fully S3 API-complaint cloud storage platform, and Hibari DB, an open source, distributed, key-value store. He has co-authored a Japanese book named "Basic Knowledge of NOSQL" in 2012, which introduces 16 NoSQL products, such as HBase, Cassandra, Riak, MongoDB, and Neo4j to novice readers. He has studied graphic design in New York, in the late 1990s. He loves playing with 3D computer graphics as much as he loves developing high-availability, scalable, storage systems. Michael Morello holds a Masters degree in Distributed Computing and Artificial Intelligence. He is a Senior Java/JEE Developer with a strong Unix and Linux background. His areas of research are mostly related to large-scale systems and emerging technologies dedicated to solving scalability, performance, and high availability issues. I would like to thank my wife and my little angel for their love and support. Shinichi Yamashita is a Chief Engineer at the OSS Professional Service unit in NTT DATA Corporation, in Japan. He has more than 7 years of experience in software and middleware (Apache, Tomcat, PostgreSQL, Hadoop eco system) engineering. Shinicha has written a few books on Hadoop in Japan. I would like to thank my colleagues. www.PacktPub.com Support files, eBooks, discount offers and more You might want to visit www.PacktPub.com for support files and downloads related to your book. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks. http://PacktLib.PacktPub.com Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books. Why Subscribe? f Fully searchable across every book published by Packt f Copy and paste, print and bookmark content f On demand and accessible via web browser Free Access for Packt account holders If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access. Table of Contents Preface 1 Chapter 1: Setting Up HBase Cluster 7 Introduction 7 Quick start 9 Getting ready on Amazon EC2 12 Setting up Hadoop 18 Setting up ZooKeeper 22 Changing the kernel settings 25 Setting up HBase 27 Basic Hadoop/ZooKeeper/HBase configurations 31 Setting up multiple High Availability (HA) masters 33 Chapter 2: Data Migration 47 Introduction 47 Importing data from MySQL via single client 48 Importing data from TSV files using the bulk load tool 54 Writing your own MapReduce job to import data 59 Precreating regions before moving data into HBase 66 Chapter 3: Using Administration Tools 71 Introduction 71 HBase Master web UI 72 Using HBase Shell to manage tables 75 Using HBase Shell to access data in HBase 78 Using HBase Shell to manage the cluster 81 Executing Java methods from HBase Shell 86 Row counter 88 WAL tool—manually splitting and dumping WALs 91
Description: