Learning Big Data with Amazon Elastic MapReduce Easily learn, build, and execute real-world Big Data solutions using Hadoop and AWS EMR Amarkant Singh Vijay Rayapati BIRMINGHAM - MUMBAI Learning Big Data with Amazon Elastic MapReduce Copyright © 2014 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: October 2014 Production reference: 1241014 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78217-343-4 www.packtpub.com Cover image by Pratyush Mohanta ([email protected]) Credits Authors Project Coordinator Amarkant Singh Judie Jose Vijay Rayapati Proofreaders Paul Hindle Reviewers Venkat Addala Bernadette Watkins Vijay Raajaa G.S Indexers Gaurav Kumar Mariammal Chettiyar Monica Ajmera Mehta Commissioning Editor Ashwin Nair Rekha Nair Tejal Soni Acquisition Editor Richard Brookes-Bland Graphics Sheetal Aute Content Development Editor Ronak Dhruv Sumeet Sawant Disha Haria Technical Editors Abhinash Sahu Mrunal M. Chavan Gaurav Thingalaya Production Coordinators Aparna Bhagat Copy Editors Manu Joseph Roshni Banerjee Nitesh Thakur Relin Hedly Cover Work Aparna Bhagat About the Authors Amarkant Singh is a Big Data specialist. Being one of the initial users of Amazon Elastic MapReduce, he has used it extensively to build and deploy many Big Data solutions. He has been working with Apache Hadoop and EMR for almost 4 years now. He is also a certified AWS Solutions Architect. As an engineer, he has designed and developed enterprise applications of various scales. He is currently leading the product development team at one of the most happening cloud-based enterprises in the Asia-Pacific region. He is also an all-time top user on Stack Overflow for EMR at the time of writing this book. He blogs at http://www.bigdataspeak.com/ and is active on Twitter as @singh_amarkant. Vijay Rayapati is the CEO of Minjar Cloud Solutions Pvt. Ltd., one of the leading providers of cloud and Big Data solutions on public cloud platforms. He has over 10 years of experience in building business rule engines, data analytics platforms, and real-time analysis systems used by many leading enterprises across the world, including Fortune 500 businesses. He has worked on various technologies such as LISP, .NET, Java, Python, and many NoSQL databases. He has rearchitected and led the initial development of a large-scale location intelligence and analytics platform using Hadoop and AWS EMR. He has worked with many ad networks, e-commerce, financial, and retail companies to help them design, implement, and scale their data analysis and BI platforms on the AWS Cloud. He is passionate about open source software, large-scale systems, and performance engineering. He is active on Twitter as @amnigos, he blogs at amnigos.com, and his GitHub profile is https://github. com/amnigos. Acknowledgments We would like to extend our gratitude to Udit Bhatia and Kartikeya Sinha from Minjar's Big Data team for their valuable feedback and support. We would also like to thank the reviewers and the Packt Publishing team for their guidance in improving our content. About the Reviewers Venkat Addala has been involved in research in the area of Computational Biology and Big Data Genomics for the past several years. Currently, he is working as a Computational Biologist in Positive Bioscience, Mumbai, India, which provides clinical DNA sequencing services (it is the first company to provide clinical DNA sequencing services in India). He understands Biology in terms of computers and solves the complex puzzle of the human genome Big Data analysis using Amazon Cloud. He is a certified MongoDB developer and has good knowledge of Shell, Python, and R. His passion lies in decoding the human genome into computer codecs. His areas of focus are cloud computing, HPC, mathematical modeling, machine learning, and natural language processing. His passion for computers and genomics keeps him going. Vijay Raajaa G.S leads the Big Data / semantic-based knowledge discovery research with the Mu Sigma's Innovation & Development group. He previously worked with the BSS R&D division at Nokia Networks and interned with Ericsson Research Labs. He had architected and built a feedback-based sentiment engine and a scalable in-memory-based solution for a telecom analytics suite. He is passionate about Big Data, machine learning, Semantic Web, and natural language processing. He has an immense fascination for open source projects. He is currently researching on building a semantic-based personal assistant system using a multiagent framework. He holds a patent on churn prediction using the graph model and has authored a white paper that was presented at a conference on Advanced Data Mining and Applications. He can be connected at https://www.linkedin.com/in/gsvijayraajaa. Gaurav Kumar has been working professionally since 2010 to provide solutions for distributed systems by using open source / Big Data technologies. He has hands-on experience in Hadoop, Pig, Hive, Flume, Sqoop, and NoSQLs such as Cassandra and MongoDB. He possesses knowledge of cloud technologies and has production experience of AWS. His area of expertise includes developing large-scale distributed systems to analyze big sets of data. He has also worked on predictive analysis models and machine learning. He architected a solution to perform clickstream analysis for Tradus.com. He also played an instrumental role in providing distributed searching capabilities using Solr for GulfNews.com (one of UAE's most-viewed newspaper websites). Learning new languages is not a barrier for Gaurav. He is particularly proficient in Java and Python, as well as frameworks such as Struts and Django. He has always been fascinated by the open source world and constantly gives back to the community on GitHub. He can be contacted at https://www.linkedin.com/in/ gauravkumar37 or on his blog at http://technoturd.wordpress.com. You can also follow him on Twitter @_gauravkr. www.PacktPub.com Support files, eBooks, discount offers, and more You might want to visit www.PacktPub.com for support files and downloads related to your book. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks. TM http://PacktLib.PacktPub.com Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books. Why subscribe? • Fully searchable across every book published by Packt • Copy and paste, print and bookmark content • On demand and accessible via web browser Free access for Packt account holders If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access. Instant updates on new Packt books Get notified! Find out when new books are published by following @PacktEnterprise on Twitter, or the Packt Enterprise Facebook page.