Understanding Big Data FM.indd 1 07/10/11 6:12 PM About the Authors Paul C. Zikopoulos, B.A., M.B.A., is the Director of Technical Professionals for IBM Software Group’s Information Management division and addition- ally leads the World Wide Database Competitive and Big Data SWAT teams. Paul is an internationally recognized award-winning writer and speaker with more than 18 years of experience in Information Management. Paul has written more than 350 magazine articles and 14 books on database technolo- gies, including DB2 pureScale: Risk Free Agile Scaling (McGraw-Hill, 2010); Break Free with DB2 9.7: A Tour of Cost-Slashing New Features (McGraw-Hill, 2010); Information on Demand: Introduction to DB2 9.5 New Features (McGraw- Hill, 2007); DB2 Fundamentals Certification for Dummies (For Dummies, 2001); DB2 for Windows for Dummies (For Dummies, 2001), and more. Paul is a DB2 Certified Advanced Technical Expert (DRDA and Clusters) and a DB2 Certi- fied Solutions Expert (BI and DBA). In his spare time, he enjoys all sorts of sporting activities, including running with his dog, Chachi; avoiding punches in his MMA training; trying to figure out why his golf handicap has unex- plainably decided to go up; and trying to understand the world according to Chloë, his daughter. You can reach him at [email protected]. Also, keep up with Paul’s take on Big Data by following him on Twitter @BigData_paulz. Chris Eaton, B.Sc., is a worldwide technical specialist for IBM’s Information Management products focused on Database Technology, Big Data, and Workload Optimization. Chris has been working with DB2 on the Linux, UNIX, and Windows platform for more than 19 years in numerous roles, from support, to development, to product management. Chris has spent his career listening to clients and working to make DB2 a better product. He is the author of several books in the data management space, including The High Availability Guide to DB2 (IBM Press, 2004), IBM DB2 9 New Features (McGraw-Hill, 2007), and Break Free with DB2 9.7: A Tour of Cost-Slashing New Features (McGraw-Hill, 2010). Chris is also an international award-winning speaker, having presented at data management conferences across the globe, and he has one of the most popular DB2 blogs located on IT Toolbox at http://it.toolbox.com/blogs/db2luw. Dirk deRoos, B.Sc., B.A., is a member of the IBM World-Wide Technical Sales Team, specializing in the IBM Big Data Platform. Dirk joined IBM 11 years ago and previously worked in the Toronto DB2 Development lab as its FM.indd 2 07/10/11 6:12 PM Information Architect. Dirk has a Bachelor’s degree in Computer Science and a Bachelor of Arts degree (Honors in English) from the University of New Brunswick. Thomas Deutsch, B.A, M.B.A., serves as a Program Director in IBM’s Big Data business. Tom has spent the last couple of years helping customers with Apache Hadoop, identifying architecture fit, and managing early stage proj- ects in multiple customer engagements. He played a formative role in the transition of Hadoop-based technologies from IBM Research to IBM Software Group, and he continues to be involved with IBM Research Big Data activities and the transition of research to commercial products. Prior to this role, Tom worked in the CTO office’s Information Management division. In that role, Tom worked with a team focused on emerging technologies and helped cus- tomers adopt IBM’s innovative Enterprise Mashups and Cloud offerings. Tom came to IBM through the FileNet acquisition, where he had responsibil- ity for FileNet’s flagship Content Management product and spearheaded FileNet product initiatives with other IBM software segments including the Lotus and InfoSphere segments. With more than 20 years in the industry and a veteran of two startups, Tom is an expert on the technical, strategic, and business information management issues facing the enterprise today. Tom earned a Bachelor’s degree from the Fordham University in New York and an MBA from the Maryland University College. George Lapis, MS CS, is a Big Data Solutions Architect at IBM’s Silicon Valley Research and Development Lab. He has worked in the database software area for more than 30 years. He was a founding member of R* and Starburst re- search projects at IBM’s Almaden Research Center in Silicon Valley, as well as a member of the compiler development team for several releases of DB2. His expertise lies mostly in compiler technology and implementation. About ten years ago, George moved from research to development, where he led the compiler development team in his current lab location, specifically working on the development of DB2’s SQL/XML and XQuery capabilities. George also spent several years in a customer enablement role for the Optim Database tool- set and more recently in IBM’s Big Data business. In his current role, George is leading the tools development team for IBM’s InfoSphere BigInsights platform. George has co-authored several database patents and has contributed to nu- merous papers. He’s a certified DB2 DBA and Hadoop Administrator. FM.indd 3 07/10/11 6:12 PM About the Technical Editor Steven Sit, B.Sc., MS, is a Program Director in IBM’s Silicon Valley Research and Development Lab where the IBM’s Big Data platform is developed and engineered. Steven and his team help IBM’s customers and partners evalu- ate, prototype, and implement Big Data solutions as well as build Big Data deployment patterns. For the past 17 years, Steven has held key positions in a number of IBM projects, including business intelligence, database tool- ing, and text search. Steven holds a Bachelor’s degree in Computer Science (University of Western Ontario) and a Masters of Computer Science degree (Syracuse University). FM.indd 4 07/10/11 6:12 PM Understanding Big Data Analytics for Enterprise Class Hadoop and Streaming Data Paul C. Zikopoulos Chris Eaton Dirk deRoos Thomas Deutsch George Lapis New York Chicago San Francisco Lisbon London Madrid Mexico City Milan New Delhi San Juan Seoul Singapore Sydney Toronto FM.indd 5 07/10/11 6:12 PM McGraw-Hill books are available at special quantity discounts to use as premiums and sales promotions, or for use in corporate training programs. To contact a repre- sentative, please e-mail us at [email protected]. Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data Copyright © 2012 by The McGraw-Hill Companies. All rights reserved. Printed in the United States of America. Except as permitted under the Copyright Act of 1976, no part of this publication may be reproduced or distributed in any form or by any means, or stored in a database or retrieval system, without the prior written permis- sion of publisher, with the exception that the program listings may be entered, stored, and executed in a computer system, but they may not be reproduced for publication. All trademarks or copyrights mentioned herein are the possession of their respective owners and McGraw-Hill makes no claim of ownership by the mention of products that contain these marks. The contents of this book represent those features that may or may not be available in the current release of any products mentioned within this book despite what the book may say. IBM reserves the right to include or exclude any functionality mentioned in this book for the current release of InfoSphere Streams or InfoSphere BigInsights, or a subsequent release. In addition, any performance claims made in this article are not official communications by IBM; rather the results observed by the authors in unau- dited testing. The views expressed in this article are those of the authors and not necessarily those of IBM Corporation. 1 2 3 4 5 6 7 8 9 0 DOC DOC 1 0 9 8 7 6 5 4 3 2 1 ISBN 978-0-07-179053-6 MHID 0-07-179053-5 Sponsoring Editor Copy Editor Illustration Paul Carlstroem Lisa Theobald Cenveo Publisher Services Editorial Supervisor Proofreader Art Director, Cover Patty Mon Paul Tyler Jeff Weeks Project Manager Production Supervisor Sheena Uprety, George Anderson Cenveo Publisher Services Composition Acquisitions Coordinator Cenveo Publisher Services Stephanie Evans Information has been obtained by McGraw-Hill from sources believed to be reliable. However, because of the possibility of human or mechanical error by our sources, McGraw-Hill, or others, McGraw-Hill does not guarantee the accuracy, adequacy, or completeness of any information and is not responsible for any errors or omissions or the results obtained from the use of such information. FM.indd 6 07/10/11 6:12 PM My fifteenth book in my eighteenth year at IBM—it’s hard to believe so much time has passed and Information Management technology has become not just my career, but somewhat of a hobby (insert image of Chloe reading this in a couple of years once she learns the universal “loser” gesture). I often dedicate my books to people in my life: This book I actually want to dedicate to the company in my life that turned 100 years old on August 12, 2011: IBM. In this day and age of fluid careers, the U.S. Department of Labor has remarked that the average learner will have 10 to 14 jobs by the time they are 38; 1 in 4 workers have been with their employer less than a year; and 1 in 2 workers have been with their employer less than 5 years. Sometimes I get asked about my 18-year tenure at IBM in a tone of disbelief for my generation. In my 18 years at IBM, I’ve had the honor to learn and participate in the latest technologies, marketing, sales, technical sales, writing, usability design, development, partner programs, channels, education, support, services, public speaking, competitive analysis, and always learning more. IBM has always been a place that nurtures excel- lence and opportunity for those thirsty to make a difference, and I’ve got a thirst not yet quenched. IBM deeply encourages learning from others—and I often wonder if other people feel like they won the lottery with a mentoring team (Martin Wildberger, Bob Piciano, Dale Rebhorn, and Alyse Passarelli) like the one I have. Thanks to IBM for providing an endless cup of opportu- nity and learning experiences. Finally, to my two gals, whose spirits always run through my soul: Grace Madeleine Zikopoulos and Chloë Alyse Zikopoulos. —Paul Zikopoulos This is the fourth book that I have authored, and every time I dedicate the book to my wife and family. Well this is no exception, because it’s their support that makes this all possible, as anyone who has ever spent hours and hours of their own personal time writing a book can attest to. To my wife, Teresa, who is always supporting me 100 percent in all that I do, including crazy ideas like writing a book. She knows full well how much time it takes to write a book since she is a real author herself and yet she still supported me when I told her I was going to write this book (you are a saint). And to Riley and Sophia, who are now old enough to read one of my FM.indd 7 07/10/11 6:12 PM books (not that they are really interested in any of this stuff since they are both under ten). Daddy is finished writing his book so let’s go outside and play. —Chris Eaton I’d like to thank Sandra, Erik, and Anna for supporting me, and giving me the time to do this. Also, thanks to Paul for making this book happen and asking me to contribute. —Dirk deRoos I would like to thank my ridiculously supportive wife and put in writing for Lauren and William that yes, I will finally take them to Disneyland again now that this is published. I’d also like to thank Anant Jhingran for both the coaching and opportunities he has entrusted in me. —Thomas Deutsch “If you love what you do, you will never work a day in your life.” I dedicate this book to all my colleagues at IBM that I worked with over the years who helped me learn and grow and have made this saying come true for me. —George Lapis Thanks to my IBM colleagues in Big Data Research and Development for the exciting technologies I get to work on every day. I also want to thank Paul for the opportunity to contribute to this book. Last but not least, and most importantly, for my wife, Amy, and my twins, Tiffany and Ronald, thank you for everything you do, the joy you bring, and for supporting the time it took to work on this book. —Steven Sit FM.indd 8 07/10/11 6:12 PM CONTENTS AT A GLANCE PART I Big Data: From the Business Perspective 1 What Is Big Data? Hint: You’re a Part of It Every Day ... 3 2 Why Is Big Data Important? ......................... 15 3 Why IBM for Big Data? ............................. 35 PART II Big Data: From the Technology Perspective 4 All About Hadoop: The Big Data Lingo Chapter ....... 51 5 InfoSphere BigInsights: Analytics for Big Data At Rest .............................. 81 6 IBM InfoSphere Streams: Analytics for Big Data in Motion ............................ 123 ix FM.indd 9 07/10/11 6:12 PM
Description: