Table Of ContentBig Data for Chimps
Philip Kromer and Russell Jurney
Big Data for Chimps
by Philip Kromer and Russell Jurney
Copyright © 2016 Philip Kromer and Russell Jurney. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA
95472.
O’Reilly books may be purchased for educational, business, or sales promotional use.
Online editions are also available for most titles (http://safaribooksonline.com). For more
information, contact our corporate/institutional sales department: 800-998-9938 or
corporate@oreilly.com.
Acquisitions Editor: Mike Loukides
Editors: Meghan Blanchette and Amy Jollymore
Production Editor: Matthew Hacker
Copyeditor: Jasmine Kwityn
Proofreader: Rachel Monaghan
Indexer: Wendy Catalano
Interior Designer: David Futato
Cover Designer: Ellie Volckhausen
Illustrator: Rebecca Demarest
October 2015: First Edition
Revision History for the First Edition
2015-09-25: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781491923948 for release details.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Big Data for Chimps,
the cover image of a chimpanzee, and related trade dress are trademarks of O’Reilly
Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the
information and instructions contained in this work are accurate, the publisher and the
authors disclaim all responsibility for errors or omissions, including without limitation
responsibility for damages resulting from the use of or reliance on this work. Use of the
information and instructions contained in this work is at your own risk. If any code
samples or other technology this work contains or describes is subject to open source
licenses or the intellectual property rights of others, it is your responsibility to ensure that
your use thereof complies with such licenses and/or rights.
978-1-491-92394-8
[LSI]
Preface
Big Data for Chimps will explain a practical, actionable view of big data. This view will
be centered on tested best practices as well as give readers street-fighting smarts with
Hadoop.
Readers will come away with a useful, conceptual idea of big data. Insight is data in
context. The key to understanding big data is scalability: infinite amounts of data can rest
upon distinct pivot points. We will teach you how to manipulate data about these pivot
points.
Finally, the book will contain examples with real data and real problems that will bring the
concepts and applications for business to life.
What This Book Covers
Big Data for Chimps shows you how to solve important problems in large-scale data
processing using simple, fun, and elegant tools.
Finding patterns in massive event streams is an important, hard problem. Most of the time,
there aren’t earthquakes — but the patterns that will let you predict one in advance lie
within the data from those quiet periods. How do you compare the trillions of
subsequences in billions of events, each to each other, to find the very few that matter?
Once you have those patterns, how do you react to them in real time?
We’ve chosen case studies anyone can understand, and that are general enough to apply to
whatever problems you’re looking to solve. Our goal is to provide you with the following:
The ability to think at scale—equipping you with a deep understanding of how to break
a problem into efficient data transformations, and of how data must flow through the
cluster to effect those transformations
Detailed example programs applying Hadoop to interesting problems in context
Advice and best practices for efficient software development
All of the examples use real data, and describe patterns found in many problem domains,
as you:
Create statistical summaries
Identify patterns and groups in the data
Search, filter, and herd records in bulk
The emphasis on simplicity and fun should make this book especially appealing to
beginners, but this is not an approach you’ll outgrow. We’ve found it’s the most powerful
and valuable approach for creative analytics. One of our maxims is “robots are cheap,
humans are important”: write readable, scalable code now and find out later whether you
want a smaller cluster. The code you see is adapted from programs we write at Infochimps
and Data Syndrome to solve enterprise-scale business problems, and these simple high-
level transformations meet our needs.
Many of the chapters include exercises. If you’re a beginning user, we highly recommend
you work through at least one exercise from each chapter. Deep learning will come less
from having the book in front of you as you read it than from having the book next to you
while you write code inspired by it. There are sample solutions and result datasets on the
book’s website.
Who This Book Is For
We’d like for you to be familiar with at least one programming language, but it doesn’t
have to be Python or Pig. Familiarity with SQL will help a bit, but isn’t essential. Some
exposure to working with data in a business intelligence or analysis background will be
helpful.
Most importantly, you should have an actual project in mind that requires a big-data
toolkit to solve — a problem that requires scaling out across multiple machines. If you
don’t already have a project in mind but really want to learn about the big-data toolkit,
take a look at Chapter 3, which uses baseball data. It makes a great dataset for fun
exploration.