ebook img

MapReduce Design Patterns Building Effective Algorithms and Analytics for Hadoop and Other Systems PDF

251 Pages·2012·8.892 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview MapReduce Design Patterns Building Effective Algorithms and Analytics for Hadoop and Other Systems

MapReduce Design Patterns Donald Miner and Adam Shook MapReduce Design Patterns by Donald Miner and Adam Shook Copyright © 2013 Donald Miner and Adam Shook. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://my.safaribooksonline.com). For more information, contact our corporate/ institutional sales department: 800-998-9938 or [email protected]. Editors: Andy Oram and Mike Hendrickson Proofreader: Dawn Carelli Production Editor: Christopher Hearse Cover Designer: Randy Comer Interior Designer: David Futato Illustrator: Rebecca Demarest December 2012: First Edition Revision History for the First Edition: 2012-11-20 First release See http://oreilly.com/catalog/errata.csp?isbn=9781449327170 for release details. Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc. MapReduce Design Patterns, the image of Père David’s deer, and related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trade‐ mark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. ISBN: 978-1-449-32717-0 [LSI] For William Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1. Design Patterns and MapReduce. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Design Patterns 2 MapReduce History 4 MapReduce and Hadoop Refresher 4 Hadoop Example: Word Count 7 Pig and Hive 11 2. Summarization Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Numerical Summarizations 14 Pattern Description 14 Numerical Summarization Examples 17 Inverted Index Summarizations 32 Pattern Description 32 Inverted Index Example 35 Counting with Counters 37 Pattern Description 37 Counting with Counters Example 40 3. Filtering Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Filtering 44 Pattern Description 44 Filtering Examples 47 Bloom Filtering 49 Pattern Description 49 Bloom Filtering Examples 53 Top Ten 58 Pattern Description 58 Top Ten Examples 63 v Distinct 65 Pattern Description 65 Distinct Examples 68 4. Data Organization Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Structured to Hierarchical 72 Pattern Description 72 Structured to Hierarchical Examples 76 Partitioning 82 Pattern Description 82 Partitioning Examples 86 Binning 88 Pattern Description 88 Binning Examples 90 Total Order Sorting 92 Pattern Description 92 Total Order Sorting Examples 95 Shuffling 99 Pattern Description 99 Shuffle Examples 101 5. Join Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 A Refresher on Joins 104 Reduce Side Join 108 Pattern Description 108 Reduce Side Join Example 111 Reduce Side Join with Bloom Filter 117 Replicated Join 119 Pattern Description 119 Replicated Join Examples 121 Composite Join 123 Pattern Description 123 Composite Join Examples 126 Cartesian Product 128 Pattern Description 128 Cartesian Product Examples 132 6. Metapatterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Job Chaining 139 With the Driver 140 Job Chaining Examples 141 With Shell Scripting 150 vi | Table of Contents With JobControl 153 Chain Folding 158 The ChainMapper and ChainReducer Approach 163 Chain Folding Example 163 Job Merging 168 Job Merging Examples 170 7. Input and Output Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 Customizing Input and Output in Hadoop 177 InputFormat 178 RecordReader 179 OutputFormat 180 RecordWriter 181 Generating Data 182 Pattern Description 182 Generating Data Examples 184 External Source Output 189 Pattern Description 189 External Source Output Example 191 External Source Input 195 Pattern Description 195 External Source Input Example 197 Partition Pruning 202 Pattern Description 202 Partition Pruning Examples 205 8. Final Thoughts and the Future of Design Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 Trends in the Nature of Data 217 Images, Audio, and Video 217 Streaming Data 218 The Effects of YARN 219 Patterns as a Library or Component 220 How You Can Help 220 A. Bloom Filters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 Table of Contents | vii

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.