Table Of ContentParallel R
Q. Ethan McCallum and Stephen Weston
Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo
Parallel R
by Q. Ethan McCallum and Stephen Weston
Copyright © 2012 Q. Ethan McCallum and Stephen Weston. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions
are also available for most titles (http://my.safaribooksonline.com). For more information, contact our
corporate/institutional sales department: (800) 998-9938 or corporate@oreilly.com.
Editors: Mike Loukides and Meghan Blanchette Cover Designer: Karen Montgomery
Production Editor: Kristen Borg Interior Designer: David Futato
Proofreader: O’Reilly Production Services Illustrator: Robert Romano
Revision History for the First Edition:
2011-10-21 First release
See http://oreilly.com/catalog/errata.csp?isbn=9781449309923 for release details.
Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of
O’Reilly Media, Inc. Parallel R, the image of a rabbit, and related trade dress are trademarks of O’Reilly
Media, Inc.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as
trademarks. Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a
trademark claim, the designations have been printed in caps or initial caps.
While every precaution has been taken in the preparation of this book, the publisher and authors assume
no responsibility for errors or omissions, or for damages resulting from the use of the information con-
tained herein.
ISBN: 978-1-449-30992-3
[LSI]
1319202138
Table of Contents
Preface ..................................................................... vii
1. Getting Started ......................................................... 1
Why R? 1
Why Not R? 1
The Solution: Parallel Execution 2
A Road Map for This Book 2
What We’ll Cover 3
Looking Forward… 3
What We’ll Assume You Already Know 3
In a Hurry? 4
snow 4
multicore 4
parallel 4
R+Hadoop 4
RHIPE 5
Segue 5
Summary 5
2. snow .................................................................. 7
Quick Look 7
How It Works 7
Setting Up 8
Working with It 9
Creating Clusters with makeCluster 9
Parallel K-Means 10
Initializing Workers 12
Load Balancing with clusterApplyLB 13
Task Chunking with parLapply 15
Vectorizing with clusterSplit 18
Load Balancing Redux 20
iii
Functions and Environments 23
Random Number Generation 25
snow Configuration 26
Installing Rmpi 29
Executing snow Programs on a Cluster with Rmpi 30
Executing snow Programs with a Batch Queueing System 32
Troubleshooting snow Programs 33
When It Works… 35
…And When It Doesn’t 36
The Wrap-up 36
3. multicore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Quick Look 37
How It Works 38
Setting Up 38
Working with It 39
The mclapply Function 39
The mc.cores Option 39
The mc.set.seed Option 40
Load Balancing with mclapply 42
The pvec Function 42
The parallel and collect Functions 43
Using collect Options 44
Parallel Random Number Generation 46
The Low-Level API 47
When It Works… 49
…And When It Doesn’t 49
The Wrap-up 49
4. parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Quick Look 52
How It Works 52
Setting Up 52
Working with It 53
Getting Started 53
Creating Clusters with makeCluster 54
Parallel Random Number Generation 55
Summary of Differences 57
When It Works… 58
…And When It Doesn’t 58
The Wrap-up 58
iv | Table of Contents
5. A Primer on MapReduce and Hadoop ...................................... 59
Hadoop at Cruising Altitude 59
A MapReduce Primer 60
Thinking in MapReduce: Some Pseudocode Examples 61
Calculate Average Call Length for Each Date 62
Number of Calls by Each User, on Each Date 62
Run a Special Algorithm on Each Record 63
Binary and Whole-File Data: SequenceFiles 63
No Cluster? No Problem! Look to the Clouds… 64
The Wrap-up 66
6. R+Hadoop ............................................................ 67
Quick Look 67
How It Works 67
Setting Up 68
Working with It 68
Simple Hadoop Streaming (All Text) 69
Streaming, Redux: Indirectly Working with Binary Data 72
The Java API: Binary Input and Output 74
Processing Related Groups (the Full Map and Reduce Phases) 79
When It Works… 83
…And When It Doesn’t 83
The Wrap-up 84
7. RHIPE ................................................................ 85
Quick Look 85
How It Works 85
Setting Up 86
Working with It 87
Phone Call Records, Redux 87
Tweet Brevity 91
More Complex Tweet Analysis 96
When It Works… 98
…And When It Doesn’t 99
The Wrap-up 100
8. Segue ............................................................... 101
Quick Look 101
How It Works 102
Setting Up 102
Working with It 102
Model Testing: Parameter Sweep 102
When It Works… 105
Table of Contents | v
…And When It Doesn’t 105
The Wrap-up 106
9. New and Upcoming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
doRedis 107
RevoScale R and RevoConnectR (RHadoop) 108
cloudNumbers.com 108
vi | Table of Contents
Preface
Conventions Used in This Book
The following typographical conventions are used in this book:
Italic
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Constant width
Used for program listings, as well as within paragraphs to refer to program elements
such as variable or function names, databases, data types, environment variables,
statements, and keywords.
Constant width bold
Shows commands or other text that should be typed literally by the user.
Constant width italic
Shows text that should be replaced with user-supplied values or by values deter-
mined by context.
This icon signifies a tip, suggestion, or general note.
This icon indicates a warning or caution.
Using Code Examples
This book is here to help you get your job done. In general, you may use the code in
this book in your programs and documentation. You do not need to contact us for
permission unless you’re reproducing a significant portion of the code. For example,
writing a program that uses several chunks of code from this book does not require
permission. Selling or distributing a CD-ROM of examples from O’Reilly books does
vii
require permission. Answering a question by citing this book and quoting example
code does not require permission. Incorporating a significant amount of example code
from this book into your product’s documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the title,
author, publisher, and ISBN. For example: “ Parallel R by Q. Ethan McCallum and
Stephen Weston (O'Reilly). Copyright 2012 Q. Ethan McCallum and Stephen Weston,
978-1-449-30992-3.”
If you feel your use of code examples falls outside fair use or the permission given above,
feel free to contact us at permissions@oreilly.com.
Safari® Books Online
Safari Books Online is an on-demand digital library that lets you easily
search over 7,500 technology and creative reference books and videos to
find the answers you need quickly.
With a subscription, you can read any page and watch any video from our library online.
Read books on your cell phone and mobile devices. Access new titles before they are
available for print, and get exclusive access to manuscripts in development and post
feedback for the authors. Copy and paste code samples, organize your favorites, down-
load chapters, bookmark key sections, create notes, print out pages, and benefit from
tons of other time-saving features.
O’Reilly Media has uploaded this book to the Safari Books Online service. To have full
digital access to this book and others on similar topics from O’Reilly and other pub-
lishers, sign up for free at http://my.safaribooksonline.com.
How to Contact Us
Please address comments and questions concerning this book to the publisher:
O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
707-829-0104 (fax)
We have a web page for this book, where we list errata, examples, and any additional
information. You can access this page at:
http://oreilly.com/catalog/0636920021421
To comment or ask technical questions about this book, send email to:
bookquestions@oreilly.com
viii | Preface