Table Of ContentBig Data
SMACK
A Guide to Apache Spark, Mesos,
Akka, Cassandra, and Kafa
—
Raul Estrada
Isaac Ruiz
B ig Data SMACK
A Guide to Apache Spark, Mesos, Akka,
Cassandra, and Kafka
Raul Estrada
Isaac Ruiz
Big Data SMACK: A Guide to Apache Spark, Mesos, Akka, Cassandra, and Kafka
Raul Estrada Isaac Ruiz
Mexico City Mexico City
Mexico Mexico
ISBN-13 (pbk): 978-1-4842-2174-7 ISBN-13 (electronic): 978-1-4842-2175-4
DOI 10.1007/978-1-4842-2175-4
Library of Congress Control Number: 2016954634
Copyright © 2016 by Raul Estrada and Isaac Ruiz
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now
known or hereafter developed.
Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with
every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an
editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are
not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to
proprietary rights.
While the advice and information in this book are believed to be true and accurate at the date of publication,
neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or
omissions that may be made. The publisher makes no warranty, express or implied, with respect to the
material contained herein.
Managing Director: Welmoed Spahr
Acquisitions Editor: Susan McDermott
Developmental Editor: Laura Berendson
Technical Reviewer: Rogelio Vizcaino
Editorial Board: Steve Anglin, Pramila Balen, Laura Berendson, Aaron Black, Louise Corrigan,
Jonathan Gennick, Robert Hutchinson, Celestin Suresh John, Nikhil Karkal, James Markham,
Susan McDermott, Matthew Moodie, Natalie Pao, Gwenan Spearing
Coordinating Editor: Rita Fernando
Copy Editor: Kim Burton-Weisman
Compositor: SPi Global
Indexer: SPi Global
Cover Image: Designed by Harryarts - Freepik.com
Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street,
6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail o
I dedicate this book to my mom and all the masters out there.
— Raúl Estrada
F or all Binnizá people.
— Isaac Ruiz
Contents at a Glance
About the Authors ...................................................................................................xix
About the Technical Reviewer ................................................................................xxi
Acknowledgments ................................................................................................xxiii
Introduction ...........................................................................................................xxv
■ Part I: Introduction ................................................................................................ 1
■ Chapter 1: Big Data, Big Challenges ...................................................................... 3
■ Chapter 2: Big Data, Big Solutions ......................................................................... 9
■ Part II: Playing SMACK ........................................................................................ 17
■ Chapter 3: The Language: Scala .......................................................................... 19
■ Chapter 4: The Model: Akka ................................................................................ 41
■ Chapter 5: Storage: Apache Cassandra ............................................................... 67
■ Chapter 6: The Engine: Apache Spark ................................................................. 97
■ Chapter 7: The Manager: Apache Mesos ........................................................... 131
■ Chapter 8: The Broker: Apache Kafka ................................................................ 165
■ Part III: Improving SMACK ................................................................................. 205
■ Chapter 9: Fast Data Patterns ............................................................................ 207
■ Chapter 10: Data Pipelines ................................................................................ 225
■ Chapter 11: Glossary ......................................................................................... 251
Index ..................................................................................................................... 259
v
Contents
About the Authors ...................................................................................................xix
About the Technical Reviewer ................................................................................xxi
Acknowledgments ................................................................................................xxiii
Introduction ...........................................................................................................xxv
■ Part I: Introduction ................................................................................................ 1
■ Chapter 1: Big Data, Big Challenges ...................................................................... 3
Big Data Problems ............................................................................................................ 3
Infrastructure Needs ........................................................................................................ 3
ETL ................................................................................................................................... 4
Lambda Architecture ........................................................................................................ 5
Hadoop ............................................................................................................................. 5
Data Center Operation ...................................................................................................... 5
The Open Source Reign .......................................................................................................................... 6
The Data Store Diversifi cation ................................................................................................................ 6
Is SMACK the Solution? .................................................................................................... 7
■ Chapter 2: Big Data, Big Solutions ......................................................................... 9
Traditional vs. Modern (Big) Data ..................................................................................... 9
SMACK in a Nutshell ....................................................................................................... 11
Apache Spark vs. MapReduce ........................................................................................ 12
The Engine...................................................................................................................... 14
The Model ....................................................................................................................... 15
The Broker ...................................................................................................................... 15
vii
■ CONTENTS
The Storage .................................................................................................................... 16
The Container ................................................................................................................. 16
Summary ........................................................................................................................ 16
■ Part II: Playing SMACK ........................................................................................ 17
■ Chapter 3: The Language: Scala .......................................................................... 19
Functional Programming ................................................................................................ 19
Predicate .............................................................................................................................................. 19
Literal Functions ................................................................................................................................... 20
Implicit Loops ....................................................................................................................................... 20
Collections Hierarchy ..................................................................................................... 21
Sequences ............................................................................................................................................ 21
Maps ..................................................................................................................................................... 22
Sets....................................................................................................................................................... 23
Choosing Collections ...................................................................................................... 23
Sequences ............................................................................................................................................ 23
Maps ..................................................................................................................................................... 24
Sets....................................................................................................................................................... 25
Traversing ....................................................................................................................... 25
foreach ................................................................................................................................................. 25
for ......................................................................................................................................................... 26
Iterators ................................................................................................................................................ 27
Mapping ......................................................................................................................... 27
Flattening ....................................................................................................................... 28
Filtering .......................................................................................................................... 29
Extracting ....................................................................................................................... 30
Splitting .......................................................................................................................... 31
Unicity ............................................................................................................................ 32
Merging .......................................................................................................................... 32
Lazy Views ...................................................................................................................... 33
Sorting ............................................................................................................................ 34
viii
■ CONTENTS
Streams .......................................................................................................................... 35
Arrays ............................................................................................................................. 35
ArrayBuffers ................................................................................................................... 36
Queues ........................................................................................................................... 37
Stacks ............................................................................................................................ 38
Ranges ........................................................................................................................... 39
Summary ........................................................................................................................ 40
■ Chapter 4: The Model: Akka ................................................................................ 41
The Actor Model ............................................................................................................. 41
Threads and Labyrinths ........................................................................................................................ 42
Actors 101 ............................................................................................................................................ 42
Installing Akka ................................................................................................................ 44
Akka Actors .................................................................................................................... 51
Actors ................................................................................................................................................... 51
Actor System ........................................................................................................................................ 53
Actor Reference .................................................................................................................................... 53
Actor Communication ........................................................................................................................... 54
Actor Lifecycle ...................................................................................................................................... 56
Starting Actors ...................................................................................................................................... 58
Stopping Actors .................................................................................................................................... 60
Killing Actors ......................................................................................................................................... 61
Shutting down the Actor System .......................................................................................................... 62
Actor Monitoring ................................................................................................................................... 62
Looking up Actors ................................................................................................................................. 63
Actor Code of Conduct .......................................................................................................................... 64
Summary ........................................................................................................................ 66
■ Chapter 5: Storage: Apache Cassandra ............................................................... 67
Once Upon a Time... ........................................................................................................ 67
Modern Cassandra................................................................................................................................ 67
NoSQL Everywhere ......................................................................................................... 67
ix
■ CONTENTS
The Memory Value .......................................................................................................... 70
Key-Value and Column ......................................................................................................................... 70
Why Cassandra? ............................................................................................................. 71
The Data Model ..................................................................................................................................... 72
Cassandra 101 ............................................................................................................... 73
Installation ............................................................................................................................................ 73
Beyond the Basics .......................................................................................................... 82
Client-Server ........................................................................................................................................ 82
Other Clients ......................................................................................................................................... 83
Apache Spark-Cassandra Connector .................................................................................................... 87
Installing the Connector ........................................................................................................................ 87
Establishing the Connection ................................................................................................................. 89
More Than One Is Better ................................................................................................. 91
cassandra.yaml .................................................................................................................................... 92
Setting the Cluster ................................................................................................................................ 93
Putting It All Together ..................................................................................................... 95
■ Chapter 6: The Engine: Apache Spark ................................................................. 97
Introducing Spark ........................................................................................................... 97
Apache Spark Download ...................................................................................................................... 98
Let’s Kick the Tires ............................................................................................................................... 99
Loading a Data File ............................................................................................................................. 100
Loading Data from S3 ......................................................................................................................... 100
Spark Architecture........................................................................................................ 101
SparkContext ...................................................................................................................................... 102
Creating a SparkContext ..................................................................................................................... 102
SparkContext Metadata ...................................................................................................................... 103
SparkContext Methods ....................................................................................................................... 103
Working with RDDs....................................................................................................... 104
Standalone Apps ................................................................................................................................. 106
RDD Operations .................................................................................................................................. 108
x
■ CONTENTS
Spark in Cluster Mode .................................................................................................. 112
Runtime Architecture .......................................................................................................................... 112
Driver .................................................................................................................................................. 113
Executor .............................................................................................................................................. 114
Cluster Manager ................................................................................................................................. 115
Program Execution ............................................................................................................................. 115
Application Deployment ...................................................................................................................... 115
Running in Cluster Mode .................................................................................................................... 117
Spark Standalone Mode ..................................................................................................................... 117
Running Spark on EC2 ........................................................................................................................ 120
Running Spark on Mesos .................................................................................................................... 122
Submitting Our Application ................................................................................................................. 122
Confi guring Resources ....................................................................................................................... 123
High Availability .................................................................................................................................. 123
Spark Streaming .......................................................................................................... 123
Spark Streaming Architecture ............................................................................................................ 124
Transformations .................................................................................................................................. 125
24/7 Spark Streaming ........................................................................................................................ 129
Checkpointing ..................................................................................................................................... 129
Spark Streaming Performance ........................................................................................................... 129
Summary ...................................................................................................................... 130
■ Chapter 7: The Manager: Apache Mesos ........................................................... 131
Divide et Impera (Divide and Rule) ............................................................................... 131
Distributed Systems ..................................................................................................... 134
Why Are They Important? .................................................................................................................... 135
It Is Diffi cult to Have a Distributed System ................................................................... 135
Ta-dah!! Apache Mesos ................................................................................................ 137
Mesos Framework ........................................................................................................ 138
Architecture ........................................................................................................................................ 138
xi