Table Of ContentFront cover
Apache Spark
Implementation on IBM z/OS
Lydia Parziale
Joe Bostian
Ravi Kumar
Ulrich Seelbach
Zhong Yu Ye
Redbooks
International Technical Support Organization
Apache Spark Implementation on IBM z/OS
August 2016
SG24-8325-00
Note: Before using this information and the product it supports, read the information in “Notices” on
pagevii.
First Edition (August 2016)
This edition applies to Version 2, Release 2 of IBM z/OS (product number 5650 ZOS), Apache Spark 1.5.2
© Copyright International Business Machines Corporation 2016. All rights reserved.
Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule
Contract with IBM Corp.
Contents
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .vii
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
IBM Redbooks promotions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Authors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Now you can become a published author, too. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xii
Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Stay connected to IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Chapter 1. Architectural overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Open source analytics on z/OS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Benefits of Spark on z/OS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Drawbacks of implementing off-platform analytics . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 A new chapter in analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Planning your environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Reference architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Spark server architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.2 Spark environment architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.3 Implementation with Jupyter Notebooks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.4 Scala IDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Security. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Chapter 2. Components and extensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1 Apache Spark component overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.1 Resilient Distributed Datasets and caching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.2 Components of a Spark cluster on z/OS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.3 Monitoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.4 Spark and Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Mainframe Data Services for IBM z/OS Platform for Apache Spark. . . . . . . . . . . . . . . 21
2.2.1 Virtual tables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.2 Virtual views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.3 SQL queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.4 MDSS JDBC driver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.5 IBM z/OS Platform for Apache Spark Interface for CICS/TS . . . . . . . . . . . . . . . . 24
2.3 Spark SQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.1 Reading from z/OS data source into a DataFrame. . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.2 Writing DataFrame to a DB2 for z/OS table using saveTable method . . . . . . . . . 26
2.4 Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 GraphX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5.1 System G . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6 MLlib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.7 Spark R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
© Copyright IBM Corp. 2016. All rights reserved. iii
Chapter 3. Installation and configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1 Installing IBM z/OS Platform for Apache Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 The Mainframe Data Service for Apache Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 Installing the MDSS started task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2 Configuring access to DB2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.3 Configuring access to IMS databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.4 The ISPF Panels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.5 Installing and configuring Bash. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.6 Check for /usr/bin/env. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Installing workstation components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.1 Installing Data Service Studio. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.2 Installing the JDBC driver on the workstation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4 Configuring Apache Spark for z/OS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.1 Create log and worker directories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.2 Apache Spark directory structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4.3 Create directories and local configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4.4 Installing the Data Server JDBC driver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4.5 Modifying the log4j configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4.6 Adding the Spark binaries to your PATH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.5 Verifying the installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.6 Starting the Spark daemons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Chapter 4. Spark application development on z/OS . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.1 Setting up the development environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.1.1 Installing ScalaIDE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.1.2 Installing Data Server Studio plugins into ScalaIDE . . . . . . . . . . . . . . . . . . . . . . 62
4.1.3 Installing and using sbt. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Accessing VSAM data as an RDD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2.1 Defining the data mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2.2 Building and running the application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3 Accessing sequential files and PDS members . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4 Accessing IBM DB2 data as a DataFrame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5 Joining DB2 data with VSAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.6 IBM IMS data to DataFrames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.7 System log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.8 SMF data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.9 JavaScript Object Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.10 Extensible Markup Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.11 Submit Spark jobs from z/OS applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Chapter 5. Production integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.1 Production deployment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2 Running Spark applications from z/OS batch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.3 Starting Spark master and workers from JCL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.4 System level tuning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.4.1 Tuning the MDSS server. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.4.2 Tuning z/OSUNIX settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Chapter 6. IBM z/OS Platform for Apache Spark and the ecosystem. . . . . . . . . . . . . . 91
6.1 Tidy data repository. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.2 Jupyter notebooks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.2.1 The Jupyter notebook overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.2.2 Docker and the platforms that support it. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.2.3 The dockeradmin userid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
iv Apache Spark Implementation on IBM z/OS
6.2.4 The Role of SSH. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.2.5 Creating the Docker container . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.2.6 A note about network configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.2.7 Building the Jupyter scala workbench. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Chapter 7. Use case patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.1 Banking and finance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.1.1 Churn prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.1.2 Fraud prevention. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.1.3 Upsell opportunity detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.2 Insurance industry. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.2.1 Claims payment analytics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.3 Retail industry. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.3.1 Product recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.4 Other use case patterns for IBM z/OS Platform for Apache Spark. . . . . . . . . . . . . . . 114
7.4.1 Analytics across OLTP and warehouse information. . . . . . . . . . . . . . . . . . . . . . 114
7.4.2 Analytics combining business-owned data and external / social data . . . . . . . . 114
7.4.3 Analytics of real-time transactions through streaming, combining with OLTP and
social. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.5 Operations analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.5.1 SMF data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.5.2 Syslog data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Appendix A. Sample code to run on Apache Spark cluster on z/OS. . . . . . . . . . . . . 117
Appendix B. FAQ: Frequently asked questions, and answers . . . . . . . . . . . . . . . . . . 121
General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Technical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Other publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Help from IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Contents v
vi Apache Spark Implementation on IBM z/OS
Notices
This information was developed for products and services offered in the US. This material might be available
from IBM in other languages. However, you may be required to own a copy of the product or product version in
that language in order to access it.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area. Any
reference to an IBM product, program, or service is not intended to state or imply that only that IBM product,
program, or service may be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However, it is the user’s responsibility to
evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document. The
furnishing of this document does not grant you any license to these patents. You can send license inquiries, in
writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, MD-NC119, Armonk, NY 10504-1785, US
INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION “AS IS”
WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A
PARTICULAR PURPOSE. Some jurisdictions do not allow disclaimer of express or implied warranties in
certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may make
improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time
without notice.
Any references in this information to non-IBM websites are provided for convenience only and do not in any
manner serve as an endorsement of those websites. The materials at those websites are not part of the
materials for this IBM product and use of those websites is at your own risk.
IBM may use or distribute any of the information you provide in any way it believes appropriate without
incurring any obligation to you.
The performance data and client examples cited are presented for illustrative purposes only. Actual
performance results may vary depending on specific configurations and operating conditions.
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm the
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the
capabilities of non-IBM products should be addressed to the suppliers of those products.
Statements regarding IBM’s future direction or intent are subject to change or withdrawal without notice, and
represent goals and objectives only.
This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to actual people or business enterprises is entirely
coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore,
cannot guarantee or imply reliability, serviceability, or function of these programs. The sample programs are
provided “AS IS”, without warranty of any kind. IBM shall not be liable for any damages arising out of your use
of the sample programs.
© Copyright IBM Corp. 2016. All rights reserved. vii
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines
Corporation, registered in many jurisdictions worldwide. Other product and service names might be
trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at “Copyright
and trademark information” at http://www.ibm.com/legal/copytrade.shtml
The following terms are trademarks or registered trademarks of International Business Machines Corporation,
and might also be trademarks or registered trademarks in other countries.
BigInsights® IBM z13™ RMF™
CICS® IMS™ S/390®
Cloudant® Lotus® SPSS®
Cognos® MVS™ System z®
DB2® Parallel Sysplex® WebSphere®
FICON® Print Services Facility™ z Systems™
GPFS™ PrintWay™ z/OS®
IBM® RACF® z13™
IBM z™ Redbooks® zEnterprise®
IBM z Systems™ Redbooks (logo) ®
The following terms are trademarks of other companies:
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Microsoft, Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States,
other countries, or both.
Java, and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its
affiliates.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Other company, product, or service names may be trademarks or service marks of others.
viii Apache Spark Implementation on IBM z/OS
Description:This edition applies to Version 2, Release 2 of IBM z/OS (product number 5650 ZOS), Apache Spark 1.5.2. Note: Before using this information and the