ebook img

Apache Spark Implementation on IBM z/OS PDF

144 Pages·2016·4.46 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Apache Spark Implementation on IBM z/OS

Front cover Apache Spark Implementation on IBM z/OS Lydia Parziale Joe Bostian Ravi Kumar Ulrich Seelbach Zhong Yu Ye Redbooks International Technical Support Organization Apache Spark Implementation on IBM z/OS August 2016 SG24-8325-00 Note: Before using this information and the product it supports, read the information in “Notices” on pagevii. First Edition (August 2016) This edition applies to Version 2, Release 2 of IBM z/OS (product number 5650 ZOS), Apache Spark 1.5.2 © Copyright International Business Machines Corporation 2016. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .vii Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii IBM Redbooks promotions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Authors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Now you can become a published author, too. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xii Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Stay connected to IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Chapter 1. Architectural overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Open source analytics on z/OS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 Benefits of Spark on z/OS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.2 Drawbacks of implementing off-platform analytics . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.3 A new chapter in analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Planning your environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Reference architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.1 Spark server architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.2 Spark environment architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3.3 Implementation with Jupyter Notebooks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3.4 Scala IDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.4 Security. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Chapter 2. Components and extensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1 Apache Spark component overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.1 Resilient Distributed Datasets and caching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.2 Components of a Spark cluster on z/OS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.3 Monitoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.1.4 Spark and Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2 Mainframe Data Services for IBM z/OS Platform for Apache Spark. . . . . . . . . . . . . . . 21 2.2.1 Virtual tables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.2 Virtual views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.3 SQL queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.4 MDSS JDBC driver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.5 IBM z/OS Platform for Apache Spark Interface for CICS/TS . . . . . . . . . . . . . . . . 24 2.3 Spark SQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.1 Reading from z/OS data source into a DataFrame. . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.2 Writing DataFrame to a DB2 for z/OS table using saveTable method . . . . . . . . . 26 2.4 Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5 GraphX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5.1 System G . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.6 MLlib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.7 Spark R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 © Copyright IBM Corp. 2016. All rights reserved. iii Chapter 3. Installation and configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.1 Installing IBM z/OS Platform for Apache Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2 The Mainframe Data Service for Apache Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.1 Installing the MDSS started task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.2 Configuring access to DB2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.3 Configuring access to IMS databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2.4 The ISPF Panels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2.5 Installing and configuring Bash. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2.6 Check for /usr/bin/env. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3 Installing workstation components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.1 Installing Data Service Studio. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.2 Installing the JDBC driver on the workstation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4 Configuring Apache Spark for z/OS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.4.1 Create log and worker directories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.4.2 Apache Spark directory structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4.3 Create directories and local configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4.4 Installing the Data Server JDBC driver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.4.5 Modifying the log4j configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4.6 Adding the Spark binaries to your PATH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.5 Verifying the installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.6 Starting the Spark daemons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Chapter 4. Spark application development on z/OS . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.1 Setting up the development environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.1.1 Installing ScalaIDE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.1.2 Installing Data Server Studio plugins into ScalaIDE . . . . . . . . . . . . . . . . . . . . . . 62 4.1.3 Installing and using sbt. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.2 Accessing VSAM data as an RDD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2.1 Defining the data mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2.2 Building and running the application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.3 Accessing sequential files and PDS members . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.4 Accessing IBM DB2 data as a DataFrame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.5 Joining DB2 data with VSAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.6 IBM IMS data to DataFrames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.7 System log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.8 SMF data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.9 JavaScript Object Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.10 Extensible Markup Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.11 Submit Spark jobs from z/OS applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Chapter 5. Production integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.1 Production deployment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.2 Running Spark applications from z/OS batch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.3 Starting Spark master and workers from JCL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.4 System level tuning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.4.1 Tuning the MDSS server. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.4.2 Tuning z/OSUNIX settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Chapter 6. IBM z/OS Platform for Apache Spark and the ecosystem. . . . . . . . . . . . . . 91 6.1 Tidy data repository. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.2 Jupyter notebooks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.2.1 The Jupyter notebook overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.2.2 Docker and the platforms that support it. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.2.3 The dockeradmin userid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 iv Apache Spark Implementation on IBM z/OS 6.2.4 The Role of SSH. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.2.5 Creating the Docker container . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.2.6 A note about network configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.2.7 Building the Jupyter scala workbench. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Chapter 7. Use case patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.1 Banking and finance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 7.1.1 Churn prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 7.1.2 Fraud prevention. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 7.1.3 Upsell opportunity detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.2 Insurance industry. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.2.1 Claims payment analytics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.3 Retail industry. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.3.1 Product recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.4 Other use case patterns for IBM z/OS Platform for Apache Spark. . . . . . . . . . . . . . . 114 7.4.1 Analytics across OLTP and warehouse information. . . . . . . . . . . . . . . . . . . . . . 114 7.4.2 Analytics combining business-owned data and external / social data . . . . . . . . 114 7.4.3 Analytics of real-time transactions through streaming, combining with OLTP and social. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.5 Operations analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.5.1 SMF data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 7.5.2 Syslog data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Appendix A. Sample code to run on Apache Spark cluster on z/OS. . . . . . . . . . . . . 117 Appendix B. FAQ: Frequently asked questions, and answers . . . . . . . . . . . . . . . . . . 121 General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 Technical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Other publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Help from IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Contents v vi Apache Spark Implementation on IBM z/OS Notices This information was developed for products and services offered in the US. This material might be available from IBM in other languages. However, you may be required to own a copy of the product or product version in that language in order to access it. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user’s responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing, IBM Corporation, North Castle Drive, MD-NC119, Armonk, NY 10504-1785, US INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some jurisdictions do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM websites are provided for convenience only and do not in any manner serve as an endorsement of those websites. The materials at those websites are not part of the materials for this IBM product and use of those websites is at your own risk. IBM may use or distribute any of the information you provide in any way it believes appropriate without incurring any obligation to you. The performance data and client examples cited are presented for illustrative purposes only. Actual performance results may vary depending on specific configurations and operating conditions. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. Statements regarding IBM’s future direction or intent are subject to change or withdrawal without notice, and represent goals and objectives only. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to actual people or business enterprises is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. The sample programs are provided “AS IS”, without warranty of any kind. IBM shall not be liable for any damages arising out of your use of the sample programs. © Copyright IBM Corp. 2016. All rights reserved. vii Trademarks IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at “Copyright and trademark information” at http://www.ibm.com/legal/copytrade.shtml The following terms are trademarks or registered trademarks of International Business Machines Corporation, and might also be trademarks or registered trademarks in other countries. BigInsights® IBM z13™ RMF™ CICS® IMS™ S/390® Cloudant® Lotus® SPSS® Cognos® MVS™ System z® DB2® Parallel Sysplex® WebSphere® FICON® Print Services Facility™ z Systems™ GPFS™ PrintWay™ z/OS® IBM® RACF® z13™ IBM z™ Redbooks® zEnterprise® IBM z Systems™ Redbooks (logo) ® The following terms are trademarks of other companies: Linux is a trademark of Linus Torvalds in the United States, other countries, or both. Microsoft, Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Java, and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. UNIX is a registered trademark of The Open Group in the United States and other countries. Other company, product, or service names may be trademarks or service marks of others. viii Apache Spark Implementation on IBM z/OS

Description:
This edition applies to Version 2, Release 2 of IBM z/OS (product number 5650 ZOS), Apache Spark 1.5.2. Note: Before using this information and the
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.