ebook img

Apache Oozie PDF

271 Pages·2015·5.85 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Apache Oozie

Apache Oozie Get a solid grounding in Apache Oozie, the workflow scheduler system for “ In this book, the A managing Hadoop jobs. In this hands-on guide, two experienced Hadoop authors have striven for p practitioners walk you through the intricacies of this powerful and flexible a practicality, focusing on platform, with numerous examples and real-world use cases. c the concepts, principles, h Once you set up your Oozie server, you’ll dive into techniques for writing tips, and tricks that e and coordinating workflows, and learn how to write complex data pipelines. Advanced topics show you how to handle shared libraries in Oozie, as well developers need to get O as how to implement and manage Oozie’s security capabilities. the most out of Oozie. o A volume such as this is z ■ Install and configure an Oozie server, and get an overview of long overdue. Developers i basic concepts e will get a lot more out of ■ Journey through the world of writing and configuring workflows the Hadoop ecosystem ■ Learn how the Oozie coordinator schedules and executes by reading it.” workflows based on triggers —Raymie Stata Apache ■ Understand how Oozie manages data dependencies CEO, Altiscale ■ Use Oozie bundles to package several coordinator apps into “ Oozie simplifies a data pipeline the managing and ■ Learn about security features and shared library management automating of complex ■ Implement custom extensions and write your own EL functions Hadoop workloads. and actions This greatly benefits O ozie ■ Debug workflows and manage Oozie’s operational details both developers and operators alike.” Mohammad Kamrul Islam works as a Staff Software Engineer in the data engineering team at Uber. He’s been involved with the Hadoop ecosystem —Alejandro Abdelnur since 2009, and is a PMC member and a respected voice in the Oozie com- Creator of Apache Oozie Is munity. He has worked in the Hadoop teams at LinkedIn and Yahoo. la m Aravind Srinivasan is a Lead Application Architect at Altiscale, a Hadoop- & as-a-service company, where he helps customers with Hadoop application S design and architecture. He has been involved with Hadoop in general and r Oozie in particular since 2008. in THE WORKFLOW SCHEDULER FOR HADOOP i v a s a n DATA Twitter: @oreillymedia facebook.com/oreilly US $39.99 CAN $45.99 ISBN: 978-1-449-36992-7 Mohammad Kamrul Islam & Aravind Srinivasan Apache Oozie Get a solid grounding in Apache Oozie, the workflow scheduler system for “ In this book, the A managing Hadoop jobs. In this hands-on guide, two experienced Hadoop authors have striven for p practitioners walk you through the intricacies of this powerful and flexible a practicality, focusing on platform, with numerous examples and real-world use cases. c the concepts, principles, h Once you set up your Oozie server, you’ll dive into techniques for writing tips, and tricks that e and coordinating workflows, and learn how to write complex data pipelines. Advanced topics show you how to handle shared libraries in Oozie, as well developers need to get O as how to implement and manage Oozie’s security capabilities. the most out of Oozie. o A volume such as this is z ■ Install and configure an Oozie server, and get an overview of long overdue. Developers i basic concepts e will get a lot more out of ■ Journey through the world of writing and configuring workflows the Hadoop ecosystem ■ Learn how the Oozie coordinator schedules and executes by reading it.” workflows based on triggers —Raymie Stata Apache ■ Understand how Oozie manages data dependencies CEO, Altiscale ■ Use Oozie bundles to package several coordinator apps into “ Oozie simplifies a data pipeline the managing and ■ Learn about security features and shared library management automating of complex ■ Implement custom extensions and write your own EL functions Hadoop workloads. and actions This greatly benefits O ozie ■ Debug workflows and manage Oozie’s operational details both developers and operators alike.” Mohammad Kamrul Islam works as a Staff Software Engineer in the data engineering team at Uber. He’s been involved with the Hadoop ecosystem —Alejandro Abdelnur since 2009, and is a PMC member and a respected voice in the Oozie com- Creator of Apache Oozie Is munity. He has worked in the Hadoop teams at LinkedIn and Yahoo. la m Aravind Srinivasan is a Lead Application Architect at Altiscale, a Hadoop- & as-a-service company, where he helps customers with Hadoop application S design and architecture. He has been involved with Hadoop in general and r Oozie in particular since 2008. in THE WORKFLOW SCHEDULER FOR HADOOP i v a s a n DATA Twitter: @oreillymedia facebook.com/oreilly US $39.99 CAN $45.99 ISBN: 978-1-449-36992-7 Mohammad Kamrul Islam & Aravind Srinivasan Apache Oozie Mohammad Kamrul Islam & Aravind Srinivasan Apache Oozie by Mohammad Kamrul Islam and Aravind Srinivasan Copyright © 2015 Mohammad Islam and Aravindakshan Srinivasan. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/ institutional sales department: 800-998-9938 or [email protected]. Editors: Mike Loukides and Marie Beaugureau Indexer: Lucie Haskins Production Editor: Colleen Lobner Interior Designer: David Futato Copyeditor: Gillian McGarvey Cover Designer: Ellie Volckhausen Proofreader: Jasmine Kwityn Illustrator: Rebecca Demarest May 2015: First Edition Revision History for the First Edition 2015-05-08: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781449369927 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Apache Oozie, the cover image of a binturong, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-449-36992-7 [LSI] Table of Contents Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1. Introduction to Oozie. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Big Data Processing 1 A Recurrent Problem 1 A Common Solution: Oozie 2 A Simple Oozie Job 4 Oozie Releases 10 Some Oozie Usage Numbers 12 2. Oozie Concepts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Oozie Applications 13 Oozie Workflows 13 Oozie Coordinators 15 Oozie Bundles 18 Parameters, Variables, and Functions 19 Application Deployment Model 20 Oozie Architecture 21 3. Setting Up Oozie. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Oozie Deployment 23 Basic Installations 24 Requirements 24 Build Oozie 25 Install Oozie Server 26 Hadoop Cluster 28 iii Start and Verify the Oozie Server 29 Advanced Oozie Installations 31 Configuring Kerberos Security 31 DB Setup 32 Shared Library Installation 34 Oozie Client Installations 36 4. Oozie Workflow Actions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Workflow 39 Actions 40 Action Execution Model 40 Action Definition 42 Action Types 43 MapReduce Action 43 Java Action 52 Pig Action 56 FS Action 59 Sub-Workflow Action 61 Hive Action 62 DistCp Action 64 Email Action 66 Shell Action 67 SSH Action 70 Sqoop Action 71 Synchronous Versus Asynchronous Actions 73 5. Workflow Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Outline of a Basic Workflow 75 Control Nodes 76 <start> and <end> 77 <fork> and <join> 77 <decision> 79 <kill> 81 <OK> and <ERROR> 82 Job Configuration 83 Global Configuration 83 Job XML 84 Inline Configuration 85 Launcher Configuration 85 Parameterization 86 EL Variables 87 EL Functions 88 iv | Table of Contents EL Expressions 89 The job.properties File 89 Command-Line Option 91 The config-default.xml File 91 The <parameters> Section 92 Configuration and Parameterization Examples 93 Lifecycle of a Workflow 94 Action States 96 6. Oozie Coordinator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Coordinator Concept 99 Triggering Mechanism 100 Time Trigger 100 Data Availability Trigger 100 Coordinator Application and Job 101 Coordinator Action 101 Our First Coordinator Job 101 Coordinator Submission 103 Oozie Web Interface for Coordinator Jobs 106 Coordinator Job Lifecycle 108 Coordinator Action Lifecycle 109 Parameterization of the Coordinator 110 EL Functions for Frequency 110 Day-Based Frequency 110 Month-Based Frequency 111 Execution Controls 112 An Improved Coordinator 113 7. Data Trigger Coordinator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Expressing Data Dependency 117 Dataset 117 Example: Rollup 122 Parameterization of Dataset Instances 124 current(n) 125 latest(n) 128 Parameter Passing to Workflow 132 dataIn(eventName): 132 dataOut(eventName) 133 nominalTime() 133 actualTime() 133 dateOffset(baseTimeStamp, skipInstance, timeUnit) 134 formatTime(timeStamp, formatString) 134 Table of Contents | v A Complete Coordinator Application 134 8. Oozie Bundles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Bundle Basics 137 Bundle Definition 137 Why Do We Need Bundles? 138 Bundle Specification 140 Execution Controls 141 Bundle State Transitions 145 9. Advanced Topics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Managing Libraries in Oozie 147 Origin of JARs in Oozie 147 Design Challenges 148 Managing Action JARs 149 Supporting the User’s JAR 152 JAR Precedence in classpath 153 Oozie Security 154 Oozie Security Overview 154 Oozie to Hadoop 155 Oozie Client to Server 158 Supporting Custom Credentials 162 Supporting New API in MapReduce Action 165 Supporting Uber JAR 167 Cron Scheduling 168 A Simple Cron-Based Coordinator 168 Oozie Cron Specification 169 Emulate Asynchronous Data Processing 172 HCatalog-Based Data Dependency 174 10. Developer Topics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 Developing Custom EL Functions 177 Requirements for a New EL Function 177 Implementing a New EL Function 178 Supporting Custom Action Types 180 Creating a Custom Synchronous Action 181 Overriding an Asynchronous Action Type 188 Implementing the New ActionMain Class 188 Testing the New Main Class 191 Creating a New Asynchronous Action 193 Writing an Asynchronous Action Executor 193 Writing the ActionMain Class 195 vi | Table of Contents Writing Action’s Schema 199 Deploying the New Action Type 200 Using the New Action Type 201 11. Oozie Operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 Oozie CLI Tool 203 CLI Subcommands 204 Useful CLI Commands 205 Oozie REST API 210 Oozie Java Client 214 The oozie-site.xml File 215 The Oozie Purge Service 218 Job Monitoring 219 JMS-Based Monitoring 220 Oozie Instrumentation and Metrics 221 Reprocessing 222 Workflow Reprocessing 222 Coordinator Reprocessing 224 Bundle Reprocessing 224 Server Tuning 225 JVM Tuning 225 Service Settings 226 Oozie High Availability 229 Debugging in Oozie 231 Oozie Logs 235 Developing and Testing Oozie Applications 235 Application Deployment Tips 236 Common Errors and Debugging 237 MiniOozie and LocalOozie 240 The Competition 241 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 Table of Contents | vii

Description:
Hadoop workloads. This greatly benefits both developers and operators alike.” —Alejandro Abdelnur. Creator of Apache Oozie. Twitter: @oreillymedia flow.xml file. The Map and Reduce classes are already available in Hadoop's classpath and we don't need to include them in the Oozie workflow
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.