ebook img

Pentaho Data Integration Beginner’s Guide PDF

502 Pages·2013·10.686 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Pentaho Data Integration Beginner’s Guide

Pentaho Data Integration Beginner's Guide Second Edition Get up and running with the Pentaho Data Integration tool using this hands-on, easy-to-read guide María Carina Roldán BIRMINGHAM - MUMBAI Pentaho Data Integration Beginner's Guide Second Edition Copyright © 2013 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: April 2010 Second Edition: October 2013 Production Reference: 1171013 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78216-504-0 www.packtpub.com Cover Image by Suresh Mogre ([email protected]) Credits Author Project Coordinator María Carina Roldán Navu Dhillon Reviewers Proofreaders Tomoyuki Hayashi Simran Bhogal Gretchen Moran Ameesha Green Acquisition Editors Indexer Usha Iyer Mariammal Chettiyar Greg Wild Graphics Lead Technical Editor Ronak Dhruv Azharuddin Sheikh Yuvraj Mannari Technical Editors Production Coordinator Sharvari H. Baet Conidon Miranda Aparna K Cover Work Kanhucharan Panda Conidon Miranda Vivek Pillai About the Author María Carina Roldán was born in Esquel, Argentina, and earned her Bachelor's degree in Computer Science at at the Universidad Nacional de La Plata (UNLP) and then moved to Buenos Aires where she has lived since 1994. She has worked as a BI consultant for almost fifteen years. She started working with Pentaho technology back in 2006. Over the last three and a half years, she has been devoted to working full time for Webdetails—a company acquired by Pentaho in 2013—as an ETL specialist. Carina is the author of Pentaho 3.2 Data Integration Beginner's Book, Packt Publishing, April 2009, and the co-author of Pentaho Data Integration 4 Cookbook, Packt Publishing, June 2011. I'd like to thank those who have encouraged me to write this book: firstly, the Pentaho community. They have given me such rewarding feedback after my other two books on PDI; it is because of them that I feel compelled to pass my knowledge on to those willing to learn. I also want to thank my friends! Especially Flavia, Jaqui, and Marce for their encouraging words throughout the writing process; Silvina for clearing up my questions about English; Gonçalo for helping with the use of PDI on Mac systems; and Hernán for helping with ideas and examples for this new edition. I would also like to thank the technical reviewers—Gretchen, Tomoyuki, Nelson, and Paula—for the time and dedication that they have put in to reviewing the book. About the Reviewers Tomoyuki Hayashi is a system engineer who mainly works for the intersection of open source and enterprise software. He has developed a CMIS-compliant and CouchDB-based ECM software named NemakiWare (http://nemakiware.com/). He is currently working with Aegif, Japan, which provides advisory services for content- oriented applications, collaboration improvement, and ECM in general. It is one of the most experienced companies in Japan that supports the introduction of foreign-made software to the Japanese market. Gretchen Moran works as an independent Pentaho consultant on a variety of business intelligence and big data projects. She has 15 years of experience in the business intelligence realm, developing software and providing services for a number of companies including Hyperion Solutions and the Pentaho Corporation. Gretchen continues to contribute to Pentaho Corporation's latest and greatest software initiatives while managing the daily adventures of her two children, Isabella and Jack, with her husband, Doug. www.PacktPub.com Support files, eBooks, discount offers and more You might want to visit www.PacktPub.com for support files and downloads related to your book. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks. TM http://PacktLib.PacktPub.com Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books. Why Subscribe? ‹ Fully searchable across every book published by Packt ‹ Copy and paste, print and bookmark content ‹ On demand and accessible via web browser Free Access for Packt account holders If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access. Table of Contents Preface 1 Chapter 1: Getting Started with Pentaho Data Integration 9 Pentaho Data Integration and Pentaho BI Suite 9 Exploring the Pentaho Demo 10 Pentaho Data Integration 12 Using PDI in real-world scenarios 13 Loading data warehouses or datamarts 13 Integrating data 14 Data cleansing 15 Migrating information 15 Exporting data 15 Integrating PDI along with other Pentaho tools 15 Installing PDI 16 Time for action – installing PDI 16 Launching the PDI graphical designer – Spoon 17 Time for action – starting and customizing Spoon 18 Spoon 21 Setting preferences in the Options window 21 Storing transformations and jobs in a repository 21 Creating your first transformation 22 Time for action – creating a hello world transformation 22 Directing Kettle engine with transformations 27 Exploring the Spoon interface 28 Designing a transformation 29 Running and previewing the transformation 30 Installing MySQL 31 Time for action – installing MySQL on Windows 31 Time for action – installing MySQL on Ubuntu 34 Summary 36 Table of Contents Chapter 2: Getting Started with Transformations 37 Designing and previewing transformations 37 Time for action – creating a simple transformation and getting familiar with the design process 38 Getting familiar with editing features 45 Using the mouseover assistance toolbar 45 Working with grids 45 Understanding the Kettle rowset 46 Looking at the results in the Execution Results pane 47 The Logging tab 48 The Step Metrics tab 48 Running transformations in an interactive fashion 50 Time for action – generating a range of dates and inspecting the data as it is being created 50 Adding or modifying fields by using different PDI steps 56 The Select values step 57 Getting fields 58 Date fields 58 Handling errors 61 Time for action – avoiding errors while converting the estimated time from string to integer 61 The error handling functionality 64 Time for action – configuring the error handling to see the description of the errors 65 Personalizing the error handling 66 Summary 68 Chapter 3: Manipulating Real-world Data 69 Reading data from files 69 Time for action – reading results of football matches from files 70 Input files 74 Input steps 75 Reading several files at once 76 Time for action – reading all your files at a time using a single text file input step 76 Time for action – reading all your files at a time using a single text file input step and regular expressions 77 Regular expressions 78 Troubleshooting reading files 78 Sending data to files 81 Time for action – sending the results of matches to a plain file 81 Output files 83 Output steps 83 Getting system information 84 [ ii ] Table of Contents Time for action – reading and writing matches files with flexibility 85 The Get System Info step 88 Running transformations from a terminal window 89 Time for action – running the matches transformation from a terminal window 90 XML files 91 Time for action – getting data from an XML file with information about countries 92 What is XML? 96 PDI transformation files 97 Getting data from XML files 97 XPath 97 Configuring the Get data from the XML step 98 Kettle variables 99 How and when you can use variables 100 Summary 100 Chapter 4: Filtering, Searching, and Performing Other Useful Operations with Data 101 Sorting data 101 Time for action – sorting information about matches with the Sort rows step 102 Calculations on groups of rows 106 Time for action – calculating football match statistics by grouping data 107 Group by Step 111 Numeric fields 113 Filtering 115 Time for action – counting frequent words by filtering 116 Time for action – refining the counting task by filtering even more 121 Filtering rows using the Filter rows step 124 Looking up data 125 Time for action – finding out which language people speak 126 The Stream lookup step 130 Data cleaning 133 Time for action – fixing words before counting them 133 Cleansing data with PDI 135 Summary 136 Chapter 5: Controlling the Flow of Data 137 Splitting streams 137 Time for action – browsing new features of PDI by copying a dataset 138 Copying rows 145 Distributing rows 146 Time for action – assigning tasks by distributing 146 Splitting the stream based on conditions 152 [ iii ]

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.