Getting Started with I mpala INTERACTIVE SQL FOR APACHE HADOOP John Russell Getting Started with Impala John Russell BBeeiijjiinngg BBoossttoonn FFaarrnnhhaamm SSeebbaassttooppooll TTookkyyoo Getting Started with Impala by John Russell Copyright © 2016 Cloudera, Inc. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/ institutional sales department: 800-998-9938 or [email protected]. Editor: Marie Beaugureau Interior Designer: David Futato Production Editor: Kristen Brown Cover Designer: Ellie Volkhausen Copyeditor: Gillian McGarvey Illustrator: Rebecca Demarest Proofreader: Linley Dolby October 2014: First Edition Revision History for the First Edition 2014-09-19: First Release 2016-04-25: Second Release See http://oreilly.com/catalog/errata.csp?isbn=9781491905777 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Getting Started with Impala, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-491-90577-7 [LSI] Table of Contents Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1. Why Impala?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Impala’s Place in the Big Data Ecosystem 1 Flexibility for Your Big Data Workflow 2 High-Performance Analytics 3 Exploratory Business Intelligence 3 2. Getting Up and Running with Impala. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Installation 5 Connecting to Impala 6 Your First Impala Queries 7 3. Impala for the Database Developer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 The SQL Language 11 Standard SQL for Queries 12 Limited DML 12 No Transactions 13 Numbers 13 Recent Additions 14 Big Data Considerations 15 Billions and Billions of Rows 15 HDFS Block Size 17 Parquet Files: The Biggest Blocks of All 17 How Impala Is Like a Data Warehouse 18 Physical and Logical Data Layouts 19 The HDFS Storage Model 19 Distributed Queries 20 iii Normalized and Denormalized Data 22 File Formats 23 Text File Format 23 Parquet File Format 25 Getting File Format Information 26 Switching File Formats 27 Aggregation 27 4. Common Developer Tasks for Impala. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Getting Data into an Impala Table 29 INSERT Statement 30 LOAD DATA Statement 30 External Tables 31 Figuring Out Where Impala Data Resides 31 Manually Loading Data Files into HDFS 32 Hive 32 Sqoop 33 Kite 33 Porting SQL Code to Impala 34 Using Impala from a JDBC or ODBC Application 35 JDBC 35 ODBC 36 Using Impala with a Scripting Language 36 Running Impala SQL Statements from Scripts 36 Variable Substitution 36 Saving Query Results 37 The impyla Package for Python Scripting 37 Optimizing Impala Performance 38 Optimizing Query Performance 39 Optimizing Memory Usage 40 Working with Partitioned Tables 42 Finding the Ideal Granularity 42 Inserting into Partitioned Tables 43 Adding and Loading New Partitions 44 Keeping Statistics Up to Date for Partitioned Tables 45 Writing User-Defined Functions 47 Collaborating with Your Administrators 47 Designing for Security 48 Anticipate Memory Usage 48 Understanding Resource Management 48 Helping to Plan for Performance (Stats, HDFS Caching) 49 Understanding Cluster Topology 50 iv | Table of Contents Always Close Your Queries 51 5. Tutorials and Deep Dives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Tutorial: From Unix Data File to Impala Table 53 Tutorial: Queries Without a Table 55 Tutorial: The Journey of a Billion Rows 57 Generating a Billion Rows of CSV Data 57 Normalizing the Original Data 63 Converting to Parquet Format 67 Making a Partitioned Table 71 Next Steps 75 Deep Dive: Joins and the Role of Statistics 76 Creating a Million-Row Table to Join With 76 Loading Data and Computing Stats 77 Reviewing the EXPLAIN Plan 78 Trying a Real Query 81 The Story So Far 85 Final Join Query with 1B x 1M Rows 86 Anti-Pattern: A Million Little Pieces 86 Tutorial: Across the Fourth Dimension 88 TIMESTAMP Data Type 88 Format Strings for Dates and Times 88 Working with Individual Date and Time Fields 89 Date and Time Arithmetic 90 Let’s Solve the Y2K Problem 91 More Fun with Dates 94 Tutorial: Verbose and Quiet impala-shell Output 95 Tutorial: When Schemas Evolve 96 Numbers Versus Strings 98 Dealing with Out-of-Range Integers 99 Tutorial: Levels of Abstraction 102 String Formatting 102 Temperature Conversion 103 Tutorial: Subqueries 103 Subqueries in the FROM Clause 104 Subqueries in the FROM Clause for Join Queries 104 Subqueries in the WHERE Clause 105 Uncorrelated and Correlated Subqueries 108 Common Table Expressions in the WITH Clause 109 Tutorial: Analytic Functions 111 Analyzing the Numbers 1 Through 10 112 Running Totals and Moving Averages 120 Table of Contents | v Breaking Ties 121 Tutorial: Complex Types 123 ARRAY: A List of Items with Identical Types 125 MAP: A Hash Table or Dictionary with Key-Value Pairs 127 STRUCT: A Row-Like Object for Flexible Typing and Naming 128 Nesting Complex Types to Represent Arbitrary Data Structures 130 Querying Tables with Nested Complex Types 132 Constructing Data for Complex Types 132 vi | Table of Contents Introduction Cloudera Impala is an open source project that opens up the Apache Hadoop soft‐ ware stack to a wide audience of database analysts, users, and developers. The Impala massively parallel processing (MPP) engine makes SQL queries of Hadoop data sim‐ ple enough to be accessible to analysts familiar with SQL and to users of business intelligence tools, and it’s fast enough to be used for interactive exploration and experimentation. From the ground up, the Impala software is written for high performance of SQL queries distributed across clusters of connected machines. Who Is This Book For? This book is intended for a broad audience of users from a variety of database, data warehousing, or Big Data backgrounds. It assumes that you’re experienced enough with SQL not to need explanations for familiar statements such as CREATE TABLE, SELECT, INSERT, and their major clauses. Linux experience is a plus. Experience with the Apache Hadoop software stack is useful but not required. This book points out instances where some aspect of Impala architecture or usage might be new to people who are experienced with databases but not the Apache Hadoop software stack. The SQL examples in this book start from a simple base for easy comprehension, then build toward best practices that demonstrate high performance and scalability. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. vii Constant width Used for program listings, as well as within paragraphs to refer to program ele‐ ments such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. This style is also used to emphasize the names of SQL statements within paragraphs. Constant width italic Shows text that should be replaced with user-supplied values or by values deter‐ mined by context. This element signifies a tip or suggestion. This element signifies a general note. This element indicates a warning or caution. Using Code Examples Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/oreillymedia/get-started-impala. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a signifi‐ cant amount of example code from this book into your product’s documentation does require permission. viii | Introduction
Description: