Learning Apache Drill QUERY AND ANALYZE DISTRIBUTED DATA SOURCES WITH SQL Charles Givre & Paul Rogers Learning Apache Drill Query and Analyze Distributed Data Sources with SQL Charles Givre and Paul Rogers BBeeiijjiinngg BBoossttoonn FFaarrnnhhaamm SSeebbaassttooppooll TTookkyyoo Learning Apache Drill by Charles Givre and Paul Rogers Copyright © 2019 Charles Givre and Paul Rogers. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐ tutional sales department: 800-998-9938 or [email protected]. Acquisitions Editor: Rachel Roumeliotis Indexer: Ellen Troutman-Zaig Development Editor: Jeff Bleiel Interior Designer: David Futato Production Editor: Melanie Yarbrough Cover Designer: Karen Montgomery Copyeditor: Octal Publishing, LLC Illustrator: Rebecca Demarest Proofreader: Rachel Head October 2018: First Edition Revision History for the First Edition 2018-10-29: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492032793 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Learning Apache Drill, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-492-03279-3 [LSI] Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1. Introduction to Apache Drill. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 What Is Apache Drill? 2 Drill Is Versatile 2 Drill Is Easy to Use 3 A Word About Drill’s Performance 4 A Very Brief History of Big Data 5 Drill in the Big Data Ecosystem 6 Comparing Drill with Similar Tools 7 2. Installing and Running Drill. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Preparing Your Machine for Drill 10 Special Configuration Instructions for Windows Installations 10 Installing Drill on Windows 12 Starting Drill on a Windows Machine 12 Installing Drill in Embedded Mode on macOS or Linux 13 Starting Drill on macOS or Linux in Embedded Mode 13 Installing Drill in Distributed Mode on macOS or Linux 14 Preparing Your Cluster for Drill 15 Starting Drill in Distributed Mode 16 Connecting to the Cluster 16 Conclusion 16 3. Overview of Apache Drill. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 The Apache Hadoop Ecosystem 17 Drill Is a Low-Latency Query Engine 18 Distributed Processing with HDFS 18 iii Elements of a Drill System 19 Drill Operation: The 30,000-Foot View 20 Drill Is a Query Engine, Not a Database 20 Drill Operation Overview 21 Drill Components 21 SQL Session State 22 Statement Preparation 22 Statement Execution 26 Low-Latency Features 28 Conclusion 30 4. Querying Delimited Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Ways of Querying Data with Drill 33 Other Interfaces 34 Drill SQL Query Format 34 Choosing a Data Source 35 Defining a Workspace 36 Specifying a Default Data Source 37 Accessing Columns in a Query 39 Delimited Data with Column Headers 41 Table Functions 42 Querying Directories 43 Understanding Drill Data Types 44 Cleaning and Preparing Data Using String Manipulation Functions 46 Complex Data Conversion Functions 48 Working with Dates and Times in Drill 49 Converting Strings to Dates 50 Reformatting Dates 51 Date Arithmetic and Manipulation 51 Date and Time Functions in Drill 52 Creating Views 53 Data Analysis Using Drill 54 Summarizing Data with Aggregate Functions 55 Common Problems in Querying Delimited Data 62 Spaces in Column Names 62 Illegal Characters in Column Headers 63 Reserved Words in Column Names 63 Conclusion 64 5. Analyzing Complex and Nested Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Arrays and Maps 65 Arrays in Drill 66 iv | Table of Contents Accessing Maps (Key–Value Pairs) in Drill 68 Querying Nested Data 69 Analyzing Log Files with Drill 77 Configuring Drill to Read HTTPD Web Server Logs 77 Querying Web Server Logs 78 Other Log Analysis with Drill 82 Conclusion 85 6. Connecting Drill to Data Sources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Querying Multiple Data Sources 88 Configuring a New Storage Plug-in 88 Connecting Drill to a Relational Database 89 Querying Data in Hadoop from Drill 93 Connecting to and Querying HBase from Drill 93 Querying Hive Data from Drill 95 Connecting to and Querying Streaming Data with Drill and Kafka 97 Connecting to and Querying Kudu 99 Connecting to and Querying MongoDB from Drill 100 Connecting Drill to Cloud Storage 100 Querying Time Series Data from Drill and OpenTSDB 104 Conclusion 106 7. Connecting to Drill. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Understanding Drill’s Interfaces 107 JDBC and Drill 108 ODBC and Drill 109 Drill’s REST Interface 111 Connecting to Drill with Python 112 Using drillpy to Query Drill 112 Connecting to Drill Using pydrill 113 Other Ways of Connecting to Drill from Python 114 Connecting to Drill Using R 116 Querying Drill from R Using sergeant 116 Connecting to Drill Using Java 118 Querying Drill with PHP 119 Using the Connector 119 Querying Drill from PHP 119 Interacting with Drill from PHP 120 Querying Drill Using Node.js 121 Using Drill as a Data Source in BI Tools 121 Exploring Data with Apache Zeppelin and Drill 121 Exploring Data with Apache Superset 127 Table of Contents | v Conclusion 132 8. Data Engineering with Drill. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Schema-on-Read 133 The SQL Relational Model 133 Data Life Cycle: Data Exploration to Production 134 Schema Inference 135 Data Source Inference 136 Storage Plug-ins 136 Storage Configurations 136 Workspaces 137 Querying Directories 139 Default Schema 139 File Type Inference 140 Format Plug-ins and Format Configuration 140 Format Inference 141 File Format Variations 141 Schema Inference Overview 142 Distributed File Scans 144 Schema Inference for Delimited Data 146 CSV Summary 151 Schema Inference for JSON 153 Ambiguous Numeric Schemas 155 Aligning Schemas Across Files 160 JSON Objects 162 JSON Lists in Drill 164 JSON Summary 167 Using Drill with the Parquet File Format 168 Schema Evolution in Parquet 169 Partitioning Data Directories 169 Defining a Table Workspace 172 Working with Queries in Production 173 Capturing Schema Mapping in Views 173 Running Challenging Queries in Scripts 173 Conclusion 174 9. Deploying Drill in Production. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Installing Drill 175 Prerequisites 176 Production Installation 176 Configuring ZooKeeper 178 Configuring Memory 179 vi | Table of Contents Configuring Logging 180 Testing the Installation 181 Distributing Drill Binaries and Configuration 182 Starting the Drill Cluster 183 Configuring Storage 184 Working with Apache Hadoop HDFS 184 Working with Amazon S3 185 Admission Control 188 Additional Configuration 189 User-Defined Functions and Custom Plug-ins 189 Security 190 Logging Levels 190 Controlling CPU Usage 191 Monitoring 193 Monitoring the Drill Process 193 Monitoring JMX Metrics 194 Monitoring Queries 194 Other Deployment Options 194 MapR Installer 195 Drill-on-YARN 195 Docker 195 Conclusion 195 10. Setting Up Your Development Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Installing Maven 197 Creating the Drill Build Environment 198 Setting Up Git and Getting the Source Code 198 Building Drill from Source 199 Installing the IDE 199 Conclusion 200 11. Writing Drill User-Defined Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Use Case: Finding and Filtering Valid Credit Card Numbers 201 How User-Defined Functions Work in Drill 202 Structure of a Simple Drill UDF 203 The pom.xml File 203 The Function File 205 The Simple Function API 209 Putting It All Together 209 Building and Installing Your UDF 211 Statically Installing a UDF 211 Dynamically Installing a UDF 211 Table of Contents | vii Complex Functions: UDFs That Return Maps or Arrays 212 Example: Extracting User Agent Metadata 213 The ComplexWriter 213 Writing Aggregate User-Defined Functions 215 The Aggregate Function API 216 Example Aggregate UDF: Kendall’s Rank Correlation Coefficient 217 Conclusion 219 12. Writing a Format Plug-in. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 The Example Regex Format Plug-in 221 Creating the “Easy” Format Plug-in 222 Creating the Maven pom.xml File 223 Creating the Plug-in Package 225 Drill Module Configuration 225 Format Plug-in Configuration 226 Cautions Before Getting Started 226 Creating the Regex Plug-in Configuration Class 227 Copyright Headers and Code Format 228 Testing the Configuration 228 Fixing Configuration Problems 229 Troubleshooting 230 Creating the Format Plug-in Class 230 Creating a Test File 232 Configuring RAT 233 Efficient Debugging 233 Creating the Unit Test 234 How Drill Finds Your Plug-in 235 The Record Reader 236 Testing the Reader Shell 238 Logging 239 Error Handling 239 Setup 240 Regex Parsing 240 Defining Column Names 241 Projection 241 Column Projection Accounting 242 Project None 243 Project All 243 Project Some 243 Opening the File 245 Record Batches 246 Drill’s Columnar Structure 246 viii | Table of Contents Defining Vectors 247 Reading Data 247 Loading Data into Vectors 248 Releasing Resources 249 Testing the Reader 250 Testing the Wildcard Case 250 Testing Explicit Projection 251 Testing Empty Projection 251 Scaling Up 251 Additional Details 253 File Chunks 253 Default Format Configuration 253 Next Steps 254 Production Build 255 Contributing to Drill: The Pull Request 255 Maintaining Your Branch 255 Create a Plug-In Project 256 Conclusion 257 13. Unique Uses of Drill. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 Finding Photos Taken Within a Geographic Region 259 Drilling Excel Files 260 The pom.xml File 261 The Excel Custom Record Reader 262 Using the Excel Format Plug-in 266 Network Packet Analysis (PCAP) with Drill 266 Examples of Queries Using PCAP Data Files 267 Analyzing Twitter Data with Drill 271 Using Drill in a Machine Learning Pipeline 272 Making Predictions Within Drill 272 Building and Serializing a Model 272 Writing the UDF Wrapper 273 Making Predictions Using the UDF 275 Conclusion 276 A. List of Drill Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 B. Drill Formatting Strings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 Table of Contents | ix
Description: