ebook img

Practical Hadoop Migration: How to Integrate Your RDBMS with the Hadoop Ecosystem and Re-Architect Relational Applications to NoSQL PDF

199 Pages·2016·5.703 MB·English
by  LakheBhushan
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Practical Hadoop Migration: How to Integrate Your RDBMS with the Hadoop Ecosystem and Re-Architect Relational Applications to NoSQL

BOOKS FOR PROFESSIONALS BY PROFESSIONALS® Lakhe Practical Hadoop Security RELATED Practical Hadoop Security is an excellent resource for administrators planning a production Hadoop deployment who want to secure their Hadoop clusters. In this detailed guide to the security options and configuration choices within Hadoop, author Bhushan Lakhe takes you through a comprehensive, hands-on study of how to implement defined security within a Hadoop cluster. You will start with a detailed overview of all the security options available for Hadoop, including popular extensions like Kerberos and OpenSSH, and then delve into how to implement user security. Code samples and workflow diagrams illustrate the process of using both in-the-box features and security extensions from leading vendors. No security system is complete without a monitoring and tracing facility, so Practical Hadoop Security next steps you through audit logging and monitoring technologies for Hadoop, providing ready-to-use, code-filled implementation and configuration examples. The book concludes with the most important aspect of Hadoop security: encryption. You’ll learn about encrypting data in transit and at rest with leading open source projects that integrate directly with Hadoop at no licensing cost. Practical Hadoop Security: (cid:115)(cid:0)(cid:0)(cid:37)(cid:88)(cid:80)(cid:76)(cid:65)(cid:73)(cid:78)(cid:83)(cid:0)(cid:84)(cid:72)(cid:69)(cid:0)(cid:73)(cid:77)(cid:80)(cid:79)(cid:82)(cid:84)(cid:65)(cid:78)(cid:67)(cid:69)(cid:0)(cid:79)(cid:70)(cid:0)(cid:83)(cid:69)(cid:67)(cid:85)(cid:82)(cid:73)(cid:84)(cid:89)(cid:12)(cid:0)(cid:65)(cid:85)(cid:68)(cid:73)(cid:84)(cid:73)(cid:78)(cid:71)(cid:12)(cid:0)(cid:65)(cid:78)(cid:68)(cid:0)(cid:69)(cid:78)(cid:67)(cid:82)(cid:89)(cid:80)(cid:84)(cid:73)(cid:79)(cid:78)(cid:0)(cid:87)(cid:73)(cid:84)(cid:72)(cid:73)(cid:78)(cid:0)(cid:65)(cid:0) Hadoop installation (cid:115)(cid:0)(cid:0)(cid:36)(cid:69)(cid:83)(cid:67)(cid:82)(cid:73)(cid:66)(cid:69)(cid:83)(cid:0)(cid:72)(cid:79)(cid:87)(cid:0)(cid:84)(cid:72)(cid:69)(cid:0)(cid:76)(cid:69)(cid:65)(cid:68)(cid:73)(cid:78)(cid:71)(cid:0)(cid:80)(cid:76)(cid:65)(cid:89)(cid:69)(cid:82)(cid:83)(cid:0)(cid:72)(cid:65)(cid:86)(cid:69)(cid:0)(cid:73)(cid:78)(cid:67)(cid:79)(cid:82)(cid:80)(cid:79)(cid:82)(cid:65)(cid:84)(cid:69)(cid:68)(cid:0)(cid:84)(cid:72)(cid:69)(cid:83)(cid:69)(cid:0)(cid:70)(cid:69)(cid:65)(cid:84)(cid:85)(cid:82)(cid:69)(cid:83)(cid:0) within their Hadoop distributions and provided extensions (cid:115)(cid:0)(cid:0)(cid:36)(cid:69)(cid:77)(cid:79)(cid:78)(cid:83)(cid:84)(cid:82)(cid:65)(cid:84)(cid:69)(cid:83)(cid:0)(cid:72)(cid:79)(cid:87)(cid:0)(cid:84)(cid:79)(cid:0)(cid:83)(cid:69)(cid:84)(cid:0)(cid:85)(cid:80)(cid:0)(cid:65)(cid:78)(cid:68)(cid:0)(cid:85)(cid:83)(cid:69)(cid:0)(cid:84)(cid:72)(cid:69)(cid:83)(cid:69)(cid:0)(cid:70)(cid:69)(cid:65)(cid:84)(cid:85)(cid:82)(cid:69)(cid:83)(cid:0)(cid:84)(cid:79)(cid:0)(cid:89)(cid:79)(cid:85)(cid:82)(cid:0)(cid:66)(cid:69)(cid:78)(cid:69)(cid:70)(cid:73)(cid:84)(cid:0)(cid:65)(cid:78)(cid:68)(cid:0) make your Hadoop installation secure without affecting performance or ease of use Shelve in ISBN 978-1-4302-6544-3 Databases/Data Warehousing 55999 User level: Intermediate–Advanced SOURCE CODE ONLINE 9781430265443 www.apress.com For your convenience Apress has placed some of the front matter material after the index. Please use the Bookmarks and Contents at a Glance links to access them. Contents at a Glance About the Author ...............................................................................................................xiii About the Technical Reviewer ............................................................................................xv Acknowledgments ............................................................................................................xvii Introduction .......................................................................................................................xix (cid:78) Part I: Introducing Hadoop and Its Security ................................................1 (cid:78) Chapter 1: Understanding Security Concepts ...................................................................3 (cid:78) Chapter 2: Introducing Hadoop ........................................................................................19 (cid:78) Chapter 3: Introducing Hadoop Security ........................................................................37 (cid:78) Part II: Authenticating and Authorizing Within Your Hadoop Cluster ........49 (cid:78) Chapter 4: Open Source Authentication in Hadoop..........................................................51 (cid:78) Chapter 5: Implementing Granular Authorization............................................................75 (cid:78) Part III: Audit Logging and Security Monitoring .........................................95 (cid:78) Chapter 6: Hadoop Logs: Relating and Interpretation .....................................................97 (cid:78) Chapter 7: Monitoring in Hadoop ...................................................................................119 (cid:78) Part IV: Encryption for Hadoop .................................................................143 (cid:78) Chapter 8: Encryption in Hadoop ...................................................................................145 v (cid:78) CONTENTS AT A GLANCE (cid:78) Part V: Appendices ...................................................................................169 (cid:78) Appendix A: Pageant Use and Implementation ..............................................................171 (cid:78) Appendix B: PuTTY and SSH Implementation for Linux-Based Clients ..........................177 (cid:78) Appendix C: Setting Up a KeyStore and TrustStore for HTTP Encryption ........................181 (cid:78) Appendix D: Hadoop Metrics and Their Relevance to Security......................................183 Index .................................................................................................................................191 vi Introduction Last year, I was designing security for a client who was looking for a reference book that talked about security implementations in the Hadoop arena, simply so he could avoid known issues and pitfalls. To my chagrin, I couldn’t locate a single book for him that covered the security aspect of Hadoop in detail or provided options for people who were planning to secure their clusters holding sensitive data! I was disappointed and surprised. Everyone planning to secure their Hadoop cluster must have been going through similar frustration. So I decided to put my security design experience to broader use and write the book myself. As Hadoop gains more corporate support and usage by the day, we all need to recognize and focus on the security aspects of Hadoop. Corporate implementations also involve following regulations and laws for data protection and confidentiality, and such security issues are a driving force for making Hadoop “corporation ready.” Open-source software usually lacks organized documentation and consensus on performing a particular functional task uniquely, and Hadoop is no different in that regard. The various distributions that mushroomed in last few years vary in their implementation of various Hadoop functions, and some, such as authorization or encryption, are not even provided by all the vendor distributions. So, in this way, Hadoop is like Unix of the ’80s or ’90s: Open source development has led to a large number of variations and in some cases deviations from functionality. Because of these variations, devising a common strategy to secure your Hadoop installation is difficult. In this book, I have tried to provide a strategy and solution (an open source solution when possible) that will apply in most of the cases, but exceptions may exist, especially if you use a Hadoop distribution that’s not well-known. It’s been a great and exciting journey developing this book, and I deliberately say “developing,” because I believe that authoring a technical book is very similar to working on a software project. There are challenges, rewards, exciting developments, and of course, unforeseen obstacles—not to mention deadlines! Who This Book Is For This book is an excellent resource for IT managers planning a production Hadoop environment or Hadoop administrators who want to secure their environment. This book is also for Hadoop developers who wish to implement security in their environments, as well as students who wish to learn about Hadoop security. This book assumes a basic understanding of Hadoop (although the first chapter revisits many basic concepts), Kerberos, relational databases, and Hive, plus an intermediate-level understanding of Linux. How This Book Is Structured The book is divided in five parts: Part I, “Introducing Hadoop and Its Security,” contains Chapters 1, 2, and 3; Part II, “Authenticating and Authorizing Within Your Hadoop Cluster,” spans Chapters 4 and 5; Part III, “Audit Logging and Security Monitoring,” houses Chapters 6 and 7; Part IV, “Encryption for Hadoop,” contains Chapter 8; and Part V holds the four appendices. xix (cid:78) INTRODUCTION Here’s a preview of each chapter in more detail: (cid:117)(cid:0) Chapter 1, “Understanding Security Concepts,” offers an overview of security, the security engineering framework, security protocols (including Kerberos), and possible security attacks. This chapter also explains how to secure a distributed system and discusses Microsoft SQL Server as an example of secure system. (cid:117)(cid:0) Chapter 2, “Introducing Hadoop,” introduces the Hadoop architecture and Hadoop Distributed File System (HDFS), and explains the security issues inherent to HDFS and why it’s easy to break into a HDFS installation. It also introduces Hadoop’s MapReduce framework and discusses its security shortcomings. Last, it discusses the Hadoop Stack. (cid:117)(cid:0) Chapter 3, “Introducing Hadoop Security,” serves as a roadmap to techniques for designing and implementing security for Hadoop. It introduces authentication (using Kerberos) for providing secure access, authorization to specify the level of access, and monitoring for unauthorized access or unforeseen malicious attacks (using tools like Ganglia or Nagios). You’ll also learn the importance of logging all access to Hadoop daemons (using the Log4j logging system) and importance of data encryption (both in transit and at rest). (cid:117)(cid:0) Chapter 4, “Open Source Authentication in Hadoop,” discusses how to secure your Hadoop cluster using open source solutions. It starts by securing a client using PuTTY, then describes the Kerberos architecture and details a Kerberos implementation for Hadoop step by step. In addition, you’ll learn how to secure interprocess communication that uses the RPC (remote procedure call) protocol, how to encrypt HTTP communication, and how to secure the data communication that uses DTP (data transfer protocol). (cid:117)(cid:0) Chapter 5, “Implementing Granular Authorization,” starts with ways to determine security needs (based on application) and then examines methods to design fine-grained authorization for applications. Directory- and file-level permissions are demonstrated using a real-world example, and then the same example is re-implemented using HDFS Access Control Lists and Apache Sentry with Hive. (cid:117)(cid:0) Chapter 6, “Hadoop Logs: Relating and Interpretation,” discusses the use of logging for security. After a high-level discussion of the Log4j API and how to use it for audit logging, the chapter examines the Log4j logging levels and their purposes. You’ll learn how to correlate Hadoop logs to implement security effectively, get a look at Hadoop analytics and a possible implementation using Splunk. (cid:117)(cid:0) Chapter 7, “Monitoring in Hadoop,” discusses monitoring for security. It starts by discussing features that a monitoring system needs, with an emphasis on monitoring distributed clusters. Thereafter, it discusses the Hadoop metrics you can use for security purposes and examines the use of Ganglia and Nagios, the two most popular monitoring applications for Hadoop. It concludes by discussing some helpful plug-ins for Ganglia and Nagios that provide security- related functionality and also discusses Ganglia integration with Nagios. (cid:117)(cid:0) Chapter 8, “Encryption in Hadoop,” begins with some data encryption basics, discusses popular encryption algorithms and their applications (certificates, keys, hash functions, digital signatures), defines what can be encrypted for a Hadoop cluster, and lists some of the popular vendor options for encryption. A detailed implementation of HDFS and Hive data at rest follows, showing Intel’s distribution in action. The chapter concludes with a step-by-step implementation of encryption at rest using Elastic MapReduce VM (EMR) from Amazon Web Services. xx (cid:78) INTRODUCTION Downloading the Code The source code for this book is available in ZIP file format in the Downloads section of the Apress web site (www.apress.com). Contacting the Author You can reach Bhushan Lakhe at [email protected] or [email protected]. xxi PART I Introducing Hadoop and Its Security CHAPTER 1 Understanding Security Concepts In today’s technology-driven world, computers have penetrated all walks of our life, and more of our personal and corporate data is available electronically than ever. Unfortunately, the same technology that provides so many benefits can also be used for destructive purposes. In recent years, individual hackers, who previously worked mostly for personal gain, have organized into groups working for financial gain, making the threat of personal or corporate data being stolen for unlawful purposes much more serious and real. Malware infests our computers and redirects our browsers to specific advertising web sites depending on our browsing context. Phishing emails entice us to log into web sites that appear real but are designed to steal our passwords. Viruses or direct attacks breach our networks to steal passwords and data. As Big Data, analytics, and machine learning push into the modern enterprise, the opportunities for critical data to be exposed and harm to be done rise exponentially. If you want to counter these attacks on your personal property (yes, your data is your personal property) or your corporate property, you have to understand thoroughly the threats as well as your own vulnerabilities. Only then can you work toward devising a strategy to secure your data, be it personal or corporate. Think about a scenario where your bank’s investment division uses Hadoop for analyzing terabytes of data and your bank’s competitor has access to the results. Or how about a situation where your insurance company decides to stop offering homeowner’s insurance based on Big Data analysis of millions of claims, and their competitor, who has access (by stealth) to this data, finds out that most of the claims used as a basis for analysis were fraudulent? Can you imagine how much these security breaches would cost the affected companies? Unfortunately, only the breaches highlight the importance of security. To its users, a good security setup—be it personal or corporate—is always transparent. This chapter lays the foundation on which you can begin to build that security strategy. I first define a security engineering framework. Then I discuss some psychological aspects of security (the human factor) and introduce security protocols. Last, I present common potential threats to a program’s security and explain how to counter those threats, offering a detailed example of a secure distributed system. So, to start with, let me introduce you to the concept of security engineering. Introducing Security Engineering Security engineering is about designing and implementing systems that do not leak private information and can reliably withstand malicious attacks, errors, or mishaps. As a science, it focuses on the tools, processes, and methods needed to design and implement complete systems and adapt existing systems. Security engineering requires expertise that spans such dissimilar disciplines as cryptography, computer security, computer networking, economics, applied psychology, and law. Software engineering skills (ranging from business process analysis to implementation and testing) are also necessary, but are relevant mostly for countering error and “mishaps”—not for malicious attacks. Designing systems to counter malice requires specialized skills and, of course, specialized experience. 3 CHAPTER 1 (cid:78) UNDERSTANDING SECURITY CONCEPTS Security requirements vary from one system to another. Usually you need a balanced combination of user authentication, authorization, policy definition, auditing, integral transactions, fault tolerance, encryption, and isolation. A lot of systems fail because their designers focus on the wrong things, omit some of these factors, or focus on the right things but do so inadequately. Securing Big Data systems with many components and interfaces is particularly challenging. A traditional database has one catalog, and one interface: SQL connections. A Hadoop system has many “catalogs” and many interfaces (Hadoop Distributed File System or HDFS, Hive, HBase). This increased complexity, along with the varied and voluminous data in such a system, introduces many challenges for security engineers. Securing a system thus depends on several types of processes. To start with, you need to determine your security requirements and then how to implement them. Also, you have to remember that secure systems have a very important component in addition to their technical components: the human factor! That’s why you have to make sure that people who are in charge of protecting the system and maintaining it are properly motivated. In the next section, I define a framework for considering all these factors. Security Engineering Framework Good security engineering relies on the following five factors to be considered while conceptualizing a system: (cid:117)(cid:0) Strategy: Your strategy revolves around your objective. A specific objective is a good starting point to define authentication, authorization, integral transactions, fault tolerance, encryption, and isolation for your system. You also need to consider and account for possible error conditions or malicious attack scenarios. (cid:117)(cid:0) Implementation: Implementation of your strategy involves procuring the necessary hardware and software components, designing and developing a system that satisfies all your objectives, defining access controls, and thoroughly testing your system to match your strategy. (cid:117)(cid:0) Reliability: Reliability is the amount of reliance you have for each of your system components and your system as a whole. Reliability is measured against failure as well as malfunction. (cid:117)(cid:0) Relevance: Relevance decides the ability of a system to counter the latest threats. For it to remain relevant, especially for a security system, it is also extremely important to update it periodically to maintain its ability to counter new threats as they arise. (cid:117)(cid:0) Motivation: Motivation relates to the drive or dedication that the people responsible for managing and maintaining your system have for doing their job properly, and also refers to the lure for the attackers to try to defeat your strategy. Figure 1-1 illustrates how these five factors interact. Strategy Implementation Reliability Relevance Motivation Figure 1-1. Five factors to consider before designing a security framework 4

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.