> m o ok.c o b e w o w w. w w < ok o B e w! o W m o d fr a o nl w o D For your convenience Apress has placed some of the front matter material after the index. Please use the Bookmarks and Contents at a Glance links to access them. Contents at a Glance Preface �������������������������������������������������������������������������������������������xiii About the Authors ����������������������������������������������������������������������������xv About the Technical Reviewer �������������������������������������������������������xvii Acknowledgments ��������������������������������������������������������������������������xix Introduction ������������������������������������������������������������������������������������xxi ■ Chapter 1: “Big Data” in the Enterprise ������������������������������������������1 ■ Chapter 2: The New Information Management Paradigm �������������25 ■ Chapter 3: Big Data Implications for Industry ������������������������������45 ■ Chapter 4: Emerging Database Landscape �����������������������������������73 ■ Chapter 5: Application Architectures for Big Data and Analytics ������������������������������������������������������������������������������107 ■ Chapter 6: Data Modeling Approaches for Big Data and Analytics Solutions ��������������������������������������������������������������155 ■ Chapter 7: Big Data Analytics Methodology �������������������������������197 ■ Chapter 8: Extracting Value From Big Data: In-Memory Solutions, Real Time Analytics, And Recommendation Systems ����������������221 ■ Chapter 9: Data Scientist ������������������������������������������������������������251 Index ����������������������������������������������������������������������������������������������289 iii Introduction You may be wondering—is this book for me? If you are seeking a textbook on Hadoop, then clearly the answer is no. This book does not attempt to fully explain the theory and derivation of the various algorithms and techniques behind products such as Hadoop. Some familiarity with Hadoop techniques and related concepts, like NoSQL, is useful in reading this book, but not assumed. If you are developing, implementing, or managing modern, intelligent applications, then the answer is yes. This book provides a practical rather than a theoretical treatment of big data concepts, along with complete examples and recipes for solutions. It develops some insights gleaned by experienced practitioners in the course of demonstrating how big data analytics can be deployed to solve problems. If you are a researcher in big data, analytics, and related areas, then the answer is yes. Chances are, your biggest obstacle is translating new concepts into practice. This book provides a few methodologies, frameworks, and collections of patterns from a practical implementation perspective. This book can serve as a reference explaining how you can leverage traditional data warehousing and BI architectures along with big data technologies like Hadoop to develop big data solutions. If you are client-facing and always in search of bright ideas to help seize business opportunities, then the answer is yes, this book is also for you. Through real-world examples, it will plant ideas about the many ways these techniques can be deployed. It will also help your technical team jump directly to a cost-effective implementation approach that can handle volumes of data previously only realistic for organizations with large technology resources. Roadmap This book is broadly divided into three parts, covering concepts and industry-specific use cases, Hadoop and NoSQL technologies, and methodologies and new skills like those of the data scientist. Part 1 consists of chapters 1 to 3. Chapter 1 introduces big data and its role in the enterprise. This chapter will get you set up for all of the chapters that follow. Chapter 2 covers the need for a new information management paradigm. It explains why the traditional approaches can’t handle the big data scale and what you need to do about this. Chapter 3 discusses several industry use cases, bringing to life several interesting implementation scenarios. Part 2 consists of chapters 4 to 6. Chapter 4 presents the technology evolution, explains the reason for NoSQL data bases, etc. Given that background, Chapter 5 presents application architectures for implementing big data and analytics solutions. Chapter 6 then gives you a first look at NoSQL data modeling techniques in a distributed environment. xxi ■ IntroduCtIon Part 3 of the book consists of chapters 7 to 9. Chapter 7 presents a methodology for developing and implementing big data and analytics solutions. Chapter 8 discusses several additional technologies like in-memory data grids and in-memory analytics. Chapter 9 presents the need for a new breed of skills (a.k.a. “data scientist”), shows how it is different from traditional data warehousing and BI skills, tells you what the key characteristics are, and also covers the importance of data visualization techniques. xxii Chapter 1 “Big Data” in the Enterprise Humans have been generating data for thousands of years. More recently we have seen an amazing progression in the amount of data produced from the advent of mainframes to client server to ERP and now everything digital. For years the overwhelming amount of data produced was deemed useless. But data has always been an integral part of every enterprise, big or small. As the importance and value of data to an enterprise became evident, so did the proliferation of data silos within an enterprise. This data was primarily of structured type, standardized and heavily governed (either through enterprise wide programs or through business functions or IT), the typical volumes of data were in the range of few terabytes and in some cases due to compliance and regulation requirements the volumes expectedly went up several notches higher. Big data is a combination of transactional data and interactive data. While technologies have mastered the art of managing volumes of transaction data, it is the interactive data that is adding variety and velocity characteristics to the ever-growing data reservoir and subsequently poses significant challenges to enterprises. Irrespective of how data is managed within an enterprise, if it is leveraged properly, it can deliver immense business values. Figure 1-1 illustrates the value cycle of data, from raw data to decision making. In the early 2000s, the acceptance of concepts like Enterprise Data Warehouse (EDW), Business Intelligence (BI) and analytics, helped enterprises to transform raw data collections into actionable wisdom. Analytics applications such as customer analytics, financial analytics, risk analytics, product analytics, health-care analytics became an integral part of the business applications architecture of any enterprise. But all of these applications were dealing with only one type of data: structured data. 1 CHAPTER 1 ■ “Big DATA” in THE EnTERPRisE Decision Making Actionable Insight Synthesizing Knowledge Analyzing Summarizing Information Organizing Data Collecting Figure 1-1. Transforming raw data into action-guiding wisdom The ubiquity of the Internet has dramatically changed the way enterprises function. Essentially most every business became a “digital” business. The result was a data explosion. New application paradigms such as web 2.0, social media applications, cloud computing, and software-as-a-service applications further contributed to the data explosion. These new application paradigms added several new dimensions to the very definition of data. Data sources for an enterprise were no longer confined to data stores within the corporate firewalls but also to what is available outside the firewalls. Companies such as LinkedIn, Facebook, Twitter, and Netflix took advantage of these newer data sources to launch innovative product offerings to millions of end users; a new business paradigm of “consumerism” was born. Data regardless of type, location, and source increasingly has become a core business asset for an enterprise and is now categorized as belonging to two camps: internal data (enterprise application data) and external data (e.g., web data). With that, a new term has emerged: big data. So, what is the definition of this all-encompassing arena called “big data”? To start with, the definition of big data veers into 3Vs (exploding data volumes, data getting generated at high velocity and data now offering more variety); however, if you scan the Internet for a definition of big data, you will find many more interpretations. There are also other interesting observations around big data: it is not only the 3Vs that need to be considered, rather when the scale of data poses real challenges to the traditional data management principles, it can then be considered a big data problem. The heterogeneous nature of big data across multiple platforms and business functions makes it difficult to be managed by following the traditional data management principles, and there is no single platform or solution that has answers to all the questions related to big data. On the other hand, there is still a vast trove of data within the enterprise firewalls that is unused (or underused) because it has historically been too voluminous and/or raw (i.e., minimally structured) to be exploited by conventional information systems, or too costly or complex to integrate and exploit. Big data is more a concept than a precise term. Some categorize big data as a volume issue, only to petabyte-scale data collections (> one million GB); some associate big data 2 CHAPTER 1 ■ “Big DATA” in THE EnTERPRisE with the variety of data types even if the volume is in terabytes. These interpretations have made big data issues situational. The pervasiveness of the Internet has pushed generation and usage of data to unprecedented levels. This aspect of digitization has taken a new meaning. The term “data” is now expanding to cover events captured and stored in the form of text, numbers, graphics, video, images, sound, and signals. Table 1-1 illustrates the measures of scale of data. Table 1-1. Measuring Big Data 1000 Gigabytes (GB) = 1 Terabyte (TB) 1000 Terabytes = 1 Petabyte (PB) 1000 Petabytes = 1 Exabyte (EB) 1000 Exabytes = 1 Zettabyte (ZB) 1000 Zettabytes = 1 Yottabyte (YB) Is big data a new problem for enterprises? Not necessarily. Big data has been of concern in few selected industries and scenarios for some time: physical sciences (meteorology, physics), life sciences (genomics, biomedical research), financial institutions (banking, insurance, and capital markets) and government (defense, treasury). For these industries, big data was primarily a data volume problem, and to solve these data-volume-related issues they had heavily relied on a mash-up of custom-developed technologies and a set of complex programs to collect and manage the data. But, when doing so, these industries and vendor products generally made the total cost of ownership (TCO) of the IT infrastructure rise exponentially every year. CIOs and CTOs have always grappled with dilemmas like how to lower IT costs to manage the ever-increasing volumes of data, how to build systems that are scalable, how to address performance-related concerns to meet business requirements that are becoming increasingly global in scope and reach, how to manage data security, and privacy and data-quality-related concerns. The polystructured nature of big data has made the concerns increase in manifold ways: how does an industry effectively utilize the poly-structured nature of data (structured data like database content, semi-structured data like log files or XML files and unstructured content like text documents or web pages or graphics) in a cost effective manner? We have come a long way from the first mainframe era. Over the last few years, technologies have evolved, and now we have solutions that can address some or all of these concerns. Indeed a second mainframe wave is upon us to capture, analyze, classify, and utilize the massive amount of data that can now be collected. There are many instances where organizations, embracing new methodologies and technologies, effectively leverage these poly-structured data reservoirs to innovate. Some of these innovations are described below: • Search at scale • Multimedia content • Sentiment analysis 3 CHAPTER 1 ■ “Big DATA” in THE EnTERPRisE • Enriching and contextualizing data • Data discovery or exploratory analytics • Operational analytics or embedded analytics In this chapter, we will briefly discuss these use cases; there are several more such use cases, which will be discussed in later chapters. Search at Scale In the early days of the Internet, search was primarily used to page through simple lists of results, matching the search objective or key words. Search as a technology has evolved immensely since then. Concepts like iteratively refining a search request by selecting (or excluding) clusters or categories of results, parametric search and guided navigation, type-ahead query suggestions, auto-spelling correction and fuzzy matching (matching via synonyms, phonetics, and approximate spelling) have revolutionized effective means of searching and navigating large volumes of information. Using natural language processing (NLP) technologies and semantic analysis, it is possible to automatically classify and categorize even big-data-size collections of unstructured content; web search engines like Google, Yahoo!, and Bing are exploiting these advances in technologies today. Multimedia Content Multimedia content is fascinating, as it consists of user-generated content like photos, audio files, and videos. From a user perspective this content contains a lot of information: e.g., where was the photo taken, when it was taken, what was the occasion, etc. But from a technology perspective all this metadata needs to be manually tagged with the content to make some meaning out of it, which is a daunting task. Analyzing and categorizing images is an area of intense research. Exploiting this type of content at big data scale is a real challenge. Recent technologies like automatic speech-to-text transcription and object-recognition processing (Content-Based Image Retrieval, or CBIR) are enabling us to structure this content in an automated fashion. If these technologies are used in an industrialized fashion, significant impacts could be made in areas like medicine, media, publishing, environmental science, forensics, and digital asset management. Sentiment Analysis Sentiment analysis technology is used to automatically discover, extract, and summarize the context behind unstructured content. It helps in discovering sentiments and opinions and polarity analysis concerning everything from ideas and issues to people, products, and companies. The most cited use case of sentiment analysis is brand or reputation analysis. The task entails collecting data from select web sources (industry sites, the media, blogs, forums, social networks, etc.), cross-referencing this content with target entities represented in internal systems (services, products, people, programs, etc.), and extracting and summarizing the sentiments expressed in this cross-referenced content. 4 CHAPTER 1 ■ “Big DATA” in THE EnTERPRisE Companies have started leveraging sentiment analysis technology to understand the voice of consumers and take timely actions such as the ones specified below: • Monitoring and managing public perceptions of an issue, brand, organization, etc. (called reputation monitoring) • Analyzing reception of a new or revamped service or product • Anticipating and responding to potential quality, pricing, or compliance issues • Identifying nascent market growth opportunities and trends in customer demand Enriching and Contextualizing Data While it is a common understanding that there is a lot of noise in unstructured data, once you are able to collect, analyze, and organize unstructured data, you can then potentially use it to merge and cross-reference with your enterprise data to further enhance and contextualize your existing structured data. There are already several examples of such initiatives across companies where they have extracted information from high-volume sources like chat, website logs, and social networks to enrich customer profiles in a Customer Relationship Management (CRM) system. Using innovative approaches like Facebook ID and Google ID, several companies have started to capture more details of customers, thereby improving the quality of master data management. Data Discovery or Exploratory Analytics Data discovery or exploratory analytics is the process of analyzing data to discover something that had not been previously noticed. It is a type of analytics that requires an open mind and a healthy sense of curiosity to delve deep into data: the paths followed during analysis are in no pre-determined patterns, and success is heavily dependent on the analyst’s curiosity as they uncover one intriguing fact and then another, till they arrive at a final conclusion. This process is in stark contrast to conventional analytics and Online Analytical Processing (OLAP) analysis. In classic OLAP, the questions are pre-defined with additional options to further drill down or drill across to get to the details of the data, but these activities are still confined to finite sets of data and finite sets of questions. Since the activity is primarily to confirm or refute hypotheses, classic OLAP is also sometimes referred to as Confirmatory Data Analysis (CDA). It is not uncommon for analysts cross-referencing individual and disconnected collections of data sets during the exploratory analysis activity. For example, analysts at Walmart cross-referenced big data collections of weather and sales data and discovered that hurricane warnings trigger sales of not just flashlights and batteries (expected) but also strawberry Pop Tarts breakfast pastries (not expected). And they also found that the top-selling pre-hurricane item is beer (surprise again). It is interesting to note that Walmart chanced upon this discovery not due to the result of exploratory analytics (as is often reported), but due to conventional analytics. 5
Description: