ebook img

Real-Time Analytics Techniques to Analyze PDF

397 Pages·2016·2.98 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Real-Time Analytics Techniques to Analyze

Table of Contents 1. Cover 2. Chapter 1: Introduction to Streaming Data a. Sources of Streaming Data b. Why Streaming Data Is Different c. Infrastructures and Algorithms d. Conclusion 3. Part I: Streaming A Analytics Architecture 4. Chapter 2: Designing Real-Time Streaming Architectures a. Real-Time Architecture Components b. Features of a Real-Time Architecture c. Languages for Real-Time Programming d. A Real-Time Architecture Checklist e. Conclusion 5. Chapter 3: Service Configuration and Coordination a. Motivation for Configuration and Coordination Systems b. Maintaining Distributed State c. Apache ZooKeeper d. Conclusion 6. Chapter 4: Data-Flow Management in Streaming Analysis a. Distributed Data Flows b. Apache Kafka: High-Throughput Distributed Messaging c. Apache Flume: Distributed Log Collection d. Conclusion 7. Chapter 5: Processing Streaming Data a. Distributed Streaming Data Processing b. Processing Data with Storm c. Processing Data with Samza d. Conclusion 8. Chapter 6: Storing Streaming Data a. Consistent Hashing b. “NoSQL” Storage Systems c. Other Storage Technologies d. Choosing a Technology e. Warehousing f. Conclusion 9. Part II: Analysis and Visualization 10. Chapter 7: Delivering Streaming Metrics a. Streaming Web Applications b. Visualizing Data c. Mobile Streaming Applications d. Conclusion 11. Chapter 8: Exact Aggregation and Delivery a. Timed Counting and Summation b. Multi-Resolution Time-Series Aggregation c. Stochastic Optimization d. Delivering Time-Series Data e. Conclusion 12. Chapter 9: Statistical Approximation of Streaming Data a. Numerical Libraries b. Probabilities and Distributions c. Working with Distributions d. Random Number Generation e. Sampling Procedures f. Conclusion 13. Chapter 10: Approximating Streaming Data with Sketching a. Registers and Hash Functions b. Working with Sets c. The Bloom Filter d. Distinct Value Sketches e. The Count-Min Sketch f. Other Applications g. Conclusion 14. Chapter 11: Beyond Aggregation a. Models for Real-Time Data b. Forecasting with Models c. Monitoring d. Real-Time Optimization e. Conclusion 15. Introduction a. Overview and Organization of This Book b. Who Should Read This Book c. Tools You Will Need d. What's on the Website e. Time to Dive In 16. End User License Agreement Guide 1. Cover 2. Table of Contents 3. Introduction 4. Part I: Streaming A Analytics Architecture 5. Begin Reading List of Illustrations 1. Figure 4.1 2. Figure 4.2 3. Figure 4.3 4. Figure 5.1 5. Figure 5.2 6. Figure 5.3 7. Figure 5.4 8. Figure 5.5 9. Figure 5.6 10. Figure 7.1 11. Figure 7.2 12. Figure 7.3 13. Figure 7.4 14. Figure 7.5 15. Figure 7.6 16. Figure 7.7 17. Figure 7.8 18. Figure 7.9 19. Figure 7.10 20. Figure 7.11 21. Figure 7.12 22. Figure 7.13 23. Figure 8.1 24. Figure 8.2 25. Figure 9.1 26. Figure 11.1 27. Figure 11.2 28. Figure 11.3 29. Figure 11.4 30. Figure 11.5 31. Figure 11.6 32. Figure 11.7 List of Tables 1. Table 6.1 2. Table 6.2 Chapter 1 Introduction to Streaming Data It seems like the world moves at a faster pace every day. People and places become more connected, and people and organizations try to react at an ever-increasing pace. Reaching the limits of a human's ability to respond, tools are built to process the vast amounts of data available to decision makers, analyze it, present it, and, in some cases, respond to events as they happen. The collection and processing of this data has a number of application areas, some of which are discussed in the next section. These applications, which are discussed later in this chapter, require an infrastructure and method of analysis specific to streaming data. Fortunately, like batch processing before it, the state of the art of streaming infrastructure is focused on using commodity hardware and software to build its systems rather than the specialized systems required for real-time analysis prior to the Internet era. This, combined with flexible cloud-based environment, puts the implementation of a real-time system within the reach of nearly any organization. These commodity systems allow organizations to analyze their data in real time and scale that infrastructure to meet future needs as the organization grows and changes over time. The goal of this book is to allow a fairly broad range of potential users and implementers in an organization to gain comfort with the complete stack of applications. When real-time projects reach a certain point, they should be agile and adaptable systems that can be easily modified, which requires that the users have a fair understanding of the stack as a whole in addition to their own areas of focus. “Real time” applies as much to the development of new analyses as it does to the data itself. Any number of well-meaning projects have failed because they took so long to implement that the people who requested the project have either moved on to other things or simply forgotten why they wanted the data in the first place. By making the projects agile and incremental, this can be avoided as much as possible. This chapter is divided into sections that cover three topics. The first section, “Sources of Streaming Data,” is some of the common sources and applications of streaming data. They are arranged more or less chronologically and provide some background on the origin of streaming data infrastructures. Although this is historically interesting, many of the tools and frameworks presented were developed to solve problems in these spaces, and their design reflects some of the challenges unique to the space in which they were born. Kafka, a data motion tool covered in Chapter 4, “Flow Management for Streaming Analysis,” for example, was developed as a web applications tool, whereas Storm, a processing framework covered in Chapter 5, “Processing Streaming Data,” was developed primarily at Twitter for handling social media data. The second section, “Why Streaming Data is Different,” covers three of the important aspects of streaming data: continuous data delivery, loosely structured data, and high-cardinality datasets. The first, of course, defines a system to be a real-time streaming data environment in the first place. The other two, though not entirely unique, present a unique challenge to the designer of a streaming data application. All three combine to form the essential streaming data environment. The third section, “Infrastructures and Algorithms,” briefly touches on the significance of how infrastructures and algorithms are used with streaming data. Sources of Streaming Data There are a variety of sources of streaming data. This section introduces some of the major categories of data. Although there are always more and more data sources being made available, as well as many proprietary data sources, the categories discussed in this section are some of the application areas that have made streaming data interesting. The ordering of the application areas is primarily chronological, and much of the software discussed in this book derives from solving problems in each of these specific application areas. The data motion systems presented in this book got their start handling data for website analytics and online advertising at places like LinkedIn, Yahoo!, and Facebook. The processing systems were designed to meet the challenges of processing social media data from Twitter and social networks like LinkedIn. Google, whose business is largely related to online advertising, makes heavy use of the advanced algorithmic approaches similar to those presented in Chapter 11. Google seems to be especially interested in a technique called deep learning, which makes use of very large-scale neural networks to learn complicated patterns. These systems are even enabling entirely new areas of data collection and analysis by making the Internet of Things and other highly distributed data collection efforts economically feasible. It is hoped that outlining some of the previous application areas provides some inspiration for as-of-yet- unforeseen applications of these technologies. Operational Monitoring Operational monitoring of physical systems was the original application of streaming data. Originally, this would have been implemented using specialized hardware and software (or even analog and mechanical systems in the pre-computer era). The most common use case today of operational monitoring is tracking the performance of the physical systems that power the Internet. These datacenters house thousands—possibly even tens of thousands—of discrete computer systems. All of these systems continuously record data about their physical state from the temperature of the processor, to the speed of the fan and the voltage draw of their power supplies. They also record information about the state of their disk drives and fundamental metrics of their operation, such as processor load, network activity, and storage access times. To make the monitoring of all of these systems possible and to identify problems, this data is collected and aggregated in real time through a variety of mechanisms. The first systems tended to be specialized ad hoc mechanisms, but when these sorts of techniques started applying to other areas, they started using the same collection systems as other data collection mechanisms. Web Analytics The introduction of the commercial web, through e-commerce and online advertising, led to the need to track activity on a website. Like the circulation numbers of a newspaper, the number of unique visitors who see a website in a day is important information. For e-commerce sites, the data is less about the number of visitors as it is the various products they browse and the correlations between them. To analyze this data, a number of specialized log-processing tools were introduced and marketed.

Description:
Chapter 3: Service Configuration and Coordination a. and people and organizations try to react at an ever-increasing pace. infrastructure and method of analysis specific to streaming data. instructions called byte code and then largely disappears from the Java environment, except when it is.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.