MEAP Edition Manning Early Access Program Elasticsearch in Action Second Edition Version 7 Copyright 2022 Manning Publications For more information on this and other Manning titles go to manning.com ©Manning Publications Co. To comment go to liveBook welcome I am absolutely delighted and thankful for your purchase of this MEAP of Elasticsearch in Action, Second Edition. Search in the new normal, the current digital world, is omnipresent. Elasticsearch is a multi-faceted popular search engine with a ton of advanced features and many bells and whistles. A humongous task was bestowed upon me to create a new In Action book on a complicated technology that is ever-changing at a rapid rate. As with any new technology, mastery is associated with a sharp learning curve. Elasticsearch, an enterprise distributed search and analytics engine, is no exception. Its countless advanced features including multi-language analyzers, geospatial and time-series storage, anomaly detection using machine learning algorithms, graph analytics, auto- complete and fuzzy searching, root cause analysis using log stores, rich visualizations of data, complex aggregations, and many more, make Elasticsearch a must-have tool for most organizations and businesses. Architecting, designing, and developing an enterprise search engine using Elasticsearch requires a solid understanding of the product. It requires a thorough deep dive into the search engine, architecture, its APIs, and its moving parts. You need someone to hold your hand, explain the concepts and fundamentals, walk you through the examples, and dissect the complex topics when learning a new technology—especially one that's a seemingly simple yet complex beast under the shiny surface like Elasticsearch. And this is exactly my aim! This book is for anyone who is interested in learning Elasticsearch in a practical, tried- and-tested manner. It delves into high-level architecture, explains the dichotomy of the product, walks through the search and analytics capabilities using various APIs, provides the infrastructure patterns, code snippets at a low level, and much more. This book is a must- have for developers, DevOps, and architects working in the search space. My goal is to make this book easy to read with tons of hands-on examples. To make the most of this book, please check out the accompanying working examples on GitHub (https://github.com/madhusudhankonda/elasticsearch-in-action/wiki). You can spin up a server and try out the examples while reading the explanations in detail. Each of the chapters on GitHub will also have some exercises to practice, too. Beginner and intermediate level engineers should find it a page-turner and try out the code snippets to solidify their understanding of the technology with ease and comfort. Please be sure to post any questions, comments, or suggestions you have in the liveBook discussion forum. I sincerely hope you enjoy reading this book as much as I’m enjoying writing it! —Madhusudhan Konda ©Manning Publications Co. To comment go to liveBook brief contents 1 Overview 2 Getting started 3 Architecture 4 Mapping 5 Working with documents 6 Indexing operations 7 Text analysis 8 Introducing search 9 Term-level search 10 Advanced search 11 Aggregations 12 Kibana visualisations 13 Machine learning 14 Time series analytics 15 Performance tuning 16 Administration APPENDIXES A Installation B Native clients C Hosting E Elasticsearch vs Solr ©Manning Publications Co. To comment go to liveBook 1 Overview This chapter covers • Setting the scene for modern search engines • Introducing Elasticsearch, a search and analytics engine • Looking at its functional and technical features • Understanding Elasticsearch’s core areas, use cases, and its prominent features • Overview of Elastic Stack: Beats, Logstash, Elasticsearch, and Kibana We expect Google to return our search results in a flash. Further, we want the results to be relevant, meaningful, and exactly what we are looking for. When I start searching in that simple- looking search bar, I expect Google to autocomplete my search, correct my spellings, and suggest some keywords. While we may take it for granted, the mechanics, science, and brains behind that tiny search bar are unfathomable. Most of us are pretty happy navigating to one of the top links on the first page. In fact, research suggests that 80% of people won’t even bother to visit the second page. The beauty is that the results are so close that we don’t give a second thought to suspect Google’s results. Search engines have benefited from years of talented brains working on sophisticated algorithms and powerful data structures. Advanced processors and other hardware has also made incredible strides. While the results can appear in a fraction of a second, a ton of complex actions are performed behind the scenes, including but not limited to applying complicated search and matching algorithms, abundant computing resources, ranking schemes, and whatnot! The good thing is that we, the end-users, do not need to be bothered with the mechanics of how these search engines work behind the scenes. In this chapter, we will look at the search space in general and skim over the evolution of search from a traditional database-backed search to the current modern search engines and their many convenient features. Along the way, we will introduce ourselves to Elasticsearch, the ultra- fast open source search engine. We will look at its features, use cases, and customer adoption. ©Manning Publications Co. To comment go to liveBook 2 1.1 What makes a good search engine? Recently, my family adopted a puppy, Milly (that’s her, pictured!). As we are the first-time owners of a pup, we started searching for dog food online. I browsed through my preferred (popular) grocer, but to my disappointment, the search results (see figure 1.1) were not appealing. You can see “poop bags” in the list of results (ranked on the top) among other non- relevant products. There weren’t any filters, no dropdowns, no price range knobs; it’s just a plain page with the search results. Figure 1.1 A search for dog food showing irrelevant results from a supermarket. Another grocer showed a pet harness and a baby’s lunch box (see figure 1.2): ©Manning Publications Co. To comment go to liveBook 3 Figure 1.2 Another search for dog food and another supermarket showing absurd results. The search engines behind these supermarkets didn’t provide me suggestions while I typed, nor did they correct my input when I misspelled dog food as dig food. Some of the results weren’t ranked relevant (though this can be forgiven as not all search engines are expected to produce relevant results). And one grocer brought back 2400 results! Not pleased with the result from the grocer, I headed over to Amazon’s online shop. The second I started typing dog, I could see the dropdown with all the possible suggestions (figure 1.3). Figure 1.3: Amazon’s search engine providing suggestions for my search The initial search from Amazon brings back the relevant results, which are enabled by default using the Sort by Featured option. Conveniently, we can also change the sort order (low-to-high or vice versa, etc). Once the initial search is carried out, we can also drill down into other categories ©Manning Publications Co. To comment go to liveBook 4 by selecting the department, average customer review, price range, and so on. I even tried the wrong spelling too: dig food. Amazon asked me if I meant dog food as figure 1.4 shows. Figure 1.4: Did-you-mean feature from a modern search engine In the current digital world, search is undoubtedly a first-class citizen, which organizations are adopting without a second thought. They understand the business value and the varied use cases that a search engine offers. In the next section, we will explore the exponential growth of search engines and how technology has enabled creating cutting edge search solutions. 1.2 Search is the new normal With the exponential growth of data (terabytes to petabytes to exabytes), the need for searching for a tool that would enable successful searches in a needle-in-the-haystack situation is imperative. What was once touted as a simple search is now becoming a necessary functionality for most organization’s survival tool chest. Organizations are expected to provide a search function by default so their customers can click in a search bar or navigate through a multitude of search drill downs to find what they need in a jiffy. It is increasingly difficult to find websites and applications without the humble magnifying glass search bar in one form or another. Providing a full-fledged search consisting of all the advanced functionality is a competitive advantage. Basic searches used to be supported by plain old databases. Today, modern engines provide the advanced functionality required to be competitive. Quite often when dealing with search engines, you’d come across data and search variants: structured and unstructured data and respective searches. Let’s look at them briefly in the next section. 1.2.1 Structured vs unstructured (full-text) search Data predominantly comes in two flavours: structured and unstructured data. The structured data, for example like dates, numbers, booleans, is very organized, has a definite shape and fits the predefined data type patterns and hence is easily searchable. The unstructured data, like blog posts, social media, audio/video clips, articles and so on, has no appropriate model and cannot be easily searchable. The unstructured data is the bread and butter of most modern search engines. Queries against structured data return results in exact matches. That is, the we are concerned about finding the documents that match the search criteria - but not how well they are matched. This kind of search results are binary, that is, either you have a result or none. As we do not search how well the documents match but just if the documents match, no relevancy scores are attached to these results. The traditional database search is more of this sort. For example, fetch all the flights that were cancelled in the last month, weekly bestseller books, activity of a logged in user and so on. ©Manning Publications Co. To comment go to liveBook 5 On the other hand, full-text (unstructured) queries will try to find results that are relevant to the query. That is, Elasticsearch will find all the documents that are most suited for the query. For example, if a user searches for “vaccine”, keyword, the search engine will not only retrieve documents related to vaccination, it could also throw in relevant documents such as inoculations, jabs and any “vaccine” related. Elasticsearch employs a similarity algorithm to generate a relevance score for full-text queries. The score is a positive floating-point number attached to the results, with the highest scored document indicating more relevant to the query criteria. 1.2.2 Search supported by a database A “good old” search was mostly based on traditional relational databases. Older search engines might have been based on layered architectures implemented in multi-tiered applications. The queries written in SQL using clauses like where and like probably provided the foundation for search. These solutions are not necessarily performant and efficient for searching through full text to provide modern search functionality. While databases support full-text searching (queries against some free text like a blog post, movie review, research articles, etc), they may struggle to provide efficient search in a near-real- time on heavy loads. The distributed nature of the search engine like Elasticsearch provides an instant scalability feature that most databases may not have been designed for. A search engine developed with a backing database (with no database’s full-text search capabilities enabled) may not be able to provide the relevant search results for the queries, let alone cope with volumetric growth, and serve results in real time. FULL-TEXT SEARCHING WITH DATABASES: Relational databases like Oracle and MySQL support full-text search functionality, albeit with fewer functionalities than a modern full-text search engine like Elasticsearch. These are fundamentally different in storing and retrieving data, so one must make sure to prototype their requirements before jumping either way. Usually if the schemas are not going to change or data loads are low and if you already have a database engine with full-text search capabilities, perhaps starting your first stride with full-text on database might make sense. 1.2.3 Full-text search engines To satisfy full-text search requirements along with other advanced functions, search engines came into existence with an approach different from that of a relational database query. The data undergoes an analysis phase in a search engine before it gets stored for later retrieval. This upfront analysis helps answer queries with ease. Elasticsearch is one such modern full-text search engine. Shay Banon, founder of Elastic, developed a search product called Compass (http://www.compass-project.org) in early 2000. It was based on an open source search engine library called Apache Lucene (https://lucene.apache.org). Lucene is Doug Cutting’s full-text search library written in Java. Because it’s a library, one needs to import and integrate it with an application using its APIs. Compass, and other search engines use Lucene to provide a generalized search engine service, so that you don’t have to integrate Lucene from scratch into your own application. Shay eventually decided to abandon Compass and focus on ElasticSearch, because it had more potential. ©Manning Publications Co. To comment go to liveBook 6 APACHE SOLR: Apache Solr is an open source search engine built on Apache Lucene in 2004. As a strong competitor to Elasticsearch, Solr has a thriving user community and I must say closer to opensource than Elasticsearch (Elastic moved from Apache to Elastic Licence and Server Side Public Licence (SSPL) in early 2021). Both Solr and Elasticsearch excel in full-text searching, however Elasticsearch may have an edge when it comes to analytics. While both these products compete in almost all functionality from head to toe, Solr has been a favourite for large static datasets working in a big data ecosystem. Obviously one has to run through prototypes and analysis to pick a product, the general trend is that most projects looking up for integrating with a search engine for the first time may look towards Elaticsearch due to its relatively no-hurdle startup time. One must carry a detailed comparison for the use cases they intend to use the search engine before adopting and embracing them. The burst of big data coupled with technological innovations paved the way to modern search engines. Modern search engines are highly effective and better suited to searching and providing relevance Regardless of whether the data is structured, semi-structured, or unstructured (more than three quarters of the data is assumed to be unstructured data), modern search engines can easily query it and return the results with relevance. As search becomes the new normal, modern search engines are trying hard to entertain the ever-growing requirements by donning new hats each day. Cheap hardware combined with the explosion of data is leading to the emergence of these modern search beasts. Let’s consider these present-day search engines in further detail in the next section. OPENSEARCH: Elastic changed their licencing policy in 2021, which applies to Elasticsearch release versions 7.11 and above. The licencing has been moved from open source to a dual licence under an Elastic Licence and a Server Side Public Licence (SSPL). This licence allows the community to use the product for free as expected. Managed service providers, however, cannot provide the products as services anymore. (There is a spat between Elastic and AWS on this move with AWS creating a forked version of Elasticsearch - called Open Distro for Elasticsearch - offering it as a managed service.) As Elastic moved from the open source licensing model to the SSPL model, a new product called OpenSearch (https://opensearch.org) was born in order to fill the gaping hole left by the new licensing agreement. The base code for OpenSearch was created from the open source Elasticsearch and Kibana version 7.10.2. The product’s first GA version 1.0 was released in July 2021. Watch out for OpenSearch becoming a competitor to Elasticsearch in the search engine space. ©Manning Publications Co. To comment go to liveBook