Table Of Content

Dikshant Shahi Apache Solr A Practical Approach to Enterprise Search Dikshant Shahi Any source code or other supplementary materials referenced by the author in this text is available to readers at www.apress.com . For detailed information about how to locate your book’s source code, go to www.apress.com/source-code/ . ISBN 978-1-4842-1071-0 e-ISBN 978-1-4842-1070-3 DOI 10.1007/978-1-4842-1070-3 © Apress 2015 Apache Solr: A Practical Approach to Enterprise Search Managing Director: Welmoed Spahr Acquisitions Editor: Celestin Suresh John Development Editor: Matthew Moodie Technical Reviewer: Shweta Gupta Editorial Board: Steve Anglin, Pramilla Balan, Louise Corrigan, James DeWolf, Jonathan Gennick, Robert Hutchinson, Celestin Suresh John, Michelle Lowman, James Markham, Susan McDermott, Matthew Moodie, Jeffrey Pepper, Douglas Pundick, Ben Renow-Clarke, Gwenan Spearing Coordinating Editor: Rita Fernando Copy Editor: Sharon Wilkey Compositor: SPi Global Indexer: SPi Global For information on translations, please e-mail [email protected], or visit www.apress.com/ . Apress and friends of ED books may be purchased in bulk for academic, corporate, or promotional use. eBook versions and licenses are also available for most titles. For more information, reference our Special Bulk Sales–eBook Licensing web page at www.apress.com/bulk-sales . This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law. Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image, we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800- SPRINGER, fax (201) 348-4505, e-mail [email protected], or visit www.springer.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc. (SSBM Finance Inc.). SSBM Finance Inc. is a Delaware corporation. To my foster mother, Mrs. Pratima Singh, for educating me! Introduction This book is for developers who are building or planning to build an enterprise search engine using Apache Solr. Chapters 1 and 3 can be read by anyone who intends to learn the basics of information retrieval, search engines, and Apache Solr specifically. Chapter 2 kick-starts development with Solr and will prove to be a great resource for Solr newbies and administrators. All other chapters explore the Solr features and approaches for developing a practical and effective search engine. This book covers use cases and examples from various domains such as e- commerce, legal, medical, and music, which will help you understand the need for certain features and how to approach the solution. While discussing the features, the book generally provides a snapshot of the required configuration, the command (using curl) to execute the feature, and a code snippet as required. The book dives into implementation details and writing plug-ins for integrating custom features. What this book doesn’t cover is performance improvement in Solr and optimizing it for high-speed indexing. This book covers Solr features through release 5.3.1, which is the latest at the time of this writing. What This Book Covers Chapter 1 , Apache Solr: An Introduction, as the name states, starts with an introduction to Apache Solr and its ecosystem. It then discusses the features, reasons for Solr’s popularity, its building blocks, and other information that will give you a holistic view about Solr. It also introduces related technologies and compares it to other alternatives. Chapter 2 , Solr Setup and Administration, begins with Solr fundamentals and covers Solr setup, steps for indexing your first set of documents and searching them. It then describes the Solr administrative features and various management options. Chapter 3 , Information Retrieval, is dedicated to the concepts of information retrieval, content extraction, and text processing. Chapter 4 , Schema Design and Text Analysis, covers the schema design, text analysis, going schemaless, and managed schemas in Solr. It also describes common text-analysis techniques. Chapter 5 , Indexing Data, concentrates on the Solr indexing process by describing the indexing request flow, various indexing tools, supported document formats, and important update request processors. This is also the first chapter that provides the steps to write a Solr plug-in, a custom UpdateRequestProcessor in this case. Chapter 6 , Searching Data, describes the Solr searching process, various query types, important query parsers, supported request parameters, and steps for writing a custom SearchComponent. Chapter 7 , Searching Data: Part 2, continues the previous chapter and covers local parameters, result grouping, statistics, faceting, reranking queries, and joins. It also dives into the details of function queries for deducing a practical relevance ranking and steps for writing your own named function. Chapter 8 , Solr Scoring, explains the Solr scoring process, supported scoring models, the score computation, and steps for customizing similarity. Chapter 9 , Additional Features, explores Solr features including spell- checking, autosuggestion, document similarity, and sponsored search. Chapter 10 , Traditional Scaling and SolrCloud, covers the distributed architectures supported by Solr and steps for setting up SolrCloud, creating a collection, distributed indexing and searching, shard splitting and ZooKeeper. Chapter 11 , Semantic Search, introduces the concept of semantic search and covers the tools and techniques for integrating semantic capabilities in Solr. What You Need for This Book Apache Solr requires Java Runtine Environment (JRE) 1.7 or newer. The provided custom Java code is tested on Java Development Kit (JDK) 1.8 and requires Apache Maven. The last chapter requires downloading resources required by Apache OpenNLP and WordNet. Who This Book Is For This book expects you to have basic understanding of the Java programming language, which is essential if you want to execute the custom components. Acknowledgments My first vote of thanks goes to my daily dose of caffeine (without which this book would not have been possible), my sister for preparing it, and my wife for teaching me to prepare it myself. Thanks to my parents for their love! Thank you, Celestin, for providing me the opportunity to write this book; Rita for coordinating the whole process; and Shweta, Matthew, Sharon, and SPi Global for all their help to get this book to completion. My sincere thanks to everyone else from Apress for believing in me. I am deeply indebted to everyone whom I have worked with in my professional journey and everyone who has motivated me and helped me learn and improve, directly or indirectly. A special thanks to my colleagues at The Digital Group for providing the support, flexibility, and occasional work break to complete the book on time. I would also like to thank all the open source contributors, especially of Apache Lucene and Solr; without their great work, there would have been no need for this book. As someone has rightly said, it takes a village to create a book. In creating this book, there is a small village, Sandha, located in the land of Buddha, which I frequented for tranquility and serenity that helped me focus on writing this book. Thank you!