Table Of ContentFor online information and ordering of this and other Manning books, please visit www.manning.com.
The publisher offers discounts on this book when ordered in quantity. For more information, please
contact
Special Sales Department
Manning Publications Co.
20 Baldwin Road
PO Box 261
Shelter Island, NY 11964
Email: orders@manning.com
©2014 by Manning Publications Co. All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form
or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of
the publisher
Photographs in this book were created by Martin Evans and Jordan Hochenbaum, unless otherwise
noted. Illustrations were created by Martin Evans, Joshua Noble, and Jordan Hochenbaum. Fritzing
(fritzing.org) was used to create some of the circuit diagrams.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed
as trademarks. Where those designations appear in the book, and Manning Publications was aware of
a trademark claim, the designations have been printed in initial caps or all caps.
Recognizing the importance of preserving what has been written, it is Manning’s policy to have the
books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing
also our responsibility to conserve the resources of our planet, Manning books are printed on paper
that is at least 15 percent recycled and processed without the use of elemental chlorine.
Manning Publications Co. Development editors: Elizabeth Lexleigh, Susan Conant
20 Baldwin Road Copyeditor: Melinda Rankin
PO Box 261 Proofreader: Elizabeth Martin
Shelter Island, NY 11964 Typesetter: Dennis Dalinnik
Cover designer: Marija Tudor
ISBN: 9781617291029
Printed in the United States of America 1 2 3 4 5 6 7 8 9 10 – MAL – 19 18 17 16 15 14
2
Table of Contents
Foreword ...................................................................................................... 7
Preface .......................................................................................................... 9
Acknowledgments ....................................................................................... 11
Trey Grainger ................................................................................................................... 11
Timothy Potter ................................................................................................................. 12
About this Book ...........................................................................................12
Roadmap ........................................................................................................................... 12
How to use this book ....................................................................................................... 15
Code conventions and downloads ................................................................................. 16
Author Online .................................................................................................................. 17
About the cover illustration ........................................................................................... 18
Part 1. Meet Solr ..........................................................................................19
Chapter 1. Introduction to Solr.................................................................. 20
1.1. Why do I need a search engine?.............................................................................. 21
1.2. What is Solr? ............................................................................................................. 27
1.3. Why Solr?................................................................................................................... 36
1.4. Features overview..................................................................................................... 39
1.5. Summary.................................................................................................................... 45
Chapter 2. Getting to know Solr ................................................................ 47
2.1. Getting started .......................................................................................................... 48
2.2. Searching is what it’s all about............................................................................... 59
2.3. Tour of the Solr administration console ............................................................... 69
2.4. Adapting the example to your needs ..................................................................... 71
2.5. Summary ................................................................................................................... 72
Chapter 3. Key Solr concepts ..................................................................... 74
3.1. Searching, matching, and finding content ............................................................ 74
3.2. Relevancy .................................................................................................................. 95
3.3. Precision and Recall .............................................................................................. 103
3.4. Searching at scale ................................................................................................... 107
3.5. Summary ................................................................................................................. 114
Chapter 4. Configuring Solr ...................................................................... 115
4.1. Overview of solrconfig.xml.................................................................................... 118
4.2. Query request handling......................................................................................... 124
4.3. Managing searchers ............................................................................................... 139
4.4. Cache management ............................................................................................... 144
4.5. Remaining configuration options ........................................................................ 153
4.6. Summary ................................................................................................................. 153
Chapter 5. Indexing...................................................................................156
3
5.1. Example microblog search application ............................................................... 156
5.2. Designing your schema ......................................................................................... 161
5.3. Defining fields in schema.xml .............................................................................. 167
5.4. Field types for structured nontext fields............................................................. 177
5.5. Sending documents to Solr for indexing............................................................. 186
5.6. Update handler....................................................................................................... 193
5.7. Index management ................................................................................................ 203
5.8. Summary ................................................................................................................. 209
Chapter 6. Text analysis ............................................................................ 211
6.1. Analyzing microblog text....................................................................................... 212
6.2. Basic text analysis .................................................................................................. 216
6.3. Defining a custom field type for microblog text ................................................ 227
6.4. Advanced text analysis .......................................................................................... 242
6.5. Summary ................................................................................................................. 250
Part 2. Core Solr capabilities ................................................................... 252
Chapter 7. Performing queries and handling results.............................. 253
7.1. The anatomy of a Solr request .............................................................................. 253
7.2. Working with query parsers ................................................................................. 264
7.3. Queries and filters .................................................................................................. 268
7.4. The default query parser (Lucene query parser) ............................................... 275
7.5. Handling user queries (eDisMax query parser)................................................. 283
7.6. Other useful query parsers.................................................................................... 296
7.7. Returning results .................................................................................................... 303
7.8. Sorting results......................................................................................................... 312
7.9. Debugging query results ....................................................................................... 315
7.10. Summary ............................................................................................................... 317
Chapter 8. Faceted search........................................................................ 318
8.1. Navigating your content at a glance .................................................................... 319
8.2. Setting up test data ................................................................................................ 323
8.3. Field faceting .......................................................................................................... 329
8.4. Query faceting ........................................................................................................ 336
8.5. Range faceting ........................................................................................................ 339
8.6. Filtering upon faceted values ............................................................................... 343
8.7. Multiselect faceting, keys, and tags ..................................................................... 350
8.8. Beyond the basics .................................................................................................. 356
8.9. Summary ................................................................................................................. 356
Chapter 9. Hit highlighting ...................................................................... 358
9.1. Overview of hit highlighting ................................................................................. 359
9.2. How highlighting works ........................................................................................ 360
9.3. Improving performance using FastVectorHighlighter ..................................... 381
4
9.4. PostingsHighlighter ............................................................................................... 383
9.5. Summary ................................................................................................................. 386
Chapter 10. Query suggestions ................................................................ 387
10.1. Spell-check............................................................................................................. 387
10.2. Autosuggesting query terms ............................................................................... 401
10.3. Suggesting document field values ..................................................................... 405
10.4. Suggesting queries based on user activity ........................................................ 409
10.5. Summary ............................................................................................................... 414
Chapter 11. Result grouping/field collapsing .......................................... 416
11.1. Result grouping vs. field collapsing .................................................................... 417
11.2. Skipping duplicate documents............................................................................ 417
11.3. Returning multiple documents per group......................................................... 429
11.4. Grouping by functions and queries .................................................................... 432
11.5. Paging and sorting grouped results.................................................................... 437
11.6. Grouping gotchas .................................................................................................. 440
11.7. Efficient field collapsing with the Collapsing query parser ............................ 445
11.8. Summary................................................................................................................ 446
Chapter 12. Taking Solr to production .................................................... 448
12.1. Developing a Solr distribution ............................................................................ 448
12.2. Deploying Solr ...................................................................................................... 449
12.3. Hardware and server configuration................................................................... 451
12.4. Data acquisition strategies.................................................................................. 461
12.5. Sharding and replication ..................................................................................... 465
12.6. Solr core management ......................................................................................... 475
12.7. Managing clusters of servers .............................................................................. 482
12.8. Querying and interacting with Solr ................................................................... 487
12.9. Monitoring Solr’s performance .......................................................................... 492
12.10. Upgrading between Solr versions .................................................................... 503
12.11. Summary .............................................................................................................. 503
Part 3. Taking Solr to the next level ......................................................... 505
Chapter 13. SolrCloud .............................................................................. 506
13.1. Getting started with SolrCloud ........................................................................... 507
13.2. Core concepts ........................................................................................................ 519
13.3. Distributed indexing ............................................................................................ 534
13.4. Distributed search ................................................................................................ 541
13.5. Collections API ..................................................................................................... 545
13.6. Basic system-administration tasks .................................................................... 552
13.7. Advanced topics .................................................................................................... 556
13.8. Summary ............................................................................................................... 560
Chapter 14. Multilingual search .............................................................. 562
5
14.1. Why linguistic analysis matters .......................................................................... 562
14.2. Stemming vs. lemmatization .............................................................................. 564
14.3. Stemming in action .............................................................................................. 566
14.4. Handling edge cases ............................................................................................ 571
14.5. Available language libraries in Solr ................................................................... 574
14.6. Searching content in multiple languages.......................................................... 579
14.7. Language identification ....................................................................................... 604
14.8. Summary ............................................................................................................... 622
Chapter 15. Complex query operations ................................................... 624
15.1. Function queries ................................................................................................... 625
15.2. Geospatial search.................................................................................................. 648
15.3. Pivot faceting......................................................................................................... 669
15.4. Referencing external data ................................................................................... 673
15.5. Cross-document and cross-index joins ............................................................. 676
15.6. Big data analytics with Solr................................................................................. 679
15.7. Summary................................................................................................................ 680
Chapter 16. Mastering relevancy ............................................................. 681
16.1. The impact of relevancy tuning .......................................................................... 682
16.2. Debugging the relevancy calculation................................................................. 683
16.3. Relevancy boosting .............................................................................................. 691
16.4. Pluggable Similarity class implementations .................................................... 704
16.5. Personalized search and recommendations ..................................................... 707
16.6. Creating a personalized search experience....................................................... 734
16.7. Running relevancy experiments......................................................................... 735
16.8. Summary ............................................................................................................... 739
Appendix A. Working with the Solr codebase ......................................... 740
A.1. Pulling the right version of Solr ........................................................................... 740
A.2. Setting up Solr in your IDE .................................................................................. 741
A.3. Debugging Solr code ............................................................................................. 744
A.4. Downloading and applying Solr patches............................................................ 746
A.5. Contributing patches............................................................................................. 747
Appendix B. Language-specific field type configurations....................... 750
Appendix C. Useful data import configurations ..................................... 758
C.1. Indexing Wikipedia................................................................................................ 758
C.2. Indexing Stack Exchange...................................................................................... 760
6
Foreword
Solr has had a long and successful history, but a major new chapter began recently with
the advent of Solr 4 and SolrCloud. This is the perfect time for Solr in Action. With clear
examples, enlightening diagrams, and coverage from key concepts through the newest
features, Solr in Action will have you successfully using Solr in no time!
Solr was born out of necessity in 2004, at CNET Networks (now CBS Interactive), to
replace a commercial search engine being discontinued by the vendor. Even though I had
no formal search background when I started writing Solr, it felt like a very natural fit,
because I have always enjoyed making software “go fast.” I viewed Solr more as an
alternate type of datastore designed around an inverted index than as a full-text search
engine, and that has helped Solr extend beyond the legacy enterprise search market.
By the end of 2005, Solr was powering the search and faceted navigation of a number of
CNET sites, and soon it was made open source. Solr was contributed to the Apache
Software Foundation in January 2006 and became a subproject of the Lucene PMC (with
Lucene Java as its sibling). There had always been a large degree of overlap with Lucene
(the core full-text search library used by Solr) committers, and in 2010 the projects were
merged. Separate Lucene and Solr downloads would still be available, but they would be
developed by a single unified team. Solr’s version number jumped to match that of
Lucene, and the releases have since been synchronized.
The recent Solr 4 release is a major milestone, adding SolrCloud—the set of highly
scalable features including distributed indexing with no single points of failure. The
NoSQL feature set was also expanded to include transaction logs, update durability,
optimistic concurrency, and atomic updates. Solr in Action, written by longtime Solr
power users and community members, Trey and Timothy, covers these important recent
Solr features and provides an excellent starting point for those new to Solr.
Solr is now used in more places than I could ever have imagined—from integrated library
systems to e-commerce platforms, analytics and business intelligence products,
content-management systems, internet searches, and more. It’s been rewarding to see
Solr grow from a few early adopters to a huge global community of helpful users and
active volunteers cooperatively pushing development forward.
Solr in Action gives you the knowledge and techniques you need to use Solr’s features that
have been under development since 2004. With Solr in Action in hand, you too are now
well equipped to join the global community and help take Solr to new heights!
7
YONIK SEELEY
CREATOR OF SOLR
8
Preface
In 2008, I was asked to take over leadership of CareerBuilder’s search technology team.
We were using the Microsoft FAST search platform at the time, but realized that search
was too important to the success of our business for us to continue relying on a
commercial vendor instead of developing the domain expertise internally. I immediately
began investigating open source alternatives such as Solr, which seemed to provide most
of the key features needed for our products. By the summer of 2009, we decided that we
were ready to bring our search expertise in-house and convert our systems to Solr.
The timing was great. Lucene, the open source search library upon which Solr is built, had
become a full top-level Apache project in February 2005, and Solr, which had been
contributed to the Apache Software Foundation in 2006, had become a top-level Apache
project in January of 2007. Both technologies were reaching critical mass and would soon
be merged (in March 2010) into a unified project.
By the summer of 2010, our entire platform was converted to Solr. In the process, we
increased the speed of our searches, significantly reduced the number of servers necessary
to support our search infrastructure, dropped expensive licensing fees, increased platform
stability, and in-sourced much of the search expertise for which we had previously been
dependent on a commercial vendor.
Little did we know at that time how much additional value we would gain by bringing
search in-house. We have been able to build entirely new suites of search-based
products—from traditional keyword and semantic search, to big data analytics products,
to real-time recommendation engines—utilizing Solr as a scalable search architecture to
handle billions of documents and millions of queries an hour across hundreds of servers.
We have entered the era of cloud services, elastic scalability, and an explosion of data that
we strive to make meaningful for society, and with Solr we are able to tackle each of these
challenges head-on.
When Manning approached me about writing Solr in Action, I was hesitant because I
knew it would be a large undertaking. My one requirement was that I needed a strong
coauthor, and that is exactly what I found in Timothy Potter. Tim also has years of
experience developing search-based solutions with Lucene and Solr. He has a wealth of
expertise building text analysis systems for social data and architecting real-time analytics
solutions using Solr and other cutting-edge big data technologies. With both of us having
received so much help from the Solr community over the years and with such a clear need
for an example-driven guide to Solr, Tim and I are excited to be able to provide Solr in
Action to help the next generation of search engineers. It’s the book we wish we’d had five
9
years ago when we started with Solr, and we hope that you find it to be useful, whether
you are just getting introduced to Solr or are looking to take your knowledge to the next
level.
TREY GRAINGER
10
Description:SummarySolr in Action is a comprehensive guide to implementing scalable search using Apache Solr. This clearly written book walks you through well-documented examples ranging from basic keyword searching to scaling a system for billions of documents and queries. It will give you a deep understanding