Digging in the Solr code: 5 minutes howto

Let’s say you need to write a component, a request handler, or in general some piece of custom code that needs to be plugged into Solr. Or, you need to have a deeper understanding about some Lucene/Solr internals, following what actually happens within the code.  

I know: unit tests, integration tests, everything to make sure things behave as you would expect; but here I’m talking about something different: while developing, it is (at least for me) very useful a productive and debug environment where it is possible, using short dev iterations, to follow step by step what’s happening within the code, taking a deep look at how actually things work behind the scenes.

In my experience I found that useful in a couple of scenarios:

  • I have to write some Solr add-on: in this case I want to have a development environment which allows to write and debug code as much fast as possible
  • I have to study some Solr internals: let’s say for example I need to check what happens at retrieval time when a field is both docValues=”true” and stored=”true”; where does Solr get the field value from?

Let’s see how both of them can be accomplished in few minutes.

Step #1: Clone the GitHub Repository

Clone the following repository:

https://github.com/SeaseLtd/solr-addon-project-skeleton

Once imported in your favourite IDE, the project layout will look like this:

As you can see, the template project provides:

  • A custom TokenFilter which simply prints in the standard out the output tokens during the text analysis. Note this is just an example (useful if you want to debug an analyzer): I could have created a SearchComponent, Tokenizer or whatever I’d need.
  • sample Solr configuration, with a minimal set of things configured
  • Test Supertype layer (BaseIntegrationTest) and a sample Test (Tests) which loads some data, executes a query and then prints out the results.

Surprisingly, that’s all! There’s no a second step!

Use Case #1: Implement, Debug And Test An Add-On

As previously said, in the example repository we already have a simple add-on which consists of a TokenFilter that prints in the standard output each token produced in the analysis chain. The filter has been declared in the Solr configuration as part of “text” field type analyzer:

The test class triggers that analyzer because it indexes some documents, so if you run it as a plain JUnit test, you will see the following output:

If you but a breakpoint in the token filter and the re-run the Tests class in debug mode, the debugger will stop at that line as expected:

Use Case #2: Debugging Solr Internals

In this case there’s no custom code because remember, the goal is to investigate some Solr internals. Specifically, the question I have to answer in this example is: assuming we have a field

and a request

Where does Solr get the field value from [1]?

The first thing I have to do is to change something in the project:

  • schema.xml: add the field definition above
  • Tests class: change the query parameters (adding fl=myfield) and add some value for the myfield field in the indexed documents.

Now, a premise: since the goal of this blog post is not to actually answer to the question above, we will skip all the investigation phase needed for understanding the overall query execution flow and for detecting the right place where we will put the breakpoint.

After some investigation, we understand the RetrieveFieldOptimizer [2] class plays a fundamental role in that process, so let’s open it and put some breakpoint:

As you can see, the name and the intent of that class is quite clear, but I still want to see what happens at runtime: let’s start the Tests class in debug mode and, as expected

I can see the field “myfield” has been collected in the “storedFields” set, while the dvFields (DocValues fields) set is empty, even if the field has the docValues flag enabled. So that probably suggests me something…

Moving forward, we arrive at the optimize method, where we meet the optimisation described in SOLR-8344:

Again, this is just an example and the goal here is not describe the findings; however, briefly, it says that if all requested fields

  • have the docValues and stored flags enabled
  • are not multivalued

then Solr retrieves the values only from docValues.

Share this post

Leave a Reply