exact match

Apache Solr: “Exact Match” BATH compliant field

If you want to expose a Z3950 interface for an Information Retrieval sooner or later you will meet the BATH profile, which is basically a set of rules that promotes standards behaviours between Z3950 servers. Aims of those specification is to determine a list of searches (fields, attributes) that should be supported by a Z3950 service.

The focus in this article is not about how to set up a Z3950 endpoint using SOLR behind the scenes because there are a lot of places where you can find such information. Instead, I will write down a brief note about the so-called “Exact Match” search, which is one of the most interesting part of the story.
Just to directly go to the problem. The following are the specification of the “Exact Match” search:

  • Position: first in field
  • Truncation: do not truncate
  • Completeness: complete field
  • Structure: phrase

This is a pretty simple scenario with a little issue / question (that after this reading could be still open because what are you reading is just my interpretation of the story, not the absolute truth). In general the definition is quite clear:

  1. you must search within the given index assuming that the match should be done keeping in mind that the user entered terms must be considered as a starting value .
  2. the user entered terms are supposed to be “complete words” (do no truncate)
  3. what the user entered is a complete field (e.g a complete title or author name)

If in our index there’s a document with Alessandro Manzoni as author, an exact match query will find this document only and only if user entered values are

  • Alessandro Manzoni
  • alessandro manzoni (assuming a minimal text analysis is done with lowercasing).

For example the following queries won’t match that document:

  • manzoni alessandro
  • Manzoni alessandro

because proximity search is not mentioned in the specification (ok, it’s useful in real life but that’s another story). Another type of text analysis that could be applied without violating the specification, is removing intra-word delimiters.

Let’s do an example. If I have

  • Manzoni, Alessandro.

it’s hard to imagine a user will be able to do an exact match query by typing exactly what is written in the index. Instead it should be better (both at index and query time) if we remove the intra-word and trailing punctuation and make life easier. In this way, the following queries:

  • manzoni alessandro
  • Manzoni, Alessandro.
  • manzoni alessandro..
  • manzoni. alessandro,

will match the document.

What are the appropriate manipulation that a SOLR field needs in order to accomodate that “Exact Match” requirement? If I should strictly adhere to bath requirements, my field should be a simple string like this:

					<field name="author" type="string" indexed="true"/>

but in this case, only a search for Manzoni, Alessandro. (with exact punctuation) will match the corresponding document, and that’s not exactly what we want.
A more flexible approach would be to assign a solr.TextField type to our field. The TextField type allows you to associate a text analysis to a field.

Let’s take another example. This time the author is

					Contessa Serbelloni Mazzanti Viendalmare

Following the mentioned approach this input value will be transformed in this way:

					Contessa Serbelloni Mazzanti Viendalmare (original)
contessa serbelloni mazzanti viendalmare (lowercase)
contessa, serbelloni, mazzanti, viendalmare (word tokenizer)

If a user enter the following query

					q="contessa Serbelloni MAZZANTI viendalmare"

a match is found. Now, the reason why this is not sufficient for our bath profile compliant can be found in the specification of the “Exact Match” search

  • Position: first in field
  • Truncation: do not truncate
  • Completeness: complete field
  • Structure: phrase

First in field means that, in order to match a document, the indexed field must contains user entered terms (with the given order because is a “phrase” search) in the first position. In addition, complete field means that what the user entered is supposed to be the complete value of the target field. So, the indexing approach followed above will violate these two preconditions. How? Here it is: if the user enter the following

					q="MAZZANTI viendalmare"

a match with the same document will still be found because terms are in the target indexed field with the given order. But they aren’t at the beginning of the field and they don’t represent the whole literal value. So, even if a little bit better, this approach don’t work.

Here is another approach which introduces a more articulated text processing chain. Both at index and query time, starting with

					Contessa Serbèlloni, Mazzànti Viendalmarè.
Keyword Tokenizer

The filter does nothing, it considers the input as a whole value, (i.e. a single token).

This is the first normalisation we apply. That results in the following transformation:
Diacritics Replacement
					contessa serbelloni, mazzanti viendalmare.
Intra-Word Delimiter Removal

The last filter removes the intra-world delimiters, including whitespaces and trailing punctuation. Here’s the output:


That’s all. As last note remember the described chain should be applied both at index and query time. Let’s run some test in order to check the implementation.

  • Searching Contessa serbelloni mazzanti viendalmare will produce 1 result;
  • Searching serbelloni mazzanti will produce no result;
  • Searching Contessa serbelloni mazzanti will produce no result;
  • Searching Contessa serbel loni maz zanti vien dal mare will produce 1 result; ok, this could be intended as a violation and I agree with you…I’m thinking about that…in the meantime lets say that this “bug” is very useful because (the searcher) you couldn’t know how the author name is exactly written

Share this post