loading

A GraphQL Mediator For Implementing A Hybrid Search API layer

This article describes a design of a mediation layer that implements a hybrid search API through GraphQL. Although being a high-level description, it is related to a concrete implementation we’ve done for a project (not mentioned because it is still in progress).  

First of all, let’s start by looking at the meaning of the three concepts in the sentence above.

Mediator

“The Mediator Design Pattern defines an object that encapsulates how a set of objects interact. Mediator promotes loose coupling by keeping objects from referring to each other explicitly.”

(“Design Patterns: Elements of Reusable Object-Oriented Software”, Erich Gamma, Richard Helm, Ralph Johnson, John Vlissides, 1994)

I like to imagine a mediator using the following picture 

The gray area represents the mediation layer, while the four components are the mediated subsystems that (indirectly) interact with each other for building a workflow instance. Note the mediation forces the design to have the following features:

  • it bridges all four subsystems
  • It marks firm boundaries between each area. In other words, a given subsystem is not aware of the deployment context it’s running on     

The second picture represents the same idea, replacing the four anonymous components with real-world examples.

The subsystems provide different services that, once orchestrated by the mediation layer, provide a unified response back to the clients. For example (referring to the example on the left)

  • the API Gateway can contribute with non-functional features like authentication/authorization, routing, balancing
  • Apache Solr contributes with fast and furious search features (e.g. full-text search, faceting, type-ahead suggestions, spellchecking, highlighting)  
  • PostgreSQL provides real-time data 
  • Fuseki offers an RDF perspective bringing additional information, for example, links to external sources (e.g. DBpedia)

Note the components listed are just examples, they could be replaced with other alternatives able to offer the same type of services.

Hybrid Search

Apart from the API Gateway, the other three components in the examples above provide data storage and retrieval services. Why do we need three subsystems located on the same layer? Is there any difference between them?

The obvious answer is yes. Simplifying, 

  • Solr or Elasticsearch is excellent at searching, but data is available only in near-real-time 
  • the RDBMS can provide real-time data, but it cannot do (unless using vendor-specific extensions) full-text search and other complementary features like faceting
  • The RDFStore organizes data around a linked-centric approach, which in general is not so good for acting as a primary datasource. In addition the same considerations about faceting and full-text search apply. 

How about using a mediation layer to combine the advantages of those three data sources?

The role of a mediator would be:

  • from a client perspective, to offer a unified and simple interface. From that viewpoint, it acts as a Facade 
  • from a server perspective to orchestrate the three decoupled subsystems and manage the request execution workflow.   

Retrieval services in such context would consist of 

  • an input interface that would include parameters for triggering specific behavior (offered by one of the underlying subsystems)
  • a compound response with contributions coming from different retrieval systems. 

The role of the mediation would be, again, to make all of that simple, uniform, and transparent.  

GraphQL

GraphQL is a query language for APIs. After having a quick look at the GraphQL specs, you can have an idea about the vast set of features captured in the language syntax.    

A GraphQL runtime (i.e., an engine which implements the specs above and offers those capabilities) is a perfect fit for the use case described in this post: it provides a front controller on top of heterogeneous data sources, whether the data provider is an API, a database, or an arbitrary application that provides its services through an interoperable protocol.

See this article about our thoughts on GraphQL and how we think it can be used with REST.

Data Access Patterns

Concretely, what are the scenarios we are interested in? This section addresses that question, by providing a list of data access patterns that benefit from such hybrid architecture.  

Known-Item Search

“Known-item search is a specialization of information exploration which represents the activities carried out by searchers who have a particular item in mind” (source: Wikipedia)

The definition above is a bit wider than the scenario we want to capture: “known-item” in this context means something like

I know the identifier of an entity; I want to retrieve the information about that entity 

For example: give me the information about the person identified by https://example.org/people/amadsen.

What could be the expected contribution of the subsystems in the example diagram above? The answer could change depending on the requirements; here’s what we implemented: 

  • The identifier of the entity to be retrieved is known, it can be used for fetching fresh (i.e. updated in real-time ) data from the RDBMS  
  • Apache Solr is not involved: there’s no exploratory search in this scenario: one possible involvement could be related to More Like This (see below). However, in the system, we built that wasn’t part of customer requirements
  • The same known identifier could be used for fetching from the RDF Store an alternative representation of the resource, in case one specific RDF encoding is requested (e.g. N-Triples, RDF/XML)
Related entities

We are in a specialization of the scenario above, where the requestor is not directly interested in the known item. Instead, the focus is on returning the known-item related entities (e.g., children or other entities connected through a named relationship). 

For example: give me the information about cars owned by the person identified by https://example.org/people/amadsen.

In this scenario, assuming proper management of those relationships data on the inverted index, Apache Solr can start contributing with its near-real-time features.

Here’s the contribution bullets list updated: 

  • The RDBMS plays the same role as above: the known-item identifier is used for fetching real-time data. The input is still the same (the identifier); the difference is in the retrieved data because this time, we are interested in related entities
  • Apache Solr uses the identifier for complementary features like faceting and more like this 
  • The RDF Store involvement follows the same “parallel” way as above, in case an RDF serialization format is requested on the client-side 
FullText

In this scenario, the starting point is an exploratory search: the user enters one or more search terms to get back a list of “relevant” matches according to their information needs.

The contribution set in this scenario introduces an interesting interaction schema where:

  • The inverted-index engine (Apache Solr in our example) acts as a primary source determining the relevant matches. The response contains the matching documents plus additional features like faceting, highlighting, spellchecking
  • we don’t want to fetch stored data from Solr (which, again, has a real-time snapshot of the data). Instead, we use the identifiers returned by Solr for retrieving the corresponding fresh data from the RDBMS or the RDF Store, depending on the requested and negotiated content type.

Here is the updated workflow diagram: 

Response Contribution

We looked at three scenarios; for each of them, we described the role each subsystem plays in request execution.

We also mentioned the contribution has another perspective, specifically in the response.

Fields Retrieval

Regardless of the scenario, entity attributes are always retrieved from RDBMS (JSON representation) or RDF Store  (RDF representation).

We can think of a two-phase retrieval protocol:

  • in the first phase, we collect the identifiers (or the identifier)
  • In the second phase, each identifier is used for fetching the entity data. In this phase, the response could also be enriched by additional contributions (e.g., facets)

In known-item and related entities scenarios, there’s only one identifier; it is known in advance, so the first phase it’s “virtual”, it doesn’t involve any subsystem. In the full-text scenario, we first collect the matching identifiers (ranked by relevance), and then we create the entities representations.

Facets, Highlighting, Spellchecking, More Like This
This section is meant to group all response contributions coming from the inverted-index engine (Apache Solr, in our example).

Faceting refers to a complementary search feature where results are classified into categories.

				
					{
  "Type":{
    "Person":4,
    "Book":5
  },
  "Genre":{
    "Traditional Literature":3,
    "Drama":2
  },
  "ContributionType":{
   "Author":2,
   "Illustrator":1,
   "Translator":1
  }
}
				
			
Highlighting provides fragments of documents that match the user’s query, surrounded by HTML tags.
				
					{
   "name":"<b>Carroll</b>, Lewis",
   "birthDate":1832,
   ...
},
{
   "title":"Alice in <b>Wonderland</b>",
   ...
},    
				
			

Spellchecking (aka Did You Mean?) provides inline query and terms suggestions based on other, similar, terms.

				
					"termsSuggestions":
 {
   "term": "winderlnd",
   "corrections": [
     "wonderland"
   ]
 },
...
"querySuggestions": [
 {
   "query": "wonderland carroll potter"
 },
 {
   "query": "wonderland carroll poṭer"
 },
 {
   "query": "wonderland carroll power"
 }
]
				
			

More Like This provides a list of similar documents according to a given set of criteria.

The idea is to retain their complementary nature and add a section in the response for each of them. This, regardless of the data provenance of the main section (e.g. RDBMS, Solr itself).

The following matrix combines the features and the scenarios described above

Known-Item Search Related Entities FullText
Faceting
No
Yes
Yes
Highlighting
No
No
Yes
Spellchecking
More Like This
Yes
Yes
No

The example response below refers to scenario #3 (Fulltext). Before that, a brief recap about what happens

  • The client search for one or more terms. The request contains 
  • The Mediator asks Apache Solr to execute the search. Solr returns the top K matching identifiers plus facets, spellcheck, and highlighting snippets 
  • Each identifier is used for collecting data (attributes) from RDBMS
  • The resulting response is enriched with faceting, spellchecking
  • The highlighting snippets replace the attributes in the results list
				
					{
   "resources":[
      {
         "name":"<b>Carroll</b>, Lewis",
         "birthDate":1832,
         ...
      },
      {
         "title":"Alice in <b>Wonderland</b>",
         ...
      },      
      ...
   ],
   "facets":{
      "type":{
         "Person":4,
         "Book":5
      },
      "genre":{
         "Traditional Literature":3,
         "Drama":2
      },
      "role":{
         "Author":2,
         "Illustrator":1,
         "Translator":1
      }
   },
   "didYouMean": {
     "termsSuggestions": {
        "term": "winderlnd",
        "corrections": [
            "wonderland"
        ]
    },
    "querySuggestions": [
      {
        "query": "wonderland carroll potter"
      },
      {
        "query": "wonderland carroll poṭer"
      },
      {
        "query": "wonderland carroll power"
      }
    ]
   },
   "page":{
      "totalMatches":9,
      "startOffset":0,
      "pageSize":5
   }
}
				
			

Conclusions

In this Case Study, we provided a high-level overview of how we built a Mediation layer that coordinates multiple subsystems and produces an added value response for the high-level service consumers.

I have to underline how we implemented the workflow is strongly connected to the customer requirements. All this is to say that it’s not the only way to mix and orchestrate the contributions coming from the mediated subsystems.

As usual, any feedback is warmly welcome!

Leave a Reply

%d bloggers like this: