loading

URIs 303 Redirection For The Semantic Web

When I think of the Semantic Web and content negotiation, I always remember some episodes from when I was a kid.

Often, during the summer holidays, I accompanied my father to the market to buy groceries. I was happy to help, even because I knew there would also be a chance to visit the candy shop or the comic shop, or both.

When at the candy shop, I was impressed by how well the man at the counter knew every client and how quickly he was ready to give the usually wanted sweeties. Of course, we made no exception, and I always got out of the shop with my chocolate loot in a short time.

The man was pretty efficient in providing everyone with their favorite form of sugar, with respect of who they were and their preferred format.

The content negotiation and URI forwarding techniques work similarly in the Semantic Web, and we’re going to see how we can get the wanted candies using the HTTP protocol.

Overview

The Semantic Web was envisioned by Tim Berners-Lee as a Web made of data that machines could process, thus setting the axiom that the most part of that data should be machine-readable.

On the Semantic Web, the information is expressed as statements about resources, just like “The members of the band Eurythmics were Annie Lennox and Dave Stewart” and “this song has been written by Annie Lennox”

The Resource Description Framework (RDF) made this modelling system its core principle.

Nevertheless, the same URI /people/Federico may identify both my home page and an RDF representation of the resource identified by the string Federico.

In fact, even the RDF uses URIs (Uniform Resource Identifiers) to identify resources, and this brings us to the main question: since URIs can identify both web pages and machine-readable resources, how do we use them so that their identification is unambiguous?

Resource Identifiers, Or What’s In A URI?

On the Semantic Web, URIs can identify Web pages and objects. Such objects may belong to the real world (like planes, people, or books) or be fictitious (like fantasy ideas or creatures, as hobbits or the Millennium Falcon). The Semantic Web calls them real-world objects or things.

From the W3C’s document Cool URIs for the Semantic Web (which, by the authors request, must be retained as a work in progress document), there are two main requirements that relate to the way URIs must be used in relation to the Semantic Web:

1. BE ON THE WEB

GIVEN ONLY A URI, MACHINES AND PEOPLE SHOULD BE ABLE TO RETRIEVE A DESCRIPTION ABOUT THE RESOURCE IDENTIFIED BY THE URI FROM THE WEB. SUCH A LOOK-UP MECHANISM IS IMPORTANT TO ESTABLISH SHARED UNDERSTANDING OF WHAT A URI IDENTIFIES. MACHINES SHOULD GET RDF DATA AND HUMANS SHOULD GET A READABLE REPRESENTATION, SUCH AS HTML. THE STANDARD WEB TRANSFER PROTOCOL, HTTP, SHOULD BE USED.

2. BE UNAMBIGUOUS.

THERE SHOULD BE NO CONFUSION BETWEEN IDENTIFIERS FOR WEB DOCUMENTS AND IDENTIFIERS FOR OTHER RESOURCES. URIS ARE MEANT TO IDENTIFY ONLY ONE OF THEM, SO ONE URI CAN’T STAND FOR BOTH A WEB DOCUMENT AND A REAL-WORLD OBJECT.

Looking a bit better, they seem to be conflictual: the best way to offer a URI describing either a real-world object (like an author or a work) as a RDF or JSON resource and its web page seems to be by using the same URI for all of them.

How could such resources be differentiated, and how should the system be able to understand which specific resource representation the client is interested in?

This is where content negotiation comes into the picture.

Content Negotiation

As said, the representations offered for the same resource differ substantially; they are not multiple versions of the same document, though, but different documents altogether. 

HTTP has a powerful mechanism for offering different formats and language versions of the same Web document known as content negotiation.

Generally speaking, when a user agent (such as a browser) makes an HTTP request, it sends along some HTTP headers to indicate what data formats it prefers. The server then selects the best match from its file system or generates the desired content on demand, and sends it back to the client. 

For example, a browser could send this HTTP request to indicate that it wants an HTML or XHTML representation of https://my-site.org/people/Alice in English or German:

				
					GET /people/Alice HTTP/1.1
Host: my-site.org
Accept: text/html, application/xhtml+xml
Accept-Language: en
				
			

And the server could answer:

				
					HTTP/1.1 303 See Other
Location: https://my-site.org/en/people/alice.html
				
			

The Web server will be configured for responding to requests with a 303 status code and a Location HTTP header that provides the URL of a document that represents the resource in the requested format (and language, in this case).

That’s what “content negotiation” means: the Web server evaluates the request inspecting the headers sent along with it, and on consequently “negotiates” the right match for the requested format and language, returning the proper resource or response.

Since 303 is a redirect status code, the server can give the location of a document that represents the resource. If, on the other hand, a request is answered with one of the usual status codes in the 2XX range, like 200 OK, then the client knows that the URI identifies a Web document or resource.

Thus Content negotiation, in its essence, is at the very heart of resource exposure on the Semantic Web.

Later in this article we’ll see a bit better how URI resolution works using content negotiation in some resource exposure case.

As shown in the diagram above, upon receiving the request for a resource the server inspects the headers and the URI in order to apply the content negotiation and determine the proper destination.

Basically, the two most interesting headers the server will inspect are Accept and Accept-language. The diagram, however, considers the Accept header only.

Notice the final resolution in case no format is recognized for the request: a 406 (Not Acceptable) status code is returned.

This is by far the best resolution strategy for unknown or unhandled formats requests, since a default format response (say HTML) would not communicate the proper message to clients, which on the Semantic Web should always be format-aware.

Examples

For our case study, let’s suppose a client requests the URI of a book, say harry potter and the philosopher’s stone, in its hardcover edition, that in our system has an ID corresponding to the book’s ISBN:

				
					https://my-site.org/books/9780780797086
				
			

We will examine what happens with the client requesting the same resource in 3 different formats: HTML, JSON and RDF.

The > character means input to the server (request), the < character means output from the server (response).

Example Case #1: HTML

In this case, our client is a Web browser. Naturally, when a web browser requests a resource, among other headers it sends an Accept: header containing the text/html media type (note: the Accept: header from browsers usually specifies a comma-separated list of various media types, like for example "text/html,text/xhtml+xml,...", but for our example we’ll keep things simple. Bear with me a little, won’t you?)

This is the HTTP request brought to the Web server:

				
					> GET /books/9780780797086 HTTP/1.1
> Host: my-site.org
> User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 ...
> Accept: text/html
				
			

To the extent of what has been said, the server inspects the Accept header to serve the wanted resource representation and this is its response:

				
					< HTTP/1.1 303 See Other
< Server: nginx/1.21.0
< Date: Wed, 07 Jul 2021 17:29:08 GMT
< Content-Type: text/html
< Content-Length: 216
< Location: https://my-site/pages/books/9780780797086
< Connection: keep-alive
< Accept: text/html
				
			

Please note that the book as a resource is identified either by the requested URI and the target location URI. This responds to the requirement that the URI must both identify the resource and be unambiguous. In fact, the target URI has a /pages/ segment within it. This lets us understand that the resource is a Web page.

I know, I know, I can almost hear you complaining: what about SEO? Search engines, and people too, do not care much about numeric codes, they expect to have intelligible information in URIs to be expressive enough to be indexed and remembered.

Obviously, you are right.

But how could we convince the Web server, that by nature ignores the existence of a database behind all this, to resolve the ID into a URI that is more meaningful to humans and search engines and redirect there?

The task can be easily addressed, by forwarding the initial request to an application block (Node, PHP, Java, whatever) that can resolve the resource ID into a meaningful URI pattern.

This task can be carried out by the client as well, having access – for instance – to the RESTful API for that resource and perform a subsequent URI transformation and redirection.

In such a case, here’s how our response would change:

				
					< HTTP/1.1 303 See Other
< Server: nginx/1.21.0
< Date: Wed, 07 Jul 2021 17:31:08 GMT
< Content-Type: text/html
< Content-Length: 261
< Location: https://my-site/pages/books/harry-potter-and-the-philosophers-stone-isbn-9780780797086
< Connection: keep-alive
< Accept: text/html
				
			

Example Case #2: JSON

In this case our client is a machine. The Accept header is set to application/json. Here’s the request:

				
					> GET /books/9780780797086 HTTP/1.1
> Host: my-site.org
> User-Agent: insomnia/2021.4.0
> Accept: application/json
				
			

and the response:

				
					< HTTP/1.1 303 See Other
< Server: nginx/1.21.0
< Date: Wed, 07 Jul 2021 17:29:54 GMT
< Content-Type: application/json
< Content-Length: 229
< Location: https://my-site/data/books/9780780797086
< Connection: keep-alive
< Accept: application/json
				
			

As you can see, the /pages/ segment in the URI has been replaced by /data/. We will be configuring the Web server so that the request will be forwarded to our RESTful API application server.

This way, the client will be provided with a JSON description of the resource.

Example Case #3: RDF

This case does not differ so much from the previous one, the main difference lies in the fact that the target URI will be processed by a RDF store instead of our RESTful API application server:

				
					> GET /books/9780780797086 HTTP/1.1
> Host: my-site.org
> User-Agent: insomnia/2021.4.0
> Accept: application/rdf+xml
				
			
				
					< HTTP/1.1 303 See Other
< Server: nginx/1.21.0
< Date: Wed, 07 Jul 2021 17:40:21 GMT
< Content-Type: application/rdf+xml
< Content-Length: 235
< Location: https://my-site/data/books/9780780797086
< Connection: keep-alive
< Accept: application/rdf+xml
				
			

What? The target location is the same as the one from the previous example (JSON)! How is this possible?

Simple: notice the response has an Accept header as well, so we’re explicitly forwarding the needed format information to the target location. It will be its responsibility to route the request according to the Accept header.

Do We Have To Do All This Every Single Time?

On the Semantic Web, things are made easy for clients (whether humans or machines) to get a description of the needed resource in the needed format. And this accessibility is maintained. Nevertheless, a client can even decide to point directly to the URI it needs without the use of the general one; it could for example decide to point directly to the JSON or RDF /data/ URI segment, always providing a consistent Accept header.

By the way, this can be considered a discouraged approach. In fact, the frontal, generic URI for a certain resource is there to prevent clients misdirection in the perspective of future changes. The /data/ or the /pages/ segments could not be there anymore at a certain point in time, due to refactoring and/or resources rearrangement.

The only one thing that should hardly change in face of the world will be the frontal URI for that resource.

This will guarantee the best resource retrieval outcome for clients using our services.

What About Performance?

Considering all the request evaluation and assessment, redirection steps and so on, one could argue that performance is an issue. Indeed, some overhead could be in the picture for the whole round-trip, this design – which focuses on resource retrieval both for humans and machines – needs efficient, scalable HTTP reverse proxy solutions.

In my personal experience I had great results using Nginx, which is an HTTP reverse proxy that achieves outstanding performance, but the most part in this direction is on the shoulders of your configuration.

As a rule of thumb, keep in mind the following three guidelines:

  • Keep things the simplest possible, i.e. maintain the number of evaluation/redirection/proxying operation steps performed by your reverse proxy at minimum
  • Use the best solution possible; Nginx is very good, but try your own
  • Maximize performance of all the mobile parts in your service layout (Web server, database, application server, RDF store), with an eye to network latency (e.g. positioning all the machines/appliances within the same private network may help a lot)

Conclusions

We have been examining how the HTTP protocol is ideal for the Semantic Web’s objectives, and how 303 redirect dynamics work at protocol level.

In the next parts we will examine how to configure the reverse proxy to perform URI forwarding and how to reduce the overhead of the whole process.

Stay tuned for new articles from SpazioCodice!

%d bloggers like this: