uris 303

URIs 303 redirection for the Semantic Web

When I think of the Semantic Web and content negotiation, I always remember some episodes from when I was a kid.
Often, during the summer holidays, I accompanied my father to the market to buy groceries. I was happy to help, even because I knew there would also be a chance to visit the candy shop or the comic shop, or both.
When at the candy shop, I was impressed by how well the man at the counter knew every client and how quickly he was ready to give the usually wanted sweeties. Of course, we made no exception, and I always got out of the shop with my chocolate loot quickly.

The man efficiently provided everyone with their favorite form of sugar, concerning who they were and their preferred format.

The content negotiation and URI forwarding techniques work similarly in the Semantic Web, and we will see how we can get the wanted candies using the HTTP protocol.

Overview

Tim Berners-Lee envisioned the Semantic Web as a Web made of data that machines could process, thus setting the axiom that most of that data should be machine-readable.

On the Semantic Web, the information is expressed as statements about resources, just like “The members of the band Eurythmics were Annie Lennox and Dave Stewart” and “Annie Lennox has written this song”

The Resource Description Framework (RDF) made this modeling system its core principle.

Nevertheless, the same URI /people/Federico may identify both my home page and an RDF representation of the resource identified by the string Federico.

Even the RDF uses URIs (Uniform Resource Identifiers) to identify resources, and this brings us to the main question: since URIs can identify both web pages and machine-readable resources, how do we use them so that their identification is unambiguous?

Resource Identifiers, Or What’s In A URI?

On the Semantic Web, URIs can identify Web pages and objects. Such objects may belong to the real world (like planes, people, or books) or be fictitious (like fantasy ideas or creatures, such as hobbits or the Millennium Falcon). The Semantic Web calls them real-world objects or things.

From the W3C’s document Cool URIs for the Semantic Web (which, by the authors’ request, must be retained as a work-in-progress document), two main requirements relating to the way URIs must be used about the Semantic Web:

1. BE ON THE WEB

GIVEN ONLY A URI, MACHINES AND PEOPLE SHOULD BE ABLE TO RETRIEVE A DESCRIPTION ABOUT THE RESOURCE IDENTIFIED BY THE URI FROM THE WEB. SUCH A LOOK-UP MECHANISM IS IMPORTANT TO ESTABLISH SHARED UNDERSTANDING OF WHAT A URI IDENTIFIES. MACHINES SHOULD GET RDF DATA AND HUMANS SHOULD GET A READABLE REPRESENTATION, SUCH AS HTML. THE STANDARD WEB TRANSFER PROTOCOL, HTTP, SHOULD BE USED.

2. BE UNAMBIGUOUS.

THERE SHOULD BE NO CONFUSION BETWEEN IDENTIFIERS FOR WEB DOCUMENTS AND IDENTIFIERS FOR OTHER RESOURCES. URIS ARE MEANT TO IDENTIFY ONLY ONE OF THEM, SO ONE URI CAN’T STAND FOR BOTH A WEB DOCUMENT AND A REAL-WORLD OBJECT.

Looking a bit better, they seem conflictual: the best way to offer a URI describing a real-world object (like an author or a work) as an RDF or JSON resource and its web page seems to be by using the same URI for all of them.

How could such resources be differentiated, and how should the system understand which specific resource representation the client is interested in?

This is where content negotiation comes into the picture.

Content Negotiation

As said, the representations offered for the same resource differ substantially; they are not multiple versions of the same document, though, but different documents altogether. 

HTTP has a powerful mechanism for offering different formats and language versions of the same Web document, known as content negotiation.

Generally speaking, when a user agent (such as a browser) makes an HTTP request, it sends along some HTTP headers to indicate what data formats it prefers. The server selects the best match from its file system, generates the desired content on demand, and sends it back to the client. 

For example, a browser could send this HTTP request to indicate that it wants an HTML or XHTML representation of https://my-site.org/people/Alice in English or German:

				
					GET /people/Alice HTTP/1.1
Host: my-site.org
Accept: text/html, application/xhtml+xml
Accept-Language: en
				
			

And the server could answer:

				
					HTTP/1.1 303 See Other
Location: https://my-site.org/en/people/alice.html
				
			

The Web server will be configured for responding to requests with a 303 status code and a Location HTTP header that provides the URL of a document that represents the resource in the requested format (and language, in this case).

That’s what “content negotiation” means: the Web server evaluates the request, inspects the headers sent along with it, and consequently “negotiates” the right match for the requested format and language, returning the proper resource or response.

Since 303 is a redirect status code, the server can give the document’s location representing the resource. If, on the other hand, a request is answered with one of the usual status codes in the 2XX range, like 200 OK, then the client knows that the URI identifies a Web document or resource.

Thus Content negotiation, in its essence, is at the very heart of resource exposure on the Semantic Web.

Later in this article, we’ll see better how URI resolution works using content negotiation in some resource exposure cases.

As shown in the diagram above, upon receiving the request for a resource, the server inspects the headers and the URI to apply the content negotiation and determine the proper destination.

The two most interesting headers the server will inspect are Accept and Accept-language. The diagram, however, considers the Accept header only.

Notice the final resolution if no format is recognized for the request: a 406 (Not Acceptable) status code is returned.

This is the best resolution strategy for unknown or unhandled format requests since a default format response (say HTML) would not communicate the proper message to clients, which on the Semantic Web should always be format-aware.

Examples

For our case study, let’s suppose a client requests the URI of a book, say harry potter and the philosopher’s stone, in its hardcover edition, that in our system has an ID corresponding to the book’s ISBN:

				
					https://my-site.org/books/9780780797086
				
			

In this case, our client is a Web browser. Naturally, when a web browser requests a resource, among other headers, it sends an Accept: header containing the text/html media type (note: the Accept: header from browsers usually specifies a comma-separated list of various media types, for example "text/html,text/xhtml+xml,...", but for our example, we’ll keep things simple. Bear with me a little, won’t you?)

This is the HTTP request brought to the Web server:

				
					> GET /books/9780780797086 HTTP/1.1
> Host: my-site.org
> User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 ...
> Accept: text/html
				
			

To the extent of what has been said, the server inspects the Accept header to serve the wanted resource representation and this is its response:

				
					< HTTP/1.1 303 See Other
< Server: nginx/1.21.0
< Date: Wed, 07 Jul 2021 17:29:08 GMT
< Content-Type: text/html
< Content-Length: 216
< Location: https://my-site/pages/books/9780780797086
< Connection: keep-alive
< Accept: text/html
				
			

Please note that the requested URI and the target location URI identify the book as a resource. This responds to the requirement that the URI identify the resource and be unambiguous. The target URI has a /pages/ segment within it. This lets us understand that the resource is a Web page.

I know, I know, I can almost hear you complaining: what about SEO? Search engines, and people too, do not care much about numeric codes; they expect to have intelligible information in URIs to be expressive enough to be indexed and remembered.

You are right.

But how could we convince the Web server, which by nature ignores the existence of a database behind all this, to resolve the ID into a URI that is more meaningful to humans and search engines and redirect there?

The task can be easily addressed by forwarding the initial request to an application block (Node, PHP, Java, whatever) that can resolve the resource ID into a meaningful URI pattern.

This task can be carried out by the client as well, having access – for instance – to the RESTful API for that resource and performing a subsequent URI transformation and redirection.

In such a case, here’s how our response would change:

				
					< HTTP/1.1 303 See Other
< Server: nginx/1.21.0
< Date: Wed, 07 Jul 2021 17:31:08 GMT
< Content-Type: text/html
< Content-Length: 261
< Location: https://my-site/pages/books/harry-potter-and-the-philosophers-stone-isbn-9780780797086
< Connection: keep-alive
< Accept: text/html
				
			

Example Case #2: JSON

In this case our client is a machine. The Accept header is set to application/json. Here’s the request:

				
					> GET /books/9780780797086 HTTP/1.1
> Host: my-site.org
> User-Agent: insomnia/2021.4.0
> Accept: application/json
				
			

and the response:

				
					< HTTP/1.1 303 See Other
< Server: nginx/1.21.0
< Date: Wed, 07 Jul 2021 17:29:54 GMT
< Content-Type: application/json
< Content-Length: 229
< Location: https://my-site/data/books/9780780797086
< Connection: keep-alive
< Accept: application/json
				
			

As you can see, the /pages/ segment in the URI has been replaced by /data/. We will configure the Web server so the request will be forwarded to our RESTful API application server.

This way, the client will receive a JSON description of the resource.

Example Case #3: RDF

This case does not differ so much from the previous one, the main difference lies in the fact that the target URI will be processed by a RDF store instead of our RESTful API application server:

				
					> GET /books/9780780797086 HTTP/1.1
> Host: my-site.org
> User-Agent: insomnia/2021.4.0
> Accept: application/rdf+xml
				
			
				
					< HTTP/1.1 303 See Other
< Server: nginx/1.21.0
< Date: Wed, 07 Jul 2021 17:40:21 GMT
< Content-Type: application/rdf+xml
< Content-Length: 235
< Location: https://my-site/data/books/9780780797086
< Connection: keep-alive
< Accept: application/rdf+xml
				
			

What? The target location is the same as the one from the previous example (JSON)! How is this possible?

Simple: notice the response has an Accept header as well, so we’re explicitly forwarding the needed format information to the target location. It will be its responsibility to route the request according to the Accept header.

Do We Have To Do All This Every Single Time?

On the Semantic Web, things are made easy for clients (whether humans or machines) to get a description of the needed resource in the needed format. And this accessibility is maintained. Nevertheless, a client can even decide to point directly to the URI it needs without the use of the general one; it could, for example, decide to point directly to the JSON or RDF /data/ URI segment, always providing a consistent Accept header.

By the way, this can be considered a discouraged approach. The frontal, generic URI for a certain resource is there to prevent clients’ misdirection in the perspective of future changes. The /data/ or the /pages/ segments could not be there anymore at a certain point in time due to refactoring and/or resource rearrangement.

The only thing that should hardly change in the face of the world will be the frontal URI for that resource.

This will guarantee the best resource retrieval outcome for clients using our services.

What About Performance?

Considering all the request evaluation and assessment, redirection steps, and so on, one could argue that performance is an issue. Indeed, some overhead could be in the picture for the whole round-trip, this design – which focuses on resource retrieval for humans and machines – needs efficient, scalable HTTP reverse proxy solutions.

In my personal experience, I had great results using Nginx, an HTTP reverse proxy that achieves outstanding performance. Still, the most part in this direction is on the shoulders of your configuration.

As a rule of thumb, keep in mind the following three guidelines:

  • Keep things the simplest possible, i.e., maintain the number of evaluation/redirection/proxying operation steps performed by your reverse proxy at minimum
  • Use the best solution possible; Nginx is very good, but try your own
  • Maximize performance of all the mobile parts in your service layout (Web server, database, application server, RDF store), with an eye to network latency (e.g., positioning all the machines/appliances within the same private network may help a lot)

Conclusions

We have been examining how the HTTP protocol is ideal for the Semantic Web’s objectives, and how 303 redirect dynamics work at protocol level.

In the next parts we will examine how to configure the reverse proxy to perform URI forwarding and how to reduce the overhead of the whole process.

Stay tuned for new articles from SpazioCodice!

Share this post

Discover more from SpazioCodice

Subscribe now to keep reading and get access to the full archive.

Continue reading