The man efficiently provided everyone with their favorite form of sugar, concerning who they were and their preferred format.
The content negotiation and URI forwarding techniques work similarly in the Semantic Web, and we will see how we can get the wanted candies using the HTTP protocol.
Tim Berners-Lee envisioned the Semantic Web as a Web made of data that machines could process, thus setting the axiom that most of that data should be machine-readable.
On the Semantic Web, the information is expressed as statements about resources, just like “The members of the band Eurythmics were Annie Lennox and Dave Stewart” and “Annie Lennox has written this song”
The Resource Description Framework (RDF) made this modeling system its core principle.
Nevertheless, the same URI /people/Federico may identify both my home page and an RDF representation of the resource identified by the string Federico.
Even the RDF uses URIs (Uniform Resource Identifiers) to identify resources, and this brings us to the main question: since URIs can identify both web pages and machine-readable resources, how do we use them so that their identification is unambiguous?
Resource Identifiers, Or What’s In A URI?
On the Semantic Web, URIs can identify Web pages and objects. Such objects may belong to the real world (like planes, people, or books) or be fictitious (like fantasy ideas or creatures, such as hobbits or the Millennium Falcon). The Semantic Web calls them real-world objects or things.
From the W3C’s document Cool URIs for the Semantic Web (which, by the authors’ request, must be retained as a work-in-progress document), two main requirements relating to the way URIs must be used about the Semantic Web:
1. BE ON THE WEB
GIVEN ONLY A URI, MACHINES AND PEOPLE SHOULD BE ABLE TO RETRIEVE A DESCRIPTION ABOUT THE RESOURCE IDENTIFIED BY THE URI FROM THE WEB. SUCH A LOOK-UP MECHANISM IS IMPORTANT TO ESTABLISH SHARED UNDERSTANDING OF WHAT A URI IDENTIFIES. MACHINES SHOULD GET RDF DATA AND HUMANS SHOULD GET A READABLE REPRESENTATION, SUCH AS HTML. THE STANDARD WEB TRANSFER PROTOCOL, HTTP, SHOULD BE USED.
2. BE UNAMBIGUOUS.
THERE SHOULD BE NO CONFUSION BETWEEN IDENTIFIERS FOR WEB DOCUMENTS AND IDENTIFIERS FOR OTHER RESOURCES. URIS ARE MEANT TO IDENTIFY ONLY ONE OF THEM, SO ONE URI CAN’T STAND FOR BOTH A WEB DOCUMENT AND A REAL-WORLD OBJECT.
Looking a bit better, they seem conflictual: the best way to offer a URI describing a real-world object (like an author or a work) as an RDF or JSON resource and its web page seems to be by using the same URI for all of them.
How could such resources be differentiated, and how should the system understand which specific resource representation the client is interested in?
This is where content negotiation comes into the picture.
As said, the representations offered for the same resource differ substantially; they are not multiple versions of the same document, though, but different documents altogether.
HTTP has a powerful mechanism for offering different formats and language versions of the same Web document, known as content negotiation.
Generally speaking, when a user agent (such as a browser) makes an HTTP request, it sends along some HTTP headers to indicate what data formats it prefers. The server selects the best match from its file system, generates the desired content on demand, and sends it back to the client.
For example, a browser could send this HTTP request to indicate that it wants an HTML or XHTML representation of
https://my-site.org/people/Alice in English or German:
GET /people/Alice HTTP/1.1 Host: my-site.org Accept: text/html, application/xhtml+xml Accept-Language: en
And the server could answer:
HTTP/1.1 303 See Other Location: https://my-site.org/en/people/alice.html
The Web server will be configured for responding to requests with a 303 status code and a Location HTTP header that provides the URL of a document that represents the resource in the requested format (and language, in this case).
That’s what “content negotiation” means: the Web server evaluates the request, inspects the headers sent along with it, and consequently “negotiates” the right match for the requested format and language, returning the proper resource or response.
Since 303 is a redirect status code, the server can give the document’s location representing the resource. If, on the other hand, a request is answered with one of the usual status codes in the 2XX range, like 200 OK, then the client knows that the URI identifies a Web document or resource.
Thus Content negotiation, in its essence, is at the very heart of resource exposure on the Semantic Web.
Later in this article, we’ll see better how URI resolution works using content negotiation in some resource exposure cases.
As shown in the diagram above, upon receiving the request for a resource, the server inspects the headers and the URI to apply the content negotiation and determine the proper destination.
The two most interesting headers the server will inspect are
Accept-language. The diagram, however, considers the
Accept header only.
Notice the final resolution if no format is recognized for the request: a 406 (Not Acceptable) status code is returned.
This is the best resolution strategy for unknown or unhandled format requests since a default format response (say HTML) would not communicate the proper message to clients, which on the Semantic Web should always be format-aware.
For our case study, let’s suppose a client requests the URI of a book, say harry potter and the philosopher’s stone, in its hardcover edition, that in our system has an ID corresponding to the book’s ISBN:
In this case, our client is a Web browser. Naturally, when a web browser requests a resource, among other headers, it sends an
Accept: header containing the
text/html media type (note: the Accept: header from browsers usually specifies a comma-separated list of various media types, for example
"text/html,text/xhtml+xml,...", but for our example, we’ll keep things simple. Bear with me a little, won’t you?)
This is the HTTP request brought to the Web server:
> GET /books/9780780797086 HTTP/1.1 > Host: my-site.org > User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 ... > Accept: text/html
To the extent of what has been said, the server inspects the
Accept header to serve the wanted resource representation and this is its response:
< HTTP/1.1 303 See Other < Server: nginx/1.21.0 < Date: Wed, 07 Jul 2021 17:29:08 GMT < Content-Type: text/html < Content-Length: 216 < Location: https://my-site/pages/books/9780780797086 < Connection: keep-alive < Accept: text/html
Please note that the requested URI and the target location URI identify the book as a resource. This responds to the requirement that the URI identify the resource and be unambiguous. The target URI has a /pages/ segment within it. This lets us understand that the resource is a Web page.
I know, I know, I can almost hear you complaining: what about SEO? Search engines, and people too, do not care much about numeric codes; they expect to have intelligible information in URIs to be expressive enough to be indexed and remembered.
You are right.
But how could we convince the Web server, which by nature ignores the existence of a database behind all this, to resolve the ID into a URI that is more meaningful to humans and search engines and redirect there?
The task can be easily addressed by forwarding the initial request to an application block (Node, PHP, Java, whatever) that can resolve the resource ID into a meaningful URI pattern.
This task can be carried out by the client as well, having access – for instance – to the RESTful API for that resource and performing a subsequent URI transformation and redirection.
In such a case, here’s how our response would change:
< HTTP/1.1 303 See Other < Server: nginx/1.21.0 < Date: Wed, 07 Jul 2021 17:31:08 GMT < Content-Type: text/html < Content-Length: 261 < Location: https://my-site/pages/books/harry-potter-and-the-philosophers-stone-isbn-9780780797086 < Connection: keep-alive < Accept: text/html
Example Case #2: JSON
In this case our client is a machine. The
Accept header is set to
application/json. Here’s the request:
> GET /books/9780780797086 HTTP/1.1 > Host: my-site.org > User-Agent: insomnia/2021.4.0 > Accept: application/json
and the response:
< HTTP/1.1 303 See Other < Server: nginx/1.21.0 < Date: Wed, 07 Jul 2021 17:29:54 GMT < Content-Type: application/json < Content-Length: 229 < Location: https://my-site/data/books/9780780797086 < Connection: keep-alive < Accept: application/json
As you can see, the /pages/ segment in the URI has been replaced by /data/. We will configure the Web server so the request will be forwarded to our RESTful API application server.
This way, the client will receive a JSON description of the resource.
Example Case #3: RDF
This case does not differ so much from the previous one, the main difference lies in the fact that the target URI will be processed by a RDF store instead of our RESTful API application server:
> GET /books/9780780797086 HTTP/1.1 > Host: my-site.org > User-Agent: insomnia/2021.4.0 > Accept: application/rdf+xml
< HTTP/1.1 303 See Other < Server: nginx/1.21.0 < Date: Wed, 07 Jul 2021 17:40:21 GMT < Content-Type: application/rdf+xml < Content-Length: 235 < Location: https://my-site/data/books/9780780797086 < Connection: keep-alive < Accept: application/rdf+xml
What? The target location is the same as the one from the previous example (JSON)! How is this possible?
Simple: notice the response has an
Accept header as well, so we’re explicitly forwarding the needed format information to the target location. It will be its responsibility to route the request according to the
Do We Have To Do All This Every Single Time?
On the Semantic Web, things are made easy for clients (whether humans or machines) to get a description of the needed resource in the needed format. And this accessibility is maintained. Nevertheless, a client can even decide to point directly to the URI it needs without the use of the general one; it could, for example, decide to point directly to the JSON or RDF /data/ URI segment, always providing a consistent
By the way, this can be considered a discouraged approach. The frontal, generic URI for a certain resource is there to prevent clients’ misdirection in the perspective of future changes. The /data/ or the /pages/ segments could not be there anymore at a certain point in time due to refactoring and/or resource rearrangement.
The only thing that should hardly change in the face of the world will be the frontal URI for that resource.
This will guarantee the best resource retrieval outcome for clients using our services.
What About Performance?
Considering all the request evaluation and assessment, redirection steps, and so on, one could argue that performance is an issue. Indeed, some overhead could be in the picture for the whole round-trip, this design – which focuses on resource retrieval for humans and machines – needs efficient, scalable HTTP reverse proxy solutions.
In my personal experience, I had great results using Nginx, an HTTP reverse proxy that achieves outstanding performance. Still, the most part in this direction is on the shoulders of your configuration.
As a rule of thumb, keep in mind the following three guidelines:
- Keep things the simplest possible, i.e., maintain the number of evaluation/redirection/proxying operation steps performed by your reverse proxy at minimum
- Use the best solution possible; Nginx is very good, but try your own
- Maximize performance of all the mobile parts in your service layout (Web server, database, application server, RDF store), with an eye to network latency (e.g., positioning all the machines/appliances within the same private network may help a lot)
We have been examining how the HTTP protocol is ideal for the Semantic Web’s objectives, and how 303 redirect dynamics work at protocol level.
In the next parts we will examine how to configure the reverse proxy to perform URI forwarding and how to reduce the overhead of the whole process.
Stay tuned for new articles from SpazioCodice!