Uniform Access to Information About

Jonathan Rees
Draft for discussion at TAG F2F (Dec 2008), 25 November 2008.
Latest checked-in version: http://w3.org/2001/tag/doc/more-uniform-access.html
Content is stable, with the exception of the "to be done" list at the end, which I'll grow as the F2F approaches.

Introduction

Objective: Establish a uniform, generally applicable method for a user agent to obtain information about a resource, given a URI that names the resource.

"Information about" a resource may be descriptive, such as physical dimensions, or not, such as license terms.

Terminology

resource: entity or thing ("resource" as used in the RDF Semantics recommendation [RDF])
document: a resource that carries information
representation (of a document): an HTTP/1.1 entity that might be returned in a 200 response to a GET of a URI that names the document (or perhaps some generalization of this idea)
about-information (for a resource): information that is about some resource; similar to description
about-resource (for another resource): a document that carries information about some (other) resource; similar to description resource.

Assumptions

You already care about this problem
We use the HTTP protocol
We use standard HTTP request methods
Rather than specific queries (PROPFIND, SPARQL) we only support the single question "what do you know about it?"
This protocol is only one way to get information about something; there may be others (well-known locations, newspapers, etc.)
The Link: header RFC draft [LINK] will be accepted in something close to its current form

Rationale for some of these assumptions is provided in [UAM].

Protocol

It is proposed to recommend essentially what's already been proposed for POWDER [POWDER-DR], perhaps using a different relation.

Elements of the protocol:

To ask an HTTP server for information about a resource R, do a HEAD (or GET) request giving a URI for R as the request-URI (absolute or relative)
If the response has a Link: header with a "describedby" relation, then obtain a representation for the resource named as the link target
If a representation of the target is obtained, then it should carry information about R

The server is under no obligation to provide a Link: header. If the response has many Link: headers they may be treated as links to many "about-resources."

Examples:

Obtain metadata about a document.

HEAD /sample/document
Host: example.net

200 OK
Link: <http://example.net/about/sample/document>; rel="http://www.iana.org/assignments/relation/describedby"
Content-type: text/html
Content-length: 31416

followed by GET of http://example.net/about/sample/document

Obtain information about a property (relation).

HEAD /sample/relation
Host: example.net

303 See Other
Location: http://example.net/about/sample/relation
Link: <http://example.net/about/sample/relation>; rel="http://www.iana.org/assignments/relation/describedby"

followed by GET of http://example.net/about/sample/relation

The Link: target URI can be any URI. In these examples, one supposes that this particular server uses a convention of inserting "/about" after the host name in the request URI to obtain the URI of the "about-resource". Other servers may behave differently.

Discussion

URI schemes

Although this protocol is most natural when the original URI and the Link target are http: URIs, in fact they can use any URI scheme. The request-URI in an HTTP request can be an absolute URI of any URI scheme, and the Link target URI, while it must name a document, can also in principle use any URI scheme.

What's at the end of a describedby link?

The target of a describedby (the "about-resource") may be any document that is about the resource. Documents having an RDF/XML representation, or XML representation that can be converted using GRDDL to RDF/XML, probably work best.

A resource can have many about-resources, and a single about-resource can carry information about many resources.

(ISSUE: Because discovery is meant for automated agents, it may be worthwhile for the agent to be assured that the target resource can be converted to RDF. Three ways this might come about: (1) require that the describedby target be convertible-to-RDF, and call it an error if the about-resource does not have such a representation; (2) specify a subproperty (see below) of describedby that is restricted in this way; (3) convey the information that the target document has a representation in some particular format using the "type=" feature of Link: (this is the approach taken by POWDER).)

The use of content negotiation to provide representations that carry fundamentally different messages, such as a document's content and a description of the same document, is discouraged [or incorrect? according to whom? can this restriction be inferred from AWWW?]. Thus if an about-resource yields both text/html and application/rdf+xml representations, say, both representations should communicate more or less the same thing [need normative reference]. Ideally one representation should be a faithful automatic translation of the other. Semantically aligning multiple representations is difficult, and servers not up to this task should not do content negotiation. An alternative to CN is to use transforms (style sheets) to generate one representation (e.g. application/xhtml+xml) from another. This way it is clear, by examining the representation, what information the intended document carries, even if the transform is not semantically neutral.

Interaction with HTTP status codes

A Link: header could be carried by just about any kind of response. For example, a 404, 405, 410 etc. response could carry a link to an document giving Dublin Core metadata and licensing information for the resource, since this information may be available even when the resource itself isn't.

If a 301, 302, or 307 redirect for GET of U names a second URI V (Location:), then U and V may be taken as synonyms, and about-information may be obtained using either URI, or both.

The TAG's httpRange-14 resolution [RANGE] has led to a practice of placing a URI for an about-resource as the target of a 303 redirect. Browsers are automatically directed to the about-resource without the implication (to an agent following-its-nose) that the about-resource's representations are representation of the resource. This use of 303 is compatible with use of the Link: header, as the same about-resource can be named as both the Link: target and the 303 target.

The convention of 303 leading to a description is only a convention - there is no assurance that the document on the far end of a 303 is an about-resource. Link: has the advantages over 303 of clarity (you know its purpose) and attribution (you know who's saying it), and also permits the page seen in the browser to differ from the describedby target. The latter property would be useful, for example, for a journal article, where the URI would denote the article, its about-resource (found via Link:) would provide bibliographic and license metadata, and the 303 target would give an HTML page offering the article's abstract and a choice of delivery methods.

Some 200 responses may carry information about the resource as part of their payload (content). If so, and the information is meant to be used (not draft, obsolete, etc.), then a self-referring Link: header might be used to indicate this fact.

Interaction with LINK element

One cannot rely on any particular agent using the Link:/describedby protocol. In order to increase the likelihood that important information such as licensing terms is communicated to all parties that need it, the HTML <link> element and similar mechanisms such as XMP should be used redundantly with Link: when such a mechanism exists for a representation's media type.

Choice and specification of the particular Link: relation

Following [POWDER-DR], I've used "describedby", defined to mean that the target is a resource that carries information about the source. We need to decide whether this will be adequate.

The describedby relation is similar to "meta" as defined by the RDFa recommendation [RDFa], which says that the target of a "meta" link is metadata. "Metadata" is defined in most dictionaries to be data about data. This is not consistent with the relation we need, which needs to be uniformly applicable to data and non-data. Either we have to convince everyone that "metadata" doesn't need to be about data, or we need to take it to be a term of art. Neither of these is approaches is appealing.

In the unlikely event of a design rift with POWDER, this protocol would have to invent a new relation. But even if this happens we may be able to arrange for one relation to be a subproperty of the other (see below).

Subproperties of describedby

[To be done... discovery, lifetime, cost, likelihood of being understood, and so on]

Reducing the number of round trips

Getting information about a resource using the above protocol requires at least two HTTP round trips: a HEAD (or GET) request to obtain the HTTP Link: header, and then a GET of the target URI to obtain a representation carrying the information about the resource. Although an application that looks up a lot of these documents should probably seek a different implementation strategy (such as locating a SPARQL endpoint providing the same information), it may be worthwhile to find ways to reduce the number of round trips.

Of course, either HTTP response may be cached according to the rules for the protocol, in which case the number of round trips is reduced. If a single about-resource is referenced in many Link: headers, it need not be fetched every time it is needed. And clearly we can cache describedby links.

Nothing says that HEAD/Link:/GET is the only way to get information about something, so more efficient alternatives are certainly possible.

One way to eliminate the first round trip is to apply a generic rule for obtaining an about-resource URI given a URI. The ARK protocol [cite] has such a rule: append ? to the end of the resource's URI get a URI for its about-resource.

Establishing a uniform rule applicable to all URIs is unlikely to work, due to extreme variation in the way that web sites are administered. (It has been suggested to promote a single rule that could work most of the time. The could then be applied speculatively, on the theory that it will either succeed or 404 almost all of the time. Sites that can't use the rule would fall back to looking for a Link:.)

Site-specific rules are a much more realistic possibility. The ARK ? convention could be communicated by each site using it, as could other conventions such as /about/ or the ,about suffix. The question is how such rules can be discovered. A solution would have to be immune to phishing and similar attacks. Perhaps proposals for site metadata [SITE]. point in the right direction. [work in progress]

Regardless of how a rule comes to be known, it would be nice if there were at least a standard way to write such rules, to make it transmit, share, and use them. Such a notation has been proposed at least once [Rules].

Trust issues

There is nothing magical about an about-resource; the buyer must beware. Servers (and entities that they speak for) can make mistakes, or can even intentionally mislead as part of some kind of deception.

Nothing says that the information is true

One risk is inconsistency between about-resource representations. There may be more than one about-resource for a resource R, potentially one for each URI that denotes R, or even multiple about-resources for the same URI (named in multiple Link: headers in one or more responses). In addition, each about-resource may have multiple representations. It is possible for contradictions to arise between these information sources. This is clearly a mistake, one that unfortunately is not always possible to detect.

Another risk is inconsistency between what the information says about a resource's representations and the representations actually delivered. For example, suppose the NA says that R's author (or more pedantically the author of the content of R's representations) is Charles Dickens, but a GET on R's URI U retrieves content that is by George Eliot.

In general, statements made made in about-resources need to be treated with the same skepticism as statements coming from any other source. Where there are security risks to believing a statement, belief in statements made by the NA should be subject to authorization barriers in a manner similar to any other kind of verified request.

Attributing information to an agent

This theory is not fully developed. Skip this section and the next if you don't know, or don't care, about attribution and authority.

It may be better to omit this discussion entirely ("authority" is a hot potato), but I don't want to lose the opportunities presented without considering carefully.

The general question of how to interpret HTTP responses declaratively, and to whom such interpretations should be attributed, is beyond the scope of this memo, as are general questions of authentication and authorization. However, we can say a little bit about the attribution of about-information, and the authority it carries.

Some applications may be able to use attribution to enable the use of about-information for various purposes. In particular, it is important to be able to attribute information to the request URI's naming authority (or URI owner). Information obtained using Link:/describedby/GET can be attributed to a URI's naming authority if the server that was contacted can speak for the naming authority. This would usually be true, for example, if the request-URI's host part specified the host that was contacted, but it may be true in other situations as well. Information carried by a describedby target named in a Link: header could be attributed to the naming authority, as could other information carried in the HTTP response (but not necessarily the content of a 200 response). The information carried by HTTP headers and status codes should be interpreted according to applicable RFCs.

Entities such as proxy servers may also be authorized to speak for an agent such as a naming authority. In general "speaking for" may be highly delegated. Each application needs to establish a trust chain in a manner suited to its tolerance for risk.

[Mention POWDER approach of using end-to-end authentication of the description resource.]

Naming authority

["The authority to name" would be better section name]

One application for which attribution matters is that of obtaining information bearing on what a URI is supposed to "refer to". This is the subject on which the URI's naming authority (as described in [AWWW] has authority to speak.

Having attributed information to an agent, one must decide whether one should believe it. For example, if the information implies some kind of request ("Joe would like to buy a refrigerator"), we need to know whether to try to satisfy the request. Obviously this is application dependent. The authority conferred by the role of naming authority is limited to the right to speak regarding the binding of the URI. The URI's NA has no particular authority relating to other URIs, or concerning any other matters. Thus the only part of what is said in an "about-resource" that may be considered "authoritative" is the part that bears on the binding of the URI.

"The right to speak" does not constitute a right to be believed. While agents should believe what the NA says when it bears on the URI's binding, an agent receiving information attributed to naming authority is entitled to ignore it if necessary (e.g. the information is false or contradictory, or the agent needs to make sense of documents that use the URI in a different way).

Ack

Thanks for comments on earlier versions from Alan Ruttenberg, Stuart Williams, Ashok Malhotra, and Phil Archer.

References

[LINK] Mark Nottingham. HTTP Header Linking. RFC draft, 2008.
[RDF] Pat Hayes, editor. RDF Semantics. W3C Recommendation, 2004.
[HTTP] Fielding et al. Hypertext Transfer Protocol -- HTTP/1.1. RFC 2616.
[AUTH] Fielding and Jacobs. Authoritative Metadata. W3C TAG finding, 2006.
[POWDER-DR] Phil Archer et al, editors. Protocol for Web Description Resources (POWDER): Description Resources. W3C working draft, November 2008.
[RDFa] Ben Adida et al, editors. RDFa in XHTML: Syntax and Processing, section 9.3 W3C recommendation, 2008.
[AWWW] Jacobs and Walsh, editors. Architecture of the World Wide Web, Volume One. W3C recommendation, 2004.
[RANGE] Roy Fielding. httpRange-14 Resolved. Email, 2005.
[Ambiguity] Pat Hayes and Harry Halpin. In defense of ambiguity. International Journal on Semantic Web and Information Systems 4(2).
[SITE] Mark Nottingham and Eran Hammer-Lahav. Site-Wide Metadata for the Web. RFC draft, October 2008.
[Rules] Jonathan Rees. Documentation source override. Wiki page, Creative Commons, 2008.
[UAM] Jonathan Rees. Uniform Access to Metadata. Memo to TAG, May 2008.
[Discovery] Eran Hammer-Lahav. Discovery and HTTP. Blog post, 2008.
[DECL] David Booth. URI Declaration Versus Use. Memo, 2008.

To be done

Make it clear from the outset that this works equally well with URIs naming all kinds of things, including people (in the FOAF sense) and properties (in the RDF sense), and not just documents or network resources. The URI needn't be "resolvable" and the resource needn't be "accessible" in order for an HTTP server to provide useful information about it.
Make it clearer that an http: URI can similarly name any kind of thing, and a 3xx response doesn't imply that the thing has been "accessed".
Make it clear that 'describedby' is not POWDER-specific.
Link header draft is now at version 3.
Report on what Eran Hammer-Lahav is up to. He and Mark Nottingham seem to be doing the heavy lifting here.
Explain what happens when <link> and Link: disagree. The cause could be either a mistake, which should be fixed, or a principled rejection of the authority of the 200 response, as when an old draft of a document is archived. (The archival copy must preserve link elements even when they're wrong.)