AWWSW Status Report

1 February 2011

Currently active AWWSW members: Michael Hausenblas, David Booth, Nathan Rixham, Jonathan Rees. This report was prepared by JR with help from the others; any shortcomings are JR's fault.

Genesis of AWWSW

The so-called AWWSW 'task force' was formed at a joint TAG/HCLS meeting 2007-11-05. ('AWWSW' facetiously stands for 'architecture of the world wide semantic web.') The SWHCLSIG (Semantic Web Health Care and Life Sciences Interest Group) was investigating the use of RDF and had encountered a few glitches. Some SWHCLSIG participants came to the TAG with these concerns:

Before the httpRange-14 resolution, there appeared to be no prior meaning to HTTP, i.e. what a URI refers to in RDF seemed to have no particular relationship to what you get when you GET it - basically RDF users could do whatever they liked with HTTP. The resolution appears to hold users of RDF accountable for something, but it doesn't make it clear what. The stated requirement that a 2xx response implies something is an information resource only introduces FUD, without really explaining anything.
The apparently normative definition of 'information resource' (AWWW) is at best untestable. Actually it doesn't make any sense at all. It seems to be a product of design by committee.
If I use a URI to refer to a journal article, and the URI yields a 200 response yielding some particular encoding of the article (say HTML or PDF), is that "OK"? Is there an interoperability risk? If the URI instead yields a 'landing page' for the article, is that any different - is there a different risk?
In general, if I use a URI to refer to something, for what aspects of HTTP behavior am I going to be held accountable?
The question is only further muddied by the use of 301/302/307 redirects, which could be interpreted as providing plausible deniability.
GET/303 at the time wasn't documented anywhere (now it is, in the HTTPbis draft) and its intended use was, and remains, unclear.
Another HCLS concern that was raised was persistence, which matters if RDF is to be used in the scholarly record. On the one hand the TAG discouraged use of non-http: URIs, while on the other it gave no explanation of how anyone would make an http: URI be as "persistent" as a URN. HCLS suggested fallback resolution methods as a way for the community to gain confidence in http: URIs and wanted TAG consideration of this approach.

The outcome of the discussion was that an informal group was chartered to discuss "HTTP semantics".

A motivating use case

The following is not the only use case, but it surfaces the most important issue. A similar scenario could be constructed around any kind of confusion over what the URI refers to.

Bob composes a document, which we'll call R, and arranges for HTTP responses to GET requests with target URI 'http://example/e' to yield 200 responses carrying R. Among other things, R (and these responses) contain the following:

  <div about="http://example/e">
    This document, by Edward Example, is licensed under a
    <a rel="license" href="http://creativecommons.org/licenses/by/3.0/">
    Creative Commons Attribution License</a>.
  </div>

The bit of RDFa means that a document, referred to in R as http://example/e, is licensed as specified. In N3 or Turtle, this would be written

  <http://example/e> xhtml:license
    <http://creativecommons.org/licenses/by/3.0/> .

Alice, reading this RDFa and wishing to republish R, understands that she has a license to republish R. Later, Bob discovers that Alice has republished R and confronts her. (For simplicity in the following, let's assume that Bob and Alice have found a mutually understood way to refer to R.)

Bob: Hey, Edward didn't give you a license to republish R.

Alice: He said that the license applied to R!

Bob: No he didn't, he said it applied to S, the one all this information in R is about.

Alice: But 'http://example/e' refers R, not S! I didn't even know about S.

Bob: No, 'http://example/e' refers S.

Alice: That's ridiculous - you don't use 'http://example/e' to refer to R? That's the customary practice.

Bob: No, nobody told Edward he had to, and anyhow why would you want to refer to R? The interesting document is S.

Alice: How on earth was I supposed to know you meant S instead of R?

Bob: You were supposed to read R. It's obvious if you do that - it wouldn't make any sense if you took 'http://example/e' to refer to R. See, it says right here that <http://example/e> is about penguins, and R is obviously not about penguins.

Alice: But I'm running a stupid search engine. It only looks at the license statement. You can't expect it to read and understand the document so that it knows the URI is supposed to mean S instead of R!

Bob: Yes, I can, it's my URI so I get to use it however I like.

Alice: If I let you play games like this with this URI, how can I ever hold anyone accountable for anything they say in RDF?

Bob: That's your problem. Maybe you're using the wrong technology.

What do you think?

It's easy to poke holes in this story, but the point is the form of the story, not the details. Meaning is burden of proof. If there's no accountability (or as Alan Ruttenberg says, no way to be wrong) there is no semantics.

(Extra credit: If 'http://example/e' refers to S, how would one refer to R?)

Work areas current and planned

In general the 'task force' has been doing three kinds of analysis: empirical (current and current 'best' practice), specification-based, and speculative (future and future 'best' practice).

In all cases the goal is a story that makes sense, an expression of it in some logic (RDF or OWL), and eventually a published set of axioms (i.e. "ontology"). But it is very hard to say anything that makes any sense in this area.

Current project: Meaning of successful dereference. If I use a URI 'http://example/e' to refer to X, and 'GET http://example/e' dereferences to a 'representation' Z, for what properties of X am I likely to be held to account? - For example, if I say 'http://example/e' refers to Moby Dick, and Z is in HTML, probably no one will hold me at fault, even being in HTML is not a property of Moby Dick. On the other hand, if Z consists of the poem Jabberwocky, I can be found at fault. So what exactly am I responsible for?
Next up: Meaning of redirect. If I use 'http://example/e' to refer to X and there is a redirect, for what might I be held accountable? Consider possible interpretations of the 30x as expressing relationships between the two URIs and the two resources (or maybe one). There may be several candidate interpretations, so the question is not to decide anything but to offer choices.
Soon: Meaning of various other HTTP exchanges (PUT, 410, and so on).

Nearby work areas

Nose-following

We have postponed consideration of nose-following. The idea is to describe particular methods for using the Web as a 'dictionary' for readers and speakers of RDF (and similar languages) and perhaps nominating one or more as potential 'best practices'. Such documentation would be an operational follow-on to The Self-Describing Web finding and the unexplained figure at the end, and would serve the semantic web and linked data communities and anyone else who cares about automating nose-following.

Documenting nose-following is raised by the new RDF WG charter. As this is a webarch issue and not specific to RDF, it would be appropriate for the TAG or a task force to address it.

The new Web Linking standard is relevant to nose-following, especially given the issues raised by AWWSW. If a URI owner wants to say something about a document, and the page will be untrusted or can't be edited to add embedded metadata, there ought to be some other way to say it, and .well-known and Link: provide this. [refer to previous TAG work here]

Fragid semantics

Fragids are certainly a concern to those in the group, and as the TAG knows are semantically troublesome, but we have not really investigated them yet.

Persistence

We have not been addressing this issue directly. Jonathan has been pursuing it separately under TAG ISSUE-50. There is some overlap.

Semantic web

Some in the group have expressed interest in documenting how the semantic web works, but others resist this suggestion. We have agreed to postpone consideration until after we've dealt with the metadata (200) issue.

Some analysis

We have been stuck on the question of how dereference bears on reference. There are two reasons for this. One is the number of 'moving parts' in the combination HTTP + webarch, and the other is the complex set of constraints that seem to be imposed on a solution.

Note that metadata is not always being written by the URI owner. The metadata author may be gambling on stability (no conneg or sessions) and consistency (continuity of content over time). This is their business. However, they will be held responsible for the correctness of the metadata, so the risk of getting it wrong needs to be in inverse proportion to the size of the wager.

Some of our test cases for Web URIs used referentially: (by 'Web URI' I mean those that can be dereferenced, in 3986 terminology - usually GET http: yielding 200)

Simple information resources (no conneg, never changes)
Moby Dick (and the other examples in Generic Resources)
Table of contents
Abridged version for mobile device
Moby Dick decorated with advertising
Random page
Someone's report on current weather in Oaxaca
Exactly 17 distinct 'representations' (polymorphic, not generic)
Adversarial URI owner
Archived but repudiated versions
Successful dereference but otherwise uncharacterized referent (who's to say it's not a toucan and why should we believe them?)
Migration and replication (same resource, different URIs)
Resource controller ≠ URI owner

The moving parts - that is, the entities that could cause someone using a URI referentially to be wrong - include:

the document (or whatever) that uses the URI to refer
the URI's referent, whatever that is, according to various parties
the agent or other force controlling a TimBL-information-resource-like referent. For example, the weather controls a weather report - if it doesn't, the report isn't a weather report. On the other hand Moby Dick (the novel) is not controlled by anyone or anything.
the URI owner, because they have authority over retrievals (see HTTPbis part 1 2.6.1)
the infrastructure operators, caring for multiple servers (e.g. multiple A records, load balancers), because they must respect the URI owner's authority, and have the freedom to select from among the authorized responses.

The constraints include:

every Web URI must refer (to something)
"information resources" can be complicated (see Henry's Ariadne article)
retrieved representations should not be in conflict with a URI referring to any kind of "information resource" described in the "Generic Resources" design note
the understanding that HTTP can be succeeded by some other protocol
any server or URI owner replaced by another
two URIs can refer to the same thing, while yielding different HTTP responses
PUT, POST, DELETE, etc. are supposed to have something to do with the referent
we can't change the HTTP specification or the way HTTP is used

It is quite possible that the problem is overconstrained and some putative requirements must be ejected. It is almost certain that most of the complexity must be ignored or ruled out in any mainstream application involving metadata. But in order to ignore it properly, we have to acknowledge that it is there and understand it. Once it's understood it will be possible to find a way for URI owners and users to express what they committing to or assuming.

A further complication is the "representation / presentation" distinction that Henry documents. So far we have pretended that this doesn't exist. At some point, however, it will be important to make our account resilient to transformations from static to dynamic content.

Is it worth it?

One might argue that Web URIs are an exceptionally poor notation for referring to any entity whose properties are somehow inferred from what is retrieved. Perhaps there is some better notation available, or maybe one could be designed. The scholarly community uses metadata-based references, which could be rendered in RDF using blank node notation, and also has things like the handle system that support "identifier"-like references that meet their needs. Perhaps the duri: or urn: URI schemes help make certain aspects of the referent clear.

Embedded metadata (metadata that's about what it's embedded in) seems to have a clear subject. In RDF this is usually designated by the URI from which the document is retrieved (i.e. the base URI). This practice is not reliable (conneg, sessions, revision, etc.), but it probably works because the URI is interpreted as intended. [To think about: thismessage: .]

However, embedded metadata is a special case, and will not usually help us indicate the subject of metadata in general. The metadata itself might do so of course, but the common practice of using a URI also does.

There are both sunk cost and aesthetic arguments against rocking the boat and in favor of plugging its leaks. The investment is not just in deployed metadata and its processors, but in the development of frameworks (such as webarch) and specifications that have encouraged the status quo.

Another reason to forget about it might be if a competing (i.e. non-interoperating) use of these URIs took hold, as has been discussed recently on www-tag (see thread starting here; also this from Harry Halpin). But this makes little difference as some other way would have to be found to refer to these entities, and then that would have to be explained.

The idea that Web URIs are unsuitable for reference and ought to be unstudied is belied to some extent by the fact that the community continues to use them referentially quite happily. The contrast between the lack of specified semantics for Web metadata and its usefulness and importance is a puzzle that needs explaining.

Fortunately, anything we learn about Web URIs can be applied to any notation used for this purpose, since all such notations run the risk of most of the same failure modes as Web URIs. Exactly what handle and DOI 'owners' or users can be held to is not spelled out very well; Crossref's member rules are quite rigorous, but leave vague what the constraints are on what web servers are supposed to do (the landing page issue described above applies, as does conneg confusion). duri: URIs and URNs, even when they're actionable, are just as confusing in the presence of content negotiation, session sensitivity, and so on as are Web URIs. By understanding better how Web URIs are used referentially, and the circumstances in which Web metadata is and isn't successful (not just technically but socially), we ought to be able to design new systems and improve the ones we have.

Miscellany

We usually phrase the problem in terms of HTTP but the general question is not semantics of HTTP but semantics of any URI that is used in a similar way - that is, the URI in the REST/webarch/3986 abstraction, not the one protocol specifically. The AWWSW problem has to explain web architecture first, and its instantiation in HTTP second.

Metadata vocabularies that we've been keeping in mind include Dublin Core, FOAF, the Information Artifact Ontology (IAO), RDFS, and CC REL. Genont is interesting but we're not aware of any deployment. FRBR provides a useful analytical framework and we use it as a reference point from time to time. BIBO is related but does not seem to use Web URIs referentially.

Tactically, we have decided not to worry about time for now, as it is an unsolved problem in the RDF community in general, and we have nothing in particular to contribute. There is enough to worry about with instantaneous confusions.

Members of the group come in with widely varying approaches to semantics and the nature and purpose of RDF, and this has been a source of friction.

Work to date includes

much discussion of different ways to define and approach the problem (e.g. Alan R's idea of setting up a standing committee with authority to "license" referential/operational use combinations)
several OWL vocabularies
analysis of specified meaning all HTTP request type/status code combinations
translating HTTP exchanges into RDF (you can do it, but it doesn't really help)
some N3 rules encoding some formal aspects of httpRange-14 and nose-following
exploration of the idea of 'exchange invariant' as a way to explain metadata (if property P is true of all retrieved representations, then it is true of the referent)

The TAG has been tracking AWWSW's work under ISSUE-57 (redirections). It might continue this out of inertia, but most of what's covered would make more sense under ISSUE-63 (metadata architecture).

Thanks to Alan Ruttenberg, Stuart Williams, Pat Hayes, Harry Halpin and others for comments as this report was being prepared.