Requirements for Any Theory of "Information Resource"

Jonathan Rees
17 February 2011
This version: http://w3.org/2001/tag/awwsw/2011/axioms-20110217.html
Latest version: http://w3.org/2001/tag/awwsw/ir-axioms/

Abstract

Dereferenceable URIs (loosely speaking, those you can 'GET') are used both imperatively, as protocol elements, and declaratively, in metadata and references. A technical term "information resource" (replacing "resource") was introduced by Architecture of the World Wide Web for the entities that metadata and references are "about". This report describes the author's understanding of what must be true in any adequate theory of "information resources" in order for them to play this explanatory role. By focusing on requirements, expressed as logical axioms, we postpone or avoid the question "what is an information resource" which inevitably leads to philosophical wheel-spinning and vexing paradoxes.

The main challenges are making "information resource" independent of dereference (in particular the HTTP protocol), while saying rigorously what it means for one of them to be "on the web" at a given URI; and understanding the subject of metadata and reference, given the potentially arbitrary variability in dereference behavior (HTTP 200 responses to GETs) for that URI.

It is a purpose of this document to help explain "information resource" and the misunderstood httpRange-14 rule, not to defend either. It is hoped that this treatment will be useful in future discussion of whether any technical specifications require modification, and if so now.

Status of this document

This document presents the author's views, which have been developed with help from the informal AWWSW task force of the W3C TAG. It is presented as a first step toward consensus first within AWWSW, and then in larger forums. Comments for the purpose of repairing deficiencies and building consensus are welcome.

This document is likely to be revised. When citing please look for the latest revision.

Introduction

We take the view that an "information resource" is whatever it has to be in order to be something that can be (a) the subject of metadata and (b) "on the web".

"Metadata" is defined conventionally, not rigorously. The sense is "data about data" or more generally "information about information". Examples of metadata would be statements or records written using FOAF, Dublin Core, or Web Linking vocabularies.

In order to avoid answering the futile question "what is an information resource," we focus on requirements on what any such answer would be like. The requirements take the form of logical axioms. There may be many ways to fulfill the axioms. Each such way would involve an interpretation or translation of the terms of the axioms into some particular informal or formal system, checking to make sure that the axioms, as translated, are true in that system.

Just to help you along: An interpretation of these axioms might be mathematical (as Roy Fielding's formal REST model), or in terms of your favorite ontological framework (Sowa KR, BFO), or in terms of your favorite information ontology or vocabulary (IAO, FRBR, genont).

As is appropriate for requirements, the axioms leave many details unspecified, leaving calculated misinterpretation and interoperability risks. Readers are challenged to find creative ways to misinterpret the axioms so that any needed but missing axioms can be added.

My plan is to transcribe all the axioms into OWL-DL, so that I can check for consistency and adequacy, but first I want to lay them out and see whether they make sense to reviewers.

Each axiom is accompanied by commentary. The commentary is meant to provide motivation and intent, but should be considered informative, not normative.

Web-independent axioms

Our first challenge is to abstract "information resources" (roughly speaking, things that can be put "on the web") away from the Web. To do this we need a set of axioms that have no mention of URIs or protocols.

There is a class of 'information resources'.

This is a troublesome one, so to be careful, please interpret the term as constrained by the following axioms, not according to your impulse. In particular ignore the AWWW definition as, while it may be consistent with what is said here, it is not sensible enough to be clearly consistent with it.

We need this term in order to explain the intent behind AWWW and the httpRange-14 resolution, although I don't promise that an interpretation compatible with these writings exists.

To say 'class' is not to say 'natural category' or 'ontologically coherent class'. The most natural model of the axioms might well make this class the Scylla that Alan Ruttenberg has been menacing us with. [citation needed]

For now just bear in mind: An information resource is whatever it has to be in order to explain how Web metadata expresed in RDF seems to work.

There is a class 'simple information resource' (or 'simple IR') that is a proper subclass of 'information resource'.

This might be what you get when GET, were the simple IR to be on the Web.

I'm introducing this term distinct 'representation' so that I can explain metadata, which is most easily seen as applied to IRs and not representations; see below.

It is supposed to be consistent to assume that this class coincides with TimBL's class 'fixed resource'. Historically, you could take 'simple IR' to be the class of hypertext nodes of of the early Web, before conneg and dynamic content came along and anyone worried about the resource/representation distinction.

If you are FRBR-inclined you might be able to be interpret 'simple IR' as a subclass of FRBR 'Manifestation'.

'Has content' is a total function mapping a 'simple IR' to a string, perhaps the empty string. 'Has media type' is a partial function mapping a 'simple IR' to a string.

The intent is that 'simple IRs' are "mostly" syntactic; any other properties they have are. This should become clearer further on.

This axiom does not rule out IRs having other syntactic parts as well, such as charset, content-language, or expiration date, but that would depend on how exactly you interpret these axioms.

Even given an enumeration of syntactic parts, a simple IR's identity is not determined - two simple IRs might have all the same parts yet have distinct origins (provenance). Compare FRBR 'Manifestation'. No need to go into detail here as the provenance aspect of the theory is not developed here.

Not to get too technical, but I'd permit 'octet sequence' as a particular kind of 'string'.

There is a relation 'has reading' between 'information resources' and 'simple IRs'.

'Has reading' is not functional, because we want to admit interpretations where readings vary by media type, language, session, time, whim, etc. And it's not inverse functional, as a simple IR can be a reading of multiple IRs (e.g. itself and some non-simple IR).

Example under one interpretation: a serial publication, where the January issue (encoded as a particular octet sequence) is a reading of the serial in January, and the February issue is a reading of the serial in February.

Other example: an "abstract document" (or "generic resource"), with readings ("fixed resources") in particular languages.

Compare HTTP Content-location: header.

TBD: Whether this holds may depend on circumstances (time, conneg, weather, etc.). Until circumstances are axiomatized (in some future version?) you can imagine the relationship holds if any such circumstances exist. If the axioms listed here conflict with circumstance dependence, that's a bug.

'has reading' is reflexive on simple IRs, and total on IRs.

That is, a simple IR is its own and only reading, and every IR has at least one reading.

(candidate requirement) An IR that has only one reading is a simple IR.

This seems plausible but I can't think why it would be needed.

There exists a simple IR that has a media type.

Maybe we don't have to assume this, but the intent is to make it harder to interpret 'information resource' as being exclusive to, say, FTP resources.

There exists an IR that has at least two distinct readings.

This could be two readings at the same time, or at different times. Time and other circumstances are not treated, yet.

Connecting to other vocabularies

The goal here is to force consistency with (and explanation of) deployed use of dereferenceable URIs in RDF statements, and to use other vocabularies to help constrain the interpretation of the present one. To do this we need to tie these axioms into other ontologies.

(optional) Information resources are not in the domain of [...].

TimBL at least has argued that information resources, whatever they may be, do not have momentum or position, and are not mathematical. Forgetting ontological concerns, functionally the purpose of saying this is to help rule out misinterpretation. Nonsense invites misinterpretation and threatens interoperability.

Strictly speaking, nonsense detection is not forced by the metadata use case, as far as I know, although it is not be too hard to come up with interoperability scenarios where nonsense would get in the way of proper function. Scenarios in which this axiom would matter are invited.

Even without this axiom, it ought to be pretty hard for a reasonable person to interpret IR-referring terms as anything other than appropriate metadata subjects, because of other axioms (below).

'simple IR' is a subclass of 'foaf:Document'

This seems both desirable and safe, given how foaf:Document is both defined and used.

This does not rule out the possibility that other 'information resources' are also foaf:Documents as well.

'simple IR' is in the domain of 'foaf:sha1', which is functional on it and factors through 'content'.

This makes sense because 'content' is functional - not varying by time or any other circumstantial variable.

Agnostic on how content-encoding figures into this.

Metadata properties

This set of three axioms create a pattern that can be repeated for a large set of 'metadata properties' from FOAF, Dublin Core, Web Linking (RFC 5988), XHTML, and elsewhere, substituting each property in turn for 'dc:creator'.

I don't know how to define 'metadata property', nor would it be appropriate to do so here; the best I can say is that they are properties that 'make good sense' as providing information about simple IRs. In a future version of this report I may include a list of properties from the above ontologies that I would consider metadata properties. The more of these there are, the harder it becomes to misinterpret the axioms.

At least one simple IR is in the domain of 'dc:creator'.

Although this is weak, it helps prevent unreasonable interpretations.

Possible generalization: for any metadata property P, there is an object Q and simple IRs A and B such that P(A,Q) but not P(B,Q). That is, these properties are informative.

Let R be an IR, S a reading of R, and W a member of the range of dc:creator. Then {R dc:creator W} implies {S dc:creator W}.

This is to say that a metadata property 'spreads' from an IR to all of its readings.

This is a strong statement, as it precludes using dc:creator with, say, a serial publication where different issues are created by different agents. On the other hand, without this axiom, metadata is not informative of readings and thus is neither falsifiable nor predictive.

Let R be an IR, S a reading of R, and W a member of the range of dc:creator. Then {S dc:creator W} implies {R dc:creator W}.

This is the converse of the previous axiom, and it says that invariant properties of readings must also hold for the IR - the IR must 'fess up' to things that all of its readings do. This rules out pathologies where something is a dc:creator of all of a document's translations, but not of the IR itself. The practical benefit is that it lets you 'gamble' on hypotheses of an IR formed by investigating a number of its readings. You're not guaranteed to be right, but you may be willing to act on the hypothesis.

This gamble happens all the time when putting links on web pages. You look at a document and decide it has the properties of something you'd like to link to. Then you add the link, perhaps reporting some of those properties ("a cool picture of a fish") in the anchor. You can get different readings at different times as long as you still think the picture is suitable for your purposes. The URI owner can change the picture out so as to make it unsuitable, but the risk of that happening is often judged to be low enough to be acceptable. If it's not, you do something else.

Example to think about: "is a serialization of an inconsistent RDF graph (or OWL ontology)".

Web-relating axioms

Not every IR needs to be 'on the web', but it should be possible to put many of them 'on'.

We don't require that every dereferenceable URI be related to an IR, but we do need to say what the special relationship is, when it exists.

(def) A simple IR is 'authorized for' a URI in certain circumstances iff it would be a correct result when dereferencing (sensu RFC 3986) the URI in those circumstances.

"Correct" means technically correct per consensus specification (e.g. RFCs) and recognized authorities (e.g. DNS). This would be hard to formalize, and I hope we won't have reason to.

'Authorized for' is similar to 3986 'dereferences to' (inverse) but (a) we want to rule out the case of unauthorized dereference (system got hacked, etc.), and (b) no actual dereferencing act has to happen in order for this to hold (e.g. data: URIs?).

We may need to account for the circumstances of authorization in order to admit intended interpretations and rule out incorrect ones. A particular simple IR might be authorized for a URI in a secure session with one user, but not in another. Or if the domain has multiple A records, the simple IR might be authorized when the request is processed by one server, but not by the other. I would prefer to avoid this issue if possible.

I'm pulling a fast one here since 'dereferences to' is defined in 3986 to yield a 'representation' but here we need a 'simple IR'. The parts (content, media type) of the simple IR I have in mind are those of the representation. The provenance (or whatever) should be determined not arbitrarily, but by in a way determined by the authorization trail. (And I'm not sure what has to happen or not happen to that trail as the 'representation' passes through proxies and caches.) Perhaps this should be axiomatized.

(def) An 'information resource' is 'bound to' a URI iff every simple IR that is 'authorized for' the URI 'is a reading of' the information resource.

This formalizes "on the web".

TBD: Need to think about 'circumstances'.

Please check; this axiom may be too weak. It could be that an authority who means to bind the IR to the URI just feels insecure and doesn't want to be held responsible for saying that some simple IR is a reading of the IR. Maybe they would be happy to authorize a reading after being convinced that it was indeed a reading.

To make 'bound to' inverse functional (probably impossible and perhaps undesirable), we would have to say which IR is bound, either as a least upper bound of the readings, or as determined by the authorization process. This is tough - this ought to be a matter of fact, independent of anything the URI owner says.

One case where 'bound to' might not be inverse functional is data: URIs, for the same reason that simple IRs are not the same as representations. For example, data:,chat could be independently generated by an English speaker and a French speaker, and intended to refer to simple IRs depending on context. In this case the provenance of the simple IR would be the provenance of the URI itself. Or would these dereferencings be unauthorized? Hard to say.

There exists an 'information resource' that is bound to the URI "http://google.com/".

This is intended to lead you to interpret the axioms to apply to web pages 'in the wild'.

(candidate) For any set {S1, S2, ...} of 'simple IRs' there exists an IR that has S1, S2, ..., as readings, and no others.

That is, a server operator can throw whatever he/she likes at us, and we will still be able to treat their URIs as being bound to something.

This axiom is part of the httpRange-14 rule (if you get a 200 then the resource is an IR), but that's not sufficient reason to require it. The purpose is to allow you to write meaningful metadata even when the 'URI owner' is looking for plausible deniability for it, or when an IR ight otherwise be impossible. Possible example: random page generator (suitably axiomatized) may be inconsistent with some interpretations otherwise.

If infinite sets are allowed here we would have to constrain the axiom somehow, e.g. decidable membership.

(def) An interpretation of an RDF graph 'respects IR bindings' if, for each dereferenceable URI occurring in the graph (outside of literals), the URI is interpreted to be an IR bound to that URI.

This is explicitly not an assertion that all interpretations 'respect IR bindings'. However it does let us express the first clause of the httpRange-14 rule (or rather what was intended by it) as "kindly use interpretations that respect IR bindings".

If this definition makes you formally queasy (that's you, Pat) try the following alternative: A satisfying interpretation 'respects IR bindings' if it is also satisfying for the graph formed by merging (a) the given graph, (b) a set of 'binding statements', one per dereferenceable URI occurring in the graph, and (c) an appropriate RDF axiom set derived from the above axioms. The 'binding statement' for a URI uuu is defined here to be the statement <uuu> :boundTo "uuu"^^xsd:anyURI.

('Satisfying' is relative to your choice of logic, of course.)

This would be simpler if we could define which IR is implicated for each URI, but this may be both impossible (the URI owner certainly won't be helping us out) and unnecessary (the IR is already constrained as far as observable behavior goes, what else do you want?).

Appendix: Assigning responsibility for metadata errors

(This probably doesn't belong in this document, but I've written it and have no better place to put it.)

Suppose that <http://example/z> is generally interpreted to be some simple IR (whether 'bound' to it or not). We can conclude that <http://example/z> foaf:sha1 12345 (if that's the IR's correct sha1), even if the IR itself says <http://example/z> foaf:sha1 56789. We can just compute the hash and express it in this way, without having to be able to parse or interpret the document first to make sure the URI isn't going to be used in some other way.

Inconsistencies always admit multiple intervention points. Someone could be misinterpreting the URI (either us or the author of the IR), it could be a simple mistake, it could be neglect (being unclear), something might have changed, etc.

A similar situation holds when an edition of The Autobiography of Alice B. Toklas (assume it's an IR), known under the URI http://example/autobiography, says <http://example/autobiography> dc:creator "Alice B. Toklas". One would correctly ignore that statement and write <http://example/autobiography> dc:creator "Gertrude Stein" in this situation. In this case the statement is neither a misinterpretation nor a mistake; it is a joke.

If those writing metadata are going to be willing to be held responsible for what they write, the correct interpretation of metadata must not involve conferring any authority to the IR. One has to be able to express what one believes about an IR without being vulnerable to anything it might say, since there is no option of fixing the IR if it's wrong - even if you wanted to, which in the case of Stein's book you don't.

It is useful to have a way to assign responsibility for inconsistencies like these. One way to do this is to agree that dereferenceable URIs are to be interpreted as the information resources to which they are bound (however you want to interpret that). Anyone who uses the URI otherwise would then be responsible for these errors. Much metadata is expressed this way today. But other designs are possible. For example, one could write the metadata's subject as [:isBoundTo "<http://example/z>"^^xsd:anyURI] instead of <http://example/z>, so that <http://example/z> is freed up for other purposes.

Test

TBD: transcribe to OWL, and derive an inconsistency from an example (toucan, flickr, jamendo, etc)

Change log

Modifications per Alan R review, esp. concerning the way the simple IR parts axiom is phrased.