Copyright © 2013 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved. W3C liability, trademark and document use rules apply.
When a URL appears in data intended for consumption by applications, sometimes the data gives information about the content that can be retrieved from that URL, such as a biography or an image, while in other cases it gives information about the entity described or depicted by what is retrieved, such as a person or a farm. It's always useful to be able to retrieve the content at the URL, since the application can get either the entity or its description, and thus learns more about what is being talked about. While humans can usually discriminate between these different modes of using URLs based on what "makes sense", applications cannot in general do so. Therefore, in standard formats for data, where we want reliable conclusions to be drawn from the data by an application, the context in which the URL occurs must make clear which mode is intended in each case.
This document addresses this problem by describing how to define data formats and publish the information necessary to support an application in determining which of mode is intended when it encounters a URL in data.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This document was published by the Technical Architecture Group as a First Public Working Draft. This document is intended to become a W3C Recommendation. If you wish to make comments regarding this document, please send them to www-tag@w3.org (subscribe, archives). All comments are welcome.
Publication as a First Public Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
Applications operate based on data that they receive or collect. For example, an application that works as an HTTP server might be sent data through an HTTP POST
or PUT
request. A mobile app might collect data by requesting it through GET
requests on a web API.
The data that an application receives is a sequence of bits. The application interprets those bits through a series of processes — decoding, parsing, transforming and so on — to create an internal model based on which it can act. When the data includes URLs, those URLs may be used to inform the processing that builds the internal model, and the internal model may eventually include things that are named using the URLs that appeared in the data. Most importantly, the internal model may include content retrieved by resolving the URLs in the original data, and associate properties with that content based on the information associated with the URL in the original data.
For example, Paul Downey has created an image of his poster The URI Is The Thing and made it available on a photo sharing site. Let us imagine that the photo sharing site exposes information about the poster in a number of ways, including through a JSON API. The JSON might look something like:
{ "@id": "http://photo.example.com/psd/12345/original.jpeg", "type": "image", "creator": "Paul Downey", "license": "http://creativecommons.org/licenses/by/3.0/" }
In this case, say the URL http://photo.example.com/psd/12345/original.jpeg
resolves to a sequence of bits that encodes a JPEG image, and the JSON provided by the photo sharing site is intended to inform applications that that JPEG image was created by Paul Downey and can be reused elsewhere as long as it is attributed (as indicated by the licence). Knowing this, an application that accessed JSON from the site that included the above data could retrieve, store and process the bits retrieved from http://photo.example.com/psd/12345/original.jpeg
(for example to extract EXIF data from the JPEG).
In other cases, as described in section 3. Landing Pages and Records, URLs used within data might point to landing pages which describe the thing that has the properties specified in the data rather than being the thing that has those properties. To communicate effectively, data providers and applications need to have an agreed understanding about whether a given property provided in some data applies directly to the content at the given URL or to the thing that content describes. This document provides terminology and best practices to facilitate that shared understanding.
This document purposefully does not address the question of what a context-free URL (for example, one on the side of a bus) identifies, or how this might be discovered. It is purely concerned with how an application can work out whether an assertion about a URL within some data is an assertion about the content found at that URL or about the thing described by that content.
There are lots of different ways of expressing data about things, the main standard ones currently in use on the web being JSON, XML and RDF. These are interpreted by applications into internal models. For the purpose of this document, we use the term entity for a thing about which we're passing information, and property as an asserted fact about an entity. An entity commonly has a corresponding data structure within an application, and properties are fields of that data structure.
In this document, we mostly use JSON to express information about entities, using the JSON-LD convention of using @id
as the means of providing a URL to name the entity. The same information could equally be expressed in XML in a variety of ways, such as:
<image uri="http://photo.example.com/psd/12345/original.jpeg"> <creator>Paul Downey</creator> <license href="http://creativecommons.org/licenses/by/3.0/" /> </image>
or in Turtle as a serialization of RDF:
PREFIX : <http://example.org/> <http://photo.example.com/psd/12345/original.jpeg> a :Image ; :creator "Paul Downey" ; :license <http://creativecommons.org/licenses/by/3.0/> ; .
The same considerations apply when URLs are used to name entities, regardless of the format that is used to express the data.
A landing page is any page whose primary purpose is to contain a description of something else. Landing pages often provide summaries or additional information about the thing that they describe. Examples are landing pages for images on Flickr or videos on YouTube, which are HTML pages that embed the media that they describe and provide access to comments and other metadata about it. Landing pages for documents are often tables of contents or abstracts.
For example, say that the photo sharing site from the earlier example published an HTML page about The URI Is The Thing at http://photo.example.com/psd/12345
which acts as a landing page for the photo, enabling people to add comments about it and providing links to other pictures by Paul Downey and so on. In this scenario, the site might publish the JSON:
{ "@id": "http://photo.example.com/psd/12345", "type": "image", "creator": "Paul Downey", "license": "http://creativecommons.org/licenses/by/3.0/", "photo": "http://photo.example.com/psd/12345/original.jpeg" }
Unlike the previous example, here it is not the case that the content an application gets when it resolves the value of the @id
property (http://photo.example.com/psd/12345
) is an image (contrary to the assertion of the type
property) — it is an HTML page. Similarly, the content of the HTML page is not created by Paul Downey — it is created by the photo sharing site. The HTML page is not available under the CC-by licence — the photo sharing site holds the copyright. Thus the properties that are associated with the URL http://photo.example.com/psd/12345
within the data do not apply to the content provided at that URL, but to the image for which the HTML page is the landing page, and which is referenced in the photo
property.
This pattern also occurs with URLs that resolve to content that is not HTML: APIs that provide data in JSON, XML or RDF usually use URLs within that data which provide locations from which further information about the entities associated with the URLs can be discovered, again in JSON or XML or RDF. These JSON, XML or RDF records are the machine-readable equivalent of HTML landing pages: they describe the image, video or other thing rather than being a sequence of bits that is that thing.
Thus the same considerations would apply if the photo sharing site published the JSON above at the URL http://photo.example.com/psd/12345
. The JSON that's published at that URL is not an image, it is a record. The site could alternatively use content negotiation to determine whether a given application receives the JSON or the HTML or some other format.
If the URL http://photo.example.com/psd/12345
supported content negotiation such that a request with Accept: text/html
provided an HTML page but a request with Accept: image/jpeg
returned the image, the URL is being used for two distinct resources: the image and the landing page. The two resources have different values for important properties that cannot be content-negotiated on, such as their creator and license. As discussed in The Architecture of the World Wide Web [WEBARCH], content negotiation should not be used between two different resources: instead, different resources should be named with different URLs. It is up to the publisher to determine whether two resources are different.
The photo sharing site may add information that is about the HTML landing page at the URL to the JSON data that it publishes. For example they might add a last-modified
date that indicates the date and time that the landing page was last modified:
{ "@id": "http://photo.example.com/psd/12345", "type": "image", "creator": "Paul Downey", "license": "http://creativecommons.org/licenses/by/3.0/", "photo": "http://photo.example.com/psd/12345/original.jpeg", "last-modified": "2012-06-20T08:54:32Z" }
Doing this is potentially confusing because a developer simply looking at the output of the API and trying to make sense of it might assume that because the rest of the properties associated with http://photo.example.com/psd/12345
(such as creator
or license
) apply to the image described by the landing page at that URL, the last-modified
property must apply to that image as well, when in fact it applies to the HTML landing page. Later sections describe methods for publishers to avoid confusing developers in this way.
While the above example is of a landing page for an image where the image itself is available elsewhere on the web, publishers also provide landing pages for things that aren't available on the web, such as people or pieces of furniture. For example, the photo sharing site might publish a landing page for Paul Downey:
{ "@id": "http://photo.example.com/psd", "type": "person", "name": "Paul Downey", "nickname": "psd" }
When data is about something like a person or piece of furniture, it is usually obvious (to developers, who understand the world) that a given property, such as nickname
or dimensions
, doesn't apply to the landing page but to the person or piece of furniture that it describes. On the other hand, when the data is about something whose content could exist as data on the web, such as a photograph or a book or a film, that thing will often have properties that could equally apply to the landing page itself, such as creator
or last-modified
.
As we have seen, the properties used within data need to be documented to avoid developer confusion about what entities they apply to. A data format that mixes properties about landing pages or records and properties about the things those landing pages or records describe is not necessarily ambiguous: all that's required for developers to understand what the properties actually apply to is for the meaning of the property to be documented.
We recommend the use of the following terms to describe properties within such documentation:
@id
or url
The following diagram shows how these properties interact:
The term shorthand property can be used in a variety of cases, and documentation about shorthand properties needs to be particularly explicit about how they should be interpreted, as described in the following sub-sections.
For example, in the JSON
{ "@id": "http://photo.example.com/psd/12345", "type": "image", "creator": "Paul Downey", "license": "http://creativecommons.org/licenses/by/3.0/", "last-modified": "2012-06-20T08:54:32Z" }
the properties might be documented as:
@id
type
creator
license
last-modified
In this cases, "has type", "has creator" and "has license" are implied properties which might not be described explicitly in the documentation. Graphically, we have:
Properties may have values that are themselves URLs. In these cases, the property documentation should make clear whether the entity URL (provided by the URL property such as @id
) points to a landing page or record, or the value URL (given in the value of the individual property) points to a landing page or record, or both. For example, in a case such as:
{ "@id": "http://photo.example.com/psd/12345", "type": "image", "creator": "http://photo.example.com/psd", "license": "http://creativecommons.org/licenses/by/3.0/", "modified": "2012-06-20T08:54:32Z" }
both the creator
and the license
properties are shorthand properties of the image described by the landing page at the entity URL http://photo.example.com/psd/12345
. However, the value of the creator
property is also a landing page, this time for Paul Downey, whereas the value of the license
property actually points to the content of the licence.
Properties between entities that are implied due to a property asserted between two landing pages or records are called parallel properties because in a diagram that shows the relationships between the landing pages and between the entities, these kinds of implied properties will appear parallel to the shorthand property.
The following diagram shows the creator
shorthand property, whose value is a URL that points to a landing page, and how this property implies the existence of two entities — an image and a person — and a "has creator" relationship between those entities.
Sometimes landing pages or records are about more than one thing, or the thing that they describe is functionally related to other things. In the example we've been using, the image http://photo.example.com/psd/12345/original.jpeg
is actually a photograph of a poster which is about the web. What if the photograph of this poster had been taken by someone other than Paul Downey, and this was captured within the data? The JSON about its landing page might be:
{ "@id": "http://photo.example.com/psd/12345", "type": "image", "photographer": "Nadia", "creator": "Paul Downey", "license": "http://creativecommons.org/licenses/by/3.0/", "last-modified": "2012-06-20T08:54:32Z" }
In this case, the photographer
property relates to the photograph described by the landing page at http://photo.example.com/psd/12345
whereas the creator
property relates to the artwork that was photographed.
As this example shows, it is helpful to document the kind of the thing described by a landing page or record that a given property relates to. This enables an application, if it chooses to, to build an internal model of the data that includes separate entities for the landing page, each of the things that are described by the landing page, and the ways in which they are related.
In the example above, the documentation might include:
photographer
creator
One of the benefits of naming an entity with a URL is that it enables multiple sources of information to associate data with that entity by referring to the same URL. For example, a social networking site may provide JSON that states that someone likes the image described by http://photo.example.com/psd/12345
:
{ "@id": "http://social.example.com/dirk", "type": "person", "name": "Dirk", "likes": [ "http://photo.example.com/psd/12345", ... ] }
Here we assume that the likes
property is defined as a shorthand property that implies that the content of the page at http://photo.example.com/psd/12345
describes the thing that is liked. Without such documentation, some applications might adopt an alternative interpretation: that Dirk likes the web page at http://photo.example.com/psd/12345
.
A review site might similarly provide JSON that describes a review of the image at http://photo.example.com/psd/12345
(again we assume here that the documentation of the subject
property describes that the review is about the thing described by the landing page at the given URL):
{ "@id": "http://review.example.com/jane/12345", "type": "review", "subject": "http://photo.example.com/psd/12345", "rating": 5 }
As discussed in section 4.2 Multi-Faceted Landing Pages and Records, the landing page at http://photo.example.com/psd/12345
may describe many things. If a search engine or other application were to merge the information from the three sites, it would need to associate both the "like" and the review to the same entity — the image.
The publishers of the image could help applications to combine information about the image across the sites accurately by supplying a separate URL for the image itself, linked to from the landing page with a specific relationship (such as describesImage
) through a Link
HTTP header or a <link>
element within the landing page. To be clear about what is being liked or reviewed, the social media site and the review site could either reference that image directly, or describe their shorthand properties in terms of the describesImage
property of the landing page.
The previous sections have discussed how important it is to have documentation that includes information about how URLs used within data should be interpreted and specifically whether properties within the data apply to the content found at a URL or to something that content describes. This documentation should be published somewhere such that it's possible for those developers to find it. Possible routes for doing this explicitly include:
application/json
), by providing supplementary documentation through a profile
link relationship, for example within a HTTP Link
header@xsi:schemaLocation
attribute in XML or by using resolvable URLs for classes and properties in RDF
Developers should be able to locate this documentation through a mechanism that isn't a search against the Internet. If the property documentation should be accessed through resolving URLs within the data (the last of the options above), this mechanism should be specified within the media type definition or the documentation provided through the profile
link relationship.
What if the data isn't made available by HTTP and you therefore don't have a media type: how does follow-your-nose work in that case? For example, if the data is provided via FTP or embedded within a textual email message.
This section makes concrete recommendations for data consumers, data publishers and the authors of specifications that use URLs, based on the discussion above.
Applications that consume data on the web may need to determine, based on a given set of data, which properties can be associated with the content found on the web at a given URL. Applications that commonly need to do this include crawlers that need to work out the licence that applies to a particular piece of content, or to whom it should be attributed. Applications should work out which properties apply to a piece of content based on the media type of the data that contains the information about the URL. Media types for structured syntaxes such as JSON, XML or Turtle may delegate how to interpret data to a vocabulary, defined in a schema or in separate documentation.
Applications should be wary, in the absence of explicit indications within specifications or vocabularies, about associating properties with the content located at a given URL used within a URL property for an entity. Some publishers may intend the properties to be associated with the content an application gets when it resolves the URL, while others intend them to be associated with an entity described by the content. Applications should be particularly careful in interpreting properties that could be associated with content retrieved from the web, such as "like" or "creator".
The response received when resolving a http:
or https:
URL does not affect how a given piece of data that refers to that URL is interpreted, but applications may use it to infer additional properties. For example, the HTTP headers that are included in an HTTP response encode properties, such as the last modification date, which are usually associated with the HTTP entity body contained within the response. The Link
header in particular provides additional data which may be about the specific HTTP entity body, a more abstract notion of the document located by the URL (which may change over time or be available in multiple content-negotiated variants), or something described by that document. The documentation of the link relation used within the Link
header should provide specific information about how the relation should be interpreted in relation to the resolved URL.
The most important property of a URL, whose value can only be discovered through resolution, is its content. The actual content located through resolving a URL may change over time or based on aspects of the request (such as Accept
headers). Where data makes assertions about the content of a URL, these assertions are taken to apply to those aspects of the content that remain constant across these variants. Applications can only sample this content at any particular point in time, and some HTTP responses may only provide a portion of the content associated with the URL.
URLs that include fragment identifiers are known as hash URLs. When presented with a hash URL, such as http://photo.example.com/psd/12345#comment-67890
or http://photo.example.com/psd#me
, applications can locate its content by resolving the base URL (before the fragment identifier) and interpreting the fragment identifier based on the fragment identifier rules specified for the media type of the the response. In some cases this will resolve to some content (such as an XML or HTML element); in other cases it may not. In cases where the fragment identifier does not resolve to any content in a given response, applications can infer that the content at the base URL describes the entity named with that hash URL.
When resolving a URL results in a 303 See Other
response, applications can infer that the content found at the URL given in the Location
header of that response describes the entity named with the original URL. Other redirections (such as 301 Permanent Redirect
or 307 Temporary Redirect
) imply that applications can get the content of the original URL by looking instead at the content retrieved from the URL given in the Location
header. Error status codes such as 404 Not Found
do not imply anything about the content associated with a given URL, except that it cannot be provided by the server.
The ability to have URLs that do not have associated content (hash URLs that do not resolve to a document fragment or URLs that give a 303 See Other
response) means that direct properties, which refer to the content retrieved from a given URL, can be used to describe things which are not yet on the web. For example, if the property creator
were defined as a direct property that specifies the creator of the content found at a given URL, it could also be used in data that described a book whose content is not currently on the web. In this case, the URL used for the book must be a hash URL that does not resolve to a document fragment, or give a 303
response.
Publishers can help enable more accurate merging of data from different sites if they support URLs for each entity they or other sites may wish to describe, separate from the landing pages or records that they publish. If these additional URLs are provided, the HTTP response given when resolving a landing page or record should include a Link
header indicating the URL of the entity the landing page or record describes using the describes
relationship. Similarly, if there are pages that describe the entity associated with a given URL, then:
Link
header with the describedby
relationship, linking to the landing page or record303 See Other
HTTP status code, redirecting to the landing page or recordMany thanks in particular to Jonathan Rees and Henry Thompson for the technical work behind this draft, and to Robin Berjon for ReSpec.js.
There are many existing data formats, metaformats, vocabularies and schema languages that do not document their use of URLs in the ways described in this document. This section lists them.
@itemid
attribute, and as values for properties when the @href
or @src
attribute is used. The meaning of the identifier and of the values of properties is specified as determined by the vocabulary that's used with microdata. No change is need here, although the specification could make it clearer that the interpretation of URLs used in these contexts should be specified within the vocabulary.url
property, and when expressed in microdata publishers may use the @itemid
attribute or in RDFa the @resource
attribute to provide a URL. Most properties appear to be designed to apply to the thing described by the document found at the URL given by the url
property, but this is not made explicit in the documentation.Link
headersLink
header expresses a property with a URL value, like those described in section 4.1 URL Values. The documentation for each link relation should describe whether the property relates to the HTTP entity body included in the response, to the more abstract notion of the content retrieved from the URL as described in section 5.2.1 HTTP Responses, or to something described by that content. Similarly, the documentation should describe whether the value of the property is the content of the target URL or the thing described by that content.This document is one output from the TAG's (re)consideration of
ISSUE-14 was originally closed by the TAG in 2005 with a decision provided by email that stated:
That we provide advice to the community that they may mint "http" URLs for any resource provided that they follow this simple rule for the sake of removing ambiguity: a) If an "http" resource responds to a GET request with a 2xx response, then the resource identified by that URL is an information resource; b) If an "http" resource responds to a GET request with a 303 (See Other) response, then the resource identified by that URL could be any resource; c) If an "http" resource responds to a GET request with a 4xx (error) response, then the nature of the resource is unknown.
Experience since that decision has highlighted problems with this resolution, such as:
303 See Other
responsesThe various other options and their strengths and weaknesses are explored in Providing and Discovering URL Documentation.
This issue has traditionally been seen as only a problem for philosophers and the Semantic Web / Linked Data community. However, there is growing adoption of RESTful APIs that provide data describing web-based documents and real-world things and that use URLs to refer to the entities that are described by the data, and who face the same issues.
The TAG has, over the past several years, put significant effort into both exploring the implications of the 2005 TAG decision and the various alternatives that have been espoused.
In February 2012, the TAG issued a call for change proposals on a formalisation of the TAG decision, Understanding URL Hosting Practice as Support for URL Documentation Discovery. This led to a number of responses which are summarised within the wiki.
The TAG put together a number of use cases and assessed the various proposals against those use cases within a matrix. Based on this analysis, the most promising direction was identified to be the "parallel properties" proposal. At the June 2012 F2F, the TAG discussed this approach and agreed that it was the right direction. Further work was then done prior to the October 2012 F2F, where it was discussed again.
At that point, the TAG resolved to: