Enabling Read Access for Web Resources

W3C Working Draft 15 February 2007

This Version:: http://www.w3.org/TR/2007/WD-access-control-20070215/
Latest Version:: http://www.w3.org/TR/access-control/
Previous Versions:: http://www.w3.org/TR/2006/WD-access-control-20060517/; http://www.w3.org/TR/2005/NOTE-access-control-20050613/
Editors:: Anne van Kesteren (Opera Software ASA) <annevk@opera.com>; Brad Porter, Tellme Networks

Abstract

This document provides a mechanism for a web resource to relax typical browser sandbox restrictions on cross-site access to it. Using either a HTTP header or XML processing instruction (or both) resources can indicate they allow read access from specified hosts (optionally using patterns). When a pattern is used you can also exclude certain hosts. For instance, allow read access from all direct subdomains of example.org (http://*.example.org) with the exception of public.example.org (http://public.example.org).

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This is the 15 February 2007 Working Draft of the "Enabling Read Access for Web Resources" document. This document is produced by a Task Force of the Web Application Formats (WAF) Working Group. The WAF Working Group is part of the Rich Web Clients Activity in the W3C Interaction Domain.

Please send comments to the WAF Working Group's public mailing list public-appformats@w3.org with either [AC] or [access-control] at the start of the subject line. Archives of this list are available. See also W3C mailing list and archive usage guidelines.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

1. Introduction
2. Access Control Read Policy
- 2.1. Content-Access-Control header
- 2.2. <?access-control?> processing instruction
3. Matching Algorithm
References
Acknowledgements

1. Introduction

The world wide web has a rich set of resources that can be combined to build content and feature-rich web sites. Websites are permitted to include a reference (either a link or an image inclusion) to web resources residing on another site. For security reasons, web browsers typically do not permit a website to read, process, or otherwise interrogate the contents of any web resource residing on a different domain.

The access-control mechanism enables web resources to permit websites to access their content.

1.1. Background

Web browsers strive to make it "safe" to run any application fetched from the Internet. In order to safely run untrusted code, the web browser tightly controls which resources the web page is allowed to access. In this way, the browser creates a safe "sandbox" in which the application can run.

One of the capabilities that web browsers allow is for one site to create a hyperlink to another site. Similarly, a web browser allows a site to display an image from another site. For instance, an HTML page from www.example.com may display an image hosted by www.w3.org. This interaction is considered "safe" because the contents of that image are displayed to the user, but are not exposed to example.com.

In order to make the experience safe for the end user, web browsers must tightly control access to web resources. Web pages or XML documents often contain sensitive information such as account balances or personal correspondences or corporate financial information. Consequently, the browser must prevent an example.com application from making a request from your browser that would allow it to "read" your sensitive information.

Because the web browser can not tell which web pages or XML documents contain sensitive information and which do not, the browser sandbox by default restricts all "read" requests. An application in example.com can not load or inspect the contents of data from any other document. Some browsers make an exception if the "read" request is for data from the same host or domain. For instance, a web page from www.example.com could request to read another XML document hosted on documents.example.com.

In web browsers, the XMLHttpRequest object allows this type of read access to XML and other web resources. VoiceXML 2.1 browsers implement this same functionality with an element named data.

The restriction on "read" access to web resources is very strict. There are cases where an application would like to "read" data from another XML document or web resource on the internet without these restrictions. For instance, a car reservation web site may want to request your trip itinerary data from an affiliated airline reservation website to streamline making your car reservation. An online retail store may want to read information from a shipping company to give you information on when your order will arrive.

The access-control header allows an XML data document to declare that it is safe for the web browser to allow another site to read this data. By specifying an access control header that "allows" example.com to read, that particular XML document is saying "Yes, it is safe to allow an example.com application to read this data."

1.1.1. Definition of Read Access to Web Resources

A request made by an application to load a web resources in a manner that allows the application to inspect the contents of that XML document. Upon inspection of the contents, the application can perform any other allowed operation using that data such as presenting it to the user, performing calculations or making decisions based on that data, copying the data into another data object, and submitting it back to its own website.

1.2. Conformance Criteria

User agents can't conform to this specification without also conforming to a specification that uses the access control read policy.

As well as sections marked as non-normative, all diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.

In this specification, The words must, must not and may are to be interpreted as described in [RFC2119].

A conformant specification is one that implements all the requirements (the must and must not statements) listed in this specification that are applicable to specifications.

A conformant user agent is one that implements all the requirements (the must) listed in this specification that are applicable to user agents, while also being consistent with the requirements listed in the specifications that use the access control read policy.

User agents may optimize any algorithm given in this specification, so long as the end result is indistinguishable from the result that would be obtained by the specification's algorithms. (The algorithms in this specification are generally written with more concern for clarity than over efficiency.)

1.2.1. Terminology

The term ToASCII algorithm is used as described in RFC 3490. [RFC3490]

A space-separated list is a string of which the items are separated by one or more U+0009, U+000A, U+000D and U+0020 characters (in any order). The string can also be prefixed or suffixed with those characters.

1.3. Security Considerations

The mechanism this specification introduces extends the "default browser security sandbox" to allow for read access on cross-site resources. The extension opens a constrained hole in the "default sandbox".

A user agent running inside a trusted corporate network and executing untrusted content should enforce a sandboxing policy by denying access. In contrast, it may be appropriate to relax this policy when the user agent is executing only trusted applications that requires access to arbitrary resources on the local network. User agent vendors that allow this sandboxing policy to be configured are encouraged to provide guidance on the appropriate settings. It is critical that network administrators understand the security issues pertinent to their environment and configure their systems appropriately. In tandem, developers and web server administrators must be aware of the dangers of trusting a user agent that can be configured to disable sandboxing.

User agents which implement this capability should take care not to expose other trusted data (cookies, HTTP header data) inappropriately.

User agents which implement this capability should also take care to properly normalize Unicode and to properly interpret IDNs to prevent URL spoofing attacks.

Application authors should be aware that content retrieved from another site is not itself trustable. Authors should take care to protect against exposing themselves to cross-site scripting attacks by failing to validate the content returned or executing the retrieved content directly.

2. Access Control Read Policy

Specifications using the mechanism defined in this specification need to define when the access control read policy applies to a retrieved resource. For instance, a specification could define that in case of cross-site requests this mechanism is put in place.

The policy described is only safe for HEAD and GET requests. Specifications must not use it for other HTTP methods without specifying extra safety measures. [RFC2616]

When a resource is said to be in error access to that resource must be denied.

Resources to which the access control read policy applies have an associated unordered list (which can be empty) of access control rules. An access control rule consists of an allow ruleset and optionally an except ruleset. Each of these rulesets is an unordered list of access items. How each access control rule is matched against the request URL to determine whether access to the resource is to be granted is described in the next section.

An access item is a domain containing a wilcard prefixed by a scheme and must match the following EBNF:

access-item    ::= scheme "://" domain-pattern ( ":" port )? | "*"
domain-pattern ::= wildcard-label | subdomain "." wildcard-label
wildcard-label ::= label | "*"

scheme and port are used as defined in RFC 3986. subdomain and label are used as defined in RFC 1034. [RFC3986] [RFC1034]

In addition to matching the above EBNF the ToASCII algorithm must apply successfully (without errors) to each label component from the access item. If the access item doesn't match the EBNF or the ToASCII algorithm fails the resource is in error.

If the port is omitted the default port for the URI scheme will be used by the matching algorithm.

An access item of * matches anything. When * is used elsewhere (within domain-pattern) it can only match the label production as indicated above.

Several examples of conforming access items:

*
http://*.example.org
http://example.org:8443
https://*.*:80

The following access items would put the resource in error:

*://example.org
http://example.org/
http://example.org/example
http://example.org:
http://example.org:*

The following access items are identical:

http://example.org
http://example.org:80

The following access items are not identical:

http://*.example.org
http://*.*.example.org

2.1. `Content-Access-Control` header

Any resource retrieved via HTTP may have access control rules defined in one or more Content-Access-Control headers which must match the following EBNF:

Content-Access-Control ::= "Content-Access-Control" ":" LWS? ruleset
ruleset        ::= rule (LWS? "," LWS? rule)+
rule           ::= "allow" (LWS pattern)+ (LWS "except" (LWS pattern)+)?
pattern        ::= "<" access-item ">"

As stated by RFC 2616, multiple Content-Access-Control headers may be combined.

LWS is used as defined by RFC 2616. [RFC2616]

If the Content-Access-Control header doesn't match the specified syntax the resource is in error.

Otherwise, for each Content-Access-Control header and then for each rule within that header user agents must append a new access control rule where the allow ruleset is constructed of each access-item following "allow" and the (optional) except ruleset of each access-item following "except".

2.2. `<?access-control?>` processing instruction

XML resources may include an <?access-control?> processing instruction within the XML Prolog to indicate in cases where the access control read policy applies from which domains they can be fetched. [XML]

The processing instruction takes two pseudo-attributes which each take a space-separated list of access items. These pseudo-attributes are allow and except. The allow attribute must be specified.

An <?access-control?> processing instruction that is part of the XML Prolog must be parsed using the same syntax rules as described in the XML Stylesheet PI specification. [XMLSSPI] If there are any parse errors the resource is in error. <?access-control?> processing instructions outside the XML Prolog are ignored and thus can never put the resource in error.

If there are any pseudo-attributes besides allow and except or the allow attribute is not specified the resource is in error.

For each <?access-control?> processing instruction user agents must append an access control rule where each access item in the allow pseudo-attribute must be appended to the allow ruleset and each access-item in the except pseudo-attribute must be appended to the except ruleset. To obtain access items from the pseudo-attributes user agents must follow the following algorithm:

Let the attribute's value be value.
Replace any sequences of U+0009, U+000A, U+000D and U+0020 characters (in any order) with a single U+0020 SPACE character in value.
Drop any leading or trailing U+0020 SPACE character in value.
Chop value at each occurrence of a U+0020 character and drop that character in the process.
The resulting list of strings are the access items to be appended to the rulesets.

3. Matching Algorithm

To see if read access to a resource can be granted user agents must apply the following algorithm:

Let grant access be false.
Then for each access control rule associated with the document run the following sub algorithm:
1. If there's a match for any access item from the allow ruleset against the request URL let grant access be true.
2. If there's a except ruleset and there's a match for any access item from the except ruleset against the request URL let grant access be false.
If at this point grant access is true grant access to the resource and abort this algorithm.
Otherwise, deny access to the resource.

The request URL must be the ....

Perhaps let the specification which defines when the access control read policy applies also define which URI to use as origin?

To determine whether a request URL and an access item match user agents must apply the following algorithm:

Let request URL be origin and access item be item.
If item is a single U+002A (*) there's a match. Abort this algorithm.
Drop the path part in origin so that it matches the access item production.
Count the U+002E (.) characters in both origin and item. If the results are not equal abort this algorithm.
Compare the scheme from origin and item. If there's a match drop the scheme from both including the :// sequence following it. Otherwise, abort this algorithm.
Compare the port from origin and item. If either of them doesn't have the port explicitly specified use the default port for the scheme. If there's a match drop the port from both including the U+003A (:) preceeding it. Otherwise, abort this algorithm.
Split origin and item on the U+002E (.) character and preserve the order of new set of items. In case there's no U+002E character each set will have exactly one item. Now for each set of items (one from origin and one from item):
1. Let the item from origin be origin item and the item from item item item.
2. If item item is a single U+002A (*) character there's a match. Do this sub algorithm again for the next set of items or abort this sub algorithm if there's no next set of items.
3. Apply the ToASCII algorithm to origin item and item item.
4. Compare origin item and item item. If there's a match do this sub algorithm again for the next set of items or abort this sub algorithm if there's no next set of items. Otherwise, abort this algorithm.
There's a match. Abort this algorithm.

References

[RFC1034]: DOMAIN NAMES - CONCEPTS AND FACILITIES, P. Mockapetris. IETF, November 1987.
[RFC2119]: Key words for use in RFCs to Indicate Requirement Levels, S. Bradner. IETF, March 1997.
[RFC2616]: Hypertext Transfer Protocol -- HTTP/1.1, R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, T. Berners-Lee, editors. IETF, June 1999
[RFC3490]: Internationalizing Domain Names in Applications (IDNA), P. Faltstrom, P. Hoffman, A. Costello. IETF, March 2003.
[RFC3986]: Uniform Resource Identifier (URI): Generic Syntax, T. Berners-Lee, R. Fielding, L. Masinter, editors. IETF, January 2005.
[XML]: Extensible Markup Language (XML) 1.0, T. Bray et al., editors. W3C, August 2006.
[XMLSSPI]: Associating Style Sheets with XML documents, ed. J. Clark. W3C, June 1999

Acknowledgements

The editors would like to thank the following people for their contributions to this specification (ordered by first name):

Arthur Barstow
Benjamin Hawkes-Lewis
David Håsäther
Dean Jackson
Ian Hickson
Maciej Stachowiak
Mark Nottingham
Thomas Roessler

Special thanks to Matt Oshry and R. Auburn who helped editing an initial version of this document.