Abstract
This document proposes some modifications and enhancements to the current
P3P protocol. These modifications include separating the current
P3P policy into two parts. One part, the protocol policy, would cover
any information disclosed in the protocol transaction with the site.
The protocol policy would be send out-of-band as part of the protocol being
used. The other part, the content policy, would cover the release
of information in the content itself. The content policy would be
embedded into the content for which it applies.
Status of this document
This document represents the personal work of the author, it incorporates
some early feedback from W3C Staff but, the proposal remains that of the
author. This document is made available for discussion only.
This work does not imply endorsement by, or the consensus of the W3C membership,
nor that W3C has, is, or will be allocating any resources to the issues
addressed by this document. This document is a work in progress and may
be updated, replaced, or rendered obsolete by other documents at any time.
Please send comments to www-p3p-public-comments@w3.org
Background:
The following series of questions helped to focus the scope of this problem.
Question 1; How does a web site currently obtain information about users?
The answer to this is simple, a web site obtains information about users
from a) the user's user agent and b) the users themselves. Question
2; How is information obtained from the user's user agent? Answer,
from HTTP headers that the user agents send, and from TCP/IP properties.
Question 3; How is information obtained from the users? Answer, the
user interacting with HTML forms and potentially Applets/Components.
This may seem an oversimplification but it really is not. When
a user agent connects to a web site the only thing that the web site knows
is that a connection from a particular IP address made a request for a
resource . Now the web site can't be sure if the IP address is the
actual machine running the user agent or if the IP address is some sort
of proxy e.g. anonymizer, proxy server, etc. To sum it up the only
way that web sites get information about users is from the users themselves
or from the users' user agents. So, the way to protect user information
is by not sending the information at all. For example, never send
anything but the smallest set of HTTP headers required, the request line
and the Host: header [HTTP] and never submit any HTML forms. This
is obviously one solution, however, it really isn't practical for most
users and web sites. Users enjoy personalization that web sites offer
and web sites need some user information to render content properly.
Basically web sites offer a service and the cost of admission to this service
is some private and not-so-private information.
Assumptions
The following assumptions have been made about this problem:
-
Many web sites that collect information are
hosted by ISPs rather than dedicated hosts.
-
The majority (>50%) of resources on the web do not contain forms.
-
There are typically few cookies (<10) used on a site.
-
There are typically few HTTP headers (<5) needed by a site.
-
Cookies are used for two purposes, user tracking and session tracking.
-
HTTP headers and cookies are not collected and used for different purposes
on pages within a site.
-
There are few intervening hosts (<=2) that collect information from
the client.
These assumptions, if violated, do not prevent this solution from functioning.
Performance may be affected in some cases, particularly if a large number
of intervening hosts collect information from the client. However,
it is anticipated that a set of de-facto policies will evolve that will
satisfy must, if not all, of the intervening hosts' information requirements.
Clients can then simply agree to send information related to these de-facto
policies on each request. For example, a de-facto policy that asks
for user agent name can be automatically allowed by a client.
Goals
The goals of the modified protocol are to provide the same level of service
as the current draft, reduce the number of round trips required, extensibility,
and ease of client and server side implementation. These goals are
not contrary to the existing protocol's goals and are simply mentioned
for completeness.
Proposed solution
Policies will no longer be separate URIs but will be combined with the
content collecting the data, e.g. HTML documents. Policies will be
linked to elements in the HTML documents by using the ID attribute of the
HTML element [HTML]. Policies that describe
the information being obtained from the protocol being used will be sent
back to the client as part of the protocol itself. For example, any
information being used from a HTTP transaction will be described in a Protocol
Policy document and sent back to the client as a payload of an HTTP 400
response. Protocol policy documents will exist outside the HTML content
because the protocol used to download the HTML is not related to the content.
This solution does not explicitly address embedding privacy policies into
Applets or other downloadable content. However, it is possible to
require that an Applet that collects information to have a content policy
document.
Summary of current problems
Basically the current problems fall into three categories. Policy
and content linking is not granular enough. Policies must be separated
into protocol and content parts. References to external documents
within content must be avoided. The following is a catalog of current
problems to be solved, a description of the problem, a proposed solution,
advantages and drawbacks of the solution.
Problem Statement:
Currently, one policy covers both the HTTP transaction and content.
Why is this a problem:
When the virtual hosting assumption is considered,
then there is the possibility that an ISP will collect information from
the HTTP logs for some purpose, while, at the same time, the company using
the virtual hosting services of the ISP will use the same or other HTTP
information for different purposes. If there is only one policy for
the HTTP transaction as well as the content there is no way to tell the
user of the two companies collecting information. This is not only
true of virtual hosting but also of any intervening network host that collects
information from the HTTP transaction e.g. proxies, etc.
Proposed Solution:
Separate the protocol policy from the content policy. Basically there
will be a unbounded number of policies for the HTTP transaction as well
as an unbounded number of policies for the content.
Advantage(s):
By having an unbounded number of policies for the HTTP transaction all
intervening hosts can make their policies known to the user.
Drawback(s):
If a large number of intervening hosts collect information there would
be delays getting to the web server due to the number of negotiations with
intervening hosts. However, it is assumed that few intervening
hosts collect private information. If an intervening host does collect
such information then the frequency of privacy policy changes is assumed
to be low.
Problem Statement:
Currently there can only be one policy in effect per resource.
Why is this a problem:
This is a complicated problem dealing with form processing. Let's
consider a typical HTML page that contains two forms. One form logs
a user into a 'Member Only' section of the web site while the second form
logs a user into a 'Guest' section. Assume that the same cgi script
is the target of the two forms. The cgi script queries a database
and either redirects to the user's member page, a temporary guest page
or a login failed page. Further assume that the member only part
of the site uses cookies for user tracking purposes, while the guest-only
section uses cookies for session tracking. Since there can only be
one policy per resource there is no way to accurately represent the distinctions
between these two policies. It is suggested that the more restrictive
of the two policies be in force for the target. However, that solution
is not 100% accurate.
Proposed Solution:
Support an unbounded number of policies for a resource. This will
be accomplished, in the case of HTML, by combining the HTML and the policy
into an HTTP multipart entity and linking, via the ID attribute, the policy
to the elements of the HTML form. The ID attribute was chosen because the
ID attribute is required to be unique in HTML documents. It is worth
noting that the reason the multipart solution was chosen was to maintain
backward compatibility. To make the policy linkages more explicit,
it is recommended, that when using XML based documents, that the policy
be embedded into XML based document using namespaces. Since embedding
XML into HTML would cause the HTML to invalid, the multipart solution was
decided on. The important point is that the combination of the policy
and the content are considered a discrete entity.
Advantage(s):
By adding the ability to link a policy to an individual element in a document
several advantages are gained. Form fillers can be made more effective
due to the fact that a data type is linked to an input field. Therefore,
a form filler can populate forms they have not seen before.
Several user interface enhancements are possible as well, such as "mouse
over" to see the policy for that particular field, gray out optional fields,
fields that go against your preferences can show up in red. These
types of user interface enhancements are not currently possible.
Additionally, this technique not only works with HTML but also XML based
languages. By linking a policy to an script or applet tag then it
is possible to assert the privacy policy for the contained script and applet.
Drawback(s):
The response for an HTML document containing a form will grow in size.
This is due the fact that the HTML and the policy will be combined into
an HTTP multipart response.
Problem Statement:
Currently there are too many round trips in the protocol and these round
trips must be done sequentially.
Why is this a problem:
Currently, for each request a reference file must be parsed to find the
name of the associated policy file if one exists. If the user agent
does not have the policy then the policy must be fetched and parsed before
any more HTML content loading is performed. Once downloaded, which
may require several redirects, the policy must be parsed and only then
can the resource loading proceed. If there are policies associated
with images then those policies must be downloaded and parsed as well.
This means that the user has to wait for the content to load. Current
browser implementations start multiple threads and pipeline multiple HTTP
requests through a TCP/IP socket. This works very well for pages
that contain images because the browser can download the page and all needed
graphics at the same time. For example, as the HTML page is being
downloaded it can be parsed for presentation and any links encountered
can be loaded using one of the extra threads. Unfortunately, this
model is broken in a stop and wait protocol.
Proposed Solution:
By combining the policy and the content together, the parsing of the reference
file can be avoided since the policy is already available.
Advantage(s):
Downloading of the policy document via multiple redirects is avoided.
Also, this fixes all the policy, reference file, and content caching problems
since the reference file does not exist nor does the policy document.
As an implementation note, if the policy is at the end of the HTML content
then rendering can be performed sooner.
Drawback(s):
The modified solution costs bandwidth for those pages collecting information
Problem Statement:
Currently, in a best case, on every resource request, at least one XML
document must be parsed, namely the reference document.
Why is this a problem:
It causes the user to wait for parsing of the reference document to be
completed before taking any action. This wait may or may not be trivial
based on the parsing software but the wait still exists.
Proposed Solution:
Remove the reference file and embed the policy into the content.
Advantage(s):
The parsing of the reference file is no longer needed since the reference
file is gone.
Drawback(s):
Each request may result in a 400 response indicating that a particular
resource uses HTTP information in a way differently than the rest of the
resources on the site do. For example if resource 'A' uses the user
agent header for different purposes than other pages the user has visited
at that site then, resource 'A' would return a 400 response indicating
what the user agent header would be used for.
Problem Statement:
The current model of caching reference documents, content, and privacy
policies is problematic.
Why is this a problem:
HTTP caching and synchronizing of multiple resources is not the HTTP model.
HTTP caching is not exact due to time skew etc. therefore heuristics are
typically employed.
Proposed Solution:
As stated above the solution to this is to remove the reference file and
the policy document and instead combine the policy document and the HTML.
Advantage(s):
The caching and synchronization problems are avoided by making the policy
and the document one package.
Drawback(s):
As stated above, the resource containing the policy will be larger.
Problem Statement:
When using the current version, as written, HTML form processing is not
user friendly.
Why is this a problem:
Consider an HTML page with a form. The target of the form is some
cgi script. When the user submits the form (i.e. after the form has
been filled in) the user agent must parse the reference file, find the
appropriate policy and parse it. At this point the user agent can
prompt the user with the purpose the data is being used. This model
is counter intuitive, only after you fill in the form will you find out
what the data is to be used for.
Proposed Solution:
Combine the policy and the HTML document and explicitly link the policy
to the form. Note, this problem could be fixed in the current
draft by requiring user agents to pre fetch privacy policies associated
with all form element targets.
Advantage(s):
By linking the policy to elements a more accurate representation of the
policy for the resource is gained. For example, instead of choosing
a least common denominator policy, the content provider can explicitly
state the purpose for each field in the form.
Drawback(s):
Again, the response, for the HTML resource containing the form, will be
larger.
Solution Details:
This solution makes use of two forms of policies: protocol policies
and content policies. Protocol policies are XML documents that contain
the privacy policy related to the protocol transaction. In HTTP this
information includes all HTTP headers, including cookies. Content
policies are XML documents that contain the privacy policy related to the
information being collected by the content. For example what purpose
will the 'SSN' form field be used for.
Both policies use the same mechanism to indicate to the server that
a particular policy has been seen by the user and that this policy is the
policy the user believes to be in effect at the time of the request.
Basically, each policy contains a non-empty, unique ID that the client
mimics back to the web server to indicate what policy the user believes
is in effect. The unique ID must be similar in nature to a UUID in
that it must be unique within and across web sites. Note, third party
recipients of private information are identified, and their privacy policies
are spelled out, in the contents of the P3P Privacy policy. Whether
or not the 3rd party's policy is spelled out in the protocol or content
policy is dependent on the applicability. For example, if the third
party uses protocol information then their policy would be spelled out
in the protocol policy of the target resource.
HTTP Protocol Policy
The protocol policy is an XML document containing the HTTP information
the host needs in order to process the request. For example in HTTP
this information can include the user agent header, accept header, and
any cookies required. The current P3P data schema suits this purpose.
The only modification needed is the addition of a policy-id attribute to
the POLICY element. The contents of this attribute is an opaque string
uniquely identifying this policy universally e.g. UUID. The purpose
of this id is to simply act as a token for the user agent to mimic back
to the web site. When the P3P enabled web site, or intervening host,
receives information in the HTTP headers that it intends to use, then the
unique identifiers must be scanned. If the host's id is not in the
list of policies then the host returns a 400 Bad Request, as per HTTP.
If the client agrees to the policy then the value of this unique identifier
is added to the P3P-Protocol-Policy HTTP general header. As intervening
hosts are contacted this header is simply appended to. The transmission
of the policy-id does not imply acceptance of the policy, but rather
indicates which policy the client believes is in effect. Acceptance
of policies is out of scope. The web site must not collect data if
the policy-id is missing or out of date. Any change to the policy
requires that the policy-id be changed. The proposed modification
make use of HTTP-Ext framework. The namespace
to be used is the same specified in the current P3P draft namely, http://www.w3.org/2000/P3Pv1.
HTTP Header Example:
[Client]
GET foo.html HTTP/1.1
Host: sample.com
Opt: "http://www.w3.org/2000/P3Pv1"; ns=11;
11-P3P-Protocol-Policy:
11-P3P-Content-Policy:
Server]
400 Bad Request;
[contents of the protocol policy stating that the user-agent header
is needed for content rendering; policy-id="sample.com54321"]
Client]
User turns on transmission of user agent header.
GET foo.html HTTP/1.1
Host: sample.com
Opt: "http://www.w3.org/2000/P3Pv1"; ns=11;
11-P3P-Protocol-Policy: sample.com54321
Server]
200 OK
[contents of foo.html]
Cookie Example:
Client]
GET foo.html HTTP/1.1
Host: sample.com
Server]
200 OK
Set-Cookie: foo="bar"
[contents of foo.html]
Client]
Rejects all cookies and selects a link from foo.html.
GET somelink.html HTTP/1.1
Host: sample.com
Server]
400 Bad Request
[contents of the protocol policy stating that the cookie named "foo"
is used for session tracking only on this realm; policy-id="sample.com12345"]
Client]
User turns on acceptance of cookie foo.
Client]
GET somelink.html HTTP/1.1
Host: sample.com; Cookie: foo="bar";
Opt: "http://www.w3.org/2000/P3Pv1"; ns=11;
11-P3P-Protocol-Policy: sample.com12345
Server]
200 OK
[contents of somelink.html]
Note: the protocol policies will be made up of more than cookies
or headers but include some combination of the two.
Content Policy
The content policy is a P3P policy that is contained in the HTTP multipart
response with the HTML document. The policy details the release of
information. Since the content policy is combined with the HTML as
one entity, when the entity expires so does the policy. User agents
must not repost forms without first validating the expiration time of the
HTML document containing the form. This is done to check for a modified
policy. Consider the case of the user agent verifying that the document
containing the form is still valid and re-posts the form. Consider that
the policy expires in transit. The host must return a 400 Bad Request
HTTP response with descriptive text identifying the problem.
The current P3P data schema suits the content policy purpose.
There are two attributes that must be added to the current P3P schema in
order to be able to link the content policy to the HTML document.
The first attribute is the policy-id attribute of the POLICY element.
This attribute has the same semantics as in protocol policy except that
the value of this unique identifier is added to the P3P-Content-Policy
header. As intervening hosts are contacted, the P3P-Content-Policy
header is simply appended to (for example, if a transcoding proxy is encountered
and the content is significantly changed). The second attribute to
be added is the target attribute. This attribute would be added to
the DATA element. The target attribute contains a fragment identifier
that is part of the referenced document. In the case of HTML and
XML, the fragment identifier refers to the ID of the HTML element in the
HTML document.
Content Policy Sample for HTML (http://www.catalog.example.com/SampleForm.html)
<html>
<head></head>
<body>
Please submit your name:
<FORM METHOD="POST" ACTION="http://www.w3.org/cgi/login.cgi"
name="loginform">
First:<INPUT ID="FirstName" TYPE="TEXT" NAME="fname"
SIZE="20" /><br/>
Middle:<INPUT ID="MiddleName" TYPE="TEXT" NAME="mname" SIZE="20"
/><br/>
Last:<INPUT ID="LastName" TYPE="TEXT" NAME="lname" SIZE="20"
/><br/>
<INPUT TYPE="SUBMIT" VALUE="Sign On">
<INPUT TYPE="RESET" VALUE=" Clear ">
</FORM>
</body>
</html>
----- Multipart separator -----
<POLICY xmlns="http://www.w3.org/2000/P3Pv1"
disuri="http://www.catalog.example.com/PrivacyPracticeBrowsing.html"
policyID="http://www.catalog.example.com/blahBlah1"
>
<ENTITY>
... omitted ....
</ENTITY>
<DISPUTES-GROUP>
... omitted ....
</DISPUTES-GROUP>
<STATEMENT>
<PURPOSE><current/></PURPOSE>
<RECIPIENT><ours/></RECIPIENT>
<RETENTION><stated-purpose/></RETENTION>
<DATA-GROUP>
<DATA ref="#user.name.given" target="#FirstName"
>
<DATA ref="#user.name.middle" target="#MiddleName"
>
<DATA ref="#user.name.family" target="#LastName"
>
</DATA-GROUP>
...omitted...
</POLICY>
This sample represents a simple HTML form that prompts for First, Middle
and Last names. The ID attributes of the HTML form are linked to
the policy by using the target attribute of the DATA element. Basically
this HTML is used to indicate that FirstName, MiddleName, and LastName
input elements on the form are used to complete the current transaction
and are kept by the site the form was downloaded from. The contents
of the policy ID attribute of the policy element will be used as the contents
of the P3P-Content-Policy HTTP header e.g. P3P-Content-Policy: http://www.catalog.example.com/blahBlah1\r\n.
The first part of the multipart content that contains a matching reference
identifier is considered to be the target of the policy linking.
For example, if the multipart consisted of multiple HTML documents then
the target of the #FirstName link would be the first reference identifier
found with an ID="FirstName".
Again, the multipart solution was chosen for backward compatibility
with HTML. If an XML document, such as a P3P policy, is inserted
into an HTML document then the HTML document is no longer valid.
In practice, popular browsers such as IE 4/5 and Netscape simply ignore
the embedded XML however, this does not make the HTML valid. The ability
to physically embed the policy into a document makes the linkages between
the policy and the elements being described more explicit.
By linking the policy to the elements of the forms, several UI benefits
can be realized. Additional support for form fillers is added since
a data type is linked to an input field. The user agent can now populate
forms they have not seen before. In addition, when a user does
a 'mouse over' of a field, a pop up could appear describing the use of
the data being prompted for, fields that are optional could be grayed out
to tell the user that the field is optional. Fields that go against
your preferences could be highlighted in red.
HTTP Support for non P3P clients:
A client not sending a P3P-Protocol-Policy or P3P-Content-Policy header
at all is assumed not to understand P3P. If the client sends data
without a P3P-*-Policy header then, the data should only be used according
to the current privacy policy in effect. When data is to be collected
and the client sends a P3P-*-Policy header that is out of date or a value
for this resource is missing then the web site MUST respond with a 400
Bad Request HTTP response with a P3P protocol policy as the body of the
error. Clients can be configured to automatically respond to the
400 responses. This automated response would include checking the
incoming protocol policy with the user's preferences. Additionally,
the user can be prompted if a discrepancy is discovered. The web
site MUST not collect any information until the protocol id and the content
ids the client mimics back match the ids of the resource. Currently
the server has no idea what policy the client is referencing, this makes
it difficult to do any negotiation. As an implementation note, servers
can be configured to rollback to the policy the client is using rather
than rejecting the request. The requirement of the client sending
an empty P3P-*-Policy header, to indicate that it understands P3P, could
be removed by making a P3P content type and using HTTP content negotiation.
Server Side Implementation Impacts
This section will describe the impacts that the proposed modifications
have on the content provider.
In the case that the content prompted the user to enter private information,
the content creator would have to:
-
add the P3P policy to the content that collects information. This
addition would be done in the form of a multipart response.
-
do some checking of the P3P-Content-Policy header coming from the
HTTP transaction and match it to the policy id of the target resource.
-
if the P3P-Content-Policy header does not match the content id or the client
transmitted an empty P3P-Content-Policy header then, the server must return
a 400 Bad Request response.
Note, these steps only have to be performed for those resources that collection
information via content. For example, HTML forms.
In the case that information from the protocol transaction was collected
by the content creator, the content creator would have to:
-
check the protocol policy id against the current protocol policy in affect.
-
if the P3P-Protocol-Policy header does not match the content id or the
client transmitted an empty P3P-Protocol-Policy header then, the server
must return a 400 Bad Request response with the protocol policy as a message
body.
Note, these steps only have to be performed by intervening and target hosts
that collect information via HTTP. If a site has a consistent policy
regarding the use of HTTP information then the protocol policy need only
be given to the client on the first request. Subsequent requests
and responses would make use of the P3P-Protocol-Policy header value.
It is worth pointing out that, by using the proposed modification, the
user interface experience the content provider can provide is more consistent.
In the current P3P draft, if a page violates any part of a user's preferences
then the client simply does not load the resource. Now, at this point,
since a resource could not be loaded from the site, some browser supplied
message would have to be displayed to the user to indicate a problem.
Since this message is not controlled by the site then there is a user interface
inconsistency. By using the proposed modifications, the user is able
to download a resource and the browser user interface would be able to
indicate which individual fields violated the user's preferences.
The point being that the site would have control of the content being displayed
to the user.
Client Side Implementation Impacts
This section will describe the impacts that the proposed modifications
have on the browser implementation.
In the case that the content prompted the user to enter private information,
the browser vendor would have to:
-
Parse the contained policy.
-
Compare the contained policy against the user preferences.
-
Optionally, indicate the any part of the content that violated the users
preferences. This is out of scope.
-
Provide a mechanism to indicate to the user the purposes of the requested
information. For example, mouse over with a pop-up containing the
localized translation of the information's purpose.
In the case that an intervening host or target host collected information
from the protocol, the browser vendor would have to:
-
Process the 400 Bad Request response by parsing the embedded protocol policy.
-
Display the contents of the protocol policy to the user, in a localized,
descriptive manner.
-
Optionally, give the user a mechanism to change his/her preferences or
modify the request on a per site basis. This handling is out of scope.
There are some details, such as lists of policy ids and the sites/uri
they relate to, that must be addressed. However, the majority
of the impacts on the client revolve around presenting the user with information
about the policies.
Problem with the proposed solution
The proposed solution has more trust that intervening hosts and web sites
will only take information they ask for. More importantly this solution
trusts that web sites will only take the information not when it is available
but, when the content and policy ids match the ones they issued.
However, this reliance on trust can be overcome by addition of some security
or enforcement mechanism but, that is out of scope.
Conclusion
This document proposes some changes that overcome concerns in the current
P3P draft and offers some enhancements not currently possible. The
solution presented has some drawbacks however, these drawbacks are outweighed
by the enhancements added.
Appendices
A. Server Side Implementation Samples.
TBD
B. Document Change History
June 7, 2000
Added clarifying note concerning the HTTP extension mechanism to be
used. The RFC2616 extension mechanism will be used.
Added discussion of third party recipients receiving private information.
Cleaned up wording of drawbacks associated with embedding policies
into HTML documents.
Added sections on Server and Client side implementation impacts.
Added a discussion of using the linking mechanism to assert a policy
for a script and applet.
Added discussion of a user agent posting a form that has a policy that
expires in transit. Result, the host must return a 400 bad request
response.
June 9, 2000
Replaced the solution of embedding the policy into the HTML content
with one where a multipart response is returned. This addresses issues
concerning the validity of an HTML document.
Added support for HTTP-Ext.
June 12, 2000
Changed the document status section to remove any ambiguities about
the sponsership of the proposal.
Acknowledgments
This note was written with the input and participation from Daniel Weitzner,
W3C and Louis Theran, Nokia.
References
[HTTP] R. Fielding, J. Gettys, J. C. Mogul, H. Frystyk,
T. Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1", RFC
2616, U.C. Irvine, DEC W3C/MIT, DEC, W3C/MIT, W3C/MIT, January 1997
[HTTP-EXT] H. Nielsen, P. Leach, S. Lawrence,
"An HTTP Extension Framework", RFC
2774, Microsoft, Microsoft, Agranat Systems
[HTML] D. Raggett, A. Le Hors, I. Jacobs, "HTML
4.01 Specification", http://www.w3.org/TR/html401,
24 December 1999.