IETF LogoW3C Logo

Canonical XML
Version 1.0

W3C Working Draft 1 June 2000

This version:
http://www.w3.org/TR/2000/WD-xml-c14n-20000601
Latest version:
http://www.w3.org/TR/xml-c14n
Previous version:
http://www.w3.org/TR/2000/WD-xml-c14n-20000119
Editor(s)
John Boyer, PureEdge Solutions Inc., jboyer@PureEdge.com

Abstract

This specification describes a method for generating a physical representation, the canonical form, of an input XML document, that does not vary under syntactic variations of the input that are defined to be logically equivalent by the XML 1.0 Recommendation [XML]. If an XML document is changed by an application, but its Canonical-XML form has not changed, then the changed document and the original document are considered equivalent for the purposes of many applications. This document does not establish a method such that two XML documents are equivalent if and only if their canonical forms are identical.

Status of this document

This draft is a proposal that (1) serves as an alternative approach to the Canonical XML specification using the XPath [XPath] data model, and (2) it includes a few substantive changes that affect the canonical serialization of an XML document. Prior versions of this document were published by the XML Core Working Group, which delegated the completion of this specification to the IETF/W3C XML Signature Working Group. Any variances between that result from this specification's use of the XPath [XPath] data model and the XML Information Set [Infoset] will be reported to the XML Information Set's comments list.

The XML Signature and XML WGs and other interested parties are invited to comment on this proposed direction, review the specification and report implementation experience. While we welcome implementation experience reports, the XML Signature Working Group will not allow early implementation to constrain its ability to make changes to this specification.

Please send comments to the editors and cc: the list <w3c-ietf-xmldsig@w3.org>. Publication as a Working Draft does not imply endorsement by the W3C membership or IESG. It is inappropriate to cite W3C Drafts as other than "work in progress." A list of current W3C working drafts can be found at http://www.w3.org/TR/. Current IETF drafts can be found at http://www.ietf.org/1id-abstracts.html.

There have been no solicitations nor declarations regarding patents related to this specification within the Signature WG.

Table of contents

1 Introduction
2 Canonical XML Data Model
3 Document Order for Canonical XML
4 Generation of Canonical XML
5 XML Document Subsets

Appendices

A Resolutions
B References
C Acknowledgements


1 Introduction

The XML 1.0 Recommendation [XML] specifies the syntax of a class of resources called XML documents. It is possible for XML documents which are equivalent for the purposes of many applications to differ in physical representation. In particular, they may differ in their entity structure, attribute ordering, and character encoding.

It is not a goal of this work to establish a method such that two XML documents are equivalent if and only if their canonical forms are identical. Such a method is unachievable, in part due to application-specific rules such as those governing unimportant whitespace and equivalent data (e.g. <color>black</color> versus <color>rgb(0,0,0)</color>). There are also equivalencies established by other W3C Recommendations and Working Drafts. Accounting for these additional equivalence rules is beyond the scope of this work. They can be applied by the application or become the subject of future specifications.

The XPath 1.0 Recommendation [XPath] specifies a data model for representing an input XML document as well as an expression syntax for describing portions of the document (as well as arbitrary strings, booleans and numbers). When an XPath expression is used to describe portions of an XML document, the result is called a document subset.

This specification describes a method for generating a physical representation of an input XML document or document subset that does not vary under syntactic variations of the input XML document that are defined to be logically equivalent by the XML 1.0 Recommendation. The input must be a well-formed XML document with an optional XPath expression and evaluation context. The output physical representation is called a canonical form or simply Canonical XML.

The Canonical XML generated for an entire XML document is well-formed. The canonical form of an XML document subset may not be well-formed XML. However, since the canonical form will often be subjected to further XML processing, most XPath expressions provided for canonicalization will be designed to produce a document subset that is a well-formed XML document or external general parsed entity.

Canonical XML is designed to be used by applications that require the ability to test whether a document or document subset has been changed in a way that is not defined to be logically equivalent by the XML 1.0 Recommendation. For example, a digital signature over the canonical form of an XML document or document subset would allow the signature digest calculations to be oblivious to changes in the document's physical representation provided that the changes are defined to be logically equivalent by the XML 1.0 Recommendation.

2 Canonical XML Data Model

The data model used to create Canonical XML is equivalent to the data model defined in the XPath 1.0 Recommendation [XPath]. Although an implementation of this specification need not be based on an XPath implementation, this specification discusses the canonicalization method based on the XPath definition of a node-set.

Under the XPath data model, an XML processor is used to perform the following tasks in order:

  1. normalize linefeeds
  2. normalize attribute values
  3. replace CDATA sections with their character content
  4. resolve entity references

Canonical XML requires that the input document be well-formed XML, but the input need not be validated. However, Canonical XML requires that attribute value normalization and entity reference resolution be performed in accordance with the behaviors of a validating XML processor. Thus, the declarations in the document type declaration are used to help create the canonical form, but the document type declaration is not retained in the canonical form (in part because it is omitted from the XPath data model and in part because it is not needed by the canonical form).

In the XPath data model, there exist the following node types: root, element, comment, processing instruction, text, attribute and namespace. There exists a single root node whose children are text nodes, processing instruction nodes, and comment nodes to represent information outside of the top-level element. The root node also has a single element node representing this top-level element. Each element node can have child nodes of type element, text, processing instruction, and comment. The attributes and namespaces associated with an element are not considered to be child nodes of the element, but they are associated with the element by inclusion in the element's attribute and namespace axes. Note that attribute and namespace axes may not directly correspond to the text appearing in the element's start tag in the original document.

Although the XML 1.0 Recommendation states that an XML processor need not provide the text of comments, the XPath data model and hence Canonical XML requires comments.

An element has attribute nodes to represent the non-namespace attribute declarations appearing in its start tag as well as nodes to represent default attributes that were not specified and not declared as #implied.

By virtue of the XPath data model, Canonical XML is namespace-aware [Names], but it cannot and therefore does not account for namespace equivalencies via namespace rewriting (see below). In the XPath data model, each element and attribute has a name returned by the function name() which can, at the discretion of the application, be the QName appearing in the original document. Canonical XML requires that the XML processor retain the sufficient information such that the QName of the element as it appeared in the original document can be provided.

An element E has namespace nodes that represent its namespace declarations, any namespace declarations made by its ancestor that have not been overridden in E's declaration, the default namespace if it is non-empty, and the declaration of the prefix xml.

Character content is represented in the XPath data model with text nodes. All consecutive characters are placed into a single text node. Furthermore, the text node's characters are represented in the UCS character domain. Canonical XML does not perform character model normalization (see below).

The XPath node-set required by the Canonical XML generator is defined to be the result of setting an initial evaluation context of:

then evaluating the expression (//. | //@* | //namespace::*). This expression generates a node-set containing every node of the XML document.

3 Document Order for Canonical XML

Although an XPath node-set is defined to be unordered, the XPath 1.0 Recommendation [XPath] defines the term document order to be the order in which the first character of the XML representation of each node occurs in the XML representation of the document after expansion of general entities, except for namespace and attribute nodes whose document order is application-dependent.

During XPath expression evaluation, Canonical XML imposes no order on the namespace and attribute axes of elements. After evaluating the expression, a node-set is processed by imposing the following additional document order rules on the namespace and attribute nodes of an element:

Lexicographic comparison is based on the UCS codepoint values, which is equivalent to lexical ordering based on UTF-8.

4 Generation of Canonical XML

The XPath node-set is converted into a UTF-8 string by generating the representative text for each node in the node-set in ascending document order with a UTF-8 encoding. No node is processed more than once. Note that processing an element node E includes the processing of all members of the node-set for which E is an ancestor. Therefore, directly after the representative text for E is generated, E and all nodes for which E is an ancestor are removed from the node-set (or some logically equivalent operation occurs such that the node-set's next node in document order has not been processed).

The method of text generation is dependent on the node type and given in the following list:

The QName of a node is either the local name if the namespace prefix string is empty or the namespace prefix, a colon, then the local name of the element. The namespace prefix used in the QName MUST be the same one which appeared in the input document.

5 XML Document Subsets

Some applications require the ability to create a physical representation for an XML document subset. Canonical XML implementations based on XPath can provide this functionality with little additional overhead. The following additional steps must be taken:

The node-set passed to the canonical form generator is calculated by setting the initial evaluation context as described in the section Canonical XML Data Model, except replacing the variable bindings and namespace declarations with those provided above, then evaluating the following XPath expression:

(X) | (//. | //@)[self::xml:lang or self::xml:space or @xml:lang or @xml:space]

The result of the given expression X is combined with the set containing all declarations of xml:lang andxml:space as well as the parent elements containing them. Note that if the result of X is not a node-set, then an XPath error results.

Combined with the propogation of namespace nodes in the XPath data model, this measure to preserve xml:lang andxml:space declarations ensures XML-specific information is not lost when an element's ancestors are removed from the node-set passed to the canonical form generator. XML entities can derive application-specific meaning from anywhere in the XML markup as well as by rules not expressed in XML. Clearly, these rules cannot be specified in this document, so the author of the expression X must be responsible for creating an expression that preserves the information necessary to capture the full semantics of the members of the resulting node-set.


Appendix A: Resolutions

Although this specification now defines Canonical XML in terms of the XPath data model rather than XML InfoSet, the canonical form described in this document is quite similar in most respects to the canonical form described in prior versions of the Canonical XML specification. However, there are some differences. This section discusses the differences and provides a rational for changes.

A.1 Comments Included By Default

Canonical XML now includes comments. It is conceivable that comments may carry critical information in certain scenarios. For example, JavaScript is often embedded in HTML using comments, so the canonical form of an HTML must include the comments. Furthermore, even if comments are solely for the benefit of XML document authors, comments must be preserved if XML document authoring tools are to adopt Canonical XML. Finally, any application that requires the canonical form to omit comments can do so by specifying an appropriate XPath expression to eliminate them. For example, to create the canonical form of an entire input document less the comments, use the follow XPath expression:

(//. | //@* | //namespace::*)[not(self::comment())]

The idea of also eliminating processing instructions has also been discussed, but it was rejected because in a number of scenarios they carry critical information value. However, if the application must eliminate comments and processing instructions as part of its equivalence testing, the following XPath expression can be used:

(//. | //@* | //namespace::*) [not(self::comment() or self::processing-instruction())]

A.2 Whitespace Text Children of Root

Prior drafts of the Canonical XML specification eliminated all whitespace outside of the top-most element except for a single linefeed after each processing instruction. This specification is based on the XPath data model, so the whitespace is preserved. Applications that do not want any whitespace outside the topmost element to affect the canonical form can specify an appropriate XPath expression to eliminate the text nodes. For example, to keep all document nodes except whitespace outside of the topmost element, use the following expression:

(//. | //@* | //namespace::*)[parent::* or not(self::text())]

It is not possible in XPath 1.0 to directly detect the root node, but the parent axis has a principal node type of element, so parent::* returns an empty set, which corresponds to a boolean false, for nodes whose parent is not an element. Every node has an element parent except for the children of the root node. The non-text children of the root are accepted by the subexpression not(self::text()).

A.3 Handling of Right Angle Bracket (>)

Prior drafts of the Canonical XML specification replaced all occurences of > with &gt; when they appeared in character content (text nodes, attribute values, and so forth). There did not appear to be a reason for this.

A.4 No Character Model Normalization

The Unicode standard [Unicode] allows multiple different representations of certain "precomposed characters" (a simple example is "ç"). Thus two XML documents with content that is equivalent for the purposes of most applications may contain differing character sequences. The W3C has recommended a normalized representation [CharModel]. Prior drafts of Canonical XML used this normalized form. However, most XML 1.0 processors do not perform the this normalization. Furthermore, applications that must solve this problem typically perform the character model normalization as character content is created, which would obviate the need for character model normalization during canonicalization. Therefore, character model normalization has been moved out of scope for Canonical XML.

A.5 No Namespace Prefix Rewriting

Prior drafts of the Canonical XML specification described a method for rewriting namespace prefixes such that two documents having logically equivalent namespace declarations would also have identical namespace prefixes. However, the statement in Namespaces in XML that "the prefix functions only as a placeholder for a namespace name" is incorrect. Namespace prefixes can impart information value in an XML document if they are referenced in an attribute value or element content (for example, and element or attribute containing an XPath expression). Thus, rewriting the namespace prefixes would damage such a document by changing its meaning (and it cannot be logically equivalent if its meaning has changed). The theorems below state the results more formally.

Theorem 1: With namespace rewriting, there exist two XML documents D1 and D2 that are logically equivalent yet their canonical forms are not equal.

Proof:Let D1 be a document containing an XPath in an attribute value or element content that refers to namespace prefixes used in D1. Further assume that the namespace prefixes in D1 will all be rewritten by the canonicalization method. Let D2 = D1, then modify the namespace prefixes in D2 and modify the XPath expression's references to namespace prefixes such that D2 and D1 remain logically equivalent. Since namespace rewriting does not include occurences of namespace references in attribute values and element content, the canonical form of D1 does not equal the canonical form of D2 because the XPath will be different. []

Remark:The same condition exists if we remove namespace rewriting. The purpose of this theorem is simply to show that namespace rewriting does not accomplish the goal for which it is intended.

Theorem 2:With namespace rewriting, there exist two XML documents D1 and D2 that have equivalent canonical forms and yet are not logically equivalent.

Proof:Let D1 be a document containing an XPath in an attribute value or element content that refers to namespace prefixes used in D1. Further assume that the namespace prefixes in D1 will all be rewritten by the canonicalization method. Now let D2 = the canonical form of D1. Clearly, the canonical forms of D1 and D2 are equivalent (since D2 is the canonical form of the canonical form of D1), yet D1 and D2 are not logically equivalent because the aforementioned XPath works in D1 and doesn't work in D2. []

Remark:Since D1 and D2 are not logically equivalent, and D2 is the canonical form of D1, we can conclude that namespace rewriting is harmful rather than simply ineffective.

The conclusion to be draw from these theorems is that namespace prefixes should not be altered by XML canonicalization. Applications that need to test for logical equivalence will need to perform more sophisticated tests than mere octet stream comparison. However, this is quite likely to be necessary in any case in order to test for logical equivalencies based on application rules as well as rules from other XML-related recommendations, working drafts, and future works.

A.6 Handling of Default Namespace

Prior drafts of the Canonical XML specification stated that the default namespace is not used. In the XPath data model, a non-empty default namespace is indicated by a namespace node with an empty local name. An empty namespace is indicated by the absence of such a node. In keeping with the policy of not rewriting namespace prefixes, which includes not adding prefixes that were not in the source document, the default namespace system has been added to Canonical XML. When there is no default namespace node, the canonicalization method indicates this with xmlns="" even if the source document did not contain this declaration explicitly (because there is no way to find out whether it did or not). The result is logically equivalent but, like the addition of default attribute nodes, implies that XPath expression authors should be wary of creating expressions that test for the position of attribute or namespace nodes (they are bound to fail in most cases because the sorting of namespace and attribute axes occurs only on output, not during the XPath expression evaluation).

Appendix B: References

CharModel
Character Model for the World Wide Web, ed. Martin J. Dürst, François Yergeau. Available at http://www.w3.org/TR/charmod/.
Infoset
XML Information Set, ed. John Cowan. Available at http://www.w3.org/TR/xml-infoset.
Namespaces
Namespaces in XML, eds. Tim Bray, Dave Hollander, and Andrew Layman. Available at http://www.w3.org/TR/REC-xml-names/.
Unicode
The Unicode Consortium. The Unicode Standard, version 3.0. ISBN 0-201-61633-5. Described at http://www.unicode.org/unicode/standard/versions/Unicode3.0.html.
XML
Extensible Markup Language (XML) 1.0, eds. Tim Bray, Jean Paoli, and C. M. Sperberg-McQueen. 10 February 1998. Available at http://www.w3.org/TR/REC-xml.
XPath
XML Path Language (XPath) Version 1.0, eds. James Clark and Steven DeRose. 16 November 1999. Available at http://www.w3.org/TR/1999/REC-xpath-19991116.

Appendix C Acknowledgements (Non-Normative)

The following people provided valuable feedback that improved the quality of this specification: