12105 – [XDM30] Allow any Unicode character in a string

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 12105 - [XDM30] Allow any Unicode character in a string

Summary: [XDM30] Allow any Unicode character in a string

Status:	RESOLVED WONTFIX

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	Data Model 3.0 (show other bugs)
Version:	Working drafts
Hardware:	PC Windows NT

Importance:	P2 enhancement
Target Milestone:	---
Assignee:	Norman Walsh
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2011-02-17 09:11 UTC by Michael Kay
Modified:	2011-07-27 19:12 UTC (History)
CC List:	1 user (show)

See Also:

Attachments

Description Michael Kay 2011-02-17 09:11:57 UTC

This is an enhancement request to enhance the data model so that any Unicode character is allowed in a string. It is raised in response to an action from the XSL Working Group.

In practice the proposed change means (a) all XML 1.1 characters are allowed by all processors, and (b) the Unicode NUL character (x0) is allowed by all processors.

Serialization would fail if a string contains a character not permitted in the version of XML that is the target of serialization. Tree construction, however, will not reject any characters as invalid.

Parsing of lexical XML is still free to use XML 1.0 or XML 1.1 rules an implementor discretion.

Justification: we allow input from sources that are not constrained by the XML rules, notably by using unparsed-text() or codepoints-to-string(), or by calling external functions. Restricting the character set that can be returned by these functions creates work for implementors, imposes a performance penalty, and restricts what users can do with the language, all quite unnecessarily.

We want to allow import of JSON data, with full round-tripping. This is hampered by the fact that JSON strings allow characters that are not legal in XDM. The alternative is to hold such strings in escaped form, which is very inconvenient for users.

Casting to string will not reject characters disallowed by XML. For validation of XDM nodes (e.g. using [xsl:]validation or XQuery validate{}) it will be implementation-defined whether the character set allowed in xs:string values is XML 1.0, XML 1.1, or the full XDM set. This preserves the freedom of implementations to use an off-the-shelf validation engines.

[For the avoidance of doubt, "any character" does not include unpaired surrogates. It is of course possible that some external data sources will supply pseudo-strings containing unpaired surrogates. This is analogous to supplying a string that is supposed to be encoded in UTF-8 but contains bytes that cannot be decoded: it is not possible to interpret what is returned as a sequence of characters. An interface that wishes to handle octet streams containing such oddities must handle it as a sequence of integers, or as hexBinary)].

Comment 1 Michael Kay 2011-07-27 19:12:44 UTC

The WG decided not to change the spec to allow NUL or unpaired surrogates to appear in strings; all XML 1.1 characters can already appear.