This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 9980 - [XSLT] Default value for byte-order-mark in xsl:output
Summary: [XSLT] Default value for byte-order-mark in xsl:output
Status: CLOSED INVALID
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: XSLT 3.0 (show other bugs)
Version: Working drafts
Hardware: PC Windows NT
: P2 normal
Target Milestone: ---
Assignee: Michael Kay
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-06-22 17:08 UTC by Oliver Hallam
Modified: 2010-07-15 10:48 UTC (History)
1 user (show)

See Also:


Attachments

Description Oliver Hallam 2010-06-22 17:08:03 UTC
The specification states (Section 20 in 2.0; Section 23 in 2.1) about the byte-order-mark property:

The default value depends on the encoding used. If the encoding is UTF-16, the default is yes; for UTF-8 it is implementation-defined, and for all other encodings it is no.

Surely if it defaults to yes for "UTF-16", it should also default to yes for "utf-16BE" and "utf-16LE" (as one of these is equivalent).

The specification also does not state that the comparison should be ignoring case (although this is reasonably obvious).

One might also argue that the default value should be either true or implementation defined for "utf-32" as well.
Comment 1 Henry Zongaro 2010-07-12 13:35:57 UTC
I have been reading what Section 3.10 "Unicode Encoding Schemes" of Unicode 5.2 has to say about UTF-16LE, UTF-16BE and UTF-16 encoding schemes.[1]  It turns out that the UTF-16 encoding scheme is not equivalent to simply choosing one of UTF-16LE or UTF-16BE.  My understanding, is that the byte order mark is only used at the start of the encoded byte sequence in the UTF-16 encoding scheme, according to Unicode 5.2, not in either UTF-16LE or UTF-16BE.  The byte sequence FE FF at the start of a file or what-have-you would be interpreted as a zero-width no-break space in something that was known to be encoded in the UTF-16BE encoding scheme.

For UTF-32, Unicode 5.2 says the byte order mark is optional.  Changing the default to true could break existing implementations.  Changing the default to implementation-defined wouldn't harm existing implementations, but I think it could have a slight impact on interoperability if some implementations chose a default byte-order-mark value of true for UTF-32.

As an aside, I know far more about the distinction between Unicode character encoding schemes and Unicode character encoding forms than I did when I woke up this morning.  :)

[1] http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf#G7404
Comment 2 Michael Kay 2010-07-15 10:41:30 UTC
The WG decided, following the reasoning of Henry's response, to make no change to the specification. Please feel free to reopen if you think we have missed something.
Comment 3 Oliver Hallam 2010-07-15 10:48:28 UTC
This is an interesting subtlety that I had not appreciated.  I completely agree with Henry's reasoning (and hence the Working Group's decision) and am marking the bug closed.