This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
The specification of normalize-unicode states: Returns the value of $arg normalized according to the normalization criteria for a normalization form identified by the value of $normalizationForm It also refers to: See [Character Model for the World Wide Web 1.0: Normalization] for a description of the normalization forms. However, consider the following query: normalize-string('̂', 'FULLY-NORMALIZED') Normalizing this string does not produce a fully normalized result. I assume the correct way to fully normalize this is to add a leading space character, but I cannot see where this behaviour is specified.
This issue is discussed here: http://lists.w3.org/Archives/Public/public-qt-comments/2003Oct/0198.html a discussion which started with my observation "It's not at all clear to me that supporting "fully-normalized" form makes any sense at all. Whereas the Unicode normalization forms all describe an algorithm for normalizing data, the "fully-normalized" form is described only as a property of a string. There is no algorithm provided for making a string fully-normalized, and the only algorithms that one might come up with involve losing information." The next message in the thread summarizes what we concluded about the algorithm: "... a check that the first character in the string being normalized is a base character (e.g. has a combining class of 0). If the last test fails, a space is inserted at the start of the data to carry the combining mark." If my memory serves me right, we were assured that the algorithm would be properly described in a future version of CharMod, and we felt that it needed to be fixed in CharMod rather than in our specs. Perhaps that was wishful thinking (many things related to I18N are). For my own part, if I remember right I decided not to support this optional feature until it was better specified.
I propose that (a) in the 1.0 spec, we don't fix this; (b) in 1.1, we fix it as follows: Delete the sentence "See [Character Model for the World Wide Web 1.0: Normalization] for a description of the normalization forms." Substitute "Normalization forms NFC, NFD, NFKC, and NFKD, and algorithms for converting a string to each of these forms, are defined in [Unicode Normalization]." where this is a new normative reference to http://unicode.org/reports/tr15/. Add the standard wording about which version of Unicode may be used. Add "The motivation for normalization form FULLY-NORMALIZED is described in [charmod-norm]." {which now becomes a non-normative reference} "However, as that specification did not progress beyond working draft status, the normative specification is as follows. A string is fully-normalized if (a) it is normalization form NFC as defined by [Unicode Normalization], and (b) it does not start with a composing character. A composing character is a character that is one or both of the following: (a) the second character in the canonical decomposition mapping of some character that is not listed in the Composition Exclusion Table defined in [UTR #15], or (b) of non-zero canonical combining class (as defined in [Unicode]). A string is converted to FULLY-NORMALIZED form as follows: (a) if the first character in the string is a composing character, prepend a single space (b) convert the string to normalization form NFC
In the Joint teleconference of the XSL WG and the XML Query WG on 2010-01-19, minuted at http://lists.w3.org/Archives/Member/w3c-xsl-query/2010Jan/0081.html (member-only link), the proposal in comment 2 was accepted. As a result, I am marking this bug RESOLVED/FIXED. If you agree with the solution adopted, please mark the bug CLOSED.
These changes have now been applied to the baseline spec.