This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 7630 - [FO] There is no formal definition of the Unicode codepoint collation
Summary: [FO] There is no formal definition of the Unicode codepoint collation
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: Functions and Operators 1.0 (show other bugs)
Version: Recommendation
Hardware: PC Windows NT
: P2 normal
Target Milestone: ---
Assignee: Michael Kay
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
Depends on:
Reported: 2009-09-15 14:08 UTC by Michael Kay
Modified: 2012-03-27 23:30 UTC (History)
0 users

See Also:


Description Michael Kay 2009-09-15 14:08:51 UTC
The specification contains no formal definition of the Unicode codepoint collation

A suitable definition might be:

declare function compare-seq($x as xs:integer*, $y as xs:integer*) as xs:integer {
   if (count($x) eq 0 or count($y) eq 0) 
   then if (count($x) eq 0 and count($y) eq 0)
        then 0
        else if (count($x) eq 0) then -1 else +1
   else if ($x[1] eq $y[1])
        then compare-seq(remove($x, 1), remove($y, 1))
        else if ($x[1] lt $y[1]) then -1 else +1

and then compare($X as xs:string, $Y as xs:string) under the Unicode codepoint collation is defined to have the result compare-seq(string-to-codepoints($X), string-to-codepoints($Y)).

Problem raised by Patrick Durusau (patrick at durusau dot net) on public-qt-comments, 2 Sept 2009.
Comment 1 Michael Kay 2009-09-29 15:41:00 UTC
The following proposal was accepted by the WG on 2009-09-29

ACTION A-411-02: MK will produce a textual proposal for resolving Bugzilla
#7630 (definition of the Unicode codepoint collation).

For the 1.0/2.0 specification:

Add a new paragraph after the current fourth paragraph of F+O section 7.3.1

The Unicode codepoint collation does not perform any normalization on the
supplied strings. It is defined as follows. Each of the two strings is
converted to a sequence of integers using the fn:string-to-codepoints
function. These two sequences $A and $B are then compared as follows: 

* If both sequences are empty, the strings are equal

* If one sequence is empty and the other is not, then the string
corresponding to the empty sequence is less than the other string

* If the first integer in $A is less than the first integer in $B, then the
string corresponding to $A is less than the string corresponding to $B.

* If the first integer in $A is greater than the first integer in $B, then
the string corresponding to $A is greater than the string corresponding to

* Otherwise (the first pair of integers are equal), the result is obtained
by applying the same rules recursively to fn:subsequence($A, 2) and
fn:subsequence($B, 2)

For the 1.1/2.1 specification: Use the same rules, but create a new section
containing the definition of the Unicode codepoint collation and refer to
this section from the appropriate places; and make "Unicode codepoint
collation" a defined term, hyperlinking all references to it.

Comment 2 Michael Kay 2012-03-27 23:30:54 UTC
I note that the agreed change has been made to the 3.0 draft, but the change for the 1.0/2.0 specification does not appear in the published second edition. I have therefore added a reference to this bug to the list of candidate errata (in the xsl-query-specs CVS area), and am herewith closing the bug.