7630 – [FO] There is no formal definition of the Unicode codepoint collation

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 7630 - [FO] There is no formal definition of the Unicode codepoint collation

Summary: [FO] There is no formal definition of the Unicode codepoint collation

Status:	CLOSED FIXED

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	Functions and Operators 1.0 (show other bugs)
Version:	Recommendation
Hardware:	PC Windows NT

Importance:	P2 normal
Target Milestone:	---
Assignee:	Michael Kay
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2009-09-15 14:08 UTC by Michael Kay
Modified:	2012-03-27 23:30 UTC (History)
CC List:	0 users

See Also:

Attachments

Description Michael Kay 2009-09-15 14:08:51 UTC

The specification contains no formal definition of the Unicode codepoint collation http://www.w3.org/2005/xpath-functions/collation/codepoint

A suitable definition might be:

declare function compare-seq($x as xs:integer*, $y as xs:integer*) as xs:integer {
   if (count($x) eq 0 or count($y) eq 0) 
   then if (count($x) eq 0 and count($y) eq 0)
        then 0
        else if (count($x) eq 0) then -1 else +1
   else if ($x[1] eq $y[1])
        then compare-seq(remove($x, 1), remove($y, 1))
        else if ($x[1] lt $y[1]) then -1 else +1
}

and then compare($X as xs:string, $Y as xs:string) under the Unicode codepoint collation is defined to have the result compare-seq(string-to-codepoints($X), string-to-codepoints($Y)).

Problem raised by Patrick Durusau (patrick at durusau dot net) on public-qt-comments, 2 Sept 2009.

Comment 1 Michael Kay 2009-09-29 15:41:00 UTC

The following proposal was accepted by the WG on 2009-09-29

ACTION A-411-02: MK will produce a textual proposal for resolving Bugzilla
#7630 (definition of the Unicode codepoint collation).

For the 1.0/2.0 specification:

Add a new paragraph after the current fourth paragraph of F+O section 7.3.1

The Unicode codepoint collation does not perform any normalization on the
supplied strings. It is defined as follows. Each of the two strings is
converted to a sequence of integers using the fn:string-to-codepoints
function. These two sequences $A and $B are then compared as follows: 

* If both sequences are empty, the strings are equal

* If one sequence is empty and the other is not, then the string
corresponding to the empty sequence is less than the other string

* If the first integer in $A is less than the first integer in $B, then the
string corresponding to $A is less than the string corresponding to $B.

* If the first integer in $A is greater than the first integer in $B, then
the string corresponding to $A is greater than the string corresponding to
$B.

* Otherwise (the first pair of integers are equal), the result is obtained
by applying the same rules recursively to fn:subsequence($A, 2) and
fn:subsequence($B, 2)

For the 1.1/2.1 specification: Use the same rules, but create a new section
containing the definition of the Unicode codepoint collation and refer to
this section from the appropriate places; and make "Unicode codepoint
collation" a defined term, hyperlinking all references to it.

Comment 2 Michael Kay 2012-03-27 23:30:54 UTC

I note that the agreed change has been made to the 3.0 draft, but the change for the 1.0/2.0 specification does not appear in the published second edition. I have therefore added a reference to this bug to the list of candidate errata (in the xsl-query-specs CVS area), and am herewith closing the bug.