5054 – Unicode character in K2-StringLT-1

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 5054 - Unicode character in K2-StringLT-1

Summary: Unicode character in K2-StringLT-1

Status:	CLOSED INVALID

Alias:	None

Product:	XML Query Test Suite
Classification:	Unclassified
Component:	XML Query Test Suite (show other bugs)
Version:	unspecified
Hardware:	PC Windows XP

Importance:	P2 normal
Target Milestone:	---
Assignee:	Frans Englich
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2007-09-17 14:13 UTC by Andrew Eisenberg
Modified:	2007-09-18 17:29 UTC (History)
CC List:	0 users

See Also:

Attachments

Description Andrew Eisenberg 2007-09-17 14:13:58 UTC

Test case K2-StringLT-1 contains the comparison of two large codepoints.

I generate the following XQueryX for this test case:

<?xml version="1.0"?>
<xqx:module xmlns:xqx="http://www.w3.org/2005/XQueryX"
            xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
            xsi:schemaLocation="http://www.w3.org/2005/XQueryX
                                http://www.w3.org/2005/XQueryX/xqueryx.xsd">
  <xqx:mainModule>
    <xqx:queryBody>
      <xqx:ltOp>
        <xqx:firstOperand>
          <xqx:stringConstantExpr>
            <xqx:value>&#60000;</xqx:value>
          </xqx:stringConstantExpr>
        </xqx:firstOperand>
        <xqx:secondOperand>
          <xqx:stringConstantExpr>
            <xqx:value>&#55300;</xqx:value>
          </xqx:stringConstantExpr>
        </xqx:secondOperand>
      </xqx:ltOp>
    </xqx:queryBody>
  </xqx:mainModule>
</xqx:module>
 

When I attempt to validate this XQueryX, I see this error:

   Character reference "&#55300" is an invalid XML character.


I'm weak on the details of Unicode. I believe that character &#55300 is &#xD804. I see the following in http://www.unicode.org/Public/UNIDATA/UnicodeData.txt:

D800;<Non Private Use High Surrogate, First>;Cs;0;L;;;;;N;;;;;
DB7F;<Non Private Use High Surrogate, Last>;Cs;0;L;;;;;N;;;;;

Perhaps you could change &#xD804 to some other character. I've experimented a bit, and &#xD700; validates just fine.

Comment 1 Michael Kay 2007-09-18 12:57:54 UTC

I think the translation of the query into XQueryX was done incorrectly. From looking at the file at the octet level, the first operand is the octet sequence ee a9 a0, the second is f0 91 85 b0. These are the UTF-8 representations of the characters with codepoints (decimal) 60000 and 70000 respectively. Codepoint 70000 will be represented in UTF-16 as a surrogate pair, and it looks as if your translation has taken the first 16 bits of the surrogate pair as representing the entire character.

Comment 2 Andrew Eisenberg 2007-09-18 17:29:27 UTC

Mike, your comment helped me pinpoint the bug in the XQueryX generation. I agree that the test case is correct as it is.