5818 – Unicode Database: shifting sands

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 5818 - Unicode Database: shifting sands

Summary: Unicode Database: shifting sands

Status:	CLOSED FIXED

Alias:	None

Product:	XML Schema
Classification:	Unclassified
Component:	Datatypes: XSD Part 2 (show other bugs)
Version:	1.1 only
Hardware:	PC Windows NT

Importance:	P2 normal
Target Milestone:	---
Assignee:	C. M. Sperberg-McQueen
QA Contact:	XML Schema comments list

URL:
Whiteboard:
Keywords:	resolved

Depends on:
Blocks:	10008
	Show dependency tree / graph

Reported:	2008-06-27 20:21 UTC by Michael Kay
Modified:	2010-11-10 17:42 UTC (History)
CC List:	3 users (show)

See Also:

Attachments

Description Michael Kay 2008-06-27 20:21:23 UTC

There is a Note in G.1.1:

Note: [Unicode Database] is subject to future revision.  For example, the mapping from code points to character properties might be updated. All ·minimally conforming· processors ·must· support the character properties defined in the version of [Unicode Database] cited in the normative references (Normative (§K.1)).  However, implementors are encouraged to support the character properties defined in any future version.

I'm not sure that it is possible to do both. In Unicode 3.1, and therefore in XML Schema 1.0, the Ethiopic digits x1369-x1371 were in group Nd (and therefore matched \d). In Unicode 4.1 they have been moved to group No (so they no longer match \d). A given processor, unless it has configuration options to put this under user control -- which seems unduly onerous -- is either going to support the new version or the old. In one case, x1369 will match \d, in the other case it won't. In practice, it's quite likely to depend on which version of Java or .NET you are using. So I think we should either pin things down so processors are required to support Unicode version 4.1 and no other, or we should remove the "must" from the above note, and make it implementation-defined which version of Unicode is used. 

(In any case, what is a "must" doing in a Note?)

Test case reS17 in the Microsoft regex test suite is relevant: its results depend on which version of Unicode you believe in.

Comment 1 David Ezell 2008-07-18 17:31:57 UTC

During the telcon of 2008-07-18 the WG decided to classify this bug as being editorial, since the issue is with a "Note:", and those are non-normative by definition.

The WG instructed the editors to make a small change to the note to soften its strictness.  Discussed was the option to change the last sentence of the note to read as follows:

However, implementors are encouraged to support the character properties defined in any future version, possibly with such support being engaged under user control.

Comment 2 Michael Kay 2008-10-28 12:06:08 UTC

See also

http://lists.w3.org/Archives/Public/www-xml-schema-comments/2008OctDec/0076.html

from James Clark

Comment 3 C. M. Sperberg-McQueen 2009-05-09 14:38:36 UTC

Reviewing this bug report with a view toward proposing a change to resolve it, I have come to believe that the changes made to appendix G.1.1 in connection with bug 5948 may already have addressed the concerns raised here (although not the concerns about 1.0 2E in the email from James Clark cited in comment 2).

The note on which the comment was originally raised has been split into a normative paragraph and a note.  The current text is 

    [Unicode Database] is subject to future revision.  For example,
    the mapping from code points to character properties might be
    updated. All ·minimally conforming· processors ·must· support the
    character properties defined in the version of [Unicode Database]
    cited in the normative references (Normative (§K.1)).  However,
    implementors are encouraged to support the character properties
    defined in any later versions. When the implementation supports
    multiple versions of the Unicode database, and they differ in
    salient respects (e.g. different properties are assigned to the
    same character in different versions of the database), then it is
    ·implementation-defined· which set of property definitions is used
    for any given assessment episode.

        Note: In order to benefit from continuing work on the Unicode
        database, a conforming implementation might by default use the
        latest supported version of the character properties. In order
        to maximize consistency with other implementations of this
        specification, however, an implementation might choose to
        provide user options to specify the use of the version of the
        database cited in the normative references. The
        PropertyAliases.txt and PropertyValueAliases.txt files of the
        Unicode database may be helpful to implementors in this
        connection.

In addition, there is a later reference to changes in the Unicode database; the current text at that location now reads:

    [Unicode Database] has been revised since XSD 1.0 was published,
    and is subject to future revision. In particular, the grouping of
    code points into blocks has changed, and may change again. All
    ·minimally conforming· processors must support the blocks defined
    in the version of [Unicode Database] cited in the normative
    references (Normative (§K.1)). However, implementors are
    encouraged to support the blocks defined in earlier and/or later
    versions of the Unicode Standard. When the implementation supports
    multiple versions of the Unicode database, and they differ in
    salient respects (e.g. different characters are assigned to a
    given block in different versions of the database), then it is
    ·implementation-defined· which set of block definitions is used
    for any given assessment episode.

    In particular, the version of [Unicode Database] referenced in XSD
    1.0 (namely, Unicode 3.1) contained the following blocks which
    have been renamed in the version cited in this
    specification. Since these block names may appear in regular
    expressions within XSD 1.0 schemas, implementors are encouraged to
    support the superseded block names in XSD 1.1 processors for
    compatibility, either by default or at user option:

        #x0370 - #x03FF: Greek
        #x20D0 - #x20FF: CombiningMarksforSymbols
        #xE000 - #xF8FF: PrivateUse
        #xF0000 - #xFFFFD: PrivateUse
        #x100000 - #x10FFFD: PrivateUse

To see the text in context, consult the current CR document at 

    http://www.w3.org/TR/xmlschema11-2/#charcter-classes

or the current status-quo document at 

    http://www.w3.org/XML/Group/2004/06/xmlschema-2/datatypes.html#charcter-classes

I'm marking this issue as needsReview to signal that I think the WG
needs to consider whether this issue has already been resolved (and
should have been so marked when we resolved bug 5948).

Comment 4 David Ezell 2009-06-26 15:27:01 UTC

WG agrees with MSM's assessment. Closing as overtaken.

Comment 5 David Ezell 2010-11-10 17:42:22 UTC

The WG reported this bug as FIXED on 2010-06-24.  We are closing this bug
as requiring no futher work.  If there are issues remaining, you can reopen
this bug and enter a comment to indicate the problem.  Thanks very much for the
feedback.