This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
The text defining regular expressions in Appendix F Schema Part 2 Second Edition (28 Oct 2004) seems to be inconsistent between the BNF and the accompanying prose. See http://lists.w3.org/Archives/Public/www-xml-schema-comments/2005JulSep/0030.html
It's embarrassing that so many inconsistencies should be present, in such close proximity to material we worked over several times. But Mike Kay's analysis appears to be correct, and I believe the WG should classify this as an error and arrange for a corrigendum.
The WG classified this issue as a requirement at its telcon of 13 January 2006 and instructed the editors to prepare a proposal with the obvious fix.
At the face to face meeting of January 2006 in St. Petersburg, the Working Group decided not to take further action on this issue in XML Schema 1.1. (This issue was not discussed separately; it was one of those which were dispatched by a blanket decision that all other open issues would be closed without action, unless raised again in last-call comments.) Some members of the Working Group expressed regret over not being able to resolve all the issues dealt with in this way, but on the whole the Working Group felt it better not to delay Datatypes 1.1 in order to resolve all of them. This issue should have been marked as RESOLVED /WONTFIX at that time, but apparently was not. I am marking it that way now, to reduce confusion.
I fail to understand how five-and-a-half years after the XML Schema specification came out, the WG has failed to resolve a simple technical problem that has been known for nearly all that time, and can now deem that the problem will be allowed to remain in the next release of the specification. This isn't something that's difficult to resolve because of environment dependencies or implementation difficulties or political hassles or because it's at the boundaries of computer science. It's a simple straightforward bug. Schema implementors and schema authors have been tripping over this issue, even W3C working groups have been publishing schemas that work with some processors and not others. Moreover, the QT specifications are impacted because they refer normatively to the regex definitions in Schema Part 2. Closing this as WONTFIX seems to show a wanton disregard for quality. If it's not the purpose of a 1.1 release to fix such problems, what is the purpose?
Since there doesn't seem to be much effort going into resolving this, and since it accounts for a significant proportion of the problems I am having in matching the published test suite results, let me propose a solution. PROPOSAL (a) leave the grammar unchanged (b) in each of the definitions in App. F, where the term being defined is spelt differently from the corresponding metasymbol, add a cross-reference. For example: "Definition: A regular expression (regExp) is composed from zero or more ·branch·es, separated by | characters." This is to remove any ambiguity about whether the term "XML Character" is a reference to the metasymbol XMLChar or to some other concept with a similar name... (c) expand the definition of Character Range: [Definition:] A character range (charRange) R identifies a set of characters C(R) containing all XML characters with UCS code points in a specified range. (d) replace the text below rule 22 as follows: There are two forms of character range: a ·start-end range·, and a ·single-character range·. A character or ·single character escape· is taken as the start of a ·start-end range· if (a) it is valid as such, and (b) it is immediately followed by a hyphen. Otherwise (if it is valid as such) it is taken as a ·single-character range·. [Definition:] A ·start-end range· (seRange) s-e identifies the set that contains all XML characters with UCS code points greater than or equal to the code point of s, but not greater than the code point of e. For s-e to be a valid character range, it must satisfy the following rules in addition to those implied by the grammar: * If s is the first character in a ·character class expression·, then s is not ^ * The code point of e is greater than or equal to the code point of s; Note: The code point of a ·single character escape· is the code point of the single character in the set of characters that it identifies. [Definition:] A ·single XML character· (XMLChar) is a ·character range· that identifies the set of characters containing only itself. For a character to be a valid ·character range·, it must satisfy the following rules in addition to those implied by the grammar: * The ^ character is only valid at the beginning of a ·positive character group· if it is part of a ·negative character group· * The - character is a valid ·character range· only (a) at the beginning of a ·positive character group·, or (b) if immediately followed by a ']' character Note: An unescaped - character is handled as follows. If it appears at the start of a ·positive character group· or immediately before a ']' character then it is taken as representing a literal hyphen. If it appears immediately before a '[' character it is taken as representing a subtraction operator (regardless whether what follows is a valid ·character class expression·). If it appears immediately after a character or character escape that is valid as the start of a ·start-end range·, then it causes that character or character escape to be treated as the start of a ·start-end range·. If it appears anywhere else (for example, after another hyphen, or after the end of a ·start-end range· but not followed by '['), then it is an error. NOTE ON PROPOSAL Some regex implementations are more permissive than this. For example, they allow - as the start or end of a start-end range, and they allow constructs such as [0-9-A-Z] meaning zero-to-nine, hyphen, or A-Z.
As implementor and user of Schema, I strongly prefer to see problems like this fixed in front of having an earlier release. Schema is a central piece of technology and its quirks and ambiguities creates much grief. What makes me happy about the efforts going into 1.1, is that they now can be fixed.
(In reply to comment #0) > The text defining regular expressions in Appendix F Schema Part 2 Second > Edition (28 Oct 2004) seems to be inconsistent between the BNF and the > accompanying prose. See > http://lists.w3.org/Archives/Public/www-xml-schema-comments/2005JulSep/0030.html Inasmuch as the comment-list discussion of the referenced comment includes the point that '+' is a metacharacter and hence must always be escaped when a real point is intended, I intend to include that point as part of this bug and fix it as part of this bug's fix.
In reply to comment #7 (a) I haven't been able to locate the comment-list discussion that you refer to (b) I can't see anywhere in the spec that suggests that because a character is a metacharacter, it needs to be escaped when used in a charGroup; so I don't see where the problem with "+" arises.
(In reply to comment #8) > In reply to comment #7 > (b) I can't see anywhere in the spec that suggests that because a character is > a metacharacter, it needs to be escaped when used in a charGroup; so I don't > see where the problem with "+" arises. @*^$&# convoluted descriptions. I believe you are right. The definition of metacharacter says metacharacters have special meanings in REs, but that is not true when most special characters are used within character groups. Fortunately we haven't acted on this believe-to-be-an-error, so we will ignore it unless someone else convinces the editors that this new (to me, at least) interpretation is wrong.
A wording proposal intended to resolve this issue (and some other regex-related issues) is at http://www.w3.org/XML/Group/2004/06/xmlschema-2/datatypes.b1889.html (member-only link).
(In reply to comment #7) > Inasmuch as the comment-list discussion of the referenced comment includes the > point that '+' is a metacharacter and hence must always be escaped when a real > point is intended, I intend to include that point as part of this bug and fix > it as part of this bug's fix. It is a fact that *outside a character range expression* an autonymous ("self-naming", not "autonomous") use of a metacharacter character must be escaped. But escaping '+' is also covered in 3659, at least for date/time datatypes, so I propose that be used to track plus signs needing escaping. The rest of this bug is covered by the proposed fix, which was being condidered today.
The wording proposal mentioned in comment #10 was adopted by the WG at its call today. We believe it resolves the issue in full, and I am accordingly marking the issue as resolved. Michael, as the originator of the issue, please indicate your acquiescence in the resolution of the issue by changing the issue status to CLOSED, or indicate dissent by reopening it, in the usual way. Since you were on the call and didn't object, I assume you assent, but for form's sake I'll ask for this additional sign. If you don't respond within the next two weeks, we'll assume that silence implies consent.