This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
Appendix F in the Part 2 of XML Schema 1.0 defines 'metacharacter' thus: A metacharacter is either ., \, ?, *, +, {, } (, ), [ or ]. It defines 'normal character' thus: [Definition:] A normal character is any XML character that is not a metacharacter. In regular expressions, a normal character is an atom that denotes the singleton set of strings containing only itself. Production [10], which I take to be defining normal characters, reads: Normal Character [10] Char ::= [^.\?*+()|#x5B#x5D] The metacharacters all need escapes, so production 24 is also relevant here: Single Character Escape [24] SingleCharEsc ::= '\' [nrt\|.?*+(){} #x2D#x5B#x5D#x5E] I have some questions: 1. shouldn't { and } (braces) be included in production [10]? ? [10] Char ::= [^.\?*+{}()|#x5B#x5D] 2. shouldn't | (vertical bar) be among the characters defined as metacharacters? 3. should ^ (#x5E) be included among the metacharacters? 4. would it be possible to list the magic characters in the same order in 10 and 24, to make eyeball-based comparisons easier? I suspect the answer to (2) is 'yes' and the answer to (3) is 'no, on the theory that the term 'metacharacter' is best reserved for characters which have special meaning at the top level of a regular expression and which must therefore have escapes to avoid ambiguity. Hyphen, circumflex, comma, n, r, and t all have special meaning only in special contexts (within character groups, within quantity-range specifications, or after backslash), and so aren't metacharacters in this sense. See: http://lists.w3.org/Archives/Public/www-xml-schema-comments/2003JulSep/0009.html
Note that items 1 and 2 are covered by R-41 (bug 2019)
The WG classified this issue as a requirement at its telcon of 13 January 2006 and instructed the editors to prepare a proposal with the obvious fix.
At the face to face meeting of January 2006 in St. Petersburg, the Working Group decided not to take further action on this issue in XML Schema 1.1. (This issue was not discussed separately; it was one of those which were dispatched by a blanket decision that all other open issues would be closed without action, unless raised again in last-call comments.) Some members of the Working Group expressed regret over not being able to resolve all the issues dealt with in this way, but on the whole the Working Group felt it better not to delay Datatypes 1.1 in order to resolve all of them. This issue should have been marked as RESOLVED /WONTFIX at that time, but apparently was not. I am marking it that way now, to reduce confusion.
Since bug 1889 has been reopened, we should probably reopen all of the issues relating to the grammar of regular expressions, including this one.
(In reply to comment #0) >> Appendix F in the Part 2 of XML Schema 1.0 defines 'metacharacter' thus: > > A metacharacter is either ., \, ?, *, +, {, } (, ), [ or ]. > > It defines 'normal character' thus: > > [Definition:] A normal character is any XML character that is not a > metacharacter. In regular expressions, a normal character is an atom that > denotes the singleton set of strings containing only itself. > > Production [10], which I take to be defining normal characters, reads: > > Normal Character [10] Char ::= [^.\?*+()|#x5B#x5D] > > The metacharacters all need escapes, so production 24 is also relevant here: > > Single Character Escape [24] SingleCharEsc ::= '\' [nrt\|.?*+(){} > #x2D#x5B#x5D#x5E] > > I have some questions: > 3. should ^ (#x5E) be included among the metacharacters? > I suspect...the answer to (3) is 'no, on the > theory that the term 'metacharacter' is best reserved for characters which have > special meaning at the top level of a regular expression and which must > therefore have escapes to avoid ambiguity. Hyphen, circumflex, comma, n, r, and > t all have special meaning only in special contexts (within character groups, > within quantity-range specifications, or after backslash), and so aren't > metacharacters in this sense. Let me define characters used autonymously (self-naming) as those which act as single-character classes containing themselves, and metacharacters as those which are not being used autonymously, with the understanding that the same character in different occurrences in an RE can be one or the other. I'll call the characters selected by the "metacharacter" nonterminal "top-level metacharacters" or "TLMs". "top-level" refers to "outside of a character class expression". In top-level, many of the TLMs can occur where other characters can occur autonymously; in those locations the TLM would have to be escaped to have autonymous effect. There are other top-level places were a TLM cannot be a legal metacharacter and could presumably be used autonymously. But the designers of the language apparently didn't want the users to have to wonder, so they made it possible and required that the TLMs always be escaped. (For that matter, a few TLMs cannot be used as metacharacters in a location where an autonymous character can occur, but that's the language design.) Within character class expressions, only a few TLMs can be used as metacharacters, also '^' (which is not a TLM) can be so used. The autonymous vs meta rules are different here; there is no blanket prohibition of potential metacharacters being used autonymously; rather, there are some rules specifying where they can and can't be so used. (A few TLMs still never can be autonymous, those that can't be metacharacters here can always be autonymous, and for '-' and '^' the rules allow each at different places.) But since '^' can't be used as a metacharacter in the top-level, it is not in the TLM list. All the TLMs and '^' are *permitted* to be escaped if their autonymous use is wanted; this is so that if a user is not sure if it can be meta at a given location and wants autonymous usage, they can just escape it and be sure to get the effect they want. That's why '^' is in the single-character-escape list. Are we having fun yet? ;-)
A wording proposal intended to resolve this issue (and some other regex-related issues) is at http://www.w3.org/XML/Group/2004/06/xmlschema-2/datatypes.b1889.html (member-only link).
The wording proposal mentioned in comment #6 was adopted by the WG at its call today. We believe it resolves the issue in full, and I am accordingly marking the issue as resolved.