This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
The specification states: <quote> [Definition:] [Unicode Database] specifies a number of possible values for the "General Category" property and provides mappings from code points to specific character properties. The set containing all characters that have property X, can be identified with a category escape \p{X} . The complement of this set is specified with the category escape \P{X} . ( [\P{X}] = [^\p{X}] ). </quote> It then gives a table purporting to show the values of "General Category" that occur in Unicode 5.1. This includes single-character categories such as "C", "L", and "M". As far as I can see, however, Unicode only defines the two-character categories such as Ll, Lu, Mc and so on. The single-character categories are an invention of the regex language, and therefore need to be described in our specification, rather than by reference to Unicode. There are two possible definitions of these categories, which give different results. At least one XML Schema implementation has interpreted the single-character category X to be the union of all two-character categories starting with X, for example C is the union of (Cc, Cf, Co, and Cn). However, another interpretation (the one used by the Java regex library) is that it is the set of all characters listed in the Unicode database as belonging to a category starting with that letter. This gives a different result in the case of category C, since Cn is the set of characters that are not listed in the relevant section of the Unicode database.
For reference, the Unicode 5.1 definition of character categories is here: http://www.unicode.org/Public/5.1.0/ucd/UCD.html#General_Category_Values
On the telcon the WG agreed that C should include Cn, but that a note should be added explaining that Java does it differently.
WG accepted http://www.w3.org/XML/Group/2004/06/xmlschema-2/datatypes.b8744.html