Issue regenerate_character_and_entity_tables | wiki (member only) |
---|---|
Character and entity tables need to be updated for MathML3 and ISO 9573-13 | |
Many of the tables in chapter 6 need to be updated and regenerated. In this draft references to tables in chapter 6 link to the published MathML2 Recommendation, and are marked [MathML2] |
|
Resolution | None recorded |
Notation and symbols have proved very important for mathematics. Mathematics has grown in part because its notation continually changes toward being succinct and suggestive. There have been many new signs developed for use in mathematical notation, and mathematicians have not held back from making use of many symbols originally introduced elsewhere. The result is that mathematics makes use of a very large collection of symbols. It is difficult to write mathematics fluently if these characters are not available for use. It is difficult to read mathematics if corresponding glyphs are not available for presentation on specific display devices.
The W3C Math Working Group therefore took on directly the task of specifying part of the full mechanism needed to proceed from notation to final presentation, and has collaborated with the STIX Fonts Project and Unicode Technical Committee (UTC) in undertaking specification of the rest.
This chapter of the MathML specification contains a listing of character names for use with MathML, recommendations for their use, and warnings to pay attention to the correct form of the corresponding code points given in the UCS (Universal Character Set) as codified in Unicode and ISO 10646 [Unicode] and the Unicode Web site. For simplicity we refer to this character set by the short name Unicode. Though Unicode changes from time to time so that it is specified exactly by using version numbers, unless this brings clarity on some point we do not use them. MathML 2.0 (Second Edition) is based on Unicode 4.0, and MathML 3.0 on Unicode 5.0.)
While a long process of review and adoption by UTC and ISO/IEC of the characters of special interest to mathematics and MathML is now complete, more characters may be added in the future. To ensure any possible corrections to relevant standards are taken into account, and for the latest character tables and font information, see the W3C Math Working Group home page and the Unicode site, notably Unicode Work in Progress and Unicode Technical Report #25 “Unicode Support for Mathematics”.
A MathML token element (see Section 3.2 Token Elements, ,
) takes as content a
sequence of MathML Characters. MathML Characters are defined to be either
Unicode characters legal in XML documents or mglyph
elements. The latter are
used to represent characters that do not have a Unicode encoding, as described in Section 3.2.9 Accessing glyphs for
characters from MathML
(mglyph). Because the Unicode UCS provided approximately one
thousand special alphabetic characters for the use of mathematics with Unicode 3.1, and
over 900 further special symbols in Unicode 3.2, the need for mglyph
should be
rare.
As always in XML, any character allowed by XML may be used in MathML in an XML document. The legal characters have the hexadecimal code numbers 09 (tab = U+0009), 0A (line feed = U+000A), 0D (carriage return = U+000D), 20-D7FF (U+0020..U+D7FF), E000-FFFD (U+E000..U+FFFD), and 10000-10FFFF (U+010000..U+10FFFF). The notation, just introduced in parentheses, beginning with U+ is that recommended by Unicode for referring to Unicode characters [see [Unicode], page xxviii]. The exclusions above code number D7FF are of the blocks used in surrogate pairs, and the two characters guaranteed not to be Unicode characters at all. U+FFFE is excluded to allow determination of byte order in certain encodings.
There are essentially three different ways of encoding character data.
Using characters directly: For example, an A may be entered as 'A' from a keyboard (character U+0041). This option is only available if the character encoding specified for the XML document includes the character. Most commonly used encodings will have 'A' in the ASCII position. In many encodings, characters may need more than one byte. Note that if the document is, for example, encoded in Latin-1 (ISO-8859-1) then only the characters in that encoding are available directly. Using UTF-8 or UTF-16, the only two encodings that all XML processors are required to accept, mathematical symbols can be encoded as character data.
Using numeric XML character references: Using this notation, 'A' may be represented as A (decimal) or A (hex). Note that the numbers always refer to the Unicode encoding (and not to the character encoding used in the XML file). By using character references it is always possible to access the entire Unicode range. For a general XML vocabulary, there is a disadvantage to this approach: character references may not be used in XML element or attribute names. However, this is not an issue for MathML, as all element names in MathML are restricted to ASCII characters.
Using entity references: The MathML DTD defines internal entities that expand to character data. Thus for example the entity reference é may be used rather than the character reference "é or, if, for example, the document is encoded in ISO-8859-1, the character é. An XML fragment that uses an entity reference which is not defined in a DTD is not well-formed; therefore it will be rejected by an XML parser. For this reason every fragment using entity references must use a DOCTYPE declaration which specifies the MathML DTD, or a DTD that at least declares any entity reference used in the MathML instance. The need to use a DOCTYPE complicates inclusion of MathML in some documents. However, entity references are very useful for small illustrative examples, and are used in most examples in this document.
For special purposes, one may need to use a character which is not in
Unicode.
In these cases
one may use the mglyph
element for direct access to a glyph from some font and creation of
a MathML substitute for the corresponding character.
All MathML token elements that accept character data also accept an
mglyph
in their content.
Beware, however, that the font chosen may not be available to all MathML processors.
A noticeable feature of mathematical and scientific writing is the use of single letters to denote variables and constants in a given context. The increasing complexity of science has led to the use of certain common alphabet and font variations to provide enough special symbols of this letter-like type. These denotations are in fact not letters that may be used to make up words with recognized meanings, but individual carriers of semantics themselves. Writing a string of such symbols is usually interpreted in terms of some composition law, for instance, multiplication. Many letter-like symbols may be quickly interpreted by specialists in a given area as of a certain mathematical type: for instance, bold symbols, whether based on Latin or Greek letters, as vectors in physics or engineering, or fraktur symbols as Lie algebras in part of pure mathematics. To this end the STIX Fonts Project defined a set of mathematical characters all of which are included in Unicode 5.0.
The additional Mathematical Alphanumeric Symbols provided in Unicode 3.1 have code points U+1D400..U+1D7FF in Plane 1, that is, in the first plane with Unicode values higher than 216. This plane of characters is also known as the Secondary Multilingual Plane (SMP), in contrast to the Basic Multilingual Plane (BMP) which was originally the entire extent of Unicode. Support for Plane 1 characters in currently deployed software is not always reliable, but it should be possible in multilingual operating systems, since Plane 2 has many Chinese characters that must be displayable in East Asian locales.
As discussed in Section 3.2.2 Mathematics style attributes common to token
elements, MathML offers an
alternative mechanism to specify mathematical alphabetic
characters. This alternative spans the gap between the
specification of Unicode 3.1 and its associated deployment in software and
fonts.
Namely, one uses the mathvariant
attribute on the surrounding token element, which will most commonly
be mi
. In this section we detail the
correspondence that a MathML processor should apply between certain
characters in Plane 0 (BMP) of Unicode, modified by the
mathvariant
attribute, and the Plane 1
Mathematical Alphanumeric Symbol characters.
The basic idea of the correspondence is fairly simple. For example, a Mathematical Fraktur alphabet is in Plane 1, and the code point for Mathematical Fraktur A is U+1D504. Thus using these characters, a typical example might be
<mi>𝔄</mi>
However, an alternative, equivalent markup would be to use
the standard A and modify the identifier using the
mathvariant
attribute, as follows:
<mi mathvariant="fraktur">A</mi>
The exact correspondence between a mathematical alphabetic character and an unstyled character is complicated by the fact that certain characters that were already present in Unicode are not in the 'expected' sequence.
Mathematical Alphanumeric Symbol characters should not be used for styled text. For example, Mathematical Fraktur A must not be used to just select a blackletter font for an uppercase A. Doing this sort of thing would create problems for searching, restyling (e.g. for accessibility), and many other kinds of processing.
Some characters, although important for the quality of print or alternative rendering, do not have glyph marks that correspond directly to them. They are called here non-marking characters. Their roles are discussed in Chapter 3 Presentation Markup and Chapter 4 Content Markup.
In MathML 2 control of page composition, such as line-breaking, is
effected by the use of the proper attributes on the mspace
element.
The characters below are not simple spacers. They are especially important new additions to the UCS because they provide textual clues which can increase the quality of print rendering, permit correct audio rendering, and allow the unique recovery of mathematical semantics from text which is visually ambiguous.
Character name | Unicode | Description |
---|---|---|
⁢ |
02062 | marks multiplication when it is understood without a mark (Section 3.2.5 Operator, Fence, Separator or Accent (mo) |
⁣ |
02063 | used as a separator, e.g., in indices (Section 3.2.5 Operator, Fence, Separator or Accent (mo) |
⁡ |
02061 | character showing function application in presentation tagging (Section 3.2.5 Operator, Fence, Separator or Accent (mo) |