Overview: Mathematical Markup Language
(MathML) Version 3.0
Previous: 5 Combining Presentation and
Content Markup
Next: 7 The MathML Interface
6 Characters, Entities and Fonts
6.1 Introduction
6.2 MathML Characters
6.2.1 Unicode Character Data
6.2.2 Special Characters Not in
Unicode
6.2.3 Mathematical Alphanumeric Symbols
Characters
6.2.4 Non-Marking Characters
6.3 Character Symbol Listings
6.3.1 Special Constants
6.3.2 Character Tables (ASCII format)
6.3.3 Tables arranged by Unicode
block
6.3.4 Negated Mathematical
Characters
6.3.5 Variant Mathematical
Characters
6.3.6 Mathematical Alphanumeric
Symbols
6.3.7 MathML Character Names
Issue regenerate_character_and_entity_tables | wiki (member only) |
---|---|
Character and entity tables need to be updated for MathML3 and ISO 9573-13 | |
Many of the tables in chapter 6 need to be updated and regenerated. In this draft references to tables in chapter 6 link to the published MathML2 Recommendation, and are marked [MathML2] |
|
Resolution | None recorded |
Notation and symbols have proved to be very important for mathematics. Mathematics has grown in part because its notation continually changes toward being succinct and suggestive. There have been many new signs developed for use in mathematical notation, and mathematicians have not held back from making use of many symbols originally introduced elsewhere. The result is that mathematics makes use of a very large collection of symbols. It is difficult to write mathematics fluently if these characters are not available for use. It is difficult to read mathematics if corresponding glyphs are not available for presentation on specific display devices.
The W3C Math Working Group therefore took on directly the task of specifying part of the full mechanism needed to proceed from notation to final presentation, and has collaborated with the STIX Fonts Project and Unicode Technical Committee (UTC) in undertaking specification of the rest.
This chapter of the MathML specification contains a listing of character names for use with MathML, recommendations for their use, and warnings to pay attention to the correct form of the corresponding code points given in the UCS (Universal Character Set) as codified in Unicode and ISO 10646 [see [Unicode] and the Unicode Web site]. For simplicity we refer to this character set by the short name Unicode. Though Unicode changes from time to time so that it is specified exactly by using version numbers, unless this brings clarity on some point we do not use them. MathML 2.0 (Second Edition) is based on Unicode 4.0, and MathML 3.0 on Unicode 5.0.)
While a long process of review and adoption by UTC and ISO/IEC of the characters of special interest to mathematics and MathML is now complete, more characters may be added in the future. To ensure any possible corrections to relevant standards are taken into account, and for the latest character tables and font information, see the W3C Math Working Group home page and the Unicode site, notably Unicode Work in Progress and Unicode Technical Report #25 “Unicode Support for Mathematics”.
A MathML token element Section 3.2 Token Elements, and Section 4.4.1 Token Elements
takes as content a sequence of MathML Characters.
MathML Characters are defined to be either Unicode characters legal
in XML documents or mglyph
elements. The latter are
used to represent characters that do not have a Unicode encoding,
as described in Section 3.2.9 Accessing glyphs
for characters from MathML (mglyph). Because the Unicode 5.0
provides approximately one thousand mathematical alphanumeric
characters and as many further special symbols, the need for
mglyph
should be rare.
In principle any character allowed by XML may be used in MathML in an XML document. The legal characters have the hexadecimal codes U+0009 (tab), U+000A (line feed), U+000D (carriage return), U+0020..U+D7FF, U+E000..U+FDCF, U+FDF0..FFEF, and U+010000..U+10FFFF minus the last two characters in each 64K plane. The notation beginning with U+ is recommended by Unicode for referring to Unicode characters [see [Unicode], page xxviii]. The exclusions above code number D7FF are the blocks used in surrogate pairs, and characters guaranteed not to be Unicode characters, i.e., for internal use only. U+FFFE is excluded to allow determination of byte order in 16-bit encodings. As a practical matter, there are other characters to avoid as described in Unicode Technical Report #20 “Unicode in XML and other Markup Languages”.
There are essentially three different ways of encoding character data.
For special purposes, one may need to use a character which is
not in Unicode. In these cases one may use the mglyph
element for
direct access to a glyph from some font and creation of a MathML
substitute for the corresponding character. All MathML token
elements that accept character data also accept an
mglyph
in their content.
Beware, however, that the font chosen may not be available to all MathML processors. The UTC has a policy that new mathematical characters that appear in technical journals will be added to the Unicode standard.
A noticeable feature of mathematical and scientific writing is the use of single letters to denote variables and constants in a given context. The increasing complexity of science has led to the use of certain common alphabet and font variations to provide enough special symbols of this letter-like type. These denotations are in fact not letters that may be used to make up words with recognized meanings, but individual carriers of semantics themselves. Writing a string of such symbols is usually interpreted in terms of some composition law, for instance, multiplication. Many letter-like symbols may be quickly interpreted by specialists in a given area as of a certain mathematical type: for instance, bold symbols, whether based on Latin or Greek letters, as vectors in physics or engineering, or fraktur symbols as Lie algebras in part of pure mathematics. To this end the STIX Fonts Project defined a set of mathematical characters all of which are included in Unicode 5.0.
The additional Mathematical Alphanumeric Symbols provided in Unicode 3.1 have code points U+1D400..U+1D7FF in Plane 1, that is, in the first plane with Unicode values higher than 216. This plane of characters is also known as the Secondary Multilingual Plane (SMP), in contrast to the Basic Multilingual Plane (BMP) which was originally the entire extent of Unicode. Support for Plane 1 characters in currently deployed software is not always reliable, but it should be possible in multilingual operating systems, since Plane 2 has many Chinese characters that must be displayable in East Asian locales.
As discussed in Section 3.2.2 Mathematics style
attributes common to token elements, MathML offers an
alternative mechanism to specify mathematical alphabetic
characters. This alternative spans the gap between the
specification of Unicode 3.1 and its associated deployment in
software and fonts. Namely, one uses the mathvariant
attribute on the surrounding token element, which will most
commonly be mi
. In this section we detail the
correspondence that a MathML processor should apply between certain
characters in Plane 0 (BMP) of Unicode, modified by the
mathvariant
attribute, and the Plane 1 Mathematical
Alphanumeric Symbol characters.
The basic idea of the correspondence is fairly simple. For example, a Mathematical Fraktur alphabet is in Plane 1, and the code point for Mathematical Fraktur A is U+1D504. Thus using these characters, a typical example might be
<mi>𝔄</mi>
However, an alternative, equivalent markup would be to use the
ASCII A and modify the identifier using the
mathvariant
attribute, as follows:
<mi mathvariant="fraktur">A</mi>
The exact correspondence between a mathematical alphabetic character and an unstyled character is complicated by the fact that mathematical alphabetic characters already present in the Unicode 2.0 Letterlike Symbols block (U+2100..U+214F) remain in that block and hence do not appear in their 'expected' sequences in Plane 1.
The detailed correspondence is shown in the tables given in Section 6.3.6 Mathematical Alphanumeric Symbols.
Mathematical Alphanumeric Symbol characters should not be used for styled text. For example, Mathematical Fraktur A must not be used to just select a blackletter font for an uppercase A. Doing this sort of thing would create problems for searching, restyling (e.g. for accessibility), and many other kinds of processing.
Some characters, although important for the quality of print or alternative rendering, do not have glyph marks that correspond directly to them. They are called here non-marking characters. Their roles are discussed in Chapter 3 Presentation Markup and Chapter 4 Content Markup.
In MathML 2 control of page composition, such as line breaking,
is effected by the use of the proper attributes on the
mspace
element.
The characters below are not simple spacers. They are important additions to the UCS because they provide textual clues which can increase the quality of print rendering, permit correct audio rendering, and allow the recovery of mathematical semantics from text which is visually ambiguous.
Character name | Unicode | Description |
---|---|---|
|
02061 |
character showing function application in presentation tagging (Section 3.2.5 Operator, Fence, Separator or Accent (mo) |
|
02062 |
marks multiplication when it is understood without a mark (Section 3.2.5 Operator, Fence, Separator or Accent (mo) |
|
02063 |
used as a separator, e.g., in indices (Section 3.2.5 Operator, Fence, Separator or Accent (mo) |
|
02064 |
used as a separator, e.g., between the 1 and the ½ in expressions like 1½ |
The Universal Character Set (UCS) of Unicode and ISO 10646 continues to evolve. At the time of writing the standard is Unicode 4.0. As before, we can only reiterate that for latest developments on details of character standards as far as they influence mathematical formalism the home page of the W3C Math Activity should be consulted.
The characters are given with entity names as well as Unicode numbers. To facilitate comprehension of a fairly large list of names, which totals over 2000 in this case, we offer more than one way to find to a given character. A corresponding full set of entity declarations is in the DTD in Appendix A Parsing MathML. For discussion of entity declarations see that appendix.
The characters are listed by name, and sample glyphs provided for all of them. Each character name is accompanied by a code for a character grouping chosen from a list given below, a short verbal description, and a Unicode hex code drawn from ISO 10646.
The character listings by alphabetical and Unicode order in Section 6.3.7 MathML Character Names are in harmony with the ISO character sets given, in that if some part of a set is included then the entire set is included.
To begin we list separately a few special characters which MathML introduced. Rather like the non-marking characters above, they provide very useful capabilities in the context of machinable mathematics.
Entity name |
Unicode |
Description |
|
02145 |
D for use in differentials, e.g. within integrals |
|
02146 |
d for use in differentials, e.g. within integrals |
|
02147 |
e for use for the exponential base of the natural logarithms |
|
02148 |
i for use as a square root of -1 |
|
02149 |
j for use as a square root of -1 |
The first table offered is a very large ASCII listing of characters considered particularly relevant to mathematics. This is given in Unicode order [MathML2]. Most, but not all, of these characters have MathML names defined via entity declarations in the DTD. Those that do not are usually symbols which seem mathematically peripheral, such as dingbats, machine graphics or technical symbols.
A second table lists those characters that do have MathML entity names, ordered alphabetically [MathML2], with a lower-case letter preceding its upper-case counterpart. A third table showing the characters according to Unicode blocks is given in Section 2.4 of Unicode Technical Report #25.
Accented characters should be represented by an appropriate <mover accent =”true”> or <munder accent =”true”> entry with the over/underscript giving the desired accent. For all but one combining mark, the combining mark itself can be used for the over/underscript, or alternatively the corresponding spacing mark can be used. So for an a tilde, one can use
<mover accent=”true”>
<mi>a</mi>
<mo>&tilde</mo>
</mover>
or
<mover accent=”true”>
<mi>a</mi>
<mo>̃</mo>
</mover>
However for the combining solidus U+0338 one should use the ASCII solidus (U+002F) as an alias accent, since Unicode Normalization Form C replaces the XML closing angle bracket U+003E followed by U+0338 by the “not greater than” symbol U+226F (≯) thereby destroying the XML structure. Note that fully composed characters like é should not be used to represent mathematical variables. The <mover> construct should be used to obtain the correct mathematical typography.
In addition to the Unicode Characters so far listed, one may use the combining characters U+0338 (/), U+20D2 (|) and U+20E5 (\) to produce negated or canceled forms of characters. A combining character should be placed immediately after its 'base' character, with no intervening markup or space, just as is the case for combining accents.
In principle, the negation characters may be applied to any Unicode character, although fonts designed for mathematics have some negated glyphs already composed. A MathML renderer should be able to use these pre-composed glyphs in these cases. A compound character code either represents a UCS character that is already available, as in the case of U+003D+00338 which amounts to U+2260, or it does not as is the case for U+2202+0338. The common cases of negations, of the latter type, that have been identified are listed in the table
Note that Unicode Normalization Form C (NFC) is recommended by the W3C and UTC for use on the web. NFC obeys the rule that if a single composed character is already defined for what can be achieved with a combining character, that character must be used instead of the decomposed form. As a practical matter this has little effect on most mathematical expressions, since mathematical alphabetic characters are unaffected by Unicode normalizations. It is also intended that no new single characters representing what can be done by with existing compositions will be introduced by the UTC. For further information on these matters see the Unicode Standard Annex 15, Unicode Normalization Forms [UAX15], especially the discussion of Normalization Form C.
Unicode attempts to avoid having several character codes for simple font variants. For a code point to be assigned there should be more than a nuance in glyphs to be recorded. To record variants worth noting there is a special character in Unicode 3.2, U+FE00 (VARIATION SELECTOR-1), which acts as a postfix modifier. However the legally allowed combinations with this variation selector are restricted to a list recorded as part of Unicode. The VARIATION SELECTOR-1 character may only be applied to the characters listed here. The resulting combination is not regarded by Unicode as a separate character, but a variation on the base character. Unicode aware systems may render the combination as the base if the available fonts do not support the variant glyph shape.
Here we list the special mathematical alphabets. Note that the names for these alphabetic runs should be regarded as conventions resulting from recent tradition in the typesetting of mathematical formulas, rather than as fixing exactly and forever the styles which are to be used. Of course, they do correspond to the styles presently most common. But, for instance, there may be font variations in the glyphs from double-struck, open-face or blackboard bold fonts, all of which would naturally be used for the characters in the range here labelled Double-struck. Similar considerations would apply to appellations such as fraktur and gothic, or script and calligraphic.
As discussed above, the use of these characters is formally
equivalent to the use of characters in Plane 0, together with a
suitable value for the mathvariant
attribute. The
correspondence is given in the character tables. Most of these
characters come from the additions to Plane 1, however a few
characters (such as the double-struck letters N, P, Z, Q, R, C, H
representing common number sets) were already present in Unicode
3.0 and retain their original positions. These characters are
highlighted in the tables.
This section corresponds closely with the entity definitions in the DTD described in Appendix A Parsing MathML. All of the entity sets except the last correspond to entity sets defined by ISO 8879 or ISO 9573-13.
ISO Handle | Description |
---|---|
Added Mathematical Symbols: Arrows |
|
Added Mathematical Symbols: Binary Operators |
|
Added Mathematical Symbols: Delimiters |
|
Added Mathematical Symbols: Negated Relations |
|
Added Mathematical Symbols: Ordinary |
|
Added Mathematical Symbols: Relations |
|
Box and Line Drawing |
|
Cyrillic-1 |
|
Cyrillic-2 |
|
Diacritical Marks |
|
Greek-3 |
|
Latin-1 |
|
Latin-2 |
|
Mathematical Fraktur |
|
Mathematical Openface (Double-struck) |
|
Mathematical Script |
|
Numeric and Special Graphic |
|
Publishing |
|
General Technical |
|
Extra Names added by MathML |
Overview: Mathematical Markup Language
(MathML) Version 3.0
Previous: 5 Combining Presentation and
Content Markup
Next: 7 The MathML Interface