6 Characters, Entities and Fonts

Overview: Mathematical Markup Language (MathML) Version 3.0
Previous: 5 Combining Presentation and Content Markup
Next: 7 The MathML Interface

6 Characters, Entities and Fonts
    6.1 Introduction
    6.2 MathML Characters
        6.2.1 Unicode Character Data
        6.2.2 Special Characters Not in Unicode
        6.2.3 Mathematical Alphanumeric Symbols Characters
        6.2.4 Non-Marking Characters
    6.3 Character Symbol Listings
        6.3.1 Special Constants
        6.3.2 Character Tables (ASCII format)
        6.3.3 Tables arranged by Unicode block
        6.3.4 Negated Mathematical Characters
        6.3.5 Variant Mathematical Characters
        6.3.6 Mathematical Alphanumeric Symbols
        6.3.7 MathML Character Names

6.1 Introduction

Character and entity tables need to be updated for MathML3 and ISO 9573-13
Issue regenerate_character_and_entity_tables	`wiki (member only)`
Many of the tables in chapter 6 need to be updated and regenerated. In this draft references to tables in chapter 6 link to the published MathML2 Recommendation, and are marked [MathML2]
Resolution	None recorded

Notation and symbols have proved to be very important for mathematics. Mathematics has grown in part because its notation continually changes toward being succinct and suggestive. There have been many new signs developed for use in mathematical notation, and mathematicians have not held back from making use of many symbols originally introduced elsewhere. The result is that mathematics makes use of a very large collection of symbols. It is difficult to write mathematics fluently if these characters are not available for use. It is difficult to read mathematics if corresponding glyphs are not available for presentation on specific display devices.

The W3C Math Working Group therefore took on directly the task of specifying part of the full mechanism needed to proceed from notation to final presentation, and has collaborated with the STIX Fonts Project and Unicode Technical Committee (UTC) in undertaking specification of the rest.

This chapter of the MathML specification contains a listing of character names for use with MathML, recommendations for their use, and warnings to pay attention to the correct form of the corresponding code points given in the UCS (Universal Character Set) as codified in Unicode and ISO 10646 [see [Unicode] and the Unicode Web site]. For simplicity we refer to this character set by the short name Unicode. Though Unicode changes from time to time so that it is specified exactly by using version numbers, unless this brings clarity on some point we do not use them. MathML 2.0 (Second Edition) is based on Unicode 4.0, and MathML 3.0 on Unicode 5.0.)

While a long process of review and adoption by UTC and ISO/IEC of the characters of special interest to mathematics and MathML is now complete, more characters may be added in the future. To ensure any possible corrections to relevant standards are taken into account, and for the latest character tables and font information, see the W3C Math Working Group home page and the Unicode site, notably Unicode Work in Progress and Unicode Technical Report #25 “Unicode Support for Mathematics”.

6.2 MathML Characters

A MathML token element Section 3.2 Token Elements, and Section 4.4.1 Token Elements

takes as content a sequence of MathML Characters. MathML Characters are defined to be either Unicode characters legal in XML documents or mglyph elements. The latter are used to represent characters that do not have a Unicode encoding, as described in Section 3.2.9 Accessing glyphs for characters from MathML (mglyph). Because the Unicode 5.0 provides approximately one thousand mathematical alphanumeric characters and as many further special symbols, the need for mglyph should be rare.

6.2.1 Unicode Character Data

In principle any character allowed by XML may be used in MathML in an XML document. The legal characters have the hexadecimal codes U+0009 (tab), U+000A (line feed), U+000D (carriage return), U+0020..U+D7FF, U+E000..U+FDCF, U+FDF0..FFEF, and U+010000..U+10FFFF minus the last two characters in each 64K plane. The notation beginning with U+ is recommended by Unicode for referring to Unicode characters [see [Unicode], page xxviii]. The exclusions above code number D7FF are the blocks used in surrogate pairs, and characters guaranteed not to be Unicode characters, i.e., for internal use only. U+FFFE is excluded to allow determination of byte order in 16-bit encodings. As a practical matter, there are other characters to avoid as described in Unicode Technical Report #20 “Unicode in XML and other Markup Languages”.

There are essentially three different ways of encoding character data.

Using characters directly: For example, an A may be entered as 'A' from a keyboard (character U+0041). This option is only available if the character encoding specified for the XML document includes the character. Most commonly used encodings will have 'A' in the ASCII position. In many encodings, characters may need more than one byte. Note that if the document is, for example, encoded in Latin-1 (ISO-8859-1) then only the characters in that encoding are available directly. Using UTF-8 or UTF-16, the only two encodings that all XML processors are required to accept, mathematical symbols can be encoded directly as character data.
Using numeric XML character references: Using this notation, 'A' may be represented as A (decimal) or A (hex). Note that the numbers always refer to the Unicode encoding (and not to the character encoding used in the XML file). By using character references it is always possible to access the entire Unicode range. For a general XML vocabulary, there is a disadvantage to this approach: character references may not be used in XML element or attribute names. However, this is not an issue for MathML, as all element names in MathML are restricted to ASCII characters.
Using entity references: The MathML DTD defines internal entities that expand to character data. Thus for example the entity reference é may be used rather than the character reference "é or, if, for example, the document is encoded in ISO-8859-1, the character é. An XML fragment that uses an entity reference which is not defined in a DTD is not well-formed; therefore it will be rejected by an XML parser. For this reason every fragment using entity references must use a DOCTYPE declaration which specifies the MathML DTD, or a DTD that at least declares any entity reference used in the MathML instance. The need to use a DOCTYPE complicates inclusion of MathML in some documents. However, entity references are very useful for small illustrative examples, and are used in most examples in this document.

6.2.2 Special Characters Not in Unicode

For special purposes, one may need to use a character which is not in Unicode. In these cases one may use the mglyph element for direct access to a glyph from some font and creation of a MathML substitute for the corresponding character. All MathML token elements that accept character data also accept an mglyph in their content.

Beware, however, that the font chosen may not be available to all MathML processors. The UTC has a policy that new mathematical characters that appear in technical journals will be added to the Unicode standard.

6.2.3 Mathematical Alphanumeric Symbols Characters

A noticeable feature of mathematical and scientific writing is the use of single letters to denote variables and constants in a given context. The increasing complexity of science has led to the use of certain common alphabet and font variations to provide enough special symbols of this letter-like type. These denotations are in fact not letters that may be used to make up words with recognized meanings, but individual carriers of semantics themselves. Writing a string of such symbols is usually interpreted in terms of some composition law, for instance, multiplication. Many letter-like symbols may be quickly interpreted by specialists in a given area as of a certain mathematical type: for instance, bold symbols, whether based on Latin or Greek letters, as vectors in physics or engineering, or fraktur symbols as Lie algebras in part of pure mathematics. To this end the STIX Fonts Project defined a set of mathematical characters all of which are included in Unicode 5.0.

The additional Mathematical Alphanumeric Symbols provided in Unicode 3.1 have code points U+1D400..U+1D7FF in Plane 1, that is, in the first plane with Unicode values higher than 2¹⁶. This plane of characters is also known as the Secondary Multilingual Plane (SMP), in contrast to the Basic Multilingual Plane (BMP) which was originally the entire extent of Unicode. Support for Plane 1 characters in currently deployed software is not always reliable, but it should be possible in multilingual operating systems, since Plane 2 has many Chinese characters that must be displayable in East Asian locales.

As discussed in Section 3.2.2 Mathematics style attributes common to token elements, MathML offers an alternative mechanism to specify mathematical alphabetic characters. This alternative spans the gap between the specification of Unicode 3.1 and its associated deployment in software and fonts. Namely, one uses the mathvariant attribute on the surrounding token element, which will most commonly be mi. In this section we detail the correspondence that a MathML processor should apply between certain characters in Plane 0 (BMP) of Unicode, modified by the mathvariantattribute, and the Plane 1 Mathematical Alphanumeric Symbol characters.

The basic idea of the correspondence is fairly simple. For example, a Mathematical Fraktur alphabet is in Plane 1, and the code point for Mathematical Fraktur A is U+1D504. Thus using these characters, a typical example might be

<mi>&#x1D504;</mi>

However, an alternative, equivalent markup would be to use the ASCII A and modify the identifier using the mathvariant attribute, as follows:

<mi mathvariant="fraktur">A</mi>

The exact correspondence between a mathematical alphabetic character and an unstyled character is complicated by the fact that mathematical alphabetic characters already present in the Unicode 2.0 Letterlike Symbols block (U+2100..U+214F) remain in that block and hence do not appear in their 'expected' sequences in Plane 1.

The detailed correspondence is shown in the tables given in Section 6.3.6 Mathematical Alphanumeric Symbols.

Mathematical Alphanumeric Symbol characters should not be used for styled text. For example, Mathematical Fraktur A must not be used to just select a blackletter font for an uppercase A. Doing this sort of thing would create problems for searching, restyling (e.g. for accessibility), and many other kinds of processing.

6.2.4 Non-Marking Characters

Some characters, although important for the quality of print or alternative rendering, do not have glyph marks that correspond directly to them. They are called here non-marking characters. Their roles are discussed in Chapter 3 Presentation Markup and Chapter 4 Content Markup.

In MathML 2 control of page composition, such as line breaking, is effected by the use of the proper attributes on the mspace element.

The characters below are not simple spacers. They are important additions to the UCS because they provide textual clues which can increase the quality of print rendering, permit correct audio rendering, and allow the recovery of mathematical semantics from text which is visually ambiguous.

Character name	Unicode	Description
`⁡`	02061	character showing function application in presentation tagging (Section 3.2.5 Operator, Fence, Separator or Accent (mo)
`⁢`	02062	marks multiplication when it is understood without a mark (Section 3.2.5 Operator, Fence, Separator or Accent (mo)
`⁣`	02063	used as a separator, e.g., in indices (Section 3.2.5 Operator, Fence, Separator or Accent (mo)
`⁤`	02064	used as a separator, e.g., between the 1 and the ½ in expressions like 1½

6.3 Character Symbol Listings

The Universal Character Set (UCS) of Unicode and ISO 10646 continues to evolve. At the time of writing the standard is Unicode 4.0. As before, we can only reiterate that for latest developments on details of character standards as far as they influence mathematical formalism the home page of the W3C Math Activity should be consulted.

The characters are given with entity names as well as Unicode numbers. To facilitate comprehension of a fairly large list of names, which totals over 2000 in this case, we offer more than one way to find to a given character. A corresponding full set of entity declarations is in the DTD in Appendix A Parsing MathML. For discussion of entity declarations see that appendix.

The characters are listed by name, and sample glyphs provided for all of them. Each character name is accompanied by a code for a character grouping chosen from a list given below, a short verbal description, and a Unicode hex code drawn from ISO 10646.

The character listings by alphabetical and Unicode order in Section 6.3.7 MathML Character Names are in harmony with the ISO character sets given, in that if some part of a set is included then the entire set is included.

6.3.1 Special Constants

To begin we list separately a few special characters which MathML introduced. Rather like the non-marking characters above, they provide very useful capabilities in the context of machinable mathematics.

Entity name	Unicode	Description
`&CapitalDifferentialD;`	02145	D for use in differentials, e.g. within integrals
`&DifferentialD;`	02146	d for use in differentials, e.g. within integrals
`&ExponentialE;`	02147	e for use for the exponential base of the natural logarithms
`&ImaginaryI;`	02148	i for use as a square root of -1
`&ImaginaryJ;`	02149	j for use as a square root of -1

6.3.2 Character Tables (ASCII format)

The first table offered is a very large ASCII listing of characters considered particularly relevant to mathematics. This is given in Unicode order [MathML2]. Most, but not all, of these characters have MathML names defined via entity declarations in the DTD. Those that do not are usually symbols which seem mathematically peripheral, such as dingbats, machine graphics or technical symbols.

A second table lists those characters that do have MathML entity names, ordered alphabetically [MathML2], with a lower-case letter preceding its upper-case counterpart. A third table showing the characters according to Unicode blocks is given in Section 2.4 of Unicode Technical Report #25.

6.3.3 Accented Characters

Accented characters should be represented by an appropriate <mover accent =”true”> or <munder accent =”true”> entry with the over/underscript giving the desired accent. For all but one combining mark, the combining mark itself can be used for the over/underscript, or alternatively the corresponding spacing mark can be used. So for an a tilde, one can use

<mover accent=”true”>
<mi>a</mi>
<mo>&tilde</mo>
</mover>

However for the combining solidus U+0338 one should use the ASCII solidus (U+002F) as an alias accent, since Unicode Normalization Form C replaces the XML closing angle bracket U+003E followed by U+0338 by the “not greater than” symbol U+226F (≯) thereby destroying the XML structure. Note that fully composed characters like &eacute should not be used to represent mathematical variables. The <mover> construct should be used to obtain the correct mathematical typography.

6.3.4 Negated Mathematical Characters

In addition to the Unicode Characters so far listed, one may use the combining characters U+0338 (/), U+20D2 (|) and U+20E5 (\) to produce negated or canceled forms of characters. A combining character should be placed immediately after its 'base' character, with no intervening markup or space, just as is the case for combining accents.

In principle, the negation characters may be applied to any Unicode character, although fonts designed for mathematics have some negated glyphs already composed. A MathML renderer should be able to use these pre-composed glyphs in these cases. A compound character code either represents a UCS character that is already available, as in the case of U+003D+00338 which amounts to U+2260, or it does not as is the case for U+2202+0338. The common cases of negations, of the latter type, that have been identified are listed in the table

cancellations [MathML2]

Note that Unicode Normalization Form C (NFC) is recommended by the W3C and UTC for use on the web. NFC obeys the rule that if a single composed character is already defined for what can be achieved with a combining character, that character must be used instead of the decomposed form. As a practical matter this has little effect on most mathematical expressions, since mathematical alphabetic characters are unaffected by Unicode normalizations. It is also intended that no new single characters representing what can be done by with existing compositions will be introduced by the UTC. For further information on these matters see the Unicode Standard Annex 15, Unicode Normalization Forms [UAX15], especially the discussion of Normalization Form C.

6.3.5 Variant Mathematical Characters

Unicode attempts to avoid having several character codes for simple font variants. For a code point to be assigned there should be more than a nuance in glyphs to be recorded. To record variants worth noting there is a special character in Unicode 3.2, U+FE00 (VARIATION SELECTOR-1), which acts as a postfix modifier. However the legally allowed combinations with this variation selector are restricted to a list recorded as part of Unicode. The VARIATION SELECTOR-1 character may only be applied to the characters listed here. The resulting combination is not regarded by Unicode as a separate character, but a variation on the base character. Unicode aware systems may render the combination as the base if the available fonts do not support the variant glyph shape.

variants [MathML2]

6.3.6 Mathematical Alphanumeric Symbols

Here we list the special mathematical alphabets. Note that the names for these alphabetic runs should be regarded as conventions resulting from recent tradition in the typesetting of mathematical formulas, rather than as fixing exactly and forever the styles which are to be used. Of course, they do correspond to the styles presently most common. But, for instance, there may be font variations in the glyphs from double-struck, open-face or blackboard bold fonts, all of which would naturally be used for the characters in the range here labelled Double-struck. Similar considerations would apply to appellations such as fraktur and gothic, or script and calligraphic.

As discussed above, the use of these characters is formally equivalent to the use of characters in Plane 0, together with a suitable value for the mathvariantattribute. The correspondence is given in the character tables. Most of these characters come from the additions to Plane 1, however a few characters (such as the double-struck letters N, P, Z, Q, R, C, H representing common number sets) were already present in Unicode 3.0 and retain their original positions. These characters are highlighted in the tables.

6.3.7 MathML Character Names

This section corresponds closely with the entity definitions in the DTD described in Appendix A Parsing MathML. All of the entity sets except the last correspond to entity sets defined by ISO 8879 or ISO 9573-13.

ISO Handle	Description
ISOAMSA [MathML2]	Added Mathematical Symbols: Arrows
ISOAMSB [MathML2]	Added Mathematical Symbols: Binary Operators
ISOAMSC [MathML2]	Added Mathematical Symbols: Delimiters
ISOAMSN [MathML2]	Added Mathematical Symbols: Negated Relations
ISOAMSO [MathML2]	Added Mathematical Symbols: Ordinary
ISOAMSR [MathML2]	Added Mathematical Symbols: Relations
ISOBOX [MathML2]	Box and Line Drawing
ISOCYR1 [MathML2]	Cyrillic-1
ISOCYR2 [MathML2]	Cyrillic-2
ISODIA [MathML2]	Diacritical Marks
ISOGRK3 [MathML2]	Greek-3
ISOLAT1 [MathML2]	Latin-1
ISOLAT2 [MathML2]	Latin-2
ISOMFRK [MathML2]	Mathematical Fraktur
ISOMOPF [MathML2]	Mathematical Openface (Double-struck)
ISOMSCR [MathML2]	Mathematical Script
ISONUM [MathML2]	Numeric and Special Graphic
ISOPUB [MathML2]	Publishing
ISOTECH [MathML2]	General Technical
MMLEXTRA [MathML2]	Extra Names added by MathML

Overview: Mathematical Markup Language (MathML) Version 3.0
Previous: 5 Combining Presentation and Content Markup
Next: 7 The MathML Interface