Updated
Web technology is based on the character repertoire of Unicode (see Character Model). Unicode contains a huge number of characters covering a wide range of scripts and languages. However, in some cases, there may be something missing:
噸
).é
in HTML). Other solutions have been proposed, such as xmlchar
,
which uses an element per character and an XSLT to convert them.Point 1 and 2 are often subsumed under the term 'gaiji problem'.
East Asian Ideographs (see also gaiji), mathematical symbols, special ligatures,...
Often it is important that a particular glyph is displayed for a certain character. Styling with CSS or XSL can take care of size, font style, and some other properties. But sometimes, there is a need for more specific glyph variants. There are various proposals to do this:
ch
element references an image file containing a glyph image. Attributes are used for exact positioning. The content of the
element is the character itself, which can serve as a fallback.The idea is to define a special element with attributes providing or pointing to the necessary information to process or render the
character. This leads to an extremely localized, and therefore extremely flexible and stable solution. The actual markup may look very similar to the
one used for selecting glyph variants, the main distinction being that there is no character content that serves as a
fallback (in some cases, the element content may be a primitive fallback such as html:img
, or a private use codepoint is used).
Examples that define markup for individual characters:
altglyph
element provides detailled control over the glyphs used
to render particular character data.mglyph
element has an alt
attribute for
fallback text, a fontfamily
attribute to indicate a font, and an index
attribute to indicate the position of a glyph in a font.It is possible to submit a proposal for encoding some characters to the Unicode Technical Committee and ISO/IEC JTC1/SC2/WG2. This requires careful preparation and takes time, but for many cases, it is the right thing to do. On the other hand, some things perceived as characters may not be suitable for encoding, or a character may already have been encoded, but you want a particular glyph variant.
Unicode reserves the Private Use Area in the BMP (U+E000-U+F8FF) and planes 15 and 16 for private use. This means that these codepoints are forever left undefined, but can be used between any two parties with a prior agreement.
The main problem with private use codepoints is that there needs to be an understanding of what these codepoints are used for. But private agreements scale very badly on the Web. Various proposals have been made to associate additional information with a document type (DTD/XML Schema), with a document, or with some part of a document.
However, in all cases, editing and otherwise processing documents with such associated information will become very complicated. Also, character information is only preserved if all operations that process it preserve the associated information correctly. Because missing characters are not a very frequent problem, it is quite unreasonable that e.g. every single Perl script dealing with XML will do the right thing. Using markup for individual missing characters is much more stable.
Gaiji (外字, foreign/outside characters) is a term often used in Japan to refer to both unencoded characters and missing glyph variants.
Talk at the 12th International Unicode Conference in Tokyo, Japan, April 1998: Exploring the Potentials of Web Technologies for the Handling of Rare Ideographs and Ideograph Variants.