Internationalization Best Practices for Spec Developers

Abstract

This document provides a checklist of internationalization-related considerations when developing a specification. Most checklist items point to detailed supporting information in other documents. Where such information does not yet exist, it can be given a temporary home in this document. The information in this document will change regularly as new content is added and existing content is modified in the light of experience and discussion.

Expand the following to reveal just the guidelines. Add a check mark to items that are relevant to you. Create a checklist that can be transferred to a GitHub wiki.

Language basics

It should be possible to associate a language with any piece of localizable text or natural language content. more
Where possible, there should be a way to label natural language changes in inline text. more
Consider whether it is useful to express the intended linguistic audience of a resource, in addition to specifying the language used for text processing. more
A language declaration that indicates the text-processing language for a range of text must associate a single language value with a specific range of text. more
Use the HTML lang and XML xml:lang language attributes where appropriate to identify the text processing language, rather than creating a new attribute or mechanism. more
It should be possible to associate a metadata-type language declaration (which indicates the intended use of the resource rather than the language of a specific range of text) with multiple language values. more
Attributes that express the language of external resources should not use the HTML lang and XML xml:lang language attributes, but should use a different attribute when they represent metadata (which indicates the intended use of the resource rather than the language of a specific range of text). more

Defining language values

Values for language declarations must use BCP 47. more
Refer to BCP 47, not to RFC 5646. more
Be specific about what level of conformance you expect for language tags: BCP 47 defines two levels of conformance, "valid" and "well-formed". more
Specifications may require implementations to check if language tags are "valid", but in most circumstances should only require that the language tags be "well-formed". more
Specifications should require content and content authors to use "valid" language tags. more
Reference BCP47 for language tag matching. more

Declaring language at the resource level

The specification should indicate how to define the default text-processing language for the resource as a whole. more
Content within the resource should inherit the language of the text-processing declared at the resource level, unless it is specifically overridden. more
Consider whether it is necessary to have separate declarations to indicate the text-processing language versus metadata about the expected use of the resource. more
If there is only one language declaration for a resource, and it has more than one language tag as a value, it must be possible to identify the default text-processing language for the resource. more

Establishing the language of a content block

By default, blocks of content should inherit any text-processing language set for the resource as a whole. more
It should be possible to indicate a change in language for blocks of content where the language changes. more

Establishing the language of inline runs

It should be possible to indicate language for spans of inline text where the language changes. more

It should be possible to associate a language with any piece of localizable text or natural language content.

Why use the language attribute?

Use cases for bidi and language metadata on the Web

Where possible, there should be a way to label natural language changes in inline text.

Text is rendered or processed differently according to the language it is in. For example, screen readers need to be prompted when a language changes, and spell checkers should be language-sensitive. When rendering text a knowledge of language is need in order to apply correct fonts, hyphenation, line-breaking, upper/lower case changes, and other features.

For example, ideographic characters such as 雪, 刃, 直, 令, 垔 have slight but important differences when used with Japanese vs Chinese fonts, and it's important not to apply a Chinese font to the Japanese text, and vice versa when it is presented to a user.

Consider whether it is useful to express the intended linguistic audience of a resource, in addition to specifying the language used for text processing.

Types of language declaration

Language information for a given resource can be used with two main objectives in mind: for text-processing, or as a statement of the intended use of the resource. We will explain the difference below.

A language declaration that indicates the text-processing language for a range of text must associate a single language value with a specific range of text.

When specifying the text-processing language you are declaring the language in which a specific range of text is actually written, so that user agents or applications that manipulate the text, such as voice browsers, spell checkers, style processors, hyphenators, etc., can apply the appropriate rules to the text in question. So we are, by necessity, talking about associating a single language with a specific range of text.

It is normal to express a text-processing language as the default language, for processing the resource as a whole, but it may also be necessary to indicate where the language changes within the resource.

Use the HTML lang and XML xml:lang language attributes where appropriate to identify the text processing language, rather than creating a new attribute or mechanism.

To identify the text-processing language for a range of text, HTML provides the lang attribute, while XML provides xml:lang which can be used in all XML formats. It's useful to continue using those attributes for relevant markup formats, since authors recognize them, as do HTML and XML processors.

It may also be useful to describe the language of a resource as a whole. This type of language declaration typically indicates the intended use of the resource. For example, such metadata may be used for searching, serving the right language version, classification, etc.

This type of language declaration differs from that of the text-processing declaration in that (a) the value for such declarations may be more than one language subtag, and (b) the language value declared doesn't indicate which bits of a multilingual resource are in which language.

It should be possible to associate a metadata-type language declaration (which indicates the intended use of the resource rather than the language of a specific range of text) with multiple language values.

The language(s) describing the intended use of a resource do not necessarily include every language used in a document. For example, many documents on the Web contain embedded fragments of content in different languages, whereas the page is clearly aimed at speakers of one particular language. For example, a German city-guide for Beijing may contain useful phrases in Chinese, but it is aimed at a German-speaking audience, not a Chinese one.

On the other hand, it is also possible to imagine a situation where a document contains the same or parallel content in more than one language. For example, a web page may welcome Canadian readers with French content in the left column, and the same content in English in the right-hand column. Here the document is equally targeted at speakers of both languages, so there are two audience languages. Another use case is a blog or a news page aimed at a multilingual community, where some articles on a page are in one language and some in another. In this case, it may make sense to list more than one language tag as the value of the language declaration.

Attributes that express the language of external resources should not use the HTML lang and XML xml:lang language attributes, but should use a different attribute when they represent metadata (which indicates the intended use of the resource rather than the language of a specific range of text).

xml:lang in XML document schemas – When should I use xml:lang and when should I define my own element or attribute for passing language values in an XML document schema (DTD)?

Using a different attribute to indicate the language of an external resource allows the attribute to specify more than one language. It also works better if the resource pointed to is not in a single language.

This distinction can be seen in HTML in the separation of the lang and hreflang attributes. The former indicates the language of the text within the HTML page; the latter is metadata indicating the expected language of a page that is linked to.

For a longer discussion of this see xml:lang in XML document schemas. This article talks specifically about xml:lang, but the concepts are applicable to other situations.

Values for language declarations must use BCP 47.

Language tags in HTML and XML

BCP 47

BCP 47 defines a method to combine subtags in order to create a much more powerful notation for language tags than that provided by the old ISO lists, but it is also backwards compatible with the ISO lists.

For an overview of the key features of BCP 47, see Language tags in HTML and XML.

Refer to BCP 47, not to RFC 5646.

The link to and name of BCP 47 was created specifically so that there is an unchanging reference to the definition of Tags for the Identification of Languages. RFCs 1766, 3066, 4646 were previous (superseded) versions and 5646 is the current version of BCP 47.

Be specific about what level of conformance you expect for language tags: BCP 47 defines two levels of conformance, "valid" and "well-formed".

A well-formed BCP 47 language tag follows the syntax defined for a language tag: implementations check that each language tag consists of hyphen-separated subtags; each subtag has a specific length and specific content (letters, digits or specific combinations) depending on the placement in the tag. A valid BCP 47 language tag is well-formed but additionally ensures that only subtags that are listed in the IANA Subtag Registry are used. Note that the IANA Subtag Registry is frequently updated with new subtags.

Specifications may require implementations to check if language tags are "valid", but in most circumstances should only require that the language tags be "well-formed".

Most specifications are second-order consumers of language metadata – they are using data already provided in the document format (HTML lang, XML xml:lang, or the document format's language fields/attributes).

Generally most specifications are concerned with selecting resources (such as spell checkers, tokenizers, fonts, etc.) or with matching (selecting which string to show, for example) and don't directly care about the content of the language tag. Invalid-but-well-formed tags just don't match anything and usually fallback schemes provide some behavior that is appropriate.

There might be cases where a specification really wants implementation-level checking for validity. In those cases, the result of a tag failing to be valid has to be specified (should the application die, warn the user, etc.). It's also a problem that the registry is sizeable and changes over time, so each implementation is registry-version dependent. The changes over time are often minor, but real users will encounter interoperability issues if random (out of date) implementations of the specification reject language tags that have become valid at a later date.

In addition, BCP 47 has an extension mechanism which defines add-on subtag sequences. For example, one extension [RFC6067] (Unicode Locales, which uses the singleton -u), is commonly used for controlling the internationalization features of JavaScript (and has other uses). Validating these additional subtags is likely out of scope for most specifications.

Specifications should require content and content authors to use "valid" language tags.

Normative language regarding language tags might be different between content and implementation requirements. Specification authors need to carefully consider what conformance requirements and tests are needed for their specification and what implementations are required to do. One solution is to normatively require that "valid" language tags be used by content authors but only require implementations to check for "well-formed" language tags.

Reference BCP47 for language tag matching.

BCP 47

BCP 47 contains one RFC dedicated to the syntax and subtags of language tags, and another dedicated to how to match two or more subtags. (This topic needs more detail, and may merit being a separate section.)

Here we are talking about an independent unit of data that contains structured text. Examples may include a whole HTML page, an XML document, a JSON file, a WebVTT script, an annotation, etc.

See also

2.2 Defining language values.

The specification should indicate how to define the default text-processing language for the resource as a whole.

It often saves trouble to identify the language, or at least the default language, of the resource as a whole in one place. For example, in an HTML file, this is done by setting the lang attribute on the html element.

Content within the resource should inherit the language of the text-processing declared at the resource level, unless it is specifically overridden.

Consider whether it is necessary to have separate declarations to indicate the text-processing language versus metadata about the expected use of the resource.

In many cases a resource contains text in only one language, and in many more cases the language declared as the default language for text-processing is the same as the language that describes the metadata about the resource as a whole. In such cases it makes sense to have a single declaration.

It becomes problematic, however, to use a single declaration when it refers to more than one language unless there is a way to determine which one language should be used as the text-processing default.

If there is only one language declaration for a resource, and it has more than one language tag as a value, it must be possible to identify the default text-processing language for the resource.

See also

2.2 Defining language values.

The words block and/or chunk are used here to refer to a structural component within the resource as a whole that groups content together and separates it from adjacent content. Boundaries between one block and another are equivalent to paragraph or section boundaries in text, or discrete data items inside a file.

For example, this could refer to a block or paragraph in XML or HTML, an object declaration in JSON, a cue in WebVTT, a line in a CSV file, etc. Contrast this with inline content, which describes a range within a paragraph, sentence, etc.

The interpretation of which structures defined in a spec are relevant to these requirements may require a little consideration, and will depend on the format of the data involved.

By default, blocks of content should inherit any text-processing language set for the resource as a whole.

See 2.1 Language basics for guidance related to the default text-processing language information.

It should be possible to indicate a change in language for blocks of content where the language changes.

In this section we refer to information that needs to be provided for a range of characters in the middle of a paragraph or string.

See also

2.2 Defining language values.

It should be possible to indicate language for spans of inline text where the language changes.

Where a switch in language can affect operations on the content, such as spell-checking, rendering, styling, voice production, translation, information retrieval, and so forth, it is necessary to indicate the range of text affected and identify the language of that content.

Expand the following to reveal just the guidelines. Add a check mark to items that are relevant to you. Create a checklist that can be transferred to a GitHub wiki.

Basic requirements

It must be possible to indicate base direction for each individual paragraph-level item of natural language text that will be read by someone. more
It must be possible to indicate base direction changes for embedded runs of inline bidirectional text for all localizable text. more
Annotating right-to-left text must require the minimum amount of effort for people who work natively with right-to-left scripts. more

Background information

Do not assume that direction can be determined from language information. more

Base direction values

Values for the default base direction should include left-to-right, right-to-left, and auto. more

Handling direction in markup

The spec should indicate how to define a default base direction for the resource as a whole, ie. set the overall base direction. more
The default base direction, in the absence of other information, should be LTR. more
The content author must be able to indicate parts of the text where the base direction changes. At the block level, this should be achieved using attributes or metadata, and should not rely on Unicode control characters. more
It must be possible to also set the direction for content fragments to auto. This means that the base direction will be determined by examining the content itself. more
If the overall base direction is set to auto for plain text, the direction of content paragraphs should be determined on a paragraph by paragraph basis. more
To indicate the sides of a block of text relative to the start and end of its contained lines, use 'block-start' and 'block-end', rather than 'top' and 'bottom'. more
To indicate the start/end of a line you should use 'start' and 'end', or 'inline-start' and 'inline-end', rather than 'left' and 'right'. more
Provide dedicated attributes for control of base direction and bidirectional overrides; do not rely on the user applying style properties to arbitrary markup to achieve bidi control. more

Handling base direction for strings

Provide metadata constructs that can be used to indicate the base direction of any natural language string. more
Specify that consumers of strings should use heuristics, preferably based on the Unicode Standard first-strong algorithm, to detect the base direction of a string except where metadata is provided. more
Where possible, define a field to indicate the default direction for all strings in a given resource or document. more
Do NOT assume that a creating a document-level default without the ability to change direction for any string is sufficient. more
If metadata is not available due to legacy implementations and cannot otherwise be provided, specifications MAY allow a base direction to be interpolated from available language metadata. more
Specifications MUST NOT require the production or use of paired bidi controls. more

Setting base direction for inline or substring text

It must be possible to indicate spans of inline text where the base direction changes. If markup is available, this is the preferred method. Otherwise your specification must require that Unicode control characters are recognized by the receiving application, and correctly implemented. more
It must be possible to also set the direction for a span to auto. This means that the base direction will be determined by examining the content itself. A typical approach here would be to set the direction based on the first strong directional character outside of any markup. more
If users use Unicode bidirectional control characters, the isolating RLI/LRI/FSI with PDI characters must be supported by the application and recommended (rather than RLE/LRE with PDF) by the spec. more
Use of RLM/LRM should be appropriate, and expectations of what those controls can and cannot do should be clear in the spec. more
For markup, provide dedicated attributes for control of base direction and bidirectional overrides; do not rely on the user applying style properties to arbitrary markup to achieve bidi control. more
For markup, allow bidi attributes on all inline elements in markup that contain text. more
For markup, provide attributes that allow the user to (a) create an embedded base direction or (b) override the bidirectional algorithm altogether; the attribute should allow the user to set the direction to LTR or RTL or the aforementioned Auto in either of these two scenarios. more

It is important to establish direction for text written or mixed with right-to-left scripts. Characters in these scripts are stored in memory in the order they are typed and pronounced – called the logical order. The Unicode Bidirectional Algorithm (UBA) provides a lot of support for automatically rendering a sequence of characters stored in logical order so that they are visually ordered as expected. Unfortunately, the UBA alone is not sufficient to correctly render bidirectional text, and additional information has to be provided about the default directional context to apply for a given sequence of characters.

The basic requirements are as follows.

It must be possible to indicate base direction for each individual paragraph-level item of natural language text that will be read by someone.

It must be possible to indicate base direction changes for embedded runs of inline bidirectional text for all localizable text.

Annotating right-to-left text must require the minimum amount of effort for people who work natively with right-to-left scripts.

Requiring a speaker of Arabic, Divehi, Hebrew, Persian, Urdu, etc. to add markup or control characters to every paragraph or small data item they write is far too much to be manageable. Typically, the format should establish a default direction and require the user to intervene only when exceptions have to be dealt with.

In this section we try to set out some key concepts associated with text direction, so that it will be easier to understand the recommendations that follow.

In order to correctly display text written in a 'right-to-left' script or left-to-right text containing bidirectional elements, it is important to establish the base direction that will be used to dictate the order in which elements of the text will be displayed.

If you are not familiar with what the Unicode Bidirectional Algorithm (UBA) does and doesn't do, and why the base direction is so important, read Unicode Bidirectional Algorithm basics.

Example 1

For example, the following annotation will not display correctly unless the application doing the display knows that the base direction needs to be right-to-left.

{
  "@context": "http://www.w3.org/ns/anno.jsonld",
  "id": "http://example.org/anno5",
  "type":"Annotation",
  "body": {
    "type" : "TextualBody",
    "text" : "פעילות הבינאום, W3C",
    "format" : "text/html",
    "language" : "he"
  },
  "target": "http://example.org/photo1"
}

You would expect the phrase in the text property value to be displayed as

פעילות הבינאום, W3C

however, if there is no indication that the base direction should be right-to-left the following incorrect display will be produced:

פעילות הבינאום, W3C

In this section, the word paragraph indicates a run of text followed by a hard line-break in plain text, but may signify different things in other situations. In CSV it equates to 'cell', so a single line of comma-separated items is actually a set of comma-separated paragraphs. In HTML it equates to the lowest level of block element, which is often a p element, but may be things such as div, li, etc., if they only contain text and/or inline elements. In JSON, it often equates to a quoted string value, but if a string value uses markup then paragraphs are associated with block elements, and if the string value is multiple lines of plain text then each line is a paragraph.

The term metadata is used here to mean information which could be an annotation or property associated with the data, or could be markup in scenarios that allow that, or could be a higher-level protocol, etc.

There are a number of possible ways of setting the base direction.

The base direction of a paragraph may be set by an application or a user applying metadata to the paragraph. Typical values for base direction may include ltr, rtl or auto.
- The metadata may specifically indicate that heuristics should be used. Then you would expect to consider the actual characters used in order to determine the base direction. (This is what happens if you set dir=auto on an HTML element.)
- The application may expect metadata, but there may be no such information provided. In this case you would usually expect there to be a default direction specified, and the base direction for a cell would be set to that default. The default is usually LTR. (This is what happens if you have no dir attributes in your HTML file.)
- Where a format contains many paragraphs or chunks of information, and the language of text in all those chunks is the same, it is sometimes useful to allow a default base direction to be set for and inherited by all. This is what happens when you set the dir attribute on the html tag in HTML. Another example would be a subtitling file containing many cues, all written in Arabic; it would be best to allow the author to say at the start of the file that the default is RTL for all cue text. There should always be a way to override the direction information for a specific paragraph where needed.
If the application expects no metadata to be available it should use heuristics to determine the base direction for each paragraph/cell. A typical solution, and one described by UAX 9 Unicode Bidirectional Algorithm, is to look for the first-strong character in the paragraph/cell. (This is likely to apply if you are looking at plain text that is not expected to be associated with metadata. It only happens with HTML if the direction is set to auto, since HTML specifies a default direction.)
- Not all paragraphs using the first-strong method will have the correct base direction applied. In some cases, an Arabic or Hebrew, etc, paragraph may start with strong LTR characters. There must be a way to deal with this.
- Where a syntactic unit contains multiple lines of plain text (for example, a multiline cue text in a subtitling file), the first-strong heuristic needs to be applied to each line separately.
- There may be special rules that involve ignoring some sequence of characters or type of markup at the start of the paragraph before identifying the first strong character.
- In some cases there are no strong characters in a paragraph, and the base direction can be critically important for the data to be understood correctly, eg. telephone numbers or MAC addresses. There needs to be a way to resort to an appropriate default for these cases.
Whether or not any metadata is specified, if a paragraph contains a string that starts with one of the Unicode bidi control characters RLI, LRI, FSI, LRE, RLE, LRO, or RLO and ends with PDF/PDI, these characters will determine the base direction for the contained string. These characters, when placed in the content, explicitly override any previously set direction by creating an inline range and assigning a base direction to it.
- The effect of such characters does not extend past paragraph boundaries, but the range ought to be explicitly ended using the PDF/PDI control character, especially if a paragraph end is not easily detectable by the application.)
- Because isolation is needed for bidirectional text to work properly, the Unicode Standard says that the isolating control codes RLI, LRI and FSI should be used rather than LRE or RLE. Unfortunately, those characters are still not widely supported.
- For structural components in markup, above the paragraph level, it is not possible to use the Unicode bidi control characters to define direction for paragraphs, since these are inline controls only, and the effect is terminated by a paragraph end.

When capturing text input by a user it is usually necessary to understand the context in which the user was inputting the data to determine the base direction of the input. In HTML, for example, this may be set by the direction inherited from the html tag, or by the user pressing keys to set the base direction for a form field. It is then necessary to find some way of storing the information about base direction or associating it with the string when rendered. Typically, in this situation, any direction changes internal to the string being input are handled by the user and will be captured as part of the string.

Embedded ranges of text within a single paragraph may need to have a different base direction. For example,

"The title was '!NOITASILANOITANRETNI'."

where the span within the single quotes is in Hebrew/Arabic/Divehi, etc., and needs to have a RTL base direction, instead of the LTR base direction of the surrounding paragraph, in order to place the exclamation mark correctly.

If markup is available to the content author, it is likely to be easier and safer to use markup to indicate such inline ranges (see below). In HTML you would usually use an inline element with a dir attribute to establish the base direction for such runs of text. If you can't mark up the text, such as in HTML's title element, or any environment that handles only plain text content, you have to resort to Unicode's paired control characters to establish the base direction for such an internal range.

Furthermore, inline ranges where the base direction is changed should be isolated from surrounding text, so that the UBA doesn't produce incorrect results due to interference across boundaries. See an example of how this can produce incorrect ordering of things such as text followed by numbers in HTML, or another example of how it can affect lists.

This means that if a content author is using Unicode control codes they should use RLI/LRI...PDI rather than RLE/LRE...PDF. These isolating codes are fairly new, and applications may not yet support them.

Reasons to avoid relying on control characters to set direction include the following:

They are invisible in most editors and are therefore difficult to work with, and can easily lead to orphans and overlapping ranges. They can be particularly difficult to manage when editing bidirectional inline text because it's hard to position the cursor in the correct place. If you ask someone who writes in a right-to-left script, you are likely to find that they dislike using control codes.
Users often don't have the necessary characters available on their keyboard, or have difficulty inputting them.
It is sometimes necessary to choose which to use based on context or the type of the data, and this means that a content author typically needs to select the control codes – specifying control codes in this way for all paragraphs is time-consuming and error-prone.
Processors that extract parts of the data, add to it, or reuse in combination with other text may incorrectly handle the control codes.
Search and comparison algorithms should ignore these characters, but typically don't.

The last two items above may also hold for markup, but implementers often support included markup better than included control codes.

Don't expect users to add control codes at the start and end of every paragraph. That's far too much work.

A word about the Unicode characters U+200F RIGHT-TO-LEFT MARK (RLM) and U+200E LEFT-TO-RIGHT MARK (LRM) is warranted at this point.

The first point to be clear about is that neither RLM nor LRM establish the base direction for a range of text. They are simply invisible characters with strong directional properties.

This means that you cannot use RLM for example, to make the text W3C appear to the left of the Hebrew text in the following example.

The title is "פעילות הבינאום, W3C".

For this you can only use metadata or the paired control characters.

Of course, if you are detecting base direction using first-strong heuristics then RLM and LRM can be useful for setting the base direction where the text in question begins with something that would otherwise give the wrong result, eg.

"نشاط التدويل" is how you say "i18n Activity" in Arabic.

Here an LRM could be placed at the start of the text, before the strong RTL Arabic characters, to prevent the algorithm from assuming that the text should be right-to-left. (Remember that if metadata is used to set the base direction, that character is ignored, unless the metadata specifically says that first-strong heuristics should be used.)

Do not assume that direction can be determined from language information.

Can we derive base direction from language?, W3C article.

The following are all reasons you cannot use language tags to provide information about base direction:

you can't produce the auto value with language tags.
some languages are written with both RTL and LTR scripts.
the only reliable part of the language tag that would indicate the base direction is the script tag, but BCP47 recommends that you suppress the use of the script tag for languages that don't usually need it, such as Hebrew (suppressscript: Hebr). Languages, such as Persian, that are usually written in a RTL script may be written in transcribed form, and it's not possible to guarantee that the necessary script tag would be present to carry the directional information. In summary, you won't be able to rely on people supplying script tags as part of the language information in order to influence direction.
the incidence of use of language tags and base direction markers often don't coincide.
they are not semantically equivalent.

Values for the default base direction should include left-to-right, right-to-left, and auto.

The auto value allows automatic detection of the base direction for a piece of text. For example, the auto value of dir in HTML looks for the first strong directional character in the text, but ignores certain items of markup also, to guess the base direction of the text. Note that automatic detection algorithms are far from perfect. First-strong detection is unable to correctly identify text that is really right-to-left, but that begins with a strong LTR character. Algorithms that attempt to judge the base direction based on contents of the text are also problematic. The best scenario is one where the base direction is known and declared.

This section is about defining approaches to bidi handling that work with resources that organises content using markup. Some of the recommendations are different from those for handling strings on the Web (see 3.5 Handling base direction for strings).

The spec should indicate how to define a default base direction for the resource as a whole, ie. set the overall base direction.

The default base direction, in the absence of other information, should be LTR.

The content author must be able to indicate parts of the text where the base direction changes. At the block level, this should be achieved using attributes or metadata, and should not rely on Unicode control characters.

Relying on Unicode control characters to establish direction for every block is not feasible because line breaks terminate the effect of such control characters. It also makes the data much less stable, and unnecessarily difficult to manage if control characters have to appear at every point where they would be needed.

It must be possible to also set the direction for content fragments to auto. This means that the base direction will be determined by examining the content itself.

Estimation algorithms, in Additional Requirements for Bidi in HTML & CSS.

A typical approach here would be to set the direction based on the first strong directional character outside of any markup, but this is not the only possible method. The algorithm used to determine directionality when direction is set to auto should match that expected by the receiver.

The first-strong algorithm looks for the first character in the paragraph with a strong directional property according to the Unicode definitions. It then sets the base direction of the paragraph according to the direction of that character.

Note that the first-strong algorithm may incorrectly guess the direction of the paragraph when the first character is not typical of the rest of the paragraph, such as when a RTL paragraph or line starts with a LTR brand name or technical term.

For additional information about algorithms for detecting direction, see Estimation algorithms in the document where this was discussed with reference to HTML.

If the overall base direction is set to auto for plain text, the direction of content paragraphs should be determined on a paragraph by paragraph basis.

To indicate the sides of a block of text relative to the start and end of its contained lines, use 'block-start' and 'block-end', rather than 'top' and 'bottom'.

CSS Logical Properties and Values Level 1

To indicate the start/end of a line you should use 'start' and 'end', or 'inline-start' and 'inline-end', rather than 'left' and 'right'.

CSS Logical Properties and Values Level 1

Provide dedicated attributes for control of base direction and bidirectional overrides; do not rely on the user applying style properties to arbitrary markup to achieve bidi control.

CSS vs. markup for bidi support, W3C article.

For example, HTML has a dir attribute that is capable of managing base direction without assistance from CSS styling. XML formats should define dedicated markup to represent directional information, even if they need CSS to achieve the required display, since the text may be used in other ways.

Style sheets such as CSS may not always be used with the data, or carried with the data when it is syndicated, etc. Directional information is fundamentally important to correct display of the data, and should be associated more closely and more permanently with the markup or data.

Note

The information in this section is pulled from Requirements for Language and Direction Metadata in Data Formats. That document is still being written, so these guidelines are likely to change at any time.

Provide metadata constructs that can be used to indicate the base direction of any natural language string.

Best Practices, Recommendations, and Gaps, in Strings on the Web: Language and Direction Metadata

Specify that consumers of strings should use heuristics, preferably based on the Unicode Standard first-strong algorithm, to detect the base direction of a string except where metadata is provided.

Best Practices, Recommendations, and Gaps, in Strings on the Web: Language and Direction Metadata

Where possible, define a field to indicate the default direction for all strings in a given resource or document.

Best Practices, Recommendations, and Gaps, in Strings on the Web: Language and Direction Metadata

Do NOT assume that a creating a document-level default without the ability to change direction for any string is sufficient.

Best Practices, Recommendations, and Gaps, in Strings on the Web: Language and Direction Metadata

If metadata is not available due to legacy implementations and cannot otherwise be provided, specifications MAY allow a base direction to be interpolated from available language metadata.

Best Practices, Recommendations, and Gaps, in Strings on the Web: Language and Direction Metadata

Specifications MUST NOT require the production or use of paired bidi controls.

Best Practices, Recommendations, and Gaps, in Strings on the Web: Language and Direction Metadata

'Inline text' here has a readily understandable meaning in markup. It also applies to strings (eg. in JSON, CVS, or other plain text formats), meaning runs of characters which don't include all the characters in the string.

It must be possible to indicate spans of inline text where the base direction changes. If markup is available, this is the preferred method. Otherwise your specification must require that Unicode control characters are recognized by the receiving application, and correctly implemented.

It must be possible to also set the direction for a span to auto. This means that the base direction will be determined by examining the content itself. A typical approach here would be to set the direction based on the first strong directional character outside of any markup.

For additional information about algorithms for detecting direction, see Estimation algorithms in the document where this was discussed with reference to HTML.

If users use Unicode bidirectional control characters, the isolating RLI/LRI/FSI with PDI characters must be supported by the application and recommended (rather than RLE/LRE with PDF) by the spec.

Use of RLM/LRM should be appropriate, and expectations of what those controls can and cannot do should be clear in the spec.

The Unicode bidirectional control characters U+200F RIGHT-TO-LEFT MARK and U+200E LEFT-TO-RIGHT MARK are not sufficient on their own to manage bidirectional text. They cannot produce a different base direction for embedded text. For that you need to be able to indicate the start and end of the range of the embedded text. This is best done by markup, if available, or failing that using the other Unicode bidirectional controls mentioned just above.

For markup, provide dedicated attributes for control of base direction and bidirectional overrides; do not rely on the user applying style properties to arbitrary markup to achieve bidi control.

For markup, allow bidi attributes on all inline elements in markup that contain text.

For markup, provide attributes that allow the user to (a) create an embedded base direction or (b) override the bidirectional algorithm altogether; the attribute should allow the user to set the direction to LTR or RTL or the aforementioned Auto in either of these two scenarios.

Expand the following to reveal just the guidelines. Add a check mark to items that are relevant to you. Create a checklist that can be transferred to a GitHub wiki.

Choosing a definition of 'character'

Specifications SHOULD use specific terms, when available, instead of the general term 'character'. more
When specifications use the term 'character' the specifications MUST define which meaning they intend, and SHOULD explicitly define the term 'character' to mean a Unicode code point. more
Specifications, software and content MUST NOT require or depend on a one-to-one relationship between characters and units of physical storage. more
Specifications, software and content MUST NOT require or depend on a one-to-one correspondence between characters and the sounds of a language. more
Specifications, software and content MUST NOT require or depend on a one-to-one mapping between characters and units of displayed text. more
Specifications and software MUST NOT require nor depend on a single keystroke resulting in a single character, nor that a single character be input with a single keystroke (even with modifiers), nor that keyboards are the same all over the world. more

Defining a Reference Processing Model

Textual data objects defined by protocol or format specifications MUST be in a single character encoding. more
All specifications that involve processing of text MUST specify the processing of text according to the Reference Processing Model described by the rest of the recommendations in this list. more
Specifications MUST define text in terms of Unicode characters, not bytes or glyphs. more
For their textual data objects specifications MAY allow use of any character encoding which can be transcoded to a Unicode encoding form. more
Specifications MAY choose to disallow or deprecate some character encodings and to make others mandatory. Independent of the actual character encoding, the specified behavior MUST be the same as if the processing happened as follows: (a) The character encoding of any textual data object received by the application implementing the specification MUST be determined and the data object MUST be interpreted as a sequence of Unicode characters - this MUST be equivalent to transcoding the data object to some Unicode encoding form, adjusting any character encoding label if necessary, and receiving it in that Unicode encoding form, (b) All processing MUST take place on this sequence of Unicode characters, (c) If text is output by the application, the sequence of Unicode characters MUST be encoded using a character encoding chosen among those allowed by the specification. more
If a specification is such that multiple textual data objects are involved (such as an XML document referring to external parsed entities), it MAY choose to allow these data objects to be in different character encodings. In all cases, the Reference Processing Model MUST be applied to all textual data objects. more

Including and excluding character ranges

Specifications SHOULD NOT arbitrarily exclude code points from the full range of Unicode code points from U+0000 to U+10FFFF inclusive. more
Specifications MUST NOT allow code points above U+10FFFF. more
Specifications SHOULD NOT allow the use of codepoints reserved by Unicode for internal use. more
Specifications MUST NOT allow the use of surrogate code points. more
Specifications SHOULD exclude compatibility characters in the syntactic elements (markup, delimiters, identifiers) of the formats they define. more
Specifications SHOULD allow the full range of Unicode for user-defined values. more

Using the Private Use Area

Specifications MUST NOT require the use of private use area characters with particular assignments. more
Specifications MUST NOT require the use of mechanisms for defining agreements of private use code points. more
Specifications and implementations SHOULD NOT disallow the use of private use code points by private agreement. more
Specifications MAY define markup to allow the transmission of symbols not in Unicode or to identify specific variants of Unicode characters. more
Specifications SHOULD allow the inclusion of or reference to pictures and graphics where appropriate, to eliminate the need to (mis)use character-oriented mechanisms for pictures or graphics. more

Choosing character encodings

Identifying character encodings

Specifications MUST NOT propose the use of heuristics to determine the encoding of data. more
Specifications MUST define conflict-resolution mechanisms (e.g. priorities) for cases where there is multiple or conflicting information about character encoding. more

Designing character escapes

Specifications should provide a mechanism for escaping characters, particularly those which are invisible or ambiguous. more
Specifications SHOULD NOT invent a new escaping mechanism if an appropriate one already exists. more
The number of different ways to escape a character SHOULD be minimized (ideally to one). more
Escape syntax SHOULD require either explicit end delimiters or a fixed number of characters in each character escape. Escape syntaxes where the end is determined by any character outside the set of characters admissible in the character escape itself SHOULD be avoided. more
Whenever specifications define character escapes that allow the representation of characters using a number, the number MUST represent the Unicode code point of the character and SHOULD be in hexadecimal notation. more
Escaped characters SHOULD be acceptable wherever their unescaped forms are; this does not preclude that syntax-significant characters, when escaped, lose their significance in the syntax. In particular, if a character is acceptable in identifiers and comments, then its escaped form should also be acceptable. more

Storing text

Protocols, data formats and APIs MUST store, interchange or process text data in logical order. more
Independent of whether some implementation uses logical selection or visual selection, characters selected MUST be kept in logical order in storage. more
Specifications of protocols and APIs that involve selection of ranges SHOULD provide for discontiguous logical selections, at least to the extent necessary to support implementation of visual selection on screen on top of those protocols and APIs. more

Defining 'string'

Specifications SHOULD NOT define a string as a 'byte string'. more
The 'character string' definition SHOULD be used by most specifications. more

Referring to Unicode characters

Use U+XXXX syntax to represent Unicode code points in the specification. more

Referencing the Unicode Standard

Since specifications in general need both a definition for their characters and the semantics associated with these characters, specifications SHOULD include a reference to the Unicode Standard, whether or not they include a reference to ISO/IEC 10646. more
A generic reference to the Unicode Standard MUST be made if it is desired that characters allocated after a specification is published are usable with that specification. A specific reference to the Unicode Standard MAY be included to ensure that functionality depending on a particular version is available and will not change over time. more
All generic references to the Unicode Standard MUST refer to the latest version of the Unicode Standard available at the date of publication of the containing specification. more
All generic references to ISO/IEC 10646 MUST refer to the latest version of ISO/IEC 10646 available at the date of publication of the containing specification. more

The term character is often used to mean different things in different contexts: it can variously refer to the visual, logical, or byte-level representation of a given piece of text. This makes the term too imprecise to use when specifying algorithms, protocols, or document formats. Understanding how characters are defined and encoded in computing systems, along with the associated terminology used to make such specification unambiguous, is thus a necessary prerequisite to discussing the processing of string data.

The visual manifestation of a "character"—the shape most people mean when they say "character"—is what we call a user-perceived character. These visual building blocks are usually perceived to be a single unit of the visible text.

At their simplest, user-perceived characters are a single shape that can be tied one-to-one to the underlying computing representation. But a user-perceived character can be formed, in some scripts, from more than one character. And a given logical character can take many different shapes due to such influences as font selection, style, or the surrounding context (such as adjacent characters). In some cases, a single user-perceived character might be formed from a long sequence of logical characters. And some logical characters (so-called "combining marks") are always used in conjunction with another character.

When user-perceived characters are represented visibly (on screen or in print), they are represented by individual rendering units. This visual unit is called a grapheme (the word glyph is also used). Graphemes are the visual units found in fonts and rendering software.

Graphemes are encoded into computer systems using "logical characters". A character set is a set of logical characters: a specific collection of characters that can be used together to encode text. The most important character set is the Universal Character Set, also known as [Unicode]. This character set includes all of the characters used to encode text, including historical or extinct writing systems as well as modern usage, private use, typesetting symbols, and other things, such as the emoji. Other character sets are defined subsets of Unicode. In Unicode, a 'character' is a single abstract logical unit of text. Each character in Unicode is assigned a unique integer number between 0x0000 and 0x10FFFF, which is called its code point. The term code point therefore unambiguously refers to a single logical character and its integer representation.

Specifications SHOULD explicitly define the term 'character' to mean a Unicode code point.

The relationship between code points and graphemes can be complex. In most cases, a code point sequence that forms a single grapheme should be treated as a single textual unit. For example, when cursoring across text, an entire grapheme should select together. It shouldn't be possible to cursor into the "middle" of a grapheme or delete only a part of user-perceived character. Because the relationship is not one-to-one between code points and graphemes and because the relationship can be somewhat complex, [Unicode] defines a specific type of grapheme: the extended grapheme cluster which most closely matches the mapping of the underlying logical character sequence to a user-perceived character. When referring to 'graphemes' in this document, we mean extended grapheme clusters (unless otherwise called out).

Example 3

Returning to the example above, the Hindi word for Unicode is made of four graphemes:

यू नि को ड

Several of these graphemes are made up of more than one Unicode character because of the way that the Devanagari script works. In Devanagari, the basic set of "letters" are syllables ending with the short 'a' vowel sound. When you want to use a different vowel, you add a combining vowel character that changes the shape of the grapheme. The red text in the example above is the syllable "ni" in "Unicode". It is made of two characters: U+0928 (the syllable "na") and U+093F (combining "short i" sound):

य	ू	न	ि	क	ो	ड
`U+092F`	`U+0942`	`U+0928`	`U+093F`	`U+0915`	`U+094B`	`U+0921`

Another example of the complex relationship between code points and graphemes are certain emoji. The emoji character for "family" has a code point in Unicode: 👪 [U+1F46A FAMILY]. It can also be formed by using using a sequence of code points: U+1F468 U+200D U+1F469 U+200D U+1F466. Altering or adding other emoji characters can alter the composition of the family. For example the sequence 👨‍👩‍👧‍👧 U+1F468 U+200D U+1F469 U+200D U+1F467 U+200D U+1F467 results in a composed emoji character for a "family: man, woman, girl, girl" on systems that support this kind of composition. Many common emoji can only be formed using sequences of code points, but should be treated as a single user-perceived character when displaying or processing the text. You wouldn't want to put a line-break in the middle of the family!

Unicode code points are just abstract integer values: they are not the values actually present in the memory of the computer or serialized on the wire. When processing text, computers use an array of fixed-size integer units. One such common unit is the byte (or octet, since bytes have 8 bits per unit). There are also 16-bit, 32-bit, or other size units. In many programming languages, the unit is called a char, which suggests that strings are made of "characters". We use the term code unit to refer unambiguously to the programming and serialized representation of characters. For example, in C, a char is generally an 8-bit byte: each char is a 8-bit code unit. In Java or Javascript, a char is a 16-bit value.

A set of rules for converting code points to or from code units is called a character encoding form (or just "character encoding" for short.

Example 4

The most common character encoding used on the Web is UTF-8. UTF-8 uses 8-bit bytes as its code unit. Each Unicode code point encoded into UTF-8 takes between one and four bytes to encode. ASCII characters take one byte to encode. Code points from 0x80 to 0x7FF take two bytes. Code points from 0x800 to 0xFFFF take three bytes. And code points from 0x10000 to 0x10FFFF (that is, the rest of Unicode) take four bytes each.

Grapheme	A	À
Code Point	`U+0041`	`U+00C0`
Code Units (bytes)	`0x41`	`0xC3 0x80`
Grapheme	न	👪
Code Point	`U+0928`	`U+1F46A`
Code Units (bytes)	`0xE0 0xA4 0xA8`	`0xF0 0x9F 0x91 0xAA`

See also

4.9 Defining 'string'.

Specifications SHOULD use specific terms, when available, instead of the general term 'character'.

explanations & examples

Perceptions of Characters, Summary C067, in Character Model for the World Wide Web 1.0: Fundamentals.

When specifications use the term 'character' the specifications MUST define which meaning they intend, and SHOULD explicitly define the term 'character' to mean a Unicode code point.

explanations & examples

Perceptions of Characters, Summary C010, in Character Model for the World Wide Web 1.0: Fundamentals

Specifications, software and content MUST NOT require or depend on a one-to-one relationship between characters and units of physical storage.

explanations & examples

Units of storage C009, in Character Model for the World Wide Web 1.0: Fundamentals

Specifications, software and content MUST NOT require or depend on a one-to-one correspondence between characters and the sounds of a language.

explanations & examples

Units of aural rendering C001, in Character Model for the World Wide Web 1.0: Fundamentals

Specifications, software and content MUST NOT require or depend on a one-to-one mapping between characters and units of displayed text.

explanations & examples

Units of visual rendering C002, in Character Model for the World Wide Web 1.0: Fundamentals.

Specifications and software MUST NOT require nor depend on a single keystroke resulting in a single character, nor that a single character be input with a single keystroke (even with modifiers), nor that keyboards are the same all over the world.

explanations & examples

Units of input C005, in Character Model for the World Wide Web 1.0: Fundamentals.

Textual data objects defined by protocol or format specifications MUST be in a single character encoding.

explanations & examples

Reference Processing Model C013, in Character Model for the World Wide Web: Fundamentals

All specifications that involve processing of text MUST specify the processing of text according to the Reference Processing Model described by the rest of the recommendations in this list.

explanations & examples

Reference Processing Model C014, in Character Model for the World Wide Web: Fundamentals

Specifications MUST define text in terms of Unicode characters, not bytes or glyphs.

explanations & examples

Reference Processing Model C014, in Character Model for the World Wide Web: Fundamentals

For their textual data objects specifications MAY allow use of any character encoding which can be transcoded to a Unicode encoding form.

explanations & examples

Reference Processing Model C014, in Character Model for the World Wide Web: Fundamentals

Specifications MAY choose to disallow or deprecate some character encodings and to make others mandatory. Independent of the actual character encoding, the specified behavior MUST be the same as if the processing happened as follows: (a) The character encoding of any textual data object received by the application implementing the specification MUST be determined and the data object MUST be interpreted as a sequence of Unicode characters - this MUST be equivalent to transcoding the data object to some Unicode encoding form, adjusting any character encoding label if necessary, and receiving it in that Unicode encoding form, (b) All processing MUST take place on this sequence of Unicode characters, (c) If text is output by the application, the sequence of Unicode characters MUST be encoded using a character encoding chosen among those allowed by the specification.

explanations & examples

Reference Processing Model C014, in Character Model for the World Wide Web: Fundamentals

If a specification is such that multiple textual data objects are involved (such as an XML document referring to external parsed entities), it MAY choose to allow these data objects to be in different character encodings. In all cases, the Reference Processing Model MUST be applied to all textual data objects.

explanations & examples

Reference Processing Model C014, in Character Model for the World Wide Web: Fundamentals

See also

4.4 Using the Private Use Area.

Specifications SHOULD NOT arbitrarily exclude code points from the full range of Unicode code points from U+0000 to U+10FFFF inclusive.

explanations & examples

Reference Processing Model C070, in Character Model for the World Wide Web: Fundamentals.

Specifications MUST NOT allow code points above U+10FFFF.

explanations & examples

Reference Processing Model C077, in Character Model for the World Wide Web: Fundamentals.

Specifications SHOULD NOT allow the use of codepoints reserved by Unicode for internal use.

explanations & examples

Reference Processing Model C079, in Character Model for the World Wide Web: Fundamentals.

Specifications MUST NOT allow the use of surrogate code points.

explanations & examples

Reference Processing Model C078, in Character Model for the World Wide Web: Fundamentals.

Specifications SHOULD exclude compatibility characters in the syntactic elements (markup, delimiters, identifiers) of the formats they define.

explanations & examples

Compatibility and Formatting Characters C050, in Character Model for the World Wide Web: Fundamentals.

Specifications SHOULD allow the full range of Unicode for user-defined values.

explanations & examples

Unicode case-insensitive matching, in Character Model for the World Wide Web: Fundamentals.

Specifications MUST NOT require the use of private use area characters with particular assignments.

explanations & examples

Private use code points, C038, in Character Model for the World Wide Web: Fundamentals

Specifications MUST NOT require the use of mechanisms for defining agreements of private use code points.

explanations & examples

Private use code points, C039, in Character Model for the World Wide Web: Fundamentals

Specifications and implementations SHOULD NOT disallow the use of private use code points by private agreement.

explanations & examples

Private use code points, C040, in Character Model for the World Wide Web: Fundamentals

Specifications MAY define markup to allow the transmission of symbols not in Unicode or to identify specific variants of Unicode characters.

explanations & examples

Private use code points, C041, in Character Model for the World Wide Web: Fundamentals

Specifications SHOULD allow the inclusion of or reference to pictures and graphics where appropriate, to eliminate the need to (mis)use character-oriented mechanisms for pictures or graphics.

explanations & examples

Private use code points, C068, in Character Model for the World Wide Web: Fundamentals

Specifications MUST either specify a unique character encoding, or provide character encoding identification mechanisms such that the encoding of text can be reliably identified.

explanations & examples

Choice and Identification of Character Encodings, C015, in Character Model for the World Wide Web: Fundamentals

When designing a new protocol, format or API, specifications SHOULD require a unique character encoding.

explanations & examples

Choice and Identification of Character Encodings, C016, in Character Model for the World Wide Web: Fundamentals

When basing a protocol, format, or API on a protocol, format, or API that already has rules for character encoding, specifications SHOULD use rather than change these rules.

explanations & examples

Choice and Identification of Character Encodings, C017, in Character Model for the World Wide Web: Fundamentals

When a unique character encoding is required, the character encoding MUST be UTF-8, or UTF-16.

explanations & examples

Mandating a unique character encoding, C018, in Character Model for the World Wide Web: Fundamentals

Note

The above guideline needs further consideration: utf-16 and utf-32 are not recommended these days. UTF-8 is the recommended encoding.

Specifications SHOULD avoid using the terms 'character set' and 'charset' to refer to a character encoding, except when the latter is used to refer to the MIME charset parameter or its IANA-registered values. The term 'character encoding', or in specific cases the terms 'character encoding form' or 'character encoding scheme', are RECOMMENDED.

explanations & examples

Mandating a unique character encoding, C020, in Character Model for the World Wide Web: Fundamentals

If the unique encoding approach is not taken, specifications SHOULD require the use of the IANA charset registry names, and in particular the names identified in the registry as 'MIME preferred names', to designate character encodings in protocols, data formats and APIs.

explanations & examples

Character encoding identification, C021, in Character Model for the World Wide Web: Fundamentals

Note

The above guideline needs further consideration: the list of character encodings recommended for Web specifications is listed in the Encoding specification.

Character encodings that are not in the IANA registry SHOULD NOT be used, except by private agreement.

explanations & examples

Character encoding identification, C022, in Character Model for the World Wide Web: Fundamentals

If an unregistered character encoding is used, the convention of using 'x-' at the beginning of the name MUST be followed.

explanations & examples

Character encoding identification, C023, in Character Model for the World Wide Web: Fundamentals

If the unique encoding approach is not chosen, specifications MUST designate at least one of the UTF-8 and UTF-16 encoding forms of Unicode as admissible character encodings and SHOULD choose at least one of UTF-8 or UTF-16 as required encoding forms (encoding forms that MUST be supported by implementations of the specification).

explanations & examples

Character encoding identification, C026, in Character Model for the World Wide Web: Fundamentals

Specifications that require a default encoding MUST define either UTF-8 or UTF-16 as the default, or both if they define suitable means of distinguishing them.

explanations & examples

Character encoding identification, C027, in Character Model for the World Wide Web: Fundamentals

Specifications MUST NOT propose the use of heuristics to determine the encoding of data.

explanations & examples

Character encoding identification, C028, in Character Model for the World Wide Web: Fundamentals

Specifications MUST define conflict-resolution mechanisms (e.g. priorities) for cases where there is multiple or conflicting information about character encoding.

explanations & examples

Character encoding identification, C028, in Character Model for the World Wide Web: Fundamentals

Specifications should provide a mechanism for escaping characters, particularly those which are invisible or ambiguous.

Using character escapes in markup and CSS, W3C article.

It is generally recommended that character escapes be provided so that difficult to enter or edit sequences can be introduced using a plain text editor. Escape sequences are particularly useful for invisible or ambiguous Unicode characters, including zero-width spaces, soft-hyphens, various bidi controls, mongolian vowel separators, etc.

For advice on use of escapes in markup, but which is mostly generalisable to other formats, see Using character escapes in markup and CSS.

Specifications SHOULD NOT invent a new escaping mechanism if an appropriate one already exists.

explanations & examples

Character Escaping, C042, in Character Model for the World Wide Web: Fundamentals

The number of different ways to escape a character SHOULD be minimized (ideally to one).

explanations & examples

Character Escaping, C043, in Character Model for the World Wide Web: Fundamentals

Escape syntax SHOULD require either explicit end delimiters or a fixed number of characters in each character escape. Escape syntaxes where the end is determined by any character outside the set of characters admissible in the character escape itself SHOULD be avoided.

explanations & examples

Character Escaping, C044, in Character Model for the World Wide Web: Fundamentals

Whenever specifications define character escapes that allow the representation of characters using a number, the number MUST represent the Unicode code point of the character and SHOULD be in hexadecimal notation.

explanations & examples

Character Escaping, C045, in Character Model for the World Wide Web: Fundamentals

Escaped characters SHOULD be acceptable wherever their unescaped forms are; this does not preclude that syntax-significant characters, when escaped, lose their significance in the syntax. In particular, if a character is acceptable in identifiers and comments, then its escaped form should also be acceptable.

explanations & examples

Character Escaping, C046, in Character Model for the World Wide Web: Fundamentals

Protocols, data formats and APIs MUST store, interchange or process text data in logical order.

explanations & examples

Visual Rendering and Logical Order, C003, in Character Model for the World Wide Web: Fundamentals.

Independent of whether some implementation uses logical selection or visual selection, characters selected MUST be kept in logical order in storage.

explanations & examples

Visual Rendering and Logical Order, C075, in Character Model for the World Wide Web: Fundamentals.

Specifications of protocols and APIs that involve selection of ranges SHOULD provide for discontiguous logical selections, at least to the extent necessary to support implementation of visual selection on screen on top of those protocols and APIs.

explanations & examples

Visual Rendering and Logical Order, C004, in Character Model for the World Wide Web: Fundamentals.

4.1 Choosing a definition of 'character'.

Specifications SHOULD NOT define a string as a 'byte string'.

explanations & examples

String concepts, C011, in Character Model for the World Wide Web: Fundamentals.

The 'character string' definition SHOULD be used by most specifications.

explanations & examples

String concepts, C012, in Character Model for the World Wide Web: Fundamentals.

Use U+XXXX syntax to represent Unicode code points in the specification.

The U+XXXX format is well understood when referring to Unicode code points in a specification. These are space separated when appearing in a sequence. No additional decoration is needed. Note that a code point may contain four, five, or six hexadecimal digits. When fewer than four digits are needed, the code point number is zero filled. E.g. U+0020.

Since specifications in general need both a definition for their characters and the semantics associated with these characters, specifications SHOULD include a reference to the Unicode Standard, whether or not they include a reference to ISO/IEC 10646.

explanations & examples

Referencing the Unicode Standard and ISO/IEC 10646, C062, in Character Model for the World Wide Web: Fundamentals.

A generic reference to the Unicode Standard MUST be made if it is desired that characters allocated after a specification is published are usable with that specification. A specific reference to the Unicode Standard MAY be included to ensure that functionality depending on a particular version is available and will not change over time.

explanations & examples

Referencing the Unicode Standard and ISO/IEC 10646, C063, in Character Model for the World Wide Web: Fundamentals.

All generic references to the Unicode Standard MUST refer to the latest version of the Unicode Standard available at the date of publication of the containing specification.

explanations & examples

Referencing the Unicode Standard and ISO/IEC 10646, C064, in Character Model for the World Wide Web: Fundamentals.

All generic references to ISO/IEC 10646 MUST refer to the latest version of ISO/IEC 10646 available at the date of publication of the containing specification.

explanations & examples

Referencing the Unicode Standard and ISO/IEC 10646, C065, in Character Model for the World Wide Web: Fundamentals.

Expand the following to reveal just the guidelines. Add a check mark to items that are relevant to you. Create a checklist that can be transferred to a GitHub wiki.

Choosing text units for segmentation, indexing, etc.

The character string is RECOMMENDED as a basis for string indexing. more
Grapheme clusters MAY be used as a basis for string indexing in applications where user interaction is the primary concern. more
Specifications that define indexing in terms of grapheme clusters MUST either: (a) define grapheme clusters in terms of extended grapheme clusters as defined in Unicode Standard Annex #29, Unicode Text Segmentation (UTR #29), or (b) define specifically how tailoring is applied to the indexing operation. more
The use of byte strings for indexing is NOT RECOMMENDED. more
A UTF-16 code unit string is NOT RECOMMENDED as a basis for string indexing, even if this results in a significant improvement in the efficiency of internal operations when compared to the use of character string. more
Specifications that need a way to identify substrings or point within a string SHOULD consider ways other than string indexing to perform this operation. more
Specifications SHOULD understand and process single characters as substrings, and treat indices as boundary positions between counting units, regardless of the choice of counting units. more
Specifications of APIs SHOULD NOT specify single characters or single 'units of encoding' as argument or return types. more
When the positions between the units are counted for string indexing, starting with an index of 0 for the position at the start of the string is the RECOMMENDED solution, with the last index then being equal to the number of counting units in the string. more

Matching string identity for identifiers and syntactic content

String identity matching for identifiers and syntactic content should involve the following steps: (a) Ensure the strings to be compared constitute a sequence of Unicode code points (b) Expand all character escapes and includes (c) Perform any appropriate case-folding and Unicode normalization step (d) Perform any additional matching tailoring specific to the specification, and (e) Compare the resulting sequences of code points for identity. more
The default recommendation for matching strings in identifiers and syntactic content is to do no normalization (ie. case folding or Unicode Normalization) of content. more
'ASCII case fold' and 'Unicode canonical case fold' approaches should only be used in special circumstances. more
A 'Unicode compatibility case fold' approach should not be used. more
Specifications of vocabularies MUST define the boundaries between syntactic content and character data as well as entity boundaries (if the language has any include mechanism). more

Working with Unicode Normalization

Specifications SHOULD NOT specify a Unicode normalization form for encoding, storage, or interchange of a given vocabulary. more
Implementations MUST NOT alter the normalization form of textual data being exchanged, read, parsed, or processed except when required to do so as a side-effect of text transformation such as transcoding the content to a Unicode character encoding, case folding, or other user-initiated change, as consumers or the content itself might depend on the de-normalized representation. more
Specifications SHOULD NOT specify compatibility normalization forms (NFKC, NFKD). more
Specifications MUST document or provide a health-warning if canonically equivalent but disjoint Unicode character sequences represent a security issue. more
Where operations can produce denormalized output from normalized text input, specifications MUST define whether the resulting output is required to be normalized or not. Specifications MAY state that performing normalization is optional for some operations; in this case the default SHOULD be that normalization is performed, and an explicit option SHOULD be used to switch normalization off. more
Specifications that require normalization MUST NOT make the implementation of normalization optional. more
Normalization-sensitive operations MUST NOT be performed unless the implementation has first either confirmed through inspection that the text is in normalized form or it has re-normalized the text itself. Private agreements MAY be created within private systems which are not subject to these rules, but any externally observable results MUST be the same as if the rules had been obeyed. more
A normalizing text-processing component which modifies text and performs normalization-sensitive operations MUST behave as if normalization took place after each modification, so that any subsequent normalization-sensitive operations always behave as if they were dealing with normalized text. more
Specifications that perform comparison or matching of string values SHOULD specify the appropriate note or warning regarding Unicode normalization. more

Case folding

Specifications and implementations that define string matching as part of the definition of a format, protocol, or formal language (which might include operations such as parsing, matching, tokenizing, etc.) MUST define the criteria and matching forms used. These MUST be one of: (a) case-sensitive (b) Unicode case-insensitive using Unicode full case-folding (c) ASCII case-insensitive. more
Case-sensitive matching is RECOMMENDED for matching syntactic content, including user-defined values. more
Specifications that define case-insensitive matching in vocabularies that include more than the Basic Latin (ASCII) range of Unicode MUST specify Unicode full casefold matching. more
Specifications that define case-insensitive matching in vocabularies limited to the Basic Latin (ASCII) subset of Unicode MAY specify ASCII case-insensitive matching. more
If language-sensitive case-sensitive matching is specified, Unicode case mappings SHOULD be tailored according to language and the source of the language used for each tailoring MUST be specified. more
Specifications that define case-insensitive matching in vocabularies SHOULD NOT specify language-sensitive case-insensitive matching. more

Truncating or limiting the length of strings

Specifications SHOULD NOT limit the size of data fields unless there is a specific practical or technical limitation. more
Specifications that limit the length of a string MUST specify which type of unit (extended grapheme clusters, Unicode code points, or code units) the length limit uses. more
Specifications that limit the length of a string SHOULD specify the length in terms of Unicode code points. more
If a specification sets a length limit in code units (such as bytes), it MUST specify that truncation can only occur on code point boundaries. more
Specifications that limit the length of a string SHOULD require truncation on grapheme boundaries, as truncation in the midst of a combining or joining sequence can alter the meaning of the string. more
If a specification specifies a length limit, it SHOULD specify that any string that is truncated include an indicator, such as ellipses, that the string has been altered. more
When specifying a length limitation in code units (such as bytes), specifications SHOULD set the maximum length in a way that accommodates users whose language requires multibyte code unit sequences. more

Working with file and path names

Specify the UTF-8 [Unicode] encoding for the storage and processing of file names and file paths. more
File names SHOULD be restricted to 255 bytes in length. more
Path names SHOULD be restricted to 65535 bytes in length. more
File name and path name definitions MUST NOT use the following Unicode code points. more

Specifying sort and search functionality

Software that sorts or searches text for users SHOULD do so on the basis of appropriate collation units and ordering rules for the relevant language and/or application. more
Where searching or sorting is done dynamically, particularly in a multilingual environment, the 'relevant language' SHOULD be determined to be that of the current user, and may thus differ from user to user. more
Software that allows users to sort or search text SHOULD allow the user to select alternative rules for collation units and ordering. more
Specifications and implementations of sorting and searching algorithms SHOULD accommodate text that contains any character in Unicode. more

See also

4.9 Defining 'string'.

5.5 Truncating or limiting the length of strings.

There are many situations where a software process needs to access a substring or to point within a string and does so by the use of indices, i.e. numeric "positions" within a string. Where such indices are exchanged between components of the Web, there is a need for an agreed-upon definition of string indexing in order to ensure consistent behavior. The two main questions that arise are: "What is the unit of counting?" and "Do we start counting at 0 or 1?".

The character string is RECOMMENDED as a basis for string indexing.

Character Model for the World Wide Web: Fundamentals, String indexing, C051

Grapheme clusters MAY be used as a basis for string indexing in applications where user interaction is the primary concern.

Character Model for the World Wide Web: Fundamentals, String indexing, C071

Typographic character units in complex scripts Situations where grapheme clusters can be insufficient for segmenting complex scripts.

Character encodings: Essential concepts, Characters & clusters

Specifications that define indexing in terms of grapheme clusters MUST either: (a) define grapheme clusters in terms of extended grapheme clusters as defined in Unicode Standard Annex #29, Unicode Text Segmentation (UTR #29), or (b) define specifically how tailoring is applied to the indexing operation.

Character Model for the World Wide Web: Fundamentals, String indexing, C071

Unicode Standard Annex #29, Unicode Text Segmentation, Grapheme Cluster Boundaries

Typographic character units in complex scripts Situations where grapheme clusters can be insufficient for segmenting complex scripts.

Character encodings: Essential concepts, Characters & clusters

The use of byte strings for indexing is NOT RECOMMENDED.

Character Model for the World Wide Web: Fundamentals > String indexing

Character Model for the World Wide Web: Fundamentals, String indexing, C072

A UTF-16 code unit string is NOT RECOMMENDED as a basis for string indexing, even if this results in a significant improvement in the efficiency of internal operations when compared to the use of character string.

Character Model for the World Wide Web: Fundamentals, String indexing, C052

A counter-example is the use of UTF-16 in DOM Level 1. The use of UTF-16 code points is discouraged because it leaves open the possibility of an index occuring between two surrogate characters, which would cause significant problems (see 5.5 Truncating or limiting the length of strings).

Specifications that need a way to identify substrings or point within a string SHOULD consider ways other than string indexing to perform this operation.

Character Model for the World Wide Web: Fundamentals, String indexing, C053

Specifications SHOULD understand and process single characters as substrings, and treat indices as boundary positions between counting units, regardless of the choice of counting units.

Character Model for the World Wide Web: Fundamentals, String indexing, C053

Specifications of APIs SHOULD NOT specify single characters or single 'units of encoding' as argument or return types.

Character Model for the World Wide Web: Fundamentals, String indexing, C056

When the positions between the units are counted for string indexing, starting with an index of 0 for the position at the start of the string is the RECOMMENDED solution, with the last index then being equal to the number of counting units in the string.

Character Model for the World Wide Web: Fundamentals, String indexing, C057

See also

5.3 Working with Unicode Normalization.

5.4 Case folding.

String identity matching for identifiers and syntactic content should involve the following steps: (a) Ensure the strings to be compared constitute a sequence of Unicode code points (b) Expand all character escapes and includes (c) Perform any appropriate case-folding and Unicode normalization step (d) Perform any additional matching tailoring specific to the specification, and (e) Compare the resulting sequences of code points for identity.

explanations & examples

The Matching Algorithm, in Character Model for the World Wide Web: String Matching

The default recommendation for matching strings in identifiers and syntactic content is to do no normalization (ie. case folding or Unicode Normalization) of content.

explanations & examples

Performing the Appropriate Normalization Step, in Character Model for the World Wide Web: String Matching

'ASCII case fold' and 'Unicode canonical case fold' approaches should only be used in special circumstances.

explanations & examples

Performing the Appropriate Normalization Step, in Character Model for the World Wide Web: String Matching

A 'Unicode compatibility case fold' approach should not be used.

explanations & examples

Performing the Appropriate Normalization Step, in Character Model for the World Wide Web: String Matching

Specifications of vocabularies MUST define the boundaries between syntactic content and character data as well as entity boundaries (if the language has any include mechanism).

explanations & examples

Additional Considerations for Normalization, in Character Model for the World Wide Web: String Matching

Specifications SHOULD NOT specify a Unicode normalization form for encoding, storage, or interchange of a given vocabulary.

explanations & examples

Additional Considerations for Normalization, in Character Model for the World Wide Web: String Matching.

Implementations MUST NOT alter the normalization form of textual data being exchanged, read, parsed, or processed except when required to do so as a side-effect of text transformation such as transcoding the content to a Unicode character encoding, case folding, or other user-initiated change, as consumers or the content itself might depend on the de-normalized representation.

explanations & examples

Additional Considerations for Normalization, in Character Model for the World Wide Web: String Matching.

Specifications SHOULD NOT specify compatibility normalization forms (NFKC, NFKD).

explanations & examples

Additional Considerations for Normalization, in Character Model for the World Wide Web: String Matching.

Specifications MUST document or provide a health-warning if canonically equivalent but disjoint Unicode character sequences represent a security issue.

explanations & examples

Additional Considerations for Normalization, in Character Model for the World Wide Web: String Matching.

Where operations can produce denormalized output from normalized text input, specifications MUST define whether the resulting output is required to be normalized or not. Specifications MAY state that performing normalization is optional for some operations; in this case the default SHOULD be that normalization is performed, and an explicit option SHOULD be used to switch normalization off.

explanations & examples

Requirements When Specifying Normalization in Document Formats, in Character Model for the World Wide Web: String Matching.

Specifications that require normalization MUST NOT make the implementation of normalization optional.

explanations & examples

Requirements When Specifying Normalization in Document Formats, in Character Model for the World Wide Web: String Matching.

Normalization-sensitive operations MUST NOT be performed unless the implementation has first either confirmed through inspection that the text is in normalized form or it has re-normalized the text itself. Private agreements MAY be created within private systems which are not subject to these rules, but any externally observable results MUST be the same as if the rules had been obeyed.

explanations & examples

Requirements When Specifying Normalization in Document Formats, in Character Model for the World Wide Web: String Matching.

A normalizing text-processing component which modifies text and performs normalization-sensitive operations MUST behave as if normalization took place after each modification, so that any subsequent normalization-sensitive operations always behave as if they were dealing with normalized text.

explanations & examples

Requirements When Specifying Normalization in Document Formats, in Character Model for the World Wide Web: String Matching.

Specifications that perform comparison or matching of string values SHOULD specify the appropriate note or warning regarding Unicode normalization.

The use or adoption of Unicode Normalization in a specification is usually part of defining how matching takes place in a given format or protocol. To help specification authors and implementers understand some of the complexity involved, the Internationalization Working Group has developed a document describing the considerations for the matching and comparison of strings: Character Model for the World Wide Web: String Matching [CHARMOD-NORM].

One of the choices specifications need to make is whether (or not) to require Unicode Normalization as part of matching various "values" defined as part of the specification's vocabulary. Values are commonly part of a document format or protocol's syntax, and include such things as: attribute names or values, element names or values, IDs, and so forth. Specifications that follow the recommendation to not employ normalization as part of matching should include the following Note as a reminder to content authors.

Example note. Necessarily this version is non-specific about what constitutes "values": specifications may wish to be more specific.

Note

This specification does not permit Unicode normalization of values for the purposes of comparison. Values that are visually and semantically identical but use different Unicode character sequences will not match. Content authors are advised to use the same encoding sequence consistently or to avoid potentially troublesome characters when choosing values. For more information, see [CHARMOD-NORM].

Specifications that choose to require require normalization as part of string matching should include the following warning:

Example warning. Necessarily this version is non-specific about what constitutes "values": specifications may wish to be more specific.

Warning

This specification applies Unicode normalization during the matching of values. This can have an effect on the appearance and meaning of the affected text. For more information, see [CHARMOD-NORM].

Contact the I18N WG for alternatives or assistance if the above do not meet your needs or you're not sure about usage.

Specifications and implementations that define string matching as part of the definition of a format, protocol, or formal language (which might include operations such as parsing, matching, tokenizing, etc.) MUST define the criteria and matching forms used. These MUST be one of: (a) case-sensitive (b) Unicode case-insensitive using Unicode full case-folding (c) ASCII case-insensitive.

Case-sensitive matching is RECOMMENDED for matching syntactic content, including user-defined values.

explanations & examples

Case-sensitive matching, in Character Model for the World Wide Web: String Matching.

Specifications that define case-insensitive matching in vocabularies that include more than the Basic Latin (ASCII) range of Unicode MUST specify Unicode full casefold matching.

explanations & examples

Unicode case-insensitive matching, in Character Model for the World Wide Web: String Matching.

Specifications that define case-insensitive matching in vocabularies limited to the Basic Latin (ASCII) subset of Unicode MAY specify ASCII case-insensitive matching.

explanations & examples

ASCII case-insensitive matching, in Character Model for the World Wide Web: String Matching.

If language-sensitive case-sensitive matching is specified, Unicode case mappings SHOULD be tailored according to language and the source of the language used for each tailoring MUST be specified.

explanations & examples

Language-specific tailoring, in Character Model for the World Wide Web: String Matching.

Specifications that define case-insensitive matching in vocabularies SHOULD NOT specify language-sensitive case-insensitive matching.

explanations & examples

Language-specific tailoring, in Character Model for the World Wide Web: String Matching.

Some specifications, formats, or protocols or their implementations need to specify limits for the size of a given data structure or text field. This could be due to many reasons, such as limits on processing, memory, data structure size, and so forth. When selecting or specifying limits on the length of a given string, specifications or implementations need to ensure that they do not cause corruption in the text.

Specifications SHOULD NOT limit the size of data fields unless there is a specific practical or technical limitation.

There are many reasons why a length limit might be needed in a specification or format. Generally length limits correspond to underlying limits in the implementation, such as the use of fixed-size fields in a database or data store, the desire to fit into practical boundaries such as packet size, or some other implementation detail related to storage allocation or efficiency.

When truncating strings, it's necessary to decide what units to use when counting the size of the string. In many cases this is beyond the control of the specification, since the truncation is occuring for some preordained reason. However, when the choice is available, some general guidelines can be applied.

If the limitation is related to the number of display positions, the grapheme count usually corresponds most closely to the expected limit. Note that proportional width fonts, combining marks, complex scripts, and many other factors complicate counting "screen positions". In Web pages, for example, the CSS text-overflow property provides visual truncation without disturbing the content of the text. Attempts to estimate the size of a given piece of text based on the number of Unicode code points or even the number of grapheme clusters is mostly futile.

Otherwise most limits are expressed in terms of code points in Unicode or code units (such as bytes) in a specific character encoding. Code points provides the best user experience, since all Unicode code points are treated identically: if text is truncated after 40 code points, all languages and scripts get the same number of code points to work with. By contrast, when the size limit is expressed in code units such as bytes in UTF-8, users who write in a language that mostly uses ASCII letters get many more characters (code points) for a given size limit than user's whose language is mostly made up of characters that take 2-, 3-, or 4-bytes per code point.

Example 5

Below you can see the effect of truncating a given string of text encoded in UTF-8 on a 40-byte boundary. There are several things to notice here.

First, the number of characters in the truncated string decreases as the number of bytes required per character goes up. So the Cyrillic string has half the number of characters as the ASCII string. The Chinese string has about 1/3 the number. And the emoji string has 1/4.

Second, in two of the three examples, the text is truncated on a byte boundary in the middle of a character. The resulting "dangling byte" is rendered as U+FFFD and the byte sequence itself is not valid UTF-8. This can interfere with the validity of a given text file. Unlike many legacy character encodings, UTF-8 is highly patterned, so the the longest broken character sequence that can result from mid-character truncation is one character. By contrast, in many legacy encodings, a file or document containing a mid-character truncated string can be wholly changed or rendered unintelligible after that point.

Script	Truncated Length (code points)	Avg. Bytes/Code Point	Truncated Text Byte Values
ASCII	40	1	`In the loveliest town of all, where the`
ASCII	40	1	`49 6E 20 74 68 65 20 6C 6F 76 65 6C 69 65 73 74 20 74 6F 77 6E 20 6F 66 20 61 6C 6C 2C 20 77 68 65 72 65 20 74 68 65 20`
Cyrillic	22	2	`В самом прекрасном го�`
Cyrillic	22	2	`D0 92 20 D1 81 D0 B0 D0 BC D0 BE D0 BC 20 D0 BF D1 80 D0 B5 D0 BA D1 80 D0 B0 D1 81 D0 BD D0 BE D0 BC 20 D0 B3 D0 BE D0`
Han	14	3	`在最美丽的城镇，那里的房屋�`
Han	14	3	`E5 9C A8 E6 9C 80 E7 BE 8E E4 B8 BD E7 9A 84 E5 9F 8E E9 95 87 EF BC 8C E9 82 A3 E9 87 8C E7 9A 84 E6 88 BF E5 B1 8B E5`
Emoji	10	4	`🙊🙁😢😠😧😎😽😉😄😮`
Emoji	10	4	`F0 9F 99 8A F0 9F 99 81 F0 9F 98 A2 F0 9F 98 A0 F0 9F 98 A7 F0 9F 98 8E F0 9F 98 BD F0 9F 98 89 F0 9F 98 84 F0 9F 98 AE`

Specifications that limit the length of a string MUST specify which type of unit (extended grapheme clusters, Unicode code points, or code units) the length limit uses.

Specifications that limit the length of a string SHOULD specify the length in terms of Unicode code points.

If a specification sets a length limit in code units (such as bytes), it MUST specify that truncation can only occur on code point boundaries.

Note that this best practice applies equally to specifications based on UTF-16, which uses 16-bit code units, not just to multibyte encodings such as UTF-8.

Specifications or APIs that interact with the [DOM] need to contend with the fact that character data, including operations such as length, substringData, insertData, deleteData, and so forth, is specified using UTF-16 code units, not Unicode code points. This can lead to inappropriate mid-character (code point) truncation. Specifications that reference DOM should specify that string operations not occur inside code points, and, where appropriate avoid starting or ending inside grapheme boundaries. Specifications should also include a health warning for implementers and users.

Example warning. Modify this health warning as appropriate for your specification:

Warning

Arbitrary index values in the DOM may not fall on character or grapheme boundaries. Implementations and users should avoid incorrectly starting or ending operations in the middle of a user-perceived character sequence.

Specifications that limit the length of a string SHOULD require truncation on grapheme boundaries, as truncation in the midst of a combining or joining sequence can alter the meaning of the string.

If a specification specifies a length limit, it SHOULD specify that any string that is truncated include an indicator, such as ellipses, that the string has been altered.

When specifying a length limitation in code units (such as bytes), specifications SHOULD set the maximum length in a way that accommodates users whose language requires multibyte code unit sequences.

Some specifications need to define how file names or file paths are constructed by various implementations. One challenge is building definitions that work consistently when used on the different file systems used by different operating systems. This section contains general guidance when defining restrictions on file names or file paths. It is based on requirements developed in [EPUB-33], as well as implementation experience.

Specify the UTF-8 [Unicode] encoding for the storage and processing of file names and file paths.

File names SHOULD be restricted to 255 bytes in length.

This restriction is related to limitations found in certain file systems, originally MS-DOS, but also certain Unix file systems—as well as packaging schemes such as PKZIP that depend on these file systems or subsumed their limitations—in which the limit for a specific "path element" (including directory names) is limited to 255 bytes.

Path names SHOULD be restricted to 65535 bytes in length.

This restriction is related to limitations found in file systems such as FAT32 or NTFS, which restrict the path length to 32760 (32K) code units in the UTF-16 character encoding. Each UTF-16 code unit takes 16 bits (or 2 bytes), making the limit 65,535 when measured in bytes. Note that a path name limited to 64K bytes in UTF-8 can exceed the path length limits on these file systems, since UTF-8 is a variable width encoding.

File name and path name definitions MUST NOT use the following Unicode code points.

These characters are known to cause interoperability problems with various file systems. Specifications and implementations should use an abundance of caution in their file naming when interoperability of content is key. The list of restricted characters is intended to help avoid some known problem areas, but it does not ensure that all other Unicode characters are supported.

" [U+0022 QUOTATION MARK]
* [U+002A ASTERISK]
/ [U+002F SOLIDUS]
: [U+003A COLON]
< [U+003C LESS-THAN SIGN]
> [U+003E GREATER-THAN SIGN]
\ [U+005C REVERSE SOLIDUS]
| [U+007C VERTICAL LINE]
U+007F DEL
U+E0001 LANGUAGE TAG
U+E007F CANCEL TAG
Codepoints in the following ranges:
- C0 Controls [U+0000...U+001F]
- C1 Controls [U+0080...U+009F]
- Private Use [U+E000...U+F8FF]
- Specials [U+FFF0...U+FFFF]
- Supplementary Private Use [U+F0000...U+FFFFF]
- Supplementary Private Use [U+100000...U+10FFFF]
. [U+002E FULL STOP] as the last character (Note that this includes the file names . and .., which have special meaning to many file systems)
All Unicode non-character code points, specifically:
- The 32 contiguous characters in the Basic Multilingual Plane (U+FDD0 … U+FDEF)
- The last two code points of the Basic Multilingual Plane (U+FFFE and U+FFFF)
- The last two code points at the end of the Supplementary Planes (U+1FFFE, U+1FFFF … U+EFFFE, U+EFFFF)

Software that sorts or searches text for users SHOULD do so on the basis of appropriate collation units and ordering rules for the relevant language and/or application.

explanations & examples

Units of collation, C006, in Character Model for the World Wide Web: Fundamentals

Where searching or sorting is done dynamically, particularly in a multilingual environment, the 'relevant language' SHOULD be determined to be that of the current user, and may thus differ from user to user.

explanations & examples

Units of collation, C007, in Character Model for the World Wide Web: Fundamentals

Software that allows users to sort or search text SHOULD allow the user to select alternative rules for collation units and ordering.

explanations & examples

Units of collation, C066, in Character Model for the World Wide Web: Fundamentals

Specifications and implementations of sorting and searching algorithms SHOULD accommodate text that contains any character in Unicode.

explanations & examples

Units of collation, C008, in Character Model for the World Wide Web: Fundamentals

Internationalization Best Practices for Spec Developers

Abstract

Status of This Document

1. Introduction

1.1 Create a github checklist

2. Language

2.1 Language basics

2.1.1 Text-processing language information

2.1.2 Language metadata about the resource as a whole

2.2 Defining language values

2.3 Declaring language at the resource level

2.4 Establishing the language of a content block

2.5 Establishing the language of inline runs

3. Text direction

3.1 Basic requirements

3.2 Background information

3.2.1 Important definitions

3.2.2 Ways base direction can be set for paragraphs

3.2.3 Inline changes to base direction

3.2.4 Problems with control characters

3.2.5 RLM and LRM

3.2.6 Base direction and language

3.3 Base direction values

3.4 Handling direction in markup

3.4.1 Setting the default base direction

3.4.2 Establishing the base direction for paragraphs

3.5 Handling base direction for strings

3.6 Setting base direction for inline or substring text

4. Characters

4.1 Choosing a definition of 'character'

4.2 Defining a Reference Processing Model

4.3 Including and excluding character ranges

4.4 Using the Private Use Area

4.5 Choosing character encodings

4.6 Identifying character encodings

4.7 Designing character escapes

4.8 Storing text

4.9 Defining 'string'

4.10 Referring to Unicode characters

4.11 Referencing the Unicode Standard

5. Text-processing

5.1 Choosing text units for segmentation, indexing, etc.

5.2 Matching string identity for identifiers and syntactic content

5.3 Working with Unicode Normalization

5.3.1 Specifying Unicode Normalization

5.4 Case folding

5.5 Truncating or limiting the length of strings

5.6 Working with file and path names

5.7 Specifying sort and search functionality

6. Resource identifiers

6.1 Basics

7. Markup & syntax

7.1 Defining elements and attributes

7.2 Defining identifiers

7.3 Working with plain text

8. Typographic support

8.1 Text decoration

8.2 Vertical text

8.3 Cursive text

8.4 Setting box positioning coordinates when text direction varies

8.5 Ruby text annotations

8.6 Miscellaneous

9. Locales, date and time values, and locally affected formats

9.1 Working with locale-affected values

9.2 Working with time

9.3 Working with personal names

9.4 Designing forms

9.5 Working with numbers

9.6 Localization

9.6.1 Working with error and exception messages

10. Navigation

10.1 Providing for content negotiation based on language

A. Revision Log

B. Acknowledgements

C. References

C.1 Informative references