Which language tag is right for me? How do I choose language and other subtags?
In HTML and XML documents a language tag is used to indicate the language of content.
A language tag is composed of one or more subtags separated by hyphens. Subtags can be of various types.
BCP stands for 'Best Current Practice', and is a persistent name for a series of RFCs whose numbers change as they are updated. The latest RFC describing language tag syntax is RFC 5646, Tags for the Identification of Languages, and it obsoletes the older RFCs 4646 3066 and 1766.
Language tag syntax is defined by the IETF's BCP 47. In the past it was necessary to consult lists of codes in various ISO standards to find the right subtags, but now you only need to look in the IANA Language Subtag Registry. We will describe the new registry below.
This article provides advice on how to choose the components of a language tag. For an overview of the concepts defined in BCP 47, see Language tags in HTML and XML.
All the subtags you will need to create a language tag are found in one place, the IANA Language Subtag Registry. The registry is a long text file, containing nearly 8,000 entries.
The first (and often only) subtag in a language tag always designates a language. It is referred to in BCP 47 as the primary language subtag. We will use that term in this document to refer to the subtag that represents a language, to more clearly make the distinction from 'language tag', which refers to the whole thing.
To find a primary-language subtag, search the page for the name of that language. For example, if you want to label something as French, searching for 'French' in the registry will bring you to a record that looks like this:
%% Type: language Subtag: fr Description: French Added: 2005-10-16 Suppress-Script: Latn %%
Your search will have matched against the Description
field. Check that the type of this record is language
. What you are looking for is the value in the Subtag
field, ie. fr
.
The rest of this article will provide advice for choosing primary language subtags and, where needed, other types of subtag. Note that not all the decisions about how to create a language tag are straightforward. There are circumstances where usage will dictate which of various possibilities you should follow.
There are tools available which provide additional help while searching the registry, such as the Language Subtag Lookup tool.
Think about letter-case. By convention, primary language subtags are lowercase, script subtags begin with an uppercase letter, and continue with lowercase, and region subtags are uppercase. This is only a convention, however, and you are free to use whatever letter-casing you like.
On the other hand, you may be using language tags in a context where letter-case is important, such as file names on some systems. In such cases, you should ensure that you follow a consistent policy for letter-case; for any new system that is not case-insensitive, it is recommended that you follow the BCP 47 conventions.
You always start by choosing a primary language subtag, and often this is all you'll need for your language tag.
Always bear in mind that the golden rule is to keep your language tag as short as possible. Only add further subtags to your language tag if they are needed to distinguish the language from something else in the context where your content is used.
When looking for a primary language subtag, there are a number of things to bear in mind.
You could look up language information in the SIL Ethnologue and cross-reference that information with Wikipedia. The Ethnologue uses the same three-letter codes as BCP47, but you'll need to convert BCP47 2-letter codes to their ISO 639-3 counterpart to look up a language by code. (The Language Subtag Lookup tool does this for you.)
There are a small number of cases where different language codes are available for what many people would regard as the same language, eg. Filipino and Tagalog, or Twi and Akan. There is no indication in the registry as to which you should use, but you should try to ensure that within a single application or context you are consistent.
Scope: collection
, this subtag represents a group of languages that are descended from a common ancestor, are spoken in the same geographical area, or are otherwise related.
You should look for a more specific subtag for the language you are interested in. Unfortunately, the subtag registry doesn't provide any pointers for this.
You can use these subtags if there is no more specific subtag available, and it is always preferable to use one of these rather than the subtags MUL
(multiple languages) or UND
(undefined).
Scope
field set to macrolanguage
, ie. this primary language subtag encompasses a number of more specific primary language subtags in the registry.
For example, ku
(Kurdish) is a macrolanguage that encompasses ckb
(Central Kurdish), kmr
(Northern Kurdish), and sdh
(Southern Kurdish).
You can find the more specific (ie. the encompassed) subtags by searching the registry for Macrolanguage: <subtag_name>
. Alternatively, the Language Subtag Lookup tool will automatically list these for a given macrolanguage (example).
As we recommended for the collection subtags mentioned above, in most cases you should try to use the more specific subtags, but there are a small number of important exceptions. These are situations where you should continue using a macrolanguage subtag for reasons of backward compatibility.
For example, although BCP 47 explains that zh
(the macrolanguage subtag for Chinese) doesn't actually specify which of the many, sometimes mutually unintelligible, dialects of Chinese is actually meant by this subtag, in practice convention overwhelmingly associates the macrolanguage subtag with the predominant language among the encompassed subtags - in this case, cmn
(Mandarin Chinese). If your application identified Mandarin Chinese in the past using the language tag zh-CN
(Chinese as used in Mainland China), or even just zh
, you can continue to use zh
in this way. Using cmn
or cmn-CN
may cause serious compatibility problems if the software or users expect a tag such as zh
.
If, on the other hand, you are using zh
to refer to another Chinese dialect such as Hakka, you should use the language subtag hak
instead.
Deprecated
field you shouldn't use this subtag. Usually the registry will indicate which alternative you should use in the Preferred-Value
field. For example, the subtag record for iw
(Hebrew) contains the two following fields:
Deprecated: 1989-01-01 Preferred-Value: he
This indicates that you should use the subtag he
for Hebrew instead.
In the past, when dealing with lists of ISO codes, there were sometimes multiple codes for a given language - there could be a 2-letter code and one or two 3-letter codes. This ambiguity is resolved by the IANA Subtag Registry: only one code is listed per language. (If an ISO 2-letter code exists, that will be the code, otherwise it will be a three-letter code.) The registry maintainer also coordinates the ongoing evolution of the registry with developments in the ISO world.
The BCP 47 specification allows for an additional, 3-letter subtag immediately after the initial primary language subtag. This is called an extended language subtag (abbreviated to extlang). Only a relatively small number of extended language subtags are defined, and they each need to be used with a specific primary language subtag (given in the Prefix
field of the entry for the extended language subtag in the registry).
Currently only seven primary language subtags can be used with extended language subtags. Six of those have a Scope
field set to macrolanguage
in the registry (ar
, kok
, ms
, sw
, uz
, and zh
), and the other is sgn
.
Consider the following:
Where possible, use a single language subtag, rather than the language+extlang pair.
There is always a 3-letter subtag that is equivalent to any language+extlang pairing, and it is always the same as the extlang subtag. For example, zh-yue
(Cantonese Chinese) can also be expressed with the single subtag yue
.
The only significant exception is where the language+extlang sequence is established practice for the system you are working with; that is, where zh-yue
would be preferred rather than yue
to maintain backwards compatibility.
ar
(the Arabic macrolanguage subtag) may be more appropriate for Standard Arabic than arb
(the more specific, encompassed subtag that means Standard Arabic).
Similarly, when dealing with the predominant language in the set, it is generally better for backwards compatibility if you replace the language+extlang sequence by just dropping the extlang, rather than using the extlang code as a primary language subtag. For example, reducing ms-zsm
to ms
(Malay macrolanguage subtag) may sometimes be better than replacing it with zsm
(Standard Malay).
As an example of usage, Unicode's CLDR database uses macrolanguages zh
to represent Mandarin Chinese and ku
to represent Kurdish. Thus for Mandarin Chinese you would use zh
, not cmn
, and for Northern Kurdish you would use ku-Latn
, not kmr-Latn
. The CLDR database, however, does not use extended language subtags, so you would need to use yue
for Cantonese, not zh-yue
.
Script subtags should only be used as part of a language tag when the script adds some useful distinguishing information to the tag. Usually this is because a language is written in more than one script or because the content has been transcribed into a script that is unusual to the language (so one might tag Russian transcribed into the Latin script with a tag such as ru-Latn
).
Script subtags are always 4 letters, and must come after any language or extended language subtag, but before any other subtags.
Here are things to look out for when choosing a script subtag.
uz-Arab
, but the Arab
script subtag would not be relevant for an audio track.
The script subtag Zxxx
could be used for non-written content, eg. uz-Zxxx
, as Zxxx
is the Code for unwritten documents
, but again this is only useful if such a distinction has to be made clear.
Suppress-script
field set to a given script subtag. For example, the entry in the registry for en
(English) contains:
Suppress-Script: Latn
meaning that you should not use the Latn
(Latin) script subtag with this language.
This is because nearly all English documents are written in the Latin script and it adds no distinguishing information. However, if a document were written in English mixing Latin script with another script such as Braille (Brai
), then it might be appropriate to indicate both scripts to aid in content selection (eg. for the application of style rules).
Note, however, that not all language subtags that are strongly associated with a given script have suppress-script fields. You should not assume that you need to use a script if a suppress-script field is absent.
Region subtags associate the language subtag you have chosen with a particular region of the world. Region subtags must come after any language or script subtags.
Like script subtags, you should only use a region subtag if it contributes information needed in a particular context to distinguish this language tag from another one; otherwise leave it out.
For example, en-GB
might be a useful distinction for spell-checking, but the region subtag in ja-JP
is unlikely to be useful unless you are intentionally contrasting it with Japanese spoken in other parts of the world.
There are two types of region subtag: 2-letter codes and 3-digit codes. The latter tend to identify multinational regions, rather than specific countries. For example, es-ES
means Spanish as spoken in Spain, whereas es-419
means Spanish as spoken in Latin America.
Avoid deprecated subtags.
Check that the subtag you intend to use isn't deprecated. In the same way as for other types of subtag, the registry will normally tell what the replacement should be via the Preferred-Value
field.
In some cases there is no Preferred-Value
field in a deprecated record, but sometimes the Comments
field contains advice. For example, under YU
(Yugoslavia) you will find:
Deprecated: 2003-07-23 Comments: see BA, HR, ME, MK, RS, or SI
Again, only use variant subtags when there is a need to distinguish this language tag from another similar one in the context in which your content is used.
Variant subtags describe additional distinctions not captured by the other subtags. Typically these are dialects, written variations (such as spelling reforms), transcriptions, and the like. A variant subtag is usually five to eight characters long and can contain letters and/or digits. A few four digit subtags (usually representing a year) are also registered. Variant subtags must come after any language, script, and region subtags.
The key thing to look out for when using variant subtags is the order in which they are used.
Check the context and ordering for variant subtags.
Most variant subtag records in the registry have one or more Prefix
fields. The prefixes indicate with which subtags it is usually appropriate to use this variant. For example, pinyin
should generally be used in a language tag that also contains either the subtags zh
and Latn
or the subtags bo
and Latn
, since the entry for pinyin
contains the following:
Prefix: zh-Latn Prefix: bo-Latn
If you have a good reason, you could use a variant subtag with different subtags, eg. cmn-Latn-pinyin
would be a perfectly legal way to say Mandarin Chinese written with pinyin.
Although zh
, bo
and Latn
are specified, this is a minimum requirement. It is also possible to include other subtags, such as a region subtag, in the language tag (where appropriate), eg. zh-Latn-CN-pinyin
.
Amongst other prefix fields, the entry for variant subtag 1994
contains
Prefix: sl-rozaj-biske
which indicates that it should be used in a language tag that already contains two other variant subtags, rozaj
and biske
. Any variant subtag specified in a prefix field should come before the variant you have just looked up.
There are some variant subtags that have no prefix field, eg. fonipa
(International Phonetic Alphabet). Such variants should appear after any other variant subtags with prefix information.
If you plan to use more than one variant without a prefix, order them in terms of decreasing significance. If they are equally significant, order them alphabetically. This will aid interoperability.
These single-character subtags allow for extensions to the language tag. To date, only one extension subtag has been registered. The subtag u
was registered by the Unicode Consortium to add information about language or locale behavior. Many locale identifiers require additional "tailorings" or options for specific values within a language, culture, region, or other variation. This extension provides a mechanism for using these additional tailorings within language tags for general interchange.
For example, the following indicates that phonebook collation order should be used by an application, that sorted data in a document is sorted according to this collation, and so on.
de-DE-u-co-phonebk
The u-
extension is defined in RFC 6067, which points to the Unicode Consortium's Common Locale Data Repository (CLDR) for details on the subtags that follow it. It is not defined by BCP 47.
Private-use subtags do not appear in the subtag registry, and are chosen and maintained by private agreement between the parties that use them. They are introduced by a single letter subtag, or 'singleton'. The singleton for private use is x
. Note that any subtags after the singleton can only be 8 characters in length, though you can use multiple subtags.
Private use subtags should be used with great care, and avoided whenever possible , since they interfere with the interoperability that BCP 47 exists to promote.
As an example of a private use subtag, en-US-x-twain
, may identify a specific type of US English, but only within a closed community. Outside of that private agreement, its meaning cannot be relied upon.
Read more in the BCP 47 spec:
Grandfathered tags are special cases, provided for backwards compatibility. They are tags that were registered before RFC 4646 that cannot be completely composed from the subtags in the current registry, or do not fit the syntax currently defined for language tags.
Nearly all grandfathered tags have been superceded by subtags or combinations of subtags in the registry. Such grandfathered tags are now deprecated, and usually contain a Preferred-Value
field that indicates how you ought to represent that language instead. For instance, the entry in the registry for the grandfathered tag art-lojban
indicates that you should use the jbo
language subtag instead.
Note that you should not use additional subtags with a grandfathered tag.
Getting started? Language on the Web
Related links, Authoring web pages
Related links, Authoring XML
Related links, Authoring SVG