When should I use xml:lang
and when should I define my own element or attribute for passing
language values in an XML document schema (DTD)?
Sometimes documents contain or include different types of natural language content. Other times they need to store a natural
language value as data or meta-data about something external to the document. Because these different applications use similar formats, schema
designers are sometimes confused about when they should use xml:lang
and when to define their own language-related element or attribute.
For example, in XHTML 1.0, there is an hreflang
attribute and also an xml:lang
(or lang
attribute, in the case of HTML) for the content of the a
element:
<a xml:lang="en" href="xyz" hreflang="de">Click for German</a>
The xml:lang
attribute describes the language contained by the a
element ("Click for
German"), while the hreflang
attribute is meta-data, in this case describing the language of some content external to this
Web page.
xml:lang
Content directly associated with the XML document (either contained within the document directly or considered part of the document
when it is processed or rendered) should use the xml:lang
attribute to indicate the language of that content. xml:lang
should be reserved for content
authors to directly label any natural language content they may have.
xml:lang
is defined by XML 1.0 as a common attribute that can be used to indicate the language of any
element's contents. This includes any human readable text, as well as other content (such as embedded objects like images or sound files) contained
by the element in which it appears. The xml:lang
value applies to any sub-elements contained by the element. It also applies
to attribute values associated with the element and sub-elements (though using natural language in attributes is not best practice). The
value of the xml:lang
attribute is a language tag defined by BCP 47.
For example, here is xml:lang
on an element t
:
<t xml:lang="en">
This is some text contained by the 't' element. The use
of the xml:lang attribute indicates the language so that, for
example, the correct font could be applied when rendered or
the correct spell-checker could be used when proofing the
document. If we didn't have xml:lang, we might have problems
with embedded content, such as the phrase <span xml:lang="fr">
C'est la vie</span>, which is in another language.
</t>
This example shows how xml:lang
applies to an attribute:
<para>Il faut utiliser <abbr title="Simple Object Access Protocol"
xml:lang="en">SOAP</abbr></para>
When the language value is really an attribute of or metadata about some external content, then xml:lang
is
not an appropriate choice. In these cases you want to store language information, but the language doesn't refer to the content of the XML document
(or included content, such as images, which are processed as part of the document) directly. In this case you should define an element or attribute
of using a different name and not use the xml:lang
attribute. The value of the element or attribute should use BCP 47, just like xml:lang
.
Some examples of this might include:
a
in XHTML) pointing to a version of this document in another
languageThe reason you would choose to create your own element (or attribute) is to convey the language as a value (as part of a data
structure or as metadata about an external document) rather than to indicate the language of a specific piece of content. Avoiding the use of xml:lang
to describe external language values avoids creating problems for content authors who need to label content for
text-processing purposes.
For example, an XML document might look like this:
<item type="DVD">
<title xml:lang="fr">Cyrano de Bergerac</title>
<!-- indicates the language of the film title -->
<runningTime value="137" />
<!-- not language affected -->
<dialogue>en</dialogue>
<!-- indicates the language of the dialogue -->
<subtitles track="1" language="zh-Hant" />
<!-- this track contains Traditional Chinese subtitles -->
<subtitles track="2" language="zh-Hans" />
</item>
In this example, the xml:lang
attribute conveys information about the natural language of text appearing in
this document. The dialogue
element and the language
attribute of the subtitles
element are defined in the XML document schema and convey a natural
language value associated with these items. For example, it conveys the information that the subtitles on Track #1 are written or displayed in
Traditional Chinese (zh-Hant
).
It's important to remember that xml:lang
has scope: lower-level elements inherit the language attribute. This can
be used to identify the language for a lot of content (without having redundant language tags on every element). For example, it is good practice to
put xml:lang
into your html
element at the start of an XHTML document and only reuse it where the
language of the text changes. For more information, see the article Language tags
in HTML and XML.
Applying xml:lang
to an attribute is problematic: there is no way to:
identify more than one language in the title
attribute
<p title="French (français)">Bonjour</p>
separate the language used in the attribute from that used in the element.
<a title="anglais" href="qa-when-xmllang.en.html" lang="en"
xml:lang="en">English</a>
Note that the three schema languages (XML DTD, XML Schema, and RELAX NG) differ with respect to the question of whether a user has to
define xml:lang
before using it as an attribute. Specifically:
XML DTDs require that any element that uses xml:lang
as an attribute must declare it in the DTD
XML Schema requires that the xml namespace be declared and imported before using xml:lang
(and other xml namespace values)
RELAX NG predeclares the xml namespace, as in XML, so no additional declaration is needed.
BCP 47, Tags for the Identifying Languages Specifies how to use language tags in xml:lang
values.
Language tags in HTML and XML Describes how to use language tags.
Related links, Authoring XML