Authoring Techniques for XHTML & HTML Internationalization: Specifying the language of content 1.0

1 Introduction

1.1 Who should use this document

All HTML content authors working with XHTML 1.0, HTML 4.01, XHTML 1.1, CSS1, CSS2 and CSS3.

The term 'author' is used in the sense described by the HTML 4.01 specification, ie. as a person or program that writes or generates HTML documents.

This document provides guidance for developers of HTML that enables support for international deployment. Enabling international deployment is the responsibility of all content authors, not just localization groups or vendors, and is relevant from the very start of development. Ignoring the advice in this document, or relegating it to a later phase in the development process, will only add unnecessary costs and resource issues at a later date.

It is assumed that readers of this document are proficient in developing HTML and XHTML pages - this document limits itself to providing advice specifically related to internationalization.

1.2 How to use this document

This document is one of several documents relating to the design of XHTML and HTML documents.

If you are new to this topic you may wish to read this document from end to end. It is, however, expected that this document will typically be used for reference purposes - the reader dipping in to a particular section to find out how to perform a specific task with internationalization in mind. In order to support this kind of usage, an effort has been made to make each technique stand alone, or to point to relevant cross-references. In some cases this leads to a small amount of repetitiveness.

Cross-references and additional resources are summarized at the end of each technique.

Editorial notes have been left in this version of the document. [Ed. note: These are marked like this].

Information is also available about the applicability of recommendations to user agents (see Section 1.5: User agent support).

An outline document is available that summarizes all the recommendations of this and its companion documents together. The outline is organized according to tasks that a developer of XHTML/HTML content may want to perform. When this material is used as a reference, it is recommended that the overview document is used as a starting point.

1.4 Technologies addressed

This document provides techniques for developing pages using HTML 4.01, XHTML 1.0 and XHTML 1.1 with CSS1, CSS2 and some parts of CSS3.

XHTML 1.0 can be served as XML (using MIME types application/xhtml+xml, application/xml or text/xml) or HTML (using the MIME type text/html).

It is very common for XHTML 1.0 to be served as HTML, following the compatibility guidelines in Appendix C of the XHTML 1.0 specification. This allows authors to produce valid XML code. HTML represented as valid XML code lends itself to processing with such things as scripting or XSLT, but is also well supported for display by most mainstream browsers. (XHTML served as application/xhtml+xml is not well supported for browser display at the moment.)

In this document we wish to reflect practical reality for content authors, so we cover XHTML served as text/html in the techniques. Indeed we encourage the use of XHTML, and all the examples (unless trying to make a specific point about HTML 4.01) are written in XHTML.

For XHTML served as XML, this document limits its advice to documents served as application/xhtml+xml. Note that user agent support for XHTML served as XML is still patchy.

1.5 User agent support

We try to ground techniques with information about their applicability to particular user agents. User agents, in the current version of this document, means a number of mainstream browsers. (The scope may grow as resources and test results become available for other user agents.)[Ed. note: Note that this version of the Working Draft is not yet completely up to date in this area.]

We have chosen a 'base version' for each of the user agents we are tracking. This base version represents a fairly recent, standards-compliant version of the browser, but nonetheless a version that we might expect many people to be using. Where a browser operates in both standards- and quirks-mode, standards-mode is assumed (ie. you should use a DOCTYPE statement).

The base versions considered for this version of the document include:

Internet Explorer 6 (Windows)
Firefox 1.0
Mozilla 1.4
Opera 7.0
Netscape Navigator 7.0
Safari 1.03
Internet Explorer 5.2 (Mac)

We will also assess the applicability of the techniques against the latest version of the user agent available at the time of publication. This will indicate progress made since the base versions. For this version of this document, that means:

Internet Explorer 6 (Windows)
Firefox 1.0
Mozilla 1.7.2
Opera 7.5.4
Netscape Navigator 7.1
Safari 1.2.2
Internet Explorer 5.2 (Mac)

Generally, the techniques described will be applicable for immediate use. However we may also recommend things that are not yet widely supported, but are described by the standards, and hopefully will be supported given a little time. Where issues of this kind exist, or other issues related to user agent support, these will be flagged by small graphics immediately after the technique summary:

A user agent name followed by indicates that there are no issues with support for this technique.
A name followed by indicates that there were issues surrounding implementation on the base version of the user agent, but not the latest version.
The icon indicates that there continue to be issues.

We will then generally describe the issues in the detailed text that follows.

Summaries for such techniques will also be worded more cautiously. For example, "Consider doing X".

If more testing is needed to ascertain whether there are issues with a particular user agent, the user agent name will be followed by a question mark.

Detailed information may also be provided from time to time about behavior of a user agent in another version than the base or current versions.

2 Why specify language?

Applications exist that can use information about the natural language of content to deliver to users the most relevant information, based on their language preferences. The more content is tagged and tagged correctly, the more useful and pervasive such applications will become.

Language information should be specified for the page as a whole, and wherever language changes within the page.

Applications for language information include authoring tools, translation tools, accessibility, font selection, page rendering, search, and scripting.

There are existing applications that require language information, such as for voice browsers in the accessibility world. There are other areas where language information could still be better exploited. This may change in the future, particularly as the larger search engines take an increasing interest in language. However, we are currently faced with a circular problem. People who don't see the applications of language information do not provide information about their content. Language-related applications are slow to be deployed until this information is widely available. This cycle can be broken by content authors taking steps to declare language information. This is usually very easy to do right now, and carries no penalties.

3 Important concepts

3.1 Primary language

Primary language is metadata about the document as a whole. Such metadata may be used for searching, serving the right language version, classification, etc. It is not specific enough to indicate the language of a particular run of text in the document for text-processing - for example, in a way that would be needed for the application of text-to-speech, styling, automatic font assignment, etc. It typically describes the language of the intended audience of the document .

The primary language does not describe every language used in a document. Many documents on the Web contain embedded fragments of content in different languages, whereas the page is clearly aimed at speakers of one particular language. For example, a German city-guide for Beijing may contain useful phrases in Chinese, but the primary language of the page is German, ie. it is aimed at a German-speaking audience.

It is also possible to imagine a situation where a document contains the same or parallel content in more than one language. For example, a Web page may welcome Canadian readers with French content in the left column, and the same content in English in the right-hand column. Here the document is equally targeted at speakers of both languages, so there are two primary languages. (This situation is not as common on the Web as in printed material since it is easy to link to separate pages on the Web for different audiences).

There are also pages where the navigational information, including the page title, is in one language but the content of the page is in another. While this is not necessarily good practice, it does not change the fact that the primary language is usually that of the content (the language of the reader of the page) independent of the language at the top of the document source.

Primary language metadata is usually best declared outside the document in the HTTP Content-Language header, although there may be situations where an internal declaration using the Content-Language meta element may be appropriate (see Section 6: How to specify primary language metadata).

3.2 Text-processing language

When specifying the text-processing language you are declaring the language in which a specific range of text is actually written, so that user agents or applications that manipulate the text, such as voice browsers, spell checkers, or style processors can effectively handle the text in question. So we are, by necessity, talking about associating a single language with a specific range of text.

This is much more specific than the primary language of a document.

The text-processing language is usually best declared using attributes on elements. Enclosed elements inherit the declared value, but you can, of course, override an initial declaration by specifying a different language on embedded elements where the language changes, eg. a French word in an English paragraph (see Section 5: How to declare the text-processing language).

3.3 Relationships with character encoding and directionality

'Character encoding' refers to the bytes that are used to represent characters in text. It is important to declare what encoding is being used for your document.

In some scripts, such as Arabic and Hebrew, text runs predominantly from right to left. Within that flow, numbers and text from other scripts run from left to right. It is important to adequately specify the intended 'directionality' of text in a document.

Language declarations in HTML and XHTML have nothing to do with character encoding or the direction of text.

Some people think that information about language can be inferred from the character encoding, but this is not true. There must be a one-to-one mapping between encoding and language for this to work, and there isn't. A single character encoding such as ISO 8859-1 (Latin1), could encode both French and English, as well as a great many other languages. In addition, different character encodings can be used for a single language, eg, Arabic could be encoded with 'Windows-1256' or 'ISO 8859-6' or 'UTF-8'.

There are separate mechanisms for declaring character encoding and directionality in HTML and XHTML, and these ideas should not be confused with mechanisms for declaring language.

Additional techniques documents at the W3C Internationalization site describe how to declare character encoding and text direction.

4 Mechanisms for declaring language in HTML

There are a number of places defined by the HTML and XHTML specifications where language can be declared. In this section we will simply show examples of the alternatives available. The rest of this document will discuss in detail which you should use, and when.

One method is to use the lang and xml:lang attributes on an element. To set the language of a whole document, you can use this attribute on the html tag.

Example 1:

<html lang="en" xml:lang="en" xml ns="http://www.w3.org/1999/xhtml">

Alternatively, you may find documents that provide language information using a Content-Language meta element.

Example 2:

<meta http-equiv="Content-Language" content="en"/>

Language information may also be found in the HTTP header that is sent with a document (see the last line in the following example of an HTTP header).

Example 3:

HTTP/1.1 200 OK
Date: Wed, 05 Nov 2003 10:46:04 GMT
Server: Apache/1.3.28 (Unix) PHP/4.2.3
Content-Location: CSS2-REC.en.html
Vary: negotiate,accept-language,accept-charset
TCN: choice
P3P: policyref=http://www.w3.org/2001/05/P3P/p3p.xml
Cache-Control: max-age=21600
Expires: Wed, 05 Nov 2003 16:46:04 GMT
Last-Modified: Tue, 12 May 1998 22:18:49 GMT
ETag: "3558cac9;36f99e2b"
Accept-Ranges: bytes
Content-Length: 10734
Connection: close
Content-Type: text/html; charset=iso-8859-1
Content-Language: en

It is also worth noting that the Content-Language meta element and the HTTP header both support a list of values. The example below declares the primary languages of the document to be (in equal measure) German, French and Italian.

Example 4:

<meta http-equiv="Content-Language" content="de, fr, it"/>

It is not possible to declare the language of text in CSS declarations.

This document addresses the question of which approach is the best in what situation.

5 How to declare the text-processing language

Technique 1: Always declare language in the html tag

Always declare the default text-processing language of the page, using the html tag, if there is a single primary language.

UA applicability issues: IE(Win) No issues

Firefox

Mozilla

Opera

NNav

Safari

IE(Mac)

How to: Declare the text-processing language of the page using the lang and/or xml:lang attributes on the html tag.. Example 5 declares an HTML document to be in Canadian French:

Example 5:

<html lang="fr-CA">

For details of which language attribute to use, see Technique 4: Should I use the lang or xml:lang attribute?.

For details of how to use language codes, see Section 7: How to choose language values.

Discussion: Declaring the text-processing language in the html tag sets the default text-processing language for the whole document. It can be overridden for portions of the document as required. This is already important for applications such as accessibility and searching, but many other possible applications for this information may emerge over time.

For this reason you should try to always declare the text-processing language in the html tag. It is usually very easy to do when creating the content, but more difficult to retrofit when you want to take advantage of language-related features.

Note that language declarations in the HTTP header or the Content-Language meta tag should be used to describe the primary language, ie. metadata about the document as a whole, rather than the default text-processing language.

For a comparison of 'primary language' and 'text-processing language', see Section 3: Important concepts.

Most documents have one primary language, but where there are more it may not be appropriate to declare a single default text-processing language in the html tag. The relevance will depend on the structure used for the document. See Technique 2: html declarations for multilingual docs.

Resources:

Background information

Should I declare the language of my XHTML document using a language attribute, the Content-Language HTTP header, or a Content-Language meta element?
W3C I18N FAQ: Using HTTP and meta for language information
Why use the language attribute? A number of useful reasons.
Why use the language attribute?

How to's

Use of language attributes: additional advice.
Using Language Information in XHTML, HTML and CSS, Why and how to declare language

Sources

Web Content Accessibility Guideline: calls for markup to express natural language in a document.
[WCAG 1.0] Guideline 4. Clarify natural language usage
Web Content Accessibility Techniques for HTML: advises use of lang attribute on html tag.
[WCAG-HTML 1.0] 2.2 Identifying the primary language
lang tag in HTML spec.
[HTML 4.01] 8.1 Specifying the language of content: the lang attribute
xml:lang in XML spec.
[XML 1.0] 2.12 Language Identification
xml:lang and lang in XHTML 1.0 spec.
[XHTML 1.0] C.7. The lang and xml:lang Attributes

Technique 2: html declarations for multilingual docs

For documents with multiple primary languages, decide whether you want to declare a single text-processing language in the html tag, or leave it undefined.

UA applicability issues: IE(Win) No issues

Firefox

Mozilla

Opera

NNav

Safari

IE(Mac)

How to: See Technique 1: Always declare language in the html tag.

Discussion:See the definition of primary language. Documents with more than one primary language are rare. A document does not have multiple primary languages if it contains small amounts of text in another language. We are talking here about documents where the basic content is repeated and the document is simultaneously aimed at more than one linguistic audience.

The language attribute on the html tag should be used to declare the default text-processing language for the document. Given that only one language can be defined at a time as the text-processing language, there may appear to be little point in using an attribute on html if all primary languages are used in a completely unbiased way in the document. It may be more appropriate to begin labelling the language on lower level elements.

If, however, the page header information or navigation is in one particular language, or there is a bias of some other kind towards one particular language, you may still want to use a language attribute on the html tag.

NOTE: There is a problem when dealing with multilingual title elements. Only one language can be declared for this element in HTML 4.01, since the only content allowed is character data.

[Ed. note: Should we mention the use of 'mul' as a language code? If we do, should we recommend it or not?]

Resources:

Background information

Should I declare the language of my XHTML document using a language attribute, the Content-Language HTTP header, or a Content-Language meta element?
W3C I18N FAQ: Using HTTP and meta for language information
Why use the language attribute? A number of useful reasons.
Why use the language attribute?

Technique 3: Declare language changes inside the document

Use the lang and/or xml:lang attributes around text to indicate any changes in language.

UA applicability issues: IE(Win) No issues

Firefox

Mozilla

Opera

NNav

Safari

IE(Mac)

How to: Where the language of the text is different from the language declared in the html tag, you should indicate this using the lang or xml:lang attributes. For example, in HTML you would write:

Example 6:

The French for Cat is chat.

The lang attribute can be used on all HTML elements except applet, base, basefont, br, frame, frameset, iframe, param and script. (Note, by the way, that this means that you could use language attributes on things like bitmaps and audio files that are language specific. Such information may be particularly useful for script-based processing of documents.)

If there is no markup around the text in a different language, use a span element to delimit the boundaries. Here is an example in XHTML 1.0 served as text/html:

Example 7:

The title in Chinese is 中国科学院文献情报中心.

For details of which language attribute to use, see Technique 4: Should I use the lang or xml:lang attribute?.

For details of how to use language codes, see Section 7: How to choose language values.

Resources:

Background information

Why use the language attribute? A number of useful reasons.
Why use the language attribute?
Language tagging in HTML and XML: a variety of useful information.
Language tagging in HTML and XML

How to's

Use of language attributes: additional advice.
Using Language Information in XHTML, HTML and CSS, Why and how to declare language

Sources

Web Content Accessibility Guideline: calls for markup to express natural language in a document.
[WCAG 1.0] Guideline 4. Clarify natural language usage
Web Content Accessibility Techniques for HTML: advises use of lang attribute when language changes in a document.
[WCAG-HTML 1.0] 2.1 Identifying changes in language
lang tag in HTML spec.
[HTML 4.01] 8.1 Specifying the language of content: the lang attribute
xml:lang in XML spec.
[XML 1.0] 2.12 Language Identification
xml:lang and lang in XHTML 1.0 spec.
[XHTML 1.0] C.7. The lang and xml:lang Attributes

Technique 4: Should I use the lang or xml:lang attribute?

For HTML use the lang attribute only, for XHTML 1.0 served as text/html use the lang and xml:lang attributes, and for XHTML served as XML use the xml:lang attribute only.

UA applicability issues: IE(Win) No issues

Firefox

Mozilla

Opera

NNav

Safari

IE(Mac)

How to: When serving HTML you should use the lang attribute to declare the language of the document or a range of text. For example, the following declares a document to be in Canadian French:

Example 8:

<html lang="fr-CA">

When serving XHTML as text/html, you should use both the lang attribute and the xml:lang attribute. The xml:lang attribute is the standard way to identify language information in XML. Example 9 shows how you would mark up the previous example for XHTML 1.0 served as text/html.

Example 9:

<html lang="fr-CA" xml:lang="fr-CA" xml ns="http://www.w3.org/1999/xhtml">

The xml:lang attribute is not actually useful for handling the file as HTML, but takes over from the lang attribute any time you treat the document as XML for, say, scripting or validation.

If you are serving XHTML 1.0 pages as XML (ie. using a MIME type such as application/xhtml+xml), or serving pages as XHTML 1.1, you do not need the lang attribute, since lang is part of the HTML language. The xml:lang attribute alone will suffice (see Example 10).

Example 10:

<html xml:lang="fr-CA" xml ns="http://www.w3.org/1999/xhtml">

Resources:

Sources

lang tag in HTML spec.
[HTML 4.01] 8.1 Specifying the language of content: the lang attribute
xml:lang in XML spec.
[XML 1.0] 2.12 Language Identification
xml:lang and lang in XHTML 1.0 spec.
[XHTML 1.0] C.7. The lang and xml:lang Attributes

Technique 5: Don't use Content-Language for text-processing

Do not use Content-Language to declare the default text-processing language, use language attributes.

UA applicability issues: IE(Win) No issues

Firefox

Mozilla

Opera

NNav

Safari

IE(Mac)

How to: Use HTTP headers and Content-Language meta elements to refer to primary language, but language attributes on the html tag to indicate the default text-processing language.

Example 11:

<html lang="ja" xml:lang="ja" xml ns="http://www.w3.org/1999/xhtml">

Discussion: There is generally a lot of confusion about the difference between declaring language information using Content-Language in HTTP or Content-Language meta elements, and using language attributes. In particular, much of the informal advice on the Web about how to declare the language of a document tells you to use the Content-Language meta tag to declare the language of the document. At least one popular authoring tool automatically inserts into the Content-Language meta element language information that you declare in the page properties dialog box. Unfortunately, we have yet to identify any user agent or application that recognizes information declared in this way for text-processing, whereas language information declared in the html tag is consistently recognized.

Techniques in this document recommend that Content-Language be used for describing primary language metadata, and that attributes be used for describing the default text-processing language of the document. (In fact, this question only arises when describing the language of a document at the highest level, since the language of fragments in a document can only be expressed using attributes.)

It is easy to see the rationale here when dealing with documents with multiple primary languages. The language attribute can only declare a single language at a time. Content-Language declarations, however, can declare a list of languages. Also, Content-Language declarations that declare a list of languages are not specific enough to indicate the default text-processing language.

Furthermore, it is strangely inconsistent not to use attributes to declare the default text-processing language when they have to be used for all fragments of text in a document.

The HTML specification recommends that, in the absence of a language attribute, the HTTP information be used to establish the default text-processing language. (Note that there is no mention of the Content-Language meta element in the HTML specification.)

In practise, the information contained in HTTP Content-Language headers is rarely used by mainstream browsers for language-dependent processing, and such implementation as there is is inconsistent. The behaviour of mainstream browsers also varies when multiple languages are declared in the HTTP header. The information in the Content-Language meta tag is typically not recognized at all by current user agents in a processing context.

There are still some unknowns related to the use of language information due to the currently low level of exploitation of this information. This may change in the future, particularly as the larger search engines take an increasing interest in language. For example, we may in the future see systematic use of in-document declarations of primary language using the Content-Language meta element. It may also be acceptable to infer primary language from the language attribute on the html element for documents with a single primary language. Discussion amongst various stakeholders needs to take place, however, before this can be known.

In the meantime, we recommend that you use HTTP headers and Content-Language meta elements to refer to primary language, and language attributes on the html tag to indicate the default text-processing language.

Resources:

Background information

Should I declare the language of my XHTML document using a language attribute, the Content-Language HTTP header, or a Content-Language meta element?
W3C I18N FAQ: Using HTTP and meta for language information

Sources

Content-Language in the HTML specification: only says that the html language attribute has a higher precedence.
[HTML 4.01] 8.1 Specifying the language of content: the lang attribute
Content-Language in the HTTP1.1 specification.
[RFC2616] 14.12 Content-Language
lang tag in HTML spec.
[HTML 4.01] 8.1 Specifying the language of content: the lang attribute
xml:lang in XML spec.
[XML 1.0] 2.12 Language Identification
xml:lang and lang in XHTML 1.0 spec.
[XHTML 1.0] C.7. The lang and xml:lang Attributes

Technique 6: Don't use the body tag rather than the html tag

Do not declare the language of a document in the body tag.

UA applicability issues: IE(Win) No issues

Firefox

Mozilla

Opera

NNav

Safari

IE(Mac)

How to: See Technique 1: Always declare language in the html tag.

Discussion: The html element is the highest level element in the document, and is therefore most appropriate for declaring the default text-processing language of the document. All elements within the document will inherit that value.

The body tag is usually the wrong place to express this information because it only refers to a portion of the text in the document. For example, the text in the title element is natural language text that should also inherit the language information. If language is declared in the body element, however, this is not the case.

The only time it would make sense is when the content of the head and body elements are in different languages.

Technique 7: When attribute and content are in different languages

If the text in attribute values and element content is in different languages, consider using a nested approach.

UA applicability issues: IE(Win) No issues

Firefox

Mozilla

Opera

NNav

Safari

IE(Mac)

Problem: You may come across a situation where the language of the text in an attribute and the element content are in different languages. For example, the top right corner of pages on the W3C Internationalization site show links to translated alternatives (see Figure 1). The name of the language is given in the language of the target page, but a title attribute contains the name in the language of the current page:

Screen snap showing a tooltip containing the word 'Swedish' popping up from the document text
'svenska'.

Figure 1: An example of a scenario where the content and attribute value of an element could be in different languages.

If you create the code as shown in Example 12 below, the language attributes would actually be saying that not only the content but also the title attribute text is in Swedish. This is obviously incorrect.

Example 12: Do not copy!

> <a xml:lang="sv" lang="sv" title="Swedish" href="index.sv.html">svenska</a>

How to: A better approach would involve moving the title attribute up a level in the hierarchy, since in this example the p tag inherits the default en setting of the html tag.

Example 13:

> <a xml:lang="sv" lang="sv" href="index.sv.html">svenska</a>

The markup in Example 13 lends itself easily to this approach. In other cases you may need to add a span element.

For details of which language attribute to use, see Technique 4: Should I use the lang or xml:lang attribute?.

For details of how to use language codes, see Section 7: How to choose language values.

Resources:

Sources

lang tag in HTML spec.
[HTML 4.01] 8.1 Specifying the language of content: the lang attribute
xml:lang in XML spec.
[XML 1.0] 2.12 Language Identification
xml:lang and lang in XHTML 1.0 spec.
[XHTML 1.0] C.7. The lang and xml:lang Attributes

6 How to specify primary language metadata

Technique 8: Use HTTP or the Content-Language meta tag for metadata

Consider using a Content-Language declaration in the HTTP header or a Content-Language meta tag to declare metadata about the primary language of a document.

UA applicability issues: IE(Win) No issues

Firefox

Mozilla

Opera

NNav

Safari

IE(Mac)

How to: Content-Language information sent in the HTTP header is defined on the server.

Example 14 shows how the language would be declared in a Content-Language meta tag inside a document:

Example 14:

<meta http-equiv="Content-Language" content="en"/>

Discussion: The Content-Language declaration, whether it is used in the HTTP header or a Content-Language meta tag, can be useful for expressing metadata about the primary language(s) of the document being served.

Note that this is different from expressing the default language of content for text-processing, which must be done using a language attribute on the html tag.

The extent to which applications use metadata information in the HTTP header or a Content-Language meta tag, or which of the two is preferred, is not clear at this point.

Using the HTTP Content-Language header entails potential issues related to the maintenance and use of server-side information. Many authors may find it difficult to access server settings, particularly when dealing with an ISP. Also, pages may not always be located on servers. So this approach is not a solution that is always available.

For a comparison of 'primary language' and 'text-processing language', see Section 3: Important concepts.

Resources:

Background information

Should I declare the language of my XHTML document using a language attribute, the Content-Language HTTP header, or a Content-Language meta element?
W3C I18N FAQ: Using HTTP and meta for language information

Sources

Content-Language in the HTTP1.1 specification.
[RFC2616] 14.12 Content-Language
Content-Language in the HTML specification: only says that the html language attribute has a higher precedence.
[HTML 4.01] 8.1 Specifying the language of content: the lang attribute

6.1 Documents with multiple primary languages

It is not common to find pages on the Web with more than one primary language. One reason is that it is easy to link to alternative pages instead. Furthermore, there may be differing views on what kind of document structure reflects a document with multiple primary languages.

Technique 9: Provide a comma-separated list of languages

For documents with multiple primary languages, use Content-Language with a comma-separated list of all primary language tags.

UA applicability issues: IE(Win) No issues

Firefox

Mozilla

Opera

NNav

Safari

IE(Mac)

How to: Content-Language information sent in the HTTP header is defined on the server. The HTTP specification provides for more than one language to be expressed as the value of the Content-Language header.

Example 15 shows part of the HTTP header sent from the server and declares a document to have three primary languages: German, French and Italian:

Example 15:

Content-Language: de,fr,it

The in-document Content-Language meta element provides a similar possibility (see Example 16):

Example 16:

<meta http-equiv="Content-Language" content="de,fr,it"/>

Resources:

Sources

Content-Language in the HTTP1.1 specification.
[RFC2616] 14.12 Content-Language

Technique 10: Division of multilingual docs

For documents with multiple primary languages, try to divide the document at the highest possible level, and declare the appropriate text-processing language in those blocks.

UA applicability issues: IE(Win) No issues

Firefox

Mozilla

Opera

NNav

Safari

IE(Mac)

Dividing parallel text at the highest possible level, can simplify the process of guiding users to the text via searching, links, etc. It also reduces the work of labeling the language of document fragments.

For details of how to use language attributes, see the section Section 7: How to choose language values.

Resources:

Background information

Why use the language attribute? A number of useful reasons.
Why use the language attribute?

7 How to choose language values

Technique 11: Use RFC3066

Follow the guidelines in RFC3066 or its successors for language attribute values.

UA applicability issues: IE(Win) No issues

Firefox

Mozilla

Opera

NNav

Safari

IE(Mac)

How to: RFC 3066 is the IETF document that defines how to use language tags to identify languages. It obsoletes the RFC 1766 referred to by earlier specifications.

For an introduction to the RFC3066 rules for language codes, see Language tags in HTML and XML.

Discussion: RFC 3066 merely expands and clarifies the possibilities for specifying languages. If you have been using RFC 1766 you typically do not need to make any changes to your code in order to start using RFC 3066.

NOTE: The HTML specification still recommends the use of RFC 1766 for identifying language. There is a planned erratum in place for the HTML specification, so you should use RFC 3066 despite what the HTML specification currently says.

A proposed successor to RFC 3066 is currently being developed, but it aims to retain backwards compatibility with tags created using RFC 3066.

Note also that lang and xml:lang attributes only take a single language value (unlike HTTP Content-language headers).

Resources:

How to's

RFC 3066: the IETF document that defines how to use language tags to identify languages.
[RFC3066] RFC 3066 Tags for the Identification of Languages
ISO 639 language codes
ISO 639: Codes for the Representation of Names of Languages
ISO 3166 country codes
ISO 3166: Codes for Country Names
IANA's language tag registry.
IANA Assigned Language Tags
RFC 3066: a brief summary.
Using Language Information in XHTML, HTML and CSS, Specifying language attribute values
Use of language codes: how to choose the right attribute values.
Language tags in HTML and XML

Sources

lang in HTML spec.
[HTML 4.01] 8.1 Specifying the language of content: the lang attribute
xml:lang in XML spec.
[XML 1.0] 2.12 Language Identification

Technique 12: Use short language codes

Use the two-letter ISO 639 codes for the language code where there are both 2- and 3-letter codes.

UA applicability issues: IE(Win) No issues

Firefox

Mozilla

Opera

NNav

Safari

IE(Mac)

How to: Pick the two-letter codes from ISO 639: Codes for the Representation of Names of Languages where both two- and three-letter codes exist.

Discussion: RFC3066 specifies that the two letter codes should be used where available, since this aids interoperability by ensuring that a single code is used everywhere to refer to a particular language.

This also avoids the question of which 3-letter code to use for those languages that have two 3-letter codes, since all such languages have a 2-letter code also.

Resources:

How to's

RFC 3066: the IETF document that defines how to use language tags to identify languages.
[RFC3066] RFC 3066 Tags for the Identification of Languages
RFC 3066: a brief summary.
Using Language Information in XHTML, HTML and CSS, Specifying language attribute values

Sources

Should I use two-letter or three-letter language codes?
W3C I18N FAQ: Two-letter or three-letter language codes

Technique 13: Use Hans and Hant codes

Where possible, use the codes zh-Hans and zh-Hant to refer to Simplified and Traditional Chinese, respectively.

UA applicability issues: IE(Win) Issues still to date

Firefox

Issues with base version, but not latest

Mozilla

Opera

NNav

Safari ? IE(Mac) Issues still to date

How to: The IANA registry now makes available the codes zh-Hans and zh-Hant for Simplified and Traditional Chinese, respectively. The following two examples illustrate the use of these tags.

Example 17:

Simplified Chinese:

当世界需要沟通时，请用统一码！

Example 18:

Traditional Chinese:

當世界需要溝通時，請用統一碼!

Discussion: RFC3066 specifies how to identify a language. Simplified vs. Traditional Chinese is a distinction based on script. In the past zh-CN (Chinese spoken in Mainland China) was commonly used to label Simplified Chinese, and zh-TW (Chinese spoken in Taiwan) was commonly used for Traditional Chinese. Apart from the fact that this is mislabeled, you could not guarantee that others would recognize these conventions, or even follow them. For example, some people used zh-HK to represent Traditional Chinese.

It is expected that these tags will persist for the foreseeable future, so on the one hand it would be good to use them as soon as possible in order to improve interoperability sooner rather than later.

On the other hand, you need to assess the impact of changing the tags. This is not really an issue for self-describing usage, such as with :lang for application of language-based styling. It may be more of an issue where external applications are looking for tags related to Chinese but are unaware of the zh-Hans and zh-Hant variants.

NOTE: There is one particular area where this may be an issue for the display of text on a user agent. Some (but not all) user agents use language information to automatically choose a font for CJK ideographic text. Note that this assumes that (a) you have appropriate fonts set in your preferences, that (b) the document styling does not apply a font, and that (c) the user agent supports this behavior (not all do). The following table summarizes support for this feature in various user agents (see the test results page for more details):

IE 6.0 (Win)	Does not recognize either of these codes.
Firefox 1.0	Handles both codes correctly.
Mozilla 1.7.2	Recognizes the tags but treats them both as Simplified Chinese.
Netscape 7.0	Recognizes the tags but treats them both as Simplified Chinese.
Opera 7.54	Doesn't automatically apply fonts in this fashion, so is irrelevant.
IE 5.2 (Mac)	Recognizes the tags but treats them both as Traditional Chinese.
Safari	Doesn't automatically apply fonts in this fashion, so is irrelevant.

Resources:

How to's

IANA's language tag registry.
IANA Assigned Language Tags
RFC 3066: the IETF document that defines how to use language tags to identify languages.
[RFC3066] RFC 3066 Tags for the Identification of Languages
RFC 3066: a brief summary.
Using Language Information in XHTML, HTML and CSS, Specifying language attribute values

Test data

Automatic font assignment for CJK text
Test results

8 How to indicate the language of a link destination

Technique 14: Pros and cons of identifying the language

When pointing to a resource in another language, consider the pros and cons of indicating the language of the target document.

UA applicability issues: IE(Win) No issues

Firefox

Mozilla

Opera

NNav

Safari ? IE(Mac) No issues

Pros: May help the reader avoid wasted time linking to pages they can't read.

Cons: May become out-of-date and so give incorrect information.

Discussion: If you add some text or graphic to a link indicating that the target document is in another language, it may allow the reader to decide in advance whether or not to follow the link, according to their language skill. If the user has to waste time following the link to find out that they cannot read the target document, this introduces fatigue, and they may lack confidence when faced with links that do go to readable pages.

There are, however, potential problems with this approach.

For example, a newly translated version may become available. Assume, for example that a French page has used this approach some time ago to point to a document which at that time was only in English. Later, the document is translated into French and language negotiation is put in place. Unless the French page referred to earlier is updated, it will now be incorrectly warning French readers that the document is in English, and possibly discouraging them from following a link to what is actually a perfectly legible document.

Technique 15: Using hreflang with CSS

If you want to indicate that the target document of an a element is in another language, consider the pros and cons of using hreflang with CSS.

UA applicability issues: IE(Win) Issues still to date

Firefox

Mozilla

Opera

NNav

Safari ? IE(Mac) Issues still to date

Pros: May help the reader avoid wasted time linking to pages they can't read; saves the author time and effort if hreflang is used consistently.

Cons: May become out-of-date and so give incorrect information; not all user agents support the necessary CSS; problematic when linking to language negotiated sites.

How to: This approach relies on CSS selectors that detect the value of the hreflang attribute and use the CSS content property to display an indicator of the language.

For example, the following link points to a page in Swedish.

Example 19:

There is also a translated page describing why a DOCTYPE is useful [sv].

The code to enable this in CSS may be something like:

Example 20:

a[hreflang]:after { content: " [" attr(hreflang) "] "; }

This says, "For each a element with an hreflang attribute, add the value of that attribute in square parentheses after the link". You could just as easily append text or even a graphic after the link.

The markup would read as follows:

Example 21:

There is also a translated page describing <a href="swedish-doc.html" hreflang="sv">why a DOCTYPE is useful</a>.

Discussion: In HTML, the hreflang attribute on an a element indicates the language of the document at the other end of the link. In practice, hreflang is typically not used by mainstream browsers. Besides that it is much better to ensure that the target document uses the language attribute in the html tag, so that this information is not needed.

A common alternative use for this attribute is to generate a visible marker attached to link text that indicates the language of the destination page for the reader. The idea is to allow the reader to decide in advance whether or not to follow the link, according to their language skill.

There are some usability-related pros and cons to this approach that are discussed in Technique 14: Pros and cons of identifying the language.

There are, also, potential technical problems with this approach.

Not all user agents support the CSS required to enable it (see the test results page). Internet Explorer does not support :after.
Note that if a resource is available in multiple languages (say you are linking from an English overview to detailed descriptions that are available in multiple languages) it is not possible to express that, since the hreflang attribute accepts only a single language as its value.

Resources:

How to's

:before and :after in the CSS 2.1 spec.
[CSS2.1] 12.1 The :before and :after pseudo-elements

Sources

hreflang in the HTML spec.
[HTML 4.01] 12.2 The A element

Test data

Hreflang content generation
Test results

Technique 16: Don't use flags to indicate languages

Do not use flag icons to indicate languages.

UA applicability issues: IE(Win) No issues

Firefox

Mozilla

Opera

NNav

Safari

IE(Mac)

How to: A much better approach is to use text. In Technique 15: Using hreflang with CSS, Example 19 uses the actual attribute value, since these two-letter codes are typically recognizable by speakers of the language.

Discussion: Flags represent countries, not languages. There are many countries that use the same language, and numerous countries that have more than one official language.