Accesskey n skips to in-page navigation. Skip to the content start.

s_gotoW3cHome Internationalization
 

Character encodings by format

Intended audience: XHTML/HTML coders (using editors or scripting), script developers (PHP, JSP, etc.), CSS coders, Web project managers, and anyone who needs an introduction to how to declare the character encoding of their (X)HTML or CSS file.

This article brings together information from various specifications related to character encoding, and from them summarises the rules about how encoding declarations should be used for each format.

The summaries are linked to the specification text using ids in square brackets, eg. [html5:3].

Encoding declaration rules

Here we summarize the rules for each format.

HTML5

The HTTP Content-Type header can be used, and has the highest precedence [html5:22].

The byte-order mark is optional [html5:21]. If present, it has next highest precedence after the HTTP Content-Type declaration [html5:24]. [Bom must be ignored [html5:30] ??]

XML declarations are not allowed [html5:21].

The meta element can be used, and the charset attribute is preferred [html5:0]. If there is no HTTP declaration or BOM, a meta element must be used [html5:14]. Any meta declaration must use an ascii-compatible encoding [html5:14] [html5:16]. The implication of this is that UTF-16 encoded pages must not use a meta declaration. Any meta declaration must fit in the first 1024 bytes of page [html5:12] [html5:23].

Only one meta declaration is allowed per page [html5:13] [html5:3] [html5:4] [html5:7].

If a pragma is used, the content attribute must refer to text/html [html5:6]

If the meta declaration says that the document is utf-16, but it is not a utf16 doc, the browser will switch to utf-8 [html5:31].

Encoding choices. The character encoding name given must be the name of the character encoding used to serialize the file. [html5:9] The value must be a valid character encoding name, and must be an ASCII case-insensitive match for the preferred MIME name for that encoding [html5:10]. The declaration must not use character references or character escapes of any kind [html5:11]

Browsers may advise against non-UTF-8 encodings [html5:17] [html5:19]

Don't use encodings that are not ascii-compat unless there's a bom [html5:18] [html5:29], ie. only UTF-16. Also cesu-8, utf-7, scsu and utf-32 should not be used [html5:18] [html5:28]

If no decl found, the browser may use auto-detection, and failing that a browser-dependent choice (utf-8 suggested, but may be locale dependent) [html5:25] [html5:26]

UTF-16 with no bom defaults to UTF-16LE [html5:27]

XHTML5

The HTTP Content-Type header can be used, and has the highest precedence [html5:22].

The XML declaration can be used if necessary [html5:20]

The meta declaration is not actually useful for XML documents [html5:2], but one can be used to aid in migration between HTML and XHTML, but it not must be a pragma [html5:8], ie. it must be a charset attribute, and it must only be used for UTF-8 documents [xhtml5:1].

The meta element must fit in first 1024 bytes of page [html5:12].

Encoding choices are as for HTML5 ?

Polyglot pages

Polyglot pages can only use the UTF-8 encoding [poly:1].

The HTTP Content-Type header can be used to set the character encoding. The MIME-type should reflect whether the page is being served as text/html or application/xhtml+xml [poly:3]

The UTF-8 signature is a preferred way to signal the encoding of the page [poly:3].

XML declarations must not be used [poly:0].

Meta elements should only use the charset attribute [poly:3]. It's use is optional, but is recommended by the i18n WG [poly:5]. It has no effect when the page is served as XML [poly:4]

HTML 4.01

The HTTP Content-Type header is recommended as the most effective way to declare the encoding [html4:6], and has the highest precedence [html5:11].

Byte-order marks should be used for UTF-16 encoded pages [html4:4]. The precedence of the bom is not mentioned.

Meta elements may be used, with http-equiv and content attributes [html4:8], but only for encodings that are ASCII-compatible [html4:9]. They should appear as soon as possible in the head element [html4:10]. Meta elements have a predence just below the HTTP header [html4:11].

Encoding choices are not constrained [html4:0]. Encoding names are not case sensitive [html4:2]. UTF-16 encoded pages should be big-endian [html4:3]. UTF-1 should not be used [html4:5]. User agents must not assume any default character encoding [html4:7], but they usually have a user-definable, local default encoding that they apply in the absence of other information [html4:11].

Specification extracts

In this section we list relevant quotations from the specifications, and give each point a unique reference id.

HTML5

4.2.5.5 Specifying the document's character encoding

[html5:0]
"The meta element can represent [...] the file's character encoding declaration when an HTML document is serialized to string form (e.g. for transmission over the network or for disk storage) with the charset attribute."

"The charset attribute specifies the character encoding used by the document. This is a character encoding declaration."

[html5:1]
"If the attribute is present in an XML document, its value must be an ASCII case-insensitive match for the string "UTF-8" (and the document is therefore forced to use UTF-8 as its encoding)."

[html5:2]
"The charset attribute on the meta element has no effect in XML documents, and is only allowed in order to facilitate migration to and from XHTML."

[html5:3]
"There must not be more than one meta element with a charset attribute per document."

[html5:4]
"Exactly one of the name, http-equiv, and charset attributes must be specified."

2.1.6 Character encodings

[html5:5]
"The preferred MIME name of a character encoding is the name or alias labeled as "preferred MIME name" in the IANA Character Sets registry, if there is one, or the encoding's name, if none of the aliases are so labeled. [IANACHARSET]"

4.2.5.3 Pragma directives, Encoding declaration state

[html5:6]
"For meta elements with an http-equiv attribute in the Encoding declaration state, the content attribute must have a value that is an ASCII case-insensitive match for a string that consists of: the literal string "text/html;", optionally followed by any number of space characters, followed by the literal string "charset=", followed by the character encoding name of the character encoding declaration."

[html5:7]
"A document must not contain both a meta element with an http-equiv attribute in the Encoding declaration state and a meta element with the charset attribute present."

[html5:8]
"The Encoding declaration state may be used in HTML documents, but elements with an http-equiv attribute in that state must not be used in XML documents."

4.2.5.5 Specifying the document's character encoding

[html5:9]
"The character encoding name given must be the name of the character encoding used to serialize the file."

[html5:10]
"The value must be a valid character encoding name, and must be an ASCII case-insensitive match for the preferred MIME name for that encoding."

[html5:11]
"The character encoding declaration must be serialized without the use of character references or character escapes of any kind."

[html5:12]
"The element containing the character encoding declaration must be serialized completely within the first 1024 bytes of the document."

[html5:13]
"[...] due to a number of restrictions on meta elements, there can only be one meta-based character encoding declaration per document."

[html5:14]
"If an HTML document does not start with a BOM, and if its encoding is not explicitly given by Content-Type metadata, and the document is not an iframe srcdoc document, then the character encoding used must be an ASCII-compatible character encoding, and, in addition, if that encoding isn't US-ASCII itself, then the encoding must be specified using a meta element with a charset attribute or a meta element with an http-equiv attribute in the Encoding declaration state."

[html5:15]
"If the document is an iframe srcdoc document, the document must not have a character encoding declaration. (In this case, the source is already decoded, since it is part of the document that contained the iframe.)"

[html5:16]
"If an HTML document contains a meta element with a charset attribute or a meta element with an http-equiv attribute in the Encoding declaration state, then the character encoding used must be an ASCII-compatible character encoding."

[html5:17]
"Authors are encouraged to use UTF-8. Conformance checkers may advise authors against using legacy encodings."

[html5:18]
"Encodings in which a series of bytes in the range 0x20 to 0x7E can encode characters other than the corresponding characters in the range U+0020 to U+007E represent a potential security vulnerability: a user agent that does not support the encoding (or does not support the label used to declare the encoding, or does not use the same mechanism to detect the encoding of unlabelled content as another user agent) might end up interpreting technically benign plain text content as HTML tags and JavaScript. For example, this applies to encodings in which the bytes corresponding to "<script>" in ASCII can encode a different string. Authors should not use such encodings, which are known to include JIS_C6226-1983, JIS_X0212-1990, HZ-GB-2312, JOHAB (Windows code page 1361), encodings based on ISO-2022, and encodings based on EBCDIC. Furthermore, authors must not use the CESU-8, UTF-7, BOCU-1 and SCSU encodings, which also fall into this category, because these encodings were never intended for use for Web content.

"Authors should not use UTF-32, as the encoding detection algorithms described in this specification intentionally do not distinguish it from UTF-16."

[html5:19]
"Using non-UTF-8 encodings can have unexpected results on form submission and URL encodings, which use the document's character encoding by default."

[html5:20]
"In XHTML, the XML declaration should be used for inline character encoding information, if necessary."

8.1 Writing HTML documents

[html5:21]
"Documents must consist of the following parts, in the given order: Optionally, a single U+FEFF BYTE ORDER MARK (BOM) character. Any number of comments and space characters. A DOCTYPE. Any number of comments and space characters. The root element, in the form of an html element."

8.2.2.1 Determining the character encoding

[html5:22]
"If the transport layer specifies an encoding, and it is supported, return that encoding with the confidence certain, and abort these steps."

[html5:23]
"The authoring conformance requirements for character encoding declarations limit them to only appearing in the first 1024 bytes. User agents are therefore encouraged to use the preparse algorithm below (part of these steps) on the first 1024 bytes, but not to stall beyond that."

[html5:24]
"For each of the rows in the following table, starting with the first one and going down, if there are as many or more bytes available than the number of bytes in the first column, and the first bytes of the file match the bytes given in the first column, then return the encoding given in the cell in the second column of that row, with the confidence certain, and abort these steps: Bytes in Hexadecimal Encoding FE FF Big-endian UTF-16 FF FE Little-endian UTF-16 EF BB BF UTF-8 ."

[html5:25]
"The user agent may attempt to autodetect the character encoding from applying frequency analysis or other algorithms to the data stream. Such algorithms may use information about the resource other than the resource's contents, including the address of the resource. If autodetection succeeds in determining a character encoding, then return that encoding, with the confidence tentative, and abort these steps. [UNIVCHARDET] The UTF-8 encoding has a highly detectable bit pattern. Documents that contain bytes with values greater than 0x7F which match the UTF-8 pattern are very likely to be UTF-8, while documents with byte sequences that do not match it are very likely not. User-agents are therefore encouraged to search for this common encoding."

[html5:26]
"Otherwise, return an implementation-defined or user-specified default character encoding, with the confidence tentative. In controlled environments or in environments where the encoding of documents can be prescribed (for example, for user agents intended for dedicated use in new networks), the comprehensive UTF-8 encoding is suggested. In other environments, the default encoding is typically dependent on the user's locale (an approximation of the languages, and thus often encodings, of the pages that the user is likely to frequent). The following table gives suggested defaults based on the user's locale, for compatibility with legacy content. Locales are identified by BCP 47 language tags."

[html5:27]
"When a user agent is to use the UTF-16 encoding but no BOM has been found, user agents must default to UTF-16LE. The requirement to default UTF-16 to LE rather than BE is a willful violation of RFC 2781, motivated by a desire for compatibility with legacy content."

[html5:28]
"User agents must not support the CESU-8, UTF-7, BOCU-1 and SCSU encodings."

[html5:29]
"Support for encodings based on EBCDIC is not recommended. This encoding is rarely used for publicly-facing Web content. Support for UTF-32 is not recommended. This encoding is rarely used, and frequently implemented incorrectly. This specification does not make any attempt to support EBCDIC-based encodings and UTF-32 in its algorithms; support and use of these encodings can thus lead to unexpected behavior in implementations of this specification."

8.2.2.3 Preprocessing the input stream

[html5:30]
"One leading U+FEFF BYTE ORDER MARK character must be ignored if any are present. The requirement to strip a U+FEFF BYTE ORDER MARK character regardless of whether that character was used to determine the byte order is a willful violation of Unicode, motivated by a desire to increase the resilience of user agents in the face of naïve transcoders."

8.2.2.4 Changing the encoding while parsing

[html5:31]
"If the encoding that is already being used to interpret the input stream is a UTF-16 encoding, then set the confidence to certain and abort these steps. The new encoding is ignored; if it was anything but the same encoding, then it would be clearly incorrect. If the new encoding is a UTF-16 encoding, change it to UTF-8."

12.1 text/html

[html5:32]
"The charset parameter may be provided to definitively specify the document's character encoding, overriding any character encoding declarations in the document. The parameter's value must be the name of the character encoding used to serialize the file, must be a valid character encoding name, and must be an ASCII case-insensitive match for the preferred MIME name for that encoding. [IANACHARSET]"

Polyglot

2. Processing Instructions and the XML Declaration

[poly:0]
"Processing Instructions and the XML Declaration are both forbidden in polyglot markup."

3. Specifying a Document's Character Encoding

[poly:1]
"Polyglot markup uses the UTF-8 character encoding, the only character encoding for which both HTML and XML require support."

[poly:2]
"HTML requires UTF-8 to be explicitly declared to avoid fallback to a legacy encoding [HTML5]. For XML, UTF-8 is an encoding default. As such, character encoding may be left undeclared in XML with the result that UTF-8 is still supported [XML10]."

[poly:3]
"Polyglot markup declares the UTF-8 character encoding in the following ways, which may be used separately or in combination: Within the document By using the Byte Order Mark (BOM) character (preferred). By using <meta charset="UTF-8"/> (the HTML encoding declaration). Outside the document By adding "charset=utf-8" to the MIME/HTTP Content-Type header [HTTP11], as the following examples show in HTML and XML, respectively: Content-type: text/html; charset=utf-8 Content-type: application/xhtml+xml; charset=utf-8."

[poly:4]
"The HTML encoding declaration has no effect in XML. When the HTML encoding declaration is the only encoding declaration, the encoding default from XML makes XML parsers treat content as UTF-8."

[poly:5]
"The W3C Internationalization (i18n) Group recommends to always include a visible encoding declaration in a document, because it helps developers, testers, or translation production managers to check the encoding of a document visually."

HTML 4.01

5.2.1 Choosing an encoding

[html4:0]
"Authoring tools (e.g., text editors) may encode HTML documents in the character encoding of their choice, and the choice largely depends on the conventions used by the system software. These tools may employ any convenient encoding that covers most of the characters contained in the document, provided the encoding is correctly labeled."

[html4:1]
"Servers and proxies may change a character encoding (called transcoding) on the fly to meet the requests of user agents (see section 14.2 of [RFC2616], the "Accept-Charset" HTTP request header). Servers and proxies do not have to serve a document in a character encoding that covers the entire document character set."

[html4:2]
"Names for character encodings are case-insensitive, so that for example "SHIFT_JIS", "Shift_JIS", and "shift_jis" are equivalent."

[html4:3]
"When HTML text is transmitted in UTF-16 (charset=UTF-16), text data should be transmitted in network byte order ("big-endian", high-order byte first) in accordance with [ISO10646], Section 6.3 and [UNICODE], clause C3, page 3-1."

[htlm4:4]
"to maximize chances of proper interpretation, it is recommended that documents transmitted as UTF-16 always begin with a ZERO-WIDTH NON-BREAKING SPACE character (hexadecimal FEFF, also called Byte Order Mark (BOM)) which, when byte-reversed, becomes hexadecimal FFFE, a character guaranteed never to be assigned."

[html4:5]
"The UTF-1 transformation format of [ISO10646] (registered by IANA as ISO-10646-UTF-1), should not be used."

5.2.2 Specifying the character encoding

[html4:6]
"The most straightforward way for a server to inform the user agent about the character encoding of the document is to use the "charset" parameter of the "Content-Type" header field of the HTTP protocol ([RFC2616], sections 3.4 and 14.17)."

[html4:7]
"The HTTP protocol ([RFC2616], section 3.7.1) mentions ISO-8859-1 as a default character encoding when the "charset" parameter is absent from the "Content-Type" header field. In practice, this recommendation has proved useless because some servers don't allow a "charset" parameter to be sent, and others may not be configured to send the parameter. Therefore, user agents must not assume any default value for the "charset" parameter."

[html4:8]
"To address server or configuration limitations, HTML documents may include explicit information about the document's character encoding; the META element can be used to provide user agents with this information. For example, to specify that the character encoding of the current document is "EUC-JP", a document should include the following META declaration: <META http-equiv="Content-Type" content="text/html; charset=EUC-JP"."

[html4:9]
"The META declaration must only be used when the character encoding is organized such that ASCII-valued bytes stand for ASCII characters (at least until the META element is parsed)."

[html4:10]
"META declarations should appear as early as possible in the HEAD element.."

[html4:11]
"To sum up, conforming user agents must observe the following priorities when determining a document's character encoding (from highest priority to lowest): An HTTP "charset" parameter in a "Content-Type" field. A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset". The charset attribute set on an element that designates an external resource. In addition to this list of priorities, the user agent may use heuristics and user settings. For example, many user agents use a heuristic to distinguish the various encodings used for Japanese text. Also, user agents typically have a user-definable, local default character encoding which they apply in the absence of other indicators."

XHTML 1.0

3.1.1. Strictly Conforming Documents

[xhtml:0]
"An XML declaration is not required in all XML documents; however XHTML document authors are strongly encouraged to use XML declarations in all their documents. Such a declaration is required when the character encoding of the document is other than the default UTF-8 or UTF-16 and no encoding was determined by a higher-level protocol. Here is an example of an XHTML document. In this example, the XML declaration is included."

C.1. Processing Instructions and the XML Declaration

[xhtml:1]
"Be aware that processing instructions are rendered on some user agents. Also, some user agents interpret the XML declaration to mean that the document is unrecognized XML rather than HTML, and therefore may not render the document as expected. For compatibility with these types of legacy browsers, you may want to avoid using processing instructions and XML declarations. Remember, however, that when the XML declaration is not included in a document, the document can only use the default character encodings UTF-8 or UTF-16."

C.9. Character Encoding

[xhtml:2]
"In order to portably present documents with specific character encodings, the best approach is to ensure that the web server provides the correct headers. If this is not possible, a document that wants to set its character encoding explicitly must include both the XML declaration an encoding declaration and a meta http-equiv statement (e.g., <meta http-equiv="Content-type" content="text/html; charset=EUC-JP" />)."

[xhtml:3]
"In XHTML-conforming user agents, the value of the encoding declaration of the XML declaration takes precedence."

[xhtml:4]
"Note: be aware that if a document must include the character encoding declaration in a meta http-equiv statement, that document may always be interpreted by HTTP servers and/or user agents as being of the internet media type defined in that statement. If a document is to be served as multiple media types, the HTTP server must be used to set the encoding of the document."

XML 1.0

4.3.3 Character Encoding in Entities

[xml:0]
"Each external parsed entity in an XML document may use a different encoding for its characters. All XML processors MUST be able to read entities in both the UTF-8 and UTF-16 encodings. The terms "UTF-8" and "UTF-16" in this specification do not apply to related character encodings, including but not limited to UTF-16BE, UTF-16LE, or CESU-8."

[xml:1]
"Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin with the Byte Order Mark described by Annex H of [ISO/IEC 10646:2000], section 16.8 of [Unicode] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors MUST be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents.."

[xml:2]
"If the replacement text of an external entity is to begin with the character U+FEFF, and no text declaration is present, then a Byte Order Mark MUST be present, whether the entity is encoded in UTF-8 or UTF-16."

[xml:3]
"Although an XML processor is required to read only entities in the UTF-8 and UTF-16 encodings, it is recognized that other encodings are used around the world, and it may be desired for XML processors to read entities that use them. In the absence of external character encoding information (such as MIME headers), parsed entities which are stored in an encoding other than UTF-8 or UTF-16 MUST begin with a text declaration (see 4.3.1 The Text Declaration) containing an encoding declaration:"

[xml:4]
"In an encoding declaration, the values " UTF-8 ", " UTF-16 ", " ISO-10646-UCS-2 ", and " ISO-10646-UCS-4 " SHOULD be used for the various encodings and transformations of Unicode / ISO/IEC 10646, the values " ISO-8859-1 ", " ISO-8859-2 ", ... " ISO-8859- n " (where n is the part number) SHOULD be used for the parts of ISO 8859, and the values " ISO-2022-JP ", " Shift_JIS ", and " EUC-JP " SHOULD be used for the various encoded forms of JIS X-0208-1997. It is RECOMMENDED that character encodings registered (as charsets) with the Internet Assigned Numbers Authority [IANA-CHARSETS], other than those just listed, be referred to using their registered names; other encodings SHOULD use names starting with an "x-" prefix. XML processors SHOULD match character encoding names in a case-insensitive way and SHOULD either interpret an IANA-registered name as the encoding registered at IANA for that name or treat it as unknown (processors are, of course, not required to support all IANA-registered encodings)."

[xml:5]
"In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration."

[xml:6]
"It is a fatal error for a TextDecl to occur other than at the beginning of an external entity."

[xml:7]
"It is a fatal error when an XML processor encounters an entity with an encoding that it is unable to process. It is a fatal error if an XML entity is determined (via default, encoding declaration, or higher-level protocol) to be in a certain encoding but contains byte sequences that are not legal in that encoding. Specifically, it is a fatal error if an entity encoded in UTF-8 contains any ill-formed code unit sequences, as defined in section 3.9 of Unicode [Unicode]. Unless an encoding is determined by a higher-level protocol, it is also a fatal error if an XML entity contains no encoding declaration and its content is not legal UTF-8 or UTF-16."

F.2 Priorities in the Presence of External Encoding Information

[xml:0]
"If an XML entity is in a file, the Byte-Order Mark and encoding declaration are used (if present) to determine the character encoding."

Tell us what you think (English).

Send us a comment

Follow our news feed.

 ‎@webi18n

 Home page news

Further reading

By: Richard Ishida, W3C.

Content first published 2010-09-09 11:49. Last substantive update 2010-09-09 11:49 GMT. This version 2010-09-09 11:49 GMT

For the history of document changes, search for article=encoding-summaries in the i18n blog.