Checking the character encoding using the validator

Answer

To make sure all recipients of a document can display and interpret it properly, it is very important to correctly indicate the character encoding ('charset'). One way to check this is to use the W3C Markup Validation Service. The validator usually detects the character encoding from the HTTP headers and information in the document. If the validator fails to detect the encoding, it can be selected on the validator result page via the 'Encoding' pulldown menu (example).

But often, the validator does not complain even if a wrong encoding is detected or selected. The reason for this is that many encodings are very similar, and the validator only checks the markup syntax and cannot decide whether the decoded text makes sense or not. To make sure that you have the correct encoding, which means that the document will be displayed correctly to readers, the following points will help:

If the encoding selected or detected is US-ASCII, UTF-8, UTF-16, or iso-2022-jp (Japanese JIS), and the validator does not complain about encoding problems, there is an extremely high probability that the selected encoding is correct. Note that US-ASCII is a strict subset of UTF-8, and so if US-ASCII works, UTF-8 will work, too.
For any other encoding, visual checking is necessary. Select the Show Source option from the Extended Interface of the validator, and check that the non-ASCII characters in the text are displayed correctly. For pages in foreign languages, this can usually be established quickly. For pages in English with just a few non-ASCII characters, this can be more difficult.

For example, if you tried to interpret the W3C home page as iso-8859-1, you may have to go almost to the end of the source to find text such as '©' and '®' to see that this is the wrong choice. (Of course, that page tells the validator from the beginning that it is encoded in UTF-8, and so you don't actually have to check anything else.)
In some cases, more than one encoding will adequately represent the characters in a document. For example, there is quite some overlap between iso-8859-1 (Latin-1, Western Europe) and iso-8859-2 (Latin-2, Eastern Europe), and other encodings in this series. If after careful checking, you cannot find a difference, then either choice is fine. The close similarity of these encodings in terms of byte patterns and in terms of actually encoded characters explains why only visual inspection can make sure that the encoding is correct.
If none of the encodings offered by the validator works, then you either have a page in an encoding that the validator does not (yet) support, or somehow, text in several different encodings got mixed up in the page. In the former case, write to the validator mailing list (public archive) to have your character encoding added. In the later case, you have to fix your page, because each Web page can only use a single character encoding.

By the way

The validator does not work without information about character encoding because SGML or XML validation is based on checking the sequences of characters in the document, but what the validator receives as input is just a sequence of bytes. Knowing the character encoding allows the validator to convert from bytes to characters. In general, this is the same for all other kinds of receivers, including browsers. If the right characters are not identified, a Web browser may display garbage.

The validator does this by converting from the encoding indicated to UTF-8, and using UTF-8 internally. If the conversion to UTF-8 fails because a particular byte sequence cannot appear in the input encoding, the validator produces an error message. For UTF-8 input, the validator checks to make sure that only valid UTF-8 byte sequences are used.

Note that visually inspecting a Web page with a browser without using the validator may fail, because:

Some browsers use non-standard ways to detect the character encoding.
Each browser has a setting used for unlabeled pages; if that setting by chance is the correct encoding for the page, you will not see that the page does not come with proper encoding information.
Besides the text in the page, there is text in attributes (e.g. alt text in <img>) that should be checked.

Checking the character encoding using the validator

Related links

Question

Answer

By the way

Further reading

Links in this document: