This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
Reported by Christopher R. Maden: When unable to detect an encoding, the new validator should use the prescribed defaults, which I believe still means ISO8859-1 for text/html over HTTP, and UTF-8 or UTF-16 for XHTML documents uploaded directly. With the simple interface, validating <URL: http://crism.maden.org/ > reports that it is unable to detect the encoding, including using Appendix F of XML 1.0. Using Appendix F is inappropriate for a document delivered over HTTP, since the HTTP headers take precedence (and thus it should be interpreted as ISO8859-1), but even so, using the Appendix F algorithm should result in a determination of UTF-8. Either way, since this page is 7-bit ASCII, the validation ought to work.
The HTTP specification does indeed specify ISO-8859-1 as the default value in the absense of a "charset" parameter in the Content-Type header. However HTTP and HTML 4.01 are in direct conflict here as the latter proscribes any assumption about a default character encoding. And since a file upload is still a HTTP transaction, although we do not normally think of it that way, the same applies for any file upload with a text/html media type. The algorithm in Appendix F of the XML Recommendation describes ways to attempt to automatically detect the character encoding in use in the absence of information from a higher level protocol. Since the HTTP transaction contained no encoding information, we attempted the Appendix F algorithm. That algorithm however, is intended for XML; and as such it requires either the presence of a UNICODE Byte Order Mark, or an XML Declaration. In particular, if there is no BOM, we look for the bit patterns that represent the characters "<?xml" in various encodings.